2023-09-18

cs.CV

cs.CV - 2023-09-18

ProtoKD: Learning from Extremely Scarce Data for Parasite Ova Recognition

paper_url: http://arxiv.org/abs/2309.10210
repo_url: None
paper_authors: Shubham Trehan, Udhav Ramachandran, Ruth Scimeca, Sathyanarayanan N. Aakur
for: 这个研究旨在发展可靠的计算框架，以帮助早期螺旋感染的检测，特别是在螺旋卵阶段。
methods: 这个研究使用了prototype网络和自我激发的方法，从仅一个示例每个类别中学习出了坚固的表现。
results: 研究发现，这个ProtoKD框架可以从仅一个示例每个类别中学习出高度可靠的表现，并且在大规模的数据中进行验证。

Abstract
Developing reliable computational frameworks for early parasite detection, particularly at the ova (or egg) stage is crucial for advancing healthcare and effectively managing potential public health crises. While deep learning has significantly assisted human workers in various tasks, its application and diagnostics has been constrained by the need for extensive datasets. The ability to learn from an extremely scarce training dataset, i.e., when fewer than 5 examples per class are present, is essential for scaling deep learning models in biomedical applications where large-scale data collection and annotation can be expensive or not possible (in case of novel or unknown infectious agents). In this study, we introduce ProtoKD, one of the first approaches to tackle the problem of multi-class parasitic ova recognition using extremely scarce data. Combining the principles of prototypical networks and self-distillation, we can learn robust representations from only one sample per class. Furthermore, we establish a new benchmark to drive research in this critical direction and validate that the proposed ProtoKD framework achieves state-of-the-art performance. Additionally, we evaluate the framework's generalizability to other downstream tasks by assessing its performance on a large-scale taxonomic profiling task based on metagenomes sequenced from real-world clinical data.

摘要
发展可靠的计算框架，特别是在菌 ovum (或蛋) 阶段，对于提高医疗和有效管理 potential public health crisis 是非常重要的。深度学习在各种任务中已经提供了很多帮助，但是它在诊断方面受到了有限的数据集的限制。在生物医学应用中，可以使用少量数据来训练深度学习模型，这是因为收集和标注大量数据可能是昂贵的或者不可能的（在新型或未知感染者情况下）。在这项研究中，我们介绍了一种名为 ProtoKD 的新方法，可以在只有一个样本每个类时进行多类菌 ovum 识别。我们结合了原型网络和自我滥化的原则，可以从单个样本中学习Robust的表示。此外，我们还提出了一个新的标准套件，用于驱动这一重要的研究方向，并证明了我们提出的 ProtoKD 框架实现了状态机器的表现。此外，我们还评估了该框架在其他下游任务中的一致性，通过对基于实际临床数据的大规模分类任务进行评估。

Image-Text Pre-Training for Logo Recognition

paper_url: http://arxiv.org/abs/2309.10206
repo_url: None
paper_authors: Mark Hubenthal, Suren Kumar
for: 这个论文是为了提高开放集logo认识技术而写的。methods: 该论文提出了两个新贡献来提高匹配模型的性能：首先，使用图像和文本对应样本进行预训练；其次，使用改进的度量学习损失函数。results: 该论文的实验结果显示，使用图像和文本对应样本进行预训练可以提高视觉嵌入器在logo检索任务中的性能，特别是对于文本占主导的类型。此外，提出的ProxyNCAHN++损失函数可以更好地减少硬negative图像，从而提高logo检索任务的性能。在五个公共logo数据集上，该方法达到了新的国际最佳性能。

Abstract
Open-set logo recognition is commonly solved by first detecting possible logo regions and then matching the detected parts against an ever-evolving dataset of cropped logo images. The matching model, a metric learning problem, is especially challenging for logo recognition due to the mixture of text and symbols in logos. We propose two novel contributions to improve the matching model's performance: (a) using image-text paired samples for pre-training, and (b) an improved metric learning loss function. A standard paradigm of fine-tuning ImageNet pre-trained models fails to discover the text sensitivity necessary to solve the matching problem effectively. This work demonstrates the importance of pre-training on image-text pairs, which significantly improves the performance of a visual embedder trained for the logo retrieval task, especially for more text-dominant classes. We construct a composite public logo dataset combining LogoDet3K, OpenLogo, and FlickrLogos-47 deemed OpenLogoDet3K47. We show that the same vision backbone pre-trained on image-text data, when fine-tuned on OpenLogoDet3K47, achieves $98.6\%$ recall@1, significantly improving performance over pre-training on Imagenet1K ($97.6\%$). We generalize the ProxyNCA++ loss function to propose ProxyNCAHN++ which incorporates class-specific hard negative images. The proposed method sets new state-of-the-art on five public logo datasets considered, with a $3.5\%$ zero-shot recall@1 improvement on LogoDet3K test, $4\%$ on OpenLogo, $6.5\%$ on FlickrLogos-47, $6.2\%$ on Logos In The Wild, and $0.6\%$ on BelgaLogo.

摘要
开放集logo认识通常通过先检测可能的标识区域，然后将检测到的部分与一个不断演化的logo图像集进行匹配来解决。匹配模型是一个度量学习问题，因为logo中的文字和符号混合而特别具有挑战性。我们提出了两个新的贡献来提高匹配模型的性能：（a）使用图像文本对应样本进行预训练，以及（b）改进度量学习损失函数。现有的标准模式是使用ImageNet预训练模型进行精度训练，但这并不能有效地解决匹配问题。这种工作表明了预训练图像文本对应样本的重要性，可以大幅提高一个视觉嵌入器在logo检索任务中的性能，特别是对于文本更加占据的类型。我们组建了OpenLogoDet3K47 compositive公共logo数据集，并证明了使用同一个视觉背景模型，经预训练于图像文本对应样本，在OpenLogoDet3K47上进行精度训练，可以达到98.6%的Recall@1，显著超越ImageNet1K($97.6\%$)。我们将ProxyNCA++损失函数推广到ProxyNCAHN++，其包含类型特定的硬负样本。提出的方法在五个公共logo数据集上达到了新的状态泰施，与LogoDet3K测试集的零MQA@1相比，提高了3.5%，与OpenLogo相比提高了4%，与FlickrLogos-47相比提高了6.5%，与Logos In The Wild相比提高了6.2%，与BelgaLogo相比提高了0.6%。

Machine Learning for enhancing Wind Field Resolution in Complex Terrain

paper_url: http://arxiv.org/abs/2309.10172
repo_url: https://github.com/jacobwulffwold/gan_sr_wind_field
paper_authors: Jacob Wulff Wold, Florian Stadtmann, Adil Rasheed, Mandar Tabib, Omer San, Jan-Tore Horn
for: 这个研究旨在开发一种基于神经网络的方法，用于在高分辨率下数值模拟大气流体动态。
methods: 这个方法基于增强型超解析生成 adversarial neural network，可以从低分辨率的风场数据中生成高分辨率的风场数据，并且能够尊重地形特性。
results: 研究表明，这个方法可以成功重建完全解析3D速度场，并且超过了传递 interpolate 方法。此外，通过采用适当的成本函数，可以减少对对抗训练的需求。

Abstract
Atmospheric flows are governed by a broad variety of spatio-temporal scales, thus making real-time numerical modeling of such turbulent flows in complex terrain at high resolution computationally intractable. In this study, we demonstrate a neural network approach motivated by Enhanced Super-Resolution Generative Adversarial Networks to upscale low-resolution wind fields to generate high-resolution wind fields in an actual wind farm in Bessaker, Norway. The neural network-based model is shown to successfully reconstruct fully resolved 3D velocity fields from a coarser scale while respecting the local terrain and that it easily outperforms trilinear interpolation. We also demonstrate that by using appropriate cost function based on domain knowledge, we can alleviate the use of adversarial training.

摘要
大气流动受到各种空间时间尺度的控制，因此实时数值模拟这种湍流在复杂地形上的计算极其困难。在本研究中，我们提出一种基于神经网络的方法，使用增强型超分辨率生成 adversarial neural network 来提高低分辨率风场的解析。我们的神经网络模型能够成功重建高分辨率三维速度场，同时尊重地形特性，并轻松超越三线 interpolate 方法。此外，我们还示出了通过采用适当的成本函数，可以减少对对抗训练的使用。

Specification-Driven Video Search via Foundation Models and Formal Verification

paper_url: http://arxiv.org/abs/2309.10171
repo_url: None
paper_authors: Yunhao Yang, Jean-Raphaël Gaglione, Sandeep Chinchali, Ufuk Topcu
for: 该文章是为了寻找视频中的事件兴趣点而写的。
methods: 该方法使用了计算机视觉和自然语言处理的最新进展，以及正式方法，自动地搜索视频中的事件兴趣点。
results: 该方法可以高效地搜索视频中的事件兴趣点，并且可以保持隐私。在一个隐私敏感的视频和一个自动驾驶数据集上，该方法可以达到90%的精度。

Abstract
The increasing abundance of video data enables users to search for events of interest, e.g., emergency incidents. Meanwhile, it raises new concerns, such as the need for preserving privacy. Existing approaches to video search require either manual inspection or a deep learning model with massive training. We develop a method that uses recent advances in vision and language models, as well as formal methods, to search for events of interest in video clips automatically and efficiently. The method consists of an algorithm to map text-based event descriptions into linear temporal logic over finite traces (LTL$_f$) and an algorithm to construct an automaton encoding the video information. Then, the method formally verifies the automaton representing the video against the LTL$_f$ specifications and adds the pertinent video clips to the search result if the automaton satisfies the specifications. We provide qualitative and quantitative analysis to demonstrate the video-searching capability of the proposed method. It achieves over 90 percent precision in searching over privacy-sensitive videos and a state-of-the-art autonomous driving dataset.

摘要
随着视频数据的增加，用户能够查找 interess 事件，例如紧急事件。然而，这也提出了新的问题，例如保持隐私。现有的视频搜索方法需要 either manual inspection 或 deep learning model with massive training。我们开发了一种使用 recent advances in vision and language models，以及 formal methods，自动地搜索视频中的事件 интересы。该方法包括一种将文本基于事件描述映射到线性时间逻辑 sobre finite traces (LTL$_f$) 的算法，以及一种构建视频信息的 automaton。然后，方法正式验证视频中的 automaton 与 LTL$_f$ 规范之间的匹配，并将符合规范的视频片段添加到搜索结果中。我们提供了质量和量化分析，以证明我们提议的方法的视频搜索能力。它在 privacy-sensitive 视频和 state-of-the-art autonomous driving 数据集上达到了高于 90% 的精度。

Offline Detection of Misspelled Handwritten Words by Convolving Recognition Model Features with Text Labels

paper_url: http://arxiv.org/abs/2309.10158
repo_url: None
paper_authors: Andrey Totev, Tomas Ward
for: 本研究旨在提高手写识别率，特别是在识别预期出现在语言模型中的缺失词汇时。
methods: 我们提出了一种无约束的二元分类器，包括一个手写识别特征提取器和一个多Modal分类头，该头将特征提取器输出与输入文本的向量表示进行卷积。我们的分类头通过使用一个国际前沿的生成对抗网络创造的Synthetic数据进行训练。
results: 我们的模型可以维持高准确率，同时可以通过调整精度来提高平均准确率19.5%，相比直接使用现有的手写识别模型进行 Addressing 任务。这些性能提升可能会导致人工循环自动化应用程序的产出提高。

Abstract
Offline handwriting recognition (HWR) has improved significantly with the advent of deep learning architectures in recent years. Nevertheless, it remains a challenging problem and practical applications often rely on post-processing techniques for restricting the predicted words via lexicons or language models. Despite their enhanced performance, such systems are less usable in contexts where out-of-vocabulary words are anticipated, e.g. for detecting misspelled words in school assessments. To that end, we introduce the task of comparing a handwriting image to text. To solve the problem, we propose an unrestricted binary classifier, consisting of a HWR feature extractor and a multimodal classification head which convolves the feature extractor output with the vector representation of the input text. Our model's classification head is trained entirely on synthetic data created using a state-of-the-art generative adversarial network. We demonstrate that, while maintaining high recall, the classifier can be calibrated to achieve an average precision increase of 19.5% compared to addressing the task by directly using state-of-the-art HWR models. Such massive performance gains can lead to significant productivity increases in applications utilizing human-in-the-loop automation.

摘要
停机写入Recognition (HWR) 在深度学习架构的推出后有了显著改进，但是仍然是一个具有挑战性的问题，实际应用中通常通过lexicon或语言模型来限制预测的词汇。尽管其表现得到了提高，但这些系统在预测非词汇词语时表现不佳，例如在学校评估中检测拼写错误。为解决这个问题，我们引入对手写图像与文本进行比较的任务。为解决这个任务，我们提议一种无限制binary分类器，其包括一个HWR特征提取器和一个多Modal分类头。这个分类头使用一个state-of-the-art生成对抗网络生成的 sintetic数据进行训练。我们示示，虽然保持高准确率，我们的分类器可以实现平均提高19.5%的精度，相比直接使用现状最佳HWR模型进行处理。这样的巨大性能提升可能会导致人工循环自动化应用中的产出提高。

Preserving Tumor Volumes for Unsupervised Medical Image Registration

paper_url: http://arxiv.org/abs/2309.10153
repo_url: None
paper_authors: Qihua Dong, Hao Du, Ying Song, Yan Xu, Jing Liao
for: 这篇论文旨在解决医学影像调整中的体积变化问题，以确保影像调整中的肿瘤体积不变化。
methods: 本文提出了一个两阶段的方法，首先使用相似性基于的调整来找到潜在的肿瘤区域，然后使用一个新的自适应体积保持损失函数来确保肿瘤体积不变化。
results: 本文的方法能够成功保持肿瘤体积，并与现有的方法相比，实现了相似的调整结果。

Abstract
Medical image registration is a critical task that estimates the spatial correspondence between pairs of images. However, current traditional and deep-learning-based methods rely on similarity measures to generate a deforming field, which often results in disproportionate volume changes in dissimilar regions, especially in tumor regions. These changes can significantly alter the tumor size and underlying anatomy, which limits the practical use of image registration in clinical diagnosis. To address this issue, we have formulated image registration with tumors as a constraint problem that preserves tumor volumes while maximizing image similarity in other normal regions. Our proposed strategy involves a two-stage process. In the first stage, we use similarity-based registration to identify potential tumor regions by their volume change, generating a soft tumor mask accordingly. In the second stage, we propose a volume-preserving registration with a novel adaptive volume-preserving loss that penalizes the change in size adaptively based on the masks calculated from the previous stage. Our approach balances image similarity and volume preservation in different regions, i.e., normal and tumor regions, by using soft tumor masks to adjust the imposition of volume-preserving loss on each one. This ensures that the tumor volume is preserved during the registration process. We have evaluated our strategy on various datasets and network architectures, demonstrating that our method successfully preserves the tumor volume while achieving comparable registration results with state-of-the-art methods. Our codes is available at: \url{https://dddraxxx.github.io/Volume-Preserving-Registration/}.

摘要
医学图像对接是一项重要任务，用于估计图像对的空间匹配。然而，现有的传统方法和深度学习方法通常基于相似度测量来生成扭曲场，这经常导致图像区域中的体积变化不匹配，特别是在肿瘤区域。这些变化可能导致肿瘤大小和下面的解剖结构发生重大变化，从而限制图像对接的实际应用。为解决这个问题，我们提出了一种图像对接约束问题，以保持肿瘤体积而最大化图像相似性。我们的提议包括两个阶段。在第一阶段，我们使用相似度基于的图像对接来确定潜在的肿瘤区域，并生成相应的软肿瘤面罩。在第二阶段，我们提出了一种保持体积的图像对接，使用一种新的自适应体积保持损失函数，该函数根据上一阶段生成的面罩对每个区域进行不同的权重调整。这种方法可以在不同区域（即正常区域和肿瘤区域）平衡图像相似性和体积保持。因此，我们的方法可以在图像对接过程中保持肿瘤体积。我们在多个数据集和网络架构上评估了我们的策略，并证明了我们的方法可以成功保持肿瘤体积，同时与现状的方法相比具有相似的对接结果。我们的代码可以在以下链接中找到：\url{https://dddraxxx.github.io/Volume-Preserving-Registration/}.

Deep Prompt Tuning for Graph Transformers

paper_url: http://arxiv.org/abs/2309.10131
repo_url: https://github.com/hieu9955/ggggg
paper_authors: Reza Shirkavand, Heng Huang
for: 这篇 paper 的目的是提出一种新的方法来使用图形变数对应 tasks，以替代 fine-tuning 方法。
methods: 这篇 paper 使用了一种名为 “deep graph prompt tuning” 的新方法，它将内置了任务特定的 Token 到图形变数模型中，以提高模型的表达能力。
results: 这篇 paper 的实验结果显示，使用 deep graph prompt tuning 方法可以与 fine-tuning 方法相比，具有相似或甚至更好的性能，且仅需要更少的任务特定参数。

Abstract
Graph transformers have gained popularity in various graph-based tasks by addressing challenges faced by traditional Graph Neural Networks. However, the quadratic complexity of self-attention operations and the extensive layering in graph transformer architectures present challenges when applying them to graph based prediction tasks. Fine-tuning, a common approach, is resource-intensive and requires storing multiple copies of large models. We propose a novel approach called deep graph prompt tuning as an alternative to fine-tuning for leveraging large graph transformer models in downstream graph based prediction tasks. Our method introduces trainable feature nodes to the graph and pre-pends task-specific tokens to the graph transformer, enhancing the model's expressive power. By freezing the pre-trained parameters and only updating the added tokens, our approach reduces the number of free parameters and eliminates the need for multiple model copies, making it suitable for small datasets and scalable to large graphs. Through extensive experiments on various-sized datasets, we demonstrate that deep graph prompt tuning achieves comparable or even superior performance to fine-tuning, despite utilizing significantly fewer task-specific parameters. Our contributions include the introduction of prompt tuning for graph transformers, its application to both graph transformers and message passing graph neural networks, improved efficiency and resource utilization, and compelling experimental results. This work brings attention to a promising approach to leverage pre-trained models in graph based prediction tasks and offers new opportunities for exploring and advancing graph representation learning.

摘要
graph transformers 在各种图像任务中获得了广泛的应用，但传统图 neural network 中的 quadratic complexity 和图 transformer 架构中的层次结构却阻碍了它们在图基于预测任务中的应用。 fine-tuning，一种常见的方法，需要存储多个大型模型，并且是资源占用的。我们提出了一种新的方法，即深度图 prompt tuning，作为 fine-tuning 的替代方案，用于利用大型图 transformer 模型在图基于预测任务中。我们的方法在图中添加可训练的特征节点，并在图 transformer 中预先补充任务特定的token，从而提高模型的表达能力。通过冻结预训练参数并只更新添加的token，我们的方法可以降低自定义参数的数量，消除多个模型 копи本的需求，使其适用于小 dataset 和可扩展到大图。我们的贡献包括图 transformers 中的 prompt tuning 的引入，其应用于图 transformers 和 message passing graph neural networks，以及提高效率和资源利用。我们的实验结果表明，深度图 prompt tuning 可以与 fine-tuning 相比，即使使用许多任务特定参数，并且具有更好的可扩展性。这种方法将注意力集中在图基于预测任务中的可优化模型，并提供新的机遇，用于探索和提高图表示学习。

Pre-training on Synthetic Driving Data for Trajectory Prediction

paper_url: http://arxiv.org/abs/2309.10121
repo_url: None
paper_authors: Yiheng Li, Seth Z. Zhao, Chenfeng Xu, Chen Tang, Chenran Li, Mingyu Ding, Masayoshi Tomizuka, Wei Zhan
for: 本研究旨在提高自动驾驶路径预测模型的泛化能力，以便在有限数据情况下进行学习。
methods: 我们提议通过对高分辨率地图（HD map）和路径进行扩充和预训练来解决有限数据的挑战。我们利用图表示法将HD map映射到vector空间中，以便轻松地扩充有限的Scene数据。此外，我们采用规则库模型生成基于扩充的Scene中的路径，从而超出实际收集的路径。
results: 我们对不同预训练策略进行了广泛的测试，并证明了我们的数据扩充和预训练策略的效果。相比基eline预测模型，我们的方法在$MR_6$, $minADE_6$和$minFDE_6$指标上升减5.04%, 3.84%和8.30%。

Abstract
Accumulating substantial volumes of real-world driving data proves pivotal in the realm of trajectory forecasting for autonomous driving. Given the heavy reliance of current trajectory forecasting models on data-driven methodologies, we aim to tackle the challenge of learning general trajectory forecasting representations under limited data availability. We propose to augment both HD maps and trajectories and apply pre-training strategies on top of them. Specifically, we take advantage of graph representations of HD-map and apply vector transformations to reshape the maps, to easily enrich the limited number of scenes. Additionally, we employ a rule-based model to generate trajectories based on augmented scenes; thus enlarging the trajectories beyond the collected real ones. To foster the learning of general representations within this augmented dataset, we comprehensively explore the different pre-training strategies, including extending the concept of a Masked AutoEncoder (MAE) for trajectory forecasting. Extensive experiments demonstrate the effectiveness of our data expansion and pre-training strategies, which outperform the baseline prediction model by large margins, e.g. 5.04%, 3.84% and 8.30% in terms of $MR_6$, $minADE_6$ and $minFDE_6$.

摘要
驱动数据的总量聚集对于自动驾驶的轨迹预测是非常重要的。由于当前的轨迹预测模型具有很大的数据驱动特性，我们想要解决有限数据量下学习通用的轨迹预测表示。我们提出了对 HD 地图和轨迹进行扩充和预训练的方法。 Specifically，我们利用 HD 地图的图表示法，并对它们进行 vector 变换，以便轻松地拓宽有限数量的场景。此外，我们采用规则型模型来生成基于扩充的场景的轨迹，从而将轨迹扩大到实际收集的轨迹之 beyond。为了让模型学习通用的表示，我们全面探讨了不同的预训练策略，包括扩展 MAE 的概念来进行轨迹预测。广泛的实验证明了我们的数据扩展和预训练策略的效果，其比基eline 预测模型高大幅度，例如 $MR_6$、$minADE_6$ 和 $minFDE_6$ 的预测误差分别为 5.04%、3.84% 和 8.30%。

GEDepth: Ground Embedding for Monocular Depth Estimation

paper_url: http://arxiv.org/abs/2309.09975
repo_url: https://github.com/qcraftai/gedepth
paper_authors: Xiaodong Yang, Zhuang Ma, Zhiyu Ji, Zhe Ren
for: 提高深度估计器的通用性和泛化能力，解耦了图像和相机参数之间的关系
methods: 提出了一种新的地面嵌入模块，通过将相机参数和图像嵌入到深度估计中，提高了模型的通用性和泛化能力
results: 实验表明，该方法可以在多个标准测试集上达到当前最佳效果，并且在跨域测试中表现出显著的泛化改进

Abstract
Monocular depth estimation is an ill-posed problem as the same 2D image can be projected from infinite 3D scenes. Although the leading algorithms in this field have reported significant improvement, they are essentially geared to the particular compound of pictorial observations and camera parameters (i.e., intrinsics and extrinsics), strongly limiting their generalizability in real-world scenarios. To cope with this challenge, this paper proposes a novel ground embedding module to decouple camera parameters from pictorial cues, thus promoting the generalization capability. Given camera parameters, the proposed module generates the ground depth, which is stacked with the input image and referenced in the final depth prediction. A ground attention is designed in the module to optimally combine ground depth with residual depth. Our ground embedding is highly flexible and lightweight, leading to a plug-in module that is amenable to be integrated into various depth estimation networks. Experiments reveal that our approach achieves the state-of-the-art results on popular benchmarks, and more importantly, renders significant generalization improvement on a wide range of cross-domain tests.

摘要
<> translate("Monocular depth estimation is an ill-posed problem as the same 2D image can be projected from infinite 3D scenes.") translate("Although the leading algorithms in this field have reported significant improvement, they are essentially geared to the particular compound of pictorial observations and camera parameters (i.e., intrinsics and extrinsics), strongly limiting their generalizability in real-world scenarios.") translate("To cope with this challenge, this paper proposes a novel ground embedding module to decouple camera parameters from pictorial cues, thus promoting the generalization capability.") translate("Given camera parameters, the proposed module generates the ground depth, which is stacked with the input image and referenced in the final depth prediction.") translate("A ground attention is designed in the module to optimally combine ground depth with residual depth.") translate("Our ground embedding is highly flexible and lightweight, leading to a plug-in module that is amenable to be integrated into various depth estimation networks.") translate("Experiments reveal that our approach achieves the state-of-the-art results on popular benchmarks, and more importantly, renders significant generalization improvement on a wide range of cross-domain tests.")Here's the text in Simplified Chinese:<>Monocular depth estimation 是一个不定问题，因为同一个2D图像可以来自无数个3D场景。尽管这个领域的主导算法已经报告了显著改进，但它们实际上受限于特定的图像观察和相机参数（即内参和外参），这限制了它们在真实世界场景中的一般化能力。为了解决这个挑战，本文提出了一种新的地面嵌入模块，用于分离相机参数和图像观察的绑定关系，从而提高总体化能力。给定相机参数，该模块生成的地面深度将与输入图像拼接起来，并作为最终深度预测中的参考。模块中还设计了地面注意力，用于优化地面深度和剩余深度的组合。我们的地面嵌入非常灵活和轻量级，导致它可以易于与不同的深度估计网络集成。实验表明，我们的方法在流行的标准底图上实现了状态艺术的结果，并在广泛的跨频测试中显示了重要的一般化改进。

Parameter-Efficient Long-Tailed Recognition

paper_url: http://arxiv.org/abs/2309.10019
repo_url: https://github.com/shijxcs/pel
paper_authors: Jiang-Xin Shi, Tong Wei, Zhi Zhou, Xin-Yan Han, Jie-Jing Shao, Yu-Feng Li
for: 这篇 paper 旨在提出一种能够快速进行 fine-tuning 的方法，以便在承载大量数据的情况下，实现长尾识别 задачі中的好几个类别的准确识别。
methods: 这篇 paper 使用了一种称为 PEL 的 fine-tuning 方法，这种方法通过将一小部分的任务特定参数添加到预训练好的模型中，以便在极少的训练 epoch 内实现好的表现。此外，这篇 paper 还提出了一种具有 semantic-aware 的分类器初始化技术，它可以从 CLIP 的文本嵌入器中获得，并不会增加任何计算负载。
results: 这篇 paper 的实验结果显示，使用 PEL 方法可以在四个长尾数据集上实现更好的表现，并且较前一些state-of-the-art 方法出色。

Abstract
The "pre-training and fine-tuning" paradigm in addressing long-tailed recognition tasks has sparked significant interest since the emergence of large vision-language models like the contrastive language-image pre-training (CLIP). While previous studies have shown promise in adapting pre-trained models for these tasks, they often undesirably require extensive training epochs or additional training data to maintain good performance. In this paper, we propose PEL, a fine-tuning method that can effectively adapt pre-trained models to long-tailed recognition tasks in fewer than 20 epochs without the need for extra data. We first empirically find that commonly used fine-tuning methods, such as full fine-tuning and classifier fine-tuning, suffer from overfitting, resulting in performance deterioration on tail classes. To mitigate this issue, PEL introduces a small number of task-specific parameters by adopting the design of any existing parameter-efficient fine-tuning method. Additionally, to expedite convergence, PEL presents a novel semantic-aware classifier initialization technique derived from the CLIP textual encoder without adding any computational overhead. Our experimental results on four long-tailed datasets demonstrate that PEL consistently outperforms previous state-of-the-art approaches. The source code is available at https://github.com/shijxcs/PEL.

摘要
“对于长尾识别任务，使用“预训练和精确化”方法已经引起了广泛的关注，自大型视力语言模型CLIP出现以来。 although previous studies have shown promise in adapting pre-trained models for these tasks, they often require extensive training epochs or additional training data to maintain good performance. In this paper, we propose PEL, a fine-tuning method that can effectively adapt pre-trained models to long-tailed recognition tasks in fewer than 20 epochs without the need for extra data. We first empirically find that commonly used fine-tuning methods, such as full fine-tuning and classifier fine-tuning, suffer from overfitting, resulting in performance deterioration on tail classes. To mitigate this issue, PEL introduces a small number of task-specific parameters by adopting the design of any existing parameter-efficient fine-tuning method. Additionally, to expedite convergence, PEL presents a novel semantic-aware classifier initialization technique derived from the CLIP textual encoder without adding any computational overhead. Our experimental results on four long-tailed datasets demonstrate that PEL consistently outperforms previous state-of-the-art approaches. The source code is available at https://github.com/shijxcs/PEL.”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

vSHARP: variable Splitting Half-quadratic ADMM algorithm for Reconstruction of inverse-Problems

paper_url: http://arxiv.org/abs/2309.09954
repo_url: None
paper_authors: George Yiasemis, Nikita Moriakov, Jan-Jakob Sonke, Jonas Teuwen
for: 本研究旨在提出一种基于深度学习的方法，用于解决医疗影像（MI）任务中的不规则 inverse problem。
methods: 该方法基于变分半quadratic ADMM算法，并使用梯度下降法和U-Net架构来保证数据一致性和图像质量提升。
results: 对于两个不同的数据集，我们的实验结果表明，vSHARP方法在加速并行磁共振成像重构任务中表现出色，与状态艺术方法相比，具有更高的性能。

Abstract
Medical Imaging (MI) tasks, such as accelerated Parallel Magnetic Resonance Imaging (MRI), often involve reconstructing an image from noisy or incomplete measurements. This amounts to solving ill-posed inverse problems, where a satisfactory closed-form analytical solution is not available. Traditional methods such as Compressed Sensing (CS) in MRI reconstruction can be time-consuming or prone to obtaining low-fidelity images. Recently, a plethora of supervised and self-supervised Deep Learning (DL) approaches have demonstrated superior performance in inverse-problem solving, surpassing conventional methods. In this study, we propose vSHARP (variable Splitting Half-quadratic ADMM algorithm for Reconstruction of inverse Problems), a novel DL-based method for solving ill-posed inverse problems arising in MI. vSHARP utilizes the Half-Quadratic Variable Splitting method and employs the Alternating Direction Method of Multipliers (ADMM) to unroll the optimization process. For data consistency, vSHARP unrolls a differentiable gradient descent process in the image domain, while a DL-based denoiser, such as a U-Net architecture, is applied to enhance image quality. vSHARP also employs a dilated-convolution DL-based model to predict the Lagrange multipliers for the ADMM initialization. We evaluate the proposed model by applying it to the task of accelerated Parallel MRI Reconstruction on two distinct datasets. We present a comparative analysis of our experimental results with state-of-the-art approaches, highlighting the superior performance of vSHARP.

摘要
医学成像（MI）任务，如加速的并行磁共振成像（MRI），经常需要从噪声或不完整的测量中重建图像。这等于解决不定系数的反向问题，其中没有可靠的关闭形式分析解。传统方法，如压缩整流（CS）在MRI重建中，可能需要很长时间或得到低精度图像。最近，一些深度学习（DL）方法在反向问题解决方面表现出色，超过了传统方法。在这种研究中，我们提出了变量分割半quadratic ADMM算法 для reconstruction of inverse problems（vSHARP），这是一种基于DL的新方法。vSHARP使用半quadratic变量分割方法和Alternating Direction Method of Multipliers（ADMM）来卷赋予过程。为保持数据一致性，vSHARP在图像域中滚动一个可微的梯度下降过程，而DL基于U-Net架构的降噪器来提高图像质量。vSHARP还使用填充 convolution DL模型来预测ADMM的拉格朗日multipliers的初始化。我们通过在两个不同的数据集上应用vSHARP进行加速的并行MRI重建，并对其实验结果进行比较分析，highlighting vSHARP的superior performance。

End-to-End Learned Event- and Image-based Visual Odometry

paper_url: http://arxiv.org/abs/2309.09947
repo_url: None
paper_authors: Roberto Pellerito, Marco Cannici, Daniel Gehrig, Joris Belhadj, Olivier Dubois-Matra, Massimo Casasco, Davide Scaramuzza
for: 这个论文主要是为了实现自主无人机导航，特别是在GPS被防止的环境中，例如行星表面。
methods: 这个系统使用了新型的循环、异步和大量并行（RAMP）编码器，让它比现有的异步编码器快了8倍，精度高20%。
results: 这个系统在实验中训练后，在传统的实际世界标准测试benchmark上表现出来，与像素和事件基本方法相比，提高了52%和20%。

Abstract
Visual Odometry (VO) is crucial for autonomous robotic navigation, especially in GPS-denied environments like planetary terrains. While standard RGB cameras struggle in low-light or high-speed motion, event-based cameras offer high dynamic range and low latency. However, seamlessly integrating asynchronous event data with synchronous frames remains challenging. We introduce RAMP-VO, the first end-to-end learned event- and image-based VO system. It leverages novel Recurrent, Asynchronous, and Massively Parallel (RAMP) encoders that are 8x faster and 20% more accurate than existing asynchronous encoders. RAMP-VO further employs a novel pose forecasting technique to predict future poses for initialization. Despite being trained only in simulation, RAMP-VO outperforms image- and event-based methods by 52% and 20%, respectively, on traditional, real-world benchmarks as well as newly introduced Apollo and Malapert landing sequences, paving the way for robust and asynchronous VO in space.

摘要
“视觉协调（VO）是自主Robot导航中的关键，尤其在没有GPS环境下，如行星表面。标准的RGB摄像机在低光照或高速运动情况下表现不佳，而事件驱动摄像机则提供了高动态范围和低延迟。然而，将异步事件数据与同步帧数据集成仍然是一个挑战。我们介绍了RAMP-VO，首个综合学习的事件和图像基于VO系统。它利用了新的循环、异步和极大并行（RAMP）编码器，比现有的异步编码器快8倍，并且精度高20%。RAMP-VO还使用了一种新的预测未来姿势技术，以初始化。尽管只在模拟环境中训练，RAMP-VO仍然在传统的实际世界benchmark中与图像和事件基于方法相比，提高52%和20%。此外，RAMP-VO还在新引入的Apoll和Malapert降落序列上表现出色，为 asynchronous VO在宇宙中铺平道路。”

Hierarchical Attention and Graph Neural Networks: Toward Drift-Free Pose Estimation

paper_url: http://arxiv.org/abs/2309.09934
repo_url: None
paper_authors: Kathia Melbouci, Fawzi Nashashibi
for: 本研究旨在提高3D几何对齐的精度，而不是使用传统的几何对齐和pose graph优化方法。
methods: 本研究使用学习模型，包括层次注意力机制和图神经网络，取代传统的几何对齐和pose graph优化方法。
results: 测试结果表明，使用本研究的方法可以significantly improve pose estimation accuracy，特别是在确定旋转组件的时候。Here’s the full Chinese text:
for: 本研究旨在提高3D几何对齐的精度，而不是使用传统的几何对齐和pose graph优化方法。
methods: 本研究使用学习模型，包括层次注意力机制和图神经网络，取代传统的几何对齐和pose graph优化方法。
results: 测试结果表明，使用本研究的方法可以significantly improve pose estimation accuracy，特别是在确定旋转组件的时候。

Abstract
The most commonly used method for addressing 3D geometric registration is the iterative closet-point algorithm, this approach is incremental and prone to drift over multiple consecutive frames. The Common strategy to address the drift is the pose graph optimization subsequent to frame-to-frame registration, incorporating a loop closure process that identifies previously visited places. In this paper, we explore a framework that replaces traditional geometric registration and pose graph optimization with a learned model utilizing hierarchical attention mechanisms and graph neural networks. We propose a strategy to condense the data flow, preserving essential information required for the precise estimation of rigid poses. Our results, derived from tests on the KITTI Odometry dataset, demonstrate a significant improvement in pose estimation accuracy. This improvement is especially notable in determining rotational components when compared with results obtained through conventional multi-way registration via pose graph optimization. The code will be made available upon completion of the review process.

摘要
最常用的3D几何注册方法是迭代最近点算法，这种方法是逐帧增量的，容易出现多个 consecutived frames 中的漂移。常见的策略是通过 frame-to-frame 注册后，进行 pose graph optimization，并包括一个循环关闭过程，以确定先前访问的地方。在这篇论文中，我们探讨了一种替代传统的几何注册和 pose graph optimization 方法，使用层次注意力机制和图 neural networks。我们提议一种压缩数据流程，以保留必要的信息，以便精确地估计固定pose。我们的结果，基于 KITTI Odometry 数据集的测试，表明了对 pose 估计精度的显著改进。这种改进，特别是在确定旋转 Component 方面，与传统的多向注册via pose graph optimization 相比，具有更高的精度。代码将在审核过程完成后提供。

Quantum Vision Clustering

paper_url: http://arxiv.org/abs/2309.09907
repo_url: None
paper_authors: Xuan Bac Nguyen, Benjamin Thompson, Hugh Churchill, Khoa Luu, Samee U. Khan
for: 这篇论文主要是研究无监督视觉归一化的方法，以解释无标签图像的分布。
methods: 该论文提出了首个适合量子计算机解决的归一化问题的形式ulation，使用了爱因斯坦场的模型来表示量子机制。
results: 该论文表明，使用当今的整数编程 solver 和量子计算机，可以实现高效的归一化解决方案，并且可以解决小例子。

Abstract
Unsupervised visual clustering has recently received considerable attention. It aims to explain distributions of unlabeled visual images by clustering them via a parameterized appearance model. From a different perspective, the clustering algorithms can be treated as assignment problems, often NP-hard. They can be solved precisely for small instances on current hardware. Adiabatic quantum computing (AQC) offers a solution, as it can soon provide a considerable speedup on a range of NP-hard optimization problems. However, current clustering formulations are unsuitable for quantum computing due to their scaling properties. Consequently, in this work, we propose the first clustering formulation designed to be solved with AQC. We employ an Ising model representing the quantum mechanical system implemented on the AQC. Our approach is competitive compared to state-of-the-art optimization-based approaches, even using of-the-shelf integer programming solvers. Finally, we demonstrate that our clustering problem is already solvable on the current generation of real quantum computers for small examples and analyze the properties of the measured solutions.

摘要

On Model Explanations with Transferable Neural Pathways

paper_url: http://arxiv.org/abs/2309.09887
repo_url: None
paper_authors: Xinmiao Lin, Wentao Bao, Qi Yu, Yu Kong
for: 本研究旨在提供一种可解释的神经网络模型，即生成类相关神经路径（GEN-CNP）模型，以便解释神经网络模型的行为。
methods: 本研究使用了生成神经路径的方法，包括：（i）同类神经路径主要由类相关神经元组成；（ii）每个实例的神经路径精度应该是最佳的。
results: 本研究通过实验和质量评价表明，生成类相关神经路径可以准确地解释神经网络模型的行为，并且可以 transferred to 解释不同实例的样本。

Abstract
Neural pathways as model explanations consist of a sparse set of neurons that provide the same level of prediction performance as the whole model. Existing methods primarily focus on accuracy and sparsity but the generated pathways may offer limited interpretability thus fall short in explaining the model behavior. In this paper, we suggest two interpretability criteria of neural pathways: (i) same-class neural pathways should primarily consist of class-relevant neurons; (ii) each instance's neural pathway sparsity should be optimally determined. To this end, we propose a Generative Class-relevant Neural Pathway (GEN-CNP) model that learns to predict the neural pathways from the target model's feature maps. We propose to learn class-relevant information from features of deep and shallow layers such that same-class neural pathways exhibit high similarity. We further impose a faithfulness criterion for GEN-CNP to generate pathways with instance-specific sparsity. We propose to transfer the class-relevant neural pathways to explain samples of the same class and show experimentally and qualitatively their faithfulness and interpretability.

摘要
神经路径作为模型解释的方法包括一个稀疏的集合神经元，可以提供与整个模型一样的预测性能。现有方法主要关注准确率和稀疏性，但生成的路径可能具有有限的解释性，因此无法完全解释模型的行为。在这篇论文中，我们提出了两个神经路径解释标准：（i）同类神经路径主要由相关类神经元组成；（ii）每个实例的神经路径稀疏度应该得到优化。为此，我们提议一种生成类相关神经路径（GEN-CNP）模型，可以从目标模型的特征地图中预测神经路径。我们从深层和浅层特征中学习类相关信息，使同类神经路径具有高相似性。此外，我们对 GEN-CNP 模型强制实施一个具体性标准，使得生成的神经路径具有实例特定的稀疏度。我们将这些类相关神经路径传递给解释同类样本，并通过实验和质量上的证明，证明其 faithfulness 和可解释性。

RaLF: Flow-based Global and Metric Radar Localization in LiDAR Maps

paper_url: http://arxiv.org/abs/2309.09875
repo_url: None
paper_authors: Abhijeet Nayak, Daniele Cattaneo, Abhinav Valada
for: 本研究旨在提出一种基于深度神经网络的radar扫描地图匹配方法，以实现自主车辆的地面定位。
methods: 该方法首先通过cross-modal metric学习来学习 радиar和LiDAR特征编码器，然后通过global描述器和三个自由度变换预测器来实现地图匹配和 метрик地位定位。
results: 经过广泛的实验 validate，该方法可以在多个真实的驾驶数据集上达到状态的艺术性表现，并且能够有效地适应不同的城市和感知设置。Here’s the summary in English for reference:
for: The paper proposes a novel deep neural network-based approach for localizing radar scans in a LiDAR map of the environment, which can achieve robust and accurate positioning for autonomous vehicles.
methods: The approach first learns a shared embedding space between radar and LiDAR modalities via cross-modal metric learning, and then uses a place recognition head to generate global descriptors and a metric localization head to predict the 3-DoF transformation between the radar scan and the map.
results: Extensive experiments on multiple real-world driving datasets demonstrate that the proposed approach achieves state-of-the-art performance for both place recognition and metric localization, and can effectively generalize to different cities and sensor setups than the ones used during training.

Abstract
Localization is paramount for autonomous robots. While camera and LiDAR-based approaches have been extensively investigated, they are affected by adverse illumination and weather conditions. Therefore, radar sensors have recently gained attention due to their intrinsic robustness to such conditions. In this paper, we propose RaLF, a novel deep neural network-based approach for localizing radar scans in a LiDAR map of the environment, by jointly learning to address both place recognition and metric localization. RaLF is composed of radar and LiDAR feature encoders, a place recognition head that generates global descriptors, and a metric localization head that predicts the 3-DoF transformation between the radar scan and the map. We tackle the place recognition task by learning a shared embedding space between the two modalities via cross-modal metric learning. Additionally, we perform metric localization by predicting pixel-level flow vectors that align the query radar scan with the LiDAR map. We extensively evaluate our approach on multiple real-world driving datasets and show that RaLF achieves state-of-the-art performance for both place recognition and metric localization. Moreover, we demonstrate that our approach can effectively generalize to different cities and sensor setups than the ones used during training. We make the code and trained models publicly available at http://ralf.cs.uni-freiburg.de.

摘要
“本文提出了一种新的深度神经网络方法，用于将雷达扫描图与激光扫描图进行本地化。该方法被称为RaLF，它可以同时解决地点识别和精度本地化问题。RaLF包括雷达和激光特征编码器、地点识别头和精度本地化头。我们通过跨模态度学习来学习雷达和激光图像之间的共享空间。此外，我们还预测了 queries 雷达扫描图和激光图像之间的像素级流动场，以将 queries 雷达扫描图与激光图像相对平行。我们对多个实际驾驶 dataset 进行了广泛的评估，并证明了 RaLF 在地点识别和精度本地化方面具有了领先的性能。此外，我们还证明了我们的方法可以对不同的城市和感知器设置进行有效的泛化。我们将代码和训练模型公开发布在 http://ralf.cs.uni-freiburg.de。”

Contrastive Learning for Enhancing Robust Scene Transfer in Vision-based Agile Flight

paper_url: http://arxiv.org/abs/2309.09865
repo_url: None
paper_authors: Jiaxu Xing, Leonard Bauersfeld, Yunlong Song, Chunwei Xing, Davide Scaramuzza
for: 本研究旨在提出一种适用于视觉基于移动 робоット应用的Scene转移策略，以解决现有的策略具有差异样本效率或局部化能力等问题。
methods: 本研究提出了一种适应多对多对比学习策略，通过对视觉表示学习进行适应调整，实现零例转移和实际应用。控制策略可以在未经训练的环境中运行，不需要训练环境的微调。
results: 实验和实际应用 результалтати表明，我们的方法可以成功泛化到未经训练的环境，并且超越所有基elines。在视觉基于quadrotor飞行任务上，我们的方法实现了高精度和稳定性。

Abstract
Scene transfer for vision-based mobile robotics applications is a highly relevant and challenging problem. The utility of a robot greatly depends on its ability to perform a task in the real world, outside of a well-controlled lab environment. Existing scene transfer end-to-end policy learning approaches often suffer from poor sample efficiency or limited generalization capabilities, making them unsuitable for mobile robotics applications. This work proposes an adaptive multi-pair contrastive learning strategy for visual representation learning that enables zero-shot scene transfer and real-world deployment. Control policies relying on the embedding are able to operate in unseen environments without the need for finetuning in the deployment environment. We demonstrate the performance of our approach on the task of agile, vision-based quadrotor flight. Extensive simulation and real-world experiments demonstrate that our approach successfully generalizes beyond the training domain and outperforms all baselines.

摘要
场景传输 для视觉基于移动 робо扮 applications 是一个非常相关和挑战的问题。机器人的实用性很大程度取决于它能够在实际世界中完成任务，而不是在受控的实验室环境中。现有的场景传输末端政策学习方法经常受到样本效率的限制或者实际应用中的局限性，这使得它们不适合移动 робо扮应用。这项工作提出了一种适应多对比较学习策略，用于视觉表示学习，允许零shot场景传输并在实际应用中无需质量调整。我们示出了我们方法在激扬、视觉基于四旋翼飞行任务中的表现。广泛的 simulations 和实际实验表明，我们的方法可以成功泛化到训练环境之外，并且超过了所有基elines。

Unsupervised Open-Vocabulary Object Localization in Videos

paper_url: http://arxiv.org/abs/2309.09858
repo_url: None
paper_authors: Ke Fan, Zechen Bai, Tianjun Xiao, Dominik Zietlow, Max Horn, Zixu Zhao, Carl-Johann Simon-Gabriel, Mike Zheng Shou, Francesco Locatello, Bernt Schiele, Thomas Brox, Zheng Zhang, Yanwei Fu, Tong He
for: 自动Localization of objects in videos without manual annotation.
methods: 使用latest advances in video representation learning and pre-trained vision-language models, followed by a slot attention approach to localize objects in videos, and then use unsupervised way to read localized semantic information from CLIP model to assign text to the obtained slots.
results: 实现了无监督的视频对象定位，并且可以在常见的视频benchmark上达到良好的效果，这是首次实现无监督视频对象定位的方法。

Abstract
In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via a slot attention approach and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks.

摘要
在这篇论文中，我们表明了最近的视频表示学习和预训练视频语言模型的进步，可以提供substantial提高自主视频 объек特定的性能。我们提议一种方法，首先通过槽注意力方法在视频中Localize对象，然后通过预训练CLIP模型中的semantic信息来读取本地化的文本。这种方法是完全无监督的，并且可以在常见的视频benchmark上达到良好的性能。

PseudoCal: Towards Initialisation-Free Deep Learning-Based Camera-LiDAR Self-Calibration

paper_url: http://arxiv.org/abs/2309.09855
repo_url: None
paper_authors: Mathieu Cocheteux, Julien Moreau, Franck Davoine
for: 这篇论文主要用于测试一种自动测量摄录和探测器的整合，以提高自主车辆和机器人的多感器融合。
methods: 这篇论文使用了一种名为 PseudoCal 的新型自我测量方法，它利用 Pseudo-LiDAR 概念和直接在3D空间上运作，以解决传统方法的问题，例如需要人工干预和特定环境，导致实际应用中存在误差和问题。
results: 这篇论文的结果显示，PseudoCal 方法可以在一般的自主车辆和机器人环境中实现一击准确的整合，并且可以处理极端情况，例如对照摄录和探测器的偏差和误差。

Abstract
Camera-LiDAR extrinsic calibration is a critical task for multi-sensor fusion in autonomous systems, such as self-driving vehicles and mobile robots. Traditional techniques often require manual intervention or specific environments, making them labour-intensive and error-prone. Existing deep learning-based self-calibration methods focus on small realignments and still rely on initial estimates, limiting their practicality. In this paper, we present PseudoCal, a novel self-calibration method that overcomes these limitations by leveraging the pseudo-LiDAR concept and working directly in the 3D space instead of limiting itself to the camera field of view. In typical autonomous vehicle and robotics contexts and conventions, PseudoCal is able to perform one-shot calibration quasi-independently of initial parameter estimates, addressing extreme cases that remain unsolved by existing approaches.

摘要
Camera-LiDAR这个外在对焦整合是自主系统中的一个关键任务，例如自驾车和移动机器人。传统的方法通常需要手动干预或特定的环境，使其成为劳动密集和错误容易发生的。现有的深度学习自我对焦方法专注于小对焦和仍然依赖初始估计，限制其实用性。在这篇论文中，我们提出了 PseudoCal，一种新的自我对焦方法，通过 Pseudo-LiDAR 概念和直接在3D空间工作，而不是仅对焦相机的视场。在典型的自驾车和机器人上下文中，PseudoCal 能够实现一次对焦 quasi-独立于初始参数估计，解决现有方法无法解决的极端情况。

Hyperbolic vs Euclidean Embeddings in Few-Shot Learning: Two Sides of the Same Coin

paper_url: http://arxiv.org/abs/2309.10013
repo_url: None
paper_authors: Gabriel Moreira, Manuel Marques, João Paulo Costeira, Alexander Hauptmann
for: 这个论文主要研究了使用偏序空间来学习表示，以便在图像识别 tasks 中获得更高精度和更少维度的表示。
methods: 这个论文使用了 прототипические偏序神经网络，并研究了偏序空间中的点集抽象的特性。
results: 研究发现，在高维度情况下，偏序 embedding 往往会趋向于波因逊球体的边缘，这会影响到几何学推理中的几何学距离计算。尽管如此，研究还发现，使用固定半径编码器和欧几丁度距离计算可以在几何学 embeddings 中实现更好的性能，不需要针对 embedding 维度进行调整。

Abstract
Recent research in representation learning has shown that hierarchical data lends itself to low-dimensional and highly informative representations in hyperbolic space. However, even if hyperbolic embeddings have gathered attention in image recognition, their optimization is prone to numerical hurdles. Further, it remains unclear which applications stand to benefit the most from the implicit bias imposed by hyperbolicity, when compared to traditional Euclidean features. In this paper, we focus on prototypical hyperbolic neural networks. In particular, the tendency of hyperbolic embeddings to converge to the boundary of the Poincar\'e ball in high dimensions and the effect this has on few-shot classification. We show that the best few-shot results are attained for hyperbolic embeddings at a common hyperbolic radius. In contrast to prior benchmark results, we demonstrate that better performance can be achieved by a fixed-radius encoder equipped with the Euclidean metric, regardless of the embedding dimension.

摘要
近期研究在表示学习中，层次数据在偏特征空间中得到了低维度和高信息的表示。然而，偏特征 embedding 的优化很容易遇到数值障碍。此外，还未清楚哪些应用程序会从偏特征 embedding 中受益最大，比较传统的欧几丁度特征。在这篇论文中，我们关注了prototype偏特征神经网络。特别是偏特征 embedding 在高维度下的准确性和Poincaré球的边缘减少的效果。我们显示了，使用常见的偏特征 радиус可以获得最佳几拟结果。与先前的标准约束结果不同，我们示出了使用欧几丁度度量的固定 радиус编码器可以在任何嵌入维度下获得更好的性能。

Grasp-Anything: Large-scale Grasp Dataset from Foundation Models

paper_url: http://arxiv.org/abs/2309.09818
repo_url: None
paper_authors: An Dinh Vuong, Minh Nhat Vu, Hieu Le, Baoru Huang, Binh Huynh, Thieu Vo, Andreas Kugi, Anh Nguyen
for: 本研究利用基础模型来解决机器人 grasp detection 问题，即机器人抓取物体的问题，这是机器人控制领域中的一个挑战。
methods: 本研究使用基础模型来生成大规模的 grasp 数据集，并利用这些数据集来实现零shot grasp detection。
results: 研究表明，使用基础模型生成的 grasp 数据集可以在视觉任务和实际机器人实验中实现零shot grasp detection，并且超过了先前的数据集。

Abstract
Foundation models such as ChatGPT have made significant strides in robotic tasks due to their universal representation of real-world domains. In this paper, we leverage foundation models to tackle grasp detection, a persistent challenge in robotics with broad industrial applications. Despite numerous grasp datasets, their object diversity remains limited compared to real-world figures. Fortunately, foundation models possess an extensive repository of real-world knowledge, including objects we encounter in our daily lives. As a consequence, a promising solution to the limited representation in previous grasp datasets is to harness the universal knowledge embedded in these foundation models. We present Grasp-Anything, a new large-scale grasp dataset synthesized from foundation models to implement this solution. Grasp-Anything excels in diversity and magnitude, boasting 1M samples with text descriptions and more than 3M objects, surpassing prior datasets. Empirically, we show that Grasp-Anything successfully facilitates zero-shot grasp detection on vision-based tasks and real-world robotic experiments. Our dataset and code are available at https://grasp-anything-2023.github.io.

摘要
基础模型如ChatGPT在机器人任务中做出了 significiant 进步，归因于它们的 universally 表示实际世界领域。在这篇论文中，我们利用基础模型来解决抓取检测，机器人业务中的一项挑战。虽然有很多的抓取数据集，但它们中的物体多样性还是相对较少，与实际世界中的物体相比。幸运的是，基础模型拥有了广泛的实际世界知识，包括我们日常生活中遇到的物体。因此，我们可以利用这些基础模型中的广泛知识来解决限制性的抓取数据集。我们提出了 Grasp-Anything，一个新的大规模抓取数据集，通过基础模型来实现这一解决方案。Grasp-Anything 的多样性和规模都非常出色，包括100万个样本和超过3000万个物体，超过了之前的数据集。我们的实验结果表明，Grasp-Anything 可以在视觉任务和真实世界机器人实验中实现零shot 抓取检测。我们的数据集和代码可以在https://grasp-anything-2023.github.io 上获取。

R2GenGPT: Radiology Report Generation with Frozen LLMs

paper_url: http://arxiv.org/abs/2309.09812
repo_url: https://github.com/wang-zhanyu/r2gengpt
paper_authors: Zhanyu Wang, Lingqiao Liu, Lei Wang, Luping Zhou
for: 这个研究是为了探索一种可以将类型特征与语言模型（LLM）之间的数据汇流合作，以提高医学报告生成（R2Gen）性能。
methods: 这个研究提出了一种名为R2GenGPT的新解决方案，它使用一个高效的视觉对Alignment模块，将类型特征与LLM的字嵌入空间相对位。这种新的方法让之前 static的LLM能够融合和处理影像信息，从而为R2Gen性能带来了进步。
results: R2GenGPT实现了顶峰性能，只需训练500万个参数（占总参数数的0.07%），并在训练速度和稳定性方面表现出色。

Abstract
Large Language Models (LLMs) have consistently showcased remarkable generalization capabilities when applied to various language tasks. Nonetheless, harnessing the full potential of LLMs for Radiology Report Generation (R2Gen) still presents a challenge, stemming from the inherent disparity in modality between LLMs and the R2Gen task. To bridge this gap effectively, we propose R2GenGPT, which is a novel solution that aligns visual features with the word embedding space of LLMs using an efficient visual alignment module. This innovative approach empowers the previously static LLM to seamlessly integrate and process image information, marking a step forward in optimizing R2Gen performance. R2GenGPT offers the following benefits. First, it attains state-of-the-art (SOTA) performance by training only the lightweight visual alignment module while freezing all the parameters of LLM. Second, it exhibits high training efficiency, as it requires the training of an exceptionally minimal number of parameters while achieving rapid convergence. By employing delta tuning, our model only trains 5M parameters (which constitute just 0.07\% of the total parameter count) to achieve performance close to the SOTA levels. Our code is available at https://github.com/wang-zhanyu/R2GenGPT.

摘要
大型语言模型（LLM）在不同语言任务中表现出了惊人的通用能力。然而，为了充分利用 LLM 在 radiology report generation（R2Gen）任务中的潜力，仍然存在一定的挑战，这主要归结于 LLM 和 R2Gen 任务之间的本质差异。为了有效 bridge 这个差异，我们提出了 R2GenGPT，它是一种新的解决方案，它将视觉特征与 LLM 中词嵌入空间进行有效对齐。这种创新的方法使得原来的静止 LLM 可以轻松地集成和处理图像信息，从而为 R2Gen 的性能优化做出了一步前进。R2GenGPT 具有以下优点：首先，它可以在具有较少参数的情况下达到最佳性能，只需要训练少量的视觉对齐模块，而不需要训练 LLM 的任何参数。其次，它具有高效的训练效率，只需要训练 5000万参数（即总参数数的 0.07%），即使在快速 converges 的情况下。我们的代码可以在 GitHub 上找到：https://github.com/wang-zhanyu/R2GenGPT。

DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving

paper_url: http://arxiv.org/abs/2309.09777
repo_url: https://github.com/JeffWang987/DriveDreamer
paper_authors: Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiwen Lu
for: 本研究旨在开发一种基于实际驾驶场景的世界模型（DriveDreamer），以提高自动驾驶车辆的驾驶视频生成和安全驾驶策略。
methods: 我们提出了一种两阶段训练策略，其中第一阶段是通过学习струк成的交通约束来让DriveDreamer具备了预测未来状态的能力；第二阶段是通过diffusion模型来构建了详细的环境表示。
results: DriveDreamer在nuScenesbenchmark上的实验表明，它可以准确地生成驾驶视频和合理的驾驶策略，并且能够准确地捕捉实际交通场景中的结构约束。

Abstract
World models, especially in autonomous driving, are trending and drawing extensive attention due to their capacity for comprehending driving environments. The established world model holds immense potential for the generation of high-quality driving videos, and driving policies for safe maneuvering. However, a critical limitation in relevant research lies in its predominant focus on gaming environments or simulated settings, thereby lacking the representation of real-world driving scenarios. Therefore, we introduce DriveDreamer, a pioneering world model entirely derived from real-world driving scenarios. Regarding that modeling the world in intricate driving scenes entails an overwhelming search space, we propose harnessing the powerful diffusion model to construct a comprehensive representation of the complex environment. Furthermore, we introduce a two-stage training pipeline. In the initial phase, DriveDreamer acquires a deep understanding of structured traffic constraints, while the subsequent stage equips it with the ability to anticipate future states. The proposed DriveDreamer is the first world model established from real-world driving scenarios. We instantiate DriveDreamer on the challenging nuScenes benchmark, and extensive experiments verify that DriveDreamer empowers precise, controllable video generation that faithfully captures the structural constraints of real-world traffic scenarios. Additionally, DriveDreamer enables the generation of realistic and reasonable driving policies, opening avenues for interaction and practical applications.

摘要
世界模型，尤其在自动驾驶领域，目前是受欢迎的趋势，因其能够理解驾驶环境。现有的世界模型具有巨大的潜力，可以生成高质量的驾驶视频和安全驾驶策略。然而，现有的研究受到限制，因为它们偏向于游戏环境或模拟设定，缺乏实际驾驶场景的表示。因此，我们介绍了DriveDreamer，一种杰出的世界模型，完全来自实际驾驶场景。由于模拟复杂的驾驶场景搜索空间过于庞大，我们提议利用强大的扩散模型构建全面的环境表示。此外，我们提出了两个阶段的训练管道。在初始阶段，DriveDreamer获得了深入的结构化交通限制的理解，而在后续阶段，它学习了预测未来状态的能力。我们实现了DriveDreamer在具有挑战性的 nuScenes 测试benchmark上，并进行了广泛的实验，证明了DriveDreamer可以准确、可控地生成符合实际交通场景的视频，同时也可以生成合理、可行的驾驶策略。此外，DriveDreamer还开启了交互和实际应用的可能性。

Towards Self-Adaptive Pseudo-Label Filtering for Semi-Supervised Learning

paper_url: http://arxiv.org/abs/2309.09774
repo_url: None
paper_authors: Lei Zhu, Zhanghan Ke, Rynson Lau
for: 提高 semi-supervised learning（SSL）方法中的训练效果，特别在标注数据很稀缺的情况下。
methods: 利用自适应 Pseudo-Label Filter（SPF），自动根据模型的演化来过滤噪声 Pseudo 标签。SPF 使用在线混合模型，对每个 Pseudo 标签样本进行权重，以考虑该样本是否正确的可信度分布。与传统手工设置的筛选器不同，SPF 随着深度神经网络的演化而自动调整。
results: 对 existing SSL 方法进行 incorporating SPF，可以提高 SSL 的训练效果，特别在标注数据很稀缺的情况下。

Abstract
Recent semi-supervised learning (SSL) methods typically include a filtering strategy to improve the quality of pseudo labels. However, these filtering strategies are usually hand-crafted and do not change as the model is updated, resulting in a lot of correct pseudo labels being discarded and incorrect pseudo labels being selected during the training process. In this work, we observe that the distribution gap between the confidence values of correct and incorrect pseudo labels emerges at the very beginning of the training, which can be utilized to filter pseudo labels. Based on this observation, we propose a Self-Adaptive Pseudo-Label Filter (SPF), which automatically filters noise in pseudo labels in accordance with model evolvement by modeling the confidence distribution throughout the training process. Specifically, with an online mixture model, we weight each pseudo-labeled sample by the posterior of it being correct, which takes into consideration the confidence distribution at that time. Unlike previous handcrafted filters, our SPF evolves together with the deep neural network without manual tuning. Extensive experiments demonstrate that incorporating SPF into the existing SSL methods can help improve the performance of SSL, especially when the labeled data is extremely scarce.

摘要
现代 semi-supervised learning（SSL）方法通常包括一种筛选策略以提高 Pseudo 标签质量。然而，这些筛选策略通常是手工制定的，而且不会随模型更新而变化，导致训练过程中 Correct Pseudo 标签被抛弃，而 Incorrect Pseudo 标签被选择。在这项工作中，我们发现 Pseudo 标签的准确率分布 gap 在训练的开始阶段就已经出现，这可以被利用来筛选 Pseudo 标签。基于这个观察，我们提出了一种自适应 Pseudo-Label 筛选器（SPF），它可以自动根据模型的演化来筛选 Pseudo 标签中的噪声。具体来说，我们使用在线混合模型，对每个 Pseudo-labeled 样本进行重新权重，以考虑该样本在训练过程中的准确率分布。与前一些手工制定的筛选器不同，我们的 SPF 会随着深度神经网络的更新而进行自适应调整，无需人工调整。广泛的实验表明，将 SPF integrated into 现有的 SSL 方法可以提高 SSL 的性能，特别是当有限的标注数据时。

Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays

paper_url: http://arxiv.org/abs/2309.09773
repo_url: None
paper_authors: Sivaramakrishnan Rajaraman, Ghada Zamzmi, Feng Yang, Zhaohui Liang, Zhiyun Xue, Sameer Antani
for: 这篇论文旨在提出一种基于信息内容的训练样本选择方法，以提高深度学习（Deep Learning）模型的性能和泛化能力。
methods: 方法基于概率理论的信息内容分析，以决定训练样本中的semantic redundancy层次，并将其去除以提高模型的性能。
results: 这篇论文的实验结果显示，使用了这种 entropy-based sample scoring approach的模型在内部测试和外部测试中，均比使用了全部训练样本的模型表现更好，并且显示了更高的准确率和更好的泛化能力。

Abstract
Deep learning (DL) has demonstrated its innate capacity to independently learn hierarchical features from complex and multi-dimensional data. A common understanding is that its performance scales up with the amount of training data. Another data attribute is the inherent variety. It follows, therefore, that semantic redundancy, which is the presence of similar or repetitive information, would tend to lower performance and limit generalizability to unseen data. In medical imaging data, semantic redundancy can occur due to the presence of multiple images that have highly similar presentations for the disease of interest. Further, the common use of augmentation methods to generate variety in DL training may be limiting performance when applied to semantically redundant data. We propose an entropy-based sample scoring approach to identify and remove semantically redundant training data. We demonstrate using the publicly available NIH chest X-ray dataset that the model trained on the resulting informative subset of training data significantly outperforms the model trained on the full training set, during both internal (recall: 0.7164 vs 0.6597, p<0.05) and external testing (recall: 0.3185 vs 0.2589, p<0.05). Our findings emphasize the importance of information-oriented training sample selection as opposed to the conventional practice of using all available training data.

摘要

Localization-Guided Track: A Deep Association Multi-Object Tracking Framework Based on Localization Confidence of Detections

paper_url: http://arxiv.org/abs/2309.09765
repo_url: https://github.com/mengting2023/lg-track
paper_authors: Ting Meng, Chunyun Fu, Mingguang Huang, Xiyang Wang, Jiawei He, Tao Huang, Wankai Shi
for: 提高多目标追踪（MOT）中的精确性和可靠性，特别是在低光度和高背景噪声的情况下。
methods: 本研究提出了一种基于探测检测器的Localization-Guided Track（LG-Track）方法，首次在MOT中使用检测框的地理位置信任度，并考虑检测框的清晰度和地理位置准确性，同时设计了一种有效的深度匹配机制。
results: 对MOT17和MOT20 dataset进行了广泛的实验，结果显示，相比之前的状态值追踪方法，LG-Track方法在精度和可靠性方面具有明显的优势。

Abstract
In currently available literature, no tracking-by-detection (TBD) paradigm-based tracking method has considered the localization confidence of detection boxes. In most TBD-based methods, it is considered that objects of low detection confidence are highly occluded and thus it is a normal practice to directly disregard such objects or to reduce their priority in matching. In addition, appearance similarity is not a factor to consider for matching these objects. However, in terms of the detection confidence fusing classification and localization, objects of low detection confidence may have inaccurate localization but clear appearance; similarly, objects of high detection confidence may have inaccurate localization or unclear appearance; yet these objects are not further classified. In view of these issues, we propose Localization-Guided Track (LG-Track). Firstly, localization confidence is applied in MOT for the first time, with appearance clarity and localization accuracy of detection boxes taken into account, and an effective deep association mechanism is designed; secondly, based on the classification confidence and localization confidence, a more appropriate cost matrix can be selected and used; finally, extensive experiments have been conducted on MOT17 and MOT20 datasets. The results show that our proposed method outperforms the compared state-of-art tracking methods. For the benefit of the community, our code has been made publicly at https://github.com/mengting2023/LG-Track.

摘要
现有 literatura 中的 tracking-by-detection (TBD) 方法都没有考虑检测框的地方确度。大多数 TBD 基本方法认为低检测确度的对象都是高度遮挡的，因此直接忽略这些对象或减少它们的匹配优先级。此外，外观相似性不是匹配这些对象的因素。但是，在检测确度融合分类和地方确度方面，低检测确度的对象可能有不准确的地方位置，但是清晰的外观；相反，高检测确度的对象可能有不准确的地方位置或晦涩的外观，却不进一步分类。为了解决这些问题，我们提出了地方确度引导的跟踪方法（LG-Track）。首先，我们在 MOT 中首次应用了地方确度，考虑检测框的清晰度和地方位置准确性，并设计了一种有效的深度关联机制；其次，根据分类确度和地方确度，选择更适合的成本矩阵；最后，我们在 MOT17 和 MOT20 数据集上进行了广泛的实验。结果显示，我们的提出方法比对比的国际先进跟踪方法更高效。为了为社区帮助，我们的代码已经公开在 GitHub 上，可以在 https://github.com/mengting2023/LG-Track 中找到。

Application-driven Validation of Posteriors in Inverse Problems

paper_url: http://arxiv.org/abs/2309.09764
repo_url: None
paper_authors: Tim J. Adler, Jan-Hinrich Nölke, Annika Reinke, Minu Dietlinde Tizabi, Sebastian Gruber, Dasha Trofimova, Lynton Ardizzone, Paul F. Jaeger, Florian Buettner, Ullrich Köthe, Lena Maier-Hein
for: 这篇论文是为了解决 inverse problems 中的多可能解问题，尤其是 deep learning-based 解决方案无法处理多个可能的解。
methods: 本论文使用 conditional Diffusion Models 和 Invertible Neural Networks 等 posterior-based 方法，但这些方法的译法受到了 validation 的限制，即不具备适当的验证方法。
results: 本论文提出了首个实用应用验证的框架，将 object detection 领域中的验证原则应用到 inverse problems，以提供更好的验证方法。这个框架在 synthetic toy example 和两个医疗影像应用中（包括手术pose estimation 和功能组织parametrization）均展示了优越的表现。

Abstract
Current deep learning-based solutions for image analysis tasks are commonly incapable of handling problems to which multiple different plausible solutions exist. In response, posterior-based methods such as conditional Diffusion Models and Invertible Neural Networks have emerged; however, their translation is hampered by a lack of research on adequate validation. In other words, the way progress is measured often does not reflect the needs of the driving practical application. Closing this gap in the literature, we present the first systematic framework for the application-driven validation of posterior-based methods in inverse problems. As a methodological novelty, it adopts key principles from the field of object detection validation, which has a long history of addressing the question of how to locate and match multiple object instances in an image. Treating modes as instances enables us to perform mode-centric validation, using well-interpretable metrics from the application perspective. We demonstrate the value of our framework through instantiations for a synthetic toy example and two medical vision use cases: pose estimation in surgery and imaging-based quantification of functional tissue parameters for diagnostics. Our framework offers key advantages over common approaches to posterior validation in all three examples and could thus revolutionize performance assessment in inverse problems.

摘要
当前的深度学习基于解决方案 frequently 无法处理具有多个可能的解决方案的问题。作为回应， posterior-based 方法如 Conditional Diffusion Models 和 Invertible Neural Networks 出现了，但它们的翻译受到了 validate 的研究不够的限制。即，进步的度量方法经常不能反映驱动实际应用的需求。在文献中填补这个差距，我们提出了首个应用驱动验证的 posterior-based 方法框架。作为方法学新颖之处，它采用了对象检测验证的关键原则，该领域已经有长期地解决了如何在图像中检测和匹配多个对象实例的问题。将模式视为实例，我们可以进行模式心智验证，使用应用视角可解释的 метриク。我们通过对synthetic 示例和两个医疗视觉应用：urger 和功能组织参数的成像诊断中的 pose 估计进行实例化，以示出我们的框架的优势。我们的框架可以在三个例子中超越常见的验证方法，因此可能将成像问题中的表现评估革命化。

Privileged to Predicted: Towards Sensorimotor Reinforcement Learning for Urban Driving

paper_url: http://arxiv.org/abs/2309.09756
repo_url: None
paper_authors: Ege Onat Özsüer, Barış Akgün, Fatma Güney
for: 本研究旨在探讨RL算法在自驾乘用车道上的潜力，以及如何 bridging RL算法和感知学习算法之间的差距。
methods: 本研究使用视觉深度学习模型来近似特权表示，并提出了解决方案来逐渐发展不具备特权表示的RL算法。
results: 通过在CARLA simulate环境进行严格评估，本研究指出了特权表示在RL算法中的重要性，并释放了未解决的挑战。

Abstract
Reinforcement Learning (RL) has the potential to surpass human performance in driving without needing any expert supervision. Despite its promise, the state-of-the-art in sensorimotor self-driving is dominated by imitation learning methods due to the inherent shortcomings of RL algorithms. Nonetheless, RL agents are able to discover highly successful policies when provided with privileged ground truth representations of the environment. In this work, we investigate what separates privileged RL agents from sensorimotor agents for urban driving in order to bridge the gap between the two. We propose vision-based deep learning models to approximate the privileged representations from sensor data. In particular, we identify aspects of state representation that are crucial for the success of the RL agent such as desired route generation and stop zone prediction, and propose solutions to gradually develop less privileged RL agents. We also observe that bird's-eye-view models trained on offline datasets do not generalize to online RL training due to distribution mismatch. Through rigorous evaluation on the CARLA simulation environment, we shed light on the significance of the state representations in RL for autonomous driving and point to unresolved challenges for future research.

摘要
自适应学习（RL）有可能超越人类在驾驶中的表现，无需专业指导。然而，当前的状态报告中的自适应学习方法占据了主导地位，这是因为RL算法的内在缺陷。然而，RL代理人可以通过精准的环境表示来找到高度成功的策略。在这项工作中，我们调查了隐私RL代理人和感知动作代理人之间的差异，以便 bridging这两者之间的差距。我们提出了基于视觉深度学习模型来近似隐私表示的方法，并特别注意了状态表示中的关键方面，如愿望路径生成和停车区预测。我们还观察到，基于离线数据集训练的鸟瞰视图模型在在线RL训练中并不能总是generalizable，这是因为分布匹配问题。通过在CARLA simulate环境中的严格评估，我们把关于RL在自动驾驶中的状态表示的重要性和未解决的挑战指出来。

Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object Detection with Repeated Labels

paper_url: http://arxiv.org/abs/2309.09742
repo_url: https://github.com/madave94/gtiod
paper_authors: David Tschirschwitz, Christian Benz, Morris Florek, Henrik Norderhus, Benno Stein, Volker Rodehorst
for: 这个研究旨在提高机器学习系统的可靠性，通过确保标签的准确性和可用性。methods: 本研究提出了一个新的地方化算法，可以将多个标签ger的结果组合成更好的标签。results: 实验结果显示，本算法在测试标签中表现出色，并且在训练过程中超过了噪音标签训练和重新标签的Weighted Boxes Fusion（WBF）。研究显示，重复标签带来的好处需要在特定的数据集和标签配置下出现，并且取决于标签ger的一致性、标签ger的数量和标签预算。

Abstract
The reliability of supervised machine learning systems depends on the accuracy and availability of ground truth labels. However, the process of human annotation, being prone to error, introduces the potential for noisy labels, which can impede the practicality of these systems. While training with noisy labels is a significant consideration, the reliability of test data is also crucial to ascertain the dependability of the results. A common approach to addressing this issue is repeated labeling, where multiple annotators label the same example, and their labels are combined to provide a better estimate of the true label. In this paper, we propose a novel localization algorithm that adapts well-established ground truth estimation methods for object detection and instance segmentation tasks. The key innovation of our method lies in its ability to transform combined localization and classification tasks into classification-only problems, thus enabling the application of techniques such as Expectation-Maximization (EM) or Majority Voting (MJV). Although our main focus is the aggregation of unique ground truth for test data, our algorithm also shows superior performance during training on the TexBiG dataset, surpassing both noisy label training and label aggregation using Weighted Boxes Fusion (WBF). Our experiments indicate that the benefits of repeated labels emerge under specific dataset and annotation configurations. The key factors appear to be (1) dataset complexity, the (2) annotator consistency, and (3) the given annotation budget constraints.

摘要
超级vised机器学习系统的可靠性取决于标签的准确性和可用性。然而，人工标注过程容易出现干扰，导致标签噪声，这可能会降低实际应用的实用性。虽然训练噪声标签是一个重要的考虑因素，但测试数据的可靠性也是很重要的，以确定结果的可靠性。在这篇论文中，我们提出了一种新的地方化算法，可以将地方化和分类任务转化为分类任务，从而应用Expectation-Maximization（EM）或 Majority Voting（MJV）等技术。我们的主要关注点是测试数据的唯一标签的聚合，但我们的算法还在TexBiG dataset上进行训练中表现出色，超过了噪声标签训练和Weighted Boxes Fusion（WBF）的标准。我们的实验表明，重复标注的优点在某些 dataset 和注解配置下出现。关键因素包括（1）数据复杂度，（2）注解者一致性，以及（3）注解预算的约束。

Improving Neural Indoor Surface Reconstruction with Mask-Guided Adaptive Consistency Constraints

paper_url: http://arxiv.org/abs/2309.09739
repo_url: None
paper_authors: Xinyi Yu, Liqin Lu, Jintao Rong, Guangkai Xu, Linlin Ou
for: scene reconstruction from 2D images
methods: neural implicit surface, data-driven pre-trained geometric cues, two-stage training process, decoupling view-dependent and view-independent colors, two novel consistency constraints
results: high-quality scene reconstruction with rich geometric details, reducing interference from prior estimation errors

Abstract
3D scene reconstruction from 2D images has been a long-standing task. Instead of estimating per-frame depth maps and fusing them in 3D, recent research leverages the neural implicit surface as a unified representation for 3D reconstruction. Equipped with data-driven pre-trained geometric cues, these methods have demonstrated promising performance. However, inaccurate prior estimation, which is usually inevitable, can lead to suboptimal reconstruction quality, particularly in some geometrically complex regions. In this paper, we propose a two-stage training process, decouple view-dependent and view-independent colors, and leverage two novel consistency constraints to enhance detail reconstruction performance without requiring extra priors. Additionally, we introduce an essential mask scheme to adaptively influence the selection of supervision constraints, thereby improving performance in a self-supervised paradigm. Experiments on synthetic and real-world datasets show the capability of reducing the interference from prior estimation errors and achieving high-quality scene reconstruction with rich geometric details.

摘要
三维场景重建从二维图像中进行了长期的任务。而不是每帧估计深度图并将其在3D中进行融合，当前的研究利用神经隐式表面作为一个统一表示方式进行3D重建。具有数据驱动的预训练准则的方法在实际中表现出了可期望的性能。然而，不准确的先前估计，通常是不可避免的，可能会导致部分复杂的地形区域中的重建质量下降。在本文中，我们提出了两个阶段的训练过程，分离视角依赖和视角独立的颜色，并利用两种新的一致性约束来增强细节重建性能，无需额外的准则。此外，我们还引入了一种重要的面 Maske scheme，以适应性地选择监督约束，从而在自动监督方式下提高表现。实验表明，可以减少先前估计错误的干扰，并实现高质量的场景重建，具有丰富的地形细节。

Scribble-based 3D Multiple Abdominal Organ Segmentation via Triple-branch Multi-dilated Network with Pixel- and Class-wise Consistency

paper_url: http://arxiv.org/abs/2309.09730
repo_url: None
paper_authors: Meng Han, Xiangde Luo, Wenjun Liao, Shichuan Zhang, Shaoting Zhang, Guotai Wang
for: 这个研究旨在提高Multi-organ segmentation在腹部Computed Tomography（CT）图像中的精度，并且降低需要大量时间和劳动力的annotations的需求。
methods: 我们提出了一个 novel的3D框架，使用了两种一致性条件来进行scribble-supervised多个腹部器官分类。
results: 我们的方法在公共的WORD dataset上进行实验，与五种现有的scribble-supervised方法进行比较，结果显示我们的方法的精度高于其他方法。I hope that helps! Let me know if you have any other questions.

Abstract
Multi-organ segmentation in abdominal Computed Tomography (CT) images is of great importance for diagnosis of abdominal lesions and subsequent treatment planning. Though deep learning based methods have attained high performance, they rely heavily on large-scale pixel-level annotations that are time-consuming and labor-intensive to obtain. Due to its low dependency on annotation, weakly supervised segmentation has attracted great attention. However, there is still a large performance gap between current weakly-supervised methods and fully supervised learning, leaving room for exploration. In this work, we propose a novel 3D framework with two consistency constraints for scribble-supervised multiple abdominal organ segmentation from CT. Specifically, we employ a Triple-branch multi-Dilated network (TDNet) with one encoder and three decoders using different dilation rates to capture features from different receptive fields that are complementary to each other to generate high-quality soft pseudo labels. For more stable unsupervised learning, we use voxel-wise uncertainty to rectify the soft pseudo labels and then supervise the outputs of each decoder. To further regularize the network, class relationship information is exploited by encouraging the generated class affinity matrices to be consistent across different decoders under multi-view projection. Experiments on the public WORD dataset show that our method outperforms five existing scribble-supervised methods.

摘要
多器官分割在腹部 computed Tomography（CT）图像中非常重要，用于腹部肿瘤诊断和后续治疗规划。虽然深度学习基本方法已经达到了高性能，但它们依赖于大规模的像素级别标注，这些标注是时间consuming和劳动密集的获得的。由于其低依赖于标注，弱监督分割吸引了广泛的关注。然而，目前的弱监督方法和完全监督学习之间还存在一定的性能差距，留下了更多的探索空间。在这项工作中，我们提出了一种新的3D框架，包括两种一致性约束 для多个腹部器官分割从CT。具体来说，我们使用一个Triple-branch multi-Dilated网络（TDNet），其中一个Encoder和三个Decoder使用不同的扩散率来捕捉不同的感受场，以生成高质量的软 Pseudo标签。为了更稳定无监督学习，我们使用 voxel-wise 不确定性来修正软 Pseudo标签，然后对每个Decoder的输出进行监督。此外，我们利用类关系信息，通过在多视图投影下强制生成的类征相互关系矩阵保持一致，以更加稳定地训练网络。实验表明，我们的方法在公共的 WORD 数据集上比五种现有的scribble-supervised方法表现出色。

Robust Geometry-Preserving Depth Estimation Using Differentiable Rendering

paper_url: http://arxiv.org/abs/2309.09724
repo_url: None
paper_authors: Chi Zhang, Wei Yin, Gang Yu, Zhibin Wang, Tao Chen, Bin Fu, Joey Tianyi Zhou, Chunhua Shen
For: 本研究探讨了单眼深度估计结果中的3D场景结构重建挑战。* Methods: 我们提出了一个学习框架，让模型透过没有额外数据或标签的方式估计具有对称性的深度，以获得实际的3D结构。* Results: 我们的框架在许多benchmark数据集上与现有的state-of-the-art方法进行比较，发现其具有更好的普遍化能力，并且可以自动从未标签的图像中获得域对称的扩展级数和移动。

Abstract
In this study, we address the challenge of 3D scene structure recovery from monocular depth estimation. While traditional depth estimation methods leverage labeled datasets to directly predict absolute depth, recent advancements advocate for mix-dataset training, enhancing generalization across diverse scenes. However, such mixed dataset training yields depth predictions only up to an unknown scale and shift, hindering accurate 3D reconstructions. Existing solutions necessitate extra 3D datasets or geometry-complete depth annotations, constraints that limit their versatility. In this paper, we propose a learning framework that trains models to predict geometry-preserving depth without requiring extra data or annotations. To produce realistic 3D structures, we render novel views of the reconstructed scenes and design loss functions to promote depth estimation consistency across different views. Comprehensive experiments underscore our framework's superior generalization capabilities, surpassing existing state-of-the-art methods on several benchmark datasets without leveraging extra training information. Moreover, our innovative loss functions empower the model to autonomously recover domain-specific scale-and-shift coefficients using solely unlabeled images.

摘要
在这个研究中，我们面临着从单视深度估计中提取3D场景结构的挑战。传统的深度估计方法利用标注数据直接预测精确的深度，而 latest advancements advocate for mix-dataset training，以提高不同场景的泛化性。然而，混合集合训练产生的深度预测只有一个未知的尺度和移动，使得准确的3D重建受阻。现有的解决方案需要额外的3D数据或准确的深度约束，这些限制了它们的灵活性。在这篇论文中，我们提出了一种学习框架，该框架可以不需要额外的数据或约束，train模型以预测保持geometry的深度。为了生成真实的3D结构，我们使用render novel views of the reconstructed scenes，并设计了loss函数来促进深度估计在不同视图之间的一致性。我们的实验表明，我们的框架在多个benchmark datasets上superior generalization capabilities，不使用额外的训练信息。此外，我们的创新的loss函数使得模型可以自动地从单个未标注图像中恢复域pecific的scale-and-shift値。

Traffic Scene Similarity: a Graph-based Contrastive Learning Approach

paper_url: http://arxiv.org/abs/2309.09720
repo_url: None
paper_authors: Maximilian Zipfl, Moritz Jarosch, J. Marius Zöllner
for: 提高高度自动驾驶车辆的验证和 homologation 难度，以便更广泛地推广高度自动驾驶车辆。
methods: 使用场景基本特征来构建场景嵌入空间，并通过图表示法实现场景之间的连续映射。
results: 可以通过基于场景特征的嵌入空间来快速标识相似的场景，从而减少无关的测试运行。

Abstract
Ensuring validation for highly automated driving poses significant obstacles to the widespread adoption of highly automated vehicles. Scenario-based testing offers a potential solution by reducing the homologation effort required for these systems. However, a crucial prerequisite, yet unresolved, is the definition and reduction of the test space to a finite number of scenarios. To tackle this challenge, we propose an extension to a contrastive learning approach utilizing graphs to construct a meaningful embedding space. Our approach demonstrates the continuous mapping of scenes using scene-specific features and the formation of thematically similar clusters based on the resulting embeddings. Based on the found clusters, similar scenes could be identified in the subsequent test process, which can lead to a reduction in redundant test runs.

摘要
高度自动驾驶的验证 pose significant challenges to its widespread adoption. scenario-based testing offers a potential solution by reducing the homologation effort required for these systems. However, a crucial prerequisite, yet unresolved, is the definition and reduction of the test space to a finite number of scenarios. To tackle this challenge, we propose an extension to a contrastive learning approach utilizing graphs to construct a meaningful embedding space. Our approach demonstrates the continuous mapping of scenes using scene-specific features and the formation of thematically similar clusters based on the resulting embeddings. Based on the found clusters, similar scenes could be identified in the subsequent test process, which can lead to a reduction in redundant test runs.Here's the word-for-word translation:高度自动驾驶的验证呈现出了广泛采用的障碍。场景基本测试可以降低高度自动驾驶系统的验证成本。然而，需要解决的关键前提是定义和减少测试空间到有限数量的场景。为此，我们提出了一种基于图的对比学习方法，通过图构建有意义的嵌入空间。我们的方法示出了使用场景特有的特征进行场景连续映射，并基于结果嵌入的场景类型进行分类。在后续测试过程中，可以通过对类似场景进行相似性比较，减少无关的测试运行。

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

paper_url: http://arxiv.org/abs/2309.09709
repo_url: https://github.com/aspirinone/catr.github.io
paper_authors: Kexin Li, Zongxin Yang, Lei Chen, Yi Yang, Jun Xiao
for: 这个论文targets audio-visual video segmentation, aiming to generate pixel-level maps of sound-producing objects within image frames, and ensure the maps faithfully adhere to the given audio.
methods: 该方法使用了一种叫做”decoupled audio-video transformer”的新方法，将音频和视频特征从它们的各自的时间和空间维度 combine，捕捉到它们的共同依赖关系。同时，通过设计一种块，可以在堆叠的过程中， Capture audio-visual细腻的 combinatorial-dependence in a memory-efficient manner.
results: 实验结果表明，我们的方法可以达到所有三个数据集的新的最佳性能，使用两种骨干。 codes are available at \url{https://github.com/aspirinone/CATR.github.io}

Abstract
Audio-visual video segmentation~(AVVS) aims to generate pixel-level maps of sound-producing objects within image frames and ensure the maps faithfully adhere to the given audio, such as identifying and segmenting a singing person in a video. However, existing methods exhibit two limitations: 1) they address video temporal features and audio-visual interactive features separately, disregarding the inherent spatial-temporal dependence of combined audio and video, and 2) they inadequately introduce audio constraints and object-level information during the decoding stage, resulting in segmentation outcomes that fail to comply with audio directives. To tackle these issues, we propose a decoupled audio-video transformer that combines audio and video features from their respective temporal and spatial dimensions, capturing their combined dependence. To optimize memory consumption, we design a block, which, when stacked, enables capturing audio-visual fine-grained combinatorial-dependence in a memory-efficient manner. Additionally, we introduce audio-constrained queries during the decoding phase. These queries contain rich object-level information, ensuring the decoded mask adheres to the sounds. Experimental results confirm our approach's effectiveness, with our framework achieving a new SOTA performance on all three datasets using two backbones. The code is available at \url{https://github.com/aspirinone/CATR.github.io}

摘要
audio-visual video segmentation（AVVS）的目标是生成图像帧中 зву频生成对象的像素级地图，并确保地图与给定的音频保持一致，例如在视频中 identific 和 segmenting 一个唱歌的人。然而，现有的方法具有两个限制：1）它们对视频的时间特征和音频视频互动特征进行分离处理，忽略了合理的空间时间相互依赖性，2）它们在解码阶段不足以引入音频约束和对象级信息，导致分 segmentation 结果不符合音频指令。为解决这些问题，我们提议一种解coupled audio-video transformer，该模型将音频和视频特征从它们的时间和空间维度组合，捕捉 их共同依赖关系。为了优化内存消耗，我们设计了块，当堆叠时，可以在内存fficient manner中捕捉音频视频细致的 combinatorial-dependence。此外，我们在解码阶段引入了音频约束查询，这些查询包含丰富的对象级信息，以确保解码的mask遵循音频指令。实验结果表明，我们的方法有效，我们的框架在三个数据集上 achieved 新的 SOTA 性能，使用两种背景。代码可以在 \url{https://github.com/aspirinone/CATR.github.io} 上获取。

DGM-DR: Domain Generalization with Mutual Information Regularized Diabetic Retinopathy Classification

paper_url: http://arxiv.org/abs/2309.09670
repo_url: None
paper_authors: Aleksandr Matsun, Dana O. Mohamed, Sharon Chokuwa, Muhammad Ridzuan, Mohammad Yaqub
for: 这篇论文主要关注于如何对医疗影像领域进行领域对错（Domain Generalization, DG），以实现模型在不同来源资料上的普遍化。
methods: 本论文提出了一种基于大型预训练模型的DG方法，通过将模型目标函数转换为内在熵的最大化，以获得具有对错性的医疗影像分类模型。
results: 根据实验结果，本论文的提出方法在公开数据集上实现了5.25%的平均准确率和较低的标准差，并且与之前的州际状况进行比较，显示了较好的一致性和稳定性。

Abstract
The domain shift between training and testing data presents a significant challenge for training generalizable deep learning models. As a consequence, the performance of models trained with the independent and identically distributed (i.i.d) assumption deteriorates when deployed in the real world. This problem is exacerbated in the medical imaging context due to variations in data acquisition across clinical centers, medical apparatus, and patients. Domain generalization (DG) aims to address this problem by learning a model that generalizes well to any unseen target domain. Many domain generalization techniques were unsuccessful in learning domain-invariant representations due to the large domain shift. Furthermore, multiple tasks in medical imaging are not yet extensively studied in existing literature when it comes to DG point of view. In this paper, we introduce a DG method that re-establishes the model objective function as a maximization of mutual information with a large pretrained model to the medical imaging field. We re-visit the problem of DG in Diabetic Retinopathy (DR) classification to establish a clear benchmark with a correct model selection strategy and to achieve robust domain-invariant representation for an improved generalization. Moreover, we conduct extensive experiments on public datasets to show that our proposed method consistently outperforms the previous state-of-the-art by a margin of 5.25% in average accuracy and a lower standard deviation. Source code available at https://github.com/BioMedIA-MBZUAI/DGM-DR

摘要
域名转换问题在训练和测试数据之间存在重大挑战，这会导致训练的深度学习模型在实际世界中表现不佳。这个问题在医疗影像领域更加严重，因为数据收集方式、医疗设备和患者之间存在很大的域名差异。域名泛化（DG）目标是解决这个问题，学习一个可以在未经见过的目标域中表现良好的模型。然而，许多域名泛化技术无法学习域名不变的表示，因为域名差异过大。此外，医疗影像领域中多个任务还没有得到广泛的研究，尤其是从域名泛化的角度来看。在这篇论文中，我们介绍了一种DG方法，通过将模型目标函数改写为与一个大型预训练模型的共同信息最大化来应用到医疗影像领域。我们重新评估了域名泛化在 диабетичеRetinopathy（DR）分类问题上的问题，并在正确的模型选择策略下建立了一个清晰的标准。此外，我们在公共数据集上进行了广泛的实验，并证明了我们的提议方法在前一个状态的平均准确率和标准差上比前一个状态提高5.25%，并且具有较低的标准差。源代码可以在中找到。

DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.09668
repo_url: https://github.com/VCIP-RGBD/DFormer
paper_authors: Bowen Yin, Xuying Zhang, Zhongyu Li, Li Liu, Ming-Ming Cheng, Qibin Hou
for: 学习RGB-D segmentation任务上的可转移表示。
methods: 包括一系列RGB-D块，这些块专门用于编码RGB和深度信息，并且采用了一种新的建筑块设计。
results: 在两个流行的RGB-D任务上，即RGB-D semantic segmentation和RGB-D焦点检测，与现有方法相比，我们的DFormer achieved new state-of-the-art performance，并且计算成本下降了超过50%。

Abstract
We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that aim to encode RGB features,DFormer comprises a sequence of RGB-D blocks, which are tailored for encoding both RGB and depth information through a novel building block design; 2) We pre-train the backbone using image-depth pairs from ImageNet-1K, and thus the DFormer is endowed with the capacity to encode RGB-D representations. It avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pre-trained backbones, which widely lies in existing methods but has not been resolved. We fine-tune the pre-trained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D segmentation datasets and five RGB-D saliency datasets. Our code is available at: https://github.com/VCIP-RGBD/DFormer.

摘要
我们提出了DFormer，一种新的RGB-D预训练框架，用于学习转移性的RGB-D表示。DFormer有两个新额外创新：1）与前期工作不同，DFormer不仅编码RGB特征，而且包含一串RGB-D块，这些块通过一种新的建筑块设计来编码RGB和深度信息; 2）我们在ImageNet-1K中预训练了干燥的脊梁，因此DFormer拥有编码RGB-D表示的能力。它避免了现有方法中RGB预训练背bone对3D几何关系的推断错误，这是现有方法的潜在问题。我们通过一个轻量级的解码头进行了细致的调整。我们的实验结果显示，我们的DFormer在两个流行的RGB-D任务上（RGB-D语义分割和RGB-D焦点检测） achieve了新的状态级表现，并且计算成本只有现有最佳方法的一半。我们的代码可以在https://github.com/VCIP-RGBD/DFormer上找到。

paper_url: http://arxiv.org/abs/2309.09667
repo_url: None
paper_authors: Huan Liu, Zichang Tan, Qiang Chen, Yunchao Wei, Yao Zhao, Jingdong Wang
for:This paper aims to address the problem of detecting and grounding multi-modal media manipulation (DGM^4), specifically focusing on face forgery and text misinformation.methods:The proposed method, Unified Frequency-Assisted transFormer (UFAFormer), utilizes a combination of the discrete wavelet transform and self-attention mechanisms to capture forgery features in both the image and frequency domains. Additionally, the proposed forgery-aware mutual module is designed to facilitate effective interaction between image and frequency features.results:Experimental results on the DGM^4 dataset demonstrate the superior performance of UFAFormer compared to previous methods, achieving a new benchmark in the field.

Abstract
Detecting and grounding multi-modal media manipulation (DGM^4) has become increasingly crucial due to the widespread dissemination of face forgery and text misinformation. In this paper, we present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM^4 problem. Unlike previous state-of-the-art methods that solely focus on the image (RGB) domain to describe visual forgery features, we additionally introduce the frequency domain as a complementary viewpoint. By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts. Then, our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands. Moreover, to address the semantic conflicts between image and frequency domains, the forgery-aware mutual module is developed to further enable the effective interaction of disparate image and frequency features, resulting in aligned and comprehensive visual forgery representations. Finally, based on visual and textual forgery features, we propose a unified decoder that comprises two symmetric cross-modal interaction modules responsible for gathering modality-specific forgery information, along with a fusing interaction module for aggregation of both modalities. The proposed unified decoder formulates our UFAFormer as a unified framework, ultimately simplifying the overall architecture and facilitating the optimization process. Experimental results on the DGM^4 dataset, containing several perturbations, demonstrate the superior performance of our framework compared to previous methods, setting a new benchmark in the field.

摘要
Detecting and grounding多modal媒体修改（DGM^4）已成为当前普遍流传的脸写 forgery和文本谣言的检测方法。在这篇论文中，我们提出了一个统一频率帮助 transformer框架，名为UFAFormer，以解决DGM^4问题。与之前的状态lejian方法不同，我们不仅focus于图像（RGB）频谱来描述视觉修改特征，而且还引入频谱频带，以获得更加丰富的脸写特征。通过频谱分解算法，我们将图像分解成多个频谱子带，捕捉到了脸写特征。然后，我们提出的频率编码器，具有内部和外部自我注意力，明确地聚合了修改特征。此外，为了解决图像和频谱频带之间的 semantic conflict，我们开发了一个 forgery-aware mutual module，以便更好地考虑多modal特征之间的协作。最后，我们提出了一个统一的解码器，包括两个对称的交叉模态交互模块，负责收集不同模式的修改信息，以及一个融合交互模块，用于将多modal特征融合。这使得我们的UFAFormer框架成为一个统一的框架，从而简化整个架构，并且优化过程。实验结果表明，我们的方法在DGM^4数据集上表现出了superior performance，并设置了新的benchmark。

HiT: Building Mapping with Hierarchical Transformers

paper_url: http://arxiv.org/abs/2309.09643
repo_url: None
paper_authors: Mingming Zhang, Qingjie Liu, Yunhong Wang
for: 提高高分辨率 remote sensing 图像中建筑物的自动映射质量
methods: 使用层次转换器（Hierarchical Transformers）和静止卷积神经网络（Convolutional Neural Networks）实现二stage探测架构，并在探测阶段添加 polygon 头，同时输出建筑物 bounding box 和 vector polygon
results: 在 CrowdAI 和 Inria 数据集上进行了 comprehensive экспериiments，并达到了新的州OF-THE-ART 水平以及 instance segmentation 和 polygon 度量上的优秀表现，同时在复杂场景下也有较好的效果

Abstract
Deep learning-based methods have been extensively explored for automatic building mapping from high-resolution remote sensing images over recent years. While most building mapping models produce vector polygons of buildings for geographic and mapping systems, dominant methods typically decompose polygonal building extraction in some sub-problems, including segmentation, polygonization, and regularization, leading to complex inference procedures, low accuracy, and poor generalization. In this paper, we propose a simple and novel building mapping method with Hierarchical Transformers, called HiT, improving polygonal building mapping quality from high-resolution remote sensing images. HiT builds on a two-stage detection architecture by adding a polygon head parallel to classification and bounding box regression heads. HiT simultaneously outputs building bounding boxes and vector polygons, which is fully end-to-end trainable. The polygon head formulates a building polygon as serialized vertices with the bidirectional characteristic, a simple and elegant polygon representation avoiding the start or end vertex hypothesis. Under this new perspective, the polygon head adopts a transformer encoder-decoder architecture to predict serialized vertices supervised by the designed bidirectional polygon loss. Furthermore, a hierarchical attention mechanism combined with convolution operation is introduced in the encoder of the polygon head, providing more geometric structures of building polygons at vertex and edge levels. Comprehensive experiments on two benchmarks (the CrowdAI and Inria datasets) demonstrate that our method achieves a new state-of-the-art in terms of instance segmentation and polygonal metrics compared with state-of-the-art methods. Moreover, qualitative results verify the superiority and effectiveness of our model under complex scenes.

摘要
深度学习基于方法在高分辨率 remote sensing 图像自动建筑图ematization中得到了广泛的探索。大多数建筑图ematization 模型生成 vector 多边形建筑，供地ографи�映图和 Navigation 系统使用。 dominant 方法通常将建筑图ematization 分解成多个子问题，包括分割、多边形化和正则化，导致复杂的推理过程、低准确率和差的普遍性。在本文中，我们提出了一种简单和新的建筑图ematization 方法，即 HiT，可以从高分辨率 remote sensing 图像中提高多边形建筑质量。HiT 基于二stage 检测架构，其中添加了一个多边形头，并同时输出建筑 bounding box 和 vector 多边形。这种方法是完全可导的。多边形头使用逆向特征来表示建筑多边形，避免了开始或结束多边形的假设。在这个新的视角下，多边形头采用了 transformer Encoder-Decoder 架构来预测序列化多边形，并且采用了自定义的双向多边形损失函数进行超参数化。此外，在多边形头的Encoder中，我们引入了层次注意力机制和 convolution 操作，以提供更多的建筑多边形的 геометри�结构，包括 vertex 和 edge 层次。我们在 CrowdAI 和 Inria 两个标准 datasets 进行了广泛的实验，并证明了我们的方法在实例分割和多边形指标方面达到了新的国际纪录。此外，质量证明了我们模型在复杂场景下的superiority和有效性。

Holistic Geometric Feature Learning for Structured Reconstruction

paper_url: http://arxiv.org/abs/2309.09622
repo_url: https://github.com/geo-tell/f-learn
paper_authors: Ziqiong Lu, Linxi Huan, Qiyuan Ma, Xianwei Zheng
for: 这个论文旨在提高结构重建中的结构意识，即使受到低级特征缺乏全局几何信息的影响。
methods: 作者们提出了一种频域特征学习策略（F-Learn），通过将散布的几何特征汇集到频域中，以便更好地理解结构。
results: 实验表明，F-Learn策略可以有效地引入结构意识到几何原子检测和结构推断中，提高最终结构重建的性能。

Abstract
The inference of topological principles is a key problem in structured reconstruction. We observe that wrongly predicted topological relationships are often incurred by the lack of holistic geometry clues in low-level features. Inspired by the fact that massive signals can be compactly described with frequency analysis, we experimentally explore the efficiency and tendency of learning structure geometry in the frequency domain. Accordingly, we propose a frequency-domain feature learning strategy (F-Learn) to fuse scattered geometric fragments holistically for topology-intact structure reasoning. Benefiting from the parsimonious design, the F-Learn strategy can be easily deployed into a deep reconstructor with a lightweight model modification. Experiments demonstrate that the F-Learn strategy can effectively introduce structure awareness into geometric primitive detection and topology inference, bringing significant performance improvement to final structured reconstruction. Code and pre-trained models are available at https://github.com/Geo-Tell/F-Learn.

摘要
“推理 topological 原则是结构重建中的关键问题。我们观察到低级特征中缺乏整体几何启示，导致 wrongly 预测的 topological 关系。受到大量信号可以使用频率分析 compactly 描述的灵感，我们实验性地探索了 learning 结构几何在频率域的有效性和倾向。因此，我们提出了频率域特征学习策略（F-Learn），将散布的几何碎片合理地汇聚到整体结构理解中。由于简洁的设计，F-Learn 策略可以轻松地整合到深度重建器中，并且只需要修改轻量级模型。实验表明，F-Learn 策略可以有效地带来结构意识到几何基本检测和推理，提高最终结构重建的性能。代码和预训练模型可以在 https://github.com/Geo-Tell/F-Learn 上获取。”

Collaborative Three-Stream Transformers for Video Captioning

paper_url: http://arxiv.org/abs/2309.09611
repo_url: https://github.com/wanghao14/COST
paper_authors: Hao Wang, Libo Zhang, Heng Fan, Tiejian Luo
for: 用于视频captioning任务中的主要组件，即主语、谓语和词语之间的互动。
methods: 提出了一种新的框架，名为COST，来分别模型这三部分，并且使得它们互相补充以获得更好的表示。COST包括三个transformers分支，用于探索视频和文本之间的视觉语言互动，检测到的对象和文本之间的互动，以及动作和文本之间的互动。同时，我们提出了一种相互扩展注意力模块，用于协调这三个分支中模型的互动，以便使得这三个分支可以互相支持，捕捉到不同细致程度的最有力的semantic信息，以便准确地预测caption。
results: 经过大规模的实验，我们发现提出的方法在三个大规模的挑战性数据集，即YouCookII、ActivityNet Captions和MSVD上表现出色，与当前的方法相比，具有较高的准确率。

Abstract
As the most critical components in a sentence, subject, predicate and object require special attention in the video captioning task. To implement this idea, we design a novel framework, named COllaborative three-Stream Transformers (COST), to model the three parts separately and complement each other for better representation. Specifically, COST is formed by three branches of transformers to exploit the visual-linguistic interactions of different granularities in spatial-temporal domain between videos and text, detected objects and text, and actions and text. Meanwhile, we propose a cross-granularity attention module to align the interactions modeled by the three branches of transformers, then the three branches of transformers can support each other to exploit the most discriminative semantic information of different granularities for accurate predictions of captions. The whole model is trained in an end-to-end fashion. Extensive experiments conducted on three large-scale challenging datasets, i.e., YouCookII, ActivityNet Captions and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods.

摘要
《 sentence 中的主要组成部分，即主语、谓语和词组，在视频描述任务中需要特别注意。为了实现这个想法，我们设计了一种新的框架，名为协同三流转换器（COST）。COST 框架由三个转换器分支组成，用以模型不同细致程度的视觉语言互动在空间时间域中，包括视频和文本、检测到的对象和文本、和动作和文本之间的互动。此外，我们还提出了跨度量注意模块，用以对三个分支中模型的互动进行协调，以便三个分支可以互相支持，挖掘出最有特征的 semantic 信息，以便准确地预测视频描述。整个模型采用端到端的训练方式。我们在 YouCookII、ActivityNet Captions 和 MSVD 等三个大规模挑战性 dataset 上进行了广泛的实验，结果显示，我们提出的方法与当前最佳方法相比，表现出了良好的性能。》Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

MEDL-U: Uncertainty-aware 3D Automatic Annotator based on Evidential Deep Learning

paper_url: http://arxiv.org/abs/2309.09599
repo_url: None
paper_authors: Helbert Paat, Qing Lian, Weilong Yao, Tong Zhang
for: This paper is written for advancing the field of 3D object detection using weakly supervised deep learning methods, and addressing the challenges of pseudo label noise and uncertainty estimation.
methods: The paper proposes an Evidential Deep Learning (EDL) based uncertainty estimation framework called MEDL-U, which generates pseudo labels and quantifies uncertainties for 3D object detection. The framework consists of an uncertainty-aware IoU-based loss, an evidence-aware multi-task loss function, and a post-processing stage for uncertainty refinement.
results: The paper demonstrates that probabilistic detectors trained using the outputs of MEDL-U surpass deterministic detectors trained using outputs from previous 3D annotators on the KITTI val set for all difficulty levels. Additionally, MEDL-U achieves state-of-the-art results on the KITTI official test set compared to existing 3D automatic annotators.

Abstract
Advancements in deep learning-based 3D object detection necessitate the availability of large-scale datasets. However, this requirement introduces the challenge of manual annotation, which is often both burdensome and time-consuming. To tackle this issue, the literature has seen the emergence of several weakly supervised frameworks for 3D object detection which can automatically generate pseudo labels for unlabeled data. Nevertheless, these generated pseudo labels contain noise and are not as accurate as those labeled by humans. In this paper, we present the first approach that addresses the inherent ambiguities present in pseudo labels by introducing an Evidential Deep Learning (EDL) based uncertainty estimation framework. Specifically, we propose MEDL-U, an EDL framework based on MTrans, which not only generates pseudo labels but also quantifies the associated uncertainties. However, applying EDL to 3D object detection presents three primary challenges: (1) relatively lower pseudolabel quality in comparison to other autolabelers; (2) excessively high evidential uncertainty estimates; and (3) lack of clear interpretability and effective utilization of uncertainties for downstream tasks. We tackle these issues through the introduction of an uncertainty-aware IoU-based loss, an evidence-aware multi-task loss function, and the implementation of a post-processing stage for uncertainty refinement. Our experimental results demonstrate that probabilistic detectors trained using the outputs of MEDL-U surpass deterministic detectors trained using outputs from previous 3D annotators on the KITTI val set for all difficulty levels. Moreover, MEDL-U achieves state-of-the-art results on the KITTI official test set compared to existing 3D automatic annotators.

摘要
深度学习基于3D对象检测的进步导致了大规模数据的需求。然而，这会带来人工标注的挑战，它们frequently both burdensome and time-consuming。为解决这个问题，文献中出现了一些弱supervised frameworks for 3D object detection，可以自动生成pseudo labels for unlabeled data。然而，这些生成的pseudo labels含有噪声并不如人工标注的准确性。在这篇论文中，我们提出了首个解决pseudo labels中存在的自然ambiguity的方法，通过引入Evidential Deep Learning（EDL）基于uncertainty estimation framework。 Specifically, we propose MEDL-U， an EDL framework based on MTrans， which not only generates pseudo labels but also quantifies the associated uncertainties。然而，在应用EDL于3D对象检测中存在三个主要挑战：（1）相比其他自动标注器，pseudolabel的质量相对较低；（2）过度的证据uncertainty estimate；和（3）不清晰的解释和下游任务中的有效利用uncertainty。我们通过引入uncertainty-aware IoU-based loss、evidence-aware multi-task loss function和post-processing阶段进行uncertainty refinement来解决这些问题。我们的实验结果表明，基于MEDL-U输出的 probabilistic detector比基于前一代3D自动标注器的deterministic detector在KITTI val set上的所有Difficulty Level上表现出色。此外，MEDL-U在KITTI官方测试集上与现有的3D自动标注器相比， achieve state-of-the-art results。

Mutual Information-calibrated Conformal Feature Fusion for Uncertainty-Aware Multimodal 3D Object Detection at the Edge

paper_url: http://arxiv.org/abs/2309.09593
repo_url: None
paper_authors: Alex C. Stutts, Danilo Erricolo, Sathya Ravi, Theja Tulabandhula, Amit Ranjan Trivedi
for: This paper aims to address the gap in uncertainty quantification in 3D object detection for robotics, by integrating conformal inference and information theoretic measures to provide lightweight and Monte Carlo-free uncertainty estimation.
methods: The proposed method uses a multimodal framework that fuses features from RGB camera and LiDAR sensor data, and leverages normalized mutual information as a modulator to calibrate uncertainty bounds derived from conformal inference.
results: The simulation results show an inverse correlation between inherent predictive uncertainty and normalized mutual information throughout the model’s training, and the proposed framework demonstrates comparable or better performance in KITTI 3D object detection benchmarks compared to similar methods that are not uncertainty-aware, making it suitable for real-time edge robotics.

Abstract
In the expanding landscape of AI-enabled robotics, robust quantification of predictive uncertainties is of great importance. Three-dimensional (3D) object detection, a critical robotics operation, has seen significant advancements; however, the majority of current works focus only on accuracy and ignore uncertainty quantification. Addressing this gap, our novel study integrates the principles of conformal inference (CI) with information theoretic measures to perform lightweight, Monte Carlo-free uncertainty estimation within a multimodal framework. Through a multivariate Gaussian product of the latent variables in a Variational Autoencoder (VAE), features from RGB camera and LiDAR sensor data are fused to improve the prediction accuracy. Normalized mutual information (NMI) is leveraged as a modulator for calibrating uncertainty bounds derived from CI based on a weighted loss function. Our simulation results show an inverse correlation between inherent predictive uncertainty and NMI throughout the model's training. The framework demonstrates comparable or better performance in KITTI 3D object detection benchmarks to similar methods that are not uncertainty-aware, making it suitable for real-time edge robotics.

摘要
在扩展的人工智能启用机器人领域，准确地量化预测不确定性是非常重要的。三维对象检测，机器人运行中的关键任务，已经取得了 significative 进步，但大多数当前的工作都专注于精度，忽略了不确定性量化。我们的新研究强调了对具有准确性的隐藏变量的拟合推理（CI）的应用，以及信息理论度量来实现轻量级、不含 Monte Carlo 的不确定性估计。通过RGB摄像机和激光雷达感知器数据的融合，通过多元 Normal 分布的 Gaussian 产物来提高预测精度。在权重损失函数中，使用Normalized Mutual Information（NMI）作为调整器，以 derivation uncertainty bound。我们的 simulate 结果显示，预测不确定性与 NMI 之间存在 inverse 相关性，并且在模型训练过程中逐渐下降。该框架在 KITTI 3D 对象检测标准准测试中与其他不具有不确定性感知的方法相比，表现相对或更好，因此适用于实时边缘机器人。

Multi-Semantic Fusion Model for Generalized Zero-Shot Skeleton-Based Action Recognition

paper_url: http://arxiv.org/abs/2309.09592
repo_url: https://github.com/EHZ9NIWI7/MSF-GZSSAR
paper_authors: Ming-Zhe Li, Zhen Jia, Zhang Zhang, Zhanyu Ma, Liang Wang
for: 解决computer vision社区中新的挑战问题——无需任何训练样本的动作识别（GZSSAR）。
methods: 我们提出了一种多 semantics融合（MSF）模型，通过使用两种类别级文本描述（动作描述和运动描述）作为辅助semantic信息，以提高通用skeleton特征的识别性。
results: 与前一代模型进行比较，我们的MSF模型在GZSSAR中表现出色， validate了我们的方法的有效性。

Abstract
Generalized zero-shot skeleton-based action recognition (GZSSAR) is a new challenging problem in computer vision community, which requires models to recognize actions without any training samples. Previous studies only utilize the action labels of verb phrases as the semantic prototypes for learning the mapping from skeleton-based actions to a shared semantic space. However, the limited semantic information of action labels restricts the generalization ability of skeleton features for recognizing unseen actions. In order to solve this dilemma, we propose a multi-semantic fusion (MSF) model for improving the performance of GZSSAR, where two kinds of class-level textual descriptions (i.e., action descriptions and motion descriptions), are collected as auxiliary semantic information to enhance the learning efficacy of generalizable skeleton features. Specially, a pre-trained language encoder takes the action descriptions, motion descriptions and original class labels as inputs to obtain rich semantic features for each action class, while a skeleton encoder is implemented to extract skeleton features. Then, a variational autoencoder (VAE) based generative module is performed to learn a cross-modal alignment between skeleton and semantic features. Finally, a classification module is built to recognize the action categories of input samples, where a seen-unseen classification gate is adopted to predict whether the sample comes from seen action classes or not in GZSSAR. The superior performance in comparisons with previous models validates the effectiveness of the proposed MSF model on GZSSAR.

摘要
新的挑战问题：通用零例基本动作识别（GZSSAR）在计算机视觉社区中引起了广泛关注，需要模型可以识别没有任何训练样本的动作。在先前的研究中，只是使用动作标签的词语短语作为semantic prototype来学习将骨架基本动作映射到共享semantic空间。然而，动作标签的有限 semantic information限制了骨架特征的通用化能力，用于认识未看过的动作。为解决这个困境，我们提出了一种多 semantics融合（MSF）模型，用于提高GZSSAR的性能。在这种模型中，我们收集了两种类型的文本描述（动作描述和运动描述）作为auxiliary semantic information，以增强骨架特征的学习效果。具体来说，一个预训练的语言编码器将动作描述、运动描述和原始类别标签作为输入，以获取每个动作类型的 ric hes semantic features。然后，我们实现了一个骨架编码器来EXTRACT骨架特征。接着，我们使用基于VAE的生成模块来学习骨架和semantic特征之间的cross-modal alignment。最后，我们建立了一个分类模块，用于识别输入样本的动作类别，并采用seen-unseen分类门户来预测样本是否来自seen动作类别。 Comparing with previous models, our MSF model shows superior performance in GZSSAR, validating its effectiveness.

paper_url: http://arxiv.org/abs/2309.09590
repo_url: None
paper_authors: Eleonora Andreis, Paolo Panicucci, Francesco Topputo
for: 提出了一种基于全视 Navigation 算法，用于自动化深空探测器的 Navigation 问题。
methods: 使用了一种基于非维度扩展卡尔曼过滤器的状态估计方法，并对图像进行了进一步的优化。
results: 在高精度的地球-火星间INTERPLANETARY TRANSFER上测试了算法的性能，并证明了其适用于深空 Navigation。

Abstract
The surge of deep-space probes makes it unsustainable to navigate them with standard radiometric tracking. Self-driving interplanetary satellites represent a solution to this problem. In this work, a full vision-based navigation algorithm is built by combining an orbit determination method with an image processing pipeline suitable for interplanetary transfers of autonomous platforms. To increase the computational efficiency of the algorithm, a non-dimensional extended Kalman filter is selected as state estimator, fed by the positions of the planets extracted from deep-space images. An enhancement of the estimation accuracy is performed by applying an optimal strategy to select the best pair of planets to track. Moreover, a novel analytical measurement model for deep-space navigation is developed providing a first-order approximation of the light-aberration and light-time effects. Algorithm performance is tested on a high-fidelity, Earth--Mars interplanetary transfer, showing the algorithm applicability for deep-space navigation.

摘要
深空探测器的激增使得标准Radiometric tracking无法实现可持续导航。自驾宇宙卫星代表了一种解决方案。在这项工作中，我们构建了一个完整的视觉导航算法，通过结合轨道确定方法和适用于交通自主平台的图像处理管道来实现。为提高算法的计算效率，我们选择了非维度扩展卡尔曼滤波器作为状态估计器，通过深空图像中找到行星的位置来 aliment 状态估计。此外，我们还开发了一种新的分析测量模型，用于描述深空导航中的光扭和时光效应。我们测试了算法在高精度的地球-火星间交通任务上的性能，并证明了算法的深空导航可行性。

RIDE: Self-Supervised Learning of Rotation-Equivariant Keypoint Detection and Invariant Description for Endoscopy

paper_url: http://arxiv.org/abs/2309.09563
repo_url: None
paper_authors: Mert Asim Karaoglu, Viktoria Markova, Nassir Navab, Benjamin Busam, Alexander Ladikos
for: 这篇论文是关于endoscopic video中的锚点检测和描述的研究，以解决endoscopic video中的大规模旋转问题。
methods: 该论文提出了一种基于学习的方法，即RIDE，其中包含了 rotation-equivariant detection和invariant description。RIDE使用了latest advancements in group-equivariant learning，并在自动标注数据集上进行了自动验证。
results: 该论文的实验结果表明，RIDE在surgical tissue tracking和relative pose estimation任务上比之前的学习基于和经典方法更高，并且在大规模旋转情况下保持稳定性。

Abstract
Unlike in natural images, in endoscopy there is no clear notion of an up-right camera orientation. Endoscopic videos therefore often contain large rotational motions, which require keypoint detection and description algorithms to be robust to these conditions. While most classical methods achieve rotation-equivariant detection and invariant description by design, many learning-based approaches learn to be robust only up to a certain degree. At the same time learning-based methods under moderate rotations often outperform classical approaches. In order to address this shortcoming, in this paper we propose RIDE, a learning-based method for rotation-equivariant detection and invariant description. Following recent advancements in group-equivariant learning, RIDE models rotation-equivariance implicitly within its architecture. Trained in a self-supervised manner on a large curation of endoscopic images, RIDE requires no manual labeling of training data. We test RIDE in the context of surgical tissue tracking on the SuPeR dataset as well as in the context of relative pose estimation on a repurposed version of the SCARED dataset. In addition we perform explicit studies showing its robustness to large rotations. Our comparison against recent learning-based and classical approaches shows that RIDE sets a new state-of-the-art performance on matching and relative pose estimation tasks and scores competitively on surgical tissue tracking.

摘要
不同于自然图像，endooscopy中没有明确的正常摄像机 Orientations。因此，endooscopic视频通常包含大的旋转动作，需要键点检测和描述算法具有旋转不敏感性。大多数传统方法通过设计实现旋转对称性和不变性，而许多学习基于方法则只有一定程度的旋转不敏感性。在这种情况下，本文提出了RIDE，一种基于学习的旋转对称检测和描述方法。按照最近的群对称学习的进展，RIDE在其架构中隐式地实现了旋转对称性。通过自动化的自我超vised学习方式，RIDE不需要手动标注训练数据。我们在SuPeR dataset上进行了手术组织跟踪和SCARED dataset上进行了修改的版本的相对pose estimation任务中测试了RIDE。此外，我们还进行了明确的研究，证明RIDE在大旋转下的稳定性。我们与最近的学习基于和传统方法进行了比较，结果显示RIDE在匹配和相对pose estimation任务上设置了新的状态态标准性，并在手术组织跟踪任务上分数竞争。

Selective Volume Mixup for Video Action Recognition

paper_url: http://arxiv.org/abs/2309.09534
repo_url: None
paper_authors: Yi Tan, Zhaofan Qiu, Yanbin Hao, Ting Yao, Xiangnan He, Tao Mei
for: 提高深度模型对小规模视频数据的泛化能力
methods: 提出了一种新的视频扩充策略 named Selective Volume Mixup (SV-Mix)，通过选择两个视频中最有用的卷积来实现新的训练视频
results: 在各种视频动作识别benchmark上经验性地证明了SV-Mix扩充策略的效果，可以提高深度模型对小规模视频数据的泛化能力，并且可以适应不同的模型结构。

Abstract
The recent advances in Convolutional Neural Networks (CNNs) and Vision Transformers have convincingly demonstrated high learning capability for video action recognition on large datasets. Nevertheless, deep models often suffer from the overfitting effect on small-scale datasets with a limited number of training videos. A common solution is to exploit the existing image augmentation strategies for each frame individually including Mixup, Cutmix, and RandAugment, which are not particularly optimized for video data. In this paper, we propose a novel video augmentation strategy named Selective Volume Mixup (SV-Mix) to improve the generalization ability of deep models with limited training videos. SV-Mix devises a learnable selective module to choose the most informative volumes from two videos and mixes the volumes up to achieve a new training video. Technically, we propose two new modules, i.e., a spatial selective module to select the local patches for each spatial position, and a temporal selective module to mix the entire frames for each timestamp and maintain the spatial pattern. At each time, we randomly choose one of the two modules to expand the diversity of training samples. The selective modules are jointly optimized with the video action recognition framework to find the optimal augmentation strategy. We empirically demonstrate the merits of the SV-Mix augmentation on a wide range of video action recognition benchmarks and consistently boot the performances of both CNN-based and transformer-based models.

摘要
SV-Mix devises a learnable selective module to choose the most informative volumes from two videos and mixes them to create a new training video. Technically, we propose two new modules: a spatial selective module to select local patches for each spatial position, and a temporal selective module to mix entire frames for each timestamp while maintaining the spatial pattern. At each time, we randomly choose one of the two modules to expand the diversity of training samples. The selective modules are jointly optimized with the video action recognition framework to find the optimal augmentation strategy.We empirically demonstrate the merits of the SV-Mix augmentation on a wide range of video action recognition benchmarks and consistently boost the performances of both CNN-based and transformer-based models.

Decompose Semantic Shifts for Composed Image Retrieval

paper_url: http://arxiv.org/abs/2309.09531
repo_url: None
paper_authors: Xingyu Yang, Daqing Liu, Heng Zhang, Yong Luo, Chaoyue Wang, Jing Zhang
for: 本研究旨在提高图像检索中的组合图像检索任务，使用参考图像作为起点，根据用户提供的文本进行启发式搜索。
methods: 本研究提出了一种Semantic Shift网络（SSN），将文本 instrucion 分解为两个步骤：从参考图像到视觉原型，然后从视觉原型到目标图像。SSN将文本 instrucion 分解为两个组成部分：减退和升级，用于生成图像的逐步映射。
results: 实验结果表明，提出的SSN在CIRR和FashionIQ数据集上显示了5.42%和1.37%的显著提升，成为图像检索领域新的状态级性能。代码将公开提供。

Abstract
Composed image retrieval is a type of image retrieval task where the user provides a reference image as a starting point and specifies a text on how to shift from the starting point to the desired target image. However, most existing methods focus on the composition learning of text and reference images and oversimplify the text as a description, neglecting the inherent structure and the user's shifting intention of the texts. As a result, these methods typically take shortcuts that disregard the visual cue of the reference images. To address this issue, we reconsider the text as instructions and propose a Semantic Shift network (SSN) that explicitly decomposes the semantic shifts into two steps: from the reference image to the visual prototype and from the visual prototype to the target image. Specifically, SSN explicitly decomposes the instructions into two components: degradation and upgradation, where the degradation is used to picture the visual prototype from the reference image, while the upgradation is used to enrich the visual prototype into the final representations to retrieve the desired target image. The experimental results show that the proposed SSN demonstrates a significant improvement of 5.42% and 1.37% on the CIRR and FashionIQ datasets, respectively, and establishes a new state-of-the-art performance. Codes will be publicly available.

摘要
“构成图像检索”是一种图像检索任务，用户提供一个参考图像作为开始点，并指定一段文本，用于将参考图像转换为目标图像。然而，现有的方法将文本视为描述，忽略文本的本质和用户的转换意愿，通常会遗传参考图像的可见特征。为解决这个问题，我们重新考虑文本为 instrucion，并提出了对于 Semantic Shift network (SSN)的建议，具体地将semantic shift decomposed为两步：从参考图像到可视标本，然后从可视标本到目标图像。具体来说，SSN会将 instrucion 分成两个部分：塑性和提升，其中塑性用于将可视标本从参考图像中描绘出来，而提升则用于将可视标本提升为最终的表现，以实现检索目标图像。实验结果显示，我们提出的 SSN 与现有的方法相比，在 CIRR 和 FashionIQ 数据集上的效果提高了5.42%和1.37%，并建立了新的顶尖性能。代码将会公开。”

Instant Photorealistic Style Transfer: A Lightweight and Adaptive Approach

paper_url: http://arxiv.org/abs/2309.10011
repo_url: None
paper_authors: Rong Liu, Enyu Zhao, Zhiyuan Liu, Andrew Wei-Wen Feng, Scott John Easley
for: 这篇论文旨在实现快速即时精美风格转移（IPST）技术，以帮助在超分辨率输入上实现快速、无需预训练的精美风格转移。
methods: 该方法使用轻量级的 StyleNet 进行风格转移，以保留非颜色信息。我们还引入了实例适应优化，以优先级增强风格转移过程中的快速训练和高效的结构减少。
results: 实验结果表明，IPST 可以快速完成多帧风格转移任务，同时保持多视图和时间一致性。它还需要较少的 GPU 内存使用量，具有更快的多帧转移速度和更加真实的输出，这使得它在各种精美风格转移应用中成为一个有前途的解决方案。

Abstract
In this paper, we propose an Instant Photorealistic Style Transfer (IPST) approach, designed to achieve instant photorealistic style transfer on super-resolution inputs without the need for pre-training on pair-wise datasets or imposing extra constraints. Our method utilizes a lightweight StyleNet to enable style transfer from a style image to a content image while preserving non-color information. To further enhance the style transfer process, we introduce an instance-adaptive optimization to prioritize the photorealism of outputs and accelerate the convergence of the style network, leading to a rapid training completion within seconds. Moreover, IPST is well-suited for multi-frame style transfer tasks, as it retains temporal and multi-view consistency of the multi-frame inputs such as video and Neural Radiance Field (NeRF). Experimental results demonstrate that IPST requires less GPU memory usage, offers faster multi-frame transfer speed, and generates photorealistic outputs, making it a promising solution for various photorealistic transfer applications.

摘要
在这篇论文中，我们提出了一种快速实时写入式样式转移（IPST）方法，旨在实现无需预训练对比对数据集或做出额外约束的实时写入式样式转移。我们的方法利用了轻量级的 StyleNet，以启用样式图像到内容图像的样式转移，同时保持非颜色信息。为了进一步优化样式转移过程，我们引入了实例适应优化，以优先级保持输出的实境化，加速样式网络的训练 converges，以在秒钟内完成训练。此外，IPST适用于多帧样式转移任务，因为它保持了多帧输入（如视频和Neural Radiance Field）的时间和多视点一致性。实验结果表明，IPST需要更少的GPU内存使用，具有更快的多帧转移速度，并生成高质量的输出，因此是许多写入式样式转移应用场景的有优点的解决方案。

NOMAD: A Natural, Occluded, Multi-scale Aerial Dataset, for Emergency Response Scenarios

paper_url: http://arxiv.org/abs/2309.09518
repo_url: None
paper_authors: Arturo Miguel Russell Bernal, Walter Scheirer, Jane Cleland-Huang
for: 这个论文是为了提高小无人飞行系统（sUAS）在紧急应急场景中的搜索和救援任务中的计算机视觉能力。
methods: 该论文使用了多种计算机视觉技术来检测人体，并在不同的飞行距离下进行了测试。
results: 该论文提供了一个新的测试数据集，可以让计算机视觉模型在受掩蔽的空中视图下进行人体检测，并且包括了多种人体运动和隐藏情况。这个数据集可以帮助提高空中搜索和救援的效iveness。

Abstract
With the increasing reliance on small Unmanned Aerial Systems (sUAS) for Emergency Response Scenarios, such as Search and Rescue, the integration of computer vision capabilities has become a key factor in mission success. Nevertheless, computer vision performance for detecting humans severely degrades when shifting from ground to aerial views. Several aerial datasets have been created to mitigate this problem, however, none of them has specifically addressed the issue of occlusion, a critical component in Emergency Response Scenarios. Natural Occluded Multi-scale Aerial Dataset (NOMAD) presents a benchmark for human detection under occluded aerial views, with five different aerial distances and rich imagery variance. NOMAD is composed of 100 different Actors, all performing sequences of walking, laying and hiding. It includes 42,825 frames, extracted from 5.4k resolution videos, and manually annotated with a bounding box and a label describing 10 different visibility levels, categorized according to the percentage of the human body visible inside the bounding box. This allows computer vision models to be evaluated on their detection performance across different ranges of occlusion. NOMAD is designed to improve the effectiveness of aerial search and rescue and to enhance collaboration between sUAS and humans, by providing a new benchmark dataset for human detection under occluded aerial views.

摘要
随着小无人航空系统（sUAS）在紧急应急场景中的使用，计算机视觉技术的 интеграción已成为任务成功的关键因素。然而，计算机视觉性能在从地面视图转移到空中视图时会受到严重降低。为了解决这个问题，一些空中数据集已经被创建，但是没有一个专门解决遮挡问题，这是紧急应急场景中的关键组成部分。自然遮挡多Scale空中数据集（NOMAD）提供了人体检测下遮挡空中视图的标准准则，包括五种不同的空中距离和丰富的图像变化。NOMAD包含100名演员，每名演员都执行了走、躺和隐藏的序列，共计42825帧，来自5400分辨率视频，并手动标注了一个 bounding box 和一个describing 10种不同的可见度水平，分为100%的人体部分可见度。这allow computer vision模型在不同的遮挡级别上进行评估。NOMAD旨在提高空中搜救和人类协作的效率，通过提供人体检测下遮挡空中视图的新标准数据集来提高计算机视觉模型的检测性能。

Sparse and Privacy-enhanced Representation for Human Pose Estimation

paper_url: http://arxiv.org/abs/2309.09515
repo_url: None
paper_authors: Ting-Ying Lin, Lin-Yung Hsieh, Fu-En Wang, Wen-Shen Wuen, Min Sun
for: 提高人姿估计（HPE）的隐私和效率。
methods: 使用专利的运动向量感知器（MVS）提取edge图像和两个方向的运动向量图像，并利用最近的简单核算法进行短时间内的快速运动跟踪。
results: 提出一种笔触和隐私增强的HPE表示方式，实现约13倍的速度提升和96%的计算量减少，并在各个模式下进行比较， validate the privacy-enhanced quality of our sparse representation through face recognition on CelebA 和用户研究。

Abstract
We propose a sparse and privacy-enhanced representation for Human Pose Estimation (HPE). Given a perspective camera, we use a proprietary motion vector sensor(MVS) to extract an edge image and a two-directional motion vector image at each time frame. Both edge and motion vector images are sparse and contain much less information (i.e., enhancing human privacy). We advocate that edge information is essential for HPE, and motion vectors complement edge information during fast movements. We propose a fusion network leveraging recent advances in sparse convolution used typically for 3D voxels to efficiently process our proposed sparse representation, which achieves about 13x speed-up and 96% reduction in FLOPs. We collect an in-house edge and motion vector dataset with 16 types of actions by 40 users using the proprietary MVS. Our method outperforms individual modalities using only edge or motion vector images. Finally, we validate the privacy-enhanced quality of our sparse representation through face recognition on CelebA (a large face dataset) and a user study on our in-house dataset.

摘要
我们提议一种稀疏化和隐私增强的人姿估算（HPE）表示方法。 Given a perspective camera, we use a proprietary motion vector sensor(MVS) to extract an edge image and a two-directional motion vector image at each time frame. Both edge and motion vector images are sparse and contain much less information (i.e., enhancing human privacy). We advocate that edge information is essential for HPE, and motion vectors complement edge information during fast movements. We propose a fusion network leveraging recent advances in sparse convolution used typically for 3D voxels to efficiently process our proposed sparse representation, which achieves about 13x speed-up and 96% reduction in FLOPs. We collect an in-house edge and motion vector dataset with 16 types of actions by 40 users using the proprietary MVS. Our method outperforms individual modalities using only edge or motion vector images. Finally, we validate the privacy-enhanced quality of our sparse representation through face recognition on CelebA (a large face dataset) and a user study on our in-house dataset.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. If you need the translation in Traditional Chinese, please let me know.

PanoMixSwap Panorama Mixing via Structural Swapping for Indoor Scene Understanding

paper_url: http://arxiv.org/abs/2309.09514
repo_url: None
paper_authors: Yu-Cheng Hsieh, Cheng Sun, Suraj Dengale, Min Sun
for: 增加现代深度学习方法的训练数据量和多样性，以提高indoor scene理解能力。
methods: 提议了PanoMixSwap数据增强技术，通过混合不同背景风格、前景家具和房间布局，从现有的indoor panorama数据集中生成新的多样化panoramic图像。
results: 经过实验表明，使用PanoMixSwap对indoor scene理解任务（semantic segmentation和layout estimation）的性能有了显著提升，比原始设定升级了consistent。

Abstract
The volume and diversity of training data are critical for modern deep learningbased methods. Compared to the massive amount of labeled perspective images, 360 panoramic images fall short in both volume and diversity. In this paper, we propose PanoMixSwap, a novel data augmentation technique specifically designed for indoor panoramic images. PanoMixSwap explicitly mixes various background styles, foreground furniture, and room layouts from the existing indoor panorama datasets and generates a diverse set of new panoramic images to enrich the datasets. We first decompose each panoramic image into its constituent parts: background style, foreground furniture, and room layout. Then, we generate an augmented image by mixing these three parts from three different images, such as the foreground furniture from one image, the background style from another image, and the room structure from the third image. Our method yields high diversity since there is a cubical increase in image combinations. We also evaluate the effectiveness of PanoMixSwap on two indoor scene understanding tasks: semantic segmentation and layout estimation. Our experiments demonstrate that state-of-the-art methods trained with PanoMixSwap outperform their original setting on both tasks consistently.

摘要
“现代深度学习方法需要大量和多样化的训练数据。相比巨量标注的人工智能图像，360度全景图像缺乏量和多样化。在这篇论文中，我们提出了PanoMixSwap，一种专门为室内全景图像设计的数据增强技术。PanoMixSwap显式地混合了不同背景风格、前景家具和房间布局的元素，从现有室内全景数据集中提取了多样化的新全景图像，以增加数据集的多样性。我们首先将每个全景图像分解成其组成部分：背景风格、前景家具和房间布局。然后，我们生成了一个扩充图像，将这三个部分从三个不同的图像中混合起来，例如将前景家具从一个图像中、背景风格从另一个图像中、房间布局从第三个图像中混合起来。我们的方法可以生成高多样性的图像组合，因为每个图像组合都有3^3=27种可能。我们也对两个室内场景理解任务进行了效果评估：semantic segmentation和layout estimation。我们的实验表明，通过PanoMixSwap增强训练的state-of-the-art方法在两个任务上的性能都能 consistent with the original setting。”

Learning Parallax for Stereo Event-based Motion Deblurring

paper_url: http://arxiv.org/abs/2309.09513
repo_url: None
paper_authors: Mingyuan Lin, Chi Zhang, Chu He, Lei Yu
for: 这种研究是为了提高图像减震的效果，使用事件推理来补充lost信息。
methods: 提出了一种新的卷积框架，即NETwork of Event-based motion Deblurring with STereo event and intensity cameras (St-EDNet)，可以直接从不一致的输入中恢复高质量图像。
results: 实验结果表明，提出的方法在实际 dataset 上比 estado-of-the-art 方法更高效。

Abstract
Due to the extremely low latency, events have been recently exploited to supplement lost information for motion deblurring. Existing approaches largely rely on the perfect pixel-wise alignment between intensity images and events, which is not always fulfilled in the real world. To tackle this problem, we propose a novel coarse-to-fine framework, named NETwork of Event-based motion Deblurring with STereo event and intensity cameras (St-EDNet), to recover high-quality images directly from the misaligned inputs, consisting of a single blurry image and the concurrent event streams. Specifically, the coarse spatial alignment of the blurry image and the event streams is first implemented with a cross-modal stereo matching module without the need for ground-truth depths. Then, a dual-feature embedding architecture is proposed to gradually build the fine bidirectional association of the coarsely aligned data and reconstruct the sequence of the latent sharp images. Furthermore, we build a new dataset with STereo Event and Intensity Cameras (StEIC), containing real-world events, intensity images, and dense disparity maps. Experiments on real-world datasets demonstrate the superiority of the proposed network over state-of-the-art methods.

摘要
因为延迟非常低，事件最近被利用来补充lost信息以实现运动抑白。现有方法大多数假设精确的像素对应关系 между图像和事件，这在实际世界中并不总是成立。为解决这个问题，我们提议一种新的均值-精度框架，名为NETwork of Event-based motion Deblurring with STereo event and intensity cameras (St-EDNet)，可以直接从不一致的输入中恢复高质量图像，包括一个朦素图像和同时发生的事件流。具体来说，首先使用交叉模态掌时匹配模块实现了某些粗糙图像和事件流的空间对应。然后，我们提出了一种双特征嵌入体系，用于慢慢地建立精度的双向嵌入，并重建事件流中latent的锐化图像序列。此外，我们构建了一个新的StEIC数据集，包括实际世界中的事件、图像和密集的 disparity 地图。实验表明，我们提出的网络在实际数据集上的性能较为出色。

RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision

paper_url: http://arxiv.org/abs/2309.09502
repo_url: None
paper_authors: Mingjie Pan, Jiaming Liu, Renrui Zhang, Peixiang Huang, Xiaoqi Li, Li Liu, Shanghang Zhang
for: 本研究旨在提出一种基于2D标签的多视图3D占用预测方法，以便降低高成本的3D占用标签生成过程中的限制。
methods: 我们提出了一种基于NeRF技术的方法，将多视图图像转化为3D体积表示，并使用volume rendering技术进行2D渲染，以便直接从2D semantics和深度标签中提取3D占用信息。此外，我们还提出了一种协助 ray方法，以解决自动驾驶场景中罕见视角的问题，通过Sequential frame组合构建完整的2D渲染。
results: 我们的实验结果表明，RenderOcc可以与完全使用3D标签进行supervision的模型相比，达到相似的性能，这证明了这种方法在实际应用中的重要性。

Abstract
3D occupancy prediction holds significant promise in the fields of robot perception and autonomous driving, which quantifies 3D scenes into grid cells with semantic labels. Recent works mainly utilize complete occupancy labels in 3D voxel space for supervision. However, the expensive annotation process and sometimes ambiguous labels have severely constrained the usability and scalability of 3D occupancy models. To address this, we present RenderOcc, a novel paradigm for training 3D occupancy models only using 2D labels. Specifically, we extract a NeRF-style 3D volume representation from multi-view images, and employ volume rendering techniques to establish 2D renderings, thus enabling direct 3D supervision from 2D semantics and depth labels. Additionally, we introduce an Auxiliary Ray method to tackle the issue of sparse viewpoints in autonomous driving scenarios, which leverages sequential frames to construct comprehensive 2D rendering for each object. To our best knowledge, RenderOcc is the first attempt to train multi-view 3D occupancy models only using 2D labels, reducing the dependence on costly 3D occupancy annotations. Extensive experiments demonstrate that RenderOcc achieves comparable performance to models fully supervised with 3D labels, underscoring the significance of this approach in real-world applications.

摘要
三维占用预测在机器人感知和自动驾驶领域产生了重要的承诺，它将三维场景分解成格格单元中的semantic标签。最近的工作主要利用完整的占用标签在3D精度空间进行监督。然而，贵重的标注过程和有时存在含瑕的标签导致了3D占用模型的可用性和扩展性受到了严重的限制。为了解决这个问题，我们提出了RenderOcc，一种新的训练3D占用模型只使用2D标签的方法。具体来说，我们从多视图图像中提取了NeRF样式的3D体积表示，并使用Volume Rendering技术来建立2D渲染，从而允许直接在2Dsemantics和深度标签的直接监督下训练3D占用模型。此外，我们还提出了auxiliary Ray方法，用于解决自动驾驶场景中稀疏的视点问题，该方法利用顺序帧构建了每个对象的完整2D渲染。根据我们所知，RenderOcc是首次训练基于2D标签的多视图3D占用模型，从而降低了3D占用标签的成本。广泛的实验表明，RenderOcc可以与基于3D标签的模型相比，证明了这种方法在实际应用中的重要性。

Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

paper_url: http://arxiv.org/abs/2309.09501
repo_url: None
paper_authors: Shaofei Huang, Han Li, Yuqing Wang, Hongji Zhu, Jiao Dai, Jizhong Han, Wenge Rong, Si Liu
for: Audio visual segmentation (AVS) aims to segment the sounding objects for each frame of a given video.
methods: The proposed method uses an Audio-Queried Transformer architecture (AQFormer) to establish explicit object-level semantic correspondence between audio and visual modalities, and to exchange sounding object-relevant information among multiple frames.
results: The method achieves state-of-the-art performances on two AVS benchmarks, with 7.1% M_J and 7.6% M_F gains on the MS3 setting.

Abstract
Audio visual segmentation (AVS) aims to segment the sounding objects for each frame of a given video. To distinguish the sounding objects from silent ones, both audio-visual semantic correspondence and temporal interaction are required. The previous method applies multi-frame cross-modal attention to conduct pixel-level interactions between audio features and visual features of multiple frames simultaneously, which is both redundant and implicit. In this paper, we propose an Audio-Queried Transformer architecture, AQFormer, where we define a set of object queries conditioned on audio information and associate each of them to particular sounding objects. Explicit object-level semantic correspondence between audio and visual modalities is established by gathering object information from visual features with predefined audio queries. Besides, an Audio-Bridged Temporal Interaction module is proposed to exchange sounding object-relevant information among multiple frames with the bridge of audio features. Extensive experiments are conducted on two AVS benchmarks to show that our method achieves state-of-the-art performances, especially 7.1% M_J and 7.6% M_F gains on the MS3 setting.

摘要
Audio visual segmentation (AVS) 目标是为每帧视频中的声音对象进行分割。为了将声音对象与无声对象区分开来，需要同时使用音频 Semantic 相关性和时间互动。先前的方法使用多帧交叉模态注意力来同时进行多帧音频和视觉特征之间的像素级交互，这是一种重复和隐式的。在这篇论文中，我们提出了一种叫做 Audio-Queried Transformer 架构（AQFormer），其中我们定义了基于音频信息的对象查询集，并将每个查询与特定的声音对象相关联。在视觉特征中收集对应的对象信息，并通过固定的音频查询来确立明确的音频Semantic 相关性。此外，我们还提出了一种叫做 Audio-Bridged Temporal Interaction 模块，用于在多帧中交换与声音相关的信息，通过音频特征作为桥接。我们在两个 AVS 标准测试集上进行了广泛的实验，并证明了我们的方法可以达到当前最佳性能，特别是在 MS3 设定上提高了7.1% M_J 和 7.6% M_F。

Target-aware Bi-Transformer for Few-shot Segmentation

paper_url: http://arxiv.org/abs/2309.09492
repo_url: None
paper_authors: Xianglin Wang, Xiaoliu Luo, Taiping Zhang
For: The paper is written for proposing a new few-shot semantic segmentation (FSS) method that uses limited labeled support images to identify the segmentation of new classes of objects.* Methods: The proposed method, called Target-aware Bi-Transformer Network (TBTNet), uses a novel Target-aware Transformer Layer (TTL) to distill correlations and focus on foreground information. The model treats the hypercorrelation as a feature, resulting in a significant reduction in the number of feature channels.* Results: The proposed method achieves state-of-the-art performance on standard FSS benchmarks of PASCAL-5i and COCO-20i, with only 0.4M learnable parameters and converging in 10% to 25% of the training epochs compared to traditional methods.

Abstract
Traditional semantic segmentation tasks require a large number of labels and are difficult to identify unlearned categories. Few-shot semantic segmentation (FSS) aims to use limited labeled support images to identify the segmentation of new classes of objects, which is very practical in the real world. Previous researches were primarily based on prototypes or correlations. Due to colors, textures, and styles are similar in the same image, we argue that the query image can be regarded as its own support image. In this paper, we proposed the Target-aware Bi-Transformer Network (TBTNet) to equivalent treat of support images and query image. A vigorous Target-aware Transformer Layer (TTL) also be designed to distill correlations and force the model to focus on foreground information. It treats the hypercorrelation as a feature, resulting a significant reduction in the number of feature channels. Benefit from this characteristic, our model is the lightest up to now with only 0.4M learnable parameters. Futhermore, TBTNet converges in only 10% to 25% of the training epochs compared to traditional methods. The excellent performance on standard FSS benchmarks of PASCAL-5i and COCO-20i proves the efficiency of our method. Extensive ablation studies were also carried out to evaluate the effectiveness of Bi-Transformer architecture and TTL.

摘要
传统的semantic segmentation任务需要大量的标签，并且难以识别未学习的类别。几shot semantic segmentation（FSS）目的是使用有限的标签支持图像来分类新类型的对象，这非常实用于实际世界。先前的研究主要基于原型或相关性。由于图像中颜色、文本和风格具有相似性，我们认为查询图像可以被视为其自己的支持图像。在这篇论文中，我们提出了Target-aware Bi-Transformer Network（TBTNet），以等效地处理支持图像和查询图像。我们还设计了一种强大的Target-aware Transformer层（TTL），以把 corrleation 当作特征，使模型更加注重前景信息。这种特点使得我们的模型成为目前最轻量级的，只有0.4M 可学习参数。此外，TBTNet在训练EPochs中 converges 在10%-25%的训练epochs，相比传统方法更快。标准的FSS Benchmarks PASCAL-5i和COCO-20i中的出色表现证明了我们的方法的效率。我们还进行了广泛的ablation研究，以评估Bi-Transformer架构和TTL的效果。

Self-supervised TransUNet for Ultrasound regional segmentation of the distal radius in children

paper_url: http://arxiv.org/abs/2309.09490
repo_url: None
paper_authors: Yuyue Zhou, Jessica Knight, Banafshe Felfeliyan, Christopher Keen, Abhilash Rakkunedeth Hareendranathan, Jacob L. Jaremko
for: 这篇论文的目的是为了应用自我监督学习（SSL）方法来减少医疗影像标注数量，以提高医疗影像分类和诊断的自动化分析。methods: 这篇论文使用了Masked Autoencoder for SSL（SSL-MAE）方法，并变化了嵌入和损失函数，以提高下游结果。results: 研究发现，将TransUNet嵌入和Encoder预训练使用SSL-MAE，不能比TransUNet直接进行下游分类任务预训练更好。

Abstract
Supervised deep learning offers great promise to automate analysis of medical images from segmentation to diagnosis. However, their performance highly relies on the quality and quantity of the data annotation. Meanwhile, curating large annotated datasets for medical images requires a high level of expertise, which is time-consuming and expensive. Recently, to quench the thirst for large data sets with high-quality annotation, self-supervised learning (SSL) methods using unlabeled domain-specific data, have attracted attention. Therefore, designing an SSL method that relies on minimal quantities of labeled data has far-reaching significance in medical images. This paper investigates the feasibility of deploying the Masked Autoencoder for SSL (SSL-MAE) of TransUNet, for segmenting bony regions from children's wrist ultrasound scans. We found that changing the embedding and loss function in SSL-MAE can produce better downstream results compared to the original SSL-MAE. In addition, we determined that only pretraining TransUNet embedding and encoder with SSL-MAE does not work as well as TransUNet without SSL-MAE pretraining on downstream segmentation tasks.

摘要
<>发表文章报告摘要深度学习监督下的检测和诊断预测具有巨大的承诺，但是它们的表现受到数据标注质量和量的限制。同时，为医疗图像批量标注数据集准备高级别的专业知识和技能，需要很多时间和成本。最近，为了满足大量高质量标注数据的需求，自动学习（SSL）方法，使用不标注的领域特定数据，吸引了关注。因此，设计一种SSL方法，只需少量标注数据，具有重要意义。这篇论文 investigate了将Masked Autoencoder（SSL-MAE）用于儿童肋骨ultrasound扫描图像的分割。我们发现，在SSL-MAE中修改嵌入和损失函数可以生成更好的下游结果，并且确定了不是先行SSL-MAE预训练TransUNet embedding和编码器的下游分割任务的效果不如TransUNet无SSL-MAE预训练。详细描述在医疗图像分割预测中，深度学习监督下的方法具有很高的承诺，但是它们的表现受到数据标注质量和量的限制。为了解决这个问题，研究者们开始关注自动学习（SSL）方法。SSL方法可以使用不标注的领域特定数据，来适应特定任务。在本文中，我们 investigate了将Masked Autoencoder（SSL-MAE）用于儿童肋骨ultrasound扫描图像的分割。我们首先介绍了SSL-MAE的基本思想和方法。然后，我们对SSL-MAE的嵌入和损失函数进行了修改，以便生成更好的下游结果。最后，我们对TransUNet embedding和编码器进行了SSL-MAE预训练，并对其下游分割任务进行了评估。结果显示，在SSL-MAE中修改嵌入和损失函数可以生成更好的下游结果。此外，我们发现不是先行SSL-MAE预训练TransUNet embedding和编码器的下游分割任务的效果不如TransUNet无SSL-MAE预训练。这些结果表明，SSL-MAE可以帮助提高医疗图像分割预测的性能，并且可以适应不同的任务和数据集。结论本文 investigate了将Masked Autoencoder（SSL-MAE）用于儿童肋骨ultrasound扫描图像的分割。我们发现，在SSL-MAE中修改嵌入和损失函数可以生成更好的下游结果，并且确定了不是先行SSL-MAE预训练TransUNet embedding和编码器的下游分割任务的效果不如TransUNet无SSL-MAE预训练。这些结果表明，SSL-MAE可以帮助提高医疗图像分割预测的性能，并且可以适应不同的任务和数据集。因此，设计一种SSL方法，只需少量标注数据，具有重要意义。

Distributional Estimation of Data Uncertainty for Surveillance Face Anti-spoofing

paper_url: http://arxiv.org/abs/2309.09485
repo_url: None
paper_authors: Mouxiao Huang
for: This paper aims to improve the security of face anti-spoofing (FAS) systems in long-distance surveillance scenarios, which are often characterized by low-quality face images and high levels of data uncertainty.methods: The proposed method, called Distributional Estimation (DisE), models data uncertainty during training to improve the stability and accuracy of FAS systems. DisE adjusts the learning strength of clean and noisy samples to enhance performance.results: The proposed method was evaluated on a large-scale and challenging FAS dataset (SuHiFiMask) and achieved comparable performance on both ACER and AUC metrics, indicating its effectiveness in improving the security of FAS systems in long-distance surveillance scenarios.

Abstract
Face recognition systems have become increasingly vulnerable to security threats in recent years, prompting the use of Face Anti-spoofing (FAS) to protect against various types of attacks, such as phone unlocking, face payment, and self-service security inspection. While FAS has demonstrated its effectiveness in traditional settings, securing it in long-distance surveillance scenarios presents a significant challenge. These scenarios often feature low-quality face images, necessitating the modeling of data uncertainty to improve stability under extreme conditions. To address this issue, this work proposes Distributional Estimation (DisE), a method that converts traditional FAS point estimation to distributional estimation by modeling data uncertainty during training, including feature (mean) and uncertainty (variance). By adjusting the learning strength of clean and noisy samples for stability and accuracy, the learned uncertainty enhances DisE's performance. The method is evaluated on SuHiFiMask [1], a large-scale and challenging FAS dataset in surveillance scenarios. Results demonstrate that DisE achieves comparable performance on both ACER and AUC metrics.

摘要
现在的面Recognition系统已经变得易受到安全威胁，因此使用Face Anti-spoofing（FAS）来保护各种攻击，如手机唔锁、脸部支付和自助安全检查。虽然FAS在传统场景下表现出了效iveness，但在长距离监测场景中保持安全是一项 significante challenge。这些场景通常会出现低质量的脸像，因此需要模型数据不确定性以提高在极端情况下的稳定性。为解决这个问题，本工作提出了分布统计（DisE）方法，它将传统的FAS点估转换为分布统计，通过在训练中模型数据不确定性，包括特征（均值）和不确定性（方差）。通过调整清洁和噪声样本的学习力，学习出来的不确定性提高了DisE的性能。这种方法在SuHiFiMask数据集上进行了评估，结果表明DisE在ACER和AUC指标上具有相当的性能。

An Accurate and Efficient Neural Network for OCTA Vessel Segmentation and a New Dataset

paper_url: http://arxiv.org/abs/2309.09483
repo_url: None
paper_authors: Haojian Ning, Chengliang Wang, Xinrun Chen, Shiying Li
for: 本研究利用非侵入性的光共振 Tomatoes angiography（OCTA）成像技术，描述高分辨率的血管网络。
methods: 该研究提出了一种准确且高效的神经网络方法，用于Retinal vessel segmentation在OCTA图像中。该方法通过应用修改后的Recurrent ConvNeXt块，实现了与其他SOTA方法相当的准确率，同时具有更少的参数和更快的推理速度（例如110倍轻量级和1.3倍快于U-Net），适用于工业应用。
results: 该研究创建了918张OCTA图像和其相应的血管注释集。这个数据集使用Segment Anything Model（SAM）进行 semi-自动注释，大大提高了注释速度。

Abstract
Optical coherence tomography angiography (OCTA) is a noninvasive imaging technique that can reveal high-resolution retinal vessels. In this work, we propose an accurate and efficient neural network for retinal vessel segmentation in OCTA images. The proposed network achieves accuracy comparable to other SOTA methods, while having fewer parameters and faster inference speed (e.g. 110x lighter and 1.3x faster than U-Net), which is very friendly for industrial applications. This is achieved by applying the modified Recurrent ConvNeXt Block to a full resolution convolutional network. In addition, we create a new dataset containing 918 OCTA images and their corresponding vessel annotations. The data set is semi-automatically annotated with the help of Segment Anything Model (SAM), which greatly improves the annotation speed. For the benefit of the community, our code and dataset can be obtained from https://github.com/nhjydywd/OCTA-FRNet.

摘要
optical coherence tomography angiography (OCTA) 是一种非侵入性的图像成像技术，可以显示高分辨率的肉眼血管。在这项工作中，我们提出了一种准确和高效的神经网络 для肉眼血管分 segmentation 在 OCTA 图像中。我们的提案的网络实现了与其他 SOTA 方法相同的准确率，而且具有较少的参数和更快的推理速度（例如，110 倍轻量级和 1.3 倍的速度），这对于工业应用非常友好。这一 достиvement 归功于在全分辨率 convolutional 网络中应用修改后的 Recurrent ConvNeXt Block。此外，我们创建了一个包含 918 个 OCTA 图像和其相应的血管注释的数据集。这个数据集通过 Segment Anything Model (SAM) 的 semi-automatic 注释，可以大幅提高注释速度。为了便于社区，我们的代码和数据集可以从 GitHub 上下载：https://github.com/nhjydywd/OCTA-FRNet。

Spatio-temporal Co-attention Fusion Network for Video Splicing Localization

paper_url: http://arxiv.org/abs/2309.09482
repo_url: None
paper_authors: Man Lin, Gang Cao, Zijie Lou
for: 本研究旨在提出一种针对视频拼接 forgery 的检测方法，以推动视频的真实性和安全性。
methods: 本研究使用了一种三核 streams 网络作为编码器，通过 novel parallel and cross co-attention fusion modules 来实现深度交互和融合，以提取多帧视频中的修改迹象。
results: 测试结果表明，使用 SCFNet 可以高效地检测视频拼接 forgery，并且在不同的视频 dataset 上达到了 state-of-the-art 的性能。

Abstract
Digital video splicing has become easy and ubiquitous. Malicious users copy some regions of a video and paste them to another video for creating realistic forgeries. It is significant to blindly detect such forgery regions in videos. In this paper, a spatio-temporal co-attention fusion network (SCFNet) is proposed for video splicing localization. Specifically, a three-stream network is used as an encoder to capture manipulation traces across multiple frames. The deep interaction and fusion of spatio-temporal forensic features are achieved by the novel parallel and cross co-attention fusion modules. A lightweight multilayer perceptron (MLP) decoder is adopted to yield a pixel-level tampering localization map. A new large-scale video splicing dataset is created for training the SCFNet. Extensive tests on benchmark datasets show that the localization and generalization performances of our SCFNet outperform the state-of-the-art. Code and datasets will be available at https://github.com/multimediaFor/SCFNet.

摘要
digital video 剪辑已成为易于普遍的。恶意用户可以将视频中的一些区域复制到另一个视频中，以创造真实的假象。因此，必须在视频中探测这些假象区域。本文提出了一种基于视频剪辑的空间时间共互关注融合网络（SCFNet），用于各种视频剪辑检测。具体来说，使用了一个三流网络作为编码器，以捕捉多帧 manipulate traces。通过新型并行和交叉共互关注融合模块，深度地进行了视频剪辑特征的深度交互和融合。使用了一个轻量级多层感知器（MLP）作为解码器，以生成像素级假象Localization图。创建了一个大规模的视频剪辑数据集，用于训练SCFNet。对于标准 benchmark 数据集进行了广泛的测试，测试结果显示，SCFNet 的Localization和泛化性能都高于当前状态。代码和数据集将在 https://github.com/multimediaFor/SCFNet 上发布。

Stealthy Physical Masked Face Recognition Attack via Adversarial Style Optimization

paper_url: http://arxiv.org/abs/2309.09480
repo_url: None
paper_authors: Huihui Gong, Minjing Dong, Siqi Ma, Seyit Camtepe, Surya Nepal, Chang Xu
for: 本研究旨在探讨一种隐蔽式的面部识别模型攻击方法，用于面部识别任务中的隐蔽式攻击。
methods: 本研究使用了一种新的隐蔽式面部识别攻击方法，即通过对面部识别模型进行隐蔽式攻击，使模型具有较高的攻击力和隐蔽性。
results: 研究发现，对于面部识别任务，采用本研究提出的隐蔽式攻击方法可以很好地隐藏攻击具有较高的攻击力和隐蔽性。此外，研究还发现了一些特定的隐蔽式攻击方法可以在不同的面部识别模型上具有较高的攻击力和隐蔽性。

Abstract
Deep neural networks (DNNs) have achieved state-of-the-art performance on face recognition (FR) tasks in the last decade. In real scenarios, the deployment of DNNs requires taking various face accessories into consideration, like glasses, hats, and masks. In the COVID-19 pandemic era, wearing face masks is one of the most effective ways to defend against the novel coronavirus. However, DNNs are known to be vulnerable to adversarial examples with a small but elaborated perturbation. Thus, a facial mask with adversarial perturbations may pose a great threat to the widely used deep learning-based FR models. In this paper, we consider a challenging adversarial setting: targeted attack against FR models. We propose a new stealthy physical masked FR attack via adversarial style optimization. Specifically, we train an adversarial style mask generator that hides adversarial perturbations inside style masks. Moreover, to ameliorate the phenomenon of sub-optimization with one fixed style, we propose to discover the optimal style given a target through style optimization in a continuous relaxation manner. We simultaneously optimize the generator and the style selection for generating strong and stealthy adversarial style masks. We evaluated the effectiveness and transferability of our proposed method via extensive white-box and black-box digital experiments. Furthermore, we also conducted physical attack experiments against local FR models and online platforms.

摘要
深度神经网络（DNNs）在过去一代的面部识别（FR）任务上达到了状态之一。在实际应用中，部署DNNs时需考虑面部访问ories，如眼镜、帽子和面具。在 COVID-19 疫情时期，穿戴面具是一种最有效的防止新型冠状病毒的方法。然而，DNNs 知道小型但精心制作的攻击例子。因此，一个面具with adversarial perturbations可能对通用的深度学习基于 FR 模型 pose 大威胁。在这篇论文中，我们考虑了一个挑战性的对抗设定：面部识别模型的目标攻击。我们提出了一种新的隐蔽的物理面具FR攻击，通过对抗式风格优化。具体来说，我们训练了一个对抗式风格面具生成器，以隐藏对抗性扰动在风格面具中。此外，为了改善一个固定风格下的优化现象，我们提出了在一个目标上进行风格优化的方法。我们同时优化了生成器和风格选择，以生成强大和隐蔽的对抗式风格面具。我们通过了广泛的白盒和黑盒数字实验，以及本地 FR 模型和在线平台的物理攻击实验。

Self-supervised Multi-view Clustering in Computer Vision: A Survey

paper_url: http://arxiv.org/abs/2309.09473
repo_url: None
paper_authors: Jiatai Wang, Zhiwei Xu, Xuewen Yang, Hailong Li, Bo Li, Xuying Meng
for: 本文旨在探讨多视图归一 clustering（MVC）在跨Modal Representation Learning和数据驱动决策中的重要性，以及自然学习在 MVC 方法中的普及。
methods: 本文主要介绍了自然学习驱动 MVC 方法，包括设计代表任务来挖掘图像和视频数据的表示，以及常见数据集、数据问题、表示学习方法和自然学习方法的内部连接和分类。
results: 本文不仅介绍了每种类别的机制，还给出了一些应用示例。最后，文章还提出了一些未解决的问题，以便进一步的研究和发展。

Abstract
Multi-view clustering (MVC) has had significant implications in cross-modal representation learning and data-driven decision-making in recent years. It accomplishes this by leveraging the consistency and complementary information among multiple views to cluster samples into distinct groups. However, as contrastive learning continues to evolve within the field of computer vision, self-supervised learning has also made substantial research progress and is progressively becoming dominant in MVC methods. It guides the clustering process by designing proxy tasks to mine the representation of image and video data itself as supervisory information. Despite the rapid development of self-supervised MVC, there has yet to be a comprehensive survey to analyze and summarize the current state of research progress. Therefore, this paper explores the reasons and advantages of the emergence of self-supervised MVC and discusses the internal connections and classifications of common datasets, data issues, representation learning methods, and self-supervised learning methods. This paper does not only introduce the mechanisms for each category of methods but also gives a few examples of how these techniques are used. In the end, some open problems are pointed out for further investigation and development.

摘要

Reconstructing Existing Levels through Level Inpainting

paper_url: http://arxiv.org/abs/2309.09472
repo_url: None
paper_authors: Johor Jara Gonzalez, Mathew Guzdial
for: 这篇论文是为了描述如何使用Content Augmentation和Procedural Content Generation via Machine Learning（PCGML）来生成电子游戏等游戏中的关卡。
methods: 这篇论文使用了两种图像填充技术，即Autoencoder和U-net，来解决关卡填充问题。
results: 在比较基准方法的情况下，这两种方法都表现出了较好的性能，并且提供了实际的关卡填充示例和未来研究的想法。

Abstract
Procedural Content Generation (PCG) and Procedural Content Generation via Machine Learning (PCGML) have been used in prior work for generating levels in various games. This paper introduces Content Augmentation and focuses on the subproblem of level inpainting, which involves reconstructing and extending video game levels. Drawing inspiration from image inpainting, we adapt two techniques from this domain to address our specific use case. We present two approaches for level inpainting: an Autoencoder and a U-net. Through a comprehensive case study, we demonstrate their superior performance compared to a baseline method and discuss their relative merits. Furthermore, we provide a practical demonstration of both approaches for the level inpainting task and offer insights into potential directions for future research.

摘要
《生成内容技术》和《机器学习生成内容技术》在先前的工作中已经用于生成游戏等游戏中的关卡。这篇论文介绍内容扩展和填充，它们是关卡填充的子问题。受到图像填充的启发，我们从该领域适应了两种技术来解决我们的特定用例。我们提出了两种关卡填充方法：一种是自适应网络，另一种是U网络。通过完整的案例研究，我们证明了这两种方法的超越性比基eline方法，并讨论了它们的相对优劣。此外，我们为两种方法的关卡填充任务提供了实践示例，并为未来研究提供了可能的方向。

Progressive Text-to-Image Diffusion with Soft Latent Direction

paper_url: http://arxiv.org/abs/2309.09466
repo_url: https://github.com/babahui/progressive-text-to-image
paper_authors: YuTeng Ye, Jiale Cai, Hang Zhou, Guanwen Li, Youjia Zhang, Zikai Song, Chenxing Gao, Junqing Yu, Wei Yang
for: 本研究旨在解决文本到图像生成中多个实体的同时拼接和约束的挑战。
methods: 该 paper 提出了一种进步的分步生成和修改操作，通过逐步将实体添加到目标图像中，以保证它们在空间和关系约束下进行拼接。
results: 该 paper 的提出的方法在处理复杂和长文本描述时显示出了明显的进步，特别是在对多个实体的拼接和修改方面。

Abstract
In spite of the rapidly evolving landscape of text-to-image generation, the synthesis and manipulation of multiple entities while adhering to specific relational constraints pose enduring challenges. This paper introduces an innovative progressive synthesis and editing operation that systematically incorporates entities into the target image, ensuring their adherence to spatial and relational constraints at each sequential step. Our key insight stems from the observation that while a pre-trained text-to-image diffusion model adeptly handles one or two entities, it often falters when dealing with a greater number. To address this limitation, we propose harnessing the capabilities of a Large Language Model (LLM) to decompose intricate and protracted text descriptions into coherent directives adhering to stringent formats. To facilitate the execution of directives involving distinct semantic operations-namely insertion, editing, and erasing-we formulate the Stimulus, Response, and Fusion (SRF) framework. Within this framework, latent regions are gently stimulated in alignment with each operation, followed by the fusion of the responsive latent components to achieve cohesive entity manipulation. Our proposed framework yields notable advancements in object synthesis, particularly when confronted with intricate and lengthy textual inputs. Consequently, it establishes a new benchmark for text-to-image generation tasks, further elevating the field's performance standards.

摘要
尽管文本到图像生成领域在快速发展，仍然有多个实体的合理拟合和操作是持续的挑战。这篇论文提出了一种创新的进行式合成和编辑操作，系统地将实体integrated到目标图像中，以确保它们在每个顺序步骤中遵循空间和关系约束。我们的关键发现来自于发现一个预训练的文本到图像扩散模型能够好好地处理一个或两个实体，但当面临更多实体时，它经常出现问题。为解决这种限制，我们提议利用大型自然语言模型（LLM）来分解复杂和长文本描述，并将其转化为严格格式的直接指令。为了实现不同semantic操作的执行，我们提出了刺激、编辑和消除等操作的框架，称为刺激响应拼接（SRF）框架。在这个框架中，各个实体的latent空间会在每个操作中被细致地刺激，然后将响应的latent组件进行拼接，以实现一致的实体修改。我们的提议的框架在对复杂和长文本输入的对象合成方面带来了显著的进步，因此，它在文本到图像生成任务中成为了一个新的标准，进一步提高了这个领域的性能标准。

Reducing Adversarial Training Cost with Gradient Approximation

paper_url: http://arxiv.org/abs/2309.09464
repo_url: None
paper_authors: Huihui Gong, Shuo Yang, Siqi Ma, Seyit Camtepe, Surya Nepal, Chang Xu
for: 提高模型对抗示例（Adversarial Examples，AE）的Robustness，提高模型的可靠性和安全性。
methods: 使用项目Gradient Descent（PGD）基于的对抗训练法，以及一种新的和高效的对抗训练方法——对抗训练with Gradient Approximation（GAAT），以减少建立强度模型的成本。
results: 对MNIST、CIFAR-10和CIFAR-100等 dataset进行了广泛的实验，结果显示，GAAT方法可以保持模型的准确率，同时减少训练时间，最多可以减少60%的训练时间。

Abstract
Deep learning models have achieved state-of-the-art performances in various domains, while they are vulnerable to the inputs with well-crafted but small perturbations, which are named after adversarial examples (AEs). Among many strategies to improve the model robustness against AEs, Projected Gradient Descent (PGD) based adversarial training is one of the most effective methods. Unfortunately, the prohibitive computational overhead of generating strong enough AEs, due to the maximization of the loss function, sometimes makes the regular PGD adversarial training impractical when using larger and more complicated models. In this paper, we propose that the adversarial loss can be approximated by the partial sum of Taylor series. Furthermore, we approximate the gradient of adversarial loss and propose a new and efficient adversarial training method, adversarial training with gradient approximation (GAAT), to reduce the cost of building up robust models. Additionally, extensive experiments demonstrate that this efficiency improvement can be achieved without any or with very little loss in accuracy on natural and adversarial examples, which show that our proposed method saves up to 60\% of the training time with comparable model test accuracy on MNIST, CIFAR-10 and CIFAR-100 datasets.

摘要
In this paper, we propose a new and efficient adversarial training method, called adversarial training with gradient approximation (GAAT), to reduce the cost of building robust models. We approximate the adversarial loss using a partial sum of Taylor series and approximate the gradient of the adversarial loss. Our proposed method saves up to 60% of the training time with comparable model test accuracy on MNIST, CIFAR-10, and CIFAR-100 datasets.Extensive experiments demonstrate that our proposed method achieves the same or better accuracy on natural and adversarial examples, without any loss in accuracy. This shows that our method is efficient and effective in improving the robustness of deep learning models against adversarial attacks.

Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection

paper_url: http://arxiv.org/abs/2309.09456
repo_url: None
paper_authors: Chenming Zhu, Wenwei Zhang, Tai Wang, Xihui Liu, Kai Chen
for: 提高开放词汇3D物体检测的性能，使用大规模大词汇3D物体数据集来扩充现有3D场景数据集的词汇。
methods: 使用Object2Scene方法，将不同来源的3D物体插入到3D场景中，生成对应的文本描述，并提出了跨频率类别对比学习方法来缓解不同数据集之间的频率差。
results: 在现有的开放词汇3D物体检测标准benchmark上实现了超过现有方法的性能，并在一个新的benchmark OV-ScanNet-200上进行了验证，并证明了对所有罕见类的检测能力。

Abstract
Point cloud-based open-vocabulary 3D object detection aims to detect 3D categories that do not have ground-truth annotations in the training set. It is extremely challenging because of the limited data and annotations (bounding boxes with class labels or text descriptions) of 3D scenes. Previous approaches leverage large-scale richly-annotated image datasets as a bridge between 3D and category semantics but require an extra alignment process between 2D images and 3D points, limiting the open-vocabulary ability of 3D detectors. Instead of leveraging 2D images, we propose Object2Scene, the first approach that leverages large-scale large-vocabulary 3D object datasets to augment existing 3D scene datasets for open-vocabulary 3D object detection. Object2Scene inserts objects from different sources into 3D scenes to enrich the vocabulary of 3D scene datasets and generates text descriptions for the newly inserted objects. We further introduce a framework that unifies 3D detection and visual grounding, named L3Det, and propose a cross-domain category-level contrastive learning approach to mitigate the domain gap between 3D objects from different datasets. Extensive experiments on existing open-vocabulary 3D object detection benchmarks show that Object2Scene obtains superior performance over existing methods. We further verify the effectiveness of Object2Scene on a new benchmark OV-ScanNet-200, by holding out all rare categories as novel categories not seen during training.

摘要
<>TRANSLATE_TEXT点云基于开 vocabulary 3D物体检测目标是检测没有训练集中的地面标注的3D类别。这是非常具有挑战性，因为3D场景数据和标注（ bounding box 与类别标签或文本描述）受限。先前的方法利用大规模有很多描述的图像数据作为3D和类别 semantics 之间的桥梁，但是需要Extra的对2D图像和3D点的对应过程，限制了3D检测器的开 vocabulary 能力。相比之下，我们提议使用大规模大 vocabulary 3D物体数据来增强现有3D场景数据，并在3D场景中插入不同来源的物体，以增加3D场景数据的词汇量。我们还引入了将3D检测和视觉定位（grounding）集成在一起的框架，名为L3Det，并提出了跨 Domian 类别对比学习方法，以 Mitigate 3D对象从不同数据集中的领域差。我们进行了大量的实验，证明 Object2Scene 在现有的开 vocabulary 3D检测标准 benchmark 上表现出色。我们还验证了 Object2Scene 在 OV-ScanNet-200 新的benchmark上的效果，通过在训练中不包括所有罕见类别。TRANSLATE_TEXT

Scalable Label-efficient Footpath Network Generation Using Remote Sensing Data and Self-supervised Learning

paper_url: http://arxiv.org/abs/2309.09446
repo_url: https://github.com/wennyxy/footpathseg
paper_authors: Xinye Wanyan, Sachith Seneviratne, Kerry Nice, Jason Thompson, Marcus White, Nano Langenheim, Mark Stevenson
For: The paper is written for urban planners and researchers who need to manage and analyze footpath infrastructure in cities, but lack real-time information and resources for doing so.* Methods: The paper proposes an automatic pipeline for generating footpath networks based on remote sensing images using machine learning models, specifically a self-supervised feature representation learning method to reduce annotation requirements.* Results: The proposed method is validated using remote sensing images and shows considerable consistency with manually collected GIS layers, making it a low-cost and extensible solution for footpath network generation.Here’s the simplified Chinese text for the three points:* For: 该论文是为城市规划师和研究人员写的，他们需要管理和分析城市的步行道基础设施，但缺乏实时信息和资源。* Methods: 该论文提出了一个基于遥感图像的自动化步行道网络生成管线，使用机器学习模型，特别是一种自我指导的特征表示学习方法，以降低标注需求。* Results: 提posed方法通过对遥感图像进行验证，与手动收集的GIS层进行比较，显示了 considerable consistency，表明该方法是一种低成本和可扩展的解决方案。

Abstract
Footpath mapping, modeling, and analysis can provide important geospatial insights to many fields of study, including transport, health, environment and urban planning. The availability of robust Geographic Information System (GIS) layers can benefit the management of infrastructure inventories, especially at local government level with urban planners responsible for the deployment and maintenance of such infrastructure. However, many cities still lack real-time information on the location, connectivity, and width of footpaths, and/or employ costly and manual survey means to gather this information. This work designs and implements an automatic pipeline for generating footpath networks based on remote sensing images using machine learning models. The annotation of segmentation tasks, especially labeling remote sensing images with specialized requirements, is very expensive, so we aim to introduce a pipeline requiring less labeled data. Considering supervised methods require large amounts of training data, we use a self-supervised method for feature representation learning to reduce annotation requirements. Then the pre-trained model is used as the encoder of the U-Net for footpath segmentation. Based on the generated masks, the footpath polygons are extracted and converted to footpath networks which can be loaded and visualized by geographic information systems conveniently. Validation results indicate considerable consistency when compared to manually collected GIS layers. The footpath network generation pipeline proposed in this work is low-cost and extensible, and it can be applied where remote sensing images are available. Github: https://github.com/WennyXY/FootpathSeg.

摘要
Footpath mapping, modeling, and analysis can provide important geospatial insights to many fields of study, including transport, health, environment, and urban planning. The availability of robust Geographic Information System (GIS) layers can benefit the management of infrastructure inventories, especially at local government level with urban planners responsible for the deployment and maintenance of such infrastructure. However, many cities still lack real-time information on the location, connectivity, and width of footpaths, and/or employ costly and manual survey means to gather this information. This work designs and implements an automatic pipeline for generating footpath networks based on remote sensing images using machine learning models. The annotation of segmentation tasks, especially labeling remote sensing images with specialized requirements, is very expensive, so we aim to introduce a pipeline requiring less labeled data. Considering supervised methods require large amounts of training data, we use a self-supervised method for feature representation learning to reduce annotation requirements. Then the pre-trained model is used as the encoder of the U-Net for footpath segmentation. Based on the generated masks, the footpath polygons are extracted and converted to footpath networks which can be loaded and visualized by geographic information systems conveniently. Validation results indicate considerable consistency when compared to manually collected GIS layers. The footpath network generation pipeline proposed in this work is low-cost and extensible, and it can be applied where remote sensing images are available. Github: .Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

TransTouch: Learning Transparent Objects Depth Sensing Through Sparse Touches

paper_url: http://arxiv.org/abs/2309.09427
repo_url: None
paper_authors: Liuyu Bian, Pengyang Shi, Weihang Chen, Jing Xu, Li Yi, Rui Chen
for: 提高真实世界中透明物体深度感知的精度
methods: 使用探测系统提供自动收集的笔触深度标签来训练雷达网络，并使用一个新的价值函数评估探测位置的利好，以提高网络在真实世界中的性能
results: 在真实世界中construct了一个包括透明和不透明物体的数据集，并进行了实验，结果显示，使用我们的方法可以显著提高真实世界中透明物体深度感知的精度

Abstract
Transparent objects are common in daily life. However, depth sensing for transparent objects remains a challenging problem. While learning-based methods can leverage shape priors to improve the sensing quality, the labor-intensive data collection in the real world and the sim-to-real domain gap restrict these methods' scalability. In this paper, we propose a method to finetune a stereo network with sparse depth labels automatically collected using a probing system with tactile feedback. We present a novel utility function to evaluate the benefit of touches. By approximating and optimizing the utility function, we can optimize the probing locations given a fixed touching budget to better improve the network's performance on real objects. We further combine tactile depth supervision with a confidence-based regularization to prevent over-fitting during finetuning. To evaluate the effectiveness of our method, we construct a real-world dataset including both diffuse and transparent objects. Experimental results on this dataset show that our method can significantly improve real-world depth sensing accuracy, especially for transparent objects.

摘要
通常存在于日常生活中的透明物体depth感测仍然是一个挑战。学习基于方法可以利用形状假设提高感测质量，但实际世界中的数据收集劳动密集和实验室到实际域的差距限制这些方法的扩展性。在这篇论文中，我们提议一种方法，通过自动收集使用探测系统的粘贴反馈来精细调整激光网络。我们提出了一种新的实用函数来评估触摸的利用度。通过估算和优化实用函数，我们可以在给定一个固定的触摸预算下优化探测位置，以更好地提高网络的性能于实际物体上。我们还将满足粘贴深度监督与信心基于规则进行减少过拟合。为评估我们的方法的效果，我们构建了一个包括散发和透明物体的实际世界数据集。实验结果表明，我们的方法可以在实际世界中提高深度感测精度，特别是对于透明物体。

Cross-attention-based saliency inference for predicting cancer metastasis on whole slide images

paper_url: http://arxiv.org/abs/2309.09412
repo_url: None
paper_authors: Ziyu Su, Mostafa Rezapour, Usama Sajjad, Shuo Niu, Metin Nafi Gurcan, Muhammad Khalid Khan Niazi
for:This paper proposes a new method for automatic tumor detection on whole slide images (WSI) using cross-attention-based salient instance inference MIL (CASiiMIL).methods:The proposed method uses a novel saliency-informed attention mechanism and negative representation learning algorithm to identify breast cancer lymph node micro-metastasis on WSIs without the need for any annotations.results:The proposed model outperforms the state-of-the-art MIL methods on two popular tumor metastasis detection datasets and demonstrates great cross-center generalizability. It also exhibits excellent accuracy in classifying WSIs with small tumor lesions and has excellent interpretability attributed to the saliency-informed attention weights.

Abstract
Although multiple instance learning (MIL) methods are widely used for automatic tumor detection on whole slide images (WSI), they suffer from the extreme class imbalance within the small tumor WSIs. This occurs when the tumor comprises only a few isolated cells. For early detection, it is of utmost importance that MIL algorithms can identify small tumors, even when they are less than 1% of the size of the WSI. Existing studies have attempted to address this issue using attention-based architectures and instance selection-based methodologies, but have not yielded significant improvements. This paper proposes cross-attention-based salient instance inference MIL (CASiiMIL), which involves a novel saliency-informed attention mechanism, to identify breast cancer lymph node micro-metastasis on WSIs without the need for any annotations. Apart from this new attention mechanism, we introduce a negative representation learning algorithm to facilitate the learning of saliency-informed attention weights for improved sensitivity on tumor WSIs. The proposed model outperforms the state-of-the-art MIL methods on two popular tumor metastasis detection datasets, and demonstrates great cross-center generalizability. In addition, it exhibits excellent accuracy in classifying WSIs with small tumor lesions. Moreover, we show that the proposed model has excellent interpretability attributed to the saliency-informed attention weights. We strongly believe that the proposed method will pave the way for training algorithms for early tumor detection on large datasets where acquiring fine-grained annotations is practically impossible.

摘要
多个实例学习（MIL）方法在整个扫描图像（WSI）上自动检测肿瘤已经广泛应用，但它们受到扫描图像中小肿瘤的极端分布异常困扰。这发生在肿瘤只包含几个隔离的细胞时。在早期检测中，非常重要的是MIL算法可以识别小肿瘤，即使它们的大小小于扫描图像的1%。现有研究已经使用注意力基 architecture和实例选择基础方法来解决这个问题，但没有产生显著改进。本文提出了跨注意力基础的突出例推理MIL（CASiiMIL），它包括一种新的突出性注意力机制，以无需任何注释来检测乳腺癌Node微量肿瘤在扫描图像上。此外，我们还引入了一种负表示学习算法，以便更好地学习突出性注意力权重，以提高肿瘤WSIs的敏感性。提出的模型在两个流行的肿瘤метастаisis检测数据集上表现出色，并且具有优秀的跨中心一致性。此外，它还能够高度准确地分类WSIs中的小肿瘤 lesions。此外，我们还证明了该模型具有优秀的解释性，即通过突出性注意力权重来解释其决策。我们认为，提出的方法将为大量数据上无法获得细致的注释的训练算法开创新的道路。

BRONCO: Automated modelling of the bronchovascular bundle using the Computed Tomography Images

paper_url: http://arxiv.org/abs/2309.09410
repo_url: None
paper_authors: Wojciech Prażuch, Marek Socha, Anna Mrukwa, Aleksandra Suwalska, Agata Durawa, Malgorzata Jelitto-Górska, Katarzyna Dziadziuszko, Edyta Szurowska, Pawel Bożek, Michal Marczyk, Witold Rzyman, Joanna Polanska
for: 这篇论文主要针对的是体内肺部分的分类，特别是针对肺部分中的气道和血管网络的分类。
methods: 本研究使用了 Computed Tomography 图像来建立肺部分中的气道和血管网络分类ipeline，包括两个模组：模拟气道树和血管树。
results: 研究发现，这个方法在不同的 CT 影像数据中都能够获得正确的分类结果，并且不受 CT 影像系列的起始点和参数的影响。这个方法适用于健康人群、肺肿瘤患者和肺绒溃病患者。

Abstract
Segmentation of the bronchovascular bundle within the lung parenchyma is a key step for the proper analysis and planning of many pulmonary diseases. It might also be considered the preprocessing step when the goal is to segment the nodules from the lung parenchyma. We propose a segmentation pipeline for the bronchovascular bundle based on the Computed Tomography images, returning either binary or labelled masks of vessels and bronchi situated in the lung parenchyma. The method consists of two modules, modeling of the bronchial tree and vessels. The core revolves around a similar pipeline, the determination of the initial perimeter by the GMM method, skeletonization, and hierarchical analysis of the created graph. We tested our method on both low-dose CT and standard-dose CT, with various pathologies, reconstructed with various slice thicknesses, and acquired from various machines. We conclude that the method is invariant with respect to the origin and parameters of the CT series. Our pipeline is best suited for studies with healthy patients, patients with lung nodules, and patients with emphysema.

摘要
填充肺部分分析是许多肺病诊断和规划的关键步骤。它还可以看作预处理步骤，当目标是从肺部分中分割肿瘤时。我们提出了基于计算机断层成像图像的肺部分分析管道，可以生成二值或标注的血管和支气道在肺部分中的掩模。管道由两个模块组成：模拟 bronchial tree 和血管。核心在于类似的管道，即使用 GMM 方法确定初始边界，skeletonization 和层次分析创建的图像。我们在低剂量 CT 和标准剂量 CT 图像上测试了我们的方法，包括不同的病理变化、不同的扫描厚度和不同的成像机。我们结论是该方法具有Origin和参数不归一化的特点。我们的管道适用于健康人群、肿瘤患者和emphysema患者。

2023-09-18

cs.AI

cs.AI - 2023-09-18

Towards Effective Semantic OOD Detection in Unseen Domains: A Domain Generalization Perspective

paper_url: http://arxiv.org/abs/2309.10209
repo_url: None
paper_authors: Haoliang Wang, Chen Zhao, Yunhui Guo, Kai Jiang, Feng Chen
for: simultaneously addresses both distributional shifts in real-world testing environments
methods: introduces two regularization strategies: domain generalization regularization and OOD detection regularization
results: showcases superior OOD detection performance compared to conventional domain generalization approaches while maintaining comparable InD classification accuracy

Abstract
Two prevalent types of distributional shifts in machine learning are the covariate shift (as observed across different domains) and the semantic shift (as seen across different classes). Traditional OOD detection techniques typically address only one of these shifts. However, real-world testing environments often present a combination of both covariate and semantic shifts. In this study, we introduce a novel problem, semantic OOD detection across domains, which simultaneously addresses both distributional shifts. To this end, we introduce two regularization strategies: domain generalization regularization, which ensures semantic invariance across domains to counteract the covariate shift, and OOD detection regularization, designed to enhance OOD detection capabilities against the semantic shift through energy bounding. Through rigorous testing on three standard domain generalization benchmarks, our proposed framework showcases its superiority over conventional domain generalization approaches in terms of OOD detection performance. Moreover, it holds its ground by maintaining comparable InD classification accuracy.

摘要
有两种常见的分布偏移在机器学习中是 covariate shift（随着不同领域的观察）和 semantic shift（随着不同的类别的观察）。传统的ODOut detection技术通常只 Address one of these shifts. 然而，实际的测试环境通常会出现 covariate 和 semantic 两种分布偏移的组合。在本研究中，我们引入了一个新的问题：针对不同领域的 semantic OOD detection，这种问题同时解决了两种分布偏移。为此，我们提出了两种 régularization 策略：领域泛化 régularization，确保在不同领域下的semantic 一致性，以遏制 covariate 偏移；以及 OOD detection régularization，通过能量约束来提高ODOut detection能力，抵消 semantic 偏移。通过对三个标准领域泛化测试 benchmark 进行严格的测试，我们的提议框架在ODOut detection性能方面superior于传统的领域泛化方法，并且保持了与InD类别准确率相当的水平。

Stabilizing RLHF through Advantage Model and Selective Rehearsal

paper_url: http://arxiv.org/abs/2309.10202
repo_url: None
paper_authors: Baolin Peng, Linfeng Song, Ye Tian, Lifeng Jin, Haitao Mi, Dong Yu
for: 这份技术报告是为了解决RLHF训练中的稳定性问题，包括奖励抢夺和忘记性问题。
methods: 该报告提出了两种创新来稳定RLHF训练：1）优点模型，直接模型优点分数，并规范分数分布在任务之间以预防奖励抢夺。 2）选择性练习，通过精心选择数据进行PPO训练和知识练习，以 Mitigate catastrophic forgetting。
results: 我们的实验分析表明，提出的方法不仅提高了RLHF训练的稳定性，还达到了更高的奖励分数和胜利率。

Abstract
Large Language Models (LLMs) have revolutionized natural language processing, yet aligning these models with human values and preferences using RLHF remains a significant challenge. This challenge is characterized by various instabilities, such as reward hacking and catastrophic forgetting. In this technical report, we propose two innovations to stabilize RLHF training: 1) Advantage Model, which directly models advantage score i.e., extra reward compared to the expected rewards and regulates score distributions across tasks to prevent reward hacking. 2) Selective Rehearsal, which mitigates catastrophic forgetting by strategically selecting data for PPO training and knowledge rehearsing. Our experimental analysis on public and proprietary datasets reveals that the proposed methods not only increase stability in RLHF training but also achieve higher reward scores and win rates.

摘要

Advantage Model: directly models advantage score, i.e., extra reward compared to the expected rewards, and regulates score distributions across tasks to prevent reward hacking.2. Selective Rehearsal: mitigates catastrophic forgetting by strategically selecting data for PPO training and knowledge rehearsing.Our experimental analysis on public and proprietary datasets reveals that the proposed methods not only increase stability in RLHF training but also achieve higher reward scores and win rates.Translated into Simplified Chinese:大型语言模型（LLMs）已经革命化自然语言处理，但是使用RLHF与人类价值观和偏好相对是一大挑战。这种挑战由各种不稳定性，如奖励黑客和慢性忘记，引起。在这份技术报告中，我们提出了两项创新，以稳定RLHF训练：1. 优势模型：直接模型奖励分数，即比预期奖励更高的额外奖励，并规范分数分布遍布任务，以防止奖励黑客。2. 选择性练习：避免慢性忘记，通过策略选择数据进行PPO训练和知识练习。我们对公共和专用数据集进行实验分析，发现我们的方法不仅提高了RLHF训练的稳定性，还达到了更高的奖励分数和胜利率。

Graph-enabled Reinforcement Learning for Time Series Forecasting with Adaptive Intelligence

paper_url: http://arxiv.org/abs/2309.10186
repo_url: None
paper_authors: Thanveer Shaik, Xiaohui Tao, Haoran Xie, Lin Li, Jianming Yong, Yuefeng Li
for: 这个研究是为了提出一个基于图形神经网络（GNN）和强化学习（RL）的新方法来预测时间序列数据。
methods: 这个研究使用了GNN来处理时间序列数据，并与RL结合以监控模型。GNN能够自然地捕捉时间序列中的数据结构，并且可以更好地预测复杂的时间序列结构，例如健康、交通和天气预测。
results: 这个研究发现，GraphRL模型在时间序列预测和监控中具有更高的准确性和效率，相比于传统的深度学习模型。此外，这个研究还发现GNN在时间序列预测中的表现比RNN和LSTM更好。

Abstract
Reinforcement learning is well known for its ability to model sequential tasks and learn latent data patterns adaptively. Deep learning models have been widely explored and adopted in regression and classification tasks. However, deep learning has its limitations such as the assumption of equally spaced and ordered data, and the lack of ability to incorporate graph structure in terms of time-series prediction. Graphical neural network (GNN) has the ability to overcome these challenges and capture the temporal dependencies in time-series data. In this study, we propose a novel approach for predicting time-series data using GNN and monitoring with Reinforcement Learning (RL). GNNs are able to explicitly incorporate the graph structure of the data into the model, allowing them to capture temporal dependencies in a more natural way. This approach allows for more accurate predictions in complex temporal structures, such as those found in healthcare, traffic and weather forecasting. We also fine-tune our GraphRL model using a Bayesian optimisation technique to further improve performance. The proposed framework outperforms the baseline models in time-series forecasting and monitoring. The contributions of this study include the introduction of a novel GraphRL framework for time-series prediction and the demonstration of the effectiveness of GNNs in comparison to traditional deep learning models such as RNNs and LSTMs. Overall, this study demonstrates the potential of GraphRL in providing accurate and efficient predictions in dynamic RL environments.

摘要
“强化学习”是已知能够模型顺序任务和学习隐藏数据模式的能力。深度学习模型在推断和分类任务中广泛应用。然而，深度学习有一些限制，例如假设数据平均 spacing和排序，以及无法运用时间序列预测中的图形结构。图形神经网络（GNN）可以超越这些挑战，捕捉时间序列数据中的时间相依性，并在更自然的方式中捕捉时间序列。这种方法可以在复杂的时间序列结构中提供更准确的预测，例如健康监测、交通预测和天气预测。我们还使用 bayesian 优化技术进一步提高性能。我们的 GraphRL 模型在时间序列预测和监控中超越了基eline 模型。这个研究的贡献包括：1. 提出一个 Novel GraphRL 框架 для 时间序列预测。2. 显示 GNNs 在比较于传统深度学习模型（RNNs 和 LSTMs）中更有效。3. 显示 GraphRL 框架在时间序列预测和监控中的应用前景。总体来说，这个研究显示 GraphRL 在动态RL环境中提供准确和高效的预测。

QoS-Aware Service Prediction and Orchestration in Cloud-Network Integrated Beyond 5G

paper_url: http://arxiv.org/abs/2309.10185
repo_url: https://github.com/hieu9955/ggggg
paper_authors: Mohammad Farhoudi, Masoud Shokrnezhad, Tarik Taleb
for: 本研究旨在探讨在Metaverse等 novel应用中， beyond 5G 网络所需的 ultra-low latency 通信和大量宽带连接，以及随着用户数量的不断变化，需要提高服务持续性考虑。
methods: 本文使用 edge-cloud 模式来充分利用云计算资源，实时管理用户在网络中的移动。然而，边缘云网络受到许多限制，包括网络和计算资源的共同管理，以及用户动态性、服务启动延迟和流量压力等因素。
results: 本文提出了一种基于 non-linear programming 模型的服务放置和资源分配算法，以最小化总成本，同时增强延迟。此外，我们还提出了一种基于 RNN 的 DDQL 技术，使用水填算法进行服务放置，以适应用户动态性和服务启动延迟。 simulation 结果表明，我们的解决方案可以提供有效的响应，最大化网络潜力，并提供可扩展的服务放置。

Abstract
Novel applications such as the Metaverse have highlighted the potential of beyond 5G networks, which necessitate ultra-low latency communications and massive broadband connections. Moreover, the burgeoning demand for such services with ever-fluctuating users has engendered a need for heightened service continuity consideration in B5G. To enable these services, the edge-cloud paradigm is a potential solution to harness cloud capacity and effectively manage users in real time as they move across the network. However, edge-cloud networks confront a multitude of limitations, including networking and computing resources that must be collectively managed to unlock their full potential. This paper addresses the joint problem of service placement and resource allocation in a network-cloud integrated environment while considering capacity constraints, dynamic users, and end-to-end delays. We present a non-linear programming model that formulates the optimization problem with the aiming objective of minimizing overall cost while enhancing latency. Next, to address the problem, we introduce a DDQL-based technique using RNNs to predict user behavior, empowered by a water-filling-based algorithm for service placement. The proposed framework adeptly accommodates the dynamic nature of users, the placement of services that mandate ultra-low latency in B5G, and service continuity when users migrate from one location to another. Simulation results show that our solution provides timely responses that optimize the network's potential, offering a scalable and efficient placement.

摘要
新应用如Metaverse等强调 beyond 5G 网络的潜在力量，需要超低延迟通信和巨大宽带连接。此外，随着用户数量不断变化，需要提高服务连续性考虑。为激活这些服务，边云模式是一个可能的解决方案，可以利用云计算资源并实时管理用户。然而，边云网络面临着许多限制，包括网络和计算资源的集成管理，以实现其全部潜力。本文 Addresses 服务放置和资源分配问题在网络云Integrated 环境中，考虑到容量约束、动态用户和终端延迟。我们提出一个非线性 программирова模型，以最小化总成本而提高延迟。然后，我们引入了基于 RNN 的 DDQL 技术，利用水填算法来实现服务放置。该方案适应用户的动态性，强调 B5G 中服务放置的低延迟性和用户迁移时的服务连续性。 simulation 结果表明，我们的解决方案可以提供有效的响应，最大化网络潜力。

Positive and Risky Message Assessment for Music Products

paper_url: http://arxiv.org/abs/2309.10182
repo_url: None
paper_authors: Yigeng Zhang, Mahsa Shafaei, Fabio Gonzalez, Thamar Solorio
for: 评估音乐产品中的积极和危险信息
methods: 提出了一种新的研究问题，并提出了一种多角度多级音乐内容评估标准准，然后提出了一种有效的多任务预测模型，并在这些模型中加入了排序约束来解决这个问题。
results: 对比于强有力的任务特定对手，提出的方法不仅显著超越了它们，还可以同时评估多个方面。

Abstract
In this work, we propose a novel research problem: assessing positive and risky messages from music products. We first establish a benchmark for multi-angle multi-level music content assessment and then present an effective multi-task prediction model with ordinality-enforcement to solve this problem. Our result shows the proposed method not only significantly outperforms strong task-specific counterparts but can concurrently evaluate multiple aspects.

摘要
在这项研究中，我们提出了一个新的研究问题：评估音乐产品中的积极和风险消息。我们首先建立了多角度多级音乐内容评估的标准准则，然后提出了一种高效的多任务预测模型，并在这些任务之间强制实施排序约束。我们的结果表明，我们提出的方法不仅在多个任务上显著超越了专门的对手，而且可以同时评估多个方面。Here's the breakdown of the translation:* 在这项研究中 (in this research)* 我们提出了一个新的研究问题 (we propose a new research problem)* 评估音乐产品中的积极和风险消息 (assessing positive and risky messages in music products)* 我们首先建立了多角度多级音乐内容评估的标准准则 (we first establish a benchmark for multi-angle multi-level music content assessment)* 然后提出了一种高效的多任务预测模型 (then we propose an effective multi-task prediction model)* 并在这些任务之间强制实施排序约束 (and enforce ordinal constraints between tasks)* 我们的结果表明 (our results show)* 我们提出的方法不仅在多个任务上显著超越了专门的对手 (our method significantly outperforms task-specific counterparts)* 而且可以同时评估多个方面 (and can concurrently evaluate multiple aspects)

Double Deep Q-Learning-based Path Selection and Service Placement for Latency-Sensitive Beyond 5G Applications

paper_url: http://arxiv.org/abs/2309.10180
repo_url: None
paper_authors: Masoud Shokrnezhad, Tarik Taleb, Patrizio Dazzi
for: This paper aims to solve the joint problem of communication and computing resource allocation in cloud-network integrated infrastructures to minimize total cost.
methods: The paper proposes two approaches based on the Branch & Bound and Water-Filling algorithms to solve the problem when the system is fully known, and a Double Deep Q-Learning (DDQL) architecture is designed for partially known systems.
results: Numerical simulations show that the proposed B&B-CCRA approach optimally solves the problem, while the WF-CCRA approach delivers near-optimal solutions in a substantially shorter time. Additionally, the DDQL-CCRA approach obtains near-optimal solutions in the absence of request-specific information.

Abstract
Nowadays, as the need for capacity continues to grow, entirely novel services are emerging. A solid cloud-network integrated infrastructure is necessary to supply these services in a real-time responsive, and scalable way. Due to their diverse characteristics and limited capacity, communication and computing resources must be collaboratively managed to unleash their full potential. Although several innovative methods have been proposed to orchestrate the resources, most ignored network resources or relaxed the network as a simple graph, focusing only on cloud resources. This paper fills the gap by studying the joint problem of communication and computing resource allocation, dubbed CCRA, including function placement and assignment, traffic prioritization, and path selection considering capacity constraints and quality requirements, to minimize total cost. We formulate the problem as a non-linear programming model and propose two approaches, dubbed B\&B-CCRA and WF-CCRA, based on the Branch \& Bound and Water-Filling algorithms to solve it when the system is fully known. Then, for partially known systems, a Double Deep Q-Learning (DDQL) architecture is designed. Numerical simulations show that B\&B-CCRA optimally solves the problem, whereas WF-CCRA delivers near-optimal solutions in a substantially shorter time. Furthermore, it is demonstrated that DDQL-CCRA obtains near-optimal solutions in the absence of request-specific information.

摘要
现在，由于需求量的增长， Entirely novel services 是出现的。一个固定云网络集成基础设施是必要的，以便在实时响应和可扩展的方式提供这些服务。由于这些服务的多样性和有限的容量，通信和计算资源必须共同管理，以发挥最大的潜力。虽然一些创新的方法已经被提出来协调资源，但大多数忽视了网络资源或者将网络视为简单的图，只关注云资源。这篇论文填充了这个空白，通过研究集成通信和计算资源分配（CCRA）问题，包括函数放置和分配、吞吐率优先级和路径选择，对容量约束和质量要求进行最小化总成本。我们将问题转化为非线性编程模型，并提出了B\&B-CCRA和WF-CCRA两种方法，基于 Branch \& Bound 和 Water-Filling 算法来解决。而在部分知道系统下，我们设计了Double Deep Q-Learning（DDQL）架构。实验显示，B\&B-CCRA 可以最优解决问题，而WF-CCRA 可以在substantially shorter time 内提供近似优解。此外，我们还证明了 DDQL-CCRA 可以在请求特定信息缺失时获得近似优解。

Self-Sustaining Multiple Access with Continual Deep Reinforcement Learning for Dynamic Metaverse Applications

paper_url: http://arxiv.org/abs/2309.10177
repo_url: None
paper_authors: Hamidreza Mazandarani, Masoud Shokrnezhad, Tarik Taleb, Richard Li
for: This paper aims to address the challenge of managing multiple access to the frequency spectrum in a dynamic and complex scenario of the Metaverse, with a focus on maximizing the throughput of intelligent agents in multi-channel environments.
methods: The proposed method is based on Double Deep Q-Learning (DDQL) empowered by Continual Learning (CL), which is designed to overcome the non-stationary situation and unknown environment.
results: The numerical simulations show that the CL-DDQL algorithm achieves significantly higher throughputs with a considerably shorter convergence time compared to other well-known methods, especially in highly dynamic scenarios.

Abstract
The Metaverse is a new paradigm that aims to create a virtual environment consisting of numerous worlds, each of which will offer a different set of services. To deal with such a dynamic and complex scenario, considering the stringent quality of service requirements aimed at the 6th generation of communication systems (6G), one potential approach is to adopt self-sustaining strategies, which can be realized by employing Adaptive Artificial Intelligence (Adaptive AI) where models are continually re-trained with new data and conditions. One aspect of self-sustainability is the management of multiple access to the frequency spectrum. Although several innovative methods have been proposed to address this challenge, mostly using Deep Reinforcement Learning (DRL), the problem of adapting agents to a non-stationary environment has not yet been precisely addressed. This paper fills in the gap in the current literature by investigating the problem of multiple access in multi-channel environments to maximize the throughput of the intelligent agent when the number of active User Equipments (UEs) may fluctuate over time. To solve the problem, a Double Deep Q-Learning (DDQL) technique empowered by Continual Learning (CL) is proposed to overcome the non-stationary situation, while the environment is unknown. Numerical simulations demonstrate that, compared to other well-known methods, the CL-DDQL algorithm achieves significantly higher throughputs with a considerably shorter convergence time in highly dynamic scenarios.

摘要
metaverse 是一新的思维方式，旨在创造一个虚拟环境，包括多个世界，每个世界都会提供不同的服务。为了面对这样一个动态和复杂的情况，一种可能的方法是采用自我维护策略，这可以通过使用适应性人工智能（适应AI）来实现，其中模型 continually 在新数据和条件下重新训练。一个自我维护的方面是频谱访问的管理。虽然已经有许多创新的方法被提出来解决这个挑战，主要使用深度强化学习（DRL），但是在不稳定的环境中适应代理人的问题还没有得到准确地解决。这篇论文填充了现有文献中的空白，研究了多个频道环境中多个访问者的情况，以 maximize 智能代理人的吞吐量，当有多个活动用户设备（UE）的情况下。为解决这个问题，我们提出了一种基于 Continual Learning（CL）的 Double Deep Q-Learning（DDQL）技术。在不稳定的环境中，CL-DDQL 算法可以在不知道环境的情况下，与其他已知方法相比，实现显著更高的吞吐量和迅速 converge 时间。

One ACT Play: Single Demonstration Behavior Cloning with Action Chunking Transformers

paper_url: http://arxiv.org/abs/2309.10175
repo_url: None
paper_authors: Abraham George, Amir Barati Farimani
for: 学习人类示例（行为克隆）是机器学习的基础。但大多数行为克隆算法需要许多示例来学习任务，特别是面临着许多初始条件的复杂任务。然而，人类可以通过只看一两次示例来完成任务。我们的工作想要模仿这种能力，使用行为克隆学习任务给定单个人类示例。我们实现这一目标通过使用线性变换扩展单个示例，生成一组路径 для许多初始条件。通过这些示例，我们可以训练一个行为克隆机器人成功完成三个块处理任务。
methods: 我们使用线性变换扩展单个示例，生成一组路径 для许多初始条件。此外，我们还开发了一种新的方法，即在推理过程中将行动预测的标准差添加到 temporal ensembling 方法中，以提高对不明确环境变化的弹性性。
results: 我们通过对三个块处理任务进行训练，成功地使用单个人类示例来学习这些任务。此外，我们的方法在对环境变化时表现更加稳定和可靠，从而实现了显著性能提高。

Abstract
Learning from human demonstrations (behavior cloning) is a cornerstone of robot learning. However, most behavior cloning algorithms require a large number of demonstrations to learn a task, especially for general tasks that have a large variety of initial conditions. Humans, however, can learn to complete tasks, even complex ones, after only seeing one or two demonstrations. Our work seeks to emulate this ability, using behavior cloning to learn a task given only a single human demonstration. We achieve this goal by using linear transforms to augment the single demonstration, generating a set of trajectories for a wide range of initial conditions. With these demonstrations, we are able to train a behavior cloning agent to successfully complete three block manipulation tasks. Additionally, we developed a novel addition to the temporal ensembling method used by action chunking agents during inference. By incorporating the standard deviation of the action predictions into the ensembling method, our approach is more robust to unforeseen changes in the environment, resulting in significant performance improvements.

摘要
学习人类示例（行为克隆）是机器人学习的基础。然而，大多数行为克隆算法需要许多示例来学习任务，特别是面临着较大的初始条件多样性的任务。然而，人类可以通过只看一两次示例来完成复杂任务。我们的工作寻求模仿这种能力，使用行为克隆算法学习任务，只需要单个人类示例。我们通过使用线性变换来扩展单个示例，生成一系列的轨迹，以适应各种初始条件。通过这些示例，我们可以训练一个行为克隆代理人来成功完成三个块处理任务。此外，我们还开发了一种新的添加到 temporal ensembling 方法中的方法，通过在推理过程中包含行动预测的标准差，使我们的方法更加鲁棒对待不可预期的环境变化，从而实现显著的性能提升。

Asynchronous Perception-Action-Communication with Graph Neural Networks

paper_url: http://arxiv.org/abs/2309.10164
repo_url: None
paper_authors: Saurav Agarwal, Alejandro Ribeiro, Vijay Kumar
for: 这paper是为了解决大型机器人群体中的协同决策问题，以实现共同全局目标。
methods: 该paper使用Graph Neural Networks（GNNs）来解决协同Perception-Action-Communication（PAC）循环中的信息共享和行为选择问题。
results: 该paper使用 asynchronous PAC 框架，并使用分布式GNNs来计算导航动作和生成通信信息，实现了大规模机器人群体的协同导航和覆盖控制。

Abstract
Collaboration in large robot swarms to achieve a common global objective is a challenging problem in large environments due to limited sensing and communication capabilities. The robots must execute a Perception-Action-Communication (PAC) loop -- they perceive their local environment, communicate with other robots, and take actions in real time. A fundamental challenge in decentralized PAC systems is to decide what information to communicate with the neighboring robots and how to take actions while utilizing the information shared by the neighbors. Recently, this has been addressed using Graph Neural Networks (GNNs) for applications such as flocking and coverage control. Although conceptually, GNN policies are fully decentralized, the evaluation and deployment of such policies have primarily remained centralized or restrictively decentralized. Furthermore, existing frameworks assume sequential execution of perception and action inference, which is very restrictive in real-world applications. This paper proposes a framework for asynchronous PAC in robot swarms, where decentralized GNNs are used to compute navigation actions and generate messages for communication. In particular, we use aggregated GNNs, which enable the exchange of hidden layer information between robots for computational efficiency and decentralized inference of actions. Furthermore, the modules in the framework are asynchronous, allowing robots to perform sensing, extracting information, communication, action inference, and control execution at different frequencies. We demonstrate the effectiveness of GNNs executed in the proposed framework in navigating large robot swarms for collaborative coverage of large environments.

摘要
协作在大型机器人群体中实现共同全球目标是一个具有挑战性的问题，尤其在受限感知和通信能力的大环境中。机器人必须在实时中执行感知-行动-通信（PAC）循环——它们感知当地环境，与其他机器人交流，并在实时中采取行动。在分布式PAC系统中，决定如何与邻居机器人交流信息以及如何在基于这些信息的情况下采取行动是一个基本挑战。在最近的研究中，使用图神经网络（GNN）已经有效地解决了这个问题，并在鸟群控制和覆盖控制等应用中获得了成功。然而，GNN策略在概念上是完全分布式的，但是评估和部署这些策略却主要保留在中央化或部分分布式的方式下。此外，现有的框架假设Sequential感知和行动推理，这是实际应用中非常限制的。本文提出了一个 asynchronous PAC 框架，其中分布式GNNs 用于计算导航行动并生成与其他机器人交流的消息。具体来说，我们使用聚合GNNs，可以在机器人之间交换隐藏层信息，以实现计算效率和分布式推理行动。此外，框架中的模块是异步的，allowing robots to perform sensing, extracting information, communication, action inference, and control execution at different frequencies。我们通过使用GNNs在提出的框架中实现了大型机器人群体的协同探索大环境，以实现共同覆盖大环境。

RadOnc-GPT: A Large Language Model for Radiation Oncology

paper_url: http://arxiv.org/abs/2309.10160
repo_url: None
paper_authors: Zhengliang Liu, Peilong Wang, Yiwei Li, Jason Holmes, Peng Shu, Lian Zhang, Chenbin Liu, Ninghao Liu, Dajiang Zhu, Xiang Li, Quanzheng Li, Samir H. Patel, Terence T. Sio, Tianming Liu, Wei Liu
for: 这个论文是为了研究透明度高的语言模型在放射科医疗领域中的应用。
methods: 这个论文使用了高级调教方法来特化语言模型，并在Mayo医院的大量放射科病人病历和诊断记录中进行了finetuning。模型包括三个关键任务：生成放射疗程治疗方案、确定最佳放射方式、以及基于病人诊断细节提供诊断描述和ICD代码。
results: 对比普通大语言模型的输出，RadOnc-GPT生成的输出显示了显著提高的明确性、特点性和клиниче重要性。研究表明，通过特化语言模型使用域专知如RadOnc-GPT进行练习，可以实现高度专业的医疗领域中的转变能力。

Abstract
This paper presents RadOnc-GPT, a large language model specialized for radiation oncology through advanced tuning methods. RadOnc-GPT was finetuned on a large dataset of radiation oncology patient records and clinical notes from the Mayo Clinic. The model employs instruction tuning on three key tasks - generating radiotherapy treatment regimens, determining optimal radiation modalities, and providing diagnostic descriptions/ICD codes based on patient diagnostic details. Evaluations conducted by having radiation oncologists compare RadOnc-GPT impressions to general large language model impressions showed that RadOnc-GPT generated outputs with significantly improved clarity, specificity, and clinical relevance. The study demonstrated the potential of using large language models fine-tuned using domain-specific knowledge like RadOnc-GPT to achieve transformational capabilities in highly specialized healthcare fields such as radiation oncology.

摘要

Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions

paper_url: http://arxiv.org/abs/2309.10150
repo_url: https://github.com/lucidrains/q-transformer
paper_authors: Yevgen Chebotar, Quan Vuong, Alex Irpan, Karol Hausman, Fei Xia, Yao Lu, Aviral Kumar, Tianhe Yu, Alexander Herzog, Karl Pertsch, Keerthana Gopalakrishnan, Julian Ibarz, Ofir Nachum, Sumedh Sontakke, Grecia Salazar, Huong T Tran, Jodilyn Peralta, Clayton Tan, Deeksha Manjunath, Jaspiar Singht, Brianna Zitkovich, Tomas Jackson, Kanishka Rao, Chelsea Finn, Sergey Levine
for: 这个论文是为了提出一种可扩展的生成学习方法，用于从大量的离线数据集中训练多任务策略。
methods: 该方法使用Transformer提供一种可扩展的Q函数表示，通过离线时间差反射来训练Q函数。具体来说，每个动作维度被精确化，并将每个动作维度的Q值表示为独立的token。这使得可以应用高容量的序列模型技术来进行Q学习。
results: 作者们表明，Q-Transformer比前一代离线RL算法和依 PDOurt学习技术在一个大型、多样化的真实世界机器人操作任务集中表现更好。详细的实验结果和项目网站可以在https://q-transformer.github.io上找到。

Abstract
In this work, we present a scalable reinforcement learning method for training multi-task policies from large offline datasets that can leverage both human demonstrations and autonomously collected data. Our method uses a Transformer to provide a scalable representation for Q-functions trained via offline temporal difference backups. We therefore refer to the method as Q-Transformer. By discretizing each action dimension and representing the Q-value of each action dimension as separate tokens, we can apply effective high-capacity sequence modeling techniques for Q-learning. We present several design decisions that enable good performance with offline RL training, and show that Q-Transformer outperforms prior offline RL algorithms and imitation learning techniques on a large diverse real-world robotic manipulation task suite. The project's website and videos can be found at https://q-transformer.github.io

摘要
在这项工作中，我们提出了一种可扩展的强化学习方法，用于从大量离线数据集中训练多任务策略，并可以利用人类示范和自动收集的数据。我们的方法使用Transformer提供一种可扩展的表示方法，用于通过离线时间差备份学习Q值函数。因此，我们称这种方法为Q-Transformer。通过离线动作维度的精细化和每个动作维度的Q值函数的表示为分开的token，我们可以应用高容量序列模型技术进行Q学习。我们提出了一些设计决策，以实现在离线RL训练中良好的性能，并证明Q-Transformer在一个大型多样化的实际世界机器人操作任务集中超过了先前的离线RL算法和模仿学习技术。项目的网站和视频可以在https://q-transformer.github.io找到。

Analysis of the Memorization and Generalization Capabilities of AI Agents: Are Continual Learners Robust?

paper_url: http://arxiv.org/abs/2309.10149
repo_url: None
paper_authors: Minsu Kim, Walid Saad
for: The paper is written for the practical deployment of continual learning (CL) applications, such as autonomous vehicles or robotics, in dynamic environments.
methods: The proposed CL framework uses a capacity-limited memory to save previously observed environmental information and mitigate forgetting issues. The algorithm samples data points from the memory to estimate the distribution of risks over environmental change and obtain robust predictors.
results: The proposed algorithm outperforms memory-based CL baselines across all environments while significantly improving the generalization performance on unseen target environments.

Abstract
In continual learning (CL), an AI agent (e.g., autonomous vehicles or robotics) learns from non-stationary data streams under dynamic environments. For the practical deployment of such applications, it is important to guarantee robustness to unseen environments while maintaining past experiences. In this paper, a novel CL framework is proposed to achieve robust generalization to dynamic environments while retaining past knowledge. The considered CL agent uses a capacity-limited memory to save previously observed environmental information to mitigate forgetting issues. Then, data points are sampled from the memory to estimate the distribution of risks over environmental change so as to obtain predictors that are robust with unseen changes. The generalization and memorization performance of the proposed framework are theoretically analyzed. This analysis showcases the tradeoff between memorization and generalization with the memory size. Experiments show that the proposed algorithm outperforms memory-based CL baselines across all environments while significantly improving the generalization performance on unseen target environments.

摘要
在连续学习（CL）中，一个AI机器人（例如自动驾驶车或机器人）在动态环境中学习非站ARY数据流。为实际应用这些应用程序，保证对未seen环境的Robustness是非常重要。本文提出了一种新的CL框架，以实现对动态环境的Robust generalization，同时保持过去经验。考虑的CL机器人使用有限容量的记忆来减轻忘记问题。然后，从记忆中采样数据点，以估计环境变化中的风险分布，以获得对未seen变化的Robust预测器。本框架的总体和记忆性性能被理论分析。这种分析显示了记忆大小与Generalization和Memorization之间的贸易。实验表明，提出的算法在所有环境下都超过了记忆基eline，并显著提高了对未seen目标环境的总体性能。

Human Gait Recognition using Deep Learning: A Comprehensive Review

paper_url: http://arxiv.org/abs/2309.10144
repo_url: None
paper_authors: Muhammad Imran Sharif, Mehwish Mehmood, Muhammad Irfan Sharif, Md Palash Uddin
For: This paper provides an overview of gait recognition (GR) technology and analyzes the environmental elements and complications that could affect it, with a focus on deep learning (DL) techniques employed for human GR.* Methods: The paper examines existing DL techniques used in GR, including those that address challenges such as shifting lighting conditions, fluctuations in gait patterns, and ensuring uniform performance evaluation across different protocols.* Results: The paper aims to generate new research opportunities in GR by analyzing the existing DL techniques and identifying potential areas for improvement.Here’s the same information in Simplified Chinese text, as requested:* For: 这篇论文提供了人体走姿识别（GR）技术的概述，以及识别环境因素和问题的分析，特别是使用深度学习（DL）技术进行人体GR。* Methods: 论文检查了现有的DL技术，包括对不同协议的性能评估、不同照明条件下的识别、走姿模式的变化和不确定性等问题的解决方案。* Results: 论文旨在通过分析现有的DL技术，探讨新的研究机遇，以提高GR技术的精度和可靠性。

Abstract
Gait recognition (GR) is a growing biometric modality used for person identification from a distance through visual cameras. GR provides a secure and reliable alternative to fingerprint and face recognition, as it is harder to distinguish between false and authentic signals. Furthermore, its resistance to spoofing makes GR suitable for all types of environments. With the rise of deep learning, steadily improving strides have been made in GR technology with promising results in various contexts. As video surveillance becomes more prevalent, new obstacles arise, such as ensuring uniform performance evaluation across different protocols, reliable recognition despite shifting lighting conditions, fluctuations in gait patterns, and protecting privacy.This survey aims to give an overview of GR and analyze the environmental elements and complications that could affect it in comparison to other biometric recognition systems. The primary goal is to examine the existing deep learning (DL) techniques employed for human GR that may generate new research opportunities.

摘要
“人体步态识别”（GR）是一种在距离通过视频摄像头进行人员识别的快 разви得生物特征模式。GR提供了一种安全可靠的人识别方式，因为它更难于分辨假冒信号。此外，它对冒泡攻击具有抵抗力，因此适用于各种环境。随着深度学习技术的不断发展，GR技术在不同上下文中呈现出了扎实的成果。随着视频监测的普及，新的障碍物出现，例如确保各个协议的统一性评估、在不同照明条件下可靠的识别、步态变化的影响和隐私保护。本调查的目标是对GR和其他生物识别系统进行比较，分析环境因素和障碍，并检视现有的深度学习技术，以便在新的研究机遇上发掘新的可能性。

Efficient Low-Rank GNN Defense Against Structural Attacks

paper_url: http://arxiv.org/abs/2309.10136
repo_url: None
paper_authors: Abdullah Alchihabi, Qing En, Yuhong Guo
for:这个研究旨在提出一种能够有效地防御对 graf neural network (GNN) 的攻击的方法，以提高 GNN 的安全性。methods:这个方法包括两个模组：粗略低维度估计模组和细节化估计模组。粗略低维度估计模组使用 truncated SVD 来初始化低维度边 Matrix 估计，并且作为 GNN 模型估计的起点。在细节化估计模组中，则与 GNN 模型共同学习低维度短镜� structure，并将其转换为低维度短镜� Matrix。results:实验结果显示，ELR-GNN 比Literature中的 GNN 防御方法更高效，同时也非常简单易于训练。

Abstract
Graph Neural Networks (GNNs) have been shown to possess strong representation abilities over graph data. However, GNNs are vulnerable to adversarial attacks, and even minor perturbations to the graph structure can significantly degrade their performance. Existing methods either are ineffective against sophisticated attacks or require the optimization of dense adjacency matrices, which is time-consuming and prone to local minima. To remedy this problem, we propose an Efficient Low-Rank Graph Neural Network (ELR-GNN) defense method, which aims to learn low-rank and sparse graph structures for defending against adversarial attacks, ensuring effective defense with greater efficiency. Specifically, ELR-GNN consists of two modules: a Coarse Low-Rank Estimation Module and a Fine-Grained Estimation Module. The first module adopts the truncated Singular Value Decomposition (SVD) to initialize the low-rank adjacency matrix estimation, which serves as a starting point for optimizing the low-rank matrix. In the second module, the initial estimate is refined by jointly learning a low-rank sparse graph structure with the GNN model. Sparsity is incorporated into the learned low-rank adjacency matrix by pruning weak connections, which can reduce redundant data while maintaining valuable information. As a result, instead of using the dense adjacency matrix directly, ELR-GNN can learn a low-rank and sparse estimate of it in a simple, efficient and easy to optimize manner. The experimental results demonstrate that ELR-GNN outperforms the state-of-the-art GNN defense methods in the literature, in addition to being very efficient and easy to train.

摘要
图ael neural networks (GNNs) 有显著的表示能力 sobre graph data, 但是 GNNs 是易受 adversarial 攻击的。现有的方法可能无法对 Sophisticated 攻击有效，或者需要优化 dense adjacency matrices，这是时间consuming 和受 Local minima 影响。为了解决这个问题，我们提出了一种高效的 Low-Rank Graph Neural Network (ELR-GNN) 防御方法，该方法 aimsto learn low-rank 和 sparse graph structures，以防止 adversarial 攻击。具体来说，ELR-GNN 由两个模块组成：Coarse Low-Rank Estimation Module 和 Fine-Grained Estimation Module。第一个模块使用 truncated Singular Value Decomposition (SVD) 初始化 low-rank adjacency matrix 估计，这个估计作为 GNN 模型优化 low-rank matrix 的起点。第二个模块使用 jointly learn low-rank sparse graph structure 和 GNN 模型，在这个过程中，约束 sparse 的 low-rank adjacency matrix 的学习。通过避免直接使用 dense adjacency matrix，ELR-GNN 可以学习一个简单、高效、易于优化的 low-rank 和 sparse estimate。实验结果表明，ELR-GNN 在文献中已经出现的 GNN 防御方法中表现出色，同时具有高效和易于训练的优点。

GDM: Dual Mixup for Graph Classification with Limited Supervision

paper_url: http://arxiv.org/abs/2309.10134
repo_url: None
paper_authors: Abdullah Alchihabi, Yuhong Guo
for: 提高graph neural network（GNN）在graph classification任务上的性能，降低需要大量标注的图样例数量。
methods: 提议一种基于mixup的图像整合方法，Graph Dual Mixup（GDM），可以利用图例中的函数和结构信息来生成新的标注图样例。GDM首先使用图结构自动编码器学习图例的结构嵌入，然后在学习后的结构嵌入空间中应用mixup，生成新的图结构。同时，GDM直接将输入节点特征混合到图例中，以生成新的功能节点特征信息。
results: 实验结果表明，当标注图样例数量受限时，我们提议的方法可以大幅提高GNN的性能，并且可以增加图样例的多样性和难度。

Abstract
Graph Neural Networks (GNNs) require a large number of labeled graph samples to obtain good performance on the graph classification task. The performance of GNNs degrades significantly as the number of labeled graph samples decreases. To reduce the annotation cost, it is therefore important to develop graph augmentation methods that can generate new graph instances to increase the size and diversity of the limited set of available labeled graph samples. In this work, we propose a novel mixup-based graph augmentation method, Graph Dual Mixup (GDM), that leverages both functional and structural information of the graph instances to generate new labeled graph samples. GDM employs a graph structural auto-encoder to learn structural embeddings of the graph samples, and then applies mixup to the structural information of the graphs in the learned structural embedding space and generates new graph structures from the mixup structural embeddings. As for the functional information, GDM applies mixup directly to the input node features of the graph samples to generate functional node feature information for new mixup graph instances. Jointly, the generated input node features and graph structures yield new graph samples which can supplement the set of original labeled graphs. Furthermore, we propose two novel Balanced Graph Sampling methods to enhance the balanced difficulty and diversity for the generated graph samples. Experimental results on the benchmark datasets demonstrate that our proposed method substantially outperforms the state-of-the-art graph augmentation methods when the labeled graphs are scarce.

摘要
GRAPH Neural Networks (GNNs) 需要大量标注的图样本以达到图分类任务的良好性能。 GNNs 的性能随着标注图样本的减少而明显下降。为了降低标注成本，因此需要开发图生成方法，以增加可用标注图样本的数量和多样性。在这种情况下，我们提出了一种基于 mixup 的图生成方法，即图双混合（GDM）。GDM 利用图样本的函数和结构信息来生成新的标注图样本。GDM 首先使用图结构自动编码器来学习图样本的结构嵌入，然后在学习的结构嵌入空间中应用 mixup 到图样本的结构信息，并生成新的图结构。在 terms of 功能信息，GDM 直接应用 mixup 到图样本的输入节点特征，以生成新的混合节点特征信息。在总的来说，生成的输入节点特征和图结构均可以产生新的图样本，可以补充原始的标注图样本。此外，我们还提出了两种新的平衡图样本采样方法，以提高生成的图样本的平衡难度和多样性。实验结果表明，我们提出的方法在标注图样本稀缺的情况下明显超越了当前的图生成方法。

Adaptive Liquidity Provision in Uniswap V3 with Deep Reinforcement Learning

paper_url: http://arxiv.org/abs/2309.10129
repo_url: None
paper_authors: Haochen Zhang, Xi Chen, Lin F. Yang
for: 这个论文是为了解决Decentralized finance（DeFi）中的一些未解决的问题，如资金效益和市场风险，而设计的。
methods: 该论文使用了深度强化学习（DRL）方法，通过自适应调整资产价格范围，以最大化收益并减少市场风险。同时，通过对资产liquidity进行重新平衡，以中和价格变化风险。
results: 使用 simulations 的盈亏指标（PnL）基准，该方法在 ETH/USDC 和 ETH/USDT 池中表现出色，与现有基线相比，具有更高的收益。

Abstract
Decentralized exchanges (DEXs) are a cornerstone of decentralized finance (DeFi), allowing users to trade cryptocurrencies without the need for third-party authorization. Investors are incentivized to deposit assets into liquidity pools, against which users can trade directly, while paying fees to liquidity providers (LPs). However, a number of unresolved issues related to capital efficiency and market risk hinder DeFi's further development. Uniswap V3, a leading and groundbreaking DEX project, addresses capital efficiency by enabling LPs to concentrate their liquidity within specific price ranges for deposited assets. Nevertheless, this approach exacerbates market risk, as LPs earn trading fees only when asset prices are within these predetermined brackets. To mitigate this issue, this paper introduces a deep reinforcement learning (DRL) solution designed to adaptively adjust these price ranges, maximizing profits and mitigating market risks. Our approach also neutralizes price-change risks by hedging the liquidity position through a rebalancing portfolio in a centralized futures exchange. The DRL policy aims to optimize trading fees earned by LPs against associated costs, such as gas fees and hedging expenses, which is referred to as loss-versus-rebalancing (LVR). Using simulations with a profit-and-loss (PnL) benchmark, our method demonstrates superior performance in ETH/USDC and ETH/USDT pools compared to existing baselines. We believe that this strategy not only offers investors a valuable asset management tool but also introduces a new incentive mechanism for DEX designers.

摘要
Decentralized exchange (DEX) 是 DeFi 中的一个重要基础设施，让用户可以不需要第三方授权进行 криптовалюencies 的交易。投资者可以将资产存入流动性池，并获得交易所得，而付出的费用则是给流动性提供者 (LP)。然而， DeFi 的进一步发展受到一些未解决的问题所限，包括资金效率和市场风险。Uniswap V3 是一个创新的 DEX 项目，它解决了资金效率的问题，让 LP 可以将流动性集中在特定价格范围内。然而，这种方法会增加市场风险，因为 LP 只在资产价格在预先定义的价格范围内时获得交易所得。为了解决这个问题，这篇论文提出了一个深度学习 (DRL) 解决方案，可以适应地调整这些价格范围，以最大化 LP 的交易所得和减少市场风险。我们的方法还可以消除价格变化的风险，通过在中央期货交易所进行调整资产位置的投资策略。DRL 政策的目标是将 LP 的交易所得与相关成本（如 gas 费用和投资策略成本）进行对抗，这被称为loss-versus-rebalancing (LVR)。使用 simulations 中的盈亏检测 (PnL) benchmark，我们的方法在 ETH/USDC 和 ETH/USDT 流动性池中表现出色，较 existing 基准线超越。我们认为，这种策略不仅为投资者提供了一个有价的资产管理工具，而且也为 DEX 设计师引入了一个新的奖励机制。

AR-TTA: A Simple Method for Real-World Continual Test-Time Adaptation

paper_url: http://arxiv.org/abs/2309.10109
repo_url: None
paper_authors: Damian Sójka, Sebastian Cygert, Bartłomiej Twardowski, Tomasz Trzciński
for: 这个研究旨在验证时间适应方法的有效性，以应对实际世界中的变化。
methods: 本研究使用了自我训练框架，并将小型快照缓存 buffer integrate 以增加模型稳定性和动态适应。
results: 提出的 AR-TTA 方法在实验和更加实际的标准资料上进行了比较，并在不同的 TTA 情况下显示了更高的效果和稳定性。

Abstract
Test-time adaptation is a promising research direction that allows the source model to adapt itself to changes in data distribution without any supervision. Yet, current methods are usually evaluated on benchmarks that are only a simplification of real-world scenarios. Hence, we propose to validate test-time adaptation methods using the recently introduced datasets for autonomous driving, namely CLAD-C and SHIFT. We observe that current test-time adaptation methods struggle to effectively handle varying degrees of domain shift, often resulting in degraded performance that falls below that of the source model. We noticed that the root of the problem lies in the inability to preserve the knowledge of the source model and adapt to dynamically changing, temporally correlated data streams. Therefore, we enhance well-established self-training framework by incorporating a small memory buffer to increase model stability and at the same time perform dynamic adaptation based on the intensity of domain shift. The proposed method, named AR-TTA, outperforms existing approaches on both synthetic and more real-world benchmarks and shows robustness across a variety of TTA scenarios.

摘要
<>这是一个推广研究方向，允许源模型在数据分布变化时自动适应。然而，现有方法通常会在对测试数据进行评估时使用对测试数据的简化。因此，我们提出使用自驾车领域的最近引入的数据集，即CLAD-C和SHIFT来验证测试适应方法。我们发现现有的测试适应方法对于不同程度的领域变化通常会导致表现下降，常常比源模型的表现低。我们发现这是因为无法保持源模型的知识并适应对于动态变化的、时间相关的数据流。因此，我们将已有的自养学框架加以小型快取buffer，以提高模型的稳定性，并在数据流中进行动态适应，以适应不同的测试适应enario。我们称之为AR-TTA，并证明它在 sintetic和更加实际的benchmark上表现出色，并且在多种测试适应scenario中展现了 Robustness。

paper_url: http://arxiv.org/abs/2309.10103
repo_url: None
paper_authors: Quanting Xie, Tianyi Zhang, Kedi Xu, Matthew Johnson-Roberson, Yonatan Bisk
for: 本研究旨在探讨机器人在户外环境中的自主导航问题，与现有的室内环境导航研究不同，涵盖更广泛的实际应用场景。
methods: 本研究提出了一种新的任务OUTDOOR，一种新的大语言模型（LLM）准确预测未来的机制，以及一种新的计算感知成功指标来推动这一更复杂的领域的研究进步。
results: 本研究在 simulate 的飞行器和实际四脚机器人上实现了出色的结果，而无需先Mapping。我们的正式成果超越了直觉LLM-based方法。

Abstract
Robots should exist anywhere humans do: indoors, outdoors, and even unmapped environments. In contrast, the focus of recent advancements in Object Goal Navigation(OGN) has targeted navigating in indoor environments by leveraging spatial and semantic cues that do not generalize outdoors. While these contributions provide valuable insights into indoor scenarios, the broader spectrum of real-world robotic applications often extends to outdoor settings. As we transition to the vast and complex terrains of outdoor environments, new challenges emerge. Unlike the structured layouts found indoors, outdoor environments lack clear spatial delineations and are riddled with inherent semantic ambiguities. Despite this, humans navigate with ease because we can reason about the unseen. We introduce a new task OUTDOOR, a new mechanism for Large Language Models (LLMs) to accurately hallucinate possible futures, and a new computationally aware success metric for pushing research forward in this more complex domain. Additionally, we show impressive results on both a simulated drone and physical quadruped in outdoor environments. Our agent has no premapping and our formalism outperforms naive LLM-based approaches

摘要
人工智能机器应该存在任何人类存在的地方：室内、室外和无法映射的环境中。然而，最近的对象目标导航（OGN）的进步主要集中在内部环境中，利用空间和semantic信息，这些信息不会在户外generalize。虽然这些贡献提供了有价值的室内场景研究，但广泛的实际机器人应用场景通常是户外设置。作为我们从室内环境传播到更加复杂的户外环境，新的挑战出现。与室内的结构化布局不同，户外环境缺乏明确的空间定义，充斥着内生的semantic抽象。尽管如此，人类能够轻松地导航，因为我们可以理解未经见过的事物。我们介绍了一个新任务OUTDOOR，一种新的大语言模型（LLM）准确预测未来的机制，以及一种新的computational感知成功指标，以推动研究在这个更复杂的领域。此外，我们在模拟的飞机和物理四肢机器人上实现了出色的成果在户外环境中。我们的机器人没有预mapping，我们的正式形式超越了没有经验LLM-based方法。

Data Formulator: AI-powered Concept-driven Visualization Authoring

paper_url: http://arxiv.org/abs/2309.10094
repo_url: None
paper_authors: Chenglong Wang, John Thompson, Bongshin Lee
for: 提高数据视觉化的艺术化能力，帮助作者快速创建高质量的数据视觉图表。
methods: 使用人工智能代理人实现概念绑定视觉 paradigma，自然语言或示例来定义数据概念，然后将其绑定到视觉通道上。
results: 通过人工智能代理人自动将输入数据转换为Visualization，创建高质量的数据视觉图表，并提供反馈来帮助作者查看和理解结果。

Abstract
With most modern visualization tools, authors need to transform their data into tidy formats to create visualizations they want. Because this requires experience with programming or separate data processing tools, data transformation remains a barrier in visualization authoring. To address this challenge, we present a new visualization paradigm, concept binding, that separates high-level visualization intents and low-level data transformation steps, leveraging an AI agent. We realize this paradigm in Data Formulator, an interactive visualization authoring tool. With Data Formulator, authors first define data concepts they plan to visualize using natural languages or examples, and then bind them to visual channels. Data Formulator then dispatches its AI-agent to automatically transform the input data to surface these concepts and generate desired visualizations. When presenting the results (transformed table and output visualizations) from the AI agent, Data Formulator provides feedback to help authors inspect and understand them. A user study with 10 participants shows that participants could learn and use Data Formulator to create visualizations that involve challenging data transformations, and presents interesting future research directions.

摘要
现代视觉工具中，作者通常需要将数据转换成整齐的格式，以创建他们想要的视觉。由于这需要编程经验或分离的数据处理工具，数据转换仍然是视觉作者的挑战。为解决这个挑战，我们提出了一种新的视觉 парадиг，即概念绑定。我们在数据形成器中实现了这种 парадиг，这是一个交互式的视觉作业工具。作者首先使用自然语言或示例来定义他们计划要visualize的数据概念，然后将它们绑定到视觉通道上。数据形成器然后将其人工智能代理发送到自动将输入数据转换成Surface这些概念和生成所需的视觉。当presenting the results（转换后的表格和输出视觉）时，数据形成器提供反馈，帮助作者检查和理解它们。一个参与者学习研究表明，参与者可以快速学习并使用数据形成器创建包含复杂数据转换的视觉。这个研究还提出了一些未来研究的可能性。

Conformal Temporal Logic Planning using Large Language Models: Knowing When to Do What and When to Ask for Help

paper_url: http://arxiv.org/abs/2309.10092
repo_url: None
paper_authors: Jun Wang, Jiaming Tong, Kaiyuan Tan, Yevgeniy Vorobeychik, Yiannis Kantaros
for: 本研究旨在提出一种新的机器人运动规划问题，即使用自然语言（NL）表述多个高级次任务，并通过时间和逻辑顺序来描述这些任务。
methods: 本研究使用LTL（ Linear Temporal Logic）定义了NL基于的原子操作符来正式定义这些任务，而不是使用相关的规划方法，这些方法定义LTL任务 sobre atomic predicate capturing desired low-level system configurations。我们的目标是设计满足LTL任务的机器人计划。
results: 我们提出了HERACLEs，一种嵌入了现有工具的层次匹配自然语言规划器，包括(i) 自动机理论来确定需要完成的NL基于的子任务; (ii) 大语言模型来设计满足这些子任务的机器人计划; 和 (iii) 匹配预测来 probabilistically about correctness of the designed plans and mission satisfaction, and determine if external assistance is required. 我们进行了广泛的比较实验，并提供了项目网站ltl-llm.github.io。

Abstract
This paper addresses a new motion planning problem for mobile robots tasked with accomplishing multiple high-level sub-tasks, expressed using natural language (NL), in a temporal and logical order. To formally define such missions, we leverage LTL defined over NL-based atomic predicates modeling the considered NL-based sub-tasks. This is contrast to related planning approaches that define LTL tasks over atomic predicates capturing desired low-level system configurations. Our goal is to design robot plans that satisfy LTL tasks defined over NL-based atomic propositions. A novel technical challenge arising in this setup lies in reasoning about correctness of a robot plan with respect to such LTL-encoded tasks. To address this problem, we propose HERACLEs, a hierarchical conformal natural language planner, that relies on a novel integration of existing tools that include (i) automata theory to determine the NL-specified sub-task the robot should accomplish next to make mission progress; (ii) Large Language Models to design robot plans satisfying these sub-tasks; and (iii) conformal prediction to reason probabilistically about correctness of the designed plans and mission satisfaction and to determine if external assistance is required. We provide extensive comparative experiments on mobile manipulation tasks. The project website is ltl-llm.github.io.

摘要
A novel technical challenge arises in this setup, as we must reason about the correctness of a robot plan with respect to such LTL-encoded tasks. To address this, we propose HERACLEs, a hierarchical conformal natural language planner that integrates existing tools:1. Automata theory to determine the next NL-specified sub-task the robot should accomplish for mission progress.2. Large Language Models to design robot plans satisfying these sub-tasks.3. Conformal prediction to reason probabilistically about the correctness of the designed plans and mission satisfaction, and to determine if external assistance is required.We provide extensive comparative experiments on mobile manipulation tasks. For more information, please visit the project website at ltl-llm.github.io.

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

paper_url: http://arxiv.org/abs/2309.10091
repo_url: https://github.com/ziyang412/ucofia
paper_authors: Ziyang Wang, Yi-Lin Sung, Feng Cheng, Gedas Bertasius, Mohit Bansal
for: 这个论文的目的是提出一种基于CLIP的视频-文本关联模型，以解决视频-文本关联任务中的找到正确的视频问题。
methods: 该模型使用了粗细grained或细化的对齐方法，并应用了一个交互性 Similarity Aggregation模块 (ISA)，以考虑不同的视觉特征的重要性而对alignment进行重新权重。
results: 该模型在多个视频-文本关联数据集上表现出色，与之前的CLIP-based方法相比，实现了2.4%、1.4%和1.3%的提升。

Abstract
The canonical approach to video-text retrieval leverages a coarse-grained or fine-grained alignment between visual and textual information. However, retrieving the correct video according to the text query is often challenging as it requires the ability to reason about both high-level (scene) and low-level (object) visual clues and how they relate to the text query. To this end, we propose a Unified Coarse-to-fine Alignment model, dubbed UCoFiA. Specifically, our model captures the cross-modal similarity information at different granularity levels. To alleviate the effect of irrelevant visual clues, we also apply an Interactive Similarity Aggregation module (ISA) to consider the importance of different visual features while aggregating the cross-modal similarity to obtain a similarity score for each granularity. Finally, we apply the Sinkhorn-Knopp algorithm to normalize the similarities of each level before summing them, alleviating over- and under-representation issues at different levels. By jointly considering the crossmodal similarity of different granularity, UCoFiA allows the effective unification of multi-grained alignments. Empirically, UCoFiA outperforms previous state-of-the-art CLIP-based methods on multiple video-text retrieval benchmarks, achieving 2.4%, 1.4% and 1.3% improvements in text-to-video retrieval R@1 on MSR-VTT, Activity-Net, and DiDeMo, respectively. Our code is publicly available at https://github.com/Ziyang412/UCoFiA.

摘要
“ Traditional video-text retrieval methods often rely on coarse-grained or fine-grained alignment between visual and textual information. However, accurately retrieving the correct video based on a text query can be challenging, as it requires the ability to reason about both high-level (scene) and low-level (object) visual clues and how they relate to the text query. To address this challenge, we propose a Unified Coarse-to-fine Alignment model (UCoFiA). Our model captures cross-modal similarity information at different granularity levels and uses an Interactive Similarity Aggregation module (ISA) to consider the importance of different visual features when aggregating cross-modal similarity. Additionally, we use the Sinkhorn-Knopp algorithm to normalize the similarities of each level before summing them, alleviating over- and under-representation issues at different levels. By jointly considering the cross-modal similarity of different granularity, UCoFiA enables effective unification of multi-grained alignments. Empirical results show that UCoFiA outperforms previous state-of-the-art CLIP-based methods on multiple video-text retrieval benchmarks, achieving 2.4%, 1.4%, and 1.3% improvements in text-to-video retrieval R@1 on MSR-VTT, Activity-Net, and DiDeMo, respectively. Our code is publicly available at .”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.

HTEC: Human Transcription Error Correction

paper_url: http://arxiv.org/abs/2309.10089
repo_url: None
paper_authors: Hanbo Sun, Jian Gao, Xiaomin Wu, Anjie Fang, Cheng Cao, Zheng Du
for: 提高自动语音识别（ASR）模型的训练和提升
methods: 提出了一种基于人工纠正的人类纠正错误（HTEC）方法，包括两个阶段：检测错误的“Trans-Checker”模型和填充错误位置的“Trans-Filler”模型，以及一个包含多种修正操作的整体修正操作列表
results: HTEC比其他方法大幅提高了WER表现，并在人工纠正时超过人类注释员2.2%至4.5%的WER表现，并且在辅助人类注释员时提高了译录质量15.1% без损失翻译速度。

Abstract
High-quality human transcription is essential for training and improving Automatic Speech Recognition (ASR) models. Recent study~\cite{libricrowd} has found that every 1% worse transcription Word Error Rate (WER) increases approximately 2% ASR WER by using the transcriptions to train ASR models. Transcription errors are inevitable for even highly-trained annotators. However, few studies have explored human transcription correction. Error correction methods for other problems, such as ASR error correction and grammatical error correction, do not perform sufficiently for this problem. Therefore, we propose HTEC for Human Transcription Error Correction. HTEC consists of two stages: Trans-Checker, an error detection model that predicts and masks erroneous words, and Trans-Filler, a sequence-to-sequence generative model that fills masked positions. We propose a holistic list of correction operations, including four novel operations handling deletion errors. We further propose a variant of embeddings that incorporates phoneme information into the input of the transformer. HTEC outperforms other methods by a large margin and surpasses human annotators by 2.2% to 4.5% in WER. Finally, we deployed HTEC to assist human annotators and showed HTEC is particularly effective as a co-pilot, which improves transcription quality by 15.1% without sacrificing transcription velocity.

摘要
高品质人工转录是训练和改进自动语音识别（ASR）模型的重要前提。据 latest study 发现，每1%更差的转录单词错误率（WER）会提高约2% ASR WER。然而，有限的研究探讨了人工转录更正。其他问题的错误修正方法，如ASR错误修正和 grammatical error correction，在这个问题上并不够。因此，我们提出了 HTEC для人工转录错误修正。 HTEC 包括两个阶段：转错检测模型（Trans-Checker），可以预测和覆盖错误单词，以及转换模型（Trans-Filler），可以填充覆盖的位置。我们提出了一份总体的修正操作列表，包括四种新的操作来处理删除错误。此外，我们还提出了一种包含音频信息的变换 embedding，用于输入转换器。 HTEC 在其他方法之上大幅提高，并在人工标注员之上提高了2.2%到4.5%的 WER。最后，我们将 HTEC 部署到人工标注员的助手，并证明 HTEC 作为助手可以提高转录质量15.1%，无需减少转录速度。

GAME: Generalized deep learning model towards multimodal data integration for early screening of adolescent mental disorders

paper_url: http://arxiv.org/abs/2309.10077
repo_url: None
paper_authors: Zhicheng Du, Chenyao Jiang, Xi Yuan, Shiyao Zhai, Zhengyang Lei, Shuyue Ma, Yang Liu, Qihui Ye, Chufan Xiao, Qiming Huang, Ming Xu, Dongmei Yu, Peiwu Qin
for: 这个研究旨在提供一个可靠的多modal数据收集系统，以早期识别青少年的情绪障碍。
methods: 研究人员使用了一个Android应用程序，该应用程序包含了多个游戏和聊天纪录功能，并将其部署在一个便携式的机器人上，以萤幕3,783名中学生的情绪状态。
results: 研究人员发现，这个系统可以实时识别青少年的情绪障碍，并且具有73.34%-92.77%的准确率和71.32%-91.06%的F1分数。此外，研究人员发现每个感知modal都会在不同的情绪障碍中发挥不同的作用，这显示了这个系统的可解释性。

Abstract
The timely identification of mental disorders in adolescents is a global public health challenge.Single factor is difficult to detect the abnormality due to its complex and subtle nature. Additionally, the generalized multimodal Computer-Aided Screening (CAS) systems with interactive robots for adolescent mental disorders are not available. Here, we design an android application with mini-games and chat recording deployed in a portable robot to screen 3,783 middle school students and construct the multimodal screening dataset, including facial images, physiological signs, voice recordings, and textual transcripts.We develop a model called GAME (Generalized Model with Attention and Multimodal EmbraceNet) with novel attention mechanism that integrates cross-modal features into the model. GAME evaluates adolescent mental conditions with high accuracy (73.34%-92.77%) and F1-Score (71.32%-91.06%).We find each modality contributes dynamically to the mental disorders screening and comorbidities among various mental disorders, indicating the feasibility of explainable model. This study provides a system capable of acquiring multimodal information and constructs a generalized multimodal integration algorithm with novel attention mechanisms for the early screening of adolescent mental disorders.

摘要
全球公共卫生中既早发现青少年精神疾病是一个大型公共卫生挑战。单一因素难以检测异常性，因为精神疾病的特征复杂且柔和。此外，普遍的多模态计算机助手系统 для青少年精神疾病并不可用。我们设计了一个安卓应用程序，其中包含了小游戏和聊天记录，并在可携带的机器人上部署。我们为3783名中学生进行了屏幕测试，并建立了多模态检测数据集，包括脸部图像、生物学指标、声音记录和文本译文。我们开发了一个名为“总体模型”（GAME）的模型，其中包含了新的注意力机制，可以将多种模式的特征集成到模型中。GAME模型可以准确地评估青少年精神状况（73.34%-92.77%）和F1分数（71.32%-91.06%）。我们发现每种模式在精神疾病检测中都有独特的贡献，表明模型的可解释性。本研究提供了一种可以获取多模式信息的系统，并构建了一种普遍的多模式融合算法，其中包含了新的注意力机制，用于早期检测青少年精神疾病。

Sex-based Disparities in Brain Aging: A Focus on Parkinson’s Disease

paper_url: http://arxiv.org/abs/2309.10069
repo_url: None
paper_authors: Iman Beheshti, Samuel Booth, Ji Hyun Ko
for: 这研究旨在了解帕金сон病患者的大脑年龄差（brain-predicted age difference，简称brain-PAD）与性别之间的关系，以及这种关系对患者的诊断和治疗的影响。
methods: 研究使用T1束重磁共振成像（MRI）驱动的大脑预测年龄差（brain-PAD）计算方法，并使用线性回归模型研究帕金сон病患者的大脑预测年龄差与临床变量之间的关系。
results: 研究发现，♂️帕金сон病患者的大脑预测年龄差较♀帕金сон病患者高，而♂️帕金сон病患者的大脑预测年龄差与总脑功能、睡眠行为障碍、视空间acuity和唱杯损害之间存在 statistically significant 关系，♀帕金сон病患者则没有这种关系。

Abstract
PD is linked to faster brain aging. Sex is recognized as an important factor in PD, such that males are twice as likely as females to have the disease and have more severe symptoms and a faster progression rate. Despite previous research, there remains a significant gap in understanding the function of sex in the process of brain aging in PD patients. The T1-weighted MRI-driven brain-predicted age difference was computed in a group of 373 PD patients from the PPMI database using a robust brain-age estimation framework that was trained on 949 healthy subjects. Linear regression models were used to investigate the association between brain-PAD and clinical variables in PD, stratified by sex. All female PD patients were used in the correlational analysis while the same number of males were selected based on propensity score matching method considering age, education level, age of symptom onset, and clinical symptom severity. Despite both patient groups being matched for demographics, motor and non-motor symptoms, it was observed that males with Parkinson's disease exhibited a significantly higher mean brain age-delta than their female counterparts . In the propensity score-matched PD male group, brain-PAD was found to be associated with a decline in general cognition, a worse degree of sleep behavior disorder, reduced visuospatial acuity, and caudate atrophy. Conversely, no significant links were observed between these factors and brain-PAD in the PD female group.

摘要
PD与更快的脑龄相关。性别被认为对PD有重要影响，男性比女性患有疾病的风险高， symptoms更严重，病程更快。 DESPITE PREVIOUS RESEARCH， THERE REMAINS A SIGNIFICANT GAP IN UNDERSTANDING THE FUNCTION OF SEX IN THE PROCESS OF BRAIN AGING IN PD PATIENTS. 使用PPMI数据库中的373名PD患者，通过一种可靠的脑龄估计框架，计算出脑龄差值（brain-PAD）。在PD患者中，按照性别分组，使用线性回归模型investigate brain-PAD与临床变量的关系。 female PD患者中，与brain-PAD相关的变量包括：脑部功能障碍、睡眠行为障碍、视空间acuity和caudate萎缩。 Conversely，在男性PD患者中，这些变量与brain-PAD无 statistically significant links。

Automatic Personalized Impression Generation for PET Reports Using Large Language Models

paper_url: http://arxiv.org/abs/2309.10066
repo_url: https://github.com/xtie97/pet-report-summarization
paper_authors: Xin Tie, Muheon Shin, Ali Pirasteh, Nevein Ibrahim, Zachary Huemann, Sharon M. Castellino, Kara M. Kelly, John Garrett, Junjie Hu, Steve Y. Cho, Tyler J. Bradshaw
For: The paper aims to determine if fine-tuned large language models (LLMs) can generate accurate, personalized impressions for whole-body PET reports.* Methods: The paper uses a corpus of PET reports to train 12 language models using the teacher-forcing algorithm, with the report findings as input and the clinical impressions as reference. The models are trained to learn physician-specific reporting styles by using an extra input token that encodes the reading physician’s identity.* Results: The paper finds that the fine-tuned PEGASUS model generates clinically acceptable impressions that are comparable in overall utility to those dictated by other physicians. Specifically, 89% of the personalized impressions generated by PEGASUS were considered clinically acceptable, with a mean utility score of 4.08/5.Here is the simplified Chinese text for the three key information points:* For: 本研究目的是判断是否可以使用调整后的大语言模型（LLMs）生成准确、个性化的核医报告。* Methods: 本研究使用核医报告集来训练12种语言模型，使用教师压制算法，报告结果作为输入，临床印象作为参考。模型通过使用阅读医生标识符来学习医生特有的报告风格。* Results: 研究发现，调整后的PEGASUS模型可生成具有医学可接受性的个性化印象，与其他医生的印象相当。具体来说，PEGASUS模型生成的个性化印象中，89%被评估为医学可接受，平均用户满意度为4.08/5。

Abstract
Purpose: To determine if fine-tuned large language models (LLMs) can generate accurate, personalized impressions for whole-body PET reports. Materials and Methods: Twelve language models were trained on a corpus of PET reports using the teacher-forcing algorithm, with the report findings as input and the clinical impressions as reference. An extra input token encodes the reading physician's identity, allowing models to learn physician-specific reporting styles. Our corpus comprised 37,370 retrospective PET reports collected from our institution between 2010 and 2022. To identify the best LLM, 30 evaluation metrics were benchmarked against quality scores from two nuclear medicine (NM) physicians, with the most aligned metrics selecting the model for expert evaluation. In a subset of data, model-generated impressions and original clinical impressions were assessed by three NM physicians according to 6 quality dimensions and an overall utility score (5-point scale). Each physician reviewed 12 of their own reports and 12 reports from other physicians. Bootstrap resampling was used for statistical analysis. Results: Of all evaluation metrics, domain-adapted BARTScore and PEGASUSScore showed the highest Spearman's rho correlations (0.568 and 0.563) with physician preferences. Based on these metrics, the fine-tuned PEGASUS model was selected as the top LLM. When physicians reviewed PEGASUS-generated impressions in their own style, 89% were considered clinically acceptable, with a mean utility score of 4.08/5. Physicians rated these personalized impressions as comparable in overall utility to the impressions dictated by other physicians (4.03, P=0.41). Conclusion: Personalized impressions generated by PEGASUS were clinically useful, highlighting its potential to expedite PET reporting.

摘要
目的：检验大型语言模型（LLM）是否可以生成精准、个性化的全身PET报告印象。材料和方法：十二个语言模型通过教师强制算法进行训练，使用PET报告作为输入，并使用报告结果作为参考。模型中还包含一个特殊的输入 токен，用于学习各个physician的报告风格。我们的 corpus 包括2010年至2022年我们机构收集的37,370个Retrospective PET报告。为了选择最佳LLM，我们对30个评估指标进行了对比，并与两名核医学家（NM）的评分相对比较。在一部分数据中，模型生成的印象和原始临床印象被三名NMphysician根据6个质量维度和一个总体用户分数（5分级）进行评估。每名physician reviewed 12份自己的报告和12份其他physician的报告。我们使用 bootstrap 采样来进行统计分析。结果：所有评估指标中，适应域BARTScore和PEGASUSScore显示最高的斯佩曼氏相关度（0.568和0.563）。基于这些指标，我们选择了精心适应PEGASUS模型。当physician们评估PEGASUS生成的印象时，89%被评为临床可接受，平均用户分数为4.08/5。physicians 评估这些个性化印象的总体用户分数相当于他们自己dictated的印象的总体用户分数（4.03，P=0.41）。结论：PEGASUS生成的个性化印象是临床有用， highlighting its potential to expedite PET reporting。

Toward collision-free trajectory for autonomous and pilot-controlled unmanned aerial vehicles

paper_url: http://arxiv.org/abs/2309.10064
repo_url: None
paper_authors: Kaya Kuru, John Michael Pinder, Benjamin Jon Watkinson, Darren Ansell, Keith Vinning, Lee Moore, Chris Gilbert, Aadithya Sujit, David Jones
for: The paper is written for the purpose of developing an advanced collision management methodology for unmanned aerial vehicles (UAVs) to avoid mid-air collisions (MACs) with manned aeroplanes.
methods: The paper uses electronic conspicuity (EC) information made available by PilotAware Ltd and a reactive geometric conflict detection and resolution (CDR) technique to determine and execute time-optimal evasive collision avoidance (CA) manoeuvres.
results: The proposed methodology is demonstrated to be successful in avoiding collisions while limiting the deviation from the original trajectory in highly dynamic aerospace environments without requiring sophisticated sensors and prior training.Here are the three points in Simplified Chinese text:
for: 这篇论文是为了开发一种高级冲突管理方法，以避免无人飞行器（UAV）与人员飞机的中空冲突（MAC）。
methods: 这篇论文使用PilotAware Ltd提供的电子闪耀信息和一种反应几何冲突检测和解决技术，以确定和执行时间优化的避免冲突措施。
results: 提议的方法在高动态航空环境中成功地避免冲突，并限制偏离原轨迹的偏差。

Abstract
For drones, as safety-critical systems, there is an increasing need for onboard detect & avoid (DAA) technology i) to see, sense or detect conflicting traffic or imminent non-cooperative threats due to their high mobility with multiple degrees of freedom and the complexity of deployed unstructured environments, and subsequently ii) to take the appropriate actions to avoid collisions depending upon the level of autonomy. The safe and efficient integration of UAV traffic management (UTM) systems with air traffic management (ATM) systems, using intelligent autonomous approaches, is an emerging requirement where the number of diverse UAV applications is increasing on a large scale in dense air traffic environments for completing swarms of multiple complex missions flexibly and simultaneously. Significant progress over the past few years has been made in detecting UAVs present in aerospace, identifying them, and determining their existing flight path. This study makes greater use of electronic conspicuity (EC) information made available by PilotAware Ltd in developing an advanced collision management methodology -- Drone Aware Collision Management (DACM) -- capable of determining and executing a variety of time-optimal evasive collision avoidance (CA) manoeuvres using a reactive geometric conflict detection and resolution (CDR) technique. The merits of the DACM methodology have been demonstrated through extensive simulations and real-world field tests in avoiding mid-air collisions (MAC) between UAVs and manned aeroplanes. The results show that the proposed methodology can be employed successfully in avoiding collisions while limiting the deviation from the original trajectory in highly dynamic aerospace without requiring sophisticated sensors and prior training.

摘要
For drones, as safety-critical systems, there is an increasing need for onboard detect & avoid (DAA) technology to see, sense or detect conflicting traffic or imminent non-cooperative threats due to their high mobility with multiple degrees of freedom and the complexity of deployed unstructured environments, and subsequently to take the appropriate actions to avoid collisions depending upon the level of autonomy. The safe and efficient integration of UAV traffic management (UTM) systems with air traffic management (ATM) systems, using intelligent autonomous approaches, is an emerging requirement where the number of diverse UAV applications is increasing on a large scale in dense air traffic environments for completing swarms of multiple complex missions flexibly and simultaneously. Significant progress has been made in recent years in detecting UAVs present in aerospace, identifying them, and determining their existing flight path. This study makes greater use of electronic conspicuity (EC) information provided by PilotAware Ltd in developing an advanced collision management methodology -- Drone Aware Collision Management (DACM) -- capable of determining and executing a variety of time-optimal evasive collision avoidance (CA) manoeuvres using a reactive geometric conflict detection and resolution (CDR) technique. The merits of the DACM methodology have been demonstrated through extensive simulations and real-world field tests in avoiding mid-air collisions (MAC) between UAVs and manned aeroplanes. The results show that the proposed methodology can be successfully employed in avoiding collisions while limiting the deviation from the original trajectory in highly dynamic aerospace without requiring sophisticated sensors and prior training.

Survey of Consciousness Theory from Computational Perspective

paper_url: http://arxiv.org/abs/2309.10063
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Zihan Ding, Xiaoxi Wei, Yidan Xu
for: 本研究目的是 bridge 不同学科的意识理论，以计算机科学的视角来解释人类意识现象。
methods: 本文使用了多种方法，包括信息理论、量子物理学、认知心理学、生物物理学和计算机科学等，以寻求解释意识现象的方法。
results: 本研究通过对多种意识理论的汇总和分析，提出了一种计算机科学的意识模型，并讨论了现有的意识评价指标和计算机模型是否具有意识性。

Abstract
Human consciousness has been a long-lasting mystery for centuries, while machine intelligence and consciousness is an arduous pursuit. Researchers have developed diverse theories for interpreting the consciousness phenomenon in human brains from different perspectives and levels. This paper surveys several main branches of consciousness theories originating from different subjects including information theory, quantum physics, cognitive psychology, physiology and computer science, with the aim of bridging these theories from a computational perspective. It also discusses the existing evaluation metrics of consciousness and possibility for current computational models to be conscious. Breaking the mystery of consciousness can be an essential step in building general artificial intelligence with computing machines.

摘要
人类意识已经是历史上一个长期的谜团，而机器智能和意识的研究则是一项艰难的探索。研究人员已经提出了多种解释人类大脑意识现象的理论，从不同的角度和水平来看。本文将从计算机科学的视角 survey 这些主要分支的意识理论，并讨论现有的意识评价指标以及现有的计算机模型是否具备意识性。破解意识之谜可能是建立通用人工智能机器的重要一步。

General In-Hand Object Rotation with Vision and Touch

paper_url: http://arxiv.org/abs/2309.09979
repo_url: None
paper_authors: Haozhi Qi, Brent Yi, Sudharshan Suresh, Mike Lambeta, Yi Ma, Roberto Calandra, Jitendra Malik
for: 这个论文是为了解决触摸式物体旋转的问题，通过多种感知输入来实现。
methods: 这个系统使用了仿真训练，并在真实的涂抹和 proprioceptive 感知输入下进行了线性激活。它还使用了一种可视感知和抓取Transformer来融合多modal 感知输入。
results: 相比之前的方法，这个系统在实际应用中表现出了显著的性能提升，并且Visual和感觉感知的重要性得到了证明。

Abstract
We introduce RotateIt, a system that enables fingertip-based object rotation along multiple axes by leveraging multimodal sensory inputs. Our system is trained in simulation, where it has access to ground-truth object shapes and physical properties. Then we distill it to operate on realistic yet noisy simulated visuotactile and proprioceptive sensory inputs. These multimodal inputs are fused via a visuotactile transformer, enabling online inference of object shapes and physical properties during deployment. We show significant performance improvements over prior methods and the importance of visual and tactile sensing.

摘要
我们介绍RotateIt系统，可以通过多轴滚动来实现手指基于多轴滚动的物体旋转，利用多种感知输入。我们的系统在模拟环境中训练，有访问真实的物体形状和物理属性。然后我们将其逼真实的视觉感知和 proprioceptive 感知混合，通过视觉感知变换，在部署时进行在线的物体形状和物理属性的推断。我们显示了与先前方法相比有显著性能提升，以及视觉和感觉感知的重要性。

MindAgent: Emergent Gaming Interaction

paper_url: http://arxiv.org/abs/2309.09971
repo_url: None
paper_authors: Ran Gong, Qiuyuan Huang, Xiaojian Ma, Hoi Vo, Zane Durante, Yusuke Noda, Zilong Zheng, Song-Chun Zhu, Demetri Terzopoulos, Li Fei-Fei, Jianfeng Gao
for: 这paper是为了研究大语言模型在多智能体系统中的复杂计划和协调能力，以及如何在游戏中与人类NPC合作。
methods: 该paper使用了现有的游戏框架，并引入了一些新的技术，如需要理解协调者、与人类玩家合作via不完美化的指令，以及在几个shot提示下进行在Context learning。
results: 该paper introduce了一个新的游戏场景和benchmark，并通过了新的自动度量器CoS来评估多智能体协调效率。该paper的结果表明，大语言模型可以在游戏中协调多智能体，并且可以通过学习大语言资源来获得这些技能。

Abstract
Large Language Models (LLMs) have the capacity of performing complex scheduling in a multi-agent system and can coordinate these agents into completing sophisticated tasks that require extensive collaboration. However, despite the introduction of numerous gaming frameworks, the community has insufficient benchmarks towards building general multi-agents collaboration infrastructure that encompass both LLM and human-NPCs collaborations. In this work, we propose a novel infrastructure - MindAgent - to evaluate planning and coordination emergent capabilities for gaming interaction. In particular, our infrastructure leverages existing gaming framework, to i) require understanding of the coordinator for a multi-agent system, ii) collaborate with human players via un-finetuned proper instructions, and iii) establish an in-context learning on few-shot prompt with feedback. Furthermore, we introduce CUISINEWORLD, a new gaming scenario and related benchmark that dispatch a multi-agent collaboration efficiency and supervise multiple agents playing the game simultaneously. We conduct comprehensive evaluations with new auto-metric CoS for calculating the collaboration efficiency. Finally, our infrastructure can be deployed into real-world gaming scenarios in a customized VR version of CUISINEWORLD and adapted in existing broader Minecraft gaming domain. We hope our findings on LLMs and the new infrastructure for general-purpose scheduling and coordination can help shed light on how such skills can be obtained by learning from large language corpora.

摘要
In this work, we propose a novel infrastructure called MindAgent to evaluate the planning and coordination capabilities of LLMs in gaming interactions. Our infrastructure leverages existing gaming frameworks to:1. Require understanding of the coordinator for a multi-agent system.2. Collaborate with human players via un-finetuned proper instructions.3. Establish in-context learning on few-shot prompts with feedback.Furthermore, we introduce CUISINEWORLD, a new gaming scenario and related benchmark that dispatches a multi-agent collaboration efficiency and supervises multiple agents playing the game simultaneously. We conduct comprehensive evaluations with a new auto-metric CoS for calculating collaboration efficiency.Our infrastructure can be deployed into real-world gaming scenarios in a customized VR version of CUISINEWORLD and adapted in existing broader Minecraft gaming domains. We hope our findings on LLMs and the new infrastructure for general-purpose scheduling and coordination can help shed light on how such skills can be obtained by learning from large language corpora.

paper_url: http://arxiv.org/abs/2309.09949
repo_url: None
paper_authors: Zhouxiang Fang, Min Yu, Zhendong Fu, Boning Zhang, Xuanwen Huang, Xiaoqi Tang, Yang Yang
for: automatic generation of popular headlines on social media
methods: Multiple preference-Extractors with Bidirectional and Auto-Regressive Transformers (BART)
results: state-of-the-art performance compared with several advanced baselines, and ability to capture trends and personal styles.

Abstract
Posts, as important containers of user-generated-content pieces on social media, are of tremendous social influence and commercial value. As an integral components of a post, the headline has a decisive contribution to the post's popularity. However, current mainstream method for headline generation is still manually writing, which is unstable and requires extensive human effort. This drives us to explore a novel research question: Can we automate the generation of popular headlines on social media? We collect more than 1 million posts of 42,447 celebrities from public data of Xiaohongshu, which is a well-known social media platform in China. We then conduct careful observations on the headlines of these posts. Observation results demonstrate that trends and personal styles are widespread in headlines on social medias and have significant contribution to posts's popularity. Motivated by these insights, we present MEBART, which combines Multiple preference-Extractors with Bidirectional and Auto-Regressive Transformers (BART), capturing trends and personal styles to generate popular headlines on social medias. We perform extensive experiments on real-world datasets and achieve state-of-the-art performance compared with several advanced baselines. In addition, ablation and case studies demonstrate that MEBART advances in capturing trends and personal styles.

摘要
posts是社交媒体上重要的用户生成内容容器，具有巨大的社会影响力和商业价值。在这些posts中，标题具有决定性的贡献，促进post的受欢迎程度。然而，现有主流的标题生成方法仍然是人工写作，它是不稳定的，需要大量的人工劳动。这使我们感到需要解决这个问题：可以自动生成社交媒体上的受欢迎标题吗？我们收集了来自Xiaohongshu社交媒体平台的公共数据，包含42,447名明星的post，并进行了仔细的观察。观察结果表明，社交媒体上的标题中广泛存在趋势和个人风格，这些趋势和风格对post的受欢迎程度有重要的贡献。我们发现这些趋势和风格的存在，驱动我们开发MEBART模型，这是基于多个偏好抽取器和双向自适应变换器（BART）的模型，可以捕捉到社交媒体上的趋势和个人风格，生成受欢迎的标题。我们在实际数据上进行了广泛的实验，与多个先进基elines进行比较，并实现了当前领域的最佳性能。此外，我们还进行了ablation和案例研究，以证明MEBART模型在捕捉趋势和个人风格方面的进步。

What is a Fair Diffusion Model? Designing Generative Text-To-Image Models to Incorporate Various Worldviews

paper_url: http://arxiv.org/abs/2309.09944
repo_url: https://github.com/zoedesimone/diffusionworldviewer
paper_authors: Zoe De Simone, Angie Boggust, Arvind Satyanarayan, Ashia Wilson
for: The paper aims to enhance bias mitigation in generative text-to-image (GTI) models by introducing a tool called DiffusionWorldViewer to analyze and manipulate the models’ attitudes, values, stories, and expectations of the world.
methods: The tool uses an interactive interface deployed as a web-based GUI and Jupyter Notebook plugin to categorize existing demographics of GTI-generated images and provide interactive methods to align image demographics with user worldviews.
results: In a study with 13 GTI users, the tool was found to allow users to represent their varied viewpoints about what GTI outputs are fair, challenging current notions of fairness that assume a universal worldview.Here’s the same information in Simplified Chinese text:
for: 论文旨在提高生成文本到图像（GTI）模型中的偏见减轻，通过引入DiffusionWorldViewer工具分析和 manipulate GTI模型对世界的态度、价值观、故事和期望的影响。
methods: DiffusionWorldViewer使用了一个交互式的网页UI和Jupyter Notebook插件，将GTI生成图像中的存在人类划分为不同类别，并提供了互动方法来与用户的世界观点对齐。
results: 在13名GTI用户的研究中，DiffusionWorldViewer被发现可以让用户表达他们对GTI输出是否公正的多种视点，挑战当前假设一个统一的世界观点的假设。

Abstract
Generative text-to-image (GTI) models produce high-quality images from short textual descriptions and are widely used in academic and creative domains. However, GTI models frequently amplify biases from their training data, often producing prejudiced or stereotypical images. Yet, current bias mitigation strategies are limited and primarily focus on enforcing gender parity across occupations. To enhance GTI bias mitigation, we introduce DiffusionWorldViewer, a tool to analyze and manipulate GTI models' attitudes, values, stories, and expectations of the world that impact its generated images. Through an interactive interface deployed as a web-based GUI and Jupyter Notebook plugin, DiffusionWorldViewer categorizes existing demographics of GTI-generated images and provides interactive methods to align image demographics with user worldviews. In a study with 13 GTI users, we find that DiffusionWorldViewer allows users to represent their varied viewpoints about what GTI outputs are fair and, in doing so, challenges current notions of fairness that assume a universal worldview.

摘要
生成文本到图像（GTI）模型可以生成高质量的图像，但它们经常增强训练数据中的偏见，导致生成的图像具有偏见或�tereotype。然而，现有的偏见缓解策略主要集中在强制性域 occupations 的均衡上。为了提高 GTI 偏见缓解，我们介绍了DiffusionWorldViewer，一种分析和 manipulate GTI 模型对世界的看法、价值观、故事和期望的工具。通过在Web上部署的界面和 Jupyter Notebook 插件，DiffusionWorldViewer 可以分类 GTI 生成的图像中的现有民族、提供交互方式来与用户的世界观点相对应。在13名 GTI 用户的研究中，我们发现DiffusionWorldViewer 允许用户表达他们对 GTI 输出的多种看法，并在这个过程中挑战当前的公平性假设，即所有人都应该有同一个世界观。

A Heterogeneous Graph-Based Multi-Task Learning for Fault Event Diagnosis in Smart Grid

paper_url: http://arxiv.org/abs/2309.09921
repo_url: None
paper_authors: Dibaloke Chanda, Nasim Yahya Soltani
for: 本文提出了一种基于多任务学习图神经网络（MTL-GNN）的精准故障诊断方法，用于检测、定位和分类故障，同时还提供了缺陷和电流估计。
methods: 本文使用了图神经网络（GNN）来学习分布系统的topological结构和特征学习，并通过消息传递机制实现feature learning。
results: 数值测试 validate了模型在各任务中的表现，并且提出了一种基于GNN的解释方法来 Identify key nodes in the distribution system，以便进行 informed sparse measurements。

Abstract
Precise and timely fault diagnosis is a prerequisite for a distribution system to ensure minimum downtime and maintain reliable operation. This necessitates access to a comprehensive procedure that can provide the grid operators with insightful information in the case of a fault event. In this paper, we propose a heterogeneous multi-task learning graph neural network (MTL-GNN) capable of detecting, locating and classifying faults in addition to providing an estimate of the fault resistance and current. Using a graph neural network (GNN) allows for learning the topological representation of the distribution system as well as feature learning through a message-passing scheme. We investigate the robustness of our proposed model using the IEEE-123 test feeder system. This work also proposes a novel GNN-based explainability method to identify key nodes in the distribution system which then facilitates informed sparse measurements. Numerical tests validate the performance of the model across all tasks.

摘要
<> translate "Precise and timely fault diagnosis is a prerequisite for a distribution system to ensure minimum downtime and maintain reliable operation. This necessitates access to a comprehensive procedure that can provide the grid operators with insightful information in the case of a fault event. In this paper, we propose a heterogeneous multi-task learning graph neural network (MTL-GNN) capable of detecting, locating and classifying faults in addition to providing an estimate of the fault resistance and current. Using a graph neural network (GNN) allows for learning the topological representation of the distribution system as well as feature learning through a message-passing scheme. We investigate the robustness of our proposed model using the IEEE-123 test feeder system. This work also proposes a novel GNN-based explainability method to identify key nodes in the distribution system which then facilitates informed sparse measurements. Numerical tests validate the performance of the model across all tasks." into 简化中文 >>Here's the translation:精准并快速的故障诊断是分布系统的必要前提，以确保最小化下时间和维护可靠的运行。这种情况下，需要访问一个全面的程序，以提供Grid操作人员有用的信息。在这篇论文中，我们提议一种多任务学习图神经网络（MTL-GNN），可以检测、定位和分类故障，同时提供故障抗力和电流的估计。使用图神经网络（GNN），可以学习分布系统的 topological表示，以及通过消息传递机制来学习特征。我们通过IEEE-123测试系统进行了对我们提议模型的Robustness测试。此外，我们还提出了一种基于GNN的解释方法，可以 identific key nodes在分布系统中，从而实现了 Informed sparse measurements。数值测试证明了模型在所有任务上的性能。

Plug in the Safety Chip: Enforcing Constraints for LLM-driven Robot Agents

paper_url: http://arxiv.org/abs/2309.09919
repo_url: None
paper_authors: Ziyi Yang, Shreyas S. Raman, Ankit Shah, Stefanie Tellex
for: 这 paper 的目的是提出一种基于线性时间逻辑 (LTL) 的问题可读性安全约束模块，以便在协同环境中部署大语言模型 (LLM) 代理，并实现安全的操作。
methods: 本 paper 使用了 LLM 进行协同操作，并采用了 NL 时间约束编码、安全违反逻辑和解释、以及危险动作擦除等方法来保证安全性。
results: 实验结果表明，本系统可以坚持安全约束，并在复杂的安全约束下进行扩展，这 highlights 了它在实际应用中的潜在实用性。

Abstract
Recent advancements in large language models (LLMs) have enabled a new research domain, LLM agents, for solving robotics and planning tasks by leveraging the world knowledge and general reasoning abilities of LLMs obtained during pretraining. However, while considerable effort has been made to teach the robot the "dos," the "don'ts" received relatively less attention. We argue that, for any practical usage, it is as crucial to teach the robot the "don'ts": conveying explicit instructions about prohibited actions, assessing the robot's comprehension of these restrictions, and, most importantly, ensuring compliance. Moreover, verifiable safe operation is essential for deployments that satisfy worldwide standards such as ISO 61508, which defines standards for safely deploying robots in industrial factory environments worldwide. Aiming at deploying the LLM agents in a collaborative environment, we propose a queryable safety constraint module based on linear temporal logic (LTL) that simultaneously enables natural language (NL) to temporal constraints encoding, safety violation reasoning and explaining, and unsafe action pruning. To demonstrate the effectiveness of our system, we conducted experiments in VirtualHome environment and on a real robot. The experimental results show that our system strictly adheres to the safety constraints and scales well with complex safety constraints, highlighting its potential for practical utility.

摘要
最近的大语言模型（LLM）技术发展，开启了一个新的研究领域——LLM代理人，通过利用 pré-training 中获得的世界知识和通用逻辑能力，解决 робо太和规划任务。然而，相比之下，“不”received relatively less attention。我们认为，在实际应用中，也是非常重要教 robot “不”：表达明确的禁止行为，评估机器人的理解这些限制，并最重要的是，确保遵守。此外，确认安全操作是在国际标准ISO 61508中定义的安全部署 robots 在工业Factory环境中的标准。为了在协同环境中部署 LLM 代理人，我们提议使用可询问安全约束模块，该模块同时允许自然语言（NL）和时间约束编码、安全违反理解和解释，以及危险行为剔除。为了证明我们的系统的有效性，我们在虚拟家庭环境和真实机器人上进行了实验。实验结果表明，我们的系统坚持着安全约束，并且随着复杂的安全约束的增加，我们的系统表现良好，这说明它在实际应用中具有潜在的实用性。

Evaluation of Human-Understandability of Global Model Explanations using Decision Tree

paper_url: http://arxiv.org/abs/2309.09917
repo_url: None
paper_authors: Adarsa Sivaprasad, Ehud Reiter, Nava Tintarev, Nir Oren
for: The paper aims to improve the understandability and trustworthiness of AI models in healthcare applications, specifically for non-expert patients who may not have a strong background in AI or domain expertise.
methods: The authors use a decision tree model to generate both local and global explanations for patients identified as having a high risk of coronary heart disease. They test the effectiveness of these explanations with non-expert users and gather feedback to enhance the narrative global explanations.
results: The majority of participants prefer global explanations, while a smaller group prefers local explanations. The authors also find that task-based evaluations of mental models of these participants provide valuable feedback to enhance narrative global explanations, which can guide the design of health informatics systems that are both trustworthy and actionable.Here are the three points in Simplified Chinese text:
for: 这篇论文的目的是提高医疗应用中AI模型的可解释性和可信worthiness，特别是为非专业的患者，他们可能没有AI或领域专业知识。
methods: 作者使用决策树模型生成本地和全局解释，并对非专业用户进行测试，收集反馈以提高 narative全局解释。
results: 大多数参与者喜欢全局解释，而一个较小的组 prefer 本地解释。作者还发现，基于任务评估的受试者的心理模型提供了价值的反馈，以提高全局解释。

Abstract
In explainable artificial intelligence (XAI) research, the predominant focus has been on interpreting models for experts and practitioners. Model agnostic and local explanation approaches are deemed interpretable and sufficient in many applications. However, in domains like healthcare, where end users are patients without AI or domain expertise, there is an urgent need for model explanations that are more comprehensible and instil trust in the model's operations. We hypothesise that generating model explanations that are narrative, patient-specific and global(holistic of the model) would enable better understandability and enable decision-making. We test this using a decision tree model to generate both local and global explanations for patients identified as having a high risk of coronary heart disease. These explanations are presented to non-expert users. We find a strong individual preference for a specific type of explanation. The majority of participants prefer global explanations, while a smaller group prefers local explanations. A task based evaluation of mental models of these participants provide valuable feedback to enhance narrative global explanations. This, in turn, guides the design of health informatics systems that are both trustworthy and actionable.

摘要
在可解释人工智能（XAI）研究中，主要的关注是为专家和实践者解释模型。但在医疗领域，因为患者没有人工智能或领域知识，有一个急需的需求是为模型提供更容易理解和带来信任的解释。我们提出的假设是，通过生成模型解释，使其更加易于理解，并且全面涵盖模型的运作。我们使用决策树模型生成本地和全局解释，并对非专业用户进行测试。我们发现，大多数参与者偏好全局解释，而一个较小的组 prefer local解释。这些参与者的任务基础评估表示，可以通过改进 narrative global解释来提高健康信息系统的可靠性和可行性。

Wait, That Feels Familiar: Learning to Extrapolate Human Preferences for Preference Aligned Path Planning

paper_url: http://arxiv.org/abs/2309.09912
repo_url: None
paper_authors: Haresh Karnan, Elvin Yang, Garrett Warnell, Joydeep Biswas, Peter Stone
for: 这paper的目的是解决自主移动任务中需要理解操作员指定的权重，以确保机器人的安全和任务完成。
methods: 这paper使用了 preference extrApolation for Terrain awarE Robot Navigation（PATERN）框架，该框架可以从机器人的感知数据中推断操作员对新地形的偏好。
results: 对比基eline方法，这paper的实验结果表明，PATERN可以强健地泛化到多种不同的地形和照明条件下，并能够导航在操作员的偏好驱动下。

Abstract
Autonomous mobility tasks such as lastmile delivery require reasoning about operator indicated preferences over terrains on which the robot should navigate to ensure both robot safety and mission success. However, coping with out of distribution data from novel terrains or appearance changes due to lighting variations remains a fundamental problem in visual terrain adaptive navigation. Existing solutions either require labor intensive manual data recollection and labeling or use handcoded reward functions that may not align with operator preferences. In this work, we posit that operator preferences for visually novel terrains, which the robot should adhere to, can often be extrapolated from established terrain references within the inertial, proprioceptive, and tactile domain. Leveraging this insight, we introduce Preference extrApolation for Terrain awarE Robot Navigation, PATERN, a novel framework for extrapolating operator terrain preferences for visual navigation. PATERN learns to map inertial, proprioceptive, tactile measurements from the robots observations to a representation space and performs nearest neighbor search in this space to estimate operator preferences over novel terrains. Through physical robot experiments in outdoor environments, we assess PATERNs capability to extrapolate preferences and generalize to novel terrains and challenging lighting conditions. Compared to baseline approaches, our findings indicate that PATERN robustly generalizes to diverse terrains and varied lighting conditions, while navigating in a preference aligned manner.

摘要
自主移动任务，如最后一英里交付，需要考虑操作员表达的偏好在不同地形上 navigator 以确保机器人的安全和任务成功。然而，对于新地形或光学变化所带来的数据 OUT OF DISTRIBUTION 问题在视觉地形适应导航中仍然存在。现有的解决方案可能需要劳动 INTENSIVE 的手动数据收集和标注，或者使用手动编码的奖励函数，这些奖励函数可能并不与操作员的偏好相一致。在这种情况下，我们认为操作员对新地形的偏好可以从已有的地形参考中推断出来。基于这一点，我们提出了 Preference extrApolation for Terrain awarE Robot Navigation，简称 PATERN，一种新的框架，用于从地形参考中推断操作员对新地形的偏好。PATERN 使用地形参考中的各种力学、 proprioceptive 和感觉测量来映射到一个表示空间，并在这个空间中进行 nearest neighbor 搜索来估算操作员对新地形的偏好。通过实际的机器人实验，我们评估了 PATERN 对新地形和不同的照明条件的扩展和适应能力。与基准方法相比，我们的发现表明，PATERN 可以坚定地 generalize 到多种地形和照明条件，并在这些条件下导航以偏好适应的方式。

Evaluation of GPT-3 for Anti-Cancer Drug Sensitivity Prediction

paper_url: http://arxiv.org/abs/2309.10016
repo_url: None
paper_authors: Shaika Chowdhury, Sivaraman Rajaganapathy, Lichao Sun, James Cerhan, Nansu Zong
for: 这项研究探讨了使用GPT-3进行精准肿瘤治疗药物敏感性预测任务的可能性，使用结构化药物ogenomics数据，并评估了零批训练和精细调整方法的性能。
methods: 这项研究使用了GPT-3模型，并对结构化药物ogenomics数据进行分析和预测。
results: 研究发现，药物的笑容表达和细胞线的遗传变化特征是药物响应的预测因素。这些结果有可能为精准肿瘤治疗做出贡献，帮助设计更有效的治疗协议。

Abstract
In this study, we investigated the potential of GPT-3 for the anti-cancer drug sensitivity prediction task using structured pharmacogenomics data across five tissue types and evaluated its performance with zero-shot prompting and fine-tuning paradigms. The drug's smile representation and cell line's genomic mutation features were predictive of the drug response. The results from this study have the potential to pave the way for designing more efficient treatment protocols in precision oncology.

摘要
在这项研究中，我们研究了GPT-3的抗癌药效预测能力，使用结构化药理学数据，对五种组织型进行评估，并评估了零模式提示和精度调整 paradigms。药物的笑容表达和细胞线的基因突变特征是药效预测的预测因素。这项研究的结果有可能为精准肿瘤学设计更有效的治疗协议开创道路。

The role of causality in explainable artificial intelligence

paper_url: http://arxiv.org/abs/2309.09901
repo_url: None
paper_authors: Gianluca Carloni, Andrea Berti, Sara Colantonio
for: 本文旨在探讨 causality 和 explainable artificial intelligence (XAI) 两个领域之间的关系，尤其是它们之间的联系是如何、以及如何利用这两个领域来建立信任于人工智能系统。
methods: 本文通过 investigate 文献来了解 causality 和 XAI 之间的关系，并 Identify 三个主要视角：首先，缺乏 causality 是当前 AI 和 XAI 方法的一个重要限制，并且“最佳”的解释的形式被 investigate ；第二是一种实用视角，认为 XAI 可以用于推动科学探索，通过识别可以实现的实验操作来推动科学探索；第三个视角支持 idea ， causality 是 XAI 的Propedeutic 三种方式：利用 causality 概念支持或改进 XAI，利用 counterfactuals 来提供解释，并考虑 accessing causal model 作为解释。
results: 本文提供了一种统一的视角，把 causality 和 XAI 两个领域联系起来，并 highlight 了这两个领域之间的可能的链接和可能的限制。此外，本文还提供了 relevante 软件解决方案，用于自动化 causal 任务。

Abstract
Causality and eXplainable Artificial Intelligence (XAI) have developed as separate fields in computer science, even though the underlying concepts of causation and explanation share common ancient roots. This is further enforced by the lack of review works jointly covering these two fields. In this paper, we investigate the literature to try to understand how and to what extent causality and XAI are intertwined. More precisely, we seek to uncover what kinds of relationships exist between the two concepts and how one can benefit from them, for instance, in building trust in AI systems. As a result, three main perspectives are identified. In the first one, the lack of causality is seen as one of the major limitations of current AI and XAI approaches, and the "optimal" form of explanations is investigated. The second is a pragmatic perspective and considers XAI as a tool to foster scientific exploration for causal inquiry, via the identification of pursue-worthy experimental manipulations. Finally, the third perspective supports the idea that causality is propaedeutic to XAI in three possible manners: exploiting concepts borrowed from causality to support or improve XAI, utilizing counterfactuals for explainability, and considering accessing a causal model as explaining itself. To complement our analysis, we also provide relevant software solutions used to automate causal tasks. We believe our work provides a unified view of the two fields of causality and XAI by highlighting potential domain bridges and uncovering possible limitations.

摘要
causality 和 explainable Artificial Intelligence (XAI) 在计算机科学中发展成为两个独立的领域，尽管它们的基本概念之间有共同的古老根基。这种情况进一步加剧了由lack of review works jointly covering these two fields所带来的分化。在这篇论文中，我们对文献进行调查，以便更好地理解causality 和 XAI 之间的关系。更具体地说，我们想要找出causality 和 XAI 之间的关系是什么样的，以及如何在建立人工智能系统的信任方面利用它们。结果，我们分析出了三个主要的视角：第一个视角认为现有的 AI 和 XAI 方法缺乏 causality 是一个重要的限制，并investigate the "optimal" form of explanations。第二个视角是一种实用的视角，认为 XAI 可以用于推动科学探索，通过识别可以实施的实验操作。最后，第三个视角支持idea that causality is propaedeutic to XAI in three possible manners：borrowing concepts from causality to support or improve XAI，utilizing counterfactuals for explainability，和considering accessing a causal model as explaining itself。为了补充我们的分析，我们还提供了用于自动化 causal 任务的相关软件解决方案。我们认为我们的工作提供了两个领域之间的统一视角，揭示了可能的领域桥梁，并揭示了可能的限制。

Towards Ontology Construction with Language Models

paper_url: http://arxiv.org/abs/2309.09898
repo_url: None
paper_authors: Maurice Funk, Simon Hosemann, Jean Christoph Jung, Carsten Lutz
for: 自动构建域层次结构
methods: 通过问题大语言模型
results: LLMs 可以帮助建立域层次结构

Abstract
We present a method for automatically constructing a concept hierarchy for a given domain by querying a large language model. We apply this method to various domains using OpenAI's GPT 3.5. Our experiments indicate that LLMs can be of considerable help for constructing concept hierarchies.

摘要
我们提出了一种方法，用于自动构建域名空间中的概念层次结构，通过问题大型自然语言模型（LLM）。我们对不同领域进行应用，使用OpenAI的GPT 3.5。我们的实验表明，LLM可以帮助建立概念层次结构。Here's a breakdown of the translation:* "We present a method" becomes "我们提出了一种方法" (wǒmen tīshì yī zhī fāngzhì)* "for automatically constructing a concept hierarchy" becomes "用于自动构建域名空间中的概念层次结构" (yǐng yú zìdān jīngjì xìngshì)* "by querying a large language model" becomes "通过问题大型自然语言模型（LLM）" (tōngguò wèn tí dàxíng zhīyǐng yǔyán módel)* "We apply this method to various domains" becomes "我们对不同领域进行应用" (wǒmen duìfāng bùdìng fāngyì)* "using OpenAI's GPT 3.5" becomes "使用OpenAI的GPT 3.5" (fùyòu OpenAI de GPT 3.5)* "Our experiments indicate that LLMs can be of considerable help for constructing concept hierarchies" becomes "我们的实验表明，LLM可以帮助建立概念层次结构" (wǒmen de shíyè bǎozhèng, LLM cóuzhù bāngzhì jīnjī)

Context is Environment

paper_url: http://arxiv.org/abs/2309.09888
repo_url: https://github.com/apostrophecms/apostrophe
paper_authors: Sharut Gupta, Stefanie Jegelka, David Lopez-Paz, Kartik Ahuja
for: 提高领域泛化性能
methods: 使用增强上下文学习来减少杂 correlate和提高领域泛化性能
results: 经验和理论表明，ICRM算法可以在不知道标签的情况下，通过关注上下文来提高OD性能。

Abstract
Two lines of work are taking the central stage in AI research. On the one hand, the community is making increasing efforts to build models that discard spurious correlations and generalize better in novel test environments. Unfortunately, the bitter lesson so far is that no proposal convincingly outperforms a simple empirical risk minimization baseline. On the other hand, large language models (LLMs) have erupted as algorithms able to learn in-context, generalizing on-the-fly to eclectic contextual circumstances that users enforce by means of prompting. In this paper, we argue that context is environment, and posit that in-context learning holds the key to better domain generalization. Via extensive theory and experiments, we show that paying attention to context$\unicode{x2013}\unicode{x2013}$unlabeled examples as they arrive$\unicode{x2013}\unicode{x2013}$allows our proposed In-Context Risk Minimization (ICRM) algorithm to zoom-in on the test environment risk minimizer, leading to significant out-of-distribution performance improvements. From all of this, two messages are worth taking home. Researchers in domain generalization should consider environment as context, and harness the adaptive power of in-context learning. Researchers in LLMs should consider context as environment, to better structure data towards generalization.

摘要

EGFE: End-to-end Grouping of Fragmented Elements in UI Designs with Multimodal Learning

paper_url: http://arxiv.org/abs/2309.09867
repo_url: https://github.com/test2975/egfe
paper_authors: Liuqing Chen, Yunnong Chen, Shuhong Xiao, Yaxuan Song, Lingyun Sun, Yankun Zhen, Tingting Zhou, Yanfang Chang
for: 快速化 Front-end 应用程序和 GUI 迭代开发
methods: 使用 Transformer Encoder 模型 UI 元素之间的关系，并使用多Modal 表示学习
results: 与基eline 比较，EGFE 方法在精度（+29.75%）、回归（+31.07%）和 F1 分数（+30.39%）上具有显著优势，并在实际软件工程应用中进行了效果评估。

Abstract
When translating UI design prototypes to code in industry, automatically generating code from design prototypes can expedite the development of applications and GUI iterations. However, in design prototypes without strict design specifications, UI components may be composed of fragmented elements. Grouping these fragmented elements can greatly improve the readability and maintainability of the generated code. Current methods employ a two-stage strategy that introduces hand-crafted rules to group fragmented elements. Unfortunately, the performance of these methods is not satisfying due to visually overlapped and tiny UI elements. In this study, we propose EGFE, a novel method for automatically End-to-end Grouping Fragmented Elements via UI sequence prediction. To facilitate the UI understanding, we innovatively construct a Transformer encoder to model the relationship between the UI elements with multi-modal representation learning. The evaluation on a dataset of 4606 UI prototypes collected from professional UI designers shows that our method outperforms the state-of-the-art baselines in the precision (by 29.75\%), recall (by 31.07\%), and F1-score (by 30.39\%) at edit distance threshold of 4. In addition, we conduct an empirical study to assess the improvement of the generated front-end code. The results demonstrate the effectiveness of our method on a real software engineering application. Our end-to-end fragmented elements grouping method creates opportunities for improving UI-related software engineering tasks.

摘要
当在产业中将用户界面设计原型转换成代码时，自动生成代码从设计原型可以大大提高应用程序和Graphical User Interface（GUI）的开发效率。然而，在没有严格的设计规范的情况下，用户界面组件可能会被分割成多个碎片。将这些碎片组合起来可以大大提高代码的可读性和维护性。现有的方法采用两阶段策略，通过手动制定规则来组合碎片元素。然而，这些方法的性能不 satisfactory，因为视觉上的重叠和小型用户界面元素。在本研究中，我们提出了一种 novel方法，名为 End-to-end Grouping Fragmented Elements（EGFE），通过用户界面序列预测来自动组合碎片元素。为了促进用户界面理解，我们创新地构建了一个 TransformerEncoder，用于模型用户界面元素之间的关系，并通过多modal表示学习来学习这些关系。我们对收集了4606个用户界面原型的数据集进行评估，结果显示，我们的方法在编辑距离阈值为4时，与state-of-the-art基线方法相比，提高了精度（29.75%），回归率（31.07%）和F1分数（30.39%）。此外，我们进行了一个实际的研究，以评估生成的前端代码的改进。结果表明，我们的方法在实际软件工程应用中具有效果。我们的末端碎片元素组合方法创造了对用户界面相关的软件工程任务的机会。

paper_url: http://arxiv.org/abs/2309.09864
repo_url: None
paper_authors: Daria de Tinguy, Toon Van de Maele, Tim Verbelen, Bart Dhoedt
for: 该论文旨在解决从像素级别观察到环境结构的推理挑战，提出了一种层次式活动推理模型。
methods: 该模型包括一个认知地图、一个allocentric和一个egocentric世界模型，结合了好奇的探索和目标带导的行为，从上下文、地点到动作的不同层次进行了理解和推理。
results: 该模型在房间结构小网格环境中实现了高效的探索和目标带导搜索。

Abstract
Cognitive maps play a crucial role in facilitating flexible behaviour by representing spatial and conceptual relationships within an environment. The ability to learn and infer the underlying structure of the environment is crucial for effective exploration and navigation. This paper introduces a hierarchical active inference model addressing the challenge of inferring structure in the world from pixel-based observations. We propose a three-layer hierarchical model consisting of a cognitive map, an allocentric, and an egocentric world model, combining curiosity-driven exploration with goal-oriented behaviour at the different levels of reasoning from context to place to motion. This allows for efficient exploration and goal-directed search in room-structured mini-grid environments.

摘要
cognitive maps 扮演了灵活行为的关键角色，协助表示环境中的空间和概念关系。学习和推理环境的结构是有效探索和导航的关键。这篇论文提出了一种层次活动推理模型，解决从像素级别观察到世界结构的推理挑战。我们提议了一种三层层次模型，包括认知地图、allocentric 和 egocentric 世界模型，将curiosity-driven 探索与目标尝试结合在不同的理解层次上，从上下文到场景到运动。这种方法可以有效地在室内小型网格环境中进行探索和目标寻找。

SYNDICOM: Improving Conversational Commonsense with Error-Injection and Natural Language Feedback

paper_url: http://arxiv.org/abs/2309.10015
repo_url: https://github.com/hieu9955/ggggg
paper_authors: Christopher Richardson, Anirudh Sundar, Larry Heck
for: 提高对话响应的常识理解
methods: 使用知识图和自然语言Feedback（NLF）创建对话对话数据集，并使用两步验证过程：首先预测NLF，然后使用预测结果和无效响应生成对话。
results: 比ChatGPT提高53%的ROUGE1分数，人类评价者57%时喜欢SYNDICOM。

Abstract
Commonsense reasoning is a critical aspect of human communication. Despite recent advances in conversational AI driven by large language models, commonsense reasoning remains a challenging task. In this work, we introduce SYNDICOM - a method for improving commonsense in dialogue response generation. SYNDICOM consists of two components. The first component is a dataset composed of commonsense dialogues created from a knowledge graph and synthesized into natural language. This dataset includes both valid and invalid responses to dialogue contexts, along with natural language feedback (NLF) for the invalid responses. The second contribution is a two-step procedure: training a model to predict natural language feedback (NLF) for invalid responses, and then training a response generation model conditioned on the predicted NLF, the invalid response, and the dialogue. SYNDICOM is scalable and does not require reinforcement learning. Empirical results on three tasks are evaluated using a broad range of metrics. SYNDICOM achieves a relative improvement of 53% over ChatGPT on ROUGE1, and human evaluators prefer SYNDICOM over ChatGPT 57% of the time. We will publicly release the code and the full dataset.

摘要
常识理解是人类communication的重要方面。 despite recent advances in conversational AI driven by large language models, commonsense reasoning remains a challenging task. In this work, we introduce SYNDICOM - a method for improving commonsense in dialogue response generation. SYNDICOM consists of two components. The first component is a dataset composed of commonsense dialogues created from a knowledge graph and synthesized into natural language. This dataset includes both valid and invalid responses to dialogue contexts, along with natural language feedback (NLF) for the invalid responses. The second contribution is a two-step procedure: training a model to predict natural language feedback (NLF) for invalid responses, and then training a response generation model conditioned on the predicted NLF, the invalid response, and the dialogue. SYNDICOM is scalable and does not require reinforcement learning. Empirical results on three tasks are evaluated using a broad range of metrics. SYNDICOM achieves a relative improvement of 53% over ChatGPT on ROUGE1, and human evaluators prefer SYNDICOM over ChatGPT 57% of the time. We will publicly release the code and the full dataset.Here's the translation in Traditional Chinese as well:常识理解是人类communication的重要方面。 despite recent advances in conversational AI driven by large language models, commonsense reasoning remains a challenging task. In this work, we introduce SYNDICOM - a method for improving commonsense in dialogue response generation. SYNDICOM consists of two components. The first component is a dataset composed of commonsense dialogues created from a knowledge graph and synthesized into natural language. This dataset includes both valid and invalid responses to dialogue contexts, along with natural language feedback (NLF) for the invalid responses. The second contribution is a two-step procedure: training a model to predict natural language feedback (NLF) for invalid responses, and then training a response generation model conditioned on the predicted NLF, the invalid response, and the dialogue. SYNDICOM is scalable and does not require reinforcement learning. Empirical results on three tasks are evaluated using a broad range of metrics. SYNDICOM achieves a relative improvement of 53% over ChatGPT on ROUGE1, and human evaluators prefer SYNDICOM over ChatGPT 57% of the time. We will publicly release the code and the full dataset.

CC-SGG: Corner Case Scenario Generation using Learned Scene Graphs

paper_url: http://arxiv.org/abs/2309.09844
repo_url: None
paper_authors: George Drayson, Efimia Panagiotaki, Daniel Omeiza, Lars Kunze
for: 增强自动驾驶车辆的安全性测试和验证
methods: 使用异构图 neural network 将常见驾驶场景转换为异常场景
results: 成功将常见驾驶场景转换为异常场景，实现89.9%的预测精度，并证明模型能够创造基eline方法所不能处理的特殊情况。

Abstract
Corner case scenarios are an essential tool for testing and validating the safety of autonomous vehicles (AVs). As these scenarios are often insufficiently present in naturalistic driving datasets, augmenting the data with synthetic corner cases greatly enhances the safe operation of AVs in unique situations. However, the generation of synthetic, yet realistic, corner cases poses a significant challenge. In this work, we introduce a novel approach based on Heterogeneous Graph Neural Networks (HGNNs) to transform regular driving scenarios into corner cases. To achieve this, we first generate concise representations of regular driving scenes as scene graphs, minimally manipulating their structure and properties. Our model then learns to perturb those graphs to generate corner cases using attention and triple embeddings. The input and perturbed graphs are then imported back into the simulation to generate corner case scenarios. Our model successfully learned to produce corner cases from input scene graphs, achieving 89.9% prediction accuracy on our testing dataset. We further validate the generated scenarios on baseline autonomous driving methods, demonstrating our model's ability to effectively create critical situations for the baselines.

摘要
弯道情况场景是自动驾驶车辆（AV）的测试和验证安全工具。然而，这些情况场景通常不足于自然驾驶数据集中，因此通过增强数据集中的人工弯道情况场景可以大大提高AV的安全运行。然而，生成具有真实感的人工弯道情况场景是一项挑战。在这种情况下，我们提出了一种基于异种图 neural network（HGNN）的新方法，可以将常见驾驶场景转换成弯道情况场景。我们首先生成了常见驾驶场景的简洁表示，并将其结构和属性进行最小的修改。然后，我们的模型通过注意力和 triple embeddings 来对这些图进行扰动，以生成弯道情况场景。最后，我们将生成的弯道情况场景重新导入到模拟器中，以生成弯道场景。我们的模型成功地将输入场景图转换成弯道场景，测试集上的预测精度达89.9%。此外，我们还 validate了生成的场景，证明我们的模型可以生成对基eline autonomous driving方法 Critical Situation。

RECAP: Retrieval-Augmented Audio Captioning

paper_url: http://arxiv.org/abs/2309.09836
repo_url: None
paper_authors: Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha
for: 这个论文旨在提出一种新的音频captioning系统，即RECAP（REtrieval-Augmented Audio CAPtioning），它可以基于输入音频和其他相似音频从数据存储中检索类似的caption。
methods: 该方法使用CLAP音频-文本模型来检索相似的caption，然后使用GPT-2解码器和CLAP编码器之间的交叉关注层来conditioning audio дляcaption生成。
results: 实验表明，RECAP在本地设置中达到了竞争性性能，而在另外一个设置中具有显著的改进。此外，由于它可以在training-free的方式下使用大量的文本-caption-只datastore，因此RECAP可以caption novel audio事件和多个事件的组合音频。

Abstract
We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally, our proposed method can transfer to any domain without the need for any additional fine-tuning. To generate a caption for an audio sample, we leverage an audio-text model CLAP to retrieve captions similar to it from a replaceable datastore, which are then used to construct a prompt. Next, we feed this prompt to a GPT-2 decoder and introduce cross-attention layers between the CLAP encoder and GPT-2 to condition the audio for caption generation. Experiments on two benchmark datasets, Clotho and AudioCaps, show that RECAP achieves competitive performance in in-domain settings and significant improvements in out-of-domain settings. Additionally, due to its capability to exploit a large text-captions-only datastore in a \textit{training-free} fashion, RECAP shows unique capabilities of captioning novel audio events never seen during training and compositional audios with multiple events. To promote research in this space, we also release 150,000+ new weakly labeled captions for AudioSet, AudioCaps, and Clotho.

摘要
我们提出了RECAP（REtrieval-Augmented Audio CAPtioning），一种新的有效的音频描述系统，它根据输入音频和相似的音频从数据库中检索到类似的描述，并且不需要任何额外的调整。为生成音频描述，我们利用CLAP音频文本模型从可更换数据库中检索类似的描述，然后将它们用于构建提示。接着，我们将这个提示传递给GPT-2解码器，并在CLAP编码器和GPT-2之间添加交叉注意力层，以condition the audio for caption generation。实验表明，RECAP在预训练集和非预训练集上具有竞争力和显著提高的表现。此外，由于它可以在训练集中使用大量的文本描述只数据库，RECAP表现出了新领域的描述和多事件音频的特有能力。为促进这个领域的研究，我们还发布了150,000多个弱Label的描述数据，用于AudioSet、AudioCaps和Clotho等数据集。

paper_url: http://arxiv.org/abs/2309.09832
repo_url: None
paper_authors: Xiangheng He, Junjie Chen, Björn W. Schuller
for: 提高主任务性能，通过共同学习相关辅助任务。
methods: 非站立多臂扔投（MAB）与减少 Thompson 抽样（TS）使用 Gaussian 分布。
results: 在不同训练阶段，不同任务有不同的用处。我们的提议方法可以有效地确定任务用处，避免无用或害处任务，并在训练中进行任务分配。我们的方法与单任务和多任务基elines相比，在 UAR 和 F1 方面具有显著优势（p-value < 0.05）。进一步分析实验结果表明，对于数据不均衡问题的数据集，我们的方法具有更高的稳定性，可以获得稳定且良好的性能 для少数类。我们的方法超越当前状态的艺术模型。

Abstract
Multi-task learning (MTL) aims to improve the performance of a primary task by jointly learning with related auxiliary tasks. Traditional MTL methods select tasks randomly during training. However, both previous studies and our results suggest that such the random selection of tasks may not be helpful, and can even be harmful to performance. Therefore, new strategies for task selection and assignment in MTL need to be explored. This paper studies the multi-modal, multi-task dialogue act classification task, and proposes a method for selecting and assigning tasks based on non-stationary multi-armed bandits (MAB) with discounted Thompson Sampling (TS) using Gaussian priors. Our experimental results show that in different training stages, different tasks have different utility. Our proposed method can effectively identify the task utility, actively avoid useless or harmful tasks, and realise the task assignment during training. Our proposed method is significantly superior in terms of UAR and F1 to the single-task and multi-task baselines with p-values < 0.05. Further analysis of experiments indicates that for the dataset with the data imbalance problem, our proposed method has significantly higher stability and can obtain consistent and decent performance for minority classes. Our proposed method is superior to the current state-of-the-art model.

摘要
多任务学习（MTL）目标是通过同时学习相关的 aux 任务来提高主任务的性能。传统的 MTL 方法在训练过程中随机选择任务。然而，前一 studies 和我们的结果表明，随机选择任务可能并不是有利的，甚至会对性能产生负面影响。因此，新的任务选择和分配策略在 MTL 中需要被探索。本文研究了多模态、多任务对话权分类任务，并提出了基于非站ARY多臂枪（MAB）和折扣 Thompson Sampling（TS）using Gaussian priors 的任务选择和分配策略。我们的实验结果表明，在不同的训练阶段，不同的任务具有不同的用户性。我们的提议的方法可以有效地确定任务的用户性，避免无用或害的任务，并在训练过程中实现任务分配。我们的提议的方法与单任务和多任务基eline 相比，在 UAR 和 F1 上有 statistically 显著的优势（p-value < 0.05）。进一步的分析结果表明，对数据不均衡问题的 dataset 上，我们的方法具有更高的稳定性，并可以在训练过程中获得稳定和 descent 的性能 для少数类。我们的方法超过了当前领域的状态对模型。

Clustering of Urban Traffic Patterns by K-Means and Dynamic Time Warping: Case Study

paper_url: http://arxiv.org/abs/2309.09830
repo_url: None
paper_authors: Sadegh Etemad, Raziyeh Mosayebi, Tadeh Alexani Khodavirdian, Elahe Dastan, Amir Salari Telmadarreh, Mohammadreza Jafari, Sepehr Rafiei
for: 这个论文的目的是提出一种基于K-Means和动态时间扭曲算法的时间序列划分方法，用于描述城市交通流量的pattern。
methods: 本论文使用的方法包括速度时序列EXTRACTOR和K-Means算法，以及基于时间扭曲的方法。
results: 实验结果表明，提议的方法可以准确地描述城市交通流量的pattern，并且可以提供有用的信息 для城市交通管理和规划。

Abstract
Clustering of urban traffic patterns is an essential task in many different areas of traffic management and planning. In this paper, two significant applications in the clustering of urban traffic patterns are described. The first application estimates the missing speed values using the speed of road segments with similar traffic patterns to colorify map tiles. The second one is the estimation of essential road segments for generating addresses for a local point on the map, using the similarity patterns of different road segments. The speed time series extracts the traffic pattern in different road segments. In this paper, we proposed the time series clustering algorithm based on K-Means and Dynamic Time Warping. The case study of our proposed algorithm is based on the Snapp application's driver speed time series data. The results of the two applications illustrate that the proposed method can extract similar urban traffic patterns.

摘要
clustering 城市交通模式是许多交通管理和规划领域的关键任务。本文描述了两种城市交通模式的 clustering 应用。第一个应用是使用同样交通模式的道路段速度来彩色地图块。第二个应用是根据不同道路段的相似性模式来生成地图上的地址。速度时间序列提取了不同道路段的交通模式。本文提出了基于 K-Means 和动态时间扩展的时间序列 clustering 算法。案例研究基于 Snapp 应用的 води手速度时间序列数据。结果表明，提posed 方法可以提取类似的城市交通模式。Here's the translation of the text into Traditional Chinese: clustering 城市交通模式是许多交通管理和规划领域的关键任务。本文描述了两种城市交通模式的 clustering 应用。第一个应用是使用同样交通模式的道路段速度来彩色地图块。第二个应用是根据不同道路段的相似性模式来生成地图上的地址。速度时间序列提取了不同道路段的交通模式。本文提出了基于 K-Means 和动态时间扩展的时间序列 clustering 算法。案例研究基于 Snapp 应用的 води手速度时间序列数据。结果表明，提案的方法可以提取类似的城市交通模式。

Efficient Avoidance of Vulnerabilities in Auto-completed Smart Contract Code Using Vulnerability-constrained Decoding

paper_url: http://arxiv.org/abs/2309.09826
repo_url: None
paper_authors: André Storhaug, Jingyue Li, Tianyuan Hu
for: 本研究旨在提高大型自然语言模型（LLM）在代码生成中的安全性，避免由这些模型生成的代码具有漏洞。
methods: 我们提出了一种新的漏洞约束生成方法，通过在代码生成过程中避免生成漏洞的代码来减少漏洞的出现。我们使用了一小组标注了漏洞代码的数据，使得模型在生成代码时包含漏洞标签。然后，在生成代码时，我们禁止模型生成这些标签，以避免生成漏洞代码。
results: 我们使用ETH的智能合约（SC）作为实验 case study，因为SC的安全性要求非常严格。我们先使用6亿参数的GPT-J模型进行了186397个SC的精度 fine-tuning，并将2217692个SC中的重复项去除。 fine-tuning需要一周以上的时间，使用了10个GPU。我们发现，我们的精度 fine-tuning可以生成SCs的平均BLEU分数为0.557。然而，大量的代码在自动完成后仍然存在漏洞。我们使用不同类型的漏洞代码的176个SC中的代码 перед漏洞行进行自动完成，发现超过70%的代码是不安全的。因此，我们进一步 fine-tuning了模型，使其能够避免这些漏洞。我们在另外941个漏洞SC中进行了精度 fine-tuning，并在代码生成过程中应用漏洞约束。 fine-tuning只需一个小时，使用了4个GPU。我们再次自动完成了176个SC，发现我们的方法可以将62%的代码标记为不安全，并避免生成67%的代码。

Abstract
Auto-completing code enables developers to speed up coding significantly. Recent advances in transformer-based large language model (LLM) technologies have been applied to code synthesis. However, studies show that many of such synthesized codes contain vulnerabilities. We propose a novel vulnerability-constrained decoding approach to reduce the amount of vulnerable code generated by such models. Using a small dataset of labeled vulnerable lines of code, we fine-tune an LLM to include vulnerability labels when generating code, acting as an embedded classifier. Then, during decoding, we deny the model to generate these labels to avoid generating vulnerable code. To evaluate the method, we chose to automatically complete Ethereum Blockchain smart contracts (SCs) as the case study due to the strict requirements of SC security. We first fine-tuned the 6-billion-parameter GPT-J model using 186,397 Ethereum SCs after removing the duplication from 2,217,692 SCs. The fine-tuning took more than one week using ten GPUs. The results showed that our fine-tuned model could synthesize SCs with an average BLEU (BiLingual Evaluation Understudy) score of 0.557. However, many codes in the auto-completed SCs were vulnerable. Using the code before the vulnerable line of 176 SCs containing different types of vulnerabilities to auto-complete the code, we found that more than 70% of the auto-completed codes were insecure. Thus, we further fine-tuned the model on other 941 vulnerable SCs containing the same types of vulnerabilities and applied vulnerability-constrained decoding. The fine-tuning took only one hour with four GPUs. We then auto-completed the 176 SCs again and found that our approach could identify 62% of the code to be generated as vulnerable and avoid generating 67% of them, indicating the approach could efficiently and effectively avoid vulnerabilities in the auto-completed code.

摘要
自动完成代码可以大大提高开发者的编程速度。现代 transformer 基于大型语言模型（LLM）技术的应用已经被应用于代码生成。然而，研究表明，许多生成的代码含有漏洞。我们提出了一种权限 constrained decoding 方法，以避免生成的代码中漏洞。我们使用一小个标注有漏洞的代码行的数据集，并在 fine-tune 一个 LLM 以包含漏洞标签，作为内置分类器。然后，在解码时，我们不允许模型生成这些标签，以避免生成漏洞的代码。为评估方法，我们选择了完成 Ethereum Blockchain 智能合约（SC）作为研究 caso。 SC 的安全要求非常严格，因此可以很好地测试我们的方法。我们首先 fine-tune 60亿参数的 GPT-J 模型，使其能够生成 SC。我们使用了 186,397 个 Ethereum SC，并从 2,217,692 个 SC 中去掉重复。 fine-tuning 需要一周以上，使用了十个 GPU。结果表明，我们的 fine-tuned 模型可以在 SC 中获得平均 BLEU 分数为 0.557。然而，许多生成的代码中存在漏洞。我们使用不同类型的漏洞的 176 个 SC 的代码前几行来自动完成代码。我们发现，More than 70% 的自动生成代码存在漏洞。因此，我们进一步 fine-tune 模型，使其可以生成不含漏洞的代码。我们使用了 941 个漏洞 SC，并在四个 GPU 上进行 fine-tuning。 fine-tuning 只需要一个小时。然后，我们再次自动完成了 176 个 SC，发现我们的方法可以识别出 62% 的代码需要被生成，并避免生成 67% 的代码。这表明，我们的方法可以有效地避免生成的代码中漏洞。

Bias of AI-Generated Content: An Examination of News Produced by Large Language Models

paper_url: http://arxiv.org/abs/2309.09825
repo_url: None
paper_authors: Xiao Fang, Shangkun Che, Minjia Mao, Hongzhe Zhang, Ming Zhao, Xiaohang Zhao
for: 本研究旨在了解大语言模型（LLM）生成的偏见。
methods: 我们使用七个代表性的LLM生成新闻文章的标题作为提示，并评估这些LLM生成的媒体内容中的性别和种族偏见。
results: 我们发现所有考测LLM都表现出了显著的性别和种族偏见，而且女性和黑人受到了明显的歧视。 chatGPT的生成内容中具有最低的偏见水平，而且它是唯一一个能够拒绝基于偏见提示生成内容的模型。

Abstract
Large language models (LLMs) have the potential to transform our lives and work through the content they generate, known as AI-Generated Content (AIGC). To harness this transformation, we need to understand the limitations of LLMs. Here, we investigate the bias of AIGC produced by seven representative LLMs, including ChatGPT and LLaMA. We collect news articles from The New York Times and Reuters, both known for their dedication to provide unbiased news. We then apply each examined LLM to generate news content with headlines of these news articles as prompts, and evaluate the gender and racial biases of the AIGC produced by the LLM by comparing the AIGC and the original news articles. We further analyze the gender bias of each LLM under biased prompts by adding gender-biased messages to prompts constructed from these news headlines. Our study reveals that the AIGC produced by each examined LLM demonstrates substantial gender and racial biases. Moreover, the AIGC generated by each LLM exhibits notable discrimination against females and individuals of the Black race. Among the LLMs, the AIGC generated by ChatGPT demonstrates the lowest level of bias, and ChatGPT is the sole model capable of declining content generation when provided with biased prompts.

摘要
大型语言模型（LLM）有可能改变我们的生活和工作通过它们生成的内容，称为人工智能生成内容（AIGC）。为了利用这一变革，我们需要了解LLM的局限性。我们在这里调查了7个代表性的LLM中的偏见，包括ChatGPT和LLaMA。我们从纽约时报和路透社获取了不偏见的新闻文章，然后对每个检查LLM中的新闻标题作为提示，生成新闻内容，并评估生成的AI内容中的性别和种族偏见。我们进一步分析每个LLM的性别偏见情况，并通过对提示语言添加性别偏见的消息来评估每个LLM的性别偏见。我们的研究发现，每个检查LLM都表现出了显著的性别和种族偏见，而且AIGC生成的 females和黑人受到了明显的歧视。与其他LLM不同，ChatGPT的AIGC表现出最低的偏见水平，并且ChatGPT是唯一一个能够拒绝生成内容的模型。

VisualProg Distiller: Learning to Fine-tune Non-differentiable Visual Programming Frameworks

paper_url: http://arxiv.org/abs/2309.09809
repo_url: None
paper_authors: Wentao Wan, Zeqing Wang, Nan Kang, Keze Wang, Zhiyu Shen, Liang Lin
for: 提高VisualProg的实用性和任务性能。
methods: 提出了一种基于教师模型的VisualProg Distiller方法，通过逐步填充和筛选知识来优化VisualProg子模块的表现，从而提高整体任务性能。
results: 经过广泛和全面的实验评估，提出的方法可以大幅提高VisualProg的性能，并超越所有比较方法。

Abstract
As an interpretable and universal neuro-symbolic paradigm based on Large Language Models, visual programming (VisualProg) can execute compositional visual tasks without training, but its performance is markedly inferior compared to task-specific supervised learning models. To increase its practicality, the performance of VisualProg on specific tasks needs to be improved. However, the non-differentiability of VisualProg limits the possibility of employing the fine-tuning strategy on specific tasks to achieve further improvements. In our analysis, we discovered that significant performance issues in VisualProg's execution originated from errors made by the sub-modules at corresponding visual sub-task steps. To address this, we propose ``VisualProg Distiller", a method of supplementing and distilling process knowledge to optimize the performance of each VisualProg sub-module on decoupled visual sub-tasks, thus enhancing the overall task performance. Specifically, we choose an end-to-end model that is well-performed on the given task as the teacher and further distill the knowledge of the teacher into the invoked visual sub-modules step-by-step based on the execution flow of the VisualProg-generated programs. In this way, our method is capable of facilitating the fine-tuning of the non-differentiable VisualProg frameworks effectively. Extensive and comprehensive experimental evaluations demonstrate that our method can achieve a substantial performance improvement of VisualProg, and outperforms all the compared state-of-the-art methods by large margins. Furthermore, to provide valuable process supervision for the GQA task, we construct a large-scale dataset by utilizing the distillation process of our method.

摘要
为了提高VisualProg在特定任务上的实际性，我们需要提高它在特定任务上的性能。然而，VisualProg的非导数性限制了我们使用精度调教策略来实现进一步改进的可能性。在我们的分析中，我们发现了VisualProg执行过程中的显著性能问题，这些问题主要来自于VisualProg中的子模块在相应的视觉子任务步骤上所作出的错误。为了解决这个问题，我们提出了“VisualProg Distiller”方法，该方法通过补充和涂抹过程知识来优化VisualProg子模块在分离的视觉子任务上的性能，从而提高整体任务性能。具体来说，我们选择了在给定任务上表现良好的端到端模型作为教师，然后通过执行VisualProg生成的程序的执行流程步骤weise distill the teacher's knowledge into the invoked visual sub-modules。这种方法可以有效地调整不导数的VisualProg框架。我们的方法在广泛和全面的实验评估中表现出色，可以大幅提高VisualProg的性能，并与所有相关的状态前方法相比而言，表现出大幅度的优势。此外，为了为GQA任务提供有价值的过程监督，我们利用我们的方法construct了一个大规模的 dataset。

Efficient Concept Drift Handling for Batch Android Malware Detection Models

paper_url: http://arxiv.org/abs/2309.09807
repo_url: https://gitlab.com/serralba/concept_drift
paper_authors: Molina-Coronado B., Mori U., Mendiburu A., Miguel-Alonso J
for: This paper aims to address the challenge of maintaining the performance of static machine learning-based malware detectors in rapidly evolving Android app environments.methods: The paper employs retraining techniques to maintain detector capabilities over time, and analyzes the effect of two aspects (frequency of retraining and data used for retraining) on efficiency and performance.results: The experiments show that concept drift detection and sample selection mechanisms can be used to efficiently maintain the performance of static Android malware state-of-the-art detectors in changing environments.

Abstract
The rapidly evolving nature of Android apps poses a significant challenge to static batch machine learning algorithms employed in malware detection systems, as they quickly become obsolete. Despite this challenge, the existing literature pays limited attention to addressing this issue, with many advanced Android malware detection approaches, such as Drebin, DroidDet and MaMaDroid, relying on static models. In this work, we show how retraining techniques are able to maintain detector capabilities over time. Particularly, we analyze the effect of two aspects in the efficiency and performance of the detectors: 1) the frequency with which the models are retrained, and 2) the data used for retraining. In the first experiment, we compare periodic retraining with a more advanced concept drift detection method that triggers retraining only when necessary. In the second experiment, we analyze sampling methods to reduce the amount of data used to retrain models. Specifically, we compare fixed sized windows of recent data and state-of-the-art active learning methods that select those apps that help keep the training dataset small but diverse. Our experiments show that concept drift detection and sample selection mechanisms result in very efficient retraining strategies which can be successfully used to maintain the performance of the static Android malware state-of-the-art detectors in changing environments.

摘要
“Android应用的快速演化带来了静态批处理机器学习算法在恶意软件检测系统中的挑战，这些算法很快就会过时。然而，现有的文献对此问题的解决方案很少，许多高级Android恶意软件检测方法，如Drebin、DroidDet和MaMaDroid，仍然采用静态模型。在这项工作中，我们展示了如何使用重新训练技术维护检测器的能力。特别是，我们分析了两个方面对检测器的效率和性能的影响：1）模型重新训练的频率，和2）重新训练使用的数据。在第一个实验中，我们比较了定期重新训练和更先进的概念漂移检测方法，该方法只在必要时触发重新训练。在第二个实验中，我们分析了采用不同大小的固定窗口和最新的活动学习方法来减少模型重新训练所需的数据量。我们的实验结果表明，概念漂移检测和样本选择机制可以实现非常高效的重新训练策略，可以成功地维护静态Android恶意软件检测器在变化环境中的性能。”

Harnessing Collective Intelligence Under a Lack of Cultural Consensus

paper_url: http://arxiv.org/abs/2309.09787
repo_url: None
paper_authors: Necdet Gürkan, Jordan W. Suchow
for: This paper aims to provide a computational foundation for harnessing collective intelligence in the absence of cultural consensus.
methods: The paper introduces Infinite Deep Latent Construct Cultural Consensus Theory (iDLC-CCT), a nonparametric Bayesian model that extends Cultural Consensus Theory with a latent construct that maps between pretrained deep neural network embeddings of entities and the consensus beliefs regarding those entities.
results: The iDLC-CCT model better predicts the degree of consensus, generalizes well to out-of-sample entities, and is effective even with sparse data. An efficient hard-clustering variant of the iDLC-CCT is also introduced to improve scalability.

Abstract
Harnessing collective intelligence to drive effective decision-making and collaboration benefits from the ability to detect and characterize heterogeneity in consensus beliefs. This is particularly true in domains such as technology acceptance or leadership perception, where a consensus defines an intersubjective truth, leading to the possibility of multiple "ground truths" when subsets of respondents sustain mutually incompatible consensuses. Cultural Consensus Theory (CCT) provides a statistical framework for detecting and characterizing these divergent consensus beliefs. However, it is unworkable in modern applications because it lacks the ability to generalize across even highly similar beliefs, is ineffective with sparse data, and can leverage neither external knowledge bases nor learned machine representations. Here, we overcome these limitations through Infinite Deep Latent Construct Cultural Consensus Theory (iDLC-CCT), a nonparametric Bayesian model that extends CCT with a latent construct that maps between pretrained deep neural network embeddings of entities and the consensus beliefs regarding those entities among one or more subsets of respondents. We validate the method across domains including perceptions of risk sources, food healthiness, leadership, first impressions, and humor. We find that iDLC-CCT better predicts the degree of consensus, generalizes well to out-of-sample entities, and is effective even with sparse data. To improve scalability, we introduce an efficient hard-clustering variant of the iDLC-CCT using an algorithm derived from a small-variance asymptotic analysis of the model. The iDLC-CCT, therefore, provides a workable computational foundation for harnessing collective intelligence under a lack of cultural consensus and may potentially form the basis of consensus-aware information technologies.

摘要
使用集体智能来驱动有效的决策和协作，可以利用检测和特征化多样性的共识信仰来获得优势。特别是在技术接受或领导 восприятие等领域，共识定义了间subjective truth，可能导致多个"真实"当 subsets of respondents sustain mutually incompatible consensuses。文化共识理论（CCT）提供了一个统计学方法来检测和特征化这些分歧的共识信仰。然而，它在现代应用中无法普及，因为它缺乏对类似信仰的普适化能力，不可靠于稀缺数据，并且无法利用外部知识库或学习机器表示。在这里，我们超越这些限制通过无穷深层 latent construct cultural consensus theory（iDLC-CCT），一种非 Parametric Bayesian 模型，它将 CCT 扩展到一个 latent construct，该 construct 将预训练的深度神经网络嵌入与共识信仰相关的一个或多个 respondents 之间的Entity的 consensus beliefs 进行映射。我们在风险来源、食品健康、领导、第一印象和幽默等领域进行验证，发现 iDLC-CCT 更好地预测度量共识，普适化能力强，稀缺数据效果良好。为了提高可扩展性，我们引入一种高效硬分 clustering 变体，使用基于小差强 asymptotic analysis 的模型。因此，iDLC-CCT 提供了在缺乏文化共识的情况下可行的计算基础，可能成为共识意识技术的基础。

How to Data in Datathons

paper_url: http://arxiv.org/abs/2309.09770
repo_url: https://github.com/YLee-ArtsCommission/Arts-Datathon
paper_authors: Carlos Mougan, Richard Plant, Clare Teng, Marya Bazzi, Alvaro Cabregas Ejea, Ryan Sze-Yin Chan, David Salvador Jasin, Martin Stoffel, Kirstie Jane Whitaker, Jules Manser
for: 该论文旨在为 datathon 组织者提供指南和最佳实践，帮助他们更好地处理数据相关的复杂问题。
methods: 该论文根据作者自己的经验和 >60 个合作机构的见解，提出了一套指南和建议，以帮助组织者在 datathon 中更好地管理数据。
results: 该论文通过应用自己的框架，对 10 个案例进行分析，以帮助组织者更好地理解和解决数据相关的问题。

Abstract
The rise of datathons, also known as data or data science hackathons, has provided a platform to collaborate, learn, and innovate in a short timeframe. Despite their significant potential benefits, organizations often struggle to effectively work with data due to a lack of clear guidelines and best practices for potential issues that might arise. Drawing on our own experiences and insights from organizing >80 datathon challenges with >60 partnership organizations since 2016, we provide guidelines and recommendations that serve as a resource for organizers to navigate the data-related complexities of datathons. We apply our proposed framework to 10 case studies.

摘要
“数据马拉松”的出现，也称为“数据”或“数据科学”冲击活动，为团队合作、学习和创新提供了一个短时间内的平台。尽管它们具有重要的可能收益，但组织 frequently struggle to effectively work with data due to a lack of clear guidelines and best practices for potential issues that might arise. 基于我们自己的经验和2016年以来与60家合作组织合作的80多个数据马拉松挑战，我们提供了指南和建议，用于帮助组织 navigate数据相关的复杂性。我们将我们的提议框架应用于10个案例研究。

Looking through the past: better knowledge retention for generative replay in continual learning

paper_url: http://arxiv.org/abs/2309.10012
repo_url: https://github.com/valeriya-khan/looking-through-the-past
paper_authors: Valeriya Khan, Sebastian Cygert, Kamil Deja, Tomasz Trzciński, Bartłomiej Twardowski
For: The paper aims to improve the generative replay in a continual learning setting to perform well on challenging scenarios, where current generative rehearsal methods are not powerful enough to generate complex data with a greater number of classes.* Methods: The proposed method incorporates distillation in the latent space between the current and previous models to reduce feature drift, and uses latent matching for reconstruction and original data to improve generated features alignment. Additionally, the method cycles through generations using the previously trained model to make them closer to the original data.* Results: The proposed method outperforms other generative replay methods in various scenarios.Here’s the same information in Simplified Chinese:* For: 该论文目的是在不断学习Setting中提高生成重温来实现复杂enario中的好势所在，当前的生成重温方法通常只能在小型、简单的数据集上进行测试。* Methods: 提议的方法包括在 latent space中进行知识传递，并通过 reconstruction和原始数据之间的匹配来改善生成的特征对齐。此外，该方法还在生成过程中循环使用之前训练的模型，以使生成的数据更加接近原始数据。* Results: 提议的方法在多种情况下表现出色，超过了其他生成重温方法。

Abstract
In this work, we improve the generative replay in a continual learning setting to perform well on challenging scenarios. Current generative rehearsal methods are usually benchmarked on small and simple datasets as they are not powerful enough to generate more complex data with a greater number of classes. We notice that in VAE-based generative replay, this could be attributed to the fact that the generated features are far from the original ones when mapped to the latent space. Therefore, we propose three modifications that allow the model to learn and generate complex data. More specifically, we incorporate the distillation in latent space between the current and previous models to reduce feature drift. Additionally, a latent matching for the reconstruction and original data is proposed to improve generated features alignment. Further, based on the observation that the reconstructions are better for preserving knowledge, we add the cycling of generations through the previously trained model to make them closer to the original data. Our method outperforms other generative replay methods in various scenarios. Code available at https://github.com/valeriya-khan/looking-through-the-past.

摘要
在这项工作中，我们改进了 continual learning 中的生成重温方法，以便在复杂的场景下表现良好。现有的生成重温方法通常在小型和简单的数据集上进行评估，因为它们无法生成更复杂的数据集 with 更多的类别。我们发现，在 VAE 基于的生成重温方法中，这可能是因为生成的特征与原始特征在幂等空间中的映射相比较远。因此，我们提议三种修改，让模型学习和生成复杂数据。更 Specifically，我们在幂等空间中添加了采样的练习，以降低特征漂移。其次，我们提出了准确重建和原始数据的匹配，以提高生成特征的对齐。此外，根据我们发现，重建是保持知识的好方法，我们在先前训练的模型中循环生成，使其更接近原始数据。我们的方法在多种场景下表现出色，代码可以在 GitHub 上找到：https://github.com/valeriya-khan/looking-through-the-past。

Moving Object Detection and Tracking with 4D Radar Point Cloud

paper_url: http://arxiv.org/abs/2309.09737
repo_url: None
paper_authors: Zhijun Pan, Fangqiang Ding, Hantao Zhong, Chris Xiaoxuan Lu
for: 本文针对radar图像跟踪问题提出了一个新的解决方案，即RaTrack。
methods: RaTrack方法强调运动分割和归一化，并且具有运动估计模块，以增强跟踪精度。
results: 在View-of-Delft数据集上，RaTrack方法与现有方法相比，跟踪精度明显提高，表现出优于现有方法。

Abstract
Mobile autonomy relies on the precise perception of dynamic environments. Robustly tracking moving objects in 3D world thus plays a pivotal role for applications like trajectory prediction, obstacle avoidance, and path planning. While most current methods utilize LiDARs or cameras for Multiple Object Tracking (MOT), the capabilities of 4D imaging radars remain largely unexplored. Recognizing the challenges posed by radar noise and point sparsity in 4D radar data, we introduce RaTrack, an innovative solution tailored for radar-based tracking. Bypassing the typical reliance on specific object types and 3D bounding boxes, our method focuses on motion segmentation and clustering, enriched by a motion estimation module. Evaluated on the View-of-Delft dataset, RaTrack showcases superior tracking precision of moving objects, largely surpassing the performance of the state of the art.

摘要
mobile自主需要精准感知动态环境。在3D世界中准确跟踪移动 объек的 tracking thus plays a crucial role for applications like trajectory prediction, obstacle avoidance, and path planning. While most current methods使用 LiDARs or cameras for Multiple Object Tracking (MOT), the capabilities of 4D imaging radars remain largely unexplored. Recognizing the challenges posed by radar noise and point sparsity in 4D radar data, we introduce RaTrack, an innovative solution tailored for radar-based tracking. Bypassing the typical reliance on specific object types and 3D bounding boxes, our method focuses on motion segmentation and clustering, enriched by a motion estimation module. Evaluated on the View-of-Delft dataset, RaTrack showcases superior tracking precision of moving objects, largely surpassing the performance of the state of the art.Here's the breakdown of the translation:* mobile自主 (mobile autonomy) - 移动自主* 需要 (need) - 需要* 精准感知 (precise perception) - 精准感知* 动态环境 (dynamic environments) - 动态环境* track (tracking) - 跟踪* 移动 objet (moving objects) - 移动 объек* MOT (Multiple Object Tracking) - MOT* 4D imaging radars (4D imaging radars) - 4D成像雷达* 挑战 (challenges) - 挑战* 点缺乏 (point sparsity) - 点缺乏* RaTrack (RaTrack) - RaTrack* 解决方案 (solution) - 解决方案* 强调 (focusing) - 强调* 动态分割 (motion segmentation) - 动态分割* 聚集 (clustering) - 聚集* 动态估计 (motion estimation) - 动态估计* 评估 (evaluated) - 评估* 视野-of-Delft (View-of-Delft) - 视野-of-Delft* 显示 (showcases) - 显示* 超越 (surpassing) - 超越* 状态前的性能 (performance of the state of the art) - 状态前的性能

A Quantum Optimization Case Study for a Transport Robot Scheduling Problem

paper_url: http://arxiv.org/abs/2309.09736
repo_url: None
paper_authors: Dominik Leib, Tobias Seidel, Sven Jäger, Raoul Heese, Caitlin Isobel Jones, Abhishek Awasthi, Astrid Niederle, Michael Bortz
for: 这个研究是为了比较D-Wave的量子-类别混合框架、Fujitsu的量子静态逻辑器和Gurobi的状态当前的类别解决器在解决交通机器人分配问题的性能。这个问题来自实际的工业问题。
methods: 我们提供了三个不同的模型来解决这个问题，按照不同的设计哲学。我们在比较不同的模型和解决器组合的终端运行时间和解决质量方面进行了对比。
results: 我们发现了静态逻辑器和混合量子逻辑器在直接与Gurobi进行比较时有推荐的结果，并提供了一些机会。我们的研究提供了应用导向优化问题的工作流程和不同策略的评估，可以用于评估不同方法的优缺点。

Abstract
We present a comprehensive case study comparing the performance of D-Waves' quantum-classical hybrid framework, Fujitsu's quantum-inspired digital annealer, and Gurobi's state-of-the-art classical solver in solving a transport robot scheduling problem. This problem originates from an industrially relevant real-world scenario. We provide three different models for our problem following different design philosophies. In our benchmark, we focus on the solution quality and end-to-end runtime of different model and solver combinations. We find promising results for the digital annealer and some opportunities for the hybrid quantum annealer in direct comparison with Gurobi. Our study provides insights into the workflow for solving an application-oriented optimization problem with different strategies, and can be useful for evaluating the strengths and weaknesses of different approaches.

摘要
我们对三种不同的方法和解决方案进行了全面的比较研究：D-Wave的量子-经典混合框架、Fujitsu的量子启发数字拓扑器和Gurobi的当今最佳的类别解决器。这个问题来自实际工况中的 industrially relevantereal-world scenario。我们提供三个不同的模型来解决这个问题，每个模型都有不同的设计哲学。在我们的比较中，我们关注的是不同模型和解决器组合的解决质量和综合时间。我们发现了数字拓扑器的批处理和混合量子拓扑器在直接与Gurobi进行比较中的抢夺。我们的研究为解决应用导向优化问题的不同策略提供了深入的视野，并且可以用于评估不同方法的优缺点。

LLM4Jobs: Unsupervised occupation extraction and standardization leveraging Large Language Models

paper_url: http://arxiv.org/abs/2309.09708
repo_url: https://github.com/aida-ugent/skillgpt
paper_authors: Nan Li, Bo Kang, Tijl De Bie
for: 本研究旨在探讨使用大语言模型（LLM）实现自动化职业抽取和标准化，以便于职业推荐和劳动市场政策制定。
methods: 本研究提出了一种新的无监督方法——LLM4Jobs，利用大语言模型的自然语言理解和生成能力来实现职业编码。
results: 经过严格的实验评估，LLM4Jobs方法在不同的数据集和粒度上 consistently 超越了现有的无监督状况标准做法，展示其在多样化数据集和粒度上的可变性。

Abstract
Automated occupation extraction and standardization from free-text job postings and resumes are crucial for applications like job recommendation and labor market policy formation. This paper introduces LLM4Jobs, a novel unsupervised methodology that taps into the capabilities of large language models (LLMs) for occupation coding. LLM4Jobs uniquely harnesses both the natural language understanding and generation capacities of LLMs. Evaluated on rigorous experimentation on synthetic and real-world datasets, we demonstrate that LLM4Jobs consistently surpasses unsupervised state-of-the-art benchmarks, demonstrating its versatility across diverse datasets and granularities. As a side result of our work, we present both synthetic and real-world datasets, which may be instrumental for subsequent research in this domain. Overall, this investigation highlights the promise of contemporary LLMs for the intricate task of occupation extraction and standardization, laying the foundation for a robust and adaptable framework relevant to both research and industrial contexts.

摘要
自动化职业抽取和标准化从自由文本职业广告和简历中提取职业信息是关键 для应用程序如职业推荐和劳动市场政策形成。这篇论文介绍了LLM4Jobs，一种新的无监督方法，利用大型自然语言模型（LLM）的自然语言理解和生成能力来实现职业编码。我们在尝试中rigorous experimentation on synthetic and real-world datasets，发现LLM4Jobs可以在多种数据集和粒度上具有优秀的表现，并且在不同的职业类别和粒度上具有很好的灵活性。此外，我们还提供了一些synthetic和实际 datasets，这些数据可能会对后续在这个领域的研究提供很好的参考。总之，这项研究展示了当今LLM的潜在能力在职业抽取和标准化中，为研究和工业上的应用提供了一个强大和灵活的基础。

Information based explanation methods for deep learning agents – with applications on large open-source chess models

paper_url: http://arxiv.org/abs/2309.09702
repo_url: https://github.com/patrik-ha/ii-map
paper_authors: Patrik Hammersborg, Inga Strümke
for: 这个研究旨在使用大型开源棋牌模型来实现透明AI（XAI）方法，以解释具有相似性表现的 alphaZero 模型。
methods: 这种 XAI 方法使用可视化解释，以便在棋牌游戏中解释模型的决策。它可以控制输入向模型传递的信息，从而提供精确的解释。
results: 研究人员使用这种 XAI 方法对标准 8x8 棋牌进行了应用，并取得了类似于 alphaZero 的性能。

Abstract
With large chess-playing neural network models like AlphaZero contesting the state of the art within the world of computerised chess, two challenges present themselves: The question of how to explain the domain knowledge internalised by such models, and the problem that such models are not made openly available. This work presents the re-implementation of the concept detection methodology applied to AlphaZero in McGrath et al. (2022), by using large, open-source chess models with comparable performance. We obtain results similar to those achieved on AlphaZero, while relying solely on open-source resources. We also present a novel explainable AI (XAI) method, which is guaranteed to highlight exhaustively and exclusively the information used by the explained model. This method generates visual explanations tailored to domains characterised by discrete input spaces, as is the case for chess. Our presented method has the desirable property of controlling the information flow between any input vector and the given model, which in turn provides strict guarantees regarding what information is used by the trained model during inference. We demonstrate the viability of our method by applying it to standard 8x8 chess, using large open-source chess models.

摘要
With large chess-playing neural network models like AlphaZero contesting the state of the art within the world of computerised chess, two challenges present themselves: The question of how to explain the domain knowledge internalised by such models, and the problem that such models are not made openly available. This work presents the re-implementation of the concept detection methodology applied to AlphaZero in McGrath et al. (2022), by using large, open-source chess models with comparable performance. We obtain results similar to those achieved on AlphaZero, while relying solely on open-source resources. We also present a novel explainable AI (XAI) method, which is guaranteed to highlight exhaustively and exclusively the information used by the explained model. This method generates visual explanations tailored to domains characterised by discrete input spaces, as is the case for chess. Our presented method has the desirable property of controlling the information flow between any input vector and the given model, which in turn provides strict guarantees regarding what information is used by the trained model during inference. We demonstrate the viability of our method by applying it to standard 8x8 chess, using large open-source chess models.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Securing Fixed Neural Network Steganography

paper_url: http://arxiv.org/abs/2309.09700
repo_url: None
paper_authors: Zicong Luo, Sheng Li, Guobiao Li, Zhenxing Qian, Xinpeng Zhang
for: 这个研究是为了提高图像隐藏技术的安全性和可见性。
methods: 研究使用固定神经网络（FNN）进行秘密嵌入和抽出，并通过生成钥匙控制的扰动来提高安全性。
results: 实验结果显示，提案的方案能够防止未授权者从隐藏图像中提取秘密，并且能够生成高品质的隐藏图像，特别是当FNN是一个用于普通学习任务的神经网络时。

Abstract
Image steganography is the art of concealing secret information in images in a way that is imperceptible to unauthorized parties. Recent advances show that is possible to use a fixed neural network (FNN) for secret embedding and extraction. Such fixed neural network steganography (FNNS) achieves high steganographic performance without training the networks, which could be more useful in real-world applications. However, the existing FNNS schemes are vulnerable in the sense that anyone can extract the secret from the stego-image. To deal with this issue, we propose a key-based FNNS scheme to improve the security of the FNNS, where we generate key-controlled perturbations from the FNN for data embedding. As such, only the receiver who possesses the key is able to correctly extract the secret from the stego-image using the FNN. In order to improve the visual quality and undetectability of the stego-image, we further propose an adaptive perturbation optimization strategy by taking the perturbation cost into account. Experimental results show that our proposed scheme is capable of preventing unauthorized secret extraction from the stego-images. Furthermore, our scheme is able to generate stego-images with higher visual quality than the state-of-the-art FNNS scheme, especially when the FNN is a neural network for ordinary learning tasks.

摘要
Image 隐藏技术是隐藏秘密信息在图像中，以便只有授权方可以访问。现有研究表明，可以使用固定神经网络（FNN）进行秘密嵌入和抽取。称为固定神经网络隐藏技术（FNNS），这种技术可以在无需训练神经网络的情况下实现高度的隐藏性。然而，现有的FNNS方案具有一定的漏洞，即任何人都可以从隐藏图像中提取秘密。为了解决这个问题，我们提议使用密钥控制的FNNS方案，其中我们从FNN中生成密钥控制的扰动来用于数据嵌入。因此，只有持有密钥的接收方可以使用FNN correctly提取秘密从隐藏图像中。为了提高隐藏图像的视觉质量和不可察觉性，我们进一步提议一种适应性优化策略，其中考虑扰动成本。实验结果表明，我们的提议方案可以防止未经授权的秘密提取。此外，我们的方案还可以生成高质量的隐藏图像，特别是当FNN是一个用于常规学习任务的神经网络时。

Noise-Augmented Boruta: The Neural Network Perturbation Infusion with Boruta Feature Selection

paper_url: http://arxiv.org/abs/2309.09694
repo_url: None
paper_authors: Hassan Gharoun, Navid Yazdanjoe, Mohammad Sadegh Khorshidi, Amir H. Gandomi
for: 这篇论文的目的是提出一种对Boruta标本选择算法进行改进，以提高其选择功能和精度。methods: 本论文使用了对Boruta算法中的shadow variable进行噪声添加，并将其与人工神经网络的损害分析框架相似。results: 根据四个公开的 benchmark dataset的实验结果显示，提出的方法比传统的Boruta算法表现更好，证明了这种改进的可能性和价值。

Abstract
With the surge in data generation, both vertically (i.e., volume of data) and horizontally (i.e., dimensionality), the burden of the curse of dimensionality has become increasingly palpable. Feature selection, a key facet of dimensionality reduction techniques, has advanced considerably to address this challenge. One such advancement is the Boruta feature selection algorithm, which successfully discerns meaningful features by contrasting them to their permutated counterparts known as shadow features. However, the significance of a feature is shaped more by the data's overall traits than by its intrinsic value, a sentiment echoed in the conventional Boruta algorithm where shadow features closely mimic the characteristics of the original ones. Building on this premise, this paper introduces an innovative approach to the Boruta feature selection algorithm by incorporating noise into the shadow variables. Drawing parallels from the perturbation analysis framework of artificial neural networks, this evolved version of the Boruta method is presented. Rigorous testing on four publicly available benchmark datasets revealed that this proposed technique outperforms the classic Boruta algorithm, underscoring its potential for enhanced, accurate feature selection.

摘要
Simplified Chinese translation:随着数据生成的快速增长，both vertically (i.e., 数据量)和horizontally (i.e., 维度)，“维度束缚”的压力变得越来越明显。特别是Feature selection，这是维度减少技术的关键方面，在这个挑战中得到了显著的进步。一种如此进步的算法是Boruta feature selection algorithm，它成功地从 permutated counterparts 中分别出 meaningful features。然而，一个特征的重要性更多地受到数据的总特征影响，而不是其内在价值，这种想法也在经典的Boruta算法中体现出来，shadow features 与原始特征几乎一致。在这个基础上，这篇论文提出了一种新的Boruta feature selection algorithm的方法，通过在阴影变量中添加噪音。从人工神经网络的扰动分析框架中启发，这种演进版的Boruta方法被提出。经过严格的测试，这种提议的方法在四个公共可用的 benchmark 数据集上表现出色，超过经典Boruta算法，这说明了这种方法的潜在准确性。

Ugly Ducklings or Swans: A Tiered Quadruplet Network with Patient-Specific Mining for Improved Skin Lesion Classification

paper_url: http://arxiv.org/abs/2309.09689
repo_url: None
paper_authors: Nathasha Naranpanawa, H. Peter Soyer, Adam Mothershaw, Gayan K. Kulatilleke, Zongyuan Ge, Brigid Betz-Stablein, Shekhar S. Chandra
For: 帮助诊断皮肤癌（cutaneous melanoma），通过 diferenciating 高度可疑的皮肤损伤和非高度可疑的损伤。* Methods: 使用深度度量学习网络（Deep Metric Learning Network），包括两级lesion feature学习：个体水平和损伤水平。还使用了个体特定的四元组挖掘方法和层次四元组网络，以便学习更多的上下文信息。* Results: 相比传统分类器，提出的方法实现了54%高于基线ResNet18 CNN的敏感性和37%高于Naive triplet网络的敏感性，在识别ugly duckling损伤中表现出色。数据 manifold 的视觉化也表明了 DMT-Quadruplet 可以成功地在个体特定和个体不特定的情况下分类ugly duckling损伤。

Abstract
An ugly duckling is an obviously different skin lesion from surrounding lesions of an individual, and the ugly duckling sign is a criterion used to aid in the diagnosis of cutaneous melanoma by differentiating between highly suspicious and benign lesions. However, the appearance of pigmented lesions, can change drastically from one patient to another, resulting in difficulties in visual separation of ugly ducklings. Hence, we propose DMT-Quadruplet - a deep metric learning network to learn lesion features at two tiers - patient-level and lesion-level. We introduce a patient-specific quadruplet mining approach together with a tiered quadruplet network, to drive the network to learn more contextual information both globally and locally between the two tiers. We further incorporate a dynamic margin within the patient-specific mining to allow more useful quadruplets to be mined within individuals. Comprehensive experiments show that our proposed method outperforms traditional classifiers, achieving 54% higher sensitivity than a baseline ResNet18 CNN and 37% higher than a naive triplet network in classifying ugly duckling lesions. Visualisation of the data manifold in the metric space further illustrates that DMT-Quadruplet is capable of classifying ugly duckling lesions in both patient-specific and patient-agnostic manner successfully.

摘要
“一只鸡蛋”是指一个个体细胞肿瘤与周围细胞不同，并且“一只鸡蛋”标准是用于诊断皮肤恶性癌的重要依据。然而，细胞肿瘤的外观可以在不同的患者之间有很大的变化，从而使得视觉分离变得困难。因此，我们提议使用深度度量学习网络（DMT-Quadruplet），以学习皮肤病变 lesion 的特征。我们首先引入了patient-specific quadruplet mining方法，并将其与 tiered quadruplet 网络结合，以使网络学习更多的上下文信息。此外，我们还在patient-specific mining中引入了动态边缘，以让更多有用的quadruplets被挖掘出来。实验表明，我们的提议方法可以超过传统分类器，达到54%的敏感性和37%的特征选择率。图像 manifold 在度量空间的Visual化也证明了DMT-Quadruplet 可以成功地分类“一只鸡蛋” lesion 在patient-specific和patient-agnostic 方面。

Distributed course allocation with asymmetric friendships

paper_url: http://arxiv.org/abs/2309.09684
repo_url: None
paper_authors: Ilya Khakhiashvili, Lihi Dery, Tal Grinshpoun
for: 本研究旨在考虑学生之间的友谊关系，为学生分配课程 seats 提供一种分布式解决方案。
methods: 本文使用非对称分布式约束优化问题来模型问题，并开发了一种专门的算法。
results: Results show that our algorithm can obtain high utility for students while ensuring fairness and observing course seat capacity limitations.

Abstract
Students' decisions on whether to take a class are strongly affected by whether their friends plan to take the class with them. A student may prefer to be assigned to a course they likes less, just to be with their friends, rather than taking a more preferred class alone. It has been shown that taking classes with friends positively affects academic performance. Thus, academic institutes should prioritize friendship relations when assigning course seats. The introduction of friendship relations results in several non-trivial changes to current course allocation methods. This paper explores how course allocation mechanisms can account for friendships between students and provide a unique, distributed solution. In particular, we model the problem as an asymmetric distributed constraint optimization problem and develop a new dedicated algorithm. Our extensive evaluation includes both simulated data and data derived from a user study on 177 students' preferences over courses and friends. The results show that our algorithm obtains high utility for the students while keeping the solution fair and observing courses' seat capacity limitations.

摘要
学生们决定选课的决定受到同学的决定影响很强。学生可能会偏好选择一门课程，即使它不是他们最喜欢的，只是为了与朋友一起学习。这已经证明了与朋友一起学习会提高学业表现。因此，学府应该在分配课程时考虑学生之间的友谊关系。在现有课程分配方法的基础上引入友谊关系会导致一些非常重要的变化。这篇论文探讨了如何考虑学生之间的友谊关系来分配课程，并提供了一种专门的算法。我们的评估包括仿真数据和177名学生对课程和朋友的偏好的用户研究数据。结果表明，我们的算法可以为学生提供高的用户价值，同时保证分配的解决方案公平、遵循课程坐席限制。

Single and Few-step Diffusion for Generative Speech Enhancement

paper_url: http://arxiv.org/abs/2309.09677
repo_url: None
paper_authors: Bunlong Lay, Jean-Marie Lemercier, Julius Richter, Timo Gerkmann
for: 提高Diffusion模型的推理速度和精度
methods: 采用两个阶段训练方法，首先使用普通的生成推理方法进行训练，然后使用预测损失来对推理结果进行修正
results: 使用这种两个阶段训练方法可以在60次函数评估中达到同等性能，并且在减少函数评估数量（NFEs）下仍然保持稳定性和超越基eline模型的性能。

Abstract
Diffusion models have shown promising results in speech enhancement, using a task-adapted diffusion process for the conditional generation of clean speech given a noisy mixture. However, at test time, the neural network used for score estimation is called multiple times to solve the iterative reverse process. This results in a slow inference process and causes discretization errors that accumulate over the sampling trajectory. In this paper, we address these limitations through a two-stage training approach. In the first stage, we train the diffusion model the usual way using the generative denoising score matching loss. In the second stage, we compute the enhanced signal by solving the reverse process and compare the resulting estimate to the clean speech target using a predictive loss. We show that using this second training stage enables achieving the same performance as the baseline model using only 5 function evaluations instead of 60 function evaluations. While the performance of usual generative diffusion algorithms drops dramatically when lowering the number of function evaluations (NFEs) to obtain single-step diffusion, we show that our proposed method keeps a steady performance and therefore largely outperforms the diffusion baseline in this setting and also generalizes better than its predictive counterpart.

摘要
Diffusion 模型在听音提升中表现出了promising的结果，使用任务适应的扩散过程来 conditional generation 清晰的听音，给定噪音混合。然而，在测试时，用于分数估计的神经网络会被多次调用，以解决反向过程的迭代问题。这会导致慢速的推理过程和积累的精度错误。在这篇论文中，我们解决这些限制，通过两个阶段的训练方法。在第一阶段，我们使用传统的扩散模型训练方法，使用生成扩散分数匹配损失函数。在第二阶段，我们解决反向过程，并将结果与清晰听音目标进行比较，使用预测损失函数。我们显示，使用这个第二阶段训练方法可以达到与基准模型相同的性能，只需要5个功能评估次数，而不是60个。而通常的生成扩散算法在降低功能评估次数（NFEs）时，性能会降低很多，但我们的提议方法可以保持稳定的性能，因此大大超越了扩散基准模型。此外，我们还发现，我们的方法在这种设定下更好地适应和泛化。

Conditioning Latent-Space Clusters for Real-World Anomaly Classification

paper_url: http://arxiv.org/abs/2309.09676
repo_url: None
paper_authors: Daniel Bogdoll, Svetlana Pavlitska, Simon Klaus, J. Marius Zöllner
for: 本研究旨在提高自动驾驶领域中异常检测的精度和效果。
methods: 本研究使用Variational Autoencoder（VAE）来分类样本为正常数据或异常数据，并通过提供异常映射来进一步提高异常检测性能。
results: 研究结果表明，通过使用异常映射，VAE可以更好地分类异常数据，并且可以分离正常数据和异常数据到隔离的集群中，从而获得有意义的幂等表示。

Abstract
Anomalies in the domain of autonomous driving are a major hindrance to the large-scale deployment of autonomous vehicles. In this work, we focus on high-resolution camera data from urban scenes that include anomalies of various types and sizes. Based on a Variational Autoencoder, we condition its latent space to classify samples as either normal data or anomalies. In order to emphasize especially small anomalies, we perform experiments where we provide the VAE with a discrepancy map as an additional input, evaluating its impact on the detection performance. Our method separates normal data and anomalies into isolated clusters while still reconstructing high-quality images, leading to meaningful latent representations.

摘要
半自动驾驶领域中的异常现象是大规模部署自动驾驶车辆的主要障碍。在这项工作中，我们关注城市场景中高分辨率摄像头数据中的异常现象，其异常类型和大小各不相同。基于变量自适应编码器，我们将其缺失空间 condition 为分类样本为正常数据或异常样本。为了强调特别小的异常，我们在实验中提供了差异地图作为额外输入，评估其影响检测性能。我们的方法可以分解正常数据和异常数据，并且仍然重建高质量图像，从而获得有意义的隐藏表示。

Neural Network-Based Rule Models With Truth Tables

paper_url: http://arxiv.org/abs/2309.09638
repo_url: https://github.com/molyswu/hand_detection
paper_authors: Adrien Benamira, Tristan Guérand, Thomas Peyrin, Hans Soegeng
for:这个论文主要针对的是理解机器学习模型做出决策的过程，特别是在安全敏感应用中。methods:这个研究使用了神经网络框架，并将神经网络转换成规则基型模型。这个框架被称为Truth Table rules（TT-rules），它基于Truth Table nets（TTnets），一种由形式验证而来的神经网络家族。results:研究表明，TT-rules可以在七个tabular数据集上达到等效或更高的性能，而且保持和性能之间的平衡。此外，TT-rules还可以适用于大型tabular数据集，包括两个实际的DNA数据集，它们具有超过20K的特征。最后，研究者还进行了一个详细的rule-based模型的调查，使用Adult数据集。

Abstract
Understanding the decision-making process of a machine/deep learning model is crucial, particularly in security-sensitive applications. In this study, we introduce a neural network framework that combines the global and exact interpretability properties of rule-based models with the high performance of deep neural networks. Our proposed framework, called $\textit{Truth Table rules}$ (TT-rules), is built upon $\textit{Truth Table nets}$ (TTnets), a family of deep neural networks initially developed for formal verification. By extracting the set of necessary and sufficient rules $\mathcal{R}$ from the trained TTnet model (global interpretability), yielding the same output as the TTnet (exact interpretability), TT-rules effectively transforms the neural network into a rule-based model. This rule-based model supports binary classification, multi-label classification, and regression tasks for tabular datasets. Furthermore, our TT-rules framework optimizes the rule set $\mathcal{R}$ into $\mathcal{R}_{opt}$ by reducing the number and size of the rules. To enhance model interpretation, we leverage Reduced Ordered Binary Decision Diagrams (ROBDDs) to visualize these rules effectively. After outlining the framework, we evaluate the performance of TT-rules on seven tabular datasets from finance, healthcare, and justice domains. We also compare the TT-rules framework to state-of-the-art rule-based methods. Our results demonstrate that TT-rules achieves equal or higher performance compared to other interpretable methods while maintaining a balance between performance and complexity. Notably, TT-rules presents the first accurate rule-based model capable of fitting large tabular datasets, including two real-life DNA datasets with over 20K features. Finally, we extensively investigate a rule-based model derived from TT-rules using the Adult dataset.

摘要
理解机器学习模型的决策过程是非常重要，尤其在安全敏感应用中。在这种研究中，我们提出了一种神经网络框架，可以结合神经网络的高性能和规则型模型的全面和准确解释性。我们称之为“真实表达规则”（TT-rules）。基于规则型网络（TTnets），TT-rules 可以从训练完成的 TTnet 模型中提取必要和 suficient 规则集（ $\mathcal{R}$），并且可以使这些规则集产生同样的输出，从而将神经网络转化成规则型模型。这种规则型模型支持二分类、多标签分类和回归任务。此外，我们还优化了规则集 $\mathcal{R}$ 为 $\mathcal{R}_{opt}$，以降低规则的数量和大小。为了增强模型解释，我们利用减少的binary decision diagram（ROBDDs）来可见地表示这些规则。在文章中，我们首先介绍了 TT-rules 框架，然后对七个标准化表格数据集进行评估。这些数据集来自于金融、医疗和正义领域。我们还与当前的解释性方法进行比较。我们的结果表明，TT-rules 可以与其他解释性方法相比，具有同等或更高的性能，同时保持性能和复杂度之间的平衡。尤其是，TT-rules 可以适用于大型表格数据集，包括两个实际的 DNA 数据集，它们具有超过 20K 的特征。最后，我们进行了一项详细的规则型模型研究，使用 Adult 数据集。

Designing a Hybrid Neural System to Learn Real-world Crack Segmentation from Fractal-based Simulation

paper_url: http://arxiv.org/abs/2309.09637
repo_url: None
paper_authors: Achref Jaziri, Martin Mundt, Andres Fernandez Rodriguez, Visvanathan Ramesh
for: 这篇论文的目的是提高计算机视觉系统对混凝土结构完整性的评估，特别是Robust crack segmentation。
methods: 这篇论文使用了高准确的裂纹图形生成器和相应的完全注释的裂纹数据集，并利用了点wise Mutual Information estimate和适应实例normalization来学习通用的表示。
results: 论文通过实验显示，该系统可以有效地处理实际世界中的裂纹分割任务，并且不同的设计选择是相互协力的。

Abstract
Identification of cracks is essential to assess the structural integrity of concrete infrastructure. However, robust crack segmentation remains a challenging task for computer vision systems due to the diverse appearance of concrete surfaces, variable lighting and weather conditions, and the overlapping of different defects. In particular recent data-driven methods struggle with the limited availability of data, the fine-grained and time-consuming nature of crack annotation, and face subsequent difficulty in generalizing to out-of-distribution samples. In this work, we move past these challenges in a two-fold way. We introduce a high-fidelity crack graphics simulator based on fractals and a corresponding fully-annotated crack dataset. We then complement the latter with a system that learns generalizable representations from simulation, by leveraging both a pointwise mutual information estimate along with adaptive instance normalization as inductive biases. Finally, we empirically highlight how different design choices are symbiotic in bridging the simulation to real gap, and ultimately demonstrate that our introduced system can effectively handle real-world crack segmentation.

摘要
检测裂隙是评估混凝土基础设施结构完整性的关键。然而，由于混凝土表面的多样性、变化的照明和天气条件以及不同损害的重叠，计算机视觉系统中的裂隙分割仍然是一项挑战。特别是，最新的数据驱动方法在数据的有限性、细化和时间消耗的 crack 注释以及扩展到不同样本上的difficulty in generalizing。在这项工作中，我们通过两种方式突破这些挑战。首先，我们介绍了一个基于 fractional 的高精度裂隙图形生成器，并且提供了相应的完全注释的裂隙集合。其次，我们利用了 both pointwise mutual information estimate和 adaptive instance normalization作为 inductive biases，以学习普适的表示。最后，我们证明了不同的设计选择是协同的，并最终表明了我们引入的系统可以有效地处理实际世界中的裂隙分割。

Gradpaint: Gradient-Guided Inpainting with Diffusion Models

paper_url: http://arxiv.org/abs/2309.09614
repo_url: None
paper_authors: Asya Grechka, Guillaume Couairon, Matthieu Cord
for: 图像填充任务中的图像生成（image inpainting）
methods: 使用 diffusion probabilistic models (DDPMs) 和自定义损失函数（custom loss）来导航生成过程，以实现图像生成的准确性和一致性。
results: 比起当前的方法， GradPaint 能够更好地考虑图像的一致性和自然性，提高了图像生成的质量。

Abstract
Denoising Diffusion Probabilistic Models (DDPMs) have recently achieved remarkable results in conditional and unconditional image generation. The pre-trained models can be adapted without further training to different downstream tasks, by guiding their iterative denoising process at inference time to satisfy additional constraints. For the specific task of image inpainting, the current guiding mechanism relies on copying-and-pasting the known regions from the input image at each denoising step. However, diffusion models are strongly conditioned by the initial random noise, and therefore struggle to harmonize predictions inside the inpainting mask with the real parts of the input image, often producing results with unnatural artifacts. Our method, dubbed GradPaint, steers the generation towards a globally coherent image. At each step in the denoising process, we leverage the model's "denoised image estimation" by calculating a custom loss measuring its coherence with the masked input image. Our guiding mechanism uses the gradient obtained from backpropagating this loss through the diffusion model itself. GradPaint generalizes well to diffusion models trained on various datasets, improving upon current state-of-the-art supervised and unsupervised methods.

摘要
德brázkyDiffusion噪声模型（DDPM）在最近已经实现了条件和无条件图像生成的出色result。预训练模型可以在推理时通过引导其迭代噪声过程来适应不同的下游任务，无需进一步训练。对于图像填充任务，当前的引导机制是通过在每个迭代步骤中复制输入图像中知道的区域。然而，噪声模型受初始噪声的强烈条件，因此在生成 Predictions 时难以融合输入图像的真实部分和填充mask中的预测，常导致生成结果具有不自然的 artifacts。我们的方法，称为GradPaint，使得生成向度向globally coherent image迁移。在噪声过程中每个步骤中，我们利用 diffusion 模型的 "denoised image estimation"，计算一个定制的损失函数，用于衡量其与masked输入图像的几何匹配度。我们的引导机制使用diffusion模型自身的倒推操作来获得的梯度。GradPaint在不同的训练数据集上训练的 diffusion 模型上generalizes well，超过当前领先的指导和无指导方法。

Proposition from the Perspective of Chinese Language: A Chinese Proposition Classification Evaluation Benchmark

paper_url: http://arxiv.org/abs/2309.09602
repo_url: None
paper_authors: Conghui Niu, Mengyang Hu, Lin Bo, Xiaoli He, Dong Yu, Pengyuan Liu
for: 这篇论文主要研究了中文提案的分类和识别，以及提案的语言表达特征和逻辑意义。
methods: 该论文提出了明确和暗示提案的概念，并提出了一种多级分类系统，使用语言学和逻辑学方法进行提案分类。
results: 经过多种方法的评估，包括Rule-based方法、SVM、BERT、RoBERTA和ChatGPT等，研究发现现有模型对中文提案分类能力不足，尤其是跨领域传递性不佳。BERT表现较好，但缺乏跨领域传递性。ChatGPT表现不佳，但可以通过提供更多提案信息进行改进。

Abstract
Existing propositions often rely on logical constants for classification. Compared with Western languages that lean towards hypotaxis such as English, Chinese often relies on semantic or logical understanding rather than logical connectives in daily expressions, exhibiting the characteristics of parataxis. However, existing research has rarely paid attention to this issue. And accurately classifying these propositions is crucial for natural language understanding and reasoning. In this paper, we put forward the concepts of explicit and implicit propositions and propose a comprehensive multi-level proposition classification system based on linguistics and logic. Correspondingly, we create a large-scale Chinese proposition dataset PEACE from multiple domains, covering all categories related to propositions. To evaluate the Chinese proposition classification ability of existing models and explore their limitations, We conduct evaluations on PEACE using several different methods including the Rule-based method, SVM, BERT, RoBERTA, and ChatGPT. Results show the importance of properly modeling the semantic features of propositions. BERT has relatively good proposition classification capability, but lacks cross-domain transferability. ChatGPT performs poorly, but its classification ability can be improved by providing more proposition information. Many issues are still far from being resolved and require further study.

摘要
现有的提案经常利用逻辑常量进行分类。相比西方语言，如英语，中文更加倾向于 semantic或逻辑理解而非逻辑连接在日常表达中，展现出复杂的parataxis特点。然而，现有的研究几乎没有关注这一点。正确地分类这些提案是自然语言理解和逻辑的关键。在这篇论文中，我们提出了explicit和implicit提案的概念，并提出了基于语言和逻辑的多级提案分类系统。与此同时，我们创建了来自多个领域的大规模中文提案数据集PEACE，覆盖所有与提案相关的类别。为了评估现有模型对中文提案分类的能力和其局限性，我们使用了多种方法，包括规则基本方法、SVM、BERT、RoBERTA和ChatGPT。结果表明，正确地表示提案的 semantic特征非常重要。BERT在提案分类能力方面表现较好，但缺乏跨领域传递性。ChatGPT表现不佳，但可以通过提供更多提案信息来提高其分类能力。许多问题仍然待解决，需要进一步研究。

Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher LLMs

paper_url: http://arxiv.org/abs/2309.09582
repo_url: None
paper_authors: Jonas Golde, Patrick Haller, Felix Hamborg, Julian Risch, Alan Akbik
for: 这个论文的目的是提出一种基于大语言模型（LLM）的数据生成方法，以解决NLプロセス中的数据预处理瓶颈。
methods: 这个论文使用的方法是通过向LLM提供任务描述，然后使用生成的数据来训练下游NLP模型。
results: 这个论文的结果表明，通过使用LLM进行数据生成，可以生成大量高质量的标注数据，从而降低NL模型的训练成本。同时，这个方法还可以支持多种下游NLP任务，如文本分类、问答和实体识别等。

Abstract
Most NLP tasks are modeled as supervised learning and thus require labeled training data to train effective models. However, manually producing such data at sufficient quality and quantity is known to be costly and time-intensive. Current research addresses this bottleneck by exploring a novel paradigm called zero-shot learning via dataset generation. Here, a powerful LLM is prompted with a task description to generate labeled data that can be used to train a downstream NLP model. For instance, an LLM might be prompted to "generate 500 movie reviews with positive overall sentiment, and another 500 with negative sentiment." The generated data could then be used to train a binary sentiment classifier, effectively leveraging an LLM as a teacher to a smaller student model. With this demo, we introduce Fabricator, an open-source Python toolkit for dataset generation. Fabricator implements common dataset generation workflows, supports a wide range of downstream NLP tasks (such as text classification, question answering, and entity recognition), and is integrated with well-known libraries to facilitate quick experimentation. With Fabricator, we aim to support researchers in conducting reproducible dataset generation experiments using LLMs and help practitioners apply this approach to train models for downstream tasks.

摘要
我们引入了一个名为 Fabricator 的开源 Python 工具库，用于生成数据集。Fabricator 支持许多下游 NLP 任务（如文本分类、问题答案和实体识别），并与各种知名库集成，以便快速实验。我们希望通过 Fabricator 支持研究人员在使用 LLM 进行可重现的数据集生成实验，并帮助实践者使用这种方法训练下游任务中的模型。

Heterogeneous Generative Knowledge Distillation with Masked Image Modeling

paper_url: http://arxiv.org/abs/2309.09571
repo_url: None
paper_authors: Ziming Wang, Shumin Han, Xiaodi Wang, Jing Hao, Xianbin Cao, Baochang Zhang
for: 这篇论文的目的是提出一种基于Masked Image Modeling（MIM）的异化深度学习知识传递（H-GKD）方法，实现将大型Transformer模型的知识转移到小型CNN模型中，以提高这些小型模型在 computationally resource-limited edge devices 上的表现。
methods: 这篇论文使用了一种基于UNet的学生网络，通过将 sparse convolution 加入学生网络中，使学生网络能够对教师模型的Visual Representation进行效果伪装。此外，这篇论文还使用了Masked Image Modeling（MIM）方法，将教师模型的Visual Representation转移到学生网络中，以实现知识传递。
results: 这篇论文的实验结果显示，H-GKD 方法可以对不同的模型和大小进行适应，在 ImageNet 1K dataset 上，H-GKD 方法可以从 Resnet50 (sparse) 的76.98% 提高到80.01%。

Abstract
Small CNN-based models usually require transferring knowledge from a large model before they are deployed in computationally resource-limited edge devices. Masked image modeling (MIM) methods achieve great success in various visual tasks but remain largely unexplored in knowledge distillation for heterogeneous deep models. The reason is mainly due to the significant discrepancy between the Transformer-based large model and the CNN-based small network. In this paper, we develop the first Heterogeneous Generative Knowledge Distillation (H-GKD) based on MIM, which can efficiently transfer knowledge from large Transformer models to small CNN-based models in a generative self-supervised fashion. Our method builds a bridge between Transformer-based models and CNNs by training a UNet-style student with sparse convolution, which can effectively mimic the visual representation inferred by a teacher over masked modeling. Our method is a simple yet effective learning paradigm to learn the visual representation and distribution of data from heterogeneous teacher models, which can be pre-trained using advanced generative methods. Extensive experiments show that it adapts well to various models and sizes, consistently achieving state-of-the-art performance in image classification, object detection, and semantic segmentation tasks. For example, in the Imagenet 1K dataset, H-GKD improves the accuracy of Resnet50 (sparse) from 76.98% to 80.01%.

摘要
通常，小型CNN模型需要从大型模型中传输知识才能在计算资源有限的边缘设备中部署。Masked image modeling（MIM）方法在各种视觉任务中取得了很大成功，但在不同深度模型之间的知识传递方面尚未得到充分开发。这主要是因为大型Transformer模型和小型CNN模型之间存在很大的不同。在这篇论文中，我们开发了首个基于MIM的Heterogeneous Generative Knowledge Distillation（H-GKD），可以有效地将大型Transformer模型中的知识传递给小型CNN模型。我们的方法建立了Transformer模型和CNN之间的桥梁，通过训练一个带有散集 convolution的学生模型，可以有效地模仿教师模型在遮盲模型中的视觉表示。我们的方法是一种简单 yet有效的学习方法，可以从不同的教师模型中学习数据的视觉表示和分布。我们的实验表明，我们的方法可以适应不同的模型和大小，并在图像分类、物体检测和semantic segmentation任务中准确地达到领先性性表现。例如，在Imagenet 1K dataset中，H-GKD提高了Resnet50（散集）的准确率从76.98%提升到80.01%。

Causal-Story: Local Causal Attention Utilizing Parameter-Efficient Tuning For Visual Story Synthesis

paper_url: http://arxiv.org/abs/2309.09553
repo_url: None
paper_authors: Tianyi Song, Jiuxin Cao, Kun Wang, Bo Liu, Xiaofeng Zhang
for: 提高视觉故事生成的全局一致性
methods: 使用本地 causal 注意机制，考虑过去的caption、frame和当前caption之间的 causal 关系，对当前帧生成
results: 在 PororoSV 和 FlintstonesSV 数据集上获得了state-of-the-art FID 分数，生成的帧也更加出色地表现了视觉故事的整体一致性。

Abstract
The excellent text-to-image synthesis capability of diffusion models has driven progress in synthesizing coherent visual stories. The current state-of-the-art method combines the features of historical captions, historical frames, and the current captions as conditions for generating the current frame. However, this method treats each historical frame and caption as the same contribution. It connects them in order with equal weights, ignoring that not all historical conditions are associated with the generation of the current frame. To address this issue, we propose Causal-Story. This model incorporates a local causal attention mechanism that considers the causal relationship between previous captions, frames, and current captions. By assigning weights based on this relationship, Causal-Story generates the current frame, thereby improving the global consistency of story generation. We evaluated our model on the PororoSV and FlintstonesSV datasets and obtained state-of-the-art FID scores, and the generated frames also demonstrate better storytelling in visuals. The source code of Causal-Story can be obtained from https://github.com/styufo/Causal-Story.

摘要
“ diffusion 模型的出色文本到图像合成能力，已经驱动了视觉故事的生成进步。当前state-of-the-art方法是将历史caption、历史框和当前caption作为生成当前帧的condition。但是，这种方法对每个历史框和caption都使用相同的权重，忽略了不同历史条件对当前帧生成的不同影响。为解决这个问题，我们提出了Causal-Story模型。这个模型包含了本地 causal 注意力机制，考虑了以前caption、frame和当前caption之间的 causal 关系。通过基于这种关系的权重分配，Causal-Story模型生成当前帧，从而提高了全局的故事生成一致性。我们在 PororoSV 和 FlintstonesSV 数据集上评估了我们的模型，并取得了state-of-the-art FID 分数，生成的图像也更好地表达了故事的视觉。Causal-Story 模型的源代码可以从 GitHub 上获取：https://github.com/styufo/Causal-Story。”

CB-Whisper: Contextual Biasing Whisper using TTS-based Keyword Spotting

paper_url: http://arxiv.org/abs/2309.09552
repo_url: None
paper_authors: Yuang Li, Yinglu Li, Min Zhang, Chang Su, Mengyao Piao, Xiaosong Qiao, Jiawei Yu, Miaomiao Ma, Yanqing Zhao, Hao Yang
for: 提高自动语音识别（ASR）系统对罕见名词的识别率，如人名、组织名称和技术术语等。
methods: 使用OpenAI的Whisper模型，首先通过关键词检测（KWS）模块匹配实体和语音示例的特征。
results: 在三个内部数据集和两个开源数据集上，包括英语、中文和code-switching场景，通过采用经过设计的口头形式提示，使Whisper模型的混合错误率（MER）和实体恢复率得到显著改进。

Abstract
End-to-end automatic speech recognition (ASR) systems often struggle to recognize rare name entities, such as personal names, organizations, or technical terms that are not frequently encountered in the training data. This paper presents Contextual Biasing Whisper (CB-Whisper), a novel ASR system based on OpenAI's Whisper model that performs keyword-spotting (KWS) before the decoder. The KWS module leverages text-to-speech (TTS) techniques and a convolutional neural network (CNN) classifier to match the features between the entities and the utterances. Experiments demonstrate that by incorporating predicted entities into a carefully designed spoken form prompt, the mixed-error-rate (MER) and entity recall of the Whisper model is significantly improved on three internal datasets and two open-sourced datasets that cover English-only, Chinese-only, and code-switching scenarios.

摘要
通常的自动语音识别（ASR）系统经常遇到不寻常的名 Entity 识别问题，如人名、组织机构或技术术语，这些名 Entity 在训练数据中不充分出现。这篇论文提出了 Contextual Biasing Whisper（CB-Whisper），一种基于 OpenAI 的 Whisper 模型的 ASR 系统，其在 decoder 前使用 Keyword-spotting（KWS）模块。KWS 模块利用文本识别（TTS）技术和卷积神经网络（CNN）分类器将 Entity 与语音之间的特征进行匹配。实验表明，通过在精心设计的 spoken form 提示中包含预测的 Entity，可以显著提高 Whisper 模型的混合错误率（MER）和 Entity 回归率，在三个内部数据集和两个开源数据集中，这些数据集覆盖了英语、中文和代码混合enario。

Adaptive Reorganization of Neural Pathways for Continual Learning with Hybrid Spiking Neural Networks

paper_url: http://arxiv.org/abs/2309.09550
repo_url: None
paper_authors: Bing Han, Feifei Zhao, Wenxuan Pan, Zhaoya Zhao, Xianqi Li, Qingqun Kong, Yi Zeng
for: 本研究旨在开发一种基于大脑自组织的连续学习算法，以便让人工神经网络能够有效地适应增加的任务，同时保持性能和能耗水平。
methods: 该算法使用自组织调节网络（SOR-SNN），通过重组单个和有限的脉冲神经网络（SNN），生成丰富的稀热神经路径，以高效地处理增量任务。
results: 实验结果表明，该算法在多种连续学习任务上表现了稳定的优异性、能耗和存储能力，并能够有效地汇集过去学习的知识和当前任务的信息，实现了回传能力。此外，该模型还具有自修复能力，可以自动分配新的路径来恢复忘记的知识。

Abstract
The human brain can self-organize rich and diverse sparse neural pathways to incrementally master hundreds of cognitive tasks. However, most existing continual learning algorithms for deep artificial and spiking neural networks are unable to adequately auto-regulate the limited resources in the network, which leads to performance drop along with energy consumption rise as the increase of tasks. In this paper, we propose a brain-inspired continual learning algorithm with adaptive reorganization of neural pathways, which employs Self-Organizing Regulation networks to reorganize the single and limited Spiking Neural Network (SOR-SNN) into rich sparse neural pathways to efficiently cope with incremental tasks. The proposed model demonstrates consistent superiority in performance, energy consumption, and memory capacity on diverse continual learning tasks ranging from child-like simple to complex tasks, as well as on generalized CIFAR100 and ImageNet datasets. In particular, the SOR-SNN model excels at learning more complex tasks as well as more tasks, and is able to integrate the past learned knowledge with the information from the current task, showing the backward transfer ability to facilitate the old tasks. Meanwhile, the proposed model exhibits self-repairing ability to irreversible damage and for pruned networks, could automatically allocate new pathway from the retained network to recover memory for forgotten knowledge.

摘要
人类大脑可以自组织富有多样性和稀疏的神经路径，以逐渐掌握百种认知任务。然而，现有的深度人工神经网络和脉冲神经网络的持续学习算法无法充分自适应网络的限制资源，导致任务增加后性能下降和能耗增加。在这篇论文中，我们提出了基于大脑自适应学习算法的快速适应神经网络，使用自组织调节网络（SOR-SNN）重组单个和有限的脉冲神经网络，以高效地处理增量任务。我们的模型在多种不同的持续学习任务上表现了一致的优异性、能耗和存储能力，包括从婴儿式简单到复杂任务，以及通用CIFAR100和ImageNet数据集。尤其是SOR-SNN模型在学习更复杂的任务和更多任务方面表现出色，并能够将过去学习的知识与当前任务的信息相结合，表现出返回传递能力以便恢复过去学习的知识。同时，我们的模型也展现了自动修复损害和剪除网络的自适应能力。

A performance characteristic curve for model evaluation: the application in information diffusion prediction

paper_url: http://arxiv.org/abs/2309.09537
repo_url: None
paper_authors: Wenjin Xie, Xiaomeng Wang, Radosław Michalski, Tao Jia
for: 这种研究旨在预测社交媒体上的信息传播受众，有实际应用在市场营销和社交媒体领域。methods: 该研究使用了一种基于信息熵的指标来衡量推测模型的准确性，并发现了一种与随机性和预测准确性之间存在的整体关系。results: 研究发现，对不同的序列长度、系统大小和随机性，推测模型的性能特征曲线呈现出一种普适的趋势，这种曲线可以用来评估推测模型的性能。这种方法可以系统地评估不同推测模型的性能，并为未来的研究提供新的评估方法。

Abstract
The information diffusion prediction on social networks aims to predict future recipients of a message, with practical applications in marketing and social media. While different prediction models all claim to perform well, general frameworks for performance evaluation remain limited. Here, we aim to identify a performance characteristic curve for a model, which captures its performance on tasks of different complexity. We propose a metric based on information entropy to quantify the randomness in diffusion data, then identify a scaling pattern between the randomness and the prediction accuracy of the model. Data points in the patterns by different sequence lengths, system sizes, and randomness all collapse into a single curve, capturing a model's inherent capability of making correct predictions against increased uncertainty. Given that this curve has such important properties that it can be used to evaluate the model, we define it as the performance characteristic curve of the model. The validity of the curve is tested by three prediction models in the same family, reaching conclusions in line with existing studies. Also, the curve is successfully applied to evaluate two distinct models from the literature. Our work reveals a pattern underlying the data randomness and prediction accuracy. The performance characteristic curve provides a new way to systematically evaluate models' performance, and sheds light on future studies on other frameworks for model evaluation.

摘要
社交媒体上的信息扩散预测目标是预测未来的消息接收者，有实际应用于市场营销和社交媒体。虽然不同的预测模型都宣称表现良好，但总体框架 для性能评估仍然有限。我们想要找到一个表现曲线，可以捕捉模型在不同复杂度任务上的表现。我们提出一种基于信息熵的度量，用于量化扩散数据中的随机性，然后确定模型预测正确率和随机性之间的扩散规律。数据点在不同的序列长度、系统大小和随机性下都可以归一化到单一的曲线上，捕捉模型在面临不确定性增加时的内在能力。由于这个曲线具有这些重要的性能特点，我们定义它为模型性能特征曲线。我们的实验证明了这个曲线的有效性，并应用到了Literature中的两种不同模型中。我们的研究发现了随机性和预测精度之间的关系，并提供了一种新的评估模型性能的方法。这种方法可以系统地评估模型的性能，并为未来的研究提供了新的思路。

DFIL: Deepfake Incremental Learning by Exploiting Domain-invariant Forgery Clues

paper_url: http://arxiv.org/abs/2309.09526
repo_url: https://github.com/deepfakeil/dfil
paper_authors: Kun Pan, Yin Yifang, Yao Wei, Feng Lin, Zhongjie Ba, Zhenguang Liu, ZhiBo Wang, Lorenzo Cavallaro, Kui Ren
for: 防止深伪造影像的普遍传播和攻击，提高伪造影像检测模型的准确性和适应能力。
methods: 提出一个增量学习框架，通过不断学习少量新数据来改善伪造影像检测模型的普遍性和适应能力。另外，提出了一个领域不断学习的方法，以获得不同数据分布下的领域共同表示。以及一种多角度知识传授法，以避免严重遗忘现象。
results: 在四个benchmark数据集上进行了广泛的实验，取得了新的state-of-the-art的平均遗忘率7.01和平均准确率85.49。

Abstract
The malicious use and widespread dissemination of deepfake pose a significant crisis of trust. Current deepfake detection models can generally recognize forgery images by training on a large dataset. However, the accuracy of detection models degrades significantly on images generated by new deepfake methods due to the difference in data distribution. To tackle this issue, we present a novel incremental learning framework that improves the generalization of deepfake detection models by continual learning from a small number of new samples. To cope with different data distributions, we propose to learn a domain-invariant representation based on supervised contrastive learning, preventing overfit to the insufficient new data. To mitigate catastrophic forgetting, we regularize our model in both feature-level and label-level based on a multi-perspective knowledge distillation approach. Finally, we propose to select both central and hard representative samples to update the replay set, which is beneficial for both domain-invariant representation learning and rehearsal-based knowledge preserving. We conduct extensive experiments on four benchmark datasets, obtaining the new state-of-the-art average forgetting rate of 7.01 and average accuracy of 85.49 on FF++, DFDC-P, DFD, and CDF2. Our code is released at https://github.com/DeepFakeIL/DFIL.

摘要
黑客使用和广泛传播深伪造成了信任危机。现有的深伪检测模型通常可以通过大量数据训练Recognize forgery images，但检测模型对新的深伪生成方法生成的图像的准确率会降低significantly due to differences in data distribution. To address this issue, we propose a novel incremental learning framework that improves the generalization of deepfake detection models by continual learning from a small number of new samples. To cope with different data distributions, we propose to learn a domain-invariant representation based on supervised contrastive learning, preventing overfit to the insufficient new data. To mitigate catastrophic forgetting, we regularize our model in both feature-level and label-level based on a multi-perspective knowledge distillation approach. Finally, we propose to select both central and hard representative samples to update the replay set, which is beneficial for both domain-invariant representation learning and rehearsal-based knowledge preserving. We conduct extensive experiments on four benchmark datasets, achieving a new state-of-the-art average forgetting rate of 7.01 and average accuracy of 85.49 on FF++, DFDC-P, DFD, and CDF2. Our code is released at .

FedGKD: Unleashing the Power of Collaboration in Federated Graph Neural Networks

paper_url: http://arxiv.org/abs/2309.09517
repo_url: None
paper_authors: Qiying Pan, Ruofan Wu, Tengfei Liu, Tianyi Zhang, Yifei Zhu, Weiqiang Wang
for: 这篇论文旨在提出一个基于 Federated Training 的 Graph Neural Network (GNN) 框架，并解决在 Federated GNN 系统中的图素异常问题。
methods: 本文使用了一个新的客边端图数据精炼方法，将本地任务描述用更好地描述任务相关性，并引入了一个新的服务器端集成机制，让它更好地利用全球协作结构。
results: 本文的实验结果显示， FedGKD 框架在六个真实世界数据集上表现出优于其他方法，特别是在大规模数据集上。

Abstract
Federated training of Graph Neural Networks (GNN) has become popular in recent years due to its ability to perform graph-related tasks under data isolation scenarios while preserving data privacy. However, graph heterogeneity issues in federated GNN systems continue to pose challenges. Existing frameworks address the problem by representing local tasks using different statistics and relating them through a simple aggregation mechanism. However, these approaches suffer from limited efficiency from two aspects: low quality of task-relatedness quantification and inefficacy of exploiting the collaboration structure. To address these issues, we propose FedGKD, a novel federated GNN framework that utilizes a novel client-side graph dataset distillation method to extract task features that better describe task-relatedness, and introduces a novel server-side aggregation mechanism that is aware of the global collaboration structure. We conduct extensive experiments on six real-world datasets of different scales, demonstrating our framework's outperformance.

摘要
随着 Federated 培训（Federated Training）的兴起，Graph Neural Networks（GNN）在数据隔离场景下进行图像相关任务的能力得到了广泛的关注。然而， Federated GNN 系统中的图像异ogeneity问题仍然具有挑战性。现有的框架通过使用不同的统计来表示本地任务和将其相关联的简单汇总机制来解决这个问题。然而，这些方法受到两个方面的限制：一是任务相关性评估质量低，二是不能充分利用全局协作结构。为了解决这些问题，我们提出了 FedGKD，一种基于客户端 Graph 数据集缩减方法的新的 Federated GNN 框架，并引入了服务器端具有全局协作结构的新汇总机制。我们在六个真实世界数据集上进行了广泛的实验，并证明了我们的框架的表现优于。

Pruning Large Language Models via Accuracy Predictor

paper_url: http://arxiv.org/abs/2309.09507
repo_url: None
paper_authors: Yupeng Ji, Yibo Cao, Jiucai Liu
for: 这篇论文旨在提出一种新的模型压缩方法，以便对大型语言模型（LLMs）进行压缩，以提高训练、测试和部署的效率。
methods: 这篇论文提出了一种新的模型压缩方法，包括首先建立一个训练集，其中包含一定数量的架构精度对。然后，使用这个精度预测器进行进一步优化搜索空间，以找到最佳的模型。
results: 实验结果显示，提案的方法具有高效性和高精度。相比基准，在Wikitext2和PTB上的PPL下降9.48%和5.76%，MMLU的平均精度提高6.28%.

Abstract
Large language models(LLMs) containing tens of billions of parameters (or even more) have demonstrated impressive capabilities in various NLP tasks. However, substantial model size poses challenges to training, inference, and deployment so that it is necessary to compress the model. At present, most model compression for LLMs requires manual design of pruning features, which has problems such as complex optimization pipeline and difficulty in retaining the capabilities of certain parts of the model.Therefore, we propose a novel pruning approach: firstly, a training set of a certain number of architecture-accuracy pairs is established, and then a non-neural model is trained as an accuracy predictor. Using the accuracy predictor to further optimize the search space and search, the optimal model can be automatically selected. Experiments show that our proposed approach is effective and efficient. Compared with the baseline, the perplexity(PPL) on Wikitext2 and PTB dropped by 9.48% and 5,76% respectively, and the average accuracy of MMLU increased by 6.28%.

摘要
大型语言模型（LLM）包含十以上亿个参数（甚至更多）的表现很出色在不同的自然语言处理任务中。然而，大型模型的训练、推理和部署带来了很多挑战，因此需要压缩模型。目前，大多数模型压缩方法 для LLM 需要手动设计特征剪除，这种方法存在复杂的优化管道和维护特定部分模型的困难。因此，我们提出了一种新的剪除方法：首先，建立一个固定数量的体系精度对的训练集，然后使用一个非神经网络模型来预测精度。使用预测器进一步优化搜索空间，并通过搜索，选择最佳模型。实验表明，我们提出的方法有效和高效。相比基准， Wikitext2 和 PTB 的 PPL 下降了9.48%和5.76%，MMLU 的平均精度提高了6.28%。

PromptST: Prompt-Enhanced Spatio-Temporal Multi-Attribute Prediction

paper_url: http://arxiv.org/abs/2309.09500
repo_url: None
paper_authors: Zijian Zhang, Xiangyu Zhao, Qidong Liu, Chunxu Zhang, Qian Ma, Wanyu Wang, Hongwei Zhao, Yiqi Wang, Zitao Liu
for:This paper focuses on the problem of spatio-temporal multi-attribute prediction, which is a critical part of urban management. The authors aim to address the challenge of handling diverse spatio-temporal attributes simultaneously and improve the prediction performance.methods:The proposed method, called PromptST, consists of a spatio-temporal transformer and a parameter-sharing training scheme. The authors also introduce a spatio-temporal prompt tuning strategy to fit specific attributes in a lightweight manner.results:Extensive experiments on real-world datasets show that PromptST achieves state-of-the-art performance in spatio-temporal multi-attribute prediction. Additionally, the authors prove that PromptST has good transferability on unseen spatio-temporal attributes, which has promising application potential in urban computing.

Abstract
In the era of information explosion, spatio-temporal data mining serves as a critical part of urban management. Considering the various fields demanding attention, e.g., traffic state, human activity, and social event, predicting multiple spatio-temporal attributes simultaneously can alleviate regulatory pressure and foster smart city construction. However, current research can not handle the spatio-temporal multi-attribute prediction well due to the complex relationships between diverse attributes. The key challenge lies in how to address the common spatio-temporal patterns while tackling their distinctions. In this paper, we propose an effective solution for spatio-temporal multi-attribute prediction, PromptST. We devise a spatio-temporal transformer and a parameter-sharing training scheme to address the common knowledge among different spatio-temporal attributes. Then, we elaborate a spatio-temporal prompt tuning strategy to fit the specific attributes in a lightweight manner. Through the pretrain and prompt tuning phases, our PromptST is able to enhance the specific spatio-temoral characteristic capture by prompting the backbone model to fit the specific target attribute while maintaining the learned common knowledge. Extensive experiments on real-world datasets verify that our PromptST attains state-of-the-art performance. Furthermore, we also prove PromptST owns good transferability on unseen spatio-temporal attributes, which brings promising application potential in urban computing. The implementation code is available to ease reproducibility.

摘要
在信息爆发时代，空间时间数据挖掘作为城市管理的重要组成部分。考虑到不同领域的需求，如交通状况、人员活动和社会事件，同时预测多个空间时间属性可以减轻管理压力和推动智能城市建设。然而，当前的研究不能好地处理多个空间时间属性的预测，因为这些属性之间存在复杂的关系。主要挑战在于如何处理共同的空间时间模式，同时考虑各属性的特点。在本文中，我们提出一种高效的空间时间多属性预测解决方案——PromptST。我们设计了一种空间时间变换器和共享参数训练方案，以处理不同空间时间属性之间的共同知识。然后，我们提出了一种空间时间Prompt调整策略，以适应特定属性的特点。通过预训练和Prompt调整两个阶段，我们的PromptST能够提高特定空间时间特征的捕捉，同时维护学习的共同知识。广泛的实验表明，我们的PromptST可以达到状态机器的表现。此外，我们还证明PromptST具有良好的传输性，可以在未看到的空间时间属性上进行应用，这提供了智能城市建设的广阔应用前景。代码实现可以方便复制。

CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval

paper_url: http://arxiv.org/abs/2309.09496
repo_url: None
paper_authors: Yating liu, Yaowei Li, Zimo Liu, Wenming Yang, Yaowei Wang, Qingmin Liao
for: 文本基准人识别（TBPR）
methods: 使用 CLIP 基于的 Synergistic Knowledge Transfer（CSKT）方法，包括文本到图像和图像到文本的bidirectional prompts 和 coupling projections，以及图像和语言多头自注意力的双向转移知识机制。
results: CSKT 方法在三个标准测试集上比前一个最佳方法表现出色，只需要训练参数占模型总参数的 7.4%，显示其高效、有效和普遍。

Abstract
Text-based Person Retrieval aims to retrieve the target person images given a textual query. The primary challenge lies in bridging the substantial gap between vision and language modalities, especially when dealing with limited large-scale datasets. In this paper, we introduce a CLIP-based Synergistic Knowledge Transfer(CSKT) approach for TBPR. Specifically, to explore the CLIP's knowledge on input side, we first propose a Bidirectional Prompts Transferring (BPT) module constructed by text-to-image and image-to-text bidirectional prompts and coupling projections. Secondly, Dual Adapters Transferring (DAT) is designed to transfer knowledge on output side of Multi-Head Self-Attention (MHSA) in vision and language. This synergistic two-way collaborative mechanism promotes the early-stage feature fusion and efficiently exploits the existing knowledge of CLIP. CSKT outperforms the state-of-the-art approaches across three benchmark datasets when the training parameters merely account for 7.4% of the entire model, demonstrating its remarkable efficiency, effectiveness and generalization.

摘要
文本基于人物检索（Text-based Person Retrieval，TBPR）目标是根据文本查询 retrieve 目标人像。主要挑战在视觉和语言模式之间的差异 bridge，特别是处理有限的大规模数据集。本文提出了基于 CLIP 的 Synergistic Knowledge Transfer（CSKT）方法 для TBPR。具体来说，我们首先提出了文本到图像和图像到文本的双向提问模块（BPT），并将其与 coupling projections 结合。其次，我们设计了双Adapter Transferring（DAT）模块，用于在视觉和语言中的多头自注意力（MHSA）知识传递。这种同时合作机制促进了早期特征融合，高效地利用 CLIP 已有的知识。CSKT 在三个标准数据集上超过了状态体系的方法，即使只用训练参数占据 7.4% 的整个模型，表明其非常高效、有效和普遍。

HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform

paper_url: http://arxiv.org/abs/2309.09493
repo_url: None
paper_authors: Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani
for: 高品质语音合成
methods: integrate inverse short-time Fourier transform (iSTFT) into the network, incorporates a harmonic-plus-noise source filter in the time-frequency domain
results: 比 iSTFTNet 和 HiFi-GAN 更高效， achieve ground-truth-level performance, outperforms BigVGAN 在未看到 Speaker 上， achieve comparable performance to BigVGAN while being four times faster with only $1/6$ of the parameters.

Abstract
Recent advancements in speech synthesis have leveraged GAN-based networks like HiFi-GAN and BigVGAN to produce high-fidelity waveforms from mel-spectrograms. However, these networks are computationally expensive and parameter-heavy. iSTFTNet addresses these limitations by integrating inverse short-time Fourier transform (iSTFT) into the network, achieving both speed and parameter efficiency. In this paper, we introduce an extension to iSTFTNet, termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the time-frequency domain that uses a sinusoidal source from the fundamental frequency (F0) inferred via a pre-trained F0 estimation network for fast inference speed. Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN, achieving ground-truth-level performance. HiFTNet also outperforms BigVGAN-base on LibriTTS for unseen speakers and achieves comparable performance to BigVGAN while being four times faster with only $1/6$ of the parameters. Our work sets a new benchmark for efficient, high-quality neural vocoding, paving the way for real-time applications that demand high quality speech synthesis.

摘要
现代语音合成技术已经利用GAN基于网络，如HiFi-GAN和BigVGAN，生成高质量波形从mel-spectrogram中。然而，这些网络 computationally expensive和parameter-heavy。iSTFTNetAddress这些限制，通过将 inverse short-time Fourier transform (iSTFT) integrate into the network，实现了速度和参数效率。在这篇论文中，我们介绍了一种增强版iSTFTNet，称为HiFTNet，它在时域-频域中使用一个harmonic-plus-noise源滤波器，使用来自基准频率(F0) 的恒定频率来进行快速的推理。对LJSpeech进行主观评估，我们发现我们的模型在iSTFTNet和HiFi-GAN之上显著超越，达到了参照水平的性能。HiFTNet还在LibriTTS上超越BigVGAN-base，并在未看到说话者时达到了相同的性能，只用了BigVGAN的一半参数和速度。我们的工作设置了新的高质量、高效的神经 vocoding 标准，为实时应用程序提供了高质量语音合成的路径。

Mechanic Maker 2.0: Reinforcement Learning for Evaluating Generated Rules

paper_url: http://arxiv.org/abs/2309.09476
repo_url: None
paper_authors: Johor Jara Gonzalez, Seth Cooper, Mathew Guzdial
for: 本研究用Reinforcement Learning（RL）作为人类游戏行为的近似方法，用于自动生成游戏规则。
methods: 本研究使用RL方法来模拟人类游戏行为，并在Unity游戏引擎上创建了一个新的开源规则生成框架。
results: 研究结果表明，RL生成的规则与A*机器人基eline有所不同，可能更有用于人类游戏玩家。

Abstract
Automated game design (AGD), the study of automatically generating game rules, has a long history in technical games research. AGD approaches generally rely on approximations of human play, either objective functions or AI agents. Despite this, the majority of these approximators are static, meaning they do not reflect human player's ability to learn and improve in a game. In this paper, we investigate the application of Reinforcement Learning (RL) as an approximator for human play for rule generation. We recreate the classic AGD environment Mechanic Maker in Unity as a new, open-source rule generation framework. Our results demonstrate that RL produces distinct sets of rules from an A* agent baseline, which may be more usable by humans.

摘要
自动游戏设计（AGD），游戏规则自动生成的研究，有很长的历史在技术游戏研究中。AGD方法通常基于人类游戏行为的估计，可以是目标函数或AI代理。尽管如此，大多数这些估计都是静态的， meaning they do not reflect human players' ability to learn and improve in a game. 在这篇论文中，我们研究了在人类游戏行为中使用奖励学习（RL）作为估计器。我们在Unity中重新创建了 Mechanic Maker 环境，并将其转换为一个新的、开源的规则生成框架。我们的结果表明，RL 可以生成与 A* 代理基线相比较为人类更可用的规则集。

Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles

paper_url: http://arxiv.org/abs/2309.09457
repo_url: None
paper_authors: Noah Golowich, Ankur Moitra, Dhruv Rohatgi
for: 这篇论文的目的是解决线性Markov决策过程（MDP）中的特征选择问题，即在缺乏专家领域知识的情况下，通过学习一个近似优化策略，以便在仅有多少交互中学习环境。
methods: 这篇论文使用了特征选择的思想，即在一个$k$-稀疏的线性MDP中，找到一个SIZE$k$的子集$S\subset[d]$，其中$d$是特征的维度，这个子集包含所有有用的特征。这篇论文提出了首个可行的算法来解决这个问题。
results: 这篇论文的主要结果是在线性MDP中实现了近似优化策略，并且这个算法只需要多少交互来学习。此外，这篇论文还引入了一种观测器，即一种简洁的逻辑表示环境的转移，可以有效地计算一些Bellman backup。这个观测器可以通过几何编程来计算得到。此外，这篇论文还给出了一种可以在块MDP中实现的算法，该算法可以在几何时间内学习一个近似优化策略，并且这个算法可以在多少交互中学习。

Abstract
The key assumption underlying linear Markov Decision Processes (MDPs) is that the learner has access to a known feature map $\phi(x, a)$ that maps state-action pairs to $d$-dimensional vectors, and that the rewards and transitions are linear functions in this representation. But where do these features come from? In the absence of expert domain knowledge, a tempting strategy is to use the ``kitchen sink" approach and hope that the true features are included in a much larger set of potential features. In this paper we revisit linear MDPs from the perspective of feature selection. In a $k$-sparse linear MDP, there is an unknown subset $S \subset [d]$ of size $k$ containing all the relevant features, and the goal is to learn a near-optimal policy in only poly$(k,\log d)$ interactions with the environment. Our main result is the first polynomial-time algorithm for this problem. In contrast, earlier works either made prohibitively strong assumptions that obviated the need for exploration, or required solving computationally intractable optimization problems. Along the way we introduce the notion of an emulator: a succinct approximate representation of the transitions that suffices for computing certain Bellman backups. Since linear MDPs are a non-parametric model, it is not even obvious whether polynomial-sized emulators exist. We show that they do exist and can be computed efficiently via convex programming. As a corollary of our main result, we give an algorithm for learning a near-optimal policy in block MDPs whose decoding function is a low-depth decision tree; the algorithm runs in quasi-polynomial time and takes a polynomial number of samples. This can be seen as a reinforcement learning analogue of classic results in computational learning theory. Furthermore, it gives a natural model where improving the sample complexity via representation learning is computationally feasible.

摘要
Linear Markov Decision Processes (MDPs) 的关键假设是学习者有访问known feature map $\phi(x, a)$，这个映射将状态-动作对映射到 $d$-维向量上，并且奖励和转移是这个表示下的线性函数。但是这些特征来自哪里？在不具备专家领域知识的情况下，一个自然的策略是使用“厨房抽象”方法，希望真实的特征包含在一个更大的可能性集中。在这篇文章中，我们从 linear MDPs 的特征选择角度重新探讨这个问题。在 $k$-简 sparse linear MDP 中，存在一个未知的子集 $S \subset [d]$ 的大小为 $k$，包含所有相关的特征，并且目标是通过只需 poly$(k, \log d)$ 交互来学习一个近似优化的策略。我们的主要结果是首先的幂时间算法。在对比之下，先前的作品 Either 假设了可观察的环境，或者需要解决 computationally intractable 优化问题。在路径中，我们引入了一个新的概念：模拟器。模拟器是一种简洁的 Approximate represntation of the transitions，可以用于计算某些 Bellman backups。由于 linear MDPs 是非 Parametric 模型，也不知道 whether polynomial-sized emulators exist。我们证明了它们确实存在，并可以通过几何Programming 计算得到。作为一个辅助结果，我们提供了一个算法，可以在块 MDPs 中学习一个近似优化的策略，其中 decoding 函数是一个 low-depth 决策树。该算法在 quasi-polynomial 时间内运行，并且只需要 polynomial 个样本。这可以看作是一种 reinforcement learning 的类比，以及一种可以通过 representation learning 提高样本复杂性的计算可能性。更多细节可以参考文章。

Are You Worthy of My Trust?: A Socioethical Perspective on the Impacts of Trustworthy AI Systems on the Environment and Human Society

paper_url: http://arxiv.org/abs/2309.09450
repo_url: None
paper_authors: Jamell Dacon
for: 本文旨在探讨人工智能系统在社会中的影响，以及如何通过多学科合作和系统性的审核来确保人工智能的可靠性。
methods: 本文使用了多学科的视角和系统性的审核方法来检视人工智能系统的社会影响。
results: 本文指出，人工智能系统的社会影响包括能源消耗和碳脚印，以及对用户的社会发展影响。此外，本文还探讨了人工智能系统的多学科风险和不可避免的社会影响。

Abstract
With ubiquitous exposure of AI systems today, we believe AI development requires crucial considerations to be deemed trustworthy. While the potential of AI systems is bountiful, though, is still unknown-as are their risks. In this work, we offer a brief, high-level overview of societal impacts of AI systems. To do so, we highlight the requirement of multi-disciplinary governance and convergence throughout its lifecycle via critical systemic examinations (e.g., energy consumption), and later discuss induced effects on the environment (i.e., carbon footprint) and its users (i.e., social development). In particular, we consider these impacts from a multi-disciplinary perspective: computer science, sociology, environmental science, and so on to discuss its inter-connected societal risks and inability to simultaneously satisfy aspects of well-being. Therefore, we accentuate the necessity of holistically addressing pressing concerns of AI systems from a socioethical impact assessment perspective to explicate its harmful societal effects to truly enable humanity-centered Trustworthy AI.

摘要
From a computer science perspective, we must examine the energy consumption of AI systems and their potential carbon footprint. From a sociological perspective, we must consider the impact of AI systems on social development and the potential for induced effects on users. Environmental science must also be taken into account, as AI systems may have a significant impact on the environment.To truly enable humanity-centered Trustworthy AI, we must address these concerns holistically, taking a socioethical impact assessment perspective to explicate the harmful societal effects of AI systems. By considering these impacts from a multi-disciplinary perspective, we can better understand the interconnected risks and challenges posed by AI systems and work towards developing Trustworthy AI that benefits society as a whole.

A Schedule of Duties in the Cloud Space Using a Modified Salp Swarm Algorithm

paper_url: http://arxiv.org/abs/2309.09441
repo_url: None
paper_authors: Hossein Jamali, Ponkoj Chandra Shill, David Feil-Seifer, Frederick C. Harris, Jr., Sergiu M. Dascalu
for: 这个论文的目的是提出一种基于集群智能算法的云计算任务调度算法，以提高云计算服务的效率和质量。
methods: 该论文使用了改进后的Salp Swarm Algorithm（SSA），并对其进行了比较与基于遗传算法（GA）、Particle Swarm Optimization（PSO）、连续ACO（ACO）等算法的性能比较。
results: 研究发现，提出的算法在云计算任务调度问题中的性能较高，比如与基本SSA相比，该算法的吞吐量减少约21%。

Abstract
Cloud computing is a concept introduced in the information technology era, with the main components being the grid, distributed, and valuable computing. The cloud is being developed continuously and, naturally, comes up with many challenges, one of which is scheduling. A schedule or timeline is a mechanism used to optimize the time for performing a duty or set of duties. A scheduling process is accountable for choosing the best resources for performing a duty. The main goal of a scheduling algorithm is to improve the efficiency and quality of the service while at the same time ensuring the acceptability and effectiveness of the targets. The task scheduling problem is one of the most important NP-hard issues in the cloud domain and, so far, many techniques have been proposed as solutions, including using genetic algorithms (GAs), particle swarm optimization, (PSO), and ant colony optimization (ACO). To address this problem, in this paper, one of the collective intelligence algorithms, called the Salp Swarm Algorithm (SSA), has been expanded, improved, and applied. The performance of the proposed algorithm has been compared with that of GAs, PSO, continuous ACO, and the basic SSA. The results show that our algorithm has generally higher performance than the other algorithms. For example, compared to the basic SSA, the proposed method has an average reduction of approximately 21% in makespan.

摘要
云计算是信息时代中提出的概念，其主要组成部分包括网格、分布式和价值计算。云是不断发展的，并且随着时间的推移而带来许多挑战，其中之一是调度。调度是一种机制，用于优化执行任务的时间。调度过程的目标是选择最佳资源来执行任务。云计算领域中的调度问题是NP困难问题之一，至今为止，已经有许多技术提出了解决方案，包括使用遗传算法（GAs）、粒子群动力学（PSO）和蚁群动力学（ACO）。为解决这个问题，在这篇论文中，一种集成智能算法，即Salp Swarm Algorithm（SSA），被扩展、改进并应用。提出的算法的性能与GAs、PSO、继续ACO和基本SSA进行比较，结果表明，我们的算法在性能方面与其他算法相比，有一定的优势。例如，相比基本SSA，我们的方法在做出span中减少了约21%的差异。

Multi-Agent Deep Reinforcement Learning for Cooperative and Competitive Autonomous Vehicles using AutoDRIVE Ecosystem

paper_url: http://arxiv.org/abs/2309.10007
repo_url: None
paper_authors: Tanmay Vilas Samak, Chinmay Vilas Samak, Venkat Krovi
for: 这 paper 旨在开发一种模块化和并行化的多代理人深度学习框架，用于养成自驾车辆的合作和竞争行为。
methods: 该 paper 使用了 AutoDRIVE Ecosystem 作为开发 physically accurate 和 graphically realistic 的数字双胞虫的 enables，并在这个 ecosystem 中训练和部署多代理人学习策略。
results: experiments 表明，在交叉路口通行问题和thead-to-head自驱车竞赛问题中，使用这种框架可以养成高效的合作和竞争行为，并且在带有偏差和安全约束的情况下进行了可靠的训练和测试。

Abstract
This work presents a modular and parallelizable multi-agent deep reinforcement learning framework for imbibing cooperative as well as competitive behaviors within autonomous vehicles. We introduce AutoDRIVE Ecosystem as an enabler to develop physically accurate and graphically realistic digital twins of Nigel and F1TENTH, two scaled autonomous vehicle platforms with unique qualities and capabilities, and leverage this ecosystem to train and deploy multi-agent reinforcement learning policies. We first investigate an intersection traversal problem using a set of cooperative vehicles (Nigel) that share limited state information with each other in single as well as multi-agent learning settings using a common policy approach. We then investigate an adversarial head-to-head autonomous racing problem using a different set of vehicles (F1TENTH) in a multi-agent learning setting using an individual policy approach. In either set of experiments, a decentralized learning architecture was adopted, which allowed robust training and testing of the approaches in stochastic environments, since the agents were mutually independent and exhibited asynchronous motion behavior. The problems were further aggravated by providing the agents with sparse observation spaces and requiring them to sample control commands that implicitly satisfied the imposed kinodynamic as well as safety constraints. The experimental results for both problem statements are reported in terms of quantitative metrics and qualitative remarks for training as well as deployment phases.

摘要
In the first set of experiments, we investigate an intersection traversal problem using a set of cooperative vehicles (Nigel) that share limited state information with each other. We use a common policy approach in both single and multi-agent learning settings. In the second set of experiments, we investigate an adversarial head-to-head autonomous racing problem using a different set of vehicles (F1TENTH) in a multi-agent learning setting using an individual policy approach.In both sets of experiments, we adopt a decentralized learning architecture that allows for robust training and testing of the approaches in stochastic environments. The agents are mutually independent and exhibit asynchronous motion behavior, which adds complexity to the problems. To make the problems more challenging, we provide the agents with sparse observation spaces and require them to sample control commands that implicitly satisfy the imposed kinodynamic and safety constraints.The experimental results for both problem statements are reported in terms of quantitative metrics and qualitative remarks for training and deployment phases. The results demonstrate the effectiveness of the proposed framework in imbibing cooperative and competitive behaviors in autonomous vehicles.

The Optimized path for the public transportation of Incheon in South Korea

paper_url: http://arxiv.org/abs/2309.10006
repo_url: None
paper_authors: Soroor Malekmohammadi faradunbeh, Hongle Li, Mangkyu Kang, Choongjae Iim
for: 本研究旨在为沟通系统路径选择优化，以满足乘客需求。
methods: 我们提出了一种基于修改A*算法的路径找路方法，可以在实时中找到大量数据点的最短路径。
results: 我们的方法比基本路径找路算法如基因算法和迪克斯特拉算法更好，可以快速找到最短路径。

Abstract
Path-finding is one of the most popular subjects in the field of computer science. Pathfinding strategies determine a path from a given coordinate to another. The focus of this paper is on finding the optimal path for the bus transportation system based on passenger demand. This study is based on bus stations in Incheon, South Korea, and we show that our modified A* algorithm performs better than other basic pathfinding algorithms such as the Genetic and Dijkstra. Our proposed approach can find the shortest path in real-time even for large amounts of data(points).

摘要
“路径找路是计算机科学领域中最受欢迎的主题之一。路径找路策略的目的是将 FROM 点到 DESTINATION 点找到最佳路径。本研究是基于韩国仁川市的公共汽车站点，我们显示了我们的修改了A\*搜寻算法可以在实时运算大量数据点上取得更短路径。”Here's the breakdown of the translation:“路径找路” (lù xiào lù) - path-finding“是计算机科学领域中最受欢迎的主题之一” (shì computacional kēxué yù zhì yǐn yī yī) - one of the most popular subjects in the field of computer science“路径找路策略” (lù xiào lù zhì lü) - path-finding strategies“的目的是将 FROM 点到 DESTINATION 点找到最佳路径” (de mù yì yī jīn yǐn zhèng yī jīn) - the goal is to find the shortest path from FROM point to DESTINATION point“本研究是基于韩国仁川市的公共汽车站点” (ben yán jí yì yī jīn yǐn zhèng yī jīn) - this study is based on bus stations in Incheon, South Korea“我们显示了我们的修改了A\*搜寻算法可以在实时运算大量数据点上取得更短路径” (wǒ men jiàn shì le yǒu men de xiū gòu le A\* sōu xún suān yì yī yī yī) - we show that our modified A* search algorithm can obtain shorter paths in real-time for large amounts of data points.

FactoFormer: Factorized Hyperspectral Transformers with Self-Supervised Pre-Training

paper_url: http://arxiv.org/abs/2309.09431
repo_url: https://github.com/csiro-robotics/factoformer
paper_authors: Shaheer Mohamed, Maryam Haghighat, Tharindu Fernando, Sridha Sridharan, Clinton Fookes, Peyman Moghadam
for: 本研究旨在使用变换器来提高卷积图像分类 task 的性能。
methods: 本研究使用了一种新的分解 spectral-spatial transformer，并提出了一种基于自我supervised pre-training的方法，以及一种基于masking的预训练策略。
results: 实验结果表明，该模型在三个公开的 dataset 上的分类任务中均达到了状态册的性能。

Abstract
Hyperspectral images (HSIs) contain rich spectral and spatial information. Motivated by the success of transformers in the field of natural language processing and computer vision where they have shown the ability to learn long range dependencies within input data, recent research has focused on using transformers for HSIs. However, current state-of-the-art hyperspectral transformers only tokenize the input HSI sample along the spectral dimension, resulting in the under-utilization of spatial information. Moreover, transformers are known to be data-hungry and their performance relies heavily on large-scale pre-training, which is challenging due to limited annotated hyperspectral data. Therefore, the full potential of HSI transformers has not been fully realized. To overcome these limitations, we propose a novel factorized spectral-spatial transformer that incorporates factorized self-supervised pre-training procedures, leading to significant improvements in performance. The factorization of the inputs allows the spectral and spatial transformers to better capture the interactions within the hyperspectral data cubes. Inspired by masked image modeling pre-training, we also devise efficient masking strategies for pre-training each of the spectral and spatial transformers. We conduct experiments on three publicly available datasets for HSI classification task and demonstrate that our model achieves state-of-the-art performance in all three datasets. The code for our model will be made available at https://github.com/csiro-robotics/factoformer.

摘要
干扰图像（HSIs）含有丰富的spectral和空间信息。驱动于自然语言处理和计算机视觉领域中transformers的成功，现有研究强调使用transformers来处理HSIs。然而，当前状态的艺术HSITransformer只是在spectral维度上进行HSIs的token化，从而导致了空间信息的不足利用。此外，transformers是知道数据备受的，其性能强度取决于大规模的预训练，这是因为有限的注解干扰图像数据。因此，HSITransformer的全部潜力尚未得到充分利用。为了解决这些限制，我们提出了一种新的分解 spectral-spatial transformer，该模型包括分解自我监督预训练过程，从而导致显著的性能提升。该模型的分解输入允许spectral和空间变换更好地捕捉干扰数据立方体中的交互。我们还根据干扰图像模型预训练的想法，设计了高效的屏蔽策略，用于预训练每个spectral和空间变换。我们在三个公共可用的干扰图像数据集上进行了实验，并证明了我们的模型在所有三个数据集中均达到了状态之最的性能。我们的代码将在https://github.com/csiro-robotics/factoformer上公开。

Joint Demosaicing and Denoising with Double Deep Image Priors

paper_url: http://arxiv.org/abs/2309.09426
repo_url: None
paper_authors: Taihui Li, Anish Lahiri, Yutong Dai, Owen Mayer
for: joint demosaicing and denoising of RAW images
methods: direct injection of prior ( DoubleDIP) without requiring training data
results: consistently outperforms other compared methods in terms of PSNR, SSIM, and qualitative visual perception

Abstract
Demosaicing and denoising of RAW images are crucial steps in the processing pipeline of modern digital cameras. As only a third of the color information required to produce a digital image is captured by the camera sensor, the process of demosaicing is inherently ill-posed. The presence of noise further exacerbates this problem. Performing these two steps sequentially may distort the content of the captured RAW images and accumulate errors from one step to another. Recent deep neural-network-based approaches have shown the effectiveness of joint demosaicing and denoising to mitigate such challenges. However, these methods typically require a large number of training samples and do not generalize well to different types and intensities of noise. In this paper, we propose a novel joint demosaicing and denoising method, dubbed JDD-DoubleDIP, which operates directly on a single RAW image without requiring any training data. We validate the effectiveness of our method on two popular datasets -- Kodak and McMaster -- with various noises and noise intensities. The experimental results show that our method consistently outperforms other compared methods in terms of PSNR, SSIM, and qualitative visual perception.

摘要
这里的数码相机处理管线中，采样和噪音处理是重要的步骤。因为摄像机感应器只Capture一 third of the color information needed to produce a digital image, therefore, the process of demosaicing is inherently ill-posed. The presence of noise further exacerbates this problem. Performing these two steps sequentially may distort the content of the captured RAW images and accumulate errors from one step to another. Recent deep neural network-based approaches have shown the effectiveness of joint demosaicing and denoising to mitigate such challenges. However, these methods typically require a large number of training samples and do not generalize well to different types and intensities of noise.在本文中，我们提出了一个新的共同采样和噪音处理方法，名为JDD-DoubleDIP，这个方法直接运算在单一的 RAW 图像上，不需要任何训练数据。我们验证了我们的方法在两个流行的数据集（Kodak 和 McMaster）上，具有不同的噪音和噪音强度时，可以实现更高的 PSNR、SSIM 和质感视觉。

Causal Discovery and Prediction: Methods and Algorithms

paper_url: http://arxiv.org/abs/2309.09416
repo_url: None
paper_authors: Gilles Blondel
for: 这个博士论文是为了提出一种能够在很多不同情况下快速和有效地发现 causal model的算法，并且可以避免在实际世界中进行不必要的实验。
methods: 这个算法使用了一种通过设计简单的 experiment 来评估每个可能的 intervention 的 cost，并且使用了一种最低成本序列的 intervention 来找出 causal relations。
results: 这个算法可以在大多数情况下使用相对便宜的 intervention 来排除大量 causal model candidate，并且可以确保在找出 causal effects 时不会产生很多不必要的实验。此外，这个算法还可以在有限的试验次数下完成 causal discovery。

Abstract
We are not only observers but also actors of reality. Our capability to intervene and alter the course of some events in the space and time surrounding us is an essential component of how we build our model of the world. In this doctoral thesis we introduce a generic a-priori assessment of each possible intervention, in order to select the most cost-effective interventions only, and avoid unnecessary systematic experimentation on the real world. Based on this a-priori assessment, we propose an active learning algorithm that identifies the causal relations in any given causal model, using a least cost sequence of interventions. There are several novel aspects introduced by our algorithm. It is, in most case scenarios, able to discard many causal model candidates using relatively inexpensive interventions that only test one value of the intervened variables. Also, the number of interventions performed by the algorithm can be bounded by the number of causal model candidates. Hence, fewer initial candidates (or equivalently, more prior knowledge) lead to fewer interventions for causal discovery. Causality is intimately related to time, as causes appear to precede their effects. Cyclical causal processes are a very interesting case of causality in relation to time. In this doctoral thesis we introduce a formal analysis of time cyclical causal settings by defining a causal analog to the purely observational Dynamic Bayesian Networks, and provide a sound and complete algorithm for the identification of causal effects in the cyclic setting. We introduce the existence of two types of hidden confounder variables in this framework, which affect in substantially different ways the identification procedures, a distinction with no analog in either Dynamic Bayesian Networks or standard causal graphs.

摘要
我们不只是观察者，也是现实中的actor。我们可以在空间和时间周围 intervene 和改变一些事件的走向，这是我们构建世界模型的重要组成部分。在这个博士论文中，我们介绍了一种通用的先行评估，以选择最cost-effective的 intervención，并避免在实际世界进行无必要的系统性实验。基于这种先行评估，我们提议一种活动学习算法，用于在任何给定 causal model 中找到 causal 关系，使用最小成本序列的 intervención。我们的算法有几个新的特点：在大多数情况下，可以通过对 intervened 变量的一个值进行相对较少的实验来抛弃许多 causal model 候选者。此外，我们的算法可以 guarantee 在给定 causal model 候选者的数量上限下进行 intervención。因此， fewer initial candidates (或等效的更多先知) 会导致更少的 intervención для causal discovery。 causality 与时间 closely related, causes 通常会先于其效果出现。循环 causal process 是 causality 在时间方面的一个非常有趣的情况。在这个博士论文中，我们提出了一种形式的时间循环 causal setting 的分析，并提供了一个完整的算法，用于在这种设定下找到 causal effect。我们还引入了两种隐藏干扰变量，这些变量在不同的方式影响了我们的 identification 过程，与 neither Dynamic Bayesian Networks nor standard causal graphs 中的干扰变量没有相应的分类。

Does Video Summarization Require Videos? Quantifying the Effectiveness of Language in Video Summarization

paper_url: http://arxiv.org/abs/2309.09405
repo_url: None
paper_authors: Yoonsoo Nam, Adam Lehavi, Daniel Yang, Digbalay Bose, Swabha Swayamdipta, Shrikanth Narayanan
for: 这个论文是为了开发一个高效、语言基于的视频摘要器，以提高计算机视觉中的视频摘要效果。
methods: 该论文使用了语言变换器模型，只使用文本描述来进行训练，而不需要图像表示。文本描述通过零批量学习获取，并通过筛选表示文本vector进行压缩。
results: 该论文可以具有更高的数据效率和解释性，并且可以保持与传统方法相比的结果水平。在模式和数据压缩方面，研究发现，只使用语言模式可以有效地减少输入数据处理量，而不会导致结果下降。

Abstract
Video summarization remains a huge challenge in computer vision due to the size of the input videos to be summarized. We propose an efficient, language-only video summarizer that achieves competitive accuracy with high data efficiency. Using only textual captions obtained via a zero-shot approach, we train a language transformer model and forego image representations. This method allows us to perform filtration amongst the representative text vectors and condense the sequence. With our approach, we gain explainability with natural language that comes easily for human interpretation and textual summaries of the videos. An ablation study that focuses on modality and data compression shows that leveraging text modality only effectively reduces input data processing while retaining comparable results.

摘要
视频摘要仍然是计算机视觉领域的大型挑战，主要是因为要摘要的视频尺寸过于大。我们提出了一种高效的语言只视频摘要器，可以实现与高数据效率相当的竞争性准确率。通过采用零shot方法获取的文本描述，我们训练了一个语言转换器模型，并忽略图像表示。这种方法allow us来对表示序列进行筛选，并将其压缩。通过我们的方法，我们可以获得自然语言的解释，这是人类可读的文本摘要。我们的ablation研究表明，只使用文本模式可以有效地减少输入数据处理，同时保持相对比较的结果。

(Deployed Application) Promoting Research Collaboration with Open Data Driven Team Recommendation in Response to Call for Proposals

paper_url: http://arxiv.org/abs/2309.09404
repo_url: None
paper_authors: Siva Likitha Valluru, Biplav Srivastava, Sai Teja Paladi, Siwen Yan, Sriraam Natarajan
for: The paper aims to recommend teams for collaborative research opportunities in response to funding agencies’ calls for proposals.
methods: The system uses various AI methods, including skill extraction from open data and taxonomies, to match demand and supply and create balanced teams that maximize goodness.
results: The system was validated through quantitative and qualitative evaluations, showing that it recommends smaller but higher-quality teams, and was found useful and relevant by users in a large-scale user study.

Abstract
Building teams and promoting collaboration are two very common business activities. An example of these are seen in the TeamingForFunding problem, where research institutions and researchers are interested to identify collaborative opportunities when applying to funding agencies in response to latter's calls for proposals. We describe a novel system to recommend teams using a variety of AI methods, such that (1) each team achieves the highest possible skill coverage that is demanded by the opportunity, and (2) the workload of distributing the opportunities is balanced amongst the candidate members. We address these questions by extracting skills latent in open data of proposal calls (demand) and researcher profiles (supply), normalizing them using taxonomies, and creating efficient algorithms that match demand to supply. We create teams to maximize goodness along a novel metric balancing short- and long-term objectives. We validate the success of our algorithms (1) quantitatively, by evaluating the recommended teams using a goodness score and find that more informed methods lead to recommendations of smaller number of teams but higher goodness, and (2) qualitatively, by conducting a large-scale user study at a college-wide level, and demonstrate that users overall found the tool very useful and relevant. Lastly, we evaluate our system in two diverse settings in US and India (of researchers and proposal calls) to establish generality of our approach, and deploy it at a major US university for routine use.

摘要
建立团队和促进协作是现代企业活动的常见任务。一个例子是团队为融资机构申请资金的问题，研究机构和研究人员希望通过协作机会来应对后者的征求提案。我们描述了一种新的系统，使用人工智能技术来推荐团队，以确保每个团队达到最高可能的技能覆盖度，同时将工作负担平均分配给候选成员。我们解决这些问题 by提取提案征求（需求）和研究人员资料（供应）中的技能，使用分类法归一化它们，并创建高效的匹配算法来匹配需求和供应。我们创建了一个以最大化好处为基础的新指标，以匹配短期和长期目标。我们证明了我们的算法的成功（1）量化方面，通过评估推荐团队的好度分数，发现更加知ledge的方法会导致 fewer teams but higher goodness，和（2）质量方面，通过在大学level进行大规模的用户研究，发现用户们总体认为工具非常有用和相关。最后，我们在美国和印度两地进行了两次不同的设置（研究人员和提案征求）来证明我们的方法的一致性，并在一所主要的美国大学中进行了 Routine 使用。

Selecting which Dense Retriever to use for Zero-Shot Search

paper_url: http://arxiv.org/abs/2309.09403
repo_url: None
paper_authors: Ekaterina Khramtsova, Shengyao Zhuang, Mahsa Baktashmotlagh, Xi Wang, Guido Zuccon
for: 本研究 targets the problem of choosing the most suitable dense retrieval model for searching on a new collection without any labeled data, i.e. in a zero-shot setting.
methods: 作者们使用了 Computer Vision 和机器学习领域的最新方法来评估无监督性能，但这些方法不适用于选择高性能的精度检索器在 zero-shot 设置下。
results: 经验表明，现有的方法不能有效地选择精度检索器在 zero-shot 设置下，这是一个重要的新问题，解决这个问题可以促进精度检索的普遍应用。Note: The paper is written in English, so the Simplified Chinese translation may not be exact.

Abstract
We propose the new problem of choosing which dense retrieval model to use when searching on a new collection for which no labels are available, i.e. in a zero-shot setting. Many dense retrieval models are readily available. Each model however is characterized by very differing search effectiveness -- not just on the test portion of the datasets in which the dense representations have been learned but, importantly, also across different datasets for which data was not used to learn the dense representations. This is because dense retrievers typically require training on a large amount of labeled data to achieve satisfactory search effectiveness in a specific dataset or domain. Moreover, effectiveness gains obtained by dense retrievers on datasets for which they are able to observe labels during training, do not necessarily generalise to datasets that have not been observed during training. This is however a hard problem: through empirical experimentation we show that methods inspired by recent work in unsupervised performance evaluation with the presence of domain shift in the area of computer vision and machine learning are not effective for choosing highly performing dense retrievers in our setup. The availability of reliable methods for the selection of dense retrieval models in zero-shot settings that do not require the collection of labels for evaluation would allow to streamline the widespread adoption of dense retrieval. This is therefore an important new problem we believe the information retrieval community should consider. Implementation of methods, along with raw result files and analysis scripts are made publicly available at https://www.github.com/anonymized.

摘要
我们提出了一个新的问题，即在搜寻新集合时选择哪个紧密搜寻模型，即零批评设定下的问题。许多紧密搜寻模型都是可用的。然而，每个模型都具有不同的搜寻效能，不仅在测试集上，而且也在不同的集合上。这是因为紧密搜寻模型通常需要训练大量的标签数据来在特定集合或领域中 достиungeatisfactory的搜寻效能。此外，在测试集上获得的效能提升不一定能够应用到没有用于训练的集合上。这是一个困难的问题，我们通过实验证明，运用最近在computer vision和机器学习领域的无supervised performance evaluation方法不能够选择高效的紧密搜寻模型。如果有可靠的方法可以在零批评设定下选择紧密搜寻模型，而不需要训练集的标签评估，这将帮助快速推广紧密搜寻。因此，我们认为信息搜寻社群应该考虑这个问题。我们在https://www.github.com/anonymized上公开了实现和原始档案，以及分析脚本。

2023-09-18

cs.CL

cs.CL - 2023-09-18

Few-Shot Adaptation for Parsing Contextual Utterances with LLMs

paper_url: http://arxiv.org/abs/2309.10168
repo_url: https://github.com/microsoft/few_shot_adaptation_for_parsing_contextual_utterances_with_llms
paper_authors: Kevin Lin, Patrick Xia, Hao Fang
for: 这个论文主要探讨了基于大语言模型（LLM）的语义解析器在实际场景中如何处理上下文语言。
methods: 论文提出了四种主要方法来处理上下文语言，即 Parse-with-Utterance-History、Parse-with-Reference-Program、Parse-then-Resolve 和 Rewrite-then-Parse。
results: 实验表明，使用 Rewrite-then-Parse 方法可以在考虑解析精度、注释成本和错误类型的情况下取得最佳效果。

Abstract
We evaluate the ability of semantic parsers based on large language models (LLMs) to handle contextual utterances. In real-world settings, there typically exists only a limited number of annotated contextual utterances due to annotation cost, resulting in an imbalance compared to non-contextual utterances. Therefore, parsers must adapt to contextual utterances with a few training examples. We examine four major paradigms for doing so in conversational semantic parsing i.e., Parse-with-Utterance-History, Parse-with-Reference-Program, Parse-then-Resolve, and Rewrite-then-Parse. To facilitate such cross-paradigm comparisons, we construct SMCalFlow-EventQueries, a subset of contextual examples from SMCalFlow with additional annotations. Experiments with in-context learning and fine-tuning suggest that Rewrite-then-Parse is the most promising paradigm when holistically considering parsing accuracy, annotation cost, and error types.

摘要
我们评估基于大语言模型（LLM）的semantic parser在处理上下文性语言时的能力。在实际场景中，通常只有有限的上下文性语言标注数据，因此训练数据的偏度较大，需要 parser 适应上下文性语言的少量标注。我们检查了四种主要的方法来实现这一点，即在 conversational semantic parsing 中使用Parse-with-Utterance-History、Parse-with-Reference-Program、Parse-then-Resolve和Rewrite-then-Parse等四种方法。为便于这些跨 парадиг的比较，我们构建了SMCalFlow-EventQueries，一 subset of 上下文性示例从 SMCalFlow 中，并添加了更多的标注。实验表明，使用 rewrite-then-parse 方法可以最大化 parsing 准确率、标注成本和错误类型。

Understanding Catastrophic Forgetting in Language Models via Implicit Inference

paper_url: http://arxiv.org/abs/2309.10105
repo_url: https://github.com/kothasuhas/understanding-forgetting
paper_authors: Suhas Kotha, Jacob Mitchell Springer, Aditi Raghunathan
for: 本研究旨在 investigating the effects of fine-tuning on language models’ performance on tasks outside the fine-tuning distribution.
methods: 研究者采用了 instruction-tuning 和 reinforcement learning from human feedback 等方法进行 fine-tuning，并使用了 conjugate prompting 来测试 Language Models 的能力。
results: 研究发现， improving performance on tasks within the fine-tuning data distribution 会导致模型在其他任务上表现下降，特别是与 fine-tuning 数据分布最相似的任务。此外，研究者发现可以通过 conjugate prompting 来系统地回归 Language Models 的预训练能力。

Abstract
Fine-tuning (via methods such as instruction-tuning or reinforcement learning from human feedback) is a crucial step in training language models to robustly carry out tasks of interest. However, we lack a systematic understanding of the effects of fine-tuning, particularly on tasks outside the narrow fine-tuning distribution. In a simplified scenario, we demonstrate that improving performance on tasks within the fine-tuning data distribution comes at the expense of suppressing model capabilities on other tasks. This degradation is especially pronounced for tasks "closest" to the fine-tuning distribution. We hypothesize that language models implicitly infer the task of the prompt corresponds, and the fine-tuning process predominantly skews this task inference towards tasks in the fine-tuning distribution. To test this hypothesis, we propose Conjugate Prompting to see if we can recover pretrained capabilities. Conjugate prompting artificially makes the task look farther from the fine-tuning distribution while requiring the same capability. We find that conjugate prompting systematically recovers some of the pretraining capabilities on our synthetic setup. We then apply conjugate prompting to real-world LLMs using the observation that fine-tuning distributions are typically heavily skewed towards English. We find that simply translating the prompts to different languages can cause the fine-tuned models to respond like their pretrained counterparts instead. This allows us to recover the in-context learning abilities lost via instruction tuning, and more concerningly, to recover harmful content generation suppressed by safety fine-tuning in chatbots like ChatGPT.

摘要
精度调整（如 instrucion-tuning 或人类反馈学习）是训练语言模型完成任务的关键步骤。然而，我们缺乏对精度调整的系统性理解，特别是在外部精度调整分布之外的任务上。在简化的场景中，我们发现，通过提高 task 内的性能，模型在其他任务上的表现会受到抑制。这种干扰特别明显于 task 与 fine-tuning 数据分布之间最相似的任务上。我们假设语言模型会隐式地推理出提示中的任务类型，而 fine-tuning 过程会主要偏向于 fine-tuning 数据分布中的任务类型。为了测试这一假设，我们提出了 conjugate prompting。 conjugate prompting 通过人工地使任务看起来更加远离 fine-tuning 数据分布，并且需要同样的能力。我们发现 conjugate prompting 系统地恢复了一些预训练能力。然后，我们应用 conjugate prompting 于实际世界的 LLMs，基于观察， fine-tuning 分布通常倾斜到英语。我们发现，只需将提示翻译成不同语言，就能让 fine-tuned 模型回归到预训练模型的状态。这允许我们恢复在 instruction tuning 中失去的 Context Learning 能力，以及在 chatbots 中被安全 fine-tuning 抑制的危险内容生成能力。

paper_url: http://arxiv.org/abs/2309.10057
repo_url: None
paper_authors: Itay Yair, Hillel Taub-Tabib, Yoav Goldberg
for: 本研究旨在提供一种方法，以便在探索 Setting 中，用户可以快速获得各种相关信息的概述，同时还能深入探究一些具体的方面。
methods: 本研究使用了一种组合分组和层次结构生成的方法，将相似的项集成到一起，并将剩下的项排序成一个可导航的 DAG 结构。
results: 本研究应用于医疗信息抽取，可以帮助用户快速获得医疗信息的概述，并且可以深入探究具体的方面。

Abstract
Information extraction systems often produce hundreds to thousands of strings on a specific topic. We present a method that facilitates better consumption of these strings, in an exploratory setting in which a user wants to both get a broad overview of what's available, and a chance to dive deeper on some aspects. The system works by grouping similar items together and arranging the remaining items into a hierarchical navigable DAG structure. We apply the method to medical information extraction.

摘要
将文本翻译成简化中文：信息提取系统经常生成大量与特定主题相关的字符串。我们提出了一种方法，可以帮助用户更好地消化这些字符串，在探索性的 Setting 中，用户希望同时获得概括和深入了解一些方面。系统通过将相似的项集成一起，并将剩下的项组织成嵌入式的DAG结构，以便用户可以方便地浏览和探索。我们在医疗信息提取中应用了这种方法。

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

paper_url: http://arxiv.org/abs/2309.10020
repo_url: None
paper_authors: Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao
for: 这篇论文旨在概述多模态基础模型的分类和演化，强调从专家模型转化为通用助手。
methods: 论文涵盖了五个核心研究领域，分为两类：一是已有的研究领域，包括两个主题：学习视觉基础模型 для视觉理解和文本到图生成；二是现代探索性研究领域，包括三个主题：基于大语言模型的统一视觉模型、端到端训练多模态语言模型、将多模态工具与语言模型串联。
results: 论文的目标受众是计算机视觉和视觉语言多模态研究人员，包括研究生、博士生和专业人士，他们想要了解多模态基础模型的基础知识和最新进展。

Abstract
This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics -- methods of learning vision backbones for visual understanding and text-to-image generation. (ii) Then, we present recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics -- unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs. The target audiences of the paper are researchers, graduate students, and professionals in computer vision and vision-language multimodal communities who are eager to learn the basics and recent advances in multimodal foundation models.

摘要

Well-established research areas: * Methods of learning vision backbones for visual understanding * Text-to-image generation2. Recent advances in exploratory, open research areas: * Unified vision models inspired by large language models (LLMs) * End-to-end training of multimodal LLMs * Chaining multimodal tools with LLMsThe target audience for this paper includes researchers, graduate students, and professionals in the computer vision and vision-language multimodal communities who are interested in learning about the basics and recent advances in multimodal foundation models.

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

paper_url: http://arxiv.org/abs/2309.09958
repo_url: https://github.com/haotian-liu/LLaVA
paper_authors: Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jianfeng Gao, Yelong Shen
for: 这个论文是为了研究开源大型多模态模型（LLaVA和MiniGPT-4）的可视 instrucion 调教进行 empirical 研究，以便为未来的研究提供更强的基准。
methods: 这个论文使用了扩大 LLVA 的参数量至 33B 和 65B/70B，并研究了LoRA/QLoRA 等参数效率训练方法的影响。
results: 研究发现，扩大 LMM 的表现和语言能力有显著提升，LoRA/QLoRA 的训练方法与全模型精度调教的性能相当，而高像素分辨率和混合多Modal-语言数据也有助于提高 LMM 表现。

Abstract
Visual instruction tuning has recently shown encouraging progress with open-source large multimodal models (LMM) such as LLaVA and MiniGPT-4. However, most existing studies of open-source LMM are performed using models with 13B parameters or smaller. In this paper we present an empirical study of scaling LLaVA up to 33B and 65B/70B, and share our findings from our explorations in image resolution, data mixing and parameter-efficient training methods such as LoRA/QLoRA. These are evaluated by their impact on the multi-modal and language capabilities when completing real-world tasks in the wild. We find that scaling LMM consistently enhances model performance and improves language capabilities, and performance of LoRA/QLoRA tuning of LMM are comparable to the performance of full-model fine-tuning. Additionally, the study highlights the importance of higher image resolutions and mixing multimodal-language data to improve LMM performance, and visual instruction tuning can sometimes improve LMM's pure language capability. We hope that this study makes state-of-the-art LMM research at a larger scale more accessible, thus helping establish stronger baselines for future research. Code and checkpoints will be made public.

摘要
幻视 instrucion 优化在最近已经取得了鼓舞人心的进步，使用开源大型多模型（LLaVA）和 MiniGPT-4 等模型。然而，大多数现有的开源 LLaVA 模型都是使用 13B 参数或更小的模型进行研究。在这篇论文中，我们将对 LLaVA 模型进行扩展，并在 33B 和 65B/70B 的参数量上进行实验。我们还将分享我们在图像分辨率、数据混合和参数有效训练方法（LoRA/QLoRA）等方面的发现。这些发现对于完成真实世界任务时 LMM 模型的多模态和语言能力的影响进行评估。我们发现，将 LMM 模型扩展可以顺利提高模型性能和语言能力，而 LoRA/QLoRA 的训练方法与全模型精度训练的性能相当。此外，这个研究还证明了更高的图像分辨率和混合多模态语言数据可以提高 LMM 模型性能，而视 instrucion 优化可以帮助 LMM 模型提高纯语言能力。我们希望这篇研究可以让开源 LMM 研究更加可 accessible，从而帮助建立更强的基线 для未来的研究。我们将在线上公开代码和检查点。

Speaker attribution in German parliamentary debates with QLoRA-adapted large language models

paper_url: http://arxiv.org/abs/2309.09902
repo_url: None
paper_authors: Tobias Bornheim, Niklas Grieger, Patrick Gustav Blaneck, Stephan Bialonski
for: 这个论文旨在提高德国议会辩论中的自动发言人分配，以便更好地进行计算文本分析。
methods: 作者使用了大型自然语言模型家族Llama 2，并使用QLoRA的高效训练策略进行细化。
results: 研究表明，使用Llama 2可以实现竞争性的自动发言人分配性能，提供了计算政治 Diskursanalyse 的可能性，以及 semantic role labeling 系统的发展。

Abstract
The growing body of political texts opens up new opportunities for rich insights into political dynamics and ideologies but also increases the workload for manual analysis. Automated speaker attribution, which detects who said what to whom in a speech event and is closely related to semantic role labeling, is an important processing step for computational text analysis. We study the potential of the large language model family Llama 2 to automate speaker attribution in German parliamentary debates from 2017-2021. We fine-tune Llama 2 with QLoRA, an efficient training strategy, and observe our approach to achieve competitive performance in the GermEval 2023 Shared Task On Speaker Attribution in German News Articles and Parliamentary Debates. Our results shed light on the capabilities of large language models in automating speaker attribution, revealing a promising avenue for computational analysis of political discourse and the development of semantic role labeling systems.

摘要
政治文本的增长开创了新的可观察机会，提供了有价值的政治动力和意识形态分析。然而，这也增加了人工分析的劳动负担。自动分配说话人，即在语音事件中识别说话者和他们所说的内容，是计算文本分析中重要的处理步骤。我们研究利用大语言模型家族Llama 2自动化德国议会辩论中的说话人分配，从2017年至2021年。我们使用QLoRA，一种高效的训练策略，微调Llama 2，并观察我们的方法在GermEval 2023共享任务中的竞赛性表现。我们的结果描绘了大语言模型在自动化说话人分配中的能力，揭示了计算政治讨论的可能性和意识形态分析系统的发展。

Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models

paper_url: http://arxiv.org/abs/2309.10707
repo_url: None
paper_authors: Hsuan Su, Ting-Yao Hu, Hema Swetha Koppula, Raviteja Vemulapalli, Jen-Hao Rick Chang, Karren Yang, Gautam Varma Mantena, Oncel Tuzel
for: 这篇论文的目的是提出一种新的自动话语识别（ASR）模型适应新目标领域的策略，不需要目标领域的文本或声音数据。
methods: 这篇论文提出了一个新的数据合成管道，使用大型自然语言模型（LLM）生成目标领域的文本库，并使用现有的控制可能性声合成模型生成相应的声音。另外，这篇论文也提出了一个简单 yet 有效的内部 instrucion 微调策略，以增加 LLM 在新领域中的效果。
results: 实验结果显示，提出的方法在 SLURP 数据集上实现了不见天色的话语识别error rate 下降，平均相对词SError rate 下降 $28%$。同时，模型在源领域中的表现也不变。

Abstract
While Automatic Speech Recognition (ASR) systems are widely used in many real-world applications, they often do not generalize well to new domains and need to be finetuned on data from these domains. However, target-domain data usually are not readily available in many scenarios. In this paper, we propose a new strategy for adapting ASR models to new target domains without any text or speech from those domains. To accomplish this, we propose a novel data synthesis pipeline that uses a Large Language Model (LLM) to generate a target domain text corpus, and a state-of-the-art controllable speech synthesis model to generate the corresponding speech. We propose a simple yet effective in-context instruction finetuning strategy to increase the effectiveness of LLM in generating text corpora for new domains. Experiments on the SLURP dataset show that the proposed method achieves an average relative word error rate improvement of $28\%$ on unseen target domains without any performance drop in source domains.

摘要
while automatic speech recognition (ASR) systems are widely used in many real-world applications, they often do not generalize well to new domains and need to be finetuned on data from these domains. however, target-domain data usually are not readily available in many scenarios. in this paper, we propose a new strategy for adapting ASR models to new target domains without any text or speech from those domains. to accomplish this, we propose a novel data synthesis pipeline that uses a large language model (LLM) to generate a target domain text corpus, and a state-of-the-art controllable speech synthesis model to generate the corresponding speech. we propose a simple yet effective in-context instruction finetuning strategy to increase the effectiveness of LLM in generating text corpora for new domains. experiments on the slurp dataset show that the proposed method achieves an average relative word error rate improvement of 28% on unseen target domains without any performance drop in source domains.Here's the translation in Traditional Chinese:而 while automatic speech recognition (ASR) 系统 widely used in many real-world applications, they often do not generalize well to new domains and need to be finetuned on data from these domains. however, target-domain data usually are not readily available in many scenarios. in this paper, we propose a new strategy for adapting ASR models to new target domains without any text or speech from those domains. to accomplish this, we propose a novel data synthesis pipeline that uses a large language model (LLM) to generate a target domain text corpus, and a state-of-the-art controllable speech synthesis model to generate the corresponding speech. we propose a simple yet effective in-context instruction finetuning strategy to increase the effectiveness of LLM in generating text corpora for new domains. experiments on the slurp dataset show that the proposed method achieves an average relative word error rate improvement of 28% on unseen target domains without any performance drop in source domains.

Not Enough Labeled Data? Just Add Semantics: A Data-Efficient Method for Inferring Online Health Texts

paper_url: http://arxiv.org/abs/2309.09877
repo_url: None
paper_authors: Joseph Gatto, Sarah M. Preum
for: 本研究旨在提出一种基于抽象意义表示（AMR）的低资源自然语言处理（NLP）方法，用于解决各种在线健康资源和社交平台上的长文本和复杂语言问题。
methods: 本研究使用AMR图来模型低资源健康NLP任务，通过将文本转化为 semantic graph embeddings，提高预训练语言模型对高复杂文本的理解和推理能力。
results: 研究表明，通过将文本 embeddings 与 semantic graph embeddings 结合使用，可以提高6种低资源健康NLP任务的性能，并且这种方法是任务无关的，可以轻松地与标准文本分类管道结合使用。

Abstract
User-generated texts available on the web and social platforms are often long and semantically challenging, making them difficult to annotate. Obtaining human annotation becomes increasingly difficult as problem domains become more specialized. For example, many health NLP problems require domain experts to be a part of the annotation pipeline. Thus, it is crucial that we develop low-resource NLP solutions able to work with this set of limited-data problems. In this study, we employ Abstract Meaning Representation (AMR) graphs as a means to model low-resource Health NLP tasks sourced from various online health resources and communities. AMRs are well suited to model online health texts as they can represent multi-sentence inputs, abstract away from complex terminology, and model long-distance relationships between co-referring tokens. AMRs thus improve the ability of pre-trained language models to reason about high-complexity texts. Our experiments show that we can improve performance on 6 low-resource health NLP tasks by augmenting text embeddings with semantic graph embeddings. Our approach is task agnostic and easy to merge into any standard text classification pipeline. We experimentally validate that AMRs are useful in the modeling of complex texts by analyzing performance through the lens of two textual complexity measures: the Flesch Kincaid Reading Level and Syntactic Complexity. Our error analysis shows that AMR-infused language models perform better on complex texts and generally show less predictive variance in the presence of changing complexity.

摘要
User-generated文本在网络和社交平台上经常是长度很长，涉猎度很高的，使得其annotate成为越来越困难的。尤其是在专业领域问题上，需要培训专家来参与注释管道。因此，我们需要开发低资源NLPTools，能够处理这些有限数据问题。在这项研究中，我们使用抽象意思表示（AMR）图来模型低资源医疗NLPTasks。AMR图适合模型在线医疗文本，可以表示多句输入，抽象化复杂术语，并模型距离匹配token的长距离关系。AMR图因此提高了预训练语言模型对高复杂文本的理解能力。我们的实验表明，通过将文本嵌入与semantic图嵌入结合在一起，可以提高6个低资源医疗NLPTasks的性能。我们的方法是任务无关的，可以轻松地与标准文本分类管道集成。我们实验证明了AMR图在模型复杂文本方面的用用。我们通过分析性能的方式，包括读取难度和 sintactic complexity，发现AMR-激发的语言模型在复杂文本中表现得更好，并且在文本复杂度发生变化时显示更少的预测异常。

Instruction-Following Speech Recognition

paper_url: http://arxiv.org/abs/2309.09843
repo_url: https://github.com/abusufyanvu/6S191_MIT_DeepLearning
paper_authors: Cheng-I Jeff Lai, Zhiyun Lu, Liangliang Cao, Ruoming Pang
for: 这研究旨在探索大自然语言模型在语音处理中的理解和“理智”能力。
methods: 研究者采用了听写抄写模型，让模型理解和执行多种自由文本指令。
results: 研究发现，没有需要大自然语言模型或预训练的语音模块，模型可以根据指令来选择ively transcribe部分语音，提供额外的隐私和安全层。

Abstract
Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models' speech understanding and "reasoning" capabilities remain underexplored. To study this question from the data perspective, we introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. This enables a multitude of speech recognition tasks -- ranging from transcript manipulation to summarization -- without relying on predefined command sets. Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring LLMs or pre-trained speech modules. It also offers selective transcription options based on instructions like "transcribe first half and then turn off listening," providing an additional layer of privacy and safety compared to existing LLMs. Our findings highlight the significant potential of instruction-following training to advance speech foundation models.

摘要
（传统的终端到终端自动语音识别（ASR）模型主要关注准确的转录任务，缺乏用户互动的灵活性。大语言模型（LLMs）在语音处理中出现后，可以实现更加自然的文本提示基于互动。然而，这些模型在语音理解和“理解”能力的机制尚未得到足够的探讨。为了从数据角度来研究这个问题，我们介绍了听写执行语音识别，通过训练一个听写执行模型来理解和执行多种自由形式文本指令。这使得许多语音识别任务——从转录修改到摘要——无需靠背景知识库或预训练语音模块。另外，我们的模型可以根据指令来选择转录内容，如“转录首半并then关闭听写”，提供了额外的隐私和安全层次。我们的发现表明了指令执行训练可以推动语音基础模型的进步。）

HypR: A comprehensive study for ASR hypothesis revising with a reference corpus

paper_url: http://arxiv.org/abs/2309.09838
repo_url: None
paper_authors: Yi-Wei Wang, Ke-Han Lu, Kuan-Yu Chen
for: 提高自动语音识别（ASR）性能， revising recognition results 是一种轻量级 yet efficient 的方法。
methods: 研究使用 N-best reranking 方法和 error correction 模型来重新评估 ASR 结果。
results: 发布了 ASR 假设修正（HypR）数据集，包括 AISHELL-1、TED-LIUM 2 和 LibriSpeech 等常用 corpora，并提供每个语音片断50个认知假设。还实现了多种经典和代表性的方法，展示了最新的研究进展。

Abstract
With the development of deep learning, automatic speech recognition (ASR) has made significant progress. To further enhance the performance, revising recognition results is one of the lightweight but efficient manners. Various methods can be roughly classified into N-best reranking methods and error correction models. The former aims to select the hypothesis with the lowest error rate from a set of candidates generated by ASR for a given input speech. The latter focuses on detecting recognition errors in a given hypothesis and correcting these errors to obtain an enhanced result. However, we observe that these studies are hardly comparable to each other as they are usually evaluated on different corpora, paired with different ASR models, and even use different datasets to train the models. Accordingly, we first concentrate on releasing an ASR hypothesis revising (HypR) dataset in this study. HypR contains several commonly used corpora (AISHELL-1, TED-LIUM 2, and LibriSpeech) and provides 50 recognition hypotheses for each speech utterance. The checkpoint models of the ASR are also published. In addition, we implement and compare several classic and representative methods, showing the recent research progress in revising speech recognition results. We hope the publicly available HypR dataset can become a reference benchmark for subsequent research and promote the school of research to an advanced level.

摘要
Translation in Simplified Chinese:随着深度学习的发展，自动语音识别（ASR）的进步也在不断。为了进一步提高性能，重新评估结果是一种轻量级 yet efficient的方式。不同的方法可以大致分为N-best重新排序方法和错误修复模型。前者targets selecting the lowest error rate hypothesis from a set of candidates generated by ASR for a given input speech。后者则是关注检测recognition errors in a given hypothesis and correcting these errors to obtain an enhanced result。然而，我们注意到这些研究通常不能相互比较，因为它们通常在不同的corpus上进行评估，与不同的ASR模型对应，甚至使用不同的数据集来训练模型。因此，我们首先集中精力于发布ASR假设修复（HypR）数据集。HypR包含了一些常用的corpus（AISHELL-1、TED-LIUM 2和LibriSpeech），并为每个语音词提供50个识别假设。ASR的checkpoint模型也同时发布。此外，我们实现并比较了一些经典和代表性的方法，展示了最近的研究进展。我们希望公开的HypR数据集可以成为后续研究的参考准 marker，并推动这一领域的研究进入更高水平。

AMuRD: Annotated Multilingual Receipts Dataset for Cross-lingual Key Information Extraction and Classification

paper_url: http://arxiv.org/abs/2309.09800
repo_url: https://github.com/update-for-integrated-business-ai/amurd
paper_authors: Abdelrahman Abdallah, Mahmoud Abdalla, Mohamed Elkasaby, Yasser Elbendary, Adam Jatowt
for: 本研究は receipt 资讯抽取の问题に关する、新的多语言数据集的开发而建立的。
methods: 本研究使用的方法是 InstructLLaMA 方法，它可以解决资讯抽取和项目分类中的主要挑战。
results: 本研究获得的结果是 F1 分数为 0.76 和准确率为 0.68，这表明 InstructLLaMA 方法可以实现高精度的资讯抽取和项目分类。

Abstract
Key information extraction involves recognizing and extracting text from scanned receipts, enabling retrieval of essential content, and organizing it into structured documents. This paper presents a novel multilingual dataset for receipt extraction, addressing key challenges in information extraction and item classification. The dataset comprises $47,720$ samples, including annotations for item names, attributes like (price, brand, etc.), and classification into $44$ product categories. We introduce the InstructLLaMA approach, achieving an F1 score of $0.76$ and an accuracy of $0.68$ for key information extraction and item classification. We provide code, datasets, and checkpoints.\footnote{\url{https://github.com/Update-For-Integrated-Business-AI/AMuRD}.

摘要
“这份研究将提出一个新的多语言 dataset，用于收据EXTRACTION，并解决资讯EXTRACTION和项目分类中的主要挑战。该dataset包含47,720个样本，包括项目名称、特征（价格、品牌等）的标注，以及44个产品类别的分类。我们提出了InstructLLaMA方法，实现了关键资讯EXTRACTION和项目分类中的 F1 分数为0.76，和精度为0.68。我们提供了代码、dataset和检查点。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Watch the Speakers: A Hybrid Continuous Attribution Network for Emotion Recognition in Conversation With Emotion Disentanglement

paper_url: http://arxiv.org/abs/2309.09799
repo_url: None
paper_authors: Shanglin Lei, Xiaoping Wang, Guanting Dong, Jiang Li, Yingjian Liu
for: 这个研究是为了提高对话中的情感识别能力，并且实现在不同情感场景下的通用性。
methods: 这个研究使用了一个混合式连续特征网络（HCAN），包括一个混合的回传和注意力模组，以模型全局情感连续性。另外，一个新的情感对应编码（EAE）是提出来模型每个语言的内部和外部情感对应。
results: 这个研究在三个数据集上取得了顶尖性能，证明了我们的方法的优越性。另外，一系列的比较实验和抽象研究还进行了三个Benchmark上，以支持每个模员的有效性。

Abstract
Emotion Recognition in Conversation (ERC) has attracted widespread attention in the natural language processing field due to its enormous potential for practical applications. Existing ERC methods face challenges in achieving generalization to diverse scenarios due to insufficient modeling of context, ambiguous capture of dialogue relationships and overfitting in speaker modeling. In this work, we present a Hybrid Continuous Attributive Network (HCAN) to address these issues in the perspective of emotional continuation and emotional attribution. Specifically, HCAN adopts a hybrid recurrent and attention-based module to model global emotion continuity. Then a novel Emotional Attribution Encoding (EAE) is proposed to model intra- and inter-emotional attribution for each utterance. Moreover, aiming to enhance the robustness of the model in speaker modeling and improve its performance in different scenarios, A comprehensive loss function emotional cognitive loss $\mathcal{L}_{\rm EC}$ is proposed to alleviate emotional drift and overcome the overfitting of the model to speaker modeling. Our model achieves state-of-the-art performance on three datasets, demonstrating the superiority of our work. Another extensive comparative experiments and ablation studies on three benchmarks are conducted to provided evidence to support the efficacy of each module. Further exploration of generalization ability experiments shows the plug-and-play nature of the EAE module in our method.

摘要
受欢迎的情感认知在对话中（ERC）在自然语言处理领域中受到广泛的关注，因为它在实际应用中具有巨大的潜力。现有的ERC方法在不同的场景下进行泛化是一个大的挑战，主要是因为不充分考虑对话上下文，不充分捕捉对话关系，以及模型过拟合。在这种情况下，我们提出了一种混合连续属性网络（HCAN），以解决这些问题。具体来说，HCAN采用混合回卷和注意力基元来模型全局情感连续性。然后，我们提出了一种新的情感归属编码（EAE），以模型每个语音句中的内部和间接情感归属。此外，为了增强模型在 speaker 模型中的稳定性和不同场景下的性能，我们提出了一种完整的情感认知loss函数 $\mathcal{L}_{\rm EC}$，以避免情感漂移和模型过拟合。我们的模型在三个数据集上达到了领先的性能，证明了我们的工作的优越性。此外，我们还进行了广泛的比较实验和简要的减少实验，以提供对每个模块的证明。进一步的普适性实验表明了EAE模块的插入性。

The ParlaSent multilingual training dataset for sentiment identification in parliamentary proceedings

paper_url: http://arxiv.org/abs/2309.09783
repo_url: None
paper_authors: Michal Mochtak, Peter Rupnik, Nikola Ljubešić
for: 这篇论文主要用于研究政治决策中的情感因素，以及如何系统地研究和测量这些情感。
methods: 论文使用了一个新的情感注释句子数据集，并在这些句子上进行了一系列的实验，旨在训练一个可靠的情感分类器。此外，论文还介绍了首次针对政治科学应用的域specific LLM，并在这些数据上进行了额外预训练。
results: 实验表明，额外预训练 LLM 在域specific 任务上可以显著提高模型的下游性能，并且多语言模型在未看过语言上表现非常好。此外，论文还证明了在其他语言上额外收集数据可以大幅提高目标议会的结果。

Abstract
Sentiments inherently drive politics. How we receive and process information plays an essential role in political decision-making, shaping our judgment with strategic consequences both on the level of legislators and the masses. If sentiment plays such an important role in politics, how can we study and measure it systematically? The paper presents a new dataset of sentiment-annotated sentences, which are used in a series of experiments focused on training a robust sentiment classifier for parliamentary proceedings. The paper also introduces the first domain-specific LLM for political science applications additionally pre-trained on 1.72 billion domain-specific words from proceedings of 27 European parliaments. We present experiments demonstrating how the additional pre-training of LLM on parliamentary data can significantly improve the model downstream performance on the domain-specific tasks, in our case, sentiment detection in parliamentary proceedings. We further show that multilingual models perform very well on unseen languages and that additional data from other languages significantly improves the target parliament's results. The paper makes an important contribution to multiple domains of social sciences and bridges them with computer science and computational linguistics. Lastly, it sets up a more robust approach to sentiment analysis of political texts in general, which allows scholars to study political sentiment from a comparative perspective using standardized tools and techniques.

摘要

Facilitating NSFW Text Detection in Open-Domain Dialogue Systems via Knowledge Distillation

paper_url: http://arxiv.org/abs/2309.09749
repo_url: https://github.com/qiuhuachuan/CensorChat
paper_authors: Huachuan Qiu, Shuai Zhang, Hongliang He, Anqi Li, Zhenzhong Lan
for: 本研究旨在提高对对话系统中NSFW语言检测的能力，以保障用户的安全和 благополучие在数字对话中。
methods: 本研究使用了知识储存技术，包括GPT-4和ChatGPT，来建立NSFW对话检测 dataset。通过人工标注和自我批判策略，该dataset提供了一种可靠的NSFW语言检测方法。
results: 研究表明，通过使用BERT模型进行文本分类，可以准确地检测NSFW语言在对话中。此外，该方法还能够考虑到用户的自由表达权，从而提高对话系统的安全性和可靠性。

Abstract
NSFW (Not Safe for Work) content, in the context of a dialogue, can have severe side effects on users in open-domain dialogue systems. However, research on detecting NSFW language, especially sexually explicit content, within a dialogue context has significantly lagged behind. To address this issue, we introduce CensorChat, a dialogue monitoring dataset aimed at NSFW dialogue detection. Leveraging knowledge distillation techniques involving GPT-4 and ChatGPT, this dataset offers a cost-effective means of constructing NSFW content detectors. The process entails collecting real-life human-machine interaction data and breaking it down into single utterances and single-turn dialogues, with the chatbot delivering the final utterance. ChatGPT is employed to annotate unlabeled data, serving as a training set. Rationale validation and test sets are constructed using ChatGPT and GPT-4 as annotators, with a self-criticism strategy for resolving discrepancies in labeling. A BERT model is fine-tuned as a text classifier on pseudo-labeled data, and its performance is assessed. The study emphasizes the importance of AI systems prioritizing user safety and well-being in digital conversations while respecting freedom of expression. The proposed approach not only advances NSFW content detection but also aligns with evolving user protection needs in AI-driven dialogues.

摘要
The process involves collecting real-life human-machine interaction data and breaking it down into individual utterances and single-turn dialogues, with the chatbot delivering the final utterance. ChatGPT is used to annotate unlabeled data, serving as a training set. Rationale validation and test sets are constructed using ChatGPT and GPT-4 as annotators, with a self-criticism strategy for resolving discrepancies in labeling. A BERT model is fine-tuned as a text classifier on pseudo-labeled data, and its performance is evaluated.The study highlights the importance of AI systems prioritizing user safety and well-being in digital conversations while respecting freedom of expression. The proposed approach not only advances NSFW content detection but also aligns with evolving user protection needs in AI-driven dialogues.

When Large Language Models Meet Citation: A Survey

paper_url: http://arxiv.org/abs/2309.09727
repo_url: None
paper_authors: Yang Zhang, Yufei Wang, Kai Wang, Quan Z. Sheng, Lina Yao, Adnan Mahmood, Wei Emma Zhang, Rongying Zhao
For: This paper reviews the use of large language models (LLMs) for in-text citation analysis tasks and how citation linkage knowledge can be used to improve text representations of LLMs.* Methods: The paper discusses the application of LLMs for citation classification, citation-based summarization, and citation recommendation, as well as the use of citation prediction, network structure information, and inter-document relationship to improve text representations.* Results: The paper provides a preliminary review of the mutually beneficial relationship between LLMs and citation analysis, and highlights potential promising avenues for further investigation.Here’s the information in Simplified Chinese text:* For: 这篇论文探讨了大语言模型（LLMs）在文本中引用分析任务中的应用，以及如何使用引用链接知识来提高 LLMs 的文本表示。* Methods: 论文讨论了 LLMs 的应用包括引用分类、基于引用的摘要和引用推荐，以及使用引用预测、网络结构信息和间文件关系来提高 LLMs 的文本表示。* Results: 论文提供了一项初步的评估，探讨了 LLMs 和引用分析之间的互利关系，并提出了可能的进一步研究方向。

Abstract
Citations in scholarly work serve the essential purpose of acknowledging and crediting the original sources of knowledge that have been incorporated or referenced. Depending on their surrounding textual context, these citations are used for different motivations and purposes. Large Language Models (LLMs) could be helpful in capturing these fine-grained citation information via the corresponding textual context, thereby enabling a better understanding towards the literature. Furthermore, these citations also establish connections among scientific papers, providing high-quality inter-document relationships and human-constructed knowledge. Such information could be incorporated into LLMs pre-training and improve the text representation in LLMs. Therefore, in this paper, we offer a preliminary review of the mutually beneficial relationship between LLMs and citation analysis. Specifically, we review the application of LLMs for in-text citation analysis tasks, including citation classification, citation-based summarization, and citation recommendation. We then summarize the research pertinent to leveraging citation linkage knowledge to improve text representations of LLMs via citation prediction, network structure information, and inter-document relationship. We finally provide an overview of these contemporary methods and put forth potential promising avenues in combining LLMs and citation analysis for further investigation.

摘要
文献引用在学术作品中服务于重要的目的是承认和归因原始知识的 incorporation 和参考。根据文本上下文，这些引用有不同的动机和目的。大语言模型（LLM）可以通过对应的文本上下文捕捉细腻的引用信息，从而更好地理解文献。此外，这些引用还建立了科学论文之间的连接，提供高质量的交叉文献关系和人类建构的知识。这些信息可以在 LLM 的预训练中包含，以提高文本表示。因此，在这篇论文中，我们提供了一个初步的文献综述，探讨 LLM 和引用分析之间的互利关系。 Specifically, we review the application of LLMs for in-text citation analysis tasks, including citation classification, citation-based summarization, and citation recommendation. We then summarize the research pertinent to leveraging citation linkage knowledge to improve text representations of LLMs via citation prediction, network structure information, and inter-document relationship. We finally provide an overview of these contemporary methods and put forth potential promising avenues in combining LLMs and citation analysis for further investigation.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

Dealing with negative samples with multi-task learning on span-based joint entity-relation extraction

paper_url: http://arxiv.org/abs/2309.09713
repo_url: None
paper_authors: Chenguang Xue, Jiamin Lu
for: span-based joint extraction modelsmethods: multitask learning, intersection over union (IoU) concept, entity Logitsresults: mitigating adverse effects of excessive negative samples, commendable F1 scores of 73.61%, 53.72%, and 83.72% on three widely employed public datasets (CoNLL04, SciERC, and ADE)

Abstract
Recent span-based joint extraction models have demonstrated significant advantages in both entity recognition and relation extraction. These models treat text spans as candidate entities, and span pairs as candidate relationship tuples, achieving state-of-the-art results on datasets like ADE. However, these models encounter a significant number of non-entity spans or irrelevant span pairs during the tasks, impairing model performance significantly. To address this issue, this paper introduces a span-based multitask entity-relation joint extraction model. This approach employs the multitask learning to alleviate the impact of negative samples on entity and relation classifiers. Additionally, we leverage the Intersection over Union(IoU) concept to introduce the positional information into the entity classifier, achieving a span boundary detection. Furthermore, by incorporating the entity Logits predicted by the entity classifier into the embedded representation of entity pairs, the semantic input for the relation classifier is enriched. Experimental results demonstrate that our proposed SpERT.MT model can effectively mitigate the adverse effects of excessive negative samples on the model performance. Furthermore, the model demonstrated commendable F1 scores of 73.61\%, 53.72\%, and 83.72\% on three widely employed public datasets, namely CoNLL04, SciERC, and ADE, respectively.

摘要
最近的span-based联合提取模型已经显示了很大的优势在实体识别和关系提取任务中。这些模型将文本段视为候选实体，并将段对 viewed as candidate relationship tuples，达到了最新的结果在ADE等 dataset上。然而，这些模型在任务中遇到了大量的非实体段或无关的段对，这会减少模型的性能。为了解决这个问题，本文提出了一种span-based多任务实体关系联合提取模型。这种方法使用多任务学习来减轻非实体段和无关段对对实体和关系分类器的影响。此外，我们利用了Intersection over Union(IoU)概念来将位置信息引入实体分类器中，实现了段边界检测。此外，通过将实体预测值 integrate into the embedded representation of entity pairs，我们可以为关系分类器提供更加semantic的输入。实验结果表明，我们提出的SpERT.MT模型可以有效地减轻非实体段和无关段对对模型性能的影响。此外，模型在三个广泛使用的公共数据集上达到了很高的F1分数，具体分别是73.61%, 53.72%, 83.72%。

Evaluating Gender Bias of Pre-trained Language Models in Natural Language Inference by Considering All Labels

paper_url: http://arxiv.org/abs/2309.09697
repo_url: https://github.com/panatchakorn-a/bias-eval-nli-considering-all-labels
paper_authors: Panatchakorn Anantaprayoon, Masahiro Kaneko, Naoaki Okazaki
for: 本研究旨在评估语言模型中的偏见，包括生物偏见和语言偏见。
methods: 本研究提出了一种基于所有标签的语言推理偏见评估方法，包括创建评估数据集和定义偏见度量。
results: 实验结果显示，该评估方法可以更准确地评估语言模型的偏见性，并且可以应用于多种语言。此外，本研究还评估了不同语言的语言模型偏见性。

Abstract
Discriminatory social biases, including gender biases, have been found in Pre-trained Language Models (PLMs). In Natural Language Inference (NLI), recent bias evaluation methods have observed biased inferences from the outputs of a particular label such as neutral or entailment. However, since different biased inferences can be associated with different output labels, it is inaccurate for a method to rely on one label. In this work, we propose an evaluation method that considers all labels in the NLI task. We create evaluation data and assign them into groups based on their expected biased output labels. Then, we define a bias measure based on the corresponding label output of each data group. In the experiment, we propose a meta-evaluation method for NLI bias measures, and then use it to confirm that our measure can evaluate bias more accurately than the baseline. Moreover, we show that our evaluation method is applicable to multiple languages by conducting the meta-evaluation on PLMs in three different languages: English, Japanese, and Chinese. Finally, we evaluate PLMs of each language to confirm their bias tendency. To our knowledge, we are the first to build evaluation datasets and measure the bias of PLMs from the NLI task in Japanese and Chinese.

摘要
社会偏见，包括性别偏见，在预训练语言模型（PLM）中发现。在自然语言推理（NLI）任务中，最近的偏见评估方法发现PLM的输出中存在偏见。然而，由于不同的偏见可能与不同的输出标签相关，因此使用一个标签是不准确的。在这种情况下，我们提议一种评估方法，该方法考虑所有的输出标签在NLI任务中。我们创建评估数据，并将其分组 Based on their expected biased output labels。然后，我们定义基于每个数据组的标签输出的偏见度量。在实验中，我们提议一种元评估方法，并使用其来验证我们的度量方法可以更准确地评估偏见。此外，我们展示了我们的评估方法可以应用于多种语言，通过在英语、日语和中文三种语言的PLM上进行元评估。最后，我们评估每种语言的PLM，以验证它们的偏见倾向。到目前为止，我们是第一个在NLI任务中为日语和中文的PLM建立评估数据集，并测量它们的偏见倾向。

Do learned speech symbols follow Zipf’s law?

paper_url: http://arxiv.org/abs/2309.09690
repo_url: None
paper_authors: Shinnosuke Takamichi, Hiroki Maeda, Joonyong Park, Daisuke Saito, Hiroshi Saruwatari
for: 本研究探讨了深度学习学习的speech symbol是否遵循Zipf的法则，与自然语言符号一样。
methods: 本研究使用了深度学习来学习speech symbol，并对其进行分析，以判断它们是否遵循Zipf的法则。
results: 研究发现，数据驱动的speech symbol遵循Zipf的法则，与自然语言符号一样。这些结果为 spoken语言处理领域提供了新的统计分析方法。

Abstract
In this study, we investigate whether speech symbols, learned through deep learning, follow Zipf's law, akin to natural language symbols. Zipf's law is an empirical law that delineates the frequency distribution of words, forming fundamentals for statistical analysis in natural language processing. Natural language symbols, which are invented by humans to symbolize speech content, are recognized to comply with this law. On the other hand, recent breakthroughs in spoken language processing have given rise to the development of learned speech symbols; these are data-driven symbolizations of speech content. Our objective is to ascertain whether these data-driven speech symbols follow Zipf's law, as the same as natural language symbols. Through our investigation, we aim to forge new ways for the statistical analysis of spoken language processing.

摘要
在这个研究中，我们研究了深度学习学习的语音符号是否遵循Zipf的法则，与自然语言符号一样。Zipf的法则是一个实际法则，描述了语言中词语频率的分布，成为自然语言处理的统计分析基础。人类创造的自然语言符号遵循这个法则，而现在的口语处理技术突破使得数据驱动的语音符号出现了。我们的目标是确定这些数据驱动的语音符号是否遵循Zipf的法则，与自然语言符号一样。通过我们的调查，我们希望开拓新的统计分析方法 для口语处理。

Multi-turn Dialogue Comprehension from a Topic-aware Perspective

paper_url: http://arxiv.org/abs/2309.09666
repo_url: None
paper_authors: Xinbei Ma, Yi Xu, Hai Zhao, Zhuosheng Zhang
for: 本文主要针对对话机器理解中的对话发展需要语言模型能够有效地解耦和模型多个对话转帖。由于对话发展的主题可能不会在整个对话过程中保持相同，因此对话模型化非常具有挑战性。本文提出了一种基于主题的对话模型化方法，通过对对话段进行自动分割，将对话段转化为主题相关的语言处理单元。
methods: 本文提出了一种基于主题的对话段分割算法，并使用这些分割后的对话段作为主题相关的语言处理单元进行进一步的对话理解。此外，本文还提出了一种基于自适应autoencoder的主题划分系统，以及两个自定义的评估数据集。
results: 对三个公共 benchmark 进行了实验，并证明了与基eline相比，本文的方法具有显著的改善。本文继承了之前关于文档主题的研究，并将对话模型化带入了一个新的主题意识领域，并进行了广泛的实验和分析。

Abstract
Dialogue related Machine Reading Comprehension requires language models to effectively decouple and model multi-turn dialogue passages. As a dialogue development goes after the intentions of participants, its topic may not keep constant through the whole passage. Hence, it is non-trivial to detect and leverage the topic shift in dialogue modeling. Topic modeling, although has been widely studied in plain text, deserves far more utilization in dialogue reading comprehension. This paper proposes to model multi-turn dialogues from a topic-aware perspective. We start with a dialogue segmentation algorithm to split a dialogue passage into topic-concentrated fragments in an unsupervised way. Then we use these fragments as topic-aware language processing units in further dialogue comprehension. On one hand, the split segments indict specific topics rather than mixed intentions, thus showing convenient on in-domain topic detection and location. For this task, we design a clustering system with a self-training auto-encoder, and we build two constructed datasets for evaluation. On the other hand, the split segments are an appropriate element of multi-turn dialogue response selection. For this purpose, we further present a novel model, Topic-Aware Dual-Attention Matching (TADAM) Network, which takes topic segments as processing elements and matches response candidates with a dual cross-attention. Empirical studies on three public benchmarks show great improvements over baselines. Our work continues the previous studies on document topic, and brings the dialogue modeling to a novel topic-aware perspective with exhaustive experiments and analyses.

摘要
对话相关的机器阅读理解需要语言模型能够有效地解耦和模型多个对话段。因为对话的目的可能会在整个段落中变化，因此检测和利用对话中的主题转换非常重要。主题分析，尽管在普通文本中广泛研究，在对话阅读理解中尚未得到充分利用。这篇论文提议在一个主题意识角度上模型多个对话段。我们从对话分割算法开始，将对话段落分解成主题强调的小段，然后使用这些小段作为主题意识语言处理单元进行进一步的对话理解。在一个方面，分割段落可以帮助检测对话中的主题，而不是混合的意图，因此在领域内主题检测和定位变得更加便捷。在另一方面，分割段落是多Turn对话回答选择的适当元素。为此，我们采用了一种新的模型，主题意识双重注意网络（TADAM），该模型将主题段落作为处理元素，并将回答候选者与双重跨注意相匹配。我们对三个公共 benchmark 进行了实验，得到了很大的改进。我们的工作继承了之前的文档主题研究，并将对话模型带到了一个新的主题意识角度，并进行了详细的实验和分析。

A Novel Method of Fuzzy Topic Modeling based on Transformer Processing

paper_url: http://arxiv.org/abs/2309.09658
repo_url: None
paper_authors: Ching-Hsun Tseng, Shin-Jye Lee, Po-Wei Cheng, Chien Lee, Chih-Chieh Hung
for: 本研究旨在提出一种基于软划分和文档嵌入的模糊主题分析方法，以便更好地监测市场趋势。
methods: 本研究使用了state-of-the-art transformer-based模型来实现词embedding和软划分 clustering，并应用于新闻发布监测中。
results: 实际应用中，模糊主题分析方法比传统的LDA模型更能够提供自然的结果。

Abstract
Topic modeling is admittedly a convenient way to monitor markets trend. Conventionally, Latent Dirichlet Allocation, LDA, is considered a must-do model to gain this type of information. By given the merit of deducing keyword with token conditional probability in LDA, we can know the most possible or essential topic. However, the results are not intuitive because the given topics cannot wholly fit human knowledge. LDA offers the first possible relevant keywords, which also brings out another problem of whether the connection is reliable based on the statistic possibility. It is also hard to decide the topic number manually in advance. As the booming trend of using fuzzy membership to cluster and using transformers to embed words, this work presents the fuzzy topic modeling based on soft clustering and document embedding from state-of-the-art transformer-based model. In our practical application in a press release monitoring, the fuzzy topic modeling gives a more natural result than the traditional output from LDA.

摘要
Recently, fuzzy membership clustering and transformer-based word embedding have gained popularity. This study introduces fuzzy topic modeling based on soft clustering and document embedding using state-of-the-art transformer-based models. In our practical application of press release monitoring, fuzzy topic modeling provides more natural results than traditional LDA output.

Speeding Up Speech Synthesis In Diffusion Models By Reducing Data Distribution Recovery Steps Via Content Transfer

paper_url: http://arxiv.org/abs/2309.09652
repo_url: None
paper_authors: Peter Ochieng
for: 提高Diffusion基于 vocoder的速度和质量
methods: 使用卷积神经网络进行降噪和数据回归，并定义skip参数以降低数据分布恢复步骤数量
results: 提出一种新的 diffusion vocoder 技术，可以在竞争性的时间内生成高质量的语音，并且具有普适性和扩展性。

Abstract
Diffusion based vocoders have been criticised for being slow due to the many steps required during sampling. Moreover, the model's loss function that is popularly implemented is designed such that the target is the original input $x_0$ or error $\epsilon_0$. For early time steps of the reverse process, this results in large prediction errors, which can lead to speech distortions and increase the learning time. We propose a setup where the targets are the different outputs of forward process time steps with a goal to reduce the magnitude of prediction errors and reduce the training time. We use the different layers of a neural network (NN) to perform denoising by training them to learn to generate representations similar to the noised outputs in the forward process of the diffusion. The NN layers learn to progressively denoise the input in the reverse process until finally the final layer estimates the clean speech. To avoid 1:1 mapping between layers of the neural network and the forward process steps, we define a skip parameter $\tau>1$ such that an NN layer is trained to cumulatively remove the noise injected in the $\tau$ steps in the forward process. This significantly reduces the number of data distribution recovery steps and, consequently, the time to generate speech. We show through extensive evaluation that the proposed technique generates high-fidelity speech in competitive time that outperforms current state-of-the-art tools. The proposed technique is also able to generalize well to unseen speech.

摘要
Diffusion基于的 vocoder 被批评为因 sampling 过程中需要多个步骤，导致速度较慢。另外，流行的模型损失函数设计为目标是原始输入 $x_0$ 或错误 $\epsilon_0$，对于早期时间步的反向过程而言，这会导致大的预测错误，从而导致语音扭曲和增加学习时间。我们提议一种设置，其中目标是在不同的时间步骤中的输出，以降低预测错误的大小和减少学习时间。我们使用不同层次的神经网络（NN）来进行去噪，通过训练它们学习生成与前向过程中的杂音输出相似的表示。NN层次逐步去噪输入，直到最后一层估计干净的语音。为了避免 NN 层次与前向过程步骤之间的一对一映射，我们定义了跳过参数 $\tau>1$，以便 NN 层次在 $\tau$ 步前进行共同去噪。这 significantly 减少了数据回归步骤的数量，并且因此减少了生成语音的时间。我们通过广泛的评估表明，我们提议的技术可以生成高效率、高质量的语音，并且可以很好地泛化到未seen语音。

Summarization is (Almost) Dead

paper_url: http://arxiv.org/abs/2309.09558
repo_url: None
paper_authors: Xiao Pu, Mingqi Gao, Xiaojun Wan
for: 这 paper 是为了评估大语言模型（LLM）在摘要任务中的表现。
methods: 该 paper 使用了新的数据集和人类评估实验来评估 LLM 在五种不同摘要任务中的零基础生成能力。
results: 研究发现，人类评估者对 LLM 生成的摘要优于人工写成的摘要和 fine-tuned 模型生成的摘要，特别是在事实一致性和外部幻见方面表现更好。因此，我们认为，在 LLM 时代，大多数文本摘要领域的传统工作都不再必要了。然而，我们还需要继续探索一些方向，例如创造更高质量和更可靠的评估方法，以及新的数据集。

Abstract
How well can large language models (LLMs) generate summaries? We develop new datasets and conduct human evaluation experiments to evaluate the zero-shot generation capability of LLMs across five distinct summarization tasks. Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models. Specifically, LLM-generated summaries exhibit better factual consistency and fewer instances of extrinsic hallucinations. Due to the satisfactory performance of LLMs in summarization tasks (even surpassing the benchmark of reference summaries), we believe that most conventional works in the field of text summarization are no longer necessary in the era of LLMs. However, we recognize that there are still some directions worth exploring, such as the creation of novel datasets with higher quality and more reliable evaluation methods.

摘要
大型语言模型（LLM）是否能够生成好的摘要？我们开发了新的数据集和进行了人类评估实验，以评估 LLM 在五种不同摘要任务中的零基础生成能力。我们的结果显示，人类评审者对 LLM 生成的摘要表示偏好，而且与人工撰写的摘要和 fine-tuned 模型生成的摘要相比， LLM 生成的摘要更有内容和外部错误的优势。尤其是 LLM 生成的摘要在事实上是更加精确和有 fewer 外部错误。由于 LLM 在摘要任务中的表现非常满意（甚至超过参考摘要的 benchmark），我们认为，现在的文本摘要领域中大多数传统的工作不再是必要的。然而，我们认为还有一些值得探索的方向，例如创建更高质量和更可靠的评估方法，以及创建更多的数据集。

Training dynamic models using early exits for automatic speech recognition on resource-constrained devices

paper_url: http://arxiv.org/abs/2309.09546
repo_url: None
paper_authors: George August Wright, Umberto Cappellazzo, Salah Zaiem, Desh Raj, Lucas Ondel Yang, Daniele Falavigna, Alessio Brutti
for: 这篇论文旨在提出一种可以在处理时间上动态地调整神经网络模型的 Computational Load，以便在资源有限的设备上进行处理。
methods: 本论文使用了初级终端架构，具体是透过中间终端分支来实现动态模型的调整。此外，本论文还对已预训神经网络进行了从头开始训练。
results: 实验结果显示，使用了初级终端架构且从头开始训练的模型不仅在使用较少层数时能够保持表现水准，更能够提高任务的准确率，比single-exit模型或使用预训模型更好。此外，本论文还提出了一种基于 posterior probability的exit选择策略作为一个替代方案。

Abstract
The possibility of dynamically modifying the computational load of neural models at inference time is crucial for on-device processing, where computational power is limited and time-varying. Established approaches for neural model compression exist, but they provide architecturally static models. In this paper, we investigate the use of early-exit architectures, that rely on intermediate exit branches, applied to large-vocabulary speech recognition. This allows for the development of dynamic models that adjust their computational cost to the available resources and recognition performance. Unlike previous works, besides using pre-trained backbones we also train the model from scratch with an early-exit architecture. Experiments on public datasets show that early-exit architectures from scratch not only preserve performance levels when using fewer encoder layers, but also improve task accuracy as compared to using single-exit models or using pre-trained models. Additionally, we investigate an exit selection strategy based on posterior probabilities as an alternative to frame-based entropy.

摘要
可以在推理时动态修改神经网络模型的计算负担非常重要，特别是在设备上进行处理，因为计算能力有限并且时间变化。现有的神经网络压缩方法已经存在，但它们提供的模型是建筑性的静态的。在这篇论文中，我们研究使用中途离开结构，该结构 rely on intermediate exit branches，应用于大 vocabulary 语音识别。这 permit 开发动态模型，可以根据可用资源和识别性来调整计算成本。与先前的工作不同，我们不仅使用预训练后台，还从scratch 训练模型。实验表明，使用 fewer encoder layers 的 early-exit 模型不仅保持性能水平，而且在使用单exit模型或使用预训练模型时，也提高了任务准确率。此外，我们还研究基于 posterior probabilities 的 exit 选择策略作为 frame-based entropy 的替代方案。

Adapting Large Language Models via Reading Comprehension

paper_url: http://arxiv.org/abs/2309.09530
repo_url: https://github.com/microsoft/lmops
paper_authors: Daixuan Cheng, Shaohan Huang, Furu Wei
for: 本研究探讨了如何通过继续在域pecific的corpus上进行预训练，以influence大语言模型，发现预训练 Raw corpora 可以让模型学习域知识，但是会严重降低其回答问题的能力。
methods: 我们提出了一种简单的方法，通过将 Raw corpora 转换成reading comprehension texts来解决这个问题。我们的方法可以扩展到任何预训练 corpora，并且可以在不同的域上进行缩放。
results: 我们的研究表明，通过使用我们的方法，我们的7B语言模型可以与更大的域pecific模型（如BloombergGPT-50B） achieve competitive performance。此外，我们还证明了域pecific reading comprehension texts可以提高模型的性能，并且表明可以开发一个可以在多个域上进行预训练的通用模型。

Abstract
We explore how continued pre-training on domain-specific corpora influences large language models, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. Taken inspiration from human learning via reading comprehension--practice after reading improves the ability to answer questions based on the learned knowledge--we propose a simple method for transforming raw corpora into reading comprehension texts. Each raw text is enriched with a series of tasks related to its content. Our method, highly scalable and applicable to any pre-training corpora, consistently enhances performance across various tasks in three different domains: biomedicine, finance, and law. Notably, our 7B language model achieves competitive performance with domain-specific models of much larger scales, such as BloombergGPT-50B. Furthermore, we demonstrate that domain-specific reading comprehension texts can improve the model's performance even on general benchmarks, showing the potential to develop a general model across even more domains. Our model, code, and data will be available at https://github.com/microsoft/LMOps.

摘要
我们探究了将领域特定文献作为预训练数据的影响于大型自然语言模型，发现将模型训练在原始文献上授知了领域知识，但会对问答能力产生极大的影响。启发自人类通过阅读理解——阅读后练习提高了根据学习的知识回答问题的能力——我们提议一种简单的方法，可以将原始文献转化为阅读理解文本。每个原始文本都会被授加一系列与其内容相关的任务。我们的方法可扩展到任何预训练 corpora，并在生物医学、金融和法律等三个领域中实现了稳定的提升性。尤其是我们的7B语言模型可以与大小相对较小的领域特定模型相比，如 BloombergGPT-50B，达到竞争性的表现。此外，我们还证明了领域特定的阅读理解文本可以提高模型的表现，即使是通用的标准 bencmarks。我们的模型、代码和数据将会在https://github.com/microsoft/LMOps 上公开。

Improved Factorized Neural Transducer Model For text-only Domain Adaptation

paper_url: http://arxiv.org/abs/2309.09524
repo_url: https://github.com/cppan-packages/43c6cb2c61134ec7e23098e41ca6ee7bfe3573342f9f6f196bc095247e062001
paper_authors: Junzhe Liu, Jianwei Yu, Xie Chen
for: 实现佳绩的语音识别模型，例如神经Transducer，可以同时融合语音和语言信息，但将这些模型调整到仅有文本数据的情况下是困难的。
methods: 为了解决这个问题，我们提出了一个称为factorized neural Transducer（FNT）的模型结构，它将引入一个独立的词汇解oder，以预测词汇，并实现传统的文本数据适应。
results: 对于这个挑战，我们提出了一个改进的factorized neural Transducer（IFNT）模型结构，可以实现语音和语言信息的完整融合，并实现有效的文本适应。我们通过对GigaSpeech和三个类别数据集进行实验，证明了IFNT的性能提升。相比于标准的神经Transducer和FNT模型，IFNT在文本适应后可以获得7.9%至28.5%的相对WRER改善，并在三个测试集上获得1.6%至8.2%的相对WRER改善。

Abstract
End-to-end models, such as the neural Transducer, have been successful in integrating acoustic and linguistic information jointly to achieve excellent recognition performance. However, adapting these models with text-only data is challenging. Factorized neural Transducer (FNT) aims to address this issue by introducing a separate vocabulary decoder to predict the vocabulary, which can effectively perform traditional text data adaptation. Nonetheless, this approach has limitations in fusing acoustic and language information seamlessly. Moreover, a degradation in word error rate (WER) on the general test sets was also observed, leading to doubts about its overall performance. In response to this challenge, we present an improved factorized neural Transducer (IFNT) model structure designed to comprehensively integrate acoustic and language information while enabling effective text adaptation. We evaluate the performance of our proposed methods through in-domain experiments on GigaSpeech and out-of-domain experiments adapting to EuroParl, TED-LIUM, and Medical datasets. After text-only adaptation, IFNT yields 7.9% to 28.5% relative WER improvements over the standard neural Transducer with shallow fusion, and relative WER reductions ranging from 1.6% to 8.2% on the three test sets compared to the FNT model.

摘要
结合语音和语言信息的端到端模型，如神经Transducer，已经在识别性能方面取得了出色的成绩。然而，将这些模型适应文本数据是一项挑战。 factorized neural Transducer (FNT) 目的是解决这个问题，它引入了一个独立的词典解码器来预测词语，这有效地实现了传统文本数据适应。然而，这种方法在融合语音和语言信息的方面存在限制。此外，我们在通用测试集上观察到了单词错误率 (WER) 的下降，这引发了对其总性能的忧虑。为了解决这个挑战，我们提出了改进的 factorized neural Transducer (IFNT) 模型结构，用于全面融合语音和语言信息，同时允许有效的文本适应。我们通过在 GigaSpeech 上进行域内实验和在 EuroParl、TED-LIUM 和医疗数据集上进行对应适应测试，证明了我们的提议方法的性能优势。在文本适应后，IFNT 对于标准神经Transducer 的浅合并而言，实现了7.9% 到 28.5% 的相对 WER 改进，并在三个测试集上实现了1.6% 到 8.2% 的相对 WER 下降。

paper_url: http://arxiv.org/abs/2309.09508
repo_url: None
paper_authors: Jinsheng Pan, Zichen Wang, Weihong Qi, Hanjia Lyu, Jiebo Luo
for: This paper aims to address the gap in our understanding of the disparities in framing political issues between news media and social media outlets.
methods: The authors conduct a comprehensive investigation, focusing on the nuanced distinctions in the framing of social media and traditional media outlets concerning a series of American Supreme Court rulings on affirmative action, student loans, and abortion rights. They use both qualitative and quantitative methods to compare the framing of these issues across different media platforms.
results: The authors find that while there is some overlap in framing between social media and traditional media outlets, there are substantial differences in the way these issues are framed, particularly in terms of polarization. They observe that social media platforms tend to present more polarized stances across all framing categories, while traditional news media tend to exhibit more consensus on the topic of student loans, but more polarization on affirmative action and abortion rights. These findings have significant implications for the formation of public opinion, policy decision-making, and the broader political landscape.Here is the same information in Simplified Chinese text:
for: 本研究旨在填补新媒体和传统媒体之间政治问题帧幕的差异不足的知识空白。
methods: 作者采用了丰富的资料和方法，对美国最高法院的一系列决定（包括奖学金、学生贷款和堕胎权）进行了精心的分析和比较。
results: 作者发现，虽然新媒体和传统媒体之间有一定的帧幕相似性，但是存在重要的差异，尤其是在各个主题上的极化程度更高。社交媒体平台比传统新闻媒体更加具有极化的立场，而传统新闻媒体则在奖学金和堕胎权上更加具有政治倾向。这些发现对公众意见形成、政策决策和政治景观产生了重要的影响。

Abstract
Understanding the framing of political issues is of paramount importance as it significantly shapes how individuals perceive, interpret, and engage with these matters. While prior research has independently explored framing within news media and by social media users, there remains a notable gap in our comprehension of the disparities in framing political issues between these two distinct groups. To address this gap, we conduct a comprehensive investigation, focusing on the nuanced distinctions both qualitatively and quantitatively in the framing of social media and traditional media outlets concerning a series of American Supreme Court rulings on affirmative action, student loans, and abortion rights. Our findings reveal that, while some overlap in framing exists between social media and traditional media outlets, substantial differences emerge both across various topics and within specific framing categories. Compared to traditional news media, social media platforms tend to present more polarized stances across all framing categories. Further, we observe significant polarization in the news media's treatment (i.e., Left vs. Right leaning media) of affirmative action and abortion rights, whereas the topic of student loans tends to exhibit a greater degree of consensus. The disparities in framing between traditional and social media platforms carry significant implications for the formation of public opinion, policy decision-making, and the broader political landscape.

摘要
理解政治问题的帧定对于个人的认知、解释和参与政治决策具有 paramount importance。 although prior research has independently explored framing within news media and by social media users, there remains a notable gap in our comprehension of the disparities in framing political issues between these two distinct groups. To address this gap, we conduct a comprehensive investigation, focusing on the nuanced distinctions both qualitatively and quantitatively in the framing of social media and traditional media outlets concerning a series of American Supreme Court rulings on affirmative action, student loans, and abortion rights. Our findings reveal that, while some overlap in framing exists between social media and traditional media outlets, substantial differences emerge both across various topics and within specific framing categories. Compared to traditional news media, social media platforms tend to present more polarized stances across all framing categories. Further, we observe significant polarization in the news media's treatment (i.e., Left vs. Right leaning media) of affirmative action and abortion rights, whereas the topic of student loans tends to exhibit a greater degree of consensus. The disparities in framing between traditional and social media platforms carry significant implications for the formation of public opinion, policy decision-making, and the broader political landscape.Here's a word-for-word translation of the text into Simplified Chinese:理解政治问题的帧定对于个人的认知、解释和参与政治决策具有paramount importance。 although prior research has independently explored framing within news media and by social media users, there remains a notable gap in our comprehension of the disparities in framing political issues between these two distinct groups. To address this gap, we conduct a comprehensive investigation, focusing on the nuanced distinctions both qualitatively and quantitatively in the framing of social media and traditional media outlets concerning a series of American Supreme Court rulings on affirmative action, student loans, and abortion rights. Our findings reveal that, while some overlap in framing exists between social media and traditional media outlets, substantial differences emerge both across various topics and within specific framing categories. Compared to traditional news media, social media platforms tend to present more polarized stances across all framing categories. Further, we observe significant polarization in the news media's treatment (i.e., Left vs. Right leaning media) of affirmative action and abortion rights, whereas the topic of student loans tends to exhibit a greater degree of consensus. The disparities in framing between traditional and social media platforms carry significant implications for the formation of public opinion, policy decision-making, and the broader political landscape.

LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language Models

paper_url: http://arxiv.org/abs/2309.09506
repo_url: https://github.com/projectnuwa/layoutnuwa
paper_authors: Zecheng Tang, Chenfei Wu, Juntao Li, Nan Duan
for: 该论文主要针对 Graphic layout generation 领域的研究, 它的目的是提高用户参与度和信息吸收。
methods: 该论文提出了一种新的 LayoutNUWA 模型，它将布局生成视为代码生成任务，以增强 semantic 信息和利用大型自然语言模型（LLMs）中隐藏的布局专家知识。该模型包括三个相互连接的模块：1）代码初始化（CI）模块，计算数值条件并将其转换为 HTML 代码中策略性地隐藏的掩码; 2）代码完成（CC）模块，使用形式知识来填充掩码内的部分; 3）代码渲染（CR）模块，将完成后的代码转换为最终的布局输出，确保了高度可读性和透明度的布局生成过程。
results: 该论文在多个数据集上达到了状态之 искусственный智能性能的显著提高（超过 50%），展示了 LayoutNUWA 模型的强大能力。

Abstract
Graphic layout generation, a growing research field, plays a significant role in user engagement and information perception. Existing methods primarily treat layout generation as a numerical optimization task, focusing on quantitative aspects while overlooking the semantic information of layout, such as the relationship between each layout element. In this paper, we propose LayoutNUWA, the first model that treats layout generation as a code generation task to enhance semantic information and harness the hidden layout expertise of large language models~(LLMs). More concretely, we develop a Code Instruct Tuning (CIT) approach comprising three interconnected modules: 1) the Code Initialization (CI) module quantifies the numerical conditions and initializes them as HTML code with strategically placed masks; 2) the Code Completion (CC) module employs the formatting knowledge of LLMs to fill in the masked portions within the HTML code; 3) the Code Rendering (CR) module transforms the completed code into the final layout output, ensuring a highly interpretable and transparent layout generation procedure that directly maps code to a visualized layout. We attain significant state-of-the-art performance (even over 50\% improvements) on multiple datasets, showcasing the strong capabilities of LayoutNUWA. Our code is available at https://github.com/ProjectNUWA/LayoutNUWA.

摘要
GRAPHIC LAYOUT生成，一个快速发展的研究领域，对用户参与度和信息感受有着重要的作用。现有方法主要对layout生成视为数值优化任务，强调数值方面而忽视layout中的semantic信息，如每个元素之间的关系。在这篇论文中，我们提出了LayoutNUWA，首个对layout生成视为代码生成任务，以增强semantic信息和利用大语言模型（LLMs）隐藏的布局专家知识。更具体地，我们开发了Code Instruct Tuning（CIT）方法，包括三个相连的模块：1）Code Initialization（CI）模块量化数值条件，并将其转化为HTML代码中策略性地置入masks；2）Code Completion（CC）模块利用LLMs的格式知识填充masked部分 dentroHTML代码；3）Code Rendering（CR）模块将完成代码转化为最终的布局输出，确保了高度可读性和透明度的布局生成过程，直接将代码映射到可视化的布局。我们在多个数据集上达到了STATE-OF-THE-ART性能（超过50%提升），展示了LayoutNUWA的强大能力。我们的代码可以在https://github.com/ProjectNUWA/LayoutNUWA上获取。

Search and Learning for Unsupervised Text Generation

paper_url: http://arxiv.org/abs/2309.09497
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Lili Mou
for: 这篇论文是为了探讨深度学习技术在人工智能（AI）领域中的文本生成，以及它对应用和社会影响的可能性。
methods: 这篇论文使用搜寻和学习方法来实现不监督的文本生成，其中一个变量目标函数估计候选句子的质量，而排序算法则将句子生成来最大化搜寻目标。一个机器学习模型也从搜寻结果中学习来平滑化噪音和提高效率。
results: 这篇论文的结果显示，这种搜寻和学习的方法可以实现高品质的文本生成，并且可以减少人类标注努力和处理低资源语言的时间。这种方法具有实际应用和社会影响的重要性，特别是在建立新任务的最小可行产品和减少人类努力的方面。

Abstract
With the advances of deep learning techniques, text generation is attracting increasing interest in the artificial intelligence (AI) community, because of its wide applications and because it is an essential component of AI. Traditional text generation systems are trained in a supervised way, requiring massive labeled parallel corpora. In this paper, I will introduce our recent work on search and learning approaches to unsupervised text generation, where a heuristic objective function estimates the quality of a candidate sentence, and discrete search algorithms generate a sentence by maximizing the search objective. A machine learning model further learns from the search results to smooth out noise and improve efficiency. Our approach is important to the industry for building minimal viable products for a new task; it also has high social impacts for saving human annotation labor and for processing low-resource languages.

摘要
随着深度学习技术的发展，文本生成在人工智能（AI）社区中吸引了越来越多的关注，因为它的广泛应用和因为它是AI的重要组件。传统的文本生成系统通常在监督方式下训练，需要大量标注的平行 corpus。在这篇论文中，我将介绍我们最近的搜索和学习方法来实现无监督文本生成，其中一个启发目标函数估算候选句子的质量，而排序算法在搜索目标上生成句子。一个机器学习模型从搜索结果中学习，以缓解噪音并提高效率。我们的方法对于建立新任务的最小可行产品非常重要，同时具有高度的社会影响，因为它可以节省人类注释劳动力和处理低资源语言。

Investigating Zero- and Few-shot Generalization in Fact Verification

paper_url: http://arxiv.org/abs/2309.09444
repo_url: https://github.com/teacherpeterpan/fact-checking-generalization
paper_authors: Liangming Pan, Yunxiang Zhang, Min-Yen Kan
for: 这个研究探索了零或几shot普遍化 для факт验证（FV），目的是将FV模型从具有资源的领域（例如Wikipedia）扩展到低资源领域，并将其应用于这些领域中。
methods: 我们首先建立了一个FV数据集集合，包括11个FV数据集，代表6个领域。我们进行了一个实验分析，发现现有模型在这些FV数据集上的普遍性很差。我们的分析发现了一些影响普遍性的因素，包括数据集大小、证据长度和宣告型。
results: 最后，我们显示了两种方向的工作可以提高普遍性：1）在特殊领域上预训模型，和2）通过自动生成训练数据来实现宣告生成。

Abstract
In this paper, we explore zero- and few-shot generalization for fact verification (FV), which aims to generalize the FV model trained on well-resourced domains (e.g., Wikipedia) to low-resourced domains that lack human annotations. To this end, we first construct a benchmark dataset collection which contains 11 FV datasets representing 6 domains. We conduct an empirical analysis of generalization across these FV datasets, finding that current models generalize poorly. Our analysis reveals that several factors affect generalization, including dataset size, length of evidence, and the type of claims. Finally, we show that two directions of work improve generalization: 1) incorporating domain knowledge via pretraining on specialized domains, and 2) automatically generating training data via claim generation.

摘要
在这篇论文中，我们探讨零和几个shot泛化 для事实验证（FV），目的是将FV模型在具有资源的领域（例如Wikipedia）上训练后，应用到缺乏人工标注的低资源领域。为此，我们首先构建了一个FV数据集合，包含11个FV数据集，代表6个领域。我们进行了FV数据集间的一般化分析，发现现有模型的一般化能力不佳。我们的分析发现一些因素影响一般化，包括数据集大小、证据长度和声明类型。最后，我们表明两种方向的工作可以提高一般化：1）在特殊领域中预训练模型，并2）通过自动生成训练数据来提高模型的一般化能力。

Enhancing Multilingual Speech Recognition through Language Prompt Tuning and Frame-Level Language Adapter

paper_url: http://arxiv.org/abs/2309.09443
repo_url: None
paper_authors: Song Li, Yongbin You, Xuezhi Wang, Ke Ding, Guanglu Wan
for: 提高多语言人工智能助手的表现，扩展其应用领域和国际交流
methods: 提议两种简单效果的方法：语言提示调整和帧级语言适配器，分别提高语言可配和无语言适配的多语言音频识别性能
results: 实验表明，使用我们提议的方法可以在七种语言上提高音频识别性能，显著提高多语言人工智能助手的表现

Abstract
Multilingual intelligent assistants, such as ChatGPT, have recently gained popularity. To further expand the applications of multilingual artificial intelligence assistants and facilitate international communication, it is essential to enhance the performance of multilingual speech recognition, which is a crucial component of speech interaction. In this paper, we propose two simple and parameter-efficient methods: language prompt tuning and frame-level language adapter, to respectively enhance language-configurable and language-agnostic multilingual speech recognition. Additionally, we explore the feasibility of integrating these two approaches using parameter-efficient fine-tuning methods. Our experiments demonstrate significant performance improvements across seven languages using our proposed methods.

摘要
多语言智能助手，如ChatGPT，在最近受欢迎。为了进一步扩展多语言人工智能助手的应用和国际交流，我们需要提高多语言语音识别的性能，这是语音互动的重要组件。在这篇论文中，我们提出了两种简单和参数效率高的方法：语言提示调整和帧级语言适配器，以提高语言可配置和语言共享的多语言语音识别。此外，我们还探讨了这两种方法的结合使用方式，并使用参数效率的细化调整方法进行评估。我们的实验表明，使用我们提posed方法可以在七种语言中获得显著的性能提升。

2023-09-18

cs.LG

cs.LG - 2023-09-18

Causal Theories and Structural Data Representations for Improving Out-of-Distribution Classification

paper_url: http://arxiv.org/abs/2309.10211
repo_url: None
paper_authors: Donald Martin, Jr., David Kinney
for: 用于提高机器学习系统的稳定性和安全性，通过使用人类生成的 causal 知识来减少机器学习开发人员的论证不确定性。
methods: 使用人类中心的 causal 理论和动力学 литературы中的工具，将数据表示为epidemic系统的数据生成过程中的不变结构 causal 特征。
results: 通过使用这种数据表示方法，在训练神经网络时，可以提高数据外的泛化性表现，比如Naive Approach的数据表示方法。

Abstract
We consider how human-centered causal theories and tools from the dynamical systems literature can be deployed to guide the representation of data when training neural networks for complex classification tasks. Specifically, we use simulated data to show that training a neural network with a data representation that makes explicit the invariant structural causal features of the data generating process of an epidemic system improves out-of-distribution (OOD) generalization performance on a classification task as compared to a more naive approach to data representation. We take these results to demonstrate that using human-generated causal knowledge to reduce the epistemic uncertainty of ML developers can lead to more well-specified ML pipelines. This, in turn, points to the utility of a dynamical systems approach to the broader effort aimed at improving the robustness and safety of machine learning systems via improved ML system development practices.

摘要
我们考虑了人类中心 causal 理论和动力系统文献中的工具如何用于导引训练 нейрон网络时的数据表现。 Specifically, 我们使用模拟数据显示了训练一个 нейрон网络使用明确表达数据生成过程中的抗变化结构特征，可以提高数据外的测验性能（OOD）的数据表现。 We take these results to demonstrate that使用人类生成的 causal 知识可以减少机器学习开发人员的 epistemic 不确定性，可以导致更好地规划的 ML 管线。 This, in turn, points to the utility of a dynamical systems approach to the broader effort aimed at improving the robustness and safety of machine learning systems via improved ML system development practices.

The Kernel Density Integral Transformation

paper_url: http://arxiv.org/abs/2309.10194
repo_url: https://github.com/calvinmccarter/kditransform
paper_authors: Calvin McCarter
for: 本研究旨在提出基于机器学习和统计方法处理表格数据时的特征预处理策略。
methods: 本文提议使用核密度积分变换作为特征预处理步骤，该方法包含了Linear Min-Max Scaling和量iles transformation两种领先的特征预处理方法，而无需hyperparameter调整。
results: 本研究示出，无需调整hyperparameter，核密度积分变换可以作为Linear Min-Max Scaling和量iles transformation的简单替换方法，并且可以具有这两种方法的稳定性和性能。此外，通过调整单个连续hyperparameter，我们经常可以超越这两种方法的性能。最后，本文还示出了核密度变换在统计数据分析中的益处，特别是在相关分析和单variate clustering中。

Abstract
Feature preprocessing continues to play a critical role when applying machine learning and statistical methods to tabular data. In this paper, we propose the use of the kernel density integral transformation as a feature preprocessing step. Our approach subsumes the two leading feature preprocessing methods as limiting cases: linear min-max scaling and quantile transformation. We demonstrate that, without hyperparameter tuning, the kernel density integral transformation can be used as a simple drop-in replacement for either method, offering robustness to the weaknesses of each. Alternatively, with tuning of a single continuous hyperparameter, we frequently outperform both of these methods. Finally, we show that the kernel density transformation can be profitably applied to statistical data analysis, particularly in correlation analysis and univariate clustering.

摘要
zh功能预处理仍然在机器学习和统计方法应用到表格数据时扮演关键角色。在这篇论文中，我们提议使用核密度积分变换作为特征预处理步骤。我们的方法包含两种领先的特征预处理方法为限制 случа：线性最小最大缩放和量程变换。我们示示，无需hyperparameter调整，核密度积分变换可以作为简单的替换方法，具有对每个方法的强度。 Alternatively，通过调整单一连续的超参数，我们经常超越这两个方法。最后，我们示示了核密度变换可以营利地应用于统计数据分析，尤其是在相关分析和单variate归一化中。

Stochastic Deep Koopman Model for Quality Propagation Analysis in Multistage Manufacturing Systems

paper_url: http://arxiv.org/abs/2309.10193
repo_url: None
paper_authors: Zhiyi Chen, Harshal Maske, Huanyi Shui, Devesh Upadhyay, Michael Hopka, Joseph Cohen, Xingjian Lai, Xun Huan, Jun Ni
for: 这篇研究的目的是为了模型多阶制造系统（MMS）的复杂行为，并使用深度学习方法来实现这个目的。
methods: 这篇研究使用了Stochastic Deep Koopman（SDK）框架，将 kritical quality information 通过Variational Autoencoders（VAEs）提取出来，并使用Koopman operator来传递这些资讯。
results: 根据比较研究，SDK 模型在预测 MMS 中每个阶段的产品质量方面的准确性比其他常用的数据驱动模型高。此外，SDK 的特殊的扩散性和可追溯性使得可以实现制程中品质的追溯和根本原因分析。

Abstract
The modeling of multistage manufacturing systems (MMSs) has attracted increased attention from both academia and industry. Recent advancements in deep learning methods provide an opportunity to accomplish this task with reduced cost and expertise. This study introduces a stochastic deep Koopman (SDK) framework to model the complex behavior of MMSs. Specifically, we present a novel application of Koopman operators to propagate critical quality information extracted by variational autoencoders. Through this framework, we can effectively capture the general nonlinear evolution of product quality using a transferred linear representation, thus enhancing the interpretability of the data-driven model. To evaluate the performance of the SDK framework, we carried out a comparative study on an open-source dataset. The main findings of this paper are as follows. Our results indicate that SDK surpasses other popular data-driven models in accuracy when predicting stagewise product quality within the MMS. Furthermore, the unique linear propagation property in the stochastic latent space of SDK enables traceability for quality evolution throughout the process, thereby facilitating the design of root cause analysis schemes. Notably, the proposed framework requires minimal knowledge of the underlying physics of production lines. It serves as a virtual metrology tool that can be applied to various MMSs, contributing to the ultimate goal of Zero Defect Manufacturing.

摘要
多stage制造系统（MMS）的模型化吸引了学术和实践领域的越来越多的关注。现代深度学习方法的提出，为了实现这项任务，成本和专业知识的减少提供了机会。本研究提出了一种随机深度库曼（SDK）框架，用于模型MMS的复杂行为。特别是，我们提出了一种使用Variational Autoencoders提取的重要质量信息的 Koopman 算子应用。通过这种框架，我们可以有效地捕捉产品质量的总非线性演化，使用传输的线性表示，从而提高数据驱动模型的解释性。为评估SDK框架的性能，我们进行了一项比较研究，用于一个开源数据集。研究结果显示，SDK在MMS中预测Stagewise产品质量方面的准确率高于其他流行的数据驱动模型。此外，SDK在随机潜在空间中的特有线性传播性能，可以跟踪产品质量的演化，从而实现质量演化的跟踪和根本分析方案的设计。值得一提的是，提出的方案不需要对制造线的物理基础知识。它可以作为虚拟测量工具，应用于不同的MMS，为无瑕制造做出贡献。

Autoencoder-based Anomaly Detection System for Online Data Quality Monitoring of the CMS Electromagnetic Calorimeter

paper_url: http://arxiv.org/abs/2309.10157
repo_url: None
paper_authors: The CMS ECAL Collaboration
for: 该论文是关于CMS实验室中高能物理数据质量监测的研究。
methods: 该研究使用了一种基于自适应神经网络的异常检测系统，通过使用时间依赖性和空间变化来提高异常检测性能。
results: 该系统能够效率地检测异常，并保持非常低的假阳性率。研究 Validates 该系统的性能通过2018和2022 LHC冲撞数据中的异常检测结果，并在CMS在线数据质量监测工作流中首次部署该系统的结果。

Abstract
The CMS detector is a general-purpose apparatus that detects high-energy collisions produced at the LHC. Online Data Quality Monitoring of the CMS electromagnetic calorimeter is a vital operational tool that allows detector experts to quickly identify, localize, and diagnose a broad range of detector issues that could affect the quality of physics data. A real-time autoencoder-based anomaly detection system using semi-supervised machine learning is presented enabling the detection of anomalies in the CMS electromagnetic calorimeter data. A novel method is introduced which maximizes the anomaly detection performance by exploiting the time-dependent evolution of anomalies as well as spatial variations in the detector response. The autoencoder-based system is able to efficiently detect anomalies, while maintaining a very low false discovery rate. The performance of the system is validated with anomalies found in 2018 and 2022 LHC collision data. Additionally, the first results from deploying the autoencoder-based system in the CMS online Data Quality Monitoring workflow during the beginning of Run 3 of the LHC are presented, showing its ability to detect issues missed by the existing system.

摘要
“CMS探测器是一个通用的实验设备，用于探测在LHC中产生的高能撞击。CMS电磁calorimeter在线时质量监控是实验专家快速识别、定位和诊断各种实验器问题的重要操作工具。这个实时自适应器基于机器学习系统可以实时检测CMS电磁calorimeter数据中的问题。我们引入了一种新的方法，利用时间递增的问题演进以及探测器响应的空间变化，以最大化问题检测性。这个自适应器基于系统能够快速检测问题，同时保持很低的伪阳性率。我们 validate了这个系统的性能，使用2018和2022年LHC撞击数据中的问题。此外，我们还将这个自适应器基于系统在CMS线上质量监控工作流程中的首次应用结果给出，证明它能够检测已有系统所忽略的问题。”

Realistic Website Fingerprinting By Augmenting Network Trace

paper_url: http://arxiv.org/abs/2309.10147
repo_url: https://github.com/spin-umass/realistic-website-fingerprinting-by-augmenting-network-traces
paper_authors: Alireza Bahramali, Ardavan Bozorgi, Amir Houmansadr
for: 本研究旨在提高网络识别攻击的实际性，并且挑战了现有的WF攻击方法的假设。
methods: 本研究使用了网络追加技术（NetAugment），这种技术可以帮助WF攻击者在未知的网络条件下进行识别。具体来说，我们使用了半监督和自监督学习技术来实现NetAugment。
results: 我们的实验结果表明，使用了网络追加技术进行WF攻击可以提高攻击的准确率。例如，在一个关闭世界的场景下，我们的自监督WF攻击（NetCLR）在评估 traces 是未知的情况下达到了80%的准确率，而现有的Triplet Fingerprinting方法只达到了64.4%的准确率。

Abstract
Website Fingerprinting (WF) is considered a major threat to the anonymity of Tor users (and other anonymity systems). While state-of-the-art WF techniques have claimed high attack accuracies, e.g., by leveraging Deep Neural Networks (DNN), several recent works have questioned the practicality of such WF attacks in the real world due to the assumptions made in the design and evaluation of these attacks. In this work, we argue that such impracticality issues are mainly due to the attacker's inability in collecting training data in comprehensive network conditions, e.g., a WF classifier may be trained only on samples collected on specific high-bandwidth network links but deployed on connections with different network conditions. We show that augmenting network traces can enhance the performance of WF classifiers in unobserved network conditions. Specifically, we introduce NetAugment, an augmentation technique tailored to the specifications of Tor traces. We instantiate NetAugment through semi-supervised and self-supervised learning techniques. Our extensive open-world and close-world experiments demonstrate that under practical evaluation settings, our WF attacks provide superior performances compared to the state-of-the-art; this is due to their use of augmented network traces for training, which allows them to learn the features of target traffic in unobserved settings. For instance, with a 5-shot learning in a closed-world scenario, our self-supervised WF attack (named NetCLR) reaches up to 80% accuracy when the traces for evaluation are collected in a setting unobserved by the WF adversary. This is compared to an accuracy of 64.4% achieved by the state-of-the-art Triplet Fingerprinting [35]. We believe that the promising results of our work can encourage the use of network trace augmentation in other types of network traffic analysis.

摘要

A Geometric Framework for Neural Feature Learning

paper_url: http://arxiv.org/abs/2309.10140
repo_url: https://github.com/xiangxiangxu/nfe
paper_authors: Xiangxiang Xu, Lizhong Zheng
for: 这种Framework用于学习系统设计，基于神经元特征提取器，利用特征空间的几何结构。
methods: 该 Framework使用特征几何来解决学习问题，包括最佳特征应对、数据样本学习和多变量学习等。
results: 该 Framework可以应用于现有网络架构和优化器，并可以解释经典方法的连接。

Abstract
We present a novel framework for learning system design based on neural feature extractors by exploiting geometric structures in feature spaces. First, we introduce the feature geometry, which unifies statistical dependence and features in the same functional space with geometric structures. By applying the feature geometry, we formulate each learning problem as solving the optimal feature approximation of the dependence component specified by the learning setting. We propose a nesting technique for designing learning algorithms to learn the optimal features from data samples, which can be applied to off-the-shelf network architectures and optimizers. To demonstrate the application of the nesting technique, we further discuss multivariate learning problems, including conditioned inference and multimodal learning, where we present the optimal features and reveal their connections to classical approaches.

摘要
我们提出了一种新的系统设计框架，基于神经特征提取器来利用特征空间的几何结构。首先，我们介绍了特征几何，这将统一统计依赖和特征在同一功能空间中的几何结构。通过应用特征几何，我们将每个学习问题解释为在依赖组件指定的学习设置中解决最佳特征近似问题。我们提议一种嵌套技术，用于从数据样本中学习最佳特征，可以应用于现有的网络架构和优化器。为了证明嵌套技术的应用，我们进一步讨论了多变量学习问题，包括受条件推理和多模态学习，并介绍了最佳特征和其与经典方法的连接。

Deep smoothness WENO scheme for two-dimensional hyperbolic conservation laws: A deep learning approach for learning smoothness indicators

paper_url: http://arxiv.org/abs/2309.10117
repo_url: None
paper_authors: Tatiana Kossaczká, Ameya D. Jagtap, Matthias Ehrhardt
for: 提高两个维度的欧拉方程的精度，特别是在尖型冲击和薄层扩散区域
methods: 通过对Weighted Essentially Non-Oscillatory（WENO）方法中的稳定指标进行深度学习修改，以提高数值解的准确性
results: 在多个文献中的测试问题上，新方法比传统的第五级WENO方法更高精度，特别是在尖型冲击和薄层扩散区域存在过度散射或超过冲击的情况下。

Abstract
In this paper, we introduce an improved version of the fifth-order weighted essentially non-oscillatory (WENO) shock-capturing scheme by incorporating deep learning techniques. The established WENO algorithm is improved by training a compact neural network to adjust the smoothness indicators within the WENO scheme. This modification enhances the accuracy of the numerical results, particularly near abrupt shocks. Unlike previous deep learning-based methods, no additional post-processing steps are necessary for maintaining consistency. We demonstrate the superiority of our new approach using several examples from the literature for the two-dimensional Euler equations of gas dynamics. Through intensive study of these test problems, which involve various shocks and rarefaction waves, the new technique is shown to outperform traditional fifth-order WENO schemes, especially in cases where the numerical solutions exhibit excessive diffusion or overshoot around shocks.

摘要
在这篇论文中，我们介绍了一种基于深度学习技术改进的第五阶重量核非抽象冲击捕捉算法（WENO）。我们在WENO算法中添加了一个紧凑型神经网络，以调整WENO算法中的平滑指标。这种修改可以提高计算结果的准确性，特别是在强冲击的情况下。与过去的深度学习基于方法不同，我们的新方法不需要额外的后处理步骤，以保持一致性。我们通过对文献中的几个测试问题进行广泛的研究，包括各种冲击和薄层振荡，证明了我们的新方法在冲击强度较大的情况下表现更好，特别是在计算结果中出现过度散射或过度强化的情况下。

A Semi-Supervised Approach for Power System Event Identification

paper_url: http://arxiv.org/abs/2309.10095
repo_url: None
paper_authors: Nima Taghipourbazargani, Lalitha Sankar, Oliver Kosut
for: 提高电力系统可靠性、安全性和稳定性，使用数据科学技术进行数据驱动事件识别。
methods: 使用 semi-supervised 学习技术，利用标注和无标注样本进行事件识别。
results: 对四类事件的识别性能显著提高，与只使用少量标注样本相比， graph-based LS 方法表现最佳。

Abstract
Event identification is increasingly recognized as crucial for enhancing the reliability, security, and stability of the electric power system. With the growing deployment of Phasor Measurement Units (PMUs) and advancements in data science, there are promising opportunities to explore data-driven event identification via machine learning classification techniques. However, obtaining accurately-labeled eventful PMU data samples remains challenging due to its labor-intensive nature and uncertainty about the event type (class) in real-time. Thus, it is natural to use semi-supervised learning techniques, which make use of both labeled and unlabeled samples. %We propose a novel semi-supervised framework to assess the effectiveness of incorporating unlabeled eventful samples to enhance existing event identification methodologies. We evaluate three categories of classical semi-supervised approaches: (i) self-training, (ii) transductive support vector machines (TSVM), and (iii) graph-based label spreading (LS) method. Our approach characterizes events using physically interpretable features extracted from modal analysis of synthetic eventful PMU data. In particular, we focus on the identification of four event classes whose identification is crucial for grid operations. We have developed and publicly shared a comprehensive Event Identification package which consists of three aspects: data generation, feature extraction, and event identification with limited labels using semi-supervised methodologies. Using this package, we generate and evaluate eventful PMU data for the South Carolina synthetic network. Our evaluation consistently demonstrates that graph-based LS outperforms the other two semi-supervised methods that we consider, and can noticeably improve event identification performance relative to the setting with only a small number of labeled samples.

摘要
<>Translate the given text into Simplified Chinese.<>电力系统中的事件识别日益被认为是提高系统可靠性、安全性和稳定性的关键。随着phasor Measurement Units（PMUs）的广泛部署和数据科学技术的进步，有希望通过机器学习分类技术来探索数据驱动的事件识别。然而，在实时获得正确标注的事件ful PMU数据样本上存在劳动 INTENSIVE和未知事件类型的问题。因此，使用半supervised学习技术，这些技术使用标注和未标注样本。 %We propose a novel semi-supervised framework to assess the effectiveness of incorporating unlabeled eventful samples to enhance existing event identification methodologies. We evaluate three categories of classical semi-supervised approaches: (i) self-training, (ii) transductive support vector machines (TSVM), and (iii) graph-based label spreading (LS) method. Our approach characterizes events using physically interpretable features extracted from modal analysis of synthetic eventful PMU data. In particular, we focus on the identification of four event classes whose identification is crucial for grid operations. We have developed and publicly shared a comprehensive Event Identification package which consists of three aspects: data generation, feature extraction, and event identification with limited labels using semi-supervised methodologies. Using this package, we generate and evaluate eventful PMU data for the South Carolina synthetic network. Our evaluation consistently demonstrates that graph-based LS outperforms the other two semi-supervised methods that we consider, and can noticeably improve event identification performance relative to the setting with only a small number of labeled samples.

Invariant Probabilistic Prediction

paper_url: http://arxiv.org/abs/2309.10083
repo_url: https://github.com/alexanderhenzi/ipp
paper_authors: Alexander Henzi, Xinwei Shen, Michael Law, Peter Bühlmann
for: 本研究旨在探讨在数据分布变化下，使用统计方法实现robust性和不变性。
methods: 该文使用了一种 causality-inspired 框架，研究了probabilistic predictions 的不变性和robust性，并提出了一种可以在不同数据分布下实现不变性的方法。
results: 研究发现，在一般情况下，arbitrary distribution shifts 不会导致 invariant和robust probabilistic predictions，与点预测相比，probabilistic predictions 在不同数据分布下的性能更强。文章还提出了一种方法来实现不变性，并进行了对实验数据的验证。

Abstract
In recent years, there has been a growing interest in statistical methods that exhibit robust performance under distribution changes between training and test data. While most of the related research focuses on point predictions with the squared error loss, this article turns the focus towards probabilistic predictions, which aim to comprehensively quantify the uncertainty of an outcome variable given covariates. Within a causality-inspired framework, we investigate the invariance and robustness of probabilistic predictions with respect to proper scoring rules. We show that arbitrary distribution shifts do not, in general, admit invariant and robust probabilistic predictions, in contrast to the setting of point prediction. We illustrate how to choose evaluation metrics and restrict the class of distribution shifts to allow for identifiability and invariance in the prototypical Gaussian heteroscedastic linear model. Motivated by these findings, we propose a method to yield invariant probabilistic predictions, called IPP, and study the consistency of the underlying parameters. Finally, we demonstrate the empirical performance of our proposed procedure on simulated as well as on single-cell data.

摘要
Translated into Simplified Chinese:近年来，有增长的兴趣在统计方法中具有对分布变化的鲁棒性。大多数相关的研究集中在点预测中，使用平方误差损失。然而，本文将注意力转移到 probabilistic 预测，它们意味着对输出变量的不确定性进行全面评估。在 causality 框架下，我们调查 probabilistic 预测的一致性和鲁棒性，使用合适的 scoring rule。我们发现，在一般情况下，不同分布变换不会导致鲁棒和一致的 probabilistic 预测，与点预测的情况不同。我们介绍如何选择评估指标和限制分布变换的类型，以便在 Gaussian 不同梯度线性模型中实现可识别性和一致性。为了实现这一目标，我们提出了一种名为 IPP 的方法，并研究其下面的参数一致性。最后，我们通过 simulations 和单元细胞数据进行了实验性评估。

A Unifying Perspective on Non-Stationary Kernels for Deeper Gaussian Processes

paper_url: http://arxiv.org/abs/2309.10068
repo_url: None
paper_authors: Marcus M. Noack, Hengrui Luo, Mark D. Risser
for: 本文旨在帮助机器学习实践者更好地理解非站立性的概率过程（Gaussian Process）中的非站立性形式，并提出一种新的kernel函数，以提高非站立性的预测性和不确定性评估。
methods: 本文使用了多种常见的非站立性kernels，并且对它们的性质进行了仔细的研究和比较，以挖掘它们的优点和缺点。
results: 本文通过使用不同的数据集和kernels进行了丰富的实践和比较，并发现了一些非站立性kernels的优点和缺点。基于这些发现，本文提出了一种新的kernel函数，以提高非站立性预测的准确性和不确定性评估。

Abstract
The Gaussian process (GP) is a popular statistical technique for stochastic function approximation and uncertainty quantification from data. GPs have been adopted into the realm of machine learning in the last two decades because of their superior prediction abilities, especially in data-sparse scenarios, and their inherent ability to provide robust uncertainty estimates. Even so, their performance highly depends on intricate customizations of the core methodology, which often leads to dissatisfaction among practitioners when standard setups and off-the-shelf software tools are being deployed. Arguably the most important building block of a GP is the kernel function which assumes the role of a covariance operator. Stationary kernels of the Mat\'ern class are used in the vast majority of applied studies; poor prediction performance and unrealistic uncertainty quantification are often the consequences. Non-stationary kernels show improved performance but are rarely used due to their more complicated functional form and the associated effort and expertise needed to define and tune them optimally. In this perspective, we want to help ML practitioners make sense of some of the most common forms of non-stationarity for Gaussian processes. We show a variety of kernels in action using representative datasets, carefully study their properties, and compare their performances. Based on our findings, we propose a new kernel that combines some of the identified advantages of existing kernels.

摘要
Gaussian process (GP) 是一种广泛使用的统计技术，用于数据不确定性评估和函数近似。在过去二十年中，GP 被机器学习领域采纳，因为它在数据稀缺的情况下表现出色，并且自然地提供了稳健的不确定性估计。然而，GP 的性能往往取决于核函数的细腻定制，这经常导致实践者在使用标准设置和商业化软件工具时感到不满。核函数是 GP 中最重要的构建块，它扮演了 covariance 算子的角色。在大多数应用研究中，使用 Stationary 核函数，但这些核函数的预测性能不佳，而且不符合实际情况。非站ARY 核函数可以提高性能，但它们的函数形式更复杂，需要更多的定制和优化。在这篇视点中，我们想帮助机器学习实践者理解 GP 中一些最常见的非站ARY 性。我们使用代表性的数据集，详细研究核函数的性质，并比较它们的性能。根据我们的发现，我们提出了一种新的核函数，它结合了一些已知核函数的优点。

Dual Student Networks for Data-Free Model Stealing

paper_url: http://arxiv.org/abs/2309.10058
repo_url: None
paper_authors: James Beetham, Navid Kardan, Ajmal Mian, Mubarak Shah
for: 提高数据预processing中的模型骚乱攻击 robustness
methods: 提出了一种基于两个学生模型的 dual student 方法，通过培养两个学生模型来提供生成器模型生成样本的依据，并通过对两个学生模型的分歧来鼓励生成器模型生成更多的样本空间
results: 实验结果表明，对于数据预processing中的模型骚乱攻击，我们的方法可以提供更高的鲁棒性和更好的攻击效果，同时也可以减少查询量和训练计算成本Here is the simplified Chinese text:
for: 提高数据预处理中模型骚乱攻击robustness
methods: 基于两个学生模型的 dual student方法，通过培养两个学生模型来提供生成器模型生成样本的依据，并通过对两个学生模型的分歧来鼓励生成器模型生成更多的样本空间
results: 实验结果表明，对于数据预处理中模型骚乱攻击，我们的方法可以提供更高的鲁棒性和更好的攻击效果，同时也可以减少查询量和训练计算成本

Abstract
Existing data-free model stealing methods use a generator to produce samples in order to train a student model to match the target model outputs. To this end, the two main challenges are estimating gradients of the target model without access to its parameters, and generating a diverse set of training samples that thoroughly explores the input space. We propose a Dual Student method where two students are symmetrically trained in order to provide the generator a criterion to generate samples that the two students disagree on. On one hand, disagreement on a sample implies at least one student has classified the sample incorrectly when compared to the target model. This incentive towards disagreement implicitly encourages the generator to explore more diverse regions of the input space. On the other hand, our method utilizes gradients of student models to indirectly estimate gradients of the target model. We show that this novel training objective for the generator network is equivalent to optimizing a lower bound on the generator's loss if we had access to the target model gradients. We show that our new optimization framework provides more accurate gradient estimation of the target model and better accuracies on benchmark classification datasets. Additionally, our approach balances improved query efficiency with training computation cost. Finally, we demonstrate that our method serves as a better proxy model for transfer-based adversarial attacks than existing data-free model stealing methods.

摘要
existed 无数据模型偷窃方法使用一个生成器生成样本，以训练一个学生模型与目标模型输出匹配。为此，两个主要挑战是无法估计目标模型参数的梯度，以及生成具有很好的覆盖度的训练样本。我们提议一种双学生方法，其中两个学生在相互对应的情况下受训。如果两个学生对某个样本表示不同意，那么至少有一个学生将该样本错误地分类为目标模型。这种启发性偏好探索更多的输入空间。另一方面，我们利用学生模型的梯度来间接估计目标模型的梯度。我们表明，这种新的训练目标函数对生成器网络来说等价于优化一个下界的生成器损失。我们示出，我们的新优化框架提供更准确的目标模型梯度估计和 benchmark 分类数据集上的更高的准确率。此外，我们的方法可以更好地平衡提高查询效率和训练计算成本。最后，我们证明我们的方法在基于转移型敌对攻击的传输基于模型偷窃方法中服为更好的代理模型。

Actively Learning Reinforcement Learning: A Stochastic Optimal Control Approach

paper_url: http://arxiv.org/abs/2309.10831
repo_url: None
paper_authors: Mohammad S. Ramadan, Mahmoud A. Hayajnh, Michael T. Tolley, Kyriakos G. Vamvoudakis
for: 这个论文是为了解决两个问题：（一）控制实验室/模拟和现实世界之间的模型不确定性导致的强化学习的脆弱性，以及（二）决策控制的计算成本过高。
methods: 该论文使用强化学习解决了随机动态计划方程的问题，从而获得了一个安全的强化学习控制器，可以自动探索和利用不确定性，并且可以在实时中学习。
results: 例如在模拟示例中，该控制器能够实时学习并适应不同的模型不确定性，并且能够保证控制器的安全性。

Abstract
In this paper we provide framework to cope with two problems: (i) the fragility of reinforcement learning due to modeling uncertainties because of the mismatch between controlled laboratory/simulation and real-world conditions and (ii) the prohibitive computational cost of stochastic optimal control. We approach both problems by using reinforcement learning to solve the stochastic dynamic programming equation. The resulting reinforcement learning controller is safe with respect to several types of constraints constraints and it can actively learn about the modeling uncertainties. Unlike exploration and exploitation, probing and safety are employed automatically by the controller itself, resulting real-time learning. A simulation example demonstrates the efficacy of the proposed approach.

摘要
在这篇论文中，我们提供了一个框架来解决两个问题：（i）控制学习中的模型不确定性导致实验室/模拟环境和实际环境之间的匹配问题，以及（ii） Stochastic Optimal Control 的计算成本过高。我们通过使用 reinforcement learning 解决随机动态程序方程，以获得一个安全的控制器，该控制器可以自动地进行探索和利用，并且可以活动地学习模型不确定性。一个示例在实验中证明了我们的方法的有效性。

A Modular Spatial Clustering Algorithm with Noise Specification

paper_url: http://arxiv.org/abs/2309.10047
repo_url: None
paper_authors: Akhil K, Srikanth H R
for: 提高 clustering 算法的精度和快速性，以便更好地处理数据挖掘、机器学习和模式识别等领域中的数据分类问题。
methods: 基于细菌园的生长模型，通过控制细菌的生长和消耗来实现理想的分组准则。模块化设计，可以根据具体任务和数据分布创建特定版本的算法。还提供了适当减少噪声的功能。
results: 提出了一种新的分组算法，即细菌园算法（Bacteria-Farm），它可以平衡性能和参数优化的时间。与其他分组算法相比，该算法具有更好的准确率和鲁棒性。

Abstract
Clustering techniques have been the key drivers of data mining, machine learning and pattern recognition for decades. One of the most popular clustering algorithms is DBSCAN due to its high accuracy and noise tolerance. Many superior algorithms such as DBSCAN have input parameters that are hard to estimate. Therefore, finding those parameters is a time consuming process. In this paper, we propose a novel clustering algorithm Bacteria-Farm, which balances the performance and ease of finding the optimal parameters for clustering. Bacteria- Farm algorithm is inspired by the growth of bacteria in closed experimental farms - their ability to consume food and grow - which closely represents the ideal cluster growth desired in clustering algorithms. In addition, the algorithm features a modular design to allow the creation of versions of the algorithm for specific tasks / distributions of data. In contrast with other clustering algorithms, our algorithm also has a provision to specify the amount of noise to be excluded during clustering.

摘要
translate into Simplified Chinese: clustering 技术已经是数据挖掘、机器学习和模式识别领域的关键驱动者，数十年来。DBSCAN 算法是最受欢迎的一种，因为它具有高度准确性和噪声忍容性。然而，许多更高级的算法，如 DBSCAN，具有难以估算的输入参数。因此，查找这些参数是一项时间consuming 的过程。在这篇论文中，我们提出了一种新的 clustering 算法，叫做 Bacteria-Farm，它可以平衡性能和找到最佳参数的易用性。Bacteria-Farm 算法是通过closed experimental farms 中细菌的生长和增长来 inspirited 的，这种生长模式与 clustering 算法中理想的群集生长非常相似。此外，算法还具有可重新配置的模块化设计，以便为特定任务/数据分布创建版本。与其他 clustering 算法不同，我们的算法还具有排除噪声的功能。

A Multi-Token Coordinate Descent Method for Semi-Decentralized Vertical Federated Learning

paper_url: http://arxiv.org/abs/2309.09977
repo_url: None
paper_authors: Pedro Valdeira, Yuejie Chi, Cláudia Soares, João Xavier
For: 采用 Multi-Token Coordinate Descent (MTCD) 算法进行 semi-decentralized вертикального联合学习，提高通信效率。* Methods: 利用客户端-服务器和客户端-客户端通信，每个客户端持有小subset的特征进行并行 Markov chain (block) coordinate descent 算法。* Results: 实现了 $\mathcal{O}(1/T)$ 的收敛速率 для非对称目标函数，并且可以控制并行通信的数量。

Abstract
Communication efficiency is a major challenge in federated learning (FL). In client-server schemes, the server constitutes a bottleneck, and while decentralized setups spread communications, they do not necessarily reduce them due to slower convergence. We propose Multi-Token Coordinate Descent (MTCD), a communication-efficient algorithm for semi-decentralized vertical federated learning, exploiting both client-server and client-client communications when each client holds a small subset of features. Our multi-token method can be seen as a parallel Markov chain (block) coordinate descent algorithm and it subsumes the client-server and decentralized setups as special cases. We obtain a convergence rate of $\mathcal{O}(1/T)$ for nonconvex objectives when tokens roam over disjoint subsets of clients and for convex objectives when they roam over possibly overlapping subsets. Numerical results show that MTCD improves the state-of-the-art communication efficiency and allows for a tunable amount of parallel communications.

摘要
通信效率是联邦学习（FL）的主要挑战。在客户端服务器方案中，服务器成为瓶颈，而分散式设计可以分散通信，但不一定可以减少它们，因为它们的减少速度较慢。我们提出了多token坐标降低（MTCD）算法，用于半分散式垂直联邦学习，利用每个客户端持有的小subset特征来实现高效的通信。我们的多token方法可以看作是并行Markov链（块）坐标降低算法，它包含客户端服务器和分散式设计的特殊情况。我们得到了非对称目标函数的 $\mathcal{O}(1/T)$ 收敛率，并且数学实验表明，MTCD可以提高当前最佳的通信效率，并允许调整的并行通信数量。

Des-q: a quantum algorithm to construct and efficiently retrain decision trees for regression and binary classification

paper_url: http://arxiv.org/abs/2309.09976
repo_url: None
paper_authors: Niraj Kumar, Romina Yalovetzky, Changhao Li, Pierre Minssen, Marco Pistoia
For: This paper proposes a novel quantum algorithm for constructing and retraining decision trees in regression and binary classification tasks, with the goal of significantly reducing the time required for tree retraining.* Methods: The proposed algorithm, named Des-q, uses a quantum-accessible memory to efficiently estimate feature weights and perform k-piecewise linear tree splits at each internal node. It also employs a quantum-supervised clustering method based on the q-means algorithm to determine the k suitable anchor points for these splits.* Results: The simulated version of the Des-q algorithm is benchmarked against the state-of-the-art classical decision tree for regression and binary classification on multiple data sets with numerical features, and is shown to exhibit similar performance while significantly speeding up the periodic tree retraining.

Abstract
Decision trees are widely used in machine learning due to their simplicity in construction and interpretability. However, as data sizes grow, traditional methods for constructing and retraining decision trees become increasingly slow, scaling polynomially with the number of training examples. In this work, we introduce a novel quantum algorithm, named Des-q, for constructing and retraining decision trees in regression and binary classification tasks. Assuming the data stream produces small increments of new training examples, we demonstrate that our Des-q algorithm significantly reduces the time required for tree retraining, achieving a poly-logarithmic time complexity in the number of training examples, even accounting for the time needed to load the new examples into quantum-accessible memory. Our approach involves building a decision tree algorithm to perform k-piecewise linear tree splits at each internal node. These splits simultaneously generate multiple hyperplanes, dividing the feature space into k distinct regions. To determine the k suitable anchor points for these splits, we develop an efficient quantum-supervised clustering method, building upon the q-means algorithm of Kerenidis et al. Des-q first efficiently estimates each feature weight using a novel quantum technique to estimate the Pearson correlation. Subsequently, we employ weighted distance estimation to cluster the training examples in k disjoint regions and then proceed to expand the tree using the same procedure. We benchmark the performance of the simulated version of our algorithm against the state-of-the-art classical decision tree for regression and binary classification on multiple data sets with numerical features. Further, we showcase that the proposed algorithm exhibits similar performance to the state-of-the-art decision tree while significantly speeding up the periodic tree retraining.

摘要
决策树在机器学习中广泛使用，因其建构和解释性很好。但是，随着数据集的增大，传统的决策树建构和重新训练方法会变得越来越慢，时间复杂度随着训练示例数量的增长而呈极函数关系。在这种情况下，我们提出了一种新的量子算法，名为Des-q，用于在回归和二分类任务中建构和重新训练决策树。假设数据流量产生小量的新训练示例，我们示出了Des-q算法可以在训练示例数量的极函数时间复杂度下，对决策树进行重新训练，而不是在训练示例数量的几乎方差时间复杂度下。我们的方法是在每个内部节点上使用k个 piecewise 线性树split，这些拆分同时生成多个抽象。为确定k个适当的吊革点，我们开发了一种高效的量子监督学习方法，基于kerenidis等人提出的q-means算法。Des-q首先高效地估计每个特征的权重，使用一种新的量子技术来估计pearson相关性。然后，我们使用质量距离估计来归类训练示例，并将其分为k个不同的区域。最后，我们使用同样的过程来扩展树。我们使用模拟版的算法对比州时的分类树的表现，并证明Des-q算法在多个数据集上的numerical特征上 exhibits similar performance，同时具有明显的时间复杂度优化。此外，我们还显示了Des-q算法在 periodic tree retraining 中的性能，并证明它在训练示例数量的增长中保持稳定性。

Empirical Study of Mix-based Data Augmentation Methods in Physiological Time Series Data

paper_url: http://arxiv.org/abs/2309.09970
repo_url: https://github.com/comp-well-org/mix-augmentation-for-physiological-time-series-classification
paper_authors: Peikun Guo, Huiyuan Yang, Akane Sano
for: 这个论文主要是为了探讨在生理时间序分类任务中使用mixup等混合基于数据增强技术的可能性和效果。
methods: 这个论文使用了多种mix-based数据增强技术，包括mixup、cutmix和替换混合，对六个生理时间序数据集进行了系统性的评估，以确定这些技术在不同的感知数据和分类任务中的表现。
results: 研究结果表明，三种mix-based数据增强技术可以在六个生理时间序数据集上提高表现，而且这些改进不需要专家知识或广泛的参数调整。

Abstract
Data augmentation is a common practice to help generalization in the procedure of deep model training. In the context of physiological time series classification, previous research has primarily focused on label-invariant data augmentation methods. However, another class of augmentation techniques (\textit{i.e., Mixup}) that emerged in the computer vision field has yet to be fully explored in the time series domain. In this study, we systematically review the mix-based augmentations, including mixup, cutmix, and manifold mixup, on six physiological datasets, evaluating their performance across different sensory data and classification tasks. Our results demonstrate that the three mix-based augmentations can consistently improve the performance on the six datasets. More importantly, the improvement does not rely on expert knowledge or extensive parameter tuning. Lastly, we provide an overview of the unique properties of the mix-based augmentation methods and highlight the potential benefits of using the mix-based augmentation in physiological time series data.

摘要
<>translate_language="zh-CN"<>数据扩充是一种常见的方法来帮助深度模型训练过程中的泛化。在生理时间序列分类领域，先前的研究主要集中在标签不变的数据扩充方法上。然而，另一类 augmentation 技术（即 Mixup）在计算机视觉领域出现后，尚未在时间序列领域得到完全探索。在这种研究中，我们系统地评估了基于混合的扩充方法，包括 mixup、cutmix 和 manifold mixup，在六个生理时间序列 dataset 上，并评估了不同的感知数据和分类任务中的性能。我们的结果表明，三种混合基于的扩充方法可以一致地提高六个 dataset 的性能。此外，这些改进不需要专家知识或广泛的参数调整。最后，我们介绍了混合基于的扩充方法的独特性质，并强调了在生理时间序列数据中使用混合基于的扩充方法的潜在优势。

Prompt a Robot to Walk with Large Language Models

paper_url: http://arxiv.org/abs/2309.09969
repo_url: https://github.com/HybridRobotics/prompt2walk
paper_authors: Yen-Jen Wang, Bike Zhang, Jianyu Chen, Koushil Sreenath
for: 这个研究旨在使用几何提示来将大型自然语言模型（LLM）应用于机器人控制中。
methods: 这个研究使用了几何提示收集自物理环境，并使用了LLM进行循环预测控制命令。
results: 实验结果显示，这个方法可以有效地将机器人诱导到行走。这证明了LLM可以作为机器人动作控制中的低层反馈控制器。I hope that helps! Let me know if you have any other questions.

Abstract
Large language models (LLMs) pre-trained on vast internet-scale data have showcased remarkable capabilities across diverse domains. Recently, there has been escalating interest in deploying LLMs for robotics, aiming to harness the power of foundation models in real-world settings. However, this approach faces significant challenges, particularly in grounding these models in the physical world and in generating dynamic robot motions. To address these issues, we introduce a novel paradigm in which we use few-shot prompts collected from the physical environment, enabling the LLM to autoregressively generate low-level control commands for robots without task-specific fine-tuning. Experiments across various robots and environments validate that our method can effectively prompt a robot to walk. We thus illustrate how LLMs can proficiently function as low-level feedback controllers for dynamic motion control even in high-dimensional robotic systems. The project website and source code can be found at: https://prompt2walk.github.io/ .

摘要
大型语言模型（LLM）在互联网规模数据上预训练后表现出了各种各样的能力。近期，有越来越多的人们对于使用 LLM 在实际场景中进行部署表示了极大的兴趣。然而，这种方法面临着 significativetranslation challenges，特别是在将模型 anchored 到物理世界中和生成动态机器人运动。为了解决这些问题，我们提出了一种新的思路，即通过几个shot的提示从物理环境中收集，使 LLM 可以自动生成机器人的低级控制命令，无需特定任务的微调。经过对各种机器人和环境的实验，我们发现我们的方法可以有效地使机器人行走。我们因此证明了 LLM 可以在高维机器人系统中作为低级反馈控制器，进行动态运动控制。项目官网和代码可以在以下链接中找到：https://prompt2walk.github.io/。

Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

paper_url: http://arxiv.org/abs/2309.09968
repo_url: None
paper_authors: Alexia Jolicoeur-Martineau, Kilian Fatras, Tal Kachman
for: 生成和填充混合类型（连续和分类）表格数据
methods: 使用分布式扩散和条件流匹配生成混合类型表格数据，而不是使用前一些工作中的神经网络函数近似器
results: 在不同的数据集上，我们的方法可以生成高度真实的人工数据，并且可以生成多种可能的数据填充结果，经验显示我们的方法通常超越深度学习生成方法，可以在CPU上并行训练而无需GPU，我们将代码发布到PyPI和CRAN上。

Abstract
Tabular data is hard to acquire and is subject to missing values. This paper proposes a novel approach to generate and impute mixed-type (continuous and categorical) tabular data using score-based diffusion and conditional flow matching. Contrary to previous work that relies on neural networks as function approximators, we instead utilize XGBoost, a popular Gradient-Boosted Tree (GBT) method. In addition to being elegant, we empirically show on various datasets that our method i) generates highly realistic synthetic data when the training dataset is either clean or tainted by missing data and ii) generates diverse plausible data imputations. Our method often outperforms deep-learning generation methods and can trained in parallel using CPUs without the need for a GPU. To make it easily accessible, we release our code through a Python library on PyPI and an R package on CRAN.

摘要
表格数据具有困难和缺失值。这篇论文提出了一种新的方法，使用分数基diffusion和条件流匹配生成和填充混合类型（连续和分类）表格数据。与之前的工作不同，我们不使用神经网络作为函数估计器，而是使用XGBoost，一种受欢迎的梯度增强树（GBT）方法。我们的方法不仅简洁高效，而且在不同的数据集上经验表明，我们的方法可以生成高度真实的人工数据，并且可以生成多种可能的数据填充。我们的方法经常超过深度学习生成方法，并且可以在CPU上并行训练，不需要GPU。为便于使用，我们在Python库中发布了代码，并在CRAN上发布了R包。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Evaluating Adversarial Robustness with Expected Viable Performance

paper_url: http://arxiv.org/abs/2309.09928
repo_url: None
paper_authors: Ryan McCoppin, Colin Dawson, Sean M. Kennedy, Leslie M. Blaha
for: 评估预测模型的可靠性，特别关注对抗性扰动的影响。
methods: 使用可靠性指标，例如分类精度，来评估模型在不同抗性扰动下的性能。
results: 提出一种基于期望值的预测模型可靠性评估方法，以便更好地评估模型对抗性扰动的抗性性能。

Abstract
We introduce a metric for evaluating the robustness of a classifier, with particular attention to adversarial perturbations, in terms of expected functionality with respect to possible adversarial perturbations. A classifier is assumed to be non-functional (that is, has a functionality of zero) with respect to a perturbation bound if a conventional measure of performance, such as classification accuracy, is less than a minimally viable threshold when the classifier is tested on examples from that perturbation bound. Defining robustness in terms of an expected value is motivated by a domain general approach to robustness quantification.

摘要
我们提出了一种度量分类器的Robustness，特别关注对抗攻击的影响，以期望功能性对可能的攻击 perturbations 的平均值。我们假设分类器对某个 perturbation bound 的测试例而言，如果使用 convential 度量指标，例如分类率，则分类器的性能低于最低可接受水平，则分类器对该 bound 的测试例是不可用的（即其功能性为零）。定义Robustness 以期望值的方式受到Domain 通用的Robustness 评估方法的 inspirations。

Graph topological property recovery with heat and wave dynamics-based features on graphs

paper_url: http://arxiv.org/abs/2309.09924
repo_url: None
paper_authors: Dhananjay Bhaskar, Yanlei Zhang, Charles Xu, Xingzhi Sun, Oluwadamilola Fasina, Guy Wolf, Maximilian Nickel, Michael Perlmutter, Smita Krishnaswamy
for: 这个论文是为了提出一种基于解决常微分方程的图形网络（GDeNet），用于获取不同下游任务的连续节点和图级表示。
methods: 论文使用了解析解决方法，连接热和波方程的动力学特性和图形的 спектраль性质以及连续时间游走在图形上的行为。
results: 实验表明，这些动力学特性能够捕捉图形的几何和拓扑特征，并且在真实世界数据集上表现出优于其他方法。

Abstract
In this paper, we propose Graph Differential Equation Network (GDeNet), an approach that harnesses the expressive power of solutions to PDEs on a graph to obtain continuous node- and graph-level representations for various downstream tasks. We derive theoretical results connecting the dynamics of heat and wave equations to the spectral properties of the graph and to the behavior of continuous-time random walks on graphs. We demonstrate experimentally that these dynamics are able to capture salient aspects of graph geometry and topology by recovering generating parameters of random graphs, Ricci curvature, and persistent homology. Furthermore, we demonstrate the superior performance of GDeNet on real-world datasets including citation graphs, drug-like molecules, and proteins.

摘要
在这篇论文中，我们提出了图Diffusion Equation Network（GDeNet），它利用图上解析方程的表达能力来获得不同下游任务的连续节点和图级表示。我们 derive了对热和波方程的动态性和图的spectral properties以及漫步过程的连续性有关的理论结果。我们通过实验表明，这些动态可以捕捉图的几何和topology特征，例如生成参数、Ricci curvature和持续同构。此外，我们在真实世界数据集上证明了GDeNet的优越性。

Distilling HuBERT with LSTMs via Decoupled Knowledge Distillation

paper_url: http://arxiv.org/abs/2309.09920
repo_url: None
paper_authors: Danilo de Oliveira, Timo Gerkmann
for: 压缩 HuBERT 模型的知识，以提高自动语音识别器的性能和储存需求。
methods: 使用知识传递和分离知识传递的方法，将 HuBERT 的 Transformer 层转换为 LSTM 型的压缩模型，以减少参数数量并提高自动语音识别器的性能。
results: 与 DistilHuBERT 相比， proposed 方法可以实现更好的自动语音识别器性能，并且对于储存需求产生更大的改善。

Abstract
Much research effort is being applied to the task of compressing the knowledge of self-supervised models, which are powerful, yet large and memory consuming. In this work, we show that the original method of knowledge distillation (and its more recently proposed extension, decoupled knowledge distillation) can be applied to the task of distilling HuBERT. In contrast to methods that focus on distilling internal features, this allows for more freedom in the network architecture of the compressed model. We thus propose to distill HuBERT's Transformer layers into an LSTM-based distilled model that reduces the number of parameters even below DistilHuBERT and at the same time shows improved performance in automatic speech recognition.

摘要
很多研究力量是用于压缩自动学习模型的知识，这些模型具有强大的能力，但却占用大量内存。在这项工作中，我们显示了知识储存方法（以及其最近提出的扩展方法，即分离知识储存）可以应用于压缩HuBERT。与其他方法不同的是，我们可以在压缩模型网络结构中有更多的自由。因此，我们提议将HuBERT的转换层压缩成一个使用LSTM的压缩模型，以降低参数数量，并且同时在自动语音识别中表现更好。

Learning Nonparametric High-Dimensional Generative Models: The Empirical-Beta-Copula Autoencoder

paper_url: http://arxiv.org/abs/2309.09916
repo_url: None
paper_authors: Maximilian Coblenz, Oliver Grothe, Fabian Kächele
for:* 这个论文的目的是把自动编码器转化为生成模型，并且寻找简单有效的方法来实现这一点。methods:* 论文使用了多种方法来模型自动编码器的幂数空间，包括核密度估计、泊松分布、正规流等。results:* 研究发现，使用泊松分布模型自动编码器的幂数空间可以很好地生成新的数据样本，并且可以控制生成的数据样本具有特定的特征。此外，论文还提供了一种新的copula模型，即Empirical Beta Copula Autoencoder，可以更好地实现生成模型的目的。

Abstract
By sampling from the latent space of an autoencoder and decoding the latent space samples to the original data space, any autoencoder can simply be turned into a generative model. For this to work, it is necessary to model the autoencoder's latent space with a distribution from which samples can be obtained. Several simple possibilities (kernel density estimates, Gaussian distribution) and more sophisticated ones (Gaussian mixture models, copula models, normalization flows) can be thought of and have been tried recently. This study aims to discuss, assess, and compare various techniques that can be used to capture the latent space so that an autoencoder can become a generative model while striving for simplicity. Among them, a new copula-based method, the Empirical Beta Copula Autoencoder, is considered. Furthermore, we provide insights into further aspects of these methods, such as targeted sampling or synthesizing new data with specific features.

摘要
<> translate english text into simplified chineseBy sampling from the latent space of an autoencoder and decoding the latent space samples to the original data space, any autoencoder can simply be turned into a generative model. For this to work, it is necessary to model the autoencoder's latent space with a distribution from which samples can be obtained. Several simple possibilities (kernel density estimates, Gaussian distribution) and more sophisticated ones (Gaussian mixture models, copula models, normalization flows) can be thought of and have been tried recently. This study aims to discuss, assess, and compare various techniques that can be used to capture the latent space so that an autoencoder can become a generative model while striving for simplicity. Among them, a new copula-based method, the Empirical Beta Copula Autoencoder, is considered. Furthermore, we provide insights into further aspects of these methods, such as targeted sampling or synthesizing new data with specific features. traducción al chino simplificado通过从自编码器的幂 space 中采样并将幂 space 采样转换为原始数据空间，任何自编码器都可以简单地变成生成模型。为了实现这一点，需要模型自编码器的幂空间 distribution 中的样本。一些简单的可能性（kernel density estimates，Gaussian distribution）和更复杂的一些（Gaussian mixture models，copula models，normalization flows）都有被考虑和尝试过。本研究旨在讨论、评估和比较这些方法，以实现将自编码器转化为生成模型，同时寻求简单性。其中，一种新的 copula-based 方法，即 Empirical Beta Copula Autoencoder，被考虑。此外，我们还提供了这些方法的进一步含义，例如针对的采样或生成特定特征的新数据。

Learning to Generate Lumped Hydrological Models

paper_url: http://arxiv.org/abs/2309.09904
repo_url: None
paper_authors: Yang Yang, Ting Fong May Chui
for: 这个研究旨在开发一种数据驱动的方法，用于表征水文功能在流域中的低维度表示，并使用这种表示来重建特定的水文功能。
methods: 这个研究使用深度学习方法来学习流域的水文功能，并直接从气候冲击和流域流量数据中学习出latent variable的值。
results: 研究发现，使用这种方法可以从全球超过3,000个流域的数据中学习出高质量的生成模型，并将这些生成模型应用到700个不同的流域中，得到了比或更好的估计结果。

Abstract
In a lumped hydrological model structure, the hydrological function of a catchment is characterized by only a few parameters. Given a set of parameter values, a numerical function useful for hydrological prediction is generated. Thus, this study assumes that the hydrological function of a catchment can be sufficiently well characterized by a small number of latent variables. By specifying the variable values, a numerical function resembling the hydrological function of a real-world catchment can be generated using a generative model. In this study, a deep learning method is used to learn both the generative model and the latent variable values of different catchments directly from their climate forcing and runoff data, without using catchment attributes. The generative models can be used similarly to a lumped model structure, i.e., by estimating the optimal parameter or latent variable values using a generic model calibration algorithm, an optimal numerical model can be derived. In this study, generative models using eight latent variables were learned from data from over 3,000 catchments worldwide, and the learned generative models were applied to model over 700 different catchments using a generic calibration algorithm. The quality of the resulting optimal models was generally comparable to or better than that obtained using 36 different types of lump model structures or using non-generative deep learning methods. In summary, this study presents a data-driven approach for representing the hydrological function of a catchment in low-dimensional space and a method for reconstructing specific hydrological functions from the representations.

摘要
在汇集型水文模型结构中，湍水功能的某catchment被定义为只有一些参数。给定一组参数值，可以生成一个数值函数用于水文预测。因此，本研究假设catchment的水文功能可以通过一小数量的隐变量足够准确地表示。通过 specifying变量值，可以使用生成模型生成一个数值函数类似于实际世界catchment的水文函数。在本研究中，使用深度学习方法来学习catchment的生成模型和隐变量值，不使用catchment特征。生成模型可以与汇集型模型结构相同地使用，即通过优化参数或隐变量值使用一个通用模型调整算法来获得最佳数值模型。在本研究中，使用八个隐变量学习了来自全球3,000多个catchment的数据，并将学习的生成模型应用到700多个不同的catchment上使用通用调整算法。得到的优化模型质量通常与或更高于使用36种不同的汇集模型结构或非生成型深度学习方法所获得的质量。总之，本研究提出了一种数据驱动的方法，用于表示catchment的水文功能在低维空间中，以及一种方法，用于从表示中重建特定的水文功能。

Deep Reinforcement Learning for the Joint Control of Traffic Light Signaling and Vehicle Speed Advice

paper_url: http://arxiv.org/abs/2309.09881
repo_url: None
paper_authors: Johannes V. S. Busch, Robert Voelckner, Peter Sossalla, Christian L. Vielhaus, Roberto Calandra, Frank H. P. Fitzek
for: 提高城市堵塞的效率和环保性
methods: 使用深度强化学习控制交通信号灯和车辆行驶速度
results: 在八个 из十一个测试场景中，联合控制方法可以降低车辆旅行延迟，并且观察到车辆速度建议策略可以平滑车辆附近交通信号灯的速度变化。

Abstract
Traffic congestion in dense urban centers presents an economical and environmental burden. In recent years, the availability of vehicle-to-anything communication allows for the transmission of detailed vehicle states to the infrastructure that can be used for intelligent traffic light control. The other way around, the infrastructure can provide vehicles with advice on driving behavior, such as appropriate velocities, which can improve the efficacy of the traffic system. Several research works applied deep reinforcement learning to either traffic light control or vehicle speed advice. In this work, we propose a first attempt to jointly learn the control of both. We show this to improve the efficacy of traffic systems. In our experiments, the joint control approach reduces average vehicle trip delays, w.r.t. controlling only traffic lights, in eight out of eleven benchmark scenarios. Analyzing the qualitative behavior of the vehicle speed advice policy, we observe that this is achieved by smoothing out the velocity profile of vehicles nearby a traffic light. Learning joint control of traffic signaling and speed advice in the real world could help to reduce congestion and mitigate the economical and environmental repercussions of today's traffic systems.

摘要
压力交通在紧张城市中带来经济和环境沉重负担。最近几年，车辆到任何通信技术的可用性允许车辆状态的详细传输到基础设施，以便智能交通灯控制。基础设施也可以为车辆提供适当的行驶方式建议，如 velocities，以改善交通系统的效率。一些研究工作使用深度强化学习控制交通灯或车辆速度。在这种工作中，我们提出了第一次同时学习交通灯控制和车辆速度建议的方法。我们示出这可以提高交通系统的效率。在我们的实验中，同时控制方法比只控制交通灯时间减少了平均车辆旅行延迟，在 eleven 个标准场景中出现了八个情况。分析车辆速度建议政策的Qualitative行为，我们发现这是通过车辆附近交通灯的速度profile的平滑来实现的。在实际世界中学习同时控制交通信号和车辆速度的可能会帮助减少拥堵和今天的交通系统的经济和环境后果。

Error Reduction from Stacked Regressions

paper_url: http://arxiv.org/abs/2309.09880
repo_url: https://github.com/djdprogramming/adfa2
paper_authors: Xin Chen, Jason M. Klusowski, Yan Shuo Tan
for: 提高预测精度，使用核算法组合多个回归分析器
methods: 使用非负约束最小二乘法学习组合权重，其中每个回归分析器都是线性最小二乘法项
results: 在嵌入多维空间中，使用核算法组合可以减小人口风险，并且比最佳单个回归分析器更好Here’s a breakdown of each point:* “for”: The paper is written to improve the accuracy of predictions by combining multiple regression analyzers using a nuclear algorithm.* “methods”: The paper uses a non-negative constraint least-squares method to learn the combination weights of the constituent estimators, and the optimization problem can be reformulated as isotonic regression.* “results”: The resulting stacked estimator has a strictly smaller population risk than the best single estimator among them, and it can be implemented with the same order of computation as the best single estimator.

Abstract
Stacking regressions is an ensemble technique that forms linear combinations of different regression estimators to enhance predictive accuracy. The conventional approach uses cross-validation data to generate predictions from the constituent estimators, and least-squares with nonnegativity constraints to learn the combination weights. In this paper, we learn these weights analogously by minimizing an estimate of the population risk subject to a nonnegativity constraint. When the constituent estimators are linear least-squares projections onto nested subspaces separated by at least three dimensions, we show that thanks to a shrinkage effect, the resulting stacked estimator has strictly smaller population risk than best single estimator among them. Here ``best'' refers to a model that minimizes a selection criterion such as AIC or BIC. In other words, in this setting, the best single estimator is inadmissible. Because the optimization problem can be reformulated as isotonic regression, the stacked estimator requires the same order of computation as the best single estimator, making it an attractive alternative in terms of both performance and implementation.

摘要
核 stacking 是一种ensemble技术，通过将不同的回归估计器组合起来，提高预测精度。传统方法使用 Cross-validation 数据生成组合估计器的预测，并使用 least-squares 方法学习组合权重。在这篇论文中，我们通过最小化人口风险的估计器来学习这些权重，并且对这些权重进行非负性约束。当组合估计器是线性最小二乘投影 onto 嵌入在至少三维空间中的子空间时，我们显示了一种减小效果，即核 stacked 估计器的人口风险比最佳单个估计器（按照 AIC 或 BIC 选择 criterion）更小。这里的“最佳”指的是一个模型，可以最小化一个选择 criterion。在这种设定下，最佳单个估计器是不可接受的。因为优化问题可以 reformulated 为iso-tonic regression，核 stacked 估计器需要与最佳单个估计器相同的计算顺序，因此它在性能和实现方面都是一个吸引人的选择。

Domain Generalization with Fourier Transform and Soft Thresholding

paper_url: http://arxiv.org/abs/2309.09866
repo_url: None
paper_authors: Hongyi Pan, Bin Wang, Zheyuan Zhan, Xin Zhu, Debesh Jha, Ahmet Enis Cetin, Concetto Spampinato, Ulas Bagci
for: 用于提高脑网络模型对不同来源图像的泛化性能
methods: 使用傅리曼变换基于的频谱预处理策略，并 introduce soft-thresholding函数来消除频谱中的背景干扰
results: 通过实验 validate our approach的效果，与传统和现有方法相比，具有较好的 segmentation metric 和更好的泛化性能

Abstract
Domain generalization aims to train models on multiple source domains so that they can generalize well to unseen target domains. Among many domain generalization methods, Fourier-transform-based domain generalization methods have gained popularity primarily because they exploit the power of Fourier transformation to capture essential patterns and regularities in the data, making the model more robust to domain shifts. The mainstream Fourier-transform-based domain generalization swaps the Fourier amplitude spectrum while preserving the phase spectrum between the source and the target images. However, it neglects background interference in the amplitude spectrum. To overcome this limitation, we introduce a soft-thresholding function in the Fourier domain. We apply this newly designed algorithm to retinal fundus image segmentation, which is important for diagnosing ocular diseases but the neural network's performance can degrade across different sources due to domain shifts. The proposed technique basically enhances fundus image augmentation by eliminating small values in the Fourier domain and providing better generalization. The innovative nature of the soft thresholding fused with Fourier-transform-based domain generalization improves neural network models' performance by reducing the target images' background interference significantly. Experiments on public data validate our approach's effectiveness over conventional and state-of-the-art methods with superior segmentation metrics.

摘要

Prognosis of Multivariate Battery State of Performance and Health via Transformers

paper_url: http://arxiv.org/abs/2309.10014
repo_url: None
paper_authors: Noah H. Paulson, Joseph J. Kubal, Susan J. Babinec
for: 本研究的目的是提供一种深度学习模型，用于预测锂离子电池性能和使用寿命。
methods: 该研究使用了深度变换网络模型，利用两个循环测试数据集，表征了六种锂离子电池化学组成（LFP、NMC111、NMC532、NMC622、HE5050和5Vspinel）、不同的电解液/镍电极组合和充电/充电方式。
results: 该研究的结果表明，使用深度学习模型可以高度准确地预测锂离子电池的性能和使用寿命，其中LFP快速充电数据集的预测结束时间误差为19循环，表明深度学习对锂离子电池健康状况的预测具有扎实的批处能力。

Abstract
Batteries are an essential component in a deeply decarbonized future. Understanding battery performance and "useful life" as a function of design and use is of paramount importance to accelerating adoption. Historically, battery state of health (SOH) was summarized by a single parameter, the fraction of a battery's capacity relative to its initial state. A more useful approach, however, is a comprehensive characterization of its state and complexities, using an interrelated set of descriptors including capacity, energy, ionic and electronic impedances, open circuit voltages, and microstructure metrics. Indeed, predicting across an extensive suite of properties as a function of battery use is a "holy grail" of battery science; it can provide unprecedented insights toward the design of better batteries with reduced experimental effort, and de-risking energy storage investments that are necessary to meet CO2 reduction targets. In this work, we present a first step in that direction via deep transformer networks for the prediction of 28 battery state of health descriptors using two cycling datasets representing six lithium-ion cathode chemistries (LFP, NMC111, NMC532, NMC622, HE5050, and 5Vspinel), multiple electrolyte/anode compositions, and different charge-discharge scenarios. The accuracy of these predictions versus battery life (with an unprecedented mean absolute error of 19 cycles in predicting end of life for an LFP fast-charging dataset) illustrates the promise of deep learning towards providing deeper understanding and control of battery health.

摘要
锂离子电池是深度减碳未来的重要组件。理解锂离子电池性能和使用寿命的关系是加速采用的关键。历史上，锂离子电池状况（SOH）通常是用一个参数表示，即锂离子电池容量相对初始状态的比率。然而，一个更有用的方法是对锂离子电池状况进行全面描述，使用一组相关的参数，包括容量、能量、锂离子和电子阻抗、开路电压和微结构指标。实际上，预测锂离子电池的广泛性能特征是 battery science 的“圣杯”，可以提供前所未有的洞察，并帮助设计更好的锂离子电池，降低实验努力，并为温室气体减排目标做出更多的投资。在这项工作中，我们提出了一种首先采用深度变换网络来预测28个锂离子电池状况指标，使用两个循环数据集，表示六种锂离子陶瓷电池化学式（LFP、NMC111、NMC532、NMC622、HE5050和5Vspinel）、多种电解质/陶瓷组合、以及不同的充电-充电方案。预测的准确性（例如，LFP快充电数据集中预测结束生命的mean absolute error为19次）表明深度学习对锂离子电池健康提供了新的可能性。

Convolutional Deep Kernel Machines

paper_url: http://arxiv.org/abs/2309.09814
repo_url: https://github.com/luisgarzac/Data-Science-Course---Udemy-frogames-Juan-Gabriel-Gomila
paper_authors: Edward Milsom, Ben Anson, Laurence Aitchison
for: 这篇论文主要是为了探讨深度kernel机器（DKM）的应用和发展。
methods: 本论文使用了深度kernel机器（DKM），其不同于传统的 neural network 和 deep kernel learning，因为它们都使用特征作为基本组件。此外，论文还提出了一种有效的间领点拟合方案。
results: 根据实验结果，使用了不同的 normalization 和 likelihood 的模型 variants，可以达到约 99% 的测试准确率在 MNIST 上，92% 在 CIFAR-10 上，71% 在 CIFAR-100 上，而且只需要训练约 28 个 GPU 小时，相比于全功能 NNGP / NTK / Myrtle kernels，速度提高了1-2个数量级。

Abstract
Deep kernel machines (DKMs) are a recently introduced kernel method with the flexibility of other deep models including deep NNs and deep Gaussian processes. DKMs work purely with kernels, never with features, and are therefore different from other methods ranging from NNs to deep kernel learning and even deep Gaussian processes, which all use features as a fundamental component. Here, we introduce convolutional DKMs, along with an efficient inter-domain inducing point approximation scheme. Further, we develop and experimentally assess a number of model variants, including 9 different types of normalisation designed for the convolutional DKMs, two likelihoods, and two different types of top-layer. The resulting models achieve around 99% test accuracy on MNIST, 92% on CIFAR-10 and 71% on CIFAR-100, despite training in only around 28 GPU hours, 1-2 orders of magnitude faster than full NNGP / NTK / Myrtle kernels, whilst achieving comparable performance.

摘要

Learning Optimal Contracts: How to Exploit Small Action Spaces

paper_url: http://arxiv.org/abs/2309.09801
repo_url: None
paper_authors: Francesco Bacchiocchi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti
for: 解决主动자和代理人之间的主动-代理人问题，即主动자通过一系列的合同来劝动代理人进行成本高、不可见的行动，以实现有利的结果。
methods: 我们使用多轮合同的扩展版本，在没有主动자对代理人的信息的情况下，通过观察每轮的结果来学习最优的合同。我们采用一种算法，可以在小的动作空间下获得高probability的最优合同，并且可以在多轮合同中实现$\tilde{\mathcal{O}(T^{4/5})$的 regret bound。
results: 我们解决了Zhu等人（2022）提出的开放问题，并且可以在相关的在线学习 Setting中提供$\tilde{\mathcal{O}(T^{4/5})$的 regret bound，这比之前的 regret bound大大提高。

Abstract
We study principal-agent problems in which a principal commits to an outcome-dependent payment scheme -- called contract -- in order to induce an agent to take a costly, unobservable action leading to favorable outcomes. We consider a generalization of the classical (single-round) version of the problem in which the principal interacts with the agent by committing to contracts over multiple rounds. The principal has no information about the agent, and they have to learn an optimal contract by only observing the outcome realized at each round. We focus on settings in which the size of the agent's action space is small. We design an algorithm that learns an approximately-optimal contract with high probability in a number of rounds polynomial in the size of the outcome space, when the number of actions is constant. Our algorithm solves an open problem by Zhu et al.[2022]. Moreover, it can also be employed to provide a $\tilde{\mathcal{O}(T^{4/5})$ regret bound in the related online learning setting in which the principal aims at maximizing their cumulative utility, thus considerably improving previously-known regret bounds.

摘要
我们研究主体-代理人问题，在这个问题中，主体会提出一个结果相依的支付计划，以使代理人采取费用高、不可见的行动，导致更加有利的结果。我们考虑了经典单回版本的问题的扩展，在这个问题中，主体和代理人在多轮交互中进行互动，主体没有关于代理人的信息，只能通过每轮结果来学习最佳合同。我们关注在代理人行动空间尺度小的情况下。我们设计了一个可以在小于Outcome空间大小的情况下获得近似最佳合同的算法，并且可以提供$\tilde{\mathcal{O}(T^{4/5})$的 regret bound，这比前所未见的 regret bound要更好。此外，这个算法还可以在相关的在线学习设定下应用，以最大化主体的总用用，从而提高之前已知的 regret bound。

Contrastive Initial State Buffer for Reinforcement Learning

paper_url: http://arxiv.org/abs/2309.09752
repo_url: None
paper_authors: Nico Messikommer, Yunlong Song, Davide Scaramuzza
for: 提高强化学习的效率，使用有限样本学习
methods: 引入一种矛盾起始状态缓存，选择过去经验中的状态，用于初始化机器人在环境中，引导其走向更有信息的状态
results: 在两个复杂的机器人任务中，实验结果显示，我们的初始状态缓存可以比基线方案高效，同时也加速了训练的收敛Here’s the summary in English for reference:
for: Improving the efficiency of reinforcement learning, using limited samples to learn
methods: Introducing a Contrastive Initial State Buffer that strategically selects states from past experiences to initialize the agent in the environment, guiding it towards more informative states
results: Experimental results on two complex robotic tasks show that our initial state buffer achieves higher task performance than the baseline while also speeding up training convergence.

Abstract
In Reinforcement Learning, the trade-off between exploration and exploitation poses a complex challenge for achieving efficient learning from limited samples. While recent works have been effective in leveraging past experiences for policy updates, they often overlook the potential of reusing past experiences for data collection. Independent of the underlying RL algorithm, we introduce the concept of a Contrastive Initial State Buffer, which strategically selects states from past experiences and uses them to initialize the agent in the environment in order to guide it toward more informative states. We validate our approach on two complex robotic tasks without relying on any prior information about the environment: (i) locomotion of a quadruped robot traversing challenging terrains and (ii) a quadcopter drone racing through a track. The experimental results show that our initial state buffer achieves higher task performance than the nominal baseline while also speeding up training convergence.

摘要
在强化学习中，探索和利用之间的贸易带来了复杂的挑战，以实现从有限样本中获得高效的学习。然而， latest works often overlook the potential of reusing past experiences for data collection.我们在独立于基础RL算法的情况下，引入了一个对比起始缓存，该缓存从过去经验中选择状态，并将其用于初始化机器人在环境中，以引导它到更有用的状态。我们在两个复杂的机器人任务上进行了验证：（i）一只四脚 robot 在困难的地形上行走，以及（ii）一架quadcopter飞机在赛道上飞行。实验结果表明，我们的起始缓存可以比基线提高任务性能，同时也可以加速训练的收敛。

Towards Better Modeling with Missing Data: A Contrastive Learning-based Visual Analytics Perspective

paper_url: http://arxiv.org/abs/2309.09744
repo_url: None
paper_authors: Laixin Xie, Yang Ouyang, Longfei Chen, Ziming Wu, Quan Li
for: addresses the challenges of missing data in machine learning (ML) modeling
methods: uses Contrastive Learning (CL) framework to model observed data with missing values, without requiring any imputation
results: demonstrates high predictive accuracy and model interpretability through quantitative experiments, expert interviews, and a qualitative user study.Here’s the full summary in Simplified Chinese:
for: Addresses the challenges of missing data in machine learning (ML) modeling
methods: 使用异常学习（CL）框架，模拟带有缺失数据的观察数据，不需要任何替换
results: 通过量化实验、专家采访和用户研究，证明高预测精度和模型可读性。

Abstract
Missing data can pose a challenge for machine learning (ML) modeling. To address this, current approaches are categorized into feature imputation and label prediction and are primarily focused on handling missing data to enhance ML performance. These approaches rely on the observed data to estimate the missing values and therefore encounter three main shortcomings in imputation, including the need for different imputation methods for various missing data mechanisms, heavy dependence on the assumption of data distribution, and potential introduction of bias. This study proposes a Contrastive Learning (CL) framework to model observed data with missing values, where the ML model learns the similarity between an incomplete sample and its complete counterpart and the dissimilarity between other samples. Our proposed approach demonstrates the advantages of CL without requiring any imputation. To enhance interpretability, we introduce CIVis, a visual analytics system that incorporates interpretable techniques to visualize the learning process and diagnose the model status. Users can leverage their domain knowledge through interactive sampling to identify negative and positive pairs in CL. The output of CIVis is an optimized model that takes specified features and predicts downstream tasks. We provide two usage scenarios in regression and classification tasks and conduct quantitative experiments, expert interviews, and a qualitative user study to demonstrate the effectiveness of our approach. In short, this study offers a valuable contribution to addressing the challenges associated with ML modeling in the presence of missing data by providing a practical solution that achieves high predictive accuracy and model interpretability.

摘要
“缺失数据可能会对机器学习（ML）模型带来挑战。为了解决这个问题，现有的方法可以分为两种：特征填充和标签预测，它们主要是对缺失数据进行处理，以提高ML表现。这些方法将从观察到的数据中估算缺失的值，因此会遇到三个主要缺陷：需要不同的填充方法依照不同的缺失调制解调器制，依赖数据分布的假设，以及可能引入偏见。本研究提出了对缺失数据的对照学习（CL）框架，其中ML模型学习缺失数据中的相似性和不相似性。我们的提案方法不需要任何填充，并且可以提高可读性。为了增强可读性，我们导入了CIVis，一个可读性系统，它结合了可读技术来显示学习过程和诊断模型状态。用户可以透过互动采样来运用专业知识来选择负面和正面对照，而CIVis的输出则是一个已优化的模型，可以根据指定的特征进行下游任务预测。我们在回归和分类任务中提供了两个使用案例，并进行了量化实验、专家访谈和 качеitative用户研究，以证明我们的方法的有效性。简而言之，这个研究为ML模型在缺失数据下的挑战提供了实用的解决方案，可以实现高预测精度和模型可读性。”

The NFLikelihood: an unsupervised DNNLikelihood from Normalizing Flows

paper_url: http://arxiv.org/abs/2309.09743
repo_url: None
paper_authors: Humberto Reyes-Gonzalez, Riccardo Torre
for: 这个论文是为了探讨一种无监督的方法，基于正常化流程，来学习高维度的likelihood函数，具体来说是在高能物理分析中。
methods: 这个论文使用了自适应流程，基于 affine 和 rational quadratic spline 函数，来学习高维度的likelihood函数。
results: 论文通过实际例子示出，这种方法可以学习复杂的高维度likelihood函数，并且可以应用于高能物理分析中的几个实际问题。

Abstract
We propose the NFLikelihood, an unsupervised version, based on Normalizing Flows, of the DNNLikelihood proposed in Ref.[1]. We show, through realistic examples, how Autoregressive Flows, based on affine and rational quadratic spline bijectors, are able to learn complicated high-dimensional Likelihoods arising in High Energy Physics (HEP) analyses. We focus on a toy LHC analysis example already considered in the literature and on two Effective Field Theory fits of flavor and electroweak observables, whose samples have been obtained throught the HEPFit code. We discuss advantages and disadvantages of the unsupervised approach with respect to the supervised one and discuss possible interplays of the two.

摘要
我们提出了NFLikelihood，一种无监督版本，基于归一化流，与Ref.[1]中提出的DNNLikelihood相似。我们通过实际的示例显示，使用自适应流，基于线性和quadratic spline bijectors，可以学习高维的Likelihood函数，出现在高能物理分析中。我们将focus on一个LHC分析示例，已经出现在文献中，以及两个Effective Field Theory的观测量 fits，其样本通过HEPFit代码获得。我们讨论无监督方法与监督方法之间的优劣点，以及两者之间的可能的互动。

Contrastive Learning and Data Augmentation in Traffic Classification Using a Flowpic Input Representation

paper_url: http://arxiv.org/abs/2309.09733
repo_url: None
paper_authors: Alessandro Finamore, Chao Wang, Jonatan Krolikowski, Jose M. Navarro, Fuxing Chen, Dario Rossi
for: 本研究是一篇关于交通分类（TC）的论文，采用了最新的深度学习（DL）方法。
methods: 本研究使用了少量学习、自我超vision via对比学习和数据增强等方法，以学习从少量样本中，并将学习结果转移到不同的数据集上。
results: 研究发现，使用这些DL方法，只需要使用100个输入样本，可以达到非常高的准确率，使用“流图”（i.e., 每个流量的2D histogram）作为输入表示。本研究还重现了原论文中的一些关键结果，并在三个额外的公共数据集上进行了数据增强的研究。

Abstract
Over the last years we witnessed a renewed interest towards Traffic Classification (TC) captivated by the rise of Deep Learning (DL). Yet, the vast majority of TC literature lacks code artifacts, performance assessments across datasets and reference comparisons against Machine Learning (ML) methods. Among those works, a recent study from IMC'22 [17] is worth of attention since it adopts recent DL methodologies (namely, few-shot learning, self-supervision via contrastive learning and data augmentation) appealing for networking as they enable to learn from a few samples and transfer across datasets. The main result of [17] on the UCDAVIS19, ISCX-VPN and ISCX-Tor datasets is that, with such DL methodologies, 100 input samples are enough to achieve very high accuracy using an input representation called "flowpic" (i.e., a per-flow 2d histograms of the packets size evolution over time). In this paper (i) we reproduce [17] on the same datasets and (ii) we replicate its most salient aspect (the importance of data augmentation) on three additional public datasets, MIRAGE-19, MIRAGE-22 and UTMOBILENET21. While we confirm most of the original results, we also found a 20% accuracy drop on some of the investigated scenarios due to a data shift in the original dataset that we uncovered. Additionally, our study validates that the data augmentation strategies studied in [17] perform well on other datasets too. In the spirit of reproducibility and replicability we make all artifacts (code and data) available at [10].

摘要
过去几年，流行推理（TC）再次吸引了深度学习（DL）的关注。然而，大多数TC文献缺乏代码艺术ifacts，数据集之间的性能评估和对机器学习（ML）方法的参照比较。其中一项研究，IMC'22[17]，在网络领域引起了关注，因为它采用了当今DL技术（即少量学习、自我超视观察和数据扩展），这些技术可以通过几个样本学习并在数据集之间传递。这个研究的主要结果是，使用这些DL技术，只需要100个输入样本就可以达到非常高的准确率，使用名为"流图"（即每个流量2D histogram的包大小演化过时）的输入表示。在本文中，我们首先复制[17]中的主要方面（数据扩展的重要性）在三个公共数据集上进行了重复实验：MIRAGE-19、MIRAGE-22和UTMOBILENET21。我们证明了大部分原始结果的确认，但也发现了一些情况下的20%准确率下降，这是由原始数据集中的数据变换所致。此外，我们的研究还证明了在其他数据集上，[17]中研究的数据扩展策略也表现良好。为了保持可重复性和复制性，我们在[10]上公开了所有文件（代码和数据）。

Neural Collapse for Unconstrained Feature Model under Cross-entropy Loss with Imbalanced Data

paper_url: http://arxiv.org/abs/2309.09725
repo_url: https://github.com/wanlihongc/neural-collapse
paper_authors: Wanli Hong, Shuyang Ling
for: 这paper研究了不等式特征模型下的神经网络坍缩现象（Neural Collapse，NC）在不均衡数据上的扩展。
methods: 该paper使用了无约束特征模型（Unconstrained Feature Model，UFM）来解释NC现象。
results: 研究发现，在不均衡数据上，NC现象仍然存在，但是feature vectors内的坍缩现象不再是等角的，而是受样本大小的影响。此外，研究还发现了一个锐度的阈值，当阈值超过这个阈值时，小类坍缩（feature vectors of minority groups collapse to one single vector）会发生。最后，研究发现，随着样本大小的增加，数据不均衡的影响会逐渐减弱。

Abstract
Recent years have witnessed the huge success of deep neural networks (DNNs) in various tasks of computer vision and text processing. Interestingly, these DNNs with massive number of parameters share similar structural properties on their feature representation and last-layer classifier at terminal phase of training (TPT). Specifically, if the training data are balanced (each class shares the same number of samples), it is observed that the feature vectors of samples from the same class converge to their corresponding in-class mean features and their pairwise angles are the same. This fascinating phenomenon is known as Neural Collapse (N C), first termed by Papyan, Han, and Donoho in 2019. Many recent works manage to theoretically explain this phenomenon by adopting so-called unconstrained feature model (UFM). In this paper, we study the extension of N C phenomenon to the imbalanced data under cross-entropy loss function in the context of unconstrained feature model. Our contribution is multi-fold compared with the state-of-the-art results: (a) we show that the feature vectors exhibit collapse phenomenon, i.e., the features within the same class collapse to the same mean vector; (b) the mean feature vectors no longer form an equiangular tight frame. Instead, their pairwise angles depend on the sample size; (c) we also precisely characterize the sharp threshold on which the minority collapse (the feature vectors of the minority groups collapse to one single vector) will take place; (d) finally, we argue that the effect of the imbalance in datasize diminishes as the sample size grows. Our results provide a complete picture of the N C under the cross-entropy loss for the imbalanced data. Numerical experiments confirm our theoretical analysis.

摘要
近年来，深度神经网络（DNN）在计算机视觉和自然语言处理等领域取得了巨大成功。意外的是，这些DNN具有庞大参数的结构性质在特定阶段训练（TPT）中的特征表示和最后一层分类器之间存在类似性。具体来说，如果训练数据均衡（每个类具有相同的样本数），则观察到样本从同一个类划分的特征向量相互吸引，其对角度保持相同。这种精彩的现象被称为神经塌缩（NC），由Papyan、Han和Donoho在2019年提出。许多最近的工作尝试理解这种现象，通过采用不受限制的特征模型（UFM）。在这篇论文中，我们研究了NC现象在不均衡数据下，使用交叉熵损失函数的情况。我们的贡献包括以下几点：（a）特征向量展现塌缩现象，即同一个类划分的特征向量塌缩到同一个均值向量；（b）均值特征向量不再形成等角紧凑框架，而是对样本大小具有相互关系的对角度；（c）我们也准确地描述了小于一定的阈值，下面的少数塌缩（特征向量少数组划分到一个向量）会发生的具体时间点；（d）最后，我们认为数据大小差异的影响随着样本大小的增长而减少。我们的结果为NC现象在交叉熵损失下的不均衡数据提供了完整的图像。数据实验证实了我们的理论分析。

FedLALR: Client-Specific Adaptive Learning Rates Achieve Linear Speedup for Non-IID Data

paper_url: http://arxiv.org/abs/2309.09719
repo_url: None
paper_authors: Hao Sun, Li Shen, Shixiang Chen, Jingwei Sun, Jing Li, Guangzhong Sun, Dacheng Tao
for: This paper focuses on improving the efficiency of federated learning, especially for training large-scale deep neural networks with heterogeneous data.
methods: The proposed method, FedLALR, adjusts the learning rate for each client based on local historical gradient squares and synchronized learning rates, which enables the method to converge and achieve linear speedup with respect to the number of clients.
results: The theoretical analysis and experimental results show that FedLALR outperforms several communication-efficient federated optimization methods in terms of convergence speed and scalability, and achieves promising results on CV and NLP tasks.

Abstract
Federated learning is an emerging distributed machine learning method, enables a large number of clients to train a model without exchanging their local data. The time cost of communication is an essential bottleneck in federated learning, especially for training large-scale deep neural networks. Some communication-efficient federated learning methods, such as FedAvg and FedAdam, share the same learning rate across different clients. But they are not efficient when data is heterogeneous. To maximize the performance of optimization methods, the main challenge is how to adjust the learning rate without hurting the convergence. In this paper, we propose a heterogeneous local variant of AMSGrad, named FedLALR, in which each client adjusts its learning rate based on local historical gradient squares and synchronized learning rates. Theoretical analysis shows that our client-specified auto-tuned learning rate scheduling can converge and achieve linear speedup with respect to the number of clients, which enables promising scalability in federated optimization. We also empirically compare our method with several communication-efficient federated optimization methods. Extensive experimental results on Computer Vision (CV) tasks and Natural Language Processing (NLP) task show the efficacy of our proposed FedLALR method and also coincides with our theoretical findings.

摘要
归一学习是一种新般的分布式机器学习方法，允许大量客户端共同训练模型，无需交换本地数据。在归一学习中，通信时间成本是一个重要瓶颈，��pecially when training large-scale deep neural networks. Some communication-efficient federated learning methods, such as FedAvg and FedAdam, share the same learning rate across different clients. However, they are not efficient when data is heterogeneous. To maximize the performance of optimization methods, the main challenge is how to adjust the learning rate without hurting the convergence.在这篇论文中，我们提出了一种归一学习中的本地自适应学习率调整方法，称为FedLALR。每个客户端根据本地历史梯度平方和同步学习率进行自适应学习率调整。我们的客户端自定义自适应学习率调整策略可以使模型快速收敛，并且可以在客户端数量增加时实现线性的速度增长。我们还对几种通信效率高的联邦优化方法进行了比较实验。我们的实验结果表明，FedLALR方法可以在CV任务和NLP任务上达到良好的效果，并且与我们的理论预测相符。

Multi-Dictionary Tensor Decomposition

paper_url: http://arxiv.org/abs/2309.09717
repo_url: None
paper_authors: Maxwell McNeil, Petko Bogdanov
for: 多方向数据的分析，如社交媒体、医疗、时空域等领域的数据分析
methods: 使用多字做 tensor decomposition 方法，利用各种数据驱动的假设来分解 tensor
results: 提出了一种多字做 tensor decomposition 框架（MDTD），可以利用外部structural信息来获得稀疏编码的 tensor 因子，并且可以处理大型稀疏tensor。实验表明，相比于字做法，MDTD 可以学习更简洁的模型，并且可以提高数据重建质量、缺失值填充质量和 tensor 维度的估计。同时，MDTD 的运行时间并不受影响，可以快速处理大型数据。

Abstract
Tensor decomposition methods are popular tools for analysis of multi-way datasets from social media, healthcare, spatio-temporal domains, and others. Widely adopted models such as Tucker and canonical polyadic decomposition (CPD) follow a data-driven philosophy: they decompose a tensor into factors that approximate the observed data well. In some cases side information is available about the tensor modes. For example, in a temporal user-item purchases tensor a user influence graph, an item similarity graph, and knowledge about seasonality or trends in the temporal mode may be available. Such side information may enable more succinct and interpretable tensor decomposition models and improved quality in downstream tasks. We propose a framework for Multi-Dictionary Tensor Decomposition (MDTD) which takes advantage of prior structural information about tensor modes in the form of coding dictionaries to obtain sparsely encoded tensor factors. We derive a general optimization algorithm for MDTD that handles both complete input and input with missing values. Our framework handles large sparse tensors typical to many real-world application domains. We demonstrate MDTD's utility via experiments with both synthetic and real-world datasets. It learns more concise models than dictionary-free counterparts and improves (i) reconstruction quality ($60\%$ fewer non-zero coefficients coupled with smaller error); (ii) missing values imputation quality (two-fold MSE reduction with up to orders of magnitude time savings) and (iii) the estimation of the tensor rank. MDTD's quality improvements do not come with a running time premium: it can decompose $19GB$ datasets in less than a minute. It can also impute missing values in sparse billion-entry tensors more accurately and scalably than state-of-the-art competitors.

摘要
tensor 分解方法是社交媒体、医疗、时空域等多维数据分析的流行工具。广泛采用的模型，如图克 decomposition 和 canonical polyadic decomposition (CPD) 采用数据驱动 philosophy：它们将 tensor decomposed into factors that approximate observed data well。在某些情况下，tensor 模式上有侧信息可用，例如，在 temporal 用户 item 购买 tensor 中，用户影响图、item similarity graph 和 temporal 模式中的季节性或趋势信息可能可用。这些侧信息可能使 tensor 分解模型更简洁可读，改进下游任务质量。我们提出了一个 Multi-Dictionary Tensor Decomposition (MDTD) 框架，利用 tensor 模式上的编码字典来获得精简编码 tensor 因子。我们 deriv 一种通用优化算法 для MDTD，可以处理完整输入和 missing 值输入。我们的框架可以处理大量的巨大稀盐 tensor，通常出现在实际应用中。我们通过实验表明，MDTD 可以学习更简洁的模型，并提高（i）重建质量（60% fewer non-zero coefficients 和 smaller error），（ii）缺失值插值质量（two-fold MSE reduction with up to orders of magnitude time savings）和（iii） tensor 级别的估计。MDTD 的质量改进不会带来运行时间开销：它可以在一分钟内分解 19GB 的数据。它还可以更准确和可扩展地插值缺失的 billion-entry tensor than state-of-the-art 竞争对手。

A Study of Data-driven Methods for Adaptive Forecasting of COVID-19 Cases

paper_url: http://arxiv.org/abs/2309.09698
repo_url: None
paper_authors: Charithea Stylianides, Kleanthis Malialis, Panayiotis Kolios
for: 本研究旨在investigate数据驱动（学习、统计）方法，以适应COVID-19病毒传播的非站点性条件。
methods: 本研究使用数据驱动学习和统计方法，以incrementally更新模型，适应不断变化的病毒传播条件。
results: 实验结果表明，该方法在不同的病毒浪涌期内，可以提供高准确率的预测结果，并在疫情爆发时进行有效的预测。

Abstract
Severe acute respiratory disease SARS-CoV-2 has had a found impact on public health systems and healthcare emergency response especially with respect to making decisions on the most effective measures to be taken at any given time. As demonstrated throughout the last three years with COVID-19, the prediction of the number of positive cases can be an effective way to facilitate decision-making. However, the limited availability of data and the highly dynamic and uncertain nature of the virus transmissibility makes this task very challenging. Aiming at investigating these challenges and in order to address this problem, this work studies data-driven (learning, statistical) methods for incrementally training models to adapt to these nonstationary conditions. An extensive empirical study is conducted to examine various characteristics, such as, performance analysis on a per virus wave basis, feature extraction, "lookback" window size, memory size, all for next-, 7-, and 14-day forecasting tasks. We demonstrate that the incremental learning framework can successfully address the aforementioned challenges and perform well during outbreaks, providing accurate predictions.

摘要
严重急性呼吸疾病SARS-CoV-2对公共卫生系统和医疗紧急应急响应有着深远的影响，尤其是在决定最有效的措施时采取决策。在过去三年的COVID-19疫情中，预测病例数量是一项有效的决策支持。然而，数据有限性和病毒传播性的高度动态和不确定性使得这项工作具有挑战性。本研究旨在调查这些挑战，并通过数据驱动（学习、统计）方法来适应这些非站点条件。我们进行了广泛的实践研究，包括精心分析不同的特征，如批处理大小、缓存大小、memory大小等，以及下一天、7天、14天预测任务的性能分析。我们示出了增量学习框架可以成功地解决上述挑战，并在爆发期间提供 precisions 的预测。

VULNERLIZER: Cross-analysis Between Vulnerabilities and Software Libraries

paper_url: http://arxiv.org/abs/2309.09649
repo_url: None
paper_authors: Irdin Pekaric, Michael Felderer, Philipp Steinmüller
For: 本研究旨在提供一种新的漏洞检测方法，用于针对软件项目中的漏洞进行检测。* Methods: 本方法使用CVE和软件库数据，结合归一化算法生成漏洞和库之间的链接。此外，还进行模型训练，以更新分配的权重。* Results: 研究结果显示，使用VULNERLIZER方法可以准确预测未来可能出现漏洞的软件库，并达到预测精度75%或更高。

Abstract
The identification of vulnerabilities is a continuous challenge in software projects. This is due to the evolution of methods that attackers employ as well as the constant updates to the software, which reveal additional issues. As a result, new and innovative approaches for the identification of vulnerable software are needed. In this paper, we present VULNERLIZER, which is a novel framework for cross-analysis between vulnerabilities and software libraries. It uses CVE and software library data together with clustering algorithms to generate links between vulnerabilities and libraries. In addition, the training of the model is conducted in order to reevaluate the generated associations. This is achieved by updating the assigned weights. Finally, the approach is then evaluated by making the predictions using the CVE data from the test set. The results show that the VULNERLIZER has a great potential in being able to predict future vulnerable libraries based on an initial input CVE entry or a software library. The trained model reaches a prediction accuracy of 75% or higher.

摘要
“找到漏洞是软件项目中不断的挑战。这是因为攻击者的方法不断发展以及软件不断更新，导致新的漏洞披露。为此，我们提出了一种新的漏洞识别框架——漏洞LIZER。它使用CVE和软件库数据，结合聚类算法生成漏洞和库之间的关联。此外，我们还进行了模型训练，以重新评估生成的关联。这是通过更新分配的权重来实现的。最后，我们对测试集中的CVE数据进行预测，结果显示，漏洞LIZER可以准确预测基于输入CVE记录或软件库的未来漏洞。训练模型的准确率达75%或更高。”

A Discussion on Generalization in Next-Activity Prediction

paper_url: http://arxiv.org/abs/2309.09618
repo_url: None
paper_authors: Luka Abb, Peter Pfeiffer, Peter Fettke, Jana-Rebecca Rehse
for: 本研究旨在评估深度学习技术在下一个活动预测中的效果，并提出了不同的预测场景，以促进未来研究的发展。
methods: 本研究使用了深度学习技术进行下一个活动预测，并评估了其预测性能使用公共可用事件日志。
results: 研究发现现有的评估方法带来很大的示例泄露问题，导致使用深度学习技术的预测方法并不如预期中效果好。

Abstract
Next activity prediction aims to forecast the future behavior of running process instances. Recent publications in this field predominantly employ deep learning techniques and evaluate their prediction performance using publicly available event logs. This paper presents empirical evidence that calls into question the effectiveness of these current evaluation approaches. We show that there is an enormous amount of example leakage in all of the commonly used event logs, so that rather trivial prediction approaches perform almost as well as ones that leverage deep learning. We further argue that designing robust evaluations requires a more profound conceptual engagement with the topic of next-activity prediction, and specifically with the notion of generalization to new data. To this end, we present various prediction scenarios that necessitate different types of generalization to guide future research.

摘要
下一个活动预测目标是预测运行进程实例的未来行为。现有文献主要采用深度学习技术进行预测性能评估，使用公共可用事件日志进行评估。本文提供了实验证据，质疑现有评价方法的效果。我们发现所有常用的事件日志具有很大的示例泄露，使得基本的预测方法几乎与深度学习相当。我们还认为设计Robust评估需要更深刻的概念理解，特别是一致到新数据的总结。为此，我们提出了不同类型的预测场景，以引导未来研究。

Latent assimilation with implicit neural representations for unknown dynamics

paper_url: http://arxiv.org/abs/2309.09574
repo_url: None
paper_authors: Zhuoyuan Li, Bin Dong, Pingwen Zhang
for: 这种研究是为了解决数据吸收中的高计算成本和数据维度问题。
methods: 该研究使用了新的抽象框架，即秘密吸收与卷积神经网络（LAINR），其中引入了圆形秘密神经表示（SINR）和数据驱动的神经网络 uncertainty 估计器。
results: 实验结果表明，与基于AutoEncoder的方法相比，LAINR在吸收过程中具有更高的精度和效率。

Abstract
Data assimilation is crucial in a wide range of applications, but it often faces challenges such as high computational costs due to data dimensionality and incomplete understanding of underlying mechanisms. To address these challenges, this study presents a novel assimilation framework, termed Latent Assimilation with Implicit Neural Representations (LAINR). By introducing Spherical Implicit Neural Representations (SINR) along with a data-driven uncertainty estimator of the trained neural networks, LAINR enhances efficiency in assimilation process. Experimental results indicate that LAINR holds certain advantage over existing methods based on AutoEncoders, both in terms of accuracy and efficiency.

摘要
<>转换文本到简化中文。<>数据融合在各种应用中是关键，但它经常遇到高计算成本的数据维度和下面机制的不完全理解。为解决这些挑战，本研究提出了一种新的融合框架，称为潜在融合（LAINR）。通过引入圆形潜在神经表示（SINR）以及基于训练神经网络的数据驱动 uncertainty 估计器，LAINR 提高了融合过程的效率。实验结果表明，LAINR 在比AutoEncoders 基于的方法上具有更高的准确性和效率。

New Bounds on the Accuracy of Majority Voting for Multi-Class Classification

paper_url: http://arxiv.org/abs/2309.09564
repo_url: None
paper_authors: Sina Aeeneh, Nikola Zlatanov, Jiangshan Yu
for: 这个论文主要研究了多类别分类问题中的多数投票函数（MVF）的精度。
methods: 本论文使用了独立且非Identically分布的选民模型，并 derivated了MVF在多类别分类问题中的新上界。
results: 研究发现，在满足 certain conditions 的情况下，MVF在多类别分类问题中的误差率会指数减少到零，而在不满足这些条件的情况下，误差率会指数增长。此外，研究还发现了对真实分类算法的精度，其在best-case情况下可以达到小误差率，但在worst-case情况下可能高于MVF的误差率。

Abstract
Majority voting is a simple mathematical function that returns the value that appears most often in a set. As a popular decision fusion technique, the majority voting function (MVF) finds applications in resolving conflicts, where a number of independent voters report their opinions on a classification problem. Despite its importance and its various applications in ensemble learning, data crowd-sourcing, remote sensing, and data oracles for blockchains, the accuracy of the MVF for the general multi-class classification problem has remained unknown. In this paper, we derive a new upper bound on the accuracy of the MVF for the multi-class classification problem. More specifically, we show that under certain conditions, the error rate of the MVF exponentially decays toward zero as the number of independent voters increases. Conversely, the error rate of the MVF exponentially grows with the number of independent voters if these conditions are not met. We first explore the problem for independent and identically distributed voters where we assume that every voter follows the same conditional probability distribution of voting for different classes, given the true classification of the data point. Next, we extend our results for the case where the voters are independent but non-identically distributed. Using the derived results, we then provide a discussion on the accuracy of the truth discovery algorithms. We show that in the best-case scenarios, truth discovery algorithms operate as an amplified MVF and thereby achieve a small error rate only when the MVF achieves a small error rate, and vice versa, achieve a large error rate when the MVF also achieves a large error rate. In the worst-case scenario, the truth discovery algorithms may achieve a higher error rate than the MVF. Finally, we confirm our theoretical results using numerical simulations.

摘要
多数投票是一种简单的数学函数，返回集合中出现最多的值。作为一种受欢迎的决策融合技术，多数投票函数（MVF）在解决冲突、 ensemble learning、数据投票、远程感知和数据链等领域都有应用。 despite its importance and various applications, the accuracy of MVF for the general multi-class classification problem remains unknown. In this paper, we derive a new upper bound on the accuracy of MVF for the multi-class classification problem. Specifically, we show that under certain conditions, the error rate of MVF exponentially decays toward zero as the number of independent voters increases. Conversely, the error rate of MVF exponentially grows with the number of independent voters if these conditions are not met. We first explore the problem for independent and identically distributed voters, assuming that every voter follows the same conditional probability distribution of voting for different classes given the true classification of the data point. Next, we extend our results to the case where the voters are independent but non-identically distributed. Using the derived results, we then provide a discussion on the accuracy of truth discovery algorithms. We show that in the best-case scenarios, truth discovery algorithms operate as an amplified MVF and thereby achieve a small error rate only when the MVF achieves a small error rate, and vice versa, achieve a large error rate when the MVF also achieves a large error rate. In the worst-case scenario, truth discovery algorithms may achieve a higher error rate than MVF. Finally, we confirm our theoretical results using numerical simulations.

Utilizing Whisper to Enhance Multi-Branched Speech Intelligibility Prediction Model for Hearing Aids

paper_url: http://arxiv.org/abs/2309.09548
repo_url: None
paper_authors: Ryandhimas E. Zezario, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao
for: 这个研究的目的是提高听力器扩展设备中的自动评估speech intelligibility的精度。
methods: 这个研究使用了一种改进后的多支分支扩展speech intelligibility预测模型，称为MBI-Net+和MBI-Net++。MBI-Net+使用Whisper嵌入来扩展语音特征，而MBI-Net++则使用了一个辅助任务来预测帧级和语录级的对象语音理解指标HASPI的分数。
results: 实验结果表明，MBI-Net++和MBI-Net+都比MBI-Net在多种指标上表现更好，而MBI-Net++还是MBI-Net+的更好。

Abstract
Automated assessment of speech intelligibility in hearing aid (HA) devices is of great importance. Our previous work introduced a non-intrusive multi-branched speech intelligibility prediction model called MBI-Net, which achieved top performance in the Clarity Prediction Challenge 2022. Based on the promising results of the MBI-Net model, we aim to further enhance its performance by leveraging Whisper embeddings to enrich acoustic features. In this study, we propose two improved models, namely MBI-Net+ and MBI-Net++. MBI-Net+ maintains the same model architecture as MBI-Net, but replaces self-supervised learning (SSL) speech embeddings with Whisper embeddings to deploy cross-domain features. On the other hand, MBI-Net++ further employs a more elaborate design, incorporating an auxiliary task to predict frame-level and utterance-level scores of the objective speech intelligibility metric HASPI (Hearing Aid Speech Perception Index) and multi-task learning. Experimental results confirm that both MBI-Net++ and MBI-Net+ achieve better prediction performance than MBI-Net in terms of multiple metrics, and MBI-Net++ is better than MBI-Net+.

摘要
Here's the translation in Simplified Chinese:自动评估听力器（HA）设备的听力 intelligibility 非常重要。我们之前的工作推出了一种非侵入式多支路听力 intelligibility 预测模型，称为 MBI-Net，在 Clarity Prediction Challenge 2022 中获得了优秀的成绩。基于 MBI-Net 模型的承诺 результа，我们希望进一步提高其性能，通过使用 Whisper 嵌入来增强语音特征。在这项研究中，我们提出了两种改进的模型，即 MBI-Net+ 和 MBI-Net++。MBI-Net+ 维持了同 MBI-Net 模型的同样结构，但是将 SSL 语音嵌入替换为 Whisper 嵌入，以便在不同频谱上进行跨频域特征的部署。而 MBI-Net++ 则进一步采用了更加复杂的设计，包括在对象听力指数 HASPI (Hearing Aid Speech Perception Index) 的帧级和语音级分数预测任务中使用多任务学习。实验结果表明，MBI-Net++ 和 MBI-Net+ 都在多个指标上超过 MBI-Net，而 MBI-Net++ 则是 MBI-Net+ 的更好。

Quantum Wasserstein GANs for State Preparation at Unseen Points of a Phase Diagram

paper_url: http://arxiv.org/abs/2309.09543
repo_url: None
paper_authors: Wiktor Jurasz, Christian B. Mendl
for: 本研究旨在扩展生成模型，特别是生成对抗网络（GANs）到量子域，并解决当前方法的局限性。
methods: 我们提出了一种新的混合类型-量子方法，基于量子沃氏赋形GANs，可以学习输入集中的测量期望函数，并生成未经见过的新状态。
results: 我们的方法可以生成新的状态，其测量期望函数遵循同一个下面函数，这些状态没有出现在输入集中。

Abstract
Generative models and in particular Generative Adversarial Networks (GANs) have become very popular and powerful data generation tool. In recent years, major progress has been made in extending this concept into the quantum realm. However, most of the current methods focus on generating classes of states that were supplied in the input set and seen at the training time. In this work, we propose a new hybrid classical-quantum method based on quantum Wasserstein GANs that overcomes this limitation. It allows to learn the function governing the measurement expectations of the supplied states and generate new states, that were not a part of the input set, but which expectations follow the same underlying function.

摘要
生成模型，特别是生成对抗网络（GANs），在过去几年变得非常流行和强大，用于数据生成。然而，大多数当前方法仅能生成训练时提供的类别的状态。在这种情况下，我们提议一种新的混合类 quantum 方法，基于量子沃尔帕特 GANs，可以超越这一限制。它可以学习输入状态的测量预期函数，并生成没有在输入集中出现过的新状态，但是预期函数follows the same underlying function。

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

paper_url: http://arxiv.org/abs/2309.09510
repo_url: https://github.com/dynamic-superb/dynamic-superb
paper_authors: Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan Sharma, Shinji Watanabe, Bhiksha Ramakrishnan, Shady Shehata, Hung-yi Lee
for: This paper aims to present a benchmark for building universal speech models that can perform multiple tasks in a zero-shot fashion using instruction tuning.
methods: The paper proposes a benchmark called Dynamic-SUPERB, which combines 33 tasks and 22 datasets to provide comprehensive coverage of diverse speech tasks and harness instruction tuning. The paper also proposes several approaches to establish benchmark baselines, including the use of speech models, text language models, and the multimodal encoder.
results: The evaluation results show that while the baselines perform reasonably on seen tasks, they struggle with unseen ones. The paper also conducts an ablation study to assess the robustness and seek improvements in the performance.

Abstract
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. However, existing studies in speech processing primarily focus on limited or specific tasks. Moreover, the lack of standardized benchmarks hinders a fair comparison across different approaches. Thus, we present Dynamic-SUPERB, a benchmark designed for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion. To achieve comprehensive coverage of diverse speech tasks and harness instruction tuning, we invite the community to collaborate and contribute, facilitating the dynamic growth of the benchmark. To initiate, Dynamic-SUPERB features 55 evaluation instances by combining 33 tasks and 22 datasets. This spans a broad spectrum of dimensions, providing a comprehensive platform for evaluation. Additionally, we propose several approaches to establish benchmark baselines. These include the utilization of speech models, text language models, and the multimodal encoder. Evaluation results indicate that while these baselines perform reasonably on seen tasks, they struggle with unseen ones. We also conducted an ablation study to assess the robustness and seek improvements in the performance. We release all materials to the public and welcome researchers to collaborate on the project, advancing technologies in the field together.

摘要
To ensure comprehensive coverage of diverse speech tasks and harness instruction tuning, we invite the community to collaborate and contribute to the benchmark, facilitating its dynamic growth. Dynamic-SUPERB currently features 55 evaluation instances by combining 33 tasks and 22 datasets, providing a broad spectrum of dimensions for evaluation. We also propose several approaches to establish benchmark baselines, including the use of speech models, text language models, and the multimodal encoder.Evaluation results show that while these baselines perform reasonably well on seen tasks, they struggle with unseen ones. To improve performance, we conducted an ablation study to assess the robustness and seek improvements. We release all materials to the public and welcome researchers to collaborate on the project, advancing technologies in the field together.Here's the translation in Simplified Chinese:文本语言模型已经展现出很强的零 shot 能力，可以通过提供良好的指令来泛化到未看过任务。然而，现有的语音处理研究主要集中在限定或特定的任务上，而lack of standardized benchmarks 使得不同方法之间的比较不公平。为此，我们提出了Dynamic-SUPERB，一个用于建立通用语音模型的 benchmark，可以通过指令调整来完成多个任务的零 shot 泛化。为确保语音任务的全面覆盖和利用指令调整，我们邀请社区参与合作，以便不断扩展 benchmark。Dynamic-SUPERB 目前已经包含了55个评估实例，通过组合 33 个任务和 22 个数据集，提供了广泛的维度评估。我们还提出了一些建立 benchmark 基准的方法，包括使用语音模型、文本语言模型和多模式Encoder。评估结果显示，虽然这些基准在看到任务上表现良好，但在未看到任务上表现不佳。为了提高性能，我们进行了一些剖析研究，以评估Robustness和寻找改进。我们将所有材料公开发布，并邀请研究人员一起合作项目，共同推动领域技术的发展。

Outlier-Insensitive Kalman Filtering: Theory and Applications

paper_url: http://arxiv.org/abs/2309.09505
repo_url: None
paper_authors: Shunit Truzman, Guy Revach, Nir Shlezinger, Itzik Klein
for: 这篇论文是关于如何从含有噪音观测的动力系统中进行状态估计，以提高适用范围和精度。
methods: 本文提出了一个具有自适应能力的实时状态估计方法，可以快速处理含有噪音观测的动力系统，并且不需要调整参数。
results: 实验和场景评估表明，本文的方法能够对含有噪音观测的动力系统进行精确的状态估计，并且比于其他方法更具有抗错误性。

Abstract
State estimation of dynamical systems from noisy observations is a fundamental task in many applications. It is commonly addressed using the linear Kalman filter (KF), whose performance can significantly degrade in the presence of outliers in the observations, due to the sensitivity of its convex quadratic objective function. To mitigate such behavior, outlier detection algorithms can be applied. In this work, we propose a parameter-free algorithm which mitigates the harmful effect of outliers while requiring only a short iterative process of the standard update step of the KF. To that end, we model each potential outlier as a normal process with unknown variance and apply online estimation through either expectation maximization or alternating maximization algorithms. Simulations and field experiment evaluations demonstrate competitive performance of our method, showcasing its robustness to outliers in filtering scenarios compared to alternative algorithms.

摘要
<> translates as:状态估计动力系统从噪声观测中是许多应用中的基本任务。通常使用线性卡尔曼筛（KF）来解决这个问题，但是在观测中出现异常值时，KF的性能会受到严重损害，因为它的对称二阶函数会变得敏感。为了解决这个问题，我们可以使用异常检测算法。在这种情况下，我们模型每个可能的异常为正常过程的不确定噪声，并通过在线估计来进行对其进行更新。这种方法不需要任何参数，仅需要短暂的迭代过程。在实验和场景评估中，我们的方法能够与其他算法相比赢得竞争优势，表明其对异常观测的抵抗力在筛码场景中较高。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Machine Learning Approaches to Predict and Detect Early-Onset of Digital Dermatitis in Dairy Cows using Sensor Data

paper_url: http://arxiv.org/abs/2309.10010
repo_url: None
paper_authors: Jennifer Magana, Dinu Gavojdian, Yakir Menachem, Teddy Lazebnik, Anna Zamansky, Amber Adams-Progar
for: 本研究旨在采用机器学习算法基于传感器行为数据早期发现和预测牛皮病（DD）。
methods: 本研究使用了机器学习模型，基于牛皮病症状出现的日期和时间，使用传感器数据进行预测和检测。
results: 研究发现，使用行为传感器数据预测和检测牛皮病的机器学习模型可达79%的准确率，预测牛皮病2天前的模型准确率为64%。这些机器学习模型可以帮助开发基于行为传感器数据的实时自动牛皮病监测和诊断工具，用于检测牛皮病的症状变化。

Abstract
The aim of this study was to employ machine learning algorithms based on sensor behavior data for (1) early-onset detection of digital dermatitis (DD); and (2) DD prediction in dairy cows. With the ultimate goal to set-up early warning tools for DD prediction, which would than allow a better monitoring and management of DD under commercial settings, resulting in a decrease of DD prevalence and severity, while improving animal welfare. A machine learning model that is capable of predicting and detecting digital dermatitis in cows housed under free-stall conditions based on behavior sensor data has been purposed and tested in this exploratory study. The model for DD detection on day 0 of the appearance of the clinical signs has reached an accuracy of 79%, while the model for prediction of DD 2 days prior to the appearance of the first clinical signs has reached an accuracy of 64%. The proposed machine learning models could help to develop a real-time automated tool for monitoring and diagnostic of DD in lactating dairy cows, based on behavior sensor data under conventional dairy environments. Results showed that alterations in behavioral patterns at individual levels can be used as inputs in an early warning system for herd management in order to detect variances in health of individual cows.

摘要
The study proposed a machine learning model that can predict and detect DD in cows housed under free-stall conditions based on behavior sensor data. The model achieved an accuracy of 79% in detecting DD on day 0 of the appearance of clinical signs, and an accuracy of 64% in predicting DD 2 days prior to the first clinical signs.The results of the study showed that alterations in behavioral patterns at the individual level can be used as inputs in an early warning system for herd management to detect variances in the health of individual cows. The proposed machine learning models have the potential to develop a real-time automated tool for monitoring and diagnosis of DD in lactating dairy cows based on behavior sensor data under conventional dairy environments.

Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

paper_url: http://arxiv.org/abs/2309.09470
repo_url: None
paper_authors: Zheng-Yan Sheng, Yang Ai, Yan-Nian Chen, Zhen-Hua Ling
for: 这篇论文目标是提出一种基于 face 图像的零 shot 语音转换任务（zero-shot FaceVC），即将源 speaker 的语音特征转换到目标 speaker 的语音特征上，只使用目标 speaker 的单个 face 图像。
methods: 我们提议一种基于 memory-based face-voice 对应模块的 zero-shot FaceVC 方法，通过槽来对这两种模式进行对应，以 capture 语音特征从 face 图像中。我们还提出了一种混合超级visit策略，以解决voice conversion任务在训练和推断阶段之间的一贯性问题。
results: 通过广泛的实验，我们证明了我们提出的方法在 zero-shot FaceVC 任务中的优越性。我们还设计了系统的主观和客观评价指标，以全面评估homogeneity、多样性和一致性 controlled by face images。

Abstract
This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC), which aims at converting the voice characteristics of an utterance from any source speaker to a newly coming target speaker, solely relying on a single face image of the target speaker. To address this task, we propose a face-voice memory-based zero-shot FaceVC method. This method leverages a memory-based face-voice alignment module, in which slots act as the bridge to align these two modalities, allowing for the capture of voice characteristics from face images. A mixed supervision strategy is also introduced to mitigate the long-standing issue of the inconsistency between training and inference phases for voice conversion tasks. To obtain speaker-independent content-related representations, we transfer the knowledge from a pretrained zero-shot voice conversion model to our zero-shot FaceVC model. Considering the differences between FaceVC and traditional voice conversion tasks, systematic subjective and objective metrics are designed to thoroughly evaluate the homogeneity, diversity and consistency of voice characteristics controlled by face images. Through extensive experiments, we demonstrate the superiority of our proposed method on the zero-shot FaceVC task. Samples are presented on our demo website.

摘要
Here is the text in Simplified Chinese:这篇论文介绍了一个新的任务：基于面像的零shot语音转换（zero-shot FaceVC），该任务的目标是将任何来源说话人的语音特征转换为新来的目标说话人，只使用单个面像。为解决这个任务，作者们提出了一种面voice记忆基于的零shot FaceVC方法。这种方法利用了一种面voice记忆对应模块，将这两种模式相互对应，以捕捉面像中的语音特征。此外，作者们还提出了一种混合超级visión策略，以解决长期存在的语音转换任务的训练和推断阶段不一致问题。为了获得无关 speaker的内容相关表示，作者们将预训练的零shot语音转换模型的知识传递到了他们的零shot FaceVC模型。鉴于FaceVC和传统的语音转换任务之间的差异，作者们设计了系统的主观和客观评价指标，以全面评估零shot FaceVC模型的性能。通过广泛的实验，作者们证明了他们的提出的方法在零shot FaceVC任务中的优越性。详细的样例可以在他们的 Demo 网站上找到。

Active anomaly detection based on deep one-class classification

paper_url: http://arxiv.org/abs/2309.09465
repo_url: https://github.com/mkkim-home/AAD
paper_authors: Minkyung Kim, Junsik Kim, Jongmin Yu, Jun Kyun Choi
for: 这 paper 是为了提高深度异常检测模型的训练效果而使用活动学习工具。
methods: 这 paper 使用了一种基于 adaptive boundary 的查询策略，以及一种 combining noise contrastive estimation 和一类分类模型的 semi-supervised learning 方法。
results: 这 paper 在 seven 个异常检测数据集上分别采用了这两种方法，并分析了它们的效果。

Abstract
Active learning has been utilized as an efficient tool in building anomaly detection models by leveraging expert feedback. In an active learning framework, a model queries samples to be labeled by experts and re-trains the model with the labeled data samples. It unburdens in obtaining annotated datasets while improving anomaly detection performance. However, most of the existing studies focus on helping experts identify as many abnormal data samples as possible, which is a sub-optimal approach for one-class classification-based deep anomaly detection. In this paper, we tackle two essential problems of active learning for Deep SVDD: query strategy and semi-supervised learning method. First, rather than solely identifying anomalies, our query strategy selects uncertain samples according to an adaptive boundary. Second, we apply noise contrastive estimation in training a one-class classification model to incorporate both labeled normal and abnormal data effectively. We analyze that the proposed query strategy and semi-supervised loss individually improve an active learning process of anomaly detection and further improve when combined together on seven anomaly detection datasets.

摘要
active learning 已经被用作一种高效的工具来建立异常检测模型，通过借助专家反馈来优化模型。在一个 active learning 框架中，一个模型会问选择要被标注的样本，然后使用这些标注的数据样本来重新训练模型。这不仅可以减少获取标注的数据量，还可以提高异常检测性能。然而，大多数现有的研究都是帮助专家标注最多的异常数据样本，这是一种优化的一类分类基于深度异常检测的方法。在这篇论文中，我们解决了 active learning 中的两个重要问题：查询策略和半supervised学习方法。首先，我们的查询策略选择的是一个适应边缘的不确定样本。其次，我们在训练一个一类分类模型时应用了噪声对照估算，以便同时使用标注的正常和异常数据来有效地捕捉一类分类模型。我们分析表明，我们提出的查询策略和半supervised损失函数分别提高了活动学习过程中的异常检测性能，并且当他们结合在一起时，可以更好地提高异常检测性能。我们在七个异常检测数据集上进行了分析。

CaT: Balanced Continual Graph Learning with Graph Condensation

paper_url: http://arxiv.org/abs/2309.09455
repo_url: https://github.com/superallen13/CaT-CGL
paper_authors: Yilun Liu, Ruihong Qiu, Zi Huang
for: 本研究旨在解决 continual graph learning (CGL) 中的快速卡斯特罗菲特问题，提高模型的稳定性和性能。
methods: 该研究提出了一种 Condense and Train (CaT) 框架，包括对新来 graph 的压缩和存储，以及在内存中直接更新模型。
results: 实验结果表明，CaT 框架可以有效解决快速卡斯特罗菲特问题，并提高 CGL 的效果和效率。

Abstract
Continual graph learning (CGL) is purposed to continuously update a graph model with graph data being fed in a streaming manner. Since the model easily forgets previously learned knowledge when training with new-coming data, the catastrophic forgetting problem has been the major focus in CGL. Recent replay-based methods intend to solve this problem by updating the model using both (1) the entire new-coming data and (2) a sampling-based memory bank that stores replayed graphs to approximate the distribution of historical data. After updating the model, a new replayed graph sampled from the incoming graph will be added to the existing memory bank. Despite these methods are intuitive and effective for the CGL, two issues are identified in this paper. Firstly, most sampling-based methods struggle to fully capture the historical distribution when the storage budget is tight. Secondly, a significant data imbalance exists in terms of the scales of the complex new-coming graph data and the lightweight memory bank, resulting in unbalanced training. To solve these issues, a Condense and Train (CaT) framework is proposed in this paper. Prior to each model update, the new-coming graph is condensed to a small yet informative synthesised replayed graph, which is then stored in a Condensed Graph Memory with historical replay graphs. In the continual learning phase, a Training in Memory scheme is used to update the model directly with the Condensed Graph Memory rather than the whole new-coming graph, which alleviates the data imbalance problem. Extensive experiments conducted on four benchmark datasets successfully demonstrate superior performances of the proposed CaT framework in terms of effectiveness and efficiency. The code has been released on https://github.com/superallen13/CaT-CGL.

摘要
kontinuous graf lerning (CGL) 是为了不断更新一个 graf 模型，使其能够处理流动式的 graf 数据。由于模型容易忘记先前学习的知识，因此 catastrophic forgetting 问题成为了 CGL 的主要关注点。latest replay-based methods 尝试解决这个问题，通过将模型更新使用整个新来的数据和一个储存 replayed graphs 的 memory bank，以便 aproximate 历史数据的分布。在更新模型后，将新的 replayed graph 从进来的 graf 中抽出来，并添加到现有的 memory bank 中。although 这些方法是直觉和有效的，这篇文章中提出了两个问题。首先，大多数抽样方法在存储预算仅仅允许的情况下，很难全面捕捉历史分布。其次，进来的新数据和储存在 memory bank 中的数据 scale 不对称，导致训练不平衡。为解决这些问题，这篇文章提出了一个 Condense and Train (CaT) 框架。在每次模型更新之前，新来的 graf 会被压缩成一个小而有用的 synthesized replayed graph，并将其储存在 Condensed Graph Memory 中。在持续学习阶段，使用 Train in Memory 方法将模型直接更新使用 Condensed Graph Memory，而不是整个新来的 graf。实际实验在四个 benchmark 数据集上显示，CaT 框架在效率和效果上具有优越的表现。code 已经在 https://github.com/superallen13/CaT-CGL 发布。

Asymptotically Efficient Online Learning for Censored Regression Models Under Non-I.I.D Data

paper_url: http://arxiv.org/abs/2309.09454
repo_url: None
paper_authors: Lantian Zhang, Lei Guo
for: investigate the asymptotically efficient online learning problem for stochastic censored regression models.
methods: propose a two-step online algorithm, which achieves algorithm convergence and improves estimation performance.
results: show that the algorithm is strongly consistent and asymptotically normal, and the covariances of the estimates can achieve the Cramer-Rao bound asymptotically.

Abstract
The asymptotically efficient online learning problem is investigated for stochastic censored regression models, which arise from various fields of learning and statistics but up to now still lacks comprehensive theoretical studies on the efficiency of the learning algorithms. For this, we propose a two-step online algorithm, where the first step focuses on achieving algorithm convergence, and the second step is dedicated to improving the estimation performance. Under a general excitation condition on the data, we show that our algorithm is strongly consistent and asymptotically normal by employing the stochastic Lyapunov function method and limit theories for martingales. Moreover, we show that the covariances of the estimates can achieve the Cramer-Rao (C-R) bound asymptotically, indicating that the performance of the proposed algorithm is the best possible that one can expect in general. Unlike most of the existing works, our results are obtained without resorting to the traditionally used but stringent conditions such as independent and identically distributed (i.i.d) assumption on the data, and thus our results do not exclude applications to stochastic dynamical systems with feedback. A numerical example is also provided to illustrate the superiority of the proposed online algorithm over the existing related ones in the literature.

摘要
“ Stochastic censored regression 模型中的 asymptotically 高效在线学习问题被研究。这种模型在学习和统计领域中有很多应用，但是现在还没有完整的理论研究。为解决这个问题，我们提出了一个两步在线算法，其中第一步是使算法快速收敛，第二步是提高估计性能。在数据中的普通激动情况下，我们使用杰尼尼抽象函数方法和 martingale 限论来证明我们的算法是强连续和强conv的。此外，我们还证明了估计的协方差可以在极限下达到 Cramer-Rao (C-R) bound，这表示我们的算法在总体上具有最佳的性能。不同于大多数现有的研究，我们的结果不需要 resorting to i.i.d 假设，因此我们的结果不排除应用于Stochastic dynamical systems with feedback。一个数字Example 也是提供以 Illustrate 我们的在线算法在 literatura 中的优越性。”Note: Please keep in mind that the translation is done by a machine and may not be perfect, especially for idiomatic expressions or cultural references.

On the Use of the Kantorovich-Rubinstein Distance for Dimensionality Reduction

paper_url: http://arxiv.org/abs/2309.09442
repo_url: None
paper_authors: Gaël Giordano
for: 这个论文的目的是研究使用康托罗维奇-鲁宾逊距离来建立分类问题中的样本复杂度描述器。
methods: 这篇论文使用了康托罗维奇-鲁宾逊距离来量化样本之间的geometry和topology信息，并将每个类别的点关联到一个措施中。
results: 论文表明，如果康托罗维奇-鲁宾逊距离 между这些措施较大，则存在一个1-Lipschitz分类器，可以良好地分类这些点。

Abstract
The goal of this thesis is to study the use of the Kantorovich-Rubinstein distance as to build a descriptor of sample complexity in classification problems. The idea is to use the fact that the Kantorovich-Rubinstein distance is a metric in the space of measures that also takes into account the geometry and topology of the underlying metric space. We associate to each class of points a measure and thus study the geometrical information that we can obtain from the Kantorovich-Rubinstein distance between those measures. We show that a large Kantorovich-Rubinstein distance between those measures allows to conclude that there exists a 1-Lipschitz classifier that classifies well the classes of points. We also discuss the limitation of the Kantorovich-Rubinstein distance as a descriptor.

摘要
本论文的目标是研究使用庞特罗维奇-鲁比涅斯坦距离来建立分类问题中的样本复杂性描述器。我们使用庞特罗维奇-鲁比涅斯坦距离是度量空间概率的度量，同时也考虑度量空间的几何和 topology。我们对每个类别点分配一个概率，并研究庞特罗维奇-鲁比涅斯坦距离这些概率之间的几何信息。我们显示，当庞特罗维奇-鲁比涅斯坦距离大于某个阈值时，存在1-Lipschitz分类器可以良好地分类点类。我们还讨论了庞特罗维奇-鲁比涅斯坦距离作为描述器的限制。

DeepHEN: quantitative prediction essential lncRNA genes and rethinking essentialities of lncRNA genes

paper_url: http://arxiv.org/abs/2309.10008
repo_url: None
paper_authors: Hanlin Zhang, Wenzheng Cheng
for: 本研究旨在解释非编码蛋白质基因的必需性。
methods: 该研究使用深度学习和图神经网络来预测非编码蛋白质基因的必需性。
results: 该模型能够预测非编码蛋白质基因的必需性，并能够区分序列特征和网络空间特征对必需性的影响。此外，该模型还能够解决其他方法因为必需性蛋白质基因数量低而导致的过拟合问题。

Abstract
Gene essentiality refers to the degree to which a gene is necessary for the survival and reproductive efficacy of a living organism. Although the essentiality of non-coding genes has been documented, there are still aspects of non-coding genes' essentiality that are unknown to us. For example, We do not know the contribution of sequence features and network spatial features to essentiality. As a consequence, in this work, we propose DeepHEN that could answer the above question. By buidling a new lncRNA-proteion-protein network and utilizing both representation learning and graph neural network, we successfully build our DeepHEN models that could predict the essentiality of lncRNA genes. Compared to other methods for predicting the essentiality of lncRNA genes, our DeepHEN model not only tells whether sequence features or network spatial features have a greater influence on essentiality but also addresses the overfitting issue of those methods caused by the low number of essential lncRNA genes, as evidenced by the results of enrichment analysis.

摘要
Translated into Simplified Chinese:基因必需性指的是生物体的存活和繁殖能力中基因的必需程度。虽然非编码基因的必需性已经记录，但还有一些非编码基因的必需性还未知。例如，我们不知道序列特征和网络空间特征对必需性的贡献。因此，在这项工作中，我们提出了深度感知（DeepHEN），可以回答上述问题。我们通过构建新的lncRNA-蛋白质-蛋白质网络和使用表示学习和图神经网络，成功建立了我们的深度感知模型，可以预测lncRNA基因的必需性。与其他预测lncRNA基因必需性的方法相比，我们的深度感知模型不仅可以评估序列特征和网络空间特征对必需性的影响，还可以解决这些方法因为低数量必需lncRNA基因而导致的过拟合问题，根据结果分析中的恰合分析结果。

An Iterative Method for Unsupervised Robust Anomaly Detection Under Data Contamination

paper_url: http://arxiv.org/abs/2309.09436
repo_url: None
paper_authors: Minkyung Kim, Jongmin Yu, Junsik Kim, Tae-Hyun Oh, Jun Kyun Choi
for:这个论文的目的是提高深入型异常检测模型的Robustness，使其能够更好地适应实际数据分布中的异常tail。methods:该论文提出了一种学习框架，通过iteratively更新样本级别的正常性权重，以提高深入型异常检测模型的学习效果。该框架是模型无关和参数适应的，可以应用于现有的异常检测方法。results:在五个异常检测benchmark dataset和两个图像 dataset上，该框架能够提高异常检测模型的Robustness，并且在不同的杂杂度水平下表现出优于现有的异常检测方法。

Abstract
Most deep anomaly detection models are based on learning normality from datasets due to the difficulty of defining abnormality by its diverse and inconsistent nature. Therefore, it has been a common practice to learn normality under the assumption that anomalous data are absent in a training dataset, which we call normality assumption. However, in practice, the normality assumption is often violated due to the nature of real data distributions that includes anomalous tails, i.e., a contaminated dataset. Thereby, the gap between the assumption and actual training data affects detrimentally in learning of an anomaly detection model. In this work, we propose a learning framework to reduce this gap and achieve better normality representation. Our key idea is to identify sample-wise normality and utilize it as an importance weight, which is updated iteratively during the training. Our framework is designed to be model-agnostic and hyperparameter insensitive so that it applies to a wide range of existing methods without careful parameter tuning. We apply our framework to three different representative approaches of deep anomaly detection that are classified into one-class classification-, probabilistic model-, and reconstruction-based approaches. In addition, we address the importance of a termination condition for iterative methods and propose a termination criterion inspired by the anomaly detection objective. We validate that our framework improves the robustness of the anomaly detection models under different levels of contamination ratios on five anomaly detection benchmark datasets and two image datasets. On various contaminated datasets, our framework improves the performance of three representative anomaly detection methods, measured by area under the ROC curve.

摘要
大多数深度异常检测模型基于学习正常性从数据集中学习，由于异常性的多样性和不一致性，因此通常采用学习正常性假设。然而，在实际应用中，正常性假设经常被违反，因为真实数据分布包含异常尾部，即杂杂数据集。这会导致学习异常检测模型的时候， gap between假设和实际训练数据产生负面影响。在这种情况下，我们提出了一种学习框架，以减少这个 gap 并达到更好的正常性表示。我们的关键思想是在样本级别上确定正常性，并将其作为重要性Weight使用，这个Weight在训练过程中进行迭代更新。我们的框架采用模型无关和 гипер参数敏感的设计，可以应用于各种现有方法无需精心参数调整。我们在三种不同的深度异常检测方法上应用了我们的框架，分别是一类分类-, 概率模型-和重建基于的方法。此外，我们还考虑了异常检测目标中的终止条件，并提出了基于异常检测目标的终止 criterion。我们验证了我们的框架可以在不同的杂杂比例下提高异常检测模型的Robustness，并在五个异常检测benchmark datasets和两个图像 datasets上进行验证。在各种杂杂数据集上，我们的框架可以提高三种代表性异常检测方法的性能， measured by area under the ROC curve。

Distributionally Time-Varying Online Stochastic Optimization under Polyak-Łojasiewicz Condition with Application in Conditional Value-at-Risk Statistical Learning

paper_url: http://arxiv.org/abs/2309.09411
repo_url: None
paper_authors: Yuen-Man Pun, Farhad Farokhi, Iman Shames
for: 这个论文研究了一系列随机优化问题，通过在线优化的视角来探讨。
methods: 论文使用了在线随机梯度 DESCENT 和在线随机 proximal 梯度 DESCENT，并且为这些方法提供了动态 regret bound。
results: 论文证明了在线随机梯度 DESCENT 和在线随机 proximal 梯度 DESCENT 的动态 regret bound，并且应用到了 Conditional Value-at-Risk (CVaR) 学习问题。

Abstract
In this work, we consider a sequence of stochastic optimization problems following a time-varying distribution via the lens of online optimization. Assuming that the loss function satisfies the Polyak-{\L}ojasiewicz condition, we apply online stochastic gradient descent and establish its dynamic regret bound that is composed of cumulative distribution drifts and cumulative gradient biases caused by stochasticity. The distribution metric we adopt here is Wasserstein distance, which is well-defined without the absolute continuity assumption or with a time-varying support set. We also establish a regret bound of online stochastic proximal gradient descent when the objective function is regularized. Moreover, we show that the above framework can be applied to the Conditional Value-at-Risk (CVaR) learning problem. Particularly, we improve an existing proof on the discovery of the PL condition of the CVaR problem, resulting in a regret bound of online stochastic gradient descent.

摘要
在这个工作中，我们考虑一个时间变化分布下的随机优化问题序列，通过在线优化的镜头来分析。我们假设损失函数满足波Яakov-{\L}ojasiewicz条件，我们应用在线随机梯度下降并确定其动态 regret bound，该 regret bound包括累积分布漂移和累积梯度偏差由随机性引起的。我们采用的分布度量是沃氏距离，这是不含绝对连续性假设或时间变化支持集的。此外，我们还证明在线随机距离梯度下降可以应用于 conditional Value-at-Risk（CVaR）学习问题。特别是，我们改进了现有的PL条件发现证明，从而获得在线随机梯度下降的 regret bound。

Guided Online Distillation: Promoting Safe Reinforcement Learning by Offline Demonstration

paper_url: http://arxiv.org/abs/2309.09408
repo_url: None
paper_authors: Jinning Li, Xinyi Liu, Banghua Zhu, Jiantao Jiao, Masayoshi Tomizuka, Chen Tang, Wei Zhan
for: 提高安全性和效率的强化学习（Reinforcement Learning）策略，使得RL agents可以在具有成本限制的情况下获得高奖励。
methods: 使用大容量模型（如决策变换器DT）来学习离线数据中的专家策略，并通过指导在线安全RL训练来策略减小。
results: GOLD框架可以成功减小离线DT策略，并在实际执行时在安全性和效率两个方面表现出色，在多种安全关键的scenario中解决决策问题。

Abstract
Safe Reinforcement Learning (RL) aims to find a policy that achieves high rewards while satisfying cost constraints. When learning from scratch, safe RL agents tend to be overly conservative, which impedes exploration and restrains the overall performance. In many realistic tasks, e.g. autonomous driving, large-scale expert demonstration data are available. We argue that extracting expert policy from offline data to guide online exploration is a promising solution to mitigate the conserveness issue. Large-capacity models, e.g. decision transformers (DT), have been proven to be competent in offline policy learning. However, data collected in real-world scenarios rarely contain dangerous cases (e.g., collisions), which makes it prohibitive for the policies to learn safety concepts. Besides, these bulk policy networks cannot meet the computation speed requirements at inference time on real-world tasks such as autonomous driving. To this end, we propose Guided Online Distillation (GOLD), an offline-to-online safe RL framework. GOLD distills an offline DT policy into a lightweight policy network through guided online safe RL training, which outperforms both the offline DT policy and online safe RL algorithms. Experiments in both benchmark safe RL tasks and real-world driving tasks based on the Waymo Open Motion Dataset (WOMD) demonstrate that GOLD can successfully distill lightweight policies and solve decision-making problems in challenging safety-critical scenarios.

摘要
<> translate "Safe Reinforcement Learning (RL) aims to find a policy that achieves high rewards while satisfying cost constraints. When learning from scratch, safe RL agents tend to be overly conservative, which impedes exploration and restrains the overall performance. In many realistic tasks, e.g. autonomous driving, large-scale expert demonstration data are available. We argue that extracting expert policy from offline data to guide online exploration is a promising solution to mitigate the conserveness issue. Large-capacity models, e.g. decision transformers (DT), have been proven to be competent in offline policy learning. However, data collected in real-world scenarios rarely contain dangerous cases (e.g., collisions), which makes it prohibitive for the policies to learn safety concepts. Besides, these bulk policy networks cannot meet the computation speed requirements at inference time on real-world tasks such as autonomous driving. To this end, we propose Guided Online Distillation (GOLD), an offline-to-online safe RL framework. GOLD distills an offline DT policy into a lightweight policy network through guided online safe RL training, which outperforms both the offline DT policy and online safe RL algorithms. Experiments in both benchmark safe RL tasks and real-world driving tasks based on the Waymo Open Motion Dataset (WOMD) demonstrate that GOLD can successfully distill lightweight policies and solve decision-making problems in challenging safety-critical scenarios."中文翻译：<>safe reinforcement learning（RL）的目标是找到一个政策，以高的奖励来满足成本限制。当学习从头开始时，safe RL 代理人往往太保守，这会阻碍探索和限制整体性能。在许多现实任务中，例如自动驾驶，大规模专家示范数据可以用于指导在线探索。我们认为从 offline 数据中提取专家策略来指导在线探索是一个有前途的解决方案，以减少保守性问题。大容量模型，如决策变换器（DT），已经在 offline 政策学习中证明自己的能力。然而，实际场景中收集的数据 редarily 包含危险情况（例如，相撞），这使得政策学习危险概念的缺乏。此外，这些大型政策网络不能在实际任务中实时进行计算，例如自动驾驶。为此，我们提出了 Guided Online Distillation（GOLD），一个 offline-to-online safe RL 框架。GOLD 通过在线安全 RL 培训来精炼 offline DT 政策，并且超过了 offline DT 政策和在线安全 RL 算法。在标准 safe RL 任务和基于 Waymo Open Motion Dataset（WOMD）的实际驾驶任务中，GOLD 可以成功精炼轻量级政策并解决安全关键问题。

2023-09-18

eess.IV

eess.IV - 2023-09-18

Mixed Graph Signal Analysis of Joint Image Denoising / Interpolation

paper_url: http://arxiv.org/abs/2309.10114
repo_url: None
paper_authors: Niruhan Viswarupan, Gene Cheung, Fengbo Lan, Michael Brown
for: 这个论文主要是研究如何jointly optimize denoising and interpolation of images from a mixed graph filtering perspective.
methods: 作者使用了linear denoiser和linear interpolator，并研究了在哪些情况下这两个操作应该独立执行，或者合并并且优化。
results: 实验表明，作者的 JOINT denoising / interpolation方法在比较不同的情况下都能够显著超过独立的方法。

Abstract
A noise-corrupted image often requires interpolation. Given a linear denoiser and a linear interpolator, when should the operations be independently executed in separate steps, and when should they be combined and jointly optimized? We study joint denoising / interpolation of images from a mixed graph filtering perspective: we model denoising using an undirected graph, and interpolation using a directed graph. We first prove that, under mild conditions, a linear denoiser is a solution graph filter to a maximum a posteriori (MAP) problem regularized using an undirected graph smoothness prior, while a linear interpolator is a solution to a MAP problem regularized using a directed graph smoothness prior. Next, we study two variants of the joint interpolation / denoising problem: a graph-based denoiser followed by an interpolator has an optimal separable solution, while an interpolator followed by a denoiser has an optimal non-separable solution. Experiments show that our joint denoising / interpolation method outperformed separate approaches noticeably.

摘要
受噪图像经常需要 interpolate。给定一个线性去噪器和一个线性 interpolator，当should these operations be independently executed in separate steps，和when should they be combined and jointly optimized？我们研究图像 joint denoising / interpolation from a mixed graph filtering perspective：我们模型denoising using an undirected graph，并 interpolating using a directed graph。我们首先证明，在某些条件下，一个线性去噪器是一个解 graph filter to a maximum a posteriori (MAP) problem regularized using an undirected graph smoothness prior，而一个线性 interpolator是一个解 to a MAP problem regularized using a directed graph smoothness prior。接着，我们研究了两种 joint interpolation / denoising problem variant：一个图像-based denoiser followed by an interpolator has an optimal separable solution，而一个 interpolator followed by a denoiser has an optimal non-separable solution。实验表明，我们的 joint denoising / interpolation method noticeably outperformed separate approaches。

MAD: Meta Adversarial Defense Benchmark

paper_url: http://arxiv.org/abs/2309.09776
repo_url: None
paper_authors: X. Peng, D. Zhou, G. Sun, J. Shi, L. Wu
For: + The paper aims to address the limitations of existing adversarial training (AT) methods, such as high computational cost, low generalization ability, and the dilemma between the original model and the defense model.* Methods: + The paper proposes a novel benchmark called meta adversarial defense (MAD), which consists of two MAD datasets and a MAD evaluation protocol. + The paper introduces a meta-learning based adversarial training (Meta-AT) algorithm as the baseline, which features high robustness to unseen adversarial attacks through few-shot learning.* Results: + The paper shows that the Meta-AT algorithm outperforms state-of-the-art methods in terms of robustness to adversarial attacks. + The paper also demonstrates that the model after Meta-AT maintains a relatively high clean-samples classification accuracy (CCA).Here is the simplified Chinese text for the three key information points:* For: + 本文目标是解决现有的针对性训练（AT）方法存在的三大限制，包括高计算成本、低泛化能力和模型与防御模型之间的矛盾。* Methods: + 本文提出了一个新的静态防御（MAD） benchmark，包括两个 MAD 数据集和一个 MAD 评估协议。 + 本文引入了一种基于 meta-学的针对性训练（Meta-AT）算法作为基准，该算法通过几个 adversarial 攻击实现了高度的鲁棒性。* Results: + 本文表明 Meta-AT 算法在针对性攻击方面的性能明显超过了现有的方法。 + 本文还表明模型 после Meta-AT 保持了相对较高的清样分类精度（CCA）。

Abstract
Adversarial training (AT) is a prominent technique employed by deep learning models to defend against adversarial attacks, and to some extent, enhance model robustness. However, there are three main drawbacks of the existing AT-based defense methods: expensive computational cost, low generalization ability, and the dilemma between the original model and the defense model. To this end, we propose a novel benchmark called meta adversarial defense (MAD). The MAD benchmark consists of two MAD datasets, along with a MAD evaluation protocol. The two large-scale MAD datasets were generated through experiments using 30 kinds of attacks on MNIST and CIFAR-10 datasets. In addition, we introduce a meta-learning based adversarial training (Meta-AT) algorithm as the baseline, which features high robustness to unseen adversarial attacks through few-shot learning. Experimental results demonstrate the effectiveness of our Meta-AT algorithm compared to the state-of-the-art methods. Furthermore, the model after Meta-AT maintains a relatively high clean-samples classification accuracy (CCA). It is worth noting that Meta-AT addresses all three aforementioned limitations, leading to substantial improvements. This benchmark ultimately achieved breakthroughs in investigating the transferability of adversarial defense methods to new attacks and the ability to learn from a limited number of adversarial examples. Our codes and attacked datasets address will be available at https://github.com/PXX1110/Meta_AT.

摘要
translate_text=" Adversarial 训练（AT）是深度学习模型防御 против敌意攻击的一种常见技术，同时可以提高模型的鲁棒性。然而，现有的 AT 防御方法存在三个主要缺点：计算成本高、适应能力低、模型与防御模型之间的 contradicton。为此，我们提出了一个新的benchmark，即元敌意防御（MAD）。MAD benchmark 包括两个 MAD 数据集，以及一个 MAD 评估协议。这两个大规模 MAD 数据集通过对 MNIST 和 CIFAR-10 数据集进行实验而生成。此外，我们引入了基于元学习的 adversarial 训练（Meta-AT）算法，它具有高度鲁棒性，能够通过少量学习对不同攻击来鲁棒化。实验结果表明，我们的 Meta-AT 算法与现状卷积方法相比，有较高的鲁棒性。此外，模型 después Meta-AT 保持了较高的清样分类精度（CCA）。值得注意的是，Meta-AT 解决了所有三种限制，带来了显著的改进。这个benchmark最终实现了对敌意防御方法的传播性和从少量敌意示例中学习的能力进行研究。我们的代码和攻击数据集将在上发布。Here's the translation:这是一个文章，描述了一个新的benchmark，即元敌意防御（MAD），用于测试深度学习模型的防御能力。现有的防御方法有三个主要缺点：计算成本高、适应能力低、模型与防御模型之间的矛盾。为此，我们提出了一个新的Meta-AT算法，具有高度鲁棒性，能够通过少量学习对不同攻击来鲁棒化。实验结果显示，我们的 Meta-AT 算法与现状卷积方法相比，有较高的鲁棒性。此外，模型 después Meta-AT 保持了较高的清样分类精度（CCA）。值得注意的是，Meta-AT 解决了所有三种限制，带来了显著的改进。这个benchmark最终实现了对敌意防御方法的传播性和从少量敌意示例中学习的能力进行研究。我们的代码和攻击数据集将在上发布。

2023-09-18

eess.SP

eess.SP - 2023-09-18

ROAR-Fed: RIS-Assisted Over-the-Air Adaptive Resource Allocation for Federated Learning

paper_url: http://arxiv.org/abs/2309.09883
repo_url: None
paper_authors: Jiayu Mao, Aylin Yener
for: 这 paper 旨在探讨基于无线通信的 Federated Learning (FL) 方法，并利用 Reconfigurable Intelligent Surface (RIS) 提高学习性能。
methods: 该 paper 使用 Cross-layer 视角，对客户端数据集分布和个体资源进行协调优化，并采用 RIS 帮助实现更好的学习环境。
results: 数值结果表明，在非 i.i.d. 数据和不完全 CSI 下，ROAR-Fed 算法可以 convergent，并且在不同客户端资源和数据分布情况下显示出优异性。

Abstract
Over-the-air federated learning (OTA-FL) integrates communication and model aggregation by exploiting the innate superposition property of wireless channels. The approach renders bandwidth efficient learning, but requires care in handling the wireless physical layer impairments. In this paper, federated edge learning is considered for a network that is heterogeneous with respect to client (edge node) data set distributions and individual client resources, under a general non-convex learning objective. We augment the wireless OTA-FL system with a Reconfigurable Intelligent Surface (RIS) to enable a propagation environment with improved learning performance in a realistic time varying physical layer. Our approach is a cross-layer perspective that jointly optimizes communication, computation and learning resources, in this general heterogeneous setting. We adapt the local computation steps and transmission power of the clients in conjunction with the RIS phase shifts. The resulting joint communication and learning algorithm, RIS-assisted Over-the-air Adaptive Resource Allocation for Federated learning (ROAR-Fed) is shown to be convergent in this general setting. Numerical results demonstrate the effectiveness of ROAR-Fed under heterogeneous (non i.i.d.) data and imperfect CSI, indicating the advantage of RIS assisted learning in this general set up.

摘要
随空 Federated Learning (OTA-FL) 利用无线通信频率层的自然superposition 特性来实现带宽效率的学习。然而，需要在无线物理层缺陷处理方面予以关注。在本文中，我们考虑了Edge federated learning 在具有不同客户（边缘节点）数据集分布和个体客户资源的网络上。我们在一般非对称学习目标下对无线OTA-FL系统进行扩展，并添加了智能表面（RIS）以实现改进的物理层媒体。我们采用了跨层视角，同时优化通信、计算和学习资源。在这种一般不同设置下，我们适应性地调整客户端的计算步骤和传输功率，并与RIS相关的相位Shift。得到的联合通信和学习算法，即RIS协助的Over-the-air Adaptive Resource Allocation for Federated learning（ROAR-Fed），在这种一般设置下被证明是收敛的。数据示ROAR-Fed在不同（非i.i.d.）数据和不完美CSI下的效果，表明RIS协助学习在这种一般设置下具有优势。

RIS-Assisted Energy Harvesting Gains for Bistatic Backscatter Networks: Performance Analysis and RIS Phase Optimization

paper_url: http://arxiv.org/abs/2309.09859
repo_url: None
paper_authors: Diluka Galappaththige, Fatemeh Rezaei, Chintha Tellambura, Sanjeewa Herath
for: This paper aims to improve the energy autonomy of inexpensive tags in IoT networks by deploying a reconfigurable intelligent surface (RIS) to enhance RF power and overcome energy insecurities.methods: The paper explores the use of RIS to improve the energy harvesting (EH) capabilities of tags, and analyzes the impact of linear and non-linear EH models on single-tag and multi-tag scenarios. The paper also derives key metrics such as received power, harvested power, achievable rate, outage probability, bit error rate, and diversity order.results: The paper shows significant improvements in activation distance, received power, and achievable rate compared to benchmarks without RIS or with random RIS-phase design. For instance, the optimal design with a \num{200}-element RIS increases the activation distance by \qty{270}{\percent} and \qty{55}{\percent} compared to the benchmarks. The paper demonstrates that RIS deployment improves the energy autonomy of tags while maintaining the basic tag design intact.

Abstract
Inexpensive tags powered by energy harvesting (EH) can realize green (energy-efficient) Internet of Things (IoT) networks. However, tags are vulnerable to energy insecurities, resulting in poor communication ranges, activation distances, and data rates. To overcome these challenges, we explore the use of a reconfigurable intelligent surface (RIS) for EH-based IoT networks. The RIS is deployed to enhance RF power at the tag, improving EH capabilities. We consider linear and non-linear EH models and analyze single-tag and multi-tag scenarios. For single-tag networks, the tag's maximum received power and the reader's signal-to-noise ratio with the optimized RIS phase-shifts are derived. Key metrics, such as received power, harvested power, achievable rate, outage probability, bit error rate, and diversity order, are also evaluated. The impact of RIS phase shift quantization errors is also studied. For the multi-tag case, an algorithm to compute the optimal RIS phase-shifts is developed. Numerical results and simulations demonstrate significant improvements compared to the benchmarks of no-RIS case and random RIS-phase design. For instance, our optimal design with a \num{200}-element RIS increases the activation distance by \qty{270}{\percent} and \qty{55}{\percent} compared to those benchmarks. In summary, RIS deployment improves the energy autonomy of tags while maintaining the basic tag design intact.

摘要
低成本的标签（标签）由能量收集（EH）能够实现绿色（能量高效）互联网络（IoT）。然而，标签受到能源不安全性的挑战，导致通信范围、活动距离和数据速率受到影响。为了解决这些挑战，我们研究使用可重新配置的智能表面（RIS）来增强标签上的RF能量，提高EH能力。我们考虑了线性和非线性EH模型，对单标签和多标签场景进行分析。在单标签网络中，我们计算了标签上最大接收功率和读取器与优化RIS相位偏移后的信号强度比。我们还评估了关键指标，如接收功率、收集功率、可达速率、失业概率、比特错误率和多样度。我们还研究了RIS相位偏移量的量化误差的影响。在多标签场景中，我们开发了计算优化RIS相位偏移的算法。numericalResults和 simulations表明，我们的优化设计可以与无RIS场景和随机RIS相位设计相比，提高标签的活动距离和数据速率。例如，我们的优化设计使用200个RIS元素可以提高标签的活动距离270%和55%。总之，RIS部署可以提高标签的能源自主性，而不需要修改标签的基本设计。

A Read Margin Enhancement Circuit with Dynamic Bias Optimization for MRAM

paper_url: http://arxiv.org/abs/2309.09797
repo_url: None
paper_authors: Renhe Chen, Albert Lee, Zirui Wang, Di Wu, Xufeng Kou
for: 提高Magnetic Random Access Memory（MRAM）的读出效率
methods: 使用动态偏好优化（DBO）电路实现实时跟踪优化读偏好电压，以适应过程Voltage-Temperature（PVT）变化
results: 对28奈米1Mb MRAM宽 macro进行了模拟研究，结果显示DBO电路能够在PVT变化下保持优化读偏好电压的跟踪精度高于90%，并且能够降低读出错误率，提高MRAM性能和可靠性。

Abstract
This brief introduces a read bias circuit to improve readout yield of magnetic random access memories (MRAMs). A dynamic bias optimization (DBO) circuit is proposed to enable the real-time tracking of the optimal read voltage across processvoltage-temperature (PVT) variations within an MRAM array. It optimizes read performance by adjusting the read bias voltage dynamically for maximum sensing margin. Simulation results on a 28-nm 1Mb MRAM macro show that the tracking accuracy of the proposed DBO circuit remains above 90% even when the optimal sensing voltage varies up to 50%. Such dynamic tracking strategy further results in up to two orders of magnitude reduction in the bit error rate with respect to different variations, highlighting its effectiveness in enhancing MRAM performance and reliability.

摘要

Asymptotic Performance of the GSVD-Based MIMO-NOMA Communications with Rician Fading

paper_url: http://arxiv.org/abs/2309.09711
repo_url: None
paper_authors: Chenguang Rao, Zhiguo Ding, Kanapathippillai Cumanan, Xuchu Dai
for: 这篇论文研究了MIMO-NOMA系统中使用GSVD作为前处理算法的性能。
methods: 该论文使用了operator-valued free probability theory中的线性化技巧和决定性方法来分析通道矩阵的普通特征。
results: 该论文提出了一种迭代过程来计算通道矩阵的Cauchy转换，并通过这种方法得到了通信系统的均值数据速率。此外，当通道模型为Rayleigh折射时，也 deriv了closed-form表达式。实验结果 validate了分析结果。

Abstract
In recent years, the multiple-input multiple-output (MIMO) non-orthogonal multiple-access (NOMA) systems have attracted a significant interest in the relevant research communities. As a potential precoding scheme, the generalized singular value decomposition (GSVD) can be adopted in MIMO-NOMA systems and has been proved to have high spectral efficiency. In this paper, the performance of the GSVD-based MIMO-NOMA communications with Rician fading is studied. In particular, the distribution characteristics of generalized singular values (GSVs) of channel matrices are analyzed. Two novel mathematical tools, the linearization trick and the deterministic equivalent method, which are based on operator-valued free probability theory, are exploited to derive the Cauchy transform of GSVs. An iterative process is proposed to obtain the numerical values of the Cauchy transform of GSVs, which can be exploited to derive the average data rates of the communication system. In addition, the special case when the channel is modeled as Rayleigh fading, i.e., the line-of-sight propagation is trivial, is analyzed. In this case, the closed-form expressions of average rates are derived from the proposed iterative process. Simulation results are provided to validate the derived analytical results.

摘要
在最近几年，多输入多输出（MIMO）不对称多访问（NOMA）系统已经吸引了研究领域的广泛关注。作为MIMO-NOMA系统的可能的预编码方案，泛化协方差矩阵（GSVD）可以应用于MIMO-NOMA系统，并且已经证明具有高spectral efficiency。本文研究了GSVD基于MIMO-NOMA通信系统中的 ricain fading 的性能。特别是分析了通道矩阵的泛化协方差值（GSV）的分布特征。基于 оператор值自由概率论的两种新的数学工具——线性化技巧和deterministic equivalent方法——被利用来 derivethe Cauchy transform of GSVs。一种迭代过程被提出来获取GSVs的Cauchy transform的数值，可以用来计算通信系统的平均数据率。此外，当通道模型为Rayleigh fading，即线性视程是rivial的特殊情况也被分析。在这种情况下，通过提posed迭代过程，得到了Closed-form表达式的平均率。Simulation results are provided to validate the derived analytical results.

Uplink Power Control for Distributed Massive MIMO with 1-Bit ADCs

paper_url: http://arxiv.org/abs/2309.09665
repo_url: None
paper_authors: Bikshapathi Gouda, Italo Atzeni, Antti Tölli
For: 这篇论文研究了分布式巨量多输入多输出系统中基站（BS）装备1比特数字转换器（ADC）的上行功率控制问题。* Methods: 该论文使用了1比特ADC的信号干扰分析和优化策略，并提出了三种算法来优化UE传输功率和BS中的扰干。* Results: 研究发现，在多BS情况下，通过合适地调整BS中的扰干，可以使SNDR在输出 JOIN combiner中变为单modal函数，从而使UE可以有效地被多个BS服务。此外，在多UE情况下，该论文提出了最优化UE传输功率和扰干的算法，以达到最小功率和最大最小SINDR的目标。

Abstract
We consider the problem of uplink power control for distributed massive multiple-input multiple-output systems where the base stations (BSs) are equipped with 1-bit analog-to-digital converters (ADCs). The scenario with a single-user equipment (UE) is first considered to provide insights into the signal-tonoise-and-distortion ratio (SNDR). With a single BS, the SNDR is a unimodal function of the UE transmit power. With multiple BSs, the SNDR at the output of the joint combiner can be made unimodal by adding properly tuned dithering at each BS. As a result, the UE can be effectively served by multiple BSs with 1-bit ADCs. Considering the signal-to-interference-plus-noise-anddistortion ratio (SINDR) in the multi-UE scenario, we aim at optimizing the UE transmit powers and the dithering at each BS based on the min-power and max-min-SINDR criteria. To this end, we propose three algorithms with different convergence and complexity properties. Numerical results show that, if the desired SINDR can only be achieved via joint combining across multiple BSs with properly tuned dithering, the optimal UE transmit power is imposed by the distance to the farthest serving BS (unlike in the unquantized case). In this context, dithering plays a crucial role in enhancing the SINDR, especially for UEs with significant path loss disparity among the serving BSs.

摘要
我们考虑了分布式巨量多输入多输出系统中的上行电力控制问题，其中基站（BS）具有1比特分析类比器（ADC）。我们首先考虑了单用户设备（UE）的情况，以获得对SNDR的深入理解。单一BS情况下，SNDR是单一函数UE传输电力。多个BS情况下，通过对每个BS添加适当的测量偏差，则SNDR在多BS联合器出口可以变成单一函数。因此，UE可以从多个BS中有效地获得服务，并且使用1比特ADC。对于多UE情况下的SINDR，我们寻找了对UE传输电力和BS中的测量偏差进行优化，以达到最小电力和最大最小SINDR的目标。为此，我们提出了三种不同的算法，每个算法具有不同的参数和复杂度。实验结果显示，如果只能通过多BS联合器实现所需的SINDR，则UE传输电力将被最远的服务BS的距离所限制（不同于不量化情况下）。在这种情况下，测量偏差在增强SINDR方面扮演至关重要的角色，特别是UE对服务BS的距离差异较大。

Turbo Coded OFDM-OQAM Using Hilbert Transform

paper_url: http://arxiv.org/abs/2309.09620
repo_url: None
paper_authors: Kasturi Vasudevan, Surendra Kota, Lov Kumar, Himanshu Bhusan Mishra
for: 本文探讨了使用哈尔杯变换生成 orthogonal frequency division multiplexing（OFDM）与偏角幂强调幂（OQAM）的方法，并证明了这种方法与单边幂调幂（SSB）相等。
methods: 本文使用了哈尔杯变换来生成OFDM-OQAM，并使用了复数valued transmit filter来实现。 receiver使用 matched filter 来回填消息。
results: 在 discrete time 中 simulated 系统，使用 turbo coding 可以实现 bit-error-rate（BER）为 $10^{-5}$， average signal-to-noise ratio（SNR）为每比特接近 0 db。

Abstract
Orthogonal frequency division multiplexing (OFDM) with offset quadrature amplitude modulation (OQAM) has been widely discussed in the literature and is considered a popular waveform for 5th generation (5G) wireless telecommunications and beyond. In this work, we show that OFDM-OQAM can be generated using the Hilbert transform and is equivalent to single sideband modulation (SSB), that has roots in analog telecommunications. The transmit filter for OFDM-OQAM is complex valued whose real part is given by the pulse corresponding to the root raised cosine spectrum and the imaginary part is the Hilbert transform of the real part. The real-valued digital information (message) are passed through the transmit filter and frequency division multiplexed on orthogonal subcarriers. The message bandwidth corresponding to each subcarrier is assumed to be narrow enough so that the channel can be considered ideal. Therefore, at the receiver, a matched filter can used to recover the message. Turbo coding is used to achieve bit-error-rate (BER) as low as $10^{-5}$ at an average signal-to-noise ratio (SNR) per bit close to 0 db. The system has been simulated in discrete time.

摘要
Orthogonal frequency division multiplexing (OFDM) with offset quadrature amplitude modulation (OQAM) 已经在文献中广泛讨论，被视为5代移动通信和更进一步的各种应用的流行的波形。在这个工作中，我们展示了OFDM-OQAM可以使用希尔伯特变换生成，与单边干扰模式（SSB）相当，这种模式在分析通信中有深根。发送器的滤波器为复数值，其实部分是根提高cosinuspectrum的脉冲，幂部分是希尔伯特变换实部。接收器使用匹配滤波器来回填消息。在接收器中，使用隐藏编码来实现 bit-error-rate（BER）在10^-5以下，均值信号响应（SNR）每比特接近0 db。系统在离散时间下进行了模拟。

Adaptive Unscented Kalman Filter under Minimum Error Entropy with Fiducial Points for Non-Gaussian Systems

paper_url: http://arxiv.org/abs/2309.09577
repo_url: None
paper_authors: Boyu Tian, Haiquan Zhao
for: 提高非泛解决方案中的噪声和异常测量数据处理能力
methods: 基于最小错误 entropy with fiducial points (MEEF) 提出一种新的 UKF 算法，并通过添加 correntropy 进一步增强噪声和异常测量数据的抑制能力
results: 通过实验示例，表明提出的算法在面临复杂非泛解决方案中的噪声和异常测量数据处理中具有良好的稳定性和估计精度

Abstract
The minimum error entropy (MEE) has been extensively used in unscented Kalman filter (UKF) to handle impulsive noises or abnormal measurement data in non-Gaussian systems. However, the MEE-UKF has poor numerical stability due to the inverse operation of singular matrix. In this paper, a novel UKF based on minimum error entropy with fiducial points (MEEF) is proposed \textcolor{black}{to improve the problem of non-positive definite key matrix. By adding the correntropy to the error entropy, the proposed algorithm further enhances the ability of suppressing impulse noise and outliers. At the same time, considering the uncertainty of noise distribution, the modified Sage-Husa estimator of noise statistics is introduced to adaptively update the noise covariance matrix. In addition, the convergence analysis of the proposed algorithm provides a guidance for the selection of kernel width. The robustness and estimation accuracy of the proposed algorithm are manifested by the state tracking examples under complex non-Gaussian noises.

摘要
“非标准几何系统中的访问频率（MEE）已经广泛使用在非标准 Kalman统计（UKF）中，以处理快速变化的数据或异常测量变量。但MEE-UKF具有对对称矩降的逆操作，导致严重的数值稳定问题。本文提出了一种基于MEE的新型UKF（MEEF），以解决这个问题。通过将correnticropy添加到错误熵中，提高了对于快速变化和偏差测量的抑制能力。同时，透过考虑随机变量的不确定性，引入修改的Sage-Husa估计器来适应更新随机矩降矩。此外，提出了修改了kernel宽度的选择指南，以确保算法的稳定性和准确性。实际应用中，提出的算法在复杂的非标准几何系统中进行了稳定追踪。”

AI-Native Transceiver Design for Near-Field Ultra-Massive MIMO: Principles and Techniques

paper_url: http://arxiv.org/abs/2309.09575
repo_url: None
paper_authors: Wentao Yu, Yifan Ma, Hengtao He, Shenghui Song, Jun Zhang, Khaled B. Letaief
for: 这篇论文旨在探讨近场多输入多输出（UM-MIMO）技术的algorithmic设计，以提高无线网络的spectral和能量效率。
methods: 论文提出了两种基于人工智能（AI）的框架，用于设计近场UM-MIMO传输器。这两种框架分别是迭代式和非迭代式框架。
results: 文章通过使用这两种框架，实现了近场UM-MIMO传输器的near-field beam focusing和频率域预测等功能，并且在多个关键性能指标上达到了显著的优势。

Abstract
Ultra-massive multiple-input multiple-output (UM-MIMO) is a cutting-edge technology that promises to revolutionize wireless networks by providing an unprecedentedly high spectral and energy efficiency. The enlarged array aperture of UM-MIMO facilitates the accessibility of the near-field region, thereby offering a novel degree of freedom for communications and sensing. Nevertheless, the transceiver design for such systems is challenging because of the enormous system scale, the complicated channel characteristics, and the uncertainties in propagation environments. Therefore, it is critical to study scalable, low-complexity, and robust algorithms that can efficiently characterize and leverage the properties of the near-field channel. In this article, we will advocate two general frameworks from an artificial intelligence (AI)-native perspective, which are tailored for the algorithmic design of near-field UM-MIMO transceivers. Specifically, the frameworks for both iterative and non-iterative algorithms are discussed. Near-field beam focusing and channel estimation are presented as two tutorial-style examples to demonstrate the significant advantages of the proposed AI-native frameworks in terms of various key performance indicators.

摘要
ultra-massive多输入多输出（UM-MIMO）是一种最前线的技术，可以带来无 precedent的 spectral和能量效率。 UM-MIMO 的扩大数组天线可以访问近场区域，从而提供一种新的通信和感知的自由度。然而，UM-MIMO 的接收机设计具有巨大的系统规模，复杂的通道特性和传播环境的不确定性，因此需要研究可扩展、低复杂度和Robust的算法，以有效地描述和利用近场通道的特性。在这篇文章中，我们将提出两个总体框架，从人工智能（AI）原生的角度出发，用于near-field UM-MIMO 接收机的算法设计。具体来说，我们将讨论iterative和非iterative算法的框架。 near-field beam focusing和通道估计被用作两个教程风格的示例，以 Illustrate了我们提出的 AI-native 框架在不同的关键性能指标方面的显著优势。

A Covariance Adaptive Student’s t Based Kalman Filter

paper_url: http://arxiv.org/abs/2309.09565
repo_url: None
paper_authors: Benyang Gong, Jiacheng He, Gang Wang, Bei Peng
for: 本文为了提高Student’s t基于 kalman filter（TKF）的精度和稳定性，采用 Gaussian mixture model（GMM）来更正TKF中对测量噪声的covariance矩阵，从而超过了TKF的调整限制。
methods: 本文使用GMM来生成测量噪声的covariance矩阵，并将其用于TKF中更正测量噪声的confidence水平。
results: 本文的实验结果表明，使用GMM来更正TKF中测量噪声的covariance矩阵，可以提高TKF的精度和稳定性，并超过了传统的TKF。

Abstract
In the classical Kalman filter(KF), the estimated state is a linear combination of the one-step predicted state and measurement state, their confidence level change when the prediction mean square error matrix and covariance matrix of measurement noise vary. The existing student's t based Kalman filter(TKF) works similarly to the way KF works, they both work well with impulse noise, but when it comes to Gaussian noise, TKF encounters an adjustment limit of the confidence level, this can lead to inaccuracies in such situations. This brief optimizes TKF by using the Gaussian mixture model(GMM), which generates a reasonable covariance matrix from the measurement noise to replace the one used in the existing algorithm and breaks the adjustment limit of the confidence level. At the end of the brief, the performance of the covariance adaptive student's t based Kalman filter(TGKF) is verified.

摘要
经 classical Kalman filter（KF），估计状态是一个线性组合于一步预测状态和测量状态，其信度水平随预测均值方差矩阵和测量噪声均值矩阵的变化而变化。现有的学生t基于Kalman filter（TKF）与KF相似，它们都能够处理冲击噪声，但在 Gaussian 噪声时，TKF会遇到信度水平调整限制，这可能导致准确性下降。本文优化TKF使用 Gaussian mixture model（GMM）生成测量噪声的合理协方差矩阵，并让信度水平不再受限制。文章结尾，covariance adaptive student's t based Kalman filter（TGKF）的性能得到了验证。

Multi-user beamforming in RIS-aided communications and experimental validations

paper_url: http://arxiv.org/abs/2309.09460
repo_url: None
paper_authors: Zhibo Zhou, Haifan Yin, Li Tan, Ruikun Zhang, Kai Wang, Yingzhuang Liu
for: 这篇论文目的是提出一种基于智能表面的多用户通信系统，以提高 Spectral Efficiency。
methods: 论文使用了频率积分和优化矩阵来实现通信系统中的多用户耦合。
results: 实验结果表明，在使用智能表面的情况下，系统的 Spectral Efficiency 提高了 13.48bps/Hz，对应的接收功率提高了 26.6dB和17.5dB。

Abstract
Reconfigurable intelligent surface (RIS) is a promising technology for future wireless communications due to its capability of optimizing the propagation environments. Nevertheless, in literature, there are few prototypes serving multiple users. In this paper, we propose a whole flow of channel estimation and beamforming design for RIS, and set up an RIS-aided multi-user system for experimental validations. Specifically, we combine a channel sparsification step with generalized approximate message passing (GAMP) algorithm, and propose to generate the measurement matrix as Rademacher distribution to obtain the channel state information (CSI). To generate the reflection coefficients with the aim of maximizing the spectral efficiency, we propose a quadratic transform-based low-rank multi-user beamforming (QTLM) algorithm. Our proposed algorithms exploit the sparsity and low-rank properties of the channel, which has the advantages of light calculation and fast convergence. Based on the universal software radio peripheral devices, we built a complete testbed working at 5.8GHz and implemented all the proposed algorithms to verify the possibility of RIS assisting multi-user systems. Experimental results show that the system has obtained an average spectral efficiency increase of 13.48bps/Hz, with respective received power gains of 26.6dB and 17.5dB for two users, compared with the case when RIS is powered-off.

摘要
《可重配置智能表面（RIS）是未来无线通信技术的前景之一，因其可以优化传播环境。然而，在文献中，有很少的多用户体系。在这篇论文中，我们提出了渠道估计和扬射设计的整体流程，并设置了基于RIS的多用户系统 для实验验证。具体来说，我们将渠道简化步骤与通用简化消息传递算法（GAMP）相结合，并提议使用拉德美纳分布生成测量矩阵来获取渠道状态信息（CSI）。为了通过扬射矩阵来最大化спектル效率，我们提议使用quadratic transform-based low-rank multi-user beamforming（QTLM）算法。我们的提议算法利用渠道的稀疏和低级特性，具有轻量级计算和快速收敛的优点。基于通用软件无线通信外围设备，我们建立了工作在5.8GHz频率上的完整测试床，并实现了所有的提议算法，以验证RIS助け多用户系统的可能性。实验结果显示，系统在RIS启用下获得了平均spectral efficiency提高13.48bps/Hz，与RIS灭活时的相比，分别提高了接收功率26.6dB和17.5dB。

Energy-efficient Integrated Sensing and Communication System and DNLFM Waveform

paper_url: http://arxiv.org/abs/2309.09415
repo_url: None
paper_authors: Yihua Ma, Zhifeng Yuan, Shuqiang Xia, Chen Bai, Zhongbin Wang, Yuxin Wang
for: 本研究旨在提高Integrated sensing and communication (ISAC)系统的能效性。
methods: 本文提出了针对ISAC系统的专门探测信号，并在探测和通信信号之间进行时频匹配窗口。此外，本文还提出了使用Discrete non-linear frequency modulation (DNLFM)来实现时间频率匹配和自适应窗口重量。
results: simulations 结果表明，提出的方法可以提高ISAC系统的能效性。

Abstract
Integrated sensing and communication (ISAC) is a key enabler of 6G. Unlike communication radio links, the sensing signal requires to experience round trips from many scatters. Therefore, sensing is more power-sensitive and faces a severer multi-target interference. In this paper, the ISAC system employs dedicated sensing signals, which can be reused as the communication reference signal. This paper proposes to add time-frequency matched windows at both the transmitting and receiving sides, which avoids mismatch loss and increases energy efficiency. Discrete non-linear frequency modulation (DNLFM) is further proposed to achieve both time-domain constant modulus and frequency-domain arbitrary windowing weights. DNLFM uses very few Newton iterations and a simple geometrically-equivalent method to generate, which greatly reduces the complex numerical integral in the conventional method. Moreover, the spatial-domain matched window is proposed to achieve low sidelobes. The simulation results show that the proposed methods gain a higher energy efficiency than conventional methods.

摘要
integration of sensing and communication (ISAC) is a key enabler of 6G. Unlike communication radio links, the sensing signal requires to experience round trips from many scatters. Therefore, sensing is more power-sensitive and faces a severer multi-target interference. In this paper, the ISAC system employs dedicated sensing signals, which can be reused as the communication reference signal. This paper proposes to add time-frequency matched windows at both the transmitting and receiving sides, which avoids mismatch loss and increases energy efficiency. Discrete non-linear frequency modulation (DNLFM) is further proposed to achieve both time-domain constant modulus and frequency-domain arbitrary windowing weights. DNLFM uses very few Newton iterations and a simple geometrically-equivalent method to generate, which greatly reduces the complex numerical integral in the conventional method. Moreover, the spatial-domain matched window is proposed to achieve low sidelobes. The simulation results show that the proposed methods gain a higher energy efficiency than conventional methods.Here's the text in Traditional Chinese if you prefer:integration of sensing and communication (ISAC) is a key enabler of 6G. Unlike communication radio links, the sensing signal requires to experience round trips from many scatters. Therefore, sensing is more power-sensitive and faces a severer multi-target interference. In this paper, the ISAC system employs dedicated sensing signals, which can be reused as the communication reference signal. This paper proposes to add time-frequency matched windows at both the transmitting and receiving sides, which avoids mismatch loss and increases energy efficiency. Discrete non-linear frequency modulation (DNLFM) is further proposed to achieve both time-domain constant modulus and frequency-domain arbitrary windowing weights. DNLFM uses very few Newton iterations and a simple geometrically-equivalent method to generate, which greatly reduces the complex numerical integral in the conventional method. Moreover, the spatial-domain matched window is proposed to achieve low sidelobes. The simulation results show that the proposed methods gain a higher energy efficiency than conventional methods.

Improving Axial Resolution of Optical Resolution Photoacoustic Microscopy with Advanced Frequency Domain Eigenspace Based Minimum Variance Beamforming Method

paper_url: http://arxiv.org/abs/2309.09409
repo_url: None
paper_authors: Yu-Hsiang Yu, Meng-Lin Li
for: 高解析力照明探针（OR-PAM）可以利用光学 фокусинг和声学检测，实现微scopic optical absorption imaging。但是，它的光学 lateral resolution 高，而声学轴resolution 低，导致不好的3D可视化，因此通常只会present 2D maximum amplitude projection images。
methods: 以 axial resolution 为问题，我们提出了一种基于Frequency-domain eigenspace-based minimum variance（F-EIBMV）的扫描技术来减少axial sidelobe interference和噪声。这种方法可以同时提高OR-PAM的axial resolution和对比度。
results: 对于25MHz OR-PAM系统，我们发现了axial point spread function的半峰宽度从69.3μm降低到16.89μm，表明了axial resolution的显著提高。

Abstract
Optical resolution photoacoustic microscopy (OR-PAM) leverages optical focusing and acoustic detection for microscopic optical absorption imaging. Intrinsically it owns high optical lateral resolution and poor acoustic axial resolution. Such anisometric resolution hinders good 3-D visualization; thus 2-D maximum amplitude projection images are commonly presented in the literature. Since its axial resolution is limited by the bandwidth of acoustic detectors, ultrahigh frequency, and wideband detectors with Wiener deconvolution have been proposed to address this issue. Nonetheless, they also introduce other issues such as severe high-frequency attenuation and limited imaging depth. In this work, we view axial resolution improvement as an axial signal reconstruction problem, and the axial resolution degradation is caused by axial sidelobe interference. We propose an advanced frequency-domain eigenspace-based minimum variance (F-EIBMV) beamforming technique to suppress axial sidelobe interference and noises. This method can simultaneously enhance the axial resolution and contrast of OR-PAM. For a 25-MHz OR-PAM system, the full-width at half-maximum of an axial point spread function decreased significantly from 69.3 $\mu$m to 16.89 $\mu$m, indicating a significant improvement in axial resolution.

摘要

2023-09-17

cs.SD

cs.SD - 2023-09-17

Sound Source Distance Estimation in Diverse and Dynamic Acoustic Conditions

paper_url: http://arxiv.org/abs/2309.09288
repo_url: None
paper_authors: Saksham Singh Kushwaha, Iran R. Roman, Magdalena Fuentes, Juan Pablo Bello
for: 本研究旨在实现在真实世界中移动声源的方向到达（DOA）和距离投机器录音器的判别。
methods: 本研究使用数据驱动的方法，利用大量开源数据集，包括多个环境下的附加麦克风记录，进行优化。
results: 我们提出了一种CRNN模型，可以在多个数据集上表现出色地估计移动声源的距离，超过了一个最近发表的方法。我们还进行了模型性能的函数分析，发现在声源真实距离不同和不同训练损失下，模型的性能存在最佳化点。本研究是首次通过深度学习实现了声源距离估计在多种听音条件下。

Abstract
Localizing a moving sound source in the real world involves determining its direction-of-arrival (DOA) and distance relative to a microphone. Advancements in DOA estimation have been facilitated by data-driven methods optimized with large open-source datasets with microphone array recordings in diverse environments. In contrast, estimating a sound source's distance remains understudied. Existing approaches assume recordings by non-coincident microphones to use methods that are susceptible to differences in room reverberation. We present a CRNN able to estimate the distance of moving sound sources across multiple datasets featuring diverse rooms, outperforming a recently-published approach. We also characterize our model's performance as a function of sound source distance and different training losses. This analysis reveals optimal training using a loss that weighs model errors as an inverse function of the sound source true distance. Our study is the first to demonstrate that sound source distance estimation can be performed across diverse acoustic conditions using deep learning.

摘要
本文描述了一种基于深度学习的Sound Source Distance Estimation（SSDE）模型，可以在多种不同的室内环境中Estimate the distance of moving sound sources。我们使用了大量的开源数据集，并且使用了一种叫做Convolutional Recurrent Neural Network（CRNN）的模型，可以在多个不同的 datasets 中Outperform 已有的方法。我们还进行了一些性能分析，包括模型的训练损失函数的选择，以及模型在不同的室内环境下的性能。我们的研究表明，可以使用深度学习来Estimate the distance of moving sound sources across diverse acoustic conditions。这是一个新的领域，尚未得到过足够的研究。我们的模型可以在多个不同的室内环境下提供高精度的距离估计，并且可以在不同的训练损失函数下进行优化。在本研究中，我们使用了一些不同的训练损失函数，包括损失函数的weighted sum，以及一种叫做“inverse distance”的损失函数。我们的研究表明，使用“inverse distance”损失函数可以提高模型的性能，并且可以在不同的室内环境下提供更高的精度。总之，我们的研究表明，可以使用深度学习来Estimate the distance of moving sound sources across diverse acoustic conditions。我们的模型可以在多个不同的室内环境下提供高精度的距离估计，并且可以在不同的训练损失函数下进行优化。这对于各种应用场景，如智能家居、智能城市等，都具有重要的意义。

PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

paper_url: http://arxiv.org/abs/2309.09262
repo_url: None
paper_authors: Jixun Yao, Yuguang Yang, Yi Lei, Ziqian Ning, Yanni Hu, Yu Pan, Jingjing Yin, Hongbin Zhou, Heng Lu, Lei Xie
for: This paper aims to improve the style voice conversion process by using natural language prompts to generate a style vector and adapting the duration of discrete tokens.
methods: The proposed approach, called PromptVC, employs a latent diffusion model to sample the style vector from noise, with the process being conditioned on natural language prompts. The system also uses HuBERT to extract discrete tokens and replace them with the K-Means center embedding to minimize residual style information.
results: The subjective and objective evaluation results demonstrate the effectiveness of the proposed system, with improved style expressiveness and adaptability to different styles.

Abstract
Style voice conversion aims to transform the style of source speech to a desired style according to real-world application demands. However, the current style voice conversion approach relies on pre-defined labels or reference speech to control the conversion process, which leads to limitations in style diversity or falls short in terms of the intuitive and interpretability of style representation. In this study, we propose PromptVC, a novel style voice conversion approach that employs a latent diffusion model to generate a style vector driven by natural language prompts. Specifically, the style vector is extracted by a style encoder during training, and then the latent diffusion model is trained independently to sample the style vector from noise, with this process being conditioned on natural language prompts. To improve style expressiveness, we leverage HuBERT to extract discrete tokens and replace them with the K-Means center embedding to serve as the linguistic content, which minimizes residual style information. Additionally, we deduplicate the same discrete token and employ a differentiable duration predictor to re-predict the duration of each token, which can adapt the duration of the same linguistic content to different styles. The subjective and objective evaluation results demonstrate the effectiveness of our proposed system.

摘要
《 Style Voice Conversion 》 aims to transform the style of source speech to a desired style based on real-world application demands. However, current methods rely on pre-defined labels or reference speech to control the conversion process, which limits style diversity and lacks intuitive and interpretable style representation. In this study, we propose PromptVC, a novel style voice conversion approach that employs a latent diffusion model to generate a style vector driven by natural language prompts. Specifically, the style vector is extracted by a style encoder during training, and the latent diffusion model is trained to sample the style vector from noise, conditioned on natural language prompts. To enhance style expressiveness, we leverage HuBERT to extract discrete tokens and replace them with the K-Means center embedding to minimize residual style information. Additionally, we deduplicate the same discrete token and employ a differentiable duration predictor to re-predict the duration of each token, allowing for adaptive duration adjustment based on different styles. Subjective and objective evaluation results demonstrate the effectiveness of our proposed system.

Zero- and Few-shot Sound Event Localization and Detection

paper_url: http://arxiv.org/abs/2309.09223
repo_url: None
paper_authors: Kazuki Shimada, Kengo Uchida, Yuichiro Koyama, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji, Tatsuya Kawahara
for: 这个论文是为了解决静音定位和检测（SELD）系统中的零或几个批处理任务。
methods: 这个论文使用的方法包括逻辑学习网络（NN）和对准抽取语音样本（CLAP）。
results: 该论文的结果表明，使用 embed-ACCDOA 模型可以在零或几个批处理任务中提高静音定位和检测的性能，并且和完整的训练数据进行比较。

Abstract
Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes that are trained before inference. To customize target classes after training, we tackle zero- and few-shot SELD tasks, in which we set new classes with a text sample or a few audio samples. While zero-shot sound classification tasks are achievable by embedding from contrastive language-audio pretraining (CLAP), zero-shot SELD tasks require assigning an activity and a DOA to each embedding, especially in overlapping cases. To tackle the assignment problem in overlapping cases, we propose an embed-ACCDOA model, which is trained to output track-wise CLAP embedding and corresponding activity-coupled Cartesian direction-of-arrival (ACCDOA). In our experimental evaluations on zero- and few-shot SELD tasks, the embed-ACCDOA model showed a better location-dependent scores than a straightforward combination of the CLAP audio encoder and a DOA estimation model. Moreover, the proposed combination of the embed-ACCDOA model and CLAP audio encoder with zero- or few-shot samples performed comparably to an official baseline system trained with complete train data in an evaluation dataset.

摘要
声音事件 lokalisierung和检测（SELD）系统估算irection-of-arrival（DOA）和时间活动 для集合Target classes。基于神经网络（NN）的SELD系统在不同的Target classes中表现良好，但它们只会在预测前训练的类型上输出DOA和时间活动。为了自定义目标类型 después de training，我们面临着零和几个shot SELD任务，在其中我们可以通过提供文本样本或几个音频样本来设置新的类型。零shot声音分类任务可以通过语音-语言预training（CLAP）的嵌入来实现，但零shot SELD任务需要将每个嵌入分配到活动和DOA，特别是在重叠的情况下。为了解决重叠情况中的分配问题，我们提出了一种嵌入-ACCDOA模型，该模型通过输出track-wise CLAP嵌入和相应的活动-联合Cartesian DOA来解决问题。在我们的实验评估中，embeds-ACCDOA模型在零和几个shot SELD任务中表现出色，其location-dependent scores比直接将CLAP音频编码器和DOA估算模型组合的 scores更高。此外，我们将embeds-ACCDOA模型和CLAP音频编码器与零或几个shot样本组合起来，与完整的训练数据进行评估，结果与官方基eline系统在评估数据集中的表现相当。

2023-09-17

cs.CV

cs.CV - 2023-09-17

Deep conditional generative models for longitudinal single-slice abdominal computed tomography harmonization

paper_url: http://arxiv.org/abs/2309.09392
repo_url: https://github.com/masilab/c-slicegen
paper_authors: Xin Yu, Qi Yang, Yucheng Tang, Riqiang Gao, Shunxing Bao, Leon Y. Cai, Ho Hin Lee, Yuankai Huo, Ann Zenobia Moore, Luigi Ferrucci, Bennett A. Landman
for: 用于解决对腹部CT成像数据进行长期分析时，因为不同年份获取的扫描 slice 位置不同，导致不同的器官/组织被捕捉的问题。
methods: 我们提出了 C-SliceGen 方法，可以将任意的腹部 axial slice 作为输入，并在 latent space 中估算结构变化，以生成预定义的vertebral level slice。
results: 我们的实验表明，C-SliceGen 方法可以生成高质量的图像，具有真实性和相似性。此外，我们还证明了该方法可以减少腹部 CT 数据的 slice 位置差异，并且在1033名参与者的 Baltimore Longitudinal Study of Aging (BLSA) 数据集上进行了评估，并证明了该方法可以减少腹部 slice 的位置差异。

Abstract
Two-dimensional single-slice abdominal computed tomography (CT) provides a detailed tissue map with high resolution allowing quantitative characterization of relationships between health conditions and aging. However, longitudinal analysis of body composition changes using these scans is difficult due to positional variation between slices acquired in different years, which leading to different organs/tissues captured. To address this issue, we propose C-SliceGen, which takes an arbitrary axial slice in the abdominal region as a condition and generates a pre-defined vertebral level slice by estimating structural changes in the latent space. Our experiments on 2608 volumetric CT data from two in-house datasets and 50 subjects from the 2015 Multi-Atlas Abdomen Labeling Challenge dataset (BTCV) Challenge demonstrate that our model can generate high-quality images that are realistic and similar. We further evaluate our method's capability to harmonize longitudinal positional variation on 1033 subjects from the Baltimore Longitudinal Study of Aging (BLSA) dataset, which contains longitudinal single abdominal slices, and confirmed that our method can harmonize the slice positional variance in terms of visceral fat area. This approach provides a promising direction for mapping slices from different vertebral levels to a target slice and reducing positional variance for single-slice longitudinal analysis. The source code is available at: https://github.com/MASILab/C-SliceGen.

摘要
“两维单片腹部计算机断层成像（CT）提供了高分辨率的组织地图，可以量化健康状况和年龄之间的关系。然而，使用这些扫描数据进行 longitudinal 分析的 Body 组成变化困难，因为不同年份扫描时的 slice 位置会有偏移。为解决这个问题，我们提出了 C-SliceGen，它可以将任意腹部 Axial slice 作为输入，并生成预定的vertebral level slice，通过估计 latent space 中的结构变化。我们的实验表明，我们的模型可以生成高质量的图像，具有实际和相似的特征。我们进一步评估了我们的方法在1033名参与者的 Baltimore Longitudinal Study of Aging（BLSA）数据集上的能力，并证明了我们的方法可以减少 slice 位置偏移的方差。这种方法提供了一个可行的方向，可以将不同 vertebral level 的 slice 映射到目标 slice 上，并减少 longitudinal 分析中的 slice 位置偏移。源代码可以在以下链接中获取：https://github.com/MASILab/C-SliceGen。”

a critical analysis of internal reliability for uncertainty quantification of dense image matching in multi-view stereo

paper_url: http://arxiv.org/abs/2309.09379
repo_url: None
paper_authors: Debao Huang, Rongjun Qin
for: 该研究用于分析多视角摄影探测数据的内部可靠性，尤其是在无referencedata情况下。
methods: 该研究使用了多视角摄影幂等的内部匹配度量，包括射线聚合统计、交叉角度统计、DIM能量等。
results: 研究发现，使用不同的内部匹配度量可以对多视角摄影探测数据的内部可靠性进行评估，尤其是在LiDAR参照数据不 disponible情况下。

Abstract
Nowadays, photogrammetrically derived point clouds are widely used in many civilian applications due to their low cost and flexibility in acquisition. Typically, photogrammetric point clouds are assessed through reference data such as LiDAR point clouds. However, when reference data are not available, the assessment of photogrammetric point clouds may be challenging. Since these point clouds are algorithmically derived, their accuracies and precisions are highly varying with the camera networks, scene complexity, and dense image matching (DIM) algorithms, and there is no standard error metric to determine per-point errors. The theory of internal reliability of camera networks has been well studied through first-order error estimation of Bundle Adjustment (BA), which is used to understand the errors of 3D points assuming known measurement errors. However, the measurement errors of the DIM algorithms are intricate to an extent that every single point may have its error function determined by factors such as pixel intensity, texture entropy, and surface smoothness. Despite the complexity, there exist a few common metrics that may aid the process of estimating the posterior reliability of the derived points, especially in a multi-view stereo (MVS) setup when redundancies are present. In this paper, by using an aerial oblique photogrammetric block with LiDAR reference data, we analyze several internal matching metrics within a common MVS framework, including statistics in ray convergence, intersection angles, DIM energy, etc.

摘要
现在，由光ogrammetry derive的点云在多种民用应用中广泛使用，主要因为它们的成本低廉和捕捉方式灵活。通常，光ogrammetric点云通过参考数据 such as LiDAR点云进行评估。然而，当参考数据不 disponible时，评估光ogrammetric点云可能具有挑战。由于这些点云是算法 derive的，其准确性和精度与摄像机网络、场景复杂度和密集图像匹配（DIM）算法有高度相关。而且没有标准的错误度量来确定每个点的错误。在摄像机网络的内部可靠性理论方面，已经进行了广泛的研究，包括第一个错误估计的Bundle Adjustment（BA）理论，以便理解3D点的错误。然而，DIM算法中的测量错误是复杂到每个点都有自己的错误函数，它们取决于因素如像素强度、текстура杂度和表面平滑性。尽管如此，还是有一些常见的度量可以帮助估计 derivated 点云的 posterior 可靠性，特别是在多视点雷达（MVS）设置中，当redundancy 存在时。在本文中，我们使用了一个飞行倾斜的光ogrammetric块，与LiDAR参考数据进行比较，分析了一些内部匹配度量，包括射线整合度、交叉角度、DIM能量等。

MOVIN: Real-time Motion Capture using a Single LiDAR

paper_url: http://arxiv.org/abs/2309.09314
repo_url: None
paper_authors: Deok-Kyeong Jang, Dongseok Yang, Deok-Yun Jang, Byeoli Choi, Taeil Jin, Sung-Hee Lee
for:* 这个论文是为了解决现有的全身跟踪系统过于昂贵、需要专业技能运行或者穿着不适的问题，提供一种数据驱动的生成方法来实现实时全身跟踪。methods:* 这个方法使用了一个LiDAR传感器来获取3D点云数据，并使用一个自适应 conditional variational autoencoder（CVAE）模型来学习全身 pose 的分布。results:* 该方法可以准确地预测表演者的3D全身信息和局部关节细节，同时具有考虑时间相关性移动的能力。Here’s the simplified Chinese text in the format you requested:for:* 这个论文是为了解决现有的全身跟踪系统过于昂贵、需要专业技能运行或者穿着不适的问题，提供一种数据驱动的生成方法来实现实时全身跟踪。methods:* 这个方法使用了一个LiDAR传感器来获取3D点云数据，并使用一个自适应 conditional variational autoencoder（CVAE）模型来学习全身 pose 的分布。results:* 该方法可以准确地预测表演者的3D全身信息和局部关节细节，同时具有考虑时间相关性移动的能力。

Abstract
Recent advancements in technology have brought forth new forms of interactive applications, such as the social metaverse, where end users interact with each other through their virtual avatars. In such applications, precise full-body tracking is essential for an immersive experience and a sense of embodiment with the virtual avatar. However, current motion capture systems are not easily accessible to end users due to their high cost, the requirement for special skills to operate them, or the discomfort associated with wearable devices. In this paper, we present MOVIN, the data-driven generative method for real-time motion capture with global tracking, using a single LiDAR sensor. Our autoregressive conditional variational autoencoder (CVAE) model learns the distribution of pose variations conditioned on the given 3D point cloud from LiDAR.As a central factor for high-accuracy motion capture, we propose a novel feature encoder to learn the correlation between the historical 3D point cloud data and global, local pose features, resulting in effective learning of the pose prior. Global pose features include root translation, rotation, and foot contacts, while local features comprise joint positions and rotations. Subsequently, a pose generator takes into account the sampled latent variable along with the features from the previous frame to generate a plausible current pose. Our framework accurately predicts the performer's 3D global information and local joint details while effectively considering temporally coherent movements across frames. We demonstrate the effectiveness of our architecture through quantitative and qualitative evaluations, comparing it against state-of-the-art methods. Additionally, we implement a real-time application to showcase our method in real-world scenarios. MOVIN dataset is available at \url{https://movin3d.github.io/movin_pg2023/}.

摘要
现代技术的发展带来了新的互动应用程序，如社交Metaverse，在这些应用程序中，用户通过他们的虚拟人物进行互动。在这些应用程序中，精准全身跟踪是实现卷积体验和虚拟人物embodying的关键。然而，目前的动作捕捉系统因为高价格、需要特殊技能运行以及穿着设备不舒适而不太可 accessible。在这篇论文中，我们提出了MOVIN，一种基于数据驱动的生成方法，通过单个LiDAR感知器实现实时动作捕捉，包括全身跟踪。我们的 autoencoder 模型学习了基于给定的 3D 点云数据的动作变化的分布，并通过一种新的特征编码器来学习 pose 的相关性。全身pose特征包括根据翻译、旋转和脚 contacts，而地方特征包括 JOINT 位置和旋转。然后，一个 pose 生成器将考虑 Sampled 随机变量以及前一帧的特征来生成一个可能的当前 pose。我们的框架可以准确预测演员的 3D 全身信息和局部关节细节，同时考虑了在帧之间的时间准确性。我们通过量化和质量评估来证明我们的架构的有效性，并与当前的方法进行比较。此外，我们还实现了一个实时应用，以示出我们的方法在实际场景中的应用。MOVIN 数据集可以在 \url{https://movin3d.github.io/movin_pg2023/} 上获取。

Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal Intervention

paper_url: http://arxiv.org/abs/2309.09311
repo_url: None
paper_authors: Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim
for: 本研究旨在解决文本视频检索中的学习和推理偏见问题。
methods: 本研究使用了一种独特的方法，即考虑视频帧长度差异对于训练和测试集的影响，以避免学习和推理偏见。
results: 研究发现，该方法可以减少文本视频检索模型中的偏见问题，并在 Epic-Kitchens-100、YouCook2 和 MSR-VTT 等 datasets 上达到了领先的成绩。

Abstract
Many studies focus on improving pretraining or developing new backbones in text-video retrieval. However, existing methods may suffer from the learning and inference bias issue, as recent research suggests in other text-video-related tasks. For instance, spatial appearance features on action recognition or temporal object co-occurrences on video scene graph generation could induce spurious correlations. In this work, we present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips, which is the first such attempt for a text-video retrieval task, to the best of our knowledge. We first hypothesise and verify the bias on how it would affect the model illustrated with a baseline study. Then, we propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets. Our model overpasses the baseline and SOTA on nDCG, a semantic-relevancy-focused evaluation metric which proves the bias is mitigated, as well as on the other conventional metrics.

摘要
In this work, we present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips, which is the first such attempt for a text-video retrieval task, to the best of our knowledge. We first hypothesize and verify the bias using a baseline study. Then, we propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets. Our model outperforms the baseline and SOTA on nDCG, a semantic-relevancy-focused evaluation metric, which proves that the bias is mitigated, as well as on other conventional metrics.

UGC: Unified GAN Compression for Efficient Image-to-Image Translation

paper_url: http://arxiv.org/abs/2309.09310
repo_url: None
paper_authors: Yuxi Ren, Jie Wu, Peng Zhang, Manlin Zhang, Xuefeng Xiao, Qian He, Rui Wang, Min Zheng, Xin Pan
for: 本研究旨在提出一种新的学习模式，即统一gan压缩（UGC），以实现模型高效和标签高效的同时满足。
methods: 本研究使用 semi-supervised-driven 网络架构搜索和自适应在线 semi-supervised distillation 两个阶段，共同实现一种多样化、标签高效和性能优秀的模型。
results: 实验结果表明，UGC 可以在各种图像识别和生成任务上达到比较高的性能水平，而且比传统的 GAN 模型具有更好的计算效率和数据使用效率。

Abstract
Recent years have witnessed the prevailing progress of Generative Adversarial Networks (GANs) in image-to-image translation. However, the success of these GAN models hinges on ponderous computational costs and labor-expensive training data. Current efficient GAN learning techniques often fall into two orthogonal aspects: i) model slimming via reduced calculation costs; ii)data/label-efficient learning with fewer training data/labels. To combine the best of both worlds, we propose a new learning paradigm, Unified GAN Compression (UGC), with a unified optimization objective to seamlessly prompt the synergy of model-efficient and label-efficient learning. UGC sets up semi-supervised-driven network architecture search and adaptive online semi-supervised distillation stages sequentially, which formulates a heterogeneous mutual learning scheme to obtain an architecture-flexible, label-efficient, and performance-excellent model.

摘要
近年来，人工智能领域内的生成对抗网络（GAN）在图像到图像翻译方面取得了很大进步。然而，GAN模型的成功受到了计算成本的约束和训练数据的劳动成本。当前的高效GAN学习技术通常分为两个垂直方面：i）模型缩减通过减少计算成本；ii）数据/标签高效学习 fewer 训练数据/标签。为了结合这两个方面的优点，我们提出了一种新的学习理念，即统一GAN压缩（UGC），它通过统一优化目标来融合模型高效和标签高效的学习。UGC设计了顺序执行 semi-supervised 驱动网络搜索和自适应在线 semi-supervised 熔化阶段，这种异质共同学习方案可以从数据中提取出高效、标签高效和表现优秀的模型。

Effective Image Tampering Localization via Enhanced Transformer and Co-attention Fusion

paper_url: http://arxiv.org/abs/2309.09306
repo_url: None
paper_authors: Kun Guo, Haochen Zhu, Gang Cao
for: 本文提出了一种基于两极性增强变换encoder的图像修改地点检测网络（EITLNet），以提高图像修改检测的精度和Robustness。
methods: 本文使用了一种两极性增强变换encoder，并设计了一个特性增强模块来提高变换器encoder的特征表示能力。另外，通过均匀注意力模块在多个级别进行特征对比，以实现RGB和杂变流中提取的特征的有效拼接。
results: 实验结果表明，提出的方案在多个标准数据集上达到了当前最佳的总体化能力和Robustness。代码将于https://github.com/multimediaFor/EITLNet公开。

Abstract
Powerful manipulation techniques have made digital image forgeries be easily created and widespread without leaving visual anomalies. The blind localization of tampered regions becomes quite significant for image forensics. In this paper, we propose an effective image tampering localization network (EITLNet) based on a two-branch enhanced transformer encoder with attention-based feature fusion. Specifically, a feature enhancement module is designed to enhance the feature representation ability of the transformer encoder. The features extracted from RGB and noise streams are fused effectively by the coordinate attention-based fusion module at multiple scales. Extensive experimental results verify that the proposed scheme achieves the state-of-the-art generalization ability and robustness in various benchmark datasets. Code will be public at https://github.com/multimediaFor/EITLNet.

摘要
powerful manipulation techniques have made digital image forgeries easily created and widespread without leaving visual anomalies. The blind localization of tampered regions becomes quite significant for image forensics. In this paper, we propose an effective image tampering localization network (EITLNet) based on a two-branch enhanced transformer encoder with attention-based feature fusion. Specifically, a feature enhancement module is designed to enhance the feature representation ability of the transformer encoder. The features extracted from RGB and noise streams are fused effectively by the coordinate attention-based fusion module at multiple scales. Extensive experimental results verify that the proposed scheme achieves the state-of-the-art generalization ability and robustness in various benchmark datasets. Code will be public at https://github.com/multimediaFor/EITLNet.Here's the translation in Traditional Chinese:powerful manipulation techniques have made digital image forgeries easily created and widespread without leaving visual anomalies. The blind localization of tampered regions becomes quite significant for image forensics. In this paper, we propose an effective image tampering localization network (EITLNet) based on a two-branch enhanced transformer encoder with attention-based feature fusion. Specifically, a feature enhancement module is designed to enhance the feature representation ability of the transformer encoder. The features extracted from RGB and noise streams are fused effectively by the coordinate attention-based fusion module at multiple scales. Extensive experimental results verify that the proposed scheme achieves the state-of-the-art generalization ability and robustness in various benchmark datasets. Code will be public at https://github.com/multimediaFor/EITLNet.

RenderIH: A Large-scale Synthetic Dataset for 3D Interacting Hand Pose Estimation

paper_url: http://arxiv.org/abs/2309.09301
repo_url: https://github.com/adwardlee/renderih
paper_authors: Lijun Li, Linrui Tian, Xindi Zhang, Qi Wang, Bang Zhang, Liefeng Bo, Mengyuan Liu, Chen Chen
for: 提高手势估计精度，增加数据的多样性和自然性。
methods: 使用 RenderIH 大规模生成 Synthetic 数据，并提出一种新的 pose 优化算法和一种基于 transformer 的 pose 估计网络 TransHand。
results: 实验表明，预训练在 RenderIH 数据上可以显著降低误差，从 6.76mm 降低至 5.79mm，并且 TransHand 超越了当前的方法。

Abstract
The current interacting hand (IH) datasets are relatively simplistic in terms of background and texture, with hand joints being annotated by a machine annotator, which may result in inaccuracies, and the diversity of pose distribution is limited. However, the variability of background, pose distribution, and texture can greatly influence the generalization ability. Therefore, we present a large-scale synthetic dataset RenderIH for interacting hands with accurate and diverse pose annotations. The dataset contains 1M photo-realistic images with varied backgrounds, perspectives, and hand textures. To generate natural and diverse interacting poses, we propose a new pose optimization algorithm. Additionally, for better pose estimation accuracy, we introduce a transformer-based pose estimation network, TransHand, to leverage the correlation between interacting hands and verify the effectiveness of RenderIH in improving results. Our dataset is model-agnostic and can improve more accuracy of any hand pose estimation method in comparison to other real or synthetic datasets. Experiments have shown that pretraining on our synthetic data can significantly decrease the error from 6.76mm to 5.79mm, and our Transhand surpasses contemporary methods. Our dataset and code are available at https://github.com/adwardlee/RenderIH.

摘要
当前的互动手（IH）数据集相对简单，背景和文化环境相对有限，手关节被机器注意者注解，可能会导致错误，手姿 distribuition 的多样性也很有限。然而，背景、手姿分布和文化环境的变化可以对泛化能力产生很大的影响。因此，我们提供了一个大规模的 sintetic 数据集 RenderIH，包含100万个真实的、多样的互动手图像，具有多种背景、视角和手 texture。为生成自然和多样的互动姿势，我们提议了一个新的pose优化算法。此外，为更好地优化pose估计精度，我们引入了一种基于 transformer 的 pose估计网络 TransHand，以利用互动手之间的相关性。我们的数据集是model-agnostic，可以提高任何手姿估计方法的准确性，比较其他真实或 sintetic 数据集。我们的实验表明，预训练于我们的 sintetic 数据可以显著降低错误率，从6.76mm降低到5.79mm，而我们的 TransHand 突破了当今方法。我们的数据集和代码可以在 https://github.com/adwardlee/RenderIH 上下载。

Chasing Day and Night: Towards Robust and Efficient All-Day Object Detection Guided by an Event Camera

paper_url: http://arxiv.org/abs/2309.09297
repo_url: None
paper_authors: Jiahang Cao, Xu Zheng, Yuanhuiyi Lyu, Jiaxu Wang, Renjing Xu, Lin Wang
for: 这篇论文目的是提出一种robust和高效的all-day对象检测方法，以适应实际应用中的各种照明条件。
methods: 该方法基于一个轻量级的射频神经网络（SNN），并使用事件模式来高效地利用异步性。另外，我们还提出了一个事件时间注意力（ETA）模块，以学习事件中的高时间信息，同时保持重要的边缘信息。最后，我们提出了一种新的对RGB-事件特征的Symmetric RGB-Event Fusion（SREF）模块，以有效地融合RGB-事件特征，无需依赖于特定的感知模式。
results: 我们的EOLO方法在各种照明条件下表现出色，与现状的最佳方法（RENet）相比，增加了3.74%的mAP50。我们还建立了两个新的数据集，E-MSCOCO和E-VOC，以便进一步验证和改进我们的方法。

Abstract
The ability to detect objects in all lighting (i.e., normal-, over-, and under-exposed) conditions is crucial for real-world applications, such as self-driving.Traditional RGB-based detectors often fail under such varying lighting conditions.Therefore, recent works utilize novel event cameras to supplement or guide the RGB modality; however, these methods typically adopt asymmetric network structures that rely predominantly on the RGB modality, resulting in limited robustness for all-day detection. In this paper, we propose EOLO, a novel object detection framework that achieves robust and efficient all-day detection by fusing both RGB and event modalities. Our EOLO framework is built based on a lightweight spiking neural network (SNN) to efficiently leverage the asynchronous property of events. Buttressed by it, we first introduce an Event Temporal Attention (ETA) module to learn the high temporal information from events while preserving crucial edge information. Secondly, as different modalities exhibit varying levels of importance under diverse lighting conditions, we propose a novel Symmetric RGB-Event Fusion (SREF) module to effectively fuse RGB-Event features without relying on a specific modality, thus ensuring a balanced and adaptive fusion for all-day detection. In addition, to compensate for the lack of paired RGB-Event datasets for all-day training and evaluation, we propose an event synthesis approach based on the randomized optical flow that allows for directly generating the event frame from a single exposure image. We further build two new datasets, E-MSCOCO and E-VOC based on the popular benchmarks MSCOCO and PASCAL VOC. Extensive experiments demonstrate that our EOLO outperforms the state-of-the-art detectors,e.g.,RENet,by a substantial margin (+3.74% mAP50) in all lighting conditions.Our code and datasets will be available at https://vlislab22.github.io/EOLO/

摘要
“能够检测各种照明（正常、过颤、和UNDER-EXPOSED）的能力是实际应用中的重要要素，例如自动驾驶。传统的RGB基于的探测器经常在这些不同的照明条件下失败。因此，现有的工作将使用新的事件摄像机来补充或导引RGB模式；然而，这些方法通常运用不对称的网络结构，导致仅对RGB模式进行有限的可靠性。在这篇文章中，我们提出了EOLO框架，一个可靠且高效的实现了全天探测的物体探测框架。我们的EOLO框架基于轻量级的神经网络（SNN），以有效地利用事件的异步性。此外，我们首先引入了一个事件时间注意力（ETA）模组，以学习事件中的高时间信息，同时保留重要的边缘信息。其次，由于不同的模式在不同的照明条件下展现出不同的重要性，我们提出了一个新的对称RGB-事件融合（SREF）模组，以有效地融合RGB-事件特征，并确保在所有照明条件下实现平衡和适应的融合。此外，为了补充缺乏RGB-事件的对称训练和评估数据，我们提出了一个基于随机抽象流的事件生成方法，允许从单一曝光图像中直接生成事件帧。 finally，我们建立了两个新的数据集，E-MSCOCO和E-VOC，基于知名的benchmark MSCOCO和PASCAL VOC。实验结果显示，我们的EOLO在所有照明条件下明显超过了现有的检测器，例如RENet，+3.74% mAP50。我们的代码和数据将在https://vlislab22.github.io/EOLO/ 网页上公开。”

LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation

paper_url: http://arxiv.org/abs/2309.09294
repo_url: https://github.com/zyhbili/LivelySpeaker
paper_authors: Yihao Zhi, Xiaodong Cun, Xuelin Chen, Xi Shen, Wen Guo, Shaoli Huang, Shenghua Gao
For: The paper is focused on developing a framework for generating co-speech gestures that are semantically aligned with the speech content, and it aims to provide several control handles for various applications.* Methods: The proposed framework consists of two stages: script-based gesture generation and audio-guided rhythm refinement. The script-based gesture generation uses pre-trained CLIP text embeddings as guidance, while the audio-guided rhythm refinement uses a simple but effective diffusion-based gesture generation backbone conditioned on audio signals.* Results: The proposed framework outperforms competing methods in terms of semantic awareness and rhythm alignment, and it also achieves state-of-the-art performance on two benchmarks. Additionally, the framework provides several applications such as changing the gesticulation style, editing co-speech gestures via textual prompting, and controlling the semantic awareness and rhythm alignment with guided diffusion.Here’s the simplified Chinese version of the three key points:* For: 这篇论文是关于开发一种基于语音内容的协调姿势生成框架，并提供了多种控制处理的目的。* Methods: 该框架包括两个阶段：脚本基于的姿势生成和音频导向的协调姿势细化。脚本基于的姿势生成使用预训练的 CLIP 文本嵌入为导航，而音频导向的姿势细化使用简单 yet effective 的扩散基本模型， conditioned on 音频信号。* Results: 该框架比前一代方法更加具有 semantics-aware 和 rhythm alignment 的优势，并在两个标准测试集上达到了领先的性能。此外，该框架还提供了多种应用，例如修改姿势风格、通过文本提示编辑协调姿势、控制semantic awareness和rhythm alignment with guided diffusion。

Abstract
Gestures are non-verbal but important behaviors accompanying people's speech. While previous methods are able to generate speech rhythm-synchronized gestures, the semantic context of the speech is generally lacking in the gesticulations. Although semantic gestures do not occur very regularly in human speech, they are indeed the key for the audience to understand the speech context in a more immersive environment. Hence, we introduce LivelySpeaker, a framework that realizes semantics-aware co-speech gesture generation and offers several control handles. In particular, our method decouples the task into two stages: script-based gesture generation and audio-guided rhythm refinement. Specifically, the script-based gesture generation leverages the pre-trained CLIP text embeddings as the guidance for generating gestures that are highly semantically aligned with the script. Then, we devise a simple but effective diffusion-based gesture generation backbone simply using pure MLPs, that is conditioned on only audio signals and learns to gesticulate with realistic motions. We utilize such powerful prior to rhyme the script-guided gestures with the audio signals, notably in a zero-shot setting. Our novel two-stage generation framework also enables several applications, such as changing the gesticulation style, editing the co-speech gestures via textual prompting, and controlling the semantic awareness and rhythm alignment with guided diffusion. Extensive experiments demonstrate the advantages of the proposed framework over competing methods. In addition, our core diffusion-based generative model also achieves state-of-the-art performance on two benchmarks. The code and model will be released to facilitate future research.

摘要
<>传统方法可以生成与语音节奏同步的手势，但是它们缺乏语音Semantic上的信息。虽然semantic手势在人类语言交流中并不太常见，但它们对听众理解语言场景的更深入的含义是非常重要的。因此，我们介绍了LivelySpeaker框架，它可以实现语音Semantic-aware co-speech手势生成，并提供了多个控制把柄。具体来说，我们的方法分为两个阶段：脚本基于的手势生成和音频导航的rhythm refinement。特别是，我们使用预训练的CLIP文本嵌入为生成高度semantic相align的手势的指导。然后，我们设计了一种简单 yet powerful的扩散基于多层perceptron（MLP）的手势生成模型，该模型通过只使用音频信号来生成真实的手势动作。我们利用这种强大的优先来谱匹配script-guided手势与音频信号，特别是在零扩展设定下。我们的新的两个阶段生成框架还具有多种应用，例如改变手势风格，通过文本提示编辑副音频手势，以及控制semantic awareness和rhythm alignment的扩散导航。我们的实验表明，我们的提议的框架比前一代方法具有更多的优势。此外，我们的核心扩散生成模型也在两个标准测试集上达到了状态的艺术性表现。我们计划将代码和模型发布，以便未来的研究。

MVP: Meta Visual Prompt Tuning for Few-Shot Remote Sensing Image Scene Classification

paper_url: http://arxiv.org/abs/2309.09276
repo_url: None
paper_authors: Junjie Zhu, Yiying Li, Chunping Qiu, Ke Yang, Naiyang Guan, Xiaodong Yi
for: 这个研究的目的是提出一个高效且灵活的几步演练类别模型，专门适用于遥感图像分类 задачі。
methods: 这个研究使用了已经预训的视觉对应器模型，并将其调整为适合遥感图像分类任务。另外，研究人员还提出了一个基于patch嵌入重复的资料增强技术，以增强Scene的表现和多样性。
results: 实验结果显示，提出的MVP方法在不同的设定下（包括不同的way和shot）均具有较好的性能，并且在跨领域适应中也具有良好的表现。

Abstract
Vision Transformer (ViT) models have recently emerged as powerful and versatile models for various visual tasks. Recently, a work called PMF has achieved promising results in few-shot image classification by utilizing pre-trained vision transformer models. However, PMF employs full fine-tuning for learning the downstream tasks, leading to significant overfitting and storage issues, especially in the remote sensing domain. In order to tackle these issues, we turn to the recently proposed parameter-efficient tuning methods, such as VPT, which updates only the newly added prompt parameters while keeping the pre-trained backbone frozen. Inspired by VPT, we propose the Meta Visual Prompt Tuning (MVP) method. Specifically, we integrate the VPT method into the meta-learning framework and tailor it to the remote sensing domain, resulting in an efficient framework for Few-Shot Remote Sensing Scene Classification (FS-RSSC). Furthermore, we introduce a novel data augmentation strategy based on patch embedding recombination to enhance the representation and diversity of scenes for classification purposes. Experiment results on the FS-RSSC benchmark demonstrate the superior performance of the proposed MVP over existing methods in various settings, such as various-way-various-shot, various-way-one-shot, and cross-domain adaptation.

摘要
视野变换器（ViT）模型最近在视觉任务中表现出了强大和通用的能力。其中，一项工作名为PMF在几个shot图像分类中获得了可观的结果，但PMF使用全部精度调整，导致重要遗传和存储问题，尤其在远程感知领域。为解决这些问题，我们转向 reciently proposed 参数精度调整方法，如VPT，该方法只更新添加的提示参数，而保留预训练的背部锁定。 inspirited by VPT，我们提出了元视觉提示调整方法（MVP）。specifically，我们将VPT方法 integrate into 元学习框架，并适应远程感知领域，从而实现了高效的几个shot远程感知场景分类（FS-RSSC）。此外，我们引入了一种新的数据增强策略基于贴图嵌入重编，以提高分类目的场景表示和多样性。FS-RSSC标准测试集实验结果表明，我们提出的MVP方法在不同的设置下（如多种多shot、一种多shot和跨领域适应）都与现有方法进行了比较，并达到了更好的表现。

LiDAR Data Synthesis with Denoising Diffusion Probabilistic Models

paper_url: http://arxiv.org/abs/2309.09256
repo_url: None
paper_authors: Kazuto Nakashima, Ryo Kurazume
for: 本研究旨在生成高精度的3D LiDAR数据，用于自适应移动机器人的规划和控制。
methods: 我们提出了一种基于DDPMs的生成模型，通过对图像表示的距离和反射强度进行描述，生成多种和高精度的3D场景点云。我们还提出了一种灵活的LiDAR完成管线，使用DDPMs的强大特性来实现。
results: 我们的方法在KITTI-360和KITTI-Raw数据集上的生成任务和KITTI-360数据集上的upsampling任务中表现出色，超过了基eline。我们的代码和预训练参数将在https://github.com/kazuto1011/r2dm上提供。

Abstract
Generative modeling of 3D LiDAR data is an emerging task with promising applications for autonomous mobile robots, such as scalable simulation, scene manipulation, and sparse-to-dense completion of LiDAR point clouds. Existing approaches have shown the feasibility of image-based LiDAR data generation using deep generative models while still struggling with the fidelity of generated data and training instability. In this work, we present R2DM, a novel generative model for LiDAR data that can generate diverse and high-fidelity 3D scene point clouds based on the image representation of range and reflectance intensity. Our method is based on the denoising diffusion probabilistic models (DDPMs), which have demonstrated impressive results among generative model frameworks and have been significantly progressing in recent years. To effectively train DDPMs on the LiDAR domain, we first conduct an in-depth analysis regarding data representation, training objective, and spatial inductive bias. Based on our designed model R2DM, we also introduce a flexible LiDAR completion pipeline using the powerful properties of DDPMs. We demonstrate that our method outperforms the baselines on the generation task of KITTI-360 and KITTI-Raw datasets and the upsampling task of KITTI-360 datasets. Our code and pre-trained weights will be available at https://github.com/kazuto1011/r2dm.

摘要
“三维LiDAR数据生成是一个emerging任务，具有吸引人的应用前景，如自动移动 robot的广泛simulation、scene操作和LiDAR点云的簇范 completion。现有的方法已经显示了对于深度生成模型的LiDAR数据生成的可能性，但是仍然面临生成数据的实际性和训练稳定性问题。在这个工作中，我们提出了R2DM，一个新的LiDAR数据生成模型，可以生成多样和高实际性的三维Scene点云，基于影像表现的距离和反射intensity。我们的方法基于减误散射概率模型（DDPMs），这些模型在生成模型框架中已经显示出了杰出的成果，并在最近几年内有所进步。为了有效地对LiDAR领域训练 DDPMs，我们首先进行了LiDAR领域的深入分析，包括数据表现、训练目标和空间传递偏好。基于我们的设计的R2DM模型，我们也提出了一个灵活的LiDAR完备管线，使用DDPMs的强大特性。我们的方法在KITTI-360和KITTI-Rawdataset上的生成和upsampling任务中表现出色，较基eline的方法更好。我们的代码和预训练 веса将在https://github.com/kazuto1011/r2dm上公开。”

Convex Latent-Optimized Adversarial Regularizers for Imaging Inverse Problems

paper_url: http://arxiv.org/abs/2309.09250
repo_url: None
paper_authors: Huayu Wang, Chen Luo, Taofeng Xie, Qiyu Jin, Guoqing Chen, Zhuo-Xu Cui, Dong Liang
for: 这个研究旨在提出一个新的、可解释的数据驱动技术，以解决Magnetic Resonance Imaging（MRI）逆问题中的挑战。
methods: 这个方法使用了深度学习（DL）和条件调节的融合，并使用了潜在优化技术来对抗训练一个输入圆体神经网络。
results: 这个研究展示了CLEAR-informed的调节模型能够在真实数据上运行，并且能够稳定地重建图像，即使在测量干扰的情况下。此外，这个方法比 conventinal data-driven技术和传统调节方法更好，具有更高的重建质量和更好的稳定性。

Abstract
Recently, data-driven techniques have demonstrated remarkable effectiveness in addressing challenges related to MR imaging inverse problems. However, these methods still exhibit certain limitations in terms of interpretability and robustness. In response, we introduce Convex Latent-Optimized Adversarial Regularizers (CLEAR), a novel and interpretable data-driven paradigm. CLEAR represents a fusion of deep learning (DL) and variational regularization. Specifically, we employ a latent optimization technique to adversarially train an input convex neural network, and its set of minima can fully represent the real data manifold. We utilize it as a convex regularizer to formulate a CLEAR-informed variational regularization model that guides the solution of the imaging inverse problem on the real data manifold. Leveraging its inherent convexity, we have established the convergence of the projected subgradient descent algorithm for the CLEAR-informed regularization model. This convergence guarantees the attainment of a unique solution to the imaging inverse problem, subject to certain assumptions. Furthermore, we have demonstrated the robustness of our CLEAR-informed model, explicitly showcasing its capacity to achieve stable reconstruction even in the presence of measurement interference. Finally, we illustrate the superiority of our approach using MRI reconstruction as an example. Our method consistently outperforms conventional data-driven techniques and traditional regularization approaches, excelling in both reconstruction quality and robustness.

摘要
最近，数据驱动技术已经在MR图像反问题中表现出了非常出色的效果。然而，这些方法仍然存在一定的可解释性和稳定性的限制。为回应，我们介绍了一种新的可解释的数据驱动方法，即Convex Latent-Optimized Adversarial Regularizers（CLEAR）。CLEAR是一种将深度学习（DL）和变量正则化相结合的新型数据驱动方法。我们使用了潜在优化技术来对输入的凸神经网络进行对抗训练，并将其集的最小值用作一种凸正则化模型，以指导图像反问题的解决。由于其内置的凸性，我们已经证明了对CLEAR-informed的正则化模型的投影下滤链落点梯度下降算法的 converges，这 garanties the attainment of a unique solution to the imaging inverse problem, subject to certain assumptions。此外，我们还证明了我们的CLEAR-informed模型的稳定性，并证明其能够在测量干扰存在时仍然实现稳定的重建。最后，我们使用MRI重建为例子，并证明了我们的方法可以与传统的数据驱动技术和正则化方法相比，在重建质量和稳定性两个方面表现出色。

LiteTrack: Layer Pruning with Asynchronous Feature Extraction for Lightweight and Efficient Visual Tracking

paper_url: http://arxiv.org/abs/2309.09249
repo_url: None
paper_authors: Qingmao Wei, Bi Zeng, Jianqi Liu, Li He, Guotian Zeng
for: 这个论文是为了提出一种高速的 transformer-based 视觉跟踪模型，以满足实时 робоτICS应用的需求。
methods: 该模型使用 asynchronous feature extraction 和模版和搜索区域之间的交互来提高特征融合和减少无用计算，并对加载的encoder层进行剪辑以达到更好的性能和速度平衡。
results: 模型在 GOT-10k 测试集上 achiev 65.2% AO，超过了所有之前的轻量级跟踪模型，并在 ONNX 上的 Jetson Orin NX 边缘设备上运行速度超过 100 fps。此外，模型在 NVIDIA 2080Ti GPU 上达到了 171 fps 的运行速度，并在 TrackingNet 测试集上达到了 72.2% AO 和 82.4% AUC。

Abstract
The recent advancements in transformer-based visual trackers have led to significant progress, attributed to their strong modeling capabilities. However, as performance improves, running latency correspondingly increases, presenting a challenge for real-time robotics applications, especially on edge devices with computational constraints. In response to this, we introduce LiteTrack, an efficient transformer-based tracking model optimized for high-speed operations across various devices. It achieves a more favorable trade-off between accuracy and efficiency than the other lightweight trackers. The main innovations of LiteTrack encompass: 1) asynchronous feature extraction and interaction between the template and search region for better feature fushion and cutting redundant computation, and 2) pruning encoder layers from a heavy tracker to refine the balnace between performance and speed. As an example, our fastest variant, LiteTrack-B4, achieves 65.2% AO on the GOT-10k benchmark, surpassing all preceding efficient trackers, while running over 100 fps with ONNX on the Jetson Orin NX edge device. Moreover, our LiteTrack-B9 reaches competitive 72.2% AO on GOT-10k and 82.4% AUC on TrackingNet, and operates at 171 fps on an NVIDIA 2080Ti GPU. The code and demo materials will be available at https://github.com/TsingWei/LiteTrack.

摘要
Recent advancements in transformer-based visual trackers have led to significant progress, thanks to their strong modeling capabilities. However, as performance improves, running latency correspondingly increases, presenting a challenge for real-time robotics applications, especially on edge devices with computational constraints. In response to this, we introduce LiteTrack, an efficient transformer-based tracking model optimized for high-speed operations across various devices. It achieves a more favorable trade-off between accuracy and efficiency than other lightweight trackers. The main innovations of LiteTrack include:1. 异步特征提取和模板与搜索区域之间的交互，以实现更好的特征融合和减少 redundancy computation。2. 从一个重量级的跟踪器中剪辑encoder层，以进一步整合性和速度的平衡。例如，我们最快的变体LiteTrack-B4在GOT-10k标准测试集上达到了65.2%的AO，超过了所有之前的高效跟踪器，并在ONNX上的Jetson Orin NX边缘设备上运行速度达100帧/秒。此外，我们的LiteTrack-B9在GOT-10k和TrackingNet标准测试集上达到了72.2%的AO和82.4%的AUC，并在NVIDIA 2080Ti GPU上运行速度达171帧/秒。我们将在https://github.com/TsingWei/LiteTrack中提供代码和示例材料。

Image-level supervision and self-training for transformer-based cross-modality tumor segmentation

paper_url: http://arxiv.org/abs/2309.09246
repo_url: None
paper_authors: Malo de Boisredon, Eugene Vorontsov, William Trung Le, Samuel Kadoury
for:* 这个研究旨在提高医疗影像分割领域中的自动化医疗影像分类，特别是跨modalities的情况下。methods:* 提出了一个新的半supervised训练策略called MoDATTS，可以实现精准的跨modalities 3D肿瘤分类。* 使用了一个image-to-image translation策略来将不同modalities的影像转换为弹性target volume，以提高对不同modalities的普遍化。* 还引入了一个迭代自训程序来进一步关闭modalities之间的领域差。results:* MoDATTS在CrossMoDA 2022挑战中的reported top Dice score为0.87+/-0.04，较其他参赛队伍的方法高。* MoDATTS在跨modalities的Brain Tumor Segmentation任务上显示了consistent的提高，其Dice score比baseline高出10%以上。* MoDATTS可以实现95%的目标supervised模型性能，并且可以透过更多的标注资料来进一步提高性能。

Abstract
Deep neural networks are commonly used for automated medical image segmentation, but models will frequently struggle to generalize well across different imaging modalities. This issue is particularly problematic due to the limited availability of annotated data, making it difficult to deploy these models on a larger scale. To overcome these challenges, we propose a new semi-supervised training strategy called MoDATTS. Our approach is designed for accurate cross-modality 3D tumor segmentation on unpaired bi-modal datasets. An image-to-image translation strategy between imaging modalities is used to produce annotated pseudo-target volumes and improve generalization to the unannotated target modality. We also use powerful vision transformer architectures and introduce an iterative self-training procedure to further close the domain gap between modalities. MoDATTS additionally allows the possibility to extend the training to unannotated target data by exploiting image-level labels with an unsupervised objective that encourages the model to perform 3D diseased-to-healthy translation by disentangling tumors from the background. The proposed model achieves superior performance compared to other methods from participating teams in the CrossMoDA 2022 challenge, as evidenced by its reported top Dice score of 0.87+/-0.04 for the VS segmentation. MoDATTS also yields consistent improvements in Dice scores over baselines on a cross-modality brain tumor segmentation task composed of four different contrasts from the BraTS 2020 challenge dataset, where 95% of a target supervised model performance is reached. We report that 99% and 100% of this maximum performance can be attained if 20% and 50% of the target data is additionally annotated, which further demonstrates that MoDATTS can be leveraged to reduce the annotation burden.

摘要
深度神经网络通常用于自动医疗影像分割，但模型很难泛化到不同的成像方式。这个问题特别是由于有限的标注数据，使得这些模型在更大规模上部署变得困难。为了解决这些挑战，我们提出了一种新的半监督训练策略called MoDATTS。我们的方法适用于精准的三维肿瘤分割不同成像方式的不协调数据集。我们使用 между成像模式之间的图像转换策略生成标注 pseudo-目标Volume，以改善对目标成像模式的泛化。此外，我们使用强大的视觉转换架构，并引入迭代自我训练过程，以进一步减小成像模式之间的领域差。MoDATTS还允许在未标注目标数据上继续训练，通过利用图像水平标签来鼓励模型进行三维疾病到健康的图像翻译，从而分离肿瘤和背景。我们的模型在CrossMoDA 2022挑战中与其他参赛队列表示的方法相比，实现了最高的Dice分数0.87+/-0.04 для VS分割。MoDATTS还在四个不同的脑肿瘤分割任务中实现了相对于基eline的稳定改进，其中95%的目标监督模型性能可以达到。如果添加20%和50%的目标数据，则可以达到99%和100%的最大性能，这进一步证明了MoDATTS可以减少标注占用。

CaSAR: Contact-aware Skeletal Action Recognition

paper_url: http://arxiv.org/abs/2309.10001
repo_url: None
paper_authors: Junan Lin, Zhichao Sun, Enjie Cao, Taein Kwon, Mahdi Rad, Marc Pollefeys
for: 这篇论文的目的是提出一种新的skeletal action recognition方法，以便在AR/VR镜头和人机器人交互中使用。
methods: 这篇论文使用了一种新的表示方法，即手指和物体之间的接触点和远离点，以捕捉手指和物体之间的空间关系。
results: 该方法在两个公共数据集上达到了91.3%和98.4%的高精度，比之前的最佳方法高出了10%以上。

Abstract
Skeletal Action recognition from an egocentric view is important for applications such as interfaces in AR/VR glasses and human-robot interaction, where the device has limited resources. Most of the existing skeletal action recognition approaches use 3D coordinates of hand joints and 8-corner rectangular bounding boxes of objects as inputs, but they do not capture how the hands and objects interact with each other within the spatial context. In this paper, we present a new framework called Contact-aware Skeletal Action Recognition (CaSAR). It uses novel representations of hand-object interaction that encompass spatial information: 1) contact points where the hand joints meet the objects, 2) distant points where the hand joints are far away from the object and nearly not involved in the current action. Our framework is able to learn how the hands touch or stay away from the objects for each frame of the action sequence, and use this information to predict the action class. We demonstrate that our approach achieves the state-of-the-art accuracy of 91.3% and 98.4% on two public datasets, H2O and FPHA, respectively.

摘要
skeletal action recognition from an egocentric view 是很重要的，因为它们可以用于AR/VR镜头和人机交互，而这些设备具有有限的资源。现有的大多数skeletal action recognition方法使用手 JOINTS的3D坐标和物体的8个顶点 rectangle bounding box作为输入，但是它们不能捕捉手和物体之间的空间关系。在这篇论文中，我们提出了一个新的框架，即Contact-aware Skeletal Action Recognition（CaSAR）。它使用了一些新的手-物体交互表示，包括：1）手 JOINTS与物体之间的接触点，2）手 JOINTS在物体之外的远离点，这些点在当前动作序列中并不直接参与动作。我们的框架可以在每帧动作序列中学习手与物体之间的接触和远离情况，并使用这些信息预测动作类别。我们证明了我们的方法可以在两个公共数据集上达到91.3%和98.4%的状态态的准确率。

CryoAlign: feature-based method for global and local 3D alignment of EM density maps

paper_url: http://arxiv.org/abs/2309.09217
repo_url: None
paper_authors: Bintao He, Fa Zhang, Chenjie Feng, Jianyi Yang, Xin Gao, Renmin Han
for: density maps的对Alignment和比较，以解释结构信息，如结构不一致性分析和原子模型组装。
methods: 使用本地密度特征描述符来捕捉空间结构相似性，快速建立点对点匹配和稳定定制参数。
results: 在实验评估中，CryoAlign表现出了较高的对Alignment精度和速度，胜过现有的方法。

Abstract
Advances on cryo-electron imaging technologies have led to a rapidly increasing number of density maps. Alignment and comparison of density maps play a crucial role in interpreting structural information, such as conformational heterogeneity analysis using global alignment and atomic model assembly through local alignment. Here, we propose a fast and accurate global and local cryo-electron microscopy density map alignment method CryoAlign, which leverages local density feature descriptors to capture spatial structure similarities. CryoAlign is the first feature-based EM map alignment tool, in which the employment of feature-based architecture enables the rapid establishment of point pair correspondences and robust estimation of alignment parameters. Extensive experimental evaluations demonstrate the superiority of CryoAlign over the existing methods in both alignment accuracy and speed.

摘要

All-optical image denoising using a diffractive visual processor

paper_url: http://arxiv.org/abs/2309.09215
repo_url: None
paper_authors: Cagatay Isıl, Tianyi Gan, F. Onuralp Ardic, Koray Mentesoglu, Jagrit Digani, Huseyin Karaca, Hanlong Chen, Jingxi Li, Deniz Mengu, Mona Jarrahi, Kaan Akşit, Aydogan Ozcan
for: removes noise/artifacts from input images
methods: all-optical and non-iterative, using deep learning-enabled analog diffractive image denoiser
results: efficiently removes salt and pepper noise and image rendering-related spatial artifacts, with an output power efficiency of ~30-40%

Abstract
Image denoising, one of the essential inverse problems, targets to remove noise/artifacts from input images. In general, digital image denoising algorithms, executed on computers, present latency due to several iterations implemented in, e.g., graphics processing units (GPUs). While deep learning-enabled methods can operate non-iteratively, they also introduce latency and impose a significant computational burden, leading to increased power consumption. Here, we introduce an analog diffractive image denoiser to all-optically and non-iteratively clean various forms of noise and artifacts from input images - implemented at the speed of light propagation within a thin diffractive visual processor. This all-optical image denoiser comprises passive transmissive layers optimized using deep learning to physically scatter the optical modes that represent various noise features, causing them to miss the output image Field-of-View (FoV) while retaining the object features of interest. Our results show that these diffractive denoisers can efficiently remove salt and pepper noise and image rendering-related spatial artifacts from input phase or intensity images while achieving an output power efficiency of ~30-40%. We experimentally demonstrated the effectiveness of this analog denoiser architecture using a 3D-printed diffractive visual processor operating at the terahertz spectrum. Owing to their speed, power-efficiency, and minimal computational overhead, all-optical diffractive denoisers can be transformative for various image display and projection systems, including, e.g., holographic displays.

摘要
图像去噪，是一个fundamental inverse problem，目标是从输入图像中除去噪声/特征。通常，计算机执行的数字图像去噪算法会出现延迟，因为它们在图像处理中需要许多迭代。深度学习启用的方法也会引入延迟和计算负担，导致增加的电力消耗。在这里，我们引入了一种光学diffractive图像去噪器，可以非 iteratively 和all-optically清理输入图像中的各种噪声和特征。这种光学图像去噪器包括优化的通过深度学习的透明层，以physically 扰动光模式，使其在输出图像Field-of-View（FoV）中产生折射。我们的结果表明，这种diffractive denoiser可以高效地除去盐和细颗粒噪声，以及图像渲染相关的空间artefacts，从输入相位或Intensity图像中获得 ~30-40%的输出功率效率。我们实验采用了一个3D打印的diffractive视觉处理器，在teraHz频谱下运行。由于它们的速度、功率效率和计算负担很低，光学diffractive denoiser可能会对各种图像显示和投影系统做出巨大的变革，例如投影式显示器。

Neural Gradient Learning and Optimization for Oriented Point Normal Estimation

paper_url: http://arxiv.org/abs/2309.09211
repo_url: https://github.com/LeoQLi/NGLO
paper_authors: Qing Li, Huifang Feng, Kanle Shi, Yi Fang, Yu-Shen Liu, Zhizhong Han
for: 学习3D点云中的向量场，用于normal估计。
methods: 使用深度学习方法， parameterize对象函数生成点云中的导向量，并使用地方几何学习 angular distance field 进行精细化。
results: 提供了一种robust和精细的normal估计方法，可以抗抗噪、点云异常和点云分布变化。对比 précédents works，提高了normal估计的精度和泛化能力。

Abstract
We propose Neural Gradient Learning (NGL), a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation. It has excellent gradient approximation properties for the underlying geometry of the data. We utilize a simple neural network to parameterize the objective function to produce gradients at points using a global implicit representation. However, the derived gradients usually drift away from the ground-truth oriented normals due to the lack of local detail descriptions. Therefore, we introduce Gradient Vector Optimization (GVO) to learn an angular distance field based on local plane geometry to refine the coarse gradient vectors. Finally, we formulate our method with a two-phase pipeline of coarse estimation followed by refinement. Moreover, we integrate two weighting functions, i.e., anisotropic kernel and inlier score, into the optimization to improve the robust and detail-preserving performance. Our method efficiently conducts global gradient approximation while achieving better accuracy and generalization ability of local feature description. This leads to a state-of-the-art normal estimator that is robust to noise, outliers and point density variations. Extensive evaluations show that our method outperforms previous works in both unoriented and oriented normal estimation on widely used benchmarks. The source code and pre-trained models are available at https://github.com/LeoQLi/NGLO.

摘要
我们提出了神经Gradient学习（NGL），一种深度学习方法，用于从3D点云中学习具有一致方向的梯度 вектор。它具有优秀的梯度近似性特性，用于下面的数据结构。我们使用了简单的神经网络来参数化目标函数，以生成点上的梯度。然而， derivated梯度通常会偏离实际的正见方向的 норма，因为缺乏本地细节描述。因此，我们引入了梯度向量优化（GVO），以学习基于本地平面几何的angular distance场，以重фине粗略梯度向量。最后，我们将方法拟合成为两阶段管道，首先进行粗略估计，然后进行细化。此外，我们将两个权重函数，即不同权重的kernel和准确度分数，integrated into the optimization，以提高方法的稳定性和细节描述能力。我们的方法可以高效地进行全局梯度近似，同时实现更高的准确性和地方特征描述能力。这使得我们的方法在噪声、异常点和点云变化等问题上具有更高的Robustness和普遍性。我们对广泛使用的标准准点Cloud进行了广泛的评估，并证明了我们的方法在不oriented和oriented normal estimation中具有state-of-the-art的性能。源代码和预训练模型可以在https://github.com/LeoQLi/NGLO中下载。

Differentiable SLAM Helps Deep Learning-based LiDAR Perception Tasks

paper_url: http://arxiv.org/abs/2309.09206
repo_url: None
paper_authors: Prashant Kumar, Dheeraj Vattikonda, Vedang Bhupesh Shenvi Nadkarni, Erqun Dong, Sabyasachi Sahoo
for: 这paper是为了研究一种新的自然语言处理方法，使用可微分SLAM架构来训练深度学习模型。
methods: 这paper使用了一种新的自然语言处理方法，即使用可微分SLAM架构来训练深度学习模型。
results: 实验结果表明，使用可微分SLAM架构可以提高两种深度学习应用程序（地面水平估计和动态到静止LiDAR翻译）的性能。总的来说，这些发现提供了重要的navidad的提高LiDAR基于导航系统性能的新方法。

Abstract
We investigate a new paradigm that uses differentiable SLAM architectures in a self-supervised manner to train end-to-end deep learning models in various LiDAR based applications. To the best of our knowledge there does not exist any work that leverages SLAM as a training signal for deep learning based models. We explore new ways to improve the efficiency, robustness, and adaptability of LiDAR systems with deep learning techniques. We focus on the potential benefits of differentiable SLAM architectures for improving performance of deep learning tasks such as classification, regression as well as SLAM. Our experimental results demonstrate a non-trivial increase in the performance of two deep learning applications - Ground Level Estimation and Dynamic to Static LiDAR Translation, when used with differentiable SLAM architectures. Overall, our findings provide important insights that enhance the performance of LiDAR based navigation systems. We demonstrate that this new paradigm of using SLAM Loss signal while training LiDAR based models can be easily adopted by the community.

摘要
我们investigates一种新的思想，利用可微分SLAM架构来在自我超vised的方式下训练深度学习模型，用于各种LiDAR应用程序中。我们知道到目前为止，没有任何工作利用SLAM作为训练深度学习模型的信号。我们探索新的方法，以提高LiDAR系统的效率、可靠性和适应性。我们关注使用可微分SLAM架构来改善深度学习任务的性能，如分类、回归以及SLAM。我们的实验结果表明，将SLAM损失信号作为训练深度学习模型的一部分，可以提高Ground Level Estimation和Dynamic to Static LiDAR Translation两个深度学习应用程序的性能。总的来说，我们的发现提供了重要的意见，使LiDAR基于导航系统的性能得到了提高。我们示示了这新的思想可以轻松地被社区采纳。

Efficient Pyramid Channel Attention Network for Pathological Myopia Detection

paper_url: http://arxiv.org/abs/2309.09196
repo_url: https://github.com/tommylitlle/epcanet
paper_authors: Xiaoqing Zhang, Jilu Zhao, Richu Jin, Yan Li, Hao Wu, Xiangtian Zhou, Jiang Liu
for: 检测普遍性近视（PM）的早期检测，以避免视力和失明。
methods: 利用注意模块设计，包括全面通道注意模块（EPCA），在特征图中高效地检测全球和本地病理信息。
results: 在三个数据集上进行了广泛的实验，证明了我们的EPCA-Net在检测PM方面的表现超过了现有方法。此外，我们还尝试了预训练和终端调整方法，并证明了与传统终端调整方法相比，我们的方法在参数更少的情况下实现了竞争力的表现。

Abstract
Pathological myopia (PM) is the leading ocular disease for impaired vision and blindness worldwide. The key to detecting PM as early as possible is to detect informative features in global and local lesion regions, such as fundus tessellation, atrophy and maculopathy. However, applying classical convolutional neural networks (CNNs) to efficiently highlight global and local lesion context information in feature maps is quite challenging. To tackle this issue, we aim to fully leverage the potential of global and local lesion information with attention module design. Based on this, we propose an efficient pyramid channel attention (EPCA) module, which dynamically explores the relative importance of global and local lesion context information in feature maps. Then we combine the EPCA module with the backbone network to construct EPCA-Net for automatic PM detection based on fundus images. In addition, we construct a PM dataset termed PM-fundus by collecting fundus images of PM from publicly available datasets (e.g., the PALM dataset and ODIR dataset). The comprehensive experiments are conducted on three datasets, demonstrating that our EPCA-Net outperforms state-of-the-art methods in detecting PM. Furthermore, motivated by the recent pretraining-and-finetuning paradigm, we attempt to adapt pre-trained natural image models for PM detection by freezing them and treating the EPCA module and other attention modules as the adapters. The results show that our method with the pretraining-and-finetuning paradigm achieves competitive performance through comparisons to part of methods with traditional fine-tuning methods with fewer tunable parameters.

摘要
全球最主要的眼睛疾病之一是病理型近视（PM），它是全球视力和失明的主要原因。检测PM的关键在于检测背景和局部 lesion 区域中有用的特征。然而，使用传统的卷积神经网络（CNNs）来快速提取全球和局部 lesion 上下文信息在特征图中是很困难的。为了解决这个问题，我们想要全面利用全球和局部 lesion 信息的潜在能力，并设计了一种受注意模块（Attention Module）。基于这种设计，我们提出了一种高效的pyramid channel attention（EPCA）模块，它可以动态探索特征图中全球和局部 lesion 上下文信息的相对重要性。然后，我们将EPCA模块与背部网络结合，构建EPCA-Net以自动检测基于眼科图像的PM。此外，我们还构建了PM-fundus数据集，收集了PM眼科图像数据。我们在三个数据集上进行了广泛的实验，结果表明，我们的EPCA-Net可以超越现有方法的检测PM性能。此外，我们还尝试了采用先training-and-finetuning的方法，将预训练的自然图像模型适应PM检测。结果表明，我们的方法可以通过与一些传统 fine-tuning 方法进行比较，实现类似的性能。

CLIPUNetr: Assisting Human-robot Interface for Uncalibrated Visual Servoing Control with CLIP-driven Referring Expression Segmentation

paper_url: http://arxiv.org/abs/2309.09183
repo_url: None
paper_authors: Chen Jiang, Yuchen Yang, Martin Jagersand
for: 这个论文目的是提高人机交互式视觉服务（UIBVS）中的图像基于视觉服务（image-based visual servoing，IBVS）精度和效率，使用 Referring Expression Segmentation（RES）技术提供更多的准确信息，以帮助机器人在 manipulate 任务中更好地理解人类的意图。
methods: 这个论文使用了一种新的 Referring Expression Segmentation 网络（CLIPUNetr），该网络利用 CLIP 的强大视觉语言表示能力来 segment 引用表达中的区域，同时利用其“U-shaped”编码器-解码器架构来生成更加精确的预测。此外，论文还提出了一种新的整体管道，用于将 CLIPUNetr 集成到 UIBVS 中，并在实际世界环境中应用。
results: 实验表明，使用 CLIPUNetr 可以提高边界和结构测量的准确性，平均提高120%，并成功地帮助实际世界中的 UIBVS 控制。

Abstract
The classical human-robot interface in uncalibrated image-based visual servoing (UIBVS) relies on either human annotations or semantic segmentation with categorical labels. Both methods fail to match natural human communication and convey rich semantics in manipulation tasks as effectively as natural language expressions. In this paper, we tackle this problem by using referring expression segmentation, which is a prompt-based approach, to provide more in-depth information for robot perception. To generate high-quality segmentation predictions from referring expressions, we propose CLIPUNetr - a new CLIP-driven referring expression segmentation network. CLIPUNetr leverages CLIP's strong vision-language representations to segment regions from referring expressions, while utilizing its ``U-shaped'' encoder-decoder architecture to generate predictions with sharper boundaries and finer structures. Furthermore, we propose a new pipeline to integrate CLIPUNetr into UIBVS and apply it to control robots in real-world environments. In experiments, our method improves boundary and structure measurements by an average of 120% and can successfully assist real-world UIBVS control in an unstructured manipulation environment.

摘要
传统的人机交互界面在无调整图像基于视觉服务（UIBVS）中依赖于人类注释或semantic segmentation WITH categorical标签。这两种方法无法匹配人类自然的沟通方式，并且不能够具备rich semantics在操作任务中。在这篇论文中，我们解决这个问题，使用引用表达分 segmentation，以提供更多的信息来提高机器人的感知。为生成高质量的分 segmentation预测，我们提出了CLIPUNetr，一种基于CLIP的引用表达分 segmentation网络。CLIPUNetr利用CLIP的强视语表示能力，从引用表达中提取区域，同时利用其“U字形”编码器-解码器架构，生成预测更加精细的边界和结构。此外，我们提出了一种新的管道，将CLIPUNetr纳入UIBVS中，并在实际世界环境中控制机器人。在实验中，我们发现，我们的方法可以提高边界和结构测量的平均值120%，并成功地帮助实际世界中的UIBVS控制。

2023-09-17

cs.AI

cs.AI - 2023-09-17

ChatGPT Hallucinates when Attributing Answers

paper_url: http://arxiv.org/abs/2309.09401
repo_url: None
paper_authors: Guido Zuccon, Bevan Koopman, Razia Shaik
for: The paper aims to investigate the ability of ChatGPT to provide evidence to support its answers and to analyze the quality of the references it suggests.
methods: The paper uses a collection of domain-specific knowledge-based questions to prompt ChatGPT to provide answers and supporting evidence in the form of references to external sources.
results: The paper finds that ChatGPT provides correct or partially correct answers in about half of the cases (50.6% of the times), but its suggested references only exist 14% of the times. The generated references reveal common traits and often do not support the claims ChatGPT attributes to them.Here is the same information in Simplified Chinese text:
for: 这篇论文目的是调查ChatGPT是否能够提供证据支持其答案，以及其建议的参考文献的质量。
methods: 论文使用具体领域知识基础问题来让ChatGPT提供答案和相关参考文献。
results: 论文发现ChatGPT在50.6%的情况下提供正确或部分正确的答案，但建议的参考文献只有14%存在。生成的参考文献显示出共同特征，并常常不支持ChatGPT所归功于它们的说法。

Abstract
Can ChatGPT provide evidence to support its answers? Does the evidence it suggests actually exist and does it really support its answer? We investigate these questions using a collection of domain-specific knowledge-based questions, specifically prompting ChatGPT to provide both an answer and supporting evidence in the form of references to external sources. We also investigate how different prompts impact answers and evidence. We find that ChatGPT provides correct or partially correct answers in about half of the cases (50.6% of the times), but its suggested references only exist 14% of the times. We further provide insights on the generated references that reveal common traits among the references that ChatGPT generates, and show how even if a reference provided by the model does exist, this reference often does not support the claims ChatGPT attributes to it. Our findings are important because (1) they are the first systematic analysis of the references created by ChatGPT in its answers; (2) they suggest that the model may leverage good quality information in producing correct answers, but is unable to attribute real evidence to support its answers. Prompts, raw result files and manual analysis are made publicly available.

摘要
Can ChatGPT 提供证据支持其答案？Does the evidence it suggests 真的存在，并且确实支持其答案？我们通过一个领域专门知识基础的问题集来调查这些问题，具体来说是让 ChatGPT 提供答案和证据的形式为外部源引用。我们还发现了不同的提问对答案和证据的影响。我们发现 ChatGPT 在50.6% 的情况下提供正确或部分正确的答案，但是提供的参考文献只有14% 的情况下存在。我们进一步分析生成的参考文献，发现这些参考文献具有共同特征，并且即使参考文献确实存在，它们通常不支持 ChatGPT 所归功于它们的说法。我们的发现对（1）是首次系统性地分析 ChatGPT 答案中的参考文献，（2）表明模型可能在生成正确答案时利用了好几个信息，但是无法归功于它们的证据。我们提供的提问、 raw 结果文件和手动分析将公开发布。

CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

paper_url: http://arxiv.org/abs/2309.09400
repo_url: None
paper_authors: Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, Thien Huu Nguyen
for: 这个论文的目的是为了提高大型自然语言处理（LLM）模型的学习能力，并且将其让到公众使用，以促进更深入的研究和应用。
methods: 该论文使用了严格的整理和筛选程序来清洁和精确地准确地训练大型自然语言处理模型。
results: 该论文发现，这些精确地训练的大型自然语言处理模型在多种语言中的表现优化，并且可以用于多种应用。

Abstract
The driving factors behind the development of large language models (LLMs) with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, especially the recent state-of-the-art models, they are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs: https://huggingface.co/datasets/uonlp/CulturaX.

摘要
大型语言模型（LLM）的发展因素包括其巨大的模型大小和广泛的训练数据。随着自然语言处理的进步，LLMs 经常被公开提供，以便更深入的研究和应用。然而，LLMs 的训练数据的进展却有一定的障碍。特别是最新的状态艺术模型，其训练数据经常不是完全公布的。为了创建高性能 LLMS 的训练数据，需要进行广泛的清洗和去重，以确保必要的质量水平。由于训练数据的不透明度，因此对 LLMS 的幻觉和偏见问题的研究和解决受到了阻碍，这也阻碍了社区的进一步发展。在多语言学习场景下，可以用的多语言文本数据库往往不够，而且清洗和去重的过程也往往不充分。为了解决这个问题，我们介绍了 CulturaX，一个具有6.3亿个字的167种语言的大型多语言数据集，适用于 LLMS 的开发。我们的数据集经过了多 Stage 的严格清洗和去重管道，包括语言标识、URL 基于的筛选、度量基于的清洗、文档级别和数据去重，以确保模型训练时的最佳质量。CulturaX 被完全公开发布到 HuggingFace，以便社区的研究和发展：https://huggingface.co/datasets/uonlp/CulturaX。

Do Large GPT Models Discover Moral Dimensions in Language Representations? A Topological Study Of Sentence Embeddings

paper_url: http://arxiv.org/abs/2309.09397
repo_url: None
paper_authors: Stephen Fitz
for: 这篇论文的目的是研究基于GPT-3.5的语言模型 internal structure，以及这些模型在训练过程中对公平性的影响。
methods: 该论文使用了GPT-3.5语言模型的基础模型，并使用了社会心理学 литературе中的公平度度量来分析模型的内部结构。
results: 研究发现，基于GPT-3.5的语言模型可以将 sentences embedding decomposed into two submanifolds，表示公平和不公平的道德判断。这表明GPT-基于语言模型在其表示空间中发展了道德维度，并在训练过程中学习了公平性的概念。

Abstract
As Large Language Models are deployed within Artificial Intelligence systems, that are increasingly integrated with human society, it becomes more important than ever to study their internal structures. Higher level abilities of LLMs such as GPT-3.5 emerge in large part due to informative language representations they induce from raw text data during pre-training on trillions of words. These embeddings exist in vector spaces of several thousand dimensions, and their processing involves mapping between multiple vector spaces, with total number of parameters on the order of trillions. Furthermore, these language representations are induced by gradient optimization, resulting in a black box system that is hard to interpret. In this paper, we take a look at the topological structure of neuronal activity in the "brain" of Chat-GPT's foundation language model, and analyze it with respect to a metric representing the notion of fairness. We develop a novel approach to visualize GPT's moral dimensions. We first compute a fairness metric, inspired by social psychology literature, to identify factors that typically influence fairness assessments in humans, such as legitimacy, need, and responsibility. Subsequently, we summarize the manifold's shape using a lower-dimensional simplicial complex, whose topology is derived from this metric. We color it with a heat map associated with this fairness metric, producing human-readable visualizations of the high-dimensional sentence manifold. Our results show that sentence embeddings based on GPT-3.5 can be decomposed into two submanifolds corresponding to fair and unfair moral judgments. This indicates that GPT-based language models develop a moral dimension within their representation spaces and induce an understanding of fairness during their training process.

摘要
As Large Language Models (LLMs) 在人工智能系统中部署，与人类社会越来越紧密相连，研究其内部结构变得更加重要。 GPT-3.5 等高级 LLM 的高级能力在大部分由 Raw text data 中抽象出的语言表示所带来，这些表示在 vector space 中存在数千维度的向量空间中，并且处理涉及多个 vector space 之间的映射，总参数数在万亿级别。此外，这些语言表示是通过梯度优化来实现的黑盒系统，具有很难理解的特点。在这篇论文中，我们将研究 Chat-GPT 的基础语言模型中神经元活动的 topological structure，并对其进行分析，以了解它们是如何实现公平性的。我们开发了一种新的方法，可以将 GPT 的道德维度视觉化。我们首先计算了一种公平度量， drawing inspiration from social psychology literature，以确定在人类中常见的公平评价因素，如合法性、需求和责任。然后，我们将 manifold 的形状摘要为一个几何体，其Topology是基于这种公平度量来确定的。我们将这个几何体用一个与公平度量相关的热图进行颜色标记，从而生成可读的 visualization 图像，以便更好地理解高级语言模型中的道德维度。我们的结果表明，基于 GPT-3.5 的 sentence embeddings 可以分解为两个子 manifold，每个子 manifold 都表示一种公平或不公平的道德评价。这表明 GPT 基于的语言模型在其表示空间中发展了道德维度，并在训练过程中学习了公平性。

Talk2Care: Facilitating Asynchronous Patient-Provider Communication with Large-Language-Model

paper_url: http://arxiv.org/abs/2309.09357
repo_url: None
paper_authors: Ziqi Yang, Xuhai Xu, Bingsheng Yao, Shao Zhang, Ethan Rogers, Stephen Intille, Nawar Shara, Guodong, Gao, Dakuo Wang
for: 这个研究的目的是为了探索大语言模型（LLMs）在长期健康照顾和跨代通信中的潜在作用。
methods: 这个研究使用了两个访谈研究，一个是older adults（N=10），另一个是健康提供者（N=9），以了解他们对LLMs在患者-医疗提供者异步通信中的需求和机会。然后，他们根据这些洞察，建立了一个LLM-powered通信系统，名为Talk2Care，并设计了与两个群体互动的元件：一个是为older adults，他们使用语音助手（VAs）来实现有效的资讯收集；另一个是为健康提供者，他们建立了LLM-基于的互动画面，以 SUMMARIZE和显示老年人对VAs的访谈中重要的健康资讯。
results: 这两个使用者研究显示，Talk2Care可以帮助患者和医疗提供者之间的通信 процес，增加患者对健康信息的收集，并对医疗提供者的努力和时间节省几成。我们视这个研究为LLMs在医疗和人际通信之间的初步探索。

Abstract
Despite the plethora of telehealth applications to assist home-based older adults and healthcare providers, basic messaging and phone calls are still the most common communication methods, which suffer from limited availability, information loss, and process inefficiencies. One promising solution to facilitate patient-provider communication is to leverage large language models (LLMs) with their powerful natural conversation and summarization capability. However, there is a limited understanding of LLMs' role during the communication. We first conducted two interview studies with both older adults (N=10) and healthcare providers (N=9) to understand their needs and opportunities for LLMs in patient-provider asynchronous communication. Based on the insights, we built an LLM-powered communication system, Talk2Care, and designed interactive components for both groups: (1) For older adults, we leveraged the convenience and accessibility of voice assistants (VAs) and built an LLM-powered VA interface for effective information collection. (2) For health providers, we built an LLM-based dashboard to summarize and present important health information based on older adults' conversations with the VA. We further conducted two user studies with older adults and providers to evaluate the usability of the system. The results showed that Talk2Care could facilitate the communication process, enrich the health information collected from older adults, and considerably save providers' efforts and time. We envision our work as an initial exploration of LLMs' capability in the intersection of healthcare and interpersonal communication.

摘要
尽管现有许多电健康应用程序可以帮助家庭older adults和医疗提供者，但是基本的消息和电话仍然是最常用的通信方法，这些方法受到有限的可用性、信息损失和流程不充分的限制。一种有前途的解决方案是利用大语言模型（LLMs），它们具有强大的自然语言对话和总结能力。然而，对LLMs在通信中的角色的理解还很有限。我们首先进行了10名older adults和9名医疗提供者的两次采访，以了解他们的需求和LLMs在患者-医生异步通信中的机会。基于这些发现，我们建立了一个LLM-powered通信系统，名为Talk2Care，并为两个群体设计了互动组件：1. дляolder adults，我们利用了voice assistant（VA）的便捷和可达性，并为他们建立了LLM-powered VA界面，以便有效地收集健康信息。2. для医疗提供者，我们建立了LLM基于的概要摘要界面，以显示older adults和VA之间的对话中重要的医疗信息。我们进行了两次用户研究，以评估Talk2Care的可用性。结果显示，Talk2Care可以改善患者-医生的通信过程，拓宽来自older adults的健康信息，并大幅减少医疗提供者的努力和时间。我们认为，我们的工作是LLMs在医疗和人际通信的交叉点的初步探索。

Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents

paper_url: http://arxiv.org/abs/2309.09346
repo_url: None
paper_authors: Carson Yu Liu, Gelareh Mohammadi, Yang Song, Wafa Johal
for: 这 paper 的目的是为了帮助机器人和embodied agents在人类与人类交互中更好地表达他们的态度、情感和意图。
methods: 这 paper 使用了一种基于 Conditional Generative Adversarial Network (GAN) 的神经网络模型，学习了语音输入中的协作姿势和语音特征之间的关系。
results: 试验结果表明，这个姿势生成框架可以帮助机器人和embodied agents更好地与人类交互，并且可以在对话中表达他们的态度和情感。

Abstract
Embodied agents, in the form of virtual agents or social robots, are rapidly becoming more widespread. In human-human interactions, humans use nonverbal behaviours to convey their attitudes, feelings, and intentions. Therefore, this capability is also required for embodied agents in order to enhance the quality and effectiveness of their interactions with humans. In this paper, we propose a novel framework that can generate sequences of joint angles from the speech text and speech audio utterances. Based on a conditional Generative Adversarial Network (GAN), our proposed neural network model learns the relationships between the co-speech gestures and both semantic and acoustic features from the speech input. In order to train our neural network model, we employ a public dataset containing co-speech gestures with corresponding speech audio utterances, which were captured from a single male native English speaker. The results from both objective and subjective evaluations demonstrate the efficacy of our gesture-generation framework for Robots and Embodied Agents.

摘要
现代智能机器人和虚拟代理人正在广泛应用。人类在人际交流中使用非语言行为表达自己的态度、情感和意图。因此，这种能力也是智能机器人需要的，以提高与人类交流的质量和效率。在这篇论文中，我们提出了一种新的姿势生成框架，可以根据语音文本和语音音频utterances生成肢体姿势。我们的提议的神经网络模型学习了语音输入中的听力和语义特征与手势之间的关系。为了训练我们的神经网络模型，我们使用了一个公共数据集，包括与语音音频utterances对应的手势记录，从单一的男性Native English speaker中获取。经过对象和主观评估，我们的姿势生成框架在机器人和虚拟代理人中得到了证明。

Unleashing the Power of Dynamic Mode Decomposition and Deep Learning for Rainfall Prediction in North-East India

paper_url: http://arxiv.org/abs/2309.09336
repo_url: None
paper_authors: Paleti Nikhil Chowdary, Sathvika P, Pranav U, Rohan S, Sowmya V, Gopalakrishnan E A, Dhanya M
for: 预测北东部印度的降水量，提高灾害防御和减轻气候变化的影响
methods: 使用数据驱动方法Dynamic Mode Decomposition (DMD)和深度学习方法Long Short-Term Memory (LSTM)进行降水量预测，使用印度气象部门每天降水数据进行训练和验证
results: LSTM方法比DMD方法更加准确地预测降水量，表明LSTM方法可以更好地捕捉数据中的复杂非线性关系，这些发现可以帮助提高北东部印度的降水量预测精度，降低气候变化的影响

Abstract
Accurate rainfall forecasting is crucial for effective disaster preparedness and mitigation in the North-East region of India, which is prone to extreme weather events such as floods and landslides. In this study, we investigated the use of two data-driven methods, Dynamic Mode Decomposition (DMD) and Long Short-Term Memory (LSTM), for rainfall forecasting using daily rainfall data collected from India Meteorological Department in northeast region over a period of 118 years. We conducted a comparative analysis of these methods to determine their relative effectiveness in predicting rainfall patterns. Using historical rainfall data from multiple weather stations, we trained and validated our models to forecast future rainfall patterns. Our results indicate that both DMD and LSTM are effective in forecasting rainfall, with LSTM outperforming DMD in terms of accuracy, revealing that LSTM has the ability to capture complex nonlinear relationships in the data, making it a powerful tool for rainfall forecasting. Our findings suggest that data-driven methods such as DMD and deep learning approaches like LSTM can significantly improve rainfall forecasting accuracy in the North-East region of India, helping to mitigate the impact of extreme weather events and enhance the region's resilience to climate change.

摘要
准确预测降水是北东地区灾害准备和mitigation的关键，这里容易受到洪水和山崩等极端天气事件的影响。在这项研究中，我们调查了使用两种数据驱动方法：动态模式分解（DMD）和长期记忆（LSTM），以预测降水。我们使用印度气象部门在北东地区收集的日常降水数据进行训练和验证。我们的结果表明，DMD和LSTM都有效地预测降水，但LSTM在准确性方面表现更好，表明LSTM可以捕捉数据中的复杂非线性关系，使其成为预测降水的强大工具。我们的发现表明，使用数据驱动方法如DMD和深度学习方法如LSTM可以大幅提高降水预测精度，帮助北东地区 Mitigate the impact of extreme weather events and enhance its resilience to climate change.

Enhancing Knee Osteoarthritis severity level classification using diffusion augmented images

paper_url: http://arxiv.org/abs/2309.09328
repo_url: https://github.com/AKSHAY24-tech/Enhancing-Knee-Osteoarthritis-severity-level-classification-using-diffusion-augmented-images
paper_authors: Paleti Nikhil Chowdary, Gorantla V N S L Vishnu Vardhan, Menta Sai Akshay, Menta Sai Aashish, Vadlapudi Sai Aravind, Garapati Venkata Krishna Rayalu, Aswathy P
for: 这个研究 paper 探讨了使用高级计算机视觉模型和扩充技术来分类膝部骨关节炎（OA）严重程度。
methods: 该研究使用了数据预处理，包括强化对比限定静止 histogram 平衡（CLAHE），以及数据扩充使用扩散模型。三个实验进行了：在原始数据集上训练模型，在预处理后的数据集上训练模型，以及在扩充后的数据集上训练模型。
results: 结果显示，数据预处理和扩充可以大幅提高模型的准确率。EfficientNetB3 模型在扩充后的数据集上达到了 84% 的最高准确率。此外，使用 Grad-CAM 等注意力视觉技术，可以提供详细的注意力地图，提高了模型的理解和信任性。这些发现指出，将高级模型与扩充数据和注意力视觉相结合，可以准确地分类膝部骨关节炎严重程度。

Abstract
This research paper explores the classification of knee osteoarthritis (OA) severity levels using advanced computer vision models and augmentation techniques. The study investigates the effectiveness of data preprocessing, including Contrast-Limited Adaptive Histogram Equalization (CLAHE), and data augmentation using diffusion models. Three experiments were conducted: training models on the original dataset, training models on the preprocessed dataset, and training models on the augmented dataset. The results show that data preprocessing and augmentation significantly improve the accuracy of the models. The EfficientNetB3 model achieved the highest accuracy of 84\% on the augmented dataset. Additionally, attention visualization techniques, such as Grad-CAM, are utilized to provide detailed attention maps, enhancing the understanding and trustworthiness of the models. These findings highlight the potential of combining advanced models with augmented data and attention visualization for accurate knee OA severity classification.

摘要
Translation in Simplified Chinese:这篇研究论文探讨了使用高级计算机视觉模型和扩充技术来分类膝关节风扁病（OA）严重程度的方法。研究检查了数据预处理，包括对比限适的自适应压缩（CLAHE），以及使用扩充模型来进行数据扩充。研究进行了三个实验：在原始数据集上训练模型，在预处理后的数据集上训练模型，以及在扩充后的数据集上训练模型。结果显示，数据预处理和扩充可以显著提高模型的准确率。EfficientNetB3模型在扩充后的数据集上达到了84%的最高准确率。此外，使用Grad-CAM等注意力可视化技术，为模型提供了详细的注意力地图，提高了对模型的理解和信任性。这些发现表明可以通过结合高级模型、扩充数据和注意力可视化来实现精准的膝关节风扁病严重程度分类。

Answering Layer 3 queries with DiscoSCMs

paper_url: http://arxiv.org/abs/2309.09323
repo_url: None
paper_authors: Heyang Gong
For: This paper aims to address the issue of counterfactual degeneration in causal inference, specifically in the context of Layer 3 valuations and individual-level semantics.* Methods: The paper proposes a novel framework called DiscoSCM, which combines the strengths of both Potential Outcome (PO) and Structural Causal Model (SCM) frameworks, and could be seen as an extension of them. The DiscoSCM framework leverages the philosophy of individual causality to tackle the counterfactual degeneration problem.* Results: The paper demonstrates the superior performance of the DiscoSCM framework in answering counterfactual questions through several key results in the topic of unit select problems. The results show that the DiscoSCM framework can effectively address the issue of counterfactual degeneration and provide more accurate estimates of counterfactual parameters.

Abstract
In the realm of causal inference, the primary frameworks are the Potential Outcome (PO) and the Structural Causal Model (SCM), both predicated on the consistency rule. However, when facing Layer 3 valuations, i.e., counterfactual queries that inherently belong to individual-level semantics, they both seem inadequate due to the issue of degeneration caused by the consistency rule. For instance, in personalized incentive scenarios within the internet industry, the probability of one particular user being a complier, denoted as $P(y_x, y'_{x'})$, degenerates to a parameter that can only take values of 0 or 1. This paper leverages the DiscoSCM framework to theoretically tackle the aforementioned counterfactual degeneration problem, which is a novel framework for causal modeling that combines the strengths of both PO and SCM, and could be seen as an extension of them. The paper starts with a brief introduction to the background of causal modeling frameworks. It then illustrates, through an example, the difficulty in recovering counterfactual parameters from data without imposing strong assumptions. Following this, we propose the DiscoSCM with independent potential noise framework to address this problem. Subsequently, the superior performance of the DiscoSCM framework in answering counterfactual questions is demonstrated by several key results in the topic of unit select problems. We then elucidate that this superiority stems from the philosophy of individual causality. In conclusion, we suggest that DiscoSCM may serve as a significant milestone in the causal modeling field for addressing counterfactual queries.

摘要
在 causal inference 领域，主要框架是 potential outcome (PO) 和 structural causal model (SCM)，都是基于一致性规则。但在面临层 3 评估（counterfactual queries）时，它们都显得不够，这是因为一致性规则导致的半极化问题。例如，在个性化奖励场景下，用户 $x$ 的行为 $y_x$ 的潜在结果 $P(y_x, y'_{x'})$ 会半极化为一个只能取 0 或 1 的参数。本文使用 DiscoSCM 框架来解决上述 counterfactual 半极化问题，这是一种结合 PO 和 SCM 的新框架，可以看作是它们的扩展。文章首先介绍了 causal modeling 框架的背景，然后通过一个例子示出在数据中无法回归 counterfactual 参数的困难。接着，我们提出了 DiscoSCM 独立潜在噪声框架来解决这个问题。文章后续展示了 DiscoSCM 框架在 unit select 问题上的优秀表现，并证明了这种优秀性源于个体 causality 哲学。最后，我们建议 DiscoSCM 可能是 causal modeling 领域内 Answering counterfactual questions 的一个重要突破口。

Active Learning for Semantic Segmentation with Multi-class Label Query

paper_url: http://arxiv.org/abs/2309.09319
repo_url: None
paper_authors: Sehyun Hwang, Sohyun Lee, Hoyoung Kim, Minhyeon Oh, Jungseul Ok, Suha Kwak
for: 提出了一种新的活动学习方法 дляsemantic segmentation
methods: 使用了一种新的标注查询设计，采样了本地图像区域（例如超 pix），并向 oracle 请求每个区域的多类标签vector，以解决存在多类标签问题
results: 在Cityscapes和PASCAL VOC 2012上比前一个方法减少了标注成本，并且达到了更高的 segmentation 性能

Abstract
This paper proposes a new active learning method for semantic segmentation. The core of our method lies in a new annotation query design. It samples informative local image regions (e.g., superpixels), and for each of such regions, asks an oracle for a multi-hot vector indicating all classes existing in the region. This multi-class labeling strategy is substantially more efficient than existing ones like segmentation, polygon, and even dominant class labeling in terms of annotation time per click. However, it introduces the class ambiguity issue in training since it assigns partial labels (i.e., a set of candidate classes) to individual pixels. We thus propose a new algorithm for learning semantic segmentation while disambiguating the partial labels in two stages. In the first stage, it trains a segmentation model directly with the partial labels through two new loss functions motivated by partial label learning and multiple instance learning. In the second stage, it disambiguates the partial labels by generating pixel-wise pseudo labels, which are used for supervised learning of the model. Equipped with a new acquisition function dedicated to the multi-class labeling, our method outperformed previous work on Cityscapes and PASCAL VOC 2012 while spending less annotation cost.

摘要

In the first stage, the method trains a segmentation model directly with the partial labels using two new loss functions inspired by partial label learning and multiple instance learning.2. In the second stage, the method disambiguates the partial labels by generating pixel-wise pseudo labels, which are used for supervised learning of the model.The proposed method is equipped with a new acquisition function tailored to the multi-class labeling, which leads to better performance on Cityscapes and PASCAL VOC 2012 while reducing the annotation cost.In simplified Chinese, the text can be translated as:这篇论文提出了一种新的活动学习方法 дляsemantic segmentation。这个方法的核心在于一种新的注释查询设计，它将地图区域（例如superpixel）作为批处理，并问 oracle 提供多个类别的多hot вектор。这种多类标注策略比传统的分类、多边形和主导类标注更加高效，但是它会在训练中引入类别不确定性问题，因为每个像素都会被分配一组候选类。为了解决这个问题，该方法提出了两个阶段的方法：1. 在第一阶段，方法直接使用多个类别的partial label进行分类模型的训练，使用两种新的损失函数，启发自partial label学习和多个实例学习。2. 在第二阶段，方法使用像素级别的pseudo标签来解决类别不确定性问题，并将其用于模型的超级vised学习。该方法采用了一种专门为多类标注而设计的新的收购函数，使其在Cityscapes和PASCAL VOC 2012上表现优于之前的工作，同时减少了注释成本。

Deep Neighbor Layer Aggregation for Lightweight Self-Supervised Monocular Depth Estimation

paper_url: http://arxiv.org/abs/2309.09272
repo_url: None
paper_authors: Boya Wang, Shuo Wang, Ziwen Dou, Dong Ye
for: 提高自主导航和机器人视觉系统中 depth estimation 模型的效率，避免使用大型和复杂的网络。
methods: 使用全连接层 fusion 技术，将高分辨率和低分辨率特征相互融合，以保留小目标和快移动物体的信息。采用轻量级频道注意力 Mechanism 来提高depth estimation结果。
results: 在 KITTI 数据集上进行实验，与许多大型模型相比，如 Monodepth2，我们的方法可以达到更高的准确率，仅使用 30 个参数。

Abstract
With the frequent use of self-supervised monocular depth estimation in robotics and autonomous driving, the model's efficiency is becoming increasingly important. Most current approaches apply much larger and more complex networks to improve the precision of depth estimation. Some researchers incorporated Transformer into self-supervised monocular depth estimation to achieve better performance. However, this method leads to high parameters and high computation. We present a fully convolutional depth estimation network using contextual feature fusion. Compared to UNet++ and HRNet, we use high-resolution and low-resolution features to reserve information on small targets and fast-moving objects instead of long-range fusion. We further promote depth estimation results employing lightweight channel attention based on convolution in the decoder stage. Our method reduces the parameters without sacrificing accuracy. Experiments on the KITTI benchmark show that our method can get better results than many large models, such as Monodepth2, with only 30 parameters. The source code is available at https://github.com/boyagesmile/DNA-Depth.

摘要
随着自主导航和机器人领域中深度估计的频繁使用，模型的效率变得越来越重要。现有大多数方法使用更大和更复杂的网络来提高深度估计的精度。一些研究人员在自主导航中采用了Transformer来提高性能，但这会导致高参数和高计算量。我们提出了一种完全 convolutional 的深度估计网络，通过Contextual feature fusion来提高性能。我们使用高分辨率和低分辨率的特征来保留小目标和快速移动的信息，而不是长距离融合。我们还使用轻量级的通道注意力来提高decoder stage中的深度估计结果。我们的方法可以降低参数量不 sacrificing 精度。在 KITTI 标准测试集上，我们的方法可以超越许多大型模型，如 Monodepth2，并且只需30个参数。代码可以在 https://github.com/boyagesmile/DNA-Depth 上获取。

Continuous Modeling of the Denoising Process for Speech Enhancement Based on Deep Learning

paper_url: http://arxiv.org/abs/2309.09270
repo_url: None
paper_authors: Zilu Guo, Jun Du, CHin-Hui Lee
for: 这个论文旨在提出一种连续模型基于深度学习的语音听写提升方法，关注听写过程中的干净化。
methods: 这个方法使用状态变量来表示听写过程，起始状态是噪声语音，结束状态是干净的语音。噪声分量在状态变量中随状态索引的变化而减少，直到噪声分量为0。在训练中，一个UNet-like神经网络学习估算每个状态变量，从连续听写过程中抽取样本。在测试中，我们引入一个控制因子作为嵌入，让神经网络控制噪声减少的水平。这种方法实现可控的语音提升和适应不同应用场景。
results: 实验结果表明，保留一小amount的噪声在干净目标中对语音提升具有利于效果，证明了对语音评价指标和自动语音识别性能的改进。

Abstract
In this paper, we explore a continuous modeling approach for deep-learning-based speech enhancement, focusing on the denoising process. We use a state variable to indicate the denoising process. The starting state is noisy speech and the ending state is clean speech. The noise component in the state variable decreases with the change of the state index until the noise component is 0. During training, a UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process. In testing, we introduce a controlling factor as an embedding, ranging from zero to one, to the neural network, allowing us to control the level of noise reduction. This approach enables controllable speech enhancement and is adaptable to various application scenarios. Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement, as evidenced by improvements in both objective speech measures and automatic speech recognition performance.

摘要
在这篇论文中，我们研究了深度学习基于的连续模型化 speech 减噪方法，强调减噪过程。我们使用状态变量来表示减噪过程。起始状态是噪声的 speech，结束状态是干净的 speech。噪声组分在状态变量中随状态索引的变化而减少，直到噪声组分为 0。在训练中，一种 UNet-like 神经网络学习 estimate 每个来自连续减噪过程的状态变量。在测试中，我们将一个控制因子作为嵌入，范围从 0 到 1，提供给神经网络，以控制噪声减少的水平。这种方法允许可控制的 speech 减噪和适应不同应用场景。实验结果表明，保留一些噪声在干净目标中有助于 speech 减噪，证明了对象测试 Speech 测量指标和自动语音识别性能的改进。

Sim-to-Real Deep Reinforcement Learning with Manipulators for Pick-and-place

paper_url: http://arxiv.org/abs/2309.09247
repo_url: None
paper_authors: Wenxing Liu, Hanlin Niu, Robert Skilton, Joaquin Carrasco
For: This paper proposes a self-supervised vision-based deep reinforcement learning (DRL) method to improve the performance of transferring a DRL model from simulation to the real world.* Methods: The proposed method uses a height-sensitive action policy to deal with crowded and stacked objects in challenging environments. The training model is applied directly to a real suction task without any fine-tuning from the real world.* Results: The proposed method achieves a high suction success rate of 90% in a real experiment with novel objects, without any real-world fine-tuning. An experimental video is available at: https://youtu.be/jSTC-EGsoFA.Here are the three key points in Simplified Chinese:* For: 这篇论文提出了一种基于自我监督视觉深度学习（DRL）方法，以提高将DRL模型从模拟世界转移到实际世界的性能。* Methods: 该方法使用了高度敏感的动作策略来处理充满杂物和堆叠物的复杂环境。模型直接应用于实际吸取任务，不需要任何实际世界细调。* Results: 该方法在实际实验中以90%的吸取成功率成功地应用于新的物体，无需任何实际世界细调。实验视频可以在：https://youtu.be/jSTC-EGsoFA中找到。

Abstract
When transferring a Deep Reinforcement Learning model from simulation to the real world, the performance could be unsatisfactory since the simulation cannot imitate the real world well in many circumstances. This results in a long period of fine-tuning in the real world. This paper proposes a self-supervised vision-based DRL method that allows robots to pick and place objects effectively and efficiently when directly transferring a training model from simulation to the real world. A height-sensitive action policy is specially designed for the proposed method to deal with crowded and stacked objects in challenging environments. The training model with the proposed approach can be applied directly to a real suction task without any fine-tuning from the real world while maintaining a high suction success rate. It is also validated that our model can be deployed to suction novel objects in a real experiment with a suction success rate of 90\% without any real-world fine-tuning. The experimental video is available at: https://youtu.be/jSTC-EGsoFA.

摘要
Note:* " simulation" => 模拟 (mó dì)* "real world" => 真实世界 (zhēn shí shì jì)* "fine-tuning" => 微调 (wēi tiān)* "success rate" => 成功率 (chéng gōng ràng)* "suction" => 吸引 (xīu yì)* "novel objects" => 新型物品 (xīn xíng wù jī)

Detection and Localization of Firearm Carriers in Complex Scenes for Improved Safety Measures

paper_url: http://arxiv.org/abs/2309.09236
repo_url: https://github.com/intelligentMachines-ITU/LFC-Dataset
paper_authors: Arif Mahmood, Abdul Basit, M. Akhtar Munir, Mohsen Ali
For: 本研究旨在提高人员携带武器的检测和精确定位，以提高安全和监控领域的效果。* Methods: 本研究提出了一种新的方法，利用人员与武器之间的互动信息，以提高携带武器人员的定位。该方法包括一个注意力机制，可以准确分别人员和背景，以及一种稳定性驱动的地方保持约束，以学习重要特征。* Results: 对比基eline方法，本研究的方法在新建的数据集上实现了显著更高的准精度（AP=77.8%）。这表明，利用注意力机制和稳定性驱动的地方保持约束可以提高人员携带武器的检测精度。

Abstract
Detecting firearms and accurately localizing individuals carrying them in images or videos is of paramount importance in security, surveillance, and content customization. However, this task presents significant challenges in complex environments due to clutter and the diverse shapes of firearms. To address this problem, we propose a novel approach that leverages human-firearm interaction information, which provides valuable clues for localizing firearm carriers. Our approach incorporates an attention mechanism that effectively distinguishes humans and firearms from the background by focusing on relevant areas. Additionally, we introduce a saliency-driven locality-preserving constraint to learn essential features while preserving foreground information in the input image. By combining these components, our approach achieves exceptional results on a newly proposed dataset. To handle inputs of varying sizes, we pass paired human-firearm instances with attention masks as channels through a deep network for feature computation, utilizing an adaptive average pooling layer. We extensively evaluate our approach against existing methods in human-object interaction detection and achieve significant results (AP=77.8\%) compared to the baseline approach (AP=63.1\%). This demonstrates the effectiveness of leveraging attention mechanisms and saliency-driven locality preservation for accurate human-firearm interaction detection. Our findings contribute to advancing the fields of security and surveillance, enabling more efficient firearm localization and identification in diverse scenarios.

摘要
探测火器和准确地local化携带火器的人在图像或视频中是安全监测和内容个性化的关键问题。然而，这个问题在复杂环境中存在 significativetranslation missing 挑战，主要是因为背景干扰和火器的多样化形状。为解决这个问题，我们提出了一种新的方法，利用人与火器互动信息，该信息提供了关键的携带人Localization的信息。我们的方法包括一个注意力机制，有效地从背景中分离人和火器，并且引入了一个带有Saliency的地方填充约束，以学习 essencial 特征。通过这些组件，我们的方法实现了出色的结果在一个新提出的数据集上。为处理不同大小的输入，我们通过一个深度网络传递paired人与火器实例，并使用适应平均抽取层来计算特征。我们对现有方法进行了广泛的评估，并实现了相比基eline方法（AP=63.1%）显著的结果（AP=77.8%）。这表明了注意力机制和Saliency-driven的地方填充约束对人与火器互动探测的精度具有重要作用。我们的发现对安全监测和识别领域的进步做出了贡献，可以在多种enario中更有效地检测和识别携带火器的人。

Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables

paper_url: http://arxiv.org/abs/2309.09220
repo_url: None
paper_authors: Ahmed Adel Attia, Yashish M. Siriwardena, Carol Espy-Wilson
for: 这个论文主要是为了研究吗？
methods: 这个论文使用了哪些方法？
results: 这个论文得到了什么结果？Here are the answers in Simplified Chinese:
for: 这个论文主要是为了研究Acoustic-to-articulatory speech inversion（SI）系统的性能。
methods: 这个论文使用了自动学习模型（SSL） HuBERT 和改进的几何变换模型。
results: 通过结合这两种方法，SI 系统的 Pearson 产品积分相关性（PPMC）分数提高了从 0.7452 到 0.8141，即6.9% 的提高。

Abstract
The performance of deep learning models depends significantly on their capacity to encode input features efficiently and decode them into meaningful outputs. Better input and output representation has the potential to boost models' performance and generalization. In the context of acoustic-to-articulatory speech inversion (SI) systems, we study the impact of utilizing speech representations acquired via self-supervised learning (SSL) models, such as HuBERT compared to conventional acoustic features. Additionally, we investigate the incorporation of novel tract variables (TVs) through an improved geometric transformation model. By combining these two approaches, we improve the Pearson product-moment correlation (PPMC) scores which evaluate the accuracy of TV estimation of the SI system from 0.7452 to 0.8141, a 6.9% increase. Our findings underscore the profound influence of rich feature representations from SSL models and improved geometric transformations with target TVs on the enhanced functionality of SI systems.

摘要
深度学习模型的性能受输入特征编码和输出解码的效率影响很大。更好的输入和输出表示有助于提高模型的性能和泛化。在声音逆转（SI）系统中，我们研究了使用自动监督学习（SSL）模型获得的声音表示，比如胡蜡BERT，与传统的声音特征相比。此外，我们还研究了通过改进的几何变换模型中的新的轨迹变量（TV）的添加。将这两种方法结合起来，可以提高声音逆转系统的归一化积分相互 correlatio（PPMC）分数，从0.7452提高到0.8141，即6.9%提高。我们的发现表明，rich的特征表示从SSL模型和改进的几何变换模型中的target TVs具有杰出的影响，提高了SI系统的功能。

SplitEE: Early Exit in Deep Neural Networks with Split Computing

paper_url: http://arxiv.org/abs/2309.09195
repo_url: None
paper_authors: Divya J. Bajpai, Vivek K. Trivedi, Sohan L. Yadav, Manjesh K. Hanawal
for: 这个研究的目的是提高资源受限的设备（边缘、移动、IoT）中深度神经网络（DNNs）的运行性能。
methods: 这个研究使用了分 computed 和早期终结的方法来解决这个问题。具体来说，它们使用了将 computation 分给云端进行最终推断（split computing），并在推断过程中选择性地终结推断。
results: 这个研究获得了较高的成本优化（>50%），并且仅导致了小于2%的准确性下降。

Abstract
Deep Neural Networks (DNNs) have drawn attention because of their outstanding performance on various tasks. However, deploying full-fledged DNNs in resource-constrained devices (edge, mobile, IoT) is difficult due to their large size. To overcome the issue, various approaches are considered, like offloading part of the computation to the cloud for final inference (split computing) or performing the inference at an intermediary layer without passing through all layers (early exits). In this work, we propose combining both approaches by using early exits in split computing. In our approach, we decide up to what depth of DNNs computation to perform on the device (splitting layer) and whether a sample can exit from this layer or need to be offloaded. The decisions are based on a weighted combination of accuracy, computational, and communication costs. We develop an algorithm named SplitEE to learn an optimal policy. Since pre-trained DNNs are often deployed in new domains where the ground truths may be unavailable and samples arrive in a streaming fashion, SplitEE works in an online and unsupervised setup. We extensively perform experiments on five different datasets. SplitEE achieves a significant cost reduction ($>50\%$) with a slight drop in accuracy ($<2\%$) as compared to the case when all samples are inferred at the final layer. The anonymized source code is available at \url{https://anonymous.4open.science/r/SplitEE_M-B989/README.md}.

摘要
深度神经网络（DNNs）吸引了关注，因其在多种任务上表现出色。然而，在资源有限的设备（边缘、移动、物联网）中部署完整的DNNs困难，因为它们的大小较大。为解决这个问题，一些方法被考虑，如将计算部分提取到云端进行最终推理（分 computation）或在设备上进行推理，而不是将所有层传递。在这种情况下，我们提出了结合这两种方法的方法，即使用早期退出在分 computation中。在我们的方法中，我们可以在设备上进行DNNs计算的深度层（分层），并决定一个样本是否可以在这层退出，或者需要被上传。这些决定是基于精度、计算和通信成本的权重平均值。我们开发了一个名为SplitEE的算法，用于学习优化策略。由于预训练的DNNs常常在新领域中部署，采用新的批处理方式和不可预测的样本流入，SplitEE在线上和无监督的设置下工作。我们对五个不同的数据集进行了广泛的实验。SplitEE可以在计算成本方面实现大于50%的减少，同时减少精度少于2%。详细的源代码可以在 \url{https://anonymous.4open.science/r/SplitEE_M-B989/README.md} 中找到。

From Cooking Recipes to Robot Task Trees – Improving Planning Correctness and Task Efficiency by Leveraging LLMs with a Knowledge Network

paper_url: http://arxiv.org/abs/2309.09181
repo_url: None
paper_authors: Md Sadman Sakib, Yu Sun
For: This paper is written for the task of robotic cooking, specifically in generating a sequence of actions for a robot to prepare a meal successfully.* Methods: The paper introduces a novel task tree generation pipeline that uses a large language model (LLM) to retrieve recipe instructions and then utilizes a fine-tuned GPT-3 to convert them into a task tree, capturing sequential and parallel dependencies among subtasks. The pipeline also mitigates the uncertainty and unreliable features of LLM outputs using task tree retrieval.* Results: The paper shows superior performance compared to previous works in task planning accuracy and efficiency, with improved planning correctness and improved execution efficiency.Here is the information in Simplified Chinese text:* For: 这篇论文是为了 robotic cooking 任务进行生成一个Successful sequence of actions。* Methods: 论文提出了一种新的任务树生成管道，使用大语言模型 (LLM) retrieve recipe instructions，然后使用精度调整后的 GPT-3 将其转换为任务树，捕捉下一个和平行依赖关系。管道还使用任务树检索来降低 LLM 输出的不确定性和不可靠性。* Results: 论文的评估结果显示，与前一些工作相比，它在任务规划正确率和执行效率方面表现出色，得到了改进的规划正确率和执行效率。

Abstract
Task planning for robotic cooking involves generating a sequence of actions for a robot to prepare a meal successfully. This paper introduces a novel task tree generation pipeline producing correct planning and efficient execution for cooking tasks. Our method first uses a large language model (LLM) to retrieve recipe instructions and then utilizes a fine-tuned GPT-3 to convert them into a task tree, capturing sequential and parallel dependencies among subtasks. The pipeline then mitigates the uncertainty and unreliable features of LLM outputs using task tree retrieval. We combine multiple LLM task tree outputs into a graph and perform a task tree retrieval to avoid questionable nodes and high-cost nodes to improve planning correctness and improve execution efficiency. Our evaluation results show its superior performance compared to previous works in task planning accuracy and efficiency.

摘要
任务观念生成 для robotic cooking 包括生成一系列动作来成功实现料理任务。本研究提出了一个新的任务树生成管线，可以生成正确的观念和高效的执行 для cooking 任务。我们的方法首先使用大型自然语言模型（LLM） retrieve 菜单 instrucions，然后使用精度调整的 GPT-3 将其转换为任务树，捕捉组成组件和平行依赖关系。然后，我们的管线将 LLM 输出中的不确定和高成本特征使用任务树搜寻来缓解。我们将多个 LLM 任务树输出融合为一个图形，并使用任务树搜寻来避免问题节点和高成本节点，以提高观念正确性和执行效率。我们的评估结果显示，该管线在任务观念正确性和执行效率方面表现更好于先前的工作。

Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture

paper_url: http://arxiv.org/abs/2309.09180
repo_url: https://github.com/liyunlongaaa/NSD-MS2S
paper_authors: Gaobin Yang, Maokui He, Shutong Niu, Ruoyu Wang, Yanyan Yue, Shuangqing Qian, Shilong Wu, Jun Du, Chin-Hui Lee
for: 这个论文旨在提出一种基于记忆感知多话者嵌入和序列到序列架构的新型神经网络演说者识别系统（NSD-MS2S），以提高效率和性能。
methods: 这个系统使用了记忆感知多话者嵌入（MA-MSE）和序列到序列架构（Seq2Seq）的优点，并将它们结合在一起，从而提高了效率和性能。在解码过程中，还采用了输入特征融合和多头注意力机制，以提高特征的捕捉和融合。
results: 根据CHiME-7 EVAL集的macro演说者识别错误率（DER）的评估结果，NSD-MS2S实现了15.9%的DER，相比官方基eline系统的49%的提升，表明该模型在主要轨道上实现了CHiME-7 DASR挑战赛的最佳性能。此外，我们还引入了深度互动模块（DIM），以更好地重新获取更清晰和更特征化的多话者嵌入，使当前模型超越了我们在CHiME-7 DASR挑战赛中使用的系统。

Abstract
We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates the strengths of memory-aware multi-speaker embedding (MA-MSE) and sequence-to-sequence (Seq2Seq) architecture, leading to improvement in both efficiency and performance. Next, we further decrease the memory occupation of decoding by incorporating input features fusion and then employ a multi-head attention mechanism to capture features at different levels. NSD-MS2S achieved a macro diarization error rate (DER) of 15.9% on the CHiME-7 EVAL set, which signifies a relative improvement of 49% over the official baseline system, and is the key technique for us to achieve the best performance for the main track of CHiME-7 DASR Challenge. Additionally, we introduce a deep interactive module (DIM) in MA-MSE module to better retrieve a cleaner and more discriminative multi-speaker embedding, enabling the current model to outperform the system we used in the CHiME-7 DASR Challenge. Our code will be available at https://github.com/liyunlongaaa/NSD-MS2S.

摘要
我们提出了一种新的神经网络发言人分类系统，使用记忆感知多 speaker嵌入Sequence-to-Sequence架构（NSD-MS2S），这种系统结合记忆感知多 speaker嵌入（MA-MSE）和Sequence-to-Sequence（Seq2Seq）架构的优点，从而提高效率和性能。然后，我们进一步减少解码过程中的内存占用量，通过输入特征融合和多头注意机制来捕捉不同级别的特征。NSD-MS2S在CHiME-7 EVAL集上达到了15.9%的macro分类错误率（DER），相比官方基eline系统，表示提高49%的相对提升，是我们在CHiME-7 DASR挑战的主轨上实现最佳性能的关键技术。此外，我们在MA-MSE模块中引入了深度互动模块（DIM），以更好地提取 cleaner和更特征化的多 speaker嵌入，使现在的模型超越了在CHiME-7 DASR挑战中使用的系统。我们的代码将于https://github.com/liyunlongaaa/NSD-MS2S上发布。

Syntax Tree Constrained Graph Network for Visual Question Answering

paper_url: http://arxiv.org/abs/2309.09179
repo_url: None
paper_authors: Xiangrui Su, Qi Zhang, Chongyang Shi, Jiachang Liu, Liang Hu
for: 本研究旨在提高Visual Question Answering（VQA）的精度，以便自动回答基于图像内容的自然语言问题。
methods: 本研究提议了一种新的Syntax Tree Constrained Graph Network（STCGN）模型，基于实体消息传递和语法树。该模型可以从问题中提取更加精确的语法树信息，并通过消息传递机制捕捉更加精确的实体特征。
results: 对VQA2.0数据集进行了广泛的实验，显示了我们提议的模型的超越性。

Abstract
Visual Question Answering (VQA) aims to automatically answer natural language questions related to given image content. Existing VQA methods integrate vision modeling and language understanding to explore the deep semantics of the question. However, these methods ignore the significant syntax information of the question, which plays a vital role in understanding the essential semantics of the question and guiding the visual feature refinement. To fill the gap, we suggested a novel Syntax Tree Constrained Graph Network (STCGN) for VQA based on entity message passing and syntax tree. This model is able to extract a syntax tree from questions and obtain more precise syntax information. Specifically, we parse questions and obtain the question syntax tree using the Stanford syntax parsing tool. From the word level and phrase level, syntactic phrase features and question features are extracted using a hierarchical tree convolutional network. We then design a message-passing mechanism for phrase-aware visual entities and capture entity features according to a given visual context. Extensive experiments on VQA2.0 datasets demonstrate the superiority of our proposed model.

摘要
visual question answering (VQA) 目标是自动回答基于图像内容的自然语言问题。现有的 VQA 方法将视觉模型和语言理解结合以探索问题深层 semantics。然而，这些方法忽略了问题语法信息，这种信息在理解问题基本 semantics 和指导视觉特征细化方面发挥重要作用。为了填补这个空白，我们提议了一种基于实体消息传递和语法树的新方法，即Syntax Tree Constrained Graph Network (STCGN)。这个模型可以从问题中提取语法树，并且通过 hierarchy 的 convolutional network 提取Word 和 phrase 层次特征。然后，我们设计了一种message-passing机制，以便捕捉基于给定视觉上下文的视觉实体特征。我们在 VQA2.0 数据集上进行了广泛的实验，并证明了我们的提议模型的优越性。

Imbalanced Data Stream Classification using Dynamic Ensemble Selection

paper_url: http://arxiv.org/abs/2309.09175
repo_url: None
paper_authors: Priya. S, Haribharathi Sivakumar, Vijay Arvind. R
for: 这篇论文旨在解决现代流动数据分类中面临的概念迁移和类别不均等问题，以提高分类器的精度和正确性。
methods: 本文提出了一个新的框架，将数据预处理和动态集合选择组合使用，以适应非站点过渡不均等数据流。这个框架使用了七种数据预处理技术和两种动态集合选择方法。
results: 实验结果显示，将数据预处理与动态集合选择组合使用可以在不均等数据流中提高分类精度和正确性。

Abstract
Modern streaming data categorization faces significant challenges from concept drift and class imbalanced data. This negatively impacts the output of the classifier, leading to improper classification. Furthermore, other factors such as the overlapping of multiple classes limit the extent of the correctness of the output. This work proposes a novel framework for integrating data pre-processing and dynamic ensemble selection, by formulating the classification framework for the nonstationary drifting imbalanced data stream, which employs the data pre-processing and dynamic ensemble selection techniques. The proposed framework was evaluated using six artificially generated data streams with differing imbalance ratios in combination with two different types of concept drifts. Each stream is composed of 200 chunks of 500 objects described by eight features and contains five concept drifts. Seven pre-processing techniques and two dynamic ensemble selection methods were considered. According to experimental results, data pre-processing combined with Dynamic Ensemble Selection techniques significantly delivers more accuracy when dealing with imbalanced data streams.

摘要
现代流处理数据分类面临着概念飘移和数据偏好问题，这会负面影响分类器的输出，导致不正确的分类。此外，多个类别的重叠也限制了正确性的范围。这项工作提出了一种新的框架，通过将数据预处理和动态ensemble选择相结合，对非站ARY飘移偏好数据流进行分类框架，该框架使用了数据预处理和动态ensemble选择技术。该提案在六个人工生成的数据流中进行了评估，每个流程包含200个块，每个块有500个对象，描述了八个特征。每个流程包含五次概念飘移。试验结果表明，数据预处理和动态ensemble选择技术的结合可以在偏好数据流中提供更高的准确性。

Can Large Language Models Understand Real-World Complex Instructions?

paper_url: http://arxiv.org/abs/2309.09150
repo_url: https://github.com/abbey4799/cello
paper_authors: Qianyu He, Jie Zeng, Wenhao Huang, Lina Chen, Jin Xiao, Qianxi He, Xunzhe Zhou, Lida Chen, Xintao Wang, Yuncheng Huang, Haoning Ye, Zihan Li, Shisong Chen, Yikai Zhang, Zhouhong Gu, Jiaqing Liang, Yanghua Xiao
for: 评估大型自然语言模型（LLM）能否系统地遵循复杂的指令，并可以应用于实际场景。
methods: 提出了八种复杂指令特征，并从实际场景中构建了评估数据集。还开发了四个评估标准和相应的指标，以替代现有的不充分、偏向或过于粗糙的评估方法。
results: 通过广泛的实验，发现代表中文和英文领域的模型在遵循复杂指令时表现不佳，具体的问题包括忽略语义约束、生成错误格式、违反长度或样本数约束、不准确反映输入文本等。

Abstract
Large language models (LLMs) can understand human instructions, showing their potential for pragmatic applications beyond traditional NLP tasks. However, they still struggle with complex instructions, which can be either complex task descriptions that require multiple tasks and constraints, or complex input that contains long context, noise, heterogeneous information and multi-turn format. Due to these features, LLMs often ignore semantic constraints from task descriptions, generate incorrect formats, violate length or sample count constraints, and be unfaithful to the input text. Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions, as they are close-ended and simple. To bridge this gap, we propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically. We design eight features for complex instructions and construct a comprehensive evaluation dataset from real-world scenarios. We also establish four criteria and develop corresponding metrics, as current ones are inadequate, biased or too strict and coarse-grained. We compare the performance of representative Chinese-oriented and English-oriented models in following complex instructions through extensive experiments. Resources of CELLO are publicly available at https://github.com/Abbey4799/CELLO.

摘要

Enhancing Quantised End-to-End ASR Models via Personalisation

paper_url: http://arxiv.org/abs/2309.09136
repo_url: https://github.com/qmgzhao/Enhancing-Quantised-End-to-End-ASR-Models-via-Personalisation
paper_authors: Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng
for: 提高资源受限设备上自动语音识别（ASR）模型的性能。
methods: 提出了一种新的个性化策略（PQM），结合 speaker adaptive training（SAT）和模型归一化来改进压缩模型的性能。PQM使用了4位 NormalFloat Quantisation（NF4）方法进行模型归一化，并使用low-rank adaptation（LoRA）进行SAT。
results: 对于 LibriSpeech 和 TED-LIUM 3 数据集，PQM 可以在压缩模型中实现15.1%和23.3%的相对WRR（word error rate）下降，相比原始精度模型。此外，PQM 只需要加入1%的 speaker-specific 参数，可以实现7倍的模型大小减少。

Abstract
Recent end-to-end automatic speech recognition (ASR) models have become increasingly larger, making them particularly challenging to be deployed on resource-constrained devices. Model quantisation is an effective solution that sometimes causes the word error rate (WER) to increase. In this paper, a novel strategy of personalisation for a quantised model (PQM) is proposed, which combines speaker adaptive training (SAT) with model quantisation to improve the performance of heavily compressed models. Specifically, PQM uses a 4-bit NormalFloat Quantisation (NF4) approach for model quantisation and low-rank adaptation (LoRA) for SAT. Experiments have been performed on the LibriSpeech and the TED-LIUM 3 corpora. Remarkably, with a 7x reduction in model size and 1% additional speaker-specific parameters, 15.1% and 23.3% relative WER reductions were achieved on quantised Whisper and Conformer-based attention-based encoder-decoder ASR models respectively, comparing to the original full precision models.

摘要
现代自动声音识别（ASR）模型已经变得越来越大，这使得它们在有限资源设备上部署变得更加困难。模型量化是一种有效的解决方案，但是有时会导致单词错误率（WER）增加。本文提出了一种个性化quantized模型（PQM）策略，该策略结合说话人适应训练（SAT）和模型量化来提高压缩模型的性能。具体来说，PQM使用4位NormalFloat量化（NF4）方法进行模型量化，并使用低级适应（LoRA）来实现SAT。在LibriSpeech和TED-LIUM 3 corpora上进行了实验，结果显示：与原始精度模型相比，使用PQM可以实现7倍压缩和1%额外的说话人特定参数，而WER的相对下降为15.1%和23.3%。

ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing

paper_url: http://arxiv.org/abs/2309.09128
repo_url: https://github.com/ianarawjo/ChainForge
paper_authors: Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, Martin Wattenberg, Elena Glassman
for: 这篇论文是为了评估大型自然语言模型（LLM）的输出而写的。
methods: 这篇论文使用了一个开源的视觉工具箱，用于比较不同模型和提示的响应。
results: 研究发现，通过使用这个工具箱，不同的人可以Investigate各种假设，包括实际世界中的应用。

Abstract
Evaluating outputs of large language models (LLMs) is challenging, requiring making -- and making sense of -- many responses. Yet tools that go beyond basic prompting tend to require knowledge of programming APIs, focus on narrow domains, or are closed-source. We present ChainForge, an open-source visual toolkit for prompt engineering and on-demand hypothesis testing of text generation LLMs. ChainForge provides a graphical interface for comparison of responses across models and prompt variations. Our system was designed to support three tasks: model selection, prompt template design, and hypothesis testing (e.g., auditing). We released ChainForge early in its development and iterated on its design with academics and online users. Through in-lab and interview studies, we find that a range of people could use ChainForge to investigate hypotheses that matter to them, including in real-world settings. We identify three modes of prompt engineering and LLM hypothesis testing: opportunistic exploration, limited evaluation, and iterative refinement.

摘要
评估大语言模型（LLM）的输出具有挑战性，需要对多个响应进行评估和理解。然而，评估工具通常需要编程API知识，专注于窄领域，或者是关闭源代码。我们提出了链forge，一个开源的视觉工具箱，用于文本生成LLM的推荐工程和即时假设测试。链forge提供了一个图形化界面，用于比较不同模型和提示变化的响应。我们的系统旨在支持三个任务：模型选择、提示模板设计和假设测试（例如，审核）。我们在开发的早期就发布了链forge，并与学术界和在线用户合作进行了设计征提高。经过实验室和专访研究，我们发现了许多人可以使用链forge来调查他们关心的假设，包括在实际场景中。我们将评估工具分为三种模式：机会性探索、有限评估和迭代优化。

How much can ChatGPT really help Computational Biologists in Programming?

paper_url: http://arxiv.org/abs/2309.09126
repo_url: None
paper_authors: Chowdhury Rafeed Rahman, Limsoon Wong
for: 本研究探讨了 chatGPT 在生物计算领域的潜在影响，包括代码生成、数据分析、机器学习模型和特征提取等方面。
methods: 本研究使用了 chatGPT 来执行不同的生物计算任务，如代码写作、代码审查、错误检测、代码转换、代码重构和管道化等。
results: 研究发现 chatGPT 在生物计算领域有多种可能的影响，包括代码生成、数据分析和机器学习模型的建立等方面，同时也存在一些潜在的负面影响，如代码质量和数据隐私等问题。

Abstract
ChatGPT, a recently developed product by openAI, is successfully leaving its mark as a multi-purpose natural language based chatbot. In this paper, we are more interested in analyzing its potential in the field of computational biology. A major share of work done by computational biologists these days involve coding up Bioinformatics algorithms, analyzing data, creating pipelining scripts and even machine learning modeling & feature extraction. This paper focuses on the potential influence (both positive and negative) of ChatGPT in the mentioned aspects with illustrative examples from different perspectives. Compared to other fields of Computer Science, Computational Biology has: (1) less coding resources, (2) more sensitivity and bias issues (deals with medical data) and (3) more necessity of coding assistance (people from diverse background come to this field). Keeping such issues in mind, we cover use cases such as code writing, reviewing, debugging, converting, refactoring and pipelining using ChatGPT from the perspective of computational biologists in this paper.

摘要
chatGPT，由openAI开发的一款多功能自然语言基础的聊天机器人，在这篇论文中，我们更关心其在生物计算领域的潜在影响。计算生物学家们的工作中大量包括编程、数据分析、创建管道脚本以及机器学习模型和特征提取。本论文将对chatGPT在这些方面的影响进行分析，并通过不同角度提供示例。与其他计算机科学领域不同，计算生物学领域有以下特点：（1） coding资源更少，（2）更敏感和偏见问题（与医疗数据相关），（3）更需要编程帮助（人们来自不同背景）。基于这些问题，本文将从计算生物学家的视角来探讨使用chatGPT进行代码写作、审查、调试、转换、重构和管道化的可能性。

Using Reinforcement Learning to Simplify Mealtime Insulin Dosing for People with Type 1 Diabetes: In-Silico Experiments

paper_url: http://arxiv.org/abs/2309.09125
repo_url: None
paper_authors: Anas El Fathi, Marc D. Breton
for: 这个论文的目的是为人类类型1糖尿病患者提供一种简单、可靠的药物剂量计算方法，以改善他们的糖尿病控制和生活质量。
methods: 这个研究使用了 actor-critic方法和长期记忆（LSTM）神经网络，通过训练80个虚拟Subject（VS）来实现。
results: 研究结果表明，提出的RL方法在26周的enario中表现出色，可以取代标准糖尿病控制方法，并且可以提高糖尿病患者的生活质量和血糖控制水平。特别是，在26周的enario中，时间在范围($70-180$mg/dL)和时间在低糖（$<70$mg/dL）分别为73.1±11.6%和2.0±1.8%，与标准CC方法相比有所提高。

Abstract
People with type 1 diabetes (T1D) struggle to calculate the optimal insulin dose at mealtime, especially when under multiple daily injections (MDI) therapy. Effectively, they will not always perform rigorous and precise calculations, but occasionally, they might rely on intuition and previous experience. Reinforcement learning (RL) has shown outstanding results in outperforming humans on tasks requiring intuition and learning from experience. In this work, we propose an RL agent that recommends the optimal meal-accompanying insulin dose corresponding to a qualitative meal (QM) strategy that does not require precise carbohydrate counting (CC) (e.g., a usual meal at noon.). The agent is trained using the soft actor-critic approach and comprises long short-term memory (LSTM) neurons. For training, eighty virtual subjects (VS) of the FDA-accepted UVA/Padova T1D adult population were simulated using MDI therapy and QM strategy. For validation, the remaining twenty VS were examined in 26-week scenarios, including intra- and inter-day variabilities in glucose. \textit{In-silico} results showed that the proposed RL approach outperforms a baseline run-to-run approach and can replace the standard CC approach. Specifically, after 26 weeks, the time-in-range ($70-180$mg/dL) and time-in-hypoglycemia ($<70$mg/dL) were $73.1\pm11.6$% and $ 2.0\pm 1.8$% using the RL-optimized QM strategy compared to $70.6\pm14.8$% and $ 1.5\pm 1.5$% using CC. Such an approach can simplify diabetes treatment, resulting in improved quality of life and glycemic outcomes.

摘要
人们 WITH 类型 1 диабеetes (T1D) 困难计算最佳注射剂量，特别是在多 daily 注射 (MDI) 治疗下。实际上，他们并不总是做出精确的计算，而是倾向于依靠直觉和之前的经验。人工学习 (RL) 已经表现出惊人的成绩，可以在需要直觉和经验学习的任务上超越人类。在这项工作中，我们提出一种RL代理人，可以根据qualitative meal (QM) 策略提供最佳陪食时注射剂量。该代理人使用 soft actor-critic 方法和长Short-Term Memory (LSTM) 神经元进行训练。为了训练，我们 simulate 了八十名虚拟Subject (VS)，分别采用 MDI 治疗和 QM 策略。为验证，剩下的二十 VS 在26周的enario中进行了检验，包括内部和外部的变化。 results showed that our proposed RL approach outperforms a baseline run-to-run approach and can replace the standard carbohydrate counting (CC) approach. Specifically, after 26 weeks, the time-in-range ($70-180$mg/dL) and time-in-hypoglycemia ($<70$mg/dL) were $73.1\pm11.6$% and $ 2.0\pm 1.8$% using the RL-optimized QM strategy compared to $70.6\pm14.8$% and $ 1.5\pm 1.5$% using CC. Such an approach can simplify diabetes treatment, resulting in improved quality of life and glycemic outcomes.

Conditional Mutual Information Constrained Deep Learning for Classification

paper_url: http://arxiv.org/abs/2309.09123
repo_url: None
paper_authors: En-Hui Yang, Shayan Mohajer Hamidi, Linfeng Ye, Renhao Tan, Beverly Yang
for: 本文使用 conditional mutual information (CMI) 和 normalized conditional mutual information (NCMI) 来衡量深度神经网络 (DNN) 输出概率分布中的集中度和分化性能。
methods: 本文使用 NCMI 来评估 popular DNNs 在 ImageNet 文献中预训练的性能，并发现这些模型的验证精度与 NCMI 值成对关系。基于这一观察， authors 修改了标准深度学习 (DL) 框架，使其具有 CMI 约束，称为 CMI constrained deep learning (CMIC-DL)。
results: 对 CMIC-DL 的束定优化问题，提出了一种新的交叉学习算法。实验结果表明，使用 CMIC-DL 训练的 DNN 模型在精度和对抗攻击性能方面都高于文献中其他模型和损失函数。此外，通过 visualizing 学习过程中 CMI 和 NCMI 的变化，也能够更好地理解 DNN 的学习过程。

Abstract
The concepts of conditional mutual information (CMI) and normalized conditional mutual information (NCMI) are introduced to measure the concentration and separation performance of a classification deep neural network (DNN) in the output probability distribution space of the DNN, where CMI and the ratio between CMI and NCMI represent the intra-class concentration and inter-class separation of the DNN, respectively. By using NCMI to evaluate popular DNNs pretrained over ImageNet in the literature, it is shown that their validation accuracies over ImageNet validation data set are more or less inversely proportional to their NCMI values. Based on this observation, the standard deep learning (DL) framework is further modified to minimize the standard cross entropy function subject to an NCMI constraint, yielding CMI constrained deep learning (CMIC-DL). A novel alternating learning algorithm is proposed to solve such a constrained optimization problem. Extensive experiment results show that DNNs trained within CMIC-DL outperform the state-of-the-art models trained within the standard DL and other loss functions in the literature in terms of both accuracy and robustness against adversarial attacks. In addition, visualizing the evolution of learning process through the lens of CMI and NCMI is also advocated.

摘要
《条件共mutual information（CMI）和正常化条件共mutual information（NCMI）的概念在深度学习（DL）中被引入，用于衡量深度神经网络（DNN）在输出概率分布空间中的集中度和分化性。CMI和NCMI之比分别表示DNN的内类集中度和对类分化程度。通过对popular DNNs在ImageNet文献中的验证性能进行NCMI评估，发现这些模型的验证性能与NCMI值相对负相关。基于此观察，我们提出了一种受CMI约束的DL框架（CMIC-DL），并提出了一种新的交互学习算法来解决这种受约束优化问题。实验结果表明，在CMIC-DL中训练的DNN模型比标准DL框架和其他文献中的损失函数训练模型在鲁棒性和鲁棒性对抗攻击方面表现更好。此外，通过CMI和NCMI的视觉化来描述学习过程的演化也被提出。》Note that Simplified Chinese is a standardized form of Chinese that is used in mainland China and Singapore. Traditional Chinese is used in Taiwan and Hong Kong.

FDCNet: Feature Drift Compensation Network for Class-Incremental Weakly Supervised Object Localization

paper_url: http://arxiv.org/abs/2309.09122
repo_url: https://github.com/Vision-sejin/FDCNet
paper_authors: Sejin Park, Taehyung Lee, Yeejin Lee, Byeongkeun Kang
for: 这个研究旨在解决类增量弱监督物体Localization（CI-WSOL）任务，即逐渐学习新类的物体Localization，只使用图像级别的标签，而保持先前学习的类的Localization能力。
methods: 我们采用了类增量类ifiers的策略来抑制忘记现象，包括知识填充、保留先前任务中的小数据集、以及使用cosine normalization。此外，我们还提出了特征漂移补偿网络，以补偿因Feature Drift而导致的类分数和Localization图像的影响。
results: 我们通过在ImageNet-100和CUB-200两个公开available datasets上进行实验，发现我们的提议方法比基eline方法高效。

Abstract
This work addresses the task of class-incremental weakly supervised object localization (CI-WSOL). The goal is to incrementally learn object localization for novel classes using only image-level annotations while retaining the ability to localize previously learned classes. This task is important because annotating bounding boxes for every new incoming data is expensive, although object localization is crucial in various applications. To the best of our knowledge, we are the first to address this task. Thus, we first present a strong baseline method for CI-WSOL by adapting the strategies of class-incremental classifiers to mitigate catastrophic forgetting. These strategies include applying knowledge distillation, maintaining a small data set from previous tasks, and using cosine normalization. We then propose the feature drift compensation network to compensate for the effects of feature drifts on class scores and localization maps. Since updating network parameters to learn new tasks causes feature drifts, compensating for the final outputs is necessary. Finally, we evaluate our proposed method by conducting experiments on two publicly available datasets (ImageNet-100 and CUB-200). The experimental results demonstrate that the proposed method outperforms other baseline methods.

摘要
这个研究强调了类增量弱监督对象定位任务（CI-WSOL）。目标是逐步学习新类对象定位，只使用图像级别的注释，保留之前学习的类别定位能力。这项任务非常重要，因为为每个新来的数据都要注释 bounding box 是非常昂贵的，但对象定位是许多应用场景中非常重要的。根据我们所知，我们是第一个对这项任务进行研究的。因此，我们首先提出了一个强大的基线方法 для CI-WSOL，通过适应类增量分类器的策略来减轻忘却性。这些策略包括应用知识传播、维护前一任务中的小数据集，以及使用仓颉 норamlization。然后，我们提议了特征漂移补做网络，以补做因更新网络参数学习新任务而导致的特征漂移的效果。最后，我们通过在 ImageNet-100 和 CUB-200 两个公共可用的数据集上进行实验，证明我们提出的方法在基eline方法上表现出色。

Public Perceptions of Gender Bias in Large Language Models: Cases of ChatGPT and Ernie

paper_url: http://arxiv.org/abs/2309.09120
repo_url: None
paper_authors: Kyrie Zhixuan Zhou, Madelyn Rose Sanfilippo
for: This paper aims to investigate and analyze public perceptions of gender bias in large language models (LLMs) trained in different cultural contexts.
methods: The authors conducted a content analysis of social media discussions to gather data on people’s observations of gender bias in their personal use of LLMs and scientific findings about gender bias in LLMs.
results: The study found that ChatGPT, a US-based LLM, exhibited more implicit gender bias, while Ernie, a China-based LLM, showed more explicit gender bias. The findings suggest that culture plays a significant role in shaping gender bias in LLMs and propose governance recommendations to regulate gender bias in these models.

Abstract
Large language models are quickly gaining momentum, yet are found to demonstrate gender bias in their responses. In this paper, we conducted a content analysis of social media discussions to gauge public perceptions of gender bias in LLMs which are trained in different cultural contexts, i.e., ChatGPT, a US-based LLM, or Ernie, a China-based LLM. People shared both observations of gender bias in their personal use and scientific findings about gender bias in LLMs. A difference between the two LLMs was seen -- ChatGPT was more often found to carry implicit gender bias, e.g., associating men and women with different profession titles, while explicit gender bias was found in Ernie's responses, e.g., overly promoting women's pursuit of marriage over career. Based on the findings, we reflect on the impact of culture on gender bias and propose governance recommendations to regulate gender bias in LLMs.

摘要
大型语言模型快速增长，但它们在回应中显示出性别偏见。在这篇研究中，我们通过社交媒体讨论分析公众对于语言模型中的性别偏见。人们分享了对于性别偏见的personal使用经验和科学发现，包括两个不同的语言模型：ChatGPT和Ernie。我们发现ChatGPT更常见偏见，例如将男女分配不同的职业标题，而Ernie的回应则显示了明显的性别偏见，例如过度推荐女性追求婚姻而不是职业。根据结果，我们思考了文化对性别偏见的影响和建议管理性别偏见在语言模型中。

Uncertainty-aware 3D Object-Level Mapping with Deep Shape Priors

paper_url: http://arxiv.org/abs/2309.09118
repo_url: https://github.com/TRAILab/UncertainShapePose
paper_authors: Ziwei Liao, Jun Yang, Jingxing Qian, Angela P. Schoellig, Steven L. Waslander
for:这篇论文的目的是提出一种高质量的物体级别地图生成方法，用于在探测过程中对未知物体进行高精度的重建。methods:该方法使用多个RGB-D图像作为输入，并输出高精度的3D形状和9DoF姿态（包括3个涂抹参数）。该方法利用学习的生成模型来作为对shape类别的先验知识，并使用概率uncertainty-aware优化框架来进行3D重建。results:我们在实验中取得了substantial的提高，与现有的方法相比。我们还示出了对下游机器人任务的活跃探测等方面的应用。

Abstract
3D object-level mapping is a fundamental problem in robotics, which is especially challenging when object CAD models are unavailable during inference. In this work, we propose a framework that can reconstruct high-quality object-level maps for unknown objects. Our approach takes multiple RGB-D images as input and outputs dense 3D shapes and 9-DoF poses (including 3 scale parameters) for detected objects. The core idea of our approach is to leverage a learnt generative model for shape categories as a prior and to formulate a probabilistic, uncertainty-aware optimization framework for 3D reconstruction. We derive a probabilistic formulation that propagates shape and pose uncertainty through two novel loss functions. Unlike current state-of-the-art approaches, we explicitly model the uncertainty of the object shapes and poses during our optimization, resulting in a high-quality object-level mapping system. Moreover, the resulting shape and pose uncertainties, which we demonstrate can accurately reflect the true errors of our object maps, can also be useful for downstream robotics tasks such as active vision. We perform extensive evaluations on indoor and outdoor real-world datasets, achieving achieves substantial improvements over state-of-the-art methods. Our code will be available at https://github.com/TRAILab/UncertainShapePose.

摘要
三维对象级映射是Robotics中的基本问题，特别是当对象CAD模型不可用于推理时更加棘手。在这项工作中，我们提出了一个框架，可以重建高质量的对象级映射。我们的方法接受多个RGB-D图像作为输入，并输出密集的3D形状和9个DoF姿态（包括3个涉及度参数）。我们的核心思想是利用学习的生成模型来作为类别的先验，并使用概率、不确定性意识推导的优化框架来重建3D映射。我们 derive了一个概率形式，将形状和姿态不确定性传递给两个新的损失函数。不同于当前状态的方法，我们显式地模型对象的形状和姿态不确定性，从而实现了高质量的对象级映射系统。此外，我们所获得的形状和姿态不确定性，可以准确地反映我们对象映射的真实错误，同时也可以用于下游机器人任务，如活动视觉。我们在室内和室外实际数据集上进行了广泛的评估，实现了明显的提高。我们的代码将在https://github.com/TRAILab/UncertainShapePose上提供。

Contrastive Decoding Improves Reasoning in Large Language Models

paper_url: http://arxiv.org/abs/2309.09117
repo_url: None
paper_authors: Sean O’Brien, Mike Lewis
for: 该研究旨在证明 Contrastive Decoding 可以大幅提高语言模型生成文本的质量，特别是在逻辑 reasoning 任务上。
methods: 该研究使用 Contrastive Decoding 方法，这是一种简单、计算上轻便，无需训练的文本生成方法。
results: 研究发现，Contrastive Decoding 可以在多种逻辑 reasoning 任务上大幅超越 greedy decoding，并且在 HellaSwag 常识逻辑测试 benchmark 和 GSM8K 数学单词逻辑测试 benchmark 上表现出色。

Abstract
We demonstrate that Contrastive Decoding -- a simple, computationally light, and training-free text generation method proposed by Li et al 2022 -- achieves large out-of-the-box improvements over greedy decoding on a variety of reasoning tasks. Originally shown to improve the perceived quality of long-form text generation, Contrastive Decoding searches for strings that maximize a weighted difference in likelihood between strong and weak models. We show that Contrastive Decoding leads LLaMA-65B to outperform LLaMA 2, GPT-3.5 and PaLM 2-L on the HellaSwag commonsense reasoning benchmark, and to outperform LLaMA 2, GPT-3.5 and PaLM-540B on the GSM8K math word reasoning benchmark, in addition to improvements on a collection of other tasks. Analysis suggests that Contrastive Decoding improves over existing methods by preventing some abstract reasoning errors, as well as by avoiding simpler modes such as copying sections of the input during chain-of-thought. Overall, Contrastive Decoding outperforms nucleus sampling for long-form generation and greedy decoding for reasoning tasks, making it a powerful general purpose method for generating text from language models.

摘要
我们证明了对比解码（Contrastive Decoding）——一种简单、计算机轻量级、训练无需的文本生成方法，提出于李等2022年的研究——在多种逻辑任务上实现了大量的出vat的改进。原本用于改善长文本生成的 perceived 质量，对比解码搜索最大化一个权重 diferencial of likelihood между强度模型和弱度模型中的字符串。我们表明，对比解码使 LLama-65B 超过 LLama 2、GPT-3.5 和 PaLM 2-L 在 hellaSwag 常识逻辑标准 bencmark 上表现出色，以及在 GSM8K 数学单词逻辑标准 bencmark 上超过 LLaMA 2、GPT-3.5 和 PaLM-540B。此外，对比解码还能在其他任务上提高表现。分析表明，对比解码可以避免一些抽象逻辑错误，以及避免简单的模式，如输入段落复制。总之，对比解码超过 nucleus sampling для长文本生成和 greedy decoding para reasoning tasks，成为一种强大的通用方法 для生成语言模型中的文本。

2023-09-17

cs.CL

cs.CL - 2023-09-17

Augmenting text for spoken language understanding with Large Language Models

paper_url: http://arxiv.org/abs/2309.09390
repo_url: None
paper_authors: Roshan Sharma, Suyoun Kim, Daniel Lazar, Trang Le, Akshat Shrivastava, Kwanghoon Ahn, Piyush Kansal, Leda Sari, Ozlem Kalinli, Michael Seltzer
for: 该研究的目的是解决现有应用领域中的Robust模型训练问题，需要对应的 triplets 数据，但获取这些数据可以是非常复杂和昂贵的。
methods: 该研究使用了不同的方法来使用无对应的文本数据进行模型训练，包括Joint Audio Text (JAT)和Text-to-Speech (TTS)等方法。
results: 实验结果表明，使用无对应的文本数据可以提高模型的性能，对于现有领域和新领域来说，使用LLMs生成的文本可以提高EM的精度 by 1.4%和2.6%。

Abstract
Spoken semantic parsing (SSP) involves generating machine-comprehensible parses from input speech. Training robust models for existing application domains represented in training data or extending to new domains requires corresponding triplets of speech-transcript-semantic parse data, which is expensive to obtain. In this paper, we address this challenge by examining methods that can use transcript-semantic parse data (unpaired text) without corresponding speech. First, when unpaired text is drawn from existing textual corpora, Joint Audio Text (JAT) and Text-to-Speech (TTS) are compared as ways to generate speech representations for unpaired text. Experiments on the STOP dataset show that unpaired text from existing and new domains improves performance by 2% and 30% in absolute Exact Match (EM) respectively. Second, we consider the setting when unpaired text is not available in existing textual corpora. We propose to prompt Large Language Models (LLMs) to generate unpaired text for existing and new domains. Experiments show that examples and words that co-occur with intents can be used to generate unpaired text with Llama 2.0. Using the generated text with JAT and TTS for spoken semantic parsing improves EM on STOP by 1.4% and 2.6% absolute for existing and new domains respectively.

摘要
spoken semantic parsing (SSP) 即从语音输入生成机器可理解的架构。为已有应用领域的训练数据或扩展到新领域训练模型而需要对应的三元组（语音、 trascript、semantic parse）数据，这是 expensive 的。在这篇论文中，我们解决这个挑战，通过不使用对应的语音，使用 trascript-semantic parse 数据（无对应的语音）来生成 speech 表示。首先，当 unpaired text 从现有的文本库中提取时，我们比较使用 Joint Audio Text (JAT) 和 Text-to-Speech (TTS) 生成 speech 表示。STOP 数据集上的实验结果显示，从现有和新领域的 unpaired text 提取可以提高表达的性能，相对于无提取情况下，减少了2%和30%的精确匹配（EM）。其次，当 unpaired text 不在现有的文本库中时，我们提议使用 Large Language Models (LLMs) 生成 unpaired text。我们发现，可以使用 intents 相关的例子和单词来生成 unpaired text。使用生成的文本与 JAT 和 TTS 生成的 speech 进行 spoken semantic parsing，可以提高 STOP 数据集上的 EM 表达，相对于无提取情况下，提高了1.4%和2.6%的精确匹配。

Mitigating Shortcuts in Language Models with Soft Label Encoding

paper_url: http://arxiv.org/abs/2309.09380
repo_url: None
paper_authors: Zirui He, Huiqi Deng, Haiyan Zhao, Ninghao Liu, Mengnan Du
for: 本研究的目的是解决大语言模型在自然语言理解（NLU）任务上依赖偶合关系的问题。
methods: 我们提出了一种简单 yet effective的减弱架构，名为软标签编码（SoftLE）。我们首先使用师模型通过硬标签训练每个样本的偶合度。然后，我们添加了一个假类来编码偶合度，并将其用于融合其他维度的真实标签生成软标签。这个新的真实标签用于训练更加可靠的学生模型。
results: 我们在两个 NLU benchmark 任务上进行了广泛的实验，结果表明，SoftLE 可以显著提高对于样本集外的泛化性能，而保持满意的在样本集内的准确率。

Abstract
Recent research has shown that large language models rely on spurious correlations in the data for natural language understanding (NLU) tasks. In this work, we aim to answer the following research question: Can we reduce spurious correlations by modifying the ground truth labels of the training data? Specifically, we propose a simple yet effective debiasing framework, named Soft Label Encoding (SoftLE). We first train a teacher model with hard labels to determine each sample's degree of relying on shortcuts. We then add one dummy class to encode the shortcut degree, which is used to smooth other dimensions in the ground truth label to generate soft labels. This new ground truth label is used to train a more robust student model. Extensive experiments on two NLU benchmark tasks demonstrate that SoftLE significantly improves out-of-distribution generalization while maintaining satisfactory in-distribution accuracy.

摘要
近期研究发现大语言模型在自然语言理解任务上依赖于偶合关系。在这项工作中，我们想回答以下研究问题：可以通过修改训练数据的真实标签来减少偶合关系吗？我们提出了一种简单 yet effective的减偶框架，名为软标签编码（SoftLE）。我们首先使用师模型通过硬标签确定每个样本的偶合度。然后，我们添加了一个幌子类来编码偶合度，并将其用于软标签的缓冲处理。这个新的真实标签用于训练更加可靠的学生模型。我们在两个NLU benchmark任务上进行了广泛的实验，结果表明，SoftLE可以明显提高异常情况泛化性，同时保持满意的内部准确率。

Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles

paper_url: http://arxiv.org/abs/2309.09369
repo_url: None
paper_authors: Kung-Hsiang Huang, Philippe Laban, Alexander R. Fabbri, Prafulla Kumar Choubey, Shafiq Joty, Caiming Xiong, Chien-Sheng Wu
for: 本研究旨在提出一种多文摘要新闻Summarization任务，即对多篇新闻文章中的不同信息进行摘要。
methods: 我们提出了一种数据收集方案，并开发了一个名为DiverseSumm的数据集，包括245篇新闻故事和10篇新闻文章。此外，我们还进行了全面的分析，探讨使用大语言模型（LLM）测试摘要的覆盖率和准确性的问题，以及与人类评估的相关性。
results: 我们的分析发现，使用LLM进行多文摘要时，主要存在覆盖率和准确性的问题，GPT-4只能覆盖不同信息的 Less than 40% 的情况。

Abstract
Previous research in multi-document news summarization has typically concentrated on collating information that all sources agree upon. However, to our knowledge, the summarization of diverse information dispersed across multiple articles about an event has not been previously investigated. The latter imposes a different set of challenges for a summarization model. In this paper, we propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event. To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm. The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference. Moreover, we conducted a comprehensive analysis to pinpoint the position and verbosity biases when utilizing Large Language Model (LLM)-based metrics for evaluating the coverage and faithfulness of the summaries, as well as their correlation with human assessments. We applied our findings to study how LLMs summarize multiple news articles by analyzing which type of diverse information LLMs are capable of identifying. Our analyses suggest that despite the extraordinary capabilities of LLMs in single-document summarization, the proposed task remains a complex challenge for them mainly due to their limited coverage, with GPT-4 only able to cover less than 40% of the diverse information on average.

摘要
To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm. The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference. Moreover, we conducted a comprehensive analysis to identify the position and verbosity biases when using Large Language Model (LLM)-based metrics to evaluate the coverage and faithfulness of the summaries, as well as their correlation with human assessments.We applied our findings to study how LLMs summarize multiple news articles by analyzing which type of diverse information LLMs are capable of identifying. Our analyses suggest that while LLMs have extraordinary capabilities in single-document summarization, the proposed task remains a complex challenge for them, with GPT-4 only able to cover less than 40% of the diverse information on average.

Language models are susceptible to incorrect patient self-diagnosis in medical applications

paper_url: http://arxiv.org/abs/2309.09362
repo_url: None
paper_authors: Rojin Ziaei, Samuel Schmidgall
for: 这个研究旨在检验大型自然语言模型（LLMs）在医疗领域中的可用性，以及它们在实际患者-医生交流中的表现。
methods: 研究使用了多种LLMs，并对它们进行了多项选择和修改，以模拟实际患者自诊的情况。研究还使用了美国医学会考试的多选题来评估LLMs的诊断准确率。
results: 研究发现，当患者提供了错误的偏见验证信息时，LLMs的诊断准确率会受到极大的影响，表明LLMs在自诊情况下存在高度易错的问题。

Abstract
Large language models (LLMs) are becoming increasingly relevant as a potential tool for healthcare, aiding communication between clinicians, researchers, and patients. However, traditional evaluations of LLMs on medical exam questions do not reflect the complexity of real patient-doctor interactions. An example of this complexity is the introduction of patient self-diagnosis, where a patient attempts to diagnose their own medical conditions from various sources. While the patient sometimes arrives at an accurate conclusion, they more often are led toward misdiagnosis due to the patient's over-emphasis on bias validating information. In this work we present a variety of LLMs with multiple-choice questions from United States medical board exams which are modified to include self-diagnostic reports from patients. Our findings highlight that when a patient proposes incorrect bias-validating information, the diagnostic accuracy of LLMs drop dramatically, revealing a high susceptibility to errors in self-diagnosis.

摘要

Performance of the Pre-Trained Large Language Model GPT-4 on Automated Short Answer Grading

paper_url: http://arxiv.org/abs/2309.09338
repo_url: None
paper_authors: Gerd Kortemeyer
for: 这 paper 是 investigate GPT-4 LLM 在 Automated Short Answer Grading (ASAG) 领域的性能。
methods: 这 paper 使用了 GPT-4 LLM 和手动设计的模型，对 SciEntsBank 和 Beetle 数据集进行测试。
results: 结果表明，GPT-4 LLM 的性能相对于手动设计的模型类似，但与特殊训练的 LLM 相比较差。

Abstract
Automated Short Answer Grading (ASAG) has been an active area of machine-learning research for over a decade. It promises to let educators grade and give feedback on free-form responses in large-enrollment courses in spite of limited availability of human graders. Over the years, carefully trained models have achieved increasingly higher levels of performance. More recently, pre-trained Large Language Models (LLMs) emerged as a commodity, and an intriguing question is how a general-purpose tool without additional training compares to specialized models. We studied the performance of GPT-4 on the standard benchmark 2-way and 3-way datasets SciEntsBank and Beetle, where in addition to the standard task of grading the alignment of the student answer with a reference answer, we also investigated withholding the reference answer. We found that overall, the performance of the pre-trained general-purpose GPT-4 LLM is comparable to hand-engineered models, but worse than pre-trained LLMs that had specialized training.

摘要
自动化简答评分（ASAG）已经是机器学习研究的活跃领域之一，时间已经超过十年。它承诺能让教育工作者在大规模课程中评分和给出反馈，尽管人工评分员的可用性有限。总之，仔细训练的模型在过去的几年中取得了不断提高的性能。更加最近，大型自然语言模型（LLM）出现了，一个有趣的问题是一个通用工具没有进行特殊训练是如何与特殊训练的模型相比。我们研究了GPT-4在标准benchmark上的性能，包括SciEntsBank和Beetle dataset，我们还研究了不公开参考答案的情况。我们发现，总的来说，预训练的通用GPT-4 LLM的性能和手工设计的模型相比较，但是比特殊训练的LLM更差。

A Few-Shot Approach to Dysarthric Speech Intelligibility Level Classification Using Transformers

paper_url: http://arxiv.org/abs/2309.09329
repo_url: None
paper_authors: Paleti Nikhil Chowdary, Vadlapudi Sai Aravind, Gorantla V N S L Vishnu Vardhan, Menta Sai Akshay, Menta Sai Aashish, Jyothish Lal. G
for: 检测异常speech（dysarthria），以便开发治疗计划，提高人们的交流能力和生活质量。
methods: 使用transformer模型，采用几招shot学习法，对有限数据进行分类，并且解决之前研究中的数据泄露问题。
results: 使用whisper-large-v2 transformer模型，在UASpeech数据集中的中度智能水平患者上达到了85%的准确率，0.92的精度、0.8的准确率、0.85的F1分数和0.91的特异性。模型在’words’数据集上表现较好，比’letters’和’digits’数据集更好。多类模型达到67%的准确率。

Abstract
Dysarthria is a speech disorder that hinders communication due to difficulties in articulating words. Detection of dysarthria is important for several reasons as it can be used to develop a treatment plan and help improve a person's quality of life and ability to communicate effectively. Much of the literature focused on improving ASR systems for dysarthric speech. The objective of the current work is to develop models that can accurately classify the presence of dysarthria and also give information about the intelligibility level using limited data by employing a few-shot approach using a transformer model. This work also aims to tackle the data leakage that is present in previous studies. Our whisper-large-v2 transformer model trained on a subset of the UASpeech dataset containing medium intelligibility level patients achieved an accuracy of 85%, precision of 0.92, recall of 0.8 F1-score of 0.85, and specificity of 0.91. Experimental results also demonstrate that the model trained using the 'words' dataset performed better compared to the model trained on the 'letters' and 'digits' dataset. Moreover, the multiclass model achieved an accuracy of 67%.

摘要
《异常speech》是一种语言障碍，影响人们的交流能力，由于说话困难。检测异常speech的重要性在于可以开发治疗计划，帮助人们提高交流能力和语言表达效果。许多文献关注提高ASR系统对异常speech的识别能力。目前研究的目标是开发一种可以准确地识别异常speech并提供语音elligibility水平信息的模型，使用少量数据和transformer模型进行几架 Approach。此外，本研究还希望解决过去研究中存在的数据泄露问题。我们使用UASpeech数据集中的中度语音智能水平患者训练了我们的whisper-large-v2 transformer模型，达到了85%的准确率、0.92的精度、0.8的 recall、0.85的F1得分和0.91的特异性。实验结果还表明，使用“words”数据集训练的模型比使用“letters”和“digits”数据集训练的模型性能更高。此外，多类模型达到了67%的准确率。

How People Perceive The Dynamic Zero-COVID Policy: A Retrospective Analysis From The Perspective of Appraisal Theory

paper_url: http://arxiv.org/abs/2309.09324
repo_url: None
paper_authors: Na Yang, Kyrie Zhixuan Zhou, Yunzhe Li
for: 这 paper 是为了研究中国的动态零病政策三年来的情感反应和观点变化。
methods: 这 paper 使用了 sentiment analysis 方法来分析了 2,358 篇微博文章中的情感，并从 appraisal theory 的视角进行了深入的诠释。
results: 研究发现了四个代表点：政策初始化、情感态度快速变化、情感分数最低、政策终止。这些结果可能有助于未来卫生防疫控制措施的开发。

Abstract
The Dynamic Zero-COVID Policy in China spanned three years and diverse emotional responses have been observed at different times. In this paper, we retrospectively analyzed public sentiments and perceptions of the policy, especially regarding how they evolved over time, and how they related to people's lived experiences. Through sentiment analysis of 2,358 collected Weibo posts, we identified four representative points, i.e., policy initialization, sharp sentiment change, lowest sentiment score, and policy termination, for an in-depth discourse analysis through the lens of appraisal theory. In the end, we reflected on the evolving public sentiments toward the Dynamic Zero-COVID Policy and proposed implications for effective epidemic prevention and control measures for future crises.

摘要
中国的动态零病政策覆盖了三年的时间，而人们对这政策的情感响应也在不同时间 exhibited 多样化的情感。在这篇论文中，我们通过对2358篇微博文章的情感分析，发现了四个代表点：政策 initialize，快速情感变化，最低情感分数，和政策终止，并通过评价理论进行深入的讨论。最后，我们回顾了公众对动态零病政策的情感发展，并提出了未来危机管理的有效措施。Note: "微博" (Weibo) is a popular social media platform in China, similar to Twitter.

A novel approach to measuring patent claim scope based on probabilities obtained from (large) language models

paper_url: http://arxiv.org/abs/2309.10003
repo_url: None
paper_authors: Sébastien Ragot
for: 这种方法用于测量专利索引的范围，它基于信息理论，假设罕见的概念更有启示性，因为它更有启示性。
methods: 该方法基于语言模型，从报告语言模型中计算自信息，从而计算专利索引的范围。五种语言模型被考虑，从 simplest 模型（每个词或字符从均匀分布中选择）到中间模型（使用平均词或字符频率），最后一个模型是 GPT2。
results: 结果表明，随着语言模型的复杂程度的增加，模型的性能也得到了改善。GPT2模型在多个静态测试中表现出色，超过了基于词和字符频率的模型，而这些模型本身也超过了基于词和字符计数的模型。

Abstract
This work proposes to measure the scope of a patent claim as the reciprocal of the self-information contained in this claim. Grounded in information theory, this approach is based on the assumption that a rare concept is more informative than a usual concept, inasmuch as it is more surprising. The self-information is calculated from the probability of occurrence of that claim, where the probability is calculated in accordance with a language model. Five language models are considered, ranging from the simplest models (each word or character is drawn from a uniform distribution) to intermediate models (using average word or character frequencies), to a large language model (GPT2). Interestingly, the simplest language models reduce the scope measure to the reciprocal of the word or character count, a metric already used in previous works. Application is made to nine series of patent claims directed to distinct inventions, where the claims in each series have a gradually decreasing scope. The performance of the language models is then assessed with respect to several ad hoc tests. The more sophisticated the model, the better the results. The GPT2 model outperforms models based on word and character frequencies, which are themselves ahead of models based on word and character counts.

摘要

AutoAM: An End-To-End Neural Model for Automatic and Universal Argument Mining

paper_url: http://arxiv.org/abs/2309.09300
repo_url: None
paper_authors: Lang Cao
for: 本文是为了提出一种基于神经网络的自动论证挖掘模型（AutoAM），以解决现有的论证挖掘技术尚未成熟和不具备准确描述论证关系的问题。
methods: 本文提出了一种新的论证组件注意力机制，可以更好地捕捉论证中相关的信息，从而提高论证挖掘的性能。此外，本文还提出了一种通用的终端框架，可以无需Constraints like tree structure，完成论证挖掘的三个子任务。
results: 实验结果表明， compared with现有的工作，我们的模型在两个公共数据集上的多个维度上表现出色，达到了更高的性能。

Abstract
Argument mining is to analyze argument structure and extract important argument information from unstructured text. An argument mining system can help people automatically gain causal and logical information behind the text. As argumentative corpus gradually increases, like more people begin to argue and debate on social media, argument mining from them is becoming increasingly critical. However, argument mining is still a big challenge in natural language tasks due to its difficulty, and relative techniques are not mature. For example, research on non-tree argument mining needs to be done more. Most works just focus on extracting tree structure argument information. Moreover, current methods cannot accurately describe and capture argument relations and do not predict their types. In this paper, we propose a novel neural model called AutoAM to solve these problems. We first introduce the argument component attention mechanism in our model. It can capture the relevant information between argument components, so our model can better perform argument mining. Our model is a universal end-to-end framework, which can analyze argument structure without constraints like tree structure and complete three subtasks of argument mining in one model. The experiment results show that our model outperforms the existing works on several metrics in two public datasets.

摘要
Argument mining是分析对话结构，从不结构化文本中提取重要的论据信息的技术。一个有效的论据挖掘系统可以帮助人们自动获得文本中的 causal 和逻辑信息。随着社交媒体上的论据资源的不断增加，论据挖掘在自然语言任务中变得越来越重要。然而，论据挖掘仍然是自然语言任务中的大挑战，因为它的难度和相关技术还没有成熟。例如，研究非树结构论据挖掘还需要做得更多。大多数工作都是只关注EXTRACTING 树结构论据信息。此外，当前的方法无法准确地描述和捕捉论据关系，也无法预测它们的类型。在这篇论文中，我们提出了一种新的神经网络模型called AutoAM，以解决这些问题。我们首先介绍了我们模型中的论据组件注意力机制。它可以捕捉论据组件之间的相关信息，因此我们的模型可以更好地进行论据挖掘。我们的模型是一个通用的端到端框架，可以不受树结构的限制，完成论据挖掘中的三个子任务。实验结果表明，我们的模型在两个公共数据集上的多个纪录度量上都高于现有的作品。

OWL: A Large Language Model for IT Operations

paper_url: http://arxiv.org/abs/2309.09298
repo_url: https://github.com/Aryia-Behroziuan/Other-sources
paper_authors: Hongcheng Guo, Jian Yang, Jiaheng Liu, Liqun Yang, Linzheng Chai, Jiaqi Bai, Junran Peng, Xiaorong Hu, Chao Chen, Dongfeng Zhang, Xu Shi, Tieqiao Zheng, Liangfan Zheng, Bo Zhang, Ke Xu, Zhoujun Li
for: 这篇论文旨在探讨特有的大自然语言处理技术（NLP）在信息技术（IT）操作中的应用。
methods: 本论文使用了一个名为OWL的大语言模型，该模型在我们收集的OWL-Instruct数据集上进行了训练，该数据集包含了各种IT相关信息。在训练过程中，提出了混合适应器策略来提高参数效率的调整 across different domains or tasks。
results: 根据我们在OWL-Bench和开放IT相关的benchmark上进行的评估，OWL模型在IT任务上表现出色，与现有模型相比，具有显著的性能优势。此外，我们希望这些发现能够为IT操作技术的发展提供更多的思路和灵感。

Abstract
With the rapid development of IT operations, it has become increasingly crucial to efficiently manage and analyze large volumes of data for practical applications. The techniques of Natural Language Processing (NLP) have shown remarkable capabilities for various tasks, including named entity recognition, machine translation and dialogue systems. Recently, Large Language Models (LLMs) have achieved significant improvements across various NLP downstream tasks. However, there is a lack of specialized LLMs for IT operations. In this paper, we introduce the OWL, a large language model trained on our collected OWL-Instruct dataset with a wide range of IT-related information, where the mixture-of-adapter strategy is proposed to improve the parameter-efficient tuning across different domains or tasks. Furthermore, we evaluate the performance of our OWL on the OWL-Bench established by us and open IT-related benchmarks. OWL demonstrates superior performance results on IT tasks, which outperforms existing models by significant margins. Moreover, we hope that the findings of our work will provide more insights to revolutionize the techniques of IT operations with specialized LLMs.

摘要
随着信息技术运营的快速发展，管理和分析大量数据的效率已成为非常重要的。自然语言处理（NLP）技术在各种任务上表现出了惊人的能力，包括命名实体识别、机器翻译和对话系统。最近，大型自然语言模型（LLMs）在各种 NLP 下渠道任务上实现了显著的改进。然而，对 IT 运营的特殊化 LLMs 缺乏。本文介绍了 OWL，一个基于我们收集的 OWL-Instruct 数据集的大型自然语言模型，其中混合 adapter 策略可以在不同的领域或任务中进行参数高效调整。此外，我们评估了 OWL 在 OWL-Bench 和开放的 IT 相关benchmark上的性能，并发现 OWL 在 IT 任务上表现出了显著的优异性。此外，我们希望通过这些研究成果，为 IT 运营技术的发展提供更多的新思路和灵感。

Model-based Subsampling for Knowledge Graph Completion

paper_url: http://arxiv.org/abs/2309.09296
repo_url: https://github.com/xincanfeng/ms_kge
paper_authors: Xincan Feng, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe
for: 提高 Knowledge Graph Embedding (KGE) 模型的适应性和性能
methods: 提出 Model-based Subsampling (MBS) 和 Mixed Subsampling (MIX) 方法，通过 KGE 模型的预测来估计不够频繁的查询的出现概率
results: 对 FB15k-237、WN18RR 和 YAGO3-10 等 dataset 进行评估，显示我们的提案的抽样方法可以提高受欢迎 KGE 模型的 KG 完成性能

Abstract
Subsampling is effective in Knowledge Graph Embedding (KGE) for reducing overfitting caused by the sparsity in Knowledge Graph (KG) datasets. However, current subsampling approaches consider only frequencies of queries that consist of entities and their relations. Thus, the existing subsampling potentially underestimates the appearance probabilities of infrequent queries even if the frequencies of their entities or relations are high. To address this problem, we propose Model-based Subsampling (MBS) and Mixed Subsampling (MIX) to estimate their appearance probabilities through predictions of KGE models. Evaluation results on datasets FB15k-237, WN18RR, and YAGO3-10 showed that our proposed subsampling methods actually improved the KG completion performances for popular KGE models, RotatE, TransE, HAKE, ComplEx, and DistMult.

摘要
通过抽样可以有效地降低知识图（KG）数据集中的过拟合问题，但现有的抽样方法只考虑了实体和关系之间的频率。这可能会下降不常见的查询的出现概率，即使实体或关系的频率很高。为解决这个问题，我们提出了基于模型的抽样（MBS）和混合抽样（MIX）方法，通过KGE模型的预测来估计它们的出现概率。我们在FB15k-237、WN18RR和YAGO3-10等 dataset上进行了评估，结果表明，我们的提议的抽样方法实际上提高了各种KGE模型（RotatE、TransE、HAKE、ComplEx、DistMult）的KG完成性能。

paper_url: http://arxiv.org/abs/2309.09274
repo_url: None
paper_authors: Megha Sundriyal, Md Shad Akhtar, Tanmoy Chakraborty
for: 本研究旨在提出一种细化的声明检查价值 Task，以优化现有的声明检查系统。
methods: 该研究使用了一个大量人工标注的Twitter数据集，并提出了一种基于人工评估的CheckMate方法，以 JOINTLY 确定声明是否值得检查，以及导致这个结论的因素。
results: 研究表明， integrating 多种因素可以提高声明检查的准确率和效率，并且人工评估 validate 了这些结论。

Abstract
The expansion of online social media platforms has led to a surge in online content consumption. However, this has also paved the way for disseminating false claims and misinformation. As a result, there is an escalating demand for a substantial workforce to sift through and validate such unverified claims. Currently, these claims are manually verified by fact-checkers. Still, the volume of online content often outweighs their potency, making it difficult for them to validate every single claim in a timely manner. Thus, it is critical to determine which assertions are worth fact-checking and prioritize claims that require immediate attention. Multiple factors contribute to determining whether a claim necessitates fact-checking, encompassing factors such as its factual correctness, potential impact on the public, the probability of inciting hatred, and more. Despite several efforts to address claim check-worthiness, a systematic approach to identify these factors remains an open challenge. To this end, we introduce a new task of fine-grained claim check-worthiness, which underpins all of these factors and provides probable human grounds for identifying a claim as check-worthy. We present CheckIt, a manually annotated large Twitter dataset for fine-grained claim check-worthiness. We benchmark our dataset against a unified approach, CheckMate, that jointly determines whether a claim is check-worthy and the factors that led to that conclusion. We compare our suggested system with several baseline systems. Finally, we report a thorough analysis of results and human assessment, validating the efficacy of integrating check-worthiness factors in detecting claims worth fact-checking.

摘要
在线社交媒体平台的扩展导致在线内容的摄 consumption 增加，但也导致了假宣传和不准确信息的传播。因此，需要一大量的人力来筛选和验证这些未经证实的宣言。现在，这些宣言都是由 фак-checker 手动验证的。然而，在线内容的量太大，fact-checker 无法在有限时间内验证每一个宣言。因此，是非常重要的确定哪些宣言值得验证，并且需要立即验证的宣言。多种因素会影响哪些宣言需要验证，包括宣言的准确性、公众影响、激进的可能性和更多。虽然有很多努力来解决宣言验证价值的问题，但是一个系统atic approach 来确定这些因素仍然是一个开放的挑战。为此，我们介绍了一个新的细化的宣言验证任务，这个任务涵盖了所有这些因素，并提供了人类可靠的判据来确定宣言是否值得验证。我们提供了 CheckIt，一个手动标注的大型 Twitter 数据集，用于细化的宣言验证。我们对我们的数据集进行了对joint 方法 CheckMate 的比较，joint 方法可以同时判断一个宣言是否值得验证，以及这个宣言被验证的原因。我们与多个基线系统进行比较。最后，我们进行了详细的分析结果和人类评估，证明了将验证价值因素纳入检测宣言值得验证的效果。

Code quality assessment using transformers

paper_url: http://arxiv.org/abs/2309.09264
repo_url: None
paper_authors: Mosleh Mahamud, Isak Samsten
for: 这个论文的目的是自动评估编程作业的正确性，但是编程任务可以有多种解决方案，其中许多方案尽管正确，但是代码具有冗长、糟糕的命名和重复等� субjective 质量。
methods: 这篇论文使用了 CodeBERT 来自动分配代码质量分数。作者们尝试了不同的模型和训练方法，并对一个新的代码质量评估数据集进行了实验。
results: 研究发现，代码质量有一定的可预测性，并且使用 transformer 基于的模型，使用任务适应预训练可以更有效地解决这个问题。

Abstract
Automatically evaluate the correctness of programming assignments is rather straightforward using unit and integration tests. However, programming tasks can be solved in multiple ways, many of which, although correct, are inelegant. For instance, excessive branching, poor naming or repetitiveness make the code hard to understand and maintain. These subjective qualities of code are hard to automatically assess using current techniques. In this work we investigate the use of CodeBERT to automatically assign quality score to Java code. We experiment with different models and training paradigms. We explore the accuracy of the models on a novel dataset for code quality assessment. Finally, we assess the quality of the predictions using saliency maps. We find that code quality to some extent is predictable and that transformer based models using task adapted pre-training can solve the task more efficiently than other techniques.

摘要
自动评估程序作业正确性 relativamente直 forward 使用单元测试和集成测试。然而，程序任务可以通过多种方式解决，许多方法都是正确的，但是命名不佳、重复性强或分支过多，导致代码难以理解和维护。这些编程质量的subjective特征难以通过当前技术自动评估。在这种工作中，我们 investigate使用CodeBERT自动为Java代码分配质量分数。我们试用不同的模型和训练方法。我们探索一个新的代码质量评估数据集的准确性。最后，我们评估预测的质量使用Saliency Map。我们发现代码质量有一定的预测性，并且基于 transformer 模型的任务适应预处理可以更有效地解决这个问题。

A Benchmark for Text Expansion: Datasets, Metrics, and Baselines

paper_url: http://arxiv.org/abs/2309.09198
repo_url: None
paper_authors: Yi Chen, Haiyun Jiang, Wei Bi, Rui Wang, Longyue Wang, Shuming Shi, Ruifeng Xu
for: This paper presents a new task called Text Expansion (TE), which aims to insert fine-grained modifiers into plain text to make it more concrete and vivid.
methods: The authors use four complementary approaches to construct a dataset of 12 million automatically generated instances and 2K human-annotated references for both English and Chinese. They also design various metrics to evaluate the effectiveness of the expansions.
results: The proposed Locate&Infill models demonstrate superiority over the Text2Text baselines, especially in expansion informativeness. Experiments verify the feasibility of the TE task and point out potential directions for future research.Here’s the simplified Chinese text:
for: 这篇论文提出了一个新的文本扩展任务（TE），旨在插入细化修饰符到文本中，使其更加具体和生动。
methods: 作者使用了四种补充方法构建了一个包含12万自动生成实例和2000个人工标注的数据集，以及多种评价指标。
results: 提出的 Locate&Infill 模型在扩展信息量方面表现出色，特别是在对 Text2Text 基elines 的比较中。实验证明了文本扩展任务的可行性，并指出了未来研究的可能性。

Abstract
This work presents a new task of Text Expansion (TE), which aims to insert fine-grained modifiers into proper locations of the plain text to concretize or vivify human writings. Different from existing insertion-based writing assistance tasks, TE requires the model to be more flexible in both locating and generation, and also more cautious in keeping basic semantics. We leverage four complementary approaches to construct a dataset with 12 million automatically generated instances and 2K human-annotated references for both English and Chinese. To facilitate automatic evaluation, we design various metrics from multiple perspectives. In particular, we propose Info-Gain to effectively measure the informativeness of expansions, which is an important quality dimension in TE. On top of a pre-trained text-infilling model, we build both pipelined and joint Locate&Infill models, which demonstrate the superiority over the Text2Text baselines, especially in expansion informativeness. Experiments verify the feasibility of the TE task and point out potential directions for future research toward better automatic text expansion.

摘要
To create a dataset for this task, we use four complementary approaches to generate 12 million automatically generated instances and 2,000 human-annotated references for both English and Chinese. We also design various metrics to evaluate the effectiveness of the expansions, including a new metric called Info-Gain that measures the informativeness of the expansions.We build both pipelined and joint Locate&Infill models on top of a pre-trained text-infilling model, and compare them with Text2Text baselines. Our experiments show that the proposed models outperform the baselines, especially in terms of expansion informativeness. This demonstrates the feasibility of the TE task and provides a new direction for future research in automatic text expansion.

Detecting covariate drift in text data using document embeddings and dimensionality reduction

paper_url: http://arxiv.org/abs/2309.10000
repo_url: https://github.com/vinayaksodar/nlp_drift_paper_code
paper_authors: Vinayak Sodar, Ankit Sekseria
for: 本研究旨在提高文本分析模型的可靠性和性能，通过检测文本数据中的covariate漂移。
methods: 本研究使用了三种文档嵌入：TF-IDF使用LSA进行维度减少， Doc2Vec和BERT嵌入，以及使用PCA进行维度减少。检测covariate漂移的方法包括KS统计和MMD测试。
results: 实验结果表明，某些嵌入、维度减少方法和漂移检测方法的组合表现较好，可以有效地检测文本数据中的covariate漂移。这些结果对于提高可靠的文本分析模型做出了贡献。

Abstract
Detecting covariate drift in text data is essential for maintaining the reliability and performance of text analysis models. In this research, we investigate the effectiveness of different document embeddings, dimensionality reduction techniques, and drift detection methods for identifying covariate drift in text data. We explore three popular document embeddings: term frequency-inverse document frequency (TF-IDF) using Latent semantic analysis(LSA) for dimentionality reduction and Doc2Vec, and BERT embeddings, with and without using principal component analysis (PCA) for dimensionality reduction. To quantify the divergence between training and test data distributions, we employ the Kolmogorov-Smirnov (KS) statistic and the Maximum Mean Discrepancy (MMD) test as drift detection methods. Experimental results demonstrate that certain combinations of embeddings, dimensionality reduction techniques, and drift detection methods outperform others in detecting covariate drift. Our findings contribute to the advancement of reliable text analysis models by providing insights into effective approaches for addressing covariate drift in text data.

摘要
检测文本数据中的变量漂移是维护文本分析模型的可靠性和性能的关键。在这个研究中，我们研究了不同的文档嵌入、维度减少技术和变量漂移检测方法，以确定在文本数据中检测变量漂移的效果。我们探索了三种流行的文档嵌入：term frequency-inverse document frequency（TF-IDF）使用隐藏语义分析（LSA）进行维度减少，Doc2Vec和BERT嵌入，并使用主成分分析（PCA）进行维度减少。为了量化训练和测试数据分布之间的差异，我们采用了科维莫洛夫-斯密涅夫（KS）统计和最大均值差（MMD）测试作为变量漂移检测方法。实验结果表明，某些组合的嵌入、维度减少技术和变量漂移检测方法在检测变量漂移方面表现出色。我们的发现对于建立可靠的文本分析模型做出了贡献，提供了有关有效地在文本数据中检测变量漂移的信息。

2023-09-18

ProtoKD: Learning from Extremely Scarce Data for Parasite Ova Recognition

Image-Text Pre-Training for Logo Recognition

Machine Learning for enhancing Wind Field Resolution in Complex Terrain

Specification-Driven Video Search via Foundation Models and Formal Verification

Offline Detection of Misspelled Handwritten Words by Convolving Recognition Model Features with Text Labels

Preserving Tumor Volumes for Unsupervised Medical Image Registration

Deep Prompt Tuning for Graph Transformers

Pre-training on Synthetic Driving Data for Trajectory Prediction

GEDepth: Ground Embedding for Monocular Depth Estimation

Parameter-Efficient Long-Tailed Recognition

vSHARP: variable Splitting Half-quadratic ADMM algorithm for Reconstruction of inverse-Problems

End-to-End Learned Event- and Image-based Visual Odometry

Hierarchical Attention and Graph Neural Networks: Toward Drift-Free Pose Estimation

Quantum Vision Clustering

On Model Explanations with Transferable Neural Pathways

RaLF: Flow-based Global and Metric Radar Localization in LiDAR Maps

Contrastive Learning for Enhancing Robust Scene Transfer in Vision-based Agile Flight

Unsupervised Open-Vocabulary Object Localization in Videos

PseudoCal: Towards Initialisation-Free Deep Learning-Based Camera-LiDAR Self-Calibration

Hyperbolic vs Euclidean Embeddings in Few-Shot Learning: Two Sides of the Same Coin

Grasp-Anything: Large-scale Grasp Dataset from Foundation Models

R2GenGPT: Radiology Report Generation with Frozen LLMs

DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving

Towards Self-Adaptive Pseudo-Label Filtering for Semi-Supervised Learning

Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays

Localization-Guided Track: A Deep Association Multi-Object Tracking Framework Based on Localization Confidence of Detections

Application-driven Validation of Posteriors in Inverse Problems

Privileged to Predicted: Towards Sensorimotor Reinforcement Learning for Urban Driving

Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object Detection with Repeated Labels

Improving Neural Indoor Surface Reconstruction with Mask-Guided Adaptive Consistency Constraints

Scribble-based 3D Multiple Abdominal Organ Segmentation via Triple-branch Multi-dilated Network with Pixel- and Class-wise Consistency

Robust Geometry-Preserving Depth Estimation Using Differentiable Rendering

Traffic Scene Similarity: a Graph-based Contrastive Learning Approach

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

DGM-DR: Domain Generalization with Mutual Information Regularized Diabetic Retinopathy Classification

DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation

Unified Frequency-Assisted Transformer Framework for Detecting and Grounding Multi-Modal Manipulation

HiT: Building Mapping with Hierarchical Transformers

Holistic Geometric Feature Learning for Structured Reconstruction

Collaborative Three-Stream Transformers for Video Captioning

MEDL-U: Uncertainty-aware 3D Automatic Annotator based on Evidential Deep Learning

Mutual Information-calibrated Conformal Feature Fusion for Uncertainty-Aware Multimodal 3D Object Detection at the Edge

Multi-Semantic Fusion Model for Generalized Zero-Shot Skeleton-Based Action Recognition

An Autonomous Vision-Based Algorithm for Interplanetary Navigation

RIDE: Self-Supervised Learning of Rotation-Equivariant Keypoint Detection and Invariant Description for Endoscopy

Selective Volume Mixup for Video Action Recognition

Decompose Semantic Shifts for Composed Image Retrieval

Instant Photorealistic Style Transfer: A Lightweight and Adaptive Approach

NOMAD: A Natural, Occluded, Multi-scale Aerial Dataset, for Emergency Response Scenarios

Sparse and Privacy-enhanced Representation for Human Pose Estimation

PanoMixSwap Panorama Mixing via Structural Swapping for Indoor Scene Understanding

Learning Parallax for Stereo Event-based Motion Deblurring

RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision

Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

Target-aware Bi-Transformer for Few-shot Segmentation

Self-supervised TransUNet for Ultrasound regional segmentation of the distal radius in children

Distributional Estimation of Data Uncertainty for Surveillance Face Anti-spoofing

An Accurate and Efficient Neural Network for OCTA Vessel Segmentation and a New Dataset

Spatio-temporal Co-attention Fusion Network for Video Splicing Localization

Stealthy Physical Masked Face Recognition Attack via Adversarial Style Optimization

Self-supervised Multi-view Clustering in Computer Vision: A Survey

Reconstructing Existing Levels through Level Inpainting

Progressive Text-to-Image Diffusion with Soft Latent Direction

Reducing Adversarial Training Cost with Gradient Approximation

Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection

Scalable Label-efficient Footpath Network Generation Using Remote Sensing Data and Self-supervised Learning

TransTouch: Learning Transparent Objects Depth Sensing Through Sparse Touches

Cross-attention-based saliency inference for predicting cancer metastasis on whole slide images

BRONCO: Automated modelling of the bronchovascular bundle using the Computed Tomography Images

2023-09-18

Towards Effective Semantic OOD Detection in Unseen Domains: A Domain Generalization Perspective

Stabilizing RLHF through Advantage Model and Selective Rehearsal

Graph-enabled Reinforcement Learning for Time Series Forecasting with Adaptive Intelligence

QoS-Aware Service Prediction and Orchestration in Cloud-Network Integrated Beyond 5G

Positive and Risky Message Assessment for Music Products

Double Deep Q-Learning-based Path Selection and Service Placement for Latency-Sensitive Beyond 5G Applications

Self-Sustaining Multiple Access with Continual Deep Reinforcement Learning for Dynamic Metaverse Applications

One ACT Play: Single Demonstration Behavior Cloning with Action Chunking Transformers

Asynchronous Perception-Action-Communication with Graph Neural Networks