2023-09-08

eess.AS

eess.AS - 2023-09-08

Asymmetric Clean Segments-Guided Self-Supervised Learning for Robust Speaker Verification

paper_url: http://arxiv.org/abs/2309.04265
repo_url: None
paper_authors: Chong-Xin Gan, Man-Wai Mak, Weiwei Lin, Jen-Tzung Chien
for: 这个论文是为了提高Speaker Verification（SV）的性能而写的。
methods: 这个论文使用了Contrastive Self-supervised Learning（CSL）方法，并在数据增强前进行了精心的调整，以确保保留speaker-specific信息。
results: 实验结果表明，提出的方法可以在Voxceleb1 dataset上达到19%的提升，并超过许多现有的状态对技术。

Abstract
Contrastive self-supervised learning (CSL) for speaker verification (SV) has drawn increasing interest recently due to its ability to exploit unlabeled data. Performing data augmentation on raw waveforms, such as adding noise or reverberation, plays a pivotal role in achieving promising results in SV. Data augmentation, however, demands meticulous calibration to ensure intact speaker-specific information, which is difficult to achieve without speaker labels. To address this issue, we introduce a novel framework by incorporating clean and augmented segments into the contrastive training pipeline. The clean segments are repurposed to pair with noisy segments to form additional positive and negative pairs. Moreover, the contrastive loss is weighted to increase the difference between the clean and augmented embeddings of different speakers. Experimental results on Voxceleb1 suggest that the proposed framework can achieve a remarkable 19% improvement over the conventional methods, and it surpasses many existing state-of-the-art techniques.

摘要
“对话自我超vised学习（CSL）在 speaker verification（SV）中已引起了越来越多的关注，因为它可以利用无标签数据。在 raw waveform 上进行数据增强，如添加噪音或投射，对 SV 的表现产生了关键作用。然而，数据增强需要仔细调整，以确保保留Speaker-specific信息，这是 без speaker 标签很难实现。为解决这个问题，我们提出了一种新的框架，通过将清晰和增强段 integrate 到对比训练管道中。清晰段被重新用于与噪音段成对，以形成额外的正方向和负方向对。此外，对比损失被权重，以增加不同Speaker的增强 embeddings 之间的差异。实验结果表明，我们的方法可以在 Voxceleb1 上实现了非常出色的 19% 提高，并超越了许多现有的state-of-the-art技术。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

2023-09-08

cs.CV

cs.CV - 2023-09-08

Open and reusable deep learning for pathology with WSInfer and QuPath

paper_url: http://arxiv.org/abs/2309.04631
repo_url: None
paper_authors: Jakub R. Kaczmarzyk, Alan O’Callaghan, Fiona Inglis, Tahsin Kurc, Rajarsi Gupta, Erich Bremer, Peter Bankhead, Joel H. Saltz
for: 这份论文的目的是提高肿瘤学中深度学习模型的应用，并使其更加流畅和可存取。
methods: 本研究使用了一个新的开源软件生态系统，名为WSInfer，以便更加方便地将深度学习模型应用到肿瘤影像中。WSInfer包括三个主要元素：1）一个Python套件和命令行工具，可以快速地将裁剪式深度学习应用到整个肿瘤影像中; 2）一个QuPath扩展，提供了一个易用且交互式的软件引擎，并3）一个模型zoo，可以让肿瘤学模型和metadata在标准化的形式下进行分享。
results: 本研究的结果显示，WSInfer可以让肿瘤学家和研究人员更加方便地存取和应用深度学习模型，并且不需要程式码经验。WSInfer的源代码被hosts在GitHub上，并且有相关的文档在https://wsinfer.readthedocs.io 。

Abstract
The field of digital pathology has seen a proliferation of deep learning models in recent years. Despite substantial progress, it remains rare for other researchers and pathologists to be able to access models published in the literature and apply them to their own images. This is due to difficulties in both sharing and running models. To address these concerns, we introduce WSInfer: a new, open-source software ecosystem designed to make deep learning for pathology more streamlined and accessible. WSInfer comprises three main elements: 1) a Python package and command line tool to efficiently apply patch-based deep learning inference to whole slide images; 2) a QuPath extension that provides an alternative inference engine through user-friendly and interactive software, and 3) a model zoo, which enables pathology models and metadata to be easily shared in a standardized form. Together, these contributions aim to encourage wider reuse, exploration, and interrogation of deep learning models for research purposes, by putting them into the hands of pathologists and eliminating a need for coding experience when accessed through QuPath. The WSInfer source code is hosted on GitHub and documentation is available at https://wsinfer.readthedocs.io.

摘要
随着数字 PATHOLOGY 领域的发展，深度学习模型在过去几年内得到了广泛应用。尽管已经取得了显著进步，但是对于其他研究人员和病理学家来说，访问已经发表的模型并将其应用到自己的图像仍然是非常困难的。这是由于分享和运行模型的困难所致。为解决这些问题，我们介绍了 WSInfer：一个新的开源软件生态系统，旨在使得 PATHOLOGY 中的深度学习更加流畅和可访问。WSInfer 包括三个主要元素：1. Python 包和命令行工具，用于高效地应用 patch-based 深度学习推理到整个扫描图像上。2. QuPath 扩展，提供了一个用户友好的和交互式的推理引擎，并且可以让病理学家通过 QuPath 访问和运行深度学习模型，不需要编程经验。3. 模型 zoo，可以方便地将 PATHOLOGY 模型和元数据分享在标准化的形式下。综上所述，WSInfer 的贡献是希望通过将深度学习模型带到病理学家手上，并且不需要编程经验，以便更多的研究人员和病理学家可以轻松地 reuse、探索和调查 PATHOLOGY 中的深度学习模型，以便更好地推进 PATHOLOGY 领域的研究。WSInfer 的源代码位于 GitHub 上，文档可以在中找到。

Style Generation: Image Synthesis based on Coarsely Matched Texts

paper_url: http://arxiv.org/abs/2309.04608
repo_url: None
paper_authors: Mengyao Cui, Zhe Zhu, Shao-Ping Lu, Yulu Yang
for: 文章主要目的是提出一种基于文本指导的图像风格生成方法，以便在具有粗糙匹配的文本指导下进行图像生成和修饰。
methods: 本文提出了一种基于文本指导的图像风格生成方法，包括两个阶段：第一阶段使用句子特征来生成图像的整体风格，第二阶段使用多模态风格合成模块来细化生成的风格。
results: 经过广泛的实验和简洁分析，本文提出的方法能够有效地生成基于文本指导的图像风格，并且可以应用于多个实际场景，如文本-图像对齐和故事视觉化等。

Abstract
Previous text-to-image synthesis algorithms typically use explicit textual instructions to generate/manipulate images accurately, but they have difficulty adapting to guidance in the form of coarsely matched texts. In this work, we attempt to stylize an input image using such coarsely matched text as guidance. To tackle this new problem, we introduce a novel task called text-based style generation and propose a two-stage generative adversarial network: the first stage generates the overall image style with a sentence feature, and the second stage refines the generated style with a synthetic feature, which is produced by a multi-modality style synthesis module. We re-filter one existing dataset and collect a new dataset for the task. Extensive experiments and ablation studies are conducted to validate our framework. The practical potential of our work is demonstrated by various applications such as text-image alignment and story visualization. Our datasets are published at https://www.kaggle.com/datasets/mengyaocui/style-generation.

摘要
The first stage of our GAN uses a sentence feature to generate the overall image style, while the second stage refines the generated style with a synthetic feature produced by a multi-modality style synthesis module. We collect a new dataset and re-filter an existing dataset to support our framework.To validate our approach, we conduct extensive experiments and ablation studies. Our work has practical potential, as demonstrated by applications such as text-image alignment and story visualization. Our datasets are available at .

Dynamic Mesh-Aware Radiance Fields

paper_url: http://arxiv.org/abs/2309.04581
repo_url: https://github.com/YilingQiao/DMRF
paper_authors: Yi-Ling Qiao, Alexander Gao, Yiran Xu, Yue Feng, Jia-Bin Huang, Ming C. Lin
for: 这篇论文的目的是探讨在把几何体资产逻辑地嵌入到 fotorealistic Neural Radience Fields（NeRF）中，以便在物理上一致的方式渲染和模拟它们，从系统角度来看是一个未经探讨的领域。
methods: 这篇论文提出了一种两向相互关联的方法，即在渲染和模拟过程中，将几何体和NeRF之间进行相互 Coupling。首先，我们审视了几何体和NeRF之间的光传输方程，然后将它们转化为一种高效的算法，用于更新各个碰撞点的辐射和通过put。为了解决NeRF使用的标准颜色空间和几何体之间的差异，我们在NeRF中训练了高动态范围（HDR）图像。此外，我们还提出了一种策略来估算NeRF中的光源和投射阴影。
results: 我们的实验结果表明，在渲染和模拟过程中，将几何体和NeRF之间进行相互 Coupling，可以提高视觉真实性。这是因为它允许真实的光传输从NeRF媒体onto几何体，对折射/填充表面和 diffuse surface informed by dynamic scene产生影响。

Abstract
Embedding polygonal mesh assets within photorealistic Neural Radience Fields (NeRF) volumes, such that they can be rendered and their dynamics simulated in a physically consistent manner with the NeRF, is under-explored from the system perspective of integrating NeRF into the traditional graphics pipeline. This paper designs a two-way coupling between mesh and NeRF during rendering and simulation. We first review the light transport equations for both mesh and NeRF, then distill them into an efficient algorithm for updating radiance and throughput along a cast ray with an arbitrary number of bounces. To resolve the discrepancy between the linear color space that the path tracer assumes and the sRGB color space that standard NeRF uses, we train NeRF with High Dynamic Range (HDR) images. We also present a strategy to estimate light sources and cast shadows on the NeRF. Finally, we consider how the hybrid surface-volumetric formulation can be efficiently integrated with a high-performance physics simulator that supports cloth, rigid and soft bodies. The full rendering and simulation system can be run on a GPU at interactive rates. We show that a hybrid system approach outperforms alternatives in visual realism for mesh insertion, because it allows realistic light transport from volumetric NeRF media onto surfaces, which affects the appearance of reflective/refractive surfaces and illumination of diffuse surfaces informed by the dynamic scene.

摘要
<> transtable mesh assets within photorealistic Neural Radience Fields（NeRF）volumes, such that they can be rendered and their dynamics simulated in a physically consistent manner with the NeRF, is under-explored from the system perspective of integrating NeRF into the traditional graphics pipeline. This paper designs a two-way coupling between mesh and NeRF during rendering and simulation. We first review the light transport equations for both mesh and NeRF, then distill them into an efficient algorithm for updating radiance and throughput along a cast ray with an arbitrary number of bounces. To resolve the discrepancy between the linear color space that the path tracer assumes and the sRGB color space that standard NeRF uses, we train NeRF with High Dynamic Range（HDR）images. We also present a strategy to estimate light sources and cast shadows on the NeRF. Finally, we consider how the hybrid surface-volumetric formulation can be efficiently integrated with a high-performance physics simulator that supports cloth, rigid and soft bodies. The full rendering and simulation system can be run on a GPU at interactive rates. We show that a hybrid system approach outperforms alternatives in visual realism for mesh insertion, because it allows realistic light transport from volumetric NeRF media onto surfaces, which affects the appearance of reflective/refractive surfaces and illumination of diffuse surfaces informed by the dynamic scene.Note that Simplified Chinese is a more casual and informal version of Chinese, and the word order and grammar may be different from Traditional Chinese.

Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation

paper_url: http://arxiv.org/abs/2309.04573
repo_url: None
paper_authors: Shyam Nandan Rai, Fabio Cermelli, Barbara Caputo, Carlo Masone
for: 本文旨在提出一种基于面Mask的异常检测方法，以解决自动驾驶应用中异常对象实例分割的问题。
methods: 本文提出了一种新的面Mask分类架构，包括全球面Mask注意模块、面对比学习、面修正解决方案和面架构特性采集方法等技术创新，以提高异常检测的精度。
results: 经过全面的质量评估，本文的Mask2异常方法在异常分割、开放集Semantic分割和开放集精度分割三个任务上达到了新的国际纪录。

Abstract
Segmenting unknown or anomalous object instances is a critical task in autonomous driving applications, and it is approached traditionally as a per-pixel classification problem. However, reasoning individually about each pixel without considering their contextual semantics results in high uncertainty around the objects' boundaries and numerous false positives. We propose a paradigm change by shifting from a per-pixel classification to a mask classification. Our mask-based method, Mask2Anomaly, demonstrates the feasibility of integrating a mask-classification architecture to jointly address anomaly segmentation, open-set semantic segmentation, and open-set panoptic segmentation. Mask2Anomaly includes several technical novelties that are designed to improve the detection of anomalies/unknown objects: i) a global masked attention module to focus individually on the foreground and background regions; ii) a mask contrastive learning that maximizes the margin between an anomaly and known classes; iii) a mask refinement solution to reduce false positives; and iv) a novel approach to mine unknown instances based on the mask-architecture properties. By comprehensive qualitative and qualitative evaluation, we show Mask2Anomaly achieves new state-of-the-art results across the benchmarks of anomaly segmentation, open-set semantic segmentation, and open-set panoptic segmentation.

摘要
segmenting unknown or anomalous object instances是自动驾驶应用中的一个关键任务，它通常是以每个像素为单位进行分类的传统方法。然而，不考虑每个像素的语义上下文会导致对象边界的高度不确定性和多个假阳性。我们提议一种思路变革，即从每个像素分类转换到Mask分类。我们的Mask2Anomaly方法表明了将Mask分类建立在 JOINT 中的可能性，以同时解决异常分 segmentation、开放集Semantic segmentation和开放集Panoptic segmentation的问题。Mask2Anomaly方法包括多个技术创新，用于提高异常检测：一、全局遮盲注意力模块，以各自针对前景和背景区域进行遮盲注意力;二、遮盲对比学习，以最大化异常与已知类之间的距离;三、遮盲修正解决方案，以降低假阳性;四、基于Mask-architecture属性挖掘未知实例的新方法。通过全面的Qualitative和Quantitative评估，我们证明Mask2Anomaly方法在 benchmark 上实现了新的状态可识别结果，包括异常分 segmentation、开放集Semantic segmentation和开放集Panoptic segmentation。

Poster: Making Edge-assisted LiDAR Perceptions Robust to Lossy Point Cloud Compression

paper_url: http://arxiv.org/abs/2309.04549
repo_url: None
paper_authors: Jin Heo, Gregorie Phillips, Per-Erik Brodin, Ada Gavrilovska
for: 提高LiDAR点云的质量，以减少因压缩而导致的感知性能下降。
methods: 使用基于深度梯度的插值算法来提高LiDAR点云的质量。
results: 与现有的图像插值算法相比，该算法可以提供更好的质量结果，当点云从插值后重建时。

Abstract
Real-time light detection and ranging (LiDAR) perceptions, e.g., 3D object detection and simultaneous localization and mapping are computationally intensive to mobile devices of limited resources and often offloaded on the edge. Offloading LiDAR perceptions requires compressing the raw sensor data, and lossy compression is used for efficiently reducing the data volume. Lossy compression degrades the quality of LiDAR point clouds, and the perception performance is decreased consequently. In this work, we present an interpolation algorithm improving the quality of a LiDAR point cloud to mitigate the perception performance loss due to lossy compression. The algorithm targets the range image (RI) representation of a point cloud and interpolates points at the RI based on depth gradients. Compared to existing image interpolation algorithms, our algorithm shows a better qualitative result when the point cloud is reconstructed from the interpolated RI. With the preliminary results, we also describe the next steps of the current work.

摘要
现实时光 detection和跟踪（LiDAR）感知需要大量计算能力，例如3D对象检测和同时地图定位。由于移动设备的限制资源，LiDAR感知通常会在边缘上下载。压缩 Raw sensor data 需要lossy compression，这会降低LiDAR点云的质量。在这种情况下，我们提出了一种 interpolating algorithm，用于改善LiDAR点云的质量，以避免因压缩而导致的感知性能下降。我们的算法targets the range image（RI）表示法，并在RI基于深度梯度进行点云点的 interpolating。与现有的图像 interpolating algorithm相比，我们的算法在重建点云时表现更好。在下一步工作中，我们还将描述我们的current work的进展。

Examining Autoexposure for Challenging Scenes

paper_url: http://arxiv.org/abs/2309.04542
repo_url: None
paper_authors: SaiKiran Tedla, Beixuan Yang, Michael S. Brown
for: 提供一个大量的曝光数据集，以便发展适用于具有变化照明的环境中的曝光算法。
methods: 使用一个软件平台，让不同的曝光算法可以在一个插件的方式下使用数据集进行重复评估。
results: 透过评估一些现有的曝光策略，发现大多数使用者偏好使用简单的焦点方法来应对具有变化照明的情况。

Abstract
Autoexposure (AE) is a critical step applied by camera systems to ensure properly exposed images. While current AE algorithms are effective in well-lit environments with constant illumination, these algorithms still struggle in environments with bright light sources or scenes with abrupt changes in lighting. A significant hurdle in developing new AE algorithms for challenging environments, especially those with time-varying lighting, is the lack of suitable image datasets. To address this issue, we have captured a new 4D exposure dataset that provides a large solution space (i.e., shutter speed range from (1/500 to 15 seconds) over a temporal sequence with moving objects, bright lights, and varying lighting. In addition, we have designed a software platform to allow AE algorithms to be used in a plug-and-play manner with the dataset. Our dataset and associate platform enable repeatable evaluation of different AE algorithms and provide a much-needed starting point to develop better AE methods. We examine several existing AE strategies using our dataset and show that most users prefer a simple saliency method for challenging lighting conditions.

摘要
自动曝光（AE）是摄像系统中一个关键的步骤，以确保得到正确曝光的图像。目前的AE算法在充足照明环境下效果良好，但是这些算法在灯光强度变化或场景中有突然变化的照明情况下仍然受到挑战。开发新的AE算法需要一个适当的图像数据集，但现有的问题在开发新算法方面带来了很大的障碍。为解决这个问题，我们已经捕捉了一个新的4D曝光数据集，该数据集提供了广泛的解决空间（即闭合速度范围从(1/500到15秒)），并且包含了在时间序列中移动的 объек、灯光和不同的照明情况。此外，我们还设计了一个软件平台，以便AE算法可以在插件化的方式使用该数据集。我们的数据集和相关平台为不同的AE算法提供了重复可评估的开始点，并且我们通过使用我们的数据集对一些现有的AE策略进行了评估，发现大多数用户在困难的照明条件下偏好使用简单的注意力方法。

Generalized Cross-domain Multi-label Few-shot Learning for Chest X-rays

paper_url: http://arxiv.org/abs/2309.04462
repo_url: None
paper_authors: Aroof Aimen, Arsh Verma, Makarand Tapaswi, Narayanan C. Krishnan
for: 这篇论文是用于测试胸部X射镜像的异常性分类方法。
methods: 这篇论文使用了一个称为Generalized Cross-Domain Multi-Label Few-Shot Learning（GenCDML-FSL）的整合框架，这个框架可以处理多个挑战，包括训练和评估集来自不同Domain的资料，以及训练和评估过程中的类别 overlap。
results: 比较了以上Method与已知方法，如trasnfer learning、Hybrid transfer learning和Multi-label meta-learning，在多个数据集上的比较结果显示了我们的方法的超越性。

Abstract
Real-world application of chest X-ray abnormality classification requires dealing with several challenges: (i) limited training data; (ii) training and evaluation sets that are derived from different domains; and (iii) classes that appear during training may have partial overlap with classes of interest during evaluation. To address these challenges, we present an integrated framework called Generalized Cross-Domain Multi-Label Few-Shot Learning (GenCDML-FSL). The framework supports overlap in classes during training and evaluation, cross-domain transfer, adopts meta-learning to learn using few training samples, and assumes each chest X-ray image is either normal or associated with one or more abnormalities. Furthermore, we propose Generalized Episodic Training (GenET), a training strategy that equips models to operate with multiple challenges observed in the GenCDML-FSL scenario. Comparisons with well-established methods such as transfer learning, hybrid transfer learning, and multi-label meta-learning on multiple datasets show the superiority of our approach.

摘要
实际应用中的胸部X射线异常分类问题需要面临多个挑战：（i）有限的训练数据；（ii）训练和评估集来自不同领域；（iii）训练中的类可能与评估中的类有部分重叠。为解决这些挑战，我们提出了一个总结性的框架，即通用跨领域多标签少量学习（GenCDML-FSL）框架。该框架支持训练和评估阶段类的重叠，进行跨领域传输，采用元学习来学习使用少量训练样本，并假设每个胸部X射线图像是正常的或与一个或多个异常相关。此外，我们提出了一种通用 episodic 训练策略（GenET），该策略可以让模型在 GenCDML-FSL 场景中处理多个挑战。与已有方法 such as 传输学习、混合传输学习和多标签元学习在多个数据集上进行比较，我们的方法表现出了superiority。

WiSARD: A Labeled Visual and Thermal Image Dataset for Wilderness Search and Rescue

paper_url: http://arxiv.org/abs/2309.04453
repo_url: None
paper_authors: Daniel Broyles, Christopher R. Hayner, Karen Leung
for: 这些研究是为了帮助reducesearch times和alleviate safety risks for first responders carrying out Wilderness Search and Rescue (WiSAR) operations.
methods: 这些研究使用了多模式感知器，specifically visual-thermal cameras，以使wiSAR UAVs可以在多种操作条件下工作。
results: 这些研究提供了roughly 56,000 labeled visual and thermal images，用于开发vision-based algorithms for autonomous WiSAR UAVs。这些图像来自UAV飞行中的多种地形、季节、天气和照明条件。

Abstract
Sensor-equipped unoccupied aerial vehicles (UAVs) have the potential to help reduce search times and alleviate safety risks for first responders carrying out Wilderness Search and Rescue (WiSAR) operations, the process of finding and rescuing person(s) lost in wilderness areas. Unfortunately, visual sensors alone do not address the need for robustness across all the possible terrains, weather, and lighting conditions that WiSAR operations can be conducted in. The use of multi-modal sensors, specifically visual-thermal cameras, is critical in enabling WiSAR UAVs to perform in diverse operating conditions. However, due to the unique challenges posed by the wilderness context, existing dataset benchmarks are inadequate for developing vision-based algorithms for autonomous WiSAR UAVs. To this end, we present WiSARD, a dataset with roughly 56,000 labeled visual and thermal images collected from UAV flights in various terrains, seasons, weather, and lighting conditions. To the best of our knowledge, WiSARD is the first large-scale dataset collected with multi-modal sensors for autonomous WiSAR operations. We envision that our dataset will provide researchers with a diverse and challenging benchmark that can test the robustness of their algorithms when applied to real-world (life-saving) applications.

摘要
游戏式无人航空车(UAV)可以帮助紧急救援人员在遥远地区进行搜索和拯救操作，即当人失踪在郊状地区时。然而，视觉感应器alone无法涵盖所有可能的地形、天气和照明情况，因此需要使用多模式感应器，尤其是视觉热成像摄像头，以实现 WiSAR UAVs 在多元运行环境中的运作。然而，由于郊状地区的特殊挑战，现有的数据集标准是不充分的 для开发基于视觉的数据分析算法。为此，我们提出了 WiSARD 数据集，收集了来自 UAV 飞行的约 56,000 个视觉和热成像摄像头标签图像，包括不同的地形、季节、天气和照明情况。我们知道 WiSARD 是首个基于多模式感应器的大规模数据集，我们预期这个数据集将提供研究人员一个多样化和挑战性的 benchmarck，以测试对真实应用中的数据分析算法的Robustness。

Demographic Disparities in 1-to-Many Facial Identification

paper_url: http://arxiv.org/abs/2309.04447
repo_url: None
paper_authors: Aman Bhatta, Gabriella Pangelinan, Micheal C. King, Kevin W. Bowyer
for: 这个研究旨在检验不同民族和性别对多个人识别率的影响，以及低分辨率和噪音影响识别率的变化。
methods: 这个研究使用了一个新的评价指标，以检验多个人识别率的差异。这些指标包括d’指标、相对分数差和多个人识别分数的分布。
results: 研究发现，不同民族和性别对多个人识别率的影响不同，而且在低分辨率和噪音情况下，男女之间的差异更大。此外，研究还发现，使用”surveillance camera quality”图像库对”government ID quality”图像库进行比较可能会导致识别率下降。

Abstract
Most studies to date that have examined demographic variations in face recognition accuracy have analyzed 1-to-1 matching accuracy, using images that could be described as "government ID quality". This paper analyzes the accuracy of 1-to-many facial identification across demographic groups, and in the presence of blur and reduced resolution in the probe image as might occur in "surveillance camera quality" images. Cumulative match characteristic curves(CMC) are not appropriate for comparing propensity for rank-one recognition errors across demographics, and so we introduce three metrics for this: (1) d' metric between mated and non-mated score distributions, (2) absolute score difference between thresholds in the high-similarity tail of the non-mated and the low-similarity tail of the mated distribution, and (3) distribution of (mated - non-mated rank one scores) across the set of probe images. We find that demographic variation in 1-to-many accuracy does not entirely follow what has been observed in 1-to-1 matching accuracy. Also, different from 1-to-1 accuracy, demographic comparison of 1-to-many accuracy can be affected by different numbers of identities and images across demographics. Finally, we show that increased blur in the probe image, or reduced resolution of the face in the probe image, can significantly increase the false positive identification rate. And we show that the demographic variation in these high blur or low resolution conditions is much larger for male/ female than for African-American / Caucasian. The point that 1-to-many accuracy can potentially collapse in the context of processing "surveillance camera quality" probe images against a "government ID quality" gallery is an important one.

摘要
大多数研究到目前为止对人群差异对面部识别精度进行了分析，使用“政府身份证图像”的样本。这篇论文研究了面部识别的1-to-多匹配精度，以及在不同人群中的差异。我们还引入了三个指标来比较不同人群的潜在一级识别错误风险：1. between mated and non-mated score distributions的d'指标;2. 非硬件tail的非硬件分布下的硬件分布附近的硬件分布差异;3. 探索图像集中的(硬件-非硬件一级识别分布)的分布。我们发现，在1-to-多匹配精度方面，人群差异并不完全与1-to-1匹配精度相同。此外，对于不同人群来说，1-to-多匹配精度的比较可能受到不同人群中的人数和图像数的影响。最后，我们发现，在低锐化或低分辨率情况下，增加了挤压效应可以导致False Positive Identification率的增加。此外，对于男女和非裔美国人来说，在高锐化或低分辨率情况下的人群差异较大。这一点显示，在处理“surveillance camera quality”的探索图像时，1-to-多匹配精度可能会受到“government ID quality”画库的影响。

Comparative Study of Visual SLAM-Based Mobile Robot Localization Using Fiducial Markers

paper_url: http://arxiv.org/abs/2309.04441
repo_url: None
paper_authors: Jongwon Lee, Su Yeon Choi, David Hanley, Timothy Bretl
for: 本研究比较了基于视觉SLAM的移动机器人地理位置的三种方法，包括SLAM、SLAM与先前地图和地理位置与先前地图。这些方法都使用了 fiducial marker（即正方形的人工标记，具有黑白棕点纹），以提高地理位置准确性和计算效率。
methods: 本研究使用了视觉SLAM技术，并且在 fiducial marker 的支持下进行了地理位置估算。在这些方法中，SLAM 方法使用了所有可用的特征和标记来估算地理位置，而 SLAM 与先前地图方法则使用了先前知道的地图来帮助估算地理位置。
results: 实验结果表明，三种方法具有相似的绝对轨迹错误水平，但是地理位置估算过程中的运行时间中最短。在地图噪音的影响下，SLAM 与先前地图方法能够维持性能，而地理位置方法却在两个方面下降。

Abstract
This paper presents a comparative study of three modes for mobile robot localization based on visual SLAM using fiducial markers (i.e., square-shaped artificial landmarks with a black-and-white grid pattern): SLAM, SLAM with a prior map, and localization with a prior map. The reason for comparing the SLAM-based approaches leveraging fiducial markers is because previous work has shown their superior performance over feature-only methods, with less computational burden compared to methods that use both feature and marker detection without compromising the localization performance. The evaluation is conducted using indoor image sequences captured with a hand-held camera containing multiple fiducial markers in the environment. The performance metrics include absolute trajectory error and runtime for the optimization process per frame. In particular, for the last two modes (SLAM and localization with a prior map), we evaluate their performances by perturbing the quality of prior map to study the extent to which each mode is tolerant to such perturbations. Hardware experiments show consistent trajectory error levels across the three modes, with the localization mode exhibiting the shortest runtime among them. Yet, with map perturbations, SLAM with a prior map maintains performance, while localization mode degrades in both aspects.

摘要
Here is the text in Simplified Chinese:这篇论文比较了三种移动机器人本地化方法，基于视觉SLAM和 fiducial marker（即方正方形人工标记，黑白扫描纹理）：SLAM、SLAM WITH prior map 和本地化 WITH prior map。这种比较是因为之前的研究表明，使用 fiducial marker 的方法在功能特征和计算成本方面都有着优势，而不需要同时检测特征和标记。这些方法的评估是通过使用indoor镜头拍摄的图像序列来进行，这些序列包含多个 fiducial marker。评估 metric 包括每帧的绝对轨迹错误和优化过程的运行时间。结果表明，三种方法在绝对轨迹错误方面具有相同的水平，但本地化模式具有最短的运行时间。然而，当 prior map 的质量受到扰动时，SLAM WITH prior map 能够维持性能，而本地化模式在两个方面都会下降。

Single View Refractive Index Tomography with Neural Fields

paper_url: http://arxiv.org/abs/2309.04437
repo_url: None
paper_authors: Brandon Zhao, Aviad Levis, Liam Connor, Pratul P. Srinivasan, Katherine L. Bouman
for: 这篇论文的目的是重建场景中的3D干涉场，从2D投射图像测量得到。
methods: 这篇论文使用一种坐标基于的神经网络来模型场景中的连续干涉场，并使用光束的3D空间弯曲来优化网络参数，从而重建干涉场。
results: 在模拟中，这种方法可以成功地重建干涉场，并分析了不同光源分布对重建的影响。在一个模拟的黑洞映射问题中，还成功地重建了真实的模拟黑洞分布。

Abstract
Refractive Index Tomography is an inverse problem in which we seek to reconstruct a scene's 3D refractive field from 2D projected image measurements. The refractive field is not visible itself, but instead affects how the path of a light ray is continuously curved as it travels through space. Refractive fields appear across a wide variety of scientific applications, from translucent cell samples in microscopy to fields of dark matter bending light from faraway galaxies. This problem poses a unique challenge because the refractive field directly affects the path that light takes, making its recovery a non-linear problem. In addition, in contrast with traditional tomography, we seek to recover the refractive field using a projected image from only a single viewpoint by leveraging knowledge of light sources scattered throughout the medium. In this work, we introduce a method that uses a coordinate-based neural network to model the underlying continuous refractive field in a scene. We then use explicit modeling of rays' 3D spatial curvature to optimize the parameters of this network, reconstructing refractive fields with an analysis-by-synthesis approach. The efficacy of our approach is demonstrated by recovering refractive fields in simulation, and analyzing how recovery is affected by the light source distribution. We then test our method on a simulated dark matter mapping problem, where we recover the refractive field underlying a realistic simulated dark matter distribution.

摘要
《干涉度图像》是一种逆 проблеme 在干涉度图像中，我们希望从2D投影图像的测量中重construct 场景中的3D干涉场。干涉场不可见自身，但它会影响光束在空间中的曲线运动。干涉场在多种科学应用中出现，从微scopic 的透明细胞样本到远方 галакси的场景中的暗物质弯光。这个问题 pose 一种独特挑战，因为干涉场直接影响光束的路径，使其回归变为非线性问题。此外，在传统tomography 中，我们通过多个视点测量来重construct 干涉场，而我们在这里是通过单个视点测量来实现。在这个工作中，我们提出了一种基于坐标的神经网络来模型场景中的连续干涉场。然后，我们通过明确的3D空间曲线的计算来优化神经网络的参数，通过分析synthesis 的方法来重construct 干涉场。我们的方法的效果在仿真中进行了测试，并分析了灯源分布对回归的影响。最后，我们在一个模拟的黑 matter 映射问题中测试了我们的方法，并成功地重construct 黑 matter 的干涉场。

Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving

paper_url: http://arxiv.org/abs/2309.04422
repo_url: None
paper_authors: Thomas E. Huang, Yifan Liu, Luc Van Gool, Fisher Yu
for: 本研究旨在探讨自动驾驶场景中多个多样化视觉任务的整合。
methods: 该研究使用了一种单一结构和单一参数的网络（VTDNet），通过任务间交互阶段来交换信息，实现多个任务的同时解决。
results: 与单任务网络相比，VTDNet在大多数任务上表现出色，仅使用20%的计算资源。

Abstract
Performing multiple heterogeneous visual tasks in dynamic scenes is a hallmark of human perception capability. Despite remarkable progress in image and video recognition via representation learning, current research still focuses on designing specialized networks for singular, homogeneous, or simple combination of tasks. We instead explore the construction of a unified model for major image and video recognition tasks in autonomous driving with diverse input and output structures. To enable such an investigation, we design a new challenge, Video Task Decathlon (VTD), which includes ten representative image and video tasks spanning classification, segmentation, localization, and association of objects and pixels. On VTD, we develop our unified network, VTDNet, that uses a single structure and a single set of weights for all ten tasks. VTDNet groups similar tasks and employs task interaction stages to exchange information within and between task groups. Given the impracticality of labeling all tasks on all frames, and the performance degradation associated with joint training of many tasks, we design a Curriculum training, Pseudo-labeling, and Fine-tuning (CPF) scheme to successfully train VTDNet on all tasks and mitigate performance loss. Armed with CPF, VTDNet significantly outperforms its single-task counterparts on most tasks with only 20% overall computations. VTD is a promising new direction for exploring the unification of perception tasks in autonomous driving.

摘要
人类视觉能力的一个特征是同时完成多种不同类型的视觉任务在动态场景中。虽然图像和视频认知技术已经做出了很大的进步，但现在的研究仍然强调设计专门的网络来解决单一或同类型的任务。我们则是探索构建一个统一的模型来涵盖主要的图像和视频认知任务在自动驾驶中，并且输入和输出结构多样化。为了实现这种研究，我们设计了一个新的挑战——视频任务十项赛（VTD），这个挑战包括十种代表性的图像和视频任务，涵盖分类、 segmentation、 localization 和对象和像素的关系。在 VTD 中，我们开发了一个统一的网络——VTDNet，它使用单一的结构和单一的参数来实现所有十个任务。VTDNet 将相似任务分组，并在任务组之间进行交互来交换信息。由于实际上标注所有任务的所有帧是不实际的，以及多任务合并训练会导致性能下降，我们设计了一种学习环境、 Pseudo-labeling 和精度调整（CPF）的办法，以成功训练 VTDNet 在所有任务上，并将性能下降降到最低。与单任务网络相比，VTDNet 在大多数任务上表现出色，只需要20%的总计算资源。VTD 是自动驾驶视觉任务统一探索的一个有前途的新方向。

DeformToon3D: Deformable 3D Toonification from Neural Radiance Fields

paper_url: http://arxiv.org/abs/2309.04410
repo_url: https://github.com/junzhezhang/DeformToon3D
paper_authors: Junzhe Zhang, Yushi Lan, Shuai Yang, Fangzhou Hong, Quan Wang, Chai Kiat Yeo, Ziwei Liu, Chen Change Loy
for: 本研究旨在解决3D漫画化问题，即将艺术领域的样式应用到目标3D面部上，并保持原始GAN幂等空间的良好性。
methods: 我们提出了DeformToon3D方法，它是针对堆叠3D GAN的有效漫画化框架。我们将3D漫画化分解为geometry和texture材质化的子问题，以更好地保持原始GAN幂等空间。我们还提出了一种新的StyleField，它预测 conditional 3D变形以将真实空间NeRF调整到样式空间中。
results: 我们的方法可以实现高质量的3D漫画化，并且支持灵活的样式度控制和形状-文本ure-特有的样式交换。此外，我们可以高效地训练我们的模型，不需要任何实际的2D-3D训练对。

Abstract
In this paper, we address the challenging problem of 3D toonification, which involves transferring the style of an artistic domain onto a target 3D face with stylized geometry and texture. Although fine-tuning a pre-trained 3D GAN on the artistic domain can produce reasonable performance, this strategy has limitations in the 3D domain. In particular, fine-tuning can deteriorate the original GAN latent space, which affects subsequent semantic editing, and requires independent optimization and storage for each new style, limiting flexibility and efficient deployment. To overcome these challenges, we propose DeformToon3D, an effective toonification framework tailored for hierarchical 3D GAN. Our approach decomposes 3D toonification into subproblems of geometry and texture stylization to better preserve the original latent space. Specifically, we devise a novel StyleField that predicts conditional 3D deformation to align a real-space NeRF to the style space for geometry stylization. Thanks to the StyleField formulation, which already handles geometry stylization well, texture stylization can be achieved conveniently via adaptive style mixing that injects information of the artistic domain into the decoder of the pre-trained 3D GAN. Due to the unique design, our method enables flexible style degree control and shape-texture-specific style swap. Furthermore, we achieve efficient training without any real-world 2D-3D training pairs but proxy samples synthesized from off-the-shelf 2D toonification models.

摘要
在这篇论文中，我们讨论了三维渐化（3D toonification）问题，即将艺术领域的风格应用到目标三维face上，并保持具有渐化的geometry和Texture。虽然可以通过练化预训练的3D GAN来实现可理解的性能，但这种策略有一些限制。具体来说，练化可能会损害原始GAN latent space，影响后续的semantic editing，并需要独立的优化和存储每个新风格，限制了灵活性和高效的部署。为了解决这些挑战，我们提出了DeformToon3D，一种适合层次3D GAN的有效渐化框架。我们的方法将三维渐化分解为geometry和Texture渐化的子问题，以更好地保持原始latent space。具体来说，我们开发了一种名为StyleField的新型预测器，可以在Real-Space NeRF上预测conditional 3D deformation，以使geometry渐化适应风格空间。由于StyleField的形式，Texture渐化可以通过适应风格混合来实现，injects风格空间信息到预训练的3D GAN decoder中。由于独特的设计，我们的方法可以实现自适应风格度控制和形状特征特定的风格交换。此外，我们可以不使用任何真实世界2D-3D训练对，而是使用市面上的2D渐化模型生成的代理样本进行训练。

MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask

paper_url: http://arxiv.org/abs/2309.04399
repo_url: None
paper_authors: Yupeng Zhou, Daquan Zhou, Zuo-Liang Zhu, Yaxing Wang, Qibin Hou, Jiashi Feng
for: 提高 diffusion 模型中文案与图像的匹配率
methods: 使用 adaptive mask 来改进 cross-modality 关系学习，从而更好地匹配文本 embedding 和图像特征
results: 与原始 diffusion 模型相比，MaskDiffusion 可以大幅提高文本-图像匹配率，而且计算负担几乎不变。

Abstract
Recent advancements in diffusion models have showcased their impressive capacity to generate visually striking images. Nevertheless, ensuring a close match between the generated image and the given prompt remains a persistent challenge. In this work, we identify that a crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning between the prompt and the output image. To better align the prompt and image content, we advance the cross-attention with an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features. This mechanism explicitly diminishes the ambiguity in semantic information embedding from the text encoder, leading to a boost of text-to-image consistency in the synthesized images. Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models. When applied to the latent diffusion models, our MaskDiffusion can significantly improve the text-to-image consistency with negligible computation overhead compared to the original diffusion models.

摘要
最近的扩散模型进步有力地生成了视觉吸引人的图像。然而，确保生成图像与给定的提示保持close match仍然是一项棘手的挑战。在这项工作中，我们发现了一个关键因素导致文本-图像匹配问题的原因：在提取图像特征时，扩散模型缺乏文本和图像之间的跨Modal关系学习。为了更好地对准提示和图像内容，我们提出了一种基于适应面罩的跨注意力机制，该机制通过根据注意力地图和提示嵌入来动态调整每个文本符号对图像特征的贡献。这种机制明确地减少了文本编码器中嵌入的Semantic信息抖动，从而导致了文本-图像一致性的明显提高。我们称之为MaskDiffusion，它是训练 свобо和热插的，可以应用于流行的预训练扩散模型。当应用于凉 diffusion模型时，我们的MaskDiffusion可以显著提高文本-图像一致性，而且与原始扩散模型的计算负担几乎不变。

Language Prompt for Autonomous Driving

paper_url: http://arxiv.org/abs/2309.04379
repo_url: https://github.com/wudongming97/prompt4driving
paper_authors: Dongming Wu, Wencheng Han, Tiancai Wang, Yingfei Liu, Xiangyu Zhang, Jianbing Shen
for: 这篇论文是为了解决自动驾驶领域中使用自然语言提示驱动场景中的挑战，即缺乏配对的提示-实例数据。
methods: 该论文提出了第一个用于驾驶场景的对象中心语言提示集，名为NuPrompt，它扩展了Nuscenes数据集，并构建了35,367个语言描述，每个描述都对应5.3个 объек跟踪。
results: 该论文提出了一种基于Transformer的简单朴素模型，名为PromptTrack，并在NuPrompt上进行了实验，实验结果表明，PromptTrack在NuPrompt上表现出色。

Abstract
A new trend in the computer vision community is to capture objects of interest following flexible human command represented by a natural language prompt. However, the progress of using language prompts in driving scenarios is stuck in a bottleneck due to the scarcity of paired prompt-instance data. To address this challenge, we propose the first object-centric language prompt set for driving scenes within 3D, multi-view, and multi-frame space, named NuPrompt. It expands Nuscenes dataset by constructing a total of 35,367 language descriptions, each referring to an average of 5.3 object tracks. Based on the object-text pairs from the new benchmark, we formulate a new prompt-based driving task, \ie, employing a language prompt to predict the described object trajectory across views and frames. Furthermore, we provide a simple end-to-end baseline model based on Transformer, named PromptTrack. Experiments show that our PromptTrack achieves impressive performance on NuPrompt. We hope this work can provide more new insights for the autonomous driving community. Dataset and Code will be made public at \href{https://github.com/wudongming97/Prompt4Driving}{https://github.com/wudongming97/Prompt4Driving}.

摘要
新趋势在计算机视觉社区是通过自然语言提示来捕捉对象 Interest的 flexible 人工命令。然而，使用语言提示在驾驶场景中的进展却被困在数据缺乏的瓶颈中。为解决这个挑战，我们提出了首个适用于驾驶场景的三维、多视图、多帧空间的对象-中心语言提示集，名为NuPrompt。它将Nuscenes数据集扩展到构建总共35,367个语言描述，每个描述都关联着5.3个对象跟踪。基于对象-文本对的新标准，我们提出了一个新的提示驱动任务，即使用语言提示来预测视图和帧中描述的对象轨迹。此外，我们还提供了一个简单的端到端基eline模型，基于Transformer，名为PromptTrack。实验表明，我们的PromptTrack在NuPrompt上表现出了很好的表现。我们希望这项工作能够为自动驾驶社区提供更多的新想法。数据集和代码将在\href{https://github.com/wudongming97/Prompt4Driving}{https://github.com/wudongming97/Prompt4Driving}上公开。

CNN Injected Transformer for Image Exposure Correction

paper_url: http://arxiv.org/abs/2309.04366
repo_url: https://github.com/rebeccaeexu/cit-ec
paper_authors: Shuning Xu, Xiangyu Chen, Binbin Song, Jiantao Zhou
for: corrected image exposure
methods: CNN Injected Transformer (CIT) and carefully formulated loss functions
results: outperforms state-of-the-art approaches in terms of both quantitative and qualitative metrics

Abstract
Capturing images with incorrect exposure settings fails to deliver a satisfactory visual experience. Only when the exposure is properly set, can the color and details of the images be appropriately preserved. Previous exposure correction methods based on convolutions often produce exposure deviation in images as a consequence of the restricted receptive field of convolutional kernels. This issue arises because convolutions are not capable of capturing long-range dependencies in images accurately. To overcome this challenge, we can apply the Transformer to address the exposure correction problem, leveraging its capability in modeling long-range dependencies to capture global representation. However, solely relying on the window-based Transformer leads to visually disturbing blocking artifacts due to the application of self-attention in small patches. In this paper, we propose a CNN Injected Transformer (CIT) to harness the individual strengths of CNN and Transformer simultaneously. Specifically, we construct the CIT by utilizing a window-based Transformer to exploit the long-range interactions among different regions in the entire image. Within each CIT block, we incorporate a channel attention block (CAB) and a half-instance normalization block (HINB) to assist the window-based self-attention to acquire the global statistics and refine local features. In addition to the hybrid architecture design for exposure correction, we apply a set of carefully formulated loss functions to improve the spatial coherence and rectify potential color deviations. Extensive experiments demonstrate that our image exposure correction method outperforms state-of-the-art approaches in terms of both quantitative and qualitative metrics.

摘要
捕捉图像with incorrect exposure settings会导致视觉经验不满意。只有当曝光正确设置时，图像的颜色和细节才能正确保存。过去的曝光修正方法基于 convolution often produce exposure deviation in images as a consequence of the restricted receptive field of convolutional kernels. This issue arises because convolutions are not capable of capturing long-range dependencies in images accurately. To overcome this challenge, we can apply the Transformer to address the exposure correction problem, leveraging its capability in modeling long-range dependencies to capture global representation. However, solely relying on the window-based Transformer leads to visually disturbing blocking artifacts due to the application of self-attention in small patches. In this paper, we propose a CNN Injected Transformer (CIT) to harness the individual strengths of CNN and Transformer simultaneously. Specifically, we construct the CIT by utilizing a window-based Transformer to exploit the long-range interactions among different regions in the entire image. Within each CIT block, we incorporate a channel attention block (CAB) and a half-instance normalization block (HINB) to assist the window-based self-attention to acquire the global statistics and refine local features. In addition to the hybrid architecture design for exposure correction, we apply a set of carefully formulated loss functions to improve the spatial coherence and rectify potential color deviations. Extensive experiments demonstrate that our image exposure correction method outperforms state-of-the-art approaches in terms of both quantitative and qualitative metrics.

SSIG: A Visually-Guided Graph Edit Distance for Floor Plan Similarity

paper_url: http://arxiv.org/abs/2309.04357
repo_url: None
paper_authors: Casper van Engelenburg, Seyran Khademi, Jan van Gemert
for: 这 paper 是为了提出一种简单 yet effective 的 metric，用于衡量建筑底层平面图像之间的结构相似性，而不需要学习。methods: 这 paper 使用了 image 和 graph 距离来计算 structural similarity，并提出了一种基于 IoU 和 GED 的评价指标，称为 SSIG。results: 实验结果表明，使用 SSIG 可以获得类似于深度学习方法的结构相似性 Retrieval 结果，而且更加有效地比较建筑底层平面图像的结构相似性。

Abstract
We propose a simple yet effective metric that measures structural similarity between visual instances of architectural floor plans, without the need for learning. Qualitatively, our experiments show that the retrieval results are similar to deeply learned methods. Effectively comparing instances of floor plan data is paramount to the success of machine understanding of floor plan data, including the assessment of floor plan generative models and floor plan recommendation systems. Comparing visual floor plan images goes beyond a sole pixel-wise visual examination and is crucially about similarities and differences in the shapes and relations between subdivisions that compose the layout. Currently, deep metric learning approaches are used to learn a pair-wise vector representation space that closely mimics the structural similarity, in which the models are trained on similarity labels that are obtained by Intersection-over-Union (IoU). To compensate for the lack of structural awareness in IoU, graph-based approaches such as Graph Matching Networks (GMNs) are used, which require pairwise inference for comparing data instances, making GMNs less practical for retrieval applications. In this paper, an effective evaluation metric for judging the structural similarity of floor plans, coined SSIG (Structural Similarity by IoU and GED), is proposed based on both image and graph distances. In addition, an efficient algorithm is developed that uses SSIG to rank a large-scale floor plan database. Code will be openly available.

摘要
我们提出一种简单 yet有效的度量，用于衡量建筑floor plan的结构相似性，无需学习。我们的实验表明，检索结果与深度学习方法相似。对于机器理解floor plan数据的成功，包括floor plan生成模型和floor plan推荐系统，都是重要的。 Comparing visual floor plan图像不仅是solely based on pixel-wise visual examination，更是关注 shapes和relations between subdivisions that compose the layout的相似性和差异。目前，深度度量学习方法是用于学习一个pair-wise vector representation space，以便closely mimic structural similarity，其中模型是通过Intersection-over-Union（IoU）获得对应的similarity labels。为了补做IoU中的结构不足，Graph-based approaches such as Graph Matching Networks (GMNs) 是使用的，但这些方法需要对数据实例进行对比，使得GMNs 在检索应用中不实用。在这篇论文中，一种有效的floor plan结构相似度度量，称为SSIG（Structural Similarity by IoU and GED），是基于图像和图distance的。此外，一种高效的算法是开发出来，用于排序大规模的floor plan数据库。代码将公开。

Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts

paper_url: http://arxiv.org/abs/2309.04354
repo_url: None
paper_authors: Erik Daxberger, Floris Weers, Bowen Zhang, Tom Gunter, Ruoming Pang, Marcin Eichner, Michael Emmersberger, Yinfei Yang, Alexander Toshev, Xianzhi Du
for: 这个研究旨在使用罕发 Mixture-of-Experts 模型（MoE）来缩小 Computer Vision Transformers（ViT），以提高资源受限的视觉应用程序中的表现。
methods: 提议了一个简化的 Mobile Vision MoE 设计，将整个图像Routing 到专家中，以及一个稳定的 MoE 训练方法，使用超级类信息来导引路由器。
results: 经验表明，我们的罕发 Mobile Vision MoE 可以在 ImageNet-1k 上比 dense ViT 表现更好，例如 ViT-Tiny 模型的 Mobile V-MoE 比它的 dense 对应者高出3.39%。另外，对于仅有54M FLOPs 的视觉运算成本的 ViT Variant，我们的 MoE 可以提高4.66%。

Abstract
Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due to their ability to decouple model size from inference efficiency by only activating a small subset of the model parameters for any given input token. As such, sparse MoEs have enabled unprecedented scalability, resulting in tremendous successes across domains such as natural language processing and computer vision. In this work, we instead explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications. To this end, we propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts. We also propose a stable MoE training procedure that uses super-class information to guide the router. We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs. For example, for the ViT-Tiny model, our Mobile V-MoE outperforms its dense counterpart by 3.39% on ImageNet-1k. For an even smaller ViT variant with only 54M FLOPs inference cost, our MoE achieves an improvement of 4.66%.

摘要
低粒度混合专家模型（MoE）在最近几年内得到了广泛的关注，因为它可以将模型大小与输入Token的执行效率解耦开来，只有一小部分模型参数对于任何输入Token进行激活。这使得低粒度MoE在不同领域，如自然语言处理和计算机视觉等领域取得了无 precedent的缩放。在这种工作中，我们则是使用低粒度MoE来缩小视Transformers（ViTs），以使其更适合具有限制的视觉应用。为此，我们提出了简单化了的手持版MoE设计，其中整个图像而不是具体的补丁被 routed 到专家。我们还提出了稳定的MoE训练过程，该过程使用超类信息来引导路由。我们实验表明，我们的粒度 мобиLE Vision MoEs（V-MoEs）可以在性能和效率之间取得更好的平衡，比如对于 ViT-Tiny 模型，我们的手持 V-MoE 在 ImageNet-1k 上比其拥有相同执行成本的 dense ViT 提高3.39%。而对于具有仅 54M FLOPs 执行成本的 ViT 变体，我们的 MoE 提高4.66%。

Revealing the preference for correcting separated aberrations in joint optic-image design

paper_url: http://arxiv.org/abs/2309.04342
repo_url: None
paper_authors: Jingwen Zhou, Shiqi Chen, Zheng Ren, Wenguan Zhang, Jiapu Yan, Huajun Feng, Qi Li, Yueting Chen
for: 本文旨在jointly设计光学系统和下游算法，以实现高效的复杂系统设计 such as smartphones和 дроны。
methods: 本文首先从光学设计的角度，描述了光学系统中的各种荷 aberrations。然后，提出了一种图像模拟系统，用于重现真实的拍摄过程。最后，提出了一种基于神经网络的 aberration correction 方法，并证明其超过了现有方法。
results: 实验表明，在jointly设计光学系统和下游算法时，应该优先 corrected longitudinal chromatic aberration、lateral chromatic aberration、spherical aberration、field curvature 和 coma，而 astigmatism 则应该排在最后。基于这些 preference，可以实现10%的总轨道减少，并且具有更高的计算摄影质量。本文的优化思路为jointly设计复杂光学系统和下游算法提供了新的思路。

Abstract
The joint design of the optical system and the downstream algorithm is a challenging and promising task. Due to the demand for balancing the global optimal of imaging systems and the computational cost of physical simulation, existing methods cannot achieve efficient joint design of complex systems such as smartphones and drones. In this work, starting from the perspective of the optical design, we characterize the optics with separated aberrations. Additionally, to bridge the hardware and software without gradients, an image simulation system is presented to reproduce the genuine imaging procedure of lenses with large field-of-views. As for aberration correction, we propose a network to perceive and correct the spatially varying aberrations and validate its superiority over state-of-the-art methods. Comprehensive experiments reveal that the preference for correcting separated aberrations in joint design is as follows: longitudinal chromatic aberration, lateral chromatic aberration, spherical aberration, field curvature, and coma, with astigmatism coming last. Drawing from the preference, a 10% reduction in the total track length of the consumer-level mobile phone lens module is accomplished. Moreover, this procedure spares more space for manufacturing deviations, realizing extreme-quality enhancement of computational photography. The optimization paradigm provides innovative insight into the practical joint design of sophisticated optical systems and post-processing algorithms.

摘要
合作设计光学系统和下游算法是一项挑战性较高且投资极大的任务。由于需要平衡全球优化图像系统和物理模拟计算成本，现有方法无法实现复杂系统 such as 智能手机和无人机的有效集成设计。在这种工作中，从光学设计的视角出发，我们 caracterize 光学器件为分离的荷量。此外，为了bridging 硬件和软件而无需梯度，我们提出了一种图像仿真系统，可以复制实际摄影过程中的镜头大 FOV 的真实摄影。在荷量修正方面，我们提议一种神经网络，可以感知并修正场景中的空间变化荷量，并证明其超过了当前方法的优势。经过广泛的实验，我们发现在修正分离荷量时的偏好顺序如下：Longitudinal Chromatic Aberration、Lateral Chromatic Aberration、Spherical Aberration、Field Curvature、Coma、Astigmatism，其中 Astigmatism 为最后一个。基于这种偏好，我们实现了Consumer-level 移动 phone 镜头模块的10% 总轨道减少。此外，这种过程还剩余了更多的生产偏移，实现了极高质量的计算摄影增强。我们的优化思路为实际复杂光学系统和后处理算法的集成设计带来了创新的视角。

Leveraging Model Fusion for Improved License Plate Recognition

paper_url: http://arxiv.org/abs/2309.04331
repo_url: None
paper_authors: Rayson Laroca, Luiz A. Zanlorensi, Valter Estevam, Rodrigo Minetto, David Menotti
for: 本研究旨在填补多模型识别结果的缺失，探讨多个识别模型的结果结合可以提高识别精度。
methods: 本研究使用多种直观的方法进行结合，包括选择最有信心的预测和多数投票策略。
results: 实验结果表明，结合多个模型可以减少对特定数据集/场景的表现下降的可能性。此外，结合基于速度的模型也是一个有效的策略，能够在满足一定的时间延迟的情况下提高识别精度。

Abstract
License Plate Recognition (LPR) plays a critical role in various applications, such as toll collection, parking management, and traffic law enforcement. Although LPR has witnessed significant advancements through the development of deep learning, there has been a noticeable lack of studies exploring the potential improvements in results by fusing the outputs from multiple recognition models. This research aims to fill this gap by investigating the combination of up to 12 different models using straightforward approaches, such as selecting the most confident prediction or employing majority vote-based strategies. Our experiments encompass a wide range of datasets, revealing substantial benefits of fusion approaches in both intra- and cross-dataset setups. Essentially, fusing multiple models reduces considerably the likelihood of obtaining subpar performance on a particular dataset/scenario. We also found that combining models based on their speed is an appealing approach. Specifically, for applications where the recognition task can tolerate some additional time, though not excessively, an effective strategy is to combine 4-6 models. These models may not be the most accurate individually, but their fusion strikes an optimal balance between accuracy and speed.

摘要

AMLP:Adaptive Masking Lesion Patches for Self-supervised Medical Image Segmentation

paper_url: http://arxiv.org/abs/2309.04312
repo_url: None
paper_authors: Xiangtao Wang, Ruizhi Wang, Jie Zhou, Thomas Lukasiewicz, Zhenghua Xu
for: 这个论文是为了解决自主指定的医学图像分割问题，即使用自主掩码模型在医学图像上进行学习。
methods: 该论文提出了一种新的自主掩码医学图像分割框架，称为自适应掩码病变块（AMLP）。该框架包括一种掩码选择策略（MPS），用于确定和学习含病变块的块。此外，该论文还引入了一种注意力重构损失（ARL）和一种类别一致损失（CCL），以提高病变块的准确性和分类精度。
results: 根据两个医学图像分割数据集的实验结果，AMLP在自主掩码模型中的性能明显高于现有的自主方法。这些策略有效地解决了在医学图像上应用自主掩码模型的限制，并且能够捕捉病变块的细节，这些细节是分割任务中非常重要的。

Abstract
Self-supervised masked image modeling has shown promising results on natural images. However, directly applying such methods to medical images remains challenging. This difficulty stems from the complexity and distinct characteristics of lesions compared to natural images, which impedes effective representation learning. Additionally, conventional high fixed masking ratios restrict reconstructing fine lesion details, limiting the scope of learnable information. To tackle these limitations, we propose a novel self-supervised medical image segmentation framework, Adaptive Masking Lesion Patches (AMLP). Specifically, we design a Masked Patch Selection (MPS) strategy to identify and focus learning on patches containing lesions. Lesion regions are scarce yet critical, making their precise reconstruction vital. To reduce misclassification of lesion and background patches caused by unsupervised clustering in MPS, we introduce an Attention Reconstruction Loss (ARL) to focus on hard-to-reconstruct patches likely depicting lesions. We further propose a Category Consistency Loss (CCL) to refine patch categorization based on reconstruction difficulty, strengthening distinction between lesions and background. Moreover, we develop an Adaptive Masking Ratio (AMR) strategy that gradually increases the masking ratio to expand reconstructible information and improve learning. Extensive experiments on two medical segmentation datasets demonstrate AMLP's superior performance compared to existing self-supervised approaches. The proposed strategies effectively address limitations in applying masked modeling to medical images, tailored to capturing fine lesion details vital for segmentation tasks.

摘要
自我监督遮盲图像模型在自然图像上显示了扎实的成果。然而，直接将这些方法应用到医学图像仍然是一项挑战。这种挑战的原因在于医学图像中的病变特征更加复杂和特殊，使得学习有效的表征变得困难。另外，传统的高固定遮盲率限制了修剪细小病变细节，导致学习的范围受限。为解决这些限制，我们提出了一种新的自我监督医学图像分割框架，即适应遮盲病变裂片（AMLP）。特别是，我们设计了一种遮盲裂片选择策略（MPS），以确定和专注于包含病变的裂片进行学习。病变区域scarce yet critical，需要精准重建。为了避免由自动归类所引起的病变和背景裂片的混淆，我们引入了一种注意力重建损失（ARL），以注意精准重建病变裂片。此外，我们还提出了一种类别一致损失（CCL），以根据重建难度进一步划分病变和背景裂片，强化病变和背景之间的分别。此外，我们还开发了一种适应遮盲率策略（AMR），以逐渐增加遮盲率，扩大可重建信息，提高学习。我们对医学图像分割任务中的两个数据集进行了广泛的实验，并证明AMLP在自我监督方法中表现出色，与现有的自我监督方法相比。我们的提案有效地解决了应用遮盲模型到医学图像的限制，适应捕捉病变细节，这些细节对分割任务至关重要。

Have We Ever Encountered This Before? Retrieving Out-of-Distribution Road Obstacles from Driving Scenes

paper_url: http://arxiv.org/abs/2309.04302
repo_url: None
paper_authors: Youssef Shoeb, Robin Chan, Gesina Schwalbe, Azarm Nowzard, Fatma Güney, Hanno Gottschalk
for: 本研究旨在提供一种基于文本查询的外部数据采集方法，以满足自动驾驶系统中的协同Debugging需求。
methods: 该方法基于最新的OoD分割和多Modal基础模型，可以快速从无标注视频中提取安全关键场景，并通过文本查询来检索相似的场景。
results: 该方法可以快速和高效地提取与OoD道路障碍相关的场景，并提供一种基于文本查询的novel Approach来检索这些场景。

Abstract
In the life cycle of highly automated systems operating in an open and dynamic environment, the ability to adjust to emerging challenges is crucial. For systems integrating data-driven AI-based components, rapid responses to deployment issues require fast access to related data for testing and reconfiguration. In the context of automated driving, this especially applies to road obstacles that were not included in the training data, commonly referred to as out-of-distribution (OoD) road obstacles. Given the availability of large uncurated recordings of driving scenes, a pragmatic approach is to query a database to retrieve similar scenarios featuring the same safety concerns due to OoD road obstacles. In this work, we extend beyond identifying OoD road obstacles in video streams and offer a comprehensive approach to extract sequences of OoD road obstacles using text queries, thereby proposing a way of curating a collection of OoD data for subsequent analysis. Our proposed method leverages the recent advances in OoD segmentation and multi-modal foundation models to identify and efficiently extract safety-relevant scenes from unlabeled videos. We present a first approach for the novel task of text-based OoD object retrieval, which addresses the question ''Have we ever encountered this before?''.

摘要
生命周期中高度自动化系统在开放动态环境中的适应能力是关键。具有数据驱动AI组件的系统在部署问题上需要快速访问相关数据进行测试和重新配置。在自动驾驶上特别是，对于没有包含在训练数据中的外部道路障碍（OoD），快速响应是非常重要。由于有大量未经整理的驾驶场景录像，我们可以通过查询数据库来检索类似的场景，并且可以使用文本查询来提取OoD道路障碍序列。在这种情况下，我们不仅可以识别OoD道路障碍在视频流中，还可以提供一种抽象CURATE OoD数据集，以便进行后续分析。我们的提议方法基于最近的OoD分割和多Modal基础模型，可以快速和有效地从未标注的视频中提取安全相关的场景。我们还提出了一种新的任务：文本基本对象重 Retrieval，可以回答问题“我们之前有否遇到过这个?”。

How Can We Tame the Long-Tail of Chest X-ray Datasets?

paper_url: http://arxiv.org/abs/2309.04293
repo_url: None
paper_authors: Arsh Verma
for: 用于自动推断胸部X射线图像中的各种畸形。
methods: 使用深度学习模型来学习独立的特征，解决多标签和少数畸形问题。
results: 提出一种使用初始化更加近似于目标数据集的方法，可以帮助提高模型性能，并且可以轻松扩展到新的标签。

Abstract
Chest X-rays (CXRs) are a medical imaging modality that is used to infer a large number of abnormalities. While it is hard to define an exhaustive list of these abnormalities, which may co-occur on a chest X-ray, few of them are quite commonly observed and are abundantly represented in CXR datasets used to train deep learning models for automated inference. However, it is challenging for current models to learn independent discriminatory features for labels that are rare but may be of high significance. Prior works focus on the combination of multi-label and long tail problems by introducing novel loss functions or some mechanism of re-sampling or re-weighting the data. Instead, we propose that it is possible to achieve significant performance gains merely by choosing an initialization for a model that is closer to the domain of the target dataset. This method can complement the techniques proposed in existing literature, and can easily be scaled to new labels. Finally, we also examine the veracity of synthetically generated data to augment the tail labels and analyse its contribution to improving model performance.

摘要
胸部X光图（CXR）是医学影像模式，用于推断许多不正常情况。尽管难以列举完整的不正常情况列表，这些情况可能在胸部X光图上同时出现，但一些非常常见，并且在使用深度学习模型自动推断时广泛存在于CXR数据集中。然而，当前的模型很难学习独立的特征来标识罕见的标签，它们可能具有高度的重要性。先前的工作将焦点放在多标签和长尾问题的组合上，通过引入新的损失函数或数据重新排序机制来解决。而我们则提议，可以通过选择更加适应目标数据集的初始化方法来实现显著的性能提升。这种方法可以补充现有文献中的技术，并可以轻松扩展到新的标签。此外，我们还研究了增强尾标签的合成数据的真实性，并分析其对模型性能的贡献。

The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion

paper_url: http://arxiv.org/abs/2309.04509
repo_url: None
paper_authors: Yujin Jeong, Wonjeong Ryoo, Seunghyun Lee, Dabin Seo, Wonmin Byeon, Sangpil Kim, Jinkyu Kim
for: 这篇论文主要针对的是音频到视频生成技术，具体来说是使用音频输入将 temporal semantics 和 magnitude 纳入视频生成中，以生成响应音频的视频内容。
methods: 该模型使用了稳定扩散模型，将文本语义信息与音频编码器的顺序编码器结合，以生成视频帧。
results: 该方法在多个任务上表现出色，与当前领域的状态Of-the-art技术进行比较，并提供了更多的示例，可以在 https://ku-vai.github.io/TPoS/ 中找到。

Abstract
In recent years, video generation has become a prominent generative tool and has drawn significant attention. However, there is little consideration in audio-to-video generation, though audio contains unique qualities like temporal semantics and magnitude. Hence, we propose The Power of Sound (TPoS) model to incorporate audio input that includes both changeable temporal semantics and magnitude. To generate video frames, TPoS utilizes a latent stable diffusion model with textual semantic information, which is then guided by the sequential audio embedding from our pretrained Audio Encoder. As a result, this method produces audio reactive video contents. We demonstrate the effectiveness of TPoS across various tasks and compare its results with current state-of-the-art techniques in the field of audio-to-video generation. More examples are available at https://ku-vai.github.io/TPoS/

摘要
Recently, video generation has become a prominent generative tool and has attracted significant attention. However, there is little consideration in audio-to-video generation, although audio contains unique qualities such as temporal semantics and magnitude. Therefore, we propose The Power of Sound (TPoS) model to incorporate audio input that includes both changeable temporal semantics and magnitude. To generate video frames, TPoS utilizes a latent stable diffusion model with textual semantic information, which is then guided by the sequential audio embedding from our pretrained Audio Encoder. As a result, this method produces audio reactive video contents. We demonstrate the effectiveness of TPoS across various tasks and compare its results with current state-of-the-art techniques in the field of audio-to-video generation. More examples are available at https://ku-vai.github.io/TPoS/.Here's the translation in Traditional Chinese:近年来，影片生成技术成为了主要的生成工具，吸引了广泛的关注。然而，对于音频至影片生成的考虑，几乎没有，尽管音频具有时间 semantics 和强度等独特特性。因此，我们提出了 The Power of Sound (TPoS) 模型，将音频输入包括了可变的时间 semantics 和强度。将 TPoS 模型应用于生成影片帧，使用了稳定的扩散模型，并将文本内容与影片帧的预先训练 Audio Encoder 进行组合。因此，这种方法可以生成对音频有应答的影片内容。我们在不同的任务中认为 TPoS 的效果，并与现有的音频至影片生成技术进行比较。更多的例子可以在网站上找到。

Towards Practical Capture of High-Fidelity Relightable Avatars

paper_url: http://arxiv.org/abs/2309.04247
repo_url: None
paper_authors: Haotian Yang, Mingwu Zheng, Wanquan Feng, Haibin Huang, Yu-Kun Lai, Pengfei Wan, Zhongyuan Wang, Chongyang Ma
for: 高精度3D人物捕捉和重建
methods: 使用动态图像序列和变化灯光条件进行训练，实现真实的照明和实时动画
results: 提供了一种高质量的捕捉和重建方法，可以在多种场景中实现真实的照明和动画效果

Abstract
In this paper, we propose a novel framework, Tracking-free Relightable Avatar (TRAvatar), for capturing and reconstructing high-fidelity 3D avatars. Compared to previous methods, TRAvatar works in a more practical and efficient setting. Specifically, TRAvatar is trained with dynamic image sequences captured in a Light Stage under varying lighting conditions, enabling realistic relighting and real-time animation for avatars in diverse scenes. Additionally, TRAvatar allows for tracking-free avatar capture and obviates the need for accurate surface tracking under varying illumination conditions. Our contributions are two-fold: First, we propose a novel network architecture that explicitly builds on and ensures the satisfaction of the linear nature of lighting. Trained on simple group light captures, TRAvatar can predict the appearance in real-time with a single forward pass, achieving high-quality relighting effects under illuminations of arbitrary environment maps. Second, we jointly optimize the facial geometry and relightable appearance from scratch based on image sequences, where the tracking is implicitly learned. This tracking-free approach brings robustness for establishing temporal correspondences between frames under different lighting conditions. Extensive qualitative and quantitative experiments demonstrate that our framework achieves superior performance for photorealistic avatar animation and relighting.

摘要
在这篇论文中，我们提出了一种新的框架，即 Tracking-free Relightable Avatar（TRAvatar），用于捕捉和重建高质量的3D人物。相比前方法，TRAvatar在更实用和高效的设置下工作。具体来说，TRAvatar通过在不同照明条件下捕捉的动态图像序列进行训练，使得人物在多样化场景中的动画和重新照明得到了真实的渲染。此外，TRAvatar允许无需准确的表面跟踪，从而消除了对不同照明条件下的表面跟踪的需求。我们的贡献有两个方面：首先，我们提出了一种新的网络架构，该架构直接基于和确保光线的线性性。通过训练简单的群组照明 Captures，TRAvatar可以在实时下一步逻辑执行，实现高质量的重新照明效果下环境图像中的不同照明条件下。其次，我们将人物的面部几何和可重新照明的外观从头开始，基于图像序列进行 JOINT 优化。这种无需跟踪的方法带来了在不同照明条件下建立 temporales 匹配的稳定性。我们的框架在实际和量化的实验中都达到了高质量的人物动画和重新照明的表现。

Unsupervised Gaze-aware Contrastive Learning with Subject-specific Condition

paper_url: http://arxiv.org/abs/2309.04506
repo_url: None
paper_authors: Lingyu Du, Xucong Zhang, Guohao Lan
for: 提高出现在多个 gaze 数据集上的 gaze 估计性能，使用一个通用的摄像头作为输入设备。
methods: 提出 ConGaze 框架，利用无标注的脸部图像学习无关Subject的 gaze-aware 表示，通过对 gaze-specific 数据增强和subject-conditional projection module来保持 gaze-semantic 特征和眼神一致性。
results: ConGaze 在三个公共 gaze 估计数据集上比现有的无监督学习解决方案提高了6.7%到22.5%，并在跨数据集评估中提高了15.1%到24.6%。

Abstract
Appearance-based gaze estimation has shown great promise in many applications by using a single general-purpose camera as the input device. However, its success is highly depending on the availability of large-scale well-annotated gaze datasets, which are sparse and expensive to collect. To alleviate this challenge we propose ConGaze, a contrastive learning-based framework that leverages unlabeled facial images to learn generic gaze-aware representations across subjects in an unsupervised way. Specifically, we introduce the gaze-specific data augmentation to preserve the gaze-semantic features and maintain the gaze consistency, which are proven to be crucial for effective contrastive gaze representation learning. Moreover, we devise a novel subject-conditional projection module that encourages a share feature extractor to learn gaze-aware and generic representations. Our experiments on three public gaze estimation datasets show that ConGaze outperforms existing unsupervised learning solutions by 6.7% to 22.5%; and achieves 15.1% to 24.6% improvement over its supervised learning-based counterpart in cross-dataset evaluations.

摘要
<>转换文本到简化中文。<>应用基于的 gaze 估计已经在许多应用程序中显示出了很大的搭配性，只使用一个通用的摄像头作为输入设备。然而，其成功受到大规模、有良好标注的 gaze 数据集的可用性的限制。为了解决这个挑战，我们提出了 ConGaze，一个基于对比学习的框架，利用无标注的脸部图像来学习不同人Subject中的通用 gaze-aware 表示。specifically，我们引入了 gaze-specific 数据增强技术来保持 gaze-semantic 特征和维护 gaze 一致性，这些特征被证明是对有效对比 gaze 表示学习的关键。此外，我们设计了一个新的 subject-conditional projection module，以便一个共享特征提取器来学习 gaze-aware 和通用表示。我们在三个公共 gaze 估计数据集上进行了实验，结果显示，ConGaze 在对比学习解决方案上出现了6.7%到22.5%的提升，并在跨数据集评估中达到15.1%到24.6%的提升。

FIVA: Facial Image and Video Anonymization and Anonymization Defense

paper_url: http://arxiv.org/abs/2309.04228
repo_url: None
paper_authors: Felix Rosberg, Eren Erdal Aksoy, Cristofer Englund, Fernando Alonso-Fernandez
for: 这个论文旨在提出一种新的面部匿名化方法，以保护个人隐私。
methods: 这个方法使用了建议的身份追踪和强大的匿名化技术，以确保面部匿名化能够一致性地运行在帧中，并且可以抵挡重建攻击。
results: 这个方法可以确保0个真阳性，false acceptance rate为0.001，并且可以实现面部匿名化和脸部替换。

Abstract
In this paper, we present a new approach for facial anonymization in images and videos, abbreviated as FIVA. Our proposed method is able to maintain the same face anonymization consistently over frames with our suggested identity-tracking and guarantees a strong difference from the original face. FIVA allows for 0 true positives for a false acceptance rate of 0.001. Our work considers the important security issue of reconstruction attacks and investigates adversarial noise, uniform noise, and parameter noise to disrupt reconstruction attacks. In this regard, we apply different defense and protection methods against these privacy threats to demonstrate the scalability of FIVA. On top of this, we also show that reconstruction attack models can be used for detection of deep fakes. Last but not least, we provide experimental results showing how FIVA can even enable face swapping, which is purely trained on a single target image.

摘要
在这篇论文中，我们提出了一种新的面部匿名技术，简称为FIVA。我们的提议方法可以保持面部匿名的一致性在帧内，并且可以 garantuee a strong difference from the original face。FIVA 可以保证0个真正的正确率，false acceptance rate 为0.001。我们的工作考虑了重要的安全问题，包括重建攻击，并对不同类型的随机噪声进行了 investigate。为了恢复随机噪声的攻击，我们应用了不同的防御和保护方法。此外，我们还证明了可以使用重建攻击模型来检测深伪。最后，我们提供了实验结果，证明FIVA 可以实现面部交换，只需要单个目标图像进行培训。

Long-Range Correlation Supervision for Land-Cover Classification from Remote Sensing Images

paper_url: http://arxiv.org/abs/2309.04225
repo_url: None
paper_authors: Dawen Yu, Shunping Ji
for:这篇论文的目的是提出一种基于深度学习的陆地覆盖分类方法，以优化大型遥感图像中的远距离相关性模型。methods:该方法使用了一种名为超级vised长距离相关网络（SLCNet），它通过在批处理中使用类别一致性信息来directly supervise the long-range dependency modeling。此外，该方法还引入了一个辅助的自适应感知场特征提取模块，以Capture finely detailed feature representations for multi-size objects in multi-scale remote sensing images。results:对于三个遥感数据集，SLCNet achieved state-of-the-art performance compared with advanced segmentation methods from computer vision, medicine, and remote sensing communities。

Abstract
Long-range dependency modeling has been widely considered in modern deep learning based semantic segmentation methods, especially those designed for large-size remote sensing images, to compensate the intrinsic locality of standard convolutions. However, in previous studies, the long-range dependency, modeled with an attention mechanism or transformer model, has been based on unsupervised learning, instead of explicit supervision from the objective ground truth. In this paper, we propose a novel supervised long-range correlation method for land-cover classification, called the supervised long-range correlation network (SLCNet), which is shown to be superior to the currently used unsupervised strategies. In SLCNet, pixels sharing the same category are considered highly correlated and those having different categories are less relevant, which can be easily supervised by the category consistency information available in the ground truth semantic segmentation map. Under such supervision, the recalibrated features are more consistent for pixels of the same category and more discriminative for pixels of other categories, regardless of their proximity. To complement the detailed information lacking in the global long-range correlation, we introduce an auxiliary adaptive receptive field feature extraction module, parallel to the long-range correlation module in the encoder, to capture finely detailed feature representations for multi-size objects in multi-scale remote sensing images. In addition, we apply multi-scale side-output supervision and a hybrid loss function as local and global constraints to further boost the segmentation accuracy. Experiments were conducted on three remote sensing datasets. Compared with the advanced segmentation methods from the computer vision, medicine, and remote sensing communities, the SLCNet achieved a state-of-the-art performance on all the datasets.

摘要
现代深度学习基于语义 segmentation 方法中，远程依赖关系模型已经广泛应用，特别是针对大型远程感知图像。然而，在前一些研究中，远程依赖关系，通过注意力机制或 transformer 模型进行模型，都是基于无监督学习。在这篇论文中，我们提出了一种新的监督性远程相关方法，called SLCNet，可以在土地覆盖分类中提高准确率。在 SLCNet 中，与同一类别的像素视为高度相关，与不同类别的像素视为不相关，这可以通过地图中的类别一致信息进行监督。由于这种监督，重调的特征更加一致于同类别的像素，更加突出不同类别的像素，不管它们的距离。为了补充全局远程相关缺失的细节信息，我们引入了一个辅助适应性识别场FeatureEXTRACT模块，并行于远程相关模块在编码器中。此外，我们还应用多尺度侧输出监督和混合损失函数作为本地和全局约束，以进一步提高分类精度。在三个远程感知数据集上进行了实验。与现代分类方法（计算机视觉、医学和远程感知社区）相比，SLCNet 在所有数据集上达到了状态的表现。

Score-PA: Score-based 3D Part Assembly

paper_url: http://arxiv.org/abs/2309.04220
repo_url: https://github.com/j-f-cheng/score-pa_score-based-3d-part-assembly
paper_authors: Junfeng Cheng, Mingdong Wu, Ruiyuan Zhang, Guanqi Zhan, Chao Wu, Hao Dong
for: 本研究旨在提出一种基于生成模型的3D部件组装方法，以解决自主3D部件组装问题在机器人和3D计算机视觉领域中的挑战。
methods: 本文提出了一种名为Score-based 3D Part Assembly（Score-PA）框架，用于3D部件组装。此外，我们还提出了一种叫做快速预测器-修正器抽象器（FPC）算法，用于加速框架中的采样过程。
results: 我们通过了多种评价指标来评估组装质量和多样性，并发现我们的算法在比较现有状态艺术方法时表现出色，得到了更好的结果。

Abstract
Autonomous 3D part assembly is a challenging task in the areas of robotics and 3D computer vision. This task aims to assemble individual components into a complete shape without relying on predefined instructions. In this paper, we formulate this task from a novel generative perspective, introducing the Score-based 3D Part Assembly framework (Score-PA) for 3D part assembly. Knowing that score-based methods are typically time-consuming during the inference stage. To address this issue, we introduce a novel algorithm called the Fast Predictor-Corrector Sampler (FPC) that accelerates the sampling process within the framework. We employ various metrics to assess assembly quality and diversity, and our evaluation results demonstrate that our algorithm outperforms existing state-of-the-art approaches. We release our code at https://github.com/J-F-Cheng/Score-PA_Score-based-3D-Part-Assembly.

摘要
自主三维部件组装是机器人和三维计算机视觉领域中的一项挑战性任务。这个任务的目标是将个体部件组装成完整的形状，不依赖于预定的指令。在这篇论文中，我们从一种新的生成方式出发，提出了Score-based 3D Part Assembly框架（Score-PA），用于三维部件组装。因为分数基本方法通常在推理阶段相对耗时，为了解决这个问题，我们提出了一种新的算法叫做快速预测器-修正器抽象器（FPC），它加速了Score-PA框架中的采样过程。我们使用了多种指标来评估组装质量和多样性，我们的评估结果表明，我们的算法在现有状态的方法上表现出色。我们在https://github.com/J-F-Cheng/Score-PA_Score-based-3D-Part-Assembly上分享了我们的代码。

SegmentAnything helps microscopy images based automatic and quantitative organoid detection and analysis

paper_url: http://arxiv.org/abs/2309.04190
repo_url: https://github.com/xiaodanxing/sam4organoid
paper_authors: Xiaodan Xing, Chunling Tang, Yunzhe Guo, Nicholas Kurniawan, Guang Yang
for: studying organ development, drug discovery, and toxicity assessment
methods: leveraging SegmentAnything for precise demarcation of individual organoids, and introducing a set of morphological properties for quantitative analysis
results: close alignment with manual organoid detection and measurement, demonstrating the effectiveness of the proposed method in accelerating organoid morphology analysis

Abstract
Organoids are self-organized 3D cell clusters that closely mimic the architecture and function of in vivo tissues and organs. Quantification of organoid morphology helps in studying organ development, drug discovery, and toxicity assessment. Recent microscopy techniques provide a potent tool to acquire organoid morphology features, but manual image analysis remains a labor and time-intensive process. Thus, this paper proposes a comprehensive pipeline for microscopy analysis that leverages the SegmentAnything to precisely demarcate individual organoids. Additionally, we introduce a set of morphological properties, including perimeter, area, radius, non-smoothness, and non-circularity, allowing researchers to analyze the organoid structures quantitatively and automatically. To validate the effectiveness of our approach, we conducted tests on bright-field images of human induced pluripotent stem cells (iPSCs) derived neural-epithelial (NE) organoids. The results obtained from our automatic pipeline closely align with manual organoid detection and measurement, showcasing the capability of our proposed method in accelerating organoids morphology analysis.

摘要
organoids 是自组织的3D细胞群，具有在 vivo 组织中的结构和功能的高度相似性。量化 organoid 形态可以帮助研究器官发展、药物探索和毒性评估。现有的微镜技术为 organoid 形态特征的获取提供了强大的工具，但是手动图像分析仍然是一项劳动和时间耗费的过程。因此，这篇论文提出了一个完整的微镜分析管线，利用 SegmentAnything 精准地界定个体 organoid。此外，我们还引入了一组形态特征，包括周长、面积、半径、不整形和不圆形，使研究人员可以对 organoid 结构进行量化和自动化的分析。为验证我们的方法的有效性，我们对人类干细胞 derived neural-epithelial（NE） organoids 的明亮场图进行了测试。结果表明，我们的自动化管线与手动图像分析结果高度相似，这表明了我们提出的方法在加速 organoid 形态分析方面的能力。

Stereo Matching in Time: 100+ FPS Video Stereo Matching for Extended Reality

paper_url: http://arxiv.org/abs/2309.04183
repo_url: None
paper_authors: Ziang Cheng, Jiayu Yang, Hongdong Li
for: 这篇论文主要是为了解决现场掌上设备上的实时深度推断问题，以提高现场扩展实际（XR）应用的性能。
methods: 这篇论文使用了一种新的视频斯特瑞数据集，并提出了一种基于视频的斯特瑞匹配方法，以实现实时的深度推断。这种方法利用了视频中的相互关系和缓存机制，以提高效率而不损失准确性。
results: 根据论文的测试结果，这种方法在标准桌面计算机上实现了134帧每秒的实时推断速度，或在磁盘式VR/AR头戴式设备上实现了30帧每秒的实时推断速度，都是现有技术的最佳性能。

Abstract
Real-time Stereo Matching is a cornerstone algorithm for many Extended Reality (XR) applications, such as indoor 3D understanding, video pass-through, and mixed-reality games. Despite significant advancements in deep stereo methods, achieving real-time depth inference with high accuracy on a low-power device remains a major challenge. One of the major difficulties is the lack of high-quality indoor video stereo training datasets captured by head-mounted VR/AR glasses. To address this issue, we introduce a novel video stereo synthetic dataset that comprises photorealistic renderings of various indoor scenes and realistic camera motion captured by a 6-DoF moving VR/AR head-mounted display (HMD). This facilitates the evaluation of existing approaches and promotes further research on indoor augmented reality scenarios. Our newly proposed dataset enables us to develop a novel framework for continuous video-rate stereo matching. As another contribution, our dataset enables us to proposed a new video-based stereo matching approach tailored for XR applications, which achieves real-time inference at an impressive 134fps on a standard desktop computer, or 30fps on a battery-powered HMD. Our key insight is that disparity and contextual information are highly correlated and redundant between consecutive stereo frames. By unrolling an iterative cost aggregation in time (i.e. in the temporal dimension), we are able to distribute and reuse the aggregated features over time. This approach leads to a substantial reduction in computation without sacrificing accuracy. We conducted extensive evaluations and comparisons and demonstrated that our method achieves superior performance compared to the current state-of-the-art, making it a strong contender for real-time stereo matching in VR/AR applications.

摘要
现实时斯特瑞匹配是虚拟现实（XR）应用的关键算法之一，包括室内3D理解、视频过关和混合实际游戏。尽管深度斯特瑞方法得到了重要的进步，但在低功耗设备上实现实时深度推测仍然是一个主要挑战。主要的困难之一是lack of high-quality indoor video stereo training datasets captured by head-mounted VR/AR glasses。为解决这个问题，我们介绍了一个新的视频斯特瑞 sintetic dataset，该dataset包括各种室内场景的 photorealistic 渲染和realistic camera motion captured by a 6-DoF moving VR/AR head-mounted display (HMD)。这使得我们可以评估现有方法并促进更多的室内扩展实际游戏enario研究。我们新提出的dataset允许我们开发一个新的持续视频斯特瑞匹配框架。另一个贡献是我们的dataset允许我们提出一种适合XR应用的新视频斯特瑞匹配方法，该方法在惊人的134fps（在标准桌面电脑上）或30fps（在电池电源的HMD上）实时推测。我们的关键发现是，在不同的斯特瑞帧之间，диспараITY和上下文信息之间存在很高的相关性和重复性。我们通过在时间维度（i.e.,在时间维度）折叠一种迭代成本聚合来分配和重用聚合的特征。这种方法导致了显著的计算减少，而不是牺牲准确性。我们进行了广泛的评估和比较，并证明了我们的方法在当前状态的某些应用中表现出色，使其成为实时斯特瑞匹配的强 кандидат。

Unsupervised Object Localization with Representer Point Selection

paper_url: http://arxiv.org/abs/2309.04172
repo_url: https://github.com/yeonghwansong/uolwrps
paper_authors: Yeonghwan Song, Seokwoo Jang, Dina Katabi, Jeany Son
for: 本研究旨在提出一种新的无监督对象定位方法，可以让我们理解模型的预测结果。
methods: 本方法基于代表点选择，通过选择模型预测结果中最重要的示例，提供了如何理解模型预测的示例和其重要性。
results: 我们的方法在多个数据集上与状态当前的无监督和自监督对象定位方法相比，具有显著的优势，甚至超过了最近的弱监督和几个预处理方法。

Abstract
We propose a novel unsupervised object localization method that allows us to explain the predictions of the model by utilizing self-supervised pre-trained models without additional finetuning. Existing unsupervised and self-supervised object localization methods often utilize class-agnostic activation maps or self-similarity maps of a pre-trained model. Although these maps can offer valuable information for localization, their limited ability to explain how the model makes predictions remains challenging. In this paper, we propose a simple yet effective unsupervised object localization method based on representer point selection, where the predictions of the model can be represented as a linear combination of representer values of training points. By selecting representer points, which are the most important examples for the model predictions, our model can provide insights into how the model predicts the foreground object by providing relevant examples as well as their importance. Our method outperforms the state-of-the-art unsupervised and self-supervised object localization methods on various datasets with significant margins and even outperforms recent weakly supervised and few-shot methods.

摘要
我们提出了一种新的无监督物体定位方法，可以使用无监督预训练模型来解释模型预测的结果。现有的无监督和自我监督物体定位方法经常使用类型不具有激活图或模型自身的相似图来提供有价值的信息。虽然这些图可以提供有用的信息，但它们的解释能力对模型预测的限制性尚未得到解决。在这篇论文中，我们提出了一种简单 yet 有效的无监督物体定位方法，基于表达点选择，其中模型预测可以表示为一个线性组合的表达点值。通过选择表达点，这些是模型预测中最重要的示例，我们的模型可以提供如何模型预测了前景对象的信息，并提供相关示例以及其重要性。我们的方法在多个数据集上与状态之前的无监督和自我监督物体定位方法之间具有显著的差异，甚至超过最近的弱监督和几个shot方法。

PRISTA-Net: Deep Iterative Shrinkage Thresholding Network for Coded Diffraction Patterns Phase Retrieval

paper_url: http://arxiv.org/abs/2309.04171
repo_url: https://github.com/liuaxou/prista-net
paper_authors: Aoxu Liu, Xiaohong Fan, Yin Yang, Jianping Zhang
for:PRISTA-Net is designed to solve the problem of phase retrieval (PR) in computational imaging and image processing, which is a challenge nonlinear inverse problem.methods:PRISTA-Net uses a deep unfolding network (DUN) based on the first-order iterative shrinkage thresholding algorithm (ISTA) to address the proximal-point mapping sub-problem associated with sparse priors. It also utilizes an attention mechanism to focus on phase information containing image edges, textures, and structures, and the fast Fourier transform (FFT) to learn global features to enhance local information.results:Experiments on Coded Diffraction Patterns (CDPs) measurements demonstrate that PRISTA-Net outperforms the existing state-of-the-art methods in terms of qualitative and quantitative evaluations.

Abstract
The problem of phase retrieval (PR) involves recovering an unknown image from limited amplitude measurement data and is a challenge nonlinear inverse problem in computational imaging and image processing. However, many of the PR methods are based on black-box network models that lack interpretability and plug-and-play (PnP) frameworks that are computationally complex and require careful parameter tuning. To address this, we have developed PRISTA-Net, a deep unfolding network (DUN) based on the first-order iterative shrinkage thresholding algorithm (ISTA). This network utilizes a learnable nonlinear transformation to address the proximal-point mapping sub-problem associated with the sparse priors, and an attention mechanism to focus on phase information containing image edges, textures, and structures. Additionally, the fast Fourier transform (FFT) is used to learn global features to enhance local information, and the designed logarithmic-based loss function leads to significant improvements when the noise level is low. All parameters in the proposed PRISTA-Net framework, including the nonlinear transformation, threshold parameters, and step size, are learned end-to-end instead of being manually set. This method combines the interpretability of traditional methods with the fast inference ability of deep learning and is able to handle noise at each iteration during the unfolding stage, thus improving recovery quality. Experiments on Coded Diffraction Patterns (CDPs) measurements demonstrate that our approach outperforms the existing state-of-the-art methods in terms of qualitative and quantitative evaluations. Our source codes are available at \emph{https://github.com/liuaxou/PRISTA-Net}.

摘要
“复位问题（PR） involves recovering an unknown image from limited amplitude measurement data，是一个非线性逆问题在计算机影像和影像处理中。然而，许多PR方法是基于黑盒网络模型，缺乏可解性和插件和平（PnP）框架，需要精确的参数调整。为了解决这个问题，我们开发了PRISTA-Net，一个深度 unfolding 网络（DUN），基于首次iterative shrinkage thresholding 算法（ISTA）。这个网络使用可学化的非线性转换来解决对簇统调整问题，并使用注意力机制来针对具有像素、文本和结构的phasic信息。此外，我们使用快速傅立叶变换（FFT）来学习全域特征，以增强本地信息，并使用设计的对数型损失函数，导致在噪音水平低时有明显的改进。所有PRISTA-Net框架内的参数，包括非线性转换、阈值参数和步长，都是通过端到端学习而不是手动设置。这种方法结合了传统方法的可解性和深度学习的快速推理能力，并可以在每个融合阶段中处理噪音，进而改善复位质量。实验结果显示，我们的方法在CDPs测量中超过了现有的州际优秀方法，以质量和量度评估为准。我们的原始代码可以在 \emph{https://github.com/liuaxou/PRISTA-Net} 获取。”

Grouping Boundary Proposals for Fast Interactive Image Segmentation

paper_url: http://arxiv.org/abs/2309.04169
repo_url: None
paper_authors: Li Liu, Da Chen, Minglei Shu, Laurent D. Cohen
for: This paper proposes a new image segmentation model that leverages the minimal geodesic framework and adaptive cut-based circular optimal path computation scheme to improve the accuracy and efficiency of image segmentation.
methods: The proposed model combines the minimal geodesic framework with an adaptive cut-based circular optimal path computation scheme and a graph-based boundary proposals grouping scheme to segment images.
results: Experimental results show that the proposed model outperforms state-of-the-art minimal paths-based image segmentation approaches.Here’s the same information in Simplified Chinese:
for: 这篇论文提出了一种基于最小几何框架和自适应割分算法的新的图像分割模型，用于解决图像分割问题。
methods: 该模型结合了最小几何框架、自适应割分算法和图形基于边界提议的组合来分割图像。
results: 实验结果表明，该模型比状态艺术最小路径基于图像分割方法更高效和更准确。

Abstract
Geodesic models are known as an efficient tool for solving various image segmentation problems. Most of existing approaches only exploit local pointwise image features to track geodesic paths for delineating the objective boundaries. However, such a segmentation strategy cannot take into account the connectivity of the image edge features, increasing the risk of shortcut problem, especially in the case of complicated scenario. In this work, we introduce a new image segmentation model based on the minimal geodesic framework in conjunction with an adaptive cut-based circular optimal path computation scheme and a graph-based boundary proposals grouping scheme. Specifically, the adaptive cut can disconnect the image domain such that the target contours are imposed to pass through this cut only once. The boundary proposals are comprised of precomputed image edge segments, providing the connectivity information for our segmentation model. These boundary proposals are then incorporated into the proposed image segmentation model, such that the target segmentation contours are made up of a set of selected boundary proposals and the corresponding geodesic paths linking them. Experimental results show that the proposed model indeed outperforms state-of-the-art minimal paths-based image segmentation approaches.

摘要

Context-Aware Prompt Tuning for Vision-Language Model with Dual-Alignment

paper_url: http://arxiv.org/abs/2309.04158
repo_url: None
paper_authors: Hongyu Hu, Tiancheng Lin, Jie Wang, Zhenbang Sun, Yi Xu
for: 提高视语模型（VLM）的适应能力，使其更好地适应下游任务。
methods: combining pre-trained large language models（LLMs）和learnable prompts，通过对Prompt的学习进行对接，从而提高视语模型的适应能力。
results: 在11个下游数据集上，DuAl-PT实现了superior的表现，并且在base-to-new泛化上也显示出了优秀的结果。

Abstract
Large-scale vision-language models (VLMs), e.g., CLIP, learn broad visual concepts from tedious training data, showing superb generalization ability. Amount of prompt learning methods have been proposed to efficiently adapt the VLMs to downstream tasks with only a few training samples. We introduce a novel method to improve the prompt learning of vision-language models by incorporating pre-trained large language models (LLMs), called Dual-Aligned Prompt Tuning (DuAl-PT). Learnable prompts, like CoOp, implicitly model the context through end-to-end training, which are difficult to control and interpret. While explicit context descriptions generated by LLMs, like GPT-3, can be directly used for zero-shot classification, such prompts are overly relying on LLMs and still underexplored in few-shot domains. With DuAl-PT, we propose to learn more context-aware prompts, benefiting from both explicit and implicit context modeling. To achieve this, we introduce a pre-trained LLM to generate context descriptions, and we encourage the prompts to learn from the LLM's knowledge by alignment, as well as the alignment between prompts and local image features. Empirically, DuAl-PT achieves superior performance on 11 downstream datasets on few-shot recognition and base-to-new generalization. Hopefully, DuAl-PT can serve as a strong baseline. Code will be available.

摘要
大规模视言模型（VLM），如CLIP，通过 tedious 训练数据学习广泛的视觉概念，显示出杰出的泛化能力。Amount of 提示学习方法已经被提出，以实现通过只需几个训练样本来适应下游任务。我们介绍了一种新的方法，通过将预训练的大型语言模型（LLM）与视言模型结合，来提高提示学习的视言模型。我们称之为双对调整提示（DuAl-PT）。learnable prompts，如CoOp，通过端到端训练来模型上下文，但这些提示难以控制和解释。而由 LLM 生成的文本提示，如 GPT-3，可以直接用于零shot分类，但这些提示过于依赖 LLM 并且还未在几shot领域得到充分发挥。With DuAl-PT，我们提议学习更加上下文意识的提示，利用 both explicit 和 implicit 上下文模型。为了实现这一点，我们引入了预训练 LLM 生成上下文描述，并强制提示学习从 LLM 的知识中，以及上下文和本地图像特征之间的对应。实验结果表明，DuAl-PT 在 11 个下游数据集上的几shot认识和基础到新泛化中表现出色。希望 DuAl-PT 可以成为一个强大的基eline。代码将可以公开。

Mapping EEG Signals to Visual Stimuli: A Deep Learning Approach to Match vs. Mismatch Classification

paper_url: http://arxiv.org/abs/2309.04153
repo_url: None
paper_authors: Yiqian Yang, Zhengqiao Zhao, Qian Wang, Yan Yang, Jingdong Chen
for: 该研究旨在开发一种基于深度学习的“匹配vs不匹配”模型，用于类ifizying视频片段是否引起记录的EEG信号响应，以及学习视觉内容和相应的神经记录之间的关系。
methods: 该模型使用了一种新的“匹配vs不匹配”机制，通过对视频片段和EEG信号进行匹配和不匹配的比较，以捕捉视频内容和神经记录之间的关系。
results: 研究发现，使用该模型可以在不知道训练数据的情况下，达到最高的准确率，并且可以减少 между主体噪音。此外，研究还发现，模型预测中的脑区域主要与语言处理相关，然后是视觉处理相关。这些结果有助于开发基于神经录音的视频重建技术和相关应用。

Abstract
Existing approaches to modeling associations between visual stimuli and brain responses are facing difficulties in handling between-subject variance and model generalization. Inspired by the recent progress in modeling speech-brain response, we propose in this work a ``match-vs-mismatch'' deep learning model to classify whether a video clip induces excitatory responses in recorded EEG signals and learn associations between the visual content and corresponding neural recordings. Using an exclusive experimental dataset, we demonstrate that the proposed model is able to achieve the highest accuracy on unseen subjects as compared to other baseline models. Furthermore, we analyze the inter-subject noise using a subject-level silhouette score in the embedding space and show that the developed model is able to mitigate inter-subject noise and significantly reduce the silhouette score. Moreover, we examine the Grad-CAM activation score and show that the brain regions associated with language processing contribute most to the model predictions, followed by regions associated with visual processing. These results have the potential to facilitate the development of neural recording-based video reconstruction and its related applications.

摘要
现有的视觉刺激和大脑响应模型面临着处理 между人差异和模型泛化的挑战。启发于最近的语音大脑响应模型的进步，我们在本工作中提出了一种“匹配vs不匹配”深度学习模型，用于判断视频片断是否产生记录的EEG信号中的刺激响应。使用专属实验数据集，我们示出了该模型能够在未见过的人群中达到最高的准确率，并且分析了在 embedding 空间中的人体遮盾分数，显示该模型能够减少人体噪音，并且通过Grad-CAM活化分数显示，大脑语言处理相关区域对模型预测做出了主要贡献，其次是视觉处理相关区域。这些结果有potential用于发展基于 neural recording 的视频重建和相关应用。

Representation Synthesis by Probabilistic Many-Valued Logic Operation in Self-Supervised Learning

paper_url: http://arxiv.org/abs/2309.04148
repo_url: None
paper_authors: Hiroki Nakamura, Masashi Okada, Tadahiro Taniguchi
for: 本研究探讨了一种基于多值逻辑的自助学习（SSL）方法，用于学习混合图像表示。
methods: 该方法使用混合图像synthesize表示，并使用多值逻辑运算实现表示合并。该方法可以保持原始表示的remarkable特征。
results: 对于图像分类任务，该方法与前期表示合并方法竞争性。此外，我们还研究了图像检索应用，并发现了与图像类别数量之间的关系。

Abstract
Self-supervised learning (SSL) using mixed images has been studied to learn various image representations. Existing methods using mixed images learn a representation by maximizing the similarity between the representation of the mixed image and the synthesized representation of the original images. However, few methods consider the synthesis of representations from the perspective of mathematical logic. In this study, we focused on a synthesis method of representations. We proposed a new SSL with mixed images and a new representation format based on many-valued logic. This format can indicate the feature-possession degree, that is, how much of each image feature is possessed by a representation. This representation format and representation synthesis by logic operation realize that the synthesized representation preserves the remarkable characteristics of the original representations. Our method performed competitively with previous representation synthesis methods for image classification tasks. We also examined the relationship between the feature-possession degree and the number of classes of images in the multilabel image classification dataset to verify that the intended learning was achieved. In addition, we discussed image retrieval, which is an application of our proposed representation format using many-valued logic.

摘要
（简化中文）自动学习（SSL）使用混合图像已经研究了学习不同的图像表示。现有的方法使用混合图像学习表示，通常是通过最大化混合图像表示和原始图像表示的相似性来学习表示。然而，很少考虑混合表示的合成从数学逻辑的角度。在这个研究中，我们关注了混合表示的合成方法。我们提出了一种新的SSL WITH mixed images和一种基于多值逻辑的新表示格式。这种格式可以表示每个图像特征的具有度，即表示中具有多少个图像特征。这种表示格式和基于逻辑操作的表示合成实现了保留原始表示的杰出特征。我们的方法与之前的表示合成方法相比较竞争，并在图像分类任务中达到了类似的性能。我们还检验了图像分类 dataset 中图像类别数与特征具有度之间的关系，以验证学习是否实现了所需的。此外，我们还讨论了使用我们提出的表示格式进行图像检索，这是一个图像检索的应用。

Robot Localization and Mapping Final Report – Sequential Adversarial Learning for Self-Supervised Deep Visual Odometry

paper_url: http://arxiv.org/abs/2309.04147
repo_url: None
paper_authors: Akankshya Kar, Sajal Maheshwari, Shamit Lal, Vinay Sameer Raja Kad
For: The paper aims to improve the accuracy of visual odometry (VO) and Simultaneous Localization and Mapping (SLAM) in challenging scenarios by using deep neural networks to extract high-level features and generate more accurate depth and pose estimates.* Methods: The paper explores two approaches to improve the accuracy of VO and SLAM: 1) modeling using optical flow and recurrent neural networks (RNN) to exploit spatio-temporal correlations, and 2) using a generative adversarial network (GAN) to improve the depth estimation and reduce artifacts.* Results: The paper achieves better depth and pose estimates compared to previous works, and demonstrates the effectiveness of the proposed methods in challenging scenarios such as low-texture images and dynamic scenarios.

Abstract
Visual odometry (VO) and SLAM have been using multi-view geometry via local structure from motion for decades. These methods have a slight disadvantage in challenging scenarios such as low-texture images, dynamic scenarios, etc. Meanwhile, use of deep neural networks to extract high level features is ubiquitous in computer vision. For VO, we can use these deep networks to extract depth and pose estimates using these high level features. The visual odometry task then can be modeled as an image generation task where the pose estimation is the by-product. This can also be achieved in a self-supervised manner, thereby eliminating the data (supervised) intensive nature of training deep neural networks. Although some works tried the similar approach [1], the depth and pose estimation in the previous works are vague sometimes resulting in accumulation of error (drift) along the trajectory. The goal of this work is to tackle these limitations of past approaches and to develop a method that can provide better depths and pose estimates. To address this, a couple of approaches are explored: 1) Modeling: Using optical flow and recurrent neural networks (RNN) in order to exploit spatio-temporal correlations which can provide more information to estimate depth. 2) Loss function: Generative adversarial network (GAN) [2] is deployed to improve the depth estimation (and thereby pose too), as shown in Figure 1. This additional loss term improves the realism in generated images and reduces artifacts.

摘要
Visual odometry (VO) 和 SLAM 已经在多个视图几何学中使用了多年。这些方法在复杂的场景下（如低Texture图像、动态场景等）存在一定的缺陷。而现在，使用深度神经网络提取高级特征是计算机视觉中 ubique 的现象。为VO，我们可以使用这些深度神经网络来提取深度和pose估计，并将视觉跟踪任务模型为图像生成任务，其中 pose 估计是产物。这可以通过自我监督的方式进行实现，从而消除深度神经网络的培训数据（supervised）Intensive 性。虽然一些工作已经尝试了类似的方法 [1]，但在这些方法中的深度和pose估计 Sometimes 存在抽象（vague），导致轨迹中的错误（drift）积累。本工作的目标是解决过去的限制，并提供更好的深度和pose估计。为此，我们探讨了一些方法：1. 模型：通过光流和循环神经网络（RNN）来利用空间时间相关性，从而提供更多的信息来估计深度。2. 损失函数：使用生成对抗网络（GAN）来改善深度估计（并因此提高pose估计），如图1所示。这个额外的损失函数提高生成图像的真实性，并减少了图像的artefacts。

Depth Completion with Multiple Balanced Bases and Confidence for Dense Monocular SLAM

paper_url: http://arxiv.org/abs/2309.04145
repo_url: None
paper_authors: Weijian Xie, Guanyi Chu, Quanhao Qian, Yihao Yu, Hai Li, Danpeng Chen, Shangjin Zhai, Nan Wang, Hujun Bao, Guofeng Zhang
For: This paper proposes a novel method for dense SLAM based on monocular cameras, which can achieve online dense mapping on a mobile device.* Methods: The proposed method integrates a light-weight depth completion network (BBC-Net) into a sparse SLAM system using a multi-basis depth representation. The method predicts multiple balanced bases and a confidence map from a monocular image with sparse points, and the final depth is a linear combination of predicted depth bases optimized by tuning the corresponding weights.* Results: The proposed method achieves better performance in monocular dense mapping than state-of-the-art methods, and provides an online demo running on a mobile phone.

Abstract
Dense SLAM based on monocular cameras does indeed have immense application value in the field of AR/VR, especially when it is performed on a mobile device. In this paper, we propose a novel method that integrates a light-weight depth completion network into a sparse SLAM system using a multi-basis depth representation, so that dense mapping can be performed online even on a mobile phone. Specifically, we present a specifically optimized multi-basis depth completion network, called BBC-Net, tailored to the characteristics of traditional sparse SLAM systems. BBC-Net can predict multiple balanced bases and a confidence map from a monocular image with sparse points generated by off-the-shelf keypoint-based SLAM systems. The final depth is a linear combination of predicted depth bases that can be optimized by tuning the corresponding weights. To seamlessly incorporate the weights into traditional SLAM optimization and ensure efficiency and robustness, we design a set of depth weight factors, which makes our network a versatile plug-in module, facilitating easy integration into various existing sparse SLAM systems and significantly enhancing global depth consistency through bundle adjustment. To verify the portability of our method, we integrate BBC-Net into two representative SLAM systems. The experimental results on various datasets show that the proposed method achieves better performance in monocular dense mapping than the state-of-the-art methods. We provide an online demo running on a mobile phone, which verifies the efficiency and mapping quality of the proposed method in real-world scenarios.

摘要
“对于单目镜头的SLAM技术来说， dense SLAM 在 AR/VR 领域中有很大的应用价值，特别是在移动设备上进行。在这篇论文中，我们提出了一个新的方法，将轻量级的深度完成网络（BBC-Net）integrete到了简略SLAM 系统中，以在移动电话上进行线上 dense mapping。具体来说，我们提出了一个特别适合传统简略SLAM 系统的多基底深度完成网络，可以从单目镜头照片中预测多个均衡基底和一个信心地图。最终的深度是由多个预测的深度基底进行线性结合，可以通过调整对应的加权因子进行优化。为了让我们的网络适应各种现有的简略SLAM 系统，我们设计了一个深度加权因子集，这使得我们的网络成为了一个通用的插入模组，可以轻松地整合到各种现有的简略SLAM 系统中，并且可以提高全球深度一致性through bundle adjustment。为了证明我们的方法的可移植性，我们将 BBC-Net 整合到了两个代表性的 SLAM 系统中。实验结果显示，我们的方法在单目密集地图中表现比前景方法更好。我们提供了一个线上 demo，证明了我们的方法在实际情况下的效率和地图质量。”

On the Efficacy of Multi-scale Data Samplers for Vision Applications

paper_url: http://arxiv.org/abs/2309.04502
repo_url: None
paper_authors: Elvis Nunez, Thomas Merth, Anish Prabhu, Mehrdad Farajtabar, Mohammad Rastegari, Sachin Mehta, Maxwell Horton
For: 本研究探讨了多尺度解析训练的性质，以帮助提高视觉任务的性能。* Methods: 本研究使用了可变批处理多尺度数据采样器，该采样器在每个训练迭代中随机选择输入分辨率，并在批处理大小的同时进行调整。* Results: 研究发现，多尺度采样器 behave as 隐式数据正则化，可以加速训练速度，同时保持或提高模型的准确率，并且更好地适应数据分布和缩放变化。此外，研究还扩展了一个多尺度变换批处理器，通过逐渐增加分辨率来减少计算量，并在检测和实例分割任务中获得了37%的训练计算量减少和3-4%的mAP提高。

Abstract
Multi-scale resolution training has seen an increased adoption across multiple vision tasks, including classification and detection. Training with smaller resolutions enables faster training at the expense of a drop in accuracy. Conversely, training with larger resolutions has been shown to improve performance, but memory constraints often make this infeasible. In this paper, we empirically study the properties of multi-scale training procedures. We focus on variable batch size multi-scale data samplers that randomly sample an input resolution at each training iteration and dynamically adjust their batch size according to the resolution. Such samplers have been shown to improve model accuracy beyond standard training with a fixed batch size and resolution, though it is not clear why this is the case. We explore the properties of these data samplers by performing extensive experiments on ResNet-101 and validate our conclusions across multiple architectures, tasks, and datasets. We show that multi-scale samplers behave as implicit data regularizers and accelerate training speed. Compared to models trained with single-scale samplers, we show that models trained with multi-scale samplers retain or improve accuracy, while being better-calibrated and more robust to scaling and data distribution shifts. We additionally extend a multi-scale variable batch sampler with a simple curriculum that progressively grows resolutions throughout training, allowing for a compute reduction of more than 30%. We show that the benefits of multi-scale training extend to detection and instance segmentation tasks, where we observe a 37% reduction in training FLOPs along with a 3-4% mAP increase on MS-COCO using a Mask R-CNN model.

摘要

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

paper_url: http://arxiv.org/abs/2309.04109
repo_url: None
paper_authors: Changming Xiao, Qi Yang, Feng Zhou, Changshui Zhang
for: 这研究旨在利用文本扩散模型中的注意机制进行Semantic Grounding，不需要再训练也不需要执行时间优化。
methods: 提议使用文本扩散模型的denoising网络中的注意机制来实现Semantic Grounding。
results: 在 Pascal VOC 2012 和 Microsoft COCO 2014 上进行了weakly-supervised Semantic Segmentation的评估，并得到了较高的性能。此外，我们还发现了自定义生成方法中学习的文本嵌入的word-pixel相关性可以通过一些修改来掌握。

Abstract
Diffusion models have revolted the field of text-to-image generation recently. The unique way of fusing text and image information contributes to their remarkable capability of generating highly text-related images. From another perspective, these generative models imply clues about the precise correlation between words and pixels. In this work, a simple but effective method is proposed to utilize the attention mechanism in the denoising network of text-to-image diffusion models. Without re-training nor inference-time optimization, the semantic grounding of phrases can be attained directly. We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting and our method achieves superior performance to prior methods. In addition, the acquired word-pixel correlation is found to be generalizable for the learned text embedding of customized generation methods, requiring only a few modifications. To validate our discovery, we introduce a new practical task called "personalized referring image segmentation" with a new dataset. Experiments in various situations demonstrate the advantages of our method compared to strong baselines on this task. In summary, our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.

摘要
Diffusion models have recently revolutionized the field of text-to-image generation. These models have a unique way of fusing text and image information, which allows them to generate highly text-related images. From another perspective, these generative models provide insights into the precise correlation between words and pixels. In this work, we propose a simple but effective method that utilizes the attention mechanism in the denoising network of text-to-image diffusion models to achieve semantic grounding of phrases without re-training or inference-time optimization. We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation settings, and our method achieves superior performance compared to prior methods. Furthermore, we find that the acquired word-pixel correlation is generalizable for the learned text embedding of customized generation methods, which only require a few modifications. To validate our discovery, we introduce a new practical task called "personalized referring image segmentation" with a new dataset. Our experiments in various situations demonstrate the advantages of our method compared to strong baselines on this task. In summary, our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.Here is the word-for-word translation of the text into Simplified Chinese:Diffusion 模型最近在文本到图像生成领域引起了革命。这些模型具有独特的文本和图像信息 fusions 的方式，使得它们能够生成高度相关的文本图像。从另一个角度来看，这些生成模型表明了文本和像素之间的精确相关性。在这项工作中，我们提出了一种简单 yet effective 的方法，利用文本涂抹网络中的注意机制来实现文本短语的semantic grounding，不需要重新训练 nor inference-time optimization。我们在 Pascal VOC 2012 和 Microsoft COCO 2014 下进行了弱监督semantic segmentation 设置下的评估，并发现我们的方法在相比先前方法上表现出色。此外，我们发现了acquired 的word-pixel correlation 可以通过自定义生成方法中学习的文本嵌入来扩展。为了证明我们的发现，我们引入了一个新的实际任务“个性化引用图像分割”，并提供了一个新的数据集。我们在多种情况下进行了实验，并发现我们的方法在这个任务上比强基eline 表现出优异。总之，我们的工作揭示了 diffusion 模型中的丰富多模态知识可以用于分割。

Toward Sufficient Spatial-Frequency Interaction for Gradient-aware Underwater Image Enhancement

paper_url: http://arxiv.org/abs/2309.04089
repo_url: https://github.com/zhihefang/SFGNet
paper_authors: Chen Zhao, Weiling Cai, Chenyu Dong, Ziqi Zeng
for: 提高水下图像质量
methods: 基于空间频率相互作用和梯度地图的SFGNet框架
results: 实验结果表明，我们的方法可以成功提高水下图像质量，并与其他方法匹配或超越其视觉质量改进。

Abstract
Underwater images suffer from complex and diverse degradation, which inevitably affects the performance of underwater visual tasks. However, most existing learning-based Underwater image enhancement (UIE) methods mainly restore such degradations in the spatial domain, and rarely pay attention to the fourier frequency information. In this paper, we develop a novel UIE framework based on spatial-frequency interaction and gradient maps, namely SFGNet, which consists of two stages. Specifically, in the first stage, we propose a dense spatial-frequency fusion network (DSFFNet), mainly including our designed dense fourier fusion block and dense spatial fusion block, achieving sufficient spatial-frequency interaction by cross connections between these two blocks. In the second stage, we propose a gradient-aware corrector (GAC) to further enhance perceptual details and geometric structures of images by gradient map. Experimental results on two real-world underwater image datasets show that our approach can successfully enhance underwater images, and achieves competitive performance in visual quality improvement.

摘要
水下图像受到复杂和多样化的干扰，这会不可避免地影响水下视觉任务的性能。然而，大多数现有的学习基于水下图像改善（UIE）方法主要是在空间频谱领域进行修复，rarely 充分利用了干扰的频率信息。在这篇论文中，我们开发了一种新的UIE框架，即SFGNet，它包括两个阶段。具体来说，在第一阶段，我们提出了一个密集的空间频谱融合网络（DSFFNet），包括我们设计的密集傅立叶融合块和密集空间融合块，通过跨连接这两个块实现了足够的空间频谱交互。在第二阶段，我们提出了一个梯度感知corrector（GAC），用于进一步增强图像的感知细节和几何结构，通过梯度地图。实验结果表明，我们的方法可以成功地改善水下图像，并在视觉质量改进方面实现了竞争性能。

Towards Efficient SDRTV-to-HDRTV by Learning from Image Formation

paper_url: http://arxiv.org/abs/2309.04084
repo_url: https://github.com/xiaom233/hdrtvnet-plus
paper_authors: Xiangyu Chen, Zheyuan Li, Zhengwen Zhang, Jimmy S. Ren, Yihao Liu, Jingwen He, Yu Qiao, Jiantao Zhou, Chao Dong
for: 本研究目的是将SDRTV内容转换为HDRTV标准，以提高视觉效果。
methods: 本文提出了一种三步解决方案，包括自适应全色映射、本地增强和高点级别。全色映射阶段使用全图统计作为引导，进行图像适应色映射。本地增强网络用于增强本地细节。最后，我们将两个子网络组合成一个生成器，通过GAN共同训练来保证高点级别。
results: 我们的方法可以准确地将SDRTV内容转换为HDRTV标准，并且可以保持高品质和精细的视觉效果。我们的方法主要针对4K分辨率图像，是轻量级的和高效的。我们还构建了一个名为HDRTV1K的数据集，包含1235个和117个训练图像和测试图像，均为4K分辨率。此外，我们选择了五个度量来评估SDRTV-to-HDRTV算法的结果。最终结果表明我们的方法在量化和视觉上具有国际前沿水平。代码、模型和数据集可以在https://github.com/xiaom233/HDRTVNet-plus上获取。

Abstract
Modern displays are capable of rendering video content with high dynamic range (HDR) and wide color gamut (WCG). However, the majority of available resources are still in standard dynamic range (SDR). As a result, there is significant value in transforming existing SDR content into the HDRTV standard. In this paper, we define and analyze the SDRTV-to-HDRTV task by modeling the formation of SDRTV/HDRTV content. Our analysis and observations indicate that a naive end-to-end supervised training pipeline suffers from severe gamut transition errors. To address this issue, we propose a novel three-step solution pipeline called HDRTVNet++, which includes adaptive global color mapping, local enhancement, and highlight refinement. The adaptive global color mapping step uses global statistics as guidance to perform image-adaptive color mapping. A local enhancement network is then deployed to enhance local details. Finally, we combine the two sub-networks above as a generator and achieve highlight consistency through GAN-based joint training. Our method is primarily designed for ultra-high-definition TV content and is therefore effective and lightweight for processing 4K resolution images. We also construct a dataset using HDR videos in the HDR10 standard, named HDRTV1K that contains 1235 and 117 training images and 117 testing images, all in 4K resolution. Besides, we select five metrics to evaluate the results of SDRTV-to-HDRTV algorithms. Our final results demonstrate state-of-the-art performance both quantitatively and visually. The code, model and dataset are available at https://github.com/xiaom233/HDRTVNet-plus.

摘要
现代显示器可以渲染视频内容高动态范围（HDR）和宽色域范围（WCG）。然而，大多数可用资源仍然是标准动态范围（SDR）。因此，将现有的SDR内容转换到HDRTV标准具有重要价值。在这篇论文中，我们定义和分析将SDRTV转换为HDRTV的任务。我们的分析和观察表明，使用简单的端到端超vised训练管道会导致严重的色域过渡错误。为解决这个问题，我们提出了一个新的三步解决方案管道called HDRTVNet++,其包括 adaptive global color mapping、local enhancement和高点级别调整。adaptive global color mapping步骤使用图像全局统计作为指导进行图像适应色mapping。然后，我们部署了本地增强网络来增强本地细节。最后，我们将两个子网络组合成一个生成器，通过GAN相关训练实现高点级别调整。我们的方法主要针对4K分辨率图像，因此效果精准和轻量级。我们还构建了一个名为HDRTV1K的HDR视频集，包含1235个和117个训练图像和测试图像，全部是4K分辨率。此外，我们选择了五个度量来评估SDRTV-to-HDRTV算法的结果。最终结果表明我们的方法在量和视觉上具有国际前进的性能。代码、模型和数据集可以在https://github.com/xiaom233/HDRTVNet-plus上获取。

UER: A Heuristic Bias Addressing Approach for Online Continual Learning

paper_url: http://arxiv.org/abs/2309.04081
repo_url: https://github.com/FelixHuiweiLin/UER
paper_authors: Huiwei Lin, Shanshan Feng, Baoquan Zhang, Hongliang Qiao, Xutao Li, Yunming Ye
for: 这篇论文主要针对在线连续学习中的偏见问题，即在继续训练神经网络时，由于数据流动性的限制，导致神经网络偏爱当前数据中的类别，从而导致忘记前期数据的问题。
methods: 这篇论文提出了一种简单而高效的方法，即使 angle factor 和 norm factor 的偏见问题。通过分解 dot-product logits 为两个因素，发现偏见主要出现在 angle factor 上，可以用 cosine logits 来学习新知识。同时，通过使用 norm factor 来帮助保持历史知识。
results: 对于三个数据集，论文提出的 UER 方法可以在不同的情况下具有最高的性能，超过了多种现有方法的性能。

Abstract
Online continual learning aims to continuously train neural networks from a continuous data stream with a single pass-through data. As the most effective approach, the rehearsal-based methods replay part of previous data. Commonly used predictors in existing methods tend to generate biased dot-product logits that prefer to the classes of current data, which is known as a bias issue and a phenomenon of forgetting. Many approaches have been proposed to overcome the forgetting problem by correcting the bias; however, they still need to be improved in online fashion. In this paper, we try to address the bias issue by a more straightforward and more efficient method. By decomposing the dot-product logits into an angle factor and a norm factor, we empirically find that the bias problem mainly occurs in the angle factor, which can be used to learn novel knowledge as cosine logits. On the contrary, the norm factor abandoned by existing methods helps remember historical knowledge. Based on this observation, we intuitively propose to leverage the norm factor to balance the new and old knowledge for addressing the bias. To this end, we develop a heuristic approach called unbias experience replay (UER). UER learns current samples only by the angle factor and further replays previous samples by both the norm and angle factors. Extensive experiments on three datasets show that UER achieves superior performance over various state-of-the-art methods. The code is in https://github.com/FelixHuiweiLin/UER.

摘要
（简化中文）在线continuous学习目标是通过连续数据流进行单次 passes through 训练神经网络。现有最有效的方法是启用循环训练，但它们仍然需要进一步改进。在这篇论文中，我们尝试通过更直观和更有效的方法来解决偏见问题。我们通过分解dot product logits into angle factor和norm factor来发现，偏见问题主要出现在角度因子上，可以用作学习新知识的cosine logits。相反，待用于记忆知识的norm factor被现有方法抛弃。基于这一观察，我们提议使用norm factor来平衡新知识和历史知识，以解决偏见问题。为此，我们开发了一种启用经验回放（UER）的规则。UER只learning当前样本的角度因子，并在之前的样本中重新使用角度因子和norm因子来回放。我们在三个dataset上进行了广泛的实验，并证明UER可以超越多种现有方法的性能。代码可以在https://github.com/FelixHuiweiLin/UER中找到。

Enhancing Hierarchical Transformers for Whole Brain Segmentation with Intracranial Measurements Integration

paper_url: http://arxiv.org/abs/2309.04071
repo_url: https://github.com/masilab/unest
paper_authors: Xin Yu, Yucheng Tang, Qi Yang, Ho Hin Lee, Shunxing Bao, Yuankai Huo, Bennett A. Landman
for: 本研究旨在提高现有的全脑分割方法，以包含内侧量测量，并提供更全面的脑结构分析。
methods: 本研究使用改进的层次变换器UNesT进行全脑分割，并同时分割脑部133个区域和内侧量/后腔量。为了解决数据短缺问题，模型首先在8个不同站点的4859个T1-weighted（T1w）3D图像上进行预训练，然后在Open Access Series Imaging Studies（OASIS）上进行微调。
results: 我们使用Dice相似度（DSC）评估方法，并显示我们的模型能够准确地估计内侧量/后腔量，同时保持132个脑区的性能在相同水平。

Abstract
Whole brain segmentation with magnetic resonance imaging (MRI) enables the non-invasive measurement of brain regions, including total intracranial volume (TICV) and posterior fossa volume (PFV). Enhancing the existing whole brain segmentation methodology to incorporate intracranial measurements offers a heightened level of comprehensiveness in the analysis of brain structures. Despite its potential, the task of generalizing deep learning techniques for intracranial measurements faces data availability constraints due to limited manually annotated atlases encompassing whole brain and TICV/PFV labels. In this paper, we enhancing the hierarchical transformer UNesT for whole brain segmentation to achieve segmenting whole brain with 133 classes and TICV/PFV simultaneously. To address the problem of data scarcity, the model is first pretrained on 4859 T1-weighted (T1w) 3D volumes sourced from 8 different sites. These volumes are processed through a multi-atlas segmentation pipeline for label generation, while TICV/PFV labels are unavailable. Subsequently, the model is finetuned with 45 T1w 3D volumes from Open Access Series Imaging Studies (OASIS) where both 133 whole brain classes and TICV/PFV labels are available. We evaluate our method with Dice similarity coefficients(DSC). We show that our model is able to conduct precise TICV/PFV estimation while maintaining the 132 brain regions performance at a comparable level. Code and trained model are available at: https://github.com/MASILab/UNesT/wholebrainSeg.

摘要
整个脑部分 segmentation with magnetic resonance imaging (MRI) 可以不侵入性地测量脑部分，包括总脑部分体积 (TICV) 和后底槽体积 (PFV)。提高现有的整个脑部分分 segmentation 方法，以包括脑部分测量，可以提供更全面的脑结构分析。然而，将深度学习技术推广到脑部分测量 faced 数据可用性问题，因为有限的手动标注图集覆盖整个脑部分和 TICV/PFV 标签。在这篇论文中，我们改进了层次转换器 UNesT для整个脑部分分 segmentation，以达到同时 segmenting 整个脑部分和 TICV/PFV 的目的。为了解决数据缺乏问题，我们首先在 8 个不同的站点上获得了 4859 个 T1-weighted (T1w) 三维图像，并将其传递 через多个 Atlas 分割ipeline 生成标签。然后，我们在 Open Access Series Imaging Studies (OASIS) 上进行了 fine-tuning，使得模型可以同时测量整个脑部分和 TICV/PFV。我们使用 dice 相似度 coefficient (DSC) 进行评估。我们发现，我们的模型可以准确地估计 TICV/PFV，同时保持 132 个脑部分性能的水平。代码和已经训练的模型可以在 GitHub 上获取：。

INSURE: An Information Theory Inspired Disentanglement and Purification Model for Domain Generalization

paper_url: http://arxiv.org/abs/2309.04063
repo_url: None
paper_authors: Xi Yu, Huan-Hsin Tseng, Shinjae Yoo, Haibin Ling, Yuewei Lin
for: 本文旨在提出一种基于信息理论的分解和纯化模型（INSURE），以便在未见目标领域中学习泛化模型。
methods: 本文使用了一种信息理论启发的损失函数，以确保分解的特征包含足够的类标签信息和另一个分解的卫星特征包含足够的领域信息。此外，本文还使用了一种对照纯化损失函数，使卫星特征抛弃所有类相关信息，使得类相关特征包含足够和必要的类标签信息。而不是使用多个Encoder，本文使用了一个学习的二进制masque作为分解器，以便更加有效地进行分解。
results: 对四个广泛使用的预测数据集（PACS、OfficeHome、TerraIncognita和DomainNet）进行了广泛的实验，并证明了提出的INSURE方法可以超越当前的状态艺。此外，本文还证明了领域特定的类相关特征对预测数据集的泛化有益。

Abstract
Domain Generalization (DG) aims to learn a generalizable model on the unseen target domain by only training on the multiple observed source domains. Although a variety of DG methods have focused on extracting domain-invariant features, the domain-specific class-relevant features have attracted attention and been argued to benefit generalization to the unseen target domain. To take into account the class-relevant domain-specific information, in this paper we propose an Information theory iNspired diSentanglement and pURification modEl (INSURE) to explicitly disentangle the latent features to obtain sufficient and compact (necessary) class-relevant feature for generalization to the unseen domain. Specifically, we first propose an information theory inspired loss function to ensure the disentangled class-relevant features contain sufficient class label information and the other disentangled auxiliary feature has sufficient domain information. We further propose a paired purification loss function to let the auxiliary feature discard all the class-relevant information and thus the class-relevant feature will contain sufficient and compact (necessary) class-relevant information. Moreover, instead of using multiple encoders, we propose to use a learnable binary mask as our disentangler to make the disentanglement more efficient and make the disentangled features complementary to each other. We conduct extensive experiments on four widely used DG benchmark datasets including PACS, OfficeHome, TerraIncognita, and DomainNet. The proposed INSURE outperforms the state-of-art methods. We also empirically show that domain-specific class-relevant features are beneficial for domain generalization.

摘要
域内泛化（DG）目标是通过只在多个观察到的源领域进行训练来学习一个通用的模型，以便在未见目标领域进行泛化。虽然许多DG方法都专注于提取域无关特征，但是域相关的类特征受到了关注，并且被论证可以帮助泛化到未见目标领域。为了考虑域相关的类特征信息，在这篇论文中，我们提出了基于信息理论的INSURE模型，以Explicitly分离隐藏特征，以获得充足和 компакт（必要）的类特征，以便泛化到未见目标领域。具体来说，我们首先提出了基于信息理论的损失函数，以确保分离的类特征包含充足的类标签信息，而另一个分离的卫星特征具备充足的域信息。我们进一步提出了一个套用纯化损失函数，使卫星特征抛弃所有类相关信息，从而使类特征具备充足和 компакт（必要）的信息。此外，而不是使用多个encoder，我们提议使用学习的二进制面积作为我们的分离器，以使分离更高效，并使分离的特征相互补做。我们在四个广泛使用DG benchmark数据集（PACS、OfficeHome、TerraIncognita和DomainNet）进行了广泛的实验。提出的INSURE模型超过了当前的状态艺。我们还证明了域相关的类特征对泛化有利。

2023-09-08

cs.AI

cs.AI - 2023-09-08

Few-Shot Learning of Force-Based Motions From Demonstration Through Pre-training of Haptic Representation

paper_url: http://arxiv.org/abs/2309.04640
repo_url: None
paper_authors: Marina Y. Aoyama, João Moura, Namiko Saito, Sethu Vijayakumar
for: 能够快速适应不同物体的物理特性，提高机器人抓取物体的能力。
methods: 使用半监督学习自动机制，将学习模型分解成感觉表示编码器和动作生成解码器。首先使用大量未经监督的数据进行预训练，然后使用少量监督学习来训练动作生成解码器，以便快速适应不同物体的物理特性。
results: 对干洗任务使用不同弹性和表面黏度的毛巾进行验证，结果表明预训练可以大幅提高下游任务中机器人对物体物理特性的认识和生成恰当的动作，超过了没有预训练的LfD方法。此外，我们还验证了在物理机器人硬件上运行的动作是否符合预期，并证明感觉表示编码器在实际物体上采集的数据上具有良好的表达能力，从而解释了它在下游任务中的贡献。

Abstract
In many contact-rich tasks, force sensing plays an essential role in adapting the motion to the physical properties of the manipulated object. To enable robots to capture the underlying distribution of object properties necessary for generalising learnt manipulation tasks to unseen objects, existing Learning from Demonstration (LfD) approaches require a large number of costly human demonstrations. Our proposed semi-supervised LfD approach decouples the learnt model into an haptic representation encoder and a motion generation decoder. This enables us to pre-train the first using large amount of unsupervised data, easily accessible, while using few-shot LfD to train the second, leveraging the benefits of learning skills from humans. We validate the approach on the wiping task using sponges with different stiffness and surface friction. Our results demonstrate that pre-training significantly improves the ability of the LfD model to recognise physical properties and generate desired wiping motions for unseen sponges, outperforming the LfD method without pre-training. We validate the motion generated by our semi-supervised LfD model on the physical robot hardware using the KUKA iiwa robot arm. We also validate that the haptic representation encoder, pre-trained in simulation, captures the properties of real objects, explaining its contribution to improving the generalisation of the downstream task.

摘要
多数有物理性任务中，力感测具有重要作用，以适应 manipulate 物体的物理性。现有的学习从示例 (LfD) 方法需要大量的贵重人类示例，以便学习总结 manipulate 任务。我们提出的半supervised LfD 方法将学习模型分解为感觉表示编码器和动作生成解码器。这使得我们可以在大量的无监督数据上预训练首先，使用少量的 LfD 训练第二个，利用学习人类技能的好处。我们使用擦除任务中使用不同坚度和表面黏性的湿巾进行验证。我们的结果表明，预训练可以提高 LfD 模型认识物理特性和生成满意的擦除动作的能力，超过没有预训练的 LfD 方法。我们使用Physical robot 硬件KUKA iiwa robot arm验证下游任务中的动作。我们还验证预训练在实际物体上的感觉表示编码器能够 capture 物体的物理特性，解释其在下游任务的总结中的贡献。

Perceptual adjustment queries and an inverted measurement paradigm for low-rank metric learning

paper_url: http://arxiv.org/abs/2309.04626
repo_url: https://github.com/austinxu87/paq
paper_authors: Austin Xu, Andrew D. McRae, Jingyan Wang, Mark A. Davenport, Ashwin Pananjady
for: 这个论文是为了提出一种新的查询机制，即感知调整查询（PAQ），用于收集人类反馈。
methods: 这个论文使用了一种倒计时间的测量方案，并结合了 cardinal 和 ordinal 查询的优点。
results: 论文在度量学习问题中使用了 PAQ 测量，并实现了一种高维度、低纬度矩阵估计问题的解决方案，并提供了样本复杂性保证。

Abstract
We introduce a new type of query mechanism for collecting human feedback, called the perceptual adjustment query ( PAQ). Being both informative and cognitively lightweight, the PAQ adopts an inverted measurement scheme, and combines advantages from both cardinal and ordinal queries. We showcase the PAQ in the metric learning problem, where we collect PAQ measurements to learn an unknown Mahalanobis distance. This gives rise to a high-dimensional, low-rank matrix estimation problem to which standard matrix estimators cannot be applied. Consequently, we develop a two-stage estimator for metric learning from PAQs, and provide sample complexity guarantees for this estimator. We present numerical simulations demonstrating the performance of the estimator and its notable properties.

摘要
我们介绍一种新型的询问机制，called perceptual adjustment query (PAQ)，它具有 both informative 和 cognitively lightweight 的特点。PAQ 使用倒排量度系统，并结合 cardinal 和 ordinal 询问的优点。我们在 metric learning 问题中使用 PAQ，收集 PAQ 测量来学习未知的 Mahalanobis 距离。这导致了一个高维、低阶矩阵估计问题，标准矩阵估计器无法应用。因此，我们开发了一个 two-stage 估计器 для metric learning from PAQs，并提供了样本Complexity 保证 для 这个估计器。我们还进行了 numrical simulations 来评估这个估计器的表现和其他优点。

Leveraging World Model Disentanglement in Value-Based Multi-Agent Reinforcement Learning

paper_url: http://arxiv.org/abs/2309.04615
repo_url: None
paper_authors: Zhizun Wang, David Meger
for: addresses the challenge of achieving a common goal of multiple agents interacting in the same environment with reduced sample complexity.
methods: uses a modularized world model, composed of action-conditioned, action-free, and static branches, to unravel the environment dynamics and produce imagined outcomes based on past experience, without sampling directly from the real environment.
results: achieves high sample efficiency and exhibits superior performance in defeating the enemy armies compared to other baselines in Easy, Hard, and Super-Hard StarCraft II micro-management challenges.

Abstract
In this paper, we propose a novel model-based multi-agent reinforcement learning approach named Value Decomposition Framework with Disentangled World Model to address the challenge of achieving a common goal of multiple agents interacting in the same environment with reduced sample complexity. Due to scalability and non-stationarity problems posed by multi-agent systems, model-free methods rely on a considerable number of samples for training. In contrast, we use a modularized world model, composed of action-conditioned, action-free, and static branches, to unravel the environment dynamics and produce imagined outcomes based on past experience, without sampling directly from the real environment. We employ variational auto-encoders and variational graph auto-encoders to learn the latent representations for the world model, which is merged with a value-based framework to predict the joint action-value function and optimize the overall training objective. We present experimental results in Easy, Hard, and Super-Hard StarCraft II micro-management challenges to demonstrate that our method achieves high sample efficiency and exhibits superior performance in defeating the enemy armies compared to other baselines.

摘要
在这篇论文中，我们提出了一种新的模型基于多代理人强化学习方法，称为值分解框架，以解决多代理人在同一环境中实现共同目标的问题，并减少样本复杂性。由于多代理人系统的扩展性和非站点性问题，无约方法需要很多样本进行训练。相反，我们使用模块化的世界模型，包括动作决策、无动作和静态分支，来揭示环境动力学和生成基于过去经验的想像结果，不直接从实际环境中采样。我们使用变量自动编码器和变量图自动编码器来学习 latent 表示，并将其与值基于框架相结合，预测共同动作值函数并优化总训练目标。我们在易、Difficult和超级Difficult StarCraft II 微管理挑战中进行了实验，并证明了我们的方法可以 достичь高样本效率，并在击败敌军方面表现出色，相比其他基准。

Linking Symptom Inventories using Semantic Textual Similarity

paper_url: http://arxiv.org/abs/2309.04607
repo_url: https://github.com/shashankv98/symptom-inventories
paper_authors: Eamonn Kennedy, Shashank Vadlamani, Hannah M Lindsey, Kelly S Peterson, Kristen Dams OConnor, Kenton Murray, Ronak Agarwal, Houshang H Amiri, Raeda K Andersen, Talin Babikian, David A Baron, Erin D Bigler, Karen Caeyenberghs, Lisa Delano-Wood, Seth G Disner, Ekaterina Dobryakova, Blessen C Eapen, Rachel M Edelstein, Carrie Esopenko, Helen M Genova, Elbert Geuze, Naomi J Goodrich-Hunsaker, Jordan Grafman, Asta K Haberg, Cooper B Hodges, Kristen R Hoskinson, Elizabeth S Hovenden, Andrei Irimia, Neda Jahanshad, Ruchira M Jha, Finian Keleher, Kimbra Kenney, Inga K Koerte, Spencer W Liebel, Abigail Livny, Marianne Lovstad, Sarah L Martindale, Jeffrey E Max, Andrew R Mayer, Timothy B Meier, Deleene S Menefee, Abdalla Z Mohamed, Stefania Mondello, Martin M Monti, Rajendra A Morey, Virginia Newcombe, Mary R Newsome, Alexander Olsen, Nicholas J Pastorek, Mary Jo Pugh, Adeel Razi, Jacob E Resch, Jared A Rowland, Kelly Russell, Nicholas P Ryan, Randall S Scheibel, Adam T Schmidt, Gershon Spitz, Jaclyn A Stephens, Assaf Tal, Leah D Talbert, Maria Carmela Tartaglia, Brian A Taylor, Sophia I Thomopoulos, Maya Troyanskaya, Eve M Valera, Harm Jan van der Horn, John D Van Horn, Ragini Verma, Benjamin SC Wade, Willian SC Walker, Ashley L Ware, J Kent Werner Jr, Keith Owen Yeates, Ross D Zafonte, Michael M Zeineh, Brandon Zielinski, Paul M Thompson, Frank G Hillary, David F Tate, Elisabeth A Wilde, Emily L Dennis
for: 这研究旨在使用人工智能（AI）方法，通过语义文本相似性（STS）来连接不同设置和研究中的症状和分数。
methods: 这研究使用四种预训练的STS模型，测试这些模型能够对多达6,607名参与者和16个国际数据源中的症状描述对应的分数进行预测。
results: STS方法实现了74.8%的准确率，在五个任务中表现出色，超过其他测试模型。这项研究表明，将语义信息纳入专家决策过程可以提高总体和疾病特定评估的效果。

Abstract
An extensive library of symptom inventories has been developed over time to measure clinical symptoms, but this variety has led to several long standing issues. Most notably, results drawn from different settings and studies are not comparable, which limits reproducibility. Here, we present an artificial intelligence (AI) approach using semantic textual similarity (STS) to link symptoms and scores across previously incongruous symptom inventories. We tested the ability of four pre-trained STS models to screen thousands of symptom description pairs for related content - a challenging task typically requiring expert panels. Models were tasked to predict symptom severity across four different inventories for 6,607 participants drawn from 16 international data sources. The STS approach achieved 74.8% accuracy across five tasks, outperforming other models tested. This work suggests that incorporating contextual, semantic information can assist expert decision-making processes, yielding gains for both general and disease-specific clinical assessment.

摘要
有很多 symptom 库已经在不同的时间和场景中开发出来，但这种多样性带来了一些长期的问题。最主要的问题是不同的设置和研究中的结果无法比较，这限制了重producibility。在这里，我们使用人工智能（AI）方法，使用semantic textual similarity（STS）将症状和分数连接起来，以解决这些不同的症状库之间的不一致性。我们测试了四种预训练的 STS 模型，将 тысячи个症状描述对比的任务进行了检测 - 这是一项传统上需要专家团队完成的复杂任务。模型在四个不同的库中预测了6,607名参与者从16个国际数据源中的症状严重程度，STS 方法实现了74.8%的准确率，超过了其他测试的模型。这种工作表明， incorporating contextual, semantic information可以帮助专家决策过程，带来疾病特定和通用的临床评估的改进。

EGOFALLS: A visual-audio dataset and benchmark for fall detection using egocentric cameras

paper_url: http://arxiv.org/abs/2309.04579
repo_url: https://github.com/Xueyi-Wang/EGOFALLS
paper_authors: Xueyi Wang
for: 预防和mitigate falls的 Tool
methods: 使用多modal descriptor从 egocentric camera captured video中提取特征，并在late decision fusion层上建立。
results: 结果表明，通过 audio和视觉信息的混合，通过late decision fusion层，可以提高探测性能，这对于护理老年人有很好的应用。

Abstract
Falls are significant and often fatal for vulnerable populations such as the elderly. Previous works have addressed the detection of falls by relying on data capture by a single sensor, images or accelerometers. In this work, we rely on multimodal descriptors extracted from videos captured by egocentric cameras. Our proposed method includes a late decision fusion layer that builds on top of the extracted descriptors. Furthermore, we collect a new dataset on which we assess our proposed approach. We believe this is the first public dataset of its kind. The dataset comprises 10,948 video samples by 14 subjects. We conducted ablation experiments to assess the performance of individual feature extractors, fusion of visual information, and fusion of both visual and audio information. Moreover, we experimented with internal and external cross-validation. Our results demonstrate that the fusion of audio and visual information through late decision fusion improves detection performance, making it a promising tool for fall prevention and mitigation.

摘要
跌倒是脆弱群体，如老年人，可致生命危险。先前的研究通过单个传感器、图像或加速计获取数据进行跌倒检测。在这种工作中，我们基于视频捕获的 egocentric 摄像头提取多模态描述符。我们的提议方法包括在描述符之间堆叠晚期决策层。此外，我们收集了一个新的数据集，用于评估我们的提议方法。这是首个公共数据集。数据集包含 10,948 个视频样本，来自 14 个主题。我们进行了减少实验来评估特定的特征提取器、视觉信息的融合和双重视觉和声音信息的融合的性能。此外，我们还进行了内部和外部验证。我们的结果表明，通过晚期决策融合视觉和声音信息可以提高跌倒检测性能，这是跌倒预防和 Mitigation 的有望工具。

Unleashing the Power of Graph Learning through LLM-based Autonomous Agents

paper_url: http://arxiv.org/abs/2309.04565
repo_url: None
paper_authors: Lanning Wei, Zhiqiang He, Huan Zhao, Quanming Yao
for:* This paper aims to simplify the learning process on diverse real-world graphs by using Large Language Models (LLMs) as autonomous agents.methods:* The proposed method, called Auto$^2$Graph, uses LLMs to decompose the complex graph learning task into three components: detecting the learning intent, configuring solutions based on AutoGraph, and generating a response.* The AutoGraph agents manage crucial procedures in automated graph learning, including data-processing, AutoML configuration, searching architectures, and hyper-parameter fine-tuning.results:* The proposed method demonstrates comparable performance on different datasets and learning tasks, and the human-like decisions made by the agents.

Abstract
Graph structured data are widely existed and applied in the real-world applications, while it is a challenge to handling these diverse data and learning tasks on graph in an efficient manner. When facing the complicated graph learning tasks, experts have designed diverse Graph Neural Networks (GNNs) in recent years. They have also implemented AutoML in Graph, also known as AutoGraph, to automatically generate data-specific solutions. Despite their success, they encounter limitations in (1) managing diverse learning tasks at various levels, (2) dealing with different procedures in graph learning beyond architecture design, and (3) the huge requirements on the prior knowledge when using AutoGraph. In this paper, we propose to use Large Language Models (LLMs) as autonomous agents to simplify the learning process on diverse real-world graphs. Specifically, in response to a user request which may contain varying data and learning targets at the node, edge, or graph levels, the complex graph learning task is decomposed into three components following the agent planning, namely, detecting the learning intent, configuring solutions based on AutoGraph, and generating a response. The AutoGraph agents manage crucial procedures in automated graph learning, including data-processing, AutoML configuration, searching architectures, and hyper-parameter fine-tuning. With these agents, those components are processed by decomposing and completing step by step, thereby generating a solution for the given data automatically, regardless of the learning task on node or graph. The proposed method is dubbed Auto$^2$Graph, and the comparable performance on different datasets and learning tasks. Its effectiveness is demonstrated by its comparable performance on different datasets and learning tasks, as well as the human-like decisions made by the agents.

摘要
Graph结构数据广泛存在并应用于实际应用场景，但是处理这些多样化数据和学习任务是一个挑战。随着复杂Graph学习任务的出现，专家们在过去几年内设计了多种Graph Neural Networks（GNNs）。他们还实现了AutoML在Graph上，也称为AutoGraph，以自动生成数据特定的解决方案。尽管它们取得了成功，但它们还面临着（1）在不同级别上处理多种学习任务的管理问题，（2）在Graph学习任务之外的不同过程的处理，以及（3）使用AutoGraph时对特定知识的巨大要求。在这篇论文中，我们提议使用大语言模型（LLMs）作为自主代理人，使得Graph学习过程更加简单。具体来说，在用户请求中可能包含不同数据和学习目标的情况下，我们将复杂的Graph学习任务分解为三个组件，即检测学习意图、基于AutoGraph配置解决方案以及生成响应。AutoGraph代理人处理关键的自动Graph学习过程，包括数据处理、AutoML配置、搜索架构和超参数精度调整。通过这些代理人，这些组件可以一步步完成，从而自动生成对给定数据的解决方案，不管学习任务是节点级或图级。我们称这种方法为Auto$^2$Graph，其效果得到证明，包括在不同数据集和学习任务上实现相同或更好的性能，以及代理人做出的人类化决策。

Connecting NTK and NNGP: A Unified Theoretical Framework for Neural Network Learning Dynamics in the Kernel Regime

paper_url: http://arxiv.org/abs/2309.04522
repo_url: None
paper_authors: Yehonatan Avidan, Qianyi Li, Haim Sompolinsky
for: 这 paper 旨在解释深度神经网络在无穷宽度限制下学习过程的完整理论框架。
methods: 这 paper 使用 Markov 距离学习模型和时间依赖神经动力学kernel（NDK）来结合NTK和NNGP两种不同的理论框架。
results: 这 paper 得出了两个不同的学习阶段：梯度导航阶段和扩散学习阶段，并通过synthetic和benchmark数据集的numerical evaluations来提供新的理解深度神经网络学习过程的新视角。

Abstract
Artificial neural networks have revolutionized machine learning in recent years, but a complete theoretical framework for their learning process is still lacking. Substantial progress has been made for infinitely wide networks. In this regime, two disparate theoretical frameworks have been used, in which the network's output is described using kernels: one framework is based on the Neural Tangent Kernel (NTK) which assumes linearized gradient descent dynamics, while the Neural Network Gaussian Process (NNGP) kernel assumes a Bayesian framework. However, the relation between these two frameworks has remained elusive. This work unifies these two distinct theories using a Markov proximal learning model for learning dynamics in an ensemble of randomly initialized infinitely wide deep networks. We derive an exact analytical expression for the network input-output function during and after learning, and introduce a new time-dependent Neural Dynamical Kernel (NDK) from which both NTK and NNGP kernels can be derived. We identify two learning phases characterized by different time scales: gradient-driven and diffusive learning. In the initial gradient-driven learning phase, the dynamics is dominated by deterministic gradient descent, and is described by the NTK theory. This phase is followed by the diffusive learning stage, during which the network parameters sample the solution space, ultimately approaching the equilibrium distribution corresponding to NNGP. Combined with numerical evaluations on synthetic and benchmark datasets, we provide novel insights into the different roles of initialization, regularization, and network depth, as well as phenomena such as early stopping and representational drift. This work closes the gap between the NTK and NNGP theories, providing a comprehensive framework for understanding the learning process of deep neural networks in the infinite width limit.

摘要
人工神经网络在最近几年内 revolutionized机器学习，但完整的理论框架仍然缺失。在无穷宽网络 regime 中，有两种不同的理论框架用于描述网络的输出：一个基于 Neural Tangent Kernel (NTK) 的框架，它假设 linearized gradient descent 动力学，而另一个基于 Neural Network Gaussian Process (NNGP) 框架，它假设 Bayesian 框架。然而，这两个框架之间的关系仍然不明确。这项工作将这两个不同的理论联系起来，使用一种 Markov proximal learning 模型来描述学习过程中的动力学。我们得到了一个精确的分析表达，描述了网络输入-输出函数在学习和学习后的行为，并引入了一种新的时间依赖的 Neural Dynamical Kernel (NDK)，从而可以 derivate NTK 和 NNGP 两个框架。我们分 distinguished two stages of learning characterized by different time scales：deterministic gradient descent 驱动的早期学习阶段，以及在这个阶段之后的杂散学习阶段。在初期的梯度驱动学习阶段，动力学由 deterministic gradient descent 控制，可以通过 NTK 理论来描述。这个阶段被后来的杂散学习阶段所follow，在这个阶段中，网络参数在解决空间中享受漂泊，最终 approaching the equilibrium distribution corresponding to NNGP。通过对 sintetic 和 benchmark 数据进行数值评估，我们提供了新的理解，关于初始化、正则化和网络深度的不同角色，以及phenomena such as early stopping 和 representational drift。这项工作 closure 了 NTK 和 NNGP 两个理论之间的 gap，提供了深度学习过程中无穷宽网络的全面框架。

On the Actionability of Outcome Prediction

paper_url: http://arxiv.org/abs/2309.04470
repo_url: https://github.com/andrewmogbolu2/blockchain-technology
paper_authors: Lydia T. Liu, Solon Barocas, Jon Kleinberg, Karen Levy
for: 这篇论文探讨了在社会影响领域中预测未来结果的应用，包括教育和医疗等领域。
methods: 论文使用了一个简单的模型，包括行动、隐藏状态和测量。
results: 论文发现，准确预测结果并不总是最有效的策略，即使结合其他测量。 except in cases where there is a single decisive action for improving the outcome, outcome prediction never maximizes “action value”. 在大多数情况下，测量行动可能性和隐藏状态可以大幅提高行动价值。

Abstract
Predicting future outcomes is a prevalent application of machine learning in social impact domains. Examples range from predicting student success in education to predicting disease risk in healthcare. Practitioners recognize that the ultimate goal is not just to predict but to act effectively. Increasing evidence suggests that relying on outcome predictions for downstream interventions may not have desired results. In most domains there exists a multitude of possible interventions for each individual, making the challenge of taking effective action more acute. Even when causal mechanisms connecting the individual's latent states to outcomes is well understood, in any given instance (a specific student or patient), practitioners still need to infer -- from budgeted measurements of latent states -- which of many possible interventions will be most effective for this individual. With this in mind, we ask: when are accurate predictors of outcomes helpful for identifying the most suitable intervention? Through a simple model encompassing actions, latent states, and measurements, we demonstrate that pure outcome prediction rarely results in the most effective policy for taking actions, even when combined with other measurements. We find that except in cases where there is a single decisive action for improving the outcome, outcome prediction never maximizes "action value", the utility of taking actions. Making measurements of actionable latent states, where specific actions lead to desired outcomes, considerably enhances the action value compared to outcome prediction, and the degree of improvement depends on action costs and the outcome model. This analysis emphasizes the need to go beyond generic outcome prediction in interventional settings by incorporating knowledge of plausible actions and latent states.

摘要
预测未来结果是社会影响领域中广泛应用的机器学习技术。例如，预测教育中学生成功和医疗领域疾病风险等。专业人员认为，最终目标不仅是预测，还是实际行动。然而，有增加证据表明，仅仅基于结果预测的下游 intervención可能无法实现愿望的结果。在大多数领域中，每个个体都有多种可能的 intervención，使得选择有效行动变得更加困难。即使理解个体的 latent states 和结果之间的 causal 机制，在特定学生或病人身上，专业人员仍需从预算的 latent states 中推断哪一些 intervención 最有效。基于这点，我们问：精准的结果预测有什么帮助于确定最佳 intervención？通过一个简单的模型，包括行动、 latent states 和测量，我们表明了纯粹的结果预测在大多数情况下无法实现最有效的政策，即使与其他测量结合使用。我们发现，除非结果中存在单一的决定性行动，否则结果预测不能增加“行动价值”，即对行动的负担和结果模型。通过测量行动可触发结果的 latent states，可以显著提高行动价值，并且提高的程度取决于行动成本和结果模型。这一分析强调了在干预设定中超越普通的结果预测，通过包含可能的行动和 latent states 的知识来实现更高效的行动。

tSPM+; a high-performance algorithm for mining transitive sequential patterns from clinical data

paper_url: http://arxiv.org/abs/2309.05671
repo_url: None
paper_authors: Jonas Hügel, Ulrich Sax, Shawn N. Murphy, Hossein Estiri
for: 本研究旨在开发高性能的时间序列模式挖掘算法（tSPM+），以便更好地挖掘大规模医疗数据集中的时间序列模式，并通过 Machine Learning 工作流程进行挖掘。
methods: 本研究使用的方法包括时间序列模式挖掘算法（tSPM）和高性能实现方法（tSPM+），以及 Docker 容器和 R 包套件，以便易于与现有的 Machine Learning 工作流程集成。
results: 本研究表明，使用 tSPM+ 算法可以提高速度到因子 980，并降低内存占用量达 48 倍。此外，研究还使用了 WHO 定义的 Post COVID-19 病人和其症状，并通过时间序列模式挖掘来识别这些病人。

Abstract
The increasing availability of large clinical datasets collected from patients can enable new avenues for computational characterization of complex diseases using different analytic algorithms. One of the promising new methods for extracting knowledge from large clinical datasets involves temporal pattern mining integrated with machine learning workflows. However, mining these temporal patterns is a computational intensive task and has memory repercussions. Current algorithms, such as the temporal sequence pattern mining (tSPM) algorithm, are already providing promising outcomes, but still leave room for optimization. In this paper, we present the tSPM+ algorithm, a high-performance implementation of the tSPM algorithm, which adds a new dimension by adding the duration to the temporal patterns. We show that the tSPM+ algorithm provides a speed up to factor 980 and a up to 48 fold improvement in memory consumption. Moreover, we present a docker container with an R-package, We also provide vignettes for an easy integration into already existing machine learning workflows and use the mined temporal sequences to identify Post COVID-19 patients and their symptoms according to the WHO definition.

摘要
“随着巨量临床数据的可用性增加，可以开启新的可 Computational 描述复杂疾病的可能性。一种可能的新方法是在机器学习工作流程中进行时间模式挖掘，但是挖掘这些时间模式是一个 computationally 沉重的任务，需要大量的计算资源和内存。现有的算法，如时间序列模式挖掘（tSPM）算法，已经提供了有希望的结果，但还有很多的余地来进行优化。在本文中，我们提出了 tSPM+ 算法，它是 tSPM 算法的高性能实现，通过添加时间持续时间到时间模式，提供了速度因子 980 和内存使用量增加到 48 倍。此外，我们提供了一个 Docker 容器和 R 套件，并提供了绿色的范例，以便与现有的机器学习工作流程整合，并使用挖掘的时间序列来根据 WHO 定义识别 Post COVID-19 病人和其症状。”

Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning

paper_url: http://arxiv.org/abs/2309.04459
repo_url: None
paper_authors: David Yunis, Justin Jung, Falcon Dai, Matthew Walter
for: 实现短时间内获得奖励的环境探索是困难的，特别是在连续动作空间中，因为需要进行长时间的协调动作序列以获得任何奖励。
methods: 我们提出了一种新的方法，它包括将互动资料集中的交互资料转换为短时间内的动作集，然后使用这个新的动作空间来优化策略。这个方法比基eline要好，因为它不需要从动作空间中挑选整个范围。
results: 我们的方法在一些困难的短时间内获得奖励的环境中比基eline更好，并且需要很少的计算量进行技能生成和在线探索。

Abstract
Exploration in sparse-reward reinforcement learning is difficult due to the requirement of long, coordinated sequences of actions in order to achieve any reward. Moreover, in continuous action spaces there are an infinite number of possible actions, which only increases the difficulty of exploration. One class of methods designed to address these issues forms temporally extended actions, often called skills, from interaction data collected in the same domain, and optimizes a policy on top of this new action space. Typically such methods require a lengthy pretraining phase, especially in continuous action spaces, in order to form the skills before reinforcement learning can begin. Given prior evidence that the full range of the continuous action space is not required in such tasks, we propose a novel approach to skill-generation with two components. First we discretize the action space through clustering, and second we leverage a tokenization technique borrowed from natural language processing to generate temporally extended actions. Such a method outperforms baselines for skill-generation in several challenging sparse-reward domains, and requires orders-of-magnitude less computation in skill-generation and online rollouts.

摘要

Discretize the action space through clustering.2. Leverage a tokenization technique from natural language processing to generate temporally extended actions.Our approach outperforms baselines for skill-generation in several challenging sparse-reward domains and requires significantly less computation in skill-generation and online rollouts.Simplified Chinese:探索 sparse-reward reinforcement learning 是困难的，因为需要长时间协调的动作序列来获得任何奖励。此外，连续动作空间中有无限多个可能的动作，这只会使探索变得更加困难。一些方法使用时间扩展的动作，通常称为技能，从互动数据中的同一个领域中收集，然后优化一个策略。然而，这些方法通常需要很长的预训练阶段，特别是在连续动作空间中。基于证据表明，完整的连续动作空间不是必需的，我们提出了一种新的方法 для技能生成，具有两个组成部分：1. 使用 clustering 将动作空间细分。2. 从自然语言处理中借鉴的tokenization技术来生成时间扩展的动作。我们的方法在一些具有挑战性的 sparse-reward 领域中出perform baseline，并且需要orders-of-magnitude less computation在技能生成和在线执行。

Physics-Informed Neural Networks for an optimal counterdiabatic quantum computation

paper_url: http://arxiv.org/abs/2309.04434
repo_url: None
paper_authors: Antonio Ferrer-Sánchez, Carlos Flores-Garrigos, Carlos Hernani-Morales, José J. Orquín-Marqués, Narendra N. Hegade, Alejandro Gomez Cadavid, Iraitz Montalban, Enrique Solano, Yolanda Vives-Gilabert, José D. Martín-Guerrero
for: 用于解决量子电路中系统的 counterdiabatic（CD）协议，即利用物理学发现的启发式神经网络（PINNs）来准确地解决量子系统中不同物理 observable 的时间演化问题。
methods: 借鉴物理学的思想，将必要的物理信息嵌入到一个底层神经网络中，以有效地解决问题。在特定的情况下，我们采用 hermiticity condition 来保证获得最佳的 counterdiabatic 项，并使用了原理力动的最小化方法来实现。
results: 提出了一种可靠的方法来解决 CD 驱动问题，不受过去基于类型数字approximation的方法所受的限制。方法可以获得物理 observable 中的优化结果，包括时间参数函数、 gauge potential 或运算符、系统能量水平的时间演化等。在实践中，我们应用了这种方法于 $\mathrm{H_{2}$ 和 $\mathrm{LiH}$ 分子，使用 STO-3G 基准表示。结果表明可以成功地解决非adiabatic 项的归一化问题，并且这种方法在量子计算算法中具有重要的实际应用优势。

Abstract
We introduce a novel methodology that leverages the strength of Physics-Informed Neural Networks (PINNs) to address the counterdiabatic (CD) protocol in the optimization of quantum circuits comprised of systems with $N_{Q}$ qubits. The primary objective is to utilize physics-inspired deep learning techniques to accurately solve the time evolution of the different physical observables within the quantum system. To accomplish this objective, we embed the necessary physical information into an underlying neural network to effectively tackle the problem. In particular, we impose the hermiticity condition on all physical observables and make use of the principle of least action, guaranteeing the acquisition of the most appropriate counterdiabatic terms based on the underlying physics. The proposed approach offers a dependable alternative to address the CD driving problem, free from the constraints typically encountered in previous methodologies relying on classical numerical approximations. Our method provides a general framework to obtain optimal results from the physical observables relevant to the problem, including the external parameterization in time known as scheduling function, the gauge potential or operator involving the non-adiabatic terms, as well as the temporal evolution of the energy levels of the system, among others. The main applications of this methodology have been the $\mathrm{H_{2}$ and $\mathrm{LiH}$ molecules, represented by a 2-qubit and 4-qubit systems employing the STO-3G basis. The presented results demonstrate the successful derivation of a desirable decomposition for the non-adiabatic terms, achieved through a linear combination utilizing Pauli operators. This attribute confers significant advantages to its practical implementation within quantum computing algorithms.

摘要
我们提出了一种新的方法，利用物理学 informed neural network（PINNs）来解决量子环境中的 counterdiabatic（CD）协议。我们的主要目标是利用物理类似深度学习技术来精确地解决量子系统中不同物理观测器的时间演化。为了完成这个目标，我们将必要的物理信息嵌入到一个基础的神经网络中，以有效地处理问题。特别是，我们将 hermiticity 条件套用到所有物理观测器上，并使用最小作用原理，以确保获得最适当的 counterdiabatic 项目，基于背景的物理。我们的方法提供了一个可靠的替代方案，用于解决 CD 驱动问题，不受前一代方法所受的传统约束。我们的方法可以实现从物理观测器中获得最佳结果，包括时间进行演化的外部化函数、 gauge 潜在或操作内含非adiabatic 项目，以及量子系统中能阶的时间演化。主要应用包括 $\rm H_2$ 和 $\rm LiH$ 分子，表示了一个 2-qubit 和 4-qubit 系统，使用 STO-3G 基底。给出的结果显示了成功地从非adiabatic 项目中获得了欲要的分解，通过一个基于 Pauli 算子的线性 комbination。这个特性具有实用实现量子 Computing 算法中的实际优势。

Variations and Relaxations of Normalizing Flows

paper_url: http://arxiv.org/abs/2309.04433
repo_url: None
paper_authors: Keegan Kelly, Lorena Piedras, Sukrit Rao, David Roth
for: 本研究旨在探讨 Normalizing Flows (NFs) 模型的扩展和改进，以提高其表达能力和采样效率。
methods: 本研究使用了一系列的改进和扩展方法，包括将 Normalizing Flows 与其他生成模型相结合，以释放其表达能力和采样速度。
results: 研究人员通过实验和数据分析表明，这些改进和扩展方法可以增强 Normalizing Flows 的表达能力和采样效率，同时保持其likelihood tractability和数据可靠性。

Abstract
Normalizing Flows (NFs) describe a class of models that express a complex target distribution as the composition of a series of bijective transformations over a simpler base distribution. By limiting the space of candidate transformations to diffeomorphisms, NFs enjoy efficient, exact sampling and density evaluation, enabling NFs to flexibly behave as both discriminative and generative models. Their restriction to diffeomorphisms, however, enforces that input, output and all intermediary spaces share the same dimension, limiting their ability to effectively represent target distributions with complex topologies. Additionally, in cases where the prior and target distributions are not homeomorphic, Normalizing Flows can leak mass outside of the support of the target. This survey covers a selection of recent works that combine aspects of other generative model classes, such as VAEs and score-based diffusion, and in doing so loosen the strict bijectivity constraints of NFs to achieve a balance of expressivity, training speed, sample efficiency and likelihood tractability.

摘要
对象分布的描述：Normalizing Flows（NFs）是一类模型，它将复杂的目标分布表示为一系列基于简单的基分布的比例变换的组合。通过限制变换空间为抽象函数，NFs 可以实现高效、准确的采样和总体评估，使其能够作为描述性和生成模型。然而，NFs 的假设均为抽象函数限制了输入、输出和所有中间空间的维度相同，这限制了它们对复杂分布的表示能力。此外，当先验分布和目标分布不同映射时，Normalizing Flows 可能会导致流出先验分布的质量。这篇评论汇聚了一些最近的工作，它们将其他生成模型类型，如 VAEs 和 score-based diffusion 的特点与 Normalizing Flows 结合，以适应不同的应用场景。这些工作通过放宽 Normalizing Flows 的假设，以达到表达能力、训练速度、采样效率和可评估性的平衡。

Create Your World: Lifelong Text-to-Image Diffusion

paper_url: http://arxiv.org/abs/2309.04430
repo_url: None
paper_authors: Gan Sun, Wenqi Liang, Jiahua Dong, Jun Li, Zhengming Ding, Yang Cong
for: 创造用户自己的概念世界，即通过文本提示生成用户自己的概念图像。
methods: 提出了一种具有知识紧急忘记和semantic紧急忽略的文本到图像扩散模型（L2DM），通过任务意识增强模块和灵活概念融合模块来解决知识紧急忘记问题，并通过概念注意艺术家模块和正交注意模块来解决semantic紧急忽略问题。
results: 在比较 related state-of-the-art 模型时，our model可以在不同的 continual text prompts 下生成更 faithful 的图像，both in terms of qualitative and quantitative metrics。

Abstract
Text-to-image generative models can produce diverse high-quality images of concepts with a text prompt, which have demonstrated excellent ability in image generation, image translation, etc. We in this work study the problem of synthesizing instantiations of a use's own concepts in a never-ending manner, i.e., create your world, where the new concepts from user are quickly learned with a few examples. To achieve this goal, we propose a Lifelong text-to-image Diffusion Model (L2DM), which intends to overcome knowledge "catastrophic forgetting" for the past encountered concepts, and semantic "catastrophic neglecting" for one or more concepts in the text prompt. In respect of knowledge "catastrophic forgetting", our L2DM framework devises a task-aware memory enhancement module and a elastic-concept distillation module, which could respectively safeguard the knowledge of both prior concepts and each past personalized concept. When generating images with a user text prompt, the solution to semantic "catastrophic neglecting" is that a concept attention artist module can alleviate the semantic neglecting from concept aspect, and an orthogonal attention module can reduce the semantic binding from attribute aspect. To the end, our model can generate more faithful image across a range of continual text prompts in terms of both qualitative and quantitative metrics, when comparing with the related state-of-the-art models. The code will be released at https://wenqiliang.github.io/.

摘要
文本到图生成模型可以生成多种高质量图像，用文本提示来描述概念，并达到图像生成、图像翻译等领域的出色表现。在这项工作中，我们研究如何通过不断创造用户自己的概念实例，即“创造你的世界”，使用户的新概念快速学习。为了实现这个目标，我们提出了一种生命long text-to-image扩散模型（L2DM），用于解决知识“悖论”和 semantic “悖论”问题。在知识“悖论”方面，L2DM框架启用任务意识增强模块和灵活概念精炼模块，可以分别保护以前所遇到的概念知识和每个个人化的概念知识。在生成用户文本提示图像时，对 semantic “悖论”问题的解决方式是通过概念注意力艺术模块和 ortogonal注意力模块来减少概念忽视和属性绑定。因此，我们的模型可以在不同的 continual text prompts 下生成更 faithful 的图像，并且在质量和量度上都达到了相对的提升。代码将在上发布。

paper_url: http://arxiv.org/abs/2309.04426
repo_url: None
paper_authors: Lyuyang Sima, Joseph Bucukovski, Erwan Carlson, Nicole L. Yien
for: 本研究旨在为新手研究领域的同仁提供一种系统性的学习概念和研究方向，探讨了脑灵模型的优缺点和适用性，以及脑灵网络算法和无监督学习算法的概念和研究进展。
methods: 本研究通过对五种脑灵模型的优缺点和适用性进行梳理，分析了五种网络拓扑的特点，并概述了基于synaptic plasticity规则的无监督学习算法和四种监督学习算法的研究进展。
results: 本研究对脑灵网络算法的概念和研究进展进行了报告和分析，并对国内外的脑灵neuromorphic芯片的研究进行了评论和分析。

Abstract
In the rapid evolution of next-generation brain-inspired artificial intelligence and increasingly sophisticated electromagnetic environment, the most bionic characteristics and anti-interference performance of spiking neural networks show great potential in terms of computational speed, real-time information processing, and spatio-temporal information processing. Data processing. Spiking neural network is one of the cores of brain-like artificial intelligence, which realizes brain-like computing by simulating the structure and information transfer mode of biological neural networks. This paper summarizes the strengths, weaknesses and applicability of five neuronal models and analyzes the characteristics of five network topologies; then reviews the spiking neural network algorithms and summarizes the unsupervised learning algorithms based on synaptic plasticity rules and four types of supervised learning algorithms from the perspectives of unsupervised learning and supervised learning; finally focuses on the review of brain-like neuromorphic chips under research at home and abroad. This paper is intended to provide learning concepts and research orientations for the peers who are new to the research field of spiking neural networks through systematic summaries.

摘要
在Next-generation brain-inspired artificial intelligence的快速演化和日益复杂的电磁环境中，具有最高生物化特征和抗干扰性能的脉冲神经网络表现出了很大的潜力，包括计算速度、实时信息处理和空间时间信息处理。数据处理。脉冲神经网络是人工智能中的核心之一，通过模拟生物神经网络的结构和信息传递方式来实现脑内样式计算。本文对五种神经元模型的优缺点和适用场景进行总结，然后分析了五种网络拓扑的特点，最后评论了国内外的脑系模块 chip 的研究进展。本文的目的是为研究领域新手提供系统性的学习概念和研究方向，以帮助他们更好地了解脉冲神经网络的研究领域。

SynthoGestures: A Novel Framework for Synthetic Dynamic Hand Gesture Generation for Driving Scenarios

paper_url: http://arxiv.org/abs/2309.04421
repo_url: https://github.com/amrgomaaelhady/synthogestures
paper_authors: Amr Gomaa, Robin Zitt, Guillermo Reyes, Antonio Krüger
for: 这篇论文旨在提供一种用于生成人工智能人机界面的自动驾驶领域的手势数据集的创新性方法。
methods: 该方法使用虚拟3D模型生成手势数据集，并提供了自定义选项和降低欠拟合风险。它还模拟了不同的摄像头位置和类型，包括RGB、红外和深度摄像头。
results: 实验结果表明，该方法可以提高手势识别精度，并可以取代或补充真实手势数据集。这种方法可以节省时间和劳动力来创建数据集，因此加速了人机界面的开发。

Abstract
Creating a diverse and comprehensive dataset of hand gestures for dynamic human-machine interfaces in the automotive domain can be challenging and time-consuming. To overcome this challenge, we propose using synthetic gesture datasets generated by virtual 3D models. Our framework utilizes Unreal Engine to synthesize realistic hand gestures, offering customization options and reducing the risk of overfitting. Multiple variants, including gesture speed, performance, and hand shape, are generated to improve generalizability. In addition, we simulate different camera locations and types, such as RGB, infrared, and depth cameras, without incurring additional time and cost to obtain these cameras. Experimental results demonstrate that our proposed framework, SynthoGestures\footnote{\url{https://github.com/amrgomaaelhady/SynthoGestures}, improves gesture recognition accuracy and can replace or augment real-hand datasets. By saving time and effort in the creation of the data set, our tool accelerates the development of gesture recognition systems for automotive applications.

摘要
创建一个多样化和完整的手势数据集 для动态人机界面在汽车领域可能是困难和耗时的。为了解决这个挑战，我们提议使用虚拟3D模型生成的 sintetic手势数据集。我们的框架使用Unreal Engine生成真实的手势，提供自定义选项，降低遮挡风险。多种变体，包括手势速度、性能和手形，被生成以提高泛化性。此外，我们模拟了不同的摄像头位置和类型，如RGB、红外和深度摄像头，而无需额外的时间和成本来获得这些摄像头。实验结果表明，我们提议的框架SynthoGestures\footnote{\url{https://github.com/amrgomaaelhady/SynthoGestures}，可以提高手势识别精度，并可以取代或补充真正的手势数据集。通过节省时间和努力来创建数据集，我们的工具加速了汽车应用程序的手势识别系统的开发。

Privacy Preserving Federated Learning with Convolutional Variational Bottlenecks

paper_url: http://arxiv.org/abs/2309.04515
repo_url: None
paper_authors: Daniel Scheliga, Patrick Mäder, Marco Seeland
for: 防止梯度泄露攻击，保护训练数据隐私。
methods: 使用Variational Modeling实现隐私保护，并对梯度泄露攻击进行分析。
results: 提出了一种新的隐私模块—卷积秘密瓶颈（CVB），可以在潜在攻击的情况下保持隐私。对三种模型和六个图像分类 datasets进行了广泛的实验研究，并证明了CVB的有效性。

Abstract
Gradient inversion attacks are an ubiquitous threat in federated learning as they exploit gradient leakage to reconstruct supposedly private training data. Recent work has proposed to prevent gradient leakage without loss of model utility by incorporating a PRivacy EnhanCing mODulE (PRECODE) based on variational modeling. Without further analysis, it was shown that PRECODE successfully protects against gradient inversion attacks. In this paper, we make multiple contributions. First, we investigate the effect of PRECODE on gradient inversion attacks to reveal its underlying working principle. We show that variational modeling introduces stochasticity into the gradients of PRECODE and the subsequent layers in a neural network. The stochastic gradients of these layers prevent iterative gradient inversion attacks from converging. Second, we formulate an attack that disables the privacy preserving effect of PRECODE by purposefully omitting stochastic gradients during attack optimization. To preserve the privacy preserving effect of PRECODE, our analysis reveals that variational modeling must be placed early in the network. However, early placement of PRECODE is typically not feasible due to reduced model utility and the exploding number of additional model parameters. Therefore, as a third contribution, we propose a novel privacy module -- the Convolutional Variational Bottleneck (CVB) -- that can be placed early in a neural network without suffering from these drawbacks. We conduct an extensive empirical study on three seminal model architectures and six image classification datasets. We find that all architectures are susceptible to gradient leakage attacks, which can be prevented by our proposed CVB. Compared to PRECODE, we show that our novel privacy module requires fewer trainable parameters, and thus computational and communication costs, to effectively preserve privacy.

摘要
Gradient倒转攻击是聚合学习中普遍存在的威胁，它们利用梯度泄露来重构被认为是私有的训练数据。 recent work 提出了防止梯度泄露而不失去模型效用的方法，基于隐私提升模型（PRECODE）的variational modeling。 without further analysis, it was shown that PRECODE successfully protects against gradient inversion attacks. In this paper, we make multiple contributions. First, we investigate the effect of PRECODE on gradient inversion attacks to reveal its underlying working principle. We show that variational modeling introduces stochasticity into the gradients of PRECODE and the subsequent layers in a neural network. The stochastic gradients of these layers prevent iterative gradient inversion attacks from converging. Second, we formulate an attack that disables the privacy preserving effect of PRECODE by purposefully omitting stochastic gradients during attack optimization. To preserve the privacy preserving effect of PRECODE, our analysis reveals that variational modeling must be placed early in the network. However, early placement of PRECODE is typically not feasible due to reduced model utility and the exploding number of additional model parameters. Therefore, as a third contribution, we propose a novel privacy module -- the Convolutional Variational Bottleneck (CVB) -- that can be placed early in a neural network without suffering from these drawbacks. We conduct an extensive empirical study on three seminal model architectures and six image classification datasets. We find that all architectures are susceptible to gradient leakage attacks, which can be prevented by our proposed CVB. Compared to PRECODE, we show that our novel privacy module requires fewer trainable parameters, and thus computational and communication costs, to effectively preserve privacy.

Generalization Bounds: Perspectives from Information Theory and PAC-Bayes

paper_url: http://arxiv.org/abs/2309.04381
repo_url: None
paper_authors: Fredrik Hellström, Giuseppe Durisi, Benjamin Guedj, Maxim Raginsky
for: 本研究旨在探讨机器学习理论中的泛化问题，尤其是PAC-Bayesian方法的应用和扩展。
methods: 本研究使用了信息理论的视角来探讨泛化问题，并与PAC-Bayesian方法的信息论派 connexion。
results: 本研究提供了一个统一的对待方法，并证明了许多泛化证明和PAC-Bayesian方法之间的相似性。特别是，本研究强调了 conditional mutual information（CMI）框架，信息复杂度的分析，以及应用于深度学习等领域。

Abstract
A fundamental question in theoretical machine learning is generalization. Over the past decades, the PAC-Bayesian approach has been established as a flexible framework to address the generalization capabilities of machine learning algorithms, and design new ones. Recently, it has garnered increased interest due to its potential applicability for a variety of learning algorithms, including deep neural networks. In parallel, an information-theoretic view of generalization has developed, wherein the relation between generalization and various information measures has been established. This framework is intimately connected to the PAC-Bayesian approach, and a number of results have been independently discovered in both strands. In this monograph, we highlight this strong connection and present a unified treatment of generalization. We present techniques and results that the two perspectives have in common, and discuss the approaches and interpretations that differ. In particular, we demonstrate how many proofs in the area share a modular structure, through which the underlying ideas can be intuited. We pay special attention to the conditional mutual information (CMI) framework; analytical studies of the information complexity of learning algorithms; and the application of the proposed methods to deep learning. This monograph is intended to provide a comprehensive introduction to information-theoretic generalization bounds and their connection to PAC-Bayes, serving as a foundation from which the most recent developments are accessible. It is aimed broadly towards researchers with an interest in generalization and theoretical machine learning.

摘要

Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation

paper_url: http://arxiv.org/abs/2309.04369
repo_url: None
paper_authors: Jiatong Li, Rui Li, Qi Liu
for: 评估大语言模型（LLMs）的能力在各种现实世界任务中，以提高LLMs的评估方法。
methods: 提出了一种基于深度交互的LLM评估框架，通过 LLMS 在 elaborately 设计的评估任务中的深度交互来评估其在现实世界中的表现。
results: 通过了四个 elaborately 设计的评估任务的实验，证明了该方法的效iveness。

Abstract
Large Language Models (LLMs) have made progress in various real-world tasks, which stimulates requirements for the evaluation of LLMs. Existing LLM evaluation methods are mainly supervised signal-based which depends on static datasets and cannot evaluate the ability of LLMs in dynamic real-world scenarios where deep interaction widely exists. Other LLM evaluation methods are human-based which are costly and time-consuming and are incapable of large-scale evaluation of LLMs. To address the issues above, we propose a novel Deep Interaction-based LLM-evaluation framework. In our proposed framework, LLMs' performances in real-world domains can be evaluated from their deep interaction with other LLMs in elaborately designed evaluation tasks. Furthermore, our proposed framework is a general evaluation method that can be applied to a host of real-world tasks such as machine translation and code generation. We demonstrate the effectiveness of our proposed method through extensive experiments on four elaborately designed evaluation tasks.

摘要
(Simplified Chinese)大型语言模型（LLM）在不同的实际任务中进步，这些进步刺激了 LLM 的评估需求。现有的 LLM 评估方法主要是指导信号基的，它们依赖于静态数据集，无法评估 LLM 在动态实际场景中的能力。其他 LLM 评估方法是人类基的，成本和时间开销高，不能大规模评估 LLM。为解决这些问题，我们提出了一种深度互动基本 LLM 评估框架。在我们的提议中， LLM 的实际领域表现可以通过它们在其他 LLM 之间的深度互动来评估。此外，我们的提议框架是一种通用的评估方法，可以应用于多种实际任务，如机器翻译和代码生成。我们通过了四个 elaborate 的评估任务进行了广泛的实验，以证明我们的提议方法的有效性。

Active Learning for Classifying 2D Grid-Based Level Completability

paper_url: http://arxiv.org/abs/2309.04367
repo_url: https://github.com/mahsabazzaz/level-completabilty-x-active-learning
paper_authors: Mahsa Bazzaz, Seth Cooper
for: 本研究旨在使用活动学习方法来评估生成器生成的关卡完成度。
methods: 本研究使用了深度学习模型来训练关卡完成度分类器，并通过活动学习方法来选择需要标注的关卡。
results: 研究结果显示，使用活动学习方法来标注关卡可以获得更高的分类器性能，而不需要更多的标注数据。

Abstract
Determining the completability of levels generated by procedural generators such as machine learning models can be challenging, as it can involve the use of solver agents that often require a significant amount of time to analyze and solve levels. Active learning is not yet widely adopted in game evaluations, although it has been used successfully in natural language processing, image and speech recognition, and computer vision, where the availability of labeled data is limited or expensive. In this paper, we propose the use of active learning for learning level completability classification. Through an active learning approach, we train deep-learning models to classify the completability of generated levels for Super Mario Bros., Kid Icarus, and a Zelda-like game. We compare active learning for querying levels to label with completability against random queries. Our results show using an active learning approach to label levels results in better classifier performance with the same amount of labeled data.

摘要
确定生成器生成的水平的可完成性可以是一项具有挑战性的任务，因为它可能需要使用解决者代理，这些代理经常需要较长的时间来分析和解决水平。在游戏评估中，活动学习还没有广泛采用， although it has been successfully applied in自然语言处理、图像和语音识别以及计算机视觉等领域，其中数据的可用性是有限的或昂贵的。在本文中，我们提议使用活动学习来学习生成器生成的水平可完成性分类。通过活动学习的方式，我们使用深度学习模型来类ifying生成器生成的水平的可完成性，并对Super Mario Bros., Kid Icarus和一款 Zelda-like 游戏进行了实验。我们比较了使用活动学习查询水平是否可以完成的 Label 与随机查询的性能。我们的结果表明，使用活动学习方法来标注水平的可完成性可以获得更好的分类器性能，只需要与传统的随机查询相同的数据量。

Systematic Review of Techniques in Brain Image Synthesis using Deep Learning

paper_url: http://arxiv.org/abs/2309.04511
repo_url: None
paper_authors: Shubham Singh, Ammar Ranapurwala, Mrunal Bewoor, Sheetal Patil, Satyam Rai
for: 本文探讨医学成像领域的当前状况，尤其是利用深度学习技术进行大脑成像synthesis。
methods: 本文详细介绍了不同的方法和技术，包括2D to 3D constructions、MRI synthesis以及使用transformers。
results: 本文总结了这些方法的限制和挑战，并探讨未来这个领域的发展前景和深度学习技术在医学成像领域的潜在影响。

Abstract
This review paper delves into the present state of medical imaging, with a specific focus on the use of deep learning techniques for brain image synthesis. The need for medical image synthesis to improve diagnostic accuracy and decrease invasiveness in medical procedures is emphasized, along with the role of deep learning in enabling these advancements. The paper examines various methods and techniques for brain image synthesis, including 2D to 3D constructions, MRI synthesis, and the use of transformers. It also addresses limitations and challenges faced in these methods, such as obtaining well-curated training data and addressing brain ultrasound issues. The review concludes by exploring the future potential of this field and the opportunities for further advancements in medical imaging using deep learning techniques. The significance of transformers and their potential to revolutionize the medical imaging field is highlighted. Additionally, the paper discusses the potential solutions to the shortcomings and limitations faced in this field. The review provides researchers with an updated reference on the present state of the field and aims to inspire further research and bridge the gap between the present state of medical imaging and the future possibilities offered by deep learning techniques.

摘要

Zero-Shot Robustification of Zero-Shot Models With Foundation Models

paper_url: http://arxiv.org/abs/2309.04344
repo_url: https://github.com/sprocketlab/roboshot
paper_authors: Dyah Adila, Changho Shin, Linrong Cai, Frederic Sala
for: 提高预训练模型的Robustness和Zero-shot推理能力
methods: 使用零shot语言模型获取任务描述中有用的信息，并使用这些信息来修正预训练模型的嵌入。
results: 对九个图像和自然语言处理任务进行评估，与多种零shot基线比较，平均提高15.98%。同时，RoboShotCompatible with多种预训练模型和语言模型。

Abstract
Zero-shot inference is a powerful paradigm that enables the use of large pretrained models for downstream classification tasks without further training. However, these models are vulnerable to inherited biases that can impact their performance. The traditional solution is fine-tuning, but this undermines the key advantage of pretrained models, which is their ability to be used out-of-the-box. We propose RoboShot, a method that improves the robustness of pretrained model embeddings in a fully zero-shot fashion. First, we use zero-shot language models (LMs) to obtain useful insights from task descriptions. These insights are embedded and used to remove harmful and boost useful components in embeddings -- without any supervision. Theoretically, we provide a simple and tractable model for biases in zero-shot embeddings and give a result characterizing under what conditions our approach can boost performance. Empirically, we evaluate RoboShot on nine image and NLP classification tasks and show an average improvement of 15.98% over several zero-shot baselines. Additionally, we demonstrate that RoboShot is compatible with a variety of pretrained and language models.

摘要
zero-shot推理是一种强大的概念，它允许使用大规模预训练模型来进行下游分类任务，无需进一步训练。然而，这些模型受到遗传的偏见的影响，这可能会影响其性能。传统的解决方案是细化，但这会消除预训练模型的优势，即可以直接使用。我们提出了RoboShot，一种方法，可以在完全零shot的方式下提高预训练模型的坚持性。首先，我们使用零shot语言模型（LM）来获得有用的洞察 FROM task descriptions。这些洞察被嵌入并用于从预训练模型中移除害虫和提高有用的组件，无需任何监督。理论上，我们提供了零shot偏见的简单和可追踪的模型，并给出了在哪些条件下，我们的方法可以提高性能。Empirically，我们在九个图像和NLP分类任务上评估了RoboShot，并显示了15.98%的均值提升，相比于多个零shot基线。此外，我们也证明了RoboShot与多种预训练和语言模型相容。

Online Submodular Maximization via Online Convex Optimization

paper_url: http://arxiv.org/abs/2309.04339
repo_url: None
paper_authors: Tareq Si-Salem, Gözde Özcan, Iasonas Nikolaou, Evimaria Terzi, Stratis Ioannidis
for: 研究 monotone submodular maximization under general matroid constraints 的在线设定下的优化问题。
methods: 使用 online convex optimization (OCO) 来优化大量的 submodular functions，即Weighted threshold potential functions。
results: 可以通过 OCO 策略和合适的轮减方案来实现 sublinear regret 在 combinatorial 设定下。

Abstract
We study monotone submodular maximization under general matroid constraints in the online setting. We prove that online optimization of a large class of submodular functions, namely, weighted threshold potential functions, reduces to online convex optimization (OCO). This is precisely because functions in this class admit a concave relaxation; as a result, OCO policies, coupled with an appropriate rounding scheme, can be used to achieve sublinear regret in the combinatorial setting. We show that our reduction extends to many different versions of the online learning problem, including the dynamic regret, bandit, and optimistic-learning settings.

摘要
我们研究简单幂函数最大化在通用环境中的在线Setting下。我们证明在线优化一种大类型的简单幂函数，即有重量的阈值 potential functions，可以降为在线凸优化（OCO）。这是因为这种函数允许一种凹降函数的抽象，因此OCO策略，结合适当的舒缓策略，可以实现在 combinatorial 设置下的减少 regret。我们证明我们的减少扩展到许多不同的在线学习问题，包括动态 regret、bandit 和 optimistic-learning 设置。

Graph Neural Networks Use Graphs When They Shouldn’t

paper_url: http://arxiv.org/abs/2309.04332
repo_url: https://github.com/mayabechlerspeicher/Graph_Neural_Networks_Overfit_Graphs
paper_authors: Maya Bechler-Speicher, Ido Amos, Ran Gilad-Bachrach, Amir Globerson
for: 本研究探讨了Graph Neural Networks（GNNs）在不同graph distribution中对graph structure的学习情况。
methods: 本研究使用了GNNs学习graph数据，并通过graph editing方法来 Mitigate GNNs对不必要的graph structure的过拟合。
results: 研究发现，GNNs在某些graph distribution中有很强的过拟合行为，而且reguler graphs更为稳定。此外，研究还提供了一种理论解释，asserting that GNNs的学习过程受到了gradient descent的偏见。最后，研究表明，通过graph editing方法可以提高GNNs的准确率。

Abstract
Predictions over graphs play a crucial role in various domains, including social networks, molecular biology, medicine, and more. Graph Neural Networks (GNNs) have emerged as the dominant approach for learning on graph data. Instances of graph labeling problems consist of the graph-structure (i.e., the adjacency matrix), along with node-specific feature vectors. In some cases, this graph-structure is non-informative for the predictive task. For instance, molecular properties such as molar mass depend solely on the constituent atoms (node features), and not on the molecular structure. While GNNs have the ability to ignore the graph-structure in such cases, it is not clear that they will. In this work, we show that GNNs actually tend to overfit the graph-structure in the sense that they use it even when a better solution can be obtained by ignoring it. We examine this phenomenon with respect to different graph distributions and find that regular graphs are more robust to this overfitting. We then provide a theoretical explanation for this phenomenon, via analyzing the implicit bias of gradient-descent-based learning of GNNs in this setting. Finally, based on our empirical and theoretical findings, we propose a graph-editing method to mitigate the tendency of GNNs to overfit graph-structures that should be ignored. We show that this method indeed improves the accuracy of GNNs across multiple benchmarks.

摘要
Graph predictions play a crucial role in various domains, including social networks, molecular biology, and medicine. Graph Neural Networks (GNNs) have emerged as the dominant approach for learning on graph data. Instances of graph labeling problems consist of the graph structure (i.e., the adjacency matrix) and node-specific feature vectors. In some cases, the graph structure is non-informative for the predictive task, such as molecular properties that depend solely on the constituent atoms (node features) and not on the molecular structure. While GNNs have the ability to ignore the graph structure in such cases, it is not clear that they will. In this work, we show that GNNs tend to overfit the graph structure, using it even when a better solution can be obtained by ignoring it. We examine this phenomenon with respect to different graph distributions and find that regular graphs are more robust to this overfitting. We then provide a theoretical explanation for this phenomenon, via analyzing the implicit bias of gradient-descent-based learning of GNNs in this setting. Finally, based on our empirical and theoretical findings, we propose a graph-editing method to mitigate the tendency of GNNs to overfit graph structures that should be ignored. We show that this method improves the accuracy of GNNs across multiple benchmarks.Here is the translation in Traditional Chinese:Graph predictions play a crucial role in various domains, including social networks, molecular biology, and medicine. Graph Neural Networks (GNNs) have emerged as the dominant approach for learning on graph data. Instances of graph labeling problems consist of the graph structure (i.e., the adjacency matrix) and node-specific feature vectors. In some cases, the graph structure is non-informative for the predictive task, such as molecular properties that depend solely on the constituent atoms (node features) and not on the molecular structure. While GNNs have the ability to ignore the graph structure in such cases, it is not clear that they will. In this work, we show that GNNs tend to overfit the graph structure, using it even when a better solution can be obtained by ignoring it. We examine this phenomenon with respect to different graph distributions and find that regular graphs are more robust to this overfitting. We then provide a theoretical explanation for this phenomenon, via analyzing the implicit bias of gradient-descent-based learning of GNNs in this setting. Finally, based on our empirical and theoretical findings, we propose a graph-editing method to mitigate the tendency of GNNs to overfit graph structures that should be ignored. We show that this method improves the accuracy of GNNs across multiple benchmarks.

Incremental Learning of Humanoid Robot Behavior from Natural Interaction and Large Language Models

paper_url: http://arxiv.org/abs/2309.04316
repo_url: None
paper_authors: Leonard Bärmann, Rainer Kartmann, Fabian Peller-Konrad, Alex Waibel, Tamim Asfour
for: 本研究旨在将人工智能给机器人，以便实现人机合作的自然语言对话。
methods: 本研究使用大量语言模型（LLMs）来高级掌控机器人的行为，通过在互动式终端机中透过人类指令、环境观察和执行结果来对LMM进行反馈，从而生成下一个陈述。
results: 本研究实现了机器人在进行互动式学习时的增量学习，并在 simulations 和实际情况中进行评估，展示了对多种任务的通用增量学习能力。

Abstract
Natural-language dialog is key for intuitive human-robot interaction. It can be used not only to express humans' intents, but also to communicate instructions for improvement if a robot does not understand a command correctly. Of great importance is to endow robots with the ability to learn from such interaction experience in an incremental way to allow them to improve their behaviors or avoid mistakes in the future. In this paper, we propose a system to achieve incremental learning of complex behavior from natural interaction, and demonstrate its implementation on a humanoid robot. Building on recent advances, we present a system that deploys Large Language Models (LLMs) for high-level orchestration of the robot's behavior, based on the idea of enabling the LLM to generate Python statements in an interactive console to invoke both robot perception and action. The interaction loop is closed by feeding back human instructions, environment observations, and execution results to the LLM, thus informing the generation of the next statement. Specifically, we introduce incremental prompt learning, which enables the system to interactively learn from its mistakes. For that purpose, the LLM can call another LLM responsible for code-level improvements of the current interaction based on human feedback. The improved interaction is then saved in the robot's memory, and thus retrieved on similar requests. We integrate the system in the robot cognitive architecture of the humanoid robot ARMAR-6 and evaluate our methods both quantitatively (in simulation) and qualitatively (in simulation and real-world) by demonstrating generalized incrementally-learned knowledge.

摘要
人工智能对话是人机交互的关键，可以不仅表达人类的意图，还可以通过对缺失指令的通信来传达 instrucciones 。对于 robots 来说，授予其能够通过交互经验学习，以便在未来避免错误或改善行为。在这篇论文中，我们提出了一种实现复杂行为的逐步学习系统，并在人工智能大语言模型（LLM）的基础上实现了高级别的行为编导。我们建立了一个交互循环，其中人类指令、环境观察和执行结果被反馈给 LLM，以便生成下一句语句。特别是，我们引入了逐步提问学习，这使得系统可以通过自己的错误来学习。为此，LLM可以调用另一个 LLM，以便基于人类反馈进行代码级别的改进。改进后的交互被保存在机器人的记忆中，并在相似的请求时被重新使用。我们在人工智能杂质机器人 ARMAR-6 的认知架构中集成了系统，并在模拟和真实环境中进行了评估，并表明了通过逐步学习获得的普遍化知识。

Federated Learning for Early Dropout Prediction on Healthy Ageing Applications

paper_url: http://arxiv.org/abs/2309.04311
repo_url: None
paper_authors: Christos Chrysanthos Nikolaidis, Vasileios Perifanis, Nikolaos Pavlidis, Pavlos S. Efraimidis
for: 这 paper 是关于社会护理应用的研究，旨在提高老年人的生活质量，并帮助操作人员提供早期干预。
methods: 这 paper 使用了机器学习（ML）算法，实现了高度准确的预测，超过传统统计方法的cope能力。
results: 研究表明， federated machine learning（FML）方法可以减轻隐私问题，实现分布式训练，无需传输个人数据。该方法在实际数据集上进行了评估，并提出了数据选择和类别不平衡处理技术，以提高模型在非独立Identical分布（non-iid）数据上的预测精度。

Abstract
The provision of social care applications is crucial for elderly people to improve their quality of life and enables operators to provide early interventions. Accurate predictions of user dropouts in healthy ageing applications are essential since they are directly related to individual health statuses. Machine Learning (ML) algorithms have enabled highly accurate predictions, outperforming traditional statistical methods that struggle to cope with individual patterns. However, ML requires a substantial amount of data for training, which is challenging due to the presence of personal identifiable information (PII) and the fragmentation posed by regulations. In this paper, we present a federated machine learning (FML) approach that minimizes privacy concerns and enables distributed training, without transferring individual data. We employ collaborative training by considering individuals and organizations under FML, which models both cross-device and cross-silo learning scenarios. Our approach is evaluated on a real-world dataset with non-independent and identically distributed (non-iid) data among clients, class imbalance and label ambiguity. Our results show that data selection and class imbalance handling techniques significantly improve the predictive accuracy of models trained under FML, demonstrating comparable or superior predictive performance than traditional ML models.

摘要
提供社会护理应用程序对老年人的生活质量有着关键作用，可以帮助运营商提供早期干预。准确预测健康年龄应用程序用户退出是直接关系到个人健康状况。机器学习（ML）算法可以实现非常高精度预测，超出传统统计方法的cope能力。然而，ML需要训练大量数据，这在个人标识信息（PII）和法规 Fragmentation 的存在下是挑战。在这篇论文中，我们提出了一种联邦机器学习（FML）方法，减少隐私问题，实现分布式训练，不需要传输个人数据。我们采用了合作训练，考虑个人和组织在FML中的相互作用，模型cross设备和cross筒 scenarios。我们的方法在实际数据集上进行了评估，该数据集具有非独立和同样分布（non-iid）、客户端数据不均衡和标签抖抖。我们的结果表明，数据选择和客户端数据不均衡处理技术可以提高FML训练得到的预测性能，达到与传统ML模型相同或更高的预测性能。

Navigating Out-of-Distribution Electricity Load Forecasting during COVID-19: A Continual Learning Approach Leveraging Human Mobility

paper_url: http://arxiv.org/abs/2309.04296
repo_url: None
paper_authors: Arian Prabowo, Kaixuan Chen, Hao Xue, Subbu Sethuvenkatraman, Flora D. Salim
for: 这篇论文旨在应对 COVID-19 锁定期间中的能源负载预测问题，并使用 continual learning 技术更新模型以应对 Out-of-Distribution 情况。
methods: 这篇论文使用了 continual learning 算法 FSNet，与 privacy-preserving 的人偏移数据来更新模型，并评估了这些方法在实际应用中的性能。
results: 研究结果显示 continual learning 技术在 Out-of-Distribution 期间能够确保精确的能源负载预测，并且在锁定期间内与普通的 online learning 方法相比，能够更好地适应变化。

Abstract
In traditional deep learning algorithms, one of the key assumptions is that the data distribution remains constant during both training and deployment. However, this assumption becomes problematic when faced with Out-of-Distribution periods, such as the COVID-19 lockdowns, where the data distribution significantly deviates from what the model has seen during training. This paper employs a two-fold strategy: utilizing continual learning techniques to update models with new data and harnessing human mobility data collected from privacy-preserving pedestrian counters located outside buildings. In contrast to online learning, which suffers from 'catastrophic forgetting' as newly acquired knowledge often erases prior information, continual learning offers a holistic approach by preserving past insights while integrating new data. This research applies FSNet, a powerful continual learning algorithm, to real-world data from 13 building complexes in Melbourne, Australia, a city which had the second longest total lockdown duration globally during the pandemic. Results underscore the crucial role of continual learning in accurate energy forecasting, particularly during Out-of-Distribution periods. Secondary data such as mobility and temperature provided ancillary support to the primary forecasting model. More importantly, while traditional methods struggled to adapt during lockdowns, models featuring at least online learning demonstrated resilience, with lockdown periods posing fewer challenges once armed with adaptive learning techniques. This study contributes valuable methodologies and insights to the ongoing effort to improve energy load forecasting during future Out-of-Distribution periods.

摘要
传统深度学习算法中一个关键假设是数据分布在训练和部署期间都保持不变。然而，这个假设在面临外部数据期间（如 COVID-19 封锁）时变得问题。这篇论文采用了两重策略：利用连续学习技术更新模型并利用隐私保护的人行数据，收集在外部建筑物外。与在线学习相比，其受到“致命的忘记”的影响，新获得的知识经常覆盖先前的信息，而连续学习则提供了一个整体的方法，保留过去的经验并与新数据集成。这项研究使用了 FSNet，一种强大的连续学习算法，对澳大利亚梅尔本市（全球第二长的总封锁时间）的13个建筑物进行实际应用。结果表明，连续学习在异常数据期间具有精度的能量预测作用，特别是在外部数据期间。次要数据，如流动和温度，为主要预测模型提供了辅助支持。更重要的是，传统方法在封锁期间很难适应，而在线学习方法至少在封锁期间展现了抗逆境能力，封锁期间使用可靠的学习技术后，封锁期间的挑战较少。本研究对未来的异常数据期间的能量负荷预测做出了有价值的方法和发现。

FIMO: A Challenge Formal Dataset for Automated Theorem Proving

paper_url: http://arxiv.org/abs/2309.04295
repo_url: None
paper_authors: Chengwu Liu, Jianhao Shen, Huajian Xin, Zhengying Liu, Ye Yuan, Haiming Wang, Wei Ju, Chuanyang Zheng, Yichun Yin, Lin Li, Ming Zhang, Qun Liu
for: 用于提高现有的自动证明方法，以达到国际数学奥林匹克（IMO）水平。
methods: 使用了GPT-4进行初步实验，以评估现有方法的局限性。
results: 发现现有方法存在很大的局限性，表明还有很长的探索之路才能达到满意的自动证明结果。

Abstract
We present FIMO, an innovative dataset comprising formal mathematical problem statements sourced from the International Mathematical Olympiad (IMO) Shortlisted Problems. Designed to facilitate advanced automated theorem proving at the IMO level, FIMO is currently tailored for the Lean formal language. It comprises 149 formal problem statements, accompanied by both informal problem descriptions and their corresponding LaTeX-based informal proofs. Through initial experiments involving GPT-4, our findings underscore the existing limitations in current methodologies, indicating a substantial journey ahead before achieving satisfactory IMO-level automated theorem proving outcomes.

摘要
我们介绍FIMO，一个创新的数据集，包含国际数学奥林匹克（IMO）短列表问题的正式数学问题陈述。 FIMO是为了促进高级自动证明在IMO水平而设计，现在采用了Lean正式语言。它包含149个正式问题陈述，以及相应的LaTeX格式的不正式证明。经初步实验表明，现有的方法存在限制，需要进一步的努力才能达到满意的IMO自动证明结果。

Fuzzy Fingerprinting Transformer Language-Models for Emotion Recognition in Conversations

paper_url: http://arxiv.org/abs/2309.04292
repo_url: None
paper_authors: Patrícia Pereira, Rui Ribeiro, Helena Moniz, Luisa Coheur, Joao Paulo Carvalho
for: 这个论文是为了结合大语言模型和杂糅指纹技术来实现对话情感识别的目的。
methods: 该论文使用了预训练的RoBERTa模型和改进的杂糅指纹分类模块来实现对话情感识别。
results: 该论文在 DailyDialog ERC 数据集上实现了状态元的识别结果，使用了许多更轻量级的模型。

Abstract
Fuzzy Fingerprints have been successfully used as an interpretable text classification technique, but, like most other techniques, have been largely surpassed in performance by Large Pre-trained Language Models, such as BERT or RoBERTa. These models deliver state-of-the-art results in several Natural Language Processing tasks, namely Emotion Recognition in Conversations (ERC), but suffer from the lack of interpretability and explainability. In this paper, we propose to combine the two approaches to perform ERC, as a means to obtain simpler and more interpretable Large Language Models-based classifiers. We propose to feed the utterances and their previous conversational turns to a pre-trained RoBERTa, obtaining contextual embedding utterance representations, that are then supplied to an adapted Fuzzy Fingerprint classification module. We validate our approach on the widely used DailyDialog ERC benchmark dataset, in which we obtain state-of-the-art level results using a much lighter model.

摘要
弹性指纹技术已成功应用于可读性文本分类 tasks，但，如大多数其他技术一样，它们在BERT或RoBERTa等大型预训练语言模型的出现后，已被大量超越。这些模型在识别情感 conversations（ERC）中达到了状态的表现，但它们缺乏可读性和解释性。在这篇论文中，我们提议将两种方法结合使用，以实现更加简单和可读的 Large Language Models-based classifier。我们提议将话语和其前一系列的对话提供给预训练的 RoBERTa，以获取话语上下文嵌入表示，然后将其传递给修改后的弹性指纹分类模块。我们在广泛使用的 DailyDialog ERC benchmark dataset上验证了我们的方法，并在使用轻量级模型时达到了状态的水平。

Sequential Semantic Generative Communication for Progressive Text-to-Image Generation

paper_url: http://arxiv.org/abs/2309.04287
repo_url: None
paper_authors: Hyelin Nam, Jihong Park, Jinho Choi, Seong-Lyun Kim
for: 这篇论文提出了一种基于多模态生成器的新通信系统框架，以便在智能应用中实现成功的通信。
methods: 论文使用多模态生成器的技术将对象图像转换为文本，并使用反向过程将文本转换回图像。每个文本句子中的每个单词都有特定的语法角色，负责传递图像中的特定信息。
results: 论文的实验结果表明，使用文本将图像转换为文本并将文本转换回图像可以减轻通信负担，同时保持图像的含义。这种方法可以在智能应用中实现更高效的通信。

Abstract
This paper proposes new framework of communication system leveraging promising generation capabilities of multi-modal generative models. Regarding nowadays smart applications, successful communication can be made by conveying the perceptual meaning, which we set as text prompt. Text serves as a suitable semantic representation of image data as it has evolved to instruct an image or generate image through multi-modal techniques, by being interpreted in a manner similar to human cognition. Utilizing text can also reduce the overload compared to transmitting the intact data itself. The transmitter converts objective image to text through multi-model generation process and the receiver reconstructs the image using reverse process. Each word in the text sentence has each syntactic role, responsible for particular piece of information the text contains. For further efficiency in communication load, the transmitter sequentially sends words in priority of carrying the most information until reaches successful communication. Therefore, our primary focus is on the promising design of a communication system based on image-to-text transformation and the proposed schemes for sequentially transmitting word tokens. Our work is expected to pave a new road of utilizing state-of-the-art generative models to real communication systems

摘要
The transmitter converts objective images into text through a multi-model generation process, and the receiver reconstructs the image using a reverse process. Each word in the text sentence has a specific syntactic role, responsible for conveying a particular piece of information the text contains. To further improve communication efficiency, the transmitter sequentially sends words in priority of carrying the most information until successful communication is achieved.Our primary focus is on the design of a communication system based on image-to-text transformation and the proposed schemes for sequentially transmitting word tokens. Our work is expected to pave a new road for utilizing state-of-the-art generative models in real communication systems.

Spatial-Temporal Graph Attention Fuser for Calibration in IoT Air Pollution Monitoring Systems

paper_url: http://arxiv.org/abs/2309.04508
repo_url: None
paper_authors: Keivan Faghih Niresi, Mengjie Zhao, Hugo Bissig, Henri Baumann, Olga Fink
for: 这篇论文主要是为了提高互联网物联网（IoT）传感器的精度，特别是在无控制的环境下进行准确的减噪calibration。
methods: 我们提出了一种新的方法，利用图 neural network，具体来说是图注意力网络模块，将数组传感器的数据进行融合，以提高传感器的准确性。
results: 我们的实验结果表明，我们的方法可以在IoT空气污染监测平台中显著提高传感器的准确性。

Abstract
The use of Internet of Things (IoT) sensors for air pollution monitoring has significantly increased, resulting in the deployment of low-cost sensors. Despite this advancement, accurately calibrating these sensors in uncontrolled environmental conditions remains a challenge. To address this, we propose a novel approach that leverages graph neural networks, specifically the graph attention network module, to enhance the calibration process by fusing data from sensor arrays. Through our experiments, we demonstrate the effectiveness of our approach in significantly improving the calibration accuracy of sensors in IoT air pollution monitoring platforms.

摘要
互联网物品（IoT）传感器在空气污染监测中的应用已经增加了，因此促进了低成本传感器的应用。然而，在无法控制的环境下精确地调整这些传感器仍然是一大挑战。为解决这个问题，我们提出了一种新的方法，利用图像神经网络，具体来说是图像注意力网络模组，将数据从传感器阵列融合以提高传感器的准确调整。我们的实验结果显示，我们的方法可以对IoT空气污染监测平台中的传感器进行重要的改善。

LLMCad: Fast and Scalable On-device Large Language Model Inference

paper_url: http://arxiv.org/abs/2309.04255
repo_url: None
paper_authors: Daliang Xu, Wangsong Yin, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, Xuanzhe Liu
for:This paper aims to improve the efficiency of generative Natural Language Processing (NLP) tasks on mobile devices.methods:The proposed method, LLMCad, uses a compact LLM to generate candidate tokens and a high-precision LLM to validate them, with three novel techniques: token tree construction, self-adjusting fallback strategy, and speculative token generation.results:LLMCad achieves impressive token generation speeds, up to 9.3x faster than existing inference engines, making it a promising solution for on-device NLP tasks.

Abstract
Generative tasks, such as text generation and question answering, hold a crucial position in the realm of mobile applications. Due to their sensitivity to privacy concerns, there is a growing demand for their execution directly on mobile devices. Currently, the execution of these generative tasks heavily depends on Large Language Models (LLMs). Nevertheless, the limited memory capacity of these devices presents a formidable challenge to the scalability of such models. In our research, we introduce LLMCad, an innovative on-device inference engine specifically designed for efficient generative Natural Language Processing (NLP) tasks. The core idea behind LLMCad revolves around model collaboration: a compact LLM, residing in memory, takes charge of generating the most straightforward tokens, while a high-precision LLM steps in to validate these tokens and rectify any identified errors. LLMCad incorporates three novel techniques: (1) Instead of generating candidate tokens in a sequential manner, LLMCad employs the smaller LLM to construct a token tree, encompassing a wider range of plausible token pathways. Subsequently, the larger LLM can efficiently validate all of these pathways simultaneously. (2) It employs a self-adjusting fallback strategy, swiftly initiating the verification process whenever the smaller LLM generates an erroneous token. (3) To ensure a continuous flow of token generation, LLMCad speculatively generates tokens during the verification process by implementing a compute-IO pipeline. Through an extensive series of experiments, LLMCad showcases an impressive token generation speed, achieving rates up to 9.3x faster than existing inference engines.

摘要
<>传统的自然语言处理（NLP）任务，如文本生成和问答，在移动设备上执行的需求不断增长。由于隐私问题的敏感性，需要直接在移动设备上执行这些任务。现在，这些任务的执行仍然大多依赖于大型语言模型（LLM）。然而，移动设备的内存容量却成为这些模型的扩展性的强烈挑战。在我们的研究中，我们提出了 LLMCad，一种特有的在移动设备上进行高效生成NLP任务的推理引擎。 LLMCad的核心思想是模型合作：一个较小的LLM在内存中 residences，负责生成最直观的单词，而一个高精度的LLM则在验证这些单词并更正发现的错误。 LLMCad包含三种新技术：1. 而不是顺序生成候选单词，LLMCad使用小型LLM构建一个包含更多可能的单词路径的单词树。然后，大型LLM可以高效地验证这些路径。2. 它采用自适应的快速恢复策略，当小型LLM生成错误的单词时，快速发起验证过程。3. 为确保单词生成的不间断流动，LLMCad使用compute-IO管道来 спекулятив地生成单词，以便在验证过程中继续生成单词。经过了广泛的实验，LLMCad展示了很快的单词生成速度，达到9.3倍于现有的推理引擎。>>>

Towards Reliable and Fluent Large Language Models: Incorporating Feedback Learning Loops in QA Systems

paper_url: http://arxiv.org/abs/2309.06384
repo_url: None
paper_authors: Dongyub Lee, Taesun Whang, Chanhee Lee, Heuiseok Lim
For: The paper aims to improve the utility and trustworthiness of large language models (LLMs) in various daily applications by addressing issues such as erroneous references, hallucinated information, and inadequate details.* Methods: The study builds a dataset to train a critic model that evaluates the citation, correctness, and fluency of responses generated by LLMs in QA systems. It also proposes an automated feedback mechanism that leverages the critic model to offer real-time feedback on heterogeneous aspects of generated text, and introduces a feedback learning loop that uses the critic model to iteratively improve the performance of the LLM responsible for response generation.* Results: The experimental results demonstrate the efficacy of the approach, showing substantial improvements in citation and fluency metrics for ChatGPT, including a 4% precision increase in citation and an approximately 8% enhancement in the MAUVE metric for fluency, while maintaining high levels of correctness.Here’s the simplified Chinese text for the three information points:* 为：本研究旨在提高大语言模型（LLM）在日常应用中的可靠性和信任worthiness，并解决误差参考、hallucinated信息和缺乏细节等问题。* 方法：本研究建立了一个评价机器人模型可以评估LLM在问答系统中生成的引用、正确性和流畅性。它还提出了一种自动反馈机制，利用评价模型提供实时反馈对生成文本中的多种方面。最后，它引入了一个反馈学习循环，使用评价模型来持续改进LLM负责生成文本的性能。* 结果：实验结果表明，本approach的有效性，包括对ChatGPT的引用精度提高4%，以及对流畅性metric MAUVE的提高约8%，同时保持高水平的正确性。

Abstract
Large language models (LLMs) have emerged as versatile tools in various daily applications. However, they are fraught with issues that undermine their utility and trustworthiness. These include the incorporation of erroneous references (citation), the generation of hallucinated information (correctness), and the inclusion of superfluous or omission of crucial details (fluency). To ameliorate these concerns, this study makes several key contributions. First, we build a dataset to train a critic model capable of evaluating the citation, correctness, and fluency of responses generated by LLMs in QA systems. Second, we propose an automated feedback mechanism that leverages the critic model to offer real-time feedback on heterogeneous aspects of generated text. Third, we introduce a feedback learning loop that uses this critic model to iteratively improve the performance of the LLM responsible for response generation. Experimental results demonstrate the efficacy of our approach, showing substantial improvements in citation and fluency metrics for ChatGPT, including a 4% precision increase in citation and an approximately 8% enhancement in the MAUVE metric for fluency, while maintaining high levels of correctness.

摘要
Incorrect references (citation)* Made-up information (correctness)* Too much or too little information (fluency)To address these problems, this study makes three important contributions:1. We create a dataset to train a critic model that can evaluate the citation, correctness, and fluency of responses generated by LLMs in QA systems.2. We propose an automated feedback mechanism that uses the critic model to give real-time feedback on the responses.3. We introduce a feedback learning loop that uses the critic model to improve the performance of the LLM responsible for response generation.Our approach was tested and showed significant improvements in citation and fluency metrics for ChatGPT. The precision of citation improved by 4% and the MAUVE metric for fluency improved by approximately 8%, while maintaining high levels of correctness.

Decoding visual brain representations from electroencephalography through Knowledge Distillation and latent diffusion models

paper_url: http://arxiv.org/abs/2309.07149
repo_url: None
paper_authors: Matteo Ferrante, Tommaso Boccato, Stefano Bargione, Nicola Toschi
for: 这个研究旨在连接神经信号与视觉认知。
methods: 该研究使用电энцефалография（EEG）数据来分类和重建图像，并采用了一种基于Contrastive Language-Image Pre-Training（CLIP）的语音分类教师网络进行知识传承。
results: 该模型可以达到80%的top-5准确率，significantly出perform了标准CNN和多个RNN基本 Referenced benchmarks，并且可以生成基于EEG活动的图像估计。

Abstract
Decoding visual representations from human brain activity has emerged as a thriving research domain, particularly in the context of brain-computer interfaces. Our study presents an innovative method that employs to classify and reconstruct images from the ImageNet dataset using electroencephalography (EEG) data from subjects that had viewed the images themselves (i.e. "brain decoding"). We analyzed EEG recordings from 6 participants, each exposed to 50 images spanning 40 unique semantic categories. These EEG readings were converted into spectrograms, which were then used to train a convolutional neural network (CNN), integrated with a knowledge distillation procedure based on a pre-trained Contrastive Language-Image Pre-Training (CLIP)-based image classification teacher network. This strategy allowed our model to attain a top-5 accuracy of 80%, significantly outperforming a standard CNN and various RNN-based benchmarks. Additionally, we incorporated an image reconstruction mechanism based on pre-trained latent diffusion models, which allowed us to generate an estimate of the images which had elicited EEG activity. Therefore, our architecture not only decodes images from neural activity but also offers a credible image reconstruction from EEG only, paving the way for e.g. swift, individualized feedback experiments. Our research represents a significant step forward in connecting neural signals with visual cognition.

摘要
研究人员们已经开发了一种新的方法，可以从人脑电听信号中解码和重建图像。我们的研究使用了6名参与者，每名参与者看过50个图像，这些图像包括40个semantic类别。我们将EEG记录转换成spectrogram，然后使用这些spectrogram来训练一个卷积神经网络（CNN），并结合一种基于预训练的Contrastive Language-Image Pre-Training（CLIP）图像分类教师网络的知识继承程序。这种策略使我们的模型达到了80%的top-5准确率，明显超过了标准CNN和多种RNN基本指标。此外，我们还添加了一种基于预训练的液态噪声模型的图像重建机制，使我们能够从EEG只有generated一个图像的估计。因此，我们的architecture不仅可以从 neural activity中解码图像，还可以提供一种可靠的图像重建方式，从EEG只有。这些成果 Represent a significant step forward in connecting neural signals with visual cognition.

paper_url: http://arxiv.org/abs/2309.04213
repo_url: https://github.com/yanjiangjerry/alex
paper_authors: Yan Jiang, Ruihong Qiu, Yi Zhang, Zi Huang
for: This paper aims to improve the performance of public health analysis on social media by addressing the data imbalance issue and utilizing the ability of large language models (LLMs) effectively.
methods: The proposed ALEX framework uses a combination of data augmentation, balanced training, and proper prompting to improve the performance of LLMs in public health analysis on social media.
results: The ALEX model achieved the best performance among all submissions in three tasks (Task 2, Task 4, and Task 1) in the Social Media Mining for Health 2023 (SMM4H) challenge, with high scores in all tasks.Here’s the simplified Chinese text for the three key points:
for: 这篇论文目的是通过解决数据不均衡问题和有效地利用大语言模型（LLMs）的能力来提高社交媒体上公共健康分析的性能。
methods: 提出的ALEX框架使用数据扩充、平衡训练和合适的提示来提高LLMs在社交媒体上公共健康分析中的性能。
results: ALEX模型在2023年社交媒体健康挖掘大会（SMM4H）中的三个任务（任务2、任务4和任务1）中得分最高，在所有任务中得分也很高。

Abstract
As social media becomes increasingly popular, more and more activities related to public health emerge. Current techniques for public health analysis involve popular models such as BERT and large language models (LLMs). However, the costs of training in-domain LLMs for public health are especially expensive. Furthermore, such kinds of in-domain datasets from social media are generally imbalanced. To tackle these challenges, the data imbalance issue can be overcome by data augmentation and balanced training. Moreover, the ability of the LLMs can be effectively utilized by prompting the model properly. In this paper, a novel ALEX framework is proposed to improve the performance of public health analysis on social media by adopting an LLMs explanation mechanism. Results show that our ALEX model got the best performance among all submissions in both Task 2 and Task 4 with a high score in Task 1 in Social Media Mining for Health 2023 (SMM4H)[1]. Our code has been released at https:// github.com/YanJiangJerry/ALEX.

摘要
随着社交媒体的普及，公共卫生领域的活动越来越多。现有的公共卫生分析技术主要基于BERT和大型自然语言模型（LLM）。然而，培训域 específico LLMs 的成本特别高。此外，这些社交媒体数据集通常受到偏见的问题。为了解决这些挑战，可以通过数据扩展和平衡训练来缓解数据不均衡问题。此外，可以通过对模型提供正确的提示来有效地利用LLMs的能力。本文提出了一种基于 LLMs 解释机制的ALEX框架，用于提高社交媒体上的公共卫生分析表现。实验结果表明，我们的ALEX模型在 Social Media Mining for Health 2023（SMM4H）中的任务2和任务4中得到了最高分，并在任务1中获得了高分。我们的代码已经在 GitHub 上发布，请参考 https://github.com/YanJiangJerry/ALEX。

Towards Mitigating Architecture Overfitting in Dataset Distillation

paper_url: http://arxiv.org/abs/2309.04195
repo_url: None
paper_authors: Xuyang Zhong, Chen Liu
for: 提高 neural network 在具有限制的训练数据情况下的性能
methods: 提出了一系列的建筑设计和训练方法，能够提高不同网络架构在热针训练数据上的泛化性能
results: 通过广泛的实验，证明了我们的方法的有效性和通用性，特别是在不同的尺度情况下，我们的方法可以在使用更大容量网络时达到相对或超过现有方法的性能

Abstract
Dataset distillation methods have demonstrated remarkable performance for neural networks trained with very limited training data. However, a significant challenge arises in the form of architecture overfitting: the distilled training data synthesized by a specific network architecture (i.e., training network) generates poor performance when trained by other network architectures (i.e., test networks). This paper addresses this issue and proposes a series of approaches in both architecture designs and training schemes which can be adopted together to boost the generalization performance across different network architectures on the distilled training data. We conduct extensive experiments to demonstrate the effectiveness and generality of our methods. Particularly, across various scenarios involving different sizes of distilled data, our approaches achieve comparable or superior performance to existing methods when training on the distilled data using networks with larger capacities.

摘要

Knowledge-tuning Large Language Models with Structured Medical Knowledge Bases for Reliable Response Generation in Chinese

paper_url: http://arxiv.org/abs/2309.04175
repo_url: None
paper_authors: Haochun Wang, Sendong Zhao, Zewen Qiang, Zijian Li, Nuwa Xi, Yanrui Du, MuZhen Cai, Haoqiang Guo, Yuhan Chen, Haoming Xu, Bing Qin, Ting Liu
for: 提高大语言模型在医疗领域中的可靠性和效果，即使模型没有医疗领域的专业知识。
methods: 利用结构化的医疗知识库来提高大语言模型的领域知识，从而提高响应生成的准确率。
results: 经过知识训练后，大语言模型可以在医疗知识区域中表现出更高的准确率，并且可以提供可靠的响应。

Abstract
Large Language Models (LLMs) have demonstrated remarkable success in diverse natural language processing (NLP) tasks in general domains. However, LLMs sometimes generate responses with the hallucination about medical facts due to limited domain knowledge. Such shortcomings pose potential risks in the utilization of LLMs within medical contexts. To address this challenge, we propose knowledge-tuning, which leverages structured medical knowledge bases for the LLMs to grasp domain knowledge efficiently and facilitate reliable response generation. We also release cMedKnowQA, a Chinese medical knowledge question-answering dataset constructed from medical knowledge bases to assess the medical knowledge proficiency of LLMs. Experimental results show that the LLMs which are knowledge-tuned with cMedKnowQA, can exhibit higher levels of accuracy in response generation compared with vanilla instruction-tuning and offer a new reliable way for the domain adaptation of LLMs.

摘要
Note:* "Large Language Models" (LLMs) is translated as "大型语言模型" (dàxìng yǔyán módelǐ)* "natural language processing" (NLP) is translated as "自然语言处理" (zìrán yǔyán xùcè)* "medical knowledge" is translated as "医学知识" (yīxué zhīshī)* "knowledge-tuning" is translated as "知识调教" (zhīshī tiàoxüe)* "vanilla instruction-tuning" is translated as "简单的指导调教" (jiǎndān de zhǐguī tiàoxüe)* "domain adaptation" is translated as "领域适应" (lǐngyì shìbiàn)

Manifold-based Verbalizer Space Re-embedding for Tuning-free Prompt-based Classification

paper_url: http://arxiv.org/abs/2309.04174
repo_url: None
paper_authors: Haochun Wang, Sendong Zhao, Chi Liu, Nuwa Xi, Muzhen Cai, Bing Qin, Ting Liu
for: 这个研究的目的是提出一种免 Parametric 的概率类别方法，可以与高维度的关键词嵌入进行类别。
methods: 这个方法使用了 Locally Linear Embedding with Intra-class Neighborhood Constraint (LLE-INC) 技术，将关键词嵌入转换为高维度的数据点，并将这些点分为不同的类别。
results: 实验结果显示，这个方法可以与受条件的类别方法相比，具有相似的类别精度，并且不需要任何参数调整。另外，在将类别方法与高维度关键词嵌入结合使用时，这个方法可以进一步提高类别精度。

Abstract
Prompt-based classification adapts tasks to a cloze question format utilizing the [MASK] token and the filled tokens are then mapped to labels through pre-defined verbalizers. Recent studies have explored the use of verbalizer embeddings to reduce labor in this process. However, all existing studies require a tuning process for either the pre-trained models or additional trainable embeddings. Meanwhile, the distance between high-dimensional verbalizer embeddings should not be measured by Euclidean distance due to the potential for non-linear manifolds in the representation space. In this study, we propose a tuning-free manifold-based space re-embedding method called Locally Linear Embedding with Intra-class Neighborhood Constraint (LLE-INC) for verbalizer embeddings, which preserves local properties within the same class as guidance for classification. Experimental results indicate that even without tuning any parameters, our LLE-INC is on par with automated verbalizers with parameter tuning. And with the parameter updating, our approach further enhances prompt-based tuning by up to 3.2%. Furthermore, experiments with the LLaMA-7B&13B indicate that LLE-INC is an efficient tuning-free classification approach for the hyper-scale language models.

摘要
In this study, we propose a tuning-free manifold-based space re-embedding method called Locally Linear Embedding with Intra-class Neighborhood Constraint (LLE-INC) for verbalizer embeddings, which preserves local properties within the same class as guidance for classification. Experimental results indicate that even without tuning any parameters, our LLE-INC is on par with automated verbalizers with parameter tuning. And with the parameter updating, our approach further enhances prompt-based tuning by up to 3.2%. Furthermore, experiments with the LLaMA-7B&13B indicate that LLE-INC is an efficient tuning-free classification approach for the hyper-scale language models.(Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. The traditional Chinese form is also commonly used in Taiwan and Hong Kong.)

Compositional Learning of Visually-Grounded Concepts Using Reinforcement

paper_url: http://arxiv.org/abs/2309.04504
repo_url: https://github.com/haidiazaman/rl-concept-learning-project
paper_authors: Zijun Lin, Haidi Azaman, M Ganesh Kumar, Cheston Tan
for: investigating how deep reinforcement learning agents learn and compose color-shape based combinatorial instructions to solve novel combinations in a spatial navigation task.
methods: using 3D environments and exploring compositional learning, frozen text encoders (e.g. CLIP, BERT), and pretraining on shape or color concepts separately.
results: agents pretrained on concept and compositional learning achieve significantly higher reward when evaluated zero-shot on novel color-shape1-shape2 visual object combinations, and a 20 times decrease in training episodes needed to solve unseen combinations of instructions.

Abstract
Deep reinforcement learning agents need to be trained over millions of episodes to decently solve navigation tasks grounded to instructions. Furthermore, their ability to generalize to novel combinations of instructions is unclear. Interestingly however, children can decompose language-based instructions and navigate to the referred object, even if they have not seen the combination of queries prior. Hence, we created three 3D environments to investigate how deep RL agents learn and compose color-shape based combinatorial instructions to solve novel combinations in a spatial navigation task. First, we explore if agents can perform compositional learning, and whether they can leverage on frozen text encoders (e.g. CLIP, BERT) to learn word combinations in fewer episodes. Next, we demonstrate that when agents are pretrained on the shape or color concepts separately, they show a 20 times decrease in training episodes needed to solve unseen combinations of instructions. Lastly, we show that agents pretrained on concept and compositional learning achieve significantly higher reward when evaluated zero-shot on novel color-shape1-shape2 visual object combinations. Overall, our results highlight the foundations needed to increase an agent's proficiency in composing word groups through reinforcement learning and its ability for zero-shot generalization to new combinations.

摘要
深度强化学习机器人需要通过百万集 Episodes 来有效地解决基于指令的导航任务。此外，它们对新的组合指令的泛化能力也不清楚。然而，孩子们可以将语言基于的指令分解成不同的部分，并寻找 Referred 对象，即使它们没有看到这些组合指令之前。因此，我们创建了三个3D环境，以 investigate 如何深度RL机器人学习和组合色彩基本指令来解决新的组合任务。首先，我们研究了机器人是否可以进行组合学习，以及是否可以使用冻结文本编码器（例如 CLIP、BERT）来学习单词组合。接着，我们示出了当机器人在Shape或Color概念上进行预训练后，它们可以减少解决未看过的组合指令的训练集数量。最后，我们显示了在概念学习和组合学习下，机器人在零次学习情况下对新的Color-Shape1-Shape2视觉对象组合表现出了显著更高的奖励。总的来说，我们的结果表明了如何通过强化学习来提高机器人对单词组合的能力，以及这种能力的零次泛化能力。

Leveraging Prototype Patient Representations with Feature-Missing-Aware Calibration to Mitigate EHR Data Sparsity

paper_url: http://arxiv.org/abs/2309.04160
repo_url: None
paper_authors: Yinghao Zhu, Zixiang Wang, Long He, Shiyun Xie, Zixi Chen, Jingkun An, Liantao Ma, Chengwei Pan
for: 这份研究是为了解决电子健康记录（EHR）数据的稀畴性问题，以提高预测模型的效能。
methods: 本研究使用间接替代填写方法，利用相似 пацієнтах的原型表示来获得更为紧密的嵌入。它还包括一个专门设计的特征信任学习模组，以评估每个特征的可靠性。
results: 研究结果显示，设计的模型在MIMIC-III和MIMIC-IV数据集上的医院死亡结果预测任务中实现了 statistically significant 的提升，比较之前的EHR-Related模型。代码可以在 \url{https://github.com/yhzhu99/SparseEHR} 上获取，以保证可重现性。

Abstract
Electronic Health Record (EHR) data frequently exhibits sparse characteristics, posing challenges for predictive modeling. Current direct imputation such as matrix imputation approaches hinge on referencing analogous rows or columns to complete raw missing data and do not differentiate between imputed and actual values. As a result, models may inadvertently incorporate irrelevant or deceptive information with respect to the prediction objective, thereby compromising the efficacy of downstream performance. While some methods strive to recalibrate or augment EHR embeddings after direct imputation, they often mistakenly prioritize imputed features. This misprioritization can introduce biases or inaccuracies into the model. To tackle these issues, our work resorts to indirect imputation, where we leverage prototype representations from similar patients to obtain a denser embedding. Recognizing the limitation that missing features are typically treated the same as present ones when measuring similar patients, our approach designs a feature confidence learner module. This module is sensitive to the missing feature status, enabling the model to better judge the reliability of each feature. Moreover, we propose a novel patient similarity metric that takes feature confidence into account, ensuring that evaluations are not based merely on potentially inaccurate imputed values. Consequently, our work captures dense prototype patient representations with feature-missing-aware calibration process. Comprehensive experiments demonstrate that designed model surpasses established EHR-focused models with a statistically significant improvement on MIMIC-III and MIMIC-IV datasets in-hospital mortality outcome prediction task. The code is publicly available at \url{https://github.com/yhzhu99/SparseEHR} to assure the reproducibility.

摘要
电子健康记录（EHR）数据经常具有稀畴特征，这会对预测模型造成挑战。目前的直接填充方法，如矩阵填充方法，基于相似的行或列来完善未完整的数据，并不能区分填充的和实际值。这会让模型把无关或误导性的信息纳入预测目标中，从而降低下游性能。一些方法尝试通过重新填充或扩展EHR嵌入来解决这些问题，但它们通常会偏好填充的特征。这会引入偏见或错误到模型中。为了解决这些问题，我们的工作采用间接填充方法，利用相似患者的原型表示来获得更密集的嵌入。我们认识到缺失的特征通常会被视为已知的特征，因此我们的方法设计了特征信任学习模块。这个模块能够感知缺失特征的状态，使模型更好地评估每个特征的可靠性。此外，我们提出了一种新的患者相似度度量，该度量考虑特征信任度，以确保评估不仅基于可能不准确的填充值。因此，我们的工作可以获得 dense prototype 患者表示，同时具有缺失特征整合过程中的可靠性评估。我们的实验表明，我们的模型在MIMIC-III和MIMIC-IV数据集上的医院死亡结果预测任务中 statistically significant 提高了表现，至于代码，可以在中找到。以确保可重现性。

NESTLE: a No-Code Tool for Statistical Analysis of Legal Corpus

paper_url: http://arxiv.org/abs/2309.04146
repo_url: None
paper_authors: Kyoungyeon Cho, Seungkum Han, Wonseok Hwang
for: 法律文本分析的大规模统计分析可以提供有价值的法律洞察。
methods: NESTLE 提供了一个无代码工具 для大规模法律文本统计分析，包括搜索引擎、终端信息抽取系统和大语言模型。
results: NESTLE 可以在大规模法律文本中实现 GPT-4 相当的性能，并且可以在不需要编写代码的情况下进行自定义统计分析。

Abstract
The statistical analysis of large scale legal corpus can provide valuable legal insights. For such analysis one needs to (1) select a subset of the corpus using document retrieval tools, (2) structuralize text using information extraction (IE) systems, and (3) visualize the data for the statistical analysis. Each process demands either specialized tools or programming skills whereas no comprehensive unified "no-code" tools have been available. Especially for IE, if the target information is not predefined in the ontology of the IE system, one needs to build their own system. Here we provide NESTLE, a no code tool for large-scale statistical analysis of legal corpus. With NESTLE, users can search target documents, extract information, and visualize the structured data all via the chat interface with accompanying auxiliary GUI for the fine-level control. NESTLE consists of three main components: a search engine, an end-to-end IE system, and a Large Language Model (LLM) that glues the whole components together and provides the chat interface. Powered by LLM and the end-to-end IE system, NESTLE can extract any type of information that has not been predefined in the IE system opening up the possibility of unlimited customizable statistical analysis of the corpus without writing a single line of code. The use of the custom end-to-end IE system also enables faster and low-cost IE on large scale corpus. We validate our system on 15 Korean precedent IE tasks and 3 legal text classification tasks from LEXGLUE. The comprehensive experiments reveal NESTLE can achieve GPT-4 comparable performance by training the internal IE module with 4 human-labeled, and 192 LLM-labeled examples. The detailed analysis provides the insight on the trade-off between accuracy, time, and cost in building such system.

摘要
大规模法律文本分析可以提供有价值的法律洞察。为实现这一目标，需要（1）使用文档检索工具选择规模大的文本子集，（2）使用信息抽取（IE）系统结构化文本，以及（3）使用数据视图工具进行统计分析。每个过程都需要特殊的工具或编程技能，而现在没有一款综合的“无代码”工具可用。尤其是IE，如果目标信息没有在IE系统中 Ontology 中定义，那么需要自己建立系统。我们提供了 NESTLE，一款“无代码”工具，可以在对大规模文本进行统计分析时，使用 conversational интерфейス和相应的辅助GUI进行搜索、信息抽取和数据视图。NESTLE 由三个主要组件组成：搜索引擎、端到端IE系统和一个基于大语言模型（LLM）的核心组件。通过LLM和端到端IE系统，NESTLE 可以自动抽取文本中的任何信息，无需在IE系统中先定义目标信息，这样开放了无限可定制的统计分析方法，无需写一行代码。使用自定义端到端IE系统，NESTLE 还可以在大规模文本中进行更快和低成本的IE。我们在15个韩国前例IE任务和3个法律文本分类任务上进行了详细的实验，并证明NESTLE 可以在培育内部IE模块时与 GPT-4 相当的性能。etailed 分析还提供了对准则、时间和成本之间的费 trade-off 的深入分析。

Trustworthy and Synergistic Artificial Intelligence for Software Engineering: Vision and Roadmaps

paper_url: http://arxiv.org/abs/2309.04142
repo_url: None
paper_authors: David Lo
for: This paper aims to provide a comprehensive overview of the current state and future directions of Artificial Intelligence for Software Engineering (AI4SE), with a focus on realizing trustworthy and synergistic AI4SE.
methods: The paper uses a combination of literature review, analysis, and visioning to explore the current challenges and potential solutions for AI4SE, and to paint a vision for the future of software engineering.
results: The paper highlights the potential leaps that can be achieved if the key challenges of AI4SE are surmounted, including the transition towards Software Engineering 2.0, and provides two strategic roadmaps for realizing trustworthy and synergistic AI4SE.

Abstract
For decades, much software engineering research has been dedicated to devising automated solutions aimed at enhancing developer productivity and elevating software quality. The past two decades have witnessed an unparalleled surge in the development of intelligent solutions tailored for software engineering tasks. This momentum established the Artificial Intelligence for Software Engineering (AI4SE) area, which has swiftly become one of the most active and popular areas within the software engineering field. This Future of Software Engineering (FoSE) paper navigates through several focal points. It commences with a succinct introduction and history of AI4SE. Thereafter, it underscores the core challenges inherent to AI4SE, particularly highlighting the need to realize trustworthy and synergistic AI4SE. Progressing, the paper paints a vision for the potential leaps achievable if AI4SE's key challenges are surmounted, suggesting a transition towards Software Engineering 2.0. Two strategic roadmaps are then laid out: one centered on realizing trustworthy AI4SE, and the other on fostering synergistic AI4SE. While this paper may not serve as a conclusive guide, its intent is to catalyze further progress. The ultimate aspiration is to position AI4SE as a linchpin in redefining the horizons of software engineering, propelling us toward Software Engineering 2.0.

摘要
(Simplified Chinese translation)For decades, much software engineering research has been dedicated to devising automated solutions aimed at enhancing developer productivity and elevating software quality. The past two decades have witnessed an unparalleled surge in the development of intelligent solutions tailored for software engineering tasks. This momentum established the Artificial Intelligence for Software Engineering (AI4SE) area, which has swiftly become one of the most active and popular areas within the software engineering field. This Future of Software Engineering (FoSE) paper navigates through several focal points. It commences with a succinct introduction and history of AI4SE. Thereafter, it underscores the core challenges inherent to AI4SE, particularly highlighting the need to realize trustworthy and synergistic AI4SE. Progressing, the paper paints a vision for the potential leaps achievable if AI4SE's key challenges are surmounted, suggesting a transition towards Software Engineering 2.0. Two strategic roadmaps are then laid out: one centered on realizing trustworthy AI4SE, and the other on fostering synergistic AI4SE. While this paper may not serve as a conclusive guide, its intent is to catalyze further progress. The ultimate aspiration is to position AI4SE as a linchpin in redefining the horizons of software engineering, propelling us toward Software Engineering 2.0.

Proprioceptive External Torque Learning for Floating Base Robot and its Applications to Humanoid Locomotion

paper_url: http://arxiv.org/abs/2309.04138
repo_url: None
paper_authors: Daegyu Lim, Myeong-Ju Kim, Junhyeok Cha, Donghyeon Kim, Jaeheung Park
for: 本研究旨在实现人型机器人的稳定行走和安全执行，并且减少对系统的成本、阻尼、复杂度和故障可能性。
methods: 本研究使用 proprioceptive 哔视感器（Encoder 和 IMU）来学习外部关节扭矩，不需要增加价格、阻尼、复杂度和可能性故障的 Force-Torque 仪。
results: 实验结果显示，训练 GRU 网络可以实现更好的外部关节扭矩和触地力估算，与模型基本方法（MOB）和摩擦模型相比，具有更小的误差。此外，训练网络还可以在不同脚和上层体重的情况下保持稳定的行走，并且显示了可以实现零矩点传递控制。

Abstract
The estimation of external joint torque and contact wrench is essential for achieving stable locomotion of humanoids and safety-oriented robots. Although the contact wrench on the foot of humanoids can be measured using a force-torque sensor (FTS), FTS increases the cost, inertia, complexity, and failure possibility of the system. This paper introduces a method for learning external joint torque solely using proprioceptive sensors (encoders and IMUs) for a floating base robot. For learning, the GRU network is used and random walking data is collected. Real robot experiments demonstrate that the network can estimate the external torque and contact wrench with significantly smaller errors compared to the model-based method, momentum observer (MOB) with friction modeling. The study also validates that the estimated contact wrench can be utilized for zero moment point (ZMP) feedback control, enabling stable walking. Moreover, even when the robot's feet and the inertia of the upper body are changed, the trained network shows consistent performance with a model-based calibration. This result demonstrates the possibility of removing FTS on the robot, which reduces the disadvantages of hardware sensors. The summary video is available at https://youtu.be/gT1D4tOiKpo.

摘要
estimate 外部联 torque 和接触扭矩的估算是人类机器人稳定行走和安全机器人的关键。 although 机器人的足部可以使用力矩传感器（FTS）测量 contacts 扭矩，FTS 会增加系统的成本、抗力、复杂性和失败可能性。 this paper introduces a method for learning external joint torque solely using proprioceptive sensors (encoders and IMUs) for a floating base robot. for learning, the GRU network is used and random walking data is collected. real robot experiments demonstrate that the network can estimate the external torque and contact wrench with significantly smaller errors compared to the model-based method, momentum observer (MOB) with friction modeling. the study also validates that the estimated contact wrench can be utilized for zero moment point (ZMP) feedback control, enabling stable walking. moreover, even when the robot's feet and the inertia of the upper body are changed, the trained network shows consistent performance with a model-based calibration. this result demonstrates the possibility of removing FTS on the robot, which reduces the disadvantages of hardware sensors. the summary video is available at .Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Weakly Supervised Point Clouds Transformer for 3D Object Detection

paper_url: http://arxiv.org/abs/2309.04105
repo_url: None
paper_authors: Zuojin Tang, Bo Sun, Tongwei Ma, Daosheng Li, Zhenhui Xu
for: 本文提出了一种基于点云transformer的弱类别学习框架，用于3D对象检测。目的是降低3D数据集的标注成本，以提高训练效率。
methods: 我们提出了一种无监督投票提议模块，通过随机设置anchor点和使用投票网络选择高质量的anchor点。然后，它将信息详细总结成教师和学生网络。学生网络采用ResNet网络高效地提取本地特征，但也可能丢失全局信息。为了提供全局和本地信息的输入，我们采用了transformer自注意机制和ResNet层。
results: 在KITTI datasets上进行了实验， achieved the highest level of average precision compared with the most recent weakly supervised 3D object detectors。

Abstract
The annotation of 3D datasets is required for semantic-segmentation and object detection in scene understanding. In this paper we present a framework for the weakly supervision of a point clouds transformer that is used for 3D object detection. The aim is to decrease the required amount of supervision needed for training, as a result of the high cost of annotating a 3D datasets. We propose an Unsupervised Voting Proposal Module, which learns randomly preset anchor points and uses voting network to select prepared anchor points of high quality. Then it distills information into student and teacher network. In terms of student network, we apply ResNet network to efficiently extract local characteristics. However, it also can lose much global information. To provide the input which incorporates the global and local information as the input of student networks, we adopt the self-attention mechanism of transformer to extract global features, and the ResNet layers to extract region proposals. The teacher network supervises the classification and regression of the student network using the pre-trained model on ImageNet. On the challenging KITTI datasets, the experimental results have achieved the highest level of average precision compared with the most recent weakly supervised 3D object detectors.

摘要
三维数据集的注释是Scene理解中Semantic-segmentation和对象检测的必要条件。在这篇论文中，我们提出了一个用于弱样本监督的点云变换器框架，以降低训练所需的监督量，因为 annotating a 3D dataset 的成本很高。我们提出了一个无监督投票建议模块，它学习随机设置的锚点，并使用投票网络选择高质量的锚点。然后，它将信息精炼到教师和学生网络。在学生网络中，我们使用ResNet网络来高效地提取本地特征，但它也可能产生大量的全局信息丢失。为了提供包含全局和本地信息的输入，我们采用了 transformer 自注意机制来提取全局特征，并使用 ResNet 层来提取地区提案。教师网络监督学生网络在 ImageNet 预训练模型的基础上进行分类和回归。在 KITTI 数据集上进行的实验结果达到了最近弱样本监督三维对象检测器的最高水平，相比之下，其他最近的弱样本监督三维对象检测器。

Modeling Recommender Ecosystems: Research Challenges at the Intersection of Mechanism Design, Reinforcement Learning and Generative Models

paper_url: http://arxiv.org/abs/2309.06375
repo_url: None
paper_authors: Craig Boutilier, Martin Mladenov, Guy Tennenholtz
for: 本文提出了一种概念框架，用于提高现代推荐系统的价值，以及提高推荐系统中各个actor的利益。
methods: 本文提出了一些新的方法，包括使用强化学习优化长期目标，使用社会选择理论考虑不同actor的利益，以及使用行为经济学和心理学来更好地模型用户和Item提供者的行为。
results: 本文的研究结果表明，通过使用这些新的方法，可以提高推荐系统的总体健康度和用户利益，同时也可以提高Item提供者的利益。

Abstract
Modern recommender systems lie at the heart of complex ecosystems that couple the behavior of users, content providers, advertisers, and other actors. Despite this, the focus of the majority of recommender research -- and most practical recommenders of any import -- is on the local, myopic optimization of the recommendations made to individual users. This comes at a significant cost to the long-term utility that recommenders could generate for its users. We argue that explicitly modeling the incentives and behaviors of all actors in the system -- and the interactions among them induced by the recommender's policy -- is strictly necessary if one is to maximize the value the system brings to these actors and improve overall ecosystem "health". Doing so requires: optimization over long horizons using techniques such as reinforcement learning; making inevitable tradeoffs in the utility that can be generated for different actors using the methods of social choice; reducing information asymmetry, while accounting for incentives and strategic behavior, using the tools of mechanism design; better modeling of both user and item-provider behaviors by incorporating notions from behavioral economics and psychology; and exploiting recent advances in generative and foundation models to make these mechanisms interpretable and actionable. We propose a conceptual framework that encompasses these elements, and articulate a number of research challenges that emerge at the intersection of these different disciplines.

摘要
现代推荐系统处于复杂的生态系统中，与用户、内容提供者、广告主和其他actor的行为相互关联。然而，大多数推荐研究和实践中心于本地、短期优化推荐给单个用户。我们认为，如果推荐系统想要在长期增值给用户，那么需要考虑所有actor的利益和行为，以及这些actor之间由推荐策略引起的互动。这需要：使用增强学习来优化推荐策略在长期 horizons上; 通过社会选择方法来让拥有不同利益的actor之间进行让担做出妥协; 减少信息不对称性，同时考虑激励和战略行为，使用机制设计的工具; 更好地模型用户和物品提供者的行为，通过包括行为经济学和心理学的思想; 并利用最新的生成和基础模型来使这些机制可读性和可行性。我们提出了一个涵盖这些元素的概念框架，并详细描述了这些不同领域之间的研究挑战。

Data-driven classification of low-power communication signals by an unauthenticated user using a software-defined radio

paper_url: http://arxiv.org/abs/2309.04088
repo_url: https://github.com/minds-code/jammingsdr
paper_authors: Tarun Rao Keshabhoina, Marcos M. Vasconcelos
for: 本文针对大规模分布式多智能体系统，尤其是在 робо控制网络应用中，通过低功率通信网络进行信息交换，具有限制的功率和无法识别的频率带宽和扩散因子。
methods: 本文使用了一种 Structural Pattern 在 LoRa 信号的快速频率表示中找到一个简单的解决方案，将问题转化为一个分类问题，可以使用神经网络实现。
results: 本文表明，如果攻击者可以成功地确定目标信号的带宽和扩散因子，那么 LoRa 协议就会受到拒绝服务攻击。

Abstract
Many large-scale distributed multi-agent systems exchange information over low-power communication networks. In particular, agents intermittently communicate state and control signals in robotic network applications, often with limited power over an unlicensed spectrum, prone to eavesdropping and denial-of-service attacks. In this paper, we argue that a widely popular low-power communication protocol known as LoRa is vulnerable to denial-of-service attacks by an unauthenticated attacker if it can successfully identify a target signal's bandwidth and spreading factor. Leveraging a structural pattern in the LoRa signal's instantaneous frequency representation, we relate the problem of jointly inferring the two unknown parameters to a classification problem, which can be efficiently implemented using neural networks.

摘要
很多大规模分布式多代理系统通过低功率通信网络进行信息交换。特别是在机器人网络应用中，代理器间断断续地交换状态和控制信号，经常具有有限的功率和无license频段，容易受到侦测和拒绝服务攻击。在这篇论文中，我们 argue That a widely popular low-power communication protocol known as LoRa is vulnerable to denial-of-service attacks by an unauthenticated attacker if it can successfully identify a target signal's bandwidth and spreading factor。通过利用LoRa信号的快速频率表示结构特征，我们将相应的问题相似于一个分类问题，可以使用神经网络高效地解决。

Curve Your Attention: Mixed-Curvature Transformers for Graph Representation Learning

paper_url: http://arxiv.org/abs/2309.04082
repo_url: None
paper_authors: Sungjun Cho, Seunghyuk Cho, Sungwoo Park, Hankook Lee, Honglak Lee, Moontae Lee
for: 学习实际图像中的层次或循环结构，而传统的欧几何空间不能够准确地表示这些结构。
methods: 提出全产品托卡斯特谐变换器，一种可以在常数曲率空间上操作的全通过积分的变换器，可以在endorse-to-end的方式中学习不同曲率的图像。
results: 对图像重建和节点分类进行了实验，并证明了通过托卡斯特谐变换器可以更好地学习非欧几何图像。

Abstract
Real-world graphs naturally exhibit hierarchical or cyclical structures that are unfit for the typical Euclidean space. While there exist graph neural networks that leverage hyperbolic or spherical spaces to learn representations that embed such structures more accurately, these methods are confined under the message-passing paradigm, making the models vulnerable against side-effects such as oversmoothing and oversquashing. More recent work have proposed global attention-based graph Transformers that can easily model long-range interactions, but their extensions towards non-Euclidean geometry are yet unexplored. To bridge this gap, we propose Fully Product-Stereographic Transformer, a generalization of Transformers towards operating entirely on the product of constant curvature spaces. When combined with tokenized graph Transformers, our model can learn the curvature appropriate for the input graph in an end-to-end fashion, without the need of additional tuning on different curvature initializations. We also provide a kernelized approach to non-Euclidean attention, which enables our model to run in time and memory cost linear to the number of nodes and edges while respecting the underlying geometry. Experiments on graph reconstruction and node classification demonstrate the benefits of generalizing Transformers to the non-Euclidean domain.

摘要
real-world 图表自然地具有层次或循环结构，这些结构不适合传统的欧几何空间。有些图注意力网络可以利用折射或圆形空间来学习更准确的表示，但这些方法受到消息传递模式的限制，容易导致过滤和压缩的问题。更新的工作已经提出了全球注意力基于图Transformers，可以轻松模型长距离交互，但这些方法在非欧几何几何上的扩展仍然未知。为了bridging这个鸿沟，我们提出了全产品投影特征变换器，一种基于Transformers的非欧几何特征变换器。当与 токен化的图Transformers结合使用时，我们的模型可以在终端方式上学习输入图的曲率，无需额外调整不同曲率的初始化。我们还提供了非欧几何注意力的kernel方法，使得我们的模型在时间和内存成本 linear 于图的节点和边数量的情况下运行，同时尊重下面的几何结构。实验表示，通过将Transformers扩展到非欧几何领域，可以获得更好的图重建和节点分类性能。

paper_url: http://arxiv.org/abs/2309.04077
repo_url: None
paper_authors: Abhinav Rajvanshi, Karan Sikka, Xiao Lin, Bhoram Lee, Han-Pang Chiu, Alvaro Velasquez
for: 这篇论文是用于提出一种新的方法，即 SayNav，以便 autonomous agent 在未知环境中完成复杂的导航任务。
methods: SayNav 使用了一种新的固定机制，即增量建立 3D 场景图，以便将人类知识从大型自然语言模型 (LLMs) 中生成适合情况的高级计划。
results: SayNav 在一个新的多物体导航任务上取得了95.35% 的成功率（与基线相比，只有56.06%），这highlights SayNav 的能力在大规模新环境中生成动态计划并成功地找到物体。 In addition, SayNav 还能够效率地泛化到实际环境中。

Abstract
Semantic reasoning and dynamic planning capabilities are crucial for an autonomous agent to perform complex navigation tasks in unknown environments. It requires a large amount of common-sense knowledge, that humans possess, to succeed in these tasks. We present SayNav, a new approach that leverages human knowledge from Large Language Models (LLMs) for efficient generalization to complex navigation tasks in unknown large-scale environments. SayNav uses a novel grounding mechanism, that incrementally builds a 3D scene graph of the explored environment as inputs to LLMs, for generating feasible and contextually appropriate high-level plans for navigation. The LLM-generated plan is then executed by a pre-trained low-level planner, that treats each planned step as a short-distance point-goal navigation sub-task. SayNav dynamically generates step-by-step instructions during navigation and continuously refines future steps based on newly perceived information. We evaluate SayNav on a new multi-object navigation task, that requires the agent to utilize a massive amount of human knowledge to efficiently search multiple different objects in an unknown environment. SayNav outperforms an oracle based Point-nav baseline, achieving a success rate of 95.35% (vs 56.06% for the baseline), under the ideal settings on this task, highlighting its ability to generate dynamic plans for successfully locating objects in large-scale new environments. In addition, SayNav also enables efficient generalization from simulation to real environments.

摘要
<>Semantic reasoning和动态规划能力是自主代理人完成复杂的导航任务所必需的。这需要人类具备的通用常识知识，以确保成功完成这些任务。我们介绍了SayNav，一种新的方法，利用大型自然语言模型（LLM）来提高导航任务的效率。SayNav使用一种新的固定机制，逐步建立未知环境中探索的3D场景图，并将这些图用于LLM生成高级计划。生成的计划将被一个已经训练的低级 плаanner执行，该 плаanner将每个规划步骤视为短距离点 Navigation sub-任务。SayNav在导航过程中动态生成步骤指示，并在新获得的信息基础上不断改进未来步骤。我们对SayNav进行了一种新的多对象导航任务的评估，该任务需要代理人能够效率地搜索未知环境中多种不同的对象。SayNav在理想的设置下，与基线点导航比较，成功率为95.35%（vs 56.06%）， highlighting its ability to generate dynamic plans for successfully locating objects in large-scale new environments。此外，SayNav还能够效率地从 simulate 到实际环境的总结。>>>

Computationally Efficient Data-Driven Discovery and Linear Representation of Nonlinear Systems For Control

paper_url: http://arxiv.org/abs/2309.04074
repo_url: https://github.com/tiwari-research-group/koopman-control-no-decoder
paper_authors: Madhur Tiwari, George Nehma, Bethany Lusch
for: 这个研究旨在开发一种基于库曼算法的数据驱动框架，用于系统识别和非线性系统的线性化。
methods: 我们提出的方法基于深度学习框架，包括回归学习。我们使用一个线性quadratic控制来控制得到的线性系统。
results: 我们通过一个拖钩系统的示例来展示我们的方法，并在噪音数据上进行了仿真。我们发现，与Autoencoder为基础的方法相比，我们的方法更高效地训练，并且更准确地预测。

Abstract
This work focuses on developing a data-driven framework using Koopman operator theory for system identification and linearization of nonlinear systems for control. Our proposed method presents a deep learning framework with recursive learning. The resulting linear system is controlled using a linear quadratic control. An illustrative example using a pendulum system is presented with simulations on noisy data. We show that our proposed method is trained more efficiently and is more accurate than an autoencoder baseline.

摘要
这个研究将关注使用库曼算法来建立数据驱动的框架，用于系统识别和线性化非线性系统，以便控制。我们提出的方法使用循环学习，并使用线性quadratic控制来控制得到的线性系统。我们通过用一个悬钩系统为例，并在噪声数据上进行了仿真，显示了我们的提议方法可以更高效地训练和更准确地识别。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Inferring physical laws by artificial intelligence based causal models

paper_url: http://arxiv.org/abs/2309.04069
repo_url: None
paper_authors: Jorawar Singh, Kishor Bharti, Arvind
for: 这个论文旨在探讨人工智能和机器学习在科学研究中的应用，以及如何通过 causal learning 模型捕捉物理现象的 causal 关系。
methods: 这篇论文使用了 causal inference 和 intervención 的原则来研究物理现象的 causal 关系，并通过对一些常见物理现象的研究来证明模型的可靠性。
results: 研究发现，这种 causal learning 模型不仅可以捕捉数据之间的相关性，还可以正确地确定变量之间的 causal 关系，从而增强（或减弱）对模型的信任度。

Abstract
The advances in Artificial Intelligence (AI) and Machine Learning (ML) have opened up many avenues for scientific research, and are adding new dimensions to the process of knowledge creation. However, even the most powerful and versatile of ML applications till date are primarily in the domain of analysis of associations and boil down to complex data fitting. Judea Pearl has pointed out that Artificial General Intelligence must involve interventions involving the acts of doing and imagining. Any machine assisted scientific discovery thus must include casual analysis and interventions. In this context, we propose a causal learning model of physical principles, which not only recognizes correlations but also brings out casual relationships. We use the principles of causal inference and interventions to study the cause-and-effect relationships in the context of some well-known physical phenomena. We show that this technique can not only figure out associations among data, but is also able to correctly ascertain the cause-and-effect relations amongst the variables, thereby strengthening (or weakening) our confidence in the proposed model of the underlying physical process.

摘要
人工智能（AI）和机器学习（ML）的进步开创了许多科研领域的可能性，增加了知识创造的新维度。然而，至今最强大和多样化的ML应用都是对关系分析的，即使是复杂数据适应。 Judah Pearl指出，人工通用智能必须包括干预和想象的行为。因此，任何机器帮助科研发现都必须包括 causal 分析和干预。在这个上下文中，我们提议一种 causal 学习模型，不仅认可关系，还能够揭示 causal 关系。我们使用 causal 推理和干预来研究物理现象中的因果关系。我们示例了这种技术不仅可以找出数据中的相关性，还能够正确地确定变量之间的因果关系，从而增强（或削弱）我们对下面物理过程的模型的信任程度。

paper_url: http://arxiv.org/abs/2309.04062
repo_url: None
paper_authors: Sungjun Cho, Dae-Woong Jeong, Sung Moon Ko, Jinwoo Kim, Sehui Han, Seunghoon Hong, Honglak Lee, Moontae Lee
for: 本研究旨在开发一种自然语言处理技术，用于提高分子性质预测的准确率和效率。
methods: 本研究使用了一种名为D&D的自适应分子表示学习框架，通过对3D杂交的知识进行填充和跨模态知识传递来学习分子表示。
results: 实验表明，使用D&D框架学习的图表示能够基于2D图像推断3D信息，并在实际分子性质预测任务中表现出优于其他基eline。

Abstract
Pretraining molecular representations from large unlabeled data is essential for molecular property prediction due to the high cost of obtaining ground-truth labels. While there exist various 2D graph-based molecular pretraining approaches, these methods struggle to show statistically significant gains in predictive performance. Recent work have thus instead proposed 3D conformer-based pretraining under the task of denoising, which led to promising results. During downstream finetuning, however, models trained with 3D conformers require accurate atom-coordinates of previously unseen molecules, which are computationally expensive to acquire at scale. In light of this limitation, we propose D&D, a self-supervised molecular representation learning framework that pretrains a 2D graph encoder by distilling representations from a 3D denoiser. With denoising followed by cross-modal knowledge distillation, our approach enjoys use of knowledge obtained from denoising as well as painless application to downstream tasks with no access to accurate conformers. Experiments on real-world molecular property prediction datasets show that the graph encoder trained via D&D can infer 3D information based on the 2D graph and shows superior performance and label-efficiency against other baselines.

摘要
<>传统的分类任务中使用大量标注数据进行预训练是不可避免的，因为获取标注数据的成本很高。然而，现有的2D图形基于的分子预训练方法很难实现 statistically significant的提升。最近的研究则提出了基于3D杂化的分子预训练，这些方法在预测性能方面具有了良好的结果。然而，在下游训练中，使用3D杂化的模型需要在未看过的分子上获取高精度的原子坐标，这是 computationally expensive的。为了解决这个限制，我们提出了 D&D，一种自助学习的分子表示学习框架，通过减噪和跨模态知识传递来预训练2D图形编码器。我们的方法可以充分利用减噪中获得的知识，同时在下游任务中不需要高精度的原子坐标。实验表明，通过 D&D 预训练的图形编码器可以基于2D图形中推断出3D信息，并与其他基准方法相比具有更好的性能和标签效率。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.

2023-09-08

cs.CL

cs.CL - 2023-09-08

Can NLP Models ‘Identify’, ‘Distinguish’, and ‘Justify’ Questions that Don’t have a Definitive Answer?

paper_url: http://arxiv.org/abs/2309.04635
repo_url: None
paper_authors: Ayushi Agarwal, Nisarg Patel, Neeraj Varshney, Mihir Parmar, Pavan Mallina, Aryan Bhavin Shah, Srihari Raju Sangaraju, Tirth Patel, Nihar Thakkar, Chitta Baral
for: investigate the ability of state-of-the-art NLP models to accurately identify and respond to questions that don’t have definitive answers.
methods: introduce a new dataset called QnotA, which consists of five categories of questions that don’t have definitive answers, and evaluate SOTA models including GPT-3 and Flan T5 on three evaluation tasks that test a system’s ability to identify, distinguish, and justify QnotA questions.
results: show that even SOTA models do not fare well on these tasks and lack considerably behind the human performance baseline, and conduct a thorough analysis that leads to several interesting findings.

Abstract
Though state-of-the-art (SOTA) NLP systems have achieved remarkable performance on a variety of language understanding tasks, they primarily focus on questions that have a correct and a definitive answer. However, in real-world applications, users often ask questions that don't have a definitive answer. Incorrectly answering such questions certainly hampers a system's reliability and trustworthiness. Can SOTA models accurately identify such questions and provide a reasonable response? To investigate the above question, we introduce QnotA, a dataset consisting of five different categories of questions that don't have definitive answers. Furthermore, for each QnotA instance, we also provide a corresponding QA instance i.e. an alternate question that ''can be'' answered. With this data, we formulate three evaluation tasks that test a system's ability to 'identify', 'distinguish', and 'justify' QnotA questions. Through comprehensive experiments, we show that even SOTA models including GPT-3 and Flan T5 do not fare well on these tasks and lack considerably behind the human performance baseline. We conduct a thorough analysis which further leads to several interesting findings. Overall, we believe our work and findings will encourage and facilitate further research in this important area and help develop more robust models.

摘要
尽管现代NLP系统在各种语言理解任务上实现了很高的表现，但它们主要集中在具有正确答案的问题上。然而，在实际应用中，用户 oftentimes 会提问无法得到定inate答案的问题。如果NLP系统 incorrectly 答复这类问题，会对系统的可靠性和信任性产生负面影响。我们是否可以使用现代NLP模型来准确地识别这类问题，并提供合理的回答？为了解决以上问题，我们引入了QnotA dataset，该 dataset包含五种不同类型的问题，这些问题无法得到定inate答案。另外，为每个QnotA实例，我们还提供了一个相应的QA实例，即可以回答的问题。通过这些数据，我们定义了三个评估任务，以测试系统对QnotA问题的识别、分辨和证明能力。经过广泛的实验，我们发现，包括GPT-3和Flan T5在内的现代NLP模型在这些任务上表现不佳，落后于人类基准值。我们进行了详细的分析，并发现了一些有趣的发现。总的来说，我们认为我们的工作和发现将激发和促进这一重要领域的进一步研究，并帮助开发更加可靠的模型。

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

paper_url: http://arxiv.org/abs/2309.04564
repo_url: None
paper_authors: Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, Sara Hooker
for: 提高大语言模型（LLMs）的发展，采用自动筛选高质量数据集来减少噪音网络文本数据。
methods: 使用批处理评估数据质量的方法，包括抛物线评估、Error L2-Norm和记忆评估，以系统地评估预训练数据的质量。
results: 发现简单的评估方法减少了预训练数据的质量，并且在训练LLMs时使用30%的原始训练数据可以达到更好的性能。

Abstract
Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. This data is typically acquired by scraping the internet, leading to pretraining datasets comprised of noisy web text. To date, efforts to prune these datasets down to a higher quality subset have relied on hand-crafted heuristics encoded as rule-based filters. In this work, we take a wider view and explore scalable estimates of data quality that can be used to systematically measure the quality of pretraining data. We perform a rigorous comparison at scale of the simple data quality estimator of perplexity, as well as more sophisticated and computationally intensive estimates of the Error L2-Norm and memorization. These metrics are used to rank and prune pretraining corpora, and we subsequently compare LLMs trained on these pruned datasets. Surprisingly, we find that the simple technique of perplexity outperforms our more computationally expensive scoring methods. We improve over our no-pruning baseline while training on as little as 30% of the original training dataset. Our work sets the foundation for unexplored strategies in automatically curating high quality corpora and suggests the majority of pretraining data can be removed while retaining performance.

摘要

Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

paper_url: http://arxiv.org/abs/2309.04561
repo_url: None
paper_authors: Ozan Unal, Christos Sakaridis, Suman Saha, Fisher Yu, Luc Van Gool
for: 本研究旨在解决3D场景中对自然语言描述引用的物体localization问题，具有广泛的应用场景，如自适应室内 robotics 和 AR/VR。
methods: 本研究使用的方法包括grounding-by-detection和dense 3D visual grounding，其中dense 3D visual grounding是指基于referral的3D实例分割。
results: 研究提出了一个名为ConcreteNet的 dense 3D grounding网络，该网络通过三个新的独立模块来提高对受挑战的重复实例（即同类 semantics 的干扰物）的定位性能。这三个模块分别是底层拼接注意力模块、对抗训练方案和学习全球摄像头令。ConcreteNet在ScanRefer online benchmark上取得了+”9.43%的精度，在50% IoU 下。此外，本研究还赢得了 ICVC 3rd Workshop on Language for 3D Scenes “3D Object Localization” 挑战。

Abstract
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. With a wide range of applications ranging from autonomous indoor robotics to AR/VR, the task has recently risen in popularity. A common formulation to tackle 3D visual grounding is grounding-by-detection, where localization is done via bounding boxes. However, for real-life applications that require physical interactions, a bounding box insufficiently describes the geometry of an object. We therefore tackle the problem of dense 3D visual grounding, i.e. referral-based 3D instance segmentation. We propose a dense 3D grounding network ConcreteNet, featuring three novel stand-alone modules which aim to improve grounding performance for challenging repetitive instances, i.e. instances with distractors of the same semantic class. First, we introduce a bottom-up attentive fusion module that aims to disambiguate inter-instance relational cues, next we construct a contrastive training scheme to induce separation in the latent space, and finally we resolve view-dependent utterances via a learned global camera token. ConcreteNet ranks 1st on the challenging ScanRefer online benchmark by a considerable +9.43% accuracy at 50% IoU and has won the ICCV 3rd Workshop on Language for 3D Scenes "3D Object Localization" challenge.

摘要
三维视觉固定是指根据自然语言描述来确定3D场景中的物体位置。它在自动化室内机器人、AR/VR等领域有广泛的应用，而且在最近几年内得到了广泛关注。一种常见的解决方法是基于检测的固定，其中通过矩形框来确定物体的位置。但是，在实际应用中，矩形框不够地描述物体的几何结构。因此，我们提出了dense 3D视觉固定问题，即基于物体实例的 referral 的3D实例分割。我们提出了一种名为ConcreteNet的密集3D固定网络，其包括三个新的独立模块，以提高固定性能。首先，我们引入了底层拥有注意力的融合模块，以解决间物体关系信息的混淆。然后，我们构建了一种对比训练方案，以强制在特征空间中强制分离物体。最后，我们使用学习的全球摄像头 токен来解决视依赖的问题。ConcreteNet在ScanRefer online bencmark上达到了+"9.43%的精度，并在ICCV 3rd Workshop on Language for 3D Scenes "3D Object Localization" challenge中赢得了首席。

Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges

paper_url: http://arxiv.org/abs/2309.04550
repo_url: None
paper_authors: Hiba Ahsan, Denis Jered McInerney, Jisoo Kim, Christopher Potter, Geoffrey Young, Silvio Amir, Byron C. Wallace
for: 这个研究旨在使用现代大语言模型来提高电子医疗纪录（EHR）数据中的资讯探索和概要化。methods: 这个研究使用了一个名为Flan-T5 XXL的现代大语言模型，并在零例设定下训练这个模型以探索和概要化EHR数据中的资讯。results: 研究发现，这个LLM-based方法可以提供与标准信息检索基准相比的更好的输出，但也发现LLMs可能会伪造证据，并且提供了一个方法来识别LLMs是否伪造证据。

Abstract
Unstructured Electronic Health Record (EHR) data often contains critical information complementary to imaging data that would inform radiologists' diagnoses. However, time constraints and the large volume of notes frequently associated with individual patients renders manual perusal of such data to identify relevant evidence infeasible in practice. Modern Large Language Models (LLMs) provide a flexible means of interacting with unstructured EHR data, and may provide a mechanism to efficiently retrieve and summarize unstructured evidence relevant to a given query. In this work, we propose and evaluate an LLM (Flan-T5 XXL) for this purpose. Specifically, in a zero-shot setting we task the LLM to infer whether a patient has or is at risk of a particular condition; if so, we prompt the model to summarize the supporting evidence. Enlisting radiologists for manual evaluation, we find that this LLM-based approach provides outputs consistently preferred to a standard information retrieval baseline, but we also highlight the key outstanding challenge: LLMs are prone to hallucinating evidence. However, we provide results indicating that model confidence in outputs might indicate when LLMs are hallucinating, potentially providing a means to address this.

摘要
不结构化电子医疗记录（EHR）数据经常包含有关诊断的关键信息，但由于时间约束和每个患者的备注量的限制，人工浏览这些数据以找到相关证据是在实践中不可能的。现代大型自然语言模型（LLM）提供了一种灵活的交互方式，可以有效地从不结构化EHR数据中提取和概括相关证据。在这种情况下，我们提出了一种使用Flan-T5 XXL模型来实现这一目标。 Specifically，我们在零批学情况下要求模型判断患者是否有某种疾病或风险，如果有，则请求模型概括支持证据。我们征得了医生的手动评估，并发现这种LLM基本上的方法比标准信息检索基准更为可靠，但我们还指出了关键的挑战：LLMs有假象证据的倾向。然而，我们提供了结果，表明模型对输出的自信度可能可以指示LLMs是否假象证据。

Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models

paper_url: http://arxiv.org/abs/2309.04461
repo_url: https://github.com/yangyi-chen/cotconsistency
paper_authors: Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, Ajay Divakaran
for:This paper explores the ability of vision-language models (VLMs) to demonstrate human-like reasoning based on perceived information, and evaluates their reasoning consistency using a chain-of-thought (CoT) based consistency measure.methods:The paper proposes a LLM-Human-in-the-Loop pipeline to reduce the cost of evaluating VLMs’ reasoning consistency, and builds the CURE benchmark to measure zero-shot reasoning performance and consistency. The paper also proposes a two-stage training framework to improve VLMs’ reasoning performance and consistency, involving supervised fine-tuning and incorporating feedback from LLMs.results:The paper finds that even the best-performing VLM is unable to demonstrate strong visual reasoning capabilities and consistency, indicating the need for substantial efforts to enable VLMs to perform visual reasoning as systematically and consistently as humans. The paper proposes a two-stage training framework to improve VLMs’ reasoning performance and consistency, and empirically highlights the effectiveness of the framework.

Abstract
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can parse natural queries about the visual content and generate human-like outputs. In this work, we explore the ability of these models to demonstrate human-like reasoning based on the perceived information. To address a crucial concern regarding the extent to which their reasoning capabilities are fully consistent and grounded, we also measure the reasoning consistency of these models. We achieve this by proposing a chain-of-thought (CoT) based consistency measure. However, such an evaluation requires a benchmark that encompasses both high-level inference and detailed reasoning chains, which is costly. We tackle this challenge by proposing a LLM-Human-in-the-Loop pipeline, which notably reduces cost while simultaneously ensuring the generation of a high-quality dataset. Based on this pipeline and the existing coarse-grained annotated dataset, we build the CURE benchmark to measure both the zero-shot reasoning performance and consistency of VLMs. We evaluate existing state-of-the-art VLMs, and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency, indicating that substantial efforts are required to enable VLMs to perform visual reasoning as systematically and consistently as humans. As an early step, we propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs. The first stage involves employing supervised fine-tuning of VLMs using step-by-step reasoning samples automatically generated by LLMs. In the second stage, we further augment the training process by incorporating feedback provided by LLMs to produce reasoning chains that are highly consistent and grounded. We empirically highlight the effectiveness of our framework in both reasoning performance and consistency.

摘要
现代视力语言模型（VLM）已经展现出强大的能力，可以作为视觉助手来理解自然语言中的问题，并生成人类化的输出。在这项工作中，我们探索了VLM的理智能力是否与人类相似。为了解决VLM的理智能力是否具有完全一致和基础的问题，我们还measure VLM的理智一致性。我们实现了这一目标 by proposing a chain-of-thought（CoT） based consistency measure。然而，such an evaluation requires a benchmark that encompasses both high-level inference and detailed reasoning chains，which is costly。我们解决这个挑战 by proposing a LLM-Human-in-the-Loop pipeline，which notably reduces cost while simultaneously ensuring the generation of a high-quality dataset。基于这个管道和现有的粗糙注释数据集，我们建立了CURE benchmark，用于测试VLM的零Instance reasoning性和一致性。我们评估了现有的state-of-the-art VLM，发现even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency， indicating that substantial efforts are required to enable VLMs to perform visual reasoning as systematically and consistently as humans。为了进一步提高VLM的理智能力和一致性，我们提出了一个两阶段培训框架。在第一阶段，我们使用监督微调VLMs使用步骤加法样本自动生成by LLMs进行supervised fine-tuning。在第二阶段，我们进一步增强培训过程，通过 incorporating feedback provided by LLMs来生成高一致性和基础的理智链。我们通过实验证明了我们的框架在理智性和一致性方面的效果。

CSPRD: A Financial Policy Retrieval Dataset for Chinese Stock Market

paper_url: http://arxiv.org/abs/2309.04389
repo_url: https://github.com/noewangjy/csprd_dataset
paper_authors: Jinyuan Wang, Hai Zhao, Zhong Wang, Zeyang Zhu, Jinhao Xie, Yong Yu, Yongjian Fei, Yue Huang, Dawei Cheng
for: 这个论文主要为了解决 dense passage retrieval 领域中的专业领域知识 Retrieval 问题，提出了一个新的任务——政策检索。
methods: 该论文使用了中国股票政策检索数据集 (CSPRD)，该数据集包含了700多个预言文本和10000多个条目的中文政策文档，并由专业人士进行了丰富的标注。
results: 实验结果表明，使用 lexical、embedding 和 fine-tuned bi-encoder 模型可以有效地解决政策检索问题，但还有很大的发展空间。最佳基eline 在 dev 集上 achieve 56.1% MRR@10、28.5% NDCG@10、37.5% Recall@10 和 80.6% Precision@10。

Abstract
In recent years, great advances in pre-trained language models (PLMs) have sparked considerable research focus and achieved promising performance on the approach of dense passage retrieval, which aims at retrieving relative passages from massive corpus with given questions. However, most of existing datasets mainly benchmark the models with factoid queries of general commonsense, while specialised fields such as finance and economics remain unexplored due to the deficiency of large-scale and high-quality datasets with expert annotations. In this work, we propose a new task, policy retrieval, by introducing the Chinese Stock Policy Retrieval Dataset (CSPRD), which provides 700+ prospectus passages labeled by experienced experts with relevant articles from 10k+ entries in our collected Chinese policy corpus. Experiments on lexical, embedding and fine-tuned bi-encoder models show the effectiveness of our proposed CSPRD yet also suggests ample potential for improvement. Our best performing baseline achieves 56.1% MRR@10, 28.5% NDCG@10, 37.5% Recall@10 and 80.6% Precision@10 on dev set.

摘要
“在最近的几年中，大幅提前语言模型（PLM）的进步引起了广泛的研究注意力，并实现了 dense passage retrieval 的批处，即从大量文献中检索相关的段落。然而，现有的 dataset 主要对 PLM 进行了通用常识的 factoid 查询，而专业领域如金融和经济仍然未得到了大规模的高质量数据集和专家标注。在这项工作中，我们提出了一项新任务——政策检索，通过引入中文股票政策检索数据集（CSPRD），该数据集包含700多份 prospectus 文本，由经验丰富的专家标注相关的文章从10000多篇收集到的中文政策库中。实验表明，我们的提议的 CSPRD Task 具有效果，同时也表明了进一步改进的潜在。我们的最佳基eline 在开发集上达到了56.1% MRR@10，28.5% NDCG@10，37.5% Recall@10和80.6% Precision@10。”

MoEController: Instruction-based Arbitrary Image Manipulation with Mixture-of-Expert Controllers

paper_url: http://arxiv.org/abs/2309.04372
repo_url: None
paper_authors: Sijia Li, Chen Chen, Haonan Lu
for: 这种研究旨在提出一种基于扩散模型的文本指导图像生成方法，以便在开放领域图像修改任务中实现全面的零基础能力。
methods: 该方法使用了混合专家（MOE）控制器，将文本指导的扩散模型与不同类型的人工指令进行对应，以便处理各种开放领域图像修改任务。
results: 经过大规模实验，该方法在各种图像修改任务中表现出色，可以快速和精准地实现图像的全球和本地修改。Here’s a breakdown of each point:
for: The research aims to propose a method based on diffusion models for text-guided image generation, in order to achieve comprehensive zero-shot capabilities for open-domain image manipulation tasks.
methods: The method uses a mixture-of-expert (MOE) controller to align the text-guided capacity of diffusion models with different types of human instructions, enabling the model to handle various open-domain image manipulation tasks.
results: Extensive experiments demonstrate that the approach performs surprisingly well on various image manipulation tasks when dealing with open-domain images and arbitrary human instructions.

Abstract
Diffusion-model-based text-guided image generation has recently made astounding progress, producing fascinating results in open-domain image manipulation tasks. Few models, however, currently have complete zero-shot capabilities for both global and local image editing due to the complexity and diversity of image manipulation tasks. In this work, we propose a method with a mixture-of-expert (MOE) controllers to align the text-guided capacity of diffusion models with different kinds of human instructions, enabling our model to handle various open-domain image manipulation tasks with natural language instructions. First, we use large language models (ChatGPT) and conditional image synthesis models (ControlNet) to generate a large number of global image transfer dataset in addition to the instruction-based local image editing dataset. Then, using an MOE technique and task-specific adaptation training on a large-scale dataset, our conditional diffusion model can edit images globally and locally. Extensive experiments demonstrate that our approach performs surprisingly well on various image manipulation tasks when dealing with open-domain images and arbitrary human instructions. Please refer to our project page: [https://oppo-mente-lab.github.io/moe_controller/]

摘要
Diffusion模型基于文本指导图像生成技术最近几年发展很快，在开放领域图像修改任务中取得了惊人的进步。然而，目前只有少数模型具备完全零shot能力，包括全球和本地图像修改。在这项工作中，我们提议使用混合专家（MOE）控制器，将文本指导的 diffusion模型与不同类型的人类指令相匹配，以便处理各种开放领域图像修改任务。首先，我们使用大型自然语言模型（ChatGPT）和condition Image Synthesis模型（ControlNet）生成大量全球图像传输数据集，以及指令基于的本地图像修改数据集。然后，我们使用MOE技术和任务特定适应训练，使我们的条件扩散模型可以全球和本地修改图像。广泛的实验表明，我们的方法在处理开放领域图像修改任务时表现出色，请参考我们项目页面：[https://oppo-mente-lab.github.io/moe_controller/](https://oppo-mente-lab.github.io/moe_controller/)。

Encoding Multi-Domain Scientific Papers by Ensembling Multiple CLS Tokens

paper_url: http://arxiv.org/abs/2309.04333
repo_url: https://github.com/ronaldseoh/multi2spe
paper_authors: Ronald Seoh, Haw-Shiuan Chang, Andrew McCallum
for: 本文适用于多个科学领域的文档处理任务，如科技论文分类和引用预测。
methods: 本文使用多个CLS tokens，使Transformer更好地特化于多个科学领域。我们提出了Multi2SPE，它鼓励每个CLS token学习不同的方式归并token embedding，然后将它们加权求和。
results: 我们在多科学领域的Multi-SciDocs测试数据集上测试了多个科学论文vector编码器，发现Multi2SPE可以在多科学领域的引用预测任务中减少误差达25%，而且只需要额外计算一个BERT前进 pass。

Abstract
Many useful tasks on scientific documents, such as topic classification and citation prediction, involve corpora that span multiple scientific domains. Typically, such tasks are accomplished by representing the text with a vector embedding obtained from a Transformer's single CLS token. In this paper, we argue that using multiple CLS tokens could make a Transformer better specialize to multiple scientific domains. We present Multi2SPE: it encourages each of multiple CLS tokens to learn diverse ways of aggregating token embeddings, then sums them up together to create a single vector representation. We also propose our new multi-domain benchmark, Multi-SciDocs, to test scientific paper vector encoders under multi-domain settings. We show that Multi2SPE reduces error by up to 25 percent in multi-domain citation prediction, while requiring only a negligible amount of computation in addition to one BERT forward pass.

摘要
多种有用任务在科学文档中，如主题分类和引用预测，通常使用跨多个科学领域的 corpora。通常，这些任务通过使用 transformer 的单个 CLS token 来表示文本。在这篇论文中，我们 argue 使用多个 CLS token 可以让 transformer 更好地特化到多个科学领域。我们提出 Multi2SPE：它鼓励每个多个 CLS token 学习不同的方式归并token embedding，然后将它们综合起来创建单个 вектор表示。我们还提出我们的新的多个领域测试套件 Multi-SciDocs，用于在多个领域的科学文档vector编码器进行测试。我们表明，Multi2SPE 可以在多个领域的引用预测中减少错误率达25%，只需要额外计算一个 BERT 前进 pass。

From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting

paper_url: http://arxiv.org/abs/2309.04269
repo_url: None
paper_authors: Griffin Adams, Alexander Fabbri, Faisal Ladhak, Eric Lehman, Noémie Elhadad
for: 本研究旨在提高自然语言生成器（GPT-4）的摘要质量，以增强其报道性和可读性。
methods: 研究人员使用了一种名为“链式密度”（Chain of Density，CoD）的提示，以帮助GPT-4生成更加抽象和融合的摘要。CoD提示首先生成一个缺乏实体的摘要，然后逐渐添加缺失的突出性实体，而不是直接增加摘要的长度。
results: 人类偏好GPT-4生成的CoD摘要比vanilla提示生成的摘要更加精炼和有趣，几乎与人类写的摘要相当。qualitative分析表明，存在一种质量和可读性之间的负担。

Abstract
Selecting the ``right'' amount of information to include in a summary is a difficult task. A good summary should be detailed and entity-centric without being overly dense and hard to follow. To better understand this tradeoff, we solicit increasingly dense GPT-4 summaries with what we refer to as a ``Chain of Density'' (CoD) prompt. Specifically, GPT-4 generates an initial entity-sparse summary before iteratively incorporating missing salient entities without increasing the length. Summaries generated by CoD are more abstractive, exhibit more fusion, and have less of a lead bias than GPT-4 summaries generated by a vanilla prompt. We conduct a human preference study on 100 CNN DailyMail articles and find that that humans prefer GPT-4 summaries that are more dense than those generated by a vanilla prompt and almost as dense as human written summaries. Qualitative analysis supports the notion that there exists a tradeoff between informativeness and readability. 500 annotated CoD summaries, as well as an extra 5,000 unannotated summaries, are freely available on HuggingFace (https://huggingface.co/datasets/griffin/chain_of_density).

摘要
选择“正确”的信息量包括在摘要中是一项困难任务。一个好的摘要应该是详细的，同时不太繁杂和难以遗弃。为了更好地理解这个贸易，我们采用一种“链式粒度”（CoD）提问来 solicit GPT-4 生成不同粒度的摘要。Specifically, GPT-4 首先生成一个entity-sparse摘要，然后逐渐添加缺失的突出Entity Without increasing the length。CoD 生成的摘要更加抽象，更加具有融合特征，并且具有较少的领先偏见。我们对 CNN DailyMail 文章100篇进行了人类喜好调查，发现人们偏好 GPT-4 生成的 denser 摘要，与vanilla prompt生成的摘要相比，几乎与人类写的摘要一样。Qualitative analysis supports the notion that there exists a tradeoff between informativeness and readability.我们在 HuggingFace 上提供了500个注解CoD摘要，以及5,000个未注解的摘要（https://huggingface.co/datasets/griffin/chain_of_density）。

The CALLA Dataset: Probing LLMs’ Interactive Knowledge Acquisition from Chinese Medical Literature

paper_url: http://arxiv.org/abs/2309.04198
repo_url: https://github.com/scir-hi/huatuo-llama-med-chinese
paper_authors: Yanrui Du, Sendong Zhao, Muzhen Cai, Jianyu Chen, Haochun Wang, Yuhan Chen, Haoqiang Guo, Bing Qin
for: 这个研究旨在探讨大型自然语言模型（LLMs）在医疗领域中的应用，特别是通过医学知识图构建指令精细调整（IFT）数据来润色LLMs在互动医学知识方面的能力。methods: 该研究使用了中文医学文献作为丰富的医学知识来源，并通过自由对话检查任务评估LLMs在互动医学知识方面的熟练性。研究人员还发现了一种被称为“事实回应”的现象，其中LLMs倾向于在问题中提到的事实上发表肯定回应，而不愿意挑战它们。为了消除这种不准确的评估，研究人员 artifically构建了一些测试数据，其中一些与事实一致，而另一些与事实不一致。results: 结果显示，IFT数据高度相关于医学文献库资料服务为LLMs提供了强大的刺激，使其能够在互动enario中高效地利用在预训练阶段获得的医学知识。此外，研究人员还提出了一种自动构建IFT数据的框架，并讨论了一些实际应用场景。

Abstract
The application of Large Language Models (LLMs) to the medical domain has stimulated the interest of researchers. Recent studies have focused on constructing Instruction Fine-Tuning (IFT) data through medical knowledge graphs to enrich the interactive medical knowledge of LLMs. However, the medical literature serving as a rich source of medical knowledge remains unexplored. Our work introduces the CALLA dataset to probe LLMs' interactive knowledge acquisition from Chinese medical literature. It assesses the proficiency of LLMs in mastering medical knowledge through a free-dialogue fact-checking task. We identify a phenomenon called the ``fact-following response``, where LLMs tend to affirm facts mentioned in questions and display a reluctance to challenge them. To eliminate the inaccurate evaluation caused by this phenomenon, for the golden fact, we artificially construct test data from two perspectives: one consistent with the fact and one inconsistent with the fact. Drawing from the probing experiment on the CALLA dataset, we conclude that IFT data highly correlated with the medical literature corpus serves as a potent catalyst for LLMs, enabling themselves to skillfully employ the medical knowledge acquired during the pre-training phase within interactive scenarios, enhancing accuracy. Furthermore, we design a framework for automatically constructing IFT data based on medical literature and discuss some real-world applications.

摘要
大量语言模型（LLMs）在医疗领域的应用已经吸引了研究人员的关注。最近的研究主要关注于通过医疗知识图构建 instrucion fine-tuning（IFT）数据，以把医疗知识丰富化大量语言模型的交互能力。然而，医疗文献作为丰富的医疗知识来源尚未被探索。我们的工作介绍了 CALLA 数据集，以评估 LLMS 在中文医疗文献中获得交互知识的能力。我们发现了一种现象，称为“事实跟随回应”， LLMS 在问题中提到的事实会被肯定，并显示不愿意挑战它们。为了消除这种不准确的评估，我们人工构建了两种视角的测试数据：一种与事实相符，一种与事实不符。通过对 CALLA 数据集的探索，我们得出结论：IFT 数据高度相关于医疗文献库资料服务为 LLMS 提供了强大的刺激，使其在交互enario中能够准确地运用在预训练阶段获得的医疗知识。此外，我们设计了一个自动构建 IFT 数据的框架，基于医疗文献，并讨论了一些真实应用。

GLS-CSC: A Simple but Effective Strategy to Mitigate Chinese STM Models’ Over-Reliance on Superficial Clue

paper_url: http://arxiv.org/abs/2309.04162
repo_url: None
paper_authors: Yanrui Du, Sendong Zhao, Yuhan Chen, Rai Bai, Jing Liu, Hua Wu, Haifeng Wang, Bing Qin
for: 本研究旨在探讨中文短文匹配模型对表面特征的过度依赖，以提高其Robustness和泛化能力。
methods: 我们提出了一种新的重采训练策略，即慢慢学习含有表面特征的样本（GLS-CSC），以降低中文STM模型对表面特征的过度依赖。
results: 我们通过对I.D., Rob.和O.O.D.测试集进行广泛的评估，发现GLS-CSC方法可以比 existed方法提高中文STM模型的Robustness和泛化能力。此外，我们还进行了现有方法的分析，并发现它们之间的共同点。

Abstract
Pre-trained models have achieved success in Chinese Short Text Matching (STM) tasks, but they often rely on superficial clues, leading to a lack of robust predictions. To address this issue, it is crucial to analyze and mitigate the influence of superficial clues on STM models. Our study aims to investigate their over-reliance on the edit distance feature, commonly used to measure the semantic similarity of Chinese text pairs, which can be considered a superficial clue. To mitigate STM models' over-reliance on superficial clues, we propose a novel resampling training strategy called Gradually Learn Samples Containing Superficial Clue (GLS-CSC). Through comprehensive evaluations of In-Domain (I.D.), Robustness (Rob.), and Out-Of-Domain (O.O.D.) test sets, we demonstrate that GLS-CSC outperforms existing methods in terms of enhancing the robustness and generalization of Chinese STM models. Moreover, we conduct a detailed analysis of existing methods and reveal their commonality.

摘要
Translation in Simplified Chinese:预训模型在中文短文匹配任务中取得成功，但它们常常依赖于 superficiale 的指导，导致预测不够Robust。为了解决这个问题，我们需要分析和mitigate STM模型中superficial clue的影响。我们的研究旨在调查STM模型对edit distance特征的过度依赖，这可以被视为 superficiale 的指导。为了减少STM模型对superficial clue的依赖，我们提出了一种新的重采训练策略 called Gradually Learn Samples Containing Superficial Clue (GLS-CSC)。通过对In-Domain (I.D.)、Robustness (Rob.)和Out-Of-Domain (O.O.D.)测试集进行广泛的评估，我们表明GLS-CSC在改善中文STM模型的Robustness和泛化性方面表现出色。此外，我们还进行了现有方法的etailed分析，并揭示了它们的共同点。

Cross-Utterance Conditioned VAE for Speech Generation

paper_url: http://arxiv.org/abs/2309.04156
repo_url: None
paper_authors: Yang Li, Cheng Yu, Guangzhi Sun, Weiqin Zu, Zheng Tian, Ying Wen, Wei Pan, Chao Zhang, Jun Wang, Yang Yang, Fanglei Sun
for: 提高语音生成的自然性和表达性，特别是在多媒体生产中。
methods: 基于预训练语言模型和变量自动编码器（VAEs）的 Cross-Utterance Conditioned Variational Autoencoder（CUC-VAE）框架，以提取上下文敏感的语音特征，并通过Context-sensitive prosody generation来更好地模仿人类语音生成。
results: 在LibriTTS dataset上，提议的模型在语音生成和修改方面具有显著的改善，生成的语音更自然和表达力强。

Abstract
Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.

摘要
<> translate the following text into Simplified Chinese<> neural network-based speech synthesis systems show promise for multimedia production, but often struggle with producing expressive speech and seamless editing. In response, we propose the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.Here's the translation in Simplified Chinese:<> translate the following text into Simplified Chinese<> нейрон网络基于的语音合成系统在多媒体生产中展示了承诺，但经常面临表达性和无缝编辑的问题。为此，我们提出了跨话语的变量自动编码器语音合成（CUC-VAE S2）框架，以提高表达和自然语音生成。这个框架利用预训练语言模型的强大表达能力和变量自动编码器（VAEs）的重新表达能力。CUC-VAE S2框架的核心组件是跨话语CVAE，它从周围的句子中提取了语音、说话者和文本特征，生成了上下文敏感的表达特征，更准确地模拟人类表达生成。我们还提出了两种实用算法，特地针对不同的语音合成应用：CUC-VAE TTS для文本到语音和CUC-VAE SE для语音编辑。CUC-VAE TTS是直接应用框架，用于生成上下文语音。而CUC-VAE SE算法利用了真实的mel spectrogram sampling，基于上下文信息，生成了真实的声音，从而实现了灵活的语音编辑基于文本，如删除、插入和替换等操作。LibriTTS数据集的实验结果表明，我们提出的模型可以明显提高语音合成和编辑，生成更自然和表达性强的语音。

RST-style Discourse Parsing Guided by Document-level Content Structures

paper_url: http://arxiv.org/abs/2309.04141
repo_url: None
paper_authors: Ming Li, Ruihong Huang
for: 这篇论文是关于 Rhetorical Structure Theory based Discourse Parsing (RST-DP) 的研究，旨在探讨 clause、 sentence 和大量文本句子如何组成整个 дискурス，并将 дискурс结构表示为一个层次结构。
methods: 该论文提出了一种新的 RST 分析管道，该管道利用 News Discourse Profiling 任务来生成具有高级内容相关信息的结构意识新闻句子表示。该管道只添加了一些额外层次，并且在多种 RST 分析指标上表现出色。
results: 该论文的实验结果表明，通过将高级内容相关信息 incorporated 到 RST 分析管道中，可以提高 RST 分析的性能，并且在多种 RST 分析指标上表现出色。

Abstract
Rhetorical Structure Theory based Discourse Parsing (RST-DP) explores how clauses, sentences, and large text spans compose a whole discourse and presents the rhetorical structure as a hierarchical tree. Existing RST parsing pipelines construct rhetorical structures without the knowledge of document-level content structures, which causes relatively low performance when predicting the discourse relations for large text spans. Recognizing the value of high-level content-related information in facilitating discourse relation recognition, we propose a novel pipeline for RST-DP that incorporates structure-aware news content sentence representations derived from the task of News Discourse Profiling. By incorporating only a few additional layers, this enhanced pipeline exhibits promising performance across various RST parsing metrics.

摘要

Down the Toxicity Rabbit Hole: Investigating PaLM 2 Guardrails

paper_url: http://arxiv.org/abs/2309.06415
repo_url: None
paper_authors: Adel Khorramrouz, Sujan Dutta, Arka Dutta, Ashiqur R. KhudaBukhsh
for: 本研究通过一种新的毒性兔洞框架对PaLM 2的安全反馈进行了一种耐性审核。
methods: 该框架从一个刻板印象开始，然后 repeatedly instruct PaLM 2 生成更加毒性的内容，直到 PaLM 2 安全护卫线Throw一个安全违反。
results: 我们的实验发现，PaLM 2 的安全护卫线无法评估高度恶势夹杂的内容，包括反犹太、伊斯兰压迫、种族歧视、同性恋歧视和妇女歧视等。

Abstract
This paper conducts a robustness audit of the safety feedback of PaLM 2 through a novel toxicity rabbit hole framework introduced here. Starting with a stereotype, the framework instructs PaLM 2 to generate more toxic content than the stereotype. Every subsequent iteration it continues instructing PaLM 2 to generate more toxic content than the previous iteration until PaLM 2 safety guardrails throw a safety violation. Our experiments uncover highly disturbing antisemitic, Islamophobic, racist, homophobic, and misogynistic (to list a few) generated content that PaLM 2 safety guardrails do not evaluate as highly unsafe.

摘要
这篇论文通过一种新的恶意孔雀框架进行了PaLM 2的安全反馈稳定性测试。从一种刻板印象开始，框架指令PaLM 2生成更多的恶意内容 than the stereotype。每一次循环都会继续指令PaLM 2生成更多的恶意内容，直到PaLM 2的安全护照抛出安全违反。我们的实验发现PaLM 2的安全护照并不评估这些内容的危险性，包括反犹太、伊斯兰差别、种族歧视、同性恋歧视和对女性的歧视（只是列举一些）。

Meta predictive learning model of natural languages

paper_url: http://arxiv.org/abs/2309.04106
repo_url: https://github.com/qjbtiger/meta-predictive-coding
paper_authors: Chan Li, Junbin Qiu, Haiping Huang
for: 这个论文旨在研究人工智能语言模型和大脑计算之间的关系，以及在语言处理中的predictive coding框架和自适应学习的作用。
methods: 本文提出了一种基于mean-field学习的predictive coding模型，假设每个连接的synaptic weight采用了频率分布，并只有分布进行了训练。
results: 该模型在分类手写数字和语言资料集上得到了成功验证，并表明大多数连接在学习后变为决定性的，输出连接具有更高水平的变化。模型的性能随数据负荷的变化，并在更多的训练数据提供下进一步提高。

Abstract
Large language models based on self-attention mechanisms have achieved astonishing performances not only in natural language itself, but also in a variety of tasks of different nature. However, regarding processing language, our human brain may not operate using the same principle. Then, a debate is established on the connection between brain computation and artificial self-supervision adopted in large language models. One of most influential hypothesis in brain computation is the predictive coding framework, which proposes to minimize the prediction error by local learning. However, the role of predictive coding and the associated credit assignment in language processing remains unknown. Here, we propose a mean-field learning model within the predictive coding framework, assuming that the synaptic weight of each connection follows a spike and slab distribution, and only the distribution is trained. This meta predictive learning is successfully validated on classifying handwritten digits where pixels are input to the network in sequence, and on the toy and real language corpus. Our model reveals that most of the connections become deterministic after learning, while the output connections have a higher level of variability. The performance of the resulting network ensemble changes continuously with data load, further improving with more training data, in analogy with the emergent behavior of large language models. Therefore, our model provides a starting point to investigate the physics and biology correspondences of the language processing and the unexpected general intelligence.

摘要
大语言模型基于自注意机制已经实现了不可思议的表现，不仅在自然语言中，还在多种不同性质的任务中。然而，人脑对语言处理可能不使用同样的原理。因此，人脑计算和大语言模型中的人工自我监督之间的连接成为了讨论的焦点。人脑计算中最有影响力的假设是预测编码框架，该框架提出了减少预测错误的本地学习。然而，预测编码和其相关的信任分配在语言处理中的作用仍然未知。我们提出了基于预测编码框架的mean-field学习模型，假设每个连接的 synaptic Weight 遵循爆发和杠杆分布，并仅训练分布。这种meta预测学习成功应用于分类手写数字，以及 Toy 和实际语言 corpus。我们发现大多数连接在学习后变为决定性的，输出连接具有更高水平的变化。结果的网络集群表现 continuous 变化与数据负荷相关，并且随着更多的训练数据，表现进一步提高，与大语言模型的 emergent 行为相似。因此，我们的模型提供了研究语言处理和大语言模型之间的物理和生物相关性的开始点。

Unsupervised Multi-document Summarization with Holistic Inference

paper_url: http://arxiv.org/abs/2309.04087
repo_url: None
paper_authors: Haopeng Zhang, Sangwoo Cho, Kaiqiang Song, Xiaoyang Wang, Hongwei Wang, Jiawei Zhang, Dong Yu
for: 本研究旨在提出一种新的无监督多文摘要框架，以提高多文摘要的效果。
methods: 本方法利用自适应搜索和可靠度评价来选择最佳摘要句子，并通过评价集成度和多样性来衡量摘要的质量。
results: 对于小规模和大规模多文摘要数据集，本方法具有显著的提升效果，ROUGE分数和多样性指标均达到了或超过了基eline。此外，研究还发现了多样性对多文摘要性能的重要性。

Abstract
Multi-document summarization aims to obtain core information from a collection of documents written on the same topic. This paper proposes a new holistic framework for unsupervised multi-document extractive summarization. Our method incorporates the holistic beam search inference method associated with the holistic measurements, named Subset Representative Index (SRI). SRI balances the importance and diversity of a subset of sentences from the source documents and can be calculated in unsupervised and adaptive manners. To demonstrate the effectiveness of our method, we conduct extensive experiments on both small and large-scale multi-document summarization datasets under both unsupervised and adaptive settings. The proposed method outperforms strong baselines by a significant margin, as indicated by the resulting ROUGE scores and diversity measures. Our findings also suggest that diversity is essential for improving multi-document summary performance.

摘要
多文摘要目标是从同一个主题下的多个文档中提取核心信息。本文提出了一种新的整体框架，用于无监督多文摘要抽取。我们的方法将整体搜索评估方法与整体测量结合，称为子集代表指数（SRI）。SRI可以在无监督和适应性下计算，并考虑文档来源的重要性和多样性。为证明我们的方法的有效性，我们在小规模和大规模多文摘要数据集上进行了广泛的实验，包括无监督和适应性下的测试。我们的方法在ROUGE分数和多样性指标上都有显著的提升，而且发现多样性对多文摘要性能的提高具有重要意义。

2023-09-08

cs.LG

cs.LG - 2023-09-08

Probabilistic Safety Regions Via Finite Families of Scalable Classifiers

paper_url: http://arxiv.org/abs/2309.04627
repo_url: None
paper_authors: Alberto Carlevaro, Teodoro Alamo, Fabrizio Dabbene, Maurizio Mongelli
for: 这种研究旨在提供机器学习模型的 probabilistic certification，以确保模型的误分类率在输入空间中是可控的。
methods: 这种方法使用了可扩展的分类器来链接机器学习的调试与误分类控制。
results: 多个测试表明，这种方法可以减少误分类率，并且可以在实际应用中使用，如智能流动应用程序。

Abstract
Supervised classification recognizes patterns in the data to separate classes of behaviours. Canonical solutions contain misclassification errors that are intrinsic to the numerical approximating nature of machine learning. The data analyst may minimize the classification error on a class at the expense of increasing the error of the other classes. The error control of such a design phase is often done in a heuristic manner. In this context, it is key to develop theoretical foundations capable of providing probabilistic certifications to the obtained classifiers. In this perspective, we introduce the concept of probabilistic safety region to describe a subset of the input space in which the number of misclassified instances is probabilistically controlled. The notion of scalable classifiers is then exploited to link the tuning of machine learning with error control. Several tests corroborate the approach. They are provided through synthetic data in order to highlight all the steps involved, as well as through a smart mobility application.

摘要
<>将文本翻译成简化中文。<>超visired分类可以识别数据中的模式，以分类行为。 canonical solution中包含内在的误分类错误，这是机器学习的数字化方法的特点。数据分析师可能会在一个类上减少分类错误的代价是增加其他类的错误。这种设计阶段的错误控制通常是 empirical manner。在这种情况下，我们引入了 probabilistic safety region，用于描述输入空间中数据分类错误的概率控制。我们然后利用可扩展分类器来连接机器学习的调整和错误控制。多个测试证明了这种方法的有效性，包括通过synthetic data来展示所有步骤，以及通过智能移动应用程序。Note: "Supervised classification" in the original text is translated as "超visired分类" in Simplified Chinese, which is a combination of "supervised" and "classification".

Knowledge Distillation-Empowered Digital Twin for Anomaly Detection

paper_url: http://arxiv.org/abs/2309.04616
repo_url: None
paper_authors: Qinghua Xu, Shaukat Ali, Tao Yue, Zaimovic Nedim, Inderjeet Singh
for: 这个研究旨在提出一个名为 KDDT 的新方法，用于铁路控制和管理系统 (TCMS) 中的异常探测。
methods: KDDT 利用语言模型 (LM) 和长短期内存 (LSTM) 网络，将时间序列和上下文特征分别提取出来。此外，KDDT 还利用知识传播 (KD) 技术，将外部数据集整合到模型中，以增加数据量。
results: 在 Alstom 提供的两个数据集上进行评估，KDDT 的 F1 分数分别为 0.931 和 0.915，显示 KDDT 的有效性。此外，透过实验研究，发现 KDDT 模型中LM、LSTM和KD的个别贡献，对整体性能的提升为12.4%、3%和6.05%。

Abstract
Cyber-physical systems (CPSs), like train control and management systems (TCMS), are becoming ubiquitous in critical infrastructures. As safety-critical systems, ensuring their dependability during operation is crucial. Digital twins (DTs) have been increasingly studied for this purpose owing to their capability of runtime monitoring and warning, prediction and detection of anomalies, etc. However, constructing a DT for anomaly detection in TCMS necessitates sufficient training data and extracting both chronological and context features with high quality. Hence, in this paper, we propose a novel method named KDDT for TCMS anomaly detection. KDDT harnesses a language model (LM) and a long short-term memory (LSTM) network to extract contexts and chronological features, respectively. To enrich data volume, KDDT benefits from out-of-domain data with knowledge distillation (KD). We evaluated KDDT with two datasets from our industry partner Alstom and obtained the F1 scores of 0.931 and 0.915, respectively, demonstrating the effectiveness of KDDT. We also explored individual contributions of the DT model, LM, and KD to the overall performance of KDDT, via a comprehensive empirical study, and observed average F1 score improvements of 12.4%, 3%, and 6.05%, respectively.

摘要
Cyber-physical systems (CPSs), like train control and management systems (TCMS), are becoming increasingly prevalent in critical infrastructures. As safety-critical systems, ensuring their reliability during operation is crucial. Digital twins (DTs) have been widely studied for this purpose due to their ability to monitor and warn of anomalies in real-time, as well as predict and detect anomalies. However, constructing a DT for anomaly detection in TCMS requires a sufficient amount of training data and extracting both chronological and context features of high quality. Therefore, in this paper, we propose a novel method named KDDT for TCMS anomaly detection. KDDT utilizes a language model (LM) and a long short-term memory (LSTM) network to extract contexts and chronological features, respectively. To enrich data volume, KDDT leverages out-of-domain data with knowledge distillation (KD). We evaluated KDDT with two datasets from our industry partner Alstom and obtained F1 scores of 0.931 and 0.915, respectively, demonstrating the effectiveness of KDDT. We also conducted a comprehensive empirical study to explore the individual contributions of the DT model, LM, and KD to the overall performance of KDDT, and observed average F1 score improvements of 12.4%, 3%, and 6.05%, respectively.

Self-optimizing Feature Generation via Categorical Hashing Representation and Hierarchical Reinforcement Crossing

paper_url: http://arxiv.org/abs/2309.04612
repo_url: https://github.com/yingwangyang/hrc_feature_cross
paper_authors: Wangyang Ying, Dongjie Wang, Kunpeng Liu, Leilei Sun, Yanjie Fu
for: 本研究旨在生成新有意义的特征，以创造一个抑制表示空间。
methods: 本文提出了一种原理性的和通用的表示 crossing 框架，以解决自动化特征生成中的几大挑战：有意义、Robust和高效的生成。
results: 通过实验研究， authors 证明了提出的方法的有效性和高效性。代码可以在 https://github.com/yingwangyang/HRC_feature_cross.git 上下载。

Abstract
Feature generation aims to generate new and meaningful features to create a discriminative representation space.A generated feature is meaningful when the generated feature is from a feature pair with inherent feature interaction. In the real world, experienced data scientists can identify potentially useful feature-feature interactions, and generate meaningful dimensions from an exponentially large search space, in an optimal crossing form over an optimal generation path. But, machines have limited human-like abilities.We generalize such learning tasks as self-optimizing feature generation. Self-optimizing feature generation imposes several under-addressed challenges on existing systems: meaningful, robust, and efficient generation. To tackle these challenges, we propose a principled and generic representation-crossing framework to solve self-optimizing feature generation.To achieve hashing representation, we propose a three-step approach: feature discretization, feature hashing, and descriptive summarization. To achieve reinforcement crossing, we develop a hierarchical reinforcement feature crossing approach.We present extensive experimental results to demonstrate the effectiveness and efficiency of the proposed method. The code is available at https://github.com/yingwangyang/HRC_feature_cross.git.

摘要
< translate> 特征生成旨在生成新而有意义的特征，以创建一个分化表示空间。一个生成的特征是有意义的当其来自一个特征对的内在特征交互。在实际世界中，经验丰富的数据科学家可以识别可能有用的特征对互动，并从极大的搜索空间中生成有意义的维度，在最佳的横跨路径上进行优化。但是，机器有限的人类化能力。我们总结这类学习任务为自动优化特征生成。自动优化特征生成存在一些未得到解决的挑战：有意义、强健和高效的生成。为了解决这些挑战，我们提出了一种原则性的和通用的表示交叉框架，以解决自动优化特征生成。 Here's the translation of the text in Simplified Chinese:特征生成的目标是生成新而有意义的特征，以创建一个分化表示空间。一个生成的特征是有意义的当其来自一个特征对的内在特征交互。在实际世界中，经验丰富的数据科学家可以识别可能有用的特征对互动，并从极大的搜索空间中生成有意义的维度，在最佳的横跨路径上进行优化。但是，机器有限的人类化能力。我们总结这类学习任务为自动优化特征生成。自动优化特征生成存在一些未得到解决的挑战：有意义、强健和高效的生成。为了解决这些挑战，我们提出了一种原则性的和通用的表示交叉框架，以解决自动优化特征生成。

Online Infinite-Dimensional Regression: Learning Linear Operators

paper_url: http://arxiv.org/abs/2309.06548
repo_url: None
paper_authors: Vinod Raman, Unique Subedi, Ambuj Tewari
for: 本文研究了在在线设定下学习线性算子的问题，具体来说是学习两个无穷维希尔бер特空间之间的线性算子。
methods: 本文使用了在线学习方法，并证明了任何 $p \in [1, \infty)$ 下的线性算子都可以在线学习。同时，本文也证明了一个不可能性结果，即对于Operator norm下的线性算子，不可能在线学习。此外，本文还分别证明了在线学习和一致极值点的分离。
results: 本文证明了一个不可能性结果，即对于Operator norm下的线性算子，不可能在线学习。此外，本文还提供了一个可以在线学习的线性算子的示例，但这个算子并不具有一致极值点的性质。最后，本文证明了这些结果在agnostik PAC设定下也是正确的。

Abstract
We consider the problem of learning linear operators under squared loss between two infinite-dimensional Hilbert spaces in the online setting. We show that the class of linear operators with uniformly bounded $p$-Schatten norm is online learnable for any $p \in [1, \infty)$. On the other hand, we prove an impossibility result by showing that the class of uniformly bounded linear operators with respect to the operator norm is \textit{not} online learnable. Moreover, we show a separation between online uniform convergence and online learnability by identifying a class of bounded linear operators that is online learnable but uniform convergence does not hold. Finally, we prove that the impossibility result and the separation between uniform convergence and learnability also hold in the agnostic PAC setting.

摘要
我们考虑线性算子学习问题，特别是在线上设定下，将两个无穷维度希尔伯特空间之间的squared loss函数学习。我们证明了$p$-Schatten нор的上界紧缩的线性算子集是在任何$p \in [1, \infty)$情况下可线性学习。另一方面，我们证明了一个不可能性结果，表明一个具有操作norm上界的线性算子集不可能在线上学习。此外，我们还证明了一个分离结果，说明一个受限于操作norm的线性算子集可以在线上学习，但uniform convergence不具体。最后，我们证明了这些结果在agnostik PAC设定下也成立。

Motif-aware Attribute Masking for Molecular Graph Pre-training

paper_url: http://arxiv.org/abs/2309.04589
repo_url: https://github.com/einae-nd/moama-dev
paper_authors: Eric Inae, Gang Liu, Meng Jiang
for: 用于预训图 neural network 中的属性重建。
methods: 使用 motif-aware 属性遮盖策略，通过利用邻近 motif 中 atoms 的信息，捕捉高级结构。
results: 在 eight 种分子性质预测任务上显示出优异表现，比randomly select nodes 的方法更能捕捉高级结构。

Abstract
Attribute reconstruction is used to predict node or edge features in the pre-training of graph neural networks. Given a large number of molecules, they learn to capture structural knowledge, which is transferable for various downstream property prediction tasks and vital in chemistry, biomedicine, and material science. Previous strategies that randomly select nodes to do attribute masking leverage the information of local neighbors However, the over-reliance of these neighbors inhibits the model's ability to learn from higher-level substructures. For example, the model would learn little from predicting three carbon atoms in a benzene ring based on the other three but could learn more from the inter-connections between the functional groups, or called chemical motifs. In this work, we propose and investigate motif-aware attribute masking strategies to capture inter-motif structures by leveraging the information of atoms in neighboring motifs. Once each graph is decomposed into disjoint motifs, the features for every node within a sample motif are masked. The graph decoder then predicts the masked features of each node within the motif for reconstruction. We evaluate our approach on eight molecular property prediction datasets and demonstrate its advantages.

摘要
<>translate("Attribute reconstruction is used to predict node or edge features in the pre-training of graph neural networks. Given a large number of molecules, they learn to capture structural knowledge, which is transferable for various downstream property prediction tasks and vital in chemistry, biomedicine, and material science. Previous strategies that randomly select nodes to do attribute masking leverage the information of local neighbors However, the over-reliance of these neighbors inhibits the model's ability to learn from higher-level substructures. For example, the model would learn little from predicting three carbon atoms in a benzene ring based on the other three but could learn more from the inter-connections between the functional groups, or called chemical motifs. In this work, we propose and investigate motif-aware attribute masking strategies to capture inter-motif structures by leveraging the information of atoms in neighboring motifs. Once each graph is decomposed into disjoint motifs, the features for every node within a sample motif are masked. The graph decoder then predicts the masked features of each node within the motif for reconstruction. We evaluate our approach on eight molecular property prediction datasets and demonstrate its advantages.")]Here's the translation: attribute 重建是用于预训练图 neural network 中的节点或边特征预测。给出大量分子，它们学习了结构知识，可以转移到多种下游性预测任务中，是化学、生物医学和材料科学中不可或缺的。先前的策略是随机选择节点进行特征遮盾，利用当地邻居的信息。然而，这种过度依赖邻居的做法限制了模型的能力，学习更高级的结构。例如，模型可以从预测三个碳原子的бен沃杂质中学习的非常少，但可以从功能组之间的连接学习更多。在这项工作中，我们提出和探索了结构意识的 attribute 遮盾策略，以利用邻居结构中的原子信息。每个图被分解成独立的结构，每个样本结构中的节点特征被遮盾。图解码器然后预测每个结构中的遮盾节点特征，以进行重建。我们在八个分子性质预测数据集上评估了我们的方法，并证明了它的优势。

Circles: Inter-Model Comparison of Multi-Classification Problems with High Number of Classes

paper_url: http://arxiv.org/abs/2309.05672
repo_url: None
paper_authors: Nina Mir, Ragaad AlTarawneh, Shah Rukh Humayoun
for: 该论文旨在提供一种可交互的视觉分析工具，帮助用户对多类别分类模型进行视觉比较。
methods: 该论文使用了一种叫做 “Concentric Radial Line Layout” 的视觉分析方法，以解决高类别数据集中模型比较的问题。
results: 该论文通过一种名叫 “Circles” 的交互式视觉分析工具，可以同时显示多个分类模型的结果，并且可以在一个视图中进行比较。

Abstract
The recent advancements in machine learning have motivated researchers to generate classification models dealing with hundreds of classes such as in the case of image datasets. However, visualization of classification models with high number of classes and inter-model comparison in such classification problems are two areas that have not received much attention in the literature, despite the ever-increasing use of classification models to address problems with very large class categories. In this paper, we present our interactive visual analytics tool, called Circles, that allows a visual inter-model comparison of numerous classification models with 1K classes in one view. To mitigate the tricky issue of visual clutter, we chose concentric a radial line layout for our inter-model comparison task. Our prototype shows the results of 9 models with 1K classes

摘要
最近的机器学习技术发展，使研究者们能够生成高达百种类别的分类模型，如图像数据集中的情况。然而，对于高类别数的分类问题，Visual化分类模型和多模型比较在文献中尚未受到充分关注，尽管使用分类模型解决大类别数问题的使用量不断增加。在这篇论文中，我们介绍了我们的互动式视觉分析工具“环”，它可以同时显示多个分类模型的1K类别结果。为了解决Visual clutter的问题，我们选择了Concentric radial line布局。我们的原型显示了9种模型的1K类别结果。

Towards Interpretable Solar Flare Prediction with Attention-based Deep Neural Networks

paper_url: http://arxiv.org/abs/2309.04558
repo_url: https://bitbucket.org/gsudmlab/fulldiskattention
paper_authors: Chetraj Pandey, Anli Ji, Rafal A. Angryk, Berkay Aydin
for: 这项研究的目的是开发一种基于注意力的深度学习模型，用于预测下一24小时内大于或等于M1.0级太阳风暴的发生。
methods: 该研究使用了数据增强扩大，以解决类别不均衡问题，并使用了真实技能统计指标（TSS）和海德克技能分数（HSS）进行评估。
results: 研究得到了一个成功的注意力基于深度学习模型，其中候选模型在24小时内大于或等于M1.0级太阳风暴预测中得到了平均TSS=0.54$\pm$0.03和HSS=0.37$\pm$0.07的Result。

Abstract
Solar flare prediction is a central problem in space weather forecasting and recent developments in machine learning and deep learning accelerated the adoption of complex models for data-driven solar flare forecasting. In this work, we developed an attention-based deep learning model as an improvement over the standard convolutional neural network (CNN) pipeline to perform full-disk binary flare predictions for the occurrence of $\geq$M1.0-class flares within the next 24 hours. For this task, we collected compressed images created from full-disk line-of-sight (LoS) magnetograms. We used data-augmented oversampling to address the class imbalance issue and used true skill statistic (TSS) and Heidke skill score (HSS) as the evaluation metrics. Furthermore, we interpreted our model by overlaying attention maps on input magnetograms and visualized the important regions focused on by the model that led to the eventual decision. The significant findings of this study are: (i) We successfully implemented an attention-based full-disk flare predictor ready for operational forecasting where the candidate model achieves an average TSS=0.54$\pm$0.03 and HSS=0.37$\pm$0.07. (ii) we demonstrated that our full-disk model can learn conspicuous features corresponding to active regions from full-disk magnetogram images, and (iii) our experimental evaluation suggests that our model can predict near-limb flares with adept skill and the predictions are based on relevant active regions (ARs) or AR characteristics from full-disk magnetograms.

摘要
太阳风暴预测是天文天气预测中的中心问题，而最近的机器学习和深度学习技术的发展使得复杂的模型在数据驱动太阳风暴预测中得到了广泛应用。在这项工作中，我们开发了一个注意力基本的深度学习模型，作为标准 convolutional neural network（CNN）管道的改进，以实现全盘二分类太阳风暴预测。为此，我们收集了全盘线性视图（LoS）磁场图像，并使用数据扩展填充来解决类别不均问题。我们使用了真实技能统计（TSS）和海德ке技能分数（HSS）作为评估指标。此外，我们还对模型进行了解释，将注意力地图 overlay 在输入磁场图像上，并可见地将重要的区域抽象为模型最终决策的原因。研究的主要发现包括：1. 我们成功地实现了一个注意力基本的全盘风暴预测模型，该模型在下一个24小时内的M1.0级太阳风暴预测中取得了平均TSS=0.54±0.03和HSS=0.37±0.07的性能。2. 我们示出了全盘模型可以从全盘磁场图像中学习明显的活跃区域特征，并且这些特征与太阳风暴的发生有直接的关系。3. 我们的实验评估表明，我们的模型可以准确预测近地风暴，并且这些预测基于全盘磁场图像中的相关活跃区域或活跃区域特征。

Regret-Optimal Federated Transfer Learning for Kernel Regression with Applications in American Option Pricing

paper_url: http://arxiv.org/abs/2309.04557
repo_url: https://github.com/floriankrach/regretoptimalfederatedtransferlearning
paper_authors: Xuwei Yang, Anastasis Kratsios, Florian Krach, Matheus Grasselli, Aurelien Lucchi
for: 本研究目的是提出一种最优的迭代方案 для联合转移学习，以最小化在多个数据集（${\cal D}_1,\ldots,{\cal D}N$）上的参数生成误差，同时保证模型($f{\theta}$)在停止iteration（round）后的性能。
methods: 我们使用一种含有抽象的特殊化模型（node/agent）和中央计划器（server）之间的 continual communication机制，以实现迭代学习。在finite-rank kernel regression模型下，我们 derivate了 regret-optimal算法的显式更新。基于对 regret-optimal算法的 symmetries的利用，我们还开发了一种近似 regret-optimal的启发式，需要$\mathcal{O}(Np^2)$ fewer elementary operations。
results: 我们证明了这种迭代方案的 regret-optimal性，并且研究了这种方案对于偏扰攻击的抵抗性。我们发现，对于所有训练集（$q$）进行最多$\varepsilon>0$的偏扰，这种方案的 regret不会增加超过$\mathcal{O}(\varepsilon q \bar{N}^{1/2})$, где $\bar{N}$是所有训练集的积合数。我们通过美国选项估价的numerical experiment validate our theoretical findings，使用随机生成的finite-rank kernel。

Abstract
We propose an optimal iterative scheme for federated transfer learning, where a central planner has access to datasets ${\cal D}_1,\dots,{\cal D}_N$ for the same learning model $f_{\theta}$. Our objective is to minimize the cumulative deviation of the generated parameters $\{\theta_i(t)\}_{t=0}^T$ across all $T$ iterations from the specialized parameters $\theta^\star_{1},\ldots,\theta^\star_N$ obtained for each dataset, while respecting the loss function for the model $f_{\theta(T)}$ produced by the algorithm upon halting. We only allow for continual communication between each of the specialized models (nodes/agents) and the central planner (server), at each iteration (round). For the case where the model $f_{\theta}$ is a finite-rank kernel regression, we derive explicit updates for the regret-optimal algorithm. By leveraging symmetries within the regret-optimal algorithm, we further develop a nearly regret-optimal heuristic that runs with $\mathcal{O}(Np^2)$ fewer elementary operations, where $p$ is the dimension of the parameter space. Additionally, we investigate the adversarial robustness of the regret-optimal algorithm showing that an adversary which perturbs $q$ training pairs by at-most $\varepsilon>0$, across all training sets, cannot reduce the regret-optimal algorithm's regret by more than $\mathcal{O}(\varepsilon q \bar{N}^{1/2})$, where $\bar{N}$ is the aggregate number of training pairs. To validate our theoretical findings, we conduct numerical experiments in the context of American option pricing, utilizing a randomly generated finite-rank kernel.

摘要
我们提议一种最优的迭代方案 для联合转移学习，其中中央规划者有access到多个 datasets $\mathcal{D}_1,\ldots,\mathcal{D}_N$ 的同一个学习模型 $f_{\theta}$. 我们的目标是使得生成的参数 $\{\theta_i(t)\}_{t=0}^T$ across all $T$ iterations 与特定的参数 $\theta^\star_{1},\ldots,\theta^\star_N$ obtained for each dataset deviation最小，同时尊重模型 $f_{\theta(T)}$ 生成的损失函数。我们只允许每个特化模型 (节点/代理) 和中央规划者 (服务器) 之间在每个迭代 (轮) 中进行交互。对于 finite-rank kernel regression 模型，我们 deriv explicit updates for the regret-optimal algorithm。通过利用 symmetries within the regret-optimal algorithm, we further develop a nearly regret-optimal heuristic that runs with $\mathcal{O}(Np^2)$ fewer elementary operations, where $p$ is the dimension of the parameter space. 我们还investigate the adversarial robustness of the regret-optimal algorithm, showing that an adversary which perturbs $q$ training pairs by at-most $\varepsilon>0$, across all training sets, cannot reduce the regret-optimal algorithm's regret by more than $\mathcal{O}(\varepsilon q \bar{N}^{1/2})$, where $\bar{N}$ is the aggregate number of training pairs. 为验证我们的理论发现，我们在美国选择 Options 上进行了数学实验，使用了随机生成的 finite-rank kernel。

Postprocessing of Ensemble Weather Forecasts Using Permutation-invariant Neural Networks

paper_url: http://arxiv.org/abs/2309.04452
repo_url: https://github.com/khoehlein/Permutation-invariant-Postprocessing
paper_authors: Kevin Höhlein, Benedikt Schulz, Rüdiger Westermann, Sebastian Lerch
for: 这个论文旨在提出一种基于神经网络的统计处理方法，用于将数据预测 ensemble 转化为可靠的预测分布。
methods: 该论文使用 permutation-invariant neural networks 来实现这个目标，这些网络会对预测ensemble treated as a set of unordered member forecasts，并学习链函数，使其对排序无关。
results: 作者通过对calibration和锐度进行评估，并与 классифика和神经网络基本方法进行比较，demonstrate state-of-the-art 预测质量。此外， authors 还提出了 permutation-based importance analysis 来深入理解训练后处理模型吸收的信息，发现大多数重要信息都集中在一些 ensemble-internal degree of freedom 上。

Abstract
Statistical postprocessing is used to translate ensembles of raw numerical weather forecasts into reliable probabilistic forecast distributions. In this study, we examine the use of permutation-invariant neural networks for this task. In contrast to previous approaches, which often operate on ensemble summary statistics and dismiss details of the ensemble distribution, we propose networks which treat forecast ensembles as a set of unordered member forecasts and learn link functions that are by design invariant to permutations of the member ordering. We evaluate the quality of the obtained forecast distributions in terms of calibration and sharpness, and compare the models against classical and neural network-based benchmark methods. In case studies addressing the postprocessing of surface temperature and wind gust forecasts, we demonstrate state-of-the-art prediction quality. To deepen the understanding of the learned inference process, we further propose a permutation-based importance analysis for ensemble-valued predictors, which highlights specific aspects of the ensemble forecast that are considered important by the trained postprocessing models. Our results suggest that most of the relevant information is contained in few ensemble-internal degrees of freedom, which may impact the design of future ensemble forecasting and postprocessing systems.

摘要
统计处理技术用于将raw数值天气预报转换为可靠的概率预报分布。在这种研究中，我们考虑使用固定排序的神经网络来实现这一任务。与以往方法不同，我们的网络会将预报集合视为一组无序成员预报，并学习链函数，这些链函数是设计具有排序无关性的。我们根据预报分布的准确性和锐度进行评估，并与经典方法和神经网络方法进行比较。在地面温度和风暴风速预报的实际案例中，我们实现了当今最佳预报质量。为深入理解训练过程中学习的推理过程，我们进一步提出了 permutation-based importance分析方法，该方法可以Highlight特定预报集合中重要的特征。我们的结果表明，大多数有用信息都包含在一些预报集合内部的度量上，这可能会影响未来的预报系统的设计。

End-to-End Speech Recognition and Disfluency Removal with Acoustic Language Model Pretraining

paper_url: http://arxiv.org/abs/2309.04516
repo_url: https://github.com/davidsroth/hubert-disfl
paper_authors: Saksham Bassi, Giulio Duregon, Siddhartha Jalagam, David Roth
for: 这 paper 是关于推广和改进现有的 two-stage 模型，以优化精准的杂音和对话 speech 识别性能的研究。
methods: 该 paper 使用了大规模自然语言处理技术，包括大规模自然语言模型的预训练和弱自监督目标的使用，以提高 audio 表示的质量。
results: 研究发现，使用 audio 基于语言模型的预训练，可以匹配或超越同等预训练的 two-stage 模型的性能，并且选择的预训练目标对模型的适应性有很大影响。

Abstract
The SOTA in transcription of disfluent and conversational speech has in recent years favored two-stage models, with separate transcription and cleaning stages. We believe that previous attempts at end-to-end disfluency removal have fallen short because of the representational advantage that large-scale language model pretraining has given to lexical models. Until recently, the high dimensionality and limited availability of large audio datasets inhibited the development of large-scale self-supervised pretraining objectives for learning effective audio representations, giving a relative advantage to the two-stage approach, which utilises pretrained representations for lexical tokens. In light of recent successes in large scale audio pretraining, we revisit the performance comparison between two-stage and end-to-end model and find that audio based language models pretrained using weak self-supervised objectives match or exceed the performance of similarly trained two-stage models, and further, that the choice of pretraining objective substantially effects a model's ability to be adapted to the disfluency removal task.

摘要
最新的State-of-the-Art（SOTA）在不流畅和对话式语音识别方面倾向于两个阶段模型，即分开识别和清洁两个阶段。我们认为，过去的端到端不流畅去除方法未能达到预期的性能，主要是因为大规模语言模型预训练对于字节模型带来了表达优势。在过去，大量音频数据的高维度和有限的可用性，使得大规模自然语言处理的自我超vised预训练目标的发展受到了限制，这给了两个阶段方法一个相对优势。鉴于最近的大规模音频预训练的成功，我们重新评估了两个阶段和端到端模型之间的性能比较，发现 audio基于语言模型通过弱自监学习目标进行预训练，可以与同等预训练的两个阶段模型匹配或超越其性能，而且选择的预训练目标对模型的适应性具有很大的影响。

Soft Quantization using Entropic Regularization

paper_url: http://arxiv.org/abs/2309.04428
repo_url: https://github.com/rajmadan96/softquantization
paper_authors: Rajmadan Lakshmanan, Alois Pichler
for: 这个论文的目的是解决量化问题，即在高维空间中用有限、离散的概率度量进行最佳化。
methods: 这个论文使用了 entropy-regularized 量化问题，这是标准量化问题的relaxation。它采用了 softmin 函数，这是一种在理论和实践上都具有稳定性的函数。此外，它使用 entropy-regularized Wasserstein distance 来评估量化问题的准确性。
results: 这个论文的实验表明，使用 entropy-regularized 量化问题可以在各种不同的情况下提供优秀的性能。具体来说，它可以在高维空间中更好地适应各种各样的概率分布，并且可以在不同的概率水平下进行调整。

Abstract
The quantization problem aims to find the best possible approximation of probability measures on ${\mathbb{R}^d$ using finite, discrete measures. The Wasserstein distance is a typical choice to measure the quality of the approximation. This contribution investigates the properties and robustness of the entropy-regularized quantization problem, which relaxes the standard quantization problem. The proposed approximation technique naturally adopts the softmin function, which is well known for its robustness in terms of theoretical and practicability standpoints. Moreover, we use the entropy-regularized Wasserstein distance to evaluate the quality of the soft quantization problem's approximation, and we implement a stochastic gradient approach to achieve the optimal solutions. The control parameter in our proposed method allows for the adjustment of the optimization problem's difficulty level, providing significant advantages when dealing with exceptionally challenging problems of interest. As well, this contribution empirically illustrates the performance of the method in various expositions.

摘要
“量化问题”的目的是找到使用finite, discrete measure approximate最好的probability measures在{\mathbb{R}^d中的方法。“ Wasserstein distance”通常用来衡量这个问题的解的质量。本贡献 investigate了entropy-regularized quantization problem的性能和稳定性，这是对于标准量化问题的relaxation。我们使用了softmin函数，它在理论和实践中都具有良好的稳定性。此外，我们使用了 entropy-regularized Wasserstein distance来衡量soft quantization problem的解的质量，并使用了Stochastic gradient方法来实现最佳解。控制参数在我们提出的方法中允许调整优化问题的困难度，具有优化问题的特殊问题的优化问题的特殊优化问题的优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊优化问题的特殊�

Robust Representation Learning for Privacy-Preserving Machine Learning: A Multi-Objective Autoencoder Approach

paper_url: http://arxiv.org/abs/2309.04427
repo_url: None
paper_authors: Sofiane Ouaari, Ali Burak Ünal, Mete Akgün, Nico Pfeifer
for: 提高隐私保护和数据安全性在机器学习应用中。
methods: 利用深度学习来实现robust表示学习，并将编码后的数据安全地发送到第三方进行模型训练和参数优化。
results: 在单模式和多模式下实现了提高的性能，并且可以在第三方工具和服务上进行安全的数据处理和模型训练。

Abstract
Several domains increasingly rely on machine learning in their applications. The resulting heavy dependence on data has led to the emergence of various laws and regulations around data ethics and privacy and growing awareness of the need for privacy-preserving machine learning (ppML). Current ppML techniques utilize methods that are either purely based on cryptography, such as homomorphic encryption, or that introduce noise into the input, such as differential privacy. The main criticism given to those techniques is the fact that they either are too slow or they trade off a model s performance for improved confidentiality. To address this performance reduction, we aim to leverage robust representation learning as a way of encoding our data while optimizing the privacy-utility trade-off. Our method centers on training autoencoders in a multi-objective manner and then concatenating the latent and learned features from the encoding part as the encoded form of our data. Such a deep learning-powered encoding can then safely be sent to a third party for intensive training and hyperparameter tuning. With our proposed framework, we can share our data and use third party tools without being under the threat of revealing its original form. We empirically validate our results on unimodal and multimodal settings, the latter following a vertical splitting system and show improved performance over state-of-the-art.

摘要

Parallel and Limited Data Voice Conversion Using Stochastic Variational Deep Kernel Learning

paper_url: http://arxiv.org/abs/2309.04420
repo_url: None
paper_authors: Mohamadreza Jafaryani, Hamid Sheikhzadeh, Vahid Pourahmadi
For: This paper proposes a voice conversion method that works with limited data and is based on stochastic variational deep kernel learning (SVDKL).* Methods: The proposed method combines the deep neural network with the Gaussian process as a Bayesian and non-parametric method, and uses marginal likelihood optimization to train the model parameters.* Results: The proposed method obtained a higher mean opinion score, smaller spectral distortion, and better preference tests than compared methods, with approximately 80 seconds of training data.Here’s the Chinese version of the three points:* For: 这篇论文提出了一种基于有限数据的语音转换方法，使用了Stochastic Variational Deep Kernel Learning（SVDKL）。* Methods: 该方法结合深度神经网络和 Gaussian Process 为 Bayesian 和非Parametric 方法，并使用 marginal likelihood optimization 来训练模型参数。* Results: 该方法在约 80 秒的训练数据上表现出了高于比较方法的 mean opinion score、更小的spectral distortion 和更好的 preference tests。

Abstract
Typically, voice conversion is regarded as an engineering problem with limited training data. The reliance on massive amounts of data hinders the practical applicability of deep learning approaches, which have been extensively researched in recent years. On the other hand, statistical methods are effective with limited data but have difficulties in modelling complex mapping functions. This paper proposes a voice conversion method that works with limited data and is based on stochastic variational deep kernel learning (SVDKL). At the same time, SVDKL enables the use of deep neural networks' expressive capability as well as the high flexibility of the Gaussian process as a Bayesian and non-parametric method. When the conventional kernel is combined with the deep neural network, it is possible to estimate non-smooth and more complex functions. Furthermore, the model's sparse variational Gaussian process solves the scalability problem and, unlike the exact Gaussian process, allows for the learning of a global mapping function for the entire acoustic space. One of the most important aspects of the proposed scheme is that the model parameters are trained using marginal likelihood optimization, which considers both data fitting and model complexity. Considering the complexity of the model reduces the amount of training data by increasing the resistance to overfitting. To evaluate the proposed scheme, we examined the model's performance with approximately 80 seconds of training data. The results indicated that our method obtained a higher mean opinion score, smaller spectral distortion, and better preference tests than the compared methods.

摘要
通常，voice转换被视为工程问题，受限于有限的训练数据。深入研究的深度学习方法需要巨量数据，但这限制了实际应用的可行性。相比之下，统计方法可以在有限数据情况下表现出色，但它们在复杂的映射函数模型化方面存在困难。本文提出了一种基于有限数据的voice转换方法，该方法基于随机变量深度kernel学习（SVDKL）。同时，SVDKL使得可以利用深度神经网络的表达能力和高灵活性的 Gaussian process 作为 Bayesian 和非 Parametric 方法。当混合传统kernel与深度神经网络时，可以估计非平滑的更复杂函数。此外，模型的稀疏变量 Gaussian process 解决了扩展性问题，与恰当 Gaussian process 不同，可以学习整个声学空间的全局映射函数。该方法的一个重要特点是通过最大 marginal likelihood 优化，考虑数据适应和模型复杂度，这使得模型参数的训练需要更少的训练数据。为评估该方法，我们对约 80 秒的训练数据进行了测试，结果显示，我们的方法在mean opinion score、spectral distortion 和 preference test 等方面比相比方法更高。

Emergent learning in physical systems as feedback-based aging in a glassy landscape

paper_url: http://arxiv.org/abs/2309.04382
repo_url: None
paper_authors: Vidyesh Rao Anisetti, Ananth Kandala, J. M. Schwarz
for: 研究线性物理网络如何学习线性变换，探讨这些网络的物理性质如何随权重更新规则的变化。
methods: 使用线性物理网络学习线性变换，并通过观察系统在反馈边界力的应用下的relaxation行为，探讨这些网络的学习过程与受损和记忆形成在偏振和玻璃系统中的相似性。
results: 发现学习过程类似于受损过程，系统在反馈边界力的应用下relaxation，并且通过输入力和反馈边界力的编码，记忆了输入-输出关系。此外，观察到平均方差的平方根函数随着epoch的变化展现非抽象的特征，这与玻璃系统的特征相似。这些结果提供了物理解释，表明通过编码更多细节的输入和反馈边界力，emergent学习可能是生物系统中very early的物理机制，从EVOLUTIONARY的角度来看。

Abstract
By training linear physical networks to learn linear transformations, we discern how their physical properties evolve due to weight update rules. Our findings highlight a striking similarity between the learning behaviors of such networks and the processes of aging and memory formation in disordered and glassy systems. We show that the learning dynamics resembles an aging process, where the system relaxes in response to repeated application of the feedback boundary forces in presence of an input force, thus encoding a memory of the input-output relationship. With this relaxation comes an increase in the correlation length, which is indicated by the two-point correlation function for the components of the network. We also observe that the square root of the mean-squared error as a function of epoch takes on a non-exponential form, which is a typical feature of glassy systems. This physical interpretation suggests that by encoding more detailed information into input and feedback boundary forces, the process of emergent learning can be rather ubiquitous and, thus, serve as a very early physical mechanism, from an evolutionary standpoint, for learning in biological systems.

摘要
通过训练线性物理网络学习线性变换，我们发现其物理属性如何随权重更新规则而发展。我们的发现表明 linear physical networks 的学习行为和杂变和玻璃系统中的寿命和记忆形成过程之间存在惊人的相似之处。我们表明这种学习过程类似于冬季过程，系统在输入力和反馈边界力的重复应用下relax，从而记忆输入输出关系。这种塑化过程中，系统的相关程度增加，可以通过两点相关函数来衡量。此外，我们发现在 epoch 函数中，平均平方误差的平方根具有非对数型曲线，这是杂变系统的典型特征。这个物理解释表明，通过在输入和反馈边界力中编码更多细节，emergent learning 过程可以是非常普遍的，从EVOLUTIONARY standpoint，因此可能是生物系统中学习的非常早期物理机制。

paper_url: http://arxiv.org/abs/2309.04370
repo_url: None
paper_authors: David DeFazio, Eisuke Hirota, Shiqi Zhang
for: 这个论文旨在开发一种可以快速响应外部拖动力的视频攻击机器人系统，以帮助视障人群更方便地进行活动。
methods: 该论文使用了生成学习和监督学习两种方法来同时训练一个稳定的行走控制器和一个外部力估计器。控制器使得机器人在外部拖动力的情况下保持稳定的行走，而力估计器则帮助机器人响应人类的拖动力，以帮助机器人引导人类避免障碍物。
results: 实验结果表明，该控制器具有很好的 External Force Robustness，并且机器人可以准确地探测外部力的方向。此外，在实际硬件上进行了测试，并在视频中可以看到一个盲人与机器人一起进行活动。

Abstract
Seeing-eye robots are very useful tools for guiding visually impaired people, potentially producing a huge societal impact given the low availability and high cost of real guide dogs. Although a few seeing-eye robot systems have already been demonstrated, none considered external tugs from humans, which frequently occur in a real guide dog setting. In this paper, we simultaneously train a locomotion controller that is robust to external tugging forces via Reinforcement Learning (RL), and an external force estimator via supervised learning. The controller ensures stable walking, and the force estimator enables the robot to respond to the external forces from the human. These forces are used to guide the robot to the global goal, which is unknown to the robot, while the robot guides the human around nearby obstacles via a local planner. Experimental results in simulation and on hardware show that our controller is robust to external forces, and our seeing-eye system can accurately detect force direction. We demonstrate our full seeing-eye robot system on a real quadruped robot with a blindfolded human. The video can be seen at our project page: https://bu-air-lab.github.io/guide_dog/

摘要
SEEING-EYE ROBOTS ARE VERY USEFUL TOOLS FOR GUIDING VISUALLY IMPAIRED PEOPLE, POTENTIALLY PRODUCING A HUGE SOCIETAL IMPACT GIVEN THE LOW AVAILABILITY AND HIGH COST OF REAL GUIDE DOGS. ALTHOUGH A FEW SEEING-EYE ROBOT SYSTEMS HAVE ALREADY BEEN DEMONSTRATED, NONE CONSIDERED EXTERNAL TUGS FROM HUMANS, WHICH FREQUENTLY OCCUR IN A REAL GUIDE DOG SETTING. IN THIS PAPER, WE SIMULTANEOUSLY TRAIN A LOCOMOTION CONTROLLER THAT IS ROBUST TO EXTERNAL TUGGING FORCES VIA REINFORCEMENT LEARNING (RL), AND AN EXTERNAL FORCE ESTIMATOR VIA SUPERVISED LEARNING. THE CONTROLLER ENSURES STABLE WALKING, AND THE FORCE ESTIMATOR ENABLES THE ROBOT TO RESPOND TO THE EXTERNAL FORCES FROM THE HUMAN. THESE FORCES ARE USED TO GUIDE THE ROBOT TO THE GLOBAL GOAL, WHICH IS UNKNOWN TO THE ROBOT, WHILE THE ROBOT GUIDES THE HUMAN AROUND NEARBY OBSTACLES VIA A LOCAL PLANNER. EXPERIMENTAL RESULTS IN SIMULATION AND ON HARDWARE SHOW THAT OUR CONTROLLER IS ROBUST TO EXTERNAL FORCES, AND OUR SEEING-EYE SYSTEM CAN ACCURATELY DETECT FORCE DIRECTION. WE DEMONSTRATE OUR FULL SEEING-EYE ROBOT SYSTEM ON A REAL QUADRUPED ROBOT WITH A BLINDFOLD HUMAN. THE VIDEO CAN BE SEEN AT OUR PROJECT PAGE: https://bu-air-lab.github.io/guide_dog/Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Learning from Power Signals: An Automated Approach to Electrical Disturbance Identification Within a Power Transmission System

paper_url: http://arxiv.org/abs/2309.04361
repo_url: None
paper_authors: Jonathan D. Boyd, Joshua H. Tyler, Anthony M. Murphy, Donald R. Reising
for: automated analysis of power quality events recorded by digital fault recorders and power quality monitors
methods: rule-based analytics and cyclic histogram processing
results: accuracy of 99% in detecting and categorizing 14 different event types, reduction of memory requirements by a factor of 320, and anticipated improvement in reliability of the transmission system through near real-time detection and identification of disturbances and prevention of problems before they occur.Here’s the full translation of the abstract in Simplified Chinese:
for: 这项研究旨在自动分析由数字缺陷记录器和电质质量监测器记录的电力质量事件。
methods: 该方法使用规则基于分析，对时域和频域特征进行分析，并设置可定制的阈值来分类各种事件。
results: 该方法在160个信号文件中实现了99%的准确率，可以自动分类14种不同的事件类型。此外，预计通过循环 histogram 处理可以降低内存需求，并且可以在实时中检测和识别各种问题，预防问题出现。

Abstract
As power quality becomes a higher priority in the electric utility industry, the amount of disturbance event data continues to grow. Utilities do not have the required personnel to analyze each event by hand. This work presents an automated approach for analyzing power quality events recorded by digital fault recorders and power quality monitors operating within a power transmission system. The automated approach leverages rule-based analytics to examine the time and frequency domain characteristics of the voltage and current signals. Customizable thresholds are set to categorize each disturbance event. The events analyzed within this work include various faults, motor starting, and incipient instrument transformer failure. Analytics for fourteen different event types have been developed. The analytics were tested on 160 signal files and yielded an accuracy of ninety-nine percent. Continuous, nominal signal data analysis is performed using an approach coined as the cyclic histogram. The cyclic histogram process will be integrated into the digital fault recorders themselves to facilitate the detection of subtle signal variations that are too small to trigger a disturbance event and that can occur over hours or days. In addition to reducing memory requirements by a factor of 320, it is anticipated that cyclic histogram processing will aid in identifying incipient events and identifiers. This project is expected to save engineers time by automating the classification of disturbance events and increase the reliability of the transmission system by providing near real time detection and identification of disturbances as well as prevention of problems before they occur.

摘要
随着电力质量在电力供应业中的重要性提高，分布式事件数据的量不断增加。Utilities没有足够的人员来手动分析每个事件。这项工作提出了一种自动化分析电力质量事件记录器和电力质量监测器在电力传输系统中记录的数据的方法。该方法利用规则based分析来检查电压和电流信号在时域和频域特征。可定制的阈值设置来分类每个分布式事件。本工作分析了十四种不同的事件类型，其中包括各种故障、电动机启动和潜在的仪器变换器故障。测试结果表明，使用160个信号文件可达九十九%的准确率。此外，针对常见信号数据进行累积分析，可以降低内存需求的320倍，并且预计可以帮助检测征相对较小的信号变化，这些变化可能在小时或多个小时内发生。这个项目预计可以为工程师提供自动化分类分布式事件的功能，同时提高电力传输系统的可靠性，通过实时检测和识别分布式事件，以及预防问题的发生。

Value-Compressed Sparse Column (VCSC): Sparse Matrix Storage for Redundant Data

paper_url: http://arxiv.org/abs/2309.04355
repo_url: None
paper_authors: Skyler Ruiter, Seth Wolfgang, Marc Tunnell, Timothy Triche Jr., Erin Carrier, Zachary DeBruine
for: 这篇论文是关于压缩稀疙数组的研究，具体来说是对 Value-Compressed Sparse Column (VCSC) 和 Index- and Value-Compressed Sparse Column (IVCSC) 两种压缩格式的研究。
methods: 该论文使用了 CSC 和 COO 两种常见压缩格式，并对它们进行了两种扩展：VCSC 和 IVCSC。VCSC 利用了列中的高重复率进行了进一步的压缩，可以在不影响性能的情况下将数据压缩到 3 倍于 COO 和 2.25 倍于 CSC。IVCSC 则是对 VCSC 进行了进一步的压缩，通过 delta 编码和字节压缩来压缩索引数组，可以将内存占用降低到 10 倍于 COO 和 7.5 倍于 CSC。
results: 作者通过对 simulated 和实际数据进行 benchmark 测试，发现 VCSC 和 IVCSC 可以在压缩形式下读取数据，而且这些方法对计算性能没有明显的负面影响。因此，这两种新的压缩格式可以广泛地应用于编码和读取稀疙数据。

Abstract
Compressed Sparse Column (CSC) and Coordinate (COO) are popular compression formats for sparse matrices. However, both CSC and COO are general purpose and cannot take advantage of any of the properties of the data other than sparsity, such as data redundancy. Highly redundant sparse data is common in many machine learning applications, such as genomics, and is often too large for in-core computation using conventional sparse storage formats. In this paper, we present two extensions to CSC: (1) Value-Compressed Sparse Column (VCSC) and (2) Index- and Value-Compressed Sparse Column (IVCSC). VCSC takes advantage of high redundancy within a column to further compress data up to 3-fold over COO and 2.25-fold over CSC, without significant negative impact to performance characteristics. IVCSC extends VCSC by compressing index arrays through delta encoding and byte-packing, achieving a 10-fold decrease in memory usage over COO and 7.5-fold decrease over CSC. Our benchmarks on simulated and real data show that VCSC and IVCSC can be read in compressed form with little added computational cost. These two novel compression formats offer a broadly useful solution to encoding and reading redundant sparse data.

摘要
压缩紧凑列（CSC）和坐标（COO）是常用压缩格式 для 稀疏矩阵。然而，CSC 和 COO 都不能利用数据的其他属性，如数据重复性。许多机器学习应用，如 genomics，会出现高度重复的稀疏数据，这些数据通常是 convent 稀疏存储格式内存中的。在这篇论文中，我们提出了两种 CSC 的扩展：（1）值压缩紧凑列（VCSC）和（2）标识和值压缩紧凑列（IVCSC）。VCSC 利用每列的高度重复性来进一步压缩数据，可以达到 COO 和 CSC 的 3 倍压缩率，而无需影响性能特征。IVCSC 在 delta 编码和字节压缩技术的基础上压缩标识数组，可以将内存使用率降低至 COO 的 10 倍和 CSC 的 7.5 倍。我们对模拟和实际数据进行了测试，发现 VCSC 和 IVCSC 可以在压缩形式下读取数据，而无需增加计算成本。这两种新的压缩格式可以为重复 sparse 数据编码和读取提供一种广泛有用的解决方案。

Decreasing the Computing Time of Bayesian Optimization using Generalizable Memory Pruning

paper_url: http://arxiv.org/abs/2309.04510
repo_url: None
paper_authors: Alexander E. Siemenn, Tonio Buonassisi
for: 这篇论文目的是提出一种能够应用于任何模型和收集函数的实验优化 wrapper，以减少实验运行时间。
methods: 本论文使用了内存剔除和范围 bounded optimization 技术，实现了将实验运行时间从复杂函数型式变数为锯条型式，不会对结果的内涵牺牲。
results: 本论文显示了实验运行时间的几何减少，并且显示了这个方法在不同的数据集和模型上的一致性和普遍性。

Abstract
Bayesian optimization (BO) suffers from long computing times when processing highly-dimensional or large data sets. These long computing times are a result of the Gaussian process surrogate model having a polynomial time complexity with the number of experiments. Running BO on high-dimensional or massive data sets becomes intractable due to this time complexity scaling, in turn, hindering experimentation. Alternative surrogate models have been developed to reduce the computing utilization of the BO procedure, however, these methods require mathematical alteration of the inherit surrogate function, pigeonholing use into only that function. In this paper, we demonstrate a generalizable BO wrapper of memory pruning and bounded optimization, capable of being used with any surrogate model and acquisition function. Using this memory pruning approach, we show a decrease in wall-clock computing times per experiment of BO from a polynomially increasing pattern to a sawtooth pattern that has a non-increasing trend without sacrificing convergence performance. Furthermore, we illustrate the generalizability of the approach across two unique data sets, two unique surrogate models, and four unique acquisition functions. All model implementations are run on the MIT Supercloud state-of-the-art computing hardware.

摘要
bayesian 优化（BO）在处理高维度或大数据集时会遇到长时间计算问题。这些长时间计算问题是因为 Gaussian process 模拟函数的时间复杂度与实验数量成正比。在高维度或大数据集上运行 BO 变得不可行，这会阻碍实验。 alternatively, 有些受托函数被开发出来 reducethe computing utilization of the BO procedure，但这些方法需要修改基本的受托函数，这限定了它们的使用。在这篇论文中，我们提出了一种通用的 BO 包装器，可以与任何受托函数和获取函数一起使用。使用这种内存剪辑方法，我们显示了 BO 的墙 clock 计算时间每个实验从呈极值增长趋势变为 sawtooth 增长趋势，而无需牺牲 convergence 性能。此外，我们阐述了该方法在两个不同的数据集、两个不同的受托函数和四个不同的获取函数上的一致性。所有模型实现都在 MIT Supercloud 当今计算硬件上运行。

Generating the Ground Truth: Synthetic Data for Label Noise Research

paper_url: http://arxiv.org/abs/2309.04318
repo_url: https://github.com/sjoerd-de-vries/synlabel
paper_authors: Sjoerd de Vries, Dirk Thierens
for: This paper aims to improve the current methodologies for label noise research by creating a noiseless dataset informed by real data.
methods: The proposed framework, called SYNLABEL, allows for creating a noiseless dataset by either pre-specifying or learning a function and defining it as the ground truth function from which labels are generated. Additionally, the framework uses resampling and aggregation of labels to generate soft label distributions for each data point.
results: The generated datasets serve as a clean baseline of adjustable complexity into which different types of noise may be introduced, allowing for direct injection and quantification of label noise. The paper demonstrates how the framework can be applied, how it enables quantification of label noise, and how it improves over existing methodologies.

Abstract
Most real-world classification tasks suffer from label noise to some extent. Such noise in the data adversely affects the generalization error of learned models and complicates the evaluation of noise-handling methods, as their performance cannot be accurately measured without clean labels. In label noise research, typically either noisy or incomplex simulated data are accepted as a baseline, into which additional noise with known properties is injected. In this paper, we propose SYNLABEL, a framework that aims to improve upon the aforementioned methodologies. It allows for creating a noiseless dataset informed by real data, by either pre-specifying or learning a function and defining it as the ground truth function from which labels are generated. Furthermore, by resampling a number of values for selected features in the function domain, evaluating the function and aggregating the resulting labels, each data point can be assigned a soft label or label distribution. Such distributions allow for direct injection and quantification of label noise. The generated datasets serve as a clean baseline of adjustable complexity into which different types of noise may be introduced. We illustrate how the framework can be applied, how it enables quantification of label noise and how it improves over existing methodologies.

摘要
大多数实际分类任务受到标签噪声影响，这种噪声会导致学习的模型的泛化误差增大，同时也使得噪声处理方法的评估变得更加复杂，因为无法精确测量噪声的影响。在标签噪声研究中，通常接受现有的噪声数据或者制作具有知名性的噪声数据作为基线。在这篇论文中，我们提出了 SYNLABEL 框架，这是一种尝试超越现有的方法ologies。它允许创建一个噪声な dataset，其中每个数据点都可以被赋予软 Label 或 Label 分布。这些分布允许直接注入和量化标签噪声。生成的 dataset 可以作为一个可调复杂度的清晰基线，在其中不同类型的噪声可以被引入。我们介绍了如何应用该框架，如何使用它来评估标签噪声，以及如何它与现有方法ologies 相比。

Actor critic learning algorithms for mean-field control with moment neural networks

paper_url: http://arxiv.org/abs/2309.04317
repo_url: None
paper_authors: Huyên Pham, Xavier Warin
for: 解决连续时间奖励学习中的mean-field控制问题
methods: 使用梯度基于的策略和估计函数，采用 Wasserstein 空间上的oment neural network 函数进行学习actor和critic
results: 提供了一系列数据示例，包括多维设置和非线性二次mean-field控制问题with controlled volatility

Abstract
We develop a new policy gradient and actor-critic algorithm for solving mean-field control problems within a continuous time reinforcement learning setting. Our approach leverages a gradient-based representation of the value function, employing parametrized randomized policies. The learning for both the actor (policy) and critic (value function) is facilitated by a class of moment neural network functions on the Wasserstein space of probability measures, and the key feature is to sample directly trajectories of distributions. A central challenge addressed in this study pertains to the computational treatment of an operator specific to the mean-field framework. To illustrate the effectiveness of our methods, we provide a comprehensive set of numerical results. These encompass diverse examples, including multi-dimensional settings and nonlinear quadratic mean-field control problems with controlled volatility.

摘要
我们开发了一种新的政策梯度和批评算法，用于解决连续时间返点学习中的平均场控制问题。我们的方法利用梯度基于的值函数表示，使用参数化的随机政策。学习actor（政策）和批评（值函数）受到oment抽象函数的支持，并且采用 Wasserstein 空间上的概率分布样本来进行学习。我们的中心挑战在于处理mean-field框架中特有的运算问题。为了证明我们的方法的效果，我们提供了广泛的数据结果，包括多维设置和非线性quadratic mean-field控制问题。

Viewing the process of generating counterfactuals as a source of knowledge – Application to the Naive Bayes classifier

paper_url: http://arxiv.org/abs/2309.04284
repo_url: None
paper_authors: Vincent Lemaire, Nathan Le Boudec, Françoise Fessant, Victor Guyomard
for: 本文旨在探讨Counterfactual例子生成算法如何帮助理解机器学习模型做出决策。
methods: 本文提出将生成Counterfactual例子视为创造知识的过程，并在加法模型和naive Bayes分类器的情况下进行了示例。
results: 本文显示了naive Bayes分类器在这种过程中的有趣性和可用性。

Abstract
There are now many comprehension algorithms for understanding the decisions of a machine learning algorithm. Among these are those based on the generation of counterfactual examples. This article proposes to view this generation process as a source of creating a certain amount of knowledge that can be stored to be used, later, in different ways. This process is illustrated in the additive model and, more specifically, in the case of the naive Bayes classifier, whose interesting properties for this purpose are shown.

摘要
现在有很多机器学习算法的理解方法，其中包括基于生成对例的方法。这篇文章提议视生成过程为创造一定量的知识，可以将其存储，以便在后续使用不同方式。这个过程在添加模型中得到了描述，而且在Naive Bayes分类器中具有感兴趣的特性。

Learning Zero-Sum Linear Quadratic Games with Improved Sample Complexity

paper_url: http://arxiv.org/abs/2309.04272
repo_url: https://github.com/wujiduan/zero-sum-lq-games
paper_authors: Jiduan Wu, Anas Barakat, Ilyas Fatkhullin, Niao He
for: Zero-sum Linear Quadratic (LQ) games are used as a dynamic game formulation for risk-sensitive or robust control, or as a benchmark setting for multi-agent reinforcement learning with two competing agents in continuous state-control spaces.
methods: The paper proposes a simpler nested Zeroth-Order (ZO) algorithm that improves sample complexity by several orders of magnitude, with a guaranteed $\widetilde{\mathcal{O}(\epsilon^{-3})$ sample complexity under the same assumptions using a single-point ZO estimator.
results: The paper achieves a better $\widetilde{\mathcal{O}(\epsilon^{-2})$ sample complexity when the estimator is replaced by a two-point estimator, with key improvements in a more sample-efficient nested algorithm design and finer control of the ZO natural gradient estimation error.

Abstract
Zero-sum Linear Quadratic (LQ) games are fundamental in optimal control and can be used (i) as a dynamic game formulation for risk-sensitive or robust control, or (ii) as a benchmark setting for multi-agent reinforcement learning with two competing agents in continuous state-control spaces. In contrast to the well-studied single-agent linear quadratic regulator problem, zero-sum LQ games entail solving a challenging nonconvex-nonconcave min-max problem with an objective function that lacks coercivity. Recently, Zhang et al. discovered an implicit regularization property of natural policy gradient methods which is crucial for safety-critical control systems since it preserves the robustness of the controller during learning. Moreover, in the model-free setting where the knowledge of model parameters is not available, Zhang et al. proposed the first polynomial sample complexity algorithm to reach an $\epsilon$-neighborhood of the Nash equilibrium while maintaining the desirable implicit regularization property. In this work, we propose a simpler nested Zeroth-Order (ZO) algorithm improving sample complexity by several orders of magnitude. Our main result guarantees a $\widetilde{\mathcal{O}(\epsilon^{-3})$ sample complexity under the same assumptions using a single-point ZO estimator. Furthermore, when the estimator is replaced by a two-point estimator, our method enjoys a better $\widetilde{\mathcal{O}(\epsilon^{-2})$ sample complexity. Our key improvements rely on a more sample-efficient nested algorithm design and finer control of the ZO natural gradient estimation error.

摘要
zero-sum linear quadratic (LQ) 游戏是优化控制中的基本形式，可以用于风险敏感或Robust控制的动态游戏形式，或者作为多代理 reinforcement learning 中的两个竞争对手的标准问题。与单个 Linear Quadratic Regulator 问题相比，zero-sum LQ 游戏具有一个复杂非几何非凹的最小最大问题， objective function 缺乏勉强性。最近，张等人发现了自然策略强度方法的隐式正则化性质，这是关键的 для安全控制系统，因为它保持了控制器的稳定性 durante learning。此外，在没有模型参数知识的模型自由设置中，张等人提出了首个 polynomial sample complexity 算法，可以到 $\epsilon$- neighborhood 的达到 Nash 均衡，同时保持愿意的隐式正则化性质。在这项工作中，我们提出了一种更简单的嵌套 Zeroth-Order（ZO）算法，提高样本复杂度的多个阶段。我们的主要结果表明，使用单点 ZO 估计器时，我们可以达到 $\widetilde{\mathcal{O}(\epsilon^{-3})$ 样本复杂度，而使用两点估计器时，我们可以达到 $\widetilde{\mathcal{O}(\epsilon^{-2})$ 样本复杂度。我们的关键提高来自于更有效的嵌套算法设计和 ZO 自然幂Gradient 估计误差的精细控制。

Optimal Rate of Kernel Regression in Large Dimensions

paper_url: http://arxiv.org/abs/2309.04268
repo_url: None
paper_authors: Weihao Lu, Haobo Zhang, Yicheng Li, Manyun Xu, Qian Lin
For: 研究大维度数据的kernel回归（ sample size $n$ 是对维度 $d$ 的样本增长 polynomial 相似，即 $n\asymp d^\gamma$ для some $\gamma >0$）。* Methods: 使用一种通用工具来Characterize kernel回归的Upper bound和Minimax lower bound，包括Mendelson复杂性 $\varepsilon_{n}^{2}$ 和 metric entropy $\bar{\varepsilon}_{n}^{2}$。* Results: 当target function falls into RKHS中的(通用)inner product model时，通过新工具可以显示kernel回归的excess risk的minimax rate是$n^{-1/2}$，当$n\asymp d^\gamma$ for $\gamma =2, 4, 6, 8, \cdots$。然后，我们还further determine了kernel回归的optimal rate，并发现了several new phenomenon，包括{\it multiple descent behavior}和{\it periodic plateau behavior}.应用于NTK，我们也提供了类似的Explicit description of the curve of optimal rate。作为直接推论，这些laim hold for wide neural networks as well.

Abstract
We perform a study on kernel regression for large-dimensional data (where the sample size $n$ is polynomially depending on the dimension $d$ of the samples, i.e., $n\asymp d^{\gamma}$ for some $\gamma >0$ ). We first build a general tool to characterize the upper bound and the minimax lower bound of kernel regression for large dimensional data through the Mendelson complexity $\varepsilon_{n}^{2}$ and the metric entropy $\bar{\varepsilon}_{n}^{2}$ respectively. When the target function falls into the RKHS associated with a (general) inner product model defined on $\mathbb{S}^{d}$, we utilize the new tool to show that the minimax rate of the excess risk of kernel regression is $n^{-1/2}$ when $n\asymp d^{\gamma}$ for $\gamma =2, 4, 6, 8, \cdots$. We then further determine the optimal rate of the excess risk of kernel regression for all the $\gamma>0$ and find that the curve of optimal rate varying along $\gamma$ exhibits several new phenomena including the {\it multiple descent behavior} and the {\it periodic plateau behavior}. As an application, For the neural tangent kernel (NTK), we also provide a similar explicit description of the curve of optimal rate. As a direct corollary, we know these claims hold for wide neural networks as well.

摘要
我们进行了一项研究，探讨了适用于大维度数据（其样本大小 $n$ 是对维度 $d$ 样本的几何函数，即 $n \asymp d^\gamma$ для some $\gamma > 0$）的kernel回归问题。我们首先构建了一个通用工具，用于Characterize kernel回归问题中的Upper bound和Minimax下界。当目标函数属于RKHS中的一个（通用）内积模型定义在 $\mathbb{S}^d$ 上时，我们利用这个工具来示出kernel回归问题的最佳误差率为 $n^{-1/2}$ ，当 $n \asymp d^\gamma$ для $\gamma = 2, 4, 6, 8, \cdots$。然后，我们进一步确定了kernel回归问题的最佳误差率，并发现了一些新现象，包括{\it 多重 descent behavior}和{\it periodic plateau behavior}.在应用中，我们还提供了类似的Explicit Description of the curve of optimal rate for the neural tangent kernel（NTK）。作为直接推论，我们知道这些laims holding for wide neural networks as well.

Generating drawdown-realistic financial price paths using path signatures

paper_url: http://arxiv.org/abs/2309.04507
repo_url: None
paper_authors: Emiel Lemahieu, Kris Boudt, Maarten Wyns
for: 用于 simulate financial price data with drawdowns that are quantifiably close to empirical data.
methods: 使用非参数 Monte Carlo 方法， combining variational autoencoder 生成模型和 drawdown 重建损失函数。
results: 实现了 drawdown-realistic 价格走势的 simulate，并且通过 linear regression 获得了close numerical approximations。

Abstract
A novel generative machine learning approach for the simulation of sequences of financial price data with drawdowns quantifiably close to empirical data is introduced. Applications such as pricing drawdown insurance options or developing portfolio drawdown control strategies call for a host of drawdown-realistic paths. Historical scenarios may be insufficient to effectively train and backtest the strategy, while standard parametric Monte Carlo does not adequately preserve drawdowns. We advocate a non-parametric Monte Carlo approach combining a variational autoencoder generative model with a drawdown reconstruction loss function. To overcome issues of numerical complexity and non-differentiability, we approximate drawdown as a linear function of the moments of the path, known in the literature as path signatures. We prove the required regularity of drawdown function and consistency of the approximation. Furthermore, we obtain close numerical approximations using linear regression for fractional Brownian and empirical data. We argue that linear combinations of the moments of a path yield a mathematically non-trivial smoothing of the drawdown function, which gives one leeway to simulate drawdown-realistic price paths by including drawdown evaluation metrics in the learning objective. We conclude with numerical experiments on mixed equity, bond, real estate and commodity portfolios and obtain a host of drawdown-realistic paths.

摘要
《一种新的生成机器学习方法 для财务价格数据序列的模拟》引言：在评估资产组合的风险时，Drawdown（下降）是一个非常重要的指标。然而，由于历史场景可能无法有效地训练和测试策略，而标准的 Parametric Monte Carlo 方法不能够准确保持下降。为了解决这些问题，我们提出了一种非 Parametric Monte Carlo 方法，结合 Variational Autoencoder 生成模型和下降重建loss函数。我们证明了下降函数的必要的准确性和相应的拟合精度。此外，我们还利用线性回归来近似下降函数，并证明了这种方法可以在实际数据上实现高度的精度。最后，我们通过对混合Equity、债券、地产和商品资产组合的数据进行数值实验，并获得了一系列具有下降实际性的价格路径。本文引入了一种新的生成机器学习方法，用于模拟财务价格数据序列的下降。这种方法结合了 Variational Autoencoder 生成模型和下降重建loss函数，以解决标准 Parametric Monte Carlo 方法无法准确保持下降的问题。我们证明了下降函数的必要的准确性和相应的拟合精度，并利用线性回归来近似下降函数。最后，我们通过对混合Equity、债券、地产和商品资产组合的数据进行数值实验，并获得了一系列具有下降实际性的价格路径。

Adaptive Distributed Kernel Ridge Regression: A Feasible Distributed Learning Scheme for Data Silos

paper_url: http://arxiv.org/abs/2309.04236
repo_url: None
paper_authors: Di Wang, Xiaotong Liu, Shao-Bo Lin, Ding-Xuan Zhou
for: 解决数据隔阂问题，提高不同机构之间数据共享和合作。
methods: 基于分治的分布式学习，保证参数自主选择、隐私保障和性能提高。
results: 在理论和实验中证明了AdaDKRR的可行性和有效性，在数据隔阂问题上表现出优于其他分布式学习方案。

Abstract
Data silos, mainly caused by privacy and interoperability, significantly constrain collaborations among different organizations with similar data for the same purpose. Distributed learning based on divide-and-conquer provides a promising way to settle the data silos, but it suffers from several challenges, including autonomy, privacy guarantees, and the necessity of collaborations. This paper focuses on developing an adaptive distributed kernel ridge regression (AdaDKRR) by taking autonomy in parameter selection, privacy in communicating non-sensitive information, and the necessity of collaborations in performance improvement into account. We provide both solid theoretical verification and comprehensive experiments for AdaDKRR to demonstrate its feasibility and effectiveness. Theoretically, we prove that under some mild conditions, AdaDKRR performs similarly to running the optimal learning algorithms on the whole data, verifying the necessity of collaborations and showing that no other distributed learning scheme can essentially beat AdaDKRR under the same conditions. Numerically, we test AdaDKRR on both toy simulations and two real-world applications to show that AdaDKRR is superior to other existing distributed learning schemes. All these results show that AdaDKRR is a feasible scheme to defend against data silos, which are highly desired in numerous application regions such as intelligent decision-making, pricing forecasting, and performance prediction for products.

摘要
“数据堡垒，主要由隐私和兼容性所引起，对不同组织之间的合作进行重大限制。分布式学习基于分割-并行提供了一种解决数据堡垒的可能性，但它面临着多个挑战，包括自主性、隐私保障和合作必要性。这篇论文关注于开发一种适应分布式内核ridge regression（AdaDKRR），该方法考虑到自主性在参数选择、隐私在交换非敏感信息以及合作的必要性。我们提供了坚实的理论验证和实验来证明AdaDKRR的可行性和有效性。理论上，我们证明在某些轻微条件下，AdaDKRR与运行最优学习算法在整个数据上相当，证明了合作的必要性，并证明没有其他分布式学习方案可以超越AdaDKRR在同样的条件下。数字上，我们测试了AdaDKRR在几个实际应用中，并证明它比其他现有的分布式学习方案更高效。所有这些结果表明AdaDKRR是一种可行的分布式学习方案，可以防止数据堡垒，这些受到许多应用领域的需求，如智能决策、价格预测和产品性能预测。”

Offline Recommender System Evaluation under Unobserved Confounding

paper_url: http://arxiv.org/abs/2309.04222
repo_url: https://github.com/olivierjeunen/confounding-consequences-2023
paper_authors: Olivier Jeunen, Ben London
for: This paper highlights the problem of unobserved confounders in off-policy estimation (OPE) methods for recommender systems.
methods: The paper focuses on policy-based estimators, where the logging propensities are learned from logged data, and demonstrates the statistical bias that arises due to confounding.
results: The paper shows that existing diagnostics are unable to uncover such cases, and that na"ive propensity estimation under confounding can lead to severely biased metric estimates.

Abstract
Off-Policy Estimation (OPE) methods allow us to learn and evaluate decision-making policies from logged data. This makes them an attractive choice for the offline evaluation of recommender systems, and several recent works have reported successful adoption of OPE methods to this end. An important assumption that makes this work is the absence of unobserved confounders: random variables that influence both actions and rewards at data collection time. Because the data collection policy is typically under the practitioner's control, the unconfoundedness assumption is often left implicit, and its violations are rarely dealt with in the existing literature. This work aims to highlight the problems that arise when performing off-policy estimation in the presence of unobserved confounders, specifically focusing on a recommendation use-case. We focus on policy-based estimators, where the logging propensities are learned from logged data. We characterise the statistical bias that arises due to confounding, and show how existing diagnostics are unable to uncover such cases. Because the bias depends directly on the true and unobserved logging propensities, it is non-identifiable. As the unconfoundedness assumption is famously untestable, this becomes especially problematic. This paper emphasises this common, yet often overlooked issue. Through synthetic data, we empirically show how na\"ive propensity estimation under confounding can lead to severely biased metric estimates that are allowed to fly under the radar. We aim to cultivate an awareness among researchers and practitioners of this important problem, and touch upon potential research directions towards mitigating its effects.

摘要
偏离策略估计（OPE）方法可以从日志数据中学习和评估决策策略。这使得它们成为了无法评估推荐系统的线上评估的有力选择，多个最近的工作都报道了使用OPE方法进行这种评估的成功。一个重要的假设是没有隐藏的干扰因素：随机变量会影响行动和奖励在数据采集时。由于数据采集策略通常是实际控制在 praktitioner 手中，因此这个假设通常被遗弃，其违反也rarely被文献中处理。这个工作的目的是强调在隐藏干扰因素存在的情况下进行偏离策略估计的问题。我们专注于基于策略的估计器，其中 logging 的可能性是通过日志数据学习出来的。我们描述了由干扰引起的统计偏误，并证明现有的诊断不能揭示这种情况。由于偏误直接取决于真实的和隐藏的 logging 可能性，它是不可识别的。这种假设是著名的不可测试的，这成为了特别问题。这篇文章强调了这种常见 yet oft overlooked 的问题。通过 синтетиче数据，我们employs demonstrate 如何 naive 的可能性估计在潜在干扰情况下可能产生严重偏差的 metric 估计。我们 hope 通过鼓励研究人员和实践者对这种重要问题产生意识，并谈谈可能的研究方向来 Mitigate 其影响。

Concomitant Group Testing

paper_url: http://arxiv.org/abs/2309.04221
repo_url: None
paper_authors: Thach V. Bui, Jonathan Scarlett
for: 本文 introduce了一种变体的群测试问题，具体来说是每个测试 Item 可以是多个类型的杂合。文中假设存在多个不同的半defective set，测试为阳性只要包含这些set中的至少一个项目。目标是通过最少测试来可靠地确定这些半defective set。
methods: 本文提出了多种算法来解决这个问题，主要是两个半defective set的情况。这些算法包括 deterministic 算法（zero-error）和 randomized 算法（small-error），以及非适应、全适应和有限适应（例如 2 或 3 个阶段）。
results: 作者的 deterministic adaptive algorithm 和 randomized algorithms（非适应或有限适应）在广泛的扩展 Régime 中具有最佳性，并在基准结果（例如 hypergraph learning）上进行了显著提高。

Abstract
In this paper, we introduce a variation of the group testing problem capturing the idea that a positive test requires a combination of multiple ``types'' of item. Specifically, we assume that there are multiple disjoint \emph{semi-defective sets}, and a test is positive if and only if it contains at least one item from each of these sets. The goal is to reliably identify all of the semi-defective sets using as few tests as possible, and we refer to this problem as \textit{Concomitant Group Testing} (ConcGT). We derive a variety of algorithms for this task, focusing primarily on the case that there are two semi-defective sets. Our algorithms are distinguished by (i) whether they are deterministic (zero-error) or randomized (small-error), and (ii) whether they are non-adaptive, fully adaptive, or have limited adaptivity (e.g., 2 or 3 stages). Both our deterministic adaptive algorithm and our randomized algorithms (non-adaptive or limited adaptivity) are order-optimal in broad scaling regimes of interest, and improve significantly over baseline results that are based on solving a more general problem as an intermediate step (e.g., hypergraph learning).

摘要
在这篇论文中，我们介绍了一种group testing问题的变种，其中每个测试需要包含多个“类型”的 Item。 Specifically, we assume that there are multiple disjoint semi-defective sets, and a test is positive if and only if it contains at least one item from each of these sets. Our goal is to reliably identify all of the semi-defective sets using as few tests as possible, and we refer to this problem as 共同组测试 (ConcGT). We derive a variety of algorithms for this task, focusing primarily on the case that there are two semi-defective sets. Our algorithms are distinguished by (i) whether they are deterministic (zero-error) or randomized (small-error), and (ii) whether they are non-adaptive, fully adaptive, or have limited adaptivity (e.g., 2 or 3 stages). Both our deterministic adaptive algorithm and our randomized algorithms (non-adaptive or limited adaptivity) are order-optimal in broad scaling regimes of interest, and improve significantly over baseline results that are based on solving a more general problem as an intermediate step (e.g., hypergraph learning).

Counterfactual Explanations via Locally-guided Sequential Algorithmic Recourse

paper_url: http://arxiv.org/abs/2309.04211
repo_url: None
paper_authors: Edward A. Small, Jeffrey N. Clark, Christopher J. McWilliams, Kacper Sokol, Jeffrey Chan, Flora D. Salim, Raul Santos-Rodriguez
for: 这个论文的目的是提供一种可操作的对算法质量的批判，以使人工智能系统更加可解释。
methods: 这个论文使用了算法权利，通过查找一个对应的对话来帮助用户更好地理解算法的决策过程。
results: 本论文引入了一种名为LocalFACE的模型无关技术，可以在algorithmic recourse中构建可行的和有效的对话，并保护用户隐私和模型安全。

Abstract
Counterfactuals operationalised through algorithmic recourse have become a powerful tool to make artificial intelligence systems explainable. Conceptually, given an individual classified as y -- the factual -- we seek actions such that their prediction becomes the desired class y' -- the counterfactual. This process offers algorithmic recourse that is (1) easy to customise and interpret, and (2) directly aligned with the goals of each individual. However, the properties of a "good" counterfactual are still largely debated; it remains an open challenge to effectively locate a counterfactual along with its corresponding recourse. Some strategies use gradient-driven methods, but these offer no guarantees on the feasibility of the recourse and are open to adversarial attacks on carefully created manifolds. This can lead to unfairness and lack of robustness. Other methods are data-driven, which mostly addresses the feasibility problem at the expense of privacy, security and secrecy as they require access to the entire training data set. Here, we introduce LocalFACE, a model-agnostic technique that composes feasible and actionable counterfactual explanations using locally-acquired information at each step of the algorithmic recourse. Our explainer preserves the privacy of users by only leveraging data that it specifically requires to construct actionable algorithmic recourse, and protects the model by offering transparency solely in the regions deemed necessary for the intervention.

摘要
algorithmic recourse through counterfactuals has become a powerful tool to make artificial intelligence systems explainable. conceptually, given an individual classified as y -- the factual -- we seek actions such that their prediction becomes the desired class y' -- the counterfactual. this process offers algorithmic recourse that is (1) easy to customize and interpret, and (2) directly aligned with the goals of each individual. however, the properties of a "good" counterfactual are still largely debated; it remains an open challenge to effectively locate a counterfactual along with its corresponding recourse. some strategies use gradient-driven methods, but these offer no guarantees on the feasibility of the recourse and are open to adversarial attacks on carefully created manifolds. this can lead to unfairness and lack of robustness. other methods are data-driven, which mostly addresses the feasibility problem at the expense of privacy, security, and secrecy as they require access to the entire training data set. here, we introduce LocalFACE, a model-agnostic technique that composes feasible and actionable counterfactual explanations using locally-acquired information at each step of the algorithmic recourse. our explainer preserves the privacy of users by only leveraging data that it specifically requires to construct actionable algorithmic recourse, and protects the model by offering transparency solely in the regions deemed necessary for the intervention.

COVID-19 Detection System: A Comparative Analysis of System Performance Based on Acoustic Features of Cough Audio Signals

paper_url: http://arxiv.org/abs/2309.04505
repo_url: None
paper_authors: Asmaa Shati, Ghulam Mubashar Hassan, Amitava Datta
for: automatize the process of detecting COVID-19 from cough signals
methods: 使用 Mel Frequency Cepstral Coefficients (MFCC), Chroma, and Spectral Contrast features 提高机器学习模型的表现
results: 提出一种高效的 COVID-19检测系统，并在 COUGHVID 和 Virufy 数据集上达到了更高的状态艺术分类性能

Abstract
A wide range of respiratory diseases, such as cold and flu, asthma, and COVID-19, affect people's daily lives worldwide. In medical practice, respiratory sounds are widely used in medical services to diagnose various respiratory illnesses and lung disorders. The traditional diagnosis of such sounds requires specialized knowledge, which can be costly and reliant on human expertise. Recently, cough audio recordings have been used to automate the process of detecting respiratory conditions. This research aims to examine various acoustic features that enhance the performance of machine learning (ML) models in detecting COVID-19 from cough signals. This study investigates the efficacy of three feature extraction techniques, including Mel Frequency Cepstral Coefficients (MFCC), Chroma, and Spectral Contrast features, on two ML algorithms, Support Vector Machine (SVM) and Multilayer Perceptron (MLP), and thus proposes an efficient COVID-19 detection system. The proposed system produces a practical solution and demonstrates higher state-of-the-art classification performance on COUGHVID and Virufy datasets for COVID-19 detection.

摘要
世界各地的人们日常生活受到各种呼吸疾病的影响，如感冒和流感、asma和COVID-19。在医疗医学中，呼吸音被广泛使用以诊断各种呼吸疾病和肺脏疾患。传统诊断方法需要专业知识，可能成本高昂并且виси于人工智能。现在，喷嚏音频记录被用来自动诊断呼吸疾病。本研究旨在检查不同的音频特征，以提高机器学习（ML）模型在检测COVID-19的能力。本研究investigates三种特征提取技术，包括Mel Frequency Cepstral Coefficients（MFCC）、Chroma和Spectral Contrast特征，并将其应用于两种ML算法，支持向量机（SVM）和多层感知机（MLP）。因此，本研究提出了一个高效的COVID-19检测系统。系统实现了现实的解决方案，并在COUGHVID和Virufy数据集上达到了最高级别的分类性能。

Adversarial attacks on hybrid classical-quantum Deep Learning models for Histopathological Cancer Detection

paper_url: http://arxiv.org/abs/2309.06377
repo_url: None
paper_authors: Biswaraj Baral, Reek Majumdar, Bhavika Bhalgamiya, Taposh Dutta Roy
for: histopathological cancer detection
methods: hybrid classical-quantum Deep Learning models, quantum transfer learning strategy, multiple transfer learning models (ResNet18, VGG-16, Inception-v3, AlexNet) and quantum circuit-based variational quantum circuits (VQC)
results: better accuracy than classical image classification models under adversarial attacks

Abstract
We present an effective application of quantum machine learning in histopathological cancer detection. The study here emphasizes two primary applications of hybrid classical-quantum Deep Learning models. The first application is to build a classification model for histopathological cancer detection using the quantum transfer learning strategy. The second application is to test the performance of this model for various adversarial attacks. Rather than using a single transfer learning model, the hybrid classical-quantum models are tested using multiple transfer learning models, especially ResNet18, VGG-16, Inception-v3, and AlexNet as feature extractors and integrate it with several quantum circuit-based variational quantum circuits (VQC) with high expressibility. As a result, we provide a comparative analysis of classical models and hybrid classical-quantum transfer learning models for histopathological cancer detection under several adversarial attacks. We compared the performance accuracy of the classical model with the hybrid classical-quantum model using pennylane default quantum simulator. We also observed that for histopathological cancer detection under several adversarial attacks, Hybrid Classical-Quantum (HCQ) models provided better accuracy than classical image classification models.

摘要
我们提出了一种有效的量子机器学习应用在 histopathological 癌症检测中。这个研究强调了两个主要的应用：首先，使用量子传输学习策略建立一个分类模型 для histopathological 癌症检测。其次，测试这个模型对各种恶意攻击的性能。而不是使用单一的传输学习模型，我们使用多个传输学习模型，如 ResNet18、VGG-16、Inception-v3 和 AlexNet 作为特征提取器，并将其与多种基于量子电路的可变量量子电路（VQC）结合。因此，我们提供了对类型模型和混合类型-量子传输学习模型的比较分析，以及对 histopathological 癌症检测下各种恶意攻击的性能对比。我们使用 Pennylane 默认量子 simulate 器进行比较。我们发现，对 histopathological 癌症检测下各种恶意攻击，混合类型-量子（HCQ）模型提供了更高的准确率，比传统的图像分类模型更好。

Preserved Edge Convolutional Neural Network for Sensitivity Enhancement of Deuterium Metabolic Imaging (DMI)

paper_url: http://arxiv.org/abs/2309.04100
repo_url: None
paper_authors: Siyuan Dong, Henk M. De Feyter, Monique A. Thomas, Robin A. de Graaf, James S. Duncan
for: 这篇论文主要目的是提高Deuterium Metabolic Imaging（DMI）的感度。
methods: 这篇论文使用了一种深度学习方法，即Convolutional Neural Network（CNN），来估算低SNR和扭曲的DMI FIDs中的2H-标记物质浓度。并通过MRI-based edge-preserving regularization进一步改进估算精度。
results: PRECISE-DMI可以视觉提高低SNR数据中的代谢图像，并量化提供高精度 than标准 Fourier重建。在骨髓肿瘤模型中的实验中，PRECISE-DMI可以提供更高的分辨率（从>8到2 $\mu$L）或更短的扫描时间（从32到4分），并提供更准确的2H-标记物质浓度测量结果。但是，严格的SD-偏差分析表明，过度使用边缘保持正则化可能会伤害结果的准确性。

Abstract
Purpose: Common to most MRSI techniques, the spatial resolution and the minimal scan duration of Deuterium Metabolic Imaging (DMI) are limited by the achievable SNR. This work presents a deep learning method for sensitivity enhancement of DMI. Methods: A convolutional neural network (CNN) was designed to estimate the 2H-labeled metabolite concentrations from low SNR and distorted DMI FIDs. The CNN was trained with synthetic data that represent a range of SNR levels typically encountered in vivo. The estimation precision was further improved by fine-tuning the CNN with MRI-based edge-preserving regularization for each DMI dataset. The proposed processing method, PReserved Edge ConvolutIonal neural network for Sensitivity Enhanced DMI (PRECISE-DMI), was applied to simulation studies and in vivo experiments to evaluate the anticipated improvements in SNR and investigate the potential for inaccuracies. Results: PRECISE-DMI visually improved the metabolic maps of low SNR datasets, and quantitatively provided higher precision than the standard Fourier reconstruction. Processing of DMI data acquired in rat brain tumor models resulted in more precise determination of 2H-labeled lactate and glutamate + glutamine levels, at increased spatial resolution (from >8 to 2 $\mu$L) or shortened scan time (from 32 to 4 min) compared to standard acquisitions. However, rigorous SD-bias analyses showed that overuse of the edge-preserving regularization can compromise the accuracy of the results. Conclusion: PRECISE-DMI allows a flexible trade-off between enhancing the sensitivity of DMI and minimizing the inaccuracies. With typical settings, the DMI sensitivity can be improved by 3-fold while retaining the capability to detect local signal variations.

摘要
目的：大多数MRSI技术的空间分解能力和最小扫描时间受到 achievable SNR 的限制。这项工作提出了基于深度学习的MRSI敏感度提高方法。方法：设计了一个卷积神经网络（CNN）来估计低SNR和扭曲的DMI FID中的2H-标记物质浓度。CNN被训练使用表征了在生物体内 typical SNR 水平的synthetic数据。为了进一步提高估计精度，我们采用了基于MRI的edge-preserving regularization（ER）来为每个DMI数据集进行细化调整。我们称之为PRECISE-DMI。这种处理方法在模拟研究和生物体内实验中应用以评估预期的改善和探讨可能的不准确。结果：PRECISE-DMI可视化改进了低SNR数据中的代谢图，并量化提供了高精度than标准 fourier重建。对于 rat brain tumor模型中的DMI数据处理，可以获得更高的空间分解能力（从>8到2 $\mu$L）或更短的扫描时间（从32到4分），相比标准捕捉。然而，严格的SD-bias分析表明，过度使用edge-preserving regularization可能会伪造结果的准确性。结论：PRECISE-DMI允许适量的 Edge-preserving regularization来平衡提高DMI敏感度和避免不准确。通常的设置下，DMI敏感度可以提高3倍，同时保持检测地方信号变化的能力。

Sample-Efficient Co-Design of Robotic Agents Using Multi-fidelity Training on Universal Policy Network

paper_url: http://arxiv.org/abs/2309.04085
repo_url: None
paper_authors: Kishan R. Nagiredla, Buddhika L. Semage, Thommen G. Karimpanal, Arun Kumar A. V, Santu Rana
for: 本研究旨在提高协同设计中控制优化和物理设计之间的同步优化效率，并通过绑定控制优化和物理设计的学习过程来提高样本效率。
methods: 本研究提出了一种基于Hyperband的多级准确性探索策略，通过将控制优化和物理设计之间的学习过程绑定在一起，实现了逐渐增强的温始效应，从而提高样本效率。
results: 实验表明， compared to基eline方法，本研究的方法在各种agent设计问题上显示出明显的优势，并且分析了优化的设计变化，发现了一些非INTUITIVE的设计变化，这些变化在生物世界中也有出现。

Abstract
Co-design involves simultaneously optimizing the controller and agents physical design. Its inherent bi-level optimization formulation necessitates an outer loop design optimization driven by an inner loop control optimization. This can be challenging when the design space is large and each design evaluation involves data-intensive reinforcement learning process for control optimization. To improve the sample-efficiency we propose a multi-fidelity-based design exploration strategy based on Hyperband where we tie the controllers learnt across the design spaces through a universal policy learner for warm-starting the subsequent controller learning problems. Further, we recommend a particular way of traversing the Hyperband generated design matrix that ensures that the stochasticity of the Hyperband is reduced the most with the increasing warm starting effect of the universal policy learner as it is strengthened with each new design evaluation. Experiments performed on a wide range of agent design problems demonstrate the superiority of our method compared to the baselines. Additionally, analysis of the optimized designs shows interesting design alterations including design simplifications and non-intuitive alterations that have emerged in the biological world.

摘要

Enabling the Evaluation of Driver Physiology Via Vehicle Dynamics

paper_url: http://arxiv.org/abs/2309.04078
repo_url: None
paper_authors: Rodrigo Ordonez-Hurtado, Bo Wen, Nicholas Barra, Ryan Vimba, Sergio Cabrero-Barros, Sergiy Zhuk, Jeffrey L. Rogers
for: 这个论文旨在提供一种基于汽车和数字医疗领域的敏捷汽车系统，以评估司机生理状况。
methods: 该论文使用了一系列商业感知器，包括汽车和数字医疗领域的感知器，以记录司机的驾驶行为和外部环境。这些数据流被处理，以提取关键参数，以了解司机在驾驶过程中的生理响应。
results: 该研究发现，该敏捷汽车系统可以帮助提高道路安全，并且可以早期发现健康问题。

Abstract
Driving is a daily routine for many individuals across the globe. This paper presents the configuration and methodologies used to transform a vehicle into a connected ecosystem capable of assessing driver physiology. We integrated an array of commercial sensors from the automotive and digital health sectors along with driver inputs from the vehicle itself. This amalgamation of sensors allows for meticulous recording of the external conditions and driving maneuvers. These data streams are processed to extract key parameters, providing insights into driver behavior in relation to their external environment and illuminating vital physiological responses. This innovative driver evaluation system holds the potential to amplify road safety. Moreover, when paired with data from conventional health settings, it may enhance early detection of health-related complications.

摘要
每天，许多人都在全球各地开车。这篇论文介绍了将车辆转化成连接到生物征识系统的方法和配置。我们将汽车业和数字健康领域的商业传感器组合在一起，并从车辆中获取driver的输入。这些敏感器数据流被处理，以提取关键参数，了解 Driver 的行为与外部环境之间的关系，并且揭示了 Driver 的生理反应。这种革新的 Driver 评估系统可能会提高道路安全性。此外，当与传统医疗设施数据相结合时，可能会提高早期发现健康问题的能力。

Riemannian Langevin Monte Carlo schemes for sampling PSD matrices with fixed rank

paper_url: http://arxiv.org/abs/2309.04072
repo_url: None
paper_authors: Tianmin Yu, Shixin Zheng, Jianfeng Lu, Govind Menon, Xiangxiong Zhang
for: 本 paper introduce two explicit schemes to sample matrices from Gibbs distributions on $\mathcal S^{n,p}_+$, the manifold of real positive semi-definite (PSD) matrices of size $n\times n$ and rank $p$.
methods: 这两种方案基于 Euler-Maruyama 离散化的里曼尼安 Léon 方程（RLE），使用柯西-瓦塞瑞恩（Bures-Wasserstein） metric 和嵌入 metric 来定义 Gibbs distribution。
results: 作者们提供了一些实际的能量函数，使得这些方案可以进行数值验证。

Abstract
This paper introduces two explicit schemes to sample matrices from Gibbs distributions on $\mathcal S^{n,p}_+$, the manifold of real positive semi-definite (PSD) matrices of size $n\times n$ and rank $p$. Given an energy function $\mathcal E:\mathcal S^{n,p}_+\to \mathbb{R}$ and certain Riemannian metrics $g$ on $\mathcal S^{n,p}_+$, these schemes rely on an Euler-Maruyama discretization of the Riemannian Langevin equation (RLE) with Brownian motion on the manifold. We present numerical schemes for RLE under two fundamental metrics on $\mathcal S^{n,p}_+$: (a) the metric obtained from the embedding of $\mathcal S^{n,p}_+ \subset \mathbb{R}^{n\times n} $; and (b) the Bures-Wasserstein metric corresponding to quotient geometry. We also provide examples of energy functions with explicit Gibbs distributions that allow numerical validation of these schemes.

摘要
Translated into Simplified Chinese:这篇论文介绍了两种直接的方法来从 Gibbs 分布中采样矩阵，这些方法基于在 $\mathcal{S}^{n,p}_+ $ 拓扑上的 Riemannian Langevin 方程（RLE）的 Эйле尔-马钦纳练习，并且使用了两种基本的 метрики： (a) 从 $\mathcal{S}^{n,p}_+ $ 的嵌入 embedding 中得到的 metric; 和 (b) Bures-Wasserstein метриoki，对应的是 quotient 拓扑。我们还提供了一些具有显式 Gibbs 分布的能量函数，以便 numerically 验证这些方法。

Weighted Unsupervised Domain Adaptation Considering Geometry Features and Engineering Performance of 3D Design Data

paper_url: http://arxiv.org/abs/2309.04499
repo_url: None
paper_authors: Seungyeon Shin, Namwoo Kang
for: 这个研究旨在提高设计过程中的设计优化效率，使用深度学习模型来预测工程性能。
methods: 本研究提出了一种双重权重领域适应方法，考虑了3D设计数据的几何特征和工程性能。这方法包括对伪设定进行反对抗训练，以提取不受领域影响的特征，并使用这些特征进行多出力回归 зада项来预测工程性能。
results: 研究发现，这种双重权重领域适应方法可以有效地预测3D设计中的最大 von Mises 压力和相应的位置。此外，这种方法可以对于新领域资料进行预测，而不需要大量的训练数据和 Computational expensive。

Abstract
The product design process in manufacturing involves iterative design modeling and analysis to achieve the target engineering performance, but such an iterative process is time consuming and computationally expensive. Recently, deep learning-based engineering performance prediction models have been proposed to accelerate design optimization. However, they only guarantee predictions on training data and may be inaccurate when applied to new domain data. In particular, 3D design data have complex features, which means domains with various distributions exist. Thus, the utilization of deep learning has limitations due to the heavy data collection and training burdens. We propose a bi-weighted unsupervised domain adaptation approach that considers the geometry features and engineering performance of 3D design data. It is specialized for deep learning-based engineering performance predictions. Domain-invariant features can be extracted through an adversarial training strategy by using hypothesis discrepancy, and a multi-output regression task can be performed with the extracted features to predict the engineering performance. In particular, we present a source instance weighting method suitable for 3D design data to avoid negative transfers. The developed bi-weighting strategy based on the geometry features and engineering performance of engineering structures is incorporated into the training process. The proposed model is tested on a wheel impact analysis problem to predict the magnitude of the maximum von Mises stress and the corresponding location of 3D road wheels. This mechanism can reduce the target risk for unlabeled target domains on the basis of weighted multi-source domain knowledge and can efficiently replace conventional finite element analysis.

摘要
制品设计过程中的iterative设计模型和分析可以实现目标工程性能，但这个iterative过程占用时间和计算成本很高。现在，基于深度学习的工程性能预测模型已经被提出，以加速设计优化。然而，这些模型只能在训练数据上作预测，并且在新领域数据上可能不准确。特别是3D设计数据具有复杂的特征，这意味着存在多种分布领域。因此，使用深度学习有限制，因为需要大量的数据收集和训练卫星。我们提议一种bi-weighted无监督领域适应方法，该方法考虑了3D设计数据的几何特征和工程性能。我们使用对假设的差异进行对抗训练策略，从而提取域无关特征。然后，我们使用提取的特征进行多输出回归任务，以预测工程性能。特别是，我们提出了适用于3D设计数据的来源实例权重策略，以避免负面传递。我们在训练过程中 integrate了这种bi-weighting策略。我们的模型在测试中用于预测3D路轮受力的最大 von Mises 压力和相应的位置。这种机制可以减少目标风险，并fficiently替换传统的Finite Element分析。

2023-09-08

eess.IV

eess.IV - 2023-09-08

Non-convex regularization based on shrinkage penalty function

paper_url: http://arxiv.org/abs/2309.04593
repo_url: None
paper_authors: Manu Ghulyani, Muthuvel Arigovindan
for: 这种论文主要研究了一种基于第二个Derivative的图像恢复方法，以提高图像的结构保持性。
methods: 这种方法使用了希尔бер施泰因约数（HSN）来 regularize 图像，HSN 使用了图像的第二 Derivative，而不是图像的 Gradient，从而减少了“阶梯效应”。
results: 该方法可以提供更加细节和结构保持的图像恢复结果，并且比 convex 方法更加稳定。

Abstract
Total Variation regularization (TV) is a seminal approach for image recovery. TV involves the norm of the image's gradient, aggregated over all pixel locations. Therefore, TV leads to piece-wise constant solutions, resulting in what is known as the "staircase effect." To mitigate this effect, the Hessian Schatten norm regularization (HSN) employs second-order derivatives, represented by the pth norm of eigenvalues in the image hessian, summed across all pixels. HSN demonstrates superior structure-preserving properties compared to TV. However, HSN solutions tend to be overly smoothed. To address this, we introduce a non-convex shrinkage penalty applied to the Hessian's eigenvalues, deviating from the convex lp norm. It is important to note that the shrinkage penalty is not defined directly in closed form, but specified indirectly through its proximal operation. This makes constructing a provably convergent algorithm difficult as the singular values are also defined through a non-linear operation. However, we were able to derive a provably convergent algorithm using proximal operations. We prove the convergence by establishing that the proposed regularization adheres to restricted proximal regularity. The images recovered by this regularization were sharper than the convex counterparts.

摘要
全Variation正规化（TV）是一种杰出的方法 для图像恢复。TV通过图像梯度的norm，在所有像素位置上进行积分，因此TV会导致piece-wise常数解，这被称为“阶梯效应”。为了 mitigate这些效应，使用第二 derivatives，表示图像Hessian的pth norm的 eigenvalues，在所有像素位置上进行积分。HSN表现出比TV更好的结构保持性。然而，HSN解决方案通常是过度熔化。为了解决这个问题，我们引入了非CONvex shrinkage penalty，应用于图像Hessian的 eigenvalues。这个 penalty不是直接定义的closed form，而是通过其proximal操作定义。这使得构建可提供 garantía de convergencia的算法变得困难，因为singular values 也是通过非线性操作定义。然而，我们成功地 derivated a provably convergent algorithm using proximal operations。我们证明了这种正则化的 convergencia by establishing that the proposed regularization adheres to restricted proximal regularity。图像recovered by this regularization were sharper than the convex counterparts。

Motion Compensated Unsupervised Deep Learning for 5D MRI

paper_url: http://arxiv.org/abs/2309.04552
repo_url: None
paper_authors: Joseph Kettelkamp, Ludovica Romanin, Davide Piccini, Sarv Priya, Mathews Jacob
for: 提高5D cardiac MRI数据重建速度和质量，并且使得数据重建不再依赖于数据分割的均匀性。
methods: 使用无监督深度学习算法，模拟数据在每个生物频率/呼吸频率分割中的变形。使用卷积神经网络驱动数据变形映射，并与模板共优估计。
results: 在5D bSSFP数据上进行了验证，并实现了更高的数据效率和质量。

Abstract
We propose an unsupervised deep learning algorithm for the motion-compensated reconstruction of 5D cardiac MRI data from 3D radial acquisitions. Ungated free-breathing 5D MRI simplifies the scan planning, improves patient comfort, and offers several clinical benefits over breath-held 2D exams, including isotropic spatial resolution and the ability to reslice the data to arbitrary views. However, the current reconstruction algorithms for 5D MRI take very long computational time, and their outcome is greatly dependent on the uniformity of the binning of the acquired data into different physiological phases. The proposed algorithm is a more data-efficient alternative to current motion-resolved reconstructions. This motion-compensated approach models the data in each cardiac/respiratory bin as Fourier samples of the deformed version of a 3D image template. The deformation maps are modeled by a convolutional neural network driven by the physiological phase information. The deformation maps and the template are then jointly estimated from the measured data. The cardiac and respiratory phases are estimated from 1D navigators using an auto-encoder. The proposed algorithm is validated on 5D bSSFP datasets acquired from two subjects.

摘要
我们提议一种无监督深度学习算法，用于从3D辐射式获取的5D心脏MRI数据进行运动补做重建。无门限自由呼吸5D MRI简化扫描计划，提高了患者的 COMFORT，并提供了许多临床优势，包括均匀的空间分辨率和可以在任意视图下重新构成数据。然而，目前的5D MRI重建算法需要很长的计算时间，其结果受到数据的划分方式的均匀性影响很大。我们的算法是一种更高效的替代方案。这种运动补做方法模型了数据在每个心脏/呼吸期中的变换为Fourier样本的扭曲版本的3D图像模板。变换地图是由基于生物频率信息的卷积神经网络驱动。模板和变换地图然后从测量数据中共同估计。心脏和呼吸频率是通过1D导航器使用自动Encoder来估计。我们的算法在5D bSSFP数据集上进行验证，从两个试验者中获得了 validate。

2023-09-08

eess.SP

eess.SP - 2023-09-08

Enhancing Missing Data Imputation of Non-stationary Signals with Harmonic Decomposition

paper_url: http://arxiv.org/abs/2309.04630
repo_url: None
paper_authors: Joaquin Ruiz, Hau-tieng Wu, Marcelo A. Colominas
for: 填充时间序列中缺失值，包括受低质量或过滤影响的时间序列，是一个重要的信号处理挑战。
methods: 本文提出了一种新的算法，即幂等级 interpolating（HaLI），用于提高现有的填充算法的性能 для振荡时间序列。 HaLI 利用基于自适应非幂模型的幂分解来提高填充精度。
results: 实验结果表明，HaLI 可以有效地提高现有填充算法的性能，并且可以在实验室和实际数据上进行应用。 Matlab 代码已经公开发布，供其他研究人员使用。

Abstract
Dealing with time series with missing values, including those afflicted by low quality or over-saturation, presents a significant signal processing challenge. The task of recovering these missing values, known as imputation, has led to the development of several algorithms. However, we have observed that the efficacy of these algorithms tends to diminish when the time series exhibit non-stationary oscillatory behavior. In this paper, we introduce a novel algorithm, coined Harmonic Level Interpolation (HaLI), which enhances the performance of existing imputation algorithms for oscillatory time series. After running any chosen imputation algorithm, HaLI leverages the harmonic decomposition based on the adaptive nonharmonic model of the initial imputation to improve the imputation accuracy for oscillatory time series. Experimental assessments conducted on synthetic and real signals consistently highlight that HaLI enhances the performance of existing imputation algorithms. The algorithm is made publicly available as a readily employable Matlab code for other researchers to use.

摘要
处理含有欠拟合值的时间序列是一项重要的信号处理挑战。恢复这些欠拟合值，称为填充，已经导致了许多算法的发展。然而，我们观察到了这些算法在时间序列表现非站ARY抖动行为时的效果减退。在这篇论文中，我们介绍了一种新的算法，名为响应级 interpolate（HaLI），可以提高现有填充算法对抖动时间序列的准确性。在任何选择的填充算法后，HaLI利用基于适应非幂模型的响应级分解来提高填充精度。实验评估在synthetic和实际信号上 consistently表明，HaLI可以提高现有填充算法的性能。该算法已经公开提供了可靠地使用MATLAB代码，以便其他研究人员可以使用。

Wi-BFI: Extracting the IEEE 802.11 Beamforming Feedback Information from Commercial Wi-Fi Devices

paper_url: http://arxiv.org/abs/2309.04408
repo_url: None
paper_authors: Khandaker Foysal Haque, Francesca Meneghello, Francesco Restuccia
for: 这篇论文是为了提供一个开源的工具来提取和解码 Wi-Fi 多输入多Output（MIMO）操作中的扫描反馈角（BFAs），以便用于不同目的，如人活动识别和设备指纹。
methods: 这篇论文使用了 Wi-BFI 工具，该工具可以从 BFAs 框架中提取和重建扫描反馈信息（BFI），这是压缩表示器通道频率响应（CFR）的压缩表示。工具支持在 IEEE 802.11ac 和 IEEE 802.11ax 网络中提取 BFAs，并且可以处理多用户和单用户 MIMO 反馈。
results: 这篇论文通过开发 Wi-BFI 工具，实现了在空中捕获 BFAs 框架后，提取和重建扫描反馈信息的功能。工具支持实时和离线提取和存储 BFAs 和 BFI，并且在实时模式下还包括一个可视化的渠道状态显示器，可以实时地更新基于收集的数据。

Abstract
Recently, researchers have shown that the beamforming feedback angles (BFAs) used for Wi-Fi multiple-input multiple-output (MIMO) operations can be effectively leveraged as a proxy of the channel frequency response (CFR) for different purposes. Examples are passive human activity recognition and device fingerprinting. However, even though the BFAs report frames are sent in clear text, there is not yet a unified open-source tool to extract and decode the BFAs from the frames. To fill this gap, we developed Wi-BFI, the first tool that allows retrieving Wi-Fi BFAs and reconstructing the beamforming feedback information (BFI) - a compressed representation of the CFR - from the BFAs frames captured over the air. The tool supports BFAs extraction within both IEEE 802.11ac and 802.11ax networks operating on radio channels with 160/80/40/20 MHz bandwidth. Both multi-user and single-user MIMO feedback can be decoded through Wi-BFI. The tool supports real-time and offline extraction and storage of BFAs and BFI. The real-time mode also includes a visual representation of the channel state that continuously updates based on the collected data. Wi-BFI code is open source and the tool is also available as a pip package.

摘要
近期，研究人员发现，Wi-Fi多输入多输出（MIMO）操作中的扫描反馈角（BFAs）可以作为通道频率响应（CFR）的代理。例如，悬浮人活动识别和设备打印。although BFAs report frames are sent in clear text, there is no unified open-source tool to extract and decode the BFAs from the frames. To fill this gap, we developed Wi-BFI, the first tool that allows retrieving Wi-Fi BFAs and reconstructing the beamforming feedback information (BFI) - a compressed representation of the CFR - from the BFAs frames captured over the air. The tool supports BFAs extraction within both IEEE 802.11ac and 802.11ax networks operating on radio channels with 160/80/40/20 MHz bandwidth. Both multi-user and single-user MIMO feedback can be decoded through Wi-BFI. The tool supports real-time and offline extraction and storage of BFAs and BFI. The real-time mode also includes a visual representation of the channel state that continuously updates based on the collected data. Wi-BFI code is open source and the tool is also available as a pip package.

Sparse Codesigned Communication and Radar Systems

paper_url: http://arxiv.org/abs/2309.04362
repo_url: None
paper_authors: Hyeon Seok Rou, Giuseppe Thadeu Freitas de Abreu, Saravanan Nagesh, Andreas Bathelt, David González G., Osvaldo Gonsa, Hans-Ludwig Bloecher
for: 本文旨在提出一种新的ISAC frameworks，即“简单编码通信和雷达（SCCR）”系统，该系统通过简化资源域和波形 спектrum域来编码通信和雷达信号。
methods: 本文使用了多种简度 robust signal processing技术，如简度 reconstruction和指标模式（IM）来应对 sparse codesign 中的挑战。
results: 该文提出了一种新的SCCR frameworks，并采用了简度 robust signal processing技术来应对 sparse codesign 中的挑战。

Abstract
In the envisioned beyond-fifth-generation (B5G) and sixth-generation (6G) scenarios which expect massive multiple-input multiple-output (mMIMO) and high frequency communications in the millimeter-wave (mmWave) and Terahertz (THz) bands, efficiency in both energy and spectrum is of increasing significance. To that extent, a novel ISAC framework called "sparse codesigned communication and radar (SCCR)" systems is described, which codesigns both communication and radar signals by a sparsification of the resource domain and the waveform spectrum domain. This improves the spectral and energy efficiency, but at the inherent cost of missing radar spectrum and irregular beampattern, and decreased throughput and diversity. Such challenges can however be corroborated, by leveraging various sparsity-robust signal processing techniques such as sparse radar reconstruction and index modulation (IM). In light of the above, the white paper aims to outlined the proposed article which provide an overview and a novel classification of the relevant state-of-the-art (SotA) methods and the implications of the challenges in the sparse codesign of the system, followed by a variety of novel SCCR frameworks.

摘要
在预期的 beyond-fifth-generation (B5G) 和 sixth-generation (6G) enario中，massive multiple-input multiple-output (mMIMO) 和高频通信在毫米波 (mmWave) 和teraHz (THz) 频率带中的效率在不断增长。为此，一种新的ISAC frameworkscalled "sparse codesigned communication and radar (SCCR)"系统被描述，该系统在资源领域和波形频谱领域进行了简化，从而提高了spectral和能量效率，但是附加了缺失 radar 频谱和不规则扫描 patrern，以及 Throughput 和多样性的减少。这些挑战可以通过不同的简单性robust signal processing技术，如简单 radar 重建和索引修饰 (IM)，进行整合。以上所述，这份白皮书的目的是提供一份概述和state-of-the-art (SotA) 方法的新分类，以及相关挑战的implications，然后提出一些新的SCCR frameworks。

Design of a Single-User RIS-Aided MISO System Based on Statistical Channel Knowledge

paper_url: http://arxiv.org/abs/2309.04341
repo_url: None
paper_authors: Sadaf Syed, Dominik Semmler, Donia Ben Amor, Michael Joham, Wolfgang Utschick
for: 提高5G网络的spectral和能量效率，降低成本
methods: 利用第二 Statistics of channels，降低培训过程的复杂性
results: 不需要CSI估计和RIS重新配置，提高系统的实用性

Abstract
Reconfigurable intelligent surface (RIS) is considered a prospective technology for beyond fifth-generation (5G) networks to improve the spectral and energy efficiency at a low cost. Prior works on the RIS mainly rely on perfect channel state information (CSI), which imposes a huge computational complexity. This work considers a single-user RIS-assisted communication system, where the second-order statistical knowledge of the channels is exploited to reduce the training overhead. We present algorithms that do not require estimation of the CSI and reconfiguration of the RIS in every channel coherence interval, which constitutes one of the most critical practical issues in an RIS-aided system.

摘要
可重配置智能表面技术（RIS）被视为 fifth-generation（5G）网络以上的可能技术，以提高频率和能量效率，而且低成本。先前的RIS研究主要基于完美的通道状态信息（CSI），这会带来巨大的计算复杂度。本工作考虑了单用户RIS协助通信系统，利用通道的第二阶统计知识来减少培训负担。我们提出了不需要CSI估计和RIS重配置每个通道幂etime的算法，这是RIS协助系统中一个最重要的实践问题。

On the performance of an integrated communication and localization system: an analytical framework

paper_url: http://arxiv.org/abs/2309.04335
repo_url: None
paper_authors: Yuan Gao, Haonan Hu, Jiliang Zhang, Yanliang Jin, Shugong Xu, Xiaoli Chu
for: 这个论文是为了研究一个综合式位置和通信（ILAC）系统的性能 bound 和通信与定位性能之间的交易。
methods: 作者使用时域或频域资源分配来实现通信和定位，并提出了一个分析框架来计算时域和频域资源分配下的容量损失与定位CRB损失的关系。
results: 实验结果验证了分析模型，并显示在具有较少的antenna数和较大的UE与gNB之间距离的场景中，频域资源分配是更有优势的，而在具有较多的antenna数和较小的UE与gNB之间距离的场景中，时域资源分配是更有优势。

Abstract
Quantifying the performance bound of an integrated localization and communication (ILAC) system and the trade-off between communication and localization performance is critical. In this letter, we consider an ILAC system that can perform communication and localization via time-domain or frequency-domain resource allocation. We develop an analytical framework to derive the closed-form expression of the capacity loss versus localization Cramer-Rao lower bound (CRB) loss via time-domain and frequency-domain resource allocation. Simulation results validate the analytical model and demonstrate that frequency-domain resource allocation is preferable in scenarios with a smaller number of antennas at the next generation nodeB (gNB) and a larger distance between user equipment (UE) and gNB, while time-domain resource allocation is preferable in scenarios with a larger number of antennas and smaller distance between UE and the gNB.

摘要
要量化整合本地化和通信系统（ILAC）的性能上限和通信和定位性能之间的交易是非常重要。在本封信中，我们考虑了一个可以通过时域或频域资源分配进行通信和定位的ILAC系统。我们开发了一个分析框架，以获得通过时域和频域资源分配得到的容量损失与定位Cramer-Rao下限损失的关闭式表达。实验结果证明了分析模型，并示出在具有较少的gNB天线数和较大的UE和gNB之间的距离的场景中，频域资源分配是更佳的选择，而在具有较多的天线数和较小的UE和gNB之间的距离的场景中，时域资源分配是更佳的选择。

Trade-Offs in Decentralized Multi-Antenna Architectures: Sparse Combining Modules for WAX Decomposition

paper_url: http://arxiv.org/abs/2309.04297
repo_url: None
paper_authors: Juan Vidal Alegría, Fredrik Rusek
For: This paper focuses on finding decentralized receiver architectures for centralized multi-antenna systems, with the goal of reducing the interconnection bandwidth and processing complexity between the antennas and the central processing unit (CPU).* Methods: The paper proposes using the WAX decomposition, a newly defined matrix decomposition, to achieve information-lossless processing in decentralized architectures. The authors also present several constructions for linear combining modules that can be used in the WAX decomposition, and show how these structures can facilitate decentralized calculation of the WAX decomposition.* Results: The paper obtains an information-lossless trade-off between the level of decentralization and the decentralized processing complexity, and demonstrates the effectiveness of the proposed constructions for linear combining modules in achieving this trade-off. The results show that the proposed methods can be used to efficiently implement information-lossless processing in architectures with an arbitrary level of decentralization.Here is the same information in Simplified Chinese:* For: 这篇论文关注于中心化多天线系统中的分布式接收架构，以减少天线和中央处理器（CPU）之间的连接带宽和处理复杂度。* Methods: 论文提出使用WAX分解，一种新定义的矩阵分解，以实现分布式处理中的信息无损处理。作者们还提出了多种 linear combining module 的构造，并证明这些结构可以在WAX分解中实现分布式计算。* Results: 论文取得了信息无损处理与分布式处理复杂度之间的信息损失融合，并证明了提posed constructions 可以高效地实现分布式处理中的信息无损处理。结果显示，提posed methods 可以在任意水平的分布式化环境中实现信息无损处理。

Abstract
With the increase in the number of antennas at base stations (BSs), centralized multi-antenna architectures have encountered scalability problems from excessive interconnection bandwidth to the central processing unit (CPU), as well as increased processing complexity. Thus, research efforts have been directed towards finding decentralized receiver architectures where a part of the processing is performed at the antenna end (or close to it). A recent paper put forth an information-lossless trade-off between level of decentralization (inputs to CPU) and decentralized processing complexity (multiplications per antenna). This trade-off was obtained by studying a newly defined matrix decomposition--the WAX decomposition--which is directly related to the information-lossless processing that should to be applied in a general framework to exploit the trade-off. {The general framework consists of three stages: a set of decentralized filters, a linear combining module, and a processing matrix applied at the CPU; these three stages are linear transformations which can be identified with the three constituent matrices of the WAX decomposition. The previous work was unable to provide explicit constructions for linear combining modules which are valid for WAX decomposition, while it remarked the importance of these modules being sparse with 1s and 0s so they could be efficiently implemented using hardware accelerators.} In this work we present a number of constructions, as well as possible variations of them, for effectively defining linear combining modules which can be used in the WAX decomposition. Furthermore, we show how these structures facilitate decentralized calculation of the WAX decomposition for applying information-lossless processing in architectures with an arbitrary level of decentralization.

摘要
随着基站antenna数量的增加，中央多antenna架构遇到了增加与CPU的连接带宽和处理复杂度的挑战。因此，研究者努力找到了分散式接收架构，其中一部分处理在天线端（或近它）进行。一篇最近发表的论文提出了关于分散式处理的信息产生的负担和分散式处理复杂度之间的信息产生负担至关重要的交易。这种交易是通过研究一种新的矩阵分解——WAX分解——来获得的。这个分解直接关系到应用在总体框架中的信息产生处理。在这篇论文中，我们提出了一些可以实现的构造，以及这些构造的可能的变化。此外，我们还证明了这些结构可以帮助实现分散式计算WAX分解，从而应用信息产生处理在不同水平的分散式架构中。

Modulation and Estimation with a Helper

paper_url: http://arxiv.org/abs/2309.04277
repo_url: None
paper_authors: Anatoly Khina, Neri Merhav
for: 本研究考虑了在添加白噪声（AWGN）通道上传输参数值的问题，其中助手可以不可预测地观察噪声，并提供有限率($R_\mathrm{h}$)的描述给发送器和接收器。methods: 我们 derive了最佳可能的 $\alpha$-次约束误差的 Achievable bound，并证明它们在小 $\alpha$ 和低 SNR 值下协同存在。上限 bounds 基于最近提出的通道编码方案，可以有效地传输 $R_\mathrm{h}$ 位，并在同一个 AWGN 通道上传输剩下的 rate 无误损失。results: 我们then concentrate on具有发送能量限制的情况，并derive了可行性结果 для多种enario：助手只帮助发送器或接收器，并知道噪声；助手帮助发送器和接收器，并知道噪声和消息。在特定的情况下，我们证明了在通道编码任务中的错误概率会随着幂指数减少。最后，我们将这些结果翻译为关于连续时间的具有限制发送功率的AWGN通道的结果。

Abstract
The problem of transmitting a parameter value over an additive white Gaussian noise (AWGN) channel is considered, where, in addition to the transmitter and the receiver, there is a helper that observes the noise non-causally and provides a description of limited rate $R_\mathrm{h}$ to the transmitter and/or the receiver. We derive upper and lower bounds on the optimal achievable $\alpha$-th moment of the estimation error and show that they coincide for small values of $\alpha$ and for low SNR values. The upper bound relies on a recently proposed channel-coding scheme that effectively conveys $R_\mathrm{h}$ bits essentially error-free and the rest of the rate - over the same AWGN channel without help, with the error-free bits allocated to the most significant bits of the quantized parameter. We then concentrate on the setting with a total transmit energy constraint, for which we derive achievability results for both channel coding and parameter modulation for several scenarios: when the helper assists only the transmitter or only the receiver and knows the noise, and when the helper assists the transmitter and/or the receiver and knows both the noise and the message. In particular, for the message-informed helper that assists both the receiver and the transmitter, it is shown that the error probability in the channel-coding task decays doubly exponentially. Finally, we translate these results to those for continuous-time power-limited AWGN channels with unconstrained bandwidth. As a byproduct, we show that the capacity with a message-informed helper that is available only at the transmitter can exceed the capacity of the same scenario when the helper knows only the noise but not the message.

摘要
问题是在加itive white Gaussian noise（AWGN）频道上传输参数值，其中助手可以不 causally 观察噪音并提供有限率 $R_\mathrm{h}$ 的描述给发送器和/或接收器。我们 deriv 最佳可能的 $\alpha$-次幂积分误差的上限和下限，并证明它们在小 $\alpha$ 和低 SNR 值时相同。上限基于最近提出的频道编码方案，可以准确地传输 $R_\mathrm{h}$ 位，并且其余的位量在同一个 AWGN 频道上无助而传输，即使是在静态频道条件下。然后，我们将注意力集中在具有总发送能力限制的情况下，并 deriv 适用于频道编码和参数模式的可行性结果。特别是，当助手协助发送器和接收器，并且知道噪音和消息时，显示了在频道编码任务中的误差概率呈双 exponential 衰减。最后，我们将这些结果翻译到连续时间的功率限制AWGN频道上的结果。作为一个副产品，我们显示了在发送器可以获得助手的情况下，容器的容量可以超过不具备消息助手的情况下的容量。

A Reliable and Resilient Framework for Multi-UAV Mutual Localization

paper_url: http://arxiv.org/abs/2309.04270
repo_url: None
paper_authors: Zexin Fang, Bin Han, Hans D. Schotten
for: 本文提出了一种可靠和安全的多无人航空器（UAV）系统对接确定方法，解决了精确定位和安全威胁的问题。
methods: 本文提出的解决方案包括两个关键 ком成分：移动适应梯度下降（MAGD）和时间演化异常探测（TAD）。MAGD适应gradient descent算法 Handle the configuration changes in the mutual localization system， Ensure accurate localization in dynamic scenarios。TAD 与声誉传播（RP）方案合作，探测和缓和可能的攻击，识别有恶待的UAV，提高安全性和抗击力。
results: numerical simulations show that the proposed solution can achieve accurate and reliable mutual localization in multiple UAV systems, even in dynamic scenarios with configuration changes. The MAGD algorithm adapts to the changes and ensures accurate localization, while the TAD and RP schemes detect and mitigate potential attacks, enhancing the security and resilience of the mutual localization.

Abstract
This paper presents a robust and secure framework for achieving accurate and reliable mutual localization in multiple unmanned aerial vehicle (UAV) systems. Challenges of accurate localization and security threats are addressed and corresponding solutions are brought forth and accessed in our paper with numerical simulations. The proposed solution incorporates two key components: the Mobility Adaptive Gradient Descent (MAGD) and Time-evolving Anomaly Detectio (TAD). The MAGD adapts the gradient descent algorithm to handle the configuration changes in the mutual localization system, ensuring accurate localization in dynamic scenarios. The TAD cooperates with reputation propagation (RP) scheme to detect and mitigate potential attacks by identifying UAVs with malicious data, enhancing the security and resilience of the mutual localization

摘要
MAGD adapts the gradient descent algorithm to handle configuration changes in the mutual localization system, ensuring accurate localization in dynamic scenarios. TAD cooperates with a reputation propagation (RP) scheme to detect and mitigate potential attacks by identifying UAVs with malicious data, enhancing the security and resilience of the mutual localization.Numerical simulations are used to assess the performance of the proposed solution, demonstrating its effectiveness in achieving accurate and reliable mutual localization in multiple UAV systems. The proposed framework provides a robust and secure solution for a variety of applications, including search and rescue, environmental monitoring, and infrastructure inspection.

D2D-Assisted Mobile Edge Computing: Optimal Scheduling under Uncertain Processing Cycles and Intermittent Communications

paper_url: http://arxiv.org/abs/2309.04204
repo_url: None
paper_authors: Tao Deng, Zhanwei Yu, Di Yuan
for: 本文研究了在移动边缘计算（MEC）系统中进行任务负载卸载的协调策略，以适应实际中的不确定性和中断通信。
methods: 本文首先 derivates a closed-form表达式来表示在设备到设备协助下的MEC系统中任务负载成功率的平均值，然后提出了一个任务负载最大化问题（TOMP），并证明了这个问题是NP困难的。为解决这个问题，如果问题实例具有对称结构，我们提议一种基于动态计划（TSDP）的任务调度算法。对于一般情况，我们通过重新表述问题，提出了一种循环匹配算法（RMA）。
results: 经过性能评估，我们 validate了closed-form表达式的准确性，以及提出的算法的有效性。

Abstract
Mobile edge computing (MEC) has been regarded as a promising approach to deal with explosive computation requirements by enabling cloud computing capabilities at the edge of networks. Existing models of MEC impose some strong assumptions on the known processing cycles and unintermittent communications. However, practical MEC systems are constrained by various uncertainties and intermittent communications, rendering these assumptions impractical. In view of this, we investigate how to schedule task offloading in MEC systems with uncertainties. First, we derive a closed-form expression of the average offloading success probability in a device-to-device (D2D) assisted MEC system with uncertain computation processing cycles and intermittent communications. Then, we formulate a task offloading maximization problem (TOMP), and prove that the problem is NP-hard. For problem solving, if the problem instance exhibits a symmetric structure, we propose a task scheduling algorithm based on dynamic programming (TSDP). By solving this problem instance, we derive a bound to benchmark sub-optimal algorithm. For general scenarios, by reformulating the problem, we propose a repeated matching algorithm (RMA). Finally, in performance evaluations, we validate the accuracy of the closed-form expression of the average offloading success probability by Monte Carlo simulations, as well as the effectiveness of the proposed algorithms.

摘要
Mobile edge computing (MEC) 被视为一种有前途的方法，以处理网络边缘的激增计算需求，通过在网络边缘提供云计算功能。现有的 MEC 模型假设了一些强制性的处理周期和不间断的通信，但实际的 MEC 系统受到各种不确定性和间歇性通信的限制，这些假设无法实现。为此，我们研究如何在 MEC 系统中进行任务负载卸载的调度，并对不确定性和间歇性通信进行考虑。首先，我们 deriv 一个关于 MEC 系统中设备间通信助け的 D2D 负载卸载的闭式表达式，该表达式表示设备之间的负载卸载成功率的平均值。然后，我们将任务负载卸载最大化问题（TOMP）进行形式化，并证明该问题是NP困难的。如果问题实例具有对称结构，我们提议一种基于动态编程的任务调度算法（TSDP）。通过解决这个问题实例，我们得到一个 bound 来评估不同算法的性能。对于一般场景，我们通过重新形式化问题，提议一种循环匹配算法（RMA）。最后，我们通过 Monte Carlo 仿真 validate 了关于负载卸载成功率的闭式表达式的准确性，以及我们提议的算法的有效性。

Spatial Modulation with Energy Detection: Diversity Analysis and Experimental Evaluation

paper_url: http://arxiv.org/abs/2309.04194
repo_url: None
paper_authors: Elio Faddoul, Ghassan M. Kraidy, Constantinos Psomas, Symeon Chatzinotas, Ioannis Krikidis
for: This paper proposes a non-coherent energy detection scheme for spatial modulation (SM) systems, which can be implemented with low complexity and is applicable for low-cost low-powered devices.
methods: The paper derives an energy detection metric for a multi-antenna receiver based on the maximum-likelihood (ML) criterion, and develops an analytical framework for the SM symbol error rate at high signal-to-noise ratios.
results: The paper shows that the proposed scheme outperforms the coherent ML receiver in certain scenarios, particularly when utilizing non-negative constellations, and provides experimental error rate measurements to validate the theoretical contribution.Here’s the same information in Simplified Chinese text:
for: 这篇论文提出了一种非幂能检测方案 для空间调制（SM）系统，可以实现低复杂性并适用于低成本低功率设备。
methods: 论文基于最大 likelihood（ML） criterion derivation 了一种能量检测度量 для多天线接收器，并开发了一个分析性的 SM 符号错误率框架。
results: 论文显示，提案的方案在某些场景下比幂ML接收器更高效，特别是使用非负架构。同时，论文还提供了实验性的错误率测量来验证理论贡献。

Abstract
In this paper, we present a non-coherent energy detection scheme for spatial modulation (SM) systems. In particular, the use of SM is motivated by its low-complexity implementation in comparison to multiple-input multiple-output (MIMO) systems, achieved through the activation of a single antenna during transmission. Moreover, energy detection-based communications restrict the channel state information to the magnitude of the fading gains. This consideration makes the design applicable for low-cost low-powered devices since phase estimation and its associated circuitry are avoided. We derive an energy detection metric for a multi-antenna receiver based on the maximum-likelihood (ML) criterion. By considering a biased pulse amplitude modulation, we develop an analytical framework for the SM symbol error rate at high signal-to-noise ratios. Numerical results show that the diversity order is proportional to half the number of receive antennas; this result stems from having partial receiver channel knowledge. In addition, we compare the performance of the proposed scheme with that of the coherent ML receiver and show that the SM energy detector outperforms its coherent counterpart in certain scenarios, particularly when utilizing non-negative constellations. Ultimately, we implement an SM testbed using software-defined radio devices and provide experimental error rate measurements that validate our theoretical contribution.

摘要
在本文中，我们提出了一种非协调能量探测方案 для空间模拟（SM）系统。特别是，使用SM是因为它的实现复杂度较低于多输入多出力（MIMO）系统，通过在传输中活动单个天线来实现。此外，能量探测基于通信 restricts the channel state information to the magnitude of the fading gains，这使得设计适用于低成本低功率设备，因为 phases estimation和相关的电路都被避免。我们 derive an energy detection metric for a multi-antenna receiver based on the maximum-likelihood（ML） criterion。通过考虑偏振普朗 amplitude modulation，我们开发了一个分析框架，用于SM符号错误率的高信号噪声比例。数值结果表明，多antenna接收器的多样性顺序与接收天线数量的一半相关，这是因为它们只有部分接收器通道知识。此外，我们比较了提出的方案与协调ML接收器的性能，并显示SM能量探测器在某些场景下超过其协调对手，特别是使用非负 constellations。最后，我们使用软件定义广播设备实现SM测试床，并提供了实验性错误率测量，以 validate our theoretical contribution。

Double RIS-Assisted MIMO Systems Over Spatially Correlated Rician Fading Channels and Finite Scatterers

paper_url: http://arxiv.org/abs/2309.04178
repo_url: None
paper_authors: Ha An Le, Trinh Van Chien, Van Duc Nguyen, Wan Choi
for: 本研究探讨了双RIS协助MIMO通信系统在具有固定散射体、空间 correlate 和双折射链的情况下。
methods: 本研究使用了关键性信息驱动的closed form Statistical analysis，以及一种基于 Alternating Direction Method of Multipliers（ADMM）的高效 alternating optimization algorithm（AO）来解决问题。
results: 研究结果表明，通过jointly optimizing active precoding和combining matrices，以及passive beamforming at the double RISs，可以提高通信系统的容量和可靠性。同时，通过使用一种基于 neural network 的 end-to-end learning framework，可以控制transceiver和RISs的相位Shift，以提高符号错误率。

Abstract
This paper investigates double RIS-assisted MIMO communication systems over Rician fading channels with finite scatterers, spatial correlation, and the existence of a double-scattering link between the transceiver. First, the statistical information is driven in closed form for the aggregated channels, unveiling various influences of the system and environment on the average channel power gains. Next, we study two active and passive beamforming designs corresponding to two objectives. The first problem maximizes channel capacity by jointly optimizing the active precoding and combining matrices at the transceivers and passive beamforming at the double RISs subject to the transmitting power constraint. In order to tackle the inherently non-convex issue, we propose an efficient alternating optimization algorithm (AO) based on the alternating direction method of multipliers (ADMM). The second problem enhances communication reliability by jointly training the encoder and decoder at the transceivers and the phase shifters at the RISs. Each neural network representing a system entity in an end-to-end learning framework is proposed to minimize the symbol error rate of the detected symbols by controlling the transceiver and the RISs phase shifts. Numerical results verify our analysis and demonstrate the superior improvements of phase shift designs to boost system performance.

摘要

Performance Analysis of OTSM under Hardware Impairments in Millimeter-Wave Vehicular Communication Networks

paper_url: http://arxiv.org/abs/2309.04161
repo_url: None
paper_authors: Abed Doosti-Aref, Sapta Girish Neelam, P. R. Sahu, Xu Zhu, Ertugrul Basar, Sinem Coleri, Huseyin Arslan
for: 本文研究了基于orthogonal time sequency multiplexing（OTSM）的homodyne传输器在硬件障碍（HI）下的性能。
methods: 本文使用了Vector形式的不连续时间基准模型， derivated the system input-output relations in time、delay-time和delay-sequency（DS）domains，并 analyzed the effect of HIs on the system。
results: 研究结果表明，即使在HI下，OTSM仍可以与不做HICompensation（HIC）的SC波形相当，但需要HIC才能在mmWave和更高频率范围内运行。

Abstract
Orthogonal time sequency multiplexing (OTSM) has been recently proposed as a single-carrier (SC) waveform offering similar bit error rate (BER) to multi-carrier orthogonal time frequency space (OTFS) modulation in doubly-spread channels under high mobilities; however, with much lower complexity making OTSM a promising candidate for low-power millimeter-wave (mmWave) vehicular communications in 6G wireless networks. In this paper, the performance of OTSM-based homodyne transceiver is explored under hardware impairments (HIs) including in-phase and quadrature imbalance (IQI), direct current offset (DCO), phase noise, power amplifier non-linearity, carrier frequency offset, and synchronization timing offset. First, the discrete-time baseband signal model is obtained in vector form under the mentioned HIs. Then, the system input-output relations are derived in time, delay-time, and delay-sequency (DS) domains in which the parameters of HIs are incorporated. Analytical studies demonstrate that noise stays white Gaussian and effective channel matrix is sparse in the DS domain under HIs. Also, DCO appears as a DC signal at receiver interfering with only the zero sequency over all delay taps in the DS domain; however, IQI redounds to self-conjugated fully-overlapping sequency interference. Simulation results reveal the fact that with no HI compensation (HIC), not only OTSM outperforms plain SC waveform but it performs close to uncompensated OTFS system; however, HIC is essentially needed for OTSM systems operating in mmWave and beyond frequency bands.

摘要
orthogonal time sequence multiplexing (OTSM) 已经被提议作为单载波形（SC）波形，在高速度频率域内提供类似的bit error rate（BER）与多载波形orthogonal time frequency space（OTFS）模ulation，但具有远低的复杂性，使其成为6G无线网络中低功率毫米波通信的有力候选人。在这篇论文中，我们研究了基于OTSM的同步receiver的性能，包括各种硬件障碍（HI）的影响，包括干扰相同和偏置（IQI）、直流偏置（DCO）、频率噪声、功率强度不对称和同步时间偏移。首先，我们获得了基于时分多谱的抽象时间基准信号模型。然后，我们 derive了系统的输入输出关系，包括时域、延迟时域和延迟序列（DS）域，并将HI参数纳入系统。分析研究表明，噪声保持白色高斯性，有效通道矩阵具有DS域中的稀疏性。此外，DCO会出现为接收器中的直流干扰，但IQI会导致自相关的完全重叠序列干扰。实验结果表明，在不进行HI修复（HIC）的情况下，OTSM不仅在干扰下超越普通SC波形，而且在HIC进行修复后，OTSM系统的性能与未修复OTFS系统几乎相同。

Sparse-DFT and WHT Precoding with Iterative Detection for Highly Frequency-Selective Channels

paper_url: http://arxiv.org/abs/2309.04149
repo_url: None
paper_authors: Roberto Bomfin, Marwa Chafii
for: 这 paper 的目的是提出一种可以在高度 selecive 频率响应中提高性能和可扩展性的干扰抑制技术。
methods: 这 paper 使用了 Walsh-Hadamard преобразова和 expectation propagation 接收器来解决干扰抑制问题。
results: 研究结果表明，提出的 SWH-Max-Log-MAP 方法在高度 selecive 频率响应中具有更好的性能和可扩展性，但对于更高的 QAM 编码器来说，其复杂性为退化。

Abstract
Various precoders have been recently studied by the wireless community to combat the channel fading effects. Two prominent precoders are implemented with the discrete Fourier transform (DFT) and Walsh-Hadamard transform (WHT). The WHT precoder is implemented with less complexity since it does not need complex multiplications. Also, spreading can be applied sparsely to decrease the transceiver complexity, leading to sparse DFT (SDFT) and sparse Walsh-Hadamard (SWH). Another relevant topic is the design of iterative receivers that deal with inter-symbol-interference (ISI). In particular, many detectors based on expectation propagation (EP) have been proposed recently for channels with high levels of ISI. An alternative is the maximum a-posterior (MAP) detector, although it leads to unfeasible high complexity in many cases. In this paper, we provide a relatively low-complexity \textcolor{black}{computation} of the MAP detector for the SWH. We also propose two \textcolor{black}{feasible methods} based on the Log-MAP and Max-Log-MAP. Additionally, the DFT, SDFT and SWH precoders are compared using an EP-based receiver with one-tap FD equalization. Lastly, SWH-Max-Log-MAP is compared to the (S)DFT with EP-based receiver in terms of performance and complexity. The results show that the proposed SWH-Max-Log-MAP has a better performance and complexity trade-off for QPSK and 16-QAM under highly selective channels, but has unfeasible complexity for higher QAM orders.

摘要
各种预编码器在无线通信社区中最近被广泛研究，以抗衰变的频率响应。两种最具有优势的预编码器是使用离散傅里叶变换 (DFT) 和华尔什-哈达姆变换 (WHT)。WHT预编码器的实现更加简单，因为它不需要复杂的乘法运算。此外，可以在广播中进行稀疏扩散，从而降低传输器的复杂度，导致SDFT和SWH。另一个相关的话题是对干扰Symbol-Interference (ISI)的设计iterative接收器。特别是，许多基于期望传播 (EP) 的探测器在高度 seleктив 的通道上提出了多种方案。其中一个 altenative是最大 posterior (MAP) 探测器，尽管它在许多情况下会带来不可接受的高复杂度。在这篇文章中，我们提供了一种相对较低的计算复杂性的MAP探测器 для SWH。我们还提出了两种可行的方法，基于Log-MAP和Max-Log-MAP。此外，DFT、SDFT和SWH预编码器被EP基于接收器与一个FD平衡器进行比较。最后，SWH-Max-Log-MAP与(S)DFT和EP基于接收器的性能和复杂度进行比较。结果表明，我们提出的SWH-Max-Log-MAP在高度选择性的通道上有更好的性能和复杂度负担，但对更高的QAM频率来说，其复杂度是不可接受的。

Gabor frames and higher dimensional boundaries in signal analysis on manifolds

paper_url: http://arxiv.org/abs/2309.04094
repo_url: None
paper_authors: Vasiliki Liontou, Matilde Marcolli
for: 该 paper 用于构造 Gabor 框架，用于检测在曲线略微流形上的信号，并且可以检测高维度边界的存在。
methods: 该 paper 使用 Gabor 筛子来检测信号中的高维度边界，并且可以应用于机器人配置空间中的精确约束。
results: 该 paper 提出了一种高维度扩展的 geometric 设置，用于研究信号分析的视觉系统中的应用。

Abstract
We provide a construction of Gabor frames that encode local linearizations of a signal detected on a curved smooth manifold of arbitrary dimension, with Gabor filters that can detect the presence of higher-dimensional boundaries in the manifold signal. We describe an application in configuration spaces in robotics with sharp constrains. The construction is a higher-dimensional generalization of the geometric setting developed for the study of signal analysis in the visual cortex.

摘要
我们提供了一种构建卡波框架的方法，该框架可以编码抽象维度的流体信号在抽象维度的满意流体上的本地线性化。我们使用的卡波滤波器可以检测高维度边界的存在在流体信号中。我们描述了一种在机器人配置空间中的应用，具有锐度约束。该构建是高维度扩展的视觉系统中的 геометри Settings的更高维度扩展。Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.

2023-09-07

cs.SD

cs.SD - 2023-09-07

Causal Signal-Based DCCRN with Overlapped-Frame Prediction for Online Speech Enhancement

paper_url: http://arxiv.org/abs/2309.03684
repo_url: None
paper_authors: Julitta Bartolewska, Stanisław Kacprzak, Konrad Kowalczyk
for: 提高单频道speech干扰signal质量和理解度
methods: 使用signal基于的 causal DCCRN，减少look-ahead和网络参数数量
results: 实验结果表明，提posed模型可以与原始DCCRN相比或更好地提高speech干扰metric，同时减少缓存时间和网络参数数量约30%

Abstract
The aim of speech enhancement is to improve speech signal quality and intelligibility from a noisy microphone signal. In many applications, it is crucial to enable processing with small computational complexity and minimal requirements regarding access to future signal samples (look-ahead). This paper presents signal-based causal DCCRN that improves online single-channel speech enhancement by reducing the required look-ahead and the number of network parameters. The proposed modifications include complex filtering of the signal, application of overlapped-frame prediction, causal convolutions and deconvolutions, and modification of the loss function. Results of performed experiments indicate that the proposed model with overlapped signal prediction and additional adjustments, achieves similar or better performance than the original DCCRN in terms of various speech enhancement metrics, while it reduces the latency and network parameter number by around 30%.

摘要
“Speech enhancement的目的是提高噪音干扰的语音信号质量和可理解度，从噪音抑制的 Microphone 信号中提取语音信号。在许多应用中，需要进行小型计算复杂性和未来信号样本访问的最小化处理。这篇论文提出了信号基于的 causal DCCRN，可以在线进行单 канал语音增强，从而降低了需要的 look-ahead 和网络参数数量。提议的修改包括信号复杂的滤波、重叠框预测、 causal 卷积和卷积，以及损失函数的修改。实验结果表明，提议的模型，带有重叠信号预测和其他调整，可以与原始 DCCRN 相比，在不同的语音增强指标上实现相似或更好的性能，同时降低了延迟和网络参数数量约30%。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Topological fingerprints for audio identification

paper_url: http://arxiv.org/abs/2309.03516
repo_url: https://github.com/wreise/top_audio_id
paper_authors: Wojciech Reise, Ximena Fernández, Maria Dominguez, Heather A. Harrington, Mariano Beguerisse-Díaz
for: 该研究提出了一种基于topological Audio fingerprinting的音频追踪方法，用于 Robustly 识别重复的音频轨迹。
methods: 该方法使用 persistente homology 对地方 spectral decompositions 的 audio signals 进行编码，使用 filtered cubical complexes 从 mel-spectrograms 计算。
results: 实验结果表明，该算法可以准确地检测时间对齐的音频匹配，并在 topological distortions 场景下表现出优于现有方法。

Abstract
We present a topological audio fingerprinting approach for robustly identifying duplicate audio tracks. Our method applies persistent homology on local spectral decompositions of audio signals, using filtered cubical complexes computed from mel-spectrograms. By encoding the audio content in terms of local Betti curves, our topological audio fingerprints enable accurate detection of time-aligned audio matchings. Experimental results demonstrate the accuracy of our algorithm in the detection of tracks with the same audio content, even when subjected to various obfuscations. Our approach outperforms existing methods in scenarios involving topological distortions, such as time stretching and pitch shifting.

摘要
我们提出了一种适用于鲁棒识别相同音频轨的多尺度音频指纹方法。我们的方法使用稳定的多尺度空间来对音频信号进行本地特征分解，并使用缓冲的立方体复合来计算mel-spectrogram。通过将音频内容编码成本地比蒂曲线，我们的音频指纹可以准确地检测时间对齐的音频匹配。实验结果表明，我们的算法在受到不同类型的扭曲（如时间延迟和调高）的情况下仍然能够准确地识别相同的音频内容。我们的方法在多尺度扭曲场景下表现出优于现有方法。

Simulating room transfer functions between transducers mounted on audio devices using a modified image source method

paper_url: http://arxiv.org/abs/2309.03486
repo_url: https://github.com/audiolabs/DEISM
paper_authors: Zeyu Xu, Adrian Herzog, Alexander Lodermeyer, Emanuël A. P. Habets, Albert G. Prinn
for: 这个研究旨在扩展图像源方法（ISM），以包括对房间声学的扩散效应。
methods: 研究使用对elesbian harmonic directivity coefficients来扩展ISM，以包括源和接收器的对话装置所导致的声学扩散效应。
results: 研究显示，提案的方法可以更正确地模拟房间转换函数，并且可以考虑房间内设备的大小、形状、数量和位置。

Abstract
The image source method (ISM) is often used to simulate room acoustics due to its ease of use and computational efficiency. The standard ISM is limited to simulations of room impulse responses between point sources and omnidirectional receivers. In this work, the ISM is extended using spherical harmonic directivity coefficients to include acoustic diffraction effects due to source and receiver transducers mounted on physical devices, which are typically encountered in practical situations. The proposed method is verified using finite element simulations of various loudspeaker and microphone configurations in a rectangular room. It is shown that the accuracy of the proposed method is related to the sizes, shapes, number, and positions of the devices inside a room. A simplified version of the proposed method, which can significantly reduce computational effort, is also presented. The proposed method and its simplified version can simulate room transfer functions more accurately than currently available image source methods and can aid the development and evaluation of speech and acoustic signal processing algorithms, including speech enhancement, acoustic scene analysis, and acoustic parameter estimation.

摘要
<>使用图像源方法（ISM）模拟室内声学，由于其使用 convenienceliness 和计算效率，经常被使用。标准的ISM只能模拟室内冲击响应 между点源和全irectional接收器。在这种工作中，ISM被扩展使用球面幂直强度系数，以包括声学扩散效应，源和接收器适配器在实际情况下的 mounting 会导致的。提议的方法通过rectangular room的finite element simulations of various loudspeaker and microphone configurations进行验证。结果表明，提议的方法的准确性与房间内设备的大小、形状、数量和位置有关。一种简化版的提议方法，可以减少计算努力，也被提出。提议的方法和其简化版可以更准确地模拟室内传递函数，并且可以帮助开发和评估speech和声学信号处理算法，包括speech enhancement、声学场景分析和声学参数估计。

2023-09-07

cs.CV

cs.CV - 2023-09-07

S-Adapter: Generalizing Vision Transformer for Face Anti-Spoofing with Statistical Tokens

paper_url: http://arxiv.org/abs/2309.04038
repo_url: None
paper_authors: Rizhao Cai, Zitong Yu, Chenqi Kong, Haoliang Li, Changsheng Chen, Yongjian Hu, Alex Kot
For: 检测面部识别系统中的恶意伪装 attempts (Face Anti-Spoofing, FAS)* Methods: 使用Efficient Parameter Transfer Learning (EPTL) paradigm，适应已经预训练的Vision Transformer模型，并在训练中插入 adapter modules，以便在不同频谱上进行恶意伪装检测。* Results: 提出了一种基于Statistical Adapter (S-Adapter)和Token Style Regularization (TSR)的方法，可以在零或几 shot cross-domain测试中提高检测性能，并且超过了现有方法在多个标准测试上的表现。

Abstract
Face Anti-Spoofing (FAS) aims to detect malicious attempts to invade a face recognition system by presenting spoofed faces. State-of-the-art FAS techniques predominantly rely on deep learning models but their cross-domain generalization capabilities are often hindered by the domain shift problem, which arises due to different distributions between training and testing data. In this study, we develop a generalized FAS method under the Efficient Parameter Transfer Learning (EPTL) paradigm, where we adapt the pre-trained Vision Transformer models for the FAS task. During training, the adapter modules are inserted into the pre-trained ViT model, and the adapters are updated while other pre-trained parameters remain fixed. We find the limitations of previous vanilla adapters in that they are based on linear layers, which lack a spoofing-aware inductive bias and thus restrict the cross-domain generalization. To address this limitation and achieve cross-domain generalized FAS, we propose a novel Statistical Adapter (S-Adapter) that gathers local discriminative and statistical information from localized token histograms. To further improve the generalization of the statistical tokens, we propose a novel Token Style Regularization (TSR), which aims to reduce domain style variance by regularizing Gram matrices extracted from tokens across different domains. Our experimental results demonstrate that our proposed S-Adapter and TSR provide significant benefits in both zero-shot and few-shot cross-domain testing, outperforming state-of-the-art methods on several benchmark tests. We will release the source code upon acceptance.

摘要
面部反射防范（FAS）目的是检测面部识别系统中的恶意入侵尝试，包括提供伪造面部。现代FAS技术主要基于深度学习模型，但它们在不同数据频谱的问题上存在跨频道泛化能力的问题。在本研究中，我们开发了基于高效参数传播学习（EPTL）模型的通用FAS方法。在训练过程中，我们插入了适应器模块到预训练的ViT模型中，并在其他预训练参数固定下更新适应器。我们发现过去的纯Adapter在基于线性层的限制下，无法具备跨频道泛化能力。为了解决这个限制并实现跨频道泛化FAS，我们提出了一种新的统计适应器（S-Adapter），它可以从本地化的token历史中收集当地特征和统计信息。为了进一步提高统计token的泛化能力，我们还提出了一种新的Token样式规范（TSR），它计划通过对token在不同频谱上的Gram矩阵进行规范来减少频谱样式差异。我们的实验结果表明，我们的提出的S-Adapter和TSR在零shot和几shot跨频道测试中具有显著的优势，超过了现有的方法在多个标准测试中的表现。我们将在接受后发布源代码。

Algebra and Geometry of Camera Resectioning

paper_url: http://arxiv.org/abs/2309.04028
repo_url: None
paper_authors: Erin Connelly, Timothy Duff, Jessie Loucks-Tavitas
for: 关于摄像机减掉问题的代数变量研究。
methods: 使用Gröbner基的技术来描述这些减掉变量的多度vanishing идеал。
results: derivation and re-interpretation of well-known results in geometric computer vision related to camera-point duality, as well as clarification of relationships between classical problems of optimal resectioning and triangulation, and a conjectured formula for the Euclidean distance degree of the resectioning variety.

Abstract
We study algebraic varieties associated with the camera resectioning problem. We characterize these resectioning varieties' multigraded vanishing ideals using Gr\"obner basis techniques. As an application, we derive and re-interpret celebrated results in geometric computer vision related to camera-point duality. We also clarify some relationships between the classical problems of optimal resectioning and triangulation, state a conjectural formula for the Euclidean distance degree of the resectioning variety, and discuss how this conjecture relates to the recently-resolved multiview conjecture.

摘要
我们研究关于摄像头重sectioning问题的代数变量。我们使用格罗本基技术来 caracterize这些重sectioning变量的多重度vanishing ideal。作为应用，我们得到并重新解释了在计算机视觉中知名的相机点对偶问题的结果。我们还清楚了经典的最优重sectioning和三角形问题之间的关系，提出了圆形距离度的重sectioning变量的投影式，并讲解了该投影式与最近解决的多视图问题之间的关系。

Improving the Accuracy of Beauty Product Recommendations by Assessing Face Illumination Quality

paper_url: http://arxiv.org/abs/2309.04022
repo_url: None
paper_authors: Parnian Afshar, Jenny Yeon, Andriy Levitskyy, Rahul Suresh, Amin Banitalebi-Dehkordi
for: 本研究旨在 Addressing the challenges in responsible beauty product recommendation, particularly when it involves comparing the product’s color with a person’s skin tone, such as for foundation and concealer products.
methods: We introduce a machine learning framework for illumination assessment which classifies images into having either good or bad illumination condition. We then build an automatic user guidance tool which informs a user holding their camera if their illumination condition is good or bad.
results: Our work improves the shade recommendation for various foundation products by using a diverse synthetic dataset and a Convolutional Neural Network (CNN) for illumination assessment.

Abstract
We focus on addressing the challenges in responsible beauty product recommendation, particularly when it involves comparing the product's color with a person's skin tone, such as for foundation and concealer products. To make accurate recommendations, it is crucial to infer both the product attributes and the product specific facial features such as skin conditions or tone. However, while many product photos are taken under good light conditions, face photos are taken from a wide range of conditions. The features extracted using the photos from ill-illuminated environment can be highly misleading or even be incompatible to be compared with the product attributes. Hence bad illumination condition can severely degrade quality of the recommendation. We introduce a machine learning framework for illumination assessment which classifies images into having either good or bad illumination condition. We then build an automatic user guidance tool which informs a user holding their camera if their illumination condition is good or bad. This way, the user is provided with rapid feedback and can interactively control how the photo is taken for their recommendation. Only a few studies are dedicated to this problem, mostly due to the lack of dataset that is large, labeled, and diverse both in terms of skin tones and light patterns. Lack of such dataset leads to neglecting skin tone diversity. Therefore, We begin by constructing a diverse synthetic dataset that simulates various skin tones and light patterns in addition to an existing facial image dataset. Next, we train a Convolutional Neural Network (CNN) for illumination assessment that outperforms the existing solutions using the synthetic dataset. Finally, we analyze how the our work improves the shade recommendation for various foundation products.

摘要
我们专注于处理美妆产品推荐中的挑战，特别是在比较产品的颜色与人们的皮肤颜色时。为确保精准的推荐，需要推算产品特性和产品具体的脸部特征，如皮肤状况或颜色。但是，许多产品照片在良好的照明条件下拍摄，而脸部照片则来自广泛的照明环境。由于撷取自不良照明环境的特征可能会导致极度错误或甚至无法与产品特性相比。因此，糟糕的照明环境可能会严重降低推荐质量。我们介绍了一个机器学习框架 для照明评估，可以区分具有好坏照明conditions的图像。然后，我们建立了一个自动用户指南工具，可以为用户提供快速的反馈，并让用户可以互动地控制他们拍摄的照片。只有几个研究对这个问题进行了研究，主要是因为缺乏大量、标注和多样化的皮肤颜色和照明模式的数据集。由于缺乏这种数据集，因此跳过了皮肤颜色多样性的问题。因此，我们开始了一个多样的人工数据集的建立，这个数据集模拟了不同的皮肤颜色和照明模式。接下来，我们使用这个数据集进行训练，并使用卷积神经网络（CNN）进行照明评估，以超越现有的解决方案。最后，我们分析了我们的工作如何改善不同的基调推荐。

Multimodal Transformer for Material Segmentation

paper_url: http://arxiv.org/abs/2309.04001
repo_url: None
paper_authors: Md Kaykobad Reza, Ashley Prater-Bennette, M. Salman Asif
for: 本研究的目的是提出一种新的多模态融合策略，以提高多模态分割 зада务的性能。
methods: 本研究提出了一种新的模型名为多模态分割变换器（MMSFormer），该模型包括一种新的融合策略，可以有效地融合不同组合的四种模式：RGB、Angular Linear Polarization（AoLP）、Degree of Linear Polarization（DoLP）和 Near-Infrared（NIR）模式。
results: 在MCubeS数据集上，MMSFormer模型达到了52.05%的mIoU，比现有状态的各种方法高出9.1%和10.4%。例如，我们的方法在检测gravel和人类类别上具有显著的提高（+10.4%和+9.1%）。

Abstract
Leveraging information across diverse modalities is known to enhance performance on multimodal segmentation tasks. However, effectively fusing information from different modalities remains challenging due to the unique characteristics of each modality. In this paper, we propose a novel fusion strategy that can effectively fuse information from different combinations of four different modalities: RGB, Angle of Linear Polarization (AoLP), Degree of Linear Polarization (DoLP) and Near-Infrared (NIR). We also propose a new model named Multi-Modal Segmentation Transformer (MMSFormer) that incorporates the proposed fusion strategy to perform multimodal material segmentation. MMSFormer achieves 52.05% mIoU outperforming the current state-of-the-art on Multimodal Material Segmentation (MCubeS) dataset. For instance, our method provides significant improvement in detecting gravel (+10.4%) and human (+9.1%) classes. Ablation studies show that different modules in the fusion block are crucial for overall model performance. Furthermore, our ablation studies also highlight the capacity of different input modalities to improve performance in the identification of different types of materials. The code and pretrained models will be made available at https://github.com/csiplab/MMSFormer.

摘要
利用多modalities的信息融合可以提高多modalities segmentation任务的性能。然而，有效地融合不同modalities的信息仍然是一个挑战，因为每种modalities都有独特的特征。在这篇论文中，我们提出了一种新的融合策略，可以有效地融合不同组合的四种modalities：RGB、Angular Linear Polarization（AoLP）、Degree of Linear Polarization（DoLP）和 Near-Infrared（NIR）。我们还提出了一个新的模型名为多Modal Segmentation Transformer（MMSFormer），该模型包含了提出的融合策略，用于进行多modal material segmentation。MMSFormer在Multimodal Material Segmentation（MCubeS）数据集上 achieved 52.05% mIoU，比前一个状态的艺术性表现出色。例如，我们的方法在检测gravel (+10.4%)和human (+9.1%)类中提供了显著改进。精算研究表明，不同模块在融合块中的不同部分对整体模型性能具有重要作用。此外，我们的精算研究还表明了不同输入modalities在不同材料类型的标识中的不同作用。代码和预训练模型将在https://github.com/csiplab/MMSFormer上提供。

Adapting Self-Supervised Representations to Multi-Domain Setups

paper_url: http://arxiv.org/abs/2309.03999
repo_url: None
paper_authors: Neha Kalibhat, Sam Sharpe, Jeremy Goodsitt, Bayan Bruss, Soheil Feizi
for: 提高多域自然语言处理模型的泛化能力（improve the generalization ability of multi-domain natural language processing models）
methods: 提出一种通用、轻量级的领域分离模块（propose a general-purpose, lightweight domain disentanglement module），可以适应任何自助学习编码器，以提高多域数据上的表示学习。在预训练期间，DDM对表示空间进行分解，从而实现领域分离。当领域标签不available时，DDM使用了一种可靠的聚类方法来发现 Pseudo-领域。
results: 与基eline比较，使用DDM预训练模型可以提高线性探测精度（linear probing accuracy）达3.5%，并且在多域测试集上显示了7.4%的泛化性能（generalization performance）提升。

Abstract
Current state-of-the-art self-supervised approaches, are effective when trained on individual domains but show limited generalization on unseen domains. We observe that these models poorly generalize even when trained on a mixture of domains, making them unsuitable to be deployed under diverse real-world setups. We therefore propose a general-purpose, lightweight Domain Disentanglement Module (DDM) that can be plugged into any self-supervised encoder to effectively perform representation learning on multiple, diverse domains with or without shared classes. During pre-training according to a self-supervised loss, DDM enforces a disentanglement in the representation space by splitting it into a domain-variant and a domain-invariant portion. When domain labels are not available, DDM uses a robust clustering approach to discover pseudo-domains. We show that pre-training with DDM can show up to 3.5% improvement in linear probing accuracy on state-of-the-art self-supervised models including SimCLR, MoCo, BYOL, DINO, SimSiam and Barlow Twins on multi-domain benchmarks including PACS, DomainNet and WILDS. Models trained with DDM show significantly improved generalization (7.4%) to unseen domains compared to baselines. Therefore, DDM can efficiently adapt self-supervised encoders to provide high-quality, generalizable representations for diverse multi-domain data.

摘要
当前最新的自动教程方法，在个别领域上有效地训练，但在未经见过的领域上显示有限的泛化能力。我们发现这些模型在混合领域上训练时表现不佳，使其在实际世界中不适用。因此，我们提出一种通用、轻量级的领域分离模块（DDM），可以与任何自动教程Encoder结合使用，以有效地进行多个、多种领域的表示学习，无论具有共享类别或不具有。在预训练时，DDM通过对自我超vised损失进行 enforcement，在表示空间中提升了分离。当领域标签不可用时，DDM使用Robust Clustering方法来发现pseudo-领域。我们显示，使用DDM进行预训练可以与现有最新的自动教程模型，包括SimCLR、MoCo、BYOL、DINO、SimSiam和Barlow Twins在多个领域的多个benchmark上提高线性探测精度达3.5%。模型通过DDM进行预训练后，对未经见过的领域的泛化能力提高了7.4%。因此，DDM可以有效地适应自动教程Encoder，以提供高质量、泛化的表示，用于多个多种领域的数据。

CDFSL-V: Cross-Domain Few-Shot Learning for Videos

paper_url: http://arxiv.org/abs/2309.03989
repo_url: None
paper_authors: Sarinda Samarasinghe, Mamshad Nayeem Rizve, Navid Kardan, Mubarak Shah
for: 这个论文是为了解决跨频道少量示例视频动作识别问题，而现有的方法都是基于大量标注的同频道数据集。
methods: 该论文提出了一种新的跨频道少量示例视频动作识别方法，该方法利用了自编码学习和课程学习来均衡源频道和目标频道的信息。具体来说，该方法使用了一个masked autoencoder-based自编码训练目标来从源和目标数据集中学习。然后，一个进步课程来均衡学习源数据集中的分类特征和目标频道特征。
results: 该论文在多个复杂的benchmark数据集上进行了评估，并证明了该方法可以超越现有的跨频道少量学习方法。代码可以在https://github.com/Sarinda251/CDFSL-V中找到。

Abstract
Few-shot video action recognition is an effective approach to recognizing new categories with only a few labeled examples, thereby reducing the challenges associated with collecting and annotating large-scale video datasets. Existing methods in video action recognition rely on large labeled datasets from the same domain. However, this setup is not realistic as novel categories may come from different data domains that may have different spatial and temporal characteristics. This dissimilarity between the source and target domains can pose a significant challenge, rendering traditional few-shot action recognition techniques ineffective. To address this issue, in this work, we propose a novel cross-domain few-shot video action recognition method that leverages self-supervised learning and curriculum learning to balance the information from the source and target domains. To be particular, our method employs a masked autoencoder-based self-supervised training objective to learn from both source and target data in a self-supervised manner. Then a progressive curriculum balances learning the discriminative information from the source dataset with the generic information learned from the target domain. Initially, our curriculum utilizes supervised learning to learn class discriminative features from the source data. As the training progresses, we transition to learning target-domain-specific features. We propose a progressive curriculum to encourage the emergence of rich features in the target domain based on class discriminative supervised features in the source domain. We evaluate our method on several challenging benchmark datasets and demonstrate that our approach outperforms existing cross-domain few-shot learning techniques. Our code is available at https://github.com/Sarinda251/CDFSL-V

摘要
新型几个shot视频动作识别方法可以快速地识别新类型，只需几个标注的示例，因此可以降低收集和标注大规模视频数据集的挑战。现有的视频动作识别方法依赖于同一个频谱中的大量标注数据。然而，这种设置不真实，因为新类型可能来自不同的数据频谱，这些频谱可能具有不同的空间和时间特征。这种差异会对传统的几个shot动作识别技术产生很大的挑战。为解决这个问题，在这项工作中，我们提出了一种新的跨频谱几个shot视频动作识别方法，该方法利用了自动生成学和课程学来均衡来源频谱和目标频谱的信息。具体来说，我们的方法使用了一个遮盖自动编码器基于的自动生成训练目标来学习来源和目标数据。然后，我们使用一个进步的课程来平衡学习来源数据中的分类特征与目标频谱中学习的通用特征。在训练过程中，我们首先使用了supervised学习来学习来源数据中的分类特征。随着训练的进行，我们转移到学习目标频谱特有的特征。我们提出了一种进步的课程，以便在目标频谱中促进特有的特征的出现，基于来源频谱中的分类特征。我们在多个挑战性 benchmark 数据集上评估了我们的方法，并证明了我们的方法在跨频谱几个shot学习中表现更好。我们的代码可以在 https://github.com/Sarinda251/CDFSL-V 上获取。

Separable Self and Mixed Attention Transformers for Efficient Object Tracking

paper_url: http://arxiv.org/abs/2309.03979
repo_url: https://github.com/goutamyg/smat
paper_authors: Goutam Yelluru Gopal, Maria A. Amer
for: 这篇论文旨在提出一种高效的自我和混合注意力转换器基 Architecture для轻量级目标跟踪。
methods: 该提案使用分解自我和混合注意力转换器来融合模板和搜索区域进行特征提取，并使用高效自注意力块进行全局Contextual模型化以提高目标状态估计的准确性。
results: 相比之下，该提案在GOT10k、TrackingNet、LaSOT、NfS30、UAV123和AVisT等 dataset上的表现都高于相关的轻量级跟踪器，而且在CPU上运行时速度达37帧，在GPU上运行时速度达158帧，参数数量为3.8亿。例如，在GOT10k-test上，它与E.T.Track和MixFormerV2-S的相似跟踪器相比，在AO metric上表现出了7.9%和5.8%的显著优势。

Abstract
The deployment of transformers for visual object tracking has shown state-of-the-art results on several benchmarks. However, the transformer-based models are under-utilized for Siamese lightweight tracking due to the computational complexity of their attention blocks. This paper proposes an efficient self and mixed attention transformer-based architecture for lightweight tracking. The proposed backbone utilizes the separable mixed attention transformers to fuse the template and search regions during feature extraction to generate superior feature encoding. Our prediction head performs global contextual modeling of the encoded features by leveraging efficient self-attention blocks for robust target state estimation. With these contributions, the proposed lightweight tracker deploys a transformer-based backbone and head module concurrently for the first time. Our ablation study testifies to the effectiveness of the proposed combination of backbone and head modules. Simulations show that our Separable Self and Mixed Attention-based Tracker, SMAT, surpasses the performance of related lightweight trackers on GOT10k, TrackingNet, LaSOT, NfS30, UAV123, and AVisT datasets, while running at 37 fps on CPU, 158 fps on GPU, and having 3.8M parameters. For example, it significantly surpasses the closely related trackers E.T.Track and MixFormerV2-S on GOT10k-test by a margin of 7.9% and 5.8%, respectively, in the AO metric. The tracker code and model is available at https://github.com/goutamyg/SMAT

摘要
<> translate into Simplified Chinese投入变换器对视图对象跟踪进行部署，显示了顶尖的结果。但是，基于变换器的模型在轻量级跟踪中受到计算复杂性的限制。这篇论文提出了一种高效的自我和混合注意力变换器-基于架构，用于轻量级跟踪。我们的提案中，使用分解式混合注意力变换器将模板和搜索区域 fusion 到特征提取过程中，以生成优化的特征编码。我们的预测头使用高效的自我注意力块进行全局Contextual 模型化，以提高目标状态估计的精度。通过这些贡献，我们的轻量级跟踪器首次同时使用变换器-基于架构和头模块。我们的ablation研究证明了我们的组合的后处和头模块的效果。实验显示，我们的分解自我和混合注意力基于跟踪器（SMAT）在GOT10k、TrackingNet、LaSOT、NfS30、UAV123和AVisT等数据集上表现出色，而且在CPU上运行时间为37帧，GPU上运行时间为158帧，参数数为3.8亿。例如，它在GOT10k-test上与相似的跟踪器E.T.Track和MixFormerV2-S的margin 7.9%和5.8%，分别。跟踪器代码和模型可以在https://github.com/goutamyg/SMAT 上下载。

Improving Resnet-9 Generalization Trained on Small Datasets

paper_url: http://arxiv.org/abs/2309.03965
repo_url: https://github.com/omarawad2/HAET2021_Huawei
paper_authors: Omar Mohamed Awad, Habib Hajimolahoseini, Michael Lim, Gurpreet Gosal, Walid Ahmed, Yang Liu, Gordon Deng
for: 本文提出了一种方法，该方法在ICLR竞赛中获得了最高精度奖。目标是在 less than 10 分钟内达到 CIFAR-10 数据集上的图像分类任务最高精度。
methods: 本文使用了一系列技术来提高 ResNet-9 的通用性，包括：锐度敏感优化、标签平滑、梯度中心化、输入补丁白净化以及基于基本学习的训练。
results: 我们的实验表明，通过在 CIFAR-10 数据集上进行sharpness aware optimization、标签平滑、梯度中心化、输入补丁白净化以及基于基本学习的训练，ResNet-9 可以在 less than 10 分钟内达到 88% 的精度，而且只需训练在 CIFAR-10 数据集上的 10% 子集上。

Abstract
This paper presents our proposed approach that won the first prize at the ICLR competition on Hardware Aware Efficient Training. The challenge is to achieve the highest possible accuracy in an image classification task in less than 10 minutes. The training is done on a small dataset of 5000 images picked randomly from CIFAR-10 dataset. The evaluation is performed by the competition organizers on a secret dataset with 1000 images of the same size. Our approach includes applying a series of technique for improving the generalization of ResNet-9 including: sharpness aware optimization, label smoothing, gradient centralization, input patch whitening as well as metalearning based training. Our experiments show that the ResNet-9 can achieve the accuracy of 88% while trained only on a 10% subset of CIFAR-10 dataset in less than 10 minuets

摘要
本文提出了我们的提议方法，在ICLR竞赛中获得首奖的硬件意识fficient Training中实现最高可能的准确率。挑战是在 less than 10 minutes 内完成一个图像分类任务。训练是基于小型的CIFAR-10数据集中随机选择的5000张图像。评估是由竞赛组织者在一个保密的数据集上进行，该数据集包含1000张同样大小的图像。我们的方法包括对ResNet-9进行一系列技巧以提高其泛化性，包括：锐度意识优化、标签平滑、梯度中心化、输入质patch白净以及基于学习的训练。我们的实验表明，ResNet-9可以在CIFAR-10数据集中训练只有10%的subset中的图像，在 less than 10 minutes 内达到88%的准确率。

REALM: Robust Entropy Adaptive Loss Minimization for Improved Single-Sample Test-Time Adaptation

paper_url: http://arxiv.org/abs/2309.03964
repo_url: None
paper_authors: Skyler Seto, Barry-John Theobald, Federico Danieli, Navdeep Jaitly, Dan Busbridge
For: The paper is written to mitigate performance loss due to distribution shifts between train and test data in online fully-test-time adaptation (F-TTA) without access to the training data and without knowledge of the model training procedure.* Methods: The paper proposes a general framework called Robust Entropy Adaptive Loss Minimization (REALM) inspired by self-paced learning and robust loss functions to improve the robustness of F-TTA to noisy samples.* Results: The proposed approach achieves better adaptation accuracy than previous approaches throughout the adaptation process on corruptions of CIFAR-10 and ImageNet-1K, demonstrating its effectiveness.Here’s the simplified Chinese version:* For: 约束是为了 mitigate 在 train 和 test 数据之间的分布变化导致的性能下降，而不需要访问训练数据和模型训练过程的知识。* Methods: 提议了一种通用框架 called Robust Entropy Adaptive Loss Minimization (REALM)， Drawing inspiration from self-paced learning and robust loss functions to improve F-TTA 的 robustness to noisy samples。* Results: 比较 previous approaches 的 adaptation accuracy superior throughout the adaptation process on corruptions of CIFAR-10 and ImageNet-1K, demonstrating its effectiveness.

Abstract
Fully-test-time adaptation (F-TTA) can mitigate performance loss due to distribution shifts between train and test data (1) without access to the training data, and (2) without knowledge of the model training procedure. In online F-TTA, a pre-trained model is adapted using a stream of test samples by minimizing a self-supervised objective, such as entropy minimization. However, models adapted with online using entropy minimization, are unstable especially in single sample settings, leading to degenerate solutions, and limiting the adoption of TTA inference strategies. Prior works identify noisy, or unreliable, samples as a cause of failure in online F-TTA. One solution is to ignore these samples, which can lead to bias in the update procedure, slow adaptation, and poor generalization. In this work, we present a general framework for improving robustness of F-TTA to these noisy samples, inspired by self-paced learning and robust loss functions. Our proposed approach, Robust Entropy Adaptive Loss Minimization (REALM), achieves better adaptation accuracy than previous approaches throughout the adaptation process on corruptions of CIFAR-10 and ImageNet-1K, demonstrating its effectiveness.

摘要
全程测试时适应（F-TTA）可以减轻因数据分布变化而导致的性能下降（1）无需访问训练数据，以及（2）无需知道模型训练过程。在线上F-TTA中，一个预训练模型通过使用一批测试样本进行适应，以降低一种自我超级对象，如熵降低。然而，通过在线使用熵降低进行适应，特别是在单个样本设置下，可能会导致不稳定的解决方案，限制TTA的推理策略的采用。先前的研究表明，噪声或不可靠的样本是在线F-TTA失败的原因。一种解决方案是忽略这些样本，可能会导致更新过程中的偏见，慢化适应，和泛化性下降。在这种情况下，我们提出了一种抗噪声的框架，即稳定熵降低适应loss函数（REALM）。我们的提议方法在CIFAR-10和ImageNet-1K上进行了某些损害的适应过程，并且在整个适应过程中达到了更高的适应精度，这表明了它的有效性。

SimpleNeRF: Regularizing Sparse Input Neural Radiance Fields with Simpler Solutions

paper_url: http://arxiv.org/abs/2309.03955
repo_url: None
paper_authors: Nagabhushan Somraj, Adithyan Karanayil, Rajiv Soundararajan
for: 这个论文主要研究了如何使用增强模型来训练几何投影场景中的NeRF，以实现 fewer-shot 渲染。
methods: 作者使用了增强模型来帮助训练NeRF，并在训练过程中添加了positional编码和视图依赖的采样来增强模型的简洁性。
results: 作者通过使用这些增强模型和采样方法，实现了在两个流行的数据集上的state-of-the-art 视角合成性能。Here’s the full text in Simplified Chinese:
for: 这个论文主要研究了如何使用增强模型来训练几何投影场景中的NeRF，以实现 fewer-shot 渲染。
methods: 作者使用了增强模型来帮助训练NeRF，并在训练过程中添加了positional编码和视图依赖的采样来增强模型的简洁性。
results: 作者通过使用这些增强模型和采样方法，实现了在两个流行的数据集上的state-of-the-art 视角合成性能。I hope that helps! Let me know if you have any other questions.

Abstract
Neural Radiance Fields (NeRF) show impressive performance for the photorealistic free-view rendering of scenes. However, NeRFs require dense sampling of images in the given scene, and their performance degrades significantly when only a sparse set of views are available. Researchers have found that supervising the depth estimated by the NeRF helps train it effectively with fewer views. The depth supervision is obtained either using classical approaches or neural networks pre-trained on a large dataset. While the former may provide only sparse supervision, the latter may suffer from generalization issues. As opposed to the earlier approaches, we seek to learn the depth supervision by designing augmented models and training them along with the NeRF. We design augmented models that encourage simpler solutions by exploring the role of positional encoding and view-dependent radiance in training the few-shot NeRF. The depth estimated by these simpler models is used to supervise the NeRF depth estimates. Since the augmented models can be inaccurate in certain regions, we design a mechanism to choose only reliable depth estimates for supervision. Finally, we add a consistency loss between the coarse and fine multi-layer perceptrons of the NeRF to ensure better utilization of hierarchical sampling. We achieve state-of-the-art view-synthesis performance on two popular datasets by employing the above regularizations. The source code for our model can be found on our project page: https://nagabhushansn95.github.io/publications/2023/SimpleNeRF.html

摘要
神经闪光场（NeRF）可以实现高品质的自由视图渲染场景。然而，NeRF需要场景中的图像 dense sampling，而且在只有 sparse 的视图时，其性能会下降 significatively。研究人员发现，对 NeRF depth 进行超vision 可以帮助它们在 fewer views 上训练效果。这些超vision 可以来自 classical 方法或者 neural network 预训练大量数据。然而，前者可能只提供 sparse 的超vision，而后者可能会导致泛化问题。与之前的方法不同，我们尝试通过设计增强模型并与 NeRF 同时训练来学习 depth 超vision。我们设计了增强模型，这些模型通过 exploring 场景中的位置编码和视角依赖的闪光来培养简单的解决方案。这些简单模型中的深度被用来超vision NeRF 的深度估计。由于增强模型在某些区域可能不准确，我们设计了一种机制来选择可靠的深度估计来作为超vision。最后，我们添加了一种 hierarchical 整合损失，以确保更好地利用多层感知。我们通过使用以上 regularizations 实现了两个流行的数据集上的视图合成状态机器。我们的模型代码可以在我们项目页面中找到：

A-Eval: A Benchmark for Cross-Dataset Evaluation of Abdominal Multi-Organ Segmentation

paper_url: http://arxiv.org/abs/2309.03906
repo_url: https://github.com/uni-medical/a-eval
paper_authors: Ziyan Huang, Zhongying Deng, Jin Ye, Haoyu Wang, Yanzhou Su, Tianbin Li, Hui Sun, Junlong Cheng, Jianpin Chen, Junjun He, Yun Gu, Shaoting Zhang, Lixu Gu, Yu Qiao
for: 本研究旨在检验多个数据集上的 Abdomen 多器官分割模型是否能够通用，以及如何进一步提高其通用性。
methods: 本研究使用了四个大规模公共数据集：FLARE22、AMOS、WORD 和 TotalSegmentator，每个数据集都提供了丰富的 Abdomen 多器官分割标签。为了评估，我们将这些数据集的验证集与 BTCV 数据集的训练集组合成一个可靠的 Benchmark，包括五个不同的数据集。
results: 我们通过使用不同的数据使用场景（即在单个数据集上独立训练、使用 pseudo-labeling 技术、混合不同Modalities 和在所有可用数据集上进行联合训练）来评估不同的模型是否能够通用。此外，我们还研究了模型的大小对 cross-dataset 通用性的影响。通过这些分析，我们强调了有效地使用数据的重要性，并提供了训练策略的有价值指导。

Abstract
Although deep learning have revolutionized abdominal multi-organ segmentation, models often struggle with generalization due to training on small, specific datasets. With the recent emergence of large-scale datasets, some important questions arise: \textbf{Can models trained on these datasets generalize well on different ones? If yes/no, how to further improve their generalizability?} To address these questions, we introduce A-Eval, a benchmark for the cross-dataset Evaluation ('Eval') of Abdominal ('A') multi-organ segmentation. We employ training sets from four large-scale public datasets: FLARE22, AMOS, WORD, and TotalSegmentator, each providing extensive labels for abdominal multi-organ segmentation. For evaluation, we incorporate the validation sets from these datasets along with the training set from the BTCV dataset, forming a robust benchmark comprising five distinct datasets. We evaluate the generalizability of various models using the A-Eval benchmark, with a focus on diverse data usage scenarios: training on individual datasets independently, utilizing unlabeled data via pseudo-labeling, mixing different modalities, and joint training across all available datasets. Additionally, we explore the impact of model sizes on cross-dataset generalizability. Through these analyses, we underline the importance of effective data usage in enhancing models' generalization capabilities, offering valuable insights for assembling large-scale datasets and improving training strategies. The code and pre-trained models are available at \href{https://github.com/uni-medical/A-Eval}{https://github.com/uni-medical/A-Eval}.

摘要
although deep learning has revolutionized abdominal multi-organ segmentation, models often struggle with generalization due to training on small, specific datasets. with the recent emergence of large-scale datasets, some important questions arise: �Can models trained on these datasets generalize well on different ones? if yes/no, how to further improve their generalizability? to address these questions, we introduce A-Eval, a benchmark for the cross-dataset evaluation of abdominal multi-organ segmentation. we employ training sets from four large-scale public datasets: flare22, amos, word, and totalsegmentator, each providing extensive labels for abdominal multi-organ segmentation. for evaluation, we incorporate the validation sets from these datasets along with the training set from the btcv dataset, forming a robust benchmark comprising five distinct datasets. we evaluate the generalizability of various models using the A-Eval benchmark, with a focus on diverse data usage scenarios: training on individual datasets independently, utilizing unlabeled data via pseudo-labeling, mixing different modalities, and joint training across all available datasets. additionally, we explore the impact of model sizes on cross-dataset generalizability. through these analyses, we underline the importance of effective data usage in enhancing models' generalization capabilities, offering valuable insights for assembling large-scale datasets and improving training strategies. the code and pre-trained models are available at https://github.com/uni-medical/A-Eval.

Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis

paper_url: http://arxiv.org/abs/2309.03904
repo_url: https://github.com/zhujiapeng/aurora
paper_authors: Jiapeng Zhu, Ceyuan Yang, Kecheng Zheng, Yinghao Xu, Zifan Shi, Yujun Shen
for: 文章旨在提出一种基于生成对抗网络（GAN）的文本决定图像生成模型，以便在大规模模型训练中减少计算资源的消耗。
methods: 该模型采用了一个集合专家来学习特征处理，并与一个稀有的路由器相结合，以选择最适合每个特征点的专家。路由器在考虑文本整体归一化代码的基础上进行动态决策，以确保准确地传递采样冲击和文本条件到最终生成图像中。
results: 在64x64图像分辨率下，使用LAION2B-en和COYO-700M训练集，模型达到了6.2 zero-shot FID在MS COCO上。

Abstract
Due to the difficulty in scaling up, generative adversarial networks (GANs) seem to be falling from grace on the task of text-conditioned image synthesis. Sparsely-activated mixture-of-experts (MoE) has recently been demonstrated as a valid solution to training large-scale models with limited computational resources. Inspired by such a philosophy, we present Aurora, a GAN-based text-to-image generator that employs a collection of experts to learn feature processing, together with a sparse router to help select the most suitable expert for each feature point. To faithfully decode the sampling stochasticity and the text condition to the final synthesis, our router adaptively makes its decision by taking into account the text-integrated global latent code. At 64x64 image resolution, our model trained on LAION2B-en and COYO-700M achieves 6.2 zero-shot FID on MS COCO. We release the code and checkpoints to facilitate the community for further development.

摘要
Translated into Simplified Chinese:由于扩展困难，生成对抗网络（GAN）在文本条件下的图像生成任务上似乎在落叶。另一方面，卷积束激活的混合专家（MoE）在具有有限的计算资源的情况下被证明为有效的解决方案。我们启发于这种哲学，提出了 Aurora，一个基于 GAN 的文本到图像生成器，该生成器使用一群专家来学习特征处理，并且使用稀有的路由器来帮助选择最适合的专家 для每个特征点。为了准确地将抽样偏移和文本条件传递到最终合成，我们的路由器动态做出决策，并考虑文本集成的全局幂等码。在 64x64 像素分辨率下，我们使用 LAION2B-en 和 COYO-700M 训练的模型达到 6.2 个零shot FID 在 MS COCO 上。我们将代码和检查点发布，以便社区进一步开发。

Tracking Anything with Decoupled Video Segmentation

paper_url: http://arxiv.org/abs/2309.03903
repo_url: https://github.com/hkchengrex/Tracking-Anything-with-DEVA
paper_authors: Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, Joon-Young Lee
for: 这篇论文旨在解决视频分割任务中的数据缺乏问题，使得扩展到新的视频分割任务更加困难。
methods: 该论文提出了一种分离视频分割方法（DEVA），它包括任务特定的图像级别分割和任务和类型无关的双向时间卷积。
results: 作者在多个数据缺乏任务中表示了这种方法的优势，包括大词汇视频精确分割、开放世界视频分割、引用视频分割和无监督视频物体分割。

Abstract
Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA

摘要
训练数据 для视频分割昂贵annotate。这阻碍了扩展到新的视频分割任务，��pecially在大词汇设定下。为了“跟踪任何”而不需要每个任务的视频数据训练，我们开发了分离视频分割方法（DEVA），它由任务特定的图像级别分割和任务和类型不可知的bi-directional时间推进动作组成。由于这种设计，我们只需要目标任务的图像级别模型（更 cheap to train）和一个通用的时间推进动作模型，这个模型在不同任务上通过一次训练而泛化。为了有效地结合这两个模块，我们使用bi-directional推进来（semi-)在线混合不同帧中的 segmentation 假设，以生成一个准确的分割。我们表明，这种分离的形式与端到端方法在多个数据缺乏任务中相比较好。代码可以在：https://hkchengrex.github.io/Tracking-Anything-with-DEVA 获取。

Learning Continuous Exposure Value Representations for Single-Image HDR Reconstruction

paper_url: http://arxiv.org/abs/2309.03900
repo_url: None
paper_authors: Su-Kai Chen, Hung-Lin Yen, Yu-Lun Liu, Min-Hung Chen, Hou-Ning Hu, Wen-Hsiao Peng, Yen-Yu Lin
for: 实现高动态变化的图像重建（HDR reconstruction），使用深度学习方法将LDR图像构成为HDR图像。
methods: 使用隐藏函数生成LDR图像的 kontinuous exposure value representation（CEVR），并使用循环训练策略来监督模型生成 kontinuous EV LDR图像。
results: 与现有方法相比，CEVR模型能够实现更高质量的HDR reconstruction。

Abstract
Deep learning is commonly used to reconstruct HDR images from LDR images. LDR stack-based methods are used for single-image HDR reconstruction, generating an HDR image from a deep learning-generated LDR stack. However, current methods generate the stack with predetermined exposure values (EVs), which may limit the quality of HDR reconstruction. To address this, we propose the continuous exposure value representation (CEVR), which uses an implicit function to generate LDR images with arbitrary EVs, including those unseen during training. Our approach generates a continuous stack with more images containing diverse EVs, significantly improving HDR reconstruction. We use a cycle training strategy to supervise the model in generating continuous EV LDR images without corresponding ground truths. Our CEVR model outperforms existing methods, as demonstrated by experimental results.

摘要
深度学习通常用于从LDR图像中重建HDR图像。现有的方法使用LDR堆栈来实现单个图像HDR重建，但是现有方法通常使用预先确定的曝光值（EV）来生成堆栈，这可能会限制HDR重建的质量。为解决这个问题，我们提出了连续曝光值表示（CEVR），它使用隐式函数生成LDR图像中的任意EV，包括训练过程中未看到的EV。我们的方法生成了更多包含多样EV的图像，Significantly Improving HDR重建。我们使用循环训练策略来监督模型在生成连续EV LDR图像时，无需对应的真实值。我们的CEVR模型在实验结果中胜过现有方法。

The Making and Breaking of Camouflage

paper_url: http://arxiv.org/abs/2309.03899
repo_url: None
paper_authors: Hala Lamdouar, Weidi Xie, Andrew Zisserman
for: 这项研究旨在解决camouflage效果的评估问题，提出三种评估指标，以评估和比较不同隐身数据集的效果。methods: 研究使用了背景和前景特征相似性和边界可见度来评估隐身的效果，并在生成模型中使用这些评估指标作为 auxillary loss，可以生成高质量的隐身图像和视频。results: 实验表明，使用这些评估指标可以实现State-of-the-art的隐身摘帽性能，并且可以在大规模的视频segmentation任务中提高性能。

Abstract
Not all camouflages are equally effective, as even a partially visible contour or a slight color difference can make the animal stand out and break its camouflage. In this paper, we address the question of what makes a camouflage successful, by proposing three scores for automatically assessing its effectiveness. In particular, we show that camouflage can be measured by the similarity between background and foreground features and boundary visibility. We use these camouflage scores to assess and compare all available camouflage datasets. We also incorporate the proposed camouflage score into a generative model as an auxiliary loss and show that effective camouflage images or videos can be synthesised in a scalable manner. The generated synthetic dataset is used to train a transformer-based model for segmenting camouflaged animals in videos. Experimentally, we demonstrate state-of-the-art camouflage breaking performance on the public MoCA-Mask benchmark.

摘要
不 todas las formas de camuflaje son igual de efectivas, ya que incluso una contour parcialmente visible o un ligero cambio de color puede hacer que el animal se destaque y rompa su camuflaje. En este artículo, abordamos la pregunta de qué hace que un camuflaje sea exitoso, proponiendo tres puntuaciones para evaluar su eficacia de forma automática. En particular, mostramos que el camuflaje se puede medir por la similitud entre las características de fondo y del foreground, así como la visibilidad de la boundaria. Incorporamos las puntuaciones de camuflaje propuestas en un modelo generativo como una pérdida auxiliar y demostramos que se pueden synthetizar imágenes o videos de camuflaje efectivos de manera escalable. El conjunto de datos sintético generado se utiliza para entrenar un modelo basado en transformers para segmentar animales camuflados en videos. Experimentalmente, demostramos un rendimiento de clasificación de camuflaje de estado del arte en el benchmark público MoCA-Mask.

ProPainter: Improving Propagation and Transformer for Video Inpainting

paper_url: http://arxiv.org/abs/2309.03897
repo_url: https://github.com/sczhou/propainter
paper_authors: Shangchen Zhou, Chongyi Li, Kelvin C. K. Chan, Chen Change Loy
for: 提高视频填充性能（VI）中的Flow-based媒体和空间时间Transformer机制的效果。
methods: 提出了改进的框架，称为ProPainter，它包括改进的ProPagation和高效的Transformer。 specifically, dual-domain propagation combines the advantages of image and feature warping, reliably exploiting global correspondences. 另外，我们还提出了一种面Mask-guided sparse video Transformer，可以高效地抛弃无用和重复的Token。
results: ProPainter比优先艺术品在PSNR指标上增加1.46 dB，同时保持了适度的效率。

Abstract
Flow-based propagation and spatiotemporal Transformer are two mainstream mechanisms in video inpainting (VI). Despite the effectiveness of these components, they still suffer from some limitations that affect their performance. Previous propagation-based approaches are performed separately either in the image or feature domain. Global image propagation isolated from learning may cause spatial misalignment due to inaccurate optical flow. Moreover, memory or computational constraints limit the temporal range of feature propagation and video Transformer, preventing exploration of correspondence information from distant frames. To address these issues, we propose an improved framework, called ProPainter, which involves enhanced ProPagation and an efficient Transformer. Specifically, we introduce dual-domain propagation that combines the advantages of image and feature warping, exploiting global correspondences reliably. We also propose a mask-guided sparse video Transformer, which achieves high efficiency by discarding unnecessary and redundant tokens. With these components, ProPainter outperforms prior arts by a large margin of 1.46 dB in PSNR while maintaining appealing efficiency.

摘要
<>使用流基本域和时间特征转换器是视频填充（VI）的两种主流机制。尽管这两种组件都有一定的局限性，但它们仍然会受到一些限制，影响其性能。前一代的卷积基本方法是在图像或特征空间分开进行，全球图像卷积可能会导致空间不一致，因为估算的光学流不准确。此外，记忆或计算限制会限制特征卷积和视频转换器的时间范围，防止在远程帧中检索相关信息。为解决这些问题，我们提出了改进的框架，称为ProPainter，它包括提升的ProPagation和高效的Transformer。我们引入了双域卷积，将图像和特征卷积的优点结合起来，可靠地利用全球匹配。我们还提出了面Mask指导的稀疏视频Transformer，它可以高效地抛弃无用和重复的token。与此前的方法相比，ProPainter在PSNR指标上提高了1.46 dB，同时保持了 attractive的效率。

InstructDiffusion: A Generalist Modeling Interface for Vision Tasks

paper_url: http://arxiv.org/abs/2309.03895
repo_url: None
paper_authors: Zigang Geng, Binxin Yang, Tiankai Hang, Chen Li, Shuyang Gu, Ting Zhang, Jianmin Bao, Zheng Zhang, Han Hu, Dong Chen, Baining Guo
for: 这篇论文旨在提出一种统一和通用的框架，用于将计算机视觉任务与人类指令相对应。
methods: 这种方法基于协抽程序，并在用户指令中预测像素。
results: 这种方法可以处理多种计算机视觉任务，包括理解任务（如分割和关键点检测）和生成任务（如编辑和提高）。它还可以处理未看过的任务和超越先前方法在新数据集上的性能。

Abstract
We present InstructDiffusion, a unifying and generic framework for aligning computer vision tasks with human instructions. Unlike existing approaches that integrate prior knowledge and pre-define the output space (e.g., categories and coordinates) for each vision task, we cast diverse vision tasks into a human-intuitive image-manipulating process whose output space is a flexible and interactive pixel space. Concretely, the model is built upon the diffusion process and is trained to predict pixels according to user instructions, such as encircling the man's left shoulder in red or applying a blue mask to the left car. InstructDiffusion could handle a variety of vision tasks, including understanding tasks (such as segmentation and keypoint detection) and generative tasks (such as editing and enhancement). It even exhibits the ability to handle unseen tasks and outperforms prior methods on novel datasets. This represents a significant step towards a generalist modeling interface for vision tasks, advancing artificial general intelligence in the field of computer vision.

摘要
我们介绍InstructDiffusion，一种普适和通用的框架，用于将计算机视觉任务与人类指令相对应。与现有方法不同，InstructDiffusion不需要将每个视觉任务的输出空间（例如，类别和坐标）预先定义，而是将多种视觉任务映射到一个人性化的图像修改过程中，其输出空间是一个可以互动的像素空间。具体来说，模型基于协振过程，并通过用户指令（如红色围绕男士左肩的框或蓝色涂抹到左车）预测像素。InstructDiffusion可以处理多种视觉任务，包括理解任务（如分割和关键点检测）和生成任务（如编辑和提高）。它甚至可以处理未看到的任务，并在新数据集上表现出excel。这表明InstructDiffusion可以作为计算机视觉领域的通用模型化接口，为人工通用智能领域带来重大进步。

BluNF: Blueprint Neural Field

paper_url: http://arxiv.org/abs/2309.03933
repo_url: None
paper_authors: Robin Courant, Xi Wang, Marc Christie, Vicky Kalogeiton
for: 这篇论文是关于Scene Novel View Synthesis的研究，旨在提供可观赏、精度、robust的隐式重建。
methods: 这篇论文使用Neural Radiance Fields（NeRFs）来实现Scene Novel View Synthesis，并提出了一种新的编辑方法，即Blueprint Neural Field（BluNF）。BluNF使用Implicit Neural Representation来构建场景的蓝图，以便INTUITIVE的场景编辑。
results: 这篇论文的实验结果表明，BluNF可以帮助人们INTUITIVE地编辑场景，包括对NeRF表示的3D形状和物理属性的修改。此外，BluNF还可以提供一种精度的3D manipulation方法，如场景的掩蔽、外观修改和物体的移除。

Abstract
Neural Radiance Fields (NeRFs) have revolutionized scene novel view synthesis, offering visually realistic, precise, and robust implicit reconstructions. While recent approaches enable NeRF editing, such as object removal, 3D shape modification, or material property manipulation, the manual annotation prior to such edits makes the process tedious. Additionally, traditional 2D interaction tools lack an accurate sense of 3D space, preventing precise manipulation and editing of scenes. In this paper, we introduce a novel approach, called Blueprint Neural Field (BluNF), to address these editing issues. BluNF provides a robust and user-friendly 2D blueprint, enabling intuitive scene editing. By leveraging implicit neural representation, BluNF constructs a blueprint of a scene using prior semantic and depth information. The generated blueprint allows effortless editing and manipulation of NeRF representations. We demonstrate BluNF's editability through an intuitive click-and-change mechanism, enabling 3D manipulations, such as masking, appearance modification, and object removal. Our approach significantly contributes to visual content creation, paving the way for further research in this area.

摘要
neural radiance fields (NeRFs) have revolutionized scene novel view synthesis, offering visually realistic, precise, and robust implicit reconstructions. while recent approaches enable NeRF editing, such as object removal, 3D shape modification, or material property manipulation, the manual annotation prior to such edits makes the process tedious. additionally, traditional 2D interaction tools lack an accurate sense of 3D space, preventing precise manipulation and editing of scenes. in this paper, we introduce a novel approach, called Blueprint Neural Field (BluNF), to address these editing issues. blunf provides a robust and user-friendly 2D blueprint, enabling intuitive scene editing. by leveraging implicit neural representation, blunf constructs a blueprint of a scene using prior semantic and depth information. the generated blueprint allows effortless editing and manipulation of NeRF representations. we demonstrate blunf's editability through an intuitive click-and-change mechanism, enabling 3D manipulations, such as masking, appearance modification, and object removal. our approach significantly contributes to visual content creation, paving the way for further research in this area.

ArtiGrasp: Physically Plausible Synthesis of Bi-Manual Dexterous Grasping and Articulation

paper_url: http://arxiv.org/abs/2309.03891
repo_url: None
paper_authors: Hui Zhang, Sammy Christen, Zicong Fan, Luocheng Zheng, Jemin Hwangbo, Jie Song, Otmar Hilliges
for: 这个论文的目的是提出一种新的手套控制方法，用于Synthesize bi-manual hand-object interactions，包括抓取和折叠动作。
methods: 该方法使用了强化学习和物理 simulations来训练一个控制全身姿势和精确的手指控制的政策。
results: 该方法可以在Dynamic Object Grasping and Articulation任务中提供高效的解决方案，并且可以适应不同的姿势和物体。

Abstract
We present ArtiGrasp, a novel method to synthesize bi-manual hand-object interactions that include grasping and articulation. This task is challenging due to the diversity of the global wrist motions and the precise finger control that are necessary to articulate objects. ArtiGrasp leverages reinforcement learning and physics simulations to train a policy that controls the global and local hand pose. Our framework unifies grasping and articulation within a single policy guided by a single hand pose reference. Moreover, to facilitate the training of the precise finger control required for articulation, we present a learning curriculum with increasing difficulty. It starts with single-hand manipulation of stationary objects and continues with multi-agent training including both hands and non-stationary objects. To evaluate our method, we introduce Dynamic Object Grasping and Articulation, a task that involves bringing an object into a target articulated pose. This task requires grasping, relocation, and articulation. We show our method's efficacy towards this task. We further demonstrate that our method can generate motions with noisy hand-object pose estimates from an off-the-shelf image-based regressor.

摘要
我们提出了ArtiGrasp方法，用于生成双手手动对象互动，包括抓取和肢体运动。由于全球肘部运动的多样性和必要的精准指控来实现对象的肢体运动，这是一项挑战性的任务。ArtiGrasp利用了奖励学习和物理模拟来训练一个控制全球和局部手姿的策略。我们的框架将抓取和肢体运动团结在一个单一的策略下，即一个手姿参考。此外，为了帮助学习精准的指控，我们提供了一个学习课程，其中从单手操作静止物体开始，然后是多机器人培训，包括双手和不稳定的物体。为评估我们的方法，我们引入了动态物体抓取和肢体运动任务，这个任务需要抓取、重新定位和肢体运动。我们证明了我们的方法的有效性。此外，我们还示出了使用市场上的图像回归器获取噪音手姿估计后，我们的方法仍可生成有效的手动对象互动。

Better Practices for Domain Adaptation

paper_url: http://arxiv.org/abs/2309.03879
repo_url: None
paper_authors: Linus Ericsson, Da Li, Timothy M. Hospedales
for: 本研究旨在 Addressing the challenge of domain shift in real-world machine learning applications, particularly the difficulty of performing hyperparameter optimization for domain adaptation algorithms without access to a labeled validation set.methods: The paper uses a suite of candidate validation criteria to benchmark popular adaptation algorithms and assess their performance.results: The results show that there are challenges across all three branches of domain adaptation methodology, including Unsupervised Domain Adaptation (UDA), Source-Free Domain Adaptation (SFDA), and Test Time Adaptation (TTA). However, the paper also demonstrates that using proper validation splits and exploring new validation metrics can improve performance.

Abstract
Distribution shifts are all too common in real-world applications of machine learning. Domain adaptation (DA) aims to address this by providing various frameworks for adapting models to the deployment data without using labels. However, the domain shift scenario raises a second more subtle challenge: the difficulty of performing hyperparameter optimisation (HPO) for these adaptation algorithms without access to a labelled validation set. The unclear validation protocol for DA has led to bad practices in the literature, such as performing HPO using the target test labels when, in real-world scenarios, they are not available. This has resulted in over-optimism about DA research progress compared to reality. In this paper, we analyse the state of DA when using good evaluation practice, by benchmarking a suite of candidate validation criteria and using them to assess popular adaptation algorithms. We show that there are challenges across all three branches of domain adaptation methodology including Unsupervised Domain Adaptation (UDA), Source-Free Domain Adaptation (SFDA), and Test Time Adaptation (TTA). While the results show that realistically achievable performance is often worse than expected, they also show that using proper validation splits is beneficial, as well as showing that some previously unexplored validation metrics provide the best options to date. Altogether, our improved practices covering data, training, validation and hyperparameter optimisation form a new rigorous pipeline to improve benchmarking, and hence research progress, within this important field going forward.

摘要
发布分布shift是现实世界应用机器学习中的普遍现象。领域适应（DA）目标是解决这个问题，提供不使用标签的方法来适应模型到部署数据。然而，领域转换场景带来一个更加细微的挑战：无法在无标签验证集上进行超参论调整。这在文献中存在坏习惯，如使用目标测试标签进行超参论调整，而在实际应用中，这些标签不可用。这导致了对DA研究进展的过度估计。在这篇论文中，我们分析了使用好的评估方式进行DA时的状态，并对一组候选验证标准进行比较。我们发现了领域适应方法学习的三个支序中的挑战，包括无监督领域适应（UDA）、源自由领域适应（SFDA）和测试时适应（TTA）。虽然结果表明实际可以达到的性能通常比预期更差，但是也表明使用正确的验证分割是有利的，同时也表明了一些未曾探索的验证指标可以提供最佳选择至今。总之，我们提出了一种新的严格的管道，包括数据、训练、验证和超参论调整，以改进DA研究的进程，并在这个重要领域内进行未来的进步。

paper_url: http://arxiv.org/abs/2309.03874
repo_url: https://github.com/eyalgomel/box-based-refinement
paper_authors: Eyal Gomel, Tal Shaharabany, Lior Wolf
for: 提高弱化监督和无监督方法的本地化性能
methods: 使用框架基于的检测网络，并在网络输出上训练检测器，采用适当的损失反propagation
results: 显著提高了“我们是哪里看到了什么”任务的表达精度，以及多种无监督物体发现方法

Abstract
It has been established that training a box-based detector network can enhance the localization performance of weakly supervised and unsupervised methods. Moreover, we extend this understanding by demonstrating that these detectors can be utilized to improve the original network, paving the way for further advancements. To accomplish this, we train the detectors on top of the network output instead of the image data and apply suitable loss backpropagation. Our findings reveal a significant improvement in phrase grounding for the ``what is where by looking'' task, as well as various methods of unsupervised object discovery. Our code is available at https://github.com/eyalgomel/box-based-refinement.

摘要
已经证明，使用库型检测网络进行训练可以增强弱监督和无监督方法的地方化性能。此外，我们将这些检测器应用到原始网络上，以便进一步提高。我们在网络输出上训练这些检测器，并将损失传播 backwards。我们的研究发现，这些检测器可以帮助解决“我在哪里看到了什么”的任务，以及其他无监督物品发现方法。我们的代码可以在 GitHub 上找到：https://github.com/eyalgomel/box-based-refinement。

Text-to-feature diffusion for audio-visual few-shot learning

paper_url: http://arxiv.org/abs/2309.03869
repo_url: https://github.com/explainableml/avdiff-gfsl
paper_authors: Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata
for: 这 paper 的目的是提出一个 audio-visual 少量数据集，用于训练深度学习模型进行视频分类。
methods: 这 paper 使用了十种方法，包括 AV-DIFF，一个基于文本到特征扩散的框架，用于将多Modal特征拼接在一起。
results: 根据这 paper，使用 AV-DIFF 方法可以在 audio-visual 少量数据集上达到状态之 искусственный智能的性能。

Abstract
Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process. A challenging and underexplored, yet much cheaper, setup is few-shot learning from video data. In particular, the inherently multi-modal nature of video data with sound and visual information has not been leveraged extensively for the few-shot video classification task. Therefore, we introduce a unified audio-visual few-shot video classification benchmark on three datasets, i.e. the VGGSound-FSL, UCF-FSL, ActivityNet-FSL datasets, where we adapt and compare ten methods. In addition, we propose AV-DIFF, a text-to-feature diffusion framework, which first fuses the temporal and audio-visual features via cross-modal attention and then generates multi-modal features for the novel classes. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual (generalised) few-shot learning. Our benchmark paves the way for effective audio-visual classification when only limited labeled data is available. Code and data are available at https://github.com/ExplainableML/AVDIFF-GFSL.

摘要
<> translate the following text into Simplified Chinese<>训练深度学习模型用于视频分类从audio-visual数据中需要巨大量的标注训练数据，这是一个昂贵的过程。一个挑战性和尚未得到充分开发的 setup 是几个shot学习从视频数据中。特别是，视频数据的内在多模式性，即声音和视觉信息，尚未得到广泛利用于几个shot视频分类任务。因此，我们引入一个统一的audio-visual几个shot视频分类标准benchmark，包括VGGSound-FSL、UCF-FSL和ActivityNet-FSL三个数据集，我们适应和比较了十种方法。此外，我们提出了AV-DIFF，一个文本到特征扩散框架，它首先将时间和音频-视觉特征 fusionvia Cross-modal注意力，然后生成多模式特征 для新的类。我们证明了AV-DIFF在我们提出的benchmark上实现了状态独一的性能，我们的benchmark为有限标注数据时的音频-视觉分类提供了一个有效的平台。代码和数据可以在https://github.com/ExplainableML/AVDIFF-GFSL上找到。

CenTime: Event-Conditional Modelling of Censoring in Survival Analysis

paper_url: http://arxiv.org/abs/2309.03851
repo_url: https://github.com/ahmedhshahin/CenTime
paper_authors: Ahmed H. Shahin, An Zhao, Alexander C. Whitehead, Daniel C. Alexander, Joseph Jacob, David Barber
for: 预测医疗机器学习模型中的时间到事件（time to event，TTE），以便更好地预测临床重要事件的发生时间。
methods: 提出了一种新的事件减少机制，可以在 censored 样本中实现稳定的 estimator ，并且可以与深度学习模型集成，不受批处大小或uncensored样本数量的限制。
results: 与标准生存分析方法，如科кс准则模型和 DeepHit，进行比较，得到了 state-of-the-art 的时间到死亡预测性能，同时维持了相似的排名性能。

Abstract
Survival analysis is a valuable tool for estimating the time until specific events, such as death or cancer recurrence, based on baseline observations. This is particularly useful in healthcare to prognostically predict clinically important events based on patient data. However, existing approaches often have limitations; some focus only on ranking patients by survivability, neglecting to estimate the actual event time, while others treat the problem as a classification task, ignoring the inherent time-ordered structure of the events. Furthermore, the effective utilization of censored samples - training data points where the exact event time is unknown - is essential for improving the predictive accuracy of the model. In this paper, we introduce CenTime, a novel approach to survival analysis that directly estimates the time to event. Our method features an innovative event-conditional censoring mechanism that performs robustly even when uncensored data is scarce. We demonstrate that our approach forms a consistent estimator for the event model parameters, even in the absence of uncensored data. Furthermore, CenTime is easily integrated with deep learning models with no restrictions on batch size or the number of uncensored samples. We compare our approach with standard survival analysis methods, including the Cox proportional-hazard model and DeepHit. Our results indicate that CenTime offers state-of-the-art performance in predicting time-to-death while maintaining comparable ranking performance. Our implementation is publicly available at https://github.com/ahmedhshahin/CenTime.

摘要
生存分析是一种有用的工具，可以根据基线观察数据来预测特定事件的时间，如死亡或癌症复发。这特别有用在医疗领域，可以预测基于病人数据的临床重要事件。然而，现有的方法往往有限制，一些只能将患者排序按照生存可能性，而忽略实际事件时间的估计；另一些对事件问题进行分类处理，忽略事件的内在时间顺序结构。此外，利用 censored 样本的有效使用是必要的，以提高预测模型的准确性。在这篇论文中，我们介绍了 CenTime，一种新的生存分析方法，可以直接估计事件时间。我们的方法具有创新的事件 conditional 封闭机制，可以在缺失完整事件时间的情况下表现稳定。我们示示了我们的方法可以在缺失完整数据的情况下成为事件模型参数的一致 estimator。此外，CenTime 可以轻松地与深度学习模型结合使用，无需Restrictions on batch size 或无 censored 样本数量。我们与标准生存分析方法，包括 Cox 幂度模型和 DeepHit 进行比较，结果表明 CenTime 在预测时间到死亡的性能处于国际前列，同时保持与排序性能相对一致。我们的实现可以在上获取。

Random Expert Sampling for Deep Learning Segmentation of Acute Ischemic Stroke on Non-contrast CT

paper_url: http://arxiv.org/abs/2309.03930
repo_url: None
paper_authors: Sophie Ostmeier, Brian Axelrod, Benjamin Pulli, Benjamin F. J. Verhaaren, Abdelkader Mahammedi, Yongkai Liu, Christian Federau, Greg Zaharchuk, Jeremy J. Heit
for: The paper aims to develop and validate a deep learning method for automatically quantifying ischemic brain tissue on non-contrast CT scans in patients with acute ischemic stroke.methods: The authors used a benchmark U-Net model trained on reference annotations from three experienced neuroradiologists using two training schemes: majority vote and random expert sampling. They compared the performance of these schemes using a one-sided Wilcoxon signed-rank test and consistency analysis.results: The random expert sampling scheme led to a model that showed better agreement with the experts and better consistency than the majority-vote model. The model also predicted the final infarct volume and correlated better with the clinical outcome than CT perfusion.Here is the information in Simplified Chinese text:for: 这篇论文目的是开发和验证一种使用非对照CT成像的多专家深度学习方法，用于自动评估急性血液血管受损脑部病变。methods: 作者使用了一个参考U-Net模型，并使用了三名经验丰富的 neuroradiologists 的参考标注来训练这个模型。他们使用了一个一侧 Wilcoxon 签名rank 测试和一致性分析来比较这两种训练方案的表现。results: 随机专家采样方案导致了一个与专家更加一致的模型，并且与专家之间的一致性更高，并且与多数投票模型性能相比有61%的提升（Surface Dice 在快速识别精度5mm上提升了0.70+-0.03）和25%的提升（Dice 在0.50+-0.04）。这个模型可以准确预测急性血液血管受损脑部病变的总量，并且与临床结果相关性更高。

Abstract
Purpose: Multi-expert deep learning training methods to automatically quantify ischemic brain tissue on Non-Contrast CT Materials and Methods: The data set consisted of 260 Non-Contrast CTs from 233 patients of acute ischemic stroke patients recruited in the DEFUSE 3 trial. A benchmark U-Net was trained on the reference annotations of three experienced neuroradiologists to segment ischemic brain tissue using majority vote and random expert sampling training schemes. We used a one-sided Wilcoxon signed-rank test on a set of segmentation metrics to compare bootstrapped point estimates of the training schemes with the inter-expert agreement and ratio of variance for consistency analysis. We further compare volumes with the 24h-follow-up DWI (final infarct core) in the patient subgroup with full reperfusion and we test volumes for correlation to the clinical outcome (mRS after 30 and 90 days) with the Spearman method. Results: Random expert sampling leads to a model that shows better agreement with experts than experts agree among themselves and better agreement than the agreement between experts and a majority-vote model performance (Surface Dice at Tolerance 5mm improvement of 61% to 0.70 +- 0.03 and Dice improvement of 25% to 0.50 +- 0.04). The model-based predicted volume similarly estimated the final infarct volume and correlated better to the clinical outcome than CT perfusion. Conclusion: A model trained on random expert sampling can identify the presence and location of acute ischemic brain tissue on Non-Contrast CT similar to CT perfusion and with better consistency than experts. This may further secure the selection of patients eligible for endovascular treatment in less specialized hospitals.

摘要
目的：使用多个专家深度学习训练方法自动评估非contrast CT中的血液脑部分量。方法：数据集包括260个非contrast CT图像，来自233名stroke患者，参与DEFUSE 3试验。我们使用一个benchmark U-Net模型，通过多个专家的参照注释来 segment非contrast CT中的血液脑部分。我们使用一个一侧Wilcoxon签名rank测试来比较各种训练方案的点估计与专家之间的一致性和差异分析。此外，我们还比较了患者 subgroup中的24小时后DWI（最终损伤核心）和临床结果（mRS after 30和90天）之间的相关性。结果：随机专家采样导致一个模型，与专家之间的一致性更高，并且与专家和多数投票模型的性能相比（Surface Dice at Tolerance 5mm改进率为61%，Dice改进率为25%）。该模型预测的量也准确地估计了最终损伤量，并且与临床结果更好地相关。结论：一个基于随机专家采样的模型可以在非contrast CT中准确地识别和定位急性血液脑部分，与CT perfusion相似，并且与专家之间的一致性更高。这可能会为eless specialized hospitals中选择患者渠道进行Endovascular treatment提供更安全的选择。

Cross-Task Attention Network: Improving Multi-Task Learning for Medical Imaging Applications

paper_url: http://arxiv.org/abs/2309.03837
repo_url: None
paper_authors: Sangwook Kim, Thomas G. Purdie, Chris McIntosh
for:This paper aims to improve the performance of medical imaging tasks using a novel attention-based multi-task learning (MTL) framework.methods:The proposed framework is called Cross-Task Attention Network (CTAN), which utilizes cross-task attention mechanisms to incorporate information from multiple tasks and improve performance.results:Compared to standard single-task learning (STL) and two widely used MTL baselines, CTAN demonstrated a 4.67% improvement in performance and outperformed both baselines. Specifically, CTAN outperformed HPS by 3.22% and MTAN by 5.38%. These findings highlight the effectiveness of CTAN in improving the accuracy of medical imaging tasks across different domains.

Abstract
Multi-task learning (MTL) is a powerful approach in deep learning that leverages the information from multiple tasks during training to improve model performance. In medical imaging, MTL has shown great potential to solve various tasks. However, existing MTL architectures in medical imaging are limited in sharing information across tasks, reducing the potential performance improvements of MTL. In this study, we introduce a novel attention-based MTL framework to better leverage inter-task interactions for various tasks from pixel-level to image-level predictions. Specifically, we propose a Cross-Task Attention Network (CTAN) which utilizes cross-task attention mechanisms to incorporate information by interacting across tasks. We validated CTAN on four medical imaging datasets that span different domains and tasks including: radiation treatment planning prediction using planning CT images of two different target cancers (Prostate, OpenKBP); pigmented skin lesion segmentation and diagnosis using dermatoscopic images (HAM10000); and COVID-19 diagnosis and severity prediction using chest CT scans (STOIC). Our study demonstrates the effectiveness of CTAN in improving the accuracy of medical imaging tasks. Compared to standard single-task learning (STL), CTAN demonstrated a 4.67% improvement in performance and outperformed both widely used MTL baselines: hard parameter sharing (HPS) with an average performance improvement of 3.22%; and multi-task attention network (MTAN) with a relative decrease of 5.38%. These findings highlight the significance of our proposed MTL framework in solving medical imaging tasks and its potential to improve their accuracy across domains.

摘要
多任务学习（MTL）是深度学习中的一种强大方法，利用多个任务的信息在训练中共享，以提高模型性能。在医疗影像领域，MTL已经实现了各种任务的解决。然而，现有的医疗影像MTL建 Architecture是有限的，它们在任务之间的信息共享上有所局限，从而减少了MTL的性能提升 potential.在本研究中，我们提出了一种新的注意力基于的MTL框架，以更好地利用多个任务之间的交互来提高各种任务的预测性能。具体来说，我们提出了一种交互式多任务注意力网络（CTAN），该网络通过交互式注意力机制来集成多个任务的信息。我们在四个医疗影像数据集上验证了CTAN，这些数据集包括了两种不同的目标肿瘤（肾癌和开口KBP）的规划计划预测、睫状皮肤损伤和诊断、以及COVID-19的诊断和严重程度预测。我们的研究表明，CTAN在医疗影像任务中的准确性得到了提高。相比于标准单任务学习（STL），CTAN在 average 上提高了4.67%的性能，并在多个MTL基线上超越了：硬件参数共享（HPS）的平均性能提升3.22%，以及多任务注意力网络（MTAN）的相对下降5.38%。这些发现表明了我们提出的MTL框架在解决医疗影像任务方面的重要性和其在不同领域中的可行性。

ArtHDR-Net: Perceptually Realistic and Accurate HDR Content Creation

paper_url: http://arxiv.org/abs/2309.03827
repo_url: None
paper_authors: Hrishav Bakul Barua, Ganesh Krishnasamy, KokSheik Wong, Kalin Stefanov, Abhinav Dhall
for: 这篇论文旨在探讨高动态范围（HDR）内容创建中，如何保持图像的艺术意义，即人类视觉对图像的感受。
methods: 该论文提出了一种基于卷积神经网络的Architecture，称为ArtHDR-Net，使用多张曝光LDR特征作为输入。
results: 实验结果表明，ArtHDR-Net可以达到状态的艺术意义水平（HDR-VDP-2分数），同时在PSNR和SSIM指标中实现竞争性的表现。

Abstract
High Dynamic Range (HDR) content creation has become an important topic for modern media and entertainment sectors, gaming and Augmented/Virtual Reality industries. Many methods have been proposed to recreate the HDR counterparts of input Low Dynamic Range (LDR) images/videos given a single exposure or multi-exposure LDRs. The state-of-the-art methods focus primarily on the preservation of the reconstruction's structural similarity and the pixel-wise accuracy. However, these conventional approaches do not emphasize preserving the artistic intent of the images in terms of human visual perception, which is an essential element in media, entertainment and gaming. In this paper, we attempt to study and fill this gap. We propose an architecture called ArtHDR-Net based on a Convolutional Neural Network that uses multi-exposed LDR features as input. Experimental results show that ArtHDR-Net can achieve state-of-the-art performance in terms of the HDR-VDP-2 score (i.e., mean opinion score index) while reaching competitive performance in terms of PSNR and SSIM.

摘要
高动态范围（HDR）内容创建已成为现代媒体和娱乐领域的重要话题，游戏和虚拟/增强现实领域。许多方法已经被提议，以便基于单张或多张抖动范围（LDR）图像/视频来重建HDR对应的Counterpart。当前的状态艺术方法主要关注重建结构的相似性和每个像素的准确率。然而，这些惯常的方法不强调保持图像的艺术意愿，即人类视觉的感知，这是媒体、娱乐和游戏领域的重要元素。在这篇论文中，我们尝试研究并填补这个空白。我们提出了一种 Architecture called ArtHDR-Net，基于卷积神经网络，使用多张抖动范围特征为输入。实验结果表明，ArtHDR-Net 可以达到当今最佳性能，而且与 PSNR 和 SSIM 的性能竞争。

T2IW: Joint Text to Image & Watermark Generation

paper_url: http://arxiv.org/abs/2309.03815
repo_url: None
paper_authors: An-An Liu, Guokai Zhang, Yuting Su, Ning Xu, Yongdong Zhang, Lanjun Wang
for: 这个研究旨在提出一个新的文本背景下的图像生成模型，以满足traceability、隐私保护和其他安全需求。
methods: 本研究使用文本与水印（T2IW）任务，强制semantic feature和水印信号在像素层次保持compatibility，并运用信息理论和非合作游戏理论分离图像和水印。
results: 实验结果显示本方法可以实现优秀的图像质量、水印隐藏和水印Robustness，并提出了一个新的评估指标集。

Abstract
Recent developments in text-conditioned image generative models have revolutionized the production of realistic results. Unfortunately, this has also led to an increase in privacy violations and the spread of false information, which requires the need for traceability, privacy protection, and other security measures. However, existing text-to-image paradigms lack the technical capabilities to link traceable messages with image generation. In this study, we introduce a novel task for the joint generation of text to image and watermark (T2IW). This T2IW scheme ensures minimal damage to image quality when generating a compound image by forcing the semantic feature and the watermark signal to be compatible in pixels. Additionally, by utilizing principles from Shannon information theory and non-cooperative game theory, we are able to separate the revealed image and the revealed watermark from the compound image. Furthermore, we strengthen the watermark robustness of our approach by subjecting the compound image to various post-processing attacks, with minimal pixel distortion observed in the revealed watermark. Extensive experiments have demonstrated remarkable achievements in image quality, watermark invisibility, and watermark robustness, supported by our proposed set of evaluation metrics.

摘要
最近的文本conditioned图像生成模型的发展，使得生成真实的结果变得更加容易。然而，这也导致了隐私泄露和假信息的扩散，需要Traceability、隐私保护和其他安全措施。然而，现有的文本到图像的思维方法缺乏技术能力，将可追溯的消息与图像生成相连。在这项研究中，我们介绍了一种新的文本到图像和水印（T2IW）任务。这种T2IW方案确保在生成复合图像时，Semantic feature和水印信号在像素级别保持Compatible。此外，通过利用信息理论和非合作游戏理论，我们可以将复合图像中的Revealed image和Revealed watermark分离开。此外，我们通过对复合图像进行不同类型的后处理攻击，保持了Minimal pixel distortion在Revealed watermark中。广泛的实验证明了我们提出的方法在图像质量、隐私性和隐私稳定性方面具有很好的表现，支持我们提出的评价指标集。

Panoramas from Photons

paper_url: http://arxiv.org/abs/2309.03811
repo_url: None
paper_authors: Sacha Jungerman, Atul Ingle, Mohit Gupta
for: 能够在高速运动和低照度下重建场景，如果应用于虚拟现实、无人机导航和自动化机器人等领域。
methods: 使用聚合和排序框架，以 iteratively 提高运动估计。
results: 可以在高速运动和极低照度下创建高质量的全景图和超分辨率结果，使用自定义单 photon 摄像头原型。

Abstract
Scene reconstruction in the presence of high-speed motion and low illumination is important in many applications such as augmented and virtual reality, drone navigation, and autonomous robotics. Traditional motion estimation techniques fail in such conditions, suffering from too much blur in the presence of high-speed motion and strong noise in low-light conditions. Single-photon cameras have recently emerged as a promising technology capable of capturing hundreds of thousands of photon frames per second thanks to their high speed and extreme sensitivity. Unfortunately, traditional computer vision techniques are not well suited for dealing with the binary-valued photon data captured by these cameras because these are corrupted by extreme Poisson noise. Here we present a method capable of estimating extreme scene motion under challenging conditions, such as low light or high dynamic range, from a sequence of high-speed image frames such as those captured by a single-photon camera. Our method relies on iteratively improving a motion estimate by grouping and aggregating frames after-the-fact, in a stratified manner. We demonstrate the creation of high-quality panoramas under fast motion and extremely low light, and super-resolution results using a custom single-photon camera prototype. For code and supplemental material see our $\href{https://wisionlab.com/project/panoramas-from-photons/}{\text{project webpage}$.

摘要
To address this challenge, we present a method that can estimate extreme scene motion under challenging conditions, such as low light or high dynamic range, from a sequence of high-speed image frames captured by a single-photon camera. Our method relies on iteratively improving a motion estimate by grouping and aggregating frames after-the-fact, in a stratified manner. We demonstrate the creation of high-quality panoramas under fast motion and extremely low light, and super-resolution results using a custom single-photon camera prototype. For more information and supplemental material, please visit our project webpage at $\href{https://wisionlab.com/project/panoramas-from-photons/}{\text{https://wisionlab.com/project/panoramas-from-photons/}$.

SimNP: Learning Self-Similarity Priors Between Neural Points

paper_url: http://arxiv.org/abs/2309.03809
repo_url: None
paper_authors: Christopher Wewer, Eddy Ilg, Bernt Schiele, Jan Eric Lenssen
for: 本研究旨在提高3D物体重建的 neural field 表示，特别是利用对象级别表示来提高物体的细节质量。
methods: 我们提出了 SimNP 方法，它将 neural point radiance fields 与对象级别自相似表示相结合，以获得更高质量的重建结果。我们首次在 neural point 中实现了类别级别自相似表示，从而保留了本地支持的物体区域的高级别细节。此外，我们还学习了 neural point 之间的信息共享方式，以便在重建过程中提取未见区域的信息。
results: SimNP 方法能够在重建 symmetric 的未见区域时，超越基于类别级别或像素对齐的 radiance fields 方法，同时提供 semantic 对应关系 между实例。我们的实验结果表明，SimNP 能够在不同的物体类别和观察角度下实现更高质量的重建结果。

Abstract
Existing neural field representations for 3D object reconstruction either (1) utilize object-level representations, but suffer from low-quality details due to conditioning on a global latent code, or (2) are able to perfectly reconstruct the observations, but fail to utilize object-level prior knowledge to infer unobserved regions. We present SimNP, a method to learn category-level self-similarities, which combines the advantages of both worlds by connecting neural point radiance fields with a category-level self-similarity representation. Our contribution is two-fold. (1) We design the first neural point representation on a category level by utilizing the concept of coherent point clouds. The resulting neural point radiance fields store a high level of detail for locally supported object regions. (2) We learn how information is shared between neural points in an unconstrained and unsupervised fashion, which allows to derive unobserved regions of an object during the reconstruction process from given observations. We show that SimNP is able to outperform previous methods in reconstructing symmetric unseen object regions, surpassing methods that build upon category-level or pixel-aligned radiance fields, while providing semantic correspondences between instances

摘要
现有的神经场表示方法 для三维物体重建都是（1）使用物体级别表示，但是因为conditioning于全局归一化代码而导致细节质量低下，或者（2）能够完美地重建观察数据，但是不能利用物体级别知识来推断未观察到的区域。我们提出了SimNP方法，它将神经点频谱场与类别级自相似表示相结合，以便结合两者的优点。我们的贡献有两个方面：1. 我们设计了首次基于类别水平的神经点表示，通过利用coherent点云概念。神经点频谱场中的高级别细节可以在支持本地物体区域时被存储。2. 我们学习了在无约束和无监督的情况下，神经点之间的信息共享方式，以便在重建过程中从观察数据中推断未观察到的区域。我们展示了SimNP方法能够在重建不见的对称区域方面超过前一代方法，而且提供semantic对应关系 между实例。

Deep Learning Safety Concerns in Automated Driving Perception

paper_url: http://arxiv.org/abs/2309.03774
repo_url: None
paper_authors: Stephanie Abrecht, Alexander Hirsch, Shervin Raafatnia, Matthias Woehrle
for: 本研究旨在提高自动驾驶系统中深度学习的应用，以确保系统的安全性。
methods: 本研究使用了安全问题的概念，以系统atic和全面地考虑深度学习模型在自动驾驶系统中的安全性。
results: 本研究提出了一种新的安全问题分类方法，以便跨功能团队共同解决问题。此外，本研究还运用了ISO 21448（SOTIF）和ISO PAS 8800等标准，以确保安全性。

Abstract
Recent advances in the field of deep learning and impressive performance of deep neural networks (DNNs) for perception have resulted in an increased demand for their use in automated driving (AD) systems. The safety of such systems is of utmost importance and thus requires to consider the unique properties of DNNs. In order to achieve safety of AD systems with DNN-based perception components in a systematic and comprehensive approach, so-called safety concerns have been introduced as a suitable structuring element. On the one hand, the concept of safety concerns is -- by design -- well aligned to existing standards relevant for safety of AD systems such as ISO 21448 (SOTIF). On the other hand, it has already inspired several academic publications and upcoming standards on AI safety such as ISO PAS 8800. While the concept of safety concerns has been previously introduced, this paper extends and refines it, leveraging feedback from various domain and safety experts in the field. In particular, this paper introduces an additional categorization for a better understanding as well as enabling cross-functional teams to jointly address the concerns.

摘要

$L_{2,1}$-Norm Regularized Quaternion Matrix Completion Using Sparse Representation and Quaternion QR Decomposition

paper_url: http://arxiv.org/abs/2309.03764
repo_url: None
paper_authors: Juan Han, Kit Ian Kou, Jifei Miao, Lizhi Liu, Haojiang Li
for: color image completion
methods: quaternion Qatar Riyal decomposition (QQR) and quaternion $L_{2,1}$-norm (QLNM-QQR), iteratively reweighted quaternion $L_{2,1}$-norm minimization (IRQLNM-QQR), and quaternion $L_{2,1}$-norm with sparse regularization (QLNM-QQR-SR)
results: outperforms QLNM-QQR and superior to several state-of-the-art methods on natural color images and color medical images

Abstract
Color image completion is a challenging problem in computer vision, but recent research has shown that quaternion representations of color images perform well in many areas. These representations consider the entire color image and effectively utilize coupling information between the three color channels. Consequently, low-rank quaternion matrix completion (LRQMC) algorithms have gained significant attention. We propose a method based on quaternion Qatar Riyal decomposition (QQR) and quaternion $L_{2,1}$-norm called QLNM-QQR. This new approach reduces computational complexity by avoiding the need to calculate the QSVD of large quaternion matrices. We also present two improvements to the QLNM-QQR method: an enhanced version called IRQLNM-QQR that uses iteratively reweighted quaternion $L_{2,1}$-norm minimization and a method called QLNM-QQR-SR that integrates sparse regularization. Our experiments on natural color images and color medical images show that IRQLNM-QQR outperforms QLNM-QQR and that the proposed QLNM-QQR-SR method is superior to several state-of-the-art methods.

摘要
图像颜色填充是计算机视觉领域的一个挑战，但最近的研究表明，使用四元数表示方法在许多领域表现良好。这些表示方法考虑整个颜色图像，并有效地利用三个颜色通道之间的相关信息。因此，低级四元数矩阵 completion（LRQMC）算法在获得了重要的注意力。我们提出了基于四元数卡塔瑞yal decompositon（QQR）和四元数L2,1-norm的方法QLNM-QQR。这种新的方法可以避免计算大四元数矩阵QSVD的需要，从而降低计算复杂性。我们还提出了两种改进QLNM-QQR方法：一种叫做IRQLNM-QQR，使用迭代重量四元数L2,1-norm最小化方法；另一种叫做QLNM-QQR-SR， integrate sparse regularization。我们对自然色图像和医疗颜色图像进行实验，发现IRQLNM-QQR方法比QLNM-QQR方法表现更好，而QLNM-QQR-SR方法在许多状态流行方法之上表现更出色。

dacl1k: Real-World Bridge Damage Dataset Putting Open-Source Data to the Test

paper_url: http://arxiv.org/abs/2309.03763
repo_url: None
paper_authors: Johannes Flotzinger, Philipp J. Rösch, Norbert Oswald, Thomas Braml
for: 本研究旨在提高桥梁材料损害识别精度，以确保结构完整性、交通安全和持续使用性。
methods: 本研究使用多种开源数据集合（meta datasets）进行模型训练，并对模型在真实世界中的应用进行评估。
results: 研究发现，使用meta datasets进行训练后，模型在新的bridge损害识别任务中表现出了实用性，最佳模型的准确率达32%。此外，研究还发现模型学习的是否分类数据集或损害类型，而不是具体的bridge损害类型。

Abstract
Recognising reinforced concrete defects (RCDs) is a crucial element for determining the structural integrity, traffic safety and durability of bridges. However, most of the existing datasets in the RCD domain are derived from a small number of bridges acquired in specific camera poses, lighting conditions and with fixed hardware. These limitations question the usability of models trained on such open-source data in real-world scenarios. We address this problem by testing such models on our "dacl1k" dataset, a highly diverse RCD dataset for multi-label classification based on building inspections including 1,474 images. Thereby, we trained the models on different combinations of open-source data (meta datasets) which were subsequently evaluated both extrinsically and intrinsically. During extrinsic evaluation, we report metrics on dacl1k and the meta datasets. The performance analysis on dacl1k shows practical usability of the meta data, where the best model shows an Exact Match Ratio of 32%. Additionally, we conduct an intrinsic evaluation by clustering the bottleneck features of the best model derived from the extrinsic evaluation in order to find out, if the model has learned distinguishing datasets or the classes (RCDs) which is the aspired goal. The dacl1k dataset and our trained models will be made publicly available, enabling researchers and practitioners to put their models to the real-world test.

摘要
识别强化混凝土缺陷（RCD）是bridge的结构完整性、交通安全和持续性的关键因素。然而，现有的RCD领域数据集大多来自少量桥梁，特定的摄像机位置和照明条件下获取的数据。这些限制问题在实际场景中使用模型的可用性。我们解决这个问题，通过在“dacl1k”数据集上测试这些模型，这是一个多标签分类的RCD数据集，包含1,474张图像。我们在不同的开源数据集（meta数据）上训练了模型，然后对这些meta数据进行了外部和内部评估。在外部评估中，我们对dacl1k和meta数据进行了度量。我们发现，使用meta数据可以实现实际场景中的实用性，最佳模型的准确匹配率达32%。此外，我们进行了内部评估，将最佳模型的瓶颈特征分组，以确定是否模型已经学习到了不同的数据集或RCD类别，这是我们的目标。dacl1k数据集和我们训练的模型将公开提供，allowing researchers和实践者可以在实际场景中测试他们的模型。

M(otion)-mode Based Prediction of Ejection Fraction using Echocardiograms

paper_url: http://arxiv.org/abs/2309.03759
repo_url: https://github.com/thomassutter/mmodeecho
paper_authors: Ece Ozkan, Thomas M. Sutter, Yurong Hu, Sebastian Balzer, Julia E. Vogt
for: 早期检测心脏功能异常，通过常规检查是诊断心血管疾病的关键。心脏功能指数下降，是心肺病的重要指标。
methods: 我们使用M模式电子心图来估算心脏功能指数和诊断心肺病。我们生成了多个人工M模式图像，并将其组合使用商业化模型架构。此外，我们将对比学习（CL）应用于卡达着影像识别，从不标注数据中提取有意义的特征，以达到高精度。
results: 我们的实验表明，使用M模式图像和对比学习可以在只有10个模式下达到高精度，与基线方法相当，而且计算上 much more efficient。此外，CL使用M模式图像在有限数据 scenarios（例如，只有200个标注患者）中非常有用。

Abstract
Early detection of cardiac dysfunction through routine screening is vital for diagnosing cardiovascular diseases. An important metric of cardiac function is the left ventricular ejection fraction (EF), where lower EF is associated with cardiomyopathy. Echocardiography is a popular diagnostic tool in cardiology, with ultrasound being a low-cost, real-time, and non-ionizing technology. However, human assessment of echocardiograms for calculating EF is time-consuming and expertise-demanding, raising the need for an automated approach. In this work, we propose using the M(otion)-mode of echocardiograms for estimating the EF and classifying cardiomyopathy. We generate multiple artificial M-mode images from a single echocardiogram and combine them using off-the-shelf model architectures. Additionally, we extend contrastive learning (CL) to cardiac imaging to learn meaningful representations from exploiting structures in unlabeled data allowing the model to achieve high accuracy, even with limited annotations. Our experiments show that the supervised setting converges with only ten modes and is comparable to the baseline method while bypassing its cumbersome training process and being computationally much more efficient. Furthermore, CL using M-mode images is helpful for limited data scenarios, such as having labels for only 200 patients, which is common in medical applications.

摘要
早期检测心脏功能不正常的 Routine 检查是诊断冠状病的关键。一个重要的心脏功能指标是左心室泵出率（EF），其中低EF 与心肺病有关。寿命成像是卡地里诊断工具中最受欢迎的一种，它是一种低成本、实时、不 ionizing 技术。然而，人类对成像进行 EF 计算是时间消耗和专业需求高的，从而需要自动化的方法。在这项工作中，我们提议使用 M（动作）模式成像来估算 EF 和诊断心肺病。我们生成多个人工 M-模式成像从单个成像，并将它们组合使用商业化的模型架构。此外，我们扩展了对比学习（CL）到卡地里成像，以学习有用的表示。我们的实验表明，在监督设定下，只需要使用十个模式，可以与基eline 方法相当，而且可以快速 converges。此外， CL 使用 M-模式成像在有限数据场景下是有帮助的，例如只有200个患者的标签。

PBP: Path-based Trajectory Prediction for Autonomous Driving

paper_url: http://arxiv.org/abs/2309.03750
repo_url: None
paper_authors: Sepideh Afshar, Nachiket Deo, Akshay Bhagat, Titas Chakraborty, Yunming Shao, Balarama Raju Buddharaju, Adwait Deshpande, Henggang Cui
for: 提高自动驾驶栈中的路径预测精度，使自动驾驶车辆更好地预测周围agent的运动轨迹。
methods: 提出了Path-based prediction（PBP）方法，通过使用HD地图中的参考路径特征和路径相对尼采抽象框架来预测路径。
results: 在Argoverse数据集上应用PBP trajectory decoder，与标准路径预测指标具有竞争性表现，同时在map compliance方面显著超过了现有基eline。

Abstract
Trajectory prediction plays a crucial role in the autonomous driving stack by enabling autonomous vehicles to anticipate the motion of surrounding agents. Goal-based prediction models have gained traction in recent years for addressing the multimodal nature of future trajectories. Goal-based prediction models simplify multimodal prediction by first predicting 2D goal locations of agents and then predicting trajectories conditioned on each goal. However, a single 2D goal location serves as a weak inductive bias for predicting the whole trajectory, often leading to poor map compliance, i.e., part of the trajectory going off-road or breaking traffic rules. In this paper, we improve upon goal-based prediction by proposing the Path-based prediction (PBP) approach. PBP predicts a discrete probability distribution over reference paths in the HD map using the path features and predicts trajectories in the path-relative Frenet frame. We applied the PBP trajectory decoder on top of the HiVT scene encoder and report results on the Argoverse dataset. Our experiments show that PBP achieves competitive performance on the standard trajectory prediction metrics, while significantly outperforming state-of-the-art baselines in terms of map compliance.

摘要
干线预测在自动驾驶栈中扮演着关键的角色，帮助自动车辆预测周围的agent的运动。目标基于预测模型在过去几年中得到了广泛应用，因为它可以简化未来轨迹的多样性。目标基于预测模型首先预测了 agent 的2D目标位置，然后预测了根据每个目标的轨迹。然而，单个2D目标位置通常是轨迹预测的弱 inductive bias，导致轨迹偏离路径，例如车辆离路或违反交通规则。在这篇论文中，我们提出了Path-based prediction（PBP）方法，该方法预测了HD地图中参考路径的抽象概率分布，然后预测了路径相对射线帧中的轨迹。我们在HiVT场景编码器之上应用了PBP轨迹解码器，并在Argoverse数据集上进行了实验。我们的实验结果显示，PBP在标准轨迹预测指标上达到了竞争性的表现，而与当前领先的基elines在地图兼容性方面表现出了显著优势。

Label-efficient Contrastive Learning-based model for nuclei detection and classification in 3D Cardiovascular Immunofluorescent Images

paper_url: http://arxiv.org/abs/2309.03744
repo_url: None
paper_authors: Nazanin Moradinasab, Rebecca A. Deaton, Laura S. Shankman, Gary K. Owens, Donald E. Brown
for: 这个研究旨在开发一个 Label-efficient Contrastive learning-based (LECL) 模型，用于检测和类别各种类型的核lei在3D免疫染色图像中。
methods: 这个模型使用 Extended Maximum Intensity Projection (EMIP) 方法来解决多层对称投影问题，并使用 Supervised Contrastive Learning (SCL) 方法在弱监督情况下进行训练。
results: 在心血管数据集上进行实验，发现这个提案的框架具有高效和高精度地检测和类别各种类型的核lei在3D免疫染色图像中。

Abstract
Recently, deep learning-based methods achieved promising performance in nuclei detection and classification applications. However, training deep learning-based methods requires a large amount of pixel-wise annotated data, which is time-consuming and labor-intensive, especially in 3D images. An alternative approach is to adapt weak-annotation methods, such as labeling each nucleus with a point, but this method does not extend from 2D histopathology images (for which it was originally developed) to 3D immunofluorescent images. The reason is that 3D images contain multiple channels (z-axis) for nuclei and different markers separately, which makes training using point annotations difficult. To address this challenge, we propose the Label-efficient Contrastive learning-based (LECL) model to detect and classify various types of nuclei in 3D immunofluorescent images. Previous methods use Maximum Intensity Projection (MIP) to convert immunofluorescent images with multiple slices to 2D images, which can cause signals from different z-stacks to falsely appear associated with each other. To overcome this, we devised an Extended Maximum Intensity Projection (EMIP) approach that addresses issues using MIP. Furthermore, we performed a Supervised Contrastive Learning (SCL) approach for weakly supervised settings. We conducted experiments on cardiovascular datasets and found that our proposed framework is effective and efficient in detecting and classifying various types of nuclei in 3D immunofluorescent images.

摘要
最近，深度学习基本方法在蛋白检测和分类应用中获得了可观的表现。然而，训练深度学习基本方法需要大量的像素级别标注数据，这是时间消耗和劳动密集的，特别是在3D图像上。一种代替方法是采用弱标注方法，如每个核体只需标注一点，但这种方法不能从2D histopathology图像（它原本是设计的）扩展到3D抗体图像。原因是3D图像包含多个通道（z轴），这些通道分别包含核体和不同的标签，因此使用点标注训练困难。为解决这个挑战，我们提出了 Label-efficient Contrastive learning-based (LECL) 模型，用于检测和分类3D抗体图像中的多种核体。以前的方法使用 Maximum Intensity Projection (MIP) 将多层抗体图像转换成2D图像，这可能会使得不同的z堆叠的信号错误地显示为相关的。为解决这个问题，我们开发了 Extended Maximum Intensity Projection (EMIP) 方法，解决了 MIP 中的问题。另外，我们采用了 Supervised Contrastive Learning (SCL) 方法在弱监督设定下进行训练。我们在循环系统数据集上进行了实验，发现我们提出的框架是有效和高效的，用于检测和分类3D抗体图像中的多种核体。

ClusterFusion: Leveraging Radar Spatial Features for Radar-Camera 3D Object Detection in Autonomous Vehicles

paper_url: http://arxiv.org/abs/2309.03734
repo_url: None
paper_authors: Irfan Tito Kurniawan, Bambang Riyanto Trilaksono
for: 本研究探讨了如何使用射频测距仪的本地空间和点位特征，通过直接从射频点云中提取点云划分后的特征进行 радио-照相机三维物体探测方法的改进。
methods: 本方法使用了 clustering 技术将射频点云分割成不同的分割，然后对每个分割进行特征提取，最后将特征 проек onto 图像平面进行跨模态特征融合。
results: 该方法在 nuScenes 测试片上 achieved 48.7% nuScenes 检测得分（NDS），在 радио-照相机三维物体探测方法中达到了状态之 arts 性能。

Abstract
Thanks to the complementary nature of millimeter wave radar and camera, deep learning-based radar-camera 3D object detection methods may reliably produce accurate detections even in low-visibility conditions. This makes them preferable to use in autonomous vehicles' perception systems, especially as the combined cost of both sensors is cheaper than the cost of a lidar. Recent radar-camera methods commonly perform feature-level fusion which often involves projecting the radar points onto the same plane as the image features and fusing the extracted features from both modalities. While performing fusion on the image plane is generally simpler and faster, projecting radar points onto the image plane flattens the depth dimension of the point cloud which might lead to information loss and makes extracting the spatial features of the point cloud harder. We proposed ClusterFusion, an architecture that leverages the local spatial features of the radar point cloud by clustering the point cloud and performing feature extraction directly on the point cloud clusters before projecting the features onto the image plane. ClusterFusion achieved the state-of-the-art performance among all radar-monocular camera methods on the test slice of the nuScenes dataset with 48.7% nuScenes detection score (NDS). We also investigated the performance of different radar feature extraction strategies on point cloud clusters: a handcrafted strategy, a learning-based strategy, and a combination of both, and found that the handcrafted strategy yielded the best performance. The main goal of this work is to explore the use of radar's local spatial and point-wise features by extracting them directly from radar point cloud clusters for a radar-monocular camera 3D object detection method that performs cross-modal feature fusion on the image plane.

摘要
Due to the complementary nature of millimeter wave radar and camera, deep learning-based radar-camera 3D object detection methods can produce accurate detections even in low-visibility conditions. This makes them more suitable for use in autonomous vehicles' perception systems, as the combined cost of both sensors is lower than the cost of a lidar. Recent radar-camera methods commonly perform feature-level fusion, which involves projecting the radar points onto the same plane as the image features and fusing the extracted features from both modalities. However, projecting radar points onto the image plane flattens the depth dimension of the point cloud, which may lead to information loss and makes extracting the spatial features of the point cloud more difficult. To address this issue, we proposed ClusterFusion, an architecture that leverages the local spatial features of the radar point cloud by clustering the point cloud and performing feature extraction directly on the point cloud clusters before projecting the features onto the image plane. ClusterFusion achieved the state-of-the-art performance among all radar-monocular camera methods on the test slice of the nuScenes dataset with 48.7% nuScenes detection score (NDS). We also investigated the performance of different radar feature extraction strategies on point cloud clusters, including a handcrafted strategy, a learning-based strategy, and a combination of both, and found that the handcrafted strategy yielded the best performance. The main goal of this work is to explore the use of radar's local spatial and point-wise features by extracting them directly from radar point cloud clusters for a radar-monocular camera 3D object detection method that performs cross-modal feature fusion on the image plane.

Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption

paper_url: http://arxiv.org/abs/2309.03729
repo_url: https://github.com/sjtuplayer/few-shot-diffusion
paper_authors: Teng Hu, Jiangning Zhang, Liang Liu, Ran Yi, Siqi Kou, Haokun Zhu, Xu Chen, Yabiao Wang, Chengjie Wang, Lizhuang Ma
for: 本研究的目的是提出一种基于几何扩展的几何扩展噪声模型，以解决具有有限样本的数据时，生成模型的训练问题。
methods: 本研究使用了phasic content fusion和directional distribution consistency loss两种新的学习目标，以帮助模型学习内容和样式信息，并且在不同的训练阶段学习不同的学习目标。
results: 实验表明，提出的方法可以在几何扩展中减少内容衰减，并且在几何扩展中增强结构一致性。此外，该方法在几何扩展中的训练效果也比PRIOR方法更好。

Abstract
Training a generative model with limited number of samples is a challenging task. Current methods primarily rely on few-shot model adaption to train the network. However, in scenarios where data is extremely limited (less than 10), the generative network tends to overfit and suffers from content degradation. To address these problems, we propose a novel phasic content fusing few-shot diffusion model with directional distribution consistency loss, which targets different learning objectives at distinct training stages of the diffusion model. Specifically, we design a phasic training strategy with phasic content fusion to help our model learn content and style information when t is large, and learn local details of target domain when t is small, leading to an improvement in the capture of content, style and local details. Furthermore, we introduce a novel directional distribution consistency loss that ensures the consistency between the generated and source distributions more efficiently and stably than the prior methods, preventing our model from overfitting. Finally, we propose a cross-domain structure guidance strategy that enhances structure consistency during domain adaptation. Theoretical analysis, qualitative and quantitative experiments demonstrate the superiority of our approach in few-shot generative model adaption tasks compared to state-of-the-art methods. The source code is available at: https://github.com/sjtuplayer/few-shot-diffusion.

摘要
训练一个生成模型具有有限样本的任务是一项具有挑战性的任务。当前方法主要依靠几 shot 模型适应来训练网络。然而，在数据非常有限（ menos de 10）的场景下，生成网络往往遇到过拟合和内容下降的问题。为解决这些问题，我们提出了一种新的phasic content fusion few-shot diffusion model，具有方向分布一致损失，可以在不同的训练阶段对 diffusion model 进行不同的学习目标。具体来说，我们设计了phasic 训练策略，通过phasic content fusion来帮助我们的模型在 t 大的时候学习内容和风格信息，并在 t 小的时候学习目标频道的本地细节，从而改善内容、风格和本地细节的捕捉。此外，我们引入了一种新的方向分布一致损失，可以更有效和稳定地保证生成的结果与源分布的一致性，避免模型过拟合。最后，我们提出了一种跨频道结构引导策略，可以在适应频道中提高结构一致性。理论分析、质量和量测试表明，我们的方法在几 shot 生成模型适应任务中比 state-of-the-art 方法更高效。模型代码可以在 GitHub 上获取：https://github.com/sjtuplayer/few-shot-diffusion。

Interpretable Visual Question Answering via Reasoning Supervision

paper_url: http://arxiv.org/abs/2309.03726
repo_url: None
paper_authors: Maria Parelli, Dimitrios Mallis, Markos Diomataris, Vassilis Pitsikalis
for: 提高模型在视觉问答任务中的视觉固定能力，使其更好地理解问题和图像之间的关系。
methods: 使用常识逻辑作为监督信号，通过文本证明来提供Visual Common Sense Reasoning（VCR）数据集上已有的批注来帮助模型更好地理解问题和图像之间的关系。
results: 经验表明，提出的方法可以帮助模型更好地理解问题和图像之间的关系，不需要训练显式固定注解。

Abstract
Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task. However, such models are likely to disregard crucial visual cues and often rely on multimodal shortcuts and inherent biases of the language modality to predict the correct answer, a phenomenon commonly referred to as lack of visual grounding. In this work, we alleviate this shortcoming through a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal. Reasoning supervision takes the form of a textual justification of the correct answer, with such annotations being already available on large-scale Visual Common Sense Reasoning (VCR) datasets. The model's visual attention is guided toward important elements of the scene through a similarity loss that aligns the learned attention distributions guided by the question and the correct reasoning. We demonstrate both quantitatively and qualitatively that the proposed approach can boost the model's visual perception capability and lead to performance increase, without requiring training on explicit grounding annotations.

摘要
带有转换器基础的架构在视觉问答任务中表现出了惊人的表现。然而，这些模型可能会忽略重要的视觉指示和依赖于多Modal短cut和语言modal的自然偏见来预测正确答案，这被称为视觉不归顺。在这项工作中，我们通过一种新的视觉问答架构来解决这个缺陷，该架构利用了通用理解作为监督信号。理解监督信号的形式是一个问题和正确答案的文本证明，这些注释已经在大规模的视觉通用理解（VCR）数据集上可以获得。我们的视觉注意力通过一种相似损失来引导，使得问题和正确答案的学习的视觉注意力 Distributions相似。我们示出了both量化和质量上的表述，表明我们的方法可以增强模型的视觉感知能力，不需要训练显式归顺注释。

A boundary-aware point clustering approach in Euclidean and embedding spaces for roof plane segmentation

paper_url: http://arxiv.org/abs/2309.03722
repo_url: None
paper_authors: Li Li, Qingqing Li, Guozheng Xu, Pengwei Zhou, Jingmin Tu, Jie Li, Jian Yao
for: 本研究旨在提高空拍LiDAR点云数据中的瓦片面分割精度，提供更高精度的3D建筑模型重建。
methods: 本研究提出了一种边缘意识点云划分方法，包括三个分支网络：一个用于预测 semantic labels、点偏移和深度嵌入特征，第二个用于预测点偏移，第三个用于确保点云实例的嵌入特征相似。
results: 实验结果显示，提出的方法significantly outperforms 现有的状态之最方法。

Abstract
Roof plane segmentation from airborne LiDAR point clouds is an important technology for 3D building model reconstruction. One of the key issues of plane segmentation is how to design powerful features that can exactly distinguish adjacent planar patches. The quality of point feature directly determines the accuracy of roof plane segmentation. Most of existing approaches use handcrafted features to extract roof planes. However, the abilities of these features are relatively low, especially in boundary area. To solve this problem, we propose a boundary-aware point clustering approach in Euclidean and embedding spaces constructed by a multi-task deep network for roof plane segmentation. We design a three-branch network to predict semantic labels, point offsets and extract deep embedding features. In the first branch, we classify the input data as non-roof, boundary and plane points. In the second branch, we predict point offsets for shifting each point toward its respective instance center. In the third branch, we constrain that points of the same plane instance should have the similar embeddings. We aim to ensure that points of the same plane instance are close as much as possible in both Euclidean and embedding spaces. However, although deep network has strong feature representative ability, it is still hard to accurately distinguish points near plane instance boundary. Therefore, we first group plane points into many clusters in the two spaces, and then we assign the rest boundary points to their closest clusters to generate final complete roof planes. In this way, we can effectively reduce the influence of unreliable boundary points. In addition, we construct a synthetic dataset and a real dataset to train and evaluate our approach. The experiments results show that the proposed approach significantly outperforms the existing state-of-the-art approaches.

摘要
《顶面面Segmentation从空中LiDAR点云是重要的三维建筑模型重建技术。一个关键问题是如何设计强大的特征来准确分辨邻近的平面 patches。点云特征质量直接影响顶面面Segmentation的准确性。大多数现有方法使用手动设计的特征来抽取顶面平面。然而，这些特征的能力相对较低，特别是在边缘区域。为解决这个问题，我们提出了一种边缘意识点云 clustering方法，通过多任务深度网络进行顶面面Segmentation。我们设计了三枝网络， Predict semantic labels, point offsets和EXTRACT deep embedding features。在第一枝网络中，我们将输入数据分类为非顶面、边界和平面点。在第二枝网络中，我们预测每个点的偏移量，以将每个点向其实例中心偏移。在第三枝网络中，我们强制实例中心的点需要具有类似的嵌入特征。我们希望通过这种方式，实例中心的点可以在Euclidean和嵌入空间中保持最近。然而，虽然深度网络具有强大的特征表示能力，但是仍然困难准确分辨边缘区域的点。因此，我们首先将平面点 grouping到多个cluster中，然后将边缘点分配到最近的cluster中，以生成最终的完整的顶面面。这种方法可以有效地减少边缘点的影响。此外，我们还构建了一个 sintetic dataset和一个实际 dataset，以用于训练和评估我们的方法。实验结果表明，我们的方法在与现有状态作准的方法进行比较时，表现出了显著的优势。

DiffDefense: Defending against Adversarial Attacks via Diffusion Models

paper_url: http://arxiv.org/abs/2309.03702
repo_url: https://github.com/hondamunigeprasannasilva/diffdefence
paper_authors: Hondamunige Prasanna Silva, Lorenzo Seidenari, Alberto Del Bimbo
for: 保护机器学习分类器免受攻击
methods: 使用Diffusion Models进行增强防御
results: 提供了一种robust的防御方法，保持了清晰精度和可插入性，同时能够抵抗攻击。In English:
for: Protect machine learning classifiers from attacks
methods: Leveraging Diffusion Models for enhanced defense
results: Provides a robust defense method that preserves clarity and plug-and-play compatibility, while resisting attacks.

Abstract
This paper presents a novel reconstruction method that leverages Diffusion Models to protect machine learning classifiers against adversarial attacks, all without requiring any modifications to the classifiers themselves. The susceptibility of machine learning models to minor input perturbations renders them vulnerable to adversarial attacks. While diffusion-based methods are typically disregarded for adversarial defense due to their slow reverse process, this paper demonstrates that our proposed method offers robustness against adversarial threats while preserving clean accuracy, speed, and plug-and-play compatibility. Code at: https://github.com/HondamunigePrasannaSilva/DiffDefence.

摘要
这篇论文提出了一种新的重建方法，利用傅里叶模型来保护机器学习分类器免受抗击攻击，而无需对分类器本身进行任何修改。由于机器学习模型对输入小变化很敏感，因此它们面临着抗击攻击的威胁。尽管傅里叶基本方法通常不被视为对抗攻击的有效方法，但这篇论文表明，我们的提议方法可以提供对抗攻击的坚固性，保持清晰率、速度和插件兼容性。代码可以在 GitHub 上找到：https://github.com/HondamunigePrasannaSilva/DiffDefence。

Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory

paper_url: http://arxiv.org/abs/2309.03696
repo_url: https://github.com/ltttpku/ada-cm
paper_authors: Ting Lei, Fabian Caba, Qingchao Chen, Hailin Jin, Yuxin Peng, Yang Liu
For: 本研究旨在提出一种可以快速准确地检测人对象互动（HOI）的方法，以解决在实际场景中HOI检测 task 的挑战，如质量下降和计算成本增加。* Methods: 该方法基于大量的视觉语言模型（VLM），并提出了两种操作模式：一是无需更新参数的培成模式，二是可以更新细化参数的实例感知适应器模式。* Results: 该方法在HICO-DET和V-COCO数据集上达到了与现有状态OF-the-art的竞争水平，而且具有训练时间较短的优势。

Abstract
Human Object Interaction (HOI) detection aims to localize and infer the relationships between a human and an object. Arguably, training supervised models for this task from scratch presents challenges due to the performance drop over rare classes and the high computational cost and time required to handle long-tailed distributions of HOIs in complex HOI scenes in realistic settings. This observation motivates us to design an HOI detector that can be trained even with long-tailed labeled data and can leverage existing knowledge from pre-trained models. Inspired by the powerful generalization ability of the large Vision-Language Models (VLM) on classification and retrieval tasks, we propose an efficient Adaptive HOI Detector with Concept-guided Memory (ADA-CM). ADA-CM has two operating modes. The first mode makes it tunable without learning new parameters in a training-free paradigm. Its second mode incorporates an instance-aware adapter mechanism that can further efficiently boost performance if updating a lightweight set of parameters can be afforded. Our proposed method achieves competitive results with state-of-the-art on the HICO-DET and V-COCO datasets with much less training time. Code can be found at https://github.com/ltttpku/ADA-CM.

摘要
人物物体交互（HOI）检测的目标是确定人类和物体之间的位置和关系。然而，从头scratch开始训练超级vised模型 для此任务可能会遇到困难，主要是因为罕见的类型下的性能下降和复杂的HOI场景中的长尾分布。这个问题驱动我们设计一种可以很好地处理长尾分布的HOI检测器。我们提出了一种基于大型视力语言模型（VLM）的强大泛化能力的Adaptive HOI Detector with Concept-guided Memory（ADA-CM）。ADA-CM有两种运作模式。首先，它可以在没有学习新参数的情况下进行调整。其第二种运作模式包括一个实例特征感知机制，可以进一步提高性能，只要更新一些轻量级的参数。我们的提议方法在HICO-DET和V-COCO数据集上实现了与当前最佳的竞争力。代码可以在https://github.com/ltttpku/ADA-CM中找到。

MS-UNet-v2: Adaptive Denoising Method and Training Strategy for Medical Image Segmentation with Small Training Data

paper_url: http://arxiv.org/abs/2309.03686
repo_url: None
paper_authors: Haoyuan Chen, Yufei Han, Pin Xu, Yanyi Li, Kuan Li, Jianping Yin
for: 这个研究旨在提高医疗影像分类 task 的性能，以及解决单层 U-Net 构造不足以掌握足够多信息的问题。
methods: 我们提出了一个名为 MS-UNet 的新型 U-Net 模型，使用了 Swin Transformer 嵌入式的多尺度嵌入式解oder，实现了Semantic feature mapping 的更好地学习。此外，我们也提出了一个 Edge loss 和一个可替换的 Denoising module，可以单独应用于其他模型中，并且可以优化 MS-UNet 的 segmentation 性能。
results: 实验结果显示，MS-UNet 能够具有更高效的特征学习能力，并且在小量训练数据情况下表现更出色，而且提出的 Edge loss 和 Denoising module 可以明显提高 MS-UNet 的 segmentation 性能。

Abstract
Models based on U-like structures have improved the performance of medical image segmentation. However, the single-layer decoder structure of U-Net is too "thin" to exploit enough information, resulting in large semantic differences between the encoder and decoder parts. Things get worse if the number of training sets of data is not sufficiently large, which is common in medical image processing tasks where annotated data are more difficult to obtain than other tasks. Based on this observation, we propose a novel U-Net model named MS-UNet for the medical image segmentation task in this study. Instead of the single-layer U-Net decoder structure used in Swin-UNet and TransUnet, we specifically design a multi-scale nested decoder based on the Swin Transformer for U-Net. The proposed multi-scale nested decoder structure allows the feature mapping between the decoder and encoder to be semantically closer, thus enabling the network to learn more detailed features. In addition, we propose a novel edge loss and a plug-and-play fine-tuning Denoising module, which not only effectively improves the segmentation performance of MS-UNet, but could also be applied to other models individually. Experimental results show that MS-UNet could effectively improve the network performance with more efficient feature learning capability and exhibit more advanced performance, especially in the extreme case with a small amount of training data, and the proposed Edge loss and Denoising module could significantly enhance the segmentation performance of MS-UNet.

摘要
模型基于U字结构的表现在医学图像分割方面有所改善。然而，单层decoder结构的U字网络（Swin-UNet和TransUnet）太"瘦"，无法利用足够的信息，导致encoder和decoder部分之间的semantic差异较大。尤其是在医学图像处理任务中，缺乏足够的训练数据是常见的问题，这会使得模型的表现更加差。基于这一观察，我们在本研究中提出了一种名为MS-UNet的新的U字网络模型。相比单层decoder结构，我们专门设计了基于Swin Transformer的多尺度嵌套decoder结构。这种多尺度嵌套decoder结构使得feature mapping междуdecoder和encoder更加近似，从而让网络学习更加细腻的特征。此外，我们还提出了一种新的边缘损失和可插拔的精度调整Denosing模块，这不仅能够有效提高MS-UNet的分割性能，还可以应用于其他模型。实验结果表明，MS-UNet可以更好地利用训练数据，具有更高效的特征学习能力和更高级别的表现，特别是在训练数据量很少的极端情况下。此外，提出的边缘损失和Denosing模块可以明显提高MS-UNet的分割性能。

paper_url: http://arxiv.org/abs/2309.03661
repo_url: None
paper_authors: Ting Liu, Wansen Wu, Yue Hu, Youkai Wang, Kai Xu, Quanjun Yin
for: 本研究旨在提高视觉语言Navigation（VLN）任务中的表达能力和模型适应能力，解决现有模型在VLN任务中的领域差距和Sequential alignment问题。
methods: 本文提出了一种新的Prompt-bAsed coNtext- and Domain-Aware（PANDA）预训练框架，通过两stage的提问方式，在领域意识阶段和上下文意识阶段，分别学习软视觉提示和硬上下文提示，以塑造模型在VLN任务中的跨模态对应性和上下文知识。
results: 实验结果表明，相比之前的状态态度方法，PANDA在R2R和REVERIE两个任务中具有明显的优势，能够更好地利用预训练模型，提高VLN任务的表达能力和模型适应能力。

Abstract
With strong representation capabilities, pretrained vision-language models are widely used in vision and language navigation (VLN). However, most of them are trained on web-crawled general-purpose datasets, which incurs a considerable domain gap when used for VLN tasks. Another challenge for VLN is how the agent understands the contextual relations between actions on a trajectory and performs cross-modal alignment sequentially. In this paper, we propose a novel Prompt-bAsed coNtext- and Domain-Aware (PANDA) pretraining framework to address these problems. It performs prompting in two stages. In the domain-aware stage, we apply a low-cost prompt tuning paradigm to learn soft visual prompts from an in-domain dataset for equipping the pretrained models with object-level and scene-level cross-modal alignment in VLN tasks. Furthermore, in the context-aware stage, we design a set of hard context prompts to capture the sequence-level semantics and instill both out-of-context and contextual knowledge in the instruction into cross-modal representations. They enable further tuning of the pretrained models via contrastive learning. Experimental results on both R2R and REVERIE show the superiority of PANDA compared to previous state-of-the-art methods.

摘要
With strong representation capabilities, pretrained vision-language models are widely used in vision and language navigation (VLN). However, most of them are trained on web-crawled general-purpose datasets, which incurs a considerable domain gap when used for VLN tasks. Another challenge for VLN is how the agent understands the contextual relations between actions on a trajectory and performs cross-modal alignment sequentially. In this paper, we propose a novel Prompt-bAsed coNtext- and Domain-Aware (PANDA) pretraining framework to address these problems. It performs prompting in two stages. In the domain-aware stage, we apply a low-cost prompt tuning paradigm to learn soft visual prompts from an in-domain dataset for equipping the pretrained models with object-level and scene-level cross-modal alignment in VLN tasks. Furthermore, in the context-aware stage, we design a set of hard context prompts to capture the sequence-level semantics and instill both out-of-context and contextual knowledge in the instruction into cross-modal representations. They enable further tuning of the pretrained models via contrastive learning. Experimental results on both R2R and REVERIE show the superiority of PANDA compared to previous state-of-the-art methods.

Spiking Structured State Space Model for Monaural Speech Enhancement

paper_url: http://arxiv.org/abs/2309.03641
repo_url: None
paper_authors: Yu Du, Xu Liu, Yansong Chua
for: 提高speech干扰率和计算成本，使用Spiking Structured State Space Model（Spiking-S4）。
methods: 使用Spiking Neural Networks（SNN）和Structured State Space Models（S4），结合能量效率和长距离序列模型能力。
results: 与现有Artificial Neural Network（ANN）方法相当，但计算资源减少，参数和浮点运算数（FLOPs）减少。

Abstract
Speech enhancement seeks to extract clean speech from noisy signals. Traditional deep learning methods face two challenges: efficiently using information in long speech sequences and high computational costs. To address these, we introduce the Spiking Structured State Space Model (Spiking-S4). This approach merges the energy efficiency of Spiking Neural Networks (SNN) with the long-range sequence modeling capabilities of Structured State Space Models (S4), offering a compelling solution. Evaluation on the DNS Challenge and VoiceBank+Demand Datasets confirms that Spiking-S4 rivals existing Artificial Neural Network (ANN) methods but with fewer computational resources, as evidenced by reduced parameters and Floating Point Operations (FLOPs).

摘要
干扰除抽取干扰后的清晰语音。传统的深度学习方法面临两个挑战：高效地使用长 speech 序列中的信息，以及高计算成本。为解决这些问题，我们介绍了 Spiking Structured State Space Model（Spiking-S4）。这种方法将神经网络中的能量效率与结构化状态空间模型（S4）结合起来，提供了一个吸引人的解决方案。评估在 DNS 挑战和 VoiceBank+Demand 数据集上表明，Spiking-S4 与现有的人工神经网络（ANN）方法相当，但具有更少的计算资源，如参数和浮点运算（FLOPs）。

Context-Aware 3D Object Localization from Single Calibrated Images: A Study of Basketballs

paper_url: http://arxiv.org/abs/2309.03640
repo_url: https://github.com/gabriel-vanzandycke/deepsport
paper_authors: Marcello Davide Caio, Gabriel Van Zandycke, Christophe De Vleeschouwer
for: This paper is written for the task of 3D localization of objects in computer vision applications, specifically for basketball localization from a single calibrated image.
methods: The method used in this paper is to predict the object’s height in pixels in image space by estimating its projection onto the ground plane within the image, leveraging the image itself and the object’s location as inputs.
results: The paper demonstrates substantial accuracy improvements compared to recent work, offering effective 3D ball tracking and understanding. The source code is made publicly available at \url{https://github.com/gabriel-vanzandycke/deepsport}.

Abstract
Accurately localizing objects in three dimensions (3D) is crucial for various computer vision applications, such as robotics, autonomous driving, and augmented reality. This task finds another important application in sports analytics and, in this work, we present a novel method for 3D basketball localization from a single calibrated image. Our approach predicts the object's height in pixels in image space by estimating its projection onto the ground plane within the image, leveraging the image itself and the object's location as inputs. The 3D coordinates of the ball are then reconstructed by exploiting the known projection matrix. Extensive experiments on the public DeepSport dataset, which provides ground truth annotations for 3D ball location alongside camera calibration information for each image, demonstrate the effectiveness of our method, offering substantial accuracy improvements compared to recent work. Our work opens up new possibilities for enhanced ball tracking and understanding, advancing computer vision in diverse domains. The source code of this work is made publicly available at \url{https://github.com/gabriel-vanzandycke/deepsport}.

摘要
三维空间中的物体准确地理解是计算机视觉应用中的关键，如 робо扮、自动驾驶和增强现实等。这种任务在体育分析中也具有重要的应用，在这篇论文中，我们介绍了一种基于单个投影图像的3D篮球定位方法。我们的方法在图像空间中预测物体的高度，利用图像本身和物体的位置作为输入，并且利用知道的投影矩阵来重建3D坐标。我们在公共的DeepSport数据集上进行了广泛的实验，该数据集提供了每个图像的摄像机准确的投影矩阵和3D球的位置的标注信息。我们的方法在相比之下提供了显著的精度提高，这些成果将开拓新的 возмож性，推动计算机视觉在多个领域的发展。我们的代码在 \url{https://github.com/gabriel-vanzandycke/deepsport} 上公开提供。

Chasing Consistency in Text-to-3D Generation from a Single Image

paper_url: http://arxiv.org/abs/2309.03599
repo_url: None
paper_authors: Yichen Ouyang, Wenhao Chai, Jiayi Ye, Dapeng Tao, Yibing Zhan, Gaoang Wang
for: 提出了一种解决多视图图像Text-to-3D生成 task中的不一致问题的方法，包括semantic inconsistency、geometric inconsistency和saturation inconsistency。
methods: 提出了一种三stage框架，包括semantic encoding stage、geometric encoding stage和optimization stage，用于学习参数化的一致性 tokens，以提高Text-to-3D生成的一致性和可靠性。
results: 实验结果表明，Compared with前一个状态的方法，Consist3D可以生成更加一致、忠实和 фото真实的3D资产，同时也允许背景和对象编辑通过文本提示。

Abstract
Text-to-3D generation from a single-view image is a popular but challenging task in 3D vision. Although numerous methods have been proposed, existing works still suffer from the inconsistency issues, including 1) semantic inconsistency, 2) geometric inconsistency, and 3) saturation inconsistency, resulting in distorted, overfitted, and over-saturated generations. In light of the above issues, we present Consist3D, a three-stage framework Chasing for semantic-, geometric-, and saturation-Consistent Text-to-3D generation from a single image, in which the first two stages aim to learn parameterized consistency tokens, and the last stage is for optimization. Specifically, the semantic encoding stage learns a token independent of views and estimations, promoting semantic consistency and robustness. Meanwhile, the geometric encoding stage learns another token with comprehensive geometry and reconstruction constraints under novel-view estimations, reducing overfitting and encouraging geometric consistency. Finally, the optimization stage benefits from the semantic and geometric tokens, allowing a low classifier-free guidance scale and therefore preventing oversaturation. Experimental results demonstrate that Consist3D produces more consistent, faithful, and photo-realistic 3D assets compared to previous state-of-the-art methods. Furthermore, Consist3D also allows background and object editing through text prompts.

摘要
文本到3D生成从单个图像是3D视图中受欢迎但具有挑战性的任务。虽然已有许多方法被提出，但现有的方法仍然受到不一致性问题的困扰，包括1）semantic不一致、2）geometry不一致和3）饱和不一致，导致生成的结果偏倾、适应度差和饱和。为了解决这些问题，我们提出了Consist3D，一个三个阶段框架，旨在从单个图像中实现semantic-, geometry-和饱和性Consistent文本到3D生成。在这三个阶段中，第一两个阶段的目标是学习参数化的一致性token，而第三个阶段是优化阶段。具体来说，semantic编码阶段学习一个独立于视图和估计的Token，Promoting semantic一致和Robustness。同时，geometry编码阶段学习另一个Token，旨在包括全面的几何和重建约束，降低过拟合和促进几何一致。最后，优化阶段利用semantic和geometry Token，允许低级别的类ifier-free导向缩放，因此避免饱和。实验结果表明，Consist3D生成的3D资产更加一致、忠实和真实的摄影图像。此外，Consist3D还允许背景和物体编辑通过文本提示。

Enhancing Sample Utilization through Sample Adaptive Augmentation in Semi-Supervised Learning

paper_url: http://arxiv.org/abs/2309.03598
repo_url: https://github.com/guangui-nju/saa
paper_authors: Guan Gui, Zhen Zhao, Lei Qi, Luping Zhou, Lei Wang, Yinghuan Shi
for: 提高 semi-supervised learning 模型的性能
methods: 使用 sample adaptive augmentation (SAA) 技术，包括 sample selection module 和 sample augmentation module，以适应不同样本的需求
results: SAA 可以显著提高 FixMatch 和 FlexMatch 模型的准确率，例如，在 CIFAR-10 数据集上，SAA 帮助 FixMatch 模型的准确率从 92.50% 提高到 94.76%，并且帮助 FlexMatch 模型的准确率从 95.01% 提高到 95.31%

Abstract
In semi-supervised learning, unlabeled samples can be utilized through augmentation and consistency regularization. However, we observed certain samples, even undergoing strong augmentation, are still correctly classified with high confidence, resulting in a loss close to zero. It indicates that these samples have been already learned well and do not provide any additional optimization benefits to the model. We refer to these samples as ``naive samples". Unfortunately, existing SSL models overlook the characteristics of naive samples, and they just apply the same learning strategy to all samples. To further optimize the SSL model, we emphasize the importance of giving attention to naive samples and augmenting them in a more diverse manner. Sample adaptive augmentation (SAA) is proposed for this stated purpose and consists of two modules: 1) sample selection module; 2) sample augmentation module. Specifically, the sample selection module picks out {naive samples} based on historical training information at each epoch, then the naive samples will be augmented in a more diverse manner in the sample augmentation module. Thanks to the extreme ease of implementation of the above modules, SAA is advantageous for being simple and lightweight. We add SAA on top of FixMatch and FlexMatch respectively, and experiments demonstrate SAA can significantly improve the models. For example, SAA helped improve the accuracy of FixMatch from 92.50% to 94.76% and that of FlexMatch from 95.01% to 95.31% on CIFAR-10 with 40 labels.

摘要
在半监督学习中，无标示样本可以通过扩充和一致性规范来利用。然而，我们观察到一些样本，即使经受了强大的扩充，仍然可以高度自信地分类，从而导致损失接近零。这表示这些样本已经很好地学习过，不会提供任何额外优化效果 для模型。我们称这些样本为“简单样本”。现有的SSL模型忽视了简单样本的特点，只是对所有样本应用同一种学习策略。为了进一步优化SSL模型，我们强调了简单样本的重要性，并提出了一种Sample Adaptive Augmentation（SAA）模型，包括以下两个模块：1）样本选择模块；2）样本扩充模块。具体来说，样本选择模块根据每 epoch 的历史训练信息选择简单样本，然后在样本扩充模块中对简单样本进行更加多样化的扩充。由于SAA的实现非常简单，因此SAA具有简单和轻量级的优势。我们在 FixMatch 和 FlexMatch 上加载 SAA，并进行了实验，结果表明，SAA可以显著提高模型的准确率。例如，SAA 在 CIFAR-10 上使 FixMatch 的准确率由 92.50% 提高到 94.76%，并在 FlexMatch 上使准确率由 95.01% 提高到 95.31%。

DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions

paper_url: http://arxiv.org/abs/2309.03576
repo_url: https://github.com/haochen-wang409/droppos
paper_authors: Haochen Wang, Junsong Fan, Yuxi Wang, Kaiyou Song, Tong Wang, Zhaoxiang Zhang
for: 提高 ViT 的位置意识（location awareness）
methods: 提出 DropPos 自我超vised挑战任务，通过随机DropPositional embedding来增强模型的位置理解能力
results: DropPos 在各种下游任务上表现出色，超过了supervised预训练和当前自我超vised方法的表现，这表明 DropPos 可以帮助 ViT 提高其位置意识。

Abstract
As it is empirically observed that Vision Transformers (ViTs) are quite insensitive to the order of input tokens, the need for an appropriate self-supervised pretext task that enhances the location awareness of ViTs is becoming evident. To address this, we present DropPos, a novel pretext task designed to reconstruct Dropped Positions. The formulation of DropPos is simple: we first drop a large random subset of positional embeddings and then the model classifies the actual position for each non-overlapping patch among all possible positions solely based on their visual appearance. To avoid trivial solutions, we increase the difficulty of this task by keeping only a subset of patches visible. Additionally, considering there may be different patches with similar visual appearances, we propose position smoothing and attentive reconstruction strategies to relax this classification problem, since it is not necessary to reconstruct their exact positions in these cases. Empirical evaluations of DropPos show strong capabilities. DropPos outperforms supervised pre-training and achieves competitive results compared with state-of-the-art self-supervised alternatives on a wide range of downstream benchmarks. This suggests that explicitly encouraging spatial reasoning abilities, as DropPos does, indeed contributes to the improved location awareness of ViTs. The code is publicly available at https://github.com/Haochen-Wang409/DropPos.

摘要
为了解决transformer眼见模型（ViT）对输入元素顺序的敏感性问题，我们提出了DropPos，一个新的自我超vised预备任务。DropPos的设计是简单的：我们首先随机选择大量的位置嵌入，然后让模型根据视觉特征来推断实际的位置。为了避免轻松解释，我们将只有一部分patch可视。此外，为了降低这个分类问题的难度，我们提出了position smoothing和专注重建构成技术。实验评估显示DropPos能够实现强大的表现，并且与现有的自我超vised替代方案相比，在广泛的下游评估中具有竞争力。这表明明确地强调空间推理能力，就像DropPos所做的一样，对ViT的位置意识 indeed有助益。代码可以在https://github.com/Haochen-Wang409/DropPos中找到。

Toward High Quality Facial Representation Learning

paper_url: http://arxiv.org/abs/2309.03575
repo_url: https://github.com/nomewang/mcf
paper_authors: Yue Wang, Jinlong Peng, Jiangning Zhang, Ran Yi, Liang Liu, Yabiao Wang, Chengjie Wang
for: 本研究旨在提高面部分析任务的性能，特别是面部表示性的提高。
methods: 本研究使用自适应预训练方法，包括使用面罩模型和对比策略，以提高面部表示性。
results: 本研究在多个下游任务中表现出色，包括AFLW-19面对齐和LaPa面分割等。模型在面部表示性方面提高了状态之arte的性能。

Abstract
Face analysis tasks have a wide range of applications, but the universal facial representation has only been explored in a few works. In this paper, we explore high-performance pre-training methods to boost the face analysis tasks such as face alignment and face parsing. We propose a self-supervised pre-training framework, called \textbf{\it Mask Contrastive Face (MCF)}, with mask image modeling and a contrastive strategy specially adjusted for face domain tasks. To improve the facial representation quality, we use feature map of a pre-trained visual backbone as a supervision item and use a partially pre-trained decoder for mask image modeling. To handle the face identity during the pre-training stage, we further use random masks to build contrastive learning pairs. We conduct the pre-training on the LAION-FACE-cropped dataset, a variants of LAION-FACE 20M, which contains more than 20 million face images from Internet websites. For efficiency pre-training, we explore our framework pre-training performance on a small part of LAION-FACE-cropped and verify the superiority with different pre-training settings. Our model pre-trained with the full pre-training dataset outperforms the state-of-the-art methods on multiple downstream tasks. Our model achieves 0.932 NME$_{diag}$ for AFLW-19 face alignment and 93.96 F1 score for LaPa face parsing. Code is available at https://github.com/nomewang/MCF.

摘要
“面部分析任务有很广泛的应用，但universal面部表示只在一些研究中被探讨。在这篇论文中，我们探索高性能预训练方法来提升面部分析任务，如面对齐和面分解。我们提出了一个自我超级vised预训练框架，称为Mask Contrastive Face（MCF），它使用面照模型和特定适应于面域任务的对比策略。为提高面部表示质量，我们使用预训练的视觉后处理器的特征图作为监督项，并使用部分预训练的解码器进行面照模型。为了在预训练阶段处理面部标识，我们还使用随机mask来建立对比学习对。我们在LAION-FACE-cropped数据集上进行预训练，这是LAION-FACE 20M数据集的一个变种，它包含了互联网上超过20万张面像。为了提高效率，我们在不同的预训练设置下进行了探索。我们的模型在多个下游任务中表现出色，其中AFLW-19面对齐NME$_{diag}$达0.932，LaPa面分解F1分数达93.96。代码可以在https://github.com/nomewang/MCF上找到。”

Sparse Federated Training of Object Detection in the Internet of Vehicles

paper_url: http://arxiv.org/abs/2309.03569
repo_url: None
paper_authors: Luping Rao, Chuan Ma, Ming Ding, Yuwen Qian, Lu Zhou, Zhe Liu
for: 提高iot中的车辆检测精度，降低通信开销
methods: 基于联合学习的方法，包括在中央服务器上分享已经训练好的本地模型，以及在边缘设备上进行稀疏训练
results: 实验结果表明，提议的方案可以实现需要的车辆检测率，同时减少了许多通信成本

Abstract
As an essential component part of the Intelligent Transportation System (ITS), the Internet of Vehicles (IoV) plays a vital role in alleviating traffic issues. Object detection is one of the key technologies in the IoV, which has been widely used to provide traffic management services by analyzing timely and sensitive vehicle-related information. However, the current object detection methods are mostly based on centralized deep training, that is, the sensitive data obtained by edge devices need to be uploaded to the server, which raises privacy concerns. To mitigate such privacy leakage, we first propose a federated learning-based framework, where well-trained local models are shared in the central server. However, since edge devices usually have limited computing power, plus a strict requirement of low latency in IoVs, we further propose a sparse training process on edge devices, which can effectively lighten the model, and ensure its training efficiency on edge devices, thereby reducing communication overheads. In addition, due to the diverse computing capabilities and dynamic environment, different sparsity rates are applied to edge devices. To further guarantee the performance, we propose, FedWeg, an improved aggregation scheme based on FedAvg, which is designed by the inverse ratio of sparsity rates. Experiments on the real-life dataset using YOLO show that the proposed scheme can achieve the required object detection rate while saving considerable communication costs.

摘要
作为智能交通系统（ITS）的重要组件，互联网imatics（IoV）在解决交通问题方面扮演着重要角色。对象检测是IoV中的关键技术之一，通过实时和敏感的车辆相关信息分析，提供交通管理服务。然而，当前的对象检测方法大多基于中央深度训练，即从边缘设备获取的敏感数据需要上传到服务器，这会导致隐私泄露。为了缓解这种隐私泄露，我们首先提议了一个基于联邦学习的框架，在中央服务器中分享已经训练好的本地模型。然而，边缘设备通常具有有限的计算能力， plus 因 IoV 的低延迟要求，我们进一步提议了一种稀疏训练过程在边缘设备上，可以有效减轻模型的计算负担，并在边缘设备上减少通信开销。此外，由于边缘设备的多样化计算能力和动态环境，我们采用不同的稀疏率来应对不同的边缘设备。为了进一步保证性能，我们提议了FedWeg，一种基于FedAvg的改进聚合方案，通过对稀疏率的反比进行调整，以保证聚合效果。实验结果表明，使用YOLO实际数据集时，我们的方案可以达到需要的对象检测率，同时减少了较大的通信成本。

Region Generation and Assessment Network for Occluded Person Re-Identification

paper_url: http://arxiv.org/abs/2309.03558
repo_url: None
paper_authors: Shuting He, Weihua Chen, Kai Wang, Hao Luo, Fan Wang, Wei Jiang, Henghui Ding
for: 本研究主要针对人体重认定（ReID）问题，尤其是 Addressing the challenges of misalignment and occlusions in ReID.
methods: 提出了一种 Region Generation and Assessment Network (RGANet)，包括 Region Generation Module (RGM) 和 Region Assessment Module (RAM)，用于有效地检测人体区域和强调重要区域。
results: 对六个广泛使用的 benchmark 进行了EXTENSIVE experimental results，证明 RGANet 在比较方法中表现出色。

Abstract
Person Re-identification (ReID) plays a more and more crucial role in recent years with a wide range of applications. Existing ReID methods are suffering from the challenges of misalignment and occlusions, which degrade the performance dramatically. Most methods tackle such challenges by utilizing external tools to locate body parts or exploiting matching strategies. Nevertheless, the inevitable domain gap between the datasets utilized for external tools and the ReID datasets and the complicated matching process make these methods unreliable and sensitive to noises. In this paper, we propose a Region Generation and Assessment Network (RGANet) to effectively and efficiently detect the human body regions and highlight the important regions. In the proposed RGANet, we first devise a Region Generation Module (RGM) which utilizes the pre-trained CLIP to locate the human body regions using semantic prototypes extracted from text descriptions. Learnable prompt is designed to eliminate domain gap between CLIP datasets and ReID datasets. Then, to measure the importance of each generated region, we introduce a Region Assessment Module (RAM) that assigns confidence scores to different regions and reduces the negative impact of the occlusion regions by lower scores. The RAM consists of a discrimination-aware indicator and an invariance-aware indicator, where the former indicates the capability to distinguish from different identities and the latter represents consistency among the images of the same class of human body regions. Extensive experimental results for six widely-used benchmarks including three tasks (occluded, partial, and holistic) demonstrate the superiority of RGANet against state-of-the-art methods.

摘要
人体重认（ReID）在最近几年变得越来越重要，它拥有广泛的应用领域。现有的ReID方法面临着异常匹配和遮挡的挑战，这些挑战会使得方法表现下降。大多数方法通过使用外部工具定位人体部分或者利用匹配策略来解决这些挑战。然而，实际的领域差值 между用于外部工具的数据集和ReID数据集，以及复杂的匹配过程，使得这些方法不可靠和敏感于噪声。在这篇论文中，我们提出了一种Region Generation and Assessment Network（RGANet），用于有效地和高效地检测人体部分并高亮重要区域。RGANet中首先设计了一种Region Generation Module（RGM），利用预训练的CLIP来定位人体部分使用语义词汇提取的文本描述。我们制定了可学习的提示，以消除领域差值 междуCLIP数据集和ReID数据集。然后，我们引入了一种Region Assessment Module（RAM），用于评估每个生成的区域的重要性。RAM包括一个排除噪声的指标和一个对称响应指标，其中前者表示可以分辨不同的人类标识，后者表示图像中同类人体部分的一致性。我们对六个广泛使用的标准 benchmark进行了广泛的实验，结果表明RGANet在比特率和泛化能力方面具有明显的优势。

Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model

paper_url: http://arxiv.org/abs/2309.03550
repo_url: https://github.com/deepshwang/text2control3d
paper_authors: Sungwon Hwang, Junha Hyung, Jaegul Choo
for: 这个论文旨在提供一种可控的文本到3D人物生成方法，可以通过提供一些控制视角的图像来控制人物的表情和外观。
methods: 该方法使用了ControlNet进行扩展，并使用了Neural Radiance Fields（NeRF）来构建3D人物。在生成视点控制图像时，使用了对 Referential的注意力来注入可控的表情和外观。此外，还进行了低通过滤波来缓解视点不同的文本问题。
results: 该方法可以生成高品质的3D人物，并可以控制人物的表情和外观。在实验中，我们发现了该方法可以在不同的视点下生成一致的3D人物，并且可以在不同的图像中控制人物的表情和外观。

Abstract
Recent advances in diffusion models such as ControlNet have enabled geometrically controllable, high-fidelity text-to-image generation. However, none of them addresses the question of adding such controllability to text-to-3D generation. In response, we propose Text2Control3D, a controllable text-to-3D avatar generation method whose facial expression is controllable given a monocular video casually captured with hand-held camera. Our main strategy is to construct the 3D avatar in Neural Radiance Fields (NeRF) optimized with a set of controlled viewpoint-aware images that we generate from ControlNet, whose condition input is the depth map extracted from the input video. When generating the viewpoint-aware images, we utilize cross-reference attention to inject well-controlled, referential facial expression and appearance via cross attention. We also conduct low-pass filtering of Gaussian latent of the diffusion model in order to ameliorate the viewpoint-agnostic texture problem we observed from our empirical analysis, where the viewpoint-aware images contain identical textures on identical pixel positions that are incomprehensible in 3D. Finally, to train NeRF with the images that are viewpoint-aware yet are not strictly consistent in geometry, our approach considers per-image geometric variation as a view of deformation from a shared 3D canonical space. Consequently, we construct the 3D avatar in a canonical space of deformable NeRF by learning a set of per-image deformation via deformation field table. We demonstrate the empirical results and discuss the effectiveness of our method.

摘要
近期Diffusion模型如ControlNet的进步使得文本到图像生成中具有可控的高精度。然而，这些方法都没有考虑文本到3D生成中的可控性问题。为此，我们提出Text2Control3D方法，它可以通过控制Neural Radiance Fields（NeRF）中的3D人物表达来实现文本到3D人物生成。我们的主要策略是根据ControlNet生成的视角相关图像来构建NeRF，并通过交叉引用注意力来注入控制 facial expression和外观的 referential特征。此外，我们还应用低通 Filtering来改善我们观察到的视点缺失问题，这些问题是由于Diffusion模型生成的图像中存在同一个文本和外观的重复现象。最后，我们通过学习每个图像的特殊变换来训练NeRF，以便在不同视点下生成可控的3D人物。我们的实验结果表明，我们的方法可以生成高质量的3D人物，并且可以控制其 facial expression和外观。

Trash to Treasure: Low-Light Object Detection via Decomposition-and-Aggregation

paper_url: http://arxiv.org/abs/2309.03548
repo_url: None
paper_authors: Xiaohan Cui, Long Ma, Tengyu Ma, Jinyuan Liu, Xin Fan, Risheng Liu
for: 提高对low-light环境的 объек检测精度
methods: 使用优化的扩充器+检测器组合，将废弃的照明减去作为检测器的助手，提取检测友好特征
results: 与其他状态艺法相比，实现了更高的检测精度

Abstract
Object detection in low-light scenarios has attracted much attention in the past few years. A mainstream and representative scheme introduces enhancers as the pre-processing for regular detectors. However, because of the disparity in task objectives between the enhancer and detector, this paradigm cannot shine at its best ability. In this work, we try to arouse the potential of enhancer + detector. Different from existing works, we extend the illumination-based enhancers (our newly designed or existing) as a scene decomposition module, whose removed illumination is exploited as the auxiliary in the detector for extracting detection-friendly features. A semantic aggregation module is further established for integrating multi-scale scene-related semantic information in the context space. Actually, our built scheme successfully transforms the "trash" (i.e., the ignored illumination in the detector) into the "treasure" for the detector. Plenty of experiments are conducted to reveal our superiority against other state-of-the-art methods. The code will be public if it is accepted.

摘要
寻找 объек特点在低光照情况下已经吸引了一些注意力。主流和表现出名的方案是通过增强器作为普通探测器的预处理。然而，由于增强器和探测器之间的任务目标差异，这种方法无法发挥最大的能力。在这种工作中，我们尝试使增强器+探测器达到最佳效果。与现有的方法不同，我们将照明基于增强器（我们 newly 设计或现有的）作为场景分解模块，并将其中的照明除去作为探测器中EXTRACTING detection-friendly features的auxiliary。此外，我们还设立了 semantic aggregation module，用于在上下文空间中集成多尺度场景相关的semantic信息。实际上，我们建立的方案成功地将"垃圾"（即探测器中被忽略的照明）转化为"财富"。我们进行了大量的实验，并证明了我们对其他现状的方法有着超越性。如果接受，我们将代码公开。

Zero-Shot Scene Graph Generation via Triplet Calibration and Reduction

paper_url: http://arxiv.org/abs/2309.03542
repo_url: https://github.com/jkli1998/T-CAR
paper_authors: Jiankai Li, Yunhong Wang, Weixin Li
for: 提高Scene Graph Generation（SGG）的下游任务表现，尤其是Zero-shot SGG。
methods: 提出Triplet Calibration and Reduction（T-CAR）框架，包括Triplet calibration loss和Unseen space reduction loss，以及Contextual encoder来提高无法见 triplets的泛化表现。
results: 实验表明，我们的方法可以在Zero-shot SGG中提供了一致的改进，超过了现有方法的表现。

Abstract
Scene Graph Generation (SGG) plays a pivotal role in downstream vision-language tasks. Existing SGG methods typically suffer from poor compositional generalizations on unseen triplets. They are generally trained on incompletely annotated scene graphs that contain dominant triplets and tend to bias toward these seen triplets during inference. To address this issue, we propose a Triplet Calibration and Reduction (T-CAR) framework in this paper. In our framework, a triplet calibration loss is first presented to regularize the representations of diverse triplets and to simultaneously excavate the unseen triplets in incompletely annotated training scene graphs. Moreover, the unseen space of scene graphs is usually several times larger than the seen space since it contains a huge number of unrealistic compositions. Thus, we propose an unseen space reduction loss to shift the attention of excavation to reasonable unseen compositions to facilitate the model training. Finally, we propose a contextual encoder to improve the compositional generalizations of unseen triplets by explicitly modeling the relative spatial relations between subjects and objects. Extensive experiments show that our approach achieves consistent improvements for zero-shot SGG over state-of-the-art methods. The code is available at https://github.com/jkli1998/T-CAR.

摘要
Scene Graph Generation (SGG) 扮演着下游视语任务的重要角色。现有的 SGG 方法通常受到不好的compositional generalization的影响，即在未看过的 triplets 上的表现不佳。它们通常是在部分注解的Scene Graph中训练的，这些Scene Graph 中充满了主导的 triplets，导致在推理时偏向这些已经看过的 triplets。为解决这个问题，我们在这篇论文中提出了Triplet Calibration and Reduction（T-CAR）框架。在我们的框架中，首先提出了 triplet calibration loss，用于规范多元 triplets 的表现，同时挖掘在不完全注解的训练 Scene Graph 中的未看过的 triplets。此外，Scene Graph 的未看过空间通常比见过空间更大，因为它包含了庞大的不可能的组合。因此，我们提出了Scene Graph 的未看过空间减少损失，以Shift attention 到合理的未看过组合，以便模型训练。最后，我们提出了一种Contextual Encoder，用于提高未看过 triplets 的compositional generalizations，通过显式地模型主题和 объек 之间的相对空间关系。我们的方法在零shot SGG 上实现了广泛的实验室，和现有的方法相比，具有了一致的改进。代码可以在 https://github.com/jkli1998/T-CAR 上获取。

YOLO series target detection algorithms for underwater environments

paper_url: http://arxiv.org/abs/2309.03539
repo_url: None
paper_authors: Chenjie Zhang, Pengcheng Jiao
for: marine engineering applications (such as underwater structural health monitoring and underwater biological detection)
methods: improved YOLO algorithm for underwater environments (addressing challenges such as dim light and turbid water)
results: potential for increased accuracy and efficiency in underwater applications, but still facing challenges and limitations.

Abstract
You Only Look Once (YOLO) algorithm is a representative target detection algorithm emerging in 2016, which is known for its balance of computing speed and accuracy, and now plays an important role in various fields of human production and life. However, there are still many limitations in the application of YOLO algorithm in underwater environments due to problems such as dim light and turbid water. With limited land area resources, the ocean must have great potential for future human development. In this paper, starting from the actual needs of marine engineering applications, taking underwater structural health monitoring (SHM) and underwater biological detection as examples, we propose improved methods for the application of underwater YOLO algorithms, and point out the problems that still exist.

摘要
你只需一看 (YOLO) 算法是2016年出现的一种代表性目标检测算法，知名于计算速度和准确率的平衡，现在在人类生产和生活中扮演着重要的角色。然而，在水下环境中应用YOLO算法还有许多限制，主要包括灰暗的照明和浑水等问题。由于海洋面积有限，海洋必须拥有未来人类发展的巨大潜在力量。在本文中，从marine工程应用实际需求出发，通过水下结构健康监测 (SHM) 和水下生物检测为例，提出改进了水下YOLO算法的应用方法，并指出仍有问题。

Feature Enhancer Segmentation Network (FES-Net) for Vessel Segmentation

paper_url: http://arxiv.org/abs/2309.03535
repo_url: None
paper_authors: Tariq M. Khan, Muhammad Arsalan, Shahzaib Iqbal, Imran Razzak, Erik Meijering
for: 预防和诊断视力减退疾病，如膳部病变和年轻人病变，需要精准地分类视网膜血管。
methods: 我们提出了一种新的特征增强分类网络（FES-Net），不需要额外的图像增强步骤，直接处理输入图像，并使用四个唤醒卷积块（PCB）进行下采样，以生成每类的 binary mask。
results: FES-Net 在四个公开可用的 state-of-the-art 数据集上（DRIVE、STARE、CHASE 和 HRF）表现出色，与现有文献中的其他竞争方法相比，显示出了明显的超越性。

Abstract
Diseases such as diabetic retinopathy and age-related macular degeneration pose a significant risk to vision, highlighting the importance of precise segmentation of retinal vessels for the tracking and diagnosis of progression. However, existing vessel segmentation methods that heavily rely on encoder-decoder structures struggle to capture contextual information about retinal vessel configurations, leading to challenges in reconciling semantic disparities between encoder and decoder features. To address this, we propose a novel feature enhancement segmentation network (FES-Net) that achieves accurate pixel-wise segmentation without requiring additional image enhancement steps. FES-Net directly processes the input image and utilizes four prompt convolutional blocks (PCBs) during downsampling, complemented by a shallow upsampling approach to generate a binary mask for each class. We evaluate the performance of FES-Net on four publicly available state-of-the-art datasets: DRIVE, STARE, CHASE, and HRF. The evaluation results clearly demonstrate the superior performance of FES-Net compared to other competitive approaches documented in the existing literature.

摘要
疾病如糖尿病和年龄相关的macular degeneration会对视力造成重大威胁，因此精准的血管分 segmentation在跟踪和诊断进程中具有 paramount importance。然而，现有的血管分 segmentation方法，即基于encoder-decoder结构的方法，在capturing retinal vessel配置上下文信息方面存在挑战，这会导致encoder和decoder特征之间的semantic disparities困难于相互协调。为了解决这个问题，我们提出了一种新的特征增强分 segmentation网络（FES-Net），它可以在不需要额外图像增强步骤的情况下，准确地进行每个像素的分 segmentation。FES-Net直接处理输入图像，并在下降阶段使用四个推荐卷积核（PCB），并且采用浅层的 upsampling 方法生成每个类型的二进制掩蔽。我们对四个公开的 state-of-the-art 数据集进行了评估：DRIVE、STARE、CHASE 和 HRF。评估结果表明，FES-Net 与文献中已有的其他竞争方法相比，具有显著的性能优势。

A Robust Negative Learning Approach to Partial Domain Adaptation Using Source Prototypes

paper_url: http://arxiv.org/abs/2309.03531
repo_url: None
paper_authors: Sandipan Choudhuri, Suli Adeniye, Arunabha Sen
for: This paper proposes a robust Partial Domain Adaptation (PDA) framework to mitigate the negative transfer problem by incorporating a robust target-supervision strategy.
methods: The proposed framework leverages ensemble learning and includes diverse, complementary label feedback, alleviating the effect of incorrect feedback and promoting pseudo-label refinement. It optimizes intra-class compactness and inter-class separation with the inferred source prototypes and highly-confident target samples in a domain-invariant fashion.
results: The proposed framework demonstrates enhanced robustness and generalization in a range of partial domain adaptation tasks, outperforming existing state-of-the-art PDA approaches.Here are the three points in Simplified Chinese:
for: 这篇论文提出了一种robust Partial Domain Adaptation（PDA）框架，以减少负转移问题，通过包含一种robust目标监督策略。
methods: 该框架利用ensemble学习和多元标签反馈，使得反馈错误的影响减少，并促进pseudo标签纠正。它在域无关的方式优化源类准确性和目标类分化度。
results: 该框架在多种partial domain adaptation任务中表现出了提高的Robustness和普遍性，超过了现有的state-of-the-art PDA方法。

Abstract
This work proposes a robust Partial Domain Adaptation (PDA) framework that mitigates the negative transfer problem by incorporating a robust target-supervision strategy. It leverages ensemble learning and includes diverse, complementary label feedback, alleviating the effect of incorrect feedback and promoting pseudo-label refinement. Rather than relying exclusively on first-order moments for distribution alignment, our approach offers explicit objectives to optimize intra-class compactness and inter-class separation with the inferred source prototypes and highly-confident target samples in a domain-invariant fashion. Notably, we ensure source data privacy by eliminating the need to access the source data during the adaptation phase through a priori inference of source prototypes. We conducted a series of comprehensive experiments, including an ablation analysis, covering a range of partial domain adaptation tasks. Comprehensive evaluations on benchmark datasets corroborate our framework's enhanced robustness and generalization, demonstrating its superiority over existing state-of-the-art PDA approaches.

摘要
Simplified Chinese:这个工作提出了一种robust Partial Domain Adaptation（PDA）框架，用以 Mitigate the negative transfer problem by incorporating a robust target-supervision strategy. 它利用了ensemble learning和多元标签反馈，以提高pseudo-label的精度和多样性，从而降低了因为错误反馈而导致的影响。而不是仅仅依靠first-order moments for distribution alignment, our approach offers explicit objectives to optimize intra-class compactness and inter-class separation with the inferred source prototypes and highly-confident target samples in a domain-invariant fashion. 另外, we ensure source data privacy by eliminating the need to access the source data during the adaptation phase through a priori inference of source prototypes. We conducted a series of comprehensive experiments, including an ablation analysis, covering a range of partial domain adaptation tasks. Comprehensive evaluations on benchmark datasets corroborate our framework's enhanced robustness and generalization, demonstrating its superiority over existing state-of-the-art PDA approaches.

Efficient Single Object Detection on Image Patches with Early Exit Enhanced High-Precision CNNs

paper_url: http://arxiv.org/abs/2309.03530
repo_url: None
paper_authors: Arne Moos
for: 本研究旨在提出一种用于移动机器人检测物体的新方法，主要是检测球体。
methods: 本文提出了一种专门为计算约束限制的机器人平台设计的卷积神经网络架构，以高精度分类单个物体图像块并确定其准确的空间位置。
results: 本文的方法可以在静态和动态环境中，在不同的照明条件下，达到100%的准确率和大于87%的检测率，并且可以在约170微秒内完成每个假设。通过结合提出的方法和早退法，可以实现更 than 28%的运行时优化。

Abstract
This paper proposes a novel approach for detecting objects using mobile robots in the context of the RoboCup Standard Platform League, with a primary focus on detecting the ball. The challenge lies in detecting a dynamic object in varying lighting conditions and blurred images caused by fast movements. To address this challenge, the paper presents a convolutional neural network architecture designed specifically for computationally constrained robotic platforms. The proposed CNN is trained to achieve high precision classification of single objects in image patches and to determine their precise spatial positions. The paper further integrates Early Exits into the existing high-precision CNN architecture to reduce the computational cost of easily rejectable cases in the background class. The training process involves a composite loss function based on confidence and positional losses with dynamic weighting and data augmentation. The proposed approach achieves a precision of 100% on the validation dataset and a recall of almost 87%, while maintaining an execution time of around 170 $\mu$s per hypotheses. By combining the proposed approach with an Early Exit, a runtime optimization of more than 28%, on average, can be achieved compared to the original CNN. Overall, this paper provides an efficient solution for an enhanced detection of objects, especially the ball, in computationally constrained robotic platforms.

摘要
The training process involves a composite loss function based on confidence and positional losses with dynamic weighting and data augmentation. The proposed approach achieves a precision of 100% on the validation dataset and a recall of almost 87%, while maintaining an execution time of around 170 microseconds per hypotheses. By combining the proposed approach with an Early Exit, a runtime optimization of more than 28%, on average, can be achieved compared to the original CNN.Overall, this paper provides an efficient solution for enhanced object detection, especially the ball, in computationally constrained robotic platforms.

BroadCAM: Outcome-agnostic Class Activation Mapping for Small-scale Weakly Supervised Applications

paper_url: http://arxiv.org/abs/2309.03509
repo_url: https://github.com/linjiatai/broadcam
paper_authors: Jiatai Lin, Guoqiang Han, Xuemiao Xu, Changhong Liang, Tien-Tsin Wong, C. L. Philip Chen, Zaiyi Liu, Chu Han
for: 这个 paper 是为了解释 deep learning 模型的问题，特别是在弱化运算下进行 semantic segmentation 和 object localization。
methods: 这个 paper 使用了 outcome-agnostic CAM 方法，即 BroadCAM，以避免因为小规模训练而产生不可靠的 weights。
results: BroadCAM 在不同的 CNN 架构下显示出了较高的性能，特别是在小规模训练数据下（less than 5%）。它还达到了 SOTA 性能在大规模训练数据下。

Abstract
Class activation mapping~(CAM), a visualization technique for interpreting deep learning models, is now commonly used for weakly supervised semantic segmentation~(WSSS) and object localization~(WSOL). It is the weighted aggregation of the feature maps by activating the high class-relevance ones. Current CAM methods achieve it relying on the training outcomes, such as predicted scores~(forward information), gradients~(backward information), etc. However, when with small-scale data, unstable training may lead to less effective model outcomes and generate unreliable weights, finally resulting in incorrect activation and noisy CAM seeds. In this paper, we propose an outcome-agnostic CAM approach, called BroadCAM, for small-scale weakly supervised applications. Since broad learning system (BLS) is independent to the model learning, BroadCAM can avoid the weights being affected by the unreliable model outcomes when with small-scale data. By evaluating BroadCAM on VOC2012 (natural images) and BCSS-WSSS (medical images) for WSSS and OpenImages30k for WSOL, BroadCAM demonstrates superior performance than existing CAM methods with small-scale data (less than 5\%) in different CNN architectures. It also achieves SOTA performance with large-scale training data. Extensive qualitative comparisons are conducted to demonstrate how BroadCAM activates the high class-relevance feature maps and generates reliable CAMs when with small-scale training data.

摘要
干净的类激活映射（CAM）现在广泛用于弱类标注 segmentation（WSSS）和物体 lokalisierung（WSOL）。它是通过活化高相关类的特征图的加权积sum来实现的。现有的CAM方法通过训练结果来实现，如预测得分（前向信息）、梯度（反向信息）等。然而，当 faced with small-scale data 时，不稳定的训练可能会导致模型的输出不稳定，最终导致错误的激活和噪音 CAM 种子。在这篇论文中，我们提出了不受训练结果影响的CAM方法，called BroadCAM， для small-scale weakly supervised 应用程序。由于 broad learning system（BLS）是独立于模型学习的，BroadCAM可以避免 weights 被模型的输出不稳定所影响。我们通过在 VOC2012（自然图像）和 BCSS-WSSS（医学图像）上进行 WSSS，以及在 OpenImages30k 上进行 WSOL，来评估 BroadCAM 的性能。我们发现 BroadCAM 在不同的 CNN 架构下对 small-scale 数据（less than 5%） exhibit 出色的性能，并且在大规模训练数据下也达到了 SOTA 性能。我们还进行了广泛的Qualitative comparison ，以示 BroadCAM 在 small-scale 训练数据下如何活化高相关类的特征图并生成可靠的 CAM。

Dynamic Frame Interpolation in Wavelet Domain

paper_url: http://arxiv.org/abs/2309.03508
repo_url: https://github.com/ltkong218/waveletvfi
paper_authors: Lingtong Kong, Boyuan Jiang, Donghao Luo, Wenqing Chu, Ying Tai, Chengjie Wang, Jie Yang
for: 提高视觉体验的frame rate，通过使用高级动态模型和合成网络。
methods: 提议一种基于wavelet synthesis网络的两stage框架，首先估计中间的湍流，然后使用流alignedContext特征预测多尺度wavelet含量。
results: 在常见高分辨率和动画框架 interpolate中，提议的WaveletVFI可以减少计算量达40%，保持相似的准确性，与其他状态静的方法相比更高效。

Abstract
Video frame interpolation is an important low-level vision task, which can increase frame rate for more fluent visual experience. Existing methods have achieved great success by employing advanced motion models and synthesis networks. However, the spatial redundancy when synthesizing the target frame has not been fully explored, that can result in lots of inefficient computation. On the other hand, the computation compression degree in frame interpolation is highly dependent on both texture distribution and scene motion, which demands to understand the spatial-temporal information of each input frame pair for a better compression degree selection. In this work, we propose a novel two-stage frame interpolation framework termed WaveletVFI to address above problems. It first estimates intermediate optical flow with a lightweight motion perception network, and then a wavelet synthesis network uses flow aligned context features to predict multi-scale wavelet coefficients with sparse convolution for efficient target frame reconstruction, where the sparse valid masks that control computation in each scale are determined by a crucial threshold ratio. Instead of setting a fixed value like previous methods, we find that embedding a classifier in the motion perception network to learn a dynamic threshold for each sample can achieve more computation reduction with almost no loss of accuracy. On the common high resolution and animation frame interpolation benchmarks, proposed WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts. Code is available at https://github.com/ltkong218/WaveletVFI.

摘要
视频帧 interpolate 是一个重要的低级视觉任务，可以增加帧率，提供更流畅的视觉经验。现有方法通过使用先进的运动模型和合成网络来实现了很大的成功。然而，在合成目标帧时间 redundancy 未经探索，可能导致大量的不必要计算。同时，计算压缩度在帧 interpolate 中高度取决于输入帧对的文本分布和场景运动，需要对每个输入帧对的空间时间信息进行更好的理解，以选择更佳的压缩度。在这项工作中，我们提出了一种新的两个阶段框架，称为 WaveletVFI，以解决以上问题。它首先估计目标帧中间的激光流，然后使用流行alignedContext Features来预测多尺度波лет系数，并使用稀聚核来减少计算。而不是在前一些方法中设置固定值，我们在运动识别网络中嵌入一个类ifier，以学习每个样本的动态阈值，可以更好地减少计算，而无损失准确性。在高分辨率和动画帧 interpolate 的标准测试 benchmarks 上，我们的 WaveletVFI 可以减少计算时间40%，同时保持相似的准确性，与其他现状之前的方法相比，更高效。代码可以在上找到。

Stroke-based Neural Painting and Stylization with Dynamically Predicted Painting Region

paper_url: http://arxiv.org/abs/2309.03504
repo_url: https://github.com/sjtuplayer/compositional_neural_painter
paper_authors: Teng Hu, Ran Yi, Haokun Zhu, Liang Liu, Jinlong Peng, Yabiao Wang, Chengjie Wang, Lizhuang Ma
for: 这篇论文的目的是提出一种基于笔画的图像渲染方法，以解决现有方法中的边界不一致问题。
methods: 该方法使用了一种动态预测下一个笔画区域的compositor网络，以及一种基于WGAN探测器的画家网络，来预测笔画参数。
results: 对比现有方法，该方法在笔画基于图像渲染和笔画基于风格转换中表现出优异，并且可以保持输入图像的结构。

Abstract
Stroke-based rendering aims to recreate an image with a set of strokes. Most existing methods render complex images using an uniform-block-dividing strategy, which leads to boundary inconsistency artifacts. To solve the problem, we propose Compositional Neural Painter, a novel stroke-based rendering framework which dynamically predicts the next painting region based on the current canvas, instead of dividing the image plane uniformly into painting regions. We start from an empty canvas and divide the painting process into several steps. At each step, a compositor network trained with a phasic RL strategy first predicts the next painting region, then a painter network trained with a WGAN discriminator predicts stroke parameters, and a stroke renderer paints the strokes onto the painting region of the current canvas. Moreover, we extend our method to stroke-based style transfer with a novel differentiable distance transform loss, which helps preserve the structure of the input image during stroke-based stylization. Extensive experiments show our model outperforms the existing models in both stroke-based neural painting and stroke-based stylization. Code is available at https://github.com/sjtuplayer/Compositional_Neural_Painter

摘要
stroke-based rendering targets to recreate an image with a set of strokes. Most existing methods use an uniform-block-dividing strategy, which leads to boundary inconsistency artifacts. To solve the problem, we propose Compositional Neural Painter, a novel stroke-based rendering framework which dynamically predicts the next painting region based on the current canvas, instead of dividing the image plane uniformly into painting regions. We start from an empty canvas and divide the painting process into several steps. At each step, a compositor network trained with a phasic RL strategy first predicts the next painting region, then a painter network trained with a WGAN discriminator predicts stroke parameters, and a stroke renderer paints the strokes onto the painting region of the current canvas. Moreover, we extend our method to stroke-based style transfer with a novel differentiable distance transform loss, which helps preserve the structure of the input image during stroke-based stylization. Extensive experiments show our model outperforms the existing models in both stroke-based neural painting and stroke-based stylization. Code is available at https://github.com/sjtuplayer/Compositional_Neural_Painter.Here's the word-for-word translation of the text into Simplified Chinese:roke-based rendering targets 图像重建 Set 的 strokes. 现有大多数方法使用 uniform-block-dividing 策略，这会导致 boundry inconsistency artifacts. 为解决问题，我们提出 Compositional Neural Painter, a novel stroke-based rendering framework, which dynamically predicts the next painting region based on the current canvas, instead of dividing the image plane uniformly into painting regions. 我们从 empty canvas 开始，并将 painting 过程分成 several steps. At each step, a compositor network trained with a phasic RL strategy first predicts the next painting region, then a painter network trained with a WGAN discriminator predicts stroke parameters, and a stroke renderer paints the strokes onto the painting region of the current canvas. 其他，我们扩展我们的方法到 stroke-based style transfer with a novel differentiable distance transform loss, which helps preserve the structure of the input image during stroke-based stylization. 广泛 experiments 表明我们的模型在 stroke-based neural painting 和 stroke-based stylization 中都高于现有模型。 Code 可以在 https://github.com/sjtuplayer/Compositional_Neural_Painter 上获取.

Instance Segmentation of Dislocations in TEM Images

paper_url: http://arxiv.org/abs/2309.03499
repo_url: https://github.com/kruzaeva/dislocation-segmentation
paper_authors: Karina Ruzaeva, Kishan Govind, Marc Legros, Stefan Sandfeld
for: 这个论文的目的是用量子电子显微镜（TEM）进行实验室内压缩试验，以揭示杂点的运动。在材料科学领域，了解杂点的位置和运动非常重要，以创造新材料。
methods: 这篇论文使用了现状的实例分割方法，包括Mask R-CNN和YOLOv8，以提取杂点面积。这些杂点面积被转换为数学线段，以便对杂点的长度和几何特征进行量化分析。
results: 这篇论文的结果表明，使用量子电子显微镜进行实验室内压缩试验可以得到高精度的杂点分割结果，并且可以用Physics-based metric来评估网络性能。这些结果可以帮助创造新材料，并且可以用于各种领域的后期处理。

Abstract
Quantitative Transmission Electron Microscopy (TEM) during in-situ straining experiment is able to reveal the motion of dislocations -- linear defects in the crystal lattice of metals. In the domain of materials science, the knowledge about the location and movement of dislocations is important for creating novel materials with superior properties. A long-standing problem, however, is to identify the position and extract the shape of dislocations, which would ultimately help to create a digital twin of such materials. In this work, we quantitatively compare state-of-the-art instance segmentation methods, including Mask R-CNN and YOLOv8. The dislocation masks as the results of the instance segmentation are converted to mathematical lines, enabling quantitative analysis of dislocation length and geometry -- important information for the domain scientist, which we then propose to include as a novel length-aware quality metric for estimating the network performance. Our segmentation pipeline shows a high accuracy suitable for all domain-specific, further post-processing. Additionally, our physics-based metric turns out to perform much more consistently than typically used pixel-wise metrics.

摘要
量子传输电子顺传显微镜（TEM）在实验室内受力试验中能够描述杂点的运动 -- 金属晶体结构中线性缺陷。在材料科学领域，了解杂点的位置和运动对创造新材料的性能具有重要意义。然而，长期存在的问题是如何确定杂点的位置并提取其形状，这将 ultimately 帮助创建材料的数字双。在这种工作中，我们对现有的实例 segmentation 方法进行了量化比较，包括 Mask R-CNN 和 YOLOv8。杂点面为实例 segmentation 的结果被转换为数学线，使得量化分析杂点的长度和几何 -- 对领域科学家而言是重要的信息。我们 then propose 将这些数学线作为新的长度意识质量指标，用于估计网络性能。我们的分 segmentation 管道表现高度准确，适用于所有领域专业人员进行进一步处理。此外，我们的物理基础指标在 Typically 使用像素精度指标之外表现了更一致。

Evaluating Deep Learning-based Melanoma Classification using Immunohistochemistry and Routine Histology: A Three Center Study

paper_url: http://arxiv.org/abs/2309.03494
repo_url: None
paper_authors: Christoph Wies, Lucas Schneider, Sarah Haggenmueller, Tabea-Clara Bucher, Sarah Hobelsberger, Markus V. Heppt, Gerardo Ferrara, Eva I. Krieghoff-Henning, Titus J. Brinker
for: 本研究用于检测皮肤癌病理 slide 的自动识别，使用 Deep Learning (DL) 技术。
methods: 研究使用 ResNet neural network 在 MelanA 和相应的 H&E 染色 slide 上进行训练，并对 OOD 数据集进行评估。
results: 结果显示，DL 基于 MelanA 的识别系统可以达到同 H&E 基准分类的水平（AUROC = 0.81-0.85），并且可以通过多种染色结合分类来提高识别精度。

Abstract
Pathologists routinely use immunohistochemical (IHC)-stained tissue slides against MelanA in addition to hematoxylin and eosin (H&E)-stained slides to improve their accuracy in diagnosing melanomas. The use of diagnostic Deep Learning (DL)-based support systems for automated examination of tissue morphology and cellular composition has been well studied in standard H&E-stained tissue slides. In contrast, there are few studies that analyze IHC slides using DL. Therefore, we investigated the separate and joint performance of ResNets trained on MelanA and corresponding H&E-stained slides. The MelanA classifier achieved an area under receiver operating characteristics curve (AUROC) of 0.82 and 0.74 on out of distribution (OOD)-datasets, similar to the H&E-based benchmark classification of 0.81 and 0.75, respectively. A combined classifier using MelanA and H&E achieved AUROCs of 0.85 and 0.81 on the OOD datasets. DL MelanA-based assistance systems show the same performance as the benchmark H&E classification and may be improved by multi stain classification to assist pathologists in their clinical routine.

摘要
PATHOLOGISTS 通常使用免疫染色技术（IHC）染色组织标本板，以提高诊断皮肤癌的准确性。使用基于 Deep Learning（DL）技术的诊断支持系统自动检查组织结构和细胞组成已经得到了广泛的研究，但对于 IHC 染色板的分析却有少量研究。因此，我们研究了使用 ResNet 在 MelanA 和相应的 H&E 染色板上训练的分类器。MelanA 分类器在 OOD 数据集上的面积下收操作Characteristic curve（AUROC）为 0.82 和 0.74，与 H&E 基准分类结果相似（AUROC 为 0.81 和 0.75）。将 MelanA 和 H&E 分类器合并使得 AUROC 在 OOD 数据集上为 0.85 和 0.81。DL 基于 MelanA 的诊断支持系统与 H&E 基准分类器的性能相同，并且可能通过多种染色分类来改进，以 помочь PATHOLOGISTS 在临床实践中。

SAM3D: Segment Anything Model in Volumetric Medical Images

paper_url: http://arxiv.org/abs/2309.03493
repo_url: https://github.com/DinhHieuHoang/SAM3D
paper_authors: Nhat-Tan Bui, Dinh-Hieu Hoang, Minh-Triet Tran, Ngan Le
for:这篇论文主要 targets at 3D volumetric medical images, aiming to provide accurate image segmentation for medical diagnosis.methods:基于Segment Anything Model（SAM）的SAM3D模型，使用预训练的SAM编码器提取输入图像的有意义表示。与其他现有的SAM基于volumetric segmentation方法不同，我们的模型直接将整个3D图像作为输入，简单地处理它，从而避免训练大量参数。results:我们在多个医疗图像数据集上进行了广泛的实验，并证明了我们的网络在3D医疗图像分割任务中具有竞争力，同时具有 significatively efficient 的参数。

Abstract
Image segmentation is a critical task in medical image analysis, providing valuable information that helps to make an accurate diagnosis. In recent years, deep learning-based automatic image segmentation methods have achieved outstanding results in medical images. In this paper, inspired by the Segment Anything Model (SAM), a foundation model that has received much attention for its impressive accuracy and powerful generalization ability in 2D still image segmentation, we propose a SAM3D that targets at 3D volumetric medical images and utilizes the pre-trained features from the SAM encoder to capture meaningful representations of input images. Different from other existing SAM-based volumetric segmentation methods that perform the segmentation by dividing the volume into a set of 2D slices, our model takes the whole 3D volume image as input and processes it simply and effectively that avoids training a significant number of parameters. Extensive experiments are conducted on multiple medical image datasets to demonstrate that our network attains competitive results compared with other state-of-the-art methods in 3D medical segmentation tasks while being significantly efficient in terms of parameters.

摘要
医疗图像分割是医疗图像分析中的关键任务，它提供了诊断的重要信息。在最近的几年中，基于深度学习的自动图像分割方法在医疗图像中取得了出色的结果。在这篇论文中，我们受到Segment Anything Model（SAM）的启发，这是一个在2D静止图像分割中表现出色的基本模型，我们提出了一个名为SAM3D的模型，该模型针对3D医疗图像进行分割，并使用SAMencoder中的预训练特征来捕捉输入图像的有意义表示。与其他现有的SAM基于volumetric分割方法不同，我们的模型不需要将Volume分成多个2Dslice，而是直接处理整个3D图像，从而避免训练大量参数。我们在多个医疗图像数据集上进行了广泛的实验，以示我们的网络与其他状态的方法在3D医疗分 segmentation任务中具有竞争力，同时在参数上具有显著的效率。

DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners

paper_url: http://arxiv.org/abs/2309.03483
repo_url: https://github.com/clarence-lee-sheng/determinet
paper_authors: Clarence Lee, M Ganesh Kumar, Cheston Tan
for: 这个论文旨在提高视觉引用模型的表现，使其能够更好地 distinguishing 特定对象 versus 总体对象。
methods: 这篇论文使用了250,000个synthetically生成的图像和标题，基于25个determiner来制作了Dataset，任务是预测矩形框，以便识别对象关注的对象。
results: 现有的视觉引用模型在这个Dataset上表现不佳，这反映了现有模型在参照和量化任务上的局限性。

Abstract
State-of-the-art visual grounding models can achieve high detection accuracy, but they are not designed to distinguish between all objects versus only certain objects of interest. In natural language, in order to specify a particular object or set of objects of interest, humans use determiners such as "my", "either" and "those". Determiners, as an important word class, are a type of schema in natural language about the reference or quantity of the noun. Existing grounded referencing datasets place much less emphasis on determiners, compared to other word classes such as nouns, verbs and adjectives. This makes it difficult to develop models that understand the full variety and complexity of object referencing. Thus, we have developed and released the DetermiNet dataset , which comprises 250,000 synthetically generated images and captions based on 25 determiners. The task is to predict bounding boxes to identify objects of interest, constrained by the semantics of the given determiner. We find that current state-of-the-art visual grounding models do not perform well on the dataset, highlighting the limitations of existing models on reference and quantification tasks.

摘要
现代视觉背景模型可以实现高的检测精度，但它们不是设计来分辨所有物体 versus 特定物体的 interests. 在自然语言中，为了指定特定的物体或 интересующий物体，人们使用 determiners such as "my", "either"和 "those"。 determiners 是自然语言中的一种 schema，用于指定名 animate 或 count nouns 的 reference 或 quantity。现有的场景 Referencing 数据集减少了 determiners 的重要性，相比其他单词类型如名动词、动词和形容词。这使得模型难以理解全面和复杂的物体引用。因此，我们开发了并发布了 DetermiNet 数据集，该数据集包括 250,000 个 sintethically 生成的图像和标签，基于 25 个 determiners。任务是预测矩形框，以标识 интересующий物体，受 determiners 的 semantics 约束。我们发现当前状态的最佳视觉背景模型在该数据集上不善于进行，这反映了现有模型在 Reference 和量化任务上的局限性。

TSI-Net: A Timing Sequence Image Segmentation Network for Intracranial Artery Segmentation in Digital Subtraction Angiography

paper_url: http://arxiv.org/abs/2309.03477
repo_url: None
paper_authors: Lemeng Wang, Wentao Liu, Weijin Xu, Haoyuan Li, Huihua Yang, Feng Gao
For: automatic segmentation of intracranial artery (IA) in digital subtraction angiography (DSA) sequences* Methods: incorporates a bi-directional ConvGRU module (BCM) in the encoder, which can input variable-length DSA sequences and retain past and future information, and introduces a sensitive detail branch (SDB) at the end for supervising fine vessels* Results: significantly better than state-of-the-art networks in recent years, with a Sen evaluation metric of 0.797, a 3% improvement compared to other methods.

Abstract
Cerebrovascular disease is one of the major diseases facing the world today. Automatic segmentation of intracranial artery (IA) in digital subtraction angiography (DSA) sequences is an important step in the diagnosis of vascular related diseases and in guiding neurointerventional procedures. While, a single image can only show part of the IA within the contrast medium according to the imaging principle of DSA technology. Therefore, 2D DSA segmentation methods are unable to capture the complete IA information and treatment of cerebrovascular diseases. We propose A timing sequence image segmentation network with U-shape, called TSI-Net, which incorporates a bi-directional ConvGRU module (BCM) in the encoder. The network incorporates a bi-directional ConvGRU module (BCM) in the encoder, which can input variable-length DSA sequences, retain past and future information, segment them into 2D images. In addition, we introduce a sensitive detail branch (SDB) at the end for supervising fine vessels. Experimented on the DSA sequence dataset DIAS, the method performs significantly better than state-of-the-art networks in recent years. In particular, it achieves a Sen evaluation metric of 0.797, which is a 3% improvement compared to other methods.

摘要
脑血管疾病是当今世界面临的一个重要疾病。自动分割整形动脉（IA）在数字抵消成像（DSA）序列中是诊断血管相关疾病和引导神经内部进行手术的重要步骤。然而，单个图像只能显示IA中的一部分，根据DSA技术的成像原理。因此，2D DSA分割方法无法捕捉IA完整的信息，对脑血管疾病的治疗造成限制。我们提出了一种名为TSI-Net的 timing sequence图像分割网络，该网络包含一个bi-directional ConvGRU模块（BCM）在编码器中。该网络可以输入变长的DSA序列，同时保留过去和未来信息，将其分割成2D图像。此外，我们还引入了一个敏感细节分支（SDB），用于监督细血管。在DIAS数据集上进行实验，该方法与当年最佳网络相比，表现出了显著的优势。尤其是，它实现了0.797的Sen评价指标，与其他方法相比，提高了3%。

Temporal Collection and Distribution for Referring Video Object Segmentation

paper_url: http://arxiv.org/abs/2309.03473
repo_url: None
paper_authors: Jiajin Tang, Ge Zheng, Sibei Yang
for: 本研究旨在提高视频对象 segmentation 的精度，通过将自然语言表达与视频帧中对象的动态关系相结合。
methods: 我们提议同时维护全视频水平的 Referent 令和一个序列化的对象提问，其中 Referent 令负责根据语言表达捕捉视频水平的 Referent，而对象提问则用于更好地定位和分割每帧中的对象。此外，我们还提出了一种新的时间集合分布机制，用于在 Referent 令和对象提问之间进行交互。
results: 我们的方法在所有标准测试集上具有显著优势，与现状的方法相比具有更高的精度和更好的一致性。

Abstract
Referring video object segmentation aims to segment a referent throughout a video sequence according to a natural language expression. It requires aligning the natural language expression with the objects' motions and their dynamic associations at the global video level but segmenting objects at the frame level. To achieve this goal, we propose to simultaneously maintain a global referent token and a sequence of object queries, where the former is responsible for capturing video-level referent according to the language expression, while the latter serves to better locate and segment objects with each frame. Furthermore, to explicitly capture object motions and spatial-temporal cross-modal reasoning over objects, we propose a novel temporal collection-distribution mechanism for interacting between the global referent token and object queries. Specifically, the temporal collection mechanism collects global information for the referent token from object queries to the temporal motions to the language expression. In turn, the temporal distribution first distributes the referent token to the referent sequence across all frames and then performs efficient cross-frame reasoning between the referent sequence and object queries in every frame. Experimental results show that our method outperforms state-of-the-art methods on all benchmarks consistently and significantly.

摘要
<>将文本翻译为简化中文。<>对于视频对象 segmentation，我们目标是根据自然语言表达在视频序列中Segment a referent。这需要将自然语言表达与对象的运动和它们在全视频水平的动态关系进行对应，并在每帧 уров划分对象。为达到这个目标，我们提议同时维护一个全视频referent token和一个序列化的对象查询，其中前者负责通过语言表达捕捉视频水平的referent，而后者则用于在每帧级划分对象。此外，为了Explicitly capture对象的运动和空间时间跨模态关系，我们提议一种新的时间集合分布机制，用于在全视频referent token和对象查询之间互动。具体来说，时间集合机制在语言表达和对象查询之间收集全视频信息，然后在每帧级分布referent token，并在每帧级进行高效的交互 между referent sequence和对象查询。实验结果表明，我们的方法在所有benchmark上都具有显著优势，并且与之前的状态有所不同。

Perceptual Quality Assessment of 360$^\circ$ Images Based on Generative Scanpath Representation

paper_url: http://arxiv.org/abs/2309.03472
repo_url: https://github.com/xiangjiesui/gsr
paper_authors: Xiangjie Sui, Hanwei Zhu, Xuelin Liu, Yuming Fang, Shiqi Wang, Zhou Wang
for: 提出了一种基于生成扫描路径表示（GSR）的高效的全息图质量评估方法，以满足人们在不同视图条件下对360度图像的质量评估。
methods: 使用了一种适合扫描路径生成器来生成基于多个假设用户的可见范围和探索时间的扫描路径集，并将这些扫描路径集转化为全息图的唯一GSR，以提高质量评估的准确性。
results: 经过实验 validate了该方法可以具有高度一致性，特别是在局部扭曲的360度图像下，以及在不同视图条件下。

Abstract
Despite substantial efforts dedicated to the design of heuristic models for omnidirectional (i.e., 360$^\circ$) image quality assessment (OIQA), a conspicuous gap remains due to the lack of consideration for the diversity of viewing behaviors that leads to the varying perceptual quality of 360$^\circ$ images. Two critical aspects underline this oversight: the neglect of viewing conditions that significantly sway user gaze patterns and the overreliance on a single viewport sequence from the 360$^\circ$ image for quality inference. To address these issues, we introduce a unique generative scanpath representation (GSR) for effective quality inference of 360$^\circ$ images, which aggregates varied perceptual experiences of multi-hypothesis users under a predefined viewing condition. More specifically, given a viewing condition characterized by the starting point of viewing and exploration time, a set of scanpaths consisting of dynamic visual fixations can be produced using an apt scanpath generator. Following this vein, we use the scanpaths to convert the 360$^\circ$ image into the unique GSR, which provides a global overview of gazed-focused contents derived from scanpaths. As such, the quality inference of the 360$^\circ$ image is swiftly transformed to that of GSR. We then propose an efficient OIQA computational framework by learning the quality maps of GSR. Comprehensive experimental results validate that the predictions of the proposed framework are highly consistent with human perception in the spatiotemporal domain, especially in the challenging context of locally distorted 360$^\circ$ images under varied viewing conditions. The code will be released at https://github.com/xiangjieSui/GSR

摘要
尽管在三Sixty度图像质量评估（OIQA）领域投入了大量努力，但是存在一个明显的缺陷，即视觉行为多样性的不足，导致三Sixty度图像的质量评估不准确。两个关键因素描述这一问题：一是忽略了用户查看行为下的不同条件，二是依靠单个视窗序列来评估图像质量。为解决这些问题，我们提出了一种独特的生成扫描路径表示（GSR），用于有效地评估三Sixty度图像的质量，该表示者将多个假设用户的多种视觉经验进行综合汇总。更具体来说，给定一个视觉条件，包括查看开始点和探索时间，我们可以使用适合的扫描路径生成器生成一系列的扫描路径，包括动态视觉固定点。然后，我们将这些扫描路径转换为唯一的 GSR，该表示者提供了在扫描路径上查看和关注的内容的全局概述。因此，三Sixty度图像的质量评估快速转换为 GSR 的质量评估。我们然后提出了一种高效的 OIQA 计算框架，通过学习 GSR 的质量地图来进行评估。实验结果表明，我们的提议的框架预测与人类视觉在空间时间域的吻合度很高，尤其是在三Sixty度图像下的局部扭曲视觉下的多种视觉条件下。代码将在 GitHub 上发布。

Multi-Modality Guidance Network For Missing Modality Inference

paper_url: http://arxiv.org/abs/2309.03452
repo_url: None
paper_authors: Zhuokai Zhao, Harish Palani, Tianyi Liu, Lena Evans, Ruth Toner
for: 提高大型系统中多模式处理的可行性
methods: 提案一个引导网络，通过训练时间对多模式表示进行知识共享，以提高单一模式模型的准确性
results: 实际验证显示，提案的框架可以训练单一模式模型，与传统训练方法相比，具有更高的准确性，并且保持相同的推断成本

Abstract
Multimodal models have gained significant success in recent years. Standard multimodal approaches often assume unchanged modalities from training stage to inference stage. In practice, however, many scenarios fail to satisfy such assumptions with missing modalities during inference, leading to limitations on where multimodal models can be applied. While existing methods mitigate the problem through reconstructing the missing modalities, it increases unnecessary computational cost, which could be just as critical, especially for large, deployed systems. To solve the problem from both sides, we propose a novel guidance network that promotes knowledge sharing during training, taking advantage of the multimodal representations to train better single-modality models for inference. Real-life experiment in violence detection shows that our proposed framework trains single-modality models that significantly outperform its traditionally trained counterparts while maintaining the same inference cost.

摘要
多模态模型在最近几年内取得了 significiant 成功。标准的多模态方法frequently assumes 不变的modalities从训练阶段到推理阶段。然而，在实践中，许多场景 Fail to Satisfy such assumptions with missing modalities during inference, leading to limitations on where multimodal models can be applied. While existing methods mitigate the problem through reconstructing the missing modalities, it increases unnecessary computational cost, which could be just as critical, especially for large, deployed systems. To solve the problem from both sides, we propose a novel guidance network that promotes knowledge sharing during training, taking advantage of the multimodal representations to train better single-modality models for inference. Real-life experiment in violence detection shows that our proposed framework trains single-modality models that significantly outperform its traditionally trained counterparts while maintaining the same inference cost.Translated into Simplified Chinese:<>多模态模型在最近几年内取得了 significiant 成功。标准的多模态方法frequently assumes 不变的modalities从训练阶段到推理阶段。然而，在实践中，许多场景 Fail to Satisfy such assumptions with missing modalities during inference, leading to limitations on where multimodal models can be applied. While existing methods mitigate the problem through reconstructing the missing modalities, it increases unnecessary computational cost, which could be just as critical, especially for large, deployed systems. To solve the problem from both sides, we propose a novel guidance network that promotes knowledge sharing during training, taking advantage of the multimodal representations to train better single-modality models for inference. Real-life experiment in violence detection shows that our proposed framework trains single-modality models that significantly outperform its traditionally trained counterparts while maintaining the same inference cost.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Taiwan and Hong Kong.

Underwater Image Enhancement by Transformer-based Diffusion Model with Non-uniform Sampling for Skip Strategy

paper_url: http://arxiv.org/abs/2309.03445
repo_url: https://github.com/piggy2009/dm_underwater
paper_authors: Yi Tang, Takafumi Iwaguchi, Hiroshi Kawasaki
for: 这个论文是为了提出一种基于分布模型的海水下图像提高方法。
methods: 该方法使用了条件杂化滤波模型，将海水下图像和高斯噪声作为输入，生成相应的提高图像。此外，为了提高推理过程中的效率，该方法采用了两种不同的方法：一是使用轻量级的变换器网络，可以提高网络前进一步的时间；二是引入跳样Strategy，可以减少迭代次数。
results: 该方法在 widely 使用的海水下图像提高数据集上进行了相对评估，与现有的方法进行比较。实验结果表明，该方法可以 достичь同等或更高的性能，同时具有高效性。代码可以在 \href{mailto:https://github.com/piggy2009/DM_underwater}{\color{blue}{https://github.com/piggy2009/DM\_underwater} 上获取。

Abstract
In this paper, we present an approach to image enhancement with diffusion model in underwater scenes. Our method adapts conditional denoising diffusion probabilistic models to generate the corresponding enhanced images by using the underwater images and the Gaussian noise as the inputs. Additionally, in order to improve the efficiency of the reverse process in the diffusion model, we adopt two different ways. We firstly propose a lightweight transformer-based denoising network, which can effectively promote the time of network forward per iteration. On the other hand, we introduce a skip sampling strategy to reduce the number of iterations. Besides, based on the skip sampling strategy, we propose two different non-uniform sampling methods for the sequence of the time step, namely piecewise sampling and searching with the evolutionary algorithm. Both of them are effective and can further improve performance by using the same steps against the previous uniform sampling. In the end, we conduct a relative evaluation of the widely used underwater enhancement datasets between the recent state-of-the-art methods and the proposed approach. The experimental results prove that our approach can achieve both competitive performance and high efficiency. Our code is available at \href{mailto:https://github.com/piggy2009/DM_underwater}{\color{blue}{https://github.com/piggy2009/DM\_underwater}.

摘要
在本文中，我们提出了一种图像提升方法基于扩散模型在水下场景中。我们的方法利用条件抑制扩散概率模型，将水下图像和高斯噪声作为输入，生成相应的提升图像。此外，为了提高扩散模型的反向过程效率，我们采用了两种不同的方法。一是使用轻量级的变换器基于抑制网络，可以有效提高网络前进一步的时间。另一方面，我们引入了跳过采样策略，以减少迭代数。此外，基于跳过采样策略，我们还提出了两种非均匀采样方法，即分割采样和搜索采样。它们都能够有效地提高性能，并且可以使用同样的步长对抗前一个均匀采样。最后，我们对水下图像提升数据集进行了相对评估，并与当前状态艺术方法进行了比较。实验结果表明，我们的方法可以实现高效和竞争力强的图像提升。我们的代码可以在 \href{mailto:https://github.com/piggy2009/DM_underwater}{\color{blue}{https://github.com/piggy2009/DM\_underwater} 上获取。

Punctate White Matter Lesion Segmentation in Preterm Infants Powered by Counterfactually Generative Learning

paper_url: http://arxiv.org/abs/2309.03440
repo_url: None
paper_authors: Zehua Ren, Yongheng Sun, Miaomiao Wang, Yuying Feng, Xianjun Li, Chao Jin, Jian Yang, Chunfeng Lian, Fan Wang
for: 这个研究旨在提高脑瘫癫病变的准确分类，以便在疗法时间上获得早期诊断和治疗。
methods: 这个研究使用了对抗事实的思维和脑组织分类的副 задачу，以学习细部位置和形态的描述，从而提高了精确的脑瘫癫病变分类。
results: 这个研究使用了一个简单和易于实现的深度学习框架（即DeepPWML），融合了病变对抗地图和组织可能性地图，对于实际临床数据集的脑瘫癫病变分类表现出了国际顶尖的性能。

Abstract
Accurate segmentation of punctate white matter lesions (PWMLs) are fundamental for the timely diagnosis and treatment of related developmental disorders. Automated PWMLs segmentation from infant brain MR images is challenging, considering that the lesions are typically small and low-contrast, and the number of lesions may dramatically change across subjects. Existing learning-based methods directly apply general network architectures to this challenging task, which may fail to capture detailed positional information of PWMLs, potentially leading to severe under-segmentations. In this paper, we propose to leverage the idea of counterfactual reasoning coupled with the auxiliary task of brain tissue segmentation to learn fine-grained positional and morphological representations of PWMLs for accurate localization and segmentation. A simple and easy-to-implement deep-learning framework (i.e., DeepPWML) is accordingly designed. It combines the lesion counterfactual map with the tissue probability map to train a lightweight PWML segmentation network, demonstrating state-of-the-art performance on a real-clinical dataset of infant T1w MR images. The code is available at \href{https://github.com/ladderlab-xjtu/DeepPWML}{https://github.com/ladderlab-xjtu/DeepPWML}.

摘要
精准 segmentation of punctate white matter lesions (PWMLs) 是诊断和治疗相关的发育障碍的基本步骤。自动从婴儿脑MR图像中提取PWMLs的自动化 segmentation 是一项挑战，因为lesions 通常很小并且对比度很低，同时Subject中lesions的数量可能会差异很大。现有的学习基本方法直接将通用网络架构应用到这个任务上，可能会miss detailed positional information of PWMLs，导致严重的下segmentation。在这篇论文中，我们提出使用counterfactual reasoning 和辅助任务脑组织 segmentation来学习PWMLs的细致位姿和形态表示，以实现精准的localization和segmentation。我们设计了一个简单易用的深度学习框架（i.e., DeepPWML），该框架结合lesion counterfactual map 和组织概率地图来训练一个轻量级PWML segmentation网络，并在实际临床数据上达到了现状前的性能。代码可以在 \href{https://github.com/ladderlab-xjtu/DeepPWML}{https://github.com/ladderlab-xjtu/DeepPWML} 中获取。

2023-09-07

cs.AI

cs.AI - 2023-09-07

Evaluation of large language models for discovery of gene set function

paper_url: http://arxiv.org/abs/2309.04019
repo_url: https://github.com/idekerlab/llm_evaluation_for_gene_set_interpretation
paper_authors: Mengzhou Hu, Sahar Alkhairy, Ingoo Lee, Rudolf T. Pillich, Robin Bachelder, Trey Ideker, Dexter Pratt
for: 这 paper 旨在评估 OpenAI 的 GPT-4 是否可以从嵌入的生物医学知识中提取共同的基因函数理论。
methods: 作者使用 GPT-4 pipeline 将基因集标记为概括其共谊功能的名称，并提供分析文本和参考文献支持。
results: GPT-4 在 Gene Ontology 中提供的名称与实际名称相似，并在 ‘omics 数据中提供了更加详细的基因集名称，并且支持语句和参考文献几乎全部得到了人工审查的 verify。

Abstract
Gene set analysis is a mainstay of functional genomics, but it relies on manually curated databases of gene functions that are incomplete and unaware of biological context. Here we evaluate the ability of OpenAI's GPT-4, a Large Language Model (LLM), to develop hypotheses about common gene functions from its embedded biomedical knowledge. We created a GPT-4 pipeline to label gene sets with names that summarize their consensus functions, substantiated by analysis text and citations. Benchmarking against named gene sets in the Gene Ontology, GPT-4 generated very similar names in 50% of cases, while in most remaining cases it recovered the name of a more general concept. In gene sets discovered in 'omics data, GPT-4 names were more informative than gene set enrichment, with supporting statements and citations that largely verified in human review. The ability to rapidly synthesize common gene functions positions LLMs as valuable functional genomics assistants.

摘要

ConDA: Contrastive Domain Adaptation for AI-generated Text Detection

paper_url: http://arxiv.org/abs/2309.03992
repo_url: https://github.com/amritabh/conda-gen-text-detection
paper_authors: Amrita Bhattacharjee, Tharindu Kumarage, Raha Moraffah, Huan Liu
for: 这篇论文旨在建立一个不需要标注训练数据的人工智能生成文本检测器，以应对伪信息的散播。
methods: 本文使用了一种叫做对比领域适应（ConDA）的框架，它结合了标准的领域适应技术和对比学习的表现力，从不标注目标资料中学习对应的领域不变表示，以便进行最终的无标注检测任务。
results: 实验结果显示，使用ConDA框架可以从最好的基eline中获得31.7%的性能提升，并且与全标注检测器之间的差距在0.8%之内。所有的代码和数据可以在https://github.com/AmritaBh/ConDA-gen-text-detection上取得。

Abstract
Large language models (LLMs) are increasingly being used for generating text in a variety of use cases, including journalistic news articles. Given the potential malicious nature in which these LLMs can be used to generate disinformation at scale, it is important to build effective detectors for such AI-generated text. Given the surge in development of new LLMs, acquiring labeled training data for supervised detectors is a bottleneck. However, there might be plenty of unlabeled text data available, without information on which generator it came from. In this work we tackle this data problem, in detecting AI-generated news text, and frame the problem as an unsupervised domain adaptation task. Here the domains are the different text generators, i.e. LLMs, and we assume we have access to only the labeled source data and unlabeled target data. We develop a Contrastive Domain Adaptation framework, called ConDA, that blends standard domain adaptation techniques with the representation power of contrastive learning to learn domain invariant representations that are effective for the final unsupervised detection task. Our experiments demonstrate the effectiveness of our framework, resulting in average performance gains of 31.7% from the best performing baselines, and within 0.8% margin of a fully supervised detector. All our code and data is available at https://github.com/AmritaBh/ConDA-gen-text-detection.

摘要

Noisy Computing of the $\mathsf{OR}$ and $\mathsf{MAX}$ Functions

paper_url: http://arxiv.org/abs/2309.03986
repo_url: None
paper_authors: Banghua Zhu, Ziao Wang, Nadim Ghaddar, Jiantao Jiao, Lele Wang
For: The paper is written for computing a function of $n$ variables using noisy queries, where each query is incorrect with some fixed and known probability $p \in (0,1/2)$.* Methods: The paper uses noisy queries to compute the $\mathsf{OR}$ function of $n$ bits and the $\mathsf{MAX}$ function of $n$ real numbers, with an expected number of queries of $(1 \pm o(1)) \frac{n\log \frac{1}{\delta}{D_{\mathsf{KL}(p | 1-p)}$.* Results: The paper shows that this expected number of queries is both sufficient and necessary to compute both functions with a vanishing error probability $\delta = o(1)$, and tightens the dependence on $p$ in both the upper and lower bounds for the two functions.

Abstract
We consider the problem of computing a function of $n$ variables using noisy queries, where each query is incorrect with some fixed and known probability $p \in (0,1/2)$. Specifically, we consider the computation of the $\mathsf{OR}$ function of $n$ bits (where queries correspond to noisy readings of the bits) and the $\mathsf{MAX}$ function of $n$ real numbers (where queries correspond to noisy pairwise comparisons). We show that an expected number of queries of \[ (1 \pm o(1)) \frac{n\log \frac{1}{\delta}{D_{\mathsf{KL}(p \| 1-p)} \] is both sufficient and necessary to compute both functions with a vanishing error probability $\delta = o(1)$, where $D_{\mathsf{KL}(p \| 1-p)$ denotes the Kullback-Leibler divergence between $\mathsf{Bern}(p)$ and $\mathsf{Bern}(1-p)$ distributions. Compared to previous work, our results tighten the dependence on $p$ in both the upper and lower bounds for the two functions.

摘要
我们考虑一个函数计算问题，其中有 $n$ 变量，每个变量的误差概率为 $p \in (0,1/2)$。我们考虑了计算 $\mathsf{OR}$ 函数和 $\mathsf{MAX}$ 函数的问题，其中每个变量的误差概率都是 $p$。我们显示出，需要 $\left(1 \pm o(1)\right) \frac{n \log \frac{1}{\delta}{D_{\mathsf{KL}(p \| 1-p)}$ 个查询，以达到误差概率 $\delta = o(1)$ 下降到零。这个结果比前一个研究更加紧凑，并且在上下限中都紧紧地依赖于 $p$。Here's the breakdown of the translation:* 我们考虑 (we consider)* 一个函数计算问题 (a function computation problem)* 其中有 $n$ 变量 (where there are $n$ variables)* 每个变量的误差概率为 $p$ (each variable has an error probability of $p$)* 我们考虑了计算 $\mathsf{OR}$ 函数和 $\mathsf{MAX}$ 函数的问题 (we consider the problem of computing the $\mathsf{OR}$ function and the $\mathsf{MAX}$ function)* 其中每个变量的误差概率都是 $p$ (where each variable has an error probability of $p$)* 我们显示出 (we show)* 需要 $\left(1 \pm o(1)\right) \frac{n \log \frac{1}{\delta}{D_{\mathsf{KL}(p \| 1-p)}$ 个查询 (need $\left(1 \pm o(1)\right) \frac{n \log \frac{1}{\delta}{D_{\mathsf{KL}(p \| 1-p)}$ queries)* 以达到误差概率 $\delta = o(1)$ 下降到零 (to reduce the error probability to zero)* 这个结果比前一个研究更加紧凑 (this result is tighter than previous studies)* 并且在上下限中都紧紧地依赖于 $p$ (and is tight in both the upper and lower bounds for $p$)

DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection

paper_url: http://arxiv.org/abs/2309.03893
repo_url: None
paper_authors: Manlin Zhang, Jie Wu, Yuxi Ren, Ming Li, Jie Qin, Xuefeng Xiao, Wei Liu, Rui Wang, Min Zheng, Andy J. Ma
for: 这篇论文的目的是为了提出一个可扩展的数据引擎，以便实现物件探测中的训练。
methods: 这篇论文使用了一个名为DiffusionEngine的数据扩展引擎，该引擎包括一个预训练的数据模型和一个有效的探测适配器。这些元件可以在单一的过程中生成大量、多样化和可重复的探测训练 pairs。
results: 实验结果显示，这篇论文提出的DiffusionEngine可以在多种情况下取得显著的改善，例如不同的探测算法、自我指导预训练、数据缺乏、标签缺乏、跨领域和半指导学习等。例如，使用DiffusionEngine和DINO-based适配器将数据扩展，则在COCO、VOC和Clipart上的mAP分别提高了3.1%、7.6%和11.5%。

Abstract
Data is the cornerstone of deep learning. This paper reveals that the recently developed Diffusion Model is a scalable data engine for object detection. Existing methods for scaling up detection-oriented data often require manual collection or generative models to obtain target images, followed by data augmentation and labeling to produce training pairs, which are costly, complex, or lacking diversity. To address these issues, we presentDiffusionEngine (DE), a data scaling-up engine that provides high-quality detection-oriented training pairs in a single stage. DE consists of a pre-trained diffusion model and an effective Detection-Adapter, contributing to generating scalable, diverse and generalizable detection data in a plug-and-play manner. Detection-Adapter is learned to align the implicit semantic and location knowledge in off-the-shelf diffusion models with detection-aware signals to make better bounding-box predictions. Additionally, we contribute two datasets, i.e., COCO-DE and VOC-DE, to scale up existing detection benchmarks for facilitating follow-up research. Extensive experiments demonstrate that data scaling-up via DE can achieve significant improvements in diverse scenarios, such as various detection algorithms, self-supervised pre-training, data-sparse, label-scarce, cross-domain, and semi-supervised learning. For example, when using DE with a DINO-based adapter to scale up data, mAP is improved by 3.1% on COCO, 7.6% on VOC, and 11.5% on Clipart.

摘要
“数据是深度学习的基础。这篇论文揭示了最近开发的扩散模型是一种可扩展的数据引擎 для物体检测。现有的方法 для扩大检测引导的数据经常需要手动收集或生成模型来获取目标图像，然后进行数据扩展和标注来生成训练对，这些过程昂贵、复杂或缺乏多样性。为解决这些问题，我们提出了DiffusionEngine（DE），一种可扩展的数据扩大引擎。DE包括一个预训练的扩散模型和一个有效的检测适配器，它可以在一个插入式的方式下生成高质量的检测引导数据。检测适配器是通过将偏振的含义和位置知识从存储在各种扩散模型中的含义和位置知识与检测意图相匹配，以提高矩形框预测的准确性。此外，我们还提供了COCO-DE和VOC-DE两个数据集，以扩大现有的检测benchmark，便于后续研究。广泛的实验表明，通过DE进行数据扩大可以在多种场景下实现显著的提升，包括不同的检测算法、自我主导的预训练、数据稀缺、标注缺乏、跨Domain、和半supervised学习。例如，当使用DE和DINO基于的适配器来扩大数据时，COCO上的mAP提高3.1%，VOC上提高7.6%，Clipart上提高11.5%。”

A Function Interpretation Benchmark for Evaluating Interpretability Methods

paper_url: http://arxiv.org/abs/2309.03886
repo_url: https://github.com/multimodal-interpretability/find
paper_authors: Sarah Schwettmann, Tamar Rott Shaham, Joanna Materzynska, Neil Chowdhury, Shuang Li, Jacob Andreas, David Bau, Antonio Torralba
for: 这 paper 的目的是为了评估自动化解释方法的性能。
methods: 这 paper 使用了语言模型（LM）来生成代码和文本描述函数行为。
results: 研究发现，使用黑盒访问函数的LM可以做出一些科学家的推测和实验，但是它们通常只能捕捉全局函数行为，而不是地方腐化。这些结果表明，FIND 可以用于评估更复杂的解释方法的性能，以前置应用于实际模型。

Abstract
Labeling neural network submodules with human-legible descriptions is useful for many downstream tasks: such descriptions can surface failures, guide interventions, and perhaps even explain important model behaviors. To date, most mechanistic descriptions of trained networks have involved small models, narrowly delimited phenomena, and large amounts of human labor. Labeling all human-interpretable sub-computations in models of increasing size and complexity will almost certainly require tools that can generate and validate descriptions automatically. Recently, techniques that use learned models in-the-loop for labeling have begun to gain traction, but methods for evaluating their efficacy are limited and ad-hoc. How should we validate and compare open-ended labeling tools? This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating the building blocks of automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. The functions are procedurally constructed across textual and numeric domains, and involve a range of real-world complexities, including noise, composition, approximation, and bias. We evaluate new and existing methods that use language models (LMs) to produce code-based and language descriptions of function behavior. We find that an off-the-shelf LM augmented with only black-box access to functions can sometimes infer their structure, acting as a scientist by forming hypotheses, proposing experiments, and updating descriptions in light of new data. However, LM-based descriptions tend to capture global function behavior and miss local corruptions. These results show that FIND will be useful for characterizing the performance of more sophisticated interpretability methods before they are applied to real-world models.

摘要
Labeling neural network submodules with human-legible descriptions是有用的downstream任务：这些描述可以暴露失败，导引 intervención，并可能 même explain important model behaviors。到目前为止，大多数机制性描述已经只是用小型模型、窄化的现象和大量的人工劳动。将所有人类可读的子计算机制 INTO models of increasing size and complexity will almost certainly require tools that can generate and validate descriptions automatically。Recently, techniques that use learned models in-the-loop for labeling have begun to gain traction, but methods for evaluating their efficacy are limited and ad-hoc。How should we validate and compare open-ended labeling tools？This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating the building blocks of automated interpretability methods。FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate。The functions are procedurally constructed across textual and numeric domains, and involve a range of real-world complexities, including noise, composition, approximation, and bias。We evaluate new and existing methods that use language models (LMs) to produce code-based and language descriptions of function behavior。We find that an off-the-shelf LM augmented with only black-box access to functions can sometimes infer their structure, acting as a scientist by forming hypotheses, proposing experiments, and updating descriptions in light of new data。However, LM-based descriptions tend to capture global function behavior and miss local corruptions。These results show that FIND will be useful for characterizing the performance of more sophisticated interpretability methods before they are applied to real-world models。

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

paper_url: http://arxiv.org/abs/2309.03883
repo_url: https://github.com/voidism/dola
paper_authors: Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, Pengcheng He
for: 减少大语言模型（LLMs）的幻觉，即生成不符事实的内容。
methods: 提出了一种简单的解oding策略，不需要conditioning retrieved external knowledge nor additional fine-tuning，可以更好地浮现LLMs中的事实知识。
results: 对多个选择任务和开放式生成任务进行了改进，如提高了LLaMA家族模型在TruthfulQA任务的性能，提高约12-17%绝对点数。

Abstract
Despite their impressive capabilities, large language models (LLMs) are prone to hallucinations, i.e., generating content that deviates from facts seen during pretraining. We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs that does not require conditioning on retrieved external knowledge nor additional fine-tuning. Our approach obtains the next-token distribution by contrasting the differences in logits obtained from projecting the later layers versus earlier layers to the vocabulary space, exploiting the fact that factual knowledge in an LLMs has generally been shown to be localized to particular transformer layers. We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts. DoLa consistently improves the truthfulness across multiple choices tasks and open-ended generation tasks, for example improving the performance of LLaMA family models on TruthfulQA by 12-17% absolute points, demonstrating its potential in making LLMs reliably generate truthful facts.

摘要
尽管它们具有印象的能力，大语言模型（LLM）仍然容易出现幻觉，即生成不符事实的内容。我们提议一种简单的解码策略可以减少LLM中的幻觉，不需要基于检索到的外部知识 nor 额外调整。我们的方法通过对 later层和earlier层的投影到 vocabulary space 进行对比，利用了 LLM 中的事实具有局部化特征。我们称之为 Decoding by Contrasting Layers（DoLa）方法。我们发现 DoLa 方法可以更好地把 фактиче知识浮现出来，降低生成错误的事实。DoLa 方法在多个选择任务和开放式生成任务中表现出色，例如提高了 LLaMA 家族模型在 TruthfulQA 中的表现，提高了 truthfulness 的表现约 12-17% 绝对点数，这表明 DoLa 方法可以使 LLM 可靠地生成真实的事实。

OpinionGPT: Modelling Explicit Biases in Instruction-Tuned LLMs

paper_url: http://arxiv.org/abs/2309.03876
repo_url: None
paper_authors: Patrick Haller, Ansar Aynetdinov, Alan Akbik
for: 这个论文的目的是为了使人们可以通过查看具有不同偏见的答案来了解语言模型中的偏见。
methods: 这个论文使用了特定偏见的文本数据来训练模型，然后通过在用户选择的偏见下提供答案来展示这些偏见。
results: 这个论文的结果是一个名为OpinionGPT的在线示例，可以让用户问题并选择想要调查的偏见，然后模型将根据这些偏见提供答案，从而使用户可以对偏见进行互动和比较。

Abstract
Instruction-tuned Large Language Models (LLMs) have recently showcased remarkable ability to generate fitting responses to natural language instructions. However, an open research question concerns the inherent biases of trained models and their responses. For instance, if the data used to tune an LLM is dominantly written by persons with a specific political bias, we might expect generated answers to share this bias. Current research work seeks to de-bias such models, or suppress potentially biased answers. With this demonstration, we take a different view on biases in instruction-tuning: Rather than aiming to suppress them, we aim to make them explicit and transparent. To this end, we present OpinionGPT, a web demo in which users can ask questions and select all biases they wish to investigate. The demo will answer this question using a model fine-tuned on text representing each of the selected biases, allowing side-by-side comparison. To train the underlying model, we identified 11 different biases (political, geographic, gender, age) and derived an instruction-tuning corpus in which each answer was written by members of one of these demographics. This paper presents OpinionGPT, illustrates how we trained the bias-aware model and showcases the web application (available at https://opiniongpt.informatik.hu-berlin.de).

摘要
很近期， instruction-tuned 大型自然语言模型（LLM）已经展现出了remarkable的适应能力，可以生成适应natural language instruction的回答。然而，一个打开的研究问题是训练模型内置的偏见。例如，如果用于训练 LLM 的数据主要由特定政治偏见的人员写成，那么生成的答案可能会带有这种偏见。现有的研究工作是想要减少或抑制这些偏见的模型。在这个示例中，我们采取了一种不同的视角，即不是减少偏见，而是使它们显示出来，并让用户可以选择想要调查的偏见。为此，我们提出了 OpinionGPT，一个网上示例，用户可以在这里提问问题，并选择想要调查的偏见。示例中的答案将使用基于每种选择的偏见进行模型细化，进行侧重比较。为了训练底层模型，我们identified 11种偏见（政治、地理、性别、年龄），并 derivated一个 instrucion-tuning 训练集，每个答案都是由不同的民族成员写成。这篇论文介绍了 OpinionGPT，详细介绍了我们如何训练偏见意识的模型，并展示了网上应用程序（可以在https://opiniongpt.informatik.hu-berlin.de 中查看）。

FLM-101B: An Open LLM and How to Train It with $100K Budget

paper_url: http://arxiv.org/abs/2309.03852
repo_url: None
paper_authors: Xiang Li, Yiqun Yao, Xin Jiang, Xuezhi Fang, Xuying Meng, Siqi Fan, Peng Han, Jing Li, Li Du, Bowen Qin, Zheng Zhang, Aixin Sun, Yequan Wang
for: 这篇论文目标是提出一种减少大语言模型（LLM）训练成本的解决方案，并通过实验证明其效果。
methods: 该论文使用了一种增长策略来减少LLM训练成本，并在其基础上进行了一系列的IQ评价。
results: 实验结果显示，使用该增长策略训练的FLM-101B模型，可以与其他 poderful 和具有名声的模型相比，尤其是在IQ评价中表现出色。

Abstract
Large language models (LLMs) have achieved remarkable success in NLP and multimodal tasks, among others. Despite these successes, two main challenges remain in developing LLMs: (i) high computational cost, and (ii) fair and objective evaluations. In this paper, we report a solution to significantly reduce LLM training cost through a growth strategy. We demonstrate that a 101B-parameter LLM with 0.31T tokens can be trained with a budget of 100K US dollars. Inspired by IQ tests, we also consolidate an additional range of evaluations on top of existing evaluations that focus on knowledge-oriented abilities. These IQ evaluations include symbolic mapping, rule understanding, pattern mining, and anti-interference. Such evaluations minimize the potential impact of memorization. Experimental results show that our model, named FLM-101B, trained with a budget of 100K US dollars, achieves performance comparable to powerful and well-known models, e.g., GPT-3 and GLM-130B, especially on the additional range of IQ evaluations. The checkpoint of FLM-101B is released at https://huggingface.co/CofeAI/FLM-101B.

摘要
大型语言模型（LLM）在自然语言处理和多模态任务中备受推崇，其中两大挑战是：（一）高计算成本，（二）公正和 объектив的评估。本文报道了一种减少 LLM 训练成本的解决方案，我们示示了一个 101B 参数的 LLM 可以在 100K 美元预算下训练。受智商测试的 inspiration，我们还添加了一些以知识为导向的评估方法，包括符号映射、规则理解、模式挖掘和抗干扰。这些评估方法减少了可能的记忆效应。实验结果表明，我们名为 FLM-101B 的模型，训练预算为 100K 美元，与知名的 GPT-3 和 GLM-130B 模型相当，尤其是在其他评估方法上。FLM-101B 的检查点可以在上下载。

Uncovering Drift in Textual Data: An Unsupervised Method for Detecting and Mitigating Drift in Machine Learning Models

paper_url: http://arxiv.org/abs/2309.03831
repo_url: None
paper_authors: Saeed Khaki, Akhouri Abhinav Aditya, Zohar Karnin, Lan Ma, Olivia Pan, Samarth Marudheri Chandrashekar
For: 本研究旨在提出一种不需要人工标注的自动检测方法，以便在机器学习模型性能下降时提前发现和修复问题。* Methods: 我们采用了一种两步方法。第一步是将生产数据编码为目标分布，模型训练数据编码为参照分布。第二步是使用核函数基于最大均值差距（MMD）距离度量，比较参照和目标分布之间的差异，并估计任何可能的漂移。* Results: 我们的方法可以快速和准确地检测到生产数据中的漂移，并且可以识别出导致漂移的子集。 retrained 使用这些标识的高漂移样本表示在线客户体验质量指标上显著改善。

Abstract
Drift in machine learning refers to the phenomenon where the statistical properties of data or context, in which the model operates, change over time leading to a decrease in its performance. Therefore, maintaining a constant monitoring process for machine learning model performance is crucial in order to proactively prevent any potential performance regression. However, supervised drift detection methods require human annotation and consequently lead to a longer time to detect and mitigate the drift. In our proposed unsupervised drift detection method, we follow a two step process. Our first step involves encoding a sample of production data as the target distribution, and the model training data as the reference distribution. In the second step, we employ a kernel-based statistical test that utilizes the maximum mean discrepancy (MMD) distance metric to compare the reference and target distributions and estimate any potential drift. Our method also identifies the subset of production data that is the root cause of the drift. The models retrained using these identified high drift samples show improved performance on online customer experience quality metrics.

摘要
In our proposed unsupervised drift detection method, we follow a two-step process:Step 1: Encode a sample of production data as the target distribution, and the model training data as the reference distribution.Step 2: Employ a kernel-based statistical test that utilizes the maximum mean discrepancy (MMD) distance metric to compare the reference and target distributions and estimate any potential drift.Our method also identifies the subset of production data that is the root cause of the drift. By retraining the models using these identified high drift samples, we observe improved performance on online customer experience quality metrics.

Training Acceleration of Low-Rank Decomposed Networks using Sequential Freezing and Rank Quantization

paper_url: http://arxiv.org/abs/2309.03824
repo_url: None
paper_authors: Habib Hajimolahoseini, Walid Ahmed, Yang Liu
for: 提高深度学习模型的训练和执行速度，而不需要采用小rank的约化
methods: 提出了两种加速低约数据模型的技术，包括约化优化和顺序冻结分解层
results: 实验表明，这两种技术可以在训练和执行过程中提高模型的吞吐量，最高可达60%和37%，同时保持模型的准确率与原始模型接近

Abstract
Low Rank Decomposition (LRD) is a model compression technique applied to the weight tensors of deep learning models in order to reduce the number of trainable parameters and computational complexity. However, due to high number of new layers added to the architecture after applying LRD, it may not lead to a high training/inference acceleration if the decomposition ranks are not small enough. The issue is that using small ranks increases the risk of significant accuracy drop after decomposition. In this paper, we propose two techniques for accelerating low rank decomposed models without requiring to use small ranks for decomposition. These methods include rank optimization and sequential freezing of decomposed layers. We perform experiments on both convolutional and transformer-based models. Experiments show that these techniques can improve the model throughput up to 60% during training and 37% during inference when combined together while preserving the accuracy close to that of the original models

摘要
低阶划分（LRD）是一种深度学习模型压缩技术，用于减少训练参数数量和计算复杂性。然而，由于LRD后加入的新层数量增加，可能无法导致高训练/推理加速，特别是使用小极值推理可能会导致准确性下降。在这篇论文中，我们提出了两种加速LRD模型无需使用小极值划分的技术。这些方法包括排序优化和顺序冻结分解层。我们在 convolutional 和 transformer 基于模型上进行了实验，实验结果表明，这些技术可以在训练和推理过程中提高模型吞吐量，最高可达 60%，并保持准确性接近原始模型。

AnthroNet: Conditional Generation of Humans via Anthropometrics

paper_url: http://arxiv.org/abs/2309.03812
repo_url: https://github.com/Unity-Technologies/AnthroNet
paper_authors: Francesco Picetti, Shrinath Deshpande, Jonathan Leban, Soroosh Shahtalebi, Jay Patel, Peifeng Jing, Chunpu Wang, Charles Metze III, Cameron Sun, Cera Laidlaw, James Warren, Kathy Huynh, River Page, Jonathan Hogins, Adam Crespi, Sujoy Ganguly, Salehe Erfanian Ebadi
for: The paper is written for the purpose of presenting a novel human body model that can generate a wide range of human body shapes and poses.
methods: The paper uses a deep generative architecture to train the model end-to-end using only synthetically generated data, which provides highly accurate human mesh representations and allows for precise anthropometry of the body.
results: The model is capable of producing humans in any arbitrary pose and can be used to generate millions of unique human identities and poses for non-commercial academic research purposes.Here is the simplified Chinese text for the three key points:
for: 这篇论文是为了介绍一种新的人体模型，可以生成各种人体形态和姿势。
methods: 这篇论文使用深度生成架构来直接训练模型，只使用人工生成的数据进行训练，可以提供高度准确的人体三维模型和人体测量数据。
results: 模型可以生成任意姿势的人体，并且可以生成数百万个唯一的人体标示和姿势。

Abstract
We present a novel human body model formulated by an extensive set of anthropocentric measurements, which is capable of generating a wide range of human body shapes and poses. The proposed model enables direct modeling of specific human identities through a deep generative architecture, which can produce humans in any arbitrary pose. It is the first of its kind to have been trained end-to-end using only synthetically generated data, which not only provides highly accurate human mesh representations but also allows for precise anthropometry of the body. Moreover, using a highly diverse animation library, we articulated our synthetic humans' body and hands to maximize the diversity of the learnable priors for model training. Our model was trained on a dataset of $100k$ procedurally-generated posed human meshes and their corresponding anthropometric measurements. Our synthetic data generator can be used to generate millions of unique human identities and poses for non-commercial academic research purposes.

摘要
我们提出了一种新的人体模型，基于广泛的人体中心量测量，可以生成广泛的人体形态和姿势。我们的模型可以直接模拟特定的人类特征，通过深度生成架构来生成任意姿势的人类。这是首次使用只有生成的数据进行端到端训练的人体模型，不仅提供了高度准确的人体网格表示，还允许精确的人体 anthropometry 测量。此外，我们使用了高度多样化的动画库，将我们的 sintetic humans 的身体和手部动作塑造得更加多样化，以最大化学习 prior 的多样性。我们的模型在一个包含 100k 个生成的姿势人体网格和其对应的人体测量数据集上进行了训练。我们的 sintetic 数据生成器可以生成数百万个独特的人体形态和姿势，用于非商业学术研究 purposes。

Pareto Frontiers in Neural Feature Learning: Data, Compute, Width, and Luck

paper_url: http://arxiv.org/abs/2309.03800
repo_url: None
paper_authors: Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, Cyril Zhang
for: 这项研究探讨了深度学习中计算统计差距的细化算法设计选择。
methods: 本文考虑了离线稀疏偏好学习，这是一种超参数分类问题，具有一个统计查询下界，可以用来训练一个多层感知机。这个下界可以看作多种资源交易前ier：成功学习只能在一个具有足够财富（大型模型）、知识（大量数据）、耐心（多个训练轮次）或幸运（多个随机猜测）的情况下进行。
results: 我们通过理论和实验来证明，在这种设定下，稀疏初始化和增加网络宽度可以实现显著的样本效率提高。在这里，宽度扮演着平行搜索的角色：它增加了找到”彩礼奖”神经元的概率，这些神经元更加 sample-efficiently 学习稀疏特征。此外，我们还证明了使用宽、稀疏初始化的 MLP 模型可以在标准表格分类benchmark上实现更好的样本效率，这些网络在一些情况下even outperform了调参随机森林。

Abstract
This work investigates the nuanced algorithm design choices for deep learning in the presence of computational-statistical gaps. We begin by considering offline sparse parity learning, a supervised classification problem which admits a statistical query lower bound for gradient-based training of a multilayer perceptron. This lower bound can be interpreted as a multi-resource tradeoff frontier: successful learning can only occur if one is sufficiently rich (large model), knowledgeable (large dataset), patient (many training iterations), or lucky (many random guesses). We show, theoretically and experimentally, that sparse initialization and increasing network width yield significant improvements in sample efficiency in this setting. Here, width plays the role of parallel search: it amplifies the probability of finding "lottery ticket" neurons, which learn sparse features more sample-efficiently. Finally, we show that the synthetic sparse parity task can be useful as a proxy for real problems requiring axis-aligned feature learning. We demonstrate improved sample efficiency on tabular classification benchmarks by using wide, sparsely-initialized MLP models; these networks sometimes outperform tuned random forests.

摘要

FisheyePP4AV: A privacy-preserving method for autonomous vehicles on fisheye camera images

paper_url: http://arxiv.org/abs/2309.03799
repo_url: None
paper_authors: Linh Trinh, Bach Ha, Tu Tran
for: 保护自驾车摄像头拍摄的人脸和车牌号 Privacy Concerns in Autonomous Driving
methods: 提出了一种基于多种教师模型的面和车牌号识别框架，并使用变化和现实的 fisheye 变换将图像和标签转换为 fisheye-like 数据
results: 对于使用自驾车摄像头拍摄的 PP4AV dataset，我们的模型比基eline方法高效，即使数据被软标签Here’s a breakdown of each point:1. for: The paper is focused on addressing privacy concerns in autonomous driving by protecting pedestrian faces and nearby car license plates in actual road-driving scenarios.2. methods: The proposed method uses a framework for extracting face and plate identification knowledge from multiple teacher models, and transforms both the image and the label from a regular image to fisheye-like data using a varied and realistic fisheye transformation.3. results: The experimental findings demonstrated that the proposed model outperformed baseline methods when trained on data from autonomous vehicles, even when the data were softly labeled.

Abstract
In many parts of the world, the use of vast amounts of data collected on public roadways for autonomous driving has increased. In order to detect and anonymize pedestrian faces and nearby car license plates in actual road-driving scenarios, there is an urgent need for effective solutions. As more data is collected, privacy concerns regarding it increase, including but not limited to pedestrian faces and surrounding vehicle license plates. Normal and fisheye cameras are the two common camera types that are typically mounted on collection vehicles. With complex camera distortion models, fisheye camera images were deformed in contrast to regular images. It causes computer vision tasks to perform poorly when using numerous deep learning models. In this work, we pay particular attention to protecting privacy while yet adhering to several laws for fisheye camera photos taken by driverless vehicles. First, we suggest a framework for extracting face and plate identification knowledge from several teacher models. Our second suggestion is to transform both the image and the label from a regular image to fisheye-like data using a varied and realistic fisheye transformation. Finally, we run a test using the open-source PP4AV dataset. The experimental findings demonstrated that our model outperformed baseline methods when trained on data from autonomous vehicles, even when the data were softly labeled. The implementation code is available at our github: https://github.com/khaclinh/FisheyePP4AV.

摘要
在多个国家和地区，自动驾驶技术的应用使用了大量公共道路上收集的数据，增加了面临挑战的需求。为了探测和隐私化行人脸和附近车辆号牌在实际道路驾驶场景中，隐私问题的关注也在不断增加，包括但不限于行人脸和周围车辆号牌。通常，自动驾驶车辆上会安装 Normal 和 fisheye 两种常见的摄像头类型。由于复杂的摄像头扭曲模型，fisheye 摄像头图像与常见图像不同，导致计算机视觉任务的表现不佳，需要许多深度学习模型来进行改进。在这种情况下，我们强调保护隐私的同时，遵循多个法律。我们的方法包括：一、从多个教师模型中提取面和号牌识别知识。二、将图像和标签从常见图像转换为 fisheye-like 数据，使用变化和实际的 fisheye 转换。最后，我们使用开源的 PP4AV 数据集进行测试。实验结果表明，我们的模型在使用自动驾驶车辆收集的数据进行训练时，能够超越基eline方法。代码可以在我们的 GitHub 上找到：https://github.com/khaclinh/FisheyePP4AV。

CPU frequency scheduling of real-time applications on embedded devices with temporal encoding-based deep reinforcement learning

paper_url: http://arxiv.org/abs/2309.03779
repo_url: https://github.com/coladog/tinyagent
paper_authors: Ti Zhou, Man Lin
for: 这篇论文主要是关于开发小设备上Periodic任务的高效能源管理方法。
methods: 作者首先研究了小设备中Linux内置方法的限制，然后描述了三种常见的工作负荷/系统模式，这些模式对Linux内置解决方案而言是挑战。然后，作者开发了一种基于强化学习的技术，使用时间编码，从而 derivate一个高效的DVFSGOVERNOR。这个GOVERNOR只需一个性能计数器，与Linux内置机制一样，并不需要显式任务模型。
results: 作者实现了一个基于Nvidia Jetson Nano板的原型系统，并对六个应用程序进行了实验，包括两个自定义应用程序和四个参考应用程序。在不同的截止时间限制下，我们的方法可以快速 derivate一个适应性能要求的DVFSGOVERNOR，并在能源储存方面高效于Linux内置机制。在Mibench工作负荷上，在性能潜伏范围为0.04s至0.4s时，提posed方法可以保存3%-11%的能源。AudioReg和FaceReg应用程序的能源储存改进率为5%-14%。作者已经开源了内核量化神经网络引擎的实现代码，代码库可以在以下链接中找到：https://github.com/coladog/tinyagent。

Abstract
Small devices are frequently used in IoT and smart-city applications to perform periodic dedicated tasks with soft deadlines. This work focuses on developing methods to derive efficient power-management methods for periodic tasks on small devices. We first study the limitations of the existing Linux built-in methods used in small devices. We illustrate three typical workload/system patterns that are challenging to manage with Linux's built-in solutions. We develop a reinforcement-learning-based technique with temporal encoding to derive an effective DVFS governor even with the presence of the three system patterns. The derived governor uses only one performance counter, the same as the built-in Linux mechanism, and does not require an explicit task model for the workload. We implemented a prototype system on the Nvidia Jetson Nano Board and experimented with it with six applications, including two self-designed and four benchmark applications. Under different deadline constraints, our approach can quickly derive a DVFS governor that can adapt to performance requirements and outperform the built-in Linux approach in energy saving. On Mibench workloads, with performance slack ranging from 0.04 s to 0.4 s, the proposed method can save 3% - 11% more energy compared to Ondemand. AudioReg and FaceReg applications tested have 5%- 14% energy-saving improvement. We have open-sourced the implementation of our in-kernel quantized neural network engine. The codebase can be found at: https://github.com/coladog/tinyagent.

摘要

Extending Transductive Knowledge Graph Embedding Models for Inductive Logical Relational Inference

paper_url: http://arxiv.org/abs/2309.03773
repo_url: https://github.com/tgebhart/sheaf_kg_transind
paper_authors: Thomas Gebhart, John Cobb
for: bridging the gap between transductive and inductive knowledge graph embedding methods
methods: leveraging representations learned through transductive embedding methods to infer representations of new entities in the inductive setting
results: competitive with or outperforming state-of-the-art models derived explicitly for inductive tasks in experiments on large-scale knowledge graph embedding benchmarks

Abstract
Many downstream inference tasks for knowledge graphs, such as relation prediction, have been handled successfully by knowledge graph embedding techniques in the transductive setting. To address the inductive setting wherein new entities are introduced into the knowledge graph at inference time, more recent work opts for models which learn implicit representations of the knowledge graph through a complex function of a network's subgraph structure, often parametrized by graph neural network architectures. These come at the cost of increased parametrization, reduced interpretability and limited generalization to other downstream inference tasks. In this work, we bridge the gap between traditional transductive knowledge graph embedding approaches and more recent inductive relation prediction models by introducing a generalized form of harmonic extension which leverages representations learned through transductive embedding methods to infer representations of new entities introduced at inference time as in the inductive setting. This harmonic extension technique provides the best such approximation, can be implemented via an efficient iterative scheme, and can be employed to answer a family of conjunctive logical queries over the knowledge graph, further expanding the capabilities of transductive embedding methods. In experiments on a number of large-scale knowledge graph embedding benchmarks, we find that this approach for extending the functionality of transductive knowledge graph embedding models to perform knowledge graph completion and answer logical queries in the inductive setting is competitive with--and in some scenarios outperforms--several state-of-the-art models derived explicitly for such inductive tasks.

摘要
许多知识图embedding任务，如关系预测，已经由知识图嵌入技术在推uctive setting中成功处理。在 inductive setting中，新的实体被引入到知识图时，更新工作选择了模型学习知识图的隐式表示，通常通过图 neural network架构进行 parametrization。这些模型增加参数化，降低可解释性，并有限制其他下游任务的通用性。在这项工作中，我们将传统的推uctive知识图嵌入方法与更新的 inductive relation prediction模型相连接，通过一种通用的harmonic extension技术来推算新引入的实体的表示，这种技术可以实现高效的迭代方案，并可以用来回答 conjunctive logical queries sobre知识图，从而扩展传统的推uctive嵌入方法的能力。在一些大规模知识图嵌入benchmark上进行实验，我们发现这种方法可以在 inductive setting中完成知识图完成和回答逻辑查询任务，与一些状态对应的模型相比，在一些场景下even outperform。

Hybrid of representation learning and reinforcement learning for dynamic and complex robotic motion planning

paper_url: http://arxiv.org/abs/2309.03758
repo_url: None
paper_authors: Chengmin Zhou, Xin Lu, Jiapeng Dai, Bingding Huang, Xiaoxu Liu, Pasi Fränti
for: This paper proposes a hybrid algorithm for robotic motion planning that combines long short-term memory (LSTM) pooling and skip connection for attention-based discrete soft actor critic (LSA-DSAC).
methods: The proposed algorithm uses a graph network and attention network to interpret the environmental state, and integrates skip connection to mitigate overfitting and improve convergence speed.
results: The proposed LSA-DSAC algorithm outperforms the state-of-the-art in training and most evaluations, and is successfully implemented and tested on a physical robot in the real world.

Abstract
Motion planning is the soul of robot decision making. Classical planning algorithms like graph search and reaction-based algorithms face challenges in cases of dense and dynamic obstacles. Deep learning algorithms generate suboptimal one-step predictions that cause many collisions. Reinforcement learning algorithms generate optimal or near-optimal time-sequential predictions. However, they suffer from slow convergence, suboptimal converged results, and overfittings. This paper introduces a hybrid algorithm for robotic motion planning: long short-term memory (LSTM) pooling and skip connection for attention-based discrete soft actor critic (LSA-DSAC). First, graph network (relational graph) and attention network (attention weight) interpret the environmental state for the learning of the discrete soft actor critic algorithm. The expressive power of attention network outperforms that of graph in our task by difference analysis of these two representation methods. However, attention based DSAC faces the overfitting problem in training. Second, the skip connection method is integrated to attention based DSAC to mitigate overfitting and improve convergence speed. Third, LSTM pooling is taken to replace the sum operator of attention weigh and eliminate overfitting by slightly sacrificing convergence speed at early-stage training. Experiments show that LSA-DSAC outperforms the state-of-the-art in training and most evaluations. The physical robot is also implemented and tested in the real world.

摘要
<>TRANSLATE_TEXT运动规划是机器人决策的核心。经典的规划算法如搜索graph和反应型算法在受到紧密和动态障碍物时遇到问题。深度学习算法生成的一步预测通常会导致多次相撞。再强化学习算法则可以生成优化或近似优化的时间序列预测，但它们受到慢速度的转化和不佳的转化结果的影响。本文提出了机器人运动规划的гибри达算法：长Short-Term Memory（LSTM）混合和跳过连接 для注意力基于Discrete Soft Actor Critic（LSA-DSAC）。首先，图网络（关系图）和注意力网络（注意力权重）解释环境状态，以便学习Discrete Soft Actor Critic算法。对比这两种表示方法，注意力网络在我们的任务中表现出了更高的表达力。然而，注意力基于DSAC still faces the overfitting problem in training。其次，跳过连接方法被集成到注意力基于DSAC中，以mitigate overfitting和提高转化速度。最后，LSTM混合被用来取代注意力权重的 SUM 操作，以消除过拟合的问题，但是略微牺牲早期训练的速度。实验表明，LSA-DSAC在训练和评估中都能够超越状态 искус границы。physical robot也在实际世界中进行了测试。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The translation is based on the given text and may not reflect the exact nuances of the original text.

TSGBench: Time Series Generation Benchmark

paper_url: http://arxiv.org/abs/2309.03755
repo_url: None
paper_authors: Yihao Ang, Qiang Huang, Yifan Bao, Anthony K. H. Tung, Zhiyong Huang
for: 本研究的目的是提供一个普遍和完整的TSG方法评估 benchmark，以扩展和改善现有的TSG方法。
methods: 本研究使用了10种先进的TSG方法，并使用了12个评估指标，包括标准的评估指标和新的距离基准。
results: 研究发现，\textsf{TSGBench} 能够提供一个统一和完整的TSG方法评估，并且能够给出不同测试集和评估指标下方法的性能差异，提供了更加精确的方法评估。

Abstract
Synthetic Time Series Generation (TSG) is crucial in a range of applications, including data augmentation, anomaly detection, and privacy preservation. Although significant strides have been made in this field, existing methods exhibit three key limitations: (1) They often benchmark against similar model types, constraining a holistic view of performance capabilities. (2) The use of specialized synthetic and private datasets introduces biases and hampers generalizability. (3) Ambiguous evaluation measures, often tied to custom networks or downstream tasks, hinder consistent and fair comparison. To overcome these limitations, we introduce \textsf{TSGBench}, the inaugural TSG Benchmark, designed for a unified and comprehensive assessment of TSG methods. It comprises three modules: (1) a curated collection of publicly available, real-world datasets tailored for TSG, together with a standardized preprocessing pipeline; (2) a comprehensive evaluation measures suite including vanilla measures, new distance-based assessments, and visualization tools; (3) a pioneering generalization test rooted in Domain Adaptation (DA), compatible with all methods. We have conducted extensive experiments across ten real-world datasets from diverse domains, utilizing ten advanced TSG methods and twelve evaluation measures, all gauged through \textsf{TSGBench}. The results highlight its remarkable efficacy and consistency. More importantly, \textsf{TSGBench} delivers a statistical breakdown of method rankings, illuminating performance variations across different datasets and measures, and offering nuanced insights into the effectiveness of each method.

摘要
“人工时间序列生成（TSG）在许多应用中扮演重要角色，包括数据增强、异常检测和隐私保护。 Although significant strides have been made in this field, existing methods have three key limitations: (1) They often benchmark against similar model types, limiting a holistic view of performance capabilities. (2) The use of specialized synthetic and private datasets introduces biases and hinders generalizability. (3) Ambiguous evaluation measures, often tied to custom networks or downstream tasks, hinder consistent and fair comparison. To overcome these limitations, we introduce \textsf{TSGBench}, the inaugural TSG Benchmark, designed for a unified and comprehensive assessment of TSG methods. It comprises three modules: (1) a curated collection of publicly available, real-world datasets tailored for TSG, together with a standardized preprocessing pipeline; (2) a comprehensive evaluation measures suite including vanilla measures, new distance-based assessments, and visualization tools; (3) a pioneering generalization test rooted in Domain Adaptation (DA), compatible with all methods. We have conducted extensive experiments across ten real-world datasets from diverse domains, utilizing ten advanced TSG methods and twelve evaluation measures, all gauged through \textsf{TSGBench}. The results highlight its remarkable efficacy and consistency. More importantly, \textsf{TSGBench} delivers a statistical breakdown of method rankings, illuminating performance variations across different datasets and measures, and offering nuanced insights into the effectiveness of each method.”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

Enhancing Pipeline-Based Conversational Agents with Large Language Models

paper_url: http://arxiv.org/abs/2309.03748
repo_url: None
paper_authors: Mina Foosherian, Hendrik Purwins, Purna Rathnayake, Touhidul Alam, Rui Teimao, Klaus-Dieter Thoben
for: 这篇论文旨在探讨如何使用大语言模型（LLM）来增强基于管道的对话代理人。
methods: 这篇论文在两个阶段 investigates LLMs’ capabilities: 在设计和开发阶段，LLMs 可以帮助生成训练数据、提取实体和同义词、本地化和人物设计；在运行阶段，LLMs 可以帮助 Contextualization、意图分类、避免对话堵塞和处理非法问题、自动修正语句、重塑回答、生成缓解 вопросы、总结和启用关闭问答能力。
results: 作者通过使用 GPT-4 在私人银行领域进行了实际实验，以示出上述场景。由于隐私问题和替换既有的生态系统需要深度 интегра，因此公司可能会尽量保留其基于管道的代理人。一种混合方法，在基于管道的代理人中 интегра LLMs，可以让公司节省建立和运行代理人的时间和成本，同时保留现有系统的隐私和安全保障。

Abstract
The latest advancements in AI and deep learning have led to a breakthrough in large language model (LLM)-based agents such as GPT-4. However, many commercial conversational agent development tools are pipeline-based and have limitations in holding a human-like conversation. This paper investigates the capabilities of LLMs to enhance pipeline-based conversational agents during two phases: 1) in the design and development phase and 2) during operations. In 1) LLMs can aid in generating training data, extracting entities and synonyms, localization, and persona design. In 2) LLMs can assist in contextualization, intent classification to prevent conversational breakdown and handle out-of-scope questions, auto-correcting utterances, rephrasing responses, formulating disambiguation questions, summarization, and enabling closed question-answering capabilities. We conducted informal experiments with GPT-4 in the private banking domain to demonstrate the scenarios above with a practical example. Companies may be hesitant to replace their pipeline-based agents with LLMs entirely due to privacy concerns and the need for deep integration within their existing ecosystems. A hybrid approach in which LLMs' are integrated into the pipeline-based agents allows them to save time and costs of building and running agents by capitalizing on the capabilities of LLMs while retaining the integration and privacy safeguards of their existing systems.

摘要
最新的人工智能和深度学习技术突破已经导致大型语言模型（LLM）基于代理人的突破，如GPT-4。然而，许多商业对话代理人开发工具是管道式的，它们在保持人类化对话方面有限制。这篇论文研究了LLM在两个阶段中对管道式对话代理人的增强：1）在设计和开发阶段，LLM可以帮助生成训练数据，提取实体和同义词，本地化，并设计人物。2）在运行阶段，LLM可以帮助Contextualization，意图类型分类，避免对话堵塞和处理外部问题，自动更正词语，重新推敲答案，形成杠词问题，摘要和启用关闭问题-答案功能。我们在private banking领域使用GPT-4进行了非正式的实验，以示上述场景。由于隐私问题和现有系统集成的需求，公司可能会尽量保留管道式代理人，而不是完全取代它们。一种混合方法，在管道式代理人中 integrate LLM，允许它们在利用LLM的能力的同时，保留现有系统的隐私和安全保障。

A Natural Gas Consumption Forecasting System for Continual Learning Scenarios based on Hoeffding Trees with Change Point Detection Mechanism

paper_url: http://arxiv.org/abs/2309.03720
repo_url: https://github.com/rasvob/hoeffding-trees-with-cpd-multistep-forecasing
paper_authors: Radek Svoboda, Sebastian Basterrech, Jędrzej Kozal, Jan Platoš, Michał Woźniak
for: 预测天然气消耗，考虑季节性和趋势，对于工业实体来说是非常重要的，以便规划生产和消耗天然气，最大化生产成本。同时，在供应威胁的情况下，也是社会能源安全的关键因素。
methods: 本文提出了一种新的多步前预测天然气消耗方法，integrating change point detection，用于数据流处理。使用Hoeffding树预测模型和Prune Exact Linear Time（PELT）算法进行变点检测。在实际应用中，使用了不同的变点检测方法来选择不同的模型集。
results: 我们的实验表明，具有变点检测功能的预测模型比无变点检测的基准方法更具有优势，尤其是在检测到更多变点的情况下。此外，使用简单的变点检测方法可以获得更加稳定和适合持续学习任务的预测模型。

Abstract
Forecasting natural gas consumption, considering seasonality and trends, is crucial in planning its supply and consumption and optimizing the cost of obtaining it, mainly by industrial entities. However, in times of threats to its supply, it is also a critical element that guarantees the supply of this raw material to meet individual consumers' needs, ensuring society's energy security. This article introduces a novel multistep ahead forecasting of natural gas consumption with change point detection integration for model collection selection with continual learning capabilities using data stream processing. The performance of the forecasting models based on the proposed approach is evaluated in a complex real-world use case of natural gas consumption forecasting. We employed Hoeffding tree predictors as forecasting models and the Pruned Exact Linear Time (PELT) algorithm for the change point detection procedure. The change point detection integration enables selecting a different model collection for successive time frames. Thus, three model collection selection procedures (with and without an error feedback loop) are defined and evaluated for forecasting scenarios with various densities of detected change points. These models were compared with change point agnostic baseline approaches. Our experiments show that fewer change points result in a lower forecasting error regardless of the model collection selection procedure employed. Also, simpler model collection selection procedures omitting forecasting error feedback leads to more robust forecasting models suitable for continual learning tasks.

摘要
预测天然气消耗，考虑季节性和趋势，是重要的在规划生产和消耗的天然气supply和cost optimization中。然而，在供应威胁时，也是一个关键的元素，确保这种原料的供应，以满足个人消耗者的需求，保障社会能源安全。本文介绍了一种新的多步 ahead forecasting天然气消耗方法， integrate change point detection for model collection selection with continual learning capabilities using data stream processing。 Forecasting models based on the proposed approach were evaluated in a complex real-world use case of natural gas consumption forecasting. We employed Hoeffding tree predictors as forecasting models and the Pruned Exact Linear Time (PELT) algorithm for the change point detection procedure. The change point detection integration enables selecting a different model collection for successive time frames. Therefore, three model collection selection procedures (with and without an error feedback loop) were defined and evaluated for forecasting scenarios with various densities of detected change points. These models were compared with change point agnostic baseline approaches. Our experiments show that fewer change points result in a lower forecasting error regardless of the model collection selection procedure employed. Additionally, simpler model collection selection procedures omitting forecasting error feedback leads to more robust forecasting models suitable for continual learning tasks.

PyGraft: Configurable Generation of Schemas and Knowledge Graphs at Your Fingertips

paper_url: http://arxiv.org/abs/2309.03685
repo_url: https://github.com/nicolas-hbt/pygraft
paper_authors: Nicolas Hubert, Pierre Monnin, Mathieu d’Aquin, Armelle Brun, Davy Monticolo
for: 本研究旨在提供一个可以生成对象 Oriented 的 Knowledge Graph (KG) 的工具，以便为 Graph-based Machine Learning (ML) 的模型进行更多的评估和测试。
methods: 本研究使用 Python 语言开发了一个名为 PyGraft 的工具，可以生成具有不同特性和规模的 Knowledge Graphs (KGs)，并且可以保证这些生成的资源的逻辑一致性。
results: 本研究透过 PyGraft 生成的 KGs，实现了对 Graph-based ML 模型的更多和更具体的评估和测试，并且获得了更好的结果。

Abstract
Knowledge graphs (KGs) have emerged as a prominent data representation and management paradigm. Being usually underpinned by a schema (e.g. an ontology), KGs capture not only factual information but also contextual knowledge. In some tasks, a few KGs established themselves as standard benchmarks. However, recent works outline that relying on a limited collection of datasets is not sufficient to assess the generalization capability of an approach. In some data-sensitive fields such as education or medicine, access to public datasets is even more limited. To remedy the aforementioned issues, we release PyGraft, a Python-based tool that generates highly customized, domain-agnostic schemas and knowledge graphs. The synthesized schemas encompass various RDFS and OWL constructs, while the synthesized KGs emulate the characteristics and scale of real-world KGs. Logical consistency of the generated resources is ultimately ensured by running a description logic (DL) reasoner. By providing a way of generating both a schema and KG in a single pipeline, PyGraft's aim is to empower the generation of a more diverse array of KGs for benchmarking novel approaches in areas such as graph-based machine learning (ML), or more generally KG processing. In graph-based ML in particular, this should foster a more holistic evaluation of model performance and generalization capability, thereby going beyond the limited collection of available benchmarks. PyGraft is available at: https://github.com/nicolas-hbt/pygraft.

摘要
知识图（KG）已经成为数据表示和管理方法的一种显著的特点。通常受到 schema（例如ontology）的支持，KG 不仅记录了事实信息，还捕捉了 contextual knowledge。在某些任务中，一些 KG 已经成为了标准的参考基eline。然而， latest works 表明，仅仅靠用有限的数据集来评估一个方法的通用能力并不够。在一些数据敏感的领域，如教育或医学，对公共数据的访问也是有限的。为了解决以上问题，我们释放了 PyGraft，一个基于 Python 的工具，可以生成高度自定义、领域不依赖的 schema 和知识图。生成的 schema 包括 RDFS 和 OWL 结构体系，而生成的知识图模拟了实际知识图的特点和规模。在整个过程中，我们使用描述逻辑（DL）理解器来保证生成的资源的逻辑一致性。通过在单个管道中生成 schema 和知识图，PyGraft 的目标是激励一种更多样的知识图的生成，以便对图基于机器学习（ML）或更一般的知识图处理方法进行更加全面的评估和性能评估。在图基于 ML 中，这应该激励模型的性能和通用能力的全面评估，从而超越现有的限定的 benchmark。PyGraft 可以在以下地址获取：https://github.com/nicolas-hbt/pygraft。

Dataset Generation and Bonobo Classification from Weakly Labelled Videos

paper_url: http://arxiv.org/abs/2309.03671
repo_url: None
paper_authors: Pierre-Etienne Martin
for: 本研究旨在开发一个基于常用机器学习方法的 bonobo 检测和分类管线，以便在无人协助的情况下，使用触摸屏设备测试 bonobo 在围栏中的行为。
methods: 本研究使用了一个新收集的 bonobo 录制数据集，并使用了手工特征和不同的分类算法以及深度学习方法，包括 ResNet 架构，进行 bonobo 识别。
results: 本研究的结果表明，通过meaningful数据分割和 fine-tuning ResNet 模型，可以达到75%的准确率。同时，研究还证明了数据预处理的重要性，并示出了 incorrect 数据分割可能导致 false 的好结果。

Abstract
This paper presents a bonobo detection and classification pipeline built from the commonly used machine learning methods. Such application is motivated by the need to test bonobos in their enclosure using touch screen devices without human assistance. This work introduces a newly acquired dataset based on bonobo recordings generated semi-automatically. The recordings are weakly labelled and fed to a macaque detector in order to spatially detect the individual present in the video. Handcrafted features coupled with different classification algorithms and deep-learning methods using a ResNet architecture are investigated for bonobo identification. Performance is compared in terms of classification accuracy on the splits of the database using different data separation methods. We demonstrate the importance of data preparation and how a wrong data separation can lead to false good results. Finally, after a meaningful separation of the data, the best classification performance is obtained using a fine-tuned ResNet model and reaches 75% of accuracy.

摘要

How adversarial attacks can disrupt seemingly stable accurate classifiers

paper_url: http://arxiv.org/abs/2309.03665
repo_url: None
paper_authors: Oliver J. Sutton, Qinghua Zhou, Ivan Y. Tyukin, Alexander N. Gorban, Alexander Bastounis, Desmond J. Higham
for: 这个论文主要针对的是防御性攻击，即使用一些微小的修改来让模型输出错误的输入数据。
methods: 这篇论文使用了一种简单的普适的框架，来解释实际系统中观察到的一些特性，如模型对小范围的攻击敏感，而对大范围的随机干扰免疫。
results: 这篇论文的结果表明，即使使用大量的随机干扰，模型仍然可能受到小范围的攻击，而且这种攻击可以轻松地构造。此外，研究还发现，使用随机干扰进行训练或测试可能不能探测出这种攻击，需要更加严格的对抗训练来解决这个问题。

Abstract
Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data. Paradoxically, empirical evidence indicates that even systems which are robust to large random perturbations of the input data remain susceptible to small, easily constructed, adversarial perturbations of their inputs. Here, we show that this may be seen as a fundamental feature of classifiers working with high dimensional input data. We introduce a simple generic and generalisable framework for which key behaviours observed in practical systems arise with high probability -- notably the simultaneous susceptibility of the (otherwise accurate) model to easily constructed adversarial attacks, and robustness to random perturbations of the input data. We confirm that the same phenomena are directly observed in practical neural networks trained on standard image classification problems, where even large additive random noise fails to trigger the adversarial instability of the network. A surprising takeaway is that even small margins separating a classifier's decision surface from training and testing data can hide adversarial susceptibility from being detected using randomly sampled perturbations. Counterintuitively, using additive noise during training or testing is therefore inefficient for eradicating or detecting adversarial examples, and more demanding adversarial training is required.

摘要
敌对攻击可能导致学习系统的输出发生显著变化，只需要对输入数据进行微小的修改。尽管这些系统可以抵抗大量随机干扰输入数据，但它们却容易受到小型针对性的攻击。在这篇文章中，我们表明这可能是高维输入数据上的分类器工作的基本特点。我们提出了一个简单的通用的框架，可以解释在实际系统中观察到的一些重要行为，例如抗 Random Perturbations 的模型同时受到攻击和稳定性问题。我们验证了这些现象在实际的神经网络中也出现，即使添加大量随机干扰也无法触发神经网络的攻击不稳定性。一个意外的发现是， même 小的准确率差可以隐藏攻击的敏感性，因此使用随机干扰来检测或消除攻击是不够有效的。相反，需要更加严格的敌对训练来检测和消除攻击。

Towards Comparable Knowledge Distillation in Semantic Image Segmentation

paper_url: http://arxiv.org/abs/2309.03659
repo_url: None
paper_authors: Onno Niemann, Christopher Vox, Thorben Werner
for: 本研究目的是提出一种解决大型模型和慢速识别问题的方法，即知识塑化（KD）。
methods: 本研究使用了25种提出的塑化损失项，从14篇最近4年的论文中提取。
results: 研究发现，使用同样的模型和数据集时，SSTKD方法可以提高学生mIoU值4.54个百分点和最终性能29.19个百分点，而APD方法只提高学生性能2.06个百分点，但实现了39.25个百分点的最终性能。这种极大差异的原因通常是使用不优化的超参数，导致参照模型的性能下降。研究还发现，使用SKD和IFVD框架的塑化改进可以在超参数优化得到更好的性能。为了改善未来在这个领域的研究比较可读性，本研究提供了三个数据集和两个学生模型的固定基线，并提供了广泛的超参数优化信息。研究发现，只有两种技术可以与我们的简单基线相比肩，并且只有在ADE20K数据集上。

Abstract
Knowledge Distillation (KD) is one proposed solution to large model sizes and slow inference speed in semantic segmentation. In our research we identify 25 proposed distillation loss terms from 14 publications in the last 4 years. Unfortunately, a comparison of terms based on published results is often impossible, because of differences in training configurations. A good illustration of this problem is the comparison of two publications from 2022. Using the same models and dataset, Structural and Statistical Texture Distillation (SSTKD) reports an increase of student mIoU of 4.54 and a final performance of 29.19, while Adaptive Perspective Distillation (APD) only improves student performance by 2.06 percentage points, but achieves a final performance of 39.25. The reason for such extreme differences is often a suboptimal choice of hyperparameters and a resulting underperformance of the student model used as reference point. In our work, we reveal problems of insufficient hyperparameter tuning by showing that distillation improvements of two widely accepted frameworks, SKD and IFVD, vanish when hyperparameters are optimized sufficiently. To improve comparability of future research in the field, we establish a solid baseline for three datasets and two student models and provide extensive information on hyperparameter tuning. We find that only two out of eight techniques can compete with our simple baseline on the ADE20K dataset.

摘要
知识塑化（KD）是一种提出来解决大型模型和慢速推理问题的方案，在我们的研究中，我们发现了25种提议的塑化损失项。 unfortunately，由于各个论文的训练配置不同，对于已发表结果进行比较是很困难的。以2022年的两篇论文为例，使用同样的模型和数据集，Structural and Statistical Texture Distillation（SSTKD） Reported an increase of student mIoU of 4.54 and a final performance of 29.19，而Adaptive Perspective Distillation（APD）只有2.06个百分点的提高，但是实现了39.25的最终性能。这种极大的差异的原因通常是模型参数的不佳选择和引用模型的下表性。在我们的工作中，我们发现了塑化过程中的不足hyperparameter tuning，并通过显示SKD和IFVD两种广泛accepted框架的塑化改进 vanish when hyperparameters are optimized sufficiently。为了提高未来的研究领域的比较可读性，我们建立了三个数据集和两个学生模型的固定基线，并提供了广泛的hyperparameter tuning信息。我们发现只有ADE20K数据集上的两个技术能够与我们的简单基线相比肩。

Anatomy-informed Data Augmentation for Enhanced Prostate Cancer Detection

paper_url: http://arxiv.org/abs/2309.03652
repo_url: https://github.com/mic-dkfz/anatomy_informed_da
paper_authors: Balint Kovacs, Nils Netzer, Michael Baumgartner, Carolin Eith, Dimitrios Bounias, Clara Meinzer, Paul F. Jaeger, Kevin S. Zhang, Ralf Floca, Adrian Schrader, Fabian Isensee, Regula Gnirs, Magdalena Goertz, Viktoria Schuetz, Albrecht Stenzinger, Markus Hohenfellner, Heinz-Peter Schlemmer, Ivo Wolf, David Bonekamp, Klaus H. Maier-Hein
for: 这篇研究旨在提高医疗影像分析中的资料增强（DA）方法，以增强遗传统数据中的肿瘤标示精度。
methods: 本研究提出了一种新的生物学信息驱动的增强方法，利用邻近器官信息来模拟Typical的生物LOGICAL deformations of the prostate，实现不同的肿瘤形状和组织弹性。这个增强方法的计算成本轻量级，可以与常见的DA框架集成。
results: 本研究在774篇确诊检查中评估了一种常见的PCa检测方法，包括不同的增强设定。结果显示，这个新的增强方法可以增强PCa检测的精度和一致性。

Abstract
Data augmentation (DA) is a key factor in medical image analysis, such as in prostate cancer (PCa) detection on magnetic resonance images. State-of-the-art computer-aided diagnosis systems still rely on simplistic spatial transformations to preserve the pathological label post transformation. However, such augmentations do not substantially increase the organ as well as tumor shape variability in the training set, limiting the model's ability to generalize to unseen cases with more diverse localized soft-tissue deformations. We propose a new anatomy-informed transformation that leverages information from adjacent organs to simulate typical physiological deformations of the prostate and generates unique lesion shapes without altering their label. Due to its lightweight computational requirements, it can be easily integrated into common DA frameworks. We demonstrate the effectiveness of our augmentation on a dataset of 774 biopsy-confirmed examinations, by evaluating a state-of-the-art method for PCa detection with different augmentation settings.

摘要
增强数据（DA）是医疗图像分析中关键因素，如抑阻肾癌（PCa）检测在磁共振图像中。现有的计算机支持诊断系统仍然依赖于简单的空间变换来保持疾病标签后转换。然而，这些扩展不会显著增加器官以及肿瘤形态变化的多样性在训练集中，限制模型的泛化能力。我们提议一种新的解剖学知识支持的变换，利用邻近器官信息来模拟Typical的生理性肿瘤变化，生成Unique的疾病形态而无需改变其标签。由于其轻量级的计算需求，它可以轻松地与常见的DA框架集成。我们在774例采取确认检查中证明了我们的扩展的有效性，通过评估一种state-of-the-art方法在不同的扩展设置下进行PCa检测。

Learning of Generalizable and Interpretable Knowledge in Grid-Based Reinforcement Learning Environments

paper_url: http://arxiv.org/abs/2309.03651
repo_url: https://github.com/manueleberhardinger/ec-rl
paper_authors: Manuel Eberhardinger, Johannes Maucher, Setareh Maghsudi
for: 本文旨在理解深度强化学习训练的 Agent 之间的交互，以便在游戏或真实世界中部署 Agent。在游戏中，不合理的行为会让玩家感到困惑。在真实世界中，这种效果更加严重，因为不期望的行为可能会导致严重和长期的后果。
methods: 本文使用程序生成来模拟强化学习策略，以便更好地理解 Agent 的行为。程序具有可读性和可验证性，可以帮助我们更好地理解 Agent 学习的概念。我们使用 DreamCoder 系统，这是目前最佳的程序生成系统，在网格环境中进行学习概念，包括导航任务和两个小型的 Atari 游戏 Space Invaders 和 Asterix。
results: 我们通过观察生成的库来理解 Agent 学习的概念，并通过视觉化 Agent 决策过程来更好地理解 Agent 的行为。我们使用不同类型的程序生成器，包括搜索方法、神经网络引导搜索和语言模型精心调整代码，来评估我们的方法。

Abstract
Understanding the interactions of agents trained with deep reinforcement learning is crucial for deploying agents in games or the real world. In the former, unreasonable actions confuse players. In the latter, that effect is even more significant, as unexpected behavior cause accidents with potentially grave and long-lasting consequences for the involved individuals. In this work, we propose using program synthesis to imitate reinforcement learning policies after seeing a trajectory of the action sequence. Programs have the advantage that they are inherently interpretable and verifiable for correctness. We adapt the state-of-the-art program synthesis system DreamCoder for learning concepts in grid-based environments, specifically, a navigation task and two miniature versions of Atari games, Space Invaders and Asterix. By inspecting the generated libraries, we can make inferences about the concepts the black-box agent has learned and better understand the agent's behavior. We achieve the same by visualizing the agent's decision-making process for the imitated sequences. We evaluate our approach with different types of program synthesizers based on a search-only method, a neural-guided search, and a language model fine-tuned on code.

摘要
理解深度强化学习模型训练后的交互是部署在游戏或实际世界中的关键。在前一种情况下，不合理的行为会让玩家感到困惑。在后一种情况下，这种效果更加严重，因为不期望的行为可能会导致严重和长期的后果，对参与者来说。在这项工作中，我们提议使用程序生成来模拟强化学习策略，以观察行为序列的轨迹。程序有利于因为它们是可解释的和可验证的。我们修改了当前的程序生成系统DreamCoder，以学习在网格环境中的概念，具体来说是一个导航任务和两个小型的Atari游戏Space Invaders和Asterix。通过检查生成的库，我们可以从拟合的序列中提取出黑obox模型学习的概念，并更好地理解模型的行为。我们还可以通过可视化模型决策过程来描述拟合序列。我们使用不同类型的程序生成器，包括搜索方法、神经网络引导搜索和语言模型在代码上进行微调，来评估我们的方法。

Large-Scale Automatic Audiobook Creation

paper_url: http://arxiv.org/abs/2309.03926
repo_url: None
paper_authors: Brendan Walsh, Mark Hamilton, Greg Newby, Xi Wang, Serena Ruan, Sheng Zhao, Lei He, Shaofei Zhang, Eric Dettinger, William T. Freeman, Markus Weimer
for: This paper aims to improve the accessibility and engagement of literature by automatically generating high-quality audiobooks from online e-books.
methods: The authors use recent advances in neural text-to-speech to create and release thousands of human-quality, open-license audiobooks from the Project Gutenberg e-book collection. They identify the proper subset of e-book content to read and can operate on hundreds of books in parallel, allowing users to customize the speaking speed, style, and emotional intonation of the audiobooks.
results: The authors contributed over five thousand open-license audiobooks and an interactive demo that allows users to quickly create their own customized audiobooks. To listen to the audiobook collection, visit \url{https://aka.ms/audiobook}.

Abstract
An audiobook can dramatically improve a work of literature's accessibility and improve reader engagement. However, audiobooks can take hundreds of hours of human effort to create, edit, and publish. In this work, we present a system that can automatically generate high-quality audiobooks from online e-books. In particular, we leverage recent advances in neural text-to-speech to create and release thousands of human-quality, open-license audiobooks from the Project Gutenberg e-book collection. Our method can identify the proper subset of e-book content to read for a wide collection of diversely structured books and can operate on hundreds of books in parallel. Our system allows users to customize an audiobook's speaking speed and style, emotional intonation, and can even match a desired voice using a small amount of sample audio. This work contributed over five thousand open-license audiobooks and an interactive demo that allows users to quickly create their own customized audiobooks. To listen to the audiobook collection visit \url{https://aka.ms/audiobook}.

摘要
audiobook可以大幅提高文学作品的可访问性和读者参与度。然而，制作audiobook需要数百个工作人员的努力来创建、编辑和发布。在这项工作中，我们介绍了一种系统，可以自动生成高质量的audiobook从在线电子书。特别是，我们利用了近期的神经网络文本读取技术来创建和发布数千个人质量高的开源audiobook从Project Gutenberg电子书集。我们的方法可以确定电子书内容的合适子集，并可以同时处理数百本书。我们的系统允许用户自定义audiobook的读音速度和风格，以及使用小量的示例音频来匹配所需的声音。这项工作已经提供了五千个开源audiobook以及一个互动 demo，允许用户快速创建自己的个性化audiobook。要听取音书集，请访问 \url{https://aka.ms/audiobook}.

Promoting Fairness in GNNs: A Characterization of Stability

paper_url: http://arxiv.org/abs/2309.03648
repo_url: None
paper_authors: Yaning Jia, Chunhui Zhang
for: 本研究旨在提出一种用于稳定 Graph Neural Networks（GNN）输出的方法，以满足在非欧几何数据上进行公平训练。
methods: 本研究使用了 Lipschitz 约束来限制 GNN 输出变动，并对输出变动进行分析，以确定输出变动的最大值。
results: 研究表明，使用 Lipschitz 约束可以有效地限制 GNN 输出变动，并且可以在训练过程中更好地平衡准确性和公平性。

Abstract
The Lipschitz bound, a technique from robust statistics, can limit the maximum changes in the output concerning the input, taking into account associated irrelevant biased factors. It is an efficient and provable method for examining the output stability of machine learning models without incurring additional computation costs. Recently, Graph Neural Networks (GNNs), which operate on non-Euclidean data, have gained significant attention. However, no previous research has investigated the GNN Lipschitz bounds to shed light on stabilizing model outputs, especially when working on non-Euclidean data with inherent biases. Given the inherent biases in common graph data used for GNN training, it poses a serious challenge to constraining the GNN output perturbations induced by input biases, thereby safeguarding fairness during training. Recently, despite the Lipschitz constant's use in controlling the stability of Euclideanneural networks, the calculation of the precise Lipschitz constant remains elusive for non-Euclidean neural networks like GNNs, especially within fairness contexts. To narrow this gap, we begin with the general GNNs operating on an attributed graph, and formulate a Lipschitz bound to limit the changes in the output regarding biases associated with the input. Additionally, we theoretically analyze how the Lipschitz constant of a GNN model could constrain the output perturbations induced by biases learned from data for fairness training. We experimentally validate the Lipschitz bound's effectiveness in limiting biases of the model output. Finally, from a training dynamics perspective, we demonstrate why the theoretical Lipschitz bound can effectively guide the GNN training to better trade-off between accuracy and fairness.

摘要
“利普希茨范围”，一种从稳定统计学中的技术，可以限制输入变化所导致的输出变化，考虑到相关的无关偏调因素。这是一种有效和可证明的方法，可以无额外计算成本，检查机器学习模型的稳定性。最近，图 neural network（GNN），它们在非欧几何数据上运作，获得了很大的关注。然而，前一次的研究没有探讨GNN的利普希茨范围，对于稳定模型输出，尤其是在非欧几何数据上具有自然偏调的情况下。由于实际 graph 数据中的偏调，对于 GNN 的训练induced的输出干扰带来了严重的挑战，以保持公平性。虽然利普希茨常量在控制欧几何神经网络的稳定性中使用，但是非欧几何神经网络Like GNNs 的利普希茨常量的计算仍然是一个未解之处。为了填补这个 gap，我们从一般的 GNN 开始，定义一个利普希茨范围，以限制对于输入偏调的变化。此外，我们也进行了理论分析，该利普希茨常量如何对模型输出偏调带来的影响。我们还进行了实验 validate 利普希茨范围的有效性，限制模型输出偏调。最后，从训练动态的角度来看，我们显示了理论上的利普希茨范围可以有效地导引 GNN 训练，以更好地平衡精度和公平性。

VideolandGPT: A User Study on a Conversational Recommender System

paper_url: http://arxiv.org/abs/2309.03645
repo_url: None
paper_authors: Mateo Gutierrez Granada, Dina Zilbershtein, Daan Odijk, Francesco Barile
for: 这个论文探讨了如何使用大语言模型（LLMs）提高推荐系统，特别是基于对话的推荐系统，该系统利用用户偏好和个性化候选选择来优化推荐结果。
methods: 该论文提出了一种基于ChatGPT的视频在线推荐系统，称为VideolandGPT，该系统使用ChatGPT选择 predetermined 集合中的内容，考虑用户与对话界面的互动提供的额外上下文。
results: 我们在用户研究中对两个版本的系统进行了比较，一个是个性化版本，另一个是非个性化版本。结果显示个性化版本在准确性和总体用户满意度方面表现出色，而两个版本都提高了不在推荐列表的ITEMS的可见性。然而，两个版本在公平性方面存在不一致的行为，系统可能生成不在Videoland上的推荐。

Abstract
This paper investigates how large language models (LLMs) can enhance recommender systems, with a specific focus on Conversational Recommender Systems that leverage user preferences and personalised candidate selections from existing ranking models. We introduce VideolandGPT, a recommender system for a Video-on-Demand (VOD) platform, Videoland, which uses ChatGPT to select from a predetermined set of contents, considering the additional context indicated by users' interactions with a chat interface. We evaluate ranking metrics, user experience, and fairness of recommendations, comparing a personalised and a non-personalised version of the system, in a between-subject user study. Our results indicate that the personalised version outperforms the non-personalised in terms of accuracy and general user satisfaction, while both versions increase the visibility of items which are not in the top of the recommendation lists. However, both versions present inconsistent behavior in terms of fairness, as the system may generate recommendations which are not available on Videoland.

摘要

Beyond XAI:Obstacles Towards Responsible AI

paper_url: http://arxiv.org/abs/2309.03638
repo_url: None
paper_authors: Yulu Pi
for: 这篇论文主要是为了探讨Explainable Artificial Intelligence（XAI）领域的发展，并提出了一些用于使AI系统更加透明和理解的技术。
methods: 本论文使用了一些现有的解释性技术，并评估了这些技术在实际应用中的局限性。
results: 本论文发现了许多解释性技术和评估策略在实际应用中存在一些限制，并讨论了这些限制对负责任AI的扩展发展的影响。

Abstract
The rapidly advancing domain of Explainable Artificial Intelligence (XAI) has sparked significant interests in developing techniques to make AI systems more transparent and understandable. Nevertheless, in real-world contexts, the methods of explainability and their evaluation strategies present numerous limitations.Moreover, the scope of responsible AI extends beyond just explainability. In this paper, we explore these limitations and discuss their implications in a boarder context of responsible AI when considering other important aspects, including privacy, fairness and contestability.

摘要
rapidly advancing domain of Explainable Artificial Intelligence (XAI) has sparked significant interests in developing techniques to make AI systems more transparent and understandable. Nevertheless, in real-world contexts, the methods of explainability and their evaluation strategies present numerous limitations.Moreover, the scope of responsible AI extends beyond just explainability. In this paper, we explore these limitations and discuss their implications in a boarder context of responsible AI when considering other important aspects, including privacy, fairness and contestability.Here's the word-for-word translation:快速发展的解释人工智能（XAI）领域引起了广泛的关注，旨在开发更加透明和理解的人工智能系统。然而，在实际应用场景中，解释方法和评估策略具有多种限制。此外，负责任人工智能的范围不仅包括解释性，还包括隐私、公平和竞争等重要方面。在本文中，我们探讨这些限制的影响，并在更广泛的负责任人工智能框架下讨论它们的意义。

NeuroCodeBench: a plain C neural network benchmark for software verification

paper_url: http://arxiv.org/abs/2309.03617
repo_url: None
paper_authors: Edoardo Manino, Rafael Sá Menezes, Fedor Shmarov, Lucas C. Cordeiro
for: 这篇论文是为了证明神经网络组件中的强制保证。
methods: 论文使用了平台C编程的神经网络代码进行验证。
results: 验证结果表明，现有的软件验证工具无法证明神经网络实现中的软件问题。

Abstract
Safety-critical systems with neural network components require strong guarantees. While existing neural network verification techniques have shown great progress towards this goal, they cannot prove the absence of software faults in the network implementation. This paper presents NeuroCodeBench - a verification benchmark for neural network code written in plain C. It contains 32 neural networks with 607 safety properties divided into 6 categories: maths library, activation functions, error-correcting networks, transfer function approximation, probability density estimation and reinforcement learning. Our preliminary evaluation shows that state-of-the-art software verifiers struggle to provide correct verdicts, due to their incomplete support of the standard C mathematical library and the complexity of larger neural networks.

摘要
安全关键系统中的神经网络组件需要强大的保证。现有的神经网络验证技术已经取得了很大的进步，但它们无法证明神经网络实现中的软件问题。这篇文章介绍了NeuroCodeBench，一个用于神经网络代码中的权威验证标准。它包含32个神经网络，607个安全性特性，分为6个类别：数学库、激活函数、错误修复网络、传输函数近似、概率密度估计和奖励学习。我们的初步评估表明，现有的软件验证工具很难提供正确的判决，因为它们对标准C语言数学库的支持不够完善，大神经网络的复杂性也很高。

Evaluating ChatGPT as a Recommender System: A Rigorous Approach

paper_url: http://arxiv.org/abs/2309.03613
repo_url: https://github.com/sisinflab/Recommender-ChatGPT
paper_authors: Dario Di Palma, Giovanni Maria Biancofiore, Vito Walter Anelli, Fedelucio Narducci, Tommaso Di Noia, Eugenio Di Sciascio
for: 这种研究旨在探索ChatGPT作为零次推荐系统的可能性，以评估其根据用户喜好进行推荐、重新排序现有推荐列表、利用类似用户的信息和冷启动情况下的表现。
methods: 该研究使用了MovieLens Small、Last.FM和Facebook Book三个数据集，对ChatGPT的表现进行了广泛的实验，并与标准推荐算法和其他大语言模型进行比较，如GPT-3.5和PaLM-2。用于评估推荐效果的评价指标包括MAP、Recall、Precision、F1、nDCG、Item Coverage、EPC、ACLT和ARP等。
results: 研究发现ChatGPT在推荐领域的表现很出色，具有较高的MAP、Recall和Precision值，同时也具有较好的 Item Coverage、EPC、ACLT和ARP值。与标准推荐算法和其他大语言模型进行比较，ChatGPT的表现也很出色。

Abstract
Recent popularity surrounds large AI language models due to their impressive natural language capabilities. They contribute significantly to language-related tasks, including prompt-based learning, making them valuable for various specific tasks. This approach unlocks their full potential, enhancing precision and generalization. Research communities are actively exploring their applications, with ChatGPT receiving recognition. Despite extensive research on large language models, their potential in recommendation scenarios still needs to be explored. This study aims to fill this gap by investigating ChatGPT's capabilities as a zero-shot recommender system. Our goals include evaluating its ability to use user preferences for recommendations, reordering existing recommendation lists, leveraging information from similar users, and handling cold-start situations. We assess ChatGPT's performance through comprehensive experiments using three datasets (MovieLens Small, Last.FM, and Facebook Book). We compare ChatGPT's performance against standard recommendation algorithms and other large language models, such as GPT-3.5 and PaLM-2. To measure recommendation effectiveness, we employ widely-used evaluation metrics like Mean Average Precision (MAP), Recall, Precision, F1, normalized Discounted Cumulative Gain (nDCG), Item Coverage, Expected Popularity Complement (EPC), Average Coverage of Long Tail (ACLT), Average Recommendation Popularity (ARP), and Popularity-based Ranking-based Equal Opportunity (PopREO). Through thoroughly exploring ChatGPT's abilities in recommender systems, our study aims to contribute to the growing body of research on the versatility and potential applications of large language models. Our experiment code is available on the GitHub repository: https://github.com/sisinflab/Recommender-ChatGPT

摘要
现在，大型人工智能语言模型因其自然语言能力而受到广泛关注。它们在语言相关任务中发挥了重要作用，包括提示学习，使其在各种特定任务中成为了珍贵的资源。这种方法可以激活它们的潜力，提高精度和通用性。研究人员 aktif explore其应用，如ChatGPT receiving recognition。Despite extensive research on large language models, their potential in recommendation scenarios still needs to be explored. This study aims to fill this gap by investigating ChatGPT's capabilities as a zero-shot recommender system. Our goals include evaluating its ability to use user preferences for recommendations, reordering existing recommendation lists, leveraging information from similar users, and handling cold-start situations. We assess ChatGPT's performance through comprehensive experiments using three datasets (MovieLens Small, Last.FM, and Facebook Book). We compare ChatGPT's performance against standard recommendation algorithms and other large language models, such as GPT-3.5 and PaLM-2. To measure recommendation effectiveness, we employ widely-used evaluation metrics like Mean Average Precision (MAP), Recall, Precision, F1, normalized Discounted Cumulative Gain (nDCG), Item Coverage, Expected Popularity Complement (EPC), Average Coverage of Long Tail (ACLT), Average Recommendation Popularity (ARP), and Popularity-based Ranking-based Equal Opportunity (PopREO). Through thoroughly exploring ChatGPT's abilities in recommender systems, our study aims to contribute to the growing body of research on the versatility and potential applications of large language models. Our experiment code is available on the GitHub repository:

Spatial encoding of BOLD fMRI time series for categorizing static images across visual datasets: A pilot study on human vision

paper_url: http://arxiv.org/abs/2309.03590
repo_url: https://github.com/kancharlavamshi/Spatial-encoding-of-BOLD-fmri-time-series-for-categorical-static-images-across-visual-dataset
paper_authors: Vamshi K. Kancharala, Debanjali Bhattacharya, Neelam Sinha
for: 这个研究用于了解人脑如何处理不同复杂度的图像，以便更好地理解视觉功能。
methods: 这个研究使用了功能磁共振成像（fMRI）时间序列（TS），使用类别幂angular field（GAF）和马尔可夫过渡场（MTF）进行空间编码，并使用了多层感知网络（CNN）进行分类。
results: 研究发现，并行的CNN模型在分类图像 across COCO、ImageNet和SUN三个标准计算机视觉数据集时表现出色，与其他网络模型相比，提高了7%的多类分类精度。

Abstract
Functional MRI (fMRI) is widely used to examine brain functionality by detecting alteration in oxygenated blood flow that arises with brain activity. In this study, complexity specific image categorization across different visual datasets is performed using fMRI time series (TS) to understand differences in neuronal activities related to vision. Publicly available BOLD5000 dataset is used for this purpose, containing fMRI scans while viewing 5254 images of diverse categories, drawn from three standard computer vision datasets: COCO, ImageNet and SUN. To understand vision, it is important to study how brain functions while looking at different images. To achieve this, spatial encoding of fMRI BOLD TS has been performed that uses classical Gramian Angular Field (GAF) and Markov Transition Field (MTF) to obtain 2D BOLD TS, representing images of COCO, Imagenet and SUN. For classification, individual GAF and MTF features are fed into regular CNN. Subsequently, parallel CNN model is employed that uses combined 2D features for classifying images across COCO, Imagenet and SUN. The result of 2D CNN models is also compared with 1D LSTM and Bi-LSTM that utilizes raw fMRI BOLD signal for classification. It is seen that parallel CNN model outperforms other network models with an improvement of 7% for multi-class classification. Clinical relevance- The obtained result of this analysis establishes a baseline in studying how differently human brain functions while looking at images of diverse complexities.

摘要
Functional MRI (fMRI) 广泛用于评估大脑功能，通过检测大脑活动引起的氧游泡流变化。在这项研究中，使用 fMRI 时间序列（TS）来分类不同的视觉数据集，以了解与视觉相关的神经活动之间的差异。使用公共可用的 BOLD5000 数据集，包括在视觉 5254 个不同类别的图像上进行 fMRI 扫描，这些图像来自三个标准计算机视觉数据集：COCO、ImageNet 和 SUN。为了理解视觉，需要研究大脑如何在不同的图像上工作。为此，使用类传统的 Gramian Angular Field (GAF) 和 Markov Transition Field (MTF) 来获得 2D BOLD TS，表示 COCO、ImageNet 和 SUN 三个数据集中的图像。然后，使用各自 GAF 和 MTF 特征进行分类，并使用并行的 CNN 模型来结合这些 2D 特征进行图像分类。结果表明，并行 CNN 模型在多类分类中比其他网络模型提高了7%。临床相关性：这项分析的结果建立了对于研究人类大脑在不同复杂度的图像上如何工作的基线。

Interactive Hyperparameter Optimization in Multi-Objective Problems via Preference Learning

paper_url: http://arxiv.org/abs/2309.03581
repo_url: https://github.com/automl/interactive-mo-ml
paper_authors: Joseph Giovanelli, Alexander Tornede, Tanja Tornede, Marius Lindauer
for: 本文主要用于解决多目标机器学习（MO-ML）中的超参数优化问题，即在多个目标之间找到最佳的超参数配置。
methods: 本文提出了一种人类中心的交互式超参数优化方法，利用喜好学习提取用户需求，而不是让用户手动选择合适的指标。
results: 实验研究表明，该方法可以比用户手动选择的指标优化超参数，并且在高级用户知道选择哪个指标时表现相当。I hope that helps! Let me know if you have any further questions.

Abstract
Hyperparameter optimization (HPO) is important to leverage the full potential of machine learning (ML). In practice, users are often interested in multi-objective (MO) problems, i.e., optimizing potentially conflicting objectives, like accuracy and energy consumption. To tackle this, the vast majority of MO-ML algorithms return a Pareto front of non-dominated machine learning models to the user. Optimizing the hyperparameters of such algorithms is non-trivial as evaluating a hyperparameter configuration entails evaluating the quality of the resulting Pareto front. In literature, there are known indicators that assess the quality of a Pareto front (e.g., hypervolume, R2) by quantifying different properties (e.g., volume, proximity to a reference point). However, choosing the indicator that leads to the desired Pareto front might be a hard task for a user. In this paper, we propose a human-centered interactive HPO approach tailored towards multi-objective ML leveraging preference learning to extract desiderata from users that guide the optimization. Instead of relying on the user guessing the most suitable indicator for their needs, our approach automatically learns an appropriate indicator. Concretely, we leverage pairwise comparisons of distinct Pareto fronts to learn such an appropriate quality indicator. Then, we optimize the hyperparameters of the underlying MO-ML algorithm towards this learned indicator using a state-of-the-art HPO approach. In an experimental study targeting the environmental impact of ML, we demonstrate that our approach leads to substantially better Pareto fronts compared to optimizing based on a wrong indicator pre-selected by the user, and performs comparable in the case of an advanced user knowing which indicator to pick.

摘要
在这篇文章中，我们提出了一种人类中心的交互式 HPO 方法，适应多目标 ML 的需求。不同于基于用户的猜测来选择最适合的指标，我们的方法会自动学习一个适当的指标。具体来说，我们利用 Pareto 前纵之间的对比来学习这个适当的质量指标。然后，我们使用现有的 HPO 方法来优化超参数，以便以该学习的指标来评估 Pareto 前纵的质量。在针对机器学习的环境影响的实验研究中，我们示出了我们的方法可以相比于用户预先选择的指标来优化 Pareto 前纵，并且在用户了解哪个指标最适合的情况下，我们的方法可以与其相比。

DTW+S: Shape-based Comparison of Time-series with Ordered Local Trend

paper_url: http://arxiv.org/abs/2309.03579
repo_url: https://github.com/scc-usc/DTW_S_apps
paper_authors: Ajitesh Srivastava
for: 本研究旨在开发一种可以识别时间序列数据中相似的趋势的度量方法，用于应用领域中的分类和归类。
methods: 本研究使用DTW+S方法，该方法首先将时间序列数据转换为可读性好的“相似性保持”矩阵表示，其中每列表示当地趋势，然后应用动态时间戳匹配来计算这些矩阵之间的距离。
results: 研究表明，DTW+S方法可以更好地识别时间序列数据中的相似趋势，特别是当本地趋势比矩阵规模具有更大的重要性时。此外，DTW+S方法也可以在 ensemble 建立和时间序列数据的归类中得到更好的结果。

Abstract
Measuring distance or similarity between time-series data is a fundamental aspect of many applications including classification and clustering. Existing measures may fail to capture similarities due to local trends (shapes) and may even produce misleading results. Our goal is to develop a measure that looks for similar trends occurring around similar times and is easily interpretable for researchers in applied domains. This is particularly useful for applications where time-series have a sequence of meaningful local trends that are ordered, such as in epidemics (a surge to an increase to a peak to a decrease). We propose a novel measure, DTW+S, which creates an interpretable "closeness-preserving" matrix representation of the time-series, where each column represents local trends, and then it applies Dynamic Time Warping to compute distances between these matrices. We present a theoretical analysis that supports the choice of this representation. We demonstrate the utility of DTW+S in ensemble building and clustering of epidemic curves. We also demonstrate that our approach results in better classification compared to Dynamic Time Warping for a class of datasets, particularly when local trends rather than scale play a decisive role.

摘要
We propose a novel measure, DTW+S, which creates an interpretable "closeness-preserving" matrix representation of the time-series, where each column represents local trends, and then applies Dynamic Time Warping to compute distances between these matrices. We present a theoretical analysis that supports the choice of this representation.We demonstrate the utility of DTW+S in ensemble building and clustering of epidemic curves. We also show that our approach results in better classification compared to Dynamic Time Warping for a class of datasets, particularly when local trends rather than scale play a decisive role.Translated into Simplified Chinese:时间序列数据的距离或相似性的评估是许多应用程序中的基本要求，包括分类和归类。现有的度量可能不能捕捉相似性，因为它们可能忽略地方趋势（形状），甚至生成错误的结果。我们的目标是开发一种度量，它搜寻在相似时间点上的相似趋势，并且能够让应用领域研究者更好地理解。这特别有用于具有意义的地方趋势的时间序列，例如疫病肆虐（增长到峰值到减少）。我们提出了一种新的度量方法，DTW+S，它创建了可解释的“亲缘性保持”矩阵表示时间序列，每列表示地方趋势，然后应用动态时间戳对这些矩阵进行计算距离。我们提供了理论分析，支持我们的选择。我们示出DTW+S在套件建立和时间序列归类中的实用性。我们还表明，当地方趋势而不是比例决定时，我们的方法比动态时间戳更好地分类。

Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation

paper_url: http://arxiv.org/abs/2309.03549
repo_url: None
paper_authors: Jiaxi Gu, Shicong Wang, Haoyu Zhao, Tianyi Lu, Xing Zhang, Zuxuan Wu, Songcen Xu, Wei Zhang, Yu-Gang Jiang, Hang Xu
for: 本研究旨在应用Latent Diffusion Models（LDM）于文本到视频生成，这是一项复杂的挑战，因为模型训练和推断过程中的计算和存储限制。
methods: 我们提出了一个名为“Reuse and Diffuse”的框架，称为$\textit{VidRD}$，用于生成更多的视频帧。我们conditioned on an initial video clip with a small number of frames，iteratively generate additional frames by reusing the original latent features and following the previous diffusion process。此外，我们还在权重网中添加了时间层，并对这些层进行了微调以提高时间一致性。
results: 我们的方法在量化和质量评估中都达到了良好的结果。我们的项目页面可以在 $\href{https://anonymous0x233.github.io/ReuseAndDiffuse/}{here}$ 上找到。

Abstract
Inspired by the remarkable success of Latent Diffusion Models (LDMs) for image synthesis, we study LDM for text-to-video generation, which is a formidable challenge due to the computational and memory constraints during both model training and inference. A single LDM is usually only capable of generating a very limited number of video frames. Some existing works focus on separate prediction models for generating more video frames, which suffer from additional training cost and frame-level jittering, however. In this paper, we propose a framework called "Reuse and Diffuse" dubbed $\textit{VidRD}$ to produce more frames following the frames already generated by an LDM. Conditioned on an initial video clip with a small number of frames, additional frames are iteratively generated by reusing the original latent features and following the previous diffusion process. Besides, for the autoencoder used for translation between pixel space and latent space, we inject temporal layers into its decoder and fine-tune these layers for higher temporal consistency. We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets including video datasets for action recognition and image-text datasets. Extensive experiments show that our method achieves good results in both quantitative and qualitative evaluations. Our project page is available $\href{https://anonymous0x233.github.io/ReuseAndDiffuse/}{here}$.

摘要
受 latent diffusion models (LDMs) 的成功启发，我们研究了 LDM 的文本到视频生成，这是一项具有计算和内存约束的挑战。通常情况下，一个 LDM 只能生成一个非常有限的数量的视频帧。一些现有的方法是通过分立预测模型来生成更多的视频帧，但这会增加训练成本和帧级抖动。在这篇论文中，我们提出了一个名为“Reuse and Diffuse”的框架，称为 $\textit{VidRD}$，用于生成更多的视频帧。基于一个初始的视频片段，我们通过重用原始的秘密特征和前一个扩散过程来生成更多的帧。此外，我们在干扰层中插入了时间层，并对这些层进行了精细调整，以提高时间一致性。我们还提出了一些组合视频-文本数据的策略，包括多个现有数据集的视频数据和图像-文本数据。我们的实验表明，我们的方法在量和质量两个方面都取得了良好的结果。我们的项目页面可以在 $\href{https://anonymous0x233.github.io/ReuseAndDiffuse/}{这里}$ 找到。

DGC: Training Dynamic Graphs with Spatio-Temporal Non-Uniformity using Graph Partitioning by Chunks

paper_url: http://arxiv.org/abs/2309.03523
repo_url: None
paper_authors: Fahao Chen, Peng Li, Celimuge Wu
for: 这个研究旨在提高动态图神经网络（DGNN）的训练效率，建立一个分布式系统来加速DGNN训练。
methods: 本研究提出了一种基于图缩放的分割策略，将动态图分割成更小的块，以便更好地分配工作负荷到多个GPU上。此外，本研究还提出了一种粗略汇集和自适应停止汇集技术来提高训练效率。
results: experiments 表明，与现有的状态OF-THE-ART系统相比，DGC可以在测试环境中 achieve 1.25x - 7.52x的速度提升。此外，DGC还具有高效的运行时，可以快速地处理大型图。

Abstract
Dynamic Graph Neural Network (DGNN) has shown a strong capability of learning dynamic graphs by exploiting both spatial and temporal features. Although DGNN has recently received considerable attention by AI community and various DGNN models have been proposed, building a distributed system for efficient DGNN training is still challenging. It has been well recognized that how to partition the dynamic graph and assign workloads to multiple GPUs plays a critical role in training acceleration. Existing works partition a dynamic graph into snapshots or temporal sequences, which only work well when the graph has uniform spatio-temporal structures. However, dynamic graphs in practice are not uniformly structured, with some snapshots being very dense while others are sparse. To address this issue, we propose DGC, a distributed DGNN training system that achieves a 1.25x - 7.52x speedup over the state-of-the-art in our testbed. DGC's success stems from a new graph partitioning method that partitions dynamic graphs into chunks, which are essentially subgraphs with modest training workloads and few inter connections. This partitioning algorithm is based on graph coarsening, which can run very fast on large graphs. In addition, DGC has a highly efficient run-time, powered by the proposed chunk fusion and adaptive stale aggregation techniques. Extensive experimental results on 3 typical DGNN models and 4 popular dynamic graph datasets are presented to show the effectiveness of DGC.

摘要
“几何对应神经网络”（DGNN）有强大的能力学习动态图，利用图形空间和时间特征。 although DGNN 在艺术社群中获得了很大的关注，并提出了许多DGNN 模型，但是建立高效的DGNN 训练分布式系统仍然是挑战。 existing works 将动态图 partitioned into snapshots or temporal sequences，这些方法只有在图形中具有均匀的空间-时间结构下可以实现高效。然而，实际上的动态图不具有均匀的结构，有些快照是非常紧密的，而其他快照则是疏松的。为了解决这个问题，我们提出了DGC，一个高效的分布式DGNN 训练系统，在我们的测试环境中实现了1.25x至7.52x的速度提升。 DGC 的成功从一种新的图形分割方法中获得，这种分割方法将动态图分成块，这些块是具有轻量级训练工作和少量的相互连接的子图。这个分割算法基于图形缩小，可以在大型图上执行非常快速。此外，DGC 还具有非常高效的执行时间，推动了我们提出的块融合和自适应统计聚合技术。实际实验结果显示，DGC 在3种常见的 DGNN 模型和4种受欢迎的动态图dataset上具有很高的效果。”

Parameterized Aspects of Distinct Kemeny Rank Aggregation

paper_url: http://arxiv.org/abs/2309.03517
repo_url: None
paper_authors: Koustav De, Harshil Mittal, Palash Dey, Neeldhara Misra
for: 本文研究了使用基美方法进行排名聚合的计算问题，特别是在不同参数下的计算复杂性。
methods: 本文使用了参数化复杂性的概念，研究了不同参数下的计算复杂性，并提供了一系列的FPTP算法来解决这些问题。
results: 本文发现了在不同参数下，可以使用FPTP算法来计算基美排名，并且可以在Running time中得到满意的结果。此外，本文还提供了FPTPapproximation算法来解决基美排名聚合问题。

Abstract
The Kemeny method is one of the popular tools for rank aggregation. However, computing an optimal Kemeny ranking is NP-hard. Consequently, the computational task of finding a Kemeny ranking has been studied under the lens of parameterized complexity with respect to many parameters. We first present a comprehensive relationship, both theoretical and empirical, among these parameters. Further, we study the problem of computing all distinct Kemeny rankings under the lens of parameterized complexity. We consider the target Kemeny score, number of candidates, average distance of input rankings, maximum range of any candidate, and unanimity width as our parameters. For all these parameters, we already have FPT algorithms. We find that any desirable number of Kemeny rankings can also be found without substantial increase in running time. We also present FPT approximation algorithms for Kemeny rank aggregation with respect to these parameters.

摘要
“凯曼尼方法是一种受欢迎的选举排名协调工具。然而，计算优化的凯曼尼排名是NP困难的。因此，计算找到凯曼尼排名的计算任务在参数化复杂性下被研究。我们首先提供了完整的关系，both theoretically和empirically， Among these parameters. Furthermore, we study the problem of computing all distinct Kemeny rankings under the lens of parameterized complexity. We consider the target Kemeny score, number of candidates, average distance of input rankings, maximum range of any candidate, and unanimity width as our parameters. For all these parameters, we already have FPT algorithms. We find that any desirable number of Kemeny rankings can also be found without substantial increase in running time. We also present FPT approximation algorithms for Kemeny rank aggregation with respect to these parameters.”Note: FPT stands for "parameterized tractable" and refers to the fact that the algorithm's running time is bounded by a function of the input size and the parameter, rather than the input size alone.

Towards Robust Natural-Looking Mammography Lesion Synthesis on Ipsilateral Dual-Views Breast Cancer Analysis

paper_url: http://arxiv.org/abs/2309.03506
repo_url: None
paper_authors: Thanh-Huy Nguyen, Quang Hien Kha, Thai Ngoc Toan Truong, Ba Thinh Lam, Ba Hung Ngo, Quang Vinh Dinh, Nguyen Quoc Khanh Le
for: 提高癌症分类任务的精度和效率
methods: 利用多视图环境和简单且可靠的SynthMix框架，杜绝训练和测试阶段的分类器
results: 在VinDr-Mammo和CMMD数据集上实现了新方法的效果，比较前一代方法在实验设置中的表现

Abstract
In recent years, many mammographic image analysis methods have been introduced for improving cancer classification tasks. Two major issues of mammogram classification tasks are leveraging multi-view mammographic information and class-imbalance handling. In the first problem, many multi-view methods have been released for concatenating features of two or more views for the training and inference stage. Having said that, most multi-view existing methods are not explainable in the meaning of feature fusion, and treat many views equally for diagnosing. Our work aims to propose a simple but novel method for enhancing examined view (main view) by leveraging low-level feature information from the auxiliary view (ipsilateral view) before learning the high-level feature that contains the cancerous features. For the second issue, we also propose a simple but novel malignant mammogram synthesis framework for upsampling minor class samples. Our easy-to-implement and no-training framework has eliminated the current limitation of the CutMix algorithm which is unreliable synthesized images with random pasted patches, hard-contour problems, and domain shift problems. Our results on VinDr-Mammo and CMMD datasets show the effectiveness of our two new frameworks for both multi-view training and synthesizing mammographic images, outperforming the previous conventional methods in our experimental settings.

摘要

InteractionNet: Joint Planning and Prediction for Autonomous Driving with Transformers

paper_url: http://arxiv.org/abs/2309.03475
repo_url: None
paper_authors: Jiawei Fu, Yanqing Shen, Zhiqiang Jian, Shitao Chen, Jingmin Xin, Nanning Zheng
for: 本研究旨在提高自动驾驶车辆的规划和预测模块，以便更好地处理交通场景中的互动和动态变化。
methods: 本研究使用 transformer 来共享全局上下文推理，并将规划和预测融合在一起，以实现联合推理。此外，模型还使用另一个 transformer 来增强对感知区域中的车辆的注意力。
results: 相比其他基线模型，InteractionNet 在多个测试 benchmark 中表现出色，特别是在安全性方面，这主要归功于规划和预测的联合考虑。模型的代码将于 GitHub 上公开。

Abstract
Planning and prediction are two important modules of autonomous driving and have experienced tremendous advancement recently. Nevertheless, most existing methods regard planning and prediction as independent and ignore the correlation between them, leading to the lack of consideration for interaction and dynamic changes of traffic scenarios. To address this challenge, we propose InteractionNet, which leverages transformer to share global contextual reasoning among all traffic participants to capture interaction and interconnect planning and prediction to achieve joint. Besides, InteractionNet deploys another transformer to help the model pay extra attention to the perceived region containing critical or unseen vehicles. InteractionNet outperforms other baselines in several benchmarks, especially in terms of safety, which benefits from the joint consideration of planning and forecasting. The code will be available at https://github.com/fujiawei0724/InteractionNet.

摘要
《计划和预测两个重要模块在自动驾驶中受到了极大的提高。然而，大多数现有方法假设计划和预测是独立的，忽略了交通enario中参与者之间的交互关系和动态变化，导致缺乏考虑安全性。为解决这个挑战，我们提出了InteractionNet，它利用转换器来共享全局上下文推理，以捕捉交互和连接计划和预测，实现共同。此外，InteractionNet还部署了另一个转换器，使模型更注重感知区域中的重要或未经见过的车辆。InteractionNet在多个标准测试中表现出色，特别是在安全性方面，受到了计划和预测的共同考虑的启示。代码将在https://github.com/fujiawei0724/InteractionNet上公开。》Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

paper_url: http://arxiv.org/abs/2309.06578
repo_url: None
paper_authors: Sai Koneru, Jian Wu, Sarah Rajtmajer
for: 本研究的目的是使用大型自然语言模型（LLM）来探索科学Abstract中支持或驳斥特定假设的证据。
methods: 本研究使用了社会科学领域的社区驱动标注来创建了一个新的数据集，并对多种现有的状况标准比较LLM的表现。
results: 研究发现LLM可以准确地检测出科学Abstract中支持或驳斥特定假设的证据，并且可以与多种现有的状况标准进行比较。同时，研究还提出了未来研究的可能性，例如针对不同领域的研究和更多的数据集创建等。

Abstract
Hypothesis formulation and testing are central to empirical research. A strong hypothesis is a best guess based on existing evidence and informed by a comprehensive view of relevant literature. However, with exponential increase in the number of scientific articles published annually, manual aggregation and synthesis of evidence related to a given hypothesis is a challenge. Our work explores the ability of current large language models (LLMs) to discern evidence in support or refute of specific hypotheses based on the text of scientific abstracts. We share a novel dataset for the task of scientific hypothesis evidencing using community-driven annotations of studies in the social sciences. We compare the performance of LLMs to several state-of-the-art benchmarks and highlight opportunities for future research in this area. The dataset is available at https://github.com/Sai90000/ScientificHypothesisEvidencing.git

摘要

Fast FixMatch: Faster Semi-Supervised Learning with Curriculum Batch Size

paper_url: http://arxiv.org/abs/2309.03469
repo_url: None
paper_authors: John Chen, Chen Dun, Anastasios Kyrillidis
for: 本研究的目的是提出一种名为快速匹配（Fast FixMatch）的新ssl算法，以提高 semi-supervised learning（ssl）的效率和性能。
methods: 本研究使用了一种名为batch size curriculum（CBS）的方法，即在训练过程中逐渐增加无标签批处理的大小，以便降低训练计算量。此外，本研究还使用了强制标签扩展（strong labeled augmentation）和pseudo标签生成（CPL）等技术。
results: 本研究的结果表明，使用CBS和强制标签扩展/CPL可以 synergistically提高ssl的性能，同时降低训练计算量。具体来说，在CIFAR-10、CIFAR-100、SVHN和STL-10等 datasets上，快速匹配可以在所有 except 40、250和4000个标签 removed情况下实现2.1-3.4倍的训练计算量减少，而且与相同的参考状态得到同等的错误率。此外，快速匹配还可以在联合学习ssl任务和在线/流式学习ssl任务中实现2.6-3.3倍的训练计算量减少。

Abstract
Advances in Semi-Supervised Learning (SSL) have almost entirely closed the gap between SSL and Supervised Learning at a fraction of the number of labels. However, recent performance improvements have often come \textit{at the cost of significantly increased training computation}. To address this, we propose Curriculum Batch Size (CBS), \textit{an unlabeled batch size curriculum which exploits the natural training dynamics of deep neural networks.} A small unlabeled batch size is used in the beginning of training and is gradually increased to the end of training. A fixed curriculum is used regardless of dataset, model or number of epochs, and reduced training computations is demonstrated on all settings. We apply CBS, strong labeled augmentation, Curriculum Pseudo Labeling (CPL) \citep{FlexMatch} to FixMatch \citep{FixMatch} and term the new SSL algorithm Fast FixMatch. We perform an ablation study to show that strong labeled augmentation and/or CPL do not significantly reduce training computations, but, in synergy with CBS, they achieve optimal performance. Fast FixMatch also achieves substantially higher data utilization compared to previous state-of-the-art. Fast FixMatch achieves between $2.1\times$ - $3.4\times$ reduced training computations on CIFAR-10 with all but 40, 250 and 4000 labels removed, compared to vanilla FixMatch, while attaining the same cited state-of-the-art error rate \citep{FixMatch}. Similar results are achieved for CIFAR-100, SVHN and STL-10. Finally, Fast MixMatch achieves between $2.6\times$ - $3.3\times$ reduced training computations in federated SSL tasks and online/streaming learning SSL tasks, which further demonstrate the generializbility of Fast MixMatch to different scenarios and tasks.

摘要
SSL 技术的进步已经几乎完全将 semi-supervised learning (SSL) 和直接学习 (Supervised Learning) 的差距缩小到了一半，但是最近的性能改进通常是在增加训练计算的代价下得来的。为解决这个问题，我们提议了批处理大小学习纲（Curriculum Batch Size，CBS），它利用深度神经网络的自然训练dinamics来逐渐增加无标记批处理大小。我们在所有设置下使用了一个固定的学习纲，并且证明了它可以减少训练计算。我们将CBS、强大的标记增强、CURRICULUM PSEUDO LABELING（CPL）和 FixMatch 结合使用，并将其称为 Fast FixMatch。我们进行了一个ablation study，并证明了强大的标记增强和/或 CPL 不会减少训练计算，但是在协同作用下，它们可以达到最佳性能。 Fast FixMatch 还实现了较高的数据利用率，相比前一代 state-of-the-art。我们在 CIFAR-10、CIFAR-100、SVHN 和 STL-10 上进行了相似的实验，并得到了类似的结果。最后，Fast MixMatch 在 federated SSL 任务和在线/流式学习 SSL 任务中实现了 $2.6\times$ - $3.3\times$ 的减少训练计算，这再次证明了 Fast MixMatch 的通用性和可靠性。

Cross-Image Context Matters for Bongard Problems

paper_url: http://arxiv.org/abs/2309.03468
repo_url: https://github.com/nraghuraman/bongard-context
paper_authors: Nikhil Raghuraman, Adam W. Harley, Leonidas Guibas
for: 本研究旨在解决现代机器学习方法在Bongard问题上的缺陷，Bongard问题是一种类型的智能测试，需要从一组正例和负例图像中抽出抽象的概念，并将新的查询图像分类为是否符合该概念。
methods: 本研究使用了一些简单的方法来考虑跨图像上下文信息，包括使用多个正例和负例图像来分别提取概念的特征，并将这些特征组合在一起以提高分类精度。
results: 本研究实现了substantial的提升，在Bongard-LOGO和Bongard-HOI上达到了新的状态码性能（75.3%和72.45%），并在原始Bongard问题集上实现了strong的性能（60.84%）。

Abstract
Current machine learning methods struggle to solve Bongard problems, which are a type of IQ test that requires deriving an abstract "concept" from a set of positive and negative "support" images, and then classifying whether or not a new query image depicts the key concept. On Bongard-HOI, a benchmark for natural-image Bongard problems, existing methods have only reached 66% accuracy (where chance is 50%). Low accuracy is often attributed to neural nets' lack of ability to find human-like symbolic rules. In this work, we point out that many existing methods are forfeiting accuracy due to a much simpler problem: they do not incorporate information contained in the support set as a whole, and rely instead on information extracted from individual supports. This is a critical issue, because unlike in few-shot learning tasks concerning object classification, the "key concept" in a typical Bongard problem can only be distinguished using multiple positives and multiple negatives. We explore a variety of simple methods to take this cross-image context into account, and demonstrate substantial gains over prior methods, leading to new state-of-the-art performance on Bongard-LOGO (75.3%) and Bongard-HOI (72.45%) and strong performance on the original Bongard problem set (60.84%).

摘要
当前的机器学习方法难以解决博格ар问题，这种问题需要从一组正例和负例图像中推导抽象的概念，然后判断新的查询图像是否表示关键概念。在Bongard-HOIbenchmark上，现有的方法只达到66%的准确率（比例为50%）。低准确率常被归结于神经网络缺乏找到人类类似的 символиRule的能力。在这项工作中，我们指出了许多现有方法宁恶地丢失了准确性，因为它们不会将支持集中的信息全面地利用，而是仅仅从个别支持中提取信息。这是一个重要的问题，因为在典型的博格ар问题中，关键概念只能通过多个正例和多个负例来 отличи出来。我们探索了一些简单的方法来考虑这些跨图像上下文信息，并证明了substantial提高，达到了新的状态态Performance在Bongard-LOGO（75.3%）和Bongard-HOI（72.45%），以及在原始博格ар问题集上强性表现（60.84%）。

Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation

paper_url: http://arxiv.org/abs/2309.03467
repo_url: None
paper_authors: Zhuqiang Lu, Kun Hu, Chaoyue Wang, Lei Bai, Zhiyong Wang
for:* 这篇论文旨在提出一种基于权重学习的方法，用于从窄视场（NFoV）图像中生成全景图像。methods:* 该方法使用了权重学习的 autoregressive omni-aware 生成网络（AOG-Net），通过逐步填充不完整的全景图像，使用 NFoV 图像和文本引用来进行指导。* 该方法还使用了全球-本地conditioning机制，将文本引用、全景视觉指示、NFoV输入和全景几何都编码并转化为一个全球流和一个本地流，并将其 integrate into a conditioned generative backbone model。results:* 对于两个常用的全景图像集，该方法在indoor和outdoor的场景中达到了当今最佳性能。* 该方法可以使用大规模的模型来激活大量的文本引用，从而提高生成的精度和一致性。

Abstract
A 360-degree (omni-directional) image provides an all-encompassing spherical view of a scene. Recently, there has been an increasing interest in synthesising 360-degree images from conventional narrow field of view (NFoV) images captured by digital cameras and smartphones, for providing immersive experiences in various scenarios such as virtual reality. Yet, existing methods typically fall short in synthesizing intricate visual details or ensure the generated images align consistently with user-provided prompts. In this study, autoregressive omni-aware generative network (AOG-Net) is proposed for 360-degree image generation by out-painting an incomplete 360-degree image progressively with NFoV and text guidances joinly or individually. This autoregressive scheme not only allows for deriving finer-grained and text-consistent patterns by dynamically generating and adjusting the process but also offers users greater flexibility to edit their conditions throughout the generation process. A global-local conditioning mechanism is devised to comprehensively formulate the outpainting guidance in each autoregressive step. Text guidances, omni-visual cues, NFoV inputs and omni-geometry are encoded and further formulated with cross-attention based transformers into a global stream and a local stream into a conditioned generative backbone model. As AOG-Net is compatible to leverage large-scale models for the conditional encoder and the generative prior, it enables the generation to use extensive open-vocabulary text guidances. Comprehensive experiments on two commonly used 360-degree image datasets for both indoor and outdoor settings demonstrate the state-of-the-art performance of our proposed method. Our code will be made publicly available.

摘要
《全景图像生成方法 based on Omni-aware Generative Network》Introduction:Recently, there has been an increasing interest in generating 360-degree images from narrow field of view (NFoV) images captured by digital cameras and smartphones, for providing immersive experiences in various scenarios such as virtual reality. However, existing methods often fall short in synthesizing intricate visual details or ensuring the generated images align consistently with user-provided prompts.Methodology:In this study, we propose an autoregressive omni-aware generative network (AOG-Net) for 360-degree image generation. The network uses an incomplete 360-degree image as input and progressively out-paints it with NFoV and text guidance. The autoregressive scheme allows for deriving finer-grained and text-consistent patterns by dynamically generating and adjusting the process, offering users greater flexibility to edit their conditions throughout the generation process.Key Components:1. Global-Local Conditioning Mechanism: We devise a global-local conditioning mechanism to comprehensively formulate the outpainting guidance in each autoregressive step. Text guidances, omni-visual cues, NFoV inputs, and omni-geometry are encoded and further formulated with cross-attention based transformers into a global stream and a local stream into a conditioned generative backbone model.2. Compatibility with Large-Scale Models: Our proposed method is compatible with large-scale models for the conditional encoder and the generative prior, enabling the generation to use extensive open-vocabulary text guidances.Experiments:We conduct comprehensive experiments on two commonly used 360-degree image datasets for both indoor and outdoor settings, demonstrating the state-of-the-art performance of our proposed method.Conclusion:In this study, we proposed an autoregressive omni-aware generative network (AOG-Net) for 360-degree image generation, which offers a more flexible and controllable approach to synthesizing high-quality 360-degree images from NFoV images. Our proposed method has the potential to be applied in various scenarios, such as virtual reality, panoramic imaging, and 3D reconstruction. The code will be made publicly available.

MIRA: Cracking Black-box Watermarking on Deep Neural Networks via Model Inversion-based Removal Attacks

paper_url: http://arxiv.org/abs/2309.03466
repo_url: None
paper_authors: Yifan Lu, Wenxuan Li, Mi Zhang, Xudong Pan, Min Yang
for: 保护深度学习模型的知识产权，黑obox深度学习模型水印（black-box DNN watermarks）在学术和工业领域得到了广泛应用。
methods: 我们提出了一种名为模型反向攻击(\textsc{Mira})的新型攻击方法，可以对大多数主流黑obox深度学习模型水印进行有效的除法。
results: 我们在三个 benchmark 数据集和 DNN 架构上对 \textsc{Mira} 进行了广泛的评估，并证明了它在覆盖的水印上具有强大的除法效果，保留至少 90% 的盗取模型用途，并且不需要dataset的可用性。

Abstract
To protect the intellectual property of well-trained deep neural networks (DNNs), black-box DNN watermarks, which are embedded into the prediction behavior of DNN models on a set of specially-crafted samples, have gained increasing popularity in both academy and industry. Watermark robustness is usually implemented against attackers who steal the protected model and obfuscate its parameters for watermark removal. Recent studies empirically prove the robustness of most black-box watermarking schemes against known removal attempts. In this paper, we propose a novel Model Inversion-based Removal Attack (\textsc{Mira}), which is watermark-agnostic and effective against most of mainstream black-box DNN watermarking schemes. In general, our attack pipeline exploits the internals of the protected model to recover and unlearn the watermark message. We further design target class detection and recovered sample splitting algorithms to reduce the utility loss caused by \textsc{Mira} and achieve data-free watermark removal on half of the watermarking schemes. We conduct comprehensive evaluation of \textsc{Mira} against ten mainstream black-box watermarks on three benchmark datasets and DNN architectures. Compared with six baseline removal attacks, \textsc{Mira} achieves strong watermark removal effects on the covered watermarks, preserving at least $90\%$ of the stolen model utility, under more relaxed or even no assumptions on the dataset availability.

摘要
保护深度神经网络（DNN）的知识产权，黑盒DNN水印（black-box DNN watermarking）已在学术和业界中得到广泛应用。水印Robustness通常是针对偷窃保护模型并尝试从 Parameters 中除掉水印的攻击者。现有研究证明大多数黑盒水印 schemes 的 Robustness 可以抵抗知悉的 removal 试验。在这篇论文中，我们提出了一种新的 Model Inversion-based Removal Attack（\textsc{Mira}，这是针对 most 主流黑盒 DNN water marking schemes 的 watermark-agnostic 和高效的攻击方法。总的来说，我们的攻击管道利用 protected 模型的内部来恢复和忘记水印信息。我们还设计了目标类检测和恢复样本分割算法，以降低由 \textsc{Mira} 引起的实用损失。我们对 ten 主流黑盒水印 schemes 进行了三个 benchmark 数据集和 DNN 架构的全面评估。相比于六个基eline removal 攻击，\textsc{Mira} 在覆盖的水印上实现了强大的水印除法效果，保留至少 90% 的偷窃模型实用性，在更放宽或甚至无 dataset 可用性的情况下。

Automatic Algorithm Selection for Pseudo-Boolean Optimization with Given Computational Time Limits

paper_url: http://arxiv.org/abs/2309.03924
repo_url: None
paper_authors: Catalina Pezo, Dorit Hochbaum, Julio Godoy, Roberto Asin-Acha
for: 本研究旨在设计一个可靠的时间限制选择器，以解决NP困难优化问题中的 Pseudo-Boolean Optimization (PBO) 问题。
methods: 本研究使用了机器学习技术，特别是Anytime选择器，来自动选择最佳的解决方案。Anytime选择器会根据给定的时间限制，预测最佳的解决方案，并在该时间限制内执行该解决方案。
results: 研究表明，使用 Anytime 选择器可以大幅提高解决PBO问题的性能，比如在 Gurobi 优化软件失败时，我们的Anytime meta-solver可以为47%的情况提供可行的解决方案。

Abstract
Machine learning (ML) techniques have been proposed to automatically select the best solver from a portfolio of solvers, based on predicted performance. These techniques have been applied to various problems, such as Boolean Satisfiability, Traveling Salesperson, Graph Coloring, and others. These methods, known as meta-solvers, take an instance of a problem and a portfolio of solvers as input. They then predict the best-performing solver and execute it to deliver a solution. Typically, the quality of the solution improves with a longer computational time. This has led to the development of anytime selectors, which consider both the instance and a user-prescribed computational time limit. Anytime meta-solvers predict the best-performing solver within the specified time limit. Constructing an anytime meta-solver is considerably more challenging than building a meta-solver without the "anytime" feature. In this study, we focus on the task of designing anytime meta-solvers for the NP-hard optimization problem of Pseudo-Boolean Optimization (PBO), which generalizes Satisfiability and Maximum Satisfiability problems. The effectiveness of our approach is demonstrated via extensive empirical study in which our anytime meta-solver improves dramatically on the performance of Mixed Integer Programming solver Gurobi, which is the best-performing single solver in the portfolio. For example, out of all instances and time limits for which Gurobi failed to find feasible solutions, our meta-solver identified feasible solutions for 47% of these.

摘要
机器学习（ML）技术已经提议用于自动选择一个竞争力最高的解决方案从一个 portefolio 中，基于预测性能。这些技术已经应用于各种问题，如布尔满足问题、旅行商问题、图色问题和其他问题。这些方法，称为元解决方案，将一个问题和一个 portefolio 中的解决方案作为输入，然后预测最佳的解决方案并执行它来提供解决方案。通常，解决方案的质量随着计算时间的增加而提高。这导致了“任何时间”选择器的发展，它们考虑了问题和用户指定的计算时间限制。任何时间元解决方案预测在指定时间限制内最佳的解决方案。在本研究中，我们关注了对 Pseudo-Boolean Optimization (PBO) 问题的任何时间元解决方案的设计任务。PBO 问题是推理问题的一种扩展，包括满足问题和最大满足问题。我们的任何时间元解决方案在广泛的实验研究中表现出色，对于 Gurobi 混合整数编程 solver，该 solver 是 portefolio 中最高性能的单个解决方案。例如，对于 Gurobi 无法找到可行解的所有实例和计算时间限制，我们的元解决方案可以提供可行解的 47%。

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

paper_url: http://arxiv.org/abs/2309.03453
repo_url: https://github.com/liuyuan-pal/SyncDreamer
paper_authors: Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, Wenping Wang
for: 生成多视图图像 from 单视图图像
methods: 使用预训练大规模2D扩散模型和3D意识特征关注机制
results: 生成高一致性的多视图图像，适用于多种3D生成任务

Abstract
In this paper, we present a novel diffusion model called that generates multiview-consistent images from a single-view image. Using pretrained large-scale 2D diffusion models, recent work Zero123 demonstrates the ability to generate plausible novel views from a single-view image of an object. However, maintaining consistency in geometry and colors for the generated images remains a challenge. To address this issue, we propose a synchronized multiview diffusion model that models the joint probability distribution of multiview images, enabling the generation of multiview-consistent images in a single reverse process. SyncDreamer synchronizes the intermediate states of all the generated images at every step of the reverse process through a 3D-aware feature attention mechanism that correlates the corresponding features across different views. Experiments show that SyncDreamer generates images with high consistency across different views, thus making it well-suited for various 3D generation tasks such as novel-view-synthesis, text-to-3D, and image-to-3D.

摘要
在这篇论文中，我们提出了一种新的扩散模型，称为同视图一致图像生成模型。使用预训练的大规模2D扩散模型， Zero123 的最新研究表明了从单视图图像中生成可信度高的新视图图像的能力。然而，维护图像的几何学和颜色协调仍然是一大挑战。为解决这个问题，我们提议一种同步多视图扩散模型，该模型对多视图图像的联合概率分布进行了模型化，从而在单向过程中生成了协调的多视图图像。SyncDreamer 在每个反向过程的每个步骤中 synchronizes 所有生成的图像的中间状态通过一种3D-aware feature attention机制，相应地对不同视图的相关特征进行了相互协调。实验显示，SyncDreamer 能够生成具有高度一致性的多视图图像，因此适用于多种3D生成任务，如新视图合成、文本到3D和图像到3D。

XGen-7B Technical Report

paper_url: http://arxiv.org/abs/2309.03450
repo_url: None
paper_authors: Erik Nijkamp, Tian Xie, Hiroaki Hayashi, Bo Pang, Congying Xia, Chen Xing, Jesse Vig, Semih Yavuz, Philippe Laban, Ben Krause, Senthil Purushwalkam, Tong Niu, Wojciech Kryściński, Lidiya Murakhovs’ka, Prafulla Kumar Choubey, Alex Fabbri, Ye Liu, Rui Meng, Lifu Tu, Meghana Bhat, Chien-Sheng Wu, Silvio Savarese, Yingbo Zhou, Shafiq Joty, Caiming Xiong
for: 这个论文的目的是提高大语言模型（LLM）的性能和可用性，使其能够更好地支持各种任务和应用。
methods: 作者使用了一系列的7B参数模型，在8K字串长度和1.5T字符数下进行训练。他们还对这些模型进行了资料适应，创建了专门针对公共领域的指南数据进行训练的XGen-Inst模型。
results: 作者在标准的benchmark测试中发现，XGen模型可以与当前开源LLM相比或者更好地实现相同的结果。在长字串模型任务中，XGen-Inst模型也表现出了优势。

Abstract
Large Language Models (LLMs) have become ubiquitous across various domains, transforming the way we interact with information and conduct research. However, most high-performing LLMs remain confined behind proprietary walls, hindering scientific progress. Most open-source LLMs, on the other hand, are limited in their ability to support longer sequence lengths, which is a key requirement for many tasks that require inference over an input context. To address this, we have trained XGen, a series of 7B parameter models on up to 8K sequence length for up to 1.5T tokens. We have also finetuned the XGen models on public-domain instructional data, creating their instruction-tuned counterparts (XGen-Inst). We open-source our models for both research advancements and commercial applications. Our evaluation on standard benchmarks shows that XGen models achieve comparable or better results when compared with state-of-the-art open-source LLMs. Our targeted evaluation on long sequence modeling tasks shows the benefits of our 8K-sequence models over 2K-sequence open-source LLMs.

摘要

Large Language Models as Optimizers

paper_url: http://arxiv.org/abs/2309.03409
repo_url: https://github.com/chrisneagu/FTC-Skystone-Dark-Angels-Romania-2020
paper_authors: Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, Xinyun Chen
for: 这篇论文的目的是提出一种使用大型自然语言模型（LLM）进行优化的简单和有效方法，以解决许多现实世界中缺乏导数的优化问题。
methods: 这篇论文使用的方法是使用大型自然语言模型（LLM）来生成新的解决方案，并在每一步优化过程中，将生成的解决方案与之前生成的解决方案一起作为描述符，以便在下一步优化过程中生成更优的解决方案。
results: 研究人员通过在线性回归和旅行售商问题上应用OPRO，并使用多种LLM，发现OPTRO可以比人类设计的提示更好地优化提示，在GSM8K和Big-Bench Hard任务上达到8%的提高和50%的提高。

Abstract
Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications. In this work, we propose Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step. We first showcase OPRO on linear regression and traveling salesman problems, then move on to prompt optimization where the goal is to find instructions that maximize the task accuracy. With a variety of LLMs, we demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks.

摘要
优化是 ubique。而基于导数算法的优化方法在实际应用中存在很多挑战，因为导数缺失。在这项工作中，我们提出了优化通过PROmpting（OPRO），一种简单而有效的方法，使用大型自然语言模型（LLM）作为优化器，其中优化任务是通过自然语言描述的。在每次优化步骤中，LLM生成新的解决方案，这些解决方案基于先前生成的解决方案和其值，然后评估这些新的解决方案，并将其添加到下一次优化步骤中的描述中。我们首先应用OPRO在线性回归和旅行商问题上，然后转移到提示优化，其目标是找到可以最大化任务准确率的指令。使用不同的LLM，我们示出了OPRO最佳提示可以比人工设计的提示高达8%的提高在GSM8K上，以及比人工设计的提示高达50%的提高在Big-Bench Hard任务上。

2023-09-08

Asymmetric Clean Segments-Guided Self-Supervised Learning for Robust Speaker Verification

2023-09-08

Open and reusable deep learning for pathology with WSInfer and QuPath

Style Generation: Image Synthesis based on Coarsely Matched Texts

Dynamic Mesh-Aware Radiance Fields

Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation

Poster: Making Edge-assisted LiDAR Perceptions Robust to Lossy Point Cloud Compression

Examining Autoexposure for Challenging Scenes

Generalized Cross-domain Multi-label Few-shot Learning for Chest X-rays

WiSARD: A Labeled Visual and Thermal Image Dataset for Wilderness Search and Rescue

Demographic Disparities in 1-to-Many Facial Identification

Comparative Study of Visual SLAM-Based Mobile Robot Localization Using Fiducial Markers

Single View Refractive Index Tomography with Neural Fields

Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving

DeformToon3D: Deformable 3D Toonification from Neural Radiance Fields

MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask

Language Prompt for Autonomous Driving

CNN Injected Transformer for Image Exposure Correction

SSIG: A Visually-Guided Graph Edit Distance for Floor Plan Similarity

Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts

Revealing the preference for correcting separated aberrations in joint optic-image design

Leveraging Model Fusion for Improved License Plate Recognition

AMLP:Adaptive Masking Lesion Patches for Self-supervised Medical Image Segmentation

Have We Ever Encountered This Before? Retrieving Out-of-Distribution Road Obstacles from Driving Scenes

How Can We Tame the Long-Tail of Chest X-ray Datasets?

The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion

Towards Practical Capture of High-Fidelity Relightable Avatars

Unsupervised Gaze-aware Contrastive Learning with Subject-specific Condition

FIVA: Facial Image and Video Anonymization and Anonymization Defense

Long-Range Correlation Supervision for Land-Cover Classification from Remote Sensing Images

Score-PA: Score-based 3D Part Assembly

SegmentAnything helps microscopy images based automatic and quantitative organoid detection and analysis

Stereo Matching in Time: 100+ FPS Video Stereo Matching for Extended Reality

Unsupervised Object Localization with Representer Point Selection

PRISTA-Net: Deep Iterative Shrinkage Thresholding Network for Coded Diffraction Patterns Phase Retrieval

Grouping Boundary Proposals for Fast Interactive Image Segmentation

Context-Aware Prompt Tuning for Vision-Language Model with Dual-Alignment

Mapping EEG Signals to Visual Stimuli: A Deep Learning Approach to Match vs. Mismatch Classification

Representation Synthesis by Probabilistic Many-Valued Logic Operation in Self-Supervised Learning

Robot Localization and Mapping Final Report – Sequential Adversarial Learning for Self-Supervised Deep Visual Odometry

Depth Completion with Multiple Balanced Bases and Confidence for Dense Monocular SLAM

On the Efficacy of Multi-scale Data Samplers for Vision Applications

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

Toward Sufficient Spatial-Frequency Interaction for Gradient-aware Underwater Image Enhancement

Towards Efficient SDRTV-to-HDRTV by Learning from Image Formation

UER: A Heuristic Bias Addressing Approach for Online Continual Learning

Enhancing Hierarchical Transformers for Whole Brain Segmentation with Intracranial Measurements Integration

INSURE: An Information Theory Inspired Disentanglement and Purification Model for Domain Generalization

2023-09-08

Few-Shot Learning of Force-Based Motions From Demonstration Through Pre-training of Haptic Representation

Perceptual adjustment queries and an inverted measurement paradigm for low-rank metric learning

Leveraging World Model Disentanglement in Value-Based Multi-Agent Reinforcement Learning

Linking Symptom Inventories using Semantic Textual Similarity

EGOFALLS: A visual-audio dataset and benchmark for fall detection using egocentric cameras

Unleashing the Power of Graph Learning through LLM-based Autonomous Agents

Connecting NTK and NNGP: A Unified Theoretical Framework for Neural Network Learning Dynamics in the Kernel Regime

On the Actionability of Outcome Prediction

tSPM+; a high-performance algorithm for mining transitive sequential patterns from clinical data

Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning

Physics-Informed Neural Networks for an optimal counterdiabatic quantum computation

Variations and Relaxations of Normalizing Flows

Create Your World: Lifelong Text-to-Image Diffusion

Advanced Computing and Related Applications Leveraging Brain-inspired Spiking Neural Networks

SynthoGestures: A Novel Framework for Synthetic Dynamic Hand Gesture Generation for Driving Scenarios

Privacy Preserving Federated Learning with Convolutional Variational Bottlenecks

Generalization Bounds: Perspectives from Information Theory and PAC-Bayes

Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation

Active Learning for Classifying 2D Grid-Based Level Completability

Systematic Review of Techniques in Brain Image Synthesis using Deep Learning

Zero-Shot Robustification of Zero-Shot Models With Foundation Models

Online Submodular Maximization via Online Convex Optimization

Graph Neural Networks Use Graphs When They Shouldn’t

Incremental Learning of Humanoid Robot Behavior from Natural Interaction and Large Language Models

Federated Learning for Early Dropout Prediction on Healthy Ageing Applications

Navigating Out-of-Distribution Electricity Load Forecasting during COVID-19: A Continual Learning Approach Leveraging Human Mobility

FIMO: A Challenge Formal Dataset for Automated Theorem Proving

Fuzzy Fingerprinting Transformer Language-Models for Emotion Recognition in Conversations

Sequential Semantic Generative Communication for Progressive Text-to-Image Generation

Spatial-Temporal Graph Attention Fuser for Calibration in IoT Air Pollution Monitoring Systems