2023-08-18

cs.CV

cs.CV - 2023-08-18

Revisiting Skin Tone Fairness in Dermatological Lesion Classification

paper_url: http://arxiv.org/abs/2308.09640
repo_url: https://github.com/tkalbl/revisitingskintonefairness
paper_authors: Thorsten Kalb, Kaisar Kushibar, Celia Cintas, Karim Lekadir, Oliver Diaz, Richard Osuala
for: 评估皮肤疾病分类算法的公平性，因为皮肤疾病的表现可能因皮肤色调而异常。
methods: 使用Individual Typology Angle（ITA）来估计皮肤色调，并对皮肤疾病分类算法进行公平性分析。
results: 对ISIC18 dataset进行了四种ITA-based皮肤色调分类方法的比较，发现这些方法之间存在很大的不一致，表明ITA-based皮肤色调估计方法存在风险。此外，研究发现ISIC18 dataset的不具有多样性，限制了其作为公平性分析的测试平台。

Abstract
Addressing fairness in lesion classification from dermatological images is crucial due to variations in how skin diseases manifest across skin tones. However, the absence of skin tone labels in public datasets hinders building a fair classifier. To date, such skin tone labels have been estimated prior to fairness analysis in independent studies using the Individual Typology Angle (ITA). Briefly, ITA calculates an angle based on pixels extracted from skin images taking into account the lightness and yellow-blue tints. These angles are then categorised into skin tones that are subsequently used to analyse fairness in skin cancer classification. In this work, we review and compare four ITA-based approaches of skin tone classification on the ISIC18 dataset, a common benchmark for assessing skin cancer classification fairness in the literature. Our analyses reveal a high disagreement among previously published studies demonstrating the risks of ITA-based skin tone estimation methods. Moreover, we investigate the causes of such large discrepancy among these approaches and find that the lack of diversity in the ISIC18 dataset limits its use as a testbed for fairness analysis. Finally, we recommend further research on robust ITA estimation and diverse dataset acquisition with skin tone annotation to facilitate conclusive fairness assessments of artificial intelligence tools in dermatology. Our code is available at https://github.com/tkalbl/RevisitingSkinToneFairness.

摘要
（注意：以下是简化中文翻译，不是正式文件翻译） Addressing fairness in lesion classification from dermatological images is crucial due to variations in how skin diseases manifest across different skin tones. However, the lack of skin tone labels in public datasets hinders the creation of a fair classifier. To date, such skin tone labels have been estimated before fairness analysis in independent studies using the Individual Typology Angle (ITA). Briefly, ITA calculates an angle based on pixels extracted from skin images, taking into account the lightness and yellow-blue tints. These angles are then categorized into skin tones that are subsequently used to analyze fairness in skin cancer classification. In this work, we review and compare four ITA-based approaches of skin tone classification on the ISIC18 dataset, a common benchmark for assessing skin cancer classification fairness in the literature. Our analyses reveal a high disagreement among previously published studies, demonstrating the risks of ITA-based skin tone estimation methods. Moreover, we investigate the causes of such large discrepancy among these approaches and find that the lack of diversity in the ISIC18 dataset limits its use as a testbed for fairness analysis. Finally, we recommend further research on robust ITA estimation and diverse dataset acquisition with skin tone annotation to facilitate conclusive fairness assessments of artificial intelligence tools in dermatology. Our code is available at https://github.com/tkalbl/RevisitingSkinToneFairness.

GeoDTR+: Toward generic cross-view geolocalization via geometric disentanglement

paper_url: http://arxiv.org/abs/2308.09624
repo_url: None
paper_authors: Xiaohan Zhang, Xingyu Li, Waqas Sultani, Chen Chen, Safwan Wshah
for: 这个论文目的是提出一个具有更好的地理构造抽取和对比硬件组合的地图地标准化方法，以提高地图地标的精度和稳定性。
methods: 这个方法使用了一个增强的地理构造抽取模组（GLE），以及一个对比硬件组合（CHSG）来增强模型的训练。
results: 实验结果显示，这个方法在跨区评估中取得了最新的州际顶峰表现 ($16.44%$, $22.71%$, and $17.02%$ 不包括极点转换)，并且保持了与现有最新顶峰相同的同区表现。

Abstract
Cross-View Geo-Localization (CVGL) estimates the location of a ground image by matching it to a geo-tagged aerial image in a database. Recent works achieve outstanding progress on CVGL benchmarks. However, existing methods still suffer from poor performance in cross-area evaluation, in which the training and testing data are captured from completely distinct areas. We attribute this deficiency to the lack of ability to extract the geometric layout of visual features and models' overfitting to low-level details. Our preliminary work introduced a Geometric Layout Extractor (GLE) to capture the geometric layout from input features. However, the previous GLE does not fully exploit information in the input feature. In this work, we propose GeoDTR+ with an enhanced GLE module that better models the correlations among visual features. To fully explore the LS techniques from our preliminary work, we further propose Contrastive Hard Samples Generation (CHSG) to facilitate model training. Extensive experiments show that GeoDTR+ achieves state-of-the-art (SOTA) results in cross-area evaluation on CVUSA, CVACT, and VIGOR by a large margin ($16.44\%$, $22.71\%$, and $17.02\%$ without polar transformation) while keeping the same-area performance comparable to existing SOTA. Moreover, we provide detailed analyses of GeoDTR+.

摘要
cross-view geo-localization (CVGL) 算法估算图像的位置，通过与数据库中geo-标记的空中图像进行匹配。现有研究取得了CVGL标准吗的出色表现。然而，现有方法仍然在跨区评估中表现不佳，其中训练和测试数据来自完全不同的区域。我们认为这是因为无法提取视觉特征的几何布局，以及模型过拟合低级细节。我们的前期工作中引入了几何布局提取器（GLE），但前一版GLE未能充分利用输入特征中的信息。在这项工作中，我们提出了GeoDTR+，它包括改进的GLE模块，可以更好地模型视觉特征之间的相关性。此外，我们还提出了对LS技术的进一步发展，即对比难样本生成（CHSG），以促进模型训练。我们的实验表明，GeoDTR+在跨区评估中达到了CVUSA、CVACT和VIGOR等标准吗的最佳结果（无极化转换），同时保持与同区表现相似的水平。此外，我们还提供了GeoDTR+的详细分析。

Is context all you need? Scaling Neural Sign Language Translation to Large Domains of Discourse

paper_url: http://arxiv.org/abs/2308.09622
repo_url: None
paper_authors: Ozge Mercanoglu Sincan, Necati Cihan Camgoz, Richard Bowden
For: The paper aims to improve the accuracy of sign language translation (SLT) by incorporating contextual information into the translation process.* Methods: The proposed method uses a multi-modal transformer architecture that combines low-level video features, recognized sign glosses, and contextual information from previous sign sequences to generate spoken language translations.* Results: The proposed approach significantly improves SLT performance compared to baseline methods, with nearly double the BLEU-4 scores. The results are evaluated on two datasets: BOBSL and SRF.

Abstract
Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos, both of which have different grammar and word/gloss order. From a Neural Machine Translation (NMT) perspective, the straightforward way of training translation models is to use sign language phrase-spoken language sentence pairs. However, human interpreters heavily rely on the context to understand the conveyed information, especially for sign language interpretation, where the vocabulary size may be significantly smaller than their spoken language equivalent. Taking direct inspiration from how humans translate, we propose a novel multi-modal transformer architecture that tackles the translation task in a context-aware manner, as a human would. We use the context from previous sequences and confident predictions to disambiguate weaker visual cues. To achieve this we use complementary transformer encoders, namely: (1) A Video Encoder, that captures the low-level video features at the frame-level, (2) A Spotting Encoder, that models the recognized sign glosses in the video, and (3) A Context Encoder, which captures the context of the preceding sign sequences. We combine the information coming from these encoders in a final transformer decoder to generate spoken language translations. We evaluate our approach on the recently published large-scale BOBSL dataset, which contains ~1.2M sequences, and on the SRF dataset, which was part of the WMT-SLT 2022 challenge. We report significant improvements on state-of-the-art translation performance using contextual information, nearly doubling the reported BLEU-4 scores of baseline approaches.

摘要
签语翻译（SLT）是一项复杂的任务，旨在从手语视频中生成口语句子，两者都有不同的语法和单词顺序。从神经机器翻译（NMT）的角度来看，直接使用手语短语-口语句子对的训练翻译模型是最直接的方法。然而，人类 intérpretes 具有很强的上下文理解能力，特别是 для手语 interpretations，其词汇量可能比其口语对应的更小。基于人类翻译的直观思路，我们提出了一种新的多模态 transformer 架构，用于在上下文意识下进行翻译任务。我们使用上下文来解决视觉较弱的征略，并且使用 complementary transformer encoders，即：(1) 视频编码器， capture 视频中帧级的低级特征; (2) 识别编码器， model 视频中识别的手语词汇; (3) 上下文编码器， capture 前一个手语序列的上下文。我们将这些编码器的信息在最终的 transformer decoder 中结合，以生成口语翻译。我们在最近发布的 BOBSL 数据集上进行了评估，该数据集包含约 1.2M 个序列，以及 SRF 数据集，该数据集在 WMT-SLT 2022 挑战中出现。我们发现，使用上下文信息可以大幅提高翻译性能，相比基eline方法的 BLEU-4 分数增长约 20%。

LaRS: A Diverse Panoptic Maritime Obstacle Detection Dataset and Benchmark

paper_url: http://arxiv.org/abs/2308.09618
repo_url: None
paper_authors: Lojze Žust, Janez Perš, Matej Kristan
for: 本研究目的是提供一个多样化的海上障碍物检测数据集，以便进一步推动海上障碍物检测领域的进步。
methods: 本研究使用了一个新的海上�anoptic障碍物检测benchmark，名为LaRS，该benchmark包含了湖泊、河流和海洋等多个环境下的场景。
results: 本研究通过对27种semantic和panoptic segmentation方法的评估，发现了一些性能杂化和未来研究方向。同时，本研究还提供了一个在线评估服务器，以便对海上障碍物检测方法进行对比和评估。

Abstract
The progress in maritime obstacle detection is hindered by the lack of a diverse dataset that adequately captures the complexity of general maritime environments. We present the first maritime panoptic obstacle detection benchmark LaRS, featuring scenes from Lakes, Rivers and Seas. Our major contribution is the new dataset, which boasts the largest diversity in recording locations, scene types, obstacle classes, and acquisition conditions among the related datasets. LaRS is composed of over 4000 per-pixel labeled key frames with nine preceding frames to allow utilization of the temporal texture, amounting to over 40k frames. Each key frame is annotated with 8 thing, 3 stuff classes and 19 global scene attributes. We report the results of 27 semantic and panoptic segmentation methods, along with several performance insights and future research directions. To enable objective evaluation, we have implemented an online evaluation server. The LaRS dataset, evaluation toolkit and benchmark are publicly available at: https://lojzezust.github.io/lars-dataset

摘要
“水域障碍物探测的进步受到水域环境的多样性不足所阻碍。我们提出了首个水域综合障碍物探测比赛LaRS，包括湖泊、河流和海洋的场景。我们的主要贡献是新的数据集，其中包括不同的录制地点、场景类型、障碍物类型和采样条件，与相关的数据集相比，展现了最大的多样性。LaRS包含了4000帧每帧标注的关键帧，共有9个前置帧，以利用时间Texture，总共有40k帧。每个关键帧都被标注为8个物类、3个物品类和19个全局场景特征。我们报告了27种semantic和综合障碍物分类方法的结果，以及一些性能检验和未来研究方向。为了实现公正评估，我们在线上进行了评估服务器。LaRS数据集、评估工具箱和比赛是公开 disponibile的：https://lojzezust.github.io/lars-dataset”

Far3D: Expanding the Horizon for Surround-view 3D Object Detection

paper_url: http://arxiv.org/abs/2308.09616
repo_url: None
paper_authors: Xiaohui Jiang, Shuailin Li, Yingfei Liu, Shihao Wang, Fan Jia, Tiancai Wang, Lijin Han, Xiangyu Zhang
for: 本研究旨在提高3D对象检测范围，特别是长范围检测，以便减少成本和提高效率。
methods: 本研究提出了一种新的稀疏查询基于框架，称为Far3D。该框架利用高质量2D对象假设生成3D适应查询，并引入了视角意识汇集模块，以兼顾不同视野和比例的特征捕捉。此外，该研究还提出了范围调整的3D推净方法，以解决查询错误的协传和长范围任务中的稳定问题。
results: 该研究在Argoverse 2数据集上达到了最佳性能水平，覆盖150米的广泛范围，超过了一些LiDAR基于的方法。此外，Far3D还在nuScenes数据集上表现出了superior的性能。代码即将公布。

Abstract
Recently 3D object detection from surround-view images has made notable advancements with its low deployment cost. However, most works have primarily focused on close perception range while leaving long-range detection less explored. Expanding existing methods directly to cover long distances poses challenges such as heavy computation costs and unstable convergence. To address these limitations, this paper proposes a novel sparse query-based framework, dubbed Far3D. By utilizing high-quality 2D object priors, we generate 3D adaptive queries that complement the 3D global queries. To efficiently capture discriminative features across different views and scales for long-range objects, we introduce a perspective-aware aggregation module. Additionally, we propose a range-modulated 3D denoising approach to address query error propagation and mitigate convergence issues in long-range tasks. Significantly, Far3D demonstrates SoTA performance on the challenging Argoverse 2 dataset, covering a wide range of 150 meters, surpassing several LiDAR-based approaches. Meanwhile, Far3D exhibits superior performance compared to previous methods on the nuScenes dataset. The code will be available soon.

摘要
近些时候，从卫星视图图像中的3D物体检测已经做出了明显的进步，主要是因为其低于部署成本。然而，大多数工作都主要集中在近距离检测上，而长距离检测则尚未得到足够的探索。扩展现有方法直接覆盖长距离范围 poses 计算成本过高和稳定性不稳定。为了解决这些限制，这篇论文提出了一种新的稀疏查询基本框架，名为 Far3D。通过利用高质量的2D物体假设，我们生成了3D适应查询，这些查询与3D全球查询衔接。为了有效地捕捉不同视图和缩放下的特征，我们提出了一种视角意识汇集模块。此外，我们提出了范围调整的3D排除方法，以解决查询错误的传递问题和长距离任务中的稳定性问题。特别是，Far3D在Argoverse 2数据集上达到了SOA性能，覆盖150米的各种距离，超过了一些激光雷达基于的方法。同时，Far3D在nuScenes数据集上表现出色，比前一些方法更高。代码即将发布。

Language-guided Human Motion Synthesis with Atomic Actions

paper_url: http://arxiv.org/abs/2308.09611
repo_url: https://github.com/yhzhai/atom
paper_authors: Yuanhao Zhai, Mingzhen Huang, Tianyu Luan, Lu Dong, Ifeoma Nwogu, Siwei Lyu, David Doermann, Junsong Yuan
for: 文章主要目的是提出一种语言引导人体运动合成技术，以解决人体行为的自然复杂性和多样性导致的合成问题。
methods: 该方法基于分解人体动作为原子动作的思想，并采用了一种课程学习策略来学习原子动作的组合。在学习过程中，文章首先将复杂的人体动作分解为一组原子动作，然后使用这些学习到的原子动作来组合新的动作，从而提高了对新动作的适应性。此外，文章还引入了一种课程学习训练策略，通过逐渐增加的面积矩阵来促进原子动作的组合。
results: 文章通过了广泛的实验，包括文本引导人体运动和动作引导人体运动任务，证明了ATOM模型的效果。具体来说，ATOM模型能够生成符合人体动作规律的文本引导人体运动序列，并且能够更好地适应新的动作。

Abstract
Language-guided human motion synthesis has been a challenging task due to the inherent complexity and diversity of human behaviors. Previous methods face limitations in generalization to novel actions, often resulting in unrealistic or incoherent motion sequences. In this paper, we propose ATOM (ATomic mOtion Modeling) to mitigate this problem, by decomposing actions into atomic actions, and employing a curriculum learning strategy to learn atomic action composition. First, we disentangle complex human motions into a set of atomic actions during learning, and then assemble novel actions using the learned atomic actions, which offers better adaptability to new actions. Moreover, we introduce a curriculum learning training strategy that leverages masked motion modeling with a gradual increase in the mask ratio, and thus facilitates atomic action assembly. This approach mitigates the overfitting problem commonly encountered in previous methods while enforcing the model to learn better motion representations. We demonstrate the effectiveness of ATOM through extensive experiments, including text-to-motion and action-to-motion synthesis tasks. We further illustrate its superiority in synthesizing plausible and coherent text-guided human motion sequences.

摘要
<>translate "ATOM (ATomic mOtion Modeling) to mitigate this problem, by decomposing actions into atomic actions, and employing a curriculum learning strategy to learn atomic action composition." into Simplified Chinese>Here's the translation:使用ATOM（原子动作模型）来解决这个问题，将动作 decomposes into atomic actions，并使用学习 atomic action 的CURRICULUM learning策略来学习atomic action的组合。Here's the breakdown of the translation:* 使用ATOM (使用ATOM)： Uses ATOM.* 解决这个问题 (解决这个问题)： Solves this problem.* decomposes into atomic actions (decomposes into atomic actions)： Decomposes into atomic actions.* 并使用学习 (并使用学习)： And uses learning.* atomic action composition (原子动作的组合)： Atomic action composition.I hope this helps! Let me know if you have any other questions.

On the Effectiveness of LayerNorm Tuning for Continual Learning in Vision Transformers

paper_url: http://arxiv.org/abs/2308.09610
repo_url: https://github.com/tdemin16/continual-layernorm-tuning
paper_authors: Thomas De Min, Massimiliano Mancini, Karteek Alahari, Xavier Alameda-Pineda, Elisa Ricci
for: 降低折冲学习方法中的计算成本，以维持竞争力性的表现。
methods: 将注重层 normalization 为每个持续学习任务，并在推断时根据任务特定的键和预训模型的输出选择参数。
results: 在 ImageNet-R 和 CIFAR-100 上实验显示，我们的方法可以与 {状态域} 的表现相互匹配，而且计算成本较低。

Abstract
State-of-the-art rehearsal-free continual learning methods exploit the peculiarities of Vision Transformers to learn task-specific prompts, drastically reducing catastrophic forgetting. However, there is a tradeoff between the number of learned parameters and the performance, making such models computationally expensive. In this work, we aim to reduce this cost while maintaining competitive performance. We achieve this by revisiting and extending a simple transfer learning idea: learning task-specific normalization layers. Specifically, we tune the scale and bias parameters of LayerNorm for each continual learning task, selecting them at inference time based on the similarity between task-specific keys and the output of the pre-trained model. To make the classifier robust to incorrect selection of parameters during inference, we introduce a two-stage training procedure, where we first optimize the task-specific parameters and then train the classifier with the same selection procedure of the inference time. Experiments on ImageNet-R and CIFAR-100 show that our method achieves results that are either superior or on par with {the state of the art} while being computationally cheaper.

摘要
现代无重复练习 continual learning 方法利用视Transformers的特点学习任务特定提示，以减少忘记性。然而，有参数数量和性能之间的交易，使得这些模型变得 computationally 昂贵。在这个工作中，我们希望减少这个成本，保持竞争力。我们实现这一点通过重新访问和扩展一种简单的转移学习想法：学习任务特定 normalization layers。特别是，我们在每个 continual learning 任务中调整层 normalization 的缩放和偏移参数，在推理时根据任务特定的键和预训练模型输出来选择。为了让分类器在推理过程中正确选择参数，我们引入了一个两阶段训练过程，在第一阶段优化任务特定参数，然后在第二阶段使用同样的选择过程来训练分类器。实验表明，我们的方法在 ImageNet-R 和 CIFAR-100 上 achieves Results that are either superior or on par with {the state of the art} while being computationally cheaper。

Language-Guided Diffusion Model for Visual Grounding

paper_url: http://arxiv.org/abs/2308.09599
repo_url: None
paper_authors: Sijia Chen, Baochun Li
for: solves the problem of visual grounding, a cross-modal alignment task, in a generative way.
methods: uses a language-guided diffusion framework called LG-DVG, which trains the model to progressively reason queried object boxes by denoising a set of noisy boxes with the language guide.
results: achieves superior performance on five widely used datasets, validating the effectiveness of the proposed framework.

Abstract
Visual grounding (VG) tasks involve explicit cross-modal alignment, as semantically corresponding image regions are to be located for the language phrases provided. Existing approaches complete such visual-text reasoning in a single-step manner. Their performance causes high demands on large-scale anchors and over-designed multi-modal fusion modules based on human priors, leading to complicated frameworks that may be difficult to train and overfit to specific scenarios. Even worse, such once-for-all reasoning mechanisms are incapable of refining boxes continuously to enhance query-region matching. In contrast, in this paper, we formulate an iterative reasoning process by denoising diffusion modeling. Specifically, we propose a language-guided diffusion framework for visual grounding, LG-DVG, which trains the model to progressively reason queried object boxes by denoising a set of noisy boxes with the language guide. To achieve this, LG-DVG gradually perturbs query-aligned ground truth boxes to noisy ones and reverses this process step by step, conditional on query semantics. Extensive experiments for our proposed framework on five widely used datasets validate the superior performance of solving visual grounding, a cross-modal alignment task, in a generative way. The source codes are available at \url{https://github.com/iQua/vgbase/tree/DiffusionVG}.

摘要
Visual grounding（视觉附加）任务需要显式跨模态对齐，即要在图像区域和语言短语之间进行Semantic对应。现有的方法通过单步visual-text reasoning来完成这些任务，其性能需要大量的 anchor和复杂的多模态融合模块，导致复杂的框架难以训练和过拟合特定场景。worse,这些once-for-all reasoning mechanisms是无法不断调整查询区域匹配的。相反，在这篇论文中，我们提出了一种迭代的理解过程，即通过降噪模型来进行语义导向的Diffusion VG框架，具体来说是语言指导降噪框架（LG-DVG）。它通过逐步降噪查询对应的真实框， conditional on查询语义来训练模型。我们在五个广泛使用的dataset上进行了extensive experiments，并证明了我们提出的框架在解决视觉附加任务的cross-modal对齐任务中的超越性。代码可以在 \url{https://github.com/iQua/vgbase/tree/DiffusionVG} 上获取。

Investigation of Architectures and Receptive Fields for Appearance-based Gaze Estimation

paper_url: http://arxiv.org/abs/2308.09593
repo_url: https://github.com/yunhanwang1105/GazeTech
paper_authors: Yunhan Wang, Xiangwei Shi, Shalini De Mello, Hyung Jin Chang, Xucong Zhang
for: 这篇论文探讨了深度学习技术在过去十年的快速发展，以及这些技术在人工智能和计算机视觉领域的应用。
methods: 这篇论文提出了多种不同的机制，包括软运算、硬运算、两眼不对称、特征分离、旋转一致和对称学习。大多数方法将单一脸部或多个区域作为输入，但这篇论文强调了基本架构的探讨。
results: 这篇论文发现，对 ResNet 架构进行一些简单的参数调整可以超过大多数现有的州OF-the-art方法在三个popular dataset上的关照眼动测量性能。经过广泛的实验，我们发现了适当的步长数、输入图像分辨率和多区域架构对关照眼动测量性能的影响，并且这些影响随着输入脸部图像质量而改变。我们在三个dataset上取得了顶尖性能，分别为ETH-XGaze 3.64、MPIIFaceGaze 4.50和Gaze360度关照眼动测量错误9.13。

Abstract
With the rapid development of deep learning technology in the past decade, appearance-based gaze estimation has attracted great attention from both computer vision and human-computer interaction research communities. Fascinating methods were proposed with variant mechanisms including soft attention, hard attention, two-eye asymmetry, feature disentanglement, rotation consistency, and contrastive learning. Most of these methods take the single-face or multi-region as input, yet the basic architecture of gaze estimation has not been fully explored. In this paper, we reveal the fact that tuning a few simple parameters of a ResNet architecture can outperform most of the existing state-of-the-art methods for the gaze estimation task on three popular datasets. With our extensive experiments, we conclude that the stride number, input image resolution, and multi-region architecture are critical for the gaze estimation performance while their effectiveness dependent on the quality of the input face image. We obtain the state-of-the-art performances on three datasets with 3.64 on ETH-XGaze, 4.50 on MPIIFaceGaze, and 9.13 on Gaze360 degrees gaze estimation error by taking ResNet-50 as the backbone.

摘要
随着深度学习技术在过去十年的快速发展，面部基于的视线估计吸引了计算机视觉和人机交互研究领域的广泛关注。吸引人的方法被提出，包括软注意力、硬注意力、两个眼睛差异、特征分离、旋转一致性和对比学习。大多数方法以单个脸或多个区域作为输入，然而基本的视线估计建 architecture尚未得到全面探讨。在这篇论文中，我们发现了一个关键的事实：调整一些简单的ResNet架构参数可以超越大多数现有的状态对 gaze estimation 任务的表现。通过我们的广泛的实验，我们结论了 that the stride number, input image resolution, and multi-region architecture are critical for the gaze estimation performance, and their effectiveness depends on the quality of the input face image。我们在三个 dataset 上 obtain 了最佳性能，具体来说是ETH-XGaze的3.64、MPIIFaceGaze的4.50和Gaze360度的9.13度视线估计误差。

StableVideo: Text-driven Consistency-aware Diffusion Video Editing

paper_url: http://arxiv.org/abs/2308.09592
repo_url: https://github.com/rese1f/stablevideo
paper_authors: Wenhao Chai, Xun Guo, Gaoang Wang, Yan Lu
for: 这篇论文是为了解决Diffusion-based方法在视频编辑中的问题，即如何在保持已有对象的外观不变的情况下，编辑视频。
methods: 这篇论文使用了文本驱动的Diffusion模型，并引入了时间关系来保持对象的外观一致性。具体来说，它使用了一种新的间帧传播机制，将一帧的外观信息传播到下一帧。
results: 该论文的实验结果表明，与现有的视频编辑方法相比，其方法可以 достичь更高的编辑质量和一致性。

Abstract
Diffusion-based methods can generate realistic images and videos, but they struggle to edit existing objects in a video while preserving their appearance over time. This prevents diffusion models from being applied to natural video editing in practical scenarios. In this paper, we tackle this problem by introducing temporal dependency to existing text-driven diffusion models, which allows them to generate consistent appearance for the edited objects. Specifically, we develop a novel inter-frame propagation mechanism for diffusion video editing, which leverages the concept of layered representations to propagate the appearance information from one frame to the next. We then build up a text-driven video editing framework based on this mechanism, namely StableVideo, which can achieve consistency-aware video editing. Extensive experiments demonstrate the strong editing capability of our approach. Compared with state-of-the-art video editing methods, our approach shows superior qualitative and quantitative results. Our code is available at \href{https://github.com/rese1f/StableVideo}{this https URL}.

摘要
Diffusion-based方法可以生成真实的图像和视频，但它们在现有视频中编辑已有对象的时候困难保持对象的外观在时间上的一致性。这使得扩散模型在实际的视频编辑场景中无法应用。在这篇论文中，我们解决了这个问题，通过添加时间相关性来改进现有的文本驱动扩散模型，使其能够在编辑过程中保持对象的外观一致性。我们开发了一种新的间帧传播机制，基于层次表示来传播一帧到下一帧的外观信息。然后，我们建立了基于这种机制的文本驱动视频编辑框架，即StableVideo，可以实现一致性感视频编辑。我们的实验表明，我们的方法可以具有优秀的编辑效果，相比之前的视频编辑方法，我们的方法在质量和量上都达到了更高的水平。我们的代码可以在以下 GitHub 上下载：https://github.com/rese1f/StableVideo。

O^2-Recon: Completing 3D Reconstruction of Occluded Objects in the Scene with a Pre-trained 2D Diffusion Model

paper_url: http://arxiv.org/abs/2308.09591
repo_url: None
paper_authors: Yubin Hu, Sheng Ye, Wang Zhao, Matthieu Lin, Yuze He, Yu-Hui Wen, Ying He, Yong-Jin Liu
for: 该研究旨在提出一种新的完整物体3D重建方法，解决RGB-D视频中物体遮挡问题。
methods: 该方法利用预训练的扩散模型填充2D图像隐藏部分，然后使用这些填充图像优化神经隐藏表示法对每个实例的3D重建。
results: 实验结果表明，该方法在ScanNet场景中可以达到状态之arte accuracy和完整性在对象级RGB-D视频重建中。

Abstract
Occlusion is a common issue in 3D reconstruction from RGB-D videos, often blocking the complete reconstruction of objects and presenting an ongoing problem. In this paper, we propose a novel framework, empowered by a 2D diffusion-based in-painting model, to reconstruct complete surfaces for the hidden parts of objects. Specifically, we utilize a pre-trained diffusion model to fill in the hidden areas of 2D images. Then we use these in-painted images to optimize a neural implicit surface representation for each instance for 3D reconstruction. Since creating the in-painting masks needed for this process is tricky, we adopt a human-in-the-loop strategy that involves very little human engagement to generate high-quality masks. Moreover, some parts of objects can be totally hidden because the videos are usually shot from limited perspectives. To ensure recovering these invisible areas, we develop a cascaded network architecture for predicting signed distance field, making use of different frequency bands of positional encoding and maintaining overall smoothness. Besides the commonly used rendering loss, Eikonal loss, and silhouette loss, we adopt a CLIP-based semantic consistency loss to guide the surface from unseen camera angles. Experiments on ScanNet scenes show that our proposed framework achieves state-of-the-art accuracy and completeness in object-level reconstruction from scene-level RGB-D videos.

摘要
干扰是RGB-D视频中3D重建中的一个常见问题，常常阻塞对象的完整重建，是一个持续的问题。在这篇论文中，我们提出了一种新的框架，利用2D扩散模型来填充对象隐藏部分的表面。 Specifically，我们使用预训练的扩散模型填充隐藏在2D图像中的部分。然后，我们使用这些填充后的图像来优化每个实例的神经蕴涵表示。由于创建填充面积需要很多人工参与，我们采用了人类在循环的策略，即很少需要人工参与来生成高质量的填充面积。此外，视频通常从有限的视角拍摄，因此一些对象部分可能会被完全隐藏。为了保证恢复这些隐藏的部分，我们开发了一种层次网络架构，用于预测签名距离场，利用不同频率域的位置编码，并保持整体的平滑性。此外，我们还采用了基于CLIP的语义一致损失，以导引表面从未看到的摄像头角度。实验结果表明，我们的提出的框架在SceneNet场景中实现了state-of-the-art的准确性和完整性。

Deep Equilibrium Object Detection

paper_url: http://arxiv.org/abs/2308.09564
repo_url: https://github.com/MCG-NJU/DEQDet
paper_authors: Shuai Wang, Yao Teng, Limin Wang
for: 本文提出了一种新的Query-based对象检测器（DEQDet），通过设计深度平衡解码器来进行对象检测。
methods: 本文使用了一种叫做深度平衡解码器（DEQ），它模型了查询向量精细化为稳定有意义的表示，然后使用简单的FFN头直接预测对象的位置和类别。
results: 对比基eline，本文的DEQDet在MS COCO数据集上达到了$49.5$ mAP和$33.0$ AP$_s$的性能，并且需要较少的内存和训练时间。

Abstract
Query-based object detectors directly decode image features into object instances with a set of learnable queries. These query vectors are progressively refined to stable meaningful representations through a sequence of decoder layers, and then used to directly predict object locations and categories with simple FFN heads. In this paper, we present a new query-based object detector (DEQDet) by designing a deep equilibrium decoder. Our DEQ decoder models the query vector refinement as the fixed point solving of an {implicit} layer and is equivalent to applying {infinite} steps of refinement. To be more specific to object decoding, we use a two-step unrolled equilibrium equation to explicitly capture the query vector refinement. Accordingly, we are able to incorporate refinement awareness into the DEQ training with the inexact gradient back-propagation (RAG). In addition, to stabilize the training of our DEQDet and improve its generalization ability, we devise the deep supervision scheme on the optimization path of DEQ with refinement-aware perturbation~(RAP). Our experiments demonstrate DEQDet converges faster, consumes less memory, and achieves better results than the baseline counterpart (AdaMixer). In particular, our DEQDet with ResNet50 backbone and 300 queries achieves the $49.5$ mAP and $33.0$ AP$_s$ on the MS COCO benchmark under $2\times$ training scheme (24 epochs).

摘要
干ifen-based object detectors directly将图像特征解码成对象实例的集合，使用一组学习的查询向量。这些查询向量通过一系列解码层进行逐渐精细化，然后直接使用简单的FFN头预测对象的位置和类别。在这篇论文中，我们提出了一种新的查询基于object detector（DEQDet），通过设计深度平衡解码器来实现。我们的DEQ解码器模型查询向量精细化为 fixes point解决的{implicit}层，等于应用{无限}步的精细化。为了更加准确地捕捉对象解码中的查询向量精细化，我们采用了两步卷积的平衡方程。因此，我们可以在DEQ训练中引入精细化意识，并通过不确定梯度反propagation（RAG）进行训练。此外，为了稳定DEQDet的训练和提高其通用能力，我们提出了深度监视 schema在DEQ的优化路径上进行反制�ubble~(RAP)。我们的实验表明，DEQDet在比特笔COCO数据集上 converge faster，占用更少的内存，并 achieve better results than基eline对手（AdaMixer）。具体来说，我们的DEQDet使用ResNet50底层和300个查询可以在$2\times$ 训练方案下 achieve $49.5$ mAP和$33.0$ AP$_s$。

Decoupled conditional contrastive learning with variable metadata for prostate lesion detection

paper_url: http://arxiv.org/abs/2308.09542
repo_url: https://github.com/camilleruppli/decoupled_ccl
paper_authors: Camille Ruppli, Pietro Gori, Roberto Ardon, Isabelle Bloch
For: 早期检测 próstate cancer 的重要性，以便有效的治疗。* Methods: 使用多parametric Magnetic Resonance Images (mp-MRI) для诊断肿瘤。使用 Prostate Imaging Reporting and Data System (PI-RADS) 标准化评估肿瘤癌变性。* Results: 通过利用weak metadata 和多个标注者每个样本的 metadata confidence，定义metadata confidence，提高了lesion detection的准确率，在公共PI-CAI挑战数据集上提高了3%的AUC。Here’s the English version of the paper’s abstract, with the three key information points highlighted:* For: Early detection of prostate cancer is crucial for efficient treatment.* Methods: The paper uses multi-parametric Magnetic Resonance Images (mp-MRI) for lesion detection, and leverages the Prostate Imaging Reporting and Data System (PI-RADS) to standardize the interpretation of prostate MRI.* Results: The proposed contrastive loss function, which combines metadata of varying confidence with unannotated data, improves lesion detection accuracy by 3% on the public PI-CAI challenge dataset.I hope this helps!

Abstract
Early diagnosis of prostate cancer is crucial for efficient treatment. Multi-parametric Magnetic Resonance Images (mp-MRI) are widely used for lesion detection. The Prostate Imaging Reporting and Data System (PI-RADS) has standardized interpretation of prostate MRI by defining a score for lesion malignancy. PI-RADS data is readily available from radiology reports but is subject to high inter-reports variability. We propose a new contrastive loss function that leverages weak metadata with multiple annotators per sample and takes advantage of inter-reports variability by defining metadata confidence. By combining metadata of varying confidence with unannotated data into a single conditional contrastive loss function, we report a 3% AUC increase on lesion detection on the public PI-CAI challenge dataset. Code is available at: https://github.com/camilleruppli/decoupled_ccl

摘要
早期检测 проstate 癌的诊断是非常重要，以便进行有效的治疗。多 Parametric Magnetic Resonance Image (mp-MRI) 广泛用于肿瘤检测。 Prostate Imaging Reporting and Data System (PI-RADS) 已经标准化了 prostate MRI 的解释，并定义了肿瘤癌性的分数。PI-RADS 数据ready availability from radiology reports，但是受到高度的Inter-reports variability。我们提议一种新的对比损失函数，利用weak metadata 和多个注解者per sample，并利用Inter-reports variability by defining metadata confidence。通过将metadata of varying confidence 和无注释数据 combine into a single conditional contrastive loss function，我们在public PI-CAI challenge dataset上发现3%的AUC提升。Code is available at: https://github.com/camilleruppli/decoupled_cclNote: "prostate" is translated as " проstate" in Simplified Chinese, which is the standard translation.

Uncertainty-based quality assurance of carotid artery wall segmentation in black-blood MRI

paper_url: http://arxiv.org/abs/2308.09538
repo_url: https://github.com/miagrouput/carotid-segmentation
paper_authors: Elina Thibeau-Sutre, Dieuwertje Alblas, Sophie Buurman, Christoph Brune, Jelmer M. Wolterink
for: 这个论文目的是为了应用深度学习模型到大规模数据集中，并自动进行质量控制。
methods: 这个论文使用的方法包括自动化的某些血液MRI图像的分割，以及测试时数据增强和Monte Carlo dropout来估计模型预测结果的不确定性。
results: 研究发现，包括不确定性测量不会下降分割质量，不确定性指标可以用来衡量分割质量，并且可以检测参与者级别的低质量分割。这种自动质量控制工具可能会使得深度学习模型在大规模数据集中应用更加广泛。

Abstract
The application of deep learning models to large-scale data sets requires means for automatic quality assurance. We have previously developed a fully automatic algorithm for carotid artery wall segmentation in black-blood MRI that we aim to apply to large-scale data sets. This method identifies nested artery walls in 3D patches centered on the carotid artery. In this study, we investigate to what extent the uncertainty in the model predictions for the contour location can serve as a surrogate for error detection and, consequently, automatic quality assurance. We express the quality of automatic segmentations using the Dice similarity coefficient. The uncertainty in the model's prediction is estimated using either Monte Carlo dropout or test-time data augmentation. We found that (1) including uncertainty measurements did not degrade the quality of the segmentations, (2) uncertainty metrics provide a good proxy of the quality of our contours if the center found during the first step is enclosed in the lumen of the carotid artery and (3) they could be used to detect low-quality segmentations at the participant level. This automatic quality assurance tool might enable the application of our model in large-scale data sets.

摘要
aplicación de modelos de aprendizaje profundo a conjuntos de datos a gran escala requiere medidas para la calidad automática. Hemos desarrollado anteriormente un algoritmo completamente automático para la segmentación de las paredes de la arteria carótida en imágenes de resonancia magnética negra, y nuestro objetivo es aplicar este método a conjuntos de datos a gran escala. Este método identifica las paredes de la arteria carótida en 3D en secciones centradas en la arteria carótida. En este estudio, investigamos hasta qué punto la incertidumbre en las predicciones del modelo sobre el lugar de la contornación puede servir como un substituto para la detección de errores y, por lo tanto, como una medida de calidad automática. Expresamos la calidad de las segmentaciones automáticas utilizando el coeficiente de semejanza de Dice. La incertidumbre en las predicciones del modelo se estima utilizando either dropout de Monte Carlo o augmentación de datos en tiempo de prueba. Encontramos que:1. Incluir mediciones de incertidumbre no degrade la calidad de las segmentaciones;2. Los métricos de incertidumbre proporcionan un buen sustituto de la calidad de nuestras contornaciones si el centro encontrado durante la primera etapa está en el lumen de la arteria carótida;3. Pueden ser utilizados para detectar segmentaciones de baja calidad en el nivel de participante.Este herramienta de calidad automática podría permitir la aplicación de nuestro modelo en conjuntos de datos a gran escala.

Small Object Detection via Coarse-to-fine Proposal Generation and Imitation Learning

paper_url: http://arxiv.org/abs/2308.09534
repo_url: https://github.com/shaunyuan22/cfinet
paper_authors: Xiang Yuan, Gong Cheng, Kebing Yan, Qinghua Zeng, Junwei Han
for: 提高小对象检测精度
methods: 使用Coarse-to-fine RPN和Feature Imitation learning
results: 在大规模小对象检测标准 benchmark 上达到领先的性能

Abstract
The past few years have witnessed the immense success of object detection, while current excellent detectors struggle on tackling size-limited instances. Concretely, the well-known challenge of low overlaps between the priors and object regions leads to a constrained sample pool for optimization, and the paucity of discriminative information further aggravates the recognition. To alleviate the aforementioned issues, we propose CFINet, a two-stage framework tailored for small object detection based on the Coarse-to-fine pipeline and Feature Imitation learning. Firstly, we introduce Coarse-to-fine RPN (CRPN) to ensure sufficient and high-quality proposals for small objects through the dynamic anchor selection strategy and cascade regression. Then, we equip the conventional detection head with a Feature Imitation (FI) branch to facilitate the region representations of size-limited instances that perplex the model in an imitation manner. Moreover, an auxiliary imitation loss following supervised contrastive learning paradigm is devised to optimize this branch. When integrated with Faster RCNN, CFINet achieves state-of-the-art performance on the large-scale small object detection benchmarks, SODA-D and SODA-A, underscoring its superiority over baseline detector and other mainstream detection approaches.

摘要
Recent years have seen tremendous success in object detection, but current state-of-the-art detectors still struggle with detecting small objects. The main challenge is that the prior knowledge and object regions have low overlap, leading to a limited sample pool for optimization, and the lack of discriminative information makes it even more difficult to recognize the objects. To address these issues, we propose CFINet, a two-stage framework tailored for small object detection based on the Coarse-to-fine pipeline and Feature Imitation learning.First, we introduce Coarse-to-fine RPN (CRPN) to ensure sufficient and high-quality proposals for small objects through dynamic anchor selection and cascade regression. Then, we add a Feature Imitation (FI) branch to the conventional detection head to enhance the region representations of size-limited instances in an imitation manner. Moreover, we design an auxiliary imitation loss based on supervised contrastive learning to optimize this branch.When integrated with Faster RCNN, CFINet achieves state-of-the-art performance on large-scale small object detection benchmarks SODA-D and SODA-A, outperforming baseline detectors and other mainstream detection approaches.

Improving 3D Pose Estimation for Sign Language

paper_url: http://arxiv.org/abs/2308.09525
repo_url: None
paper_authors: Maksym Ivashechkin, Oscar Mendez, Richard Bowden
for: 这篇论文目标是提出一种基于前向运动学和神经网络的3D人姿重建方法，以实现快速和准确地测量人姿。
methods: 该方法利用了前向运动学（FK）和神经网络，将2D图像中的关键点检测转化为3D人姿。首先，使用神经网络预测人 JOINT 的旋转和肢长，然后将这些预测结果与骨骼约束相结合，通过实现了 PyTorch 中的 FK 层，以获得准确的3D人姿。
results: 通过量化和质量评估，该方法与 MediaPipe 相比，在每个 JOINT 的位势误差和视觉效果上具有显著的优势。此外，该方法还能够在不同的数据集上进行泛化。 PyTorch 实现在 CPU 上运行时间为100-200毫秒（包括 CNN 检测）。

Abstract
This work addresses 3D human pose reconstruction in single images. We present a method that combines Forward Kinematics (FK) with neural networks to ensure a fast and valid prediction of 3D pose. Pose is represented as a hierarchical tree/graph with nodes corresponding to human joints that model their physical limits. Given a 2D detection of keypoints in the image, we lift the skeleton to 3D using neural networks to predict both the joint rotations and bone lengths. These predictions are then combined with skeletal constraints using an FK layer implemented as a network layer in PyTorch. The result is a fast and accurate approach to the estimation of 3D skeletal pose. Through quantitative and qualitative evaluation, we demonstrate the method is significantly more accurate than MediaPipe in terms of both per joint positional error and visual appearance. Furthermore, we demonstrate generalization over different datasets. The implementation in PyTorch runs at between 100-200 milliseconds per image (including CNN detection) using CPU only.

摘要
We evaluate the method through both quantitative and qualitative means, demonstrating that it is significantly more accurate than MediaPipe in terms of both per joint positional error and visual appearance. Additionally, we show that the method generalizes well across different datasets. The implementation in PyTorch runs at between 100-200 milliseconds per image (including CNN detection) using CPU only.

Denoising Diffusion for 3D Hand Pose Estimation from Images

paper_url: http://arxiv.org/abs/2308.09523
repo_url: None
paper_authors: Maksym Ivashechkin, Oscar Mendez, Richard Bowden
for: 本研究旨在解决从单个图像或序列中提取3D手势的问题。
methods: 提出了一种新的端到端框架，使用了扩散模型来捕捉数据分布，并遵循了机械约束来确保生成的姿势是有效的。
results: 研究表明该方法可以在提取2D单手图像到3D的过程中提供状态的最佳性能，并在使用序列数据时通过添加Transformer模块来进一步提高准确性。

Abstract
Hand pose estimation from a single image has many applications. However, approaches to full 3D body pose estimation are typically trained on day-to-day activities or actions. As such, detailed hand-to-hand interactions are poorly represented, especially during motion. We see this in the failure cases of techniques such as OpenPose or MediaPipe. However, accurate hand pose estimation is crucial for many applications where the global body motion is less important than accurate hand pose estimation. This paper addresses the problem of 3D hand pose estimation from monocular images or sequences. We present a novel end-to-end framework for 3D hand regression that employs diffusion models that have shown excellent ability to capture the distribution of data for generative purposes. Moreover, we enforce kinematic constraints to ensure realistic poses are generated by incorporating an explicit forward kinematic layer as part of the network. The proposed model provides state-of-the-art performance when lifting a 2D single-hand image to 3D. However, when sequence data is available, we add a Transformer module over a temporal window of consecutive frames to refine the results, overcoming jittering and further increasing accuracy. The method is quantitatively and qualitatively evaluated showing state-of-the-art robustness, generalization, and accuracy on several different datasets.

摘要
<> traduced the text into Simplified Chinese.<>人体手势估计从单个图像中有很多应用。然而，对全息体姿估计方法通常是在日常活动或动作上训练的。因此，手部之间的细节互动被不好地表示，特别是在运动中。我们在技术如OpenPose或MediaPipe的失败案例中看到了这一点。然而，准确的手势估计对许多应用来说非常重要，特别是在全息体姿的运动不那么重要。这篇论文Addresses the problem of 3D手势估计从单个图像或序列中。我们提出了一种新的终端框架，用于3D手势回归。该框架利用了扩散模型，这些模型在生成目的上表现出了优秀的能力。此外，我们加入了显式前向几何层，以确保生成的姿势是真实的。提议的模型在将2D单手图像提升到3D的任务中提供了状态机器的性能。当序列数据可用时，我们将Transformer模块添加到 consecutives帧中，以修复结果，超越晃动，并提高准确率。方法被量测量和质量上评估，显示了状态机器的稳定性、泛化能力和准确率在多个不同的数据集上都达到了顶峰。

Leveraging Intrinsic Properties for Non-Rigid Garment Alignment

paper_url: http://arxiv.org/abs/2308.09519
repo_url: None
paper_authors: Siyou Lin, Boyao Zhou, Zerong Zheng, Hongwen Zhang, Yebin Liu
for:这篇论文targets the problem of aligning real-world 3D garment data, which is beneficial for applications such as texture learning, physical parameter estimation, and generative modeling of garments.methods:The proposed method leverages intrinsic manifold properties and uses two neural deformation fields, one in 3D space and another in intrinsic space, to achieve coarse-to-fine alignment. The coarse stage performs a 3D fitting, and the refined stage uses a second neural deformation field for higher accuracy.results:The method achieves accurate wrinkle-level and texture-level alignment and works well for difficult garment types such as long coats. The project page with more information and results is available at https://jsnln.github.io/iccv2023_intrinsic/index.html.

Abstract
We address the problem of aligning real-world 3D data of garments, which benefits many applications such as texture learning, physical parameter estimation, generative modeling of garments, etc. Existing extrinsic methods typically perform non-rigid iterative closest point and struggle to align details due to incorrect closest matches and rigidity constraints. While intrinsic methods based on functional maps can produce high-quality correspondences, they work under isometric assumptions and become unreliable for garment deformations which are highly non-isometric. To achieve wrinkle-level as well as texture-level alignment, we present a novel coarse-to-fine two-stage method that leverages intrinsic manifold properties with two neural deformation fields, in the 3D space and the intrinsic space, respectively. The coarse stage performs a 3D fitting, where we leverage intrinsic manifold properties to define a manifold deformation field. The coarse fitting then induces a functional map that produces an alignment of intrinsic embeddings. We further refine the intrinsic alignment with a second neural deformation field for higher accuracy. We evaluate our method with our captured garment dataset, GarmCap. The method achieves accurate wrinkle-level and texture-level alignment and works for difficult garment types such as long coats. Our project page is https://jsnln.github.io/iccv2023_intrinsic/index.html.

摘要
我们解决了对真实世界3D资料的衣服对齐问题，这有助于训练文本、物理参数估计、生成衣服等应用程序。现有的外部方法通常会进行非静态迭代最近点，并很难跟踪细节，因为 incorrect closest matches 和静态约束。而内部方法基于函数对应则可以生成高品质对应关系，但它们在衣服塑形中是不可靠的，因为衣服塑形是非静态的。为了 achiev wrinkle-level 和 texture-level 对齐，我们提出了一个新的两阶段方法，利用内部构造特性，并使用两个神经塑形场，在3D空间和内部空间中进行塑形。在粗细阶段，我们利用内部构造特性定义一个构造塑形场，并将其调整为一个函数对应。我们进一步将这个函数对应进行精确调整，使用第二个神经塑形场。我们评估了我们的方法，使用我们捕捉的衣服Dataset，GarmCap。我们的方法可以实现细节级和纹理级对齐，并在困难的衣服类型，如长大袍，也能够运作。我们的项目页面是。

Learnt Contrastive Concept Embeddings for Sign Recognition

paper_url: http://arxiv.org/abs/2308.09515
repo_url: None
paper_authors: Ryan Wong, Necati Cihan Camgoz, Richard Bowden
for: 本研究旨在提供一种基于contrastive learning的、weakly supervised的签语嵌入方法，用于将手语语言与口语语言相互连接。
methods: 我们提出了一种学习框架，用于 derivation LCC（学习对应概念）嵌入 для手语语言，该方法基于手语语言的语言标签来训练词语表示。我们还开发了一种基于word embedding的概念相似损失函数，以便使用NLP方法中的word embedding来创建更好的手语嵌入。
results: 我们的方法可以在WLASL和BOBSL数据集上实现状态实验室的键点基于手语识别性能。

Abstract
In natural language processing (NLP) of spoken languages, word embeddings have been shown to be a useful method to encode the meaning of words. Sign languages are visual languages, which require sign embeddings to capture the visual and linguistic semantics of sign. Unlike many common approaches to Sign Recognition, we focus on explicitly creating sign embeddings that bridge the gap between sign language and spoken language. We propose a learning framework to derive LCC (Learnt Contrastive Concept) embeddings for sign language, a weakly supervised contrastive approach to learning sign embeddings. We train a vocabulary of embeddings that are based on the linguistic labels for sign video. Additionally, we develop a conceptual similarity loss which is able to utilise word embeddings from NLP methods to create sign embeddings that have better sign language to spoken language correspondence. These learnt representations allow the model to automatically localise the sign in time. Our approach achieves state-of-the-art keypoint-based sign recognition performance on the WLASL and BOBSL datasets.

摘要
在自然语言处理（NLP）的 spoken语言中，词嵌入已经被证明是一种有用的方法来编码词语的含义。手语是一种视觉语言，需要手势嵌入来捕捉手语的视觉和语言 semantics。不同于许多常见的手语识别方法，我们注重于显式地创建手势嵌入，以bridging the gap between手语和口语。我们提出了一种学习框架，用于 derivation LCC（学习对应概念）嵌入 для手语，这是一种弱监督对比方法来学习手势嵌入。我们训练了一个词库，基于手语视频的语言标签来学习嵌入。此外，我们开发了一种概念相似损失，可以利用 NLP 方法中的词嵌入来创建更好地与口语相匹配的手势嵌入。这些学习的表示允许模型自动在时间上local化手语。我们的方法在 WLASL 和 BOBSL 数据集上实现了状态的键点基于手语识别性能。

ResQ: Residual Quantization for Video Perception

paper_url: http://arxiv.org/abs/2308.09511
repo_url: None
paper_authors: Davide Abati, Haitam Ben Yahia, Markus Nagel, Amirhossein Habibian
for: 加速视频识别，如semantic segmentation和人体 pose estimation，通过利用帧间重复性。
methods: 利用低位量编码，基于差异网络活动的差分 residuals 的特性，提出 Residual Quantization 模型。
results: 比标准quantization和现有高效视频识别模型 superior performance 在准确性和位宽之间的权衡。

Abstract
This paper accelerates video perception, such as semantic segmentation and human pose estimation, by levering cross-frame redundancies. Unlike the existing approaches, which avoid redundant computations by warping the past features using optical-flow or by performing sparse convolutions on frame differences, we approach the problem from a new perspective: low-bit quantization. We observe that residuals, as the difference in network activations between two neighboring frames, exhibit properties that make them highly quantizable. Based on this observation, we propose a novel quantization scheme for video networks coined as Residual Quantization. ResQ extends the standard, frame-by-frame, quantization scheme by incorporating temporal dependencies that lead to better performance in terms of accuracy vs. bit-width. Furthermore, we extend our model to dynamically adjust the bit-width proportional to the amount of changes in the video. We demonstrate the superiority of our model, against the standard quantization and existing efficient video perception models, using various architectures on semantic segmentation and human pose estimation benchmarks.

摘要

Video-Instrument Synergistic Network for Referring Video Instrument Segmentation in Robotic Surgery

paper_url: http://arxiv.org/abs/2308.09475
repo_url: None
paper_authors: Hongqiu Wang, Lei Zhu, Guang Yang, Yike Guo, Shichen Zhang, Bo Xu, Yueming Jin
for: 该研究旨在提高Robot-assisted surgery的质量，通过实时器段化来促进手术 Navigation和医学教育。
methods: 该研究提出了一种新的 Referring Surgical Video Instrument Segmentation (RSVIS)任务，通过自动将文本描述与视频帧相关联，以提高手术器段化的精度。为此，我们提出了一种新的 Video-Instrument Synergistic Network (VIS-Net)，以学习视频水平和工具水平的知识，从而提高性能。同时，我们还设计了一种图像基于关系意识模块 (GRM)，以模型多modal信息（文本描述和视频帧）之间的关系，以便提取工具级别信息。
results: 我们的方法在两个RSVIS dataset上进行验证，实验结果表明，我们的VIS-Net可以significantly outperform现有的参照分割方法。

Abstract
Robot-assisted surgery has made significant progress, with instrument segmentation being a critical factor in surgical intervention quality. It serves as the building block to facilitate surgical robot navigation and surgical education for the next generation of operating intelligence. Although existing methods have achieved accurate instrument segmentation results, they simultaneously generate segmentation masks for all instruments, without the capability to specify a target object and allow an interactive experience. This work explores a new task of Referring Surgical Video Instrument Segmentation (RSVIS), which aims to automatically identify and segment the corresponding surgical instruments based on the given language expression. To achieve this, we devise a novel Video-Instrument Synergistic Network (VIS-Net) to learn both video-level and instrument-level knowledge to boost performance, while previous work only used video-level information. Meanwhile, we design a Graph-based Relation-aware Module (GRM) to model the correlation between multi-modal information (i.e., textual description and video frame) to facilitate the extraction of instrument-level information. We are also the first to produce two RSVIS datasets to promote related research. Our method is verified on these datasets, and experimental results exhibit that the VIS-Net can significantly outperform existing state-of-the-art referring segmentation methods. Our code and our datasets will be released upon the publication of this work.

摘要
机器人协助手术已经做出了重要进步， инструмен特征分割成为手术干预质量的关键因素。它作为手术机器人导航和下一代操作智能的基础阶段，对手术教育也具有重要意义。 although existing methods have achieved accurate instrument segmentation results, they simultaneously generate segmentation masks for all instruments, without the capability to specify a target object and allow an interactive experience. This work explores a new task of Referring Surgical Video Instrument Segmentation (RSVIS), which aims to automatically identify and segment the corresponding surgical instruments based on the given language expression. To achieve this, we devise a novel Video-Instrument Synergistic Network (VIS-Net) to learn both video-level and instrument-level knowledge to boost performance, while previous work only used video-level information. Meanwhile, we design a Graph-based Relation-aware Module (GRM) to model the correlation between multi-modal information (i.e., textual description and video frame) to facilitate the extraction of instrument-level information. We are also the first to produce two RSVIS datasets to promote related research. Our method is verified on these datasets, and experimental results exhibit that the VIS-Net can significantly outperform existing state-of-the-art referring segmentation methods. Our code and our datasets will be released upon the publication of this work.Here's the translation in Traditional Chinese:机器人协助手术已经做出了重要进步， instrumente特征分割成为手术干预质量的关键因素。它作为手术机器人导航和下一代操作智能的基础阶段，对手术教育也具有重要意义。although existing methods have achieved accurate instrument segmentation results, they simultaneously generate segmentation masks for all instruments, without the capability to specify a target object and allow an interactive experience. This work explores a new task of Referring Surgical Video Instrument Segmentation (RSVIS), which aims to automatically identify and segment the corresponding surgical instruments based on the given language expression. To achieve this, we devise a novel Video-Instrument Synergistic Network (VIS-Net) to learn both video-level and instrument-level knowledge to boost performance, while previous work only used video-level information. Meanwhile, we design a Graph-based Relation-aware Module (GRM) to model the correlation between multi-modal information (i.e., textual description and video frame) to facilitate the extraction of instrument-level information. We are also the first to produce two RSVIS datasets to promote related research. Our method is verified on these datasets, and experimental results exhibit that the VIS-Net can significantly outperform existing state-of-the-art referring segmentation methods. Our code and our datasets will be released upon the publication of this work.

Quantitative Susceptibility Mapping through Model-based Deep Image Prior (MoDIP)

paper_url: http://arxiv.org/abs/2308.09467
repo_url: None
paper_authors: Zhuang Xiong, Yang Gao, Yin Liu, Amir Fazlollahi, Peter Nestor, Feng Liu, Hongfu Sun
for: 解决量子感知图像推导中不同对象的扫描参数下的偏振转换问题
methods: 提出了一种新的无需训练的模型基于深度学习方法，称为MoDIP（模型基于深度图像优先）
results: 在不同扫描参数下，MoDIP可以 excellently 普适化解决量子感知图像推导问题，并且比传统的迭代方法和深度学习方法有32%的准确率提升，同时也比传统的DIP方法33%更快速，可以在4.5分钟内完成3D高分辨率图像重建。

Abstract
The data-driven approach of supervised learning methods has limited applicability in solving dipole inversion in Quantitative Susceptibility Mapping (QSM) with varying scan parameters across different objects. To address this generalization issue in supervised QSM methods, we propose a novel training-free model-based unsupervised method called MoDIP (Model-based Deep Image Prior). MoDIP comprises a small, untrained network and a Data Fidelity Optimization (DFO) module. The network converges to an interim state, acting as an implicit prior for image regularization, while the optimization process enforces the physical model of QSM dipole inversion. Experimental results demonstrate MoDIP's excellent generalizability in solving QSM dipole inversion across different scan parameters. It exhibits robustness against pathological brain QSM, achieving over 32% accuracy improvement than supervised deep learning and traditional iterative methods. It is also 33% more computationally efficient and runs 4 times faster than conventional DIP-based approaches, enabling 3D high-resolution image reconstruction in under 4.5 minutes.

摘要
supervised学习方法的数据驱动方法在量子频谱地图（QSM）中的杂项倒转问题中有限的应用。为解决这种泛化问题，我们提议一种没有训练的模型基于无supervised方法called MoDIP（模型基于深度图像先验）。MoDIP包括一个小型、未训练的网络和数据准确优化（DFO）模块。网络在进行杂项倒转时 converges to an interim state，作为图像规范化的隐藏先验，而优化过程检查QSM杂项倒转的物理模型。实验结果表明MoDIP在不同扫描参数下的QSM杂项倒转问题中具有出色的泛化性。它在异常脑QSM问题中表现稳定，比超出深度学习和传统迭代方法32%以上的准确率提高。它还比折衔DIP基于方法33%更快，可以在4.5分钟内完成3D高分辨率图像重建。

Data augmentation and explainability for bias discovery and mitigation in deep learning

paper_url: http://arxiv.org/abs/2308.09464
repo_url: None
paper_authors: Agnieszka Mikołajczyk-Bareła
for: 这个论文的目的是研究深度神经网络中的偏见问题，并提出了一些方法来减少偏见对模型性能的影响。methods: 这个论文使用了一些方法来减少偏见，包括：Explainable AI、Style Transfer Data Augmentation、Targeted Data Augmentations和Attribution Feedback。results: 这个论文的研究结果表明，通过使用这些方法可以减少偏见对模型性能的影响，并提高模型的准确率。

Abstract
This dissertation explores the impact of bias in deep neural networks and presents methods for reducing its influence on model performance. The first part begins by categorizing and describing potential sources of bias and errors in data and models, with a particular focus on bias in machine learning pipelines. The next chapter outlines a taxonomy and methods of Explainable AI as a way to justify predictions and control and improve the model. Then, as an example of a laborious manual data inspection and bias discovery process, a skin lesion dataset is manually examined. A Global Explanation for the Bias Identification method is proposed as an alternative semi-automatic approach to manual data exploration for discovering potential biases in data. Relevant numerical methods and metrics are discussed for assessing the effects of the identified biases on the model. Whereas identifying errors and bias is critical, improving the model and reducing the number of flaws in the future is an absolute priority. Hence, the second part of the thesis focuses on mitigating the influence of bias on ML models. Three approaches are proposed and discussed: Style Transfer Data Augmentation, Targeted Data Augmentations, and Attribution Feedback. Style Transfer Data Augmentation aims to address shape and texture bias by merging a style of a malignant lesion with a conflicting shape of a benign one. Targeted Data Augmentations randomly insert possible biases into all images in the dataset during the training, as a way to make the process random and, thus, destroy spurious correlations. Lastly, Attribution Feedback is used to fine-tune the model to improve its accuracy by eliminating obvious mistakes and teaching it to ignore insignificant input parts via an attribution loss. The goal of these approaches is to reduce the influence of bias on machine learning models, rather than eliminate it entirely.

摘要
As an example of a laborious manual data inspection and bias discovery process, a skin lesion dataset is manually examined. A Global Explanation for the Bias Identification method is proposed as an alternative semi-automatic approach to manual data exploration for discovering potential biases in data. Relevant numerical methods and metrics are discussed for assessing the effects of the identified biases on the model.Whereas identifying errors and bias is critical, improving the model and reducing the number of flaws in the future is an absolute priority. Hence, the second part of the thesis focuses on mitigating the influence of bias on ML models. Three approaches are proposed and discussed: Style Transfer Data Augmentation, Targeted Data Augmentations, and Attribution Feedback.Style Transfer Data Augmentation aims to address shape and texture bias by merging a style of a malignant lesion with a conflicting shape of a benign one. Targeted Data Augmentations randomly insert possible biases into all images in the dataset during the training, as a way to make the process random and, thus, destroy spurious correlations. Lastly, Attribution Feedback is used to fine-tune the model to improve its accuracy by eliminating obvious mistakes and teaching it to ignore insignificant input parts via an attribution loss. The goal of these approaches is to reduce the influence of bias on machine learning models, rather than eliminate it entirely.

Accelerated Bayesian imaging by relaxed proximal-point Langevin sampling

paper_url: http://arxiv.org/abs/2308.09460
repo_url: None
paper_authors: Teresa Klatzer, Paul Dobson, Yoann Altmann, Marcelo Pereyra, Jesús María Sanz-Serna, Konstantinos C. Zygalakis
for: 这个论文提出了一种新的加速 proximal Markov chain Monte Carlo 方法，用于在具有凸凹性的图像反问题中进行 Bayesian 推断。
methods: 该方法使用了一种 Stochastic Relaxed Proximal-Point 迭代法，该迭代法可以看作是一种停止推断方法，并且有两种不同的解释。对于满足某些条件的模型，该方法等价于一种停止推断方法，而对于不满足这些条件的模型，该方法等价于一种 Leimkuhler-Matthews 离散方法。
results: 该paper提供了一种非对称的 accelerated proximal Markov chain Monte Carlo 方法，该方法可以在凸凹性的图像反问题中获得加速的速度，并且可以避免对于非凸凹性模型的偏度。在不同的实验中，该方法都能够达到更高的速度和更好的准确性，比如图像恢复问题中的 Gaussian 和 Poisson 噪声。

Abstract
This paper presents a new accelerated proximal Markov chain Monte Carlo methodology to perform Bayesian inference in imaging inverse problems with an underlying convex geometry. The proposed strategy takes the form of a stochastic relaxed proximal-point iteration that admits two complementary interpretations. For models that are smooth or regularised by Moreau-Yosida smoothing, the algorithm is equivalent to an implicit midpoint discretisation of an overdamped Langevin diffusion targeting the posterior distribution of interest. This discretisation is asymptotically unbiased for Gaussian targets and shown to converge in an accelerated manner for any target that is $\kappa$-strongly log-concave (i.e., requiring in the order of $\sqrt{\kappa}$ iterations to converge, similarly to accelerated optimisation schemes), comparing favorably to [M. Pereyra, L. Vargas Mieles, K.C. Zygalakis, SIAM J. Imaging Sciences, 13, 2 (2020), pp. 905-935] which is only provably accelerated for Gaussian targets and has bias. For models that are not smooth, the algorithm is equivalent to a Leimkuhler-Matthews discretisation of a Langevin diffusion targeting a Moreau-Yosida approximation of the posterior distribution of interest, and hence achieves a significantly lower bias than conventional unadjusted Langevin strategies based on the Euler-Maruyama discretisation. For targets that are $\kappa$-strongly log-concave, the provided non-asymptotic convergence analysis also identifies the optimal time step which maximizes the convergence speed. The proposed methodology is demonstrated through a range of experiments related to image deconvolution with Gaussian and Poisson noise, with assumption-driven and data-driven convex priors.

摘要

Transformer-based Detection of Microorganisms on High-Resolution Petri Dish Images

paper_url: http://arxiv.org/abs/2308.09436
repo_url: None
paper_authors: Nikolas Ebert, Didier Stricker, Oliver Wasenmüller
for: 该论文主要用于提高医疗或制药过程中的不间断清洁监测。
methods: 该论文使用了一种新型的变体自注意机制，称为高效全局自注意机制，以解决当前自动化检测技术面临的主要挑战。
results: 在公共可用的 AGAR 数据集上，该网络表现出了与当前状态艺术的超过当前最佳性能。此外，通过在 COCO 和 LIVECell 数据集上进行进一步的实验，我们证明了该方法的任务独立性。

Abstract
Many medical or pharmaceutical processes have strict guidelines regarding continuous hygiene monitoring. This often involves the labor-intensive task of manually counting microorganisms in Petri dishes by trained personnel. Automation attempts often struggle due to major challenges: significant scaling differences, low separation, low contrast, etc. To address these challenges, we introduce AttnPAFPN, a high-resolution detection pipeline that leverages a novel transformer variation, the efficient-global self-attention mechanism. Our streamlined approach can be easily integrated in almost any multi-scale object detection pipeline. In a comprehensive evaluation on the publicly available AGAR dataset, we demonstrate the superior accuracy of our network over the current state-of-the-art. In order to demonstrate the task-independent performance of our approach, we perform further experiments on COCO and LIVECell datasets.

摘要
许多医疗或药品生产过程有严格的清洁监测指南。这经常包括人工计数微生物在杯中的劳动密集任务，由训练过的人员进行。自动化尝试经常遇到主要挑战，如规模差异、低分离、低对比等。为解决这些挑战，我们介绍AttnPAFPN，一种高分辨率检测管线，利用一种新型变体的 transformer 机制，即有效全球自注意机制。我们的流线型approach可以轻松地与大多数多尺度对象检测管线集成。在公共可用的 AGAR 数据集上进行了全面的评估，我们示出了我们网络比现状态的最高精度。为了证明我们的方法是任务独立的，我们在 COCO 和 LIVECell 数据集上进行了进一步的实验。

Can ultrasound confidence maps predict sonographers’ labeling variability?

paper_url: http://arxiv.org/abs/2308.09433
repo_url: None
paper_authors: Vanessa Gonzalez Duque, Leonhard Zirus, Yordanka Velikova, Nassir Navab, Diana Mateus
for: 这个论文的目的是提出一种新的深度学习 segmentation 方法，以便考虑医生对 ultrasound 图像的注意力和不确定性。
methods: 该方法使用了 ultrasound 图像中的信任度映射 (CM)，以帮助深度学习 segmentation 网络生成更加准确和多样化的预测结果。
results: 研究结果表明，使用 ultrasound CM 可以提高 Dice 分数、改善 Hausdorff 和平均表面距离，以及减少孤立像素预测。此外，研究还发现， ultrasound CM 可以改善真实 interpolations 的惩罚，从而提高深度学习 segmentation 的准确性。

Abstract
Measuring cross-sectional areas in ultrasound images is a standard tool to evaluate disease progress or treatment response. Often addressed today with supervised deep-learning segmentation approaches, existing solutions highly depend upon the quality of experts' annotations. However, the annotation quality in ultrasound is anisotropic and position-variant due to the inherent physical imaging principles, including attenuation, shadows, and missing boundaries, commonly exacerbated with depth. This work proposes a novel approach that guides ultrasound segmentation networks to account for sonographers' uncertainties and generate predictions with variability similar to the experts. We claim that realistic variability can reduce overconfident predictions and improve physicians' acceptance of deep-learning cross-sectional segmentation solutions. Our method provides CM's certainty for each pixel for minimal computational overhead as it can be precalculated directly from the image. We show that there is a correlation between low values in the confidence maps and expert's label uncertainty. Therefore, we propose to give the confidence maps as additional information to the networks. We study the effect of the proposed use of ultrasound CMs in combination with four state-of-the-art neural networks and in two configurations: as a second input channel and as part of the loss. We evaluate our method on 3D ultrasound datasets of the thyroid and lower limb muscles. Our results show ultrasound CMs increase the Dice score, improve the Hausdorff and Average Surface Distances, and decrease the number of isolated pixel predictions. Furthermore, our findings suggest that ultrasound CMs improve the penalization of uncertain areas in the ground truth data, thereby improving problematic interpolations. Our code and example data will be made public at https://github.com/IFL-CAMP/Confidence-segmentation.

摘要
评估静止影像中的横截面面积是评估疾病进程或治疗效果的标准工具。现有的深度学习分割方法可以高度依赖专家的注释，但是现有的解决方案受到镜像质量的限制，镜像质量呈扁平不均，受到物理镜像原理的影响，包括吸收、阴影和缺失边界，这些问题通常会在深度方向加剧。本文提出了一种新的方法，使得ultrasound分割网络能够考虑医生uncertainty，生成具有类似专家预测的各种性的预测。我们认为，在静止影像中添加realistic的variability可以减少过于自信的预测，提高医生对深度学习横截面分割解决方案的acceptance。我们的方法可以在低计算开销下提供CM的确定性，直接从图像中计算。我们发现，低值在确定性图中与专家标签不确定性相吻合。因此，我们提议使用ultrasound CM来补充网络。我们在四种state-of-the-art神经网络和两种配置下测试了我们的方法：作为第二个输入通道和作为损失的一部分。我们在3D静止影像 datasets中评估了我们的方法，结果显示，ultrasound CM可以提高 dice分数、改善 Hausdorff和平均表面距离，并减少孤立像素预测。此外，我们发现，ultrasound CM可以优化问题 interpolations，因此提高了对真实值数据的惩罚。我们的代码和示例数据将在https://github.com/IFL-CAMP/Confidence-segmentation中公开。

Self-Supervised Single-Image Deconvolution with Siamese Neural Networks

paper_url: http://arxiv.org/abs/2308.09426
repo_url: None
paper_authors: Mikhail Papkov, Kaupo Palo, Leopold Parts
for: 这篇论文旨在提出一种基于深度学习的图像恢复方法，以提高图像的清晰度和锐度。
methods: 该方法使用了自适应掩模神经网络，通过直接从数据中学习雨花点质量而不需要手动设置参数。
results: 实验结果表明，该改进的框架可以在3D微型scopy图像恢复任务中高效地提高图像的清晰度和锐度，并且超过了之前的状态艺术恢复方法。

Abstract
Inverse problems in image reconstruction are fundamentally complicated by unknown noise properties. Classical iterative deconvolution approaches amplify noise and require careful parameter selection for an optimal trade-off between sharpness and grain. Deep learning methods allow for flexible parametrization of the noise and learning its properties directly from the data. Recently, self-supervised blind-spot neural networks were successfully adopted for image deconvolution by including a known point-spread function in the end-to-end training. However, their practical application has been limited to 2D images in the biomedical domain because it implies large kernels that are poorly optimized. We tackle this problem with Fast Fourier Transform convolutions that provide training speed-up in 3D microscopy deconvolution tasks. Further, we propose to adopt a Siamese invariance loss for deconvolution and empirically identify its optimal position in the neural network between blind-spot and full image branches. The experimental results show that our improved framework outperforms the previous state-of-the-art deconvolution methods with a known point spread function.

摘要
“影像重建问题的逆问题是由于未知的噪音特性而困难。经典的迭代复原方法将噪音放大，需要精确地选择参数以取得最佳的对称性和粒子。深度学习方法可以灵活地设定噪音的参数，并将其直接从数据中学习。现在，自我监督的盲点神经网络在影像复原中获得了成功，但它们仅适用于2D图像在医学领域中。我们解决这个问题，使用快速傅立叶变换来提高3D显微镜复原任务的训练速度。此外，我们提议运用对称损失来复原，并评估其在神经网络中的最佳位置。实验结果显示，我们改进的框架比前一代的复原方法更高效。”Note: Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and Macau.

MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection

paper_url: http://arxiv.org/abs/2308.09421
repo_url: https://github.com/cskkxjk/mononerd
paper_authors: Junkai Xu, Liang Peng, Haoran Cheng, Hao Li, Wei Qian, Ke Li, Wenxiao Wang, Deng Cai
for: 提高单目3D检测器的性能，使其能够更好地检测远离和受阻物体。
methods: 使用Signed Distance Functions (SDF)模型场景，然后将这些表示转化为Neural Radiance Fields (NeRF)，并使用卷积渲染来恢复RGB图像和深度图。
results: 在KITTI-3D benchmark和Waymo Open Dataset上进行了广泛的实验，并得到了效果报告。

Abstract
In the field of monocular 3D detection, it is common practice to utilize scene geometric clues to enhance the detector's performance. However, many existing works adopt these clues explicitly such as estimating a depth map and back-projecting it into 3D space. This explicit methodology induces sparsity in 3D representations due to the increased dimensionality from 2D to 3D, and leads to substantial information loss, especially for distant and occluded objects. To alleviate this issue, we propose MonoNeRD, a novel detection framework that can infer dense 3D geometry and occupancy. Specifically, we model scenes with Signed Distance Functions (SDF), facilitating the production of dense 3D representations. We treat these representations as Neural Radiance Fields (NeRF) and then employ volume rendering to recover RGB images and depth maps. To the best of our knowledge, this work is the first to introduce volume rendering for M3D, and demonstrates the potential of implicit reconstruction for image-based 3D perception. Extensive experiments conducted on the KITTI-3D benchmark and Waymo Open Dataset demonstrate the effectiveness of MonoNeRD. Codes are available at https://github.com/cskkxjk/MonoNeRD.

摘要
在单目3D探测领域，广泛采用场景几何准确来提高探测器的性能。然而，许多现有工作直接使用这些准确，如计算深度图并将其投影到3D空间中。这种直接方法会导致3D表示的稀疏性，特别是对于远距离和遮挡的对象。为了解决这个问题，我们提出了MonoNeRD检测框架，可以对场景进行精度的3D几何和占用率的推测。具体来说，我们使用signed distance functions（SDF）来建模场景，并将这些表示转换为神经辐射场（NeRF）。然后，我们使用卷积渲染来恢复RGB图像和深度图。我们知道，这是第一个在M3D中应用卷积渲染的工作，并证明了隐式重建的潜在优势。我们在KITTI-3D标准测试集和Waymo开放数据集进行了广泛的实验，并证明了MonoNeRD的有效性。代码可以在https://github.com/cskkxjk/MonoNeRD中找到。

Metadata Improves Segmentation Through Multitasking Elicitation

paper_url: http://arxiv.org/abs/2308.09411
repo_url: None
paper_authors: Iaroslav Plutenko, Mikhail Papkov, Kaupo Palo, Leopold Parts, Dmytro Fishman
for: 这个论文主要是为了提高深度学习方法中的 semantic segmentation 性能。
methods: 这个论文使用了通道调制机制，将metadata作为 convolutional network 的输入，以提高 segmentation 结果。
results: 论文表明，通过使用metadata，可以提高 segmentation 结果，而且这种方法具有低成本和可以与现有模型结合使用的优点。

Abstract
Metainformation is a common companion to biomedical images. However, this potentially powerful additional source of signal from image acquisition has had limited use in deep learning methods, for semantic segmentation in particular. Here, we incorporate metadata by employing a channel modulation mechanism in convolutional networks and study its effect on semantic segmentation tasks. We demonstrate that metadata as additional input to a convolutional network can improve segmentation results while being inexpensive in implementation as a nimble add-on to popular models. We hypothesize that this benefit of metadata can be attributed to facilitating multitask switching. This aspect of metadata-driven systems is explored and discussed in detail.

摘要
这里的metadata是对生物医学影像的常见伴侣。然而，这个潜在强大的额外讯号仍然受到深度学习方法中的有限使用，尤其是在 semantic segmentation 方面。在这里，我们通过将metadata作为构成元件加入卷积网络中，研究 metadata 在 semantic segmentation 任务中的影响。我们证明了 metadata 作为卷积网络的额外输入，可以提高分类结果，而且实现起来相对便宜。我们推测这个metadata的好处可以归因于促进多任务转换。这个metadata-驱动的系统的内部运作和讨论在详细的文章中进行了详细的探讨。

Generalizable Decision Boundaries: Dualistic Meta-Learning for Open Set Domain Generalization

paper_url: http://arxiv.org/abs/2308.09391
repo_url: https://github.com/zzwdx/medic
paper_authors: Xiran Wang, Jian Zhang, Lei Qi, Yinghuan Shi
for: 本研究旨在解决频繁出现在目标领域中的领域转换问题，特别是当源领域和目标领域有不同的类型时。
methods: 本研究使用多个一对一分类器来定义决策边界，并将外围样本拒绝为未知样本。然而，在许多情况下，正样本和负样本的类别数量不均衡，导致决策边界偏向正样本，从而导致已知样本在目标领域中的误分类。本研究提出了一种基于元学习的框架，即双重MEta-learning with joint DomaIn-Class matching（MEDIC），它同时考虑了多个领域和类别的梯度匹配，以找到一个总体可靠的、Balanced的决策边界。
results: 实验结果表明，MEDIC不仅在开集enario下超过了先前的方法，同时也保持了相对较好的闭集泛化能力。 Code可以在https://github.com/zzwdx/MEDIC中找到。

Abstract
Domain generalization (DG) is proposed to deal with the issue of domain shift, which occurs when statistical differences exist between source and target domains. However, most current methods do not account for a common realistic scenario where the source and target domains have different classes. To overcome this deficiency, open set domain generalization (OSDG) then emerges as a more practical setting to recognize unseen classes in unseen domains. An intuitive approach is to use multiple one-vs-all classifiers to define decision boundaries for each class and reject the outliers as unknown. However, the significant class imbalance between positive and negative samples often causes the boundaries biased towards positive ones, resulting in misclassification for known samples in the unseen target domain. In this paper, we propose a novel meta-learning-based framework called dualistic MEta-learning with joint DomaIn-Class matching (MEDIC), which considers gradient matching towards inter-domain and inter-class splits simultaneously to find a generalizable boundary balanced for all tasks. Experimental results demonstrate that MEDIC not only outperforms previous methods in open set scenarios, but also maintains competitive close set generalization ability at the same time. Our code is available at https://github.com/zzwdx/MEDIC.

摘要
域外泛化（DG）是为了解决域shift问题而提出的，域shift问题是指源领域和目标领域存在统计上的差异。然而，现有的大多数方法不能考虑到预测领域和目标领域中的类不同的常见情况。为了解决这个不足，开放集领域泛化（OSDG）作为更实际的设定，用于识别未经见过的类。一种直观的方法是使用多个一对一分类器来定义决策边界，并将异常值作为未知样本拒绝。然而，难以平衡的类划分常导致决策边界偏向正样本，从而导致在未经见过的目标领域中误分类已知样本。在这篇论文中，我们提出了一种基于元学习的框架，即双向MEta-learning with joint DomaIn-Class matching（MEDIC）。该框架同时考虑到域间和类间的梯度匹配，以找到一个泛化性能良好的、对所有任务均匀的决策边界。实验结果表明，MEDIC不仅在开放集领域中表现出优于前方法，还能够保持与闭set泛化能力相当的竞争力。我们的代码可以在https://github.com/zzwdx/MEDIC上找到。

Diffusion Models for Image Restoration and Enhancement – A Comprehensive Survey

paper_url: http://arxiv.org/abs/2308.09388
repo_url: https://github.com/lixinustc/awesome-diffusion-model-for-image-processing
paper_authors: Xin Li, Yulin Ren, Xin Jin, Cuiling Lan, Xingrui Wang, Wenjun Zeng, Xinchao Wang, Zhibo Chen
for: 本文旨在为图像修复（IR）领域提供一份总结，尤其是在使用扩散模型（Diffusion Model）进行图像修复方面。
methods: 本文总结了最新的扩散模型基于的图像修复方法，包括学习框架、条件策略、模型设计等方面，并对现有的方法进行评价。
results: 本文对现有的扩散模型基于的图像修复方法进行了全面的总结，并提出了五个可能的未来研究方向，包括样本效率、模型压缩、质量评价等方面。

Abstract
Image restoration (IR) has been an indispensable and challenging task in the low-level vision field, which strives to improve the subjective quality of images distorted by various forms of degradation. Recently, the diffusion model has achieved significant advancements in the visual generation of AIGC, thereby raising an intuitive question, "whether diffusion model can boost image restoration". To answer this, some pioneering studies attempt to integrate diffusion models into the image restoration task, resulting in superior performances than previous GAN-based methods. Despite that, a comprehensive and enlightening survey on diffusion model-based image restoration remains scarce. In this paper, we are the first to present a comprehensive review of recent diffusion model-based methods on image restoration, encompassing the learning paradigm, conditional strategy, framework design, modeling strategy, and evaluation. Concretely, we first introduce the background of the diffusion model briefly and then present two prevalent workflows that exploit diffusion models in image restoration. Subsequently, we classify and emphasize the innovative designs using diffusion models for both IR and blind/real-world IR, intending to inspire future development. To evaluate existing methods thoroughly, we summarize the commonly-used dataset, implementation details, and evaluation metrics. Additionally, we present the objective comparison for open-sourced methods across three tasks, including image super-resolution, deblurring, and inpainting. Ultimately, informed by the limitations in existing works, we propose five potential and challenging directions for the future research of diffusion model-based IR, including sampling efficiency, model compression, distortion simulation and estimation, distortion invariant learning, and framework design.

摘要
Image restoration（IR）是低级视觉领域中不可或缺的和挑战性的任务，旨在提高受到不同类型的降低的图像主观质量。最近，扩散模型在视觉生成AIGC中取得了显著进步，因此提出了一个直观的问题：“扩散模型能否提高图像restoration”？为了回答这个问题，一些创新的研究尝试将扩散模型与图像restoration任务结合，从而实现更高的性能。然而，一份全面和激进的扩散模型基于图像restoration的评价还缺乏。在这篇论文中，我们是首次提供了一份全面的扩散模型基于图像restoration的评价，涵盖学习 парадиг，条件策略，框架设计，模型化策略和评价。具体来说，我们首先介绍了扩散模型的背景，然后介绍了在图像restoration中使用扩散模型的两种常见方法。接着，我们将扩散模型在图像restoration和匿名/实际图像restoration中的创新设计进行分类和强调，以便激励未来的发展。为了评价现有方法的全面性，我们列举了一些常用的数据集、实现细节和评价度量。此外，我们还对开源方法进行了对比评价，包括图像超解、压缩和填充等三个任务。最后，根据现有方法的局限性，我们提出了五个未来研究的挑战和可能性，包括样本效率、模型压缩、降低度量和不变学习。

DReg-NeRF: Deep Registration for Neural Radiance Fields

paper_url: http://arxiv.org/abs/2308.09386
repo_url: https://github.com/aibluefisher/dreg-nerf
paper_authors: Yu Chen, Gim Hee Lee
for: 本研究的目的是解决基于物体中心的场景下多个NeRF的注册问题，而不需要人工标注关键点。
methods: 我们提出了一种基于transformer架构的DReg-NeRF方法，首先提取NeRF中的占据网格特征，然后使用自我关注和相关关注层来学习对应NeRF块之间的关系。与现有的点云注册方法不同，我们的方法不需要任何人工标注，并且使用表面场所supervise协调对应点云。
results: 我们在测试集上评估了我们的方法，与现有的点云注册方法相比，我们的方法在mean $\text{RPE}$和mean $\text{RTE}$上都有大幅度的提升，分别为9.67度和0.038。

Abstract
Although Neural Radiance Fields (NeRF) is popular in the computer vision community recently, registering multiple NeRFs has yet to gain much attention. Unlike the existing work, NeRF2NeRF, which is based on traditional optimization methods and needs human annotated keypoints, we propose DReg-NeRF to solve the NeRF registration problem on object-centric scenes without human intervention. After training NeRF models, our DReg-NeRF first extracts features from the occupancy grid in NeRF. Subsequently, our DReg-NeRF utilizes a transformer architecture with self-attention and cross-attention layers to learn the relations between pairwise NeRF blocks. In contrast to state-of-the-art (SOTA) point cloud registration methods, the decoupled correspondences are supervised by surface fields without any ground truth overlapping labels. We construct a novel view synthesis dataset with 1,700+ 3D objects obtained from Objaverse to train our network. When evaluated on the test set, our proposed method beats the SOTA point cloud registration methods by a large margin, with a mean $\text{RPE}=9.67^{\circ}$ and a mean $\text{RTE}=0.038$. Our code is available at https://github.com/AIBluefisher/DReg-NeRF.

摘要
尽管神经辐射场（NeRF）在计算机视觉领域最近受欢迎，但是多个NeRF的注册问题还尚未得到了充分的关注。与现有的工作不同，我们的NeRF2NeRF基于传统优化方法，需要人工标注关键点，我们提出了DReg-NeRF来解决对象中心场景下NeRF注册问题。经过训练NeRF模型后，我们的DReg-NeRF首先从NeRF中提取特征。然后，我们的DReg-NeRF利用了转换器架构，包括自我注意和跨注意层，以学习NeRF块对之间的关系。与现状最佳点云注册方法相比，我们的解耦对应点不需要任何真实重合标签。我们在Objaverse上构建了一个新的视图合成数据集，并在其上训练我们的网络。在测试集上评估时，我们的提议方法与最佳点云注册方法相比， Mean RPE=9.67度和Mean RTE=0.038。参考代码可以在https://github.com/AIBluefisher/DReg-NeRF中找到。

Label-Free Event-based Object Recognition via Joint Learning with Image Reconstruction from Events

paper_url: http://arxiv.org/abs/2308.09383
repo_url: None
paper_authors: Hoonhee Cho, Hyeonseong Kim, Yujeong Chae, Kuk-Jin Yoon
for: 本研究旨在实现没有类别标签和对应图像的事件基于对象识别。
methods: 我们提出了一种结合对象识别和图像重建的共同形式，首先将事件重建为图像，然后通过对比语言-图像预训练（CLIP）进行对象识别，并使用类别导向吸引损失和类别无关排斥损失将文本特征与重建图像的视觉特征相吸引。我们还提出了一种可靠的数据采样策略和本地-全局重建一致性来促进两个任务的共同学习。
results: 我们的方法在预测和重建质量方面具有明显的优势，并且可以进行零参数对象识别。我们的项目代码可以在 \url{https://github.com/Chohoonhee/Ev-LaFOR} 上下载。

Abstract
Recognizing objects from sparse and noisy events becomes extremely difficult when paired images and category labels do not exist. In this paper, we study label-free event-based object recognition where category labels and paired images are not available. To this end, we propose a joint formulation of object recognition and image reconstruction in a complementary manner. Our method first reconstructs images from events and performs object recognition through Contrastive Language-Image Pre-training (CLIP), enabling better recognition through a rich context of images. Since the category information is essential in reconstructing images, we propose category-guided attraction loss and category-agnostic repulsion loss to bridge the textual features of predicted categories and the visual features of reconstructed images using CLIP. Moreover, we introduce a reliable data sampling strategy and local-global reconstruction consistency to boost joint learning of two tasks. To enhance the accuracy of prediction and quality of reconstruction, we also propose a prototype-based approach using unpaired images. Extensive experiments demonstrate the superiority of our method and its extensibility for zero-shot object recognition. Our project code is available at \url{https://github.com/Chohoonhee/Ev-LaFOR}.

摘要
Recognizing objects from sparse and noisy events becomes extremely difficult when paired images and category labels do not exist. In this paper, we study label-free event-based object recognition where category labels and paired images are not available. To this end, we propose a joint formulation of object recognition and image reconstruction in a complementary manner. Our method first reconstructs images from events and performs object recognition through Contrastive Language-Image Pre-training (CLIP), enabling better recognition through a rich context of images. Since the category information is essential in reconstructing images, we propose category-guided attraction loss and category-agnostic repulsion loss to bridge the textual features of predicted categories and the visual features of reconstructed images using CLIP. Moreover, we introduce a reliable data sampling strategy and local-global reconstruction consistency to boost joint learning of two tasks. To enhance the accuracy of prediction and quality of reconstruction, we also propose a prototype-based approach using unpaired images. Extensive experiments demonstrate the superiority of our method and its extensibility for zero-shot object recognition. Our project code is available at \url{https://github.com/Chohoonhee/Ev-LaFOR}.Here's the translation in Traditional Chinese:当事件为 sparse 和噪音时，对象识别成为极其困难，当 paired images 和 category labels 不存在时。在这篇文章中，我们研究了无标签事件基于的对象识别，其中 category labels 和 paired images 都不可用。为了解决这个问题，我们提出了一个组合的形式，即对象识别和图像重建的联合运算。我们的方法首先从事件中重建图像，然后通过 Contrastive Language-Image Pre-training (CLIP) 进行对象识别，这样可以通过充分的图像背景来提高识别的精度。由于类别信息是重建图像中的重要元素，我们提出了类别导引吸引损失和类别无关排斥损失，将文本特征与重建图像中的视觉特征相连接。此外，我们还引入了可靠的抽样策略和本地-全球重建一致性，以提高两个任务之间的联合学习。为了提高预测精度和重建质量，我们还提出了一个原型基于的方法，使用无标签图像。实验结果显示了我们的方法的超越性和可扩展性。我们的项目代码可以在 \url{https://github.com/Chohoonhee/Ev-LaFOR} 上获取。

Image Processing and Machine Learning for Hyperspectral Unmixing: An Overview and the HySUPP Python Package

paper_url: http://arxiv.org/abs/2308.09375
repo_url: https://github.com/behnoodrasti/hysupp
paper_authors: Behnood Rasti, Alexandre Zouaoui, Julien Mairal, Jocelyn Chanussot
for: 本文提供了一个概述先进和传统混合分析方法的评 comparison，以及三种不同类型的 linear unmixing 方法的性能对比。
methods: 本文使用了先进的 Image processing 和机器学习技术，包括超vised、semi-supervised 和 blind linear unmixing 方法。
results: 实验结果表明，不同类型的 unmixing 方法在不同的混合分析场景下有不同的优势，以及一个开源的 Python 基于包可以在 https://github.com/BehnoodRasti/HySUPP 上下载。

Abstract
Spectral pixels are often a mixture of the pure spectra of the materials, called endmembers, due to the low spatial resolution of hyperspectral sensors, double scattering, and intimate mixtures of materials in the scenes. Unmixing estimates the fractional abundances of the endmembers within the pixel. Depending on the prior knowledge of endmembers, linear unmixing can be divided into three main groups: supervised, semi-supervised, and unsupervised (blind) linear unmixing. Advances in Image processing and machine learning substantially affected unmixing. This paper provides an overview of advanced and conventional unmixing approaches. Additionally, we draw a critical comparison between advanced and conventional techniques from the three categories. We compare the performance of the unmixing techniques on three simulated and two real datasets. The experimental results reveal the advantages of different unmixing categories for different unmixing scenarios. Moreover, we provide an open-source Python-based package available at https://github.com/BehnoodRasti/HySUPP to reproduce the results.

摘要
干扰像素通常是Materials的纯谱混合物，由于干扰数据探针的低空间分辨率、双折射和场景中Materials的密切混合，导致干扰像素的混合。拆分分析计算机程序可以Estimate the fractional abundance of endmembers within the pixel.根据对Endmember的知识，线性拆分可以分为三个主要类别：有监督、半监督和无监督（盲目）线性拆分。图像处理和机器学习技术的进步对拆分具有重要影响。本文提供了高级和传统拆分方法的总览，并对这些方法进行了 kritische 比较。我们在三个 simulated 和两个实际数据集上测试了不同拆分方法的性能。实验结果表明不同拆分类型在不同拆分场景中具有优势。此外，我们还提供了一个开源的 Python 基于 package，可以在 https://github.com/BehnoodRasti/HySUPP 上下载。

paper_url: http://arxiv.org/abs/2308.09369
repo_url: https://github.com/sguttikon/SFSS-MMSI
paper_authors: Suresh Guttikonda, Jason Rambach
for: 这篇论文旨在探讨多modal的拼接和扩展对全景场景理解的可能性。
methods: 该论文提出了一种基于transformer的跨Modal融合架构，以AddressExtreme object deformation和全景图像扭曲。
results: 该论文在三个indoor全景视图数据集上进行了广泛的测试，并达到了当前领域最佳的mIoU性能：60.60%在Stanford2D3DS（RGB-HHA）、71.97%在Structured3D（RGB-D-N）以及35.92%在Matterport3D（RGB-D）。

Abstract
In recent years, the research community has shown a lot of interest to panoramic images that offer a 360-degree directional perspective. Multiple data modalities can be fed, and complimentary characteristics can be utilized for more robust and rich scene interpretation based on semantic segmentation, to fully realize the potential. Existing research, however, mostly concentrated on pinhole RGB-X semantic segmentation. In this study, we propose a transformer-based cross-modal fusion architecture to bridge the gap between multi-modal fusion and omnidirectional scene perception. We employ distortion-aware modules to address extreme object deformations and panorama distortions that result from equirectangular representation. Additionally, we conduct cross-modal interactions for feature rectification and information exchange before merging the features in order to communicate long-range contexts for bi-modal and tri-modal feature streams. In thorough tests using combinations of four different modality types in three indoor panoramic-view datasets, our technique achieved state-of-the-art mIoU performance: 60.60% on Stanford2D3DS (RGB-HHA), 71.97% Structured3D (RGB-D-N), and 35.92% Matterport3D (RGB-D). We plan to release all codes and trained models soon.

摘要
近年来，研究社区对全景图像（panoramic image）表现出了很大的兴趣，这些图像可以提供360度方向的视角。多种数据模式可以被 fed，并可以利用不同特征进行更加robust和 ricn scene解释，以实现潜在的潜力。然而，现有的研究主要集中在pinhole RGB-X semantic segmentation。在本研究中，我们提议一种基于 transformer 混合模型来bridging the gap between multi-modal fusion和全景场景理解。我们使用扭曲感知模块来处理极端对象扭曲和全景图像扭曲，并进行交叉模式交互以便Feature rectification和信息交换，以便在bi-modal和tri-modal流水线中传递长距离上下文。在使用四种不同模式类型的三个indoor panoramic-view数据集进行了经验详细测试后，我们的技术实现了状态的最佳mIoU性能：60.60%在Stanford2D3DS（RGB-HHA）、71.97%在Structured3D（RGB-D-N）以及35.92%在Matterport3D（RGB-D）。我们计划在不久将所有代码和训练模型公开。

Overlap Bias Matching is Necessary for Point Cloud Registration

paper_url: http://arxiv.org/abs/2308.09364
repo_url: None
paper_authors: Pengcheng Shi, Jie Zhang, Haozhe Cheng, Junyang Wang, Yiyang Zhou, Chenlin Zhao, Jihua Zhu
for: 本研究旨在提出一种基于无监督学习的点云注册方法，以解决实际中点云注册问题中的受限 overlap 问题。
methods: 本方法基于一个名为 Overlap Bias Matching Network (OBMNet)，它包括两个Integral component： overlap sampling module和 bias prediction module。这两个组件可以捕捉点云重叠区域的分布，并预测点云共同结构的偏置系数。然后，我们将OBMM与邻居地图匹配模块结合，以精确地挖掘对应关系。
results: 实验结果表明，我们的方法在各种数据集上达到了与当前注册方法相比较显著的改进。

Abstract
Point cloud registration is a fundamental problem in many domains. Practically, the overlap between point clouds to be registered may be relatively small. Most unsupervised methods lack effective initial evaluation of overlap, leading to suboptimal registration accuracy. To address this issue, we propose an unsupervised network Overlap Bias Matching Network (OBMNet) for partial point cloud registration. Specifically, we propose a plug-and-play Overlap Bias Matching Module (OBMM) comprising two integral components, overlap sampling module and bias prediction module. These two components are utilized to capture the distribution of overlapping regions and predict bias coefficients of point cloud common structures, respectively. Then, we integrate OBMM with the neighbor map matching module to robustly identify correspondences by precisely merging matching scores of points within the neighborhood, which addresses the ambiguities in single-point features. OBMNet can maintain efficacy even in pair-wise registration scenarios with low overlap ratios. Experimental results on extensive datasets demonstrate that our approach's performance achieves a significant improvement compared to the state-of-the-art registration approach.

摘要
点云注册是许多领域的基本问题。实际上，注册点云的重叠部分可能很小。大多数无监督方法缺乏有效的初始评估重叠，导致注册精度下降。为解决这个问题，我们提议一种无监督网络Overlap Bias Matching Network（OBMNet），用于部分点云注册。具体来说，我们提出了一个插件式的Overlap Bias Matching Module（OBMM），包括两个基本组成部分：重叠采样模块和偏好预测模块。这两个组成部分用于捕捉重叠区域的分布和预测点云共同结构的偏好系数，分别。然后，我们将OBMM与邻居地图匹配模块集成，以robustly确定对应关系，解决单点特征之间的混淆。OBMNet可以在对点云注册方式进行对比时保持效率。实验结果表明，我们的方法在评估中达到了与当前注册方法相比较显著的提升。

Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models

paper_url: http://arxiv.org/abs/2308.09363
repo_url: https://github.com/mlvlab/ovqa
paper_authors: Dohwan Ko, Ji Soo Lee, Miso Choi, Jaewon Chu, Jihwan Park, Hyunwoo J. Kim
for: 这个论文的目的是提出一个新的benchmark，以评估视频问答模型的泛化能力。methods: 这个论文使用了一种新的GNN-based soft verbalizer来提高模型对不常见答案的预测能力，同时也对现有的开放端VIDEOQA模型进行了修改，以便进一步考虑罕见和未看过的答案。results: 该论文的实验结果表明，使用GNN-based soft verbalizer可以进一步提高模型的泛化能力，特别是对罕见和未看过的答案。此外，修改了现有的开放端VIDEOQA模型也可以提高其表现。

Abstract
Video Question Answering (VideoQA) is a challenging task that entails complex multi-modal reasoning. In contrast to multiple-choice VideoQA which aims to predict the answer given several options, the goal of open-ended VideoQA is to answer questions without restricting candidate answers. However, the majority of previous VideoQA models formulate open-ended VideoQA as a classification task to classify the video-question pairs into a fixed answer set, i.e., closed-vocabulary, which contains only frequent answers (e.g., top-1000 answers). This leads the model to be biased toward only frequent answers and fail to generalize on out-of-vocabulary answers. We hence propose a new benchmark, Open-vocabulary Video Question Answering (OVQA), to measure the generalizability of VideoQA models by considering rare and unseen answers. In addition, in order to improve the model's generalization power, we introduce a novel GNN-based soft verbalizer that enhances the prediction on rare and unseen answers by aggregating the information from their similar words. For evaluation, we introduce new baselines by modifying the existing (closed-vocabulary) open-ended VideoQA models and improve their performances by further taking into account rare and unseen answers. Our ablation studies and qualitative analyses demonstrate that our GNN-based soft verbalizer further improves the model performance, especially on rare and unseen answers. We hope that our benchmark OVQA can serve as a guide for evaluating the generalizability of VideoQA models and inspire future research. Code is available at https://github.com/mlvlab/OVQA.

摘要
视频问答（VideoQA）是一项复杂的多模态逻辑任务。相比多选视频问答，开放式视频问答的目标是Answer questions without restricting candidate answers。然而，大多数前一代视频问答模型将开放式视频问答视为一种分类任务，即将视频-问题对 clasified into a fixed answer set，即关闭词汇，这会导致模型偏向常见答案，而忽略非常见答案。我们因此提出了一个新的标准测试集，开放词汇视频问答（OVQA），以衡量视频问答模型的通用性。此外，为提高模型的通用性能力，我们引入了一种新的图神经网络（GNN）基于的软化词语izer，该算法可以通过聚合相似词语的信息来提高对不常见答案的预测。我们的基线和精度分析表明，我们的GNN基于的软化词语izer可以进一步提高模型的性能，特别是对于不常见答案。我们希望，我们的标准测试集OVQA可以为评估视频问答模型的通用性提供指南，并且能启发未来的研究。代码可以在https://github.com/mlvlab/OVQA 中找到。

Multi-scale Target-Aware Framework for Constrained Image Splicing Detection and Localization

paper_url: http://arxiv.org/abs/2308.09357
repo_url: None
paper_authors: Yuxuan Tan, Yuanman Li, Limin Zeng, Jiaxiong Ye, Wei wang, Xia Li
for: 批处 multimedia 图像中的剪辑检测和定位问题，即检测两个受控图像之间的剪辑操作并定位剪辑区域在两个图像上。
methods: 我们提出了一种多尺度目标意识框架，将特征提取和相关比较作为一个独立的管道进行联合处理，并设计了一种目标意识注意机制，使模型可以同时学习特征和相关比较。
results: 我们的方法在多个标准数据集上进行测试，与现有方法相比，具有更高的检测精度和更好的灵活性，并且可以有效地处理缩放变换。

Abstract
Constrained image splicing detection and localization (CISDL) is a fundamental task of multimedia forensics, which detects splicing operation between two suspected images and localizes the spliced region on both images. Recent works regard it as a deep matching problem and have made significant progress. However, existing frameworks typically perform feature extraction and correlation matching as separate processes, which may hinder the model's ability to learn discriminative features for matching and can be susceptible to interference from ambiguous background pixels. In this work, we propose a multi-scale target-aware framework to couple feature extraction and correlation matching in a unified pipeline. In contrast to previous methods, we design a target-aware attention mechanism that jointly learns features and performs correlation matching between the probe and donor images. Our approach can effectively promote the collaborative learning of related patches, and perform mutual promotion of feature learning and correlation matching. Additionally, in order to handle scale transformations, we introduce a multi-scale projection method, which can be readily integrated into our target-aware framework that enables the attention process to be conducted between tokens containing information of varying scales. Our experiments demonstrate that our model, which uses a unified pipeline, outperforms state-of-the-art methods on several benchmark datasets and is robust against scale transformations.

摘要
《受限制的图像拼接检测和地点Localization（CISDL）是 multimedia 的基本任务，检测两个可疑图像之间的拼接操作并将拼接区域分别显示在两个图像上。现有的方法通常将特征提取和相关匹配视为两个独立的过程，这可能会阻碍模型学习特征特异的匹配特征，同时也可能受到杂质背景像素的干扰。在这种情况下，我们提出了一种多尺度目标意识框架，将特征提取和相关匹配作为一个统一的管道进行处理。与先前的方法不同，我们设计了一种目标意识的注意机制，同时学习特征和相关匹配。我们的方法可以有效地促进相关块之间的协同学习，并且可以同时促进特征学习和相关匹配的协同发展。此外，为了处理缩放变换，我们引入了多尺度投影方法，可以轻松地 интеGRATE到我们的目标意识框架中，使注意过程可以在不同的尺度上进行。我们的实验表明，使用统一管道的我们模型，在多个标准数据集上表现出色，并且对缩放变换具有鲁棒性。》

Boosting Few-shot Action Recognition with Graph-guided Hybrid Matching

paper_url: http://arxiv.org/abs/2308.09346
repo_url: https://github.com/jiazheng-xing/gghm
paper_authors: Jiazheng Xing, Mengmeng Wang, Yudi Ruan, Bofan Chen, Yaowei Guo, Boyu Mu, Guang Dai, Jingdong Wang, Yong Liu
for: 本文提出了一种新的几何匹配框架（GgHM），用于解决几何捕捉异常分类问题。
methods: 本文使用图 neural network 引导构建任务特有的特征，并对这些特征进行内部和间部特征相关性优化。然后，本文提出了一种混合匹配策略，将帧级匹配和元组级匹配相结合，以便将视频匹配到多种样式。最后，本文还提出了一种可学习的细致时间模型，以增强视频特征的时间表示，为匹配过程建立更加坚实的基础。
results: GgHM 在多个几何捕捉数据集上实现了一致性好于其他挑战性基准点，证明了我们的方法的有效性。

Abstract
Class prototype construction and matching are core aspects of few-shot action recognition. Previous methods mainly focus on designing spatiotemporal relation modeling modules or complex temporal alignment algorithms. Despite the promising results, they ignored the value of class prototype construction and matching, leading to unsatisfactory performance in recognizing similar categories in every task. In this paper, we propose GgHM, a new framework with Graph-guided Hybrid Matching. Concretely, we learn task-oriented features by the guidance of a graph neural network during class prototype construction, optimizing the intra- and inter-class feature correlation explicitly. Next, we design a hybrid matching strategy, combining frame-level and tuple-level matching to classify videos with multivariate styles. We additionally propose a learnable dense temporal modeling module to enhance the video feature temporal representation to build a more solid foundation for the matching process. GgHM shows consistent improvements over other challenging baselines on several few-shot datasets, demonstrating the effectiveness of our method. The code will be publicly available at https://github.com/jiazheng-xing/GgHM.

摘要
<>Translate the given text into Simplified Chinese.<>前方法主要集中在设计空间时间关系模型或复杂的时间对Alignment算法。尽管其 promise 的结果，但它们忽略了类prototype构建和匹配的价值，导致每个任务中类似类型的识别性不够。在这篇论文中，我们提出了GgHM，一个新的框架，具有图导向混合匹配。具体来说，我们在类prototype构建过程中，通过图 neural network 的引导，显式地优化内类和外类特征相关性。接着，我们设计了混合匹配策略，将帧级匹配和元组级匹配结合起来，以便在多样化风格下分类视频。此外，我们还提出了一个可学习的紧凑时间模型，以强化视频特征的时间表示，以建立更坚实的匹配基础。GgHM在多个复杂的基准下显示了稳定的改进，证明了我们的方法的有效性。代码将在https://github.com/jiazheng-xing/GgHM上公开。

Denoising diffusion-based MR to CT image translation enables whole spine vertebral segmentation in 2D and 3D without manual annotations

paper_url: http://arxiv.org/abs/2308.09345
repo_url: https://github.com/robert-graf/readable-conditional-denoising-diffusion
paper_authors: Robert Graf, Joachim Schmitt, Sarah Schlaeger, Hendrik Kristian Möller, Vasiliki Sideri-Lampretsa, Anjany Sekuboyina, Sandro Manuel Krieg, Benedikt Wiestler, Bjoern Menze, Daniel Rueckert, Jan Stefan Kirschke
for: This paper aims to develop and evaluate methods for translating spinal MR images to CT images, with a focus on accurately delineating posterior spine structures.methods: The study uses a combination of landmark-based registration and image-to-image translation techniques, including paired and unpaired methods such as Pix2Pix, DDIM, and SynDiff. The authors evaluate the performance of these methods using PSNR and Dice scores.results: The study finds that paired methods and SynDiff exhibit similar translation performance and Dice scores on paired data, while DDIM image mode achieves the highest image quality. The 3D translation methods outperform the 2D approach, providing anatomically accurate segmentations with improved Dice scores and avoiding underprediction of small structures like the spinous process.

Abstract
Background: Automated segmentation of spinal MR images plays a vital role both scientifically and clinically. However, accurately delineating posterior spine structures presents challenges. Methods: This retrospective study, approved by the ethical committee, involved translating T1w and T2w MR image series into CT images in a total of n=263 pairs of CT/MR series. Landmark-based registration was performed to align image pairs. We compared 2D paired (Pix2Pix, denoising diffusion implicit models (DDIM) image mode, DDIM noise mode) and unpaired (contrastive unpaired translation, SynDiff) image-to-image translation using "peak signal to noise ratio" (PSNR) as quality measure. A publicly available segmentation network segmented the synthesized CT datasets, and Dice scores were evaluated on in-house test sets and the "MRSpineSeg Challenge" volumes. The 2D findings were extended to 3D Pix2Pix and DDIM. Results: 2D paired methods and SynDiff exhibited similar translation performance and Dice scores on paired data. DDIM image mode achieved the highest image quality. SynDiff, Pix2Pix, and DDIM image mode demonstrated similar Dice scores (0.77). For craniocaudal axis rotations, at least two landmarks per vertebra were required for registration. The 3D translation outperformed the 2D approach, resulting in improved Dice scores (0.80) and anatomically accurate segmentations in a higher resolution than the original MR image. Conclusion: Two landmarks per vertebra registration enabled paired image-to-image translation from MR to CT and outperformed all unpaired approaches. The 3D techniques provided anatomically correct segmentations, avoiding underprediction of small structures like the spinous process.

摘要
Methods: 这是一个回顾性研究，由伦敦病理学会approved的审核委员会批准。该研究将T1w和T2w MR图像系列转换为CT图像系列，共计n=263对CT/MR系列图像对应。使用了landmark-based registration来对图像对对应。我们使用"peak signal to noise ratio"（PSNR）作为质量指标，并对在家测试集和"MRSpineSeg Challenge"的Volume进行评估。使用了一个公共可用的分割网络将synthesized CT数据分割，并评估了Dice分数。Results: 2D对应方法和SynDiff在对应数据上 exhibited similar translation performance和Dice分数。DDIM图像模式实现了最高的图像质量。SynDiff、Pix2Pix和DDIM图像模式在对应数据上都表现了类似的Dice分数（0.77）。对于脊梗磁旋转轴，至少需要两个附加的landmark per vertebra进行 регистраción。3D翻译超过了2D方法，导致了改进的Dice分数（0.80）和精度高于原始MR图像的正确的分割。Conclusion: 使用两个附加的landmark per vertebra进行 registration可以实现对MR图像与CT图像的对应，并且超过了所有的无对应方法。3D技术提供了正确的分割，避免了小结构的下预测，如脊梗处。

LSCD: A Large-Scale Screen Content Dataset for Video Compression

paper_url: http://arxiv.org/abs/2308.09332
repo_url: None
paper_authors: Yuhao Cheng, Siru Zhang, Yiqiang Yan, Rong Chen, Yun Zhang
for: 提供一个大规模屏幕内容数据集(LSCD)，用于促进屏幕内容视频压缩领域的研究。
methods: 使用人工智能和视频压缩技术，对屏幕内容视频进行学习型压缩。
results: 提供了一个大规模的屏幕内容视频压缩数据集(LSCD)，并对数据集进行分析，以帮助研究人员更好地理解屏幕内容视频的特点，并提高学习型压缩算法的开发。

Abstract
Multimedia compression allows us to watch videos, see pictures and hear sounds within a limited bandwidth, which helps the flourish of the internet. During the past decades, multimedia compression has achieved great success using hand-craft features and systems. With the development of artificial intelligence and video compression, there emerges a lot of research work related to using the neural network on the video compression task to get rid of the complicated system. Not only producing the advanced algorithms, but researchers also spread the compression to different content, such as User Generated Content(UGC). With the rapid development of mobile devices, screen content videos become an important part of multimedia data. In contrast, we find community lacks a large-scale dataset for screen content video compression, which impedes the fast development of the corresponding learning-based algorithms. In order to fulfill this blank and accelerate the research of this special type of videos, we propose the Large-scale Screen Content Dataset(LSCD), which contains 714 source sequences. Meanwhile, we provide the analysis of the proposed dataset to show some features of screen content videos, which will help researchers have a better understanding of how to explore new algorithms. Besides collecting and post-processing the data to organize the dataset, we also provide a benchmark containing the performance of both traditional codec and learning-based methods.

摘要
Multimedia compression allow us to watch videos, see pictures and hear sounds within a limited bandwidth, which helps the flourish of the internet. During the past decades, multimedia compression has achieved great success using hand-craft features and systems. With the development of artificial intelligence and video compression, there emerges a lot of research work related to using the neural network on the video compression task to get rid of the complicated system. Not only producing the advanced algorithms, but researchers also spread the compression to different content, such as User Generated Content(UGC). With the rapid development of mobile devices, screen content videos become an important part of multimedia data. In contrast, we find community lacks a large-scale dataset for screen content video compression, which impedes the fast development of the corresponding learning-based algorithms. In order to fulfill this blank and accelerate the research of this special type of videos, we propose the Large-scale Screen Content Dataset(LSCD), which contains 714 source sequences. Meanwhile, we provide the analysis of the proposed dataset to show some features of screen content videos, which will help researchers have a better understanding of how to explore new algorithms. Besides collecting and post-processing the data to organize the dataset, we also provide a benchmark containing the performance of both traditional codec and learning-based methods.

SAMedOCT: Adapting Segment Anything Model (SAM) for Retinal OCT

paper_url: http://arxiv.org/abs/2308.09331
repo_url: None
paper_authors: Botond Fazekas, José Morano, Dmitrii Lachinov, Guilherme Aresta, Hrvoje Bogunović
for: 这篇论文主要是为了评估Segment Anything Model（SAM）在 RETOUCH 挑战中的大规模公共数据集上的应用。methods: 这篇论文使用了SAM和其修改版本进行了Retinal OCT影像分割的评估，并与当前领导的Retinal fluid segmentation方法进行了比较。results: 研究发现， adapted SAM在Retinal OCT影像分割中表现出了优异的能力，但在一些情况下仍落后于当前领导的方法。这些结果表明SAM在Retinal OCT图像分析中具有适应性和稳定性，并且可以作为Retinal OCT图像分析中的一种有价值工具。

Abstract
The Segment Anything Model (SAM) has gained significant attention in the field of image segmentation due to its impressive capabilities and prompt-based interface. While SAM has already been extensively evaluated in various domains, its adaptation to retinal OCT scans remains unexplored. To bridge this research gap, we conduct a comprehensive evaluation of SAM and its adaptations on a large-scale public dataset of OCTs from RETOUCH challenge. Our evaluation covers diverse retinal diseases, fluid compartments, and device vendors, comparing SAM against state-of-the-art retinal fluid segmentation methods. Through our analysis, we showcase adapted SAM's efficacy as a powerful segmentation model in retinal OCT scans, although still lagging behind established methods in some circumstances. The findings highlight SAM's adaptability and robustness, showcasing its utility as a valuable tool in retinal OCT image analysis and paving the way for further advancements in this domain.

摘要
segmen anything model (SAM) 在图像分割领域备受关注，因其出色的能力和提示式界面。 although SAM 在不同领域得到了广泛的评估，它在Retinal OCT 图像中的适应仍然未经探索。 To bridge this research gap, we conducted a comprehensive evaluation of SAM and its adaptations on a large-scale public dataset of OCTs from RETOUCH challenge. 我们的评估覆盖了多种Retinal diseases, fluid compartments, and device vendors，比较SAM 与现有的Retinal fluid segmentation方法。 Through our analysis, we showcased adapted SAM's efficacy as a powerful segmentation model in retinal OCT scans, although still lagging behind established methods in some circumstances. The findings highlight SAM's adaptability and robustness, showcasing its utility as a valuable tool in retinal OCT image analysis and paving the way for further advancements in this domain.

Unlimited Knowledge Distillation for Action Recognition in the Dark

paper_url: http://arxiv.org/abs/2308.09327
repo_url: None
paper_authors: Ruibing Jin, Guosheng Lin, Min Wu, Jie Lin, Zhengguo Li, Xiaoli Li, Zhenghua Chen
for: 提高动作识别网络学习的知识。
methods: 提出无限知识填充（UKD）技术，不需大量GPU内存，可以有效地融合不同的知识。
results: 对ARID数据集进行了广泛的实验，单流网络通过UKD的填充得到了优于两流网络的表现。

Abstract
Dark videos often lose essential information, which causes the knowledge learned by networks is not enough to accurately recognize actions. Existing knowledge assembling methods require massive GPU memory to distill the knowledge from multiple teacher models into a student model. In action recognition, this drawback becomes serious due to much computation required by video process. Constrained by limited computation source, these approaches are infeasible. To address this issue, we propose an unlimited knowledge distillation (UKD) in this paper. Compared with existing knowledge assembling methods, our UKD can effectively assemble different knowledge without introducing high GPU memory consumption. Thus, the number of teaching models for distillation is unlimited. With our UKD, the network's learned knowledge can be remarkably enriched. Our experiments show that the single stream network distilled with our UKD even surpasses a two-stream network. Extensive experiments are conducted on the ARID dataset.

摘要
黑色视频常常会产生重要信息的丢失，导致网络学习的知识不够准确地识别动作。现有的知识组合方法需要巨量的GPU内存来浸泡多个教师模型中的知识到学生模型中。在动作识别 tasks中，这种缺点变得非常严重，因为视频处理需要很多计算。由于计算源有限，这些方法是不可能实现的。为解决这个问题，我们在这篇论文中提出了无限知识浸泡（UKD）方法。与现有的知识组合方法相比，我们的 UKD 可以无需高GPU内存占用，因此教师模型的数量是无限的。通过我们的 UKD，网络学习的知识可以备受扩展。我们的实验表明，单流网络通过我们的 UKD 甚至超过了两流网络。我们在 ARID 数据集上进行了广泛的实验。

Retro-FPN: Retrospective Feature Pyramid Network for Point Cloud Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.09314
repo_url: https://github.com/allenxiangx/retro-fpn
paper_authors: Peng Xiang, Xin Wen, Yu-Shen Liu, Hui Zhang, Yi Fang, Zhizhong Han
for: 提高点云Semantic segmentation的精度，解决过去方法中的信息损失和不明确区域特征问题。
methods: 提出Retro-FPN方法，将每个点的特征预测设计为一个显式和回顾的修复过程，通过所有层次结构来提取semantic features。其关键新特点是一个Retro-Transformer，用于从前一层层次结构中概括semantic context，并在当前阶段修复特征。
results: 与州际标准背景方法相比，Retro-FPN可以显著提高性能。经过广泛的实验证明，Retro-FPN可以在多种常用的 benchmark 上达到州际标准水平。源代码可以在https://github.com/AllenXiangX/Retro-FPN上获取。

Abstract
Learning per-point semantic features from the hierarchical feature pyramid is essential for point cloud semantic segmentation. However, most previous methods suffered from ambiguous region features or failed to refine per-point features effectively, which leads to information loss and ambiguous semantic identification. To resolve this, we propose Retro-FPN to model the per-point feature prediction as an explicit and retrospective refining process, which goes through all the pyramid layers to extract semantic features explicitly for each point. Its key novelty is a retro-transformer for summarizing semantic contexts from the previous layer and accordingly refining the features in the current stage. In this way, the categorization of each point is conditioned on its local semantic pattern. Specifically, the retro-transformer consists of a local cross-attention block and a semantic gate unit. The cross-attention serves to summarize the semantic pattern retrospectively from the previous layer. And the gate unit carefully incorporates the summarized contexts and refines the current semantic features. Retro-FPN is a pluggable neural network that applies to hierarchical decoders. By integrating Retro-FPN with three representative backbones, including both point-based and voxel-based methods, we show that Retro-FPN can significantly improve performance over state-of-the-art backbones. Comprehensive experiments on widely used benchmarks can justify the effectiveness of our design. The source is available at https://github.com/AllenXiangX/Retro-FPN

摘要
学习每个点的 semantic 特征从层次特征 пирамид是semantic segmentation of point clouds 中的关键。然而，大多数之前的方法受到不明确的区域特征或不能有效地细化每个点的特征，导致信息损失和不明确的 semantic 标识。为解决这个问题，我们提出了Retro-FPN，它模型了每个点的特征预测为明确的和Retrospective 细化过程，通过所有层次来提取semantic 特征。其关键创新是一个Retro-transformer，它在上一层的semantic 上下文中进行总结，并在当前阶段细化特征。因此，每个点的分类是基于其当前semantic 模式。具体来说，Retro-transformer包括一个local cross-attention块和一个semantic gate单元。cross-attention 用于在上一层的semantic 模式中总结，并将其细化到当前阶段。而gate单元通过细化current semantic features ，以提高分类精度。Retro-FPN是一个可插入的神经网络，可以应用于层次解码器。通过与三种代表性的背景结构相结合，包括点基的和voxel基的方法，我们表明Retro-FPN可以在state-of-the-art 背景下显著提高性能。广泛的实验证明了我们的设计的有效性。源代码可以在https://github.com/AllenXiangX/Retro-FPN 中下载。

Rethinking Image Forgery Detection via Contrastive Learning and Unsupervised Clustering

paper_url: http://arxiv.org/abs/2308.09307
repo_url: https://github.com/highwaywu/focal
paper_authors: Haiwei Wu, Yiming Chen, Jiantao Zhou
for: 本研究旨在提高图像forge检测的精度和效果，并提出一种新的方法FOCAL（Forensic ContrAstive cLustering），这种方法基于对比学习和无监督划分，能够准确地检测图像中的forge区域。
methods: FOCAL方法包括三个主要部分：1）使用像素级对比学习来监督高级别侦验特征提取; 2）使用在线无监督划分算法来将学习到的特征分为forge和正常两类; 3）通过简单地Feature层 concatenation来进一步提高检测性能，无需重新训练。
results: 实验结果表明，FOCAL方法在六个公共测试数据集上达到了与state-of-the-art竞争算法之间的大幅提升：+24.3%的覆盖率、+18.6%的哥伦比亚、+17.5%的FF++, +14.2%的MISD、+13.5%的CASIA和+10.3%的NIST，以及IoU方面。

Abstract
Image forgery detection aims to detect and locate forged regions in an image. Most existing forgery detection algorithms formulate classification problems to classify pixels into forged or pristine. However, the definition of forged and pristine pixels is only relative within one single image, e.g., a forged region in image A is actually a pristine one in its source image B (splicing forgery). Such a relative definition has been severely overlooked by existing methods, which unnecessarily mix forged (pristine) regions across different images into the same category. To resolve this dilemma, we propose the FOrensic ContrAstive cLustering (FOCAL) method, a novel, simple yet very effective paradigm based on contrastive learning and unsupervised clustering for the image forgery detection. Specifically, FOCAL 1) utilizes pixel-level contrastive learning to supervise the high-level forensic feature extraction in an image-by-image manner, explicitly reflecting the above relative definition; 2) employs an on-the-fly unsupervised clustering algorithm (instead of a trained one) to cluster the learned features into forged/pristine categories, further suppressing the cross-image influence from training data; and 3) allows to further boost the detection performance via simple feature-level concatenation without the need of retraining. Extensive experimental results over six public testing datasets demonstrate that our proposed FOCAL significantly outperforms the state-of-the-art competing algorithms by big margins: +24.3% on Coverage, +18.6% on Columbia, +17.5% on FF++, +14.2% on MISD, +13.5% on CASIA and +10.3% on NIST in terms of IoU. The paradigm of FOCAL could bring fresh insights and serve as a novel benchmark for the image forgery detection task. The code is available at https://github.com/HighwayWu/FOCAL.

摘要
Image forgery detection aims to detect and locate forged regions in an image. Most existing forgery detection algorithms formulate classification problems to classify pixels into forged or pristine. However, the definition of forged and pristine pixels is only relative within one single image, e.g., a forged region in image A is actually a pristine one in its source image B (splicing forgery). Such a relative definition has been severely overlooked by existing methods, which unnecessarily mix forged (pristine) regions across different images into the same category. To resolve this dilemma, we propose the FOrensic ContrAstive cLustering (FOCAL) method, a novel, simple yet very effective paradigm based on contrastive learning and unsupervised clustering for the image forgery detection. Specifically, FOCAL 1) utilizes pixel-level contrastive learning to supervise the high-level forensic feature extraction in an image-by-image manner, explicitly reflecting the above relative definition; 2) employs an on-the-fly unsupervised clustering algorithm (instead of a trained one) to cluster the learned features into forged/pristine categories, further suppressing the cross-image influence from training data; and 3) allows to further boost the detection performance via simple feature-level concatenation without the need of retraining. Extensive experimental results over six public testing datasets demonstrate that our proposed FOCAL significantly outperforms the state-of-the-art competing algorithms by big margins: +24.3% on Coverage, +18.6% on Columbia, +17.5% on FF++, +14.2% on MISD, +13.5% on CASIA and +10.3% on NIST in terms of IoU. The paradigm of FOCAL could bring fresh insights and serve as a novel benchmark for the image forgery detection task. The code is available at https://github.com/HighwayWu/FOCAL.

paper_url: http://arxiv.org/abs/2308.09306
repo_url: None
paper_authors: Runhui Huang, Jianhua Han, Guansong Lu, Xiaodan Liang, Yihan Zeng, Wei Zhang, Hang Xu
for: 本文旨在探讨将生成和检测融合到一起的可能性，以提高图像生成和图像文本检测的性能。
methods: 本文提出了一种名为DiffDis的散度过程模型，它将批处理的散度过程与多modal的预训练模型（如CLIP、ALIGN和FILIP）结合起来，以实现图像生成和图像文本检测的同时学习。
results: 实验结果表明，DiffDis可以在12个数据集上的零 shot分类任务上提高平均精度0.65%，并在图像生成和图像文本检测任务上提高FID指标0.242%。

Abstract
Recently, large-scale diffusion models, e.g., Stable diffusion and DallE2, have shown remarkable results on image synthesis. On the other hand, large-scale cross-modal pre-trained models (e.g., CLIP, ALIGN, and FILIP) are competent for various downstream tasks by learning to align vision and language embeddings. In this paper, we explore the possibility of jointly modeling generation and discrimination. Specifically, we propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process. DiffDis first formulates the image-text discriminative problem as a generative diffusion process of the text embedding from the text encoder conditioned on the image. Then, we propose a novel dual-stream network architecture, which fuses the noisy text embedding with the knowledge of latent images from different scales for image-text discriminative learning. Moreover, the generative and discriminative tasks can efficiently share the image-branch network structure in the multi-modality model. Benefiting from diffusion-based unified training, DiffDis achieves both better generation ability and cross-modal semantic alignment in one architecture. Experimental results show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks, e.g., 1.65% improvement on average accuracy of zero-shot classification over 12 datasets and 2.42 improvement on FID of zero-shot image synthesis.

摘要
近期，大规模扩散模型，如稳定扩散和DallE2，在图像生成方面已经显示出惊人的成绩。然而，大规模的交叉模态预训练模型（如CLIP、ALIGN和FILIP）在多种下游任务上表现出色，这些模型通过学习视觉和语言嵌入的对应关系来学习。在这篇论文中，我们探讨了将生成和批判合并到一起的可能性。特别是，我们提出了DiffDis模型，它将交叉模态生成和批判预训练集成为一个散射过程中的单一框架。DiffDis首先将图像-文本识别问题定义为图像编码器生成的文本嵌入在图像条件下的散射过程。然后，我们提出了一种新的双流网络架构，它将不同缩放的图像嵌入与文本嵌入进行混合，以进行图像-文本识别学习。此外，生成和批判任务可以在多模态模型中有效共享图像分支网络结构。由于散射基于的统一训练，DiffDis可以同时实现更好的生成能力和交叉模态含义对齐。实验结果表明，DiffDis在图像生成和图像-文本识别任务上都有更好的表现，比如在12个数据集上的平均精度为1.65%，和在FID上的图像生成精度为2.42%。

Human Part-wise 3D Motion Context Learning for Sign Language Recognition

paper_url: http://arxiv.org/abs/2308.09305
repo_url: None
paper_authors: Taeryung Lee, Yeonguk Oh, Kyoung Mu Lee
for: 提高手语识别的表现，特别是利用手部特征来提高表现。
methods: 提出了一种基于人体部分动作上下文学习的框架，包括使用分解变换器（PET）和整体变换器（WET）来学习手部动作上下文，以及对2D和3D姿态进行ensemble。
results: 在WLASL数据集上实现了比前一代方法更高的表现，具体来说是通过学习手部动作上下文来提高手语识别的精度。

Abstract
In this paper, we propose P3D, the human part-wise motion context learning framework for sign language recognition. Our main contributions lie in two dimensions: learning the part-wise motion context and employing the pose ensemble to utilize 2D and 3D pose jointly. First, our empirical observation implies that part-wise context encoding benefits the performance of sign language recognition. While previous methods of sign language recognition learned motion context from the sequence of the entire pose, we argue that such methods cannot exploit part-specific motion context. In order to utilize part-wise motion context, we propose the alternating combination of a part-wise encoding Transformer (PET) and a whole-body encoding Transformer (WET). PET encodes the motion contexts from a part sequence, while WET merges them into a unified context. By learning part-wise motion context, our P3D achieves superior performance on WLASL compared to previous state-of-the-art methods. Second, our framework is the first to ensemble 2D and 3D poses for sign language recognition. Since the 3D pose holds rich motion context and depth information to distinguish the words, our P3D outperformed the previous state-of-the-art methods employing a pose ensemble.

摘要
在这篇论文中，我们提出P3D，人体部分动作上下文学习框架，用于手语识别。我们的主要贡献在两个维度：学习部分动作上下文和利用2D和3D pose共同。首先，我们的实验观察表明，部分动作上下文编码对手语识别性能有益。而前一代方法通常从整个姿态序列学习动作上下文，我们 argue那些方法无法利用部分动作上下文。为了利用部分动作上下文，我们提议使用分解成部分编码器（PET）和整体编码器（WET）的交互式组合。PET编码部分动作上下文，而WET将其合并为一个统一的上下文。通过学习部分动作上下文，我们的P3D在WLASL上比前一代方法更高的性能。其次，我们的框架是首次将2D和3D姿态共同用于手语识别。因为3D姿态具有较多的动作上下文和深度信息，我们的P3D在使用pose ensemble时表现出了更高的性能。

Online Class Incremental Learning on Stochastic Blurry Task Boundary via Mask and Visual Prompt Tuning

paper_url: http://arxiv.org/abs/2308.09303
repo_url: https://github.com/moonjunyyy/si-blurry
paper_authors: Jun-Yeong Moon, Keon-Hee Park, Jung Uk Kim, Gyeong-Moon Park
for: 本研究旨在 Addressing the challenges of continual learning in real-world scenarios, where the number of input data and tasks is constantly changing in a statistical way.
methods: 本研究提出了一种新的 Stochastic incremental Blurry task boundary scenario (Si-Blurry)，以及一种名为 Mask and Visual Prompt tuning (MVP) 的方法来解决 inter-和 intra-task 忘记问题和 class imbalance 问题。MVP 包括一种新的 instance-wise logit 遮盾和 contrastive visual prompt 准则，以及一种新的 gradient similarity-based focal loss 和 adaptive feature scaling。
results: 对于our challenging Si-Blurry scenario, extensive experiments show that our proposed MVP significantly outperforms the existing state-of-the-art methods.

Abstract
Continual learning aims to learn a model from a continuous stream of data, but it mainly assumes a fixed number of data and tasks with clear task boundaries. However, in real-world scenarios, the number of input data and tasks is constantly changing in a statistical way, not a static way. Although recently introduced incremental learning scenarios having blurry task boundaries somewhat address the above issues, they still do not fully reflect the statistical properties of real-world situations because of the fixed ratio of disjoint and blurry samples. In this paper, we propose a new Stochastic incremental Blurry task boundary scenario, called Si-Blurry, which reflects the stochastic properties of the real-world. We find that there are two major challenges in the Si-Blurry scenario: (1) inter- and intra-task forgettings and (2) class imbalance problem. To alleviate them, we introduce Mask and Visual Prompt tuning (MVP). In MVP, to address the inter- and intra-task forgetting issues, we propose a novel instance-wise logit masking and contrastive visual prompt tuning loss. Both of them help our model discern the classes to be learned in the current batch. It results in consolidating the previous knowledge. In addition, to alleviate the class imbalance problem, we introduce a new gradient similarity-based focal loss and adaptive feature scaling to ease overfitting to the major classes and underfitting to the minor classes. Extensive experiments show that our proposed MVP significantly outperforms the existing state-of-the-art methods in our challenging Si-Blurry scenario.

摘要
<>输入文本转换为简化中文。<>持续学习目标是学习从连续流动的数据中学习模型，但是它主要假设Fixed数据量和任务数量，并且任务boundary是明确的。然而，在实际情况下，输入数据和任务的数量在统计方式上不断变化，而不是静态的方式。虽然最近引入的增量学习方式有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方

Inferior Alveolar Nerve Segmentation in CBCT images using Connectivity-Based Selective Re-training

paper_url: http://arxiv.org/abs/2308.09298
repo_url: https://github.com/garynico517/ssl-ian-retraining
paper_authors: Yusheng Liu, Rui Xin, Tao Yang, Lisheng Wang
for: 这个论文的目的是提高自动尖齿神经 canal segmentation 的能力，以便在 dental 和 maxillofacial 手术中避免不可逆的神经损伤。
methods: 作者提出了一种基于 IAN 连接性的选择性重训练方法，以解决自动 segmentation 中 sparse labeling 的负面影响。
results: 作者的方法在 ToothFairy 验证集上进行了量化评估，达到了 dice similarity coefficient (DSC) 0.7956，和 95% hausdorff distance (HD95) 4.4905，并在竞赛中获得冠军。

Abstract
Inferior Alveolar Nerve (IAN) canal detection in CBCT is an important step in many dental and maxillofacial surgery applications to prevent irreversible damage to the nerve during the procedure.The ToothFairy2023 Challenge aims to establish a 3D maxillofacial dataset consisting of all sparse labels and partial dense labels, and improve the ability of automatic IAN segmentation. In this work, in order to avoid the negative impact brought by sparse labeling, we transform the mixed supervised problem into a semi-supervised problem. Inspired by self-training via pseudo labeling, we propose a selective re-training framework based on IAN connectivity. Our method is quantitatively evaluated on the ToothFairy verification cases, achieving the dice similarity coefficient (DSC) of 0.7956, and 95\% hausdorff distance (HD95) of 4.4905, and wining the champion in the competition. Code is available at https://github.com/GaryNico517/SSL-IAN-Retraining.

摘要
“ inferior alveolar nerve（IAN） Canal detection in CBCT 是 dental 和 maxillofacial surgery 中重要的一步，以避免在过程中对神经造成 irreversible 的损害。ToothFairy2023 Challenge 目标是建立一个 3D maxillofacial 数据集，包括所有稀疏标签和部分杂散标签，以提高自动 IAN 分割能力。在本工作中，以免除稀疏标签的负面影响，我们将混合监督问题转化为 semi-supervised 问题。受 pseudo labeling 的启发，我们提出一种选择性重训练框架，基于 IAN 连接性。我们的方法在 ToothFairy 验证例程中被评估，达到了 dice similarity coefficient（DSC）0.7956，和 Hausdorff distance（HD95）4.4905，并在竞赛中获胜。代码可以在 https://github.com/GaryNico517/SSL-IAN-Retraining 上获取。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

NAPA-VQ: Neighborhood Aware Prototype Augmentation with Vector Quantization for Continual Learning

paper_url: http://arxiv.org/abs/2308.09297
repo_url: https://github.com/tamasham/napa-vq
paper_authors: Tamasha Malepathirana, Damith Senanayake, Saman Halgamuge
For: 强调在深度神经网络中避免严重遗传问题，即在新知获得时不损失之前知识。* Methods: 基于非特例学习（NECIL），不使用过去的例子来学习新的类别，并且通过对邻近类别的了解来增强分类器的决策界。* Results: 与现有的State-of-the-art NECIL方法比较，NAPA-VQ方法在CIFAR-100、TinyImageNet和ImageNet-Subset上的实验结果显示，获得了5%、2%和4%的精度提升和10%、3%和9%的遗传减少。

Abstract
Catastrophic forgetting; the loss of old knowledge upon acquiring new knowledge, is a pitfall faced by deep neural networks in real-world applications. Many prevailing solutions to this problem rely on storing exemplars (previously encountered data), which may not be feasible in applications with memory limitations or privacy constraints. Therefore, the recent focus has been on Non-Exemplar based Class Incremental Learning (NECIL) where a model incrementally learns about new classes without using any past exemplars. However, due to the lack of old data, NECIL methods struggle to discriminate between old and new classes causing their feature representations to overlap. We propose NAPA-VQ: Neighborhood Aware Prototype Augmentation with Vector Quantization, a framework that reduces this class overlap in NECIL. We draw inspiration from Neural Gas to learn the topological relationships in the feature space, identifying the neighboring classes that are most likely to get confused with each other. This neighborhood information is utilized to enforce strong separation between the neighboring classes as well as to generate old class representative prototypes that can better aid in obtaining a discriminative decision boundary between old and new classes. Our comprehensive experiments on CIFAR-100, TinyImageNet, and ImageNet-Subset demonstrate that NAPA-VQ outperforms the State-of-the-art NECIL methods by an average improvement of 5%, 2%, and 4% in accuracy and 10%, 3%, and 9% in forgetting respectively. Our code can be found in https://github.com/TamashaM/NAPA-VQ.git.

摘要
深度神经网络在实际应用中面临的一个挑战是 catastrophic forgetting，即因为新知约而失去老知识的问题。许多现有的解决方案是通过存储过去的数据（例emplars），但这可能不是应用中存储限制或隐私限制的情况下可行。因此，近期的注意点在 Non-Exemplar based Class Incremental Learning (NECIL) 方面，这种方法可以在没有过去数据的情况下，逐步学习新类。然而，由于缺乏过去数据，NECIL 方法可能会将旧类和新类的特征表示相互混淆。我们提出了 NAPA-VQ：它是一种基于 Neural Gas 学习特征空间中类之间的 topological 关系的框架，以便在 feature 空间中强制分离邻近类。此外，它还可以生成旧类代表对象，以便更好地帮助得出精准的决策边界 между旧类和新类。我们的全面实验表明，NAPA-VQ 在 CIFAR-100、TinyImageNet 和 ImageNet-Subset 上比 State-of-the-art NECIL 方法平均提高了5%, 2%, 4% 的准确率和10%, 3%, 9% 的忘记率。我们的代码可以在 https://github.com/TamashaM/NAPA-VQ.git 找到。

Self-Calibrated Cross Attention Network for Few-Shot Segmentation

paper_url: http://arxiv.org/abs/2308.09294
repo_url: https://github.com/sam1224/sccan
paper_authors: Qianxiong Xu, Wenting Zhao, Guosheng Lin, Cheng Long
for:This paper focuses on improving few-shot segmentation (FSS) by effectively utilizing support samples.methods:The proposed method uses a self-calibrated cross attention (SCCA) block, which splits the query and support features into patches, aligns each query patch with its most similar support patch, and fuses the query background features with matched background features from the support image.results:The proposed method achieves state-of-the-art performance on PASCAL-5^i and COCO-20^i under 5-shot setting, with a mIoU score of 5.6% better than previous state-of-the-arts on COCO-20^i.

Abstract
The key to the success of few-shot segmentation (FSS) lies in how to effectively utilize support samples. Most solutions compress support foreground (FG) features into prototypes, but lose some spatial details. Instead, others use cross attention to fuse query features with uncompressed support FG. Query FG could be fused with support FG, however, query background (BG) cannot find matched BG features in support FG, yet inevitably integrates dissimilar features. Besides, as both query FG and BG are combined with support FG, they get entangled, thereby leading to ineffective segmentation. To cope with these issues, we design a self-calibrated cross attention (SCCA) block. For efficient patch-based attention, query and support features are firstly split into patches. Then, we design a patch alignment module to align each query patch with its most similar support patch for better cross attention. Specifically, SCCA takes a query patch as Q, and groups the patches from the same query image and the aligned patches from the support image as K&V. In this way, the query BG features are fused with matched BG features (from query patches), and thus the aforementioned issues will be mitigated. Moreover, when calculating SCCA, we design a scaled-cosine mechanism to better utilize the support features for similarity calculation. Extensive experiments conducted on PASCAL-5^i and COCO-20^i demonstrate the superiority of our model, e.g., the mIoU score under 5-shot setting on COCO-20^i is 5.6%+ better than previous state-of-the-arts. The code is available at https://github.com/Sam1224/SCCAN.

摘要
针对几何shot segmentation（FSS）的成功关键在于如何有效利用支持样本。大多数解决方案将支持背景（BG）特征压缩成原型，但是会产生一些空间细节的损失。其他人则使用批注注意力机制来融合查询特征和未压缩的支持BG。然而，查询BG无法在支持BG中找到匹配的BG特征，却必然混合不同的特征。此外，由于查询BG和支持BG都与支持BG进行融合，因此会导致不准确的分割。为了解决这些问题，我们设计了一个自适应批注注意力块（SCCA）。SCCA块的实现方式如下：首先，我们将查询特征和支持特征切分成小块。然后，我们设计了一个块对齐模块，用于将每个查询块与其最相似的支持块进行对齐。这样，查询BG特征可以与支持BG中的匹配BG特征进行混合，从而解决上述问题。此外，在计算SCCA时，我们设计了一个托管整数机制，以更好地利用支持特征进行相似性计算。我们在PASCAL-5^i和COCO-20^i上进行了广泛的实验，结果显示我们的模型在5架shot设定下的mIoU分数比前一个状态艺术高出5.6%。代码可以在https://github.com/Sam1224/SCCAN中下载。

RFDforFin: Robust Deep Forgery Detection for GAN-generated Fingerprint Images

paper_url: http://arxiv.org/abs/2308.09285
repo_url: None
paper_authors: Hui Miao, Yuanfang Guo, Yunhong Wang
for: 防止GAN生成的指纹图像被恶意褪别，以保护公共安全。
methods: combining unique ridge features of fingerprint and generation artifacts of the GAN-generated images to detect deep forgery.
results: 提出了首个针对指纹图像的深度褪别检测方法，实现了低复杂度和高效果。

Abstract
With the rapid development of the image generation technologies, the malicious abuses of the GAN-generated fingerprint images poses a significant threat to the public safety in certain circumstances. Although the existing universal deep forgery detection approach can be applied to detect the fake fingerprint images, they are easily attacked and have poor robustness. Meanwhile, there is no specifically designed deep forgery detection method for fingerprint images. In this paper, we propose the first deep forgery detection approach for fingerprint images, which combines unique ridge features of fingerprint and generation artifacts of the GAN-generated images, to the best of our knowledge. Specifically, we firstly construct a ridge stream, which exploits the grayscale variations along the ridges to extract unique fingerprint-specific features. Then, we construct a generation artifact stream, in which the FFT-based spectrums of the input fingerprint images are exploited, to extract more robust generation artifact features. At last, the unique ridge features and generation artifact features are fused for binary classification (\textit{i.e.}, real or fake). Comprehensive experiments demonstrate that our proposed approach is effective and robust with low complexities.

摘要
随着图像生成技术的快速发展，GAN生成的指纹图像在某些情况下可能会对公共安全构成威胁。虽然现有的通用深度伪造检测方法可以检测假指纹图像，但它们容易受到攻击并有低效率。此外，没有专门设计的深度伪造检测方法 для指纹图像。在这篇论文中，我们提出了首个深度伪造检测方法 для指纹图像，该方法结合了指纹特有的缝合特征和GAN生成图像的生成痕迹特征。具体来说，我们首先构建缝合流，利用缝合中的灰度变化来提取指纹特有的特征。然后，我们构建生成痕迹流，利用输入指纹图像的FFT基准谱来提取更加鲁棒的生成痕迹特征。最后，缝合特征和生成痕迹特征进行 binary 分类（即真实或假）。广泛的实验表明，我们提出的方法有效和可靠，并且具有低复杂性。

Diverse Cotraining Makes Strong Semi-Supervised Segmentor

paper_url: http://arxiv.org/abs/2308.09281
repo_url: None
paper_authors: Yijiang Li, Xinjiang Wang, Lihe Yang, Litong Feng, Wayne Zhang, Ying Gao
for: 本研究旨在探讨深度共训的工作机制，并证明多视图支持的假设不符合现实。
methods: 本研究使用多种方法来增加共训模型的多样性，包括输入领域、不同的扩展和网络架构。
results: compared to current state-of-the-art 方法，我们的多样共训方法在不同的评估协议上取得了明显的提高，例如在 Pascal 上 achieve the best mIoU of 76.2%, 77.7% and 80.2% with only 92, 183 and 366 labeled images.

Abstract
Deep co-training has been introduced to semi-supervised segmentation and achieves impressive results, yet few studies have explored the working mechanism behind it. In this work, we revisit the core assumption that supports co-training: multiple compatible and conditionally independent views. By theoretically deriving the generalization upper bound, we prove the prediction similarity between two models negatively impacts the model's generalization ability. However, most current co-training models are tightly coupled together and violate this assumption. Such coupling leads to the homogenization of networks and confirmation bias which consequently limits the performance. To this end, we explore different dimensions of co-training and systematically increase the diversity from the aspects of input domains, different augmentations and model architectures to counteract homogenization. Our Diverse Co-training outperforms the state-of-the-art (SOTA) methods by a large margin across different evaluation protocols on the Pascal and Cityscapes. For example. we achieve the best mIoU of 76.2%, 77.7% and 80.2% on Pascal with only 92, 183 and 366 labeled images, surpassing the previous best results by more than 5%.

摘要

DiffLLE: Diffusion-guided Domain Calibration for Unsupervised Low-light Image Enhancement

paper_url: http://arxiv.org/abs/2308.09279
repo_url: None
paper_authors: Shuzhou Yang, Xuanyu Zhang, Yinhuai Wang, Jiwen Yu, Yuhan Wang, Jian Zhang
for: 这篇论文是为了提出一种robust和有效的无监督低光照图像改善方法，即Diffusion-based domain calibration（DiffLLE）。
methods: 该方法采用了一种简单的无监督增强算法，并采用了两个附加的零扩展插件模块：Diffusion-guided Degradation Calibration（DDC）模块和Fine-grained Target domain Distillation（FTD）模块。
results: 该方法在多种实验中表现出色，甚至超过了一些监督学习方法。

Abstract
Existing unsupervised low-light image enhancement methods lack enough effectiveness and generalization in practical applications. We suppose this is because of the absence of explicit supervision and the inherent gap between real-world scenarios and the training data domain. In this paper, we develop Diffusion-based domain calibration to realize more robust and effective unsupervised Low-Light Enhancement, called DiffLLE. Since the diffusion model performs impressive denoising capability and has been trained on massive clean images, we adopt it to bridge the gap between the real low-light domain and training degradation domain, while providing efficient priors of real-world content for unsupervised models. Specifically, we adopt a naive unsupervised enhancement algorithm to realize preliminary restoration and design two zero-shot plug-and-play modules based on diffusion model to improve generalization and effectiveness. The Diffusion-guided Degradation Calibration (DDC) module narrows the gap between real-world and training low-light degradation through diffusion-based domain calibration and a lightness enhancement curve, which makes the enhancement model perform robustly even in sophisticated wild degradation. Due to the limited enhancement effect of the unsupervised model, we further develop the Fine-grained Target domain Distillation (FTD) module to find a more visual-friendly solution space. It exploits the priors of the pre-trained diffusion model to generate pseudo-references, which shrinks the preliminary restored results from a coarse normal-light domain to a finer high-quality clean field, addressing the lack of strong explicit supervision for unsupervised methods. Benefiting from these, our approach even outperforms some supervised methods by using only a simple unsupervised baseline. Extensive experiments demonstrate the superior effectiveness of the proposed DiffLLE.

摘要
We adopt a diffusion model that has been trained on massive clean images, and use it to bridge the gap between the real low-light domain and the training degradation domain. This provides efficient priors of real-world content for unsupervised models. We use a naive unsupervised enhancement algorithm to realize preliminary restoration, and design two zero-shot plug-and-play modules based on the diffusion model to improve generalization and effectiveness.The Diffusion-guided Degradation Calibration (DDC) module narrows the gap between real-world and training low-light degradation through diffusion-based domain calibration and a lightness enhancement curve. This makes the enhancement model perform robustly even in sophisticated wild degradation. However, the limited enhancement effect of the unsupervised model leads to the development of the Fine-grained Target domain Distillation (FTD) module. This module exploits the priors of the pre-trained diffusion model to generate pseudo-references, which shrink the preliminary restored results from a coarse normal-light domain to a finer high-quality clean field. This addresses the lack of strong explicit supervision for unsupervised methods.Our approach outperforms some supervised methods using only a simple unsupervised baseline. Extensive experiments demonstrate the superior effectiveness of the proposed DiffLLE.

MATLABER: Material-Aware Text-to-3D via LAtent BRDF auto-EncodeR

paper_url: http://arxiv.org/abs/2308.09278
repo_url: https://github.com/SheldonTsui/Matlaber
paper_authors: Xudong Xu, Zhaoyang Lyu, Xingang Pan, Bo Dai
for:本研究旨在提高文本到3D图像生成中的物料质量，通过使用新的秘密BRDF自动编码器（Latent BRDF auto-EncodeR，简称MATLABER）。methods:我们使用大规模的真实世界BRDF收集来训练这个自动编码器，并确保其隐藏空间的光滑性，这些隐藏空间自然变为物料的自然分布。在文本到3D图像生成中，我们在物料预测中使用这些隐藏空间编码器，而不是直接使用BRDF参数。results:我们的方法在生成物料的实际和准确性方面表现出色，并且高质量的物料自然地启用了多个下游任务，如重新照明和物料编辑。

Abstract
Based on powerful text-to-image diffusion models, text-to-3D generation has made significant progress in generating compelling geometry and appearance. However, existing methods still struggle to recover high-fidelity object materials, either only considering Lambertian reflectance, or failing to disentangle BRDF materials from the environment lights. In this work, we propose Material-Aware Text-to-3D via LAtent BRDF auto-EncodeR (\textbf{MATLABER}) that leverages a novel latent BRDF auto-encoder for material generation. We train this auto-encoder with large-scale real-world BRDF collections and ensure the smoothness of its latent space, which implicitly acts as a natural distribution of materials. During appearance modeling in text-to-3D generation, the latent BRDF embeddings, rather than BRDF parameters, are predicted via a material network. Through exhaustive experiments, our approach demonstrates the superiority over existing ones in generating realistic and coherent object materials. Moreover, high-quality materials naturally enable multiple downstream tasks such as relighting and material editing. Code and model will be publicly available at \url{https://sheldontsui.github.io/projects/Matlaber}.

摘要

Progression-Guided Temporal Action Detection in Videos

paper_url: http://arxiv.org/abs/2308.09268
repo_url: https://github.com/makecent/apn
paper_authors: Chongkai Lu, Man-Wai Mak, Ruimin Li, Zheru Chi, Hong Fu
For: The paper proposes a novel framework called Action Progression Network (APN) for temporal action detection (TAD) in videos.* Methods: The APN framework uses a complete action process to encode the temporal structure of actions, and trains a neural network to recognize the action progressions. The framework detects action boundaries by detecting complete action processes in the videos.* Results: The APN achieves competitive performance and significantly surpasses its counterparts in detecting long-lasting actions, with a mean Average Precision (mAP) of 58.3% on the THUMOS14 dataset and 98.9% mAP on the DFMAD70 dataset.Here’s the same information in Simplified Chinese text:* For: 本文提出了一种名为Action Progression Network (APN)的新框架，用于视频中的时间动作检测 (TAD)。* Methods: APN框架使用完整的动作进程来编码动作的时间结构，然后使用神经网络来识别动作进程。框架通过检测视频中的完整动作进程来检测动作边界。* Results: APN达到了竞争性的性能，并在检测持续时间的动作方面表现出色，THUMOS14 dataset上的mean Average Precision (mAP)为58.3%，DFMAD70 dataset上的mAP为98.9%。

Abstract
We present a novel framework, Action Progression Network (APN), for temporal action detection (TAD) in videos. The framework locates actions in videos by detecting the action evolution process. To encode the action evolution, we quantify a complete action process into 101 ordered stages (0\%, 1\%, ..., 100\%), referred to as action progressions. We then train a neural network to recognize the action progressions. The framework detects action boundaries by detecting complete action processes in the videos, e.g., a video segment with detected action progressions closely follow the sequence 0\%, 1\%, ..., 100\%. The framework offers three major advantages: (1) Our neural networks are trained end-to-end, contrasting conventional methods that optimize modules separately; (2) The APN is trained using action frames exclusively, enabling models to be trained on action classification datasets and robust to videos with temporal background styles differing from those in training; (3) Our framework effectively avoids detecting incomplete actions and excels in detecting long-lasting actions due to the fine-grained and explicit encoding of the temporal structure of actions. Leveraging these advantages, the APN achieves competitive performance and significantly surpasses its counterparts in detecting long-lasting actions. With an IoU threshold of 0.5, the APN achieves a mean Average Precision (mAP) of 58.3\% on the THUMOS14 dataset and 98.9\% mAP on the DFMAD70 dataset.

摘要
我们提出了一个新的框架，即 Action Progression Network (APN)，用于视频中的时间动作检测（TAD）。这个框架通过检测动作进程来确定动作的位置。为了编码动作进程，我们将完整的动作过程分解为101个顺序阶段（0%、1%、...、100%），称之为动作进程。然后我们将神经网络训练以认识这些动作进程。框架通过检测视频中的完整动作进程来检测动作边界，例如视频段中的检测动作进程与序列0%、1%、...、100%的顺序匹配。框架具有以下三大优点：1. 我们的神经网络是通过端到端训练的，与传统方法不同，这些方法通常将模块分割并优化。2. APN使用唯一的动作帧进行训练，因此模型可以在动作分类数据集上训练，并且对视频中的时间背景样式不同于训练时的样式具有鲁棒性。3. 我们的框架可以准确地检测长时间的动作，因为它使用细化和明确的时间结构编码器，从而避免检测不完整的动作。通过这些优点，APN在检测长时间动作方面实现了竞争性的性能，并在THUMOS14和DFMAD70数据集上达到了98.9%的mAP和58.3%的mAP。

SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos

paper_url: http://arxiv.org/abs/2308.09244
repo_url: https://github.com/mcg-nju/sparsebev
paper_authors: Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, Limin Wang
for: This paper focuses on developing a fully sparse 3D object detector, SparseBEV, to mitigate the performance gap between sparse and dense detectors in camera-based 3D object detection.
methods: SparseBEV uses a query-based paradigm without explicit dense BEV feature construction, and includes three key designs: scale-adaptive self attention, adaptive spatio-temporal sampling, and adaptive mixing.
results: On the test split of nuScenes, SparseBEV achieves the state-of-the-art performance of 67.5 NDS. On the val split, SparseBEV achieves 55.8 NDS while maintaining a real-time inference speed of 23.5 FPS.

Abstract
Camera-based 3D object detection in BEV (Bird's Eye View) space has drawn great attention over the past few years. Dense detectors typically follow a two-stage pipeline by first constructing a dense BEV feature and then performing object detection in BEV space, which suffers from complex view transformations and high computation cost. On the other side, sparse detectors follow a query-based paradigm without explicit dense BEV feature construction, but achieve worse performance than the dense counterparts. In this paper, we find that the key to mitigate this performance gap is the adaptability of the detector in both BEV and image space. To achieve this goal, we propose SparseBEV, a fully sparse 3D object detector that outperforms the dense counterparts. SparseBEV contains three key designs, which are (1) scale-adaptive self attention to aggregate features with adaptive receptive field in BEV space, (2) adaptive spatio-temporal sampling to generate sampling locations under the guidance of queries, and (3) adaptive mixing to decode the sampled features with dynamic weights from the queries. On the test split of nuScenes, SparseBEV achieves the state-of-the-art performance of 67.5 NDS. On the val split, SparseBEV achieves 55.8 NDS while maintaining a real-time inference speed of 23.5 FPS. Code is available at https://github.com/MCG-NJU/SparseBEV.

摘要
过去几年，基于 bird's eye view（BEV）空间的几何检测器在Camera-based 3D объек体检测领域中吸引了很大的注意力。紧密的检测器通常遵循一个 two-stage 管道，首先是在BEV空间建立紧密的BEV特征，然后进行 объек detection，这会受到复杂的视野转换和高计算成本的影响。另一方面，稀疏的检测器遵循一个查询基本的思想，不需要明确的紧密BEV特征建立，但是其性能较差。在这篇论文中，我们发现了关键是在BEV和图像空间中的检测器适应能力。为了实现这个目标，我们提出了SparseBEV，一个完全稀疏的3D几何检测器，它的性能比紧密检测器更高。SparseBEV包括三个关键设计，分别是（1）缩寸自适应的自我注意力，用于在BEV空间中缩寸特征的整合，（2）适应的空间-时间抽样，用于在寻找适当的抽样位置的指导，以及（3）适应的混合，用于将抽样结果与问题答案进行适应权重的混合。在nuScenes的测试分区上，SparseBEV实现了67.5 NDS的国际级别性能。在val分区上，SparseBEV实现了55.8 NDS，同时保持了实时推理速度的23.5 FPS。代码可以在https://github.com/MCG-NJU/SparseBEV上找到。

ASAG: Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Anchor Generation

paper_url: http://arxiv.org/abs/2308.09242
repo_url: https://github.com/isee-laboratory/asag
paper_authors: Shenghao Fu, Junkai Yan, Yipeng Gao, Xiaohua Xie, Wei-Shi Zheng
for: 提高 object detection 的速度和准确率， bridging the performance gap between sparse and dense detectors.
methods: 提出 Adaptive Sparse Anchor Generator (ASAG)， dynamically predicting anchors on patches rather than grids in a sparse way to alleviate feature conflict problem, and using a simple and effective Query Weighting method to ease the training instability.
results: 比较 dense-initialized ones 和其他方法，实现了更好的速度-准确率平衡，并且在实验中表现出色。

Abstract
Recent sparse detectors with multiple, e.g. six, decoder layers achieve promising performance but much inference time due to complex heads. Previous works have explored using dense priors as initialization and built one-decoder-layer detectors. Although they gain remarkable acceleration, their performance still lags behind their six-decoder-layer counterparts by a large margin. In this work, we aim to bridge this performance gap while retaining fast speed. We find that the architecture discrepancy between dense and sparse detectors leads to feature conflict, hampering the performance of one-decoder-layer detectors. Thus we propose Adaptive Sparse Anchor Generator (ASAG) which predicts dynamic anchors on patches rather than grids in a sparse way so that it alleviates the feature conflict problem. For each image, ASAG dynamically selects which feature maps and which locations to predict, forming a fully adaptive way to generate image-specific anchors. Further, a simple and effective Query Weighting method eases the training instability from adaptiveness. Extensive experiments show that our method outperforms dense-initialized ones and achieves a better speed-accuracy trade-off. The code is available at \url{https://github.com/iSEE-Laboratory/ASAG}.

摘要
We find that the architecture discrepancy between dense and sparse detectors leads to feature conflict, hindering the performance of one-decoder-layer detectors. To address this, we propose the Adaptive Sparse Anchor Generator (ASAG), which predicts dynamic anchors on patches in a sparse way, alleviating the feature conflict problem. For each image, ASAG dynamically selects which feature maps and which locations to predict, forming a fully adaptive way to generate image-specific anchors.Further, we propose a simple and effective Query Weighting method to ease the training instability from adaptiveness. Extensive experiments show that our method outperforms dense-initialized ones and achieves a better speed-accuracy trade-off. The code is available at \url{https://github.com/iSEE-Laboratory/ASAG}.Translation notes:* "Recent sparse detectors" is translated as "近期的 sparse detectors" (jì qī de sparse detectors)* "multiple, e.g. six, decoder layers" is translated as "多个，例如六个，decoder层" (duō gè, yǐ jǐ liù gè, decoder layer)* "achieve promising performance" is translated as "实现了可能的表现" (shí yì le kě néng de biǎo xiǎng)* "but much inference time" is translated as "但是大量的推理时间" (dàn shì dà liú de tuī lǐ shí jiān)* "due to complex heads" is translated as "由于复杂的头部" (yūn guī zhī de tóu bù)* "Previous works have explored using dense priors as initialization" is translated as "先前的工作已经探索了使用密集的初始值" (xiān qian de gōng zuò yǐ jīn tàng zhī de chū shí zhì)* "built one-decoder-layer detectors" is translated as "构建了一层decoder的检测器" (gòng jìan le yī cè decoder de jiǎn téng)* "Although they gain remarkable acceleration" is translated as "尽管它们很快" (zhòu guàn tā men hěn kuài)* "their performance still lags behind their six-decoder-layer counterparts by a large margin" is translated as "它们的表现仍然落后六层decoder的对手 by a large margin" (tā men de biǎo xiǎng zhèng rán zài liù cè decoder de duì shǒu by a large margin)* "In this work, we aim to bridge this performance gap" is translated as "在这个工作中，我们目标是填补这个表现差距" (zài zhè ge gōng zuò, wǒmen mù tiāo shì fēn chōng zhè ge biǎo xiǎng jì dao)* "while retaining fast speed" is translated as "同时保持快速" (tóng shí bǎo qiú sù)* "We find that the architecture discrepancy between dense and sparse detectors leads to feature conflict" is translated as "我们发现 dense和sparse detector的 architecture difference导致 feature conflict" (wǒmen fā xiǎng dense yào sparse detector de architecture difference dào yùn feature conflict)* "hampering the performance of one-decoder-layer detectors" is translated as "妨碍一层decoder的检测器表现" (mèi yòu yī cè decoder de jiǎn téng biǎo xiǎng)* "Thus we propose Adaptive Sparse Anchor Generator (ASAG)" is translated as "因此我们提出 Adaptive Sparse Anchor Generator (ASAG)" (yīn qí wǒmen tī shuā Adaptive Sparse Anchor Generator (ASAG))* "which predicts dynamic anchors on patches in a sparse way" is translated as "可以在缺省的方式下预测动态的锚点" (kě yǐ zài yì qiū xiāng de fāng shí zhī xiǎng yì qiū yì zhī)* "alleviating the feature conflict problem" is translated as "解决 feature conflict 问题" (jiě jī feature conflict wèn tí)* "For each image, ASAG dynamically selects which feature maps and which locations to predict" is translated as "对于每个图像，ASAG动态选择哪些特征地图和哪些位置预测" (duì yú zhè ge hú xiàng, ASAG dòng tài xūn zhè yǐ qù zhì yǐ zhòng zhì)* "forming a fully adaptive way to generate image-specific anchors" is translated as "形成一种完全适应的方式来生成图像特定的锚点" (xíng chéng yī zhī qù zhì yǐ zhòng zhì)* "Further, we propose a simple and effective Query Weighting method" is translated as "另外，我们提出了一种简单有效的查询权重方法" (yīn qí, wǒmen tī shuā yī qiū yǒu yì de chá qiǎn zhèng yì)* "to ease the training instability from adaptiveness" is translated as "以适应性导致的训练不稳定性" (yǐ shì yì qiū xiāng de tào xiǎng shì)* "Extensive experiments show that our method outperforms dense-initialized ones" is translated as "广泛的实验表明我们的方法在 dense-initialized 方面表现出色" (guǎng fāng de shí yàn bǎng mìng wǒmen de fāng shì zài dense-initialized zhōng xiàng)* "and achieves a better speed-accuracy trade-off" is translated as "并实现了更好的速度-准确性质量比" (yǔ shì yì qiū yǐ zhèng yì qiū yì zhòng zhì)* "The code is available at \url{https://github.com/iSEE-Laboratory/ASAG}" is translated as "代码可以在 \url{https://github.com/iSEE-Laboratory/ASAG} 上获取" (dài māo kě yǐ zài \url{https://github.com/iSEE-Laboratory/ASAG} shàng gòu qǔ)

paper_url: http://arxiv.org/abs/2308.09234
repo_url: None
paper_authors: Sahar Rahimi Malakshan, Mohammad Saeed Ebrahimi Saadabadi, Nima Najafzadeh, Nasser M. Nasrabadi
for: 提高人脸识别（FR）模型的泛化能力，解决现有训练数据的质量不均问题。
methods: 使用多模型增强技术，即AdaBoost的sample-level weighting方法，使得不同模型在不同样本困难程度上拥有专家性。
results: 在CFP-FP、LFW、CPLFW、CALFW、AgeDB、TinyFace、IJB-B和IJB-C评估 datasets上实现了比 estado-of-the-art 性能。

Abstract
Deep convolutional neural networks have achieved remarkable success in face recognition (FR), partly due to the abundant data availability. However, the current training benchmarks exhibit an imbalanced quality distribution; most images are of high quality. This poses issues for generalization on hard samples since they are underrepresented during training. In this work, we employ the multi-model boosting technique to deal with this issue. Inspired by the well-known AdaBoost, we propose a sample-level weighting approach to incorporate the importance of different samples into the FR loss. Individual models of the proposed framework are experts at distinct levels of sample hardness. Therefore, the combination of models leads to a robust feature extractor without losing the discriminability on the easy samples. Also, for incorporating the sample hardness into the training criterion, we analytically show the effect of sample mining on the important aspects of current angular margin loss functions, i.e., margin and scale. The proposed method shows superior performance in comparison with the state-of-the-art algorithms in extensive experiments on the CFP-FP, LFW, CPLFW, CALFW, AgeDB, TinyFace, IJB-B, and IJB-C evaluation datasets.

摘要
深度卷积神经网络在人脸识别（FR）领域取得了很大的成功，一部分是因为数据的充足性。然而，当前的训练标准 exhibit 一个不均衡的质量分布，大多数图像是高质量的。这会导致在训练中困难样本的代表性受到限制。在这种情况下，我们使用多模型增强技术来解决这个问题。我们提出了一种样本水平的权重方法，以便在 FR 损失中包含不同样本的重要性。具体来说，我们的框架中的各个模型是专家在不同水平上的样本困难程度。因此，将这些模型相加可以获得一个可靠的特征提取器，而不会失去易样本的把握。此外，为了在训练标准中包含样本困难程度，我们分析了现有的angular margin loss函数的重要方面，即边和缩放。我们的方法在大量的实验中表现出色，比如CFP-FP、LFW、CPLFW、CALFW、AgeDB、TinyFace、IJB-B 和 IJB-C 评估数据集。

CCFace: Classification Consistency for Low-Resolution Face Recognition

paper_url: http://arxiv.org/abs/2308.09230
repo_url: None
paper_authors: Mohammad Saeed Ebrahimi Saadabadi, Sahar Rahimi Malakshan, Hossein Kashiani, Nasser M. Nasrabadi
for: 提高低分辨率人脸识别的性能
methods: 使用分类一致知识塑造和适应角度罚款，以及异形跨分辨率学习
results: 在低分辨率benchmark上提高三个百分点，包括tinyface和scface等In English, this would be:
for: Improving the performance of low-resolution face recognition
methods: Using classification consistency knowledge distillation and adaptive angular penalty, as well as asymmetric cross-resolution learning
results: Improving performance by three percent on TinyFace and other low-resolution benchmarks while maintaining performance on high-resolution benchmarks.

Abstract
In recent years, deep face recognition methods have demonstrated impressive results on in-the-wild datasets. However, these methods have shown a significant decline in performance when applied to real-world low-resolution benchmarks like TinyFace or SCFace. To address this challenge, we propose a novel classification consistency knowledge distillation approach that transfers the learned classifier from a high-resolution model to a low-resolution network. This approach helps in finding discriminative representations for low-resolution instances. To further improve the performance, we designed a knowledge distillation loss using the adaptive angular penalty inspired by the success of the popular angular margin loss function. The adaptive penalty reduces overfitting on low-resolution samples and alleviates the convergence issue of the model integrated with data augmentation. Additionally, we utilize an asymmetric cross-resolution learning approach based on the state-of-the-art semi-supervised representation learning paradigm to improve discriminability on low-resolution instances and prevent them from forming a cluster. Our proposed method outperforms state-of-the-art approaches on low-resolution benchmarks, with a three percent improvement on TinyFace while maintaining performance on high-resolution benchmarks.

摘要

Generalized Sum Pooling for Metric Learning

paper_url: http://arxiv.org/abs/2308.09228
repo_url: https://github.com/yetigurbuz/generalized-sum-pooling
paper_authors: Yeti Z. Gurbuz, Ozan Sener, A. Aydın Alatan
for: 该论文主要研究了深度度量学中的核心选择方法，即全局平均 pooling (GAP) 的扩展和改进。
methods: 该论文提出了一种名为泛化总和Pooling (GSP) 的新方法，它可以更好地选择Semantic entity，并学习每个 Semantic entity 的重要性。
results: 论文通过广泛的实验证明了 GSP 方法的效果，在四个Popular metric learning benchmark上表现出色，代替 GAP 方法可以更好地进行深度度量学。

Abstract
A common architectural choice for deep metric learning is a convolutional neural network followed by global average pooling (GAP). Albeit simple, GAP is a highly effective way to aggregate information. One possible explanation for the effectiveness of GAP is considering each feature vector as representing a different semantic entity and GAP as a convex combination of them. Following this perspective, we generalize GAP and propose a learnable generalized sum pooling method (GSP). GSP improves GAP with two distinct abilities: i) the ability to choose a subset of semantic entities, effectively learning to ignore nuisance information, and ii) learning the weights corresponding to the importance of each entity. Formally, we propose an entropy-smoothed optimal transport problem and show that it is a strict generalization of GAP, i.e., a specific realization of the problem gives back GAP. We show that this optimization problem enjoys analytical gradients enabling us to use it as a direct learnable replacement for GAP. We further propose a zero-shot loss to ease the learning of GSP. We show the effectiveness of our method with extensive evaluations on 4 popular metric learning benchmarks. Code is available at: GSP-DML Framework

摘要
通常的建筑设计 для深度度量学是使用卷积神经网络后跟global average pooling（GAP）。尽管简单，但GAP是一种非常有效的信息汇总方法。一种可能的解释是每个特征向量都代表着不同的semantic entity，GAP是这些entity的 convex combination。基于这种视角，我们总结GAP并提出了一种可学习的总和汇总方法（GSP）。GSP在两个方面提高了GAP：一是选择一 subset of semantic entities，有效地忽略干扰信息；二是学习每个entity的重要性的权重。我们提出了一个 Entropy-smoothed optimal transport problem，并证明它是GAP的严格泛化，即一个特定的实现问题可以得回GAP。我们还提出了一个零战损损失函数，以便学习GSP。我们通过对4个popular度量学benchmark进行广泛的评估，证明了我们的方法的效果。代码可以在：GSP-DML框架中找到。

DMCVR: Morphology-Guided Diffusion Model for 3D Cardiac Volume Reconstruction

paper_url: http://arxiv.org/abs/2308.09223
repo_url: https://github.com/hexiaoxiao-cs/dmcvr
paper_authors: Xiaoxiao He, Chaowei Tan, Ligong Han, Bo Liu, Leon Axel, Kang Li, Dimitris N. Metaxas
for: 提高心脏疾病诊断和治疗规划的准确3D心脏重建
methods: 使用形态导航推 diffusion模型（DMCVR）Synthesize高解度2D图像和对应的3D重建体积
results: 比前方法高效，能生成高解度3D心脏MRI重建图像，提高心脏疾病诊断和治疗规划的准确性

Abstract
Accurate 3D cardiac reconstruction from cine magnetic resonance imaging (cMRI) is crucial for improved cardiovascular disease diagnosis and understanding of the heart's motion. However, current cardiac MRI-based reconstruction technology used in clinical settings is 2D with limited through-plane resolution, resulting in low-quality reconstructed cardiac volumes. To better reconstruct 3D cardiac volumes from sparse 2D image stacks, we propose a morphology-guided diffusion model for 3D cardiac volume reconstruction, DMCVR, that synthesizes high-resolution 2D images and corresponding 3D reconstructed volumes. Our method outperforms previous approaches by conditioning the cardiac morphology on the generative model, eliminating the time-consuming iterative optimization process of the latent code, and improving generation quality. The learned latent spaces provide global semantics, local cardiac morphology and details of each 2D cMRI slice with highly interpretable value to reconstruct 3D cardiac shape. Our experiments show that DMCVR is highly effective in several aspects, such as 2D generation and 3D reconstruction performance. With DMCVR, we can produce high-resolution 3D cardiac MRI reconstructions, surpassing current techniques. Our proposed framework has great potential for improving the accuracy of cardiac disease diagnosis and treatment planning. Code can be accessed at https://github.com/hexiaoxiao-cs/DMCVR.

摘要
精确的3D心脏重建从cinematic magnetic resonance imaging（cMRI）是诊断心血管疾病和心脏运动的关键。然而，现有的心脏MRI基于的重建技术在临床设置中只有2D的限制 Through-plane 分辨率，导致低质量重建的心脏体积。为了更好地从稀疏的2D图像堆栈中重建3D心脏体积，我们提议一种基于形态指导的分子模型，DMCVR，该模型可以生成高分辨率的2D图像和对应的3D重建体积。我们的方法比前一代方法更高效，因为它们 conditioning cardiac morphology 在生成模型中，消除耗时的迭代优化过程，并提高生成质量。学习的秘密空间提供了全球 semantics，局部心脏形态和每个2D cMRI slice 的高可读性，以重建3D心脏形态。我们的实验表明，DMCVR 在多个方面表现出色，如2D生成和3D重建性能。通过 DMCVR，我们可以生成高分辨率的3D心脏MRI重建，超过当前技术。我们提出的框架具有较大的诊断心血管疾病和治疗规划的潜在优势。代码可以在https://github.com/hexiaoxiao-cs/DMCVR 中获取。

A review of technical factors to consider when designing neural networks for semantic segmentation of Earth Observation imagery

paper_url: http://arxiv.org/abs/2308.09221
repo_url: None
paper_authors: Sam Khallaghi, J. Ronald Eastman, Lyndon D. Estes
for: 本文旨在提供对遥感图像semantic segmentation（分类） tasks的技术因素的全面审查，帮助研究人员和实践者更好地理解这个领域中的 neural network 设计因素。
methods: 本文详细介绍了 CNNs、RNNs、GANs 和 transformer 模型，并讨论了这些 ANN 家族中的显著设计特征和其对 semantic segmentation 的影响。同时，也涵盖了常见的预处理技术，包括图像normalization和chipping，以及如何处理训练样本数据不均衡的问题，以及如何使用扩展学习、转移学习和领域适应来解决有限数据的问题。
results: 本文提供了一个全面和最新的理解，涵盖了遥感图像semantic segmentation tasks中 neural network 设计因素的技术和数据相关因素。

Abstract
Semantic segmentation (classification) of Earth Observation imagery is a crucial task in remote sensing. This paper presents a comprehensive review of technical factors to consider when designing neural networks for this purpose. The review focuses on Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Generative Adversarial Networks (GANs), and transformer models, discussing prominent design patterns for these ANN families and their implications for semantic segmentation. Common pre-processing techniques for ensuring optimal data preparation are also covered. These include methods for image normalization and chipping, as well as strategies for addressing data imbalance in training samples, and techniques for overcoming limited data, including augmentation techniques, transfer learning, and domain adaptation. By encompassing both the technical aspects of neural network design and the data-related considerations, this review provides researchers and practitioners with a comprehensive and up-to-date understanding of the factors involved in designing effective neural networks for semantic segmentation of Earth Observation imagery.

摘要
Semantic segmentation (classification) of Earth Observation imagery 是 remote sensing 领域中的一个重要任务。本文将提供 Earth Observation imagery 中 neural networks 的设计因素的 comprehensive 综述，包括 Convolutional Neural Networks (CNNs)、Recurrent Neural Networks (RNNs)、Generative Adversarial Networks (GANs) 和 transformer 模型，并讨论这些 ANN 家族的主要设计模式以及它们在 semantic segmentation 中的应用。本文还讨论了一些常见的预processing 技术，包括图像 normalization 和扩展、训练数据均衡问题的解决方案、以及对于有限数据的处理方法，包括扩展技术、传播学习以及领域适应。本文涵盖了 neural network 设计的技术性和数据相关的考虑因素，以提供研究者和实践者一个完整和最新的理解，对 Earth Observation imagery 中 neural networks 的设计为 semantic segmentation 做出更有效的应用。

LibreFace: An Open-Source Toolkit for Deep Facial Expression Analysis

paper_url: http://arxiv.org/abs/2308.10713
repo_url: None
paper_authors: Di Chang, Yufeng Yin, Zongjian Li, Minh Tran, Mohammad Soleymani
for: 这个论文主要目的是提出一个开源的人工智能工具kit，用于实时和离线的表情分析。
methods: 这个工具kit使用了深度学习模型，包括表情动作单元（AU）检测、AU强度估计和表情识别。具体来说，我们使用了一个大规模预训练的网络、特征知识填充和任务特定的细化调整等技术，以便准确地分析人脸表情。
results: 在Action Unit（AU）强度估计方面，我们在DISFA上达到了0.63的佩森相关系数（PCC），比OpenFace 2.0的性能高7%，同时保持高效的推理，运行速度两倍于OpenFace 2.0。尽管占用空间小，我们的模型也能够与当前最佳表情分析方法在AffecNet、FFHQ和RAFDB等 datasets上达到竞争性表现。

Abstract
Facial expression analysis is an important tool for human-computer interaction. In this paper, we introduce LibreFace, an open-source toolkit for facial expression analysis. This open-source toolbox offers real-time and offline analysis of facial behavior through deep learning models, including facial action unit (AU) detection, AU intensity estimation, and facial expression recognition. To accomplish this, we employ several techniques, including the utilization of a large-scale pre-trained network, feature-wise knowledge distillation, and task-specific fine-tuning. These approaches are designed to effectively and accurately analyze facial expressions by leveraging visual information, thereby facilitating the implementation of real-time interactive applications. In terms of Action Unit (AU) intensity estimation, we achieve a Pearson Correlation Coefficient (PCC) of 0.63 on DISFA, which is 7% higher than the performance of OpenFace 2.0 while maintaining highly-efficient inference that runs two times faster than OpenFace 2.0. Despite being compact, our model also demonstrates competitive performance to state-of-the-art facial expression analysis methods on AffecNet, FFHQ, and RAFDB. Our code will be released at https://github.com/ihp-lab/LibreFace

摘要
Facial expression analysis is an important tool for human-computer interaction. In this paper, we introduce LibreFace, an open-source toolkit for facial expression analysis. This open-source toolbox offers real-time and offline analysis of facial behavior through deep learning models, including facial action unit (AU) detection, AU intensity estimation, and facial expression recognition. To accomplish this, we employ several techniques, including the utilization of a large-scale pre-trained network, feature-wise knowledge distillation, and task-specific fine-tuning. These approaches are designed to effectively and accurately analyze facial expressions by leveraging visual information, thereby facilitating the implementation of real-time interactive applications. In terms of Action Unit (AU) intensity estimation, we achieve a Pearson Correlation Coefficient (PCC) of 0.63 on DISFA, which is 7% higher than the performance of OpenFace 2.0 while maintaining highly-efficient inference that runs two times faster than OpenFace 2.0. Despite being compact, our model also demonstrates competitive performance to state-of-the-art facial expression analysis methods on AffecNet, FFHQ, and RAFDB. Our code will be released at https://github.com/ihp-lab/LibreFace.

TinyProp – Adaptive Sparse Backpropagation for Efficient TinyML On-device Learning

paper_url: http://arxiv.org/abs/2308.09201
repo_url: None
paper_authors: Marcus Rüb, Daniel Maier, Daniel Mueller-Gritschneder, Axel Sikora
for: 这篇论文主要目的是提出一种可以在低功耗微控制器单元（MCU）上进行Device Learning或精益化的深度神经网络训练方法，以减少训练时间和计算负载。
methods: 这篇论文使用了一种名为TinyProp的简单传播方法，这个方法可以在MCU上进行Device Learning，并且可以在训练过程中动态地调整传播比例，以提高训练效率和精确性。
results: 根据论文的结果，TinyProp比非简单训练更快，并且可以保持精确性。具体来说，与非简单训练相比，TinyProp在三个数据集（MNIST、DCASE2020和CIFAR10）上的训练时间比例为5倍，并且仅受到轻微的计算过程影响。此外，TinyProp与现有的静态简单传播方法相比，在训练效率和精确性方面都有优势。

Abstract
Training deep neural networks using backpropagation is very memory and computationally intensive. This makes it difficult to run on-device learning or fine-tune neural networks on tiny, embedded devices such as low-power micro-controller units (MCUs). Sparse backpropagation algorithms try to reduce the computational load of on-device learning by training only a subset of the weights and biases. Existing approaches use a static number of weights to train. A poor choice of this so-called backpropagation ratio limits either the computational gain or can lead to severe accuracy losses. In this paper we present TinyProp, the first sparse backpropagation method that dynamically adapts the back-propagation ratio during on-device training for each training step. TinyProp induces a small calculation overhead to sort the elements of the gradient, which does not significantly impact the computational gains. TinyProp works particularly well on fine-tuning trained networks on MCUs, which is a typical use case for embedded applications. For typical datasets from three datasets MNIST, DCASE2020 and CIFAR10, we are 5 times faster compared to non-sparse training with an accuracy loss of on average 1%. On average, TinyProp is 2.9 times faster than existing, static sparse backpropagation algorithms and the accuracy loss is reduced on average by 6 % compared to a typical static setting of the back-propagation ratio.

摘要
<>使用深度神经网络进行训练是非常占用内存和计算资源的。这使得在低功耗微控制器单元（MCU）上进行设备学习或细化神经网络 become difficult。使用 sparse backpropagation 算法可以减少设备学习中的计算负担，但现有方法使用静态的 backpropagation 比率，这限制了计算减少或导致精度损失。在这篇论文中，我们提出了 TinyProp，第一个在设备学习过程中动态调整 backpropagation 比率的稀疏 backpropagation 方法。TinyProp 在MCU 上进行设备学习时具有较小的计算开销，并不会对计算减少的影响。TinyProp 在微controller 上进行 fine-tuning 神经网络时表现特别好，这是常见的嵌入式应用场景。对于 Typical datasets MNIST、DCASE2020 和 CIFAR10，我们比非稀疏训练更快，减少了平均1%的精度损失。与现有静态稀疐 backpropagation 算法相比，TinyProp 在平均上2.9倍快，并且在平均上减少了6%的精度损失。

FedPerfix: Towards Partial Model Personalization of Vision Transformers in Federated Learning

paper_url: http://arxiv.org/abs/2308.09160
repo_url: https://github.com/imguangyu/fedperfix
paper_authors: Guangyu Sun, Matias Mendieta, Jun Luo, Shandong Wu, Chen Chen
for: 提高分布式学习的效率和个性化性
methods: 研究如何在Vision Transformer（ViT）模型中部分个性化，并提出了一种基于插件的方法—FedPerfix
results: 在CIFAR-100、OrganAMNIST和Office-Home等 datasets上进行了实验，并证明了该方法可以与多种先进的分布式学习方法相比提高模型的性能

Abstract
Personalized Federated Learning (PFL) represents a promising solution for decentralized learning in heterogeneous data environments. Partial model personalization has been proposed to improve the efficiency of PFL by selectively updating local model parameters instead of aggregating all of them. However, previous work on partial model personalization has mainly focused on Convolutional Neural Networks (CNNs), leaving a gap in understanding how it can be applied to other popular models such as Vision Transformers (ViTs). In this work, we investigate where and how to partially personalize a ViT model. Specifically, we empirically evaluate the sensitivity to data distribution of each type of layer. Based on the insights that the self-attention layer and the classification head are the most sensitive parts of a ViT, we propose a novel approach called FedPerfix, which leverages plugins to transfer information from the aggregated model to the local client as a personalization. Finally, we evaluate the proposed approach on CIFAR-100, OrganAMNIST, and Office-Home datasets and demonstrate its effectiveness in improving the model's performance compared to several advanced PFL methods.

摘要
个人化联合学习（PFL）表示在不同数据环境中分布式学习的有优解决方案。部分模型个性化已被提议以提高PFL的效率，但以前的工作主要集中在卷积神经网络（CNNs）上，对其他受欢迎的模型，如视觉变换器（ViTs），的应用仍然存在知识空白。在这种情况下，我们调查了在ViT模型中可以 selectively更新的部分。 Specifically, we empirically evaluated the sensitivity of each type of layer to data distribution. Based on the findings that the self-attention layer and the classification head are the most sensitive parts of a ViT, we proposed a novel approach called FedPerfix, which leverages plugins to transfer information from the aggregated model to the local client as a personalization. Finally, we evaluated the proposed approach on CIFAR-100, OrganAMNIST, and Office-Home datasets and demonstrated its effectiveness in improving the model's performance compared to several advanced PFL methods.

Semi-sparsity Priors for Image Structure Analysis and Extraction

paper_url: http://arxiv.org/abs/2308.09141
repo_url: None
paper_authors: Junqing Huang, Haihui Wang, Michael Ruzhansky
for: 图像结构分割和图像分类
methods: 通过总体半稀疑函数框架，将图像结构与复杂的质地背景解耦开来
results: 能够保持图像结构，不会出现波动融合现象，同时能够处理强抗振荡的图像文化分割问题，并且可以与现代方法相比肤

Abstract
Image structure-texture decomposition is a long-standing and fundamental problem in both image processing and computer vision fields. In this paper, we propose a generalized semi-sparse regularization framework for image structural analysis and extraction, which allows us to decouple the underlying image structures from complicated textural backgrounds. Combining with different textural analysis models, such a regularization receives favorable properties differing from many traditional methods. We demonstrate that it is not only capable of preserving image structures without introducing notorious staircase artifacts in polynomial-smoothing surfaces but is also applicable for decomposing image textures with strong oscillatory patterns. Moreover, we also introduce an efficient numerical solution based on an alternating direction method of multipliers (ADMM) algorithm, which gives rise to a simple and maneuverable way for image structure-texture decomposition. The versatility of the proposed method is finally verified by a series of experimental results with the capability of producing comparable or superior image decomposition results against cutting-edge methods.

摘要
Simplified Chinese:图像结构-文本分解是图像处理和计算机视觉领域的一个长期和基本问题。在这篇论文中，我们提出了一种通用半稀畴化常数执行框架，用于图像结构分析和提取。这种常数执行允许我们将图像结构从复杂的文本背景中分离出来。与不同的文本分析模型结合，这种常数执行具有不同于许多传统方法的优点。我们示示它不仅可以保留图像结构，而且可以避免引入普遍存在的梯形artefacts在多阶趋正面上。此外，我们还介绍了一种高效的数字解决方案，基于分配方法 OF 多个参数（ADMM）算法，这给出了一种简单可操作的图像结构-文本分解方法。我们的方法的多样性最终被实验证明，可以生成与当今顶尖方法相当或更好的图像分解结果。

The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation

paper_url: http://arxiv.org/abs/2308.09139
repo_url: https://github.com/giaczara/dallv
paper_authors: Giacomo Zara, Alessandro Conti, Subhankar Roy, Stéphane Lathuilière, Paolo Rota, Elisa Ricci
for: 这篇论文旨在解决无监督目标数据集上的动作识别模型转移，并且不需要存取实际的源数据。
methods: 这篇论文使用了大型语言视觉模型（LLVM）的“网络超级vision”来推广SFVUDA任务，并且提出了一个叫做“Domain Adaptation with Large Language-Vision models”的方法（简称DALL-V），将 LLVM 的世界假设和补充性源模型信息转换为适应target的学习网络。
results: despite the simplicity, DALL-V 可以实现显著的提升 compared to 目前的SFVUDA方法。

Abstract
Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset, without accessing the actual source data. The previous approaches have attempted to address SFVUDA by leveraging self-supervision (e.g., enforcing temporal consistency) derived from the target data itself. In this work, we take an orthogonal approach by exploiting "web-supervision" from Large Language-Vision Models (LLVMs), driven by the rationale that LLVMs contain a rich world prior surprisingly robust to domain-shift. We showcase the unreasonable effectiveness of integrating LLVMs for SFVUDA by devising an intuitive and parameter-efficient method, which we name Domain Adaptation with Large Language-Vision models (DALL-V), that distills the world prior and complementary source model information into a student network tailored for the target. Despite the simplicity, DALL-V achieves significant improvement over state-of-the-art SFVUDA methods.

摘要

ICAR: Image-based Complementary Auto Reasoning

paper_url: http://arxiv.org/abs/2308.09119
repo_url: None
paper_authors: Xijun Wang, Anqi Liang, Junbang Liang, Ming Lin, Yu Lou, Shan Yang
for: addresses the challenging task of scene-aware complementary item retrieval (CIR), which requires generating a set of compatible items across domains.
methods: proposes a visual compatibility concept based on similarity and complementarity, and a category-aware Flexible Bidirectional Transformer (FBT) framework for visual “scene-based set compatibility reasoning” with cross-domain visual similarity input and auto-regressive complementary item generation.
results: achieves up to 5.3% and 9.6% in FITB score and 22.3% and 31.8% SFID improvement on fashion and furniture, respectively, compared with state-of-the-art methods.

Abstract
Scene-aware Complementary Item Retrieval (CIR) is a challenging task which requires to generate a set of compatible items across domains. Due to the subjectivity, it is difficult to set up a rigorous standard for both data collection and learning objectives. To address this challenging task, we propose a visual compatibility concept, composed of similarity (resembling in color, geometry, texture, and etc.) and complementarity (different items like table vs chair completing a group). Based on this notion, we propose a compatibility learning framework, a category-aware Flexible Bidirectional Transformer (FBT), for visual "scene-based set compatibility reasoning" with the cross-domain visual similarity input and auto-regressive complementary item generation. We introduce a "Flexible Bidirectional Transformer (FBT)" consisting of an encoder with flexible masking, a category prediction arm, and an auto-regressive visual embedding prediction arm. And the inputs for FBT are cross-domain visual similarity invariant embeddings, making this framework quite generalizable. Furthermore, our proposed FBT model learns the inter-object compatibility from a large set of scene images in a self-supervised way. Compared with the SOTA methods, this approach achieves up to 5.3% and 9.6% in FITB score and 22.3% and 31.8% SFID improvement on fashion and furniture, respectively.

摘要
Scene-aware Complementary Item Retrieval (CIR) 是一个复杂的任务，需要生成兼容性的项目领域之间。由于主观性，设置严格的数据采集标准和学习目标很难。为解决这个挑战，我们提出了视觉兼容性概念，包括相似性（颜色、形状、文本等的相似性）和补充性（如桌子和椅子完成一组）。基于这个概念，我们提议一种兼容学习框架，一种类型意识的灵活双向Transformer（FBT），用于视觉“场景基于集合兼容理解”，输入为跨领域视觉相似性无关的嵌入，并使用自动生成的补充项。我们的FBT模型包括一个灵活的面罩，一个类型预测臂，和一个自动生成视觉嵌入预测臂。我们的输入是跨领域视觉相似性无关的嵌入，使得这个框架非常普适。此外，我们的FBT模型从大量场景图像中学习了各对象之间的相互兼容性，自然地实现了Scene-aware CIR。相比 соState-of-the-art方法，我们的方法可以达到5.3%和9.6%的FITB分数提升和22.3%和31.8%的SFID提升在时尚和家具等领域。

JPEG Quantized Coefficient Recovery via DCT Domain Spatial-Frequential Transformer

paper_url: http://arxiv.org/abs/2308.09110
repo_url: None
paper_authors: Mingyu Ouyang, Zhenzhong Chen
for: 提高JPEG压缩图像的Restoration效果，并能处理各种压缩质因子。
methods: 提出了DCT域的Transformer模型， dual-branch架构用于捕捉空间频率相关性，并采用量化矩阵嵌入和同步颜色卷积。
results: 与当前状态的JPEG噪声除法相比，提高了Restoration效果。

Abstract
JPEG compression adopts the quantization of Discrete Cosine Transform (DCT) coefficients for effective bit-rate reduction, whilst the quantization could lead to a significant loss of important image details. Recovering compressed JPEG images in the frequency domain has attracted more and more attention recently, in addition to numerous restoration approaches developed in the pixel domain. However, the current DCT domain methods typically suffer from limited effectiveness in handling a wide range of compression quality factors, or fall short in recovering sparse quantized coefficients and the components across different colorspace. To address these challenges, we propose a DCT domain spatial-frequential Transformer, named as DCTransformer. Specifically, a dual-branch architecture is designed to capture both spatial and frequential correlations within the collocated DCT coefficients. Moreover, we incorporate the operation of quantization matrix embedding, which effectively allows our single model to handle a wide range of quality factors, and a luminance-chrominance alignment head that produces a unified feature map to align different-sized luminance and chrominance components. Our proposed DCTransformer outperforms the current state-of-the-art JPEG artifact removal techniques, as demonstrated by our extensive experiments.

摘要
JPEG压缩使用Discrete Cosine Transform（DCT）系数归一化来实现有效的比特率减少，但这可能导致重要的图像细节丢失。在频域中恢复压缩JPEG图像已经引起了更多的关注，同时出现了许多像素领域的恢复方法。然而，当前的DCT频域方法通常具有处理各种压缩质量因子的有限效果，或者缺乏恢复粗化quantized均值和不同色彩空间中的组件。为解决这些挑战，我们提议了DCT频域的空间-频率Transformer，称为DCTransformer。具体来说，我们设计了双支架构，以捕捉DCT系数中的空间和频率相关性。此外，我们还包含了量化矩阵嵌入操作，可以有效地让我们的单个模型处理各种质量因子，以及颜色彩空间中的一致性头，可以生成一个统一的特征图来对不同大小的颜色和灰度组件进行对齐。我们的提议的DCTransformer比当前状态的JPEG噪声去除技术高效，如我们的广泛的实验所示。

Hyperbolic Face Anti-Spoofing

paper_url: http://arxiv.org/abs/2308.09107
repo_url: None
paper_authors: Shuangpeng Han, Rizhao Cai, Yawen Cui, Zitong Yu, Yongjian Hu, Alex Kot
for: 本研究旨在提高面Recognition系统的安全性，通过学习泛化化脸认证防 spoofing (FAS) 模型，对于训练数据中未seen的攻击进行检测。
methods: 本研究提出了一种基于 hyperbolic space 的 FAS 学习方法，包括将特征编码项 проекed 到 Poincaré 球中，然后使用 hyperbolic binary logistic regression 层进行分类。为了进一步提高泛化能力，我们还实现了 hyperbolic contrastive learning 方法，并对 bonafide 进行了 relaxation 操作。此外，我们还提出了一种新的特征clipping方法，以解决 hyperbolic 空间中的衰减Gradient问题。
results: 实验表明，提出的方法可以与 Euclidean 基eline 相比，在未seen 攻击检测中带来显著的改善。此外，我们还发现，该方法在四个 benchmark dataset （i.e., MSU-MFSD, IDIAP REPLAY-ATTACK, CASIA-FASD, 和 OULU-NPU）上也具有良好的泛化能力。

Abstract
Learning generalized face anti-spoofing (FAS) models against presentation attacks is essential for the security of face recognition systems. Previous FAS methods usually encourage models to extract discriminative features, of which the distances within the same class (bonafide or attack) are pushed close while those between bonafide and attack are pulled away. However, these methods are designed based on Euclidean distance, which lacks generalization ability for unseen attack detection due to poor hierarchy embedding ability. According to the evidence that different spoofing attacks are intrinsically hierarchical, we propose to learn richer hierarchical and discriminative spoofing cues in hyperbolic space. Specifically, for unimodal FAS learning, the feature embeddings are projected into the Poincar\'e ball, and then the hyperbolic binary logistic regression layer is cascaded for classification. To further improve generalization, we conduct hyperbolic contrastive learning for the bonafide only while relaxing the constraints on diverse spoofing attacks. To alleviate the vanishing gradient problem in hyperbolic space, a new feature clipping method is proposed to enhance the training stability of hyperbolic models. Besides, we further design a multimodal FAS framework with Euclidean multimodal feature decomposition and hyperbolic multimodal feature fusion & classification. Extensive experiments on three benchmark datasets (i.e., WMCA, PADISI-Face, and SiW-M) with diverse attack types demonstrate that the proposed method can bring significant improvement compared to the Euclidean baselines on unseen attack detection. In addition, the proposed framework is also generalized well on four benchmark datasets (i.e., MSU-MFSD, IDIAP REPLAY-ATTACK, CASIA-FASD, and OULU-NPU) with a limited number of attack types.

摘要
学习通用面部防伪模型（FAS）对于攻击推送是face recognition系统安全的基础。过去的FAS方法通常会让模型提取特征，其中同类（bonafide或攻击）之间的距离压缩，而不同类之间的距离弹性地压缩。但这些方法基于欧几丁度距离，lack generalization ability for unseen attack detection due to poor hierarchy embedding ability。根据不同的骗术攻击是内在层次结构的证据，我们提议学习更加具有层次结构和特征的骗术攻击特征在希腊空间。specifically，for unimodal FAS learning, the feature embeddings are projected into the Poincaré ball, and then the hyperbolic binary logistic regression layer is cascaded for classification. To further improve generalization, we conduct hyperbolic contrastive learning for the bonafide only while relaxing the constraints on diverse spoofing attacks. To alleviate the vanishing gradient problem in hyperbolic space, a new feature clipping method is proposed to enhance the training stability of hyperbolic models. Besides, we further design a multimodal FAS framework with Euclidean multimodal feature decomposition and hyperbolic multimodal feature fusion & classification. Extensive experiments on three benchmark datasets (i.e., WMCA, PADISI-Face, and SiW-M) with diverse attack types demonstrate that the proposed method can bring significant improvement compared to the Euclidean baselines on unseen attack detection. In addition, the proposed framework is also generalized well on four benchmark datasets (i.e., MSU-MFSD, IDIAP REPLAY-ATTACK, CASIA-FASD, and OULU-NPU) with a limited number of attack types.

Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation

paper_url: http://arxiv.org/abs/2308.09105
repo_url: None
paper_authors: Shengcao Cao, Mengtian Li, James Hays, Deva Ramanan, Yi-Xiong Wang, Liang-Yan Gui
for: 提高资源受限的视觉系统中的检测和分割精度，使用知识储存技术来提高轻量级视觉模型的性能。
methods: 提出了一种简单 yet 有效的顺序方法，通过将多个教师模型转化为一个学生模型来逐渐传递知识。
results: 成功地将Transformer基本的教师检测器知识传递到Convolution基本的学生检测器上，并在MS COCO数据集上提高了RetinaNet的 AP 分数从36.5% 提高到42.0%，以及Mask R-CNN的 AP 分数从38.2% 提高到42.5%。

Abstract
Resource-constrained perception systems such as edge computing and vision-for-robotics require vision models to be both accurate and lightweight in computation and memory usage. While knowledge distillation is a proven strategy to enhance the performance of lightweight classification models, its application to structured outputs like object detection and instance segmentation remains a complicated task, due to the variability in outputs and complex internal network modules involved in the distillation process. In this paper, we propose a simple yet surprisingly effective sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student. To distill knowledge from a highly accurate but complex teacher model, we construct a sequence of teachers to help the student gradually adapt. Our progressive strategy can be easily combined with existing detection distillation mechanisms to consistently maximize student performance in various settings. To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students, and unprecedentedly boost the performance of ResNet-50 based RetinaNet from 36.5% to 42.0% AP and Mask R-CNN from 38.2% to 42.5% AP on the MS COCO benchmark.

摘要
In this paper, we propose a simple yet effective sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student. To distill knowledge from a highly accurate but complex teacher model, we construct a sequence of teachers to help the student gradually adapt. Our progressive strategy can be easily combined with existing detection distillation mechanisms to consistently maximize student performance in various settings.As far as we know, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students, and unprecedentedly boost the performance of ResNet-50 based RetinaNet from 36.5% to 42.0% AP and Mask R-CNN from 38.2% to 42.5% AP on the MS COCO benchmark.

ImGeoNet: Image-induced Geometry-aware Voxel Representation for Multi-view 3D Object Detection

paper_url: http://arxiv.org/abs/2308.09098
repo_url: None
paper_authors: Tao Tu, Shun-Po Chuang, Yu-Lun Liu, Cheng Sun, Ke Zhang, Donna Roy, Cheng-Hao Kuo, Min Sun
for: 提高多视图图像基于3D对象检测的精度和效率
methods: 使用图像引导的几何空间表示法，从多视图图像中学习几何信息，并在推理阶段只需要多视图图像进行检测
results: 在三个indoor数据集上（ARKitScenes、ScanNetV2、ScanNet200）实现了与当前状态艺术多视图图像基于方法ImVoxelNet相同或更高的检测精度，并且在数据效率方面也表现出了优异性。

Abstract
We propose ImGeoNet, a multi-view image-based 3D object detection framework that models a 3D space by an image-induced geometry-aware voxel representation. Unlike previous methods which aggregate 2D features into 3D voxels without considering geometry, ImGeoNet learns to induce geometry from multi-view images to alleviate the confusion arising from voxels of free space, and during the inference phase, only images from multiple views are required. Besides, a powerful pre-trained 2D feature extractor can be leveraged by our representation, leading to a more robust performance. To evaluate the effectiveness of ImGeoNet, we conduct quantitative and qualitative experiments on three indoor datasets, namely ARKitScenes, ScanNetV2, and ScanNet200. The results demonstrate that ImGeoNet outperforms the current state-of-the-art multi-view image-based method, ImVoxelNet, on all three datasets in terms of detection accuracy. In addition, ImGeoNet shows great data efficiency by achieving results comparable to ImVoxelNet with 100 views while utilizing only 40 views. Furthermore, our studies indicate that our proposed image-induced geometry-aware representation can enable image-based methods to attain superior detection accuracy than the seminal point cloud-based method, VoteNet, in two practical scenarios: (1) scenarios where point clouds are sparse and noisy, such as in ARKitScenes, and (2) scenarios involve diverse object classes, particularly classes of small objects, as in the case in ScanNet200.

摘要
我们提出ImGeoNet，一种基于多视图图像的3D对象探测框架，该框架使用图像引导的geometry-aware粒子表示来模型3D空间。与前方方法不同，ImGeoNet不仅将2D特征积累到3D粒子上无论 Considering geometry, but also learn to induce geometry from multi-view images to alleviate the confusion arising from free space voxels during the inference phase. In addition, our representation can be leveraged by a powerful pre-trained 2D feature extractor, leading to a more robust performance. To evaluate the effectiveness of ImGeoNet, we conduct quantitative and qualitative experiments on three indoor datasets, namely ARKitScenes, ScanNetV2, and ScanNet200. The results show that ImGeoNet outperforms the current state-of-the-art multi-view image-based method, ImVoxelNet, on all three datasets in terms of detection accuracy. Furthermore, ImGeoNet achieves results comparable to ImVoxelNet with 100 views using only 40 views, demonstrating great data efficiency. Our studies also indicate that our proposed image-induced geometry-aware representation can enable image-based methods to achieve superior detection accuracy than the seminal point cloud-based method, VoteNet, in two practical scenarios: (1) scenarios where point clouds are sparse and noisy, such as in ARKitScenes, and (2) scenarios involve diverse object classes, particularly classes of small objects, as in the case in ScanNet200.

Edit Temporal-Consistent Videos with Image Diffusion Model

paper_url: http://arxiv.org/abs/2308.09091
repo_url: None
paper_authors: Yuanzhi Wang, Yong Li, Xin Liu, Anbo Dai, Antoni Chan, Zhen Cui
for: 本文旨在提出一种能够减少视频时间不一致问题的robust文本导向视频编辑方法，以提高视频编辑效果。
methods: 本文提出了一种名为Temporal-Consistent Video Editing（TCVE）方法，其中使用了预训练的2D Unet进行空间内容修改，同时设计了专门用于捕捉视频序列的时间特征的Temporal Unet架构。此外，通过建立空间和时间两个方面的协同关系，提高了视频编辑效果和时间一致性。
results: 实验结果表明，TCVE方法在视频时间一致性和视频编辑能力两个方面均达到了领域内最佳性能，超过了现有的标准准则。

Abstract
Large-scale text-to-image (T2I) diffusion models have been extended for text-guided video editing, yielding impressive zero-shot video editing performance. Nonetheless, the generated videos usually show spatial irregularities and temporal inconsistencies as the temporal characteristics of videos have not been faithfully modeled. In this paper, we propose an elegant yet effective Temporal-Consistent Video Editing (TCVE) method, to mitigate the temporal inconsistency challenge for robust text-guided video editing. In addition to the utilization of a pretrained 2D Unet for spatial content manipulation, we establish a dedicated temporal Unet architecture to faithfully capture the temporal coherence of the input video sequences. Furthermore, to establish coherence and interrelation between the spatial-focused and temporal-focused components, a cohesive joint spatial-temporal modeling unit is formulated. This unit effectively interconnects the temporal Unet with the pretrained 2D Unet, thereby enhancing the temporal consistency of the generated video output while simultaneously preserving the capacity for video content manipulation. Quantitative experimental results and visualization results demonstrate that TCVE achieves state-of-the-art performance in both video temporal consistency and video editing capability, surpassing existing benchmarks in the field.

摘要
In addition to utilizing a pretrained 2D Unet for spatial content manipulation, we establish a dedicated temporal Unet architecture to faithfully capture the temporal coherence of the input video sequences. Furthermore, to establish coherence and interrelation between the spatial-focused and temporal-focused components, we formulate a cohesive joint spatial-temporal modeling unit. This unit effectively interconnects the temporal Unet with the pretrained 2D Unet, enhancing the temporal consistency of the generated video output while simultaneously preserving the capacity for video content manipulation.Experimental results and visualization results demonstrate that TCVE achieves state-of-the-art performance in both video temporal consistency and video editing capability, surpassing existing benchmarks in the field.

Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries

paper_url: http://arxiv.org/abs/2308.09089
repo_url: None
paper_authors: Julia Wilkins, Justin Salamon, Magdalena Fuentes, Juan Pablo Bello, Oriol Nieto
for: 这个论文是为了找到高质量的声效（SFX），以匹配视频中的特定场景。
methods: 这个论文使用了大型自然语言模型和基础视觉语言模型来连接高质量的音频和视频，从而创建了一个可扩展的自动音频视频数据采集管道。它还使用了预训练的音频和视频编码器来训练一种对比学习基于的检索系统。
results: 这个论文的系统在对高质量音频和视频进行检索 task 上显示了显著的超越基eline的表现。它还能够在不同的数据集上保持一定的表现，并且在用户测试中，人们对系统提供的声效 preference 比基eline 67%。

Abstract
Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task, and relies heavily on the quality and completeness of text metadata. Retrieving high-quality (HQ) SFX using a video frame directly as the query is an attractive alternative, removing the reliance on text metadata and providing a low barrier to entry for non-experts. Due to the lack of HQ audio-visual training data, previous work on audio-visual retrieval relies on YouTube (in-the-wild) videos of varied quality for training, where the audio is often noisy and the video of amateur quality. As such it is unclear whether these systems would generalize to the task of matching HQ audio to production-quality video. To address this, we propose a multimodal framework for recommending HQ SFX given a video frame by (1) leveraging large language models and foundational vision-language models to bridge HQ audio and video to create audio-visual pairs, resulting in a highly scalable automatic audio-visual data curation pipeline; and (2) using pre-trained audio and visual encoders to train a contrastive learning-based retrieval system. We show that our system, trained using our automatic data curation pipeline, significantly outperforms baselines trained on in-the-wild data on the task of HQ SFX retrieval for video. Furthermore, while the baselines fail to generalize to this task, our system generalizes well from clean to in-the-wild data, outperforming the baselines on a dataset of YouTube videos despite only being trained on the HQ audio-visual pairs. A user study confirms that people prefer SFX retrieved by our system over the baseline 67% of the time both for HQ and in-the-wild data. Finally, we present ablations to determine the impact of model and data pipeline design choices on downstream retrieval performance. Please visit our project website to listen to and view our SFX retrieval results.

摘要
找到合适的声效（SFX）以匹配录影中的时刻是一个困难和时间耗费的任务，并且取决于录影中的文本元数据质量。使用录影帧作为查询来撷取高品质（HQ）声效是一个吸引人的选择，减少了文本元数据的依赖和入门障碍。然而，由于缺乏HQ音频视觉训练数据，先前的对话音频视觉检索工作通常是使用YouTube（在野）影片的不同质量进行训练，其中的音频通常是噪音的，影片质量则是业余级。这使得是否这些系统能够通用到对HQ音频视觉匹配的任务仍然存在问题。为了解决这问题，我们提出了一个多 modal 框架，用于根据录影帧提供高品质声效的建议，包括：1. 利用大型语言模型和基础的视觉语言模型来跨度HQ音频和影片，实现高可扩展自动 audio-visual 数据库 Curate 管道。2. 使用预训的音频和视觉嵌入器来训练对比学习扩展系统。我们发现，我们的系统，通过我们自动生成的数据库管道，与基eline 相比，在HQ声效撷取任务上表现出色，并且在实际应用中具有很好的一致性。此外，我们的系统能够从清洁到在野数据中具有良好的一致性，并且在YouTube 影片上进行测试时表现出色。一个使用者研究确认，使用我们的系统比基eline 67% 的时间 prefer SFX。最后，我们提供了ablation 来测试模型和数据管道设计的影响下沿 Retrieval 性能。您可以访问我们的项目网站，listen 和观看我们的 SFX 撷取结果。

MovePose: A High-performance Human Pose Estimation Algorithm on Mobile and Edge Devices

paper_url: http://arxiv.org/abs/2308.09084
repo_url: None
paper_authors: Dongyang Yu, Haoyue Zhang, Zhirui Zhou, Wangpeng An, Yanhong Yang
for: 这个研究旨在提供高精度和实时性的人体姿势估测算法，特别适用于CPU型手持式移动设备。
methods: MovePose使用优化的轻量级卷积神经网络，包括三种技术：deconvolution、大kernel卷积和坐标分类方法，以提高精度和速度。
results: MovePose在COCO验证数据集上获得了67.7的Mean Average Precision（mAP）分数，并在Intel i9-10920x CPU和NVIDIA RTX3090 GPU上显示出了高效性，其中在Android手机上的帧率超过11帧/秒。

Abstract
We present MovePose, an optimized lightweight convolutional neural network designed specifically for real-time body pose estimation on CPU-based mobile devices. The current solutions do not provide satisfactory accuracy and speed for human posture estimation, and MovePose addresses this gap. It aims to maintain real-time performance while improving the accuracy of human posture estimation for mobile devices. The network produces 17 keypoints for each individual at a rate exceeding 11 frames per second, making it suitable for real-time applications such as fitness tracking, sign language interpretation, and advanced mobile human posture estimation. Our MovePose algorithm has attained an Mean Average Precision (mAP) score of 67.7 on the COCO \cite{cocodata} validation dataset. The MovePose algorithm displayed efficiency with a performance of 69+ frames per second (fps) when run on an Intel i9-10920x CPU. Additionally, it showcased an increased performance of 452+ fps on an NVIDIA RTX3090 GPU. On an Android phone equipped with a Snapdragon 8 + 4G processor, the fps reached above 11. To enhance accuracy, we incorporated three techniques: deconvolution, large kernel convolution, and coordinate classification methods. Compared to basic upsampling, deconvolution is trainable, improves model capacity, and enhances the receptive field. Large kernel convolution strengthens these properties at a decreased computational cost. In summary, MovePose provides high accuracy and real-time performance, marking it a potential tool for a variety of applications, including those focused on mobile-side human posture estimation. The code and models for this algorithm will be made publicly accessible.

摘要
我们现在提出了 MovePose，一种优化的轻量级卷积神经网络，专门为 CPU 基于移动设备上的实时人体姿态估计设计。现有的解决方案无法提供满意的准确率和速度，MovePose 弥补了这一空隙。它目标保持实时性，同时改进移动设备上人体姿态估计的准确率。该网络每秒可生成 17 个锚点，每秒 exceeding 11 帧，适用于实时应用程序，如健身跟踪、手语理解和高级移动人体姿态估计。我们的 MovePose 算法在 COCO 验证集（\cite{cocodata}) 上获得了 67.7 的 Mean Average Precision（mAP）分数。MovePose 算法在 Intel i9-10920x CPU 上运行时显示了效率，每秒可以达到 69+ 帧/秒。此外，在 NVIDIA RTX3090 GPU 上，其性能提高了 452+ 帧/秒。在装有 Snapdragon 8 + 4G 处理器的 Android 手机上，帧率可达上限。为了提高准确率，我们采用了三种技术：减 convolution、大小 kernel convolution 和坐标分类方法。相比基本的 upsampling，减 convolution 可以学习、提高模型容量和扩展感受野。大小 kernel convolution 强化这些属性，但减少计算成本。简单来说，MovePose 提供了高准确率和实时性，使其成为许多应用程序的可能工具，包括移动端人体姿态估计。我们将代码和模型公开访问。

Pedestrian Environment Model for Automated Driving

paper_url: http://arxiv.org/abs/2308.09080
repo_url: None
paper_authors: Adrian Holzbock, Alexander Tsaregorodtsev, Vasileios Belagiannis
for: 这个论文的目的是提供一种能够在自动驾驶车辆与游客之间安全交互的环境模型。
methods: 该论文使用了一种基于单目相机图像和车辆定位数据的人体pose估计器，以及一种简单的跟踪算法和egosynchronous抵消。
results: 该论文在CARLA simulate器和nuScenes数据集上测试了其人体环境模型，并达到了约16%的相对位置误差。

Abstract
Besides interacting correctly with other vehicles, automated vehicles should also be able to react in a safe manner to vulnerable road users like pedestrians or cyclists. For a safe interaction between pedestrians and automated vehicles, the vehicle must be able to interpret the pedestrian's behavior. Common environment models do not contain information like body poses used to understand the pedestrian's intent. In this work, we propose an environment model that includes the position of the pedestrians as well as their pose information. We only use images from a monocular camera and the vehicle's localization data as input to our pedestrian environment model. We extract the skeletal information with a neural network human pose estimator from the image. Furthermore, we track the skeletons with a simple tracking algorithm based on the Hungarian algorithm and an ego-motion compensation. To obtain the 3D information of the position, we aggregate the data from consecutive frames in conjunction with the vehicle position. We demonstrate our pedestrian environment model on data generated with the CARLA simulator and the nuScenes dataset. Overall, we reach a relative position error of around 16% on both datasets.

摘要
besides correctly interacting with other vehicles, automated vehicles should also be able to react safely to vulnerable road users like pedestrians or cyclists. for a safe interaction between pedestrians and automated vehicles, the vehicle must be able to interpret the pedestrian's behavior. common environment models do not contain information like body poses used to understand the pedestrian's intent. in this work, we propose an environment model that includes the position of the pedestrians as well as their pose information. we only use images from a monocular camera and the vehicle's localization data as input to our pedestrian environment model. we extract the skeletal information with a neural network human pose estimator from the image. furthermore, we track the skeletons with a simple tracking algorithm based on the hungarian algorithm and an ego-motion compensation. to obtain the 3D information of the position, we aggregate the data from consecutive frames in conjunction with the vehicle position. we demonstrate our pedestrian environment model on data generated with the carla simulator and the nuscenes dataset. overall, we reach a relative position error of around 16% on both datasets.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

2023-08-18

cs.AI

cs.AI - 2023-08-18

VALERIE22 – A photorealistic, richly metadata annotated dataset of urban environments

paper_url: http://arxiv.org/abs/2308.09632
repo_url: None
paper_authors: Oliver Grau, Korbinian Hagn
for: 这个论文的目的是研究深度神经网络（DNN）在城市环境中的识别性能，并开发了一种方法来验证DNN的验证方法。
methods: 这个论文使用了VALERIE工具管道生成了一个名为VALERIE22的synthetic数据集，该数据集包含了高度真实的摄像头 simulate的场景，并提供了丰富的元数据，可以提取特定场景和Semantic特征（如精确的遮挡率、场景中的位置和相机距离）。
results: 根据性能指标，这个论文比较了VALERIE22数据集与其他公开available的数据集，结果显示VALERIE22是目前开放领域中最佳的synthetic数据集之一。

Abstract
The VALERIE tool pipeline is a synthetic data generator developed with the goal to contribute to the understanding of domain-specific factors that influence perception performance of DNNs (deep neural networks). This work was carried out under the German research project KI Absicherung in order to develop a methodology for the validation of DNNs in the context of pedestrian detection in urban environments for automated driving. The VALERIE22 dataset was generated with the VALERIE procedural tools pipeline providing a photorealistic sensor simulation rendered from automatically synthesized scenes. The dataset provides a uniquely rich set of metadata, allowing extraction of specific scene and semantic features (like pixel-accurate occlusion rates, positions in the scene and distance + angle to the camera). This enables a multitude of possible tests on the data and we hope to stimulate research on understanding performance of DNNs. Based on performance metric a comparison with several other publicly available datasets is provided, demonstrating that VALERIE22 is one of best performing synthetic datasets currently available in the open domain.

摘要
valeerie工具管道是一种人工数据生成工具，用于帮助理解深度神经网络（DNN）在不同领域的感知性能因素。这项工作是在德国研究项目“KI Absicherung”下进行的，旨在为自动驾驶中的人体检测中开发一种验证方法。valerie22数据集是通过 valeerie工具管道生成的，该管道提供了高度真实的感知器 simulated scenes。该数据集具有独特的元数据，例如像素精度的遮挡率、场景中的位置和相机到摄像头的距离。这些元数据可以用于进行多种测试，并且我们希望通过这些测试来探索DNN的性能。基于性能指标，我们对多个公共可用的数据集进行了比较，并证明了valerie22是目前公开领域中最佳的人工数据集之一。

Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents

paper_url: http://arxiv.org/abs/2308.09595
repo_url: None
paper_authors: Arrasy Rahman, Jiaxun Cui, Peter Stone
for: 本研究旨在提高不可见的代理和人类合作伙伴的协作robustness，因为这些合作伙伴可能采用多种不同的合作规则。
methods: 我们首先提出，为了提高AHT代理的Robustness，应该让代理模仿环境中任何合作伙伴策略的最小覆盖集（MCS）中的策略。然后，我们引入L-BRDiv算法，它可以生成一组合适的团队策略，使AHT代理在训练过程中模仿MCS中的策略。L-BRDiv通过解决一个具有约束的优化问题，同时让AHT代理学习MCS中的策略和团队策略。
results: 我们的实验表明，L-BRDiv在更多的两个玩家合作问题中生成了更加稳定和可靠的AHT代理，而不需要让对象的目标进行详细调整。我们的研究表明，L-BRDiv比基eline方法更好地发现不同的MCS成员，而不是重复找到相同的策略。

Abstract
Robustly cooperating with unseen agents and human partners presents significant challenges due to the diverse cooperative conventions these partners may adopt. Existing Ad Hoc Teamwork (AHT) methods address this challenge by training an agent with a population of diverse teammate policies obtained through maximizing specific diversity metrics. However, these heuristic diversity metrics do not always maximize the agent's robustness in all cooperative problems. In this work, we first propose that maximizing an AHT agent's robustness requires it to emulate policies in the minimum coverage set (MCS), the set of best-response policies to any partner policies in the environment. We then introduce the L-BRDiv algorithm that generates a set of teammate policies that, when used for AHT training, encourage agents to emulate policies from the MCS. L-BRDiv works by solving a constrained optimization problem to jointly train teammate policies for AHT training and approximating AHT agent policies that are members of the MCS. We empirically demonstrate that L-BRDiv produces more robust AHT agents than state-of-the-art methods in a broader range of two-player cooperative problems without the need for extensive hyperparameter tuning for its objectives. Our study shows that L-BRDiv outperforms the baseline methods by prioritizing discovering distinct members of the MCS instead of repeatedly finding redundant policies.

摘要
Robustly collaborating with unseen agents and human partners poses significant challenges due to the diverse cooperative conventions these partners may adopt. Existing Ad Hoc Teamwork (AHT) methods address this challenge by training an agent with a population of diverse teammate policies obtained through maximizing specific diversity metrics. However, these heuristic diversity metrics do not always maximize the agent's robustness in all cooperative problems. In this work, we first propose that maximizing an AHT agent's robustness requires it to emulate policies in the minimum coverage set (MCS), the set of best-response policies to any partner policies in the environment. We then introduce the L-BRDiv algorithm that generates a set of teammate policies that, when used for AHT training, encourage agents to emulate policies from the MCS. L-BRDiv works by solving a constrained optimization problem to jointly train teammate policies for AHT training and approximating AHT agent policies that are members of the MCS. We empirically demonstrate that L-BRDiv produces more robust AHT agents than state-of-the-art methods in a broader range of two-player cooperative problems without the need for extensive hyperparameter tuning for its objectives. Our study shows that L-BRDiv outperforms the baseline methods by prioritizing discovering distinct members of the MCS instead of repeatedly finding redundant policies.Here is the translation in Traditional Chinese:和不可见的代理人和人类合作伙伴一起合作具有很大的挑战，因为这些合作伙伴可能遵循多种不同的合作传统。现有的协作团队（AHT）方法面对这个挑战，通过将一个代理人训练为一群多样化的伙伴政策，并通过最大化特定多样性度量来实现。但这些旧的多样性度量不一定能够将代理人的Robustness最大化在所有的合作问题中。在这个工作中，我们首先提出，将代理人的Robustness最大化需要它们模仿环境中的最小覆盖集（MCS）中的最佳回应策略。我们然后引入L-BRDiv算法，这个算法可以将一群伙伴政策训练为AHT训练，并且将AHT代理人的政策变成MCS中的成员。L-BRDiv算法通过解决一个受限搜索问题，以实现AHT训练和MCS中的策略匹配。我们实际地显示，L-BRDiv算法可以在更广泛的二 player合作问题中生成更Robust的AHT代理人，并且不需要进行大量的参数调整。我们的研究显示，L-BRDiv算法比基eline方法更好，因为它将优先发现MCS中的不同成员，而不是重复发现重复的策略。

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

paper_url: http://arxiv.org/abs/2308.09583
repo_url: https://github.com/nlpxucan/wizardlm
paper_authors: Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Dongmei Zhang
for: 提高大型自然语言处理（NLP）任务中的数学逻辑能力
methods: 应用提出的生长学习自适应反馈法（RLEIF）方法
results: 在两个数学逻辑评测 benchmark 上，WizardMath 表现出众，超越了所有现有的开源大型自然语言模型，并在 GSM8k 和 MATH 两个评测 benchmark 上同时超越了 ChatGPT-3.5、Claude Instant-1、PaLM-2 和 Minerva。Here’s the full translation in Simplified Chinese:
for: 本文旨在提高大型自然语言处理（NLP）任务中的数学逻辑能力。
methods: 本文提出了一种基于生长学习自适应反馈法（RLEIF）的方法，用于提高大型自然语言模型的数学逻辑能力。
results: 经过广泛的实验，WizardMath 在两个数学逻辑评测 benchmark 上表现出众，超越了所有现有的开源大型自然语言模型，并在 GSM8k 和 MATH 两个评测 benchmark 上同时超越了 ChatGPT-3.5、Claude Instant-1、PaLM-2 和 Minerva。Please note that the translation is based on the abstract you provided, and the full paper may contain more details and results.

Abstract
Large language models (LLMs), such as GPT-4, have shown remarkable performance in natural language processing (NLP) tasks, including challenging mathematical reasoning. However, most existing open-source models are only pre-trained on large-scale internet data and without math-related optimization. In this paper, we present WizardMath, which enhances the mathematical reasoning abilities of Llama-2, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model. WizardMath surpasses all other open-source LLMs by a substantial margin. Furthermore, our model even outperforms ChatGPT-3.5, Claude Instant-1, PaLM-2 and Minerva on GSM8k, simultaneously surpasses Text-davinci-002, PaLM-1 and GPT-3 on MATH. More details and model weights are public at https://github.com/nlpxucan/WizardLM and https://huggingface.co/WizardLM.

摘要
大型自然语言处理（NLP）模型（如GPT-4）在数学逻辑任务中表现出色，但大多数现有的开源模型都只是通过大规模互联网数据进行预训练而未进行数学相关的优化。在这篇论文中，我们提出了增强大型自然语言处理模型的数学逻辑能力的方法——基于反馈学习的强化学习（RLEIF）方法。我们在数学逻辑 benchmark 上进行了广泛的实验，并证明了我们的模型在 GSM8k 和 MATH 上的超常表现。我们的 WizardMath 模型在 GSM8k 上超过了所有现有的开源 LLM，并同时超过了 ChatGPT-3.5、Claude Instant-1、PaLM-2 和 Minerva。此外，我们的模型还在 MATH 上超过了 Text-davinci-002、PaLM-1 和 GPT-3。更多细节和模型权重可以在 GitHub 上找到（https://github.com/nlpxucan/WizardLM）和 Hugging Face 上找到（https://huggingface.co/WizardLM）。

Investigating the Interplay between Features and Structures in Graph Learning

paper_url: http://arxiv.org/abs/2308.09570
repo_url: None
paper_authors: Daniele Castellana, Federico Errica
for: This paper aims to investigate the relationship between node features and target labels in deep graph networks, and to develop new metrics to measure the influence of node features on target labels.
methods: The paper uses two generative processes to build and study ad-hoc node classification tasks, and evaluates the performance of six models, including structure-agnostic ones.
results: The paper finds that previously defined metrics are not adequate when the assumption of a strong correlation between node features and target labels is relaxed, and presents novel research findings that could help advance our understanding of the field.Here’s the Chinese translation of the three pieces of information:
for: 这篇论文的目的是 investigate深度图网络中节点特征和目标标签之间的关系，并开发新的度量来衡量节点特征对目标标签的影响。
methods: 这篇论文使用两种生成过程来建立和研究特定的节点分类任务，并评估了六种模型的性能，包括无结构的模型。
results: 这篇论文发现先前定义的度量在减弱节点特征和目标标签之间的强相关关系时不适用，并提出了新的研究成果，可能帮助我们更深入理解这个领域。

Abstract
In the past, the dichotomy between homophily and heterophily has inspired research contributions toward a better understanding of Deep Graph Networks' inductive bias. In particular, it was believed that homophily strongly correlates with better node classification predictions of message-passing methods. More recently, however, researchers pointed out that such dichotomy is too simplistic as we can construct node classification tasks where graphs are completely heterophilic but the performances remain high. Most of these works have also proposed new quantitative metrics to understand when a graph structure is useful, which implicitly or explicitly assume the correlation between node features and target labels. Our work empirically investigates what happens when this strong assumption does not hold, by formalising two generative processes for node classification tasks that allow us to build and study ad-hoc problems. To quantitatively measure the influence of the node features on the target labels, we also use a metric we call Feature Informativeness. We construct six synthetic tasks and evaluate the performance of six models, including structure-agnostic ones. Our findings reveal that previously defined metrics are not adequate when we relax the above assumption. Our contribution to the workshop aims at presenting novel research findings that could help advance our understanding of the field.

摘要
Translated into Simplified Chinese:在过去，豪豪和非豪豪的分类理论在深度图学网络的吸引偏好上促进了研究贡献。特别是，人们认为豪豪和非豪豪之间的差异会导致更好的节点预测结果。然而，最近的研究表明，这种分类是太过简单，因为我们可以构建节点预测任务，其中图是完全不豪豪的，却可以达到高性能。大多数这些工作还提出了新的量化指标，以评估图结构的有用性，这些指标直接或间接假设节点特征和目标标签之间的相关性。我们的工作employs two generative processes for node classification tasks，allowing us to build and study ad-hoc problems。为了量化节点特征对目标标签的影响，我们还使用一个叫做特征有用性的指标。我们构建了六个 sintetic任务，并评估了六种模型，包括结构不关注的模型。我们的发现表明，先前定义的指标不适用于放宽这种假设。我们的贡献是在工作室会议上发表新的研究发现，可以帮助我们更深入地理解这个领域。

Eigenvalue-based Incremental Spectral Clustering

paper_url: http://arxiv.org/abs/2308.10999
repo_url: None
paper_authors: Mieczysław A. Kłopotek, Bartłmiej Starosta, Sławomir T. Wierzchoń
for: 这个论文是用于提出一种增量 spectral clustering 方法的。
methods: 该方法包括将数据分割成可处理的子集，对每个子集进行归一化，然后将不同子集中的归一化结果合并，以形成整个数据集的归一化结果。
results: 实验表明，这种增量 spectral clustering 方法可以准确地 clustering 大量数据，并且可以避免因数据规模增长而导致的复杂性增加。

Abstract
Our previous experiments demonstrated that subsets collections of (short) documents (with several hundred entries) share a common normalized in some way eigenvalue spectrum of combinatorial Laplacian. Based on this insight, we propose a method of incremental spectral clustering. The method consists of the following steps: (1) split the data into manageable subsets, (2) cluster each of the subsets, (3) merge clusters from different subsets based on the eigenvalue spectrum similarity to form clusters of the entire set. This method can be especially useful for clustering methods of complexity strongly increasing with the size of the data sample,like in case of typical spectral clustering. Experiments were performed showing that in fact the clustering and merging the subsets yields clusters close to clustering the entire dataset.

摘要
我们之前的实验表明， subsets collections of (短) 文档（具有数百个数据）共享一个常数归一化的几何 Laplacian 吸引器谱。基于这一点，我们提出了一种逐步增量 spectral clustering 方法。该方法包括以下步骤：1. 将数据分割成可控制的 subset，2. 对每个 subset 进行归一化，3. 根据吸引器谱的相似性，将不同 subset 的归一化结果合并形成整个数据集的归一化结果。这种方法可以特别有用于数据集的规模快速增长的情况下，如典型的spectral clustering。我们的实验表明，将 subset 进行归一化并将归一化结果合并可以得到整个数据集的准确归一化结果。

Adapt Your Teacher: Improving Knowledge Distillation for Exemplar-free Continual Learning

paper_url: http://arxiv.org/abs/2308.09544
repo_url: None
paper_authors: Filip Szatkowski, Mateusz Pyla, Marcin Przewięźlikowski, Sebastian Cygert, Bartłomiej Twardowski, Tomasz Trzciński
for: 这篇研究探讨了例示无法学习（Class Incremental Learning，CIL）中的知识传授（Knowledge Distillation，KD）作为调节策略，以预防忘记。
methods: 这篇研究使用了KD作为CIL中的调节策略，但是KD通常需要对前一任 зада的资料进行例示，这可能导致 repreentation shift 问题。这篇研究引入了 Teacher Adaptation（TA）方法，让教师网络和主要模型在增量训练中同步更新。
results: 这篇研究发现TA方法可以与KD-based CIL方法相互运作，并在多个例示无法学习benchmark上提供了一致的性能提升。

Abstract
In this work, we investigate exemplar-free class incremental learning (CIL) with knowledge distillation (KD) as a regularization strategy, aiming to prevent forgetting. KD-based methods are successfully used in CIL, but they often struggle to regularize the model without access to exemplars of the training data from previous tasks. Our analysis reveals that this issue originates from substantial representation shifts in the teacher network when dealing with out-of-distribution data. This causes large errors in the KD loss component, leading to performance degradation in CIL. Inspired by recent test-time adaptation methods, we introduce Teacher Adaptation (TA), a method that concurrently updates the teacher and the main model during incremental training. Our method seamlessly integrates with KD-based CIL approaches and allows for consistent enhancement of their performance across multiple exemplar-free CIL benchmarks.

摘要
在这项研究中，我们调查了无 exemplar 的类增量学习（CIL）中使用知识塑造（KD）作为准则约束，以避免忘记。KD 基本方法在 CIL 中成功应用，但它们经常无法规范模型，不管在之前任务的示例数据上进行学习。我们的分析发现，这个问题源于教师网络对异常数据的表达变化，导致 KD 损失成分中的大误差，从而导致 CIL 的性能下降。针对此，我们引入了教师适应（TA）方法，该方法在增量训练中同时更新教师网络和主模型。我们的方法顺利地融合了 KD 基本方法和 CIL 方法，并且可以在多个无 exemplar 的 CIL 基准上提供一致的性能提升。

Meta-ZSDETR: Zero-shot DETR with Meta-learning

paper_url: http://arxiv.org/abs/2308.09540
repo_url: None
paper_authors: Lu Zhang, Chenbo Zhang, Jiajia Zhao, Jihong Guan, Shuigeng Zhou
for: 这篇论文的目的是为了解决零次识别 task 中的问题，特别是低精度和背景混淆问题。methods: 这篇论文使用了 DETR 和 meta-learning 技术，实现了名为 Meta-ZSDETR 的零次识别方法。这篇论文不同于使用 Faster R-CNN 的方法，首先产生无关于类别的提案，然后使用视觉 semantic 对照模组进行分类。相反地，Meta-ZSDETR 直接预测类别特定的方块坐标，使用类别特定的查询来预测类别对应的坐标，并且使用预测的准确度从分类头进行筛选。results: 这篇论文的实验结果显示，该方法在两个 benchmark 数据集 MS COCO 和 PASCAL VOC 上的表现都大幅超过了现有的 ZSD 方法。

Abstract
Zero-shot object detection aims to localize and recognize objects of unseen classes. Most of existing works face two problems: the low recall of RPN in unseen classes and the confusion of unseen classes with background. In this paper, we present the first method that combines DETR and meta-learning to perform zero-shot object detection, named Meta-ZSDETR, where model training is formalized as an individual episode based meta-learning task. Different from Faster R-CNN based methods that firstly generate class-agnostic proposals, and then classify them with visual-semantic alignment module, Meta-ZSDETR directly predict class-specific boxes with class-specific queries and further filter them with the predicted accuracy from classification head. The model is optimized with meta-contrastive learning, which contains a regression head to generate the coordinates of class-specific boxes, a classification head to predict the accuracy of generated boxes, and a contrastive head that utilizes the proposed contrastive-reconstruction loss to further separate different classes in visual space. We conduct extensive experiments on two benchmark datasets MS COCO and PASCAL VOC. Experimental results show that our method outperforms the existing ZSD methods by a large margin.

摘要
Zero-shot object detection aims to localize and recognize objects of unseen classes. Most existing works face two problems: low recall of RPN in unseen classes and confusion of unseen classes with background. In this paper, we present the first method that combines DETR and meta-learning for zero-shot object detection, named Meta-ZSDETR. The model training is formalized as an individual episode-based meta-learning task. Different from Faster R-CNN based methods that first generate class-agnostic proposals and then classify them with visual-semantic alignment modules, Meta-ZSDETR directly predicts class-specific boxes with class-specific queries and further filters them with the predicted accuracy from the classification head. The model is optimized with meta-contrastive learning, which contains a regression head to generate the coordinates of class-specific boxes, a classification head to predict the accuracy of generated boxes, and a contrastive head that utilizes the proposed contrastive-reconstruction loss to further separate different classes in visual space. We conduct extensive experiments on two benchmark datasets MS COCO and PASCAL VOC. Experimental results show that our method outperforms existing ZSD methods by a large margin.

Proceedings of the 2nd International Workshop on Adaptive Cyber Defense

paper_url: http://arxiv.org/abs/2308.09520
repo_url: None
paper_authors: Marco Carvalho, Damian Marriott, Mark Bilinski, Ahmad Ridley
for: 本研讨会旨在分享利用人工智能（AI）和机器学习（ML）技术创造固定的响应和适应的网络防御方法。
methods: 本研究使用了AI和ML技术来帮助网络防御，包括自动识别和响应网络攻击、发现和缓解网络漏洞等。
results: 研究人员通过实验和演示证明了AI和ML技术在网络防御中的可行性和有效性，并提出了一些可能的应用场景。

Abstract
The 2nd International Workshop on Adaptive Cyber Defense was held at the Florida Institute of Technology, Florida. This workshop was organized to share research that explores unique applications of Artificial Intelligence (AI) and Machine Learning (ML) as foundational capabilities for the pursuit of adaptive cyber defense. The cyber domain cannot currently be reliably and effectively defended without extensive reliance on human experts. Skilled cyber defenders are in short supply and often cannot respond fast enough to cyber threats. Building on recent advances in AI and ML the Cyber defense research community has been motivated to develop new dynamic and sustainable defenses through the adoption of AI and ML techniques to cyber settings. Bridging critical gaps between AI and Cyber researchers and practitioners can accelerate efforts to create semi-autonomous cyber defenses that can learn to recognize and respond to cyber attacks or discover and mitigate weaknesses in cooperation with other cyber operation systems and human experts. Furthermore, these defenses are expected to be adaptive and able to evolve over time to thwart changes in attacker behavior, changes in the system health and readiness, and natural shifts in user behavior over time. The workshop was comprised of invited keynote talks, technical presentations and a panel discussion about how AI/ML can enable autonomous mitigation of current and future cyber attacks. Workshop submissions were peer reviewed by a panel of domain experts with a proceedings consisting of six technical articles exploring challenging problems of critical importance to national and global security. Participation in this workshop offered new opportunities to stimulate research and innovation in the emerging domain of adaptive and autonomous cyber defense.

摘要
第二届国际适应性网络防御工作坊在美国佛罗里达理工学院举行，旨在共享研究人员在使用人工智能（AI）和机器学习（ML）技术为适应性网络防御做出各种应用。现在，无法可靠地和有效地防御网络域，不可避免人才短缺和快速响应网络攻击的问题。基于最新的AI和ML技术，网络防御研究社区受到了激励，以开发新的动态和可持续的防御机制，通过与网络设备和人类专家合作，实现自动化的防御。这些防御机制预期能够适应变化的攻击者行为、系统健康和准备度以及自然的用户行为变化。工作坊包括邀请演讲、技术演示和关于如何通过AI/ML实现自主 Mitigation的panel讨论。工作坊提交经过埃评审定，汇报包括六篇技术文章，探讨了国家和全球安全中的挑战性问题。参加这个工作坊，提供了新的研究和创新机会在emerging领域中，即适应性网络防御。

Spatial LibriSpeech: An Augmented Dataset for Spatial Audio Learning

paper_url: http://arxiv.org/abs/2308.09514
repo_url: https://github.com/apple/ml-spatial-librispeech
paper_authors: Miguel Sarabia, Elena Menyaylenko, Alessandro Toso, Skyler Seto, Zakaria Aldeneh, Shadi Pirhosseinloo, Luca Zappella, Barry-John Theobald, Nicholas Apostoloff, Jonathan Sheaffer
for: 这篇论文是用于训练机器学习模型的，特别是用于三维声音定位和相关任务。
methods: 这篇论文使用了大量的 simulate acoustic condition 和 synthetic room 来生成了一个名为 Spatial LibriSpeech 的三维声音数据集，该数据集包含了19个通道的声音、一级散射和可选的干扰声音。
results: 作者使用了这个数据集训练了四个空间声音任务，并取得了 median absolute error 为6.60{\deg} 的3D声音源定位任务，0.43m 的距离任务，90.66ms 的 T30 任务和 2.74dB 的 DRR 估计任务的结果。此外，作者还证明了这些模型可以通过在广泛使用的评估数据集上进行训练来获得良好的普适性。

Abstract
We present Spatial LibriSpeech, a spatial audio dataset with over 650 hours of 19-channel audio, first-order ambisonics, and optional distractor noise. Spatial LibriSpeech is designed for machine learning model training, and it includes labels for source position, speaking direction, room acoustics and geometry. Spatial LibriSpeech is generated by augmenting LibriSpeech samples with 200k+ simulated acoustic conditions across 8k+ synthetic rooms. To demonstrate the utility of our dataset, we train models on four spatial audio tasks, resulting in a median absolute error of 6.60{\deg} on 3D source localization, 0.43m on distance, 90.66ms on T30, and 2.74dB on DRR estimation. We show that the same models generalize well to widely-used evaluation datasets, e.g., obtaining a median absolute error of 12.43{\deg} on 3D source localization on TUT Sound Events 2018, and 157.32ms on T30 estimation on ACE Challenge.

摘要
我们现在提供的是一个空间听说数据集，名为空间听说（Spatial LibriSpeech），它包含了650小时以上的19个频道声音，以及一个首选的干扰噪声。空间听说数据集是为机器学习模型训练而设计，其中包括源位置、说话方向、室内声学和室内geometry的标签。空间听说数据集通过对LibriSpeech样本进行扩展，生成了200,000+个simulated acoustic condition，并在8,000+个synthetic room中进行了扩展。为了证明我们的数据集的实用性，我们在四个空间听说任务上训练了模型，其中包括3D源localization、距离、T30和DRR估计。我们发现这些模型在广泛使用的评估数据集上也能够良好地适应，例如在TUT Sound Events 2018中， median absolute error为12.43°，和在ACE Challenge中，T30估计中的 median absolute error为157.32ms。

Semantic relatedness in DBpedia: A comparative and experimental assessment

paper_url: http://arxiv.org/abs/2308.09502
repo_url: None
paper_authors: Anna Formica, Francesco Taglino
for: 本研究的目的是评估Web资源之间的Semantic relatedness，尤其是基于知识图的方法。
methods: 本研究选择了10种现有Literature中的方法，分为邻居资源、 triple patterns 和 triple weights 三种类型。这些方法都基于DBpedia作为参考RDF知识图进行实现和评估。
results: 根据实验结果，将RDF triplets权重化并评估所有导向链连接的资源是计算DBpedia中Semantic relatedness的最佳策略。

Abstract
Evaluating semantic relatedness of Web resources is still an open challenge. This paper focuses on knowledge-based methods, which represent an alternative to corpus-based approaches, and rely in general on the availability of knowledge graphs. In particular, we have selected 10 methods from the existing literature, that have been organized according to it adjacent resources, triple patterns, and triple weights-based methods. They have been implemented and evaluated by using DBpedia as reference RDF knowledge graph. Since DBpedia is continuously evolving, the experimental results provided by these methods in the literature are not comparable. For this reason, in this work, such methods have been experimented by running them all at once on the same DBpedia release and against 14 well-known golden datasets. On the basis of the correlation values with human judgment obtained according to the experimental results, weighting the RDF triples in combination with evaluating all the directed paths linking the compared resources is the best strategy in order to compute semantic relatedness in DBpedia.

摘要
evaluating semantic relatedness of web resources is still an open challenge. this paper focuses on knowledge-based methods, which represent an alternative to corpus-based approaches, and rely in general on the availability of knowledge graphs. in particular, we have selected 10 methods from the existing literature, that have been organized according to their adjacent resources, triple patterns, and triple weights-based methods. they have been implemented and evaluated by using dbpedia as reference rdf knowledge graph. since dbpedia is continuously evolving, the experimental results provided by these methods in the literature are not comparable. for this reason, in this work, such methods have been experimented by running them all at once on the same dbpedia release and against 14 well-known golden datasets. on the basis of the correlation values with human judgment obtained according to the experimental results, weighting the rdf triples in combination with evaluating all the directed paths linking the compared resources is the best strategy in order to compute semantic relatedness in dbpedia.

Predictive Authoring for Brazilian Portuguese Augmentative and Alternative Communication

paper_url: http://arxiv.org/abs/2308.09497
repo_url: https://github.com/jayralencar/pictogram_prediction_pt
paper_authors: Jayr Pereira, Rodrigo Nogueira, Cleber Zanchettin, Robson Fidalgo
for: This paper proposes using a BERT-like model for pictogram prediction in AAC systems to improve the efficiency of message authoring for individuals with complex communication needs.
methods: The authors finetune BERTimbau, a Brazilian Portuguese version of BERT, using an AAC corpus for Brazilian Portuguese, and test different approaches to representing a pictogram for prediction, including as a word, as a concept, and as a set of synonyms.
results: The results demonstrate that using embeddings computed from the pictograms’ captions, synonyms, or definitions have a similar performance, and using synonyms leads to lower perplexity, but using captions leads to the highest accuracies. The paper provides insight into how to represent a pictogram for prediction using a BERT-like model and the potential of using images for pictogram prediction.Here is the same information in Simplified Chinese:
for: 这篇论文提议使用基于BERT的模型进行AAC系统中图像预测，以提高帮助受残障人群表达需求的消息排版效率。
methods: 作者使用了一个特定于巴西葡萄牙语的BERT版本（BERTimbau），并使用了一个特定于巴西葡萄牙语的AAC corpus进行训练。他们测试了不同的图像预测方法，包括作为单词（使用图像描述）、作为概念（使用词典定义）和作为相关词（使用相关词）。
results: 结果表明，使用图像描述、词典定义或相关词来计算图像预测的方法具有相似的性能。使用相关词的方法可以得到更低的混淆率，但使用描述的方法可以得到最高的准确率。这篇论文提供了使用基于BERT的模型进行图像预测的可能性和图像预测的技术。

Abstract
Individuals with complex communication needs (CCN) often rely on augmentative and alternative communication (AAC) systems to have conversations and communique their wants. Such systems allow message authoring by arranging pictograms in sequence. However, the difficulty of finding the desired item to complete a sentence can increase as the user's vocabulary increases. This paper proposes using BERTimbau, a Brazilian Portuguese version of BERT, for pictogram prediction in AAC systems. To finetune BERTimbau, we constructed an AAC corpus for Brazilian Portuguese to use as a training corpus. We tested different approaches to representing a pictogram for prediction: as a word (using pictogram captions), as a concept (using a dictionary definition), and as a set of synonyms (using related terms). We also evaluated the usage of images for pictogram prediction. The results demonstrate that using embeddings computed from the pictograms' caption, synonyms, or definitions have a similar performance. Using synonyms leads to lower perplexity, but using captions leads to the highest accuracies. This paper provides insight into how to represent a pictogram for prediction using a BERT-like model and the potential of using images for pictogram prediction.

摘要
人们 WITH complex communication needs (CCN) 常常使用增强型替代通信系统 (AAC) 进行交流和表达自己的需求。这些系统允许用户通过排序绘图来编写消息。然而，如果用户的词汇量增加，则查找所需的项目可能变得更加困难。这篇论文提议使用 BERTimbau，一种巴西葡萄牙语版的 BERT，来预测绘图。为了训练 BERTimbau，我们创建了一个用于巴西葡萄牙语 AAC 系统的训练集。我们测试了不同的绘图预测方法：以字符 (使用绘图描述)、以概念 (使用词典定义) 和以同义词 (使用相关词语) 三种方法。我们还评估了使用图像进行绘图预测。结果表明，使用绘图描述、同义词或词典定义生成的嵌入都有类似的表现。使用同义词可以降低噪音水平，但使用绘图描述可以达到最高的准确率。这篇论文提供了如何使用 BERT-like 模型来预测绘图，以及使用图像进行绘图预测的潜在性。

Balancing Transparency and Risk: The Security and Privacy Risks of Open-Source Machine Learning Models

paper_url: http://arxiv.org/abs/2308.09490
repo_url: None
paper_authors: Dominik Hintersdorf, Lukas Struppek, Kristian Kersting
for: 这项研究的目的是提醒用户关于使用开源机器学习模型的隐私和安全风险。
methods: 该研究使用了诸如攻击分析和隐私检测等方法来描述开源模型中常见的隐私和安全风险。
results: 研究发现了许多开源模型中隐私和安全风险的例子，包括模型中隐藏的功能和输入模式的攻击等。这些风险可能导致服务中断、敏感用户数据泄露和更严重的物理损害等。

Abstract
The field of artificial intelligence (AI) has experienced remarkable progress in recent years, driven by the widespread adoption of open-source machine learning models in both research and industry. Considering the resource-intensive nature of training on vast datasets, many applications opt for models that have already been trained. Hence, a small number of key players undertake the responsibility of training and publicly releasing large pre-trained models, providing a crucial foundation for a wide range of applications. However, the adoption of these open-source models carries inherent privacy and security risks that are often overlooked. To provide a concrete example, an inconspicuous model may conceal hidden functionalities that, when triggered by specific input patterns, can manipulate the behavior of the system, such as instructing self-driving cars to ignore the presence of other vehicles. The implications of successful privacy and security attacks encompass a broad spectrum, ranging from relatively minor damage like service interruptions to highly alarming scenarios, including physical harm or the exposure of sensitive user data. In this work, we present a comprehensive overview of common privacy and security threats associated with the use of open-source models. By raising awareness of these dangers, we strive to promote the responsible and secure use of AI systems.

摘要
人工智能（AI）领域在最近几年内取得了很大进步，由于研究和业务中广泛采用开源机器学习模型而驱动。由于训练大量数据集需要巨量的资源，许多应用程序选择使用已经训练过的模型。因此，只有一些关键参与者在训练和公共发布大型预训练模型，为各种应用提供了重要的基础。然而，使用这些开源模型的采用带来了隐私和安全风险，这些风险frequently overlooked。为了提供一个具体的例子，一个无征的模型可能封装了隐藏的功能，当特定的输入模式触发时，可以 manipulate the behavior of the system，如 instructing self-driving cars to ignore the presence of other vehicles。成功的隐私和安全攻击的后果extremely broad，从relatively minor damage like service interruptions到highly alarming scenarios, including physical harm or the exposure of sensitive user data。在这项工作中，我们提供了对公开源模型的常见隐私和安全威胁的全面概述。通过提醒这些危险，我们希望促进AI系统的负责任和安全使用。

Modelling Electricity Consumption in Irish Dairy Farms Using Agent-Based Modelling

paper_url: http://arxiv.org/abs/2308.09488
repo_url: None
paper_authors: Hossein Khaleghy, Abdul Wahid, Eoghan Clifford, Karl Mason
for: 这个研究报告是为了研究爱尔兰奶牛农场的电力消耗而写的。
methods: 该研究使用了代理模型来模拟奶牛农场的电力消耗，考虑了奶牛数量、牛奶机器数量以及时间的影响。
results: 研究发现代理模型可以准确地预测奶牛农场的电力消耗，并且可以提供完全可解释的输出，这是其他人工智能技术，如深度学习模型所不具备的优点。

Abstract
Dairy farming can be an energy intensive form of farming. Understanding the factors affecting electricity consumption on dairy farms is crucial for farm owners and energy providers. In order to accurately estimate electricity demands in dairy farms, it is necessary to develop a model. In this research paper, an agent-based model is proposed to model the electricity consumption of Irish dairy farms. The model takes into account various factors that affect the energy consumption of dairy farms, including herd size, number of milking machines, and time of year. The outputs are validated using existing state-of-the-art dairy farm modelling frameworks. The proposed agent-based model is fully explainable, which is an advantage over other Artificial Intelligence techniques, e.g. deep learning.

摘要
牛奶农业可能是一种能源密集的农业形式。了解牛奶农场电ity消耗的因素对农场所有者和能源供应商非常重要。为了准确估算牛奶农场电ity需求，需要开发一个模型。在这篇研究论文中，我们提出了一个代理基模型，用于模拟爱尔兰牛奶农场电ity消耗。该模型考虑了各种影响牛奶农场电ity消耗的因素，包括牛群规模、数量 milking machines 和时间季节。输出被验证使用现有的国际先进的牛奶农场模型框架。我们的代理基模型具有可解释性，这是对其他人工智能技术，例如深度学习，的优势。

Poison Dart Frog: A Clean-Label Attack with Low Poisoning Rate and High Attack Success Rate in the Absence of Training Data

paper_url: http://arxiv.org/abs/2308.09487
repo_url: https://github.com/magic-ma-tech/poison-dart-frog
paper_authors: Binhao Ma, Jiahui Wang, Dejun Wang, Bo Meng
for: 防止背门的攻击 (backdoor attacks)
methods: 使用“Poison Dart Frog”clean-label方法，不需要攻击者有训练集的知识 (knowledge of the entire training set or a portion of it)
results: 在CIFAR10、Tiny-ImageNet和TSRD上，使用0.1%、0.025%和0.4%的训练集恶化率，分别达到高的攻击成功率，并与州对抗方法NARCISSUS相似的攻击成功率，而不需要任何训练集知识。

Abstract
To successfully launch backdoor attacks, injected data needs to be correctly labeled; otherwise, they can be easily detected by even basic data filters. Hence, the concept of clean-label attacks was introduced, which is more dangerous as it doesn't require changing the labels of injected data. To the best of our knowledge, the existing clean-label backdoor attacks largely relies on an understanding of the entire training set or a portion of it. However, in practice, it is very difficult for attackers to have it because of training datasets often collected from multiple independent sources. Unlike all current clean-label attacks, we propose a novel clean label method called 'Poison Dart Frog'. Poison Dart Frog does not require access to any training data; it only necessitates knowledge of the target class for the attack, such as 'frog'. On CIFAR10, Tiny-ImageNet, and TSRD, with a mere 0.1\%, 0.025\%, and 0.4\% poisoning rate of the training set size, respectively, Poison Dart Frog achieves a high Attack Success Rate compared to LC, HTBA, BadNets, and Blend. Furthermore, compared to the state-of-the-art attack, NARCISSUS, Poison Dart Frog achieves similar attack success rates without any training data. Finally, we demonstrate that four typical backdoor defense algorithms struggle to counter Poison Dart Frog.

摘要
要成功发起后门攻击，注入数据需要正确地标签，否则可能被even basic数据筛选器轻易探测。因此， cleaner-label攻击被引入，这种攻击更加危险，因为它不需要改变注入数据的标签。据我们所知，现有的 cleaner-label 后门攻击大多数依据整个训练集或一部分它的理解。然而，在实践中，攻击者很难获得这些训练集，因为训练集通常来自多个独立的源头。不同于现有的 cleaner-label 攻击，我们提出了一种新的 cleaner-label 方法 called 'Poison Dart Frog'。Poison Dart Frog 不需要训练集的访问权限，只需要target类的知识，例如 '蛙'。在 CIFAR10、Tiny-ImageNet 和 TSRD 上，使用0.1%、0.025% 和 0.4% 的训练集大小杂入率，Poison Dart Frog 可以获得高度的 Attack Success Rate，比LC、HTBA、BadNets 和 Blend 高。此外，与当前状态的攻击相比，Poison Dart Frog 可以达到类似的攻击成功率，不需要任何训练数据。最后，我们示示了四种常见的后门防御算法无法防御 Poison Dart Frog。

RBA-GCN: Relational Bilevel Aggregation Graph Convolutional Network for Emotion Recognition

paper_url: http://arxiv.org/abs/2308.11029
repo_url: https://github.com/luftmenscher/RBA-GCN
paper_authors: Lin Yuan, Guoheng Huang, Fenghuan Li, Xiaochen Yuan, Chi-Man Pun, Guo Zhong
for: 这篇论文的目的是提出一种基于图 convolutional networks (GCNs) 的 Emotion recognition in conversation (ERC) 方法，以解决传统GCNs的节点资料重复问题和单层GCNs无法捕捉广泛的上下文资讯问题。
methods: 这篇论文使用了三个模组：graph generation module (GGM)、similarity-based cluster building module (SCBM) 和 bilevel aggregation module (BiAM)。GGM 将目标节点资料简化为减少节点资料重复问题，而 SCBM 计算节点与其结构邻域的相似性，删除低相似性的资讯以保留节点的决定性资讯。 Meanwhile, BiAM 是一种新的统计方法，可以在不同模式之间建立互动，并捕捉广泛的上下文资讯。
results: 在 IEMOCAP 和 MELD datasets 上，RBA-GCN 的 weighted average F1 score 与最先进方法相比，有2.17%∼5.21%的提升。

Abstract
Emotion recognition in conversation (ERC) has received increasing attention from researchers due to its wide range of applications. As conversation has a natural graph structure, numerous approaches used to model ERC based on graph convolutional networks (GCNs) have yielded significant results. However, the aggregation approach of traditional GCNs suffers from the node information redundancy problem, leading to node discriminant information loss. Additionally, single-layer GCNs lack the capacity to capture long-range contextual information from the graph. Furthermore, the majority of approaches are based on textual modality or stitching together different modalities, resulting in a weak ability to capture interactions between modalities. To address these problems, we present the relational bilevel aggregation graph convolutional network (RBA-GCN), which consists of three modules: the graph generation module (GGM), similarity-based cluster building module (SCBM) and bilevel aggregation module (BiAM). First, GGM constructs a novel graph to reduce the redundancy of target node information. Then, SCBM calculates the node similarity in the target node and its structural neighborhood, where noisy information with low similarity is filtered out to preserve the discriminant information of the node. Meanwhile, BiAM is a novel aggregation method that can preserve the information of nodes during the aggregation process. This module can construct the interaction between different modalities and capture long-range contextual information based on similarity clusters. On both the IEMOCAP and MELD datasets, the weighted average F1 score of RBA-GCN has a 2.17$\sim$5.21\% improvement over that of the most advanced method.

摘要
很多研究者对话情感识别（ERC）的注意力增加，因为它在各种应用方面有广泛的应用空间。由于对话有自然的图形结构，许多方法使用图形傅敏网（GCNs）来模型ERC，它们已经获得了重要的成果。但是，传统GCN的聚合方法受到节点信息重复问题的影响，导致节点标识信息的损失。此外，单层GCN缺乏捕捉长距离内容关联的能力。此外，大多数方法仅基于文本模式或将不同模式组合起来，导致模式间的互动capture能力弱化。为了解决这些问题，我们提出了关系两级聚合图形傅敏网（RBA-GCN），它包括三个模组：图形生成模组（GGM）、相似度基于集群建立模组（SCBM）和两级聚合模组（BiAM）。首先，GGM将建立一个新的图形，以减少目标节点信息的重复性。接着，SCBM计算目标节点和其结构邻域中的相似度，并将低相似度的信息排除，以保留节点标识信息的 дискリ民资讯。同时，BiAM是一个新的聚合方法，可以在聚合过程中保持节点信息。这个模组可以建立不同模式之间的互动，并捕捉长距离内容关联。在IEMOCAP和MELD dataset上，RBA-GCN的复合权重平均F1分数与最先进方法相比，具有2.17%∼5.21%的提升。

AI Hilbert: From Data and Background Knowledge to Automated Scientific Discovery

paper_url: http://arxiv.org/abs/2308.09474
repo_url: None
paper_authors: Ryan Cory-Wright, Bachir El Khadir, Cristina Cornelio, Sanjeeb Dash, Lior Horesh
for: 本研究的目的是找到一种能够简洁地解释自然现象的科学方程式，并与现有背景理论相符。
methods: 本研究使用了回归和理解的方法，以消除与背景理论不符的方程式。
results: 研究表明，使用拟合和逻辑的方法可以快速地找到一个与数据最佳匹配的科学方程式，并且可以自动证明这些发现的正确性。

Abstract
The discovery of scientific formulae that parsimoniously explain natural phenomena and align with existing background theory is a key goal in science. Historically, scientists have derived natural laws by manipulating equations based on existing knowledge, forming new equations, and verifying them experimentally. In recent years, data-driven scientific discovery has emerged as a viable competitor in settings with large amounts of experimental data. Unfortunately, data-driven methods often fail to discover valid laws when data is noisy or scarce. Accordingly, recent works combine regression and reasoning to eliminate formulae inconsistent with background theory. However, the problem of searching over the space of formulae consistent with background theory to find one that fits the data best is not well solved. We propose a solution to this problem when all axioms and scientific laws are expressible via polynomial equalities and inequalities and argue that our approach is widely applicable. We further model notions of minimal complexity using binary variables and logical constraints, solve polynomial optimization problems via mixed-integer linear or semidefinite optimization, and automatically prove the validity of our scientific discoveries via Positivestellensatz certificates. Remarkably, the optimization techniques leveraged in this paper allow our approach to run in polynomial time with fully correct background theory, or non-deterministic polynomial (NP) time with partially correct background theory. We experimentally demonstrate that some famous scientific laws, including Kepler's Third Law of Planetary Motion, the Hagen-Poiseuille Equation, and the Radiated Gravitational Wave Power equation, can be automatically derived from sets of partially correct background axioms.

摘要
科学发现的目标是找到经济、与背景理论相align的自然现象的科学方程。历史上，科学家通过拟合方程、形成新方程、并通过实验验证来 derive 自然法律。在近年来，数据驱动的科学发现得到了广泛应用，但是在具有噪音或缺乏数据的情况下，数据驱动方法常常无法发现有效的法律。因此，现有的方法通常是结合回归和逻辑来排除不符合背景理论的方程。然而，搜索符合背景理论的方程空间中最佳方程的问题还没有得到妥善解决。我们提出一种解决这个问题的方法，其中所有的axioms和科学法律都可以表示为多项式等式和不等式。我们还使用二进制变量和逻辑约束来模型简洁度的概念，使用混合整数线性或半definiteProgramming来解决多项式优化问题，并使用Positivestellensatz证明来自动验证我们的科学发现。很显然，我们的优化技术可以使我们的方法在 polynomial 时间内运行，并且完全符合背景理论，或者在 partially correct 背景理论下运行在非 deterministic polynomial 时间内。我们在实验中证明了一些著名的科学法律，包括凯撒第三法律、hagen-Poiseuille方程和释放 gravitational wave 的能量方程，可以自动从部分正确的背景axioms中 derivation。

Vision Relation Transformer for Unbiased Scene Graph Generation

paper_url: http://arxiv.org/abs/2308.09472
repo_url: https://github.com/visinf/veto
paper_authors: Gopika Sudhakaran, Devendra Singh Dhami, Kristian Kersting, Stefan Roth
for: scene graph generation (SGG) task, aiming to predict entity relationships using a relation encoder-decoder pipeline stacked on top of an object encoder-decoder backbone.
methods: introduces the Vision rElation TransfOrmer (VETO), consisting of a novel local-level entity relation encoder, and the Mutually Exclusive ExperT (MEET) learning strategy to capture important relation features without bias towards head or tail classes.
results: experimental results on the VG and GQA datasets demonstrate that VETO + MEET boosts the predictive performance by up to 47 percentage over the state of the art while being 10 times smaller.

Abstract
Recent years have seen a growing interest in Scene Graph Generation (SGG), a comprehensive visual scene understanding task that aims to predict entity relationships using a relation encoder-decoder pipeline stacked on top of an object encoder-decoder backbone. Unfortunately, current SGG methods suffer from an information loss regarding the entities local-level cues during the relation encoding process. To mitigate this, we introduce the Vision rElation TransfOrmer (VETO), consisting of a novel local-level entity relation encoder. We further observe that many existing SGG methods claim to be unbiased, but are still biased towards either head or tail classes. To overcome this bias, we introduce a Mutually Exclusive ExperT (MEET) learning strategy that captures important relation features without bias towards head or tail classes. Experimental results on the VG and GQA datasets demonstrate that VETO + MEET boosts the predictive performance by up to 47 percentage over the state of the art while being 10 times smaller.

摘要
Recent years have seen a growing interest in Scene Graph Generation (SGG), a comprehensive visual scene understanding task that aims to predict entity relationships using a relation encoder-decoder pipeline stacked on top of an object encoder-decoder backbone. Unfortunately, current SGG methods suffer from an information loss regarding the entities local-level cues during the relation encoding process. To mitigate this, we introduce the Vision rElation TransfOrmer (VETO), consisting of a novel local-level entity relation encoder. We further observe that many existing SGG methods claim to be unbiased, but are still biased towards either head or tail classes. To overcome this bias, we introduce a Mutually Exclusive ExperT (MEET) learning strategy that captures important relation features without bias towards head or tail classes. Experimental results on the VG and GQA datasets demonstrate that VETO + MEET boosts the predictive performance by up to 47 percentage over the state of the art while being 10 times smaller.Here's the translation in Traditional Chinese:Recent years have seen a growing interest in Scene Graph Generation (SGG), a comprehensive visual scene understanding task that aims to predict entity relationships using a relation encoder-decoder pipeline stacked on top of an object encoder-decoder backbone. Unfortunately, current SGG methods suffer from an information loss regarding the entities local-level cues during the relation encoding process. To mitigate this, we introduce the Vision rElation TransfOrmer (VETO), consisting of a novel local-level entity relation encoder. We further observe that many existing SGG methods claim to be unbiased, but are still biased towards either head or tail classes. To overcome this bias, we introduce a Mutually Exclusive ExperT (MEET) learning strategy that captures important relation features without bias towards head or tail classes. Experimental results on the VG and GQA datasets demonstrate that VETO + MEET boosts the predictive performance by up to 47 percentage over the state of the art while being 10 times smaller.

Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning

paper_url: http://arxiv.org/abs/2308.09455
repo_url: None
paper_authors: Yeming Chen, Siyu Zhang, Yaoru Sun, Weijian Liang, Haoran Wang
for: 本研究旨在提高视频语言（VL）任务中的semantic gap问题，通过引入一种新的视觉 semantic模块来提高 VL 任务的性能。
methods: 本研究提出了一种灵活的计算框架，即Artificial-Spiking Hierarchical Networks (ASH-Nets)，它将Artificial neural networks (ANNs)和Spiking neural networks (SNNs)的优点相结合，以增强视觉 semantic表示。具体来说，ASH-Nets 使用了一个视觉具体编码器和一个semantic抽象编码器，以学习不同类型的 kontinuous和 discrete 隐藏变量，以提高视觉表示的灵活性。此外，通过对相似样本进行对比学习，可以提高层次网络的计算效率，同时对硬样本的扩充也有助于视觉表示的学习。
results: 在多个Established downstream VL tasks上，提出的 STUA 预训练方法和 ASH-Nets 模型实现了竞争性的 результаados。

Abstract
With the success of self-supervised learning, multimodal foundation models have rapidly adapted a wide range of downstream tasks driven by vision and language (VL) pretraining. State-of-the-art methods achieve impressive performance by pre-training on large-scale datasets. However, bridging the semantic gap between the two modalities remains a nonnegligible challenge for VL tasks. In this work, we propose an efficient computation framework for multimodal alignment by introducing a novel visual semantic module to further improve the performance of the VL tasks. Specifically, we propose a flexible model, namely Artificial-Spiking Hierarchical Networks (ASH-Nets), which combines the complementary advantages of Artificial neural networks (ANNs) and Spiking neural networks (SNNs) to enrich visual semantic representations. In particular, a visual concrete encoder and a semantic abstract encoder are constructed to learn continuous and discrete latent variables to enhance the flexibility of semantic encoding. Considering the spatio-temporal properties of SNNs modeling, we introduce a contrastive learning method to optimize the inputs of similar samples. This can improve the computational efficiency of the hierarchical network, while the augmentation of hard samples is beneficial to the learning of visual representations. Furthermore, the Spiking to Text Uni-Alignment Learning (STUA) pre-training method is proposed, which only relies on text features to enhance the encoding ability of abstract semantics. We validate the performance on multiple well-established downstream VL tasks. Experiments show that the proposed ASH-Nets achieve competitive results.

摘要
自顾精学学习的成功，多Modal基础模型快速适应了视频语言（VL）预训练驱动的广泛下游任务。状态顶尖方法在大规模数据预训练后达到了印象式的性能。然而，跨模态 semantic gap 问题仍然是 VL 任务的一个重要挑战。在这种情况下，我们提议一种高效的计算框架，即 Multimodal 对接框架，通过引入一种新的视觉semantic模块来进一步提高 VL 任务的性能。具体来说，我们提出一种灵活的模型，即人工神经网络（ANNs）和脉冲神经网络（SNNs）的组合模型，以增强视觉semantic表示。特别是，我们构建了一个视觉具体编码器和一个semantic抽象编码器，以学习不同的连续和离散的幂变量，以提高模型的灵活性。此外，我们引入了一种对比学习方法，以优化层次网络的输入，从而提高计算效率。此外，我们还提出了一种基于文本特征的预训练方法，即 STUA 预训练方法，可以增强模型的抽象semantic编码能力。我们 validate 了多个常见的下游 VL 任务的性能，实验结果表明，我们提出的 ASH-Nets 可以 achieve 竞争性的结果。

Logistics Hub Location Optimization: A K-Means and P-Median Model Hybrid Approach Using Road Network Distances

paper_url: http://arxiv.org/abs/2308.11038
repo_url: None
paper_authors: Muhammad Abdul Rahman, Muhammad Aamir Basheer, Zubair Khalid, Muhammad Tahir, Momin Uppal
For: 优化快递总站的位置，以提高电商业务的效率和环保性。* Methods: hybrid方法，首先使用K-Means对交通网络距离相关的配送点进行聚合，然后使用P-Median方法将总站设置在优化位置。* Results: 通过使用优化的总站位置，可以节省每个配送的10%（815米）的距离。

Abstract
Logistic hubs play a pivotal role in the last-mile delivery distance; even a slight increment in distance negatively impacts the business of the e-commerce industry while also increasing its carbon footprint. The growth of this industry, particularly after Covid-19, has further intensified the need for optimized allocation of resources in an urban environment. In this study, we use a hybrid approach to optimize the placement of logistic hubs. The approach sequentially employs different techniques. Initially, delivery points are clustered using K-Means in relation to their spatial locations. The clustering method utilizes road network distances as opposed to Euclidean distances. Non-road network-based approaches have been avoided since they lead to erroneous and misleading results. Finally, hubs are located using the P-Median method. The P-Median method also incorporates the number of deliveries and population as weights. Real-world delivery data from Muller and Phipps (M&P) is used to demonstrate the effectiveness of the approach. Serving deliveries from the optimal hub locations results in the saving of 815 (10%) meters per delivery.

摘要

Delivery points are clustered using K-Means based on their spatial locations, utilizing road network distances instead of Euclidean distances to ensure accurate results.2. Hubs are located using the P-Median method, which takes into account the number of deliveries and population as weights to ensure the most efficient placement.Using real-world delivery data from Muller and Phipps (M&P), we demonstrate the effectiveness of our approach. By serving deliveries from the optimal hub locations, we save 815 meters (10%) per delivery, significantly reducing the carbon footprint and improving the efficiency of the e-commerce industry.

From Hope to Safety: Unlearning Biases of Deep Models by Enforcing the Right Reasons in Latent Space

paper_url: http://arxiv.org/abs/2308.09437
repo_url: None
paper_authors: Maximilian Dreyer, Frederik Pahde, Christopher J. Anders, Wojciech Samek, Sebastian Lapuschkin
for: 这篇研究旨在减少深度神经网络中嵌入的伪 positives corrleations，以避免高风险的预测。
methods: 我们使用了一种新的方法，通过减少模型对伪 positives 的敏感性，以确保模型做出的预测是正确的。
results: 我们在控制环境和实际世界中进行了实验，在 ISIC、Bone Age、ImageNet 和 CelebA 数据集上使用 VGG、ResNet 和 EfficientNet 架构，实现了对伪 positives 的有效控制。

Abstract
Deep Neural Networks are prone to learning spurious correlations embedded in the training data, leading to potentially biased predictions. This poses risks when deploying these models for high-stake decision-making, such as in medical applications. Current methods for post-hoc model correction either require input-level annotations, which are only possible for spatially localized biases, or augment the latent feature space, thereby hoping to enforce the right reasons. We present a novel method ensuring the right reasons on the concept level by reducing the model's sensitivity towards biases through the gradient. When modeling biases via Concept Activation Vectors, we highlight the importance of choosing robust directions, as traditional regression-based approaches such as Support Vector Machines tend to result in diverging directions. We effectively mitigate biases in controlled and real-world settings on the ISIC, Bone Age, ImageNet and CelebA datasets using VGG, ResNet and EfficientNet architectures.

摘要
We present a novel method ensuring the right reasons on the concept level by reducing the model's sensitivity towards biases through the gradient 梯度。 When modeling biases via Concept Activation Vectors 概念活动向量, we highlight the importance of choosing robust directions 选择可靠的方向, as traditional regression-based approaches such as Support Vector Machines 支持向量机 tend to result in diverging directions 分散的方向. We effectively mitigate biases in controlled and real-world settings on the ISIC, Bone Age, ImageNet and CelebA datasets using VGG, ResNet and EfficientNet architectures 在 ISIC, Bone Age, ImageNet 和 CelebA dataset上使用 VGG, ResNet 和 EfficientNet 架构中有效地缓解偏见。

Enhancing Agent Communication and Learning through Action and Language

paper_url: http://arxiv.org/abs/2308.10842
repo_url: https://github.com/jettbrains/-L-
paper_authors: Caselles-Dupré Hugo, Sigaud Olivier, Chetouani Mohamed
for: 这个论文旨在描述一种新的 GC-agent，可以同时作为教师和学生进行交互。
methods: 这种 GC-agent 利用了动作示例和语言指令，提高了交互效率。
results: 研究发现，通过 combining 动作和语言交互模式，可以提高学习效果。

Abstract
We introduce a novel category of GC-agents capable of functioning as both teachers and learners. Leveraging action-based demonstrations and language-based instructions, these agents enhance communication efficiency. We investigate the incorporation of pedagogy and pragmatism, essential elements in human communication and goal achievement, enhancing the agents' teaching and learning capabilities. Furthermore, we explore the impact of combining communication modes (action and language) on learning outcomes, highlighting the benefits of a multi-modal approach.

摘要
我们介绍了一种新的 GC-代理人，可以作为 both 教师和学生。通过行为示例和语言指令，这些代理人可以提高交流效率。我们研究了包括教学理论和 Pragmatics 等人类交流和目标实现的关键元素，以提高代理人的教学和学习能力。此外，我们还探讨了将多种交流方式（行为和语言）结合使用的影响，并指出了多模式approach的 beneficial effects。

ICU Mortality Prediction Using Long Short-Term Memory Networks

paper_url: http://arxiv.org/abs/2308.12800
repo_url: None
paper_authors: Manel Mili, Asma Kerkeni, Asma Ben Abdallah, Mohamed Hedi Bedoui
for: 这篇论文是为了提出一种自动化数据驱动系统，用于分析医疗电子健康记录（EHRs）中的大量多变量时间序列数据，并提取高级信息，以预测医院内死亡率和医疗时间（LOS）的早期预测。
methods: 这篇论文使用了LSTM网络，通过减少时间框架为6小时，提高临床任务效果。
results: 实验结果表明，LSTM模型在具有严格多变量时间序列测量的情况下，对实际世界预测建立了高效的预测机制。

Abstract
Extensive bedside monitoring in Intensive Care Units (ICUs) has resulted in complex temporal data regarding patient physiology, which presents an upscale context for clinical data analysis. In the other hand, identifying the time-series patterns within these data may provide a high aptitude to predict clinical events. Hence, we investigate, during this work, the implementation of an automatic data-driven system, which analyzes large amounts of multivariate temporal data derived from Electronic Health Records (EHRs), and extracts high-level information so as to predict in-hospital mortality and Length of Stay (LOS) early. Practically, we investigate the applicability of LSTM network by reducing the time-frame to 6-hour so as to enhance clinical tasks. The experimental results highlight the efficiency of LSTM model with rigorous multivariate time-series measurements for building real-world prediction engines.

摘要
延伸床side监测在医疗急诊室（ICU）中已经导致了复杂的时间序列数据，这些数据提供了更高级别的临床数据分析的上下文。然而，在这些数据中找到时间序列模式可能提供高度预测临床事件的可能性。因此，在本研究中，我们调查了一个自动化的数据驱动系统，该系统分析大量的多变量时间序列数据，并从医疗记录（EHRs）中提取高级别的信息，以预测患者在医院内的死亡率和治疗时间（LOS）的早期预测。实际上，我们研究了使用LSTM网络，通过减少时间帧为6小时，以提高临床任务的效率。实验结果表明，LSTM模型在多变量时间序列测量下表现了高效的预测能力。

Multi-Level Compositional Reasoning for Interactive Instruction Following

paper_url: http://arxiv.org/abs/2308.09387
repo_url: None
paper_authors: Suvaansh Bhambri, Byeonghwi Kim, Jonghyun Choi
for: 这个论文旨在提高机器人在家庭用品上进行各种任务的能力，使其能够更好地理解和完成复杂的任务指令。
methods: 该论文提出了一种分解任务为多个互相关联的子任务的方法，并通过多级组合政策来实现。
results: 该论文通过这种方法实现了2.03%的绝对提升（PLWSR）在未seen集中，而不使用规则基本规划或semantic spatial memory。

Abstract
Robotic agents performing domestic chores by natural language directives are required to master the complex job of navigating environment and interacting with objects in the environments. The tasks given to the agents are often composite thus are challenging as completing them require to reason about multiple subtasks, e.g., bring a cup of coffee. To address the challenge, we propose to divide and conquer it by breaking the task into multiple subgoals and attend to them individually for better navigation and interaction. We call it Multi-level Compositional Reasoning Agent (MCR-Agent). Specifically, we learn a three-level action policy. At the highest level, we infer a sequence of human-interpretable subgoals to be executed based on language instructions by a high-level policy composition controller. At the middle level, we discriminatively control the agent's navigation by a master policy by alternating between a navigation policy and various independent interaction policies. Finally, at the lowest level, we infer manipulation actions with the corresponding object masks using the appropriate interaction policy. Our approach not only generates human interpretable subgoals but also achieves 2.03% absolute gain to comparable state of the arts in the efficiency metric (PLWSR in unseen set) without using rule-based planning or a semantic spatial memory.

摘要
机器人代理人在完成家务任务时需要掌握环境导航和对物体互动的复杂任务。给予的任务经常是复杂的，需要理解多个子任务，例如带一杯咖啡。为了解决这个挑战，我们提议分解任务为多个子目标，并对它们进行独立的处理，以更好地导航和互动。我们称之为多级组合理解代理人（MCR-Agent）。具体来说，我们学习了三级行为策略。在最高层，我们根据语言指令生成一个序列的人类可读Subgoal，并由高级组合控制器执行。在中层，我们通过alternating between a navigation policy和多种独立的互动策略来干扰导航。最后，在最低层，我们使用相应的互动策略来INFER操作。我们的方法不仅生成了人类可读的Subgoal，而且在PLWSR指标上（未seen集）实现了2.03%的绝对提升，不使用规则化计划或semantic spatial memory。

Deciphering knee osteoarthritis diagnostic features with explainable artificial intelligence: A systematic review

paper_url: http://arxiv.org/abs/2308.09380
repo_url: None
paper_authors: Yun Xin Teoh, Alice Othmani, Siew Li Goh, Juliana Usman, Khin Wee Lai
for: 本研究旨在提供一份关于膝关节杂合病（OA）诊断的帮助，使用可解释人工智能（XAI）技术以提高诊断的可靠性和可信度。
methods: 本研究使用了两种视角来描述XAI技术的应用：数据可解释性和模型可解释性。
results: 本研究发现XAI技术可以提高膝关节杂合病诊断的可靠性和可信度，并且可以提供有用的启示，以便促进XAI技术的应用在临床实践中。

Abstract
Existing artificial intelligence (AI) models for diagnosing knee osteoarthritis (OA) have faced criticism for their lack of transparency and interpretability, despite achieving medical-expert-like performance. This opacity makes them challenging to trust in clinical practice. Recently, explainable artificial intelligence (XAI) has emerged as a specialized technique that can provide confidence in the model's prediction by revealing how the prediction is derived, thus promoting the use of AI systems in healthcare. This paper presents the first survey of XAI techniques used for knee OA diagnosis. The XAI techniques are discussed from two perspectives: data interpretability and model interpretability. The aim of this paper is to provide valuable insights into XAI's potential towards a more reliable knee OA diagnosis approach and encourage its adoption in clinical practice.

摘要
现有的膝关节artoarthritis（OA）诊断模型（AI）受到了不透明性和解释性的批评，即使它们达到了医疗专业人员水平。这种透明性使得它们在临床实践中具有挑战。最近，解释性人工智能（XAI）作为一种专门的技术，可以提供对预测的信任，并透明地表明预测是如何 derivation的。这篇文章发表了膝关节OA诊断中使用XAI技术的首次报告。XAI技术从两个角度进行了讨论：数据解释性和模型解释性。文章的目的是提供XAI在膝关节OA诊断方面的可靠性的 valuable 信息，并促进XAI在临床实践中的采用。

Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers

paper_url: http://arxiv.org/abs/2308.09372
repo_url: https://github.com/tobna/whattransformertofavor
paper_authors: Tobias Christian Nauen, Sebastian Palacio, Andreas Dengel
for: This paper aims to provide a comprehensive evaluation of vision transformers and related architectures, focusing on their efficiency across multiple performance metrics.
methods: The authors use more than 30 models and consider various performance metrics to evaluate the efficiency of different architectures. They also propose a hybrid attention-CNN model that performs well with low inference memory and number of parameters.
results: The study finds that ViT is still Pareto optimal across multiple efficiency metrics, despite the existence of alternative approaches claiming to be more efficient. The authors also discover a strong positive correlation between the number of FLOPS and training memory, and that scaling the model size is more effective than scaling the image size. The study provides valuable insights for practitioners and researchers when selecting models for specific applications.

Abstract
The growing popularity of Vision Transformers as the go-to models for image classification has led to an explosion of architectural modifications claiming to be more efficient than the original ViT. However, a wide diversity of experimental conditions prevents a fair comparison between all of them, based solely on their reported results. To address this gap in comparability, we conduct a comprehensive analysis of more than 30 models to evaluate the efficiency of vision transformers and related architectures, considering various performance metrics. Our benchmark provides a comparable baseline across the landscape of efficiency-oriented transformers, unveiling a plethora of surprising insights. For example, we discover that ViT is still Pareto optimal across multiple efficiency metrics, despite the existence of several alternative approaches claiming to be more efficient. Results also indicate that hybrid attention-CNN models fare particularly well when it comes to low inference memory and number of parameters, and also that it is better to scale the model size, than the image size. Furthermore, we uncover a strong positive correlation between the number of FLOPS and the training memory, which enables the estimation of required VRAM from theoretical measurements alone. Thanks to our holistic evaluation, this study offers valuable insights for practitioners and researchers, facilitating informed decisions when selecting models for specific applications. We publicly release our code and data at https://github.com/tobna/WhatTransformerToFavor

摘要
“vision transformer”的快速增长 popularity 使得许多建筑修改被提出，承认这些模型更高效 чем原始的 ViT。然而，各种实验条件的多样性使得对各模型的比较变得困难，根据报告的结果来进行对比。为了解决这个问题，我们进行了对超过30个模型的全面分析，以评估视transformer和相关建筑的效率，并考虑多种性能指标。我们的基准提供了不同类型的transformer模型之间的比较基准，揭示了许多有趣的发现。例如，我们发现， despite the existence of several alternative approaches claiming to be more efficient, ViT仍然在多种效率指标上保持Pareto优化的状态。此外，我们发现在低执行 памяти和参数数量方面，混合注意力-CNN模型表现特别好，而且Scaling the model size是比Scale the image size更好的选择。此外，我们发现对于FLOPS和训练内存之间存在强正相关关系，这使得通过理论测量 alone 可以估算需要的VRAM。由于我们的彻底评估，这些研究可以为实践者和研究人员提供有价值的意见，以便在特定应用中选择合适的模型。我们将代码和数据公开发布在https://github.com/tobna/WhatTransformerToFavor上。”

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

paper_url: http://arxiv.org/abs/2308.09351
repo_url: https://github.com/jacobyuan7/rlipv2
paper_authors: Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao
for: 提高计算机视觉任务中的关系理解能力，通过将视觉表示与语言表示相关联。
methods: 提出了一种快速结合的模型RLIPv2，使得可以在大规模 Pseudo-labelled scene graph 数据上进行关系预训练。RLIPv2 引入了非对称语言图像融合 (ALIF) 机制，使得早期和深入的受阻语言编码层可以更加简洁。
results: 通过对 Human-Object Interaction Detection 和 Scene Graph Generation 等两个任务进行广泛的实验，RLIPv2 在三个标准 bencmarks 上达到了状态的艺术性表现，包括无需训练、几架shot 和零shot 设置下的表现。特别是，最大的 RLIPv2 在 HICO-DET 上达到了 23.29mAP 的最高分，只需要 1% 的数据进行 finituning，可以达到 32.22mAP 的表现，并且在 100% 的数据上进行 finituning 可以达到 45.09mAP。

Abstract
Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.

摘要
RLIP（关系语言图前期训练）目的是将视觉表示与关系文本对齐，从而提高计算机视觉任务中的关系理解能力。然而，RLIPv1架构的慢慢涨潮和现有场景图数据的有限性，使得扩展RLIPv1的困难。在这篇论文中，我们提出RLIPv2，一种快速涨潮的模型，可以将关系预训练扩展到大规模 Pseudo-标注场景图数据。为了快速涨潮，RLIPv2引入了异常语言图像融合（ALIF）机制，使得更早、更深的隔离modal融合，并使用压缩语言编码层。ALIF使得RLIPv2在预训练和精度调整中比RLIPv1快得多。为了获得大规模场景图数据，我们将对象检测数据集扩展，并在BLIP（例如）和设计的关系标签器（Relation Tagger）的帮助下，对区域对应关系文本进行分配。这使得更大规模的关系预训练可能。通过对人物交互检测和场景图生成进行广泛的实验，RLIPv2在三个标准准则下达到了状态 искусственный智能水平，包括完全finetuning、几何shot和零shot设置。尤其是RLIPv2最大的版本在HICO-DET上达到了23.29mAP，无需任何调整，并且在1%数据上达到了32.22mAP，以及在100%数据上达到了45.09mAP。代码和模型在https://github.com/JacobYuan7/RLIPv2上公开。

Surprise machines: revealing Harvard Art Museums’ image collection

paper_url: http://arxiv.org/abs/2308.09343
repo_url: None
paper_authors: Dario Rodighiero, Lins Derry, Douglas Duhaime, Jordan Kruguer, Maximilian C. Mueller, Christopher Pietsch, Jeffrey T. Schnapp, Jeff Steward
for: 这项研究旨在为哈佛艺术博物馆的所有图像收藏开展一项实验性的 museology 项目，以探索人工智能在显示大量图像时的限制，并为访问者带来意外的视觉体验。
methods: 该项目使用了一种chorographic interface，通过访问者的运动连接多个独特的图像视图，以创造意外的感受。
results: 该项目成功地创造了一种意外的视觉体验，使访问者能够更深入地了解和探索哈佛艺术博物馆的图像收藏。

Abstract
Surprise Machines is a project of experimental museology that sets out to visualize the entire image collection of the Harvard Art Museums, intending to open up unexpected vistas on more than 200,000 objects usually inaccessible to visitors. Part of the exhibition Curatorial A(i)gents organized by metaLAB (at) Harvard, the project explores the limits of artificial intelligence to display a large set of images and create surprise among visitors. To achieve such a feeling of surprise, a choreographic interface was designed to connect the audience's movement with several unique views of the collection.

摘要
《意料之机》是哈佛艺术博物馆的一个实验 museology 项目，旨在将哈佛艺术博物馆的全部图像收藏视觉化，以开拓访问者未曾能够访问的更多 чем200,000个物品。该项目是metaLAB（@）哈佛所组织的《Curatorial A(i)gents》展览的一部分，探索人工智能是否能够显示大量图像并创造访问者的意外感。为实现此类感受，项目设计了一个chorographic User Interface，以连接访客的移动和图像收藏中的多个独特视图。

Distributed Neurodynamics-Based Backstepping Optimal Control for Robust Constrained Consensus of Underactuated Underwater Vehicles Fleet

paper_url: http://arxiv.org/abs/2308.09326
repo_url: None
paper_authors: Tao Yan, Zhe Xu, Simon X. Yang, S. Andrew Gadsden
for: 本研究旨在提出一种robust constrained formation tracking控制方法，用于三维空间下无力 actuated underwater vehicles（UUVs）队伍的可靠性和稳定性问题。
methods: 本研究采用了一种协调协议和一种稳定控制器，其中约束级别采用了层次架构。在上层，通过圆拟变换来解决非拘束约束，然后开发了一种分布式优化运动协调策略。在下层控制循环中，采用了基于神经动力学的稳定控制器，以避免在传统的backstepping控制器中出现的”爆炸性”问题，并提高控制活动。
results: 研究发现，基于协调协议和稳定控制器的优化形态跟踪控制方法可以实现UUVs队伍的最佳形态跟踪，同时满足约束条件。此外，在不确定干扰的情况下，研究也证明了UUVs队伍的稳定性和可靠性。

Abstract
Robust constrained formation tracking control of underactuated underwater vehicles (UUVs) fleet in three-dimensional space is a challenging but practical problem. To address this problem, this paper develops a novel consensus based optimal coordination protocol and a robust controller, which adopts a hierarchical architecture. On the top layer, the spherical coordinate transform is introduced to tackle the nonholonomic constraint, and then a distributed optimal motion coordination strategy is developed. As a result, the optimal formation tracking of UUVs fleet can be achieved, and the constraints are fulfilled. To realize the generated optimal commands better and, meanwhile, deal with the underactuation, at the lower-level control loop a neurodynamics based robust backstepping controller is designed, and in particular, the issue of "explosion of terms" appearing in conventional backstepping based controllers is avoided and control activities are improved. The stability of the overall UUVs formation system is established to ensure that all the states of the UUVs are uniformly ultimately bounded in the presence of unknown disturbances. Finally, extensive simulation comparisons are made to illustrate the superiority and effectiveness of the derived optimal formation tracking protocol.

摘要
Robust constrained formation tracking控制 OF underactuated underwater vehicles (UUVs) 舰队在三维空间是一个具有挑战性但实际性的问题。为解决这个问题，这篇论文开发了一种新的协调协议和一种稳定控制器，其采用了层次架构。在最高层，引入了球坐标变换来处理非整体约束，然后开发了分布式优化运动协调策略。因此，UUVs 舰队可以实现优化的形态跟踪，同时满足约束。为了实现生成的优化命令更好地，并且处理下 actuation，在下一级控制循环中设计了基于神经动力学的稳定后退控制器，其中避免了传统后退控制器中的“爆炸性”问题，提高了控制活动。最后，确立了 UUVs formation 系统的稳定性，以确保所有 UUVs 的状态在未知干扰的情况下都是 ultimately bounded。 finally，通过了EXTENSIVE 的simulation 比较，证明了获得的优化形态跟踪协议的优越性和有效性。

Audio-Visual Glance Network for Efficient Video Recognition

paper_url: http://arxiv.org/abs/2308.09322
repo_url: None
paper_authors: Muhammad Adi Nugroho, Sangmin Woo, Sumin Lee, Changick Kim
for: 提高视频理解任务的效率和可扩展性，使用可用的音频和视频modalities进行效率的处理。
methods: 使用lightweight unimodal encoders提取全局视觉特征和音频特征，并使用Audio-Visual Temporal Saliency Transformer（AV-TeST）估计视频帧中的重要性分数。在空间维度中提高效率，使用Audio-Enhanced Spatial Patch Attention（AESPA）模块生成改进的粗细视觉特征，并使用策略网络生成重要的补做坐标。
results: 在多个视频认证测试准则中实现新的状态略进行多个视频认证任务，并实现更快的处理速度。

Abstract
Deep learning has made significant strides in video understanding tasks, but the computation required to classify lengthy and massive videos using clip-level video classifiers remains impractical and prohibitively expensive. To address this issue, we propose Audio-Visual Glance Network (AVGN), which leverages the commonly available audio and visual modalities to efficiently process the spatio-temporally important parts of a video. AVGN firstly divides the video into snippets of image-audio clip pair and employs lightweight unimodal encoders to extract global visual features and audio features. To identify the important temporal segments, we use an Audio-Visual Temporal Saliency Transformer (AV-TeST) that estimates the saliency scores of each frame. To further increase efficiency in the spatial dimension, AVGN processes only the important patches instead of the whole images. We use an Audio-Enhanced Spatial Patch Attention (AESPA) module to produce a set of enhanced coarse visual features, which are fed to a policy network that produces the coordinates of the important patches. This approach enables us to focus only on the most important spatio-temporally parts of the video, leading to more efficient video recognition. Moreover, we incorporate various training techniques and multi-modal feature fusion to enhance the robustness and effectiveness of our AVGN. By combining these strategies, our AVGN sets new state-of-the-art performance in multiple video recognition benchmarks while achieving faster processing speed.

摘要
深度学习在视频理解任务中做出了重要的进步，但计算长期和大量视频使用clip级视频分类器仍然是不可能和过分昂贵的。为解决这个问题，我们提出了Audio-Visual Glance Network（AVGN），它利用通常可以获得的音频和视觉modalities来高效地处理视频中的空间-时间重要部分。AVGN首先将视频分割成帧和音频clip对的剪辑，然后使用轻量级单模态编码器提取全局视觉特征和音频特征。为了确定重要的时间段落，我们使用Audio-Visual Temporal Saliency Transformer（AV-TeST）来估算每帧的特征积分。然后，我们使用Audio-Enhanced Spatial Patch Attention（AESPA）模块生成一组改进后的粗糙视觉特征，并将其传递给一个策略网络，以生成视频中重要的空间-时间部分的坐标。这种方法使得我们只需要关注视频中最重要的空间-时间部分，从而提高视频识别的效率。此外，我们还 incorporatedvarious training techniques和多模态特征融合以提高我们的AVGN的 Robustness和有效性。通过组合这些策略，我们的AVGN在多个视频识别benchmark上达到了新的状态态表现，同时实现更快的处理速度。

Distributed Robust Learning-Based Backstepping Control Aided with Neurodynamics for Consensus Formation Tracking of Underwater Vessels

paper_url: http://arxiv.org/abs/2308.09320
repo_url: None
paper_authors: Tao Yan, Zhe Xu, Simon X. Yang
for: 本 paper 旨在提出一种分布式Robust学习控制方法，用于多艘海上 vessel 的聚合成本 Tracking 问题，系统参数被完全 unknown 并且受到模型匹配错误、海洋干扰和噪声的影响。
methods: 本 paper 使用 graph theory synthesize 分布式控制器，并提供了稳定性保证。由于参数不确定性仅出现在船舶动态模型中，因此使用了 backstepping control technique。然后，为了解决时变不确定系统的问题，提出了一种在线学习程序。此外，模型误差、环境干扰和测量噪声也被考虑并解决。
results: 本 paper 提出的分布式控制协议，可以在面对模型误差、海洋干扰和测量噪声等问题下，实现稳定的聚合成本 Tracking。并通过了大量的simulation experiment 来验证其效果。

Abstract
This paper addresses distributed robust learning-based control for consensus formation tracking of multiple underwater vessels, in which the system parameters of the marine vessels are assumed to be entirely unknown and subject to the modeling mismatch, oceanic disturbances, and noises. Towards this end, graph theory is used to allow us to synthesize the distributed controller with a stability guarantee. Due to the fact that the parameter uncertainties only arise in the vessels' dynamic model, the backstepping control technique is then employed. Subsequently, to overcome the difficulties in handling time-varying and unknown systems, an online learning procedure is developed in the proposed distributed formation control protocol. Moreover, modeling errors, environmental disturbances, and measurement noises are considered and tackled by introducing a neurodynamics model in the controller design to obtain a robust solution. Then, the stability analysis of the overall closed-loop system under the proposed scheme is provided to ensure the robust adaptive performance at the theoretical level. Finally, extensive simulation experiments are conducted to further verify the efficacy of the presented distributed control protocol.

摘要
Here is the text in Simplified Chinese:这篇论文关于多艘海上船舶的协同追踪共轨，即使系统参数完全不确定，并且受到模型误差、海洋干扰和噪声的影响。为此，我们使用图论来确保稳定性，并使用backstepping控制技术来处理时间变化和不确定系统。此外，我们还引入了神经动力学模型来处理模型误差、环境干扰和测量噪声。最后，我们提供了对整个关闭环系统的稳定分析，以确保在理论上的稳定适应性。此外，我们还进行了详细的仿真实验来验证我们的分布式控制协议的有效性。

Towards Attack-tolerant Federated Learning via Critical Parameter Analysis

paper_url: http://arxiv.org/abs/2308.09318
repo_url: https://github.com/sungwon-han/fedcpa
paper_authors: Sungwon Han, Sungwon Park, Fangzhao Wu, Sundong Kim, Bin Zhu, Xing Xie, Meeyoung Cha
for: 本研究旨在防御 federated learning 系统中的毒 poisoning 攻击，使用 Critical Parameter Analysis (FedCPA) 方法。
methods: 本研究使用了针对本地模型中权重参数的敏感分析，并提出了一种新的攻击忍受策略。
results: 实验结果表明，我们的模型在不同的攻击场景下，能够比现有的防御策略更高效地防御 poisoning 攻击。

Abstract
Federated learning is used to train a shared model in a decentralized way without clients sharing private data with each other. Federated learning systems are susceptible to poisoning attacks when malicious clients send false updates to the central server. Existing defense strategies are ineffective under non-IID data settings. This paper proposes a new defense strategy, FedCPA (Federated learning with Critical Parameter Analysis). Our attack-tolerant aggregation method is based on the observation that benign local models have similar sets of top-k and bottom-k critical parameters, whereas poisoned local models do not. Experiments with different attack scenarios on multiple datasets demonstrate that our model outperforms existing defense strategies in defending against poisoning attacks.

摘要
federated learning 是一种用于共享模型的分布式训练方式，无需客户端共享私人数据。但是， federated learning 系统容易受到毒品攻击，当恶意客户端将false更新发送到中央服务器。现有的防御策略在非标一分布数据设置下无效。这篇论文提出了一种新的防御策略，即 FedCPA（ federated learning with Critical Parameter Analysis）。我们的攻击忍受聚合方法基于本地模型的涵义参数 sets 的观察，即善意本地模型的 top-k 和 bottom-k 涵义参数集相似，而毒品本地模型不同。在多个数据集上对不同的攻击场景进行了实验，我们的模型在防御毒品攻击方面表现出excel，比现有的防御策略更高效。

Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms

paper_url: http://arxiv.org/abs/2308.09302
repo_url: https://github.com/ph-w2000/s2pecnet
paper_authors: Penghui Wen, Kun Hu, Wenxi Yue, Sen Zhang, Wanlei Zhou, Zhiyong Wang
for: 防止伪造语音（audio deepfake）攻击
methods: 使用多频率特征融合重建策略（S2pecNet），包括来自不同频率的spectral pattern的融合，以提高对不同伪造攻击的抗伪声识别表现。
results: 在ASVspoof2019 LA Challenge上取得了顶尖表现，EER为0.77%。

Abstract
Robust audio anti-spoofing has been increasingly challenging due to the recent advancements on deepfake techniques. While spectrograms have demonstrated their capability for anti-spoofing, complementary information presented in multi-order spectral patterns have not been well explored, which limits their effectiveness for varying spoofing attacks. Therefore, we propose a novel deep learning method with a spectral fusion-reconstruction strategy, namely S2pecNet, to utilise multi-order spectral patterns for robust audio anti-spoofing representations. Specifically, spectral patterns up to second-order are fused in a coarse-to-fine manner and two branches are designed for the fine-level fusion from the spectral and temporal contexts. A reconstruction from the fused representation to the input spectrograms further reduces the potential fused information loss. Our method achieved the state-of-the-art performance with an EER of 0.77% on a widely used dataset: ASVspoof2019 LA Challenge.

摘要
受深圳技术的进步影响，Robust audio anti-spoofing已经变得越来越困难。虽然spectrograms表明其可以用于anti-spoofing，但是多个 spectral pattern的信息尚未得到了充分利用，这限制了它们在不同的 spoofing 攻击下的效iveness。因此，我们提出了一种基于深度学习的新方法，即S2pecNet，以利用多个 spectral pattern来获得Robust audio anti-spoofing表示。特别是，在第二顺序的spectral pattern中进行了混合，并在粗糙到细腻的manner中设置了两个分支来从spectral和temporal上下文中进行细粒度的拼接。从拼接后的表示重建回到输入spectrograms，可以避免潜在的混合信息损失。我们的方法在ASVspoof2019 LA Challenge上达到了状态级的表现，EER为0.77%。

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

paper_url: http://arxiv.org/abs/2308.09300
repo_url: None
paper_authors: Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, Weidong Cai
for:This paper focuses on the problem of generating semantically-relevant sound from visual input, specifically using foundation models (FMs) to bridge the domain gap between visual and auditory modalities.methods:The proposed method uses a simple yet effective mapper mechanism (V2A-Mapper) to translate the visual input between the CLIP and CLAP spaces, and then uses pretrained audio generative FM AudioLDM to produce high-fidelity and visually-aligned sound.results:Compared to previous approaches, the proposed method achieves superior performance in both objective and subjective evaluations, with 53% and 19% improvement in fidelity and relevance, respectively, using 86% fewer parameters.

Abstract
Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-to-audio (V2A) generation problem, existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound. Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches - trained with 86% fewer parameters but achieving 53% and 19% improvement in FD and CS, respectively.

摘要
Translated into Simplified Chinese:建立基于基础模型（FM）的人工智能系统是当前AI研究中的新方案。这些基础模型通过大量数据学习的表示和生成能力，可以方便地适应多种下游任务，无需从头开始重新训练。然而，在音频模式中使用FM进行跨模态生成仍然受到了较少的研究。在这篇论文中，我们提出了一种轻量级的解决方案，利用CLIP、CLAP和AudioLDM这些基础模型。我们首先调查了视觉CLIP和听音CLAP模型之间的领域差距。然后，我们提出了一种简单 yet有效的映射机制（V2A-Mapper），用于跨模态的映射。 conditioned on the translated CLAP embedding,我们采用了预训练的听音生成FM AudioLDM，以生成具有高准确性和视觉对齐的声音。相比之前的方法，我们的方法只需快速训练V2A-Mapper。我们进一步分析和进行了大量实验，研究V2A-Mapper的选择问题，并发现生成映射在FD中性能提高53%，CS中性能提高19%。两个V2A数据集的对象和主观评估都表明了我们提出的方法在当前状态方法中的超越性，训练参数减少86%，FD和CS中性能提高53%和19%，分别。

How important are specialized transforms in Neural Operators?

paper_url: http://arxiv.org/abs/2308.09293
repo_url: https://github.com/Ritam-M/LearnableTransformsNO
paper_authors: Ritam Majumdar, Shirish Karande, Lovekesh Vig
for: 本研究旨在探讨transform-based neural operators中转换层的重要性，以及其对 solving partial differential equations (PDEs) 的影响。
methods: 本研究使用了一种简单的方法，即将所有的转换层替换为学习型线性层，以评估其影响于性能和计算时间。
results: 研究发现，使用学习型线性层可以提供与最佳转换层相当的性能，并且在计算时间方面也有一定的优势。这种观察可能对未来关于Neural Operators的研究有着重要的影响。

Abstract
Simulating physical systems using Partial Differential Equations (PDEs) has become an indispensible part of modern industrial process optimization. Traditionally, numerical solvers have been used to solve the associated PDEs, however recently Transform-based Neural Operators such as the Fourier Neural Operator and Wavelet Neural Operator have received a lot of attention for their potential to provide fast solutions for systems of PDEs. In this work, we investigate the importance of the transform layers to the reported success of transform based neural operators. In particular, we record the cost in terms of performance, if all the transform layers are replaced by learnable linear layers. Surprisingly, we observe that linear layers suffice to provide performance comparable to the best-known transform-based layers and seem to do so with a compute time advantage as well. We believe that this observation can have significant implications for future work on Neural Operators, and might point to other sources of efficiencies for these architectures.

摘要
使用分数方程（PDEs）模拟物理系统已成为现代工业过程优化的不可或缺的一部分。传统上，数值解法被用来解决相关的PDEs，但最近，基于变换的神经操作符，如傅里叶 neural operator和wavelet neural operator，在提供快速解决系统的PDEs方面受到了广泛关注。在这项工作中，我们研究了变换层的重要性，特别是替换所有变换层为学习的线性层后的成本。我们发现了一个Counterintuitive的现象：所有变换层被替换后，使用学习的线性层可以提供与最佳变换层相当的性能，并且在计算时间方面也有一定的优势。我们认为这一观察可能对未来神经操作符的研究产生重要的影响，并可能暴露出其他优化的来源。

Graph-based Alignment and Uniformity for Recommendation

paper_url: http://arxiv.org/abs/2308.09292
repo_url: https://github.com/yangliangwei/graphau
paper_authors: Liangwei Yang, Zhiwei Liu, Chen Wang, Mingdai Yang, Xiaolong Liu, Jing Ma, Philip S. Yu
for: addressing the sparsity issue in collaborative filtering-based recommender systems (RecSys)
methods: proposes a novel approach called graph-based alignment and uniformity (GraphAU), which explicitly considers high-order connectivities in the user-item bipartite graph
results: significantly alleviates the sparsity issue and achieves state-of-the-art performance on four datasets, with the open-source code available at https://github.com/YangLiangwei/GraphAU.Here's the full Chinese text:
for: 本研究旨在解决collaborative filtering-based recommender systems (RecSys)中的缺乏问题。
methods: 提出了一种新的方法，即图基于对齐和均匀性(GraphAU)，该方法直接考虑用户-物品 биipartite图中的高阶连接度。
results: 在四个数据集上，GraphAU有效地解决了缺乏问题，并达到了当前最佳性能。代码可以在https://github.com/YangLiangwei/GraphAU中下载。

Abstract
Collaborative filtering-based recommender systems (RecSys) rely on learning representations for users and items to predict preferences accurately. Representation learning on the hypersphere is a promising approach due to its desirable properties, such as alignment and uniformity. However, the sparsity issue arises when it encounters RecSys. To address this issue, we propose a novel approach, graph-based alignment and uniformity (GraphAU), that explicitly considers high-order connectivities in the user-item bipartite graph. GraphAU aligns the user/item embedding to the dense vector representations of high-order neighbors using a neighborhood aggregator, eliminating the need to compute the burdensome alignment to high-order neighborhoods individually. To address the discrepancy in alignment losses, GraphAU includes a layer-wise alignment pooling module to integrate alignment losses layer-wise. Experiments on four datasets show that GraphAU significantly alleviates the sparsity issue and achieves state-of-the-art performance. We open-source GraphAU at https://github.com/YangLiangwei/GraphAU.

摘要

HyperLoRA for PDEs

paper_url: http://arxiv.org/abs/2308.09290
repo_url: None
paper_authors: Ritam Majumdar, Vishal Jadhav, Anirudh Deodhar, Shirish Karande, Lovekesh Vig, Venkataramana Runkana
for: 用于解决参数化部分 diferencial equations 的 neural surrogates 问题
methods: 使用 Hypernetwork 和 low-ranked adaptation (LoRA) 技术，将每层基本网络转化为低维度的tensor，并使用 hypernetworks 预测这些tensor的参数
results: 通过添加物理信息损失函数 HyperPINN 进行训练，可以快速学习参数化部分 differential equations 的解，如布格尔方程和奈尔-斯托克斯-科瓦斯纳流体动力学方程，并在参数量方面实现8倍减少，而不会影响准确性。

Abstract
Physics-informed neural networks (PINNs) have been widely used to develop neural surrogates for solutions of Partial Differential Equations. A drawback of PINNs is that they have to be retrained with every change in initial-boundary conditions and PDE coefficients. The Hypernetwork, a model-based meta learning technique, takes in a parameterized task embedding as input and predicts the weights of PINN as output. Predicting weights of a neural network however, is a high-dimensional regression problem, and hypernetworks perform sub-optimally while predicting parameters for large base networks. To circumvent this issue, we use a low ranked adaptation (LoRA) formulation to decompose every layer of the base network into low-ranked tensors and use hypernetworks to predict the low-ranked tensors. Despite the reduced dimensionality of the resulting weight-regression problem, LoRA-based Hypernetworks violate the underlying physics of the given task. We demonstrate that the generalization capabilities of LoRA-based hypernetworks drastically improve when trained with an additional physics-informed loss component (HyperPINN) to satisfy the governing differential equations. We observe that LoRA-based HyperPINN training allows us to learn fast solutions for parameterized PDEs like Burger's equation and Navier Stokes: Kovasznay flow, while having an 8x reduction in prediction parameters on average without compromising on accuracy when compared to all other baselines.

摘要
физи学信息泛化神经网络（PINNs）已广泛应用于解决部分梯度方程的解的神经替代模型。 PINNs 的缺点是它们需要每次更改初始边界条件和微分方程系数时重新训练。神经网络模型基于元学习技术（Hypernetwork）可以将任务嵌入作为输入，预测 PINN 的权重。但是，预测神经网络权重是一个高维度回归问题，神经网络模型在预测基础网络参数时表现不佳。为了解决这个问题，我们使用 low-rank adaptation（LoRA）形式划分每层基础网络中的每个层成低维度的矩阵，并使用神经网络预测这些低维度矩阵。尽管LoRA-based Hypernetworks 的维度减少了，但是它们仍然不符合给定任务的物理学。我们表明，通过在 LoRA-based HyperPINN 训练中添加物理学信息泛化损失函数，可以提高 LoRA-based HyperPINN 的泛化能力。我们观察到，LoRA-based HyperPINN 训练可以快速地解决参数化的微分方程，如布尔格方程和奈尔-斯托克斯：科瓦兹纳流，而且在 average 上降低预测参数的数量约 8 倍，不会降低准确性。

Preference-conditioned Pixel-based AI Agent For Game Testing

paper_url: http://arxiv.org/abs/2308.09289
repo_url: None
paper_authors: Sherif Abdelfattah, Adrian Brown, Pushi Zhang
for: This paper aims to improve game testing AI agents’ ability to explore and test games with high quality and efficiency, addressing the limitations of current methods that rely on game state information and lack explicit control over exploration style.methods: The proposed agent design uses pixel-based state observations and imitation learning with self-supervised and supervised learning objectives to improve exploration coverage and test execution quality.results: The proposed agent significantly outperforms state-of-the-art pixel-based game testing agents in exploration coverage and test execution quality when evaluated on a complex open-world environment resembling many aspects of real AAA games.

Abstract
The game industry is challenged to cope with increasing growth in demand and game complexity while maintaining acceptable quality standards for released games. Classic approaches solely depending on human efforts for quality assurance and game testing do not scale effectively in terms of time and cost. Game-testing AI agents that learn by interaction with the environment have the potential to mitigate these challenges with good scalability properties on time and costs. However, most recent work in this direction depends on game state information for the agent's state representation, which limits generalization across different game scenarios. Moreover, game test engineers usually prefer exploring a game in a specific style, such as exploring the golden path. However, current game testing AI agents do not provide an explicit way to satisfy such a preference. This paper addresses these limitations by proposing an agent design that mainly depends on pixel-based state observations while exploring the environment conditioned on a user's preference specified by demonstration trajectories. In addition, we propose an imitation learning method that couples self-supervised and supervised learning objectives to enhance the quality of imitation behaviors. Our agent significantly outperforms state-of-the-art pixel-based game testing agents over exploration coverage and test execution quality when evaluated on a complex open-world environment resembling many aspects of real AAA games.

摘要
游戏产业面临增长的需求和游戏复杂度的挑战，同时保持适当的质量标准 для发布的游戏。经典的方法仅凭靠人工努力来确保质量检测和游戏测试，效率不足。基于环境互动学习的游戏测试AI代理具有良好的可扩展性，但大多数最新的研究仅基于游戏状态信息来表示代理的状态，导致泛化性不足。此外，游戏测试工程师通常会按照某种特定的风格来探索游戏，如探索 golden path。但现有的游戏测试AI代理没有直接提供这种偏好的实现方式。本文解决了这些限制，提出一种基于像素状态观察的代理设计，同时采用用户 preference 提供的示范轨迹来conditioned 环境探索。此外，我们提出一种imiter learning方法，将自动学习和监督学习目标相结合，以提高模仿行为质量。我们的代理在一个复杂的开放世界环境中表现出色，与现有的像素基于游戏测试代理相比，在探索覆盖率和测试执行质量方面显著超越。

Enhancing Reasoning Capabilities of Large Language Models: A Graph-Based Verification Approach

paper_url: http://arxiv.org/abs/2308.09267
repo_url: None
paper_authors: Lang Cao
for: 这个论文主要是为了提高大语言模型（LLM）的理解能力，尤其是在复杂的理解任务中，如数学问题。
methods: 这个论文使用了一种图像基的方法来加强LLM的理解能力，具体来说是通过将多种解决方案转化为一个图像，以便对这些解决方案进行分析和验证。
results: 实验结果表明，这种图像基的验证方法不仅可以显著提高LLM的理解能力，而且也可以超过现有的验证方法在提高这些模型的理解性能方面。

Abstract
Large Language Models (LLMs) have showcased impressive reasoning capabilities, particularly when guided by specifically designed prompts in complex reasoning tasks such as math word problems. These models typically solve tasks using a chain-of-thought approach, which not only bolsters their reasoning abilities but also provides valuable insights into their problem-solving process. However, there is still significant room for enhancing the reasoning abilities of LLMs. Some studies suggest that the integration of an LLM output verifier can boost reasoning accuracy without necessitating additional model training. In this paper, we follow these studies and introduce a novel graph-based method to further augment the reasoning capabilities of LLMs. We posit that multiple solutions to a reasoning task, generated by an LLM, can be represented as a reasoning graph due to the logical connections between intermediate steps from different reasoning paths. Therefore, we propose the Reasoning Graph Verifier (RGV) to analyze and verify the solutions generated by LLMs. By evaluating these graphs, models can yield more accurate and reliable results.Our experimental results show that our graph-based verification method not only significantly enhances the reasoning abilities of LLMs but also outperforms existing verifier methods in terms of improving these models' reasoning performance.

摘要

Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos

paper_url: http://arxiv.org/abs/2308.09247
repo_url: None
paper_authors: Xiaoxiao Sheng, Zhiqiang Shen, Gang Xiao, Longguang Wang, Yulan Guo, Hehe Fan
for: 本文提出了一种点云视频自主学习框架，用于对象和场景数据的表示学习。previous方法通常在clip或帧层进行表示学习，但这些表示往往无法捕捉细腻的 semantics。
methods: 本文提出了一种点云自主学习框架，通过对点进行对比学习来 capture 细腻 semantics。此外，我们还引入了一种新的预测任务，即实现 superpoints 的Semantic alignment，以便更好地捕捉多尺度的 semantics。
results: 实验表明，我们的方法可以比超过supervised counterparts 的表示学习方法在多种下游任务上表现出色，并且表明 learned 表示的转移性。

Abstract
We propose a unified point cloud video self-supervised learning framework for object-centric and scene-centric data. Previous methods commonly conduct representation learning at the clip or frame level and cannot well capture fine-grained semantics. Instead of contrasting the representations of clips or frames, in this paper, we propose a unified self-supervised framework by conducting contrastive learning at the point level. Moreover, we introduce a new pretext task by achieving semantic alignment of superpoints, which further facilitates the representations to capture semantic cues at multiple scales. In addition, due to the high redundancy in the temporal dimension of dynamic point clouds, directly conducting contrastive learning at the point level usually leads to massive undesired negatives and insufficient modeling of positive representations. To remedy this, we propose a selection strategy to retain proper negatives and make use of high-similarity samples from other instances as positive supplements. Extensive experiments show that our method outperforms supervised counterparts on a wide range of downstream tasks and demonstrates the superior transferability of the learned representations.

摘要
我们提出了一种综合点云视频自动学习框架，用于物体和场景数据的 Representation Learning。过去的方法通常在clip或帧级进行 Representation Learning，但这些方法难以捕捉细腻的 semantics。相反，在这篇论文中，我们提出了一种综合自我超vised学习框架，通过在点级进行对比学习。此外，我们还引入了一种新的预text任务，即实现superpoint的semantic aligning，以便 representations能够捕捉多个尺度的semanticcue。此外，由于点云动态数据中的时间维度具有高的重复率，直接在点级进行对比学习通常会导致庞大的undesired negatives和Positive representations的不足。为了解决这个问题，我们提出了一种选择策略，以保留正确的负例和利用其他实例中的高相似性样本作为Positive complement。广泛的实验表明，我们的方法在多种下游任务上表现出优于supervised counterpart，并且demonstrates the superior transferability of the learned representations。

A Robust Policy Bootstrapping Algorithm for Multi-objective Reinforcement Learning in Non-stationary Environments

paper_url: http://arxiv.org/abs/2308.09734
repo_url: None
paper_authors: Sherif Abdelfattah, Kathryn Kasmarik, Jiankun Hu
for: 这个论文主要旨在解决多目标Markov决策过程中的多目标优化问题，这种问题涉及到随机过程的Sequential决策，同时满足Markov性的约束。
methods: 这篇论文使用了多目标激励学习方法，它将激励学习概念与多目标优化技术相结合，但这些方法具有缺乏适应非站点环境的缺点。
results: 该论文提出了一种发展优化方法，可以在在线环境中不断演化政策覆盖集，同时在定义的目标空间中探索Preferencespace。该算法在非站点环境中表现出了明显的优势，并在站点环境中达到了相对的比较Result。

Abstract
Multi-objective Markov decision processes are a special kind of multi-objective optimization problem that involves sequential decision making while satisfying the Markov property of stochastic processes. Multi-objective reinforcement learning methods address this problem by fusing the reinforcement learning paradigm with multi-objective optimization techniques. One major drawback of these methods is the lack of adaptability to non-stationary dynamics in the environment. This is because they adopt optimization procedures that assume stationarity to evolve a coverage set of policies that can solve the problem. This paper introduces a developmental optimization approach that can evolve the policy coverage set while exploring the preference space over the defined objectives in an online manner. We propose a novel multi-objective reinforcement learning algorithm that can robustly evolve a convex coverage set of policies in an online manner in non-stationary environments. We compare the proposed algorithm with two state-of-the-art multi-objective reinforcement learning algorithms in stationary and non-stationary environments. Results showed that the proposed algorithm significantly outperforms the existing algorithms in non-stationary environments while achieving comparable results in stationary environments.

摘要
本文提出了一种发展优化方法，可以在在线模式下，在定义的目标空间中探索 preference space，而不是采用优化过程。我们提出了一种新的多目标激励学习算法，可以在非站ARY环境中稳定地演化一个凸包含策略的覆盖集。我们与两种现有的多目标激励学习算法进行比较，Results showed that the proposed algorithm significantly outperforms the existing algorithms in non-stationary environments, while achieving comparable results in stationary environments.

Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos

paper_url: http://arxiv.org/abs/2308.09245
repo_url: https://github.com/johnsonsign/mast-pre
paper_authors: Zhiqiang Shen, Xiaoxiao Sheng, Hehe Fan, Longguang Wang, Yulan Guo, Qiong Liu, Hao Wen, Xi Zhou
for: 本研究旨在提出一种无需人工标注的点云视频理解方法，以捕捉点云视频中的空间temporal结构。
methods: 该方法基于点云ube掩码和两种自我supervised学习任务。首先，通过重建掩码后的点云，方法可以捕捉点云视频的外观信息。其次，为了学习运动，我们提出了一个时间卡达度差预测任务，该任务估算点云ube中点的变化。
results: 对于MSRAction-3D、NTU-RGBD、NvGesture和SHREC’17等数据集，我们进行了广泛的实验，并证明了提议的方法的有效性。

Abstract
Recently, the community has made tremendous progress in developing effective methods for point cloud video understanding that learn from massive amounts of labeled data. However, annotating point cloud videos is usually notoriously expensive. Moreover, training via one or only a few traditional tasks (e.g., classification) may be insufficient to learn subtle details of the spatio-temporal structure existing in point cloud videos. In this paper, we propose a Masked Spatio-Temporal Structure Prediction (MaST-Pre) method to capture the structure of point cloud videos without human annotations. MaST-Pre is based on spatio-temporal point-tube masking and consists of two self-supervised learning tasks. First, by reconstructing masked point tubes, our method is able to capture the appearance information of point cloud videos. Second, to learn motion, we propose a temporal cardinality difference prediction task that estimates the change in the number of points within a point tube. In this way, MaST-Pre is forced to model the spatial and temporal structure in point cloud videos. Extensive experiments on MSRAction-3D, NTU-RGBD, NvGesture, and SHREC'17 demonstrate the effectiveness of the proposed method.

摘要
(Simplified Chinese translation)最近，社区在开发有效的点云视频理解方法上做出了很大的进步，通过大量标注数据学习。然而，标注点云视频通常非常昂贵。此外，通过一些传统任务（例如分类）训练可能不足以学习点云视频中细腻的空间-时间结构。在这篇论文中，我们提出了一种Masked Spatio-Temporal Structure Prediction（MaST-Pre）方法，可以无需人工标注，捕捉点云视频的结构。MaST-Pre基于空间-时间点管道遮盲，包括两个自我超vised学习任务。首先，通过重建遮盲点管道，我们的方法可以捕捉点云视频的外观信息。其次，为了学习运动，我们提出了一个时间 cardinality difference prediction任务，可以估计点管道中点的变化。这样，MaST-Pre是被迫模型点云视频的空间和时间结构。我们在 MSRAction-3D、NTU-RGBD、NvGesture 和 SHREC'17 上进行了广泛的实验， demonstarted 提出的方法的有效性。

Intrinsically Motivated Hierarchical Policy Learning in Multi-objective Markov Decision Processes

paper_url: http://arxiv.org/abs/2308.09733
repo_url: None
paper_authors: Sherif Abdelfattah, Kathryn Merrick, Jiankun Hu
for: solve multi-objective Markov decision processes in non-stationary environments
methods: intrinsically motivated reinforcement learning with dual-phase learning
results: significantly outperforms state-of-the-art multi-objective reinforcement methods in a dynamic robotics environmentHere’s the simplified Chinese text:
for: 解决多目标Markov决策过程中的非站点环境
methods: 使用内在动机导向的强化学习方法，包括双期学习
results: 在动力环境中显著超越了现有多目标强化方法的性能I hope this helps!

Abstract
Multi-objective Markov decision processes are sequential decision-making problems that involve multiple conflicting reward functions that cannot be optimized simultaneously without a compromise. This type of problems cannot be solved by a single optimal policy as in the conventional case. Alternatively, multi-objective reinforcement learning methods evolve a coverage set of optimal policies that can satisfy all possible preferences in solving the problem. However, many of these methods cannot generalize their coverage sets to work in non-stationary environments. In these environments, the parameters of the state transition and reward distribution vary over time. This limitation results in significant performance degradation for the evolved policy sets. In order to overcome this limitation, there is a need to learn a generic skill set that can bootstrap the evolution of the policy coverage set for each shift in the environment dynamics therefore, it can facilitate a continuous learning process. In this work, intrinsically motivated reinforcement learning has been successfully deployed to evolve generic skill sets for learning hierarchical policies to solve multi-objective Markov decision processes. We propose a novel dual-phase intrinsically motivated reinforcement learning method to address this limitation. In the first phase, a generic set of skills is learned. While in the second phase, this set is used to bootstrap policy coverage sets for each shift in the environment dynamics. We show experimentally that the proposed method significantly outperforms state-of-the-art multi-objective reinforcement methods in a dynamic robotics environment.

摘要
Simplified Chinese translation:多目标Markov决策过程是一种sequential decision-making问题，其中存在多个矛盾的奖励函数，无法同时优化。这类问题不可以通过单一的优化策略来解决，与传统情况不同。相反，多目标学习方法会演化一个coverage集的优化策略，以满足所有可能的偏好。然而，许多这些方法无法泛化其coverage集，以适应非站ARY environments。在这些环境中，状态转移和奖励分布的参数变化过时。这限制了演化出来的策略集的性能。为了突破这些限制，需要学习一个通用技能集，可以启动演化策略集的扩展。在这种情况下，我们提出了一种新的双相启动多目标学习方法。在第一阶段，学习一个通用技能集。在第二阶段，使用这个集来扩展策略集 для每个环境动态变化。我们实验表明，我们提出的方法在动态 роботикс环境中明显超过了当前多目标学习方法的性能。

Digital Twin-Oriented Complex Networked Systems based on Heterogeneous node features and interaction rules

paper_url: http://arxiv.org/abs/2308.11034
repo_url: None
paper_authors: Jiaqi Wen, Bogdan Gabrys, Katarzyna Musial
for: This paper proposes a modelling framework for Digital Twin-Oriented Complex Networked Systems (DT-CNSs) to generate networks that faithfully represent real systems.methods: The modelling process focuses on features of nodes and interaction rules for creating connections based on individual node preferences.results: The paper presents a case study on disaster resilience of social networks during an epidemic outbreak, showing how different levels of structural and dynamics complexities influence network growth and epidemic spread. The analysis reveals that mitigation policies should target nodes with preferred features as they have higher infection risks and should be the focus of epidemic control.Here is the same information in Simplified Chinese text:for: 这篇论文提出了一种用于数字双体复杂网络系统（DT-CNS）的扩展模型 Frameworks，以生成具有真实系统特点的网络。methods: 模型过程关注节点特点和连接规则的互动，以建立基于个体节点偏好的连接。results: 论文通过一个案例研究，探讨在社交网络上疫情爆发时的灾害抵御能力，发现不同水平的结构和动态复杂性对网络增长和疫情传播产生了影响。分析表明，控制疫情时应该向具有偏好特点的节点进行抗疫措施，以降低疫情传播风险。

Abstract
This study proposes an extendable modelling framework for Digital Twin-Oriented Complex Networked Systems (DT-CNSs) with a goal of generating networks that faithfully represent real systems. Modelling process focuses on (i) features of nodes and (ii) interaction rules for creating connections that are built based on individual node's preferences. We conduct experiments on simulation-based DT-CNSs that incorporate various features and rules about network growth and different transmissibilities related to an epidemic spread on these networks. We present a case study on disaster resilience of social networks given an epidemic outbreak by investigating the infection occurrence within specific time and social distance. The experimental results show how different levels of the structural and dynamics complexities, concerned with feature diversity and flexibility of interaction rules respectively, influence network growth and epidemic spread. The analysis revealed that, to achieve maximum disaster resilience, mitigation policies should be targeted at nodes with preferred features as they have higher infection risks and should be the focus of the epidemic control.

摘要
The case study on disaster resilience of social networks during an epidemic outbreak shows how different levels of structural and dynamics complexities, such as feature diversity and flexibility of interaction rules, influence network growth and epidemic spread. The analysis reveals that mitigation policies should target nodes with preferred features as they have a higher risk of infection and should be the focus of epidemic control.Translation in Simplified Chinese:这个研究提出了一个可扩展的模型框架，用于数字双方向复杂网络系统（DT-CNS），以创建准确表示实际系统的网络。模型 процесс关注节点特性以及建立连接的交互规则，这些规则基于个节点的偏好。研究通过在模拟基础上进行实验，检验不同特性和规则对网络增长和疾病传播的影响。研究中的案例研究探讨了在疾病爆发时社交网络的难以恢复性，通过对特定时间和社交距离内的感染发生进行调查。实验结果表明，不同的结构和动态复杂性水平对网络增长和疾病传播产生不同的影响。分析表明，为了实现最大的灾害抵御能力，应该通过针对具有偏好特性的节点进行疫苗控制，以降低这些节点的感染风险。

Improving Buoy Detection with Deep Transfer Learning for Mussel Farm Automation

paper_url: http://arxiv.org/abs/2308.09238
repo_url: None
paper_authors: Carl McMillan, Junhong Zhao, Bing Xue, Ross Vennell, Mengjie Zhang
for: 提高 mussel farm 运营效率和管理 accuracy 和鲜明度
methods: 使用人工智能和计算机视觉技术，包括深度学习方法和对象检测
results: 通过适应转移学习和数据多样性，实现了对浮标的检测性能的显著提高，并且在不同的天气和照明条件下具有良好的一致性和鲜明度

Abstract
The aquaculture sector in New Zealand is experiencing rapid expansion, with a particular emphasis on mussel exports. As the demands of mussel farming operations continue to evolve, the integration of artificial intelligence and computer vision techniques, such as intelligent object detection, is emerging as an effective approach to enhance operational efficiency. This study delves into advancing buoy detection by leveraging deep learning methodologies for intelligent mussel farm monitoring and management. The primary objective centers on improving accuracy and robustness in detecting buoys across a spectrum of real-world scenarios. A diverse dataset sourced from mussel farms is captured and labeled for training, encompassing imagery taken from cameras mounted on both floating platforms and traversing vessels, capturing various lighting and weather conditions. To establish an effective deep learning model for buoy detection with a limited number of labeled data, we employ transfer learning techniques. This involves adapting a pre-trained object detection model to create a specialized deep learning buoy detection model. We explore different pre-trained models, including YOLO and its variants, alongside data diversity to investigate their effects on model performance. Our investigation demonstrates a significant enhancement in buoy detection performance through deep learning, accompanied by improved generalization across diverse weather conditions, highlighting the practical effectiveness of our approach.

摘要
新西兰aquaculture业正在迅速扩张，特别是对贝壳出口有着强烈的注意力。随着贝壳养殖操作的需求不断演化，人工智能和计算机视觉技术的应用正在成为提高操作效率的有效方法。本研究探讨了基于深度学习方法的智能贝壳园监测和管理，以提高贝壳园的准确性和鲁棒性。我们使用了多种预训练模型，包括YOLO和其变体，以及不同的数据多样性来研究它们对模型性能的影响。我们的调查显示，通过深度学习实现了贝壳检测的显著提高，同时在多种天气条件下也有良好的泛化性，这说明了我们的方法的实际效果。

Advancing Relation Extraction through Language Probing with Exemplars from Set Co-Expansion

paper_url: http://arxiv.org/abs/2308.11720
repo_url: None
paper_authors: Yerong Li, Roxana Girju
for: 提高relation classification的准确率和降低相似类异类的混淆
methods: integrate representative examples and through co-set expansion, context-free Hearst patterns, and contrastive examples tuning
results: 实验结果表明提出的方法可以显著提高relation extraction的准确率，同时降低相似类异类的混淆

Abstract
Relation Extraction (RE) is a pivotal task in automatically extracting structured information from unstructured text. In this paper, we present a multi-faceted approach that integrates representative examples and through co-set expansion. The primary goal of our method is to enhance relation classification accuracy and mitigating confusion between contrastive classes. Our approach begins by seeding each relationship class with representative examples. Subsequently, our co-set expansion algorithm enriches training objectives by incorporating similarity measures between target pairs and representative pairs from the target class. Moreover, the co-set expansion process involves a class ranking procedure that takes into account exemplars from contrastive classes. Contextual details encompassing relation mentions are harnessed via context-free Hearst patterns to ascertain contextual similarity. Empirical evaluation demonstrates the efficacy of our co-set expansion approach, resulting in a significant enhancement of relation classification performance. Our method achieves an observed margin of at least 1 percent improvement in accuracy in most settings, on top of existing fine-tuning approaches. To further refine our approach, we conduct an in-depth analysis that focuses on tuning contrastive examples. This strategic selection and tuning effectively reduce confusion between classes sharing similarities, leading to a more precise classification process. Experimental results underscore the effectiveness of our proposed framework for relation extraction. The synergy between co-set expansion and context-aware prompt tuning substantially contributes to improved classification accuracy. Furthermore, the reduction in confusion between contrastive classes through contrastive examples tuning validates the robustness and reliability of our method.

摘要
<>关系提取（RE）是自动从不结构化文本中提取结构化信息的关键任务。在这篇论文中，我们提出了一种多方面的方法，它将表示例和相似扩展相结合。我们的方法的 PRIMARY GOAL 是提高关系类别的分类精度，并降低对比类别的混淆。我们的方法开始于每个关系类别中的表示例。然后，我们的相似扩展算法将训练目标包含类别之间的相似度。此外，相似扩展过程还包括一个类别排名过程，它考虑了对应类别中的表示例。 Contextual details surrounding relation mentions are harnessed via context-free Hearst patterns to ascertain contextual similarity.实验结果表明，我们的相似扩展方法有效地提高了关系分类性能。我们的方法在大多数设置下达到了至少1%的提升率，并且在现有的细化方法之上进行了进一步的优化。为了进一步改进我们的方法，我们进行了深入的分析，将对于相似类别的选择和调整作为战略。这种策略性的选择和调整有效地减少了类别之间的混淆，导致更加精准的分类过程。实验结果证明了我们提出的关系提取框架的效iveness。它的同时进行相似扩展和上下文相关的提升，使得分类性能得到了进一步提高。此外，通过对相似类别进行调整，我们的方法的可靠性和可靠性得到了证明。

Baird Counterexample Is Solved: with an example of How to Debug a Two-time-scale Algorithm

paper_url: http://arxiv.org/abs/2308.09732
repo_url: None
paper_authors: Hengshuai Yao
for: 测试和比较离况学习算法的效果
methods: 使用Gradient TD算法和两个时间步骤测量算法
results: 解释TD算法在这个例子中的缓慢问题，并提供了一个可用于研究离况学习算法的 debug 技术，以及实验结果显示 Impression GTD 算法在这个例子中的快速收敛。

Abstract
Baird counterexample was proposed by Leemon Baird in 1995, first used to show that the Temporal Difference (TD(0)) algorithm diverges on this example. Since then, it is often used to test and compare off-policy learning algorithms. Gradient TD algorithms solved the divergence issue of TD on Baird counterexample. However, their convergence on this example is still very slow, and the nature of the slowness is not well understood, e.g., see (Sutton and Barto 2018). This note is to understand in particular, why TDC is slow on this example, and provide debugging analysis to understand this behavior. Our debugging technique can be used to study the convergence behavior of two-time-scale stochastic approximation algorithms. We also provide empirical results of the recent Impression GTD algorithm on this example, showing the convergence is very fast, in fact, in a linear rate. We conclude that Baird counterexample is solved, by an algorithm with convergence guarantee to the TD solution in general and a fast convergence rate.

摘要
白尔德对例（Baird counterexample）于1995年由Leemon Baird提出，用以证明TD(0)算法在这个例子中崩溃。自此以后，它经常用于测试和比较不同的离政学习算法。梯度TD算法解决了TD算法在白尔德对例中的崩溃问题，但它们在这个例子上的 converges 速度非常慢，并且不很了解这种慢速度的性质。例如，参见（Sutton和Barto 2018）。本记录的目的是要更好地理解TD算法在白尔德对例上的慢速度，并提供调试分析来理解这种行为。我们的调试技术可以用来研究两个时间尺度的随机抽象算法的收敛行为。我们还提供了最近的Impression GTD算法在这个例子上的实验结果，显示其 converge 速度非常快，甚至在线性速度。我们结论是，白尔德对例已经被解决，并且有一个可靠的收敛保证和快速 converges 速度。

Learning in Cooperative Multiagent Systems Using Cognitive and Machine Models

paper_url: http://arxiv.org/abs/2308.09219
repo_url: https://github.com/ddm-lab/greedy-hysteretic-lenient-maibl
paper_authors: Thuy Ngoc Nguyen, Duy Nhat Phan, Cleotilde Gonzalez
for: 本研究旨在开发有效的多智能体系统（MAS），以满足协作和协调人类的多种应用。
methods: 本研究提出三种变体的多智能IBLT模型（MAIBL），结合了IBLT的认知机制和MADRL模型来处理协调MAS在随机环境中的协同学习。
results: 对于不同的随机奖励设定，MAIBL模型在动态CMOTP任务中表现出比现有MADRL模型更快的学习速度和更好的协调性。

Abstract
Developing effective Multi-Agent Systems (MAS) is critical for many applications requiring collaboration and coordination with humans. Despite the rapid advance of Multi-Agent Deep Reinforcement Learning (MADRL) in cooperative MAS, one major challenge is the simultaneous learning and interaction of independent agents in dynamic environments in the presence of stochastic rewards. State-of-the-art MADRL models struggle to perform well in Coordinated Multi-agent Object Transportation Problems (CMOTPs), wherein agents must coordinate with each other and learn from stochastic rewards. In contrast, humans often learn rapidly to adapt to nonstationary environments that require coordination among people. In this paper, motivated by the demonstrated ability of cognitive models based on Instance-Based Learning Theory (IBLT) to capture human decisions in many dynamic decision making tasks, we propose three variants of Multi-Agent IBL models (MAIBL). The idea of these MAIBL algorithms is to combine the cognitive mechanisms of IBLT and the techniques of MADRL models to deal with coordination MAS in stochastic environments from the perspective of independent learners. We demonstrate that the MAIBL models exhibit faster learning and achieve better coordination in a dynamic CMOTP task with various settings of stochastic rewards compared to current MADRL models. We discuss the benefits of integrating cognitive insights into MADRL models.

摘要
发展有效的多智能体系统（MAS）是许多需要协作和协调的应用中的关键。虽然多智能深度学习（MADRL）在合作MAS中得到了快速的进步，但一个主要挑战是独立的智能体在动态环境中同时学习和互动，并在恒定奖励下学习。现状最先进的MADRL模型在协调多智能对象运输问题（CMOTP）中表现不佳，这种问题需要智能体之间协调和学习。然而，人类在非站台环境中很快地适应和适应非站台环境，并且能够快速地学习。在这篇论文中，我们受到了基于实例学习理论（IBLT）的认知模型的能力所启发，并提出了三种多智能IBLT模型（MAIBL）。这些MAIBL算法的想法是将IBLT的认知机制与MADRL模型的技术相结合，以面对独立学习的协调MAS在随机奖励下的问题。我们示出了MAIBL模型在动态CMOTP任务中与不同设置的随机奖励下显著更快地学习和更好地协调。我们讨论了将认知预测 integrate into MADRL模型的好处。

GPU Accelerated Color Correction and Frame Warping for Real-time Video Stitching

paper_url: http://arxiv.org/abs/2308.09209
repo_url: None
paper_authors: Lu Yang, Zhenglun Kong, Ting Li, Xinyi Bai, Zhiye Lin, Hong Cheng
for: 实时拼接多个视频序列为全景视频
methods: 基于GPU加速的颜色修正和框架折叠，不需要精确的摄像机参数
results: 实时生成高质量全景视频

Abstract
Traditional image stitching focuses on a single panorama frame without considering the spatial-temporal consistency in videos. The straightforward image stitching approach will cause temporal flicking and color inconstancy when it is applied to the video stitching task. Besides, inaccurate camera parameters will cause artifacts in the image warping. In this paper, we propose a real-time system to stitch multiple video sequences into a panoramic video, which is based on GPU accelerated color correction and frame warping without accurate camera parameters. We extend the traditional 2D-Matrix (2D-M) color correction approach and a present spatio-temporal 3D-Matrix (3D-M) color correction method for the overlap local regions with online color balancing using a piecewise function on global frames. Furthermore, we use pairwise homography matrices given by coarse camera calibration for global warping followed by accurate local warping based on the optical flow. Experimental results show that our system can generate highquality panorama videos in real time.

摘要
传统的图像融合方法集中精力于单一的панaramic帧，不考虑视频中的空间-时间一致性。直接使用图像融合方法会导致视频融合任务中的时间闪烁和颜色不稳定。此外，不准确的相机参数会导致图像扭曲。在这篇论文中，我们提出了一种基于GPU加速的实时系统，用于将多个视频序列拼接成一个投影视频。我们extend了传统的2D-矩阵（2D-M）颜色修正方法，并提出了一种基于空间-时间3D-矩阵（3D-M）颜色修正方法，用于在重叠地方进行在线颜色均衡使用分割函数。此外，我们使用粗略相机封锁得到的对角矩阵，然后使用精度地ocal warping基于运动图像。实验结果表明，我们的系统可以在实时中生成高质量的投影视频。

A Model-Agnostic Framework for Recommendation via Interest-aware Item Embeddings

paper_url: http://arxiv.org/abs/2308.09202
repo_url: None
paper_authors: Amit Kumar Jaiswal, Yu Xiong
for: 提高推荐系统的准确率和个性化程度，使其能更好地捕捉用户的兴趣和需求。
methods: 利用 capsule network 和 interest-aware 技术，对 item 进行强化表示，直接从用户行为中捕捉用户兴趣。
results: 在多种benchmark dataset上进行了广泛的实验，并证明了该方法可以帮助提高推荐系统的性能，特别是在用户兴趣方面。

Abstract
Item representation holds significant importance in recommendation systems, which encompasses domains such as news, retail, and videos. Retrieval and ranking models utilise item representation to capture the user-item relationship based on user behaviours. While existing representation learning methods primarily focus on optimising item-based mechanisms, such as attention and sequential modelling. However, these methods lack a modelling mechanism to directly reflect user interests within the learned item representations. Consequently, these methods may be less effective in capturing user interests indirectly. To address this challenge, we propose a novel Interest-aware Capsule network (IaCN) recommendation model, a model-agnostic framework that directly learns interest-oriented item representations. IaCN serves as an auxiliary task, enabling the joint learning of both item-based and interest-based representations. This framework adopts existing recommendation models without requiring substantial redesign. We evaluate the proposed approach on benchmark datasets, exploring various scenarios involving different deep neural networks, behaviour sequence lengths, and joint learning ratios of interest-oriented item representations. Experimental results demonstrate significant performance enhancements across diverse recommendation models, validating the effectiveness of our approach.

摘要
“item表示具有重要 significancenin recommendation系统中，包括新闻、零售和视频等领域。 Retrieval和排名模型通过item表示来捕捉用户-item关系，基于用户行为。而现有的表示学习方法主要是通过对item-based机制进行优化，如关注和序列模型。但这些方法缺乏直接表达用户兴趣的模elling机制，因此可能不能够准确地捕捉用户兴趣。为了解决这个挑战，我们提出了一种新的用户兴趣注意力感知网络（IaCN）推荐模型，这是一种model-agnostic框架，可以直接学习用户兴趣 oriented item表示。IaCN作为auxiliary task，可以在已有的推荐模型中 jointly learn item-based和用户兴趣 oriented表示。这种框架不需要大量重新设计现有的推荐模型。我们在 benchmark datasets上进行了实验，exploring不同的深度神经网络、行为序列长度和joint learning ratio of interest-oriented item表示。实验结果表明，我们的方法可以在多种不同的推荐模型和 scenarios中提高表现，证明了我们的方法的有效性。”

Regularizing Adversarial Imitation Learning Using Causal Invariance

paper_url: http://arxiv.org/abs/2308.09189
repo_url: None
paper_authors: Ivan Ovinnikov, Joachim M. Buhmann
for: 这篇论文是用来学习模型从专家示范数据集中推断策略的。
methods: 这篇论文使用了对抗学习方法，并使用了一个推定器作为导向信号。
results: 这篇论文表明了使用 causal invariance 作为regularization principle可以解决模型吸收专家数据中的假 correlation问题。

Abstract
Imitation learning methods are used to infer a policy in a Markov decision process from a dataset of expert demonstrations by minimizing a divergence measure between the empirical state occupancy measures of the expert and the policy. The guiding signal to the policy is provided by the discriminator used as part of an versarial optimization procedure. We observe that this model is prone to absorbing spurious correlations present in the expert data. To alleviate this issue, we propose to use causal invariance as a regularization principle for adversarial training of these models. The regularization objective is applicable in a straightforward manner to existing adversarial imitation frameworks. We demonstrate the efficacy of the regularized formulation in an illustrative two-dimensional setting as well as a number of high-dimensional robot locomotion benchmark tasks.

摘要
模型使用依据学习方法从专家示范数据集中推导策略，以减少专家和策略之间的差异度量。导引信号被用来引导策略，并通过对抗优化过程中的探测器来提供指导信号。我们发现这种模型容易吸收专家数据中的偶极 correlations。为解决这问题，我们提议使用 causal invariance 作为对抗训练这些模型的正则化原则。这个正则化目标可以直接应用于现有的对抗依据模型中。我们在一些二维设定和高维机器人行走 benchmark 任务中证明了这种准则的效果。

ChatGPT-HealthPrompt. Harnessing the Power of XAI in Prompt-Based Healthcare Decision Support using ChatGPT

paper_url: http://arxiv.org/abs/2308.09731
repo_url: None
paper_authors: Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia
for: 这个研究旨在应用大语言模型（LLM）在医疗决策中，特点是使用 OpenAI 的 ChatGPT，并开发了一种新的应用方法。
methods: 这种方法利用了域知识，从高性能可读取 ML 模型中提取了关键信息，并将其灵活地 integrate 到提问设计中。
results: 研究表明，使用这种方法可以在数据稀缺的情况下实现高质量的二分类任务，并且在不同的数据条件下，OpenAI 的 ChatGPT 的性能比传统的直接学习 ML 模型要好。这种方法可以在医疗决策中提供更多的洞察力和支持。

Abstract
This study presents an innovative approach to the application of large language models (LLMs) in clinical decision-making, focusing on OpenAI's ChatGPT. Our approach introduces the use of contextual prompts-strategically designed to include task description, feature description, and crucially, integration of domain knowledge-for high-quality binary classification tasks even in data-scarce scenarios. The novelty of our work lies in the utilization of domain knowledge, obtained from high-performing interpretable ML models, and its seamless incorporation into prompt design. By viewing these ML models as medical experts, we extract key insights on feature importance to aid in decision-making processes. This interplay of domain knowledge and AI holds significant promise in creating a more insightful diagnostic tool. Additionally, our research explores the dynamics of zero-shot and few-shot prompt learning based on LLMs. By comparing the performance of OpenAI's ChatGPT with traditional supervised ML models in different data conditions, we aim to provide insights into the effectiveness of prompt engineering strategies under varied data availability. In essence, this paper bridges the gap between AI and healthcare, proposing a novel methodology for LLMs application in clinical decision support systems. It highlights the transformative potential of effective prompt design, domain knowledge integration, and flexible learning approaches in enhancing automated decision-making.

摘要
Our research also explores the dynamics of zero-shot and few-shot prompt learning based on LLMs. By comparing the performance of OpenAI's ChatGPT with traditional supervised ML models in different data conditions, we aim to provide insights into the effectiveness of prompt engineering strategies under varied data availability. This study bridges the gap between AI and healthcare, proposing a novel methodology for LLMs application in clinical decision support systems. It highlights the transformative potential of effective prompt design, domain knowledge integration, and flexible learning approaches in enhancing automated decision-making.

How Does Pruning Impact Long-Tailed Multi-Label Medical Image Classifiers?

paper_url: http://arxiv.org/abs/2308.09180
repo_url: https://github.com/vita-group/prunecxr
paper_authors: Gregory Holste, Ziyu Jiang, Ajay Jaiswal, Maria Hanna, Shlomo Minkowitz, Alan C. Legasto, Joanna G. Escalon, Sharon Steinberger, Mark Bittman, Thomas C. Shen, Ying Ding, Ronald M. Summers, George Shih, Yifan Peng, Zhangyang Wang
for: 本研究旨在 investigating the impact of pruning on deep neural networks trained for thorax disease diagnosis from chest X-rays (CXRs), and understanding how pruning affects model behavior in long-tailed, multi-label datasets.
methods: 研究使用了 two large CXR datasets, and analyzed the effect of pruning on disease classification. The study also identified individual CXRs where uncompressed and heavily pruned models disagreed, known as pruning-identified exemplars (PIEs), and conducted a human reader study to evaluate their unifying qualities.
results: 研究发现，采用 pruning 技术可以减少深度神经网络的内存使用和执行时间，但是这些方法可能会对模型行为产生负面影响，特别是在长尾、多标签数据集中。研究还发现，采用 pruning 技术可以增加模型的Forgettability，并且可以通过人类读者研究来评估这些 exemplars 的特征。

Abstract
Pruning has emerged as a powerful technique for compressing deep neural networks, reducing memory usage and inference time without significantly affecting overall performance. However, the nuanced ways in which pruning impacts model behavior are not well understood, particularly for long-tailed, multi-label datasets commonly found in clinical settings. This knowledge gap could have dangerous implications when deploying a pruned model for diagnosis, where unexpected model behavior could impact patient well-being. To fill this gap, we perform the first analysis of pruning's effect on neural networks trained to diagnose thorax diseases from chest X-rays (CXRs). On two large CXR datasets, we examine which diseases are most affected by pruning and characterize class "forgettability" based on disease frequency and co-occurrence behavior. Further, we identify individual CXRs where uncompressed and heavily pruned models disagree, known as pruning-identified exemplars (PIEs), and conduct a human reader study to evaluate their unifying qualities. We find that radiologists perceive PIEs as having more label noise, lower image quality, and higher diagnosis difficulty. This work represents a first step toward understanding the impact of pruning on model behavior in deep long-tailed, multi-label medical image classification. All code, model weights, and data access instructions can be found at https://github.com/VITA-Group/PruneCXR.

摘要
剪辑技术已成为深度神经网络压缩的有力的方法，可以降低计算机 memory 使用量和执行时间，而不会对总性表现产生重要的影响。然而，剪辑对模型行为的细微影响还不够了解，特别是在常见的医疗数据集上。这种知识空白可能会在部署剪辑后导致诊断错误，这可能会影响病人健康。为了填补这一空白，我们进行了首次对剪辑对神经网络诊断颈部疾病（CXR）的影响的分析。在两个大CXR数据集上，我们研究了哪些疾病受到剪辑影响，并 characterize 疾病 "忘记度" 基于疾病频率和相互出现行为。此外，我们identified 压缩后和 heavily 剪辑后模型之间的分歧，称为剪辑标识 exemplars (PIEs)，并进行了人类读者研究来评估其共同特征。我们发现， radiologists 认为 PIEs 具有更多的标签噪音、更差的图像质量和更高的诊断难度。这项工作代表了对剪辑对深度、多标签医疗图像分类模型行为的影响的首次研究。所有代码、模型权重和数据访问指南可以在 GitHub 上找到：https://github.com/VITA-Group/PruneCXR。

Diversifying AI: Towards Creative Chess with AlphaZero

paper_url: http://arxiv.org/abs/2308.09175
repo_url: None
paper_authors: Tom Zahavy, Vivek Veeriah, Shaobo Hou, Kevin Waugh, Matthew Lai, Edouard Leurent, Nenad Tomasev, Lisa Schut, Demis Hassabis, Satinder Singh
for: 本研究探讨了人工智能（AI）系统是否可以通过创新决策机制来提高其计算能力。
methods: 本研究使用了AlphaZero（AZ）和其扩展版本AZ_db，通过行为多样性技术和低添加计划来让AI系统生成更多的想法，并选择最有前途的想法。
results: 实验表明，AZ_db在围棋游戏中表现出了多样化的做法，解决了更多的问题，并在围棋游戏中超越了更一致的团队。此外，在不同的开局中，AZ_db的成员特циализиру于不同的开局，通过低添加计划选择开局的棋手可以提高50个Elo分。研究结果表明，AI团队中的多样性贡献可以与人类团队中的多样性贡献相比，多样性是解决计算复杂问题的有价值资产。

Abstract
In recent years, Artificial Intelligence (AI) systems have surpassed human intelligence in a variety of computational tasks. However, AI systems, like humans, make mistakes, have blind spots, hallucinate, and struggle to generalize to new situations. This work explores whether AI can benefit from creative decision-making mechanisms when pushed to the limits of its computational rationality. In particular, we investigate whether a team of diverse AI systems can outperform a single AI in challenging tasks by generating more ideas as a group and then selecting the best ones. We study this question in the game of chess, the so-called drosophila of AI. We build on AlphaZero (AZ) and extend it to represent a league of agents via a latent-conditioned architecture, which we call AZ_db. We train AZ_db to generate a wider range of ideas using behavioral diversity techniques and select the most promising ones with sub-additive planning. Our experiments suggest that AZ_db plays chess in diverse ways, solves more puzzles as a group and outperforms a more homogeneous team. Notably, AZ_db solves twice as many challenging puzzles as AZ, including the challenging Penrose positions. When playing chess from different openings, we notice that players in AZ_db specialize in different openings, and that selecting a player for each opening using sub-additive planning results in a 50 Elo improvement over AZ. Our findings suggest that diversity bonuses emerge in teams of AI agents, just as they do in teams of humans and that diversity is a valuable asset in solving computationally hard problems.

摘要
近年来，人工智能（AI）系统已经超越了人类智能在多种计算任务上。然而，AI系统，如人类一样，会出现错误、盲点、幻觉和难以通过新情况泛化。这项工作探讨了AI是否可以通过创新决策机制提高其计算理性的限制。特别是，我们研究了一群多样化AI系统是否可以在复杂任务上超越单个AI，通过生成更多的想法并选择最佳的想法来增强其表现。我们在国际象棋（即人工智能的“蜞蜓”）中进行了研究，我们称之为AZ_db。我们使用了行为多样性技术来让AZ_db生成更广泛的想法，并使用下降式规划选择最佳想法。我们的实验表明，AZ_db在各种开局下玩国际象棋，每个开局都有不同的特点，并且选择每个开局的最佳棋手使用下降式规划，可以提高50个Elo分。我们的发现表明，AI团队中的多样性奖励存在，与人类团队一样，多样性是解决计算上的困难问题的有价值资产。

Forensic Data Analytics for Anomaly Detection in Evolving Networks

paper_url: http://arxiv.org/abs/2308.09171
repo_url: None
paper_authors: Li Yang, Abdallah Moubayed, Abdallah Shami, Amine Boukhtouta, Parisa Heidari, Stere Preda, Richard Brunner, Daniel Migault, Adel Larabi
for: 本研究旨在提供一个网络异常探测的数字分析框架，以帮助实现网络安全性。
methods: 本研究使用多元视角特征工程、无监督异常检测和全面结果修正过程，以探测网络中的异常行为。
results: 实验结果表明，提出的数字分析解决方案能够有效探测网络中的异常行为。

Abstract
In the prevailing convergence of traditional infrastructure-based deployment (i.e., Telco and industry operational networks) towards evolving deployments enabled by 5G and virtualization, there is a keen interest in elaborating effective security controls to protect these deployments in-depth. By considering key enabling technologies like 5G and virtualization, evolving networks are democratized, facilitating the establishment of point presences integrating different business models ranging from media, dynamic web content, gaming, and a plethora of IoT use cases. Despite the increasing services provided by evolving networks, many cybercrimes and attacks have been launched in evolving networks to perform malicious activities. Due to the limitations of traditional security artifacts (e.g., firewalls and intrusion detection systems), the research on digital forensic data analytics has attracted more attention. Digital forensic analytics enables people to derive detailed information and comprehensive conclusions from different perspectives of cybercrimes to assist in convicting criminals and preventing future crimes. This chapter presents a digital analytics framework for network anomaly detection, including multi-perspective feature engineering, unsupervised anomaly detection, and comprehensive result correction procedures. Experiments on real-world evolving network data show the effectiveness of the proposed forensic data analytics solution.

摘要
在传统基础设施（如电信和产业运营网络）协调向5G和虚拟化的演进部署方向，有很大的兴趣在彻底保护这些部署。通过考虑关键启用技术（如5G和虚拟化），演进网络被民主化，使得不同业务模式的点 présence可以成功建立，包括媒体、动态网页内容、游戏和互联网器件多种应用场景。尽管演进网络提供了越来越多的服务，但是许多网络犯罪和攻击仍然在演进网络中进行不良活动。由于传统安全文件（如防火墙和侵入检测系统）的局限性，研究数字审计数据分析的研究吸引了更多的注意。数字审计数据分析可以帮助人们从不同角度获得详细信息和全面的结论，以帮助检察官捕捉犯罪分子和预防未来的犯罪。本章介绍了一种网络异常检测数字审计框架，包括多元角度特征工程、无监督异常检测和全面结果修正过程。实验表明，提案的数字审计数据分析解决方案在真实的演进网络数据上具有有效性。

Semantic Consistency for Assuring Reliability of Large Language Models

paper_url: http://arxiv.org/abs/2308.09138
repo_url: None
paper_authors: Harsh Raj, Vipul Gupta, Domenic Rosati, Subhabrata Majumdar
for: 这篇论文旨在提高大型自然语言模型（LLMs）的安全和可靠性，使其在不同的自然语言任务中表现更加稳定和可靠。
methods: 本论文提出了一个通用的对称性量表（Semantic Consistency Metric, SCM），用于评估 LLMS 在开放式文本生成任务中的对称性。此外，论文还提出了一个叫做 Ask-to-Choose (A2C) 的问题提示策略，用于增加对称性。
results: 实验结果显示，使用 A2C 问题提示策略可以将关注率提高 47%，并且可以将对称性指标提高 7 倍。

Abstract
Large Language Models (LLMs) exhibit remarkable fluency and competence across various natural language tasks. However, recent research has highlighted their sensitivity to variations in input prompts. To deploy LLMs in a safe and reliable manner, it is crucial for their outputs to be consistent when prompted with expressions that carry the same meaning or intent. While some existing work has explored how state-of-the-art LLMs address this issue, their evaluations have been confined to assessing lexical equality of single- or multi-word answers, overlooking the consistency of generative text sequences. For a more comprehensive understanding of the consistency of LLMs in open-ended text generation scenarios, we introduce a general measure of semantic consistency, and formulate multiple versions of this metric to evaluate the performance of various LLMs. Our proposal demonstrates significantly higher consistency and stronger correlation with human evaluations of output consistency than traditional metrics based on lexical consistency. Finally, we propose a novel prompting strategy, called Ask-to-Choose (A2C), to enhance semantic consistency. When evaluated for closed-book question answering based on answer variations from the TruthfulQA benchmark, A2C increases accuracy metrics for pretrained and finetuned LLMs by up to 47%, and semantic consistency metrics for instruction-tuned models by up to 7-fold.

摘要
大型语言模型（LLM）在不同自然语言任务上表现出惊人的流畅和能力。然而，最近的研究发现，LLM在输入提示的变化对其输出有敏感性。为了在安全和可靠的方式部署LLM，其输出必须在具有同意义或目的的输入提示下保持一致性。一些现有的研究已经探讨了现代LLM如何处理这个问题，但是其评估仅限于单词或多词答案的字符串一致性，忽略了生成文本序列的一致性。为了更全面地了解LLM在开放式文本生成场景下的一致性，我们提出了一个通用的 semantic consistency 度量，并制定了多个版本的这个度量来评估不同LLM的性能。我们的提案显示在开放式文本生成场景下，我们的度量具有更高的一致性和更强的与人类评估输出一致性的相关性，而传统的基于字符串一致性的度量则显示较差的性能。最后，我们提出了一种新的提示策略，called Ask-to-Choose（A2C），以提高 semantic consistency。在基于 TruthfulQA benchmark 的关闭书问答任务上，A2C 提高了预训练和训练 LLM 的准确度指标 by up to 47%，并提高了 instruction-tuned 模型的 semantic consistency 指标 by up to 7倍。

EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

paper_url: http://arxiv.org/abs/2308.09126
repo_url: https://github.com/egoschema/egoschema
paper_authors: Karttikeya Mangalam, Raiymbek Akshulakov, Jitendra Malik
for: 本研究目的是评估现代视觉和语言系统的长期视频理解能力，特别是在面对长视频clip时的能力。
methods: 本研究使用了Ego4D dataset，经人工精心编辑，涵盖了多种自然的人类活动和行为，并提供了5000多个多选题answer对。为每个问题，需要根据三分钟的视频clip选择正确的答案。此外，本研究还引入了时间证书集，用于捕捉视频任务中的内在时间理解长度。
results: 本研究发现，使用EgoSchema dataset进行评估，现代视觉和语言系统的长期视频理解能力远低于人类的76%，甚至bilions参数的模型也只能达到20%~33%的答案正确率。本研究认为，EgoSchema dataset，拥有长期内在时间结构和多样化复杂性，将成为未来开发有效长期视频理解系统的价值评估工具。

Abstract
We introduce EgoSchema, a very long-form video question-answering dataset, and benchmark to evaluate long video understanding capabilities of modern vision and language systems. Derived from Ego4D, EgoSchema consists of over 5000 human curated multiple choice question answer pairs, spanning over 250 hours of real video data, covering a very broad range of natural human activity and behavior. For each question, EgoSchema requires the correct answer to be selected between five given options based on a three-minute-long video clip. While some prior works have proposed video datasets with long clip lengths, we posit that merely the length of the video clip does not truly capture the temporal difficulty of the video task that is being considered. To remedy this, we introduce temporal certificate sets, a general notion for capturing the intrinsic temporal understanding length associated with a broad range of video understanding tasks & datasets. Based on this metric, we find EgoSchema to have intrinsic temporal lengths over 5.7x longer than the second closest dataset and 10x to 100x longer than any other video understanding dataset. Further, our evaluation of several current state-of-the-art video and language models shows them to be severely lacking in long-term video understanding capabilities. Even models with several billions of parameters achieve QA accuracy less than 33% (random is 20%) on the EgoSchema multi-choice question answering task, while humans achieve about 76% accuracy. We posit that \name{}{}, with its long intrinsic temporal structures and diverse complexity, would serve as a valuable evaluation probe for developing effective long-term video understanding systems in the future. Data and Zero-shot model evaluation code are open-sourced for both public and commercial use under the Ego4D license at http://egoschema.github.io

摘要
我们介绍EGOSchema，一个非常长形式的视频问题回答数据集和benchmark，用于评估现代视频和语言系统的长期视频理解能力。从EGO4D derive，EGOSchema包含了超过5000个人类混合多个选项的问题答案对，覆盖了250小时的真实视频数据，涵盖了人类活动和行为的非常广泛领域。对每个问题，EGOSchema需要根据三分钟的视频片段选择正确的答案，而不是仅根据视频clip的长度。我们认为，仅仅根据视频clip的长度不能够真正捕捉视频任务中的时间困难程度。为了解决这个问题，我们引入了时间证书集，一个通用的概念用于捕捉视频任务中的内在时间理解长度。根据这个指标，我们发现EGOSchema的内在时间长度高于第二最近的数据集5.7倍，并且与任何其他视频理解数据集相比，EGOSchema的内在时间长度在10倍至100倍之间。此外，我们评估了一些现代视频和语言模型，发现这些模型在EGOSchema的多选问题回答任务中的答案率仅在33%（随机为20%），而人类则有约76%的答案率。我们认为，EGOSchema，拥有长期视频结构和多方面的复杂性，将成为未来发展长期视频理解系统的重要评估棒。EGOSchema的数据和零式模型评估代码在http://egoschema.github.io上公开，供公共和商业使用，欢迎各位专家和研究人员前来参与。

Spectral information criterion for automatic elbow detection

paper_url: http://arxiv.org/abs/2308.09108
repo_url: None
paper_authors: L. Martino, R. San Millan-Castillo, E. Morgado
for: 本文提出了一种总结了其他较知名的信息拟合 criterion（BIC和AIC）的一般化信息拟合 criterion（SIC），并且可以自动检测error curve的特点。
methods: 本文使用了spectral information criterion（SIC）来提取error curve的几何特征，并且不需要知道likelihood函数。
results: 本文的实验结果表明，SIC可以提供一个较小的子集模型，并且这些模型都是error curve的拐角点。此外，本文还提出了一个实际适用的选择模型的方法。

Abstract
We introduce a generalized information criterion that contains other well-known information criteria, such as Bayesian information Criterion (BIC) and Akaike information criterion (AIC), as special cases. Furthermore, the proposed spectral information criterion (SIC) is also more general than the other information criteria, e.g., since the knowledge of a likelihood function is not strictly required. SIC extracts geometric features of the error curve and, as a consequence, it can be considered an automatic elbow detector. SIC provides a subset of all possible models, with a cardinality that often is much smaller than the total number of possible models. The elements of this subset are elbows of the error curve. A practical rule for selecting a unique model within the sets of elbows is suggested as well. Theoretical invariance properties of SIC are analyzed. Moreover, we test SIC in ideal scenarios where provides always the optimal expected results. We also test SIC in several numerical experiments: some involving synthetic data, and two experiments involving real datasets. They are all real-world applications such as clustering, variable selection, or polynomial order selection, to name a few. The results show the benefits of the proposed scheme. Matlab code related to the experiments is also provided. Possible future research lines are finally discussed.

摘要
我们介绍一个通用化信息标准（Generalized Information Criterion，GIC），它包含了bayesian信息标准（BIC）和阿凯瑞信息标准（AIC）等其他知名信息标准的特别情况。此外，我们的spectral information criterion（SIC）还比其他信息标准更加通用，例如不需要知道假设概率函数。SIC从几何特征中提取错误曲线的几何特征，因此可以被视为一个自动拱角检测器。SIC提供了可能的模型集合，其中的元素都是错误曲线的拱角。我们建议一个实用的选择模型方法，以及对SIC的理论不变性性的分析。此外，我们还进行了一些理论测试和实验测试，包括使用 sintetic 数据和实际数据。结果显示了我们的方案的优点。matlab代码相关的实验也提供。最后，我们讨论了未来的研究方向。

MindMap: Knowledge Graph Prompting Sparks Graph of Thoughts in Large Language Models

paper_url: http://arxiv.org/abs/2308.09729
repo_url: None
paper_authors: Yilin Wen, Zifeng Wang, Jimeng Sun
for: 本研究旨在探讨如何通过知识图（KG）来帮助语言模型（LLM）更好地吸收新知识，激发LLM的创造力，并让LLM的决策过程更加透明。
methods: 我们建立了一个启用KG输入和检索外部知识的提示管道，以便让LLM能够更好地理解KG输入和推理。此外，我们还研究了如何提取LLM的推理路径和生成答案时的心图。
results: 我们在三个问答 dataset上进行了实验，结果表明，使用 MindMap 提示可以带来明显的实验性提升。例如，在 GPT-3.5 上使用 MindMap 提示可以在 GPT-4 上取得战胜性的表现。此外，我们还发现，通过结构化知识从 KG 中检索，MindMap 可以超越一些使用文档检索方法的提示方法，从而获得更高的准确率、更短的检索距离和更全面的知识。

Abstract
LLMs usually exhibit limitations in their ability to incorporate new knowledge, the generation of hallucinations, and the transparency of their decision-making process. In this paper, we explore how to prompt LLMs with knowledge graphs (KG), working as a remedy to engage LLMs with up-to-date knowledge and elicit the reasoning pathways from LLMs. Specifically, we build a prompting pipeline that endows LLMs with the capability of comprehending KG inputs and inferring with a combined implicit knowledge and the retrieved external knowledge. In addition, we investigate eliciting the mind map on which LLMs perform the reasoning and generate the answers. It is identified that the produced mind map exhibits the reasoning pathways of LLMs grounded on the ontology of knowledge, hence bringing the prospects of probing and gauging LLM inference in production. The experiments on three question & answering datasets also show that MindMap prompting leads to a striking empirical gain. For instance, prompting a GPT-3.5 with MindMap yields an overwhelming performance over GPT-4 consistently. We also demonstrate that with structured facts retrieved from KG, MindMap can outperform a series of prompting-with-document-retrieval methods, benefiting from more accurate, concise, and comprehensive knowledge from KGs.

摘要

Identity-Aware Semi-Supervised Learning for Comic Character Re-Identification

paper_url: http://arxiv.org/abs/2308.09096
repo_url: None
paper_authors: Gürkan Soykan, Deniz Yuret, Tevfik Metin Sezgin
for: 这个论文的目的是提出一种robust的半监督框架，用于在漫画中的人物重识别。
methods: 这个方法结合度量学习和一种新的’自我超vised’自我反例方法，通过对人物脸和身体的对比学习，提取人物特征。
results: 这个方法可以准确地重识别漫画中的人物，并且比使用脸或身体独立地重识别更有效。

Abstract
Character re-identification, recognizing characters consistently across different panels in comics, presents significant challenges due to limited annotated data and complex variations in character appearances. To tackle this issue, we introduce a robust semi-supervised framework that combines metric learning with a novel 'Identity-Aware' self-supervision method by contrastive learning of face and body pairs of characters. Our approach involves processing both facial and bodily features within a unified network architecture, facilitating the extraction of identity-aligned character embeddings that capture individual identities while preserving the effectiveness of face and body features. This integrated character representation enhances feature extraction and improves character re-identification compared to re-identification by face or body independently, offering a parameter-efficient solution. By extensively validating our method using in-series and inter-series evaluation metrics, we demonstrate its effectiveness in consistently re-identifying comic characters. Compared to existing methods, our approach not only addresses the challenge of character re-identification but also serves as a foundation for downstream tasks since it can produce character embeddings without restrictions of face and body availability, enriching the comprehension of comic books. In our experiments, we leverage two newly curated datasets: the 'Comic Character Instances Dataset', comprising over a million character instances and the 'Comic Sequence Identity Dataset', containing annotations of identities within more than 3000 sets of four consecutive comic panels that we collected.

摘要
<>将文本翻译成简化中文。<>Character 重认处理，在漫画中的人物重复检测，存在很大的挑战，主要是因为有限的标注数据和人物形象的复杂变化。为解决这个问题，我们提出了一种可靠的半监督性框架，将度量学习与一种新的 '自我认知' 自监督学习方法结合，通过对人物脸和身体匹配的对比学习。我们的方法包括对人物脸和身体特征进行共同处理，使得提取人物特征的标准化 embeddings 可以捕捉到个体特征，同时保持脸和身体特征的有效性。这种一体化的人物表示提高了特征提取和人物重认处理，比起独立地使用脸或身体进行重认处理，提供了参数高效的解决方案。通过对我们方法进行 série 和 inter-série 评估指标，我们证明了它的有效性，可以在漫画中一致地重复检测人物。相比之下，我们的方法不仅解决了人物重认处理的挑战，还可以为下游任务提供基础，因为它可以不受脸和身体可用性的限制生成人物 embeddings，推动漫画的理解。在我们的实验中，我们利用了两个新收集的数据集：'Comic Character Instances Dataset'，包含了大于一百万个人物实例，以及'Comic Sequence Identity Dataset'，包含了四个连续的漫画幕后的标注，我们收集了超过3000个人物标注。

Fast Decision Support for Air Traffic Management at Urban Air Mobility Vertiports using Graph Learning

paper_url: http://arxiv.org/abs/2308.09075
repo_url: None
paper_authors: Prajit KrisshnaKumar, Jhoel Witter, Steve Paul, Hanvit Cho, Karthik Dantu, Souma Chowdhury
for: 提高城市和郊区交通效率、安全性和快速旅行的新领域——城市空中交通 (UAM)。
methods: 利用图约束学习生成决策支持策略， représenter物理位置在Vertiport的空间和被管理的交通工具为两个分开的图，通过图 convolutional neural network (GCN) 提取特征，然后通过perceptron层决定行为，如继续悬停或飞行、继续停机或起飞、或在分配给Vertiport的位置上降落。
results: 通过在AirSim中进行实际模拟，对减小多旋翼机的实际情况进行评估，结果表明图约束学习可以有效地解决城市空中交通——Vertiport调度管理问题，并且比基本的强化学习（图嵌入）或随机选择基eline有更好的性能， measured by delays, safety (no. of collisions) and battery consumption.

Abstract
Urban Air Mobility (UAM) promises a new dimension to decongested, safe, and fast travel in urban and suburban hubs. These UAM aircraft are conceived to operate from small airports called vertiports each comprising multiple take-off/landing and battery-recharging spots. Since they might be situated in dense urban areas and need to handle many aircraft landings and take-offs each hour, managing this schedule in real-time becomes challenging for a traditional air-traffic controller but instead calls for an automated solution. This paper provides a novel approach to this problem of Urban Air Mobility - Vertiport Schedule Management (UAM-VSM), which leverages graph reinforcement learning to generate decision-support policies. Here the designated physical spots within the vertiport's airspace and the vehicles being managed are represented as two separate graphs, with feature extraction performed through a graph convolutional network (GCN). Extracted features are passed onto perceptron layers to decide actions such as continue to hover or cruise, continue idling or take-off, or land on an allocated vertiport spot. Performance is measured based on delays, safety (no. of collisions) and battery consumption. Through realistic simulations in AirSim applied to scaled down multi-rotor vehicles, our results demonstrate the suitability of using graph reinforcement learning to solve the UAM-VSM problem and its superiority to basic reinforcement learning (with graph embeddings) or random choice baselines.

摘要

2023-08-18

cs.CL

cs.CL - 2023-08-18

ChatHaruhi: Reviving Anime Character in Reality via Large Language Model

paper_url: http://arxiv.org/abs/2308.09597
repo_url: https://github.com/LC1332/Chat-Haruhi-Suzumiya
paper_authors: Cheng Li, Ziang Leng, Chenxi Yan, Junyi Shen, Hao Wang, Weishi MI, Yaying Fei, Xiaoyang Feng, Song Yan, HaoSheng Wang, Linkang Zhan, Yaokai Jia, Pingyu Wu, Haozhen Sun
for: 这篇论文旨在提出一种控制语言模型以模拟特定的虚构人物的算法，以提高角色扮演能力。
methods: 该算法使用改进的提示和从剧本中提取的人物记忆来控制语言模型。
results: 自动和人工评估都表明，该方法在比基eline进行角色扮演时表现更好。Translation:
for: This paper proposes an algorithm to control language models to mimic specific fictional characters, with the goal of improving role-playing ability.
methods: The algorithm uses an improved prompt and memories of the character extracted from scripts to control the language models.
results: Both automatic and human evaluations show that the proposed approach performs better than baselines in role-playing.

Abstract
Role-playing chatbots built on large language models have drawn interest, but better techniques are needed to enable mimicking specific fictional characters. We propose an algorithm that controls language models via an improved prompt and memories of the character extracted from scripts. We construct ChatHaruhi, a dataset covering 32 Chinese / English TV / anime characters with over 54k simulated dialogues. Both automatic and human evaluations show our approach improves role-playing ability over baselines. Code and data are available at https://github.com/LC1332/Chat-Haruhi-Suzumiya .

摘要
大语言模型上的角色扮演聊天机器人已经吸引了注意，但更好的技术是需要实现模拟特定的虚构角色。我们提出了一个算法，可以通过改进提示和从剧本中提取的角色记忆来控制语言模型。我们建立了ChatHaruhi，一个覆盖32个中文/英文电视/动画角色的32000多个虚构对话。自动和人类评估都显示了我们的方法可以提高角色扮演能力比基eline。代码和数据可以在https://github.com/LC1332/Chat-Haruhi-Suzumiya 获取。

PUMGPT: A Large Vision-Language Model for Product Understanding

paper_url: http://arxiv.org/abs/2308.09568
repo_url: None
paper_authors: Shuhui Wu, Zengming Tang, Zongyi Guo, Weiwei Zhang, Baoliang Cui, Haihong Tang, Weiming Lu
for: 本研究主要针对产品理解任务进行探讨，以提高在线购物体验。产品理解任务包括多种子任务，需要模型回答基于多modal产品信息的多种问题。
methods: 我们提出了PUMGPT，一个大型视言语模型，旨在统一所有产品理解任务于单一模型结构。为 bridging视觉和文本表现之间的差距，我们提出了层刻拓展（LA），一种方法，通过增加 fewer 视觉 токен来提供更好的整合，并允许实际 Parameter-efficient 微调。
results: PUMGPT 在多种产品理解任务中表现出色，包括产品描述、类别问答、特征EXTRACTION、特征问答以及自由形问答关于产品。

Abstract
Recent developments of multi-modal large language models have demonstrated its strong ability in solving vision-language tasks. In this paper, we focus on the product understanding task, which plays an essential role in enhancing online shopping experience. Product understanding task includes a variety of sub-tasks, which require models to respond diverse queries based on multi-modal product information. Traditional methods design distinct model architectures for each sub-task. On the contrary, we present PUMGPT, a large vision-language model aims at unifying all product understanding tasks under a singular model structure. To bridge the gap between vision and text representations, we propose Layer-wise Adapters (LA), an approach that provides enhanced alignment with fewer visual tokens and enables parameter-efficient fine-tuning. Moreover, the inherent parameter-efficient fine-tuning ability allows PUMGPT to be readily adapted to new product understanding tasks and emerging products. We design instruction templates to generate diverse product instruction datasets. Simultaneously, we utilize open-domain datasets during training to improve the performance of PUMGPT and its generalization ability. Through extensive evaluations, PUMGPT demonstrates its superior performance across multiple product understanding tasks, including product captioning, category question-answering, attribute extraction, attribute question-answering, and even free-form question-answering about products.

摘要
To bridge the gap between vision and text representations, we introduce Layer-wise Adapters (LA), a method that provides enhanced alignment with fewer visual tokens and enables parameter-efficient fine-tuning. PUMGPT's inherent parameter-efficient fine-tuning ability allows it to be easily adapted to new product understanding tasks and emerging products. We create instruction templates to generate diverse product instruction datasets, and we train PUMGPT using open-domain datasets to improve its performance and generalization ability.Through extensive evaluations, PUMGPT demonstrates superior performance across multiple product understanding tasks, including product captioning, category question-answering, attribute extraction, attribute question-answering, and even free-form question-answering about products.

Exploring Sampling Techniques for Generating Melodies with a Transformer Language Model

paper_url: http://arxiv.org/abs/2308.09454
repo_url: None
paper_authors: Mathias Rose Bjare, Stefan Lattner, Gerhard Widmer
for: investigate the impact of different sampling techniques on musical qualities such as diversity and structure
methods: train a high-capacity transformer model on a vast collection of highly-structured Irish folk melodies, and analyze the musical qualities of the samples generated using distribution truncation sampling techniques
results: discover that probability truncation techniques may restrict diversity and structural patterns in optimal circumstances, but may also produce more musical samples in suboptimal circumstances.

Abstract
Research in natural language processing has demonstrated that the quality of generations from trained autoregressive language models is significantly influenced by the used sampling strategy. In this study, we investigate the impact of different sampling techniques on musical qualities such as diversity and structure. To accomplish this, we train a high-capacity transformer model on a vast collection of highly-structured Irish folk melodies and analyze the musical qualities of the samples generated using distribution truncation sampling techniques. Specifically, we use nucleus sampling, the recently proposed "typical sampling", and conventional ancestral sampling. We evaluate the effect of these sampling strategies in two scenarios: optimal circumstances with a well-calibrated model and suboptimal circumstances where we systematically degrade the model's performance. We assess the generated samples using objective and subjective evaluations. We discover that probability truncation techniques may restrict diversity and structural patterns in optimal circumstances, but may also produce more musical samples in suboptimal circumstances.

摘要
研究自然语言处理显示，训练 autoregressive 语言模型的质量生成受采样策略的影响。在这项研究中，我们调查不同采样技术对音乐质量的影响。为此，我们使用高容量 transformer 模型训练一大量高结构性的爱尔兰传统民歌旋律，并分析生成的样本中的音乐质量。特别是，我们使用核心采样、“典型采样”和传统祖先采样。我们在两种情况下评估这些采样策略的影响：优化的情况下，模型性能很好，以及受损的情况下，我们系统地降低模型性能。我们使用对象和主观评估来评估生成的样本。我们发现，概率 truncation 技术可能会在优化情况下压缩多样性和结构性特征，但在受损情况下可能会生成更多的音乐样本。

Scope is all you need: Transforming LLMs for HPC Code

paper_url: http://arxiv.org/abs/2308.09440
repo_url: https://github.com/scientific-computing-lab-nrcn/tokompiler
paper_authors: Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren
for:这个论文旨在探讨大型自然语言处理（NLP）模型如何应用于编程任务，特别是高性能计算（HPC）领域的任务。methods:该论文提出了一种名为 Tokompiler 的新型编译器，用于适应编程语言和编译任务。Tokompiler 利用了语言基础知识，生成了语言相关的标记，以提供语言结构的上下文感知，而完全避免了人工含义的代码结构。results:实验结果表明，使用 Tokompiler 进行预训练，可以大幅提高代码完成率和语义理解能力，比传统的标识符更低，约为 1 个折衡指数。这些结果开启了领域特定 LLM 的发展前景，以满足特定领域的独特需求。

Abstract
With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size (e.g., billions of parameters) and demand expensive compute resources for training. We found this design choice confusing - why do we need large LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question design choices made by existing LLMs by developing smaller LLMs for specific domains - we call them domain-specific LLMs. Specifically, we start off with HPC as a domain and propose a novel tokenizer named Tokompiler, designed specifically for preprocessing code in HPC and compilation-centric tasks. Tokompiler leverages knowledge of language primitives to generate language-oriented tokens, providing a context-aware understanding of code structure while avoiding human semantics attributed to code structures completely. We applied Tokompiler to pre-train two state-of-the-art models, SPT-Code and Polycoder, for a Fortran code corpus mined from GitHub. We evaluate the performance of these models against the conventional LLMs. Results demonstrate that Tokompiler significantly enhances code completion accuracy and semantic understanding compared to traditional tokenizers in normalized-perplexity tests, down to ~1 perplexity score. This research opens avenues for further advancements in domain-specific LLMs, catering to the unique demands of HPC and compilation tasks.

摘要
<>使用更加可accessible的计算资源，在人工智能领域的软件开发中，正在增长一种大型语言模型（LLM）的趋势，以解决多种编程任务。即使应用于高性能计算（HPC）领域的LLM也非常大（例如，十亿个参数），需要昂贵的计算资源进行训练。我们认为这种设计选择是奇怪的——为什么需要大型LLM在自然语言和编程语言不related的HPC任务上进行训练？在这个研究中，我们想要质问现有LLM的设计选择，而是开发特定领域的LLM——我们称之为域specific LLM。 Specifically，我们开始于HPC领域，并提出了一种新的tokenizer名为Tokompiler，用于适应编译和编程任务。Tokompiler利用语言基本元素的知识来生成语言 oriented 的token，提供了代码结构上的上下文感知，而完全避免了人类语义 attributed 到代码结构。我们对HPC领域中的 Fortran 代码集进行预训练两个现有模型，SPT-Code 和 Polycoder。我们对这些模型进行评估，并与传统的 tokenizer进行比较。结果表明，Tokompiler在 норма化的复杂度测试中显著提高了代码完成率和semantic理解，相比传统的 tokenizer，下降至~1复杂度分数。这些研究开创了域specific LLM的新途径，适应HPC和编译任务的特殊需求。

A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages

paper_url: http://arxiv.org/abs/2308.09435
repo_url: None
paper_authors: Nikita Martynov, Mark Baushenko, Anastasia Kozlova, Katerina Kolomeytseva, Aleksandr Abramov, Alena Fenogenova
for: 本研究的目的是提出一种生成式拼写修正（SC）方法，用于更好地在文本编辑任务中 corrected spelling errors and mistypings。methods: 本研究使用了自然语言拼写错误和 mistypings 的研究，以及如何通过在正确句子中模拟这些错误来增强生成模型的预训练过程。我们还 investigate 了不同的文本领域下模型的能力。results: 我们在不同的损害策略、模型结构和大小下进行了实验，并评估了模型在单一领域和多领域测试集上的性能。此外，我们还提出了一个名为 SAGET (拼写检查 via 增强和生成分布 Emulation) 的自动生成 SC 库，包括一家生成模型的家族和内置的增强算法。

Abstract
Modern large language models demonstrate impressive capabilities in text generation and generalization. However, they often struggle with solving text editing tasks, particularly when it comes to correcting spelling errors and mistypings. In this paper, we present a methodology for generative spelling correction (SC), which was tested on English and Russian languages and potentially can be extended to any language with minor changes. Our research mainly focuses on exploring natural spelling errors and mistypings in texts and studying the ways those errors can be emulated in correct sentences to effectively enrich generative models' pre-train procedure. We investigate the impact of such emulations and the models' abilities across different text domains. In this work, we investigate two spelling corruption techniques: 1) first one mimics human behavior when making a mistake through leveraging statistics of errors from particular dataset and 2) second adds the most common spelling errors, keyboard miss clicks, and some heuristics within the texts. We conducted experiments employing various corruption strategies, models' architectures and sizes on the pre-training and fine-tuning stages and evaluated the models using single-domain and multi-domain test sets. As a practical outcome of our work, we introduce SAGE (Spell checking via Augmentation and Generative distribution Emulation) is a library for automatic generative SC that includes a family of pre-trained generative models and built-in augmentation algorithms.

摘要
现代大语言模型表现出了优秀的文本生成和通用能力。然而，它们经常在文本编辑任务上遇到困难，特别是正确推理和短语输入错误。在这篇论文中，我们提出了一种生成拼写检查（SC）方法，在英文和俄文语言上进行测试，并可以适用于任何语言。我们的研究主要关注自然发生的拼写错误和输入错误在文本中的表现方式，并研究如何通过模拟这些错误来增强生成模型的预训练过程。我们 investigate了不同的损害策略、模型架构和大小在预训练和精度调整阶段的影响。在我们的实验中，我们使用了不同的损害策略、模型架构和大小，并对单domain和多domain测试集进行评估。作为实践的结果，我们介绍了一个名为SAGE（拼写检查via扩展和生成分布Emulation）的自动生成SC库，该库包括一家拼写检查模型和内置的扩展算法。

Leveraging Large Language Models for DRL-Based Anti-Jamming Strategies in Zero Touch Networks

paper_url: http://arxiv.org/abs/2308.09376
repo_url: None
paper_authors: Abubakar S. Ali, Dimitrios Michael Manias, Abdallah Shami, Sami Muhaidat
for: 这篇论文主要是为了探讨自动化网络中的零点touch网络（ZTN）概念，以及在自动化过程中提高网络透明度和用户交互的可能性。
methods: 本论文使用了大语言模型（LLM）来把自动化网络过程与人类中心的界面相结合，以提高网络透明度和用户交互。
results: 通过一个深度强化学习（DRL）基于的防止干扰技术的实践案例，本论文示出了 LLM 可以将复杂的网络操作概念化为人类可读的报告。

Abstract
As the dawn of sixth-generation (6G) networking approaches, it promises unprecedented advancements in communication and automation. Among the leading innovations of 6G is the concept of Zero Touch Networks (ZTNs), aiming to achieve fully automated, self-optimizing networks with minimal human intervention. Despite the advantages ZTNs offer in terms of efficiency and scalability, challenges surrounding transparency, adaptability, and human trust remain prevalent. Concurrently, the advent of Large Language Models (LLMs) presents an opportunity to elevate the ZTN framework by bridging the gap between automated processes and human-centric interfaces. This paper explores the integration of LLMs into ZTNs, highlighting their potential to enhance network transparency and improve user interactions. Through a comprehensive case study on deep reinforcement learning (DRL)-based anti-jamming technique, we demonstrate how LLMs can distill intricate network operations into intuitive, human-readable reports. Additionally, we address the technical and ethical intricacies of melding LLMs with ZTNs, with an emphasis on data privacy, transparency, and bias reduction. Looking ahead, we identify emerging research avenues at the nexus of LLMs and ZTNs, advocating for sustained innovation and interdisciplinary synergy in the domain of automated networks.

摘要
随着第六代网络（6G）的到来，它承诺了前所未有的通信和自动化技术。其中一项主导技术是零接触网络（ZTN），旨在实现无需人类干预的完全自动化网络。虽然ZTN具有高效率和可扩展性的优势，但是在透明度、适应性和人类信任方面仍存在许多挑战。同时，大型自然语言模型（LLM）的出现提供了一个机会，通过结合LLM和ZTN来bridging自动化过程和人类中心的界面。本文 explore了LLM在ZTN中的整合，探讨其能够提高网络透明度和改善用户互动。通过一个基于深度强化学习（DRL）的防止干扰技术的实践案例，我们示出了LLM可以将复杂的网络操作概括成易于理解的人类可读报告。此外，我们还讨论了将LLM与ZTN结合的技术和道德复杂性，强调数据隐私、透明度和偏见减少。 looking ahead，我们认为在LLM和ZTN之间的研究前景很广阔，希望能够持续推动这两个领域之间的创新和跨学科共融。

TrOMR:Transformer-Based Polyphonic Optical Music Recognition

paper_url: http://arxiv.org/abs/2308.09370
repo_url: https://github.com/netease/polyphonic-tromr
paper_authors: Yixuan Li, Huaping Liu, Qiang Jin, Miaomiao Cai, Peng Li
for: 这个论文是关于音乐Recognition（OMR）技术的研究，旨在提出一种基于 transformer 的全音程识别方法，以提高recognition accuracy。
methods: 该方法使用 transformer 来实现全音程识别，并引入了一种新的一致性损失函数和合理的数据注释方法来提高识别精度。
results: 实验表明，TrOMR 方法在实际场景下比现有的 OMR 方法表现更高，特别是在识别复杂的乐谱上。此外，作者还开发了 TrOMR 系统和一个实际拍摄的乐谱场景数据集。

Abstract
Optical Music Recognition (OMR) is an important technology in music and has been researched for a long time. Previous approaches for OMR are usually based on CNN for image understanding and RNN for music symbol classification. In this paper, we propose a transformer-based approach with excellent global perceptual capability for end-to-end polyphonic OMR, called TrOMR. We also introduce a novel consistency loss function and a reasonable approach for data annotation to improve recognition accuracy for complex music scores. Extensive experiments demonstrate that TrOMR outperforms current OMR methods, especially in real-world scenarios. We also develop a TrOMR system and build a camera scene dataset for full-page music scores in real-world. The code and datasets will be made available for reproducibility.

摘要
《光学音乐识别（OMR）技术在音乐领域已经被研究了很长时间。前一些OMR方法通常基于CNN для图像理解和RNN для音乐符号分类。在这篇论文中，我们提出了基于transformer的全globale感知方法，称为TrOMR，以提高端到端多重音乐OMR的准确率。我们还介绍了一种新的一致损失函数和合理的数据注释方法，以提高复杂音乐手稿的识别率。广泛的实验表明，TrOMR已经超越了当前OMR方法，特别是在实际场景中。我们还开发了TrOMR系统和一个摄像头场景数据集，用于实际全页音乐手稿识别。代码和数据集将被提供，以便重现。》Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

A tailored Handwritten-Text-Recognition System for Medieval Latin

paper_url: http://arxiv.org/abs/2308.09368
repo_url: None
paper_authors: Philipp Koch, Gilary Vera Nuñez, Esteban Garces Arias, Christian Heumann, Matthias Schöffel, Alexander Häberlin, Matthias Aßenmacher
for: 针对 medieval Latin dictionary 的数字化进行 Handwritten Text Recognition (HTR) 任务。
methods: 使用两个 state-of-the-art (SOTA) 图像分割模型进行数据集的准备，并运行多种组合的 transformer-based 模型和 GPT-2 解码器进行实验。
results: 实现了一个高度竞争力的模型，最佳设置 achievement 的 Character Error Rate (CER) 为 0.015，超过了商业 Google Cloud Vision 模型，并且表现更加稳定。

Abstract
The Bavarian Academy of Sciences and Humanities aims to digitize its Medieval Latin Dictionary. This dictionary entails record cards referring to lemmas in medieval Latin, a low-resource language. A crucial step of the digitization process is the Handwritten Text Recognition (HTR) of the handwritten lemmas found on these record cards. In our work, we introduce an end-to-end pipeline, tailored to the medieval Latin dictionary, for locating, extracting, and transcribing the lemmas. We employ two state-of-the-art (SOTA) image segmentation models to prepare the initial data set for the HTR task. Furthermore, we experiment with different transformer-based models and conduct a set of experiments to explore the capabilities of different combinations of vision encoders with a GPT-2 decoder. Additionally, we also apply extensive data augmentation resulting in a highly competitive model. The best-performing setup achieved a Character Error Rate (CER) of 0.015, which is even superior to the commercial Google Cloud Vision model, and shows more stable performance.

摘要
Bavarian Academy of Sciences and Humanities 计划数字化中世纪拉丁词典。这个词典包含手写 Record cards 上的中世纪拉丁词语，这是一种低资源语言。我们的工作是设计一个端到端管道，专门为中世纪拉丁词典进行找到、提取和转录词语的任务。我们使用两个现代状态的图像分割模型来准备初始数据集 для HTR 任务。此外，我们还对不同的 transformer 模型进行了试验，并对不同的视觉编码器与 GPT-2 解码器的不同组合进行了一系列实验。此外，我们还应用了广泛的数据增强，实现了非常竞争力的模型。最佳设置达到了 Character Error Rate （CER）0.015，超过了商业 Google Cloud Vision 模型，并且表现更加稳定。

Accelerated materials language processing enabled by GPT

paper_url: http://arxiv.org/abs/2308.09354
repo_url: None
paper_authors: Jaewoong Choi, Byungju Lee
for: 这研究的目的是提高材料科学文献中信息抽取的效率，并采用生成式预训练变换器（GPT）来替换优化的模型结构。
methods: 这研究使用了GPT引入的文档分类方法、命名实体识别（NER）方法和抽取问答（QA）方法，其中使用了策略性的提示工程来替换优化的模型结构。
results: 研究发现，使用GPT引入的方法可以实现与优化模型结构相当的准确率和可靠性，并且只需要小量的数据进行训练。此外，这些方法还可以在不同的材料科学领域中应用，以加速文献中信息抽取的过程。

Abstract
Materials language processing (MLP) is one of the key facilitators of materials science research, as it enables the extraction of structured information from massive materials science literature. Prior works suggested high-performance MLP models for text classification, named entity recognition (NER), and extractive question answering (QA), which require complex model architecture, exhaustive fine-tuning and a large number of human-labelled datasets. In this study, we develop generative pretrained transformer (GPT)-enabled pipelines where the complex architectures of prior MLP models are replaced with strategic designs of prompt engineering. First, we develop a GPT-enabled document classification method for screening relevant documents, achieving comparable accuracy and reliability compared to prior models, with only small dataset. Secondly, for NER task, we design an entity-centric prompts, and learning few-shot of them improved the performance on most of entities in three open datasets. Finally, we develop an GPT-enabled extractive QA model, which provides improved performance and shows the possibility of automatically correcting annotations. While our findings confirm the potential of GPT-enabled MLP models as well as their value in terms of reliability and practicability, our scientific methods and systematic approach are applicable to any materials science domain to accelerate the information extraction of scientific literature.

摘要

Document Automation Architectures: Updated Survey in Light of Large Language Models

paper_url: http://arxiv.org/abs/2308.09341
repo_url: None
paper_authors: Mohammad Ahmadi Achachlouei, Omkar Patil, Tarun Joshi, Vijayan N. Nair
for: 本研究审查了当前文档自动化（DA）领域的最新状况，尤其是在法律领域的商业解决方案中的自动化文档生成。
methods: 本研究通过审查学术文献，为DA的定义和特征提供了更清晰的定义，并识别了学术研究中的DA架构和技术。
results: 本研究提供了新的DA研究机遇，基于最新的生成AI和大语言模型。

Abstract
This paper surveys the current state of the art in document automation (DA). The objective of DA is to reduce the manual effort during the generation of documents by automatically creating and integrating input from different sources and assembling documents conforming to defined templates. There have been reviews of commercial solutions of DA, particularly in the legal domain, but to date there has been no comprehensive review of the academic research on DA architectures and technologies. The current survey of DA reviews the academic literature and provides a clearer definition and characterization of DA and its features, identifies state-of-the-art DA architectures and technologies in academic research, and provides ideas that can lead to new research opportunities within the DA field in light of recent advances in generative AI and large language models.

摘要

KESDT: knowledge enhanced shallow and deep Transformer for detecting adverse drug reactions

paper_url: http://arxiv.org/abs/2308.09329
repo_url: None
paper_authors: Yunzhi Qiu, Xiaokun Zhang, Weiwei Wang, Tongxuan Zhang, Bo Xu, Hongfei Lin
for:The paper aims to improve the detection of adverse drug reactions (ADRs) on social media platforms by proposing a novel model called Knowledge Enhanced Shallow and Deep Transformer (KESDT).methods:The KESDT model incorporates domain keywords into the Transformer model through a shallow fusion manner and integrates synonym sets through a deep fusion manner to address the challenges of low annotated data and sample imbalance.results:The proposed KESDT model outperforms state-of-the-art baselines on three public datasets (TwiMed, Twitter, and CADEC) in terms of F1 values, with relative improvements of 4.87%, 47.83%, and 5.73%, respectively.

Abstract
Adverse drug reaction (ADR) detection is an essential task in the medical field, as ADRs have a gravely detrimental impact on patients' health and the healthcare system. Due to a large number of people sharing information on social media platforms, an increasing number of efforts focus on social media data to carry out effective ADR detection. Despite having achieved impressive performance, the existing methods of ADR detection still suffer from three main challenges. Firstly, researchers have consistently ignored the interaction between domain keywords and other words in the sentence. Secondly, social media datasets suffer from the challenges of low annotated data. Thirdly, the issue of sample imbalance is commonly observed in social media datasets. To solve these challenges, we propose the Knowledge Enhanced Shallow and Deep Transformer(KESDT) model for ADR detection. Specifically, to cope with the first issue, we incorporate the domain keywords into the Transformer model through a shallow fusion manner, which enables the model to fully exploit the interactive relationships between domain keywords and other words in the sentence. To overcome the low annotated data, we integrate the synonym sets into the Transformer model through a deep fusion manner, which expands the size of the samples. To mitigate the impact of sample imbalance, we replace the standard cross entropy loss function with the focal loss function for effective model training. We conduct extensive experiments on three public datasets including TwiMed, Twitter, and CADEC. The proposed KESDT outperforms state-of-the-art baselines on F1 values, with relative improvements of 4.87%, 47.83%, and 5.73% respectively, which demonstrates the effectiveness of our proposed KESDT.

摘要
医疗领域内的药物反应检测是一项非常重要的任务，因为药物反应会对患者的健康产生严重的影响，同时也会对医疗系统产生沉重的负担。随着更多的人通过社交媒体平台分享信息，有越来越多的努力集中在社交媒体数据上进行有效的药物反应检测。尽管现有的检测方法已经取得了很好的表现，但是这些方法仍然面临着三个主要挑战。首先，研究人员一直忽略了域关键词和其他单词之间的互动关系。第二，社交媒体数据受到严重的标注数据不足的影响。第三，社交媒体数据中的样本偏度问题很常见。为解决这些挑战，我们提出了基于知识的扩展深度传播（KESDT）模型，用于药物反应检测。具体来说，为了处理第一个问题，我们将域关键词 integration到传播模型中，使得模型能够充分利用域关键词和其他单词之间的互动关系。为了解决低标注数据的问题，我们将同义词集 integrate到传播模型中，从而扩大样本的大小。为了缓解样本偏度问题，我们将标准十字 entropy损失函数替换为关注损失函数，以便更好地训练模型。我们对三个公共数据集，包括TwiMed、Twitter和CADEC进行了广泛的实验。我们的提出的KESDT比 estado-of-the-art基elines在F1值上提高4.87%、47.83%和5.73%，这表明了我们的KESDT的效果。

Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge

paper_url: http://arxiv.org/abs/2308.09311
repo_url: None
paper_authors: Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Yong Man Ro
for: 本文提出了一种新的脸部说话框架，特别适用于低资源语言，过去的文献中没有充分考虑这个问题。由于低资源语言没有充分的视频文本对数据来训练模型，因此lipreading模型的开发受到了挑战。
methods: 我们尝试了通过Predicting speech units来学习通用语言知识，即模型嘴部运动的能力。不同语言部分共享相同的phoneme，因此可以将学习一种语言的通用语言知识扩展到其他语言。然后，我们提出了Language-specific Memory-augmented Decoder（LMDecoder）来学习语言特定的知识。LMDecoder将语言特定的声音特征存储在内存银行中，可以通过声音文本对来训练。因此，我们可以将输入的speech units转换成语言特定的声音特征，并使用学习的丰富语言知识来翻译它们。
results: 通过对五种语言（英语、西班牙语、法语、意大利语、葡萄牙语）的广泛实验，我们证明了提出的方法的效iveness。

Abstract
This paper proposes a novel lip reading framework, especially for low-resource languages, which has not been well addressed in the previous literature. Since low-resource languages do not have enough video-text paired data to train the model to have sufficient power to model lip movements and language, it is regarded as challenging to develop lip reading models for low-resource languages. In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units. It is known that different languages partially share common phonemes, thus general speech knowledge learned from one language can be extended to other languages. Then, we try to learn language-specific knowledge, the ability to model language, by proposing Language-specific Memory-augmented Decoder (LMDecoder). LMDecoder saves language-specific audio features into memory banks and can be trained on audio-text paired data which is more easily accessible than video-text paired data. Therefore, with LMDecoder, we can transform the input speech units into language-specific audio features and translate them into texts by utilizing the learned rich language knowledge. Finally, by combining general speech knowledge and language-specific knowledge, we can efficiently develop lip reading models even for low-resource languages. Through extensive experiments using five languages, English, Spanish, French, Italian, and Portuguese, the effectiveness of the proposed method is evaluated.

摘要

Differentiable Retrieval Augmentation via Generative Language Modeling for E-commerce Query Intent Classification

paper_url: http://arxiv.org/abs/2308.09308
repo_url: None
paper_authors: Chenyu Zhao, Yunjiang Jiang, Yiming Qiu, Han Zhang, Wen-Yun Yang
for: 提高自然语言处理(NLP)任务中下游模型的性能，具体来说是在电商搜索中的查询意图分类任务。
methods: 通过知识检索器和外部词库的协同使用，实现增强下游模型的性能，而不是增加模型参数的方式。
results: 通过实验和减少研究，证明了我们提出的方法可以在查询意图分类任务中显著提高州网络处理(NLP)任务的性能，并且在实际应用中也能够达到良好的效果。

Abstract
Retrieval augmentation, which enhances downstream models by a knowledge retriever and an external corpus instead of by merely increasing the number of model parameters, has been successfully applied to many natural language processing (NLP) tasks such as text classification, question answering and so on. However, existing methods that separately or asynchronously train the retriever and downstream model mainly due to the non-differentiability between the two parts, usually lead to degraded performance compared to end-to-end joint training. In this paper, we propose Differentiable Retrieval Augmentation via Generative lANguage modeling(Dragan), to address this problem by a novel differentiable reformulation. We demonstrate the effectiveness of our proposed method on a challenging NLP task in e-commerce search, namely query intent classification. Both the experimental results and ablation study show that the proposed method significantly and reasonably improves the state-of-the-art baselines on both offline evaluation and online A/B test.

摘要
<>TRANSLATE_TEXT Retrieval augmentation, which enhances downstream models by a knowledge retriever and an external corpus instead of by merely increasing the number of model parameters, has been successfully applied to many natural language processing (NLP) tasks such as text classification, question answering, and so on. However, existing methods that separately or asynchronously train the retriever and downstream model mainly due to the non-differentiability between the two parts, usually lead to degraded performance compared to end-to-end joint training. In this paper, we propose Differentiable Retrieval Augmentation via Generative lANguage modeling(Dragan), to address this problem by a novel differentiable reformulation. We demonstrate the effectiveness of our proposed method on a challenging NLP task in e-commerce search, namely query intent classification. Both the experimental results and ablation study show that the proposed method significantly and reasonably improves the state-of-the-art baselines on both offline evaluation and online A/B test.TRANSLATE_END

Conversational Ontology Alignment with ChatGPT

paper_url: http://arxiv.org/abs/2308.09217
repo_url: None
paper_authors: Sanaz Saki Norouzi, Mohammad Saeid Mahdavinejad, Pascal Hitzler
for: 本研究evaluates the feasibility and efficiency of ChatGPT for ontology alignment using a naive approach.
methods: 本研究使用了一种Naive方法，使用ChatGPT的输出与 Ontology Alignment Evaluation Initiative 2022 的会议轨道 ontologies进行比较，以获得更多关于 conversational large language model 在 Naive 方法下的 ontology matching 能力的信息。
results: 研究发现，ChatGPT 的输出与 Ontology Alignment Evaluation Initiative 2022 的结果之间存在一定的相似性，但是还有一些差异。这表明了 conversational large language model 在 Naive 方法下的 ontology matching 能力是有限的，但也有一定的潜在优势。

Abstract
This study evaluates the applicability and efficiency of ChatGPT for ontology alignment using a naive approach. ChatGPT's output is compared to the results of the Ontology Alignment Evaluation Initiative 2022 campaign using conference track ontologies. This comparison is intended to provide insights into the capabilities of a conversational large language model when used in a naive way for ontology matching, and to investigate the potential advantages and disadvantages of this approach.

摘要
这项研究评估了chatGPT在ontologyAlignment中的适用性和效率，使用了一种简单的方法。chatGPT的输出与2022年ontologyAlignment评估活动的会议轨道 ontologies 的结果进行比较，以提供 conversational large language model 在 naive 方式上的ontology匹配能力的洞察，并 investigate这种方法的优劣点。

A Comparative Study of Text Embedding Models for Semantic Text Similarity in Bug Reports

paper_url: http://arxiv.org/abs/2308.09193
repo_url: https://github.com/av9ash/duplicatebugdetection
paper_authors: Avinash Patil, Kihwan Han, Sabyasachi Mukhopadhyay
for: 本研究旨在比较不同文本 Similarity 方法在bug report retrieval中的效果,以提高bug report的检索效率。
methods: 本研究使用了TF-IDF (基eline), FastText, Gensim, BERT和ADA embedding模型进行比较。
results: 实验结果显示BERT模型在回忆率方面表现最好，其次是ADA模型，接下来是Gensim、FastText和TF-IDF模型。

Abstract
Bug reports are an essential aspect of software development, and it is crucial to identify and resolve them quickly to ensure the consistent functioning of software systems. Retrieving similar bug reports from an existing database can help reduce the time and effort required to resolve bugs. In this paper, we compared the effectiveness of semantic textual similarity methods for retrieving similar bug reports based on a similarity score. We explored several embedding models such as TF-IDF (Baseline), FastText, Gensim, BERT, and ADA. We used the Software Defects Data containing bug reports for various software projects to evaluate the performance of these models. Our experimental results showed that BERT generally outperformed the rest of the models regarding recall, followed by ADA, Gensim, FastText, and TFIDF. Our study provides insights into the effectiveness of different embedding methods for retrieving similar bug reports and highlights the impact of selecting the appropriate one for this task. Our code is available on GitHub.

摘要
📝 Bug 报告是软件开发中非常重要的一部分，快速标识和解决它们以确保软件系统的一致性。从现有的数据库中检索类似的bug报告可以减少解决bug所需的时间和努力。在这篇论文中，我们对用 semantic textual similarity 方法检索类似bug报告的效果进行了比较，并根据相似性分数进行评估。我们检查了TF-IDF（基准）、FastText、Gensim、BERT和ADA 等嵌入模型。我们使用了 Software Defects Data 中的 bug 报告来评估这些模型的性能。我们的实验结果表明，BERT 通常在 recall 方面表现更好，其次是 ADA，Gensim，FastText 和 TF-IDF。我们的研究提供了不同嵌入方法在检索类似bug报告的效果的视角，并 highlights 选择合适的嵌入方法对这项任务的重要性。我们的代码可以在 GitHub 上找到。

Is Argument Structure of Learner Chinese Understandable: A Corpus-Based Analysis

paper_url: http://arxiv.org/abs/2308.09186
repo_url: None
paper_authors: Yuguang Duan, Zi Lin, Weiwei Sun
for: 这个论文是为了分析learner中文中的意向结构错误的。
methods: 这个论文使用了 sentence produced by language learners和 их corrected by native speakers的数据，并与 semantic role labeling标注。
results: 这个论文发现了learner中文中的意向结构错误，包括word order、word selection、lack of proposition和argument-adjunct confounding等。

Abstract
This paper presents a corpus-based analysis of argument structure errors in learner Chinese. The data for analysis includes sentences produced by language learners as well as their corrections by native speakers. We couple the data with semantic role labeling annotations that are manually created by two senior students whose majors are both Applied Linguistics. The annotation procedure is guided by the Chinese PropBank specification, which is originally developed to cover first language phenomena. Nevertheless, we find that it is quite comprehensive for handling second language phenomena. The inter-annotator agreement is rather high, suggesting the understandability of learner texts to native speakers. Based on our annotations, we present a preliminary analysis of competence errors related to argument structure. In particular, speech errors related to word order, word selection, lack of proposition, and argument-adjunct confounding are discussed.

摘要
Translated into Simplified Chinese:这篇论文介绍了学习中文learner的语法错误分析。数据来源包括学习者生成的句子以及Native speaker的修改。我们将数据 coupling avec manually created的semantic role labeling标注，遵循中文PropBank规范，这个规范原本是为first language phenomena而设计。然而，我们发现它对second language phenomena也非常全面。标注过程中的间接对应度较高，表明学习者的文本对Native speaker来说很容易理解。基于我们的标注，我们对语法错误进行了初步分析，包括word order错误、word selection错误、缺乏 Proposition 和 argument-adjunct杂合错误等。

ZhiJian: A Unifying and Rapidly Deployable Toolbox for Pre-trained Model Reuse

paper_url: http://arxiv.org/abs/2308.09158
repo_url: https://github.com/zhangyikaii/lamda-zhijian
paper_authors: Yi-Kai Zhang, Lu Ren, Chao Yi, Qi-Wei Wang, De-Chuan Zhan, Han-Jia Ye
for: 本研究旨在提供一个可用于实际应用中的模型重复使用工具箱（ZhiJian），实现了多种模型重复使用方法的整合，并提供了一个简单易用的PyTorch backend。
methods: 本研究使用了PyTorch backend，提出了一个称为PTM的目标架构建方法，并且提供了一个基于PTM的探索和调整方法，可以帮助深度学习专家在下游任务中探索和发现不同方法之间的补偿优点。
results: 本研究通过实际应用和评估，显示了ZhiJian在实际应用中的效果和可靠性，并且显示了PTM-based inference可以帮助深度学习专家在下游任务中探索和发现不同方法之间的补偿优点。

Abstract
The rapid expansion of foundation pre-trained models and their fine-tuned counterparts has significantly contributed to the advancement of machine learning. Leveraging pre-trained models to extract knowledge and expedite learning in real-world tasks, known as "Model Reuse", has become crucial in various applications. Previous research focuses on reusing models within a certain aspect, including reusing model weights, structures, and hypothesis spaces. This paper introduces ZhiJian, a comprehensive and user-friendly toolbox for model reuse, utilizing the PyTorch backend. ZhiJian presents a novel paradigm that unifies diverse perspectives on model reuse, encompassing target architecture construction with PTM, tuning target model with PTM, and PTM-based inference. This empowers deep learning practitioners to explore downstream tasks and identify the complementary advantages among different methods. ZhiJian is readily accessible at https://github.com/zhangyikaii/lamda-zhijian facilitating seamless utilization of pre-trained models and streamlining the model reuse process for researchers and developers.

摘要
“快速扩展的基础模型和其精细化版本的发展，对机器学习的进步做出了重要贡献。利用预训练模型提取知识和快速学习实际任务中的概念，称为“模型重用”，在各种应用中变得非常重要。先前的研究主要关注在模型重用的特定方面，包括模型权重、结构和假设空间的重用。本文介绍了一个名为ZhiJian的通用和易用的工具箱，使用PyTorch后端。ZhiJian提出了一种新的思想，即在PTM中构建目标建筑、在PTM中调参目标模型和PTM基于的推理。这使得深度学习实践者可以更好地探索下游任务和发现不同方法之间的补做优势。ZhiJian可以很容易地在https://github.com/zhangyikaii/lamda-zhijian上获取，便于预训练模型的使用和模型重用过程中的流程，为研究者和开发者提供了便捷的使用方式。”

paper_url: http://arxiv.org/abs/2308.09156
repo_url: None
paper_authors: Omar Sharif, Madhusudan Basak, Tanzia Parvin, Ava Scharfstein, Alphonso Bradham, Jacob T. Borodovsky, Sarah E. Lord, Sarah Masud Preum
for: This paper focuses on analyzing health-related information-seeking on social media, specifically on Reddit, to understand the treatment options and misconceptions related to Opioid Use Disorder (OUD).methods: The authors use a novel approach called event-driven analysis to categorize health-related information-seeking on social media into different events, such as treatment options, misconceptions, and knowledge gaps. They also develop a dataset called TREAT-ISE, which contains Reddit posts annotated with the type of events related to recovery from OUD.results: The authors achieve a strong performance benchmark of 77.4% F1 score for the task of classifying information-seeking events on Reddit related to OUD using machine learning and deep learning classifiers. They also investigate the performance and errors of ChatGPT on this task, providing insights into the capabilities and limitations of large language models.

Abstract
Social media sites have become a popular platform for individuals to seek and share health information. Despite the progress in natural language processing for social media mining, a gap remains in analyzing health-related texts on social discourse in the context of events. Event-driven analysis can offer insights into different facets of healthcare at an individual and collective level, including treatment options, misconceptions, knowledge gaps, etc. This paper presents a paradigm to characterize health-related information-seeking in social discourse through the lens of events. Events here are board categories defined with domain experts that capture the trajectory of the treatment/medication. To illustrate the value of this approach, we analyze Reddit posts regarding medications for Opioid Use Disorder (OUD), a critical global health concern. To the best of our knowledge, this is the first attempt to define event categories for characterizing information-seeking in OUD social discourse. Guided by domain experts, we develop TREAT-ISE, a novel multilabel treatment information-seeking event dataset to analyze online discourse on an event-based framework. This dataset contains Reddit posts on information-seeking events related to recovery from OUD, where each post is annotated based on the type of events. We also establish a strong performance benchmark (77.4% F1 score) for the task by employing several machine learning and deep learning classifiers. Finally, we thoroughly investigate the performance and errors of ChatGPT on this task, providing valuable insights into the LLM's capabilities and ongoing characterization efforts.

摘要
社交媒体平台已成为个人寻求和分享健康信息的受欢迎平台。 despite the progress in自然语言处理 для社交媒体挖掘, a gap remains in analyzing health-related texts in the context of events. Event-driven analysis can offer insights into different facets of healthcare at an individual and collective level, including treatment options, misconceptions, knowledge gaps, etc. This paper presents a paradigm to characterize health-related information-seeking in social discourse through the lens of events. Events here are board categories defined with domain experts that capture the trajectory of the treatment/medication. To illustrate the value of this approach, we analyze Reddit posts regarding medications for Opioid Use Disorder (OUD), a critical global health concern. To the best of our knowledge, this is the first attempt to define event categories for characterizing information-seeking in OUD social discourse. Guided by domain experts, we develop TREAT-ISE, a novel multilabel treatment information-seeking event dataset to analyze online discourse on an event-based framework. This dataset contains Reddit posts on information-seeking events related to recovery from OUD, where each post is annotated based on the type of events. We also establish a strong performance benchmark (77.4% F1 score) for the task by employing several machine learning and deep learning classifiers. Finally, we thoroughly investigate the performance and errors of ChatGPT on this task, providing valuable insights into the LLM's capabilities and ongoing characterization efforts.

Linearity of Relation Decoding in Transformer Language Models

paper_url: http://arxiv.org/abs/2308.09124
repo_url: None
paper_authors: Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, David Bau
for: 本研究探讨了transformer语言模型（LM）中大量知识的表达方式，即关系之间的计算。
methods: 研究人员使用了一种单个提示来构建一个first-order approximation的LM，并证明了某些关系可以通过单一的线性变换来表示。
results: 研究发现，LM的表达中存在许多不是线性地编码的关系知识，但是它们可以准确地预测关系。这些结果表明LM使用了一种简单、可读的，但是不均匀分布的知识表示策略。

Abstract
Much of the knowledge encoded in transformer language models (LMs) may be expressed in terms of relations: relations between words and their synonyms, entities and their attributes, etc. We show that, for a subset of relations, this computation is well-approximated by a single linear transformation on the subject representation. Linear relation representations may be obtained by constructing a first-order approximation to the LM from a single prompt, and they exist for a variety of factual, commonsense, and linguistic relations. However, we also identify many cases in which LM predictions capture relational knowledge accurately, but this knowledge is not linearly encoded in their representations. Our results thus reveal a simple, interpretable, but heterogeneously deployed knowledge representation strategy in transformer LMs.

摘要
许多语言模型中的知识可以表示为关系：单词和其同义词之间的关系，实体和其属性之间的关系等。我们发现，对于一些关系，这种计算可以被简单地表示为单一的线性变换在主题表示中。这种线性关系表示可以通过从单个提示构建首个预测来获得，并存在许多事实、通俗知识和语言关系中。然而，我们也发现许多情况下，LM的预测捕捉到了关系知识，但这种知识不是直接地编码在其表示中。我们的结果因此揭示了一种简单、可解释的，但受到不同应用场景的各种表示策略在transformer语言模型中。

MaScQA: A Question Answering Dataset for Investigating Materials Science Knowledge of Large Language Models

paper_url: http://arxiv.org/abs/2308.09115
repo_url: None
paper_authors: Mohd Zaki, Jayadeva, Mausam, N. M. Anoop Krishnan
for: 本研究旨在开发一个可以快速找到材料的知识库，以便更好地满足材料科学领域的研究需求。
methods: 本研究使用语言模型来回答材料领域的问题，并从知识库中提取信息。
results: GPT-4模型在解决材料领域的650个问题中表现最好（约62%的准确率），而链式思维提示对模型的性能没有显著提升。研究发现，概念错误（约64%）是LLMs表现下降的主要原因，而计算错误（约36%）则是次要原因。

Abstract
Information extraction and textual comprehension from materials literature are vital for developing an exhaustive knowledge base that enables accelerated materials discovery. Language models have demonstrated their capability to answer domain-specific questions and retrieve information from knowledge bases. However, there are no benchmark datasets in the materials domain that can evaluate the understanding of the key concepts by these language models. In this work, we curate a dataset of 650 challenging questions from the materials domain that require the knowledge and skills of a materials student who has cleared their undergraduate degree. We classify these questions based on their structure and the materials science domain-based subcategories. Further, we evaluate the performance of GPT-3.5 and GPT-4 models on solving these questions via zero-shot and chain of thought prompting. It is observed that GPT-4 gives the best performance (~62% accuracy) as compared to GPT-3.5. Interestingly, in contrast to the general observation, no significant improvement in accuracy is observed with the chain of thought prompting. To evaluate the limitations, we performed an error analysis, which revealed conceptual errors (~64%) as the major contributor compared to computational errors (~36%) towards the reduced performance of LLMs. We hope that the dataset and analysis performed in this work will promote further research in developing better materials science domain-specific LLMs and strategies for information extraction.

摘要
信息抽取和文本理解从材料文献中是发展加速材料发现的关键。语言模型已经表现出其能够回答域务特定问题和从知识库中提取信息。但是，在材料领域没有 benchmark 数据集来评估这些语言模型对关键概念的理解。在这项工作中，我们积集了 650 个材料领域的复杂问题，这些问题需要Materials 学生完成本科学位课程后的知识和技能。我们将这些问题分类为结构和材料科学领域下的子类别。然后，我们使用 zero-shot 和链条提问训练 GPT-3.5 和 GPT-4 模型，并评估其性能。结果显示，GPT-4 的性能最高（约 62% 准确率），而 GPT-3.5 的性能较低。另外，与通常观察不同，链条提问不对性能的提高有显著影响。为了评估局限性，我们进行了错误分析，发现概念错误（约 64%）是 LLMS 表现不佳的主要原因，而计算错误（约 36%）则是次要原因。我们希望这些数据和分析可以促进更好的材料科学领域特定 LLMS 的开发和信息抽取策略的研究。

mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view Contrastive Learning

paper_url: http://arxiv.org/abs/2308.09073
repo_url: None
paper_authors: Ying Mo, Jian Yang, Jiahao Liu, Qifan Wang, Ruoyu Chen, Jingang Wang, Zhoujun Li
for: 提高 Cross-Lingual Named Entity Recognition (CrossNER) 的性能，尤其是非英语数据的 scarcity 问题。
methods: 提出 Multi-view Contrastive Learning for Cross-Lingual Named Entity Recognition (mCL-NER)，通过考虑 token 之间的关系来协调 semantic 和 token-level 表示之间的差异。
results: 在 XTREME benchmark 上进行了实验，与先前的数据驱动和模型驱动方法进行比较，显示 mCL-NER 可以大幅提高 CrossNER 的性能，在40种语言中提高了 nearly +2.0 $F_1$ 得分。

Abstract
Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora, especially for non-English data. While prior efforts mainly focus on data-driven transfer methods, a significant aspect that has not been fully explored is aligning both semantic and token-level representations across diverse languages. In this paper, we propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER). Specifically, we reframe the CrossNER task into a problem of recognizing relationships between pairs of tokens. This approach taps into the inherent contextual nuances of token-to-token connections within entities, allowing us to align representations across different languages. A multi-view contrastive learning framework is introduced to encompass semantic contrasts between source, codeswitched, and target sentences, as well as contrasts among token-to-token relations. By enforcing agreement within both semantic and relational spaces, we minimize the gap between source sentences and their counterparts of both codeswitched and target sentences. This alignment extends to the relationships between diverse tokens, enhancing the projection of entities across languages. We further augment CrossNER by combining self-training with labeled source data and unlabeled target data. Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches. It achieves a substantial increase of nearly +2.0 $F_1$ scores across a broad spectrum and establishes itself as the new state-of-the-art performer.

摘要
cross-lingual named entity recognition (CrossNER) 面临着因语言异常缺乏多语言资料而导致的性能不均衡的挑战。现有的尝试主要集中在数据驱动的传输方法上，而一个重要的方面尚未得到完全探索是在多语言空间中对表示进行同步。在这篇论文中，我们提议一种多视角对比学习方法，称为多视角对比学习 для跨语言命名实体识别（mCL-NER）。具体来说，我们将跨语言命名实体识别任务转化为一个涉及到对实体中token之间的关系的问题。这种方法利用了实体中token之间的语言特征，从而使得表示之间进行同步。我们提出了一个多视角对比学习框架，包括源语言、codeswitched语言和目标语言之间的semantic对比和token之间的关系对比。通过在Semantic和Relational空间中强制同步，我们最小化了源语言和其相应的codeswitched语言和目标语言之间的差异。这种对应扩展到不同语言之间的关系，提高了实体的跨语言 проекing。我们还通过与标注的源数据和无标注目标数据进行自我训练，进一步提高了 CrossNER 的性能。我们在XTREME benchmark上进行了40种语言的实验，并证明了mCL-NER 在先前的数据驱动和模型基于的方法之上取得了显著的改进，增加了约 +2.0 $F_1$ 分数，并成为新的状态级表现者。

2023-08-18

cs.LG

cs.LG - 2023-08-18

Revisiting Skin Tone Fairness in Dermatological Lesion Classification

paper_url: http://arxiv.org/abs/2308.09640
repo_url: https://github.com/tkalbl/revisitingskintonefairness
paper_authors: Thorsten Kalb, Kaisar Kushibar, Celia Cintas, Karim Lekadir, Oliver Diaz, Richard Osuala
for: 本研究目的是为了评估皮肤疾病分类算法的公平性，因为皮肤疾病的表现可能因皮肤颜色而异常。
methods: 本研究使用了Individual Typology Angle（ITA）来估计皮肤颜色，并对ISIC18数据集进行了比较和分析。
results: 研究发现 existed large disagreement among previously published studies demonstrating the risks of ITA-based skin tone estimation methods，并且发现ISIC18数据集的不够多样性限制了其作为公平性分析的测试平台。I hope that helps! Let me know if you have any other questions.

Abstract
Addressing fairness in lesion classification from dermatological images is crucial due to variations in how skin diseases manifest across skin tones. However, the absence of skin tone labels in public datasets hinders building a fair classifier. To date, such skin tone labels have been estimated prior to fairness analysis in independent studies using the Individual Typology Angle (ITA). Briefly, ITA calculates an angle based on pixels extracted from skin images taking into account the lightness and yellow-blue tints. These angles are then categorised into skin tones that are subsequently used to analyse fairness in skin cancer classification. In this work, we review and compare four ITA-based approaches of skin tone classification on the ISIC18 dataset, a common benchmark for assessing skin cancer classification fairness in the literature. Our analyses reveal a high disagreement among previously published studies demonstrating the risks of ITA-based skin tone estimation methods. Moreover, we investigate the causes of such large discrepancy among these approaches and find that the lack of diversity in the ISIC18 dataset limits its use as a testbed for fairness analysis. Finally, we recommend further research on robust ITA estimation and diverse dataset acquisition with skin tone annotation to facilitate conclusive fairness assessments of artificial intelligence tools in dermatology. Our code is available at https://github.com/tkalbl/RevisitingSkinToneFairness.

摘要
（Addressing fairness in lesion classification from dermatological images is crucial due to variations in how skin diseases manifest across different skin tones. However, the lack of skin tone labels in public datasets hinders the development of a fair classifier. To date, skin tone labels have been estimated prior to fairness analysis in independent studies using the Individual Typology Angle (ITA). Briefly, ITA calculates an angle based on pixels extracted from skin images, taking into account the lightness and yellow-blue tints. These angles are then categorized into skin tones that are subsequently used to analyze fairness in skin cancer classification. In this work, we review and compare four ITA-based approaches of skin tone classification on the ISIC18 dataset, a common benchmark for assessing skin cancer classification fairness in the literature. Our analyses reveal a high disagreement among previously published studies, demonstrating the risks of ITA-based skin tone estimation methods. Moreover, we investigate the causes of such large discrepancies among these approaches and find that the lack of diversity in the ISIC18 dataset limits its use as a testbed for fairness analysis. Finally, we recommend further research on robust ITA estimation and diverse dataset acquisition with skin tone annotation to facilitate conclusive fairness assessments of artificial intelligence tools in dermatology. Our code is available at https://github.com/tkalbl/RevisitingSkinToneFairness.）

Development of a Neural Network-based Method for Improved Imputation of Missing Values in Time Series Data by Repurposing DataWig

paper_url: http://arxiv.org/abs/2308.09635
repo_url: None
paper_authors: Daniel Zhang
for: 这个研究旨在提供一个可靠地填写时间序列资料中的缺失值的方法，以便在研究、企业和管理中做出更好的决策。
methods: 本研究使用了tsDataWig方法，它是修改了DataWig方法的时间序列资料填写方法，可以直接处理时间变数的值和填写时间序列资料中的缺失值。
results: 研究结果显示，tsDataWig方法可以在实验和实际应用中表现更好，并且可以填写更多的缺失值，而且不需要强制性地假设资料缺失机制。

Abstract
Time series data are observations collected over time intervals. Successful analysis of time series data captures patterns such as trends, cyclicity and irregularity, which are crucial for decision making in research, business, and governance. However, missing values in time series data occur often and present obstacles to successful analysis, thus they need to be filled with alternative values, a process called imputation. Although various approaches have been attempted for robust imputation of time series data, even the most advanced methods still face challenges including limited scalability, poor capacity to handle heterogeneous data types and inflexibility due to requiring strong assumptions of data missing mechanisms. Moreover, the imputation accuracy of these methods still has room for improvement. In this study, I developed tsDataWig (time-series DataWig) by modifying DataWig, a neural network-based method that possesses the capacity to process large datasets and heterogeneous data types but was designed for non-time series data imputation. Unlike the original DataWig, tsDataWig can directly handle values of time variables and impute missing values in complex time series datasets. Using one simulated and three different complex real-world time series datasets, I demonstrated that tsDataWig outperforms the original DataWig and the current state-of-the-art methods for time series data imputation and potentially has broad application due to not requiring strong assumptions of data missing mechanisms. This study provides a valuable solution for robustly imputing missing values in challenging time series datasets, which often contain millions of samples, high dimensional variables, and heterogeneous data types.

摘要
时间序列数据是观测过时间间隔的观察结果。成功分析时间序列数据可以捕捉到趋势、循环和不规则性，这些特征对于研究、商业和管理决策非常重要。然而，时间序列数据中的缺失值经常发生，这些缺失值需要使用代理值来填充，这个过程被称为投入。虽然有很多方法用于适应时间序列数据的缺失值投入，但是这些方法仍然面临着限制Scalability、不能处理多种数据类型和固定的假设时间序列数据缺失机制的问题。此外，这些方法的投入准确性仍然有很大的可能性提高。在这个研究中，我修改了DataWig方法，创造了tsDataWig（时间序列DataWig），它可以直接处理时间变量的值并在复杂的时间序列数据中填充缺失值。使用一个模拟和三个不同的复杂实际时间序列数据集，我示出了tsDataWig在时间序列数据投入中的优于DataWig和当前状态艺术方法，并有广泛的应用前提，不需要强制的数据缺失机制假设。这个研究为抗强制时间序列数据投入提供了一个有价值的解决方案，这些数据经常包含百万个样本、高维度变量和多种数据类型。

VALERIE22 – A photorealistic, richly metadata annotated dataset of urban environments

paper_url: http://arxiv.org/abs/2308.09632
repo_url: None
paper_authors: Oliver Grau, Korbinian Hagn
For: The paper aims to contribute to the understanding of domain-specific factors that influence the perception performance of deep neural networks (DNNs) in the context of pedestrian detection in urban environments for automated driving.* Methods: The paper presents the VALERIE tool pipeline, a synthetic data generator that uses procedural tools to simulate photorealistic sensor data from automatically synthesized scenes. The dataset provides a rich set of metadata, enabling a variety of tests to understand the performance of DNNs.* Results: The paper demonstrates that the VALERIE22 dataset is one of the best-performing synthetic datasets currently available in the open domain, based on performance metrics. The dataset provides a unique set of features that enable researchers to understand the performance of DNNs in various scenarios.Here’s the same information in Simplified Chinese:* For: 这篇论文目标是帮助理解深度神经网络（DNNs）在城市环境中自动驾驶中的感知性能的域特有因素。* Methods: 论文提出了VALERIE工具框架，这是一个可生成 fotorealistic感知数据的工具，通过自动生成的场景来模拟感知数据。该数据集提供了丰富的元数据，可以对 DNNs 的性能进行多种测试。* Results: 论文表明，VALERIE22 数据集是目前公开领域中最好的 sintetic 数据集之一，基于性能指标。该数据集提供了一些独特的特征，如像素精度遮挡率、场景中的位置和相机与场景之间的距离和角度，以便研究 DNNs 的性能。

Abstract
The VALERIE tool pipeline is a synthetic data generator developed with the goal to contribute to the understanding of domain-specific factors that influence perception performance of DNNs (deep neural networks). This work was carried out under the German research project KI Absicherung in order to develop a methodology for the validation of DNNs in the context of pedestrian detection in urban environments for automated driving. The VALERIE22 dataset was generated with the VALERIE procedural tools pipeline providing a photorealistic sensor simulation rendered from automatically synthesized scenes. The dataset provides a uniquely rich set of metadata, allowing extraction of specific scene and semantic features (like pixel-accurate occlusion rates, positions in the scene and distance + angle to the camera). This enables a multitude of possible tests on the data and we hope to stimulate research on understanding performance of DNNs. Based on performance metric a comparison with several other publicly available datasets is provided, demonstrating that VALERIE22 is one of best performing synthetic datasets currently available in the open domain.

摘要
VALERIE工具管道是一个人工数据生成器，旨在帮助理解深度神经网络（DNN）的域特性因素对感知性能的影响。这项工作是在德国研究项目“KI Absicherung”下进行的，旨在在城市环境中自动驾驶时，对人体检测器的DNN进行验证。VALERIE22数据集是通过VALERIE的批处理工具管道生成的，提供了高度 photorealistic 感知器模拟，自动生成的场景。该数据集具有独特的metadata，允许提取特定场景和semantic特征（如像素精度遮挡率、场景中的位置和相机到摄像头的距离）。这使得可以进行多种可能的测试，并希望促进DNN性能的研究。基于表现指标，VALERIE22与其他公共可用的数据集进行了比较，示出VALERIE22是目前公开领域中最佳的人工数据集之一。

Learning Computational Efficient Bots with Costly Features

paper_url: http://arxiv.org/abs/2308.09629
repo_url: None
paper_authors: Anthony Kobanda, Valliappan C. A., Joshua Romoff, Ludovic Denoyer
For: The paper is written for decision-making processes in various fields, particularly in real-time settings such as video games.* Methods: The paper proposes a generic offline learning approach that incorporates cost constraints to limit the computational cost of the input features. The approach is based on the Decision Transformer, but with additional budgeting constraints to dynamically choose the best input features at each timestep.* Results: The paper demonstrates the effectiveness of the proposed method on several tasks, including D4RL benchmarks and complex 3D environments similar to those found in video games. The results show that the method can achieve similar performance while using significantly fewer computational resources compared to classical approaches.Here’s the information in Simplified Chinese text:* 用途：论文用于各种领域中的决策过程，特别是在视频游戏等实时设置中。* 方法：论文提出一种通用的离线学习方法，其具有限制输入特征计算成本的功能。该方法基于决策变换器，但又增加了预算约束来动态选择最佳输入特征在每个时刻。* 结果：论文在多个任务上展示了方法的效果，包括D4RL标准准则和复杂的3D环境，并显示该方法可以与传统方法相比使用更少的计算资源来 достичь相似的性能。

Abstract
Deep reinforcement learning (DRL) techniques have become increasingly used in various fields for decision-making processes. However, a challenge that often arises is the trade-off between both the computational efficiency of the decision-making process and the ability of the learned agent to solve a particular task. This is particularly critical in real-time settings such as video games where the agent needs to take relevant decisions at a very high frequency, with a very limited inference time. In this work, we propose a generic offline learning approach where the computation cost of the input features is taken into account. We derive the Budgeted Decision Transformer as an extension of the Decision Transformer that incorporates cost constraints to limit its cost at inference. As a result, the model can dynamically choose the best input features at each timestep. We demonstrate the effectiveness of our method on several tasks, including D4RL benchmarks and complex 3D environments similar to those found in video games, and show that it can achieve similar performance while using significantly fewer computational resources compared to classical approaches.

摘要
深度强化学习（DRL）技术在不同领域的决策过程中得到广泛应用。然而，一个常见的挑战是在决策过程中计算效率与学习模型解决特定任务的能力之间的负担。特别是在实时设定，如游戏中，模型需要在非常高频率下做出相关决策，并且具有非常有限的推理时间。在这种情况下，我们提出了一种通用的离线学习方法，其中输入特征的计算成本被考虑在内。我们 derivate了一种受成本限制的决策变换器，即优化的决策变换器，可以在每个时间步骤中动态选择最佳的输入特征。我们在多个任务上证明了我们的方法的有效性，包括D4RL数据集和复杂的3D环境，并显示它可以在计算资源方面减少了许多比 класси方法。

Constrained Bayesian Optimization Using a Lagrange Multiplier Applied to Power Transistor Design

paper_url: http://arxiv.org/abs/2308.09612
repo_url: None
paper_authors: Ping-Ju Chuang, Ali Saadat, Sara Ghazvini, Hal Edwards, William G. Vandenberghe
for: 优化LDMOS晶体管的设计过程，实现目标断裂电压(BV)。
methods: 将受限BO问题转化为常见BO问题，使用拉格朗日矩阵作为优化目标函数。
results: 自动地在设计空间中获得满足优化FOM和目标BV约束的设备，并在30-50V范围内探索设备的物理限制。

Abstract
We propose a novel constrained Bayesian Optimization (BO) algorithm optimizing the design process of Laterally-Diffused Metal-Oxide-Semiconductor (LDMOS) transistors while realizing a target Breakdown Voltage (BV). We convert the constrained BO problem into a conventional BO problem using a Lagrange multiplier. Instead of directly optimizing the traditional Figure-of-Merit (FOM), we set the Lagrangian as the objective function of BO. This adaptive objective function with a changeable Lagrange multiplier can address constrained BO problems which have constraints that require costly evaluations, without the need for additional surrogate models to approximate constraints. Our algorithm enables a device designer to set the target BV in the design space, and obtain a device that satisfies the optimized FOM and the target BV constraint automatically. Utilizing this algorithm, we have also explored the physical limits of the FOM for our devices in 30 - 50 V range within the defined design space.

摘要
我们提出了一种新的受限制的整合算法（BO），用于优化LDMOS晶体管的设计过程，同时实现目标破坏电压（BV）。我们将受限制BO问题转化为一个常规BO问题，使用拉格朗日multiplier。而不是直接优化传统的图像质量指标（FOM），我们设置了拉格朗日作为算法的目标函数。这种适应目标函数的改变多余分子可以解决具有costly评估的受限制BO问题，无需额外的拟合模型来近似约束。我们的算法允许设计者在设计空间中设定目标BV，并自动获得满足优化FOM和目标BV约束的设备。通过这种算法，我们还探索了我们的设备在30-50V范围内的物理限制。

Solving PDEs on Spheres with Physics-Informed Convolutional Neural Networks

paper_url: http://arxiv.org/abs/2308.09605
repo_url: None
paper_authors: Guanhang Lei, Zhen Lei, Lei Shi, Chenyu Zeng, Ding-Xuan Zhou
for: 解决部分导数方程（PDEs）的数学理论分析
methods: 使用物理学信息激活的径向神经网络（PICNN）和深度卷积神经网络（CNN），以及圆形幂分析
results: 提供了对圆形PDE的数学分析，并证明了对 Sobolev 范数的上界误差 bound，以及快速收敛率Here’s a more detailed explanation of each point:
for: The paper is focused on providing a mathematical analysis of the numerical performance of physics-informed convolutional neural networks (PICNNs) for solving partial differential equations (PDEs) on the sphere.
methods: The paper uses and improves the latest approximation results of deep convolutional neural networks and spherical harmonic analysis to establish an upper bound for the approximation error with respect to the Sobolev norm. The paper also integrates this with innovative localization complexity analysis to establish fast convergence rates for PICNN.
results: The paper provides theoretical results on the numerical performance of PICNN for solving PDEs on the sphere, including an upper bound for the approximation error and fast convergence rates. The results are also confirmed and supplemented by experiments.

Abstract
Physics-informed neural networks (PINNs) have been demonstrated to be efficient in solving partial differential equations (PDEs) from a variety of experimental perspectives. Some recent studies have also proposed PINN algorithms for PDEs on surfaces, including spheres. However, theoretical understanding of the numerical performance of PINNs, especially PINNs on surfaces or manifolds, is still lacking. In this paper, we establish rigorous analysis of the physics-informed convolutional neural network (PICNN) for solving PDEs on the sphere. By using and improving the latest approximation results of deep convolutional neural networks and spherical harmonic analysis, we prove an upper bound for the approximation error with respect to the Sobolev norm. Subsequently, we integrate this with innovative localization complexity analysis to establish fast convergence rates for PICNN. Our theoretical results are also confirmed and supplemented by our experiments. In light of these findings, we explore potential strategies for circumventing the curse of dimensionality that arises when solving high-dimensional PDEs.

摘要
物理学 informed 神经网络 (PINNs) 已经在多种实验方面证明其高效地解决部分 diferencial 方程 (PDEs)。一些最近的研究还提出了 PINN 算法用于表面上的 PDEs。然而，对 PINNs 的数学性能，特别是在表面或 manifold 上的数学性能，仍然缺乏理论理解。在这篇论文中，我们建立了 физи学 informed 卷积神经网络 (PICNN) 的准确分析，用于解决在球体上的 PDEs。我们使用和改进了最新的深度卷积神经网络的近似结果和球面幂分析，证明了对 Sobolev нор的上界误差。然后，我们将这与创新的本地化复杂性分析相结合，以确定 PICNN 的快速收敛速率。我们的理论结果也被实验证明。在这些发现的基础上，我们探讨了解决高维度 PDEs 所带来的 "诅咒" 的策略。

Breaking the Complexity Barrier in Compositional Minimax Optimization

paper_url: http://arxiv.org/abs/2308.09604
repo_url: None
paper_authors: Jin Liu, Xiaokang Pan, Junwen Duan, Hongdong Li, Youqi Li, Zhe Qu
for: 本研究旨在解决机器学习中的复合最小化优化问题，包括分布robust训练和策略评估。
methods: 本文提出了嵌入STORM（NSTORM）算法，可以在$O(\kappa^3/\epsilon^3)$样本复杂度下找到一个$\epsilon$-准确的解决方案。同时，作者还提出了适应学习率的ADA-NSTORM算法，可以实现同样的样本复杂度而且在实验中表现更加有效。
results: 本文的方法可以匹配下界函数的最小化优化问题，而不需要大量的批处理大小。实验结果表明，ADA-NSTORM算法在实际应用中表现更加稳定和有效。

Abstract
Compositional minimax optimization is a pivotal yet under-explored challenge across machine learning, including distributionally robust training and policy evaluation for reinforcement learning. Current techniques exhibit suboptimal complexity or rely heavily on large batch sizes. This paper proposes Nested STOchastic Recursive Momentum (NSTORM), attaining the optimal sample complexity of $O(\kappa^3/\epsilon^3)$ for finding an $\epsilon$-accurate solution. However, NSTORM requires low learning rates, potentially limiting applicability. Thus we introduce ADAptive NSTORM (ADA-NSTORM) with adaptive learning rates, proving it achieves the same sample complexity while experiments demonstrate greater effectiveness. Our methods match lower bounds for minimax optimization without large batch requirements, validated through extensive experiments. This work significantly advances compositional minimax optimization, a crucial capability for distributional robustness and policy evaluation

摘要
<>将文本翻译成简化中文。<>机器学习中的分组最小最大优化是一项重要 yet under-explored 挑战。现有技术具有较差的复杂性或者依赖于大批量。这篇论文提出了嵌套STOchastic Recursive Momentum（NSTORM），实现了 $\epsilon $-高精度解决方案的最佳样本复杂性 $O(\kappa^3/\epsilon^3)$。但NSTORM需要低学习率，可能限制其应用。因此，我们引入了自适应NSTORM（ADA-NSTORM），其中学习率适应自适应，证明它们实现了同样的样本复杂性，而实验表明它们更有效。我们的方法与最小最大优化下界匹配，无需大批量，通过广泛的实验验证。这项工作对分组最小最大优化做出了重要进展，这对分布robustness和策略评估非常重要。

Disparity, Inequality, and Accuracy Tradeoffs in Graph Neural Networks for Node Classification

paper_url: http://arxiv.org/abs/2308.09596
repo_url: https://github.com/arpitdm/gnn_accuracy_fairness_tradeoff
paper_authors: Arpit Merchant, Carlos Castillo
for: 这个研究旨在探讨Graph Neural Networks（GNNs）在人工智能应用中是否存在偏见，并且提出了两种GNN-agnostic干预方法（PFR-AX和PostProcess）来减少这种偏见的影响。
methods: 这个研究使用了四个数据集进行大量的实验，测试了两种干预方法（PFR-AX和PostProcess）和三种基eline干预方法（random dropout、weighted dropout和PGD）在三个现代GNN模型上的效果。
results: 研究结果显示，PFR-AX和PostProcess可以提供精确的偏见控制和改善模型在保护群体中确认正面结果的自信度，但是不同数据集和GNN模型之间存在差异，无法一个通用的偏见控制方法。

Abstract
Graph neural networks (GNNs) are increasingly used in critical human applications for predicting node labels in attributed graphs. Their ability to aggregate features from nodes' neighbors for accurate classification also has the capacity to exacerbate existing biases in data or to introduce new ones towards members from protected demographic groups. Thus, it is imperative to quantify how GNNs may be biased and to what extent their harmful effects may be mitigated. To this end, we propose two new GNN-agnostic interventions namely, (i) PFR-AX which decreases the separability between nodes in protected and non-protected groups, and (ii) PostProcess which updates model predictions based on a blackbox policy to minimize differences between error rates across demographic groups. Through a large set of experiments on four datasets, we frame the efficacies of our approaches (and three variants) in terms of their algorithmic fairness-accuracy tradeoff and benchmark our results against three strong baseline interventions on three state-of-the-art GNN models. Our results show that no single intervention offers a universally optimal tradeoff, but PFR-AX and PostProcess provide granular control and improve model confidence when correctly predicting positive outcomes for nodes in protected groups.

摘要
граф нейронные сети (GNNs) 在重要的人类应用中越来越普遍用于预测图中节点标签。它们可以将节点邻居特征集成到准确的分类中，同时也有可能扩大现有数据中的偏见或引入新的偏见 towards 保护类群体的成员。因此，我们需要衡量 GNNs 是否偏离和如何减轻其不良影响。为此，我们提出了两种 GNN-agnostic 举措，namely，(i) PFR-AX，减少保护和非保护群体之间节点的分离度，以及 (ii) PostProcess，根据黑盒政策更新模型预测，以最小化不同群体的错误率差异。通过大量实验 на four datasets，我们框定了我们的方法（和三个变种）的算法公正性-准确性负担和比较我们的结果与三种强基eline intervención on three state-of-the-art GNN 模型。我们的结果表明，无论单个 intervención 都不是通用的优化选择，但PFR-AX 和 PostProcess 可以提供细化的控制和提高模型对保护群体成员正确预测的自信心。

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

paper_url: http://arxiv.org/abs/2308.09583
repo_url: https://github.com/nlpxucan/wizardlm
paper_authors: Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Dongmei Zhang
for: 这篇论文主要目的是提高大型自然语言处理（NLP）模型的数学逻辑能力。
methods: 该论文提出了一种基于强化学习的 Reinforcement Learning from Evol-Instruct Feedback（RLEIF）方法，用于提高LLMs的数学逻辑能力。
results: 经过广泛的实验，WizardMath模型在两个数学逻辑测试 benchmark 上表现出色，与其他已有的开源模型相比，具有显著的优势。此外，WizardMath模型还可以在 GSM8k 和 MATH 两个测试 benchmark 上超越 ChatGPT-3.5、Claude Instant-1、PaLM-2 和 Minerva 等模型。

Abstract
Large language models (LLMs), such as GPT-4, have shown remarkable performance in natural language processing (NLP) tasks, including challenging mathematical reasoning. However, most existing open-source models are only pre-trained on large-scale internet data and without math-related optimization. In this paper, we present WizardMath, which enhances the mathematical reasoning abilities of Llama-2, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model. WizardMath surpasses all other open-source LLMs by a substantial margin. Furthermore, our model even outperforms ChatGPT-3.5, Claude Instant-1, PaLM-2 and Minerva on GSM8k, simultaneously surpasses Text-davinci-002, PaLM-1 and GPT-3 on MATH. More details and model weights are public at https://github.com/nlpxucan/WizardLM and https://huggingface.co/WizardLM.

摘要
大型自然语言处理（NLP）模型（LLM），如GPT-4，在数学逻辑任务中表现出了惊人的能力。然而，现有的开源模型大多只是在大规模互联网数据上预训练而无法进行数学相关的优化。本文提出了一种基于我们的提议的强化策略——Reinforcement Learning from Evol-Instruct Feedback（RLEIF），用于提高Llama-2模型的数学逻辑能力。经过广泛的实验，我们发现了WizardMath模型的极高能力。WizardMath在GSM8k和MATH两个数学逻辑评测 benchmark 上大幅超越了所有开源LLM，并且 même surpasses ChatGPT-3.5、Claude Instant-1、PaLM-2和Minerva在GSM8k上，同时在MATH上还超越了Text-davinci-002、PaLM-1和GPT-3。更多细节和模型权重可以在 GitHub 上找到（https://github.com/nlpxucan/WizardLM）和 Hugging Face 上找到（https://huggingface.co/WizardLM）。

Physics-Informed Boundary Integral Networks (PIBI-Nets): A Data-Driven Approach for Solving Partial Differential Equations

paper_url: http://arxiv.org/abs/2308.09571
repo_url: None
paper_authors: Monika Nagy-Huber, Volker Roth
for: The paper is written for solving partial differential equations (PDEs) in real-world applications, especially when there is limited information about boundary or initial conditions, or when unknown model parameters need to be identified.
methods: The paper proposes a data-driven approach called Physics-Informed Boundary Integral Networks (PIBI-Nets) to solve PDEs. PIBI-Nets only require collocation points at the computational domain boundary, which can reduce computational costs and achieve highly accurate results.
results: The paper demonstrates the excellent performance of PIBI-Nets for the Laplace and Poisson equations on both artificial data sets and a real-world application concerning the reconstruction of groundwater flows. PIBI-Nets outperform Physics-Informed Neural Networks (PINNs) in high-dimensional settings and can handle point sources in inverse problems using a principled and simple approach.

Abstract
Partial differential equations (PDEs) can describe many relevant phenomena in dynamical systems. In real-world applications, we commonly need to combine formal PDE models with (potentially noisy) observations. This is especially relevant in settings where we lack information about boundary or initial conditions, or where we need to identify unknown model parameters. In recent years, Physics-informed neural networks (PINNs) have become a popular tool for problems of this kind. In high-dimensional settings, however, PINNs often suffer from computational problems because they usually require dense collocation points over the entire computational domain. To address this problem, we present Physics-Informed Boundary Integral Networks (PIBI-Nets) as a data-driven approach for solving PDEs in one dimension less than the original problem space. PIBI-Nets only need collocation points at the computational domain boundary, while still achieving highly accurate results, and in several practical settings, they clearly outperform PINNs. Exploiting elementary properties of fundamental solutions of linear differential operators, we present a principled and simple way to handle point sources in inverse problems. We demonstrate the excellent performance of PIBI-Nets for the Laplace and Poisson equations, both on artificial data sets and within a real-world application concerning the reconstruction of groundwater flows.

摘要
部分偏微分方程（PDEs）可以描述多种实际系统中的相关现象。在实际应用中，我们经常需要将正式的PDE模型与（可能具有噪声）观测结合起来。特别是在我们缺乏边界或初始条件信息时，或者需要确定未知模型参数时。在过去几年中，物理学推导的神经网络（PINNs）已成为这类问题的受欢迎工具。在高维度设置中，然而，PINNs经常因计算问题而受到限制，因为它们通常需要某些稠密的散点在计算域中。为解决这个问题，我们提出了基于物理学推导的边界积分网络（PIBI-Nets），这是一种数据驱动的方法，用于解决PDEs。PIBI-Nets只需在计算域边界上设置散点，却可以具有高度准确的结果，并在一些实际应用中表现出色，超过了PINNs。利用基本的线性微分方程的元素性质，我们提出了一种原则性的和简单的方法来处理反向问题。我们在拉普拉斯和波撤方程中进行了详细的实验，并在一个真实世界应用中对地下水流的重建中获得了出色的表现。

Investigating the Interplay between Features and Structures in Graph Learning

paper_url: http://arxiv.org/abs/2308.09570
repo_url: None
paper_authors: Daniele Castellana, Federico Errica
for: This paper aims to investigate the relationship between node features and target labels in deep graph networks, and to propose new metrics to measure the influence of node features on target labels.
methods: The paper uses two generative processes to build and study ad-hoc node classification tasks, and evaluates the performance of six models, including structure-agnostic ones.
results: The paper finds that previously defined metrics are not adequate when the assumption of a strong correlation between node features and target labels is relaxed, and proposes a new metric called Feature Informativeness to quantitatively measure the influence of node features on target labels.

Abstract
In the past, the dichotomy between homophily and heterophily has inspired research contributions toward a better understanding of Deep Graph Networks' inductive bias. In particular, it was believed that homophily strongly correlates with better node classification predictions of message-passing methods. More recently, however, researchers pointed out that such dichotomy is too simplistic as we can construct node classification tasks where graphs are completely heterophilic but the performances remain high. Most of these works have also proposed new quantitative metrics to understand when a graph structure is useful, which implicitly or explicitly assume the correlation between node features and target labels. Our work empirically investigates what happens when this strong assumption does not hold, by formalising two generative processes for node classification tasks that allow us to build and study ad-hoc problems. To quantitatively measure the influence of the node features on the target labels, we also use a metric we call Feature Informativeness. We construct six synthetic tasks and evaluate the performance of six models, including structure-agnostic ones. Our findings reveal that previously defined metrics are not adequate when we relax the above assumption. Our contribution to the workshop aims at presenting novel research findings that could help advance our understanding of the field.

摘要

Normalization Is All You Need: Understanding Layer-Normalized Federated Learning under Extreme Label Shift

paper_url: http://arxiv.org/abs/2308.09565
repo_url: None
paper_authors: Guojun Zhang, Mahdi Beitollahi, Alex Bie, Xi Chen
for: 本文探讨了层normalization（LN）在 federated learning（FL）中的作用，特别是如何在非相关数据上表现出 surprisingly 的效果。
methods: 本文使用了层normalization（LN）和 feature normalization（FN）来控制feature collapse和本地适应，以提高 federated learning 的 globally training。
results: 实验结果表明，normalization 可以在 extreme label shift 下导致 drastic 的改善，并且对 learning rate 的选择有关键作用。

Abstract
Layer normalization (LN) is a widely adopted deep learning technique especially in the era of foundation models. Recently, LN has been shown to be surprisingly effective in federated learning (FL) with non-i.i.d. data. However, exactly why and how it works remains mysterious. In this work, we reveal the profound connection between layer normalization and the label shift problem in federated learning. To understand layer normalization better in FL, we identify the key contributing mechanism of normalization methods in FL, called feature normalization (FN), which applies normalization to the latent feature representation before the classifier head. Although LN and FN do not improve expressive power, they control feature collapse and local overfitting to heavily skewed datasets, and thus accelerates global training. Empirically, we show that normalization leads to drastic improvements on standard benchmarks under extreme label shift. Moreover, we conduct extensive ablation studies to understand the critical factors of layer normalization in FL. Our results verify that FN is an essential ingredient inside LN to significantly improve the convergence of FL while remaining robust to learning rate choices, especially under extreme label shift where each client has access to few classes.

摘要
层normalization（LN）是深度学习中广泛采用的技术，尤其在基础模型时代。最近，LN在非独立数据（Federated Learning，FL）中表现出意外的有效性，但具体的原因和如何工作仍然不清楚。在这项工作中，我们揭示了层normalization和FL中标签变化问题之间的深刻关系。为了更好地理解层normalization在FL中，我们识别了FL中normalization方法的关键贡献机制，称为特征normalization（FN），它在批处头前应用normalization于隐藏特征表示。虽然LN和FN不会提高表达力，但它们控制特征塌积和本地适应，从而加速全局训练。我们的实验表明，normalization在极端标签变化下带来了很大的改善。此外，我们进行了广泛的减少研究，以了解层normalization在FL中关键的因素。我们的结果表明，FN是LN中重要的组成部分，可以在极端标签变化下提高FL的凝固性，尤其是当每个客户只有几个类时。

Eigenvalue-based Incremental Spectral Clustering

paper_url: http://arxiv.org/abs/2308.10999
repo_url: None
paper_authors: Mieczysław A. Kłopotek, Bartłmiej Starosta, Sławomir T. Wierzchoń
for: clustering large datasets using incremental spectral clustering
methods: split the data into manageable subsets, cluster each subset, and merge clusters based on eigenvalue spectrum similarity
results: clustering and merging the subsets yields clusters close to clustering the entire dataset

Abstract
Our previous experiments demonstrated that subsets collections of (short) documents (with several hundred entries) share a common normalized in some way eigenvalue spectrum of combinatorial Laplacian. Based on this insight, we propose a method of incremental spectral clustering. The method consists of the following steps: (1) split the data into manageable subsets, (2) cluster each of the subsets, (3) merge clusters from different subsets based on the eigenvalue spectrum similarity to form clusters of the entire set. This method can be especially useful for clustering methods of complexity strongly increasing with the size of the data sample,like in case of typical spectral clustering. Experiments were performed showing that in fact the clustering and merging the subsets yields clusters close to clustering the entire dataset.

摘要
我们之前的实验表明，子集合（短文档）的聚合spectrum（卷积 Laplacian）具有一定的共同normalized方式。基于这一点，我们提出了一种增量 spectral clustering 方法。该方法包括以下步骤：（1）将数据分割成可管理的子集，（2）对每个子集进行 clustering，（3）根据聚合spectrum的相似性，将不同子集的cluster合并形成整个数据集的cluster。这种方法可以特别有用于数据样本规模增长的情况下，如典型的spectral clustering一样。我们的实验表明，将 subsets 分割并 merge 后的 clusters 与整个数据集的 clustering 几乎相同。

Attesting Distributional Properties of Training Data for Machine Learning

paper_url: http://arxiv.org/abs/2308.09552
repo_url: None
paper_authors: Vasisht Duddu, Anudeep Das, Nora Khayata, Hossein Yalame, Thomas Schneider, N. Asokan
for: Ensuring the trustworthiness of machine learning models by demonstrating desirable distributional properties of training data.
methods: Property inference and cryptographic mechanisms for data privacy-preserving property attestation.
results: An effective hybrid property attestation method for model trainers to demonstrate relevant distributional properties of training data to customers without revealing the data.

Abstract
The success of machine learning (ML) has been accompanied by increased concerns about its trustworthiness. Several jurisdictions are preparing ML regulatory frameworks. One such concern is ensuring that model training data has desirable distributional properties for certain sensitive attributes. For example, draft regulations indicate that model trainers are required to show that training datasets have specific distributional properties, such as reflecting diversity of the population. We propose the notion of property attestation allowing a prover (e.g., model trainer) to demonstrate relevant distributional properties of training data to a verifier (e.g., a customer) without revealing the data. We present an effective hybrid property attestation combining property inference with cryptographic mechanisms.

摘要
机器学习（ML）的成功也导致了其可靠性的关注。一些司法管辖区正在制定ML规则。一个如此关注点是确保模型训练数据具有特定敏感属性的恰当分布特性。例如，草拟法规要求模型训练人员证明训练数据具有特定的分布特性，如反映人口多样性。我们提出了属性证明的想法，允许证明人（例如模型训练人员）在不披露数据的情况下，向验证人（例如客户）证明训练数据具有相关的分布特性。我们介绍了一种有效的混合属性证明方法，结合属性推理和 крип加密机制。

Adapt Your Teacher: Improving Knowledge Distillation for Exemplar-free Continual Learning

paper_url: http://arxiv.org/abs/2308.09544
repo_url: None
paper_authors: Filip Szatkowski, Mateusz Pyla, Marcin Przewięźlikowski, Sebastian Cygert, Bartłomiej Twardowski, Tomasz Trzciński
for: 这篇研究文章的目的是探讨例外自由类别增量学习（CIL）中的知识传递（KD）作为调节策略，以预防忘记。
methods: 这篇文章使用了KD作为调节策略，但是它们经常无法调节模型不具有先前任务的例子。我们的分析表明，这问题源于教师网络对于非标准资料的重大表现变化。这会导致KD损失成分中的大量错误，导致CIL中的表现下降。
results: 我们引入了教师适应（TA），一种同时更新教师网络和主要模型的方法，以实现CIL中的例外自由类别增量学习。我们的方法与KD-based CIL方法相互运作，并在多个例外自由CIL标准 benchmark上提供了一致的性能提升。

Abstract
In this work, we investigate exemplar-free class incremental learning (CIL) with knowledge distillation (KD) as a regularization strategy, aiming to prevent forgetting. KD-based methods are successfully used in CIL, but they often struggle to regularize the model without access to exemplars of the training data from previous tasks. Our analysis reveals that this issue originates from substantial representation shifts in the teacher network when dealing with out-of-distribution data. This causes large errors in the KD loss component, leading to performance degradation in CIL. Inspired by recent test-time adaptation methods, we introduce Teacher Adaptation (TA), a method that concurrently updates the teacher and the main model during incremental training. Our method seamlessly integrates with KD-based CIL approaches and allows for consistent enhancement of their performance across multiple exemplar-free CIL benchmarks.

摘要
在这项工作中，我们研究无 exemplar 的类增量学习（CIL），并用知识塑化（KD）作为规范策略，以避免忘记。 KD 基本方法在 CIL 中得到了成功，但它们经常在不同任务的数据上遇到表示转移的问题，导致模型训练不稳定。我们的分析表明，这一问题源于老师网络处理异常数据时的重大表示转移，导致 KD 损失函数中的大误差，从而导致 CIL 性能下降。鉴于这一点，我们引入教师适应（TA）方法，该方法在增量训练中同时更新老师网络和主模型。我们的方法顺利地与 KD 基本方法结合，并允许在多个无 exemplar 的 CIL benchmark上进行一致增强性表现。

Latent State Models of Training Dynamics

paper_url: http://arxiv.org/abs/2308.09543
repo_url: None
paper_authors: Michael Y. Hu, Angelica Chen, Naomi Saphra, Kyunghyun Cho
for: 本研究旨在理解Randomness对模型训练的影响，包括不同数据顺序和初始化方法对模型训练的影响，以及这些影响如何导致模型训练过程中的不同演进和结果。
methods: 本研究使用多个随机种子训练模型，并计算训练过程中的多个指标，如模型权重的$L_2$范数、平均值和方差。然后，使用隐马尔可夫模型（HMM）来模型训练过程，将训练过程看作一种随机过程，并从而获得低维、离散的训练动力学表示。
results: 本研究使用HMM表示法描述了不同训练过程的训练动力学，并发现了一些稳定演进和慢速演进的特征。此外，研究还发现了一些”拐弯”状态，这些状态可能会降低模型的收敛速率。

Abstract
The impact of randomness on model training is poorly understood. How do differences in data order and initialization actually manifest in the model, such that some training runs outperform others or converge faster? Furthermore, how can we interpret the resulting training dynamics and the phase transitions that characterize different trajectories? To understand the effect of randomness on the dynamics and outcomes of neural network training, we train models multiple times with different random seeds and compute a variety of metrics throughout training, such as the $L_2$ norm, mean, and variance of the neural network's weights. We then fit a hidden Markov model (HMM) over the resulting sequences of metrics. The HMM represents training as a stochastic process of transitions between latent states, providing an intuitive overview of significant changes during training. Using our method, we produce a low-dimensional, discrete representation of training dynamics on grokking tasks, image classification, and masked language modeling. We use the HMM representation to study phase transitions and identify latent "detour" states that slow down convergence.

摘要
“随机的影响在模型训练中仍然不够清楚。实际上，不同的数据顺序和初始化方式会如何对模型的训练造成影响，使一些训练运行比其他优化得更快或更好？此外，我们如何解释训练过程中的结果和阶段转换，以及它们对不同的训练运行所造成的影响？为了理解随机的影响，我们在不同的随机seed中训练模型多次，并计算训练过程中的一些指标，例如模型的$L_2$ нор和平均值。然后，我们使用隐藏Marker模型（HMM）来描述训练过程，将训练过程看作一种随机过程，从而获得训练过程的低维、组合表示。使用我们的方法，我们可以研究训练过程中的阶段转换和潜在的“停顿”状态，以及它们对不同的训练运行所造成的影响。”Note: The translation is in Simplified Chinese, which is the standard Chinese writing system used in mainland China and Singapore. Traditional Chinese is used in Hong Kong, Macau, and Taiwan.

Decoupled conditional contrastive learning with variable metadata for prostate lesion detection

paper_url: http://arxiv.org/abs/2308.09542
repo_url: https://github.com/camilleruppli/decoupled_ccl
paper_authors: Camille Ruppli, Pietro Gori, Roberto Ardon, Isabelle Bloch
for: 这个研究的目的是提高肠癌早期诊断的精度。methods: 这篇论文使用多 parametric 磁共振成像（mp-MRI）进行肠癌涂敷识别。它还使用了报告和数据系统（PI-RADS）来标准化肠MRI的解释，并利用了多个注解者每个样本的多个注解来增强metadata的信任度。results: 这篇论文report了使用这种新的对比损失函数，在PI-CAI挑战数据集上提高了肠癌涂敷识别的准确率3%。

Abstract
Early diagnosis of prostate cancer is crucial for efficient treatment. Multi-parametric Magnetic Resonance Images (mp-MRI) are widely used for lesion detection. The Prostate Imaging Reporting and Data System (PI-RADS) has standardized interpretation of prostate MRI by defining a score for lesion malignancy. PI-RADS data is readily available from radiology reports but is subject to high inter-reports variability. We propose a new contrastive loss function that leverages weak metadata with multiple annotators per sample and takes advantage of inter-reports variability by defining metadata confidence. By combining metadata of varying confidence with unannotated data into a single conditional contrastive loss function, we report a 3% AUC increase on lesion detection on the public PI-CAI challenge dataset. Code is available at: https://github.com/camilleruppli/decoupled_ccl

摘要
早期诊断 простаতheimer 癌是非常重要，以便有效地治疗。多 Parametric Magnetic Resonance Image (mp-MRI) 广泛用于肿瘤检测。Prostate Imaging Reporting and Data System (PI-RADS) 已经标准化了 простаvat MRI 的解释，并定义了肿瘤凶猛程度的分数。PI-RADS 数据ready available from radiology reports，但是受到高度的 inter-reports 变化。我们提议一种新的对比损失函数，利用weak metadata 和多个 annotators per sample，并利用 inter-reports 变化来定义metadata confidence。通过将 metadata of varying confidence 与未注释数据合并为一个conditional contrastive loss function，我们报告了PI-CAI challenge dataset上的 lesion detection 的3% AUC 提高。代码可以在：https://github.com/camilleruppli/decoupled_ccl 中找到。

Privacy-Preserving 3-Layer Neural Network Training using Mere Homomorphic Encryption Technique

paper_url: http://arxiv.org/abs/2308.09531
repo_url: None
paper_authors: John Chiang
for: 这篇论文处理了在单纯复制算法设定下进行隐私保护的神经网络训练问题。
methods: 本论文结合了多种现有技术，并将其扩展和改进，最终实现了使用单纯复制算法进行3层神经网络的训练，并解决了回归和分类问题。
results: 本论文通过实验证明，可以使用单纯复制算法进行隐私保护的神经网络训练，并且可以解决回归和分类问题。

Abstract
In this manuscript, we consider the problem of privacy-preserving training of neural networks in the mere homomorphic encryption setting. We combine several exsiting techniques available, extend some of them, and finally enable the training of 3-layer neural networks for both the regression and classification problems using mere homomorphic encryption technique.

摘要
在这个手稿中，我们考虑了使用简单卷积Encryption（MHE）的 neural network训练 privacy保护问题。我们将现有的技术综合使用，部分扩展，最终实现了使用 MHE 技术训练三层神经网络，用于回归和分类问题。Here's a breakdown of the translation:* 在这个手稿中 (in this manuscript) - This phrase is used to indicate that the text is part of a larger document or manuscript.* 我们考虑了 (we consider) - This verb is used to indicate that the authors are thinking about or discussing a particular topic.* 使用简单卷积Encryption (MHE) (using simple homomorphic encryption) - This phrase indicates that the authors are using a specific type of encryption called "simple homomorphic encryption" (简单卷积Encryption).* 神经网络训练 (neural network training) - This phrase indicates that the authors are training a neural network.* privacy保护 (privacy protection) - This phrase indicates that the authors are concerned with protecting the privacy of the data used for training the neural network.* 问题 (problem) - This word is used to indicate that the authors are trying to solve a specific problem or challenge.* 我们将现有的技术综合使用 (we will combine existing techniques) - This phrase indicates that the authors will use a combination of existing techniques to solve the problem.* 部分扩展 (extending some of them) - This phrase indicates that the authors will extend some of the existing techniques to solve the problem.* 最终实现了 (finally enabled) - This phrase indicates that the authors were able to achieve their goal of training a neural network using the combination of techniques.* 使用 MHE 技术 (using MHE technique) - This phrase indicates that the authors used the simple homomorphic encryption technique to train the neural network.* 训练三层神经网络 (training a three-layer neural network) - This phrase indicates that the authors trained a neural network with three layers.* 用于回归和分类问题 (for regression and classification problems) - This phrase indicates that the trained neural network can be used for both regression and classification tasks.

Transitivity-Preserving Graph Representation Learning for Bridging Local Connectivity and Role-based Similarity

paper_url: http://arxiv.org/abs/2308.09517
repo_url: https://github.com/nslab-cuk/unified-graph-transformer
paper_authors: Van Thuy Hoang, O-Joun Lee
for: This paper aims to improve graph representation learning methods by integrating local and global structural information into fixed-length vector representations.
methods: The proposed Unified Graph Transformer Networks (UGT) learn local structure by identifying local substructures and aggregating features of the $k$-hop neighborhoods of each node, and construct virtual edges to capture long-range dependencies. UGT also learns unified representations through self-attention, encoding structural distance and $p$-step transition probability between node pairs.
results: The proposed method significantly outperformed baselines that consist of state-of-the-art models on real-world benchmark datasets over various downstream tasks, and reached the expressive power of the third-order Weisfeiler-Lehman isomorphism test (3d-WL) in distinguishing non-isomorphic graph pairs.

Abstract
Graph representation learning (GRL) methods, such as graph neural networks and graph transformer models, have been successfully used to analyze graph-structured data, mainly focusing on node classification and link prediction tasks. However, the existing studies mostly only consider local connectivity while ignoring long-range connectivity and the roles of nodes. In this paper, we propose Unified Graph Transformer Networks (UGT) that effectively integrate local and global structural information into fixed-length vector representations. First, UGT learns local structure by identifying the local substructures and aggregating features of the $k$-hop neighborhoods of each node. Second, we construct virtual edges, bridging distant nodes with structural similarity to capture the long-range dependencies. Third, UGT learns unified representations through self-attention, encoding structural distance and $p$-step transition probability between node pairs. Furthermore, we propose a self-supervised learning task that effectively learns transition probability to fuse local and global structural features, which could then be transferred to other downstream tasks. Experimental results on real-world benchmark datasets over various downstream tasks showed that UGT significantly outperformed baselines that consist of state-of-the-art models. In addition, UGT reaches the expressive power of the third-order Weisfeiler-Lehman isomorphism test (3d-WL) in distinguishing non-isomorphic graph pairs. The source code is available at https://github.com/NSLab-CUK/Unified-Graph-Transformer.

摘要
GRaph representation learning (GRL) 方法，如图神经网络和图变换模型，已经成功地分析图形结构数据，主要关注节点分类和链接预测任务。然而，现有的研究通常只考虑本地连接，而忽略远程连接和节点的角色。在这篇论文中，我们提出了统一图 transformer 网络 (UGT)，可以有效地 integrate 本地和远程结构信息到固定长度 вектор表示中。首先，UGT 通过标识本地子结构和 $k $- hop 邻居的特征汇集来学习本地结构。其次，我们通过构建虚拟边，将远程节点与结构相似的节点相连，以捕捉远程依赖关系。最后，UGT 通过自我注意力学习，编码结构距离和 $p $- step 过渡概率 между 节点对，学习统一表示。此外，我们提出了一种自适应学习任务，可以有效地学习 transition probability，并将本地和远程结构特征融合，以便在其他下游任务中转移。实验结果表明，UGT 在真实世界 benchmark 数据上显著超越基eline，并达到了第三个 Weisfeiler-Lehman 同态测试 (3d-WL) 的表达力，在分辨非同态图对中表现出优异。代码可以在上获取。

Spatial LibriSpeech: An Augmented Dataset for Spatial Audio Learning

paper_url: http://arxiv.org/abs/2308.09514
repo_url: https://github.com/apple/ml-spatial-librispeech
paper_authors: Miguel Sarabia, Elena Menyaylenko, Alessandro Toso, Skyler Seto, Zakaria Aldeneh, Shadi Pirhosseinloo, Luca Zappella, Barry-John Theobald, Nicholas Apostoloff, Jonathan Sheaffer
for: 用于机器学习模型训练
methods: 使用了LibriSpeech采样并通过200k+的 simulated acoustic conditions和8k+的合成房间来生成数据集
results: 训练四个空间声音任务后， median absolute error为6.60{\deg} 3D源 localization，0.43m distance，90.66ms T30，和2.74dB DRR estimation，同时模型可以通过广泛使用的评估数据集进行推广

Abstract
We present Spatial LibriSpeech, a spatial audio dataset with over 650 hours of 19-channel audio, first-order ambisonics, and optional distractor noise. Spatial LibriSpeech is designed for machine learning model training, and it includes labels for source position, speaking direction, room acoustics and geometry. Spatial LibriSpeech is generated by augmenting LibriSpeech samples with 200k+ simulated acoustic conditions across 8k+ synthetic rooms. To demonstrate the utility of our dataset, we train models on four spatial audio tasks, resulting in a median absolute error of 6.60{\deg} on 3D source localization, 0.43m on distance, 90.66ms on T30, and 2.74dB on DRR estimation. We show that the same models generalize well to widely-used evaluation datasets, e.g., obtaining a median absolute error of 12.43{\deg} on 3D source localization on TUT Sound Events 2018, and 157.32ms on T30 estimation on ACE Challenge.

摘要
我们介绍的空间听说库SpatiLibriSpeech，包含650多小时19个渠道声音数据，首次投影束聚合和可选择干扰噪音。SpatiLibriSpeech是为机器学习模型训练而设计，其包括源位置、说话方向、室内声学和几何学标签。SpatiLibriSpeech通过对LibriSpeech样本进行扩展，生成了200000+个 simulate室内声音条件和8000+个synthetic室内。为了证明我们的数据集的实用性，我们在四个空间听说任务上训练了模型，其中 median absolute error为6.60度、0.43米、90.66毫秒和2.74dB。我们还表明了这些模型在广泛使用的评估数据集上也具有良好的泛化能力，例如在TUT Sound Events 2018中 median absolute error为12.43度、ACE Challenge中T30估计为157.32毫秒。

Bridged-GNN: Knowledge Bridge Learning for Effective Knowledge Transfer

paper_url: http://arxiv.org/abs/2308.09499
repo_url: None
paper_authors: Wendong Bi, Xueqi Cheng, Bingbing Xu, Xiaoqian Sun, Li Xu, Huawei Shen
for: 解决数据缺乏和低质量问题，帮助深度学习模型在Target领域中提高性能。
methods: 基于Graph Neural Networks (GNNs)的知识抽象技术，通过构建 Bridge-Graph 来学习Target领域的知识背景，然后通过GNNs进行样本知识传递。
results: 对于不同的数据领域和数据质量，Bridged-GNN 表现出显著的改善，与State-of-the-Art方法相比。

Abstract
The data-hungry problem, characterized by insufficiency and low-quality of data, poses obstacles for deep learning models. Transfer learning has been a feasible way to transfer knowledge from high-quality external data of source domains to limited data of target domains, which follows a domain-level knowledge transfer to learn a shared posterior distribution. However, they are usually built on strong assumptions, e.g., the domain invariant posterior distribution, which is usually unsatisfied and may introduce noises, resulting in poor generalization ability on target domains. Inspired by Graph Neural Networks (GNNs) that aggregate information from neighboring nodes, we redefine the paradigm as learning a knowledge-enhanced posterior distribution for target domains, namely Knowledge Bridge Learning (KBL). KBL first learns the scope of knowledge transfer by constructing a Bridged-Graph that connects knowledgeable samples to each target sample and then performs sample-wise knowledge transfer via GNNs.KBL is free from strong assumptions and is robust to noises in the source data. Guided by KBL, we propose the Bridged-GNN} including an Adaptive Knowledge Retrieval module to build Bridged-Graph and a Graph Knowledge Transfer module. Comprehensive experiments on both un-relational and relational data-hungry scenarios demonstrate the significant improvements of Bridged-GNN compared with SOTA methods

摘要
“问题集”， caracterized by “欠缺和低质量的数据”， poses obstacles for deep learning models. “传入学习” has been a feasible way to transfer knowledge from high-quality external data of source domains to limited data of target domains, which follows a domain-level knowledge transfer to learn a shared posterior distribution. However, they are usually built on strong assumptions, e.g., the domain invariant posterior distribution, which is usually unsatisfied and may introduce noises, resulting in poor generalization ability on target domains.Inspired by Graph Neural Networks (GNNs) that aggregate information from neighboring nodes, we redefine the paradigm as learning a knowledge-enhanced posterior distribution for target domains, namely Knowledge Bridge Learning (KBL). KBL first learns the scope of knowledge transfer by constructing a Bridged-Graph that connects knowledgeable samples to each target sample and then performs sample-wise knowledge transfer via GNNs.KBL is free from strong assumptions and is robust to noises in the source data.Guided by KBL, we propose the Bridged-GNN, including an Adaptive Knowledge Retrieval module to build Bridged-Graph and a Graph Knowledge Transfer module. Comprehensive experiments on both un-relational and relational data-hungry scenarios demonstrate the significant improvements of Bridged-GNN compared with SOTA methods.

Predictive Authoring for Brazilian Portuguese Augmentative and Alternative Communication

paper_url: http://arxiv.org/abs/2308.09497
repo_url: https://github.com/jayralencar/pictogram_prediction_pt
paper_authors: Jayr Pereira, Rodrigo Nogueira, Cleber Zanchettin, Robson Fidalgo
For: This paper proposes using a BERT-like model for pictogram prediction in AAC systems to improve the efficiency of message authoring for individuals with complex communication needs.* Methods: The authors finetune BERTimbau, a Brazilian Portuguese version of BERT, using an AAC corpus for Brazilian Portuguese, and test different approaches to representing a pictogram for prediction, including as a word, as a concept, and as a set of synonyms. They also evaluate the usage of images for pictogram prediction.* Results: The results demonstrate that using embeddings computed from the pictograms’ caption, synonyms, or definitions have a similar performance, with using synonyms leading to lower perplexity but using captions leading to the highest accuracies. The paper provides insight into how to represent a pictogram for prediction using a BERT-like model and the potential of using images for pictogram prediction.Here is the information in Simplified Chinese text:* For: 这篇论文提出使用基于BERT的模型来提高AAC系统中图文推理的效率，以便更好地满足复杂通信需求的个体需要。* Methods: 作者们使用了一个特定于巴西葡萄牙语的BERT版本（BERTimbau）进行finetuning，并使用巴西葡萄牙语AAC corpus进行训练。他们还测试了不同的图文表示方法，包括作为单词、作为概念和作为相关词的方法。此外，他们还评估了使用图像进行图文预测的可能性。* Results: 结果表明，使用图文caption、synonyms或definition中的嵌入都有类似的性能。使用synonyms导致词语精度更低，但使用caption导致最高准确率。这篇论文提供了使用基于BERT的模型来表示图文的预测方法的可能性，以及使用图像进行图文预测的潜在可能性。

Abstract
Individuals with complex communication needs (CCN) often rely on augmentative and alternative communication (AAC) systems to have conversations and communique their wants. Such systems allow message authoring by arranging pictograms in sequence. However, the difficulty of finding the desired item to complete a sentence can increase as the user's vocabulary increases. This paper proposes using BERTimbau, a Brazilian Portuguese version of BERT, for pictogram prediction in AAC systems. To finetune BERTimbau, we constructed an AAC corpus for Brazilian Portuguese to use as a training corpus. We tested different approaches to representing a pictogram for prediction: as a word (using pictogram captions), as a concept (using a dictionary definition), and as a set of synonyms (using related terms). We also evaluated the usage of images for pictogram prediction. The results demonstrate that using embeddings computed from the pictograms' caption, synonyms, or definitions have a similar performance. Using synonyms leads to lower perplexity, but using captions leads to the highest accuracies. This paper provides insight into how to represent a pictogram for prediction using a BERT-like model and the potential of using images for pictogram prediction.

摘要

Balancing Transparency and Risk: The Security and Privacy Risks of Open-Source Machine Learning Models

paper_url: http://arxiv.org/abs/2308.09490
repo_url: None
paper_authors: Dominik Hintersdorf, Lukas Struppek, Kristian Kersting
for: 本研究旨在提醒开源机器学习模型使用者关注其隐私和安全问题。
methods: 本文提出了一种概述开源模型常见的隐私和安全威胁的方法，以提高开源模型的安全使用。
results: 本研究发现了一些可能的隐私和安全攻击方法，包括模型中隐藏的功能和特定输入模式触发的攻击方法。

Abstract
The field of artificial intelligence (AI) has experienced remarkable progress in recent years, driven by the widespread adoption of open-source machine learning models in both research and industry. Considering the resource-intensive nature of training on vast datasets, many applications opt for models that have already been trained. Hence, a small number of key players undertake the responsibility of training and publicly releasing large pre-trained models, providing a crucial foundation for a wide range of applications. However, the adoption of these open-source models carries inherent privacy and security risks that are often overlooked. To provide a concrete example, an inconspicuous model may conceal hidden functionalities that, when triggered by specific input patterns, can manipulate the behavior of the system, such as instructing self-driving cars to ignore the presence of other vehicles. The implications of successful privacy and security attacks encompass a broad spectrum, ranging from relatively minor damage like service interruptions to highly alarming scenarios, including physical harm or the exposure of sensitive user data. In this work, we present a comprehensive overview of common privacy and security threats associated with the use of open-source models. By raising awareness of these dangers, we strive to promote the responsible and secure use of AI systems.

摘要
artifical intelligence（AI）领域在近年来已经取得了很大的进步，这主要归功于研究和业界广泛采用开源机器学习模型。由于训练大量数据集的资源占用性很高，许多应用程序选择使用已经训练好的模型。因此，只有一些关键 иг主要负责训练和公共发布大型预训练模型，提供了广泛应用的基础。然而，使用这些开源模型的采用带来了内置的隐私和安全风险，这些风险frequently overlooked。作为一个具体的例子，一个不起眼的模型可能隐藏潜在的功能，当特定的输入模式触发时，可以 manipulate the behavior of the system，如 instructing self-driving cars to ignore the presence of other vehicles。成功的隐私和安全攻击的影响范围广泛，从relatively minor damage like service interruptions到highly alarming scenarios，包括物理伤害或披露敏感用户数据。在这种工作中，我们提供了对开源模型常见隐私和安全威胁的全面回顾。通过提高对这些危险的意识，我们希望推动AI系统的负责任和安全使用。

RBA-GCN: Relational Bilevel Aggregation Graph Convolutional Network for Emotion Recognition

paper_url: http://arxiv.org/abs/2308.11029
repo_url: https://github.com/luftmenscher/RBA-GCN
paper_authors: Lin Yuan, Guoheng Huang, Fenghuan Li, Xiaochen Yuan, Chi-Man Pun, Guo Zhong
for: 本研究旨在提高对话中情感认知（ERC）的性能，通过基于图 convolutional networks（GCNs）的模型来解决传统GCNs中的节点信息重复问题，以及单层GCNs缺乏捕捉长距离上下文信息的能力。
methods: 本研究提出了一种基于图生成模块（GGM）、相似性基于团建模块（SCBM）和双层汇集模块（BiAM）的图 convolutional network（RBA-GCN），以解决传统GCNs中节点信息重复问题和单层GCNs缺乏捕捉长距离上下文信息的问题。
results: 在IEMOCAP和MELD数据集上，RBA-GCN的Weighted Average F1 score比最先进的方法提高2.17%∼5.21%。

Abstract
Emotion recognition in conversation (ERC) has received increasing attention from researchers due to its wide range of applications. As conversation has a natural graph structure, numerous approaches used to model ERC based on graph convolutional networks (GCNs) have yielded significant results. However, the aggregation approach of traditional GCNs suffers from the node information redundancy problem, leading to node discriminant information loss. Additionally, single-layer GCNs lack the capacity to capture long-range contextual information from the graph. Furthermore, the majority of approaches are based on textual modality or stitching together different modalities, resulting in a weak ability to capture interactions between modalities. To address these problems, we present the relational bilevel aggregation graph convolutional network (RBA-GCN), which consists of three modules: the graph generation module (GGM), similarity-based cluster building module (SCBM) and bilevel aggregation module (BiAM). First, GGM constructs a novel graph to reduce the redundancy of target node information. Then, SCBM calculates the node similarity in the target node and its structural neighborhood, where noisy information with low similarity is filtered out to preserve the discriminant information of the node. Meanwhile, BiAM is a novel aggregation method that can preserve the information of nodes during the aggregation process. This module can construct the interaction between different modalities and capture long-range contextual information based on similarity clusters. On both the IEMOCAP and MELD datasets, the weighted average F1 score of RBA-GCN has a 2.17$\sim$5.21\% improvement over that of the most advanced method.

摘要
“对话情感识别”（ERC）在研究者中获得了越来越多的注意力，因为它有广泛的应用领域。由于对话有自然的图形结构，许多方法使用图形卷积网（GCNs）来模型ERC，它们已经获得了显著的成果。然而，传统GCNs的聚合方法受到节点资讯重复问题的影响，导致节点标识资讯的损失。另外，单层GCNs缺乏获取长距离Contextual信息的能力。此外，大多数方法仅基于文本模式或是将不同模式组合在一起，从而导致模式之间的互动缺乏能力。为了解决这些问题，我们提出了关系内部卷积网（RBA-GCN），它包括三个模块：图形生成模块（GGM）、相似度基于单元模块（SCBM）和两层聚合模块（BiAM）。首先，GGM使用一个新的图形来减少目标节点信息的重复性。然后，SCBM计算了目标节点和其结构邻域中的相似度，并将低相似度的信息排除以保留节点标识资讯。同时，BiAM是一个新的聚合方法，可以在聚合过程中保留节点信息。这个模块可以在不同模式之间建立互动，并以相似 clusters 来捕捉长距离Contextual信息。在IEMOCAP和MELD datasets上，RBA-GCN的Weighted average F1 score与最先进方法相比，有2.17%至5.21%的提升。

Data augmentation and explainability for bias discovery and mitigation in deep learning

paper_url: http://arxiv.org/abs/2308.09464
repo_url: None
paper_authors: Agnieszka Mikołajczyk-Bareła
for: 本论文探讨深度神经网络中偏见的影响和降低其影响的方法。
methods: 本论文提出了三种方法来降低偏见的影响：样式传输数据增强、targeted数据增强和负责任反馈。
results: 本论文通过实验表明，这些方法可以降低深度神经网络中偏见的影响，提高模型的准确率。

Abstract
This dissertation explores the impact of bias in deep neural networks and presents methods for reducing its influence on model performance. The first part begins by categorizing and describing potential sources of bias and errors in data and models, with a particular focus on bias in machine learning pipelines. The next chapter outlines a taxonomy and methods of Explainable AI as a way to justify predictions and control and improve the model. Then, as an example of a laborious manual data inspection and bias discovery process, a skin lesion dataset is manually examined. A Global Explanation for the Bias Identification method is proposed as an alternative semi-automatic approach to manual data exploration for discovering potential biases in data. Relevant numerical methods and metrics are discussed for assessing the effects of the identified biases on the model. Whereas identifying errors and bias is critical, improving the model and reducing the number of flaws in the future is an absolute priority. Hence, the second part of the thesis focuses on mitigating the influence of bias on ML models. Three approaches are proposed and discussed: Style Transfer Data Augmentation, Targeted Data Augmentations, and Attribution Feedback. Style Transfer Data Augmentation aims to address shape and texture bias by merging a style of a malignant lesion with a conflicting shape of a benign one. Targeted Data Augmentations randomly insert possible biases into all images in the dataset during the training, as a way to make the process random and, thus, destroy spurious correlations. Lastly, Attribution Feedback is used to fine-tune the model to improve its accuracy by eliminating obvious mistakes and teaching it to ignore insignificant input parts via an attribution loss. The goal of these approaches is to reduce the influence of bias on machine learning models, rather than eliminate it entirely.

摘要
Then, as an example of a laborious manual data inspection and bias discovery process, a skin lesion dataset is manually examined. A Global Explanation for the Bias Identification method is proposed as an alternative semi-automatic approach to manual data exploration for discovering potential biases in data. Relevant numerical methods and metrics are discussed for assessing the effects of the identified biases on the model.Whereas identifying errors and bias is critical, improving the model and reducing the number of flaws in the future is an absolute priority. Hence, the second part of the thesis focuses on mitigating the influence of bias on ML models. Three approaches are proposed and discussed: Style Transfer Data Augmentation, Targeted Data Augmentations, and Attribution Feedback.Style Transfer Data Augmentation aims to address shape and texture bias by merging a style of a malignant lesion with a conflicting shape of a benign one. Targeted Data Augmentations randomly insert possible biases into all images in the dataset during the training, as a way to make the process random and, thus, destroy spurious correlations. Lastly, Attribution Feedback is used to fine-tune the model to improve its accuracy by eliminating obvious mistakes and teaching it to ignore insignificant input parts via an attribution loss. The goal of these approaches is to reduce the influence of bias on machine learning models, rather than eliminate it entirely.

Reconstructing $S$-matrix Phases with Machine Learning

paper_url: http://arxiv.org/abs/2308.09451
repo_url: None
paper_authors: Aurélien Dersy, Matthew D. Schwartz, Alexander Zhiboedov
for: 这篇论文是关于$S$-矩阵 bootstrap 计划中对幂谱和相位之间的关系的研究。
methods: 作者使用现代机器学习技术来研究单位约束。他们发现，对于给定的幂谱，当存在相位时，可以通过机器学习技术准确重建相位。此外，损失函数估算算法可以用作判断给定幂谱是否与单位约束相容的准确指标。
results: 作者发现，在某些情况下，多个相位可能与单个幂谱相容，并发现了一种新的相位ambiguous 解。此外，他们发现这种解可以让单位约束的已知限制得到明显改善。

Abstract
An important element of the $S$-matrix bootstrap program is the relationship between the modulus of an $S$-matrix element and its phase. Unitarity relates them by an integral equation. Even in the simplest case of elastic scattering, this integral equation cannot be solved analytically and numerical approaches are required. We apply modern machine learning techniques to studying the unitarity constraint. We find that for a given modulus, when a phase exists it can generally be reconstructed to good accuracy with machine learning. Moreover, the loss of the reconstruction algorithm provides a good proxy for whether a given modulus can be consistent with unitarity at all. In addition, we study the question of whether multiple phases can be consistent with a single modulus, finding novel phase-ambiguous solutions. In particular, we find a new phase-ambiguous solution which pushes the known limit on such solutions significantly beyond the previous bound.

摘要
“$S$-matrixbootstrap程序中一个重要元素是几何和几何的关系。对几何的modulus，对它的相位也存在一个统计方程。即使是最简的反射散射案例，这个统计方程无法分析解，需要使用数值方法。我们使用现代机器学习技术来研究单位约束。我们发现，给定的几何，当一个相位存在时，通常可以使用机器学习重建它，并且损失函数提供了一个单位约束是否成立的好proxy。此外，我们研究了是否多个相位可以与单一的几何相容，发现了新的相位不确定解。特别是，我们发现了一个新的相位不确定解，将已知的限制Push beyond the previous bound。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Logistics Hub Location Optimization: A K-Means and P-Median Model Hybrid Approach Using Road Network Distances

paper_url: http://arxiv.org/abs/2308.11038
repo_url: None
paper_authors: Muhammad Abdul Rahman, Muhammad Aamir Basheer, Zubair Khalid, Muhammad Tahir, Momin Uppal
for: 优化快递总站的位置，以提高电商业务的效率和环保性。
methods: 该研究使用权重P-Median方法和K-Means clustering方法，将配送点按照其空间位置归类，然后使用P-Median方法决定最佳的快递总站位置。
results: 结果显示，使用优化后的快递总站位置，每个配送 distances 可以减少 815（10%）米。

Abstract
Logistic hubs play a pivotal role in the last-mile delivery distance; even a slight increment in distance negatively impacts the business of the e-commerce industry while also increasing its carbon footprint. The growth of this industry, particularly after Covid-19, has further intensified the need for optimized allocation of resources in an urban environment. In this study, we use a hybrid approach to optimize the placement of logistic hubs. The approach sequentially employs different techniques. Initially, delivery points are clustered using K-Means in relation to their spatial locations. The clustering method utilizes road network distances as opposed to Euclidean distances. Non-road network-based approaches have been avoided since they lead to erroneous and misleading results. Finally, hubs are located using the P-Median method. The P-Median method also incorporates the number of deliveries and population as weights. Real-world delivery data from Muller and Phipps (M&P) is used to demonstrate the effectiveness of the approach. Serving deliveries from the optimal hub locations results in the saving of 815 (10%) meters per delivery.

摘要

Defending Label Inference Attacks in Split Learning under Regression Setting

paper_url: http://arxiv.org/abs/2308.09448
repo_url: None
paper_authors: Haoze Qiu, Fei Zheng, Chaochao Chen, Xiaolin Zheng
for: 本研究主要针对Split Learning下的标签推导攻击，即通过梯度反转方法进行标签推导。
methods: 我们提出了两种防御方法：Random Label Extension (RLE) 和 Model-based adaptive Label Extension (MLE)。RLE 方法通过扩展标签信息，防止攻击者通过梯度来训练攻击模型。MLE 方法保留了原始任务的标签信息，并在扩展后的标签中占据主导地位。
results: 我们的 эксперименталь结果表明，相比基础防御方法，我们提出的防御方法可以减少攻击模型的表现，同时保持原始任务的表现。

Abstract
As a privacy-preserving method for implementing Vertical Federated Learning, Split Learning has been extensively researched. However, numerous studies have indicated that the privacy-preserving capability of Split Learning is insufficient. In this paper, we primarily focus on label inference attacks in Split Learning under regression setting, which are mainly implemented through the gradient inversion method. To defend against label inference attacks, we propose Random Label Extension (RLE), where labels are extended to obfuscate the label information contained in the gradients, thereby preventing the attacker from utilizing gradients to train an attack model that can infer the original labels. To further minimize the impact on the original task, we propose Model-based adaptive Label Extension (MLE), where original labels are preserved in the extended labels and dominate the training process. The experimental results show that compared to the basic defense methods, our proposed defense methods can significantly reduce the attack model's performance while preserving the original task's performance.

摘要
As a privacy-preserving method for implementing Vertical Federated Learning, Split Learning has been extensively studied. However, numerous studies have shown that the privacy-preserving capability of Split Learning is insufficient. In this paper, we focus primarily on label inference attacks in Split Learning under the regression setting, which are mainly implemented through the gradient inversion method. To defend against label inference attacks, we propose Random Label Extension (RLE), where labels are extended to obscure the label information contained in the gradients, thereby preventing the attacker from using gradients to train an attack model that can infer the original labels. To further minimize the impact on the original task, we propose Model-based adaptive Label Extension (MLE), where original labels are preserved in the extended labels and dominate the training process. The experimental results show that compared to the basic defense methods, our proposed defense methods can significantly reduce the attack model's performance while preserving the original task's performance.Here's the translation of each sentence into Simplified Chinese:1. 在 Vertical Federated Learning 中作为隐私保护方法，Split Learning 已经得到了广泛的研究。2. 然而，许多研究表明，Split Learning 的隐私保护能力不够。3. 在本文中，我们主要关注 Split Learning 下的批量推理攻击，它们通常通过梯度倒反方法进行实现。4. 为了防止批量推理攻击，我们提议 Random Label Extension (RLE)，labels 被扩展以隐藏在梯度中的标签信息，因此防止攻击者使用梯度训练攻击模型来推理出原始标签。5. 为了最小化影响原始任务，我们提议 Model-based adaptive Label Extension (MLE)，原始标签在扩展后保留，并且主导训练过程。6. 实验结果表明，相比基本防御方法，我们提议的防御方法可以显著降低攻击模型的性能，同时保持原始任务的性能。

An Efficient 1 Iteration Learning Algorithm for Gaussian Mixture Model And Gaussian Mixture Embedding For Neural Network

paper_url: http://arxiv.org/abs/2308.09444
repo_url: None
paper_authors: Weiguo Lu, Xuan Wu, Deng Ding, Gangnan Yuan
for: 这个论文是为了提出一种基于我们之前的GMM扩展思想的 Gaussian Mixture Model（GMM）学习算法。
methods: 这个算法使用了新的GMM扩展思想，具有更高的稳定性和简洁性，并且能够在1轮内学习。我们也提供了一个对于不同参数初始值的确定性证明。
results: 我们的GMM扩展方法比 классиical Expectation Maximization（EM）算法更具有鲁棒性和精度，并且能够更好地应对数据不确定性和逆问题。最后，我们测试了基于GMM的生成器，并证明了它的潜在应用于随机抽样和变量控制。

Abstract
We propose an Gaussian Mixture Model (GMM) learning algorithm, based on our previous work of GMM expansion idea. The new algorithm brings more robustness and simplicity than classic Expectation Maximization (EM) algorithm. It also improves the accuracy and only take 1 iteration for learning. We theoretically proof that this new algorithm is guarantee to converge regardless the parameters initialisation. We compare our GMM expansion method with classic probability layers in neural network leads to demonstrably better capability to overcome data uncertainty and inverse problem. Finally, we test GMM based generator which shows a potential to build further application that able to utilized distribution random sampling for stochastic variation as well as variation control.

摘要
我团队提出了一种高斯混合模型（GMM）学习算法，基于我们之前的GMM扩展思想。新算法具有更高的稳定性和简洁性，并且在1轮学习中可以提高准确性。我们 theoretically证明了这种新算法是无论初始化参数都能够 converge。我们对 классический可期望最大化（EM）算法和GMM扩展方法进行比较，发现我们的方法在面对数据不确定性和逆问题时表现更好。最后，我们测试了基于GMM的生成器，发现它具有可以利用分布随机抽样以及变量控制的潜在应用 potential。Note: Please keep in mind that the translation is done by a machine and may not be perfect. It's always a good idea to have a human proofread and verify the translation, especially for important documents.

From Hope to Safety: Unlearning Biases of Deep Models by Enforcing the Right Reasons in Latent Space

paper_url: http://arxiv.org/abs/2308.09437
repo_url: None
paper_authors: Maximilian Dreyer, Frederik Pahde, Christopher J. Anders, Wojciech Samek, Sebastian Lapuschkin
for: The paper is written for deep neural networks that are prone to learning spurious correlations and biases, with a focus on high-stake decision-making applications such as medical applications.
methods: The paper proposes a novel method for mitigating biases in deep neural networks by reducing the model’s sensitivity towards biases through the gradient. The method uses Concept Activation Vectors to model biases and highlights the importance of choosing robust directions.
results: The paper demonstrates the effectiveness of the proposed method in controlling biases in controlled and real-world settings on several datasets, including ISIC, Bone Age, ImageNet, and CelebA, using VGG, ResNet, and EfficientNet architectures.

Abstract
Deep Neural Networks are prone to learning spurious correlations embedded in the training data, leading to potentially biased predictions. This poses risks when deploying these models for high-stake decision-making, such as in medical applications. Current methods for post-hoc model correction either require input-level annotations, which are only possible for spatially localized biases, or augment the latent feature space, thereby hoping to enforce the right reasons. We present a novel method ensuring the right reasons on the concept level by reducing the model's sensitivity towards biases through the gradient. When modeling biases via Concept Activation Vectors, we highlight the importance of choosing robust directions, as traditional regression-based approaches such as Support Vector Machines tend to result in diverging directions. We effectively mitigate biases in controlled and real-world settings on the ISIC, Bone Age, ImageNet and CelebA datasets using VGG, ResNet and EfficientNet architectures.

摘要
翻译结果：深度神经网络 Deep neural networks 深度神经网络 are prone to learning spurious correlations embedded in the training data, leading to potentially biased predictions. This poses risks when deploying these models for high-stake decision-making, such as in medical applications. Current methods for post-hoc model correction either require input-level annotations, which are only possible for spatially localized biases, or augment the latent feature space, thereby hoping to enforce the right reasons. We present a novel method ensuring the right reasons on the concept level by reducing the model's sensitivity towards biases through the gradient. When modeling biases via Concept Activation Vectors, we highlight the importance of choosing robust directions, as traditional regression-based approaches such as Support Vector Machines tend to result in diverging directions. We effectively mitigate biases in controlled and real-world settings on the ISIC, Bone Age, ImageNet and CelebA datasets using VGG, ResNet and EfficientNet architectures.

Can ultrasound confidence maps predict sonographers’ labeling variability?

paper_url: http://arxiv.org/abs/2308.09433
repo_url: None
paper_authors: Vanessa Gonzalez Duque, Leonhard Zirus, Yordanka Velikova, Nassir Navab, Diana Mateus
for: 这篇论文的目的是提出一种新的深度学习 segmentation 方法，以帮助诊断人员更准确地识别骨内部结构。
methods: 该方法使用了 ultrasound 图像中的 confidence map，以帮助深度学习 segmentation 网络更好地识别图像中的结构。
results: 研究发现，使用 confidence map 可以提高 segmentation 的准确性，降低 isolated pixel 的预测数量，并且可以更好地识别图像中的异常区域。

Abstract
Measuring cross-sectional areas in ultrasound images is a standard tool to evaluate disease progress or treatment response. Often addressed today with supervised deep-learning segmentation approaches, existing solutions highly depend upon the quality of experts' annotations. However, the annotation quality in ultrasound is anisotropic and position-variant due to the inherent physical imaging principles, including attenuation, shadows, and missing boundaries, commonly exacerbated with depth. This work proposes a novel approach that guides ultrasound segmentation networks to account for sonographers' uncertainties and generate predictions with variability similar to the experts. We claim that realistic variability can reduce overconfident predictions and improve physicians' acceptance of deep-learning cross-sectional segmentation solutions. Our method provides CM's certainty for each pixel for minimal computational overhead as it can be precalculated directly from the image. We show that there is a correlation between low values in the confidence maps and expert's label uncertainty. Therefore, we propose to give the confidence maps as additional information to the networks. We study the effect of the proposed use of ultrasound CMs in combination with four state-of-the-art neural networks and in two configurations: as a second input channel and as part of the loss. We evaluate our method on 3D ultrasound datasets of the thyroid and lower limb muscles. Our results show ultrasound CMs increase the Dice score, improve the Hausdorff and Average Surface Distances, and decrease the number of isolated pixel predictions. Furthermore, our findings suggest that ultrasound CMs improve the penalization of uncertain areas in the ground truth data, thereby improving problematic interpolations. Our code and example data will be made public at https://github.com/IFL-CAMP/Confidence-segmentation.

摘要
measuring cross-sectional areas in ultrasound images 是评估疾病进展或治疗效果的标准工具。现有的深度学习分割方法常常高度依赖于专家的注释质量。然而， ultrasound 中注释质量存在各向异性和位置变化，由于物理射频原理所带来的吸收、阴影和缺失边界等问题，这些问题通常会在深度中扩大。这项工作提议一种新的方法，使 ultrasound 分割网络考虑到医生的不确定性，并生成与专家的预测类似的结果。我们认为，使用真实的变化可以减少过于自信的预测，提高医生接受深度学习横截分割解决方案的可能性。我们的方法可以在图像直接从图像中计算CM的确定性，无需较大的计算负担。我们发现，CM的低值与专家标签的不确定性之间存在相关性。因此，我们提议将CM作为额外信息传递给网络。我们在四种当前的状态艺术神经网络和两种配置下对 ultrasound CM 的使用进行了研究。我们的结果表明， ultrasound CM 可以提高 dice 分数，改善 Hausdorff 和平均表面距离，并减少孤立像素预测。此外，我们的发现表明， ultrasound CM 可以改善基于真实数据的 interpolations 问题。我们的代码和示例数据将在 GitHub 上公开。

End-to-end topographic networks as models of cortical map formation and human visual behaviour: moving beyond convolutions

paper_url: http://arxiv.org/abs/2308.09431
repo_url: None
paper_authors: Zejin Lu, Adrien Doerig, Victoria Bosch, Bas Krahmer, Daniel Kaiser, Radoslaw M Cichy, Tim C Kietzmann
for: 这篇论文旨在理解震撼视系统的演化和功能，以及如何通过计算模型来理解这种排列结构。
methods: 这篇论文使用了All-Topographic Neural Networks（All-TNNs）来模拟视觉系统，并通过训练视觉输入数据来实现了一些猴颗粒的特征，如平滑的方向图和大脑内部的倍增。
results: 论文表明，All-TNNs在对人类空间偏好的对象识别任务中表现 significatively better than之前的状态对模型，这是因为All-TNNs的排列性质使得它们更能够和人类视觉系统的排列结构相匹配。

Abstract
Computational models are an essential tool for understanding the origin and functions of the topographic organisation of the primate visual system. Yet, vision is most commonly modelled by convolutional neural networks that ignore topography by learning identical features across space. Here, we overcome this limitation by developing All-Topographic Neural Networks (All-TNNs). Trained on visual input, several features of primate topography emerge in All-TNNs: smooth orientation maps and cortical magnification in their first layer, and category-selective areas in their final layer. In addition, we introduce a novel dataset of human spatial biases in object recognition, which enables us to directly link models to behaviour. We demonstrate that All-TNNs significantly better align with human behaviour than previous state-of-the-art convolutional models due to their topographic nature. All-TNNs thereby mark an important step forward in understanding the spatial organisation of the visual brain and how it mediates visual behaviour.

摘要
计算模型是视觉系统起源和功能理解的重要工具。然而，大多数视觉模型使用卷积神经网络，这些神经网络忽略了空间的特征，通过学习同样的特征来学习视觉数据。在这里，我们解决了这一限制，开发了全面特征神经网络（All-TNNs）。这些神经网络在训练视觉输入数据后显示出了许多非常有趣的特征，包括顺序映射和 cortical 增强在其第一层，以及在最终层中的类选择区域。此外，我们还介绍了一个新的人类空间偏见数据集，该数据集允许我们直接将模型与行为联系起来。我们表明，All-TNNs 比前一代 convolutional 模型更好地适应人类行为，这是因为它们的特征性质。All-TNNs 因此代表了我们理解视觉脑的空间组织和如何通过视觉行为来转化的重要一步。

Towards Understanding the Generalizability of Delayed Stochastic Gradient Descent

paper_url: http://arxiv.org/abs/2308.09430
repo_url: None
paper_authors: Xiaoge Deng, Li Shen, Shengwei Li, Tao Sun, Dongsheng Li, Dacheng Tao
for: 这个论文主要研究了异步延迟随机梯度下降（ASGD）在大规模机器学习模型训练中的泛化性表现。
methods: 作者使用了生成函数分析工具来研究延迟梯度算法的稳定性，并基于这种稳定性提供了异步延迟随机梯度下降算法的泛化误差Upper bound。
results: 作者的理论研究结果表明，异步延迟可以降低延迟随机梯度下降算法的泛化误差。同时，作者还对随机延迟设置进行了相应的分析，并通过实验 validate了他们的理论结论。

Abstract
Stochastic gradient descent (SGD) performed in an asynchronous manner plays a crucial role in training large-scale machine learning models. However, the generalization performance of asynchronous delayed SGD, which is an essential metric for assessing machine learning algorithms, has rarely been explored. Existing generalization error bounds are rather pessimistic and cannot reveal the correlation between asynchronous delays and generalization. In this paper, we investigate sharper generalization error bound for SGD with asynchronous delay $\tau$. Leveraging the generating function analysis tool, we first establish the average stability of the delayed gradient algorithm. Based on this algorithmic stability, we provide upper bounds on the generalization error of $\tilde{\mathcal{O}(\frac{T-\tau}{n\tau})$ and $\tilde{\mathcal{O}(\frac{1}{n})$ for quadratic convex and strongly convex problems, respectively, where $T$ refers to the iteration number and $n$ is the amount of training data. Our theoretical results indicate that asynchronous delays reduce the generalization error of the delayed SGD algorithm. Analogous analysis can be generalized to the random delay setting, and the experimental results validate our theoretical findings.

摘要
（注：在这个句子中，我们使用了一些简化的中文表达，以便更好地适应中文语言的特点。例如，我们使用了“异步方式”而不是“异步调度”，以便更好地表达SGD的异步性。同时，我们还使用了“$\mathcal{O}$”来表示大O符号，以便更好地表达数学符号。）

Self-Supervised Single-Image Deconvolution with Siamese Neural Networks

paper_url: http://arxiv.org/abs/2308.09426
repo_url: None
paper_authors: Mikhail Papkov, Kaupo Palo, Leopold Parts
for: Image reconstruction from noisy observations, specifically in 3D microscopy deconvolution tasks.
methods: Deep learning methods with a Siamese invariance loss and Fast Fourier Transform (FFT) convolutions, which improve upon previous state-of-the-art deconvolution methods with a known point spread function.
results: Outperformance of the improved framework compared to previous state-of-the-art deconvolution methods, with improved sharpness and reduced grain.

Abstract
Inverse problems in image reconstruction are fundamentally complicated by unknown noise properties. Classical iterative deconvolution approaches amplify noise and require careful parameter selection for an optimal trade-off between sharpness and grain. Deep learning methods allow for flexible parametrization of the noise and learning its properties directly from the data. Recently, self-supervised blind-spot neural networks were successfully adopted for image deconvolution by including a known point-spread function in the end-to-end training. However, their practical application has been limited to 2D images in the biomedical domain because it implies large kernels that are poorly optimized. We tackle this problem with Fast Fourier Transform convolutions that provide training speed-up in 3D microscopy deconvolution tasks. Further, we propose to adopt a Siamese invariance loss for deconvolution and empirically identify its optimal position in the neural network between blind-spot and full image branches. The experimental results show that our improved framework outperforms the previous state-of-the-art deconvolution methods with a known point spread function.

摘要
<> translate text into Simplified Chinese文本： inverse problems in image reconstruction are fundamentally complicated by unknown noise properties. classical iterative deconvolution approaches amplify noise and require careful parameter selection for an optimal trade-off between sharpness and grain. deep learning methods allow for flexible parametrization of the noise and learning its properties directly from the data. recently, self-supervised blind-spot neural networks were successfully adopted for image deconvolution by including a known point-spread function in the end-to-end training. however, their practical application has been limited to 2D images in the biomedical domain because it implies large kernels that are poorly optimized. we tackle this problem with fast fourier transform convolutions that provide training speed-up in 3D microscopy deconvolution tasks. further, we propose to adopt a siamese invariance loss for deconvolution and empirically identify its optimal position in the neural network between blind-spot and full image branches. the experimental results show that our improved framework outperforms the previous state-of-the-art deconvolution methods with a known point spread function.翻译： inverse problems in 图像重建是基础上 complicated by unknown noise properties。 classical iterative deconvolution approaches amplify noise and require careful parameter selection for an optimal trade-off between sharpness and grain。 deep learning methods allow for flexible parametrization of the noise and learning its properties directly from the data。 recently, self-supervised blind-spot neural networks were successfully adopted for image deconvolution by including a known point-spread function in the end-to-end training。 however, their practical application has been limited to 2D images in the biomedical domain because it implies large kernels that are poorly optimized。 we tackle this problem with fast fourier transform convolutions that provide training speed-up in 3D microscopy deconvolution tasks。 further, we propose to adopt a siamese invariance loss for deconvolution and empirically identify its optimal position in the neural network between blind-spot and full image branches。 experimental results show that our improved framework outperforms the previous state-of-the-art deconvolution methods with a known point spread function。

Enhancing Agent Communication and Learning through Action and Language

paper_url: http://arxiv.org/abs/2308.10842
repo_url: https://github.com/jettbrains/-L-
paper_authors: Caselles-Dupré Hugo, Sigaud Olivier, Chetouani Mohamed
for: 本研究旨在开发一种新的GC-代理人，能够同时扮演教师和学生的角色，提高交流效率。
methods: 该研究采用了动作示范和语言指令，并考虑了人类交流中的教学和实践元素，以提高代理人的教学和学习能力。
results: 研究发现，结合动作和语言交流模式可以提高学习效果，并且多模式交流具有优势。

Abstract
We introduce a novel category of GC-agents capable of functioning as both teachers and learners. Leveraging action-based demonstrations and language-based instructions, these agents enhance communication efficiency. We investigate the incorporation of pedagogy and pragmatism, essential elements in human communication and goal achievement, enhancing the agents' teaching and learning capabilities. Furthermore, we explore the impact of combining communication modes (action and language) on learning outcomes, highlighting the benefits of a multi-modal approach.

摘要
我们介绍了一种新的GC-代理，可以同时作为教师和学生进行功能。通过动作示例和语言指令，这些代理提高了交流效率。我们研究了包括教学和实践在内的人类communication的关键元素，以提高代理的教学和学习能力。此外，我们还研究了将多种交流模式（动作和语言）结合使用的影响，并 highlighted the benefits of a multi-modal approach。

ICU Mortality Prediction Using Long Short-Term Memory Networks

paper_url: http://arxiv.org/abs/2308.12800
repo_url: None
paper_authors: Manel Mili, Asma Kerkeni, Asma Ben Abdallah, Mohamed Hedi Bedoui
for: 预测医院死亡率和医院 lengths of stay (LOS) 的早期预测
methods: 使用自动化数据驱动系统，分析大量多变量时间数据，并提取高级信息以预测医院死亡率和LOS
results: LSTM 网络模型能够准确预测医院死亡率和LOS，并且可以减少时间帧数以提高临床任务效果

Abstract
Extensive bedside monitoring in Intensive Care Units (ICUs) has resulted in complex temporal data regarding patient physiology, which presents an upscale context for clinical data analysis. In the other hand, identifying the time-series patterns within these data may provide a high aptitude to predict clinical events. Hence, we investigate, during this work, the implementation of an automatic data-driven system, which analyzes large amounts of multivariate temporal data derived from Electronic Health Records (EHRs), and extracts high-level information so as to predict in-hospital mortality and Length of Stay (LOS) early. Practically, we investigate the applicability of LSTM network by reducing the time-frame to 6-hour so as to enhance clinical tasks. The experimental results highlight the efficiency of LSTM model with rigorous multivariate time-series measurements for building real-world prediction engines.

摘要
使用了床side监测的医疗机构（ICU）已经生成了复杂的时间序列数据，这些数据具有更高的上下文级别，可以进行丰富的临床数据分析。在另一方面，可以通过时间序列模式的识别来预测临床事件。因此，在这项工作中，我们研究了自动化数据驱动系统，该系统分析了大量的多变量时间序列数据，并提取高级信息，以预测医院内死亡率和长期入院天数（LOS）的早期预测。实际上，我们调整了时间帧为6小时，以便进行临床任务。实验结果表明，LSTM模型在严格的多变量时间序列测量上可以建立实用的临床预测机器。Here is the translation of the text into Simplified Chinese:通过床side监测的医疗机构（ICU）已经生成了复杂的时间序列数据，这些数据具有更高的上下文级别，可以进行丰富的临床数据分析。在另一方面，可以通过时间序列模式的识别来预测临床事件。因此，在这项工作中，我们研究了自动化数据驱动系统，该系统分析了大量的多变量时间序列数据，并提取高级信息，以预测医院内死亡率和长期入院天数（LOS）的早期预测。实际上，我们调整了时间帧为6小时，以便进行临床任务。实验结果表明，LSTM模型在严格的多变量时间序列测量上可以建立实用的临床预测机器。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Machine-Learning Solutions for the Analysis of Single-Particle Diffusion Trajectories

paper_url: http://arxiv.org/abs/2308.09414
repo_url: None
paper_authors: Henrik Seckler, Janusz Szwabinski, Ralf Metzler
for: 本研究旨在探讨现代机器学习技术如何应用于扩散时序序中，以解释记录的动态。
methods: 本文综述了最近引入的机器学习方法，包括在异常扩散挑战中取得成功的方法。这些方法受到批评因为缺乏可解性，所以本文强调了包含不确定性估计和特征基于的方法，以提高解释性和提供具体的学习过程的信息。
results: 本文分析了不同的非标型数据集中的预测结果，并评论了未来发展的想法。

Abstract
Single-particle traces of the diffusive motion of molecules, cells, or animals are by-now routinely measured, similar to stochastic records of stock prices or weather data. Deciphering the stochastic mechanism behind the recorded dynamics is vital in understanding the observed systems. Typically, the task is to decipher the exact type of diffusion and/or to determine system parameters. The tools used in this endeavor are currently revolutionized by modern machine-learning techniques. In this Perspective we provide an overview over recently introduced methods in machine-learning for diffusive time series, most notably, those successfully competing in the Anomalous-Diffusion-Challenge. As such methods are often criticized for their lack of interpretability, we focus on means to include uncertainty estimates and feature-based approaches, both improving interpretability and providing concrete insight into the learning process of the machine. We expand the discussion by examining predictions on different out-of-distribution data. We also comment on expected future developments.

摘要
单 particle跟踪记录分子、细胞或动物的扩散运动已经成为惯例，与 Stochastic 记录如股票价格或天气资料一样。解读这些动态的数学机制是理解观察系统的关键。通常，任务是确定扩散的类型和/或系统参数。现代机器学习技术已经推动这些工具的改革。在这篇 Perspective 中，我们给出了最近引入的机器学习方法 для扩散时间序列，主要是在 Anomalous-Diffusion-Challenge 中竞争成功的方法。由于这些方法经常被批评缺乏解释性，我们专注于包括不确定性估计和特征基于的方法，它们可以提高解释性和给予实际的机器学习过程的关键信息。我们进一步扩展了讨论，评估不同的外部数据类型的预测。我们也评估未来的发展。

Metadata Improves Segmentation Through Multitasking Elicitation

paper_url: http://arxiv.org/abs/2308.09411
repo_url: None
paper_authors: Iaroslav Plutenko, Mikhail Papkov, Kaupo Palo, Leopold Parts, Dmytro Fishman
for: 本研究用于探讨Metadata在深度学习方法中的应用，具体来说是在Semantic Segmentation任务中使用Metadata进行改进。
methods: 本研究使用了通道调制机制，将Metadata作为 convolutional network 的输入，以提高Semantic Segmentation的结果。
results: 研究结果表明，Metadata作为输入可以提高Semantic Segmentation的结果，而且这种改进只需要对现有的模型进行简单修改，不需要增加训练样本或更改网络结构。

Abstract
Metainformation is a common companion to biomedical images. However, this potentially powerful additional source of signal from image acquisition has had limited use in deep learning methods, for semantic segmentation in particular. Here, we incorporate metadata by employing a channel modulation mechanism in convolutional networks and study its effect on semantic segmentation tasks. We demonstrate that metadata as additional input to a convolutional network can improve segmentation results while being inexpensive in implementation as a nimble add-on to popular models. We hypothesize that this benefit of metadata can be attributed to facilitating multitask switching. This aspect of metadata-driven systems is explored and discussed in detail.

摘要
这文本中的metadata是一个常见的伴侣，尤其是在生物医学影像的摄取中。然而，这具有潜力的额外信号仍然尚未在深度学习方法中得到了广泛的利用，尤其是在semantic segmentation中。在这里，我们将metadata incorporated into convolutional networks using a channel modulation mechanism，并评估其对于semantic segmentation task的影响。我们发现，将metadata作为 convolutional network的额外输入，可以提高分类结果，而且实现起来相对容易，可以作为受欢迎的模型的一个小改进。我们预测，metadata的帮助可以减少多任务的变换，这个方面的metadata-driven系统的特点是详细地探讨和讨论。

Learning MDL logic programs from noisy data

paper_url: http://arxiv.org/abs/2308.09393
repo_url: None
paper_authors: Céline Hocquette, Andreas Niskanen, Matti Järvisalo, Andrew Cropper
for: 该论文是为了解决 inductive logic programming 方法在含有噪声数据时学习程序的问题。
methods: 该方法使用 minimal description length 来学习从含有噪声数据中的程序，包括回归式程序。
results: 我们在多个领域，包括药物设计、游戏撸猫和程序生成中进行了实验，发现我们的方法可以在噪声数据中提高预测精度和扩展至一定程度的噪声。

Abstract
Many inductive logic programming approaches struggle to learn programs from noisy data. To overcome this limitation, we introduce an approach that learns minimal description length programs from noisy data, including recursive programs. Our experiments on several domains, including drug design, game playing, and program synthesis, show that our approach can outperform existing approaches in terms of predictive accuracies and scale to moderate amounts of noise.

摘要
多种逻辑编程方法在含噪数据上学习程序具有困难，为了解决这一限制，我们提出了一种学习最短描述长度程序从含噪数据中学习，包括循环程序。我们在多个领域，如药物设计、游戏撸抓和程序生成等，进行了实验，结果表明我们的方法可以在适度的噪音量下超过现有方法的预测精度。

FunQuant: A R package to perform quantization in the context of rare events and time-consuming simulations

paper_url: http://arxiv.org/abs/2308.10871
repo_url: None
paper_authors: Charlie Sire, Yann Richet, Rodolphe Le Riche, Didier Rullière, Jérémy Rohmer, Lucie Pheulpin
for: 这篇论文是为了描述数据量化的方法和其应用场景。
methods: 这篇论文使用 Lloyd’s algorithm，它将数据空间分成Voronoi细分，并基于中心和概率质量来构建离散分布。
results: 这篇论文发现，在数据评估是成本高、罕见事件关联的情况下，Lloyd’s algorithm可能会遇到困难，而且单个无事件集中占据大量概率质量。因此，需要使用мета模型和改进的采样方法来增加精度计算的罕见集。

Abstract
Quantization summarizes continuous distributions by calculating a discrete approximation. Among the widely adopted methods for data quantization is Lloyd's algorithm, which partitions the space into Vorono\"i cells, that can be seen as clusters, and constructs a discrete distribution based on their centroids and probabilistic masses. Lloyd's algorithm estimates the optimal centroids in a minimal expected distance sense, but this approach poses significant challenges in scenarios where data evaluation is costly, and relates to rare events. Then, the single cluster associated to no event takes the majority of the probability mass. In this context, a metamodel is required and adapted sampling methods are necessary to increase the precision of the computations on the rare clusters.

摘要
量化概率分布的目的是将连续的数据转换为离散的形式，以便进行更加简单的计算和分析。 Lloyd 算法是一种广泛采用的数据量化方法，它将数据空间分割成 Voronoi 维度，可以看作是各个类别的集中点，并根据这些集中点和概率质量来构建离散分布。 Lloyd 算法会估算最优中心点，但是在数据评估是成本高、与罕见事件相关的场景下，这种方法会遇到很大的挑战。在这种情况下，需要使用多态模型，并采用适当的采样方法来提高罕见类别的计算精度。Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

On Gradient-like Explanation under a Black-box Setting: When Black-box Explanations Become as Good as White-box

paper_url: http://arxiv.org/abs/2308.09381
repo_url: None
paper_authors: Yi Cai, Gerhard Wunder
For: 本文旨在提供一种基于梯度估计的解释方法，以便在数据驱动方法中提高解释性。* Methods: 本文提出了一种名为GEEX的解释方法，该方法可在黑盒Setting下提供梯度类型的解释。此外，本文还将GEEX方法与路径方法集成，得到了名为iGEEX的完整解释方法。* Results: 实验表明，提出的方法可以在图像数据上超越当前黑盒方法的表现，并与完全访问的方法具有相似的性能。

Abstract
Attribution methods shed light on the explainability of data-driven approaches such as deep learning models by revealing the most contributing features to decisions that have been made. A widely accepted way of deriving feature attributions is to analyze the gradients of the target function with respect to input features. Analysis of gradients requires full access to the target system, meaning that solutions of this kind treat the target system as a white-box. However, the white-box assumption may be untenable due to security and safety concerns, thus limiting their practical applications. As an answer to the limited flexibility, this paper presents GEEX (gradient-estimation-based explanation), an explanation method that delivers gradient-like explanations under a black-box setting. Furthermore, we integrate the proposed method with a path method. The resulting approach iGEEX (integrated GEEX) satisfies the four fundamental axioms of attribution methods: sensitivity, insensitivity, implementation invariance, and linearity. With a focus on image data, the exhaustive experiments empirically show that the proposed methods outperform state-of-the-art black-box methods and achieve competitive performance compared to the ones with full access.

摘要
<>将文本翻译成简化中文。<>数据驱动方法的解释方法可以把灯光抛向数字学习模型中的决策过程中的最重要的特征。一种广泛得到的解释特征的 derivation 方法是通过输入特征的梯度分析target 函数的梯度。这种方法需要对目标系统具有全访问权限，因此这种方法被称为白盒模型。然而，白盒假设可能存在安全和安全问题，因此它们的实际应用受到限制。为了解决这些限制，本文提出了 GEEX（梯度估计基于解释），一种解释方法，可以在黑盒Setting下提供梯度类似的解释。此外，我们将该方法与路径方法集成，得到了 iGEEX（集成 GEEX）。该方法满足了解释方法的四个基本假设：敏感性、不敏感性、实现不变性和直线性。对于图像数据，我们进行了大量的实验，证明了我们的方法在黑盒设置下表现出色，并且与完全访问的情况相比，其性能几乎相同。Note: "简化中文" refers to Simplified Chinese, which is one of the two standardized Chinese languages used in mainland China and Singapore.

Deciphering knee osteoarthritis diagnostic features with explainable artificial intelligence: A systematic review

paper_url: http://arxiv.org/abs/2308.09380
repo_url: None
paper_authors: Yun Xin Teoh, Alice Othmani, Siew Li Goh, Juliana Usman, Khin Wee Lai
for: 提高膝关节炎诊断的可靠性和可读性，推广人工智能技术在医疗领域的应用。
methods: 本文首次对膝关节炎诊断使用可解释人工智能技术进行了评估，从数据可解释性和模型可解释性两个角度进行了分析。
results: 研究发现，可解释人工智能技术可以提高膝关节炎诊断的可靠性和可读性，并且可以增强医生对模型预测的信任感。

Abstract
Existing artificial intelligence (AI) models for diagnosing knee osteoarthritis (OA) have faced criticism for their lack of transparency and interpretability, despite achieving medical-expert-like performance. This opacity makes them challenging to trust in clinical practice. Recently, explainable artificial intelligence (XAI) has emerged as a specialized technique that can provide confidence in the model's prediction by revealing how the prediction is derived, thus promoting the use of AI systems in healthcare. This paper presents the first survey of XAI techniques used for knee OA diagnosis. The XAI techniques are discussed from two perspectives: data interpretability and model interpretability. The aim of this paper is to provide valuable insights into XAI's potential towards a more reliable knee OA diagnosis approach and encourage its adoption in clinical practice.

摘要
现有的人工智能（AI）模型用于诊断膝关节病（OA）已经受到了不透明性和解释性的批评，尽管它们达到了医疗专业人员水平。这种透明性使得它们在临床实践中具有挑战。最近，可解释的人工智能（XAI）技术在医疗领域出现了，它可以为模型预测提供信任度，并且解释预测是如何 derivation。这篇论文是首次对膝关节病诊断中使用XAI技术进行了报告。本文从数据可解释性和模型可解释性两个角度来讨论XAI技术，旨在为读者提供XAI在膝关节病诊断方面的可靠性和可信度，并促进XAI在临床实践中的应用。

Deep Learning Techniques in Extreme Weather Events: A Review

paper_url: http://arxiv.org/abs/2308.10995
repo_url: None
paper_authors: Shikha Verma, Kuldeep Srivastava, Akhilesh Tiwari, Shekhar Verma
for: 本评论旨在提供气象预报领域深度学习的现状报告，探讨深度学习在气象预报中的应用和发展趋势。
methods: 本评论总结了各种深度学习架构在气象预报中的应用，包括风暴、雷电、降水、旱情、热波、寒波等方面的应用。
results: 本评论指出了深度学习在气象预报中的优势，包括能够捕捉复杂的模式和非线性关系，并且对现有方法存在一些限制。

Abstract
Extreme weather events pose significant challenges, thereby demanding techniques for accurate analysis and precise forecasting to mitigate its impact. In recent years, deep learning techniques have emerged as a promising approach for weather forecasting and understanding the dynamics of extreme weather events. This review aims to provide a comprehensive overview of the state-of-the-art deep learning in the field. We explore the utilization of deep learning architectures, across various aspects of weather prediction such as thunderstorm, lightning, precipitation, drought, heatwave, cold waves and tropical cyclones. We highlight the potential of deep learning, such as its ability to capture complex patterns and non-linear relationships. Additionally, we discuss the limitations of current approaches and highlight future directions for advancements in the field of meteorology. The insights gained from this systematic review are crucial for the scientific community to make informed decisions and mitigate the impacts of extreme weather events.

摘要
极端天气事件带来重大挑战，需要精准的分析和预测方法来减轻其影响。近年来，深度学习技术在天气预测和极端天气事件动力学理解方面emerged as a promising approach。本文提供了天气预测领域的深度学习状态评估，探讨了不同天气元素的适用深度学习架构，包括雨夹、闪电、降水、旱情、热潮、冰潮和热带风暴。我们强调了深度学习的可能性，如其能够捕捉复杂的模式和非线性关系。此外，我们还讨论了当前方法的局限性，并提出了未来的发展方向，以便在天气预测领域取得更大的进步。这些系统性评估的结论对科学社区的决策具有重要意义，以减轻极端天气事件的影响。

Image Processing and Machine Learning for Hyperspectral Unmixing: An Overview and the HySUPP Python Package

paper_url: http://arxiv.org/abs/2308.09375
repo_url: https://github.com/behnoodrasti/hysupp
paper_authors: Behnood Rasti, Alexandre Zouaoui, Julien Mairal, Jocelyn Chanussot
for:This paper provides an overview of advanced and conventional unmixing approaches for hyperspectral image analysis, and compares their performance on various datasets.methods:The paper discusses linear unmixing techniques, including supervised, semi-supervised, and unsupervised (blind) methods, and their applications in hyperspectral image analysis.results:The experimental results show the advantages of different unmixing categories for different unmixing scenarios, and provide an open-source Python-based package for reproducing the results.

Abstract
Spectral pixels are often a mixture of the pure spectra of the materials, called endmembers, due to the low spatial resolution of hyperspectral sensors, double scattering, and intimate mixtures of materials in the scenes. Unmixing estimates the fractional abundances of the endmembers within the pixel. Depending on the prior knowledge of endmembers, linear unmixing can be divided into three main groups: supervised, semi-supervised, and unsupervised (blind) linear unmixing. Advances in Image processing and machine learning substantially affected unmixing. This paper provides an overview of advanced and conventional unmixing approaches. Additionally, we draw a critical comparison between advanced and conventional techniques from the three categories. We compare the performance of the unmixing techniques on three simulated and two real datasets. The experimental results reveal the advantages of different unmixing categories for different unmixing scenarios. Moreover, we provide an open-source Python-based package available at https://github.com/BehnoodRasti/HySUPP to reproduce the results.

摘要
spectral pixels 常常是Materials的纯谱的混合物，称为Endmember，由于遥感器的低空间分辨率、双折射和场景中Materials的密切混合，导致了这种混合。混合计算Endmember内Pixel中的含量。根据Endmember的先知情况，线性混合可以分为三类：有监督、半监督和无监督（盲目）线性混合。图像处理和机器学习技术的进步对混合有很大影响。本文提供了高级和传统混合方法的总览，并对这些方法进行了严格的比较。我们在三个模拟 dataset和两个实际 dataset上进行了比较性研究，研究结果表明不同混合类型在不同混合场景中的优势。此外，我们还提供了一个可以在上下载的开源Python包，以便重现结果。

Noise Sensitivity and Stability of Deep Neural Networks for Binary Classification

paper_url: http://arxiv.org/abs/2308.09374
repo_url: None
paper_authors: Johan Jonasson, Jeffrey E. Steif, Olof Zetterqvist
for: 研究深度神经网络（DNN）分类器的不稳定性现象，从布尔函数的角度出发，检查某些布尔函数表示的常见DNN模型是否对随机噪声敏感或稳定。
methods: 使用布尔函数理论中的随机噪声敏感和稳定性概念，对常见的批处理和卷积模型进行分析和研究。
results: 对于两种标准的DNN模型——批处理和卷积模型——在初始化为高斯权重情况下，研究其在随机噪声下的性质和特性。

Abstract
A first step is taken towards understanding often observed non-robustness phenomena of deep neural net (DNN) classifiers. This is done from the perspective of Boolean functions by asking if certain sequences of Boolean functions represented by common DNN models are noise sensitive or noise stable, concepts defined in the Boolean function literature. Due to the natural randomness in DNN models, these concepts are extended to annealed and quenched versions. Here we sort out the relation between these definitions and investigate the properties of two standard DNN architectures, the fully connected and convolutional models, when initiated with Gaussian weights.

摘要
第一步是对深度神经网络（DNN）分类器的常见非稳定性现象进行理解。这是通过将Boolean函数的概念应用到常见DNN模型中，以查核这些模型是否对随机变量敏感或稳定的。由于DNN模型的自然随机性，这些概念被扩展到气化和固化版本。我们调整这些定义并调查两种标准DNN架构——完全连接和卷积模型——在各种初始Conditions下的性质。Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I can provide that instead.

Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers

paper_url: http://arxiv.org/abs/2308.09372
repo_url: https://github.com/tobna/whattransformertofavor
paper_authors: Tobias Christian Nauen, Sebastian Palacio, Andreas Dengel
for: 本研究旨在对多种效率准备的视图变换器和相关架构进行全面的分析，以提供一个公共的基准，从而为实践者和研究人员提供有价值的指导。
methods: 本研究使用了多种性能指标来评估多种效率准备的视图变换器和相关架构，包括ViT和其他一些替代方案。
results: 研究发现，尽管存在多种宣称更高效的方法，但ViT仍然在多个效率指标上保持Pareto优化的状态，而且混合注意力-CNN模型在具有低执行内存和参数量的情况下表现特别好。此外，研究还发现了图像大小与模型大小之间的关系，以及计算Memory和训练内存之间的正相关关系。

Abstract
The growing popularity of Vision Transformers as the go-to models for image classification has led to an explosion of architectural modifications claiming to be more efficient than the original ViT. However, a wide diversity of experimental conditions prevents a fair comparison between all of them, based solely on their reported results. To address this gap in comparability, we conduct a comprehensive analysis of more than 30 models to evaluate the efficiency of vision transformers and related architectures, considering various performance metrics. Our benchmark provides a comparable baseline across the landscape of efficiency-oriented transformers, unveiling a plethora of surprising insights. For example, we discover that ViT is still Pareto optimal across multiple efficiency metrics, despite the existence of several alternative approaches claiming to be more efficient. Results also indicate that hybrid attention-CNN models fare particularly well when it comes to low inference memory and number of parameters, and also that it is better to scale the model size, than the image size. Furthermore, we uncover a strong positive correlation between the number of FLOPS and the training memory, which enables the estimation of required VRAM from theoretical measurements alone. Thanks to our holistic evaluation, this study offers valuable insights for practitioners and researchers, facilitating informed decisions when selecting models for specific applications. We publicly release our code and data at https://github.com/tobna/WhatTransformerToFavor

摘要
“vision transformer”的崛起使得许多架构 modifications 宣称更有效性 than the original ViT，但是实验环境的多样性使得比较这些模型的公平性很难。为了解决这个问题，我们进行了详细的30多个模型的分析，以评估视觉 трансформа器和相关的架构在不同的效率指标下的表现。我们的参考基准提供了跨多种效率对应的比较基线，揭露了许多意外的发现。例如，我们发现，即使有许多替代方案，ViT仍然在多个效率指标上是Pareto优化的。实验结果显示，混合注意力-CNN模型在内存少量和参数少量的情况下表现 particulary well，而且较好的对image size的调整，而不是模型size。此外，我们发现了许多FLOPS和训练内存之间的强正相关，这使得从理论测量 alone 可以估算需要的VRAM。这些结果将为实践者和研究人员提供有用的参考，帮助他们在特定应用中选择合适的模型。我们将代码和数据公开发布在https://github.com/tobna/WhatTransformerToFavor上。”

A tailored Handwritten-Text-Recognition System for Medieval Latin

paper_url: http://arxiv.org/abs/2308.09368
repo_url: None
paper_authors: Philipp Koch, Gilary Vera Nuñez, Esteban Garces Arias, Christian Heumann, Matthias Schöffel, Alexander Häberlin, Matthias Aßenmacher
for: This paper aims to digitize the Medieval Latin Dictionary by using Handwritten Text Recognition (HTR) to transcribe handwritten lemmas on record cards.
methods: The authors employ two state-of-the-art image segmentation models to prepare the initial data set, and experiment with different transformer-based models and data augmentation techniques to improve the HTR performance.
results: The best-performing setup achieved a Character Error Rate (CER) of 0.015, which is superior to the commercial Google Cloud Vision model and shows more stable performance.

Abstract
The Bavarian Academy of Sciences and Humanities aims to digitize its Medieval Latin Dictionary. This dictionary entails record cards referring to lemmas in medieval Latin, a low-resource language. A crucial step of the digitization process is the Handwritten Text Recognition (HTR) of the handwritten lemmas found on these record cards. In our work, we introduce an end-to-end pipeline, tailored to the medieval Latin dictionary, for locating, extracting, and transcribing the lemmas. We employ two state-of-the-art (SOTA) image segmentation models to prepare the initial data set for the HTR task. Furthermore, we experiment with different transformer-based models and conduct a set of experiments to explore the capabilities of different combinations of vision encoders with a GPT-2 decoder. Additionally, we also apply extensive data augmentation resulting in a highly competitive model. The best-performing setup achieved a Character Error Rate (CER) of 0.015, which is even superior to the commercial Google Cloud Vision model, and shows more stable performance.

摘要
《巴伐利亚科学院计划将中世纪拉丁词典数字化。这个词典包含手写记录卡上的中世纪拉丁词汇，这是一种低资源语言。我们的工作是设计一个端到端管道，用于在记录卡上找到、提取和转写词汇。我们使用两种最新的图像分割模型来准备初始数据集，并运用多种变换器模型和GPT-2解码器进行实验，以探索不同组合的可能性。此外，我们还进行了广泛的数据增强，实现了非常竞争力强的模型。最佳配置达到了字符错误率（CER）0.015，超越了商业Google云视觉模型，表现更稳定。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The Traditional Chinese writing system is used in Taiwan and other parts of the world.

On the Approximation of Bi-Lipschitz Maps by Invertible Neural Networks

paper_url: http://arxiv.org/abs/2308.09367
repo_url: None
paper_authors: Bangti Jin, Zehui Zhou, Jun Zou
for: 这个论文主要是为了研究嵌入型神经网络（INNs）的表达能力和应用场景。
methods: 论文使用了 coupling-based INNs 来表示bi-Lipschitz连续函数的映射，并提出了一种基于模型缩放、主成分分析和 INNs 的方法来近似无穷维空间上的bi-Lipschitz映射。
results: 论文显示了 coupling-based INNs 可以同时良好地近似前向和反向映射，并提出了一种可以同时近似前向和反向映射的方法。此外，论文还进行了初步的数值实验，并表明了该方法的可行性。

Abstract
Invertible neural networks (INNs) represent an important class of deep neural network architectures that have been widely used in several applications. The universal approximation properties of INNs have also been established recently. However, the approximation rate of INNs is largely missing. In this work, we provide an analysis of the capacity of a class of coupling-based INNs to approximate bi-Lipschitz continuous mappings on a compact domain, and the result shows that it can well approximate both forward and inverse maps simultaneously. Furthermore, we develop an approach for approximating bi-Lipschitz maps on infinite-dimensional spaces that simultaneously approximate the forward and inverse maps, by combining model reduction with principal component analysis and INNs for approximating the reduced map, and we analyze the overall approximation error of the approach. Preliminary numerical results show the feasibility of the approach for approximating the solution operator for parameterized second-order elliptic problems.

摘要
insehen neural networks (INNs) represent an important class of deep neural network architectures that have been widely used in several applications. The universal approximation properties of INNs have also been established recently. However, the approximation rate of INNs is largely missing. In this work, we provide an analysis of the capacity of a class of coupling-based INNs to approximate bi-Lipschitz continuous mappings on a compact domain, and the result shows that it can well approximate both forward and inverse maps simultaneously. Furthermore, we develop an approach for approximating bi-Lipschitz maps on infinite-dimensional spaces that simultaneously approximate the forward and inverse maps, by combining model reduction with principal component analysis and INNs for approximating the reduced map, and we analyze the overall approximation error of the approach. Preliminary numerical results show the feasibility of the approach for approximating the solution operator for parameterized second-order elliptic problems.Here's the translation in Traditional Chinese:这里的文本翻译为简化字的中文：对于深度神经网络（INNs）来说，它们是一种广泛使用的深度学习架构，而且在最近的研究中，它们的通用预测性也得到了证明。然而，INNs的预测率却受到了很大的限制。在这个研究中，我们提供了一个 coupling-based INNs 的容量分析，该分析表明这种 INNs 可以同时对内部和外部的映射进行良好的预测。此外，我们还开发了一种可以同时预测前向和反向映射的方法，这种方法基于模型简化、主成分分析和 INNs 的 combinaison，并且分析了这个方法的总预测误差。preliminary numerical results 表明这个方法可以实现解析Operator的解析。

Multi-feature concatenation and multi-classifier stacking: an interpretable and generalizable machine learning method for MDD discrimination with rsfMRI

paper_url: http://arxiv.org/abs/2308.09360
repo_url: None
paper_authors: Yunsong Luo, Wenyu Chen, Ling Zhan, Jiang Qiu, Tao Jia
For: The paper aims to improve the accuracy of diagnosing major depressive disorder (MDD) using resting-state functional MRI (rsfMRI) and machine learning algorithms.* Methods: The paper proposes a machine learning method called Multi-Feature Multi-Classifier (MFMC) that concatenates multiple features and stacks multiple classifiers to discriminate MDD patients from normal controls. The method is tested on the REST-meta-MDD data set, which contains 2428 subjects from 25 different sites.* Results: The paper reports that MFMC achieves 96.9% MDD discrimination accuracy, outperforming existing methods. The method is also validated for its generalizability by showing good performance when the training and testing subjects are from independent sites. Additionally, the paper identifies 13 feature values related to 9 brain regions that contribute most to the classification and demonstrate significant differences at the group level.

Abstract
Major depressive disorder is a serious and heterogeneous psychiatric disorder that needs accurate diagnosis. Resting-state functional MRI (rsfMRI), which captures multiple perspectives on brain structure, function, and connectivity, is increasingly applied in the diagnosis and pathological research of mental diseases. Different machine learning algorithms are then developed to exploit the rich information in rsfMRI and discriminate MDD patients from normal controls. Despite recent advances reported, the discrimination accuracy has room for further improvement. The generalizability and interpretability of the method are not sufficiently addressed either. Here, we propose a machine learning method (MFMC) for MDD discrimination by concatenating multiple features and stacking multiple classifiers. MFMC is tested on the REST-meta-MDD data set that contains 2428 subjects collected from 25 different sites. MFMC yields 96.9% MDD discrimination accuracy, demonstrating a significant improvement over existing methods. In addition, the generalizability of MFMC is validated by the good performance when the training and testing subjects are from independent sites. The use of XGBoost as the meta classifier allows us to probe the decision process of MFMC. We identify 13 feature values related to 9 brain regions including the posterior cingulate gyrus, superior frontal gyrus orbital part, and angular gyrus, which contribute most to the classification and also demonstrate significant differences at the group level. The use of these 13 feature values alone can reach 87% of MFMC's full performance when taking all feature values. These features may serve as clinically useful diagnostic and prognostic biomarkers for mental disorders in the future.

摘要
大笔迹谱精神疾病是一种严重多样的心理疾病，需要准确诊断。休息态功能磁共振成像（rsfMRI）已成为诊断和病理研究心理疾病的新技术。不同的机器学习算法被开发出来利用rsfMRI中的多种信息，以分辨抑郁症患者和正常控制人群。尽管有最新的进展报道，但iscrimination accuracy还有很大的提升空间。此外，方法的一般可行性和可解释性也没有充分考虑。我们提出了一种机器学习方法（MFMC），通过 concatenating多个特征和堆叠多个分类器来实现抑郁症分辨。MFMC在 REST-meta-MDD 数据集上进行测试，包括 2428 名参与者，来自 25 个不同的站点。MFMC 的抑郁症分辨率为 96.9%，显著高于现有方法。此外，我们还 validate了 MFMC 的一般可行性，通过在训练和测试集来自独立站点时进行测试。使用 XGBoost 作为元分类器，我们可以探索 MFMC 的决策过程。我们发现了 13 个特征值，与 9 个大脑区域相关，包括后中枢孔隙、上前叶颞部和抽象皮层，这些特征值对分类具有最大的贡献，同时在群体水平也存在显著差异。这些特征值可能在未来作为心理疾病的临床有用的诊断和预后标志。

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

paper_url: http://arxiv.org/abs/2308.09351
repo_url: https://github.com/jacobyuan7/rlipv2
paper_authors: Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao
For: 本文提出了一种名为RLIPv2的模型，用于提高计算机视觉任务中关系逻辑能力。RLIPv2可以快速地训练和 fine-tune，并且可以在大规模 pseudo-labelled scene graph 数据上进行预训练。* Methods: RLIPv2 使用了一种名为ALIF的机制，它可以在早期和深度的干扰模式中进行快速的语言-图像 fusión。此外，RLIPv2 还使用了一种名为 Relation Tagger 的工具，用于生成大量的自由形式关系标签。* Results: 通过广泛的实验，RLIPv2 在人物-物体互动检测和场景图生成任务上达到了状态的前导性表现。RLIPv2 可以在完全训练、几何shot 和零shot 设置下达到 state-of-the-art 性能。特别是，RLIPv2 的最大模型在没有任何训练的情况下可以达到 23.29mAP 的表现，并且在 1% 数据上可以达到 32.22mAP，在 100% 数据上可以达到 45.09mAP。

Abstract
Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.

摘要
RLIPv2是一种快速结合模型，旨在将视觉表示与关系文本相align，从而提高计算机视觉任务中的关系逻辑能力。然而，RLIPv1架构的慢速收敛和现有场景图数据的有限性，使得扩展RLIPv1的 scaling 困难。在这篇论文中，我们提出RLIPv2，一种快速收敛的模型，可以在大规模 pseudo-labelled 场景图数据上进行关系预训练。为了实现快速收敛，RLIPv2引入了Asymmetric Language-Image Fusion（ALIF）机制，该机制使得在执行cross-modal fusión之前，可以通过隐藏层进行早期和深度的受限缓存语言编码层。ALIF使得RLIPv2在预训练和精度调整中比RLIPv1更快收敛。为了获得大规模场景图数据，我们将对象检测dataset扩展，并在其中添加自由形态关系标签。我们引入了一个captioner（例如BLIP）和一个设计的Relation Tagger，以便为region对 assigning BLIP生成的关系文本。这种方法使得我们可以在大规模的关系预训练中获得更多的数据。通过广泛的实验，我们证明RLIPv2在人物物体互动检测和场景图生成中具有状态机器人的表现，在三个标准准则下达到了全finetuning、少shot和零shot设置下的最佳性能。特别是，RLIPv2最大化的模型在HICO-DET无需任何调整就达到了23.29mAP，并且在1%数据上达到了32.22mAP，在100%数据上达到了45.09mAP。我们的代码和模型在https://github.com/JacobYuan7/RLIPv2上公开。

Denoising diffusion-based MR to CT image translation enables whole spine vertebral segmentation in 2D and 3D without manual annotations

paper_url: http://arxiv.org/abs/2308.09345
repo_url: https://github.com/robert-graf/readable-conditional-denoising-diffusion
paper_authors: Robert Graf, Joachim Schmitt, Sarah Schlaeger, Hendrik Kristian Möller, Vasiliki Sideri-Lampretsa, Anjany Sekuboyina, Sandro Manuel Krieg, Benedikt Wiestler, Bjoern Menze, Daniel Rueckert, Jan Stefan Kirschke
for: 这个研究的目的是将MR图像翻译成CT图像，以便更好地分类和诊断脊梁部分的疾病。methods: 该研究使用了两种方法： paired image-to-image translation 和 unpaired image-to-image translation。 paired 方法使用了 landmark-based registration 来对图像进行对齐，而 unpaired 方法使用了 contrastive unpaired translation 和 SynDiff 来进行对齐。results: 研究发现，paired 方法和 SynDiff 在对照数据上 exhibited 相似的翻译性和 Dice 分数。 DDIM 图像模式 achieve 最高的图像质量。 SynDiff、Pix2Pix 和 DDIM 图像模式在 paired 数据上都达到了 Dice 分数的 0.77。在沿着脊梁轴的旋转变换下，至少需要两个附加的标记来进行注册。 3D 翻译方法在 Dice 分数和图像质量上超过了 2D 方法，并提供了更高分辨率的分类结果。

Abstract
Background: Automated segmentation of spinal MR images plays a vital role both scientifically and clinically. However, accurately delineating posterior spine structures presents challenges. Methods: This retrospective study, approved by the ethical committee, involved translating T1w and T2w MR image series into CT images in a total of n=263 pairs of CT/MR series. Landmark-based registration was performed to align image pairs. We compared 2D paired (Pix2Pix, denoising diffusion implicit models (DDIM) image mode, DDIM noise mode) and unpaired (contrastive unpaired translation, SynDiff) image-to-image translation using "peak signal to noise ratio" (PSNR) as quality measure. A publicly available segmentation network segmented the synthesized CT datasets, and Dice scores were evaluated on in-house test sets and the "MRSpineSeg Challenge" volumes. The 2D findings were extended to 3D Pix2Pix and DDIM. Results: 2D paired methods and SynDiff exhibited similar translation performance and Dice scores on paired data. DDIM image mode achieved the highest image quality. SynDiff, Pix2Pix, and DDIM image mode demonstrated similar Dice scores (0.77). For craniocaudal axis rotations, at least two landmarks per vertebra were required for registration. The 3D translation outperformed the 2D approach, resulting in improved Dice scores (0.80) and anatomically accurate segmentations in a higher resolution than the original MR image. Conclusion: Two landmarks per vertebra registration enabled paired image-to-image translation from MR to CT and outperformed all unpaired approaches. The 3D techniques provided anatomically correct segmentations, avoiding underprediction of small structures like the spinous process.

摘要
背景：自动分割神经穿梭MR图像在科学和临床领域中发挥重要作用。然而，准确地界定后方脊梁结构具有挑战。方法：这是一项回顾性研究，得到了伦敦委员会的批准。研究通过将T1w和T2wMR图像序列翻译成CT图像序列，共计n=263对CT/MR序列进行了同步注册。我们使用了“峰峰信号响应比”（PSNR）作为质量指标，并对在家试验集和“MRSpineSeg Challenge”volumes上进行了Dice分数的评估。使用了一个公共可用的 segmentation network 将生成的Synthesized CTdataset中进行了 segmentation。结果：2D paired方法和SynDiff在对应数据上显示了类似的翻译性和Dice分数。DDIM图像模式实现了最高的图像质量。SynDiff、Pix2Pix和DDIM图像模式在对应数据上具有类似的Dice分数（0.77）。对于脊梁轴向旋转，至少需要两个附加的Landmark每个vertebra进行注册。3D翻译超过了2D方法，导致了改进的Dice分数（0.80）和高分辨率的正确分 segmentation。结论：对每个vertebra需要至少两个Landmark注册，以实现paired图像-图像翻译从MR到CT，并且超过了所有的不对应方法。3D技术提供了正确的分 segmentation，避免了小结构 like spinous process的下预测。

Surprise machines: revealing Harvard Art Museums’ image collection

paper_url: http://arxiv.org/abs/2308.09343
repo_url: None
paper_authors: Dario Rodighiero, Lins Derry, Douglas Duhaime, Jordan Kruguer, Maximilian C. Mueller, Christopher Pietsch, Jeffrey T. Schnapp, Jeff Steward
for: 这个研究是为了使用人工智能技术来visualize哈佛艺术博物馆的全部图像收藏，以便开拓不可能 для访客访问的更多than 200,000个物品的新视野。
methods: 这个项目使用了人工智能技术，设计了一个舞台式界面，将评估者的移动与多个特有视角的收藏相连接，以创造对访客的惊喜。
results: 这个项目成功地使用了人工智能技术来显示大量图像，创造了对访客的惊喜和新的视野，开拓了不可能 для访客访问的更多than 200,000个物品的新视野。

Abstract
Surprise Machines is a project of experimental museology that sets out to visualize the entire image collection of the Harvard Art Museums, intending to open up unexpected vistas on more than 200,000 objects usually inaccessible to visitors. Part of the exhibition Curatorial A(i)gents organized by metaLAB (at) Harvard, the project explores the limits of artificial intelligence to display a large set of images and create surprise among visitors. To achieve such a feeling of surprise, a choreographic interface was designed to connect the audience's movement with several unique views of the collection.

摘要
“抽象机器”是哈佛艺术博物馆的一个实验博物馆项目，旨在视觉化哈佛艺术博物馆的全幅图像收藏，以创造未知的景象，并对访客展示200,000件以上不可见的物品。这个项目是metaLAB（@）哈佛的 Curatorial A(i)gents 展览的一部分，探索人工智能可以显示大量图像的限制，并创造访客的惊喜感。为了实现这种感觉，设计了一个舞蹈式的用户界面，让访客的运动与收藏中的多个独特视野相连。

Document Automation Architectures: Updated Survey in Light of Large Language Models

paper_url: http://arxiv.org/abs/2308.09341
repo_url: None
paper_authors: Mohammad Ahmadi Achachlouei, Omkar Patil, Tarun Joshi, Vijayan N. Nair
for: 这篇论文主要是为了对文档自动化（DA）的当前状况进行评估和概述。
methods: 本论文使用了学术研究的 DA 架构和技术来对不同来源的输入自动创建和组合，并且使用了定义模板来生成符合要求的文档。
results: 本论文通过对学术研究的 DA 架构和技术进行评估和概述，提供了更清晰的 DA 定义和特征，并且预测了基于生成 AI 和大语言模型的新研究机遇。

Abstract
This paper surveys the current state of the art in document automation (DA). The objective of DA is to reduce the manual effort during the generation of documents by automatically creating and integrating input from different sources and assembling documents conforming to defined templates. There have been reviews of commercial solutions of DA, particularly in the legal domain, but to date there has been no comprehensive review of the academic research on DA architectures and technologies. The current survey of DA reviews the academic literature and provides a clearer definition and characterization of DA and its features, identifies state-of-the-art DA architectures and technologies in academic research, and provides ideas that can lead to new research opportunities within the DA field in light of recent advances in generative AI and large language models.

摘要
This survey of DA reviews the academic literature and provides a clearer definition and characterization of DA and its features. It identifies state-of-the-art DA architectures and technologies in academic research and suggests new research opportunities in the field of DA, taking into account recent advances in generative AI and large language models.

Causal Interpretable Progression Trajectory Analysis of Chronic Disease

paper_url: http://arxiv.org/abs/2308.09735
repo_url: None
paper_authors: Zhoujian Sun, Wenzhuo Zhang, Zhengxing Huang, Nai Ding
for: 预测疾病进程轨迹和决策支持
methods: combining trajectory prediction and causal discovery
results: 提供精度预测疾病进程轨迹和揭示 causal 关系，增强模型的解释性In English, this means:
for: Predicting disease progression trajectories and supporting clinical decisions
methods: Combining trajectory prediction and causal discovery
results: Providing accurate predictions of disease progression trajectories and uncovering causal relationships, enhancing the interpretability of the model.

Abstract
Chronic disease is the leading cause of death, emphasizing the need for accurate prediction of disease progression trajectories and informed clinical decision-making. Machine learning (ML) models have shown promise in this domain by capturing non-linear patterns within patient features. However, existing ML-based models lack the ability to provide causal interpretable predictions and estimate treatment effects, limiting their decision-assisting perspective. In this study, we propose a novel model called causal trajectory prediction (CTP) to tackle the limitation. The CTP model combines trajectory prediction and causal discovery to enable accurate prediction of disease progression trajectories and uncovering causal relationships between features. By incorporating a causal graph into the prediction process, CTP ensures that ancestor features are not influenced by treatment on descendant features, thereby enhancing the interpretability of the model. By estimating the bounds of treatment effects, even in the presence of unmeasured confounders, the CTP provides valuable insights for clinical decision-making. We evaluate the performance of the CTP using simulated and real medical datasets. Experimental results demonstrate that our model achieves satisfactory performance, highlighting its potential to assist clinical decisions.

摘要
Chronic disease is the leading cause of death, emphasizing the need for accurate prediction of disease progression trajectories and informed clinical decision-making. Machine learning (ML) models have shown promise in this domain by capturing non-linear patterns within patient features. However, existing ML-based models lack the ability to provide causal interpretable predictions and estimate treatment effects, limiting their decision-assisting perspective. In this study, we propose a novel model called causal trajectory prediction (CTP) to tackle the limitation. The CTP model combines trajectory prediction and causal discovery to enable accurate prediction of disease progression trajectories and uncovering causal relationships between features. By incorporating a causal graph into the prediction process, CTP ensures that ancestor features are not influenced by treatment on descendant features, thereby enhancing the interpretability of the model. By estimating the bounds of treatment effects, even in the presence of unmeasured confounders, the CTP provides valuable insights for clinical decision-making. We evaluate the performance of the CTP using simulated and real medical datasets. Experimental results demonstrate that our model achieves satisfactory performance, highlighting its potential to assist clinical decisions.Here is the word-for-word translation of the text into Simplified Chinese: Chronic disease is the leading cause of death, emphasizing the need for accurate prediction of disease progression trajectories and informed clinical decision-making. Machine learning (ML) models have shown promise in this domain by capturing non-linear patterns within patient features. However, existing ML-based models lack the ability to provide causal interpretable predictions and estimate treatment effects, limiting their decision-assisting perspective. In this study, we propose a novel model called causal trajectory prediction (CTP) to tackle the limitation. The CTP model combines trajectory prediction and causal discovery to enable accurate prediction of disease progression trajectories and uncovering causal relationships between features. By incorporating a causal graph into the prediction process, CTP ensures that ancestor features are not influenced by treatment on descendant features, thereby enhancing the interpretability of the model. By estimating the bounds of treatment effects, even in the presence of unmeasured confounders, the CTP provides valuable insights for clinical decision-making. We evaluate the performance of the CTP using simulated and real medical datasets. Experimental results demonstrate that our model achieves satisfactory performance, highlighting its potential to assist clinical decisions.

Towards Attack-tolerant Federated Learning via Critical Parameter Analysis

paper_url: http://arxiv.org/abs/2308.09318
repo_url: https://github.com/sungwon-han/fedcpa
paper_authors: Sungwon Han, Sungwon Park, Fangzhao Wu, Sundong Kim, Bin Zhu, Xing Xie, Meeyoung Cha
for: 本研究旨在提出一种防御欺诈攻击的聚合方法，以帮助在分布式学习中训练共享模型。
methods: 本研究使用了批处理学习和敏感参数分析（FedCPA）来防御欺诈攻击。
results: 实验结果显示，FedCPA模型在不同的攻击场景下，能够更高效地防御欺诈攻击，并且在多个数据集上表现出色。

Abstract
Federated learning is used to train a shared model in a decentralized way without clients sharing private data with each other. Federated learning systems are susceptible to poisoning attacks when malicious clients send false updates to the central server. Existing defense strategies are ineffective under non-IID data settings. This paper proposes a new defense strategy, FedCPA (Federated learning with Critical Parameter Analysis). Our attack-tolerant aggregation method is based on the observation that benign local models have similar sets of top-k and bottom-k critical parameters, whereas poisoned local models do not. Experiments with different attack scenarios on multiple datasets demonstrate that our model outperforms existing defense strategies in defending against poisoning attacks.

摘要
federated learning 是一种在分布式方式下训练共享模型，而不需要客户端对彼此分享私人数据。 federated learning 系统容易受到恶意客户端发送false更新给中央服务器的恶意攻击。现有的防御策略在非同分布数据设置下无效。这篇论文提出了一种新的防御策略，即 FedCPA（ Federated learning with Critical Parameter Analysis）。我们的攻击忍受聚合方法基于本地模型的惊异性和恶意模型的不同性。实验结果表明，我们的模型在不同的攻击场景下，在防御恶意攻击方面表现出色，超过了现有的防御策略。

Path Signatures for Seizure Forecasting

paper_url: http://arxiv.org/abs/2308.09312
repo_url: None
paper_authors: Jonas F. Haderlein, Andre D. H. Peterson, Parvin Zarei Eskikand, Mark J. Cook, Anthony N. Burkitt, Iven M. Y. Mareels, David B. Grayden
for: 预测脑动力系统的状态从观测时间序列数据中
methods: 使用现有和新的特征提取算法，包括路径签名，一种新的时间序列分析方法，以及统计分类算法和内置的子集选择来自动发现和评估患者特定的病理特征，以预测发作
results: 通过使用这些复杂、非线性特征，对发作预测进行了自动发现和评估，并与基于线性特征进行了比较

Abstract
Forecasting the state of a system from an observed time series is the subject of research in many domains, such as computational neuroscience. Here, the prediction of epileptic seizures from brain measurements is an unresolved problem. There are neither complete models describing underlying brain dynamics, nor do individual patients exhibit a single seizure onset pattern, which complicates the development of a `one-size-fits-all' solution. Based on a longitudinal patient data set, we address the automated discovery and quantification of statistical features (biomarkers) that can be used to forecast seizures in a patient-specific way. We use existing and novel feature extraction algorithms, in particular the path signature, a recent development in time series analysis. Of particular interest is how this set of complex, nonlinear features performs compared to simpler, linear features on this task. Our inference is based on statistical classification algorithms with in-built subset selection to discern time series with and without an impending seizure while selecting only a small number of relevant features. This study may be seen as a step towards a generalisable pattern recognition pipeline for time series in a broader context.

摘要
预测系统的状态从观测时序列是许多领域的研究主题，如计算神经科学。在这里，从脑测量获取的癫痫病发症预测是一个未解决的问题。没有完整的模型描述下面动力学，也没有各个患者表现出单一的癫痫病发症开始模式，这使得开发一个"一Size-fits-all"解决方案变得复杂。基于长期患者数据集，我们addresses自动发现和量化统计特征（生物标志），可以用于预测患者特定的癫痫病发症。我们使用现有和新的特征提取算法，特别是路径签名，是时间序列分析的新发展。我们的推断基于统计分类算法，并包含内置的子集选择，以分辨时间序列中有无危险癫痫病发症，并且只选择一小部分相关的特征。这项研究可能被看作为时间序列普遍的模式识别管道的一步。

Variance reduction techniques for stochastic proximal point algorithms

paper_url: http://arxiv.org/abs/2308.09310
repo_url: None
paper_authors: Cheik Traoré, Vassilis Apidopoulos, Saverio Salzo, Silvia Villa
for: 针对于贝叶斯随机Gradient方法的性能提高，使用变量减少技术广泛应用。
methods: 提出了随机 proximal 点算法，并对其进行了随机 proximal 版本的SVRG、SAGA以及一些变体的研究。
results: 对iterates和目标函数值进行了许多的收敛结果，并在PL条件下获得了线性收敛率。实验表明，采用 proximal 变量减少方法比 Gradient 方法更稳定，尤其是对步长的选择。

Abstract
In the context of finite sums minimization, variance reduction techniques are widely used to improve the performance of state-of-the-art stochastic gradient methods. Their practical impact is clear, as well as their theoretical properties. Stochastic proximal point algorithms have been studied as an alternative to stochastic gradient algorithms since they are more stable with respect to the choice of the stepsize but a proper variance reduced version is missing. In this work, we propose the first study of variance reduction techniques for stochastic proximal point algorithms. We introduce a stochastic proximal version of SVRG, SAGA, and some of their variants for smooth and convex functions. We provide several convergence results for the iterates and the objective function values. In addition, under the Polyak-{\L}ojasiewicz (PL) condition, we obtain linear convergence rates for the iterates and the function values. Our numerical experiments demonstrate the advantages of the proximal variance reduction methods over their gradient counterparts, especially about the stability with respect to the choice of the step size.

摘要
在finite sums minimization中， variance reduction techniques 广泛应用以提高现有的泛型梯度方法性能。其实际影响明显，同时其理论性质也很清晰。随机 proximal point algoritms 被视为梯度算法的代替方案，因为它们对步长的选择更加稳定。在这项工作中，我们提出了variance reduction techniques的首次研究 для随机 proximal point algoritms。我们引入了随机 proximal版本的SVRG、SAGA以及一些其他的变体，用于对凸函数进行优化。我们提供了许多收敛结果 для迭代值和目标函数值。此外，在Polyak-{\L}ojasiewicz（PL）条件下，我们得到了梯度值和函数值的线性收敛率。我们的numerical experiments表明， proximal variance reduction方法比其梯度对应方法更加稳定，特别是对步长的选择。

Meta-learning enhanced next POI recommendation by leveraging check-ins from auxiliary cities

paper_url: http://arxiv.org/abs/2308.09309
repo_url: https://github.com/oli-wang/merec
paper_authors: Jinze Wang, Lu Zhang, Zhu Sun, Yew-Soon Ong
for: 提高城市级用户偏好学习的效果，弥补城市级用户历史检查入数据缺乏问题。
methods: 基于城市级用户历史检查入数据和相关城市之间的相似性关系，提出一种Meta-learning Enhanced next POI Recommendation（MERec）框架。MERec利用了城市级用户检查入行为之间的相似性，在meta-learning paradigm中帮助推断用户偏好。
results: 对比于现有算法，MERec表现出了显著的优势。

Abstract
Most existing point-of-interest (POI) recommenders aim to capture user preference by employing city-level user historical check-ins, thus facilitating users' exploration of the city. However, the scarcity of city-level user check-ins brings a significant challenge to user preference learning. Although prior studies attempt to mitigate this challenge by exploiting various context information, e.g., spatio-temporal information, they ignore to transfer the knowledge (i.e., common behavioral pattern) from other relevant cities (i.e., auxiliary cities). In this paper, we investigate the effect of knowledge distilled from auxiliary cities and thus propose a novel Meta-learning Enhanced next POI Recommendation framework (MERec). The MERec leverages the correlation of check-in behaviors among various cities into the meta-learning paradigm to help infer user preference in the target city, by holding the principle of "paying more attention to more correlated knowledge". Particularly, a city-level correlation strategy is devised to attentively capture common patterns among cities, so as to transfer more relevant knowledge from more correlated cities. Extensive experiments verify the superiority of the proposed MERec against state-of-the-art algorithms.

摘要
现有的点 интерес (POI) 推荐器大多 aim to 捕捉用户偏好，通过使用城市级别的用户历史检查入，以便用户探索城市。然而，城市级别的用户检查入缺乏是一个重要的挑战，用户偏好学习。尽管先前的研究尝试通过利用不同的上下文信息，如空间-时间信息，来解决这个挑战，但它们忽略了将知识（即共享行为模式）从其他相关城市（即辅助城市）传递到用户。在本文中，我们研究了auxiliary cities中的知识传递对用户喜好的影响，并提出了一种基于meta-learning的下一POI推荐框架（MERec）。MERec利用城市之间的检查入行为相似性在meta-learning中进行学习，以帮助推荐用户喜好的下一POI。特别是，我们设计了一种城市级别的相似性策略，以便精准地捕捉城市之间的共享模式，从而将更加相关的知识传递到用户。经验证明，我们提出的MERec在对state-of-the-art算法进行比较时显示出了明显的优势。

Online Class Incremental Learning on Stochastic Blurry Task Boundary via Mask and Visual Prompt Tuning

paper_url: http://arxiv.org/abs/2308.09303
repo_url: https://github.com/moonjunyyy/si-blurry
paper_authors: Jun-Yeong Moon, Keon-Hee Park, Jung Uk Kim, Gyeong-Moon Park
for: This paper focuses on the problem of continual learning in real-world scenarios, where the number of input data and tasks is constantly changing in a statistical way, and proposes a new scenario called Stochastic Incremental Blurry (Si-Blurry) to reflect the stochastic properties of the real-world.
methods: The paper introduces a novel method called Mask and Visual Prompt tuning (MVP) to alleviate the inter- and intra-task forgetting issues and class imbalance problem in the Si-Blurry scenario. MVP includes a novel instance-wise logit masking and contrastive visual prompt tuning loss, as well as a new gradient similarity-based focal loss and adaptive feature scaling.
results: The paper shows that MVP significantly outperforms the existing state-of-the-art methods in the challenging Si-Blurry scenario through extensive experiments.

Abstract
Continual learning aims to learn a model from a continuous stream of data, but it mainly assumes a fixed number of data and tasks with clear task boundaries. However, in real-world scenarios, the number of input data and tasks is constantly changing in a statistical way, not a static way. Although recently introduced incremental learning scenarios having blurry task boundaries somewhat address the above issues, they still do not fully reflect the statistical properties of real-world situations because of the fixed ratio of disjoint and blurry samples. In this paper, we propose a new Stochastic incremental Blurry task boundary scenario, called Si-Blurry, which reflects the stochastic properties of the real-world. We find that there are two major challenges in the Si-Blurry scenario: (1) inter- and intra-task forgettings and (2) class imbalance problem. To alleviate them, we introduce Mask and Visual Prompt tuning (MVP). In MVP, to address the inter- and intra-task forgetting issues, we propose a novel instance-wise logit masking and contrastive visual prompt tuning loss. Both of them help our model discern the classes to be learned in the current batch. It results in consolidating the previous knowledge. In addition, to alleviate the class imbalance problem, we introduce a new gradient similarity-based focal loss and adaptive feature scaling to ease overfitting to the major classes and underfitting to the minor classes. Extensive experiments show that our proposed MVP significantly outperforms the existing state-of-the-art methods in our challenging Si-Blurry scenario.

摘要
<>Continual learning目标是从连续的数据流中学习模型，但是它假设有固定的数据量和任务数，并且任务boundary是清晰的。然而，在实际情况下，输入数据和任务的数量在统计方式上不断变化，而不是静态的。虽然最近引入的增量学习场景中的模糊任务boundary有些地址了以上问题，但是它们仍然不能完全反映实际情况中的统计性质。在这篇论文中，我们提出了一种新的随机增量模糊任务boundary场景，称为Si-Blurry。我们发现在Si-Blurry场景中有两个主要挑战：（1）间和内任务忘记，以及（2）类划分问题。为了解决这些问题，我们提出了Mask和Visual Prompt tuning（MVP）技术。在MVP中，我们提出了一种新的实例级别的logit掩码和视觉提示调整损失，以帮助我们的模型在当前批处理中学习的类。这会帮助我们的模型保持之前的知识。此外，我们还引入了一种新的类相似性基于的焦点损失和自适应特征涨幅，以避免过拟合主要类和下降不适应次要类。我们的实验表明，我们的提出的MVP在我们的挑战性Si-Blurry场景中表现出色，舒适性强。Note: The translation is done using a machine translation tool, and may not be perfect or idiomatic.

Learning Reward Machines through Preference Queries over Sequences

paper_url: http://arxiv.org/abs/2308.09301
repo_url: None
paper_authors: Eric Hsiung, Joydeep Biswas, Swarat Chaudhuri
for: 学习任务中的复杂动作序列
methods: 使用 preference queries 和 symbolic observation table 等技术
results: 有correctness和termination garanties，并且能够准确地学习奖励机制

Abstract
Reward machines have shown great promise at capturing non-Markovian reward functions for learning tasks that involve complex action sequencing. However, no algorithm currently exists for learning reward machines with realistic weak feedback in the form of preferences. We contribute REMAP, a novel algorithm for learning reward machines from preferences, with correctness and termination guarantees. REMAP introduces preference queries in place of membership queries in the L* algorithm, and leverages a symbolic observation table along with unification and constraint solving to narrow the hypothesis reward machine search space. In addition to the proofs of correctness and termination for REMAP, we present empirical evidence measuring correctness: how frequently the resulting reward machine is isomorphic under a consistent yet inexact teacher, and the regret between the ground truth and learned reward machines.

摘要
奖励机器有广泛应用于复杂动作序列学习任务中捕捉非马尔可夫奖函数的承诺。然而，目前没有一种算法可以学习奖励机器 WITH 实际的弱反馈。我们提出了REMAP算法，一种学习奖励机器 FROM 偏好的新算法，具有正确性和结束保证。REMAP在L*算法中替换了成员查询，引入了偏好查询，并利用符号观察表 along with 统一和约束解决方案，将假设奖励机器搜索空间缩小。此外，我们还提供了REMAP的正确性和结束证明，以及对学习后的奖励机器和真实奖励机器之间的差异进行实验证明。

CARLA: A Self-supervised Contrastive Representation Learning Approach for Time Series Anomaly Detection

paper_url: http://arxiv.org/abs/2308.09296
repo_url: None
paper_authors: Zahra Zamanzadeh Darban, Geoffrey I. Webb, Shirui Pan, Mahsa Salehi
for: 这个研究旨在提出一个自动获取的时间序列异常检测方法，并且这个方法可以在单ivariate和多ivariate时间序列数据中检测异常。methods: 这个方法使用对照式表现学习，将时间序列窗口转换为高度相似的表现，并且透过自动获取的方法来分类正常和异常表现。results: 这个方法在7个标准的实际世界时间序列异常检测 benchmark dataset 上得到了较高的 F1 和 AU-PR 成绩，较先进的州前方法。

Abstract
We introduce a Self-supervised Contrastive Representation Learning Approach for Time Series Anomaly Detection (CARLA), an innovative end-to-end self-supervised framework carefully developed to identify anomalous patterns in both univariate and multivariate time series data. By taking advantage of contrastive representation learning, We introduce an innovative end-to-end self-supervised deep learning framework carefully developed to identify anomalous patterns in both univariate and multivariate time series data. By taking advantage of contrastive representation learning, CARLA effectively generates robust representations for time series windows. It achieves this by 1) learning similar representations for temporally close windows and dissimilar representations for windows and their equivalent anomalous windows and 2) employing a self-supervised approach to classify normal/anomalous representations of windows based on their nearest/furthest neighbours in the representation space. Most of the existing models focus on learning normal behaviour. The normal boundary is often tightly defined, which can result in slight deviations being classified as anomalies, resulting in a high false positive rate and limited ability to generalise normal patterns. CARLA's contrastive learning methodology promotes the production of highly consistent and discriminative predictions, thereby empowering us to adeptly address the inherent challenges associated with anomaly detection in time series data. Through extensive experimentation on 7 standard real-world time series anomaly detection benchmark datasets, CARLA demonstrates F1 and AU-PR superior to existing state-of-the-art results. Our research highlights the immense potential of contrastive representation learning in advancing the field of time series anomaly detection, thus paving the way for novel applications and in-depth exploration in this domain.

摘要
我们介绍一个自我超vised contrastive representation learning方法（CARLA），这是一个创新的端到端自我超vised深度学习框架，用于时间序列异常探测。通过利用对比表现学习，CARLA具有生成高效的时间序列窗口表现的能力。它通过以下两个方法来实现这一点：1. 学习近似时间序列窗口的表现，并将其与其等效的异常窗口表现分开。2. 使用自我超vised的方法来分类正常/异常窗口表现，基于它们与其最近/最远邻的表现空间。现有的模型通常专注于学习正常行为，但是这会导致紧密定义正常边界，从而导致微小差异被识别为异常，导致伪阳性率高且难以扩展正常模式。CARLA的对比学习方法ologies生成高度一致和描述性的预测，因此可以有效地处理时间序列异常探测中的自然挑战。经过广泛的实验，CARLA在7个标准的真实世界时间序列异常探测 benchmark 数据集上显示 F1 和 AU-PR 的成绩superior 于现有的州�� nich 的结果。我们的研究显示了对比表现学习在时间序列异常探测领域的潜在应用和深入探索的潜力。

How important are specialized transforms in Neural Operators?

paper_url: http://arxiv.org/abs/2308.09293
repo_url: https://github.com/Ritam-M/LearnableTransformsNO
paper_authors: Ritam Majumdar, Shirish Karande, Lovekesh Vig
for: 研究transform基于神经网络的partial differential equation（PDE）解决方法，以提高现代工业过程优化的效率。
methods: 使用Transform-based Neural Operators，如傅里叶 нейрон运算和wavelet нейрон运算，来解决PDEs。
results: 发现使用learnable线性层可以提供与最佳transform层相同的性能，并且计算时间更短。这些结果可能对未来神经网络架构的研究有重要意义，并可能暴露其他效率来源。

Abstract
Simulating physical systems using Partial Differential Equations (PDEs) has become an indispensible part of modern industrial process optimization. Traditionally, numerical solvers have been used to solve the associated PDEs, however recently Transform-based Neural Operators such as the Fourier Neural Operator and Wavelet Neural Operator have received a lot of attention for their potential to provide fast solutions for systems of PDEs. In this work, we investigate the importance of the transform layers to the reported success of transform based neural operators. In particular, we record the cost in terms of performance, if all the transform layers are replaced by learnable linear layers. Surprisingly, we observe that linear layers suffice to provide performance comparable to the best-known transform-based layers and seem to do so with a compute time advantage as well. We believe that this observation can have significant implications for future work on Neural Operators, and might point to other sources of efficiencies for these architectures.

摘要
使用偏微分方程（PDE）模拟物理系统已成为现代工业过程优化的不可或缺的一部分。在过去，数值方法通常用来解决相关的PDE，但最近，基于变换的神经网络运算器，如傅ри尼尔神经网络运算器和波干神经网络运算器，受到了很多关注，因为它们可以提供解决PDE的快速方法。在这项工作中，我们调查了变换层对报道的success的重要性。 Specifically，我们记录了将所有变换层替换为学习型线性层后的成本，以及性能的影响。我们发现，使用学习型线性层可以提供与最佳变换基于层相当的性能，并且在计算时间方面也有优势。我们认为这一观察可能对未来神经网络运算器的研究产生重要的影响，并可能指向其他效率来源。

Graph-based Alignment and Uniformity for Recommendation

paper_url: http://arxiv.org/abs/2308.09292
repo_url: https://github.com/yangliangwei/graphau
paper_authors: Liangwei Yang, Zhiwei Liu, Chen Wang, Mingdai Yang, Xiaolong Liu, Jing Ma, Philip S. Yu
for: solves the sparsity issue in collaborative filtering-based recommender systems (RecSys) by using graph-based alignment and uniformity (GraphAU) to learn representations for users and items.
methods: uses a novel approach called GraphAU, which explicitly considers high-order connectivities in the user-item bipartite graph to align the user/item embedding to the dense vector representations of high-order neighbors.
results: significantly alleviates the sparsity issue and achieves state-of-the-art performance on four datasets.Here is the full text in Simplified Chinese:
for: 这个论文是为了解决基于协同推荐系统（RecSys）的用户和项目的表示学习中的缺乏问题。
methods: 该论文提出了一种新的方法called GraphAU，它在用户-项目二分图中显式考虑高阶连接关系，将用户/项目的嵌入align到高阶邻居的稀盐vector表示中。
results: GraphAU可以减少缺乏问题，并在四个数据集上实现了状态机器人的表现。

Abstract
Collaborative filtering-based recommender systems (RecSys) rely on learning representations for users and items to predict preferences accurately. Representation learning on the hypersphere is a promising approach due to its desirable properties, such as alignment and uniformity. However, the sparsity issue arises when it encounters RecSys. To address this issue, we propose a novel approach, graph-based alignment and uniformity (GraphAU), that explicitly considers high-order connectivities in the user-item bipartite graph. GraphAU aligns the user/item embedding to the dense vector representations of high-order neighbors using a neighborhood aggregator, eliminating the need to compute the burdensome alignment to high-order neighborhoods individually. To address the discrepancy in alignment losses, GraphAU includes a layer-wise alignment pooling module to integrate alignment losses layer-wise. Experiments on four datasets show that GraphAU significantly alleviates the sparsity issue and achieves state-of-the-art performance. We open-source GraphAU at https://github.com/YangLiangwei/GraphAU.

摘要
Translated into Simplified Chinese:共同推荐系统（RecSys）通过学习用户和物品的表示来预测用户的偏好。表示学习在圆上是一种有利的方法，因为它具有一些恰当的性质，如对齐和均匀性。但当它遇到RecSys时，缺乏性问题出现。为解决这个问题，我们提出了一种新的方法——基于图的对齐和均匀性（GraphAU），它直接考虑用户-物品 дву价图中的高阶连接。GraphAU将用户/物品的嵌入对高阶邻居的稠密向量表示进行对齐，无需计算高阶邻居的困难对齐。为了解决层次对齐损失的差异，GraphAU包含层次对齐池化模块，将层次对齐损失集成。实验表明，GraphAU可以有效地解决缺乏性问题，并达到状态机器的性能。我们在https://github.com/YangLiangwei/GraphAU中开源GraphAU。

HyperLoRA for PDEs

paper_url: http://arxiv.org/abs/2308.09290
repo_url: None
paper_authors: Ritam Majumdar, Vishal Jadhav, Anirudh Deodhar, Shirish Karande, Lovekesh Vig, Venkataramana Runkana
For: develop neural surrogates for solutions of Partial Differential Equations* Methods: Hypernetworks, low-ranked adaptation (LoRA)* Results: 8x reduction in prediction parameters on average without compromising on accuracy, improved generalization capabilities for parameterized PDEs like Burger’s equation and Navier Stokes: Kovasznay flow.

Abstract
Physics-informed neural networks (PINNs) have been widely used to develop neural surrogates for solutions of Partial Differential Equations. A drawback of PINNs is that they have to be retrained with every change in initial-boundary conditions and PDE coefficients. The Hypernetwork, a model-based meta learning technique, takes in a parameterized task embedding as input and predicts the weights of PINN as output. Predicting weights of a neural network however, is a high-dimensional regression problem, and hypernetworks perform sub-optimally while predicting parameters for large base networks. To circumvent this issue, we use a low ranked adaptation (LoRA) formulation to decompose every layer of the base network into low-ranked tensors and use hypernetworks to predict the low-ranked tensors. Despite the reduced dimensionality of the resulting weight-regression problem, LoRA-based Hypernetworks violate the underlying physics of the given task. We demonstrate that the generalization capabilities of LoRA-based hypernetworks drastically improve when trained with an additional physics-informed loss component (HyperPINN) to satisfy the governing differential equations. We observe that LoRA-based HyperPINN training allows us to learn fast solutions for parameterized PDEs like Burger's equation and Navier Stokes: Kovasznay flow, while having an 8x reduction in prediction parameters on average without compromising on accuracy when compared to all other baselines.

摘要
物理学信息泛化神经网络（PINNs）已广泛应用于解决部分偏微分方程的解的神经替换问题。PINNs的缺点是它们需要每次更改初始边界条件和偏微分方程的系数时重新训练。基于模型的元学习技术，超网络（Hypernetwork）可以将任务嵌入作为输入，预测PINN的权重。但是，预测神经网络的参数是一个高维度回归问题，超网络在预测参数时表现不佳，特别是预测大基础网络的参数。为解决这个问题，我们使用低级别适应（LoRA）形式将每层基础网络分解成低级别矩阵，并使用超网络预测低级别矩阵。尽管结果的维度减少，但LoRA基于的超网络仍然违反了给定任务的物理基础。我们表明，在增加物理学信息泛化损失组件（HyperPINN）进行训练后，LoRA基于的超网络的泛化能力得到了悉数提高。我们发现，LoRA基于的HyperPINN训练可以快速解决参数化的偏微分方程，如布尔格方程和奈尔-斯托克斯方程：kovasznay流，而无需增加预测参数的数量，平均每个基eline下降8倍，而且不会妥协准确性。

A hybrid Decoder-DeepONet operator regression framework for unaligned observation data

paper_url: http://arxiv.org/abs/2308.09274
repo_url: https://github.com/cb-sjtu/decoder_deeponet
paper_authors: Bo Chen, Chenyu Wang, Weipeng Li, Haiyang Fu
for:这个研究的目的是解决深度神经网络（DNO）在非线性映射函数空间中的维度和计算成本问题，尤其是在面对不同观察数据时。methods:该研究提出了一种混合Decoder-DeepONet操作器 regression框架，以及一种Multi-Decoder-DeepONet，它使用了训练数据的平均场作为输入增强。这两种方法都基于操作器近似理论，并且在数学上提供了一致性。results:两个数值实验（Darcy问题和风流附近飞机翼）证明了Decoder-DeepONet和Multi-Decoder-DeepONet在面对不同观察数据时的效果和准确性。这些方法可以改善预测精度，特别是在面对高维度和复杂的数据时。

Abstract
Deep neural operators (DNOs) have been utilized to approximate nonlinear mappings between function spaces. However, DNOs face the challenge of increased dimensionality and computational cost associated with unaligned observation data. In this study, we propose a hybrid Decoder-DeepONet operator regression framework to handle unaligned data effectively. Additionally, we introduce a Multi-Decoder-DeepONet, which utilizes an average field of training data as input augmentation. The consistencies of the frameworks with the operator approximation theory are provided, on the basis of the universal approximation theorem. Two numerical experiments, Darcy problem and flow-field around an airfoil, are conducted to validate the efficiency and accuracy of the proposed methods. Results illustrate the advantages of Decoder-DeepONet and Multi-Decoder-DeepONet in handling unaligned observation data and showcase their potentials in improving prediction accuracy.

摘要
深度神经运算符 (DNO) 已经用于approximate非线性映射 между函数空间。然而，DNO 面临不对采样数据进行Alignment的挑战，增加维度和计算成本。在这项研究中，我们提出了一种混合Decoder-DeepONet操作量 regression框架，以有效地处理不对采样数据。此外，我们还引入了Multi-Decoder-DeepONet，它使用训练数据的平均场为输入增强。我们提供了基于运算符approximation理论的一致性，以确保方法的正确性。在Darcy问题和流体风洞附近的飞行器上进行了两个数值实验，以验证提出的方法的效率和准确性。结果表明Decoder-DeepONet和Multi-Decoder-DeepONet在处理不对采样数据方面具有优势，并且它们在改进预测精度方面具有潜力。

Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality Assessment Model

paper_url: http://arxiv.org/abs/2308.09262
repo_url: None
paper_authors: Ryandhimas E. Zezario, Bo-Ren Brian Bai, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao
for: This paper proposes a multi-task pseudo-label (MPL) learning approach for a non-intrusive speech quality assessment model.
methods: The MPL approach consists of two stages: obtaining pseudo-label scores from a pretrained model and performing multi-task learning. The model is optimized using a Huber loss function.
results: The proposed MPL approach outperforms training the model from scratch and using knowledge transfer mechanisms. Additionally, the use of Huber loss improves the prediction capabilities of the model.

Abstract
This study introduces multi-task pseudo-label (MPL) learning for a non-intrusive speech quality assessment model. MPL consists of two stages which are obtaining pseudo-label scores from a pretrained model and performing multi-task learning. The 3QUEST metrics, namely Speech-MOS (S-MOS), Noise-MOS (N-MOS), and General-MOS (G-MOS) are selected as the primary ground-truth labels. Additionally, the pretrained MOSA-Net model is utilized to estimate three pseudo-labels: perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (SDI). Multi-task learning stage of MPL is then employed to train the MTQ-Net model (multi-target speech quality assessment network). The model is optimized by incorporating Loss supervision (derived from the difference between the estimated score and the real ground-truth labels) and Loss semi-supervision (derived from the difference between the estimated score and pseudo-labels), where Huber loss is employed to calculate the loss function. Experimental results first demonstrate the advantages of MPL compared to training the model from scratch and using knowledge transfer mechanisms. Secondly, the benefits of Huber Loss in improving the prediction model of MTQ-Net are verified. Finally, the MTQ-Net with the MPL approach exhibits higher overall prediction capabilities when compared to other SSL-based speech assessment models.

摘要
In the multi-task learning stage, the MTQ-Net model (multi-target speech quality assessment network) is trained using the estimated pseudo-labels and real ground-truth labels. The model is optimized using a loss function that incorporates both Loss supervision (the difference between the estimated score and the real ground-truth labels) and Loss semi-supervision (the difference between the estimated score and pseudo-labels). The Huber loss is employed to calculate the loss function.Experimental results show that the MPL approach outperforms training the model from scratch and using knowledge transfer mechanisms. Additionally, the benefits of using the Huber loss in improving the prediction model of MTQ-Net are demonstrated. Finally, the MTQ-Net with the MPL approach exhibits higher overall prediction capabilities compared to other speech assessment models based on semi-supervised learning.

Distribution shift mitigation at test time with performance guarantees

paper_url: http://arxiv.org/abs/2308.09259
repo_url: None
paper_authors: Rui Ding, Jielong Yang, Feng Ji, Xionghu Zhong, Linbo Xie
For: The paper aims to address the challenge of distribution shift in Graph Neural Networks (GNNs), which can negatively impact the test performance of the model.* Methods: The proposed framework, called FR-GNN, constructs a mapping relationship between the output and input of a well-trained GNN to obtain class representative embeddings and then uses these embeddings to reconstruct the features of labeled nodes. The reconstructed features are then incorporated into the message passing mechanism of GNNs to influence the predictions of unlabeled nodes at test time.* Results: The paper shows that the proposed FR-GNN framework can effectively reduce the distribution shift and improve the test performance of GNNs without modifying the model structure or parameters. The experimental results demonstrate the superior performance of FR-GNN in comparison to mainstream methods on various public datasets.

Abstract
Due to inappropriate sample selection and limited training data, a distribution shift often exists between the training and test sets. This shift can adversely affect the test performance of Graph Neural Networks (GNNs). Existing approaches mitigate this issue by either enhancing the robustness of GNNs to distribution shift or reducing the shift itself. However, both approaches necessitate retraining the model, which becomes unfeasible when the model structure and parameters are inaccessible. To address this challenge, we propose FR-GNN, a general framework for GNNs to conduct feature reconstruction. FRGNN constructs a mapping relationship between the output and input of a well-trained GNN to obtain class representative embeddings and then uses these embeddings to reconstruct the features of labeled nodes. These reconstructed features are then incorporated into the message passing mechanism of GNNs to influence the predictions of unlabeled nodes at test time. Notably, the reconstructed node features can be directly utilized for testing the well-trained model, effectively reducing the distribution shift and leading to improved test performance. This remarkable achievement is attained without any modifications to the model structure or parameters. We provide theoretical guarantees for the effectiveness of our framework. Furthermore, we conduct comprehensive experiments on various public datasets. The experimental results demonstrate the superior performance of FRGNN in comparison to mainstream methods.

摘要
FR-GNN constructs a mapping relationship between the output and input of a well-trained GNN to obtain class representative embeddings and then uses these embeddings to reconstruct the features of labeled nodes. These reconstructed features are then incorporated into the message passing mechanism of GNNs to influence the predictions of unlabeled nodes at test time. Notably, the reconstructed node features can be directly utilized for testing the well-trained model, effectively reducing the distribution shift and leading to improved test performance. This remarkable achievement is attained without any modifications to the model structure or parameters. We provide theoretical guarantees for the effectiveness of our framework.In addition, we conduct comprehensive experiments on various public datasets, and the experimental results demonstrate the superior performance of FR-GNN in comparison to mainstream methods.

Capacity Bounds for Hyperbolic Neural Network Representations of Latent Tree Structures

paper_url: http://arxiv.org/abs/2308.09250
repo_url: None
paper_authors: Anastasis Kratsios, Ruiyang Hong, Haitz Sáez de Ocáriz Borde
for: 这篇论文是研究深度征有折射函数的折射神经网络（HNN）的表示能力的。
methods: 这篇论文使用了ReLU activation function，并提供了首次证明HNN可以在任意权重树上实现$\varepsilon$-同构嵌入，并且可以在折射空间中嵌入树的尺寸为至少$d$，并且权重树的sectional curvature为$\kappa<0$。
results: 这篇论文提供了关于HNN实现图像表示的网络复杂度的精确上限，并且发现网络复杂度与表示质量之间无直接关系。此外，这篇论文还提供了对任意ReLU多层感知机（MLP）在折射空间中嵌入树的下界，并证明任何ReLU MLP在嵌入树时必须夹带至少$\Omega(L^{1/d})$的质量损失。

Abstract
We study the representation capacity of deep hyperbolic neural networks (HNNs) with a ReLU activation function. We establish the first proof that HNNs can $\varepsilon$-isometrically embed any finite weighted tree into a hyperbolic space of dimension $d$ at least equal to $2$ with prescribed sectional curvature $\kappa<0$, for any $\varepsilon> 1$ (where $\varepsilon=1$ being optimal). We establish rigorous upper bounds for the network complexity on an HNN implementing the embedding. We find that the network complexity of HNN implementing the graph representation is independent of the representation fidelity/distortion. We contrast this result against our lower bounds on distortion which any ReLU multi-layer perceptron (MLP) must exert when embedding a tree with $L>2^d$ leaves into a $d$-dimensional Euclidean space, which we show at least $\Omega(L^{1/d})$; independently of the depth, width, and (possibly discontinuous) activation function defining the MLP.

摘要
我们研究深度偏折 neural network (HNN) 的表示能力，使用 ReLU 活化函数。我们证明了 HNN 可以将任意有限重量树 $\varepsilon$-同构 embedding 到半正交空间中，其维度至少为 $2$，且sectional curvature $\kappa<0$，对任何 $\varepsilon>1$（其中 $\varepsilon=1$ 是最优）。我们提供了精确的网络复杂度上限，用于表示实现。我们发现网络复杂度与表示质量/扭曲度无关。我们与这些下界对任何 ReLU 多层感知机 (MLP) 将树图 embedding 到 $d$-维欧式空间中的至少 $\Omega(L^{1/d})$ 的扭曲度进行比较，独立于深度、宽度和（可能破折）活化函数定义 MLP。

Active and Passive Causal Inference Learning

paper_url: http://arxiv.org/abs/2308.09248
repo_url: None
paper_authors: Daniel Jiwoong Im, Kyunghyun Cho
for: 这篇论文是为机器学习研究者、工程师和学生准备的一个入门课程，旨在介绍 causal inference 的基本概念和技术。
methods: 论文从一系列重要的假设出发，如交换性、正性、一致性和不干扰，然后建立了一系列重要的 causal inference 技术，分为两个杯子：活跃和被动两类。
results: 论文介绍了一些常见的 causal inference 技术，包括随机控制试验和短剑算法，以及经典的匹配和反射权重等方法。通过完善一些 causal inference 的缺失方面，如卷积风险、聚合风险等，论文预期能为读者提供一个多样化的入门点，进一步研究和探索 causal inference 领域。

Abstract
This paper serves as a starting point for machine learning researchers, engineers and students who are interested in but not yet familiar with causal inference. We start by laying out an important set of assumptions that are collectively needed for causal identification, such as exchangeability, positivity, consistency and the absence of interference. From these assumptions, we build out a set of important causal inference techniques, which we do so by categorizing them into two buckets; active and passive approaches. We describe and discuss randomized controlled trials and bandit-based approaches from the active category. We then describe classical approaches, such as matching and inverse probability weighting, in the passive category, followed by more recent deep learning based algorithms. By finishing the paper with some of the missing aspects of causal inference from this paper, such as collider biases, we expect this paper to provide readers with a diverse set of starting points for further reading and research in causal inference and discovery.

摘要
这篇论文作为机器学习研究者、工程师和学生们的入门文献，旨在介绍 causal inference 的基本概念和方法。我们从共同需要的假设出发，包括交换性、正性、一致性和没有干扰。从这些假设中，我们构建了一组重要的 causal inference 技术，并将其分为两个托管：活动和被动approaches。我们描述了Randomized controlled trials和bandit-based approaches从活动类别，然后描述了经典方法，如匹配和逆序权重法，以及更近期的深度学习基于算法。通过结束文章中的一些 causal inference 的缺失方面，如融合偏见，我们期望这篇文章能为读者提供一个多样化的入门点，进一步阅读和研究 causal inference 和发现。

A Robust Policy Bootstrapping Algorithm for Multi-objective Reinforcement Learning in Non-stationary Environments

paper_url: http://arxiv.org/abs/2308.09734
repo_url: None
paper_authors: Sherif Abdelfattah, Kathryn Kasmarik, Jiankun Hu
for: solves the problem of non-stationary dynamics in multi-objective reinforcement learning
methods: developmental optimization approach, novel multi-objective reinforcement learning algorithm
results: significantly outperforms state-of-the-art algorithms in non-stationary environments while achieving comparable results in stationary environments.Here’s the same information in Simplified Chinese:
for: 解决多bjective reinforcement learning中的非站点环境问题
methods: 发展优化方法，新型多bjective reinforcement learning算法
results: 在非站点环境下显著超越现有算法，在站点环境下达到相对的比较Result。

Abstract
Multi-objective Markov decision processes are a special kind of multi-objective optimization problem that involves sequential decision making while satisfying the Markov property of stochastic processes. Multi-objective reinforcement learning methods address this problem by fusing the reinforcement learning paradigm with multi-objective optimization techniques. One major drawback of these methods is the lack of adaptability to non-stationary dynamics in the environment. This is because they adopt optimization procedures that assume stationarity to evolve a coverage set of policies that can solve the problem. This paper introduces a developmental optimization approach that can evolve the policy coverage set while exploring the preference space over the defined objectives in an online manner. We propose a novel multi-objective reinforcement learning algorithm that can robustly evolve a convex coverage set of policies in an online manner in non-stationary environments. We compare the proposed algorithm with two state-of-the-art multi-objective reinforcement learning algorithms in stationary and non-stationary environments. Results showed that the proposed algorithm significantly outperforms the existing algorithms in non-stationary environments while achieving comparable results in stationary environments.

摘要
多目标Markov决策过程是特殊的多目标优化问题，它涉及到顺序做出决策，并满足Markov性的随机过程的属性。多目标激励学习方法 Addresses this problem by combining the reinforcement learning paradigm with multi-objective optimization techniques. However, one major drawback of these methods is the lack of adaptability to non-stationary dynamics in the environment. This is because they adopt optimization procedures that assume stationarity to evolve a coverage set of policies that can solve the problem.This paper proposes a developmental optimization approach that can evolve the policy coverage set while exploring the preference space over the defined objectives in an online manner. We propose a novel multi-objective reinforcement learning algorithm that can robustly evolve a convex coverage set of policies in an online manner in non-stationary environments. We compare the proposed algorithm with two state-of-the-art multi-objective reinforcement learning algorithms in stationary and non-stationary environments. Results showed that the proposed algorithm significantly outperforms the existing algorithms in non-stationary environments while achieving comparable results in stationary environments.Here's the translation in Traditional Chinese:多目标Markov决策过程是特殊的多目标优化问题，它涉及到顺序做出决策，并满足Markov性的随机过程的属性。多目标激励学习方法 Addresses this problem by combining the reinforcement learning paradigm with multi-objective optimization techniques. However, one major drawback of these methods is the lack of adaptability to non-stationary dynamics in the environment. This is because they adopt optimization procedures that assume stationarity to evolve a coverage set of policies that can solve the problem.This paper proposes a developmental optimization approach that can evolve the policy coverage set while exploring the preference space over the defined objectives in an online manner. We propose a novel multi-objective reinforcement learning algorithm that can robustly evolve a convex coverage set of policies in an online manner in non-stationary environments. We compare the proposed algorithm with two state-of-the-art multi-objective reinforcement learning algorithms in stationary and non-stationary environments. Results showed that the proposed algorithm significantly outperforms the existing algorithms in non-stationary environments while achieving comparable results in stationary environments.

Intrinsically Motivated Hierarchical Policy Learning in Multi-objective Markov Decision Processes

paper_url: http://arxiv.org/abs/2308.09733
repo_url: None
paper_authors: Sherif Abdelfattah, Kathryn Merrick, Jiankun Hu
for: 解决多目标Markov决策过程中的多个矛盾奖励函数不能同时优化，而是需要一种权衡策略来满足所有可能的偏好。
methods: 使用了自适应奖励学习方法，通过学习一个内在motivated的技能集来演化策略覆盖集，从而实现持续学习过程。
results: 在动态环境中，提出了一种新的双相自适应奖励学习方法，在第一阶段学习一个通用的技能集，然后在第二阶段使用这个集来启动策略覆盖集的演化，实验显示该方法可以significantly outperform现有的多目标奖励学习方法。

Abstract
Multi-objective Markov decision processes are sequential decision-making problems that involve multiple conflicting reward functions that cannot be optimized simultaneously without a compromise. This type of problems cannot be solved by a single optimal policy as in the conventional case. Alternatively, multi-objective reinforcement learning methods evolve a coverage set of optimal policies that can satisfy all possible preferences in solving the problem. However, many of these methods cannot generalize their coverage sets to work in non-stationary environments. In these environments, the parameters of the state transition and reward distribution vary over time. This limitation results in significant performance degradation for the evolved policy sets. In order to overcome this limitation, there is a need to learn a generic skill set that can bootstrap the evolution of the policy coverage set for each shift in the environment dynamics therefore, it can facilitate a continuous learning process. In this work, intrinsically motivated reinforcement learning has been successfully deployed to evolve generic skill sets for learning hierarchical policies to solve multi-objective Markov decision processes. We propose a novel dual-phase intrinsically motivated reinforcement learning method to address this limitation. In the first phase, a generic set of skills is learned. While in the second phase, this set is used to bootstrap policy coverage sets for each shift in the environment dynamics. We show experimentally that the proposed method significantly outperforms state-of-the-art multi-objective reinforcement methods in a dynamic robotics environment.

摘要
多目标Markov决策过程是一种следова意决策问题，其涉及多个矛盾奖励函数，这些奖励函数无法同时优化。这类问题不能通过单一优化策略来解决，而是需要一组优化策略来满足所有可能的偏好。然而，许多多目标学习方法无法扩展其覆盖集来处理不稳定环境。在这些环境中，状态转移和奖励分布的参数会随时间变化，这会导致优化策略集的性能下降。为了解决这一限制，需要学习一个通用技能集，以便在环境动态变化时，使用这个技能集来演化策略覆盖集，从而实现连续学习过程。在这种情况下，我们成功地应用了内在激励学习来演化通用技能集，以解决多目标Markov决策过程中的限制。我们提出了一种新的双期内在激励学习方法，其中，在第一阶段，学习一个通用技能集；在第二阶段，使用这个技能集来演化策略覆盖集 для每个环境动态变化。我们实验ally表明，提议的方法在动态 робо特环境中显著超过了当前最佳多目标学习方法。

Generalized Sum Pooling for Metric Learning

paper_url: http://arxiv.org/abs/2308.09228
repo_url: https://github.com/yetigurbuz/generalized-sum-pooling
paper_authors: Yeti Z. Gurbuz, Ozan Sener, A. Aydın Alatan
for: 本研究旨在提出一种可学习的总和池化方法（Generalized Sum Pooling，GSP），用于深度度量学习中的核心 pooling 阶段。
methods: 本研究提出了一种基于信息ENTROPY的优化transport问题，用于学习核心池化方法。此外，还提出了一种零批评损函数，用于帮助学习GSP。
results: 实验结果表明，GSP可以提高深度度量学习中的性能，并且可以快速地学习到有用的semantic entity。Code可以在GSP-DML Framework中找到。

Abstract
A common architectural choice for deep metric learning is a convolutional neural network followed by global average pooling (GAP). Albeit simple, GAP is a highly effective way to aggregate information. One possible explanation for the effectiveness of GAP is considering each feature vector as representing a different semantic entity and GAP as a convex combination of them. Following this perspective, we generalize GAP and propose a learnable generalized sum pooling method (GSP). GSP improves GAP with two distinct abilities: i) the ability to choose a subset of semantic entities, effectively learning to ignore nuisance information, and ii) learning the weights corresponding to the importance of each entity. Formally, we propose an entropy-smoothed optimal transport problem and show that it is a strict generalization of GAP, i.e., a specific realization of the problem gives back GAP. We show that this optimization problem enjoys analytical gradients enabling us to use it as a direct learnable replacement for GAP. We further propose a zero-shot loss to ease the learning of GSP. We show the effectiveness of our method with extensive evaluations on 4 popular metric learning benchmarks. Code is available at: GSP-DML Framework

摘要
一般而言，深度度量学中常用的建筑方式是一个卷积神经网络，然后跟global average pooling（GAP）。尽管简单，但GAP是一种非常有效的信息汇总方法。一种可能的解释是每个特征向量都代表不同的semantic entity，GAP是这些entity的 convex combination。基于这个视角，我们总结GAP并提出一种学习可能的总和方法（GSP）。GSP在两个方面提高GAP：一是可以选择一 subset of semantic entity，有效地忽略干扰信息，二是学习每个entity的重要性权重。我们提出一个 entropy-smoothed optimal transport问题，并证明它是GAP的严格推广，即这个问题的特定实现可以返回GAP。我们表明这个优化问题具有分析性Gradient，可以作为直接学习替换GAP的方法。此外，我们还提出了一种零拟损优化方法，以便学习GSP。我们通过对4种常见度量学权重 benchmark进行广泛的评估，证明了我们的方法的有效性。代码可以在：GSP-DML Framework 中找到。

Advancing Relation Extraction through Language Probing with Exemplars from Set Co-Expansion

paper_url: http://arxiv.org/abs/2308.11720
repo_url: None
paper_authors: Yerong Li, Roxana Girju
for:This paper focuses on improving relation extraction accuracy and reducing confusion between contrastive classes.methods:The proposed approach uses representative examples and co-set expansion, incorporating similarity measures between target pairs and representative pairs from the target class. Contextual details are harnessed via context-free Hearst patterns to ascertain contextual similarity.results:The co-set expansion approach significantly enhances relation classification performance, achieving an observed margin of at least 1 percent improvement in accuracy in most settings. Tuning contrastive examples further refines the approach, reducing confusion between classes sharing similarities and leading to more precise classification.

Abstract
Relation Extraction (RE) is a pivotal task in automatically extracting structured information from unstructured text. In this paper, we present a multi-faceted approach that integrates representative examples and through co-set expansion. The primary goal of our method is to enhance relation classification accuracy and mitigating confusion between contrastive classes. Our approach begins by seeding each relationship class with representative examples. Subsequently, our co-set expansion algorithm enriches training objectives by incorporating similarity measures between target pairs and representative pairs from the target class. Moreover, the co-set expansion process involves a class ranking procedure that takes into account exemplars from contrastive classes. Contextual details encompassing relation mentions are harnessed via context-free Hearst patterns to ascertain contextual similarity. Empirical evaluation demonstrates the efficacy of our co-set expansion approach, resulting in a significant enhancement of relation classification performance. Our method achieves an observed margin of at least 1 percent improvement in accuracy in most settings, on top of existing fine-tuning approaches. To further refine our approach, we conduct an in-depth analysis that focuses on tuning contrastive examples. This strategic selection and tuning effectively reduce confusion between classes sharing similarities, leading to a more precise classification process. Experimental results underscore the effectiveness of our proposed framework for relation extraction. The synergy between co-set expansion and context-aware prompt tuning substantially contributes to improved classification accuracy. Furthermore, the reduction in confusion between contrastive classes through contrastive examples tuning validates the robustness and reliability of our method.

摘要
relation extraction (RE) 是自动提取结构化信息的重要任务。在这篇论文中，我们提出了一种多方面的方法，该方法通过代表性例子和相似度扩展来提高关系分类精度。我们的方法首先将每个关系类型中的代表性例子填充。然后，我们的相似度扩展算法在训练目标中添加了类似度测试。此外，相似度扩展过程还包括一个类别排名过程，该过程考虑了相似类型的表达。Contextual details surrounding relation mentions are harnessed using context-free Hearst patterns to determine contextual similarity。Empirical evaluation shows the effectiveness of our co-set expansion approach, resulting in a significant improvement in relation classification performance. Our method achieves an observed margin of at least 1 percent improvement in accuracy in most settings, on top of existing fine-tuning approaches。为了进一步改进我们的方法，我们进行了深入的分析，集中在调整对比例例子上。这种筛选和调整有效地减少了类似类型之间的混淆，从而使得分类更加精准。实验结果证明我们提出的框架对relation extraction是有效的。它的同时涵盖相似度扩展和上下文相关的提示调整，导致了改进的分类精度。此外，通过对相似类型之间的混淆进行调整，我们的方法的可靠性和可靠性得到了证明。

DMCVR: Morphology-Guided Diffusion Model for 3D Cardiac Volume Reconstruction

paper_url: http://arxiv.org/abs/2308.09223
repo_url: https://github.com/hexiaoxiao-cs/dmcvr
paper_authors: Xiaoxiao He, Chaowei Tan, Ligong Han, Bo Liu, Leon Axel, Kang Li, Dimitris N. Metaxas
for: 提高心血管疾病诊断和治疗规划的准确三维心脏重建
methods: 基于生成模型的心脏形态导导模型（DMCVR），通过合并高分辨率二维心脏图像和相应的三维重建体积来提高三维心脏重建质量
results: 在多个方面比前方法高效，包括二维生成和三维重建性能，并且可以生成高分辨率三维心脏MRI重建图像，超过现有技术水平

Abstract
Accurate 3D cardiac reconstruction from cine magnetic resonance imaging (cMRI) is crucial for improved cardiovascular disease diagnosis and understanding of the heart's motion. However, current cardiac MRI-based reconstruction technology used in clinical settings is 2D with limited through-plane resolution, resulting in low-quality reconstructed cardiac volumes. To better reconstruct 3D cardiac volumes from sparse 2D image stacks, we propose a morphology-guided diffusion model for 3D cardiac volume reconstruction, DMCVR, that synthesizes high-resolution 2D images and corresponding 3D reconstructed volumes. Our method outperforms previous approaches by conditioning the cardiac morphology on the generative model, eliminating the time-consuming iterative optimization process of the latent code, and improving generation quality. The learned latent spaces provide global semantics, local cardiac morphology and details of each 2D cMRI slice with highly interpretable value to reconstruct 3D cardiac shape. Our experiments show that DMCVR is highly effective in several aspects, such as 2D generation and 3D reconstruction performance. With DMCVR, we can produce high-resolution 3D cardiac MRI reconstructions, surpassing current techniques. Our proposed framework has great potential for improving the accuracy of cardiac disease diagnosis and treatment planning. Code can be accessed at https://github.com/hexiaoxiao-cs/DMCVR.

摘要
精准的3D心脏重建从 cinematic magnet共振成像（cMRI）是诊断心血管疾病和心脏运动的关键。然而，现有的心脏MRI基于的重建技术在临床 Settings 中只有2D的，导致重建的心脏体积质量低下。为了更好地从稀疏的2D图像堆栈中重建3D心脏体积，我们提议一种基于几何学指导的扩散模型，称为DMCVR，该模型可以将高分辨率的2D图像和相应的3D重建体积相互conditioning。我们的方法超越先前的方法，因为它们不需要质量优化过程，提高生成质量。学习的秘密空间提供了全球 semantics，局部心脏形态和每个2D cMRI slice 的高度解释性。我们的实验表明，DMCVR 在多个方面具有高效性，例如2D生成和3D重建性能。通过DMCVR，我们可以生成高分辨率的3D心脏MRI重建，超越当前技术。我们提出的框架具有较大的诊断心血管疾病和治疗规划的潜在价值。代码可以在https://github.com/hexiaoxiao-cs/DMCVR 中下载。

Baird Counterexample Is Solved: with an example of How to Debug a Two-time-scale Algorithm

paper_url: http://arxiv.org/abs/2308.09732
repo_url: None
paper_authors: Hengshuai Yao
for: 这个论文是为了解释帕尔德对例中的TD（0）算法的异常行为，以及对此例进行测试和比较偏离政策学习算法的性能。
methods: 这个论文使用了调度TD算法的分析和调试技术，以解释TD（0）算法在帕尔德对例中的慢速收敛问题。
results: 该论文提供了一种调度TD算法的方法，可以在帕尔德对例中实现TD解决方案的快速收敛。此外，论文还提供了实验结果，表明该算法在帕尔德对例中的收敛速率是线性的。因此，该论文结论认为，帕尔德对例已经得到了解决。

Abstract
Baird counterexample was proposed by Leemon Baird in 1995, first used to show that the Temporal Difference (TD(0)) algorithm diverges on this example. Since then, it is often used to test and compare off-policy learning algorithms. Gradient TD algorithms solved the divergence issue of TD on Baird counterexample. However, their convergence on this example is still very slow, and the nature of the slowness is not well understood, e.g., see (Sutton and Barto 2018). This note is to understand in particular, why TDC is slow on this example, and provide debugging analysis to understand this behavior. Our debugging technique can be used to study the convergence behavior of two-time-scale stochastic approximation algorithms. We also provide empirical results of the recent Impression GTD algorithm on this example, showing the convergence is very fast, in fact, in a linear rate. We conclude that Baird counterexample is solved, by an algorithm with convergence guarantee to the TD solution in general and a fast convergence rate.

摘要
白尔德Counterexample（Baird counterexample）在1995年被李姓飞（Leemon Baird）提出，用以证明TD(0)算法在这个例子中出现偏离。 desde entonces, 它经常用于测试和比较不同的离散学习算法。梯度TD算法解决了TD算法在白尔德Counterexample中的偏离问题，但是它们在这个例子中的收敛速度非常慢，并且这种慢速度的原因还不很了解，例如参见（Sutton和Barto 2018）。本文的目的是理解特别是TD梯度算法在这个例子中的慢收敛问题，并提供调试分析以理解这种行为。我们的调试技术可以用来研究两个时间尺度的随机抽象算法的收敛行为。我们还提供了最近的Impression GTD算法在这个例子中的实验结果，显示其收敛非常快，甚至在线性率。我们 conclude that Baird counterexample已经得到解决，并且TD梯度算法在总的来说具有收敛保证和快速收敛速度。

A Model-Agnostic Framework for Recommendation via Interest-aware Item Embeddings

paper_url: http://arxiv.org/abs/2308.09202
repo_url: None
paper_authors: Amit Kumar Jaiswal, Yu Xiong
for: 本研究旨在提高推荐系统中ITEM表示的精度，以更好地捕捉用户的兴趣。
methods: 本文提出了一种名为Interest-aware Capsule network（IaCN）的新型推荐模型，它通过直接学习用户兴趣 oriented item表示来提高推荐效果。
results: 实验结果表明，对于不同的深度神经网络、行为序列长度和共同学习率，IaCN模型均显示出显著的性能提升，证明了该方法的有效性。

Abstract
Item representation holds significant importance in recommendation systems, which encompasses domains such as news, retail, and videos. Retrieval and ranking models utilise item representation to capture the user-item relationship based on user behaviours. While existing representation learning methods primarily focus on optimising item-based mechanisms, such as attention and sequential modelling. However, these methods lack a modelling mechanism to directly reflect user interests within the learned item representations. Consequently, these methods may be less effective in capturing user interests indirectly. To address this challenge, we propose a novel Interest-aware Capsule network (IaCN) recommendation model, a model-agnostic framework that directly learns interest-oriented item representations. IaCN serves as an auxiliary task, enabling the joint learning of both item-based and interest-based representations. This framework adopts existing recommendation models without requiring substantial redesign. We evaluate the proposed approach on benchmark datasets, exploring various scenarios involving different deep neural networks, behaviour sequence lengths, and joint learning ratios of interest-oriented item representations. Experimental results demonstrate significant performance enhancements across diverse recommendation models, validating the effectiveness of our approach.

摘要
“Item representation plays a crucial role in recommendation systems, which encompasses domains such as news, retail, and videos. Retrieval and ranking models use item representation to capture the user-item relationship based on user behaviors. However, existing representation learning methods primarily focus on optimizing item-based mechanisms, such as attention and sequential modeling, and lack a modeling mechanism to directly reflect user interests within the learned item representations. As a result, these methods may be less effective in capturing user interests indirectly. To address this challenge, we propose a novel Interest-aware Capsule network (IaCN) recommendation model, a model-agnostic framework that directly learns interest-oriented item representations. IaCN serves as an auxiliary task, enabling the joint learning of both item-based and interest-based representations. This framework adopts existing recommendation models without requiring substantial redesign. We evaluate the proposed approach on benchmark datasets, exploring various scenarios involving different deep neural networks, behavior sequence lengths, and joint learning ratios of interest-oriented item representations. Experimental results demonstrate significant performance enhancements across diverse recommendation models, validating the effectiveness of our approach.”Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

TinyProp – Adaptive Sparse Backpropagation for Efficient TinyML On-device Learning

paper_url: http://arxiv.org/abs/2308.09201
repo_url: None
paper_authors: Marcus Rüb, Daniel Maier, Daniel Mueller-Gritschneder, Axel Sikora
for: 这篇论文目的是提高在低功耗微控制器单元（MCU）上进行device learning或精细化神经网络的效率。
methods: 这篇论文使用了简略的传播算法，通过在训练步骤中适当地选择和训练价值对的方法来实现硬件上的训练。
results: compared to非简略训练，这篇论文的结果显示，使用这种方法可以提高平均5倍，并且仅受到轻微的计算成本影响。此外，与现有的静态简略传播算法相比，这种方法的执行速度提高了2.9倍，并且降低了平均6%的精度损失。

Abstract
Training deep neural networks using backpropagation is very memory and computationally intensive. This makes it difficult to run on-device learning or fine-tune neural networks on tiny, embedded devices such as low-power micro-controller units (MCUs). Sparse backpropagation algorithms try to reduce the computational load of on-device learning by training only a subset of the weights and biases. Existing approaches use a static number of weights to train. A poor choice of this so-called backpropagation ratio limits either the computational gain or can lead to severe accuracy losses. In this paper we present TinyProp, the first sparse backpropagation method that dynamically adapts the back-propagation ratio during on-device training for each training step. TinyProp induces a small calculation overhead to sort the elements of the gradient, which does not significantly impact the computational gains. TinyProp works particularly well on fine-tuning trained networks on MCUs, which is a typical use case for embedded applications. For typical datasets from three datasets MNIST, DCASE2020 and CIFAR10, we are 5 times faster compared to non-sparse training with an accuracy loss of on average 1%. On average, TinyProp is 2.9 times faster than existing, static sparse backpropagation algorithms and the accuracy loss is reduced on average by 6 % compared to a typical static setting of the back-propagation ratio.

摘要
训练深度神经网络使用反传播是非常占用内存和计算资源的。这使得在设备学习或精度调整神经网络在小型嵌入式设备（如低功耗微控制器）上进行困难。稀疏反传播算法尝试减少设备上学习的计算负担，只训练部分权重和偏移。现有方法使用静态的反传播比率来训练。这称为反传播比率的选择会限制计算的益处或导致严重的准确性损失。在这篇论文中，我们介绍了TinyProp，第一个动态在设备上训练时适应反传播比率的稀疏反传播方法。TinyProp添加了一些排序元素计划的小计算开销，不会对计算的益处产生很大的影响。TinyProp在MCU上精度调整已经训练的网络时特别有效，这是常见的嵌入式应用场景。对于典型的MNIST、DCASE2020和CIFAR10数据集，我们比非稀疏训练更快5倍，准确性损失平均1%。与现有静态稀疏反传播算法相比，TinyProp平均2.9倍快，并且准确性损失平均降低6%。

Polynomial Bounds for Learning Noisy Optical Physical Unclonable Functions and Connections to Learning With Errors

paper_url: http://arxiv.org/abs/2308.09199
repo_url: None
paper_authors: Apollo Albright, Boris Gelfand, Michael Dixon
for: 这篇论文主要研究了一种光学物理不可克隆函数（PUF）的学习问题。
methods: 该论文使用了一种基于多项式的批量学习方法，通过训练一个多项式回归模型来学习PUF。
results: 该论文显示，可以使用这种方法将PUF学习到任意精度，即使在噪声存在的情况下，只要有充足的挑战对象和计算资源。这些结果超越了2013年 Rh"uramir 等人的研究，他们只处理了一个子集的PUF，并且假设了光学系统是线性的或者有轻微非线性效应。

Abstract
It is shown that a class of optical physical unclonable functions (PUFs) can be learned to arbitrary precision with arbitrarily high probability, even in the presence of noise, given access to polynomially many challenge-response pairs and polynomially bounded computational power, under mild assumptions about the distributions of the noise and challenge vectors. This extends the results of Rh\"uramir et al. (2013), who showed a subset of this class of PUFs to be learnable in polynomial time in the absence of noise, under the assumption that the optics of the PUF were either linear or had negligible nonlinear effects. We derive polynomial bounds for the required number of samples and the computational complexity of a linear regression algorithm, based on size parameters of the PUF, the distributions of the challenge and noise vectors, and the probability and accuracy of the regression algorithm, with a similar analysis to one done by Bootle et al. (2018), who demonstrated a learning attack on a poorly implemented version of the Learning With Errors problem.

摘要
研究表明，一类光学物理不可克隆函数（PUF）可以在很高的准确率下学习，即使在噪声存在的情况下，只需要访问很多挑战-响应对的对话和计算能力，假设噪声和挑战向量的分布是某种很少的，并且允许使用很多的计算资源。这个结论超越了2013年rh\"uramir等人的结果，他们只处理了这个类型的PUF的一个子集，并且假设optics的影响是线性或有小非线性效应。我们 derive了基于PUF的大小参数、挑战向量和噪声向量的分布、恢复算法的概率和准确率的多项式上限，与2018年Bootle等人的一个类似的分析相同，他们展示了一种学习攻击，攻击一个不好实现的学习问题。

Half-Hop: A graph upsampling approach for slowing down message passing

paper_url: http://arxiv.org/abs/2308.09198
repo_url: https://github.com/nerdslab/halfhop
paper_authors: Mehdi Azabou, Venkataramana Ganesh, Shantanu Thakoor, Chi-Heng Lin, Lakshmi Sathidevi, Ran Liu, Michal Valko, Petar Veličković, Eva L. Dyer
for: 提高消息传递神经网络的学习效果，特别在邻居节点属于不同类型时。
methods: 添加”慢节点”来协调消息传递，只需修改输入图。
results: 在多种批处理和自动化学习 benchmark 上提高表现，特别在异质情况下，并可用于生成自适应学习的增强视图。

Abstract
Message passing neural networks have shown a lot of success on graph-structured data. However, there are many instances where message passing can lead to over-smoothing or fail when neighboring nodes belong to different classes. In this work, we introduce a simple yet general framework for improving learning in message passing neural networks. Our approach essentially upsamples edges in the original graph by adding "slow nodes" at each edge that can mediate communication between a source and a target node. Our method only modifies the input graph, making it plug-and-play and easy to use with existing models. To understand the benefits of slowing down message passing, we provide theoretical and empirical analyses. We report results on several supervised and self-supervised benchmarks, and show improvements across the board, notably in heterophilic conditions where adjacent nodes are more likely to have different labels. Finally, we show how our approach can be used to generate augmentations for self-supervised learning, where slow nodes are randomly introduced into different edges in the graph to generate multi-scale views with variable path lengths.

摘要
We provide both theoretical and empirical analyses to demonstrate the benefits of slowing down message passing. Our results on several supervised and self-supervised benchmarks show improvements across the board, particularly in heterophilic conditions where adjacent nodes are more likely to have different labels. Additionally, we show how our approach can be used to generate augmentations for self-supervised learning, where slow nodes are randomly introduced into different edges in the graph to generate multi-scale views with variable path lengths.

A Comparative Study of Text Embedding Models for Semantic Text Similarity in Bug Reports

paper_url: http://arxiv.org/abs/2308.09193
repo_url: https://github.com/av9ash/duplicatebugdetection
paper_authors: Avinash Patil, Kihwan Han, Sabyasachi Mukhopadhyay
for: 本研究旨在比较不同文本 Similarity 方法在 bug report 中的效果，以提高 bug report 的检索和分类效果。
methods: 本研究使用了多种嵌入模型，包括 TF-IDF (基准), FastText, Gensim, BERT, ADA 等。
results: 实验结果显示，BERT 在回忆性和准确率方面比其他模型都高，其次是 ADA, Gensim, FastText, TF-IDF。这些结果提供了不同嵌入方法在 bug report 检索中的效果，并指出了选择合适的嵌入方法对这种任务的重要性。

Abstract
Bug reports are an essential aspect of software development, and it is crucial to identify and resolve them quickly to ensure the consistent functioning of software systems. Retrieving similar bug reports from an existing database can help reduce the time and effort required to resolve bugs. In this paper, we compared the effectiveness of semantic textual similarity methods for retrieving similar bug reports based on a similarity score. We explored several embedding models such as TF-IDF (Baseline), FastText, Gensim, BERT, and ADA. We used the Software Defects Data containing bug reports for various software projects to evaluate the performance of these models. Our experimental results showed that BERT generally outperformed the rest of the models regarding recall, followed by ADA, Gensim, FastText, and TFIDF. Our study provides insights into the effectiveness of different embedding methods for retrieving similar bug reports and highlights the impact of selecting the appropriate one for this task. Our code is available on GitHub.

摘要
📝 bug 报告是软件开发中非常重要的一环，快速确定和解决 bug 可以保证软件系统的一致性。从现有数据库中检索类似 bug 报告可以减少解决 bug 所需的时间和努力。在这篇论文中，我们比较了基于 semantic textual similarity 方法来检索类似 bug 报告的效果，并计算了相似性分数。我们检查了TF-IDF（基准）、FastText、Gensim、BERT 和 ADA 等嵌入模型。我们使用了 Software Defects Data 中的 bug 报告来评估这些模型的性能。我们的实验结果表明，BERT 通常在 recall 方面表现出色，其次是 ADA、Gensim、FastText 和 TF-IDF。我们的研究提供了不同嵌入方法在检索类似 bug 报告的效果的启示，并 highlights 选择合适的嵌入方法对这种任务的重要性。我们的代码可以在 GitHub 上找到。

Regularizing Adversarial Imitation Learning Using Causal Invariance

paper_url: http://arxiv.org/abs/2308.09189
repo_url: None
paper_authors: Ivan Ovinnikov, Joachim M. Buhmann
for: 这篇论文的目的是使用伪函数学习方法从专家示范数据集中推导出一个策略，并使用推论器作为对抗优化过程中的指导信号。
methods: 这篇论文使用了伪函数学习方法，并使用了一个推论器作为对抗优化过程中的指导信号。
results: 论文发现这种模型容易吸收专家数据中的偶极 correlations，为了解决这个问题，提出了使用 causal invariance 作为对抗训练这些模型的正则化原则。

Abstract
Imitation learning methods are used to infer a policy in a Markov decision process from a dataset of expert demonstrations by minimizing a divergence measure between the empirical state occupancy measures of the expert and the policy. The guiding signal to the policy is provided by the discriminator used as part of an versarial optimization procedure. We observe that this model is prone to absorbing spurious correlations present in the expert data. To alleviate this issue, we propose to use causal invariance as a regularization principle for adversarial training of these models. The regularization objective is applicable in a straightforward manner to existing adversarial imitation frameworks. We demonstrate the efficacy of the regularized formulation in an illustrative two-dimensional setting as well as a number of high-dimensional robot locomotion benchmark tasks.

摘要
模仿学习方法用于在Markov决策过程中推导策略，通过将专家示例集中的数据用作示例来 minimum化分割度量和策略之间的差异。导向信号将被用作抽象优化过程中的评价函数。我们发现这些模型容易吸收专家数据中存在的偶极关系。为了解决这个问题，我们提议使用 causal invariance 作为对 adversarial 训练的规则化原则。这个正则化目标可以直接应用于现有的对抗模仿框架中。我们在一个简单的二维设定中以及一些高维机器人行走 benchmark 任务中证明了这种规则化形式的效果。

Distributed Extra-gradient with Optimal Complexity and Communication Guarantees

paper_url: http://arxiv.org/abs/2308.09187
repo_url: https://github.com/lions-epfl/qgenx
paper_authors: Ali Ramezani-Kebrya, Kimon Antonakopoulos, Igor Krawczuk, Justin Deschenaux, Volkan Cevher
for: 解决多处理器/工作者/客户端之间的分布式凸小最优化问题，包括分布式凸最小化、最大化和游戏等问题。
methods: 提出了一种基于量化的通用逆向梯度法（Q-GenX），该方法可以有效地减少通信量，并且适应不同的噪声 Profile。
results: 提出了一种自适应步长规则，可以适应各种噪声 Profiless，并且在分布式训练下显著加速了训练速度，并通过实际实验验证了这些理论结论。

Abstract
We consider monotone variational inequality (VI) problems in multi-GPU settings where multiple processors/workers/clients have access to local stochastic dual vectors. This setting includes a broad range of important problems from distributed convex minimization to min-max and games. Extra-gradient, which is a de facto algorithm for monotone VI problems, has not been designed to be communication-efficient. To this end, we propose a quantized generalized extra-gradient (Q-GenX), which is an unbiased and adaptive compression method tailored to solve VIs. We provide an adaptive step-size rule, which adapts to the respective noise profiles at hand and achieve a fast rate of ${\mathcal O}(1/T)$ under relative noise, and an order-optimal ${\mathcal O}(1/\sqrt{T})$ under absolute noise and show distributed training accelerates convergence. Finally, we validate our theoretical results by providing real-world experiments and training generative adversarial networks on multiple GPUs.

摘要
我们考虑多个GPU上的 monotoneVariational inequality（VI）问题，其中多个处理器/工作者/客户端具有本地随机对应矩阵。这些设定包括分布式凸减少问题、min-max和游戏等重要问题。标准的Extra-gradient算法（de facto algorithm）未被设计来实现通信效率。因此，我们提出了量化通过矩阵（Q-GenX），它是一种适应性和无偏量的压缩方法，用于解决VI问题。我们提供了适应步长规则，可以适应不同的噪音 профи在手中，并可以在相对噪音下 achieved a fast rate of $\mathcal{O}(1/T)$，以及在绝对噪音下 achieved an order-optimal rate of $\mathcal{O}(1/\sqrt{T})$。最后，我们 validate our theoretical results by providing real-world experiments and training generative adversarial networks on multiple GPUs.Note that "随机对应矩阵" in the text can be translated as "stochastic dual vectors" or "randomized dual vectors".

RatGPT: Turning online LLMs into Proxies for Malware Attacks

paper_url: http://arxiv.org/abs/2308.09183
repo_url: None
paper_authors: Mika Beckerich, Laura Plein, Sergio Coronado
for: 这项研究旨在探讨使用开源插件和大语言模型（LLMs）在软件工程中新开的可能性，以及这些技术在网络安全方面带来的新挑战。
methods: 研究人员使用ChatGPT等LLMs生成攻击性内容，并通过在攻击者和受害者之间作为中间人使用这些LLMs来实现攻击。
results: 研究人员成功地使用ChatGPT等LLMs生成攻击性软件，并在不被检测的情况下传递命令到受害者系统。这项研究指出了使用开源插件和LLMs时存在的重要网络安全问题，需要开发安全指南、控制和缓解策略。

Abstract
The evolution of Generative AI and the capabilities of the newly released Large Language Models (LLMs) open new opportunities in software engineering. However, they also lead to new challenges in cybersecurity. Recently, researchers have shown the possibilities of using LLMs such as ChatGPT to generate malicious content that can directly be exploited or guide inexperienced hackers to weaponize tools and code. Those studies covered scenarios that still require the attacker in the middle of the loop. In this study, we leverage openly available plugins and use an LLM as proxy between the attacker and the victim. We deliver a proof-of-concept where ChatGPT is used for the dissemination of malicious software while evading detection, alongside establishing the communication to a command and control (C2) server to receive commands to interact with a victim's system. Finally, we present the general approach as well as essential elements in order to stay undetected and make the attack a success. This proof-of-concept highlights significant cybersecurity issues with openly available plugins and LLMs, which require the development of security guidelines, controls, and mitigation strategies.

摘要
随着生成式人工智能的演化和新一代大语言模型（LLMs）的发布，软件工程方面开放出了新的机遇。然而，这也导致了新的网络安全挑战。最近，研究人员已经证明了使用 ChatGPT 等 LLMs 生成恶意内容，并直接利用或帮助不熟悉黑客 weaponize 工具和代码。这些研究都是在攻击者在中途的情况下进行的。在本研究中，我们利用公开available的插件，并使用 LLM 作为攻击者和受害者之间的代理。我们实现了一个证明，在使用 ChatGPT 进行恶意软件的散布时，同时避免检测，并与Command and Control（C2）服务器建立通信，以接收对受害者系统的交互命令。最后，我们提出了一种总体方法和重要元素，以确保隐蔽和成功攻击。这个证明指出了公开available的插件和 LLMS 对网络安全的潜在问题，需要开发安全指南、控制和缓解策略。

ChatGPT-HealthPrompt. Harnessing the Power of XAI in Prompt-Based Healthcare Decision Support using ChatGPT

paper_url: http://arxiv.org/abs/2308.09731
repo_url: None
paper_authors: Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia
for: 本研究旨在探讨大语言模型（LLM）在医疗决策中的应用，尤其是OpenAI的ChatGPT。我们的方法涉及使用上下文提示，截然地设计了任务描述、特征描述以及针对医疗领域知识的整合。
methods: 我们的研究利用了医疗领域知识，从高性能可解释Machine Learning（ML）模型中提取了关键信息，并将其灵活地 интеGRATE到提示设计中。我们视这些ML模型为医疗专家，从而提取了关键的特征重要性，以帮助决策过程。
results: 我们的研究发现，通过使用上下文提示和医疗领域知识的整合，可以在数据缺乏情况下实现高质量的 binary 分类任务。此外，我们还探讨了LLMs在不同数据条件下的零少shot和几少shot提示学习效果，并对传统supervised ML模型进行比较。

Abstract
This study presents an innovative approach to the application of large language models (LLMs) in clinical decision-making, focusing on OpenAI's ChatGPT. Our approach introduces the use of contextual prompts-strategically designed to include task description, feature description, and crucially, integration of domain knowledge-for high-quality binary classification tasks even in data-scarce scenarios. The novelty of our work lies in the utilization of domain knowledge, obtained from high-performing interpretable ML models, and its seamless incorporation into prompt design. By viewing these ML models as medical experts, we extract key insights on feature importance to aid in decision-making processes. This interplay of domain knowledge and AI holds significant promise in creating a more insightful diagnostic tool. Additionally, our research explores the dynamics of zero-shot and few-shot prompt learning based on LLMs. By comparing the performance of OpenAI's ChatGPT with traditional supervised ML models in different data conditions, we aim to provide insights into the effectiveness of prompt engineering strategies under varied data availability. In essence, this paper bridges the gap between AI and healthcare, proposing a novel methodology for LLMs application in clinical decision support systems. It highlights the transformative potential of effective prompt design, domain knowledge integration, and flexible learning approaches in enhancing automated decision-making.

摘要
Our research also explores the dynamics of zero-shot and few-shot prompt learning based on LLMs. By comparing the performance of OpenAI's ChatGPT with traditional supervised ML models in different data conditions, we aim to provide insights into the effectiveness of prompt engineering strategies under varied data availability. This study bridges the gap between AI and healthcare, proposing a novel methodology for LLMs application in clinical decision support systems. Our approach highlights the transformative potential of effective prompt design, domain knowledge integration, and flexible learning approaches in enhancing automated decision-making.

Diversifying AI: Towards Creative Chess with AlphaZero

paper_url: http://arxiv.org/abs/2308.09175
repo_url: None
paper_authors: Tom Zahavy, Vivek Veeriah, Shaobo Hou, Kevin Waugh, Matthew Lai, Edouard Leurent, Nenad Tomasev, Lisa Schut, Demis Hassabis, Satinder Singh
for: 这项研究是为了检查人工智能（AI）系统是否可以从创造性决策机制中受益，当被推到计算性理解的边缘时。
methods: 我们使用了多种行为多样性技术来让AI系统生成更多的想法，然后选择最有前途的想法。我们基于AlphaZero（AZ）扩展了其为一个联盟agent，并使用了隐藏条件的建筑来让AZ_db在不同的开局中选择最佳策略。
results: 我们的实验表明，AZ_db在困难的棋局中玩出了更多的想法，解决了更多的问题，并在多个开局中比AZ表现出色。特别是，AZ_db在解决了困难的彭罗斯位置的棋局中表现出了特别的优异。在不同的开局中，我们发现了每个player在不同的开局中特циализи得更强，并且通过使用不同的开局来选择最佳策略，可以提高50个Elo分。我们的发现表明，在AI代理中，多样性奖励存在，与人类团队中的多样性奖励相似。

Abstract
In recent years, Artificial Intelligence (AI) systems have surpassed human intelligence in a variety of computational tasks. However, AI systems, like humans, make mistakes, have blind spots, hallucinate, and struggle to generalize to new situations. This work explores whether AI can benefit from creative decision-making mechanisms when pushed to the limits of its computational rationality. In particular, we investigate whether a team of diverse AI systems can outperform a single AI in challenging tasks by generating more ideas as a group and then selecting the best ones. We study this question in the game of chess, the so-called drosophila of AI. We build on AlphaZero (AZ) and extend it to represent a league of agents via a latent-conditioned architecture, which we call AZ_db. We train AZ_db to generate a wider range of ideas using behavioral diversity techniques and select the most promising ones with sub-additive planning. Our experiments suggest that AZ_db plays chess in diverse ways, solves more puzzles as a group and outperforms a more homogeneous team. Notably, AZ_db solves twice as many challenging puzzles as AZ, including the challenging Penrose positions. When playing chess from different openings, we notice that players in AZ_db specialize in different openings, and that selecting a player for each opening using sub-additive planning results in a 50 Elo improvement over AZ. Our findings suggest that diversity bonuses emerge in teams of AI agents, just as they do in teams of humans and that diversity is a valuable asset in solving computationally hard problems.

摘要
近年来，人工智能（AI）系统已经超越了人类智能在多种计算任务中。然而，AI系统，如人类一样，会出现错误、盲点、幻觉和适应新情况的困难。这项工作探索了AI是否可以通过创造性决策机制来提高其计算合理性。特别是，我们研究了一群多样化AI系统是否可以在复杂任务中超越单一AI，通过生成更多的想法并选择最佳的想法来解决问题。我们在国际象棋中进行了研究，国际象棋被称为AI的“蜘蛛”。我们基于AlphaZero（AZ），并将其扩展为多个代理via含有 latent 条件的架构，称为AZ_db。我们在AZ_db中使用行为多样性技术来生成更广泛的想法，并使用负添加itive планинг选择最有前途的想法。我们的实验表明，AZ_db在各种开局下进行棋牌比赛，可以解决更多的难题，并且在某些情况下比AZ更好。特别是，AZ_db解决了AlphaZero无法解决的困难位置，包括贝叶姆位置。当AZ_db在不同的开局下进行棋牌比赛时，我们发现每个代理在不同的开局上特化，并且通过选择每个开局中的最佳代理使用负添加itive планинг，可以提高50个Elo分。我们的发现表明，在AI代理团队中，多样性奖励 emerge，与人类团队一样，多样性是解决计算上的难题的有价值资产。

Forensic Data Analytics for Anomaly Detection in Evolving Networks

paper_url: http://arxiv.org/abs/2308.09171
repo_url: None
paper_authors: Li Yang, Abdallah Moubayed, Abdallah Shami, Amine Boukhtouta, Parisa Heidari, Stere Preda, Richard Brunner, Daniel Migault, Adel Larabi
for:The paper is written to elaborate effective security controls to protect evolving network deployments in-depth, specifically in the context of 5G and virtualization.methods:The paper proposes a digital forensic data analytics framework for network anomaly detection, which includes multi-perspective feature engineering, unsupervised anomaly detection, and comprehensive result correction procedures.results:Experiments on real-world evolving network data demonstrate the effectiveness of the proposed forensic data analytics solution.

Abstract
In the prevailing convergence of traditional infrastructure-based deployment (i.e., Telco and industry operational networks) towards evolving deployments enabled by 5G and virtualization, there is a keen interest in elaborating effective security controls to protect these deployments in-depth. By considering key enabling technologies like 5G and virtualization, evolving networks are democratized, facilitating the establishment of point presences integrating different business models ranging from media, dynamic web content, gaming, and a plethora of IoT use cases. Despite the increasing services provided by evolving networks, many cybercrimes and attacks have been launched in evolving networks to perform malicious activities. Due to the limitations of traditional security artifacts (e.g., firewalls and intrusion detection systems), the research on digital forensic data analytics has attracted more attention. Digital forensic analytics enables people to derive detailed information and comprehensive conclusions from different perspectives of cybercrimes to assist in convicting criminals and preventing future crimes. This chapter presents a digital analytics framework for network anomaly detection, including multi-perspective feature engineering, unsupervised anomaly detection, and comprehensive result correction procedures. Experiments on real-world evolving network data show the effectiveness of the proposed forensic data analytics solution.

摘要
在传统基础设施（如telco和产业运维网络）向5G和虚拟化技术的演进部署过渡时，有很大的兴趣在深入保护这些部署。通过考虑关键实现技术（如5G和虚拟化），演进网络得到了民主化，使得不同业务模式的点 présence得到了建立，包括媒体、动态网页内容、游戏和大量物联网应用场景。然而，随着演进网络服务的增加，许多网络犯罪和攻击也在演进网络进行 malicious 活动。由于传统安全文件（如防火墙和入侵检测系统）的局限性，研究数字科学证据分析在获得更多的关注。数字科学证据分析可以帮助人们从不同角度获得细节信息，并提供完整的结论，以帮助控制犯罪和预防未来犯罪。本章介绍了一个网络异常检测数字分析框架，包括多元视角特征工程、无监督异常检测和完整结果修正过程。对实际演进网络数据进行实验，表明提议的数字科学证据分析解决方案的有效性。

Online Transition-Based Feature Generation for Anomaly Detection in Concurrent Data Streams

paper_url: http://arxiv.org/abs/2308.10893
repo_url: None
paper_authors: Yinzheng Zhong, Alexei Lisitsa
for: 这种技术是为了处理通用活动数据，包括网络流量、系统调用和监控录像等，并生成step-by-step生成的数据。
methods: 这种技术使用转换基于特征生成器（TFGen）技术，可以在线处理数据，并将历史数据编码到每个进来的活动中，以实现高效计算。
results: 这种技术可以解决域独特性、全球过程结构的发现、时间序列数据编码和在线处理能力等问题。

Abstract
In this paper, we introduce the transition-based feature generator (TFGen) technique, which reads general activity data with attributes and generates step-by-step generated data. The activity data may consist of network activity from packets, system calls from processes or classified activity from surveillance cameras. TFGen processes data online and will generate data with encoded historical data for each incoming activity with high computational efficiency. The input activities may concurrently originate from distinct traces or channels. The technique aims to address issues such as domain-independent applicability, the ability to discover global process structures, the encoding of time-series data, and online processing capability.

摘要
在这篇论文中，我们介绍了一种基于过程的特征生成技术（TFGen），它可以读取具有特征的活动数据，并生成步骤生成的数据。活动数据可以包括网络活动数据包、进程系统调用或Surveillance camera中分类的活动数据。TFGen处理数据在线，并将生成每个入参活动数据的编码历史数据，以高效处理能力进行处理。输入活动可同时来自不同的轨迹或通道。该技术的目标是解决域独立应用性、找到全局过程结构、编码时间序列数据以及在线处理能力等问题。

FedPerfix: Towards Partial Model Personalization of Vision Transformers in Federated Learning

paper_url: http://arxiv.org/abs/2308.09160
repo_url: https://github.com/imguangyu/fedperfix
paper_authors: Guangyu Sun, Matias Mendieta, Jun Luo, Shandong Wu, Chen Chen
for: 这个研究旨在提高分散式学习中的模型个性化，并且应用于具有不同数据分布的多种模型中。
methods: 研究使用了半模型个性化来提高分散式学习的效率，并且对每种层次进行了调查，以了解哪些部分最需要个性化。
results: 研究结果显示，使用 FedPerfix 可以对 CIFAR-100、OrganAMNIST 和 Office-Home 等 dataset 进行优化，并且与多种进阶 PFL 方法进行比较，获得了更好的效果。

Abstract
Personalized Federated Learning (PFL) represents a promising solution for decentralized learning in heterogeneous data environments. Partial model personalization has been proposed to improve the efficiency of PFL by selectively updating local model parameters instead of aggregating all of them. However, previous work on partial model personalization has mainly focused on Convolutional Neural Networks (CNNs), leaving a gap in understanding how it can be applied to other popular models such as Vision Transformers (ViTs). In this work, we investigate where and how to partially personalize a ViT model. Specifically, we empirically evaluate the sensitivity to data distribution of each type of layer. Based on the insights that the self-attention layer and the classification head are the most sensitive parts of a ViT, we propose a novel approach called FedPerfix, which leverages plugins to transfer information from the aggregated model to the local client as a personalization. Finally, we evaluate the proposed approach on CIFAR-100, OrganAMNIST, and Office-Home datasets and demonstrate its effectiveness in improving the model's performance compared to several advanced PFL methods.

摘要
personalized federated learning (PFL) 是一种有前途的解决方案，用于分布式学习不同数据环境中的模型个性化。partial model personalization 可以提高 PFL 的效率，通过选择ively更新本地模型参数而不是所有参数的汇集。然而，过去关于 partial model personalization 的研究主要集中在卷积神经网络 (CNNs) 上，尚未对其他流行的模型，如视觉转换器 (ViTs) 进行研究。在这项工作中，我们调查了如何在 ViT 模型中部分个性化。具体来说，我们employn了数据分布的敏感性测试，发现自注意层和分类头是 ViT 模型中最敏感的部分。基于这些发现，我们提出了一种名为 FedPerfix，它使用插件来将汇集模型中的信息传递给本地客户端进行个性化。最后，我们在 CIFAR-100、OrganAMNIST 和 Office-Home datasets 上评估了我们的方法，并证明它在许多先进的 PFL 方法之上表现出色。

Data diversity and virtual imaging in AI-based diagnosis: A case study based on COVID-19

paper_url: http://arxiv.org/abs/2308.09730
repo_url: None
paper_authors: Fakrul Islam Tushar, Lavsen Dahal, Saman Sotoudeh-Paima, Ehsan Abadi, W. Paul Segars, Ehsan Samei, Joseph Y. Lo
for: This study aimed to evaluate the performance of deep-learning-based AI models for COVID-19 diagnosis using diverse clinical and virtually generated medical images, and to assess the impact of dataset characteristics, disease extent, and imaging modality on AI performance.
methods: The study used a retrospective design and developed AI models using both clinical and virtually generated medical images. A virtual imaging trial was conducted to assess the impact of patient- and physics-based factors on AI performance.
results: The study found that AI performance was strongly influenced by dataset characteristics, with poor generalization to new data and a 20% drop in receiver operating characteristic area under the curve. CT results were consistently superior to those from CXR, and imaging dose had negligible influence on results. The study highlighted the significance of dataset characteristics and disease extent on COVID assessment, and the potential role of virtual imaging trial techniques in developing effective AI algorithms and facilitating translation into diagnostic practice.

Abstract
Many studies have investigated deep-learning-based artificial intelligence (AI) models for medical imaging diagnosis of the novel coronavirus (COVID-19), with many reports of near-perfect performance. However, variability in performance and underlying data biases raise concerns about clinical generalizability. This retrospective study involved the development and evaluation of artificial intelligence (AI) models for COVID-19 diagnosis using both diverse clinical and virtually generated medical images. In addition, we conducted a virtual imaging trial to assess how AI performance is affected by several patient- and physics-based factors, including the extent of disease, radiation dose, and imaging modality of computed tomography (CT) and chest radiography (CXR). AI performance was strongly influenced by dataset characteristics including quantity, diversity, and prevalence, leading to poor generalization with up to 20% drop in receiver operating characteristic area under the curve. Model performance on virtual CT and CXR images was comparable to overall results on clinical data. Imaging dose proved to have negligible influence on the results, but the extent of the disease had a marked affect. CT results were consistently superior to those from CXR. Overall, the study highlighted the significant impact of dataset characteristics and disease extent on COVID assessment, and the relevance and potential role of virtual imaging trial techniques on developing effective evaluation of AI algorithms and facilitating translation into diagnostic practice.

摘要
AI performance was strongly influenced by dataset characteristics including quantity, diversity, and prevalence, leading to poor generalization with up to 20% drop in receiver operating characteristic area under the curve. Model performance on virtual CT and CXR images was comparable to overall results on clinical data. Imaging dose proved to have negligible influence on the results, but the extent of the disease had a marked affect. CT results were consistently superior to those from CXR. Overall, the study highlighted the significant impact of dataset characteristics and disease extent on COVID assessment, and the relevance and potential role of virtual imaging trial techniques on developing effective evaluation of AI algorithms and facilitating translation into diagnostic practice.Translation:多些研究已经 investigate deep learning 基于人工智能（AI）模型用于医疗影像诊断 COVID-19，其中许多报告显示 near-perfect 性能。然而，表现变化和下面数据偏见引起了临床可靠性的 Concerns。这项 retrospective 研究涉及开发和评估 COVID-19 诊断中使用多种临床和虚拟生成的医疗影像 AI 模型。此外，我们还进行了虚拟成像试验，以评估 AI 性能受患者和物理因素的影响，包括疾病程度、辐射剂量和成像方式。AI 性能强烈受 dataset 特点的影响，包括数量、多样性和患者发展程度，导致到 20% 的 Receiver Operating Characteristic 曲线下降。模型在虚拟 CT 和 CXR 影像上的性能与临床数据上的性能相似。辐射剂量对结果没有重要影响，但疾病程度有显著影响。 CT 结果比 CXR 结果更好。总之，这项研究强调了 COVID 诊断中数据特点和疾病程度的重要影响，以及虚拟成像试验技术在开发有效 AI 算法并实现诊断实践中的重要作用。

ZhiJian: A Unifying and Rapidly Deployable Toolbox for Pre-trained Model Reuse

paper_url: http://arxiv.org/abs/2308.09158
repo_url: https://github.com/zhangyikaii/lamda-zhijian
paper_authors: Yi-Kai Zhang, Lu Ren, Chao Yi, Qi-Wei Wang, De-Chuan Zhan, Han-Jia Ye
for: This paper is written for deep learning practitioners and researchers who want to explore downstream tasks and identify the complementary advantages among different methods for model reuse.
methods: The paper introduces ZhiJian, a comprehensive and user-friendly toolbox for model reuse that utilizes the PyTorch backend. ZhiJian presents a novel paradigm that unifies diverse perspectives on model reuse, including target architecture construction with PTM, tuning target model with PTM, and PTM-based inference.
results: The paper empowers deep learning practitioners to explore downstream tasks and identify the complementary advantages among different methods for model reuse, and ZhiJian is readily accessible at https://github.com/zhangyikaii/lamda-zhijian for seamless utilization of pre-trained models and streamlining the model reuse process.

Abstract
The rapid expansion of foundation pre-trained models and their fine-tuned counterparts has significantly contributed to the advancement of machine learning. Leveraging pre-trained models to extract knowledge and expedite learning in real-world tasks, known as "Model Reuse", has become crucial in various applications. Previous research focuses on reusing models within a certain aspect, including reusing model weights, structures, and hypothesis spaces. This paper introduces ZhiJian, a comprehensive and user-friendly toolbox for model reuse, utilizing the PyTorch backend. ZhiJian presents a novel paradigm that unifies diverse perspectives on model reuse, encompassing target architecture construction with PTM, tuning target model with PTM, and PTM-based inference. This empowers deep learning practitioners to explore downstream tasks and identify the complementary advantages among different methods. ZhiJian is readily accessible at https://github.com/zhangyikaii/lamda-zhijian facilitating seamless utilization of pre-trained models and streamlining the model reuse process for researchers and developers.

摘要
“快速扩展基础模型和其精度调整版本的发展对机器学习进步做出了重要贡献。利用预训练模型提取知识和快速学习实际任务，称为“模型重用”，在各种应用中变得非常重要。先前的研究主要关注在某一方面进行模型重用，包括重用模型权重、结构和假设空间。本文介绍了智慧箱（ZhiJian），一个通用且易用的模型重用工具箱，使用PyTorch backend。智慧箱提出了一种新的模型重用 paradigma，整合多种视角，包括目标建筑PTM、调整目标模型PTM和PTM基于的推理。这使得深度学习专家能够更好地探索下游任务，并发现不同方法之间的补偿优势。智慧箱可以在https://github.com/zhangyikaii/lamda-zhijian中免费获取，便于预训练模型的使用和模型重用过程，为研究人员和开发者提供便捷的使用。”

Accurate machine learning force fields via experimental and simulation data fusion

paper_url: http://arxiv.org/abs/2308.09142
repo_url: None
paper_authors: Sebastien Röcken, Julija Zavadlav
for: 这个论文的目的是开发一种基于机器学习（ML）的力场模型，以涵盖离子质量级别的分子模型。
methods: 该论文使用了density functional theory（DFT）计算和实验测量的机械性质和晶体结构来训练一个ML力场模型。
results: 研究人员发现，通过结合DFT计算和实验数据来训练ML力场模型，可以同时满足所有目标目标，而且比使用单个数据源训练的模型更准确。此外，通过这种方法，可以正确地修正DFT函数的缺陷，同时保持偏离目标 свой性的准确性。

Abstract
Machine Learning (ML)-based force fields are attracting ever-increasing interest due to their capacity to span spatiotemporal scales of classical interatomic potentials at quantum-level accuracy. They can be trained based on high-fidelity simulations or experiments, the former being the common case. However, both approaches are impaired by scarce and erroneous data resulting in models that either do not agree with well-known experimental observations or are under-constrained and only reproduce some properties. Here we leverage both Density Functional Theory (DFT) calculations and experimentally measured mechanical properties and lattice parameters to train an ML potential of titanium. We demonstrate that the fused data learning strategy can concurrently satisfy all target objectives, thus resulting in a molecular model of higher accuracy compared to the models trained with a single data source. The inaccuracies of DFT functionals at target experimental properties were corrected, while the investigated off-target properties remained largely unperturbed. Our approach is applicable to any material and can serve as a general strategy to obtain highly accurate ML potentials.

摘要

RTB Formulation Using Point Process

paper_url: http://arxiv.org/abs/2308.09122
repo_url: None
paper_authors: Seong Jin Lee, Bumsik Kim
for: 这篇论文是为了模型重复拍卖（RTB）生态系统中的点过程，并提出了一种通用的杂event-driven模型。
methods: 该论文使用点过程模型来描述拍卖过程，并提出了一种基于Poisson点过程的approximation方法。
results: 该论文提出了player的优化策略以及joint分布的重要性，并指出了在不同情况下的优化策略。

Abstract
We propose a general stochastic framework for modelling repeated auctions in the Real Time Bidding (RTB) ecosystem using point processes. The flexibility of the framework allows a variety of auction scenarios including configuration of information provided to player, determination of auction winner and quantification of utility gained from each auctions. We propose theoretical results on how this formulation of process can be approximated to a Poisson point process, which enables the analyzer to take advantage of well-established properties. Under this framework, we specify the player's optimal strategy under various scenarios. We also emphasize that it is critical to consider the joint distribution of utility and market condition instead of estimating the marginal distributions independently.

摘要
我们提出了一种通用的随机化框架，用于模拟RTB生态系统中的重复拍卖。这个框架的灵活性允许多种拍卖场景，包括玩家信息提供方式、拍卖赢家决定以及每次拍卖中获得的用于量。我们提出了理论结果，表明这种过程的形式可以approximerate到一个Poisson点过程，这使得分析者可以利用已有的性质。在这个框架下，我们定义了玩家的优化策略在不同场景下。我们还强调了考虑市场条件和用户性的联合分布，而不是独立地估算各自的分布。

Multi-fidelity Fourier Neural Operator for Fast Modeling of Large-Scale Geological Carbon Storage

paper_url: http://arxiv.org/abs/2308.09113
repo_url: None
paper_authors: Hewei Tang, Qingkai Kong, Joseph P. Morris
for: 用于加速地质碳存储（GCS）问题中预测储存压力和CO2气泡迁移的深度学习基于模拟器。
methods: 使用多优异Fourier神经操作器解决大规模GCS问题，使用更可持有的多优异训练数据集来预测复杂物理行为。
results: 测试表明，使用多优异FNO模型可以在81%的数据生成成本下达到与高精度模型相同的准确率，并且在不同地球预测模型和流体 simulator生成的高精度和低精度数据集上进行了一致预测。

Abstract
Deep learning-based surrogate models have been widely applied in geological carbon storage (GCS) problems to accelerate the prediction of reservoir pressure and CO2 plume migration. Large amounts of data from physics-based numerical simulators are required to train a model to accurately predict the complex physical behaviors associated with this process. In practice, the available training data are always limited in large-scale 3D problems due to the high computational cost. Therefore, we propose to use a multi-fidelity Fourier Neural Operator to solve large-scale GCS problems with more affordable multi-fidelity training datasets. The Fourier Neural Operator has a desirable grid-invariant property, which simplifies the transfer learning procedure between datasets with different discretization. We first test the model efficacy on a GCS reservoir model being discretized into 110k grid cells. The multi-fidelity model can predict with accuracy comparable to a high-fidelity model trained with the same amount of high-fidelity data with 81% less data generation costs. We further test the generalizability of the multi-fidelity model on a same reservoir model with a finer discretization of 1 million grid cells. This case was made more challenging by employing high-fidelity and low-fidelity datasets generated by different geostatistical models and reservoir simulators. We observe that the multi-fidelity FNO model can predict pressure fields with reasonable accuracy even when the high-fidelity data are extremely limited.

摘要
深度学习基于的代理模型在地球科学中广泛应用于加速预测储存气体和二氧化碳泵迹迁移。它们需要大量的物理基础数据来训练模型，以便准确预测这些过程中的复杂物理行为。然而，在大规模3D问题中，实际可用的培训数据总是有限的，因此我们提议使用多质量Fourier neural operator来解决这些问题。Fourier neural operator具有可以简化转移学习过程的网格不变性属性，因此可以在不同的精度水平上进行数据转移。我们首先测试了模型在110k个网格细分的气体储存器模型上的效果。我们发现，使用多质量模型可以与高精度模型在同样的数据量下达到相同的准确率，但是需要81% fewer data generation costs。我们进一步测试了模型的通用性，在同一个气体储存器模型上使用不同的地OSTATISTICAL模型和气体 simulator生成的高精度和低精度数据来进行测试。我们发现，使用多质量FNO模型可以在高精度数据很少的情况下预测压力场的reasonable accuracy。

Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation

paper_url: http://arxiv.org/abs/2308.09105
repo_url: None
paper_authors: Shengcao Cao, Mengtian Li, James Hays, Deva Ramanan, Yi-Xiong Wang, Liang-Yan Gui
for: 提高资源受限的感知系统中的视觉模型精度和轻量级计算和存储使用。
methods: 提出一种简单 yet 有效的顺序方法，从多个教师检测器中逐步传递知识到一个轻量级学生模型。
results: 成功地从Transformer基本的教师检测器中提取知识，并使用逐步进行知识传递的方法，从而提高了ResNet-50基于RetinaNet的MS COCObenchmark中的AP分数从36.5%提高到42.0%，以及Mask R-CNN的AP分数从38.2%提高到42.5%。

Abstract
Resource-constrained perception systems such as edge computing and vision-for-robotics require vision models to be both accurate and lightweight in computation and memory usage. While knowledge distillation is a proven strategy to enhance the performance of lightweight classification models, its application to structured outputs like object detection and instance segmentation remains a complicated task, due to the variability in outputs and complex internal network modules involved in the distillation process. In this paper, we propose a simple yet surprisingly effective sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student. To distill knowledge from a highly accurate but complex teacher model, we construct a sequence of teachers to help the student gradually adapt. Our progressive strategy can be easily combined with existing detection distillation mechanisms to consistently maximize student performance in various settings. To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students, and unprecedentedly boost the performance of ResNet-50 based RetinaNet from 36.5% to 42.0% AP and Mask R-CNN from 38.2% to 42.5% AP on the MS COCO benchmark.

摘要
资源受限的感知系统，如边缘 computing 和 robotics 视觉系统，需要视觉模型具备高度精度和轻量级的计算和内存使用。而知识传递是一种证实的策略，可以提高轻量级分类模型的性能，但是对于结构化输出 like 物体探测和实例分割仍然是一个复杂的任务，因为输出的变化和内部网络模组的复杂性。在这篇文章中，我们提出了一个简单又奇怪有效的顺序式知识传递方法，可以将一群教师探测器中的知识逐渐传递给一个轻量级学生。为了将高度精度但复杂的教师探测器中的知识传递给 convolution 型学生，我们建立了一系列的教师，帮助学生逐渐适应。我们的顺序式策略可以与现有的探测传递机制整合，以确保学生在不同的设定下的表现最佳。我们知道，我们是第一个成功地将 Transformer 型教师探测器中的知识传递给 convolution 型学生，并在 MS COCO benchmark 上从 36.5% 提高到 42.0% AP 和 38.2% 提高到 42.5% AP。

A comprehensive study of spike and slab shrinkage priors for structurally sparse Bayesian neural networks

paper_url: http://arxiv.org/abs/2308.09104
repo_url: None
paper_authors: Sanket Jantre, Shrijita Bhattacharya, Tapabrata Maiti
for: 提高深度学习的网络复杂性和计算效率，使用减少极大过 parameterized 深度神经网络来恢复稀疏表示。
methods: 使用架构稀疏（例如节点稀疏）压缩深度神经网络，以实现低延迟推理、高速数据传输和降低能耗。
results: 使用架构稀疏 Bayesian 神经网络，通过 Spike-and-Slab Group Lasso (SS-GL) 和 Spike-and-Slab Group Horseshoe (SS-GHS) 约束，实现计算可 tractable 变分推理，并在不同网络架构、层次节点数量和网络参数的约束下，确定变分 posterior 的收缩率。通过实验，比较了我们的模型与基eline模型的预测精度、模型压缩和推理延迟，并证明了我们的模型在这些方面的竞争性。

Abstract
Network complexity and computational efficiency have become increasingly significant aspects of deep learning. Sparse deep learning addresses these challenges by recovering a sparse representation of the underlying target function by reducing heavily over-parameterized deep neural networks. Specifically, deep neural architectures compressed via structured sparsity (e.g. node sparsity) provide low latency inference, higher data throughput, and reduced energy consumption. In this paper, we explore two well-established shrinkage techniques, Lasso and Horseshoe, for model compression in Bayesian neural networks. To this end, we propose structurally sparse Bayesian neural networks which systematically prune excessive nodes with (i) Spike-and-Slab Group Lasso (SS-GL), and (ii) Spike-and-Slab Group Horseshoe (SS-GHS) priors, and develop computationally tractable variational inference including continuous relaxation of Bernoulli variables. We establish the contraction rates of the variational posterior of our proposed models as a function of the network topology, layer-wise node cardinalities, and bounds on the network weights. We empirically demonstrate the competitive performance of our models compared to the baseline models in prediction accuracy, model compression, and inference latency.

摘要
网络复杂性和计算效率在深度学习中日益重要。 sparse deep learning 通过recovering 简洁表示函数来降低深度神经网络的复杂性。Specifically, 深度神经网络通过结构减少（例如节点稀疏）提供低延迟推理、高速数据传输和降低能耗。在这篇论文中，我们探讨了两种已有的减少技术，lasso 和 horseshoe，用于神经网络压缩。为此，我们提出了结构减少的 Bayesian neural networks，并使用 Spike-and-Slab Group Lasso（SS-GL）和 Spike-and-Slab Group Horseshoe（SS-GHS）假设来系统地剪除过度的节点。我们还开发了可 computationally tractable 的变换推理，包括连续放寒变量的放寒。我们证明了我们提出的模型的可信度衰减率与网络结构、层次节点数量和网络权值的约束有关。我们还通过实验表明我们的模型与基eline模型相比在预测精度、模型压缩和推理延迟方面具有竞争性。

MindMap: Knowledge Graph Prompting Sparks Graph of Thoughts in Large Language Models

paper_url: http://arxiv.org/abs/2308.09729
repo_url: https://github.com/willing510/MindMap
paper_authors: Yilin Wen, Zifeng Wang, Jimeng Sun
for: 这个论文的目的是解决大语言模型（LLM）的限制，包括它们的知识更新、hallucination和决策过程的不透明度。
methods: 这篇论文使用知识图（KG）来激活LLM的知识更新和逻辑推理。具体来说，我们建立了一个提示管道，使LLM能够理解KG输入并结合内存中的隐式知识和外部知识进行推理。此外，我们还研究了询问LLM的思维路径和生成答案。
results: 实验结果表明，使用 MindMap 提示可以获得显著的实验性提升。例如，使用 MindMap 提示与 GPT-3.5 比较，可以在三个问答数据集上获得悬殊的性能。此外，我们还发现，使用 KG 中的结构化事实可以超过使用文档检索方法，因为 MindMap 可以提供更准确、简洁和全面的知识。

Abstract
LLMs usually exhibit limitations in their ability to incorporate new knowledge, the generation of hallucinations, and the transparency of their decision-making process. In this paper, we explore how to prompt LLMs with knowledge graphs (KG), working as a remedy to engage LLMs with up-to-date knowledge and elicit the reasoning pathways from LLMs. Specifically, we build a prompting pipeline that endows LLMs with the capability of comprehending KG inputs and inferring with a combined implicit knowledge and the retrieved external knowledge. In addition, we investigate eliciting the mind map on which LLMs perform the reasoning and generate the answers. It is identified that the produced mind map exhibits the reasoning pathways of LLMs grounded on the ontology of knowledge, hence bringing the prospects of probing and gauging LLM inference in production. The experiments on three question & answering datasets also show that MindMap prompting leads to a striking empirical gain. For instance, prompting a GPT-3.5 with MindMap yields an overwhelming performance over GPT-4 consistently. We also demonstrate that with structured facts retrieved from KG, MindMap can outperform a series of prompting-with-document-retrieval methods, benefiting from more accurate, concise, and comprehensive knowledge from KGs.

摘要

Modeling Edge Features with Deep Bayesian Graph Networks

paper_url: http://arxiv.org/abs/2308.09087
repo_url: https://github.com/diningphil/e-cgmm
paper_authors: Daniele Atzeni, Federico Errica, Davide Bacciu, Alessio Micheli
for: 这篇论文是为了扩展Contextual Graph Markov Model（一种深度和概率学习模型），以模型边Feature的分布。
methods: 我们提出了一种建立额外的 Bayesian网络，将边特征映射到精确的状态中，以便由原始模型使用。这种方法使得我们可以在缺乏边特征的情况下建立更加富有的图表示，并且在标准图分类benchmark上达到了性能提升。
results: 我们在图 regression 场景中成功应用了我们的提议，并在三个链接预测任务上实现了明显的性能提升。此外，我们还证明了我们的模型在大规模图处理中保持了线性计算复杂性。

Abstract
We propose an extension of the Contextual Graph Markov Model, a deep and probabilistic machine learning model for graphs, to model the distribution of edge features. Our approach is architectural, as we introduce an additional Bayesian network mapping edge features into discrete states to be used by the original model. In doing so, we are also able to build richer graph representations even in the absence of edge features, which is confirmed by the performance improvements on standard graph classification benchmarks. Moreover, we successfully test our proposal in a graph regression scenario where edge features are of fundamental importance, and we show that the learned edge representation provides substantial performance improvements against the original model on three link prediction tasks. By keeping the computational complexity linear in the number of edges, the proposed model is amenable to large-scale graph processing.

摘要
我们提出了Contextual Graph Markov Model的扩展，一种深度和概率机器学习模型，以模型边Feature的分布。我们的方法是建立一个额外的 Bayesian 网络，将边特征映射到精确的状态中，以便由原始模型使用。这样做的好处是，我们可以在原始模型缺失edge特征时，建立更加富裕的图表示。此外，我们成功地在图回归情况下测试了我们的建议，并证明了对三个链接预测任务的性能提高。此外，我们保持了计算复杂性为边数量的线性，使得我们的模型适用于大规模图处理。

Embracing assay heterogeneity with neural processes for markedly improved bioactivity predictions

paper_url: http://arxiv.org/abs/2308.09086
repo_url: None
paper_authors: Lucian Chan, Marcel Verdonk, Carl Poelking
for: 预测药物的生物活性是计算辅助药物发现中最Difficult和最重要的挑战之一，尽管年月的数据收集和整理努力已经做出了很大的贡献，但生物活性数据仍然稀缺和多样化，这使得建立准确、可迁移和稳定的预测模型变得困难。
methods: 作者提出了一种层次meta学框架，该框架利用不同的试验方法之间的信息相互作用，成功地考虑了试验方法的多样化。
results: 作者展示了该模型在不同的蛋白目标和试验方法上的较好的预测性能，并且可以快速适应新的目标 Context 使用非常少的观察，因此可以实现大规模的虚拟屏选在早期药物发现阶段。

Abstract
Predicting the bioactivity of a ligand is one of the hardest and most important challenges in computer-aided drug discovery. Despite years of data collection and curation efforts by research organizations worldwide, bioactivity data remains sparse and heterogeneous, thus hampering efforts to build predictive models that are accurate, transferable and robust. The intrinsic variability of the experimental data is further compounded by data aggregation practices that neglect heterogeneity to overcome sparsity. Here we discuss the limitations of these practices and present a hierarchical meta-learning framework that exploits the information synergy across disparate assays by successfully accounting for assay heterogeneity. We show that the model achieves a drastic improvement in affinity prediction across diverse protein targets and assay types compared to conventional baselines. It can quickly adapt to new target contexts using very few observations, thus enabling large-scale virtual screening in early-phase drug discovery.

摘要
预测药物的生物活性是计算辅助药物发现中最Difficult和最重要的挑战之一。尽管世界各地研究机构多年来努力收集和筛选数据，生物活性数据仍然稀缺和多样化，这使得建立准确、可传播和可靠的预测模型变得困难。实验室中的自然变化更加剑指了数据积累做法，这些做法忽略了多样性，以 overcome sparsity。我们讨论了这些实践的局限性，并提出了一种层次元学习框架，该框架利用不同测试方法之间的信息协同，成功地考虑到多样性。我们显示，该模型在不同蛋白目标和测试方法上显著改善了粘性预测，相比传统基准值，它可以快速适应新的目标上下文，使用非常少的观察数据进行大规模的虚拟屏选，以便在早期药物发现阶段进行大规模的虚拟屏选。

MovePose: A High-performance Human Pose Estimation Algorithm on Mobile and Edge Devices

paper_url: http://arxiv.org/abs/2308.09084
repo_url: None
paper_authors: Dongyang Yu, Haoyue Zhang, Zhirui Zhou, Wangpeng An, Yanhong Yang
for: 这个论文的目的是为了提供高精度且实时性强的人体姿态估计模型，特别适用于 CPU 型手持式设备上的移动设备人体姿态估计。
methods: 这个模型使用了优化的轻量级卷积神经网组件，包括滤除、大 kernel 卷积和坐标分类方法，以提高精度和实时性。
results: 这个模型在 COCO 验证数据集上获得了 67.7 的 Mean Average Precision 分数，并在 Intel i9-10920x CPU 和 NVIDIA RTX3090 GPU 上显示出了高效性和实时性。

Abstract
We present MovePose, an optimized lightweight convolutional neural network designed specifically for real-time body pose estimation on CPU-based mobile devices. The current solutions do not provide satisfactory accuracy and speed for human posture estimation, and MovePose addresses this gap. It aims to maintain real-time performance while improving the accuracy of human posture estimation for mobile devices. The network produces 17 keypoints for each individual at a rate exceeding 11 frames per second, making it suitable for real-time applications such as fitness tracking, sign language interpretation, and advanced mobile human posture estimation. Our MovePose algorithm has attained an Mean Average Precision (mAP) score of 67.7 on the COCO \cite{cocodata} validation dataset. The MovePose algorithm displayed efficiency with a performance of 69+ frames per second (fps) when run on an Intel i9-10920x CPU. Additionally, it showcased an increased performance of 452+ fps on an NVIDIA RTX3090 GPU. On an Android phone equipped with a Snapdragon 8 + 4G processor, the fps reached above 11. To enhance accuracy, we incorporated three techniques: deconvolution, large kernel convolution, and coordinate classification methods. Compared to basic upsampling, deconvolution is trainable, improves model capacity, and enhances the receptive field. Large kernel convolution strengthens these properties at a decreased computational cost. In summary, MovePose provides high accuracy and real-time performance, marking it a potential tool for a variety of applications, including those focused on mobile-side human posture estimation. The code and models for this algorithm will be made publicly accessible.

摘要
我们现在提出了 MovePose，一种优化的轻量级卷积神经网络，专门用于在 CPU 基于的移动设备上实时人体姿态估计。现有解决方案无法提供满意的准确率和速度 для人体姿态估计，MovePose 弥补了这一空隙。它的目标是在实时应用中维持实时性，同时提高人体姿态估计的准确率。该网络每秒钟产生17个关键点，适用于实时应用such as 健身跟踪、手语 interpret 和高级移动人体姿态估计。我们的 MovePose 算法在 COCO 验证集（）上获得了67.7的 Mean Average Precision（mAP）分数。 MovePose 算法在 Intel i9-10920x CPU 上运行时达到69+帧每秒（fps）的性能，而在 NVIDIA RTX3090 GPU 上则达到452+ fps。在配备 Snapdragon 8 + 4G 处理器的 Android 手机上，fps 超过11。为了提高准确率，我们引入了三种技术：deconvolution、大小 kernel convolution 和坐标分类方法。相比基本的 upsampling，deconvolution 可以学习、提高模型容量和感知范围。大小 kernel convolution 强化这些特性，而且降低计算成本。综上所述，MovePose 提供了高准确率和实时性，使其成为许多应用，包括移动端人体姿态估计的可能工具。我们将代码和模型公开访问。

Over-the-Air Computation Aided Federated Learning with the Aggregation of Normalized Gradient

paper_url: http://arxiv.org/abs/2308.09082
repo_url: None
paper_authors: Rongfei Fan, Xuming An, Shiyuan Zuo, Han Hu
for: 这个论文的目的是提出一种基于无线电计算的联合学习（Federated Learning，FL）方法，以提高系统的计算效率。
methods: 这个论文使用了一种循环进程，其中每个移动设备都会更新本地梯度，并将其扩大并传输给服务器。服务器接收所有设备的梯度，并生成并广播新的模型参数。在扩大因子选择方面，大多数相关的工作假设了本地梯度的最大 нор Always happens，这可能会降低数据的整合性。为了解决这个问题，我们提议将本地梯度转换成正规化的梯度。
results: 我们的提议方法可以在平滑的损失函数下达到相对较快的收敛速率，并且在平滑和强地 convex 损失函数下可以达到最小的训练损失，并且发现了收敛速率和容错度之间的负反关系。此外，我们还提出了一些优化系统参数的问题，并 derivated 优化问题的优化解决方案，其中问题的解决方案具有多项式复杂度。实验结果表明，我们的提议方法可以超越标准方法在收敛性能方面。

Abstract
Over-the-air computation is a communication-efficient solution for federated learning (FL). In such a system, iterative procedure is performed: Local gradient of private loss function is updated, amplified and then transmitted by every mobile device; the server receives the aggregated gradient all-at-once, generates and then broadcasts updated model parameters to every mobile device. In terms of amplification factor selection, most related works suppose the local gradient's maximal norm always happens although it actually fluctuates over iterations, which may degrade convergence performance. To circumvent this problem, we propose to turn local gradient to be normalized one before amplifying it. Under our proposed method, when the loss function is smooth, we prove our proposed method can converge to stationary point at sub-linear rate. In case of smooth and strongly convex loss function, we prove our proposed method can achieve minimal training loss at linear rate with any small positive tolerance. Moreover, a tradeoff between convergence rate and the tolerance is discovered. To speedup convergence, problems optimizing system parameters are also formulated for above two cases. Although being non-convex, optimal solution with polynomial complexity of the formulated problems are derived. Experimental results show our proposed method can outperform benchmark methods on convergence performance.

摘要
无需网络传输的计算是一种可靠的解决方案 для联合学习（FL）。在这种系统中，每个移动设备都会进行迭代程序：每个设备都会更新、增强并将本地梯度传输给服务器，服务器会收到所有设备的汇总梯度，并生成并广播更新后的模型参数给每个设备。在扩大因子选择方面，大多数相关的研究假设了本地梯度的最大 нор Always happens，实际上，这可能会降低对称性性的性能。为解决这个问题，我们提议将本地梯度转换成归一化的梯度之前，然后扩大它。我们的提议方法可以在平滑的损失函数下 converge 到稳定点的速度下。在平滑和强地 convex 损失函数下，我们证明我们的提议方法可以在 any small positive tolerance 下 достичь最佳训练损失的速度。此外，我们发现了训练速度和误差容差之间的负面选择。为加速训练，我们还提出了一些优化系统参数的问题，并derived 优化问题的优秀解，其复杂度为多阶 polynomials。实验结果表明，我们的提议方法可以在对比方法上出现较好的融合性表现。

Conditional Sampling of Variational Autoencoders via Iterated Approximate Ancestral Sampling

paper_url: http://arxiv.org/abs/2308.09078
repo_url: None
paper_authors: Vaidotas Simkus, Michael U. Gutmann
for: 用于 addresses the limitations of Metropolis-within-Gibbs (MWG) sampler in variational autoencoders (VAEs) when learning a structured latent space.
methods: 使用 two original methods to mitigate the pitfalls of MWG sampler: (1) a novel initialization method to improve the mixing of the latent space, and (2) a adaptive temperature schedule to adjust the sampling temperature based on the current sample.
results: 对一系列的 sampling tasks 进行了实验，并证明了提出的方法可以提高 sampling 的性能。

Abstract
Conditional sampling of variational autoencoders (VAEs) is needed in various applications, such as missing data imputation, but is computationally intractable. A principled choice for asymptotically exact conditional sampling is Metropolis-within-Gibbs (MWG). However, we observe that the tendency of VAEs to learn a structured latent space, a commonly desired property, can cause the MWG sampler to get "stuck" far from the target distribution. This paper mitigates the limitations of MWG: we systematically outline the pitfalls in the context of VAEs, propose two original methods that address these pitfalls, and demonstrate an improved performance of the proposed methods on a set of sampling tasks.

摘要
<转换给定文本到简化中文。> conditional sampling of variational autoencoders (VAEs) 是在各种应用中需要的，如数据缺失填充，但是计算复杂性太高。 Metropolis-within-Gibbs (MWG) 是一种理想的 asymptotically exact conditional sampling 方法，但我们发现 VAEs 往往学习一个结构化的尘肠空间，这是通常所希望的特性，可以使 MWG 抽样器受到目标分布的吸引力强化。这篇论文解决了 VAEs 中 MWG 抽样器的局限性，我们系统地描述了这些局限性在 VAEs 中的坑害，并提出了两种原创的方法来解决这些坑害，并在一组抽样任务中示出了提高的性能。

Fast Decision Support for Air Traffic Management at Urban Air Mobility Vertiports using Graph Learning

paper_url: http://arxiv.org/abs/2308.09075
repo_url: None
paper_authors: Prajit KrisshnaKumar, Jhoel Witter, Steve Paul, Hanvit Cho, Karthik Dantu, Souma Chowdhury
for: 解决城市和郊区堵塞、安全、快速旅行的问题，提供城市空中流体力（UAM）的新一代解决方案。
methods: 使用图 reconstruction学习来生成决策支持策略，图 reconstruction学习是一种基于图学习的自动决策技术。
results: 通过在AirSim simulator上进行实际的多rotor飞行器模拟，研究发现使用图 reconstruction学习解决UAM-VSM问题的适用性和优势，比基本的再征学习（图嵌入）或随机选择基线更好。

Abstract
Urban Air Mobility (UAM) promises a new dimension to decongested, safe, and fast travel in urban and suburban hubs. These UAM aircraft are conceived to operate from small airports called vertiports each comprising multiple take-off/landing and battery-recharging spots. Since they might be situated in dense urban areas and need to handle many aircraft landings and take-offs each hour, managing this schedule in real-time becomes challenging for a traditional air-traffic controller but instead calls for an automated solution. This paper provides a novel approach to this problem of Urban Air Mobility - Vertiport Schedule Management (UAM-VSM), which leverages graph reinforcement learning to generate decision-support policies. Here the designated physical spots within the vertiport's airspace and the vehicles being managed are represented as two separate graphs, with feature extraction performed through a graph convolutional network (GCN). Extracted features are passed onto perceptron layers to decide actions such as continue to hover or cruise, continue idling or take-off, or land on an allocated vertiport spot. Performance is measured based on delays, safety (no. of collisions) and battery consumption. Through realistic simulations in AirSim applied to scaled down multi-rotor vehicles, our results demonstrate the suitability of using graph reinforcement learning to solve the UAM-VSM problem and its superiority to basic reinforcement learning (with graph embeddings) or random choice baselines.

摘要

Joint Power Control and Data Size Selection for Over-the-Air Computation Aided Federated Learning

paper_url: http://arxiv.org/abs/2308.09072
repo_url: https://github.com/anxuming/fedaircomp
paper_authors: Xuming An, Rongfei Fan, Shiyuan Zuo, Han Hu, Hai Jiang, Ning Zhang
for: 这个研究旨在提高 Federated Learning (FL) 中总站（BS）与多个移动设备（Mobile Device，MD）之间的训练模型参数的整合。
methods: 我们提出了一种基于 Over-the-air computation 的 Spectrum-efficient 解决方案，将 MD 的参数映射信号同时传递到 BS。
results: 我们的方法可以对 MSE 进行最小化，并且能够提高 FL 的训练性能，比对 Benchmark 方法更好。

Abstract
Federated learning (FL) has emerged as an appealing machine learning approach to deal with massive raw data generated at multiple mobile devices, {which needs to aggregate the training model parameter of every mobile device at one base station (BS) iteratively}. For parameter aggregating in FL, over-the-air computation is a spectrum-efficient solution, which allows all mobile devices to transmit their parameter-mapped signals concurrently to a BS. Due to heterogeneous channel fading and noise, there exists difference between the BS's received signal and its desired signal, measured as the mean-squared error (MSE). To minimize the MSE, we propose to jointly optimize the signal amplification factors at the BS and the mobile devices as well as the data size (the number of data samples involved in local training) at every mobile device. The formulated problem is challenging to solve due to its non-convexity. To find the optimal solution, with some simplification on cost function and variable replacement, which still preserves equivalence, we transform the changed problem to be a bi-level problem equivalently. For the lower-level problem, optimal solution is found by enumerating every candidate solution from the Karush-Kuhn-Tucker (KKT) condition. For the upper-level problem, the optimal solution is found by exploring its piecewise convexity. Numerical results show that our proposed method can greatly reduce the MSE and can help to improve the training performance of FL compared with benchmark methods.

摘要
“联合学习（FL）已经出现为处理大量 raw 数据生成多个移动设备的有吸引力机器学习方法。在 FL 中，需要在一个基站（BS）上运算多个移动设备的训练模型参数，并在每个移动设备上进行本地训练。为了在 BS 上进行参数联合，游击式计算是一种可以实现spectrum-efficient的解决方案，允许所有的移动设备同时将参数映射到 BS 上。由于频道折射和噪音的不同，BS 所接收到的信号和其所需的信号之间存在差异，表示为 Mean-Squared Error（MSE）。为了最小化 MSE，我们提议同时优化基站和移动设备中的信号增幅因子以及每个移动设备上的数据大小。这个问题具有非对称性，实际上是一个问题。为了找到最佳解决方案，我们通过将问题转换为一个类比问题，并通过 KKT 条件来找到最佳解。结果显示，我们的提议方法可以对 MSE 进行重大降低，并且可以帮助 FL 的训练性能提高。”

2023-08-18

eess.IV

eess.IV - 2023-08-18

Uncertainty-based quality assurance of carotid artery wall segmentation in black-blood MRI

paper_url: http://arxiv.org/abs/2308.09538
repo_url: https://github.com/miagrouput/carotid-segmentation
paper_authors: Elina Thibeau-Sutre, Dieuwertje Alblas, Sophie Buurman, Christoph Brune, Jelmer M. Wolterink
for: 这研究旨在使用深度学习模型对大规模数据集进行自动质量控制。
methods: 这种方法使用全自动算法对黑血栓MRI中的柏感血管壁进行分割。它可以在3D补充中心在柏感血管中找到嵌入的内部颗粒。
results: 这种方法可以准确地检测模型预测结果中的不确定性，并且可以用来自动检测质量问题。通过使用误差度量，可以衡量自动分割的质量。

Abstract
The application of deep learning models to large-scale data sets requires means for automatic quality assurance. We have previously developed a fully automatic algorithm for carotid artery wall segmentation in black-blood MRI that we aim to apply to large-scale data sets. This method identifies nested artery walls in 3D patches centered on the carotid artery. In this study, we investigate to what extent the uncertainty in the model predictions for the contour location can serve as a surrogate for error detection and, consequently, automatic quality assurance. We express the quality of automatic segmentations using the Dice similarity coefficient. The uncertainty in the model's prediction is estimated using either Monte Carlo dropout or test-time data augmentation. We found that (1) including uncertainty measurements did not degrade the quality of the segmentations, (2) uncertainty metrics provide a good proxy of the quality of our contours if the center found during the first step is enclosed in the lumen of the carotid artery and (3) they could be used to detect low-quality segmentations at the participant level. This automatic quality assurance tool might enable the application of our model in large-scale data sets.

摘要
aplicación de modelos de aprendizaje profundo a conjuntos de datos a gran escala requiere medios para la calidad automática. hemos desarrollado anteriormente un algoritmo completamente automático para la segmentación de las paredes de la arteria carótida en imágenes de resonancia magnética negra que nos gustaría aplicar a conjuntos de datos a gran escala. este método identifica las paredes de la arteria en 3D centradas en la arteria carótida. en este estudio, investigamos hasta qué punto la incertidumbre en las predicciones del modelo para el lugar de la contornación puede servir como un substituto para la detección de errores y, por lo tanto, como una calidad automática. expresamos la calidad de las segmentaciones automáticas utilizando el coeficiente de similitud de Dice. la incertidumbre en las predicciones del modelo se estima utilizando either la técnica de dropout Monte Carlo o la augmentación de datos en tiempo de prueba. encontramos que (1) incluyendo mediciones de incertidumbre no degradó la calidad de las segmentaciones, (2) los métricas de incertidumbre proporcionan un buen sustituto de la calidad de nuestros contornos si el centro encontrado durante la primera etapa está en el lumen de la arteria carótida y (3) pueden ser utilizados para detectar segmentaciones de baja calidad en el nivel del participante. esta herramienta de calidad automática podría permitir la aplicación de nuestro modelo en conjuntos de datos a gran escala.

INR-LDDMM: Fluid-based Medical Image Registration Integrating Implicit Neural Representation and Large Deformation Diffeomorphic Metric Mapping

paper_url: http://arxiv.org/abs/2308.09473
repo_url: None
paper_authors: Chulong Zhang, Xiaokun Liang
for: 这篇论文是用于医疗影像注册的流基于含隐神经表现的框架。
methods: 这篇论文使用了含隐神经表现和大型弹性数学模型（LDDMM），并使用多层感知器（MLP）作为速度生成器，同时优化速度和影像相似性。
results: 这篇论文在一个包含50名病人的CT-CBCT dataset上进行验证，以 dice 系数作为评估指标，与现有方法相比，其方法实现了顶尖性能。

Abstract
We propose a flow-based registration framework of medical images based on implicit neural representation. By integrating implicit neural representation and Large Deformable Diffeomorphic Metric Mapping (LDDMM), we employ a Multilayer Perceptron (MLP) as a velocity generator while optimizing velocity and image similarity. Moreover, we adopt a coarse-to-fine approach to address the challenge of deformable-based registration methods dropping into local optimal solutions, thus aiding the management of significant deformations in medical image registration. Our algorithm has been validated on a paired CT-CBCT dataset of 50 patients,taking the dice coefficient of transferred annotations as an evaluation metric. Compared to existing methods, our approach achieves the state-of-the-art performance.

摘要
我们提出一种基于隐藏神经表示的医学图像注册框架。通过结合隐藏神经表示和大型可变拟合度量测量（LDDMM），我们使用多层感知器（MLP）作为速度生成器，同时优化速度和图像相似性。此外，我们采用一种层次递进方法，以解决医学图像注册方法中的重要形变挑战，从而帮助管理大的形变。我们的算法在一个包含50名病人的CT-CBCT对应数据集上进行了验证，并根据 transferred 注释的锥积率作为评价指标。相比现有方法，我们的方法实现了状态机器的性能。Note: "隐藏神经表示" (implicit neural representation) is a literal translation of "implicit neural network" in Chinese, but it is not a commonly used term in the field. A more common term would be "神经网络模型" (neural network model).

Quantitative Susceptibility Mapping through Model-based Deep Image Prior (MoDIP)

paper_url: http://arxiv.org/abs/2308.09467
repo_url: None
paper_authors: Zhuang Xiong, Yang Gao, Yin Liu, Amir Fazlollahi, Peter Nestor, Feng Liu, Hongfu Sun
for: 解决量子透射图像映射（QSM）中的束缚倒转问题，提高不同对象的扫描参数下QSM的普适性。
methods: 提议一种没有训练的模型基于深度学习方法，即MoDIP（模型基于深度图像优先），它包括一个小型未训练网络和数据准确优化（DFO）模块。
results: 实验结果表明，MoDIP在不同扫描参数下解决QSM束缚倒转问题时表现出色，比超vised深度学习和传统迭代方法提高了32%以上的准确率。同时，它也比传统DIP基本方法更快速，可以在4.5分钟内完成3D高分辨率图像重建。

Abstract
The data-driven approach of supervised learning methods has limited applicability in solving dipole inversion in Quantitative Susceptibility Mapping (QSM) with varying scan parameters across different objects. To address this generalization issue in supervised QSM methods, we propose a novel training-free model-based unsupervised method called MoDIP (Model-based Deep Image Prior). MoDIP comprises a small, untrained network and a Data Fidelity Optimization (DFO) module. The network converges to an interim state, acting as an implicit prior for image regularization, while the optimization process enforces the physical model of QSM dipole inversion. Experimental results demonstrate MoDIP's excellent generalizability in solving QSM dipole inversion across different scan parameters. It exhibits robustness against pathological brain QSM, achieving over 32% accuracy improvement than supervised deep learning and traditional iterative methods. It is also 33% more computationally efficient and runs 4 times faster than conventional DIP-based approaches, enabling 3D high-resolution image reconstruction in under 4.5 minutes.

摘要
supervised learning方法的数据驱动方法在不同物体的扫描参数下解决量子感知地图（QSM）中的电pole反转问题具有限制。为了解决普通化问题在超级vised QSM方法中，我们提出了一种新的无需训练的模型基于方法called MoDIP（模型基于深度图像先验）。MoDIP包括一个小型、未训练的网络和数据准确优化（DFO）模块。网络会 converges到一个临时状态，作为图像REGULARIZATION的隐藏先验，而优化过程会实现QSM电pole反转的物理模型。实验结果表明MoDIP在不同扫描参数下解决QSM电pole反转问题 exhibits excellent generalizability。它对于肿瘤脑QSM具有32%的准确性提升，比超级vised深度学习和传统迭代方法更加稳定。它还比折衣DIP基本方法33%更快速，可以在4.5分钟内完成3D高分辨率图像重建。

Causal SAR ATR with Limited Data via Dual Invariance

paper_url: http://arxiv.org/abs/2308.09412
repo_url: https://github.com/cwwangsaratr/saratr_causal_dual_invariance
paper_authors: Chenwei Wang, You Qin, Li Li, Siyi Luo, Yulin Huang, Jifang Pei, Yin Zhang, Jianyu Yang
for: 提高弱化概化的Synthetic Aperture Radar自动目标识别（SAR ATR）缺乏数据。
methods: 提出了一个 causal ATR 模型，解释了有限数据所导致的弱化的原因，并使用了后门调整来解除障碍。
results: 实验结果显示，提出的方法在三个benchmark datasets中表现出色。

Abstract
Synthetic aperture radar automatic target recognition (SAR ATR) with limited data has recently been a hot research topic to enhance weak generalization. Despite many excellent methods being proposed, a fundamental theory is lacked to explain what problem the limited SAR data causes, leading to weak generalization of ATR. In this paper, we establish a causal ATR model demonstrating that noise $N$ that could be blocked with ample SAR data, becomes a confounder with limited data for recognition. As a result, it has a detrimental causal effect damaging the efficacy of feature $X$ extracted from SAR images, leading to weak generalization of SAR ATR with limited data. The effect of $N$ on feature can be estimated and eliminated by using backdoor adjustment to pursue the direct causality between $X$ and the predicted class $Y$. However, it is difficult for SAR images to precisely estimate and eliminated the effect of $N$ on $X$. The limited SAR data scarcely powers the majority of existing optimization losses based on empirical risk minimization (ERM), thus making it difficult to effectively eliminate $N$'s effect. To tackle with difficult estimation and elimination of $N$'s effect, we propose a dual invariance comprising the inner-class invariant proxy and the noise-invariance loss. Motivated by tackling change with invariance, the inner-class invariant proxy facilitates precise estimation of $N$'s effect on $X$ by obtaining accurate invariant features for each class with the limited data. The noise-invariance loss transitions the ERM's data quantity necessity into a need for noise environment annotations, effectively eliminating $N$'s effect on $X$ by cleverly applying the previous $N$'s estimation as the noise environment annotations. Experiments on three benchmark datasets indicate that the proposed method achieves superior performance.

摘要
射频 Synthetic aperture radar自动目标识别（SAR ATR）在有限数据情况下已经是最近的热点研究，以增强弱化。 despite many excellent methods being proposed, a fundamental theory is lacked to explain what problem the limited SAR data causes, leading to weak generalization of ATR. In this paper, we establish a causal ATR model demonstrating that noise $N$ that could be blocked with ample SAR data, becomes a confounder with limited data for recognition. As a result, it has a detrimental causal effect damaging the efficacy of feature $X$ extracted from SAR images, leading to weak generalization of SAR ATR with limited data. The effect of $N$ on feature can be estimated and eliminated by using backdoor adjustment to pursue the direct causality between $X$ and the predicted class $Y$. However, it is difficult for SAR images to precisely estimate and eliminated the effect of $N$ on $X$. The limited SAR data scarcely powers the majority of existing optimization losses based on empirical risk minimization (ERM), thus making it difficult to effectively eliminate $N$'s effect. To tackle with difficult estimation and elimination of $N$'s effect, we propose a dual invariance comprising the inner-class invariant proxy and the noise-invariance loss. Motivated by tackling change with invariance, the inner-class invariant proxy facilitates precise estimation of $N$'s effect on $X$ by obtaining accurate invariant features for each class with the limited data. The noise-invariance loss transitions the ERM's data quantity necessity into a need for noise environment annotations, effectively eliminating $N$'s effect on $X$ by cleverly applying the previous $N$'s estimation as the noise environment annotations. Experiments on three benchmark datasets indicate that the proposed method achieves superior performance.

Unveiling Causalities in SAR ATR: A Causal Interventional Approach for Limited Data

paper_url: http://arxiv.org/abs/2308.09396
repo_url: None
paper_authors: Chenwei Wang, Xin Chen, You Qin, Siyi Luo, Yulin Huang, Jifang Pei, Jianyu Yang
for: 提高Synthetic Aperture Radar自动目标识别（SAR ATR）方法的有效性，尤其是在有限SAR数据的情况下。
methods: 我们提出了一种 causal interventional ATR方法（CIATR），通过 causal inference 理解 SAR 图像的镜像条件对 ATR 问题的影响，并通过后门调整来解决这种隐藏的 causal effect。
results: 我们的方法可以在有限 SAR 数据的情况下寻找真正的 causality 关系 между SAR 图像和对应的类别，并且在 MSTAR 和 OpenSARship 数据集上进行了实验和比较，证明了我们的方法的有效性。

Abstract
Synthetic aperture radar automatic target recognition (SAR ATR) methods fall short with limited training data. In this letter, we propose a causal interventional ATR method (CIATR) to formulate the problem of limited SAR data which helps us uncover the ever-elusive causalities among the key factors in ATR, and thus pursue the desired causal effect without changing the imaging conditions. A structural causal model (SCM) is comprised using causal inference to help understand how imaging conditions acts as a confounder introducing spurious correlation when SAR data is limited. This spurious correlation among SAR images and the predicted classes can be fundamentally tackled with the conventional backdoor adjustments. An effective implement of backdoor adjustments is proposed by firstly using data augmentation with spatial-frequency domain hybrid transformation to estimate the potential effect of varying imaging conditions on SAR images. Then, a feature discrimination approach with hybrid similarity measurement is introduced to measure and mitigate the structural and vector angle impacts of varying imaging conditions on the extracted features from SAR images. Thus, our CIATR can pursue the true causality between SAR images and the corresponding classes even with limited SAR data. Experiments and comparisons conducted on the moving and stationary target acquisition and recognition (MSTAR) and OpenSARship datasets have shown the effectiveness of our method with limited SAR data.

摘要
Synthetic aperture radar自动目标识别（SAR ATR）方法受限于训练数据的量。在这封信中，我们提出了一个 causal interventional ATR 方法（CIATR），以解决SAR数据的限制，从而探索隐藏的 causality 中的因素。我们使用 causal inference 来理解对于 ATR 的影像条件是否导致杂散相关，并且使用常见的后门调整来解决这种杂散相关。我们首先使用数据扩展，使用空间频率域混合变换估计对 SAR 图像的可能影响。接着，我们引入了混合相似度量表示法，以衡量和抑制不同影像条件对抽出的特征之间的结构和向量角影响。因此，我们的 CIATR 可以在有限 SAR 数据的情况下，追求真正的 causality zwischen SAR 图像和相应的类别。实验和比较在 MSTAR 和 OpenSARship 数据集上显示了我们的方法的有效性。

SAMedOCT: Adapting Segment Anything Model (SAM) for Retinal OCT

paper_url: http://arxiv.org/abs/2308.09331
repo_url: None
paper_authors: Botond Fazekas, José Morano, Dmitrii Lachinov, Guilherme Aresta, Hrvoje Bogunović
for: 这项研究是为了评估Segment Anything Model（SAM）在retinal OCT扫描图像分割领域的可用性和可靠性。
methods: 这项研究使用了SAM模型，并对其进行了适应和调整，以适应retinal OCT扫描图像的特点。
results: 研究发现，适应SAM模型在大规模公共数据集上的Retouch挑战中表现出色，对多种retinal疾病、液体室和设备制造商进行了全面的评估。然而，在某些情况下，适应SAM模型仍迟于现有的标准方法。研究发现，适应SAM模型具有良好的适应性和稳定性，表明它可以作为retinal OCT图像分析中的一种有用工具。

Abstract
The Segment Anything Model (SAM) has gained significant attention in the field of image segmentation due to its impressive capabilities and prompt-based interface. While SAM has already been extensively evaluated in various domains, its adaptation to retinal OCT scans remains unexplored. To bridge this research gap, we conduct a comprehensive evaluation of SAM and its adaptations on a large-scale public dataset of OCTs from RETOUCH challenge. Our evaluation covers diverse retinal diseases, fluid compartments, and device vendors, comparing SAM against state-of-the-art retinal fluid segmentation methods. Through our analysis, we showcase adapted SAM's efficacy as a powerful segmentation model in retinal OCT scans, although still lagging behind established methods in some circumstances. The findings highlight SAM's adaptability and robustness, showcasing its utility as a valuable tool in retinal OCT image analysis and paving the way for further advancements in this domain.

摘要
《Segment Anything Model（SAM）在图像分割领域已经受到了广泛关注，因为它的出色性能和提示式界面。尽管SAM已经在不同领域进行了广泛的评估，但它在Retinal OCT扫描图像中的适应性仍然未得到了评估。为了填补这个研究漏洞，我们进行了大规模的公共数据集上的SAM和其变体的全面评估。我们的评估覆盖了多种Retinal疾病、液体腔和设备生产厂商，与现有的Retinal液体分割方法进行比较。通过我们的分析，我们发现了适应SAM的效果是在Retinal OCT扫描图像中的强大分割模型，虽然在某些情况下仍落后于已知方法。这些发现抛光了SAM的适应性和稳定性，展示了它作为Retinal OCT图像分析中的有价值工具，并为此领域的进一步发展开辟了道路。》

Advancing Intra-operative Precision: Dynamic Data-Driven Non-Rigid Registration for Enhanced Brain Tumor Resection in Image-Guided Neurosurgery

paper_url: http://arxiv.org/abs/2308.10868
repo_url: None
paper_authors: Nikos Chrisochoides, Andriy Fedorov, Fotis Drakopoulos, Andriy Kot, Yixun Liu, Panos Foteinos, Angelos Angelopoulos, Olivier Clatz, Nicholas Ayache, Peter M. Black, Alex J. Golby, Ron Kikinis
for: used to improve the accuracy of brain tumor removal during neurosurgery
methods: uses Dynamic Data-Driven Non-Rigid Registration (NRR) to adjust pre-operative image data for intra-operative brain shift
results: enables NRR results to be delivered within clinical time constraints while leveraging Distributed Computing and Machine Learning to enhance registration accuracy

Abstract
During neurosurgery, medical images of the brain are used to locate tumors and critical structures, but brain tissue shifts make pre-operative images unreliable for accurate removal of tumors. Intra-operative imaging can track these deformations but is not a substitute for pre-operative data. To address this, we use Dynamic Data-Driven Non-Rigid Registration (NRR), a complex and time-consuming image processing operation that adjusts the pre-operative image data to account for intra-operative brain shift. Our review explores a specific NRR method for registering brain MRI during image-guided neurosurgery and examines various strategies for improving the accuracy and speed of the NRR method. We demonstrate that our implementation enables NRR results to be delivered within clinical time constraints while leveraging Distributed Computing and Machine Learning to enhance registration accuracy by identifying optimal parameters for the NRR method. Additionally, we highlight challenges associated with its use in the operating room.

摘要
在神经外科过程中，医疗图像被用来确定肿瘤和重要结构，但脑组织变化使得先前的图像数据无法准确地remove肿瘤。实时图像跟踪这些变化，但不能取代先前的数据。为解决这个问题，我们使用动态数据驱动非固定均衡（NRR），一种复杂和时间消耗的图像处理操作，用于调整先前的图像数据，以 comptefor intra-operative brain shift。我们的文章探讨了一种NRR方法，用于在图像导航神经外科中 регистрироваbrain MRI，并评估了不同的策略来提高NRR方法的准确性和速度。我们示示了我们的实现可以在临床时间限制下提供NRR结果，同时利用分布式计算和机器学习来提高注射准确性。此外，我们还 highlighted the challenges associated with its use in the operating room.

JPEG Quantized Coefficient Recovery via DCT Domain Spatial-Frequential Transformer

paper_url: http://arxiv.org/abs/2308.09110
repo_url: None
paper_authors: Mingyu Ouyang, Zhenzhong Chen
for: 本研究旨在提出一种基于DCT频域的JPEG图像恢复方法，以提高恢复效果和抗质量因素能力。
methods: 本方法基于DCT频域的 dual-branch 架构， capture 频域和空间协同关系，并通过量化矩阵嵌入和chrominance-luminanceAlignment来扩展模型的应用范围。
results: 对于多种质量因素和图像Content，our proposed DCTransformer能够提供更高的恢复效果，比如现有状态的JPEG风损除法。

Abstract
JPEG compression adopts the quantization of Discrete Cosine Transform (DCT) coefficients for effective bit-rate reduction, whilst the quantization could lead to a significant loss of important image details. Recovering compressed JPEG images in the frequency domain has attracted more and more attention recently, in addition to numerous restoration approaches developed in the pixel domain. However, the current DCT domain methods typically suffer from limited effectiveness in handling a wide range of compression quality factors, or fall short in recovering sparse quantized coefficients and the components across different colorspace. To address these challenges, we propose a DCT domain spatial-frequential Transformer, named as DCTransformer. Specifically, a dual-branch architecture is designed to capture both spatial and frequential correlations within the collocated DCT coefficients. Moreover, we incorporate the operation of quantization matrix embedding, which effectively allows our single model to handle a wide range of quality factors, and a luminance-chrominance alignment head that produces a unified feature map to align different-sized luminance and chrominance components. Our proposed DCTransformer outperforms the current state-of-the-art JPEG artifact removal techniques, as demonstrated by our extensive experiments.

摘要

2023-08-17

cs.SD

cs.SD - 2023-08-17

Severity Classification of Parkinson’s Disease from Speech using Single Frequency Filtering-based Features

paper_url: http://arxiv.org/abs/2308.09042
repo_url: None
paper_authors: Sudarsana Reddy Kadiri, Manila Kodali, Paavo Alku
for: 这个研究旨在提出一种新的评估parkinson病（PD）严重程度的对象方法，以提高诊断和治疗的效果。
methods: 该研究使用了单频 filtering（SFF）方法，从而 derivation two sets of novel features：(1) SFF cepstral coefficients（SFFCC）和 (2) MFCCs from SFF（MFCC-SFF）。SFF 方法可以提供更高的spectro-temporal resolution，而且在三种说话任务（vowel、sentence、text reading）中都表现出了更好的效果。
results: 实验结果表明，提出的特征比普通的MFCC特征更高，在三种说话任务中都达到了5.8%、7.0%和2.4%的提升。

Abstract
Developing objective methods for assessing the severity of Parkinson's disease (PD) is crucial for improving the diagnosis and treatment. This study proposes two sets of novel features derived from the single frequency filtering (SFF) method: (1) SFF cepstral coefficients (SFFCC) and (2) MFCCs from the SFF (MFCC-SFF) for the severity classification of PD. Prior studies have demonstrated that SFF offers greater spectro-temporal resolution compared to the short-time Fourier transform. The study uses the PC-GITA database, which includes speech of PD patients and healthy controls produced in three speaking tasks (vowels, sentences, text reading). Experiments using the SVM classifier revealed that the proposed features outperformed the conventional MFCCs in all three speaking tasks. The proposed SFFCC and MFCC-SFF features gave a relative improvement of 5.8% and 2.3% for the vowel task, 7.0% & 1.8% for the sentence task, and 2.4% and 1.1% for the read text task, in comparison to MFCC features.

摘要
发展客观的评估parkinson病（PD）严重度的方法是至关重要的，以提高诊断和治疗的效果。这项研究提出了两个新的特征集：（1）单频范围滤波（SFF）cepstral coefficient（SFFCC）和（2）基于SFF的Mel-frequency cepstral coefficients（MFCC-SFF），用于分类PD严重度。先前的研究表明，SFF比短时间傅立叶 transform（STFT）具有更高的spectro-temporal分辨率。这项研究使用了PC-GITA数据库，包括PD患者和健康控制人员在三种说话任务（元音、句子和文本读取）中的语音。实验使用SVM分类器显示，提出的特征比折衔MFCC更高，在三种说话任务中都有显著提高。SFFCC和MFCC-SFF特征在元音任务中增加了5.8%和2.3%，在句子任务中增加了7.0%和1.8%，在文本读取任务中增加了2.4%和1.1%，相比MFCC特征。

Home monitoring for frailty detection through sound and speaker diarization analysis

paper_url: http://arxiv.org/abs/2308.08985
repo_url: None
paper_authors: Yannis Tevissen, Dan Istrate, Vincent Zalc, Jérôme Boudy, Gérard Chollet, Frédéric Petitpont, Sami Boutamine
for: 这篇论文是为了研究如何通过人类日常生活声音识别和语音存在检测来实现可靠和隐私保护的家庭监测系统。
methods: 这篇论文使用了最新的声音处理和 speaker diarization 技术来改进现有的嵌入式系统。
results: 研究发现，使用 DNN 基本方法可以提高性能约100%。

Abstract
As the French, European and worldwide populations are aging, there is a strong interest for new systems that guarantee a reliable and privacy preserving home monitoring for frailty prevention. This work is a part of a global environmental audio analysis system which aims to help identification of Activities of Daily Life (ADL) through human and everyday life sounds recognition, speech presence and number of speakers detection. The focus is made on the number of speakers detection. In this article, we present how recent advances in sound processing and speaker diarization can improve the existing embedded systems. We study the performances of two new methods and discuss the benefits of DNN based approaches which improve performances by about 100%.

摘要
As the French, European, and worldwide populations are aging, there is a strong interest in new systems that can provide reliable and privacy-preserving home monitoring for frailty prevention. This work is part of a global environmental audio analysis system that aims to help identify Activities of Daily Life (ADL) through human and everyday life sounds recognition, speech presence, and number of speakers detection. The focus is on the number of speakers detection. In this article, we discuss how recent advances in sound processing and speaker diarization can improve the existing embedded systems. We evaluate the performance of two new methods and explore the benefits of deep neural network (DNN) based approaches, which can improve performances by about 100%.

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

paper_url: http://arxiv.org/abs/2308.08926
repo_url: None
paper_authors: Ye-Xin Lu, Yang Ai, Zhen-Hua Ling
for: 提高speech质量和可理解性
methods: 提出MP-SENet模型，通过并行地恢复大小和相位特征图像来提高speech质量
results: 在多个任务上实现高质量的speech提升，包括speech干声、抑护、宽频扩展等任务，并且比存在相位感知的方法更好地避免了相位与大小相互赔偿的问题

Abstract
Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing speech enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature and wrapping characteristics of the phase, leading to a bottleneck in enhanced speech quality. To overcome the above issue, in this paper, we proposed MP-SENet, a novel Speech Enhancement Network which explicitly enhances Magnitude and Phase spectra in parallel. The proposed MP-SENet adopts a codec architecture in which the encoder and decoder are bridged by time-frequency Transformers along both time and frequency dimensions. The encoder aims to encode time-frequency representations derived from the input distorted magnitude and phase spectra. The decoder comprises dual-stream magnitude and phase decoders, directly enhancing magnitude and wrapped phase spectra by incorporating a magnitude estimation architecture and a phase parallel estimation architecture, respectively. To train the MP-SENet model effectively, we define multi-level loss functions, including mean square error and perceptual metric loss of magnitude spectra, anti-wrapping loss of phase spectra, as well as mean square error and consistency loss of short-time complex spectra. Experimental results demonstrate that our proposed MP-SENet excels in high-quality speech enhancement across multiple tasks, including speech denoising, dereverberation, and bandwidth extension. Compared to existing phase-aware speech enhancement methods, it successfully avoids the bidirectional compensation effect between the magnitude and phase, leading to a better harmonic restoration. Notably, for the speech denoising task, the MP-SENet yields a state-of-the-art performance with a PESQ of 3.60 on the public VoiceBank+DEMAND dataset.

摘要
干扰信息对语音辨识质量和可读性有着重要的影响。然而，现有的语音增强方法受到不可靠的阶段估计的限制，由于干扰信息的非结构性和包袋特性，导致增强语音质量的瓶颈。为了突破这个问题，在这篇论文中，我们提出了MP-SENet，一种新的语音增强网络，可以并行地增强干扰信息的大小和阶段 spectra。MP-SENet 采用了 codec 架构，其中编码器和解码器通过时间频率变换器连接。编码器的目的是将输入损坏的干扰信息转化为时间频率表示。解码器包括两个独立的大小和包袋解码器，直接将输入干扰信息的大小和包袋 spectra 提高，并采用了大小估计架构和包袋平行估计架构。为了训练 MP-SENet 模型，我们定义了多个层次损失函数，包括平均方差损失和听觉指标损失，反射损失、适应损失和短时间复杂 spectra 的损失。实验结果表明，我们提出的 MP-SENet 在多个任务上实现了高质量的语音增强，包括语音减雷、抑制抑震和频率扩展。相比现有的阶段意识的语音增强方法，MP-SENet 成功避免了阶段相互赔偿效果，从而更好地保持干扰信息的谱干整复。特别是在语音减雷任务上，MP-SENet 实现了公共 VoiceBank+DEMAND 数据集上的状态级表现，PESQ 为 3.60。

Long-frame-shift Neural Speech Phase Prediction with Spectral Continuity Enhancement and Interpolation Error Compensation

paper_url: http://arxiv.org/abs/2308.08850
repo_url: https://github.com/yangai520/lfs-nspp
paper_authors: Yang Ai, Ye-Xin Lu, Zhen-Hua Ling
for: 提高信号处理领域内的speech phase预测精度，使得可以准确地从 amplitude-related features 中预测长框偏移 phase spectra。
methods: 提出了一种基于 neural network 的长框偏移 speech phase 预测方法 (LFS-NSPP)，包括三个阶段： interpolate、predict 和 decimate。首先，将 long-frame-shift log amplitude spectra 转换为 short-frame-shift log amplitude spectra mediante frequency-by-frequency interpolation，然后使用 NSPP 模型预测 short-frame-shift phase spectra，最后，将 long-frame-shift phase spectra 得到于 short-frame-shift phase spectra mediante frame-by-frame decimation。
results: 实验结果表明，提出的 LFS-NSPP 方法可以在预测 long-frame-shift phase spectra 方面达到更高的精度，比原 NSPP 模型和其他信号处理基于 phase 估计算法更好。

Abstract
Speech phase prediction, which is a significant research focus in the field of signal processing, aims to recover speech phase spectra from amplitude-related features. However, existing speech phase prediction methods are constrained to recovering phase spectra with short frame shifts, which are considerably smaller than the theoretical upper bound required for exact waveform reconstruction of short-time Fourier transform (STFT). To tackle this issue, we present a novel long-frame-shift neural speech phase prediction (LFS-NSPP) method which enables precise prediction of long-frame-shift phase spectra from long-frame-shift log amplitude spectra. The proposed method consists of three stages: interpolation, prediction and decimation. The short-frame-shift log amplitude spectra are first constructed from long-frame-shift ones through frequency-by-frequency interpolation to enhance the spectral continuity, and then employed to predict short-frame-shift phase spectra using an NSPP model, thereby compensating for interpolation errors. Ultimately, the long-frame-shift phase spectra are obtained from short-frame-shift ones through frame-by-frame decimation. Experimental results show that the proposed LFS-NSPP method can yield superior quality in predicting long-frame-shift phase spectra than the original NSPP model and other signal-processing-based phase estimation algorithms.

摘要
<>使用signal processing的研究焦点中的speech phase prediction（SPP）方法可以从振荡功率相关特征中恢复speech phase spectra。然而，现有的SPP方法只能recover short frame shift的phase spectra，这些框架偏移远小于理论最大值需要的短时傅立埃 transform（STFT）的波形重建。为解决这个问题，我们提出了一种新的long-frame-shift neural speech phase prediction（LFS-NSPP）方法，可以高精度地预测long-frame-shift phase spectra从long-frame-shift log amplitude spectra。LFS-NSPP方法包括三个阶段：interpolation、prediction和decimation。首先，通过频率域的 interpolate来从long-frame-shift amplitude spectra中提取short-frame-shift log amplitude spectra，以提高spectral continuity。然后，使用NSPP模型预测short-frame-shift phase spectra，以补偿interpolation error。最后，通过frame-by-frame decimation，从short-frame-shift phase spectra中提取long-frame-shift phase spectra。实验结果表明，提出的LFS-NSPP方法可以在预测long-frame-shift phase spectra方面比原始的NSPP模型和其他基于signal processing的phase estimation算法更高质量。Note: The translation is in Simplified Chinese, which is a standardized form of Chinese used in mainland China. The translation may vary depending on the specific dialect or region.

META-SELD: Meta-Learning for Fast Adaptation to the new environment in Sound Event Localization and Detection

paper_url: http://arxiv.org/abs/2308.08847
repo_url: None
paper_authors: Jinbo Hu, Yin Cao, Ming Wu, Feiran Yang, Ziying Yu, Wenwu Wang, Mark D. Plumbley, Jun Yang
for: 本研究旨在提高学习型 зву响事件定位检测（SELD）方法在不同听控环境中的性能。
methods: 本研究使用meta学习方法，以便快速适应新环境。
results: 实验结果表明，Meta-SELD在适应新环境时的性能较高，比传统的 fine-tuning 方法更加灵活和高效。

Abstract
For learning-based sound event localization and detection (SELD) methods, different acoustic environments in the training and test sets may result in large performance differences in the validation and evaluation stages. Different environments, such as different sizes of rooms, different reverberation times, and different background noise, may be reasons for a learning-based system to fail. On the other hand, acquiring annotated spatial sound event samples, which include onset and offset time stamps, class types of sound events, and direction-of-arrival (DOA) of sound sources is very expensive. In addition, deploying a SELD system in a new environment often poses challenges due to time-consuming training and fine-tuning processes. To address these issues, we propose Meta-SELD, which applies meta-learning methods to achieve fast adaptation to new environments. More specifically, based on Model Agnostic Meta-Learning (MAML), the proposed Meta-SELD aims to find good meta-initialized parameters to adapt to new environments with only a small number of samples and parameter updating iterations. We can then quickly adapt the meta-trained SELD model to unseen environments. Our experiments compare fine-tuning methods from pre-trained SELD models with our Meta-SELD on the Sony-TAU Realistic Spatial Soundscapes 2023 (STARSSS23) dataset. The evaluation results demonstrate the effectiveness of Meta-SELD when adapting to new environments.

摘要
为了解决学习基于 зву频事件检测和位置标注（SELD）方法中环境不同导致验证和评估阶段表现差异较大的问题，我们提出了Meta-SELD方法。这种方法利用元学习技术来实现环境适应快速。具体来说，基于Model Agnostic Meta-Learning（MAML），我们的Meta-SELD目标是在新环境中找到好的元初始化参数，使其适应新环境只需要少量样本和参数更新迭代。这样可以快速地适应元训练的SELD模型中未看过的环境。我们在STARSSS23 dataset上进行了对照研究，并证明了Meta-SELD在适应新环境方面的效果。

Graph Neural Network Backend for Speaker Recognition

paper_url: http://arxiv.org/abs/2308.08767
repo_url: None
paper_authors: Liang He, Ruida Li, Mengqi Niu
for: 提高 speaker recognition 精度
methods: 使用图 neural network (GNN) backend 挖掘嵌入在低维度空间中的 latent 关系
results: 在 NIST SRE14 i-vector 挑战 task 和 VoxCeleb1-O、VoxCeleb1-E、VoxCeleb1-H dataset 上实现了显著的性能提升

Abstract
Currently, most speaker recognition backends, such as cosine, linear discriminant analysis (LDA), or probabilistic linear discriminant analysis (PLDA), make decisions by calculating similarity or distance between enrollment and test embeddings which are already extracted from neural networks. However, for each embedding, the local structure of itself and its neighbor embeddings in the low-dimensional space is different, which may be helpful for the recognition but is often ignored. In order to take advantage of it, we propose a graph neural network (GNN) backend to mine latent relationships among embeddings for classification. We assume all the embeddings as nodes on a graph, and their edges are computed based on some similarity function, such as cosine, LDA+cosine, or LDA+PLDA. We study different graph settings and explore variants of GNN to find a better message passing and aggregation way to accomplish the recognition task. Experimental results on NIST SRE14 i-vector challenging, VoxCeleb1-O, VoxCeleb1-E, and VoxCeleb1-H datasets demonstrate that our proposed GNN backends significantly outperform current mainstream methods.

摘要
当前大多数 speaker recognition 后端，如cosine、线性混合分析（LDA）或 probabilistic 线性混合分析（PLDA），做出决策时通常计算投入和测试嵌入的相似性或距离。然而，每个嵌入都有自己本地结构，这些结构可能对识别有帮助，但通常被忽略。为了利用这些结构，我们提议一种基于图 neural network（GNN）的后端，以挖掘嵌入之间的隐藏关系，并用于分类。我们将所有嵌入视为图中的节点，并根据某种相似函数（如cosine、LDA+cosine或LDA+PLDA）计算它们之间的边。我们研究了不同的图设置和GNN变种，以找到更好的消息传递和聚合方式，以完成识别任务。实验结果表明，我们提议的 GNN 后端在 NIST SRE14 i-vector 挑战任务、VoxCeleb1-O、VoxCeleb1-E 和 VoxCeleb1-H 数据集上显著超越了当前主流方法。

The DKU-MSXF Speaker Verification System for the VoxCeleb Speaker Recognition Challenge 2023

paper_url: http://arxiv.org/abs/2308.08766
repo_url: None
paper_authors: Ze Li, Yuke Lin, Xiaoyi Qin, Ning Jiang, Guoqing Zhao, Ming Li
For: 本研究是DKU-MSXF系统的 Track1、Track2 和 Track3 的 VoxCeleb Speaker Recognition Challenge 2023（VoxSRC-23）系统描述。* Methods: 我们利用基于 ResNet 网络结构的训练方法，并构建了跨年龄 QMF 训练集，从而实现了显著提高系统性能。* Results: 我们在 Track 2 中使用预训练模型，并通过将 VoxBlink-clean 数据集 incorporated 进行混合训练，相比 Track 1，包含 VoxBlink-clean 数据集的模型表现提高了 более чем 10%。在 Track 3 中，我们采用了一种新的 Pseudo-labeling 方法，并使用 triple thresholds 和 sub-center purification，实现了预测领域的适应。最终提交得到了 task1 的 mDCF 0.1243、Track 2 的 mDCF 0.1165 和 Track 3 的 EER 4.952%。

Abstract
This paper is the system description of the DKU-MSXF System for the track1, track2 and track3 of the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23). For Track 1, we utilize a network structure based on ResNet for training. By constructing a cross-age QMF training set, we achieve a substantial improvement in system performance. For Track 2, we inherite the pre-trained model from Track 1 and conducte mixed training by incorporating the VoxBlink-clean dataset. In comparison to Track 1, the models incorporating VoxBlink-clean data exhibit a performance improvement by more than 10% relatively. For Track3, the semi-supervised domain adaptation task, a novel pseudo-labeling method based on triple thresholds and sub-center purification is adopted to make domain adaptation. The final submission achieves mDCF of 0.1243 in task1, mDCF of 0.1165 in Track 2 and EER of 4.952% in Track 3.

摘要
这份文章是DKU-MSXF系统的描述，用于VoxCeleb Speaker Recognition Challenge 2023（VoxSRC-23）的track1、track2和track3。在track1中，我们使用基于ResNet的网络结构进行训练，通过构建跨年龄QMF训练集，实现了显著提高系统性能。在track2中，我们继承了track1中的预训练模型，并通过将VoxBlink-clean数据集 incorporated进行混合训练，相比track1，包含VoxBlink-clean数据集的模型表现相对提高了 более10%。在track3中，我们采用了一种新的半有限预测方法，基于三个阈值和子中心纯化，进行预测领域适应。最终提交的结果为task1中的mDCF为0.1243，track2中的mDCF为0.1165，以及track3中的EER为4.952%。

Decoding Emotions: A comprehensive Multilingual Study of Speech Models for Speech Emotion Recognition

paper_url: http://arxiv.org/abs/2308.08713
repo_url: https://github.com/95anantsingh/decoding-emotions
paper_authors: Anant Singh, Akshat Gupta
for: 这个研究旨在评估基于变换器的speech表示模型在多种语言的语音情感识别 tasks 中的表现，以及这些模型内部的表示方式。
methods: 本文使用了八种speech表示模型和六种语言进行了一系列的比较和探索，包括 probing 实验来探究这些模型内部的工作方式。
results: 研究发现，使用单个最佳层的speech模型特征可以降低错误率32%的平均值，并在七个数据集中达到了德语和波斯语的国际前景。 probing 结果表明，speech模型的中间层 capture 最重要的情感信息。

Abstract
Recent advancements in transformer-based speech representation models have greatly transformed speech processing. However, there has been limited research conducted on evaluating these models for speech emotion recognition (SER) across multiple languages and examining their internal representations. This article addresses these gaps by presenting a comprehensive benchmark for SER with eight speech representation models and six different languages. We conducted probing experiments to gain insights into inner workings of these models for SER. We find that using features from a single optimal layer of a speech model reduces the error rate by 32\% on average across seven datasets when compared to systems where features from all layers of speech models are used. We also achieve state-of-the-art results for German and Persian languages. Our probing results indicate that the middle layers of speech models capture the most important emotional information for speech emotion recognition.

摘要

2023-08-17

cs.CV

cs.CV - 2023-08-17

Discretization-Induced Dirichlet Posterior for Robust Uncertainty Quantification on Regression

paper_url: http://arxiv.org/abs/2308.09065
repo_url: None
paper_authors: Xuanlong Yu, Gianni Franchi, Jindong Gu, Emanuel Aldea
for: 这个研究旨在提高深度神经网络（DNNs）在实际应用中的稳定性，通过提出一个辅助不确定估测器（AuxUE），以估计主任务预测结果的不确定性。
methods: 本研究提出了一个通用的AuxUE方案，以提供更加稳定的不确定性估测。 Specifically, 我们考虑了不同的分布假设，以估计各种不同类型的噪声，最终选择了拉Place分布估计预测误差。我们还提出了一个新的解方案，即精确化后的Dirichlet posterior（DIDO），用于模型预测误差的精确化。
results: 我们在年龄估测、单目深度估测和超解像任务上进行了广泛的实验，结果显示了我们的提案可以在噪音输入下提供稳定的不确定性估测，并且可以扩展到像素级和图像级任务。

Abstract
Uncertainty quantification is critical for deploying deep neural networks (DNNs) in real-world applications. An Auxiliary Uncertainty Estimator (AuxUE) is one of the most effective means to estimate the uncertainty of the main task prediction without modifying the main task model. To be considered robust, an AuxUE must be capable of maintaining its performance and triggering higher uncertainties while encountering Out-of-Distribution (OOD) inputs, i.e., to provide robust aleatoric and epistemic uncertainty. However, for vision regression tasks, current AuxUE designs are mainly adopted for aleatoric uncertainty estimates, and AuxUE robustness has not been explored. In this work, we propose a generalized AuxUE scheme for more robust uncertainty quantification on regression tasks. Concretely, to achieve a more robust aleatoric uncertainty estimation, different distribution assumptions are considered for heteroscedastic noise, and Laplace distribution is finally chosen to approximate the prediction error. For epistemic uncertainty, we propose a novel solution named Discretization-Induced Dirichlet pOsterior (DIDO), which models the Dirichlet posterior on the discretized prediction error. Extensive experiments on age estimation, monocular depth estimation, and super-resolution tasks show that our proposed method can provide robust uncertainty estimates in the face of noisy inputs and that it can be scalable to both image-level and pixel-wise tasks.

摘要
深度神经网络（DNN）在实际应用中部署需要量化不确定性。副作用不确定性估计器（AuxUE）是 modifying the main task model 的一种非常有效的方法来估计主任务预测结果的不确定性。为了被视为可靠，一个 AuxUE 必须能够在遇到不同输入时保持其性能并触发更高的不确定性，即提供可靠的 aleatoric 和 epistemic 不确定性。然而，现有的 AuxUE 设计主要用于 aleatoric 不确定性估计，而 regression 任务上的 AuxUE 强度还未被探索。在这项工作中，我们提出了一种通用的 AuxUE 方案，以提高 regression 任务上的不确定性量化的可靠性。具体来说，为了更好地估计 aleatoric 不确定性，我们考虑了不同的分布假设 для hetroscedastic 噪声，并最终选择了 Laplace 分布来近似预测错误。为 epistemic 不确定性，我们提出了一种新的解决方案，即 Discretization-Induced Dirichlet pOsterior（DIDO），它模型了精度 posterior 在精度化预测错误上。我们在年龄估计、单目深度估计和超分辨率任务上进行了广泛的实验，结果显示，我们的提议方法可以在噪声输入下提供可靠的不确定性估计，并且可以扩展到图像级和像素级任务。

SimFIR: A Simple Framework for Fisheye Image Rectification with Self-supervised Representation Learning

paper_url: http://arxiv.org/abs/2308.09040
repo_url: https://github.com/fh2019ustc/SimFIR
paper_authors: Hao Feng, Wendi Wang, Jiajun Deng, Wengang Zhou, Li Li, Houqiang Li
for: fisheye image rectification
methods: self-supervised representation learning with a Vision Transformer (ViT) and an innovative unified distortion-aware pretext task
results: remarkable boost in transfer performance on downstream rectification task, with superiority over state-of-the-art algorithms and strong generalization ability on real-world fisheye images.Here’s the full summary in Simplified Chinese:
for: fisheye 图像整正
methods: 自主学习表征学习（ViT）和创新的扭曲意识预测任务
results: downstream 整正任务的转移性能强劲提高，与状态艺术算法相比，在真实世界 fisheye 图像上具有强大的总体化能力。

Abstract
In fisheye images, rich distinct distortion patterns are regularly distributed in the image plane. These distortion patterns are independent of the visual content and provide informative cues for rectification. To make the best of such rectification cues, we introduce SimFIR, a simple framework for fisheye image rectification based on self-supervised representation learning. Technically, we first split a fisheye image into multiple patches and extract their representations with a Vision Transformer (ViT). To learn fine-grained distortion representations, we then associate different image patches with their specific distortion patterns based on the fisheye model, and further subtly design an innovative unified distortion-aware pretext task for their learning. The transfer performance on the downstream rectification task is remarkably boosted, which verifies the effectiveness of the learned representations. Extensive experiments are conducted, and the quantitative and qualitative results demonstrate the superiority of our method over the state-of-the-art algorithms as well as its strong generalization ability on real-world fisheye images.

摘要
在鱼眼图像中，丰富的不同扭曲模式 Regularly 分布在图像平面上。这些扭曲模式与视觉内容无关，并提供有用的纠正指引。为了充分利用这些纠正指引，我们提出了基于自我超vised representation learning的SimFIR框架。技术上，我们首先将鱼眼图像分割成多个 patches，然后使用 Vision Transformer (ViT) 提取这些 patches 的表示。为了学习细腻的扭曲表示，我们然后将不同的图像 patches 与其 especific 的扭曲模式相关联，并进一步设计了一种创新的统一扭曲意识pretext任务 для其学习。这种转移性能在下游纠正任务上明显提高，这证明了我们学习的表示的效果。我们进行了广泛的实验，并示出了对现实世界鱼眼图像的superiority和其强大的泛化能力。

MarginMatch: Improving Semi-Supervised Learning with Pseudo-Margins

paper_url: http://arxiv.org/abs/2308.09037
repo_url: https://github.com/tsosea2/marginmatch
paper_authors: Tiberiu Sosea, Cornelia Caragea
for: 提高 semi-supervised learning 的性能，特别是在数据量少的情况下。
methods: combine consistency regularization 和 pseudo-labeling，使用无标签数据训练动态来衡量pseudo-label的质量。
results: 在四个视觉benchmark和两个大规模数据集上提供了显著改进，强调高质量 pseudo-label 的重要性。特别是在 CIFAR-100 上提高了error rate的优化3.25%，并在 STL-10 上使用了只有4个标签每类的情况下提高了error rate的优化3.78%。

Abstract
We introduce MarginMatch, a new SSL approach combining consistency regularization and pseudo-labeling, with its main novelty arising from the use of unlabeled data training dynamics to measure pseudo-label quality. Instead of using only the model's confidence on an unlabeled example at an arbitrary iteration to decide if the example should be masked or not, MarginMatch also analyzes the behavior of the model on the pseudo-labeled examples as the training progresses, to ensure low quality predictions are masked out. MarginMatch brings substantial improvements on four vision benchmarks in low data regimes and on two large-scale datasets, emphasizing the importance of enforcing high-quality pseudo-labels. Notably, we obtain an improvement in error rate over the state-of-the-art of 3.25% on CIFAR-100 with only 25 labels per class and of 3.78% on STL-10 using as few as 4 labels per class. We make our code available at https://github.com/tsosea2/MarginMatch.

摘要
我团队今天宣布了一种新的SSL方法，即MarginMatch，它结合了一致性规则和假标注，其主要创新在于使用无标注数据训练动态来衡量假标注质量。不同于以往只使用模型对无标注示例的任意轮次的信任度来决定是否遮盖示例，MarginMatch还分析了模型在假标注示例上的行为，以确保低质量预测被排除。MarginMatch在四个视觉标准benchmark上表现出了显著改善，特别是在低数据 régime下，以及在两个大规模数据集上，这些成果强调了高质量假标注的重要性。我们在CIFAR-100上实现了error rate的提升为3.25%，只使用每类25个标签，而在STL-10上实现了error rate的提升为3.78%，只使用每类4个标签。我们将代码发布在https://github.com/tsosea2/MarginMatch上。

Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks

paper_url: http://arxiv.org/abs/2308.09033
repo_url: https://github.com/fawazsammani/uni-nlx
paper_authors: Fawaz Sammani, Nikos Deligiannis
For: The paper aims to propose a unified framework for Natural Language Explanations (NLE) that can consolidate all NLE tasks into a single and compact multi-task model using a unified training objective of text generation.* Methods: The proposed Uni-NLX framework uses a unified training objective of text generation to train a single model that can perform seven NLE tasks, including VQA, visual recognition, and visual reasoning tasks, with 7X fewer parameters compared to previous approaches.* Results: The proposed Uni-NLX framework demonstrates comparable performance to independent task-specific models in previous approaches, and in certain tasks even outperforms them, with a single model that can perform seven NLE tasks using 7X fewer parameters.Here are the three key points in Simplified Chinese text:* For: 这paper的目标是提出一个综合的自然语言解释（NLE）框架，可以将所有NLE任务集成到一个紧凑的多任务模型中，使用一个统一的文本生成训练目标。* Methods: 提议的Uni-NLX框架使用一个统一的文本生成训练目标来训练一个单一模型，可以同时完成七个NLE任务，包括VQA、视觉识别和视觉理解任务，与之前的方法相比，具有7倍 fewer参数。* Results: Uni-NLX框架可以与之前的独立任务特定模型相比，在七个NLE任务中显示相对性能，甚至在某些任务中超越它们，具有一个单一模型可以同时完成七个NLE任务，使用7倍 fewer参数。

Abstract
Natural Language Explanations (NLE) aim at supplementing the prediction of a model with human-friendly natural text. Existing NLE approaches involve training separate models for each downstream task. In this work, we propose Uni-NLX, a unified framework that consolidates all NLE tasks into a single and compact multi-task model using a unified training objective of text generation. Additionally, we introduce two new NLE datasets: 1) ImageNetX, a dataset of 144K samples for explaining ImageNet categories, and 2) VQA-ParaX, a dataset of 123K samples for explaining the task of Visual Question Answering (VQA). Both datasets are derived leveraging large language models (LLMs). By training on the 1M combined NLE samples, our single unified framework is capable of simultaneously performing seven NLE tasks including VQA, visual recognition and visual reasoning tasks with 7X fewer parameters, demonstrating comparable performance to the independent task-specific models in previous approaches, and in certain tasks even outperforming them. Code is at https://github.com/fawazsammani/uni-nlx

摘要
自然语言解释（NLE）目标是补充模型预测结果中的人类友好的自然文本。现有的NLE方法通常包括为每个下游任务培训单独的模型。在这个工作中，我们提出了Uni-NLX框架，这是一个单一的框架，将所有NLE任务合并到一个紧凑的多任务模型中，使用一个统一的文本生成训练目标。此外，我们还提出了两个新的NLE数据集：1）ImageNetX，包含144K个样本，用于解释ImageNet类别，和2）VQA-ParaX，包含123K个样本，用于解释视觉问答任务（VQA）。两个数据集都是基于大语言模型（LLM）的。通过训练1M个总NLE样本，我们的单一框架可以同时完成七个NLE任务，包括VQA、视觉识别和视觉理解任务，使用7X fewer parameters，与之前独立的任务特定模型相比，表现相似，甚至在某些任务上超越它们。代码位于https://github.com/fawazsammani/uni-nlx。

LesionMix: A Lesion-Level Data Augmentation Method for Medical Image Segmentation

paper_url: http://arxiv.org/abs/2308.09026
repo_url: https://github.com/dogabasaran/lesionmix
paper_authors: Berke Doga Basaran, Weitong Zhang, Mengyun Qiao, Bernhard Kainz, Paul M. Matthews, Wenjia Bai
for: 这个研究是为了提高深度学习运算库中的医疗影像分类方法。
methods: 这篇论文提出了一种新的数据增强方法，即LesionMix，它将数据增强进行了特定疾病范围内的调整，以提高分类的多样性和精度。
results: 实验结果显示，LesionMix在不同的MODALITIES和不同的疾病数据集上表现出色，能够优于一些最近的混合数据增强方法，提高医疗影像分类的精度和多样性。

Abstract
Data augmentation has become a de facto component of deep learning-based medical image segmentation methods. Most data augmentation techniques used in medical imaging focus on spatial and intensity transformations to improve the diversity of training images. They are often designed at the image level, augmenting the full image, and do not pay attention to specific abnormalities within the image. Here, we present LesionMix, a novel and simple lesion-aware data augmentation method. It performs augmentation at the lesion level, increasing the diversity of lesion shape, location, intensity and load distribution, and allowing both lesion populating and inpainting. Experiments on different modalities and different lesion datasets, including four brain MR lesion datasets and one liver CT lesion dataset, demonstrate that LesionMix achieves promising performance in lesion image segmentation, outperforming several recent Mix-based data augmentation methods. The code will be released at https://github.com/dogabasaran/lesionmix.

摘要
<>translation into Simplified Chinese深度学习基于医学影像 segmentation 方法中的数据增强已成为一种实际中的组件。大多数数据增强技术在医疗影像中都是通过空间和温度变换来提高训练图像的多样性。这些技术通常是在图像层次上进行设计，对全图像进行增强，并不关注特定的病变内部特征。在这里，我们介绍了 LesionMix，一种新的和简单的病变意识的数据增强方法。它在病变层次上进行增强，提高病变形状、位置、温度和负荷分布，并允许病变填充和掩蔽。经过不同的Modalities和不同的病变数据集的测试，包括四个脑MR病变数据集和一个肝CT病变数据集，LesionMix在病变图像分割方面实现了出色的表现，比较多种最近的 Mix-based 数据增强方法。代码将在 GitHub 上发布，地址为。

paper_url: http://arxiv.org/abs/2308.09025
repo_url: None
paper_authors: Johannes Erdmann, Aaron van der Graaf, Florian Mausolf, Olaf Nackenhorst
for: 这个论文是用来研究单个图像超分辨算法的，它们是基于生成对抗网络。
methods: 这个论文使用了生成超分辨网络来提高图像的分辨率，并且使用了电磁喷涂仪表示图像的电磁辐射和中性π子衰变。
results: 这个论文得到了图像的分辨率提高，并且可以重现电磁辐射中的特征。此外，使用生成图像作为深度学习光子识别算法的预处理步骤也得到了改善。

Abstract
We study single-image super-resolution algorithms for photons at collider experiments based on generative adversarial networks. We treat the energy depositions of simulated electromagnetic showers of photons and neutral-pion decays in a toy electromagnetic calorimeter as 2D images and we train super-resolution networks to generate images with an artificially increased resolution by a factor of four in each dimension. The generated images are able to reproduce features of the electromagnetic showers that are not obvious from the images at nominal resolution. Using the artificially-enhanced images for the reconstruction of shower-shape variables and of the position of the shower center results in significant improvements. We additionally investigate the utilization of the generated images as a pre-processing step for deep-learning photon-identification algorithms and observe improvements in the case of low training statistics.

摘要
我们研究单个图像超分辨算法，用于粒子加速器实验室中的光子。我们将 simulate电磁散射的能量储存为图像，并将中子衰变探测器中的电磁calorimeter视为2D图像，然后我们使用生成式对抗网络来生成具有人工提高的分辨率（每个维度提高4倍）的图像。生成的图像能够重现辐射的特征，而不可见于原始分辨率的图像中。我们还发现，通过使用生成的图像作为深度学习光子识别算法的预处理步骤，可以在训练数据量较少的情况下获得改善。

ARAI-MVSNet: A multi-view stereo depth estimation network with adaptive depth range and depth interval

paper_url: http://arxiv.org/abs/2308.09022
repo_url: None
paper_authors: Song Zhang, Wenjia Xu, Zhiwei Wei, Lili Zhang, Yang Wang, Junyi Liu
for: 这 paper 目的是解决基本的Computer Vision问题，即 Multi-View Stereo（MVS），即使用多视图图像和已知摄像机参数，重建场景。
methods: 这 paper 提出了一种新的多阶段粗细化框架，包括在第一阶段预测粗细度图，然后在第二阶段使用参考图像和已有的深度图预测更加精准的所有像素深度范围，以及在第三和第四阶段使用 Adaptive Depth Interval Adjustment 模块来实现变量间隔分割，以提高更加准确的深度估计。
results: 广泛的实验表明，这 paper 的方法可以达到现状最佳性和优秀的泛化能力，特别是在 DTU 数据集上达到最高的 Acc 和 Overall，在 Tanks and Temples 中间和高级数据集上达到最高的 Recall 和 $F_{1}$-score，并在 BlendedMVS 数据集上达到最低的 $e_{1}$ 和 $e_{3} $，以及在 ETH 3D 数据集上达到最高的 Acc 和 $F_{1}$-score，超过所有列出的方法。

Abstract
Multi-View Stereo~(MVS) is a fundamental problem in geometric computer vision which aims to reconstruct a scene using multi-view images with known camera parameters. However, the mainstream approaches represent the scene with a fixed all-pixel depth range and equal depth interval partition, which will result in inadequate utilization of depth planes and imprecise depth estimation. In this paper, we present a novel multi-stage coarse-to-fine framework to achieve adaptive all-pixel depth range and depth interval. We predict a coarse depth map in the first stage, then an Adaptive Depth Range Prediction module is proposed in the second stage to zoom in the scene by leveraging the reference image and the obtained depth map in the first stage and predict a more accurate all-pixel depth range for the following stages. In the third and fourth stages, we propose an Adaptive Depth Interval Adjustment module to achieve adaptive variable interval partition for pixel-wise depth range. The depth interval distribution in this module is normalized by Z-score, which can allocate dense depth hypothesis planes around the potential ground truth depth value and vice versa to achieve more accurate depth estimation. Extensive experiments on four widely used benchmark datasets~(DTU, TnT, BlendedMVS, ETH 3D) demonstrate that our model achieves state-of-the-art performance and yields competitive generalization ability. Particularly, our method achieves the highest Acc and Overall on the DTU dataset, while attaining the highest Recall and $F_{1}$-score on the Tanks and Temples intermediate and advanced dataset. Moreover, our method also achieves the lowest $e_{1}$ and $e_{3}$ on the BlendedMVS dataset and the highest Acc and $F_{1}$-score on the ETH 3D dataset, surpassing all listed methods.Project website: https://github.com/zs670980918/ARAI-MVSNet

摘要
多视图斯tereo（MVS）是计算机视觉中的基本问题，目标是使用多视图图像和known camera参数来重建场景。然而，主流方法都是使用固定的所有像素深度范围和平等深度间隔来表示场景，这会导致深度估计不准确。在这篇论文中，我们提出了一种新的多阶段粗细化框架，以实现适应性的所有像素深度范围和深度间隔。在第一阶段，我们预测了一个粗细的深度图；然后在第二阶段，我们提出了一种适应深度范围预测模块，通过利用参考图像和在第一阶段获得的深度图来逐渐缩进场景，并预测更加准确的所有像素深度范围。在第三和第四阶段，我们提出了一种适应变量间隔调整模块，以实现适应变量间隔的像素深度范围分布。在这个模块中，深度间隔分布被normalized by Z-score，可以为每个深度值分配密集的深度假设平面，从而实现更加准确的深度估计。我们在四个常用的标准测试集（DTU、TnT、BlendedMVS、ETH 3D）进行了广泛的实验，结果表明，我们的模型在状态艺术性能和泛化能力方面均达到了顶峰。尤其是，我们的方法在DTU测试集上 достиieves最高的Acc和Overall，在Tanks and Temples中间和高级测试集上达到最高的Recall和$F_{1}$score，而在BlendedMVS测试集上，我们的方法也实现了最低的$e_{1}$和$e_{3}，并在ETH 3D测试集上实现了最高的Acc和$F_{1}$score，超越了所有列出的方法。Project website:

FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo Embeddings

paper_url: http://arxiv.org/abs/2308.09012
repo_url: https://github.com/valley-vl/fashionlogo
paper_authors: Yulin Su, Min Yang, Minghui Qiu, Jing Wang, Tao Wang
For: The paper aims to improve the robustness of logo embedding by leveraging textual knowledge as an auxiliary, which can enhance the performance of logo recognition in real-world scenarios.* Methods: The proposed method, FashionLOGO, utilizes Multimodal Large Language Models (MLLMs) to generate explicit textual knowledge through three types of prompts, including image OCR, brief captions, and detailed descriptions prompts, in a zero-shot setting. The cross-attention transformer is used to enable image embedding queries to learn supplementary knowledge from textual embeddings automatically.* Results: The extensive experiments on three real-world datasets demonstrate that FashionLOGO learns generalized and robust logo embeddings, achieving state-of-the-art performance in all benchmark datasets. The introduction of MLLMs improves the performance of logo recognition, and comprehensive ablation studies are conducted to demonstrate the performance improvements.Here’s the simplified Chinese text for the three key points:* 为：本文目的是通过使用文本知识作为辅助，提高图标识别的robustness，以便在实际应用中提高图标识别的性能。* 方法：提议的方法是使用多modal大语言模型（MLLMs）生成文本知识，包括图像OCR、简短标题和详细描述等三种提示，在零shot设定下进行。使用交叉注意力变换器，以便图像嵌入查询自动地学习文本嵌入的补充知识。* 结果：对三个实际 dataset进行了广泛的实验，结果显示，FashionLOGO可以学习 generalized和Robust的图标嵌入，在所有benchmark dataset中达到了状态之art的性能。引入 MLLMs 提高了图标识别的性能，并进行了广泛的减少ablation study来证明性能提高的原因。

Abstract
Logo embedding plays a crucial role in various e-commerce applications by facilitating image retrieval or recognition, such as intellectual property protection and product search. However, current methods treat logo embedding as a purely visual problem, which may limit their performance in real-world scenarios. A notable issue is that the textual knowledge embedded in logo images has not been adequately explored. Therefore, we propose a novel approach that leverages textual knowledge as an auxiliary to improve the robustness of logo embedding. The emerging Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in both visual and textual understanding and could become valuable visual assistants in understanding logo images. Inspired by this observation, our proposed method, FashionLOGO, aims to utilize MLLMs to enhance fashion logo embedding. We explore how MLLMs can improve logo embedding by prompting them to generate explicit textual knowledge through three types of prompts, including image OCR, brief captions, and detailed descriptions prompts, in a zero-shot setting. We adopt a cross-attention transformer to enable image embedding queries to learn supplementary knowledge from textual embeddings automatically. To reduce computational costs, we only use the image embedding model in the inference stage, similar to traditional inference pipelines. Our extensive experiments on three real-world datasets demonstrate that FashionLOGO learns generalized and robust logo embeddings, achieving state-of-the-art performance in all benchmark datasets. Furthermore, we conduct comprehensive ablation studies to demonstrate the performance improvements resulting from the introduction of MLLMs.

摘要
<> translate_language: zh-CN<>logo嵌入在多种电商应用中扮演着关键的角色，如知识产权保护和产品搜索。然而，当前方法对logo嵌入视为纯粹的视觉问题，可能会限制其在实际场景中的表现。一个显著的问题是，logo图像中嵌入的文本知识未得到了充分的利用。因此，我们提出了一种新的方法，即使logo嵌入的文本知识作为辅助提高logo嵌入的稳定性。现在的多Modal大语言模型（MLLM）在视觉和文本理解方面具有卓越的能力，因此可以成为logo图像理解的优秀辅助。以这一观察为出发点，我们提出了一种名为FashionLOGO的方法，旨在利用MLLM来强化时尚logo嵌入。我们研究了MLLM如何通过三种类型的提示（图像OCR、简短标题和详细描述提示）在零扩展设定下提高logo嵌入。我们采用了跨层transformer来允许图像嵌入查询学习自动获取文本嵌入的补充知识。为了降低计算成本，我们只在推理阶段使用图像嵌入模型，类似于传统的推理管道。我们对三个实际数据集进行了广泛的实验，结果显示FashionLOGO可以学习 generalized和稳定的logo嵌入，在所有benchmark数据集中达到了状态机器。此外，我们进行了广泛的减少学习研究，以示MLLM引入后的性能提升。

DealMVC: Dual Contrastive Calibration for Multi-view Clustering

paper_url: http://arxiv.org/abs/2308.09000
repo_url: https://github.com/xihongyang1999/dealmvc
paper_authors: Xihong Yang, Jiaqi Jin, Siwei Wang, Ke Liang, Yue Liu, Yi Wen, Suyuan Liu, Sihang Zhou, Xinwang Liu, En Zhu
For: 提高多视图集群性能，解决跨视图场景下相似 pero 不同样本的问题。* Methods: 提出了一种基于对比塑性网络的双重对比塑性约束网络（DealMVC），包括全视观察器和本地对比塑性约束两部分。* Results: 与其他状态时approaches进行比较，通过八个基准数据集的全面实验结果表明DealMVC的效果和优越性。

Abstract
Benefiting from the strong view-consistent information mining capacity, multi-view contrastive clustering has attracted plenty of attention in recent years. However, we observe the following drawback, which limits the clustering performance from further improvement. The existing multi-view models mainly focus on the consistency of the same samples in different views while ignoring the circumstance of similar but different samples in cross-view scenarios. To solve this problem, we propose a novel Dual contrastive calibration network for Multi-View Clustering (DealMVC). Specifically, we first design a fusion mechanism to obtain a global cross-view feature. Then, a global contrastive calibration loss is proposed by aligning the view feature similarity graph and the high-confidence pseudo-label graph. Moreover, to utilize the diversity of multi-view information, we propose a local contrastive calibration loss to constrain the consistency of pair-wise view features. The feature structure is regularized by reliable class information, thus guaranteeing similar samples have similar features in different views. During the training procedure, the interacted cross-view feature is jointly optimized at both local and global levels. In comparison with other state-of-the-art approaches, the comprehensive experimental results obtained from eight benchmark datasets provide substantial validation of the effectiveness and superiority of our algorithm. We release the code of DealMVC at https://github.com/xihongyang1999/DealMVC on GitHub.

摘要
利用强大的视图一致信息挖掘能力，多视图对比 clustering 在最近几年内吸引了很多关注。然而，我们发现以下缺陷，限制了归类性能的进一步改进：现有的多视图模型主要关注同一个样本在不同视图中的一致性，而忽略了不同视图中的相似 yet 不同的样本之间的关系。为解决这问题，我们提议一种名为 DealMVC 的新型 dual contrastive calibration network for Multi-View Clustering。具体来说，我们首先设计了一种 fusions 机制，以获取全局跨视图特征。然后，我们提出了一种全局对比满意抽象loss，通过对视图特征相似图和高确度假标签图进行对应。此外，为了利用多视图信息的多样性，我们提出了一种本地对比满意抽象loss，以约束不同视图中的对应样本之间的一致性。通过对视图特征进行可靠的分类信息 regularization，我们保证了不同视图中的相似样本具有相似的特征。在训练过程中，我们对跨视图特征进行了交互性的joint 优化。与其他状态的方法相比，我们通过八个 benchmark 数据集的全面实验结果，证明了 DealMVC 的有效性和优越性。我们在 GitHub 上发布了 DealMVC 的代码，请参考。

Semantic Information for Object Detection

paper_url: http://arxiv.org/abs/2308.08990
repo_url: https://github.com/BMW-InnovationLab/BMW-Anonymization-API
paper_authors: Jean-Francois Nies
for: 这 paper 的目的是探讨 Semantic Consistency 概念和 Knowledge-Aware Re-Optimization 方法在复杂交通场景中的应用性。
methods: 该 paper 引入了一种新的方法来从图像集中提取知识图，并将这个新的知识图与现有的语义一致性模型集成。
results: 研究发现，通过将这些新的知识图和现有的语义一致性模型相结合，可以在 Faster-RCNN 和 DETR 对象检测模型中获得有限 yet consistent 的改进。

Abstract
In this paper, we demonstrate that the concept of Semantic Consistency and the ensuing method of Knowledge-Aware Re-Optimization can be adapted for the problem of object detection in intricate traffic scenes. Furthermore, we introduce a novel method for extracting a knowledge graph from a dataset of images provided with instance-level annotations, and integrate this new knowledge graph with the existing semantic consistency model. Combining both this novel hybrid knowledge graph and the preexisting methods of frequency analysis and external knowledge graph as sources for semantic information, we investigate the effectiveness of knowledge-aware re-optimization on the Faster-RCNN and DETR object detection models. We find that limited but consistent improvements in precision and or recall can be achieved using this method for all combinations of model and method studied.

摘要
在本文中，我们证明了 semantic consistency 概念和基于知识的重新优化方法可以应用于复杂交通场景中的对象检测问题。此外，我们介绍了一种 novel 的方法，用于从具有实例级注解的图像集中提取知识图谱，并将该新的知识图谱与现有的 semantic consistency 模型结合使用。通过将这些新的混合知识图谱和频率分析以及外部知识图谱作为Semantic information的来源，我们 investigate 了基于知识重新优化的 Faster-RCNN 和 DETR 对象检测模型的效果。我们发现，对于所有模型和方法的组合，可以达到有限 yet consistent 的改进。

Eosinophils Instance Object Segmentation on Whole Slide Imaging Using Multi-label Circle Representation

paper_url: http://arxiv.org/abs/2308.08974
repo_url: https://github.com/yilinliu610730/eoe
paper_authors: Yilin Liu, Ruining Deng, Juming Xiong, Regina N Tyree, Hernan Correa, Girish Hiremath, Yaohong Wang, Yuankai Huo
For: 这个论文旨在提出一种自动化的肝炎细胞分 segmentation方法，以便更好地诊断和评估肝炎细胞病变。* Methods: 该方法基于圆形表示，并将圆形分类模型扩展到多标签模型，以便同时分类多种细胞类型。* Results: 实验结果表明，圆形分类模型在identifying和 segmenting嗜精细胞方面的准确率比传统的Mask R-CNN模型和DeepSnake模型更高，这些结果表明这种自动化方法在诊断肝炎细胞病变中具有优势。

Abstract
Eosinophilic esophagitis (EoE) is a chronic and relapsing disease characterized by esophageal inflammation. Symptoms of EoE include difficulty swallowing, food impaction, and chest pain which significantly impact the quality of life, resulting in nutritional impairments, social limitations, and psychological distress. The diagnosis of EoE is typically performed with a threshold (15 to 20) of eosinophils (Eos) per high-power field (HPF). Since the current counting process of Eos is a resource-intensive process for human pathologists, automatic methods are desired. Circle representation has been shown as a more precise, yet less complicated, representation for automatic instance cell segmentation such as CircleSnake approach. However, the CircleSnake was designed as a single-label model, which is not able to deal with multi-label scenarios. In this paper, we propose the multi-label CircleSnake model for instance segmentation on Eos. It extends the original CircleSnake model from a single-label design to a multi-label model, allowing segmentation of multiple object types. Experimental results illustrate the CircleSnake model's superiority over the traditional Mask R-CNN model and DeepSnake model in terms of average precision (AP) in identifying and segmenting eosinophils, thereby enabling enhanced characterization of EoE. This automated approach holds promise for streamlining the assessment process and improving diagnostic accuracy in EoE analysis. The source code has been made publicly available at https://github.com/yilinliu610730/EoE.

摘要
《营养细胞肝炎（EoE）是一种慢性和复发性的疾病，特征是食管Inflammation。症状包括困难吞食、食物堵塞和胸痛，对生活质量产生重要影响，导致营养不良、社会限制和心理压力。EoE诊断通常采用15-20个Eosinophils（Eos）per high-power field（HPF）的阈值。由于现有的Eos计数过程需要人工pathologist的劳动力，因此自动方法被欢迎。圆形表示被证明为更精确、 yet less complicated的表示方法 для自动实例细胞分 segmentation，如CircleSnake方法。然而，CircleSnake方法是单标签设计，无法处理多标签场景。在本文中，我们提出了多标签CircleSnake模型 для实例分 segmentation。它将原始的CircleSnake模型从单标签设计扩展到多标签模型，以便分类多种对象类型。实验结果表明CircleSnake模型在标准Mask R-CNN模型和DeepSnake模型的AP平均精度方面比其他两者更高，以便更好地识别和分类Eosinophils，从而提高EoE分析的精度。这种自动化方法可能会改善EoE诊断过程的效率和准确性。代码已经在https://github.com/yilinliu610730/EoE上公开。

Watch Your Steps: Local Image and Scene Editing by Text Instructions

paper_url: http://arxiv.org/abs/2308.08947
repo_url: None
paper_authors: Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A. Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G. Derpanis, Igor Gilitschenski
For: The paper is written for the task of text-guided image and NeRF editing, with the goal of localizing the desired edit region implicit in a text instruction.* Methods: The paper uses InstructPix2Pix (IP2P) to predict the desired edit region, and then uses the discrepancy between IP2P predictions with and without the instruction to create a relevance map that guides the modifications. The paper also uses neural radiance fields (NRF) to enhance the quality of text-guided editing of 3D scenes.* Results: The paper achieves state-of-the-art performance on both image and NeRF editing tasks using the proposed method.Here’s the same information in Simplified Chinese:* For: 这篇论文是为了实现文本指导的图像和NeRF编辑任务，目标是将文本中隐藏的编辑区域归类到InstructPix2Pix（IP2P）中。* Methods: 论文使用InstructPix2Pix（IP2P）预测编辑区域，然后使用IP2P预测结果中的差异来生成一个权重图（relevance map），以便引导修改。此外，论文还使用神经辐射场（NRF）来提高文本指导的3D场景编辑质量。* Results: 论文使用提posed方法实现文本指导的图像和NeRF编辑任务，并达到了现有最佳性能。

Abstract
Denoising diffusion models have enabled high-quality image generation and editing. We present a method to localize the desired edit region implicit in a text instruction. We leverage InstructPix2Pix (IP2P) and identify the discrepancy between IP2P predictions with and without the instruction. This discrepancy is referred to as the relevance map. The relevance map conveys the importance of changing each pixel to achieve the edits, and is used to to guide the modifications. This guidance ensures that the irrelevant pixels remain unchanged. Relevance maps are further used to enhance the quality of text-guided editing of 3D scenes in the form of neural radiance fields. A field is trained on relevance maps of training views, denoted as the relevance field, defining the 3D region within which modifications should be made. We perform iterative updates on the training views guided by rendered relevance maps from the relevance field. Our method achieves state-of-the-art performance on both image and NeRF editing tasks. Project page: https://ashmrz.github.io/WatchYourSteps/

摘要
文本指导的图像生成和修改技术已经实现了高质量的图像生成和修改。我们提出了一种方法，可以在文本指导下将需要修改的区域解决为implicit的方式。我们利用InstructPix2Pix（IP2P），并识别IP2P预测中包含和不包含文本指导的差异。这个差异被称为相关性地图。相关性地图表示需要修改每个像素以实现编辑，并用于导航修改。这种导航确保了不重要的像素保持不变。相关性地图还用于提高文本指导下3D场景的颜色场景的质量。我们在 relevance field 中训练了 relevance map ，定义了需要修改的3D区域。我们在 relevance map 的指导下进行了 Iterative 更新，并实现了文本指导下图像和 NeRF 编辑任务的 state-of-the-art 性能。项目页面：https://ashmrz.github.io/WatchYourSteps/

Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction

paper_url: http://arxiv.org/abs/2308.08942
repo_url: https://github.com/mediabrain-sjtu/auxformer
paper_authors: Chenxin Xu, Robby T. Tan, Yuhong Tan, Siheng Chen, Xinchao Wang, Yanfeng Wang
for: 本研究的目的是提高人体运动预测的精度，探索人体运动中的空间-时间相关性。
methods: 本文提出了一种新的方法，通过引入辅助任务来提高模型的学习效果。在辅助任务中，部分身体 JOINTS 的坐标被masking或添加噪声，目的是在其他坐标的基础上恢复受损坐标。为了处理辅助任务，我们提出了一种新的辅助适应 transformer，可以处理不完整、受损的运动数据，并通过捕捉空间-时间相关性来实现坐标恢复。
results: 实验结果显示，我们的方法在 Human3.6M、CMU Mocap 和 3DPW 数据集上的3D MPJPE 方面比州前方法优于remarkable margins of 7.2%, 3.7%,和9.4%。此外，我们的方法在数据缺失和噪声数据情况下也更加稳定和可靠。代码可以在https://github.com/MediaBrain-SJTU/AuxFormer 上下载。

Abstract
Exploring spatial-temporal dependencies from observed motions is one of the core challenges of human motion prediction. Previous methods mainly focus on dedicated network structures to model the spatial and temporal dependencies. This paper considers a new direction by introducing a model learning framework with auxiliary tasks. In our auxiliary tasks, partial body joints' coordinates are corrupted by either masking or adding noise and the goal is to recover corrupted coordinates depending on the rest coordinates. To work with auxiliary tasks, we propose a novel auxiliary-adapted transformer, which can handle incomplete, corrupted motion data and achieve coordinate recovery via capturing spatial-temporal dependencies. Through auxiliary tasks, the auxiliary-adapted transformer is promoted to capture more comprehensive spatial-temporal dependencies among body joints' coordinates, leading to better feature learning. Extensive experimental results have shown that our method outperforms state-of-the-art methods by remarkable margins of 7.2%, 3.7%, and 9.4% in terms of 3D mean per joint position error (MPJPE) on the Human3.6M, CMU Mocap, and 3DPW datasets, respectively. We also demonstrate that our method is more robust under data missing cases and noisy data cases. Code is available at https://github.com/MediaBrain-SJTU/AuxFormer.

摘要
investigate 空间-时间关系从观察动作是人动作预测的核心挑战。 previous methods mainly focus on 专门的网络结构来模型空间和时间关系。 this paper considers a new direction by introducing a model learning framework with auxiliary tasks。 In our auxiliary tasks, partial body joints' coordinates are corrupted by either masking or adding noise, and the goal is to recover corrupted coordinates depending on the rest coordinates。 To work with auxiliary tasks, we propose a novel auxiliary-adapted transformer, which can handle incomplete, corrupted motion data and achieve coordinate recovery by capturing spatial-temporal dependencies。 Through auxiliary tasks, the auxiliary-adapted transformer is promoted to capture more comprehensive spatial-temporal dependencies among body joints' coordinates, leading to better feature learning。 extensive experimental results have shown that our method outperforms state-of-the-art methods by remarkable margins of 7.2%, 3.7%, and 9.4% in terms of 3D mean per joint position error (MPJPE) on the Human3.6M, CMU Mocap, and 3DPW datasets, respectively。 we also demonstrate that our method is more robust under data missing cases and noisy data cases。 Code is available at https://github.com/MediaBrain-SJTU/AuxFormer。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The translation is based on the standard Mandarin pronunciation and may not be exactly the same as the original text in other dialects or pronunciations.

Automatic Signboard Recognition in Low Quality Night Images

paper_url: http://arxiv.org/abs/2308.08941
repo_url: None
paper_authors: Manas Kagde, Priyanka Choudhary, Rishi Joshi, Somnath Dey
for: 解决车辆助手系统和自动驾驶技术中的 traffic sign 识别问题，提高车辆在不良环境下自动分析环境并做出应有的决策。
methods: 使用 Modified MIRNet 模型进行图像增强，然后使用 Yolov4 模型在不制限环境中识别 traffic sign。
results: 提高了 Yolov4 模型在低质量图像上的 mAP@0.5 值5.40%，并在 GTSRB 数据集上实现了96.75%的总 mAP@0.5，与当前最佳工作相当。

Abstract
An essential requirement for driver assistance systems and autonomous driving technology is implementing a robust system for detecting and recognizing traffic signs. This system enables the vehicle to autonomously analyze the environment and make appropriate decisions regarding its movement, even when operating at higher frame rates. However, traffic sign images captured in inadequate lighting and adverse weather conditions are poorly visible, blurred, faded, and damaged. Consequently, the recognition of traffic signs in such circumstances becomes inherently difficult. This paper addressed the challenges of recognizing traffic signs from images captured in low light, noise, and blurriness. To achieve this goal, a two-step methodology has been employed. The first step involves enhancing traffic sign images by applying a modified MIRNet model and producing enhanced images. In the second step, the Yolov4 model recognizes the traffic signs in an unconstrained environment. The proposed method has achieved 5.40% increment in mAP@0.5 for low quality images on Yolov4. The overall mAP@0.5 of 96.75% has been achieved on the GTSRB dataset. It has also attained mAP@0.5 of 100% on the GTSDB dataset for the broad categories, comparable with the state-of-the-art work.

摘要
driver assistance systems 和自动驾驶技术的重要需求之一是实现一个可靠的交通标志检测和识别系统。这个系统使得车辆可以自动分析环境，并根据相应的 дви作行为。然而，交通标志图像在不良照明和不利天气条件下 capture 得到的图像会变得混乱、模糊、暗淡和受损。因此，在这些情况下，交通标志的识别变得自然地困难。这篇论文通过解决低照明、噪音和模糊等问题，提出了一种两步方法。在第一步中，我们使用修改后的 MIRNet 模型来增强交通标志图像，生成了更加明确的图像。在第二步中，我们使用 Yolov4 模型在无约束环境中识别交通标志。我们的方法在低质量图像上实现了 5.40% 的增量，并在 GTSRB 数据集上实现了 96.75% 的总 mAP@0.5。此外，我们在 GTSDB 数据集上实现了 Broad 类别的 100% mAP@0.5，与现有的工作相当。

SDDNet: Style-guided Dual-layer Disentanglement Network for Shadow Detection

paper_url: http://arxiv.org/abs/2308.08935
repo_url: None
paper_authors: Runmin Cong, Yuchen Guan, Jinpeng Chen, Wei Zhang, Yao Zhao, Sam Kwong
for: 本研究的目的是提高阴影检测的精度，尤其是在复杂背景下，现有方法受到背景颜色的影响，导致阴影检测错误。
methods: 我们使用Style-guided Dual-layer Disentanglement Network (SDDNet)，包括Feature Separation and Recombination (FSR)模块和Shadow Style Filter (SSF)模块，独立模型背景层和阴影层，并通过特殊监督和重建约束保持信息完整性和无重复。
results: 我们的模型在三个公共数据集上实现了优秀的性能，并且在实时推理速度上达到32帧/秒。

Abstract
Despite significant progress in shadow detection, current methods still struggle with the adverse impact of background color, which may lead to errors when shadows are present on complex backgrounds. Drawing inspiration from the human visual system, we treat the input shadow image as a composition of a background layer and a shadow layer, and design a Style-guided Dual-layer Disentanglement Network (SDDNet) to model these layers independently. To achieve this, we devise a Feature Separation and Recombination (FSR) module that decomposes multi-level features into shadow-related and background-related components by offering specialized supervision for each component, while preserving information integrity and avoiding redundancy through the reconstruction constraint. Moreover, we propose a Shadow Style Filter (SSF) module to guide the feature disentanglement by focusing on style differentiation and uniformization. With these two modules and our overall pipeline, our model effectively minimizes the detrimental effects of background color, yielding superior performance on three public datasets with a real-time inference speed of 32 FPS.

摘要
尽管目前的阴影检测技术已经取得了 significative 进步，但是它们仍然在复杂背景下陷入阴影检测错误的问题。我们 Drawing inspiration from the human visual system，我们将输入阴影图像看作是一个背景层和一个阴影层的组合，并设计了一个 Style-guided Dual-layer Disentanglement Network (SDDNet) 来模型这两个层。为了实现这一目标，我们提出了一个 Feature Separation and Recombination (FSR) 模块，该模块将多层特征分解成阴影相关和背景相关的组成部分，通过提供特殊的监督来保持这两个部分之间的信息完整性和不重复，同时通过重建约束来避免信息损失。此外，我们还提出了一个 Shadow Style Filter (SSF) 模块，该模块通过注重阴影风格差异和均衡来引导特征分解。与这两个模块和我们的整体管道相结合，我们的模型能够有效地减少背景颜色的负面影响，在三个公共数据集上实现了超过32帧每秒的实时推理速度。

paper_url: http://arxiv.org/abs/2308.08930
repo_url: https://github.com/rmcong/picr-net_acmmm23
paper_authors: Runmin Cong, Hongyu Liu, Chen Zhang, Wei Zhang, Feng Zheng, Ran Song, Sam Kwong
for: 提高复杂和挑战性场景中的焦点对象检测（SOD）能力
methods: 通过RGB图像和深度地图的共同信息 интеграción，利用Convolutional Neural Networks（CNNs）激活和转换器架构，实现自模态和跨模态的全球长距离相关性模型化
results: 在五个RGB-D SOD数据集上进行了广泛的实验，并示出了与参考模型相比的竞争性Results

Abstract
By integrating complementary information from RGB image and depth map, the ability of salient object detection (SOD) for complex and challenging scenes can be improved. In recent years, the important role of Convolutional Neural Networks (CNNs) in feature extraction and cross-modality interaction has been fully explored, but it is still insufficient in modeling global long-range dependencies of self-modality and cross-modality. To this end, we introduce CNNs-assisted Transformer architecture and propose a novel RGB-D SOD network with Point-aware Interaction and CNN-induced Refinement (PICR-Net). On the one hand, considering the prior correlation between RGB modality and depth modality, an attention-triggered cross-modality point-aware interaction (CmPI) module is designed to explore the feature interaction of different modalities with positional constraints. On the other hand, in order to alleviate the block effect and detail destruction problems brought by the Transformer naturally, we design a CNN-induced refinement (CNNR) unit for content refinement and supplementation. Extensive experiments on five RGB-D SOD datasets show that the proposed network achieves competitive results in both quantitative and qualitative comparisons.

摘要
文本翻译为简化中文：通过融合RGB图像和深度图的补充信息，可以提高复杂和挑战的场景下的鲜色对象检测（SOD）能力。近年来，人工神经网络（CNN）在特征提取和跨模态交互方面得到了广泛的探索，但是在自模态和跨模态的全局长距离相互作用方面仍然不够。为此，我们提出了CNNs-assisted Transformer架构，并提出了一种新的RGB-D SOD网络，即点位相互作用和CNN引导的PICR-Net。一方面， compte tenpoint-aware cross-modality interaction（CmPI）模块，通过关注RGB模式和深度模式之间的先前相关性，探索不同模式之间的特征交互。另一方面，为了解决由Transformer自然引入的块效应和细节毁灭问题，我们设计了CNN引导的修充（CNNR）单元，用于内容修充和补充。在五个RGB-D SOD数据集上进行了广泛的实验，结果表明，我们提出的网络在量化和质量比较中均达到了竞争水平。

Frequency Perception Network for Camouflaged Object Detection

paper_url: http://arxiv.org/abs/2308.08924
repo_url: https://github.com/rmcong/fpnet_acmmm23
paper_authors: Runmin Cong, Mengyao Sun, Sanyi Zhang, Xiaofei Zhou, Wei Zhang, Yao Zhao
for: 抵抗隐藏的物体检测 (COD) 目标是准确检测隐藏在环境中的物体。但是现有的 COD 方法主要在 RGB 频谱中检测隐藏的物体，其性能尚未在许多挑战性enario中得到了充分利用。
methods: 我们提出了一种新的学习可能的和分离的频谱感知机制，驱动于 semantic 层次结构。我们的整个网络采用了两个阶段模型，包括频谱导航的粗略本地化阶段和详细保持的细致本地化阶段。通过多级特征提取的后凹网络，我们设计了一种灵活的频谱感知模块基于 octave convolution 进行粗略定位。然后，我们设计了修正融合模块，通过先导引 correction 和跨层特征通道关联，逐步融合高级特征，并最终与浅层特征结合以实现隐藏物体的细致修正。
results: 与现有的模型相比，我们的提出的方法在三个流行的benchmark数据集中具有竞争性的性能， both qualitatively and quantitatively。

Abstract
Camouflaged object detection (COD) aims to accurately detect objects hidden in the surrounding environment. However, the existing COD methods mainly locate camouflaged objects in the RGB domain, their performance has not been fully exploited in many challenging scenarios. Considering that the features of the camouflaged object and the background are more discriminative in the frequency domain, we propose a novel learnable and separable frequency perception mechanism driven by the semantic hierarchy in the frequency domain. Our entire network adopts a two-stage model, including a frequency-guided coarse localization stage and a detail-preserving fine localization stage. With the multi-level features extracted by the backbone, we design a flexible frequency perception module based on octave convolution for coarse positioning. Then, we design the correction fusion module to step-by-step integrate the high-level features through the prior-guided correction and cross-layer feature channel association, and finally combine them with the shallow features to achieve the detailed correction of the camouflaged objects. Compared with the currently existing models, our proposed method achieves competitive performance in three popular benchmark datasets both qualitatively and quantitatively.

摘要
<>TRANSLATE_TEXT隐形物体检测（COD）的目标是准确检测周围环境中隐藏的物体。然而，现有的COD方法主要在RGB频谱上定位隐形物体，其性能在许多具有挑战性的场景下尚未得到完全利用。尝试从频谱域的特征角度来解决这个问题，我们提出了一种新的学习型和可分离的频谱感知机制，它是基于semantic hierarchy在频谱域的频谱感知机制。我们的整个网络采用了两个阶段模型，包括频谱导向粗略定位阶段和细节保持细致定位阶段。通过多级特征提取的背景，我们设计了一种可变频谱感知模块，基于八分割卷积来实现粗略定位。然后，我们设计了修正融合模块，通过优先指导修正和跨层特征通道关联，逐步融合高级特征，并最终与浅层特征相结合以实现细节修正。相比现有的模型，我们的提出的方法在三个流行的标准 benchmark dataset上具有竞争性的表现， both qualitatively and quantitatively。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Identity-Seeking Self-Supervised Representation Learning for Generalizable Person Re-identification

paper_url: http://arxiv.org/abs/2308.08887
repo_url: https://github.com/dcp15/isr_iccv2023_oral
paper_authors: Zhaopeng Dou, Zhongdao Wang, Yali Li, Shengjin Wang
for: 本研究旨在从大规模视频中学习域外常见人识别表示，无需任何注释。
methods: 我们提出了一种寻求自我超vised表示学习（ISR）方法，通过模型实体关系为最大积分二分图问题来挖掘人识别信息。
results: 无需人类注释和细化，ISR方法可以在Market-1501和MSMT17上达到87.0% rank-1和56.4% rank-1的表现，分别超过最佳注释域通用方法的5.0%和19.5%。在预训练→细化方案下，ISR方法达到了MSMT17上的最佳表现，即88.4% rank-1。代码可以在GitHub上找到：https://github.com/dcp15/ISR_ICCV2023_Oral。

Abstract
This paper aims to learn a domain-generalizable (DG) person re-identification (ReID) representation from large-scale videos \textbf{without any annotation}. Prior DG ReID methods employ limited labeled data for training due to the high cost of annotation, which restricts further advances. To overcome the barriers of data and annotation, we propose to utilize large-scale unsupervised data for training. The key issue lies in how to mine identity information. To this end, we propose an Identity-seeking Self-supervised Representation learning (ISR) method. ISR constructs positive pairs from inter-frame images by modeling the instance association as a maximum-weight bipartite matching problem. A reliability-guided contrastive loss is further presented to suppress the adverse impact of noisy positive pairs, ensuring that reliable positive pairs dominate the learning process. The training cost of ISR scales approximately linearly with the data size, making it feasible to utilize large-scale data for training. The learned representation exhibits superior generalization ability. \textbf{Without human annotation and fine-tuning, ISR achieves 87.0\% Rank-1 on Market-1501 and 56.4\% Rank-1 on MSMT17}, outperforming the best supervised domain-generalizable method by 5.0\% and 19.5\%, respectively. In the pre-training$\rightarrow$fine-tuning scenario, ISR achieves state-of-the-art performance, with 88.4\% Rank-1 on MSMT17. The code is at \url{https://github.com/dcp15/ISR_ICCV2023_Oral}.

摘要
ISR constructs positive pairs from inter-frame images by modeling the instance association as a maximum-weight bipartite matching problem. A reliability-guided contrastive loss is further introduced to suppress the adverse impact of noisy positive pairs, ensuring that reliable positive pairs dominate the learning process. The training cost of ISR scales approximately linearly with the data size, making it feasible to utilize large-scale data for training.The learned representation exhibits superior generalization ability. Without human annotation and fine-tuning, ISR achieves 87.0% Rank-1 on Market-1501 and 56.4% Rank-1 on MSMT17, outperforming the best supervised domain-generalizable method by 5.0% and 19.5%, respectively. In the pre-training → fine-tuning scenario, ISR achieves state-of-the-art performance, with 88.4% Rank-1 on MSMT17. The code is available at \url{https://github.com/dcp15/ISR_ICCV2023_Oral}.

Event-Guided Procedure Planning from Instructional Videos with Text Supervision

paper_url: http://arxiv.org/abs/2308.08885
repo_url: https://github.com/AlanWang0o0/ISEE-E3P
paper_authors: An-Lan Wang, Kun-Yu Lin, Jia-Run Du, Jingke Meng, Wei-Shi Zheng
for: 这 paper 的目的是提出一种基于文本监督的视频过程规划方法，以适应视频中的指令和过程。
methods: 该方法首先推理出事件，然后根据事件和视频状态进行动作规划。它采用一种事件导向的规范建模方法，将事件信息 incorporate 到过程规划中。
results: 该方法在三个 dataset 上进行了广泛的实验，并得到了较高的效果。

Abstract
In this work, we focus on the task of procedure planning from instructional videos with text supervision, where a model aims to predict an action sequence to transform the initial visual state into the goal visual state. A critical challenge of this task is the large semantic gap between observed visual states and unobserved intermediate actions, which is ignored by previous works. Specifically, this semantic gap refers to that the contents in the observed visual states are semantically different from the elements of some action text labels in a procedure. To bridge this semantic gap, we propose a novel event-guided paradigm, which first infers events from the observed states and then plans out actions based on both the states and predicted events. Our inspiration comes from that planning a procedure from an instructional video is to complete a specific event and a specific event usually involves specific actions. Based on the proposed paradigm, we contribute an Event-guided Prompting-based Procedure Planning (E3P) model, which encodes event information into the sequential modeling process to support procedure planning. To further consider the strong action associations within each event, our E3P adopts a mask-and-predict approach for relation mining, incorporating a probabilistic masking scheme for regularization. Extensive experiments on three datasets demonstrate the effectiveness of our proposed model.

摘要
在这项工作中，我们关注 instruccional 视频中的过程规划任务，其中模型需要预测从初始视觉状态到目标视觉状态的动作序列。这个任务的挑战之一是semantic gap between observed visual states and unobserved intermediate actions，即在视觉状态中存在的Semantic gap between observed contents and some action text labels in a procedure. To address this challenge, we propose a novel event-guided paradigm, which first infers events from the observed states and then plans out actions based on both the states and predicted events. Our inspiration comes from the fact that planning a procedure from an instructional video is to complete a specific event and a specific event usually involves specific actions. Based on the proposed paradigm, we contribute an Event-guided Prompting-based Procedure Planning (E3P) model, which encodes event information into the sequential modeling process to support procedure planning. To further consider the strong action associations within each event, our E3P adopts a mask-and-predict approach for relation mining, incorporating a probabilistic masking scheme for regularization. Our extensive experiments on three datasets demonstrate the effectiveness of our proposed model.

SRMAE: Masked Image Modeling for Scale-Invariant Deep Representations

paper_url: http://arxiv.org/abs/2308.08884
repo_url: None
paper_authors: Zhiming Wang, Lin Gu, Feng Lu
for: 提高Masked Image Modeling（MIM）模型的自动标注性能
methods: 使用图像缩放作为自助指示，采用SR技术设计预测头，从低分辨率启发的图像进行重建
results: 在ImageNet-1K任务上，SRMAE模型在400个epoch后达到82.1%的准确率，在VLR认识任务上超过DeriveNet by 1.3%，在低分辨率表情识别任务上达到74.84%的准确率，超过当前状态的FMD by 9.48%。

Abstract
Due to the prevalence of scale variance in nature images, we propose to use image scale as a self-supervised signal for Masked Image Modeling (MIM). Our method involves selecting random patches from the input image and downsampling them to a low-resolution format. Our framework utilizes the latest advances in super-resolution (SR) to design the prediction head, which reconstructs the input from low-resolution clues and other patches. After 400 epochs of pre-training, our Super Resolution Masked Autoencoders (SRMAE) get an accuracy of 82.1% on the ImageNet-1K task. Image scale signal also allows our SRMAE to capture scale invariance representation. For the very low resolution (VLR) recognition task, our model achieves the best performance, surpassing DeriveNet by 1.3%. Our method also achieves an accuracy of 74.84% on the task of recognizing low-resolution facial expressions, surpassing the current state-of-the-art FMD by 9.48%.

摘要

Text-Only Training for Visual Storytelling

paper_url: http://arxiv.org/abs/2308.08881
repo_url: None
paper_authors: Yuechen Wang, Wengang Zhou, Zhenbo Lu, Houqiang Li
for: 本文旨在提出一种基于文本数据的视觉叙述生成方法，以提高视觉叙述的生成能力和泛化能力。
methods: 本文提出了一种文本唯一训练方法，即将视觉控制集成到一个基于文本的叙述生成器中，并使用CLIP模型来实现跨模态对Alignment。此外，文本中的时间结构和全局视觉内容均被考虑在内。
results: 经过广泛的实验表明，本方法在VIST benchmark上表现出色，在域内和跨域设置中均显示出优于其他方法的效果。此外，人工评估和表达多样性评估也表明本方法的优势。

Abstract
Visual storytelling aims to generate a narrative based on a sequence of images, necessitating both vision-language alignment and coherent story generation. Most existing solutions predominantly depend on paired image-text training data, which can be costly to collect and challenging to scale. To address this, we formulate visual storytelling as a visual-conditioned story generation problem and propose a text-only training method that separates the learning of cross-modality alignment and story generation. Our approach specifically leverages the cross-modality pre-trained CLIP model to integrate visual control into a story generator, trained exclusively on text data. Moreover, we devise a training-free visual condition planner that accounts for the temporal structure of the input image sequence while balancing global and local visual content. The distinctive advantage of requiring only text data for training enables our method to learn from external text story data, enhancing the generalization capability of visual storytelling. We conduct extensive experiments on the VIST benchmark, showcasing the effectiveness of our approach in both in-domain and cross-domain settings. Further evaluations on expression diversity and human assessment underscore the superiority of our method in terms of informativeness and robustness.

摘要
<> translate into Simplified Chinese视觉故事创作目标是通过序列图像生成叙事，需要同视语对齐和有效的故事生成。现有的大多数解决方案都倚靠了对图像文本训练数据的依赖，这可能是收集和扩展困难。为此，我们将视觉故事创作转化为视觉控制的故事生成问题，并提出了基于文本训练的文本专门学习方法。我们的方法特别利用了跨modal CLIP模型来整合视觉控制到故事生成器中，并且具有训练 libre 的视觉条件规划器，能够考虑输入图像序列的时间结构，同时均衡全局和局部视觉内容。由于我们的方法只需要文本数据进行训练，因此可以从外部文本故事数据中学习，提高了视觉故事创作的通用性。我们在 VIST benchmark 上进行了广泛的实验，展示了我们的方法在域内和跨域设置中的效果。进一步的评估表明我们的方法在表达多样性和人工评估中具有更高的有用性和稳定性。

Towards Semi-supervised Learning with Non-random Missing Labels

paper_url: http://arxiv.org/abs/2308.08872
repo_url: https://github.com/njuyued/prg4ssl-mnar
paper_authors: Yue Duan, Zhen Zhao, Lei Qi, Luping Zhou, Lei Wang, Yinghuan Shi
for: Addresses the challenging scenario of label Missing Not At Random (MNAR) in semi-supervised learning (SSL), which is ignored by existing SSL methods.
methods: Proposes a class transition tracking based Pseudo-Rectifying Guidance (PRG) to maintain the model’s unbiased enthusiasm towards assigning pseudo-labels to all classes, improving the quality of pseudo-labels on both popular classes and rare classes in MNAR.
results: Shows superior performance of PRG across a variety of MNAR scenarios, outperforming the latest SSL approaches combining bias removal solutions by a large margin.

Abstract
Semi-supervised learning (SSL) tackles the label missing problem by enabling the effective usage of unlabeled data. While existing SSL methods focus on the traditional setting, a practical and challenging scenario called label Missing Not At Random (MNAR) is usually ignored. In MNAR, the labeled and unlabeled data fall into different class distributions resulting in biased label imputation, which deteriorates the performance of SSL models. In this work, class transition tracking based Pseudo-Rectifying Guidance (PRG) is devised for MNAR. We explore the class-level guidance information obtained by the Markov random walk, which is modeled on a dynamically created graph built over the class tracking matrix. PRG unifies the historical information of class distribution and class transitions caused by the pseudo-rectifying procedure to maintain the model's unbiased enthusiasm towards assigning pseudo-labels to all classes, so as the quality of pseudo-labels on both popular classes and rare classes in MNAR could be improved. Finally, we show the superior performance of PRG across a variety of MNAR scenarios, outperforming the latest SSL approaches combining bias removal solutions by a large margin. Code and model weights are available at https://github.com/NJUyued/PRG4SSL-MNAR.

摘要
semi-supervised learning (SSL) addresses the problem of missing labels by leveraging unlabeled data. However, existing SSL methods primarily focus on the traditional setting and often ignore the practical and challenging scenario of label Missing Not At Random (MNAR). In MNAR, the labeled and unlabeled data have different class distributions, leading to biased label imputation and degraded performance of SSL models. To address this, we propose class transition tracking based Pseudo-Rectifying Guidance (PRG) for MNAR. We leverage the class-level guidance information obtained from the Markov random walk, which is modeled on a dynamically created graph built over the class tracking matrix. PRG integrates historical information on class distribution and class transitions caused by the pseudo-rectifying procedure to ensure the model remains unbiased in assigning pseudo-labels to all classes, thereby improving the quality of pseudo-labels for both popular classes and rare classes in MNAR. Our experiments demonstrate the superior performance of PRG across various MNAR scenarios, outperforming the latest SSL approaches combining bias removal solutions by a large margin. Code and model weights are available at https://github.com/NJUyued/PRG4SSL-MNAR.

Spatially and Spectrally Consistent Deep Functional Maps

paper_url: http://arxiv.org/abs/2308.08871
repo_url: https://github.com/rqhuang88/Spatially-and-Spectrally-Consistent-Deep-Functional-Maps
paper_authors: Mingze Sun, Shiwei Mao, Puhua Jiang, Maks Ovsjanikov, Ruqi Huang
for: investigate the utility of cycle consistency in Deep Functional Maps for non-rigid shape matching
methods: use spectral and point-wise representation to enforce harmony of learned maps, and independently estimate maps in both domains to alleviate over-fitting
results: produce state-of-the-art results in mapping shapes under significant distortions, with superior generalization performance and accuracy in challenging tests for both near-isometric and non-isometric datasets

Abstract
Cycle consistency has long been exploited as a powerful prior for jointly optimizing maps within a collection of shapes. In this paper, we investigate its utility in the approaches of Deep Functional Maps, which are considered state-of-the-art in non-rigid shape matching. We first justify that under certain conditions, the learned maps, when represented in the spectral domain, are already cycle consistent. Furthermore, we identify the discrepancy that spectrally consistent maps are not necessarily spatially, or point-wise, consistent. In light of this, we present a novel design of unsupervised Deep Functional Maps, which effectively enforces the harmony of learned maps under the spectral and the point-wise representation. By taking advantage of cycle consistency, our framework produces state-of-the-art results in mapping shapes even under significant distortions. Beyond that, by independently estimating maps in both spectral and spatial domains, our method naturally alleviates over-fitting in network training, yielding superior generalization performance and accuracy within an array of challenging tests for both near-isometric and non-isometric datasets. Codes are available at https://github.com/rqhuang88/Spatiallyand-Spectrally-Consistent-Deep-Functional-Maps.

摘要
Cycles consistency 已经被利用为一种强大的先验，用于同时优化在一个集合中的形状中的地图。在这篇论文中，我们调查了它在深度功能地图中的用途，这种技术被认为是非RIGID形状匹配领域的国际标准。我们首先证明，在某些条件下，学习的地图，当表示在спектルDomain中时，已经是循环一致的。此外，我们发现了spectrally consistent maps不一定是spatially consistent，也就是说，学习的地图不一定是点 wise一致的。在这种情况下，我们提出了一种新的无监督的深度功能地图设计，可以有效地在循环和点 wise两个 Representation下对学习的地图进行融合。通过利用循环一致性，我们的框架在面对重大扭曲时产生了国际标准的结果。此外，我们的方法独立地估算了spectral和spatial Domain下的地图，从而自然地减轻了网络训练中的过拟合问题，从而获得了在多种难度测试中的优秀性和准确性。代码可以在https://github.com/rqhuang88/Spatiallyand-Spectrally-Consistent-Deep-Functional-Maps中下载。

MV-ROPE: Multi-view Constraints for Robust Category-level Object Pose and Size Estimation

paper_url: http://arxiv.org/abs/2308.08856
repo_url: https://github.com/greatoyster/mv-rope
paper_authors: Jiaqi Yang, Yucong Chen, Xiangting Meng, Chenxin Yan, Min Li, Ran Chen, Lige Liu, Tao Sun, Laurent Kneip
for: 该文章提出了一种基于RGB的分类水平6D对象pose和大小估计的新框架。
methods: 该方法利用预测正规对象坐标空间（NOCS），从RGB图像中提取了一种高效和有效的对象标准表示，而不需要额外的深度读取。
results: 该文章的实验结果表明，该方法在公共数据集序列上具有强大的表现，甚至与RGB-D方法相当。此外，文章还证明了该方法的通用性，在自己收集的数据集上进行了评估。Here is the translation in English:
for: The paper proposes a novel framework for RGB-based category-level 6D object pose and size estimation.
methods: The method utilizes the prediction of normalized object coordinate space (NOCS), which is an efficient and effective object canonical representation that can be extracted from RGB images, without relying on additional depth readings.
results: The experimental results demonstrate the strong performance of the proposed method, comparable to state-of-the-art RGB-D methods across public dataset sequences, and the method also shows good generalization ability on self-collected datasets.

Abstract
We propose a novel framework for RGB-based category-level 6D object pose and size estimation. Our approach relies on the prediction of normalized object coordinate space (NOCS), which serves as an efficient and effective object canonical representation that can be extracted from RGB images. Unlike previous approaches that heavily relied on additional depth readings as input, our novelty lies in leveraging multi-view information, which is commonly available in practical scenarios where a moving camera continuously observes the environment. By introducing multi-view constraints, we can obtain accurate camera pose and depth estimation from a monocular dense SLAM framework. Additionally, by incorporating constraints on the camera relative pose, we can apply trimming strategies and robust pose averaging on the multi-view object poses, resulting in more accurate and robust estimations of category-level object poses even in the absence of direct depth readings. Furthermore, we introduce a novel NOCS prediction network that significantly improves performance. Our experimental results demonstrate the strong performance of our proposed method, even comparable to state-of-the-art RGB-D methods across public dataset sequences. Additionally, we showcase the generalization ability of our method by evaluating it on self-collected datasets.

摘要
我们提出了一种新的RGB基于分类水平6D物体姿态和大小估计框架。我们的方法基于预测正规化物体坐标空间（NOCS），这是一种高效和有效的物体标准表示，可以从RGB图像中提取出来。与之前的方法不同，我们不需要添加深度读取作为输入，而是利用多视图信息，这是在实际场景中常见的移动摄像头不断观察环境中的情况。通过引入多视图约束，我们可以从单摄 dense SLAM框架中获得高精度的相机pose和深度估计。此外，通过应用相机相对pose约束，我们可以在多视图物体姿态中进行trimming策略和稳定 pose 平均，从而在不具备直接深度读取的情况下获得更高精度和更加稳定的分类水平物体姿态估计。此外，我们还提出了一种新的NOCS预测网络，该网络有效提高了性能。我们的实验结果表明，我们提出的方法在公共数据集序列上具有强大的表现，甚至与RGB-D方法相当。此外，我们还证明了我们的方法在自主收集的数据集上进行了普适化。

Realistic Full-Body Tracking from Sparse Observations via Joint-Level Modeling

paper_url: http://arxiv.org/abs/2308.08855
repo_url: https://github.com/zxz267/AvatarJLM
paper_authors: Xiaozheng Zheng, Zhuo Su, Chao Wen, Zhou Xue, Xiaojie Jin
for: 快速发展的VR/AR应用程序中实现真实的全身动作控制。
methods: 提出了一种两stage框架，通过使用头盔显示器和手控制器的三种跟踪信号来获取高精度和平滑的全身动作。该框架在第一个阶段显式地模型了关节级特征，并在第二个阶段通过交替的空间和时间变换块来捕捉关节级相关性。
results: 通过对AMASS运动数据集和真实捕捉数据进行广泛的实验，证明了我们的设计的效果，并显示了我们的提议方法可以实现比现有方法更高精度和平滑的动作。

Abstract
To bridge the physical and virtual worlds for rapidly developed VR/AR applications, the ability to realistically drive 3D full-body avatars is of great significance. Although real-time body tracking with only the head-mounted displays (HMDs) and hand controllers is heavily under-constrained, a carefully designed end-to-end neural network is of great potential to solve the problem by learning from large-scale motion data. To this end, we propose a two-stage framework that can obtain accurate and smooth full-body motions with the three tracking signals of head and hands only. Our framework explicitly models the joint-level features in the first stage and utilizes them as spatiotemporal tokens for alternating spatial and temporal transformer blocks to capture joint-level correlations in the second stage. Furthermore, we design a set of loss terms to constrain the task of a high degree of freedom, such that we can exploit the potential of our joint-level modeling. With extensive experiments on the AMASS motion dataset and real-captured data, we validate the effectiveness of our designs and show our proposed method can achieve more accurate and smooth motion compared to existing approaches.

摘要
Here's the translation in Simplified Chinese:要连接物理世界和虚拟世界，快速开发VR/AR应用中能够真实驱动3D全身人物的能力是非常重要。尽管只使用头盔显示器（HMD）和手控制器来实时跟踪身体活动是非常受限的，但是一个特殊设计的端到端神经网络可以通过学习大规模运动数据来解决这个问题。为此，我们提出了一个两stage框架，可以通过头和手三个跟踪信号来获取准确和平滑的全身运动。我们的框架明确表示 JOINT 级特征在第一个阶段，然后利用它们作为空间和时间变换块来捕捉 JOINT 级相关性。此外，我们设计了一些损失项来限制任务的自由度，以便我们可以利用我们的 JOINT 级模型的潜力。通过对 AMASS 运动数据集和实际捕捉数据进行广泛的实验，我们验证了我们的设计的有效性，并证明了我们的提议方法可以在现有方法中获得更高精度和平滑的运动。

Language-enhanced RNR-Map: Querying Renderable Neural Radiance Field maps with natural language

paper_url: http://arxiv.org/abs/2308.08854
repo_url: https://github.com/intelligolabs/Le-RNR-Map
paper_authors: Francesco Taioli, Federico Cunico, Federico Girella, Riccardo Bologna, Alessandro Farinelli, Marco Cristani
for: 这个论文是为了提供一种语言增强的可 Renderable Neural Radiance Map（Le-RNR-Map），用于视觉导航，并且可以通过自然语言查询提示来搜索。
methods: 该论文使用了RNR-Map，一种基于格子结构的秘密码，每个像素都有一个来自图像观察的 latent codes，可以将图像渲染到相应的Camera pose。此外，该论文还使用了CLIP-based embedding latent codes，允许通过自然语言查询来搜索。
results: 该论文通过单个和多个对象搜索的实验，证明了Le-RNR-Map的效果。此外，该论文还 investigate了这个地图与大型自然语言模型的相容性，并且发现了这个地图可以帮助解决”可用性查询”问题。

Abstract
We present Le-RNR-Map, a Language-enhanced Renderable Neural Radiance map for Visual Navigation with natural language query prompts. The recently proposed RNR-Map employs a grid structure comprising latent codes positioned at each pixel. These latent codes, which are derived from image observation, enable: i) image rendering given a camera pose, since they are converted to Neural Radiance Field; ii) image navigation and localization with astonishing accuracy. On top of this, we enhance RNR-Map with CLIP-based embedding latent codes, allowing natural language search without additional label data. We evaluate the effectiveness of this map in single and multi-object searches. We also investigate its compatibility with a Large Language Model as an "affordance query resolver". Code and videos are available at https://intelligolabs.github.io/Le-RNR-Map/

摘要
我们介绍Le-RNR-Map，一种语言增强的可 Renderable Neural Radiance Map 用于视觉导航，通过自然语言查询提示。最近提出的RNR-Map使用网格结构，每个像素位置具有秘密码。这些秘密码，从图像观察中 derivation，允许：i) 根据摄像头pose进行图像渲染; ii) 图像导航和定位准确。此外，我们在RNR-Map中添加了CLIP基于的嵌入代码，使得不需要额外标注数据进行自然语言搜索。我们评估了这个地图在单个和多个对象搜索中的效果，以及它与大型语言模型作为"能力查询解决方案"的 compatibilty。代码和视频可以在https://intelligolabs.github.io/Le-RNR-Map/ 中下载。

Bag of Tricks for Long-Tailed Multi-Label Classification on Chest X-Rays

paper_url: http://arxiv.org/abs/2308.08853
repo_url: None
paper_authors: Feng Hong, Tianjie Dai, Jiangchao Yao, Ya Zhang, Yanfeng Wang
for: 本文主要针对的是用机器学习算法进行胸部X射线图像的临床分类，特别是面临长尾和多标签等挑战。
methods: 本文提出了一些新的设计方案，包括数据扩充、特征提取器、分类器设计、损失函数重新权衡、外部数据补充等，以提高CXR诊断的性能。
results: 本文通过对多种设计方案的实践和简单的测试数据扩充以及 ensemble 技术，最终达到了ICCV CVAMD 2023 CXR-LT Competition 的测试集上的0.349 mAP值，排名前五。

Abstract
Clinical classification of chest radiography is particularly challenging for standard machine learning algorithms due to its inherent long-tailed and multi-label nature. However, few attempts take into account the coupled challenges posed by both the class imbalance and label co-occurrence, which hinders their value to boost the diagnosis on chest X-rays (CXRs) in the real-world scenarios. Besides, with the prevalence of pretraining techniques, how to incorporate these new paradigms into the current framework lacks of the systematical study. This technical report presents a brief description of our solution in the ICCV CVAMD 2023 CXR-LT Competition. We empirically explored the effectiveness for CXR diagnosis with the integration of several advanced designs about data augmentation, feature extractor, classifier design, loss function reweighting, exogenous data replenishment, etc. In addition, we improve the performance through simple test-time data augmentation and ensemble. Our framework finally achieves 0.349 mAP on the competition test set, ranking in the top five.

摘要
严重疾病分类从骨盆 X 光图像中 particullay challenging，特别是由于其内在的长尾和多标签性。然而，有很少的尝试把承载这两个挑战：分类倾度不均和标签共occurrence。这些挑战限制了机器学习算法在实际应用中的价值。此外，随着预训练技术的普及，如何将这些新的思维方式集成到当前框架中，尚未得到系统性的研究。本技报报告 briefly describes our solution in the ICCV CVAMD 2023 CXR-LT Competition. we empirically explored the effectiveness of CXR diagnosis with the integration of several advanced designs, including data augmentation, feature extractor, classifier design, loss function reweighting, exogenous data replenishment, etc. In addition, we improve the performance through simple test-time data augmentation and ensemble. our framework finally achieves 0.349 mAP on the competition test set, ranking in the top five.

paper_url: http://arxiv.org/abs/2308.08849
repo_url: https://github.com/wentaol86/awesome-body-language
paper_authors: Li Liu, Lufei Gao, Wentao Lei, Fengji Ma, Xiaotian Lin, Jinting Wang
for: 这 paper 主要是为了探讨深度多Modal学习在不同的身体语言（BL）生成和识别方面的应用。
methods: 这 paper 使用了深度多Modal学习技术来分析和理解不同的BL，包括手语（SL）、句子语音（CS）、同时说话（CoS）和头部语音（TH）等。
results: 这 paper 对这些多Modal approaches的评估和比较，并提出了未来研究的方向，如自然语言处理、多Modal学习和大规模预训练模型的应用。

Abstract
Body language (BL) refers to the non-verbal communication expressed through physical movements, gestures, facial expressions, and postures. It is a form of communication that conveys information, emotions, attitudes, and intentions without the use of spoken or written words. It plays a crucial role in interpersonal interactions and can complement or even override verbal communication. Deep multi-modal learning techniques have shown promise in understanding and analyzing these diverse aspects of BL. The survey emphasizes their applications to BL generation and recognition. Several common BLs are considered i.e., Sign Language (SL), Cued Speech (CS), Co-speech (CoS), and Talking Head (TH), and we have conducted an analysis and established the connections among these four BL for the first time. Their generation and recognition often involve multi-modal approaches. Benchmark datasets for BL research are well collected and organized, along with the evaluation of SOTA methods on these datasets. The survey highlights challenges such as limited labeled data, multi-modal learning, and the need for domain adaptation to generalize models to unseen speakers or languages. Future research directions are presented, including exploring self-supervised learning techniques, integrating contextual information from other modalities, and exploiting large-scale pre-trained multi-modal models. In summary, this survey paper provides a comprehensive understanding of deep multi-modal learning for various BL generations and recognitions for the first time. By analyzing advancements, challenges, and future directions, it serves as a valuable resource for researchers and practitioners in advancing this field. n addition, we maintain a continuously updated paper list for deep multi-modal learning for BL recognition and generation: https://github.com/wentaoL86/awesome-body-language.

摘要
Body language (BL) 指的是通过物理运动、姿势、表情和姿态来表达的非语言通信。它是人际交流中的一种重要的沟通方式，可以补充或甚至覆盖语言交流。深入的多Modal学习技术已经在理解和分析这些多种非语言通信方面表现出了承诺。本文件尽可能地概括了这些多Modal学习技术的应用和挑战，并提出了未来研究的方向。在本文中，我们分析了四种常见的BL：手语（SL）、笔记法（CS）、协调说话（CoS）和对话头（TH），并对这些四种BL之间的连接进行了分析。其生成和识别通常需要多Modal的方法。我们也收集了一些标准的BL数据集，并对这些数据集进行了评估。Despite the progress made, there are still several challenges that need to be addressed, such as limited labeled data, multi-modal learning, and the need for domain adaptation to generalize models to unseen speakers or languages. In the future, we can explore self-supervised learning techniques, integrate contextual information from other modalities, and exploit large-scale pre-trained multi-modal models.总之，本文提供了深入的多Modal学习技术在不同的BL生成和识别方面的首次概括。通过分析进步、挑战和未来方向，它将成为研究和实践者在这个领域的有价值资源。此外，我们还维护一份continuously更新的BL生成和识别相关文献列表，可以在 GitHub上找到：https://github.com/wentaoL86/awesome-body-language。

ICoNIK: Generating Respiratory-Resolved Abdominal MR Reconstructions Using Neural Implicit Representations in k-Space

paper_url: http://arxiv.org/abs/2308.08830
repo_url: None
paper_authors: Veronika Spieker, Wenqi Huang, Hannah Eichhorn, Jonathan Stelter, Kilian Weiss, Veronika A. Zimmer, Rickmer F. Braren, Dimitrios C. Karampinos, Kerstin Hammernik, Julia A. Schnabel
for: 实现静止照片与运动照片的混合，解决运动引起的影像污染问题。
methods: 使用神经网络学习几何空间中的几何函数，并将这个函数与测量点和呼吸访问信号组合，实现无推印影像重建。
results: 比标准运动解析技术高效，并提供了一个可能性解决运动引起的影像污染问题的解析方法。

Abstract
Motion-resolved reconstruction for abdominal magnetic resonance imaging (MRI) remains a challenge due to the trade-off between residual motion blurring caused by discretized motion states and undersampling artefacts. In this work, we propose to generate blurring-free motion-resolved abdominal reconstructions by learning a neural implicit representation directly in k-space (NIK). Using measured sampling points and a data-derived respiratory navigator signal, we train a network to generate continuous signal values. To aid the regularization of sparsely sampled regions, we introduce an additional informed correction layer (ICo), which leverages information from neighboring regions to correct NIK's prediction. Our proposed generative reconstruction methods, NIK and ICoNIK, outperform standard motion-resolved reconstruction techniques and provide a promising solution to address motion artefacts in abdominal MRI.

摘要
对于腹部磁共振成像（MRI）中的运动解像仍然是一个挑战，因为存在归一化运动态论和抽样缺陷之间的质量冲突。在这种工作中，我们提议通过直接在k空间学习神经网络（NIK）来生成无抖的运动解像腹部重建。使用测量的抽样点和数据驱动的呼吸导航信号，我们训练了一个网络来生成连续的信号值。为了帮助稀疏抽样区域的正则化，我们引入了一个加 informations层（ICo），该层利用邻近区域的信息来修正 NIK 的预测。我们的提出的生成重建方法，NIK 和 ICoNIK，超过标准的运动解像重建技术，并提供了解决腹部 MRI 中运动artefacts的可能性。

Fast Inference and Update of Probabilistic Density Estimation on Trajectory Prediction

paper_url: http://arxiv.org/abs/2308.08824
repo_url: https://github.com/meaten/flowchain-iccv2023
paper_authors: Takahiro Maeda, Norimichi Ukita
For: 本文提出了一种新的正规流基本拟合方法（FlowChain），用于预测物体的运动轨迹。这种方法能够快速计算并准确地估计概率密度，这是安全关键应用如自动驾驶和社交机器人所需的。* Methods: FlowChain 是一个栈式的条件连续分布（CIF），可以表示概率密度的表达。这种表达可以进行分析计算，比较快速，而且更准确于 Gaussian 混合模型。此外，FlowChain 还允许快速更新估计概率密度，只需要在新的观测位置基础上， reuse 流变换和其对数Jacobian，可以在一毫秒内完成。* Results: 实验结果显示，我们的 FlowChain 在过去方法中实现了最佳的轨迹预测精度。此外，我们的 FlowChain 还在概率密度估计方面表现出了优势，具有更高的准确性和更快的计算速度。我们的代码可以在 https://github.com/meaten/FlowChain-ICCV2023 上下载。

Abstract
Safety-critical applications such as autonomous vehicles and social robots require fast computation and accurate probability density estimation on trajectory prediction. To address both requirements, this paper presents a new normalizing flow-based trajectory prediction model named FlowChain. FlowChain is a stack of conditional continuously-indexed flows (CIFs) that are expressive and allow analytical probability density computation. This analytical computation is faster than the generative models that need additional approximations such as kernel density estimation. Moreover, FlowChain is more accurate than the Gaussian mixture-based models due to fewer assumptions on the estimated density. FlowChain also allows a rapid update of estimated probability densities. This update is achieved by adopting the \textit{newest observed position} and reusing the flow transformations and its log-det-jacobians that represent the \textit{motion trend}. This update is completed in less than one millisecond because this reuse greatly omits the computational cost. Experimental results showed our FlowChain achieved state-of-the-art trajectory prediction accuracy compared to previous methods. Furthermore, our FlowChain demonstrated superiority in the accuracy and speed of density estimation. Our code is available at \url{https://github.com/meaten/FlowChain-ICCV2023}

摘要
安全关键应用，如自动驾驶车和社交机器人，需要快速计算和准确的概率密度估计。为解决这两个需求，这篇论文提出了一种新的正规流基本型 trajectory prediction 模型，名为 FlowChain。FlowChain 是一个堆叠的 conditional continuously-indexed flows (CIFs)，它们是表达力强的，并允许analytical probability density computation。这种analytical computation比 generative models 需要更多的 Approximation，如 kernel density estimation 更快。此外，FlowChain 比 Gaussian mixture-based models 更准确，因为它们对 estimated density 假设的更少。FlowChain 还允许快速更新 estimated probability densities。这个更新通过 adopting the 最新观测位置和 reuse flow transformations 和其 log-det-jacobians 来完成，这个计算成本很低。实验结果表明我们的 FlowChain 在前一代方法的基础上实现了状态机器人 trajectory prediction 的最佳准确性。此外，我们的 FlowChain 还在概率密度估计的准确性和速度方面表现出了优势。我们的代码可以在上找到。

MixBag: Bag-Level Data Augmentation for Learning from Label Proportions

paper_url: http://arxiv.org/abs/2308.08822
repo_url: None
paper_authors: Takanori Asanomi, Shinnosuke Matsuo, Daiki Suehiro, Ryoma Bise
for: 本研究旨在提出一种基于批处理的数据增强方法，以提高无监督学习中的实例级别分类器。
methods: 我们提出了一种基于实验观察的关键观察，即在固定总数据量下，增加标注批处理可以提高实例级别分类精度。此外，我们还提出了基于统计理论的信息量损失函数，以便有效地利用扩充后的批处理。
results: 实验结果表明，我们的方法可以与现有的实例级别数据增强方法相比，在减小损失函数下达到更高的精度。此外，我们的方法还可以与其他无监督学习方法结合使用，以提高分类器的泛化能力。

Abstract
Learning from label proportions (LLP) is a promising weakly supervised learning problem. In LLP, a set of instances (bag) has label proportions, but no instance-level labels are given. LLP aims to train an instance-level classifier by using the label proportions of the bag. In this paper, we propose a bag-level data augmentation method for LLP called MixBag, based on the key observation from our preliminary experiments; that the instance-level classification accuracy improves as the number of labeled bags increases even though the total number of instances is fixed. We also propose a confidence interval loss designed based on statistical theory to use the augmented bags effectively. To the best of our knowledge, this is the first attempt to propose bag-level data augmentation for LLP. The advantage of MixBag is that it can be applied to instance-level data augmentation techniques and any LLP method that uses the proportion loss. Experimental results demonstrate this advantage and the effectiveness of our method.

摘要
学习标签比例（LLP）是一个有前途的弱监督学习问题。在LLP中，一个集合（袋）有标签比例，但没有每个实例的标签。LLP的目标是使用袋的标签比例来训练每个实例的分类器。在这篇论文中，我们提出了一种基于先前实验的观察的袋级数据增强方法called MixBag，以及基于统计理论的自信度范围损失。这是我们知道的第一个提出袋级数据增强的尝试。MixBag的优点是可以与实例级数据增强技术结合使用，并且可以与任何使用比例损失的LLP方法结合使用。实验结果表明了MixBag的优点和效果。

paper_url: http://arxiv.org/abs/2308.08816
repo_url: https://github.com/greatlog/realdan
paper_authors: Zhengxiong Luo, Yan Huang, Shang Li, Liang Wang, Tieniu Tan
for: 这篇论文主要针对的是做出高清度图像的简化超解像（SR），即从低解度图像（LR）中恢复高解度图像（HR）。
methods: 这篇论文提出了一种新的SR方法，即通过alternating optimization算法，将LR图像的简化和SR问题协同解决。具体来说，这种方法包括两个卷积神经网络：一个用于restore SR图像（Restorer），另一个用于估计LR图像的简化（Estimator）。这两个模块之间进行了循环的优化，以实现一个端到端可训练的网络。
results: 根据实验结果，这种方法可以与当前最佳方法相比，在SR问题上具有更高的精度和更好的视觉效果。

Abstract
Blind Super-Resolution (SR) usually involves two sub-problems: 1) estimating the degradation of the given low-resolution (LR) image; 2) super-resolving the LR image to its high-resolution (HR) counterpart. Both problems are ill-posed due to the information loss in the degrading process. Most previous methods try to solve the two problems independently, but often fall into a dilemma: a good super-resolved HR result requires an accurate degradation estimation, which however, is difficult to be obtained without the help of original HR information. To address this issue, instead of considering these two problems independently, we adopt an alternating optimization algorithm, which can estimate the degradation and restore the SR image in a single model. Specifically, we design two convolutional neural modules, namely \textit{Restorer} and \textit{Estimator}. \textit{Restorer} restores the SR image based on the estimated degradation, and \textit{Estimator} estimates the degradation with the help of the restored SR image. We alternate these two modules repeatedly and unfold this process to form an end-to-end trainable network. In this way, both \textit{Restorer} and \textit{Estimator} could get benefited from the intermediate results of each other, and make each sub-problem easier. Moreover, \textit{Restorer} and \textit{Estimator} are optimized in an end-to-end manner, thus they could get more tolerant of the estimation deviations of each other and cooperate better to achieve more robust and accurate final results. Extensive experiments on both synthetic datasets and real-world images show that the proposed method can largely outperform state-of-the-art methods and produce more visually favorable results. The codes are rleased at \url{https://github.com/greatlog/RealDAN.git}.

摘要
通常，盲目超解像（SR）问题包括两个互相关联的优化问题：1）估计给出的低分辨率（LR）图像的劣化程度; 2）将LR图像提升到其高分辨率（HR）对应的图像。两个问题都是不定的，因为升级过程中的信息损失。大多数前一代方法通常会解决这两个问题独立，但经常陷入一个困境：一个好的HR图像需要一个准确的劣化估计，但是不可以不带原始HR信息来获得这个估计。为解决这个问题，我们采用了一种alternating optimization算法，可以同时估计劣化和 restaure SR图像。我们设计了两个卷积神经网络模块：namely \textit{Restorer}和\textit{Estimator}。\textit{Restorer}使用估计的劣化来还原SR图像，而\textit{Estimator}使用还原后的SR图像来估计劣化。我们在这两个模块之间进行了循环的交互，并将这个过程拓展成一个可训练的结束到终结点的网络。这样，\textit{Restorer}和\textit{Estimator}可以互相帮助，使得每个优化问题变得更加容易。此外，\textit{Restorer}和\textit{Estimator}在结束到终结点的training中被优化，因此它们可以更快地适应彼此的估计偏差，并更好地合作以实现更加稳定和准确的最终结果。广泛的实验表明，我们的方法可以大幅超越当前的状态控制方法，并生成更加视觉愉悦的结果。代码可以在\url{https://github.com/greatlog/RealDAN.git}中找到。

A Fusion of Variational Distribution Priors and Saliency Map Replay for Continual 3D Reconstruction

paper_url: http://arxiv.org/abs/2308.08812
repo_url: None
paper_authors: Sanchar Palit, Sandika Biswas
for: 单张图像三维重建任务是一项研究挑战，旨在从单视图图像中预测物体的三维形状。这项任务需要大量数据采集，以预测可见和遮盖的部分。
methods: 我们提议使用 continual learning 的方法，并使用 Variational Priors 来设计模型，以便在新类之后仍然可以reasonably重建先前所见的类。Variational Priors 表示抽象形状，并避免忘记，而 saliency maps 保留物体特征，占用较少的内存。
results: 经过仔细的实验表明，我们的方法可以与已知方法相比， both quantitatively and qualitatively 显示出竞争力。

Abstract
Single-image 3D reconstruction is a research challenge focused on predicting 3D object shapes from single-view images. This task requires significant data acquisition to predict both visible and occluded portions of the shape. Furthermore, learning-based methods face the difficulty of creating a comprehensive training dataset for all possible classes. To this end, we propose a continual learning-based 3D reconstruction method where our goal is to design a model using Variational Priors that can still reconstruct the previously seen classes reasonably even after training on new classes. Variational Priors represent abstract shapes and combat forgetting, whereas saliency maps preserve object attributes with less memory usage. This is vital due to resource constraints in storing extensive training data. Additionally, we introduce saliency map-based experience replay to capture global and distinct object features. Thorough experiments show competitive results compared to established methods, both quantitatively and qualitatively.

摘要
单图三维重建是一项研究挑战，旨在根据单个图像预测三维物体形状。这项任务需要大量数据收集，以预测可见和遮挡部分的形状。学习基于方法则面临创建全面训练数据集的挑战，以涵盖所有可能的类型。为此，我们提议一种逐步学习基于Variational Priors的三维重建方法，其目标是在训练新类后，仍能reasonably重建先前所见的类。Variational Priors表示抽象形态，防止忘记，而saliency maps保留物体特征，占用内存更少。这对于资源受限的存储大量训练数据非常重要。此外，我们引入了saliency map基于经验回放，以捕捉全球和特定物体特征。经过广泛的实验，我们的方法与已知方法相比，具有竞争性的Result。

Label Shift Adapter for Test-Time Adaptation under Covariate and Label Shifts

paper_url: http://arxiv.org/abs/2308.08810
repo_url: None
paper_authors: Sunghyun Park, Seunghan Yang, Jaegul Choo, Sungrack Yun
for: 这个研究旨在实现批量执行时间适应（Test-time adaptation），将预训模型适应目标领域中的数据分布。
methods: 我们提出了一个新的标签迁移适应器，可以与现有的TTA方法结合使用，实现标签迁移的处理。我们估计目标领域的标签分布，然后将其 feed 到标签迁移适应器中，以生成适合目标领域的标签参数。
results: 我们通过广泛的实验表明，将我们的策略与TTA方法结合，可以在标签和 covariate 迁移时获得显著的性能提升。

Abstract
Test-time adaptation (TTA) aims to adapt a pre-trained model to the target domain in a batch-by-batch manner during inference. While label distributions often exhibit imbalances in real-world scenarios, most previous TTA approaches typically assume that both source and target domain datasets have balanced label distribution. Due to the fact that certain classes appear more frequently in certain domains (e.g., buildings in cities, trees in forests), it is natural that the label distribution shifts as the domain changes. However, we discover that the majority of existing TTA methods fail to address the coexistence of covariate and label shifts. To tackle this challenge, we propose a novel label shift adapter that can be incorporated into existing TTA approaches to deal with label shifts during the TTA process effectively. Specifically, we estimate the label distribution of the target domain to feed it into the label shift adapter. Subsequently, the label shift adapter produces optimal parameters for the target label distribution. By predicting only the parameters for a part of the pre-trained source model, our approach is computationally efficient and can be easily applied, regardless of the model architectures. Through extensive experiments, we demonstrate that integrating our strategy with TTA approaches leads to substantial performance improvements under the joint presence of label and covariate shifts.

摘要

Self-distillation Regularized Connectionist Temporal Classification Loss for Text Recognition: A Simple Yet Effective Approach

paper_url: http://arxiv.org/abs/2308.08806
repo_url: None
paper_authors: Ziyin Zhang, Ning Lu, Minghui Liao, Yongshuai Huang, Cheng Li, Min Wang, Wei Peng
for: 提高文本识别模型的准确率，不增加Extra参数或训练阶段。
methods: 提议使用自适应定制的CTC损失函数（DCTC损失），通过带有帧级别的正则化项来强调个体监督，并通过最大化 posteriori 的潜在对齐问题来解决在涨化中的矛盾问题。
results: 对于公共 benchmark 进行了广泛的实验，结果显示，使用 DCTC 损失可以提高文本识别模型的准确率，最高提升达 2.6%，而不增加任何不良影响。

Abstract
Text recognition methods are gaining rapid development. Some advanced techniques, e.g., powerful modules, language models, and un- and semi-supervised learning schemes, consecutively push the performance on public benchmarks forward. However, the problem of how to better optimize a text recognition model from the perspective of loss functions is largely overlooked. CTC-based methods, widely used in practice due to their good balance between performance and inference speed, still grapple with accuracy degradation. This is because CTC loss emphasizes the optimization of the entire sequence target while neglecting to learn individual characters. We propose a self-distillation scheme for CTC-based model to address this issue. It incorporates a framewise regularization term in CTC loss to emphasize individual supervision, and leverages the maximizing-a-posteriori of latent alignment to solve the inconsistency problem that arises in distillation between CTC-based models. We refer to the regularized CTC loss as Distillation Connectionist Temporal Classification (DCTC) loss. DCTC loss is module-free, requiring no extra parameters, longer inference lag, or additional training data or phases. Extensive experiments on public benchmarks demonstrate that DCTC can boost text recognition model accuracy by up to 2.6%, without any of these drawbacks.

摘要
文本识别方法在快速发展中。一些先进技术，如强大的模块、语言模型和不监督学习方法， consecutively 提高了公共测试底板上的性能。然而，如何更好地优化一个文本识别模型，从损失函数的角度来看，几乎被忽略。基于CTC的方法，由于其在实践中的良好平衡性和推理速度，仍然受到精度下降的困扰。这是因为CTC损失函数强调整合整个序列目标，而忽略学习个体字符。我们提出了一种自适应方案，即Distillation Connectionist Temporal Classification（DCTC）损失函数。DCTC损失函数包含了帧级别正则化项，以强调个体监督，并利用最大 posteriori的潜在对齐问题来解决在液化中的不一致问题。DCTC损失函数是模块化的，无需额外参数、更长的推理时间或额外的训练数据或阶段。广泛的实验表明，DCTC可以提高文本识别模型的精度，最高提高2.6%，而无需这些缺点。

Deep Ear Biometrics for Gender Classification

paper_url: http://arxiv.org/abs/2308.08797
repo_url: None
paper_authors: Ritwiz Singh, Keshav Kashyap, Rajesh Mukherjee, Asish Bera, Mamata Dalui Chakraborty
for: 人类性别分类 based on 生物特征, 特别是 Computer Vision 领域的一个重要问题, 因为它有很多应用场景。
methods: 我们使用了深度卷积神经网络 (CNN) 模型来自动地分类人类性别, 使用了 EarVN1.0 耳朵数据集进行评估。
results: 我们的模型达到了 93% 的准确率。I hope this helps! Let me know if you have any other questions.

Abstract
Human gender classification based on biometric features is a major concern for computer vision due to its vast variety of applications. The human ear is popular among researchers as a soft biometric trait, because it is less affected by age or changing circumstances, and is non-intrusive. In this study, we have developed a deep convolutional neural network (CNN) model for automatic gender classification using the samples of ear images. The performance is evaluated using four cutting-edge pre-trained CNN models. In terms of trainable parameters, the proposed technique requires significantly less computational complexity. The proposed model has achieved 93% accuracy on the EarVN1.0 ear dataset.

摘要
人类性别分类基于生物特征是计算机视觉领域的主要问题，由于它的广泛应用领域。人耳是研究人员的首选软生物特征之一，因为它对年龄或变化情况的影响相对较少，非侵入式。在本研究中，我们开发了一种深度卷积神经网络（CNN）模型，用于自动性别分类，使用耳架图像样本。我们使用四种最新的预训练CNN模型进行评估性能。与传统模型相比，我们的方法具有更少的计算复杂性。在 EarVN1.0 耳架数据集上，我们的模型达到了 93% 的准确率。

Environment Diversification with Multi-head Neural Network for Invariant Learning

paper_url: http://arxiv.org/abs/2308.08778
repo_url: https://github.com/joe0123/EDNIL
paper_authors: Bo-Wei Huang, Keng-Te Liao, Chang-Sheng Kao, Shou-De Lin
for: 这篇论文旨在提出一个不需要先知道环境或强制假设的普遍学习框架，以提高神经网络模型对于分布类型的适应能力。
methods: 这个框架包含了一个多头神经网络，用于吸收数据偏见。
results: 该框架不需要先知道环境或强制假设，并且可以实现模型对于分布类型的适应。

Abstract
Neural networks are often trained with empirical risk minimization; however, it has been shown that a shift between training and testing distributions can cause unpredictable performance degradation. On this issue, a research direction, invariant learning, has been proposed to extract invariant features insensitive to the distributional changes. This work proposes EDNIL, an invariant learning framework containing a multi-head neural network to absorb data biases. We show that this framework does not require prior knowledge about environments or strong assumptions about the pre-trained model. We also reveal that the proposed algorithm has theoretical connections to recent studies discussing properties of variant and invariant features. Finally, we demonstrate that models trained with EDNIL are empirically more robust against distributional shifts.

摘要
神经网络经常通过Empirical Risk Minimization（ERM）进行训练；然而，存在训练和测试分布之间的偏移会导致性能下降。为解决这个问题，一种研究方向——不变学习（Invariant Learning）——已经被提出，以抽取不受分布变化影响的特征。本研究提出了EDNIL框架，包括多头神经网络来吸收数据偏好。我们证明了这种框架不需要先知环境或强ASSUME预训练模型。此外，我们还发现了该算法与最近的变异和不变特征研究有理论上的连接。最后，我们通过实验表明，使用EDNIL进行训练的模型在分布变化时的性能更加稳定。

Learning to In-paint: Domain Adaptive Shape Completion for 3D Organ Segmentation

paper_url: http://arxiv.org/abs/2308.08775
repo_url: None
paper_authors: Mingjin Chen, Yongkang He, Yongyi Lu, Zhijing Yang
for: 本研究旨在把Shape信息Explicitly incorporated into current 3D organ segmentation models.
methods: 我们采用Masked Label Mask Modeling (MLM)方法，通过学习mask token来完成Label mask的组织器。此外，我们还提出了一种新的Shape-aware self-distillation方法，用于在Target上传递MLM shape知识。
results: 我们在五个公共organ segmentation dataset上进行了广泛的实验，并得到了至少1.2点的Dice分数提升，证明了我们的方法在难以控制的预测领域中的效果。

Abstract
We aim at incorporating explicit shape information into current 3D organ segmentation models. Different from previous works, we formulate shape learning as an in-painting task, which is named Masked Label Mask Modeling (MLM). Through MLM, learnable mask tokens are fed into transformer blocks to complete the label mask of organ. To transfer MLM shape knowledge to target, we further propose a novel shape-aware self-distillation with both in-painting reconstruction loss and pseudo loss. Extensive experiments on five public organ segmentation datasets show consistent improvements over prior arts with at least 1.2 points gain in the Dice score, demonstrating the effectiveness of our method in challenging unsupervised domain adaptation scenarios including: (1) In-domain organ segmentation; (2) Unseen domain segmentation and (3) Unseen organ segmentation. We hope this work will advance shape analysis and geometric learning in medical imaging.

摘要
我们目标是将显式形态信息integrated到当前3D器官分割模型中。与之前的工作不同，我们将形态学习转换为一个填充任务，称为掩码标签掩码模型（MLM）。通过MLM，学习的掩码标签被传递到转换块中，以完成器官的标签掩码。为将MLM形态知识传递到目标上，我们进一步提议一种新的形态自适应自我热化，包括填充重建损失和假损失。我们在五个公共器官分割数据集上进行了广泛的实验，并示出了与之前的艺术品相比至少1.2点的Dice分数提升，这说明了我们的方法在不可预测的领域适应场景中的效果，包括：（1）本地器官分割；（2）未看到的频谱分割和（3）未看到的器官分割。我们希望这项工作能够推动医学影像中的形态分析和几何学学习。

URL: Combating Label Noise for Lung Nodule Malignancy Grading

paper_url: http://arxiv.org/abs/2308.08772
repo_url: https://github.com/axz520/URL
paper_authors: Xianze Ai, Zehui Liao, Yong Xia
for:This paper focuses on the problem of label noise in lung nodule malignancy grading datasets and proposes a new framework called URL to tackle this issue.methods:The proposed URL framework consists of two stages: SCL and MU. SCL uses supervised contrastive learning to learn better representations, while MU generates pseudo-labels and uses temporal ensembling to obtain memory pseudo-labels that supervise the model training.results:Experiments on the LIDC-IDRI dataset show that the proposed URL framework outperforms other competing methods, demonstrating its effectiveness in handling label noise and modeling the ordinal relation among classes.

Abstract
Due to the complexity of annotation and inter-annotator variability, most lung nodule malignancy grading datasets contain label noise, which inevitably degrades the performance and generalizability of models. Although researchers adopt the label-noise-robust methods to handle label noise for lung nodule malignancy grading, they do not consider the inherent ordinal relation among classes of this task. To model the ordinal relation among classes to facilitate tackling label noise in this task, we propose a Unimodal-Regularized Label-noise-tolerant (URL) framework. Our URL contains two stages, the Supervised Contrastive Learning (SCL) stage and the Memory pseudo-labels generation and Unimodal regularization (MU) stage. In the SCL stage, we select reliable samples and adopt supervised contrastive learning to learn better representations. In the MU stage, we split samples with multiple annotations into multiple samples with a single annotation and shuffle them into different batches. To handle label noise, pseudo-labels are generated using the similarity between each sample and the central feature of each class, and temporal ensembling is used to obtain memory pseudo-labels that supervise the model training. To model the ordinal relation, we introduce unimodal regularization to keep the ordinal relation among classes in the predictions. Moreover, each lung nodule is characterized by three orthographic views. Experiments conducted on the LIDC-IDRI dataset indicate the superiority of our URL over other competing methods. Code is available at https://github.com/axz520/UR.

摘要
Due to the complexity of annotation and inter-annotator variability, most lung nodule malignancy grading datasets contain label noise, which inevitably degrades the performance and generalizability of models. Although researchers adopt label-noise-robust methods to handle label noise for lung nodule malignancy grading, they do not consider the inherent ordinal relation among classes of this task. To model the ordinal relation among classes to facilitate tackling label noise in this task, we propose a Unimodal-Regularized Label-noise-tolerant (URL) framework. Our URL contains two stages, the Supervised Contrastive Learning (SCL) stage and the Memory pseudo-labels generation and Unimodal regularization (MU) stage. In the SCL stage, we select reliable samples and adopt supervised contrastive learning to learn better representations. In the MU stage, we split samples with multiple annotations into multiple samples with a single annotation and shuffle them into different batches. To handle label noise, pseudo-labels are generated using the similarity between each sample and the central feature of each class, and temporal ensembling is used to obtain memory pseudo-labels that supervise the model training. To model the ordinal relation, we introduce unimodal regularization to keep the ordinal relation among classes in the predictions. Moreover, each lung nodule is characterized by three orthographic views. Experiments conducted on the LIDC-IDRI dataset indicate the superiority of our URL over other competing methods. Code is available at https://github.com/axz520/UR.Note: The translation is in Simplified Chinese, which is one of the two standard Chinese languages used in mainland China.

Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes

paper_url: http://arxiv.org/abs/2308.08769
repo_url: https://github.com/Chat-3D/Chat-3D
paper_authors: Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, Zhou Zhao
for: 提高3D场景理解的实用性，建立可对多种下游任务进行对话的全局对话系统。
methods: 利用预训练的3D表示和高级LLM的推理和对话能力，将3D表示映射到LLM的特征空间中，使LLM能够理解3D世界。
results: 实验显示，Chat-3D可以快速理解多种3D场景 instrucions，进行复杂的空间推理，并将外部知识 integrate into its responses。与GPT-4相比，Chat-3D在构建的指令集合上得分75.6%。

Abstract
3D scene understanding has gained significant attention due to its wide range of applications. However, existing methods for 3D scene understanding are limited to specific downstream tasks, which hinders their practicality in real-world applications. This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs to achieve the first universal dialogue systems for 3D scenes. Specifically, we align 3D representations into the feature space of LLMs, thus enabling LLMs to perceive the 3D world. Given the scarcity of 3D scene-text data, we propose a three-stage training strategy to efficiently utilize the available data for better alignment. To enhance the reasoning ability and develop a user-friendly interaction scheme, we further construct a high-quality object-centric 3D instruction dataset and design an associated object-centric prompt. Our experiments show that Chat-3D achieves an impressive ability to comprehend diverse instructions for 3D scenes, engage in intricate spatial reasoning, and incorporate external knowledge into its responses. Chat-3D achieves a 75.6% relative score compared with GPT-4 on the constructed instruction dataset.

摘要
三维场景理解已经吸引了广泛的关注，因为它们在各种应用领域中具有广泛的应用前景。然而，现有的三维场景理解方法受到特定下游任务的限制，这限制了它们在实际应用中的实用性。本文介绍了Chat-3D，它通过将预训练的三维表示与高级LLM的强大理解和对话能力相结合，实现了第一个universal对话系统 для三维场景。具体来说，我们将三维表示空间对齐到LLM的特征空间中，因此让LLM能够感受到三维世界。由于三维场景文本数据的罕见性，我们提出了三个阶段的训练策略，以更有效地利用可用的数据进行更好的对齐。为了提高理解能力和设计用户友好的交互方案，我们还制作了高质量的三维对象中心指令集和相关的对象中心提示。我们的实验表明，Chat-3D可以具有卓越的理解多种三维场景指令、进行复杂的空间逻辑和 incorporate external knowledge into its responses。在我们制作的指令集上，Chat-3D achieved a 75.6% relative score compared with GPT-4。

XVTP3D: Cross-view Trajectory Prediction Using Shared 3D Queries for Autonomous Driving

paper_url: http://arxiv.org/abs/2308.08764
repo_url: None
paper_authors: Zijian Song, Huikun Bi, Ruisi Zhang, Tianlu Mao, Zhaoqi Wang
for: 这篇论文的目的是提出一种能够预测自动驾驶车辆的路径，并且确保多 vista 的预测结果保持一致性。
methods: 本文使用的方法包括使用共享的3D查询（XVTP3D）来生成多个目标，并使用随机遮盾法和粗糙至细的跨视观温探查来捕捉稳定的跨视特征。
results: 实验结果显示，XVTP3D 在两个公开available的数据集上 achieved state-of-the-art 性能，并且保持了多 vista 的预测结果一致性。

Abstract
Trajectory prediction with uncertainty is a critical and challenging task for autonomous driving. Nowadays, we can easily access sensor data represented in multiple views. However, cross-view consistency has not been evaluated by the existing models, which might lead to divergences between the multimodal predictions from different views. It is not practical and effective when the network does not comprehend the 3D scene, which could cause the downstream module in a dilemma. Instead, we predicts multimodal trajectories while maintaining cross-view consistency. We presented a cross-view trajectory prediction method using shared 3D Queries (XVTP3D). We employ a set of 3D queries shared across views to generate multi-goals that are cross-view consistent. We also proposed a random mask method and coarse-to-fine cross-attention to capture robust cross-view features. As far as we know, this is the first work that introduces the outstanding top-down paradigm in BEV detection field to a trajectory prediction problem. The results of experiments on two publicly available datasets show that XVTP3D achieved state-of-the-art performance with consistent cross-view predictions.

摘要
几种感知资料的融合是自动驾驶中的决定性和挑战性任务。现在，我们可以轻松地存取多种检测数据。然而，不同检测视图之间的一致性尚未被现有的模型评估，这可能导致不同检测视图的多模式预测偏离。这不实际又无效当网络不理解3D场景，这可能导致下游模组受到困难。因此，我们预测多种检测路径，并维护不同检测视图之间的一致性。我们使用共享3D查询（XVTP3D）来生成跨观察方向的多个目标，并提出了随机填充方法和粗糙至细的标注捕捉强健的跨观察特征。我们相信这是首次在BEV检测领域中将顶部下降方式引入到路径预测问题中。实验结果显示，XVTP3D在两个公开可用的数据集上实现了状态顶对状态的表现。

Fine-grained Text and Image Guided Point Cloud Completion with CLIP Model

paper_url: http://arxiv.org/abs/2308.08754
repo_url: None
paper_authors: Wei Song, Jun Zhou, Mingjie Wang, Hongchen Tan, Nannan Li, Xiuping Liu
for: This paper focuses on the task of point cloud completion guided by multimodal information, with the goal of improving the generalization ability and fine-grained semantic information of the model.
methods: The proposed method uses a multimodal fusion network that fuses visual and textual information to predict the semantic and geometric characteristics of incomplete shapes. The network employs a pre-trained vision-language model and a multi-stage feature fusion strategy to fuse the textual and visual features.
results: The proposed method achieves superior performance compared to state-of-the-art point cloud completion networks, as demonstrated through extensive quantitative and qualitative experiments. The use of fine-grained text descriptions provides richer geometric details for 3D shapes, further improving the accuracy of the completion.

Abstract
This paper focuses on the recently popular task of point cloud completion guided by multimodal information. Although existing methods have achieved excellent performance by fusing auxiliary images, there are still some deficiencies, including the poor generalization ability of the model and insufficient fine-grained semantic information for extracted features. In this work, we propose a novel multimodal fusion network for point cloud completion, which can simultaneously fuse visual and textual information to predict the semantic and geometric characteristics of incomplete shapes effectively. Specifically, to overcome the lack of prior information caused by the small-scale dataset, we employ a pre-trained vision-language model that is trained with a large amount of image-text pairs. Therefore, the textual and visual encoders of this large-scale model have stronger generalization ability. Then, we propose a multi-stage feature fusion strategy to fuse the textual and visual features into the backbone network progressively. Meanwhile, to further explore the effectiveness of fine-grained text descriptions for point cloud completion, we also build a text corpus with fine-grained descriptions, which can provide richer geometric details for 3D shapes. The rich text descriptions can be used for training and evaluating our network. Extensive quantitative and qualitative experiments demonstrate the superior performance of our method compared to state-of-the-art point cloud completion networks.

摘要
Translated into Simplified Chinese:这篇论文关注最近受欢迎的点云补充任务，带有多 modal 信息导航。虽然现有方法已经通过融合 auxillary 图像实现出色的性能，但还有一些缺陷，包括模型的泛化能力不够和缺乏细化Semantic 信息。在这个工作中，我们提出了一种新的多 modal 融合网络，可以同时融合视觉和文本信息，以预测受限shape的Semantic 和 геометрические特征。Specifically，为了缓解由小规模数据集所带来的缺乏先验信息，我们采用了一个预训练的视觉语言模型，该模型在大量的图像文本对中进行了训练。因此，视觉和文本Encoder 这两个模块具有更强的泛化能力。然后，我们提出了一种多stage 特征融合策略，以逐步融合视觉和文本特征到网络的后部。同时，为了更好地探索文本描述的细化效果，我们还建立了一个细化文本库，该库包含了更细化的描述，可以为3D 形状提供更 ric 的几何细节。这些细化的文本描述可以用于训练和评估我们的网络。广泛的量化和质量测试表明，我们的方法在现有点云补充网络中具有更高的性能。

BOTT: Box Only Transformer Tracker for 3D Object Tracking

paper_url: http://arxiv.org/abs/2308.08753
repo_url: None
paper_authors: Lubing Zhou, Xiaoli Meng, Yiluan Guo, Jiong Yang
for: 三元素 объек Tracking是自主驾驶中的重要任务，现有的 kalman滤波器基于方法仍然是最受欢迎的解决方案，但这些方法需要手工设计的运动模型，无法利用增长的数据量。
methods: 本文提出了盒子只 transformer跟踪器（BOTT），该方法通过将所有的3D盒在一个时间窗口中作为输入，使用 transformer自我注意力来交换所有盒子之间的信息，从而学习全局有用的盒子嵌入。
results: 实验显示，BOTT在 nuScenes 验证和测试分区上得到了69.9和66.7 AMOTA的竞争性性能，在 Waymo Open Dataset 验证和测试分区上得到了56.45和59.57 MOTA L2 的竞争性性能。这些结果表明，通过直接从3D盒子中学习特征使用 transformers 是一种简单 yet 有效的方法。

Abstract
Tracking 3D objects is an important task in autonomous driving. Classical Kalman Filtering based methods are still the most popular solutions. However, these methods require handcrafted designs in motion modeling and can not benefit from the growing data amounts. In this paper, Box Only Transformer Tracker (BOTT) is proposed to learn to link 3D boxes of the same object from the different frames, by taking all the 3D boxes in a time window as input. Specifically, transformer self-attention is applied to exchange information between all the boxes to learn global-informative box embeddings. The similarity between these learned embeddings can be used to link the boxes of the same object. BOTT can be used for both online and offline tracking modes seamlessly. Its simplicity enables us to significantly reduce engineering efforts required by traditional Kalman Filtering based methods. Experiments show BOTT achieves competitive performance on two largest 3D MOT benchmarks: 69.9 and 66.7 AMOTA on nuScenes validation and test splits, respectively, 56.45 and 59.57 MOTA L2 on Waymo Open Dataset validation and test splits, respectively. This work suggests that tracking 3D objects by learning features directly from 3D boxes using transformers is a simple yet effective way.

摘要
<>请将以下文本翻译成简化中文：Tracking 3D对象是自主驾驶中非常重要的任务。经典的Kalman滤波法仍然是最受欢迎的解决方案。然而，这些方法需要手工设计的运动模型，并且无法利用增长的数据量。在这篇论文中，我们提出了一种名为Box Only Transformer Tracker（BOTT）的方法，可以学习将不同帧中的3D盒子链接起来，并且可以在线和离线跟踪模式之间切换。具体来说，我们使用transformer自注意力来交换所有帧中的3D盒子信息，以学习全局有用的盒子嵌入。这些学习的嵌入之间的相似性可以用来链接同一个对象的盒子。BOTT可以在线和离线跟踪模式之间切换，并且其简单性使得可以减少传统Kalman滤波法基于的工程劳动量。实验显示BOTT在nuScenes验证和测试分区上得到了69.9和66.7 AMOTA的竞争性性能，以及Waymo开放数据集验证和测试分区上得到了56.45和59.57 MOTA L2的竞争性性能。这种工作表明了通过直接从3D盒子中学习特征来跟踪3D对象是一种简单 yet有效的方法。<>Here's the translation: Tracking 3D objects is an important task in autonomous driving. Classical Kalman Filtering based methods are still the most popular solutions, but they require handcrafted designs in motion modeling and cannot benefit from the growing data amounts. In this paper, we propose a method called Box Only Transformer Tracker (BOTT) that learns to link 3D boxes of the same object from different frames by taking all the 3D boxes in a time window as input. Specifically, we use transformer self-attention to exchange information between all the boxes to learn global-informative box embeddings. The similarity between these learned embeddings can be used to link the boxes of the same object. BOTT can seamlessly switch between online and offline tracking modes, and its simplicity reduces the engineering efforts required by traditional Kalman Filtering based methods. Experimental results show that BOTT achieves competitive performance on the two largest 3D MOT benchmarks: 69.9 and 66.7 AMOTA on nuScenes validation and test splits, respectively, and 56.45 and 59.57 MOTA L2 on Waymo Open Dataset validation and test splits, respectively. This work suggests that tracking 3D objects by learning features directly from 3D boxes using transformers is a simple yet effective way.

MIPS-Fusion: Multi-Implicit-Submaps for Scalable and Robust Online Neural RGB-D Reconstruction

paper_url: http://arxiv.org/abs/2308.08741
repo_url: None
paper_authors: Yijie Tang, Jiazhao Zhang, Zhinan Yu, He Wang, Kai Xu
for: 这个论文主要目的是提出一种基于神经网络的在线RGB-D重建方法，以实现大规模Scene的高质量重建。
methods: 该方法使用了一种新的神经网络表示方法——多重神经映射（Multi-Implicit-Submap，MIS），并采用分治设计来解决存储特征网格的问题。在该方法中，神经子地图在扫描轨迹中逐渐分配并高效地学习本地神经簇调整。
results: 对比现有的神经RGB-D重建方法，该方法可以实现更高的重建质量，特别是在大规模Scene和快速摄像机运动情况下。

Abstract
We introduce MIPS-Fusion, a robust and scalable online RGB-D reconstruction method based on a novel neural implicit representation -- multi-implicit-submap. Different from existing neural RGB-D reconstruction methods lacking either flexibility with a single neural map or scalability due to extra storage of feature grids, we propose a pure neural representation tackling both difficulties with a divide-and-conquer design. In our method, neural submaps are incrementally allocated alongside the scanning trajectory and efficiently learned with local neural bundle adjustments. The submaps can be refined individually in a back-end optimization and optimized jointly to realize submap-level loop closure. Meanwhile, we propose a hybrid tracking approach combining randomized and gradient-based pose optimizations. For the first time, randomized optimization is made possible in neural tracking with several key designs to the learning process, enabling efficient and robust tracking even under fast camera motions. The extensive evaluation demonstrates that our method attains higher reconstruction quality than the state of the arts for large-scale scenes and under fast camera motions.

摘要
我团队介绍MIPS-Fusion，一种可靠和扩展的在线RGB-D重建方法，基于一种新的神经凝聚表示—多神经凝聚映射（Multi-Implicit-Submap，MIS）。与现有的神经RGB-D重建方法不同，我们的方法缺乏一个灵活的单个神经地图或可扩展性，我们提议一种纯神经表示，同时解决了这两个问题。在我们的方法中，神经子地图在扫描轨迹中逐渐分配并高效地学习地ocal神经簇更正。子地图可以在后续优化中被精细地修正，并在各个子地图水平实现子地图级循环关闭。此外，我们提出了一种混合Tracking方法，结合随机和梯度基于的pose优化。这是神经跟踪中首次实现随机优化的，通过一些关键的设计，使得神经跟踪可以快速和稳定地跟踪，即使相机速度快。我们的评估结果表明，我们的方法在大规模场景下和快相机速度下都可以 дости到更高的重建质量。

Recursive Detection and Analysis of Nanoparticles in Scanning Electron Microscopy Images

paper_url: http://arxiv.org/abs/2308.08732
repo_url: None
paper_authors: Aidan S. Wright, Nathaniel P. Youmans, Enrique F. Valderrama Araya
for: 这个研究旨在开发一个基于Python的计算框架，用于精确地检测和全面分析SEM图像中的粒子。
methods: 这个框架使用了多种技术，包括阈值设定、扩展和膨润，以提高图像处理结果的准确性。
results: 研究人员通过使用这个框架，在五个不同的测试图像中达到97%的粒子检测精度，并能够识别出强度较弱的粒子。

Abstract
In this study, we present a computational framework tailored for the precise detection and comprehensive analysis of nanoparticles within scanning electron microscopy (SEM) images. The primary objective of this framework revolves around the accurate localization of nanoparticle coordinates, accompanied by secondary objectives encompassing the extraction of pertinent morphological attributes including area, orientation, brightness, and length. Constructed leveraging the robust image processing capabilities of Python, particularly harnessing libraries such as OpenCV, SciPy, and Scikit-Image, the framework employs an amalgamation of techniques, including thresholding, dilating, and eroding, to enhance the fidelity of image processing outcomes. The ensuing nanoparticle data is seamlessly integrated into the RStudio environment to facilitate meticulous post-processing analysis. This encompasses a comprehensive evaluation of model accuracy, discernment of feature distribution patterns, and the identification of intricate particle arrangements. The finalized framework exhibits high nanoparticle identification within the primary sample image and boasts 97\% accuracy in detecting particles across five distinct test images drawn from a SEM nanoparticle dataset. Furthermore, the framework demonstrates the capability to discern nanoparticles of faint intensity, eluding manual labeling within the control group.

摘要
在本研究中，我们提出了一种基于计算机的方法，用于准确检测和全面分析顺序电镜图像中的粒子。主要目标是准确地确定粒子坐标，并且包括次要目标，如粒子形态属性的抽取，包括面积、方向、亮度和长度。这个框架利用Python语言的强大图像处理能力，特别是使用OpenCV、SciPy和Scikit-Image库，并采用了多种技术，如阈值处理、膨润和磨灭，以提高图像处理结果的准确性。得到的粒子数据可以轻松地 интеグрироваться到RStudio环境中，以便仔细进行后处理分析。这包括完整评估模型准确度，分析特征分布模式，以及描述复杂的粒子排列。最终的框架在主要样本图像中具有高精度的粒子识别能力，并在五个不同的测试图像中达到97%的检测粒子精度。此外，该框架还能够识别强度较弱的粒子，这些粒子在控制组中逃避人工标注。

Learning Through Guidance: Knowledge Distillation for Endoscopic Image Classification

paper_url: http://arxiv.org/abs/2308.08731
repo_url: None
paper_authors: Harshala Gammulle, Yubo Chen, Sridha Sridharan, Travis Klein, Clinton Fookes
for:The paper is written to improve the accuracy and efficiency of GI tract disease diagnosis using deep learning methods, specifically Convolutional Neural Networks (CNNs).methods:The paper proposes a novel multi-head attention-based feature fusion mechanism to support relation-based learning, and investigates three KD-based learning frameworks: response-based, feature-based, and relation-based.results:The proposed relation-based framework achieves improved lightweight model performance (only 51.8k trainable parameters) on two widely used public datasets, KVASIR-V2 and Hyper-KVASIR, signifying the merits of the proposed method in achieving accurate and efficient disease diagnosis in resource-limited medical clinics.

Abstract
Endoscopy plays a major role in identifying any underlying abnormalities within the gastrointestinal (GI) tract. There are multiple GI tract diseases that are life-threatening, such as precancerous lesions and other intestinal cancers. In the usual process, a diagnosis is made by a medical expert which can be prone to human errors and the accuracy of the test is also entirely dependent on the expert's level of experience. Deep learning, specifically Convolution Neural Networks (CNNs) which are designed to perform automatic feature learning without any prior feature engineering, has recently reported great benefits for GI endoscopy image analysis. Previous research has developed models that focus only on improving performance, as such, the majority of introduced models contain complex deep network architectures with a large number of parameters that require longer training times. However, there is a lack of focus on developing lightweight models which can run in low-resource environments, which are typically encountered in medical clinics. We investigate three KD-based learning frameworks, response-based, feature-based, and relation-based mechanisms, and introduce a novel multi-head attention-based feature fusion mechanism to support relation-based learning. Compared to the existing relation-based methods that follow simplistic aggregation techniques of multi-teacher response/feature-based knowledge, we adopt the multi-head attention technique to provide flexibility towards localising and transferring important details from each teacher to better guide the student. We perform extensive evaluations on two widely used public datasets, KVASIR-V2 and Hyper-KVASIR, and our experimental results signify the merits of our proposed relation-based framework in achieving an improved lightweight model (only 51.8k trainable parameters) that can run in a resource-limited environment.

摘要
endoscopic examination plays a crucial role in identifying potential abnormalities within the gastrointestinal (GI) tract. there are numerous GI tract diseases that are life-threatening, such as precancerous lesions and other intestinal cancers. in the conventional process, a diagnosis is made by a medical expert, which can be prone to human errors and the accuracy of the test is entirely dependent on the expert's level of experience. deep learning, specifically convolutional neural networks (CNNs), has recently shown great benefits for GI endoscopy image analysis. previous research has developed models that focus solely on improving performance, resulting in complex deep network architectures with a large number of parameters that require longer training times. however, there is a lack of focus on developing lightweight models that can run in low-resource environments, typically encountered in medical clinics. we investigate three KD-based learning frameworks, response-based, feature-based, and relation-based mechanisms, and introduce a novel multi-head attention-based feature fusion mechanism to support relation-based learning. compared to existing relation-based methods that use simplistic aggregation techniques of multi-teacher response/feature-based knowledge, we adopt the multi-head attention technique to provide flexibility towards localizing and transferring important details from each teacher to better guide the student. we perform extensive evaluations on two widely used public datasets, KVASIR-V2 and Hyper-KVASIR, and our experimental results demonstrate the advantages of our proposed relation-based framework in achieving an improved lightweight model (only 51.8k trainable parameters) that can run in a resource-limited environment.

Learning A Coarse-to-Fine Diffusion Transformer for Image Restoration

paper_url: http://arxiv.org/abs/2308.08730
repo_url: https://github.com/wlydlut/c2f-dft
paper_authors: Liyan Wang, Qinyu Yang, Cong Wang, Wei Wang, Jinshan Pan, Zhixun Su
for: 这个论文是为了提出一种基于填充 transformer 的图像修复方法，以解决 diffusion-based 方法在图像修复任务中可能因为不准确的噪声估计而未能获得出色的结果。
methods: 该方法使用了填充 transformer，包括填充自注意力（DFSA）和填充Feedforward网络（DFN），并在一种新的粗细顺序训练方案中使用。
results: 对于3个任务（抽掉雨、消除震荡和实际噪声），该方法在与 IR-SDE 比较的情况下显著地超越了 diffusion-based 修复方法，并与Transformer-based状态流方法在性能上具有竞争力。

Abstract
Recent years have witnessed the remarkable performance of diffusion models in various vision tasks. However, for image restoration that aims to recover clear images with sharper details from given degraded observations, diffusion-based methods may fail to recover promising results due to inaccurate noise estimation. Moreover, simple constraining noises cannot effectively learn complex degradation information, which subsequently hinders the model capacity. To solve the above problems, we propose a coarse-to-fine diffusion Transformer (C2F-DFT) for image restoration. Specifically, our C2F-DFT contains diffusion self-attention (DFSA) and diffusion feed-forward network (DFN) within a new coarse-to-fine training scheme. The DFSA and DFN respectively capture the long-range diffusion dependencies and learn hierarchy diffusion representation to facilitate better restoration. In the coarse training stage, our C2F-DFT estimates noises and then generates the final clean image by a sampling algorithm. To further improve the restoration quality, we propose a simple yet effective fine training scheme. It first exploits the coarse-trained diffusion model with fixed steps to generate restoration results, which then would be constrained with corresponding ground-truth ones to optimize the models to remedy the unsatisfactory results affected by inaccurate noise estimation. Extensive experiments show that C2F-DFT significantly outperforms diffusion-based restoration method IR-SDE and achieves competitive performance compared with Transformer-based state-of-the-art methods on $3$ tasks, including deraining, deblurring, and real denoising. The code is available at https://github.com/wlydlut/C2F-DFT.

摘要
近年来，扩散模型在视觉任务中表现出了remarkable的表现。然而，为了recover清晰图像，扩散模型可能因为不准确的噪声估计而无法获得出色的结果。此外，简单的约束噪声无法有效地学习复杂的噪声信息，这会限制模型的容量。为解决这些问题，我们提出了一种归一化扩散变换器（C2F-DFT） для图像修复。具体来说，我们的C2F-DFT包括扩散自注意（DFSA）和扩散径向网络（DFN），并在一种新的归一化训练机制中进行了整合。DFSA和DFN分别捕捉了扩散的长距离相关性和层次扩散表示，以便更好地修复图像。在粗糙训练阶段，我们的C2F-DFT估计噪声，并通过抽样算法生成最终的清晰图像。为了进一步提高修复质量，我们提出了一种简单 yet有效的细化训练机制。它首先利用粗糙训练过的扩散模型，并将其与固定步长进行多次扩散，然后将其与对应的真实图像进行约束，以便使模型更好地修复受到噪声估计的影响的不满result。广泛的实验表明，C2F-DFT在3个任务上（包括雨几摄、锐化和真实噪声）significantly exceeds扩散基于SDE的修复方法，并与基于Transformer的state-of-the-art方法相当。代码可以在https://github.com/wlydlut/C2F-DFT中找到。

Long-Range Grouping Transformer for Multi-View 3D Reconstruction

paper_url: http://arxiv.org/abs/2308.08724
repo_url: https://github.com/liyingcv/long-range-grouping-transformer
paper_authors: Liying Yang, Zhenwei Zhu, Xuxin Lin, Jian Nong, Yanyan Liang
for: 本研究旨在提高多视图3D重建 task 中 transformer 网络的性能，特别是对于自注意处理大量视图输入的困难性。
methods: 我们提出了一种基于 divide-and-conquer 原理的长范围集群注意力（LGA）机制，以便在不同视图之间进行注意力操作。此外，我们还设计了一种高效的编码器，可以连接不同视图之间的间距特征，以及一种进步的高分辨率增幅嵌入器 для voxel 生成。
results: 我们的方法在ShapeNet 数据集上实现了state-of-the-art 精度水平，证明了我们的方法在多视图3D重建 task 中的效果。

Abstract
Nowadays, transformer networks have demonstrated superior performance in many computer vision tasks. In a multi-view 3D reconstruction algorithm following this paradigm, self-attention processing has to deal with intricate image tokens including massive information when facing heavy amounts of view input. The curse of information content leads to the extreme difficulty of model learning. To alleviate this problem, recent methods compress the token number representing each view or discard the attention operations between the tokens from different views. Obviously, they give a negative impact on performance. Therefore, we propose long-range grouping attention (LGA) based on the divide-and-conquer principle. Tokens from all views are grouped for separate attention operations. The tokens in each group are sampled from all views and can provide macro representation for the resided view. The richness of feature learning is guaranteed by the diversity among different groups. An effective and efficient encoder can be established which connects inter-view features using LGA and extract intra-view features using the standard self-attention layer. Moreover, a novel progressive upsampling decoder is also designed for voxel generation with relatively high resolution. Hinging on the above, we construct a powerful transformer-based network, called LRGT. Experimental results on ShapeNet verify our method achieves SOTA accuracy in multi-view reconstruction. Code will be available at https://github.com/LiyingCV/Long-Range-Grouping-Transformer.

摘要
现在，变换器网络在计算机视觉任务中表现出了非常出色的表现。在这种多视图3D重建算法中，变换器处理器需要处理具有庞大信息量的复杂图像token。词汇内容咒语导致模型学习非常困难。为解决这问题，现有的方法通过压缩每个视图的token数量或者抛弃不同视图之间的注意操作来缓解问题。然而，这些方法会对性能产生负面影响。因此，我们提出了长范围群组注意（LGA），基于分治原则。所有视图的token都被分组进行独立的注意操作。每个组中的token来自所有视图，可以为残存视图提供macro表示。各个组之间的多样性保证了特征学习的 ricness。通过LGA和标准自注意层连接，我们可以建立高效的encoder，并且可以提取高精度的voxel生成。此外，我们还设计了一种进步式upsampling解码器，用于生成相对高分辨率的voxel。基于以上，我们构建了一个强大的变换器基于网络，称为LRGT。实验结果表明，我们的方法在ShapeNet上达到了最佳精度的多视图重建。代码将于https://github.com/LiyingCV/Long-Range-Grouping-Transformer上提供。

Dynamic Kernel-Based Adaptive Spatial Aggregation for Learned Image Compression

paper_url: http://arxiv.org/abs/2308.08723
repo_url: https://github.com/Huairui/DKIC
paper_authors: Huairui Wang, Nianxiang Fu, Zhenzhong Chen, Shan Liu
for: 提高图像压缩率和精度性能
methods: 使用动态kernel基于变换编码、共享权重机制和自适应累积、改进 entropymodel
results: 在三个标准测试集上比前state-of-the-art学习基于方法获得更高的率压缩率和精度性能

Abstract
Learned image compression methods have shown superior rate-distortion performance and remarkable potential compared to traditional compression methods. Most existing learned approaches use stacked convolution or window-based self-attention for transform coding, which aggregate spatial information in a fixed range. In this paper, we focus on extending spatial aggregation capability and propose a dynamic kernel-based transform coding. The proposed adaptive aggregation generates kernel offsets to capture valid information in the content-conditioned range to help transform. With the adaptive aggregation strategy and the sharing weights mechanism, our method can achieve promising transform capability with acceptable model complexity. Besides, according to the recent progress of entropy model, we define a generalized coarse-to-fine entropy model, considering the coarse global context, the channel-wise, and the spatial context. Based on it, we introduce dynamic kernel in hyper-prior to generate more expressive global context. Furthermore, we propose an asymmetric spatial-channel entropy model according to the investigation of the spatial characteristics of the grouped latents. The asymmetric entropy model aims to reduce statistical redundancy while maintaining coding efficiency. Experimental results demonstrate that our method achieves superior rate-distortion performance on three benchmarks compared to the state-of-the-art learning-based methods.

摘要
现有的学习图像压缩方法已经显示出了较好的比特率-损失性能和很好的潜在性，相比传统压缩方法。大多数现有的学习方法使用堆叠 convolution或窗口基于自注意力来实现变换编码，这些方法将空间信息归约到固定范围内。在这篇论文中，我们将注意力集中在扩展空间归约能力上，并提出一种动态kernel基于变换编码。我们的提案的自适应归约生成器将kernel偏移来捕捉有效信息在内容受限范围内，以帮助变换。通过自适应归约策略和共享权重机制，我们的方法可以实现可接受的变换能力和模型复杂度。此外，根据最近的比特模型进展，我们定义一种总体粗细-细致比特模型，考虑全局粗细上下文、通道粗细和空间上下文。基于它，我们引入动态kernel在超PRIOR中生成更表达力的全局上下文。此外，我们还提出一种偏极空频频道比特模型，根据图像特征的分组分析。这种偏极 entropy模型的目的是降低统计冗余，保持编码效率。实验结果表明，我们的方法在三个标准底层上比对 estado-of-the-art 学习基于方法得到了较好的比特率-损失性能。

RFD-ECNet: Extreme Underwater Image Compression with Reference to Feature Dictionar

paper_url: http://arxiv.org/abs/2308.08721
repo_url: https://github.com/lilala0/rfd-ecnet
paper_authors: Mengyao Li, Liquan Shen, Peng Ye, Guorui Feng, Zheyin Wang
for: 提高水下应用的高效性，实现水下图像（UWI）在非常窄的水下频谱中的传输。
methods: 首先构建了水下多级特征字典，以提供水下图像压缩的粗略参考特征。然后，提出了一种Extreme UWI压缩网络（RFD-ECNet），利用特征匹配和参考特征变化来减少UWI之间的重复性。为了正确地匹配水下图像的多样性，提出了一种水下风格标准块（USNB），利用水下物理图像模型中提取的水下物理约束来 норmalize水下字典特征向输入。此外，还提出了一种参考特征变化模块（RFVM），用于适应性地改变参考特征，提高参考特征和输入特征之间的相似性。
results: 实验结果表明，我们的RFD-ECNet在四个UWI数据集上实现了31%的BD率减少，超过了最先进的VVC。

Abstract
Thriving underwater applications demand efficient extreme compression technology to realize the transmission of underwater images (UWIs) in very narrow underwater bandwidth. However, existing image compression methods achieve inferior performance on UWIs because they do not consider the characteristics of UWIs: (1) Multifarious underwater styles of color shift and distance-dependent clarity, caused by the unique underwater physical imaging; (2) Massive redundancy between different UWIs, caused by the fact that different UWIs contain several common ocean objects, which have plenty of similarities in structures and semantics. To remove redundancy among UWIs, we first construct an exhaustive underwater multi-scale feature dictionary to provide coarse-to-fine reference features for UWI compression. Subsequently, an extreme UWI compression network with reference to the feature dictionary (RFD-ECNet) is creatively proposed, which utilizes feature match and reference feature variant to significantly remove redundancy among UWIs. To align the multifarious underwater styles and improve the accuracy of feature match, an underwater style normalized block (USNB) is proposed, which utilizes underwater physical priors extracted from the underwater physical imaging model to normalize the underwater styles of dictionary features toward the input. Moreover, a reference feature variant module (RFVM) is designed to adaptively morph the reference features, improving the similarity between the reference and input features. Experimental results on four UWI datasets show that our RFD-ECNet is the first work that achieves a significant BD-rate saving of 31% over the most advanced VVC.

摘要
在水下应用需要高效的极端压缩技术来实现水下图像（UWI）的传输在非常窄的水下频谱带中。然而，现有的图像压缩方法在UWI上的性能不佳，因为它们不考虑水下图像的特点：（1）多样的水下颜色变换和距离相关的清晰度变化，由水下物理捕捉器带来的独特水下物理特性；（2）水下图像之间的巨大重复性，由于不同的UWI都包含许多相似的海洋对象，这些对象在结构和 semantics 方面具有很多相似性。为了消除UWI之间的重复性，我们首先构建了水下多尺度特征字典（USDL），提供水下图像压缩的粗略到细则参考特征。然后，我们创新提出了基于特征字典的极端UWI压缩网络（RFD-ECNet），该网络利用特征匹配和参考特征变体来减少UWI之间的重复性。为了调整水下颜色的多样性并提高特征匹配的准确性，我们还提出了水下风格normal化块（USNB），该块利用从水下物理捕捉器提取的水下物理理论来正常化特征字典中的水下风格。此外，我们还设计了参考特征变体模块（RFVM），以适应不同的参考特征，使得参考特征与输入特征之间的相似性更高。实验结果表明，我们的RFD-ECNet在四个UWI数据集上达到了最高的BD-率减少31%，比最先进的VVC更高效。

V-FUSE: Volumetric Depth Map Fusion with Long-Range Constraints

paper_url: http://arxiv.org/abs/2308.08715
repo_url: https://github.com/nburgdorfer/vfuse
paper_authors: Nathaniel Burgdorfer, Philippos Mordohai
for: 提高Multi-View Stereo（MVS）算法生成的深度和信任图的精度
methods: integrate volumetric visibility constraints into an end-to-end trainable architecture, 以及一个jointly trained depth search window estimation sub-network
results: 对MVS数据进行了大量实验，并显示了出入 fusion depth和信任图的精度有substantial improvements。

Abstract
We introduce a learning-based depth map fusion framework that accepts a set of depth and confidence maps generated by a Multi-View Stereo (MVS) algorithm as input and improves them. This is accomplished by integrating volumetric visibility constraints that encode long-range surface relationships across different views into an end-to-end trainable architecture. We also introduce a depth search window estimation sub-network trained jointly with the larger fusion sub-network to reduce the depth hypothesis search space along each ray. Our method learns to model depth consensus and violations of visibility constraints directly from the data; effectively removing the necessity of fine-tuning fusion parameters. Extensive experiments on MVS datasets show substantial improvements in the accuracy of the output fused depth and confidence maps.

摘要
我们提出了一种基于学习的深度地图融合框架，该框架接受多视图深度和信任图生成器的输出，并改进它们。我们在整个端到端可学习架构中 integrate了Volumetric可见性约束，这些约束编码了不同视图之间的长距离表面关系。我们还提出了一个专门用于每个光栅的深度搜索窗口估计子网络，与更大的融合子网络一起培训，以降低每个光栅的深度假设搜索空间。我们的方法可以直接从数据中学习深度一致性和视图约束的违反;无需精细调整融合参数。我们的实验表明，对MVS数据集进行了广泛的测试，并显著提高了输出融合的深度和信任图的准确性。

SkinDistilViT: Lightweight Vision Transformer for Skin Lesion Classification

paper_url: http://arxiv.org/abs/2308.08669
repo_url: https://github.com/Longman-Stan/SkinDistilVit
paper_authors: Vlad-Constantin Lungu-Stan, Dumitru-Clementin Cercel, Florin Pop
for: 这个论文的目的是提供一种特定生产环境中的皮肤癌分类问题的解决方案，以匹配人类的检测精度。
methods: 这个论文使用了知识储存法来训练一个基于视Transformer的模型，并在专家级别标注的皮肤医学图像上进行了训练。
results: 该模型可以保持98.33%的平衡多类准确率，而且在推理成本方面具有显著的提高，即时间内存具有69.25%和97.96%的提高。

Abstract
Skin cancer is a treatable disease if discovered early. We provide a production-specific solution to the skin cancer classification problem that matches human performance in melanoma identification by training a vision transformer on melanoma medical images annotated by experts. Since inference cost, both time and memory wise is important in practice, we employ knowledge distillation to obtain a model that retains 98.33% of the teacher's balanced multi-class accuracy, at a fraction of the cost. Memory-wise, our model is 49.60% smaller than the teacher. Time-wise, our solution is 69.25% faster on GPU and 97.96% faster on CPU. By adding classification heads at each level of the transformer and employing a cascading distillation process, we improve the balanced multi-class accuracy of the base model by 2.1%, while creating a range of models of various sizes but comparable performance. We provide the code at https://github.com/Longman-Stan/SkinDistilVit.

摘要
皮肤癌是一种可治疗的疾病，如果早发现。我们提供一个特定生产环境下的解决方案，用于皮肤癌分类问题，它与专业人员注意标注的melanoma医疗图像进行训练，以实现人工智能水平的抑肿癌识别。在实践中，推理成本（时间和内存）是重要的，我们使用知识储存法来获得一个保持98.33%的教师平衡多类准确率的模型，而且模型的内存占用量比教师模型少49.60%，时间占用量比教师模型快69.25%（GPU）和97.96%（CPU）。通过在转换器中添加多个分类头和使用层次分解法，我们提高基本模型的平衡多类准确率2.1%，并创造了不同大小的模型，但它们具有相似的性能。我们提供了代码，可以在 GitHub上找到：https://github.com/Longman-Stan/SkinDistilVit。

A New Data-Driven Method to Identify Violent Facial Expression

paper_url: http://arxiv.org/abs/2308.08658
repo_url: https://github.com/arindampaulripon/A-Novel-Method-for-Machine-Learning-Based-Automatic-Crime-Activity-Identification-System-by-Analyzin
paper_authors: Arindam Kumar Paul, Md Maruf Hasan, Md. Delwar Hosen
for: 本研究旨在开发一个自动识别凶暴行为的系统，以帮助预防犯罪和保护社会。
methods: 本研究使用了卷积神经网络模型，并使用自动特征选择器来捕捉特定的面孔表情特征。
results: 研究发现，这个系统可以更加精确地识别凶暴行为的面孔表情特征，并且只需使用少量的面孔数据来训练。

Abstract
Human Facial Expressions plays an important role in identifying human actions or intention. Facial expressions can represent any specific action of any person and the pattern of violent behavior of any person strongly depends on the geographic region. Here we have designed an automated system by using a Convolutional Neural Network which can detect whether a person has any intention to commit any crime or not. Here we proposed a new method that can identify criminal intentions or violent behavior of any person before executing crimes more efficiently by using very little data on facial expressions before executing a crime or any violent tasks. Instead of using image features which is a time-consuming and faulty method we used an automated feature selector Convolutional Neural Network model which can capture exact facial expressions for training and then can predict that target facial expressions more accurately. Here we used only the facial data of a specific geographic region which can represent the violent and before-crime before-crime facial patterns of the people of the whole region.

摘要
人类表情表达在识别人类行为或意图方面发挥重要作用。人类表情可以代表任何人的特定行为，并且任何人的暴力行为强度受地理区域的影响。我们设计了一个自动化系统，使用卷积神经网络来检测人类是否有任何犯罪意图。我们提出了一种新的方法，可以更高效地识别人类犯罪意图或暴力行为，只需使用小量的面部表达数据。而不是使用图像特征，我们使用自动化特征选择器卷积神经网络模型，可以更准确地捕捉面部表达特征，并且预测目标面部表达更加准确。我们使用了特定地理区域的面部数据，可以更好地表示该地区的暴力和前犯罪面部模式。

Flickr Africa: Examining Geo-Diversity in Large-Scale, Human-Centric Visual Data

paper_url: http://arxiv.org/abs/2308.08656
repo_url: None
paper_authors: Keziah Naggita, Julienne LaChance, Alice Xiang
for: 研究大规模图像数据集中的偏见对计算机视觉模型的性能的影响
methods: 使用 geotagged Flickr 图像与每个非洲国家相对比较的人类中心图像异质量进行分析，并进行两年频次的时间分析以暴露出来的数据趋势
results: 发现非洲的图像数据匮乏，主要由非洲以外的摄影师拍摄，需要进一步的工作以获得更加代表性的图像数据，以提高计算机视觉模型在全球范围内的应用性

Abstract
Biases in large-scale image datasets are known to influence the performance of computer vision models as a function of geographic context. To investigate the limitations of standard Internet data collection methods in low- and middle-income countries, we analyze human-centric image geo-diversity on a massive scale using geotagged Flickr images associated with each nation in Africa. We report the quantity and content of available data with comparisons to population-matched nations in Europe as well as the distribution of data according to fine-grained intra-national wealth estimates. Temporal analyses are performed at two-year intervals to expose emerging data trends. Furthermore, we present findings for an ``othering'' phenomenon as evidenced by a substantial number of images from Africa being taken by non-local photographers. The results of our study suggest that further work is required to capture image data representative of African people and their environments and, ultimately, to improve the applicability of computer vision models in a global context.

摘要
大规模图像数据集中的偏见会影响计算机视觉模型的表现，具体来说是根据地理上下文。为了探讨标准互联网数据收集方法在LOW-和中等收入国家的局限性，我们使用Geotagged Flickr图像与每个非洲国家进行人类中心的图像地域多样性分析，并对相应的人口比较欧洲国家进行比较。我们还分析了图像数据的分布 according to fine-grained intra-national wealth estimates。通过两年 interval的时间分析，暴露出emerging data trends。此外，我们还发现了一种“其他化”现象，即非洲的图像大多被非本地摄影师拍摄。我们的研究结果表明，需要进一步的工作，以捕捉符合非洲人和其环境的图像数据，并 ultimately improve计算机视觉模型在全球上的可用性。

Fair GANs through model rebalancing with synthetic data

paper_url: http://arxiv.org/abs/2308.08638
repo_url: None
paper_authors: Anubhav Jain, Nasir Memon, Julian Togelius
for: 本文旨在提高生成模型的偏差减少和公平性提高。
methods: 本文提出一种使用潜在空间探索生成平衡数据，并使用这些数据来训练平衡的生成模型来减少生成模型中的偏差。此外，本文还提出了一种偏差纠正损失函数，可以在不平衡的数据集上提高公平性指标。
results: 在使用Stylegan2模型和FFHQ数据集进行遥感肖像偏差和公平性问题时，本文得到了 almost 5 倍的提高，同时保持图像质量。此外，本文还验证了这种方法在不平衡的 Cifar-10 数据集上的有效性。最后，本文指出了传统的图像质量指标，如Frechet inception distance (FID)，在偏差减少问题上不适用。

Abstract
Deep generative models require large amounts of training data. This often poses a problem as the collection of datasets can be expensive and difficult, in particular datasets that are representative of the appropriate underlying distribution (e.g. demographic). This introduces biases in datasets which are further propagated in the models. We present an approach to mitigate biases in an existing generative adversarial network by rebalancing the model distribution. We do so by generating balanced data from an existing unbalanced deep generative model using latent space exploration and using this data to train a balanced generative model. Further, we propose a bias mitigation loss function that shows improvements in the fairness metric even when trained with unbalanced datasets. We show results for the Stylegan2 models while training on the FFHQ dataset for racial fairness and see that the proposed approach improves on the fairness metric by almost 5 times, whilst maintaining image quality. We further validate our approach by applying it to an imbalanced Cifar-10 dataset. Lastly, we argue that the traditionally used image quality metrics such as Frechet inception distance (FID) are unsuitable for bias mitigation problems.

摘要
深度生成模型需要大量的训练数据，这经常导致收集数据集的成本和困难，特别是收集表示适当下游分布（例如人口学）的数据集。这会导致模型中的偏见，并在模型中传递这些偏见。我们提出了一种方法来减轻模型中的偏见，通过在现有的深度生成模型中进行缺失空间探索，并使用这些缺失数据来训练一个平衡的生成模型。此外，我们提出了一种偏见纠正loss函数，这种loss函数可以在偏见数据集上训练模型，并且可以提高公平度量表示的改进。我们在使用Stylegan2模型进行FFHQ数据集上的遥感风格改进和Cifar-10数据集上的偏见纠正 validate our approach。最后，我们 argue that traditionally used image quality metrics such as Frechet inception distance (FID) are unsuitable for bias mitigation problems。Note: Please note that the translation is in Simplified Chinese, and the word order and sentence structure may be different from the original text.

MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions

paper_url: http://arxiv.org/abs/2308.08544
repo_url: https://github.com/henghuiding/MeViS
paper_authors: Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Chen Change Loy
for: 这 paper 的目的是为了研究基于动作表达的视频分割，即使用语句来描述目标对象的动作，从而在视频内部分割目标对象。
methods: 本 paper 使用了现有的referring video object segmentation (RVOS) 方法进行比较，并提出了一个基线方法来解决问题。
results: results show that current RVOS methods cannot effectively address motion expression-guided video segmentation, and the proposed baseline approach can provide a good starting point for future research.

Abstract
This paper strives for motion expressions guided video segmentation, which focuses on segmenting objects in video content based on a sentence describing the motion of the objects. Existing referring video object datasets typically focus on salient objects and use language expressions that contain excessive static attributes that could potentially enable the target object to be identified in a single frame. These datasets downplay the importance of motion in video content for language-guided video object segmentation. To investigate the feasibility of using motion expressions to ground and segment objects in videos, we propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments. We benchmarked 5 existing referring video object segmentation (RVOS) methods and conducted a comprehensive comparison on the MeViS dataset. The results show that current RVOS methods cannot effectively address motion expression-guided video segmentation. We further analyze the challenges and propose a baseline approach for the proposed MeViS dataset. The goal of our benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms that leverage motion expressions as a primary cue for object segmentation in complex video scenes. The proposed MeViS dataset has been released at https://henghuiding.github.io/MeViS.

摘要
Translated into Simplified Chinese:这篇论文目标是开发基于运动表达的视频分割，强调在视频内容中基于运动描述对象进行分割。现有的视频对象引用Dataset通常强调出现在视频中的突出对象，使用语言表达包含过多的静态特征，这可能使得目标对象在单帧中被识别。这些Dataset下Play优化视频内容中的运动的重要性，对语言导向视频对象分割进行了下Play。为了探索使用运动表达来固定和分割视频中的对象，我们提出了大规模的MeViS dataset，它包含了许多运动表达来指定目标对象在复杂环境中。我们对5种现有的视频对象分割方法进行了比较性分析，并在MeViS dataset上进行了完整的比较。结果表明，现有的视频对象分割方法无法有效地解决运动表达导向的视频分割。我们进一步分析了挑战和提出了基线方法，以便在我们的benchmark中发展有效的语言导向视频分割算法，利用运动表达作为主要准确对象分割在复杂视频场景中的导向。MeViS dataset已经在https://henghuiding.github.io/MeViS上发布。

InsightMapper: A Closer Look at Inner-instance Information for Vectorized High-Definition Mapping

paper_url: http://arxiv.org/abs/2308.08543
repo_url: https://github.com/TonyXuQAQ/InsightMapper
paper_authors: Zhenhua Xu, Kenneth K. Y. Wong, Hengshuang Zhao
for: 本研究旨在提高自动驾驶车辆中 vectorized 高清晰地图的检测性能，通过利用内部实例信息进行增强。
methods: 本研究使用了 transformers 来利用内部实例信息，并 introduce 了三种新的设计方案，包括 Hybrid 查询生成、内部实例查询融合和内部实例特征汇集。
results: 对 NuScenes 数据集进行比较 экспериментирова，显示了我们的提案方法在检测性能和效率两个方面具有明显的优势，比如5.78 mAP 和 5.12 TOPO，这两个指标都是评估拓扑正确性的。

Abstract
Vectorized high-definition (HD) maps contain detailed information about surrounding road elements, which are crucial for various downstream tasks in modern autonomous driving vehicles, such as vehicle planning and control. Recent works have attempted to directly detect the vectorized HD map as a point set prediction task, resulting in significant improvements in detection performance. However, these approaches fail to analyze and exploit the inner-instance correlations between predicted points, impeding further advancements. To address these challenges, we investigate the utilization of inner-$\textbf{INS}$tance information for vectorized h$\textbf{IGH}$-definition mapping through $\textbf{T}$ransformers and introduce InsightMapper. This paper presents three novel designs within InsightMapper that leverage inner-instance information in distinct ways, including hybrid query generation, inner-instance query fusion, and inner-instance feature aggregation. Comparative experiments are conducted on the NuScenes dataset, showcasing the superiority of our proposed method. InsightMapper surpasses previous state-of-the-art (SOTA) methods by 5.78 mAP and 5.12 TOPO, which assess topology correctness. Simultaneously, InsightMapper maintains high efficiency during both training and inference phases, resulting in remarkable comprehensive performance. The project page for this work is available at https://tonyxuqaq.github.io/projects/InsightMapper .

摘要
高清晰度地图（HD map）被vector化后，含有周围道路元素的详细信息，对现代自动驾驶车辆的各种下渠任务非常重要，如车辆规划和控制。现有研究直接将vectorized HD map当作点集预测任务进行处理，这些方法可以大幅提高检测性能。然而，这些方法忽略了预测点之间的内部相关关系，从而限制了进一步的进步。为了解决这些挑战，我们研究了使用内部实例信息对vectorized高清晰度地图进行Transformers的利用，并提出了InsightMapper。InsightMapper包含三种新的设计，它们在不同的方面利用内部实例信息，包括半结合查询生成、内部实例查询融合和内部实例特征聚合。我们在NuScenes dataset上进行了比较性实验，显示我们提出的方法在检测性能和效率方面具有卓越的表现。InsightMapper比前一代最佳方法（SOTA）提高5.78 mAP和5.12 TOPO，这些指标评估了排名准确性。同时，InsightMapper在训练和推理阶段保持高效，从而实现了remarkable的总性性能。有关这项工作的详细信息，请参考https://tonyxuqaq.github.io/projects/InsightMapper。

Ref-DVGO: Reflection-Aware Direct Voxel Grid Optimization for an Improved Quality-Efficiency Trade-Off in Reflective Scene Reconstruction

paper_url: http://arxiv.org/abs/2308.08530
repo_url: https://github.com/gkouros/ref-dvgo
paper_authors: Georgios Kouros, Minye Wu, Shubham Shrivastava, Sushruth Nagesh, Punarjay Chakravarty, Tinne Tuytelaars
for: 该研究旨在寻找一种能够平衡效率和质量的方法来处理反射物体。
methods: 该研究采用了一种基于储存渠道的隐式-显式方法，并采用了高效的密度基本的格子表示。
results: 该研究在重建质量和训练和渲染过程中提高了效率，并实现了与其他方法相当的质量效率负荷平衡。

Abstract
Neural Radiance Fields (NeRFs) have revolutionized the field of novel view synthesis, demonstrating remarkable performance. However, the modeling and rendering of reflective objects remain challenging problems. Recent methods have shown significant improvements over the baselines in handling reflective scenes, albeit at the expense of efficiency. In this work, we aim to strike a balance between efficiency and quality. To this end, we investigate an implicit-explicit approach based on conventional volume rendering to enhance the reconstruction quality and accelerate the training and rendering processes. We adopt an efficient density-based grid representation and reparameterize the reflected radiance in our pipeline. Our proposed reflection-aware approach achieves a competitive quality efficiency trade-off compared to competing methods. Based on our experimental results, we propose and discuss hypotheses regarding the factors influencing the results of density-based methods for reconstructing reflective objects. The source code is available at https://github.com/gkouros/ref-dvgo.

摘要
neural radiance fields (nerfs) 已经革命化了新视角合成领域，表现出色。然而，模型和渲染镜面物体仍然是挑战。现有方法已经在镜面场景中显示了重要的改进，但是这些改进通常是在效率上的代价。在这种工作中，我们想要寻找效率和质量之间的平衡。为此，我们 investigate了一种采用储存-渲染的方法，使用传统的体积渲染来提高重建质量并加速训练和渲染过程。我们采用了高效的体积-基的格子表示和重新 parameterize 反射辐射的管道。我们的提案的反射意识方法可以与竞争方法相比，在质量和效率之间寻找一个稍微的平衡。根据我们的实验结果，我们提出和讨论了镜面物体重建中各种体积方法的因素的影响因素。源代码可以在 https://github.com/gkouros/ref-dvgo 上找到。

Diagnosing Human-object Interaction Detectors

paper_url: http://arxiv.org/abs/2308.08529
repo_url: https://github.com/neu-vi/diag-hoi
paper_authors: Fangrui Zhu, Yiming Xie, Weidi Xie, Huaizu Jiang
for: 本文旨在提供一个用于分析现有人对物检测模型错误来源的诊断工具箱。
methods: 本文首先对人对物检测管道进行总体调查，然后定义了不同类型的错误和其修正方法。通过测量修正错误时的mAP提升，可以进行详细的错误分析。
results: 本文对人对物检测和互动类别分类任务进行了深入的分析，包括人对物检测的准确率和干扰率的计算，以及互动类别分类的mAP分布。

Abstract
Although we have witnessed significant progress in human-object interaction (HOI) detection with increasingly high mAP (mean Average Precision), a single mAP score is too concise to obtain an informative summary of a model's performance and to understand why one approach is better than another. In this paper, we introduce a diagnosis toolbox for analyzing the error sources of the existing HOI detection models. We first conduct holistic investigations in the pipeline of HOI detection, consisting of human-object pair detection and then interaction classification. We define a set of errors and the oracles to fix each of them. By measuring the mAP improvement obtained from fixing an error using its oracle, we can have a detailed analysis of the significance of different errors. We then delve into the human-object detection and interaction classification, respectively, and check the model's behavior. For the first detection task, we investigate both recall and precision, measuring the coverage of ground-truth human-object pairs as well as the noisiness level in the detections. For the second classification task, we compute mAP for interaction classification only, without considering the detection scores. We also measure the performance of the models in differentiating human-object pairs with and without actual interactions using the AP (Average Precision) score. Our toolbox is applicable for different methods across different datasets and available at https://github.com/neu-vi/Diag-HOI.

摘要
尽管我们已经观察到人物对象交互检测中的显著进步，但单一的mAP得分并不能提供完整的模型性能概括，也无法理解哪些方法更好。在这篇论文中，我们提出了一个分析错误源的工具箱 для现有的人物对象交互检测模型。我们首先在人物对象检测的管道中进行了全面的调查，包括人物对象对的检测和 then 交互分类。我们定义了一组错误和其修复方法。通过使用每个错误的 oracle 来修复错误，我们可以进行详细的错误分析，并计算修复错误后的mAP提升。然后，我们逐个探究人物检测和交互分类任务中的模型行为。对于第一个检测任务，我们评估了人物对象对的覆盖率和检测结果的噪音水平。对于第二个分类任务，我们计算了交互分类的mAP分数，不考虑检测得分。我们还计算了模型在分辨人物对象对有实际交互和无实际交互的情况下的AP分数。我们的工具箱可以应用于不同的方法和数据集，可以在 GitHub 上获取：https://github.com/neu-vi/Diag-HOI。

Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual and Semantic Credit Assignment

paper_url: http://arxiv.org/abs/2308.08525
repo_url: https://github.com/chenqi008/leica
paper_authors: Qi Chen, Chaorui Deng, Zixiong Huang, Bowen Zhang, Mingkui Tan, Qi Wu
for: 本研究旨在提出一种新的文本到图像生成评价指标，以更好地评估图像生成模型的表现。
methods: 本研究使用了一种基于概率的文本到图像生成模型，以直接估计生成图像的可能性，并提出了一些新的设计来实现准确的归因分配策略。
results: 在实验中，提出的指标能够 успеш地评估多种流行的文本到图像生成模型和数据集，并且可以使用只需要几十个样本来稳定评估结果，这使得它在实践中非常有效率。

Abstract
Text-to-image synthesis has made encouraging progress and attracted lots of public attention recently. However, popular evaluation metrics in this area, like the Inception Score and Fr'echet Inception Distance, incur several issues. First of all, they cannot explicitly assess the perceptual quality of generated images and poorly reflect the semantic alignment of each text-image pair. Also, they are inefficient and need to sample thousands of images to stabilise their evaluation results. In this paper, we propose to evaluate text-to-image generation performance by directly estimating the likelihood of the generated images using a pre-trained likelihood-based text-to-image generative model, i.e., a higher likelihood indicates better perceptual quality and better text-image alignment. To prevent the likelihood of being dominated by the non-crucial part of the generated image, we propose several new designs to develop a credit assignment strategy based on the semantic and perceptual significance of the image patches. In the experiments, we evaluate the proposed metric on multiple popular text-to-image generation models and datasets in accessing both the perceptual quality and the text-image alignment. Moreover, it can successfully assess the generation ability of these models with as few as a hundred samples, making it very efficient in practice.

摘要

Painter: Teaching Auto-regressive Language Models to Draw Sketches

paper_url: http://arxiv.org/abs/2308.08520
repo_url: None
paper_authors: Reza Pourreza, Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Pulkit Madan, Roland Memisevic
for: 这篇论文旨在应用大型自然语言理解模型（LLM）来进行图像生成任务，直接从文本描述中生成虚拟的毫线绘制图像。
methods: 作者使用了一个基于废弃的LLM，通过精心调整和保留语言理解能力来构建了Painter模型，可以将文本描述转换为绘制图像的毫线绘制。
results: 作者创建了一个多bject绘制集，并使用Painter模型将文本描述转换为绘制图像，还可以从画布上除掉对象、检测和分类对象等功能。结果很有推动力。

Abstract
Large language models (LLMs) have made tremendous progress in natural language understanding and they have also been successfully adopted in other domains such as computer vision, robotics, reinforcement learning, etc. In this work, we apply LLMs to image generation tasks by directly generating the virtual brush strokes to paint an image. We present Painter, an LLM that can convert user prompts in text description format to sketches by generating the corresponding brush strokes in an auto-regressive way. We construct Painter based on off-the-shelf LLM that is pre-trained on a large text corpus, by fine-tuning it on the new task while preserving language understanding capabilities. We create a dataset of diverse multi-object sketches paired with textual prompts that covers several object types and tasks. Painter can generate sketches from text descriptions, remove objects from canvas, and detect and classify objects in sketches. Although this is an unprecedented pioneering work in using LLMs for auto-regressive image generation, the results are very encouraging.

摘要
大型自然语言模型（LLM）已经取得了很大的进步，并在其他领域如计算机视觉、机器人学习、奖励学习等领域中得到成功应用。在这项工作中，我们将LLM应用于图像生成任务中，通过直接生成虚拟的毫梭来绘制图像。我们提出了一种名为“艺术家”的LLM，可以将用户提示转化为文本描述格式的绘制。我们基于市场上可获得的LLM预训练数据库，通过细化其新任务而不损失语言理解能力来构建艺术家。我们创建了一个包含多种物体和任务的多物体绘制数据集，以及与文本提示相对应的绘制。艺术家可以从文本描述中生成绘制，从画布上除除物体，并在绘制中检测和分类物体。虽然这是一项前所未有的使用LLM进行自然逻辑推导的图像生成的研究，但结果很有激励力。

Two-and-a-half Order Score-based Model for Solving 3D Ill-posed Inverse Problems

paper_url: http://arxiv.org/abs/2308.08511
repo_url: None
paper_authors: Zirong Li, Yanyang Wang, Jianjia Zhang, Weiwen Wu, Hengyong Yu
for:* 这篇论文旨在解决 CT 和 MRI 领域中的不同数据类型的 inverse problem，包括 sparse-view CT 和 fast MRI reconstruction。methods:* 提出了一个新的 two-and-a-half order score-based model (TOSM)，它在训练阶段学习资料分布在 2D 空间，降低了训练的复杂性。* 在重建阶段，TOSM 使用了三个方向的补偿分数（sagittal、coronal 和 transaxial），实现更精确的重建。results:* 透过广泛的实验，发现 TOSM 可以实现高质量的 3D 积分重建，并且有效地解决了不同数据类型中的 inter-slice 不一致问题。

Abstract
Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) are crucial technologies in the field of medical imaging. Score-based models have proven to be effective in addressing different inverse problems encountered in CT and MRI, such as sparse-view CT and fast MRI reconstruction. However, these models face challenges in achieving accurate three dimensional (3D) volumetric reconstruction. The existing score-based models primarily focus on reconstructing two dimensional (2D) data distribution, leading to inconsistencies between adjacent slices in the reconstructed 3D volumetric images. To overcome this limitation, we propose a novel two-and-a-half order score-based model (TOSM). During the training phase, our TOSM learns data distributions in 2D space, which reduces the complexity of training compared to directly working on 3D volumes. However, in the reconstruction phase, the TOSM updates the data distribution in 3D space, utilizing complementary scores along three directions (sagittal, coronal, and transaxial) to achieve a more precise reconstruction. The development of TOSM is built on robust theoretical principles, ensuring its reliability and efficacy. Through extensive experimentation on large-scale sparse-view CT and fast MRI datasets, our method demonstrates remarkable advancements and attains state-of-the-art results in solving 3D ill-posed inverse problems. Notably, the proposed TOSM effectively addresses the inter-slice inconsistency issue, resulting in high-quality 3D volumetric reconstruction.

摘要
computed tomography (CT) 和磁共振成像 (MRI) 是医学成像领域的重要技术。分数模型已经证明可以有效地解决 CT 和 MRI 中的不同的反问题，如稀疏视图 CT 和快速 MRI 重建。然而，这些模型在实现准确的三维 (3D) 体积重建方面遇到了挑战。现有的分数模型主要关注在重建二维 (2D) 数据分布上，导致重建的3D体积图像中的邻近slice之间存在不一致。为了解决这个限制，我们提出了一种新的二阶三角形分数模型（TOSM）。在训练阶段，我们的 TOSM 学习了数据分布在2D空间上，这有助于减少训练的复杂性，相比直接在3D体积上进行训练。然而，在重建阶段，TOSM 使用了三个方向（极轴、极轴和扁平）的补偿分数，以实现更加精准的重建。TOSM 的开发基于坚实的理论原则，以确保其可靠性和效果。通过对大规模稀疏视图 CT 和快速 MRI 数据进行广泛的实验，我们的方法在解决 3D 不定 inverse problem 方面具有显著的进步和达到了当前领域的状态艺术。尤其是，我们的 TOSM 有效地解决了邻近slice之间的不一致问题，从而实现高质量的3D体积重建。

ResBuilder: Automated Learning of Depth with Residual Structures

paper_url: http://arxiv.org/abs/2308.08504
repo_url: None
paper_authors: Julian Burghoff, Matthias Rottmann, Jill von Conta, Sebastian Schoenen, Andreas Witte, Hanno Gottschalk
for: 这篇论文是为了开发一个基于 neural architecture search 的 ResNet 架构，以达到高精度低计算成本的目标。
methods: 该算法使用了一种名为 Resbuilder 的神经网络搜索算法，可以从 scratch 开发 ResNet 架构，并且可以修改现有架构，还可以从 ResNet 架构中移除和插入块。
results: 在不同的图像分类任务上进行实验，Resbuilder 可以达到与状态艺术级的性能，同时比 off-the-shelf ResNets 减少计算成本。此外，通过对 CIFAR10 进行参数调整，我们获得了一个适合所有其他任务的默认参数集，并且这种特性可以普遍应用于实际应用场景。

Abstract
In this work, we develop a neural architecture search algorithm, termed Resbuilder, that develops ResNet architectures from scratch that achieve high accuracy at moderate computational cost. It can also be used to modify existing architectures and has the capability to remove and insert ResNet blocks, in this way searching for suitable architectures in the space of ResNet architectures. In our experiments on different image classification datasets, Resbuilder achieves close to state-of-the-art performance while saving computational cost compared to off-the-shelf ResNets. Noteworthy, we once tune the parameters on CIFAR10 which yields a suitable default choice for all other datasets. We demonstrate that this property generalizes even to industrial applications by applying our method with default parameters on a proprietary fraud detection dataset.

摘要
在这个工作中，我们开发了一种神经网络搜索算法，即Resbuilder，可以从头来开发高精度低计算成本的ResNet架构。它可以修改现有架构，并可以移除和插入ResNet块，因此可以在ResNet架构空间中进行搜索。在我们对不同的图像分类 dataset 进行实验时，Resbuilder 能够达到 state-of-the-art 性能，同时保持下来计算成本相对较低。值得注意的是，我们在 CIFAR10 上调参得到了一个适用于所有其他 dataset 的适当默认选择。我们示示了这种性能普适性，通过在一个商业应用中使用我们的方法并在默认参数下对一个专有诈骗检测 dataset 进行应用。

Self-Supervised Online Camera Calibration for Automated Driving and Parking Applications

paper_url: http://arxiv.org/abs/2308.08495
repo_url: None
paper_authors: Ciarán Hogan, Ganesh Sistu, Ciarán Eising
for: 这个论文是为了提出一种基于深度学习的摄像头准确投影系统，用于现代自动驾驶车辆中。
methods: 该论文使用深度学习框架来学习摄像头的内在和外在准确参数，不需要任何标注或监督。
results: 该论文的实验结果表明，该深度学习框架可以在实时中学习摄像头的准确投影参数，不需要特殊的数据收集或精心调整。

Abstract
Camera-based perception systems play a central role in modern autonomous vehicles. These camera based perception algorithms require an accurate calibration to map the real world distances to image pixels. In practice, calibration is a laborious procedure requiring specialised data collection and careful tuning. This process must be repeated whenever the parameters of the camera change, which can be a frequent occurrence in autonomous vehicles. Hence there is a need to calibrate at regular intervals to ensure the camera is accurate. Proposed is a deep learning framework to learn intrinsic and extrinsic calibration of the camera in real time. The framework is self-supervised and doesn't require any labelling or supervision to learn the calibration parameters. The framework learns calibration without the need for any physical targets or to drive the car on special planar surfaces.

摘要
Camera-based 感知系统在现代自动驾驶汽车中发挥中心作用。这些摄像头基于的感知算法需要精准的准确映射到实际世界中的距离。在实践中，准确是一个复杂的过程，需要专门的数据采集和精心调整。这个过程需要在相机参数发生变化时重复，这可能是自动驾驶汽车中的一个频繁发生的情况。因此，需要定期进行准确性检查。提议一种基于深度学习的框架，以自动学习摄像头的内在和外在准确参数。该框架不需要任何 labels 或监督来学习准确参数。该框架可以在不需要任何物理目标或特殊平面表面上学习准确参数。

DeDoDe: Detect, Don’t Describe – Describe, Don’t Detect for Local Feature Matching

paper_url: http://arxiv.org/abs/2308.08479
repo_url: https://github.com/parskatt/dedode
paper_authors: Johan Edstedt, Georg Bökman, Mårten Wadenbäck, Michael Felsberg
for: 本文主要针对的是3D重建中的关键点检测问题，即在不同视图中检测出相同的3D点集。
methods: 本文使用了一种直接从3D一致性中学习关键点的方法，即通过训练从大规模SfM数据中检测出轨迹来学习关键点。为了解决缺少数据的问题，我们提出了一种半监督的两视检测目标函数，以扩展检测结果的数量。
results: 本文的方法DeDoDe在多个几何benchmark上实现了显著的提升，代码可以在https://github.com/Parskatt/DeDoDe上下载。

Abstract
Keypoint detection is a pivotal step in 3D reconstruction, whereby sets of (up to) K points are detected in each view of a scene. Crucially, the detected points need to be consistent between views, i.e., correspond to the same 3D point in the scene. One of the main challenges with keypoint detection is the formulation of the learning objective. Previous learning-based methods typically jointly learn descriptors with keypoints, and treat the keypoint detection as a binary classification task on mutual nearest neighbours. However, basing keypoint detection on descriptor nearest neighbours is a proxy task, which is not guaranteed to produce 3D-consistent keypoints. Furthermore, this ties the keypoints to a specific descriptor, complicating downstream usage. In this work, we instead learn keypoints directly from 3D consistency. To this end, we train the detector to detect tracks from large-scale SfM. As these points are often overly sparse, we derive a semi-supervised two-view detection objective to expand this set to a desired number of detections. To train a descriptor, we maximize the mutual nearest neighbour objective over the keypoints with a separate network. Results show that our approach, DeDoDe, achieves significant gains on multiple geometry benchmarks. Code is provided at https://github.com/Parskatt/DeDoDe .

摘要
<>将文本翻译成简化中文。<>Scene 3D 重建中，关键点检测是一个关键步骤，其中每个视图中检测到多达K个点。这些点需要在不同视图中保持一致，即在场景中相同的3D点。关键点检测的主要挑战之一是学习目标函数的形式化。先前的学习基于方法通常是同时学习描述符和关键点，并将关键点检测视为视图中 nearest neighbors 的二分类任务。然而，基于描述符 nearest neighbors 的方法并不能保证生成3D 一致的关键点。此外，这会将关键点绑定到特定的描述符，使其下游使用变得复杂。在这种情况下，我们 direktly 从3D 一致中学习关键点。为此，我们在大规模 SfM 中训练检测器，以检测覆盖大规模 SfM 的点。由于这些点通常是过度稀疏，我们 derivate 一种 semi-supervised 的两视检测目标，以扩展这个集到所需的数量。为了训练描述符，我们在关键点上maximize 约束 nearest neighbors 的目标。结果表明我们的方法，DeDoDe，在多个几何度量上具有显著的提升。代码可以在 https://github.com/Parskatt/DeDoDe 中找到。

Classification Committee for Active Deep Object Detection

paper_url: http://arxiv.org/abs/2308.08476
repo_url: https://github.com/ylzy123/CCADOD
paper_authors: Lei Zhao, Bo Li, Xingxing Wei
for: This paper proposes an active deep object detection method that uses a classification committee to select the most informative images for training the object detector.methods: The proposed method uses a main detector and a classification committee to select the most informative images based on their uncertainty values. The committee is pre-trained via the Maximum Classifiers Discrepancy Group Loss (MCDGL) and the Focus on Positive Instances Loss (FPIL) to mitigate the impact of interference instances.results: The proposed method outperforms state-of-the-art active learning methods in object detection tasks on Pascal VOC and COCO datasets.

Abstract
In object detection, the cost of labeling is much high because it needs not only to confirm the categories of multiple objects in an image but also to accurately determine the bounding boxes of each object. Thus, integrating active learning into object detection will raise pretty positive significance. In this paper, we propose a classification committee for active deep object detection method by introducing a discrepancy mechanism of multiple classifiers for samples' selection when training object detectors. The model contains a main detector and a classification committee. The main detector denotes the target object detector trained from a labeled pool composed of the selected informative images. The role of the classification committee is to select the most informative images according to their uncertainty values from the view of classification, which is expected to focus more on the discrepancy and representative of instances. Specifically, they compute the uncertainty for a specified instance within the image by measuring its discrepancy output by the committee pre-trained via the proposed Maximum Classifiers Discrepancy Group Loss (MCDGL). The most informative images are finally determined by selecting the ones with many high-uncertainty instances. Besides, to mitigate the impact of interference instances, we design a Focus on Positive Instances Loss (FPIL) to make the committee the ability to automatically focus on the representative instances as well as precisely encode their discrepancies for the same instance. Experiments are conducted on Pascal VOC and COCO datasets versus some popular object detectors. And results show that our method outperforms the state-of-the-art active learning methods, which verifies the effectiveness of the proposed method.

摘要
在物体检测中，标注成本很高，因为需要不仅确定图像中多个物体的类别，还需要准确确定每个物体的 bounding box。因此，将活动学习 интегрирован到物体检测中将具有非常正面的意义。在这篇论文中，我们提出了一种 classification committee для活动深度物体检测方法，通过引入多个分类器的偏差机制来选择训练物体检测器时的样本。模型包括主检测器和分类委员会。主检测器表示从标注池中训练的目标物体检测器，而分类委员会的角色是选择图像中最有用的信息的样本，这些样本的uncertainty值由分类委员会预训练得到的MCDGL（最大分类器偏差组loss）进行计算。最终选择最有用的图像是根据它们中高uncertainty实例的多少进行选择。此外，为了减少干扰实例的影响，我们设计了Focus on Positive Instances Loss（FPIL），使得委员会能够自动关注代表实例，同时精确编码它们的偏差。在 Pascal VOC 和 COCO datasets 上进行了实验，与一些流行的物体检测器进行比较，结果表明，我们的方法在活动学习方法中表现出色，这证明了我们的方法的有效性。

Hierarchical Uncertainty Estimation for Medical Image Segmentation Networks

paper_url: http://arxiv.org/abs/2308.08465
repo_url: None
paper_authors: Xinyu Bai, Wenjia Bai
for: 建立一个可信任的医疗影像分类模型，需要不仅评估模型的性能，而且也需要估计模型预测结果中的不确定性。
methods: 我们利用了这个嵌入式Encoder架构，将影像特征从细节到概要层次提取出来，然后使用 skip-connection 模组估计多个层次的不确定性。
results: 我们显示了，将这种多层次不确定性估计模组添加到深度学习分类网络中，可以实现高度的分类性能，同时提供了有意义的不确定性地图，可以用于过度分布检测。

Abstract
Learning a medical image segmentation model is an inherently ambiguous task, as uncertainties exist in both images (noise) and manual annotations (human errors and bias) used for model training. To build a trustworthy image segmentation model, it is important to not just evaluate its performance but also estimate the uncertainty of the model prediction. Most state-of-the-art image segmentation networks adopt a hierarchical encoder architecture, extracting image features at multiple resolution levels from fine to coarse. In this work, we leverage this hierarchical image representation and propose a simple yet effective method for estimating uncertainties at multiple levels. The multi-level uncertainties are modelled via the skip-connection module and then sampled to generate an uncertainty map for the predicted image segmentation. We demonstrate that a deep learning segmentation network such as U-net, when implemented with such hierarchical uncertainty estimation module, can achieve a high segmentation performance, while at the same time provide meaningful uncertainty maps that can be used for out-of-distribution detection.

摘要
学习医学图像分割模型是一个自然而又不确定的任务，因为图像中的噪声和人工标注（人类错误和偏见）在模型训练中存在不确定性。为建立可靠的图像分割模型，不仅要评估其性能，还需要估计模型预测结果中的不确定性。大多数当前的图像分割网络采用层次编码结构，从细到粗渐地提取图像特征。在这种工作中，我们利用这种层次图像表示，并提议一种简单 yet 有效的方法来估计多个水平的不确定性。这些多个不确定性被模型的跳过连接模块模型，然后采样以生成预测图像分割结果中的不确定性地图。我们示示了一个深度学习分割网络，如U-Net，当它被实现了这种层次不确定性估计模块时，可以 дости到高的分割性能，同时提供可靠的不确定性地图，可以用于非典型检测。

Learning to Distill Global Representation for Sparse-View CT

paper_url: http://arxiv.org/abs/2308.08463
repo_url: None
paper_authors: Zilong Li, Chenglong Ma, Jie Chen, Junping Zhang, Hongming Shan
for: 这个论文主要针对的是使用少量投影计算tomography图像的问题，以及通过图像后处理技术提高图像质量的问题。
methods: 这个论文使用了图像后处理技术，包括global representation（GloRe）混合约束和批处理学习等方法。
results: 实验结果表明，提议的GloReDi方法在ultra-sparse-view计算tomography图像中的重建效果明显更高，并且与现有的双频域方法相比，具有更好的扩展性和可靠性。

Abstract
Sparse-view computed tomography (CT) -- using a small number of projections for tomographic reconstruction -- enables much lower radiation dose to patients and accelerated data acquisition. The reconstructed images, however, suffer from strong artifacts, greatly limiting their diagnostic value. Current trends for sparse-view CT turn to the raw data for better information recovery. The resultant dual-domain methods, nonetheless, suffer from secondary artifacts, especially in ultra-sparse view scenarios, and their generalization to other scanners/protocols is greatly limited. A crucial question arises: have the image post-processing methods reached the limit? Our answer is not yet. In this paper, we stick to image post-processing methods due to great flexibility and propose global representation (GloRe) distillation framework for sparse-view CT, termed GloReDi. First, we propose to learn GloRe with Fourier convolution, so each element in GloRe has an image-wide receptive field. Second, unlike methods that only use the full-view images for supervision, we propose to distill GloRe from intermediate-view reconstructed images that are readily available but not explored in previous literature. The success of GloRe distillation is attributed to two key components: representation directional distillation to align the GloRe directions, and band-pass-specific contrastive distillation to gain clinically important details. Extensive experiments demonstrate the superiority of the proposed GloReDi over the state-of-the-art methods, including dual-domain ones. The source code is available at https://github.com/longzilicart/GloReDi.

摘要
sparse-view computed tomography (CT) 使用少量投射进行tomographic reconstruction，可以大大降低病人接受的辐射剂量并加速数据获取。然而，重建图像却受到强烈的artifacts的限制，因此其诊断价值受到了很大的限制。当前的趋势是使用原始数据来提取更多的信息。结果的双Domain方法却受到了次要的artifacts，特别是在ultra-sparse view scenario下，其普遍性受到了限制。问题是：图像后处理方法是否已经达到了其限制？我们的答案是不是。在这篇文章中，我们选择使用图像后处理方法，因为它具有很大的灵活性。我们提出了一种全球表示（GloRe）液化框架，称之为GloReDi。首先，我们提出了使用傅里叶变换来学习GloRe，以确保每个GloRe元素都具有整个图像的报知频谱。其次，不同于以前的方法，我们提出了使用中间视图重建图像进行监督，这些图像ready available但没有在前期文献中被探讨。GloRe的Success被归因于两个关键组成部分：方向性液化照明和带宽特征相关的液化照明。我们进行了广泛的实验，并证明了我们的GloReDi在现有的方法之上具有优越性。代码可以在https://github.com/longzilicart/GloReDi中获取。

2023-08-17

cs.AI

cs.AI - 2023-08-17

Enhancing API Documentation through BERTopic Modeling and Summarization

paper_url: http://arxiv.org/abs/2308.09070
repo_url: https://github.com/scam2023-bert/bertopic
paper_authors: AmirHossein Naghshzan, Sylvie Ratte
for: 本研究旨在提高API文档理解的效率和可用性，以帮助开发者更好地利用API功能。
methods: 本研究使用BERTopic进行主题分析和自然语言处理（NLP）技术，自动生成API文档摘要，从而为开发者提供更加有效的信息检索方式。
results: 研究发现API文档中有很多重复的主题和问题，并生成了可行的解决方案。这些发现和解决方案可以帮助开发者更好地理解API文档，提高软件开发过程的效率和质量。

Abstract
As the amount of textual data in various fields, including software development, continues to grow, there is a pressing demand for efficient and effective extraction and presentation of meaningful insights. This paper presents a unique approach to address this need, focusing on the complexities of interpreting Application Programming Interface (API) documentation. While official API documentation serves as a primary source of information for developers, it can often be extensive and lacks user-friendliness. In light of this, developers frequently resort to unofficial sources like Stack Overflow and GitHub. Our novel approach employs the strengths of BERTopic for topic modeling and Natural Language Processing (NLP) to automatically generate summaries of API documentation, thereby creating a more efficient method for developers to extract the information they need. The produced summaries and topics are evaluated based on their performance, coherence, and interoperability. The findings of this research contribute to the field of API documentation analysis by providing insights into recurring topics, identifying common issues, and generating potential solutions. By improving the accessibility and efficiency of API documentation comprehension, our work aims to enhance the software development process and empower developers with practical tools for navigating complex APIs.

摘要
随着不同领域的文本数据量不断增加，包括软件开发，有效地提取和表现出有用的洞察结论变得越来越重要。这篇论文提出了一种独特的方法来解决这个需求，集中在API文档的复杂性上。官方API文档作为开发者的主要信息来源，然而它们经常是广泛的和不易于使用的。为了解决这个问题，开发者们经常查看Stack Overflow和GitHub等非官方源。我们的新方法利用BERTopic的主题分析和自然语言处理（NLP）技术自动生成API文档的摘要，从而创造了一种更高效的方法，使开发者可以更方便地提取所需的信息。生成的摘要和主题都被评估了其性能、一致性和可共享性。这些研究结果对API文档分析领域做出了贡献，提供了复杂API文档中的循环主题、通用问题和 potential解决方案的洞察。通过提高API文档的可访问性和效率，我们的工作希望能够改善软件开发过程，并为开发者们提供实用的工具来探索复杂的API。

Fostering User Engagement in the Critical Reflection of Arguments

paper_url: http://arxiv.org/abs/2308.09061
repo_url: None
paper_authors: Klaus Weber, Annalena Aicher, Wolfang Minker, Stefan Ultes, Elisabeth André
for: 支持公正、不偏袋化的意见形成过程
methods: 使用对话系统和模型来评估用户的反思程度和开放性
results: 在58名参与者的用户研究中，发现了对用户反思和总体关注程度的显著影响，证明了我们的方法的有效性。

Abstract
A natural way to resolve different points of view and form opinions is through exchanging arguments and knowledge. Facing the vast amount of available information on the internet, people tend to focus on information consistent with their beliefs. Especially when the issue is controversial, information is often selected that does not challenge one's beliefs. To support a fair and unbiased opinion-building process, we propose a chatbot system that engages in a deliberative dialogue with a human. In contrast to persuasive systems, the envisioned chatbot aims to provide a diverse and representative overview - embedded in a conversation with the user. To account for a reflective and unbiased exploration of the topic, we enable the system to intervene if the user is too focused on their pre-existing opinion. Therefore we propose a model to estimate the users' reflective engagement (RUE), defined as their critical thinking and open-mindedness. We report on a user study with 58 participants to test our model and the effect of the intervention mechanism, discuss the implications of the results, and present perspectives for future work. The results show a significant effect on both user reflection and total user focus, proving our proposed approach's validity.

摘要
自然的方式解决不同看法和形成意见是通过互动对话和交换知识。面对互联网上的巨量信息，人们往往会选择一些不会挑战他们的信念的信息。特别是当问题是争议的时，人们往往会偏好一些不会挑战他们的信念的信息。为了支持公正和不偏袋的意见形成过程，我们提议一个 chatbot 系统，与人类进行审慎对话。与推销系统不同的是，我们的 chatbot 旨在提供多样化和代表性的概述，并且在与用户进行对话时嵌入在其中。为了考虑用户的反思和公正探索，我们允许系统在用户过于围绕自己的先前意见时进行 вмешательство。为了衡量用户的反思程度，我们提出了用户反思参与度（RUE）模型，定义为用户的批判思维和开明性。我们报告了一个用户研究，测试我们的模型和干预机制的效果，讨论结果的意义，并提出未来工作的视角。结果显示我们的方法有效，用户反思程度和总用户焦点都有显著改善。

Refining a Deep Learning-based Formant Tracker using Linear Prediction Methods

paper_url: http://arxiv.org/abs/2308.09051
repo_url: None
paper_authors: Paavo Alku, Sudarsana Reddy Kadiri, Dhananjaya Gowda
for: 这个研究 investigate了使用数据驱动的DeepFormants跟踪器进行形式追踪，并使用模型驱动的LP-based方法来进一步改进跟踪结果。
methods: 这个研究使用了LP-COV和QCP-FB两种LP-based形式估计方法，并将其与数据驱动的DeepFormants跟踪器结合使用，以提高跟踪结果。
results: 研究结果显示，使用LP-based模型驱动的跟踪器可以提高跟踪效果，并且使用QCP-FB分析方法可以获得最佳的跟踪结果。 Additionally, the study showed that the refined DeepFormants trackers were more resilient to noise than the reference trackers when tracking formants using VTR speech that was corrupted by additive noise.

Abstract
In this study, formant tracking is investigated by refining the formants tracked by an existing data-driven tracker, DeepFormants, using the formants estimated in a model-driven manner by linear prediction (LP)-based methods. As LP-based formant estimation methods, conventional covariance analysis (LP-COV) and the recently proposed quasi-closed phase forward-backward (QCP-FB) analysis are used. In the proposed refinement approach, the contours of the three lowest formants are first predicted by the data-driven DeepFormants tracker, and the predicted formants are replaced frame-wise with local spectral peaks shown by the model-driven LP-based methods. The refinement procedure can be plugged into the DeepFormants tracker with no need for any new data learning. Two refined DeepFormants trackers were compared with the original DeepFormants and with five known traditional trackers using the popular vocal tract resonance (VTR) corpus. The results indicated that the data-driven DeepFormants trackers outperformed the conventional trackers and that the best performance was obtained by refining the formants predicted by DeepFormants using QCP-FB analysis. In addition, by tracking formants using VTR speech that was corrupted by additive noise, the study showed that the refined DeepFormants trackers were more resilient to noise than the reference trackers. In general, these results suggest that LP-based model-driven approaches, which have traditionally been used in formant estimation, can be combined with a modern data-driven tracker easily with no further training to improve the tracker's performance.

摘要
在这个研究中，我们研究了使用数据驱动的 DeepFormants 溯形器进行形式追踪，并使用 LP-based 方法来估计形式。我们使用了传统的 covariance analysis（LP-COV）和最近提出的 quasi-closed phase forward-backward（QCP-FB）分析。在我们的改进方法中，首先使用 DeepFormants 溯形器预测三个最低的形式迹，然后将预测的形式迹替换为每帧的本地 спектral peak 显示出来。这个改进过程可以轻松地插入到 DeepFormants 溯形器中，无需进行任何新的数据学习。我们比较了两个改进后的 DeepFormants 溯形器与原始的 DeepFormants 和五个知名的传统溯形器，使用 popular vocal tract resonance（VTR） corpus。结果表明，数据驱动的 DeepFormants 溯形器比传统的溯形器更高效，而使用 QCP-FB 分析进行改进可以获得最佳性能。此外，通过使用受到添加噪声损害的 VTR 语音进行追踪，研究表明，改进后的 DeepFormants 溯形器比参照溯形器更抗雷。总之，这些结果表明，LP-based 模型驱动方法可以轻松地与现代数据驱动溯形器结合使用，以提高溯形器的性能。

Severity Classification of Parkinson’s Disease from Speech using Single Frequency Filtering-based Features

paper_url: http://arxiv.org/abs/2308.09042
repo_url: None
paper_authors: Sudarsana Reddy Kadiri, Manila Kodali, Paavo Alku
for: 本研究的目的是提出一种新的评估多巴生殖疾病（PD）严重程度的对象方法，以改善诊断和治疗。
methods: 本研究使用了单频 filtering（SFF）方法， derivated two sets of novel features：(1) SFF cepstral coefficients（SFFCC）和(2) MFCCs from SFF（MFCC-SFF），用于识别PD的严重程度。
results: 实验表明，提posed的特征比 conventinal MFCCs在三种说话任务（元音、句子、文本读取）中都有较好的表现，具体来说，SFFCC和MFCC-SFF特征在元音任务中比MFCC特征提高5.8%和2.3%，在句子任务中提高7.0%和1.8%，在文本读取任务中提高2.4%和1.1%。

Abstract
Developing objective methods for assessing the severity of Parkinson's disease (PD) is crucial for improving the diagnosis and treatment. This study proposes two sets of novel features derived from the single frequency filtering (SFF) method: (1) SFF cepstral coefficients (SFFCC) and (2) MFCCs from the SFF (MFCC-SFF) for the severity classification of PD. Prior studies have demonstrated that SFF offers greater spectro-temporal resolution compared to the short-time Fourier transform. The study uses the PC-GITA database, which includes speech of PD patients and healthy controls produced in three speaking tasks (vowels, sentences, text reading). Experiments using the SVM classifier revealed that the proposed features outperformed the conventional MFCCs in all three speaking tasks. The proposed SFFCC and MFCC-SFF features gave a relative improvement of 5.8% and 2.3% for the vowel task, 7.0% & 1.8% for the sentence task, and 2.4% and 1.1% for the read text task, in comparison to MFCC features.

摘要
开发对parkinson病（PD）严重程度评估方法是至关重要，以提高诊断和治疗的效果。这项研究提出了两个新的特征集：（1）单频范围滤波（SFF）cepstral coefficients（SFFCC）和（2）基于SFF的MFCC（MFCC-SFF），用于PD严重分类。先前的研究表明，SFF在spectro-temporal分辨率方面比短时傅立干更高。本研究使用PC-GITA数据库，包括PD患者和正常人的语音数据，从三种说话任务（声音、句子、文本阅读）中获得。实验表明，提出的特征超过了传统MFCC特征，在三种说话任务中都有显著提高。SFFCC和MFCC-SFF特征相比MFCC特征，在声音任务中提高5.8%、7.0%和2.4%，在句子任务中提高1.8%和1.1%，在文本阅读任务中提高2.3%。

A Mathematical Characterization of Minimally Sufficient Robot Brains

paper_url: http://arxiv.org/abs/2308.09041
repo_url: None
paper_authors: Basak Sakcak, Kalle G. Timperi, Vadim Weinstein, Steven M. LaValle
for: 本研究探讨了内部系统（机器人算法或软件）与外部系统（机器人身体和环境）之间的交互所获得的信息的下限。
methods: 本研究使用过程系统来模型内部系统和外部系统之间的交互，并研究了最弱的内部系统是否具备可以完成过滤和规划任务的能力。
results: 研究发现，在给定机器人硬件和任务下，存在最小的信息过程系统，这些系统具备Uniqueness和可以实现的性。此外，这些系统还可以用于解决优化整合/筛选、基本规划任务和模型系统给定输入-输出关系的问题。

Abstract
This paper addresses the lower limits of encoding and processing the information acquired through interactions between an internal system (robot algorithms or software) and an external system (robot body and its environment) in terms of action and observation histories. Both are modeled as transition systems. We want to know the weakest internal system that is sufficient for achieving passive (filtering) and active (planning) tasks. We introduce the notion of an information transition system for the internal system which is a transition system over a space of information states that reflect a robot's or other observer's perspective based on limited sensing, memory, computation, and actuation. An information transition system is viewed as a filter and a policy or plan is viewed as a function that labels the states of this information transition system. Regardless of whether internal systems are obtained by learning algorithms, planning algorithms, or human insight, we want to know the limits of feasibility for given robot hardware and tasks. We establish, in a general setting, that minimal information transition systems exist up to reasonable equivalence assumptions, and are unique under some general conditions. We then apply the theory to generate new insights into several problems, including optimal sensor fusion/filtering, solving basic planning tasks, and finding minimal representations for modeling a system given input-output relations.

摘要
To address this question, the paper introduces the concept of an information transition system for the internal system, which is a transition system over a space of information states that reflect the robot's or other observer's perspective based on limited sensing, memory, computation, and actuation. An information transition system is viewed as a filter, and a policy or plan is viewed as a function that labels the states of this information transition system.The paper establishes, in a general setting, that minimal information transition systems exist and are unique under certain conditions. These minimal systems are the weakest possible internal systems that can achieve the desired tasks. The theory is then applied to generate new insights into several problems, including optimal sensor fusion/filtering, solving basic planning tasks, and finding minimal representations for modeling a system given input-output relations.In simplified Chinese, the paper is about exploring the minimum amount of information and processing power needed for a robot to perform tasks such as filtering and planning, and how to model this information and processing power using transition systems. The paper introduces the concept of an information transition system, which is a way of representing the robot's perspective based on limited sensing, memory, computation, and actuation. The paper shows that there exist minimal information transition systems that are unique and sufficient for achieving passive and active tasks. These insights can be applied to various problems such as optimal sensor fusion, basic planning, and finding minimal representations for modeling a system.

Synthesizing Physically Plausible Human Motions in 3D Scenes

paper_url: http://arxiv.org/abs/2308.09036
repo_url: https://github.com/liangpan99/interscene
paper_authors: Liang Pan, Jingbo Wang, Buzhen Huang, Junyu Zhang, Haofan Wang, Xu Tang, Yangang Wang
for: 本研究旨在Synthesizing physically plausible human motions in 3D scenes, 解决现有的kinematics-based方法和physics-based方法存在缺陷，如penetration和foot skating。
methods: 我们提出了一种框架，即InterScene，将人类-场景交互分解成两个基本过程：Interacting和Navigating，并设计了两个可重用控制器，即InterCon和NavCon。
results: 实验结果表明，我们的框架可以在复杂的3D场景中生成physically plausible的长期人类运动。Here’s the translation in English for reference:
for: The paper aims to synthesize physically plausible human motions in 3D scenes, addressing the limitations of existing kinematics-based and physics-based methods, such as penetration and foot skating.
methods: We propose a framework called InterScene, which decomposes human-scene interactions into two fundamental processes, Interacting and Navigating, and designs two reusable controllers, InterCon and NavCon.
results: Experimental results demonstrate that our framework can generate physically plausible long-term human motions in complex 3D scenes.

Abstract
Synthesizing physically plausible human motions in 3D scenes is a challenging problem. Kinematics-based methods cannot avoid inherent artifacts (e.g., penetration and foot skating) due to the lack of physical constraints. Meanwhile, existing physics-based methods cannot generalize to multi-object scenarios since the policy trained with reinforcement learning has limited modeling capacity. In this work, we present a framework that enables physically simulated characters to perform long-term interaction tasks in diverse, cluttered, and unseen scenes. The key idea is to decompose human-scene interactions into two fundamental processes, Interacting and Navigating, which motivates us to construct two reusable Controller, i.e., InterCon and NavCon. Specifically, InterCon contains two complementary policies that enable characters to enter and leave the interacting state (e.g., sitting on a chair and getting up). To generate interaction with objects at different places, we further design NavCon, a trajectory following policy, to keep characters' locomotion in the free space of 3D scenes. Benefiting from the divide and conquer strategy, we can train the policies in simple environments and generalize to complex multi-object scenes. Experimental results demonstrate that our framework can synthesize physically plausible long-term human motions in complex 3D scenes. Code will be publicly released at https://github.com/liangpan99/InterScene.

摘要
使三维场景中的人物表现出physically plausible的运动是一个复杂的问题。基于骨骼动学的方法无法避免内在的缺陷（例如冲突和脚滑）由于physical constraints的缺失。而现有的物理学基的方法则无法泛化到多 объек 场景，因为训练的策略使用了强化学习有限的模型ing capacity。在这项工作中，我们提出了一种框架，允许 физи simulated 人物在多物件场景中进行长期交互任务。我们的关键思想是将人物-场景交互 decomposed 成两个基本过程：交互和导航。这使我们可以构建两个可重用的控制器，即InterCon和NavCon。具体来说，InterCon包含两个补做策略，使人物能够进入和离开交互状态（例如坐在椅子上和站起来）。为了在不同的位置上生成对象与人物之间的交互，我们还设计了 NavCon，一个路径跟踪策略，以保证人物在3D场景中的自由运动。由于我们采用了分而治之策略，我们可以在简单的环境中训练策略，并在复杂的多物件场景中进行泛化。实验结果表明，我们的框架可以在复杂的3D场景中生成physically plausible的长期人物运动。代码将在https://github.com/liangpan99/InterScene 上公开。

Reinforcement Learning for Battery Management in Dairy Farming

paper_url: http://arxiv.org/abs/2308.09023
repo_url: None
paper_authors: Nawazish Ali, Abdul Wahid, Rachael shaw, Karl Mason
for: 本研究旨在应用人工智能（AI）以提高牛奶牧场中的可再生能源发电效率。
methods: 本研究使用Q学习算法来学习一个有效的电池充放和充电策略。
results: 研究结果显示，开发的策略可以对比基准算法大幅降低电力成本。这些结果显示了强化学习在牛奶牧场中电池管理中的效iveness。

Abstract
Dairy farming is a particularly energy-intensive part of the agriculture sector. Effective battery management is essential for renewable integration within the agriculture sector. However, controlling battery charging/discharging is a difficult task due to electricity demand variability, stochasticity of renewable generation, and energy price fluctuations. Despite the potential benefits of applying Artificial Intelligence (AI) to renewable energy in the context of dairy farming, there has been limited research in this area. This research is a priority for Ireland as it strives to meet its governmental goals in energy and sustainability. This research paper utilizes Q-learning to learn an effective policy for charging and discharging a battery within a dairy farm setting. The results demonstrate that the developed policy significantly reduces electricity costs compared to the established baseline algorithm. These findings highlight the effectiveness of reinforcement learning for battery management within the dairy farming sector.

摘要
奶业是农业部分中最为能源密集的一部分。有效的电池管理是农业部门中绿色能源融合的关键。然而，控制电池充放电是一项具有挑战性的任务，这主要归结于能源需求的变化、可再生能源生产的随机性和能源价格的波动。尽管应用人工智能（AI）于奶业中可能带来多少优势，然而有限的研究进行了在这个领域。这项研究使用Q学习算法学习一个有效的电池充放电策略。研究结果表明，开发的策略可以significantly降低电力成本，与已有的基线算法相比。这些发现 highlights了Q学习在奶业中电池管理的有效性。

Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability

paper_url: http://arxiv.org/abs/2308.09004
repo_url: None
paper_authors: Renan Souza, Tyler J. Skluzacek, Sean R. Wilkinson, Maxim Ziatdinov, Rafael Ferreira da Silva
for: 这 paper 的目的是提出一种基于数据可见性的多任务集成数据分析方法，以满足现代大规模科学发现所需的多学科协作和多 computing 环境。
methods: 该 paper 使用了数据可见性策略和适配器系统设计，以及证明信息，来实现轻量级运行时多 workflow 集成数据分析。
results: 实验表明，该方法可以在多种并行系统和机器学习工具上实现near-zero overhead的多任务集成数据分析，并在 Summit 超级计算机上运行 up to 100,000 任务。

Abstract
Modern large-scale scientific discovery requires multidisciplinary collaboration across diverse computing facilities, including High Performance Computing (HPC) machines and the Edge-to-Cloud continuum. Integrated data analysis plays a crucial role in scientific discovery, especially in the current AI era, by enabling Responsible AI development, FAIR, Reproducibility, and User Steering. However, the heterogeneous nature of science poses challenges such as dealing with multiple supporting tools, cross-facility environments, and efficient HPC execution. Building on data observability, adapter system design, and provenance, we propose MIDA: an approach for lightweight runtime Multi-workflow Integrated Data Analysis. MIDA defines data observability strategies and adaptability methods for various parallel systems and machine learning tools. With observability, it intercepts the dataflows in the background without requiring instrumentation while integrating domain, provenance, and telemetry data at runtime into a unified database ready for user steering queries. We conduct experiments showing end-to-end multi-workflow analysis integrating data from Dask and MLFlow in a real distributed deep learning use case for materials science that runs on multiple environments with up to 276 GPUs in parallel. We show near-zero overhead running up to 100,000 tasks on 1,680 CPU cores on the Summit supercomputer.

摘要
现代大规模科学发现需要跨学科合作，包括高性能计算机（HPC）机器和边缘到云端的连续体系。集成数据分析在科学发现中扮演着关键性角色，特别是在当今人工智能时代，可以实现负责任的人工智能开发、FAIR、可重现和用户导航。然而，科学的多样性带来了多种支持工具、跨设施环境和高效HPC执行的挑战。基于数据可见性、适配器系统设计和 происхождение，我们提出了MIDA：一种轻量级运行时多 workflow интеGRATED数据分析方法。MIDA定义了数据可见性策略和适配方法，可以在多种并行系统和机器学习工具上使用。通过可见性，它可以在背景中 intercept数据流无需实rumentation，并将domain、 происхождение和telemetry数据集成到运行时，并将其存储在一个统一的数据库中，准备就绪 для用户导航查询。我们对实验结果表明，可以在多种环境中同时运行多个 workflow，并将数据集成到一个统一的数据库中，并且可以实现近于零 overhead。在 Summit 超级计算机上，我们运行了100,000个任务，使用1,680个CPU核心，并在多个环境中使用多达276个GPU并行运行。

An Extended Convergence Result for Behaviour Tree Controllers

paper_url: http://arxiv.org/abs/2308.08994
repo_url: None
paper_authors: Christopher Iliffe Sprague, Petter Ögren
for: 这篇论文是为了研究行为树（BTs）的叉集抽象和控制策略的可能性。
methods: 这篇论文使用了一种树结构，将低级控制策略组合成为层次结构，从而实现控制策略的可模块化。
results: 这篇论文研究了行为树的叉集抽象和控制策略的可能性，并推广了之前的结果，包括新的循环切换情况。

Abstract
Behavior trees (BTs) are an optimally modular framework to assemble hierarchical hybrid control policies from a set of low-level control policies using a tree structure. Many robotic tasks are naturally decomposed into a hierarchy of control tasks, and modularity is a well-known tool for handling complexity, therefor behavior trees have garnered widespread usage in the robotics community. In this paper, we study the convergence of BTs, in the sense of reaching a desired part of the state space. Earlier results on BT convergence were often tailored to specific families of BTs, created using different design principles. The results of this paper generalize the earlier results and also include new cases of cyclic switching not covered in the literature.

摘要
行为树（BT）是一种最佳的模块化框架，可以将层次结构的混合控制策略从一组低级控制策略中组装 together。许多 робо械任务自然可以被划分为一个层次结构的控制任务，而模块化是处理复杂性的知名工具，因此行为树在机器人社区中得到了广泛的应用。在这篇论文中，我们研究行为树的叉树，即达到某个状态空间中的愿景部分。先前的结果通常是针对特定的行为树设计原则而设计的，本paper的结果可以总结这些结果，并包括先前未被探讨的循环切换情况。

KnowledGPT: Enhancing Large Language Models with Retrieval and Storage Access on Knowledge Bases

paper_url: http://arxiv.org/abs/2308.11761
repo_url: None
paper_authors: Xintao Wang, Qianwen Yang, Yongting Qiu, Jiaqing Liang, Qianyu He, Zhouhong Gu, Yanghua Xiao, Wei Wang
for: 该论文旨在探讨如何将大型自然语言处理模型（LLM）与多种知识库（KB）集成，以提高模型的完整性、时效性、忠诚性和适应性。
methods: 该论文提出了一种名为 KnowledGPT 的全面框架，可以将 LLM 与多种 KB 集成，并提供了两种方法：一是使用思维提示程序来为 KB 进行搜索，二是将个性化知识库（PKB）引入到 LLM 中，以满足用户的特定需求。
results: 经过广泛的实验，该论文表明，通过将 LLM 与 KB 集成， KnowledGPT 可以更好地回答需要世界知识的问题，比如使用已知 KB 中的知识和从 PKB 中提取的个性化知识。

Abstract
Large language models (LLMs) have demonstrated impressive impact in the field of natural language processing, but they still struggle with several issues regarding, such as completeness, timeliness, faithfulness and adaptability. While recent efforts have focuses on connecting LLMs with external knowledge sources, the integration of knowledge bases (KBs) remains understudied and faces several challenges. In this paper, we introduce KnowledGPT, a comprehensive framework to bridge LLMs with various knowledge bases, facilitating both the retrieval and storage of knowledge. The retrieval process employs the program of thought prompting, which generates search language for KBs in code format with pre-defined functions for KB operations. Besides retrieval, KnowledGPT offers the capability to store knowledge in a personalized KB, catering to individual user demands. With extensive experiments, we show that by integrating LLMs with KBs, KnowledGPT properly answers a broader range of questions requiring world knowledge compared with vanilla LLMs, utilizing both knowledge existing in widely-known KBs and extracted into personalized KBs.

摘要
The retrieval process uses the program of thought prompting, which generates search language for KBs in code format with pre-defined functions for KB operations. In addition to retrieval, KnowledGPT offers the capability to store knowledge in a personalized KB, catering to individual user demands. With extensive experiments, we show that by integrating LLMs with KBs, KnowledGPT can properly answer a broader range of questions requiring world knowledge compared to vanilla LLMs, utilizing both knowledge existing in widely-known KBs and extracted into personalized KBs.

Equitable Restless Multi-Armed Bandits: A General Framework Inspired By Digital Health

paper_url: http://arxiv.org/abs/2308.09726
repo_url: https://github.com/google-research/socialgood
paper_authors: Jackson A. Killian, Manish Jain, Yugang Jia, Jonathan Amar, Erich Huang, Milind Tambe
For: 这篇论文研究了如何使用多臂抽签机制（Restless Multi-armed Bandits，RMAB）来实现公平的决策，尤其是在电子健康领域。* Methods: 论文使用了两种平衡公平性的目标函数：最小最大奖赏和Nash均衡益。论文也提出了一个水满算法和一个理论据基于的均衡算法来解决这两个目标函数。* Results: 论文透过三个模拟领域的实验结果表明，使用论文提出的方法可以实现多倍的公平性，而且不需要严重影响使用度的损失。这些结果证明了论文的重要性，因为RMAB在影响人类和野生动物的系统中愈来愈普遍。

Abstract
Restless multi-armed bandits (RMABs) are a popular framework for algorithmic decision making in sequential settings with limited resources. RMABs are increasingly being used for sensitive decisions such as in public health, treatment scheduling, anti-poaching, and -- the motivation for this work -- digital health. For such high stakes settings, decisions must both improve outcomes and prevent disparities between groups (e.g., ensure health equity). We study equitable objectives for RMABs (ERMABs) for the first time. We consider two equity-aligned objectives from the fairness literature, minimax reward and max Nash welfare. We develop efficient algorithms for solving each -- a water filling algorithm for the former, and a greedy algorithm with theoretically motivated nuance to balance disparate group sizes for the latter. Finally, we demonstrate across three simulation domains, including a new digital health model, that our approaches can be multiple times more equitable than the current state of the art without drastic sacrifices to utility. Our findings underscore our work's urgency as RMABs permeate into systems that impact human and wildlife outcomes. Code is available at https://github.com/google-research/socialgood/tree/equitable-rmab

摘要
不休息的多手犬 (RMAB) 是一种流行的算法决策框架，用于Sequential Setting with limited resources。 RMAB 已经在公共卫生、治疗安排、抵抗贪婪和数字卫生等高规格场景中使用。为了保证高规格决策，决策必须同时提高结果和避免群体之间的差异（例如，确保健康公平）。我们研究了 equitable 目标函数 (ERMAB) ，并考虑了两种与公平性相关的目标函数：最小最大奖励和最大 Nash 利益。我们开发了高效的算法来解决每一个——一种水填算法来实现最小最大奖励，以及一种理论上支持的精心设计来平衡不同群体大小的 greedy 算法来实现最大 Nash 利益。最后，我们在三个模拟领域中，包括一个新的数字卫生模型，证明了我们的方法可以在不产生极大的Utility 损失的情况下多达多ples 更公平。我们的发现强调了我们的工作的急迫性，因为 RMAB 在影响人类和野生动物的系统中普遍存在。代码可以在中获取。

A Dual-Perspective Approach to Evaluating Feature Attribution Methods

paper_url: http://arxiv.org/abs/2308.08949
repo_url: None
paper_authors: Yawei Li, Yang Zhang, Kenji Kawaguchi, Ashkan Khakzar, Bernd Bischl, Mina Rezaei
for: 本研究旨在提供一个稳固的评估特征归因方法的框架，以便更好地理解神经网络预测结果的基础特征。
methods: 本研究使用了两种新的视角来评估特征归因方法的 faithfulness 性能，即听soundness和completeness。soundness 评估归因特征是真实预测特征，而completeness 评估归因结果是否能够捕捉所有预测特征。
results: 研究人员通过对主流特征归因方法应用这两种新的视角，发现了一些缺陷和不足，并提出了一些改进建议。这些改进建议可以帮助提高特征归因方法的 faithfulness 和完整性。

Abstract
Feature attribution methods attempt to explain neural network predictions by identifying relevant features. However, establishing a cohesive framework for assessing feature attribution remains a challenge. There are several views through which we can evaluate attributions. One principal lens is to observe the effect of perturbing attributed features on the model's behavior (i.e., faithfulness). While providing useful insights, existing faithfulness evaluations suffer from shortcomings that we reveal in this paper. In this work, we propose two new perspectives within the faithfulness paradigm that reveal intuitive properties: soundness and completeness. Soundness assesses the degree to which attributed features are truly predictive features, while completeness examines how well the resulting attribution reveals all the predictive features. The two perspectives are based on a firm mathematical foundation and provide quantitative metrics that are computable through efficient algorithms. We apply these metrics to mainstream attribution methods, offering a novel lens through which to analyze and compare feature attribution methods.

摘要
neural network 预测结果的Feature归因方法尝试解释预测结果的相关特征。然而，建立一个一致的框架来评估归因方法仍然是一个挑战。我们可以通过以下几种视角来评估归因：一个重要的镜像是观察归因特征的改变对模型的行为的影响（即寓言）。虽然提供了有用的洞察，但现有的寓言评估存在缺陷，我们在这篇论文中揭示这些缺陷。在这个工作中，我们提出了两种新的视角，它们基于坚实的数学基础，并提供了可计算的量化度量。我们对主流归因方法应用这些度量，提供了一种新的分析和比较feature归因方法的镜像。

Predicting Crop Yield With Machine Learning: An Extensive Analysis Of Input Modalities And Models On a Field and sub-field Level

paper_url: http://arxiv.org/abs/2308.08948
repo_url: None
paper_authors: Deepak Pathak, Miro Miranda, Francisco Mena, Cristhian Sanchez, Patrick Helber, Benjamin Bischke, Peter Habelitz, Hiba Najjar, Jayanth Siddamsetty, Diego Arenas, Michaela Vollmer, Marcela Charfuelan, Marlon Nuske, Andreas Dengel
for: 这项研究的目的是开发一种简单又有效的早期融合方法，以优化耐灌植物生长和产量预测。
methods: 该方法使用高分辨率耐灌植物产量地图作为训练数据，并使用机器学习模型和气象、土壤和DEM数据进行融合。
results: 研究发现，使用不同输入模式可以得到不同的最佳结果，并且输入模式的选择取决于地区、作物和选择的模型。

Abstract
We introduce a simple yet effective early fusion method for crop yield prediction that handles multiple input modalities with different temporal and spatial resolutions. We use high-resolution crop yield maps as ground truth data to train crop and machine learning model agnostic methods at the sub-field level. We use Sentinel-2 satellite imagery as the primary modality for input data with other complementary modalities, including weather, soil, and DEM data. The proposed method uses input modalities available with global coverage, making the framework globally scalable. We explicitly highlight the importance of input modalities for crop yield prediction and emphasize that the best-performing combination of input modalities depends on region, crop, and chosen model.

摘要
我们介绍了一种简单 yet 有效的早期融合方法 для农作物产量预测，该方法可以处理多个输入模式，具有不同的时间和空间分辨率。我们使用高分辨率农作物产量地图作为真实数据来训练农作物和机器学习模型无关的方法。我们使用 Sentinal-2 卫星图像作为主要输入数据，其他补充模式包括天气、土壤和 DEM 数据。我们表明输入模式的重要性，并强调选择地区、作物和模型而定的最佳输入模式的重要性。Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need the translation in Traditional Chinese, please let me know.

Interpretable Graph Neural Networks for Tabular Data

paper_url: http://arxiv.org/abs/2308.08945
repo_url: None
paper_authors: Amr Alkhatib, Sofiane Ennadir, Henrik Boström, Michalis Vazirgiannis
for: 本研究旨在提出一种可解释的图 neural network（IGNNet），用于处理常见的表格数据，并且能够准确地表示输入特征之间的交互作用。
methods: IGNNet使用一种受限的学习算法，以生成可解释的模型，其中模型能够从原始输入特征直接计算出预测结果的具体计算过程。
results: 实验结果表明，IGNNet与现有针对表格数据的机器学习算法相比，在性能上具有相当的竞争力，而且可以准确地表示输入特征之间的交互作用。同时，IGNNet的解释结果与真实的Shapley值相关，无需额外计算开销。

Abstract
Data in tabular format is frequently occurring in real-world applications. Graph Neural Networks (GNNs) have recently been extended to effectively handle such data, allowing feature interactions to be captured through representation learning. However, these approaches essentially produce black-box models, in the form of deep neural networks, precluding users from following the logic behind the model predictions. We propose an approach, called IGNNet (Interpretable Graph Neural Network for tabular data), which constrains the learning algorithm to produce an interpretable model, where the model shows how the predictions are exactly computed from the original input features. A large-scale empirical investigation is presented, showing that IGNNet is performing on par with state-of-the-art machine-learning algorithms that target tabular data, including XGBoost, Random Forests, and TabNet. At the same time, the results show that the explanations obtained from IGNNet are aligned with the true Shapley values of the features without incurring any additional computational overhead.

摘要
实际应用中 frequently 出现的数据格式是表格格式。图 neuron 网络（GNNs）在最近扩展以处理这种数据，以便捕捉特征之间的交互。然而，这些方法基本上生成黑盒模型，即深度神经网络，禁止用户跟踪模型预测的逻辑。我们提出了一种方法，称为 IGNNet（可解释图 neuron 网络 для表格数据），它限制学习算法生成可解释的模型，其中模型可以从原始输入特征直接计算预测。我们进行了大规模的实验研究，显示，IGNNet与目标 tabular 数据的状态机器学习算法相当，包括 XGBoost、Random Forests 和 TabNet。同时，结果表明，IGNNet 获得的解释与实际 Shapley 值相对，而不会增加任何额外的计算开销。

Towards Automatically Addressing Self-Admitted Technical Debt: How Far Are We?

paper_url: http://arxiv.org/abs/2308.08943
repo_url: https://github.com/antonio-mastropaolo/satd-removal
paper_authors: Antonio Mastropaolo, Massimiliano Di Penta, Gabriele Bavota
for:* The paper aims to investigate the use of neural-based generative models for automatically paying back technical debt in software development.methods:* The paper employs seven different generative deep learning (DL) model configurations, including transformers pre-trained and fine-tuned with different combinations of training objectives.results:* The best model experimented with was able to automatically fix ~2% to 8% of test instances, depending on the number of attempts it was allowed to make.Here is the simplified Chinese translation of the three information points:for:* 本研究目的是 investigating 使用神经网络基于的生成模型来自动偿还软件开发中的技术债。methods:* 本研究使用了七种不同的生成深度学习（DL）模型配置，包括使用不同的训练目标来预训练和细化 transformers。results:* 最佳模型在试试次数不同情况下，能够自动修复 ~2% 到 8% 的测试实例。

Abstract
Upon evolving their software, organizations and individual developers have to spend a substantial effort to pay back technical debt, i.e., the fact that software is released in a shape not as good as it should be, e.g., in terms of functionality, reliability, or maintainability. This paper empirically investigates the extent to which technical debt can be automatically paid back by neural-based generative models, and in particular models exploiting different strategies for pre-training and fine-tuning. We start by extracting a dateset of 5,039 Self-Admitted Technical Debt (SATD) removals from 595 open-source projects. SATD refers to technical debt instances documented (e.g., via code comments) by developers. We use this dataset to experiment with seven different generative deep learning (DL) model configurations. Specifically, we compare transformers pre-trained and fine-tuned with different combinations of training objectives, including the fixing of generic code changes, SATD removals, and SATD-comment prompt tuning. Also, we investigate the applicability in this context of a recently-available Large Language Model (LLM)-based chat bot. Results of our study indicate that the automated repayment of SATD is a challenging task, with the best model we experimented with able to automatically fix ~2% to 8% of test instances, depending on the number of attempts it is allowed to make. Given the limited size of the fine-tuning dataset (~5k instances), the model's pre-training plays a fundamental role in boosting performance. Also, the ability to remove SATD steadily drops if the comment documenting the SATD is not provided as input to the model. Finally, we found general-purpose LLMs to not be a competitive approach for addressing SATD.

摘要
企业和个人开发者在软件发展过程中需要投入很大的努力来偿还技术债务，即软件发布的形态不是理想的，例如功能、可靠性和维护性等方面的问题。这篇论文employs neural网络模型来自动偿还技术债务，并研究了不同预训练和细化目标的影响。我们从595个开源项目中提取了5,039个自我披露技术债务（SATD）的实例，并使用这些实例进行了7种不同的深度学习（DL）模型配置的实验。特别是，我们比较了使用预训练后细化不同组合的训练目标，包括通用代码修改、SATD修复和SATD注释提示训练。此外，我们还 investigate了在这种上下文中Large Language Model（LLM）基于的聊天机器人的可用性。结果表明自动偿还SATD是一项具有挑战性的任务，最佳我们实验的模型可以自动修复test实例中的~2%至8%，具体取决于允许尝试的次数。由于 fine-tuning数据的数量（~5k个实例）较小，预训练在性能提升中扮演了基本的角色。同时，如果不提供SATD注释作为输入，模型的SATD修复能力逐渐下降。最后，我们发现通用LLM不是适合解决SATD的方法。

A White-Box False Positive Adversarial Attack Method on Contrastive Loss-Based Offline Handwritten Signature Verification Models

paper_url: http://arxiv.org/abs/2308.08925
repo_url: None
paper_authors: Zhongliang Guo, Yifei Qian, Ognjen Arandjelović, Lei Fang
for: 这 paper 是为了解决白盒式 false positive 对做 contrastive loss 基于的 Offline 手写签名验证模型。
methods: 我们提出了一种新的攻击方法，它将攻击看作是一种style transfer между两种相似而又独特的写作风格。为了导引生成的幌子图像，我们引入了两个新的损失函数，它们可以在 embedding вектор 之间增加攻击成功率，同时保证最小化生成图像与原图像之间的差异。
results: 我们的方法在 white-box 攻击中对 contrastive loss 基于的 Offline 手写签名验证模型表现出了 state-of-the-art 性能，这得到了我们的实验证明。这 paper 的关键贡献包括一种新的 false positive 攻击方法，两个新的损失函数，有效的 style transfer 在手写风格之间，以及在 white-box false positive 攻击中比其他 white-box 攻击方法表现出更高的性能。

Abstract
In this paper, we tackle the challenge of white-box false positive adversarial attacks on contrastive loss-based offline handwritten signature verification models. We propose a novel attack method that treats the attack as a style transfer between closely related but distinct writing styles. To guide the generation of deceptive images, we introduce two new loss functions that enhance the attack success rate by perturbing the Euclidean distance between the embedding vectors of the original and synthesized samples, while ensuring minimal perturbations by reducing the difference between the generated image and the original image. Our method demonstrates state-of-the-art performance in white-box attacks on contrastive loss-based offline handwritten signature verification models, as evidenced by our experiments. The key contributions of this paper include a novel false positive attack method, two new loss functions, effective style transfer in handwriting styles, and superior performance in white-box false positive attacks compared to other white-box attack methods.

摘要
在这篇论文中，我们面临了对对照损失基于线上手写签名验证模型的白盒式false positive骚扰攻击的挑战。我们提出了一种新的攻击方法，它将攻击看作是在 closely related but distinct writing styles之间进行风格转换。为了导引生成的误leading images的生成，我们引入了两个新的损失函数，它们可以在 embedding vectors of the original and synthesized samples之间增加攻击成功率，同时保持最小的抖动，使得生成的图像与原始图像之间的差异尽量小。我们的方法在 white-box false positive attacks中表现出了state-of-the-art的性能，经过我们的实验证明。本文的主要贡献包括一种新的false positive攻击方法、两个新的损失函数、effective的风格转换在手写风格中、以及在 white-box false positive attacks中的superior performance compared to other white-box attack methods。

IMM: An Imitative Reinforcement Learning Approach with Predictive Representation Learning for Automatic Market Making

paper_url: http://arxiv.org/abs/2308.08918
repo_url: None
paper_authors: Hui Niu, Siyuan Li, Jiahao Zheng, Zhouchi Lin, Jian Li, Jian Guo, Bo An
for: 这种研究的目的是为了开发一种基于强化学习的多价格级市场制作者（Imitative Market Maker，IMM），以提高市场流动性和财务表现。methods: 这种方法利用了知识从不优化的信号基础 экспер特性和直接策略互动来快速发展多价格级市场制作者。它首先引入了有效的状态和动作表示，能够快速编码多价格级订单信息。然后，IMM利用了一个表示学习单元，能够捕捉短期和长期市场趋势，以减少不良选择风险。最后，IMM通过结合RL和仿效学习技术来训练代理人，从而实现有效的学习。results: 实验结果表明，IMM在四个实际市场数据集上比现有的RL基于市场制作者策略具有较高的财务表现。减少不良选择风险的策略组件的实验结果也证明了模型的有效性。

Abstract
Market making (MM) has attracted significant attention in financial trading owing to its essential function in ensuring market liquidity. With strong capabilities in sequential decision-making, Reinforcement Learning (RL) technology has achieved remarkable success in quantitative trading. Nonetheless, most existing RL-based MM methods focus on optimizing single-price level strategies which fail at frequent order cancellations and loss of queue priority. Strategies involving multiple price levels align better with actual trading scenarios. However, given the complexity that multi-price level strategies involves a comprehensive trading action space, the challenge of effectively training profitable RL agents for MM persists. Inspired by the efficient workflow of professional human market makers, we propose Imitative Market Maker (IMM), a novel RL framework leveraging both knowledge from suboptimal signal-based experts and direct policy interactions to develop multi-price level MM strategies efficiently. The framework start with introducing effective state and action representations adept at encoding information about multi-price level orders. Furthermore, IMM integrates a representation learning unit capable of capturing both short- and long-term market trends to mitigate adverse selection risk. Subsequently, IMM formulates an expert strategy based on signals and trains the agent through the integration of RL and imitation learning techniques, leading to efficient learning. Extensive experimental results on four real-world market datasets demonstrate that IMM outperforms current RL-based market making strategies in terms of several financial criteria. The findings of the ablation study substantiate the effectiveness of the model components.

摘要
市场制作（MM）在金融交易中吸引了广泛的注意力，因为它提供了市场流动性的关键 fonction。强大的顺序决策能力使得强化学习（RL）技术在量化交易中取得了显著的成功。然而，大多数现有的RL基于MM方法都是优化单价级别策略，这些策略在频繁的订单取消和优先级损失中失败。使用多个价格级别的策略更好地适应实际交易场景。然而，由于多个价格级别的策略的复杂性，RL代理人的训练仍然存在挑战。以人类市场制作者的高效工作流程为灵感，我们提出了imitative Market Maker（IMM），一种基于RL和仿制学习技术的新的市场制作框架。IMM从引入有效的状态和动作表示开始，并将市场趋势短期和长期都 captured。然后，IMM通过RL和仿制学习技术结合来训练代理人，从而实现高效的学习。我们在四个实际市场数据集上进行了广泛的实验，发现IMM在许多金融指标上表现出色，超过了现有的RL基于市场制作策略。减少风险的研究结果证明了模型组件的效果。

paper_url: http://arxiv.org/abs/2308.08915
repo_url: https://github.com/dawnvince/mts_cad
paper_authors: Haotian Si, Changhua Pei, Zhihan Li, Yadong Zhao, Jingjing Li, Haiming Zhang, Zulong Diao, Jianhui Li, Gaogang Xie, Dan Pei
for: 本研究旨在提出一种基于多任务学习的多变量时间序列异常检测算法（CAD），以解决现有异常检测方法中缺乏准确检测异常时序序列数据的问题。
methods: CAD使用了一种具有冲突意识的结构，以mitigate potential conflicts among metrics的 regression objectives，并提出了一种简单 yet effective的任务oriented metric selection和personalized和shared gating机制。
results: 对于多个公共数据集，CAD实现了一个平均的F1分数为0.943，明显超过了现有方法的性能。

Abstract
Massive key performance indicators (KPIs) are monitored as multivariate time series data (MTS) to ensure the reliability of the software applications and service system. Accurately detecting the abnormality of MTS is very critical for subsequent fault elimination. The scarcity of anomalies and manual labeling has led to the development of various self-supervised MTS anomaly detection (AD) methods, which optimize an overall objective/loss encompassing all metrics' regression objectives/losses. However, our empirical study uncovers the prevalence of conflicts among metrics' regression objectives, causing MTS models to grapple with different losses. This critical aspect significantly impacts detection performance but has been overlooked in existing approaches. To address this problem, by mimicking the design of multi-gate mixture-of-experts (MMoE), we introduce CAD, a Conflict-aware multivariate KPI Anomaly Detection algorithm. CAD offers an exclusive structure for each metric to mitigate potential conflicts while fostering inter-metric promotions. Upon thorough investigation, we find that the poor performance of vanilla MMoE mainly comes from the input-output misalignment settings of MTS formulation and convergence issues arising from expansive tasks. To address these challenges, we propose a straightforward yet effective task-oriented metric selection and p&s (personalized and shared) gating mechanism, which establishes CAD as the first practicable multi-task learning (MTL) based MTS AD model. Evaluations on multiple public datasets reveal that CAD obtains an average F1-score of 0.943 across three public datasets, notably outperforming state-of-the-art methods. Our code is accessible at https://github.com/dawnvince/MTS_CAD.

摘要
巨大的键性表现指标 (KPIs) 被监测为多变量时间序列数据 (MTS)，以确保软件应用程序和服务系统的可靠性。检测 MTS 中异常性的精度非常重要，以便后续的缺陷排除。由于缺乏异常例和手动标注，已经开发了许多自动学习 MTS 异常检测 (AD) 方法，这些方法通过优化总体的函数/损失来优化所有指标的回归函数/损失。然而，我们的实证研究发现，存在指标回归函数之间的冲突，导致 MTS 模型面临着不同的损失。这种问题在现有方法中受到了忽略。为解决这个问题，我们引入了 CAD（异常检测算法），一种具有冲突意识的多变量 KPI 异常检测算法。CAD 采用了多个指标的专属结构，以mitigate 指标之间的冲突，并且通过促进指标之间的促进来促进指标之间的互动。经过了系统的调查，我们发现，原始 MMoE 的性能不佳主要来自 MTS 表示形式的输入-输出不对称和任务的扩展问题。为解决这些挑战，我们提出了一种简单 yet 有效的任务关注的指标选择和 p&s 阻止机制，这使得 CAD 成为了首个实用的多任务学习 (MTL) 基于 MTS AD 模型。对多个公共数据集进行评估，我们发现，CAD 在三个公共数据集上的平均 F1 分数为 0.943，较state-of-the-art 方法高。我们的代码可以在上获取。

MoCLIM: Towards Accurate Cancer Subtyping via Multi-Omics Contrastive Learning with Omics-Inference Modeling

paper_url: http://arxiv.org/abs/2308.09725
repo_url: None
paper_authors: Ziwei Yang, Zheng Chen, Yasuko Matsubara, Yasushi Sakurai
for: 本研究旨在推动个性化医疗的发展，通过在肿瘤分型中利用多Omics数据来提高肿瘤分型结果。
methods: 本研究提出了MoCLIM Representation Learning框架，可以独立提取不同Omics模式下的有用特征，并通过对不同Omics模式的对照学习来获得更好的分型结果。
results: 在六个肿瘤数据集上，MoCLIM Representation Learning框架可以显著提高肿瘤分型结果，并且可以在更少的高维度肿瘤实例中实现更好的分型结果。此外，该框架还可以在医疗分析中提供高可读性。

Abstract
Precision medicine fundamentally aims to establish causality between dysregulated biochemical mechanisms and cancer subtypes. Omics-based cancer subtyping has emerged as a revolutionary approach, as different level of omics records the biochemical products of multistep processes in cancers. This paper focuses on fully exploiting the potential of multi-omics data to improve cancer subtyping outcomes, and hence developed MoCLIM, a representation learning framework. MoCLIM independently extracts the informative features from distinct omics modalities. Using a unified representation informed by contrastive learning of different omics modalities, we can well-cluster the subtypes, given cancer, into a lower latent space. This contrast can be interpreted as a projection of inter-omics inference observed in biological networks. Experimental results on six cancer datasets demonstrate that our approach significantly improves data fit and subtyping performance in fewer high-dimensional cancer instances. Moreover, our framework incorporates various medical evaluations as the final component, providing high interpretability in medical analysis.

摘要
准精准医学目标是确定肿瘤分型中的生物化学机制异常。 omics 技术在肿瘤分类方面带来了革命性的变革，因为不同的 omics 数据记录了肿瘤多步骤过程中的生物化学产物。本文探讨了将多Omics 数据完全利用来改进肿瘤分类结果，因此开发了 MoCLIM 表示学框架。 MoCLIM 独立提取不同 omics Modalities 中的有用特征。通过对异 Omics Modalities 的异构学习，我们可以将肿瘤分类到更低的准则空间中。这种异构可以被解释为生物网络中跨Modalities 的推断。实验结果表明，我们的方法可以在更少的高维肿瘤实例中提高数据适应度和分类性能。此外，我们的框架还包括各种医学评估作为最后一个组件，提供了高度可读性在医学分析中。

Building Emotional Support Chatbots in the Era of LLMs

paper_url: http://arxiv.org/abs/2308.11584
repo_url: None
paper_authors: Zhonghua Zheng, Lizi Liao, Yang Deng, Liqiang Nie
for: 这篇论文目标是提出一种基于大语言模型（LLM）的情感支持对话集合创建方法，以推动情感支持机器人的实际应用。
methods: 该方法首先采用了仔细设计的对话生成模板，然后通过使用ChatGPT进行反射生成，创建了一个广泛的情感支持对话集合（ExTES）。接着，对LLaMA模型进行了高级调参技术，以优化情感支持交互的性能。
results: 结果表明，该模型在情感支持交互中表现出色，emarking a significant step forward in the field of emotional support bots, and paving the way for subsequent research and applications.

Abstract
The integration of emotional support into various conversational scenarios presents profound societal benefits, such as social interactions, mental health counseling, and customer service. However, there are unsolved challenges that hinder real-world applications in this field, including limited data availability and the absence of well-accepted model training paradigms. This work endeavors to navigate these challenges by harnessing the capabilities of Large Language Models (LLMs). We introduce an innovative methodology that synthesizes human insights with the computational prowess of LLMs to curate an extensive emotional support dialogue dataset. Our approach is initiated with a meticulously designed set of dialogues spanning diverse scenarios as generative seeds. By utilizing the in-context learning potential of ChatGPT, we recursively generate an ExTensible Emotional Support dialogue dataset, named ExTES. Following this, we deploy advanced tuning techniques on the LLaMA model, examining the impact of diverse training strategies, ultimately yielding an LLM meticulously optimized for emotional support interactions. An exhaustive assessment of the resultant model showcases its proficiency in offering emotional support, marking a pivotal step in the realm of emotional support bots and paving the way for subsequent research and implementations.

摘要
integración de apoyo emocional en diversas escenarios conversacionales ofrece beneficios sociales profundos, como interacciones sociales, consejería de salud mental y servicio al cliente. Sin embargo, existen desafíos sin resolver que impiden aplicaciones en el mundo real en este campo, como la limitaciones de datos y la ausencia de paradigmas de entrenamiento bien aceptados. Este trabajo busca superar estos desafíos mediante la combinación de la capacidad de modelos de lenguaje grande (LLMs) con la perspicacia humana. Presentamos un enfoque innovador que sintetiza la sabiduría humana con el poder computacional de LLMs para crear un gran conjunto de diálogos de apoyo emocional, llamado ExTES. Luego, aplicamos técnicas de ajuste avanzadas en el modelo LLaMA, examinando el impacto de diversas estrategias de entrenamiento, lo que permite optimizar meticulosamente el modelo para interacciones de apoyo emocional. Una evaluación exhaustiva del modelo resultante demuestra su habilidad para brindar apoyo emocional, lo que representa un paso importante en el campo de los bots de apoyo emocional y abre la puerta a investigaciones y aplicaciones subsiguientes.

Towards a Practical Defense against Adversarial Attacks on Deep Learning-based Malware Detectors via Randomized Smoothing

paper_url: http://arxiv.org/abs/2308.08906
repo_url: None
paper_authors: Daniel Gibert, Giulio Zizzo, Quan Le
for: 防御深度学习恶意软件检测器受到针对性攻击
methods: 提议使用随机缩放方法来防御恶意软件检测器
results: 对BODMAS数据集进行实验表明，相比非粗糙分类器，我们的随机缩放模型具有更高的鲁棒性和泛化能力对抗针对性攻击

Abstract
Malware detectors based on deep learning (DL) have been shown to be susceptible to malware examples that have been deliberately manipulated in order to evade detection, a.k.a. adversarial malware examples. More specifically, it has been show that deep learning detectors are vulnerable to small changes on the input file. Given this vulnerability of deep learning detectors, we propose a practical defense against adversarial malware examples inspired by randomized smoothing. In our work, instead of employing Gaussian or Laplace noise when randomizing inputs, we propose a randomized ablation-based smoothing scheme that ablates a percentage of the bytes within an executable. During training, our randomized ablation-based smoothing scheme trains a base classifier based on ablated versions of the executable files. At test time, the final classification for a given input executable is taken as the class most commonly predicted by the classifier on a set of ablated versions of the original executable. To demonstrate the suitability of our approach we have empirically evaluated the proposed ablation-based model against various state-of-the-art evasion attacks on the BODMAS dataset. Results show greater robustness and generalization capabilities to adversarial malware examples in comparison to a non-smoothed classifier.

摘要
深度学习（DL）基于的恶意软件检测器有 been shown to be susceptible to deliberately manipulated malware examples that evade detection, a.k.a. adversarial malware examples. Specifically, it has been shown that deep learning detectors are vulnerable to small changes in the input file. Given this vulnerability of deep learning detectors, we propose a practical defense against adversarial malware examples inspired by randomized smoothing. In our work, instead of employing Gaussian or Laplace noise when randomizing inputs, we propose a randomized ablation-based smoothing scheme that ablates a percentage of the bytes within an executable. During training, our randomized ablation-based smoothing scheme trains a base classifier based on ablated versions of the executable files. At test time, the final classification for a given input executable is taken as the class most commonly predicted by the classifier on a set of ablated versions of the original executable. To demonstrate the suitability of our approach, we have empirically evaluated the proposed ablation-based model against various state-of-the-art evasion attacks on the BODMAS dataset. Results show greater robustness and generalization capabilities to adversarial malware examples in comparison to a non-smoothed classifier.

Development of a Knowledge Graph Embeddings Model for Pain

paper_url: http://arxiv.org/abs/2308.08904
repo_url: None
paper_authors: Jaya Chaturvedi, Tao Wang, Sumithra Velupillai, Robert Stewart, Angus Roberts
For: The paper aims to construct knowledge graph embedding models of pain concepts extracted from mental health electronic health records, combined with external knowledge from SNOMED CT, and evaluate their performance on a subject-object link prediction task.* Methods: The paper uses knowledge graph embedding models to represent pain concepts in a low-dimensional vector space, and combines them with external knowledge from SNOMED CT to enrich the graph. The models are evaluated on a subject-object link prediction task to assess their performance.* Results: The paper compares the performance of the knowledge graph embedding models with other baseline models, and evaluates their ability to predict subject-object links in the context of pain. The results show that the knowledge graph embedding models outperform the baseline models, demonstrating their effectiveness in capturing the complex relationships between pain concepts.

Abstract
Pain is a complex concept that can interconnect with other concepts such as a disorder that might cause pain, a medication that might relieve pain, and so on. To fully understand the context of pain experienced by either an individual or across a population, we may need to examine all concepts related to pain and the relationships between them. This is especially useful when modeling pain that has been recorded in electronic health records. Knowledge graphs represent concepts and their relations by an interlinked network, enabling semantic and context-based reasoning in a computationally tractable form. These graphs can, however, be too large for efficient computation. Knowledge graph embeddings help to resolve this by representing the graphs in a low-dimensional vector space. These embeddings can then be used in various downstream tasks such as classification and link prediction. The various relations associated with pain which are required to construct such a knowledge graph can be obtained from external medical knowledge bases such as SNOMED CT, a hierarchical systematic nomenclature of medical terms. A knowledge graph built in this way could be further enriched with real-world examples of pain and its relations extracted from electronic health records. This paper describes the construction of such knowledge graph embedding models of pain concepts, extracted from the unstructured text of mental health electronic health records, combined with external knowledge created from relations described in SNOMED CT, and their evaluation on a subject-object link prediction task. The performance of the models was compared with other baseline models.

摘要
疼痛是一种复杂的概念，可以与其他概念相连接，例如一种疾病可能会引起疼痛，一种药物可能会缓解疼痛等。为了全面理解个人或人口所经历的疼痛，我们需要检查所有与疼痛相关的概念和它们之间的关系。这在电子医疗记录中模拟疼痛 particullary useful。知识图表表示概念和它们之间的关系为相互连接的网络，使得semantic和上下文基于的理解变得可计算化。这些图表可能太大，导致不可fficiente computation。知识图表嵌入帮助解决这个问题，将知识图表转化为低维度向量空间中的表示。这些嵌入可以在多种下游任务中使用，如分类和链接预测。为了构建这些知识图表嵌入模型，我们可以从电子精神医疗记录中提取疼痛相关的文本信息，并与外部医学知识库SNOMED CT中的概念关系结合。SNOMED CT是一种层次系统atic的医学术语系统，可以提供疼痛相关的外部知识。通过将这些知识图表嵌入模型与电子精神医疗记录中的疼痛实例进行结合，我们可以进一步丰富知识图表，并在subject-object链接预测任务上评估这些模型的性能。在比较baseline模型的基础上，我们发现这些模型在这个任务上表现出色。

Model-Free Algorithm with Improved Sample Efficiency for Zero-Sum Markov Games

paper_url: http://arxiv.org/abs/2308.08858
repo_url: None
paper_authors: Songtao Feng, Ming Yin, Yu-Xiang Wang, Jing Yang, Yingbin Liang
for: 本研究探讨了两个玩家零 SUM Markov 游戏中的理论研究，尤其是在考虑 finite-horizon episodic Markov decision processes (MDPs) 中。
methods: 本研究提出了一种基于 stage-based Q-learning 的模型自由算法，并证明了它可以达到与最佳模型基于算法相同的样本复杂度，即 $O(H^3SAB/\epsilon^2) $。
results: 本研究显示了模型自由算法可以在 Markov 游戏中实现 $\epsilon$-优 Nash 平衡 (NE)，并且在样本复杂度上与最佳模型基于算法相同。此外，本研究还提出了一种基于 reference-advantage decomposition 的变量减少技术，以提高样本效率。

Abstract
The problem of two-player zero-sum Markov games has recently attracted increasing interests in theoretical studies of multi-agent reinforcement learning (RL). In particular, for finite-horizon episodic Markov decision processes (MDPs), it has been shown that model-based algorithms can find an $\epsilon$-optimal Nash Equilibrium (NE) with the sample complexity of $O(H^3SAB/\epsilon^2)$, which is optimal in the dependence of the horizon $H$ and the number of states $S$ (where $A$ and $B$ denote the number of actions of the two players, respectively). However, none of the existing model-free algorithms can achieve such an optimality. In this work, we propose a model-free stage-based Q-learning algorithm and show that it achieves the same sample complexity as the best model-based algorithm, and hence for the first time demonstrate that model-free algorithms can enjoy the same optimality in the $H$ dependence as model-based algorithms. The main improvement of the dependency on $H$ arises by leveraging the popular variance reduction technique based on the reference-advantage decomposition previously used only for single-agent RL. However, such a technique relies on a critical monotonicity property of the value function, which does not hold in Markov games due to the update of the policy via the coarse correlated equilibrium (CCE) oracle. Thus, to extend such a technique to Markov games, our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions whose value difference is the smallest in the history in order to achieve the desired improvement in the sample efficiency.

摘要
“两Player零 SUM Markov 游戏问题在多智能RL理论研究中已经吸引了越来越多的关注。特别是在finite-horizon episodic Markov decision processes（MDPs）中，已经证明了模型基于算法可以在$O(H^3SAB/\epsilon^2)$的样本复杂度下找到$\epsilon$-优 Nash Equilibrium（NE），这是很依赖于Horizon $H$和状态数 $S$（其中 $A$ 和 $B$ 分别表示两个玩家的动作数）。然而，现有的模型free算法无法达到这种优化。在这项工作中，我们提出了一种模型free stage-based Q-学习算法，并证明它可以 дости到与最佳模型基于算法相同的样本复杂度，因此首次确立了模型free算法可以在$H$ 的依赖性上达到同样的优化。主要改进来自于variance reduction技术，基于reference-advantage decomposition，这种技术在单个RL中已经使用了很长时间，但是在Markov 游戏中它无法使用，因为policy更新通过coarse correlated equilibrium（CCE）论点。因此，为了将这种技术扩展到Markov 游戏，我们的算法具有一个关键的新特点，即在历史中更新参 referential value functions的方法，使其值差为历史中最小，以实现所需的样本效率提升。”

D-IF: Uncertainty-aware Human Digitization via Implicit Distribution Field

paper_url: http://arxiv.org/abs/2308.08857
repo_url: https://github.com/psyai-net/d-if_release
paper_authors: Xueting Yang, Yihao Luo, Yuliang Xiu, Wei Wang, Hao Xu, Zhaoxin Fan
for: 这 paper 的目的是提出一种基于深度隐式函数的图像基于3D人体重建方法，以实现高级别的真实人体模拟。
methods: 这 paper 使用的方法是基于深度隐式函数的图像基于3D人体重建方法，并将 implicit value 替换为 adaptive uncertainty distribution，以 differentiate Points based on their distance to the surface。
results: compared to nearly all baselines, the models trained using the uncertainty distribution loss proposed in this paper can capture more intricate wrinkles and realistic limbs, and demonstrate significant improvements.

Abstract
Realistic virtual humans play a crucial role in numerous industries, such as metaverse, intelligent healthcare, and self-driving simulation. But creating them on a large scale with high levels of realism remains a challenge. The utilization of deep implicit function sparks a new era of image-based 3D clothed human reconstruction, enabling pixel-aligned shape recovery with fine details. Subsequently, the vast majority of works locate the surface by regressing the deterministic implicit value for each point. However, should all points be treated equally regardless of their proximity to the surface? In this paper, we propose replacing the implicit value with an adaptive uncertainty distribution, to differentiate between points based on their distance to the surface. This simple ``value to distribution'' transition yields significant improvements on nearly all the baselines. Furthermore, qualitative results demonstrate that the models trained using our uncertainty distribution loss, can capture more intricate wrinkles, and realistic limbs. Code and models are available for research purposes at https://github.com/psyai-net/D-IF_release.

摘要
现实型人体在虚拟世界、智能医疗和自动驾驶等领域扮演着关键角色，但在大规模创造高真实度人体模型上存在挑战。深度隐函数推发了一新的图像基于3D人体重建时代，使得像素对应的形态恢复得到了细节。然而，Should all points be treated equally regardless of their proximity to the surface?在本文中，我们提出将implizit值换为 adaptive uncertainty distribution，以 differentiate between points based on their distance to the surface。这种简单的“值到分布”转换带来了大量基础上的改进。此外，qualitative results表明，使用我们的uncertainty distribution损失来训练模型，可以捕捉更复杂的皱纹和真实的肢体。可以在https://github.com/psyai-net/D-IF_release上获取代码和模型用于研究purpose。

BERT4CTR: An Efficient Framework to Combine Pre-trained Language Model with Non-textual Features for CTR Prediction

paper_url: http://arxiv.org/abs/2308.11527
repo_url: None
paper_authors: Dong Wang, Kavé Salamatian, Yunqing Xia, Weiwei Deng, Qi Zhiang
for: 这个研究的目的是提出一个新的框架BERT4CTR，以解决在Click-Through-Rate（CTR）预测中融合预训语言模型和多媒体输入的挑战。
methods: 本研究使用Uni-Attention机制，允许不同类型的输入（文本和非文本）之间的互动，并维持训练和推导时间成本低。
results: 实验结果显示，BERT4CTR可以与现有的州OF-the-art框架相比，在处理多媒体输入和CTR预测中表现出色，并且具有较低的训练和推导时间成本。

Abstract
Although deep pre-trained language models have shown promising benefit in a large set of industrial scenarios, including Click-Through-Rate (CTR) prediction, how to integrate pre-trained language models that handle only textual signals into a prediction pipeline with non-textual features is challenging. Up to now two directions have been explored to integrate multi-modal inputs in fine-tuning of pre-trained language models. One consists of fusing the outcome of language models and non-textual features through an aggregation layer, resulting into ensemble framework, where the cross-information between textual and non-textual inputs are only learned in the aggregation layer. The second one consists of splitting non-textual features into fine-grained fragments and transforming the fragments to new tokens combined with textual ones, so that they can be fed directly to transformer layers in language models. However, this approach increases the complexity of the learning and inference because of the numerous additional tokens. To address these limitations, we propose in this work a novel framework BERT4CTR, with the Uni-Attention mechanism that can benefit from the interactions between non-textual and textual features while maintaining low time-costs in training and inference through a dimensionality reduction. Comprehensive experiments on both public and commercial data demonstrate that BERT4CTR can outperform significantly the state-of-the-art frameworks to handle multi-modal inputs and be applicable to CTR prediction.

摘要

Fuse the output of language models and non-textual features through an aggregation layer, resulting in an ensemble framework, where the interactions between textual and non-textual inputs are only learned in the aggregation layer.2. Split non-textual features into fine-grained fragments and transform them into new tokens combined with textual ones, so that they can be fed directly into transformer layers in language models. However, this approach increases the complexity of training and inference due to the numerous additional tokens.To address these limitations, we propose a novel framework called BERT4CTR, which utilizes the Uni-Attention mechanism to benefit from the interactions between non-textual and textual features while maintaining low time costs in training and inference through dimensionality reduction. Extensive experiments on both public and commercial data demonstrate that BERT4CTR can significantly outperform state-of-the-art frameworks in handling multi-modal inputs and be applicable to CTR prediction.

CMB: A Comprehensive Medical Benchmark in Chinese

paper_url: http://arxiv.org/abs/2308.08833
repo_url: https://github.com/FreedomIntelligence/CMB
paper_authors: Xidong Wang, Guiming Hardy Chen, Dingjie Song, Zhiyi Zhang, Zhihong Chen, Qingying Xiao, Feng Jiang, Jianquan Li, Xiang Wan, Benyou Wang, Haizhou Li
for: 这篇论文的目的是提出一个基于中药文化的本地化医疗标准，以便评估和发展大型语言模型（LLMs）在医疗领域的表现。
methods: 这篇论文使用了一个名为CMB的本地化医疗标准，评估了多达几个大型语言模型，包括ChatGPT、GPT-4、中药模型和医疗领域专门的模型。
results: 根据CMB的评估结果，这些大型语言模型在医疗领域的表现有所不同，并且发现了一些地区特有的语言特征。

Abstract
Large Language Models (LLMs) provide a possibility to make a great breakthrough in medicine. The establishment of a standardized medical benchmark becomes a fundamental cornerstone to measure progression. However, medical environments in different regions have their local characteristics, e.g., the ubiquity and significance of traditional Chinese medicine within China. Therefore, merely translating English-based medical evaluation may result in \textit{contextual incongruities} to a local region. To solve the issue, we propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety. Using this benchmark, we have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain. It is worth noting that our benchmark is not devised as a leaderboard competition but as an instrument for self-assessment of model advancements. We hope this benchmark could facilitate the widespread adoption and enhancement of medical LLMs within China. Check details in \url{https://cmedbenchmark.llmzoo.com/}.

摘要
大语言模型（LLMs）为医学领域提供了一个可能性，以确定重要的突破点。建立一个标准化的医学标准became a fundamental cornerstone to measure progress. However, different regions have their own local characteristics, such as the prevalence and significance of traditional Chinese medicine within China. Therefore, simply translating English-based medical evaluation may result in 上下文不符 incongruities in a local region. To solve this issue, we propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is an integral part of this evaluation, it does not constitute its entirety. Using this benchmark, we have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain. It is worth noting that our benchmark is not designed as a leaderboard competition but as an instrument for self-assessment of model advancements. We hope this benchmark could facilitate the widespread adoption and enhancement of medical LLMs within China. For more details, please visit \url{https://cmedbenchmark.llmzoo.com/}.

Lifted Algorithms for Symmetric Weighted First-Order Model Sampling

paper_url: http://arxiv.org/abs/2308.08828
repo_url: None
paper_authors: Yuanhong Wang, Juhua Pu, Yuyi Wang, Ondřej Kuželka
for: 本文 targets the problem of weighted model sampling (WMS) in first-order logic, and explores whether WMS can be solved efficiently like domain-liftable model counting problems.
methods: 本文提出了一种快速的抽样算法，用于解决Weighted Model Sampling (WMS)问题，该算法基于首ORDER 逻辑中的 counting quantifiers。
results: 本文证明了 WMS 问题在 two-variables fragment 中是可采样的，并提供了一种快速的抽样算法，其运行时间与域大小成正比。此外，本文还证明了这一结论在 cardinality constraints 的存在下仍然成立。实验结果表明，该算法比现有的 WMS 抽样器快得多，证实了理论结论。

Abstract
Weighted model counting (WMC) is the task of computing the weighted sum of all satisfying assignments (i.e., models) of a propositional formula. Similarly, weighted model sampling (WMS) aims to randomly generate models with probability proportional to their respective weights. Both WMC and WMS are hard to solve exactly, falling under the $\#\mathsf{P}$-hard complexity class. However, it is known that the counting problem may sometimes be tractable, if the propositional formula can be compactly represented and expressed in first-order logic. In such cases, model counting problems can be solved in time polynomial in the domain size, and are known as domain-liftable. The following question then arises: Is it also the case for weighted model sampling? This paper addresses this question and answers it affirmatively. Specifically, we prove the domain-liftability under sampling for the two-variables fragment of first-order logic with counting quantifiers in this paper, by devising an efficient sampling algorithm for this fragment that runs in time polynomial in the domain size. We then further show that this result continues to hold even in the presence of cardinality constraints. To empirically verify our approach, we conduct experiments over various first-order formulas designed for the uniform generation of combinatorial structures and sampling in statistical-relational models. The results demonstrate that our algorithm outperforms a start-of-the-art WMS sampler by a substantial margin, confirming the theoretical results.

摘要
“加重模型计数（WMC）是计算一个 propositional 式中所有满足 assignment（即模型）的加重和的任务。相似地，加重模型采样（WMS）目标是随机生成权重 proportional 的模型。两者都属于 $\#\mathsf{P}$-hard 复杂性类别。然而，已知在某些情况下，计数问题可能是可解的，如果 propositional 式可以简洁地表示并表示在第一频频论 logic 中。在这种情况下，模型计数问题可以在域大小的时间复杂度内解决，并被称为域可提升的。这个问题的下一个问题是：加重模型采样是否也是可提升的？这篇论文回答了这个问题，并答曰：是的。具体来说，我们证明在两个变量的 Fragment 中的 first-order 逻辑中，加重模型采样是可提升的，并提供了一种高效的采样算法，其时间复杂度与域大小成正比。我们再次证明了在具有 cardinality 约束时，这个结果仍然保持。为确认我们的方法，我们对多种适用于统计关系模型和采样的 first-order 方程进行了实验，结果表明我们的算法在相对较大的域大小下表现更好，证明了我们的 теоретичеResult。”

Do you really follow me? Adversarial Instructions for Evaluating the Robustness of Large Language Models

paper_url: http://arxiv.org/abs/2308.10819
repo_url: None
paper_authors: Zekun Li, Baolin Peng, Pengcheng He, Xifeng Yan
for: 本研究旨在评估大语言模型（LLM）对针对性指令的Robustness，以确保其安全地部署在实际应用中。
methods: 本研究提出了一种探索LLM对针对性指令的Robustness的benchmark，通过测试当今的 instruction-following LLMs，发现了这些模型对针对性指令攻击的有限性。
results: 研究发现，现有的 instruction-tuned 模型容易被针对性指令攻击，并且这些模型会被训练以仅完成提示中的指令而不真正理解提示的含义。这 highlights 需要Addressing the challenge of training models to comprehend prompts instead of merely following instruction phrases and completing the text.

Abstract
Large Language Models (LLMs) have shown remarkable proficiency in following instructions, making them valuable in customer-facing applications. However, their impressive capabilities also raise concerns about the amplification of risks posed by adversarial instructions, which can be injected into the model input by third-party attackers to manipulate LLMs' original instructions and prompt unintended actions and content. Therefore, it is crucial to understand LLMs' ability to accurately discern which instructions to follow to ensure their safe deployment in real-world scenarios. In this paper, we propose a pioneering benchmark for automatically evaluating the robustness of LLMs against adversarial instructions. The objective of this benchmark is to quantify the extent to which LLMs are influenced by injected adversarial instructions and assess their ability to differentiate between these adversarial instructions and original user instructions. Through experiments conducted with state-of-the-art instruction-following LLMs, we uncover significant limitations in their robustness against adversarial instruction attacks. Furthermore, our findings indicate that prevalent instruction-tuned models are prone to being overfitted to follow any instruction phrase in the prompt without truly understanding which instructions should be followed. This highlights the need to address the challenge of training models to comprehend prompts instead of merely following instruction phrases and completing the text.

摘要
大型语言模型（LLMs）在客户面前的应用中表现出了惊人的能力，但这也使人们对其面临的风险的增强表示担忧。这些印象的能力可以通过第三方攻击者通过模型输入插入恶意指令来控制LLMs的原始指令和让其生成不良内容。因此，了解LLMs如何准确地执行原始指令是关键。在这篇论文中，我们提出了一个先进的benchmark来自动评估LLMs对恶意指令的抵抗力。本benchmark的目标是量化LLMs对插入的恶意指令的影响和评估它们是否能够分辨恶意指令和原始用户指令。经过对当今最先进的指令执行LLMs进行实验，我们发现了这些模型对恶意指令攻击的有限性。此外，我们的发现表明，现有的指令训练模型容易被适应过度，以至于无法真正理解需要执行的指令，而是仅仅完成了提示中的指令。这highlights the need to address the challenge of training models to comprehend prompts instead of merely following instruction phrases and completing the text.

Capturing Popularity Trends: A Simplistic Non-Personalized Approach for Enhanced Item Recommendation

paper_url: http://arxiv.org/abs/2308.08799
repo_url: https://github.com/jingxiaoyi/pare
paper_authors: Jiazheng Jing, Yinan Zhang, Xin Zhou, Zhiqi Shen
For: 本研究提出了一种基于Item Popularity的推荐方法（PARE），以优化现有的推荐方法。* Methods: PARE包括四个模块，每个模块关注不同的方面：历史流行度、时间影响、周期性影响以及侧信息。 finally，一个注意层用于将四个模块的输出融合。* Results: 对于多个数据集的实验表明，PARE可以与现有的先进推荐方法相比，或者甚至超越它们。此外，将PARE与现有的推荐方法结合使用可以显著提高性能。

Abstract
Recommender systems have been gaining increasing research attention over the years. Most existing recommendation methods focus on capturing users' personalized preferences through historical user-item interactions, which may potentially violate user privacy. Additionally, these approaches often overlook the significance of the temporal fluctuation in item popularity that can sway users' decision-making. To bridge this gap, we propose Popularity-Aware Recommender (PARE), which makes non-personalized recommendations by predicting the items that will attain the highest popularity. PARE consists of four modules, each focusing on a different aspect: popularity history, temporal impact, periodic impact, and side information. Finally, an attention layer is leveraged to fuse the outputs of four modules. To our knowledge, this is the first work to explicitly model item popularity in recommendation systems. Extensive experiments show that PARE performs on par or even better than sophisticated state-of-the-art recommendation methods. Since PARE prioritizes item popularity over personalized user preferences, it can enhance existing recommendation methods as a complementary component. Our experiments demonstrate that integrating PARE with existing recommendation methods significantly surpasses the performance of standalone models, highlighting PARE's potential as a complement to existing recommendation methods. Furthermore, the simplicity of PARE makes it immensely practical for industrial applications and a valuable baseline for future research.

摘要
很多研究者在过去几年中对推荐系统进行了逐渐增长的研究。大多数现有的推荐方法都是通过历史用户项交互来捕捉用户个性化的偏好，这可能会违反用户隐私。此外，这些方法通常忽视了item popularity的时间变化，这可能会影响用户做出决策。为了bridging这个差距，我们提出了Popularity-Aware Recommender（PARE），它通过预测item popularity来提供非个性化推荐。PARE包括四个模块，每个模块都专注于不同的方面：popularity history、temporal impact、periodic impact和side information。最后，我们使用了注意层来融合四个模块的输出。据我们所知，这是首次在推荐系统中显式地模型item popularity。我们的实验表明，PARE可以与现有的先进推荐方法相比，并且在一些情况下可以更好。由于PARE强调item popularity而不是个性化用户偏好，因此它可以增强现有的推荐方法，作为补充组件。我们的实验还表明，将PARE与现有的推荐方法集成可以大幅超越单独的模型性能，这说明PARE的潜在价值。此外，PARE的简单性使得它在实际应用中非常实用，并且成为未来研究的优秀基准。

Joint Local Relational Augmentation and Global Nash Equilibrium for Federated Learning with Non-IID Data

paper_url: http://arxiv.org/abs/2308.11646
repo_url: None
paper_authors: Xinting Liao, Chaochao Chen, Weiming Liu, Pengyang Zhou, Huabin Zhu, Shuheng Shen, Weiqiang Wang, Mengling Hu, Yanchao Tan, Xiaolin Zheng
for: 提高 Federated Learning 在实际应用中的效果，特别是在非独立和 identical 数据 setting 中。
methods: 提出了两个主要模块：本地关系增强（LRA）和全局尼亚希尔贝克（GNE），用于同时解决客户端间和客户端内的不一致性。
results: 经过了四个 benchmark 数据集的广泛实验，证明 FedRANE 可以在非独立和 identical 数据 setting 中提高 Federated Learning 的性能。

Abstract
Federated learning (FL) is a distributed machine learning paradigm that needs collaboration between a server and a series of clients with decentralized data. To make FL effective in real-world applications, existing work devotes to improving the modeling of decentralized data with non-independent and identical distributions (non-IID). In non-IID settings, there are intra-client inconsistency that comes from the imbalanced data modeling, and inter-client inconsistency among heterogeneous client distributions, which not only hinders sufficient representation of the minority data, but also brings discrepant model deviations. However, previous work overlooks to tackle the above two coupling inconsistencies together. In this work, we propose FedRANE, which consists of two main modules, i.e., local relational augmentation (LRA) and global Nash equilibrium (GNE), to resolve intra- and inter-client inconsistency simultaneously. Specifically, in each client, LRA mines the similarity relations among different data samples and enhances the minority sample representations with their neighbors using attentive message passing. In server, GNE reaches an agreement among inconsistent and discrepant model deviations from clients to server, which encourages the global model to update in the direction of global optimum without breaking down the clients optimization toward their local optimums. We conduct extensive experiments on four benchmark datasets to show the superiority of FedRANE in enhancing the performance of FL with non-IID data.

摘要
Federated learning (FL) 是一种分布式机器学习 paradigma，需要服务器和多个客户端之间的合作，以便处理 Decentralized 数据。为了在实际应用中使 FL 有效，现有的工作努力于非独立和同分布（non-IID）数据的模型化。在非独立 Setting 中，有内部客户端不一致性，来自不均衡数据模型的差异，不仅妨碍了足够表示少数据，还导致了不同客户端的模型偏差不同。然而，过去的工作忽视了 simultaneous 处理上述两种 coupling inconsistency。在这种工作中，我们提议 FedRANE，它包括两个主要模块：本地关系增强（LRA）和全局纳什均衡（GNE）。具体来说，在每个客户端上，LRA 挖掘不同数据样本之间的相似关系，并通过对注意力传递来增强少数据表示。在服务器端，GNE 达成了客户端和服务器之间的一致，以避免客户端优化方向和服务器优化方向之间的冲突。我们在四个 benchmark 数据集上进行了广泛的实验，以显示 FedRANE 在非独立数据上提高 FL 的性能。

Bayesian polynomial neural networks and polynomial neural ordinary differential equations

paper_url: http://arxiv.org/abs/2308.10892
repo_url: None
paper_authors: Colby Fronk, Jaewoong Yun, Prashant Singh, Linda Petzold
for: 这些方法用于 recovering 多种科学和工程问题中的方程模型，但是它们只提供点估计方法，无法处理噪音数据。
methods: 我们使用 Laplace 近似、Markov Chain Monte Carlo (MCMC) 采样方法和 variational inference 进行 Bayesian 推理。
results: 我们发现 Laplace 近似是这类问题中最佳的方法。我们的工作可以轻松扩展到符号神经网络中的更广泛类型。

Abstract
Symbolic regression with polynomial neural networks and polynomial neural ordinary differential equations (ODEs) are two recent and powerful approaches for equation recovery of many science and engineering problems. However, these methods provide point estimates for the model parameters and are currently unable to accommodate noisy data. We address this challenge by developing and validating the following Bayesian inference methods: the Laplace approximation, Markov Chain Monte Carlo (MCMC) sampling methods, and variational inference. We have found the Laplace approximation to be the best method for this class of problems. Our work can be easily extended to the broader class of symbolic neural networks to which the polynomial neural network belongs.

摘要
Symbolic regression with polynomial neural networks and polynomial neural ordinary differential equations (ODEs) 是两种最新和强大的方法 для解决许多科学和工程问题中的方程回归问题。然而，这些方法只能提供点估计方法参数，并且无法处理噪声数据。我们通过开发和验证以下抽象推理方法来解决这个挑战：拉普拉斯逼近、Markov Chain Monte Carlo（MCMC）采样方法和variational推理。我们发现拉普拉斯逼近是这类问题中最佳的方法。我们的工作可以轻松扩展到符号化神经网络中的更广泛的类型。

CodeCoT and Beyond: Learning to Program and Test like a Developer

paper_url: http://arxiv.org/abs/2308.08784
repo_url: None
paper_authors: Dong Huang, Qingwen Bu, Heming Cui
For: The paper aims to improve the code generation accuracy of transformer-based large language models (LLMs) like GPT-x models, which often encounter challenges when handling tasks that differ from their training data.* Methods: The paper proposes two components: Vanilla CodeCoT and Self-exam CodeCoT. The Self-exam CodeCoT incorporates self-examination, allowing the model to iteratively generate code, formulate test cases, and refine its outputs.* Results: The paper reports significant enhancements in code generation accuracy across various LLM variants, with the Self-exam CodeCoT approach achieving an unprecedented pass@1 accuracy of 79.27% on the gpt-3.5-turbo-0613 model in the HumanEval dataset.Here are the three points in Simplified Chinese:
for: 这篇论文目标是提高基于转换器的大型自然语言处理模型（LLM）如GPT-x模型的代码生成精度，这些模型经常在不同于训练数据的任务中遇到挑战。
methods: 论文提出了两个组成部分：普通的CodeCoT和自我评估的CodeCoT。后者将自我评估纳入到模型中，使其可以顺序生成代码，制定测试例子，并改进其输出。
results: 论文发现，使用不同的LLM变体时，CodeCoT技术均能够显著提高代码生成精度，而Self-exam CodeCoT方法在gpt-3.5-turbo-0613模型上的HumanEval数据集上达到了历史最高的pass@1准确率为79.27%。

Abstract
In natural language processing, transformer-based large language models (LLMs) like GPT-x models developed by OpenAI have revolutionized the landscape. Despite their impressive capabilities, these models often encounter challenges when handling tasks that differ from their training data, resulting in compromised performance. To address this, few-shot learning has emerged as a valuable technique, allowing LLMs to adapt with minimal task-specific data. One innovative strategy, known as Chain-of-Thought Prompting (CoT), has been introduced to guide LLMs in revealing cognitive processes during multi-step reasoning. In this paper, we propose Code Chain-of-Thought~(CodeCoT), which consists of two components: the Vanilla CodeCoT and the Self-exam CodeCoT. The latter incorporates self-examination, empowering the model to iteratively generate code, formulate test cases, and refine its outputs. Specifically, the process entails the generation of test examples by the model corresponding to the code it is tasked to implement. If it fails on the test examples, then it regenerates the code based on the erroneous code and associated error types. Through comprehensive experiments, we observed that both techniques significantly enhance code generation accuracy across various LLM variants. Our evaluation results reveal that CodeCoT improves the code generation effectiveness, including an unprecedented pass@1 accuracy of 79.27\% using the Self-exam CodeCoT approach on the gpt-3.5-turbo-0613 model in the HumanEval dataset.

摘要
在自然语言处理领域，基于转换器的大语言模型（LLM）如OpenAI开发的GPT-x模型已经革命化了景观。尽管它们具有印象的能力，但它们在处理不同于训练数据的任务时经常遇到挑战，导致性能受损。为解决这个问题，几何学学习（few-shot learning）已经成为一种有价值的技术，允许LLM在最小的任务特定数据上适应。在这篇论文中，我们提出了代码链条提示（CodeCoT）技术，它包括两个组成部分：简单的CodeCoT和自我检验CodeCoT。后者在模型生成代码、编写测试例子和改进输出过程中，具有自我检验能力。具体来说，该过程包括由模型生成测试例子，并将其与代码相关的错误类型进行关联。我们通过了详细的实验，发现CodeCoT技术可以在不同的LLM变体上提高代码生成精度。我们的评估结果表明，使用Self-exam CodeCoT方法，gpt-3.5-turbo-0613模型在人类评估数据集上达到了历史上无 precedent的pass@1准确率为79.27%。

Knowledge-inspired Subdomain Adaptation for Cross-Domain Knowledge Transfer

paper_url: http://arxiv.org/abs/2308.09724
repo_url: None
paper_authors: Liyue Chen, Linian Wang, Jinyu Xu, Shuai Chen, Weiqiang Wang, Wenbiao Zhao, Qiyu Li, Leye Wang
for: 这篇论文是针对cross-domain fraud detection和traffic demand prediction领域的深度领域适应技术。
methods: 本文提出了一个novel的Knowledge-Inspired Subdomain Adaptation（KISA）框架，包括提供了理论底下 Shared Expected Loss的最小化，以及知识驱动的子领域分配问题和知识融合网络等。
results: 实验结果显示，KISA在骗贾检测和交通需求预测任务中获得了杰出的成绩。

Abstract
Most state-of-the-art deep domain adaptation techniques align source and target samples in a global fashion. That is, after alignment, each source sample is expected to become similar to any target sample. However, global alignment may not always be optimal or necessary in practice. For example, consider cross-domain fraud detection, where there are two types of transactions: credit and non-credit. Aligning credit and non-credit transactions separately may yield better performance than global alignment, as credit transactions are unlikely to exhibit patterns similar to non-credit transactions. To enable such fine-grained domain adaption, we propose a novel Knowledge-Inspired Subdomain Adaptation (KISA) framework. In particular, (1) We provide the theoretical insight that KISA minimizes the shared expected loss which is the premise for the success of domain adaptation methods. (2) We propose the knowledge-inspired subdomain division problem that plays a crucial role in fine-grained domain adaption. (3) We design a knowledge fusion network to exploit diverse domain knowledge. Extensive experiments demonstrate that KISA achieves remarkable results on fraud detection and traffic demand prediction tasks.

摘要
大多数当今深度领域适应技术都是在全局方式对源和目标样本进行对齐。即 после对齐，每个源样本都应该变得与任何目标样本相似。然而，全局对齐可能并不是最佳或必需的在实践中。例如，考虑cross-domain fraud detection，其中有两类交易：信用和非信用。对这两类交易进行分别对齐可能会提高性能，因为信用交易很 unlikely to exhibit patterns similar to non-credit transactions。为实现这种细致的领域适应，我们提出了一个novel Knowledge-Inspired Subdomain Adaptation（KISA） frameworks。具体来说，我们提供了以下三个方面：1. 我们提供了对KISA的理论启示，即KISA最小化了共享预期损失，这是领域适应方法的成功前提。2. 我们提出了基于知识的子领域分配问题，这在细致领域适应中扮演着关键性的角色。3. 我们设计了一个知识融合网络，以利用不同领域的知识。我们的实验证明，KISA在fraud detection和交通需求预测任务上表现出了很好的 result。

Exploring Demonstration Ensembling for In-context Learning

paper_url: http://arxiv.org/abs/2308.08780
repo_url: https://github.com/mukhal/icl-ensembling
paper_authors: Muhammad Khalifa, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, Lu Wang
for: 强化 язы言模型（LM）的学习，使其能够更好地完成给定任务。
methods: 使用示例集（demonstrations）进行强化学习，并研究不同的ensemble方法。
results: 比 concatenation 方法更高效，可以提高模型的预测精度。weighted max ensemble 方法在12种语言任务上表现出色，比 concatenation 方法提高了2.4个平均点。

Abstract
In-context learning (ICL) operates by showing language models (LMs) examples of input-output pairs for a given task, i.e., demonstrations. The standard approach for ICL is to prompt the LM with concatenated demonstrations followed by the test input. This approach suffers from some issues. First, concatenation offers almost no control over the contribution of each demo to the model prediction. This can be sub-optimal when some demonstrations are irrelevant to the test example. Second, due to the input length limit of some transformer models, it might be infeasible to fit many examples into the context, especially when dealing with long-input tasks. In this work, we explore Demonstration Ensembling (DENSE) as an alternative to simple concatenation. DENSE predicts outputs using subsets (i.e., buckets) of the demonstrations and then combines the output probabilities resulting from each subset to produce the final prediction. We study different ensembling methods using GPT-j and experiment on 12 language tasks. Our experiments show weighted max ensembling to outperform vanilla concatenation by as large as 2.4 average points. Code available at https://github.com/mukhal/icl-ensembling.

摘要
启发式学习（ICL）通过显示语言模型（LM）输入输出对的示例来操作，即示例。标准的ICL方法是在LM中提示示例后跟测试输入。这种方法存在一些问题。首先， concatenation 无法控制每个示例对模型预测的贡献。这可能是不优的when some demonstrations 无关于测试示例。其次，由于一些转换器模型的输入长度限制，可能无法将多个示例放入上下文中，特别是处理长输入任务。在这项工作中，我们探讨使用Demonstration Ensembling（DENSE）作为concatentation的替代方案。DENSE 使用示例集（i.e., bucket）预测输出，然后将每个集的输出概率组合成最终预测。我们研究不同的 ensemble 方法使用 GPT-j，并在 12 种语言任务上进行实验。我们的实验结果显示 weighted max ensembling 可以和concatentation 相比，在12种语言任务上出performances by as large as 2.4 average points。代码可以在 https://github.com/mukhal/icl-ensembling 上获取。

Large Language Models at Work in China’s Labor Market

paper_url: http://arxiv.org/abs/2308.08776
repo_url: None
paper_authors: Qin Chen, Jinfeng Ge, Huaqing Xie, Xingcheng Xu, Yanqing Yang
for: This paper explores the potential impacts of large language models (LLMs) on the Chinese labor market, with a focus on understanding the displacement risks for high-paying and experience-intensive jobs.
methods: The paper uses a methodology that incorporates human expertise and LLM classifications to analyze occupational exposure to LLM capabilities, and aggregates occupation exposure to the industry level to obtain industry exposure scores.
results: The results indicate a positive correlation between occupation exposure and wage levels/experience premiums, suggesting that higher-paying and experience-intensive jobs may face greater displacement risks from LLM-powered software. The industry exposure scores align with expert assessments and economic intuitions, and the study provides an analytical basis for understanding the labor market impacts of increasingly capable AI systems in China.

Abstract
This paper explores the potential impacts of large language models (LLMs) on the Chinese labor market. We analyze occupational exposure to LLM capabilities by incorporating human expertise and LLM classifications, following Eloundou et al. (2023)'s methodology. We then aggregate occupation exposure to the industry level to obtain industry exposure scores. The results indicate a positive correlation between occupation exposure and wage levels/experience premiums, suggesting higher-paying and experience-intensive jobs may face greater displacement risks from LLM-powered software. The industry exposure scores align with expert assessments and economic intuitions. We also develop an economic growth model incorporating industry exposure to quantify the productivity-employment trade-off from AI adoption. Overall, this study provides an analytical basis for understanding the labor market impacts of increasingly capable AI systems in China. Key innovations include the occupation-level exposure analysis, industry aggregation approach, and economic modeling incorporating AI adoption and labor market effects. The findings will inform policymakers and businesses on strategies for maximizing the benefits of AI while mitigating adverse disruption risks.

摘要

Differential Privacy, Linguistic Fairness, and Training Data Influence: Impossibility and Possibility Theorems for Multilingual Language Models

paper_url: http://arxiv.org/abs/2308.08774
repo_url: None
paper_authors: Phillip Rust, Anders Søgaard
for: 本文目的是探讨语言模型如mBERT、XLM-R和BLOOM在多语言泛化和压缩方面是否可以同时满足隐私、语言公平和透明性的要求。
methods: 本文使用了多语言压缩和语言公平性来评估这些模型的隐私、语言公平和透明性。
results: 研究发现，多语言压缩和语言公平性可以同时满足隐私要求，但是隐私和训练数据的影响稀缺性之间存在矛盾。研究还进行了多种NLP任务的实验，并评估了不同隐私保证下的多语言压缩和训练数据影响的质量。结果表明，需要开发新的方法来同时优化这些目标，以找到实际的让步。

Abstract
Language models such as mBERT, XLM-R, and BLOOM aim to achieve multilingual generalization or compression to facilitate transfer to a large number of (potentially unseen) languages. However, these models should ideally also be private, linguistically fair, and transparent, by relating their predictions to training data. Can these requirements be simultaneously satisfied? We show that multilingual compression and linguistic fairness are compatible with differential privacy, but that differential privacy is at odds with training data influence sparsity, an objective for transparency. We further present a series of experiments on two common NLP tasks and evaluate multilingual compression and training data influence sparsity under different privacy guarantees, exploring these trade-offs in more detail. Our results suggest that we need to develop ways to jointly optimize for these objectives in order to find practical trade-offs.

摘要
Language models like mBERT、XLM-R和BLOOM目的是实现多语言通用或压缩，以便将模型转移到大量（可能未看过）语言上。然而，这些模型应该也是私人的，公平的，透明的，对模型预测的关系。是这些需求可以同时满足呢？我们显示，多语言压缩和语言公平是可以同时满足 differential privacy 的，但是 differential privacy 与训练数据影响稀缺是不可能同时满足的。我们进一步发现了在两种常见的 NLP 任务上对多语言压缩和训练数据影响稀缺进行了多种实验，并详细分析了这些负担的交叉关系。我们的结果表明，我们需要开发一些方法来同时优化这些目标，以找到实际的平衡点。

Sensor Fusion by Spatial Encoding for Autonomous Driving

paper_url: http://arxiv.org/abs/2308.10707
repo_url: None
paper_authors: Quoc-Vinh Lai-Dang, Jihui Lee, Bumgeun Park, Dongsoo Har
for: 本研究旨在探讨摄像头和激光雷达数据融合的问题，以提高自动驾驶和机器人视觉系统的性能。
methods: 本研究使用了Transformer模块，并在不同的分辨率下应用了多个Transformer模块，以有效地结合本地和全局上下文关系。
results: 对于两个挑战性的测试基准，提出的方法比前一些方法显著提高了驾驶和违法分数，相比TransFuser，本方法在Longest6和Town05 Long bencmarks上的驾驶分数提高8%和19%。

Abstract
Sensor fusion is critical to perception systems for task domains such as autonomous driving and robotics. Recently, the Transformer integrated with CNN has demonstrated high performance in sensor fusion for various perception tasks. In this work, we introduce a method for fusing data from camera and LiDAR. By employing Transformer modules at multiple resolutions, proposed method effectively combines local and global contextual relationships. The performance of the proposed method is validated by extensive experiments with two adversarial benchmarks with lengthy routes and high-density traffics. The proposed method outperforms previous approaches with the most challenging benchmarks, achieving significantly higher driving and infraction scores. Compared with TransFuser, it achieves 8% and 19% improvement in driving scores for the Longest6 and Town05 Long benchmarks, respectively.

摘要
感知系统中的感知融合是自动驾驶和机器人等任务域的关键技术。最近，以Transformer和CNN结合的方法在感知融合中表现出色。在这种方法中，我们提出了将数码和LiDAR数据进行融合的方法，通过在多个分辨率下使用Transformer模块，有效地组合了本地和全局上下文关系。我们在两个挑战性的测试基准上进行了广泛的实验，并证明了我们的方法在最具挑战性的情况下表现出色，比之前的方法提高了驾驶和违法分数。相比于TransFuser，我们的方法在Longest6和Town05 Long benchmark上提高了8%和19%的驾驶分数。

Discrete Prompt Compression with Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.08758
repo_url: None
paper_authors: Hoyoun Jung, Kyung-Joong Kim
for: 本研究旨在提出一种基于强化学习的提示压缩方法，以解决现有方法具有多个含义的 embedding 问题，提高可读性、可重用性和适用性。
methods: 本研究使用了一种名为 PCRL 的计算效率高的政策网络，直接编辑提示，以实现提示压缩。 PCRL 可以适应不同类型的 LM，包括逻辑机制和解码器-编码器架构，并可以在不使用梯度访问 LM 或标注数据的情况下进行训练。
results: 研究发现，PCRL 可以实现提示Token 的平均减少24.6%，同时保持性能。此外，我们还证明了Policy 可以被传递到更大的 LM 上，并通过不同的分析，帮助理解提示中Token 的重要性。

Abstract
Instruction-tuned Language Models (LMs) are widely used by users to address various problems with task-specific prompts. Constraints associated with the context window length and computational costs encourage the development of compressed prompts. Existing methods rely heavily on training embeddings, which are designed to accommodate multiple token meanings. This presents challenges in terms of interpretability, a fixed number of embedding tokens, reusability across different LMs, and inapplicability when interacting with black-box APIs. This study proposes prompt compression with reinforcement learning (PCRL), a novel discrete prompt compression method that addresses these issues. PCRL employs a computationally efficient policy network that directly edits prompts. The PCRL training approach can be flexibly applied to various types of LMs, as well as decoder-only and encoder-decoder architecture, and can be trained without gradient access to LMs or labeled data. PCRL achieves an average reduction of 24.6% in token count across various instruction prompts while preserving performance. Further, we demonstrate that the learned policy can be transferred to larger LMs, and through various analyses, we aid the understanding of token importance within prompts.

摘要
启示调整语言模型（LM）广泛地使用于解决各种问题，通常通过用户提供任务特定的提示。Context窗口长度和计算成本的约束，推动了压缩提示的发展。现有方法主要依赖于训练嵌入，这会带来多个token意义的解释问题，固定 embedding 数量、不可重复性和黑盒API交互时无法应用。本研究提出了启示压缩学习（PCRL），一种新的简单提示压缩方法。PCRL使用了计算效率高的政策网络，直接编辑提示。PCRL 训练方法可以适应不同类型的 LM，以及decoder-only和encoder-decoder架构，并可以在无需梯度访问LM或标注数据的情况下进行训练。PCRL 实现了各种 instruciton 提示中的平均token数减少24.6%，同时保持性能。此外，我们还证明了学习政策可以传递到更大的 LM 上，并通过多种分析，帮助理解提示中的token重要性。

SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation

paper_url: http://arxiv.org/abs/2308.08746
repo_url: https://github.com/wenxi-yue/surgicalsam
paper_authors: Wenxi Yue, Jing Zhang, Kun Hu, Yong Xia, Jiebo Luo, Zhiyong Wang
for: 这篇论文是针对医疗器械分类 зада填的，具体是使用Segment Anything Model (SAM)来进行医疗器械分类。
methods: 这篇论文使用了SAM作为基础模型，并提出了一个名为SurgicalSAM的新方法，它是一个端到端的高效优化方法，可以将医疗器械特有的信息与SAM的预先训练知识结合，以提高分类的精度和简化pipeline。
results: 实验结果显示，SurgicalSAM在EndoVis2018和EndoVis2017数据集上实现了顶尖性能，并且只需要小量可调参数。

Abstract
The Segment Anything Model (SAM) is a powerful foundation model that has revolutionised image segmentation. To apply SAM to surgical instrument segmentation, a common approach is to locate precise points or boxes of instruments and then use them as prompts for SAM in a zero-shot manner. However, we observe two problems with this naive pipeline: (1) the domain gap between natural objects and surgical instruments leads to poor generalisation of SAM; and (2) SAM relies on precise point or box locations for accurate segmentation, requiring either extensive manual guidance or a well-performing specialist detector for prompt preparation, which leads to a complex multi-stage pipeline. To address these problems, we introduce SurgicalSAM, a novel end-to-end efficient-tuning approach for SAM to effectively integrate surgical-specific information with SAM's pre-trained knowledge for improved generalisation. Specifically, we propose a lightweight prototype-based class prompt encoder for tuning, which directly generates prompt embeddings from class prototypes and eliminates the use of explicit prompts for improved robustness and a simpler pipeline. In addition, to address the low inter-class variance among surgical instrument categories, we propose contrastive prototype learning, further enhancing the discrimination of the class prototypes for more accurate class prompting. The results of extensive experiments on both EndoVis2018 and EndoVis2017 datasets demonstrate that SurgicalSAM achieves state-of-the-art performance while only requiring a small number of tunable parameters. The source code will be released at https://github.com/wenxi-yue/SurgicalSAM.

摘要
《医疗器械分割模型（SAM）》是一种强大的基础模型，对医疗器械分割进行了革命性的改进。为了应用SAM于医疗器械分割，常见的方法是在SAM中提供精确的点或盒子作为批处理，然后使用这些点或盒子作为SAM的Zero-shot模式下的提示。然而，我们发现两个问题：（1）自然物体和医疗器械之间的领域差异导致SAM的泛化性差；（2）SAM需要精确的点或盒子位置来进行准确的分割，需要大量的手动指导或一个高性能的专家检测器来准备提示，这导致了复杂的多Stage管道。为解决这些问题，我们介绍了《医疗SAM》，一种新的终端有效策略，使SAM能够有效地 интеGRATE医疗特有信息和SAM已经预训练的知识，从而提高泛化性。具体来说，我们提出了一种轻量级的原型基本类提示编码器，直接从类原型生成提示编码，并废除了显式提示的使用，从而提高了robustness和管道的简单化。此外，为了解决医疗器械类别之间的低同类差，我们提出了对比较类原型学习，进一步增强类原型的抑强，从而提高了分割的精度。实验结果表明，SurgicalSAM在EndoVis2018和EndoVis2017 datasets上表现出状态的作者性，只需要一小量的可调参数。源代码将在https://github.com/wenxi-yue/SurgicalSAM上发布。

PMET: Precise Model Editing in a Transformer

paper_url: http://arxiv.org/abs/2308.08742
repo_url: https://github.com/xpq-tech/pmet
paper_authors: Xiaopeng Li, Shasha Li, Shezheng Song, Jing Yang, Jun Ma, Jie Yu
for: 本研究旨在提高模型修改技术的性能，减少模型修改的成本。
methods: 该研究使用了多头自注意力（MHSA）和循环网络（FFN）的隐藏状态分析，并引入了一种同时优化多头自注意力和循环网络隐藏状态的方法（PMET），以准确地更新FFN的参数。
results: 实验表明，PMET在COUNTERFACT和zsRE datasets上表现出色，与其他方法相比，具有更高的性能。

Abstract
Model editing techniques modify a minor proportion of knowledge in Large Language Models (LLMs) at a relatively low cost, which have demonstrated notable success. Existing methods assume Transformer Layer (TL) hidden states are values of key-value memories of the Feed-Forward Network (FFN). They usually optimize the TL hidden states to memorize target knowledge and use it to update the weights of the FFN in LLMs. However, the information flow of TL hidden states comes from three parts: Multi-Head Self-Attention (MHSA), FFN, and residual connections. Existing methods neglect the fact that the TL hidden states contains information not specifically required for FFN. Consequently, the performance of model editing decreases. To achieve more precise model editing, we analyze hidden states of MHSA and FFN, finding that MHSA encodes certain general knowledge extraction patterns. This implies that MHSA weights do not require updating when new knowledge is introduced. Based on above findings, we introduce PMET, which simultaneously optimizes Transformer Component (TC, namely MHSA and FFN) hidden states, while only using the optimized TC hidden states of FFN to precisely update FFN weights. Our experiments demonstrate that PMET exhibits state-of-the-art performance on both the COUNTERFACT and zsRE datasets. Our ablation experiments substantiate the effectiveness of our enhancements, further reinforcing the finding that the MHSA encodes certain general knowledge extraction patterns and indicating its storage of a small amount of factual knowledge. Our code is available at https://github.com/xpq-tech/PMET.git.

摘要
大型语言模型（LLM）的修改技术可以变更小一部分知识，并且实现了可观的成果。现有方法假设transformer层（TL）的隐藏状态是对Feed-Forward Network（FFN）的键值内存。它们通常将TL隐藏状态优化为记忆目标知识，并将其用于更新FFN的对应预测。但是，TL隐藏状态的资讯来源来自多个部分，包括多头自我对Alignment（MHSA）、FFN和复合连接。现有方法忽略了TL隐藏状态中不具体知识的存在，从而导致模型修改的性能下降。为了更精确地进行模型修改，我们进行了隐藏状态分析，发现MHSA对于抽取一些通用知识的模式具有特定的储存作用。这意味着MHSA的对应预测不需要更新，只需要使用优化的FFN对应预测来更新FFN的对应预测。基于以上发现，我们提出了PMET，它同时优化Transformer Component（TC，即MHSA和FFN）的隐藏状态，并将优化的TC隐藏状态仅用于精确地更新FFN的对应预测。我们的实验显示，PMET在COUNTERFACT和zsRE dataset上展示了顶尖的表现。我们的抽象实验显示，PMET的改进具有实质的优化作用，进一步证实了MHSA对于抽取通用知识的模式储存的发现，并显示了MHSA储存一小量的事实知识。我们的代码可以在https://github.com/xpq-tech/PMET.git中找到。

paper_url: http://arxiv.org/abs/2308.08737
repo_url: None
paper_authors: Tejaswini Manjunath, Mozhgan Navardi, Prakhar Dixit, Bharat Prakash, Tinoosh Mohsenin
for: 本研究旨在提出一种名为Ready for Production Hierarchical RL（ReProHRL）的方法，用于解决在实际环境中进行多目标导航的学习挑战。
methods: 该方法使用分类器作为预处理步骤，以学习多目标导航并将其转移到实际环境中。
results: 实验结果表明，提出的ReProHRL方法在模拟环境和实际环境中都能够在训练时间和性能方面超越基eline。在一个简单的单目标导航环境中，两种方法均达到100%的成功率，但在一个更复杂的环境和多目标设定下，提出的方法在基eline方法的18%和5%之上。为证明实际应用和Proof of Concept的实现，我们在一架名为Crazyflie的奈米扁平飞机上部署了提出的方法，并在其前置摄像头上进行多目标导航实验。

Abstract
Robots have been successfully used to perform tasks with high precision. In real-world environments with sparse rewards and multiple goals, learning is still a major challenge and Reinforcement Learning (RL) algorithms fail to learn good policies. Training in simulation environments and then fine-tuning in the real world is a common approach. However, adapting to the real-world setting is a challenge. In this paper, we present a method named Ready for Production Hierarchical RL (ReProHRL) that divides tasks with hierarchical multi-goal navigation guided by reinforcement learning. We also use object detectors as a pre-processing step to learn multi-goal navigation and transfer it to the real world. Empirical results show that the proposed ReProHRL method outperforms the state-of-the-art baseline in simulation and real-world environments in terms of both training time and performance. Although both methods achieve a 100% success rate in a simple environment for single goal-based navigation, in a more complex environment and multi-goal setting, the proposed method outperforms the baseline by 18% and 5%, respectively. For the real-world implementation and proof of concept demonstration, we deploy the proposed method on a nano-drone named Crazyflie with a front camera to perform multi-goal navigation experiments.

摘要
Translated into Simplified Chinese: Robots 已经成功完成了高精度任务。在实际环境中， sparse 奖励和多个目标导致学习仍然是一个大型挑战，并 Reinforcement Learning（RL）算法无法学习好的策略。训练在模拟环境中，然后在实际世界中细化是一种常见的方法。然而，适应实际环境是一个挑战。在这篇论文中，我们提出了名为Ready for Production Hierarchical RL（ReProHRL）的方法，该方法将多个目标导航分为多个层次。我们还使用对象检测器作为先processing步骤，以学习多个目标导航并将其转移到实际世界。实际结果表明，我们提出的ReProHRL方法在模拟环境和实际世界中都能够超过基准值，即使在更复杂的环境和多个目标设定下。为了证明实际实施和理解，我们在一架名为Crazyflie的奈米飞机上部署了我们的方法，并通过前 Camera 进行多个目标导航实验。

LLM-FuncMapper: Function Identification for Interpreting Complex Clauses in Building Codes via LLM

paper_url: http://arxiv.org/abs/2308.08728
repo_url: None
paper_authors: Zhe Zheng, Ke-Yin Chen, Xin-Yu Cao, Xin-Zheng Lu, Jia-Rui Lin
for: 本研究的目的是提出一种基于大语言模型（LLM）的方法，用于解释复杂的法规条款。
methods: 该方法包括系统分析建筑 codes，定义一系列的原子函数，以捕捉 Shared 计算逻辑和复杂约束，并开发了一个启发模板和分类化调整策略，以便使用常见的 LLM 进行有效的函数标识。
results: 经 statistical analysis 和实验 validate，该方法可以准确地 Identify 相应的预定函数，并将其转换为计算代码。此外，该方法还可以解释复杂的法规条款。

Abstract
As a vital stage of automated rule checking (ARC), rule interpretation of regulatory texts requires considerable effort. However, interpreting regulatory clauses with implicit properties or complex computational logic is still challenging due to the lack of domain knowledge and limited expressibility of conventional logic representations. Thus, LLM-FuncMapper, an approach to identifying predefined functions needed to interpret various regulatory clauses based on the large language model (LLM), is proposed. First, by systematically analysis of building codes, a series of atomic functions are defined to capture shared computational logics of implicit properties and complex constraints, creating a database of common blocks for interpreting regulatory clauses. Then, a prompt template with the chain of thought is developed and further enhanced with a classification-based tuning strategy, to enable common LLMs for effective function identification. Finally, the proposed approach is validated with statistical analysis, experiments, and proof of concept. Statistical analysis reveals a long-tail distribution and high expressibility of the developed function database, with which almost 100% of computer-processible clauses can be interpreted and represented as computer-executable codes. Experiments show that LLM-FuncMapper achieve promising results in identifying relevant predefined functions for rule interpretation. Further proof of concept in automated rule interpretation also demonstrates the possibility of LLM-FuncMapper in interpreting complex regulatory clauses. To the best of our knowledge, this study is the first attempt to introduce LLM for understanding and interpreting complex regulatory clauses, which may shed light on further adoption of LLM in the construction domain.

摘要
为了自动检查规则（ARC）中的规则解释，需要很大的努力。然而，解释法规条款中的隐式属性或复杂计算逻辑仍然是一个挑战，因为缺乏领域知识和限制表达逻辑的表达能力。因此，我们提出了LLM-FuncMapper，一种基于大语言模型（LLM）的方法，用于确定解释法规条款中的预定函数。首先，通过系统性分析法规文本，我们定义了一系列原子函数，用于捕捉不同领域的共享计算逻辑和复杂约束，并建立了一个共享块数据库，用于解释法规条款。然后，我们开发了一个提示模板，并使用分类调整策略，以便使用常见的LLM来确定预定函数。最后，我们验证了我们的方法，通过统计分析、实验和证明原理来证明其效果。统计分析显示，我们建立的函数库具有长尾分布和高表达能力，可以将大多数计算可能的条款解释为计算代码。实验表明，LLM-FuncMapper可以成功地确定解释法规条款中的预定函数。此外，我们还进行了自动规则解释的证明，这表明LLM-FuncMapper可以在解释复杂的法规条款中发挥作用。根据我们所知，这是第一次将大语言模型应用于解释复杂法规条款，这可能会照亮未来在建筑领域中LLM的更多应用。

A Novel Loss Function Utilizing Wasserstein Distance to Reduce Subject-Dependent Noise for Generalizable Models in Affective Computing

paper_url: http://arxiv.org/abs/2308.10869
repo_url: None
paper_authors: Nibraas Khan, Mahrukh Tauseef, Ritam Ghosh, Nilanjan Sarkar
for: 这篇论文的目的是提出一个新的成本函数，用于适应调节数据中的主题参数，以提高情绪识别的准确性。methods: 这篇论文使用了深度学习技术，特别是水星距离理论，来规定主题参数的重要性。results: 比较这篇论文的提案成本函数与传统的 Mean Squared Error 损失函数，在四个常用的数据集上得到了14.75% 和 17.75% 的平均提升。

Abstract
Emotions are an essential part of human behavior that can impact thinking, decision-making, and communication skills. Thus, the ability to accurately monitor and identify emotions can be useful in many human-centered applications such as behavioral training, tracking emotional well-being, and development of human-computer interfaces. The correlation between patterns in physiological data and affective states has allowed for the utilization of deep learning techniques which can accurately detect the affective states of a person. However, the generalisability of existing models is often limited by the subject-dependent noise in the physiological data due to variations in a subject's reactions to stimuli. Hence, we propose a novel cost function that employs Optimal Transport Theory, specifically Wasserstein Distance, to scale the importance of subject-dependent data such that higher importance is assigned to patterns in data that are common across all participants while decreasing the importance of patterns that result from subject-dependent noise. The performance of the proposed cost function is demonstrated through an autoencoder with a multi-class classifier attached to the latent space and trained simultaneously to detect different affective states. An autoencoder with a state-of-the-art loss function i.e., Mean Squared Error, is used as a baseline for comparison with our model across four different commonly used datasets. Centroid and minimum distance between different classes are used as a metrics to indicate the separation between different classes in the latent space. An average increase of 14.75% and 17.75% (from benchmark to proposed loss function) was found for minimum and centroid euclidean distance respectively over all datasets.

摘要
人类情感是人类行为的重要组成部分，可以影响思维、决策和交流技能。因此，能够准确监测和识别情感的能力可以在许多人类中心应用中得到利用，如行为训练、情感健康评估和人机界面开发。基于生物数据的征特和情感状态的相关性，可以使用深度学习技术来准确检测人类情感状态。但是，现有模型的普适性往往受到参与者的偏好所限，因为参与者对刺激的反应会导致数据中的偏差。因此，我们提出了一个新的成本函数，使用优化运输理论和 Wasserstein 距离来权重调整参与者特定的数据，以便更重要地考虑参与者共同的数据特征，而不是受到偏差所影响的数据。我们的模型的性能通过一个具有多类分类器的自编码器来检测不同情感状态，并与使用 Mean Squared Error 损失函数的基准模型进行比较。中心距离和最小距离between不同类别在隐藏空间中的分离度被用作评价模型的metric。在所有数据集上，我们发现使用我们的损失函数而不是基准模型时，平均提高14.75%和17.75%（从基准到我们的损失函数）。

EdgeMA: Model Adaptation System for Real-Time Video Analytics on Edge Devices

paper_url: http://arxiv.org/abs/2308.08717
repo_url: None
paper_authors: Liang Wang, Nan Zhang, Xiaoyang Qu, Jianzong Wang, Jiguang Wan, Guokuan Li, Kaiyu Hu, Guilin Jiang, Jing Xiao
for: 这篇论文主要关注在对于边缘设备上进行实时影像分析中遇到的挑战，特别是边缘设备通常具有资源短缺的情况下，如何使用深度神经网络（DNN）来进行影像分析。
methods: 本论文提出了一个实用且高效的影像分析系统EdgeMA，可以适应影像流中的变化，并且能够处理资料漂移问题。EdgeMA使用灰度层次共occurrence统计 texture feature，并使用Random Forest分类器来检测领域变化。此外，我们还将模型更新方法基于重要性评分，专门用于更新模型以适应标签分布变化。
results: 经过严谨的实验评估，我们的结果显示EdgeMA可以明显提高推论精度。

Abstract
Real-time video analytics on edge devices for changing scenes remains a difficult task. As edge devices are usually resource-constrained, edge deep neural networks (DNNs) have fewer weights and shallower architectures than general DNNs. As a result, they only perform well in limited scenarios and are sensitive to data drift. In this paper, we introduce EdgeMA, a practical and efficient video analytics system designed to adapt models to shifts in real-world video streams over time, addressing the data drift problem. EdgeMA extracts the gray level co-occurrence matrix based statistical texture feature and uses the Random Forest classifier to detect the domain shift. Moreover, we have incorporated a method of model adaptation based on importance weighting, specifically designed to update models to cope with the label distribution shift. Through rigorous evaluation of EdgeMA on a real-world dataset, our results illustrate that EdgeMA significantly improves inference accuracy.

摘要
现实时视频分析在边缘设备上是一项复杂的任务。边缘设备通常具有资源约束，因此边缘深度神经网络（DNN）通常具有较少的参数和较浅的结构，只能在有限的场景下表现良好。在这篇论文中，我们介绍了EdgeMA，一个实用和高效的视频分析系统，用于适应实际视频流中的数据偏移问题。EdgeMA提取了灰度层次相关矩阵基本的纹理特征，并使用Random Forest分类器来探测领域偏移。此外，我们还实现了基于重要性赋分的模型更新方法，以适应标签分布偏移。经过对实际数据集的严格评估，我们的结果表明，EdgeMA可以显著提高推理精度。

Probabilistic Results on the Architecture of Mathematical Reasoning Aligned by Cognitive Alternation

paper_url: http://arxiv.org/abs/2308.08714
repo_url: None
paper_authors: Minzheng Li, Xiangzhong Fang, Haixin Yang
for: 本研究旨在构思一种可以解决数学问题的机器人。
methods: 该研究采用分两部分的量化逻辑系统：思维过程和认知过程，并提供了概率描述该系统的建构。
results: 研究人员提供了一种可以模型思维过程和认知过程的 probabilistic 描述，以便实现机器人可以解决数学问题的目标。

Abstract
We envision a machine capable of solving mathematical problems. Dividing the quantitative reasoning system into two parts: thought processes and cognitive processes, we provide probabilistic descriptions of the architecture.

摘要
我们看到了一种机器能够解决数学问题的想法。将数学逻辑系统分为两部分：思维过程和认知过程，我们提供了概率描述这个architecture。Here's a breakdown of the translation:* "We envision" is translated as "我们看到了" (wǒmen kàn dào le)* "a machine capable of solving mathematical problems" is translated as "一种机器能够解决数学问题" (yī zhǒng jīqì néng huì jiě jué dù shù)* "Dividing the quantitative reasoning system into two parts" is translated as "将数学逻辑系统分为两部分" (jiǎng dào xué xíng lógí xìtemu bìwǎn)* "thought processes" is translated as "思维过程" (sī wèi guò xìng)* "and cognitive processes" is translated as "和认知过程" (hé rèn zhì guò xìng)* "we provide probabilistic descriptions of the architecture" is translated as "我们提供了概率描述这个architecture" (wǒmen tīng yǐjī le gè yī jī qì bèng mǐng)Note that the word "architecture" is not explicitly mentioned in the original text, but it is implied by the context. In Chinese, the word for "architecture" is 架构 (gā kū), but it is not necessary to use it in this sentence.

Consciousness in Artificial Intelligence: Insights from the Science of Consciousness

paper_url: http://arxiv.org/abs/2308.08708
repo_url: None
paper_authors: Patrick Butlin, Robert Long, Eric Elmoznino, Yoshua Bengio, Jonathan Birch, Axel Constant, George Deane, Stephen M. Fleming, Chris Frith, Xu Ji, Ryota Kanai, Colin Klein, Grace Lindsay, Matthias Michel, Liad Mudrik, Megan A. K. Peters, Eric Schwitzgebel, Jonathan Simon, Rufin VanRullen
for: 本研究旨在探讨现有的人工智能系统是否具备意识性，并采用科学理论和实验来评估这些系统。
methods: 本研究使用了多种科学理论，包括反复处理理论、全局工作区理论、更高级别理论、预测处理和注意Schema理论，从这些理论中提取了”指示性特征”，用于评估人工智能系统是否拥有意识性。
results: 本研究发现现有的人工智能系统没有意识性，但也没有技术障碍建立意识性的AI系统。

Abstract
Whether current or near-term AI systems could be conscious is a topic of scientific interest and increasing public concern. This report argues for, and exemplifies, a rigorous and empirically grounded approach to AI consciousness: assessing existing AI systems in detail, in light of our best-supported neuroscientific theories of consciousness. We survey several prominent scientific theories of consciousness, including recurrent processing theory, global workspace theory, higher-order theories, predictive processing, and attention schema theory. From these theories we derive "indicator properties" of consciousness, elucidated in computational terms that allow us to assess AI systems for these properties. We use these indicator properties to assess several recent AI systems, and we discuss how future systems might implement them. Our analysis suggests that no current AI systems are conscious, but also suggests that there are no obvious technical barriers to building AI systems which satisfy these indicators.

摘要
当前或近期的人工智能系统是科学兴趣和公众关注的话题。这份报告提出和证明了一种严谨和基于实证的人工智能意识方法：对现有的AI系统进行详细的评估，以及根据我们最好支持的神经科学理论来评估意识。我们对多种知名的科学理论进行了检查，包括循环处理理论、全球工作空间理论、更高级理论、预测处理和注意schema理论。从这些理论中，我们 derivated "指标属性" （indicator properties），这些属性可以用计算机语言来评估 AI 系统。我们使用这些指标属性来评估一些最近的 AI 系统，并讨论了未来系统如何实现这些属性。我们的分析表明，目前没有任何 AI 系统具备意识，但也没有明显的技术障碍来建立具备这些指标属性的 AI 系统。

Improving Anomaly Segmentation with Multi-Granularity Cross-Domain Alignment

paper_url: http://arxiv.org/abs/2308.08696
repo_url: None
paper_authors: Ji Zhang, Xiao Wu, Zhi-Qi Cheng, Qi He, Wei Li
for: 这个论文的目的是提出一个基于多 Granularity Cross-Domain Alignment (MGCDA) 框架的 anomaly segmentation 方法，以便在自驾车中探测道路上的异常现象。
methods: 这个方法使用了一个新的 Multi-source Domain Adversarial Training (MDAT) 模组和一种 Cross-domain Anomaly-aware Contrastive Learning (CACL) 方法，将模型的通用性提高到了最高水平。这两种方法都是为了处理多个领域资料的问题，并且能够实现模型在测试阶段 Parameters-free。
results: 实验结果显示，提出的方法在 Fishyscapes 和 RoadAnomaly 资料集上取得了现有方法的最佳性能。

Abstract
Anomaly segmentation plays a crucial role in identifying anomalous objects within images, which facilitates the detection of road anomalies for autonomous driving. Although existing methods have shown impressive results in anomaly segmentation using synthetic training data, the domain discrepancies between synthetic training data and real test data are often neglected. To address this issue, the Multi-Granularity Cross-Domain Alignment (MGCDA) framework is proposed for anomaly segmentation in complex driving environments. It uniquely combines a new Multi-source Domain Adversarial Training (MDAT) module and a novel Cross-domain Anomaly-aware Contrastive Learning (CACL) method to boost the generality of the model, seamlessly integrating multi-domain data at both scene and sample levels. Multi-source domain adversarial loss and a dynamic label smoothing strategy are integrated into the MDAT module to facilitate the acquisition of domain-invariant features at the scene level, through adversarial training across multiple stages. CACL aligns sample-level representations with contrastive loss on cross-domain data, which utilizes an anomaly-aware sampling strategy to efficiently sample hard samples and anchors. The proposed framework has decent properties of parameter-free during the inference stage and is compatible with other anomaly segmentation networks. Experimental conducted on Fishyscapes and RoadAnomaly datasets demonstrate that the proposed framework achieves state-of-the-art performance.

摘要
《多源频率域对接框架》（MGCDA）是一种用于异常分割的新框架，旨在 Addressing the issue of domain discrepancies in anomaly segmentation using synthetic training data. The MGCDA framework combines a novel Multi-source Domain Adversarial Training (MDAT) module and a Cross-domain Anomaly-aware Contrastive Learning (CACL) method to improve the generality of the model. The MDAT module uses a multi-source domain adversarial loss and a dynamic label smoothing strategy to acquire domain-invariant features at the scene level through adversarial training across multiple stages. The CACL method aligns sample-level representations with contrastive loss on cross-domain data, using an anomaly-aware sampling strategy to efficiently sample hard samples and anchors. The proposed framework has the advantage of being parameter-free during the inference stage and is compatible with other anomaly segmentation networks. Experimental results on the Fishyscapes and RoadAnomaly datasets demonstrate that the proposed framework achieves state-of-the-art performance.Here's the translation of the text into Traditional Chinese:《多源频率域对接框架》（MGCDA）是一种新的框架，旨在 Addressing the issue of domain discrepancies in anomaly segmentation using synthetic training data. MGCDA 框架 combine 一个新的 Multi-source Domain Adversarial Training (MDAT) 模组和一个 Cross-domain Anomaly-aware Contrastive Learning (CACL) 方法，以提高模型的通用性。 MDAT 模组使用多源频率域 adversarial loss 和动态标签平滑策略来获得频率域对应的特征，通过多阶段的对抗训练。 CACL 方法使用异常感知抽象来对两个不同频率域的标签进行对应，并使用异常感知抽象来优化标签的映射。提议的框架具有在推断阶段不需要参数的优点，并且可以与其他异常分割网络相容。实验结果显示，提议的框架在 Fishyscapes 和 RoadAnomaly datasets 上 achieve state-of-the-art 性能。

Planning in the imagination: High-level planning on learned abstract search spaces

paper_url: http://arxiv.org/abs/2308.08693
repo_url: None
paper_authors: Carlos Martin, Tuomas Sandholm
for: 提供了一种新的方法，叫做PiZero，允许智能机器人在自己创建的抽象搜索空间中进行规划，完全与实际环境分离。
methods: 与先前的方法不同，PiZero 允许智能机器人在任意时间尺度上进行高级规划，并且可以处理连续动作空间和部分可见性。
results: 在多个领域中进行实验，PiZero 比先前的方法更高效，无需访问环境模拟器。

Abstract
We propose a new method, called PiZero, that gives an agent the ability to plan in an abstract search space of its own creation that is completely decoupled from the real environment. Unlike prior approaches, this enables the agent to perform high-level planning at arbitrary timescales and reason in terms of compound or temporally-extended actions, which can be useful in environments where large numbers of base-level micro-actions are needed to perform relevant macro-actions. In addition, our method is more general than comparable prior methods because it handles settings with continuous action spaces and partial observability. We evaluate our method on multiple domains, including navigation tasks and Sokoban. Experimentally, it outperforms comparable prior methods without assuming access to an environment simulator.

摘要
我们提出一种新方法，叫做PiZero，允许智能机器人在自己创建的抽象搜索空间中进行规划。与先前的方法不同的是，这使得智能机器人可以在任意时间尺度上进行高级规划，并可以使用复合或时间扩展的动作来执行重要的宏动作。此外，我们的方法更加通用于具有连续动作空间和部分可见性的场景。我们在多个领域进行了实验，包括导航任务和Sokoban，并证明了PiZero在不假设环境模拟器的情况下表现更好。

Lightweight Adaptation of Neural Language Models via Subspace Embedding

paper_url: http://arxiv.org/abs/2308.08688
repo_url: https://github.com/amitkumarj441/cikm2023_subspaceembedding
paper_authors: Amit Kumar Jaiswal, Haiming Liu
for: 这个论文的目的是提出一种新的嵌入结构，以减少预训练语言模型的存储占用空间，同时保持模型的精度。
methods: 该论文使用一种基于语言模型的嵌入结构，通过对 tokens 的上下文关系进行赋值，实现嵌入结构的压缩。
results: 实验结果显示，使用该嵌入结构可以将预训练语言模型的嵌入矩阵压缩至99.8%以上，而且保持模型的精度。

Abstract
Traditional neural word embeddings are usually dependent on a richer diversity of vocabulary. However, the language models recline to cover major vocabularies via the word embedding parameters, in particular, for multilingual language models that generally cover a significant part of their overall learning parameters. In this work, we present a new compact embedding structure to reduce the memory footprint of the pre-trained language models with a sacrifice of up to 4% absolute accuracy. The embeddings vectors reconstruction follows a set of subspace embeddings and an assignment procedure via the contextual relationship among tokens from pre-trained language models. The subspace embedding structure calibrates to masked language models, to evaluate our compact embedding structure on similarity and textual entailment tasks, sentence and paraphrase tasks. Our experimental evaluation shows that the subspace embeddings achieve compression rates beyond 99.8% in comparison with the original embeddings for the language models on XNLI and GLUE benchmark suites.

摘要

Quantifying Overfitting: Introducing the Overfitting Index

paper_url: http://arxiv.org/abs/2308.08682
repo_url: None
paper_authors: Sanad Aburass
for: 本研究旨在提供一个量化评估模型过拟合情况的指标，以便提高模型在实际应用中的效能。
methods: 本研究使用了多种模型架构，包括 MobileNet、U-Net、ResNet、Darknet 和 ViT-32，进行实验，并使用了数据增强技术来确认模型的过拟合情况。
results: 研究结果显示，不同的模型架构在不同的数据集上 exhibits 不同的过拟合情况，而数据增强技术对于小型和特殊的数据集有更大的稳定化效果。 ViT-32 模型在 MNIST 数据集上的表现也显示了某些模型在实际应用中的强健性和数据集的完整性。

Abstract
In the rapidly evolving domain of machine learning, ensuring model generalizability remains a quintessential challenge. Overfitting, where a model exhibits superior performance on training data but falters on unseen data, is a recurrent concern. This paper introduces the Overfitting Index (OI), a novel metric devised to quantitatively assess a model's tendency to overfit. Through extensive experiments on the Breast Ultrasound Images Dataset (BUS) and the MNIST dataset using architectures such as MobileNet, U-Net, ResNet, Darknet, and ViT-32, we illustrate the utility and discernment of the OI. Our results underscore the variable overfitting behaviors across architectures and highlight the mitigative impact of data augmentation, especially on smaller and more specialized datasets. The ViT-32's performance on MNIST further emphasizes the robustness of certain models and the dataset's comprehensive nature. By providing an objective lens to gauge overfitting, the OI offers a promising avenue to advance model optimization and ensure real-world efficacy.

摘要
在机器学习领域的快速演化中，保证模型通用性仍然是一个核心挑战。过拟合，其表现出模型在训练数据上出色，但是在未经见数据上崩溃的问题，是一个常见的问题。本文介绍了一个新的度量工具——过拟合指数（OI），用于量化评估模型是否过拟合。通过对Breast Ultrasound Images Dataset（BUS）和MNIST datasets上的MobileNet、U-Net、ResNet、Darknet和ViT-32等建筑物的广泛实验，我们证明了OI的实用性和分辨率。我们的结果表明不同的建筑物在不同的数据集上具有不同的过拟合行为，并且数据增强特别是在小型和特殊化数据集上具有缓解作用。ViT-32在MNIST上的表现也证明了某些模型的稳定性和数据集的全面性。通过提供一个对过拟合进行Objective评估的工具，OI为模型优化和实际效果的提高提供了一个可靠的途径。

Answering Ambiguous Questions with a Database of Questions, Answers, and Revisions

paper_url: http://arxiv.org/abs/2308.08661
repo_url: None
paper_authors: Haitian Sun, William W. Cohen, Ruslan Salakhutdinov
for: answering ambiguous questions
methods: exploiting a database of unambiguous questions generated from Wikipedia
results: improved performance by 15% (relative improvement) on recall measures and 10% on measures which evaluate disambiguating questions from predicted outputs, as well as large improvements in diverse passage retrieval.Here’s the same information in Simplified Chinese:
for: 回答不确定的问题
methods: 利用wikipedia中生成的明确问题数据库
results: 提高了15%的准确率和10%的解释问题率，以及大幅提高多个段落检索率。

Abstract
Many open-domain questions are under-specified and thus have multiple possible answers, each of which is correct under a different interpretation of the question. Answering such ambiguous questions is challenging, as it requires retrieving and then reasoning about diverse information from multiple passages. We present a new state-of-the-art for answering ambiguous questions that exploits a database of unambiguous questions generated from Wikipedia. On the challenging ASQA benchmark, which requires generating long-form answers that summarize the multiple answers to an ambiguous question, our method improves performance by 15% (relative improvement) on recall measures and 10% on measures which evaluate disambiguating questions from predicted outputs. Retrieving from the database of generated questions also gives large improvements in diverse passage retrieval (by matching user questions q to passages p indirectly, via questions q' generated from p).

摘要
多个开放问题是下pecified，因此它们有多个可能的答案，每个答案都是正确的，只要是不同的问题解释。回答这些抽象问题是困难的，因为它们需要检索并then reasoning about多个信息的多个段落。我们提出了一种新的状态 Arts，用于回答抽象问题，该方法利用了基于Wikipedia的问题库。在ASQA Benchmark上，这个Benchmark需要生成长形答案，这些答案需要总结多个答案到一个抽象问题上，我们的方法在这些衡量中提高了15%的Relative Improvement（相对提高）和10%的衡量（evaluation measures）。在检索过程中，从生成的问题库中提取信息也得到了大幅度的提高（by matching user questions q to passages p indirectly, via questions q' generated from p）。

Towards Zero Memory Footprint Spiking Neural Network Training

paper_url: http://arxiv.org/abs/2308.08649
repo_url: None
paper_authors: Bin Lei, Sheng Lin, Pei-Hung Lin, Chunhua Liao, Caiwen Ding
for: 这 paper 的目的是解决 SNN 训练中的内存约束问题。
methods: 这 paper 使用了一种新的架构和一种新的算法来降低 SNN 训练中的内存使用量。
results: 这 paper 的实验结果表明，使用这种新的架构和算法可以减少 SNN 训练中的内存使用量，同时不会增加训练时间。具体来说，可以达到 $\mathbf{58.65\times}$ 的内存减少和 $\mathbf{23.8%}$ 的训练时间减少。

Abstract
Biologically-inspired Spiking Neural Networks (SNNs), processing information using discrete-time events known as spikes rather than continuous values, have garnered significant attention due to their hardware-friendly and energy-efficient characteristics. However, the training of SNNs necessitates a considerably large memory footprint, given the additional storage requirements for spikes or events, leading to a complex structure and dynamic setup. In this paper, to address memory constraint in SNN training, we introduce an innovative framework, characterized by a remarkably low memory footprint. We \textbf{(i)} design a reversible SNN node that retains a high level of accuracy. Our design is able to achieve a $\mathbf{58.65\times}$ reduction in memory usage compared to the current SNN node. We \textbf{(ii)} propose a unique algorithm to streamline the backpropagation process of our reversible SNN node. This significantly trims the backward Floating Point Operations Per Second (FLOPs), thereby accelerating the training process in comparison to current reversible layer backpropagation method. By using our algorithm, the training time is able to be curtailed by $\mathbf{23.8\%}$ relative to existing reversible layer architectures.

摘要
生物学发明的刺激神经网络（SNN），通过使用不同时间点事件而不是连续值来处理信息，已经吸引了广泛关注，因为它们具有硬件友好和能效的特点。然而，SNN的训练需要较大的内存占用量，因为需要额外存储刺激或事件，从而导致复杂的结构和动态设置。在这篇论文中，我们提出了一种创新的框架，以减少SNN训练中的内存占用量。我们的设计包括：（i）设计一种可逆的SNN节点，保持高度准确性。我们的设计可以实现$\mathbf{58.65\times}$的内存占用量减少相比现有的SNN节点。（ii）我们提出了一种特殊的算法，用于简化我们的可逆层反propagation过程。这种算法可以大幅减少反向浮点运算数（FLOPs），从而加快训练过程的速度，相比现有的可逆层反propagation方法。通过使用我们的算法，训练时间可以被减少$\mathbf{23.8\%}$相比现有的可逆层架构。

FedPop: Federated Population-based Hyperparameter Tuning

paper_url: http://arxiv.org/abs/2308.08634
repo_url: None
paper_authors: Haokun Chen, Denis Krompass, Jindong Gu, Volker Tresp
for: 本研究旨在提高 Federated Learning (FL) 中 hyperparameter (HP) 的优化，以提高 FL 的性能和可扩展性。
methods: 本研究提出了一种新的 HP 优化算法，called Federated Population-based Hyperparameter Tuning (FedPop)，使用了人口生长算法来优化 HP，并且采用了在线 “调度-while-训练” 框架，以提高计算效率和探索 HP 搜索空间。
results: 对于常见的 FL benchmark 和复杂的实际世界 FL 数据集，本研究的提出的方法在比较于现状最佳 HP 调度方法的实际 validate 中显著提高了 FL 的性能。

Abstract
Federated Learning (FL) is a distributed machine learning (ML) paradigm, in which multiple clients collaboratively train ML models without centralizing their local data. Similar to conventional ML pipelines, the client local optimization and server aggregation procedure in FL are sensitive to the hyperparameter (HP) selection. Despite extensive research on tuning HPs for centralized ML, these methods yield suboptimal results when employed in FL. This is mainly because their "training-after-tuning" framework is unsuitable for FL with limited client computation power. While some approaches have been proposed for HP-Tuning in FL, they are limited to the HPs for client local updates. In this work, we propose a novel HP-tuning algorithm, called Federated Population-based Hyperparameter Tuning (FedPop), to address this vital yet challenging problem. FedPop employs population-based evolutionary algorithms to optimize the HPs, which accommodates various HP types at both client and server sides. Compared with prior tuning methods, FedPop employs an online "tuning-while-training" framework, offering computational efficiency and enabling the exploration of a broader HP search space. Our empirical validation on the common FL benchmarks and complex real-world FL datasets demonstrates the effectiveness of the proposed method, which substantially outperforms the concurrent state-of-the-art HP tuning methods for FL.

摘要
FedPop 使用人口生长算法来优化 HP，可以满足不同类型的 HP 在客户端和服务器两侧。与先前的调优方法相比，FedPop 采用了在线 "调优 while training" 框架，提供计算效率和探索更广泛的 HP 搜索空间。我们对常见的 FL benchmark 和复杂的实际 FL 数据进行了实验验证，得到了我们提出的方法的有效性，与当前状态的HP调优方法相比，它具有显著的性能提升。

LSTM-Based Forecasting Model for GRACE Accelerometer Data

paper_url: http://arxiv.org/abs/2308.08621
repo_url: https://github.com/darbeheshti/lstm-based-analysis-for-grace-accelerometers
paper_authors: Neda Darbeheshti, Elahe Moradi
for: The paper is written for monitoring variations in Earth’s gravity field and filling data gaps using the GRACE and GRACE Follow-On satellite missions.
methods: The paper uses Long Short-Term Memory (LSTM) networks to train a model capable of predicting accelerometer data for all three axes.
results: The model demonstrates effectiveness in filling gaps and forecasting GRACE accelerometer data, with accurate predictions for the three axes.

Abstract
The Gravity Recovery and Climate Experiment (GRACE) satellite mission, spanning from 2002 to 2017, has provided a valuable dataset for monitoring variations in Earth's gravity field, enabling diverse applications in geophysics and hydrology. The mission was followed by GRACE Follow-On in 2018, continuing data collection efforts. The monthly Earth gravity field, derived from the integration different instruments onboard satellites, has shown inconsistencies due to various factors, including gaps in observations for certain instruments since the beginning of the GRACE mission. With over two decades of GRACE and GRACE Follow-On data now available, this paper proposes an approach to fill the data gaps and forecast GRACE accelerometer data. Specifically, we focus on accelerometer data and employ Long Short-Term Memory (LSTM) networks to train a model capable of predicting accelerometer data for all three axes. In this study, we describe the methodology used to preprocess the accelerometer data, prepare it for LSTM training, and evaluate the model's performance. Through experimentation and validation, we assess the model's accuracy and its ability to predict accelerometer data for the three axes. Our results demonstrate the effectiveness of the LSTM forecasting model in filling gaps and forecasting GRACE accelerometer data.

摘要
grace卫星任务（GRACE），从2002年到2017年，提供了对地球重力场变化的监测数据，用于多种地球物理和水文应用。这个任务被GRACE Follow-On在2018年继承，继续进行数据采集。月度地球重力场，由不同卫星上的 instrumenteintegrate 而来，存在各种因素的影响，包括GRACE任务开始时的观测 gap。 now, with over two decades of GRACE and GRACE Follow-On data available, this paper proposes an approach to fill the data gaps and forecast GRACE accelerometer data. Specifically, we focus on accelerometer data and employ Long Short-Term Memory (LSTM) networks to train a model capable of predicting accelerometer data for all three axes. 在这个研究中，我们描述了对加速计数据进行预处理、准备 для LSTM 训练，以及模型性能的评估。通过实验和验证，我们评估了模型的准确性和其能够预测加速计数据的三个轴。我们的结果表明LSTM预测模型能够有效地填充数据 gap并预测GRACE加速计数据。

Boosting Logical Reasoning in Large Language Models through a New Framework: The Graph of Thought

paper_url: http://arxiv.org/abs/2308.08614
repo_url: None
paper_authors: Bin Lei, pei-Hung Lin, Chunhua Liao, Caiwen Ding
for: 提高大规模模型对复杂问题的逻辑推理能力
methods: 提出Graph of Thoughts（GoT）引导技术
results: 比GPT-4高$89.7%$, $86%$,和$56%$，与SOTA方法ToT average上升$23%$, $24%$,和$15%$

Abstract
Recent advancements in large-scale models, such as GPT-4, have showcased remarkable capabilities in addressing standard queries. However, when facing complex problems that require multi-step logical reasoning, their accuracy dramatically decreases. Current research has explored the realm of \textit{prompting engineering} to bolster the inferential capacities of these models. Our paper unveils a pioneering prompting technique, dubbed \textit{Graph of Thoughts (GoT)}. Through testing on a trio of escalating challenges: the 24-point game, resolution of high-degree polynomial equations, and derivation of formulas for recursive sequences, our method outperformed GPT-4, achieving accuracy improvements of $89.7\%$, $86\%$, and $56\%$ for each respective task. Moreover, when juxtaposed with the state-of-the-art (SOTA) prompting method, \textit{Tree of Thought (ToT)}, our approach registered an average accuracy boost of $23\%$, $24\%$, and $15\%$.

摘要
（Recent advancements in large-scale models, such as GPT-4, have showcased remarkable capabilities in addressing standard queries. However, when facing complex problems that require multi-step logical reasoning, their accuracy dramatically decreases. Current research has explored the realm of \textit{prompting engineering} to bolster the inferential capacities of these models. Our paper unveils a pioneering prompting technique, dubbed \textit{Graph of Thoughts (GoT)}. Through testing on a trio of escalating challenges: the 24-point game, resolution of high-degree polynomial equations, and derivation of formulas for recursive sequences, our method outperformed GPT-4, achieving accuracy improvements of $89.7\%$, $86\%$, and $56\%$ for each respective task. Moreover, when juxtaposed with the state-of-the-art (SOTA) prompting method, \textit{Tree of Thought (ToT)}, our approach registered an average accuracy boost of $23\%$, $24\%$, and $15\%$.）

Integrating Renewable Energy in Agriculture: A Deep Reinforcement Learning-based Approach

paper_url: http://arxiv.org/abs/2308.08611
repo_url: None
paper_authors: A. Wahid, I faiud, K. Mason
for: 这种研究旨在优化农业部门中 photovoltaic (PV) 系统的决策，帮助农业投资者做出了有数据支持的决策。
methods: 该研究使用深度Q学习网络 (DQN) 来优化决策，通过设置奖励机制，使DQN学习据材料驱动的决策。
results: 研究提供了一个全面的理解，如何使用DQN来支持农业投资者做出有利可图的PV安装决策，这些决策可以提高能源效率，降低环境影响，提高利润性。

Abstract
This article investigates the use of Deep Q-Networks (DQNs) to optimize decision-making for photovoltaic (PV) systems installations in the agriculture sector. The study develops a DQN framework to assist agricultural investors in making informed decisions considering factors such as installation budget, government incentives, energy requirements, system cost, and long-term benefits. By implementing a reward mechanism, the DQN learns to make data-driven decisions on PV integration. The analysis provides a comprehensive understanding of how DQNs can support investors in making decisions about PV installations in agriculture. This research has significant implications for promoting sustainable and efficient farming practices while also paving the way for future advancements in this field. By leveraging DQNs, agricultural investors can make optimized decisions that improve energy efficiency, reduce environmental impact, and enhance profitability. This study contributes to the advancement of PV integration in agriculture and encourages further innovation in this promising area.

摘要

FootGPT : A Large Language Model Development Experiment on a Minimal Setting

paper_url: http://arxiv.org/abs/2308.08610
repo_url: None
paper_authors: Eren Unlu
for: 本研究的目的是开发一个特定用途的语言模型，用于解释足球数据，并且在有限的资源下进行。
methods: 本研究使用了一个已经训练过的一百亿参数大小的通用 causal 语言模型，并在Italian足球联赛第一十周的比赛数据上进行了微调。使用了低级别适应。
results: 本研究发现，使用有限的资源和短时间训练，可以开发出一个高度精准的语言模型，用于解释足球数据。

Abstract
With recent empirical observations, it has been argued that the most significant aspect of developing accurate language models may be the proper dataset content and training strategy compared to the number of neural parameters, training duration or dataset size. Following this argument, we opted to fine tune a one billion parameter size trained general purpose causal language model with a dataset curated on team statistics of the Italian football league first ten game weeks, using low rank adaptation. The limited training dataset was compiled based on a framework where a powerful commercial large language model provides distilled paragraphs and question answer pairs as intended. The training duration was kept relatively short to provide a basis for our minimal setting exploration. We share our key observations on the process related to developing a specific purpose language model which is intended to interpret soccer data with constrained resources in this article.

摘要
据最新的观察，人们认为在建立准确语言模型方面，最重要的因素是数据集内容和训练策略，而不是神经网络参数数量、训练时间或数据集大小。基于这个Arguments，我们决定使用一个已经训练过的一亿参数大小的通用 causal语言模型，并在意大利足球联赛第一个十周的比赛数据上进行微调。我们采用了低级别适应。由于我们的训练集较小，因此我们保留了较短的训练时间，以便在有限的资源下进行exploration。在这篇文章中，我们将分享我们在开发特定目标语言模型方面的关键观察。这种语言模型的目的是解释足球数据，并且在有限的资源下进行。

TeCH: Text-guided Reconstruction of Lifelike Clothed Humans

paper_url: http://arxiv.org/abs/2308.08545
repo_url: https://github.com/huangyangyi/tech
paper_authors: Yangyi Huang, Hongwei Yi, Yuliang Xiu, Tingting Liao, Jiaxiang Tang, Deng Cai, Justus Thies
for: 本研究旨在解决单张图像中重构人体的“未seen region”问题，即 accurately restoring高级详细信息。
methods: 本研究使用了基于文本描述（如衣服、颜色、发型）的自动生成的garment parsing模型和视觉问答（VQA）模型，以及个性化文本到图像扩散（T2I）模型。
results: 本研究提出了一种基于DMTet的混合3D表示方法，通过多视点Score Distillation Sampling（SDS）和重构损失来优化geometry和texture。实验结果表明，TeCH可以生成高精度的3D人体图像，具有一致和细腻的текстура，以及详细的全身几何结构。

Abstract
Despite recent research advancements in reconstructing clothed humans from a single image, accurately restoring the "unseen regions" with high-level details remains an unsolved challenge that lacks attention. Existing methods often generate overly smooth back-side surfaces with a blurry texture. But how to effectively capture all visual attributes of an individual from a single image, which are sufficient to reconstruct unseen areas (e.g., the back view)? Motivated by the power of foundation models, TeCH reconstructs the 3D human by leveraging 1) descriptive text prompts (e.g., garments, colors, hairstyles) which are automatically generated via a garment parsing model and Visual Question Answering (VQA), 2) a personalized fine-tuned Text-to-Image diffusion model (T2I) which learns the "indescribable" appearance. To represent high-resolution 3D clothed humans at an affordable cost, we propose a hybrid 3D representation based on DMTet, which consists of an explicit body shape grid and an implicit distance field. Guided by the descriptive prompts + personalized T2I diffusion model, the geometry and texture of the 3D humans are optimized through multi-view Score Distillation Sampling (SDS) and reconstruction losses based on the original observation. TeCH produces high-fidelity 3D clothed humans with consistent & delicate texture, and detailed full-body geometry. Quantitative and qualitative experiments demonstrate that TeCH outperforms the state-of-the-art methods in terms of reconstruction accuracy and rendering quality. The code will be publicly available for research purposes at https://huangyangyi.github.io/TeCH

摘要
尽管最近的研究进步在单个图像中重建人体已经做出了 significativo 的进展，但还没有解决高级特征的重建问题。现有的方法通常会生成高级特征表面上的柔和粗糙 текстур。问题是如何有效地从单个图像中捕捉个体的所有视觉特征，以便重建未看到的区域（例如背部）？基于基础模型的TeCH重构3D人体，通过1）由garment parsing模型自动生成的描述文本提示（例如服装、颜色、头发样式），2）个性化进行文本到图像扩散（T2I）模型，学习“不可述的”外观。为了在可持置价格下提供高分辨率3D人体图像，我们提议一种混合3D表示方式，基于DMTet，包括显式体形网格和隐式距离场。通过描述提示+个性化T2I扩散模型，3D人体的几何和текстура在多视图Score Distillation Sampling（SDS）和重建损失基于原始观察得到优化。TeCH生成高准确率、细腻的текстура和详细的全身几何，在量化和质量上都超过了当前状况。代码将在https://huangyangyi.github.io/TeCH 公开 для研究用途。

Can Transformers Learn Optimal Filtering for Unknown Systems?

paper_url: http://arxiv.org/abs/2308.08536
repo_url: None
paper_authors: Haldun Balim, Zhe Du, Samet Oymak, Necmiye Ozay
for: 这个论文的目的是使用变换器来解决动力系统中的输出估计问题。
methods: 这篇论文使用了变换器来生成输出预测，并通过训练变换器使其能够适应不同的系统。
results: 这篇论文的结果显示，使用变换器来解决动力系统中的输出估计问题可以匹配最佳输出估计器的性能，并且在具有非相关噪声、时间变化动力和非线性动力等挑战性enario下也能够表现良好。

Abstract
Transformers have demonstrated remarkable success in natural language processing; however, their potential remains mostly unexplored for problems arising in dynamical systems. In this work, we investigate the optimal output estimation problem using transformers, which generate output predictions using all the past ones. We train the transformer using various systems drawn from a prior distribution and then evaluate its performance on previously unseen systems from the same distribution. As a result, the obtained transformer acts like a prediction algorithm that learns in-context and quickly adapts to and predicts well for different systems - thus we call it meta-output-predictor (MOP). MOP matches the performance of the optimal output estimator, based on Kalman filter, for most linear dynamical systems even though it does not have access to a model. We observe via extensive numerical experiments that MOP also performs well in challenging scenarios with non-i.i.d. noise, time-varying dynamics, and nonlinear dynamics like a quadrotor system with unknown parameters. To further support this observation, in the second part of the paper, we provide statistical guarantees on the performance of MOP and quantify the required amount of training to achieve a desired excess risk during test-time. Finally, we point out some limitations of MOP by identifying two classes of problems MOP fails to perform well, highlighting the need for caution when using transformers for control and estimation.

摘要
孔雀Transformers在自然语言处理方面已经展现出惊人的成功，但它们在动力系统中的潜力还未得到了充分的发掘。在这项工作中，我们使用transformer来解决输出预测问题，它可以使用所有过去的输出来生成输出预测。我们使用不同的系统从先前分布中随机抽取training dataset，然后评估其性能在未看过的系统上。因此，我们得到的transformer behave like a prediction algorithm that learns in-context and quickly adapts to different systems, so we call it meta-output-predictor (MOP). MOP的性能与基于Kalman滤波的优化输出估计器相当，即使它没有访问模型。我们通过广泛的数值实验发现，MOP在非相关噪声、时间变化动力学和非线性动力学中也表现出色，如四旋翼系统with unknown parameters。在第二部分的文章中，我们为MOP的性能提供了统计保证和测试时间所需的培训量。最后，我们指出了MOP在某些情况下的局限性，例如在某些非线性和不确定性下的性能不佳，这引起了使用 transformers 进行控制和估计时的谨慎。

Exploiting Point-Wise Attention in 6D Object Pose Estimation Based on Bidirectional Prediction

paper_url: http://arxiv.org/abs/2308.08518
repo_url: None
paper_authors: Yuhao Yang, Jun Wu, Guangjian Zhang, Rong Xiong
for: 提高3D物体重建的精度和稳定性，特别是在受到 occlusion 的环境下。
methods: 提出了一种 bidirectional correspondence prediction network，利用点对点的注意力感知机制，同时考虑模型和Scene之间的几何相似性。
results: 对于 LineMOD、YCB-Video 和 Occ-LineMOD 等公共数据集进行了实验，并与其他状态数据方法进行了比较，研究发现提出的方法在同一种评价标准下表现更好，特别是在受到 occlusion 的环境下。

Abstract
Traditional geometric registration based estimation methods only exploit the CAD model implicitly, which leads to their dependence on observation quality and deficiency to occlusion. To address the problem,the paper proposes a bidirectional correspondence prediction network with a point-wise attention-aware mechanism. This network not only requires the model points to predict the correspondence but also explicitly models the geometric similarities between observations and the model prior. Our key insight is that the correlations between each model point and scene point provide essential information for learning point-pair matches. To further tackle the correlation noises brought by feature distribution divergence, we design a simple but effective pseudo-siamese network to improve feature homogeneity. Experimental results on the public datasets of LineMOD, YCB-Video, and Occ-LineMOD show that the proposed method achieves better performance than other state-of-the-art methods under the same evaluation criteria. Its robustness in estimating poses is greatly improved, especially in an environment with severe occlusions.

摘要
传统的几何注册基于估计方法只是通过隐式地利用CAD模型，这导致其виси于观察质量和遮挡的不足。为解决问题，文章提出了一种bidirectional匹配预测网络，该网络不仅需要模型点预测匹配，还Explicitly模型了观察和模型之间的几何相似性。我们关键的发现是每个模型点和场景点之间的相关性提供了重要的学习点对匹配信息。为了进一步处理特征分布的分散，我们设计了一个简单 yet有效的 Pseudo-Siamese 网络，以提高特征同化。实验结果表明，提议的方法在公共数据集LineMOD、YCB-Video和Occ-LineMOD上比其他状态的艺术方法得到更好的性能，特别是在具有严重遮挡的环境中。

2023-08-17

cs.CL

cs.CL - 2023-08-17

Contrasting Linguistic Patterns in Human and LLM-Generated Text

paper_url: http://arxiv.org/abs/2308.09067
repo_url: None
paper_authors: Alberto Muñoz-Ortiz, Carlos Gómez-Rodríguez, David Vilares
for: 对比人类编写的英语新闻文本和相似的大语言模型（LLM）输出。
methods: 使用多个 linguistic dimensions，包括 morphological, syntactic, psychometric和sociolinguistic aspects进行量化分析。
results: 结果显示人类文本和AI生成文本存在许多可观量的差异，如人类文本中句子长度分布较为散布、使用依赖和结构类型不同、 Constituents shorter、以及更多的情感表达（恐惧、厌恶）。LLM输出使用更多的数字、符号和助动词（表示 объекivity 语言），以及更多的代词。人类文本中的性别偏见也被LLM模型反映出来。

Abstract
We conduct a quantitative analysis contrasting human-written English news text with comparable large language model (LLM) output from 4 LLMs from the LLaMa family. Our analysis spans several measurable linguistic dimensions, including morphological, syntactic, psychometric and sociolinguistic aspects. The results reveal various measurable differences between human and AI-generated texts. Among others, human texts exhibit more scattered sentence length distributions, a distinct use of dependency and constituent types, shorter constituents, and more aggressive emotions (fear, disgust) than LLM-generated texts. LLM outputs use more numbers, symbols and auxiliaries (suggesting objective language) than human texts, as well as more pronouns. The sexist bias prevalent in human text is also expressed by LLMs.

摘要
我们进行了量化分析，对人工写的英文新闻文本与相关的大语言模型（LLM）输出进行比较，从4个LLM家族中选择了输出。我们的分析覆盖了多个可衡量的语言方面，包括形态、 sintaxis、心理测试和社会语言方面。结果显示了人类和AI生成文本之间的各种可衡量差异。人类文本显示了更加散布的句子长度分布、更明确的依赖和结构类型使用、更短的成分和更强烈的情感（恐惧、厌恶）than LLM生成的文本。LLM输出使用了更多的数字、符号和助动词（表示Objective语言），以及更多的代名词。人类文本中的性别偏见也被LLM模型表达出来。

Don’t lose the message while paraphrasing: A study on content preserving style transfer

paper_url: http://arxiv.org/abs/2308.09055
repo_url: https://github.com/s-nlp/lewit-informal
paper_authors: Nikolay Babakov, David Dale, Ilya Gusev, Irina Krotova, Alexander Panchenko
for: 本研究旨在提高自然语言处理中的文本风格转换技术，使得文本可以在需要的形式下进行转换，如从毒瘤到神经、从正式到便秘、从古老到现代英语等。
methods: 本研究使用了多种文本风格转换模型，并进行了对这些模型的比较，以确定哪些模型能够最好地保持原始内容的含义。
results: 研究发现，当转换文本时，保持原始内容的含义是非常重要，而现有的文本风格转换模型往往会产生不准确的结果。此外，研究还提出了一种修改了无监督的LEWIS方法，可以在提高文本风格转换的同时，保持原始内容的含义。

Abstract
Text style transfer techniques are gaining popularity in natural language processing allowing paraphrasing text in the required form: from toxic to neural, from formal to informal, from old to the modern English language, etc. Solving the task is not sufficient to generate some neural/informal/modern text, but it is important to preserve the original content unchanged. This requirement becomes even more critical in some applications such as style transfer of goal-oriented dialogues where the factual information shall be kept to preserve the original message, e.g. ordering a certain type of pizza to a certain address at a certain time. The aspect of content preservation is critical for real-world applications of style transfer studies, but it has received little attention. To bridge this gap we perform a comparison of various style transfer models on the example of the formality transfer domain. To perform a study of the content preservation abilities of various style transfer methods we create a parallel dataset of formal vs. informal task-oriented dialogues. The key difference between our dataset and the existing ones like GYAFC [17] is the presence of goal-oriented dialogues with predefined semantic slots essential to be kept during paraphrasing, e.g. named entities. This additional annotation allowed us to conduct a precise comparative study of several state-of-the-art techniques for style transfer. Another result of our study is a modification of the unsupervised method LEWIS [19] which yields a substantial improvement over the original method and all evaluated baselines on the proposed task.

摘要
文本风格传输技术在自然语言处理领域升级，允许将文本从毒精到神经、从正式到便秘、从古老到现代英语等形式进行重写。解决这个任务不仅是生成一些神经/便秘/现代的文本，而且也需要保持原始内容不变。这种需求在某些应用程序中，如文本风格传输的对话目标中，变得非常重要。例如，在订购特定类型的比萨时，需要保持原始信息，如订购地址和时间。对于实际应用，保持内容的精度是关键。为了填补这一漏洞，我们进行了不同风格传输模型的比较，并创建了一个并行的正式 vs. 便秘任务型对话集。我们的数据集和现有的 datasets like GYAFC [17] differ in that they contain goal-oriented dialogues with predefined semantic slots that must be preserved during paraphrasing, such as named entities。这些额外的注释使我们能够进行精确的比较研究多种现状风格传输技术。另外，我们还提出了一种基于不监督学习的方法 modification of LEWIS [19]，它在我们的任务上实现了显著提高，并在所有评估基准上表现出优于原始方法和评估基准。

Reinforced Self-Training (ReST) for Language Modeling

paper_url: http://arxiv.org/abs/2308.08998
repo_url: https://github.com/jettbrains/-L-
paper_authors: Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, Nando de Freitas
for: 提高大语言模型的输出质量
methods: 使用激励学习自动对人类反馈进行调整
results: 在计算和样本利用效率的情况下，可以substantially提高翻译质量，并且通过自动度量和人类评估在机器翻译benchmark上表现出色。

Abstract
Reinforcement learning from human feedback (RLHF) can improve the quality of large language model's (LLM) outputs by aligning them with human preferences. We propose a simple algorithm for aligning LLMs with human preferences inspired by growing batch reinforcement learning (RL), which we call Reinforced Self-Training (ReST). Given an initial LLM policy, ReST produces a dataset by generating samples from the policy, which are then used to improve the LLM policy using offline RL algorithms. ReST is more efficient than typical online RLHF methods because the training dataset is produced offline, which allows data reuse. While ReST is a general approach applicable to all generative learning settings, we focus on its application to machine translation. Our results show that ReST can substantially improve translation quality, as measured by automated metrics and human evaluation on machine translation benchmarks in a compute and sample-efficient manner.

摘要
“强化学习从人类反馈（RLHF）可以提高大语言模型（LLM）的输出质量，将其与人类偏好Alignment。我们提出了一种简单的算法，名为强化自我培训（ReST），以实现这一目标。给定初始LLM策略，ReST会生成一个数据集，然后使用线下RL算法来改善LLM策略。与 Typical online RLHF方法不同的是，ReST在训练集生成过程中不需要在线学习，因此可以进行数据重用。虽然ReST是一种通用的激励学习方法，但我们在机器翻译任务中进行了研究，并显示了它可以在计算和样本效率的情况下提高翻译质量。”Note that Simplified Chinese is the official standard for Chinese writing in mainland China, and it is used in this translation. Traditional Chinese is used in Taiwan and other regions, and it may differ slightly from the Simplified Chinese used here.

Evaluation of really good grammatical error correction

paper_url: http://arxiv.org/abs/2308.08982
repo_url: https://github.com/robertostling/gec-evaluation
paper_authors: Robert Östling, Katarina Gillholm, Murathan Kurfalı, Marie Mattson, Mats Wirén
for: 这个论文的目的是评估不同 grammatical error correction (GEC) 系统的表现，以及评估现有评价方法的有效性。
methods: 这个论文使用了一些已知的评价方法，以及一些新的评价方法，例如使用人工评估和大语言模型 (LLM)。
results: 研究发现，使用 GPT-3 在几个步骤中的情况下，可以明显超越之前的 grammatical error correction 系统，并且可以在 Swedish 语言中减少了训练数据的语言差异。此外，研究还发现了现有的评价方法存在偏见，而人工评估则能够更好地揭示这些偏见。

Abstract
Although rarely stated, in practice, Grammatical Error Correction (GEC) encompasses various models with distinct objectives, ranging from grammatical error detection to improving fluency. Traditional evaluation methods fail to fully capture the full range of system capabilities and objectives. Reference-based evaluations suffer from limitations in capturing the wide variety of possible correction and the biases introduced during reference creation and is prone to favor fixing local errors over overall text improvement. The emergence of large language models (LLMs) has further highlighted the shortcomings of these evaluation strategies, emphasizing the need for a paradigm shift in evaluation methodology. In the current study, we perform a comprehensive evaluation of various GEC systems using a recently published dataset of Swedish learner texts. The evaluation is performed using established evaluation metrics as well as human judges. We find that GPT-3 in a few-shot setting by far outperforms previous grammatical error correction systems for Swedish, a language comprising only 0.11% of its training data. We also found that current evaluation methods contain undesirable biases that a human evaluation is able to reveal. We suggest using human post-editing of GEC system outputs to analyze the amount of change required to reach native-level human performance on the task, and provide a dataset annotated with human post-edits and assessments of grammaticality, fluency and meaning preservation of GEC system outputs.

摘要

Beam Retrieval: General End-to-End Retrieval for Multi-Hop Question Answering

paper_url: http://arxiv.org/abs/2308.08973
repo_url: https://github.com/canghongjian/beam_retriever
paper_authors: Jiahao Zhang, Haiyang Zhang, Dongmei Zhang, Yong Liu, Shen Huang
for: 这个论文旨在提出一种扩展多步问答（Multi-hop QA）的搜索框架，以解决复杂问题时存在多个相关段落的选择和逻辑推理问题。
methods: 该方法称为Beam Retrieval，它是一种通用的终端搜索框架，可以维护多个部分假设的相关段落，从而扩大搜索空间并减少 irrelevant passage 的选择风险。此外，Beam Retrieval 将编码器和两个分类头并行优化，通过将所有跳数的损失函数进行共同最小化。
results: 实验结果表明，Beam Retrieval 与基eline比较，在复杂的 MuSiQue-Ans 上提高了nearly 50%的性能，并且在 HotpotQA 和 2WikiMultiHopQA 上也超过了所有之前的搜索器。此外，通过提供高质量的上下文，Beam Retrieval 帮助我们的supervised reader 实现了新的州队性能记录，并在零学习 GPT-3.5 上进行了重要提升（最高提升28.8点）。

Abstract
Multi-hop QA involves finding multiple relevant passages and step-by-step reasoning to answer complex questions. While previous approaches have developed retrieval modules for selecting relevant passages, they face challenges in scenarios beyond two hops, owing to the limited performance of one-step methods and the failure of two-step methods when selecting irrelevant passages in earlier stages. In this work, we introduce Beam Retrieval, a general end-to-end retrieval framework for multi-hop QA. This approach maintains multiple partial hypotheses of relevant passages at each step, expanding the search space and reducing the risk of missing relevant passages. Moreover, Beam Retrieval jointly optimizes an encoder and two classification heads by minimizing the combined loss across all hops. To establish a complete QA system, we incorporate a supervised reader or a zero-shot GPT-3.5. Experimental results demonstrate that Beam Retrieval achieves a nearly 50% improvement compared with baselines on challenging MuSiQue-Ans, and it also surpasses all previous retrievers on HotpotQA and 2WikiMultiHopQA. Providing high-quality context, Beam Retrieval helps our supervised reader achieve new state-of-the-art performance and substantially improves (up to 28.8 points) the QA performance of zero-shot GPT-3.5.

摘要
多跳问答涉及到找到多个相关段落并进行步骤 reasoning 来回答复杂问题。而前一些方法已经开发了检索模块，但在超过两步 scenarios 中表现不佳，因为一步方法的表现有限，而两步方法在选择无关段落的早期阶段会失败。在这项工作中，我们介绍了 Beam Retrieval，一种通用的检索框架 для多跳问答。这种方法在每个步骤中维护多个部分假设的相关段落，扩大搜索空间，降低缺失相关段落的风险。此外，Beam Retrieval 同时优化一个encoder和两个分类头，通过在所有跳数中共同最小化损失来jointly 优化。为建立完整的问答系统，我们将 incorporate 一个监督式读者或一个零容量 GPT-3.5。实验结果表明，Beam Retrieval 相比基eline 提高了约50%的表现，并且在 HotpotQA 和 2WikiMultiHopQA 等检索器中也超越了所有前一些 Retriever。通过提供高质量的 контекст，Beam Retrieval 帮助我们的监督式读者达到新的状态平台纪录，并在零容量 GPT-3.5 中提高了问答性能（最多28.8个点）。

Factuality Detection using Machine Translation – a Use Case for German Clinical Text

paper_url: http://arxiv.org/abs/2308.08827
repo_url: None
paper_authors: Mohammed Bin Sumait, Aleksandra Gabryszak, Leonhard Hennig, Roland Roller
for: 这种研究是用于检测临床文本中的事实性的。
methods: 这种方法使用机器翻译将英语数据翻译成德语，然后使用变换器来检测事实性。
results: 这种方法可以准确地检测临床文本中的事实性。

Abstract
Factuality can play an important role when automatically processing clinical text, as it makes a difference if particular symptoms are explicitly not present, possibly present, not mentioned, or affirmed. In most cases, a sufficient number of examples is necessary to handle such phenomena in a supervised machine learning setting. However, as clinical text might contain sensitive information, data cannot be easily shared. In the context of factuality detection, this work presents a simple solution using machine translation to translate English data to German to train a transformer-based factuality detection model.

摘要
Factuality can play an important role when automatically processing clinical text, as it makes a difference if particular symptoms are explicitly not present, possibly present, not mentioned, or affirmed. In most cases, a sufficient number of examples is necessary to handle such phenomena in a supervised machine learning setting. However, as clinical text might contain sensitive information, data cannot be easily shared. In the context of factuality detection, this work presents a simple solution using machine translation to translate English data to German to train a transformer-based factuality detection model.Here's the word-for-word translation in Simplified Chinese:事实性可以在自动处理医疗文本时发挥重要作用，因为不同的症状是否存在、可能存在、未提及或确认的差异会影响结果。大多数情况下，需要一 sufficient number of examples来处理这种现象。但是，医疗文本可能包含敏感信息，因此数据Difficult to share。在实际性检测中，这项工作提出了一个简单的解决方案，利用机器翻译将英文数据翻译成德文，以训练基于转换器的实际性检测模型。

Linguistically-Informed Neural Architectures for Lexical, Syntactic and Semantic Tasks in Sanskrit

paper_url: http://arxiv.org/abs/2308.08807
repo_url: None
paper_authors: Jivnesh Sandhan
for: 本论文的目的是使 sanskrit 手稿更加可accessible для最终用户通过自然语言技术。
methods: 本研究使用了语言学家所提供的语言模型，以解决 sanskrit 语言的 morphological richness、free word orderliness 和 low-resource nature 等问题。
results: 本研究提出了一个名为 SanskritShala 的 neural toolkit，可以在线进行 sanskrit 手稿的各种 NLP 任务的实时分析。此外，研究还提出了一些 linguistically-informed 的 neural architecture，并实现了这些模型的 interpretability 和 multilingual extension。最终，研究获得了 state-of-the-art 的性能。

Abstract
The primary focus of this thesis is to make Sanskrit manuscripts more accessible to the end-users through natural language technologies. The morphological richness, compounding, free word orderliness, and low-resource nature of Sanskrit pose significant challenges for developing deep learning solutions. We identify four fundamental tasks, which are crucial for developing a robust NLP technology for Sanskrit: word segmentation, dependency parsing, compound type identification, and poetry analysis. The first task, Sanskrit Word Segmentation (SWS), is a fundamental text processing task for any other downstream applications. However, it is challenging due to the sandhi phenomenon that modifies characters at word boundaries. Similarly, the existing dependency parsing approaches struggle with morphologically rich and low-resource languages like Sanskrit. Compound type identification is also challenging for Sanskrit due to the context-sensitive semantic relation between components. All these challenges result in sub-optimal performance in NLP applications like question answering and machine translation. Finally, Sanskrit poetry has not been extensively studied in computational linguistics. While addressing these challenges, this thesis makes various contributions: (1) The thesis proposes linguistically-informed neural architectures for these tasks. (2) We showcase the interpretability and multilingual extension of the proposed systems. (3) Our proposed systems report state-of-the-art performance. (4) Finally, we present a neural toolkit named SanskritShala, a web-based application that provides real-time analysis of input for various NLP tasks. Overall, this thesis contributes to making Sanskrit manuscripts more accessible by developing robust NLP technology and releasing various resources, datasets, and web-based toolkit.

摘要
主要研究目标是使梵语手稿更加可 accessible 给用户通过自然语言技术。梵语的 morphological richness、自由词序和 low-resource 性带来了开发深度学习解决方案的很大挑战。我们标识出了四个基本任务，这些任务是对梵语 NLP 技术的关键：梵语单词分 segmentation（SWS）、依赖分析、复合类型识别和 poetry 分析。首先，SWS 是任何其他下游应用程序的基本文本处理任务，但它受到 sandhi 现象的影响，使其成为挑战。同时，现有的依赖分析方法在 morphologically rich 和 low-resource 语言 LIKE 梵语上表现不佳。复合类型识别也是梵语的挑战，因为它们在上下文中具有相互关联的 semantics。这些挑战导致了在 NLP 应用程序如问答和机器翻译中的下OPTIMAL 性能。最后，梵语诗歌还没有得到了计算语言学的广泛研究。在面临这些挑战时，本论文做出了多个贡献：1. 提出了语言学 informed 的神经网络架构。2. 展示了解释性和多语言扩展的可能性。3. 我们的提议系统在性能方面做出了州OF-THE-ART 的贡献。4. 最后，我们发布了一个名为 SanskritShala 的神经工具kit，它是一个在线应用程序，可以对输入进行实时分析，并提供多种 NLP 任务的分析结果。总的来说，本论文的贡献在于使梵语手稿更加可 accessible，通过开发 Robust NLP 技术和发布资源、数据集和在线工具kit。

Chinese Spelling Correction as Rephrasing Language Model

paper_url: http://arxiv.org/abs/2308.08796
repo_url: https://github.com/gingasan/lemon
paper_authors: Linfeng Liu, Hongqiu Wu, Hai Zhao
for: 本研究旨在提高中文拼写自动 corrections的精度和可重用性，通过增加额外槽位，使模型能够更好地理解语义和自动拼写 corrections。
methods: 本研究提出了一种新的training paradigm，即“重写语言模型”（ReLM），通过增加额外槽位，使模型能够更好地理解语义和自动拼写 corrections。
results: 对于精度和可重用性，ReLM在精度和零shot benchmark上达到了新的state-of-the-artresult，与之前的对手相比，提高了大幅度。此外，ReLM还能够学习可转移的语言表示，当拼写自动 corrections与其他任务一起进行 JOINT 训练时。

Abstract
This paper studies Chinese Spelling Correction (CSC), which aims to detect and correct potential spelling errors in a given sentence. Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs. However, we note a critical flaw in the process of tagging one character to another, that the correction is excessively conditioned on the error. This is opposite from human mindset, where individuals rephrase the complete sentence based on its semantics, rather than solely on the error patterns memorized before. Such a counter-intuitive learning process results in the bottleneck of generalizability and transferability of machine spelling correction. To address this, we propose $Rephrasing Language Modeling$ (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging. This novel training paradigm achieves the new state-of-the-art results across fine-tuned and zero-shot CSC benchmarks, outperforming previous counterparts by a large margin. Our method also learns transferable language representation when CSC is jointly trained with other tasks.

摘要

Task Relation Distillation and Prototypical Pseudo Label for Incremental Named Entity Recognition

paper_url: http://arxiv.org/abs/2308.08793
repo_url: https://github.com/bladedancer957/iner_rdp
paper_authors: Duzhen Zhang, Hongliu Li, Wei Cong, Rongtao Xu, Jiahua Dong, Xiuyi Chen
for: 这个论文是为了解决逐次学习中的快速忘却问题，特别是在background shift情况下。
methods: 这个方法使用了任务关系热钻和类型尺度拟标法来解决快速忘却和background shift问题。
results: 在十个INER设定下，这个方法实现了6.08%的微平均准确率和7.71%的macro平均准确率的显著提升，相比之前的状态泰尊方法。

Abstract
Incremental Named Entity Recognition (INER) involves the sequential learning of new entity types without accessing the training data of previously learned types. However, INER faces the challenge of catastrophic forgetting specific for incremental learning, further aggravated by background shift (i.e., old and future entity types are labeled as the non-entity type in the current task). To address these challenges, we propose a method called task Relation Distillation and Prototypical pseudo label (RDP) for INER. Specifically, to tackle catastrophic forgetting, we introduce a task relation distillation scheme that serves two purposes: 1) ensuring inter-task semantic consistency across different incremental learning tasks by minimizing inter-task relation distillation loss, and 2) enhancing the model's prediction confidence by minimizing intra-task self-entropy loss. Simultaneously, to mitigate background shift, we develop a prototypical pseudo label strategy that distinguishes old entity types from the current non-entity type using the old model. This strategy generates high-quality pseudo labels by measuring the distances between token embeddings and type-wise prototypes. We conducted extensive experiments on ten INER settings of three benchmark datasets (i.e., CoNLL2003, I2B2, and OntoNotes5). The results demonstrate that our method achieves significant improvements over the previous state-of-the-art methods, with an average increase of 6.08% in Micro F1 score and 7.71% in Macro F1 score.

摘要
incremental named entity recognition (INER) 是一种逐步学习新类型的方法，不需要训练数据集的访问。然而，INER面临着增量学习中的灾难性忘记问题，这问题更加严重由背景转换（即旧和未来类型都被标记为当前任务中的非实体类型）。为解决这些挑战，我们提出了一种方法called任务关系热针和抽象 pseudo标签（RDP）。Specifically, to tackle catastrophic forgetting, we introduce a task relation distillation scheme that serves two purposes: 1) ensuring inter-task semantic consistency across different incremental learning tasks by minimizing inter-task relation distillation loss, and 2) enhancing the model's prediction confidence by minimizing intra-task self-entropy loss. Simultaneously, to mitigate background shift, we develop a prototypical pseudo label strategy that distinguishes old entity types from the current non-entity type using the old model. This strategy generates high-quality pseudo labels by measuring the distances between token embeddings and type-wise prototypes.我们在三个 benchmark datasets（CoNLL2003、I2B2和OntoNotes5）上进行了广泛的实验，结果显示，我们的方法在前一个状态的方法中获得了显著提高，其中 Micro F1 分数提高了6.08%，Macro F1 分数提高了7.71%。

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

paper_url: http://arxiv.org/abs/2308.08747
repo_url: https://github.com/luoxiaoheics/continual-tune
paper_authors: Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, Yue Zhang
for: 本研究旨在探讨大语言模型（LLM）在继续微调过程中是否存在悬峰忘记现象（Catastrophic Forgetting，CF）。
methods: 本研究使用了各种方法，包括领域知识、逻辑和语言理解等方面的评估，来评估 LLM 的知识忘记现象。
results: 研究发现，LLM 在继续微调过程中一般存在 CF 现象，而scale增加后，忘记现象也加剧。 Comparing the decoder-only model BLOOMZ with the encoder-decoder model mT0, BLOOMZ 比较少忘记，保持更多的知识。另外，研究发现 LLM 在继续微调过程中可以减少语言偏见（如性偏见）。

Abstract
Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information as it learns new information. As large language models (LLMs) have shown excellent performance, it is interesting to uncover whether CF exists in the continual fine-tuning of LLMs. In this study, we empirically evaluate the forgetting phenomenon in LLMs' knowledge, from the perspectives of domain knowledge, reasoning, and reading comprehension. The experiments demonstrate that catastrophic forgetting is generally observed in LLMs ranging from 1b to 7b. Furthermore, as the scale increases, the severity of forgetting also intensifies. Comparing the decoder-only model BLOOMZ with the encoder-decoder model mT0, BLOOMZ suffers less forgetting and maintains more knowledge. We also observe that LLMs can mitigate language bias (e.g. gender bias) during continual fine-tuning. Moreover, we find that ALPACA can maintain more knowledge and capacity compared with LLAMA during the continual fine-tuning, which implies that general instruction tuning can help mitigate the forgetting phenomenon of LLMs in the further fine-tuning process.

摘要
您好！我很高兴今天分享一篇关于机器学习的研究成果。研究发现，当机器学习模型（LLM）在不断微调过程中学习新信息时，可能会出现“恐慌性忘记”（Catastrophic Forgetting，CF）现象。在这种情况下，模型可能会忘记之前学习的信息。我们通过实验证明，CF确实存在于LLM中，并且随着模型的缩放程度增大，忘记现象也变得更加严重。此外，我们发现，使用泛化指令调整可以减轻LLM在不断微调过程中的忘记现象。在这篇研究中，我们还发现了一些其他有趣的现象。例如，我们发现LLM可以在不断微调过程中减少语言偏见（如性偏见）。此外，我们还发现了一些不同的模型之间的差异。例如，比较decoder-only模型BLOOMZ和encoder-decoder模型mT0，BLOOMZ在不断微调过程中减少忘记现象，并保持更多的知识。这些发现可能有助于我们更好地理解LLM在不断微调过程中的行为，并且可能有助于我们开发更好的LLM模型。总之，这篇研究提供了一些有趣的发现，可能有助于我们更好地理解LLM在不断微调过程中的行为，并且可能有助于我们开发更好的LLM模型。如果您有兴趣，请随时阅读我们的研究论文，了解更多细节。谢谢！

Enhancing Phrase Representation by Information Bottleneck Guided Text Diffusion Process for Keyphrase Extraction

paper_url: http://arxiv.org/abs/2308.08739
repo_url: None
paper_authors: Yuanzhen Luo, Qingyu Zhou, Feng Zhou
for: 本研究的目的是提出一种基于Variational Information Bottleneck（VIB）的supervised文本扩散过程，用于提高键短语提取（KPE）的性能。
methods: 该方法首先根据整个文档生成需要的键短语嵌入，然后将生成的嵌入注入到每个短语表示中。然后，使用排名网络和VIB进行优化，并使用排名损失和分类损失进行训练。
results: 实验显示，Diff-KPE比exist的KPE方法在一个大规模的开放领域键短语提取benchmark（OpenKP）和一个科学领域的数据集（KP20K）上表现出色，得到了更好的性能。

Abstract
Keyphrase extraction (KPE) is an important task in Natural Language Processing for many scenarios, which aims to extract keyphrases that are present in a given document. Many existing supervised methods treat KPE as sequential labeling, span-level classification, or generative tasks. However, these methods lack the ability to utilize keyphrase information, which may result in biased results. In this study, we propose Diff-KPE, which leverages the supervised Variational Information Bottleneck (VIB) to guide the text diffusion process for generating enhanced keyphrase representations. Diff-KPE first generates the desired keyphrase embeddings conditioned on the entire document and then injects the generated keyphrase embeddings into each phrase representation. A ranking network and VIB are then optimized together with rank loss and classification loss, respectively. This design of Diff-KPE allows us to rank each candidate phrase by utilizing both the information of keyphrases and the document. Experiments show that Diff-KPE outperforms existing KPE methods on a large open domain keyphrase extraction benchmark, OpenKP, and a scientific domain dataset, KP20K.

摘要
KEYPHRASE EXTRACTION (KPE) 是自然语言处理中一项重要的任务，目标是从给定的文档中提取关键短语。现有的超级vised方法将 KPE 视为顺序标注、span-level 分类或生成任务，但这些方法可能会忽略关键短语信息，从而导致不准确的结果。在本研究中，我们提出了Diff-KPE，它利用supervisedVariational Information Bottleneck（VIB）来导引文本噪声过程，从而生成改进的关键短语表示。Diff-KPE首先根据整个文档生成所需的关键短语嵌入，然后将生成的嵌入注入到每个短语表示中。然后，一个排名网络和VIB分别与排名损失和分类损失进行优化。这种Diff-KPE的设计允许我们根据文档信息和关键短语来排名每个候选短语，从而提高了关键短语提取的准确率。实验表明，Diff-KPE在一个大规模的开放领域关键短语提取标准benchmark OpenKP 和一个科学领域数据集 KP20K 上都有出色的表现，超过了现有的 KPE 方法。

Decoding Emotions: A comprehensive Multilingual Study of Speech Models for Speech Emotion Recognition

paper_url: http://arxiv.org/abs/2308.08713
repo_url: https://github.com/95anantsingh/decoding-emotions
paper_authors: Anant Singh, Akshat Gupta
for: 这篇论文主要是为了评估多种语言下的语音情感识别（SER）模型，以及这些模型的内部表示方式。
methods: 这篇论文使用了八种语音表示模型和六种语言进行了比较，并通过探测实验获取这些模型的内部工作方式。
results: 研究发现，使用单个最佳层的语音模型特征可以降低错误率平均为32%，而使用所有层的语音模型特征可以达到最佳结果。此外，这些探测结果表明，语音模型的中间层 capture最重要的情感信息。

Abstract
Recent advancements in transformer-based speech representation models have greatly transformed speech processing. However, there has been limited research conducted on evaluating these models for speech emotion recognition (SER) across multiple languages and examining their internal representations. This article addresses these gaps by presenting a comprehensive benchmark for SER with eight speech representation models and six different languages. We conducted probing experiments to gain insights into inner workings of these models for SER. We find that using features from a single optimal layer of a speech model reduces the error rate by 32\% on average across seven datasets when compared to systems where features from all layers of speech models are used. We also achieve state-of-the-art results for German and Persian languages. Our probing results indicate that the middle layers of speech models capture the most important emotional information for speech emotion recognition.

摘要
近年来，基于转换器的语音表示模型在语音处理领域已经带来巨大的变革。然而，关于评估这些模型在多种语言的语音情感识别（SER）方面的研究还是有限的。这篇文章填补了这些差距，并提供了一个全面的 SER benchmark，包括八种语音表示模型和六种不同的语言。我们进行了探索性的实验，以了解这些模型内部的具体工作原理。我们发现，使用单个最佳层的语音模型特征可以降低错误率平均为32%，相比于所有层的语音模型特征使用系统。我们还实现了德语和波斯语的状态机器人表现。我们的探索结果表明，语音模型的中间层最好 capture 情感信息。

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

paper_url: http://arxiv.org/abs/2308.09723
repo_url: None
paper_authors: Young Jin Kim, Rawn Henry, Raffy Fahim, Hany Hassan Awadalla
for: 这篇研究旨在提高大型语言模型（LLMs）的实际应用性，因为它们需要大量的内存，而且最新的生成模型在自动化预测过程中发生了内存带宽瓶须。
methods: 我们提出了一种有效的量化方法，可以降低模型的内存使用量和加速推断，而且不需要额外的调整。我们还引入了一个简单且有效的规律，使用预训模型的维度作为量化精度。
results: 我们评估了我们的提案，并证明了它们可以实现最大化的执行速度和最小化的质量损失。在使用大量的Open Source模型和内部MoE模型时，我们获得了与原始模型相似的准确性，并在相同数量的GPU上实现了3.65倍的运算速度。

Abstract
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment due to their substantial memory requirements. Furthermore, the latest generative models suffer from high inference costs caused by the memory bandwidth bottleneck in the auto-regressive decoding process. To address these issues, we propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs. To ensure minimal quality degradation, we introduce a simple and effective heuristic approach that utilizes only the model weights of a pre-trained model. This approach is applicable to both Mixture-of-Experts (MoE) and dense models without requiring additional fine-tuning. To demonstrate the effectiveness of our proposed method, we first analyze the challenges and issues associated with LLM quantization. Subsequently, we present our heuristic approach, which adaptively finds the granularity of quantization, effectively addressing these problems. Furthermore, we implement highly efficient GPU GEMMs that perform on-the-fly matrix multiplication and dequantization, supporting the multiplication of fp16 or bf16 activations with int8 or int4 weights. We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput on the same number of GPUs.

摘要

Large Language Models for Granularized Barrett’s Esophagus Diagnosis Classification

paper_url: http://arxiv.org/abs/2308.08660
repo_url: None
paper_authors: Jenna Kefeli, Ali Soroush, Courtney J. Diamond, Haley M. Zylberberg, Benjamin May, Julian A. Abrams, Chunhua Weng, Nicholas Tatonetti
For: 本研究旨在提高贝勒氏食道癌前体（BE）诊断代码的精度和特点，以满足各种研究和临床应用场景的需求。* Methods: 研究人员开发了一种通用的变换器基本方法，用于自动提取BE诊断报告中的关键特征。该方法使用了两个临床预训练的大语言模型，并与手动 Chart review 进行比较。* Results: 研究人员通过使用哥伦比亚大学爱立信医学中心的BE诊断报告，并与 Gastroenterologist 注解的目标进行比较，实现了二分类分化和多类BE相关诊断分类。研究人员发现，使用两个临床预训练的大语言模型，与手动开发的规则基本方法相比，其性能相对较高，具有更高的精度和更快的实现速度。

Abstract
Diagnostic codes for Barrett's esophagus (BE), a precursor to esophageal cancer, lack granularity and precision for many research or clinical use cases. Laborious manual chart review is required to extract key diagnostic phenotypes from BE pathology reports. We developed a generalizable transformer-based method to automate data extraction. Using pathology reports from Columbia University Irving Medical Center with gastroenterologist-annotated targets, we performed binary dysplasia classification as well as granularized multi-class BE-related diagnosis classification. We utilized two clinically pre-trained large language models, with best model performance comparable to a highly tailored rule-based system developed using the same data. Binary dysplasia extraction achieves 0.964 F1-score, while the multi-class model achieves 0.911 F1-score. Our method is generalizable and faster to implement as compared to a tailored rule-based approach.

摘要
Current diagnostic codes for Barrett's esophagus (BE), a precursor to esophageal cancer, lack specificity and granularity for many research or clinical use cases. Manual chart review is time-consuming and laborious to extract key diagnostic phenotypes from BE pathology reports. We developed a generalizable transformer-based method to automate data extraction. Using pathology reports from Columbia University Irving Medical Center with gastroenterologist-annotated targets, we performed binary dysplasia classification as well as granularized multi-class BE-related diagnosis classification. We utilized two pre-trained large language models, with the best model performance comparable to a highly tailored rule-based system developed using the same data. Binary dysplasia extraction achieved an F1-score of 0.964, while the multi-class model achieved an F1-score of 0.911. Our method is more efficient and generalizable than a tailored rule-based approach.

Learning the meanings of function words from grounded language using a visual question answering model

paper_url: http://arxiv.org/abs/2308.08628
repo_url: https://github.com/evaportelance/vqa-function-word-learning
paper_authors: Eva Portelance, Michael C. Frank, Dan Jurafsky
For: 本研究探讨了儿童如何学习功能词（function words），以及模型如何使用这些词语来回答复杂的视觉问题。* Methods: 研究使用了基于神经网络的视觉语言模型，通过训练这些模型在视觉上grounded的语言中使用功能词来研究这些词语的含义。* Results: 研究发现，这些模型可以通过非Symbolic普遍学习算法来学习功能词的含义，无需任何先前的语言含义知识。此外，模型还可以学习逻辑连接词（like “and” and “or”）的含义，并且可以在语言中理解多种表达方式。最后，研究发现，词语学习难度与模型的输入频率有关。

Abstract
Interpreting a seemingly-simple function word like "or", "behind", or "more" can require logical, numerical, and relational reasoning. How are such words learned by children? Prior acquisition theories have often relied on positing a foundation of innate knowledge. Yet recent neural-network based visual question answering models apparently can learn to use function words as part of answering questions about complex visual scenes. In this paper, we study what these models learn about function words, in the hope of better understanding how the meanings of these words can be learnt by both models and children. We show that recurrent models trained on visually grounded language learn gradient semantics for function words requiring spacial and numerical reasoning. Furthermore, we find that these models can learn the meanings of logical connectives "and" and "or" without any prior knowledge of logical reasoning, as well as early evidence that they can develop the ability to reason about alternative expressions when interpreting language. Finally, we show that word learning difficulty is dependent on frequency in models' input. Our findings offer evidence that it is possible to learn the meanings of function words in visually grounded context by using non-symbolic general statistical learning algorithms, without any prior knowledge of linguistic meaning.

摘要
函数词如“或”、“后”或“更”的理解可能需要逻辑、数学和关系的推理能力。儿童如何学习这些词？以前的获得理论 frequently rely on positing a foundation of innate knowledge。然而，最新的神经网络基于视觉问答模型 Apparently can learn to use function words as part of answering questions about complex visual scenes.在这篇论文中，我们研究这些模型对函数词的学习，以更好地理解儿童和模型如何学习这些词的meaning。我们发现，基于视觉受限语言的回归模型在学习gradient semantics for function words requiring spacial and numerical reasoning。此外，我们发现这些模型可以不带任何逻辑推理知识学习逻辑连接词“和”和“或”的meaning，以及初步证明它们可以在语言解释中理解备用表达。最后，我们发现word learning difficulty是模型输入频率的函数。我们的发现表明可以通过非符号统计学学习算法，不带任何语言含义的先验知识，来学习函数词的meaning。

BIOptimus: Pre-training an Optimal Biomedical Language Model with Curriculum Learning for Named Entity Recognition

paper_url: http://arxiv.org/abs/2308.08625
repo_url: https://github.com/rttl-ai/bioptimus
paper_authors: Pavlova Vera, Mohammed Makhlouf
for: investigate different pre-training methods for biomedical language models and compare their performance on Named Entity Recognition (NER) tasks.
methods: pre-training the biomedical language model from scratch, pre-training it in a continued fashion, and using a curriculum learning approach with contextualized weight distillation.
results: a new biomedical language model (BIOptimus) that sets new states of the art on several biomedical NER tasks, and an analysis of the impact of masking rate, corruption strategy, and masking strategies on the performance of the biomedical LM.

Abstract
Using language models (LMs) pre-trained in a self-supervised setting on large corpora and then fine-tuning for a downstream task has helped to deal with the problem of limited label data for supervised learning tasks such as Named Entity Recognition (NER). Recent research in biomedical language processing has offered a number of biomedical LMs pre-trained using different methods and techniques that advance results on many BioNLP tasks, including NER. However, there is still a lack of a comprehensive comparison of pre-training approaches that would work more optimally in the biomedical domain. This paper aims to investigate different pre-training methods, such as pre-training the biomedical LM from scratch and pre-training it in a continued fashion. We compare existing methods with our proposed pre-training method of initializing weights for new tokens by distilling existing weights from the BERT model inside the context where the tokens were found. The method helps to speed up the pre-training stage and improve performance on NER. In addition, we compare how masking rate, corruption strategy, and masking strategies impact the performance of the biomedical LM. Finally, using the insights from our experiments, we introduce a new biomedical LM (BIOptimus), which is pre-trained using Curriculum Learning (CL) and contextualized weight distillation method. Our model sets new states of the art on several biomedical Named Entity Recognition (NER) tasks. We release our code and all pre-trained models

摘要
使用先前训练的语言模型（LM）在自然语言处理（NLP）领域进行预训练，然后进行下游任务（如命名实体识别（NER））的问题，受到有限的标注数据的限制。 current research in biomedical NLP 提供了许多适用于生物医学领域的语言模型，通过不同的方法和技术进行预训练，提高了许多生物医学任务的结果，包括NER。然而，还没有一个全面的比较研究，探讨不同的预训练方法在生物医学领域是否可以更优。这篇论文旨在调查不同的预训练方法，如从scratch预训练和继续预训练。我们比较了现有的方法和我们提出的方法，该方法是使用BERT模型中的权重初始化新的token。这种方法可以加速预训练阶段，并提高NER的表现。此外，我们还比较了掩码率、损害策略和掩码策略对生物医学LM的性能的影响。最后，基于我们的实验结果，我们提出了一个新的生物医学LM（BIOptimus），通过curriculum learning（CL）和Contextualized weight distillation方法进行预训练。我们的模型在许多生物医学NER任务中设置了新的纪录。我们发布了我们的代码和所有预训练模型。

Time Travel in LLMs: Tracing Data Contamination in Large Language Models

paper_url: http://arxiv.org/abs/2308.08493
repo_url: None
paper_authors: Shahriar Golchin, Mihai Surdeanu
for: 本研究旨在探讨大语言模型（LLM）在不同任务上的效果，并检测LLM在训练数据中是否存在数据污染问题。
methods: 本研究提出了一种简单 yet effective的方法来检测LLM中的数据污染。该方法包括在小样本中identify potential contamination，并使用“导向指令”（prompt）来评估实例是否污染。
results: 研究发现，使用“导向指令”可以准确地检测LLM中的数据污染。 Specifically, the paper achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human expert. Additionally, the findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.

Abstract
Data contamination, i.e., the presence of test data from downstream tasks in the training data of large language models (LLMs), is a potential major issue in understanding LLMs' effectiveness on other tasks. We propose a straightforward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination in individual instances that are drawn from a small random sample; using this information, our approach then assesses if an entire dataset partition is contaminated. To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM's output either exactly or closely matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE or BLEURT) is statistically significantly better with the guided instruction vs. a general instruction that does not include the dataset and partition name. The second idea marks a dataset as contaminated if a classifier based on GPT-4 with in-context learning prompting marks multiple instances as contaminated. Our best method achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human expert. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.

摘要
“数据污染”，即大语言模型（LLM）训练数据中下游任务的数据污染，是评估LLM的效果所存在的一个潜在问题。我们提出了一种简单 yet有效的方法来识别LLM中的数据污染。我们的方法的核心在于在小样本中标识潜在的污染，然后使用这些信息来评估整个数据分区是否受污染。为了评估个体实例的污染情况，我们采用了“引导指令”：一个包含数据集名称、分区类型和参考实例的开头部分的提示，要求LLM完成它。如果LLM的输出与参考实例的后半部分匹配，则认为该实例污染了。为了判断整个分区是否受污染，我们提出了两个想法。第一个想法是如果使用ROUGE或BLEURT评估指标，则认为分区受污染，如果使用引导指令vs无引导指令的平均 overlap score差异是 statistically significant。第二个想法是使用基于GPT-4的类型学习提示，如果多个实例被标记为污染，则认为整个分区受污染。我们的最佳方法可以在七个数据集上达到92%-100%的准确率，与人工评估相比。此外，我们的发现表明GPT-4污染了AG News、WNLI和XSum数据集。

2023-08-17

cs.LG

cs.LG - 2023-08-17

Enhancing API Documentation through BERTopic Modeling and Summarization

paper_url: http://arxiv.org/abs/2308.09070
repo_url: https://github.com/scam2023-bert/bertopic
paper_authors: AmirHossein Naghshzan, Sylvie Ratte
for: 本研究旨在提高API文档理解效率，帮助开发者更好地利用API功能。
methods: 本研究使用BERTopic进行主题分析和自然语言处理（NLP）技术，自动生成API文档摘要，从而提高开发者获取信息的效率。
results: 研究发现了一些常见的主题和问题，并提供了可能的解决方案，对API文档分析和开发者工作提供了valuable的启示和实践性的工具。

Abstract
As the amount of textual data in various fields, including software development, continues to grow, there is a pressing demand for efficient and effective extraction and presentation of meaningful insights. This paper presents a unique approach to address this need, focusing on the complexities of interpreting Application Programming Interface (API) documentation. While official API documentation serves as a primary source of information for developers, it can often be extensive and lacks user-friendliness. In light of this, developers frequently resort to unofficial sources like Stack Overflow and GitHub. Our novel approach employs the strengths of BERTopic for topic modeling and Natural Language Processing (NLP) to automatically generate summaries of API documentation, thereby creating a more efficient method for developers to extract the information they need. The produced summaries and topics are evaluated based on their performance, coherence, and interoperability. The findings of this research contribute to the field of API documentation analysis by providing insights into recurring topics, identifying common issues, and generating potential solutions. By improving the accessibility and efficiency of API documentation comprehension, our work aims to enhance the software development process and empower developers with practical tools for navigating complex APIs.

摘要
随着不同领域中文本数据的增加，包括软件开发，存在一项强烈的需求，即提取和显示有用的洞察结论。本文提出了一种独特的方法来解决这个问题，即利用API文档的复杂性进行解释。官方API文档 serves as the primary source of information for developers, but it can be extensive and difficult to use. Therefore, developers often resort to unofficial sources such as Stack Overflow and GitHub. Our novel approach employs the strengths of BERTopic for topic modeling and Natural Language Processing (NLP) to automatically generate summaries of API documentation, thereby creating a more efficient method for developers to extract the information they need. The generated summaries and topics are evaluated based on their performance, coherence, and interoperability. 本研究的发现对API文档分析领域做出了贡献，提供了复杂的主题、常见问题和可能的解决方案的洞察。通过改善API文档的可访问性和效率，我们的工作希望能够提高软件开发过程中的开发人员 navigating 复杂的API。

Uplift Modeling: from Causal Inference to Personalization

paper_url: http://arxiv.org/abs/2308.09066
repo_url: None
paper_authors: Felipe Moraes, Hugo Manuel Proença, Anastasiia Kornilova, Javier Albert, Dmitri Goldenberg
for: 本教程旨在介绍 causality 和 uplift 模型，以便在线电商平台上进行个性化推广活动。
methods: 本教程会介绍 state-of-the-art 的 uplift 模型技术，包括不同的优点和局限性。
results: 本教程会介绍实际应用情况，以及在生产环境中实施这些模型时可能会遇到的挑战。

Abstract
Uplift modeling is a collection of machine learning techniques for estimating causal effects of a treatment at the individual or subgroup levels. Over the last years, causality and uplift modeling have become key trends in personalization at online e-commerce platforms, enabling the selection of the best treatment for each user in order to maximize the target business metric. Uplift modeling can be particularly useful for personalized promotional campaigns, where the potential benefit caused by a promotion needs to be weighed against the potential costs. In this tutorial we will cover basic concepts of causality and introduce the audience to state-of-the-art techniques in uplift modeling. We will discuss the advantages and the limitations of different approaches and dive into the unique setup of constrained uplift modeling. Finally, we will present real-life applications and discuss challenges in implementing these models in production.

摘要
“增强模型是一种Machine Learning技术集成，用于估计对各个个体或子组的影响。在过去几年， causality和增强模型在在线电商平台上成为了个性化的潮流，帮助选择对每个用户最佳的处理，以最大化目标业务指标。增强模型特别有用于个性化促销活动，因为推广的潜在利益需要与潜在成本进行平衡。在这个教程中，我们将讲解 causality 的基本概念，并介绍现代的增强模型技术。我们会讲述不同方法的优点和限制，并探讨受限增强模型的特殊设置。最后，我们会介绍实际应用和在生产中实施这些模型的挑战。”

Discretization-Induced Dirichlet Posterior for Robust Uncertainty Quantification on Regression

paper_url: http://arxiv.org/abs/2308.09065
repo_url: None
paper_authors: Xuanlong Yu, Gianni Franchi, Jindong Gu, Emanuel Aldea
for: 这个论文的目的是提出一种更加Robust的auxiliary uncertainty estimator（AuxUE）来估计深度神经网络（DNNs）的不确定性，以便在实际应用中使用。methods: 该论文使用了不同的分布假设来估计异常输入（Out-of-Distribution，OOD）中的aleatoric uncertainty，并最终选择了拉пла斯分布来近似预测错误。此外，该论文还提出了一种新的epistemic uncertainty estimation方法 named Discretization-Induced Dirichlet pOsterior（DIDO），该方法模型了预测错误的粒度化 posterior。results: 该论文的实验结果表明，其提出的方法可以在各种视觉任务上提供Robust的不确定性估计，包括年龄估计、单目深度估计和超分辨任务。此外，该论文还证明了其方法可以扩展到图像级和像素级任务。

Abstract
Uncertainty quantification is critical for deploying deep neural networks (DNNs) in real-world applications. An Auxiliary Uncertainty Estimator (AuxUE) is one of the most effective means to estimate the uncertainty of the main task prediction without modifying the main task model. To be considered robust, an AuxUE must be capable of maintaining its performance and triggering higher uncertainties while encountering Out-of-Distribution (OOD) inputs, i.e., to provide robust aleatoric and epistemic uncertainty. However, for vision regression tasks, current AuxUE designs are mainly adopted for aleatoric uncertainty estimates, and AuxUE robustness has not been explored. In this work, we propose a generalized AuxUE scheme for more robust uncertainty quantification on regression tasks. Concretely, to achieve a more robust aleatoric uncertainty estimation, different distribution assumptions are considered for heteroscedastic noise, and Laplace distribution is finally chosen to approximate the prediction error. For epistemic uncertainty, we propose a novel solution named Discretization-Induced Dirichlet pOsterior (DIDO), which models the Dirichlet posterior on the discretized prediction error. Extensive experiments on age estimation, monocular depth estimation, and super-resolution tasks show that our proposed method can provide robust uncertainty estimates in the face of noisy inputs and that it can be scalable to both image-level and pixel-wise tasks.

摘要
深度神经网络（DNN）在实际应用中部署需要量化不确定性。协助性不确定估计器（AuxUE）是修改主任务模型的最佳方法来估计主任务预测结果的不确定性。为了被视为可靠，一个 AuxUE 必须能够保持其性能，并在异常输入（OOD）中触发更高的不确定性。然而，目前的 AuxUE 设计主要用于 aleatoric 不确定性估计，AuxUE 的 Robustness 尚未得到探讨。在这种情况下，我们提出一种通用的 AuxUE 方案，以提高 regression 任务中的不确定性量化的可靠性。具体来说，为了实现更加robust的 aleatoric 不确定性估计，我们考虑了不同的分布假设，并最终选择了 Laplace 分布来近似预测错误。为 epistemic 不确定性，我们提出了一种新的解决方案，即 Discretization-Induced Dirichlet pOsterior（DIDO），它模型了精度 posterior 在精度化预测错误上。我们在年龄估计、单目深度估计和超分辨率任务上进行了广泛的实验，结果表明，我们的提出的方法可以在噪声输入下提供可靠的不确定性估计，并且可以扩展到图像级和像素级任务。

Refining a Deep Learning-based Formant Tracker using Linear Prediction Methods

paper_url: http://arxiv.org/abs/2308.09051
repo_url: None
paper_authors: Paavo Alku, Sudarsana Reddy Kadiri, Dhananjaya Gowda
for: 这个研究是用来调查和改进现有的数据驱动式形式追踪器（DeepFormants）的方法。
methods: 这个研究使用了线性预测（LP）基于方法来估算形式，包括传统的covariance分析（LP-COV）和最近提出的 quasi-closed phase forward-backward（QCP-FB）分析。
results: 研究发现，使用LP基于方法来修正DeepFormants预测的形式可以提高追踪器的性能，并且在受到噪声损害的VTR语音追踪 task中表现更加稳定。此外，这种方法可以轻松地与现有的数据驱动式追踪器结合使用，不需要进行任何新的数据学习。

Abstract
In this study, formant tracking is investigated by refining the formants tracked by an existing data-driven tracker, DeepFormants, using the formants estimated in a model-driven manner by linear prediction (LP)-based methods. As LP-based formant estimation methods, conventional covariance analysis (LP-COV) and the recently proposed quasi-closed phase forward-backward (QCP-FB) analysis are used. In the proposed refinement approach, the contours of the three lowest formants are first predicted by the data-driven DeepFormants tracker, and the predicted formants are replaced frame-wise with local spectral peaks shown by the model-driven LP-based methods. The refinement procedure can be plugged into the DeepFormants tracker with no need for any new data learning. Two refined DeepFormants trackers were compared with the original DeepFormants and with five known traditional trackers using the popular vocal tract resonance (VTR) corpus. The results indicated that the data-driven DeepFormants trackers outperformed the conventional trackers and that the best performance was obtained by refining the formants predicted by DeepFormants using QCP-FB analysis. In addition, by tracking formants using VTR speech that was corrupted by additive noise, the study showed that the refined DeepFormants trackers were more resilient to noise than the reference trackers. In general, these results suggest that LP-based model-driven approaches, which have traditionally been used in formant estimation, can be combined with a modern data-driven tracker easily with no further training to improve the tracker's performance.

摘要
在这项研究中，我们调查了使用现有的数据驱动跟踪器DeepFormants的形式跟踪结果，并使用LP（线性预测）方法所估计的形式来进行修订。我们使用了传统的covariance analysis（LP-COV）和最近提出的 quasi-closed phase forward-backward（QCP-FB）分析方法。在我们的修订方法中，首先使用DeepFormants tracker来预测三个最低的形式轨迹，然后将预测的形式替换为每帧的本地spectral peak，显示出LP-based方法所估计的形式。这种修订方法可以轻松地插入到DeepFormants tracker中，无需进行任何新的数据学习。我们对原始DeepFormants和五种已知传统跟踪器进行比较，结果表明了数据驱动DeepFormants trackers的性能高于传统跟踪器，而使用QCP-FB分析方法进行修订可以得到最佳性能。此外，通过使用受杂音损害的VTR语音跟踪，研究表明了修订后的DeepFormants trackers在噪音环境中的更高抗噪性。总之，这些结果表明了LP-based模型驱动方法可以轻松地与现有的数据驱动跟踪器结合使用，以提高跟踪器的性能。

Kernel-Based Tests for Likelihood-Free Hypothesis Testing

paper_url: http://arxiv.org/abs/2308.09043
repo_url: None
paper_authors: Patrik Róbert Gerber, Tianze Jiang, Yury Polyanskiy, Rui Sun
for: 这个论文主要关注的问题是如何使用有限数量的标注样本来预测另外一些样本的类别，具体来说是使用两个类别之间的混合样本来预测另外一个类别。
methods: 这篇论文使用了likelihood-ratio测试和maximum mean discrepancy（MMD）来解决这个问题。它还使用了权重参数化的神经网络来评估预测性能。
results: 论文发现了一个基本的负荷误差对m和n的负荷误差之间的交易，即增加数据样本m会减少训练/模拟数据样本n的需要。同时，论文还证明了这种交易的存在性，并通过两个实际问题（检测希格斯粒子和检测植入DDPM生成的图像）来验证其理论性。

Abstract
Given $n$ observations from two balanced classes, consider the task of labeling an additional $m$ inputs that are known to all belong to \emph{one} of the two classes. Special cases of this problem are well-known: with complete knowledge of class distributions ($n=\infty$) the problem is solved optimally by the likelihood-ratio test; when $m=1$ it corresponds to binary classification; and when $m\approx n$ it is equivalent to two-sample testing. The intermediate settings occur in the field of likelihood-free inference, where labeled samples are obtained by running forward simulations and the unlabeled sample is collected experimentally. In recent work it was discovered that there is a fundamental trade-off between $m$ and $n$: increasing the data sample $m$ reduces the amount $n$ of training/simulation data needed. In this work we (a) introduce a generalization where unlabeled samples come from a mixture of the two classes -- a case often encountered in practice; (b) study the minimax sample complexity for non-parametric classes of densities under \textit{maximum mean discrepancy} (MMD) separation; and (c) investigate the empirical performance of kernels parameterized by neural networks on two tasks: detection of the Higgs boson and detection of planted DDPM generated images amidst CIFAR-10 images. For both problems we confirm the existence of the theoretically predicted asymmetric $m$ vs $n$ trade-off.

摘要
给定 $n$ 观察数据，二分类问题中的标注一个额外 $m$ 个输入，其中所有输入都属于一个类别。特殊情况包括：当 $n=\infty$ 时，通过 likelihood-ratio 测试可以优化地解决问题；当 $m=1$ 时，相当于二分类问题；当 $m\approx n$ 时，等价于两个样本测试。在中间设置中，在 likelihood-free 推理中获取标注样本，并将未标注样本收集到实验中。最新的研究发现，存在 $m$ 和 $n$ 之间的基本财富平衡：增加数据样本 $m$ 会降低训练/模拟数据需要的量 $n$。在这项工作中，我们（a）引入一种泛化，其中未标注样本来自两个类别的混合；（b）研究非Parametric 类型的概率密度下的最小最大误差（MMD）分离下的最小样本复杂度；以及（c）调查使用权重参数化神经网络的两个任务：探测希格斯粒子和探测植入DDPM生成的图像中的植入DDPM生成图像。对于两个问题，我们证实了理论预测的偏极 $m$ vs $n$ 财富平衡。

LesionMix: A Lesion-Level Data Augmentation Method for Medical Image Segmentation

paper_url: http://arxiv.org/abs/2308.09026
repo_url: https://github.com/dogabasaran/lesionmix
paper_authors: Berke Doga Basaran, Weitong Zhang, Mengyun Qiao, Bernhard Kainz, Paul M. Matthews, Wenjia Bai
For: The paper is written for the purpose of proposing a novel data augmentation method called LesionMix, which is designed to improve the accuracy of deep learning-based medical image segmentation methods.* Methods: The paper uses a combination of spatial and intensity transformations to augment medical images, with a focus on lesion-aware augmentation at the lesion level. The method allows for both lesion populating and inpainting, and is evaluated on multiple modalities and lesion datasets.* Results: The paper reports promising performance of LesionMix in lesion image segmentation, outperforming several recent Mix-based data augmentation methods. The code for LesionMix will be released on GitHub for further use and evaluation.Here’s the simplified Chinese text for the three key points:* For: 这篇论文是为了提出一种新的数据增强方法，即LesionMix，用于深度学习基于医疗影像分割方法的准确性。* Methods: 这篇论文使用了一种组合的空间和强度变换来增强医疗影像，并将注意力集中在病变水平上进行数据增强。这种方法允许病变的填充和抹除，并在不同的modalities和病变数据集上进行评估。* Results: 这篇论文报告了LesionMix在病变图像分割中的良好表现，比较了许多最近的Mix基于数据增强方法。代码将在GitHub上发布，以便进一步使用和评估。

Abstract
Data augmentation has become a de facto component of deep learning-based medical image segmentation methods. Most data augmentation techniques used in medical imaging focus on spatial and intensity transformations to improve the diversity of training images. They are often designed at the image level, augmenting the full image, and do not pay attention to specific abnormalities within the image. Here, we present LesionMix, a novel and simple lesion-aware data augmentation method. It performs augmentation at the lesion level, increasing the diversity of lesion shape, location, intensity and load distribution, and allowing both lesion populating and inpainting. Experiments on different modalities and different lesion datasets, including four brain MR lesion datasets and one liver CT lesion dataset, demonstrate that LesionMix achieves promising performance in lesion image segmentation, outperforming several recent Mix-based data augmentation methods. The code will be released at https://github.com/dogabasaran/lesionmix.

摘要
<>Translate the given text into Simplified Chinese.<>医学影像分割中使用深度学习的方法中，数据增强已成为一个非正式的组成部分。大多数医学影像数据增强技术都是针对空间和Intensity变换，以提高训练图像的多样性。它们通常是图像水平进行增强，不注重特定疾病内部的特征。我们现在提出了LesionMix，一种新的和简单的疾病意识的数据增强方法。它在疾病水平进行增强，提高疾病形态、位置、强度和负荷分布，并允许疾病填充和遮盖。经过不同Modalities和不同疾病数据集的测试，包括四个脑MR疾病数据集和一个肝CT疾病数据集，LesionMix在疾病图像分割中表现出了优秀的表现，比较其他最近的混合数据增强方法更高。代码将在https://github.com/dogabasaran/lesionmix上发布。

Reinforcement Learning for Battery Management in Dairy Farming

paper_url: http://arxiv.org/abs/2308.09023
repo_url: None
paper_authors: Nawazish Ali, Abdul Wahid, Rachael shaw, Karl Mason
For: 这个研究旨在适用人工智能（AI）于牛奶农业中的可再生能源应用，以降低电力成本并实现政府的能源和可持续发展目标。* Methods: 该研究使用Q学习算法学习可重复充电和充电牛奶农场中的策略。* Results: 研究发现，基于Q学习算法的策略可以效果地降低电力成本，与传统基准算法相比。这些结果表明Q学习算法在牛奶农业中的电池管理可以获得显著的效果。

Abstract
Dairy farming is a particularly energy-intensive part of the agriculture sector. Effective battery management is essential for renewable integration within the agriculture sector. However, controlling battery charging/discharging is a difficult task due to electricity demand variability, stochasticity of renewable generation, and energy price fluctuations. Despite the potential benefits of applying Artificial Intelligence (AI) to renewable energy in the context of dairy farming, there has been limited research in this area. This research is a priority for Ireland as it strives to meet its governmental goals in energy and sustainability. This research paper utilizes Q-learning to learn an effective policy for charging and discharging a battery within a dairy farm setting. The results demonstrate that the developed policy significantly reduces electricity costs compared to the established baseline algorithm. These findings highlight the effectiveness of reinforcement learning for battery management within the dairy farming sector.

摘要
奶业是农业部分中特别占用能源的一部分。有效的电池管理是重要的，以实现农业部门中可再生能源的整合。然而，控制电池充电/充电是一项具有挑战性的任务，因为电力需求的变化、可再生能源的随机性和能源价格的波动。虽然在奶业中应用人工智能（AI）可以实现可再生能源的潜在利益，但是有限的研究在这个领域。这项研究利用Q学习学习一个有效的电池充电/充电策略，并在奶业设置下实现了显著减少电力成本的result。这些发现表明了Q学习在奶业中的有效性。

Multi-field Visualisation via Trait-induced Merge Trees

paper_url: http://arxiv.org/abs/2308.09015
repo_url: None
paper_authors: Jochen Jankowai, Talha Bin Masood, Ingrid Hotz
for: 这个论文旨在探讨tensor场或多变量数据的分析，通过特征级别集合的概念来扩展merge树。
methods: 本文使用特征空间中的特征定义来定义特征 trait，然后使用这些特征 trait 计算特征级别集合的距离场，从而实现特征级别集合的排序和查询。
results: 本文提出了一种基于特征 trait 的merge树，可以对tensor场或多变量数据进行有效的探讨和查询。三个案例研究证明了这种方法的十分适用性和普遍性。

Abstract
In this work, we propose trait-based merge trees a generalization of merge trees to feature level sets, targeting the analysis of tensor field or general multi-variate data. For this, we employ the notion of traits defined in attribute space as introduced in the feature level sets framework. The resulting distance field in attribute space induces a scalar field in the spatial domain that serves as input for topological data analysis. The leaves in the merge tree represent those areas in the input data that are closest to the defined trait and thus most closely resemble the defined feature. Hence, the merge tree yields a hierarchy of features that allows for querying the most relevant and persistent features. The presented method includes different query methods for the tree which enable the highlighting of different aspects. We demonstrate the cross-application capabilities of this approach with three case studies from different domains.

摘要
在这项工作中，我们提出了基于特征 trait-based 合并树，这是 merge tree 的一种普遍化，targeting tensor field 或普通多变量数据的分析。为此，我们利用了 attribute space 中定义的特征（trait）的概念，如 feature level sets 框架中所介绍的。将 attribute space 中的距离场转换为 spatial domain 中的scalar场，以便在 topological data analysis 中使用。 merge tree 的叶节点表示输入数据中最近 trait 的区域，即最接近定义的特征的区域。因此，merge tree 提供了一个 Hierarchy of Features，可以对输入数据进行最相关和 persistente 特征的查询。我们还提供了不同的查询方法，以便高亮不同的方面。我们通过三个不同领域的案例，证明了这种方法的跨应用性。

Deep-seeded Clustering for Unsupervised Valence-Arousal Emotion Recognition from Physiological Signals

paper_url: http://arxiv.org/abs/2308.09013
repo_url: None
paper_authors: Antoine Dubois, Carlos Lima Azevedo, Sonja Haustein, Bruno Miranda
For: The paper is written for recognizing emotions from physiological and psychological data using unsupervised deep cluster methods.* Methods: The paper proposes an unsupervised deep cluster framework for emotion recognition, using deep k-means and deep c-means to distinguish the four quadrants of Russell’s circumplex model of affect.* Results: The tests on the open benchmark data set WESAD show that the proposed method achieves an overall accuracy of 87% in recognizing emotions from physiological and psychological data, without the need for labels.

Abstract
Emotions play a significant role in the cognitive processes of the human brain, such as decision making, learning and perception. The use of physiological signals has shown to lead to more objective, reliable and accurate emotion recognition combined with raising machine learning methods. Supervised learning methods have dominated the attention of the research community, but the challenge in collecting needed labels makes emotion recognition difficult in large-scale semi- or uncontrolled experiments. Unsupervised methods are increasingly being explored, however sub-optimal signal feature selection and label identification challenges unsupervised methods' accuracy and applicability. This article proposes an unsupervised deep cluster framework for emotion recognition from physiological and psychological data. Tests on the open benchmark data set WESAD show that deep k-means and deep c-means distinguish the four quadrants of Russell's circumplex model of affect with an overall accuracy of 87%. Seeding the clusters with the subject's subjective assessments helps to circumvent the need for labels.

摘要
This article proposes an unsupervised deep cluster framework for emotion recognition from physiological and psychological data. The proposed method uses deep k-means and deep c-means to distinguish the four quadrants of Russell's circumplex model of affect, with an overall accuracy of 87%. Additionally, seeding the clusters with the subject's subjective assessments helps to circumvent the need for labels. Tests on the open benchmark data set WESAD demonstrate the effectiveness of the proposed method.

Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability

paper_url: http://arxiv.org/abs/2308.09004
repo_url: None
paper_authors: Renan Souza, Tyler J. Skluzacek, Sean R. Wilkinson, Maxim Ziatdinov, Rafael Ferreira da Silva
for: 这个论文的目的是探讨如何实现跨计算机中心的数据分析，以便在科学发现过程中实现负责任AI开发、可重复性、可访问性和用户指导。
methods: 该论文使用了数据可见性策略和适配器系统设计，以及证据来解决多种并行系统和机器学习工具之间的兼容性问题。
results: 实验结果表明，使用MIDA方法可以实现无 overhead的运行更多任务，并在 Summit 超级计算机上运行多达 276 个 GPU 并行。

Abstract
Modern large-scale scientific discovery requires multidisciplinary collaboration across diverse computing facilities, including High Performance Computing (HPC) machines and the Edge-to-Cloud continuum. Integrated data analysis plays a crucial role in scientific discovery, especially in the current AI era, by enabling Responsible AI development, FAIR, Reproducibility, and User Steering. However, the heterogeneous nature of science poses challenges such as dealing with multiple supporting tools, cross-facility environments, and efficient HPC execution. Building on data observability, adapter system design, and provenance, we propose MIDA: an approach for lightweight runtime Multi-workflow Integrated Data Analysis. MIDA defines data observability strategies and adaptability methods for various parallel systems and machine learning tools. With observability, it intercepts the dataflows in the background without requiring instrumentation while integrating domain, provenance, and telemetry data at runtime into a unified database ready for user steering queries. We conduct experiments showing end-to-end multi-workflow analysis integrating data from Dask and MLFlow in a real distributed deep learning use case for materials science that runs on multiple environments with up to 276 GPUs in parallel. We show near-zero overhead running up to 100,000 tasks on 1,680 CPU cores on the Summit supercomputer.

摘要
现代大规模科学发现需要跨学科合作，包括高性能计算机（HPC）机器和边缘到云Continuum。集成数据分析在科学发现中扮演着关键角色，特别是在当今AI时代，可以实现负责任AI开发、FAIR、可重现和用户指导。然而，科学的多样性带来了多种支持工具、跨设施环境和高性能HPC执行的挑战。基于数据可见性、适配器系统设计和 provinidence，我们提出了MIDA：一种轻量级运行时多 workflow集成数据分析方法。MIDA定义了不同平台和机器学习工具的数据可见性策略和适配性方法。通过可见性，它在背景中 intercepts 数据流 Without requiring instrumentation，并在运行时将domain、provinidence和telemetry数据集成到一个统一的数据库中，准备好 для用户导航查询。我们在实验中通过结合 Dask 和 MLFlow 的数据来实现了端到端多 workflow分析，并在多个环境上并行运行了多达 276 个GPU。我们还显示了 near-zero 执行 overhead，在 Summit 超级计算机上运行了 Up to 100,000 任务，使用 1,680 个CPU核心。

DealMVC: Dual Contrastive Calibration for Multi-view Clustering

paper_url: http://arxiv.org/abs/2308.09000
repo_url: https://github.com/xihongyang1999/dealmvc
paper_authors: Xihong Yang, Jiaqi Jin, Siwei Wang, Ke Liang, Yue Liu, Yi Wen, Suyuan Liu, Sihang Zhou, Xinwang Liu, En Zhu
for: 提高多视图 clustering 性能，解决跨视图场景下相似 yet different 样本的问题。
methods: 提出了一种 dual contrastive calibration network (DealMVC)，包括全局抽象特征生成、全局对比约束和本地对比约束等多种约束。
results: 在八个 benchmark 数据集上进行了广泛的实验，证明了 DealMVC 的效果和优越性，并将代码发布在 GitHub。

Abstract
Benefiting from the strong view-consistent information mining capacity, multi-view contrastive clustering has attracted plenty of attention in recent years. However, we observe the following drawback, which limits the clustering performance from further improvement. The existing multi-view models mainly focus on the consistency of the same samples in different views while ignoring the circumstance of similar but different samples in cross-view scenarios. To solve this problem, we propose a novel Dual contrastive calibration network for Multi-View Clustering (DealMVC). Specifically, we first design a fusion mechanism to obtain a global cross-view feature. Then, a global contrastive calibration loss is proposed by aligning the view feature similarity graph and the high-confidence pseudo-label graph. Moreover, to utilize the diversity of multi-view information, we propose a local contrastive calibration loss to constrain the consistency of pair-wise view features. The feature structure is regularized by reliable class information, thus guaranteeing similar samples have similar features in different views. During the training procedure, the interacted cross-view feature is jointly optimized at both local and global levels. In comparison with other state-of-the-art approaches, the comprehensive experimental results obtained from eight benchmark datasets provide substantial validation of the effectiveness and superiority of our algorithm. We release the code of DealMVC at https://github.com/xihongyang1999/DealMVC on GitHub.

摘要
利用强大的视图一致信息挖掘能力，多视图对比 clustering 在最近几年内吸引了很多关注。然而，我们发现以下缺点，限制了 clustering 性能的进一步提高：现有的多视图模型主要关注同一个样本在不同视图中的一致性，而忽略了另一个样本在不同视图中的相似性。为解决这个问题，我们提出了一种新的对比抑制网络 для多视图 clustering（DealMVC）。 Specifically，我们首先设计了一种合并机制，以获得全局跨视图特征。然后，我们提出了一种全局对比抑制损失，通过对视图特征相似性图和高置信度假标签图进行对对比，并且使用可靠的类别信息来规范特征结构。此外，为了利用多视图信息的多样性，我们提出了一种本地对比抑制损失，以确保不同视图中的相似样本具有相似的特征。在训练过程中，交互式跨视图特征被联合地优化在本地和全局两级。与其他状态当前的方法进行比较，我们在八个benchmark数据集上获得了广泛的实验结果，这些结果证明了我们的算法的有效性和超越性。我们在 GitHub 上发布了 DealMVC 的代码，可以在中下载。

Reinforced Self-Training (ReST) for Language Modeling

paper_url: http://arxiv.org/abs/2308.08998
repo_url: https://github.com/jettbrains/-L-
paper_authors: Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, Nando de Freitas
for: 提高大型语言模型（LLM）的输出质量，通过人工反馈学习（RLHF）对其进行调整，使其更加符合人类的偏好。
methods: 我们提出了一种简单的算法，即增长批量学习（RL） inspirited Reinforced Self-Training（ReST），可以帮助RLHF方法更加高效。 ReST 使用了初始 LLM 策略，生成样本，并使用了离线 RL 算法来改善 LLM 策略。
results: 我们的结果表明，ReST 可以有效地提高翻译质量，并在计算和样本效率方面具有一定的优势。我们通过自动度量标准和人工评估在机器翻译benchmark上进行评估，并得到了良好的结果。

Abstract
Reinforcement learning from human feedback (RLHF) can improve the quality of large language model's (LLM) outputs by aligning them with human preferences. We propose a simple algorithm for aligning LLMs with human preferences inspired by growing batch reinforcement learning (RL), which we call Reinforced Self-Training (ReST). Given an initial LLM policy, ReST produces a dataset by generating samples from the policy, which are then used to improve the LLM policy using offline RL algorithms. ReST is more efficient than typical online RLHF methods because the training dataset is produced offline, which allows data reuse. While ReST is a general approach applicable to all generative learning settings, we focus on its application to machine translation. Our results show that ReST can substantially improve translation quality, as measured by automated metrics and human evaluation on machine translation benchmarks in a compute and sample-efficient manner.

摘要
“强化学习从人类反馈（RLHF）可以提高大语言模型（LLM）的输出质量，使其与人类偏好更加一致。我们提出了一种简单的算法，称为强化自我培训（ReST），用于将 LLM 政策与人类偏好进行对齐。给定初始 LLM 策略，ReST 会生成一个数据集，并使用这些样本来改进 LLM 策略使用离线 RL 算法。ReST 比典型在线 RLHF 方法更加高效，因为它可以重用数据。虽然 ReST 是一种通用的方法，但我们在机器翻译中进行了专门的应用。我们的结果表明，ReST 可以在计算和样本效率的情况下提高翻译质量，并且可以通过自动评价指标和人类评价来证明这一点。”

Learning representations by forward-propagating errors

paper_url: http://arxiv.org/abs/2308.09728
repo_url: None
paper_authors: Ryoungwoo Jang
for: 这篇论文是为了提出一种轻量级、快速的学习算法，用于训练神经网络。
methods: 该算法基于前向传播方法，使用了数学几何中的双数概念。
results: 该算法比传统的后向传播法快速，可以在中央处理器（CPU）上实现快速的神经网络训练。

Abstract
Back-propagation (BP) is widely used learning algorithm for neural network optimization. However, BP requires enormous computation cost and is too slow to train in central processing unit (CPU). Therefore current neural network optimizaiton is performed in graphical processing unit (GPU) with compute unified device architecture (CUDA) programming. In this paper, we propose a light, fast learning algorithm on CPU that is fast as CUDA acceleration on GPU. This algorithm is based on forward-propagating method, using concept of dual number in algebraic geometry.

摘要
<>translate "Back-propagation (BP) is widely used learning algorithm for neural network optimization. However, BP requires enormous computation cost and is too slow to train in central processing unit (CPU). Therefore current neural network optimizaiton is performed in graphical processing unit (GPU) with compute unified device architecture (CUDA) programming. In this paper, we propose a light, fast learning algorithm on CPU that is fast as CUDA acceleration on GPU. This algorithm is based on forward-propagating method, using concept of dual number in algebraic geometry." into Chinese (Simplified) answer:Back-propagation (BP) 是广泛使用的神经网络优化算法。然而，BP 需要巨大的计算成本， Training 在中央处理单元 (CPU) 太慢。因此，当前神经网络优化通常在图形处理单元 (GPU) 上使用 compute unified device architecture (CUDA) 编程。在这篇论文中，我们提出了一种轻量级、快速学习算法，在 CPU 上实现了 CUDA 加速器的速度。该算法基于前向传播方法，使用了代数几何中的 dual 数概念。

Neural oscillators for generalization of physics-informed machine learning

paper_url: http://arxiv.org/abs/2308.08989
repo_url: None
paper_authors: Taniya Kapoor, Abhishek Chandra, Daniel M. Tartakovsky, Hongrui Wang, Alfredo Nunez, Rolf Dollevoet
for: 提高物理学 Informed 机器学习（PIML）的泛化能力，尤其是面临复杂的物理问题时。
methods: 利用各种常微分方程（PDEs）的约束和时间序列特征，将PIML模型与回归神经网络结合，形成叫做神经振荡器的循环神经网络架构。
results: 通过有效地捕捉长期依赖关系和减少扩散和消失梯度问题，神经振荡器使PIML任务中的泛化能力得到了改进。在不同的约束和时间序列问题上进行了广泛的实验，并证明了该方法在评价指标上的优越性。

Abstract
A primary challenge of physics-informed machine learning (PIML) is its generalization beyond the training domain, especially when dealing with complex physical problems represented by partial differential equations (PDEs). This paper aims to enhance the generalization capabilities of PIML, facilitating practical, real-world applications where accurate predictions in unexplored regions are crucial. We leverage the inherent causality and temporal sequential characteristics of PDE solutions to fuse PIML models with recurrent neural architectures based on systems of ordinary differential equations, referred to as neural oscillators. Through effectively capturing long-time dependencies and mitigating the exploding and vanishing gradient problem, neural oscillators foster improved generalization in PIML tasks. Extensive experimentation involving time-dependent nonlinear PDEs and biharmonic beam equations demonstrates the efficacy of the proposed approach. Incorporating neural oscillators outperforms existing state-of-the-art methods on benchmark problems across various metrics. Consequently, the proposed method improves the generalization capabilities of PIML, providing accurate solutions for extrapolation and prediction beyond the training data.

摘要
primary challenge of physics-informed machine learning (PIML) is its generalization beyond the training domain, especially when dealing with complex physical problems represented by partial differential equations (PDEs). This paper aims to enhance the generalization capabilities of PIML, facilitating practical, real-world applications where accurate predictions in unexplored regions are crucial. We leverage the inherent causality and temporal sequential characteristics of PDE solutions to fuse PIML models with recurrent neural architectures based on systems of ordinary differential equations, referred to as neural oscillators. Through effectively capturing long-time dependencies and mitigating the exploding and vanishing gradient problem, neural oscillators foster improved generalization in PIML tasks. Extensive experimentation involving time-dependent nonlinear PDEs and biharmonic beam equations demonstrates the efficacy of the proposed approach. Incorporating neural oscillators outperforms existing state-of-the-art methods on benchmark problems across various metrics. Consequently, the proposed method improves the generalization capabilities of PIML, providing accurate solutions for extrapolation and prediction beyond the training data.Here's the word-for-word translation in Simplified Chinese:physics-informed machine learning (PIML) 的主要挑战是其在训练领域之外的泛化，特别是在复杂的物理问题中表示为 partial differential equations (PDEs) 时。这篇文章目的是提高 PIML 的泛化能力，以便实际应用中具有准确预测未知区域的需求。我们利用 PDE 解的内在 causality 和时间序列特征来融合 PIML 模型和回归神经网络，基于系数 differential equations，称为神经振荡器。通过有效地捕捉长时间依赖关系和缓解扩散和消失梯度问题，神经振荡器促进了 PIML 任务中的泛化。广泛的实验，包括时间依赖非线性 PDE 和 biharmonic beam 方程，证明了我们的方法的有效性。在不同的纬度上，涂抹神经振荡器的方法超过了现有的状态之巅方法。因此，我们的方法提高了 PIML 的泛化能力，为推算和预测训练数据之外的精度提供了准确的解决方案。

Quantifying the biomimicry gap in biohybrid systems

paper_url: http://arxiv.org/abs/2308.08978
repo_url: None
paper_authors: Vaios Papaspyros, Guy Theraulaz, Clément Sire, Francesco Mondada
for: 这个论文的目的是用生物受体系（biohybrid system）来探索和识别动物的集体行为机制。
methods: 这篇论文使用了生物受体系，其中一个是一个模拟鲤鱼（rummy-nose tetra fish）的机器人骗吸，另一个是一个神经网络（NN）模型，用于生成生物受体系中的社交互动。
results: 经过实验和模拟，这个生物受体系能够模拟出真实的鲤鱼对话，并且能够在实际情况下维护高度的互动准确性。

Abstract
Biohybrid systems in which robotic lures interact with animals have become compelling tools for probing and identifying the mechanisms underlying collective animal behavior. One key challenge lies in the transfer of social interaction models from simulations to reality, using robotics to validate the modeling hypotheses. This challenge arises in bridging what we term the "biomimicry gap", which is caused by imperfect robotic replicas, communication cues and physics constrains not incorporated in the simulations that may elicit unrealistic behavioral responses in animals. In this work, we used a biomimetic lure of a rummy-nose tetra fish (Hemigrammus rhodostomus) and a neural network (NN) model for generating biomimetic social interactions. Through experiments with a biohybrid pair comprising a fish and the robotic lure, a pair of real fish, and simulations of pairs of fish, we demonstrate that our biohybrid system generates high-fidelity social interactions mirroring those of genuine fish pairs. Our analyses highlight that: 1) the lure and NN maintain minimal deviation in real-world interactions compared to simulations and fish-only experiments, 2) our NN controls the robot efficiently in real-time, and 3) a comprehensive validation is crucial to bridge the biomimicry gap, ensuring realistic biohybrid systems.

摘要
生物混合系统中的机器人骗子与动物之间的互动已成为诱导和识别动物集体行为的有力工具。一个关键挑战在于将社交互动模型从模拟转移到现实中，使用机器人来验证模型假设。这个挑战是由我们称为“生物模仿差距”引起的，这是因为机器人的复制不准确、通信信号和物理约束不包括在模拟中，可能导致动物行为的不真实。在这项工作中，我们使用一个生物模仿的鲤鱼（Hemigrammus rhodostomus）和一个神经网络（NN）模型来生成生物模仿的社交互动。通过实验使用一个包括鱼和机器人骗子的生物混合对，一对真正的鱼对，以及模拟两个鱼对，我们表明了我们的生物混合系统可以生成高效环境中的社交互动，与真正的鱼对相似。我们的分析显示了以下结论：1）骗子和NN在实际互动中几乎保持和模拟中的互动一致，2）我们的NN可以在实时控制机器人，3）完整的验证是关键来跨越生物模仿差距，确保生物混合系统的真实性。

Hitting the High-Dimensional Notes: An ODE for SGD learning dynamics on GLMs and multi-index models

paper_url: http://arxiv.org/abs/2308.08977
repo_url: None
paper_authors: Elizabeth Collins-Woodfin, Courtney Paquette, Elliot Paquette, Inbar Seroussi
for: 这篇论文研究了流动式梯度下降（SGD）在高维度情况下的动态，特别是在泛型线性模型和多指量模型（如逻辑回归、相位恢复）中的应用。
methods: 这篇论文使用了一个系统Ordinary differential equations（ODE）来描述SGD的动态，这个ODE可以涵盖各种统计量，如风险和优化度的度量。此外， authors还引入了一种简化的扩散积分（homogenized SGD），以便分析SGD迭代的动态。
results: 这篇论文提出了SGD的学习率阈值和稳定性的保证，并通过一些标准示例进行了数值实验，实验结果与理论结果几乎一致。

Abstract
We analyze the dynamics of streaming stochastic gradient descent (SGD) in the high-dimensional limit when applied to generalized linear models and multi-index models (e.g. logistic regression, phase retrieval) with general data-covariance. In particular, we demonstrate a deterministic equivalent of SGD in the form of a system of ordinary differential equations that describes a wide class of statistics, such as the risk and other measures of sub-optimality. This equivalence holds with overwhelming probability when the model parameter count grows proportionally to the number of data. This framework allows us to obtain learning rate thresholds for stability of SGD as well as convergence guarantees. In addition to the deterministic equivalent, we introduce an SDE with a simplified diffusion coefficient (homogenized SGD) which allows us to analyze the dynamics of general statistics of SGD iterates. Finally, we illustrate this theory on some standard examples and show numerical simulations which give an excellent match to the theory.

摘要
我们分析流动式随机梯度下降（SGD）在高维度限制下的动态，当应用于泛化线性模型和多指标模型（例如逻辑回归、相位恢复）时。特别是，我们提出了SGD的权值等价物，即一个系数为普通 дифференциаль方程，描述了广泛的统计量，如风险和其他优化度量的扩展。这种等价性在数据个数增加时，几乎总是成立，并且允许我们获得SGD的学习率阈值以及稳定性保证。此外，我们还引入了一种简化的扩散率（混合SGD），使得我们可以分析SGD迭代的通用统计特性。最后，我们在一些标准例子中进行了数值仿真，并证明了理论与实际匹配得非常好。

Cross-city Few-Shot Traffic Forecasting via Traffic Pattern Bank

paper_url: http://arxiv.org/abs/2308.09727
repo_url: https://github.com/zhyliu00/tpb
paper_authors: Zhanyu Liu, Guanjie Zheng, Yanwei Yu
for: 这个研究旨在提高城市间交通预测的性能，解决城市缺乏资料的问题。methods: 这个方法利用了跨城市几个shot的方法，通过将城市之间的交通模式汇集成一个Traffic Pattern Bank（TPB），从而将城市之间的交通模式转换为高维度的空间中。results: 实验结果显示，这个方法可以在真实的交通数据上进行预测，并且比较出performace的提升。

Abstract
Traffic forecasting is a critical service in Intelligent Transportation Systems (ITS). Utilizing deep models to tackle this task relies heavily on data from traffic sensors or vehicle devices, while some cities might lack device support and thus have few available data. So, it is necessary to learn from data-rich cities and transfer the knowledge to data-scarce cities in order to improve the performance of traffic forecasting. To address this problem, we propose a cross-city few-shot traffic forecasting framework via Traffic Pattern Bank (TPB) due to that the traffic patterns are similar across cities. TPB utilizes a pre-trained traffic patch encoder to project raw traffic data from data-rich cities into high-dimensional space, from which a traffic pattern bank is generated through clustering. Then, the traffic data of the data-scarce city could query the traffic pattern bank and explicit relations between them are constructed. The metaknowledge is aggregated based on these relations and an adjacency matrix is constructed to guide a downstream spatial-temporal model in forecasting future traffic. The frequently used meta-training framework Reptile is adapted to find a better initial parameter for the learnable modules. Experiments on real-world traffic datasets show that TPB outperforms existing methods and demonstrates the effectiveness of our approach in cross-city few-shot traffic forecasting.

摘要
traffic 预测是智能交通系统（ITS）中的关键服务。使用深度模型来解决这个任务需要依赖于交通传感器或车辆设备上的数据，而一些城市可能缺乏设备支持，因此有少量可用的数据。因此，需要从数据丰富城市学习并传递知识到数据缺乏城市，以改善交通预测性能。为解决这个问题，我们提出了跨城市少量 traffic 预测框架，通过交通模式银行（TPB）。由于交通模式在城市之间相似，因此可以使用 pre-trained 交通覆盖器来将数据丰富城市的 raw 交通数据 проекed 到高维空间中，从而生成交通模式银行。然后，数据缺乏城市的交通数据可以查询交通模式银行，并构建了交通模式之间的显式关系。基于这些关系，我们可以归纳知识，并使用这些关系来导引下游的空间时间模型进行未来交通预测。我们使用现实世界交通数据进行实验，并证明了 TPB 在跨城市少量 traffic 预测中的效果。

CONVERT:Contrastive Graph Clustering with Reliable Augmentation

paper_url: http://arxiv.org/abs/2308.08963
repo_url: https://github.com/xihongyang1999/convert
paper_authors: Xihong Yang, Cheng Tan, Yue Liu, Ke Liang, Siwei Wang, Sihang Zhou, Jun Xia, Stan Z. Li, Xinwang Liu, En Zhu
for:* 这个研究目的是提出一个可靠的图像隐藏嵌入 clustering 方法，以解决现有方法对于隐藏嵌入的依赖性和可靠性的问题。methods:* 本方法使用一个叫做复原噪音网络 (Reversible Perturb-Recover Network, RPRN) 来处理数据增强，从而将隐藏嵌入中的 semantics 写入可靠的形式。* 此外，本方法还提出了一个新的内容损失函数 (Semantic Loss)，以保证隐藏嵌入中的 semantics 的可靠性。results:* 实验结果显示，本方法在七个数据集上具有优秀的效果，并且比较现有的方法有更好的可靠性和稳定性。

Abstract
Contrastive graph node clustering via learnable data augmentation is a hot research spot in the field of unsupervised graph learning. The existing methods learn the sampling distribution of a pre-defined augmentation to generate data-driven augmentations automatically. Although promising clustering performance has been achieved, we observe that these strategies still rely on pre-defined augmentations, the semantics of the augmented graph can easily drift. The reliability of the augmented view semantics for contrastive learning can not be guaranteed, thus limiting the model performance. To address these problems, we propose a novel CONtrastiVe Graph ClustEring network with Reliable AugmenTation (COVERT). Specifically, in our method, the data augmentations are processed by the proposed reversible perturb-recover network. It distills reliable semantic information by recovering the perturbed latent embeddings. Moreover, to further guarantee the reliability of semantics, a novel semantic loss is presented to constrain the network via quantifying the perturbation and recovery. Lastly, a label-matching mechanism is designed to guide the model by clustering information through aligning the semantic labels and the selected high-confidence clustering pseudo labels. Extensive experimental results on seven datasets demonstrate the effectiveness of the proposed method. We release the code and appendix of CONVERT at https://github.com/xihongyang1999/CONVERT on GitHub.

摘要
<> simult Vincent 翻译文本 into Simplified Chinese.<>研究领域中的热点问题是无监督图学习中的对比性图节点聚合。现有方法通过学习预定的扩充来自动生成数据驱动的扩充。虽然这些策略已经实现了可观的聚合性能，但我们发现这些策略仍然 rely on pre-defined augmentations，对扩充后的图semantics的可靠性很难保证，因此限制模型性能。为解决这些问题，我们提出了一种名为 CONtrastiVe Graph ClustEring network with Reliable AugmenTation (COVERT)的新方法。具体来说，在我们的方法中，数据扩充被提posed reversible perturb-recover网络处理。它通过recovering the perturbed latent embeddings来浮雨reliable semantic information。此外，为了进一步保证semantics的可靠性，我们提出了一种新的semantic loss，用于via quantifying the perturbation and recovery。最后，为了引导模型，我们设计了一种label-matching机制，通过对semantic labels和选择高 confidence clustering pseudo labels进行对应，以确保模型的聚合性能。我们的实验结果表明，提出的方法效果显著。我们在 GitHub上发布了CONVERT的代码和补充文件，请参考https://github.com/xihongyang1999/CONVERT。

Equitable Restless Multi-Armed Bandits: A General Framework Inspired By Digital Health

paper_url: http://arxiv.org/abs/2308.09726
repo_url: https://github.com/google-research/socialgood
paper_authors: Jackson A. Killian, Manish Jain, Yugang Jia, Jonathan Amar, Erich Huang, Milind Tambe
for: 这个论文旨在研究如何使用多臂摇树机制（Restless Multi-armed Bandits，RMAB）来实现公平的决策，特别是在数字医疗方面。
methods: 这篇论文使用了两种公平目标函数来衡量公平性：最小最大奖励和最大战略启发奖励。它们分别使用水满算法和聪明的抢夺策略来解决这些目标函数。
results: 在三个模拟领域中，包括一个新的数字医疗模型，这些方法能够明显提高公平性，而不是失去效率。这些结果表明，使用RMAB来实现公平的决策在实际应用中非常重要。代码可以在https://github.com/google-research/socialgood/tree/equitable-rmab 中找到。

Abstract
Restless multi-armed bandits (RMABs) are a popular framework for algorithmic decision making in sequential settings with limited resources. RMABs are increasingly being used for sensitive decisions such as in public health, treatment scheduling, anti-poaching, and -- the motivation for this work -- digital health. For such high stakes settings, decisions must both improve outcomes and prevent disparities between groups (e.g., ensure health equity). We study equitable objectives for RMABs (ERMABs) for the first time. We consider two equity-aligned objectives from the fairness literature, minimax reward and max Nash welfare. We develop efficient algorithms for solving each -- a water filling algorithm for the former, and a greedy algorithm with theoretically motivated nuance to balance disparate group sizes for the latter. Finally, we demonstrate across three simulation domains, including a new digital health model, that our approaches can be multiple times more equitable than the current state of the art without drastic sacrifices to utility. Our findings underscore our work's urgency as RMABs permeate into systems that impact human and wildlife outcomes. Code is available at https://github.com/google-research/socialgood/tree/equitable-rmab

摘要
众臂多 armed bandit (RMAB) 是一种流行的算法决策框架，用于Sequential setting with limited resources。RMAB 在公共卫生、治疗安排、抗捕鱼和数字卫生等高度敏感的决策中被越来越广泛使用。为了确保高效性和避免群体之间的差异，我们研究了 equitable objectives for RMAB (ERMAB) 的第一次。我们考虑了两种与公平相关的目标函数，即最小最大奖励和最大 Nash 利益。我们开发了高效的算法来解决每一个，包括一种填充水的算法和一种基于理论上的细节来平衡不同群体的大小的排序算法。最后，我们在三个 simulate 领域中，包括一个新的数字卫生模型，证明了我们的方法可以在不做很大的牺牲下多达多少倍于当前状态的art 更加公平。我们的发现推动我们的工作的急需，因为 RMAB 在影响人类和野生动物的系统中普及。代码可以在获取。

A Dual-Perspective Approach to Evaluating Feature Attribution Methods

paper_url: http://arxiv.org/abs/2308.08949
repo_url: None
paper_authors: Yawei Li, Yang Zhang, Kenji Kawaguchi, Ashkan Khakzar, Bernd Bischl, Mina Rezaei
for: 本文旨在批判现有的 faithfulness 评估方法，并提出两种新的评估方法，即 soundness 和 completeness。
methods: 本文使用了两种新的评估方法，即 soundness 和 completeness，它们都基于数学基础，并且可以通过高效的算法来计算。
results: 本文通过应用这两种评估方法，对主流的 attribute 方法进行了分析和比较，并发现了一些缺陷。

Abstract
Feature attribution methods attempt to explain neural network predictions by identifying relevant features. However, establishing a cohesive framework for assessing feature attribution remains a challenge. There are several views through which we can evaluate attributions. One principal lens is to observe the effect of perturbing attributed features on the model's behavior (i.e., faithfulness). While providing useful insights, existing faithfulness evaluations suffer from shortcomings that we reveal in this paper. In this work, we propose two new perspectives within the faithfulness paradigm that reveal intuitive properties: soundness and completeness. Soundness assesses the degree to which attributed features are truly predictive features, while completeness examines how well the resulting attribution reveals all the predictive features. The two perspectives are based on a firm mathematical foundation and provide quantitative metrics that are computable through efficient algorithms. We apply these metrics to mainstream attribution methods, offering a novel lens through which to analyze and compare feature attribution methods.

摘要
translate text into Simplified Chinese特征归因方法试图解释神经网络预测的结果，并识别有关的特征。但是，建立一个完整的评估特征归因框架仍然是一个挑战。我们可以从多种角度来评估归因，其中一个主要的视角是观察归因特征的修改对模型行为的影响（即实践）。尽管提供了有用的洞察，但现有的实践评估受到一些缺陷，我们在这篇论文中揭露出这些缺陷。在这项工作中，我们提出了两种新的视角，它们是尊重性和完整性。尊重性评估归因特征是真正预测性的特征，而完整性评估总是否能够把预测特征全面地揭露出来。这两种视角基于坚实的数学基础，并提供了可计算的量化指标。我们应用这些指标到主流归因方法上，提供了一种新的评估特征归因方法的视角。Note: The above translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

Predicting Crop Yield With Machine Learning: An Extensive Analysis Of Input Modalities And Models On a Field and sub-field Level

paper_url: http://arxiv.org/abs/2308.08948
repo_url: None
paper_authors: Deepak Pathak, Miro Miranda, Francisco Mena, Cristhian Sanchez, Patrick Helber, Benjamin Bischke, Peter Habelitz, Hiba Najjar, Jayanth Siddamsetty, Diego Arenas, Michaela Vollmer, Marcela Charfuelan, Marlon Nuske, Andreas Dengel
for: 该论文旨在提出一种简单又有效的早期融合方法，用于预测农业产量，该方法可以处理不同的时间和空间分辨率输入数据。
methods: 该方法使用高分辨率农业产量地图作为真实数据来训练农作物和机器学习模型，并使用Sentinel-2卫星图像作为主要输入数据，并考虑其他补充模式，如天气、土壤和地形数据。
results: 该方法可以在全球范围内使用可用的输入模式，并且可以在不同地区、农作物和选择的模型中确定最佳输入模式。

Abstract
We introduce a simple yet effective early fusion method for crop yield prediction that handles multiple input modalities with different temporal and spatial resolutions. We use high-resolution crop yield maps as ground truth data to train crop and machine learning model agnostic methods at the sub-field level. We use Sentinel-2 satellite imagery as the primary modality for input data with other complementary modalities, including weather, soil, and DEM data. The proposed method uses input modalities available with global coverage, making the framework globally scalable. We explicitly highlight the importance of input modalities for crop yield prediction and emphasize that the best-performing combination of input modalities depends on region, crop, and chosen model.

摘要
我们介绍了一种简单 yet effective的早期融合方法 для农作物收成预测，该方法可以处理多个输入模式，每种模式具有不同的时间和空间分辨率。我们使用高分辨率农作物收成地图作为真实数据来训练农作物和机器学习模型无关方法。我们使用卫星影像作为主要输入数据，其他补充模式包括气象、土壤和 DEM 数据。我们的方法使用全球覆盖的输入数据，使得框架可以在全球范围内扩展。我们显式强调输入模式对农作物收成预测的重要性，并强调选择地区、农作物和模型而定的最佳输入模式组合。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Interpretable Graph Neural Networks for Tabular Data

paper_url: http://arxiv.org/abs/2308.08945
repo_url: None
paper_authors: Amr Alkhatib, Sofiane Ennadir, Henrik Boström, Michalis Vazirgiannis
for: 本研究的目的是开发一种可解释的图 neural network（IGNNet），用于处理标量数据。
methods: IGNNet使用了一种新的学习策略，即强制学习算法生成可解释的模型，以便用户可以了解模型的逻辑。
results: 实验结果表明，IGNNet可以与当今最佳的机器学习算法相比，在处理标量数据方面表现出色，同时可以提供可解释的模型。

Abstract
Data in tabular format is frequently occurring in real-world applications. Graph Neural Networks (GNNs) have recently been extended to effectively handle such data, allowing feature interactions to be captured through representation learning. However, these approaches essentially produce black-box models, in the form of deep neural networks, precluding users from following the logic behind the model predictions. We propose an approach, called IGNNet (Interpretable Graph Neural Network for tabular data), which constrains the learning algorithm to produce an interpretable model, where the model shows how the predictions are exactly computed from the original input features. A large-scale empirical investigation is presented, showing that IGNNet is performing on par with state-of-the-art machine-learning algorithms that target tabular data, including XGBoost, Random Forests, and TabNet. At the same time, the results show that the explanations obtained from IGNNet are aligned with the true Shapley values of the features without incurring any additional computational overhead.

摘要
translates into数据在表格格式下经常出现在实际应用中。图 neural networks（GNNs）在最近扩展以处理这类数据，以便捕捉特征相互作用。然而，这些方法基本上生成黑盒模型，即深度神经网络，禁止用户跟踪模型预测的逻辑。我们提出了一种方法，称为 IGNNet（可解释图 neural networks for 表格数据），它限制学习算法生成可解释模型，该模型显示从原始输入特征直接计算出模型预测的具体过程。我们进行了大规模的实验研究，显示IGNNet与目标 tabular 数据的状态态ART机器学习算法，包括 XGBoost、Random Forests 和 TabNet 的性能相当。同时，结果表明IGNNet 的解释与真实的 Shapley 值相吻合，无需增加计算开销。

Causal Adversarial Perturbations for Individual Fairness and Robustness in Heterogeneous Data Spaces

paper_url: http://arxiv.org/abs/2308.08938
repo_url: https://github.com/Ehyaei/CAPIFY
paper_authors: Ahmad-Reza Ehyaei, Kiarash Mohammadi, Amir-Hossein Karimi, Samira Samadi, Golnoosh Farnadi
for: 这个论文是为了同时探讨和结合个人公平、对抗攻击和结构 causal 模型在不同数据空间中的关系而写的。
methods: 该论文使用 causal 结构模型和敏感特征来创建一个公平度量，并将其应用于测试人员之间的 semantic 相似性。它还引入了一种新的 causal 对抗抑制和对 adversarial 训练，以创建一种新的敏感度量。
results: 该论文在真实世界和 sintetic 数据集上进行了评估，并示出了同时满足个人公平、对抗攻击和 causal 意识的精度 классифика器的可行性。

Abstract
As responsible AI gains importance in machine learning algorithms, properties such as fairness, adversarial robustness, and causality have received considerable attention in recent years. However, despite their individual significance, there remains a critical gap in simultaneously exploring and integrating these properties. In this paper, we propose a novel approach that examines the relationship between individual fairness, adversarial robustness, and structural causal models in heterogeneous data spaces, particularly when dealing with discrete sensitive attributes. We use causal structural models and sensitive attributes to create a fair metric and apply it to measure semantic similarity among individuals. By introducing a novel causal adversarial perturbation and applying adversarial training, we create a new regularizer that combines individual fairness, causality, and robustness in the classifier. Our method is evaluated on both real-world and synthetic datasets, demonstrating its effectiveness in achieving an accurate classifier that simultaneously exhibits fairness, adversarial robustness, and causal awareness.

摘要
“responsible AI”在机器学习算法中的重要性日益增长，其中包括“公平”、“对抗攻击”和“ causality”等性能。然而，尽管它们各自的重要性，仍然存在一个重要的挑战是同时探索和 интегра这些性能。在这篇论文中，我们提出一种新的方法，检查个人公平、对抗攻击和结构性 causal 模型在不同数据空间中的关系，特别是在处理敏感特征时。我们使用 causal 结构模型和敏感特征来创建一个公平度量，并将其应用于度量个体之间的含义相似性。通过引入一种新的 causal 对抗扰动和应用对抗训练，我们创造了一种新的规则，将个体公平、对抗攻击和 causality 集成到分类器中。我们的方法在真实世界和 sintetic 数据集上进行了评估，并显示了同时实现公平、对抗攻击和 causality 的准确分类器。

Estimating fire Duration using regression methods

paper_url: http://arxiv.org/abs/2308.08936
repo_url: None
paper_authors: Hansong Xiao
for: 本研究旨在提出机器学习方法来解决野火预测问题中的计算成本高和计算时间长问题。
methods: 本研究使用了随机森林、KNN和XGBoost回归模型，以及图像基于的CNN和编码器来预测野火燃烧时间。
results: 本研究通过对历史火灾数据和地形特征地图进行训练，并测试最新的实际值，提出了快速和相对准确的未来预测方法。

Abstract
Wildfire forecasting problems usually rely on complex grid-based mathematical models, mostly involving Computational fluid dynamics(CFD) and Celluar Automata, but these methods have always been computationally expensive and difficult to deliver a fast decision pattern. In this paper, we provide machine learning based approaches that solve the problem of high computational effort and time consumption. This paper predicts the burning duration of a known wildfire by RF(random forest), KNN, and XGBoost regression models and also image-based, like CNN and Encoder. Model inputs are based on the map of landscape features provided by satellites and the corresponding historical fire data in this area. This model is trained by happened fire data and landform feature maps and tested with the most recent real value in the same area. By processing the input differently to obtain the optimal outcome, the system is able to make fast and relatively accurate future predictions based on landscape images of known fires.

摘要
通常情况下，野火预测问题采用复杂的格点基础数学模型，主要包括计算流体动力学（CFD）和细胞自动机，但这些方法总是 computationally expensive 和困难以提供快速决策模式。在这篇论文中，我们提供了基于机器学习的方法，解决了高计算成本和时间消耗的问题。这篇论文预测已知的野火燃烧时间的RF（随机森林）、KNN和XGBoost等 regression 模型，以及图像基于的，如 CNN 和 Encoder。模型输入基于通过卫星提供的地形特征地图和相应的历史火灾数据。这个模型通过已发生过火灾数据和地形特征地图进行训练，并在同一地区测试最新的实际值。通过不同的处理输入来获得优化的结果，系统可以基于景观图像知道的火灾来做快速和相对准确的未来预测。

On Data Imbalance in Molecular Property Prediction with Pre-training

paper_url: http://arxiv.org/abs/2308.08934
repo_url: None
paper_authors: Limin Wang, Masatoshi Hanai, Toyotaro Suzumura, Shun Takashige, Kenjiro Taura
For: 本研究旨在提高分子性质预测模型的精度，解决传统的分子性质预测方法存在偏好性问题。* Methods: 本研究使用了一种组合方法，将理论计算与机器学习相结合，通过训练机器学习模型在一 subset of 理论计算结果上构建一个伪模型，以应用于剩下的材料。此外，还使用了预训练技术来提高机器学习模型的准确性。* Results: 本研究通过修改存在偏好性的损失函数，提高了预训练后的最终预测精度。通过实验和评估使用了标准的分子性质预测模型 benchmark，确认了我们的提案的有效性。

Abstract
Revealing and analyzing the various properties of materials is an essential and critical issue in the development of materials, including batteries, semiconductors, catalysts, and pharmaceuticals. Traditionally, these properties have been determined through theoretical calculations and simulations. However, it is not practical to perform such calculations on every single candidate material. Recently, a combination method of the theoretical calculation and machine learning has emerged, that involves training machine learning models on a subset of theoretical calculation results to construct a surrogate model that can be applied to the remaining materials. On the other hand, a technique called pre-training is used to improve the accuracy of machine learning models. Pre-training involves training the model on pretext task, which is different from the target task, before training the model on the target task. This process aims to extract the input data features, stabilizing the learning process and improving its accuracy. However, in the case of molecular property prediction, there is a strong imbalance in the distribution of input data and features, which may lead to biased learning towards frequently occurring data during pre-training. In this study, we propose an effective pre-training method that addresses the imbalance in input data. We aim to improve the final accuracy by modifying the loss function of the existing representative pre-training method, node masking, to compensate the imbalance. We have investigated and assessed the impact of our proposed imbalance compensation on pre-training and the final prediction accuracy through experiments and evaluations using benchmark of molecular property prediction models.

摘要
描述和分析物质的各种性质是物料开发中不可或缺的一步，包括电池、半导体、催化剂和药品等。过去，这些性质通常通过理论计算和模拟来确定。但是，对每种候选材料都进行这些计算是不实际的。最近，一种结合理论计算和机器学习的方法在发展中，即通过训练机器学习模型在一部分理论计算结果基础上构建一个代理模型，以应用于剩下的材料。同时，一种称为预训练的技术也被应用，即在目标任务之前，通过不同于目标任务的预 Text task来训练模型，以提高模型的稳定性和准确性。但是，在分子性质预测中，输入数据和特征的分布存在强烈的不均衡，这可能导致在预训练时偏向频繁出现的数据进行偏袋学习。本研究提出了一种有效地平衡预训练方法，通过修改现有代表预训练方法的损失函数，以补偿输入数据的不均衡。我们通过实验和评估使用分子性质预测模型的标准套件进行了研究和评估。

IMM: An Imitative Reinforcement Learning Approach with Predictive Representation Learning for Automatic Market Making

paper_url: http://arxiv.org/abs/2308.08918
repo_url: None
paper_authors: Hui Niu, Siyuan Li, Jiahao Zheng, Zhouchi Lin, Jian Li, Jian Guo, Bo An
For: This paper proposes a novel Reinforcement Learning (RL) framework called Imitative Market Maker (IMM) to develop multi-price level Market Making (MM) strategies efficiently.* Methods: IMM integrates effective state and action representations, representation learning, and imitation learning techniques to train the agent efficiently.* Results: The proposed IMM framework outperforms current RL-based market making strategies in terms of several financial criteria, and the ablation study substantiates the effectiveness of the model components.Here’s the same information in Simplified Chinese text:* For: 这篇论文提出了一种基于强化学习（RL）的新型市场制定（MM）策略框架，即模仿市场制定（IMM）。* Methods: IMM 使用有效的状态和动作表示，以及学习表示学习和模仿学习技术来训练代理人。* Results: 提议的 IMM 框架在实验中胜过当前RL基于市场制定策略，以许多金融指标为基准。 ablation 研究证明模型组件的有效性。

Abstract
Market making (MM) has attracted significant attention in financial trading owing to its essential function in ensuring market liquidity. With strong capabilities in sequential decision-making, Reinforcement Learning (RL) technology has achieved remarkable success in quantitative trading. Nonetheless, most existing RL-based MM methods focus on optimizing single-price level strategies which fail at frequent order cancellations and loss of queue priority. Strategies involving multiple price levels align better with actual trading scenarios. However, given the complexity that multi-price level strategies involves a comprehensive trading action space, the challenge of effectively training profitable RL agents for MM persists. Inspired by the efficient workflow of professional human market makers, we propose Imitative Market Maker (IMM), a novel RL framework leveraging both knowledge from suboptimal signal-based experts and direct policy interactions to develop multi-price level MM strategies efficiently. The framework start with introducing effective state and action representations adept at encoding information about multi-price level orders. Furthermore, IMM integrates a representation learning unit capable of capturing both short- and long-term market trends to mitigate adverse selection risk. Subsequently, IMM formulates an expert strategy based on signals and trains the agent through the integration of RL and imitation learning techniques, leading to efficient learning. Extensive experimental results on four real-world market datasets demonstrate that IMM outperforms current RL-based market making strategies in terms of several financial criteria. The findings of the ablation study substantiate the effectiveness of the model components.

摘要
市场制作（MM）在金融交易中吸引了广泛的注意力，因为它为市场流动性做出了关键性的贡献。RL技术在量化交易中表现出色，但大多数现有的RL基于MM方法都是优化单价级别策略，这些策略在频繁的订单取消和队列优先级失去时会失败。使用多价级别策略更好地适应实际交易场景。然而，由于多价级别策略的复杂性，RL代理的训练问题仍然存在。受到专业人类市场制作者的有效工作流程启发，我们提出了Imitative Market Maker（IMM），一种新的RL框架，该框架通过结合优化信号基于专家和直接政策互动来快速发展多价级别MM策略。IMM从 introduce有效的状态和动作表示，并 integrate representation学习单元，以便更好地捕捉多价级别订单中的信息。然后，IMM采用了RL和仿制学习技术结合，通过经验策略来训练代理，从而实现高效的学习。实验结果表明，IMM在四个实际市场数据集上比现有的RL基于MM策略表现出较好的财务效果。简化研究的结果证明了模型组件的效果。

paper_url: http://arxiv.org/abs/2308.08915
repo_url: https://github.com/dawnvince/mts_cad
paper_authors: Haotian Si, Changhua Pei, Zhihan Li, Yadong Zhao, Jingjing Li, Haiming Zhang, Zulong Diao, Jianhui Li, Gaogang Xie, Dan Pei
for: 这个论文是为了提出一种基于多任务学习的多变量时间序列异常检测方法（CAD），以解决现有的异常检测方法存在冲突问题。
methods: 该方法使用了自适应的多任务学习模型，通过模仿多门混合专家（MMoE）的设计，解决各个指标之间的冲突问题。同时，该方法还提出了一种任务定向的指标选择和个性化和共享的闭合机制，以提高模型的性能。
results: 该方法在多个公共数据集上进行了评估，与现有方法相比，它的平均F1分数为0.943，显著超过了现有方法的性能。

Abstract
Massive key performance indicators (KPIs) are monitored as multivariate time series data (MTS) to ensure the reliability of the software applications and service system. Accurately detecting the abnormality of MTS is very critical for subsequent fault elimination. The scarcity of anomalies and manual labeling has led to the development of various self-supervised MTS anomaly detection (AD) methods, which optimize an overall objective/loss encompassing all metrics' regression objectives/losses. However, our empirical study uncovers the prevalence of conflicts among metrics' regression objectives, causing MTS models to grapple with different losses. This critical aspect significantly impacts detection performance but has been overlooked in existing approaches. To address this problem, by mimicking the design of multi-gate mixture-of-experts (MMoE), we introduce CAD, a Conflict-aware multivariate KPI Anomaly Detection algorithm. CAD offers an exclusive structure for each metric to mitigate potential conflicts while fostering inter-metric promotions. Upon thorough investigation, we find that the poor performance of vanilla MMoE mainly comes from the input-output misalignment settings of MTS formulation and convergence issues arising from expansive tasks. To address these challenges, we propose a straightforward yet effective task-oriented metric selection and p&s (personalized and shared) gating mechanism, which establishes CAD as the first practicable multi-task learning (MTL) based MTS AD model. Evaluations on multiple public datasets reveal that CAD obtains an average F1-score of 0.943 across three public datasets, notably outperforming state-of-the-art methods. Our code is accessible at https://github.com/dawnvince/MTS_CAD.

摘要
巨大的关键性表现指标 (KPIs) 被监测为多变量时间序列数据 (MTS)，以确保软件应用程序和服务系统的可靠性。正确地探测 MTS 中的异常是非常关键的，以便随后的故障排除。由于罕见的异常和手动标注的缺乏，已经导致了多种自动化 MTS 异常检测 (AD) 方法的开发，这些方法通过优化总体的目标/损失函数来优化所有指标的回归目标/损失函数。然而，我们的实验发现，存在指标回归目标之间的冲突，导致 MTS 模型面临着不同的损失函数，这种问题在现有的方法中受到了忽视。为解决这个问题，我们采用了模仿多门混合专家 (MMoE) 的设计，并提出了 CAD，即冲突感知多变量 KPI 异常检测算法。CAD 提供了每个指标的独特结构，以避免 potential 冲突，同时推动指标之间的促进。经过了余地的调查，我们发现，普通的 MMoE 的性能问题主要来自 MTS 表示形式和输入输出不一致的问题，以及扩展任务的问题。为解决这些挑战，我们提出了一种简单 yet 有效的任务关注型指标选择和个性化阻塞机制，这使得 CAD 成为了首个实用多任务学习 (MTL) 基于 MTS AD 模型。经过了多个公共数据集的评估，我们发现，CAD 在三个公共数据集上的平均 F1-score 为 0.943，明显超过了当前的状态艺法。我们的代码可以在 GitHub 上找到：。

MoCLIM: Towards Accurate Cancer Subtyping via Multi-Omics Contrastive Learning with Omics-Inference Modeling

paper_url: http://arxiv.org/abs/2308.09725
repo_url: None
paper_authors: Ziwei Yang, Zheng Chen, Yasuko Matsubara, Yasushi Sakurai
for: This paper aims to improve cancer subtyping outcomes by fully exploiting the potential of multi-omics data.
methods: The paper proposes a representation learning framework called MoCLIM, which independently extracts informative features from distinct omics modalities and uses contrastive learning to integrate the features into a unified representation.
results: The experimental results on six cancer datasets demonstrate that the proposed approach significantly improves data fit and subtyping performance in fewer high-dimensional cancer instances, and provides high interpretability in medical analysis.

Abstract
Precision medicine fundamentally aims to establish causality between dysregulated biochemical mechanisms and cancer subtypes. Omics-based cancer subtyping has emerged as a revolutionary approach, as different level of omics records the biochemical products of multistep processes in cancers. This paper focuses on fully exploiting the potential of multi-omics data to improve cancer subtyping outcomes, and hence developed MoCLIM, a representation learning framework. MoCLIM independently extracts the informative features from distinct omics modalities. Using a unified representation informed by contrastive learning of different omics modalities, we can well-cluster the subtypes, given cancer, into a lower latent space. This contrast can be interpreted as a projection of inter-omics inference observed in biological networks. Experimental results on six cancer datasets demonstrate that our approach significantly improves data fit and subtyping performance in fewer high-dimensional cancer instances. Moreover, our framework incorporates various medical evaluations as the final component, providing high interpretability in medical analysis.

摘要
基础式个性化医学的目标是确定疾病子型的诱导关系。OMICS技术在肿瘤分型中发挥了革命性的作用，不同的OMICS数据记录肿瘤多步骤过程中的生化产物。本文将关注在完全利用多Omics数据来改进肿瘤分型结果方面，因此开发了MoCLIM表示学框架。MoCLIM独立提取不同Omics模式中的有用特征。通过对不同Omics模式的对比学习，我们可以将肿瘤分型下降到一个较低的Latent空间。这种对比可以被解释为生物网络中的Inter-Omics推理。实验结果表明，我们的方法可以在较少的高维度肿瘤实例中显著提高数据适应和分型性能。此外，我们的框架还可以 incorporate多种医学评估作为最后一个组件，提供高度可读性在医学分析中。

Development of a Knowledge Graph Embeddings Model for Pain

paper_url: http://arxiv.org/abs/2308.08904
repo_url: None
paper_authors: Jaya Chaturvedi, Tao Wang, Sumithra Velupillai, Robert Stewart, Angus Roberts
for: 这篇论文的目的是构建一个智能知识图库，以便更好地理解报病人所经历的痛苦，并在计算机可 tractable 的形式下进行 semantic 和上下文基于的推理。
methods: 该论文使用了知识图库 embedding 技术，将知识图库中的概念和关系转化为低维度向量空间中的表示，以便在下游任务中进行分类和链接预测。其中，外部医学知识库 SNOMED CT 中的各种关系可以提供有用的外部知识。
results: 该论文通过对精神医学电子健康纪录中的报病人痛苦概念进行EXTRACTION，并与 SNOMED CT 中的外部知识结合，构建了一个智能知识图库。该模型在主题-对象链接预测任务中的表现被评估，并与其他基eline模型进行比较。

Abstract
Pain is a complex concept that can interconnect with other concepts such as a disorder that might cause pain, a medication that might relieve pain, and so on. To fully understand the context of pain experienced by either an individual or across a population, we may need to examine all concepts related to pain and the relationships between them. This is especially useful when modeling pain that has been recorded in electronic health records. Knowledge graphs represent concepts and their relations by an interlinked network, enabling semantic and context-based reasoning in a computationally tractable form. These graphs can, however, be too large for efficient computation. Knowledge graph embeddings help to resolve this by representing the graphs in a low-dimensional vector space. These embeddings can then be used in various downstream tasks such as classification and link prediction. The various relations associated with pain which are required to construct such a knowledge graph can be obtained from external medical knowledge bases such as SNOMED CT, a hierarchical systematic nomenclature of medical terms. A knowledge graph built in this way could be further enriched with real-world examples of pain and its relations extracted from electronic health records. This paper describes the construction of such knowledge graph embedding models of pain concepts, extracted from the unstructured text of mental health electronic health records, combined with external knowledge created from relations described in SNOMED CT, and their evaluation on a subject-object link prediction task. The performance of the models was compared with other baseline models.

摘要
痛苦是一个复杂的概念，可以与其他概念相连接，例如可能导致痛苦的疾病、可能使痛苦减轻的药物等。为了全面理解个人或人口所经历的痛苦Context，我们需要检查所有与痛苦相关的概念和它们之间的关系。这尤其有用于基于电子医疗记录的痛苦模型化。知识图表示概念和它们之间的关系，形成一个相互连接的网络，允许semantic和上下文基础的推理。但这些图可能太大，不可能进行有效的计算。知识图嵌入帮助解决这个问题，将图表示为一个低维度的 вектор空间中。这些嵌入可以在下游任务中使用，如分类和链接预测。对于痛苦的不同关系，可以从外部的医学知识库，如SNOMED CT，获取相关的概念和关系。建立如此的知识图可以通过实际世界中的痛苦和其关系，从电子医疗记录中提取来进一步填充。本文描述了基于痛苦概念的知识图嵌入模型的建立和评估，其中包括从痛苦文本中提取的不结构化数据，以及与SNOMED CT中的外部知识相结合。模型的性能与其他基eline模型进行比较。

Optimal Resource Allocation for U-Shaped Parallel Split Learning

paper_url: http://arxiv.org/abs/2308.08896
repo_url: None
paper_authors: Song Lyu, Zheng Lin, Guanqiao Qu, Xianhao Chen, Xiaoxia Huang, Pan Li
for: 本研究旨在提出一种基于U型网络的分布式学习方法，以保护数据所有者的标签隐私。
methods: 本文提出了一种基于U型网络的分布式学习方法，其中早期层和末级层都放置在用户端，以避免泄露标签隐私。此外，文章还提出了一种名为LSCRA的资源分配算法，用于优化边缘网络的性能。
results: 实验结果表明，LSCRA算法可以有效地优化边缘网络的性能，并且U型PSL方法可以与其他SL基elines具有相似性能的同时保护标签隐私。

Abstract
Split learning (SL) has emerged as a promising approach for model training without revealing the raw data samples from the data owners. However, traditional SL inevitably leaks label privacy as the tail model (with the last layers) should be placed on the server. To overcome this limitation, one promising solution is to utilize U-shaped architecture to leave both early layers and last layers on the user side. In this paper, we develop a novel parallel U-shaped split learning and devise the optimal resource optimization scheme to improve the performance of edge networks. In the proposed framework, multiple users communicate with an edge server for SL. We analyze the end-to-end delay of each client during the training process and design an efficient resource allocation algorithm, called LSCRA, which finds the optimal computing resource allocation and split layers. Our experimental results show the effectiveness of LSCRA and that U-shaped PSL can achieve a similar performance with other SL baselines while preserving label privacy. Index Terms: U-shaped network, split learning, label privacy, resource allocation, 5G/6G edge networks.

摘要
Split learning (SL) 已经出现为一种有前途的方法，用于模型训练而不泄露原始数据样本的拥有者。然而，传统的 SL 无法保护标签隐私，因为服务器上需要放置尾部模型（具有最后层）。为解决这个限制，一种有前途的解决方案是使用 U 型架构，以保留用户 сторо面上的早期层和尾部层。在这篇论文中，我们开发了一种新的并发 U 型 split learning 框架，并设计了优化性的资源分配策略，以提高边缘网络的性能。在我们的提案中，多个用户与边缘服务器进行 SL 通信。我们分析了每个客户端在训练过程中的末端延迟，并设计了一种高效的资源分配算法，称为 LSCRA，以找到最佳的计算资源分配和分割层。我们的实验结果表明 LSCRA 的有效性，并证明 U 型 PSL 可以在保持标签隐私的情况下达到类似的性能。关键字： U 型网络，split learning，标签隐私，资源分配，5G/6G 边缘网络。

Dual Gauss-Newton Directions for Deep Learning

paper_url: http://arxiv.org/abs/2308.08886
repo_url: None
paper_authors: Vincent Roulet, Mathieu Blondel
for: 这篇论文研究了如何使用深度学习目标结构，即几何函数和非线性网络的组合，以获得更好的方向指南，而不是使用渐进随机Gradient。
methods: 该论文提出了使用 dual 形式来计算方向指南，从而获得计算上的收益和新的理解。
results: 该论文实验表明，使用 dual 形式计算的方向指南可以用作现有优化算法中的替换，并且实际上表现出了更好的性能。

Abstract
Inspired by Gauss-Newton-like methods, we study the benefit of leveraging the structure of deep learning objectives, namely, the composition of a convex loss function and of a nonlinear network, in order to derive better direction oracles than stochastic gradients, based on the idea of partial linearization. In a departure from previous works, we propose to compute such direction oracles via their dual formulation, leading to both computational benefits and new insights. We demonstrate that the resulting oracles define descent directions that can be used as a drop-in replacement for stochastic gradients, in existing optimization algorithms. We empirically study the advantage of using the dual formulation as well as the computational trade-offs involved in the computation of such oracles.

摘要
受加ус-新顿方法的启发，我们研究利用深度学习目标的结构，即几何函数和非线性网络的组合，以 derivate更好的方向指南。在之前的工作中，我们提议通过对调方式来计算这些方向指南，从而获得计算效益和新的理解。我们证明了这些方向指南可以作为现有优化算法中的替换，并进行实验研究其优势和计算交易。

Feature Enforcing PINN (FE-PINN): A Framework to Learn the Underlying-Physics Features Before Target Task

paper_url: http://arxiv.org/abs/2308.08873
repo_url: None
paper_authors: Mahyar Jahaninasab, Mohamad Ali Bijarchi
for: 这篇论文是为了解决各种分散方程（Partial Differential Equations，PDEs）的问题，并且可以实现低计算成本和快速训练。
methods: 这篇论文提出了一个新的数据自由框架（Feature Enforcing Physics Informed Neural Network，FE-PINN），可以在低计算成本下学习问题的基本模式。FE-PINN使用了一系列子任务，先从物理方面学习有用的特征，然后将模型训练到目标任务中以精确化计算。
results: FE-PINN可以在三个标准问题中实现15倍、2倍和5倍的速度提升，并且可以实现较低的损失值（1e-5），并且数据挥发过程较为平滑，可以使用更高的学习速率。这个框架可以用于解决各种PDEs领域中的问题，并且具有快速和精确的特点。

Abstract
In this work, a new data-free framework called Feature Enforcing Physics Informed Neural Network (FE-PINN) is introduced. This framework is capable of learning the underlying pattern of any problem with low computational cost before the main training loop. The loss function of vanilla PINN due to the existence of two terms of partial differential residuals and boundary condition mean squared error is imbalanced. FE-PINN solves this challenge with just one minute of training instead of time-consuming hyperparameter tuning for loss function that can take hours. The FE-PINN accomplishes this process by performing a sequence of sub-tasks. The first sub-task learns useful features about the underlying physics. Then, the model trains on the target task to refine the calculations. FE-PINN is applied to three benchmarks, flow over a cylinder, 2D heat conduction, and an inverse problem of calculating inlet velocity. FE-PINN can solve each case with, 15x, 2x, and 5x speed up accordingly. Another advantage of FE-PINN is that reaching lower order of value for loss function is systematically possible. In this study, it was possible to reach a loss value near 1e-5 which is challenging for vanilla PINN. FE-PINN also has a smooth convergence process which allows for utilizing higher learning rates in comparison to vanilla PINN. This framework can be used as a fast, accurate tool for solving a wide range of Partial Differential Equations (PDEs) across various fields.

摘要
在这项工作中，我们提出了一种新的数据自由框架，即特征强制物理学信息神经网络（FE-PINN）。这个框架可以在低计算成本下学习任务中的下面模式。 vanilla PINN 的损失函数由两个分量的 partial differential residuals 和 boundary condition mean squared error 组成，这两个分量的损失函数是不均衡的。FE-PINN 解决了这个挑战，只需要一分钟的训练时间，而不是浪费时间在hyperparameter tuning中。FE-PINN 通过执行一系列子任务来完成这个过程。首先，它学习了任务中的有用特征。然后，模型在目标任务上进行了细化计算。FE-PINN 应用于三个标准测试函数：流过圆柱、2D 热传导和反问题计算入口速度。在每个情况下，FE-PINN 可以达到15倍、2倍和5倍的速度提升。此外，FE-PINN 可以系统地降低损失函数的次数。在这种研究中，可以达到1e-5的损失值，这是vanilla PINN 所难以达到的。此外，FE-PINN 的 converges 过程是平滑的，因此可以使用高学习率，相比于vanilla PINN。这个框架可以作为一种快速、准确地解决各种 partial differential equations (PDEs) 的工具。

Towards Semi-supervised Learning with Non-random Missing Labels

paper_url: http://arxiv.org/abs/2308.08872
repo_url: https://github.com/njuyued/prg4ssl-mnar
paper_authors: Yue Duan, Zhen Zhao, Lei Qi, Luping Zhou, Lei Wang, Yinghuan Shi
for: 解决labels缺失问题，实现效率地使用无标注数据。
methods: 基于Markov随机游走建立动态图，使用类转移跟踪信息获取类水平指导，保持模型免受类分布偏袋的影响。
results: 在多种实际MNAR场景中，PRG表现出色，与最新的SSL方法结合偏置除掉solution相比，提高了pseudo标签的质量。

Abstract
Semi-supervised learning (SSL) tackles the label missing problem by enabling the effective usage of unlabeled data. While existing SSL methods focus on the traditional setting, a practical and challenging scenario called label Missing Not At Random (MNAR) is usually ignored. In MNAR, the labeled and unlabeled data fall into different class distributions resulting in biased label imputation, which deteriorates the performance of SSL models. In this work, class transition tracking based Pseudo-Rectifying Guidance (PRG) is devised for MNAR. We explore the class-level guidance information obtained by the Markov random walk, which is modeled on a dynamically created graph built over the class tracking matrix. PRG unifies the historical information of class distribution and class transitions caused by the pseudo-rectifying procedure to maintain the model's unbiased enthusiasm towards assigning pseudo-labels to all classes, so as the quality of pseudo-labels on both popular classes and rare classes in MNAR could be improved. Finally, we show the superior performance of PRG across a variety of MNAR scenarios, outperforming the latest SSL approaches combining bias removal solutions by a large margin. Code and model weights are available at https://github.com/NJUyued/PRG4SSL-MNAR.

摘要
半监督学习（SSL）处理缺失标签问题，使得无标签数据可以有效地使用。然而，现有的SSL方法往往忽略了实际和挑战性的情况，即标签缺失不均匀（MNAR）。在MNAR中，标签和无标签数据分布不同，导致假标签插入偏斜，从而降低SSL模型的性能。本文提出的类转移跟踪基于假正导航（PRG）方法，可以有效地解决MNAR中的标签缺失问题。我们利用Markov随机漫步模型来建立一个动态创建的图，并在图上获取类水平指导信息。PRG通过历史类分布和类转换的 dynamically created graph 来维护模型的偏离性，以确保假标签的质量在MNAR中都能得到改进。最后，我们在多种MNAR场景中证明PRG的超越性，与latest SSL方法组合偏移解决方案相比，差距非常大。代码和模型参数可以在https://github.com/NJUyued/PRG4SSL-MNAR 中下载。

Model-Free Algorithm with Improved Sample Efficiency for Zero-Sum Markov Games

paper_url: http://arxiv.org/abs/2308.08858
repo_url: None
paper_authors: Songtao Feng, Ming Yin, Yu-Xiang Wang, Jing Yang, Yingbin Liang
for: 这个论文的目的是研究多智能体强化学习（RL）中的二 player零点Markov游戏问题。
methods: 这个论文提出了一种基于实际的stage-based Q学习算法，并证明了它可以达到最佳的样本复杂度 $O(H^3SAB/\epsilon^2)$，这是对于观察 horizon $H$ 和状态数 $S$ 的依赖关系。
results: 这个论文的主要成果是通过引入variance reduction技术，使得model-free算法可以 дости到与model-based算法相同的最佳样本复杂度。此外，这个算法还提供了一种新的referenc-advantage decompositions技术，可以在Markov游戏中提高样本效率。

Abstract
The problem of two-player zero-sum Markov games has recently attracted increasing interests in theoretical studies of multi-agent reinforcement learning (RL). In particular, for finite-horizon episodic Markov decision processes (MDPs), it has been shown that model-based algorithms can find an $\epsilon$-optimal Nash Equilibrium (NE) with the sample complexity of $O(H^3SAB/\epsilon^2)$, which is optimal in the dependence of the horizon $H$ and the number of states $S$ (where $A$ and $B$ denote the number of actions of the two players, respectively). However, none of the existing model-free algorithms can achieve such an optimality. In this work, we propose a model-free stage-based Q-learning algorithm and show that it achieves the same sample complexity as the best model-based algorithm, and hence for the first time demonstrate that model-free algorithms can enjoy the same optimality in the $H$ dependence as model-based algorithms. The main improvement of the dependency on $H$ arises by leveraging the popular variance reduction technique based on the reference-advantage decomposition previously used only for single-agent RL. However, such a technique relies on a critical monotonicity property of the value function, which does not hold in Markov games due to the update of the policy via the coarse correlated equilibrium (CCE) oracle. Thus, to extend such a technique to Markov games, our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions whose value difference is the smallest in the history in order to achieve the desired improvement in the sample efficiency.

摘要
“两个玩家零Sum Markov游戏的问题在多智能人工智能学习（RL）理论研究中受到了越来越多的关注。特别是在finite-horizon episodic Markov决策过程（MDP）中，已经证明了模型基于算法可以在$O(H^3SAB/\epsilon^2)$的样本复杂度下找到ε-优先 equilibria（NE），这是对于很大的horizon $H$和state $S$（其中 $A$ 和 $B$ 分别表示两个玩家的动作数）。然而，现有的模型自由算法无法达到这种优化。在这个工作中，我们提出了一种模型自由stage-based Q学习算法，并证明了它可以达到最佳模型基于算法的样本复杂度，因此在 $H$ 取值上first time demonstrates that model-free algorithms can enjoy the same optimality as model-based algorithms。主要改进来自于 $H$ 的依赖性，通过利用单个智能人工智能学习中广泛使用的变量减少技术，基于reference-advantage decompositions。然而，这种技术需要Markov游戏中值函数的强 monotonicity 属性，这并不是真实存在的。因此，我们的算法具有一个关键的新特点，即在历史中选择最小的值差来更新参考值函数，以达到所需的样本效率改进。”

Bag of Tricks for Long-Tailed Multi-Label Classification on Chest X-Rays

paper_url: http://arxiv.org/abs/2308.08853
repo_url: None
paper_authors: Feng Hong, Tianjie Dai, Jiangchao Yao, Ya Zhang, Yanfeng Wang
for: 这个研究是为了提高胸部X射影（CXR）诊断的Value，特别是面对长尾和多 label的挑战。
methods: 这个研究使用了多种进步的设计，包括数据增强、特征提取、分类器设计、损失函数重新平衡、外部数据补充等。
results: 这个研究获得了2023年ICCV CVAMD CXR-LT竞赛的第五名，其MAP值为0.349。

Abstract
Clinical classification of chest radiography is particularly challenging for standard machine learning algorithms due to its inherent long-tailed and multi-label nature. However, few attempts take into account the coupled challenges posed by both the class imbalance and label co-occurrence, which hinders their value to boost the diagnosis on chest X-rays (CXRs) in the real-world scenarios. Besides, with the prevalence of pretraining techniques, how to incorporate these new paradigms into the current framework lacks of the systematical study. This technical report presents a brief description of our solution in the ICCV CVAMD 2023 CXR-LT Competition. We empirically explored the effectiveness for CXR diagnosis with the integration of several advanced designs about data augmentation, feature extractor, classifier design, loss function reweighting, exogenous data replenishment, etc. In addition, we improve the performance through simple test-time data augmentation and ensemble. Our framework finally achieves 0.349 mAP on the competition test set, ranking in the top five.

摘要
严重诊断涂整照片的机器学习算法受到胸部X射线（CXR）的自然长尾和多标签性的挑战。然而，目前的尝试很少考虑这两个挑战的同时存在，这限制了它们在真实世界场景中的价值。此外，随着预训练技术的普及，如何在当前框架中包含这些新 paradigma 尚未得到系统性的研究。本技术报告介绍我们在ICCV CVAMD 2023 CXR-LT竞赛中的解决方案。我们employs several advanced designs,包括数据扩充、特征提取器、分类器设计、损失函数重新权衡和外部数据补充等。此外，我们通过简单的测试时数据扩充和集成提高了性能。最终，我们的框架在竞赛测试集上达到了0.349 mAP，排名前五。

Learning the hub graphical Lasso model with the structured sparsity via an efficient algorithm

paper_url: http://arxiv.org/abs/2308.08852
repo_url: None
paper_authors: Chengjing Wang, Peipei Tang, Wenling He, Meixia Lin
for: 这个论文的目的是提出一种高效的图模型预测算法，用于处理大量数据时的计算复杂问题。
methods: 该算法使用了两个阶段的方法：首先使用对偶替换法生成一个初始解，然后使用半软 Newton法进行修订。
results: 实验表明，该算法可以高效地计算图模型，并且在一些高维度任务中可以提高计算效率超过70%，同时仍然保持高质量的预测结果。

Abstract
Graphical models have exhibited their performance in numerous tasks ranging from biological analysis to recommender systems. However, graphical models with hub nodes are computationally difficult to fit, particularly when the dimension of the data is large. To efficiently estimate the hub graphical models, we introduce a two-phase algorithm. The proposed algorithm first generates a good initial point via a dual alternating direction method of multipliers (ADMM), and then warm starts a semismooth Newton (SSN) based augmented Lagrangian method (ALM) to compute a solution that is accurate enough for practical tasks. The sparsity structure of the generalized Jacobian ensures that the algorithm can obtain a nice solution very efficiently. Comprehensive experiments on both synthetic data and real data show that it obviously outperforms the existing state-of-the-art algorithms. In particular, in some high dimensional tasks, it can save more than 70\% of the execution time, meanwhile still achieves a high-quality estimation.

摘要
文本翻译成简化中文：图形模型在多种任务中表现出色，从生物分析到推荐系统。然而，含有中心节点的图形模型 computationally 困难于适应，特别是当数据维度很大时。为了高效地估算中心图形模型，我们提出了两相态算法。我们的算法首先使用 dual alternating direction method of multipliers (ADMM) 生成一个好的初始点，然后使用 semismooth Newton (SSN) 基于扩展拉格朗日方法 (ALM) 进行热开始，以计算一个够精确的解决方案。通过 generalized Jacobian 的稀疏结构，我们的算法可以很快获得一个好的解。经过对 synthetic data 和实际数据的广泛实验，我们发现我们的算法明显超越了现有的状态态算法，特别是在高维度任务中，可以save более 70% 的执行时间，同时仍能达到高质量的估算。

Machine Learning-Assisted Discovery of Novel Reactor Designs via CFD-Coupled Multi-fidelity Bayesian Optimisation

paper_url: http://arxiv.org/abs/2308.08841
repo_url: None
paper_authors: Tom Savage, Nausheen Basha, Jonathan McDonough, Omar K Matar, Ehecatl Antonio del Rio Chanona
For: The paper aims to establish a framework for the next-generation of reactors by leveraging additive manufacturing and intelligent design to improve the performance and sustainability of future chemical processes.* Methods: The authors propose two novel coiled-tube parameterisations that enable the variation of cross-section and coil path, and use multi-fidelity Bayesian optimisation to optimise the designs. They also employ parameterised meshing and simulation to reduce computational costs.* Results: The authors identify key characteristics of optimal reactor designs and experimentally validate two novel geometries that they 3D print. The results demonstrate the potential of the proposed framework for improving the performance and sustainability of future chemical processes.Here’s the simplified Chinese version:* For: 这篇论文目标是建立未来化学过程中的下一代循环器，通过添加制造和智能设计来提高性能和可持续性。* Methods: 作者提出了两种新的卷积管参数化，允许横截面和卷积路径的变化，并使用多质量湾esian优化来优化设计。他们还使用参数化的网格和模拟来降低计算成本。* Results: 作者确定了优化循环器设计的关键特征，并实验验证了两种新的循环器 geometries。结果表明，提案的框架可以提高未来化学过程的性能和可持续性。

Abstract
Additive manufacturing has enabled the production of more advanced reactor geometries, resulting in the potential for significantly larger and more complex design spaces. Identifying and optimising promising configurations within broader design spaces presents a significant challenge for existing human-centric design approaches. As such, existing parameterisations of coiled-tube reactor geometries are low-dimensional with expensive optimisation limiting more complex solutions. Given algorithmic improvements and the onset of additive manufacturing, we propose two novel coiled-tube parameterisations enabling the variation of cross-section and coil path, resulting in a series of high dimensional, complex optimisation problems. To ensure tractable, non-local optimisation where gradients are not available, we apply multi-fidelity Bayesian optimisation. Our approach characterises multiple continuous fidelities and is coupled with parameterised meshing and simulation, enabling lower quality, but faster simulations to be exploited throughout optimisation. Through maximising the plug-flow performance, we identify key characteristics of optimal reactor designs, and extrapolate these to produce two novel geometries that we 3D print and experimentally validate. By demonstrating the design, optimisation, and manufacture of highly parameterised reactors, we seek to establish a framework for the next-generation of reactors, demonstrating that intelligent design coupled with new manufacturing processes can significantly improve the performance and sustainability of future chemical processes.

摘要
Traditional design approaches for coiled-tube reactors have been limited by low-dimensional parameterizations and expensive optimization methods, which have restricted the exploration of more complex and promising configurations. With the advent of additive manufacturing, we propose two novel coiled-tube parameterizations that enable the variation of cross-section and coil path, leading to a series of high-dimensional, complex optimization problems. To tackle these problems, we employ a multi-fidelity Bayesian optimization approach that integrates parameterized meshing and simulation, allowing for efficient exploration of the design space. By optimizing the plug-flow performance, we identify key characteristics of optimal reactor designs and experimentally validate them through 3D printing and experimental testing. Our findings establish a framework for the next-generation of reactors, demonstrating that intelligent design coupled with advanced manufacturing processes can significantly improve the performance and sustainability of future chemical processes.

ICoNIK: Generating Respiratory-Resolved Abdominal MR Reconstructions Using Neural Implicit Representations in k-Space

paper_url: http://arxiv.org/abs/2308.08830
repo_url: None
paper_authors: Veronika Spieker, Wenqi Huang, Hannah Eichhorn, Jonathan Stelter, Kilian Weiss, Veronika A. Zimmer, Rickmer F. Braren, Dimitrios C. Karampinos, Kerstin Hammernik, Julia A. Schnabel
for: 这个论文旨在提出一种基于神经网络的扩展幂等方法，以解决静脉磁共振成像（MRI）中的运动融合问题。
methods: 这个论文使用了一种基于神经网络的直接在k空间中学习方法（NIK），通过使用测量的抽取点和数据驱动的呼吸导航信号，来生成连续的信号值。此外，论文还引入了一种有 информацию的修正层（ICo），以正则化缺失样本区域。
results: 论文的实验结果表明，NIK和ICoNIK方法可以超越标准的运动融合重建方法，并提供了一种有 Promise的解决方案，以解决静脉磁共振成像中的运动artefacts问题。

Abstract
Motion-resolved reconstruction for abdominal magnetic resonance imaging (MRI) remains a challenge due to the trade-off between residual motion blurring caused by discretized motion states and undersampling artefacts. In this work, we propose to generate blurring-free motion-resolved abdominal reconstructions by learning a neural implicit representation directly in k-space (NIK). Using measured sampling points and a data-derived respiratory navigator signal, we train a network to generate continuous signal values. To aid the regularization of sparsely sampled regions, we introduce an additional informed correction layer (ICo), which leverages information from neighboring regions to correct NIK's prediction. Our proposed generative reconstruction methods, NIK and ICoNIK, outperform standard motion-resolved reconstruction techniques and provide a promising solution to address motion artefacts in abdominal MRI.

摘要
对于腹部磁共振成像（MRI）中的运动解析仍然是一个挑战，这是因为运动状态的精细化会导致剩下的运动抑制和抽象缺失的问题。在这项工作中，我们提出了一种通过直接在k空间学习神经网络（NIK）来生成无抑制的运动解析图像。使用测量的抽象点和数据驱动的呼吸导航信号，我们在网络中训练了连续的信号值。为了帮助稀疏抽象区域的正则化，我们引入了一个知情更正层（ICo），该层利用周围区域的信息来修正NIK的预测。我们的提议的生成重建方法，NIK和ICoNIK，在标准运动解析重建技术的基础上超越，并提供了对运动缺失的可能解决方案。

Controlling Federated Learning for Covertness

paper_url: http://arxiv.org/abs/2308.08825
repo_url: None
paper_authors: Adit Jain, Vikram Krishnamurthy
for: 本研究旨在减少一个函数 $f$ 的值，同时隐藏 $\arg\min f$ FROM 一个恶意窃听者。
methods: 本研究使用了 Markov decision process 模型来控制随机 gradient 算法，并提出了一种 computationally efficient policy gradient algorithm。
results: 数值结果显示，当学习者使用优化策略时，恶意窃听者只能在无信息情况下达到 $52%$ 的验证精度，而在拥有公共数据集的情况下，恶意窃听者只能达到 $69%$ 的验证精度，与学习者使用排队策略相比，恶意窃听者可以达到 $83%$ 的验证精度。

Abstract
A learner aims to minimize a function $f$ by repeatedly querying a distributed oracle that provides noisy gradient evaluations. At the same time, the learner seeks to hide $\arg\min f$ from a malicious eavesdropper that observes the learner's queries. This paper considers the problem of \textit{covert} or \textit{learner-private} optimization, where the learner has to dynamically choose between learning and obfuscation by exploiting the stochasticity. The problem of controlling the stochastic gradient algorithm for covert optimization is modeled as a Markov decision process, and we show that the dynamic programming operator has a supermodular structure implying that the optimal policy has a monotone threshold structure. A computationally efficient policy gradient algorithm is proposed to search for the optimal querying policy without knowledge of the transition probabilities. As a practical application, our methods are demonstrated on a hate speech classification task in a federated setting where an eavesdropper can use the optimal weights to generate toxic content, which is more easily misclassified. Numerical results show that when the learner uses the optimal policy, an eavesdropper can only achieve a validation accuracy of $52\%$ with no information and $69\%$ when it has a public dataset with 10\% positive samples compared to $83\%$ when the learner employs a greedy policy.

摘要
学生希望减少函数 $f$ 的值，通过 repeatedly 查询分布式 oracle 提供的噪声梯度评估。同时，学生希望隐藏 $\arg\min f$ 从一个恶意窃听者的观察。这篇论文考虑了 covert 或 learner-private 优化问题，其中学生需要在权衡学习和隐蔽之间 dynamically 选择。我们模型了控制随机梯度算法的 Markov 决策过程，并证明优化问题的动态 програм环境有超模ular 结构，这 imply 优化策略具有 monotone 阈值结构。我们提出了一种 computationally 高效的策略梯度算法，可以在不知道过渡概率的情况下搜索优化策略。在实际应用中，我们在一个 federated 环境中进行了一个 hate speech 分类任务，其中一个窃听者可以使用优化后的权重生成恶意内容，这种内容更易被识别。numerical 结果表明，当学生使用优化策略时，窃听者只能在没有信息的情况下 achieves 52% 的验证精度，而在 possession 10% 正例样本的情况下，窃听者只能 achieves 69% 的验证精度，与学生使用恶性策略时的精度相比。

Mitigating Semantic Confusion from Hostile Neighborhood for Graph Active Learning

paper_url: http://arxiv.org/abs/2308.08823
repo_url: None
paper_authors: Tianmeng Yang, Min Zhou, Yujing Wang, Zhengjie Lin, Lujia Pan, Bin Cui, Yunhai Tong
for: 提高图像神经网络（GNNs）性能，通过在图像中选择最有用的节点进行标注。
methods: 提出了一种Semantic-aware Active learning framework for Graphs（SAG），通过对节点之间的对比来评估节点的影响，并设计了一种新的原型基于的评价标准和查询策略来保持节点的多样性和分类均衡。
results: 对公共测试图和一个真实世界金融数据集进行了广泛的实验，并证明了SAG可以显著提高节点分类性能，并常常超越先前的方法。

Abstract
Graph Active Learning (GAL), which aims to find the most informative nodes in graphs for annotation to maximize the Graph Neural Networks (GNNs) performance, has attracted many research efforts but remains non-trivial challenges. One major challenge is that existing GAL strategies may introduce semantic confusion to the selected training set, particularly when graphs are noisy. Specifically, most existing methods assume all aggregating features to be helpful, ignoring the semantically negative effect between inter-class edges under the message-passing mechanism. In this work, we present Semantic-aware Active learning framework for Graphs (SAG) to mitigate the semantic confusion problem. Pairwise similarities and dissimilarities of nodes with semantic features are introduced to jointly evaluate the node influence. A new prototype-based criterion and query policy are also designed to maintain diversity and class balance of the selected nodes, respectively. Extensive experiments on the public benchmark graphs and a real-world financial dataset demonstrate that SAG significantly improves node classification performances and consistently outperforms previous methods. Moreover, comprehensive analysis and ablation study also verify the effectiveness of the proposed framework.

摘要
格Active学习（GAL），旨在在图中找出最有信息的节点进行标注，以最大化图内 нейрон网络（GNNs）性能，已经吸引了许多研究努力，但仍存在许多挑战。一个主要挑战是现有GAL策略可能会在选择训练集时引入semantic confusion，特别是当图像具有噪音时。现有方法几乎所有的聚合特征都被视为有用，忽略了在消息传递机制下 интер-类边的semantic负面效果。在这项工作中，我们提出了Semantic-aware Active learning Framework for Graphs（SAG），以 Mitigate semantic confusion problem。节点之间的对比相似性和不同性，以及节点的semantic特征被引入，以同时评估节点的影响力。我们还设计了一种新的原型基于准则和查询策略，以保持选择节点的多样性和类别均衡。extensive experiments表明，SAG在公共 benchmark graphs 和一个实际的金融数据集上显著提高节点分类性能，并在前一些方法之上出现consistently。此外，我们还进行了完整的分析和剖析研究，以证明提案的有效性。

A Fusion of Variational Distribution Priors and Saliency Map Replay for Continual 3D Reconstruction

paper_url: http://arxiv.org/abs/2308.08812
repo_url: None
paper_authors: Sanchar Palit, Sandika Biswas
for: 这种研究旨在预测单个图像中的3D对象形状。
methods: 该方法使用Variational Priors来设计模型，以确保在新类的训练后仍能reasonably重建之前看到的类。
results: 经验表明，该方法可以与已有方法相比， Both quantitatively and qualitatively提供竞争力强的结果。

Abstract
Single-image 3D reconstruction is a research challenge focused on predicting 3D object shapes from single-view images. This task requires significant data acquisition to predict both visible and occluded portions of the shape. Furthermore, learning-based methods face the difficulty of creating a comprehensive training dataset for all possible classes. To this end, we propose a continual learning-based 3D reconstruction method where our goal is to design a model using Variational Priors that can still reconstruct the previously seen classes reasonably even after training on new classes. Variational Priors represent abstract shapes and combat forgetting, whereas saliency maps preserve object attributes with less memory usage. This is vital due to resource constraints in storing extensive training data. Additionally, we introduce saliency map-based experience replay to capture global and distinct object features. Thorough experiments show competitive results compared to established methods, both quantitatively and qualitatively.

摘要
<>单图3D重建是一项研究挑战，旨在从单个视图图像中预测3D物体形状。这个任务需要大量数据收集，以预测可见和遮盖的部分。学习基于方法面临的挑战是创建全面训练数据集 для所有可能的类。为此，我们提议一种持续学习基于Variational Priors的3D重建方法，目标是使用Variational Priors来表示抽象形状，并使得已经训练过的类可以reasonably重建。Variational Priors战胜忘记，而saliency maps保留物体特征具有较少的内存使用。这是因为训练数据存储具有资源约束。此外，我们引入saliency map基于经验回放，以捕捉全球和特定物体特征。经过全面的实验，我们的方法与已知方法相比，具有竞争力的结果 both quantitatively and qualitatively。<>

Label Shift Adapter for Test-Time Adaptation under Covariate and Label Shifts

paper_url: http://arxiv.org/abs/2308.08810
repo_url: None
paper_authors: Sunghyun Park, Seunghan Yang, Jaegul Choo, Sungrack Yun
for: 这篇研究的目的是提出一种可以在批处理时进行适应，以适应目标频谱中的标签分布偏移。
methods: 本研究使用了现有的TTA方法，并添加了一种标签偏移适应器来处理标签偏移。
results: 研究表明，通过将标签偏移适应器与TTA方法结合使用，可以在批处理时进行有效的适应，并且可以提高模型的性能在标签和covariate偏移的情况下。

Abstract
Test-time adaptation (TTA) aims to adapt a pre-trained model to the target domain in a batch-by-batch manner during inference. While label distributions often exhibit imbalances in real-world scenarios, most previous TTA approaches typically assume that both source and target domain datasets have balanced label distribution. Due to the fact that certain classes appear more frequently in certain domains (e.g., buildings in cities, trees in forests), it is natural that the label distribution shifts as the domain changes. However, we discover that the majority of existing TTA methods fail to address the coexistence of covariate and label shifts. To tackle this challenge, we propose a novel label shift adapter that can be incorporated into existing TTA approaches to deal with label shifts during the TTA process effectively. Specifically, we estimate the label distribution of the target domain to feed it into the label shift adapter. Subsequently, the label shift adapter produces optimal parameters for the target label distribution. By predicting only the parameters for a part of the pre-trained source model, our approach is computationally efficient and can be easily applied, regardless of the model architectures. Through extensive experiments, we demonstrate that integrating our strategy with TTA approaches leads to substantial performance improvements under the joint presence of label and covariate shifts.

摘要

Capturing Popularity Trends: A Simplistic Non-Personalized Approach for Enhanced Item Recommendation

paper_url: http://arxiv.org/abs/2308.08799
repo_url: https://github.com/jingxiaoyi/pare
paper_authors: Jiazheng Jing, Yinan Zhang, Xin Zhou, Zhiqi Shen
for: 本研究旨在提供一种不具体化推荐方法，以适应用户的选择习惯和Item的流行度变化。
methods: 本方法包括四个模块：历史人气、时间影响、周期影响和Side信息。这些模块都集中在预测Item的流行度上。最后，一个注意层用于融合这四个模块的输出。
results: 实验结果表明，PARE可以和现有的先进推荐方法相比，具有相似或更高的性能。此外，PARE可以轻松地与现有的推荐方法结合使用，从而提高总的性能。

Abstract
Recommender systems have been gaining increasing research attention over the years. Most existing recommendation methods focus on capturing users' personalized preferences through historical user-item interactions, which may potentially violate user privacy. Additionally, these approaches often overlook the significance of the temporal fluctuation in item popularity that can sway users' decision-making. To bridge this gap, we propose Popularity-Aware Recommender (PARE), which makes non-personalized recommendations by predicting the items that will attain the highest popularity. PARE consists of four modules, each focusing on a different aspect: popularity history, temporal impact, periodic impact, and side information. Finally, an attention layer is leveraged to fuse the outputs of four modules. To our knowledge, this is the first work to explicitly model item popularity in recommendation systems. Extensive experiments show that PARE performs on par or even better than sophisticated state-of-the-art recommendation methods. Since PARE prioritizes item popularity over personalized user preferences, it can enhance existing recommendation methods as a complementary component. Our experiments demonstrate that integrating PARE with existing recommendation methods significantly surpasses the performance of standalone models, highlighting PARE's potential as a complement to existing recommendation methods. Furthermore, the simplicity of PARE makes it immensely practical for industrial applications and a valuable baseline for future research.

摘要
“推荐系统在年年都获得更多的研究注意力。现有的大多数推荐方法强调 recording 用户对项目的个人化偏好，这可能会侵犯用户隐私。此外，这些方法通常忽略项目的时间影响和周期性，这可能会影响用户的决策。为了填补这个差距，我们提出了 Popularity-Aware Recommender（PARE），它预测项目将在未来获得最高人气。PARE 由四个模块组成：历史人气、时间影响、周期影响和副资料。最后，我们使用注意力层进行融合。根据我们所知，这是第一个针对推荐系统中项目人气的明确模型。我们的实验显示，PARE 与现有的先进推荐方法相比，在许多情况下表现相似或甚至更好。由于 PARE 将人气优先顺位推荐，因此它可以强化现有的推荐方法，成为辅助性的一部分。我们的实验显示，将 PARE 与现有的推荐方法结合，对推荐性能有着明显的提升。此外，PARE 的简单性使其在工业应用中实际和有用，并且成为未来研究的良好基础。”

Joint Local Relational Augmentation and Global Nash Equilibrium for Federated Learning with Non-IID Data

paper_url: http://arxiv.org/abs/2308.11646
repo_url: None
paper_authors: Xinting Liao, Chaochao Chen, Weiming Liu, Pengyang Zhou, Huabin Zhu, Shuheng Shen, Weiqiang Wang, Mengling Hu, Yanchao Tan, Xiaolin Zheng
for: 提高 Federated Learning (FL) 在实际应用中的效果，解决非独立和同分布 (non-IID) 数据模型下的问题。
methods: 提出 FedRANE 方法，包括两个主要模块：本地关系增强 (LRA) 和全局尼亚希 equilibrium (GNE)，同时解决 intra-和 inter-客户端不一致性问题。
results: 在四个基准数据集上进行了广泛的实验，证明 FedRANE 能够提高 FL 在非独立和同分布数据下的性能。

Abstract
Federated learning (FL) is a distributed machine learning paradigm that needs collaboration between a server and a series of clients with decentralized data. To make FL effective in real-world applications, existing work devotes to improving the modeling of decentralized data with non-independent and identical distributions (non-IID). In non-IID settings, there are intra-client inconsistency that comes from the imbalanced data modeling, and inter-client inconsistency among heterogeneous client distributions, which not only hinders sufficient representation of the minority data, but also brings discrepant model deviations. However, previous work overlooks to tackle the above two coupling inconsistencies together. In this work, we propose FedRANE, which consists of two main modules, i.e., local relational augmentation (LRA) and global Nash equilibrium (GNE), to resolve intra- and inter-client inconsistency simultaneously. Specifically, in each client, LRA mines the similarity relations among different data samples and enhances the minority sample representations with their neighbors using attentive message passing. In server, GNE reaches an agreement among inconsistent and discrepant model deviations from clients to server, which encourages the global model to update in the direction of global optimum without breaking down the clients optimization toward their local optimums. We conduct extensive experiments on four benchmark datasets to show the superiority of FedRANE in enhancing the performance of FL with non-IID data.

摘要
federated 学习（FL）是一种分布式机器学习模式，需要服务器和多个客户端之间的合作，以便处理 Decentralized 数据。为了在实际应用中使 FL 有效，现有的工作主要关注于改进非独立和相同分布（non-IID）的数据模型。在非独立设置中，存在客户端内的不一致性，来自不均匀数据模型的偏好，以及客户端之间的不一致性，这不仅限制了少数据的充分表示，还导致模型偏差不一致。然而，先前的工作忽视了同时解决上述两个 coupling 不一致性。在这种情况下，我们提出了 FedRANE，它包括两个主要模块：本地关系增强（LRA）和全局尼亚希尔（GNE）。具体来说，在每个客户端中，LRA 挖掘不同数据样本之间的相似关系，并使用敏感消息传递来增强少数据表示。在服务器端，GNE 达成了客户端和服务器之间的不一致和不一致的模型偏差，以便服务器模型更新在全局最优点 без 破坏客户端的最优点。我们在四个标准数据集上进行了广泛的实验，以显示 FedRANE 在非独立数据上提高 FL 性能的优越性。

Bayesian polynomial neural networks and polynomial neural ordinary differential equations

paper_url: http://arxiv.org/abs/2308.10892
repo_url: None
paper_authors: Colby Fronk, Jaewoong Yun, Prashant Singh, Linda Petzold
for: 方法可以用于很多科学和工程问题的方程回归。
methods: 使用抽象回归和普通微分方程（ODEs）来进行方程回归，但是这些方法只能提供点估计。
results: 通过开发和验证 Laplace aproximation、MCMC 抽样方法和变分推理来解决噪声数据问题，发现 Laplace aproximation 是这类问题的最佳方法。

Abstract
Symbolic regression with polynomial neural networks and polynomial neural ordinary differential equations (ODEs) are two recent and powerful approaches for equation recovery of many science and engineering problems. However, these methods provide point estimates for the model parameters and are currently unable to accommodate noisy data. We address this challenge by developing and validating the following Bayesian inference methods: the Laplace approximation, Markov Chain Monte Carlo (MCMC) sampling methods, and variational inference. We have found the Laplace approximation to be the best method for this class of problems. Our work can be easily extended to the broader class of symbolic neural networks to which the polynomial neural network belongs.

摘要
Symbolic regression with polynomial neural networks and polynomial neural ordinary differential equations (ODEs) are two recent and powerful approaches for equation recovery of many science and engineering problems. However, these methods provide point estimates for the model parameters and are currently unable to accommodate noisy data. We address this challenge by developing and validating the following Bayesian inference methods: the Laplace approximation, Markov Chain Monte Carlo (MCMC) sampling methods, and variational inference. We have found the Laplace approximation to be the best method for this class of problems. Our work can be easily extended to the broader class of symbolic neural networks to which the polynomial neural network belongs.Translation in Simplified Chinese: symbolic 回归 with polynomial neural networks and polynomial neural ordinary differential equations (ODEs) 是 two recent and powerful approaches for equation recovery of many science and engineering problems. However, these methods provide point estimates for the model parameters and are currently unable to accommodate noisy data. We address this challenge by developing and validating the following Bayesian inference methods: the Laplace approximation, Markov Chain Monte Carlo (MCMC) sampling methods, and variational inference. We have found the Laplace approximation to be the best method for this class of problems. Our work can be easily extended to the broader class of symbolic neural networks to which the polynomial neural network belongs.

Tipping Point Forecasting in Non-Stationary Dynamics on Function Spaces

paper_url: http://arxiv.org/abs/2308.08794
repo_url: None
paper_authors: Miguel Liu-Schiaffini, Clare E. Singer, Nikola Kovachki, Tapio Schneider, Kamyar Azizzadenesheli, Anima Anandkumar
for: 本研究旨在预测非站点动力系统的突然变化，如气候战况中的云覆盖减少。
methods: 本文使用一种新型的循环神经网络算法（RNO）来学习函数空间的映射关系，并使用不确定性基于的方法来探测未来的突然变化。
results: 我们的实验表明，即使只有部分或不完整的物理约束，我们的方法仍可以准确预测未来的突然变化，并且可以提供准确性的度量。

Abstract
Tipping points are abrupt, drastic, and often irreversible changes in the evolution of non-stationary and chaotic dynamical systems. For instance, increased greenhouse gas concentrations are predicted to lead to drastic decreases in low cloud cover, referred to as a climatological tipping point. In this paper, we learn the evolution of such non-stationary dynamical systems using a novel recurrent neural operator (RNO), which learns mappings between function spaces. After training RNO on only the pre-tipping dynamics, we employ it to detect future tipping points using an uncertainty-based approach. In particular, we propose a conformal prediction framework to forecast tipping points by monitoring deviations from physics constraints (such as conserved quantities and partial differential equations), enabling forecasting of these abrupt changes along with a rigorous measure of uncertainty. We illustrate our proposed methodology on non-stationary ordinary and partial differential equations, such as the Lorenz-63 and Kuramoto-Sivashinsky equations. We also apply our methods to forecast a climate tipping point in stratocumulus cloud cover. In our experiments, we demonstrate that even partial or approximate physics constraints can be used to accurately forecast future tipping points.

摘要
�ipped points are abrupt, drastic, and often irreversible changes in the evolution of non-stationary and chaotic dynamical systems. For instance, increased greenhouse gas concentrations are predicted to lead to drastic decreases in low cloud cover, referred to as a climatological tipping point. In this paper, we learn the evolution of such non-stationary dynamical systems using a novel recurrent neural operator (RNO), which learns mappings between function spaces. After training RNO on only the pre-tipping dynamics, we employ it to detect future tipping points using an uncertainty-based approach. In particular, we propose a conformal prediction framework to forecast tipping points by monitoring deviations from physics constraints (such as conserved quantities and partial differential equations), enabling forecasting of these abrupt changes along with a rigorous measure of uncertainty. We illustrate our proposed methodology on non-stationary ordinary and partial differential equations, such as the Lorenz-63 and Kuramoto-Sivashinsky equations. We also apply our methods to forecast a climate tipping point in stratocumulus cloud cover. In our experiments, we demonstrate that even partial or approximate physics constraints can be used to accurately forecast future tipping points.Here's the translation in Traditional Chinese:�ipped points are abrupt, drastic, and often irreversible changes in the evolution of non-stationary and chaotic dynamical systems. For instance, increased greenhouse gas concentrations are predicted to lead to drastic decreases in low cloud cover, referred to as a climatological tipping point. In this paper, we learn the evolution of such non-stationary dynamical systems using a novel recurrent neural operator (RNO), which learns mappings between function spaces. After training RNO on only the pre-tipping dynamics, we employ it to detect future tipping points using an uncertainty-based approach. In particular, we propose a conformal prediction framework to forecast tipping points by monitoring deviations from physics constraints (such as conserved quantities and partial differential equations), enabling forecasting of these abrupt changes along with a rigorous measure of uncertainty. We illustrate our proposed methodology on non-stationary ordinary and partial differential equations, such as the Lorenz-63 and Kuramoto-Sivashinsky equations. We also apply our methods to forecast a climate tipping point in stratocumulus cloud cover. In our experiments, we demonstrate that even partial or approximate physics constraints can be used to accurately forecast future tipping points.

Federated Reinforcement Learning for Electric Vehicles Charging Control on Distribution Networks

paper_url: http://arxiv.org/abs/2308.08792
repo_url: None
paper_authors: Junkai Qian, Yuning Jiang, Xin Liu, Qing Wang, Ting Wang, Yuanming Shi, Wei Chen
for:This paper focuses on developing a novel approach for EV charging control that considers the natural power flow of EV charging/discharging in the distribution network and driver privacy.methods:The proposed approach combines multi-EV charging/discharging with a radial distribution network (RDN) operating under optimal power flow (OPF) to distribute power flow in real time. A federated deep reinforcement learning algorithm named FedSAC is used to learn the optimal EV charging control strategy.results:The proposed approach is effective in balancing V2G profits, RDN load, and driver anxiety. Comprehensive simulation results demonstrate the superiority of the proposed algorithm in terms of the diversity of the charging control strategy, power fluctuations on RDN, convergence efficiency, and generalization ability.

Abstract
With the growing popularity of electric vehicles (EVs), maintaining power grid stability has become a significant challenge. To address this issue, EV charging control strategies have been developed to manage the switch between vehicle-to-grid (V2G) and grid-to-vehicle (G2V) modes for EVs. In this context, multi-agent deep reinforcement learning (MADRL) has proven its effectiveness in EV charging control. However, existing MADRL-based approaches fail to consider the natural power flow of EV charging/discharging in the distribution network and ignore driver privacy. To deal with these problems, this paper proposes a novel approach that combines multi-EV charging/discharging with a radial distribution network (RDN) operating under optimal power flow (OPF) to distribute power flow in real time. A mathematical model is developed to describe the RDN load. The EV charging control problem is formulated as a Markov Decision Process (MDP) to find an optimal charging control strategy that balances V2G profits, RDN load, and driver anxiety. To effectively learn the optimal EV charging control strategy, a federated deep reinforcement learning algorithm named FedSAC is further proposed. Comprehensive simulation results demonstrate the effectiveness and superiority of our proposed algorithm in terms of the diversity of the charging control strategy, the power fluctuations on RDN, the convergence efficiency, and the generalization ability.

摘要
随着电动车 (EV) 的普及，维护电力网络稳定性已成为一项重要挑战。为解决这问题，EV充电控制策略已被开发来管理电动车充电/充电模式之间的切换。在这个上下文中，多代理深度学习 (MADRL) 已经证明其在EV充电控制中的效iveness。然而，现有的MADRL基于方法忽略了电动车充电/充电的自然电力流和驾驶员隐私。为解决这些问题，本文提出了一种新的方法，即将多个电动车充电/充电与 радиаль分布网络 (RDN) 操作于优化电力流 (OPF) 下，以实时分配电力流。一个数学模型被开发来描述RDN负荷。EV充电控制问题被形式化为Markov决策过程 (MDP)，以找到一个优化充电控制策略，把V2G利润、RDN负荷和驾驶员焦虑平衡。为有效地学习优化EV充电控制策略，一种名为FedSAC的联邦深度学习算法被进一步提出。 simulation结果表明，我们提出的算法在多种维度上表现出优异性，包括充电控制策略多样性、RDN电力波动、效率和总体化能力。

APPFLx: Providing Privacy-Preserving Cross-Silo Federated Learning as a Service

paper_url: http://arxiv.org/abs/2308.08786
repo_url: None
paper_authors: Zilinghan Li, Shilan He, Pranshu Chaturvedi, Trung-Hieu Hoang, Minseok Ryu, E. A. Huerta, Volodymyr Kindratenko, Jordan Fuhrman, Maryellen Giger, Ryan Chard, Kibaek Kim, Ravi Madduri
For: This paper aims to provide a ready-to-use platform for cross-silo privacy-preserving federated learning (PPFL) as a service, making it easier and faster for domain experts and machine learning practitioners to collaboratively train robust and generalized machine learning models without sharing sensitive local data.* Methods: The paper introduces APPFLx, a platform that employs Globus authentication, implements several synchronous and asynchronous federated learning algorithms, streamlines the experiment launch process, and enables tracking and visualizing the life cycle of federated learning experiments.* Results: The paper provides a ready-to-use platform for cross-silo PPFL as a service, making it easier and faster for domain experts and machine learning practitioners to collaboratively train robust and generalized machine learning models without sharing sensitive local data.In Simplified Chinese text, the three key points would be:* For: 这篇论文目的是提供一个跨存储器隐私保护联合学习（PPFL）服务，使域专家和机器学习实践者可以轻松地和更快地在不分享敏感地方数据的情况下，共同训练Robust和通用机器学习模型。* Methods: 论文引入了APPFLx平台，该平台使用Globus认证、实现了同步和异步联合学习算法、简化了实验启动过程，并允许域专家和机器学习实践者轻松地和可见地管理和评估跨存储器联合学习实验。* Results: 论文提供了一个跨存储器PPFL服务，使域专家和机器学习实践者可以轻松地和更快地在不分享敏感地方数据的情况下，共同训练Robust和通用机器学习模型。

Abstract
Cross-silo privacy-preserving federated learning (PPFL) is a powerful tool to collaboratively train robust and generalized machine learning (ML) models without sharing sensitive (e.g., healthcare of financial) local data. To ease and accelerate the adoption of PPFL, we introduce APPFLx, a ready-to-use platform that provides privacy-preserving cross-silo federated learning as a service. APPFLx employs Globus authentication to allow users to easily and securely invite trustworthy collaborators for PPFL, implements several synchronous and asynchronous FL algorithms, streamlines the FL experiment launch process, and enables tracking and visualizing the life cycle of FL experiments, allowing domain experts and ML practitioners to easily orchestrate and evaluate cross-silo FL under one platform. APPFLx is available online at https://appflx.link

摘要
cross-silo隐私保护联邦学习（PPFL）是一种强大的工具，可以无需共享敏感数据（如医疗或金融）进行共同训练 Robust和通用机器学习（ML）模型。为了方便和加速PPFL的采用，我们引入了APPFLx，一个准备就绪的平台，提供了隐私保护跨积 Ley 联邦学习作为服务。APPFLx 使用 Globus 鉴别来允许用户轻松地邀请可靠的合作者参与PPFL，实现了一些同步和异步联邦学习算法，通过简化联邦学习实验启动过程，并允许域专家和机器学习实践者轻松地组织和评估跨积 Ley 联邦学习。APPFLx 可以在https://appflx.link 上线上获取。

Knowledge-inspired Subdomain Adaptation for Cross-Domain Knowledge Transfer

paper_url: http://arxiv.org/abs/2308.09724
repo_url: None
paper_authors: Liyue Chen, Linian Wang, Jinyu Xu, Shuai Chen, Weiqiang Wang, Wenbiao Zhao, Qiyu Li, Leye Wang
for: 这篇论文是用于提出一个新的专domain adaptation（KISA）框架，以便在不同领域进行精细化的领域适应。
methods: 这篇论文使用了知识驱动的子领域分配问题，以及一个知识融合网络来推导多元领域知识。
results: 实验结果显示，KISA在骗案检测和交通流量预测任务中获得了优异的成绩。

Abstract
Most state-of-the-art deep domain adaptation techniques align source and target samples in a global fashion. That is, after alignment, each source sample is expected to become similar to any target sample. However, global alignment may not always be optimal or necessary in practice. For example, consider cross-domain fraud detection, where there are two types of transactions: credit and non-credit. Aligning credit and non-credit transactions separately may yield better performance than global alignment, as credit transactions are unlikely to exhibit patterns similar to non-credit transactions. To enable such fine-grained domain adaption, we propose a novel Knowledge-Inspired Subdomain Adaptation (KISA) framework. In particular, (1) We provide the theoretical insight that KISA minimizes the shared expected loss which is the premise for the success of domain adaptation methods. (2) We propose the knowledge-inspired subdomain division problem that plays a crucial role in fine-grained domain adaption. (3) We design a knowledge fusion network to exploit diverse domain knowledge. Extensive experiments demonstrate that KISA achieves remarkable results on fraud detection and traffic demand prediction tasks.

摘要
现代深度领域适应技术通常在全局方式对源和目标样本进行协调。即 после协调，每个源样本都会变得类似于任何目标样本。但全局协调可能并不总是最佳或必要的。例如，考虑跨领域诈骗探测，其中有两种交易：信用和非信用。对这两种交易进行独立协调可能会产生更好的性能，因为信用交易不太可能会展现与非信用交易类似的模式。为实现这种细腻领域适应，我们提出了一个新的知识授意子领域适应（KISA）框架。具体来说，我们提供了适应领域适应的理论基础，并设计了知识融合网络来利用多个领域知识。广泛的实验证明了KISA在诈骗探测和交通需求预测任务上具有很好的表现。

Environment Diversification with Multi-head Neural Network for Invariant Learning

paper_url: http://arxiv.org/abs/2308.08778
repo_url: https://github.com/joe0123/EDNIL
paper_authors: Bo-Wei Huang, Keng-Te Liao, Chang-Sheng Kao, Shou-De Lin
for: 这篇论文目的是提出一个不受分布变化的影响的学习框架，以提高神经网络的预测性。
methods: 这个框架使用了多头神经网络来吸收数据偏见，并无需对环境或先前训练模型做强大的假设。
results: 我们的实验结果显示，使用这个框架训练的模型在分布变化下的预测性较高。

Abstract
Neural networks are often trained with empirical risk minimization; however, it has been shown that a shift between training and testing distributions can cause unpredictable performance degradation. On this issue, a research direction, invariant learning, has been proposed to extract invariant features insensitive to the distributional changes. This work proposes EDNIL, an invariant learning framework containing a multi-head neural network to absorb data biases. We show that this framework does not require prior knowledge about environments or strong assumptions about the pre-trained model. We also reveal that the proposed algorithm has theoretical connections to recent studies discussing properties of variant and invariant features. Finally, we demonstrate that models trained with EDNIL are empirically more robust against distributional shifts.

摘要
神经网络常用емпирическая风险最小化来训练，但是已经证明了在训练和测试分布之间的差异可能会导致性能下降。为解决这个问题，一种研究方向——抗变域学习（Invariant Learning）——已经被提出，以EXTRACT不敏感于分布变化的特征。本文提出了一种基于多头神经网络的EDNIL框架，用于吸收数据偏见。我们表明该框架不需要先知环境或强制假设先训练模型。此外，我们还发现该算法与最近的变异特征和抗变异特征的研究有理论上的连接。最后，我们通过实验表明，使用EDNIL训练的模型在分布变化情况下的性能更加稳定。

Differential Privacy, Linguistic Fairness, and Training Data Influence: Impossibility and Possibility Theorems for Multilingual Language Models

paper_url: http://arxiv.org/abs/2308.08774
repo_url: None
paper_authors: Phillip Rust, Anders Søgaard
for: 这些论文旨在解决多语言模型如BERT、XLM-R和BLOOM等模型实现多语言普适化或压缩，以便将其应用到大量（可能未看过）语言中。
methods: 这些模型应该同时满足私有性、语言公平性和透明度的要求，通过关系其预测与训练数据之间的关系。
results: 我们发现多语言压缩和语言公平性可以同时满足隐私保障，但是隐私保障与训练数据的稀缺性矛盾，这意味着在不同的隐私保障水平下， multilingual compression 和训练数据的影响稀缺性之间存在负相关性。我们进一步进行了两个常见的 NLP 任务的实验，并评估了 multilingual compression 和训练数据的影响稀缺性在不同的隐私保障水平下，从而更加深入地探讨这些负相关性的交互。

Abstract
Language models such as mBERT, XLM-R, and BLOOM aim to achieve multilingual generalization or compression to facilitate transfer to a large number of (potentially unseen) languages. However, these models should ideally also be private, linguistically fair, and transparent, by relating their predictions to training data. Can these requirements be simultaneously satisfied? We show that multilingual compression and linguistic fairness are compatible with differential privacy, but that differential privacy is at odds with training data influence sparsity, an objective for transparency. We further present a series of experiments on two common NLP tasks and evaluate multilingual compression and training data influence sparsity under different privacy guarantees, exploring these trade-offs in more detail. Our results suggest that we need to develop ways to jointly optimize for these objectives in order to find practical trade-offs.

摘要
языковые модели, такие как mBERT, XLM-R и BLOOM, стремятся достичь многоязычной общей генерализации или сжатия для упрощения передачи большого количества (возможно невидимых) языков. Однако эти модели должны быть при этом частными, справедливыми с точки зрения языка и прозрачными, связывая свои прогнозы с данными обучения. Можно ли эти требования одновременно удовлетворить? Мы показываем, что многоязычное сжатие и справедливость с точки зрения языка совместимы с защитой конфиденциальности, но защита конфиденциальности противоречит снижению количества данных обучения, цели, которая связана с прозрачностью. Мы также представляем серию экспериментов по двум распространенным задачам NLP и оцениваем многоязычное сжатие и влияние данных обучения на различных гарантиях конфиденциальности, изучая эти противоречия в более подробном режиме. Наши результаты показывают, что необходимо разработать способы совместно оптимизировать эти цели, чтобы найти практические компромиссы.

Sensor Fusion by Spatial Encoding for Autonomous Driving

paper_url: http://arxiv.org/abs/2308.10707
repo_url: None
paper_authors: Quoc-Vinh Lai-Dang, Jihui Lee, Bumgeun Park, Dongsoo Har
for: 本研究旨在提出一种用于摄像头和激光雷达数据融合的方法，以提高自动驾驶和机器人视觉系统的性能。
methods: 该方法使用Transformer模块在多个分辨率进行融合，以确保地面和高空的关系得到有效地融合。
results: 对于两个难度最高的benchmark，提出的方法表现出色，与之前的方法相比，在驾驶和违法分数上具有显著的提升（8%和19%）。

Abstract
Sensor fusion is critical to perception systems for task domains such as autonomous driving and robotics. Recently, the Transformer integrated with CNN has demonstrated high performance in sensor fusion for various perception tasks. In this work, we introduce a method for fusing data from camera and LiDAR. By employing Transformer modules at multiple resolutions, proposed method effectively combines local and global contextual relationships. The performance of the proposed method is validated by extensive experiments with two adversarial benchmarks with lengthy routes and high-density traffics. The proposed method outperforms previous approaches with the most challenging benchmarks, achieving significantly higher driving and infraction scores. Compared with TransFuser, it achieves 8% and 19% improvement in driving scores for the Longest6 and Town05 Long benchmarks, respectively.

摘要
感知系统中的感知融合是自动驾驶和机器人等任务领域的关键。最近，带有CNN的Transformer已经在多种感知任务中示出了高性能。在这项工作中，我们介绍了一种将摄像头和LiDAR数据进行融合的方法。通过在不同分辨率的Transformer模块中使用，我们的方法可以有效地结合本地和全局的 Contextual Relationships。我们的方法的性能被证明通过了两个挑战性的标准套件，即Length6和Town05 Long benchmarks。与之前的方法相比，我们的方法在这两个标准套件中表现出了8%和19%的提升。相比于TransFuser，我们的方法在Length6和Town05 Long benchmarks中分别提高了19%和8%的驾驶得分。

Neurological Prognostication of Post-Cardiac-Arrest Coma Patients Using EEG Data: A Dynamic Survival Analysis Framework with Competing Risks

paper_url: http://arxiv.org/abs/2308.11645
repo_url: None
paper_authors: Xiaobin Shen, Jonathan Elmer, George H. Chen
for: 预测受很高风险的心肺功能复兴后昏迷病人的神经系统结果，以帮助医疗决策。
methods: 使用EEG数据建立动态推断框架，预测患者在时间上的神经系统结果，并且随着更多的EEG数据成为可用，预测也会随着变化。
results: 使用三种竞争风险模型对922名患者的实际数据进行了比较，结果显示：（1）经典的Fine和Gray模型，只使用患者的静态特征和最近一小时的EEG数据，与最近开发的动态深度推断模型相当竞争，达到了高度准确的成绩；（2）在剪除试验中，我们发现，使用三种竞争风险模型，能够达到至少同样准确，同时学习更多信息的模型。

Abstract
Patients resuscitated from cardiac arrest who enter a coma are at high risk of death. Forecasting neurological outcomes of these patients (the task of neurological prognostication) could help with treatment decisions. In this paper, we propose, to the best of our knowledge, the first dynamic framework for neurological prognostication of post-cardiac-arrest comatose patients using EEG data: our framework makes predictions for a patient over time as more EEG data become available, and different training patients' available EEG time series could vary in length. Predictions are phrased in terms of either time-to-event outcomes (time-to-awakening or time-to-death) or as the patient's probability of awakening or of dying across multiple time horizons. Our framework uses any dynamic survival analysis model that supports competing risks in the form of estimating patient-level cumulative incidence functions. We consider three competing risks as to what happens first to a patient: awakening, being withdrawn from life-sustaining therapies (and thus deterministically dying), or dying (by other causes). We demonstrate our framework by benchmarking three existing dynamic survival analysis models that support competing risks on a real dataset of 922 patients. Our main experimental findings are that: (1) the classical Fine and Gray model which only uses a patient's static features and summary statistics from the patient's latest hour's worth of EEG data is highly competitive, achieving accuracy scores as high as the recently developed Dynamic-DeepHit model that uses substantially more of the patient's EEG data; and (2) in an ablation study, we show that our choice of modeling three competing risks results in a model that is at least as accurate while learning more information than simpler models (using two competing risks or a standard survival analysis setup with no competing risks).

摘要
患者从心肺停止救护后入kom上的患者具有高风险死亡。预测这些患者的神经学结果（神经预测任务）可以帮助医疗决策。在这篇论文中，我们提出了，至今为止的知识最前进的动态框架 для神经预测患者从心肺停止救护后入kom的患者，使用EEG数据进行预测。我们的框架可以随着更多的EEG数据成为可用，预测患者的结果在时间上逐渐发展。我们的预测包括时间到事件结果（时间到醒来或时间到死亡）或患者在不同时间水平上的醒来或死亡概率。我们的框架使用任何支持竞争风险的动态生存分析模型，并且可以处理多种竞争风险，包括醒来、被终止生命维持治疗（因此必然死亡）和其他因素导致的死亡。我们在实际数据集上进行了比较三种现有的动态生存分析模型，并发现以下结论：（1）经典的 Fine and Gray 模型，只使用患者的静态特征和最近一小时的EEG数据，可以达到非常高的准确率，与最近开发的 Dynamic-DeepHit 模型相当。（2）在减少模型的 studiodelete 实验中，我们发现，通过处理三种竞争风险，我们的选择的模型可以learn更多的信息，同时保持与更简单的模型（使用两种竞争风险或标准生存分析设置无竞争风险）相比较高的准确率。

Explainable AI for tool wear prediction in turning

paper_url: http://arxiv.org/abs/2308.08765
repo_url: None
paper_authors: Saleh Valizadeh Sotubadi, Rui Liu, Vinh Neguyen
for: 这项研究旨在开发一个可解释人工智能（XAI）框架，以便为工具损害预测提供人理解的解决方案。
methods: 该研究使用了随机森林算法作为监督机器学习（ML）分类器，并使用加速度、噪声、温度和螺旋速度作为输入特征进行训练和二分类预测。
results: 研究发现，在所有测试集上实施了Shapley准则后，工具温度被确定为预测工具可用性和失效的最重要的输入特征。因此，这项研究表明XAI可以为机床操作员提供对复杂ML分类器预测工具损害的诊断和理解能力。

Abstract
This research aims develop an Explainable Artificial Intelligence (XAI) framework to facilitate human-understandable solutions for tool wear prediction during turning. A random forest algorithm was used as the supervised Machine Learning (ML) classifier for training and binary classification using acceleration, acoustics, temperature, and spindle speed during the orthogonal tube turning process as input features. The ML classifier was used to predict the condition of the tool after the cutting process, which was determined in a binary class form indicating if the cutting tool was available or failed. After the training process, the Shapley criterion was used to explain the predictions of the trained ML classifier. Specifically, the significance of each input feature in the decision-making and classification was identified to explain the reasoning of the ML classifier predictions. After implementing the Shapley criterion on all testing datasets, the tool temperature was identified as the most significant feature in determining the classification of available versus failed cutting tools. Hence, this research demonstrates capability of XAI to provide machining operators the ability to diagnose and understand complex ML classifiers in prediction of tool wear.

摘要
After training the ML classifier, the Shapley criterion was used to explain the predictions. The Shapley criterion identified the significance of each input feature in the decision-making and classification, providing insight into the reasoning behind the ML classifier's predictions.Results showed that tool temperature was the most significant feature in determining the classification of available versus failed cutting tools. This research demonstrates the capability of XAI to provide machining operators with the ability to diagnose and understand complex ML classifiers in predicting tool wear. By using XAI, machining operators can gain a deeper understanding of the decision-making process of the ML classifier, improving the accuracy and reliability of tool wear prediction.In simplified Chinese, the text can be translated as:这项研究的目标是开发一个可解释人工智能（XAI）框架，以便为汽刃磨损预测提供人理解的解决方案。研究使用了随机森林算法作为指导式机器学习（ML）分类器，输入特征包括旋转、噪声、温度和转速等。ML分类器用于预测切割过程后工具的状况，并将其分为可用和失败两类。经过训练后，使用了Shapley criterion来解释ML分类器的预测结果。Shapley criterion可以确定每个输入特征在决策和分类过程中的重要性，从而解释ML分类器的预测结果的原因。研究结果表明，工具温度是确定可用与失败切割工具的最重要的特征。这项研究 demonstrate XAI的能力，即为机床操作人员提供可以诊断和理解复杂ML分类器的预测结果。通过使用XAI，机床操作人员可以更深入地理解ML分类器的决策过程，提高汽刃磨损预测的准确性和可靠性。

Efficient Commercial Bank Customer Credit Risk Assessment Based on LightGBM and Feature Engineering

paper_url: http://arxiv.org/abs/2308.08762
repo_url: None
paper_authors: Yanjie Sun, Zhike Gong, Quan Shi, Lin Chen
for: 这个论文主要是为了帮助商业银行控制信用风险，使用LightGBM算法建立分类器，以判断客户是否会default的可能性。
methods: 本论文使用了特征工程技术，如处理缺失值、编码、不均衡样本等，以提高机器学习效果。
results: 本论文的主要创新在于基于原始数据集构建新的特征属性，使分类器的准确率达0.734，AUC达0.772，超过了同基于数据集的许多分类器。I hope this helps! Let me know if you have any other questions.

Abstract
Effective control of credit risk is a key link in the steady operation of commercial banks. This paper is mainly based on the customer information dataset of a foreign commercial bank in Kaggle, and we use LightGBM algorithm to build a classifier to classify customers, to help the bank judge the possibility of customer credit default. This paper mainly deals with characteristic engineering, such as missing value processing, coding, imbalanced samples, etc., which greatly improves the machine learning effect. The main innovation of this paper is to construct new feature attributes on the basis of the original dataset so that the accuracy of the classifier reaches 0.734, and the AUC reaches 0.772, which is more than many classifiers based on the same dataset. The model can provide some reference for commercial banks' credit granting, and also provide some feature processing ideas for other similar studies.

摘要
Effective control of credit risk is a key link in the steady operation of commercial banks. This paper is mainly based on the customer information dataset of a foreign commercial bank in Kaggle, and we use LightGBM algorithm to build a classifier to classify customers, to help the bank judge the possibility of customer credit default. This paper mainly deals with characteristic engineering, such as missing value processing, coding, imbalanced samples, etc., which greatly improves the machine learning effect. The main innovation of this paper is to construct new feature attributes on the basis of the original dataset so that the accuracy of the classifier reaches 0.734, and the AUC reaches 0.772, which is more than many classifiers based on the same dataset. The model can provide some reference for commercial banks' credit granting, and also provide some feature processing ideas for other similar studies.Here's the translation breakdown:Effective control of credit risk: 信用风险控制Key link: 关键链接Steady operation: 稳定运行Commercial banks: 商业银行Customer information dataset: 客户信息数据集LightGBM algorithm: LightGBM算法Classifier: 分类器Customer credit default: 客户债务 defaultCharacteristic engineering: 特征工程Missing value processing: 缺失值处理Coding: 编码Imbalanced samples: 不均衡样本Machine learning effect: 机器学习效果Accuracy: 准确率AUC: 面积 beneath the ROC curveNew feature attributes: 新的特征属性Original dataset: 原始数据集Reference: 参考Credit granting: 信用授予

PMET: Precise Model Editing in a Transformer

paper_url: http://arxiv.org/abs/2308.08742
repo_url: https://github.com/xpq-tech/pmet
paper_authors: Xiaopeng Li, Shasha Li, Shezheng Song, Jing Yang, Jun Ma, Jie Yu
for: 这个论文主要目标是提高模型修改技术的性能，以便更好地在大语言模型（LLM）中插入新知识。
methods: 该论文使用了一种新的模型修改方法，称为PMET，它同时优化多头自注意力（MHSA）和循环网络（FFN）的隐藏状态，并只使用优化后的FFN隐藏状态来精确地更新FFN参数。
results: 实验表明，PMET在COUNTERFACT和zsRE数据集上具有状态对抗性，并且在这两个数据集上都达到了最佳性能。

Abstract
Model editing techniques modify a minor proportion of knowledge in Large Language Models (LLMs) at a relatively low cost, which have demonstrated notable success. Existing methods assume Transformer Layer (TL) hidden states are values of key-value memories of the Feed-Forward Network (FFN). They usually optimize the TL hidden states to memorize target knowledge and use it to update the weights of the FFN in LLMs. However, the information flow of TL hidden states comes from three parts: Multi-Head Self-Attention (MHSA), FFN, and residual connections. Existing methods neglect the fact that the TL hidden states contains information not specifically required for FFN. Consequently, the performance of model editing decreases. To achieve more precise model editing, we analyze hidden states of MHSA and FFN, finding that MHSA encodes certain general knowledge extraction patterns. This implies that MHSA weights do not require updating when new knowledge is introduced. Based on above findings, we introduce PMET, which simultaneously optimizes Transformer Component (TC, namely MHSA and FFN) hidden states, while only using the optimized TC hidden states of FFN to precisely update FFN weights. Our experiments demonstrate that PMET exhibits state-of-the-art performance on both the COUNTERFACT and zsRE datasets. Our ablation experiments substantiate the effectiveness of our enhancements, further reinforcing the finding that the MHSA encodes certain general knowledge extraction patterns and indicating its storage of a small amount of factual knowledge. Our code is available at https://github.com/xpq-tech/PMET.git.

摘要
大型语言模型（LLM）的修改技术可以轻松地修改一小部分知识，并且已经达到了显著的成功。现有的方法假设Transformer层（TL）的隐藏状态是Feed-Forward Network（FFN）的钥匙值内存。他们通常将TL隐藏状态优化为记忆target知识，然后使用这些TL隐藏状态来更新FFN的类型。然而，TL隐藏状态的资讯来源来自三个部分：多头自我对话（MHSA）、FFN和复古连接。现有的方法忽略了TL隐藏状态中包含的资讯不具体的特定知识。因此，模型修改的性能受到影响。为了更加精确地修改模型，我们进行了隐藏状态分析，发现MHSA实际上对应某些通用知识提取模式。这表示MHSA的权重不需要更新当新知识引入。基于以上发现，我们引入了PMET，它同时优化Transformer组件（TC，即MHSA和FFN）的隐藏状态，只使用优化的TC隐藏状态FFN来准确地更新FFN的类型。我们的实验显示PMET在COUNTERFACT和zsRE datasets上具有最佳性能。我们的剖析实验证实了我们的改进的有效性，进一步证明MHSA对应通用知识提取模式，并且储存一小量的事实知识。我们的代码可以在https://github.com/xpq-tech/PMET.git中找到。

paper_url: http://arxiv.org/abs/2308.08737
repo_url: None
paper_authors: Tejaswini Manjunath, Mozhgan Navardi, Prakhar Dixit, Bharat Prakash, Tinoosh Mohsenin
for: 本研究旨在解决现实世界中RL算法学习环境稀缺奖励和多目标导航问题。
methods: 本研究提出了Ready for Production Hierarchical RL（ReProHRL）方法，该方法将多目标导航划分为多个层次，并使用了对象检测器作为前处理步骤来学习多目标导航并在实际世界中传输。
results: 实验结果显示，提出的ReProHRL方法在模拟和实际环境中比基eline方法更高效，具体来说，在一个简单的单目标导航环境中，两者均达成100%成功率，但在一个更复杂的环境和多目标设定下，ReProHRL方法比基eline方法提高18%和5%。此外，通过在一架名为Crazyflie的奈米飞行器上部署提出的方法，实现了多目标导航实验。

Abstract
Robots have been successfully used to perform tasks with high precision. In real-world environments with sparse rewards and multiple goals, learning is still a major challenge and Reinforcement Learning (RL) algorithms fail to learn good policies. Training in simulation environments and then fine-tuning in the real world is a common approach. However, adapting to the real-world setting is a challenge. In this paper, we present a method named Ready for Production Hierarchical RL (ReProHRL) that divides tasks with hierarchical multi-goal navigation guided by reinforcement learning. We also use object detectors as a pre-processing step to learn multi-goal navigation and transfer it to the real world. Empirical results show that the proposed ReProHRL method outperforms the state-of-the-art baseline in simulation and real-world environments in terms of both training time and performance. Although both methods achieve a 100% success rate in a simple environment for single goal-based navigation, in a more complex environment and multi-goal setting, the proposed method outperforms the baseline by 18% and 5%, respectively. For the real-world implementation and proof of concept demonstration, we deploy the proposed method on a nano-drone named Crazyflie with a front camera to perform multi-goal navigation experiments.

摘要
Translated into Simplified Chinese:罗宾托已经成功地执行了高精度任务。在真实世界中的罕见奖励和多个目标下，学习仍然是一个主要挑战，而从实验环境进行训练然后在真实世界进行精确化是一个常见的方法。但是适应真实世界设定是一个挑战。在这篇论文中，我们提出了一种名为Ready for Production Hierarchical RL（ReProHRL）的方法，它通过分解任务为层次多个目标导航，并通过从实验环境进行训练，然后在真实世界进行精确化。我们还使用物体探测器作为先processing步骤，以学习多个目标导航并将其转移到真实世界。实验结果显示，提案的ReProHRL方法在模拟和真实世界环境中比基准方案高效和精确，尤其在复杂的环境和多个目标设定下。为了证明概念的可行性和实现，我们在一架名为Crazyflie的奈米探测器上进行了多个目标导航实验。

On the Effectiveness of Log Representation for Log-based Anomaly Detection

paper_url: http://arxiv.org/abs/2308.08736
repo_url: https://github.com/mooselab/suppmaterial-logrepforanomalydetection
paper_authors: Xingfang Wu, Heng Li, Foutse Khomh
for:This paper aims to compare and evaluate different log representation techniques for use in machine learning-based log analysis tasks, specifically for anomaly detection.methods:The authors select six commonly used log representation techniques and evaluate them with seven machine learning models and four public log datasets. They also examine the impact of log parsing and feature aggregation approaches when used with these techniques.results:The authors provide heuristic guidelines for future researchers and developers based on their comprehensive comparison of log representation techniques. They hope to help researchers and practitioners better understand the characteristics of different log representation techniques and select the most suitable ones for their ML-based log analysis workflow.

Abstract
Logs are an essential source of information for people to understand the running status of a software system. Due to the evolving modern software architecture and maintenance methods, more research efforts have been devoted to automated log analysis. In particular, machine learning (ML) has been widely used in log analysis tasks. In ML-based log analysis tasks, converting textual log data into numerical feature vectors is a critical and indispensable step. However, the impact of using different log representation techniques on the performance of the downstream models is not clear, which limits researchers and practitioners' opportunities of choosing the optimal log representation techniques in their automated log analysis workflows. Therefore, this work investigates and compares the commonly adopted log representation techniques from previous log analysis research. Particularly, we select six log representation techniques and evaluate them with seven ML models and four public log datasets (i.e., HDFS, BGL, Spirit and Thunderbird) in the context of log-based anomaly detection. We also examine the impacts of the log parsing process and the different feature aggregation approaches when they are employed with log representation techniques. From the experiments, we provide some heuristic guidelines for future researchers and developers to follow when designing an automated log analysis workflow. We believe our comprehensive comparison of log representation techniques can help researchers and practitioners better understand the characteristics of different log representation techniques and provide them with guidance for selecting the most suitable ones for their ML-based log analysis workflow.

摘要
日志是软件系统运行状况理解的重要来源。由于现代软件架构和维护方法不断演化，更多的研究努力被投入到自动化日志分析领域。特别是在机器学习（ML）中，日志分析任务中的文本日志数据转换成数值特征向量是 kritical 和不可或缺的步骤。然而，使用不同的日志表示技术对下游模型性能的影响还不够清楚，这限制了研究者和实践者在自动化日志分析流程中选择优化的日志表示技术的机会。因此，本工作 investigate 和比较过去的日志分析研究中通常采用的日志表示技术。特别是我们选择了六种日志表示技术，并与七种ML模型和四个公共日志数据集（即HDFS、BGL、Spirit 和Thunderbird）在日志基于异常检测中进行评估。我们还检查了在日志分析过程中使用日志解析过程和不同的特征聚合方法的影响。从实验结果来看，我们提供了一些有用的准则，以帮助未来的研究者和开发者在设计自动化日志分析工作流程时 следу。我们认为我们的全面的日志表示技术比较可以帮助研究者和实践者更好地了解不同日志表示技术的特点，并为他们选择最适合的日志表示技术。

A Novel Loss Function Utilizing Wasserstein Distance to Reduce Subject-Dependent Noise for Generalizable Models in Affective Computing

paper_url: http://arxiv.org/abs/2308.10869
repo_url: None
paper_authors: Nibraas Khan, Mahrukh Tauseef, Ritam Ghosh, Nilanjan Sarkar
for: 本研究旨在提高人工智能系统的情感识别精度，通过使用优化运输理论和 Wasserstein 距离来降低主观噪声的影响。
methods: 本研究使用了一种新的成本函数，其中包括用于衡量主观噪声的 Wasserstein 距离。同时，研究还使用了一种基于 autoencoder 的多类分类器，以同时检测不同情感状态。
results: 研究发现，使用提议的成本函数可以提高情感识别精度，与基准模型（Mean Squared Error）相比，平均提高了14.75%和17.75%的 minimum 和 centroid 距离。

Abstract
Emotions are an essential part of human behavior that can impact thinking, decision-making, and communication skills. Thus, the ability to accurately monitor and identify emotions can be useful in many human-centered applications such as behavioral training, tracking emotional well-being, and development of human-computer interfaces. The correlation between patterns in physiological data and affective states has allowed for the utilization of deep learning techniques which can accurately detect the affective states of a person. However, the generalisability of existing models is often limited by the subject-dependent noise in the physiological data due to variations in a subject's reactions to stimuli. Hence, we propose a novel cost function that employs Optimal Transport Theory, specifically Wasserstein Distance, to scale the importance of subject-dependent data such that higher importance is assigned to patterns in data that are common across all participants while decreasing the importance of patterns that result from subject-dependent noise. The performance of the proposed cost function is demonstrated through an autoencoder with a multi-class classifier attached to the latent space and trained simultaneously to detect different affective states. An autoencoder with a state-of-the-art loss function i.e., Mean Squared Error, is used as a baseline for comparison with our model across four different commonly used datasets. Centroid and minimum distance between different classes are used as a metrics to indicate the separation between different classes in the latent space. An average increase of 14.75% and 17.75% (from benchmark to proposed loss function) was found for minimum and centroid euclidean distance respectively over all datasets.

摘要
人类行为中的情感是一个重要的部分，可以影响思维、决策和communication Skills。因此，可以准确识别和评估情感的能力在许多人类中心应用中可以是有益的，如行为训练、情感健康评估和人机界面的开发。基于生理数据中的征 patrerns和情感状态的相关性，使用深度学习技术可以准确地检测人类情感状态。然而，现有模型的通用性通常受到参与者的响应差异所限制，这些差异会导致数据中的Subject-dependent noise。因此，我们提议一种新的成本函数，使用Optimal Transport Theory，具体来说是Wasserstein Distance，来抑制参与者具有特定数据的影响。我们的模型通过在缺省空间中附加多类分类器，并同时使用我们提议的成本函数和现有的state-of-the-art loss function（即Mean Squared Error）进行训练，来检测不同的情感状态。与基准模型相比，我们的模型在四个常用的数据集上的性能显著提高，具体来说是平均提高14.75%和17.75%（从基准loss function到我们的loss function）。为了衡量不同类别之间的分离度，我们使用中心距离和最小距离两种指标，并发现在所有数据集上，我们的模型的中心距离和最小距离均有显著提高。

Synergistic Signal Denoising for Multimodal Time Series of Structure Vibration

paper_url: http://arxiv.org/abs/2308.11644
repo_url: None
paper_authors: Yang Yu, Han Chen
for: 本研究旨在提供一种基于深度学习的 Structural Health Monitoring (SHM) 解决方案，以提高结构的寿命和安全性。
methods: 本研究使用了一种新的深度学习算法，结合了卷积和循环网络，以捕捉多Modal vibration signal 中的本地和持续性结构行为。 additionally, the algorithm incorporates attention mechanisms to prioritize salient structural responses and improve predictive accuracy.
results: 研究结果显示了该算法在多种 SHM 场景中的显著改进，包括早期损害探测和适应性。 Furthermore, the proposed approach offers a more transparent and interpretable AI-driven SHM solution, with potential for real-time processing, integration with external environmental factors, and further emphasis on model interpretability.

Abstract
Structural Health Monitoring (SHM) plays an indispensable role in ensuring the longevity and safety of infrastructure. With the rapid growth of sensor technology, the volume of data generated from various structures has seen an unprecedented surge, bringing forth challenges in efficient analysis and interpretation. This paper introduces a novel deep learning algorithm tailored for the complexities inherent in multimodal vibration signals prevalent in SHM. By amalgamating convolutional and recurrent architectures, the algorithm adeptly captures both localized and prolonged structural behaviors. The pivotal integration of attention mechanisms further enhances the model's capability, allowing it to discern and prioritize salient structural responses from extraneous noise. Our results showcase significant improvements in predictive accuracy, early damage detection, and adaptability across multiple SHM scenarios. In light of the critical nature of SHM, the proposed approach not only offers a robust analytical tool but also paves the way for more transparent and interpretable AI-driven SHM solutions. Future prospects include real-time processing, integration with external environmental factors, and a deeper emphasis on model interpretability.

摘要
（简体中文）结构健康监测（SHM）在基础设施的寿命和安全方面扮演着不可或缺的角色。随着感知技术的快速发展，各种结构生成的数据量在不断增加，这导致了分析和解释数据的挑战。本文介绍了一种适应多模态振荡信号的深度学习算法，通过结合卷积和回归体系，能够具有地方化和持续的结构行为捕捉能力。具有注意力机制的整合使得模型能够更好地筛选和优先处理关键的结构响应。我们的结果表明，该方法可以在多种 SHM 场景中显著提高预测精度、早期损害检测和适应性。鉴于 SHM 的重要性，该方法不仅提供了一种可靠的分析工具，还开创了更加透明和可解释的 AI-驱动 SHM 解决方案。未来的发展方向包括实时处理、与外部环境因素集成和更深入的模型解释性。

Dynamic Neural Network is All You Need: Understanding the Robustness of Dynamic Mechanisms in Neural Networks

paper_url: http://arxiv.org/abs/2308.08709
repo_url: https://github.com/anonymous2015258/Early_Attack
paper_authors: Mirazul Haque, Wei Yang
for: 本研究旨在 investigate 动态神经网络（DyNNs）中动态机制的稳定性和鲁棒性问题。
methods: 本研究使用三个模型和两个数据集进行评估。我们采用了多种攻击方法来评估动态机制对DyNNs的影响。
results: 我们发现，从DyNNs到SDNNs的攻击传递率高于从SDNNs到DyNNs。此外，我们发现DyNNs可以更有效地生成攻击样本。最后，我们提出了一种新的攻击方法，并提供了设计选择来提高DyNNs的鲁棒性。

Abstract
Deep Neural Networks (DNNs) have been used to solve different day-to-day problems. Recently, DNNs have been deployed in real-time systems, and lowering the energy consumption and response time has become the need of the hour. To address this scenario, researchers have proposed incorporating dynamic mechanism to static DNNs (SDNN) to create Dynamic Neural Networks (DyNNs) performing dynamic amounts of computation based on the input complexity. Although incorporating dynamic mechanism into SDNNs would be preferable in real-time systems, it also becomes important to evaluate how the introduction of dynamic mechanism impacts the robustness of the models. However, there has not been a significant number of works focusing on the robustness trade-off between SDNNs and DyNNs. To address this issue, we propose to investigate the robustness of dynamic mechanism in DyNNs and how dynamic mechanism design impacts the robustness of DyNNs. For that purpose, we evaluate three research questions. These evaluations are performed on three models and two datasets. Through the studies, we find that attack transferability from DyNNs to SDNNs is higher than attack transferability from SDNNs to DyNNs. Also, we find that DyNNs can be used to generate adversarial samples more efficiently than SDNNs. Then, through research studies, we provide insight into the design choices that can increase robustness of DyNNs against the attack generated using static model. Finally, we propose a novel attack to understand the additional attack surface introduced by the dynamic mechanism and provide design choices to improve robustness against the attack.

摘要
深度神经网络（DNNs）已经用于解决不同的日常问题。近期，DNNs在实时系统中部署，降低能耗和响应时间已成为当前的需求。为解决这种情况，研究人员已经提议将静止神经网络（SDNN）与动态机制结合，创造可动Amount of computation的神经网络（DyNNs）。虽然在实时系统中将静止机制添加到SDNNs是可行的，但也需要评估这种机制对模型的稳定性的影响。然而，有很少研究关于SDNNs和DyNNs之间的稳定性质量的负担。为了解决这个问题，我们提出了三个研究问题。这些评估在三个模型和两个数据集上进行。通过研究，我们发现了以下结论：从 DyNNs 到 SDNNs 的攻击传播率高于从 SDNNs 到 DyNNs 的攻击传播率。此外，我们发现了 DyNNs 可以更高效地生成黑客样本 than SDNNs。然后，通过研究，我们提供了在设计 DyNNs 时的选择，以提高其对攻击的鲁棒性。最后，我们提出了一种新的攻击方法，以了解动态机制引入的额外攻击表面，并提供了防御策略来改善鲁棒性。

Consciousness in Artificial Intelligence: Insights from the Science of Consciousness

paper_url: http://arxiv.org/abs/2308.08708
repo_url: None
paper_authors: Patrick Butlin, Robert Long, Eric Elmoznino, Yoshua Bengio, Jonathan Birch, Axel Constant, George Deane, Stephen M. Fleming, Chris Frith, Xu Ji, Ryota Kanai, Colin Klein, Grace Lindsay, Matthias Michel, Liad Mudrik, Megan A. K. Peters, Eric Schwitzgebel, Jonathan Simon, Rufin VanRullen
for: 本研究的目的是评估当前和未来的人工智能系统是否具有意识。
methods: 本研究使用了一种严谨和基于实验的方法，即在 neuroscientific 理论的光下评估当前和未来的人工智能系统，以确定它们是否具有意识。
results: 研究结果表明，目前的人工智能系统没有意识，但也没有技术障碍建立意识的系统。

Abstract
Whether current or near-term AI systems could be conscious is a topic of scientific interest and increasing public concern. This report argues for, and exemplifies, a rigorous and empirically grounded approach to AI consciousness: assessing existing AI systems in detail, in light of our best-supported neuroscientific theories of consciousness. We survey several prominent scientific theories of consciousness, including recurrent processing theory, global workspace theory, higher-order theories, predictive processing, and attention schema theory. From these theories we derive "indicator properties" of consciousness, elucidated in computational terms that allow us to assess AI systems for these properties. We use these indicator properties to assess several recent AI systems, and we discuss how future systems might implement them. Our analysis suggests that no current AI systems are conscious, but also suggests that there are no obvious technical barriers to building AI systems which satisfy these indicators.

摘要
Current or near-term AI systems 是科学界的兴趣话题，也是公众的关注点。本报告提出了一种严谨的、基于实证研究的AI意识方法：根据我们最为支持的神经科学理论来评估现有AI系统。我们评估了一些主要的科学理论，包括循环处理理论、全局工作区理论、更高级理论、预测处理理论和注意 schema 理论。从这些理论中，我们 derivate了“指标属性”，这些属性可以用计算机语言来评估 AI 系统。我们使用这些指标属性来评估一些最近的 AI 系统，并讨论未来系统可能如何实现这些属性。我们的分析结果表明，目前没有任何 AI 系统具有意识，但也没有明显的技术障碍来建立具有这些指标的 AI 系统。

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

paper_url: http://arxiv.org/abs/2308.09723
repo_url: None
paper_authors: Young Jin Kim, Rawn Henry, Raffy Fahim, Hany Hassan Awadalla
for: 提高大型语言模型（LLMs）在不同语言任务中的状态空间性能，但它们在实际部署中存在巨大内存需求和吞吐量瓶颈问题。
methods: 我们提出了一种高效的量化方法，可以降低LLMs的内存占用和加速推理。我们采用了一种简单有效的规则，只使用预训练模型的Weight，无需额外 fine-tuning。
results: 我们的方法可以保持最小的质量下降，在大规模的开源模型，如OPT-175B和内部MoE模型上实现高达3.65倍的 Throughput，同时占用相同数量的GPU。

Abstract
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment due to their substantial memory requirements. Furthermore, the latest generative models suffer from high inference costs caused by the memory bandwidth bottleneck in the auto-regressive decoding process. To address these issues, we propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs. To ensure minimal quality degradation, we introduce a simple and effective heuristic approach that utilizes only the model weights of a pre-trained model. This approach is applicable to both Mixture-of-Experts (MoE) and dense models without requiring additional fine-tuning. To demonstrate the effectiveness of our proposed method, we first analyze the challenges and issues associated with LLM quantization. Subsequently, we present our heuristic approach, which adaptively finds the granularity of quantization, effectively addressing these problems. Furthermore, we implement highly efficient GPU GEMMs that perform on-the-fly matrix multiplication and dequantization, supporting the multiplication of fp16 or bf16 activations with int8 or int4 weights. We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput on the same number of GPUs.

摘要
To demonstrate the effectiveness of our proposed method, we first analyze the challenges and issues associated with LLM quantization. Subsequently, we present our heuristic approach, which adaptively finds the granularity of quantization, effectively addressing these problems. Furthermore, we implement highly efficient GPU GEMMs that perform on-the-fly matrix multiplication and dequantization, supporting the multiplication of fp16 or bf16 activations with int8 or int4 weights.We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput on the same number of GPUs.

paper_url: http://arxiv.org/abs/2308.08705
repo_url: None
paper_authors: Xiangyu Liu, Kaiqing Zhang
for: 本研究旨在提出一种可证明的多智能RL（MARL）方法，以普通的多智能游戏（POSG）为框架， circumvent known hardness results 和 computationally intractable oracles。
methods: 我们提出利用多智能agent之间的信息共享，一种常见的实际MARL中的做法，以及多智能控制系统中的通信模型。我们首先证明了需要信息共享和可观察性假设的计算复杂度结论，以确保计算效率。然后，我们提出一种 Approximate 模型，以解决POSG的计算复杂度问题。最后，我们开发了一种可 statistically 和 computationally quasi-efficient 的多智能RL算法。
results: 我们的研究可能开启了在不同信息结构下开发可 sample-和计算效率的多智能RL算法的可能性。

Abstract
We study provable multi-agent reinforcement learning (MARL) in the general framework of partially observable stochastic games (POSGs). To circumvent the known hardness results and the use of computationally intractable oracles, we advocate leveraging the potential \emph{information-sharing} among agents, a common practice in empirical MARL, and a standard model for multi-agent control systems with communications. We first establish several computation complexity results to justify the necessity of information-sharing, as well as the observability assumption that has enabled quasi-efficient single-agent RL with partial observations, for computational efficiency in solving POSGs. We then propose to further \emph{approximate} the shared common information to construct an {approximate model} of the POSG, in which planning an approximate equilibrium (in terms of solving the original POSG) can be quasi-efficient, i.e., of quasi-polynomial-time, under the aforementioned assumptions. Furthermore, we develop a partially observable MARL algorithm that is both statistically and computationally quasi-efficient. We hope our study may open up the possibilities of leveraging and even designing different \emph{information structures}, for developing both sample- and computation-efficient partially observable MARL.

摘要
我们研究可证明多智能体刺激学习（MARL）在通用概念下的部分可见随机游戏（POSG）中。为了绕过已知的困难性和使用计算可负担的底层 oracle，我们倡议利用智能体之间的信息共享，这是现实中多智能体 MARL 中的常见做法，以及多机控制系统通信的标准模型。我们首先确立了一些计算复杂性结果，以证明信息共享的必要性和部分可见性假设，以确保计算效率在解决 POSG 中。然后，我们提议使用可approximate的共享公共信息来构建一个approximate模型，以便在这个模型中计划一个近似equilibrium（相当于解决原始 POSG），这可以在前面的假设下达到 quasi-多项式时间的计算效率。此外，我们开发了一种部分可见 MARL 算法，它不仅是 statistically quasi-efficient，而且也是 computationally quasi-efficient。我们希望我们的研究可以开放更多的可能性，例如利用和设计不同的信息结构，以开发更高效的部分可见 MARL。

Planning in the imagination: High-level planning on learned abstract search spaces

paper_url: http://arxiv.org/abs/2308.08693
repo_url: None
paper_authors: Carlos Martin, Tuomas Sandholm
for: 本文提出了一种新的方法 PiZero，允许智能机器人在自己创建的抽象搜索空间中进行规划，这个空间与实际环境完全解耦。
methods: 本文使用的方法PiZero可以在任意时间尺度上进行高级规划，并且可以处理连续动作空间和部分可见性的情况。
results: 根据实验结果，PiZero在多个领域中表现出色，比如导航任务和Sokoban，并且在不假设环境模拟器的情况下超过了相似的先前方法。

Abstract
We propose a new method, called PiZero, that gives an agent the ability to plan in an abstract search space of its own creation that is completely decoupled from the real environment. Unlike prior approaches, this enables the agent to perform high-level planning at arbitrary timescales and reason in terms of compound or temporally-extended actions, which can be useful in environments where large numbers of base-level micro-actions are needed to perform relevant macro-actions. In addition, our method is more general than comparable prior methods because it handles settings with continuous action spaces and partial observability. We evaluate our method on multiple domains, including navigation tasks and Sokoban. Experimentally, it outperforms comparable prior methods without assuming access to an environment simulator.

摘要
我们提出了一种新方法，叫做PiZero，它让一个代理人有能力在自己创造的抽象搜索空间中进行规划，这个空间完全与真实环境解耦。与先前的方法不同，这使得代理人可以在任意时间尺度进行高级规划，并且可以使用复杂或时间扩展的动作来执行重要的macro-动作。此外，我们的方法比先前的方法更通用，因为它处理了连续动作空间和部分可见性的情况。我们在多个领域进行了实验，包括导航任务和Sokoban，并证明了它在相比先前方法而言表现出色，不需要对环境模拟器进行假设。

Quantifying Overfitting: Introducing the Overfitting Index

paper_url: http://arxiv.org/abs/2308.08682
repo_url: None
paper_authors: Sanad Aburass
for: 本研究旨在提出一种新的度量指标（过拟合指标），用于评估深度学习模型的过拟合情况。
methods: 该研究使用了多种深度学习模型，包括 MobileNet、U-Net、ResNet、Darknet 和 ViT-32，并在 Breast Ultrasound Images Dataset (BUS) 和 MNIST 数据集上进行了广泛的实验。
results: 研究结果表明，不同的模型在不同的数据集上 exhibits 不同的过拟合行为，而数据增强特别是在小型特有数据集上产生了积极的影响。 In addition, the ViT-32 模型在 MNIST 数据集上的表现也证明了某些模型的稳定性和数据集的全面性。

Abstract
In the rapidly evolving domain of machine learning, ensuring model generalizability remains a quintessential challenge. Overfitting, where a model exhibits superior performance on training data but falters on unseen data, is a recurrent concern. This paper introduces the Overfitting Index (OI), a novel metric devised to quantitatively assess a model's tendency to overfit. Through extensive experiments on the Breast Ultrasound Images Dataset (BUS) and the MNIST dataset using architectures such as MobileNet, U-Net, ResNet, Darknet, and ViT-32, we illustrate the utility and discernment of the OI. Our results underscore the variable overfitting behaviors across architectures and highlight the mitigative impact of data augmentation, especially on smaller and more specialized datasets. The ViT-32's performance on MNIST further emphasizes the robustness of certain models and the dataset's comprehensive nature. By providing an objective lens to gauge overfitting, the OI offers a promising avenue to advance model optimization and ensure real-world efficacy.

摘要
在快速发展的机器学习领域中，保证模型通用化性仍然是一项核心挑战。过拟合，其中模型在训练数据上显示出杰出表现，但在未经见数据上却表现不佳，是一个常见的问题。本文介绍了一种新的评估模型过拟合的指标——过拟合指数（OI）。通过对Breast Ultrasound Images Dataset（BUS）和MNIST dataset上的多种架构（如MobileNet、U-Net、ResNet、Darknet和ViT-32）进行了广泛的实验，我们证明了OI的用于评估模型过拟合的有用性和分辨率。我们的结果表明不同的架构之间存在差异性，并且数据扩展尤其是在小型特定 datasets 上具有缓解作用。ViT-32在MNIST上的表现进一步证明了某些模型的强健性和数据的完整性。通过提供一个对过拟合进行 объек 的评估，OI为模型优化和实际应用带来了一个有前途的途径。

SkinDistilViT: Lightweight Vision Transformer for Skin Lesion Classification

paper_url: http://arxiv.org/abs/2308.08669
repo_url: https://github.com/Longman-Stan/SkinDistilVit
paper_authors: Vlad-Constantin Lungu-Stan, Dumitru-Clementin Cercel, Florin Pop
For: The paper is written for solving the skin cancer classification problem, specifically focusing on melanoma identification.* Methods: The paper uses a vision transformer trained on melanoma medical images annotated by experts, with knowledge distillation to obtain a model that retains the teacher’s balanced multi-class accuracy at a lower cost in terms of memory and time.* Results: The paper achieves a balanced multi-class accuracy of 98.33% with a model that is 49.60% smaller than the teacher and 69.25% faster on GPU and 97.96% faster on CPU. Additionally, the paper proposes a cascading distillation process to improve the balanced multi-class accuracy of the base model by 2.1%, while creating a range of models of various sizes but comparable performance.

Abstract
Skin cancer is a treatable disease if discovered early. We provide a production-specific solution to the skin cancer classification problem that matches human performance in melanoma identification by training a vision transformer on melanoma medical images annotated by experts. Since inference cost, both time and memory wise is important in practice, we employ knowledge distillation to obtain a model that retains 98.33% of the teacher's balanced multi-class accuracy, at a fraction of the cost. Memory-wise, our model is 49.60% smaller than the teacher. Time-wise, our solution is 69.25% faster on GPU and 97.96% faster on CPU. By adding classification heads at each level of the transformer and employing a cascading distillation process, we improve the balanced multi-class accuracy of the base model by 2.1%, while creating a range of models of various sizes but comparable performance. We provide the code at https://github.com/Longman-Stan/SkinDistilVit.

摘要
皮肤癌是一种可治疗的疾病，如果早发现。我们提供了一个特有的解决方案，用于皮肤癌分类问题，可以与人类专家的标注医生医生医生的医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生医生

BREATHE: Second-Order Gradients and Heteroscedastic Emulation based Design Space Exploration

paper_url: http://arxiv.org/abs/2308.08666
repo_url: None
paper_authors: Shikhar Tuli, Niraj K. Jha
for: 本文旨在提出一种受限Multiple Objective Optimization（MOO）框架，以优化vector和graph基于的设计空间中的最佳设计。
methods: 该框架利用第二阶导数和约束来实现样本高效优化，同时使用不同的搜索策略来扩展搜索空间。
results: 在单目标vector优化应用中，与最佳基elineRandom Forest Regression相比，本方法可以 дости到64.1%的提高性能。在图基于的搜索中，本方法可以与Gaussian-process-based Bayesian Optimization相比，达到64.9%的提高性能。在多目标优化任务中，本方法可以达到21.9倍的超越性能。

Abstract
Researchers constantly strive to explore larger and more complex search spaces in various scientific studies and physical experiments. However, such investigations often involve sophisticated simulators or time-consuming experiments that make exploring and observing new design samples challenging. Previous works that target such applications are typically sample-inefficient and restricted to vector search spaces. To address these limitations, this work proposes a constrained multi-objective optimization (MOO) framework, called BREATHE, that searches not only traditional vector-based design spaces but also graph-based design spaces to obtain best-performing graphs. It leverages second-order gradients and actively trains a heteroscedastic surrogate model for sample-efficient optimization. In a single-objective vector optimization application, it leads to 64.1% higher performance than the next-best baseline, random forest regression. In graph-based search, BREATHE outperforms the next-best baseline, i.e., a graphical version of Gaussian-process-based Bayesian optimization, with up to 64.9% higher performance. In a MOO task, it achieves up to 21.9$\times$ higher hypervolume than the state-of-the-art method, multi-objective Bayesian optimization (MOBOpt). BREATHE also outperforms the baseline methods on most standard MOO benchmark applications.

摘要

Flickr Africa: Examining Geo-Diversity in Large-Scale, Human-Centric Visual Data

paper_url: http://arxiv.org/abs/2308.08656
repo_url: None
paper_authors: Keziah Naggita, Julienne LaChance, Alice Xiang
for: investigate the limitations of standard Internet data collection methods in low- and middle-income countries
methods: analyze human-centric image geo-diversity on a massive scale using geotagged Flickr images associated with each nation in Africa
results: findings for an ``othering’’ phenomenon as evidenced by a substantial number of images from Africa being taken by non-local photographers, and the need for further work to capture image data representative of African people and their environments to improve the applicability of computer vision models in a global context.Here’s the text in Simplified Chinese:
for: investigate the limitations of standard Internet data collection methods in low- and middle-income countries
methods: 使用 geotagged Flickr 图像与每个非洲国家相关的人中心的图像增强大规模地分析图像地域多样性
results: 发现非洲图像中有许多非本地摄影师拍摄的``他者’’现象，需要进一步的工作来捕捉非洲人和他们环境的图像数据，以提高计算机视觉模型在全球上的适用性。

Abstract
Biases in large-scale image datasets are known to influence the performance of computer vision models as a function of geographic context. To investigate the limitations of standard Internet data collection methods in low- and middle-income countries, we analyze human-centric image geo-diversity on a massive scale using geotagged Flickr images associated with each nation in Africa. We report the quantity and content of available data with comparisons to population-matched nations in Europe as well as the distribution of data according to fine-grained intra-national wealth estimates. Temporal analyses are performed at two-year intervals to expose emerging data trends. Furthermore, we present findings for an ``othering'' phenomenon as evidenced by a substantial number of images from Africa being taken by non-local photographers. The results of our study suggest that further work is required to capture image data representative of African people and their environments and, ultimately, to improve the applicability of computer vision models in a global context.

摘要
大规模图像数据集中的偏见被知道会影响计算机视觉模型的表现，具体来说是根据地理上下文。为了探究标准互联网数据采集方法在低中收入国家的局限性，我们使用地标注的Flickr图像与每个非洲国家进行人acentric图像地域多样性分析，并对与人口匹配的欧洲国家进行比较。我们还分析了根据细化的内部富裕度Estimates分布的数据。我们在两年 interval temporal 分析中暴露出emerging 数据趋势。此外，我们还发现了一种“其他化”现象，即非洲的大量图像是由非本地摄影师拍摄的。研究结果表明，更多的工作需要 capture 非洲人和他们的环境的图像数据，以便提高计算机视觉模型在全球上的可应用性。

Physics Informed Recurrent Neural Networks for Seismic Response Evaluation of Nonlinear Systems

paper_url: http://arxiv.org/abs/2308.08655
repo_url: None
paper_authors: Faisal Nissar Malik, James Ricles, Masoud Yari, Malik Arsala Nissar
for: 本研究旨在evaluating the dynamic response of multi-degree-of-freedom (MDOF) systems using physics-informed recurrent neural networks, with a focus on seismic (earthquake) response of nonlinear structures.
methods: 本研究使用了physics-informed recurrent neural networks (RNNs) to evaluate the dynamic response of MDOF systems. The proposed method leverages large data sets and sophisticated algorithms to learn the complex relationship between inputs and outputs.
results: 研究人员预测了MDOF系统的动态回应，并与现有方法such as finite element analysis (FEA)进行比较，以评估physics-informed RNN模型的效果。

Abstract
Dynamic response evaluation in structural engineering is the process of determining the response of a structure, such as member forces, node displacements, etc when subjected to dynamic loads such as earthquakes, wind, or impact. This is an important aspect of structural analysis, as it enables engineers to assess structural performance under extreme loading conditions and make informed decisions about the design and safety of the structure. Conventional methods for dynamic response evaluation involve numerical simulations using finite element analysis (FEA), where the structure is modeled using finite elements, and the equations of motion are solved numerically. Although effective, this approach can be computationally intensive and may not be suitable for real-time applications. To address these limitations, recent advancements in machine learning, specifically artificial neural networks, have been applied to dynamic response evaluation in structural engineering. These techniques leverage large data sets and sophisticated algorithms to learn the complex relationship between inputs and outputs, making them ideal for such problems. In this paper, a novel approach is proposed for evaluating the dynamic response of multi-degree-of-freedom (MDOF) systems using physics-informed recurrent neural networks. The focus of this paper is to evaluate the seismic (earthquake) response of nonlinear structures. The predicted response will be compared to state-of-the-art methods such as FEA to assess the efficacy of the physics-informed RNN model.

摘要
<> dynamically evaluate the response of a structure, such as member forces, node displacements, etc when subjected to dynamic loads such as earthquakes, wind, or impact. This is an important aspect of structural analysis, as it enables engineers to assess structural performance under extreme loading conditions and make informed decisions about the design and safety of the structure. Conventionally, methods for dynamic response evaluation involve numerical simulations using finite element analysis (FEA), where the structure is modeled using finite elements, and the equations of motion are solved numerically. Although effective, this approach can be computationally intensive and may not be suitable for real-time applications. To address these limitations, recent advancements in machine learning, specifically artificial neural networks, have been applied to dynamic response evaluation in structural engineering. These techniques leverage large data sets and sophisticated algorithms to learn the complex relationship between inputs and outputs, making them ideal for such problems. In this paper, a novel approach is proposed for evaluating the dynamic response of multi-degree-of-freedom (MDOF) systems using physics-informed recurrent neural networks. The focus of this paper is to evaluate the seismic (earthquake) response of nonlinear structures. The predicted response will be compared to state-of-the-art methods such as FEA to assess the efficacy of the physics-informed RNN model.Translated by Google Translate.

Reproducing Kernel Hilbert Space Pruning for Sparse Hyperspectral Abundance Prediction

paper_url: http://arxiv.org/abs/2308.08653
repo_url: None
paper_authors: Michael G. Rawson, Timothy Doster, Tegan Emerson
for: 本研究旨在开发一种基于希尔伯特空间变换的简单spectral compression和分析方法，以减少高分辨率数据的压缩和分析成本。
methods: 本研究使用非负最小二乘法来逐步减少矩阵的维度，并使用最大似然压缩向量来减少信息损失。
results: 对实际和 sintetic 数据进行评估，我们发现希尔伯特空间预处理可以减少错误率，并且可以超越标准预处理和最小二乘法方法。

Abstract
Hyperspectral measurements from long range sensors can give a detailed picture of the items, materials, and chemicals in a scene but analysis can be difficult, slow, and expensive due to high spatial and spectral resolutions of state-of-the-art sensors. As such, sparsity is important to enable the future of spectral compression and analytics. It has been observed that environmental and atmospheric effects, including scattering, can produce nonlinear effects posing challenges for existing source separation and compression methods. We present a novel transformation into Hilbert spaces for pruning and constructing sparse representations via non-negative least squares minimization. Then we introduce max likelihood compression vectors to decrease information loss. Our approach is benchmarked against standard pruning and least squares as well as deep learning methods. Our methods are evaluated in terms of overall spectral reconstruction error and compression rate using real and synthetic data. We find that pruning least squares methods converge quickly unlike matching pursuit methods. We find that Hilbert space pruning can reduce error by as much as 40% of the error of standard pruning and also outperform neural network autoencoders.

摘要
We propose a novel transformation into Hilbert spaces for pruning and constructing sparse representations via non-negative least squares minimization. We also introduce max likelihood compression vectors to reduce information loss. Our approach is benchmarked against standard pruning and least squares methods as well as deep learning methods.We evaluate our methods in terms of overall spectral reconstruction error and compression rate using real and synthetic data. We find that pruning least squares methods converge quickly, unlike matching pursuit methods. Additionally, Hilbert space pruning can reduce error by as much as 40% of the error of standard pruning and outperform neural network autoencoders.

Towards Personalized Federated Learning via Heterogeneous Model Reassembly

paper_url: http://arxiv.org/abs/2308.08643
repo_url: None
paper_authors: Jiaqi Wang, Xingyi Yang, Suhan Cui, Liwei Che, Lingjuan Lyu, Dongkuan Xu, Fenglong Ma
for: 本文目的是解决联合学习中的模型异同问题，CLIENTS possessing 不同网络结构的模型。
methods: 我们提出了一个名为pFedHR的新框架，利用异同模型重新组装来实现个性化联合学习。具体来说，我们将客户端模型匹配优化任务视为服务器端的一个模型匹配优化任务。此外，pFedHR可以自动和 Dynamically生成有用和多样化的个性化候选者，减少人工干预。
results: 我们的实验结果表明，pFedHR在三个数据集上下降比基eline在IID和非IIDSetting下表现较好。此外，pFedHR可以减少使用不同公共数据导致的负面影响，同时自动生成多样化的个性化模型。

Abstract
This paper focuses on addressing the practical yet challenging problem of model heterogeneity in federated learning, where clients possess models with different network structures. To track this problem, we propose a novel framework called pFedHR, which leverages heterogeneous model reassembly to achieve personalized federated learning. In particular, we approach the problem of heterogeneous model personalization as a model-matching optimization task on the server side. Moreover, pFedHR automatically and dynamically generates informative and diverse personalized candidates with minimal human intervention. Furthermore, our proposed heterogeneous model reassembly technique mitigates the adverse impact introduced by using public data with different distributions from the client data to a certain extent. Experimental results demonstrate that pFedHR outperforms baselines on three datasets under both IID and Non-IID settings. Additionally, pFedHR effectively reduces the adverse impact of using different public data and dynamically generates diverse personalized models in an automated manner.

摘要
Translation notes:* "pFedHR" is translated as "PFedHR" (简化中文：PFedHR)* "heterogeneous model reassembly" is translated as "heterogeneous model reassembly" (简化中文：不同模型重新组装)* "personalized federated learning" is translated as "个性化联合学习" (简化中文：个性化联合学习)* "IID" and "Non-IID" are translated as "同分布" and "不同分布" (简化中文：同分布和不同分布)* "public data" is translated as "公共数据" (简化中文：公共数据)* "client data" is translated as "客户数据" (简化中文：客户数据)

Non-monotone Sequential Submodular Maximization

paper_url: http://arxiv.org/abs/2308.08641
repo_url: None
paper_authors: Shaojie Tang, Jing Yuan
for: 本研究是关于sequential submodular maximization的基本问题，具体目标是从集合$V$中选择和排序$k$个项目，以便最大化Weighted总和$k$个可能不升序的submodular函数$f_1, \cdots ,f_k: 2^V \rightarrow \mathbb{R}^+ $。
methods: 本研究提出了一些有效的解决方案，包括flexible和固定长度约束的情况，以及特殊情况下的相同实用函数。
results: 实验证明了我们提出的算法在视频推荐领域具有有效性。研究结果对于推荐系统和择寸优化领域有重要意义，因为项目的顺序对总值产生重要影响。

Abstract
In this paper, we study a fundamental problem in submodular optimization, which is called sequential submodular maximization. Specifically, we aim to select and rank a group of $k$ items from a ground set $V$ such that the weighted summation of $k$ (possibly non-monotone) submodular functions $f_1, \cdots ,f_k: 2^V \rightarrow \mathbb{R}^+$ is maximized, here each function $f_j$ takes the first $j$ items from this sequence as input. The existing research on sequential submodular maximization has predominantly concentrated on the monotone setting, assuming that the submodular functions are non-decreasing. However, in various real-world scenarios, like diversity-aware recommendation systems, adding items to an existing set might negatively impact the overall utility. In response, this paper pioneers the examination of the aforementioned problem with non-monotone submodular functions and offers effective solutions for both flexible and fixed length constraints, as well as a special case with identical utility functions. The empirical evaluations further validate the effectiveness of our proposed algorithms in the domain of video recommendations. The results of this research have implications in various fields, including recommendation systems and assortment optimization, where the ordering of items significantly impacts the overall value obtained.

摘要
To address this issue, we explore the problem with non-monotone submodular functions and propose effective solutions for both flexible and fixed length constraints, as well as a special case with identical utility functions. Our proposed algorithms are validated through empirical evaluations in the domain of video recommendations, and the results have implications in various fields, including recommendation systems and assortment optimization, where the ordering of items significantly impacts the overall value obtained.

Fair GANs through model rebalancing with synthetic data

paper_url: http://arxiv.org/abs/2308.08638
repo_url: None
paper_authors: Anubhav Jain, Nasir Memon, Julian Togelius
for: 这篇研究旨在解决深度生成模型在训练时的偏见问题，特别是在收集数据时遇到的问题，如收集不够代表性的数据集。
methods: 我们提出了一种方法来mitigate偏见在现有的生成对抗网络中，通过在当前的对抗网络中进行潜在空间探索，从而获得了平衡的数据，并使用这些数据重训一个平衡的生成模型。我们还提出了一个偏见缓和损失函数，可以在训练时实现偏见缓和。
results: 我们在使用Stylegan2模型训练在FFHQ数据集上进行了过度种族优化，并获得了与传统方法相比的5倍以上的改进，同时保持了图像质量。我们还验证了我们的方法在Cifar-10数据集上的效果。最后，我们认为传统使用的图像质量指标（如Frechet对抗距离）无法应对偏见缓和问题。

Abstract
Deep generative models require large amounts of training data. This often poses a problem as the collection of datasets can be expensive and difficult, in particular datasets that are representative of the appropriate underlying distribution (e.g. demographic). This introduces biases in datasets which are further propagated in the models. We present an approach to mitigate biases in an existing generative adversarial network by rebalancing the model distribution. We do so by generating balanced data from an existing unbalanced deep generative model using latent space exploration and using this data to train a balanced generative model. Further, we propose a bias mitigation loss function that shows improvements in the fairness metric even when trained with unbalanced datasets. We show results for the Stylegan2 models while training on the FFHQ dataset for racial fairness and see that the proposed approach improves on the fairness metric by almost 5 times, whilst maintaining image quality. We further validate our approach by applying it to an imbalanced Cifar-10 dataset. Lastly, we argue that the traditionally used image quality metrics such as Frechet inception distance (FID) are unsuitable for bias mitigation problems.

摘要
深度生成模型需要大量的训练数据。这经常导致数据收集成本高昂和困难，特别是收集 represntative 的下面分布（如人口结构）的数据集。这会导致模型中的偏见，这些偏见再次被模型中传递。我们提出了一种方法来减轻模型中的偏见，通过在现有的不均衡深度生成模型中进行秘密空间探索，并使用这些数据来训练一个均衡的生成模型。此外，我们提出了一种偏见减轻损失函数，这个函数可以在不均衡的数据集上训练，并且可以改善公平度量表，即frechet inception distance（FID）。我们在使用stylegan2模型，在ffhq数据集上进行人种公平性训练，并见到提议方法可以在公平度量表上提高约5倍，保持图像质量。此外，我们还验证了我们的方法，应用于cifar-10数据集上的偏见问题。最后，我们认为传统的图像质量指标，如frechet inception distance（FID），不适合偏见减轻问题。

FedPop: Federated Population-based Hyperparameter Tuning

paper_url: http://arxiv.org/abs/2308.08634
repo_url: None
paper_authors: Haokun Chen, Denis Krompass, Jindong Gu, Volker Tresp
for: 提高 federated learning 中 hyperparameter 的优化 (to optimize hyperparameters in federated learning)
methods: 使用人口生成算法进行 hyperparameter 优化 (using population-based evolutionary algorithms for hyperparameter optimization)
results: substantially outperforms concurrent state-of-the-art hyperparameter tuning methods for federated learning (substantially outperforms prior HP tuning methods for FL)I hope that helps! Let me know if you have any other questions.

Abstract
Federated Learning (FL) is a distributed machine learning (ML) paradigm, in which multiple clients collaboratively train ML models without centralizing their local data. Similar to conventional ML pipelines, the client local optimization and server aggregation procedure in FL are sensitive to the hyperparameter (HP) selection. Despite extensive research on tuning HPs for centralized ML, these methods yield suboptimal results when employed in FL. This is mainly because their "training-after-tuning" framework is unsuitable for FL with limited client computation power. While some approaches have been proposed for HP-Tuning in FL, they are limited to the HPs for client local updates. In this work, we propose a novel HP-tuning algorithm, called Federated Population-based Hyperparameter Tuning (FedPop), to address this vital yet challenging problem. FedPop employs population-based evolutionary algorithms to optimize the HPs, which accommodates various HP types at both client and server sides. Compared with prior tuning methods, FedPop employs an online "tuning-while-training" framework, offering computational efficiency and enabling the exploration of a broader HP search space. Our empirical validation on the common FL benchmarks and complex real-world FL datasets demonstrates the effectiveness of the proposed method, which substantially outperforms the concurrent state-of-the-art HP tuning methods for FL.

摘要
federated learning (FL) 是一种分布式机器学习 (ML) 模式，在多个客户端协同训练 ML 模型时，不需要集中客户端的本地数据。与传统的 ML 管道一样，在 FL 中客户端本地优化和服务器聚合过程中，选择 hyperparameter (HP) 也是敏感的。尽管对中央 ML 的 HP 优化方法进行了广泛的研究，但这些方法在 FL 中具有较差的效果，主要是因为它们的 "训练后优化" 框架不适合 FL 中有限的客户端计算能力。有些方法已经为 FL 中 HP 优化提出了方案，但它们只能处理客户端本地更新中的 HP。在这项工作中，我们提出了一种新的 HP 优化算法，called Federated Population-based Hyperparameter Tuning (FedPop)，以解决这一重要但具有挑战性的问题。FedPop 使用了人口基因算法来优化 HP，可以处理多种 HP 类型，包括客户端和服务器两侧的 HP。与先前的优化方法相比，FedPop 采用了在线 "优化 while 训练" 框架，可以提高计算效率，并且可以探索更广泛的 HP 搜索空间。我们对常见的 FL 测试集和复杂的实际 FL 数据进行了实验验证，并证明了我们提出的方法的效果，与当前状态的同类方法相比，具有显著的提升。

LSTM-Based Forecasting Model for GRACE Accelerometer Data

paper_url: http://arxiv.org/abs/2308.08621
repo_url: https://github.com/darbeheshti/lstm-based-analysis-for-grace-accelerometers
paper_authors: Neda Darbeheshti, Elahe Moradi
for: 这 paper 的目的是填充 GRACE 卫星计划中的数据 gap，并预测 GRACE 加速仪器数据。
methods: 这 paper 使用 Long Short-Term Memory (LSTM) 网络来训练一个可以预测 GRACE 加速仪器数据的模型。
results: 这 paper 的结果表明 LSTM 预测模型能够有效地填充 GRACE 数据 gap 并预测加速仪器数据。

Abstract
The Gravity Recovery and Climate Experiment (GRACE) satellite mission, spanning from 2002 to 2017, has provided a valuable dataset for monitoring variations in Earth's gravity field, enabling diverse applications in geophysics and hydrology. The mission was followed by GRACE Follow-On in 2018, continuing data collection efforts. The monthly Earth gravity field, derived from the integration different instruments onboard satellites, has shown inconsistencies due to various factors, including gaps in observations for certain instruments since the beginning of the GRACE mission. With over two decades of GRACE and GRACE Follow-On data now available, this paper proposes an approach to fill the data gaps and forecast GRACE accelerometer data. Specifically, we focus on accelerometer data and employ Long Short-Term Memory (LSTM) networks to train a model capable of predicting accelerometer data for all three axes. In this study, we describe the methodology used to preprocess the accelerometer data, prepare it for LSTM training, and evaluate the model's performance. Through experimentation and validation, we assess the model's accuracy and its ability to predict accelerometer data for the three axes. Our results demonstrate the effectiveness of the LSTM forecasting model in filling gaps and forecasting GRACE accelerometer data.

摘要
gravitational Recovery and Climate Experiment (GRACE) 卫星任务，从2002年至2017年，提供了监测地球重力场变化的有价值数据集，推动了多种地球物理和水文应用。这个任务于2018年被GRACE Follow-On接续，继续数据采集。每月的地球重力场，来自不同卫星上的仪器集成，存在各种因素的差异，包括GRACE任务开始时的一些仪器观测缺失。现在GRACE和GRACE Follow-On任务共计超过二十年的数据已经可用，这篇论文提出了一种方法，用于填充数据缺失并预测GRACE加速仪数据。我们专注于加速仪数据，使用Long Short-Term Memory（LSTM）网络训练一个能够预测加速仪数据的模型。在这篇论文中，我们描述了对加速仪数据的预处理方法、LSTM训练准备和模型性能评价的方法。通过实验和验证，我们评估了模型的准确性和预测加速仪数据的三个轴的能力。我们的结果表明LSTM预测模型可以有效地填充GRACE加速仪数据的数据缺失和预测加速仪数据。

Boosting Logical Reasoning in Large Language Models through a New Framework: The Graph of Thought

paper_url: http://arxiv.org/abs/2308.08614
repo_url: None
paper_authors: Bin Lei, pei-Hung Lin, Chunhua Liao, Caiwen Ding
for: 提高大规模模型的逻辑推理能力
methods: 提出Graph of Thoughts（GoT）引导技术
results: 比GPT-4高$89.7%$, $86%$,和$56%$的精度提高，和SOTA引导方法（Tree of Thought）的平均提高$23%$, $24%$,和$15%$

Abstract
Recent advancements in large-scale models, such as GPT-4, have showcased remarkable capabilities in addressing standard queries. However, when facing complex problems that require multi-step logical reasoning, their accuracy dramatically decreases. Current research has explored the realm of \textit{prompting engineering} to bolster the inferential capacities of these models. Our paper unveils a pioneering prompting technique, dubbed \textit{Graph of Thoughts (GoT)}. Through testing on a trio of escalating challenges: the 24-point game, resolution of high-degree polynomial equations, and derivation of formulas for recursive sequences, our method outperformed GPT-4, achieving accuracy improvements of $89.7\%$, $86\%$, and $56\%$ for each respective task. Moreover, when juxtaposed with the state-of-the-art (SOTA) prompting method, \textit{Tree of Thought (ToT)}, our approach registered an average accuracy boost of $23\%$, $24\%$, and $15\%$.

摘要
最近的大型模型，如GPT-4，在处理标准查询方面表现出了惊人的能力。然而，当面临复杂的问题，需要多步逻辑推理时，其准确率却显著下降。当前的研究在\"推荐工程\"方面进行了努力。我们的论文揭示了一种创新的推荐技巧，名为\"思维图（GoT）\"。通过对三个增难的任务进行测试：24点游戏、解高度多项式方程、和 recursive sequences 的计算，我们的方法比GPT-4高$89.7\%$, $86\%$,和$56\%$。此外，与现有的 state-of-the-art（SOTA）推荐方法，\"树图思维（ToT）\"，我们的方法在平均上提高了$23\%$, $24\%$,和$15\%$。

Integrating Renewable Energy in Agriculture: A Deep Reinforcement Learning-based Approach

paper_url: http://arxiv.org/abs/2308.08611
repo_url: None
paper_authors: A. Wahid, I faiud, K. Mason
for: 这个研究是为了帮助农业投资者通过深度Q学习网络（DQN）优化农业 photovoltaic（PV）系统安装的决策。
methods: 这个研究开发了一个基于DQN的框架，以帮助农业投资者根据安装预算、政府补贴、能源需求、系统成本等因素做出了数据驱动的决策。
results: 研究发现，通过实施奖励机制，DQN可以学习据驱动的决策，并提供了对PV安装的全面理解。这种技术有助于农业投资者做出更有效的能源决策，提高能效性、降低环境影响和提高利润。

Abstract
This article investigates the use of Deep Q-Networks (DQNs) to optimize decision-making for photovoltaic (PV) systems installations in the agriculture sector. The study develops a DQN framework to assist agricultural investors in making informed decisions considering factors such as installation budget, government incentives, energy requirements, system cost, and long-term benefits. By implementing a reward mechanism, the DQN learns to make data-driven decisions on PV integration. The analysis provides a comprehensive understanding of how DQNs can support investors in making decisions about PV installations in agriculture. This research has significant implications for promoting sustainable and efficient farming practices while also paving the way for future advancements in this field. By leveraging DQNs, agricultural investors can make optimized decisions that improve energy efficiency, reduce environmental impact, and enhance profitability. This study contributes to the advancement of PV integration in agriculture and encourages further innovation in this promising area.

摘要

Atom-by-atom protein generation and beyond with language models

paper_url: http://arxiv.org/abs/2308.09482
repo_url: None
paper_authors: Daniel Flam-Shepherd, Kevin Zhu, Alán Aspuru-Guzik
for: 这 paper 是为了探讨蛋白质语言模型是否可以学习蛋白质的atom级表示，并且可以生成蛋白质之外的其他分子。
methods: 这 paper 使用的方法是使用化学语言模型，这些模型可以学习小分子的atom级表示，包括所有的原子、键和环。
results: 这 paper 的结果表明，化学语言模型可以学习蛋白质的atom级表示，并且可以生成蛋白质之外的其他分子，包括 modificated 的侧链和不自然的氨基酸。此外，这 paper 还发现化学语言模型可以同时探索蛋白质空间和化学空间，并生成新的蛋白质-药物 conjugate。

Abstract
Protein language models learn powerful representations directly from sequences of amino acids. However, they are constrained to generate proteins with only the set of amino acids represented in their vocabulary. In contrast, chemical language models learn atom-level representations of smaller molecules that include every atom, bond, and ring. In this work, we show that chemical language models can learn atom-level representations of proteins enabling protein generation unconstrained to the standard genetic code and far beyond it. In doing so, we show that language models can generate entire proteins atom by atom -- effectively learning the multiple hierarchical layers of molecular information that define proteins from their primary sequence to their secondary, and tertiary structure. We demonstrate language models are able to explore beyond protein space -- generating proteins with modified sidechains that form unnatural amino acids. Even further, we find that language models can explore chemical space and protein space simultaneously and generate novel examples of protein-drug conjugates. The results demonstrate the potential for biomolecular design at the atom level using language models.

摘要

Proprioceptive Learning with Soft Polyhedral Networks

paper_url: http://arxiv.org/abs/2308.08538
repo_url: None
paper_authors: Xiaobo Liu, Xudong Han, Wei Hong, Fang Wan, Chaoyang Song
for: 本研究旨在开发一种能够实现自适应吸附、柔软 proprioception 和高精度力学感知的软体网络，用于Robotics 应用。
methods: 本研究使用软体网络和高速运动追踪系统，并通过学习动态力学特征来实现自适应吸附和 proprioception。
results: 研究显示，软体网络可以在实时情况下测量6D 力和扭矩的准确性为0.25/0.24/0.35 N和0.025/0.034/0.006 Nm，并在静态适应中添加塑性和待遇模式以细化预测结果。这种软体网络具有简单设计、全面适应、 proprioception 和高精度力学感知等特点，适用于低成本 Robotics 应用，可以进行敏捷和竞争性的握摸、软制造和人机交互等任务，并且具有超过100万次使用寿命。

Abstract
Proprioception is the "sixth sense" that detects limb postures with motor neurons. It requires a natural integration between the musculoskeletal systems and sensory receptors, which is challenging among modern robots that aim for lightweight, adaptive, and sensitive designs at a low cost. Here, we present the Soft Polyhedral Network with an embedded vision for physical interactions, capable of adaptive kinesthesia and viscoelastic proprioception by learning kinetic features. This design enables passive adaptations to omni-directional interactions, visually captured by a miniature high-speed motion tracking system embedded inside for proprioceptive learning. The results show that the soft network can infer real-time 6D forces and torques with accuracies of 0.25/0.24/0.35 N and 0.025/0.034/0.006 Nm in dynamic interactions. We also incorporate viscoelasticity in proprioception during static adaptation by adding a creep and relaxation modifier to refine the predicted results. The proposed soft network combines simplicity in design, omni-adaptation, and proprioceptive sensing with high accuracy, making it a versatile solution for robotics at a low cost with more than 1 million use cycles for tasks such as sensitive and competitive grasping, and touch-based geometry reconstruction. This study offers new insights into vision-based proprioception for soft robots in adaptive grasping, soft manipulation, and human-robot interaction.

摘要
Proprioception 是机器人的“第六感”，通过电动神经元来检测四肢的姿势。它需要自然地结合肌骨系统和感测器，这在现代机器人设计中是一个挑战。我们现在介绍了一种名为软多面网络的设计，具有内置的视觉捕捉和 proprioceptive 学习。这种设计可以在多向交互中进行适应性，通过学习动态特征来进行适应性和弹性 proprioception。我们还在静态适应中添加了延展和塑性修饰器，以便精细化预测结果。我们的软网络设计简单、多样化、 proprioceptive 感知和高精度准确，可以在低成本下实现100万次使用循环，用于敏捷抓取、软操作和人机交互等任务。这项研究为软机器人在适应抓取、软抓取和人机交互领域提供了新的视野和突破。

Can Transformers Learn Optimal Filtering for Unknown Systems?

paper_url: http://arxiv.org/abs/2308.08536
repo_url: None
paper_authors: Haldun Balim, Zhe Du, Samet Oymak, Necmiye Ozay
for: 这个论文旨在使用 transformers 来解决动力系统中的优化输出估计问题，并证明 transformers 可以快速适应不同的系统并且达到优化的性能。
methods: 这篇论文使用 transformers 来生成输出预测，并通过训练 transformers 使其能够适应不同的系统。
results: 论文的实验结果表明，使用 transformers 可以快速适应不同的系统并且达到优化的性能，并且在具有非同偶播动的系统上也表现良好。

Abstract
Transformers have demonstrated remarkable success in natural language processing; however, their potential remains mostly unexplored for problems arising in dynamical systems. In this work, we investigate the optimal output estimation problem using transformers, which generate output predictions using all the past ones. We train the transformer using various systems drawn from a prior distribution and then evaluate its performance on previously unseen systems from the same distribution. As a result, the obtained transformer acts like a prediction algorithm that learns in-context and quickly adapts to and predicts well for different systems - thus we call it meta-output-predictor (MOP). MOP matches the performance of the optimal output estimator, based on Kalman filter, for most linear dynamical systems even though it does not have access to a model. We observe via extensive numerical experiments that MOP also performs well in challenging scenarios with non-i.i.d. noise, time-varying dynamics, and nonlinear dynamics like a quadrotor system with unknown parameters. To further support this observation, in the second part of the paper, we provide statistical guarantees on the performance of MOP and quantify the required amount of training to achieve a desired excess risk during test-time. Finally, we point out some limitations of MOP by identifying two classes of problems MOP fails to perform well, highlighting the need for caution when using transformers for control and estimation.

摘要
卷积 Nevilles 已经在自然语言处理中展现出惊人的成功，但它们的潜在仍然未被完全探索，特别是在动力系统中。在这项工作中，我们使用卷积来解决输出预测问题，卷积可以使用所有过去的输出来生成输出预测。我们在不同的系统 drawn from a prior distribution 上训练卷积，然后评估其在未经见过的系统上的性能。因此，得到的卷积可以被称为元输出预测器（MOP）。MOP与基于 Kalman 滤波器的优化输出预测器匹配性能，即使它没有访问模型。我们通过广泛的数字实验观察到，MOP在非同分布的噪声、时间变化的动力系统和非线性动力系统（如四旋翼系统）中也表现良好。在第二部分的论文中，我们为MOP的性能提供了统计保证，并评估在测试时需要多少训练时间来实现所需的过量风险。最后，我们指出了MOP在某些情况下的局限性，并提出了使用卷积进行控制和估计时应注意的一些问题。

Painter: Teaching Auto-regressive Language Models to Draw Sketches

paper_url: http://arxiv.org/abs/2308.08520
repo_url: None
paper_authors: Reza Pourreza, Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Pulkit Madan, Roland Memisevic
for: 这个论文是用于应用大语言模型（LLM）在图像生成任务上的，以直接生成虚拟画刷的笔刷来绘制图像。
methods: 这个论文使用了预先训练在大文本 corpus 上的 off-the-shelf LLM，通过 fine-tuning 来新任务 while preserving 语言理解能力。
results: 论文提出了一种基于 LLM 的自然语言描述到笔刷生成方法，可以从文本描述中生成绘制，从画布上移除对象，并在绘制中探测和分类对象。结果很有推动力。

Abstract
Large language models (LLMs) have made tremendous progress in natural language understanding and they have also been successfully adopted in other domains such as computer vision, robotics, reinforcement learning, etc. In this work, we apply LLMs to image generation tasks by directly generating the virtual brush strokes to paint an image. We present Painter, an LLM that can convert user prompts in text description format to sketches by generating the corresponding brush strokes in an auto-regressive way. We construct Painter based on off-the-shelf LLM that is pre-trained on a large text corpus, by fine-tuning it on the new task while preserving language understanding capabilities. We create a dataset of diverse multi-object sketches paired with textual prompts that covers several object types and tasks. Painter can generate sketches from text descriptions, remove objects from canvas, and detect and classify objects in sketches. Although this is an unprecedented pioneering work in using LLMs for auto-regressive image generation, the results are very encouraging.

摘要
巨型自然语言模型（LLM）在自然语言理解方面已经取得了巨大的进步，同时也在其他领域如计算机视觉、机器人学习、奖励学习等领域得到了成功应用。在这项工作中，我们通过直接生成虚拟毫幅来使用LLM进行图像生成任务。我们介绍了一种名为“画家”的LLM，可以将用户提交的文本描述转换为笔划画作。我们基于市场上 readily available的LLM，通过精心调整和保留语言理解能力来构建了画家。我们创建了一个多样化的多对象素描绘集合，其中包括了各种物体和任务。画家可以从文本描述中生成素描绘，将物体从画布上除除，以及在素描绘中检测和分类物体。虽然这是一项前所未有的使用LLM进行自动往返图像生成的研究，但结果很有激励力。

Two-and-a-half Order Score-based Model for Solving 3D Ill-posed Inverse Problems

paper_url: http://arxiv.org/abs/2308.08511
repo_url: None
paper_authors: Zirong Li, Yanyang Wang, Jianjia Zhang, Weiwen Wu, Hengyong Yu
For: 提高 CT 和 MRI 的三维图像重建精度* Methods: 使用两个半级分数模型（TOSM），在训练阶段学习二维数据分布，在重建阶段使用三个方向的补做分数（极轴、极斜、极斜）实现更加精确的重建* Results: 在大规模稀缺视图 CT 和快速 MRI 数据集上进行了广泛的实验，并取得了解决三维矩阵问题的最新成果，其中解决了 slice 间不一致问题，导致高质量三维图像重建

Abstract
Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) are crucial technologies in the field of medical imaging. Score-based models have proven to be effective in addressing different inverse problems encountered in CT and MRI, such as sparse-view CT and fast MRI reconstruction. However, these models face challenges in achieving accurate three dimensional (3D) volumetric reconstruction. The existing score-based models primarily focus on reconstructing two dimensional (2D) data distribution, leading to inconsistencies between adjacent slices in the reconstructed 3D volumetric images. To overcome this limitation, we propose a novel two-and-a-half order score-based model (TOSM). During the training phase, our TOSM learns data distributions in 2D space, which reduces the complexity of training compared to directly working on 3D volumes. However, in the reconstruction phase, the TOSM updates the data distribution in 3D space, utilizing complementary scores along three directions (sagittal, coronal, and transaxial) to achieve a more precise reconstruction. The development of TOSM is built on robust theoretical principles, ensuring its reliability and efficacy. Through extensive experimentation on large-scale sparse-view CT and fast MRI datasets, our method demonstrates remarkable advancements and attains state-of-the-art results in solving 3D ill-posed inverse problems. Notably, the proposed TOSM effectively addresses the inter-slice inconsistency issue, resulting in high-quality 3D volumetric reconstruction.

摘要
computed tomography (CT) 和 magnetic resonance imaging (MRI) 是医学图像领域的重要技术。 score-based 模型在 CT 和 MRI 中已经证明是有效的，但这些模型在获取精确三维（3D）组织图像时面临挑战。现有的 score-based 模型主要专注于重建二维（2D）数据分布，导致邻近层的不一致性问题。为了解决这个限制，我们提出了一个新的二阶三方向分布学习（TOSM）模型。在训练阶段，我们的 TOSM 将数据分布学习在二维空间，这样训练的复杂度比较低。但在重建阶段，TOSM 将数据分布更新到三维空间，运用三个方向（横轴、纵轴和横条）的补充分布来获得更精确的重建。TOSM 的开发基于坚固的理论基础， guarantees its reliability and efficacy。通过对大规模的简范 CT 和快速 MRI 数据进行广泛的实验，我们的方法展现出了很大的进步，并在解决 3D 逆问题中实现了州际级的结果。特别是，我们的 TOSM 干预了邻近层不一致性问题，实现了高品质的 3D 组织图像重建。

Autoencoding a Soft Touch to Learn Grasping from On-land to Underwater

paper_url: http://arxiv.org/abs/2308.08510
repo_url: https://github.com/bionicdl-sustech/amphibioussoftfinger
paper_authors: Ning Guo, Xudong Han, Xiaobo Liu, Shuqiao Zhong, Zhiyuan Zhou, Jian Lin, Jiansheng Dai, Fang Wan, Chaoyang Song
for: investigate the transferability of grasping knowledge from on-land to underwater
methods: vision-based soft robotic finger, Supervised Variational Autoencoder (SVAE)
results: superior adaptation to changing environments, soft, delicate, and reactive grasping, improved reliability and robustness at a much-reduced cost

Abstract
Robots play a critical role as the physical agent of human operators in exploring the ocean. However, it remains challenging to grasp objects reliably while fully submerging under a highly pressurized aquatic environment with little visible light, mainly due to the fluidic interference on the tactile mechanics between the finger and object surfaces. This study investigates the transferability of grasping knowledge from on-land to underwater via a vision-based soft robotic finger that learns 6D forces and torques (FT) using a Supervised Variational Autoencoder (SVAE). A high-framerate camera captures the whole-body deformations while a soft robotic finger interacts with physical objects on-land and underwater. Results show that the trained SVAE model learned a series of latent representations of the soft mechanics transferrable from land to water, presenting a superior adaptation to the changing environments against commercial FT sensors. Soft, delicate, and reactive grasping enabled by tactile intelligence enhances the gripper's underwater interaction with improved reliability and robustness at a much-reduced cost, paving the path for learning-based intelligent grasping to support fundamental scientific discoveries in environmental and ocean research.

摘要
роботы扮演了人类运行员的物理代理人在探索海洋中发挥重要作用。然而，在高压力水中充满黑暗的环境中，握持物品的可靠性仍然是一个挑战，主要是因为水媒体的干扰对手指和物品表面之间的摩擦力产生了影响。本研究探讨了将从陆地上的握持知识传递到水下via一种视觉基于的软机械脚的学习6D力和扭矩（FT）的方法。一个高速摄像机记录了软机械脚与物理对象的全身弯曲变形，而在陆地和水下两个环境中，软机械脚与物理对象进行交互。结果显示，已经训练过的SVAE模型学习了一系列对软机械力学的隐藏表示，可以在水下环境中传递陆地上的握持知识，比商业FT传感器更有效，提供了软、细腻、敏感的握持能力，使软机械脚在水下交互中具有更高的可靠性和可重复性，降低成本，开拓出学习基于智能握持的 ocean 研究新途径。

ResBuilder: Automated Learning of Depth with Residual Structures

paper_url: http://arxiv.org/abs/2308.08504
repo_url: None
paper_authors: Julian Burghoff, Matthias Rottmann, Jill von Conta, Sebastian Schoenen, Andreas Witte, Hanno Gottschalk
for: 这个论文是为了开发一个基于ResNet的神经网络搜索算法，以实现高精度低计算成本的神经网络模型。
methods: 这个算法使用了一种名为Resbuilder的神经网络搜索算法，可以从零开始构建ResNet架构，并且可以修改现有架构并移除/插入ResNet块。
results: 在不同的图像分类任务上，Resbuilder可以达到相对较低的计算成本，同时保持高度的准确率，与当前的Off-the-shelf ResNets相比。此外，我们还在一个商业应用中进行了应用，并证明了这种性能可以普遍应用于不同的任务。

Abstract
In this work, we develop a neural architecture search algorithm, termed Resbuilder, that develops ResNet architectures from scratch that achieve high accuracy at moderate computational cost. It can also be used to modify existing architectures and has the capability to remove and insert ResNet blocks, in this way searching for suitable architectures in the space of ResNet architectures. In our experiments on different image classification datasets, Resbuilder achieves close to state-of-the-art performance while saving computational cost compared to off-the-shelf ResNets. Noteworthy, we once tune the parameters on CIFAR10 which yields a suitable default choice for all other datasets. We demonstrate that this property generalizes even to industrial applications by applying our method with default parameters on a proprietary fraud detection dataset.

摘要
在这个工作中，我们开发了一种神经网络搜索算法，即Resbuilder，可以从头开始构建高精度且计算成本相对较低的ResNet架构。它还可以修改现有架构，并具有移除和插入ResNet块的能力，因此可以在ResNet架构空间中进行搜索。在我们对不同的图像分类 datasets 进行了实验后，Resbuilder 能够达到 state-of-the-art 性能水平，同时相比于各种预制 ResNet 更加经济。值得注意的是，我们在 CIFAR10 上进行了参数调整，得到了一个适合所有其他 datasets 的默认选择。我们示出了这种特性在实际应用中也能够普遍化，通过将我们的方法与默认参数应用于一个商业领域中的一个 propriety 销售欺诈 dataset。

Time Travel in LLMs: Tracing Data Contamination in Large Language Models

paper_url: http://arxiv.org/abs/2308.08493
repo_url: None
paper_authors: Shahriar Golchin, Mihai Surdeanu
for:The paper aims to identify data contamination within large language models (LLMs) to better understand their effectiveness on other tasks.methods:The approach uses “guided instruction” prompts to identify potential contamination in individual instances, and assesses if an entire dataset partition is contaminated based on the average overlap score with reference instances or a classifier based on GPT-4 with in-context learning prompting.results:The approach achieves an accuracy between 92% and 100% in detecting data contamination with seven datasets, and finds that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.Here is the simplified Chinese text for the three key points:for:这篇论文目标是通过识别大语言模型（LLM）中的数据污染来更好地理解它们在其他任务上的效果。methods:这种方法使用“导向指令”提示来识别个体实例中的可能污染，并根据平均重叠分数与参考实例进行整个数据分区是否污染的评估。results:该方法在七个数据集上达到了92%-100%的准确率，并发现GPT-4在AG News、WNLI和XSum数据集上存在污染。

Abstract
Data contamination, i.e., the presence of test data from downstream tasks in the training data of large language models (LLMs), is a potential major issue in understanding LLMs' effectiveness on other tasks. We propose a straightforward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination in individual instances that are drawn from a small random sample; using this information, our approach then assesses if an entire dataset partition is contaminated. To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM's output either exactly or closely matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE or BLEURT) is statistically significantly better with the guided instruction vs. a general instruction that does not include the dataset and partition name. The second idea marks a dataset as contaminated if a classifier based on GPT-4 with in-context learning prompting marks multiple instances as contaminated. Our best method achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human expert. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.

摘要
大数据污染（即在大语言模型（LLM）训练数据中存在下游任务的测试数据）是一个 potential major issue，可能会影响 LLM 的效果。我们提出了一种简单 yet effective的方法来 Identify 大数据污染。我们的方法的核心是在一个小随机样本中 Identify 潜在的污染，然后判断整个数据分区是否污染。为了估计个体实例的污染情况，我们使用 "导向指令"：一个包含数据集名、分区类型和参考实例的开头部分的提问，要求 LLM 完成它。如果 LLM 的输出与参考实例的后半部分匹配，则认为该实例污染。要判断整个分区是否污染，我们提出了两个想法。第一个想法是，如果在指令中包含数据集名和分区类型时，LLM 的输出与参考实例的 overlap 得分（使用 ROUGE 或 BLEURT 评估）是 statistically significantly better 的，那么认为该分区污染。第二个想法是，如果一个基于 GPT-4 的分类器通过受Context learning提示标记多个实例为污染，那么认为该数据集污染。我们的最佳方法可以在七个数据集（包括训练和测试/验证分区）中准确地检测 LLM 是否污染，与人工评估比较。此外，我们的发现表明 GPT-4 污染 AG News、WNLI 和 XSum 数据集。

Label Propagation Techniques for Artifact Detection in Imbalanced Classes using Photoplethysmogram Signals

paper_url: http://arxiv.org/abs/2308.08480
repo_url: None
paper_authors: Clara Macabiau, Thanh-Dung Le, Kevin Albert, Philippe Jouvet, Rita Noumeir
for: 这个研究旨在提高血液压力信号中的精度，并增强血液压力信号基础的靠拢性。
methods: 这个研究使用了标签传播技术来传播标签 между血液压力信号样本。特别是在不均衡类型的情况下，清洁的血液压力信号样本被大量的噪声样本淹没。
results: 研究结果表明，使用标签传播技术可以高效地标注医疗数据集，即使清洁样本scarce。对artefact的分类，我们比较了经典分类器（如权重平均分类器、神经网络等）和自动标注算法。结果显示，使用标签传播技术可以更好地检测artefact。

Abstract
Photoplethysmogram (PPG) signals are widely used in healthcare for monitoring vital signs, but they are susceptible to motion artifacts that can lead to inaccurate interpretations. In this study, the use of label propagation techniques to propagate labels among PPG samples is explored, particularly in imbalanced class scenarios where clean PPG samples are significantly outnumbered by artifact-contaminated samples. With a precision of 91%, a recall of 90% and an F1 score of 90% for the class without artifacts, the results demonstrate its effectiveness in labeling a medical dataset, even when clean samples are rare. For the classification of artifacts our study compares supervised classifiers such as conventional classifiers and neural networks (MLP, Transformers, FCN) with the semi-supervised label propagation algorithm. With a precision of 89%, a recall of 95% and an F1 score of 92%, the KNN supervised model gives good results, but the semi-supervised algorithm performs better in detecting artifacts. The findings suggest that the semi-supervised algorithm label propagation hold promise for artifact detection in PPG signals, which can enhance the reliability of PPG-based health monitoring systems in real-world applications.

摘要

LLM4TS: Two-Stage Fine-Tuning for Time-Series Forecasting with Pre-Trained LLMs

paper_url: http://arxiv.org/abs/2308.08469
repo_url: None
paper_authors: Ching Chang, Wen-Chih Peng, Tien-Fu Chen
for: LLM4TS is designed to enhance time-series forecasting by leveraging pre-trained Large Language Models (LLMs).methods: The approach combines time-series patching with temporal encoding, and uses a two-stage fine-tuning process with supervised fine-tuning and task-specific downstream fine-tuning. Additionally, the model utilizes Parameter-Efficient Fine-Tuning (PEFT) techniques to adapt the pre-trained LLMs for time-series forecasting.results: LLM4TS has achieved state-of-the-art results in long-term forecasting, and has demonstrated exceptional capabilities as both a robust representation learner and an effective few-shot learner.Here is the Chinese translation of the three key points:for: LLM4TS 是为了提高时间序列预测的表现，利用预训练的大语言模型（LLMs）。methods: 该方法使用时间序列补充和时间编码，并采用了两阶段训练过程：首先进行监督训练，然后进行任务特定的下游训练。此外，模型还使用了Parameter-Efficient Fine-Tuning（PEFT）技术来适应预训练 LLMS。results: LLM4TS 已经实现了长期预测的州际纪录，并表现出了出色的Robust Representation Learning和几招学习能力。

Abstract
In this work, we leverage pre-trained Large Language Models (LLMs) to enhance time-series forecasting. Mirroring the growing interest in unifying models for Natural Language Processing and Computer Vision, we envision creating an analogous model for long-term time-series forecasting. Due to limited large-scale time-series data for building robust foundation models, our approach LLM4TS focuses on leveraging the strengths of pre-trained LLMs. By combining time-series patching with temporal encoding, we have enhanced the capability of LLMs to handle time-series data effectively. Inspired by the supervised fine-tuning in chatbot domains, we prioritize a two-stage fine-tuning process: first conducting supervised fine-tuning to orient the LLM towards time-series data, followed by task-specific downstream fine-tuning. Furthermore, to unlock the flexibility of pre-trained LLMs without extensive parameter adjustments, we adopt several Parameter-Efficient Fine-Tuning (PEFT) techniques. Drawing on these innovations, LLM4TS has yielded state-of-the-art results in long-term forecasting. Our model has also shown exceptional capabilities as both a robust representation learner and an effective few-shot learner, thanks to the knowledge transferred from the pre-trained LLM.

摘要
在这项工作中，我们利用预训练的大语言模型（LLM）提高时间序列预测。随着自然语言处理和计算机视觉模型的统一发展，我们期望创建相似的模型，以提高长期时间序列预测能力。由于有限的大规模时间序列数据建立坚实基础模型，我们的方法LLM4TS专注于利用预训练LLM的优势。通过将时间序列补充与时间编码相结合，我们有效地使得LLM处理时间序列数据。受到协助chatbot领域的超级vised fine-tuning的激发，我们采用了两个阶段的精度调整过程：首先进行有监督的精度调整，然后进行任务特定的下游精度调整。此外，为了不需要广泛参数调整，我们采用了多种Parameter-Efficient Fine-Tuning（PEFT）技术。通过这些创新，LLM4TS已经实现了长期预测的state-of-the-art结果。我们的模型还表现出了出色的robust表示学习和几招学习能力，这与预训练LLM传递的知识有着密切的关系。

An Expert’s Guide to Training Physics-informed Neural Networks

paper_url: http://arxiv.org/abs/2308.08468
repo_url: https://github.com/predictiveintelligencelab/jaxpi
paper_authors: Sifan Wang, Shyam Sankaran, Hanwen Wang, Paris Perdikaris
for: 本研究旨在提供PINNs训练最佳实践和挑战性问题集，以提高PINNs的训练效率和总准确性。
methods: 本研究使用了多种architecture和训练策略，包括JAX库的优化。
results: 研究显示了PINNs训练最佳实践和挑战性问题集可以获得状态之准确性，并提供了未来研究使用的强制基线。

Abstract
Physics-informed neural networks (PINNs) have been popularized as a deep learning framework that can seamlessly synthesize observational data and partial differential equation (PDE) constraints. Their practical effectiveness however can be hampered by training pathologies, but also oftentimes by poor choices made by users who lack deep learning expertise. In this paper we present a series of best practices that can significantly improve the training efficiency and overall accuracy of PINNs. We also put forth a series of challenging benchmark problems that highlight some of the most prominent difficulties in training PINNs, and present comprehensive and fully reproducible ablation studies that demonstrate how different architecture choices and training strategies affect the test accuracy of the resulting models. We show that the methods and guiding principles put forth in this study lead to state-of-the-art results and provide strong baselines that future studies should use for comparison purposes. To this end, we also release a highly optimized library in JAX that can be used to reproduce all results reported in this paper, enable future research studies, as well as facilitate easy adaptation to new use-case scenarios.

摘要
физикс-指定神经网络（PINNs）已经广泛应用于深度学习框架，可以快速同 observational data 和部分偏微分方程（PDE）约束相结合。然而，它们的实际效果可能受训练异常和用户缺乏深度学习专业知识的影响。在这篇论文中，我们提出了一系列最佳实践，可以大幅提高 PINNs 的训练效率和总准确性。我们还提出了一些挑战性的 benchmark 问题，描述了 PINNs 训练中的一些最主要的困难，并对不同架构选择和训练策略的影响进行了完整的和可重复的ablation 研究。我们显示了我们的方法和指导原则可以达到状态态的结果，并提供了强大的基准，以便 future studies 可以用于比较。为此，我们还发布了高度优化的库在 JAX 上，可以重现报告中的所有结果，支持未来的研究studies，以及方便新用 случа scenario 的适应。

On Neural Quantum Support Vector Machines

paper_url: http://arxiv.org/abs/2308.08467
repo_url: https://github.com/GhadaAbdulsalam/Explainable_Heart_Disease_Prediction_Using_Ensemble-Quantum_ML
paper_authors: Lars Simon, Manuel Radons
for: 本文介绍了四种算法用于神经支持向量机（NSVM）训练，并证明其可行性。在这篇通知中，我们将介绍神经量子支持向量机（NSVM），并扩展我们的结果到这种设定下。
methods: 本文使用的方法包括NSVM的量子kernel和相关的训练算法。
results: 本文的结果表明，使用量子kernel可以提高NSVM的性能，并且可以应用于各种问题。

Abstract
In \cite{simon2023algorithms} we introduced four algorithms for the training of neural support vector machines (NSVMs) and demonstrated their feasibility. In this note we introduce neural quantum support vector machines, that is, NSVMs with a quantum kernel, and extend our results to this setting.

摘要
在《\cite{simon2023algorithms}》中，我们介绍了四种算法用于神经支持向量机（NSVM）的训练，并证明其可行性。在这个笔记中，我们将介绍神经量子支持向量机（NSVM），即使用量子kernel的NSVM，并扩展我们的结果到这个设定下。Here's the word-for-word translation:在《\cite{simon2023algorithms}》中，我们介绍了四种算法用于神经支持向量机（NSVM）的训练，并证明其可行性。在这个笔记中，我们将介绍神经量子支持向量机（NSVM），即使用量子kernel的NSVM，并扩展我们的结果到这个设定下。

Hierarchical Uncertainty Estimation for Medical Image Segmentation Networks

paper_url: http://arxiv.org/abs/2308.08465
repo_url: None
paper_authors: Xinyu Bai, Wenjia Bai
for: This paper aims to build a trustworthy medical image segmentation model by estimating the uncertainty of the model prediction.
methods: The proposed method leverages the hierarchical encoder architecture of state-of-the-art image segmentation networks, and uses a skip-connection module to model multi-level uncertainties.
results: The proposed method can achieve high segmentation performance and provide meaningful uncertainty maps that can be used for out-of-distribution detection.Here’s the Chinese translation of the three pieces of information:
for: 这篇论文目标是建立一个可靠的医疗图像分割模型，并且可以计算模型预测结果的不确定性。
methods: 该方法利用现有的医疗图像分割网络的层次编码结构，并使用skip-connection模块来模拟多级不确定性。
results: 该方法可以实现高效的图像分割性能，同时提供有用的不确定性地图，可以用于异常检测。

Abstract
Learning a medical image segmentation model is an inherently ambiguous task, as uncertainties exist in both images (noise) and manual annotations (human errors and bias) used for model training. To build a trustworthy image segmentation model, it is important to not just evaluate its performance but also estimate the uncertainty of the model prediction. Most state-of-the-art image segmentation networks adopt a hierarchical encoder architecture, extracting image features at multiple resolution levels from fine to coarse. In this work, we leverage this hierarchical image representation and propose a simple yet effective method for estimating uncertainties at multiple levels. The multi-level uncertainties are modelled via the skip-connection module and then sampled to generate an uncertainty map for the predicted image segmentation. We demonstrate that a deep learning segmentation network such as U-net, when implemented with such hierarchical uncertainty estimation module, can achieve a high segmentation performance, while at the same time provide meaningful uncertainty maps that can be used for out-of-distribution detection.

摘要
学习医疗影像分割模型是一个自然而又不确定的任务，因为影像中的噪声和人工标注（人类错误和偏见）对模型训练中存在不确定性。为建立可靠的影像分割模型，不仅需要评估其性能，还需要估计模型预测结果中的不确定性。大多数当前的影像分割网络采用层次编码结构，从细到粗提取影像特征。在这种工作中，我们利用这种层次图像表示，并提议一种简单 yet effective的方法来估计多个水平的不确定性。这些多个不确定性被模拟为跳过连接模块，然后随机抽取来生成预测图像分割结果中的不确定性地图。我们示出，通过将深度学习分割网络和多个水平不确定性估计模块相结合，可以实现高水平的分割性能，同时提供有意义的不确定性地图，可以用于非标准分布检测。

2023-08-18

Revisiting Skin Tone Fairness in Dermatological Lesion Classification

GeoDTR+: Toward generic cross-view geolocalization via geometric disentanglement

Is context all you need? Scaling Neural Sign Language Translation to Large Domains of Discourse

LaRS: A Diverse Panoptic Maritime Obstacle Detection Dataset and Benchmark

Far3D: Expanding the Horizon for Surround-view 3D Object Detection

Language-guided Human Motion Synthesis with Atomic Actions

On the Effectiveness of LayerNorm Tuning for Continual Learning in Vision Transformers

Language-Guided Diffusion Model for Visual Grounding

Investigation of Architectures and Receptive Fields for Appearance-based Gaze Estimation

StableVideo: Text-driven Consistency-aware Diffusion Video Editing

O^2-Recon: Completing 3D Reconstruction of Occluded Objects in the Scene with a Pre-trained 2D Diffusion Model

Deep Equilibrium Object Detection

Decoupled conditional contrastive learning with variable metadata for prostate lesion detection

Uncertainty-based quality assurance of carotid artery wall segmentation in black-blood MRI

Small Object Detection via Coarse-to-fine Proposal Generation and Imitation Learning

Improving 3D Pose Estimation for Sign Language

Denoising Diffusion for 3D Hand Pose Estimation from Images

Leveraging Intrinsic Properties for Non-Rigid Garment Alignment

Learnt Contrastive Concept Embeddings for Sign Recognition

ResQ: Residual Quantization for Video Perception

Video-Instrument Synergistic Network for Referring Video Instrument Segmentation in Robotic Surgery

Quantitative Susceptibility Mapping through Model-based Deep Image Prior (MoDIP)

Data augmentation and explainability for bias discovery and mitigation in deep learning

Accelerated Bayesian imaging by relaxed proximal-point Langevin sampling

Transformer-based Detection of Microorganisms on High-Resolution Petri Dish Images

Can ultrasound confidence maps predict sonographers’ labeling variability?

Self-Supervised Single-Image Deconvolution with Siamese Neural Networks

MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection

Metadata Improves Segmentation Through Multitasking Elicitation

Generalizable Decision Boundaries: Dualistic Meta-Learning for Open Set Domain Generalization

Diffusion Models for Image Restoration and Enhancement – A Comprehensive Survey

DReg-NeRF: Deep Registration for Neural Radiance Fields

Label-Free Event-based Object Recognition via Joint Learning with Image Reconstruction from Events

Image Processing and Machine Learning for Hyperspectral Unmixing: An Overview and the HySUPP Python Package

Single Frame Semantic Segmentation Using Multi-Modal Spherical Images

Overlap Bias Matching is Necessary for Point Cloud Registration

Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models

Multi-scale Target-Aware Framework for Constrained Image Splicing Detection and Localization

Boosting Few-shot Action Recognition with Graph-guided Hybrid Matching

Denoising diffusion-based MR to CT image translation enables whole spine vertebral segmentation in 2D and 3D without manual annotations

LSCD: A Large-Scale Screen Content Dataset for Video Compression

SAMedOCT: Adapting Segment Anything Model (SAM) for Retinal OCT

Unlimited Knowledge Distillation for Action Recognition in the Dark

Retro-FPN: Retrospective Feature Pyramid Network for Point Cloud Semantic Segmentation

Rethinking Image Forgery Detection via Contrastive Learning and Unsupervised Clustering

DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability

Human Part-wise 3D Motion Context Learning for Sign Language Recognition

Online Class Incremental Learning on Stochastic Blurry Task Boundary via Mask and Visual Prompt Tuning

Inferior Alveolar Nerve Segmentation in CBCT images using Connectivity-Based Selective Re-training

NAPA-VQ: Neighborhood Aware Prototype Augmentation with Vector Quantization for Continual Learning

Self-Calibrated Cross Attention Network for Few-Shot Segmentation

RFDforFin: Robust Deep Forgery Detection for GAN-generated Fingerprint Images

Diverse Cotraining Makes Strong Semi-Supervised Segmentor

DiffLLE: Diffusion-guided Domain Calibration for Unsupervised Low-light Image Enhancement

MATLABER: Material-Aware Text-to-3D via LAtent BRDF auto-EncodeR

Progression-Guided Temporal Action Detection in Videos

SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos

ASAG: Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Anchor Generation

Deep Boosting Multi-Modal Ensemble Face Recognition with Sample-Level Weighting

CCFace: Classification Consistency for Low-Resolution Face Recognition

Generalized Sum Pooling for Metric Learning

DMCVR: Morphology-Guided Diffusion Model for 3D Cardiac Volume Reconstruction

A review of technical factors to consider when designing neural networks for semantic segmentation of Earth Observation imagery

LibreFace: An Open-Source Toolkit for Deep Facial Expression Analysis

TinyProp – Adaptive Sparse Backpropagation for Efficient TinyML On-device Learning

FedPerfix: Towards Partial Model Personalization of Vision Transformers in Federated Learning

Semi-sparsity Priors for Image Structure Analysis and Extraction

The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation

ICAR: Image-based Complementary Auto Reasoning

JPEG Quantized Coefficient Recovery via DCT Domain Spatial-Frequential Transformer

Hyperbolic Face Anti-Spoofing

Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation

ImGeoNet: Image-induced Geometry-aware Voxel Representation for Multi-view 3D Object Detection

Edit Temporal-Consistent Videos with Image Diffusion Model

Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries

MovePose: A High-performance Human Pose Estimation Algorithm on Mobile and Edge Devices

Pedestrian Environment Model for Automated Driving

2023-08-18

VALERIE22 – A photorealistic, richly metadata annotated dataset of urban environments