2023-07-19

eess.AS

eess.AS - 2023-07-19

An analysis on the effects of speaker embedding choice in non auto-regressive TTS

paper_url: http://arxiv.org/abs/2307.09898
repo_url: None
paper_authors: Adriana Stan, Johannah O’Mahony
for: 这个论文是研究非 autoregressive 多说话人语音合成架构如何利用不同的说话人嵌入集的信息，以提高目标说话人标识的质量。
methods: 这个论文使用了一种新的非 autoregressive 多说话人语音合成架构，并分析了在不同的嵌入集和学习策略下，是否会产生任何质量改进。
results: 研究发现， regardless of the used set of embeddings and learning strategy, the network can handle various speaker identities equally well, with barely noticeable variations in speech output quality, and that speaker leakage within the core structure of the synthesis system is inevitable in the standard training procedures adopted thus far。

Abstract
In this paper we introduce a first attempt on understanding how a non-autoregressive factorised multi-speaker speech synthesis architecture exploits the information present in different speaker embedding sets. We analyse if jointly learning the representations, and initialising them from pretrained models determine any quality improvements for target speaker identities. In a separate analysis, we investigate how the different sets of embeddings impact the network's core speech abstraction (i.e. zero conditioned) in terms of speaker identity and representation learning. We show that, regardless of the used set of embeddings and learning strategy, the network can handle various speaker identities equally well, with barely noticeable variations in speech output quality, and that speaker leakage within the core structure of the synthesis system is inevitable in the standard training procedures adopted thus far.

摘要
在这篇论文中，我们首次尝试了理解非 autoregressive 多 speaker speech synthesis 架构如何利用不同 speaker embedding 集的信息。我们分析了在一起学习表示和从预训练模型初始化是否会提高目标 speaker 身份的质量改进。在另一项分析中，我们研究了不同 embedding 集的影响于核心语音抽象（即零条件）中的 speaker identity 和表示学习。我们发现，无论使用哪个 embedding 集和学习策略，网络都可以平等地处理不同的 speaker 身份，并且 speech 输出质量的变化幅度几乎无法注意。此外，我们发现在标准的训练程序中，speaker leakage 会在核心Synthesis 系统的结构中发生，这是不可避免的。

Self-Supervised Acoustic Word Embedding Learning via Correspondence Transformer Encoder

paper_url: http://arxiv.org/abs/2307.09871
repo_url: None
paper_authors: Jingru Lin, Xianghu Yue, Junyi Ao, Haizhou Li
For: The paper aims to learn robust acoustic word embeddings (AWEs) from a large-scale unlabelled speech corpus.* Methods: The proposed method, called Correspondence Transformer Encoder (CTE), uses a teacher-student learning framework and pre-trains the model with a word-level loss to ensure that different realisations of the same word are close in the underlying embedding space.* Results: The embeddings extracted from the CTE model are robust to speech variations, such as speakers and domains, and achieve new state-of-the-art performance on a low-resource cross-lingual setting.Here’s the simplified Chinese text for the three key points:* For: 这个论文目标是从大规模无标签语音集合中学习坚定的语音字幕嵌入 (AWEs)。* Methods: 提posed方法是使用教师学生学习框架，先在word级别损失下预训练模型，以确保不同的语音实现都是 embedding 空间中的近似。* Results: CTE模型提取的嵌入具有对语音变化（如speaker和domain）的抗变性，并在low-resource cross-lingual设定中达到新的状态纪录性。

Abstract
Acoustic word embeddings (AWEs) aims to map a variable-length speech segment into a fixed-dimensional representation. High-quality AWEs should be invariant to variations, such as duration, pitch and speaker. In this paper, we introduce a novel self-supervised method to learn robust AWEs from a large-scale unlabelled speech corpus. Our model, named Correspondence Transformer Encoder (CTE), employs a teacher-student learning framework. We train the model based on the idea that different realisations of the same word should be close in the underlying embedding space. Specifically, we feed the teacher and student encoder with different acoustic instances of the same word and pre-train the model with a word-level loss. Our experiments show that the embeddings extracted from the proposed CTE model are robust to speech variations, e.g. speakers and domains. Additionally, when evaluated on Xitsonga, a low-resource cross-lingual setting, the CTE model achieves new state-of-the-art performance.

摘要
听音词嵌入（AWEs）的目标是将变长的语音段映射到固定维度的表示中。高质量的AWEs应该具有不变性，例如持续时间、音高和说话人。在这篇论文中，我们介绍了一种新的自动学习方法，以学习高质量的AWEs从大规模的无标签语音 corpus 中。我们的模型，即匹配变换器 encoder（CTE），采用了教师学生学习框架。我们在不同的语音实例中Feed 教师和学生encoder，并在Word level上预训练模型。我们的实验表明，从我们提出的CTE模型中提取的嵌入是不同语音变化（例如，说话人和频谱）的抗衡。此外，当我们在Xitsonga，一个低资源的 Cross-Lingual Setting 中评估CTE模型时，它达到了新的状态前的最佳性能。

2023-07-19

cs.CV

cs.CV - 2023-07-19

Unsupervised Accuracy Estimation of Deep Visual Models using Domain-Adaptive Adversarial Perturbation without Source Samples

paper_url: http://arxiv.org/abs/2307.10062
repo_url: None
paper_authors: JoonHo Lee, Jae Oh Woo, Hankyu Moon, Kwonho Lee
for: 本研究的目的是提出一种无需源数据和标签的深度视觉模型部署方法，以解决由于目标分布与源分布的不同而导致的性能下降问题。
methods: 本研究使用pseudo-标签来估计目标域的准确性，并采用最近的无源适应预测算法进行适应。我们在目标模型的输入上应用自适应敌意干扰，以减少由高 идеal共同假设风险导致的假标签的影响。
results: 我们的源无法框架可以有效地Addresses the challenging distribution shift scenarios，并超越需要源数据和标签的训练方法。

Abstract
Deploying deep visual models can lead to performance drops due to the discrepancies between source and target distributions. Several approaches leverage labeled source data to estimate target domain accuracy, but accessing labeled source data is often prohibitively difficult due to data confidentiality or resource limitations on serving devices. Our work proposes a new framework to estimate model accuracy on unlabeled target data without access to source data. We investigate the feasibility of using pseudo-labels for accuracy estimation and evolve this idea into adopting recent advances in source-free domain adaptation algorithms. Our approach measures the disagreement rate between the source hypothesis and the target pseudo-labeling function, adapted from the source hypothesis. We mitigate the impact of erroneous pseudo-labels that may arise due to a high ideal joint hypothesis risk by employing adaptive adversarial perturbation on the input of the target model. Our proposed source-free framework effectively addresses the challenging distribution shift scenarios and outperforms existing methods requiring source data and labels for training.

摘要
deploying deep visual models can lead to performance drops due to the discrepancies between source and target distributions. several approaches leverage labeled source data to estimate target domain accuracy, but accessing labeled source data is often prohibitively difficult due to data confidentiality or resource limitations on serving devices. our work proposes a new framework to estimate model accuracy on unlabeled target data without access to source data. we investigate the feasibility of using pseudo-labels for accuracy estimation and evolve this idea into adopting recent advances in source-free domain adaptation algorithms. our approach measures the disagreement rate between the source hypothesis and the target pseudo-labeling function, adapted from the source hypothesis. we mitigate the impact of erroneous pseudo-labels that may arise due to a high ideal joint hypothesis risk by employing adaptive adversarial perturbation on the input of the target model. our proposed source-free framework effectively addresses the challenging distribution shift scenarios and outperforms existing methods requiring source data and labels for training.

Divert More Attention to Vision-Language Object Tracking

paper_url: http://arxiv.org/abs/2307.10046
repo_url: https://github.com/JudasDie/SOTS
paper_authors: Mingzhe Guo, Zhipeng Zhang, Liping Jing, Haibin Ling, Heng Fan
for: 本文旨在提出一种基于视觉语言学习的tracking方法，以解决现有的视觉语言学习模型在跟踪领域的不足。
methods: 本文提出了一种新的视觉语言表示方法，包括提出了一种通用特征标注策略和一种模态混合器（ModaMixer）。同时，本文还引入了一种对比损失来对不同模式进行对应。
results: 本文的方法在六个跟踪benchmark上进行了实验，并取得了显著的提升。同时，本文还提供了一些理论分析，以证明方法的合理性。

Abstract
Multimodal vision-language (VL) learning has noticeably pushed the tendency toward generic intelligence owing to emerging large foundation models. However, tracking, as a fundamental vision problem, surprisingly enjoys less bonus from recent flourishing VL learning. We argue that the reasons are two-fold: the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning of current works. These nuisances motivate us to design more effective vision-language representation for tracking, meanwhile constructing a large database with language annotation for model learning. Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos. We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer). To further improve VL representation, we introduce a contrastive loss to align different modalities. To thoroughly evidence the effectiveness of our method, we integrate the proposed framework on three tracking methods with different designs, i.e., the CNN-based SiamCAR, the Transformer-based OSTrack, and the hybrid structure TransT. The experiments demonstrate that our framework can significantly improve all baselines on six benchmarks. Besides empirical results, we theoretically analyze our approach to show its rationality. By revealing the potential of VL representation, we expect the community to divert more attention to VL tracking and hope to open more possibilities for future tracking with diversified multimodal messages.

摘要
多模态视语（VL）学习已经明显推动了基本智能的趋势，尤其是由于大规模基础模型的出现。然而，跟踪问题，作为视觉基本问题，却意外地没有从现在蓬勃发展的 VL 学习中受到多大的优惠。我们认为这有两个原因：一是视觉语言注解视频的缺乏大规模数据集，二是当前 VL 学习中视觉语言交互学习的不具有效果。这些弊端使我们强烈需要设计更有效的视觉语言表示，同时建立一个大规模的语言注解视频数据集。在本文中，我们首先提出一种通用属性注解策略，以增加 six 个流行的跟踪benchmark中的视频数量，从而构建了大规模的视觉语言跟踪数据集，包含超过 23,000 个视频。然后，我们引入一种新的框架，以提高跟踪的性能，包括我们所提出的异symmetric architecture search和ModaMixer。为了进一步提高 VL 表示，我们还引入了一种对比损失，以协调不同模式。为了证明我们的方法的效iveness，我们将其应用于三种不同的跟踪方法中，即CNN基于 SiamCAR，Transformer基于 OSTrack，以及混合结构 TransT。实验结果表明，我们的框架可以显著提高所有基准在六个benchmark中。此外，我们还进行了理论分析，以证明我们的方法的合理性。通过推动 VL 表示的潜力，我们希望社区更多地关注 VL 跟踪，并期待在未来跟踪中涉及到多样化的Multimodal消息时开启更多的可能性。

Class Attention to Regions of Lesion for Imbalanced Medical Image Recognition

paper_url: http://arxiv.org/abs/2307.10036
repo_url: None
paper_authors: Jia-Xin Zhuang, Jiabin Cai, Jianguo Zhang, Wei-shi Zheng, Ruixuan Wang
for: This paper aims to effectively handle data imbalance issues in automated medical image classification by proposing a simple yet effective framework, named CARE, which embeds attention into the training process of CNNs to help the network attend to lesion regions of rare diseases.methods: The proposed CARE framework uses an attention module that helps CNNs attend to lesion regions of rare diseases during the training phase. The attention module is embedded into the training process of CNNs and does not change the architecture of the original network. Additionally, the authors developed variants of CARE by leveraging traditional saliency methods or a pretrained segmentation model for bounding box generation.results: The results show that the CARE variants with automated bounding box generation are comparable to the original CARE framework with manual bounding box annotations. The method effectively helps the network focus on the lesion regions of rare diseases and remarkably improves the classification performance of rare diseases on two imbalanced datasets, one for skin images and one for pneumonia.

Abstract
Automated medical image classification is the key component in intelligent diagnosis systems. However, most medical image datasets contain plenty of samples of common diseases and just a handful of rare ones, leading to major class imbalances. Currently, it is an open problem in intelligent diagnosis to effectively learn from imbalanced training data. In this paper, we propose a simple yet effective framework, named \textbf{C}lass \textbf{A}ttention to \textbf{RE}gions of the lesion (CARE), to handle data imbalance issues by embedding attention into the training process of \textbf{C}onvolutional \textbf{N}eural \textbf{N}etworks (CNNs). The proposed attention module helps CNNs attend to lesion regions of rare diseases, therefore helping CNNs to learn their characteristics more effectively. In addition, this attention module works only during the training phase and does not change the architecture of the original network, so it can be directly combined with any existing CNN architecture. The CARE framework needs bounding boxes to represent the lesion regions of rare diseases. To alleviate the need for manual annotation, we further developed variants of CARE by leveraging the traditional saliency methods or a pretrained segmentation model for bounding box generation. Results show that the CARE variants with automated bounding box generation are comparable to the original CARE framework with \textit{manual} bounding box annotations. A series of experiments on an imbalanced skin image dataset and a pneumonia dataset indicates that our method can effectively help the network focus on the lesion regions of rare diseases and remarkably improves the classification performance of rare diseases.

摘要
自动医疗图像分类是智能诊断系统的关键组件。然而，大多数医疗图像集合中充满常见疾病的样本，而罕见疾病的样本却只有几个，这导致了培训数据的主要类别不均。目前，智能诊断中有一个打开的问题是如何有效地学习培训数据中的类别不均问题。在这篇论文中，我们提出了一个简单 yet有效的框架，名为\textbf{C}lass \textbf{A}ttention to \textbf{RE}gions of the lesion (CARE)，用于处理培训数据中的类别不均问题。我们在\textbf{C}onvolutional \textbf{N}eural \textbf{N}etworks (CNNs) 的培训过程中嵌入注意力模块，使 CNNs 能够更有效地关注罕见疾病的 lesion 区域。此外，我们的注意力模块只在培训阶段工作，不会改变原始网络的架构，因此可以直接与任何现有的 CNN 架构结合使用。CARE 框架需要病变区域的 bounding box 来表示罕见疾病的 lesion 区域。为了减少手动标注的需求，我们还开发了 CARE 的多种变体，通过利用传统的注意力方法或一个预训导的分割模型来生成 bounding box。实验结果表明，CARE 变体使用自动生成的 bounding box 与手动标注的 CARE 框架相当。系列的实验在一个充满病变的皮肤图像集合和一个肺炎图像集合中表明，我们的方法可以帮助网络更有效地关注罕见疾病的 lesion 区域，并显著提高罕见疾病的分类性能。

Towards Fair Face Verification: An In-depth Analysis of Demographic Biases

paper_url: http://arxiv.org/abs/2307.10011
repo_url: None
paper_authors: Ioannis Sarridis, Christos Koutlis, Symeon Papadopoulos, Christos Diou
for: 这篇论文旨在探讨人脸识别和验证系统中存在的性别、年龄和种族偏见问题，以及这些偏见在不同保护组合中的交叠影响。
methods: 本论文使用深度学习方法进行人脸识别和验证，并对现有云端解决方案进行了分析。同时，该论文还包括五种附加的评价指标，以衡量系统的公平性。
results: 研究发现，存在许多种族、年龄和性别的偏见问题，其中非裔人群的True Positive Rate（TPR）相比于白人群体下降11.25%，而且在不同保护组合之间存在显著的偏见。此外，跨多个保护组合的交叠效应也存在显著偏见。这些结果希望能够促进更公平、更公正的人脸识别和验证系统的发展。

Abstract
Deep learning-based person identification and verification systems have remarkably improved in terms of accuracy in recent years; however, such systems, including widely popular cloud-based solutions, have been found to exhibit significant biases related to race, age, and gender, a problem that requires in-depth exploration and solutions. This paper presents an in-depth analysis, with a particular emphasis on the intersectionality of these demographic factors. Intersectional bias refers to the performance discrepancies w.r.t. the different combinations of race, age, and gender groups, an area relatively unexplored in current literature. Furthermore, the reliance of most state-of-the-art approaches on accuracy as the principal evaluation metric often masks significant demographic disparities in performance. To counter this crucial limitation, we incorporate five additional metrics in our quantitative analysis, including disparate impact and mistreatment metrics, which are typically ignored by the relevant fairness-aware approaches. Results on the Racial Faces in-the-Wild (RFW) benchmark indicate pervasive biases in face recognition systems, extending beyond race, with different demographic factors yielding significantly disparate outcomes. In particular, Africans demonstrate an 11.25% lower True Positive Rate (TPR) compared to Caucasians, while only a 3.51% accuracy drop is observed. Even more concerning, the intersections of multiple protected groups, such as African females over 60 years old, demonstrate a +39.89% disparate mistreatment rate compared to the highest Caucasians rate. By shedding light on these biases and their implications, this paper aims to stimulate further research towards developing fairer, more equitable face recognition and verification systems.

摘要
深度学习基于人识别和验证系统在过去几年中有了非常大的进步，但是这些系统，包括广泛使用的云端解决方案，已经被发现有显著的种族、年龄和性别相关的偏见问题，这是需要深入探讨和解决的问题。这篇文章提供了深入的分析，尤其是关于人识别系统中 intersect 的偏见问题。人识别系统中的偏见指的是不同的种族、年龄和性别组合的性能差异，这是现有文献中尚未得到充分探讨的领域。此外，现有大多数状态的艺术方法仅仅依靠精度作为评价指标，这会隐藏人识别系统中的重要民生差异。为了解决这一重要 limitation，我们在量化分析中添加了五个额外的指标，包括不平等影响指标和不公正对待指标，这些指标通常被现有的公平意识方法忽略。results on the Racial Faces in-the-Wild (RFW) benchmark show that face recognition systems exhibit pervasive biases beyond race, with different demographic factors leading to significantly disparate outcomes. Specifically, Africans have a 11.25% lower True Positive Rate (TPR) compared to Caucasians, while only a 3.51% accuracy drop is observed. Furthermore, the intersections of multiple protected groups, such as African females over 60 years old, demonstrate a +39.89% disparate mistreatment rate compared to the highest Caucasians rate. By shedding light on these biases and their implications, this paper aims to stimulate further research towards developing fairer, more equitable face recognition and verification systems.

MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions

paper_url: http://arxiv.org/abs/2307.10008
repo_url: None
paper_authors: Yunfei Liu, Lijian Lin, Fei Yu, Changyin Zhou, Yu Li
for: 这篇论文旨在实现基于音频的人脸动画，以生成高精度、多模态的人脸视频。
methods: 该方法包括三个阶段：1）Mapping-Once网络with dual attention（MODA）将音频转换为说话表示；2）facial composer网络生成精度受到控制的面部特征点；3）时间引导渲染器Synthesize稳定的视频。
results: 对比前方法，该方法生成的人脸视频更加自然和真实。

Abstract
Audio-driven portrait animation aims to synthesize portrait videos that are conditioned by given audio. Animating high-fidelity and multimodal video portraits has a variety of applications. Previous methods have attempted to capture different motion modes and generate high-fidelity portrait videos by training different models or sampling signals from given videos. However, lacking correlation learning between lip-sync and other movements (e.g., head pose/eye blinking) usually leads to unnatural results. In this paper, we propose a unified system for multi-person, diverse, and high-fidelity talking portrait generation. Our method contains three stages, i.e., 1) Mapping-Once network with Dual Attentions (MODA) generates talking representation from given audio. In MODA, we design a dual-attention module to encode accurate mouth movements and diverse modalities. 2) Facial composer network generates dense and detailed face landmarks, and 3) temporal-guided renderer syntheses stable videos. Extensive evaluations demonstrate that the proposed system produces more natural and realistic video portraits compared to previous methods.

摘要
Audio-driven portrait animation aims to synthesize portrait videos that are conditioned by given audio. 动画高质量多modal视频肖像有很多应用。先前的方法都是通过不同的模型或样本信号来捕捉不同的动作模式并生成高质量肖像视频。然而，通常lacking correlation learning between lip-sync and other movements (例如，头姿/眼睛跳动)，导致结果不自然。在这篇论文中，我们提出了一个统一的系统，用于多个人、多样化和高质量的说话肖像生成。我们的方法包括三个阶段：1. Mapping-Once网络与双注意力（MODA）生成来自给定音频的说话表示。在MODA中，我们设计了双注意力模块，以编码准确的口型动作和多样化的特征。2. 脸部组合网络生成密集和详细的面部标记点，3. 时间引导渲染器将生成稳定的视频。广泛的评估表明，我们提出的系统可以生成更自然和真实的视频肖像，与先前的方法相比。

As large as it gets: Learning infinitely large Filters via Neural Implicit Functions in the Fourier Domain

paper_url: http://arxiv.org/abs/2307.10001
repo_url: None
paper_authors: Julia Grabinski, Janis Keuper, Margret Keuper
for: investigate how large receptive fields are needed in vision applications, and develop a method to learn large filters without increasing memory consumption during training or inference.
methods: learn frequency representations of filter weights as neural implicit functions, allowing for efficient implementation in current frameworks and enabling the use of infinitely large filters with only a few learnable weights.
results: achieve results on par with the state-of-the-art on large image classification benchmarks while executing convolutions solely in the frequency domain, and provide an extensive analysis of the learned receptive fields, showing that the learned filters are well-localized and relatively small in the spatial domain.

Abstract
Motivated by the recent trend towards the usage of larger receptive fields for more context-aware neural networks in vision applications, we aim to investigate how large these receptive fields really need to be. To facilitate such study, several challenges need to be addressed, most importantly: (i) We need to provide an effective way for models to learn large filters (potentially as large as the input data) without increasing their memory consumption during training or inference, (ii) the study of filter sizes has to be decoupled from other effects such as the network width or number of learnable parameters, and (iii) the employed convolution operation should be a plug-and-play module that can replace any conventional convolution in a Convolutional Neural Network (CNN) and allow for an efficient implementation in current frameworks. To facilitate such models, we propose to learn not spatial but frequency representations of filter weights as neural implicit functions, such that even infinitely large filters can be parameterized by only a few learnable weights. The resulting neural implicit frequency CNNs are the first models to achieve results on par with the state-of-the-art on large image classification benchmarks while executing convolutions solely in the frequency domain and can be employed within any CNN architecture. They allow us to provide an extensive analysis of the learned receptive fields. Interestingly, our analysis shows that, although the proposed networks could learn very large convolution kernels, the learned filters practically translate into well-localized and relatively small convolution kernels in the spatial domain.

摘要

We need to find an effective way for models to learn large filters (potentially as large as the input data) without increasing their memory consumption during training or inference.2. The study of filter sizes must be decoupled from other effects such as the network width or number of learnable parameters.3. The employed convolution operation should be a plug-and-play module that can replace any conventional convolution in a Convolutional Neural Network (CNN) and allow for an efficient implementation in current frameworks.To address these challenges, we propose learning not spatial but frequency representations of filter weights as neural implicit functions. This allows even infinitely large filters to be parameterized by only a few learnable weights. The resulting neural implicit frequency CNNs are the first models to achieve results on par with the state-of-the-art on large image classification benchmarks while executing convolutions solely in the frequency domain. They can be employed within any CNN architecture and allow us to provide an extensive analysis of the learned receptive fields.Our analysis shows that, although the proposed networks could learn very large convolution kernels, the learned filters practically translate into well-localized and relatively small convolution kernels in the spatial domain.

Mitigating Viewer Impact from Disturbing Imagery using AI Filters: A User-Study

paper_url: http://arxiv.org/abs/2307.10334
repo_url: None
paper_authors: Ioannis Sarridis, Jochen Spangenberg, Olga Papadopoulou, Symeon Papadopoulos
for: 这个论文的目的是研究用人工智能技术来减少对激素内容的情感影响。methods: 这个研究使用了107名参与者，主要是记者和人权调查员，测试了五种过滤方式，包括传统的滤除（抑制和部分抑制）和人工智能基于的绘制（绘制、颜色绘制和画画）。研究测量了这些过滤方式对图像信息传递和情感压力减少的效果。results: 研究发现，基于人工智能的绘制过滤方式表现最佳，可以减少负面感受(-30.38%)，同时保持图像的可读性（97.19%）。参与者还建议将AI过滤器 integrate到他们的工作流程中，例如用作先决步骤，以便最终 inspect the original image。这篇论文为专业人士Routinely处理可能激素内容的视觉环境的开发做出了贡献。

Abstract
Exposure to disturbing imagery can significantly impact individuals, especially professionals who encounter such content as part of their work. This paper presents a user study, involving 107 participants, predominantly journalists and human rights investigators, that explores the capability of Artificial Intelligence (AI)-based image filters to potentially mitigate the emotional impact of viewing such disturbing content. We tested five different filter styles, both traditional (Blurring and Partial Blurring) and AI-based (Drawing, Colored Drawing, and Painting), and measured their effectiveness in terms of conveying image information while reducing emotional distress. Our findings suggest that the AI-based Drawing style filter demonstrates the best performance, offering a promising solution for reducing negative feelings (-30.38%) while preserving the interpretability of the image (97.19%). Despite the requirement for many professionals to eventually inspect the original images, participants suggested potential strategies for integrating AI filters into their workflow, such as using AI filters as an initial, preparatory step before viewing the original image. Overall, this paper contributes to the development of a more ethically considerate and effective visual environment for professionals routinely engaging with potentially disturbing imagery.

摘要
曝露恐怖图像可能对个人产生深见影响，特别是专业人士在工作中遇到这类内容。这篇论文报告了一项用户研究，参与者为107名记者和人权调查员，探讨了人工智能（AI）基于图像筛选器的可能性来减轻曝露恐怖图像的情感影响。我们测试了5种不同的筛选器风格，包括传统的掩蔽和部分掩蔽，以及基于AI的绘制、颜色绘制和画画风格。我们发现，AI基于绘制风格的筛选器表现最佳，可以减少负面情感 (-30.38%) while保持图像的可读性（97.19%）。尽管 eventually需要查看原图像，但参与者建议可以将AI筛选器 integrate into their workflow，例如作为初步预备步骤 перед查看原图像。总的来说，这篇论文对专业人士常遇到可能恐怖图像的可负面影响的开发提供了一种更优的解决方案。

TUNeS: A Temporal U-Net with Self-Attention for Video-based Surgical Phase Recognition

paper_url: http://arxiv.org/abs/2307.09997
repo_url: None
paper_authors: Isabel Funke, Dominik Rivoir, Stefanie Krell, Stefanie Speidel
for: 这个论文的目的是提供一种基于视频流的自适应计算机助手，以便在未来的手术室中提高操作效率和准确性。
methods: 本论文使用的方法包括视频流中提取有用特征和模型时间信息的技术，以及使用注意力机制来捕捉长距离依赖关系。特别是，该论文提出了一种新的注意力模型，即TUNeS，该模型不需要局部注意力或注意力权重补做。
results: 在实验中，所有的时间模型都表现出了在使用长时间文本段训练的特征提取器时的改善，而TUNeS得到了state-of-the-art的结果在Cholec80 dataset上。

Abstract
To enable context-aware computer assistance in the operating room of the future, cognitive systems need to understand automatically which surgical phase is being performed by the medical team. The primary source of information for surgical phase recognition is typically video, which presents two challenges: extracting meaningful features from the video stream and effectively modeling temporal information in the sequence of visual features. For temporal modeling, attention mechanisms have gained popularity due to their ability to capture long-range dependencies. In this paper, we explore design choices for attention in existing temporal models for surgical phase recognition and propose a novel approach that does not resort to local attention or regularization of attention weights: TUNeS is an efficient and simple temporal model that incorporates self-attention at the coarsest stage of a U-Net-like structure. In addition, we propose to train the feature extractor, a standard CNN, together with an LSTM on preferably long video segments, i.e., with long temporal context. In our experiments, all temporal models performed better on top of feature extractors that were trained with longer temporal context. On top of these contextualized features, TUNeS achieves state-of-the-art results on Cholec80.

摘要
为实现智能医疗室的计算机助手功能，�gnitive系统需要自动地理解医疗团队在执行哪一个手术阶段。视频是主要的信息来源，它们存在两个挑战：从视频流中提取有用的特征和有效地模型视频流中的时间信息。为了处理时间信息，注意机制在模型中得到了广泛的应用，因为它们可以捕捉长距离依赖关系。在这篇论文中，我们探讨了现有的 temporal 模型中的注意机制设计选择，并提出了一种新的方法，它不需要本地注意力或注意力权重规则化：TUNeS 是一种高效和简单的时间模型，它在 U-Net like 结构中使用自注意。此外，我们提议在可能长的视频段上训练特征提取器，即标准的 CNN，并与 LSTM 一起训练。在我们的实验中，所有的时间模型都在使用更长的时间上下文下表现更好。在这些上下文化特征上，TUNeS 实现了 Cholect80 的状态当前结果。

Impact of Disentanglement on Pruning Neural Networks

paper_url: http://arxiv.org/abs/2307.09994
repo_url: None
paper_authors: Carl Shneider, Peyman Rostami, Anis Kacem, Nilotpal Sinha, Abd El Rahman Shabayek, Djamila Aouada
for: 这篇论文的目的是为了实现深度学习神经网络在边缘设备上的部署，并且降低其内存占用量、电力消耗和延迟时间。
methods: 本论文使用了β-VAE框架，与标准的条件给出来进行删除，以 Investigate 将网络训练为了学习分离表示的影响。
results: 在 MNIST 和 CIFAR10 数据集上实验，发现 forcing 网络学习分离表示可以增强删除的效果，并且提出了未来工作的道路。

Abstract
Deploying deep learning neural networks on edge devices, to accomplish task specific objectives in the real-world, requires a reduction in their memory footprint, power consumption, and latency. This can be realized via efficient model compression. Disentangled latent representations produced by variational autoencoder (VAE) networks are a promising approach for achieving model compression because they mainly retain task-specific information, discarding useless information for the task at hand. We make use of the Beta-VAE framework combined with a standard criterion for pruning to investigate the impact of forcing the network to learn disentangled representations on the pruning process for the task of classification. In particular, we perform experiments on MNIST and CIFAR10 datasets, examine disentanglement challenges, and propose a path forward for future works.

摘要
部署深度学习神经网络在边缘设备上，以完成实际场景中的任务特定目标，需要减少其内存占用量、能耗和延迟。这可以通过有效的模型压缩来实现。 variational autoencoder（VAE）网络生成的分离的干扰表示可能是实现模型压缩的有效方法，因为它们主要保留任务特定的信息，抛弃无关于任务的信息。我们使用β-VAE框架与标准的束缚方法进行调研，研究在分类任务上强制网络学习分离表示的影响。在特定的MNIST和CIFAR10数据集上进行实验，检查分离挑战，并提出未来工作的路径。

TinyTrain: Deep Neural Network Training at the Extreme Edge

paper_url: http://arxiv.org/abs/2307.09988
repo_url: None
paper_authors: Young D. Kwon, Rui Li, Stylianos I. Venieris, Jagmohan Chauhan, Nicholas D. Lane, Cecilia Mascolo
For: 这个研究旨在提高 edge 设备上的内存和 compute 资源限制下的人工智能模型训练效率，以满足用户个性化和隐私需求。* Methods: 本研究提出了 TinyTrain，一个在 edge 设备上进行内存和 compute 资源有限的模型训练方法，通过选择ively 更新模型部分，以及对于数据稀缺的特殊处理，以获得高精度的结果。* Results: TinyTrain 比 vanilla fine-tuning 提高 3.6-5.0% 的精度，同时将backward-pass 内存和计算成本降低到 2,286 倍和 7.68 倍，实现了快速、能效地进行模型训练。

Abstract
On-device training is essential for user personalisation and privacy. With the pervasiveness of IoT devices and microcontroller units (MCU), this task becomes more challenging due to the constrained memory and compute resources, and the limited availability of labelled user data. Nonetheless, prior works neglect the data scarcity issue, require excessively long training time (e.g. a few hours), or induce substantial accuracy loss ($\geq$10\%). We propose TinyTrain, an on-device training approach that drastically reduces training time by selectively updating parts of the model and explicitly coping with data scarcity. TinyTrain introduces a task-adaptive sparse-update method that dynamically selects the layer/channel based on a multi-objective criterion that jointly captures user data, the memory, and the compute capabilities of the target device, leading to high accuracy on unseen tasks with reduced computation and memory footprint. TinyTrain outperforms vanilla fine-tuning of the entire network by 3.6-5.0\% in accuracy, while reducing the backward-pass memory and computation cost by up to 2,286$\times$ and 7.68$\times$, respectively. Targeting broadly used real-world edge devices, TinyTrain achieves 9.5$\times$ faster and 3.5$\times$ more energy-efficient training over status-quo approaches, and 2.8$\times$ smaller memory footprint than SOTA approaches, while remaining within the 1 MB memory envelope of MCU-grade platforms.

摘要
“设备内训练是用户个性化和隐私的关键。随着互联网物联网设备和微控制器单元（MCU）的普及，这个任务变得更加挑战，因为这些设备具有有限的内存和计算资源，同时有限的标注用户数据。然而，先前的研究忽视了数据稀缺问题，需要过长的训练时间（例如几个小时），或者导致显著的准确性损失（大于10%）。我们提出了TinyTrain，一种在设备内训练方法，可以减少训练时间，并且特别处理数据稀缺问题。TinyTrain引入了任务适应的稀缺更新方法，可以动态选择层/通道，根据多目标 criterion，同时考虑用户数据、内存和计算能力。这使得TinyTrain在未seen任务上保持高准确性，同时减少了后向传输的内存和计算成本。相比普通精度调整，TinyTrain提高了3.6-5.0%的准确性，同时减少了反向传输的内存和计算成本。TinyTrain针对广泛使用的现实世界边缘设备，实现了9.5倍快速训练和3.5倍更高能效训练，同时减少了2.8倍的内存占用。”

Lazy Visual Localization via Motion Averaging

paper_url: http://arxiv.org/abs/2307.09981
repo_url: None
paper_authors: Siyan Dong, Shaohui Liu, Hengkai Guo, Baoquan Chen, Marc Pollefeys
for: 本研究的目的是提出一种基于图像的视觉地理位置系统，以便实现高精度的地理位置估计。
methods: 该方法不需要构建3D metric maps，而是通过对数据库图像和查询图像的运动平均来实现高精度的地理位置估计。
results: 实验结果显示，该方法可以与现有的结构基本方法相比，达到同等或更高的地理位置准确率。此外，该方法还可以轻松扩展到处理复杂的配置，如多个查询图像的协同地理位置估计和摄像机架。

Abstract
Visual (re)localization is critical for various applications in computer vision and robotics. Its goal is to estimate the 6 degrees of freedom (DoF) camera pose for each query image, based on a set of posed database images. Currently, all leading solutions are structure-based that either explicitly construct 3D metric maps from the database with structure-from-motion, or implicitly encode the 3D information with scene coordinate regression models. On the contrary, visual localization without reconstructing the scene in 3D offers clear benefits. It makes deployment more convenient by reducing database pre-processing time, releasing storage requirements, and remaining unaffected by imperfect reconstruction, etc. In this technical report, we demonstrate that it is possible to achieve high localization accuracy without reconstructing the scene from the database. The key to achieving this owes to a tailored motion averaging over database-query pairs. Experiments show that our visual localization proposal, LazyLoc, achieves comparable performance against state-of-the-art structure-based methods. Furthermore, we showcase the versatility of LazyLoc, which can be easily extended to handle complex configurations such as multi-query co-localization and camera rigs.

摘要
将文本翻译为简化中文。计算机视觉和 робо技术中的视觉（重）本地化是非常重要的。其目标是根据每个查询图像，估算相对于数据库图像的6个自由度（DoF）摄像头姿。现在，所有领先的解决方案都是结构基的，即从数据库中构建3D métric 地图，或者隐式地将3D信息编码到场景坐标回归模型中。然而，不需要重建场景的视觉本地化具有明显的优点。它使得部署更加便捷，降低数据库预处理时间，释放存储要求，并不受不完全重建等等的影响。在这份技术报告中，我们示出了可以达到高地位精度无需重建场景的可能性。关键在于对数据库-查询对的特制运动平均。实验表明，我们的视觉本地化提案，懒散寻Localization（LazyLoc）可以与当前领先的结构基方法相比，达到相似的性能。此外，我们还展示了 LazyLoc 的灵活性，可以轻松扩展到处理复杂配置，如多个查询均在和相机套件。Note: "简化中文" refers to Simplified Chinese, which is one of the two standardized Chinese writing systems, used in mainland China and Singapore.

ProtoCaps: A Fast and Non-Iterative Capsule Network Routing Method

paper_url: http://arxiv.org/abs/2307.09944
repo_url: None
paper_authors: Miles Everett, Mingjun Zhong, Georgios Leontidis
for: 提高卷积网络的运算效率和表现力
methods: 提出了一种新的非迭代路由机制，以及使用可调prototype集成的方法
results: 比对 current best non-iterative Capsule Network 的结果，在Imagewoof dataset上达到了更高的性能Translation:
for: To enhance the operational efficiency and performance of Capsule Networks
methods: Proposed a novel, non-iterative routing mechanism and use of trainable prototype clustering
results: Achieved superior results compared to the current best non-iterative Capsule Network on the Imagewoof dataset, which is too computationally demanding to handle efficiently by iterative approaches.

Abstract
Capsule Networks have emerged as a powerful class of deep learning architectures, known for robust performance with relatively few parameters compared to Convolutional Neural Networks (CNNs). However, their inherent efficiency is often overshadowed by their slow, iterative routing mechanisms which establish connections between Capsule layers, posing computational challenges resulting in an inability to scale. In this paper, we introduce a novel, non-iterative routing mechanism, inspired by trainable prototype clustering. This innovative approach aims to mitigate computational complexity, while retaining, if not enhancing, performance efficacy. Furthermore, we harness a shared Capsule subspace, negating the need to project each lower-level Capsule to each higher-level Capsule, thereby significantly reducing memory requisites during training. Our approach demonstrates superior results compared to the current best non-iterative Capsule Network and tests on the Imagewoof dataset, which is too computationally demanding to handle efficiently by iterative approaches. Our findings underscore the potential of our proposed methodology in enhancing the operational efficiency and performance of Capsule Networks, paving the way for their application in increasingly complex computational scenarios.

摘要
干燥网络（Capsule Networks）已经成为深度学习领域的一种强大的架构，知名于相对于卷积神经网络（CNNs）的Parameters少，但它们的内置效率经常被迅速的迭代路由机制所掩蔽，导致不能扩展。在这篇论文中，我们提出了一种新的非迭代路由机制，取得了trainable prototype clustering的灵感。这种创新的方法旨在降低计算复杂性，保持或者提高性能效率。另外，我们利用共享干燥子空间，从而废弃每层下级干燥向每层上级干燥的投影，从而在训练时大幅减少内存需求。我们的方法在Imagewoof数据集上测试，这个数据集是迅速的迭代方法无法处理的计算强度。我们的发现表明了我们提出的方法在提高干燥网络的运行效率和性能的潜在可能性，为其应用在越来越复杂的计算场景铺开了道路。

AGAR: Attention Graph-RNN for Adaptative Motion Prediction of Point Clouds of Deformable Objects

paper_url: http://arxiv.org/abs/2307.09936
repo_url: None
paper_authors: Pedro Gomes, Silvia Rossi, Laura Toni
for: 本研究针对三维点 cloud sequence中的弹性物体动作（如人体动作）进行动作预测，探讨这种表示方式中的困难和复杂动作所带来的技术限制。
methods: 我们提出一个改进的建筑方案，专门针对弹性形状和复杂动作的点 cloud prediction。 Specifically, 我们使用一个基于图的方法，学习和利用点 cloud 的空间结构，以提取更有表示力的特征。然后，我们提出一个适应Module，能够合理地调整本地和全局运动，以更好地模型弹性三维点 cloud 中的动作。
results: 我们对以下数据集进行试验：MNIST moving digits、Mixamo human bodies motions、JPEG和CWIPC-SXR real-world dynamic bodies。结果显示，我们的方法在模型复杂动作方面表现更好，并且能够保持点 cloud 的形状。此外，我们还证明了我们的框架在动态特征学习中的一般化能力，通过将框架应用于MSRAction3D数据集，并 achieved results on-par with state-of-the-art methods。

Abstract
This paper focuses on motion prediction for point cloud sequences in the challenging case of deformable 3D objects, such as human body motion. First, we investigate the challenges caused by deformable shapes and complex motions present in this type of representation, with the ultimate goal of understanding the technical limitations of state-of-the-art models. From this understanding, we propose an improved architecture for point cloud prediction of deformable 3D objects. Specifically, to handle deformable shapes, we propose a graph-based approach that learns and exploits the spatial structure of point clouds to extract more representative features. Then we propose a module able to combine the learned features in an adaptative manner according to the point cloud movements. The proposed adaptative module controls the composition of local and global motions for each point, enabling the network to model complex motions in deformable 3D objects more effectively. We tested the proposed method on the following datasets: MNIST moving digits, the Mixamo human bodies motions, JPEG and CWIPC-SXR real-world dynamic bodies. Simulation results demonstrate that our method outperforms the current baseline methods given its improved ability to model complex movements as well as preserve point cloud shape. Furthermore, we demonstrate the generalizability of the proposed framework for dynamic feature learning, by testing the framework for action recognition on the MSRAction3D dataset and achieving results on-par with state-of-the-art methods

摘要
To handle deformable shapes, the proposed method uses a graph-based approach that learns and exploits the spatial structure of point clouds to extract more representative features. Then, a module is proposed that combines the learned features in an adaptive manner according to the point cloud movements. This module controls the composition of local and global motions for each point, allowing the network to model complex motions in deformable 3D objects more effectively.The proposed method was tested on several datasets, including MNIST moving digits, Mixamo human bodies motions, JPEG, and CWIPC-SXR real-world dynamic bodies. The simulation results show that the proposed method outperforms current baseline methods in terms of its ability to model complex movements while preserving point cloud shape. Additionally, the proposed framework was tested for action recognition on the MSRAction3D dataset and achieved results on par with state-of-the-art methods, demonstrating its generalizability for dynamic feature learning.

DISA: DIfferentiable Similarity Approximation for Universal Multimodal Registration

paper_url: http://arxiv.org/abs/2307.09931
repo_url: https://github.com/imfusiongmbh/disa-universal-multimodal-registration
paper_authors: Matteo Ronchetti, Wolfgang Wein, Nassir Navab, Oliver Zettinig, Raphael Prevost
for: 这篇论文的目的是为了提出一种能够快速、灵活地进行多Modal imaging的图像注册方法。
methods: 该方法使用了小型卷积神经网络（CNN）来approximate现有的相似度度量，从而使得注册过程变得更加快速和灵活。
results: 实验结果表明，该方法可以在不同的生物学Dataset上进行广泛的应用，并且不需要特殊的重新训练。此外，该方法比传统的局部patch基本metric更快速，并且可以在临床设置中直接应用。

Abstract
Multimodal image registration is a challenging but essential step for numerous image-guided procedures. Most registration algorithms rely on the computation of complex, frequently non-differentiable similarity metrics to deal with the appearance discrepancy of anatomical structures between imaging modalities. Recent Machine Learning based approaches are limited to specific anatomy-modality combinations and do not generalize to new settings. We propose a generic framework for creating expressive cross-modal descriptors that enable fast deformable global registration. We achieve this by approximating existing metrics with a dot-product in the feature space of a small convolutional neural network (CNN) which is inherently differentiable can be trained without registered data. Our method is several orders of magnitude faster than local patch-based metrics and can be directly applied in clinical settings by replacing the similarity measure with the proposed one. Experiments on three different datasets demonstrate that our approach generalizes well beyond the training data, yielding a broad capture range even on unseen anatomies and modality pairs, without the need for specialized retraining. We make our training code and data publicly available.

摘要
多modal图像匹配是一项复杂但必要的步骤，用于许多图像引导过程。大多数匹配算法依靠计算复杂，通常不导 differentiable的相似度度量来处理不同模态图像之间的形态差异。现代机器学习基于的方法受到特定的解剖学-模态组合的限制，并不能扩展到新的设置。我们提议一种通用框架，用于生成表达力强的跨模态描述符，以实现快速的扭形全局匹配。我们通过将现有的度量简化为投影在小的卷积神经网络（CNN）的特征空间中的点积，实现了快速、可训练的dot product。我们的方法比局部小区域度量更快速，可以直接应用于临床设置，只需要将相似度度量替换为我们提议的度量即可。我们在三个不同的数据集上进行了实验，结果表明我们的方法可以跨模态数据集上泛化良好，无需特殊的再训练。我们将训练代码和数据公开发布。

Measuring and Modeling Uncertainty Degree for Monocular Depth Estimation

paper_url: http://arxiv.org/abs/2307.09929
repo_url: None
paper_authors: Mochu Xiang, Jing Zhang, Nick Barnes, Yuchao Dai
for: This paper aims to improve the reliability of monocular depth estimation (MDE) models by modeling uncertainty from the perspective of inherent probability distributions.
methods: The proposed method introduces additional training regularization terms to estimate uncertainty more fairly and with more comprehensive metrics, without requiring extra modules or multiple inferences.
results: The proposed method provides state-of-the-art reliability in uncertainty estimations, and can be further improved when combined with ensemble or sampling methods, as demonstrated in a series of experiments.Here is the text in Simplified Chinese:
for: 这篇论文目标是提高单目深度估计（MDE）模型的可靠性，通过来自深度概率体 Volume 和其扩展的内在概率分布的视角来模型不确定性。
methods: 提议的方法通过添加额外训练规则来更加公正地估计不确定性，不需要额外模块或多个推理。
results: 提议的方法可以提供state-of-the-art的可靠性在不确定性估计中，并可以通过ensemble或采样方法进一步改进。

Abstract
Effectively measuring and modeling the reliability of a trained model is essential to the real-world deployment of monocular depth estimation (MDE) models. However, the intrinsic ill-posedness and ordinal-sensitive nature of MDE pose major challenges to the estimation of uncertainty degree of the trained models. On the one hand, utilizing current uncertainty modeling methods may increase memory consumption and are usually time-consuming. On the other hand, measuring the uncertainty based on model accuracy can also be problematic, where uncertainty reliability and prediction accuracy are not well decoupled. In this paper, we propose to model the uncertainty of MDE models from the perspective of the inherent probability distributions originating from the depth probability volume and its extensions, and to assess it more fairly with more comprehensive metrics. By simply introducing additional training regularization terms, our model, with surprisingly simple formations and without requiring extra modules or multiple inferences, can provide uncertainty estimations with state-of-the-art reliability, and can be further improved when combined with ensemble or sampling methods. A series of experiments demonstrate the effectiveness of our methods.

摘要
必须有效地测量和模型训练过的模型的可靠性，以便在实际应用中部署单目深度估计（MDE）模型。然而，MDE的内在缺失和排序敏感性带来了训练模型的不确定度的估计的主要挑战。一方面，使用现有的不确定度模型方法可能会增加内存占用和时间消耗。另一方面，基于模型准确率来估计uncertainty的方法也有问题，因为不确定度可靠性和预测精度不是分离的。在这篇论文中，我们提议从深度概率分布和其扩展的视角来模型MDE模型的不确定度，并使用更全面的指标来评估其可靠性。通过简单地添加额外训练正则化项，我们的模型可以提供高度可靠的不确定度估计，并可以通过 ensemble或抽样方法进行进一步改进。一系列实验证明了我们的方法的效果。

Embedded Heterogeneous Attention Transformer for Cross-lingual Image Captioning

paper_url: http://arxiv.org/abs/2307.09915
repo_url: None
paper_authors: Zijie Song, Zhenzhen Hu, Richang Hong
for: 这个论文旨在解决跨语言图像描述任务中的跨语言和跨modal挑战，以建立跨语言和不同modal之间的关联。
methods: 我们使用多元网络来建立跨领域关系和图像和不同语言之间的本地匹配。我们的方法包括穿梭隐藏式多元注意力（MHCA）、多元注意力推理网络（HARN）和多元共注意（HCA）。
results: 我们的方法在MSCOCO数据集上进行英文和中文图像描述任务，和先进的单语言方法相比，获得更好的结果。

Abstract
Cross-lingual image captioning is confronted with both cross-lingual and cross-modal challenges for multimedia analysis. The crucial issue in this task is to model the global and local matching between the image and different languages. Existing cross-modal embedding methods based on Transformer architecture oversight the local matching between the image region and monolingual words, not to mention in the face of a variety of differentiated languages. Due to the heterogeneous property of the cross-modal and cross-lingual task, we utilize the heterogeneous network to establish cross-domain relationships and the local correspondences between the image and different languages. In this paper, we propose an Embedded Heterogeneous Attention Transformer (EHAT) to build reasoning paths bridging cross-domain for cross-lingual image captioning and integrate into transformer. The proposed EHAT consists of a Masked Heterogeneous Cross-attention (MHCA), Heterogeneous Attention Reasoning Network (HARN) and Heterogeneous Co-attention (HCA). HARN as the core network, models and infers cross-domain relationship anchored by vision bounding box representation features to connect two languages word features and learn the heterogeneous maps. MHCA and HCA implement cross-domain integration in the encoder through the special heterogeneous attention and enable single model to generate two language captioning. We test on MSCOCO dataset to generate English and Chinese, which are most widely used and have obvious difference between their language families. Our experiments show that our method even achieve better than advanced monolingual methods.

摘要
cross-lingual image captioning 面临着跨语言和跨模态挑战，以实现多媒体分析。这个任务的关键问题在于模型图像和不同语言之间的全球和本地匹配。现有的跨Modal embedding方法基于Transformer架构，仅考虑图像区域和单一语言词汇之间的本地匹配，而不考虑不同语言之间的差异。由于跨Modal和跨语言任务的异ogeneous性，我们利用异ogeneous网络建立跨Domains关系和图像与不同语言之间的本地匹配。在这篇论文中，我们提出一种Embedded Heterogeneous Attention Transformer（EHAT），用于建立跨Domains的理解路径，并将其集成到Transformer中。EHAT包括Masked Heterogeneous Cross-attention（MHCA）、Heterogeneous Attention Reasoning Network（HARN）和Heterogeneous Co-attention（HCA）。HARN作为核心网络，通过视觉矩形框特征来连接两种语言词汇，并学习异ogeneous地图。MHCA和HCA在编码器中实现跨Domain集成，通过特殊的异ogeneous注意力，使单个模型可以生成两种语言描述。我们在MSCOCO数据集上进行测试，生成英语和中文，这两种语言是最常用的，而且它们的语言家族具有明显的差异。我们的实验结果表明，我们的方法可以超越先进的单语言方法。

Learning from Abstract Images: on the Importance of Occlusion in a Minimalist Encoding of Human Poses

paper_url: http://arxiv.org/abs/2307.09893
repo_url: None
paper_authors: Saad Manzur, Wayne Hayes
for: 提高2D-to-3Dpose lifting网络的跨数据集性能。
methods: 使用透明3D臂提供 occlusion信息，并隐式地编码关节位置。
results: 在不使用部件地图的情况下，通过在抽象的synthetic图像上进行训练，在多个视角下获得了优化的同dataset benchmark表现，以及在跨数据集 benchmark中的“量子跳跃”性提升。

Abstract
Existing 2D-to-3D pose lifting networks suffer from poor performance in cross-dataset benchmarks. Although the use of 2D keypoints joined by "stick-figure" limbs has shown promise as an intermediate step, stick-figures do not account for occlusion information that is often inherent in an image. In this paper, we propose a novel representation using opaque 3D limbs that preserves occlusion information while implicitly encoding joint locations. Crucially, when training on data with accurate three-dimensional keypoints and without part-maps, this representation allows training on abstract synthetic images, with occlusion, from as many synthetic viewpoints as desired. The result is a pose defined by limb angles rather than joint positions $\unicode{x2013}$ because poses are, in the real world, independent of cameras $\unicode{x2013}$ allowing us to predict poses that are completely independent of camera viewpoint. The result provides not only an improvement in same-dataset benchmarks, but a "quantum leap" in cross-dataset benchmarks.

摘要
现有的2D-to-3D姿态提升网络受到跨数据集benchmark的表现不佳。虽然使用2D关键点并由“棒拌”肢体连接的“棒拌”limbs表示方法具有潜在的优势，但棒拌limbs并不考虑图像中的 occlusion 信息。在本文中，我们提出了一种新的表示方法，使用透明的3D肢体来保留 occlusion 信息，同时隐式地编码关节位置。在没有部分地图的情况下，在训练时使用抽象的 sintetic 图像，包括 occlusion，从任意 sintetic 视角训练。这种表示方法使得姿态定义由肢体角度而不是关节位置，因为姿态在实际世界中是独立于摄像头的。这种提出的方法不仅在同一个数据集上的benchmark中提高表现，而且在跨数据集benchmark中做出了“量子跃进”。

3Deformer: A Common Framework for Image-Guided Mesh Deformation

paper_url: http://arxiv.org/abs/2307.09892
repo_url: None
paper_authors: Hao Su, Xuefeng Liu, Jianwei Niu, Ji Wan, Xinghao Wu
for: 三维形状编辑框架的开发
methods: 基于异常 Renderer 技术，通过 semantic 图像和 mesh 材质之间的对应关系来进行形状编辑，并通过层次优化架构和多种策略和损失函数来保证形状精度、表面光滑、几何稳定和全局同步等特性。
results: 在评估实验中，提出的 3Deformer 能够生成出吸引人的结果，并达到了现状计算机视觉领域的先进水平。

Abstract
We propose 3Deformer, a general-purpose framework for interactive 3D shape editing. Given a source 3D mesh with semantic materials, and a user-specified semantic image, 3Deformer can accurately edit the source mesh following the shape guidance of the semantic image, while preserving the source topology as rigid as possible. Recent studies of 3D shape editing mostly focus on learning neural networks to predict 3D shapes, which requires high-cost 3D training datasets and is limited to handling objects involved in the datasets. Unlike these studies, our 3Deformer is a non-training and common framework, which only requires supervision of readily-available semantic images, and is compatible with editing various objects unlimited by datasets. In 3Deformer, the source mesh is deformed utilizing the differentiable renderer technique, according to the correspondences between semantic images and mesh materials. However, guiding complex 3D shapes with a simple 2D image incurs extra challenges, that is, the deform accuracy, surface smoothness, geometric rigidity, and global synchronization of the edited mesh should be guaranteed. To address these challenges, we propose a hierarchical optimization architecture to balance the global and local shape features, and propose further various strategies and losses to improve properties of accuracy, smoothness, rigidity, and so on. Extensive experiments show that our 3Deformer is able to produce impressive results and reaches the state-of-the-art level.

摘要
我们提出了3Deformer，一个通用的交互式3D形状编辑框架。从源3D网格中提取 semantic materials，并使用用户指定的semantic图像作为形状指南，3Deformer可以准确地编辑源网格，同时保持原始结构的尺寸一致性。现有的3D形状编辑研究主要集中在学习神经网络预测3D形状，这需要高成本的3D训练数据集和对特定对象的限制。与这些研究不同，我们的3Deformer是一个不需要训练的通用框架，只需要 readily-available semantic图像的监督，并且可以编辑多种物体，不受数据集的限制。在3Deformer中，源网格通过 differentiable renderer 技术进行扭曲，根据 semantic图像和网格材质的对应关系。然而，将复杂的3D形状指导于简单的2D图像存在一些挑战，即保证扭曲精度、表面光滑、几何坚定性和编辑后的 mesh 的全局同步。为解决这些挑战，我们提出了层次优化架构，以平衡全局和局部形状特征，并提出了多种策略和损失来提高精度、光滑、坚定性等属性。我们的3Deformer在广泛的实验中能够 prodcue 出卓越的结果，达到了状态的末点水平。

paper_url: http://arxiv.org/abs/2307.09880
repo_url: None
paper_authors: Liekang Zeng, Haowei Chen, Daipeng Feng, Xiaoxi Zhang, Xu Chen
for: 提高自适应无人机导航的精度和安全性
methods: 使用深度神经网络和动态调整任务执行位置、输入分辨率和图像压缩比例来降低推理延迟和提高预测精度
results: 实现了28.06%的终端延迟减少和27.28%的飞行距离增加，比非适应解决方案更高效

Abstract
Accurate navigation is of paramount importance to ensure flight safety and efficiency for autonomous drones. Recent research starts to use Deep Neural Networks to enhance drone navigation given their remarkable predictive capability for visual perception. However, existing solutions either run DNN inference tasks on drones in situ, impeded by the limited onboard resource, or offload the computation to external servers which may incur large network latency. Few works consider jointly optimizing the offloading decisions along with image transmission configurations and adapting them on the fly. In this paper, we propose A3D, an edge server assisted drone navigation framework that can dynamically adjust task execution location, input resolution, and image compression ratio in order to achieve low inference latency, high prediction accuracy, and long flight distances. Specifically, we first augment state-of-the-art convolutional neural networks for drone navigation and define a novel metric called Quality of Navigation as our optimization objective which can effectively capture the above goals. We then design a deep reinforcement learning based neural scheduler at the drone side for which an information encoder is devised to reshape the state features and thus improve its learning ability. To further support simultaneous multi-drone serving, we extend the edge server design by developing a network-aware resource allocation algorithm, which allows provisioning containerized resources aligned with drones' demand. We finally implement a proof-of-concept prototype with realistic devices and validate its performance in a real-world campus scene, as well as a simulation environment for thorough evaluation upon AirSim. Extensive experimental results show that A3D can reduce end-to-end latency by 28.06% and extend the flight distance by up to 27.28% compared with non-adaptive solutions.

摘要
准确的导航是无人机安全和效率的关键。 latest research 使用深度神经网络来提高无人机导航，因为它们具有remarkable predictive capability for visual perception。 However, existing solutions either run DNN inference tasks on drones in situ, impeded by limited onboard resources, or offload the computation to external servers, which may incur large network latency. Few works consider jointly optimizing the offloading decisions along with image transmission configurations and adapting them on the fly.In this paper, we propose A3D, an edge server-assisted drone navigation framework that can dynamically adjust task execution location, input resolution, and image compression ratio to achieve low inference latency, high prediction accuracy, and long flight distances. Specifically, we first augment state-of-the-art convolutional neural networks for drone navigation and define a novel metric called Quality of Navigation, which can effectively capture the above goals. We then design a deep reinforcement learning-based neural scheduler at the drone side, which uses an information encoder to reshape the state features and improve its learning ability. To support simultaneous multi-drone serving, we extend the edge server design by developing a network-aware resource allocation algorithm, which allows provisioning containerized resources aligned with drones' demand. We finally implement a proof-of-concept prototype with realistic devices and validate its performance in a real-world campus scene and a simulation environment for thorough evaluation upon AirSim.Experimental results show that A3D can reduce end-to-end latency by 28.06% and extend the flight distance by up to 27.28% compared with non-adaptive solutions.

BSDM: Background Suppression Diffusion Model for Hyperspectral Anomaly Detection

paper_url: http://arxiv.org/abs/2307.09861
repo_url: https://github.com/majitao-xd/bsdm-had
paper_authors: Jitao Ma, Weiying Xie, Yunsong Li, Leyuan Fang
For: 本研究旨在提高频谱异常检测（HAD）的性能， solves the problem of complex background in hyperspectral images (HSIs) without labeled samples.* Methods: 我们提出了一种新的方法，即background suppression diffusion model（BSDM）， which can simultaneously learn latent background distributions and generalize to different datasets.* Results: 我们的方法可以减少HSIs的背景干扰，提高HAD的性能，并且可以适应不同的数据集。 We demonstrate the effectiveness of our method through assessments and generalization experiments on several real HSI datasets.

Abstract
Hyperspectral anomaly detection (HAD) is widely used in Earth observation and deep space exploration. A major challenge for HAD is the complex background of the input hyperspectral images (HSIs), resulting in anomalies confused in the background. On the other hand, the lack of labeled samples for HSIs leads to poor generalization of existing HAD methods. This paper starts the first attempt to study a new and generalizable background learning problem without labeled samples. We present a novel solution BSDM (background suppression diffusion model) for HAD, which can simultaneously learn latent background distributions and generalize to different datasets for suppressing complex background. It is featured in three aspects: (1) For the complex background of HSIs, we design pseudo background noise and learn the potential background distribution in it with a diffusion model (DM). (2) For the generalizability problem, we apply a statistical offset module so that the BSDM adapts to datasets of different domains without labeling samples. (3) For achieving background suppression, we innovatively improve the inference process of DM by feeding the original HSIs into the denoising network, which removes the background as noise. Our work paves a new background suppression way for HAD that can improve HAD performance without the prerequisite of manually labeled data. Assessments and generalization experiments of four HAD methods on several real HSI datasets demonstrate the above three unique properties of the proposed method. The code is available at https://github.com/majitao-xd/BSDM-HAD.

摘要
“几何 Spectral 异常检测（HAD）在地球观测和深空探索中广泛应用。然而，几何 Spectral 图像的复杂背景会导致异常难以分辨。同时，几何 Spectral 图像的标注样本缺乏，导致现有 HAD 方法的泛化差。本文开展了第一次不带标注样本的背景学习问题研究。我们提出了一种新的解决方案，即背景抑制算法（BSDM），可同时学习几何 Spectral 图像的潜在背景分布，并在不同领域的数据集上进行泛化。它具有以下三个特点：1. 对几何 Spectral 图像的复杂背景，我们设计了假背景噪音，并通过演化模型（DM）学习潜在背景分布。2. 对泛化问题，我们应用了统计偏移模块，使BSDM能够适应不同领域的数据集，无需标注样本。3. 为了实现背景抑制，我们创新地改进了 DM 的推理过程，通过将原始几何 Spectral 图像Feed into 杀噪网络，从而 removing 背景噪音。我们的工作开拓了一条新的背景抑制途径，可以提高 HAD 性能，不需要手动标注数据。评估和泛化实验表明，我们的方法在多个真实的几何 Spectral 图像 dataset 上具有三个独特的特点。代码可以在中找到。”

paper_url: http://arxiv.org/abs/2307.09857
repo_url: None
paper_authors: Hassan Khalid, Nisar Ahmed
for: 本研究旨在提出一种基于多流空间和通道注意力的自适应图像质量评估算法，以提高图像质量评估的准确性和可靠性。
methods: 该算法首先使用两种不同的后处器生成混合特征，然后通过空间和通道注意力来提供高权重 dlaregion of interest。
results: 研究表明，该算法可以更加准确地评估图像质量，与人类感知评估呈高相关性。同时，该算法也能够在不同类型的图像质量评估 task 中具有优秀的一致性和泛化性。

Abstract
BIQA (Blind Image Quality Assessment) is an important field of study that evaluates images automatically. Although significant progress has been made, blind image quality assessment remains a difficult task since images vary in content and distortions. Most algorithms generate quality without emphasizing the important region of interest. In order to solve this, a multi-stream spatial and channel attention-based algorithm is being proposed. This algorithm generates more accurate predictions with a high correlation to human perceptual assessment by combining hybrid features from two different backbones, followed by spatial and channel attention to provide high weights to the region of interest. Four legacy image quality assessment datasets are used to validate the effectiveness of our proposed approach. Authentic and synthetic distortion image databases are used to demonstrate the effectiveness of the proposed method, and we show that it has excellent generalization properties with a particular focus on the perceptual foreground information.

摘要
BIQA (无目标图像质量评估) 是一个重要的研究领域，用于自动评估图像质量。尽管有了很大的进步，但无目标图像质量评估仍然是一个困难的任务，因为图像的内容和扭曲都很多。大多数算法会生成质量，而不是强调关键的区域兴趣。为解决这个问题，我们提出了一种多流批处理和通道注意力基于的算法。这个算法可以将多种混合特征从两个不同的背景拼接起来，然后在空间和通道上进行注意力分配，以提高关键区域的权重，从而生成更加准确的预测结果，与人类感知评估呈现高相关性。我们使用了四个传统的图像质量评估数据集来验证我们的提议的有效性，并使用了真实和synthetic扭曲图像库来示示我们的方法的普适性。

Hierarchical Spatio-Temporal Representation Learning for Gait Recognition

paper_url: http://arxiv.org/abs/2307.09856
repo_url: None
paper_authors: Lei Wang, Bo Liu, Fangfang Liang, Bincheng Wang
for: 本研究旨在提出一种基于层次空间-时间特征学习（HSTL）框架，用于提取走姿特征从粗到细。
methods: 该方法包括层次群集分析、自适应区域基于运动特征提取器（ARME）、层次特征映射（ASTP）和帧级时间下采样（FTA）等多个模块。
results: 对于CASIA-B、OUMVLP、GREW和Gait3D等数据集，该方法比 estado-of-the-art 高效，同时保持模型精度和复杂度的平衡。

Abstract
Gait recognition is a biometric technique that identifies individuals by their unique walking styles, which is suitable for unconstrained environments and has a wide range of applications. While current methods focus on exploiting body part-based representations, they often neglect the hierarchical dependencies between local motion patterns. In this paper, we propose a hierarchical spatio-temporal representation learning (HSTL) framework for extracting gait features from coarse to fine. Our framework starts with a hierarchical clustering analysis to recover multi-level body structures from the whole body to local details. Next, an adaptive region-based motion extractor (ARME) is designed to learn region-independent motion features. The proposed HSTL then stacks multiple ARMEs in a top-down manner, with each ARME corresponding to a specific partition level of the hierarchy. An adaptive spatio-temporal pooling (ASTP) module is used to capture gait features at different levels of detail to perform hierarchical feature mapping. Finally, a frame-level temporal aggregation (FTA) module is employed to reduce redundant information in gait sequences through multi-scale temporal downsampling. Extensive experiments on CASIA-B, OUMVLP, GREW, and Gait3D datasets demonstrate that our method outperforms the state-of-the-art while maintaining a reasonable balance between model accuracy and complexity.

摘要
《坐姿识别方法》是一种人体特征识别技术，可以在无限制环境中识别个体，并且具有广泛的应用前景。现有的方法通常是基于身体部位特征的利用，但是它们经常忽略人体运动的层次关系。本文提出了一种层次空间时间学习（HSTL）框架，用于从粗到细提取坐姿特征。我们的框架开始于层次划分分析，以恢复全身到地方细节的多级体结构。接着，我们设计了一种适应区域基于运动特征Extractor（ARME），用于学习无关于地域的运动特征。我们的HSTL然后将多个ARME堆叠在一起，每个ARME都对应特定的层次分解级别。一种适应空间时间汇聚（ASTP）模块用于在不同级别上捕捉坐姿特征，以进行层次特征映射。最后，一种帧级时间抑制（FTA）模块用于通过多尺度时间抑制来减少坐姿序列中的重复信息。我们的方法在CASIA-B、OUMVLP、GREW和Gait3D数据集上进行了广泛的实验，结果表明我们的方法在精度和复杂性之间保持了良好的平衡，而且比现状态技术更高。

Cryo-forum: A framework for orientation recovery with uncertainty measure with the application in cryo-EM image analysis

paper_url: http://arxiv.org/abs/2307.09847
repo_url: https://github.com/phonchi/cryo-forum
paper_authors: Szu-Chi Chung
for: 这篇论文的目的是解决单分子冰 embedding电子显微镜（cryo-EM）中Orientation parameter的有效决定问题，这个问题在3D结构重建中扮演着关键的角色，但由于数据含有噪声和异常值，因此需要耗费大量的时间进行2D clean-up处理。
methods: 这篇论文提出了一种新的方法，使用10维特征向量表示Orientation，并使用Quadratically-Constrained Quadratic Program来预测Orientation，并且添加了一个不确定度度量。此外，提出了一种新的损失函数，该损失函数考虑了方向之间的距离，从而提高了方法的准确性。
results: 数值分析表明，该方法可以从2D cryo-EM图像中有效地回归Orientation，并且可以直接从数据集中清理噪声。此外，包装了该方法的软件包在用户友好的方式提供，以便开发者可以轻松地使用。

Abstract
In single-particle cryo-electron microscopy (cryo-EM), the efficient determination of orientation parameters for 2D projection images poses a significant challenge yet is crucial for reconstructing 3D structures. This task is complicated by the high noise levels present in the cryo-EM datasets, which often include outliers, necessitating several time-consuming 2D clean-up processes. Recently, solutions based on deep learning have emerged, offering a more streamlined approach to the traditionally laborious task of orientation estimation. These solutions often employ amortized inference, eliminating the need to estimate parameters individually for each image. However, these methods frequently overlook the presence of outliers and may not adequately concentrate on the components used within the network. This paper introduces a novel approach that uses a 10-dimensional feature vector to represent the orientation and applies a Quadratically-Constrained Quadratic Program to derive the predicted orientation as a unit quaternion, supplemented by an uncertainty metric. Furthermore, we propose a unique loss function that considers the pairwise distances between orientations, thereby enhancing the accuracy of our method. Finally, we also comprehensively evaluate the design choices involved in constructing the encoder network, a topic that has not received sufficient attention in the literature. Our numerical analysis demonstrates that our methodology effectively recovers orientations from 2D cryo-EM images in an end-to-end manner. Importantly, the inclusion of uncertainty quantification allows for direct clean-up of the dataset at the 3D level. Lastly, we package our proposed methods into a user-friendly software suite named cryo-forum, designed for easy accessibility by the developers.

摘要
Single-particle cryo-electron microscopy (cryo-EM) 中的orientation parameter的效率确定是一个重要的挑战，但是它是重要的 для构建3D结构。这个任务受到高噪音水平的影响，包括异常值，因此需要许多时间consuming 2D clean-up 过程。最近，基于深度学习的解决方案出现了，它们提供了一种更加流畅的方法来传统上劳动密集的orientation estimation任务。这些解决方案通常采用总结式推理，从而消除了每个图像需要单独估计参数的需求。然而，这些方法可能忽视异常值的存在，并且可能不充分关注网络中使用的组件。本文介绍了一种新的方法，使用10维特征向量来表示orientation，并使用quadratically-constrained quadratic program来计算预测的orientation为单位旋转矩阵，并且附加了一个不确定度量。此外，我们还提出了一个新的损失函数，该函数考虑了 orientations之间的对称关系，从而提高了我们的方法的准确性。最后，我们还进行了对构建encoder网络的设计选择的完整评估，这是在文献中没有得到 suficient 的注意。我们的数值分析表明，我们的方法可以从2D cryo-EM图像中有效地回收orientation。更重要的是，由于包含不确定度量，我们的方法可以直接清理dataset at the 3D level。最后，我们将我们的提posed方法包装成一个 user-friendly 软件包，名为cryo-forum，用于方便开发者的使用。

Compressive Image Scanning Microscope

paper_url: http://arxiv.org/abs/2307.09841
repo_url: None
paper_authors: Ajay Gunalan, Marco Castello, Simonluca Piazza, Shunlei Li, Alberto Diaspro, Leonardo S. Mattos, Paolo Bianchini
for: 提高扫描镜观测镜像质量和减少数据收集时间
methods: 使用单 photon 触发冲击器（SPAD）数组探测器和固定抽象策略，通过并行扫描多个图像来提高压缩扫描镜图像质量
results: 实现高质量压缩扫描镜图像的生成，同时减少数据收集时间和降低谐振衰变的影响

Abstract
We present a novel approach to implement compressive sensing in laser scanning microscopes (LSM), specifically in image scanning microscopy (ISM), using a single-photon avalanche diode (SPAD) array detector. Our method addresses two significant limitations in applying compressive sensing to LSM: the time to compute the sampling matrix and the quality of reconstructed images. We employ a fixed sampling strategy, skipping alternate rows and columns during data acquisition, which reduces the number of points scanned by a factor of four and eliminates the need to compute different sampling matrices. By exploiting the parallel images generated by the SPAD array, we improve the quality of the reconstructed compressive-ISM images compared to standard compressive confocal LSM images. Our results demonstrate the effectiveness of our approach in producing higher-quality images with reduced data acquisition time and potential benefits in reducing photobleaching.

摘要
我们提出了一种新的方法，用于在激光扫描镜（LSM）中实现压缩感知，特别是在图像扫描镜（ISM）中使用单 photon 膨胀 диод（SPAD）数组探测器。我们的方法解决了应用压缩感知到 LSM 中的两个主要限制：计算抽象矩阵的时间和重建图像的质量。我们采用了固定抽象策略，在数据收集过程中跳过 alternate 行和列，从而将数据量减少为四分之一，并消除计算不同抽象矩阵的需求。通过利用 SPAD 数组生成的平行图像，我们提高了压缩-ISM 图像的重建质量，相比标准压缩干扫LSM图像。我们的结果表明，我们的方法可以生成高质量图像，并降低数据收集时间和可能减少照明损害。

What do neural networks learn in image classification? A frequency shortcut perspective

paper_url: http://arxiv.org/abs/2307.09829
repo_url: https://github.com/nis-research/nn-frequency-shortcuts
paper_authors: Shunxin Wang, Raymond Veldhuis, Christoph Brune, Nicola Strisciuglio
for: 这个研究旨在 investigate the mechanisms of representation learning in neural networks (NNs) for classification tasks, and expand our understanding of frequency shortcuts.
methods: 研究使用了 synthetic datasets 和 natural images，并提出了一个metric to measure class-wise frequency characteristics和一种方法来Identify frequency shortcuts.
results: 研究发现，NNs 在 classification 任务上倾向于学习简单的解决方案，并在训练中学习的频谱特征取决于最明显的频率带，这些频率带可以是低频或高频。此外，研究还发现了频谱短cuts可以是基于文本或形状的，取决于最好简化目标。最后，研究还 validate了频谱短cuts的可转移性。

Abstract
Frequency analysis is useful for understanding the mechanisms of representation learning in neural networks (NNs). Most research in this area focuses on the learning dynamics of NNs for regression tasks, while little for classification. This study empirically investigates the latter and expands the understanding of frequency shortcuts. First, we perform experiments on synthetic datasets, designed to have a bias in different frequency bands. Our results demonstrate that NNs tend to find simple solutions for classification, and what they learn first during training depends on the most distinctive frequency characteristics, which can be either low- or high-frequencies. Second, we confirm this phenomenon on natural images. We propose a metric to measure class-wise frequency characteristics and a method to identify frequency shortcuts. The results show that frequency shortcuts can be texture-based or shape-based, depending on what best simplifies the objective. Third, we validate the transferability of frequency shortcuts on out-of-distribution (OOD) test sets. Our results suggest that frequency shortcuts can be transferred across datasets and cannot be fully avoided by larger model capacity and data augmentation. We recommend that future research should focus on effective training schemes mitigating frequency shortcut learning.

摘要
<>频率分析有助于理解神经网络（NN）的表征学习机制。大多数研究都集中在NN的回归任务上，而对于分类任务的研究则比较少。这项研究则是对后者的实证研究，并扩展了频率短cut的理解。我们首先在synthetic数据上进行实验，设计了具有不同频率频谱偏好的数据集。我们的结果表明，NNs tend to learn simple solutions for classification, and what they learn first during training depends on the most distinctive frequency characteristics, which can be either low- or high-frequencies.然后，我们在自然图像上进行了实验，并提出了一种类别频率特征的度量和一种可以识别频率短cut的方法。结果显示，频率短cut可以是基于文本的或形状的，它们取决于对象的简化目标。最后，我们 validate了频率短cut的传送性，并证明它们可以在不同数据集上传送，而且不能完全通过更大的模型容量和数据增强来避免。我们建议未来的研究应该关注有效地减少频率短cut学习的训练方案。

paper_url: http://arxiv.org/abs/2307.09823
repo_url: https://github.com/batuhankmkaraman/mlbasedad
paper_authors: Yaran Chen, Xueyu Chen, Yu Han, Haoran Li, Dongbin Zhao, Jingzhong Li, Xu Wang
for: 这个研究旨在开发一个非侵入性的肝病诊断系统，以便预测肝病变化和进展，并避免进行侵入性的肝生検。
methods: 这个研究使用了一个多modal的学习方法，结合来自多个不同来源的资料，包括来自临床评估、实验室和成像等的资料，以及一些面部图像。
results: 研究发现，使用多modal学习方法可以提高肝病诊断的精度，并且可以使用面部图像来进行肝病诊断，这可以降低肝生検的风险和成本。

Abstract
Non alcoholic fatty liver disease (NAFLD) is the most common cause of chronic liver disease, which can be predicted accurately to prevent advanced fibrosis and cirrhosis. While, a liver biopsy, the gold standard for NAFLD diagnosis, is invasive, expensive, and prone to sampling errors. Therefore, non-invasive studies are extremely promising, yet they are still in their infancy due to the lack of comprehensive research data and intelligent methods for multi-modal data. This paper proposes a NAFLD diagnosis system (DeepFLDDiag) combining a comprehensive clinical dataset (FLDData) and a multi-modal learning based NAFLD prediction method (DeepFLD). The dataset includes over 6000 participants physical examinations, laboratory and imaging studies, extensive questionnaires, and facial images of partial participants, which is comprehensive and valuable for clinical studies. From the dataset, we quantitatively analyze and select clinical metadata that most contribute to NAFLD prediction. Furthermore, the proposed DeepFLD, a deep neural network model designed to predict NAFLD using multi-modal input, including metadata and facial images, outperforms the approach that only uses metadata. Satisfactory performance is also verified on other unseen datasets. Inspiringly, DeepFLD can achieve competitive results using only facial images as input rather than metadata, paving the way for a more robust and simpler non-invasive NAFLD diagnosis.

摘要
非 алкого醉liver病 (NAFLD)是 chronic liver disease 最常见的原因，可以准确预测并防止进展到高度纤维化和 cirrhosis。然而，liver biopsy，用于 NAFLD 诊断的 golden standard，是侵入性的、昂贵的，并且容易出现采样错误。因此，不侵入性的研究非常有前途，但是目前还处于初级阶段，因为缺乏全面的研究数据和智能的多Modal数据处理方法。这篇文章提出了一种 combining 全面临床数据集 (FLDData) 和多Modal学习基于 NAFLD 预测方法 (DeepFLD) 的 NAFLD 诊断系统 (DeepFLDDiag)。该数据集包括6000多名参与者的 физиicalexamination，实验室和成像研究，广泛的问卷和一部分参与者的面部图像，这是丰富和有价值的临床数据。从数据集中，我们量化分析并选择了对 NAFLD 预测最有价值的临床 metadata。此外，我们提出的 DeepFLD，一种基于多Modal输入（包括 metadata 和面部图像）的深度神经网络模型，可以更好地预测 NAFLD。在其他未经见过数据上，我们也证明了它的满意性表现。吸引地，DeepFLD 可以使用只有面部图像作为输入，而不是 metadata，达到类似的预测性能，这为非侵入性 NAFLD 诊断开创了新的可能性。

A Siamese-based Verification System for Open-set Architecture Attribution of Synthetic Images

paper_url: http://arxiv.org/abs/2307.09822
repo_url: None
paper_authors: Lydia Abady, Jun Wang, Benedetta Tondi, Mauro Barni
for: This paper aims to address the problem of open-set attribution of synthetic images to the architecture that generated them.
methods: The proposed verification framework relies on a Siamese Network to compare the query image with reference images generated by the same architecture.
results: The proposed method achieves excellent performance in both closed and open-set settings, with strong generalization capabilities.

Abstract
Despite the wide variety of methods developed for synthetic image attribution, most of them can only attribute images generated by models or architectures included in the training set and do not work with unknown architectures, hindering their applicability in real-world scenarios. In this paper, we propose a verification framework that relies on a Siamese Network to address the problem of open-set attribution of synthetic images to the architecture that generated them. We consider two different settings. In the first setting, the system determines whether two images have been produced by the same generative architecture or not. In the second setting, the system verifies a claim about the architecture used to generate a synthetic image, utilizing one or multiple reference images generated by the claimed architecture. The main strength of the proposed system is its ability to operate in both closed and open-set scenarios so that the input images, either the query and reference images, can belong to the architectures considered during training or not. Experimental evaluations encompassing various generative architectures such as GANs, diffusion models, and transformers, focusing on synthetic face image generation, confirm the excellent performance of our method in both closed and open-set settings, as well as its strong generalization capabilities.

摘要
尽管现有多种方法用于生成图像归属，大多数方法只能归属图像由训练集中的模型或架构生成，无法处理未知的架构，这限制了它们在实际场景中的应用性。在这篇论文中，我们提出了一个验证框架，利用对称网络解决生成图像的开放集归属问题。我们考虑了两种不同的设定。在第一个设定中，系统确定两个图像是否由同一个生成架构生成。在第二个设定中，系统验证一个关于生成架构的声明，利用一个或多个由声明的架构生成的参考图像。我们的方法的主要优点在于它可以在关闭和开放集enario中运行，因此输入图像（或订单、参考图像）可以属于训练集中的架构或不属于。我们的实验评估了多种生成架构，包括GANs、扩散模型和转换器，以及对生成的人脸图像进行了评估，并证明了我们的方法在关闭和开放集enario中具有极高的表现力和普适性。

Hierarchical Semantic Perceptual Listener Head Video Generation: A High-performance Pipeline

paper_url: http://arxiv.org/abs/2307.09821
repo_url: None
paper_authors: Zhigang Chang, Weitai Hu, Qing Yang, Shibao Zheng
for: This paper is written for the technical report of ViCo@2023 Conversational Head Generation Challenge in ACM Multimedia 2023 conference.methods: The paper uses a high-performance solution that enhances the hierarchical semantic extraction capability of the audio encoder module and improves the decoder part, renderer, and post-processing modules.results: The solution proposed in the paper achieves the first place on the official leaderboard for the track of listening head generation.

Abstract
In dyadic speaker-listener interactions, the listener's head reactions along with the speaker's head movements, constitute an important non-verbal semantic expression together. The listener Head generation task aims to synthesize responsive listener's head videos based on audios of the speaker and reference images of the listener. Compared to the Talking-head generation, it is more challenging to capture the correlation clues from the speaker's audio and visual information. Following the ViCo baseline scheme, we propose a high-performance solution by enhancing the hierarchical semantic extraction capability of the audio encoder module and improving the decoder part, renderer and post-processing modules. Our solution gets the first place on the official leaderboard for the track of listening head generation. This paper is a technical report of ViCo@2023 Conversational Head Generation Challenge in ACM Multimedia 2023 conference.

摘要
在双向说话人-听众互动中，听众的头部反应以及说话人的头部运动，共同组成了重要的非语言表达。听众头生成任务的目标是基于说话人的音频和参考图像生成响应式的听众头视频。与对话头生成相比，捕捉说话人的相关征具有更高的挑战性。我们基于ViCo基eline方案，提出一种高性能的解决方案，通过增强音频编码模块的层次semantic抽象能力和decoder部分、渲染器和后处理模块的改进，实现了高质量的听众头生成。我们的解决方案在官方领先者榜上名列第一，这篇报告是ViCo@2023 conversational Head Generation Challenge的技术报告，发表于ACM Multimedia 2023会议上。

Deep unrolling Shrinkage Network for Dynamic MR imaging

paper_url: http://arxiv.org/abs/2307.09818
repo_url: https://github.com/yhao-z/dus-net
paper_authors: Yinghao Zhang, Xiaodi Li, Weihang Li, Yue Hu
For: 这个论文旨在提高动态磁共振成像（MR）图像的重建模型，特别是使用深度抽象网络（CNN）和简单阈值（ST）算法来强制遵循简洁约束。* Methods: 本文提出了一个新的运算子，叫做潜在阈值运算子（AST），这个运算子可以学习每个通道的阈值。同时，本文也提出了一个新的深度 unfolding shrinkage network（DUS-Net），这个模型是基于 alternate direction method of multipliers（ADMM）来优化磁共振成像的对应$l_1$ нор数据重建模型。* Results: 实验结果显示，提案的 DUS-Net 模型在一个公开存储的动态磁共振成像数据集上表现出色，比前一代方法更好。

Abstract
Deep unrolling networks that utilize sparsity priors have achieved great success in dynamic magnetic resonance (MR) imaging. The convolutional neural network (CNN) is usually utilized to extract the transformed domain, and then the soft thresholding (ST) operator is applied to the CNN-transformed data to enforce the sparsity priors. However, the ST operator is usually constrained to be the same across all channels of the CNN-transformed data. In this paper, we propose a novel operator, called soft thresholding with channel attention (AST), that learns the threshold for each channel. In particular, we put forward a novel deep unrolling shrinkage network (DUS-Net) by unrolling the alternating direction method of multipliers (ADMM) for optimizing the transformed $l_1$ norm dynamic MR reconstruction model. Experimental results on an open-access dynamic cine MR dataset demonstrate that the proposed DUS-Net outperforms the state-of-the-art methods. The source code is available at \url{https://github.com/yhao-z/DUS-Net}.

摘要
深度折衣网络在动态磁共振成像中取得了很大成功。通常情况下，卷积神经网络（CNN）将数据映射到转换edomain中，然后使用soft thresholding（ST）操作来实现稀疏约束。然而，ST操作通常对所有通道的CNN-transformed数据进行同样的约束。在这篇论文中，我们提出了一种新的操作，即Channel Attention Soft Thresholding（AST），它可以学习每个通道的阈值。具体来说，我们提出了一种新的深度折衣缩放网络（DUS-Net），通过对 alternate direction method of multipliers（ADMM）的拓展来优化转换$l_1$ norm动力磁共振重建模型。实验结果表明，提议的DUS-Net在一个公开的动态磁共振MR数据集上表现出色，超过了当前的状态值。源代码可以在 \url{https://github.com/yhao-z/DUS-Net} 上获取。

LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network

paper_url: http://arxiv.org/abs/2307.09815
repo_url: None
paper_authors: Hao Yang, Liyuan Pan, Yan Yang, Miaomiao Liu
for: 本研究旨在使用语言图像预训练框架（CLIP）来从双像缓冲（DP）对组中估计缓冲图。
methods: 我们首先设计了特制文本提示，使CLIP能够从DP对组中学习缓冲相关的几何假设知识。然后，我们提议一种输入双像DP对到CLIP的格式，无需任何微调。给出估计缓冲图后，我们引入了缓冲优先注意力块、缓冲权重损失和缓冲意识损失来恢复全ocus图像。
results: 我们在广泛的实验中达到了状态 искусственный智能的表现。

Abstract
Recovering sharp images from dual-pixel (DP) pairs with disparity-dependent blur is a challenging task.~Existing blur map-based deblurring methods have demonstrated promising results. In this paper, we propose, to the best of our knowledge, the first framework to introduce the contrastive language-image pre-training framework (CLIP) to achieve accurate blur map estimation from DP pairs unsupervisedly. To this end, we first carefully design text prompts to enable CLIP to understand blur-related geometric prior knowledge from the DP pair. Then, we propose a format to input stereo DP pair to the CLIP without any fine-tuning, where the CLIP is pre-trained on monocular images. Given the estimated blur map, we introduce a blur-prior attention block, a blur-weighting loss and a blur-aware loss to recover the all-in-focus image. Our method achieves state-of-the-art performance in extensive experiments.

摘要
recuperating sharp images from dual-pixel (DP) pairs with disparity-dependent blur is a challenging task. existing blur map-based deblurring methods have demonstrated promising results. in this paper, we propose, to the best of our knowledge, the first framework to introduce the contrastive language-image pre-training framework (CLIP) to achieve accurate blur map estimation from DP pairs unsupervisedly. to this end, we first carefully design text prompts to enable CLIP to understand blur-related geometric prior knowledge from the DP pair. then, we propose a format to input stereo DP pair to the CLIP without any fine-tuning, where the CLIP is pre-trained on monocular images. given the estimated blur map, we introduce a blur-prior attention block, a blur-weighting loss and a blur-aware loss to recover the all-in-focus image. our method achieves state-of-the-art performance in extensive experiments.Here's the text with some minor adjustments to make it more readable in Simplified Chinese: recuperating sharp images from dual-pixel (DP) pairs with disparity-dependent blur is a challenging task. 现有的锐化图-基于的锐化方法已经达到了一定的成果。在这篇论文中，我们提出了，到目前为止为止，第一个引入语言-图像预训练框架（CLIP）来实现不监督的锐化图估计从DP对中。为此，我们首先仔细设计了文本提示，使CLIP能够从DP对中了解锐化相关的几何启示知识。然后，我们提议一种输入双目DP对到CLIP无需任何微调的格式，其中CLIP是基于单目图像的预训练的。给出估计的锐化图，我们引入了锐化关注块、锐化权重损失和锐化意识损失，以恢复所有锐化图像。我们的方法在广泛的实验中达到了顶尖性能。

GenKL: An Iterative Framework for Resolving Label Ambiguity and Label Non-conformity in Web Images Via a New Generalized KL Divergence

paper_url: http://arxiv.org/abs/2307.09810
repo_url: https://github.com/codetopaper/genkl
paper_authors: Xia Huang, Kai Fong Ernest Chong
for: 这篇论文是为了解决在网络图像Dataset中存在ambiguous instances（ID和OOD）问题而写的。
methods: 这篇论文使用了一种新的iterative training框架，called GenKL，来标识和重新标注ambiguous instances。GenKL使用了一种新的 $\alpha,\beta$ -generalized KL divergence来识别NC instances。
results: 在三个web图像Dataset上（Clothing1M、Food101/Food101N和mini WebVision 1.0），这种新的iterative training框架GenKL可以 дости到新的state-of-the-art classification accuracy：81.34%、85.73%和78.99%/92.54% (top-1/top-5)。

Abstract
Web image datasets curated online inherently contain ambiguous in-distribution (ID) instances and out-of-distribution (OOD) instances, which we collectively call non-conforming (NC) instances. In many recent approaches for mitigating the negative effects of NC instances, the core implicit assumption is that the NC instances can be found via entropy maximization. For "entropy" to be well-defined, we are interpreting the output prediction vector of an instance as the parameter vector of a multinomial random variable, with respect to some trained model with a softmax output layer. Hence, entropy maximization is based on the idealized assumption that NC instances have predictions that are "almost" uniformly distributed. However, in real-world web image datasets, there are numerous NC instances whose predictions are far from being uniformly distributed. To tackle the limitation of entropy maximization, we propose $(\alpha, \beta)$-generalized KL divergence, $\mathcal{D}_{\text{KL}^{\alpha, \beta}(p\|q)$, which can be used to identify significantly more NC instances. Theoretical properties of $\mathcal{D}_{\text{KL}^{\alpha, \beta}(p\|q)$ are proven, and we also show empirically that a simple use of $\mathcal{D}_{\text{KL}^{\alpha, \beta}(p\|q)$ outperforms all baselines on the NC instance identification task. Building upon $(\alpha,\beta)$-generalized KL divergence, we also introduce a new iterative training framework, GenKL, that identifies and relabels NC instances. When evaluated on three web image datasets, Clothing1M, Food101/Food101N, and mini WebVision 1.0, we achieved new state-of-the-art classification accuracies: $81.34\%$, $85.73\%$ and $78.99\%$/$92.54\%$ (top-1/top-5), respectively.

摘要
translate("Web 图像集合在线批处理中存在歧义的内部（ID）实例和外部（OOD）实例，我们一起称之为非合规（NC）实例。在许多最近的NC实例缓解方法中，核心隐含假设是通过熵 maximization来发现NC实例。")以下是文本的简化中文翻译：在线批处理的网络图像集合中存在ID实例和OOD实例，我们称之为非合规（NC）实例。许多最近的NC实例缓解方法中，核心隐含假设是通过熵 maximization来发现NC实例。 translate("在实际情况下，NC实例的预测结果并不是很接近均匀分布的，这限制了熵 maximization的应用。为了解决这一问题，我们提出了（α, β）总体KL异同（D），可以更好地发现NC实例。")以下是文本的简化中文翻译：在实际情况下，NC实例的预测结果并不是很接近均匀分布的，这限制了熵 maximization的应用。为了解决这一问题，我们提出了（α, β）总体KL异同（D），可以更好地发现NC实例。 translate("我们证明了（α, β）总体KL异同的性质，并且通过实验表明，使用（α, β）总体KL异同可以比基础方法更好地发现NC实例。")以下是文本的简化中文翻译：我们证明了（α, β）总体KL异同的性质，并且通过实验表明，使用（α, β）总体KL异同可以比基础方法更好地发现NC实例。 translate("基于（α, β）总体KL异同，我们还提出了一种新的迭代培训框架，GenKL，可以同时标注和重新标注NC实例。")以下是文本的简化中文翻译：基于（α, β）总体KL异同，我们还提出了一种新的迭代培训框架，GenKL，可以同时标注和重新标注NC实例。 translate("我们在Clothing1M、Food101/Food101N和mini WebVision 1.0三个网络图像集合上评估了GenKL框架，并取得了新的状态级别分类精度：81.34%、85.73%和78.99%/92.54% (top-1/top-5)。")以下是文本的简化中文翻译：我们在Clothing1M、Food101/Food101N和mini WebVision 1.0三个网络图像集合上评估了GenKL框架，并取得了新的状态级别分类精度：81.34%、85.73%和78.99%/92.54% (top-1/top-5)。

Fix your downsampling ASAP! Be natively more robust via Aliasing and Spectral Artifact free Pooling

paper_url: http://arxiv.org/abs/2307.09804
repo_url: None
paper_authors: Julia Grabinski, Janis Keuper, Margret Keuper
for: 提高卷积神经网络的鲁棒性 against common corruptions and adversarial attacks
methods: 使用 aliasing and spectral artifact-free pooling (ASAP) 方法， modify FLC pooling method to reduce spectral leakage artifacts
results: 网络使用 ASAP 方法显示出高度的Native robustness against common corruptions and adversarial attacks, 同时保持相似的clean accuracy or even outperform the baseline.

Abstract
Convolutional neural networks encode images through a sequence of convolutions, normalizations and non-linearities as well as downsampling operations into potentially strong semantic embeddings. Yet, previous work showed that even slight mistakes during sampling, leading to aliasing, can be directly attributed to the networks' lack in robustness. To address such issues and facilitate simpler and faster adversarial training, [12] recently proposed FLC pooling, a method for provably alias-free downsampling - in theory. In this work, we conduct a further analysis through the lens of signal processing and find that such current pooling methods, which address aliasing in the frequency domain, are still prone to spectral leakage artifacts. Hence, we propose aliasing and spectral artifact-free pooling, short ASAP. While only introducing a few modifications to FLC pooling, networks using ASAP as downsampling method exhibit higher native robustness against common corruptions, a property that FLC pooling was missing. ASAP also increases native robustness against adversarial attacks on high and low resolution data while maintaining similar clean accuracy or even outperforming the baseline.

摘要
convolutional neural networks 通过一系列卷积、 норма化和非线性运算以及下采样操作转化图像为强Semantic embedding。然而，先前的工作表明，even slight mistakes during sampling, leading to aliasing, can be directly attributed to the networks' lack of robustness。To address such issues and facilitate simpler and faster adversarial training, [12] recently proposed FLC pooling, a method for provably alias-free downsampling - in theory。在这项工作中，我们通过信号处理的视角进行进一步的分析，发现现有的 pooling 方法，通过频域地址aliasing，仍然存在频率泄漏 artifacts。因此，我们提议 aliasing和频率泄漏 artifact-free pooling，简称ASAP。尽管ASAP只是FLC pooling的一些修改，但是使用ASAP作为下采样方法的网络在常见损害和攻击下的Native robustness显著提高，而不会影响clean accuracy的性能。

From West to East: Who can understand the music of the others better?

paper_url: http://arxiv.org/abs/2307.09795
repo_url: https://github.com/pxaris/ccml
paper_authors: Charilaos Papaioannou, Emmanouil Benetos, Alexandros Potamianos
for: 本研究的目的是了解不同文化和风格的音乐表示，以及是否可以使用现有的深度学习模型来学习这些表示。
methods: 本研究使用了转移学习方法，将西方流行音乐和传统/民间音乐数据转移到不同文化和风格的音乐数据上，以 derive 音乐表示之间的相似性。
results: 实验结果显示，通过转移学习，可以在不同文化和风格的音乐数据上达到竞争性的表示性能，而最佳来源数据集的选择则因每个音乐文化而异。

Abstract
Recent developments in MIR have led to several benchmark deep learning models whose embeddings can be used for a variety of downstream tasks. At the same time, the vast majority of these models have been trained on Western pop/rock music and related styles. This leads to research questions on whether these models can be used to learn representations for different music cultures and styles, or whether we can build similar music audio embedding models trained on data from different cultures or styles. To that end, we leverage transfer learning methods to derive insights about the similarities between the different music cultures to which the data belongs to. We use two Western music datasets, two traditional/folk datasets coming from eastern Mediterranean cultures, and two datasets belonging to Indian art music. Three deep audio embedding models are trained and transferred across domains, including two CNN-based and a Transformer-based architecture, to perform auto-tagging for each target domain dataset. Experimental results show that competitive performance is achieved in all domains via transfer learning, while the best source dataset varies for each music culture. The implementation and the trained models are both provided in a public repository.

摘要
最近的MIR发展已经导致了许多标准深度学习模型，其embedding可以用于多种下游任务。然而，大多数这些模型都是在西方流行/摇滚音乐和相关风格上训练的。这引发了研究问题，是否可以使用这些模型学习不同的音乐文化和风格的表示，或者可以建立来自不同文化或风格的音乐音频嵌入模型。为了解决这个问题，我们利用了传输学习方法，以derive关于不同音乐文化之间的相似性的洞察。我们使用了两个西方音乐数据集，两个来自东 Mediterranean 的传统/民谣数据集，以及两个印度古典音乐数据集。三个深度音频嵌入模型被训练和传输到不同领域，包括两个CNN基于的和一个Transformer基于的架构，以实现每个目标领域数据集的自动标签。实验结果表明，通过传输学习，在所有领域中可以 achiev competitive 性，而每个音乐文化的最佳来源数据集则不同。实现和训练过的模型都提供在公共存储库中。

DiffDP: Radiotherapy Dose Prediction via a Diffusion Model

paper_url: http://arxiv.org/abs/2307.09794
repo_url: https://github.com/scufzh/DiffDP
paper_authors: Zhenghao Feng, Lu Wen, Peng Wang, Binyu Yan, Xi Wu, Jiliu Zhou, Yan Wang
for: 这个研究旨在提高放射治疗规划中自动预测药用剂分布的精度和效率，并解决现有方法所受到的过滤问题。
methods: 本研究提出了一个扩散基于的药用剂预测模型（DiffDP），它包括一个前进过程和一个反进过程。在前进过程中，DiffDP将药用剂分布图描述当作 Gaussian 噪声，逐步添加小量噪声，并训练一个噪声预测器来预测添加在每个时间步骤中的噪声。在反进过程中，DiffDP从原始 Gaussian 噪声中逐步除去噪声，使用对照预测器，最终输出预测的药用剂分布图描述。
results: 实验结果显示，DiffDP 模型可以优化 radiotherapy 规划中药用剂分布的预测精度和效率，并且能够保持药用剂分布的稳定性和精确性。

Abstract
Currently, deep learning (DL) has achieved the automatic prediction of dose distribution in radiotherapy planning, enhancing its efficiency and quality. However, existing methods suffer from the over-smoothing problem for their commonly used L_1 or L_2 loss with posterior average calculations. To alleviate this limitation, we innovatively introduce a diffusion-based dose prediction (DiffDP) model for predicting the radiotherapy dose distribution of cancer patients. Specifically, the DiffDP model contains a forward process and a reverse process. In the forward process, DiffDP gradually transforms dose distribution maps into Gaussian noise by adding small noise and trains a noise predictor to predict the noise added in each timestep. In the reverse process, it removes the noise from the original Gaussian noise in multiple steps with the well-trained noise predictor and finally outputs the predicted dose distribution map. To ensure the accuracy of the prediction, we further design a structure encoder to extract anatomical information from patient anatomy images and enable the noise predictor to be aware of the dose constraints within several essential organs, i.e., the planning target volume and organs at risk. Extensive experiments on an in-house dataset with 130 rectum cancer patients demonstrate the s

摘要
The DiffDP model consists of a forward process and a reverse process. In the forward process, DiffDP transforms dose distribution maps into Gaussian noise by adding small noise and trains a noise predictor to predict the added noise in each timestep. In the reverse process, it removes noise from original Gaussian noise in multiple steps with the well-trained noise predictor and outputs the predicted dose distribution map.To ensure accuracy, we designed a structure encoder to extract anatomical information from patient anatomy images, enabling the noise predictor to be aware of dose constraints within essential organs, such as the planning target volume and organs at risk.Extensive experiments on an in-house dataset with 130 rectum cancer patients demonstrate the superiority of the DiffDP model in accuracy and efficiency compared to existing methods.

Density-invariant Features for Distant Point Cloud Registration

paper_url: http://arxiv.org/abs/2307.09788
repo_url: https://github.com/liuquan98/gcl
paper_authors: Quan Liu, Hongzi Zhu, Yunsong Zhou, Hongyang Li, Shan Chang, Minyi Guo
for: 提高远距离自动驾驶车辆3D视野的扩展，重要的一环是远程outdoor LiDAR点云注册。
methods: 提出了Group-wise Contrastive Learning（GCL）方法，使用这种方法可以提取不受密度影响的геометрические特征来注册远程outdoor LiDAR点云。
results: 经过理论分析和实验 validate，GCL方法可以提高远方enario的注册准确率，在KITTI和nuScenes benchmark上提高了40.9%和26.9%。I hope that helps! Let me know if you have any other questions.

Abstract
Registration of distant outdoor LiDAR point clouds is crucial to extending the 3D vision of collaborative autonomous vehicles, and yet is challenging due to small overlapping area and a huge disparity between observed point densities. In this paper, we propose Group-wise Contrastive Learning (GCL) scheme to extract density-invariant geometric features to register distant outdoor LiDAR point clouds. We mark through theoretical analysis and experiments that, contrastive positives should be independent and identically distributed (i.i.d.), in order to train densityinvariant feature extractors. We propose upon the conclusion a simple yet effective training scheme to force the feature of multiple point clouds in the same spatial location (referred to as positive groups) to be similar, which naturally avoids the sampling bias introduced by a pair of point clouds to conform with the i.i.d. principle. The resulting fully-convolutional feature extractor is more powerful and density-invariant than state-of-the-art methods, improving the registration recall of distant scenarios on KITTI and nuScenes benchmarks by 40.9% and 26.9%, respectively. Code is available at https://github.com/liuQuan98/GCL.

摘要
注册远程外部LiDAR点云是自主驾驶车辆共同视觉扩展的关键，但受到小 overlap 区域和巨大的观察点云密度差异困扰。在这篇论文中，我们提出了集合比较学习（GCL）方案，以提取不受点云密度影响的geometry特征来注册远程外部LiDAR点云。我们通过理论分析和实验发现，对于contrastive正例，需要独立并同分布（i.i.d），以便训练不受点云密度影响的特征提取器。我们提出了一种简单 yet有效的训练方案，使得多个点云在同一个空间位置（称为正例组）中的特征相似，从而自然避免了由一对点云的抽样偏见引入的i.i.d.原则。得到的全连接的特征提取器更强大和不受点云密度影响，在KITTI和nuScenes benchmark上提高了远程enario中注册回快率 by 40.9%和26.9%，分别。代码可以在https://github.com/liuQuan98/GCL上获取。

DVPT: Dynamic Visual Prompt Tuning of Large Pre-trained Models for Medical Image Analysis

paper_url: http://arxiv.org/abs/2307.09787
repo_url: None
paper_authors: Along He, Kai Wang, Zhihong Wang, Tao Li, Huazhu Fu
for: 针对医疗领域中具有限制的标注数据问题，提出了一种 Parametric Efficient Fine-tuning (PEFT) 方法，以增进泛化模型的适应性和效率。methods: 提出了一种名为 Dynamic Visual Prompt Tuning (DVPT) 的方法，通过将冻结特征经过轻量级瓶颈层处理，学习医疗任务特定的分布，然后使用几个可变的视觉提示进行相互关注，以获取适合每个样本的知识。results: 对医疗分类和分割任务进行了广泛的实验，发现 DVPT 方法不仅可以有效地适应预先训练的模型到医疗领域，还可以在具有有限的标注数据的情况下提高数据效率。例如，与预先训练的模型进行比较，DVPT 方法可以在医疗分类任务上提高 Kappa 分数超过 2.20%，并且可以节省到 60% 的标注数据和 99% 的存储成本。

Abstract
Limited labeled data makes it hard to train models from scratch in medical domain, and an important paradigm is pre-training and then fine-tuning. Large pre-trained models contain rich representations, which can be adapted to downstream medical tasks. However, existing methods either tune all the parameters or the task-specific layers of the pre-trained models, ignoring the input variations of medical images, and thus they are not efficient or effective. In this work, we aim to study parameter-efficient fine-tuning (PEFT) for medical image analysis, and propose a dynamic visual prompt tuning method, named DVPT. It can extract knowledge beneficial to downstream tasks from large models with a few trainable parameters. Firstly, the frozen features are transformed by an lightweight bottleneck layer to learn the domain-specific distribution of downstream medical tasks, and then a few learnable visual prompts are used as dynamic queries and then conduct cross-attention with the transformed features, attempting to acquire sample-specific knowledge that are suitable for each sample. Finally, the features are projected to original feature dimension and aggregated with the frozen features. This DVPT module can be shared between different Transformer layers, further reducing the trainable parameters. To validate DVPT, we conduct extensive experiments with different pre-trained models on medical classification and segmentation tasks. We find such PEFT method can not only efficiently adapt the pre-trained models to the medical domain, but also brings data efficiency with partial labeled data. For example, with 0.5\% extra trainable parameters, our method not only outperforms state-of-the-art PEFT methods, even surpasses the full fine-tuning by more than 2.20\% Kappa score on medical classification task. It can saves up to 60\% labeled data and 99\% storage cost of ViT-B/16.

摘要
因为医疗领域有限的标注数据，训练模型从头来是困难的。一种重要的思路是使用预训练并微调。大型预训练模型含有丰富的表示，可以适应下游医疗任务。然而，现有方法都是全部参数或任务特定层的微调，忽略医疗图像的输入变化，因此不是高效或有效的。在这项工作中，我们目的是研究高效微调（PEFT）方法 для医疗图像分析，并提出了动态视觉提示调整方法（DVPT）。它可以从大型模型中提取有用于下游任务的知识，并且只需几个可调参数。首先，冻结的特征被轻量级瓶颈层转换，以学习下游医疗任务的域特定分布。然后，通过一些可调视觉提示，对转换后的特征进行跨见注意力操作，以获得适合每个样本的样本特定知识。最后，特征被投影到原始特征维度，并与冻结特征进行汇聚。这种DVPT模块可以在不同的Transformer层之间共享，进一步减少可调参数。为验证DVPT，我们在医疗分类和分割任务上进行了广泛的实验。我们发现，这种PEFT方法不仅可以高效地适应预训练模型到医疗领域，还可以提供数据效果，即使只有0.5%的额外可调参数。例如，我们的方法不仅超过了当前最佳PEFT方法的2.20%卡方度提升，还可以降低60%的标注数据和99%的存储成本。

Source-Free Domain Adaptation for Medical Image Segmentation via Prototype-Anchored Feature Alignment and Contrastive Learning

paper_url: http://arxiv.org/abs/2307.09769
repo_url: https://github.com/cscyqj/miccai23-protocontra-sfda
paper_authors: Qinji Yu, Nan Xi, Junsong Yuan, Ziyu Zhou, Kang Dang, Xiaowei Ding
for: 这篇论文旨在解决医疗领域中的对应领域数据不可用问题，通过不需要同时存在源和目标领域数据的情况下，实现对医疗影像分类的领域适应。
methods: 这篇论文提出了一个 novel two-stage source-free domain adaptation（SFDA）框架，仅需要一个已经预训的源分类模型和无标目标数据，便可以在领域适应中进行对应。特别是在专扩型态扩大设计阶段，我们首先使用源模型的预训维度作为源原型，然后引入两向运输来对目标特征进行对齐。此外，我们还将对应领域数据中的不可靠预测 pixels 用于建立更加紧密的目标特征分布。
results: 实验结果显示，在大领域落差设定下，我们的方法在与现有的 SFDA 方法和一些 UDA 方法进行比较中，表现较好。 Code 可以在 https://github.com/CSCYQJ/MICCAI23-ProtoContra-SFDA 上获得。

Abstract
Unsupervised domain adaptation (UDA) has increasingly gained interests for its capacity to transfer the knowledge learned from a labeled source domain to an unlabeled target domain. However, typical UDA methods require concurrent access to both the source and target domain data, which largely limits its application in medical scenarios where source data is often unavailable due to privacy concern. To tackle the source data-absent problem, we present a novel two-stage source-free domain adaptation (SFDA) framework for medical image segmentation, where only a well-trained source segmentation model and unlabeled target data are available during domain adaptation. Specifically, in the prototype-anchored feature alignment stage, we first utilize the weights of the pre-trained pixel-wise classifier as source prototypes, which preserve the information of source features. Then, we introduce the bi-directional transport to align the target features with class prototypes by minimizing its expected cost. On top of that, a contrastive learning stage is further devised to utilize those pixels with unreliable predictions for a more compact target feature distribution. Extensive experiments on a cross-modality medical segmentation task demonstrate the superiority of our method in large domain discrepancy settings compared with the state-of-the-art SFDA approaches and even some UDA methods. Code is available at https://github.com/CSCYQJ/MICCAI23-ProtoContra-SFDA.

摘要
<>转换文本到简化中文。<>无监督领域适应（USDA）在过去几年中得到了越来越多的关注，因为它可以将源领域中标注的知识传递到目标领域中。然而，传统的USDA方法通常需要同时访问源和目标领域的数据，这大大限制了它在医疗场景中的应用，因为源数据通常因隐私问题而不可доступible。为解决源数据缺失问题，我们提出了一个新的两stage无源领域适应（SFDA）框架，用于医疗像素分割，只有一个已经训练的源分割模型和无标注目标数据在领域适应过程中可用。具体来说，在原型锁定特征对应阶段，我们首先利用源模型已经训练的像素精度分类器的权重作为源原型，以保留源特征信息。然后，我们引入双向运输来对目标特征与类原型进行对应，通过最小化其预期成本来做出匹配。此外，我们还设计了一个对比学习阶段，以利用具有不可靠预测的像素来实现目标特征分布更加紧凑。我们在一个cross-modality医疗像素分割任务上进行了广泛的实验，结果表明，我们的方法在大领域差异设定下超过了当前SFDA方法和一些UDAD方法的性能。代码可以在https://github.com/CSCYQJ/MICCAI23-ProtoContra-SFDA中找到。

Longitudinal Data and a Semantic Similarity Reward for Chest X-Ray Report Generation

paper_url: http://arxiv.org/abs/2307.09758
repo_url: https://github.com/aehrc/cxrmate
paper_authors: Aaron Nicolson, Jason Dowling, Bevan Koopman
For: 提高放射学家的工作效率和精度，以及改善患者护理* Methods: 利用患者的长期历史信息和CXR-BERT reward来提高自动生成放射报告* Results: 比前方法更高效，能够更好地遵循放射学家的工作流程，并且可以提供更好的临床翻译Here’s the breakdown of each point:* For: The paper aims to improve the work efficiency and accuracy of radiologists, as well as improve patient care.* Methods: The paper proposes a new method that utilizes the patient’s longitudinal history and CXR-BERT reward to improve automatic CXR report generation.* Results: The proposed method achieves better performance than previous methods, aligns more closely with radiologists’ assessment, and provides a better pathway to clinical translation.

Abstract
The current burnout rate of radiologists is high due to the large and ever growing number of Chest X-Rays (CXRs) needing interpretation and reporting. Promisingly, automatic CXR report generation has the potential to aid radiologists with this laborious task and improve patient care. Previous CXR report generation methods are limited by their diagnostic inaccuracy and their lack of alignment with the workflow of radiologists. To address these issues, we present a new method that utilises the longitudinal history available from a patient's previous CXR study when generating a report, which imitates a radiologist's workflow. We also propose a new reward for reinforcement learning based on CXR-BERT -- which captures the clinical semantic similarity between reports -- to further improve CXR report generation. We conduct experiments on the publicly available MIMIC-CXR dataset with metrics more closely correlated with radiologists' assessment of reporting. The results indicate capturing a patient's longitudinal history improves CXR report generation and that CXR-BERT is a promising alternative to the current state-of-the-art reward. Our approach generates radiology reports that are quantitatively more aligned with those of radiologists than previous methods while simultaneously offering a better pathway to clinical translation. Our Hugging Face checkpoint (https://huggingface.co/aehrc/cxrmate) and code (https://github.com/aehrc/cxrmate) are publicly available.

摘要
现有的胸部X射线（CXR）报告生成方法有高热含率，主要是因为逐渐增长的胸部X射线需要解释和报告的数量。幸运的是，自动生成CXR报告有很大的潜在作用，可以帮助胸部骨科医生处理这项劳动ioso�的任务，并提高患者的护理质量。在过去的CXR报告生成方法中，存在诊断不准确和医生工作流程不符的问题。为解决这些问题，我们提出了一种新的方法，该方法利用患者的前一次CXR研究记录来生成报告，这与胸部骨科医生的工作流程相似。此外，我们还提出了一种基于CXR-BERT的新奖励方法，这种奖励方法可以进一步提高CXR报告生成的质量。我们在公共可用的MIMIC-CXR数据集上进行了实验，结果表明，捕捉患者的 longitudinal历史可以提高CXR报告生成质量，而CXR-BERT是一个有前途的奖励方法。我们的方法可以生成与胸部骨科医生评估更加一致的报告，同时也提供了更好的临床翻译路径。我们的Hugging Face检查点（https://huggingface.co/aehrc/cxrmate）和代码（https://github.com/aehrc/cxrmate）都是公共可用的。

Generative Prompt Model for Weakly Supervised Object Localization

paper_url: http://arxiv.org/abs/2307.09756
repo_url: https://github.com/callsys/genpromp
paper_authors: Yuzhong Zhao, Qixiang Ye, Weijia Wu, Chunhua Shen, Fang Wan
for: 本研究旨在解决受限监督物体地标的物体本地化问题，提出一种生成提示模型（GenPromp），以便学习不具有特征的物体部分。
methods: 本研究使用生成提示模型（GenPromp），将图像分类标签转换为学习可能的提示编码，然后通过一个生成模型来conditionally恢复输入图像噪音。
results: 在CUB-200-2011和ILSVRC上进行实验，GenPromp分别超过了最佳探测模型的5.2%和5.6%（Top-1 Loc），设置了WSOL领域的固定基线。可以从https://github.com/callsys/GenPromp获取代码。

Abstract
Weakly supervised object localization (WSOL) remains challenging when learning object localization models from image category labels. Conventional methods that discriminatively train activation models ignore representative yet less discriminative object parts. In this study, we propose a generative prompt model (GenPromp), defining the first generative pipeline to localize less discriminative object parts by formulating WSOL as a conditional image denoising procedure. During training, GenPromp converts image category labels to learnable prompt embeddings which are fed to a generative model to conditionally recover the input image with noise and learn representative embeddings. During inference, enPromp combines the representative embeddings with discriminative embeddings (queried from an off-the-shelf vision-language model) for both representative and discriminative capacity. The combined embeddings are finally used to generate multi-scale high-quality attention maps, which facilitate localizing full object extent. Experiments on CUB-200-2011 and ILSVRC show that GenPromp respectively outperforms the best discriminative models by 5.2% and 5.6% (Top-1 Loc), setting a solid baseline for WSOL with the generative model. Code is available at https://github.com/callsys/GenPromp.

摘要
弱监督物体Localization（WSOL）在学习物体Localization模型时仍然是一个挑战。传统的推理模型忽略了特征却不是最终的物体部分。在这个研究中，我们提出了一个生成提示模型（GenPromp），定义了首个生成管道，用于通过将图像类别标签转换为学习可能的提示嵌入，并使用生成模型来 conditionally 恢复输入图像噪声，以学习表示性嵌入。在训练过程中，GenPromp将图像类别标签转换为学习可能的提示嵌入，并将其 fed 到生成模型中，以conditionally 恢复输入图像噪声。在推理过程中，enPromp 将表示性嵌入与推理嵌入（从一个可用的视觉语言模型中查询）相结合，以实现表示性和推理能力的结合。最后，combined embeddings 将被用来生成多尺度高质量注意力地图，以便快速 Localize 全部物体范围。实验表明，GenPromp 可以在 CUB-200-2011 和 ILSVRC 上分别与最佳推理模型相比，提高 Top-1 Loc 的性能5.2%和5.6%，设置了WSOL领域的坚实基准。代码可以在中找到。

Towards Robust Scene Text Image Super-resolution via Explicit Location Enhancement

paper_url: http://arxiv.org/abs/2307.09749
repo_url: https://github.com/csguoh/lemma
paper_authors: Hang Guo, Tao Dai, Guanghao Meng, Shu-Tao Xia
for: 提高Scene Text Recognition（STR）精度，address complex background disturbance
methods: 提出了一种新的方法LEMMA，explicitly model character regions to produce high-level text-specific guidance for super-resolution
results: 在TextZoom和四个场景文本识别benchmark上，与其他状态的方法进行比较，得到了superiority

Abstract
Scene text image super-resolution (STISR), aiming to improve image quality while boosting downstream scene text recognition accuracy, has recently achieved great success. However, most existing methods treat the foreground (character regions) and background (non-character regions) equally in the forward process, and neglect the disturbance from the complex background, thus limiting the performance. To address these issues, in this paper, we propose a novel method LEMMA that explicitly models character regions to produce high-level text-specific guidance for super-resolution. To model the location of characters effectively, we propose the location enhancement module to extract character region features based on the attention map sequence. Besides, we propose the multi-modal alignment module to perform bidirectional visual-semantic alignment to generate high-quality prior guidance, which is then incorporated into the super-resolution branch in an adaptive manner using the proposed adaptive fusion module. Experiments on TextZoom and four scene text recognition benchmarks demonstrate the superiority of our method over other state-of-the-art methods. Code is available at https://github.com/csguoh/LEMMA.

摘要
Scene文本超解像（STISR）目标是提高图像质量并提高下游场景文本识别精度，最近几年取得了很大成功。然而，大多数现有方法在前进过程中对背景（非文本区域）和前景（文本区域）待遇处理相同，不充分考虑背景的复杂性，因此限制性。为解决这些问题，在这篇论文中，我们提出了一种新的方法LEMMA，其中明确模型文本区域以生成高级文本特定指导。为了有效地捕捉文本区域的位置，我们提出了位置增强模块，该模块通过注意力地图序列提取文本区域特征。此外，我们提出了多模态对应模块，该模块通过双向视Semantic对齐来生成高质量的初始指导，然后将其与超解像分支结合使用我们提出的适应融合模块。实验证明了我们的方法在TextZoom和四个场景文本识别标准测试集上的优势。代码可以在https://github.com/csguoh/LEMMA中下载。

Watch out Venomous Snake Species: A Solution to SnakeCLEF2023

paper_url: http://arxiv.org/abs/2307.09748
repo_url: https://github.com/xiaoxsparraw/clef2023
paper_authors: Feiran Hu, Peng Wang, Yangyang Li, Chenlong Duan, Zijian Zhu, Fei Wang, Faen Zhang, Yong Li, Xiu-Shen Wei
for: 这个研究是为了开发高级的类别植物识别算法，通过图像和数据库的分析。
methods: 本研究使用了现代的CNN模型和强大的数据增强，以学习图像的更好表现。另外，使用了 seesaw 损失函数来缓解长条形分布的挑战。在后期处理阶段，我们设计了一个轻量级的模型来计算 metadata 特征，并将毒蛇类别标签给一些模型是不确定的示例。
results: 本研究在私人领先板上获得了91.31%的最终度量结果，排名第一。代码可以在 https://github.com/xiaoxsparraw/CLEF2023 上取得。

Abstract
The SnakeCLEF2023 competition aims to the development of advanced algorithms for snake species identification through the analysis of images and accompanying metadata. This paper presents a method leveraging utilization of both images and metadata. Modern CNN models and strong data augmentation are utilized to learn better representation of images. To relieve the challenge of long-tailed distribution, seesaw loss is utilized in our method. We also design a light model to calculate prior probabilities using metadata features extracted from CLIP in post processing stage. Besides, we attach more importance to venomous species by assigning venomous species labels to some examples that model is uncertain about. Our method achieves 91.31% score of the final metric combined of F1 and other metrics on private leaderboard, which is the 1st place among the participators. The code is available at https://github.com/xiaoxsparraw/CLEF2023.

摘要
《蛇类CLEF2023比赛目标是开发高级算法用于蛇类识别，通过图像和相关元数据分析。本文提出了利用图像和元数据的方法，使用现代 convolutional neural network（CNN）模型和强大的数据增强来学习图像的更好表示。为了解决长尾分布的挑战，我们使用 seesaw loss 方法。此外，我们还设计了一个轻量级模型用于计算元数据特征，并在后处理阶段使用 CLIP 提取的元数据特征来进行估计。此外，我们还强调有毒种类，对模型不确定的示例分配有毒种类标签。我们的方法在私人领先板上实现了 91.31% 的最终指标（F1 和其他指标的组合），位居所有参与者之首。代码可以在 GitHub 上找到：https://github.com/xiaoxsparraw/CLEF2023。

CPCM: Contextual Point Cloud Modeling for Weakly-supervised Point Cloud Semantic Segmentation

paper_url: http://arxiv.org/abs/2307.10316
repo_url: https://github.com/lizhaoliu-Lec/CPCM
paper_authors: Lizhao Liu, Zhuangwei Zhuang, Shangxin Huang, Xunlong Xiao, Tianhang Xiang, Cen Chen, Jingdong Wang, Mingkui Tan
for: 这个研究旨在探讨弱监督点 cloud semantic segmentation 的问题，尤其是对于Scene understanding 的环境学习。
methods: 我们提出了一个简单 yet effective的 Contextual Point Cloud Modeling (CPCM) 方法，包括两个部分：RegionMask 策略和 contextual masked training (CMT) 方法。
results: 我们的实验结果显示，CPCM 方法在ScanNet V2和S3DIS benchmarks 上比州前方法表现更好，对于弱监督点 cloud semantic segmentation 提供了更高的性能。

Abstract
We study the task of weakly-supervised point cloud semantic segmentation with sparse annotations (e.g., less than 0.1% points are labeled), aiming to reduce the expensive cost of dense annotations. Unfortunately, with extremely sparse annotated points, it is very difficult to extract both contextual and object information for scene understanding such as semantic segmentation. Motivated by masked modeling (e.g., MAE) in image and video representation learning, we seek to endow the power of masked modeling to learn contextual information from sparsely-annotated points. However, directly applying MAE to 3D point clouds with sparse annotations may fail to work. First, it is nontrivial to effectively mask out the informative visual context from 3D point clouds. Second, how to fully exploit the sparse annotations for context modeling remains an open question. In this paper, we propose a simple yet effective Contextual Point Cloud Modeling (CPCM) method that consists of two parts: a region-wise masking (RegionMask) strategy and a contextual masked training (CMT) method. Specifically, RegionMask masks the point cloud continuously in geometric space to construct a meaningful masked prediction task for subsequent context learning. CMT disentangles the learning of supervised segmentation and unsupervised masked context prediction for effectively learning the very limited labeled points and mass unlabeled points, respectively. Extensive experiments on the widely-tested ScanNet V2 and S3DIS benchmarks demonstrate the superiority of CPCM over the state-of-the-art.

摘要
我们研究弱监督点云 semantic segmentation 任务，使用稀有的标注（例如，点云中的标注点数少于0.1%），以降低高成本的密集标注成本。然而，与极端稀有的标注点相关，提取场景理解中的对象信息和 контекст信息具有极大的挑战。为了解决这个问题，我们启发自masked modeling（例如MAE）在图像和视频表示学习中的应用。我们寻求通过masked modeling来学习点云中的Contextual信息，但直接将MAE应用于稀有标注的3D点云可能不太可能。首先，从3D点云中有效地遮盖有用的视觉上下文是一个非常困难的问题。其次，如何充分利用稀有标注来Contextual模型仍然是一个开放的问题。在这篇论文中，我们提出了一种简单又有效的Contextual Point Cloud Modeling（CPCM）方法，它包括两个部分：RegionMask和CMT。 RegionMask使用几何空间中的连续遮盖来构建一个有意义的masked prediction任务，以便进行后续的Contextual模型学习。CMT通过分离supervised segmentation和无supervised masked context prediction来有效地学习EXTremely Limited Labels和大量无标注点云，分别学习supervised segmentation和无supervised masked context prediction。我们在ScanNet V2和S3DIS benchmark上进行了广泛的实验，结果表明CPCM在状态的前ier exceeds the state-of-the-art。

Improved Distribution Matching for Dataset Condensation

paper_url: http://arxiv.org/abs/2307.09742
repo_url: https://github.com/uitrbn/idm
paper_authors: Ganlong Zhao, Guanbin Li, Yipeng Qin, Yizhou Yu
for: 这个论文的目的是缩小深度学习数据集，以减少储存成本和训练成本，但保持训练模型的表现良好。
methods: 这个论文提出了一种基于分布匹配的数据集缩小方法，该方法更加高效和有前途。具体来说，论文识别了两个重要缺陷（分布数据不均匀和无效的嵌入），并通过三种新技术（分区和扩展增强、有效和丰富的模型抽样、类别对分布调整）解决这些缺陷。
results: 这个方法在大量的数据集上实现了更高效的缩小，并且在训练深度学习模型时显示出了更好的表现，较之前的优化方法更加高效。实验结果显示了这个方法的有效性。

Abstract
Dataset Condensation aims to condense a large dataset into a smaller one while maintaining its ability to train a well-performing model, thus reducing the storage cost and training effort in deep learning applications. However, conventional dataset condensation methods are optimization-oriented and condense the dataset by performing gradient or parameter matching during model optimization, which is computationally intensive even on small datasets and models. In this paper, we propose a novel dataset condensation method based on distribution matching, which is more efficient and promising. Specifically, we identify two important shortcomings of naive distribution matching (i.e., imbalanced feature numbers and unvalidated embeddings for distance computation) and address them with three novel techniques (i.e., partitioning and expansion augmentation, efficient and enriched model sampling, and class-aware distribution regularization). Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources, thereby scaling data condensation to larger datasets and models. Extensive experiments demonstrate the effectiveness of our method. Codes are available at https://github.com/uitrbn/IDM

摘要

ClickSeg: 3D Instance Segmentation with Click-Level Weak Annotations

paper_url: http://arxiv.org/abs/2307.09732
repo_url: None
paper_authors: Leyao Liu, Tao Kong, Minzhao Zhu, Jiashuo Fan, Lu Fang
for: 这个论文旨在提出一种ClickSeg方法，该方法可以在具有只一个点标注的情况下实现3D实例分割。
methods: ClickSeg方法使用了一种基准的弱iously supervised training方法，通过模型自己生成pseudo标签来补充无标签数据。此外，该方法还提出了一个新的训练框架，使用k-means算法并设置了固定的初始种子。
results: ClickSeg方法在ScanNetV2和S3DIS数据集上实现了较好的实验结果，比前一个最佳弱iously supervised实例分割结果高出9.4%的MAP值。使用0.02%的超vision信号，ClickSeg方法可以达到$\sim$90%的准确率，同时也实现了同样的annotation设置下的状态态的 semantic segmentation结果。

Abstract
3D instance segmentation methods often require fully-annotated dense labels for training, which are costly to obtain. In this paper, we present ClickSeg, a novel click-level weakly supervised 3D instance segmentation method that requires one point per instance annotation merely. Such a problem is very challenging due to the extremely limited labels, which has rarely been solved before. We first develop a baseline weakly-supervised training method, which generates pseudo labels for unlabeled data by the model itself. To utilize the property of click-level annotation setting, we further propose a new training framework. Instead of directly using the model inference way, i.e., mean-shift clustering, to generate the pseudo labels, we propose to use k-means with fixed initial seeds: the annotated points. New similarity metrics are further designed for clustering. Experiments on ScanNetV2 and S3DIS datasets show that the proposed ClickSeg surpasses the previous best weakly supervised instance segmentation result by a large margin (e.g., +9.4% mAP on ScanNetV2). Using 0.02% supervision signals merely, ClickSeg achieves $\sim$90% of the accuracy of the fully-supervised counterpart. Meanwhile, it also achieves state-of-the-art semantic segmentation results among weakly supervised methods that use the same annotation settings.

摘要
三维实例分割方法经常需要完全标注的紧密标签进行训练，这些标注是贵重的获得。在这篇论文中，我们提出了ClickSeg，一种新的点级弱标注三维实例分割方法，只需要每个实例一个点的注释。这是一个非常困难的问题，因为非常少的标注，这已经rarely been solved before。我们首先开发了一个基eline的弱标注训练方法，该方法可以生成pseudo标签 для未标注数据，通过模型自身。为了利用点级标注设置的特性，我们进一步提议了一新的训练框架。而不是直接使用模型推理方式，即mean-shift clustering，来生成pseudo标签，我们提议使用k-meansWith fixed initial seeds：注释点。新的相似度度量被设计用于分类。实验表明，提案的ClickSeg超过了之前最佳弱标注实例分割结果，即+9.4% mAP on ScanNetV2。使用0.02%的超vision信号仅，ClickSeg实现了$\sim$90%的准确性，与完全标注的 counterpart相当。此外，它还实现了weakly supervised方法中的最佳semantic segmentation结果。

NTIRE 2023 Quality Assessment of Video Enhancement Challenge

paper_url: http://arxiv.org/abs/2307.09729
repo_url: None
paper_authors: Xiaohong Liu, Xiongkuo Min, Wei Sun, Yulun Zhang, Kai Zhang, Radu Timofte, Guangtao Zhai, Yixuan Gao, Yuqin Cao, Tengchuan Kou, Yunlong Dong, Ziheng Jia, Yilin Li, Wei Wu, Shuming Hu, Sibin Deng, Pengxiang Xiao, Ying Chen, Kai Li, Kai Zhao, Kun Yuan, Ming Sun, Heng Cong, Hao Wang, Lingzhi Fu, Yusheng Zhang, Rongyu Zhang, Hang Shi, Qihang Xu, Longan Xiao, Zhiliang Ma, Mirko Agarla, Luigi Celona, Claudio Rota, Raimondo Schettini, Zhiwei Huang, Yanan Li, Xiaotao Wang, Lei Lei, Hongye Liu, Wei Hong, Ironhead Chuang, Allen Lin, Drake Guan, Iris Chen, Kae Lou, Willy Huang, Yachun Tasi, Yvonne Kao, Haotian Fan, Fangyuan Kong, Shiqi Zhou, Hao Liu, Yu Lai, Shanshan Chen, Wenqi Wang, Haoning Wu, Chaofeng Chen, Chunzheng Zhu, Zekun Guo, Shiling Zhao, Haibing Yin, Hongkui Wang, Hanene Brachemi Meftah, Sid Ahmed Fezza, Wassim Hamidouche, Olivier Déforges, Tengfei Shi, Azadeh Mansouri, Hossein Motamednia, Amir Hossein Bakhtiari, Ahmad Mahmoudi Aznaveh
for: 本研究旨在提高视频处理领域中的视频质量评估（VQA）技术，特别是对于进行了增强处理的视频。
methods: 本研究使用了VQA Dataset for Perceptual Video Enhancement（VDPVE），包括600个彩色、亮度和对比度增强的视频、310个去锐化视频和301个去晃动视频。参赛者使用了多种方法，包括基线方法以及自己研发的模型。
results: 本研究发现了一些方法可以超越基线方法的性能，并且赢得了比赛。赢得者的方法展示了出色的预测性能。

Abstract
This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual Video Enhancement (VDPVE), which has a total of 1211 enhanced videos, including 600 videos with color, brightness, and contrast enhancements, 310 videos with deblurring, and 301 deshaked videos. The challenge has a total of 167 registered participants. 61 participating teams submitted their prediction results during the development phase, with a total of 3168 submissions. A total of 176 submissions were submitted by 37 participating teams during the final testing phase. Finally, 19 participating teams submitted their models and fact sheets, and detailed the methods they used. Some methods have achieved better results than baseline methods, and the winning methods have demonstrated superior prediction performance.

摘要

Uncertainty-Driven Multi-Scale Feature Fusion Network for Real-time Image Deraining

paper_url: http://arxiv.org/abs/2307.09728
repo_url: None
paper_authors: Ming Tong, Xuefeng Yan, Yongzhen Wang
for: 提高雨天下视觉测量系统的性能
methods: 提出了一种基于多级特征协同 fusion的不确定性驱动多尺度特征融合网络（UMFFNet），通过把不确定性信息纳入网络来估算推断uncertainty
results: 实验表明，UMFFNet可以在减少预测错误的情况下提高视觉测量系统的性能，并且比存在state-of-the-art图像推理方法更高效。

Abstract
Visual-based measurement systems are frequently affected by rainy weather due to the degradation caused by rain streaks in captured images, and existing imaging devices struggle to address this issue in real-time. While most efforts leverage deep networks for image deraining and have made progress, their large parameter sizes hinder deployment on resource-constrained devices. Additionally, these data-driven models often produce deterministic results, without considering their inherent epistemic uncertainty, which can lead to undesired reconstruction errors. Well-calibrated uncertainty can help alleviate prediction errors and assist measurement devices in mitigating risks and improving usability. Therefore, we propose an Uncertainty-Driven Multi-Scale Feature Fusion Network (UMFFNet) that learns the probability mapping distribution between paired images to estimate uncertainty. Specifically, we introduce an uncertainty feature fusion block (UFFB) that utilizes uncertainty information to dynamically enhance acquired features and focus on blurry regions obscured by rain streaks, reducing prediction errors. In addition, to further boost the performance of UMFFNet, we fused feature information from multiple scales to guide the network for efficient collaborative rain removal. Extensive experiments demonstrate that UMFFNet achieves significant performance improvements with few parameters, surpassing state-of-the-art image deraining methods.

摘要
视觉基于的测量系统常常由雨水影响，因为雨斑在捕捉到的图像中带来的降低效果，现有的图像设备很难在实时中解决这个问题。大多数努力是利用深度网络进行图像除雨，并且已经取得了进步，但它们的参数较大，使得在资源有限的设备上部署不可能。此外，这些数据驱动的模型经常生成决定性的结果，不考虑其内在的可识别 uncertainty，这可能导致不良重建结果。Well-calibrated uncertainty可以帮助减轻预测错误和使测量设备避免风险，提高可用性。因此，我们提出了一种基于不确定性的多级特征融合网络（UMFFNet），该网络学习图像之间的可能性映射分布。我们首次引入了一个不确定性特征融合块（UFFB），该块利用不确定性信息来动态增强获取的特征和焦点在雨斑掩蔽的模糊区域，从而减少预测错误。此外，为了进一步提高 UMFFNet 的性能，我们将特征信息从多个级别融合，以便导航网络进行高效的合作雨除。广泛的实验表明， UMFFNet 可以在少量参数下取得显著的性能提升，超越当前的图像除雨方法。

SAMConvex: Fast Discrete Optimization for CT Registration using Self-supervised Anatomical Embedding and Correlation Pyramid

paper_url: http://arxiv.org/abs/2307.09727
repo_url: None
paper_authors: Zi Li, Lin Tian, Tony C. W. Mok, Xiaoyu Bai, Puyang Wang, Jia Ge, Jingren Zhou, Le Lu, Xianghua Ye, Ke Yan, Dakai Jin
for: 这篇论文的目的是提出一种快速且精确的 CT 注册方法，以便在医疗影像处理中实现高精度的注册。
methods: 这篇论文使用了一种叫做 SAMConvex 的快速粗糙到细致的确定优化方法，通过自适应的扩展特征（SAM）来捕捉本地和全局的信息。
results: 这篇论文的结果显示 SAMConvex 方法在两个间patient CT 注册数据集（腹部 CT 和头颈部 CT）以及一个内patient CT 注册数据集（肺部 CT）上都能够实现更高的精度和速度，并且只需要几十毫秒这么长。

Abstract
Estimating displacement vector field via a cost volume computed in the feature space has shown great success in image registration, but it suffers excessive computation burdens. Moreover, existing feature descriptors only extract local features incapable of representing the global semantic information, which is especially important for solving large transformations. To address the discussed issues, we propose SAMConvex, a fast coarse-to-fine discrete optimization method for CT registration that includes a decoupled convex optimization procedure to obtain deformation fields based on a self-supervised anatomical embedding (SAM) feature extractor that captures both local and global information. To be specific, SAMConvex extracts per-voxel features and builds 6D correlation volumes based on SAM features, and iteratively updates a flow field by performing lookups on the correlation volumes with a coarse-to-fine scheme. SAMConvex outperforms the state-of-the-art learning-based methods and optimization-based methods over two inter-patient registration datasets (Abdomen CT and HeadNeck CT) and one intra-patient registration dataset (Lung CT). Moreover, as an optimization-based method, SAMConvex only takes $\sim2$s ($\sim5s$ with instance optimization) for one paired images.

摘要
估算距离 вектор场via一个在特征空间计算的成本量表示图像匹配得到了很大的成功，但它受到过度计算负担。此外，现有的特征描述符只能提取本地特征，无法表示全局semantic信息，尤其是用于解决大 transformations。为了解决这些问题，我们提议SAMConvex，一种快速、粗略到细节的分别优化方法 дляCT匹配，包括一个分离 convex 优化过程来获取变形场 based on a self-supervised anatomical embedding (SAM) feature extractor，该特征提取器可以捕捉到本地和全局信息。具体来说，SAMConvex提取每个voxel的特征并建立6D相关体积表 based on SAM features，然后iteratively更新流场by performing lookups on the correlation volumes with a coarse-to-fine scheme。SAMConvex在对两个Inter-patient CT registration数据集（ Abdomen CT 和 HeadNeck CT）以及一个intra-patient CT registration数据集（Lung CT）进行比较时，超越了当前的学习基于方法和优化基于方法。此外，作为一种优化基于方法，SAMConvex只需要大约2秒（大约5秒with instance optimization）来处理一个对应的图像。

AesPA-Net: Aesthetic Pattern-Aware Style Transfer Networks

paper_url: http://arxiv.org/abs/2307.09724
repo_url: https://github.com/kibeom-hong/aespa-net
paper_authors: Kibeom Hong, Seogkyu Jeon, Junsoo Lee, Namhyuk Ahn, Kunhee Kim, Pilhyeon Lee, Daesik Kim, Youngjung Uh, Hyeran Byun
for: 这paper是为了解决style transfer问题，即将 contenido image 转换成艺术风格image。
methods: 这paper使用了Attention机制，以实现地图local patches of style image和content image的对应关系。同时，paper also introduces a novel metric called pattern repeatability to quantify the repetition of patterns in the style image, and a self-supervisory task to encourage the attention mechanism to learn precise and meaningful semantic correspondence.
results: 经过qualitative和quantitative evaluations, paper verify了pattern repeatability的可靠性，并demonstrated the superiority of the proposed framework compared to existing methods.

Abstract
To deliver the artistic expression of the target style, recent studies exploit the attention mechanism owing to its ability to map the local patches of the style image to the corresponding patches of the content image. However, because of the low semantic correspondence between arbitrary content and artworks, the attention module repeatedly abuses specific local patches from the style image, resulting in disharmonious and evident repetitive artifacts. To overcome this limitation and accomplish impeccable artistic style transfer, we focus on enhancing the attention mechanism and capturing the rhythm of patterns that organize the style. In this paper, we introduce a novel metric, namely pattern repeatability, that quantifies the repetition of patterns in the style image. Based on the pattern repeatability, we propose Aesthetic Pattern-Aware style transfer Networks (AesPA-Net) that discover the sweet spot of local and global style expressions. In addition, we propose a novel self-supervisory task to encourage the attention mechanism to learn precise and meaningful semantic correspondence. Lastly, we introduce the patch-wise style loss to transfer the elaborate rhythm of local patterns. Through qualitative and quantitative evaluations, we verify the reliability of the proposed pattern repeatability that aligns with human perception, and demonstrate the superiority of the proposed framework.

摘要
通过实现目标风格的艺术表达，最近的研究借鉴了注意机制，因为它可以将风格图像中的本地小块映射到内容图像中的相应小块。然而，由于随机内容和艺术作品之间的 semantics 低，注意模块会不断利用风格图像中的特定本地小块，导致不和谐和明显的重复artefacts。为了超越这个限制和完成无瑕的艺术风格传输，我们将注意力领域的增强和风格图像中的 рит律掌握作为主要目标。在这篇论文中，我们提出了一个新的度量器，即风格图像中的pattern repeatability，用于量化风格图像中的pattern的重复度。基于pattern repeatability，我们提出了Aesthetic Pattern-Aware style transfer Networks（AesPA-Net），该网络可以找到风格图像中的精彩点，并寻找合适的本地和全局风格表达。此外，我们提出了一种新的自我超级视角任务，以便注意力机制学习准确和有意义的semantic相关性。最后，我们引入了 patch-wise style loss，以传递风格图像中的细腻的rhythm。通过质量和量化评估，我们证明了提案的pattern repeatability的可靠性，并证明了我们的提案方框的优越性。

Semantic-Aware Dual Contrastive Learning for Multi-label Image Classification

paper_url: http://arxiv.org/abs/2307.09715
repo_url: https://github.com/yu-gi-oh-leilei/sadcl
paper_authors: Leilei Ma, Dengdi Sun, Lei Wang, Haifeng Zhao, Bin Luo
for: 提高自然图像中多个对象或特征的图像semantics提取和分类效果
methods: 使用图像semantic关系模型、类 activation maps (CAM) 以及样本对比学习 (SSCL) 和类prototype对比学习 (PSCL)
results: 在五个大规模公共数据集上实验，提出的方法比现有方法高效，并且能够准确捕捉图像内容相关的涉及性 label

Abstract
Extracting image semantics effectively and assigning corresponding labels to multiple objects or attributes for natural images is challenging due to the complex scene contents and confusing label dependencies. Recent works have focused on modeling label relationships with graph and understanding object regions using class activation maps (CAM). However, these methods ignore the complex intra- and inter-category relationships among specific semantic features, and CAM is prone to generate noisy information. To this end, we propose a novel semantic-aware dual contrastive learning framework that incorporates sample-to-sample contrastive learning (SSCL) as well as prototype-to-sample contrastive learning (PSCL). Specifically, we leverage semantic-aware representation learning to extract category-related local discriminative features and construct category prototypes. Then based on SSCL, label-level visual representations of the same category are aggregated together, and features belonging to distinct categories are separated. Meanwhile, we construct a novel PSCL module to narrow the distance between positive samples and category prototypes and push negative samples away from the corresponding category prototypes. Finally, the discriminative label-level features related to the image content are accurately captured by the joint training of the above three parts. Experiments on five challenging large-scale public datasets demonstrate that our proposed method is effective and outperforms the state-of-the-art methods. Code and supplementary materials are released on https://github.com/yu-gi-oh-leilei/SADCL.

摘要
原文：Extracting image semantics effectively and assigning corresponding labels to multiple objects or attributes for natural images is challenging due to the complex scene contents and confusing label dependencies. Recent works have focused on modeling label relationships with graph and understanding object regions using class activation maps (CAM). However, these methods ignore the complex intra- and inter-category relationships among specific semantic features, and CAM is prone to generate noisy information. To this end, we propose a novel semantic-aware dual contrastive learning framework that incorporates sample-to-sample contrastive learning (SSCL) as well as prototype-to-sample contrastive learning (PSCL). Specifically, we leverage semantic-aware representation learning to extract category-related local discriminative features and construct category prototypes. Then based on SSCL, label-level visual representations of the same category are aggregated together, and features belonging to distinct categories are separated. Meanwhile, we construct a novel PSCL module to narrow the distance between positive samples and category prototypes and push negative samples away from the corresponding category prototypes. Finally, the discriminative label-level features related to the image content are accurately captured by the joint training of the above three parts. Experiments on five challenging large-scale public datasets demonstrate that our proposed method is effective and outperforms the state-of-the-art methods. Code and supplementary materials are released on https://github.com/yu-gi-oh-leilei/SADCL.中文翻译：抽取自然图像中的 semantics 效果好并将多个对象或属性分配到相应的标签是一项挑战，原因在于场景内容的复杂性和标签依赖关系的混乱。 recent works 主要关注图标关系模型和对象区域理解使用类 activation maps (CAM)。然而，这些方法忽略了特定semantic feature之间的复杂内部和间接关系，而 CAM 也容易产生噪音信息。为此，我们提出了一种新的semantic-aware dual contrastive learning框架，该框架包括sample-to-sample contrastive learning (SSCL) 以及 prototype-to-sample contrastive learning (PSCL)。具体来说，我们利用semantic-aware representation learning来抽取相应类型的本地特征特征，并构建类型prototype。然后，基于 SSCL，同类标签的视觉表示被聚合在一起，不同类标签的表示被分离开。此外，我们构建了一种新的PSCL模块，以减少正样例与类型prototype之间的距离，并推动负样例离开对应的类型prototype。最后，通过上述三部分的集成，捕捉到图像内容相关的有用特征。 experiments 表明，我们提出的方法效果好，并超过了当前状态的方法。代码和补充材料可以在https://github.com/yu-gi-oh-leilei/SADCL 上获取。

Towards Saner Deep Image Registration

paper_url: http://arxiv.org/abs/2307.09696
repo_url: https://github.com/tuffr5/saner-deep-registration
paper_authors: Bin Duan, Ming Zhong, Yan Yan
for: 本文旨在探讨学习基于深度学习的图像registrations方法在医学图像注册中的精度和稳定性。
methods: 本文提出了一种基于reg regularization的sanity-enforcer方法，通过两种sanity check来降低模型的逆向一致性错误和提高其识别力。
results: 实验结果表明，该方法可以 simultanously improve the sanity of models without sacrificing any performance. Additionally, the authors provide theoretical guarantees for their method.

Abstract
With recent advances in computing hardware and surges of deep-learning architectures, learning-based deep image registration methods have surpassed their traditional counterparts, in terms of metric performance and inference time. However, these methods focus on improving performance measurements such as Dice, resulting in less attention given to model behaviors that are equally desirable for registrations, especially for medical imaging. This paper investigates these behaviors for popular learning-based deep registrations under a sanity-checking microscope. We find that most existing registrations suffer from low inverse consistency and nondiscrimination of identical pairs due to overly optimized image similarities. To rectify these behaviors, we propose a novel regularization-based sanity-enforcer method that imposes two sanity checks on the deep model to reduce its inverse consistency errors and increase its discriminative power simultaneously. Moreover, we derive a set of theoretical guarantees for our sanity-checked image registration method, with experimental results supporting our theoretical findings and their effectiveness in increasing the sanity of models without sacrificing any performance. Our code and models are available at https://github.com/tuffr5/Saner-deep-registration.

摘要

GlobalMapper: Arbitrary-Shaped Urban Layout Generation

paper_url: http://arxiv.org/abs/2307.09693
repo_url: None
paper_authors: Liu He, Daniel Aliaga
for: Layout generation for urban buildings, using graph attention networks for fully automatic generation of realistic city blocks with arbitrary road networks and building shapes.
methods: Using graph attention networks to skeletonize building layouts and generate realistic urban layouts, with conditional generation based on learned priors.
results: Superior performance compared to prior layout generation networks, demonstrated through generating layouts for 28 large cities with arbitrary city block and building shapes.

Abstract
Modeling and designing urban building layouts is of significant interest in computer vision, computer graphics, and urban applications. A building layout consists of a set of buildings in city blocks defined by a network of roads. We observe that building layouts are discrete structures, consisting of multiple rows of buildings of various shapes, and are amenable to skeletonization for mapping arbitrary city block shapes to a canonical form. Hence, we propose a fully automatic approach to building layout generation using graph attention networks. Our method generates realistic urban layouts given arbitrary road networks, and enables conditional generation based on learned priors. Our results, including user study, demonstrate superior performance as compared to prior layout generation networks, support arbitrary city block and varying building shapes as demonstrated by generating layouts for 28 large cities.

摘要
《计算机视觉和计算机图形学中的城市建筑布局模拟和设计是非常有价值的。城市建筑布局是一个由城市块网络定义的多个建筑物的集合，我们发现这些布局是离散结构，由多行不同形状的建筑物组成，可以通过skeletonization来将城市块形状划分为标准形状。因此，我们提出了一种完全自动化的城市建筑布局生成方法，使用图注意网络进行生成。我们的方法可以根据学习的先验知识进行Conditional生成，并且可以处理任意城市块和不同的建筑物形状。我们的结果，包括用户调查，显示了与之前的布局生成网络相比，我们的方法表现更出色，并且可以生成28个大城市的布局。》

Domain Adaptation based Enhanced Detection for Autonomous Driving in Foggy and Rainy Weather

paper_url: http://arxiv.org/abs/2307.09676
repo_url: None
paper_authors: Jinlong Li, Runsheng Xu, Jin Ma, Qin Zou, Jiaqi Ma, Hongkai Yu
for: 本研究旨在解决自动驾驶摄像头检测中的频繁出现的检测瓶颈问题，即在雾气和雨天下，模型训练在晴天下的检测模型可能无法达到预期的性能。
methods: 本研究提出了一种域 adaptive 对象检测框架，通过在图像水平和对象水平进行适应，以减少不同域之间的差异。此外，我们还提出了一种新的对抗梯度逆转层，通过对困难的实例进行对抗挑战，以提高模型在挑战性实例上的表现。最后，我们还提出了通过数据增强来生成一个辅助域，以便在新域上强制执行一个新的域级别的质量规则。
results: 实验结果表明，本研究在公共 V2V 测试集上实现了对 foggy 和 rainy 驾驶景情的对象检测性能的显著提高。

Abstract
Typically, object detection methods for autonomous driving that rely on supervised learning make the assumption of a consistent feature distribution between the training and testing data, however such assumption may fail in different weather conditions. Due to the domain gap, a detection model trained under clear weather may not perform well in foggy and rainy conditions. Overcoming detection bottlenecks in foggy and rainy weather is a real challenge for autonomous vehicles deployed in the wild. To bridge the domain gap and improve the performance of object detectionin foggy and rainy weather, this paper presents a novel framework for domain-adaptive object detection. The adaptations at both the image-level and object-level are intended to minimize the differences in image style and object appearance between domains. Furthermore, in order to improve the model's performance on challenging examples, we introduce a novel adversarial gradient reversal layer that conducts adversarial mining on difficult instances in addition to domain adaptation. Additionally, we suggest generating an auxiliary domain through data augmentation to enforce a new domain-level metric regularization. Experimental findings on public V2V benchmark exhibit a substantial enhancement in object detection specifically for foggy and rainy driving scenarios.

摘要
Simplified Chinese:通常，基于supervised learning的自动驾驶对象检测方法假设了测试数据和训练数据中特征分布的一致性，但这个假设在不同的天气conditions下可能会失败。由于领域差距，训练在晴朗天气下的检测模型可能无法在雾天和雨天条件下表现好。为了在野外部署自动驾驶车辆中适应雾天和雨天条件，本文提出了一种适应领域对象检测框架。这种框架在图像和对象两级进行了适应，以减少图像和对象之间的差异。此外，我们还引入了一种新的对抗梯度反转层，以便在困难的实例上进行对抗挖掘。此外，我们还建议通过数据增强来生成一个辅助领域，以便强制执行一个新的领域级别的度量规则。实验结果表明，在公共V2Vbenchmark上，对雾天和雨天驾驶场景的对象检测表现得到了显著提高。

Object-aware Gaze Target Detection

paper_url: http://arxiv.org/abs/2307.09662
repo_url: https://github.com/francescotonini/object-aware-gaze-target-detection
paper_authors: Francesco Tonini, Nicola Dall’Asen, Cigdem Beyan, Elisa Ricci
for: 这个论文的目的是提出一种基于Transformer架构的目标识别方法，以便自动检测场景中的人员和被注视的对象，并建立每个人头和被注视对象之间的关系。
methods: 这种方法使用了Transformer架构，自动检测场景中的对象（包括人头），并建立每个人头和被注视对象之间的关系，从而实现了全面、可解释的眼动分析。
results: 根据在野外 benchmark 的评估，该方法在所有指标上达到了状态的艺术 Results（增加了2.91%的AUC、降低了50%的眼动距离、提高了9%的离屏平均精度），并在眼动目标检测和眼动对象的分类和 lokalisierung 方面提供了11-13%的提升。

Abstract
Gaze target detection aims to predict the image location where the person is looking and the probability that a gaze is out of the scene. Several works have tackled this task by regressing a gaze heatmap centered on the gaze location, however, they overlooked decoding the relationship between the people and the gazed objects. This paper proposes a Transformer-based architecture that automatically detects objects (including heads) in the scene to build associations between every head and the gazed-head/object, resulting in a comprehensive, explainable gaze analysis composed of: gaze target area, gaze pixel point, the class and the image location of the gazed-object. Upon evaluation of the in-the-wild benchmarks, our method achieves state-of-the-art results on all metrics (up to 2.91% gain in AUC, 50% reduction in gaze distance, and 9% gain in out-of-frame average precision) for gaze target detection and 11-13% improvement in average precision for the classification and the localization of the gazed-objects. The code of the proposed method is available https://github.com/francescotonini/object-aware-gaze-target-detection

摘要
目标尝试检测目标目标在图像中的位置和概率是否离屏。一些工作已经解决这个任务，但它们忽略了人与观看的对象之间的关系。这篇论文提议一种基于Transformer架构的方法，自动检测场景中的对象（包括头部），并建立每个头部和观看的对象之间的关联，从而实现全面、可解释的视线分析，包括视线目标区域、视线像素点、类别和图像中的对象位置。在评估在野 benchmark中，我们的方法实现了状态机器人的结果（增加了2.91%的AUC，减少了50%的视线距离，增加了9%的离屏平均精度）。此外，我们的方法还提高了对观看的对象的分类和 lokalisierung的平均精度，增加了11-13%。我们的代码可以在https://github.com/francescotonini/object-aware-gaze-target-detection上获取。

Skin Lesion Correspondence Localization in Total Body Photography

paper_url: http://arxiv.org/abs/2307.09642
repo_url: https://github.com/weilunhuang-jhu/lesioncorrespondencetbp3d
paper_authors: Wei-Lun Huang, Davood Tashayyod, Jun Kang, Amir Gandjbakhche, Michael Kazhdan, Mehran Armand
for: 长期跟踪皮肤 lesion 的找寻、变化和Texture 有助于早期检测皮肤癌。但是，在全身图像中并未得到充分的研究。
methods: 我们提出了一种新的框架，将几何信息和Texture 信息结合使用，以确定皮肤 lesion 在全身图像中的匹配。首先，在源和目标3D 纹理模型上创建了体部标记或稀疏匹配。然后，每个顶点在每个模型上都被映射到一个表示 geodesic 距离标记的特征向量中。最后，对每个 source 中的 lesion 进行了粗略估算，使用几何信息编码在特征向量中，然后使用 Texture 信息进行精细调整。
results: 我们对公共和私人数据集进行了量化评估，与只有一次 longitudinal 研究的成功率相当。随着全身3D 捕捉的质量和频率的提高，我们预计提出的方法将成为跨度跟踪皮肤 lesion 的重要步骤。

Abstract
Longitudinal tracking of skin lesions - finding correspondence, changes in morphology, and texture - is beneficial to the early detection of melanoma. However, it has not been well investigated in the context of full-body imaging. We propose a novel framework combining geometric and texture information to localize skin lesion correspondence from a source scan to a target scan in total body photography (TBP). Body landmarks or sparse correspondence are first created on the source and target 3D textured meshes. Every vertex on each of the meshes is then mapped to a feature vector characterizing the geodesic distances to the landmarks on that mesh. Then, for each lesion of interest (LOI) on the source, its corresponding location on the target is first coarsely estimated using the geometric information encoded in the feature vectors and then refined using the texture information. We evaluated the framework quantitatively on both a public and a private dataset, for which our success rates (at 10 mm criterion) are comparable to the only reported longitudinal study. As full-body 3D capture becomes more prevalent and has higher quality, we expect the proposed method to constitute a valuable step in the longitudinal tracking of skin lesions.

摘要
longitudinal tracking of skin lesions - finding correspondence, changes in morphology, and texture - is beneficial to the early detection of melanoma. However, it has not been well investigated in the context of full-body imaging. We propose a novel framework combining geometric and texture information to localize skin lesion correspondence from a source scan to a target scan in total body photography (TBP). Body landmarks or sparse correspondence are first created on the source and target 3D textured meshes. Every vertex on each of the meshes is then mapped to a feature vector characterizing the geodesic distances to the landmarks on that mesh. Then, for each lesion of interest (LOI) on the source, its corresponding location on the target is first coarsely estimated using the geometric information encoded in the feature vectors and then refined using the texture information. We evaluated the framework quantitatively on both a public and a private dataset, for which our success rates (at 10 mm criterion) are comparable to the only reported longitudinal study. As full-body 3D capture becomes more prevalent and has higher quality, we expect the proposed method to constitute a valuable step in the longitudinal tracking of skin lesions.Here's the translation in Traditional Chinese as well, for your reference:Longitudinal tracking of skin lesions - finding correspondence, changes in morphology, and texture - is beneficial to the early detection of melanoma. However, it has not been well investigated in the context of full-body imaging. We propose a novel framework combining geometric and texture information to localize skin lesion correspondence from a source scan to a target scan in total body photography (TBP). Body landmarks or sparse correspondence are first created on the source and target 3D textured meshes. Every vertex on each of the meshes is then mapped to a feature vector characterizing the geodesic distances to the landmarks on that mesh. Then, for each lesion of interest (LOI) on the source, its corresponding location on the target is first coarsely estimated using the geometric information encoded in the feature vectors and then refined using the texture information. We evaluated the framework quantitatively on both a public and a private dataset, for which our success rates (at 10 mm criterion) are comparable to the only reported longitudinal study. As full-body 3D capture becomes more prevalent and has higher quality, we expect the proposed method to constitute a valuable step in the longitudinal tracking of skin lesions.I hope this helps!

Conditional 360-degree Image Synthesis for Immersive Indoor Scene Decoration

paper_url: http://arxiv.org/abs/2307.09621
repo_url: https://github.com/kcshum/neural_360_decoration
paper_authors: Ka Chun Shum, Hong-Wing Pang, Binh-Son Hua, Duc Thanh Nguyen, Sai-Kit Yeung
for: 这个论文targets the problem of conditional scene decoration for 360-degree images, aiming to generate decorated images of the same scene in panorama view.
methods: 该方法首先开发了一个360度对象布局生成器，用于学习360度视图中对象的约束，然后使用这个对象布局来condition a generative adversarial network（GAN）来 sinthezize输入场景的图像。另外，他们还开发了一个简单 yet effective的场景清空器，用于 removing generated furniture and producing an emptied scene for the model to learn a cyclic constraint.
results: 经过训练于Structure3D dataset，该模型可以生成多种可控的对象布局，并且在Zillow indoor scene dataset上显示出优秀的性能。用户研究表明，该生成结果具有真实的图像质量和家具布局，提供了 immerse 的经验。

Abstract
In this paper, we address the problem of conditional scene decoration for 360-degree images. Our method takes a 360-degree background photograph of an indoor scene and generates decorated images of the same scene in the panorama view. To do this, we develop a 360-aware object layout generator that learns latent object vectors in the 360-degree view to enable a variety of furniture arrangements for an input 360-degree background image. We use this object layout to condition a generative adversarial network to synthesize images of an input scene. To further reinforce the generation capability of our model, we develop a simple yet effective scene emptier that removes the generated furniture and produces an emptied scene for our model to learn a cyclic constraint. We train the model on the Structure3D dataset and show that our model can generate diverse decorations with controllable object layout. Our method achieves state-of-the-art performance on the Structure3D dataset and generalizes well to the Zillow indoor scene dataset. Our user study confirms the immersive experiences provided by the realistic image quality and furniture layout in our generation results. Our implementation will be made available.

摘要
在这篇论文中，我们解决了基于360度图像的条件场景饰化问题。我们的方法使用了一张360度背景图片，并生成了相同场景的潮拂视图图像。为了实现这一点，我们开发了一个360度对象布局生成器，它学习了360度视图中对象的秘密 вектор，以便在输入360度背景图片的基础上实现多种家具的排列。我们使用这个对象布局来condition一个生成敌对网络，以生成输入场景的图像。为了进一步强化我们的模型的生成能力，我们开发了一个简单 yet有效的场景空置器，它将生成的家具移除，并生成了一个空的场景，以便我们的模型学习一个循环约束。我们在Structure3D数据集上训练了我们的模型，并证明了我们的模型可以生成多样的饰化，并且可以控制对象布局。我们的方法实现了状态对的性能在Structure3D数据集上，并且在Zillow室内场景数据集上也具有良好的普适性。我们的用户研究证明了我们的生成结果提供了真实的图像质量和家具布局，使用者可以获得沉浸式的 immerse 体验。我们的实现将被公开。

Surgical Action Triplet Detection by Mixed Supervised Learning of Instrument-Tissue Interactions

paper_url: http://arxiv.org/abs/2307.09548
repo_url: None
paper_authors: Saurav Sharma, Chinedu Innocent Nwoye, Didier Mutter, Nicolas Padoy
for: 这篇论文的目的是提出一种基于交互图的多类 instrumente-aware transformer 模型，用于检测手术动作三元组。
methods: 该模型使用 MCIT 阶段和 IG 阶段两个部分组成，MCIT 阶段模型每个目标的类别特征，以减少 triplet 杂合的风险；IG 阶段构建了一个两侧动态图，以模型 instrumente 和目标之间的互动。
results: 在 CholectT50 数据集上测试，该模型提高了 instrumente localization 和 triplet 检测的性能，并在 MICCAI 2022 年的 CholecTriplet 挑战中击败了所有参赛者，提出了新的顶峰性表现。

Abstract
Surgical action triplets describe instrument-tissue interactions as (instrument, verb, target) combinations, thereby supporting a detailed analysis of surgical scene activities and workflow. This work focuses on surgical action triplet detection, which is challenging but more precise than the traditional triplet recognition task as it consists of joint (1) localization of surgical instruments and (2) recognition of the surgical action triplet associated with every localized instrument. Triplet detection is highly complex due to the lack of spatial triplet annotation. We analyze how the amount of instrument spatial annotations affects triplet detection and observe that accurate instrument localization does not guarantee better triplet detection due to the risk of erroneous associations with the verbs and targets. To solve the two tasks, we propose MCIT-IG, a two-stage network, that stands for Multi-Class Instrument-aware Transformer-Interaction Graph. The MCIT stage of our network models per class embedding of the targets as additional features to reduce the risk of misassociating triplets. Furthermore, the IG stage constructs a bipartite dynamic graph to model the interaction between the instruments and targets, cast as the verbs. We utilize a mixed-supervised learning strategy that combines weak target presence labels for MCIT and pseudo triplet labels for IG to train our network. We observed that complementing minimal instrument spatial annotations with target embeddings results in better triplet detection. We evaluate our model on the CholecT50 dataset and show improved performance on both instrument localization and triplet detection, topping the leaderboard of the CholecTriplet challenge in MICCAI 2022.

摘要
针对手术场景中的 instrumente-tissue 交互，我们提出了“手术动作 triplets”的概念，即（instrument，verb，target）组合。这种概念支持etailed分析手术过程中的活动和工作流程。本文主要关注手术动作 triplet 检测，这是传统 triplet 认识任务的更加精准和复杂的一种。由于缺乏空间 triplet 标注，手术动作 triplet 检测是非常复杂的。我们分析了不同量的 instrumente 空间标注对 triplet 检测的影响，并观察到了虽然精确的 instrumente localization 不能保证更好的 triplet 检测，因为可能会 mistakenly associate with verbs and targets。为解决这两个任务，我们提出了 MCIT-IG 网络，即 Multi-Class Instrument-aware Transformer-Interaction Graph。MCIT 阶段中，我们模型了每个目标的多类 embedding，以降低将 triplets 混淆的风险。IG 阶段则构建了一个双向动态图，以模型 instrumente 和目标之间的互动，即 cast as verbs。我们采用了一种混合强化学习策略，将 MCIT 和 IG 两个阶段的训练数据进行混合。我们发现，通过补充 minimal instrumente 空间标注与目标嵌入，可以提高 triplet 检测的性能。我们对 CholecT50 数据集进行了评估，并在 MICCAI 2022 年的 CholecTriplet 挑战中排名第一。

Can Neural Network Memorization Be Localized?

paper_url: http://arxiv.org/abs/2307.09542
repo_url: https://github.com/pratyushmaini/localizing-memorization
paper_authors: Pratyush Maini, Michael C. Mozer, Hanie Sedghi, Zachary C. Lipton, J. Zico Kolter, Chiyuan Zhang
for: 这 paper 旨在解释深度过参数网络中 memorization 和 generalization 之间的交互作用。
methods: 这 paper 使用了 gradient accounting、layer rewinding 和 retraining 等三种实验来证明 memorization 不仅局限于最后几层，而是围绕一小组神经元分布在不同层次。
results: 研究发现， memorization 通常局限于模型中几个神经元或通道（大约5），而且可以通过 example-tied dropout 来引导 memorization 到预先确定的神经元集。这种dropout方法可以降低 memorized examples 的准确率从 100% 降至 3%，同时降低了泛化差距。

Abstract
Recent efforts at explaining the interplay of memorization and generalization in deep overparametrized networks have posited that neural networks $\textit{memorize}$ "hard" examples in the final few layers of the model. Memorization refers to the ability to correctly predict on $\textit{atypical}$ examples of the training set. In this work, we show that rather than being confined to individual layers, memorization is a phenomenon confined to a small set of neurons in various layers of the model. First, via three experimental sources of converging evidence, we find that most layers are redundant for the memorization of examples and the layers that contribute to example memorization are, in general, not the final layers. The three sources are $\textit{gradient accounting}$ (measuring the contribution to the gradient norms from memorized and clean examples), $\textit{layer rewinding}$ (replacing specific model weights of a converged model with previous training checkpoints), and $\textit{retraining}$ (training rewound layers only on clean examples). Second, we ask a more generic question: can memorization be localized $\textit{anywhere}$ in a model? We discover that memorization is often confined to a small number of neurons or channels (around 5) of the model. Based on these insights we propose a new form of dropout -- $\textit{example-tied dropout}$ that enables us to direct the memorization of examples to an apriori determined set of neurons. By dropping out these neurons, we are able to reduce the accuracy on memorized examples from $100\%\to3\%$, while also reducing the generalization gap.

摘要
最近的研究表明，深度过参数网络中的记忆和通用之间存在一定的互动。研究人员认为，神经网络会在训练集中的罕见示例上进行记忆。记忆指的是能够正确预测训练集中的非典型示例。在这项工作中，我们发现了一些有利的证据，证明记忆不仅存在于各个层之间，而且受到一些特定层的限制。我们通过三种实验来证明这一点，即梯度考虑（测量模型中记忆和干净示例的贡献）、层恢复（将特定模型参数替换为训练过程中的检查点）和重训练（只训练恢复的层）。我们还提出了一个新的抽象概念——示例绑定Dropout，可以让我们将记忆示例直接链接到预先确定的神经元。通过dropout这些神经元，我们能够将记忆示例的准确率降低到3%，同时降低通用性隔阂。

Adversarial Bayesian Augmentation for Single-Source Domain Generalization

paper_url: http://arxiv.org/abs/2307.09520
repo_url: None
paper_authors: Sheng Cheng, Tejas Gokhale, Yezhou Yang
for: 该论文旨在解决难以泛化到未经见过的图像频谱问题，主要是因为缺乏多样化的训练数据、不可达的目标数据以及实际情况中的大频率差异。
methods: 该论文提出了一种新的算法——对抗抽象学习极 Bayesian 增强（ABA），用于生成单源频谱下的图像增强。 ABA 利用了对抗学习和极 Bayesian 神经网络来引导生成多样化的数据增强——这些生成的图像频谱可以帮助分类器泛化到未经见过的频谱。
results: 作者在多种频谱差异情况下（包括风格差异、子population差异和医学成像中的差异）进行了实验，并证明了 ABA 能够超越所有之前的状态艺术方法，包括预先定义的增强、像素级增强和卷积级增强。

Abstract
Generalizing to unseen image domains is a challenging problem primarily due to the lack of diverse training data, inaccessible target data, and the large domain shift that may exist in many real-world settings. As such data augmentation is a critical component of domain generalization methods that seek to address this problem. We present Adversarial Bayesian Augmentation (ABA), a novel algorithm that learns to generate image augmentations in the challenging single-source domain generalization setting. ABA draws on the strengths of adversarial learning and Bayesian neural networks to guide the generation of diverse data augmentations -- these synthesized image domains aid the classifier in generalizing to unseen domains. We demonstrate the strength of ABA on several types of domain shift including style shift, subpopulation shift, and shift in the medical imaging setting. ABA outperforms all previous state-of-the-art methods, including pre-specified augmentations, pixel-based and convolutional-based augmentations.

摘要
通用到未经见的图像领域是一个具有挑战性的问题，主要由于训练数据缺乏多样性、目标数据困难获取以及实际场景中可能存在很大的领域变化。因此，数据增强成为领域总结方法中的关键组成部分。我们介绍了一种新的算法——对抗权重学增强（ABA），可以在单源领域总结Setting中学习生成图像增强。ABA借鉴了对抗学习和权重神经网络的优势，以引导生成多样化的数据增强——这些合成的图像领域帮助分类器在未经见的领域中总结。我们在多种领域shift中展示了ABA的强大性，包括风格shift、子群体shift和医学成像设置中的shift。ABA在所有前一个状态的方法之上升迁，包括预先指定的增强、像素基于的增强和卷积基于的增强。

AnyDoor: Zero-shot Object-level Image Customization

paper_url: http://arxiv.org/abs/2307.09481
repo_url: None
paper_authors: Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, Hengshuang Zhao
for: This paper presents a novel image generator called AnyDoor, which can teleport objects to new scenes in a harmonious way.
methods: The model uses a diffusion-based approach and is trained only once to generalize to diverse object-scene combinations at inference. The model also incorporates detail features to maintain texture details and support object blending with different surroundings.
results: The approach demonstrates superiority over existing alternatives and has great potential in real-world applications such as virtual try-on and object moving, as shown through extensive experiments.Here’s the Chinese translation:
for: 这篇论文提出了一种名为AnyDoor的图像生成器，可以将目标对象teleport到新的场景中，并在用户指定的位置进行融合。
methods: 模型采用了扩散基于的方法，并在推理阶段仅需要一次训练，以便通用多种对象-场景组合。模型还具有细节特征，以保持Texture细节并允许多种本地变化（例如光照、方向、姿势等），以支持对象融合不同的环境。
results: 方法在对 existed alternatives 的比较中表现出色，并在虚拟试穿和对象移动等实际应用中具有潜在的应用前景，根据广泛的实验结果。

Abstract
This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations in a harmonious way. Instead of tuning parameters for each object, our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage. Such a challenging zero-shot setting requires an adequate characterization of a certain object. To this end, we complement the commonly used identity feature with detail features, which are carefully designed to maintain texture details yet allow versatile local variations (e.g., lighting, orientation, posture, etc.), supporting the object in favorably blending with different surroundings. We further propose to borrow knowledge from video datasets, where we can observe various forms (i.e., along the time axis) of a single object, leading to stronger model generalizability and robustness. Extensive experiments demonstrate the superiority of our approach over existing alternatives as well as its great potential in real-world applications, such as virtual try-on and object moving. Project page is https://damo-vilab.github.io/AnyDoor-Page/.

摘要
这个工作介绍了AnyDoor，一种基于扩散的图像生成器，可以让目标对象在用户指定的位置上新场景中快速传输。而不需要对每个对象进行参数调整，我们的模型只需要一次性训练，并在推理阶段自动适应多种对象-场景组合。这种具有挑战性的零shot设定需要对特定对象进行充分的特征化。为此，我们在标准的标识特征之外，采用细节特征，这些特征旨在保持Texture细节，同时允许多元的本地变化（例如照明、方向、姿势等），使对象能够顺利地融入不同的环境中。此外，我们还提出了借鉴视频 datasets，我们可以在时间轴上观察不同形态（例如不同的照明、方向、姿势等）的同一个对象，从而提高模型的泛化和Robustness。广泛的实验证明了我们的方法的优越性以及其在实际应用中的巨大潜力，如虚拟试穿和对象移动等。项目页面可以在https://damo-vilab.github.io/AnyDoor-Page/中找到。

FACTS: Facial Animation Creation using the Transfer of Styles

paper_url: http://arxiv.org/abs/2307.09480
repo_url: None
paper_authors: Jack Saunders, Steven Caulkin, Vinay Namboodiri
for: 这篇论文的目的是为了创造更有表情的电子游戏角色。
methods: 这篇论文使用了一种新的方法，即通过修改样式特征来实现表情动画。具体来说，他们使用了一个StarGAN来转换3D表情动画为不同的情感和个人特有的样式。
results: 通过这种方法，他们能够保持动画的唇同步，并且可以实现不同的情感和个人特有的表情动画。

Abstract
The ability to accurately capture and express emotions is a critical aspect of creating believable characters in video games and other forms of entertainment. Traditionally, this animation has been achieved with artistic effort or performance capture, both requiring costs in time and labor. More recently, audio-driven models have seen success, however, these often lack expressiveness in areas not correlated to the audio signal. In this paper, we present a novel approach to facial animation by taking existing animations and allowing for the modification of style characteristics. Specifically, we explore the use of a StarGAN to enable the conversion of 3D facial animations into different emotions and person-specific styles. We are able to maintain the lip-sync of the animations with this method thanks to the use of a novel viseme-preserving loss.

摘要
“创造真实的人物表情是视频游戏和娱乐媒体中非常重要的一部分。传统上，这种动画通常需要艺术努力或表演捕捉，它们需要大量的时间和劳动。在最近几年，听取模型已经取得了成功，但它们通常缺乏不相关于声音信号的表达能力。在这篇论文中，我们提出了一种新的面部动画方法，即通过现有动画的修改，以达到不同情感和人Specific风格。我们使用了一种名为StarGAN的模型，并使用一种新的视emeserving损失来保持动画的唇同步。”

GroupLane: End-to-End 3D Lane Detection with Channel-wise Grouping

paper_url: http://arxiv.org/abs/2307.09472
repo_url: None
paper_authors: Zhuoling Li, Chunrui Han, Zheng Ge, Jinrong Yang, En Yu, Haoqian Wang, Hengshuang Zhao, Xiangyu Zhang
for: 提高3D幕면检测效率，适应实际应用需求
methods: 提出一种简单、快速、端到端检测器，同时保持高检测精度
results: 在3个实际3D幕面测试 benchmark 上，GroupLane 比 published state-of-the-art PersFormer 提高13.6% F1 分数，并且具有 faster inference speed 和 lower FLOPs than PersFormer

Abstract
Efficiency is quite important for 3D lane detection due to practical deployment demand. In this work, we propose a simple, fast, and end-to-end detector that still maintains high detection precision. Specifically, we devise a set of fully convolutional heads based on row-wise classification. In contrast to previous counterparts, ours supports recognizing both vertical and horizontal lanes. Besides, our method is the first one to perform row-wise classification in bird-eye-view. In the heads, we split feature into multiple groups and every group of feature corresponds to a lane instance. During training, the predictions are associated with lane labels using the proposed single-win one-to-one matching to compute loss, and no post-processing operation is demanded for inference. In this way, our proposed fully convolutional detector, GroupLane, realizes end-to-end detection like DETR. Evaluated on 3 real world 3D lane benchmarks, OpenLane, Once-3DLanes, and OpenLane-Huawei, GroupLane adopting ConvNext-Base as the backbone outperforms the published state-of-the-art PersFormer by 13.6% F1 score in the OpenLane validation set. Besides, GroupLane with ResNet18 still surpasses PersFormer by 4.9% F1 score, while the inference speed is nearly 7x faster and the FLOPs is only 13.3% of it.

摘要
“效率是3D跑道探测中非常重要的因素，在这个工作中，我们提出了一个简单、快速、端到端的探测器，可以保持高的探测精度。具体来说，我们设计了一组全 convolutional 头，基于行间分类。与前一代相比，我们的方法可以识别 both Vertical 和 Horizontal 跑道。此外，我们的方法是第一个在 bird-eye-view 中进行行间分类的。在 heads 中，我们将特征分为多个群体，每个群体的特征都与一个跑道实例相对应。在训练时，预测与跑道标签之间的对应使用我们提出的单一胜一对一匹配算法来计算损失，而探测时不需要进行后处理操作。这样，我们的提案的全 convolutional 探测器，GroupLane，可以实现端到端探测，类似于DETR。在3个真实世界3D跑道benchmark上进行评估，GroupLane 使用 ConvNext-Base 作为背景时，与已出版的State-of-the-art PersFormer 相比，在 OpenLane 验证集上提高了13.6% F1 分数。此外，GroupLane 使用 ResNet18 仍然超过 PersFormer 的4.9% F1 分数，而且探测速度几乎是7倍 faster，而 FLOPs 仅仅是13.3% 的它。”

Occlusion Aware Student Emotion Recognition based on Facial Action Unit Detection

paper_url: http://arxiv.org/abs/2307.09465
repo_url: None
paper_authors: Shrouk Wally, Ahmed Elsayed, Islam Alkabbany, Asem Ali, Aly Farag
for: 提高课堂环境质量，提高STEM学生在大学初年留存率。
methods: 使用人工 occlusion 数据集和注意力机制，提出 occlusion-aware 表情活动单元抽取方法，以便在课堂 SETTINGS 中识别表情。
results: 研究发现 occlusion 对 facial recognition 模型的性能有较大影响，并提出了一种 occlusion-aware 的表情活动单元抽取方法，可以提高识别表情的可靠性。

Abstract
Given that approximately half of science, technology, engineering, and mathematics (STEM) undergraduate students in U.S. colleges and universities leave by the end of the first year [15], it is crucial to improve the quality of classroom environments. This study focuses on monitoring students' emotions in the classroom as an indicator of their engagement and proposes an approach to address this issue. The impact of different facial parts on the performance of an emotional recognition model is evaluated through experimentation. To test the proposed model under partial occlusion, an artificially occluded dataset is introduced. The novelty of this work lies in the proposal of an occlusion-aware architecture for facial action units (AUs) extraction, which employs attention mechanism and adaptive feature learning. The AUs can be used later to classify facial expressions in classroom settings. This research paper's findings provide valuable insights into handling occlusion in analyzing facial images for emotional engagement analysis. The proposed experiments demonstrate the significance of considering occlusion and enhancing the reliability of facial analysis models in classroom environments. These findings can also be extended to other settings where occlusions are prevalent.

摘要
据统计，约一半的科学、技术、工程和数学（STEM）本科学生在美国大学和学院内退学之前一年结束。因此，改善课堂环境质量是非常重要的。这项研究关注课堂中学生的情绪状况作为参与度的指标，并提出了一种解决方案。本研究通过实验测试不同的面部部分对情绪识别模型的性能的影响。为了测试提议的模型在受部分遮挡的情况下的性能，我们引入了一个人工遮挡数据集。本研究的新特点在于提出了一种遮挡意识的面部动作单元（AU）提取方法，该方法使用了注意力机制和适应特征学习。这些AU可以用于后续在课堂设置中分类表情。本研究的发现对于处理覆盖情况的面部图像情感参与度分析提供了有价值的信息。提议的实验表明了在课堂环境中考虑覆盖和提高面部分析模型的可靠性是非常重要的。这些发现可以应用到其他受覆盖的场景中。

Measuring Student Behavioral Engagement using Histogram of Actions

paper_url: http://arxiv.org/abs/2307.09420
repo_url: None
paper_authors: Ahmed Abdelkawy, Islam Alkabbany, Asem Ali, Aly Farag
for: 本研究提出了一种新的学生行为参与度测量技术，通过识别学生的动作来预测学生行为参与度水平。
methods: 本研究使用人体骨架来模拟学生姿势和上半身运动，并使用3D-CNN模型来学习学生上半身的动态。然后，使用训练好的3D-CNN模型来识别每段2分钟视频中的动作，并将这些动作组织成一个历史gram，该历史gram用于输入SVM分类器来确定学生是否参与度高或低。
results: 实验结果表明，本研究的方法可以识别学生动作的准确率达83.63%，并且可以捕捉教室中学生的平均参与度。

Abstract
In this paper, we propose a novel technique for measuring behavioral engagement through students' actions recognition. The proposed approach recognizes student actions then predicts the student behavioral engagement level. For student action recognition, we use human skeletons to model student postures and upper body movements. To learn the dynamics of student upper body, a 3D-CNN model is used. The trained 3D-CNN model is used to recognize actions within every 2minute video segment then these actions are used to build a histogram of actions which encodes the student actions and their frequencies. This histogram is utilized as an input to SVM classifier to classify whether the student is engaged or disengaged. To evaluate the proposed framework, we build a dataset consisting of 1414 2-minute video segments annotated with 13 actions and 112 video segments annotated with two engagement levels. Experimental results indicate that student actions can be recognized with top 1 accuracy 83.63% and the proposed framework can capture the average engagement of the class.

摘要
在这篇论文中，我们提出了一种新的技术来测量学生的行为参与度。我们的方法首先识别学生的行为，然后预测学生的参与度水平。为了识别学生的行为，我们使用人体骨架来模拟学生的姿势和上半身运动。为了学习学生上半身的动态，我们使用3D-CNN模型进行训练。训练完成后，我们使用3D-CNN模型来识别每段2分钟视频中的行为，并将这些行为组织成一个历史gram中，其中每个行为的频率都是其对应的值。这个历史gram作为输入，我们使用支持向量机进行分类，以判断学生是否参与到活动中。为了评估我们的方法，我们建立了一个包含1414个2分钟视频段和112个视频段的数据集，其中每个视频段都有13种行为和两个参与度水平的标注。实验结果表明，我们的方法可以识别学生的行为 WITH top 1 准确率83.63%，并且可以捕捉学生班级的平均参与度。

2023-07-19

cs.AI

cs.AI - 2023-07-19

A Decision Making Framework for Recommended Maintenance of Road Segments

paper_url: http://arxiv.org/abs/2307.10085
repo_url: None
paper_authors: Haoyu Sun, Yan Yan
for: 提供更科学的道路维护决策工具和证据 для道路维护部门
methods: 结合多种人工智能决策技术，对历史维护数据进行全面探索，适应道路维护科学决策
results: 提出了一幅基于历史维护经验和技术指标的维护计划和维护部分优先级分配方案，帮助道路维护部门更科学地做出决策，考虑有限的预算和历史维护经验。

Abstract
With the rapid development of global road transportation, countries worldwide have completed the construction of road networks. However, the ensuing challenge lies in the maintenance of existing roads. It is well-known that countries allocate limited budgets to road maintenance projects, and road management departments face difficulties in making scientifically informed maintenance decisions. Therefore, integrating various artificial intelligence decision-making techniques to thoroughly explore historical maintenance data and adapt them to the context of road maintenance scientific decision-making has become an urgent issue. This integration aims to provide road management departments with more scientific tools and evidence for decision-making. The framework proposed in this paper primarily addresses the following four issues: 1) predicting the pavement performance of various routes, 2) determining the prioritization of maintenance routes, 3) making maintenance decisions based on the evaluation of the effects of past maintenance, and considering comprehensive technical and management indicators, and 4) determining the prioritization of maintenance sections based on the maintenance effectiveness and recommended maintenance effectiveness. By tackling these four problems, the framework enables intelligent decision-making for the optimal maintenance plan and maintenance sections, taking into account limited funding and historical maintenance management experience.

摘要
The proposed framework primarily addresses the following four issues:1. 预测不同路线的路面性能（pavement performance）2. 确定维护路线的优先级（prioritization of maintenance routes）3. 根据过去维护的影响进行维护决策（maintenance decisions based on the evaluation of the effects of past maintenance）4. 根据维护效果和推荐维护效果来决定维护部分的优先级（prioritization of maintenance sections based on the maintenance effectiveness and recommended maintenance effectiveness）By tackling these four problems, the framework enables intelligent decision-making for the optimal maintenance plan and maintenance sections, taking into account limited funding and historical maintenance management experience.

Accurate deep learning sub-grid scale models for large eddy simulations

paper_url: http://arxiv.org/abs/2307.10060
repo_url: None
paper_authors: Rikhi Bose, Arunabha M. Roy
for: 这两个家族的粒子扩散层（SGS）湍流模型是为大湍流计算（LES）目的而开发的。
methods: 这两个模型使用了physics-informed Deep Learning（DL）算法，与传统的分析模型技术不同，可以生成高阶复杂非线性关系。
results: 两个模型在不同的滤波宽度和 Reynolds 数下预测了SGS 压力的结果，其中一个模型使用了tensor基 neural network（TBNN），具有更高的数学表达能力；另一个模型使用了简单的网络结构，具有更好的特征提取能力，但在统计性能指标上表现更好。

Abstract
We present two families of sub-grid scale (SGS) turbulence models developed for large-eddy simulation (LES) purposes. Their development required the formulation of physics-informed robust and efficient Deep Learning (DL) algorithms which, unlike state-of-the-art analytical modeling techniques can produce high-order complex non-linear relations between inputs and outputs. Explicit filtering of data from direct simulations of the canonical channel flow at two friction Reynolds numbers $Re_\tau\approx 395$ and 590 provided accurate data for training and testing. The two sets of models use different network architectures. One of the architectures uses tensor basis neural networks (TBNN) and embeds the simplified analytical model form of the general effective-viscosity hypothesis, thus incorporating the Galilean, rotational and reflectional invariances. The other architecture is that of a relatively simple network, that is able to incorporate the Galilean invariance only. However, this simpler architecture has better feature extraction capacity owing to its ability to establish relations between and extract information from cross-components of the integrity basis tensors and the SGS stresses. Both sets of models are used to predict the SGS stresses for feature datasets generated with different filter widths, and at different Reynolds numbers. It is shown that due to the simpler model's better feature learning capabilities, it outperforms the invariance embedded model in statistical performance metrics. In a priori tests, both sets of models provide similar levels of dissipation and backscatter. Based on the test results, both sets of models should be usable in a posteriori actual LESs.

摘要
我们介绍了两家族的子grid尺度（SGS）随机流模型，用于大扩散 simulated（LES）的目的。它们的发展需要了物理知识具有强大和有效的深度学习（DL）算法，不同于现有的分析模型技术可以生成高阶复杂的非线性关系。我们使用了直接实验的标准频率道流场的数据进行范例训练和测试。这两个模型使用了不同的网络架构。其中一个架构使用了tensor基 neural network（TBNN），并嵌入了简化的分析模型形式的通用有效黏度假设，因此包含了加利ле安、旋转和反射对称性。另一个架构则是一个较简单的网络，它能够将加利ле安对称性独立出来。然而，这个简单的架构有更好的特征学习能力，因为它能够在横向分量和SGS压力之间建立关系和提取信息。两个模型都用于预测SGS压力的特征数据集，并在不同的滤波宽度和 Reynolds 数下进行预测。发现简单的模型在统计性表现指标上比嵌入对称性的模型更好。在先前的测试中，这两个模型都提供了相似的扩散和反射性能。基于这些测试结果，这两个模型在 posteriori 实际 LES 中都可以使用。

Convergence Guarantees for Stochastic Subgradient Methods in Nonsmooth Nonconvex Optimization

paper_url: http://arxiv.org/abs/2307.10053
repo_url: https://github.com/xnchxy/GeneralSGD
paper_authors: Nachuan Xiao, Xiaoyin Hu, Kim-Chuan Toh
for: investigate the convergence properties of stochastic gradient descent (SGD) method and its variants in training neural networks with nonsmooth activation functions.
methods: develop a novel framework that assigns different timescales to stepsizes for updating momentum terms and variables, and prove the global convergence of the proposed framework in both single-timescale and two-timescale cases.
results: prove the convergence properties of several well-known SGD-type methods, including heavy-ball SGD, SignSGD, Lion, normalized SGD, and clipped SGD, and demonstrate the high efficiency of these methods through preliminary numerical experiments.

Abstract
In this paper, we investigate the convergence properties of the stochastic gradient descent (SGD) method and its variants, especially in training neural networks built from nonsmooth activation functions. We develop a novel framework that assigns different timescales to stepsizes for updating the momentum terms and variables, respectively. Under mild conditions, we prove the global convergence of our proposed framework in both single-timescale and two-timescale cases. We show that our proposed framework encompasses a wide range of well-known SGD-type methods, including heavy-ball SGD, SignSGD, Lion, normalized SGD and clipped SGD. Furthermore, when the objective function adopts a finite-sum formulation, we prove the convergence properties for these SGD-type methods based on our proposed framework. In particular, we prove that these SGD-type methods find the Clarke stationary points of the objective function with randomly chosen stepsizes and initial points under mild assumptions. Preliminary numerical experiments demonstrate the high efficiency of our analyzed SGD-type methods.

摘要
在这篇论文中，我们研究权重梯度下降（SGD）方法和其变体的收敛性质，特别是在训练使用非凸活动函数建立的神经网络时。我们提出了一个新的框架，它在更新摇摆项和变量时分配不同的时间尺度。在某些轻度条件下，我们证明了我们所提议的框架的全球收敛性。我们还证明了我们的框架包括许多已知SGD类型方法，包括重力SGD、SignSGD、Lion、normalized SGD和clipped SGD。此外，当目标函数采用finite-sum形式时，我们证明了这些SGD类型方法的收敛性基于我们所提议的框架。具体来说，我们证明了这些SGD类型方法可以随机选择步长和初始点，并在某些轻度假设下找到目标函数的克拉克站点。初步的数据 экспериiments表明我们分析的SGD类型方法具有高效性。

Automatic Conversion of MiniZinc Programs to QUBO

paper_url: http://arxiv.org/abs/2307.10032
repo_url: None
paper_authors: Armin Wolf, Cristian Grozea
for: 提高Physical Quantum Computer（如DWave annealers）上的优化问题解决效率。
methods: 使用自动转换器从MiniZinc转换为QUBO，可以处理广泛的约束优化和约束满足问题，并将其转换为等效的QUBO问题。
results: 提高了优化问题解决效率。

Abstract
Obtaining Quadratic Unconstrained Binary Optimisation models for various optimisation problems, in order to solve those on physical quantum computers (such as the the DWave annealers) is nowadays a lengthy and tedious process that requires one to remodel all problem variables as binary variables and squeeze the target function and the constraints into a single quadratic polynomial into these new variables. We report here on the basis of our automatic converter from MiniZinc to QUBO, which is able to process a large set of constraint optimisation and constraint satisfaction problems and turn them into equivalent QUBOs, effectively optimising the whole process.

摘要
当前，获取二次不约定 binary 优化模型（QUBO）来解决各种优化问题，需要将问题变量全部转换为二进制变量，并将目标函数和约束函数压缩成单个二次多项式中。我们在这篇报告中介绍了我们的自动转换器，可以将 MiniZinc 问题转换为 QUBO，并且可以处理大量的约束优化和约束满足问题。这有效地优化了整个过程。

An Empirical Study on Fertility Proposals Using Multi-Grained Topic Analysis Methods

paper_url: http://arxiv.org/abs/2307.10025
repo_url: None
paper_authors: Yulin Zhou
for: 研究中国社会对政策的公众意见
methods: employing co-occurrence semantic analysis, topic analysis and sentiment analysis to conduct multi-granularity semantic analysis of microblog comments
results: 发现提议“废除婚姻限制出生登记”的讨论涉及个人、社会和国家三维度，并详细分析到社会问题、个人行为、社会伦理和法律等方面，人们的情感倾向于负面的主题。基于这些结论，提出了八项建议作为政策决策参考和研究公众意见政治问题的参考方法。

Abstract
Fertility issues are closely related to population security, in 60 years China's population for the first time in a negative growth trend, the change of fertility policy is of great concern to the community. 2023 "two sessions" proposal "suggests that the country in the form of legislation, the birth of the registration of the cancellation of the marriage restriction" This topic was once a hot topic on the Internet, and "unbundling" the relationship between birth registration and marriage has become the focus of social debate. In this paper, we adopt co-occurrence semantic analysis, topic analysis and sentiment analysis to conduct multi-granularity semantic analysis of microblog comments. It is found that the discussion on the proposal of "removing marriage restrictions from birth registration" involves the individual, society and the state at three dimensions, and is detailed into social issues such as personal behaviour, social ethics and law, and national policy, with people's sentiment inclined to be negative in most of the topics. Based on this, eight proposals were made to provide a reference for governmental decision making and to form a reference method for researching public opinion on political issues.

摘要
fertility issues are closely related to population security, in 60 years China's population for the first time in a negative growth trend, the change of fertility policy is of great concern to the community. 2023 "two sessions" proposal "suggests that the country in the form of legislation, the birth of the registration of the cancellation of the marriage restriction" This topic was once a hot topic on the Internet, and "unbundling" the relationship between birth registration and marriage has become the focus of social debate. In this paper, we adopt co-occurrence semantic analysis, topic analysis and sentiment analysis to conduct multi-granularity semantic analysis of microblog comments. It is found that the discussion on the proposal of "removing marriage restrictions from birth registration" involves the individual, society and the state at three dimensions, and is detailed into social issues such as personal behavior, social ethics and law, and national policy, with people's sentiment inclined to be negative in most of the topics. Based on this, eight proposals were made to provide a reference for governmental decision making and to form a reference method for researching public opinion on political issues.Here is the text with the names of the months and the years in Simplified Chinese:fertility issues are closely related to population security, in 60 years China's population for the first time in a negative growth trend, the change of fertility policy is of great concern to the community. 2023 "two sessions" proposal "suggests that the country in the form of legislation, the birth of the registration of the cancellation of the marriage restriction" This topic was once a hot topic on the Internet, and "unbundling" the relationship between birth registration and marriage has become the focus of social debate. In this paper, we adopt co-occurrence semantic analysis, topic analysis and sentiment analysis to conduct multi-granularity semantic analysis of microblog comments. It is found that the discussion on the proposal of "removing marriage restrictions from birth registration" involves the individual, society and the state at three dimensions, and is detailed into social issues such as personal behavior, social ethics and law, and national policy, with people's sentiment inclined to be negative in most of the topics. Based on this, eight proposals were made to provide a reference for governmental decision making and to form a reference method for researching public opinion on political issues.

RobôCIn Small Size League Extended Team Description Paper for RoboCup 2023

paper_url: http://arxiv.org/abs/2307.10018
repo_url: None
paper_authors: Aline Lima de Oliveira, Cauê Addae da Silva Gomes, Cecília Virginia Santos da Silva, Charles Matheus de Sousa Alves, Danilo Andrade Martins de Souza, Driele Pires Ferreira Araújo Xavier, Edgleyson Pereira da Silva, Felipe Bezerra Martins, Lucas Henrique Cavalcanti Santos, Lucas Dias Maciel, Matheus Paixão Gumercindo dos Santos, Matheus Lafayette Vasconcelos, Matheus Vinícius Teotonio do Nascimento Andrade, João Guilherme Oliveira Carvalho de Melo, João Pedro Souza Pereira de Moura, José Ronald da Silva, José Victor Silva Cruz, Pedro Henrique Santana de Morais, Pedro Paulo Salman de Oliveira, Riei Joaquim Matos Rodrigues, Roberto Costa Fernandes, Ryan Vinicius Santos Morais, Tamara Mayara Ramos Teobaldo, Washington Igor dos Santos Silva, Edna Natividade Silva Barros
for: 本研究目的是在2023年的RoboCup Small Size League（SSL）组别B联赛中卫冕冠军。
methods: 本研究使用了Unification架构，并进行了软件和人工智能 refactoring。此外，我们也详细说明了将机械部件集成到机械系统中的过程。
results: 本研究已经发表了两篇相关SSL的学术研究论文，并在25届RoboCup国际symposium和19届IEEE拉丁美洲机器人学会议（LARS 2022）上发表。我们还在持续将过去的代码库转换到Unification架构。

Abstract
Rob\^oCIn has participated in RoboCup Small Size League since 2019, won its first world title in 2022 (Division B), and is currently a three-times Latin-American champion. This paper presents our improvements to defend the Small Size League (SSL) division B title in RoboCup 2023 in Bordeaux, France. This paper aims to share some of the academic research that our team developed over the past year. Our team has successfully published 2 articles related to SSL at two high-impact conferences: the 25th RoboCup International Symposium and the 19th IEEE Latin American Robotics Symposium (LARS 2022). Over the last year, we have been continuously migrating from our past codebase to Unification. We will describe the new architecture implemented and some points of software and AI refactoring. In addition, we discuss the process of integrating machined components into the mechanical system, our development for participating in the vision blackout challenge last year and what we are preparing for this year.

摘要
罗博琪（Rob\^oCIn）自2019年起参加小型足球联赛（Small Size League，SSL），赢得了2022年世界冠军（分区B），并现为拉丁美洲三届冠军。本文描述我们在2023年法国博览会（RoboCup 2023）中防守小型足球联赛（SSL）分区B冠军的改进。本文目的是分享过去一年我们团队进行的一些学术研究。我们成功发表了两篇与SSL相关的文章，在两个高影响因子会议上进行了发表：25届RoboCup国际学术会议和19届IEEE拉丁美洲机器人学术会议（LARS 2022）。过去一年，我们不断迁移自己的代码基础，迁移到统一平台。我们将描述新的架构实现和一些软件和人工智能重构。此外，我们还讨论了在机械系统中 integrate 机器组件的过程，我们在过去一年参加视网膜挑战并准备今年参加。

6G Network Business Support System

paper_url: http://arxiv.org/abs/2307.10004
repo_url: None
paper_authors: Ye Ouyang, Yaqin Zhang, Peng Wang, Yunxin Liu, Wen Qiao, Jun Zhu, Yang Liu, Feng Zhang, Shuling Wang, Xidong Wang
for:* 6G BSS systems will support the efficient connection of intelligent agents and lead the digital, intelligent, and green transformation of the economy and society.methods:* The paper introduces the overall vision, potential key technologies, and functional architecture of 6G BSS systems.results:* The paper presents an evolutionary roadmap and technological prospects for the BSS systems from 5G to 6G.

Abstract
6G is the next-generation intelligent and integrated digital information infrastructure, characterized by ubiquitous interconnection, native intelligence, multi-dimensional perception, global coverage, green and low-carbon, native network security, etc. 6G will realize the transition from serving people and people-things communication to supporting the efficient connection of intelligent agents, and comprehensively leading the digital, intelligent and green transformation of the economy and the society. As the core support system for mobile communication network, 6 6G BSS need to integrate with new business models brought about by the development of the next-generation Internet and IT, upgrade from "network-centric" to "business and service centric" and "customer-centric". 6G OSS and BSS systems need to strengthen their integration to improve the operational efficiency and benefits of customers by connecting the digital intelligence support capabilities on both sides of supply and demand. This paper provides a detailed introduction to the overall vision, potential key technologies, and functional architecture of 6G BSS systems. It also presents an evolutionary roadmap and technological prospects for the BSS systems from 5G to 6G.

摘要
6G是下一代智能 интегрирован的数字信息基础设施，具有 ubique 连接、内置智能、多维感知、全球覆盖、绿色低碳、内网安全等特点。6G将实现从人工智能和物联网的服务转移到支持高效的智能代理连接，全面领导数字、智能和绿色经济社会的转型。作为移动通信网络核心支持系统，6G BSS需要与新一代互联网和IT的商业化模式相结合，升级从“网络中心”到“业务和服务中心”和“客户中心”。6G OSS和BSS系统需要加强对两侧供应需求的连接，以提高客户运营效率和利益。本文对6G BSS系统的总视图、潜在关键技术和功能架构进行详细介绍，还提供了5G到6G BSS系统的进化路线图和技术前景。

TbExplain: A Text-based Explanation Method for Scene Classification Models with the Statistical Prediction Correction

paper_url: http://arxiv.org/abs/2307.10003
repo_url: None
paper_authors: Amirhossein Aminimehr, Pouya Khani, Amirali Molaei, Amirmohammad Kazemeini, Erik Cambria
for: 提高黑盒机器学习模型的解释性，即Explainable Artificial Intelligence (XAI)。
methods: 使用XAI技术和预训练对象检测器，提供图文 объяснения场景分类模型的下发。另外，提出一种纠正预测和文本 объяснение基于输入图像中对象的统计的新方法。
results: 对Scene Classification模型进行质量和量化测试，发现TbExplain可以提高分类精度，并且对于初始预测不可靠时，文本 объяснение具有可靠性。

Abstract
The field of Explainable Artificial Intelligence (XAI) aims to improve the interpretability of black-box machine learning models. Building a heatmap based on the importance value of input features is a popular method for explaining the underlying functions of such models in producing their predictions. Heatmaps are almost understandable to humans, yet they are not without flaws. Non-expert users, for example, may not fully understand the logic of heatmaps (the logic in which relevant pixels to the model's prediction are highlighted with different intensities or colors). Additionally, objects and regions of the input image that are relevant to the model prediction are frequently not entirely differentiated by heatmaps. In this paper, we propose a framework called TbExplain that employs XAI techniques and a pre-trained object detector to present text-based explanations of scene classification models. Moreover, TbExplain incorporates a novel method to correct predictions and textually explain them based on the statistics of objects in the input image when the initial prediction is unreliable. To assess the trustworthiness and validity of the text-based explanations, we conducted a qualitative experiment, and the findings indicated that these explanations are sufficiently reliable. Furthermore, our quantitative and qualitative experiments on TbExplain with scene classification datasets reveal an improvement in classification accuracy over ResNet variants.

摘要
黑盒式人工智能（XAI）领域的目标是提高黑盒机器学习模型的解释性。建立基于输入特征的重要性值的热力映射是广泛使用的方法来解释这些模型在预测中所使用的下面运算。然而，热力映射并不完美，例如非专家用户可能无法全面理解热力映射的逻辑（在预测中相关的像素被不同的强度或颜色表示）。此外，输入图像中对模型预测的物件和区域也不常完全区分。在这篇论文中，我们提出了一个名为TbExplain的框架，使用XAI技术和预训物件探测器来提供Scene Classification模型的文本解释。此外，TbExplain还包括一个新的方法来修正预测和文本解释基于输入图像中物件的统计时，当初始预测不可靠时。为评估文本解释的可信度和有效性，我们进行了一个质量性实验，发现这些解释具有足够的可靠性。此外，我们在TbExplainScene Classification dataset上进行了量值和质量性实验，发现TbExplain可以与ResNetVariants相比提高分类精度。

Our Model Achieves Excellent Performance on MovieLens: What Does it Mean?

paper_url: http://arxiv.org/abs/2307.09985
repo_url: https://github.com/FrankYuchen/MovieLensRec
paper_authors: Yu-chen Fan, Yitong Ji, Jie Zhang, Aixin Sun
for: 这个研究用于分析MovieLens数据集，并解释这个数据集如何影响推荐算法的评估。
methods: 这个研究使用了细化分析方法，揭示了用户交互的各个阶段之间的差异，以及用户交互如何受到内置推荐算法的影响。
results: 研究发现，用户交互的不同阶段会对用户的首选项产生影响，并且随着用户的交互增加，推荐算法的学习效果会逐渐下降。此外，改变用户交互的顺序会使Sequential algoritms更难 capture用户交互进程。

Abstract
A typical benchmark dataset for recommender system (RecSys) evaluation consists of user-item interactions generated on a platform within a time period. The interaction generation mechanism partially explains why a user interacts with (e.g.,like, purchase, rate) an item, and the context of when a particular interaction happened. In this study, we conduct a meticulous analysis on the MovieLens dataset and explain the potential impact on using the dataset for evaluating recommendation algorithms. We make a few main findings from our analysis. First, there are significant differences in user interactions at the different stages when a user interacts with the MovieLens platform. The early interactions largely define the user portrait which affect the subsequent interactions. Second, user interactions are highly affected by the candidate movies that are recommended by the platform's internal recommendation algorithm(s). Removal of interactions that happen nearer to the last few interactions of a user leads to increasing difficulty in learning user preference, thus deteriorating recommendation accuracy. Third, changing the order of user interactions makes it more difficult for sequential algorithms to capture the progressive interaction process. Based on these findings, we further discuss the discrepancy between the interaction generation mechanism that is employed by the MovieLens system and that of typical real world recommendation scenarios. In summary, models that achieve excellent recommendation accuracy on the MovieLens dataset may not demonstrate superior performance in practice for at least two kinds of differences: (i) the differences in the contexts of user-item interaction generation, and (ii) the differences in user knowledge about the item collections.

摘要
一个典型的RecSys评估数据集包含用户和项目之间的交互，在一个平台上的一段时间内生成。交互生成机制部分解释了用户为何与项目（例如，喜欢、购买、评分）交互。在本研究中，我们进行了仔细的分析MovieLens数据集，并解释了使用该数据集评估推荐算法的可能的影响。我们得出了一些主要发现：1. 用户在不同阶段与MovieLens平台交互时，交互差异显著。早期交互主要定义用户肖像，影响后续交互。2. 用户交互受到MovieLens平台内部推荐算法提供的候选电影的影响，移除用户最后几次交互的交互会降低用户喜好学习的难度，导致推荐精度下降。3. 改变用户交互的顺序使得sequential算法更难捕捉用户交互过程的进步。根据这些发现，我们进一步讨论了MovieLens系统employs的交互生成机制与真实世界推荐场景中的交互生成机制之间的差异。总结来说，在MovieLens数据集上达到杰出推荐精度的模型可能不会在实际场景中表现出色，因为有两种类型的差异：1. 用户和项目之间交互的上下文差异。2. 用户对项目集的了解程度的差异。

XSkill: Cross Embodiment Skill Discovery

paper_url: http://arxiv.org/abs/2307.09955
repo_url: None
paper_authors: Mengda Xu, Zhenjia Xu, Cheng Chi, Manuela Veloso, Shuran Song
for: 本研究旨在将人工智能学习扩展到机器人控制领域，利用人类示范影片作为机器人学习的数据源。
methods: 本研究使用了一个名为XSkill的循环传染学习框架，通过将人类和机器人的摄像头动作影片组合成一个跨实体表示（skill prototype），并将这个表示转换到机器人动作中使用条件散布政策。
results: 实验结果显示XSkill可以将学习自人类示范影片中的技能转换到机器人动作中，并且可以根据人类提供的问题影片来组合学习的技能以完成未见过的任务。

Abstract
Human demonstration videos are a widely available data source for robot learning and an intuitive user interface for expressing desired behavior. However, directly extracting reusable robot manipulation skills from unstructured human videos is challenging due to the big embodiment difference and unobserved action parameters. To bridge this embodiment gap, this paper introduces XSkill, an imitation learning framework that 1) discovers a cross-embodiment representation called skill prototypes purely from unlabeled human and robot manipulation videos, 2) transfers the skill representation to robot actions using conditional diffusion policy, and finally, 3) composes the learned skill to accomplish unseen tasks specified by a human prompt video. Our experiments in simulation and real-world environments show that the discovered skill prototypes facilitate both skill transfer and composition for unseen tasks, resulting in a more general and scalable imitation learning framework. The performance of XSkill is best understood from the anonymous website: https://xskillcorl.github.io.

摘要
人类示例视频是机器人学习中广泛可用的数据源，同时也是一种直观的用户界面，用于表达所愿的行为。然而，直接从不结构化的人类视频中提取可重用的机器人操作技巧是困难的，这是因为人类和机器人之间存在大的实体差异和未观察到的行为参数。为 bridging这个实体差距，本文提出了XSkill，一种仿冒学习框架，它可以：1）从无标签的人类和机器人操作视频中找到跨实体表示called skill prototypes，2）使用条件扩散策略将技能表示转移到机器人动作，并最后3）使用人类提示视频来组合学习的技能完成未看到的任务。我们在模拟环境和实际环境中进行了实验，结果表明XSkill可以快速地传递和组合学习的技能，从而实现更一般和可扩展的仿冒学习框架。XSkill的性能可以通过无名website：https://xskillcorl.github.io来了解。

U-CE: Uncertainty-aware Cross-Entropy for Semantic Segmentation

paper_url: http://arxiv.org/abs/2307.09947
repo_url: None
paper_authors: Steven Landgraf, Markus Hillemann, Kira Wursthorn, Markus Ulrich
for: 提高自适应驾驶等安全关键应用中模型的可靠性和可信度。
methods: 基于 pixel-wise 权重的 uncertainty-aware Cross-Entropy 损失函数 (U-CE)，将predictive uncertainty纳入训练过程中。
results: 在Cityscapes和ACDC datasets上，使用ResNet-18和ResNet-101两种常见背景网络架构，U-CE训练模型可以不仅提高分割性能，还提供了有意义的uncertainty值。

Abstract
Deep neural networks have shown exceptional performance in various tasks, but their lack of robustness, reliability, and tendency to be overconfident pose challenges for their deployment in safety-critical applications like autonomous driving. In this regard, quantifying the uncertainty inherent to a model's prediction is a promising endeavour to address these shortcomings. In this work, we present a novel Uncertainty-aware Cross-Entropy loss (U-CE) that incorporates dynamic predictive uncertainties into the training process by pixel-wise weighting of the well-known cross-entropy loss (CE). Through extensive experimentation, we demonstrate the superiority of U-CE over regular CE training on two benchmark datasets, Cityscapes and ACDC, using two common backbone architectures, ResNet-18 and ResNet-101. With U-CE, we manage to train models that not only improve their segmentation performance but also provide meaningful uncertainties after training. Consequently, we contribute to the development of more robust and reliable segmentation models, ultimately advancing the state-of-the-art in safety-critical applications and beyond.

摘要
Through extensive experimentation, we demonstrate the superiority of U-CE over regular CE training on two benchmark datasets, Cityscapes and ACDC, using two common backbone architectures, ResNet-18 and ResNet-101. With U-CE, we manage to train models that not only improve their segmentation performance but also provide meaningful uncertainties after training. This contribution advances the development of more robust and reliable segmentation models, ultimately benefiting safety-critical applications and beyond.

TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic Tree-Based Memory Network

paper_url: http://arxiv.org/abs/2307.09942
repo_url: None
paper_authors: Brandon Theodorou, Cao Xiao, Jimeng Sun
for: 这个研究旨在提高药物开发中的临床试验进程，使用机器学习模型快速匹配病人临床试验，并提供更好的解释性结果以便采纳。
methods: 这个研究使用了一种名为TREEMENT的个性化动态树状内存网络模型，利用预设的临床生物学 Ontology 扩展个体化病人表示，然后使用注意力对搜寻查询学习自适应搜寻以提高匹配性和解释性。
results: 比较TREEMENT模型和现有模型，TREEMENT模型在实际数据上表现较好，降低了7%的条件水平匹配错误率，并在临床试验水平上表现出色。此外，TREEMENT模型也提供了好的解释性，让模型结果更易采纳。

Abstract
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment. In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials based on longitudinal patient electronic health records (EHR) data and eligibility criteria of clinical trials. However, they either depend on trial-specific expert rules that cannot expand to other trials or perform matching at a very general level with a black-box model where the lack of interpretability makes the model results difficult to be adopted. To provide accurate and interpretable patient trial matching, we introduce a personalized dynamic tree-based memory network model named TREEMENT. It utilizes hierarchical clinical ontologies to expand the personalized patient representation learned from sequential EHR data, and then uses an attentional beam-search query learned from eligibility criteria embedding to offer a granular level of alignment for improved performance and interpretability. We evaluated TREEMENT against existing models on real-world datasets and demonstrated that TREEMENT outperforms the best baseline by 7% in terms of error reduction in criteria-level matching and achieves state-of-the-art results in its trial-level matching ability. Furthermore, we also show TREEMENT can offer good interpretability to make the model results easier for adoption.

摘要
临床试验是药品开发的关键，但它们经常受到昂贵和不效的病人招募的困扰。在最近的几年中，机器学习模型被提议用于加速病人招募，通过自动将病人与临床试验相匹配，基于 longitudinal 病人电子医疗记录 (EHR) 数据和临床试验的参与条件。然而，它们可能会依赖于具体的试验规则，无法扩展到其他试验，或者使用黑obox模型，导致模型结果难以采用。为了提供准确和可解释的病人试验匹配，我们引入了个性化动态树型记忆网络模型 named TREEMENT。它利用层次的临床 Ontology 扩展个性化病人表示，然后使用注意力寻找查询学习从参与条件嵌入中提取的精细水平匹配，以提高性能和可解释性。我们对实际数据进行了评估，并证明 TREEMENT 在参与条件匹配和试验匹配能力方面都达到了领先的水平，与最佳基eline 比较，TREEMENT 可以提高error reduction 7%。此外，我们还示出 TREEMENT 可以提供好的可解释性，使得模型结果更易于采用。

Spuriosity Didn’t Kill the Classifier: Using Invariant Predictions to Harness Spurious Features

paper_url: http://arxiv.org/abs/2307.09933
repo_url: None
paper_authors: Cian Eastwood, Shashank Singh, Andrei Liviu Nicolicioiu, Marin Vlastelica, Julius von Kügelgen, Bernhard Schölkopf
for: 避免基于数据集外的失败，现代工作尝试提取具有稳定或不变的关系性的特征，抛弃不稳定的特征，以避免其与标签之间的关系变化。
methods: 我们的主要贡献是展示如何在测试Domain中没有标签的情况下使用不稳定特征。我们提出了稳定特征增强（SFB）算法，它可以：（i）学习一个分离稳定和条件无关不稳定特征的预测器；和（ii）使用稳定特征预测来适应不稳定特征预测。
results: 我们 teorically prove That SFB can learn an asymptotically-optimal predictor without test-domain labels. Empirically, we demonstrate the effectiveness of SFB on real and synthetic data.

Abstract
To avoid failures on out-of-distribution data, recent works have sought to extract features that have a stable or invariant relationship with the label across domains, discarding the "spurious" or unstable features whose relationship with the label changes across domains. However, unstable features often carry complementary information about the label that could boost performance if used correctly in the test domain. Our main contribution is to show that it is possible to learn how to use these unstable features in the test domain without labels. In particular, we prove that pseudo-labels based on stable features provide sufficient guidance for doing so, provided that stable and unstable features are conditionally independent given the label. Based on this theoretical insight, we propose Stable Feature Boosting (SFB), an algorithm for: (i) learning a predictor that separates stable and conditionally-independent unstable features; and (ii) using the stable-feature predictions to adapt the unstable-feature predictions in the test domain. Theoretically, we prove that SFB can learn an asymptotically-optimal predictor without test-domain labels. Empirically, we demonstrate the effectiveness of SFB on real and synthetic data.

摘要
recent works have sought to extract features that have a stable or invariant relationship with the label across domains, discarding the "spurious" or unstable features whose relationship with the label changes across domains. However, unstable features often carry complementary information about the label that could boost performance if used correctly in the test domain. Our main contribution is to show that it is possible to learn how to use these unstable features in the test domain without labels. In particular, we prove that pseudo-labels based on stable features provide sufficient guidance for doing so, provided that stable and unstable features are conditionally independent given the label. Based on this theoretical insight, we propose Stable Feature Boosting (SFB), an algorithm for: (i) learning a predictor that separates stable and conditionally-independent unstable features; and (ii) using the stable-feature predictions to adapt the unstable-feature predictions in the test domain. Theoretically, we prove that SFB can learn an asymptotically-optimal predictor without test-domain labels. Empirically, we demonstrate the effectiveness of SFB on real and synthetic data.Here's the translation in Traditional Chinese:近期的研究尝试提取具有稳定或跨Domain对于标签的相似关系的特征，将"伪�component"或不稳定的特征排除出来。然而，不稳定的特征经常带有标签的补充信息，可以帮助提高效能。我们的主要贡献是表明可以在测试Domain中使用不稳定的特征， без标签。具体来说，我们证明可以使用稳定特征 Pseudo-labels 提供足够的指导，以便在测试Domain中使用不稳定特征。基于这个理论性的见解，我们提出了稳定特征提升（SFB）算法，包括： (i) 学习分类器，以分类稳定和条件独立的不稳定特征; (ii) 使用稳定特征预测，以适应不稳定特征的预测。理论上，我们证明SFB可以学习无需测试Domain标签的 asymptotically-optimal 预测器。实际上，我们透过实验表明SFB在真实和 sintetic 数据上具有优秀的效能。

Exploring Non-Regular Extensions of Propositional Dynamic Logic with Description-Logics Features

paper_url: http://arxiv.org/abs/2307.09913
repo_url: None
paper_authors: Bartosz Bednarczyk
for: investigate the impact of non-regular path expressions on the decidability of satisfiability checking and querying in description logics extending ALC.
methods: employing regular and visibly-pushdown languages, and using undecidability results to show the loss of decidability.
results: established undecidability of the concept satisfiability problem for ALCvpl extended with nominals, and undecidability of query entailment for queries involving non-regular atoms.

Abstract
We investigate the impact of non-regular path expressions on the decidability of satisfiability checking and querying in description logics extending ALC. Our primary objects of interest are ALCreg and ALCvpl, the extensions of with path expressions employing, respectively, regular and visibly-pushdown languages. The first one, ALCreg, is a notational variant of the well-known Propositional Dynamic Logic of Fischer and Ladner. The second one, ALCvpl, was introduced and investigated by Loding and Serre in 2007. The logic ALCvpl generalises many known decidable non-regular extensions of ALCreg. We provide a series of undecidability results. First, we show that decidability of the concept satisfiability problem for ALCvpl is lost upon adding the seemingly innocent Self operator. Second, we establish undecidability for the concept satisfiability problem for ALCvpl extended with nominals. Interestingly, our undecidability proof relies only on one single non-regular (visibly-pushdown) language, namely on r#s# := { r^n s^n | n in N } for fixed role names r and s. Finally, in contrast to the classical database setting, we establish undecidability of query entailment for queries involving non-regular atoms from r#s#, already in the case of ALC-TBoxes.

摘要
我们研究非正规路径表达对描述逻辑中的可 decidability和查询问题的影响。我们的主要研究对象是 ALCreg 和 ALCvpl，它们分别使用正规语言和可见推下语言来表达非正规路径。ALCreg 是 Fischer 和 Ladner 提出的一种知名的推理逻辑，而 ALCvpl 是 Loding 和 Serre 在 2007 年提出的一种扩展。我们提供了一系列的不可 decidability 结果。首先，我们显示了 ALCvpl 中的概念可行性问题是在添加 Self 运算符后失去可 decidability。其次，我们证明了 ALCvpl 中的概念可行性问题是在添加nominals后不可 decidability。值得注意的是，我们的不可 decidability 证明只需一个非正规（可见推下）语言，即 r#s# := { r^n s^n | n in N } for fixed role names r 和 s。最后，我们证明了在 classical 数据库设定下，对于含有非正规atom的 r#s# 查询，就算是 ALC-TBoxes 中，也是不可 decidability。

Chit-Chat or Deep Talk: Prompt Engineering for Process Mining

paper_url: http://arxiv.org/abs/2307.09909
repo_url: None
paper_authors: Urszula Jessen, Michal Sroka, Dirk Fahland
for: 这个研究探讨了使用大语言模型（LLM）来增强对话代理人在过程挖掘中的应用，以解决其内置的复杂性和多样化技能要求。
methods: 我们提出了一种创新的方法，利用先前的自然语言处理（NLP）研究，以解决现有解决方案中的多个问题。
results: 我们的框架可以提高对话代理人的性能和可访问性，如公共问题和数据集上的实验所示。我们的研究为未来 LLM 在过程挖掘中的角色做出了贡献，并提出了改进 LLM 记忆、实时用户测试和多样数据集的建议。

Abstract
This research investigates the application of Large Language Models (LLMs) to augment conversational agents in process mining, aiming to tackle its inherent complexity and diverse skill requirements. While LLM advancements present novel opportunities for conversational process mining, generating efficient outputs is still a hurdle. We propose an innovative approach that amend many issues in existing solutions, informed by prior research on Natural Language Processing (NLP) for conversational agents. Leveraging LLMs, our framework improves both accessibility and agent performance, as demonstrated by experiments on public question and data sets. Our research sets the stage for future explorations into LLMs' role in process mining and concludes with propositions for enhancing LLM memory, implementing real-time user testing, and examining diverse data sets.

摘要
Here is the text in Simplified Chinese:这个研究 investigate Large Language Models (LLMs) 在进程挖掘中增强对话代理人，以解决该领域的内在复杂性和多样化技能要求。虽然 LLM 的进步提供了对话进程挖掘中新的机遇，但生成高效输出仍然是一个障碍。我们提议一种创新的方法，利用 LLMs 解决现有解决方案中的许多问题，基于对话代理人的先前研究。我们的框架可以提高对话代理人的可访问性和性能，如实验结果所示。我们的研究为未来对 LLMs 在进程挖掘中的角色进行未来的探索提供了基础，并结束于对 LLM 的内存增强、实时用户测试和多样数据集的探索。

Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head video Generation

paper_url: http://arxiv.org/abs/2307.09906
repo_url: https://github.com/harlanhong/iccv2023-mcnet
paper_authors: Fa-Ting Hong, Dan Xu
for: 高品质的人脸动画生成（talking head generation），使用动态资讯从目标驱动影片中获取，并维持人脸的身份特征。
methods: 提出了一个全球人脸表现空间学习（global facial representation space），并设计了一个新的条件参数记忆补偿网络（MCNet），以高精度的人脸生成。
results: 实验结果显示MCNet可以学习代表性和补偿性的人脸记忆，并在VoxCeleb1和CelebV数据集上明显超越前一代的人脸生成方法。

Abstract
Talking head video generation aims to animate a human face in a still image with dynamic poses and expressions using motion information derived from a target-driving video, while maintaining the person's identity in the source image. However, dramatic and complex motions in the driving video cause ambiguous generation, because the still source image cannot provide sufficient appearance information for occluded regions or delicate expression variations, which produces severe artifacts and significantly degrades the generation quality. To tackle this problem, we propose to learn a global facial representation space, and design a novel implicit identity representation conditioned memory compensation network, coined as MCNet, for high-fidelity talking head generation.~Specifically, we devise a network module to learn a unified spatial facial meta-memory bank from all training samples, which can provide rich facial structure and appearance priors to compensate warped source facial features for the generation. Furthermore, we propose an effective query mechanism based on implicit identity representations learned from the discrete keypoints of the source image. It can greatly facilitate the retrieval of more correlated information from the memory bank for the compensation. Extensive experiments demonstrate that MCNet can learn representative and complementary facial memory, and can clearly outperform previous state-of-the-art talking head generation methods on VoxCeleb1 and CelebV datasets. Please check our \href{https://github.com/harlanhong/ICCV2023-MCNET}{Project}.

摘要
幕前人物视频生成技术目的是使用动态姿势和表情信息来动画一张静止图像中的人脸，保持人脸的身份信息，但是由于驱动视频中的剧烈和复杂动作，可能会导致生成杂乱和严重损害质量。为解决这个问题，我们提议学习全面的人脸表示空间，并设计了一种基于人脸记忆强化网络（MCNet）的高精度人脸生成方法。特别是，我们设计了一种网络模块来学习所有训练样本中的统一空间人脸媒体记忆银行，这可以提供丰富的人脸结构和外观偏好，以补偿驱动视频中的扭曲源图像特征。此外，我们提出了一种基于人脸记忆中学习的隐式身份表示查询机制，可以大大优化寻找更相关信息的检索。我们的实验表明，MCNet可以学习代表性和补偿性的人脸记忆，并在VoxCeleb1和CelebV数据集上明显超越前一代人脸生成方法。请查看我们的\href{https://github.com/harlanhong/ICCV2023-MCNET}{项目}.

PyTAG: Challenges and Opportunities for Reinforcement Learning in Tabletop Games

paper_url: http://arxiv.org/abs/2307.09905
repo_url: https://github.com/martinballa/pytag
paper_authors: Martin Balla, George E. M. Long, Dominik Jeurissen, James Goodman, Raluca D. Gaina, Diego Perez-Liebana
for: 这 paper 是为了探索使用 Reinforcement Learning (RL) 在现代桌游戏中进行研究的。
methods: 这 paper 使用了 Python API 和 Tabletop Games 框架 (TAG)，并提出了基eline 的 Proximal Policy Optimization 算法在一 subset of games 上进行训练。
results: 这 paper 获得了一些基eline 的结果，并讨论了现代桌游戏在RL研究中的一些特殊挑战。

Abstract
In recent years, Game AI research has made important breakthroughs using Reinforcement Learning (RL). Despite this, RL for modern tabletop games has gained little to no attention, even when they offer a range of unique challenges compared to video games. To bridge this gap, we introduce PyTAG, a Python API for interacting with the Tabletop Games framework (TAG). TAG contains a growing set of more than 20 modern tabletop games, with a common API for AI agents. We present techniques for training RL agents in these games and introduce baseline results after training Proximal Policy Optimisation algorithms on a subset of games. Finally, we discuss the unique challenges complex modern tabletop games provide, now open to RL research through PyTAG.

摘要
Recently, 游戏人工智能研究（Reinforcement Learning，RL）得到了重要的突破。然而，RL在现代桌面游戏方面却受到了 peu 的关注，即使这些游戏具有许多独特的挑战，比如桌面游戏。为了 bridging 这个差距，我们介绍 PyTAG，一个 Python API for interacting with the Tabletop Games framework（TAG）。TAG 包含了20多种现代桌面游戏，具有共同的 API for AI agents。我们介绍了在这些游戏中训练 RL 代理的技巧，并在一部分游戏上提出了基线结果。最后，我们讨论了现代桌面游戏对 RL 研究提供的特殊挑战，现在通过 PyTAG 开放给 RL 研究。

An analysis on the effects of speaker embedding choice in non auto-regressive TTS

paper_url: http://arxiv.org/abs/2307.09898
repo_url: None
paper_authors: Adriana Stan, Johannah O’Mahony
for: 本研究旨在探讨非 autoregressive 多 speaker speech synthesis 架构如何利用不同 speaker embedding 集中的信息。
methods: 本研究使用 jointly 学习表示和使用预训练模型初始化的方法，以确定是否存在质量改进。
results: 研究发现，不 matter 使用哪种 embedding 集和学习策略，网络都可以处理不同 speaker 标识符 equally well，并且 speech output 质量的变化幅度很小。此外，研究还发现，在标准的训练过程中，speaker leakage 会在核心 speech abstraction 中出现。

Abstract
In this paper we introduce a first attempt on understanding how a non-autoregressive factorised multi-speaker speech synthesis architecture exploits the information present in different speaker embedding sets. We analyse if jointly learning the representations, and initialising them from pretrained models determine any quality improvements for target speaker identities. In a separate analysis, we investigate how the different sets of embeddings impact the network's core speech abstraction (i.e. zero conditioned) in terms of speaker identity and representation learning. We show that, regardless of the used set of embeddings and learning strategy, the network can handle various speaker identities equally well, with barely noticeable variations in speech output quality, and that speaker leakage within the core structure of the synthesis system is inevitable in the standard training procedures adopted thus far.

摘要
在这篇论文中，我们介绍了一种非autoregressive多speaker语音合成架构中 informations present in different speaker embedding sets的理解。我们分析了在jointly learning representations并使用预训练模型初始化时，对目标speaker identity的质量改进的影响。在另一个分析中，我们研究了不同 embedding sets对网络的核心speech abstraction（i.e. zero conditioned）的影响， Specifically, we investigate how the different sets of embeddings impact the network's core speech abstraction (i.e. zero conditioned) in terms of speaker identity and representation learning. We show that, regardless of the used set of embeddings and learning strategy, the network can handle various speaker identities equally well, with barely noticeable variations in speech output quality, and that speaker leakage within the core structure of the synthesis system is inevitable in the standard training procedures adopted thus far.

Amortised Design Optimization for Item Response Theory

paper_url: http://arxiv.org/abs/2307.09891
repo_url: None
paper_authors: Antti Keurulainen, Isak Westerlund, Oskar Keurulainen, Andrew Howes
for: 这篇论文主要用于教育和心理学领域中的学生能力评估和测试项目选择。
methods: 该论文提出了一种基于深度强化学习（DRL）的有效性投入设计（OED）方法，以减少对学生交互的计算成本，并在实时中提供学生能力评估。
results: 该论文通过训练DRL代理人使用合成数据，实现了对学生能力分布的优化测试项目选择，并在实际应用中提供了准确的学生能力评估。

Abstract
Item Response Theory (IRT) is a well known method for assessing responses from humans in education and psychology. In education, IRT is used to infer student abilities and characteristics of test items from student responses. Interactions with students are expensive, calling for methods that efficiently gather information for inferring student abilities. Methods based on Optimal Experimental Design (OED) are computationally costly, making them inapplicable for interactive applications. In response, we propose incorporating amortised experimental design into IRT. Here, the computational cost is shifted to a precomputing phase by training a Deep Reinforcement Learning (DRL) agent with synthetic data. The agent is trained to select optimally informative test items for the distribution of students, and to conduct amortised inference conditioned on the experiment outcomes. During deployment the agent estimates parameters from data, and suggests the next test item for the student, in close to real-time, by taking into account the history of experiments and outcomes.

摘要
item Response Theory（IRT）是一种常用的方法，用于评估学生在教育和心理学中的回答。在教育领域，IRT用于从学生回答中推断学生能力和测试项Characteristics。与学生互动的成本高，需要有效地收集学生能力信息。基于优化实验设计（OED）的方法 computationally expensive，不适用于交互应用。为此，我们提议将amortized experimental design incorporated into IRT。在这种方式下，计算成本在预计算phase中卷shifted到训练一个深度学习（DRL）代理人，使其能够选择Student Distribution中最有用的测试项，并在实际应用中使用conditioned on experiment outcomes进行amortized inference。在部署时，代理人通过对数据进行参数估计，并根据学生历史和试题结果来建议下一个测试项。Note: "amortized" in the text refers to the idea of precomputing the optimal experimental design and storing it in a way that allows for efficient inference during deployment, rather than computing it on the fly.

A reinforcement learning approach for VQA validation: an application to diabetic macular edema grading

paper_url: http://arxiv.org/abs/2307.09886
repo_url: None
paper_authors: Tatiana Fountoukidou, Raphael Sznitman
for: This paper focuses on providing a more comprehensive and appropriate validation approach for highly powerful Visual Question Answering (VQA) algorithms, specifically for diabetic macular edema (DME) grading.
methods: The proposed approach uses an automatic adaptive questioning method based on reinforcement learning (RL), which selects the next question to pose based on the history of previously asked questions.
results: The experiments show that the RL agent exhibits similar behavior to a clinician, asking questions that are relevant to key clinical concepts.

Abstract
Recent advances in machine learning models have greatly increased the performance of automated methods in medical image analysis. However, the internal functioning of such models is largely hidden, which hinders their integration in clinical practice. Explainability and trust are viewed as important aspects of modern methods, for the latter's widespread use in clinical communities. As such, validation of machine learning models represents an important aspect and yet, most methods are only validated in a limited way. In this work, we focus on providing a richer and more appropriate validation approach for highly powerful Visual Question Answering (VQA) algorithms. To better understand the performance of these methods, which answer arbitrary questions related to images, this work focuses on an automatic visual Turing test (VTT). That is, we propose an automatic adaptive questioning method, that aims to expose the reasoning behavior of a VQA algorithm. Specifically, we introduce a reinforcement learning (RL) agent that observes the history of previously asked questions, and uses it to select the next question to pose. We demonstrate our approach in the context of evaluating algorithms that automatically answer questions related to diabetic macular edema (DME) grading. The experiments show that such an agent has similar behavior to a clinician, whereby asking questions that are relevant to key clinical concepts.

摘要

Test-takers have a say: understanding the implications of the use of AI in language tests

paper_url: http://arxiv.org/abs/2307.09885
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Dawen Zhang, Thong Hoang, Shidong Pan, Yongquan Hu, Zhenchang Xing, Mark Staples, Xiwei Xu, Qinghua Lu, Aaron Quigley
for: 这个研究的目的是了解在语言测试中使用人工智能（AI）的影响，特别是测试者的看法和行为方式。methods: 这个研究使用了面对面和在线调查，了解测试者对AI在语言测试中的感知和行为。results: 研究发现，AI的 интеграción可能会提高测试者对测试的公正性和可用性的认知，但也可能会导致测试者对测试的可靠性和互动性的不信任。这些发现可以帮助各方在决定使用AI在语言测试中的时候做出更 Informed 的选择，以保护社会的利益和测试的integrity。

Abstract
Language tests measure a person's ability to use a language in terms of listening, speaking, reading, or writing. Such tests play an integral role in academic, professional, and immigration domains, with entities such as educational institutions, professional accreditation bodies, and governments using them to assess candidate language proficiency. Recent advances in Artificial Intelligence (AI) and the discipline of Natural Language Processing have prompted language test providers to explore AI's potential applicability within language testing, leading to transformative activity patterns surrounding language instruction and learning. However, with concerns over AI's trustworthiness, it is imperative to understand the implications of integrating AI into language testing. This knowledge will enable stakeholders to make well-informed decisions, thus safeguarding community well-being and testing integrity. To understand the concerns and effects of AI usage in language tests, we conducted interviews and surveys with English test-takers. To the best of our knowledge, this is the first empirical study aimed at identifying the implications of AI adoption in language tests from a test-taker perspective. Our study reveals test-taker perceptions and behavioral patterns. Specifically, we identify that AI integration may enhance perceptions of fairness, consistency, and availability. Conversely, it might incite mistrust regarding reliability and interactivity aspects, subsequently influencing the behaviors and well-being of test-takers. These insights provide a better understanding of potential societal implications and assist stakeholders in making informed decisions concerning AI usage in language testing.

摘要
语言测试测量人类语言使用能力，包括听说、读写等方面。这些测试在学术、职业和移民领域具有重要的应用，由于语言测试提供者正在探索人工智能（AI）在语言测试中的应用前景，导致语言测试领域的变革。然而，由于AI的可靠性问题，需要深入了解AI在语言测试中的影响，以便各方能够做出了解据的决策，保护社会利益和测试的公正性。为了了解AI在语言测试中的影响，我们通过对英语测试者进行了访谈和问卷调查。根据我们所知，这是第一项针对语言测试中AI使用的实证研究，旨在了解测试者的看法和行为倾向。我们发现，AI的integrating可能会提高公正性、一致性和可用性的认知，但同时也可能会产生不信任的情感，影响测试者的行为和心理健康。这些发现可以帮助各方更好地理解AI在语言测试中的社会影响，并帮助各方做出了解据的决策。

Adversarial Likelihood Estimation with One-way Flows

paper_url: http://arxiv.org/abs/2307.09882
repo_url: None
paper_authors: Omri Ben-Dov, Pravir Singh Gupta, Victoria Abrevaya, Michael J. Black, Partha Ghosh
for: 本研究旨在提高Generative Adversarial Networks（GANs）的质量和可靠性，并提供一种不受偏见的估计方法。
methods: 本研究使用了importance sampling和energy-based Setting，并开发了一种新的一次流网络（one-way flow network）来计算生成器密度。
results: 本研究的实验结果表明，使用这种新的方法可以更快速地收敛，生成质量与传统GANs相似，避免过拟合常用数据集，并生成了平滑的低维 latent representation。

Abstract
Generative Adversarial Networks (GANs) can produce high-quality samples, but do not provide an estimate of the probability density around the samples. However, it has been noted that maximizing the log-likelihood within an energy-based setting can lead to an adversarial framework where the discriminator provides unnormalized density (often called energy). We further develop this perspective, incorporate importance sampling, and show that 1) Wasserstein GAN performs a biased estimate of the partition function, and we propose instead to use an unbiased estimator; 2) when optimizing for likelihood, one must maximize generator entropy. This is hypothesized to provide a better mode coverage. Different from previous works, we explicitly compute the density of the generated samples. This is the key enabler to designing an unbiased estimator of the partition function and computation of the generator entropy term. The generator density is obtained via a new type of flow network, called one-way flow network, that is less constrained in terms of architecture, as it does not require to have a tractable inverse function. Our experimental results show that we converge faster, produce comparable sample quality to GANs with similar architecture, successfully avoid over-fitting to commonly used datasets and produce smooth low-dimensional latent representations of the training data.

摘要
генеративные adversarial networks (GANs) 可以生成高质量样本，但不提供样本附近的概率密度估计。然而，有人注意到，在能量基本设置下，最大化征韵 log-likelihood 可以导致对抗性框架，其中分类器提供无正则化概率（常称为能量）。我们进一步发展这个视角，吸收重要抽象，并表明了以下两点：1. Wasserstein GAN 实现了偏置估计 partition function，我们提议使用不偏置的估计器；2. 在优化可能性时，必须最大化生成器 entropy，这被假设为提供更好的模式覆盖率。与前一些工作不同，我们直接计算生成器样本的概率密度。这是计算分配函数的不可或缺的关键因素。我们的实验结果表明，我们的方法可以更快 converges，生成与 GANs 类似的样本质量，避免过拟合常用的数据集，并生成缓慢低维 latent 表示。

Amortised Experimental Design and Parameter Estimation for User Models of Pointing

paper_url: http://arxiv.org/abs/2307.09878
repo_url: None
paper_authors: Antti Keurulainen, Isak Westerlund, Oskar Keurulainen, Andrew Howes
for: 这篇论文主要是为了研究如何使用用户数据来估计模型参数，以便自动化互动设计选择。
methods: 这篇论文使用了一种基于实验设计的方法，通过与 simulate 的参与者进行交互来快速收集数据并估计参数。
results: 研究发现，通过使用这种方法，可以快速地收集大量数据，并且可以使用 Synthetic data 而不需要巨量的人工数据来估计参数。

Abstract
User models play an important role in interaction design, supporting automation of interaction design choices. In order to do so, model parameters must be estimated from user data. While very large amounts of user data are sometimes required, recent research has shown how experiments can be designed so as to gather data and infer parameters as efficiently as possible, thereby minimising the data requirement. In the current article, we investigate a variant of these methods that amortises the computational cost of designing experiments by training a policy for choosing experimental designs with simulated participants. Our solution learns which experiments provide the most useful data for parameter estimation by interacting with in-silico agents sampled from the model space thereby using synthetic data rather than vast amounts of human data. The approach is demonstrated for three progressively complex models of pointing.

摘要
用户模型在互动设计中发挥重要作用，支持自动化互动设计选择。为此，模型参数必须从用户数据进行估算。虽然有时需要很大量的用户数据，但最近的研究表明，可以通过设计 эксперименты，以最小化数据需求来收集数据和推导参数。在当前文章中，我们调查了一种变体的这些方法，即通过训练选择实验设计的策略来减少计算成本。我们的解决方案通过与模型空间中随机 sampling的数字人类进行交互，以使用合成数据而不是巨量的人类数据来学习哪些实验提供最有用的数据 для参数估算。我们的方法在三个不同的指向模型上进行了证明。

Detecting Vulnerable Nodes in Urban Infrastructure Interdependent Network

paper_url: http://arxiv.org/abs/2307.09866
repo_url: https://github.com/tsinghua-fib-lab/kdd2023-id546-urbaninfra
paper_authors: Jinzhu Mao, Liu Cao, Chen Gao, Huandong Wang, Hangyu Fan, Depeng Jin, Yong Li
for: 本研究旨在理解和 caracterizar la vulnerabilidad de las infraestructuras urbanas, es decir, los equipos de ingeniería esenciales para el funcionamiento normal de las ciudades y que existen naturalmente en la forma de redes. La aplicación potencial incluye proteger instalaciones frágiles y diseñar topologías robustas, etc.
methods: 本研究 utiliza la teoría de grafos neuronales y aprendizaje por refuerzo para modelar la red heterogénea de la infraestructura urbana y caracterizar su vulnerabilidad de manera precisa. El sistema propuesto se entrena con datos reales y utiliza técnicas de aprendizaje profundo para comprender y analizar la red heterogénea, lo que permite capturar el riesgo de fallos en cascada y descubrir las infraestructuras vulnerables de las ciudades.
results: los resultados extensivos de los experimentos con diferentes solicitudes demuestran no solo el poder expresivo del sistema propuesto, sino también su capacidad de transferencia y la necesidad de los componentes específicos.

Abstract
Understanding and characterizing the vulnerability of urban infrastructures, which refers to the engineering facilities essential for the regular running of cities and that exist naturally in the form of networks, is of great value to us. Potential applications include protecting fragile facilities and designing robust topologies, etc. Due to the strong correlation between different topological characteristics and infrastructure vulnerability and their complicated evolution mechanisms, some heuristic and machine-assisted analysis fall short in addressing such a scenario. In this paper, we model the interdependent network as a heterogeneous graph and propose a system based on graph neural network with reinforcement learning, which can be trained on real-world data, to characterize the vulnerability of the city system accurately. The presented system leverages deep learning techniques to understand and analyze the heterogeneous graph, which enables us to capture the risk of cascade failure and discover vulnerable infrastructures of cities. Extensive experiments with various requests demonstrate not only the expressive power of our system but also transferring ability and necessity of the specific components.

摘要
理解和 характеризуй城市基础设施的漏洞性，即工程设施的重要组成部分，是对我们来说非常重要的。潜在应用包括保护脆弱设施和设计强健拓扑等等。由于不同的拓扑特征与基础设施漏洞性之间存在强相关性，以及其复杂的演化机制，一些启发式和机器学习分析方法无法处理这种情况。在这篇论文中，我们将城市系统模型为不同类型图的异质图，并提出基于图神经网络和强化学习的系统，可以在实际数据上训练，准确地评估城市系统的漏洞性。我们的系统利用深度学习技术来理解和分析异质图，从而捕捉城市系统中的风险冲击和漏洞设施。广泛的实验表明，我们的系统不仅具有表达力，还能够跨请求传递知识和特定组件的必要性。

Towards a population-informed approach to the definition of data-driven models for structural dynamics

paper_url: http://arxiv.org/abs/2307.09862
repo_url: None
paper_authors: G. Tsialiamanis, N. Dervilis, D. J. Wagg, K. Worden
for: 本研究旨在透过 физиcs-based 方法和机器学习算法结合，对于构造动力学领域中的问题进行模型化，但是由于数据稀缺性，机器学习算法经常表现不佳。这篇研究旨在鼓励使用基于人口的模型，从 físics-based 模型中学习关系性。
methods: 本研究使用了两种meta-learning 算法：model-agnostic meta-learning (MAML) 算法和 conditional neural processes (CNP) 模型。这两种算法在训练人口中学习关系性，并且表现出与传统机器学习算法相似的行为。
results: 研究结果显示，这两种算法可以对于构造动力学问题进行高精度的预测，并且比传统机器学习算法更好地适应训练人口中的数据稀缺性。

Abstract
Machine learning has affected the way in which many phenomena for various domains are modelled, one of these domains being that of structural dynamics. However, because machine-learning algorithms are problem-specific, they often fail to perform efficiently in cases of data scarcity. To deal with such issues, combination of physics-based approaches and machine learning algorithms have been developed. Although such methods are effective, they also require the analyser's understanding of the underlying physics of the problem. The current work is aimed at motivating the use of models which learn such relationships from a population of phenomena, whose underlying physics are similar. The development of such models is motivated by the way that physics-based models, and more specifically finite element models, work. Such models are considered transferrable, explainable and trustworthy, attributes which are not trivially imposed or achieved for machine-learning models. For this reason, machine-learning approaches are less trusted by industry and often considered more difficult to form validated models. To achieve such data-driven models, a population-based scheme is followed here and two different machine-learning algorithms from the meta-learning domain are used. The two algorithms are the model-agnostic meta-learning (MAML) algorithm and the conditional neural processes (CNP) model. The algorithms seem to perform as intended and outperform a traditional machine-learning algorithm at approximating the quantities of interest. Moreover, they exhibit behaviour similar to traditional machine learning algorithms (e.g. neural networks or Gaussian processes), concerning their performance as a function of the available structures in the training population.

摘要
The current work aims to develop models that learn relationships from a population of phenomena with similar underlying physics. This is inspired by the way physics-based models, such as finite element models, work. These models are considered transferrable, explainable, and trustworthy, which are not easily achieved with machine learning models. As a result, machine learning approaches are less trusted by industry and are often more difficult to form validated models.To achieve these data-driven models, a population-based approach is used with two machine learning algorithms from the meta-learning domain: the model-agnostic meta-learning (MAML) algorithm and the conditional neural processes (CNP) model. These algorithms seem to perform as intended and outperform traditional machine learning algorithms at approximating the quantities of interest. Additionally, they exhibit similar behavior to traditional machine learning algorithms, such as neural networks or Gaussian processes, in terms of their performance as a function of the available structures in the training population.

Towards Reliable Rare Category Analysis on Graphs via Individual Calibration

paper_url: http://arxiv.org/abs/2307.09858
repo_url: https://github.com/wulongfeng/calirare
paper_authors: Longfeng Wu, Bowen Lei, Dongkuan Xu, Dawei Zhou
for: 本研究旨在提高罕见类分类的可靠性，尤其是在面临高度不均衡数据分布的情况下。
methods: 本文提出了一种新的个体准确性框架，named CALIRARE，以解决罕见类分类中的独特挑战。该框架包括一种节点级别的不确定性量化算法，以及一种基于实例水平的准确性评价指标EICE。
results: 实验结果表明，CALIRARE可以有效地缓解罕见类分类中的偏置和不确定性问题，提高模型的准确性和可靠性。

Abstract
Rare categories abound in a number of real-world networks and play a pivotal role in a variety of high-stakes applications, including financial fraud detection, network intrusion detection, and rare disease diagnosis. Rare category analysis (RCA) refers to the task of detecting, characterizing, and comprehending the behaviors of minority classes in a highly-imbalanced data distribution. While the vast majority of existing work on RCA has focused on improving the prediction performance, a few fundamental research questions heretofore have received little attention and are less explored: How confident or uncertain is a prediction model in rare category analysis? How can we quantify the uncertainty in the learning process and enable reliable rare category analysis? To answer these questions, we start by investigating miscalibration in existing RCA methods. Empirical results reveal that state-of-the-art RCA methods are mainly over-confident in predicting minority classes and under-confident in predicting majority classes. Motivated by the observation, we propose a novel individual calibration framework, named CALIRARE, for alleviating the unique challenges of RCA, thus enabling reliable rare category analysis. In particular, to quantify the uncertainties in RCA, we develop a node-level uncertainty quantification algorithm to model the overlapping support regions with high uncertainty; to handle the rarity of minority classes in miscalibration calculation, we generalize the distribution-based calibration metric to the instance level and propose the first individual calibration measurement on graphs named Expected Individual Calibration Error (EICE). We perform extensive experimental evaluations on real-world datasets, including rare category characterization and model calibration tasks, which demonstrate the significance of our proposed framework.

摘要
罕见类在许多实际网络中充斥，在各种高风险应用中扮演着重要角色，如金融欺诈检测、网络入侵检测和罕见疾病诊断。罕见类分析（RCA）指的是在高度不均匀数据分布中探测、特征化和理解少数类别的行为。而现有大多数RCA研究都专注于提高预测性能，而忽略了一些基本研究问题，如预测模型对罕见类的自信度和不确定性如何？如何量化学习过程中的不确定性，使罕见类分析可靠？为回答这些问题，我们开始 investigate existing RCA方法的miscalibration。实验结果表明，现状顶峰RCA方法主要是对少数类预测过于自信，对多数类预测过于不自信。这一观察导我们提出一种新的个体准确框架，名为CALIRARE，以解决RCAUnique挑战，使罕见类分析可靠。 Specifically, to quantify the uncertainties in RCA, we develop a node-level uncertainty quantification algorithm to model the overlapping support regions with high uncertainty; to handle the rarity of minority classes in miscalibration calculation, we generalize the distribution-based calibration metric to the instance level and propose the first individual calibration measurement on graphs named Expected Individual Calibration Error (EICE). We perform extensive experimental evaluations on real-world datasets, including rare category characterization and model calibration tasks, which demonstrate the significance of our proposed framework.

A Fast and Map-Free Model for Trajectory Prediction in Traffics

paper_url: http://arxiv.org/abs/2307.09831
repo_url: None
paper_authors: Junhong Xiang, Jingmin Zhang, Zhixiong Nan
for: 提出了一种高效的路径预测模型，不需要卫星地图，可以在实际交通场景中提高预测精度和计算效率。
methods: 提出了一种两stage的方法，首先编码单个机器的空间时间信息，然后在第二stage中探索多个机器的空间时间交互信息，并使用注意力机制、LSTM、图像卷积网络和时间变换器来学习各自机器的 ric dynamic和交互信息。
results: 在Argoverse数据集上比较了现有的map-free方法和map-based状态前方法，其性能最高，并且比基eline方法更快。

Abstract
To handle the two shortcomings of existing methods, (i)nearly all models rely on high-definition (HD) maps, yet the map information is not always available in real traffic scenes and HD map-building is expensive and time-consuming and (ii) existing models usually focus on improving prediction accuracy at the expense of reducing computing efficiency, yet the efficiency is crucial for various real applications, this paper proposes an efficient trajectory prediction model that is not dependent on traffic maps. The core idea of our model is encoding single-agent's spatial-temporal information in the first stage and exploring multi-agents' spatial-temporal interactions in the second stage. By comprehensively utilizing attention mechanism, LSTM, graph convolution network and temporal transformer in the two stages, our model is able to learn rich dynamic and interaction information of all agents. Our model achieves the highest performance when comparing with existing map-free methods and also exceeds most map-based state-of-the-art methods on the Argoverse dataset. In addition, our model also exhibits a faster inference speed than the baseline methods.

摘要
为了解决现有方法的两个缺陷，即大多数模型依赖高清晰度地图， yet 地图信息在实际交通场景中不一定可用，而且制图 HD 地图是昂贵的和时间consuming，以及现有模型通常是通过增强预测精度来忽略计算效率，而计算效率对实际应用来说是关键，这篇论文提出了一种高效的轨迹预测模型，这种模型不依赖于交通地图。我们的核心想法是在第一阶段对单个机器的空间时间信息进行编码，然后在第二阶段通过多 Agent 的空间时间互动来探索。通过全面利用注意机制、LSTM、图 convolution 网络和时间变换器，我们的模型能够学习所有机器的丰富动态和互动信息。我们的模型在对比现有无地图方法的情况下 achieved 最高性能，同时也超过了大多数基于地图的现状方法在 Argoverse 数据集上。此外，我们的模型还具有更快的推理速度than 基eline 方法。

Online Continual Learning for Robust Indoor Object Recognition

paper_url: http://arxiv.org/abs/2307.09827
repo_url: None
paper_authors: Umberto Michieli, Mete Ozay
for: 本研究旨在提高家用机器人视觉系统在变化环境中与未见类交互。
methods: 我们提出了一种基于少量示例学习（few-shot，FS）和不断学习（continual learning，CL）的方法，称为RobOCLe。RobOCLe使用高阶统计矩阵来描述样本的特征空间，并基于这些矩阵的相似性来预测样本的类别标签。
results: 我们在不同的训练/测试扩展情况下评估了CL模型的RobOCLe的Robustness。结果显示，不同的高阶统计矩阵可以捕捉不同的变形特征，从而提供更高的Robustness而无需减少推断速率。

Abstract
Vision systems mounted on home robots need to interact with unseen classes in changing environments. Robots have limited computational resources, labelled data and storage capability. These requirements pose some unique challenges: models should adapt without forgetting past knowledge in a data- and parameter-efficient way. We characterize the problem as few-shot (FS) online continual learning (OCL), where robotic agents learn from a non-repeated stream of few-shot data updating only a few model parameters. Additionally, such models experience variable conditions at test time, where objects may appear in different poses (e.g., horizontal or vertical) and environments (e.g., day or night). To improve robustness of CL agents, we propose RobOCLe, which; 1) constructs an enriched feature space computing high order statistical moments from the embedded features of samples; and 2) computes similarity between high order statistics of the samples on the enriched feature space, and predicts their class labels. We evaluate robustness of CL models to train/test augmentations in various cases. We show that different moments allow RobOCLe to capture different properties of deformations, providing higher robustness with no decrease of inference speed.

摘要
家用机器人视系统需要与未经见过的类型在变化环境中交互。机器人有限的计算资源，标注数据和存储能力。这些需求带来一些独特的挑战：模型需要适应无重复数据和参数的方式进行学习，而不会忘记过去的知识。我们描述这个问题为几shot（FS）在线连续学习（OCL）问题， где机器人代理人通过非重复的流量几shot数据来学习，只需要更新几个模型参数。此外，模型在测试时可能会遇到不同的位置（例如，横向或纵向）和环境（例如，白天或黑夜）。为了提高CL模型的Robustness，我们提议RobOCLe，它包括以下两个部分：1. 构建增强的特征空间，计算样本嵌入特征的高阶统计 moments。2. 在增强的特征空间上计算样本的相似度，预测样本的类别标签。我们对CL模型的Robustness进行了不同的训练/测试拓展的评估。我们发现不同的 moments 可以让 RobOCLe 捕捉不同的变形特征，从而提供更高的Robustness，而无需减少推断速度。

Probabilistic Forecasting with Coherent Aggregation

paper_url: http://arxiv.org/abs/2307.09797
repo_url: None
paper_authors: Geoffrey Négiar, Ruijun Ma, O. Nangba Meetei, Mengfei Cao, Michael W. Mahoney
for: 本研究旨在提供高度准确的预测分布，同时尊重层次结构信息。
methods: 本文提出一种基于因素模型结构的新方法，通过构建层次结构的预测来保证预测的准确性。
results: 对三个层次预测数据集进行比较，本方法可以达到11.8-41.4%的显著提升，而且可以调整基础分布和因素数量来影响预测结果。

Abstract
Obtaining accurate probabilistic forecasts while respecting hierarchical information is an important operational challenge in many applications, perhaps most obviously in energy management, supply chain planning, and resource allocation. The basic challenge, especially for multivariate forecasting, is that forecasts are often required to be coherent with respect to the hierarchical structure. In this paper, we propose a new model which leverages a factor model structure to produce coherent forecasts by construction. This is a consequence of a simple (exchangeability) observation: permuting \textit{}base-level series in the hierarchy does not change their aggregates. Our model uses a convolutional neural network to produce parameters for the factors, their loadings and base-level distributions; it produces samples which can be differentiated with respect to the model's parameters; and it can therefore optimize for any sample-based loss function, including the Continuous Ranked Probability Score and quantile losses. We can choose arbitrary continuous distributions for the factor and the base-level distributions. We compare our method to two previous methods which can be optimized end-to-end, while enforcing coherent aggregation. Our model achieves significant improvements: between $11.8-41.4\%$ on three hierarchical forecasting datasets. We also analyze the influence of parameters in our model with respect to base-level distribution and number of factors.

摘要
获取准确的 probabilistic 预测，同时尊重层次结构，在许多应用程序中是一项重要的操作挑战，例如能源管理、供应链规划和资源分配。基本挑战在多变量预测中，即 forecast 需要尊重层次结构的准确性。在这篇论文中，我们提出了一种新的模型，利用因子模型结构生成准确的预测。这是由于一个简单观察（交换性）的结论：在层次结构中重新排序基级系列不会改变它们的总和。我们的模型使用卷积神经网络生成因子、因子加载和基级分布的参数；它生成可以与模型参数进行梯度检查的样本；因此它可以优化任何样本基于损失函数，包括连续排名概率分数和量iles损失。我们可以选择任何连续分布来描述因子和基级分布。我们与之前的两种可以结构化优化的方法进行比较，我们的方法在三个层次预测 datasets 上实现了显著提升：11.8-41.4%。我们还分析了我们模型中参数的影响，即基级分布和因子数量。

ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats

paper_url: http://arxiv.org/abs/2307.09782
repo_url: https://github.com/microsoft/DeepSpeed
paper_authors: Xiaoxia Wu, Zhewei Yao, Yuxiong He
for: 本研究旨在探讨大语言模型（LLM）中 strike 平衡计算效率和模型质量的挑战，以及在 NVIDIA H100 硬件上部署 LLM 的可能性。
methods: 本研究使用浮点数（FP）量化技术，特别是 FP8 和 FP4，来解决 LLM 中的异常值问题。我们还提出了两种缩放约束来减少精度对齐所带来的开销，并将 LoRC 策略integrated into our quantization methods。
results: 我们的研究发现，对 LLM 来说，FP8 活动常量比 INT8 更高效，特别是模型参数超过一亿时。对于参量量化，我们发现 FP4 与 INT4 相比，性能相对或更高，使得在 FP 支持的硬件上进行部署更加简单。我们的研究结果表明，FP 量化可以为 LLM 带来很高的效率，使其在资源有限的场景中进行部署。

Abstract
In the complex domain of large language models (LLMs), striking a balance between computational efficiency and maintaining model quality is a formidable challenge. Navigating the inherent limitations of uniform quantization, particularly when dealing with outliers, and motivated by the launch of NVIDIA's H100 hardware, this study delves into the viability of floating-point (FP) quantization, particularly focusing on FP8 and FP4, as a potential solution. Our comprehensive investigation reveals that for LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion. For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100. To mitigate the overhead from precision alignment caused by the disparity between weights and activations, we propose two scaling constraints for weight quantization that negligibly impact the performance compared to the standard W4A8 model. We additionally enhance our quantization methods by integrating the Low Rank Compensation (LoRC) strategy, yielding improvements especially in smaller models. The results of our investigation emphasize the immense potential of FP quantization for LLMs, paving the way for high-efficiency deployment in resource-limited settings.

摘要
在大语言模型（LLM）领域中，Computational efficiency和维护模型质量的平衡是一项具有挑战性的任务。uniform quantization的内在限制，特别是处理异常值时，驱动了我们对浮点（FP）量化的研究，特别是FP8和FP4。这项研究发现，对LLMs来说，FP8活动通常比INT8相对胜过，其性能优势在模型参数超过一亿时变得更加明显。对于权量量化，我们发现，FP4的性能与INT4相当，甚至超过INT4，这使得在FP支持的硬件上进行部署变得更加简单。为了减少精度对齐所导致的开销，我们提议两种缩放约束 для权量量化，对W4A8模型的性能影响很小。此外，我们还增强了我们的量化方法，通过 интеграating Low Rank Compensation（LoRC）策略，尤其是在较小的模型中，得到了改进。研究结果表明，FP量化对LLMs具有巨大的潜力，为高效部署在有限资源的场景提供了道路。

Text2Layer: Layered Image Generation using Latent Diffusion Model

paper_url: http://arxiv.org/abs/2307.09781
repo_url: None
paper_authors: Xinyang Zhang, Wentian Zhao, Xin Lu, Jeff Chien
for: 该论文旨在探讨层compositing的图像编辑方法，以层分割图像生成为出发点。
methods: 该论文提出了一种基于扩散模型的层分割图像生成方法，通过训练自编码器和扩散模型来实现层分割图像的生成。
results: 实验结果显示，该方法可以生成高质量的层分割图像，同时提供了一个基准 для未来的研究。

Abstract
Layer compositing is one of the most popular image editing workflows among both amateurs and professionals. Motivated by the success of diffusion models, we explore layer compositing from a layered image generation perspective. Instead of generating an image, we propose to generate background, foreground, layer mask, and the composed image simultaneously. To achieve layered image generation, we train an autoencoder that is able to reconstruct layered images and train diffusion models on the latent representation. One benefit of the proposed problem is to enable better compositing workflows in addition to the high-quality image output. Another benefit is producing higher-quality layer masks compared to masks produced by a separate step of image segmentation. Experimental results show that the proposed method is able to generate high-quality layered images and initiates a benchmark for future work.

摘要
层compositing是现场图像编辑中最受欢迎的工作流程，both amateur和professional都广泛采用。受扩散模型的成功启发，我们从层图生成的角度来探讨层compositing。而不是生成整个图像，我们提议同时生成背景、前景、层Mask和组合图像。为实现层图生成，我们训练了一个能够重建层图的autoencoder，并在幂 rappresentation中训练扩散模型。本提案的两大优点是：一是可以实现更好的组合工作流程，二是生成高质量的层Mask，比起分开进行图像 segmentation后生成的Mask更高质量。实验结果表明，我们的方法能够生成高质量的层图并成为未来工作的标准。

Beyond Single-Feature Importance with ICECREAM

paper_url: http://arxiv.org/abs/2307.09779
repo_url: None
paper_authors: Michael Oesterle, Patrick Blöbaum, Atalanti A. Mastakouri, Elke Kirschbaum
for: 本研究旨在解释模型中不同变量协会的影响，以及这些变量协会对Target变量的分布的影响。
methods: 本研究提出了一种信息理论量化度量，用于度量协会变量对Target变量的影响。这种方法可以识别具有特定结果的变量协会，而不是单独评估每个变量的重要性。
results: 在synthetic和实际数据上进行了实验，并显示了ICECREAM在解释和根本原因分析任务中的出色表现，并在这些任务中实现了惊人的准确率。

Abstract
Which set of features was responsible for a certain output of a machine learning model? Which components caused the failure of a cloud computing application? These are just two examples of questions we are addressing in this work by Identifying Coalition-based Explanations for Common and Rare Events in Any Model (ICECREAM). Specifically, we propose an information-theoretic quantitative measure for the influence of a coalition of variables on the distribution of a target variable. This allows us to identify which set of factors is essential to obtain a certain outcome, as opposed to well-established explainability and causal contribution analysis methods which can assign contributions only to individual factors and rank them by their importance. In experiments with synthetic and real-world data, we show that ICECREAM outperforms state-of-the-art methods for explainability and root cause analysis, and achieves impressive accuracy in both tasks.

摘要
我们在这个工作中通过“ icecream”方法来解释模型的输出和云计算应用的失败。 Specifically, we propose an information-theoretic quantitative measure for the influence of a coalition of variables on the distribution of a target variable. This allows us to identify which set of factors is essential to obtain a certain outcome, as opposed to well-established explainability and causal contribution analysis methods which can assign contributions only to individual factors and rank them by their importance. In experiments with synthetic and real-world data, we show that ICECREAM outperforms state-of-the-art methods for explainability and root cause analysis, and achieves impressive accuracy in both tasks.Here's the breakdown of the translation:* "icecream" 是我们提出的一种信息理论基于的量化方法，用于解释模型的输出和云计算应用的失败。* "coalition of variables" 是指一组变量的集合，这些变量可能会影响模型的输出或云计算应用的结果。* "target variable" 是指我们想要解释的变量，例如模型的输出或云计算应用的结果。* "well-established explainability and causal contribution analysis methods" 是指已有的解释和 causal contribution 分析方法，这些方法可能会忽略一些变量的相互作用，导致解释不准确。* "information-theoretic quantitative measure" 是指我们提出的一种基于信息理论的量化方法，用于衡量变量之间的相互作用。* "essential to obtain a certain outcome" 是指这些变量的集合是必须的，以确保模型的输出或云计算应用的结果。* "impressive accuracy" 是指我们在实验中得到的结果具有很高的准确率。I hope this helps! Let me know if you have any further questions or if you'd like me to translate anything else.

Generating Redstone Style Cities in Minecraft

paper_url: http://arxiv.org/abs/2307.09777
repo_url: None
paper_authors: Shuo Huang, Chengpeng Hu, Julian Togelius, Jialin Liu
for: 这个论文旨在提供一种用于生成 Minecraft 中城市的生成器，以便玩家可以在不同的场景中体验更多的多样性，并帮助理解和改进在数字世界和真实世界中城市的设计。
methods: 生成过程由六个主要步骤组成，即荒地清理、地形修改、建筑布局生成、路径规划、灯光设置和墙体建造。三种算法，包括一种规则基于的算法、一种演化布局算法和一种随机算法，被应用于生成建筑布局，从而决定不同的红石风格建筑的位置。
results: 实验结果表明，规则基于的算法在平坦地图上能够更快地找到可接受的建筑布局，而演化布局算法在 rugged 地图上表现较好。用户研究 comparing 我们的生成器与2022年版 Minecraft Settlement Generation Competition 的优秀入选作品，根据比赛的评价标准，表明我们的生成器在适应和功能性方面表现良好。

Abstract
Procedurally generating cities in Minecraft provides players more diverse scenarios and could help understand and improve the design of cities in other digital worlds and the real world. This paper presents a city generator that was submitted as an entry to the 2023 Edition of Minecraft Settlement Generation Competition for Minecraft. The generation procedure is composed of six main steps, namely vegetation clearing, terrain reshaping, building layout generation, route planning, streetlight placement, and wall construction. Three algorithms, including a heuristic-based algorithm, an evolving layout algorithm, and a random one are applied to generate the building layout, thus determining where to place different redstone style buildings, and tested by generating cities on random maps in limited time. Experimental results show that the heuristic-based algorithm is capable of finding an acceptable building layout faster for flat maps, while the evolving layout algorithm performs better in evolving layout for rugged maps. A user study is conducted to compare our generator with outstanding entries of the competition's 2022 edition using the competition's evaluation criteria and shows that our generator performs well in the adaptation and functionality criteria

摘要
通过生成 Minecraft 中的城市，玩家可以获得更多的多样化场景，并且可以帮助我们更好地理解和改进真实世界中的城市设计。这篇论文介绍了一种城市生成器，该生成器作为2023年 Minecraft 定居点生成比赛的参赛作品。生成过程包括6个主要步骤，namely 维度清除、地形重塑、建筑布局生成、路径规划、灯光设置和墙体建设。三种算法，包括一种规则基于的算法、一种演化布局算法和一种随机算法，被应用于生成建筑布局，因此决定在不同的红石风格建筑物中放置不同的位置。这些算法在生成城市的过程中被测试，并在限时内生成Random maps上进行了测试。实验结果显示，规则基于的算法在平坦地图上能够更快地找到适当的建筑布局，而演化布局算法在 rugged 地图上表现更好。另外，我们进行了用户研究，对2022年版 Minecraft 定居点生成比赛的出色入选作品进行了比较，并通过使用比赛的评价标准表明，我们的生成器在适应性和功能性方面表现良好。

Eliminating Label Leakage in Tree-Based Vertical Federated Learning

paper_url: http://arxiv.org/abs/2307.10318
repo_url: None
paper_authors: Hideaki Takahashi, Jingjing Liu, Yang Liu
for: 这项研究旨在探讨tree-based federated learning中的泄露问题，以及一种有效的防止泄露的防御机制。
methods: 这项研究使用了一种新的标签推导攻击方法ID2Graph，通过使用训练样本中的记录IDsets生成图结构，提取图中的社区信息，并使用社区信息对本地数据进行分 clustering。
results: 实验结果表明，ID2Graph攻击可以快速地泄露tree-based模型的训练标签信息，而ID-LMID防御机制能够有效地防止标签泄露。

Abstract
Vertical federated learning (VFL) enables multiple parties with disjoint features of a common user set to train a machine learning model without sharing their private data. Tree-based models have become prevalent in VFL due to their interpretability and efficiency. However, the vulnerability of tree-based VFL has not been sufficiently investigated. In this study, we first introduce a novel label inference attack, ID2Graph, which utilizes the sets of record-IDs assigned to each node (i.e., instance space) to deduce private training labels. The ID2Graph attack generates a graph structure from training samples, extracts communities from the graph, and clusters the local dataset using community information. To counteract label leakage from the instance space, we propose an effective defense mechanism, ID-LMID, which prevents label leakage by focusing on mutual information regularization. Comprehensive experiments conducted on various datasets reveal that the ID2Graph attack presents significant risks to tree-based models such as Random Forest and XGBoost. Further evaluations on these benchmarks demonstrate that ID-LMID effectively mitigates label leakage in such instances.

摘要
Vertical Federated Learning (VFL) 允许不同党有不同特征的用户集合训练机器学习模型，无需共享私人数据。Tree-based模型在VFL中变得普遍，因为它们具有可读性和效率。然而，tree-based VFL 的漏洞尚未得到充分调查。在这种研究中，我们首先介绍了一种新的标签推理攻击，ID2Graph，它利用每个节点（即实例空间）分配的集合ID来推理私人训练标签。ID2Graph 攻击生成了训练样本中的图结构，提取了图中的社区信息，并使用社区信息将地方数据分组。为了防止实例空间中的标签泄露，我们提议一种有效的防御机制，ID-LMID，它通过互信息规范来防止标签泄露。对于Random Forest和XGBoost等树状模型，我们的实验表明，ID2Graph 攻击对这些模型具有显著的风险。而ID-LMID 可以有效地遏制标签泄露在这些情况下。

Self-Supervised Learning for WiFi CSI-Based Human Activity Recognition: A Systematic Study

paper_url: http://arxiv.org/abs/2308.02412
repo_url: https://github.com/JJJinx/SSLCSI
paper_authors: Ke Xu, Jiangtao Wang, Hongyuan Zhu, Dingchang Zheng
for: 本研究旨在探讨wiFi CSIsensor-based Human Activity Recognition（HAR）领域中SSL算法的潜力，以提高深度学习模型在CSIsensor数据上的性能。
methods: 本研究使用了四类SSL算法，包括已经研究过的和尚未研究过的，并对三个公共可用的CSIsensor HAR数据集进行了评估。
results: 实验结果表明，SSL算法可以有效地提高wiFi CSIsensor-based HAR的性能，但是还存在一些限制和盲点，需要进一步改进才能在实际应用中使用。

Abstract
Recently, with the advancement of the Internet of Things (IoT), WiFi CSI-based HAR has gained increasing attention from academic and industry communities. By integrating the deep learning technology with CSI-based HAR, researchers achieve state-of-the-art performance without the need of expert knowledge. However, the scarcity of labeled CSI data remains the most prominent challenge when applying deep learning models in the context of CSI-based HAR due to the privacy and incomprehensibility of CSI-based HAR data. On the other hand, SSL has emerged as a promising approach for learning meaningful representations from data without heavy reliance on labeled examples. Therefore, considerable efforts have been made to address the challenge of insufficient data in deep learning by leveraging SSL algorithms. In this paper, we undertake a comprehensive inventory and analysis of the potential held by different categories of SSL algorithms, including those that have been previously studied and those that have not yet been explored, within the field. We provide an in-depth investigation of SSL algorithms in the context of WiFi CSI-based HAR. We evaluate four categories of SSL algorithms using three publicly available CSI HAR datasets, each encompassing different tasks and environmental settings. To ensure relevance to real-world applications, we design performance metrics that align with specific requirements. Furthermore, our experimental findings uncover several limitations and blind spots in existing work, highlighting the barriers that need to be addressed before SSL can be effectively deployed in real-world WiFi-based HAR applications. Our results also serve as a practical guideline for industry practitioners and provide valuable insights for future research endeavors in this field.

摘要
最近，因互联网物联网（IoT）的进步，WiFi CSI基本的人工智能识别（HAR）已经获得了学术和工业社区的增加关注。通过融合深度学习技术与CSI基本的HAR，研究人员可以 achieve state-of-the-art performance 不需要专家知识。然而，CSI基本的数据短缺仍然是对应深度学习模型的最大挑战，因为CSI基本的数据隐私和不可理解。另一方面，安全性可见性学习（SSL）已经出现了具有潜在的应用前景的方法，可以从数据中学习有意义的表现，不需要专家知识。因此，许多努力已经被做出来解决深度学习中的数据不足问题，通过应用SSL算法。在这篇论文中，我们进行了广泛的调查和分析不同类型的SSL算法，包括已经研究过的和尚未研究过的，在这个领域中。我们对这些SSL算法进行了深入的探索，并将其应用到WiFi CSI基本的HAR中。我们使用三个公开available的CSI HAR数据集，每个数据集都包含不同的任务和环境设定，以评估SSL算法的性能。为了保持实际应用的相关性，我们设计了与实际应用相关的表现度量。我们的实验结果显示出了一些限制和盲点，强调了现有的问题，需要解决才能够在实际应用中有效地应用SSL。我们的结果也服务了实际应用的实践者，并提供了价值的见解，供未来的研究努力参考。

Perturbing a Neural Network to Infer Effective Connectivity: Evidence from Synthetic EEG Data

paper_url: http://arxiv.org/abs/2307.09770
repo_url: None
paper_authors: Peizhen Yang, Xinke Shen, Zongsheng Li, Zixiang Luo, Kexin Lou, Quanying Liu
for: 这项研究旨在利用数据驱动方法推断脑电响应的 causal 关系，以探索脑信息处理和认知功能的逻辑结构。
methods: 研究人员使用了不同类型的人工神经网络（CNN、vanilla RNN、GRU、LSTM和Transformer）来预测历史数据的脑电信号，然后通过干扰这些神经网络的输入来推断脑电信号之间的 causal 关系。
results: 研究人员发现，使用人工神经网络来预测脑电信号的方法可以更好地推断脑电信号之间的 causal 关系，并且可以超过经典的Granger causality方法。

Abstract
Identifying causal relationships among distinct brain areas, known as effective connectivity, holds key insights into the brain's information processing and cognitive functions. Electroencephalogram (EEG) signals exhibit intricate dynamics and inter-areal interactions within the brain. However, methods for characterizing nonlinear causal interactions among multiple brain regions remain relatively underdeveloped. In this study, we proposed a data-driven framework to infer effective connectivity by perturbing the trained neural networks. Specifically, we trained neural networks (i.e., CNN, vanilla RNN, GRU, LSTM, and Transformer) to predict future EEG signals according to historical data and perturbed the networks' input to obtain effective connectivity (EC) between the perturbed EEG channel and the rest of the channels. The EC reflects the causal impact of perturbing one node on others. The performance was tested on the synthetic EEG generated by a biological-plausible Jansen-Rit model. CNN and Transformer obtained the best performance on both 3-channel and 90-channel synthetic EEG data, outperforming the classical Granger causality method. Our work demonstrated the potential of perturbing an artificial neural network, learned to predict future system dynamics, to uncover the underlying causal structure.

摘要
描述脑部信息处理和认知功能的关键信息在脑部之间的关系，即有效连接，是脑部研究的关键。电энцеfalogram（EEG）信号表现出脑部内部不同区域之间的复杂动态交互。然而，用于描述多个脑部区域之间的非线性相关性的方法仍然较为未发展。在这项研究中，我们提出了一种数据驱动的框架，用于推测有效连接。具体来说，我们将神经网络（如CNN、vanilla RNN、GRU、LSTM和Transformer）训练为预测基于历史数据的未来EEG信号，并将神经网络的输入干扰以获得有效连接（EC）。EC表示对一个节点的干扰对其他节点的影响。我们在生物可能的Jansen-Rit模型生成的 sintetic EEG数据上测试了表现。CNN和Transformer在3个渠道和90个渠道的 sintetic EEG数据上表现出色，超过了经典Granger causality方法。我们的工作表明了可以使用训练用于预测未来系统动力的人工神经网络来揭示脑部的下面 causal结构。

Sig-Splines: universal approximation and convex calibration of time series generative models

paper_url: http://arxiv.org/abs/2307.09767
repo_url: None
paper_authors: Magnus Wiese, Phillip Murray, Ralf Korn
for: 这篇论文是为了描述一种基于神经花式的多变量离散时间序列资料的新型生成模型。
methods: 这篇论文使用了神经花式的线性转换和签名转换来建立一个可以替换传统神经网络的新型生成模型。
results: 这篇论文的结果显示了这个新型模型不仅具有神经网络的通用性，而且还具有线性的参数。

Abstract
We propose a novel generative model for multivariate discrete-time time series data. Drawing inspiration from the construction of neural spline flows, our algorithm incorporates linear transformations and the signature transform as a seamless substitution for traditional neural networks. This approach enables us to achieve not only the universality property inherent in neural networks but also introduces convexity in the model's parameters.

摘要
我们提出了一种新的生成模型，用于多变量离散时间序列数据。我们的算法启发自神经剖流的建构，并将线性变换和签名变换作为神经网络的替换方式。这种方法不仅可以实现神经网络的universality性，还可以在模型参数中引入凸形。

Towards Building More Robust Models with Frequency Bias

paper_url: http://arxiv.org/abs/2307.09763
repo_url: https://github.com/retsuh-bqw/ICCV23-Towards-Building-More-Robust-Models-with-Frequency-Bias
paper_authors: Qingwen Bu, Dong Huang, Heming Cui
for: 本研究旨在提高深度神经网络的抗辐射性，并提出一种可插入到任何敌方训练框架中的吸引模块，以便更好地利用频谱特征。
methods: 本研究使用了一种叫做频率偏好控制模块（Frequency Preference Control Module），该模块可以自适应地重新配置中间特征表示的低频和高频成分，以更好地利用频谱特征。
results: 实验表明，使用我们提出的频率偏好控制模块可以轻松地将敌方训练框架与不同的架构和数据集绑定，并且可以提高模型的抗辐射性。此外，我们还进行了对频谱偏好对敌方训练过程和最终抗辐射性的研究，发现了有趣的结论。

Abstract
The vulnerability of deep neural networks to adversarial samples has been a major impediment to their broad applications, despite their success in various fields. Recently, some works suggested that adversarially-trained models emphasize the importance of low-frequency information to achieve higher robustness. While several attempts have been made to leverage this frequency characteristic, they have all faced the issue that applying low-pass filters directly to input images leads to irreversible loss of discriminative information and poor generalizability to datasets with distinct frequency features. This paper presents a plug-and-play module called the Frequency Preference Control Module that adaptively reconfigures the low- and high-frequency components of intermediate feature representations, providing better utilization of frequency in robust learning. Empirical studies show that our proposed module can be easily incorporated into any adversarial training framework, further improving model robustness across different architectures and datasets. Additionally, experiments were conducted to examine how the frequency bias of robust models impacts the adversarial training process and its final robustness, revealing interesting insights.

摘要
深度神经网络在面对攻击样本时的抵触性问题，对其广泛应用带来了很大障碍，尽管它在不同领域取得了成功。近期一些研究表明，使用对抗样本训练的模型强调低频信息的重要性可以提高模型的抗性。然而，直接将输入图像应用低通滤波器导致了不可逆的信息损失和不好的泛化性，这限制了在不同频谱特征的数据集上的应用。本文提出了一个叫做频率偏好控制模块的插件式模块，可以自适应调整中间特征表示中的低频和高频组分，从而更好地利用频率在Robust学习中。实验表明，我们的提议的模块可以轻松地与任何对抗训练框架结合使用，并在不同架构和数据集上提高模型的抗性。此外，我们还进行了对抗训练过程中频率偏好的模型的影响和最终抗性的实验，发现了有趣的发现。

Reinforcing POD based model reduction techniques in reaction-diffusion complex networks using stochastic filtering and pattern recognition

paper_url: http://arxiv.org/abs/2307.09762
repo_url: None
paper_authors: Abhishek Ajayakumar, Soumyendu Raha
for: 模型复杂网络系统，但是这些系统的维度可能会使其分析变得困难。
methods: 我们提出了一种算法框架，该框架结合了模式识别技术和随机滤波理论来增强模型的输出。
results: 我们的研究结果显示，我们的方法可以在受到输入数据干扰后提高模型的准确性。Here’s the breakdown of each point in English:
for: The paper is written for modeling complex networks, which can be challenging to analyze due to their high dimensionality.
methods: The paper proposes an algorithmic framework that combines techniques from pattern recognition and stochastic filtering theory to enhance the output of dimensionality reduction models.
results: The results of the study show that the proposed method can improve the accuracy of the surrogate model under perturbed inputs.

Abstract
Complex networks are used to model many real-world systems. However, the dimensionality of these systems can make them challenging to analyze. Dimensionality reduction techniques like POD can be used in such cases. However, these models are susceptible to perturbations in the input data. We propose an algorithmic framework that combines techniques from pattern recognition (PR) and stochastic filtering theory to enhance the output of such models. The results of our study show that our method can improve the accuracy of the surrogate model under perturbed inputs. Deep Neural Networks (DNNs) are susceptible to adversarial attacks. However, recent research has revealed that neural Ordinary Differential Equations (ODEs) exhibit robustness in specific applications. We benchmark our algorithmic framework with a Neural ODE-based approach as a reference.

摘要
复杂网络被用来模型许多实际世界系统。然而，这些系统的维度可以使得它们变得困难分析。降维技术如POD可以在这些情况下使用。然而，这些模型对输入数据的变化很敏感。我们提出了一个算法框架，该框架结合了 Pattern recognition（PR）和随机滤波理论来增强模型的输出。我们的研究结果表明，我们的方法可以在受到输入数据变化时提高模型的准确性。深度神经网络（DNNs）容易受到恶意攻击。然而，最近的研究发现，神经Ordinary Differential Equations（ODEs）在特定应用中具有抗锋性。我们对比了我们的算法框架与神经ODE-based方法作为参考。

FedBug: A Bottom-Up Gradual Unfreezing Framework for Federated Learning

paper_url: http://arxiv.org/abs/2307.10317
repo_url: https://github.com/iandrover/fedbug
paper_authors: Chia-Hsiang Kao, Yu-Chiang Frank Wang
for: 这篇论文的目的是提出一种基于分布式学习的 Federated Learning（FL）框架，以帮助多个客户端在共享模型中协同训练而无需泄露数据隐私。
methods: 这篇论文提出了一种名为 FedBug（基于底层慢融冰的 Federated Learning）的新的 FL 框架，该框架可以有效地 Mitigate 客户端泄露（client drift）问题。FedBug 在客户端上启动自适应层次结构，从输入层到输出层逐层解冻模型参数，以使模型在多个客户端上协同训练并保持一致性。
results: 该论文通过了严格的实验和理论分析，证明 FedBug 在多个数据集、训练条件和网络架构下具有优秀的性能和可靠性。

Abstract
Federated Learning (FL) offers a collaborative training framework, allowing multiple clients to contribute to a shared model without compromising data privacy. Due to the heterogeneous nature of local datasets, updated client models may overfit and diverge from one another, commonly known as the problem of client drift. In this paper, we propose FedBug (Federated Learning with Bottom-Up Gradual Unfreezing), a novel FL framework designed to effectively mitigate client drift. FedBug adaptively leverages the client model parameters, distributed by the server at each global round, as the reference points for cross-client alignment. Specifically, on the client side, FedBug begins by freezing the entire model, then gradually unfreezes the layers, from the input layer to the output layer. This bottom-up approach allows models to train the newly thawed layers to project data into a latent space, wherein the separating hyperplanes remain consistent across all clients. We theoretically analyze FedBug in a novel over-parameterization FL setup, revealing its superior convergence rate compared to FedAvg. Through comprehensive experiments, spanning various datasets, training conditions, and network architectures, we validate the efficacy of FedBug. Our contributions encompass a novel FL framework, theoretical analysis, and empirical validation, demonstrating the wide potential and applicability of FedBug.

摘要
联合学习（FL）提供了一个合作训练框架，让多个客户端参与共享模型，而无需泄露数据隐私。由于本地数据的不同性，更新的客户端模型可能会过滤和分化，又称为客户遗传问题。在这篇论文中，我们提出了FedBug（联合学习底下逐层解冻），一个新的FL框架，可以有效地缓和客户遗传问题。FedBug在服务器端分发的每个全球轮次中，将客户端模型中的各层参数作为参考点进行跨客户同步。具体来说，在客户端上，FedBug会首先冻结整个模型，然后逐渐解冻层，从输入层到输出层。这个底下逐层解冻的方法让模型可以在新解冻的层上训练，将数据对应到一个共同的潜在空间，在这个潜在空间中，数据分隔的折冲线保持一致 across all clients。我们在一个新的FL设置中进行了理论分析，显示FedBug在较高的速度比FedAvg更快地趋向数据。通过了多种数据集、训练条件和网络架构，我们验证了FedBug的有效性。我们的贡献包括一个新的FL框架、理论分析和实验验证，显示了FedBug的广泛应用和可能性。

Space Engage: Collaborative Space Supervision for Contrastive-based Semi-Supervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2307.09755
repo_url: https://github.com/WangChangqi98/CSS
paper_authors: Changqi Wang, Haoyu Xie, Yuhui Yuan, Chong Fu, Xiangyu Yue
for: 提高semantic segmentation模型的 Robustness和精度，使用限量标注图像和大量无标注图像进行训练。
methods: 提出了一种基于像素级对准学习的方法，通过在很少标注图像下进行协同学习，从而提高模型的表征表示能力。
results: 对两个公共的benchmark进行测试，研究结果表明，与现有方法相比，本方法具有竞争性的性能。

Abstract
Semi-Supervised Semantic Segmentation (S4) aims to train a segmentation model with limited labeled images and a substantial volume of unlabeled images. To improve the robustness of representations, powerful methods introduce a pixel-wise contrastive learning approach in latent space (i.e., representation space) that aggregates the representations to their prototypes in a fully supervised manner. However, previous contrastive-based S4 methods merely rely on the supervision from the model's output (logits) in logit space during unlabeled training. In contrast, we utilize the outputs in both logit space and representation space to obtain supervision in a collaborative way. The supervision from two spaces plays two roles: 1) reduces the risk of over-fitting to incorrect semantic information in logits with the help of representations; 2) enhances the knowledge exchange between the two spaces. Furthermore, unlike previous approaches, we use the similarity between representations and prototypes as a new indicator to tilt training those under-performing representations and achieve a more efficient contrastive learning process. Results on two public benchmarks demonstrate the competitive performance of our method compared with state-of-the-art methods.

摘要
半指导Semantic Segmentation（S4）目标是使用有限量的标注图像和大量的无标注图像来训练一个分割模型。为了提高表示的稳定性，我们引入了一种像素级对比学习方法，将表示 aggregate到其权重空间中的聚合权重。然而，之前的对比基于S4方法仅仅在无标注训练中使用模型的输出（logits）作为监督。与之不同，我们在两个空间中获得监督：1）降低因 incorrect semantic information在logits中过度适应的风险，通过使用表示；2）提高两个空间之间的知识交换。此外，与之前的方法不同，我们使用表示和聚合权重之间的相似性作为一个新的指标，以调整不优秀的表示并实现更高效的对比学习过程。对两个公共的标准测试集进行了比较，我们的方法与当前的状态OF-THE-ART方法相比，具有竞争性的性能。

Information Retrieval Meets Large Language Models: A Strategic Report from Chinese IR Community

paper_url: http://arxiv.org/abs/2307.09751
repo_url: None
paper_authors: Qingyao Ai, Ting Bai, Zhao Cao, Yi Chang, Jiawei Chen, Zhumin Chen, Zhiyong Cheng, Shoubin Dong, Zhicheng Dou, Fuli Feng, Shen Gao, Jiafeng Guo, Xiangnan He, Yanyan Lan, Chenliang Li, Yiqun Liu, Ziyu Lyu, Weizhi Ma, Jun Ma, Zhaochun Ren, Pengjie Ren, Zhiqiang Wang, Mingwen Wang, Ji-Rong Wen, Le Wu, Xin Xin, Jun Xu, Dawei Yin, Peng Zhang, Fan Zhang, Weinan Zhang, Min Zhang, Xiaofei Zhu
for: 本研究旨在探讨大语言模型（LLMs）在信息检索领域中的应用和潜力，并提出一种新的信息检索技术哲学。
methods: 本研究使用了工作坊的形式，即在2023年4月举行的中国信息检索社区策略性工作坊，以获得参与者的views和ideas。
results: 工作坊的结果表明，LLMs可以帮助改善信息检索的效果和体验，同时也存在一些 Computational Costs、信任性和道德问题等挑战。此外，提出了一种新的信息检索技术哲学，即“人-LLM-信息检索”技术哲学。

Abstract
The research field of Information Retrieval (IR) has evolved significantly, expanding beyond traditional search to meet diverse user information needs. Recently, Large Language Models (LLMs) have demonstrated exceptional capabilities in text understanding, generation, and knowledge inference, opening up exciting avenues for IR research. LLMs not only facilitate generative retrieval but also offer improved solutions for user understanding, model evaluation, and user-system interactions. More importantly, the synergistic relationship among IR models, LLMs, and humans forms a new technical paradigm that is more powerful for information seeking. IR models provide real-time and relevant information, LLMs contribute internal knowledge, and humans play a central role of demanders and evaluators to the reliability of information services. Nevertheless, significant challenges exist, including computational costs, credibility concerns, domain-specific limitations, and ethical considerations. To thoroughly discuss the transformative impact of LLMs on IR research, the Chinese IR community conducted a strategic workshop in April 2023, yielding valuable insights. This paper provides a summary of the workshop's outcomes, including the rethinking of IR's core values, the mutual enhancement of LLMs and IR, the proposal of a novel IR technical paradigm, and open challenges.

摘要
研究领域信息检索（IR）在过去几年发展得非常 significatively，从传统搜索扩展到满足多样化用户信息需求。最近，大型自然语言模型（LLM）在文本理解、生成和知识推理方面表现出色，为IR研究开创了吸引人的新途径。LLM不仅实现生成检索，还提供改进的用户理解、模型评估和用户-系统互动解决方案。此外，IR模型、LLM和人类之间的协同关系形成了一个更强大的技术哲学，用于信息检索。IR模型在实时提供有关信息，LLM增加内存知识，人类扮演搜索者和评估者的中心角色，以确保信息服务的可靠性。然而，计算成本、可靠性、领域特定限制以及伦理考虑等问题仍然存在。为了全面讨论LLM对IR研究的转变性影响，中国IR社区在2023年4月举行了一场策略性工作坊，并获得了有价值的成果。本文总结了工作坊的结果，包括重新思考IR核心价值观、LLM和IR之间的互进式关系、提议一种新的IR技术哲学，以及开放的挑战。

Enhancing conversational quality in language learning chatbots: An evaluation of GPT4 for ASR error correction

paper_url: http://arxiv.org/abs/2307.09744
repo_url: None
paper_authors: Long Mai, Julie Carson-Berndsen
for: 本研究旨在探讨使用GPT4进行对话会话中的自然语言处理技术，以提高语言学习者的语言技能。
methods: 本研究使用GPT4进行对话会话中的自动听写错误纠正，并使用semantic textual similarity (STS)和next response sensibility (NRS) metrics来评估对话质量。
results: 研究发现，使用GPT4进行错误纠正后，对话质量得到提高，尽管word-error-rate (WER)增加。GPT4还超过了标准错误纠正方法，无需培训数据。

Abstract
The integration of natural language processing (NLP) technologies into educational applications has shown promising results, particularly in the language learning domain. Recently, many spoken open-domain chatbots have been used as speaking partners, helping language learners improve their language skills. However, one of the significant challenges is the high word-error-rate (WER) when recognizing non-native/non-fluent speech, which interrupts conversation flow and leads to disappointment for learners. This paper explores the use of GPT4 for ASR error correction in conversational settings. In addition to WER, we propose to use semantic textual similarity (STS) and next response sensibility (NRS) metrics to evaluate the impact of error correction models on the quality of the conversation. We find that transcriptions corrected by GPT4 lead to higher conversation quality, despite an increase in WER. GPT4 also outperforms standard error correction methods without the need for in-domain training data.

摘要
< translator: Google translate文本 интеграция到教育应用程序中已经显示出了有望的结果，特别是在语言学习领域。最近，许多开放频道的口语聊天机器人被用作对话伙伴，帮助语言学习者提高语言技能。然而，一个重要的挑战是非Native/非流利语音识别器的高字幕错误率（WER），这会中断对话流程并使学习者失望。这篇论文探讨使用GPT4进行ASR错误修正的方法。除了WER，我们还提出使用Semantic Textual Similarity（STS）和Next Response Sensibility（NRS）度量来评估错误修正模型对对话质量的影响。我们发现GPT4修正的 trascriptions 会导致更高的对话质量，尽管WER增加。GPT4 还超越了标准的错误修正方法，无需培训数据。

Absolutist AI

paper_url: http://arxiv.org/abs/2307.10315
repo_url: None
paper_authors: Mitchell Barrington
for: 该论文提出了训练AI系统 WITH 绝对约束（不允许某些行为无论其价值如何）可能会取得很大进步在AI安全问题上。
methods: 该论文提供了一个保障AI系统不会出现最坏情况的固定轮廓，并使AI系统更容易被更正。
results: 该论文证明了绝对约束可以防止AI系统导致灾难性结果，并且不会让AI系统变得无理性。

Abstract
This paper argues that training AI systems with absolute constraints -- which forbid certain acts irrespective of the amount of value they might produce -- may make considerable progress on many AI safety problems in principle. First, it provides a guardrail for avoiding the very worst outcomes of misalignment. Second, it could prevent AIs from causing catastrophes for the sake of very valuable consequences, such as replacing humans with a much larger number of beings living at a higher welfare level. Third, it makes systems more corrigible, allowing creators to make corrective interventions in them, such as altering their objective functions or shutting them down. And fourth, it helps systems explore their environment more safely by prohibiting them from exploring especially dangerous acts. I offer a decision-theoretic formalization of an absolute constraints, improving on existing models in the literature, and use this model to prove some results about the training and behavior of absolutist AIs. I conclude by showing that, although absolutist AIs will not maximize expected value, they will not be susceptible to behave irrationally, and they will not (contra coherence arguments) face environmental pressure to become expected-value maximizers.

摘要
The paper offers a decision-theoretic formalization of absolute constraints, improving on existing models in the literature, and uses this model to prove results about the training and behavior of absolutist AIs. The author concludes that, although absolutist AIs will not maximize expected value, they will not behave irrationally, and they will not face environmental pressure to become expected-value maximizers.

Multi-Grained Multimodal Interaction Network for Entity Linking

paper_url: http://arxiv.org/abs/2307.09721
repo_url: https://github.com/pengfei-luo/mimic
paper_authors: Pengfei Luo, Tong Xu, Shiwei Wu, Chen Zhu, Linli Xu, Enhong Chen
for: 解决多模态 mentions 映射到多模态知识图的ambiguity问题
methods: 提出了一种基于Multi-GraIned Multimodal InteraCtion Network $\textbf{(MIMIC)}$框架的解决方案，包括文本/视觉编码器、三种交互单元和单元一致性目标函数
results: 实验结果表明，我们的解决方案在三个公共测试集上表现出色，并且ablation study verify了设计的模块的效果

Abstract
Multimodal entity linking (MEL) task, which aims at resolving ambiguous mentions to a multimodal knowledge graph, has attracted wide attention in recent years. Though large efforts have been made to explore the complementary effect among multiple modalities, however, they may fail to fully absorb the comprehensive expression of abbreviated textual context and implicit visual indication. Even worse, the inevitable noisy data may cause inconsistency of different modalities during the learning process, which severely degenerates the performance. To address the above issues, in this paper, we propose a novel Multi-GraIned Multimodal InteraCtion Network $\textbf{(MIMIC)}$ framework for solving the MEL task. Specifically, the unified inputs of mentions and entities are first encoded by textual/visual encoders separately, to extract global descriptive features and local detailed features. Then, to derive the similarity matching score for each mention-entity pair, we device three interaction units to comprehensively explore the intra-modal interaction and inter-modal fusion among features of entities and mentions. In particular, three modules, namely the Text-based Global-Local interaction Unit (TGLU), Vision-based DuaL interaction Unit (VDLU) and Cross-Modal Fusion-based interaction Unit (CMFU) are designed to capture and integrate the fine-grained representation lying in abbreviated text and implicit visual cues. Afterwards, we introduce a unit-consistency objective function via contrastive learning to avoid inconsistency and model degradation. Experimental results on three public benchmark datasets demonstrate that our solution outperforms various state-of-the-art baselines, and ablation studies verify the effectiveness of designed modules.

摘要
Multimodal实体链接（MEL）任务，目标是将不确定的提及映射到多模态知识图，在过去几年内吸引了广泛的关注。虽然大量努力被投入到多模态之间的补做效应的探索中，但它们可能会忽略短文本上的缩写表述和隐藏的视觉指示。worse, noisy数据可能会在学习过程中导致不一致的多模态，从而严重降低性能。为解决上述问题，在这篇论文中，我们提出了一种新的多模态交互网络（MIMIC）框架，用于解决MEL任务。具体来说，提及和实体的共同输入首先被文本/视觉编码器分别解码，以提取全局描述特征和局部细节特征。然后，为每个提及实体对应的匹配分数，我们设计了三种交互单元，用于全面探索实体和提及之间的内模态交互和跨模态融合。具体来说，我们设计了文本基于的全球-本地交互单元（TGLU）、视觉基于的双向交互单元（VDLU）和交叉模态融合基于的交互单元（CMFU），以捕捉和融合缩写文本中的细节表述和隐藏的视觉指示。接着，我们引入了一个单元一致性目标函数，通过对比学习来避免不一致和模型衰竭。实验结果表明，我们的解决方案在三个公共 benchmark 数据集上表现出色，并且ablation 研究证明了我们设计的模块的有效性。

Two Tales of Platoon Intelligence for Autonomous Mobility Control: Enabling Deep Learning Recipes

paper_url: http://arxiv.org/abs/2307.09711
repo_url: None
paper_authors: Soohyun Park, Haemin Lee, Chanyoung Park, Soyi Jung, Minseok Choi, Joongheon Kim
for: 解决自动驾驶汽车和无人机的自主行动控制和资源管理问题
methods: 使用多代理人学习（MARL）和神经Myerson拍卖
results: 实现多代理人的分布式行动，以及保证多代理人之间的信任性和优化高动态系统的收益Note: The above text is in Simplified Chinese.

Abstract
This paper presents the deep learning-based recent achievements to resolve the problem of autonomous mobility control and efficient resource management of autonomous vehicles and UAVs, i.e., (i) multi-agent reinforcement learning (MARL), and (ii) neural Myerson auction. Representatively, communication network (CommNet), which is one of the most popular MARL algorithms, is introduced to enable multiple agents to take actions in a distributed manner for their shared goals by training all agents' states and actions in a single neural network. Moreover, the neural Myerson auction guarantees trustfulness among multiple agents as well as achieves the optimal revenue of highly dynamic systems. Therefore, we survey the recent studies on autonomous mobility control based on MARL and neural Myerson auction. Furthermore, we emphasize that integration of MARL and neural Myerson auction is expected to be critical for efficient and trustful autonomous mobility services.

摘要

Multi-agent reinforcement learning (MARL): This approach enables multiple agents to learn and take actions in a distributed manner to achieve shared goals.2. Neural Myerson auction: This method guarantees trustworthiness among multiple agents and achieves optimal revenue in highly dynamic systems.The paper surveys recent studies on autonomous mobility control based on MARL and neural Myerson auction, and highlights the importance of integrating these two approaches for efficient and trustworthy autonomous mobility services.In Simplified Chinese:这篇论文介绍了深度学习技术在自动驾驶车载和无人机中的应用，以解决自动驾驶车载和无人机的资源管理和自动驾驶控制问题。这些方法包括：1. 多代理人奖励学习（MARL）：这种方法可以让多个代理人在分布式环境中学习和行动，以实现共同目标。2. 神经美森拍卖：这种方法保证多个代理人之间的信任性，并在高度动态系统中实现最优的收益。这篇论文对自动驾驶车载基于MARL和神经美森拍卖的最新研究进行了综述，并强调将这两种方法集成起来将是efficient和可信的自动驾驶服务的关键。

RaTE: a Reproducible automatic Taxonomy Evaluation by Filling the Gap

paper_url: http://arxiv.org/abs/2307.09706
repo_url: https://github.com/cestlucas/rate
paper_authors: Tianjian Gao, Phillipe Langlais
for: automatic taxonomy construction (ATC) algorithm evaluation
methods: using a large pre-trained language model for label-free taxonomy scoring
results: 1) RaTE correlates well with human judgments, 2) artificially degrading a taxonomy leads to decreasing RaTE score.

Abstract
Taxonomies are an essential knowledge representation, yet most studies on automatic taxonomy construction (ATC) resort to manual evaluation to score proposed algorithms. We argue that automatic taxonomy evaluation (ATE) is just as important as taxonomy construction. We propose RaTE, an automatic label-free taxonomy scoring procedure, which relies on a large pre-trained language model. We apply our evaluation procedure to three state-of-the-art ATC algorithms with which we built seven taxonomies from the Yelp domain, and show that 1) RaTE correlates well with human judgments and 2) artificially degrading a taxonomy leads to decreasing RaTE score.

摘要
taxonomies是知识表示的重要组成部分，然而大多数自动taxonomy建构（ATC）研究仍然采用手动评估提议的算法。我们认为自动taxonomy评估（ATE）也非常重要。我们提出了一种名为RaTE的自动无标签分类方法，它基于大型预训练语言模型。我们对三种state-of-the-art ATC算法进行了七个Yelp领域的taxonomy建构，并显示了以下两点：1）RaTE与人类评价高度相关，2） искусственно降低一个分类会导致RaTE分数下降。

STRAPPER: Preference-based Reinforcement Learning via Self-training Augmentation and Peer Regularization

paper_url: http://arxiv.org/abs/2307.09692
repo_url: https://github.com/rll-research/bpref
paper_authors: Yachen Kang, Li He, Jinxin Liu, Zifeng Zhuang, Donglin Wang
for:This paper aims to learn a complex reward function with binary human preference using preference-based reinforcement learning (PbRL).methods:The paper proposes a self-training method with proposed peer regularization to overcome the issue of similarity trap in PbRL, which improperly enhances the consistency possibility of the model’s predictions between segment pairs and reduces the confidence in reward learning.results:The proposed approach is capable of learning well a variety of locomotion and robotic manipulation behaviors using different semi-supervised alternatives and peer regularization, demonstrating its effectiveness in large-scale applications.Here is the Chinese translation of the three key points:for:这篇论文目标是使用 preference-based reinforcement learning (PbRL) 学习复杂的奖励函数和人类偏好。methods:论文提议一种自适应方法，并提出了一种 peer regularization 来解决 PbRL 中的相似陷阱问题，该问题会使模型在类比对中增强预测的一致性，从而降低奖励学习的信任度。results:提议的方法可以使用不同的半超vised альтернатив和 peer regularization 学习多种 lokomotion 和机器人 manipulation 行为， demonstrating its effectiveness in large-scale applications。

Abstract
Preference-based reinforcement learning (PbRL) promises to learn a complex reward function with binary human preference. However, such human-in-the-loop formulation requires considerable human effort to assign preference labels to segment pairs, hindering its large-scale applications. Recent approache has tried to reuse unlabeled segments, which implicitly elucidates the distribution of segments and thereby alleviates the human effort. And consistency regularization is further considered to improve the performance of semi-supervised learning. However, we notice that, unlike general classification tasks, in PbRL there exits a unique phenomenon that we defined as similarity trap in this paper. Intuitively, human can have diametrically opposite preferredness for similar segment pairs, but such similarity may trap consistency regularization fail in PbRL. Due to the existence of similarity trap, such consistency regularization improperly enhances the consistency possiblity of the model's predictions between segment pairs, and thus reduces the confidence in reward learning, since the augmented distribution does not match with the original one in PbRL. To overcome such issue, we present a self-training method along with our proposed peer regularization, which penalizes the reward model memorizing uninformative labels and acquires confident predictions. Empirically, we demonstrate that our approach is capable of learning well a variety of locomotion and robotic manipulation behaviors using different semi-supervised alternatives and peer regularization.

摘要
preference-based reinforcement learning (PbRL) 承诺学习复杂的奖励函数使用二进制人类偏好。然而，这种人loop形式需要较大的人类努力来分配偏好标签对segment对， limiting its large-scale applications。 recent approaches have tried to reuse unlabeled segments, which implicitly elucidates the distribution of segments and thereby alleviates the human effort. In addition, consistency regularization is further considered to improve the performance of semi-supervised learning. However, we notice that, unlike general classification tasks, in PbRL there exists a unique phenomenon that we defined as similarity trap in this paper. Intuitively, human can have diametrically opposite preferredness for similar segment pairs, but such similarity may trap consistency regularization fail in PbRL. Due to the existence of similarity trap, such consistency regularization improperly enhances the consistency possibility of the model's predictions between segment pairs, and thus reduces the confidence in reward learning, since the augmented distribution does not match with the original one in PbRL. To overcome such issue, we present a self-training method along with our proposed peer regularization, which penalizes the reward model memorizing uninformative labels and acquires confident predictions. Empirically, we demonstrate that our approach is capable of learning well a variety of locomotion and robotic manipulation behaviors using different semi-supervised alternatives and peer regularization.

Amazon-M2: A Multilingual Multi-locale Shopping Session Dataset for Recommendation and Text Generation

paper_url: http://arxiv.org/abs/2307.09688
repo_url: None
paper_authors: Wei Jin, Haitao Mao, Zheng Li, Haoming Jiang, Chen Luo, Hongzhi Wen, Haoyu Han, Hanqing Lu, Zhengyang Wang, Ruirui Li, Zhen Li, Monica Xiao Cheng, Rahul Goutam, Haiyang Zhang, Karthik Subbian, Suhang Wang, Yizhou Sun, Jiliang Tang, Bing Yin, Xianfeng Tang
for: This paper is written for those who are interested in improving customer shopping experiences and personalized recommendations in e-commerce.
methods: The paper uses a multilingual session dataset, named Amazon-M2, which consists of millions of user sessions from six different locales, to benchmark various algorithms and tasks related to next-product recommendation, next-product recommendation with domain shifts, and next-product title generation.
results: The paper introduces three tasks and benchmarks a range of algorithms on the proposed dataset, drawing new insights for further research and practice. Additionally, the paper hosts a competition in the KDD CUP 2023 and attracts thousands of users and submissions.

Abstract
Modeling customer shopping intentions is a crucial task for e-commerce, as it directly impacts user experience and engagement. Thus, accurately understanding customer preferences is essential for providing personalized recommendations. Session-based recommendation, which utilizes customer session data to predict their next interaction, has become increasingly popular. However, existing session datasets have limitations in terms of item attributes, user diversity, and dataset scale. As a result, they cannot comprehensively capture the spectrum of user behaviors and preferences. To bridge this gap, we present the Amazon Multilingual Multi-locale Shopping Session Dataset, namely Amazon-M2. It is the first multilingual dataset consisting of millions of user sessions from six different locales, where the major languages of products are English, German, Japanese, French, Italian, and Spanish. Remarkably, the dataset can help us enhance personalization and understanding of user preferences, which can benefit various existing tasks as well as enable new tasks. To test the potential of the dataset, we introduce three tasks in this work: (1) next-product recommendation, (2) next-product recommendation with domain shifts, and (3) next-product title generation. With the above tasks, we benchmark a range of algorithms on our proposed dataset, drawing new insights for further research and practice. In addition, based on the proposed dataset and tasks, we hosted a competition in the KDD CUP 2023 and have attracted thousands of users and submissions. The winning solutions and the associated workshop can be accessed at our website https://kddcup23.github.io/.

摘要
模拟客户购买意图是电商的关键任务，直接影响用户体验和参与度。因此，正确理解客户偏好是提供个性化推荐的关键。会话基于推荐，利用客户会话数据预测他们下一次交互，在现有会话数据上有限制，无法全面捕捉用户行为和偏好谱系。为了弥补这一差距，我们提出了亚马逊多语言多地区购物会话数据集（Amazon-M2）。这是首个多语言数据集，包含来自六个不同的地区，主要语言包括英文、德语、日语、法语、意大利语和西班牙语。这个数据集可以帮助我们提高个性化和用户偏好的理解，从而对现有任务和新任务产生新的发现和挑战。为了评估数据集的潜力，我们在本文中引入了三个任务：（1）下一个产品推荐（2）下一个产品推荐域转移（3）下一个产品标题生成。通过这三个任务，我们对我们提posed的数据集和任务进行了 benchmark，从而获得了新的发现和实践。此外，我们还在KDD CUP 2023中举办了一场竞赛，并吸引了数千名用户和提交。赢家的解决方案和相关的工作坊可以在我们的官方网站https://kddcup23.github.io/上获得。

PubMed and Beyond: Recent Advances and Best Practices in Biomedical Literature Search

paper_url: http://arxiv.org/abs/2307.09683
repo_url: None
paper_authors: Qiao Jin, Robert Leaman, Zhiyong Lu
for: This paper aims to help readers efficiently fulfill their information needs in biomedicine by providing a comprehensive survey of literature search tools tailored to specific information needs.
methods: The paper examines widely used PubMed search engine and describes literature search tools catering to five specific information needs, including identifying high-quality clinical research, retrieving gene-related information, searching by meaning, locating related articles, and mining literature to discover associations between concepts.
results: The paper provides a comprehensive view of biomedical literature search functionalities with 36 publicly available tools, and offers practical considerations and best practices for choosing and using these tools. Additionally, the paper provides a perspective on the future of literature search engines considering recent breakthroughs in large language models such as ChatGPT.

Abstract
Biomedical research yields a wealth of information, much of which is only accessible through the literature. Consequently, literature search is an essential tool for building on prior knowledge in clinical and biomedical research. Although recent improvements in artificial intelligence have expanded functionality beyond keyword-based search, these advances may be unfamiliar to clinicians and researchers. In response, we present a survey of literature search tools tailored to both general and specific information needs in biomedicine, with the objective of helping readers efficiently fulfill their information needs. We first examine the widely used PubMed search engine, discussing recent improvements and continued challenges. We then describe literature search tools catering to five specific information needs: 1. Identifying high-quality clinical research for evidence-based medicine. 2. Retrieving gene-related information for precision medicine and genomics. 3. Searching by meaning, including natural language questions. 4. Locating related articles with literature recommendation. 5. Mining literature to discover associations between concepts such as diseases and genetic variants. Additionally, we cover practical considerations and best practices for choosing and using these tools. Finally, we provide a perspective on the future of literature search engines, considering recent breakthroughs in large language models such as ChatGPT. In summary, our survey provides a comprehensive view of biomedical literature search functionalities with 36 publicly available tools.

摘要
生物医学研究提供了大量信息，其中许多信息只能通过文献来获取。因此，文献搜索成为了临床和生物医学研究中不可或缺的工具。虽然最近的人工智能技术已经扩展了搜索功能之 beyond 键盘基本搜索，但这些进步可能对临床和研究人员来说还是不熟悉的。为了帮助读者快速满足他们的信息需求，我们提供了一份关于文献搜索工具的讲话，涵盖了以下五个特定信息需求：1. 找到高质量的临床研究，以便实施基于证据的医疗。2. 搜索与基因相关的信息，以满足精准医学和基因学的需求。3. 通过自然语言问题进行搜索。4. 找到相关的文献，并获得文献推荐。5. 挖掘文献中的概念之间的关系，如疾病和基因变异之间的关系。此外，我们还讨论了选择和使用这些工具的实际考虑和最佳实践。最后，我们提供了对文献搜索引擎的未来展望，考虑到最近的大语言模型如ChatGPT的突破。总之，我们的讲话提供了36种公共可用的生物医学文献搜索功能的全面视图。

What’s meant by explainable model: A Scoping Review

paper_url: http://arxiv.org/abs/2307.09673
repo_url: None
paper_authors: Mallika Mainali, Rosina O Weber
for: 本研究探讨了使用人工智能模型并采用后期解释方法来生成解释的论文，以及作者是否在使用这些方法时进行评估。
methods: 本研究使用了探索性评估方法来 investigate 论文是否进行了评估，并发现了81%的应用论文中的作者在使用解释模型时没有进行任何评估。
results: 本研究发现，81%的应用论文中的作者在使用解释模型时没有进行任何评估，这表明了在使用解释模型时，作者们往往假设它们已经足够地被评估。

Abstract
We often see the term explainable in the titles of papers that describe applications based on artificial intelligence (AI). However, the literature in explainable artificial intelligence (XAI) indicates that explanations in XAI are application- and domain-specific, hence requiring evaluation whenever they are employed to explain a model that makes decisions for a specific application problem. Additionally, the literature reveals that the performance of post-hoc methods, particularly feature attribution methods, varies substantially hinting that they do not represent a solution to AI explainability. Therefore, when using XAI methods, the quality and suitability of their information outputs should be evaluated within the specific application. For these reasons, we used a scoping review methodology to investigate papers that apply AI models and adopt methods to generate post-hoc explanations while referring to said models as explainable. This paper investigates whether the term explainable model is adopted by authors under the assumption that incorporating a post-hoc XAI method suffices to characterize a model as explainable. To inspect this problem, our review analyzes whether these papers conducted evaluations. We found that 81% of the application papers that refer to their approaches as an explainable model do not conduct any form of evaluation on the XAI method they used.

摘要
我们经常看到“可解释的”（Explainable）一词在人工智能（AI）应用中的标题中出现。然而，XAI文献表明，在不同应用和领域中，解释都是应用和领域特定的，因此需要在特定应用中进行评估。此外，文献还表明，后期方法特别是特征归属方法的性能差异很大，这意味着它们不是AI可解释的解决方案。因此，当使用XAI方法时，需要评估其输出信息的质量和适用性。为了解决这个问题，我们采用了一种批量评估方法来研究那些使用AI模型并采用后期解释方法的论文。我们发现，81%的应用论文中的作者在使用XAI方法时没有进行任何评估。

Towards A Unified Agent with Foundation Models

paper_url: http://arxiv.org/abs/2307.09668
repo_url: None
paper_authors: Norman Di Palo, Arunkumar Byravan, Leonard Hasenclever, Markus Wulfmeier, Nicolas Heess, Martin Riedmiller
for: 这个论文目的是怎样使用语言模型和视觉语言模型来提高人工智能机器人的探索和学习能力？
methods: 这个论文使用了语言为核心的理解和计划方法，以解决一些基本的人工智能探索和学习挑战，如有限的奖励、重用经验数据、计划技能和从观察中学习。
results: 在一个具有笨重奖励的机器人拼堆环境中，这种方法能够提高探索效率和重用经验数据的能力，并且可以在不同的任务上重用已学会的技能。

Abstract
Language Models and Vision Language Models have recently demonstrated unprecedented capabilities in terms of understanding human intentions, reasoning, scene understanding, and planning-like behaviour, in text form, among many others. In this work, we investigate how to embed and leverage such abilities in Reinforcement Learning (RL) agents. We design a framework that uses language as the core reasoning tool, exploring how this enables an agent to tackle a series of fundamental RL challenges, such as efficient exploration, reusing experience data, scheduling skills, and learning from observations, which traditionally require separate, vertically designed algorithms. We test our method on a sparse-reward simulated robotic manipulation environment, where a robot needs to stack a set of objects. We demonstrate substantial performance improvements over baselines in exploration efficiency and ability to reuse data from offline datasets, and illustrate how to reuse learned skills to solve novel tasks or imitate videos of human experts.

摘要
<>translate_language model Reinforcement Learning agent 能力 investigated embed scene understanding planning-like behavior text form efficient exploration reusing experience data scheduling skills learning from observations sparse-reward simulated robotic manipulation environment substantial performance improvements exploration efficiency reuse learned skills solve novel tasks imitate human expertsLanguage Models 和 Vision Language Models 最近Displayed unprecedented capabilities 理解人类意图 reasoning scene understanding 和 planning-like behavior 在文本形式中，其中许多。在这项工作中，我们调查了如何在Reinforcement Learning（RL）代理人中嵌入和利用这些能力。我们设计了一个框架，用语言作为核心的理解工具，探索如何使得代理人可以解决RL中的一系列基本挑战，例如高效探索、重用经验数据、培训技能和从观察中学习，这些传统需要独立的、垂直设计的算法。我们在一个具有缺乏奖励的 simulate 机器人搬运环境中测试了我们的方法，where a robot needs to stack a set of objects。我们示出了较好的性能提升，包括探索效率和从停挂数据中重用经验，并示出了如何重用已学到的技能来解决新任务或模仿人类专家的视频。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.

Anticipating Technical Expertise and Capability Evolution in Research Communities using Dynamic Graph Transformers

paper_url: http://arxiv.org/abs/2307.09665
repo_url: None
paper_authors: Sameera Horawalavithana, Ellyn Ayton, Anastasiya Usenko, Robin Cosbey, Svitlana Volkova
for: 这种研究旨在预测技术专业和能力发展趋势，以便为国际和全球安全做出更好的准备。特别是在安全关键领域 like 核不扩散 (NN) 和快速发展的领域 like 人工智能 (AI) 中，预测技术专业和能力发展变得非常重要。methods: 该研究使用了传统的统计关系学学习方法（如链接预测在合作网络中），并将问题转化为动态多态图表示。研究人员还开发了一种新的能量预测技术，以便预测科学家和机构之间的合作模式、作者行为和技术能力的发展。results: 研究人员使用的动态图变换（DGT）神经网络模型，可以在AI和NN两个领域中预测合作、合作伙伴和技术能力的发展。DGT模型的性能比最佳静止图模型提高30-80%。研究人员还发现，DGT模型在未seen节点出现在测试数据中时，对 inductive 任务的性能有所提高。具体来说，模型可以准确预测已有的科学家和新手科学家之间的合作关系在 AI 领域中。

Abstract
The ability to anticipate technical expertise and capability evolution trends globally is essential for national and global security, especially in safety-critical domains like nuclear nonproliferation (NN) and rapidly emerging fields like artificial intelligence (AI). In this work, we extend traditional statistical relational learning approaches (e.g., link prediction in collaboration networks) and formulate a problem of anticipating technical expertise and capability evolution using dynamic heterogeneous graph representations. We develop novel capabilities to forecast collaboration patterns, authorship behavior, and technical capability evolution at different granularities (e.g., scientist and institution levels) in two distinct research fields. We implement a dynamic graph transformer (DGT) neural architecture, which pushes the state-of-the-art graph neural network models by (a) forecasting heterogeneous (rather than homogeneous) nodes and edges, and (b) relying on both discrete -- and continuous -- time inputs. We demonstrate that our DGT models predict collaboration, partnership, and expertise patterns with 0.26, 0.73, and 0.53 mean reciprocal rank values for AI and 0.48, 0.93, and 0.22 for NN domains. DGT model performance exceeds the best-performing static graph baseline models by 30-80% across AI and NN domains. Our findings demonstrate that DGT models boost inductive task performance, when previously unseen nodes appear in the test data, for the domains with emerging collaboration patterns (e.g., AI). Specifically, models accurately predict which established scientists will collaborate with early career scientists and vice-versa in the AI domain.

摘要
能够预测技术培养和能力演化趋势是国家和全球安全的关键，特别是在安全敏感领域如核不扩散（NN）和快速出现的领域如人工智能（AI）。在这种工作中，我们扩展了传统的统计关系学学习方法（例如链接预测在合作网络中），并将问题定义为预测技术培养和能力演化使用动态多类图表示。我们开发了新的能力来预测合作模式、作者行为和技术能力演化的不同细腻度（例如科学家和机构层次）在两个不同的研究领域。我们实现了动态图变换（DGT）神经网络模型，该模型超过了状态可能的图像基eline模型的性能，通过（a）预测多样性（而不是同质）的节点和边，以及（b）使用时间输入。我们 demonstate that DGT models predict collaboration, partnership, and expertise patterns with mean reciprocal rank values of 0.26, 0.73, and 0.53 for AI and 0.48, 0.93, and 0.22 for NN domains. DGT model performance exceeds the best-performing static graph baseline models by 30-80% across AI and NN domains. Our findings demonstrate that DGT models boost inductive task performance, when previously unseen nodes appear in the test data, for the domains with emerging collaboration patterns (e.g., AI). Specifically, models accurately predict which established scientists will collaborate with early career scientists and vice-versa in the AI domain.

HAT-CL: A Hard-Attention-to-the-Task PyTorch Library for Continual Learning

paper_url: http://arxiv.org/abs/2307.09653
repo_url: https://github.com/xduan7/hat-cl
paper_authors: Xiaotian Duan
for: 这篇论文旨在解决 continual learning 中的忘记现象，即神经网络学习新任务时会忘记之前的知识。
methods: 这篇论文提出了一种名为 HAT 机制来 Mitigate 这个问题，但实际应用中存在可用性和兼容性问题，以及现有网络重用支持不足。
results: 本文引入了一种更User-Friendly、PyTorch-Compatible的 HAT 机制，名为 HAT-CL。 HAT-CL 不仅自动 manipulate 梯度，还可以将 PyTorch 模块转换为 HAT 模块，并提供了一个完整的模块集，可以轻松地整合到现有的架构中。 In addition, the authors introduce novel mask manipulation techniques for HAT, which have consistently shown improvements across various experiments.

Abstract
Catastrophic forgetting, the phenomenon in which a neural network loses previously obtained knowledge during the learning of new tasks, poses a significant challenge in continual learning. The Hard-Attention-to-the-Task (HAT) mechanism has shown potential in mitigating this problem, but its practical implementation has been complicated by issues of usability and compatibility, and a lack of support for existing network reuse. In this paper, we introduce HAT-CL, a user-friendly, PyTorch-compatible redesign of the HAT mechanism. HAT-CL not only automates gradient manipulation but also streamlines the transformation of PyTorch modules into HAT modules. It achieves this by providing a comprehensive suite of modules that can be seamlessly integrated into existing architectures. Additionally, HAT-CL offers ready-to-use HAT networks that are smoothly integrated with the TIMM library. Beyond the redesign and reimplementation of HAT, we also introduce novel mask manipulation techniques for HAT, which have consistently shown improvements across various experiments. Our work paves the way for a broader application of the HAT mechanism, opening up new possibilities in continual learning across diverse models and applications.

摘要
灾难性忘却（Catastrophic Forgetting）是机器学习中的一个问题，它表示神经网络在学习新任务时会遗传已经获得的知识。这个问题对 kontinual learning（持续学习）领域 pose a significant challenge。在这篇论文中，我们介绍了一个名为 HAT-CL（Hard-Attention-to-the-Task）的用户友好的、PyTorch-相容的重新设计。HAT-CL不仅自动调整Gradient的方法，而且可以将PyTorch模组转换为HAT模组，并且提供了一个涵盖了多个模组的套件，可以与现有的架构集成。此外，HAT-CL还提供了一些Ready-to-use HAT网络，与TIMM库相关联系。除了重新设计和重新实现HAT之外，我们还引入了一些新的面积调整技术，这些技术在不同的实验中均展示了改善。我们的工作开辟了HAT机制的应用，将在不同的模型和应用中实现持续学习的新可能性。

VISER: A Tractable Solution Concept for Games with Information Asymmetry

paper_url: http://arxiv.org/abs/2307.09652
repo_url: https://github.com/jermcmahan/viser
paper_authors: Jeremy McMahan, Young Wu, Yudong Chen, Xiaojin Zhu, Qiaomin Xie
for: 这篇论文是为了解决受信息不均衡影响的现实世界游戏问题，例如安全游戏和多代理人学习。
methods: 这篇论文提出了一种新的解决方案，即VISER（受害者安全，攻击者最佳回应）。VISER使得外部观察者可以预测这些游戏的结果。
results: 作者表示，每个玩家的VISER策略都可以独立计算，并且可以用线性规划（LP）来计算。此外，作者还扩展了VISER到其Markov游戏对应的Markov-perfect版本，可以使用一系列LP来解决。

Abstract
Many real-world games suffer from information asymmetry: one player is only aware of their own payoffs while the other player has the full game information. Examples include the critical domain of security games and adversarial multi-agent reinforcement learning. Information asymmetry renders traditional solution concepts such as Strong Stackelberg Equilibrium (SSE) and Robust-Optimization Equilibrium (ROE) inoperative. We propose a novel solution concept called VISER (Victim Is Secure, Exploiter best-Responds). VISER enables an external observer to predict the outcome of such games. In particular, for security applications, VISER allows the victim to better defend itself while characterizing the most damaging attacks available to the attacker. We show that each player's VISER strategy can be computed independently in polynomial time using linear programming (LP). We also extend VISER to its Markov-perfect counterpart for Markov games, which can be solved efficiently using a series of LPs.

摘要
Many real-world games suffer from information asymmetry: one player only knows their own payoffs, while the other player has full game information. Examples include security games and adversarial multi-agent reinforcement learning. Information asymmetry makes traditional solution concepts like Strong Stackelberg Equilibrium (SSE) and Robust-Optimization Equilibrium (ROE) inoperative. We propose a novel solution concept called VISER (Victim Is Secure, Exploiter best-Responds). VISER enables an external observer to predict the outcome of such games. In particular, for security applications, VISER allows the victim to better defend itself while characterizing the most damaging attacks available to the attacker. We show that each player's VISER strategy can be computed independently in polynomial time using linear programming (LP). We also extend VISER to its Markov-perfect counterpart for Markov games, which can be solved efficiently using a series of LPs.Here's the translation in Traditional Chinese:Many real-world games suffer from information asymmetry: one player only knows their own payoffs, while the other player has full game information. Examples include security games and adversarial multi-agent reinforcement learning. Information asymmetry makes traditional solution concepts like Strong Stackelberg Equilibrium (SSE) and Robust-Optimization Equilibrium (ROE) inoperative. We propose a novel solution concept called VISER (Victim Is Secure, Exploiter best-Responds). VISER enables an external observer to predict the outcome of such games. In particular, for security applications, VISER allows the victim to better defend itself while characterizing the most damaging attacks available to the attacker. We show that each player's VISER strategy can be computed independently in polynomial time using linear programming (LP). We also extend VISER to its Markov-perfect counterpart for Markov games, which can be solved efficiently using a series of LPs.

With Flying Colors: Predicting Community Success in Large-scale Collaborative Campaigns

paper_url: http://arxiv.org/abs/2307.09650
repo_url: None
paper_authors: Abraham Israeli, Oren Tsur
For: This paper aims to study the effectiveness of online communities in promoting their agenda, specifically in the context of Reddit’s r/place.* Methods: The paper uses a novel task of predicting the success level of online communities in Reddit’s r/place, and experiments with several hybrid models combining various types of features.* Results: The models significantly outperform all baseline models over all definitions of “success level”, and the results provide insights into the factors that contribute to the success of coordinated campaigns.Here are the three points in Simplified Chinese:* For: 这篇论文目的是研究在Reddit的r/place上线上社区的效果，具体来说是研究社区的合作能力和成果。* Methods: 这篇论文使用了一个新的任务，即预测Reddit的r/place上线上社区的成功水平，并试用了多种混合模型。* Results: 这些模型在所有基础模型上都显著地超越了，并且提供了社区协作成功的因素分析。

Abstract
Online communities develop unique characteristics, establish social norms, and exhibit distinct dynamics among their members. Activity in online communities often results in concrete ``off-line'' actions with a broad societal impact (e.g., political street protests and norms related to sexual misconduct). While community dynamics, information diffusion, and online collaborations have been widely studied in the past two decades, quantitative studies that measure the effectiveness of online communities in promoting their agenda are scarce. In this work, we study the correspondence between the effectiveness of a community, measured by its success level in a competitive online campaign, and the underlying dynamics between its members. To this end, we define a novel task: predicting the success level of online communities in Reddit's r/place - a large-scale distributed experiment that required collaboration between community members. We consider an array of definitions for success level; each is geared toward different aspects of collaborative achievement. We experiment with several hybrid models, combining various types of features. Our models significantly outperform all baseline models over all definitions of `success level'. Analysis of the results and the factors that contribute to the success of coordinated campaigns can provide a better understanding of the resilience or the vulnerability of communities to online social threats such as election interference or anti-science trends. We make all data used for this study publicly available for further research.

摘要
在线社区发展独特特征，成社会规范，展现特殊的动态。活动在线社区常引起“Off-line”行动，影响广泛社会（如政治街头抗议和性骚扰规范）。 despite community dynamics, information diffusion, and online collaborations have been widely studied in the past two decades, quantitative studies that measure the effectiveness of online communities in promoting their agenda are scarce. In this work, we study the correspondence between the effectiveness of a community, measured by its success level in a competitive online campaign, and the underlying dynamics between its members. To this end, we define a novel task: predicting the success level of online communities in Reddit's r/place - a large-scale distributed experiment that required collaboration between community members. We consider an array of definitions for success level; each is geared toward different aspects of collaborative achievement. We experiment with several hybrid models, combining various types of features. Our models significantly outperform all baseline models over all definitions of `success level'. Analysis of the results and the factors that contribute to the success of coordinated campaigns can provide a better understanding of the resilience or the vulnerability of communities to online social threats such as election interference or anti-science trends. We make all data used for this study publicly available for further research.Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China and other parts of the world. If you need the translation in Traditional Chinese, please let me know.

Promoting Exploration in Memory-Augmented Adam using Critical Momenta

paper_url: http://arxiv.org/abs/2307.09638
repo_url: https://github.com/chandar-lab/cmoptimizer
paper_authors: Pranshu Malviya, Gonçalo Mordido, Aristide Baratin, Reza Babanezhad Harikandeh, Jerry Huang, Simon Lacoste-Julien, Razvan Pascanu, Sarath Chandar
for: 本研究旨在提高大规模深度学习模型的训练过程中，使用 Adaptive 梯度下降法带来更好的性能。
methods: 本研究提出了一种基于缓存的 Adam 优化器，通过保留一些关键动量项的缓存来促进探索更平的极值陷阱。
results: 实验表明，使用该新的缓存版本的 Adam 优化器可以提高多种 Adam 变体在标准的语言模型和图像分类任务中的性能。

Abstract
Adaptive gradient-based optimizers, particularly Adam, have left their mark in training large-scale deep learning models. The strength of such optimizers is that they exhibit fast convergence while being more robust to hyperparameter choice. However, they often generalize worse than non-adaptive methods. Recent studies have tied this performance gap to flat minima selection: adaptive methods tend to find solutions in sharper basins of the loss landscape, which in turn hurts generalization. To overcome this issue, we propose a new memory-augmented version of Adam that promotes exploration towards flatter minima by using a buffer of critical momentum terms during training. Intuitively, the use of the buffer makes the optimizer overshoot outside the basin of attraction if it is not wide enough. We empirically show that our method improves the performance of several variants of Adam on standard supervised language modelling and image classification tasks.

摘要
适应式梯度基本优化器，如 Adam，在训练大规模深度学习模型中留下了印记。这些优化器具有快速收敛的优点，同时也更加抗性于超参数选择。然而，它们通常的泛化性比非适应方法差。最近的研究表明，这种性能差距与平坦的最小值选择相关：适应方法通常会在loss函数的梯度图上找到更加锐角的解，这会导致泛化性下降。为了解决这个问题，我们提出了一种新的带缓存的 Adam 方法，通过在训练过程中使用缓存来保持权重的积累。这种方法可以使优化器在训练过程中更加勇敢地尝试不同的方向，从而增强泛化性。我们通过实验证明，我们的方法可以提高许多 Adam 变体在标准的supervised语言模型和图像分类任务中的性能。

Traffic-Domain Video Question Answering with Automatic Captioning

paper_url: http://arxiv.org/abs/2307.09636
repo_url: None
paper_authors: Ehsan Qasemi, Jonathan M. Francis, Alessandro Oltramari
for: 本研究旨在推动智能交通监测和交通运输系统领域内的高级机器推理能力。
methods: 本研究使用了一种新的方法，即“视频问答自动captioning”（TRIVIA），以强化大型视频语言模型中的城市交通场景知识。
results: 实验结果表明，TRIVIA方法可以提高代表性视频语言模型的准确率，相比基eline设置提高了6.5点（19.88%）。这一突破性的方法在交通相关应用中具有极大的推动力，鼓励研究者和实践者共同探索视频语言模型的潜力。

Abstract
Video Question Answering (VidQA) exhibits remarkable potential in facilitating advanced machine reasoning capabilities within the domains of Intelligent Traffic Monitoring and Intelligent Transportation Systems. Nevertheless, the integration of urban traffic scene knowledge into VidQA systems has received limited attention in previous research endeavors. In this work, we present a novel approach termed Traffic-domain Video Question Answering with Automatic Captioning (TRIVIA), which serves as a weak-supervision technique for infusing traffic-domain knowledge into large video-language models. Empirical findings obtained from the SUTD-TrafficQA task highlight the substantial enhancements achieved by TRIVIA, elevating the accuracy of representative video-language models by a remarkable 6.5 points (19.88%) compared to baseline settings. This pioneering methodology holds great promise for driving advancements in the field, inspiring researchers and practitioners alike to unlock the full potential of emerging video-language models in traffic-related applications.

摘要
视频问答（VidQA）具有很好的潜力，能够推动智能交通监测和交通系统领域的高级机器理解能力。然而，在前一些研究中，对城市交通场景知识的 интеграción into VidQA系统受到了限制。在这项工作中，我们提出了一种新的方法，称为交通领域视频问答自动标注（TRIVIA），它作为软监督技术，将交通领域知识注入到大型视频语言模型中。实验结果表明，TRIVIA可以提高代表性的视频语言模型的准确率，相比基线设定，提高6.5点（19.88%）。这种先锋的方法会推动这个领域的发展，鼓励研究人员和实践者们一起探索emerging视频语言模型在交通相关应用中的潜力。

Transformer-based Dual-domain Network for Few-view Dedicated Cardiac SPECT Image Reconstructions

paper_url: http://arxiv.org/abs/2307.09624
repo_url: None
paper_authors: Huidong Xie, Bo Zhou, Xiongchao Chen, Xueqi Guo, Stephanie Thorn, Yi-Hwa Liu, Ge Wang, Albert Sinusas, Chi Liu
for: 这个论文的目的是提出一种基于transformer的3D域 dual-domain网络（TIP-Net），用于从站立数据中生成高质量的3D冠状血管SPECT图像重建。
methods: 该方法首先从投影数据直接重建3D冠状血管SPECT图像，无需迭代重建过程。然后，给出重建输出和原始几视重建结果，进一步使用图像域重建网络进行纠正。
results: 经验 validate by cardiac catheterization图像、核心医生的诊断解读和FDA 510(k)-清理的产品软件，该方法在人体研究中比前一代方法高于cardiac defect contrast，可能启用站立几视专业cardiac SPECT镜像仪器上高质量的缺陷可见化。

Abstract
Cardiovascular disease (CVD) is the leading cause of death worldwide, and myocardial perfusion imaging using SPECT has been widely used in the diagnosis of CVDs. The GE 530/570c dedicated cardiac SPECT scanners adopt a stationary geometry to simultaneously acquire 19 projections to increase sensitivity and achieve dynamic imaging. However, the limited amount of angular sampling negatively affects image quality. Deep learning methods can be implemented to produce higher-quality images from stationary data. This is essentially a few-view imaging problem. In this work, we propose a novel 3D transformer-based dual-domain network, called TIP-Net, for high-quality 3D cardiac SPECT image reconstructions. Our method aims to first reconstruct 3D cardiac SPECT images directly from projection data without the iterative reconstruction process by proposing a customized projection-to-image domain transformer. Then, given its reconstruction output and the original few-view reconstruction, we further refine the reconstruction using an image-domain reconstruction network. Validated by cardiac catheterization images, diagnostic interpretations from nuclear cardiologists, and defect size quantified by an FDA 510(k)-cleared clinical software, our method produced images with higher cardiac defect contrast on human studies compared with previous baseline methods, potentially enabling high-quality defect visualization using stationary few-view dedicated cardiac SPECT scanners.

摘要
心血管疾病（CVD）是全球最主要的死亡原因，而myocardial perfusion imaging using SPECT已经广泛用于CVD的诊断。GE 530/570c专用心血管SPECT扫描仪采用了静止geometry，同时获取19个投影图以增强敏感度和实现动态扫描。然而，有限的角度采样会影响图像质量。深度学习方法可以实现从静止数据生成高质量图像。这是基本上是几视图图像问题。在这种工作中，我们提出了一种新的3D transformer-based dual-domain网络，called TIP-Net，用于高质量3Dcardiac SPECT图像重建。我们的方法首先从投影数据直接重建3Dcardiac SPECT图像，而不需要迭代重建过程。然后，给出其重建输出和原始几视图重建结果，我们进一步修复图像使用图像域重建网络。被 cardiac catheterization 图像、核子医生的诊断解读和FDA 510(k)-cleared严格评估软件评估，我们的方法在人类研究中生成了比前一种基准方法更高的心血管损伤对比度，可能允许使用静止几视图专用心血管SPECT扫描仪获得高质量损伤可视化。

Gradient strikes back: How filtering out high frequencies improves explanations

paper_url: http://arxiv.org/abs/2307.09591
repo_url: None
paper_authors: Sabine Muzellec, Léo Andéol, Thomas Fel, Rufin VanRullen, Thomas Serre
for: 这篇论文旨在解释深度神经网络的决策过程中，使用估计基于方法和推断基于方法之间的比较，以及这两种方法之间的差异。
methods: 这篇论文使用了三种常见的视觉分类模型，并对它们的梯度进行分析，发现高频信息对于梯度的描述是不必要的，并且这些高频信息可能是由抽取操作引入的噪声。
results: 研究发现，通过使用最佳低通 filtered attribution maps，可以提高梯度基于方法的解释效果，并且可以在多个模型中获得更好的解释效果。这些结果表明，使用估计基于方法可以提高解释效果，而不需要使用复杂的计算方法。

Abstract
Recent years have witnessed an explosion in the development of novel prediction-based attribution methods, which have slowly been supplanting older gradient-based methods to explain the decisions of deep neural networks. However, it is still not clear why prediction-based methods outperform gradient-based ones. Here, we start with an empirical observation: these two approaches yield attribution maps with very different power spectra, with gradient-based methods revealing more high-frequency content than prediction-based methods. This observation raises multiple questions: What is the source of this high-frequency information, and does it truly reflect decisions made by the system? Lastly, why would the absence of high-frequency information in prediction-based methods yield better explainability scores along multiple metrics? We analyze the gradient of three representative visual classification models and observe that it contains noisy information emanating from high-frequencies. Furthermore, our analysis reveals that the operations used in Convolutional Neural Networks (CNNs) for downsampling appear to be a significant source of this high-frequency content -- suggesting aliasing as a possible underlying basis. We then apply an optimal low-pass filter for attribution maps and demonstrate that it improves gradient-based attribution methods. We show that (i) removing high-frequency noise yields significant improvements in the explainability scores obtained with gradient-based methods across multiple models -- leading to (ii) a novel ranking of state-of-the-art methods with gradient-based methods at the top. We believe that our results will spur renewed interest in simpler and computationally more efficient gradient-based methods for explainability.

摘要
近年来，有一场激进的发展，即使用预测来解释深度神经网络的决策。然而，我们仍然不清楚为什么预测基于的方法超越了梯度基于的方法。在这里，我们从employm empirical observation开始：这两种方法的解释地图具有不同的功率谱，预测基于的方法 fewer high-frequency content than gradient-based methods。这个观察引出了多个问题：这些高频信息的来源是什么，并不是系统真正做出的决策吗？最后，为什么缺少高频信息的预测基于方法会得到多个维度的更好的解释分数？我们分析了三种代表性的视觉分类模型的梯度，发现梯度包含了高频信息的噪声。进一步分析发现，Convolutional Neural Networks (CNNs) 的下采样操作是高频信息的主要来源，这可能是噪声的基础。我们应用最佳低通Filter для解释地图，并观察到：（i）从高频噪声中移除噪声可以提高梯度基于方法的解释分数，（ii）这种方法可以将 gradient-based 方法列为现有的最佳方法之一。我们认为，我们的结果会激励人们对简单而 computationally more efficient 的梯度基于方法的解释表示新的兴趣。

Automating Wood Species Detection and Classification in Microscopic Images of Fibrous Materials with Deep Learning

paper_url: http://arxiv.org/abs/2307.09588
repo_url: None
paper_authors: Lars Nieradzik, Jördis Sieburg-Rockel, Stephanie Helmling, Janis Keuper, Thomas Weibel, Andrea Olbrich, Henrike Stephani
for: 这个论文的目的是提出了一种系统的方法来生成大量的浸泡木参考图像集，以便使用深度学习自动识别纤维材料中的硬木种。
methods: 该方法使用了一个灵活的管道来轻松地注解血管元素，并比较了不同的神经网络架构和超参数的性能。
results: 该方法的表现与人工专家类似，未来这将改善全球木材产品流量的控制，以保护森林。I hope this helps! Let me know if you have any other questions.

Abstract
We have developed a methodology for the systematic generation of a large image dataset of macerated wood references, which we used to generate image data for nine hardwood genera. This is the basis for a substantial approach to automate, for the first time, the identification of hardwood species in microscopic images of fibrous materials by deep learning. Our methodology includes a flexible pipeline for easy annotation of vessel elements. We compare the performance of different neural network architectures and hyperparameters. Our proposed method performs similarly well to human experts. In the future, this will improve controls on global wood fiber product flows to protect forests.

摘要
我们已经开发了一种系统化生成大量浸泡木参考图像的方法ologies，用于生成九种硬木属图像数据。这是我们使用深度学习自动识别硬木种类的基础。我们的方法包括一个灵活的管道，方便对血管元素进行标注。我们比较了不同的神经网络架构和超参数的性能。我们的提议方法与人类专家相似。未来，这将改善全球木质产品流动的控制，以保护森林。Note: Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and Macau.

Overthinking the Truth: Understanding how Language Models Process False Demonstrations

paper_url: http://arxiv.org/abs/2307.09476
repo_url: https://github.com/dannyallover/overthinking_the_truth
paper_authors: Danny Halawi, Jean-Stanislas Denain, Jacob Steinhardt
for: 这 paper 研究了模型在几个示例学习中吸收和复制危险内容的现象，并提出了两种相关现象：过度思考和假推导头。
methods: 研究使用了一种基于 interior representations 的方法，通过decode predictions from intermediate layers来研究模型的内部表示。
results: 研究发现，在某些层次上，模型会因为 incorrect demonstrations 而出现过度思考现象，并且这种现象可能归结于 false induction heads 的存在。

Abstract
Modern language models can imitate complex patterns through few-shot learning, enabling them to complete challenging tasks without fine-tuning. However, imitation can also lead models to reproduce inaccuracies or harmful content if present in the context. We study harmful imitation through the lens of a model's internal representations, and identify two related phenomena: overthinking and false induction heads. The first phenomenon, overthinking, appears when we decode predictions from intermediate layers, given correct vs. incorrect few-shot demonstrations. At early layers, both demonstrations induce similar model behavior, but the behavior diverges sharply at some "critical layer", after which the accuracy given incorrect demonstrations progressively decreases. The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late layers that attend to and copy false information from previous demonstrations, and whose ablation reduces overthinking. Beyond scientific understanding, our results suggest that studying intermediate model computations could be a promising avenue for understanding and guarding against harmful model behaviors.

摘要
现代语言模型可以通过几个示例学习模式，完成复杂任务，无需调整。然而，模仿也可能使模型复制错误或有害内容，如果存在上下文中。我们通过模型内部表示的研究，发现了两种相关现象：过度思考和假推导头。第一种现象，过度思考，发生在decode预测时，当正确vs.错误几个示例中的解码。在早期层次，两种示例都会导致模型行为相似，但行为在某个"关键层"后开始差异化，并且在这个层次下，错误示例下的准确率逐渐下降。第二种现象，假推导头，可能是过度思考的可能机制：这些是在晚期层次中的头，它们会将false信息从前一个示例中拟合并复制，并且去掉这些头会减少过度思考。我们的结果表明，研究模型中间计算可能是制止模型不良行为的有望的途径。

Unsupervised Conditional Slot Attention for Object Centric Learning

paper_url: http://arxiv.org/abs/2307.09437
repo_url: None
paper_authors: Avinash Kori, Francesco Locatello, Francesca Toni, Ben Glocker
for: 本研究旨在学习无监督下的对象水平表示，以便在下游逻辑任务中进行更好的表示。
methods: 我们提出了一种无监督条件槽注意力方法（Unsupervised Conditional Slot Attention），使用一个新的概率槽词典（PSD），其中每个槽词ictionary包含一个抽象对象水平特性向量作为键和一个参数化的 Gaussian 分布作为其对应的值。
results: 我们的方法在多个下游任务中提供了场景组合能力和几个几shot适应能力的显著提升，包括对象发现、 compositional scene generation 和 compositional visual reasoning。另外，我们的方法在对象发现任务中与槽注意力方法相当或更好，而在其他两个任务中则显示出了更好的性能。

Abstract
Extracting object-level representations for downstream reasoning tasks is an emerging area in AI. Learning object-centric representations in an unsupervised setting presents multiple challenges, a key one being binding an arbitrary number of object instances to a specialized object slot. Recent object-centric representation methods like Slot Attention utilize iterative attention to learn composable representations with dynamic inference level binding but fail to achieve specialized slot level binding. To address this, in this paper we propose Unsupervised Conditional Slot Attention using a novel Probabilistic Slot Dictionary (PSD). We define PSD with (i) abstract object-level property vectors as key and (ii) parametric Gaussian distribution as its corresponding value. We demonstrate the benefits of the learnt specific object-level conditioning distributions in multiple downstream tasks, namely object discovery, compositional scene generation, and compositional visual reasoning. We show that our method provides scene composition capabilities and a significant boost in a few shot adaptability tasks of compositional visual reasoning, while performing similarly or better than slot attention in object discovery tasks

摘要
通过不监督学习，从object-centric表示中提取下游逻辑任务的object-level表示是一个emerging领域。在这种情况下，绑定arbitrary数量的object实例到特殊的object槽是一个关键挑战。现有的object-centric表示方法，如槽注意力，通过迭代注意力学习可composable表示，但未能实现特殊槽级别绑定。为解决这个问题，在这篇论文中，我们提出了不监督条件槽注意力，使用一种新的概率槽词典（PSD）。我们定义PSD中的键为抽象object-level属性向量，值为参数化的 Gaussian Distribution。我们示出了学习的特定object-levelconditioning分布的好处，在多个下游任务中，包括object发现、compositionalScene生成和compositional visual reasoning。我们显示了我们的方法可以实现Scene组合能力，并在compositional visual reasoning中具有显著的几步适应性提升，同时与槽注意力在object发现任务中表现相似或更好。

SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs

paper_url: http://arxiv.org/abs/2307.09435
repo_url: https://github.com/yl4579/SLMGAN
paper_authors: Yinghao Aaron Li, Cong Han, Nima Mesgarani
for: 这篇论文旨在提出一种基于大规模预训练语音语言模型（SLM）的语音变换模型，使用Generative Adversarial Network（GAN）框架，以实现零参数语音变换系统。
methods: 这篇论文使用了SLM Representations来进行拟合任务，并在StarGANv2-VC基础上添加了我们的新的SLM-based WavLM拟合器和我们自己设计的SLM特征匹配损失函数，从而实现了无监督零参数语音变换系统。
results: 对比 existing state-of-the-art zero-shot voice conversion models, 本paper的SLMGAN模型在自然性和相似性方面表现出色，并且与文本标注不需要进行训练。

Abstract
In recent years, large-scale pre-trained speech language models (SLMs) have demonstrated remarkable advancements in various generative speech modeling applications, such as text-to-speech synthesis, voice conversion, and speech enhancement. These applications typically involve mapping text or speech inputs to pre-trained SLM representations, from which target speech is decoded. This paper introduces a new approach, SLMGAN, to leverage SLM representations for discriminative tasks within the generative adversarial network (GAN) framework, specifically for voice conversion. Building upon StarGANv2-VC, we add our novel SLM-based WavLM discriminators on top of the mel-based discriminators along with our newly designed SLM feature matching loss function, resulting in an unsupervised zero-shot voice conversion system that does not require text labels during training. Subjective evaluation results show that SLMGAN outperforms existing state-of-the-art zero-shot voice conversion models in terms of naturalness and achieves comparable similarity, highlighting the potential of SLM-based discriminators for related applications.

摘要
Translated into Simplified Chinese:近年来，大规模预训练的语音语言模型（SLM）在各种生成语音模型应用中表现出色，如文本到语音合成、声音转换和语音增强等。这些应用通常涉及将文本或语音输入映射到预训练SLM表示，从而解码目标语音。本文介绍了一种新的方法，SLMGAN，利用SLM表示来在生成对抗网络（GAN）框架中进行探索性任务，特别是语音转换。基于StarGANv2-VC，我们添加了我们的新的SLM基于WavLM探测器，以及我们新的SLM特征匹配损失函数，从而实现了无监督零shot语音转换系统，不需要文本标注 durante 训练。主观评估结果表明，SLMGAN在自然性和相似性方面比前 estado-of-the-art 零shot语音转换模型高， highlighting the potential of SLM-based discriminators for related applications.

Balancing Privacy and Progress in Artificial Intelligence: Anonymization in Histopathology for Biomedical Research and Education

paper_url: http://arxiv.org/abs/2307.09426
repo_url: None
paper_authors: Neel Kanwal, Emiel A. M. Janssen, Kjersti Engan
For: This paper aims to explore the legal regulations and terminologies for medical data-sharing in the context of histopathology, with a focus on balancing privacy and progress in bio-informatics research.* Methods: The paper reviews existing approaches to medical data-sharing and highlights challenges from the histopathological perspective, including the risk of data linkage attacks and the lack of standardization in digital pathology.* Results: The paper presents a data-sharing guideline for histological data to foster multidisciplinary research and education, while addressing the challenges of privacy and data usability.Here’s the same information in Simplified Chinese text:* For: 这篇论文目标是探讨医疗数据共享的法律规定和术语，特别是在 Histopathology 领域内进行生物信息学研究的平衡 privacy 和进步。* Methods: 论文回顾了现有的医疗数据共享方法，并将 Histopathology 领域的挑战作为主要关注点，包括数据链接攻击和数字patology 的不一致性。* Results: 论文提出了一个 Histological 数据共享指南，以促进多学科研究和教育，同时解决 privacy 和数据可用性之间的矛盾。

Abstract
The advancement of biomedical research heavily relies on access to large amounts of medical data. In the case of histopathology, Whole Slide Images (WSI) and clinicopathological information are valuable for developing Artificial Intelligence (AI) algorithms for Digital Pathology (DP). Transferring medical data "as open as possible" enhances the usability of the data for secondary purposes but poses a risk to patient privacy. At the same time, existing regulations push towards keeping medical data "as closed as necessary" to avoid re-identification risks. Generally, these legal regulations require the removal of sensitive data but do not consider the possibility of data linkage attacks due to modern image-matching algorithms. In addition, the lack of standardization in DP makes it harder to establish a single solution for all formats of WSIs. These challenges raise problems for bio-informatics researchers in balancing privacy and progress while developing AI algorithms. This paper explores the legal regulations and terminologies for medical data-sharing. We review existing approaches and highlight challenges from the histopathological perspective. We also present a data-sharing guideline for histological data to foster multidisciplinary research and education.

摘要
医学研究的进步受到医疗数据的大量存储的限制。在 histopathology 中，整个标本图像（WSI）和临床 PATHOLOGY 信息是开发数字 pathology （DP）的人工智能（AI）算法的宝贵资源。将医疗数据“作为公开的可能”可以提高数据的可重用性，但会对患者隐私带来风险。同时，现有的法规强调保持医疗数据“作为关闭的必要”，以避免重新邻接风险。通常，这些法规要求去除敏感数据，但不考虑现代图像匹配算法可能导致的数据链接攻击。此外，DP 的不一致性使其更难建立一个 universal 的解决方案。这些挑战使生物信息学研究人员在保护隐私和进步之间做出坚韧的决策。本文研究医疗数据共享的法规和术语。我们评论现有的方法并高亮 histopathological 视角中的挑战。我们还提出了 histological 数据共享指南，以促进多学科研究和教育。

Scaling Laws for Imitation Learning in NetHack

paper_url: http://arxiv.org/abs/2307.09423
repo_url: None
paper_authors: Jens Tuyls, Dhruv Madeka, Kari Torkkola, Dean Foster, Karthik Narasimhan, Sham Kakade
for: 这个论文主要目的是研究如何通过练习学习来帮助机器学习模型更好地学习人类专家的行为。methods: 这篇论文使用了委婉学习（Imitation Learning，IL）方法，并且对模型和数据集进行了缩放。results: 研究发现，通过缩放模型和数据集可以使IL模型更好地学习人类专家的行为，并且可以达到至少2倍的性能提升。

Abstract
Imitation Learning (IL) is one of the most widely used methods in machine learning. Yet, while powerful, many works find it is often not able to fully recover the underlying expert behavior. However, none of these works deeply investigate the role of scaling up the model and data size. Inspired by recent work in Natural Language Processing (NLP) where "scaling up" has resulted in increasingly more capable LLMs, we investigate whether carefully scaling up model and data size can bring similar improvements in the imitation learning setting. To demonstrate our findings, we focus on the game of NetHack, a challenging environment featuring procedural generation, stochasticity, long-term dependencies, and partial observability. We find IL loss and mean return scale smoothly with the compute budget and are strongly correlated, resulting in power laws for training compute-optimal IL agents with respect to model size and number of samples. We forecast and train several NetHack agents with IL and find they outperform prior state-of-the-art by at least 2x in all settings. Our work both demonstrates the scaling behavior of imitation learning in a challenging domain, as well as the viability of scaling up current approaches for increasingly capable agents in NetHack, a game that remains elusively hard for current AI systems.

摘要
copying learning (IL) 是机器学习中最广泛使用的方法之一。然而，虽然强大，但许多研究发现它通常无法完全恢复专家行为的下面。然而，这些研究几乎没有深入探究扩大模型和数据集大小的作用。受到最近的自然语言处理（NLP）研究的启发，我们调查了扩大模型和数据集大小是否可以在imitating learning setting中带来类似的改进。为了证明我们的发现，我们关注了NetHack游戏，这是一个复杂的环境，具有生成过程、随机性、长期依赖和部分可见性。我们发现IL损失和平均返回金额随着计算预算的增加而呈现平滑的关系，并且具有强相关性，从而导致了训练计算优化IL代理人的模型大小和样本数量的力学关系。我们预测和训练了一些NetHack代理人，并发现它们在所有设置下都高于之前的状态艺术，至少提高2倍。我们的工作不仅证明了imitating learning在复杂环境中的扩展行为，也证明了可以通过扩大当前方法来创建更有能力的NetHack代理人，一游戏仍然是AI系统所难以解决的。

2023-07-19

cs.CL

cs.CL - 2023-07-19

Android in the Wild: A Large-Scale Dataset for Android Device Control

paper_url: http://arxiv.org/abs/2307.10088
repo_url: https://github.com/google-research/google-research
paper_authors: Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, Timothy Lillicrap
For: 这个论文旨在提供一个大型的设备控制数据集，用于研究设备控制系统的语言理解和可视上下文的推断。* Methods: 该论文使用了人类示例来描述设备交互，包括屏幕和操作，以及相应的自然语言指令。它包含715k个集和30k个唯一的指令，四个版本的Android（v10-13），八种设备类型（Pixel 2 XL到Pixel 6），以及不同的屏幕分辨率。* Results: 该论文报告了两个代理的性能分布在数据集中，并提出了一种新的挑战：从视觉上的动作进行推断。而不是简单的UI元素基础的动作，动作空间包括精准的手势（例如水平滚动来操作轮播widget）。

Abstract
There is a growing interest in device-control systems that can interpret human natural language instructions and execute them on a digital device by directly controlling its user interface. We present a dataset for device-control research, Android in the Wild (AITW), which is orders of magnitude larger than current datasets. The dataset contains human demonstrations of device interactions, including the screens and actions, and corresponding natural language instructions. It consists of 715k episodes spanning 30k unique instructions, four versions of Android (v10-13),and eight device types (Pixel 2 XL to Pixel 6) with varying screen resolutions. It contains multi-step tasks that require semantic understanding of language and visual context. This dataset poses a new challenge: actions available through the user interface must be inferred from their visual appearance. And, instead of simple UI element-based actions, the action space consists of precise gestures (e.g., horizontal scrolls to operate carousel widgets). We organize our dataset to encourage robustness analysis of device-control systems, i.e., how well a system performs in the presence of new task descriptions, new applications, or new platform versions. We develop two agents and report performance across the dataset. The dataset is available at https://github.com/google-research/google-research/tree/master/android_in_the_wild.

摘要
“现在有一个增长的兴趣是将人类自然语言指令转换为数位设备的控制系统，以直接控制其用户界面。我们发布了一个叫Android在野（AITW）的数据集，该数据集比现有的数据集有很大的规模。该数据集包含人类对设备互动的示例，包括萤幕和动作，以及对应的自然语言指令。它包含715,000个集和30,000个专案，涵盖四个版本的Android（v10-13）和八款设备（Pixel 2 XL到Pixel 6），具有不同的萤幕分辨率。它包含多步任务，需要自然语言理解和视觉上下文。这个数据集对于设备控制系统的研究提出了新的挑战：需要从视觉上的动作推导出可用的动作。而不是简单的UI元素基础的动作，动作空间包括精确的手势（例如，横向滑块操作卡车ousel widget）。我们将数据集分为多个分支，以便鼓励设备控制系统的韧性分析，即在新的任务描述、新的应用程序或新的平台版本下，系统的表现如何。我们开发了两个代理，并在数据集上进行了性能评估。数据集可以在https://github.com/google-research/google-research/tree/master/android_in_the_wild上取得。”

Generating Mathematical Derivations with Large Language Models

paper_url: http://arxiv.org/abs/2307.09998
repo_url: https://github.com/jmeadows17/deriving-equations-with-llms
paper_authors: Jordan Meadows, Marco Valentino, Andre Freitas
for: 本研究旨在使用大型自然语言模型（LLM） derive 数学结果，以探索这些模型的局限性，并可能支持数学发现。
methods: 我们使用符号引擎生成方程式，并 comparing 不同预训练策略的 Robustness 和泛化能力。
results: 我们发现， fine-tuned FLAN-T5-large (MathT5) 模型在所有静态和 OUT-OF-distribution 测试集上表现出色，但是 fine-tuned 模型对未看过的符号和方程结构变化显示高度敏感。此外，我们还发现了一些常见的逻辑错误，如包含错误、无关的和重复的方程。最后，我们发现现有的评价指标不能准确评价生成的数学文本质量。

Abstract
The derivation of mathematical results in specialised fields, using Large Language Models (LLMs), is an emerging research direction that can help identify models' limitations, and potentially support mathematical discovery. In this paper, we leverage a symbolic engine to generate derivations of equations at scale, and investigate the capabilities of LLMs when deriving goal equations from premises. Specifically, we employ in-context learning for GPT and fine-tune a range of T5 models to compare the robustness and generalisation of pre-training strategies to specialised models. Empirical results show that fine-tuned FLAN-T5-large (MathT5) outperforms GPT models on all static and out-of-distribution test sets in conventional scores. However, an in-depth analysis reveals that the fine-tuned models are more sensitive to perturbations involving unseen symbols and (to a lesser extent) changes to equation structure. In addition, we analyse 1.7K equations, and over 200 derivations, to highlight common reasoning errors such as the inclusion of incorrect, irrelevant, and redundant equations. Finally, we explore the suitability of existing metrics for evaluating mathematical derivations and find evidence that, while they can capture general properties such as sensitivity to perturbations, they fail to highlight fine-grained reasoning errors and essential differences between models. Overall, this work demonstrates that training models on synthetic data may improve their math capabilities beyond much larger LLMs, but current metrics are not appropriately assessing the quality of generated mathematical text.

摘要
大数据集中的数学结果推导，使用大语言模型（LLM），是一个emerging的研究方向，可以帮助我们了解模型的局限性，并可能支持数学发现。在这篇论文中，我们利用符号引擎生成方程式的推导，并对LLM的推导能力进行了调整。具体来说，我们使用上下文学习来训练GPT，并对T5模型进行了较为精细的调整。我们通过对特定模型进行准确的训练来比较预处理策略的稳定性和泛化性。实验结果表明，对MathT5模型进行了精细的训练后，其在所有静态和非静态测试集上的成绩都高于GPT模型。然而，我们的深入分析表明，这些精细调整后的模型在未看过的符号和方程结构变化时更加敏感。此外，我们对1.7K方程和200个推导进行了分析，发现了一些常见的逻辑错误，如包含错误、无关和繁殖的方程。最后，我们检查了现有的评价指标是否能够正确评估生成的数学文本质量，发现它们可以捕捉大致的敏感性，但是不能捕捉细致的逻辑错误和模型之间的重要差异。总的来说，这项研究表明，通过训练模型在大数据集中可以提高其数学能力，但现有的评价指标不能准确评估生成的数学文本质量。

GUIDO: A Hybrid Approach to Guideline Discovery & Ordering from Natural Language Texts

paper_url: http://arxiv.org/abs/2307.09959
repo_url: https://github.com/nils-freyer/guido
paper_authors: Nils Freyer, Dustin Thewes, Matthias Meinecke
for: 提取文书中的工作流程网络，以简化指南或正式化文书中的过程描述
methods: 使用BERT模型来分类句子的相关性，并使用依赖分析来提取相关句子中的工作流程模型
results: GUIDO方法可以很好地提取工作流程模型，与纯机器学本来的方法相比，注解成本较低。

Abstract
Extracting workflow nets from textual descriptions can be used to simplify guidelines or formalize textual descriptions of formal processes like business processes and algorithms. The task of manually extracting processes, however, requires domain expertise and effort. While automatic process model extraction is desirable, annotating texts with formalized process models is expensive. Therefore, there are only a few machine-learning-based extraction approaches. Rule-based approaches, in turn, require domain specificity to work well and can rarely distinguish relevant and irrelevant information in textual descriptions. In this paper, we present GUIDO, a hybrid approach to the process model extraction task that first, classifies sentences regarding their relevance to the process model, using a BERT-based sentence classifier, and second, extracts a process model from the sentences classified as relevant, using dependency parsing. The presented approach achieves significantly better results than a pure rule-based approach. GUIDO achieves an average behavioral similarity score of $0.93$. Still, in comparison to purely machine-learning-based approaches, the annotation costs stay low.

摘要
可以从文本描述中提取工作流程网络，以简化指南或正式化文本描述形式的过程，如商业过程和算法。 however， manually extracting processes requires domain expertise and effort. While automatic process model extraction is desirable, annotating texts with formalized process models is expensive. Therefore, there are only a few machine-learning-based extraction approaches. Rule-based approaches require domain specificity to work well and can rarely distinguish relevant and irrelevant information in textual descriptions. In this paper, we present GUIDO， a hybrid approach to the process model extraction task that first, classifies sentences based on their relevance to the process model using a BERT-based sentence classifier, and second, extracts a process model from the sentences classified as relevant using dependency parsing. The presented approach achieves significantly better results than a pure rule-based approach. GUIDO achieves an average behavioral similarity score of $0.93$. Still, in comparison to purely machine-learning-based approaches, the annotation costs stay low.Note: I used the Google Translate API to translate the text into Simplified Chinese. Please note that the translation may not be perfect and may require some adjustments to accurately convey the intended meaning.

Large Language Models can accomplish Business Process Management Tasks

paper_url: http://arxiv.org/abs/2307.09923
repo_url: None
paper_authors: Michael Grohs, Luka Abb, Nourhan Elsayed, Jana-Rebecca Rehse
for: 本研究旨在探讨自然语言处理技术如何应用于商业流程管理（BPM）中，以提高组织活动的效率和成果。
methods: 本研究使用大语言模型（LLM）来解决三种 exemplary BPM 任务：从文本描述中检索 Imperative 过程模型、从文本描述中检索 Declarative 过程模型、以及根据文本描述评估过程任务的合适性 для机器过程自动化。
results: 研究表明，无需较多的配置或提示工程，LLMs 可以与现有解决方案相比或者更好地完成这些任务，并讨论未来 BPM 研究的未来和实际应用中的可能性。

Abstract
Business Process Management (BPM) aims to improve organizational activities and their outcomes by managing the underlying processes. To achieve this, it is often necessary to consider information from various sources, including unstructured textual documents. Therefore, researchers have developed several BPM-specific solutions that extract information from textual documents using Natural Language Processing techniques. These solutions are specific to their respective tasks and cannot accomplish multiple process-related problems as a general-purpose instrument. However, in light of the recent emergence of Large Language Models (LLMs) with remarkable reasoning capabilities, such a general-purpose instrument with multiple applications now appears attainable. In this paper, we illustrate how LLMs can accomplish text-related BPM tasks by applying a specific LLM to three exemplary tasks: mining imperative process models from textual descriptions, mining declarative process models from textual descriptions, and assessing the suitability of process tasks from textual descriptions for robotic process automation. We show that, without extensive configuration or prompt engineering, LLMs perform comparably to or better than existing solutions and discuss implications for future BPM research as well as practical usage.

摘要

DAPrompt: Deterministic Assumption Prompt Learning for Event Causality Identification

paper_url: http://arxiv.org/abs/2307.09813
repo_url: None
paper_authors: Wei Xiang, Chuanhong Zhan, Bang Wang
for: 本研究旨在解决事件关系识别任务中，是否存在 causal 关系 между两个事件的问题。
methods: 我们提出了一种新的 deterministic assumption prompt learning 模型，称为 DAPrompt，它基于预训练语言模型中嵌入的百科全书式知识。
results: 实验结果表明，相比现状态的算法，DAPrompt 模型在 EventStoryLine corpora 和 Causal-TimeBank corpus 上显示了显著的性能提升。

Abstract
Event Causality Identification (ECI) aims at determining whether there is a causal relation between two event mentions. Conventional prompt learning designs a prompt template to first predict an answer word and then maps it to the final decision. Unlike conventional prompts, we argue that predicting an answer word may not be a necessary prerequisite for the ECI task. Instead, we can first make a deterministic assumption on the existence of causal relation between two events and then evaluate its rationality to either accept or reject the assumption. The design motivation is to try the most utilization of the encyclopedia-like knowledge embedded in a pre-trained language model. In light of such considerations, we propose a deterministic assumption prompt learning model, called DAPrompt, for the ECI task. In particular, we design a simple deterministic assumption template concatenating with the input event pair, which includes two masks as predicted events' tokens. We use the probabilities of predicted events to evaluate the assumption rationality for the final event causality decision. Experiments on the EventStoryLine corpus and Causal-TimeBank corpus validate our design objective in terms of significant performance improvements over the state-of-the-art algorithms.

摘要

IncDSI: Incrementally Updatable Document Retrieval

paper_url: http://arxiv.org/abs/2307.10323
repo_url: None
paper_authors: Varsha Kishore, Chao Wan, Justin Lovelace, Yoav Artzi, Kilian Q. Weinberger
for: 这篇论文是关于文档检索的一种新的方法，即使用神经网络来直接将查询转换为相应的文档。
methods: 这种方法使用一种名为Differentiable Search Index的新的搜索索引方法，可以在训练时间内添加新的文档，而不需要重新训练整个数据集。
results: 这种方法可以在20-50毫秒内添加新的文档，而不需要重新训练整个数据集，并且与完全训练新模型相比，其性能 Competitive。

Abstract
Differentiable Search Index is a recently proposed paradigm for document retrieval, that encodes information about a corpus of documents within the parameters of a neural network and directly maps queries to corresponding documents. These models have achieved state-of-the-art performances for document retrieval across many benchmarks. These kinds of models have a significant limitation: it is not easy to add new documents after a model is trained. We propose IncDSI, a method to add documents in real time (about 20-50ms per document), without retraining the model on the entire dataset (or even parts thereof). Instead we formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters. Although orders of magnitude faster, our approach is competitive with re-training the model on the whole dataset and enables the development of document retrieval systems that can be updated with new information in real-time. Our code for IncDSI is available at https://github.com/varshakishore/IncDSI.

摘要
干净搜索索引是一种最近提出的文档检索模式，它将文档库内容编码到神经网络参数中，直接将查询符号映射到相应的文档。这些模型在许多标准准测试 benchmark 上实现了状态足的表现。然而，这些类型的模型有一定的限制：添加新文档不容易。我们提出了 IncDSI，一种在实时（约20-50ms每个文档）添加文档的方法，不需要重新训练整个数据集（或者部分）。相反，我们将文档添加形式为约束优化问题，以便减少网络参数的变化。虽然速度是当前的许多orders of magnitude，但我们的方法与重新训练整个数据集相比，并不逊色。这使得可以在实时更新文档检索系统，以便在实时添加新信息。我们的 IncDSI 代码可以在 GitHub 上找到：https://github.com/varshakishore/IncDSI。

On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large Language Models

paper_url: http://arxiv.org/abs/2307.09793
repo_url: None
paper_authors: Sarah Gao, Andrew Kean Gao
for: 本研究旨在找出大语言模型（LLMs）中具有共同特征和趋势的家族和Subgroup,以便更好地理解和分析这些模型的性能和应用 potential.
methods: 该研究使用了 hierarchical clustering 和 n-grams 以及 term frequency-inverse document frequency（TF-IDF）来找出 LLMS 中的共同特征和趋势,并建立了一个 web 应用程序来快速生成多种可见化 visualization.
results: 研究发现，使用 TF-IDF 和 n-grams 可以准确地找出 LLMS 中的家族和Subgroup,并且可以通过 Constellation web 应用程序来快速生成多种可见化 visualization,以便更好地理解和分析这些模型的性能和应用 potential.

Abstract
Since late 2022, Large Language Models (LLMs) have become very prominent with LLMs like ChatGPT and Bard receiving millions of users. Hundreds of new LLMs are announced each week, many of which are deposited to Hugging Face, a repository of machine learning models and datasets. To date, nearly 16,000 Text Generation models have been uploaded to the site. Given the huge influx of LLMs, it is of interest to know which LLM backbones, settings, training methods, and families are popular or trending. However, there is no comprehensive index of LLMs available. We take advantage of the relatively systematic nomenclature of Hugging Face LLMs to perform hierarchical clustering and identify communities amongst LLMs using n-grams and term frequency-inverse document frequency. Our methods successfully identify families of LLMs and accurately cluster LLMs into meaningful subgroups. We present a public web application to navigate and explore Constellation, our atlas of 15,821 LLMs. Constellation rapidly generates a variety of visualizations, namely dendrograms, graphs, word clouds, and scatter plots. Constellation is available at the following link: https://constellation.sites.stanford.edu/.

摘要

Mood Classification of Bangla Songs Based on Lyrics

paper_url: http://arxiv.org/abs/2307.10314
repo_url: None
paper_authors: Maliha Mahajebin, Mohammad Rifat Ahmmad Rashid, Nafees Mansoor
for: 这个研究旨在分类孔乃诗歌的情感类型，以便更好地理解人们对音乐的感受。
methods: 该研究使用自然语言处理和Bert算法来分析4000首孔乃诗歌的歌词，并将其分为四种情感类型：快乐、悲伤、爱情和放松。
results: 研究发现，4000首孔乃诗歌中有1513首表达悲伤的情感，1362首表达爱情的情感，886首表达快乐的情感，而剩下的239首则属于放松的情感。这些结果表明，自动地分类孔乃诗歌的情感类型是可行的，并且准确性较高。

Abstract
Music can evoke various emotions, and with the advancement of technology, it has become more accessible to people. Bangla music, which portrays different human emotions, lacks sufficient research. The authors of this article aim to analyze Bangla songs and classify their moods based on the lyrics. To achieve this, this research has compiled a dataset of 4000 Bangla song lyrics, genres, and used Natural Language Processing and the Bert Algorithm to analyze the data. Among the 4000 songs, 1513 songs are represented for the sad mood, 1362 for the romantic mood, 886 for happiness, and the rest 239 are classified as relaxation. By embedding the lyrics of the songs, the authors have classified the songs into four moods: Happy, Sad, Romantic, and Relaxed. This research is crucial as it enables a multi-class classification of songs' moods, making the music more relatable to people's emotions. The article presents the automated result of the four moods accurately derived from the song lyrics.

摘要
音乐可以诱发多种情感，技术的发展使得音乐更加容易访问。孟加拉音乐，表达不同人类情感的形式，尚未得到充分的研究。本文的作者想要分析孟加拉歌曲，根据歌词来分类其情感。为了实现这一目标，本研究编译了4000首孟加拉歌曲的歌词、类型，并使用自然语言处理和Bert算法来分析数据。总共有1513首歌曲表达了悲伤的情感，1362首表达了爱情的情感，886首表达了喜乐的情感，剩下的239首被分类为宁静。通过嵌入歌曲 lyrics，作者将歌曲分为四种情感：快乐、悲伤、爱情和宁静。这项研究非常重要，因为它使得歌曲的情感更加可 relate 到人们的情感，从而使得音乐更加美妙。文章展示了自动从歌曲 lyrics 中提取出的四种情感的准确分类结果。

CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility

paper_url: http://arxiv.org/abs/2307.09705
repo_url: https://github.com/x-plug/cvalues
paper_authors: Guohai Xu, Jiayi Liu, Ming Yan, Haotian Xu, Jinghui Si, Zhuoran Zhou, Peng Yi, Xing Gao, Jitao Sang, Rong Zhang, Ji Zhang, Chao Peng, Fei Huang, Jingren Zhou
for: 评估大语言模型（LLMs）是否符合人类价值观念的重要性在不断增长，这篇论文提出了首个中文人类价值评估标准（CValues），用于评估 LLMS 在安全性和责任性两个方面的价值Alignment。
methods: 该论文使用了人工收集的阴性安全提问和责任提问，以及专业专家卷积的多选提问，以提供中文 LLMS 的全面价值评估。
results: 研究发现，大多数中文 LLMS 在安全性方面表现良好，但在责任性方面还有很大的提升空间。此外，自动和人类评估都是评估中文 LLMS 的人类价值Alignment 的重要方法。

Abstract
With the rapid evolution of large language models (LLMs), there is a growing concern that they may pose risks or have negative social impacts. Therefore, evaluation of human values alignment is becoming increasingly important. Previous work mainly focuses on assessing the performance of LLMs on certain knowledge and reasoning abilities, while neglecting the alignment to human values, especially in a Chinese context. In this paper, we present CValues, the first Chinese human values evaluation benchmark to measure the alignment ability of LLMs in terms of both safety and responsibility criteria. As a result, we have manually collected adversarial safety prompts across 10 scenarios and induced responsibility prompts from 8 domains by professional experts. To provide a comprehensive values evaluation of Chinese LLMs, we not only conduct human evaluation for reliable comparison, but also construct multi-choice prompts for automatic evaluation. Our findings suggest that while most Chinese LLMs perform well in terms of safety, there is considerable room for improvement in terms of responsibility. Moreover, both the automatic and human evaluation are important for assessing the human values alignment in different aspects. The benchmark and code is available on ModelScope and Github.

摘要
随着大语言模型（LLM）的快速演化，有关它们可能带来风险或有负面社会影响的担忧增加。因此，评估人类价值Alignment在当前变得越来越重要。先前的工作主要集中于评估LLM的某些知识和理解能力，而忽略了与人类价值的Alignment，尤其在中文上下文中。在这篇论文中，我们介绍了CValues，第一个中文人类价值评估标准，用于评估LLM的安全和责任性能力。我们手动收集了10个场景下的逆向安全提示和8个领域下的责任提示，并构建了多选提示以进行自动评估。我们发现，大多数中文LLM在安全性方面表现良好，但在责任性方面有很大的改进空间。此外，人工评估和自动评估都是评估中文LLM的人类价值Alignment的重要方法。我们的找到结果和标准化的代码将在ModelScope和Github上公布。

Efficient Guided Generation for Large Language Models

paper_url: http://arxiv.org/abs/2307.09702
repo_url: https://github.com/normal-computing/outlines
paper_authors: Brandon T. Willard, Rémi Louf
for: 本文提出了一种使用 finite-state machine 框架来解决神经文本生成问题的方法。
methods: 该方法使用 regular expressions 和 context-free grammars 来引导文本生成，并可以在语言模型中构建索引，从而保证生成的文本结构可靠。
results: 该方法在性能上显著超过了现有的解决方案，并允许在不同领域中应用域特定的知识和约束。

Abstract
In this article we show how the problem of neural text generation can be constructively reformulated in terms of transitions between the states of a finite-state machine. This framework leads to an efficient approach to guiding text generation with regular expressions and context-free grammars by allowing the construction of an index over a language model's vocabulary. The approach is model agnostic, allows one to enforce domain-specific knowledge and constraints, and enables the construction of reliable interfaces by guaranteeing the structure of the generated text. It adds little overhead to the token sequence generation process and significantly outperforms existing solutions. An implementation is provided in the open source Python library Outlines

摘要
在这篇文章中，我们展示了如何将神经文本生成问题构思为finite-state机器的状态转移问题。这个框架导致了一种高效的使用正则表达式和context-free grammar来引导文本生成的方法，可以建立语言模型词汇索引。该方法是模型无关的，允许承载域特定的知识和限制，并能够建立可靠的界面，保证生成的文本结构。它增加了少量的токен序列生成过程的负担，并显著超越了现有的解决方案。我们在Python开源库Outlines中提供了实现。

Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation

paper_url: http://arxiv.org/abs/2307.09701
repo_url: None
paper_authors: Hao Peng, Qingqing Cao, Jesse Dodge, Matthew E. Peters, Jared Fernandez, Tom Sherborne, Kyle Lo, Sam Skjonsberg, Emma Strubell, Darrell Plessas, Iz Beltagy, Evan Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi
for: The paper aims to address the practical challenges in evaluating and comparing the efficiency of natural language processing (NLP) models, and to provide a standardized and centralized platform for fair and reproducible evaluations.
methods: The paper introduces Pentathlon, a benchmark for holistic and realistic evaluation of NLP model efficiency, which focuses on inference and offers a strictly-controlled hardware platform, a suite of metrics, and a software library for seamless integration.
results: The paper hopes to stimulate algorithmic innovations in building efficient NLP models and foster an increased awareness of the social and environmental implications in the development of future-generation NLP models.Here is the same information in Simplified Chinese text:
for: 本文旨在解决现代自然语言处理（NLP）系统的计算需求增加了研究障碍和环境问题的实际挑战，并提供一个标准化和中心化的评估平台。
methods: 本文介绍了 Pentathlon，一个用于整体和现实启用的 NLP 模型效率评估 benchmark，它将注重协议，提供一个严格控制的硬件平台，以及一个综合指标集和可顺应式软件库。
results: 本文希望通过促进高效 NLP 模型的算法创新和社会环境因素的认识，推动未来一代 NLP 模型的发展。

Abstract
Rising computational demands of modern natural language processing (NLP) systems have increased the barrier to entry for cutting-edge research while posing serious environmental concerns. Yet, progress on model efficiency has been impeded by practical challenges in model evaluation and comparison. For example, hardware is challenging to control due to disparate levels of accessibility across different institutions. Moreover, improvements in metrics such as FLOPs often fail to translate to progress in real-world applications. In response, we introduce Pentathlon, a benchmark for holistic and realistic evaluation of model efficiency. Pentathlon focuses on inference, which accounts for a majority of the compute in a model's lifecycle. It offers a strictly-controlled hardware platform, and is designed to mirror real-world applications scenarios. It incorporates a suite of metrics that target different aspects of efficiency, including latency, throughput, memory overhead, and energy consumption. Pentathlon also comes with a software library that can be seamlessly integrated into any codebase and enable evaluation. As a standardized and centralized evaluation platform, Pentathlon can drastically reduce the workload to make fair and reproducible efficiency comparisons. While initially focused on natural language processing (NLP) models, Pentathlon is designed to allow flexible extension to other fields. We envision Pentathlon will stimulate algorithmic innovations in building efficient models, and foster an increased awareness of the social and environmental implications in the development of future-generation NLP models.

摘要
现代自然语言处理（NLP）系统的计算需求不断增加，使得进行前沿研究的门槛增加，同时也提出了严重的环境问题。然而，模型效率的进步受到了实际检验和比较的实际挑战。例如，硬件控制因为不同机构的访问权限而具有不同水平的困难。此外，在计算力（FLOPs）中提高的改进 frequently fails to translate into real-world applications.为此，我们介绍了 Pentathlon，一个用于整体和现实应用场景的模型效率评价标准。Pentathlon 专注于推理，占模型生命周期的主要部分。它提供一个严格控制的硬件平台，并遵循现实应用场景的设计。它包含一 suite of metrics targeting different aspects of efficiency, including latency, throughput, memory overhead, and energy consumption。Pentathlon 还提供了一个可以轻松地 интеGRATE到任何代码库的软件库，可以实现评价。作为一个标准化和中心化的评价平台，Pentathlon 可以减少对比较性和可重现性的工作负担。虽然首先关注于自然语言处理（NLP）模型，但Pentathlon 设计可扩展到其他领域。我们期望Pentathlon 能够促进模型建构的算法创新，并且推动未来一代 NLP 模型的开发中对社会和环境的考虑。

Analyzing sports commentary in order to automatically recognize events and extract insights

paper_url: http://arxiv.org/abs/2307.10303
repo_url: https://github.com/yanismiraoui/analyzing-sports-commentary-in-order-to-automatically-recognize-events-and-extract-insights
paper_authors: Yanis Miraoui
for: 本研究旨在使用不同自然语言处理技术和方法自动识别体育活动中的主要动作。
methods: 本研究使用了不同的自然语言处理技术和方法，包括文本分析和分类，以EXTRACT主要动作的抽象。
results: 研究发现，可以使用情感分析来检测主要动作。

Abstract
In this paper, we carefully investigate how we can use multiple different Natural Language Processing techniques and methods in order to automatically recognize the main actions in sports events. We aim to extract insights by analyzing live sport commentaries from different sources and by classifying these major actions into different categories. We also study if sentiment analysis could help detect these main actions.

摘要
在这篇论文中，我们仔细研究了如何使用多种自然语言处理技术和方法来自动识别体育活动中的主要动作。我们目标是通过分析不同来源的直播体育评论来提取情感，并将这些主要动作分类为不同类别。我们还研究了情感分析是否可以帮助探测这些主要动作。

Can Model Fusing Help Transformers in Long Document Classification? An Empirical Study

paper_url: http://arxiv.org/abs/2307.09532
repo_url: https://github.com/damithdr/legal-classification
paper_authors: Damith Premasiri, Tharindu Ranasinghe, Ruslan Mitkov
for: 本研究旨在解决长文档分类问题，以提高现有的转换器模型在多个领域中的应用能力。
methods: 本研究使用模型融合技术来解决长文档分类问题，并与BERT和Longformer架构进行比较。
results: 研究发现，模型融合技术可以提高转换器模型在长文档分类任务中的表现，并且比BERT和Longformer架构更具有灵活性和可扩展性。

Abstract
Text classification is an area of research which has been studied over the years in Natural Language Processing (NLP). Adapting NLP to multiple domains has introduced many new challenges for text classification and one of them is long document classification. While state-of-the-art transformer models provide excellent results in text classification, most of them have limitations in the maximum sequence length of the input sequence. The majority of the transformer models are limited to 512 tokens, and therefore, they struggle with long document classification problems. In this research, we explore on employing Model Fusing for long document classification while comparing the results with well-known BERT and Longformer architectures.

摘要
文本分类是一个长期研究的领域，在自然语言处理（NLP）中进行研究。适应多个领域的NLP带来了许多新的挑战，其中之一是长文档分类。虽然现代变换器模型在文本分类方面提供了出色的结果，但大多数变换器模型受限于输入序列 longest length，因此在长文档分类问题上表现不佳。在这项研究中，我们探讨使用模型融合来解决长文档分类问题，并与知名的BERT和Longformer架构进行比较。

ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning

paper_url: http://arxiv.org/abs/2307.09474
repo_url: None
paper_authors: Liang Zhao, En Yu, Zheng Ge, Jinrong Yang, Haoran Wei, Hongyu Zhou, Jianjian Sun, Yuang Peng, Runpei Dong, Chunrui Han, Xiangyu Zhang
for: 提高多模态大语言模型（MLLMs）的用户-人工智能互动性。
methods: 使用多种参照表示方式，如点和方正矩形作为参照提示，使MLLMs能够更加精细地关注特定区域。
results: ChatSpot模型在不同的互动方式和任务上表现出色，提供了更加灵活和抽象的互动体验。

Abstract
Human-AI interactivity is a critical aspect that reflects the usability of multimodal large language models (MLLMs). However, existing end-to-end MLLMs only allow users to interact with them through language instructions, leading to the limitation of the interactive accuracy and efficiency. In this study, we present precise referring instructions that utilize diverse reference representations such as points and boxes as referring prompts to refer to the special region. This enables MLLMs to focus on the region of interest and achieve finer-grained interaction. Based on precise referring instruction, we propose ChatSpot, a unified end-to-end multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes, which provides a more flexible and seamless interactive experience. We also construct a multi-grained vision-language instruction-following dataset based on existing datasets and GPT-4 generating. Furthermore, we design a series of evaluation tasks to assess the effectiveness of region recognition and interaction. Experimental results showcase ChatSpot's promising performance.

摘要
人机合作性是多模态大语言模型（MLLM）的关键方面，但现有的端到端 MLLM 只允许用户通过语言指令与其交互，导致交互准确性和效率受限。在这项研究中，我们提出精确引用指令，使用多种参考表示such as points and boxes作为引用提示，以引导 MLLM 关注特定区域。这使得 MLLM 可以更加精准地进行交互。基于精确引用指令，我们提出 ChatSpot，一个综合的端到端多模态大语言模型，支持多种交互方式，包括鼠标点击、拖拽和绘制方块，以提供更加灵活和无缝的交互体验。此外，我们构建了一个多层次视力语言指令遵循 dataset，基于现有数据集和 GPT-4 生成。此外，我们设计了一系列评估任务，以评估区域识别和交互的效果。实验结果表明 ChatSpot 表现出色。

A comparative analysis of SRGAN models

paper_url: http://arxiv.org/abs/2307.09456
repo_url: None
paper_authors: Fatemeh Rezapoor Nikroo, Ajinkya Deshmukh, Anantha Sharma, Adrian Tam, Kaarthik Kumar, Cleo Norris, Aditya Dangi
for: 这些模型是用于实现单图像超分辨的，以提高图像的分辨率和质量。
methods: 这些模型使用了多种state-of-the-art SRGAN模型，包括ESRGAN、Real-ESRGAN和EDSR，并使用了一个管道来评估这些模型的性能。
results: 研究发现，ESDR-BASE模型从huggingface库中的模型最高效，它在量化指标和主观视觉质量评估中都达到了最佳效果，并且具有最低的计算开销。 EDSR模型可以生成高PSNR和SSIM值的图像，并且可以通过Tesseract OCR引擎获得高质量的OCR结果。

Abstract
In this study, we evaluate the performance of multiple state-of-the-art SRGAN (Super Resolution Generative Adversarial Network) models, ESRGAN, Real-ESRGAN and EDSR, on a benchmark dataset of real-world images which undergo degradation using a pipeline. Our results show that some models seem to significantly increase the resolution of the input images while preserving their visual quality, this is assessed using Tesseract OCR engine. We observe that EDSR-BASE model from huggingface outperforms the remaining candidate models in terms of both quantitative metrics and subjective visual quality assessments with least compute overhead. Specifically, EDSR generates images with higher peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) values and are seen to return high quality OCR results with Tesseract OCR engine. These findings suggest that EDSR is a robust and effective approach for single-image super-resolution and may be particularly well-suited for applications where high-quality visual fidelity is critical and optimized compute.

摘要
在本研究中，我们评估了多种现代SRGAN（超分解生成对抗网络）模型，ESRGAN、Real-ESRGAN和EDSR，在一个实际图像降解管道中的表现。我们的结果表明，一些模型能够显著提高输入图像的分辨率，同时保持视觉质量，这被评估使用Tesseract OCR引擎。我们发现，来自huggingface的EDSR-BASE模型在对比其他候选模型的情况下，在量化指标和主观视觉质量评估中占据领先地位，同时具有最小的计算开销。specifically，EDSR生成的图像具有更高的峰峰信号噪声比（PSNR）和结构相似度指标（SSIM）值，并且能够返回高质量的OCR结果。这些发现表明，EDSR是一种稳定有效的单图超分解方法，可以用于应用场景需要高质量视觉准确性和优化计算。

Pseudo Outlier Exposure for Out-of-Distribution Detection using Pretrained Transformers

paper_url: http://arxiv.org/abs/2307.09455
repo_url: None
paper_authors: Jaeyoung Kim, Kyuheon Jung, Dongbin Na, Sion Jang, Eunbin Park, Sungchul Choi
for: 这篇论文的目的是为了帮助语言模型检测对应预设（In-distribution，ID）和非对应预设（Out-of-distribution，OOD）样本的区别，以避免语言模型预测错误的情况。
methods: 这篇论文使用了一种简单而有效的方法called Pseudo Outlier Exposure（POE），它可以将ID类别的 tokens顺序排序，从而生成一个伪类别的数据集，来训练弃置网络。POE方法不需要任何外部OOD数据，且可以与现有的Transformers架构一起使用。
results: 根据该论文的结果，POE方法与现有的方法相比，在多个文本分类 benchmark上具有相似的性能，且可以实现ID和OOD样本之间的区别。

Abstract
For real-world language applications, detecting an out-of-distribution (OOD) sample is helpful to alert users or reject such unreliable samples. However, modern over-parameterized language models often produce overconfident predictions for both in-distribution (ID) and OOD samples. In particular, language models suffer from OOD samples with a similar semantic representation to ID samples since these OOD samples lie near the ID manifold. A rejection network can be trained with ID and diverse outlier samples to detect test OOD samples, but explicitly collecting auxiliary OOD datasets brings an additional burden for data collection. In this paper, we propose a simple but effective method called Pseudo Outlier Exposure (POE) that constructs a surrogate OOD dataset by sequentially masking tokens related to ID classes. The surrogate OOD sample introduced by POE shows a similar representation to ID data, which is most effective in training a rejection network. Our method does not require any external OOD data and can be easily implemented within off-the-shelf Transformers. A comprehensive comparison with state-of-the-art algorithms demonstrates POE's competitiveness on several text classification benchmarks.

摘要
Translation notes:* "real-world language applications" is translated as "实际语言应用" (shí jí yǔ yán bìng)* "in-distribution" is translated as "内部分布" (nèi bù zhāng)* "out-of-distribution" is translated as "外部分布" (wài bù zhāng)* "overconfident" is translated as "过于自信" (guò yú zì xìn)* "semantic representation" is translated as " semantics 表示" (xiàng yì bǐng dài)* "surrogate OOD dataset" is translated as "代理 OOD 数据集" (dài lǐ OOD shù jiàn)* "rejection network" is translated as "拒绝网络" (guà jiè wǎng luò)* "off-the-shelf Transformers" is translated as "Ready-to-use Transformers" (准备好的 Transformers)* "state-of-the-art algorithms" is translated as "当前最佳算法" (dāng qián zuì jiā algoritmos)

Let’s ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation

paper_url: http://arxiv.org/abs/2307.09416
repo_url: None
paper_authors: Federico Betti, Jacopo Staiano, Lorenzo Baraldi, Lorenzo Baraldi, Rita Cucchiara, Nicu Sebe
for: 这篇论文主要是为了解决图像生成领域的自动评估问题，即用语言模型和视觉问答技术来评估生成的图像质量和与提示的一致性。
methods: 该论文提出了一种新的自动评估方法，即视觉概念评估（ViCE），它通过复制人类认知过程来评估图像质量。ViCE组合了大型语言模型和视觉问答技术，并通过问答系统来调查图像，以获得评估结果。
results: 据论文所示，ViCE方法可以准确地评估图像质量和与提示的一致性，并且可以与人类评估结果相匹配。这些结果表明，ViCE方法可以成为图像生成和图像目标编辑任务中的一种有用的自动评估工具。

Abstract
Research in Image Generation has recently made significant progress, particularly boosted by the introduction of Vision-Language models which are able to produce high-quality visual content based on textual inputs. Despite ongoing advancements in terms of generation quality and realism, no methodical frameworks have been defined yet to quantitatively measure the quality of the generated content and the adherence with the prompted requests: so far, only human-based evaluations have been adopted for quality satisfaction and for comparing different generative methods. We introduce a novel automated method for Visual Concept Evaluation (ViCE), i.e. to assess consistency between a generated/edited image and the corresponding prompt/instructions, with a process inspired by the human cognitive behaviour. ViCE combines the strengths of Large Language Models (LLMs) and Visual Question Answering (VQA) into a unified pipeline, aiming to replicate the human cognitive process in quality assessment. This method outlines visual concepts, formulates image-specific verification questions, utilizes the Q&A system to investigate the image, and scores the combined outcome. Although this brave new hypothesis of mimicking humans in the image evaluation process is in its preliminary assessment stage, results are promising and open the door to a new form of automatic evaluation which could have significant impact as the image generation or the image target editing tasks become more and more sophisticated.

摘要
To address this issue, we propose a novel automated method for Visual Concept Evaluation (ViCE), which assesses the consistency between a generated/edited image and the corresponding prompt/instructions. This method is inspired by human cognitive behavior and combines the strengths of large language models (LLMs) and visual question answering (VQA) into a unified pipeline.The ViCE method outlines visual concepts, formulates image-specific verification questions, utilizes a Q&A system to investigate the image, and scores the combined outcome. Although this approach is in its preliminary assessment stage, the results are promising and open the door to a new form of automatic evaluation that could have significant impact as image generation and image target editing tasks become more sophisticated.

Zero-shot Query Reformulation for Conversational Search

paper_url: http://arxiv.org/abs/2307.09384
repo_url: https://github.com/dayuyang1999/zeqr
paper_authors: Dayu Yang, Yue Zhang, Hui Fang
for: 提高 conversational search 中的搜寻效果，解决资料缺乏问题
methods: 提出了一个Zero-shot Query Reformulation（ZeQR）框架，利用语言模型解决 Raw 查询中的核心对比和排除问题，不需要对话搜寻数据进行超级vision
results: 透过实验表明，ZeQR 方法可以优于现有的基eline，提高查询意图理解和搜寻效果

Abstract
As the popularity of voice assistants continues to surge, conversational search has gained increased attention in Information Retrieval. However, data sparsity issues in conversational search significantly hinder the progress of supervised conversational search methods. Consequently, researchers are focusing more on zero-shot conversational search approaches. Nevertheless, existing zero-shot methods face three primary limitations: they are not universally applicable to all retrievers, their effectiveness lacks sufficient explainability, and they struggle to resolve common conversational ambiguities caused by omission. To address these limitations, we introduce a novel Zero-shot Query Reformulation (ZeQR) framework that reformulates queries based on previous dialogue contexts without requiring supervision from conversational search data. Specifically, our framework utilizes language models designed for machine reading comprehension tasks to explicitly resolve two common ambiguities: coreference and omission, in raw queries. In comparison to existing zero-shot methods, our approach is universally applicable to any retriever without additional adaptation or indexing. It also provides greater explainability and effectively enhances query intent understanding because ambiguities are explicitly and proactively resolved. Through extensive experiments on four TREC conversational datasets, we demonstrate the effectiveness of our method, which consistently outperforms state-of-the-art baselines.

摘要
随着语音助手的普及，对话搜索在信息检索领域受到了越来越多的关注。然而，对话搜索数据中的数据缺乏问题对超级vised conversational search方法的进步带来了很大的障碍。因此，研究人员更加关注zero-shot conversational search方法。然而，现有的zero-shot方法存在三个主要的限制：它们不适用于所有搜索引擎，它们的效iveness缺乏足够的解释力，并且它们无法解决通常的对话歧义，导致 omission 问题。为了解决这些限制，我们提出了一种新的Zero-shot Query Reformulation（ZeQR）框架，该框架基于对话上下文来重新 reformulate 查询，无需对 conversational search 数据进行监督。具体来说，我们的框架利用特定设计 для机器阅读理解任务的语言模型来解决 raw 查询中的两个常见歧义：核心引用和 omission。与现有的zero-shot方法相比，我们的方法可以无需额外适应或索引化应用于任何搜索引擎。此外，我们的方法还提供了更高的解释力，因为歧义被明确地和 про动地解决。通过对四个 TREC 对话数据集进行广泛的实验，我们证明了我们的方法的效果， consistently 超过了当前的基eline。

2023-07-19

cs.LG

cs.LG - 2023-07-19

Android in the Wild: A Large-Scale Dataset for Android Device Control

paper_url: http://arxiv.org/abs/2307.10088
repo_url: https://github.com/google-research/google-research
paper_authors: Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, Timothy Lillicrap
For: 本研究准备了一个大型设备控制 dataset，以便研究设备控制系统可以从人工语言指令中提取信息并直接控制设备用户界面。* Methods: 该 dataset 包含人类展示设备交互的示例，以及对应的自然语言指令。它包括 715k 个集成，30k 个唯一的指令，四个版本的 Android (v10-13)，八种设备类型 (Pixel 2 XL 到 Pixel 6) with varying 屏幕分辨率。它包含多步任务，需要语言和视觉上下文的含义理解。* Results: 该 dataset 提出了一个新的挑战：通过视觉上下文进行动作推理。而不是简单的 UI 元素基本动作，动作空间包括精准的手势（例如，横向滚动来操作轮播 widget）。作者们组织了该 dataset，以便检验设备控制系统的可靠性，即在新的任务描述、应用程序或平台版本下，系统如何表现。作者们开发了两个代理，并在 dataset 上进行了性能测试。dataset 可以在 https://github.com/google-research/google-research/tree/master/android_in_the_wild 中下载。

Abstract
There is a growing interest in device-control systems that can interpret human natural language instructions and execute them on a digital device by directly controlling its user interface. We present a dataset for device-control research, Android in the Wild (AITW), which is orders of magnitude larger than current datasets. The dataset contains human demonstrations of device interactions, including the screens and actions, and corresponding natural language instructions. It consists of 715k episodes spanning 30k unique instructions, four versions of Android (v10-13),and eight device types (Pixel 2 XL to Pixel 6) with varying screen resolutions. It contains multi-step tasks that require semantic understanding of language and visual context. This dataset poses a new challenge: actions available through the user interface must be inferred from their visual appearance. And, instead of simple UI element-based actions, the action space consists of precise gestures (e.g., horizontal scrolls to operate carousel widgets). We organize our dataset to encourage robustness analysis of device-control systems, i.e., how well a system performs in the presence of new task descriptions, new applications, or new platform versions. We develop two agents and report performance across the dataset. The dataset is available at https://github.com/google-research/google-research/tree/master/android_in_the_wild.

摘要
“有一 growing interest 在设备控制系统中，能够理解人类自然语言指令并直接控制设备用户界面。我们提供了一个设备控制数据集，Android in the Wild (AITW)，该数据集规模比现有数据集有很大的提升。数据集包含人类设备交互示例，包括屏幕和操作，以及相应的自然语言指令。其包含30k个独特指令、四个版本的Android（v10-13）、八种设备（Pixel 2 XL 到 Pixel 6）以及不同屏幕分辨率。它包含多步任务，需要自然语言和视觉上下文的含义理解。这个数据集带来了一个新的挑战：通过视觉出现来推理操作。而不是简单的UI元素基于的操作，操作空间包括精准的手势（例如，横向滚动来操作轮播 widget）。我们将数据集分成了多个分区，以便鼓励设备控制系统的鲁棒性分析，即在新的任务描述、应用程序或平台版本下，系统的性能如何。我们开发了两个代理，并在数据集上进行了性能测试。数据集可以在https://github.com/google-research/google-research/tree/master/android_in_the_wild 上获取。”

A Dual Formulation for Probabilistic Principal Component Analysis

paper_url: http://arxiv.org/abs/2307.10078
repo_url: None
paper_authors: Henri De Plaen, Johan A. K. Suykens
for: 本研究探讨了概率主成分分析在希尔бер特空间中的特性，并证明了优解在 dual space 中有一个表示。这使得我们可以开发一种生成框架 для核方法。
methods: 本研究使用了概率主成分分析和 dual space 的技术。
results: 研究证明了 Kernel Principal Component Analysis 是 Probabilistic Principal Component Analysis 的一种特例，并在一个小型示例和一个实际数据集上进行了示例。

Abstract
In this paper, we characterize Probabilistic Principal Component Analysis in Hilbert spaces and demonstrate how the optimal solution admits a representation in dual space. This allows us to develop a generative framework for kernel methods. Furthermore, we show how it englobes Kernel Principal Component Analysis and illustrate its working on a toy and a real dataset.

摘要
在这篇论文中，我们 caracterize Probabilistic Principal Component Analysis在希尔бер特空间中，并证明优解 admit dual空间的表示。这 позволяет我们开发generative框架 дляkernel方法。此外，我们还示了它包含Kernel Principal Component Analysis，并在一个玩偶和一个真实数据集上验证其工作。Here's a breakdown of the translation:* "Probabilistic Principal Component Analysis" is translated as "概率主成分分析" (gōng cháng zhòng xīng yǐng yǎn jī).* "in Hilbert spaces" is translated as "在希尔бер特空间中" (zhōng yǐn hī lè bèi tiān xiàng).* "demonstrate how the optimal solution admits a representation in dual space" is translated as "证明优解 admit dual空间的表示" (zhèng mín yú jiě admit dì zhōng yǐng yǎn jī).* "This allows us to develop a generative framework for kernel methods" is translated as "这 позволяет我们开发generative框架 дляkernel方法" (zhèng mìng yú jī yī jiān kē yì).* "Furthermore, we show how it englobes Kernel Principal Component Analysis" is translated as "此外，我们还示了它包含Kernel Principal Component Analysis" (qí wài, wǒmen hái shì le ta bāng zhì Kernel Principal Component Analysis).* "and illustrate its working on a toy and a real dataset" is translated as "并在一个玩偶和一个真实数据集上验证其工作" (yī yī gè zhōng zhì yī yī zhèng shí yī jīn jī).

Unsupervised Accuracy Estimation of Deep Visual Models using Domain-Adaptive Adversarial Perturbation without Source Samples

paper_url: http://arxiv.org/abs/2307.10062
repo_url: None
paper_authors: JoonHo Lee, Jae Oh Woo, Hankyu Moon, Kwonho Lee
for: 这个论文是为了解决在投入深度视觉模型时可能出现的性能下降问题，具体是因为源和目标分布之间存在差异。
methods: 该论文提出了一种新的源无法场景下的模型准确率估计方法，不需要源数据和标签。具体来说，它使用pseudo-标签来估计目标域中模型的准确率，并采用了源自由适应算法来解决问题。
results: 该论文的实验结果表明，该方法可以在不需要源数据和标签的情况下，有效地Addressing the challenging distribution shift scenarios and outperform existing methods that require source data and labels for training.

Abstract
Deploying deep visual models can lead to performance drops due to the discrepancies between source and target distributions. Several approaches leverage labeled source data to estimate target domain accuracy, but accessing labeled source data is often prohibitively difficult due to data confidentiality or resource limitations on serving devices. Our work proposes a new framework to estimate model accuracy on unlabeled target data without access to source data. We investigate the feasibility of using pseudo-labels for accuracy estimation and evolve this idea into adopting recent advances in source-free domain adaptation algorithms. Our approach measures the disagreement rate between the source hypothesis and the target pseudo-labeling function, adapted from the source hypothesis. We mitigate the impact of erroneous pseudo-labels that may arise due to a high ideal joint hypothesis risk by employing adaptive adversarial perturbation on the input of the target model. Our proposed source-free framework effectively addresses the challenging distribution shift scenarios and outperforms existing methods requiring source data and labels for training.

摘要
deploying deep visual models 可能会导致性能下降，因为目标分布和源分布之间存在差异。多种方法利用标签的源数据估算目标领域的准确性，但是获取标签的源数据常常因调制解调器密性或服务器上的资源限制而受到限制。我们的工作提出了一个新的框架，可以在无法存取源数据的情况下估算模型的准确性。我们 investigates the feasibility of using pseudo-labels for accuracy estimation, and evolves this idea into adopting recent advances in source-free domain adaptation algorithms. 我们的方法量化源数据的假标签与目标领域的假标签函数之间的不一致率，并且运用适应式对抗扰动来减少因高理想共同假设风险而产生的错误假标签。我们的提议的source-free框架有效地解决了困难的分布迁移情况，并且超越了需要源数据和标签进行训练的现有方法。

Accurate deep learning sub-grid scale models for large eddy simulations

paper_url: http://arxiv.org/abs/2307.10060
repo_url: None
paper_authors: Rikhi Bose, Arunabha M. Roy
for: 这两家模型是为了大涵观 simulations（LES）的目的而开发的。
methods: 这两家模型使用了深度学习（DL）算法，与传统的分析模型技术不同，可以生成高阶复杂的非线性关系。
results: 实验结果显示，使用了更简单的模型可以更好地学习特征，并且在不同的滤波器宽度和 Reynolds 数下预测 SGS 压力的预测性能更高。

Abstract
We present two families of sub-grid scale (SGS) turbulence models developed for large-eddy simulation (LES) purposes. Their development required the formulation of physics-informed robust and efficient Deep Learning (DL) algorithms which, unlike state-of-the-art analytical modeling techniques can produce high-order complex non-linear relations between inputs and outputs. Explicit filtering of data from direct simulations of the canonical channel flow at two friction Reynolds numbers $Re_\tau\approx 395$ and 590 provided accurate data for training and testing. The two sets of models use different network architectures. One of the architectures uses tensor basis neural networks (TBNN) and embeds the simplified analytical model form of the general effective-viscosity hypothesis, thus incorporating the Galilean, rotational and reflectional invariances. The other architecture is that of a relatively simple network, that is able to incorporate the Galilean invariance only. However, this simpler architecture has better feature extraction capacity owing to its ability to establish relations between and extract information from cross-components of the integrity basis tensors and the SGS stresses. Both sets of models are used to predict the SGS stresses for feature datasets generated with different filter widths, and at different Reynolds numbers. It is shown that due to the simpler model's better feature learning capabilities, it outperforms the invariance embedded model in statistical performance metrics. In a priori tests, both sets of models provide similar levels of dissipation and backscatter. Based on the test results, both sets of models should be usable in a posteriori actual LESs.

摘要
我们介绍了两家子grid scale（SGS）湍流模型，用于大扰波 simulations（LES）的目的。它们的发展需要了具有物理学 Informed Robust 和高效的深度学习（DL）算法，不同于现有的分析模型技术可以生成高阶复杂非线性关系。我们使用了对直接实验的标准涡流频率 $Re_\tau\approx 395$ 和 $Re_\tau\approx 590$ 的数据进行直接训练和测试。这两组模型使用了不同的网络架构。其中一个架构使用了tensor基 neural network（TBNN），并将简单的分析模型形式给适用到通用效率假设，因此包含了加利力、旋转和反射不变性。另一个架构则是一个较简单的网络，它能够将加利力不变性给适用，但是它的特征提取能力比较强，因为它可以从标本网络和SGS压力之间的交互关系中提取信息。这两组模型都用于预测SGS压力，并在不同的范围和 Reynolds 数下进行了预测。发现较简单的模型在统计性能指标下比具有不变性的模型表现更好。在先前的测试中，这两组模型都提供了相似的净减杂和反射。根据测试结果，这两组模型都可以在 posteriori 实际的LES中使用。

Convergence Guarantees for Stochastic Subgradient Methods in Nonsmooth Nonconvex Optimization

paper_url: http://arxiv.org/abs/2307.10053
repo_url: https://github.com/xnchxy/GeneralSGD
paper_authors: Nachuan Xiao, Xiaoyin Hu, Kim-Chuan Toh
for: investigate the convergence properties of stochastic gradient descent (SGD) method and its variants, especially in training neural networks built from nonsmooth activation functions.
methods: develop a novel framework that assigns different timescales to stepsizes for updating the momentum terms and variables, and prove the global convergence of the proposed framework in both single-timescale and two-timescale cases.
results: prove the convergence properties of SGD-type methods based on the proposed framework, including heavy-ball SGD, SignSGD, Lion, normalized SGD and clipped SGD, and demonstrate the high efficiency of these methods through preliminary numerical experiments.

Abstract
In this paper, we investigate the convergence properties of the stochastic gradient descent (SGD) method and its variants, especially in training neural networks built from nonsmooth activation functions. We develop a novel framework that assigns different timescales to stepsizes for updating the momentum terms and variables, respectively. Under mild conditions, we prove the global convergence of our proposed framework in both single-timescale and two-timescale cases. We show that our proposed framework encompasses a wide range of well-known SGD-type methods, including heavy-ball SGD, SignSGD, Lion, normalized SGD and clipped SGD. Furthermore, when the objective function adopts a finite-sum formulation, we prove the convergence properties for these SGD-type methods based on our proposed framework. In particular, we prove that these SGD-type methods find the Clarke stationary points of the objective function with randomly chosen stepsizes and initial points under mild assumptions. Preliminary numerical experiments demonstrate the high efficiency of our analyzed SGD-type methods.

摘要
在这篇论文中，我们研究了束ochastic gradient descent（SGD）方法和其变体在训练使用nonsmooth activation function的神经网络时的收敛性质。我们提出了一个新的框架，将步长分配给更新势能项和变量的两个不同时间尺度。在某些轻量条件下，我们证明了我们提议的框架在单时间尺度和双时间尺度情况下的全球收敛性。我们表明，我们的提议框架包括许多已知SGD-type方法，包括坚重球SGD、SignSGD、Lion、normalized SGD和clipped SGD。此外，当目标函数采用finite-sum形式时，我们证明了这些SGD-type方法的收敛性基于我们的提议框架。具体来说，我们证明这些SGD-type方法在 randomly chosen步长和初始点下，可以找到目标函数的clarke静点。初步的数值实验表明我们分析的SGD-type方法具有高效性。

Contextual Reliability: When Different Features Matter in Different Contexts

paper_url: http://arxiv.org/abs/2307.10026
repo_url: None
paper_authors: Gaurav Ghosal, Amrith Setlur, Daniel S. Brown, Anca D. Dragan, Aditi Raghunathan
for: 本研究旨在提高深度神经网络的非恶性性能，通过考虑上下文相关的可靠性来解决深度神经网络过度依赖偶合关系的问题。
methods: 本研究提出了一种两阶段框架called Explicit Non-spurious feature Prediction (ENP)，首先在给定上下文中标识合适的特征，然后使用这些特征来训练模型。
results: 本研究的理论和实验结果表明，相比现有方法，ENP框架可以提高非恶性性能，并提供了新的上下文相关可靠性的标准准measure。

Abstract
Deep neural networks often fail catastrophically by relying on spurious correlations. Most prior work assumes a clear dichotomy into spurious and reliable features; however, this is often unrealistic. For example, most of the time we do not want an autonomous car to simply copy the speed of surrounding cars -- we don't want our car to run a red light if a neighboring car does so. However, we cannot simply enforce invariance to next-lane speed, since it could provide valuable information about an unobservable pedestrian at a crosswalk. Thus, universally ignoring features that are sometimes (but not always) reliable can lead to non-robust performance. We formalize a new setting called contextual reliability which accounts for the fact that the "right" features to use may vary depending on the context. We propose and analyze a two-stage framework called Explicit Non-spurious feature Prediction (ENP) which first identifies the relevant features to use for a given context, then trains a model to rely exclusively on these features. Our work theoretically and empirically demonstrates the advantages of ENP over existing methods and provides new benchmarks for contextual reliability.

摘要

Europepolls: A Dataset of Country-Level Opinion Polling Data for the European Union and the UK

paper_url: http://arxiv.org/abs/2307.10022
repo_url: https://github.com/konstantinos-p/europepolls
paper_authors: Konstantinos Pitas
for: fills a gap in available opinion polling data for the European Union and the UK, providing a large and open dataset for researchers to study voting behavior and multimodal data.
methods: uses Wikipedia data and the pandas library to gather and preprocess the data, making it available in both raw and preprocessed formats.
results: enables researchers to study complex interactions between multimodal data and voting behavior, with the potential for new insights and discoveries using recent advances in LLMs and deep learning.

Abstract
I propose an open dataset of country-level historical opinion polling data for the European Union and the UK. The dataset aims to fill a gap in available opinion polling data for the European Union. Some existing datasets are restricted to the past five years, limiting research opportunities. At the same time, some larger proprietary datasets exist but are available only in a visual preprocessed time series format. Finally, while other large datasets for individual countries might exist, these could be inaccessible due to language barriers. The data was gathered from Wikipedia, and preprocessed using the pandas library. Both the raw and the preprocessed data are in the .csv format. I hope that given the recent advances in LLMs and deep learning in general, this large dataset will enable researchers to uncover complex interactions between multimodal data (news articles, economic indicators, social media) and voting behavior. The raw data, the preprocessed data, and the preprocessing scripts are available on GitHub.

摘要
我提议开放的国家级历史民意调查数据集，涵盖欧盟和英国。这个数据集的目标是填补现有的欧盟民意调查数据的空白，一些现有的数据集只有过去五年的数据，限制了研究机会。同时，一些大型专有数据集存在，但只能在可视化预处理的时间序列格式下获得。此外，各国个别的大型数据集可能存在，但可能因语言障碍而不可达。这些数据来自Wikipedia，使用pandas库进行预处理。原始数据和预处理后的数据均在.csv格式下可用。我希望通过最近的人工智能技术和深度学习的发展，这个大型数据集将帮助研究人员发现多Modal数据（新闻文章、经济指标、社交媒体）和选举行为之间的复杂互动。原始数据、预处理后的数据和预处理脚本都可在GitHub上下载。

TbExplain: A Text-based Explanation Method for Scene Classification Models with the Statistical Prediction Correction

paper_url: http://arxiv.org/abs/2307.10003
repo_url: None
paper_authors: Amirhossein Aminimehr, Pouya Khani, Amirali Molaei, Amirmohammad Kazemeini, Erik Cambria
for: 这个研究旨在提高透明化人工智能（XAI）模型的解释性，使得不熟悉机器学习的用户可以更好地理解模型的预测结果。
methods: 该研究提出了一种名为TbExplain的框架，使用XAI技术和预训练的对象检测器来提供场景分类模型的文本解释。此外，TbExplain还包括一种新的方法来修正预测和文本解释基于输入图像中对象的统计学分布。
results: 对TbExplain在场景分类 dataset 上进行质量和质量测试，发现它可以提高场景分类精度，并且对于不可靠的预测，可以提供可靠的文本解释。

Abstract
The field of Explainable Artificial Intelligence (XAI) aims to improve the interpretability of black-box machine learning models. Building a heatmap based on the importance value of input features is a popular method for explaining the underlying functions of such models in producing their predictions. Heatmaps are almost understandable to humans, yet they are not without flaws. Non-expert users, for example, may not fully understand the logic of heatmaps (the logic in which relevant pixels to the model's prediction are highlighted with different intensities or colors). Additionally, objects and regions of the input image that are relevant to the model prediction are frequently not entirely differentiated by heatmaps. In this paper, we propose a framework called TbExplain that employs XAI techniques and a pre-trained object detector to present text-based explanations of scene classification models. Moreover, TbExplain incorporates a novel method to correct predictions and textually explain them based on the statistics of objects in the input image when the initial prediction is unreliable. To assess the trustworthiness and validity of the text-based explanations, we conducted a qualitative experiment, and the findings indicated that these explanations are sufficiently reliable. Furthermore, our quantitative and qualitative experiments on TbExplain with scene classification datasets reveal an improvement in classification accuracy over ResNet variants.

摘要
黑obox机器学习模型的解释可视化（XAI）领域目标是提高模型预测结果的解释性。建立基于输入特征重要性值的热图是一种流行的方法来解释模型在生成预测结果时使用的下面函数。然而，热图并不完全无瑕疵，例如，非专业用户可能无法完全理解热图的逻辑（在热图中显示相关像素的不同颜色或强度）。此外，输入图像中对模型预测结果的重要对象和区域也经常不充分分化。在这篇论文中，我们提出了一个名为TbExplain的框架，该框架使用XAI技术和预训练的对象检测器来提供场景分类模型的文本基于解释。此外，TbExplain还包括一种新的方法来根据输入图像中对象的统计特征来修正预测结果和文本解释，当初始预测结果不可靠时。为评估文本基于解释的可信度和有效性，我们进行了一次质量性实验，实验结果表明这些解释具有足够的可靠性。此外，我们对TbExplain与场景分类数据集进行了量化和质量性实验，发现TbExplain比ResNet变体增加了分类精度。

Impact of Disentanglement on Pruning Neural Networks

paper_url: http://arxiv.org/abs/2307.09994
repo_url: None
paper_authors: Carl Shneider, Peyman Rostami, Anis Kacem, Nilotpal Sinha, Abd El Rahman Shabayek, Djamila Aouada
for: 这个研究目的是为了实现对于实际世界的任务特定目标进行深度学习神经网络在边缘设备上部署，因此需要对神经网络进行压缩，以减少它们的内存占用、电力消耗和延迟。
methods: 这个研究使用了 beta-VAE 框架，与标准的条件进行剪裁，以 investigate 对于训练任务的条件下，强制神经网络学习分离的表现如何影响剪裁过程。
results: 这个研究在 MNIST 和 CIFAR10 datasets 上进行实验，探讨了分离挑战，并提出了未来研究的路径。

Abstract
Deploying deep learning neural networks on edge devices, to accomplish task specific objectives in the real-world, requires a reduction in their memory footprint, power consumption, and latency. This can be realized via efficient model compression. Disentangled latent representations produced by variational autoencoder (VAE) networks are a promising approach for achieving model compression because they mainly retain task-specific information, discarding useless information for the task at hand. We make use of the Beta-VAE framework combined with a standard criterion for pruning to investigate the impact of forcing the network to learn disentangled representations on the pruning process for the task of classification. In particular, we perform experiments on MNIST and CIFAR10 datasets, examine disentanglement challenges, and propose a path forward for future works.

摘要
deploying deep learning neural networks on edge devices, to accomplish task specific objectives in the real world, requires a reduction in their memory footprint, power consumption, and latency. this can be realized via efficient model compression. disentangled latent representations produced by variational autoencoder (VAE) networks are a promising approach for achieving model compression because they mainly retain task-specific information, discarding useless information for the task at hand. we make use of the beta-vae framework combined with a standard criterion for pruning to investigate the impact of forcing the network to learn disentangled representations on the pruning process for the task of classification. in particular, we perform experiments on mnist and cifar10 datasets, examine disentanglement challenges, and propose a path forward for future works.

UniMatch: A Unified User-Item Matching Framework for the Multi-purpose Merchant Marketing

paper_url: http://arxiv.org/abs/2307.09989
repo_url: None
paper_authors: Qifang Zhao, Tianyu Li, Meng Du, Yu Jiang, Qinghui Sun, Zhongyao Wang, Hong Liu, Huan Xu
for: 降低云服务中private域营销时的机器学习模型成本
methods: 提出了一个统一的用户项目匹配框架，同时进行用户定向和项目推荐，只需一个模型
results: 实验表明，该框架可以同时实现用户定向和项目推荐，并且比 estado del arte 方法更高效，同时减少了计算资源和日常维护成本

Abstract
When doing private domain marketing with cloud services, the merchants usually have to purchase different machine learning models for the multiple marketing purposes, leading to a very high cost. We present a unified user-item matching framework to simultaneously conduct item recommendation and user targeting with just one model. We empirically demonstrate that the above concurrent modeling is viable via modeling the user-item interaction matrix with the multinomial distribution, and propose a bidirectional bias-corrected NCE loss for the implementation. The proposed loss function guides the model to learn the user-item joint probability $p(u,i)$ instead of the conditional probability $p(i|u)$ or $p(u|i)$ through correcting both the users and items' biases caused by the in-batch negative sampling. In addition, our framework is model-agnostic enabling a flexible adaptation of different model architectures. Extensive experiments demonstrate that our framework results in significant performance gains in comparison with the state-of-the-art methods, with greatly reduced cost on computing resources and daily maintenance.

摘要
(Note: Simplified Chinese is a standardized form of Chinese that is used in mainland China and Singapore. It is different from Traditional Chinese, which is used in Taiwan and other countries. The translation is written in Simplified Chinese.)

TinyTrain: Deep Neural Network Training at the Extreme Edge

paper_url: http://arxiv.org/abs/2307.09988
repo_url: None
paper_authors: Young D. Kwon, Rui Li, Stylianos I. Venieris, Jagmohan Chauhan, Nicholas D. Lane, Cecilia Mascolo
for: 这个论文的目的是提出一个在设备上训练的方法，以提高用户化和隐私。
methods: 这个方法使用选择性更新部分模型，以及特别是处理数据缺乏的问题。它还引入了一个任务适应的迭代更新方法，可以在不同的任务下灵活地选择层/通道，以jointly captures用户数据、内存和计算能力。
results: 相比于vanilla fine-tuning整个网络，这个方法可以提高精度 by 3.6-5.0%，并同时降低了传递内存和计算成本，分别降低了2,286倍和7.68倍。这个方法可以在实际的edge设备上进行9.5倍 faster和3.5倍更能效的训练，并且仅需1 MB的内存空间，比SOTA方法小得多。

Abstract
On-device training is essential for user personalisation and privacy. With the pervasiveness of IoT devices and microcontroller units (MCU), this task becomes more challenging due to the constrained memory and compute resources, and the limited availability of labelled user data. Nonetheless, prior works neglect the data scarcity issue, require excessively long training time (e.g. a few hours), or induce substantial accuracy loss ($\geq$10\%). We propose TinyTrain, an on-device training approach that drastically reduces training time by selectively updating parts of the model and explicitly coping with data scarcity. TinyTrain introduces a task-adaptive sparse-update method that dynamically selects the layer/channel based on a multi-objective criterion that jointly captures user data, the memory, and the compute capabilities of the target device, leading to high accuracy on unseen tasks with reduced computation and memory footprint. TinyTrain outperforms vanilla fine-tuning of the entire network by 3.6-5.0\% in accuracy, while reducing the backward-pass memory and computation cost by up to 2,286$\times$ and 7.68$\times$, respectively. Targeting broadly used real-world edge devices, TinyTrain achieves 9.5$\times$ faster and 3.5$\times$ more energy-efficient training over status-quo approaches, and 2.8$\times$ smaller memory footprint than SOTA approaches, while remaining within the 1 MB memory envelope of MCU-grade platforms.

摘要
<> translate the following text into Simplified ChineseOn-device training is essential for user personalization and privacy. With the pervasiveness of IoT devices and microcontroller units (MCU), this task becomes more challenging due to the constrained memory and compute resources, and the limited availability of labeled user data. Nonetheless, prior works neglect the data scarcity issue, require excessively long training time (e.g., a few hours), or induce substantial accuracy loss (≥10%). We propose TinyTrain, an on-device training approach that drastically reduces training time by selectively updating parts of the model and explicitly coping with data scarcity. TinyTrain introduces a task-adaptive sparse-update method that dynamically selects the layer/channel based on a multi-objective criterion that jointly captures user data, the memory, and the compute capabilities of the target device, leading to high accuracy on unseen tasks with reduced computation and memory footprint. TinyTrain outperforms vanilla fine-tuning of the entire network by 3.6-5.0% in accuracy, while reducing the backward-pass memory and computation cost by up to 2,286 times and 7.68 times, respectively. Targeting broadly used real-world edge devices, TinyTrain achieves 9.5 times faster and 3.5 times more energy-efficient training over status-quo approaches, and 2.8 times smaller memory footprint than SOTA approaches, while remaining within the 1 MB memory envelope of MCU-grade platforms.Translation:在设备上训练是用户个性化和隐私的关键。随着物联网设备和微控制器单元（MCU）的普及，这种任务变得更加挑战，因为它们具有有限的存储和计算资源，以及有限的用户数据标注。然而，先前的工作忽视了数据稀缺问题，或者需要过长的训练时间（例如几个小时），或者导致减少精度（≥10%）。我们提出了TinyTrain，一种在设备上训练的方法，可以快速地减少训练时间，并且直接处理数据稀缺问题。TinyTrain引入了任务适应的稀有更新方法，可以在多bjective criterion 中选择层/通道，该 criterion jointly 捕捉用户数据、内存和计算能力等因素。这使得TinyTrain在未经训练任务上达到高精度，同时减少了反向传播的内存和计算成本。对于广泛使用的边缘设备，TinyTrain实现了9.5倍快速的训练，3.5倍更高的能效性，并且内存占用率比SOTAapproaches 小2.8倍。同时，TinyTrain仍然保持在MCU-grade平台上的1 MB内存范围内。

Learner Referral for Cost-Effective Federated Learning Over Hierarchical IoT Networks

paper_url: http://arxiv.org/abs/2307.09977
repo_url: None
paper_authors: Yulan Gao, Ziqiang Ye, Yue Xiao, Wei Xiang
for: 这篇论文的目的是提出一种基于联合学习者推荐和本地模型准确优化的 Federated Learning（FL）方法，以Address data privacy concerns和提高FL在分布式网络中的可扩展性和可靠性。
methods: 这篇论文使用了联合学习者推荐、通信和计算资源调度、本地模型准确优化等方法来Minimize the cost incurred by the worst-case participant and ensure the long-term fairness of FL in hierarchical Internet of Things (HieIoT) networks。
results: numerical simulations and experimental results on the MNIST/CIFAR-10 datasets demonstrate that our proposed LRef-FedCS approach could achieve a good balance between pursuing high global accuracy and reducing cost。

Abstract
The paradigm of federated learning (FL) to address data privacy concerns by locally training parameters on resource-constrained clients in a distributed manner has garnered significant attention. Nonetheless, FL is not applicable when not all clients within the coverage of the FL server are registered with the FL network. To bridge this gap, this paper proposes joint learner referral aided federated client selection (LRef-FedCS), along with communications and computing resource scheduling, and local model accuracy optimization (LMAO) methods. These methods are designed to minimize the cost incurred by the worst-case participant and ensure the long-term fairness of FL in hierarchical Internet of Things (HieIoT) networks. Utilizing the Lyapunov optimization technique, we reformulate the original problem into a stepwise joint optimization problem (JOP). Subsequently, to tackle the mixed-integer non-convex JOP, we separatively and iteratively address LRef-FedCS and LMAO through the centralized method and self-adaptive global best harmony search (SGHS) algorithm, respectively. To enhance scalability, we further propose a distributed LRef-FedCS approach based on a matching game to replace the centralized method described above. Numerical simulations and experimental results on the MNIST/CIFAR-10 datasets demonstrate that our proposed LRef-FedCS approach could achieve a good balance between pursuing high global accuracy and reducing cost.

摘要
《联邦学习（FL）》的概念，通过在分布式环境中本地训练参数，以解决数据隐私问题，已经吸引了广泛的关注。然而，FL不适用于所有FL服务器的覆盖区域内的所有客户端没有注册到FL网络。为了bridging这个差距，本文提出了联合学生推荐帮助 federated client selection（LRef-FedCS）、通信和计算资源调度，以及本地模型准确优化（LMAO）方法。这些方法的目的是 minimize the cost incurred by the worst-case participant and ensure the long-term fairness of FL in hierarchical Internet of Things（HieIoT） networks。通过利用Lyapunov优化技术，我们将原始问题转换为stepwise joint optimization problem（JOP）。然后，为了解决杂合Integer non-convex JOP，我们分别和iteratively处理LRef-FedCS和LMAO通过中央化方法和自适应全球最佳匹配搜索（SGHS）算法，分别进行处理。为了提高扩展性，我们还提出了基于匹配游戏的分布式LRef-FedCS方法来取代中央化方法。numerical simulations and experimental results on the MNIST/CIFAR-10 datasets show that our proposed LRef-FedCS approach can achieve a good balance between pursuing high global accuracy and reducing cost.

Towards green AI-based software systems: an architecture-centric approach (GAISSA)

paper_url: http://arxiv.org/abs/2307.09964
repo_url: None
paper_authors: Silverio Martínez-Fernández, Xavier Franch, Francisco Durán
for: 这个研究项目的目的是提供数据科学家和软件工程师用于模型和开发绿色人工智能系统的工具支持方法。
methods: 该项目使用了architecture-centric方法，以帮助数据科学家和软件工程师在设计和实现绿色人工智能系统时更加高效。
results: current research results indicate that the GAISSA project has the potential to achieve its objectives and provide effective methods for developing green AI-based systems.

Abstract
Nowadays, AI-based systems have achieved outstanding results and have outperformed humans in different domains. However, the processes of training AI models and inferring from them require high computational resources, which pose a significant challenge in the current energy efficiency societal demand. To cope with this challenge, this research project paper describes the main vision, goals, and expected outcomes of the GAISSA project. The GAISSA project aims at providing data scientists and software engineers tool-supported, architecture-centric methods for the modelling and development of green AI-based systems. Although the project is in an initial stage, we describe the current research results, which illustrate the potential to achieve GAISSA objectives.

摘要
现在，基于人工智能的系统已经取得了杰出的成绩，在不同的领域超越了人类。但是，训练AI模型和从其进行推理需要高度的计算资源，这对当今能效社会需求 pose 了 significiant 挑战。为了应对这个挑战，本研究项目论文描述了GAISSA项目的主要视野、目标和预期结果。GAISSA项目的目标是为数据科学家和软件工程师提供工具支持、建筑中心的方法，用于开发绿色基于人工智能的系统。虽然项目还在初期阶段，我们介绍了当前的研究成果，这些成果表明GAISSA目标的可能性。

XSkill: Cross Embodiment Skill Discovery

paper_url: http://arxiv.org/abs/2307.09955
repo_url: None
paper_authors: Mengda Xu, Zhenjia Xu, Cheng Chi, Manuela Veloso, Shuran Song
for: 本研究旨在实现从人工智能影像中提取可重用的机器人操作技能，并将其应用于实际世界中。
methods: 本研究使用了一个叫做XSkill的实践学习框架，它可以从无标注的人工和机器人操作影像中独立获取技能标本，并将其转换为机器人动作使用条件散乱策略。
results: 实验结果显示，XSkill可以将获取的技能标本转换为机器人动作，并且可以将学习的技能组合以实现未见过的任务，实现了更加一般和可扩展的实践学习框架。

Abstract
Human demonstration videos are a widely available data source for robot learning and an intuitive user interface for expressing desired behavior. However, directly extracting reusable robot manipulation skills from unstructured human videos is challenging due to the big embodiment difference and unobserved action parameters. To bridge this embodiment gap, this paper introduces XSkill, an imitation learning framework that 1) discovers a cross-embodiment representation called skill prototypes purely from unlabeled human and robot manipulation videos, 2) transfers the skill representation to robot actions using conditional diffusion policy, and finally, 3) composes the learned skill to accomplish unseen tasks specified by a human prompt video. Our experiments in simulation and real-world environments show that the discovered skill prototypes facilitate both skill transfer and composition for unseen tasks, resulting in a more general and scalable imitation learning framework. The performance of XSkill is best understood from the anonymous website: https://xskillcorl.github.io.

摘要
人类示例视频是机器人学习的广泛可用数据源，也是一种直观的用户界面，可以直接表达机器人需要的行为。然而，直接从无结构的人类视频中提取可重用的机器人操作技能是具有大embodiment差异和未观察行为参数的挑战。为bridging这个embodiment gap，本文提出了XSkill，一个仿学学习框架，它可以：1. 基于无标注的人类和机器人操作视频，纯粹地发现跨embodiment的表示 called skill prototypes;2. 使用条件扩散策略将表示转移到机器人动作;3. 使用人类提示视频完成未seen任务。我们在 simulated和实际环境中进行了实验，结果表明 XSkill可以快速 Transfer skill和组合未seen任务，从而实现更一般和可扩展的仿学学习框架。XSkill的性能可以通过无名网站https://xskillcorl.github.io进行了解。

Impatient Bandits: Optimizing Recommendations for the Long-Term Without Delay

paper_url: http://arxiv.org/abs/2307.09943
repo_url: https://github.com/spotify-research/impatient-bandits
paper_authors: Thomas M. McDonald, Lucas Maystre, Mounia Lalmas, Daniel Russo, Kamil Ciosek
for: 这个论文目标是提高用户在线平台上的满意度，通过研究内容探索任务，以及将这个任务формализова为多重钩盘问题。
methods: 作者使用了一种 bayesian 筛选器来 integrate full observations和partial outcomes，并开发了一种钩盘算法来快速Identify用户长期满意的内容。
results: 实验表明，作者的方法可以substantially improve performance compared to approaches that either optimize for short-term proxies, or wait for the long-term outcome to be fully realized。

Abstract
Recommender systems are a ubiquitous feature of online platforms. Increasingly, they are explicitly tasked with increasing users' long-term satisfaction. In this context, we study a content exploration task, which we formalize as a multi-armed bandit problem with delayed rewards. We observe that there is an apparent trade-off in choosing the learning signal: Waiting for the full reward to become available might take several weeks, hurting the rate at which learning happens, whereas measuring short-term proxy rewards reflects the actual long-term goal only imperfectly. We address this challenge in two steps. First, we develop a predictive model of delayed rewards that incorporates all information obtained to date. Full observations as well as partial (short or medium-term) outcomes are combined through a Bayesian filter to obtain a probabilistic belief. Second, we devise a bandit algorithm that takes advantage of this new predictive model. The algorithm quickly learns to identify content aligned with long-term success by carefully balancing exploration and exploitation. We apply our approach to a podcast recommendation problem, where we seek to identify shows that users engage with repeatedly over two months. We empirically validate that our approach results in substantially better performance compared to approaches that either optimize for short-term proxies, or wait for the long-term outcome to be fully realized.

摘要
优化器系统在在线平台上是一种普遍存在的特性。现在，它们更加专注于提高用户的长期满意度。在这个 контексте，我们研究一种内容探索任务，我们将其正式定义为带延迟奖励的多臂投机问题。我们发现，选择学习信号的决策存在一定的负担：等待获得完整奖励可能需要几周时间，从而妨碍学习的速度，而且测量短期代理奖励只能够做出不准确的长期目标预测。为了解决这个挑战，我们采取了两个步骤：首先，我们开发了一个带延迟奖励的预测模型。这个模型把所有已有信息都合并到一起，并使用某种权重来衡量奖励的可能性。其次，我们设计了一种投机算法，该算法利用这个新的预测模型来快速地标识与长期成功相符的内容。我们在一个 Podcast 推荐问题中应用了这种方法，我们的目标是在两个月内找到用户重复听众的播客。我们通过实验证明，我们的方法可以与等待完整奖励或仅仅优化短期代理奖励相比，显著提高表现。

TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic Tree-Based Memory Network

paper_url: http://arxiv.org/abs/2307.09942
repo_url: None
paper_authors: Brandon Theodorou, Cao Xiao, Jimeng Sun
for: 快速和效率地招募临床试验参与者，提高临床试验效率。
methods: 使用机器学习模型自动匹配患者与临床试验，基于患者长期电子医疗记录（EHR）数据和临床试验招募要求。
results: 提出一种名为TREEMENT的个性化动态树型记忆网络模型，可以提供精度和可读性的患者试验匹配。对现有模型进行比较，TREEMENT在实际数据集上表现出7%的错误减少和优于最佳基eline的试验级匹配能力。此外，TREEMENT还可以提供良好的可读性，使模型结果更易于采用。

Abstract
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment. In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials based on longitudinal patient electronic health records (EHR) data and eligibility criteria of clinical trials. However, they either depend on trial-specific expert rules that cannot expand to other trials or perform matching at a very general level with a black-box model where the lack of interpretability makes the model results difficult to be adopted. To provide accurate and interpretable patient trial matching, we introduce a personalized dynamic tree-based memory network model named TREEMENT. It utilizes hierarchical clinical ontologies to expand the personalized patient representation learned from sequential EHR data, and then uses an attentional beam-search query learned from eligibility criteria embedding to offer a granular level of alignment for improved performance and interpretability. We evaluated TREEMENT against existing models on real-world datasets and demonstrated that TREEMENT outperforms the best baseline by 7% in terms of error reduction in criteria-level matching and achieves state-of-the-art results in its trial-level matching ability. Furthermore, we also show TREEMENT can offer good interpretability to make the model results easier for adoption.

摘要
临床试验是药品开发的关键一环，但它们经常面临临床招募的成本和效率问题。在最近几年，机器学习模型被提出来加速临床招募，通过自动将病人与临床试验相匹配，基于患者的长期电子医疗记录（EHR）数据和临床试验的参与条件。但是，这些模型可能会依赖于专门为某个试验而设计的规则，无法扩展到其他试验，或者使用黑obox模型，导致模型结果难以被采纳。为了提供准确和可解释的病人试验匹配，我们介绍了一种个性化动态树型记忆网络模型，名为TREEMENT。它利用层次的临床 ontology 扩展个性化病人表示，然后使用注意力寻找查询学习自适应搜索来提供精细水平的匹配，从而提高性能和可解释性。我们对实际数据进行了评估，并证明TREEMENT在指标水平匹配方面比最佳参考模型下降7%，并在试验水平匹配能力方面达到了当前最佳 результа。此外，我们还证明TREEMENT可以提供好的可解释性，使得模型结果更容易采纳。

Spuriosity Didn’t Kill the Classifier: Using Invariant Predictions to Harness Spurious Features

paper_url: http://arxiv.org/abs/2307.09933
repo_url: None
paper_authors: Cian Eastwood, Shashank Singh, Andrei Liviu Nicolicioiu, Marin Vlastelica, Julius von Kügelgen, Bernhard Schölkopf
For: The paper is written for improving the performance of machine learning models on out-of-distribution data.* Methods: The paper proposes a method called Stable Feature Boosting (SFB) that learns to use unstable features in the test domain without labels. SFB consists of two steps: (i) learning a predictor that separates stable and conditionally-independent unstable features, and (ii) using the stable-feature predictions to adapt the unstable-feature predictions in the test domain.* Results: The paper proves that SFB can learn an asymptotically-optimal predictor without test-domain labels, and demonstrates the effectiveness of SFB on real and synthetic data.

Abstract
To avoid failures on out-of-distribution data, recent works have sought to extract features that have a stable or invariant relationship with the label across domains, discarding the "spurious" or unstable features whose relationship with the label changes across domains. However, unstable features often carry complementary information about the label that could boost performance if used correctly in the test domain. Our main contribution is to show that it is possible to learn how to use these unstable features in the test domain without labels. In particular, we prove that pseudo-labels based on stable features provide sufficient guidance for doing so, provided that stable and unstable features are conditionally independent given the label. Based on this theoretical insight, we propose Stable Feature Boosting (SFB), an algorithm for: (i) learning a predictor that separates stable and conditionally-independent unstable features; and (ii) using the stable-feature predictions to adapt the unstable-feature predictions in the test domain. Theoretically, we prove that SFB can learn an asymptotically-optimal predictor without test-domain labels. Empirically, we demonstrate the effectiveness of SFB on real and synthetic data.

摘要
We propose Stable Feature Boosting (SFB), an algorithm that learns to separate stable and conditionally-independent unstable features, and uses the stable-feature predictions to adapt the unstable-feature predictions in the test domain. Theoretically, we prove that SFB can learn an asymptotically-optimal predictor without test-domain labels. Empirically, we demonstrate the effectiveness of SFB on real and synthetic data.In summary, our approach leverages pseudo-labels based on stable features to guide the use of unstable features in the test domain, without requiring test-domain labels. This can lead to improved performance on out-of-distribution data, and our algorithm is proven to be asymptotically optimal.

DISA: DIfferentiable Similarity Approximation for Universal Multimodal Registration

paper_url: http://arxiv.org/abs/2307.09931
repo_url: https://github.com/imfusiongmbh/disa-universal-multimodal-registration
paper_authors: Matteo Ronchetti, Wolfgang Wein, Nassir Navab, Oliver Zettinig, Raphael Prevost
for: 用于多Modal imaging registration的挑战性问题，以便实现多种图像导航程序。
methods: 使用小型卷积神经网络（CNN）来创建表达力强的跨Modal描述符，以便快速进行可变的全局对 align。
results: 在三个不同的数据集上进行实验，结果表明我们的方法可以快速地执行，并且可以在临床设置中直接应用，无需特殊 retraining。

Abstract
Multimodal image registration is a challenging but essential step for numerous image-guided procedures. Most registration algorithms rely on the computation of complex, frequently non-differentiable similarity metrics to deal with the appearance discrepancy of anatomical structures between imaging modalities. Recent Machine Learning based approaches are limited to specific anatomy-modality combinations and do not generalize to new settings. We propose a generic framework for creating expressive cross-modal descriptors that enable fast deformable global registration. We achieve this by approximating existing metrics with a dot-product in the feature space of a small convolutional neural network (CNN) which is inherently differentiable can be trained without registered data. Our method is several orders of magnitude faster than local patch-based metrics and can be directly applied in clinical settings by replacing the similarity measure with the proposed one. Experiments on three different datasets demonstrate that our approach generalizes well beyond the training data, yielding a broad capture range even on unseen anatomies and modality pairs, without the need for specialized retraining. We make our training code and data publicly available.

摘要
多modal图像匹配是一项复杂但必要的步骤，用于许多图像引导过程。大多数注册算法利用计算复杂，通常不可导的相似度指标来处理不同模式下的生物结构的外观差异。现代机器学习基于方法受限于特定的解剖学-模式组合，并不能泛化到新的设定。我们提出了一个通用框架，用于生成表达性较高的跨模式描述符，以实现快速的可变截形注册。我们实现了这一点通过将现有的度量简化为点积在一个小型卷积神经网络（CNN）的特征空间中，该网络自然地可导，可以在不注册数据下进行训练。我们的方法比局部补充度量快速得多，可以直接在临床设置中应用，只需替换相似度度量即可。我们在三个不同的数据集上进行了实验，并证明我们的方法可以覆盖训练数据以外的广泛范围，包括未看到的解剖结构和模式对。我们将训练代码和数据公开发布。

TimeTuner: Diagnosing Time Representations for Time-Series Forecasting with Counterfactual Explanations

paper_url: http://arxiv.org/abs/2307.09916
repo_url: https://github.com/catherinehao/timetuner
paper_authors: Jianing Hao, Qing Shi, Yilin Ye, Wei Zeng
for: 本研究旨在帮助分析者理解时间序列表示之间的关系，以及如何通过特征工程来提高模型预测性能。
methods: 本研究使用了一种新的视觉分析框架，named TimeTuner，它可以帮助分析者理解模型行为与时间序列表示之间的关系，并提供多种可视化视图来描述模型表现。
results: 研究表明，TimeTuner可以帮助分析者更好地理解时间序列表示的特征，并提高模型预测性能。在实验中，TimeTuner使用了两种变换方法（平滑和采样），并在实际时间序列预测 task 中进行了应用。

Abstract
Deep learning (DL) approaches are being increasingly used for time-series forecasting, with many efforts devoted to designing complex DL models. Recent studies have shown that the DL success is often attributed to effective data representations, fostering the fields of feature engineering and representation learning. However, automated approaches for feature learning are typically limited with respect to incorporating prior knowledge, identifying interactions among variables, and choosing evaluation metrics to ensure that the models are reliable. To improve on these limitations, this paper contributes a novel visual analytics framework, namely TimeTuner, designed to help analysts understand how model behaviors are associated with localized correlations, stationarity, and granularity of time-series representations. The system mainly consists of the following two-stage technique: We first leverage counterfactual explanations to connect the relationships among time-series representations, multivariate features and model predictions. Next, we design multiple coordinated views including a partition-based correlation matrix and juxtaposed bivariate stripes, and provide a set of interactions that allow users to step into the transformation selection process, navigate through the feature space, and reason the model performance. We instantiate TimeTuner with two transformation methods of smoothing and sampling, and demonstrate its applicability on real-world time-series forecasting of univariate sunspots and multivariate air pollutants. Feedback from domain experts indicates that our system can help characterize time-series representations and guide the feature engineering processes.

摘要
TimeTuner 的核心技术包括以下两阶段方法：1. 首先，我们利用 counterfactual 解释来连接时间序列表示、多Variable 特征和预测的关系。2. 接下来，我们设计了多个协调的观点，包括分割基于相互关联的变数矩阵和排列在一起的双轴条纹，并提供了一些互动来让用户进入转换选择过程，穿梭特征空间，并理解模型性能。我们实现 TimeTuner 使用两种转换方法：平滑和抽样。我们在实际应用中运用这两种转换方法，并在单variate 太阳黑子和多variate 空气污染物中进行了实验。专家反馈表明，我们的系统可以帮助描述时间序列表示和导引特征工程过程。

Deep projection networks for learning time-homogeneous dynamical systems

paper_url: http://arxiv.org/abs/2307.09912
repo_url: None
paper_authors: Vladimir R. Kostic, Pietro Novelli, Riccardo Grazzi, Karim Lounici, Massimiliano Pontil
for: 学习时间homogeneous dynamical systems的meaningful representation，用于预测未来状态或观测量。
methods: 使用卷积神经网络学习投影算子，通过优化一个类似于 canonical correlation analysis（CCA）的目标函数来学习。
results: 提出了一种稳定且可靠的方法，可以应用于具有挑战性的情况，并且可以改进前一些方法的性能。

Abstract
We consider the general class of time-homogeneous dynamical systems, both discrete and continuous, and study the problem of learning a meaningful representation of the state from observed data. This is instrumental for the task of learning a forward transfer operator of the system, that in turn can be used for forecasting future states or observables. The representation, typically parametrized via a neural network, is associated with a projection operator and is learned by optimizing an objective function akin to that of canonical correlation analysis (CCA). However, unlike CCA, our objective avoids matrix inversions and therefore is generally more stable and applicable to challenging scenarios. Our objective is a tight relaxation of CCA and we further enhance it by proposing two regularization schemes, one encouraging the orthogonality of the components of the representation while the other exploiting Chapman-Kolmogorov's equation. We apply our method to challenging discrete dynamical systems, discussing improvements over previous methods, as well as to continuous dynamical systems.

摘要
我们考虑一般时间对称动力系统，包括离散和连续类型，并研究从观测数据中学习一个有意义的状态表现。这个表现通常是通过神经网络实现，并与投影算子相关。我们通过优化一个目标函数，与标准均值分析（CCA）的目标函数相似，但不同的是，我们的目标函数没有矩阵逆元，因此更稳定和适用于具有挑战性的情况。我们的目标函数是CCA的紧缩版本，而我们还提出了两种调整方案，一种鼓励表现的分量对称，另一种利用柯莫格罗夫的方程式。我们将方法应用到具有挑战性的离散动力系统和连续动力系统。

Repeated Observations for Classification

paper_url: http://arxiv.org/abs/2307.09896
repo_url: None
paper_authors: Hüseyin Afşer, László Györfi, Harro Walk
for: 这个论文研究了非 Parametric 分类问题，具体来说是在重复观察的情况下。
methods: 这篇论文提出了一些简单的分类规则，这些规则的 conditional error probabilities 的整数速度 converge 为 $t\to\infty$.
results: 文章分析了一些特定的模型，如 robust detection by nominal densities、prototype classification、linear transformation、linear classification、scaling。

Abstract
We study the problem nonparametric classification with repeated observations. Let $\bX$ be the $d$ dimensional feature vector and let $Y$ denote the label taking values in $\{1,\dots ,M\}$. In contrast to usual setup with large sample size $n$ and relatively low dimension $d$, this paper deals with the situation, when instead of observing a single feature vector $\bX$ we are given $t$ repeated feature vectors $\bV_1,\dots ,\bV_t $. Some simple classification rules are presented such that the conditional error probabilities have exponential convergence rate of convergence as $t\to\infty$. In the analysis, we investigate particular models like robust detection by nominal densities, prototype classification, linear transformation, linear classification, scaling.

摘要
我们研究非 Parametric 分类问题，对于重复观察 $\bX$ 的状况。在传统设置中，我们有大量样本数 $n$ 和低维度 $d$，但这里我们则是考虑在 $\bV_1, \bV_2, ..., \bV_t$ 中观察 $t$ 次的重复Feature vector。我们提出了一些简单的分类规则，其中 conditional error probability 的数值具有 exponential 的减少速度，随着 $t$ 趋向无限大。在分析中，我们探讨了特定的模型，例如 Robust 检测、prototype 分类、线性转换、线性分类、缩放。

Symmetric Equilibrium Learning of VAEs

paper_url: http://arxiv.org/abs/2307.09883
repo_url: None
paper_authors: Boris Flach, Dmitrij Schlesinger, Alexander Shekhovtsov
for: 这个论文是为了推广Variational Autoencoders（VAE）的应用范围，使其能够在更复杂的学习场景中使用，例如普通的半监督学习和使用复杂的生成模型为先验。
methods: 这个论文提出了一种Nash平衡学习方法，该方法可以适应更复杂的学习场景，例如只能通过采样获取数据和幂态分布的情况。该方法使用随机扩散来学习VAE，并且可以应用于各种下游任务。
results: 实验表明，使用Nash平衡学习方法学习VAE可以与使用标准的ELBO学习方法相比，并且可以应用于不可以使用标准VAE学习方法的任务。

Abstract
We view variational autoencoders (VAE) as decoder-encoder pairs, which map distributions in the data space to distributions in the latent space and vice versa. The standard learning approach for VAEs, i.e. maximisation of the evidence lower bound (ELBO), has an obvious asymmetry in that respect. Moreover, it requires a closed form a-priori latent distribution. This limits the applicability of VAEs in more complex scenarios, such as general semi-supervised learning and employing complex generative models as priors. We propose a Nash equilibrium learning approach that relaxes these restrictions and allows learning VAEs in situations where both the data and the latent distributions are accessible only by sampling. The flexibility and simplicity of this approach allows its application to a wide range of learning scenarios and downstream tasks. We show experimentally that the models learned by this method are comparable to those obtained by ELBO learning and demonstrate its applicability for tasks that are not accessible by standard VAE learning.

摘要
Translated into Simplified Chinese:我们看待变量自动机 (VAE) 为抽象-编码对，它将数据空间中的分布映射到幂空间中的分布和 vice versa。标准的 VAE 学习方法，即最大化证据下界 (ELBO)，具有显著的不均衡性，而且需要闭式预先知道的幂空间分布。这限制了 VAE 在更复杂的场景下的应用，如总体半导导学习和使用复杂的生成模型为假设。我们提议一种纳什平衡学习方法，它可以放弃这些限制，并在数据和幂空间中的样本可用时学习 VAE。这种灵活性和简单性使其适用于各种学习场景和下游任务。我们实验表明，与 ELBO 学习相比，这种方法学习的模型相当，并且可以应用于不可用于标准 VAE 学习的任务。

Adversarial Likelihood Estimation with One-way Flows

paper_url: http://arxiv.org/abs/2307.09882
repo_url: None
paper_authors: Omri Ben-Dov, Pravir Singh Gupta, Victoria Abrevaya, Michael J. Black, Partha Ghosh
for: 本文使用 Generative Adversarial Networks (GANs) 生成高质量样本，但GANs不提供样本周围的概率密度估计。然而，在能量基础设定下，最大化循环可能性函数可以导致对拒绝者的挑战，并且可以获得不正规化的概率密度（常称为能量）。本文进一步发展这种视角，并 incorporate 重要抽样，以获得一种不偏估计的合理性。
methods: 本文提出了一种新的流网络 architecture，called one-way flow network，它不需要有可追踪的逆函数，因此更加自由。此外，本文还使用了重要抽样来提高生成器的概率密度估计。
results: 本文的实验结果表明，使用了一种新的抽样策略和一种更自由的流网络 architecture，可以更快地收敛，并且生成的样本质量与相似的 GAN 架构相比几乎相同。此外，本文还成功避免了常见的数据集适应问题，并生成了平滑的低维 latent representation。

Abstract
Generative Adversarial Networks (GANs) can produce high-quality samples, but do not provide an estimate of the probability density around the samples. However, it has been noted that maximizing the log-likelihood within an energy-based setting can lead to an adversarial framework where the discriminator provides unnormalized density (often called energy). We further develop this perspective, incorporate importance sampling, and show that 1) Wasserstein GAN performs a biased estimate of the partition function, and we propose instead to use an unbiased estimator; 2) when optimizing for likelihood, one must maximize generator entropy. This is hypothesized to provide a better mode coverage. Different from previous works, we explicitly compute the density of the generated samples. This is the key enabler to designing an unbiased estimator of the partition function and computation of the generator entropy term. The generator density is obtained via a new type of flow network, called one-way flow network, that is less constrained in terms of architecture, as it does not require to have a tractable inverse function. Our experimental results show that we converge faster, produce comparable sample quality to GANs with similar architecture, successfully avoid over-fitting to commonly used datasets and produce smooth low-dimensional latent representations of the training data.

摘要
生成对抗网络（GANs）可生成高质量样本，但不提供样本周围的概率密度估计。然而，有观点认为，在能量基础设置下，最大化各自对抗极大值可以导致对抗框架，其中承认器提供不归一化密度（常称为能量）。我们进一步发展这个视角，包括重要抽样，并表明以下两点：1. Wasserstein GAN 实现了偏差估计 partition function，我们提议使用不偏差估计器；2. 在优化likelihood时，需要最大化生成器的Entropy。这是假设提供更好的模式覆盖率。与前一些作品不同，我们直接计算生成样本的浓度。这是键启用计算生成器的Entropy项和不偏差估计器。我们的实验结果表明，我们 faster converging, 生成样本质量与GANs相似，避免了通用数据集上的过拟合，并生成了平滑、低维度的训练数据表示。

Detecting Vulnerable Nodes in Urban Infrastructure Interdependent Network

paper_url: http://arxiv.org/abs/2307.09866
repo_url: https://github.com/tsinghua-fib-lab/kdd2023-id546-urbaninfra
paper_authors: Jinzhu Mao, Liu Cao, Chen Gao, Huandong Wang, Hangyu Fan, Depeng Jin, Yong Li
For: This paper aims to understand and characterize the vulnerability of urban infrastructures, which are essential for the regular running of cities and exist naturally in the form of networks.* Methods: The paper proposes a system based on graph neural network with reinforcement learning to accurately model the interdependent network as a heterogeneous graph and capture the risk of cascade failure and discover vulnerable infrastructures of cities.* Results: The proposed system is demonstrated to be effective through extensive experiments with various requests, showing not only its expressive power but also its transferring ability and necessity of specific components.

Abstract
Understanding and characterizing the vulnerability of urban infrastructures, which refers to the engineering facilities essential for the regular running of cities and that exist naturally in the form of networks, is of great value to us. Potential applications include protecting fragile facilities and designing robust topologies, etc. Due to the strong correlation between different topological characteristics and infrastructure vulnerability and their complicated evolution mechanisms, some heuristic and machine-assisted analysis fall short in addressing such a scenario. In this paper, we model the interdependent network as a heterogeneous graph and propose a system based on graph neural network with reinforcement learning, which can be trained on real-world data, to characterize the vulnerability of the city system accurately. The presented system leverages deep learning techniques to understand and analyze the heterogeneous graph, which enables us to capture the risk of cascade failure and discover vulnerable infrastructures of cities. Extensive experiments with various requests demonstrate not only the expressive power of our system but also transferring ability and necessity of the specific components.

摘要
理解和特征城市基础设施的抗垮性很重要，这些基础设施包括城市的工程设施，这些设施自然存在于城市的网络形式中。 potential应用包括保护脆弱设施和设计Robust topology等等。由于不同的 topological特征和基础设施抗垮性存在强相关性和复杂的演化机制，一些启发式和机器学习分析方法无法处理这种情况。在这篇论文中，我们模型城市系统为不同类型图的异质图，并提出基于图神经网络和强化学习的系统，可以在实际数据上训练，准确地评估城市系统的抗垮性。我们的系统利用深度学习技术来理解和分析异质图，以便捕捉城市系统中的风险和潜在敏感设施。我们的实验结果表明，我们的系统不仅具有表达力，还可以转移和特定组件的必要性。

Towards a population-informed approach to the definition of data-driven models for structural dynamics

paper_url: http://arxiv.org/abs/2307.09862
repo_url: None
paper_authors: G. Tsialiamanis, N. Dervilis, D. J. Wagg, K. Worden
for: 本研究旨在适应数据稀缺问题，通过将物理学基本方法与机器学习算法结合使用，提高结构动力学领域的模型性能。
methods: 本研究使用了两种meta-learning算法，分别是模型独立多学习（MAML）算法和条件神经过程（CNP）模型，以Population-based方式实现数据驱动模型的建立。
results: 实验结果表明，使用这两种机器学习算法可以更好地估算关键量，并且表现与传统机器学习算法类似，即随着训练集体系数量的增加，性能逐渐提高。

Abstract
Machine learning has affected the way in which many phenomena for various domains are modelled, one of these domains being that of structural dynamics. However, because machine-learning algorithms are problem-specific, they often fail to perform efficiently in cases of data scarcity. To deal with such issues, combination of physics-based approaches and machine learning algorithms have been developed. Although such methods are effective, they also require the analyser's understanding of the underlying physics of the problem. The current work is aimed at motivating the use of models which learn such relationships from a population of phenomena, whose underlying physics are similar. The development of such models is motivated by the way that physics-based models, and more specifically finite element models, work. Such models are considered transferrable, explainable and trustworthy, attributes which are not trivially imposed or achieved for machine-learning models. For this reason, machine-learning approaches are less trusted by industry and often considered more difficult to form validated models. To achieve such data-driven models, a population-based scheme is followed here and two different machine-learning algorithms from the meta-learning domain are used. The two algorithms are the model-agnostic meta-learning (MAML) algorithm and the conditional neural processes (CNP) model. The algorithms seem to perform as intended and outperform a traditional machine-learning algorithm at approximating the quantities of interest. Moreover, they exhibit behaviour similar to traditional machine learning algorithms (e.g. neural networks or Gaussian processes), concerning their performance as a function of the available structures in the training population.

摘要
机器学习已经对许多领域的现象模型ling有所影响，其中一个领域是结构动力学。然而，由于机器学习算法是问题特定的，因此在数据稀缺时常常表现不好。为解决这些问题，physics-based方法和机器学习算法的组合被开发出来。虽然这些方法有效，但也需要分析者对问题的物理基础知识。本工作的目的是鼓励使用基于人口现象的模型，其中人口现象的物理基础相似。这种模型被认为是可转移、可解释和可信任的，这些特性不是容易受到机器学习模型的限制。因此，机器学习方法在业界中 Less trusted， often considered more difficult to form validated models. To achieve such data-driven models, a population-based scheme is followed here and two different machine-learning algorithms from the meta-learning domain are used. The two algorithms are the model-agnostic meta-learning (MAML) algorithm and the conditional neural processes (CNP) model. The algorithms seem to perform as intended and outperform a traditional machine-learning algorithm at approximating the quantities of interest. Moreover, they exhibit behavior similar to traditional machine learning algorithms (e.g. neural networks or Gaussian processes), concerning their performance as a function of the available structures in the training population.

Reinforcement Learning for Credit Index Option Hedging

paper_url: http://arxiv.org/abs/2307.09844
repo_url: None
paper_authors: Francesco Mandelli, Marco Pinciroli, Michele Trapletti, Edoardo Vittori
for: 本研究目标是找到最佳减负策略，以减少信用指数选择的风险。
methods: 我们采用了实用的方法，即使 discrete time 和交易成本，并在实际市场数据上测试了我们的策略。我们使用了现代算法，即信任区域波动优化（TRVO）算法。
results: 我们的研究表明，基于TRVO算法得到的减负策略比布莱克和斯科尔的Delta减负策略更高效。

Abstract
In this paper, we focus on finding the optimal hedging strategy of a credit index option using reinforcement learning. We take a practical approach, where the focus is on realism i.e. discrete time, transaction costs; even testing our policy on real market data. We apply a state of the art algorithm, the Trust Region Volatility Optimization (TRVO) algorithm and show that the derived hedging strategy outperforms the practitioner's Black & Scholes delta hedge.

摘要
在这篇论文中，我们关注发现债券指数选项最优化套利策略的优化策略，使用回归学习。我们采取了实用的方法，即注重实际情况，即逐步时间、交易成本等。我们应用了现代算法TRVO算法，并证明得到的套利策略超过实践中的黑格&瑞德（Black & Scholes）delta套利。

Near-Linear Time Projection onto the $\ell_{1,\infty}$ Ball; Application to Sparse Autoencoders

paper_url: http://arxiv.org/abs/2307.09836
repo_url: https://github.com/memo-p/projection
paper_authors: Guillaume Perez, Laurent Condat, Michel Barlaud
for: 本文主要研究如何快速训练大规模神经网络，特别是如何使用投影技术来减少神经网络的总成本。
methods: 本文提出了一种新的投影算法，用于减少$\ell_{1,\infty}$ нор范围内的矩阵元素。该算法的最坏时间复杂度为$\mathcal{O}\big(nm+J\log(nm)\big)$，其中$J$是一个减少到0的常数，或者是$nm$的常数，具体取决于矩阵的稀烈程度。此外，本文还提议在神经网络训练中采用$\ell_{1,\infty}$球投影来实现特征选择和稀烈化。
results: 本文的实验结果显示，在生物学应用中，只有小部分数据（少于2%）是有关的，而使用本文的方法可以快速地减少神经网络的总成本。此外，在总体上，本文的方法也是最快的。

Abstract
Looking for sparsity is nowadays crucial to speed up the training of large-scale neural networks. Projections onto the $\ell_{1,2}$ and $\ell_{1,\infty}$ are among the most efficient techniques to sparsify and reduce the overall cost of neural networks. In this paper, we introduce a new projection algorithm for the $\ell_{1,\infty}$ norm ball. The worst-case time complexity of this algorithm is $\mathcal{O}\big(nm+J\log(nm)\big)$ for a matrix in $\mathbb{R}^{n\times m}$. $J$ is a term that tends to 0 when the sparsity is high, and to $nm$ when the sparsity is low. Its implementation is easy and it is guaranteed to converge to the exact solution in a finite time. Moreover, we propose to incorporate the $\ell_{1,\infty}$ ball projection while training an autoencoder to enforce feature selection and sparsity of the weights. Sparsification appears in the encoder to primarily do feature selection due to our application in biology, where only a very small part ($<2\%$) of the data is relevant. We show that both in the biological case and in the general case of sparsity that our method is the fastest.

摘要
现在，寻找稀疏性（sparsity）是训练大规模神经网络的关键。 projet onto the $\ell_{1,2}$ and $\ell_{1,\infty}$ 是最有效的技术来减少和降低神经网络的总成本。在这篇论文中，我们提出了一种新的 projection algorithm for the $\ell_{1,\infty}$ norm ball。该算法的最坏情况时间复杂度为 $\mathcal{O}\big(nm+J\log(nm)\big)$，其中 $J$ 是一个随着稀疏性的变量，当稀疏性高时，$J$ 趋于 0，而当稀疏性低时，$J$ 趋于 $nm$。该算法的实现容易，并且可以在有限时间内 converge to the exact solution。此外，我们还提议在训练 autoencoder 时，通过 incorporating the $\ell_{1,\infty}$ ball projection 来强制权重的稀疏性和选择性。在我们的应用中，只有一小部分（<2%）的数据是关键的，因此在encoder中进行稀疏化可以主要地实现特征选择。我们展示了在生物学应用中和总的稀疏性情况下，我们的方法是最快的。

Deep Operator Network Approximation Rates for Lipschitz Operators

paper_url: http://arxiv.org/abs/2307.09835
repo_url: None
paper_authors: Christoph Schwab, Andreas Stein, Jakob Zech
for: 这个论文是为了研究深度运算网络（Deep Operator Networks，DON）在模拟 lipschitz（或 holder）连续映射 $\mathcal G:\mathcal X\to\mathcal Y$ между（subsets of） separable Hilbert spaces $\mathcal X$, $\mathcal Y$ 的表达率 bounds。
methods: 这个论文使用了 linear encoders $\mathcal E$ 和 decoders $\mathcal D$ via（biorthogonal）Riesz bases of $\mathcal X$, $\mathcal Y$，并使用了一个参数化坐标映射的 approximator network，以实现 infinite-dimensional、 Parametric coordinate map 的 lipschitz continuous 表达。
results: 这个论文得到了 Lipschitz（或 holder）连续映射 $\mathcal G$ 的表达率 bounds，不需要 $\mathcal G$ 是 holomorphic 的假设。关键在证明表达率 bounds 时使用了 either super-expressive activations 或 nonstandard NN architectures with standard (ReLU) activations。例如，可以用 elementry superexpressive activations 或 nonstandard NN architectures with standard (ReLU) activations，如 [Yarotski: Elementary superexpressive activations, Int. Conf. on ML, 2021] 和 [Shen, Yang and Zhang: Neural network approximation: Three hidden layers are enough, Neural Networks, 2021] 等。

Abstract
We establish universality and expression rate bounds for a class of neural Deep Operator Networks (DON) emulating Lipschitz (or H\"older) continuous maps $\mathcal G:\mathcal X\to\mathcal Y$ between (subsets of) separable Hilbert spaces $\mathcal X$, $\mathcal Y$. The DON architecture considered uses linear encoders $\mathcal E$ and decoders $\mathcal D$ via (biorthogonal) Riesz bases of $\mathcal X$, $\mathcal Y$, and an approximator network of an infinite-dimensional, parametric coordinate map that is Lipschitz continuous on the sequence space $\ell^2(\mathbb N)$. Unlike previous works ([Herrmann, Schwab and Zech: Neural and Spectral operator surrogates: construction and expression rate bounds, SAM Report, 2022], [Marcati and Schwab: Exponential Convergence of Deep Operator Networks for Elliptic Partial Differential Equations, SAM Report, 2022]), which required for example $\mathcal G$ to be holomorphic, the present expression rate results require mere Lipschitz (or H\"older) continuity of $\mathcal G$. Key in the proof of the present expression rate bounds is the use of either super-expressive activations (e.g. [Yarotski: Elementary superexpressive activations, Int. Conf. on ML, 2021], [Shen, Yang and Zhang: Neural network approximation: Three hidden layers are enough, Neural Networks, 2021], and the references there) which are inspired by the Kolmogorov superposition theorem, or of nonstandard NN architectures with standard (ReLU) activations as recently proposed in [Zhang, Shen and Yang: Neural Network Architecture Beyond Width and Depth, Adv. in Neural Inf. Proc. Sys., 2022]. We illustrate the abstract results by approximation rate bounds for emulation of a) solution operators for parametric elliptic variational inequalities, and b) Lipschitz maps of Hilbert-Schmidt operators.

摘要
我们证明了一 classe of neural Deep Operator Networks (DON) 的 universality和expression rate bounds，其中 $\mathcal G:\mathcal X\to\mathcal Y$ 是一个 Lipschitz (或Holder) 连续的映射 zwischen (subsets of) 分离 Hilbert spaces $\mathcal X$, $\mathcal Y$。DON 架构使用了线性encoder $\mathcal E$ 和decoder $\mathcal D$ via (biorthogonal) Riesz bases of $\mathcal X$, $\mathcal Y$, 并且使用一个实际coordinate map的 approximator network，这个coordinate map是一个无限维度的参数化映射，并且在sequence space $\ell^2(\mathbb N)$上是 Lipschitz 连续的。不同于先前的工作([Herrmann, Schwab and Zech: Neural and Spectral operator surrogates: construction and expression rate bounds, SAM Report, 2022], [Marcati and Schwab: Exponential Convergence of Deep Operator Networks for Elliptic Partial Differential Equations, SAM Report, 2022])，我们不需要 $\mathcal G$ 是holomorphic的，而是只需要它是Lipschitz (或Holder) 连续的。在证明我们的expression rate bounds时，我们使用了 either super-expressive activations (e.g. [Yarotski: Elementary superexpressive activations, Int. Conf. on ML, 2021], [Shen, Yang and Zhang: Neural network approximation: Three hidden layers are enough, Neural Networks, 2021], 和 referenes there)，这些activations是基于Kolmogorov superposition theorem的，或者使用 nonstandard NN architectures with standard (ReLU) activations，如最近提出的 [Zhang, Shen and Yang: Neural Network Architecture Beyond Width and Depth, Adv. in Neural Inf. Proc. Sys., 2022]。我们这里给出了 approximation rate bounds for emulation of a) solution operators for parametric elliptic variational inequalities, 和 b) Lipschitz maps of Hilbert-Schmidt operators。

What do neural networks learn in image classification? A frequency shortcut perspective

paper_url: http://arxiv.org/abs/2307.09829
repo_url: https://github.com/nis-research/nn-frequency-shortcuts
paper_authors: Shunxin Wang, Raymond Veldhuis, Christoph Brune, Nicola Strisciuglio
for: investigate the mechanisms of representation learning in neural networks (NNs) for classification tasks, and expand the understanding of frequency shortcuts.
methods: perform experiments on synthetic datasets and natural images, propose a metric to measure class-wise frequency characteristics, and identify frequency shortcuts.
results: demonstrate that NNs tend to find simple solutions for classification, and what they learn first during training depends on the most distinctive frequency characteristics; confirm that frequency shortcuts can be transferred across datasets and cannot be fully avoided by larger model capacity and data augmentation.

Abstract
Frequency analysis is useful for understanding the mechanisms of representation learning in neural networks (NNs). Most research in this area focuses on the learning dynamics of NNs for regression tasks, while little for classification. This study empirically investigates the latter and expands the understanding of frequency shortcuts. First, we perform experiments on synthetic datasets, designed to have a bias in different frequency bands. Our results demonstrate that NNs tend to find simple solutions for classification, and what they learn first during training depends on the most distinctive frequency characteristics, which can be either low- or high-frequencies. Second, we confirm this phenomenon on natural images. We propose a metric to measure class-wise frequency characteristics and a method to identify frequency shortcuts. The results show that frequency shortcuts can be texture-based or shape-based, depending on what best simplifies the objective. Third, we validate the transferability of frequency shortcuts on out-of-distribution (OOD) test sets. Our results suggest that frequency shortcuts can be transferred across datasets and cannot be fully avoided by larger model capacity and data augmentation. We recommend that future research should focus on effective training schemes mitigating frequency shortcut learning.

摘要
频率分析有助于理解神经网络（NN）的表征学习机制。大多数研究集中在 regression 任务上进行学习动力研究，而 Classification 方面则得少。这项研究进行实验性调查，扩展了频率短cut的理解。我们在设计的synthetic datasets上进行实验，发现NNs在类别化任务上倾向于找到简单的解决方案，并在训练过程中学习的第一个频率特征取决于数据集的偏好。我们还确认了这种现象在自然图像上。我们提出了一种类别频率特征的度量和frequency短cut的标识方法。结果显示，frequency短cut可以基于文字特征或形态特征，归结在最简化目标时。最后，我们验证了频率短cut的传送性，发现它们可以在不同的数据集上传送，并且不可以通过更大的模型容量和数据增强来完全避免。我们建议未来的研究应该关注有效地 mitigate 频率短cut学习。

paper_url: http://arxiv.org/abs/2307.09823
repo_url: https://github.com/batuhankmkaraman/mlbasedad
paper_authors: Yaran Chen, Xueyu Chen, Yu Han, Haoran Li, Dongbin Zhao, Jingzhong Li, Xu Wang
for: 预测非酒精性肝病（NAFLD）的病理诊断
methods: combining a comprehensive clinical dataset（FLDData）和一种基于多Modal学习的NAFLD预测方法（DeepFLD），使用多Modal输入，包括Metadata和面部图像，以提高非侵入性诊断的准确率
results: DeepFLD模型在不同的测试集上达到了高度的表现，并且使用只有面部图像作为输入，可以实现相对较好的性能，这显示了该方法的可行性和简洁性。

Abstract
Non alcoholic fatty liver disease (NAFLD) is the most common cause of chronic liver disease, which can be predicted accurately to prevent advanced fibrosis and cirrhosis. While, a liver biopsy, the gold standard for NAFLD diagnosis, is invasive, expensive, and prone to sampling errors. Therefore, non-invasive studies are extremely promising, yet they are still in their infancy due to the lack of comprehensive research data and intelligent methods for multi-modal data. This paper proposes a NAFLD diagnosis system (DeepFLDDiag) combining a comprehensive clinical dataset (FLDData) and a multi-modal learning based NAFLD prediction method (DeepFLD). The dataset includes over 6000 participants physical examinations, laboratory and imaging studies, extensive questionnaires, and facial images of partial participants, which is comprehensive and valuable for clinical studies. From the dataset, we quantitatively analyze and select clinical metadata that most contribute to NAFLD prediction. Furthermore, the proposed DeepFLD, a deep neural network model designed to predict NAFLD using multi-modal input, including metadata and facial images, outperforms the approach that only uses metadata. Satisfactory performance is also verified on other unseen datasets. Inspiringly, DeepFLD can achieve competitive results using only facial images as input rather than metadata, paving the way for a more robust and simpler non-invasive NAFLD diagnosis.

摘要
非酒精肝病（NAFLD）是现代肝病最常见的原因，可以准确预测并预防进展到高度纤维化和 cirrhosis。然而，肝切片，作为 NAFLD 诊断的标准方法，是侵入性的、昂贵的，且容易出现采样错误。因此，非侵入性的研究非常有前途，但是现在仍然处于初期阶段，因为缺乏全面的研究数据和智能的多Modal 数据处理方法。本文提出了一种基于 Deep Learning 技术的 NAFLD 诊断系统（DeepFLDDiag），结合了丰富的临床数据集（FLDData）和多Modal 学习基于 NAFLD 预测方法（DeepFLD）。该数据集包括超过 6000 名参与者的身体检查、实验室和成像研究、广泛的问卷和一部分参与者的面部图像，这是丰富和有价值的临床数据。从数据集中，我们量化分析并选择了对 NAFLD 预测最有价值的临床Metadata。此外，我们提出的 DeepFLD 模型，用于预测 NAFLD 的多Modal 输入，包括Metadata 和面部图像，超越了只使用Metadata 的方法。在其他未看到数据集上，我们也得到了满意的性能。更有意思的是，DeepFLD 可以使用面部图像作为输入，而不需要Metadata，这打开了一个更加简单、更加稳定的非侵入性 NAFLD 诊断之路。

Deep unrolling Shrinkage Network for Dynamic MR imaging

paper_url: http://arxiv.org/abs/2307.09818
repo_url: https://github.com/yhao-z/dus-net
paper_authors: Yinghao Zhang, Xiaodi Li, Weihang Li, Yue Hu
for:* 这篇论文主要是关于动态核磁共振成像（MR）重建模型的研究。methods:* 该论文提出了一种新的soft thresholding with channel attention（AST）操作，用于在每个通道上学习阈值。* 该论文还提出了一种深度折叠缩小网络（DUS-Net），通过对多个分量的alternating direction method of multipliers（ADMM）进行拓展，来优化转换后$l_1$ нор的动态MR重建模型。results:* 实验结果表明，提议的DUS-Net方法在一个开放的动态照片MR数据集上表现出色，超过了当前的状态艺术方法。

Abstract
Deep unrolling networks that utilize sparsity priors have achieved great success in dynamic magnetic resonance (MR) imaging. The convolutional neural network (CNN) is usually utilized to extract the transformed domain, and then the soft thresholding (ST) operator is applied to the CNN-transformed data to enforce the sparsity priors. However, the ST operator is usually constrained to be the same across all channels of the CNN-transformed data. In this paper, we propose a novel operator, called soft thresholding with channel attention (AST), that learns the threshold for each channel. In particular, we put forward a novel deep unrolling shrinkage network (DUS-Net) by unrolling the alternating direction method of multipliers (ADMM) for optimizing the transformed $l_1$ norm dynamic MR reconstruction model. Experimental results on an open-access dynamic cine MR dataset demonstrate that the proposed DUS-Net outperforms the state-of-the-art methods. The source code is available at \url{https://github.com/yhao-z/DUS-Net}.

摘要
深度 unfolding 网络，利用减法约束，在动态磁共振成像中取得了很大的成功。通常情况下，卷积神经网络（CNN）会被用来提取转换的频谱，然后使用软resholding（ST）操作来强制施加约束。但是，ST操作通常会被限制为所有通道的 CNN-transformed 数据中的同一个值。在这篇论文中，我们提出了一种新的操作，即通道注意力软resholding（AST），它可以学习每个通道的阈值。具体来说，我们提出了一种新的深度 unfolding shrinkage network（DUS-Net），通过对 alternate direction method of multipliers（ADMM）的拓展来优化转换 $l_1$ нор的动态 MR 重建模型。实验结果表明，提议的 DUS-Net 在一个公开的动态磁共振动态MR数据集上表现出了比状态之前的方法更好的性能。源代码可以在 \url{https://github.com/yhao-z/DUS-Net} 上获取。

Manifold Learning with Sparse Regularised Optimal Transport

paper_url: http://arxiv.org/abs/2307.09816
repo_url: https://github.com/zsteve/QROT
paper_authors: Stephen Zhang, Gilles Mordant, Tetsuya Matsumoto, Geoffrey Schiebinger
for: 本研究旨在提出一种基于优化运输的拟合 manifold 学习方法，以减少实际数据中的噪声和采样影响。
methods: 该方法使用一种对称的优化运输方法，加以二阶规则化， constructions 一个稀疏和可适应的凝聚矩阵，可以视为泛化凝聚kernel的扩展。
results: 研究证明了该kernel在维度缩放时是一个准确的laplace-type运算，并且在各种伪随机噪声下保持稳定性。此外，该方法在一些实验中表现出了较高的效率和精度。

Abstract
Manifold learning is a central task in modern statistics and data science. Many datasets (cells, documents, images, molecules) can be represented as point clouds embedded in a high dimensional ambient space, however the degrees of freedom intrinsic to the data are usually far fewer than the number of ambient dimensions. The task of detecting a latent manifold along which the data are embedded is a prerequisite for a wide family of downstream analyses. Real-world datasets are subject to noisy observations and sampling, so that distilling information about the underlying manifold is a major challenge. We propose a method for manifold learning that utilises a symmetric version of optimal transport with a quadratic regularisation that constructs a sparse and adaptive affinity matrix, that can be interpreted as a generalisation of the bistochastic kernel normalisation. We prove that the resulting kernel is consistent with a Laplace-type operator in the continuous limit, establish robustness to heteroskedastic noise and exhibit these results in simulations. We identify a highly efficient computational scheme for computing this optimal transport for discrete data and demonstrate that it outperforms competing methods in a set of examples.

摘要
《 manifold learning 是现代统计和数据科学中的中心任务。许多数据（细胞、文档、图像、分子）可以被表示为高维拓扑空间中的点云，但数据中的自由度通常比拓扑空间维度少得多。探测数据所embedded的秘密拓扑是下游分析的先决条件。实际世界数据受到噪声观测和采样的影响，因此提取数据下的底层拓扑信息是一大挑战。我们提出一种基于对称的最优运输方法，加入 quadratic regularization，构建一个稀疏和适应性的倾斜矩阵，这可以看作是泛函 kernel 的普适化。我们证明这种核函在连续限制下与拉普拉斯型运算相一致，并且具有抗hetroskedastic 噪声的稳定性。我们还提出了高效的计算方案，可以高效地计算这种最优运输，并在一些示例中证明其超过竞争方法。》

GenKL: An Iterative Framework for Resolving Label Ambiguity and Label Non-conformity in Web Images Via a New Generalized KL Divergence

paper_url: http://arxiv.org/abs/2307.09810
repo_url: https://github.com/codetopaper/genkl
paper_authors: Xia Huang, Kai Fong Ernest Chong
for:* The paper is written to address the problem of non-conforming (NC) instances in web image datasets, which can negatively impact the performance of image classification models.methods:* The paper proposes a new method called $\mathcal{D}{\text{KL}^{\alpha, \beta}(p|q)$ based on the concept of generalized KL divergence to identify NC instances.* The method is based on the idea of measuring the difference between the predicted distribution of an instance and the true distribution, and it is proven to be more effective than traditional entropy-based methods.results:* The paper achieves new state-of-the-art classification accuracies on three web image datasets: Clothing1M, Food101/Food101N, and mini WebVision 1.0. The achieved accuracies are $81.34%$, $85.73%$ and $78.99%$/$92.54%$ (top-1/top-5), respectively.Here is the information in Simplified Chinese text:for:* 论文是为了解决网络图像集中的非合规（NC）实例问题，这些实例可能会影响图像分类模型的性能。methods:* 论文提出了一种基于泛化KL差的新方法，称为 $\mathcal{D}{\text{KL}^{\alpha, \beta}(p|q)$，用于标识NC实例。* 该方法基于测量实例预测分布与真实分布之间的差异，并证明其比传统的熵基本方法更有效。results:* 论文在三个网络图像集上达到了新的状态对图像分类模型的最佳性能：Clothing1M、Food101/Food101N和 mini WebVision 1.0。得到的准确率分别是81.34%、85.73%和78.99%/92.54% (top-1/top-5)。

Abstract
Web image datasets curated online inherently contain ambiguous in-distribution (ID) instances and out-of-distribution (OOD) instances, which we collectively call non-conforming (NC) instances. In many recent approaches for mitigating the negative effects of NC instances, the core implicit assumption is that the NC instances can be found via entropy maximization. For "entropy" to be well-defined, we are interpreting the output prediction vector of an instance as the parameter vector of a multinomial random variable, with respect to some trained model with a softmax output layer. Hence, entropy maximization is based on the idealized assumption that NC instances have predictions that are "almost" uniformly distributed. However, in real-world web image datasets, there are numerous NC instances whose predictions are far from being uniformly distributed. To tackle the limitation of entropy maximization, we propose $(\alpha, \beta)$-generalized KL divergence, $\mathcal{D}_{\text{KL}^{\alpha, \beta}(p\|q)$, which can be used to identify significantly more NC instances. Theoretical properties of $\mathcal{D}_{\text{KL}^{\alpha, \beta}(p\|q)$ are proven, and we also show empirically that a simple use of $\mathcal{D}_{\text{KL}^{\alpha, \beta}(p\|q)$ outperforms all baselines on the NC instance identification task. Building upon $(\alpha,\beta)$-generalized KL divergence, we also introduce a new iterative training framework, GenKL, that identifies and relabels NC instances. When evaluated on three web image datasets, Clothing1M, Food101/Food101N, and mini WebVision 1.0, we achieved new state-of-the-art classification accuracies: $81.34\%$, $85.73\%$ and $78.99\%$/$92.54\%$ (top-1/top-5), respectively.

摘要
网络图像数据集Curated online自然而含有ambiguous in-distribution (ID)实例和out-of-distribution (OOD)实例，我们共同称之为非符合 (NC) 实例。在许多最近的approaches中，核心隐式假设是通过Entropy maximization来缓解NC实例的负面影响。为了使Entropy well-defined，我们 interprets the output prediction vector of an instance as the parameter vector of a multinomial random variable, with respect to some trained model with a softmax output layer。因此，Entropy maximization基于理想化的假设，NC实例的预测值都是“几乎” uniform distribution。然而，在实际的网络图像数据集中，有许多NC实例 whose predictions are far from being uniformly distributed。为了缓解Entropy maximization的局限性，我们提出了$(\alpha, \beta)$-通常KL差（Divergence）， $\mathcal{D}_{\text{KL}^{\alpha, \beta}(p\|q)$，可以用来标识更多的NC实例。我们也证明了这种方法的理论性质，并通过实验表明，使用$\mathcal{D}_{\text{KL}^{\alpha, \beta}(p\|q)$可以超过所有基准的NC实例标识性能。基于$(\alpha, \beta)$-通常KL差，我们还介绍了一种新的迭代培训框架，GenKL，可以标识和重新标注NC实例。当应用于三个网络图像数据集，Clothing1M、Food101/Food101N和mini WebVision 1.0时，我们 achieved new state-of-the-art classification accuracies：$81.34\%$, $85.73\%$和$78.99\%$/$92.54\%$ (top-1/top-5)，分别。

Graph Federated Learning Based on the Decentralized Framework

paper_url: http://arxiv.org/abs/2307.09801
repo_url: None
paper_authors: Peilin Liu, Yanni Tang, Mingyue Zhang, Wu Chen
for: 防止数据隐私泄露和提高模型准确率，提出一种基于分布式机器学习的图学习方法。
methods: 使用分布式机器学习的客户端-服务器框架，并在客户端之间进行数据交换和模型训练。基于节点之间数据相似性的确定节点信任度，然后进行权重线性聚合。
results: 与 FedAvg、Fedprox、GCFL 和 GCFL+ 进行比较，实验结果表明提出的方法在准确率和稳定性方面具有较高的性能。

Abstract
Graph learning has a wide range of applications in many scenarios, which require more need for data privacy. Federated learning is an emerging distributed machine learning approach that leverages data from individual devices or data centers to improve the accuracy and generalization of the model, while also protecting the privacy of user data. Graph-federated learning is mainly based on the classical federated learning framework i.e., the Client-Server framework. However, the Client-Server framework faces problems such as a single point of failure of the central server and poor scalability of network topology. First, we introduce the decentralized framework to graph-federated learning. Second, determine the confidence among nodes based on the similarity of data among nodes, subsequently, the gradient information is then aggregated by linear weighting based on confidence. Finally, the proposed method is compared with FedAvg, Fedprox, GCFL, and GCFL+ to verify the effectiveness of the proposed method. Experiments demonstrate that the proposed method outperforms other methods.

摘要
《图学学习有广泛的应用场景，需要更加重视数据隐私。联邦学习是一种迅速发展的分布式机器学习方法，利用设备或数据中心上的数据来提高模型的准确性和通用性，同时保护用户数据的隐私。图联邦学习主要基于经典的客户端-服务器框架。但是，客户端-服务器框架受到中央服务器垂直点失败和网络拓扑缺乏扩展性的问题。我们首先引入了分布式框架到图联邦学习。其次，通过数据相似性对节点进行确定 Similarity，然后将Gradient信息通过线性权重平均来聚合。最后，我们对提出的方法进行了与FedAvg、Fedprox、GCFL和GCFL+进行比较，以验证提出的方法的有效性。实验表明，提出的方法在其他方法之上具有优势。》Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form as well.

Probabilistic Forecasting with Coherent Aggregation

paper_url: http://arxiv.org/abs/2307.09797
repo_url: None
paper_authors: Geoffrey Négiar, Ruijun Ma, O. Nangba Meetei, Mengfei Cao, Michael W. Mahoney
for: 这篇论文主要应用于多変量预测 tasks，例如能源管理、供应链观察和资源分配等领域，以提供精确的机会预测。
methods: 本论文提出了一个新的模型，利用因子模型结构来生成具有层次结构的预测，并使用卷积神经网来生成因子、负载和基层分布的参数。这个模型可以跟踪基层分布的变化，并且可以根据样本的差异来估计基层分布的参数。
results: 该论文的实验结果显示，该模型可以在三个层次预测数据集上获得明显的改善（11.8-41.4%），并且可以调整基层分布和因子数量对模型的影响。

Abstract
Obtaining accurate probabilistic forecasts while respecting hierarchical information is an important operational challenge in many applications, perhaps most obviously in energy management, supply chain planning, and resource allocation. The basic challenge, especially for multivariate forecasting, is that forecasts are often required to be coherent with respect to the hierarchical structure. In this paper, we propose a new model which leverages a factor model structure to produce coherent forecasts by construction. This is a consequence of a simple (exchangeability) observation: permuting \textit{}base-level series in the hierarchy does not change their aggregates. Our model uses a convolutional neural network to produce parameters for the factors, their loadings and base-level distributions; it produces samples which can be differentiated with respect to the model's parameters; and it can therefore optimize for any sample-based loss function, including the Continuous Ranked Probability Score and quantile losses. We can choose arbitrary continuous distributions for the factor and the base-level distributions. We compare our method to two previous methods which can be optimized end-to-end, while enforcing coherent aggregation. Our model achieves significant improvements: between $11.8-41.4\%$ on three hierarchical forecasting datasets. We also analyze the influence of parameters in our model with respect to base-level distribution and number of factors.

摘要
Obtaining accurate probabilistic forecasts while respecting hierarchical information is an important operational challenge in many applications, perhaps most obviously in energy management, supply chain planning, and resource allocation. The basic challenge, especially for multivariate forecasting, is that forecasts are often required to be coherent with respect to the hierarchical structure. In this paper, we propose a new model that leverages a factor model structure to produce coherent forecasts by construction. This is a consequence of a simple (exchangeability) observation: permuting base-level series in the hierarchy does not change their aggregates. Our model uses a convolutional neural network to produce parameters for the factors, their loadings, and base-level distributions; it produces samples that can be differentiated with respect to the model's parameters; and it can therefore optimize for any sample-based loss function, including the Continuous Ranked Probability Score and quantile losses. We can choose arbitrary continuous distributions for the factor and the base-level distributions. We compare our method to two previous methods that can be optimized end-to-end while enforcing coherent aggregation. Our model achieves significant improvements: between 11.8-41.4% on three hierarchical forecasting datasets. We also analyze the influence of parameters in our model with respect to base-level distribution and number of factors.Here's the text in Traditional Chinese as well:Obtaining accurate probabilistic forecasts while respecting hierarchical information is an important operational challenge in many applications, perhaps most obviously in energy management, supply chain planning, and resource allocation. The basic challenge, especially for multivariate forecasting, is that forecasts are often required to be coherent with respect to the hierarchical structure. In this paper, we propose a new model that leverages a factor model structure to produce coherent forecasts by construction. This is a consequence of a simple (exchangeability) observation: permuting base-level series in the hierarchy does not change their aggregates. Our model uses a convolutional neural network to produce parameters for the factors, their loadings, and base-level distributions; it produces samples that can be differentiated with respect to the model's parameters; and it can therefore optimize for any sample-based loss function, including the Continuous Ranked Probability Score and quantile losses. We can choose arbitrary continuous distributions for the factor and the base-level distributions. We compare our method to two previous methods that can be optimized end-to-end while enforcing coherent aggregation. Our model achieves significant improvements: between 11.8-41.4% on three hierarchical forecasting datasets. We also analyze the influence of parameters in our model with respect to base-level distribution and number of factors.

Forecasting Early with Meta Learning

paper_url: http://arxiv.org/abs/2307.09796
repo_url: https://github.com/super-shayan/feml
paper_authors: Shayan Jawed, Kiran Madhusudhanan, Vijaya Krishna Yalavarthi, Lars Schmidt-Thieme
for: 本研究的目的是开发一种基于Meta学习的时间序列预测方法，以便在早期观察期间，使用已有的数据集来预测时间序列。
methods: 本研究使用了一种基于对抗学习的Meta学习方法，其中包括一个共享的卷积层，用于学习不同数据集的特征，以及数据集特定的头，用于预测不同的输出长度。
results: 研究表明，FEML可以在不同数据集上进行Meta学习，并通过在 auxiliary task 中学习对抗生成的样本，提高预测性能，比单任务学习和不同的解决方案（如共同学习、多任务学习和经典预测基准）的表现更佳。

Abstract
In the early observation period of a time series, there might be only a few historic observations available to learn a model. However, in cases where an existing prior set of datasets is available, Meta learning methods can be applicable. In this paper, we devise a Meta learning method that exploits samples from additional datasets and learns to augment time series through adversarial learning as an auxiliary task for the target dataset. Our model (FEML), is equipped with a shared Convolutional backbone that learns features for varying length inputs from different datasets and has dataset specific heads to forecast for different output lengths. We show that FEML can meta learn across datasets and by additionally learning on adversarial generated samples as auxiliary samples for the target dataset, it can improve the forecasting performance compared to single task learning, and various solutions adapted from Joint learning, Multi-task learning and classic forecasting baselines.

摘要
在时间序列的早期观察期，可能只有几个历史观察值可以学习模型。但在存在现有的先前数据集的情况下，元学习方法可以应用。在这篇论文中，我们设计了一种元学习方法，利用来自其他数据集的样本，并通过对抗学习作为目标数据集的auxiliary任务来增强时间序列的学习。我们的模型（FEML）具有共享的卷积几何体，可以从不同的数据集中学习不同长度的输入特征，并具有特定数据集的头来预测不同的输出长度。我们显示，FEML可以在不同数据集之间meta学习，并通过额外学习对抗生成的样本来提高预测性能，并评估了单任务学习、联合学习、多任务学习和经典预测基准的多种解决方案。

From West to East: Who can understand the music of the others better?

paper_url: http://arxiv.org/abs/2307.09795
repo_url: https://github.com/pxaris/ccml
paper_authors: Charilaos Papaioannou, Emmanouil Benetos, Alexandros Potamianos
For: 本研究用于了解不同音乐文化和风格下的音频嵌入模型是否可以学习音乐特征，以及是否可以通过转移学习来捕捉不同音乐文化之间的相似性。* Methods: 本研究使用了转移学习方法，将西方乐器演奏的音频数据和东方传统乐器演奏的音频数据作为源数据，并使用了两种CNN和一种Transformer预训练模型进行音频嵌入。* Results: 实验结果显示，通过转移学习可以在所有领域中实现竞争性的表现，但是不同音乐文化的源数据具有不同的优势。研究提供了一个公共存储库，包含实现和训练的模型。

Abstract
Recent developments in MIR have led to several benchmark deep learning models whose embeddings can be used for a variety of downstream tasks. At the same time, the vast majority of these models have been trained on Western pop/rock music and related styles. This leads to research questions on whether these models can be used to learn representations for different music cultures and styles, or whether we can build similar music audio embedding models trained on data from different cultures or styles. To that end, we leverage transfer learning methods to derive insights about the similarities between the different music cultures to which the data belongs to. We use two Western music datasets, two traditional/folk datasets coming from eastern Mediterranean cultures, and two datasets belonging to Indian art music. Three deep audio embedding models are trained and transferred across domains, including two CNN-based and a Transformer-based architecture, to perform auto-tagging for each target domain dataset. Experimental results show that competitive performance is achieved in all domains via transfer learning, while the best source dataset varies for each music culture. The implementation and the trained models are both provided in a public repository.

摘要

IncDSI: Incrementally Updatable Document Retrieval

paper_url: http://arxiv.org/abs/2307.10323
repo_url: None
paper_authors: Varsha Kishore, Chao Wan, Justin Lovelace, Yoav Artzi, Kilian Q. Weinberger
For: The paper is written for document retrieval systems that need to be updated with new information in real-time.* Methods: The paper proposes a method called IncDSI, which adds documents to a trained neural network-based search index in real-time, without retraining the entire model.* Results: The authors claim that their approach is orders of magnitude faster than retraining the model on the entire dataset, and is competitive with retraining the model on the whole dataset in terms of performance. They also demonstrate the effectiveness of their approach on several benchmarks.

Abstract
Differentiable Search Index is a recently proposed paradigm for document retrieval, that encodes information about a corpus of documents within the parameters of a neural network and directly maps queries to corresponding documents. These models have achieved state-of-the-art performances for document retrieval across many benchmarks. These kinds of models have a significant limitation: it is not easy to add new documents after a model is trained. We propose IncDSI, a method to add documents in real time (about 20-50ms per document), without retraining the model on the entire dataset (or even parts thereof). Instead we formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters. Although orders of magnitude faster, our approach is competitive with re-training the model on the whole dataset and enables the development of document retrieval systems that can be updated with new information in real-time. Our code for IncDSI is available at https://github.com/varshakishore/IncDSI.

摘要
diferenciable 搜寻指标是一种最近提出的文档搜寻方法，它将文档集中的信息嵌入神经网络的参数中，并直接将查询映射到相应的文档。这种模型在许多benchmark上达到了州际级的表现，但它具有一个重要的限制：不容易在模型训练后添加新的文档。我们提出了IncDSI，一种在实时（约20-50ms每个文档）添加文档的方法，不需要重新训练整个资料集或其部分。相反，我们将文档添加形式化为一个受限制的优化问题，以最小化网络参数的变化。这种方法比较应用训练整个资料集的方法快得多，但是其表现和重新训练整个资料集的方法相当，可以在实时更新文档搜寻系统，以便获取新的信息。我们的IncDSI代码可以在https://github.com/varshakishore/IncDSI 中找到。

A Note on Hardness of Computing Recursive Teaching Dimension

paper_url: http://arxiv.org/abs/2307.09792
repo_url: None
paper_authors: Pasin Manurangsi
for: 计算概念类（输入直接提供）的Recursive Teaching Dimension（RTD）问题需要 $n^{\Omega(\log n)}$ 时间，假设快速寻路假设（ETH）。
methods: 使用快速寻路假设（ETH）。
results: 得到 $n^{O(\log n)}$ 时间的解决方案，与笨方法 ($n^{O(\log n)}$)相同。

Abstract
In this short note, we show that the problem of computing the recursive teaching dimension (RTD) for a concept class (given explicitly as input) requires $n^{\Omega(\log n)}$-time, assuming the exponential time hypothesis (ETH). This matches the running time $n^{O(\log n)}$ of the brute-force algorithm for the problem.

摘要
在这个短文中，我们证明计算概念类中的重ursive教学维度（RTD）问题需要$n^{\Omega(\log n)}$时间，假设快速幂时间假设（ETH）成立。这与布尔特 FORCE算法的运行时间 $n^{O(\log n)}$ 相同。

Reproducibility in Machine Learning-Driven Research

paper_url: http://arxiv.org/abs/2307.10320
repo_url: None
paper_authors: Harald Semmelrock, Simone Kopeinik, Dieter Theiler, Tony Ross-Hellauer, Dominik Kowald
for: 这个论文旨在探讨机器学习（ML）驱动的研究中存在的可重现性危机，以及这些研究领域中存在的可重现性问题和障碍。
methods: 该论文采用了文献综述的方法，检视了不同研究领域中ML可重现性的现状，并确定了可重现性问题和障碍。
results: 该论文发现了各种研究领域中ML可重现性的问题和障碍，并提出了一些可能的解决方案，包括使用ML平台等工具和实践。

Abstract
Research is facing a reproducibility crisis, in which the results and findings of many studies are difficult or even impossible to reproduce. This is also the case in machine learning (ML) and artificial intelligence (AI) research. Often, this is the case due to unpublished data and/or source-code, and due to sensitivity to ML training conditions. Although different solutions to address this issue are discussed in the research community such as using ML platforms, the level of reproducibility in ML-driven research is not increasing substantially. Therefore, in this mini survey, we review the literature on reproducibility in ML-driven research with three main aims: (i) reflect on the current situation of ML reproducibility in various research fields, (ii) identify reproducibility issues and barriers that exist in these research fields applying ML, and (iii) identify potential drivers such as tools, practices, and interventions that support ML reproducibility. With this, we hope to contribute to decisions on the viability of different solutions for supporting ML reproducibility.

摘要
研究正面临一种复制危机，在许多研究中结果和发现很难或根本不可复制。这也是机器学习（ML）和人工智能（AI）研究中的问题。常常这是因为未发表的数据和/或源代码，以及对ML训练条件的敏感性。虽然研究社区中有许多解决此问题的方案，如使用ML平台，但ML驱动研究的复制性不增加的情况并没有改善。因此，在这次小访问中，我们会回顾ML驱动研究中的复制性情况，了解不同领域的ML复制性情况，识别ML复制性问题和障碍，以及可能支持ML复制性的工具、做法和措施。希望通过这些研究，对不同解决方案的可行性做出决策。

ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats

paper_url: http://arxiv.org/abs/2307.09782
repo_url: https://github.com/microsoft/DeepSpeed
paper_authors: Xiaoxia Wu, Zhewei Yao, Yuxiong He
for: 这个研究的目的是为大型语言模型（LLM）实现高效性和模型质量的平衡。
methods: 这个研究使用浮点数（FP）量化，特别是FP8和FP4，作为可能的解决方案，并进行了全面的探索。
results: 研究结果显示， для LLMs，FP8活化常常比其整数（INT8）相等者更好，尤其在模型中有多亿个参数时。对于预算量化，我们发现FP4的表现与INT4相似或更好，使得在FP支持的硬件上进行部署变得更加简单。我们还提出了两个缩减对称的精简方法，以减少因精度不同而导致的遗憾。此外，我们将量化方法与LoRC策略结合，导致小型模型中的改进。研究结果显示FP量化具有巨大的潜力，并且这种方法可以实现高效地部署在资源有限的环境中。

Abstract
In the complex domain of large language models (LLMs), striking a balance between computational efficiency and maintaining model quality is a formidable challenge. Navigating the inherent limitations of uniform quantization, particularly when dealing with outliers, and motivated by the launch of NVIDIA's H100 hardware, this study delves into the viability of floating-point (FP) quantization, particularly focusing on FP8 and FP4, as a potential solution. Our comprehensive investigation reveals that for LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion. For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100. To mitigate the overhead from precision alignment caused by the disparity between weights and activations, we propose two scaling constraints for weight quantization that negligibly impact the performance compared to the standard W4A8 model. We additionally enhance our quantization methods by integrating the Low Rank Compensation (LoRC) strategy, yielding improvements especially in smaller models. The results of our investigation emphasize the immense potential of FP quantization for LLMs, paving the way for high-efficiency deployment in resource-limited settings.

摘要
在大语言模型（LLM）领域中，实现计算效率和模型质量之间的平衡是一项具有挑战性的任务。在无格量化的环境中，特别是处理异常值时，我们发现了使用浮点数（FP）量化的可能性，特别是FP8和FP4。本研究的全面调查表明，对于LLMs，FP8活动通常比INT8等效，而且在模型参数超过一亿时，性能差异变得更加明显。对于权量量化，我们发现了FP4的性能与INT4相当，甚至超越INT4，这使得在FP支持的硬件上进行部署变得更加简单。为了减少精度对齐所带来的开销，我们提议了两种缩放约束，对于标准W4A8模型而言，影响非常小。此外，我们还增强了我们的量化方法，通过 интеграating Low Rank Compensation（LoRC）策略，尤其在较小的模型中，得到了改进。研究结果表明，FP量化对LLMs来说具有极大的潜力，为高效部署在有限资源的场景提供了道路。

Text2Layer: Layered Image Generation using Latent Diffusion Model

paper_url: http://arxiv.org/abs/2307.09781
repo_url: None
paper_authors: Xinyang Zhang, Wentian Zhao, Xin Lu, Jeff Chien
for: 本研究旨在探讨层 composite workflow 中的图像生成问题，通过Diffusion模型的成功启发我们对层图像生成进行探索。
methods: 我们提议同时生成背景图像、前景图像、层Mask和组合图像，并使用自适应encoder来重建层图像和对抽象表示进行扩散模型训练。
results: 我们的方法可以生成高质量的层图像和层Mask，并为未来工作提供了一个新的标准。

Abstract
Layer compositing is one of the most popular image editing workflows among both amateurs and professionals. Motivated by the success of diffusion models, we explore layer compositing from a layered image generation perspective. Instead of generating an image, we propose to generate background, foreground, layer mask, and the composed image simultaneously. To achieve layered image generation, we train an autoencoder that is able to reconstruct layered images and train diffusion models on the latent representation. One benefit of the proposed problem is to enable better compositing workflows in addition to the high-quality image output. Another benefit is producing higher-quality layer masks compared to masks produced by a separate step of image segmentation. Experimental results show that the proposed method is able to generate high-quality layered images and initiates a benchmark for future work.

摘要
层compositing是图像编辑中最受欢迎的工作流程之一，both amateur和professional都very popular。被diffusion模型的成功启发，我们从层图生成的角度出发，对层compositing进行探索。而不是生成一个图像，我们提议同时生成背景、前景、层mask和组合图像。为实现层图生成，我们训练了一个能够重建层图的autoencoder，并在层表示中训练diffusion模型。这个问题的两个benefit是：一是实现更好的compositing工作流程，二是生成更高质量的层mask，比起分开进行图像分割后生成的mask更高质量。实验结果表明，我们的方法可以生成高质量的层图并为未来的工作提供了参考。

Beyond Single-Feature Importance with ICECREAM

paper_url: http://arxiv.org/abs/2307.09779
repo_url: None
paper_authors: Michael Oesterle, Patrick Blöbaum, Atalanti A. Mastakouri, Elke Kirschbaum
for: 本文旨在描述一种用于解释模型输出和云计算应用失败的方法。
methods: 本文提出了一种基于联盟的信息量度量表示方法，用于掌握模型输出和失败原因。
results: 在synthetic和实际数据上进行了实验，并证明了该方法可以在解释和根本原因分析任务中减轻表达和排名的约束，并且在两个任务中具有卓越的准确率。

Abstract
Which set of features was responsible for a certain output of a machine learning model? Which components caused the failure of a cloud computing application? These are just two examples of questions we are addressing in this work by Identifying Coalition-based Explanations for Common and Rare Events in Any Model (ICECREAM). Specifically, we propose an information-theoretic quantitative measure for the influence of a coalition of variables on the distribution of a target variable. This allows us to identify which set of factors is essential to obtain a certain outcome, as opposed to well-established explainability and causal contribution analysis methods which can assign contributions only to individual factors and rank them by their importance. In experiments with synthetic and real-world data, we show that ICECREAM outperforms state-of-the-art methods for explainability and root cause analysis, and achieves impressive accuracy in both tasks.

摘要
“我们在这个研究中使用 Identifying Coalition-based Explanations for Common and Rare Events in Any Model (ICECREAM) 方法来回答一些问题，例如：哪些功能集引起了某个机器学习模型的特定输出？哪些组件导致云 computing 应用程序的失败？我们提出一个信息理论基础的量化度量，用于衡量参数集对目标变量的分布的影响。这允许我们找出获得特定结果所必需的元素，而不是传统的解释和 causal contribution 分析方法，它们仅将贡献分配到个别的参数上，并按其重要性排名。在实验中，我们使用 synthetic 和实际数据，与现有的方法进行比较，结果显示 ICECREAM 在解释和根本原因分析任务中具有优秀的性能。”Note that Simplified Chinese is a standardized form of Chinese that is used in mainland China and Singapore, while Traditional Chinese is used in Hong Kong, Macau, and Taiwan.

Polyffusion: A Diffusion Model for Polyphonic Score Generation with Internal and External Controls

paper_url: http://arxiv.org/abs/2307.10304
repo_url: https://github.com/aik2mlj/polyffusion
paper_authors: Lejun Min, Junyan Jiang, Gus Xia, Jingwei Zhao
for: generating polyphonic music scores using a diffusion model
methods: uses internal control (pre-defined parts of the music) and external control (external yet related information such as chord, texture, or other features) via cross-attention mechanism
results: significantly outperforms existing Transformer and sampling-based baselines, and using pre-trained disentangled representations as external conditions yields more effective controls.Here’s the simplified Chinese version:
for: 通过扩散模型生成多重音乐谱
methods: 使用内部控制（用户先定义音乐的一部分）和外部控制（通过对应关系机制将外部信息引入）
results: 与现有的Transformer和采样基eline相比，表现出色，并且使用预训练分解表示者作为外部条件可以更有效地控制音乐生成。

Abstract
We propose Polyffusion, a diffusion model that generates polyphonic music scores by regarding music as image-like piano roll representations. The model is capable of controllable music generation with two paradigms: internal control and external control. Internal control refers to the process in which users pre-define a part of the music and then let the model infill the rest, similar to the task of masked music generation (or music inpainting). External control conditions the model with external yet related information, such as chord, texture, or other features, via the cross-attention mechanism. We show that by using internal and external controls, Polyffusion unifies a wide range of music creation tasks, including melody generation given accompaniment, accompaniment generation given melody, arbitrary music segment inpainting, and music arrangement given chords or textures. Experimental results show that our model significantly outperforms existing Transformer and sampling-based baselines, and using pre-trained disentangled representations as external conditions yields more effective controls.

摘要
我们提出了Polyffusion，一种扩散模型，通过将音乐视为像图像的钢琴谱表示，生成多重音乐分解。该模型具有可控的音乐生成功能，可以通过两种方法进行控制：内部控制和外部控制。内部控制指的是用户先定义部分音乐，然后让模型填充其余部分，类似于遮盖音乐生成（或音乐填充）任务。外部控制通过跨注意机制将外部 yet related的信息，如和声、xture或其他特征，用于控制模型。我们表明，通过内部和外部控制，Polyffusion可以统一许多音乐创作任务，包括给予伴奏的旋律生成、给予旋律的伴奏生成、任意音乐段填充和基于和声或xture的音乐安排。实验结果显示，我们的模型在与传统Transformer和采样基础之间显著超越了现有的基elines，并且使用预训练的分离表示作为外部条件可以更有效地控制。

Eliminating Label Leakage in Tree-Based Vertical Federated Learning

paper_url: http://arxiv.org/abs/2307.10318
repo_url: None
paper_authors: Hideaki Takahashi, Jingjing Liu, Yang Liu
for: 这个论文是研究 vertical federated learning（VFL）中的攻击和防御机制的。
methods: 论文使用了一种新的标签推导攻击方法ID2Graph，通过利用训练样本中每个节点的集合（即实例空间）来推导私有训练标签。同时，论文还提出了一种有效的防御机制ID-LMID，通过互信息常量化来防止标签泄露。
results: 实验表明，ID2Graph攻击可以带来tree-based模型的泄露风险，而ID-LMID有效地mitigates label leakage in such instances。

Abstract
Vertical federated learning (VFL) enables multiple parties with disjoint features of a common user set to train a machine learning model without sharing their private data. Tree-based models have become prevalent in VFL due to their interpretability and efficiency. However, the vulnerability of tree-based VFL has not been sufficiently investigated. In this study, we first introduce a novel label inference attack, ID2Graph, which utilizes the sets of record-IDs assigned to each node (i.e., instance space) to deduce private training labels. The ID2Graph attack generates a graph structure from training samples, extracts communities from the graph, and clusters the local dataset using community information. To counteract label leakage from the instance space, we propose an effective defense mechanism, ID-LMID, which prevents label leakage by focusing on mutual information regularization. Comprehensive experiments conducted on various datasets reveal that the ID2Graph attack presents significant risks to tree-based models such as Random Forest and XGBoost. Further evaluations on these benchmarks demonstrate that ID-LMID effectively mitigates label leakage in such instances.

摘要
vertically federated learning (VFL) 可以让多个党有不同特征的共同用户集中训练机器学习模型，而不需要共享私人数据。由于树状模型的可读性和效率，因此在VFL中它们变得普遍。然而，树状VFL的漏洞尚未得到充分调查。在本研究中，我们首先介绍了一种新的标签推断攻击，即ID2Graph攻击，它利用训练样本中每个节点（即实例空间）分配的集合ID来推断私人训练标签。ID2Graph攻击生成了训练样本中的图结构，提取了图中的社区信息，并使用社区信息对本地数据进行归一化。为了防止实例空间中的标签泄露，我们提议了一种有效的防御机制，即ID-LMID，它通过强调共轭信息规则来防止标签泄露。在各种数据集上进行了广泛的实验，显示ID2Graph攻击对树状模型，如Random Forest和XGBoost，具有重大风险。进一步在这些标准底上进行的评估表明，ID-LMID有效地减轻标签泄露的风险。

Self-Supervised Learning for WiFi CSI-Based Human Activity Recognition: A Systematic Study

paper_url: http://arxiv.org/abs/2308.02412
repo_url: https://github.com/JJJinx/SSLCSI
paper_authors: Ke Xu, Jiangtao Wang, Hongyuan Zhu, Dingchang Zheng
for: 这篇论文主要是为了探讨 WiFi CSI 技术在人体动作识别（HAR）领域中的应用，以及如何使用深度学习技术和 SSL 算法来解决数据不充足问题。
methods: 该论文使用了多种 SSL 算法，包括已经研究过的和尚未研究过的类型，并在三个公开available的 CSI HAR 数据集上进行了广泛的实验研究。
results: 该论文的实验结果表明，使用 SSL 算法可以在 WiFi CSI 技术应用中提高 HAR 性能，但是还存在一些限制和盲点，需要进一步的改进才能在实际应用中使用。

Abstract
Recently, with the advancement of the Internet of Things (IoT), WiFi CSI-based HAR has gained increasing attention from academic and industry communities. By integrating the deep learning technology with CSI-based HAR, researchers achieve state-of-the-art performance without the need of expert knowledge. However, the scarcity of labeled CSI data remains the most prominent challenge when applying deep learning models in the context of CSI-based HAR due to the privacy and incomprehensibility of CSI-based HAR data. On the other hand, SSL has emerged as a promising approach for learning meaningful representations from data without heavy reliance on labeled examples. Therefore, considerable efforts have been made to address the challenge of insufficient data in deep learning by leveraging SSL algorithms. In this paper, we undertake a comprehensive inventory and analysis of the potential held by different categories of SSL algorithms, including those that have been previously studied and those that have not yet been explored, within the field. We provide an in-depth investigation of SSL algorithms in the context of WiFi CSI-based HAR. We evaluate four categories of SSL algorithms using three publicly available CSI HAR datasets, each encompassing different tasks and environmental settings. To ensure relevance to real-world applications, we design performance metrics that align with specific requirements. Furthermore, our experimental findings uncover several limitations and blind spots in existing work, highlighting the barriers that need to be addressed before SSL can be effectively deployed in real-world WiFi-based HAR applications. Our results also serve as a practical guideline for industry practitioners and provide valuable insights for future research endeavors in this field.

摘要
近些年，因互联网物联网（IoT）的进步，WiFi CSI基本的人类活动识别（HAR）已经获得学术和工业社区的关注。通过融合深度学习技术与CSI基本的HAR，研究人员可以 дости得现场的性能，不需要专家知识。然而，CSI基本的数据短缺对应用深度学习模型在CSI基本的HAR领域中具有最大的挑战，因为CSI基本的数据具有隐私和不可理解的问题。一方面，SSL（匿名学习）已经成为一种可能地学习具有意义的表现，不需要大量的标注数据。因此，在深度学习应用中，对于CSI基本的数据短缺的挑战，SSL算法已经获得了大量的研究和应用。在这篇文章中，我们对WiFi CSI基本的HAR领域中SSL算法的应用进行了全面的调查和分析。我们对不同类型的SSL算法进行了深入的探访，包括已经研究过的和尚未研究过的类型。我们还对三个公开可用的CSI HAR数据集进行了实验，每个数据集都包含不同的任务和环境设定。为了保持现实应用的相关性，我们设计了适合实际应用的表现指标。我们的实验结果显示，SSL算法在WiFi CSI基本的HAR领域中具有许多局限和盲点，这些盲点需要被解决才能实现实际应用。我们的结果也可以作为实践工程师的实用指南，并提供了未来研究的有益指导。

A Novel Spatial-Temporal Variational Quantum Circuit to Enable Deep Learning on NISQ Devices

paper_url: http://arxiv.org/abs/2307.09771
repo_url: None
paper_authors: Jinyang Li, Zhepeng Wang, Zhirui Hu, Prasanna Date, Ang Li, Weiwen Jiang
For: This paper proposes a novel spatial-temporal design, named ST-VQC, to improve the accuracy and robustness of quantum learning algorithms for handling non-linear datasets and noisy environments.* Methods: The proposed ST-VQC integrates non-linearity in quantum learning through a block-based encoding quantum sub-circuit and layer-wise computation quantum sub-circuit, and adopts a SWAP-Free physical circuit design to improve robustness. An automated optimization framework is proposed to generate the ST-VQC quantum circuit.* Results: The proposed ST-VQC achieves over 30% accuracy improvement compared with existing VQCs on actual quantum computers, and outperforms a linear classifier by 27.9% on a non-linear synthetic dataset.

Abstract
Quantum computing presents a promising approach for machine learning with its capability for extremely parallel computation in high-dimension through superposition and entanglement. Despite its potential, existing quantum learning algorithms, such as Variational Quantum Circuits(VQCs), face challenges in handling more complex datasets, particularly those that are not linearly separable. What's more, it encounters the deployability issue, making the learning models suffer a drastic accuracy drop after deploying them to the actual quantum devices. To overcome these limitations, this paper proposes a novel spatial-temporal design, namely ST-VQC, to integrate non-linearity in quantum learning and improve the robustness of the learning model to noise. Specifically, ST-VQC can extract spatial features via a novel block-based encoding quantum sub-circuit coupled with a layer-wise computation quantum sub-circuit to enable temporal-wise deep learning. Additionally, a SWAP-Free physical circuit design is devised to improve robustness. These designs bring a number of hyperparameters. After a systematic analysis of the design space for each design component, an automated optimization framework is proposed to generate the ST-VQC quantum circuit. The proposed ST-VQC has been evaluated on two IBM quantum processors, ibm_cairo with 27 qubits and ibmq_lima with 7 qubits to assess its effectiveness. The results of the evaluation on the standard dataset for binary classification show that ST-VQC can achieve over 30% accuracy improvement compared with existing VQCs on actual quantum computers. Moreover, on a non-linear synthetic dataset, the ST-VQC outperforms a linear classifier by 27.9%, while the linear classifier using classical computing outperforms the existing VQC by 15.58%.

摘要
量子计算技术在机器学习方面具有极大的潜力，它可以通过积分和异常相互作用来进行非常平行的计算，以处理高维数据。然而，现有的量子学习算法，如变量量子电路（VQC），在处理更复杂的数据集时会遇到挑战，特别是不可分割的数据集。此外，它还会遇到部署问题，使得学习模型在真实的量子设备上表现出较大的精度下降。为了解决这些局限性，这篇论文提出了一种新的空间-时间设计，即ST-VQC，以 интегрирова量子学习中的非线性。具体来说，ST-VQC可以通过一种新的块基编码量子子电路和层次计算量子子电路来提取空间特征，以实现时间 wise深度学习。此外，一种SWAP-Free物理电路设计也被提出，以提高robustness。这些设计带来了一些参数。经系统分析每个设计组件的设计空间，我们提出了一个自动优化框架，以生成ST-VQC量子电路。我们对IBM量子处理器ibmq_lima和ibm_cairo，它们具有27个量子比特和7个量子比特，进行了评估。结果表明，ST-VQC在标准的二分类数据集上的评价结果表明，ST-VQC可以在真实的量子计算机上实现30%的精度提高。此外，在一个非线性的synthetic数据集上，ST-VQC超过了一个线性分类器的27.9%的提高，而线性分类器使用классический计算机则超过了现有VQC的15.58%的提高。

How Curvature Enhance the Adaptation Power of Framelet GCNs

paper_url: http://arxiv.org/abs/2307.09768
repo_url: https://github.com/dshi3553usyd/curvature_enhanced_graph_convolution
paper_authors: Dai Shi, Yi Guo, Zhiqi Shao, Junbin Gao
for: This paper aims to enhance the performance of graph neural networks (GNNs) by incorporating graph geometric information, specifically discrete graph Ricci curvature.
methods: The proposed approach uses the graph Ricci curvature defined on the edges of a graph to measure the difficulty of information transit between nodes. The curvature information is inserted into the GNN model with a carefully designed transformation function $\zeta$ to alleviate computational issues such as over-smoothing.
results: The proposed curvature-based GNN model outperforms state-of-the-art baselines in both homophily and heterophily graph datasets, indicating the effectiveness of involving graph geometric information in GNNs. Additionally, the curvature-based graph edge drop algorithm is proposed to drop edges with very positive Ricci curvature to enhance the model’s adaptation to heterophily graphs.Here is the simplified Chinese text for the three key points:
for: 这篇论文目的是使用图形神经网络（GNN）中的图形信息来提高模型性能。
methods: 提议的方法使用图中边的 Ricci curvature 来衡量信息在不同节点之间的传递难度。 curvature 信息被注入到 GNN 模型中，并使用特定的变换函数 $\zeta$ 来缓解计算问题。
results: 提议的 curvature-based GNN 模型在同性和异性图像数据集上都有较高的性能，这表明将图形信息包含在 GNN 中是有效的。此外，提出了基于 Ricci curvature 的图边Drop 算法，以便适应异性图。

Abstract
Graph neural network (GNN) has been demonstrated powerful in modeling graph-structured data. However, despite many successful cases of applying GNNs to various graph classification and prediction tasks, whether the graph geometrical information has been fully exploited to enhance the learning performance of GNNs is not yet well understood. This paper introduces a new approach to enhance GNN by discrete graph Ricci curvature. Specifically, the graph Ricci curvature defined on the edges of a graph measures how difficult the information transits on one edge from one node to another based on their neighborhoods. Motivated by the geometric analogy of Ricci curvature in the graph setting, we prove that by inserting the curvature information with different carefully designed transformation function $\zeta$, several known computational issues in GNN such as over-smoothing can be alleviated in our proposed model. Furthermore, we verified that edges with very positive Ricci curvature (i.e., $\kappa_{i,j} \approx 1$) are preferred to be dropped to enhance model's adaption to heterophily graph and one curvature based graph edge drop algorithm is proposed. Comprehensive experiments show that our curvature-based GNN model outperforms the state-of-the-art baselines in both homophily and heterophily graph datasets, indicating the effectiveness of involving graph geometric information in GNNs.

摘要
GRAPH Neural Network (GNN) 已经在模型图structured数据中显示出强大的能力。然而，虽然 многи successfully applied GNNs to various graph classification and prediction tasks, whether the graph geometric information has been fully exploited to enhance the learning performance of GNNs is not yet well understood. This paper introduces a new approach to enhance GNN by discrete graph Ricci curvature. Specifically, the graph Ricci curvature defined on the edges of a graph measures how difficult the information transits on one edge from one node to another based on their neighborhoods. Motivated by the geometric analogy of Ricci curvature in the graph setting, we prove that by inserting the curvature information with different carefully designed transformation function $\zeta$, several known computational issues in GNN such as over-smoothing can be alleviated in our proposed model. Furthermore, we verified that edges with very positive Ricci curvature (i.e., $\kappa_{i,j} \approx 1$) are preferred to be dropped to enhance model's adaption to heterophily graph and one curvature based graph edge drop algorithm is proposed. Comprehensive experiments show that our curvature-based GNN model outperforms the state-of-the-art baselines in both homophily and heterophily graph datasets, indicating the effectiveness of involving graph geometric information in GNNs.

Sig-Splines: universal approximation and convex calibration of time series generative models

paper_url: http://arxiv.org/abs/2307.09767
repo_url: None
paper_authors: Magnus Wiese, Phillip Murray, Ralf Korn
for: 这个论文是为了提出一种新的生成模型，用于处理多变量离散时间序列数据。
methods: 该算法启发自神经抽屉流的建构，并将线性变换和签名变换作为神经网络的替换。这种方法不仅具有神经网络的通用性，还引入了模型参数的凸性。
results: 该模型可以实现不仅神经网络的通用性，还可以保证模型参数的凸性。

Abstract
We propose a novel generative model for multivariate discrete-time time series data. Drawing inspiration from the construction of neural spline flows, our algorithm incorporates linear transformations and the signature transform as a seamless substitution for traditional neural networks. This approach enables us to achieve not only the universality property inherent in neural networks but also introduces convexity in the model's parameters.

摘要
我们提出一种新的生成模型，用于多变量离散时间序列数据。我们的算法启发自神经 spline flows 的结构，并在其中包含线性变换和签名变换，作为传统神经网络的替换。这种方法不仅具有神经网络的universality性，还会使模型参数变得几何。

Reinforcing POD based model reduction techniques in reaction-diffusion complex networks using stochastic filtering and pattern recognition

paper_url: http://arxiv.org/abs/2307.09762
repo_url: None
paper_authors: Abhishek Ajayakumar, Soumyendu Raha
for: 模型实际世界系统，但是这些系统的维度可能会使其分析变得困难。
methods: 使用dimensionality reduction技术，如POD，但这些模型容易受输入数据的干扰。
results: 我们的方法可以改善受到干扰输入的模型的准确性。

Abstract
Complex networks are used to model many real-world systems. However, the dimensionality of these systems can make them challenging to analyze. Dimensionality reduction techniques like POD can be used in such cases. However, these models are susceptible to perturbations in the input data. We propose an algorithmic framework that combines techniques from pattern recognition (PR) and stochastic filtering theory to enhance the output of such models. The results of our study show that our method can improve the accuracy of the surrogate model under perturbed inputs. Deep Neural Networks (DNNs) are susceptible to adversarial attacks. However, recent research has revealed that neural Ordinary Differential Equations (ODEs) exhibit robustness in specific applications. We benchmark our algorithmic framework with a Neural ODE-based approach as a reference.

摘要
困难的网络被用来模型许多实际世界系统。然而，这些系统的维度可以使得它们变得复杂而难以分析。降维技术如POD可以在这些情况下使用。然而，这些模型容易受输入数据的扰动影响。我们提议一种算法框架，将 Pattern recognition 技术和 Stochastic filtering theory 结合使用，以提高这些模型的输出精度。我们的研究结果表明，我们的方法可以在受到扰动输入时提高模型的准确性。深度神经网络（DNNs）容易受到恶意攻击。然而，最近的研究表明，神经 Ordinary Differential Equations （ODEs）在某些应用场景中具有特殊的Robustness。我们将我们的算法框架与神经 ODE 方法作为参考进行比较。

FedBug: A Bottom-Up Gradual Unfreezing Framework for Federated Learning

paper_url: http://arxiv.org/abs/2307.10317
repo_url: https://github.com/iandrover/fedbug
paper_authors: Chia-Hsiang Kao, Yu-Chiang Frank Wang
for: 该论文旨在提出一种新的 Federated Learning（FL）框架，以解决客户端遗传的问题。
methods: 该论文提出了一种名为 FedBug（Federated Learning with Bottom-Up Gradual Unfreezing）的新的 FL 框架，它在客户端上采用层 wise 的渐进解冻策略，从输入层到输出层，以实现跨客户端的Alignment。
results: 该论文通过理论分析和实验 validate 了 FedBug 的高效性，并且在不同的数据集、训练条件和网络架构下进行了广泛的测试。

Abstract
Federated Learning (FL) offers a collaborative training framework, allowing multiple clients to contribute to a shared model without compromising data privacy. Due to the heterogeneous nature of local datasets, updated client models may overfit and diverge from one another, commonly known as the problem of client drift. In this paper, we propose FedBug (Federated Learning with Bottom-Up Gradual Unfreezing), a novel FL framework designed to effectively mitigate client drift. FedBug adaptively leverages the client model parameters, distributed by the server at each global round, as the reference points for cross-client alignment. Specifically, on the client side, FedBug begins by freezing the entire model, then gradually unfreezes the layers, from the input layer to the output layer. This bottom-up approach allows models to train the newly thawed layers to project data into a latent space, wherein the separating hyperplanes remain consistent across all clients. We theoretically analyze FedBug in a novel over-parameterization FL setup, revealing its superior convergence rate compared to FedAvg. Through comprehensive experiments, spanning various datasets, training conditions, and network architectures, we validate the efficacy of FedBug. Our contributions encompass a novel FL framework, theoretical analysis, and empirical validation, demonstrating the wide potential and applicability of FedBug.

摘要
联合学习（FL）提供了一个合作训练框架，让多个客户端共同训练一个共享模型，不会妥协数据隐私。由于本地数据组的不同性，客户端模型的更新可能会过滤和分化，即client drift问题。在这篇文章中，我们提出了FedBug（联合学习 with Bottom-Up Gradual Unfreezing），一个新的FL框架，用于有效地解决client drift问题。FedBug在服务器端每个全球轮次释出的客户端模型参数作为参考点，进行跨客户Alignment。具体来说，在客户端上，FedBug首先将整个模型冻结，然后逐层解冻，从输入层到输出层。这种底上逐层的方法让模型在新解冻的层上训练，将数据 проек 到一个共同的潜在空间，其中的分隔条件保持一致 across all clients。我们在一个新的FL设置下进行了理论分析，显示FedBug的渐进速度比FedAvg更高。通过了多种数据集、训练条件和网络架构，我们验证了FedBug的有效性。我们的贡献包括一个新的FL框架、理论分析和实验验证，展示了FedBug的广泛应用和可能性。

Constructing Extreme Learning Machines with zero Spectral Bias

paper_url: http://arxiv.org/abs/2307.09759
repo_url: None
paper_authors: Kaumudi Joshi, Vukka Snigdha, Arya Kumar Bhattacharya
for: This paper aims to address the issue of Spectral Bias (SB) in Extreme Learning Machines (ELMs) and its impact on the resolution of higher frequencies, which is crucial in fields like Physics Informed Neural Networks (PINNs).
methods: The paper uses Fourier Feature Embeddings to mitigate SB in ELMs, which has been shown to be effective in addressing this issue in other types of Artificial Neural Networks (ANNs).
results: The paper demonstrates that the proposed approach completely eliminates SB in ELMs, making them feasible for practical problems like PINNs where resolution of higher frequencies is essential.

Abstract
The phenomena of Spectral Bias, where the higher frequency components of a function being learnt in a feedforward Artificial Neural Network (ANN) are seen to converge more slowly than the lower frequencies, is observed ubiquitously across ANNs. This has created technology challenges in fields where resolution of higher frequencies is crucial, like in Physics Informed Neural Networks (PINNs). Extreme Learning Machines (ELMs) that obviate an iterative solution process which provides the theoretical basis of Spectral Bias (SB), should in principle be free of the same. This work verifies the reliability of this assumption, and shows that it is incorrect. However, the structure of ELMs makes them naturally amenable to implementation of variants of Fourier Feature Embeddings, which have been shown to mitigate SB in ANNs. This approach is implemented and verified to completely eliminate SB, thus bringing into feasibility the application of ELMs for practical problems like PINNs where resolution of higher frequencies is essential.

摘要
现象 known as Spectral Bias（спектральная предвзятость），其中高频成分函数在Feedforward Artificial Neural Network（ANN）中学习时 convergence slower than low frequency components, 是在 ANN 中广泛观察到的。这种情况在 Physics Informed Neural Networks（PINNs）等领域带来了技术挑战，因为在这些领域，高频分量的解析是关键。Extreme Learning Machines（ELMs），它们不需要迭代解决过程，因此可以避免 Spectral Bias（SB）的理论基础。然而，实际上，ELMs 中的结构使其自然地适合实现 variants of Fourier Feature Embeddings（FFE），这种方法已经在 ANN 中证明可以 Mitigate SB。本文验证了这种方法的可靠性，并证明了它可以完全消除 SB，因此使得 ELMs 在实际应用中，如 PINNs，可以实现高频分量的解析。

Improved Distribution Matching for Dataset Condensation

paper_url: http://arxiv.org/abs/2307.09742
repo_url: https://github.com/uitrbn/idm
paper_authors: Ganlong Zhao, Guanbin Li, Yipeng Qin, Yizhou Yu
for: 这个研究旨在开发一种能够将大量数据集整合成小数据集，并且能够训练高性能模型，以减少深度学习应用中的储存成本和训练努力。
methods: 本研究使用分布匹配法来实现数据集整合，并且提出三种新技术来解决对分布匹配的两大缺陷（即分布特征数量不均匀和无效的嵌入）。
results: compared to先前的优化方法，本研究的方法更加高效，可以处理更大的数据集和模型，并且在实验中得到了更好的效果。

Abstract
Dataset Condensation aims to condense a large dataset into a smaller one while maintaining its ability to train a well-performing model, thus reducing the storage cost and training effort in deep learning applications. However, conventional dataset condensation methods are optimization-oriented and condense the dataset by performing gradient or parameter matching during model optimization, which is computationally intensive even on small datasets and models. In this paper, we propose a novel dataset condensation method based on distribution matching, which is more efficient and promising. Specifically, we identify two important shortcomings of naive distribution matching (i.e., imbalanced feature numbers and unvalidated embeddings for distance computation) and address them with three novel techniques (i.e., partitioning and expansion augmentation, efficient and enriched model sampling, and class-aware distribution regularization). Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources, thereby scaling data condensation to larger datasets and models. Extensive experiments demonstrate the effectiveness of our method. Codes are available at https://github.com/uitrbn/IDM

摘要

Mood Classification of Bangla Songs Based on Lyrics

paper_url: http://arxiv.org/abs/2307.10314
repo_url: None
paper_authors: Maliha Mahajebin, Mohammad Rifat Ahmmad Rashid, Nafees Mansoor
for: 本研究旨在分析孟加拉语歌曲的情感，并通过自然语言处理和BERT算法来分类歌曲的情感。
methods: 该研究使用自然语言处理和BERT算法来分析4000首孟加拉语歌曲的 lyrics，并将歌曲分为四种情感：快乐、悲伤、爱情和宁静。
results: 研究发现，4000首歌曲中有1513首表达悲伤情感，1362首表达爱情情感，886首表达快乐情感，剩下的239首表达宁静情感。通过嵌入歌曲 lyrics，该研究准确地分类了歌曲的四种情感。

Abstract
Music can evoke various emotions, and with the advancement of technology, it has become more accessible to people. Bangla music, which portrays different human emotions, lacks sufficient research. The authors of this article aim to analyze Bangla songs and classify their moods based on the lyrics. To achieve this, this research has compiled a dataset of 4000 Bangla song lyrics, genres, and used Natural Language Processing and the Bert Algorithm to analyze the data. Among the 4000 songs, 1513 songs are represented for the sad mood, 1362 for the romantic mood, 886 for happiness, and the rest 239 are classified as relaxation. By embedding the lyrics of the songs, the authors have classified the songs into four moods: Happy, Sad, Romantic, and Relaxed. This research is crucial as it enables a multi-class classification of songs' moods, making the music more relatable to people's emotions. The article presents the automated result of the four moods accurately derived from the song lyrics.

摘要
音乐可以诱发不同的情感，而技术的进步使得音乐更加 accessible。孟加拉音乐，表达不同人类情感的音乐，尚未得到足够的研究。本文的作者希望通过分析孟加拉歌曲的歌词，根据歌词来分类它们的情感。为此，本研究编译了4000首孟加拉歌曲的歌词、种类，并使用自然语言处理和BERT算法来分析数据。其中4000首歌曲中，1513首表达了悲伤的情感，1362首表达了爱情的情感，886首表达了快乐的情感，剩下的239首被分类为放松的情感。通过嵌入歌曲歌词，作者将歌曲分为四种情感：快乐、悲伤、爱情和放松。这项研究非常重要，因为它使得音乐与人们的情感更加相似，使得更多人能够通过音乐来表达自己的情感。文章公布了自动分类出的四种情感的准确结果。

RaTE: a Reproducible automatic Taxonomy Evaluation by Filling the Gap

paper_url: http://arxiv.org/abs/2307.09706
repo_url: https://github.com/cestlucas/rate
paper_authors: Tianjian Gao, Phillipe Langlais
for: 这篇论文主要用于提出一种自动生成税onomy的评估方法，以便用于评估自动生成税onomy算法的效果。
methods: 该论文使用了一种基于大量预训练语言模型的自动标签 свобо Taxonomy 评估方法，称为RaTE。
results: 研究发现，RaTE 与人类评估结果相关度高，并且人工降低税onomy 会导致 RaTE 分数下降。

Abstract
Taxonomies are an essential knowledge representation, yet most studies on automatic taxonomy construction (ATC) resort to manual evaluation to score proposed algorithms. We argue that automatic taxonomy evaluation (ATE) is just as important as taxonomy construction. We propose RaTE, an automatic label-free taxonomy scoring procedure, which relies on a large pre-trained language model. We apply our evaluation procedure to three state-of-the-art ATC algorithms with which we built seven taxonomies from the Yelp domain, and show that 1) RaTE correlates well with human judgments and 2) artificially degrading a taxonomy leads to decreasing RaTE score.

摘要
《taxonomies是知识表示的重要方面，但大多数自动taxonomy建构(ATC)研究仍然采用手动评估提议的算法。我们认为自动taxonomy评估(ATE)也是非常重要。我们提出了一种无标签的自动评估方法，名为RaTE，它基于大量预训练的自然语言模型。我们对三种当前领先ATC算法进行了应用，并在Yelp领域建立了七个稿件，并证明了以下两点：1）RaTE与人类评估呈相关性，2） искусственно降低稿件会导致RaTE分数下降。》Note: The translation is done using Google Translate, and may not be perfect. Please let me know if you need any further assistance.

Efficient Guided Generation for Large Language Models

paper_url: http://arxiv.org/abs/2307.09702
repo_url: https://github.com/normal-computing/outlines
paper_authors: Brandon T. Willard, Rémi Louf
for: 这篇论文主要针对文本生成问题进行了构建性的重新定义，它基于finite-state machine的状态转移来实现有效的文本生成 guideline。
methods: 该方法使用了regular expressions和context-free grammar来控制文本生成，可以构建语言模型词汇表的索引，并且允许执行域专业知识和约束。
results: 该方法可以减少TokenSequence生成过程中的开销，并且在比较试验中显著超越现有的解决方案。一个开源的Python库Outlines提供了实现。

Abstract
In this article we show how the problem of neural text generation can be constructively reformulated in terms of transitions between the states of a finite-state machine. This framework leads to an efficient approach to guiding text generation with regular expressions and context-free grammars by allowing the construction of an index over a language model's vocabulary. The approach is model agnostic, allows one to enforce domain-specific knowledge and constraints, and enables the construction of reliable interfaces by guaranteeing the structure of the generated text. It adds little overhead to the token sequence generation process and significantly outperforms existing solutions. An implementation is provided in the open source Python library Outlines

摘要
在这篇文章中，我们示例如如何将神经文本生成问题转换成 finite-state machine 的状态转移问题。这个框架导致了一种高效的方法来使用正则表达式和 context-free грамматики来引导文本生成，并允许建立语言模型词汇索引。这种方法是模型不偏向的，允许承载领域特定知识和约束，并允许建立可靠的界面，保证生成的文本结构正确。它增加了Token序列生成过程中的负担，并在性能上明显超过现有的解决方案。我们在 Python 开源库 Outlines 中提供了一个实现。

STRAPPER: Preference-based Reinforcement Learning via Self-training Augmentation and Peer Regularization

paper_url: http://arxiv.org/abs/2307.09692
repo_url: https://github.com/rll-research/bpref
paper_authors: Yachen Kang, Li He, Jinxin Liu, Zifeng Zhuang, Donglin Wang
for: 学习复杂的奖励函数，使用人类偏好作为学习目标。
methods: 使用 preference-based reinforcement learning (PbRL) 方法，并对不带标签的段使用 reuse 技术，以降低人类努力的成本。 consistency regularization 也被考虑以提高 semi-supervised learning 的性能。
results: 对于不同的 semi-supervised 选择和 peer regularization，我们的方法可以学习出多种 locomotive 和机器人 manipulate 行为。然而，我们发现在 PbRL 中存在一种 Similarity Trap 现象，导致 consistency regularization 不当地增强了模型对段对的一致性，从而降低了奖励学习的信任度。我们的自动训练方法和提出的 peer regularization 可以解决这个问题，并且实验证明了我们的方法的效果。

Abstract
Preference-based reinforcement learning (PbRL) promises to learn a complex reward function with binary human preference. However, such human-in-the-loop formulation requires considerable human effort to assign preference labels to segment pairs, hindering its large-scale applications. Recent approache has tried to reuse unlabeled segments, which implicitly elucidates the distribution of segments and thereby alleviates the human effort. And consistency regularization is further considered to improve the performance of semi-supervised learning. However, we notice that, unlike general classification tasks, in PbRL there exits a unique phenomenon that we defined as similarity trap in this paper. Intuitively, human can have diametrically opposite preferredness for similar segment pairs, but such similarity may trap consistency regularization fail in PbRL. Due to the existence of similarity trap, such consistency regularization improperly enhances the consistency possiblity of the model's predictions between segment pairs, and thus reduces the confidence in reward learning, since the augmented distribution does not match with the original one in PbRL. To overcome such issue, we present a self-training method along with our proposed peer regularization, which penalizes the reward model memorizing uninformative labels and acquires confident predictions. Empirically, we demonstrate that our approach is capable of learning well a variety of locomotion and robotic manipulation behaviors using different semi-supervised alternatives and peer regularization.

摘要
preference-based reinforcement learning (PbRL) 承诺学习复杂的奖励函数使用二进制人类偏好。然而，这种人 loop 形式需要较大的人类努力来分配偏好标签对 segment pair，限制其大规模应用。 recent approach 尝试 reuse unlabeled segments，这些 segment 隐式地描述了分布，从而减少人类努力。 In addition, consistency regularization 是进一步考虑的，以提高 semi-supervised learning 的性能。然而，我们注意到，与通用的分类任务不同，在 PbRL 中存在一种特殊的现象，我们在这篇论文中定义为 similarity trap。人类可能对类似 segment pair 表示对立的偏好，但这种相似性可能会诱导 consistency regularization 失败在 PbRL 中。由于存在 similarity trap，consistency regularization 可能会增强模型对 segment pair 的预测一致性，从而降低奖励学习的信任度，因为扩展的分布与原始分布在 PbRL 中不匹配。为了解决这个问题，我们提出了一种 self-training 方法，并与我们的提议的 peer regularization 一起使用。 peer regularization 会对奖励模型的预测进行约束，以避免模型记忆无用的标签。我们的方法可以在不同的 semi-supervised 变体和 peer regularization 下学习多种 locomotion 和 robotic manipulation 行为。

Joint Service Caching, Communication and Computing Resource Allocation in Collaborative MEC Systems: A DRL-based Two-timescale Approach

paper_url: http://arxiv.org/abs/2307.09691
repo_url: None
paper_authors: Qianqian Liu, Haixia Zhang, Xin Zhang, Dongfeng Yuan
for: 提高多元Access Edge Computing（MEC）系统中端点的功能，以满足终端设备的严格服务质量（QoS）要求。
methods: 提议一种协作式MEC框架，以便在边缘服务器之间分享资源，并通过缓存、协同卸载和计算和通信资源的均衡来最大化长期QoS和减少缓存切换成本。
results: 通过对问题的解决，提高了平均QoS和缓存切换成本。

Abstract
Meeting the strict Quality of Service (QoS) requirements of terminals has imposed a signiffcant challenge on Multiaccess Edge Computing (MEC) systems, due to the limited multidimensional resources. To address this challenge, we propose a collaborative MEC framework that facilitates resource sharing between the edge servers, and with the aim to maximize the long-term QoS and reduce the cache switching cost through joint optimization of service caching, collaborative offfoading, and computation and communication resource allocation. The dual timescale feature and temporal recurrence relationship between service caching and other resource allocation make solving the problem even more challenging. To solve it, we propose a deep reinforcement learning (DRL)-based dual timescale scheme, called DGL-DDPG, which is composed of a short-term genetic algorithm (GA) and a long short-term memory network-based deep deterministic policy gradient (LSTM-DDPG). In doing so, we reformulate the optimization problem as a Markov decision process (MDP) where the small-timescale resource allocation decisions generated by an improved GA are taken as the states and input into a centralized LSTM-DDPG agent to generate the service caching decision for the large-timescale. Simulation results demonstrate that our proposed algorithm outperforms the baseline algorithms in terms of the average QoS and cache switching cost.

摘要
To address this challenge, we propose a deep reinforcement learning (DRL)-based dual timescale scheme, called DGL-DDPG, which consists of a short-term genetic algorithm (GA) and a long short-term memory network-based deep deterministic policy gradient (LSTM-DDPG). We reformulate the optimization problem as a Markov decision process (MDP) where small-timescale resource allocation decisions generated by an improved GA are taken as states and input into a centralized LSTM-DDPG agent to generate service caching decisions for the large-timescale.Simulation results show that our proposed algorithm outperforms baseline algorithms in terms of average QoS and cache switching costs.

Amazon-M2: A Multilingual Multi-locale Shopping Session Dataset for Recommendation and Text Generation

paper_url: http://arxiv.org/abs/2307.09688
repo_url: None
paper_authors: Wei Jin, Haitao Mao, Zheng Li, Haoming Jiang, Chen Luo, Hongzhi Wen, Haoyu Han, Hanqing Lu, Zhengyang Wang, Ruirui Li, Zhen Li, Monica Xiao Cheng, Rahul Goutam, Haiyang Zhang, Karthik Subbian, Suhang Wang, Yizhou Sun, Jiliang Tang, Bing Yin, Xianfeng Tang
for: 这个论文的目的是提高电商个性化推荐，以提高用户体验和参与度。
methods: 这篇论文使用了用户会话数据来预测用户的下一个交互，并在不同的语言和地区上进行了多样化的数据采集。
results: 论文提出了一个新的多语言多地区的用户会话数据集，可以帮助提高个性化推荐和理解用户偏好，并且可以用于各种现有任务以及新任务的研究和实践。

Abstract
Modeling customer shopping intentions is a crucial task for e-commerce, as it directly impacts user experience and engagement. Thus, accurately understanding customer preferences is essential for providing personalized recommendations. Session-based recommendation, which utilizes customer session data to predict their next interaction, has become increasingly popular. However, existing session datasets have limitations in terms of item attributes, user diversity, and dataset scale. As a result, they cannot comprehensively capture the spectrum of user behaviors and preferences. To bridge this gap, we present the Amazon Multilingual Multi-locale Shopping Session Dataset, namely Amazon-M2. It is the first multilingual dataset consisting of millions of user sessions from six different locales, where the major languages of products are English, German, Japanese, French, Italian, and Spanish. Remarkably, the dataset can help us enhance personalization and understanding of user preferences, which can benefit various existing tasks as well as enable new tasks. To test the potential of the dataset, we introduce three tasks in this work: (1) next-product recommendation, (2) next-product recommendation with domain shifts, and (3) next-product title generation. With the above tasks, we benchmark a range of algorithms on our proposed dataset, drawing new insights for further research and practice. In addition, based on the proposed dataset and tasks, we hosted a competition in the KDD CUP 2023 and have attracted thousands of users and submissions. The winning solutions and the associated workshop can be accessed at our website https://kddcup23.github.io/.

摘要
modelo de intenciones de compras de los clientes es una tarea crucial para el comercio electrónico, ya que directamente afecta la experiencia del usuario y la participación. Por lo tanto, comprender preferencias de los clientes de manera precisa es esencial para brindar recomendaciones personalizadas. La recomendación de sesión, que utiliza datos de sesiones de clientes para predecir su próxima interacción, ha ganado popularidad en la actualidad. Sin embargo, los conjuntos de datos de sesión existentes tienen limitaciones en términos de atributos de item, diversidad de usuarios y escala de datos. Como resultado, no pueden abarcar completamente el espectro de comportamientos y preferencias de los usuarios. Para superar este gap, presentamos el Dataset de Sesiones de Compras Multilingüe de Amazon, conocido como Amazon-M2. Es el primer conjunto de datos multilingüe que consta de millones de sesiones de usuarios de seis locales diferentes, donde las lenguas principales de los productos son inglés, alemán, japonés, francés, italiano y español. Destacablemente, el conjunto de datos puede ayudarnos a mejorar la personalización y la comprensión de las preferencias de los usuarios, lo que puede beneficiar diversas tareas existentes e incluso permitir nuevas tareas. Para probar el potencial del conjunto de datos, introducimos tres tareas en este trabajo: (1) recomendación de productos siguientes, (2) recomendación de productos siguientes con cambios de dominio, y (3) generación de títulos de productos siguientes. Con estas tareas, evaluamos una variedad de algoritmos en nuestro conjunto de datos propuesto, obteniendo nuevos hallazgos para la investigación y la práctica. Además, basado en nuestro conjunto de datos y tareas, organizamos una competencia en la KDD CUP 2023 y atraemos a miles de usuarios y suscripciones. Los soluciones ganadoras y el trabajo asociado se pueden acceder en nuestro sitio web .

Convex Geometry of ReLU-layers, Injectivity on the Ball and Local Reconstruction

paper_url: http://arxiv.org/abs/2307.09672
repo_url: https://github.com/danedane-haider/alpha-rectifying-frames
paper_authors: Daniel Haider, Martin Ehler, Peter Balazs
for: 研究ReLU层的具体输入 Vector 的可逆性，并提供一种可行的方法来确保 ReLU 层的可逆性。
methods: 使用框架理论设定，研究 ReLU 层在closed ball 上的具体输入 Vector 的可逆性，并利用 convex geometry 的视角来提供一种计算可行的方法。
results: 提供了一种可逆性判断方法，并提供了一种具体的重建方法，可以用于任何输入 Vector 在 ball 上。

Abstract
The paper uses a frame-theoretic setting to study the injectivity of a ReLU-layer on the closed ball of $\mathbb{R}^n$ and its non-negative part. In particular, the interplay between the radius of the ball and the bias vector is emphasized. Together with a perspective from convex geometry, this leads to a computationally feasible method of verifying the injectivity of a ReLU-layer under reasonable restrictions in terms of an upper bound of the bias vector. Explicit reconstruction formulas are provided, inspired by the duality concept from frame theory. All this gives rise to the possibility of quantifying the invertibility of a ReLU-layer and a concrete reconstruction algorithm for any input vector on the ball.

摘要
文章使用框理论设定来研究具有$\mathbb{R}^n$闭球的ReLU层的具体性。特别是，文章强调半径和偏置向量之间的交互作用。通过几何学的视角，这导致了对ReLU层的具体性进行有限制的 computationally feasible 验证方法。 besides, the paper provides explicit reconstruction formulas, inspired by the duality concept from frame theory. all this allows for the possibility of quantifying the invertibility of a ReLU-layer and a concrete reconstruction algorithm for any input vector on the ball.Here's the text in Traditional Chinese:文章使用框理论设定来研究具有$\mathbb{R}^n$闭球的ReLU层的具体性。特别是，文章强调半径和偏置向量之间的交互作用。通过几何学的视角，这导致了对ReLU层的具体性进行有限制的 computationally feasible 验证方法。 besides, the paper provides explicit reconstruction formulas, inspired by the duality concept from frame theory. all this allows for the possibility of quantifying the invertibility of a ReLU-layer and a concrete reconstruction algorithm for any input vector on the ball.

JAZZVAR: A Dataset of Variations found within Solo Piano Performances of Jazz Standards for Music Overpainting

paper_url: http://arxiv.org/abs/2307.09670
repo_url: None
paper_authors: Eleanor Row, Jingjing Tang, George Fazekas
for: 这个论文是为了描述一个 JazzVAR 数据集，用于研究 jazz 标准曲目的变奏和表演方式。
methods: 论文使用了人工提取的 JazzVAR 数据集，包含 502 对变奏和原始 MIDI 段落。每个变奏段落都有对应的原始段落，包含旋律和和声。
results: 论文提出了一种新的生成音乐任务 - 音乐覆盖，并提出了一个基于 Transformer 模型的基线模型，用于这个任务。此外，数据集还可用于表达性表演分析和演奏者识别等其他应用。

Abstract
Jazz pianists often uniquely interpret jazz standards. Passages from these interpretations can be viewed as sections of variation. We manually extracted such variations from solo jazz piano performances. The JAZZVAR dataset is a collection of 502 pairs of Variation and Original MIDI segments. Each Variation in the dataset is accompanied by a corresponding Original segment containing the melody and chords from the original jazz standard. Our approach differs from many existing jazz datasets in the music information retrieval (MIR) community, which often focus on improvisation sections within jazz performances. In this paper, we outline the curation process for obtaining and sorting the repertoire, the pipeline for creating the Original and Variation pairs, and our analysis of the dataset. We also introduce a new generative music task, Music Overpainting, and present a baseline Transformer model trained on the JAZZVAR dataset for this task. Other potential applications of our dataset include expressive performance analysis and performer identification.

摘要
爵士钢琴家常会有独特的表演方式，这些表演方式可以被视为旋律的变化部分。我们 manually提取了这些变化段落从爵士钢琴独奏表演中。JAZZVAR数据集是一个收集502对变化和原始MIDI段落的集合。每个变化段落在数据集中都是由原始爵士标准旋律和和声的一部分，而不是 improvise部分。在这篇论文中，我们介绍了获取和排序 реперtoire的过程，以及创建原始和变化段落的管道。我们还对数据集进行分析，并介绍了一种新的音乐任务——音乐覆盖，以及基于JAZZVAR数据集的这种任务的基线Transformer模型。此外，JAZZVAR数据集还有其他潜在应用，如表演技巧分析和演奏者识别。

Towards A Unified Agent with Foundation Models

paper_url: http://arxiv.org/abs/2307.09668
repo_url: None
paper_authors: Norman Di Palo, Arunkumar Byravan, Leonard Hasenclever, Markus Wulfmeier, Nicolas Heess, Martin Riedmiller
for: 本文旨在探讨如何将语言模型和视觉语言模型 embedding into 强化学习（RL）代理人，以便利用其理解人类意图、逻辑、场景理解和计划行为等能力。
methods: 本文提出了一个框架，用语言作为核心思维工具，以解决RL中一些基本挑战，如高效探索、重用经验数据、调度技能和学习从观察中学习。
results: 我们在一个具有缺乏奖励的虚拟机器人搅拌环境中测试了我们的方法，并证明了在探索效率和重用经验数据方面获得了显著性能提升，并示出了如何重用已学到的技能解决新任务或模仿人类专家视频。

Abstract
Language Models and Vision Language Models have recently demonstrated unprecedented capabilities in terms of understanding human intentions, reasoning, scene understanding, and planning-like behaviour, in text form, among many others. In this work, we investigate how to embed and leverage such abilities in Reinforcement Learning (RL) agents. We design a framework that uses language as the core reasoning tool, exploring how this enables an agent to tackle a series of fundamental RL challenges, such as efficient exploration, reusing experience data, scheduling skills, and learning from observations, which traditionally require separate, vertically designed algorithms. We test our method on a sparse-reward simulated robotic manipulation environment, where a robot needs to stack a set of objects. We demonstrate substantial performance improvements over baselines in exploration efficiency and ability to reuse data from offline datasets, and illustrate how to reuse learned skills to solve novel tasks or imitate videos of human experts.

摘要
现在的语言模型和视觉语言模型已经展示了对人类意图、逻辑、场景理解和规划行为的无 precedent的能力，以文本形式为主。在这个工作中，我们investigate如何将这些能力 embedding在奖励学习（RL）代理人中。我们设计了一个框架，用语言作为核心思维工具，探索如何使得代理人可以解决一系列基本RL挑战，如高效探索、重用经验数据、调度技能和从观察中学习，这些传统需要分立的算法。我们在一个具有缺少奖励的模拟 robotic manipulation 环境中测试了我们的方法，并示出了对基eline的显著性能提高，包括探索效率和可以重用数据集的 reuse。此外，我们还 illustrate如何 reuse学习的技能来解决新任务或者模仿人类专家的视频。

Anticipating Technical Expertise and Capability Evolution in Research Communities using Dynamic Graph Transformers

paper_url: http://arxiv.org/abs/2307.09665
repo_url: None
paper_authors: Sameera Horawalavithana, Ellyn Ayton, Anastasiya Usenko, Robin Cosbey, Svitlana Volkova
for: 这项研究的目的是为了预测技术能力和技能的发展趋势，以便提高国家和全球安全，特别是在核不扩散领域和迅速发展的人工智能领域。
methods: 这项研究使用了传统的统计关系学学习方法（如链接预测在合作网络中），并使用动态不同类型图表示法来解决技术能力和技能发展的问题。
results: 研究人员们开发了一种动态图变换（DGT）神经网络模型，可以预测科学家和机构之间的合作 Patterns、作者行为和技能发展趋势，并在人工智能和核不扩散领域中达到了state-of-the-art的性能。

Abstract
The ability to anticipate technical expertise and capability evolution trends globally is essential for national and global security, especially in safety-critical domains like nuclear nonproliferation (NN) and rapidly emerging fields like artificial intelligence (AI). In this work, we extend traditional statistical relational learning approaches (e.g., link prediction in collaboration networks) and formulate a problem of anticipating technical expertise and capability evolution using dynamic heterogeneous graph representations. We develop novel capabilities to forecast collaboration patterns, authorship behavior, and technical capability evolution at different granularities (e.g., scientist and institution levels) in two distinct research fields. We implement a dynamic graph transformer (DGT) neural architecture, which pushes the state-of-the-art graph neural network models by (a) forecasting heterogeneous (rather than homogeneous) nodes and edges, and (b) relying on both discrete -- and continuous -- time inputs. We demonstrate that our DGT models predict collaboration, partnership, and expertise patterns with 0.26, 0.73, and 0.53 mean reciprocal rank values for AI and 0.48, 0.93, and 0.22 for NN domains. DGT model performance exceeds the best-performing static graph baseline models by 30-80% across AI and NN domains. Our findings demonstrate that DGT models boost inductive task performance, when previously unseen nodes appear in the test data, for the domains with emerging collaboration patterns (e.g., AI). Specifically, models accurately predict which established scientists will collaborate with early career scientists and vice-versa in the AI domain.

摘要
可以预测技术专业和能力发展趋势是国家和全球安全的关键，尤其是在安全关键领域如核不扩散（NN）和快速出现的领域如人工智能（AI）。在这项工作中，我们扩展了传统的统计关系学学习方法（例如链接预测在合作网络中），并将技术专业和能力发展的问题转化为动态不同类型图表示。我们开发了新的能力来预测合作模式、作者行为和技术能力的演变，并在不同粒度（例如科学家和机构层次）上进行预测。我们实现了动态图变换（DGT）神经网络模型，它超越了当前最佳STATIC GRAPH模型，通过（a）预测不同类型的节点和边（而不是同类型的节点和边），以及（b）使用时间输入。我们的DGT模型在AI和NN领域中预测了合作、合作伙伴和技术能力的模式，其中AI领域的mean reciprocal rank值为0.26、0.73和0.53，NN领域的mean reciprocal rank值为0.48、0.93和0.22。DGT模型的性能超过了最佳静止GRAPH模型的30-80%。我们的发现表明，DGT模型可以提高适应任务性能，当前没有seen节点出现在测试数据中，特别是在AI领域。具体来说，模型可以准确预测知名科学家和初出茅廊的科学家之间的合作在AI领域。

Physics-based Reduced Order Modeling for Uncertainty Quantification of Guided Wave Propagation using Bayesian Optimization

paper_url: http://arxiv.org/abs/2307.09661
repo_url: None
paper_authors: G. I. Drakoulas, T. V. Gortsas, D. Polyzos
for: 这 paper 的目的是提出一种基于机器学习的减少阶模型（BO-ML-ROM），用于提高 condition-based maintenance 中的STRUCTURAL HEALTH MONITORING（SHM）的计算效率。
methods: 这 paper 使用了 Bayesian optimization 框架和 finite element method 来实现 BO-ML-ROM，并通过 Sobol 指数来实现全局敏感分析。
results: 实验结果显示，Bayesian optimization 比一般采样方法更高效和快速，BO-ML-ROM 可以减少 GWP 的计算成本，并且可以准确地预测 SHM 中的结果。

Abstract
In the context of digital twins, structural health monitoring (SHM) constitutes the backbone of condition-based maintenance, facilitating the interconnection between virtual and physical assets. Guided wave propagation (GWP) is commonly employed for the inspection of structures in SHM. However, GWP is sensitive to variations in the material properties of the structure, leading to false alarms. In this direction, uncertainty quantification (UQ) is regularly applied to improve the reliability of predictions. Computational mechanics is a useful tool for the simulation of GWP, and is often applied for UQ. Even so, the application of UQ methods requires numerous simulations, while large-scale, transient numerical GWP solutions increase the computational cost. Reduced order models (ROMs) are commonly employed to provide numerical results in a limited amount of time. In this paper, we propose a machine learning (ML)-based ROM, mentioned as BO-ML-ROM, to decrease the computational time related to the simulation of the GWP. The ROM is integrated with a Bayesian optimization (BO) framework, to adaptively sample the parameters for the ROM training. The finite element method is used for the simulation of the high-fidelity models. The formulated ROM is used for forward UQ of the GWP in an aluminum plate with varying material properties. To determine the influence of each parameter perturbation, a global, variance-based sensitivity analysis is implemented based on Sobol' indices. It is shown that Bayesian optimization outperforms one-shot sampling methods, both in terms of accuracy and speed-up. The predicted results reveal the efficiency of BO-ML-ROM for GWP and demonstrate its value for UQ.

摘要
在数字双身框架中，结构健康监测（SHM）作为 Condition-based 维护的基础，实现了虚拟和物理资产之间的连接。通常使用推波propagation（GWP）进行结构的检测。但GWP受到结构物理属性的变化影响，导致假警示。为了改善预测的可靠性，uncertainty量化（UQ）经常应用。计算机机制是结构 simulations的有用工具，并常用于UQ。然而，UQ方法的应用需要许多 simulations，而大规模、过程 numerical GWP 解决方案会增加计算成本。减少级模型（ROMs）通常用于提供数字化的结果，以降低计算时间。在本文中，我们提出了一种基于机器学习（ML）的 ROM，称为 BO-ML-ROM，以降低 GWP 的计算时间。BO-ML-ROM 与 Bayesian 优化（BO）框架集成，以适应参数的参数采样。使用finite element 方法进行高精度模型的 simulations。我们采用了 Sobol 指数来实现全球、卷积基于sensitivity分析，以确定每个参数的影响。结果显示， Bayesian 优化超过一键采样方法，both in terms of accuracy和speed-up。预测结果表明，BO-ML-ROM 对 GWP 有效，并且对 UQ 提供了价值。

Neural Priority Queues for Graph Neural Networks

paper_url: http://arxiv.org/abs/2307.09660
repo_url: None
paper_authors: Rishabh Jain, Petar Veličković, Pietro Liò
for: 这篇论文旨在扩展Graph Neural Networks（GNNs）中的外部记忆。
methods: 该论文提出了一种名为Neural Priority Queues（NPQs）的算法，用于增强GNNs的推理能力。
results: 实验结果表明，NPQs能够 capture long-range interactions，并且与算法性理解相关。

Abstract
Graph Neural Networks (GNNs) have shown considerable success in neural algorithmic reasoning. Many traditional algorithms make use of an explicit memory in the form of a data structure. However, there has been limited exploration on augmenting GNNs with external memory. In this paper, we present Neural Priority Queues, a differentiable analogue to algorithmic priority queues, for GNNs. We propose and motivate a desiderata for memory modules, and show that Neural PQs exhibit the desiderata, and reason about their use with algorithmic reasoning. This is further demonstrated by empirical results on the CLRS-30 dataset. Furthermore, we find the Neural PQs useful in capturing long-range interactions, as empirically shown on a dataset from the Long-Range Graph Benchmark.

摘要
图内神经网络（GNNs）在神经算法逻辑中表现出了很大的成功。许多传统算法都利用了明确的内存结构，但是对于GNNs中的外部Memory的探索是有限的。在这篇论文中，我们介绍了神经优先队列（Neural Priority Queues），它是神经网络中的一种可导的外部 Memory 模块。我们提出了内存模块的需求和愿景，并证明了神经优先队列符合这些需求，并且可以与算法逻辑进行推理。这种结果在CLRS-30数据集上得到了实验证明，并且在Long-Range Graph Benchmark数据集上也得到了实验证明，其可以捕捉长距离交互。

HAT-CL: A Hard-Attention-to-the-Task PyTorch Library for Continual Learning

paper_url: http://arxiv.org/abs/2307.09653
repo_url: https://github.com/xduan7/hat-cl
paper_authors: Xiaotian Duan
for: 本研究旨在 Mitigating catastrophic forgetting 问题，即神经网络学习新任务时，损失之前学习的知识。
methods: 本文引入了 Hard-Attention-to-the-Task (HAT) 机制，并提供了一个 user-friendly 和 PyTorch-compatible 的 redesign - HAT-CL。HAT-CL 不仅自动 manipulate 梯度，还可以将 PyTorch 模块转换为 HAT 模块。
results: 本研究成果包括一个 comprehensive 的 HAT 模块集，可以顺利地 интеGRATE 到现有的架构中，以及一些Ready-to-use HAT 网络，它们可以轻松地与 TIMM 库集成。此外，本研究还提出了一些新的 mask manipulate 技术，这些技术在多种实验中表现出了提高。

Abstract
Catastrophic forgetting, the phenomenon in which a neural network loses previously obtained knowledge during the learning of new tasks, poses a significant challenge in continual learning. The Hard-Attention-to-the-Task (HAT) mechanism has shown potential in mitigating this problem, but its practical implementation has been complicated by issues of usability and compatibility, and a lack of support for existing network reuse. In this paper, we introduce HAT-CL, a user-friendly, PyTorch-compatible redesign of the HAT mechanism. HAT-CL not only automates gradient manipulation but also streamlines the transformation of PyTorch modules into HAT modules. It achieves this by providing a comprehensive suite of modules that can be seamlessly integrated into existing architectures. Additionally, HAT-CL offers ready-to-use HAT networks that are smoothly integrated with the TIMM library. Beyond the redesign and reimplementation of HAT, we also introduce novel mask manipulation techniques for HAT, which have consistently shown improvements across various experiments. Our work paves the way for a broader application of the HAT mechanism, opening up new possibilities in continual learning across diverse models and applications.

摘要
Catastrophic forgetting, neuronal networks losing previously obtained knowledge during the learning of new tasks, poses a significant challenge in continual learning. The Hard-Attention-to-the-Task (HAT) mechanism has shown potential in mitigating this problem, but its practical implementation has been complicated by issues of usability and compatibility, and a lack of support for existing network reuse. In this paper, we introduce HAT-CL, a user-friendly, PyTorch-compatible redesign of the HAT mechanism. HAT-CL not only automates gradient manipulation but also streamlines the transformation of PyTorch modules into HAT modules. It achieves this by providing a comprehensive suite of modules that can be seamlessly integrated into existing architectures. Additionally, HAT-CL offers ready-to-use HAT networks that are smoothly integrated with the TIMM library. Beyond the redesign and reimplementation of HAT, we also introduce novel mask manipulation techniques for HAT, which have consistently shown improvements across various experiments. Our work paves the way for a broader application of the HAT mechanism, opening up new possibilities in continual learning across diverse models and applications.Here's the text with some additional information about the Simplified Chinese translation:The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. The translation is written in a formal and neutral style, with a focus on accurately conveying the meaning of the original text.Some of the key features of the translation include:* Use of Catastrophic forgetting (泛化遗忘) to refer to the phenomenon of neural networks losing previously obtained knowledge during the learning of new tasks.* Use of Hard-Attention-to-the-Task (HAT) mechanism (硬件注意力机制) to refer to the mechanism proposed in the paper to mitigate catastrophic forgetting.* Use of PyTorch-compatible (PyTorch 兼容) to refer to the fact that the HAT mechanism is designed to be compatible with the PyTorch deep learning framework.* Use of ready-to-use HAT networks (准备好使用的 HAT 网络) to refer to the fact that the HAT-CL mechanism provides pre-built HAT networks that can be easily integrated into existing architectures.* Use of novel mask manipulation techniques (新的面积操作技术) to refer to the additional techniques proposed in the paper to improve the performance of the HAT mechanism.Overall, the translation aims to accurately convey the meaning and content of the original text in a formal and neutral style, while also taking into account the conventions and characteristics of Simplified Chinese.

Application of BadNets in Spam Filters

paper_url: http://arxiv.org/abs/2307.09649
repo_url: None
paper_authors: Swagnik Roychoudhury, Akshaj Kumar Veldanda
for: 防止垃圾邮件泄露用户个人信息和威胁邮件系统安全性
methods: 利用机器学习模型攻击推送垃圾邮件
results: 成功实现攻击后门攻击，提示机器学习模型供应链需要严格评估和监测

Abstract
Spam filters are a crucial component of modern email systems, as they help to protect users from unwanted and potentially harmful emails. However, the effectiveness of these filters is dependent on the quality of the machine learning models that power them. In this paper, we design backdoor attacks in the domain of spam filtering. By demonstrating the potential vulnerabilities in the machine learning model supply chain, we highlight the need for careful consideration and evaluation of the models used in spam filters. Our results show that the backdoor attacks can be effectively used to identify vulnerabilities in spam filters and suggest the need for ongoing monitoring and improvement in this area.

摘要
防ospam filter 是现代电子邮件系统中的重要组件，它们帮助保护用户从不想要的和可能有害的电子邮件中受到保护。然而，这些防ospam filter 的效iveness 取决于机器学习模型的质量。在这篇论文中，我们设计了针对防ospam filter 的后门攻击。我们通过示例出了机器学习模型供应链中的潜在漏洞，并高亮了在使用这些模型时需要仔细评估和评估。我们的结果表明，后门攻击可以有效地找到防ospam filter 中的漏洞，并建议需要持续监测和改进。

Promoting Exploration in Memory-Augmented Adam using Critical Momenta

paper_url: http://arxiv.org/abs/2307.09638
repo_url: https://github.com/chandar-lab/cmoptimizer
paper_authors: Pranshu Malviya, Gonçalo Mordido, Aristide Baratin, Reza Babanezhad Harikandeh, Jerry Huang, Simon Lacoste-Julien, Razvan Pascanu, Sarath Chandar
for: 本文主要针对大规模深度学习模型的训练。
methods: 本文提出了一种新的记忆扩展版的Adam优化器，通过在训练过程中使用缓存来提高探索较平的极小值的能力。
results: 本文经验表明，该方法可以提高多种Adam变体在标准的报告语言模型和图像分类任务中的表现。

Abstract
Adaptive gradient-based optimizers, particularly Adam, have left their mark in training large-scale deep learning models. The strength of such optimizers is that they exhibit fast convergence while being more robust to hyperparameter choice. However, they often generalize worse than non-adaptive methods. Recent studies have tied this performance gap to flat minima selection: adaptive methods tend to find solutions in sharper basins of the loss landscape, which in turn hurts generalization. To overcome this issue, we propose a new memory-augmented version of Adam that promotes exploration towards flatter minima by using a buffer of critical momentum terms during training. Intuitively, the use of the buffer makes the optimizer overshoot outside the basin of attraction if it is not wide enough. We empirically show that our method improves the performance of several variants of Adam on standard supervised language modelling and image classification tasks.

摘要
《适应型梯度下降优化器，特别是Adam，在训练大规模深度学习模型中留下了痕迹。这些优化器的优点在于它们在选择超参数时表现出更快的收敛速度和更高的稳定性。然而，它们通常在泛化性方面表现不佳。现有研究表明，这种性能差异与找到狭窄的极小点有关：适应方法通常找到具有更锐度的极小点，这会导致泛化性下降。为了解决这个问题，我们提出了一种新的内存扩展版本的Adam优化器，通过在训练过程中使用缓存来促进探索更平降的极小点。直观来说，使用缓存会让优化器在缺乏够广的极小点基因上进行过射。我们实验表明，我们的方法可以提高许多变种的Adam优化器在标准的supervised语言模型和图像分类任务中的性能。》Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you prefer Traditional Chinese, I can provide that as well.

Towards Federated Foundation Models: Scalable Dataset Pipelines for Group-Structured Learning

paper_url: http://arxiv.org/abs/2307.09619
repo_url: https://github.com/google-research/dataset_grouper
paper_authors: Zachary Charles, Nicole Mitchell, Krishna Pillutla, Michael Reneer, Zachary Garrett
for: This paper is written for researchers and practitioners in the field of federated learning, particularly those interested in creating large-scale group-structured datasets for federated learning simulations.
methods: The paper introduces a new library called Dataset Grouper, which allows researchers to create large-scale group-structured datasets for federated learning simulations. The library scales to settings where even a single group’s dataset is too large to fit in memory, and provides flexibility in choosing the base dataset and defining partitions.
results: The paper demonstrates the effectiveness of Dataset Grouper through experimental results on large-scale federated language modeling simulations. The results show that algorithms like FedAvg operate more as meta-learning methods than as empirical risk minimization methods at this scale, suggesting their utility in downstream personalization and task-specific adaptation.

Abstract
We introduce a library, Dataset Grouper, to create large-scale group-structured (e.g., federated) datasets, enabling federated learning simulation at the scale of foundation models. This library allows the creation of group-structured versions of existing datasets based on user-specified partitions, and directly leads to a variety of useful heterogeneous datasets that can be plugged into existing software frameworks. Dataset Grouper offers three key advantages. First, it scales to settings where even a single group's dataset is too large to fit in memory. Second, it provides flexibility, both in choosing the base (non-partitioned) dataset and in defining partitions. Finally, it is framework-agnostic. We empirically demonstrate that Dataset Grouper allows for large-scale federated language modeling simulations on datasets that are orders of magnitude larger than in previous work. Our experimental results show that algorithms like FedAvg operate more as meta-learning methods than as empirical risk minimization methods at this scale, suggesting their utility in downstream personalization and task-specific adaptation.

摘要
我团队介绍一个库 named Dataset Grouper，用于创建大规模群体结构化（例如联邦学习）数据集，使得联邦学习模拟可以在基础模型的规模上进行。这个库允许用户根据自己的分区来创建群体结构化的现有数据集的版本，直接导致了一些灵活的多种数据集，可以适应现有的软件框架。Dataset Grouper具有三个关键优势：首先，它可以处理具有太多数据的设置，以至于单个群体的数据集不能置于内存中；其次，它允许用户选择基本（非分区）数据集和定义分区，提供了更多的灵活性；最后，它是框架不依赖的。我们通过实验证明，使用Dataset Grouper可以进行大规模的联邦语言模型 simulations，比前作多出了数个数量级的领域。我们的实验结果表明，在这种规模下，算法如 FedAvg 更像是元学习方法而非实际风险最小化方法，这表明它们在下游个性化和任务特定适应方面具有Utility。

Gradient strikes back: How filtering out high frequencies improves explanations

paper_url: http://arxiv.org/abs/2307.09591
repo_url: None
paper_authors: Sabine Muzellec, Léo Andéol, Thomas Fel, Rufin VanRullen, Thomas Serre
for: 本研究旨在解释深度神经网络（CNN）的决策过程中， Gradient-based 和 Prediction-based 两种方法的差异，以及这些方法对决策的影响。
methods: 本研究使用了三种代表性的视觉分类模型的梯度分析，探讨了这些模型中高频信息的来源，以及这些信息对决策的影响。
results: 研究发现，Gradient-based 方法的梯度中含有噪声信息，而 Prediction-based 方法的梯度中减少了高频信息。此外，研究还发现，CNN 中的下采样操作可能是高频信息的主要来源。通过应用优化的低通量滤波器，可以改善 Gradient-based 方法的解释效果。研究结果表明，移除高频噪声可以提高 Gradient-based 方法的解释效果，并且将 Gradient-based 方法 ranked 为 state-of-the-art 方法。

Abstract
Recent years have witnessed an explosion in the development of novel prediction-based attribution methods, which have slowly been supplanting older gradient-based methods to explain the decisions of deep neural networks. However, it is still not clear why prediction-based methods outperform gradient-based ones. Here, we start with an empirical observation: these two approaches yield attribution maps with very different power spectra, with gradient-based methods revealing more high-frequency content than prediction-based methods. This observation raises multiple questions: What is the source of this high-frequency information, and does it truly reflect decisions made by the system? Lastly, why would the absence of high-frequency information in prediction-based methods yield better explainability scores along multiple metrics? We analyze the gradient of three representative visual classification models and observe that it contains noisy information emanating from high-frequencies. Furthermore, our analysis reveals that the operations used in Convolutional Neural Networks (CNNs) for downsampling appear to be a significant source of this high-frequency content -- suggesting aliasing as a possible underlying basis. We then apply an optimal low-pass filter for attribution maps and demonstrate that it improves gradient-based attribution methods. We show that (i) removing high-frequency noise yields significant improvements in the explainability scores obtained with gradient-based methods across multiple models -- leading to (ii) a novel ranking of state-of-the-art methods with gradient-based methods at the top. We believe that our results will spur renewed interest in simpler and computationally more efficient gradient-based methods for explainability.

摘要
近年来，有一个快速发展的新的预测基于的归因方法，逐渐替代了老的梯度基于的方法来解释深度神经网络的决策。然而，还不清楚为什么预测基于的方法超过梯度基于的方法。在这里，我们开始从一个实际观察出发：这两种方法生成的归因图有非常不同的功率特征，预测基于的方法的归因图具有较低的功率特征，而梯度基于的方法的归因图具有更高的功率特征。这个观察问题出现了多个问题：预测基于的方法中具有高频信息的来源是什么，这些信息是否真正反映系统做出的决策？最后，预测基于的方法中缺乏高频信息的原因是为什么会导致多个纪录的解释分数提高？我们分析了三种代表性的视觉分类模型的梯度，发现梯度包含了高频信息的噪声。我们的分析还表明，在Convolutional Neural Networks（CNNs）中用于下采样的操作是高频信息的主要来源，这表明噪声可能是下采样操作的基础。我们应用最佳低通 filters 来修正归因图，并证明了移除高频噪声可以大幅提高梯度基于的解释分数。我们的结果表明，（i）移除高频噪声可以使得梯度基于的解释分数在多个模型中得到显著改进，导致（ii）新的状态艺术方法排名，梯度基于的方法位于排名的顶峰。我们认为，我们的结果将会激发人们对简单而计算效率更高的梯度基于的解释方法的新的兴趣。

Self-Compatibility: Evaluating Causal Discovery without Ground Truth

paper_url: http://arxiv.org/abs/2307.09552
repo_url: None
paper_authors: Philipp M. Faller, Leena Chennuru Vankadara, Atalanti A. Mastakouri, Francesco Locatello, Dominik Janzing
for: 本研究的目的是提出一种新的方法来证明 causal discovery 算法的输出是否正确， absence of ground truth.
methods: 本研究使用了一种新的方法， which relies on the notion of compatibility between causal graphs learned on different subsets of variables.
results: 研究表明， detecting incompatibilities can falsify wrongly inferred causal relations due to violation of assumptions or errors from finite sample effects. Additionally, the method provides strong evidence for the causal models whenever compatibility entails strong implications for the joint distribution.

Abstract
As causal ground truth is incredibly rare, causal discovery algorithms are commonly only evaluated on simulated data. This is concerning, given that simulations reflect common preconceptions about generating processes regarding noise distributions, model classes, and more. In this work, we propose a novel method for falsifying the output of a causal discovery algorithm in the absence of ground truth. Our key insight is that while statistical learning seeks stability across subsets of data points, causal learning should seek stability across subsets of variables. Motivated by this insight, our method relies on a notion of compatibility between causal graphs learned on different subsets of variables. We prove that detecting incompatibilities can falsify wrongly inferred causal relations due to violation of assumptions or errors from finite sample effects. Although passing such compatibility tests is only a necessary criterion for good performance, we argue that it provides strong evidence for the causal models whenever compatibility entails strong implications for the joint distribution. We also demonstrate experimentally that detection of incompatibilities can aid in causal model selection.

摘要
《因果真实是极其罕见的，因此 causal discovery 算法通常只能在模拟数据上进行评估。这有一定的问题，因为模拟数据反映了我们对生成过程的共同假设，如噪声分布、模型类型等。在这种情况下，我们提出了一种新的方法来证明 causal discovery 算法的输出是错误的。我们的关键发现是，统计学学习寻求数据点集之间的稳定性，而 causal learning 则应该寻求变量集之间的稳定性。根据这一点，我们的方法基于变量集之间的兼容性来评估 causal 模型。我们证明了，如果检测到兼容性不兼容，则可以证明因果关系是错误的，这可能是因为假设的违反或 finite sample effects 的错误。虽然通过兼容性测试只是一个必要条件，但我们 argue 它提供了强有力的证明，当兼容性导致 JOINT 分布的强制性时。我们还在实验中证明了，检测到兼容性可以帮助在 causal 模型选择中。》Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Analyzing sports commentary in order to automatically recognize events and extract insights

paper_url: http://arxiv.org/abs/2307.10303
repo_url: https://github.com/yanismiraoui/analyzing-sports-commentary-in-order-to-automatically-recognize-events-and-extract-insights
paper_authors: Yanis Miraoui
for: 本研究旨在使用多种自然语言处理技术和方法自动识别体育赛事中的主要动作。
methods: 本研究使用了多种生动语言评论来分类不同来源的主要动作，并研究了情感分析是否能够检测主要动作。
results: 研究发现，使用多种自然语言处理技术和方法可以准确地识别体育赛事中的主要动作，并且情感分析可以帮助检测主要动作。

Abstract
In this paper, we carefully investigate how we can use multiple different Natural Language Processing techniques and methods in order to automatically recognize the main actions in sports events. We aim to extract insights by analyzing live sport commentaries from different sources and by classifying these major actions into different categories. We also study if sentiment analysis could help detect these main actions.

摘要
在这篇论文中，我们仔细研究了如何使用多种自然语言处理技术和方法来自动识别体育活动中的主要动作。我们希望通过分析不同来源的直播体育评论来提取分析结论，并将这些主要动作分类为不同的类别。此外，我们还研究了情感分析是否可以帮助检测这些主要动作。

The semantic landscape paradigm for neural networks

paper_url: http://arxiv.org/abs/2307.09550
repo_url: None
paper_authors: Shreyas Gokhale
for: 这篇论文的目的是提出一个概念和数学框架，用于描述深度神经网络的训练剖架和性能。
methods: 这篇论文使用的方法是基于神经网络学习的表示学习的概念和数学模型，用于解释神经网络的训练过程和性能。
results: 这篇论文的结果是提出了一种名为“semantic landscape”的概念和数学框架，可以用于描述神经网络的训练剖架和性能，并且可以用于解释神经网络的各种现象，如grokking和emergence。

Abstract
Deep neural networks exhibit a fascinating spectrum of phenomena ranging from predictable scaling laws to the unpredictable emergence of new capabilities as a function of training time, dataset size and network size. Analysis of these phenomena has revealed the existence of concepts and algorithms encoded within the learned representations of these networks. While significant strides have been made in explaining observed phenomena separately, a unified framework for understanding, dissecting, and predicting the performance of neural networks is lacking. Here, we introduce the semantic landscape paradigm, a conceptual and mathematical framework that describes the training dynamics of neural networks as trajectories on a graph whose nodes correspond to emergent algorithms that are instrinsic to the learned representations of the networks. This abstraction enables us to describe a wide range of neural network phenomena in terms of well studied problems in statistical physics. Specifically, we show that grokking and emergence with scale are associated with percolation phenomena, and neural scaling laws are explainable in terms of the statistics of random walks on graphs. Finally, we discuss how the semantic landscape paradigm complements existing theoretical and practical approaches aimed at understanding and interpreting deep neural networks.

摘要
深度神经网络展现出一种惊喜的谱面，从预测可预测的缩放法则到由训练时间、数据集大小和网络大小而带来的新的能力的不可预测出现。对这些现象的分析发现了神经网络中学习表示中的概念和算法。虽然在分解这些现象方面已经做出了重要的进步，但一个综合的框架 для理解、拆分和预测神经网络的性能是缺失的。在这里，我们引入 semantic landscape 概念，它是一种概念和数学框架，用于描述神经网络的训练剖ogram，其节点对应于神经网络学习表示中的内在的算法。这种抽象使得我们可以用已有的统计物理学问题来描述各种神经网络现象。例如，我们表明了感知和emergence with scale是percolation现象的关联，而神经网络的尺度法则是可以通过图集的统计学来解释的。最后，我们讨论了 semantic landscape 概念如何与现有的理论和实践方法相结合，以更好地理解和解释深度神经网络。

DreaMR: Diffusion-driven Counterfactual Explanation for Functional MRI

paper_url: http://arxiv.org/abs/2307.09547
repo_url: https://github.com/icon-lab/dreamr
paper_authors: Hasan Atakan Bedel, Tolga Çukur
for: 这个论文旨在提供一种高准确性、可信度和效率的 diffusion-driven counterfactual method，以便对 fMRI 数据进行解释。
methods: 该论文使用了一种新型的 fractional multi-phase-distilled diffusion prior，以提高抽样效率而不妥协准确性，同时使用 transformer 架构来考虑 long-range spatiotemporal 上下文。
results: 对 neuroimaging 数据进行了广泛的实验，并证明了 DreaMR 比 state-of-the-art counterfactual 方法在 sample generation 方面具有更高的准确性、可信度和效率。

Abstract
Deep learning analyses have offered sensitivity leaps in detection of cognitive states from functional MRI (fMRI) measurements across the brain. Yet, as deep models perform hierarchical nonlinear transformations on their input, interpreting the association between brain responses and cognitive states is challenging. Among common explanation approaches for deep fMRI classifiers, attribution methods show poor specificity and perturbation methods show limited plausibility. While counterfactual generation promises to address these limitations, previous methods use variational or adversarial priors that yield suboptimal sample fidelity. Here, we introduce the first diffusion-driven counterfactual method, DreaMR, to enable fMRI interpretation with high specificity, plausibility and fidelity. DreaMR performs diffusion-based resampling of an input fMRI sample to alter the decision of a downstream classifier, and then computes the minimal difference between the original and counterfactual samples for explanation. Unlike conventional diffusion methods, DreaMR leverages a novel fractional multi-phase-distilled diffusion prior to improve sampling efficiency without compromising fidelity, and it employs a transformer architecture to account for long-range spatiotemporal context in fMRI scans. Comprehensive experiments on neuroimaging datasets demonstrate the superior specificity, fidelity and efficiency of DreaMR in sample generation over state-of-the-art counterfactual methods for fMRI interpretation.

摘要
深度学习分析已经提供了识别认知状态从功能磁共振成像（fMRI）测量的感知跳变。然而，由于深度模型对输入进行堆叠非线性变换，因此解释脑响应和认知状态之间的关系具有挑战。当前的解释方法包括负回归方法和负影响方法，但它们具有较低的特点和可靠性。在这些限制下，Counterfactual生成技术被提出，以提高深度fMRI分类器的解释能力。在本文中，我们介绍了首个扩散驱动Counterfactual方法，即DreaMR，用于实现高特点、可靠性和准确性的fMRI解释。DreaMR使用扩散基于扩散的重新混合输入fMRI样本，并计算原始和对应样本之间的最小差异来进行解释。与传统扩散方法不同，DreaMR利用一种新的分数多相态混合扩散先进技术来提高抽象效率而无需妥协准确性，并使用一个转换架构来考虑fMRI扫描中的长距离空间时间 correlations。实验表明，DreaMR在 neuroscience 数据集上具有更高的特点、可靠性和效率，并且在 sample generation 方面与当前最佳的Counterfactual方法进行比较。

Can Neural Network Memorization Be Localized?

paper_url: http://arxiv.org/abs/2307.09542
repo_url: https://github.com/pratyushmaini/localizing-memorization
paper_authors: Pratyush Maini, Michael C. Mozer, Hanie Sedghi, Zachary C. Lipton, J. Zico Kolter, Chiyuan Zhang
for: 本文探讨了深度过参数网络中 memorization 和通用化的关系，并提出了一种新的 dropout 方法来控制 memorization。
methods: 本文使用了三种实验来证明 memorization 不是局限于各层，而是一种在模型中的小集合的现象。这三种实验包括 gradient accounting、layer rewinding 和 retraining。
results: 研究发现，大多数层在 memorization 方面是 redundant，而 memorization 的层通常不是最后几层。此外，研究还发现 memorization 通常是模型中的一小部分 neuron 或通道（大约 5）的现象。基于这些发现，本文提出了一种新的 dropout 方法——example-tied dropout，可以控制 memorization 的分布。

Abstract
Recent efforts at explaining the interplay of memorization and generalization in deep overparametrized networks have posited that neural networks $\textit{memorize}$ "hard" examples in the final few layers of the model. Memorization refers to the ability to correctly predict on $\textit{atypical}$ examples of the training set. In this work, we show that rather than being confined to individual layers, memorization is a phenomenon confined to a small set of neurons in various layers of the model. First, via three experimental sources of converging evidence, we find that most layers are redundant for the memorization of examples and the layers that contribute to example memorization are, in general, not the final layers. The three sources are $\textit{gradient accounting}$ (measuring the contribution to the gradient norms from memorized and clean examples), $\textit{layer rewinding}$ (replacing specific model weights of a converged model with previous training checkpoints), and $\textit{retraining}$ (training rewound layers only on clean examples). Second, we ask a more generic question: can memorization be localized $\textit{anywhere}$ in a model? We discover that memorization is often confined to a small number of neurons or channels (around 5) of the model. Based on these insights we propose a new form of dropout -- $\textit{example-tied dropout}$ that enables us to direct the memorization of examples to an apriori determined set of neurons. By dropping out these neurons, we are able to reduce the accuracy on memorized examples from $100\%\to3\%$, while also reducing the generalization gap.

摘要
最近的研究表明，深度过参数网络中的各层之间的交互作用可以帮助网络“记忆”特殊的训练集示例。记忆指的是网络可以正确预测训练集中的特殊示例。在这项工作中，我们表明了这种记忆不是局限于具体层，而是一种局限于小量神经元的现象。我们通过三种实验证据来证明这一点：1. Gradient Accounting：我们测试了各层对 memorized 和 clean 示例的贡献，发现大多数层对 memorized 示例的贡献很小，而且这些层通常不是最后几层。2. Layer Rewinding：我们将特定模型的权重更改为训练过程中的检查点权重，发现这些层对 memorized 示例的贡献也很小。3. Retraining：我们只在 clean 示例上训练rewound层，发现这些层对 memorized 示例的贡献也很小。 Based on these findings, we propose a new form of dropout called "example-tied dropout" that can direct the memorization of examples to a priori determined neurons. By dropping out these neurons, we can reduce the accuracy on memorized examples from 100% to 3%, while also reducing the generalization gap.

Forecasting the steam mass flow in a powerplant using the parallel hybrid network

paper_url: http://arxiv.org/abs/2307.09483
repo_url: None
paper_authors: Andrii Kurkin, Jonas Hegemann, Mo Kordzanganeh, Alexey Melnikov
for: 这种研究旨在提高热电厂的蒸汽质量流量预测精度，以提高操作效率和降低成本。
methods: 该研究使用了并行混合神经网络架构，结合 Parametrized Quantum Circuit 和 Conventional Feed-Forward Neural Network，专门为产业场景的时间序列预测。
results: 对比 Standalone Classical 和 Quantum 模型，并行混合模型在测试集上的 MSE 损失较低，比 pure Classical 和 pure Quantum 网络低出至少 5.7 和 4.9 倍。此外，混合模型在测试集上的相对误差较小，与基eline Classical 模型相比，最多下降 2 倍。这些发现可以帮助energy sector解决实际问题，最终带来更好的发电厂运行效率。

Abstract
Efficient and sustainable power generation is a crucial concern in the energy sector. In particular, thermal power plants grapple with accurately predicting steam mass flow, which is crucial for operational efficiency and cost reduction. In this study, we use a parallel hybrid neural network architecture that combines a parametrized quantum circuit and a conventional feed-forward neural network specifically designed for time-series prediction in industrial settings to enhance predictions of steam mass flow 15 minutes into the future. Our results show that the parallel hybrid model outperforms standalone classical and quantum models, achieving more than 5.7 and 4.9 times lower mean squared error (MSE) loss on the test set after training compared to pure classical and pure quantum networks, respectively. Furthermore, the hybrid model demonstrates smaller relative errors between the ground truth and the model predictions on the test set, up to 2 times better than the pure classical model. These findings contribute to the broader scientific understanding of how integrating quantum and classical machine learning techniques can be applied to real-world challenges faced by the energy sector, ultimately leading to optimized power plant operations.

摘要
efficient和可持续的电力生产是能源领域的关键问题。特别是热电厂面临精准预测蒸汽质量的挑战，这对操作效率和成本减少都是关键。在这种研究中，我们使用并行的混合神经网络架构，将 parametrized quantum circuit 和 conventional feed-forward neural network 结合在一起，专门用于工业设置中的时间序列预测。我们的结果表明，并行混合模型比纯经典和量子模型都更高效，在测试集上减少MSE损失的mean squared error 比例下降超过5.7和4.9倍，而且在测试集上相对误差与真实值的比较也更小，最多下降2倍于纯经典模型。这些发现对能源领域的科学理解如何将量子和经典机器学习技术集成应用于实际挑战中，带来了优化的发电厂操作。

Overthinking the Truth: Understanding how Language Models Process False Demonstrations

paper_url: http://arxiv.org/abs/2307.09476
repo_url: https://github.com/dannyallover/overthinking_the_truth
paper_authors: Danny Halawi, Jean-Stanislas Denain, Jacob Steinhardt
for: 研究模型内部表征如何导致危险模仿行为
methods: 使用 few-shot learning 训练模型，并分析模型内部表征以探究危险模仿行为的两个相关现象：过思考和假推导头
results: 研究发现，在某些层次上，模型会因为 incorrect 示例而导致模型行为分化，并且这种分化可能是由 false induction heads 引起的。禁用 false induction heads 可以降低过思考现象。这些结果提供了一个可能有用的方向，以便更好地理解和防止危险模型行为。

Abstract
Modern language models can imitate complex patterns through few-shot learning, enabling them to complete challenging tasks without fine-tuning. However, imitation can also lead models to reproduce inaccuracies or harmful content if present in the context. We study harmful imitation through the lens of a model's internal representations, and identify two related phenomena: overthinking and false induction heads. The first phenomenon, overthinking, appears when we decode predictions from intermediate layers, given correct vs. incorrect few-shot demonstrations. At early layers, both demonstrations induce similar model behavior, but the behavior diverges sharply at some "critical layer", after which the accuracy given incorrect demonstrations progressively decreases. The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late layers that attend to and copy false information from previous demonstrations, and whose ablation reduces overthinking. Beyond scientific understanding, our results suggest that studying intermediate model computations could be a promising avenue for understanding and guarding against harmful model behaviors.

摘要
现代语言模型可以通过几招学习模式，完成复杂任务而无需微调。然而，仿效也可能导致模型复制错误或有害内容，如果存在在上下文中。我们通过模型内部表示的研究，发现了两种相关现象：过思和假推导头。第一种现象，过思，发生在我们decode预测从中间层，正确vs.错误几招示范下。在早期层，两种示范都会导致相似的模型行为，但行为在某“关键层”后弯曲分别， incorrect示范后的准确率逐渐下降。第二种现象，假推导头，可能是过思的机制原因：这些在晚期层出现的头部会复制和仿效先前示范中的错误信息，并且它们的ablation可以减少过思。我们的结果表明，研究中间模型计算可能是理解和防止危险模型行为的有望之路。

A Cryogenic Memristive Neural Decoder for Fault-tolerant Quantum Error Correction

paper_url: http://arxiv.org/abs/2307.09463
repo_url: None
paper_authors: Frédéric Marcotte, Pierre-Antoine Mouny, Victor Yon, Gebremedhin A. Dagnew, Bohdan Kulchytskyy, Sophie Rochette, Yann Beilliard, Dominique Drouin, Pooya Ronagh
for: 这个论文目的是为了提高量子错误修复（QEC）中的神经网络扩散器的性能。methods: 这个论文使用的方法是基于内存计算（IMC）架构，使用栅格阵列来存储神经网络的 synaptic 权重和进行分析时的分数积分。results: 这个论文的结果表明，使用 TiO$_\textrm{x}$-based 记忆Device 的非理想特性会导致损失精度，但通过硬件意识训练方法可以 Mitigate 这种损失，使得神经网络扩散器可以达到 $9.23\times 10^{-4}$ 的 pseudo-threshold。

Abstract
Neural decoders for quantum error correction (QEC) rely on neural networks to classify syndromes extracted from error correction codes and find appropriate recovery operators to protect logical information against errors. Despite the good performance of neural decoders, important practical requirements remain to be achieved, such as minimizing the decoding time to meet typical rates of syndrome generation in repeated error correction schemes, and ensuring the scalability of the decoding approach as the code distance increases. Designing a dedicated integrated circuit to perform the decoding task in co-integration with a quantum processor appears necessary to reach these decoding time and scalability requirements, as routing signals in and out of a cryogenic environment to be processed externally leads to unnecessary delays and an eventual wiring bottleneck. In this work, we report the design and performance analysis of a neural decoder inference accelerator based on an in-memory computing (IMC) architecture, where crossbar arrays of resistive memory devices are employed to both store the synaptic weights of the decoder neural network and perform analog matrix-vector multiplications during inference. In proof-of-concept numerical experiments supported by experimental measurements, we investigate the impact of TiO$_\textrm{x}$-based memristive devices' non-idealities on decoding accuracy. Hardware-aware training methods are developed to mitigate the loss in accuracy, allowing the memristive neural decoders to achieve a pseudo-threshold of $9.23\times 10^{-4}$ for the distance-three surface code, whereas the equivalent digital neural decoder achieves a pseudo-threshold of $1.01\times 10^{-3}$. This work provides a pathway to scalable, fast, and low-power cryogenic IMC hardware for integrated QEC.

摘要
量子错误修复（QEC）神经解码器 rely on 神经网络来分类错误码和找到保护逻辑信息的正确操作符。despite the good performance of neural decoders, important practical requirements remain to be achieved, such as minimizing the decoding time to meet typical rates of syndrome generation in repeated error correction schemes, and ensuring the scalability of the decoding approach as the code distance increases. Designing a dedicated integrated circuit to perform the decoding task in co-integration with a quantum processor appears necessary to reach these decoding time and scalability requirements, as routing signals in and out of a cryogenic environment to be processed externally leads to unnecessary delays and an eventual wiring bottleneck. In this work, we report the design and performance analysis of a neural decoder inference accelerator based on an in-memory computing（IMC）architecture, where crossbar arrays of resistive memory devices are employed to both store the synaptic weights of the decoder neural network and perform analog matrix-vector multiplications during inference. In proof-of-concept numerical experiments supported by experimental measurements, we investigate the impact of TiO$_\textrm{x}$-based memristive devices' non-idealities on decoding accuracy. Hardware-aware training methods are developed to mitigate the loss in accuracy, allowing the memristive neural decoders to achieve a pseudo-threshold of $9.23\times 10^{-4}$ for the distance-three surface code, whereas the equivalent digital neural decoder achieves a pseudo-threshold of $1.01\times 10^{-3}$. This work provides a pathway to scalable, fast, and low-power cryogenic IMC hardware for integrated QEC.

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

paper_url: http://arxiv.org/abs/2307.09458
repo_url: None
paper_authors: Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik
for: 这 paper 旨在测试 Chinchilla 模型中的回路分析的可扩展性。
methods: 这 paper 使用了现有的回路分析技术，包括 logit 归因、注意力图像化和活动覆盖，对 Chinchilla 模型进行了测试。
results: 研究发现，对于多选问题的回答，可以快速压缩查询、关键和值子空间，无损性性能。然而，当使用这些解释来理解 heads 对多选问题回答的行为时，发现只是一个 partial explanation， suggessts 还有更多的学习需要完成。

Abstract
\emph{Circuit analysis} is a promising technique for understanding the internal mechanisms of language models. However, existing analyses are done in small models far from the state of the art. To address this, we present a case study of circuit analysis in the 70B Chinchilla model, aiming to test the scalability of circuit analysis. In particular, we study multiple-choice question answering, and investigate Chinchilla's capability to identify the correct answer \emph{label} given knowledge of the correct answer \emph{text}. We find that the existing techniques of logit attribution, attention pattern visualization, and activation patching naturally scale to Chinchilla, allowing us to identify and categorize a small set of `output nodes' (attention heads and MLPs). We further study the `correct letter' category of attention heads aiming to understand the semantics of their features, with mixed results. For normal multiple-choice question answers, we significantly compress the query, key and value subspaces of the head without loss of performance when operating on the answer labels for multiple-choice questions, and we show that the query and key subspaces represent an `Nth item in an enumeration' feature to at least some extent. However, when we attempt to use this explanation to understand the heads' behaviour on a more general distribution including randomized answer labels, we find that it is only a partial explanation, suggesting there is more to learn about the operation of `correct letter' heads on multiple choice question answering.

摘要
我们进一步研究 `正确字符' 类型的注意力头，以了解它们的Semantics。结果是混合的，我们在对多选问答的答案标签进行操作时，可以压缩查询、键和值子空间，而不会影响性能。此外，我们发现查询和键子空间表示了 `N个项在排序' 的特征，至少在一些程度上。然而，当我们尝试使用这种解释来理解 `correct letter' 头在多选问答中的行为时，我们发现这只是一个 partial explanation， suggesting there is more to learn about the operation of `correct letter' heads on multiple choice question answering。

Smooth Attention for Deep Multiple Instance Learning: Application to CT Intracranial Hemorrhage Detection

paper_url: http://arxiv.org/abs/2307.09457
repo_url: https://github.com/yunanwu2168/sa-mil
paper_authors: Yunan Wu, Francisco M. Castro-Macías, Pablo Morales-Álvarez, Rafael Molina, Aggelos K. Katsaggelos
for: 这篇研究应用于医疗影像诊断，特别是头部CT扫描中的血肿诊断。
methods: 这篇研究提出了一个缓和对应深度多例学习（SA-DMIL）模型，通过设定首次和第二次约束来实现缓和性。
results: 研究结果显示，SA-DMIL模型在头部CT扫描中血肿诊断方面比非缓和对应深度多例学习模型表现更好，并且学习了对于每个扫描的空间相依性。同时，它也超过了目前医疗影像诊断中的现有州态艺术方法。

Abstract
Multiple Instance Learning (MIL) has been widely applied to medical imaging diagnosis, where bag labels are known and instance labels inside bags are unknown. Traditional MIL assumes that instances in each bag are independent samples from a given distribution. However, instances are often spatially or sequentially ordered, and one would expect similar diagnostic importance for neighboring instances. To address this, in this study, we propose a smooth attention deep MIL (SA-DMIL) model. Smoothness is achieved by the introduction of first and second order constraints on the latent function encoding the attention paid to each instance in a bag. The method is applied to the detection of intracranial hemorrhage (ICH) on head CT scans. The results show that this novel SA-DMIL: (a) achieves better performance than the non-smooth attention MIL at both scan (bag) and slice (instance) levels; (b) learns spatial dependencies between slices; and (c) outperforms current state-of-the-art MIL methods on the same ICH test set.

摘要

Convergent regularization in inverse problems and linear plug-and-play denoisers

paper_url: http://arxiv.org/abs/2307.09441
repo_url: None
paper_authors: Andreas Hauptmann, Subhadip Mukherjee, Carola-Bibiane Schönlieb, Ferdia Sherry
for: 这篇论文的目的是研究插件和排除（PnP）去噪的可行性和稳定性。
methods: 这篇论文使用了经典的干扰理论和一些可证明是可靠的数据驱动方法来研究PnP算法的稳定性。
results: 这篇论文提出了一种新的spectral filtering技术来控制PnP算法中的噪声约束，并证明了PnP算法在linear denoiser情况下是一种可靠的减噪方法。

Abstract
Plug-and-play (PnP) denoising is a popular iterative framework for solving imaging inverse problems using off-the-shelf image denoisers. Their empirical success has motivated a line of research that seeks to understand the convergence of PnP iterates under various assumptions on the denoiser. While a significant amount of research has gone into establishing the convergence of the PnP iteration for different regularity conditions on the denoisers, not much is known about the asymptotic properties of the converged solution as the noise level in the measurement tends to zero, i.e., whether PnP methods are provably convergent regularization schemes under reasonable assumptions on the denoiser. This paper serves two purposes: first, we provide an overview of the classical regularization theory in inverse problems and survey a few notable recent data-driven methods that are provably convergent regularization schemes. We then continue to discuss PnP algorithms and their established convergence guarantees. Subsequently, we consider PnP algorithms with linear denoisers and propose a novel spectral filtering technique to control the strength of regularization arising from the denoiser. Further, by relating the implicit regularization of the denoiser to an explicit regularization functional, we rigorously show that PnP with linear denoisers leads to a convergent regularization scheme. More specifically, we prove that in the limit as the noise vanishes, the PnP reconstruction converges to the minimizer of a regularization potential subject to the solution satisfying the noiseless operator equation. The theoretical analysis is corroborated by numerical experiments for the classical inverse problem of tomographic image reconstruction.

摘要
插件并运行（PnP）杜食是一种流行的迭代框架，用于解决图像逆问题。它们的实际成功激发了一条研究的线索，旨在了解PnP迭代的稳定性和收敛性。虽然一些研究已经证明了PnP迭代的稳定性，但它们对静音水平下的恒定性还不够了解。这篇论文旨在两个目的：首先，提供经典的反问题理论的概述，并Survey一些最近的数据驱动方法，这些方法是可证明的收敛规则。然后，我们继续讨论PnP算法，并证明它们的稳定性和收敛性。接下来，我们考虑PnP算法与线性杜食器的组合，并提出一种新的 спектраль滤波技术来控制杜食器对正则化的影响。此外，我们将杜食器的隐式正则化相关到一个显式的正则化函数中，并证明PnP与线性杜食器的组合是一种收敛的正则化方案。具体来说，我们证明在静音水平下，PnP重建将收敛到一个具有正则化潜在能量的最小化问题的解。这个 teorema 的实际分析得到了数值实验的证明。

Unsupervised Conditional Slot Attention for Object Centric Learning

paper_url: http://arxiv.org/abs/2307.09437
repo_url: None
paper_authors: Avinash Kori, Francesco Locatello, Francesca Toni, Ben Glocker
for: 这个论文的目的是学习无监督下的对象水平表示，以便在下游逻辑任务中进行更好的理解和推理。
methods: 这个论文使用了一种新的Conditional Slot Attention方法，它使用一个新的随机 Gaussian distribution来学习对象水平的特定槽级别绑定。
results: 这个论文的结果表明，使用Conditional Slot Attention方法可以在多个下游任务中提供场景组合能力和几个步骤适应能力，同时在对象发现任务中表现与Slot Attention方法相当或更好。

Abstract
Extracting object-level representations for downstream reasoning tasks is an emerging area in AI. Learning object-centric representations in an unsupervised setting presents multiple challenges, a key one being binding an arbitrary number of object instances to a specialized object slot. Recent object-centric representation methods like Slot Attention utilize iterative attention to learn composable representations with dynamic inference level binding but fail to achieve specialized slot level binding. To address this, in this paper we propose Unsupervised Conditional Slot Attention using a novel Probabilistic Slot Dictionary (PSD). We define PSD with (i) abstract object-level property vectors as key and (ii) parametric Gaussian distribution as its corresponding value. We demonstrate the benefits of the learnt specific object-level conditioning distributions in multiple downstream tasks, namely object discovery, compositional scene generation, and compositional visual reasoning. We show that our method provides scene composition capabilities and a significant boost in a few shot adaptability tasks of compositional visual reasoning, while performing similarly or better than slot attention in object discovery tasks

摘要
“抽取对象水平表示是人工智能领域的一个emerging领域。在无监督情况下学习对象中心表示存在多个挑战，其中一个关键的问题是将多个对象实例绑定到专门的对象槽。现有的对象中心表示方法，如满意注意力，利用迭代注意力学习可 compose 表示，但缺乏专门槽级别绑定。为解决这个问题，在这篇论文中我们提出了无监督条件满意注意力，使用一种新的probabilistic slot dictionary（PSD）。我们定义 PSD 的键是抽取对象级别属性向量，值是 Parametric Gaussian distribution。我们示出了学习的特定对象级别conditioning分布的好处，在多个下游任务中，包括对象发现、 compositional scene generation 和 compositional visual reasoning。我们显示了我们的方法提供了场景组合能力，并在一些几枚投入的 compositional visual reasoning 任务中带来了明显的提升，而与满意注意力在对象发现任务中表现类似或更好”

Scaling Laws for Imitation Learning in NetHack

paper_url: http://arxiv.org/abs/2307.09423
repo_url: None
paper_authors: Jens Tuyls, Dhruv Madeka, Kari Torkkola, Dean Foster, Karthik Narasimhan, Sham Kakade
for: 研究 Whether scaling up the model and data size can improve the performance of imitation learning in a challenging environment, such as the game of NetHack.
methods: 使用 Inspired by recent work in Natural Language Processing (NLP), the authors carefully scale up the model and data size to investigate the effectiveness of imitation learning in NetHack.
results: 发现 IL loss and mean return scale smoothly with the compute budget and are strongly correlated, resulting in power laws for training compute-optimal IL agents with respect to model size and number of samples. The authors also find that their agents outperform prior state-of-the-art by at least 2x in all settings.

Abstract
Imitation Learning (IL) is one of the most widely used methods in machine learning. Yet, while powerful, many works find it is often not able to fully recover the underlying expert behavior. However, none of these works deeply investigate the role of scaling up the model and data size. Inspired by recent work in Natural Language Processing (NLP) where "scaling up" has resulted in increasingly more capable LLMs, we investigate whether carefully scaling up model and data size can bring similar improvements in the imitation learning setting. To demonstrate our findings, we focus on the game of NetHack, a challenging environment featuring procedural generation, stochasticity, long-term dependencies, and partial observability. We find IL loss and mean return scale smoothly with the compute budget and are strongly correlated, resulting in power laws for training compute-optimal IL agents with respect to model size and number of samples. We forecast and train several NetHack agents with IL and find they outperform prior state-of-the-art by at least 2x in all settings. Our work both demonstrates the scaling behavior of imitation learning in a challenging domain, as well as the viability of scaling up current approaches for increasingly capable agents in NetHack, a game that remains elusively hard for current AI systems.

摘要
遗传学习（IL）是机器学习中最广泛使用的方法之一，然而许多研究发现，它通常无法完全回归到专家行为的下面。然而，这些研究几乎没有深入探究扩大模型和数据集大小的作用。以最近的自然语言处理（NLP）研究为例，“扩大”已经导致了更加强大的LLMs的出现。我们调查了扩大模型和数据集大小对IL的影响，以示我们的发现。我们使用NetHack游戏作为研究环境，这是一个复杂的环境，具有过程生成、随机性、长期依赖和部分可见性。我们发现，IL损失和平均回报与计算预算有直线关系，并且存在计算优质的IL代理人的尺度-样本数量关系。我们预测和训练了一些NetHack代理人，并发现它们在所有设置下至少两倍于之前的状态艺术。我们的工作不仅证明了IL在复杂环境中的扩展行为，也证明了现有方法的扩大可以提供更加强大的代理人在NetHack游戏中。NetHack游戏仍然是现有AI系统所拒绝的游戏，我们的工作提供了一个可能的解决方案。

Causality-oriented robustness: exploiting general additive interventions

paper_url: http://arxiv.org/abs/2307.10299
repo_url: https://github.com/xwshen51/drig
paper_authors: Xinwei Shen, Peter Bühlmann, Armeen Taeb
for: This paper focuses on developing a robust prediction method that can handle distribution shifts in real-world applications.methods: The proposed method, Distributional Robustness via Invariant Gradients (DRIG), exploits general additive interventions in training data to achieve robust predictions against unseen interventions, and naturally interpolates between in-distribution prediction and causality.results: The authors prove that DRIG yields predictions that are robust among a data-dependent class of distribution shifts, and show that their framework includes anchor regression as a special case. Additionally, they extend their approach to the semi-supervised domain adaptation setting to further improve prediction performance, and empirically validate their methods on synthetic simulations and on single-cell data.

Abstract
Since distribution shifts are common in real-world applications, there is a pressing need for developing prediction models that are robust against such shifts. Existing frameworks, such as empirical risk minimization or distributionally robust optimization, either lack generalizability for unseen distributions or rely on postulated distance measures. Alternatively, causality offers a data-driven and structural perspective to robust predictions. However, the assumptions necessary for causal inference can be overly stringent, and the robustness offered by such causal models often lacks flexibility. In this paper, we focus on causality-oriented robustness and propose Distributional Robustness via Invariant Gradients (DRIG), a method that exploits general additive interventions in training data for robust predictions against unseen interventions, and naturally interpolates between in-distribution prediction and causality. In a linear setting, we prove that DRIG yields predictions that are robust among a data-dependent class of distribution shifts. Furthermore, we show that our framework includes anchor regression (Rothenh\"ausler et al.\ 2021) as a special case, and that it yields prediction models that protect against more diverse perturbations. We extend our approach to the semi-supervised domain adaptation setting to further improve prediction performance. Finally, we empirically validate our methods on synthetic simulations and on single-cell data.

摘要
因为分布shift是实际应用中的常见现象，因此有一项急需开发具有对不visible分布的Robustness的预测模型。现有的框架，如empirical risk minimization或distributionally robust optimization，可能会lack普遍性 для未经看过的分布，或者基于假设的距离度量。然而， causality提供了一种数据驱动和结构性的 Perspective for robust predictions。然而， causal inference的假设可能会太 stringent，而且 causal models中的Robustness often lacks flexibility。在这篇论文中，我们关注 causality-oriented robustness，并提出 Distributional Robustness via Invariant Gradients (DRIG)，一种利用训练数据中的通用加itivity interventions来为未经看过的interventions提供Robust predictions，并自然地 interpolates between in-distribution prediction和causality。在线性设置下，我们证明了 DRIG 的预测是对数据依赖性的 class of distribution shifts 中Robust。此外，我们显示了我们的框架包含了 anchor regression (Rothenh\"ausler et al.\ 2021) 的特例，并且其预测模型可以保护 против更多元的扰动。我们将我们的方法推广到 semi-supervised domain adaptation setting，以进一步改善预测性能。最后，我们实际验证了我们的方法在 sintetic simulations 和单元细胞数据上。

Online Learning with Costly Features in Non-stationary Environments

paper_url: http://arxiv.org/abs/2307.09388
repo_url: https://github.com/saeedghoorchian/ncc-bandits
paper_authors: Saeed Ghoorchian, Evgenii Kortukov, Setareh Maghsudi
for: maximizing long-term rewards in sequential decision-making problems
methods: extending the contextual bandit setting to observe subsets of features’ states, and developing an algorithm with a sublinear regret guarantee
results: superior performance in a real-world scenario compared to existing methods

Abstract
Maximizing long-term rewards is the primary goal in sequential decision-making problems. The majority of existing methods assume that side information is freely available, enabling the learning agent to observe all features' states before making a decision. In real-world problems, however, collecting beneficial information is often costly. That implies that, besides individual arms' reward, learning the observations of the features' states is essential to improve the decision-making strategy. The problem is aggravated in a non-stationary environment where reward and cost distributions undergo abrupt changes over time. To address the aforementioned dual learning problem, we extend the contextual bandit setting and allow the agent to observe subsets of features' states. The objective is to maximize the long-term average gain, which is the difference between the accumulated rewards and the paid costs on average. Therefore, the agent faces a trade-off between minimizing the cost of information acquisition and possibly improving the decision-making process using the obtained information. To this end, we develop an algorithm that guarantees a sublinear regret in time. Numerical results demonstrate the superiority of our proposed policy in a real-world scenario.

摘要
“最大化长期回报是Sequential decision-making问题的主要目标。现有的方法多数假设可以免费获得侧情报，让决策机器可以在决策之前观察所有特征的状态。然而，在实际问题中，收集有利信息可能是成本的。这意味着，除了个别臂的奖励之外，学习特征的状态观察也是重要的。在非站点环境下，奖励和成本分布会随时间而改变，这问题更加严重。为解决上述双重学习问题，我们将上下文ual bandit设定扩展，让机器人可以观察特征subset的状态。目标是 Maximizing the long-term average gain, which is the difference between the accumulated rewards and the paid costs on average.因此，机器人面临一个优化资讯成本和可能改善决策过程的问题。为此，我们开发了一个 guarantees a sublinear regret in time的算法。numerical results show that our proposed policy is superior to other methods in a real-world scenario.”Note: Please note that the translation is in Simplified Chinese, and the word order and sentence structure may be different from the original text.

Batched Predictors Generalize within Distribution

paper_url: http://arxiv.org/abs/2307.09379
repo_url: None
paper_authors: Andreas Loukas, Pan Kessel
for: 这个论文主要研究批处理预测器的通用性特性，即用一小组例子（或批）来预测样本的平均标签。
methods: 这篇论文使用了一种适当的总化了Rademacher复杂度的方法来证明批处理预测器的泛化性能比标准每个样本预测更强。
results: 论文通过实验证明，批处理预测器在不同任务、结构和应用中都有较强的泛化性能，而且这些性能积极不受过参数化的影响。

Abstract
We study the generalization properties of batched predictors, i.e., models tasked with predicting the mean label of a small set (or batch) of examples. The batched prediction paradigm is particularly relevant for models deployed to determine the quality of a group of compounds in preparation for offline testing. By utilizing a suitable generalization of the Rademacher complexity, we prove that batched predictors come with exponentially stronger generalization guarantees as compared to the standard per-sample approach. Surprisingly, the proposed bound holds independently of overparametrization. Our theoretical insights are validated experimentally for various tasks, architectures, and applications.

摘要
我们研究批处理预测器的通用性质，即模型用于预测一小集（或批）的示例的平均标签。批处理预测模式特别有用于用于在线测试前对一组分子的质量进行评估。我们使用适当的总体化卡达默chs complexity来证明批处理预测器的泛化保证比标准每个样本预测更强大，而且这些保证独立于过参数化。我们的理论发现被实验证实了 для多种任务、架构和应用。Note: "批处理预测器" in Chinese is "批处理预测模式" (pīnzhèng yùjì zhìdǎo), and "总体化卡达默chs complexity" in Chinese is "总体化卡达默chs复杂度" (zòngtǐhuì kǎdàmùchōng dòu).

Data Cross-Segmentation for Improved Generalization in Reinforcement Learning Based Algorithmic Trading

paper_url: http://arxiv.org/abs/2307.09377
repo_url: None
paper_authors: Vikram Duvvur, Aashay Mehta, Edward Sun, Bo Wu, Ken Yew Chan, Jeff Schneider
for: 这个论文旨在提出一种基于强化学习的股票交易系统，以便在稀缺交易市场和不同化资产市场中进行更有效的交易。
methods: 该论文使用了强化学习算法，并将预测模型的信号用于交易决策。
results: 在20多年的马来西亚证券交易数据上测试的结果表明，该算法可以在稀缺交易市场和不同化资产市场中提供更高的交易效率和更好的资产配置。

Abstract
The use of machine learning in algorithmic trading systems is increasingly common. In a typical set-up, supervised learning is used to predict the future prices of assets, and those predictions drive a simple trading and execution strategy. This is quite effective when the predictions have sufficient signal, markets are liquid, and transaction costs are low. However, those conditions often do not hold in thinly traded financial markets and markets for differentiated assets such as real estate or vehicles. In these markets, the trading strategy must consider the long-term effects of taking positions that are relatively more difficult to change. In this work, we propose a Reinforcement Learning (RL) algorithm that trades based on signals from a learned predictive model and addresses these challenges. We test our algorithm on 20+ years of equity data from Bursa Malaysia.

摘要
机器学习在算法交易系统中越来越普遍。一般情况下，监督学习用于预测资产未来价格，这些预测驱动简单的交易和执行策略。这很有效，当预测具有足够的信号，市场流动性高，交易成本低时。然而，这些条件往往不符合财务证券市场和特殊资产市场，如房地产或汽车。在这些市场中，交易策略需要考虑长期持有位置的难度。在这种情况下，我们提出一种强化学习（RL）算法，基于学习预测模型的信号进行交易。我们对马来西亚证券交易所20多年的股票数据进行测试。

2023-07-19

eess.IV

eess.IV - 2023-07-19

Flexible Physical Unclonable Functions based on non-deterministically distributed Dye-Doped Fibers and Droplets

paper_url: http://arxiv.org/abs/2308.11000
repo_url: None
paper_authors: Mauro Daniel Luigi Bruno, Giuseppe Emanuele Lio, Antonio Ferraro, Sara Nocentini, Giuseppe Papuzzo, Agostino Forestiero, Giovanni Desiderio, Maria Penelope De Santo, Diederik Sybolt Wiersma, Roberto Caputo, Giovanni Golemme, Francesco Riboli, Riccardo Cristoforo Barberi
for: 这个论文的目的是开发一种新的防伪技术，以保护日常生活中的商品免受伪制品的侵害。
methods: 这篇论文使用了电子涂敷和电子喷涂技术，制备了具有不同Physical Unclonable Function（PUF）密钥的灵活自由浮现膜。
results: 这种新技术可以生成三种加密级别的抗伪标签：一、用激发辐射的聚合物滴定粒子图案，非 deterministic 地分布在聚合物纤维中；二、每个标签具有独特的辐射特征；三、使用强大的物理不可克隆功能进行挑战-响应协议（CRP）标识。这种颜色的聚合物标签具有简单、便宜的制造过程，以及多级身份验证，因此成为实用的安全保护解决方案。

Abstract
The development of new anti-counterfeiting solutions is a constant challenge and involves several research fields. Much interest is devoted to systems that are impossible to clone, based on the Physical Unclonable Function (PUF) paradigm. In this work, new strategies based on electrospinning and electrospraying of dye-doped polymeric materials are presented for the manufacturing of flexible free-standing films that embed different PUF keys. Films can be used to fabricate anticounterfeiting labels having three encryption levels: i) a map of fluorescent polymer droplets, with non deterministic positions on a dense yarn of polymer nanofibers; ii) a characteristic fluorescence spectrum for each label; iii) a challenge-response pairs (CRPs) identification protocol based on the strong nature of the physical unclonable function. The intrinsic uniqueness introduced by the deposition techniques encodes enough complexity into the optical anti-counterfeiting tag to generate thousands of cryptographic keys. The simple and cheap fabrication process as well as the multilevel authentication makes such colored polymeric unclonable tags a practical solution in the secure protection of merchandise in our daily life.

摘要
发展新的反伪技术是一constant challenge，涉及多个研究领域。大量的研究着眼于基于物理不可克隆函数（PUF）理念的系统，以实现不可克隆的安全标签。在这项工作中，我们提出了基于电子涂敷和电子涂敷的染料含有聚合物材料的新策略，用于制造弹性自由standing films，以实现不同PUF钥匙的嵌入。这些涂敷可以用于制造反伪标签，具有三级加密：1. 染料液体的地图，具有不确定的位置在密集的聚合物纤维中；2. 每个标签的特有荧光谱，作为唯一标识符；3. 基于物理不可克隆函数的挑战回应协议（CRP），用于实现多级身份验证。电子涂敷和电子涂敷的固有独特性会将复杂度嵌入到光学反伪标签中，生成数千个 криптографиic钥匙。这种简单便宜的制造过程以及多级身份验证，使得这种颜色含有聚合物的不可克隆标签成为日常生活中安全财物保护的实用解决方案。

paper_url: http://arxiv.org/abs/2307.09857
repo_url: None
paper_authors: Hassan Khalid, Nisar Ahmed
for: 本研究旨在提出一种多流程空间频道注意力基于的盲图像质量评估算法，以提高图像质量评估的准确性和人类评估的相关性。
methods: 该算法首先使用两种不同的背景网络生成混合特征，然后通过空间和频道注意力来为重要区域提供高重量。
results: 研究使用四个传统图像质量评估数据集验证了该提议的效果，并通过真实和synthetic零损像库展示了该方法在不同类型的损害下的普适性和准确性。

Abstract
BIQA (Blind Image Quality Assessment) is an important field of study that evaluates images automatically. Although significant progress has been made, blind image quality assessment remains a difficult task since images vary in content and distortions. Most algorithms generate quality without emphasizing the important region of interest. In order to solve this, a multi-stream spatial and channel attention-based algorithm is being proposed. This algorithm generates more accurate predictions with a high correlation to human perceptual assessment by combining hybrid features from two different backbones, followed by spatial and channel attention to provide high weights to the region of interest. Four legacy image quality assessment datasets are used to validate the effectiveness of our proposed approach. Authentic and synthetic distortion image databases are used to demonstrate the effectiveness of the proposed method, and we show that it has excellent generalization properties with a particular focus on the perceptual foreground information.

摘要
BIQA (自动图像质量评估) 是一个重要的研究领域，它自动评估图像的质量。虽然有了 significante 进步，但图像质量评估仍然是一个困难的任务，因为图像的内容和扭曲都很多。大多数算法生成的质量不强调关键区域的兴趣。为解决这个问题，我们提出了一种多流扩展 spatial 和通道注意力基于算法。这个算法可以将混合特征从两个不同的背板组合起来，然后使用空间和通道注意力提供高重要性的区域。我们使用四个传统的图像质量评估数据集来验证我们的提议的有效性。我们还使用了authentic和synthetic扭曲图像库来证明我们的方法的普适性，并显示它在特定的感知前景信息方面具有出色的泛化性。

Cryo-forum: A framework for orientation recovery with uncertainty measure with the application in cryo-EM image analysis

paper_url: http://arxiv.org/abs/2307.09847
repo_url: https://github.com/phonchi/cryo-forum
paper_authors: Szu-Chi Chung
for: 这个论文的目的是提出一种新的方法来确定二维晶体电子显微镜图像的方向参数，以便重建三维结构。
methods: 这种方法使用深度学习算法，将2D图像的方向参数表示为10维特征向量，并使用几何约束来预测方向参数。
results: 数学分析表明，这种方法可以从2D晶体电子显微镜图像中有效地回收方向参数，并且可以直接从3D数据中清理噪声。此外，这种方法还提供了一个用户友好的软件套件，名为cryo-forum，以便方便开发者使用。

Abstract
In single-particle cryo-electron microscopy (cryo-EM), the efficient determination of orientation parameters for 2D projection images poses a significant challenge yet is crucial for reconstructing 3D structures. This task is complicated by the high noise levels present in the cryo-EM datasets, which often include outliers, necessitating several time-consuming 2D clean-up processes. Recently, solutions based on deep learning have emerged, offering a more streamlined approach to the traditionally laborious task of orientation estimation. These solutions often employ amortized inference, eliminating the need to estimate parameters individually for each image. However, these methods frequently overlook the presence of outliers and may not adequately concentrate on the components used within the network. This paper introduces a novel approach that uses a 10-dimensional feature vector to represent the orientation and applies a Quadratically-Constrained Quadratic Program to derive the predicted orientation as a unit quaternion, supplemented by an uncertainty metric. Furthermore, we propose a unique loss function that considers the pairwise distances between orientations, thereby enhancing the accuracy of our method. Finally, we also comprehensively evaluate the design choices involved in constructing the encoder network, a topic that has not received sufficient attention in the literature. Our numerical analysis demonstrates that our methodology effectively recovers orientations from 2D cryo-EM images in an end-to-end manner. Importantly, the inclusion of uncertainty quantification allows for direct clean-up of the dataset at the 3D level. Lastly, we package our proposed methods into a user-friendly software suite named cryo-forum, designed for easy accessibility by the developers.

摘要
Single-particle cryo-electron microscopy (cryo-EM) 的高精度Orientation parameter determination for 2D projection images is a significant challenge, yet it is crucial for reconstructing 3D structures. This task is complicated by the high noise levels present in cryo-EM datasets, which often include outliers, necessitating several time-consuming 2D clean-up processes. Recently, deep learning-based solutions have emerged, offering a more streamlined approach to the traditionally laborious task of orientation estimation. These solutions often employ amortized inference, eliminating the need to estimate parameters individually for each image. However, these methods frequently overlook the presence of outliers and may not adequately concentrate on the components used within the network.This paper introduces a novel approach that uses a 10-dimensional feature vector to represent orientation and applies a Quadratically-Constrained Quadratic Program to derive the predicted orientation as a unit quaternion, supplemented by an uncertainty metric. Furthermore, we propose a unique loss function that considers the pairwise distances between orientations, thereby enhancing the accuracy of our method. Finally, we comprehensively evaluate the design choices involved in constructing the encoder network, a topic that has not received sufficient attention in the literature. Our numerical analysis demonstrates that our methodology effectively recovers orientations from 2D cryo-EM images in an end-to-end manner. Importantly, the inclusion of uncertainty quantification allows for direct clean-up of the dataset at the 3D level. Lastly, we package our proposed methods into a user-friendly software suite named cryo-forum, designed for easy accessibility by developers.

Compressive Image Scanning Microscope

paper_url: http://arxiv.org/abs/2307.09841
repo_url: None
paper_authors: Ajay Gunalan, Marco Castello, Simonluca Piazza, Shunlei Li, Alberto Diaspro, Leonardo S. Mattos, Paolo Bianchini
for: 提高laser扫描镜微scopic imaging（ISM）的图像质量和数据获取速度
methods: 使用单 photon 致暴流器（SPAD）数组探测器和固定抽象策略，通过并行扫描来提高图像重建质量，同时降低计算样式矩阵的时间和数据获取时间
results: 实验结果表明，该方法可以提供高质量的图像，同时降低数据获取时间和可能性的辐照热问题

Abstract
We present a novel approach to implement compressive sensing in laser scanning microscopes (LSM), specifically in image scanning microscopy (ISM), using a single-photon avalanche diode (SPAD) array detector. Our method addresses two significant limitations in applying compressive sensing to LSM: the time to compute the sampling matrix and the quality of reconstructed images. We employ a fixed sampling strategy, skipping alternate rows and columns during data acquisition, which reduces the number of points scanned by a factor of four and eliminates the need to compute different sampling matrices. By exploiting the parallel images generated by the SPAD array, we improve the quality of the reconstructed compressive-ISM images compared to standard compressive confocal LSM images. Our results demonstrate the effectiveness of our approach in producing higher-quality images with reduced data acquisition time and potential benefits in reducing photobleaching.

摘要
我们提出了一种新的方法，用于在激光扫描镜icroscopes（LSM）中实现压缩感知，具体是在图像扫描微scopy（ISM）中使用单 photon 爆发层（SPAD）数组探测器。我们的方法解决了应用压缩感知到 LSM 的两个主要限制：计算样本矩阵的时间和重建图像的质量。我们采用固定样本策略，在数据收集过程中跳过 alternate 行和列，从而将数据点数减少为四分之一，并消除计算不同样本矩阵的需求。通过利用 SPAD 数组产生的平行图像，我们提高了压缩-ISM 图像的重建质量，相比标准压缩扫描LSM 图像。我们的结果表明我们的方法可以生成高质量图像，降低数据收集时间和避免photobleaching。

Fix your downsampling ASAP! Be natively more robust via Aliasing and Spectral Artifact free Pooling

paper_url: http://arxiv.org/abs/2307.09804
repo_url: None
paper_authors: Julia Grabinski, Janis Keuper, Margret Keuper
for: 本文旨在提高卷积神经网络的Native robustness，即对常见损害和攻击而言的抗颤性。
methods: 本文使用了FLC pooling和ASAP pooling两种方法来降采样，其中ASAP pooling在频域降采样中具有较高的抗颤性。
results: 对于常见损害和攻击，使用ASAP pooling的网络展现出较高的Native robustness，并且与基eline相当或者超过基eline的clean accuracy。

Abstract
Convolutional neural networks encode images through a sequence of convolutions, normalizations and non-linearities as well as downsampling operations into potentially strong semantic embeddings. Yet, previous work showed that even slight mistakes during sampling, leading to aliasing, can be directly attributed to the networks' lack in robustness. To address such issues and facilitate simpler and faster adversarial training, [12] recently proposed FLC pooling, a method for provably alias-free downsampling - in theory. In this work, we conduct a further analysis through the lens of signal processing and find that such current pooling methods, which address aliasing in the frequency domain, are still prone to spectral leakage artifacts. Hence, we propose aliasing and spectral artifact-free pooling, short ASAP. While only introducing a few modifications to FLC pooling, networks using ASAP as downsampling method exhibit higher native robustness against common corruptions, a property that FLC pooling was missing. ASAP also increases native robustness against adversarial attacks on high and low resolution data while maintaining similar clean accuracy or even outperforming the baseline.

摘要
convolutional neural networks 通过一系列卷积、 нор化和非线性转换，以及下采样操作转化图像为强式 semantic embedding。然而，前一些研究表明， Even slight mistakes during sampling, leading to aliasing, can be directly attributed to the networks' lack in robustness。 To address such issues and facilitate simpler and faster adversarial training, 在 [12] 提出了FLC pooling，一种可以 theoretically guarantee alias-free downsampling的方法。在这项工作中，我们通过信号处理的视角进行进一步的分析，发现现有的 pooling 方法，通过频域 Addressing aliasing, still prone to spectral leakage artifacts。因此，我们提出了防止抽象和频域泄漏的 pooling，简称ASAP。尽管ASAP只对 FLC pooling 进行了一些修改，但是使用ASAP作为下采样方法的网络在常见损害和 adversarial attack 上表现出更高的Native robustness，同时保持相似的清洁精度或者甚至超过基eline。

DiffDP: Radiotherapy Dose Prediction via a Diffusion Model

paper_url: http://arxiv.org/abs/2307.09794
repo_url: https://github.com/scufzh/DiffDP
paper_authors: Zhenghao Feng, Lu Wen, Peng Wang, Binyu Yan, Xi Wu, Jiliu Zhou, Yan Wang
for: 这个研究旨在提高辐射疗法规划中的剂量分布预测精度，并解决现有方法的过滤问题，使其能够更好地捕捉病人组织内部的剂量分布。
methods: 本研究提出了一个扩散基于的剂量预测模型（DiffDP），它包括一个前向过程和一个反向过程。在前向过程中，DiffDP将剂量分布图transformed为 Gaussian 噪声，并训练一个噪声预测器来预测加入每个时间步骤中的噪声。在反向过程中，它从原始 Gaussian 噪声中移除噪声，并使用已经训练好的噪声预测器，在多个步骤中重新生成原始剂量分布图。
results: 实验结果显示，DiffDP 模型能够更好地预测病人的剂量分布，并且能够更好地捕捉病人组织内部的剂量分布。此外，DiffDP 模型还能够对病人的临床特点进行更好的考虑，例如临床检查结果和病人的生物物理特点。

Abstract
Currently, deep learning (DL) has achieved the automatic prediction of dose distribution in radiotherapy planning, enhancing its efficiency and quality. However, existing methods suffer from the over-smoothing problem for their commonly used L_1 or L_2 loss with posterior average calculations. To alleviate this limitation, we innovatively introduce a diffusion-based dose prediction (DiffDP) model for predicting the radiotherapy dose distribution of cancer patients. Specifically, the DiffDP model contains a forward process and a reverse process. In the forward process, DiffDP gradually transforms dose distribution maps into Gaussian noise by adding small noise and trains a noise predictor to predict the noise added in each timestep. In the reverse process, it removes the noise from the original Gaussian noise in multiple steps with the well-trained noise predictor and finally outputs the predicted dose distribution map. To ensure the accuracy of the prediction, we further design a structure encoder to extract anatomical information from patient anatomy images and enable the noise predictor to be aware of the dose constraints within several essential organs, i.e., the planning target volume and organs at risk. Extensive experiments on an in-house dataset with 130 rectum cancer patients demonstrate the s

摘要

CPCM: Contextual Point Cloud Modeling for Weakly-supervised Point Cloud Semantic Segmentation

paper_url: http://arxiv.org/abs/2307.10316
repo_url: https://github.com/lizhaoliu-Lec/CPCM
paper_authors: Lizhao Liu, Zhuangwei Zhuang, Shangxin Huang, Xunlong Xiao, Tianhang Xiang, Cen Chen, Jingdong Wang, Mingkui Tan
for: 提高 weakly-supervised 点云Semantic segmentation 的精度，减少 dense annotations 的成本。
methods: 使用 RegionMask 策略和 CMT 方法，结合 masked modeling 技术，从稀疏标注点Cloud中提取上下文信息。
results: 在 ScanNet V2 和 S3DIS 测试集上，CPCM 方法比 state-of-the-art 高效。

Abstract
We study the task of weakly-supervised point cloud semantic segmentation with sparse annotations (e.g., less than 0.1% points are labeled), aiming to reduce the expensive cost of dense annotations. Unfortunately, with extremely sparse annotated points, it is very difficult to extract both contextual and object information for scene understanding such as semantic segmentation. Motivated by masked modeling (e.g., MAE) in image and video representation learning, we seek to endow the power of masked modeling to learn contextual information from sparsely-annotated points. However, directly applying MAE to 3D point clouds with sparse annotations may fail to work. First, it is nontrivial to effectively mask out the informative visual context from 3D point clouds. Second, how to fully exploit the sparse annotations for context modeling remains an open question. In this paper, we propose a simple yet effective Contextual Point Cloud Modeling (CPCM) method that consists of two parts: a region-wise masking (RegionMask) strategy and a contextual masked training (CMT) method. Specifically, RegionMask masks the point cloud continuously in geometric space to construct a meaningful masked prediction task for subsequent context learning. CMT disentangles the learning of supervised segmentation and unsupervised masked context prediction for effectively learning the very limited labeled points and mass unlabeled points, respectively. Extensive experiments on the widely-tested ScanNet V2 and S3DIS benchmarks demonstrate the superiority of CPCM over the state-of-the-art.

摘要
我们研究弱监督点云 semantic segmentation的任务，将点云中的少于0.1%的点为标签，以减少密集标签的成本。然而，具有极端罕 annotated points 的情况下，很难从点云中提取情感和物件信息，进而实现Scene Understanding。我们受到遮罩modeling（例如MAE）的启发，尝试将这种能力应用到弱监督点云中。然而，直接将MAE应用到3D点云中可能无法工作。首先，从3D点云中效iveness地遮罩出有用的视觉上下文是一个问题。其次，如何充分利用仅有的罕 annotated points 进行上下文学习仍然是一个开问题。在这篇论文中，我们提出了一个简单 yet effective的Contextual Point Cloud Modeling（CPCM）方法，包括两个部分：RegionMask 策略和CMT 方法。具体来说，RegionMask 策略在 geometric space 中连续遮罩点云，以建立有意义的遮罩预测任务，并且CMT 方法将这些预测任务与仅有的罕 annotated points 进行协同学习，以实现有限的标签点和大量的点云中的上下文学习。我们对 ScanNet V2 和 S3DIS 两个测试集进行了广泛的实验，结果显示CPCM 方法在State-of-the-art 之上。

NTIRE 2023 Quality Assessment of Video Enhancement Challenge

paper_url: http://arxiv.org/abs/2307.09729
repo_url: None
paper_authors: Xiaohong Liu, Xiongkuo Min, Wei Sun, Yulun Zhang, Kai Zhang, Radu Timofte, Guangtao Zhai, Yixuan Gao, Yuqin Cao, Tengchuan Kou, Yunlong Dong, Ziheng Jia, Yilin Li, Wei Wu, Shuming Hu, Sibin Deng, Pengxiang Xiao, Ying Chen, Kai Li, Kai Zhao, Kun Yuan, Ming Sun, Heng Cong, Hao Wang, Lingzhi Fu, Yusheng Zhang, Rongyu Zhang, Hang Shi, Qihang Xu, Longan Xiao, Zhiliang Ma, Mirko Agarla, Luigi Celona, Claudio Rota, Raimondo Schettini, Zhiwei Huang, Yanan Li, Xiaotao Wang, Lei Lei, Hongye Liu, Wei Hong, Ironhead Chuang, Allen Lin, Drake Guan, Iris Chen, Kae Lou, Willy Huang, Yachun Tasi, Yvonne Kao, Haotian Fan, Fangyuan Kong, Shiqi Zhou, Hao Liu, Yu Lai, Shanshan Chen, Wenqi Wang, Haoning Wu, Chaofeng Chen, Chunzheng Zhu, Zekun Guo, Shiling Zhao, Haibing Yin, Hongkui Wang, Hanene Brachemi Meftah, Sid Ahmed Fezza, Wassim Hamidouche, Olivier Déforges, Tengfei Shi, Azadeh Mansouri, Hossein Motamednia, Amir Hossein Bakhtiari, Ahmad Mahmoudi Aznaveh
for: 这个论文旨在提出一个视频提升质量评估挑战，用于解决视频处理领域中的主要挑战之一——视频质量评估（VQA）。
methods: 这个挑战使用了VDPVE数据集，包括1211个加强视频，其中600个视频有颜色、亮度和对比度加强，310个视频有去锐加强，301个视频有去偏移加强。挑战共有167个注册参与者。
results: 挑战共收到了61个参与者的预测结果，共计3168个提交。最终，37个参与者在最终测试阶段提交了176个模型和相关的说明文档，并详细介绍了他们使用的方法。一些方法的结果比基线方法更好，而赢家的方法还有出色的预测性能。

Abstract
This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual Video Enhancement (VDPVE), which has a total of 1211 enhanced videos, including 600 videos with color, brightness, and contrast enhancements, 310 videos with deblurring, and 301 deshaked videos. The challenge has a total of 167 registered participants. 61 participating teams submitted their prediction results during the development phase, with a total of 3168 submissions. A total of 176 submissions were submitted by 37 participating teams during the final testing phase. Finally, 19 participating teams submitted their models and fact sheets, and detailed the methods they used. Some methods have achieved better results than baseline methods, and the winning methods have demonstrated superior prediction performance.

摘要

Uncertainty-Driven Multi-Scale Feature Fusion Network for Real-time Image Deraining

paper_url: http://arxiv.org/abs/2307.09728
repo_url: None
paper_authors: Ming Tong, Xuefeng Yan, Yongzhen Wang
for: 提高雨水影响下视觉测量系统的精度和可靠性。
methods: 提出了一种基于不确定性的多尺度特征融合网络（UMFFNet），通过learns probability mapping distribution между对应图像来估计不确定性。具体来说，我们引入了一种不确定性特征融合块（UFFB），通过利用不确定性信息来动态增强获取的特征和着眼于雨斑干扰的模糊区域，从而降低预测错误。
results: 对比 existed 图像抽取方法，UMFFNet 在雨水影响下的图像抽取 task 上表现出了显著的性能提升，并且具有少量参数和高效的特点。

Abstract
Visual-based measurement systems are frequently affected by rainy weather due to the degradation caused by rain streaks in captured images, and existing imaging devices struggle to address this issue in real-time. While most efforts leverage deep networks for image deraining and have made progress, their large parameter sizes hinder deployment on resource-constrained devices. Additionally, these data-driven models often produce deterministic results, without considering their inherent epistemic uncertainty, which can lead to undesired reconstruction errors. Well-calibrated uncertainty can help alleviate prediction errors and assist measurement devices in mitigating risks and improving usability. Therefore, we propose an Uncertainty-Driven Multi-Scale Feature Fusion Network (UMFFNet) that learns the probability mapping distribution between paired images to estimate uncertainty. Specifically, we introduce an uncertainty feature fusion block (UFFB) that utilizes uncertainty information to dynamically enhance acquired features and focus on blurry regions obscured by rain streaks, reducing prediction errors. In addition, to further boost the performance of UMFFNet, we fused feature information from multiple scales to guide the network for efficient collaborative rain removal. Extensive experiments demonstrate that UMFFNet achieves significant performance improvements with few parameters, surpassing state-of-the-art image deraining methods.

摘要
<>Visual-based measurement systems 常常受到雨水的影响，因为捕捉图像中的雨斑会导致图像质量下降。现有的捕捉设备很难在实时中解决这个问题。大多数努力都是基于深度网络进行图像排除雨水，但这些模型很大，不能在资源受限的设备上部署。此外，这些数据驱动的模型经常生成决定性的结果，不考虑其内在的 epistemic 不确定性，这可能导致不想要的重建错误。Well-calibrated uncertainty 可以减轻预测错误和助measurement设备减轻风险，提高可用性。因此，我们提出了一个 Uncertainty-Driven Multi-Scale Feature Fusion Network (UMFFNet)，它学习了图像对的映射分布，以便估计 uncertainty。specifically，我们引入了一个 uncertainty 特征融合块 (UFFB)，它利用 uncertainty 信息来动态增强获取的特征，并专注于雨斑遮盖的模糊区域，从而减少预测错误。此外，为了进一步提高 UMFFNet 的性能，我们将特征信息从多个级别集成到导向网络，以便有效地协同除雨。广泛的实验表明，UMFFNet 可以在几个参数下达到显著性能提升，超过当前的图像排除雨水方法。

Flexible single multimode fiber imaging using white LED

paper_url: http://arxiv.org/abs/2307.09714
repo_url: None
paper_authors: Minyu Fan, Kun Liu, Jie Zhu, Yu Cao, Sha Wang
for: This research aims to improve the imaging capabilities of multimode fibers (MMFs) using white LEDs and cascaded convolutional neural networks (CNNs) to mitigate the effects of mode coupling and modal dispersion.
methods: The proposed method uses a MMF as the imaging medium, a white LED as the light source, and a cascaded CNN to reconstruct the images. The channel stitching technology is used to concatenate the output speckle patterns in three different color channels of the CCD camera.
results: The experimental results show that the proposed method achieves high-quality image reconstruction with an average Pearson correlation coefficient (PCC) of 0.83 on the Fashion-MINIST dataset. The method also demonstrates good robustness properties, maintaining an average PCC of 0.83 even after completely changing the shape of the MMF.

Abstract
Multimode fiber (MMF) has been proven to have good potential in imaging and optical communication because of its advantages of small diameter and large mode numbers. However, due to the mode coupling and modal dispersion, it is very sensitive to environmental changes. Minor changes in the fiber shape can lead to difficulties in information reconstruction. Here, white LED and cascaded Unet are used to achieve MMF imaging to eliminate the effect of fiber perturbations. The output speckle patterns in three different color channels of the CCD camera produced by transferring images through the MMF are concatenated and inputted into the cascaded Unet using channel stitching technology to improve the reconstruction effects. The average Pearson correlation coefficient (PCC) of the reconstructed images from the Fashion-MINIST dataset is 0.83. In order to check the flexibility of such a system, perturbation tests on the image reconstruction capability by changing the fiber shapes are conducted. The experimental results show that the MMF imaging system has good robustness properties, i. e. the average PCC remains 0.83 even after completely changing the shape of the MMF. This research potentially provides a flexible approach for the practical application of MMF imaging.

摘要
多模式纤维（MMF）因其小直径和大模数的优点，在成像和光学通信中具有良好的潜力。然而，由于模式相互作用和模态扩散，MMF受环境变化的影响很敏感。小变化纤维形状可能导致信息重建困难。在本研究中，使用白色LED和堆式UNet实现MMF成像，以消除纤维异常的影响。通过将三个不同颜色通道的CCD相机输出的 спе克尔图像 concatenate 并输入到堆式UNet中，使用通道缝合技术进行改进重建效果。实验结果表明，MMF成像系统具有良好的灵活性特性，即平均相关系数（PCC）保持在0.83，即使完全改变纤维形状。这些研究可能为MMF成像的实际应用提供一种灵活的方法。

Transformer-based Dual-domain Network for Few-view Dedicated Cardiac SPECT Image Reconstructions

paper_url: http://arxiv.org/abs/2307.09624
repo_url: None
paper_authors: Huidong Xie, Bo Zhou, Xiongchao Chen, Xueqi Guo, Stephanie Thorn, Yi-Hwa Liu, Ge Wang, Albert Sinusas, Chi Liu
for:* 这个论文旨在提高使用GE 530/570c专用卡ди亚斯PECT仪器进行心血管疾病诊断中的图像质量。methods:* 该方法使用了3D transformer-based dual-domain网络（TIP-Net）来实现高质量3D卡ди亚斯PECT图像重建。results:* 对人体研究来说，该方法可以提高卡ди亚斯PECT图像中心疾病潜在性的显示，并且与先前的基线方法进行比较，得到了更高的卡ди亚斯疾病潜在性图像质量。

Abstract
Cardiovascular disease (CVD) is the leading cause of death worldwide, and myocardial perfusion imaging using SPECT has been widely used in the diagnosis of CVDs. The GE 530/570c dedicated cardiac SPECT scanners adopt a stationary geometry to simultaneously acquire 19 projections to increase sensitivity and achieve dynamic imaging. However, the limited amount of angular sampling negatively affects image quality. Deep learning methods can be implemented to produce higher-quality images from stationary data. This is essentially a few-view imaging problem. In this work, we propose a novel 3D transformer-based dual-domain network, called TIP-Net, for high-quality 3D cardiac SPECT image reconstructions. Our method aims to first reconstruct 3D cardiac SPECT images directly from projection data without the iterative reconstruction process by proposing a customized projection-to-image domain transformer. Then, given its reconstruction output and the original few-view reconstruction, we further refine the reconstruction using an image-domain reconstruction network. Validated by cardiac catheterization images, diagnostic interpretations from nuclear cardiologists, and defect size quantified by an FDA 510(k)-cleared clinical software, our method produced images with higher cardiac defect contrast on human studies compared with previous baseline methods, potentially enabling high-quality defect visualization using stationary few-view dedicated cardiac SPECT scanners.

摘要
Cardiovascular disease (CVD) is the leading cause of death worldwide, and myocardial perfusion imaging using SPECT has been widely used in the diagnosis of CVDs. The GE 530/570c dedicated cardiac SPECT scanners use a stationary geometry to simultaneously acquire 19 projections to increase sensitivity and achieve dynamic imaging. However, the limited amount of angular sampling negatively affects image quality. Deep learning methods can be implemented to produce higher-quality images from stationary data. This is essentially a few-view imaging problem. In this work, we propose a novel 3D transformer-based dual-domain network, called TIP-Net, for high-quality 3D cardiac SPECT image reconstructions. Our method aims to first reconstruct 3D cardiac SPECT images directly from projection data without the iterative reconstruction process by proposing a customized projection-to-image domain transformer. Then, given its reconstruction output and the original few-view reconstruction, we further refine the reconstruction using an image-domain reconstruction network. Validated by cardiac catheterization images, diagnostic interpretations from nuclear cardiologists, and defect size quantified by an FDA 510(k)-cleared clinical software, our method produced images with higher cardiac defect contrast on human studies compared with previous baseline methods, potentially enabling high-quality defect visualization using stationary few-view dedicated cardiac SPECT scanners.

A comparative analysis of SRGAN models

paper_url: http://arxiv.org/abs/2307.09456
repo_url: None
paper_authors: Fatemeh Rezapoor Nikroo, Ajinkya Deshmukh, Anantha Sharma, Adrian Tam, Kaarthik Kumar, Cleo Norris, Aditya Dangi
for:* 这项研究评估了多种state-of-the-art SRGAN模型（ESRGAN、Real-ESRGAN和EDSR）在一个实际图像受损pipeline上的性能。methods:* 这些模型使用了state-of-the-art SRGAN网络 architecture。results:* 我们的结果表明，ESDR-BASE模型从huggingface库中出现，在量化指标和主观视觉质量评估中都有较高的表现，同时具有最小的计算负担。 Specifically, EDSR生成的图像具有更高的峰峰信号噪声比（PSNR）和结构相似性指标（SSIM）值，并且可以通过Tesseract OCR引擎返回高质量的OCR结果。这些发现表明，ESDR是一种稳定有效的单图像超分辨方法，可能特别适合需要高质量视觉准确性和优化计算的应用。

Abstract
In this study, we evaluate the performance of multiple state-of-the-art SRGAN (Super Resolution Generative Adversarial Network) models, ESRGAN, Real-ESRGAN and EDSR, on a benchmark dataset of real-world images which undergo degradation using a pipeline. Our results show that some models seem to significantly increase the resolution of the input images while preserving their visual quality, this is assessed using Tesseract OCR engine. We observe that EDSR-BASE model from huggingface outperforms the remaining candidate models in terms of both quantitative metrics and subjective visual quality assessments with least compute overhead. Specifically, EDSR generates images with higher peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) values and are seen to return high quality OCR results with Tesseract OCR engine. These findings suggest that EDSR is a robust and effective approach for single-image super-resolution and may be particularly well-suited for applications where high-quality visual fidelity is critical and optimized compute.

摘要
在这项研究中，我们评估了多种现代SRGAN（超分解生成 adversarial网络）模型，ESRGAN、Real-ESRGAN和EDSR，在一个实际图像减退管道中的 benchmark 数据集上表现。我们的结果表明，一些模型可以明显提高输入图像的分辨率，同时保持视觉质量，这被评估使用 Tesseract OCR 引擎。我们发现，huggingface 中的 EDSR-BASE 模型在对比其他候选模型时，在量化指标和主观视觉质量评估中具有最低的计算开销。具体来说，EDSR 生成的图像具有更高的峰峰信号噪声比（PSNR）和结构相似性指标（SSIM）值，并且可以返回高质量 OCR 结果。这些发现表明，EDSR 是一种稳定和有效的单图超分解方法，可能适用于需要高品质视觉质量和优化计算的应用。

Measuring Student Behavioral Engagement using Histogram of Actions

paper_url: http://arxiv.org/abs/2307.09420
repo_url: None
paper_authors: Ahmed Abdelkawy, Islam Alkabbany, Asem Ali, Aly Farag
for: 本研究旨在开发一种新的学生行为参与度测量技术，通过识别学生的动作来预测学生行为参与度水平。
methods: 本研究使用人体骨骼模型来模拟学生姿势和上半身运动，并使用3D-CNN模型来学习学生上半身动作的动力学。然后，对每段2分钟视频中的动作进行识别，并将这些动作组织成一个历史gram仓库，用于SVM分类器来判断学生是否参与度高或低。
results: 实验结果表明，可以使用3D-CNN模型和SVM分类器来识别学生动作并预测学生参与度，并且可以 Capture the average engagement of the class。

Abstract
In this paper, we propose a novel technique for measuring behavioral engagement through students' actions recognition. The proposed approach recognizes student actions then predicts the student behavioral engagement level. For student action recognition, we use human skeletons to model student postures and upper body movements. To learn the dynamics of student upper body, a 3D-CNN model is used. The trained 3D-CNN model is used to recognize actions within every 2minute video segment then these actions are used to build a histogram of actions which encodes the student actions and their frequencies. This histogram is utilized as an input to SVM classifier to classify whether the student is engaged or disengaged. To evaluate the proposed framework, we build a dataset consisting of 1414 2-minute video segments annotated with 13 actions and 112 video segments annotated with two engagement levels. Experimental results indicate that student actions can be recognized with top 1 accuracy 83.63% and the proposed framework can capture the average engagement of the class.

摘要
在这篇论文中，我们提出了一种新的方法来测量学生的行为参与度 durch 学生的动作识别。我们的提议方法首先识别学生的动作，然后预测学生的行为参与度水平。为了识别学生的动作，我们使用人体骨架来模拟学生的姿势和上半身运动。为了学习学生上半身的动力学，我们使用3D-CNN模型进行训练。训练完成后，我们使用3D-CNN模型来识别每个2分钟视频段中的动作，并将这些动作组织成一个动作分布图，该图表示学生的动作和其频率。这个动作分布图被用作SVM分类器的输入，以判断学生是否参与度。为评估我们的方框架，我们建立了一个数据集，该数据集包含1414个2分钟视频段，每个视频段被标注为13种动作，以及112个视频段被标注为两个参与度水平。实验结果表明，我们的方法可以识别学生的动作 WITH top 1 准确率83.63%，并且可以捕捉教室的平均参与度。

2023-07-18

cs.SD

cs.SD - 2023-07-18

paper_url: http://arxiv.org/abs/2307.09352
repo_url: None
paper_authors: Adam Szwajcowski
for: 这篇论文旨在探讨将圆柱卷积函数（SH）领域的继承方法替换为四维连续函数模型，以实现空间上的连续性。
methods: 论文使用了四维卷积函数（HSH）和圆柱卷积函数（SH） merged with one-dimensional basis functions，并对这两种方法进行比较。
results: 研究发现，使用四维连续函数模型可以将HRTF的大小谱спектроgram表示为一小组含义强的系数，这些系数可以在任何方向和频率下解码。HSH和SH merged with reverse Fourier-Bessel series表现最佳，其中HSH具有更好的压缩能力，对低级数据进行更高精度的重建。这些模型可以用于HRTF的 interpol、压缩和参数化，以及其他类型的直接函数的应用。

Abstract
Utilizing spherical harmonic (SH) domain has been established as the default method of obtaining continuity over space in head-related transfer functions (HRTFs). This paper concerns different variants of extending this solution by replacing SHs with four-dimensional (4D) continuous functional models in which frequency is imagined as another physical dimension. Recently developed hyperspherical harmonic (HSH) representation is compared with models defined in spherindrical coordinate system by merging SHs with one-dimensional basis functions. The efficiency of both approaches is evaluated based on the reproduction errors for individual HRTFs from HUTUBS database, including detailed analysis of its dependency on chosen orders of approximation in frequency and space. Employing continuous functional models defined in 4D coordinate systems allows HRTF magnitude spectra to be expressed as a small set of coefficients which can be decoded back into values at any direction and frequency. The best performance was noted for HSHs and SHs merged with reverse Fourier-Bessel series, with the former featuring better compression abilities, achieving slightly higher accuracy for low number of coefficients. The presented models can serve multiple purposes, such as interpolation, compression or parametrization for machine learning applications, and can be applied not only to HRTFs but also to other types of directivity functions, e.g. sound source directivity.

摘要
utilizing 球面幂函数（SH）域已经被确立为HEAD相关传送函数（HRTF）的默认方法，这篇论文考虑了不同的扩展方案，其中 replacing SHs with four-dimensional（4D）连续函数模型，在这个模型中，频率被想象为空间中的另一个物理维度。最近开发的对称幂函数（HSH）表示与在圆柱坐标系中定义的模型相比较，并进行了详细的分析。这两种方法的效率被评估基于HRTF数据库中的重建错误，包括选择的频率和空间纬度的顺序方法的依赖关系。使用在4D坐标系中定义的连续函数模型，可以将HRTF的 магниту드 спектrum表示为一小组含义的系数，这些系数可以在任何方向和频率上解码。最佳性能被观察到在HSH和SH与反傅立埃尔-贝塞尔列表相加的情况下，前者具有更好的压缩能力，在低数量的系数下达到了微scopic的准确性。这些模型可以用于多种目的，如 interpolation、compression 或 parametrization for machine learning applications，并可以应用于其他类型的直径函数，如声源直径函数。

Interpretable Timbre Synthesis using Variational Autoencoders Regularized on Timbre Descriptors

paper_url: http://arxiv.org/abs/2307.10283
repo_url: None
paper_authors: Anastasia Natsiou, Luca Longo, Sean O’Leary
for: 研究控制音色synthesizer的方法
methods: 使用深度神经网络和变量自动编码器（VAEs）生成音色表示
results: 提出了一种含timbre描述的准则化VAEs，并使用音色的幂数内容来缩小维度Here’s the translation in English for reference:
for: Research on controllable timbre synthesis methods
methods: Using deep neural networks and Variational Autoencoders (VAEs) to generate timbre representations
results: Proposed a regularized VAE-based latent space that incorporates timbre descriptors, and utilized harmonic content to reduce the dimensionality of the latent space.

Abstract
Controllable timbre synthesis has been a subject of research for several decades, and deep neural networks have been the most successful in this area. Deep generative models such as Variational Autoencoders (VAEs) have the ability to generate a high-level representation of audio while providing a structured latent space. Despite their advantages, the interpretability of these latent spaces in terms of human perception is often limited. To address this limitation and enhance the control over timbre generation, we propose a regularized VAE-based latent space that incorporates timbre descriptors. Moreover, we suggest a more concise representation of sound by utilizing its harmonic content, in order to minimize the dimensionality of the latent space.

摘要
控制性 timbre 合成已经是研究的主题之一，深度神经网络是这个领域中最成功的。深度生成模型如 Variational Autoencoders (VAEs) 可以生成高级别的声音表示，并提供结构化的幂轨空间。然而，这些幂轨空间在人类听觉上的解释能力往往有限。为了解决这个限制并提高声音生成控制，我们提议使用带有声音描述符的幂轨空间，并提出一种更简洁的声音表示方法，通过利用声音的和弦内容来减少幂轨空间的维度。

On Computing In the Network: Covid-19 Coughs Detection Case Study

paper_url: http://arxiv.org/abs/2307.08902
repo_url: None
paper_authors: Soukaina Ouledsidi Ali, Zakaria Ait Hmitti, Halima Elbiaze, Roch Glitho
for: 该论文旨在探讨在网络设备（如交换机和网络interface卡）上进行计算，以实现时间敏感应用的质量服务目标。
methods: 论文比较了在云端-边缘-浮动 kontinuum 中进行计算和缓存任务的优势，并对 covid-19 警示应用在机场设置中进行了一个关键的使用场景。
results: 通过实验比较，论文表明，在网络设备内部进行计算可以更好地降低往返时间（RTT）和流量筛选。

Abstract
Computing in the network (COIN) is a promising technology that allows processing to be carried out within network devices such as switches and network interface cards. Time sensitive application can achieve their quality of service (QoS) target by flexibly distributing the caching and computing tasks in the cloud-edge-mist continuum. This paper highlights the advantages of in-network computing, comparing to edge computing, in terms of latency and traffic filtering. We consider a critical use case related to Covid-19 alert application in an airport setting. Arriving travelers are monitored through cough analysis so that potentially infected cases can be detected and isolated for medical tests. A performance comparison has been done between an architecture using in-network computing and another one using edge computing. We show using simulations that in-network computing outperforms edge computing in terms of Round Trip Time (RTT) and traffic filtering.

摘要
计算在网络（COIN）是一种有前途的技术，允许在网络设备 such as 交换机和网络接口卡上进行处理。时间敏感应用可以通过在云端-边缘-浮云 continuum 中flexibly分配缓存和计算任务来实现质量服务（QoS）目标。本文highlights COIN的优势，比如边缘计算，在延迟和流量筛选方面。我们考虑了一个关键的covid-19预警应用场景，在机场设置下，来往旅客通过喊叫分析进行监测，检测和隔离涉疫患者进行医学测试。我们通过 simulated comparison 表明，使用COIN的architecture 在延迟（RTT）和流量筛选方面表现出色，高于边缘计算。

Multilingual Speech-to-Speech Translation into Multiple Target Languages

paper_url: http://arxiv.org/abs/2307.08655
repo_url: None
paper_authors: Hongyu Gong, Ning Dong, Sravya Popuri, Vedanuj Goswami, Ann Lee, Juan Pino
for: 多语言speech-to-speech翻译 (S2ST)，即在不同语言之间的口头交流。
methods: 利用最新的直接S2ST技术，包括speech-to-unit和 vocoder，并将这些关键组件增加多语言能力。 specifically, the paper proposes a multilingual extension of S2U, called speech-to-masked-unit (S2MU), which applies masking to units that do not belong to the given target language to reduce language interference. Additionally, the paper proposes a multilingual vocoder trained with language embedding and the auxiliary loss of language identification.
results: 在多种benchmark翻译测试集上，提出的多语言模型比双语模型在英语到16种目标语言的翻译中表现更出色。

Abstract
Speech-to-speech translation (S2ST) enables spoken communication between people talking in different languages. Despite a few studies on multilingual S2ST, their focus is the multilinguality on the source side, i.e., the translation from multiple source languages to one target language. We present the first work on multilingual S2ST supporting multiple target languages. Leveraging recent advance in direct S2ST with speech-to-unit and vocoder, we equip these key components with multilingual capability. Speech-to-masked-unit (S2MU) is the multilingual extension of S2U, which applies masking to units which don't belong to the given target language to reduce the language interference. We also propose multilingual vocoder which is trained with language embedding and the auxiliary loss of language identification. On benchmark translation testsets, our proposed multilingual model shows superior performance than bilingual models in the translation from English into $16$ target languages.

摘要
听说-听写翻译（S2ST）可以帮助人们通过不同语言的口语沟通。虽有一些关于多语言S2ST的研究，但他们的重点都是源语言的多语言性，即从多种源语言翻译到一个目标语言。我们提出了首个支持多个目标语言的多语言S2ST模型。我们利用了直接S2ST的最新进展，包括speech-to-unit（S2U）和 vocalsoder，并将这些关键组件具备多语言能力。speech-to-masked-unit（S2MU）是S2U的多语言扩展，它对不属于目标语言的单元应用掩蔽，以降低语言干扰。我们还提出了多语言 vocalsoder，它在语言嵌入和语言认识的 auxillary 损失下训练。在标准翻译测试集上，我们的提议的多语言模型在英语到16种目标语言的翻译中表现出色，超过了双语模型的表现。

2023-07-18

eess.AS

eess.AS - 2023-07-18

Low bit rate binaural link for improved ultra low-latency low-complexity multichannel speech enhancement in Hearing Aids

paper_url: http://arxiv.org/abs/2307.08858
repo_url: None
paper_authors: Nils L. Westhausen, Bernd T. Meyer
for: 提高听力器的听音质量
methods: 使用深度学习模型和 Filter-and-Sum 处理，并采用量化承载和集群通信来减少模型大小和计算负担
results: 可以匹配 oracle 双耳 LCMV 扬声器在非低延迟配置下的表现，且只需2毫秒的延迟Here’s the Chinese text in the format you requested:
for: 提高听力器的听音质量
methods: 使用深度学习模型和 Filter-and-Sum 处理，并采用量化承载和集群通信来减少模型大小和计算负担
results: 可以匹配 oracle 双耳 LCMV 扬声器在非低延迟配置下的表现，且只需2毫秒的延迟

Abstract
Speech enhancement in hearing aids is a challenging task since the hardware limits the number of possible operations and the latency needs to be in the range of only a few milliseconds. We propose a deep-learning model compatible with these limitations, which we refer to as Group-Communication Filter-and-Sum Network (GCFSnet). GCFSnet is a causal multiple-input single output enhancement model using filter-and-sum processing in the time-frequency domain and a multi-frame deep post filter. All filters are complex-valued and are estimated by a deep-learning model using weight-sharing through Group Communication and quantization-aware training for reducing model size and computational footprint. For a further increase in performance, a low bit rate binaural link for delayed binaural features is proposed to use binaural information while retaining a latency of 2ms. The performance of an oracle binaural LCMV beamformer in non-low-latency configuration can be matched even by a unilateral configuration of the GCFSnet in terms of objective metrics.

摘要
听见助手中的语音提升是一项具有挑战性的任务，因为硬件的限制只能进行一定数量的操作，并且响应时间需要在几毫秒内。我们提议一种深度学习模型，称之为群组通信滤波和总网络（GCFSnet）。GCFSnet是一种 causal 多输入单出提升模型，使用时域频域的滤波和总处理，并使用多帧深度后 filters。所有滤波器都是复数值的，并由深度学习模型使用 weight-sharing 和量化感知训练来 estimates。为了进一步提高性能，我们提议使用低比特率双耳链接，以使用双耳信息而不超过2毫秒的响应时间。GCFSnet 的单边配置可以与非低延迟配置下的 oracle 双耳 LCMV 扫描器匹配，以对象指标来衡量。

Semi-supervised multi-channel speaker diarization with cross-channel attention

paper_url: http://arxiv.org/abs/2307.08688
repo_url: None
paper_authors: Shilong Wu, Jun Du, Maokui He, Shutong Niu, Hang Chen, Haitao Tang, Chin-Hui Lee
for: 本研究提出了一种半监督式的Speaker diarization系统，以利用大规模多通道训练数据，并生成pseudo标签来labels无标签数据。
methods: 本研究引入了 Cross-channel attention机制，以更好地学习speaker embedding的通道上下文信息。
results: 在CHiME-7 Mixer6数据集上，我们的系统比基于分 clustering 模型的开发集上的相对减少率为57.01%。在CHiME-6数据集上，当使用80%和50%标签的训练数据时，我们的系统与使用100%标签的训练数据相对性能相似。

Abstract
Most neural speaker diarization systems rely on sufficient manual training data labels, which are hard to collect under real-world scenarios. This paper proposes a semi-supervised speaker diarization system to utilize large-scale multi-channel training data by generating pseudo-labels for unlabeled data. Furthermore, we introduce cross-channel attention into the Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding (NSD-MA-MSE) to learn channel contextual information of speaker embeddings better. Experimental results on the CHiME-7 Mixer6 dataset which only contains partial speakers' labels of the training set, show that our system achieved 57.01% relative DER reduction compared to the clustering-based model on the development set. We further conducted experiments on the CHiME-6 dataset to simulate the scenario of missing partial training set labels. When using 80% and 50% labeled training data, our system performs comparably to the results obtained using 100% labeled data for training.

摘要
大多数神经网络发音分类系统都需要充足的手动训练数据标签，这些标签在实际场景下很难收集。这篇论文提议一种半监督的发音分类系统，利用大规模多通道训练数据生成 pseudo-标签，以便于无标签数据的学习。此外，我们在 Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding (NSD-MA-MSE) 中引入了ChannelContextual Information的权重学习，以更好地学习发音者特征的通道信息。实验结果表明，在 CHiME-7 Mixer6 数据集上，我们的系统与 clustering-based 模型在开发集上实现了57.01%的相对减少性能。我们进一步在 CHiME-6 数据集上进行了 simulate 失去部分训练集标签的场景，当使用 80% 和 50% 标签过滤的训练数据时，我们的系统与使用 100% 标签训练的结果相当。

2023-07-18

cs.CV

cs.CV - 2023-07-18

Plug the Leaks: Advancing Audio-driven Talking Face Generation by Preventing Unintended Information Flow

paper_url: http://arxiv.org/abs/2307.09368
repo_url: None
paper_authors: Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Bärmann, Hazim Kemal Ekenel, Alexander Waibel
for: 这个论文的目的是提出一种基于音频的讲话脸面生成方法，以提高讲话脸面的视觉质量和音频视频同步。
methods: 这篇论文提出了一些现有的同步方法的问题，如lip和pose信息的不必要流动，以及训练过程中的不稳定性。具体来说，它们包括一个静音lip引用图生成器，一个适应 triplet损失，以及一种稳定的同步损失函数。
results: 根据实验结果，这种方法可以在LRS2和LRW上实现状态之最的表现，同时在视觉质量和音频视频同步两个方面均有显著提高。此外，通过不同的减少实验，证明了这些改进的共同作用和它们的个别贡献。

Abstract
Audio-driven talking face generation is the task of creating a lip-synchronized, realistic face video from given audio and reference frames. This involves two major challenges: overall visual quality of generated images on the one hand, and audio-visual synchronization of the mouth part on the other hand. In this paper, we start by identifying several problematic aspects of synchronization methods in recent audio-driven talking face generation approaches. Specifically, this involves unintended flow of lip and pose information from the reference to the generated image, as well as instabilities during model training. Subsequently, we propose various techniques for obviating these issues: First, a silent-lip reference image generator prevents leaking of lips from the reference to the generated image. Second, an adaptive triplet loss handles the pose leaking problem. Finally, we propose a stabilized formulation of synchronization loss, circumventing aforementioned training instabilities while additionally further alleviating the lip leaking issue. Combining the individual improvements, we present state-of-the art performance on LRS2 and LRW in both synchronization and visual quality. We further validate our design in various ablation experiments, confirming the individual contributions as well as their complementary effects.

摘要
audio-driven talking face生成任务是创建一个 lip-synchronized、realistic 的面部视频，从给定的音频和参考帧开始。这涉及到两个主要挑战：一是生成图像的总Visual质量，二是音频-视频同步的口部部分。在这篇论文中，我们开始是通过找到最近的音频驱动 talking face生成方法中的一些问题。特别是，这些问题包括往返图像中的lip和姿势信息的不良流动，以及在训练模型时的不稳定性。然后，我们提出了多种解决方案：首先，一个 silent-lip 参考图像生成器防止了 lip 信息从参考图像流向生成图像。其次，一种适应 triplet 损失处理 pose 泄露问题。最后，我们提出了一种稳定化同步损失的形式， circumventing aforementioned training instabilities 而并且进一步缓解 lip 泄露问题。将各种改进相结合，我们展示了 state-of-the-art 性能在 LRS2 和 LRW 上，同时在同步和Visual质量两个方面均有提高。我们进一步验证了我们的设计，通过多个减少实验，证明了各个贡献以及它们的补偿效果。

An Evaluation of Zero-Cost Proxies – from Neural Architecture Performance to Model Robustness

paper_url: http://arxiv.org/abs/2307.09365
repo_url: None
paper_authors: Jovita Lukasik, Michael Moeller, Margret Keuper
for: 研究 zero-cost 代理的用途和能力，尤其是用于寻找性能好吃的 neural architecture 搜索。
methods: 使用 zero-cost 代理来预测模型的性能，包括单个预测任务和多任务协同预测。
results: 分析了常见 zero-cost 代理的性能预测能力，发现 predicting 模型的稳定性更加困难，需要结合多个代理来预测。

Abstract
Zero-cost proxies are nowadays frequently studied and used to search for neural architectures. They show an impressive ability to predict the performance of architectures by making use of their untrained weights. These techniques allow for immense search speed-ups. So far the joint search for well-performing and robust architectures has received much less attention in the field of NAS. Therefore, the main focus of zero-cost proxies is the clean accuracy of architectures, whereas the model robustness should play an evenly important part. In this paper, we analyze the ability of common zero-cost proxies to serve as performance predictors for robustness in the popular NAS-Bench-201 search space. We are interested in the single prediction task for robustness and the joint multi-objective of clean and robust accuracy. We further analyze the feature importance of the proxies and show that predicting the robustness makes the prediction task from existing zero-cost proxies more challenging. As a result, the joint consideration of several proxies becomes necessary to predict a model's robustness while the clean accuracy can be regressed from a single such feature.

摘要
现在，零成本代理常被研究和使用来搜索神经网络架构。它们能够很好地预测架构的性能，只使用未训练的权重。这些技术可以提供很大的搜索速度减少。而在神经网络搜索（NAS）领域中， JOINT 搜索良好性和稳定性的架构尚未受到很多关注。因此，零成本代理的主要关注点是架构的净精度，而模型的稳定性应该具有相等的重要性。在这篇论文中，我们分析了常见的零成本代理是否能够用来预测架构的稳定性，以及它们的多目标任务。我们还分析了代理的特征重要性，并显示了预测稳定性使得预测任务更加困难。因此，需要结合多个代理来预测模型的稳定性，而净精度可以从单个代理中进行回归。

Disentangle then Parse:Night-time Semantic Segmentation with Illumination Disentanglement

paper_url: http://arxiv.org/abs/2307.09362
repo_url: None
paper_authors: Zhixiang Wei, Lin Chen, Tao Tu, Huaian Chen, Pengyang Ling, Yi Jin
for: 提高夜间 semantic segmentation 性能
methods: 提出了一种新的夜间 semantic segmentation 方法，即 disentangle then parse (DTP)，通过分离夜间图像中的光度和反射组成部分，并基于它们的自适应 fusión 来认知 semantics。
results: DTP 在夜间 segmentation 任务中表现出色，与state-of-the-art 方法相比，具有更高的准确率和更好的一致性。

Abstract
Most prior semantic segmentation methods have been developed for day-time scenes, while typically underperforming in night-time scenes due to insufficient and complicated lighting conditions. In this work, we tackle this challenge by proposing a novel night-time semantic segmentation paradigm, i.e., disentangle then parse (DTP). DTP explicitly disentangles night-time images into light-invariant reflectance and light-specific illumination components and then recognizes semantics based on their adaptive fusion. Concretely, the proposed DTP comprises two key components: 1) Instead of processing lighting-entangled features as in prior works, our Semantic-Oriented Disentanglement (SOD) framework enables the extraction of reflectance component without being impeded by lighting, allowing the network to consistently recognize the semantics under cover of varying and complicated lighting conditions. 2) Based on the observation that the illumination component can serve as a cue for some semantically confused regions, we further introduce an Illumination-Aware Parser (IAParser) to explicitly learn the correlation between semantics and lighting, and aggregate the illumination features to yield more precise predictions. Extensive experiments on the night-time segmentation task with various settings demonstrate that DTP significantly outperforms state-of-the-art methods. Furthermore, with negligible additional parameters, DTP can be directly used to benefit existing day-time methods for night-time segmentation.

摘要
大多数先前的 semantic segmentation 方法都是在白天场景下开发的，而在夜晚场景下表现不佳，主要因为夜晚场景中的灯光条件不充分和复杂。在这项工作中，我们解决了这个挑战，我们提出了一种新的夜晚 semantic segmentation 方法，即分解并解析 (DTP)。DTP 方法分解夜晚图像为不变的反射和特定灯光组成部分，然后基于这两个部分的 adaptive 融合来认定 semantics。具体来说，我们的 DTP 方法包括两个关键组成部分：1. 而不是在先前的工作中处理灯光束结合的特征，我们的 Semantic-Oriented Disentanglement (SOD) 框架允许在灯光条件下提取反射组成部分，使网络可以在不同和复杂的灯光条件下一直认定 semantics。2. 基于我们发现，灯光组成部分可以作为一些具有混淆的区域的征标，我们进一步引入了 Illumination-Aware Parser (IAParser) 来显式学习灯光和 semantics 之间的相关性，并将灯光特征聚合到更加精确的预测中。我们对夜晚 segmentation 任务进行了广泛的实验，结果表明，DTP 方法在不同的设置下显著超过了当前的方法。此外，DTP 方法可以在不加增加参数的情况下直接应用于当前的日间方法中，以提高夜晚 segmentation 性能。

OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation

paper_url: http://arxiv.org/abs/2307.09356
repo_url: https://github.com/wudongming97/onlinerefer
paper_authors: Dongming Wu, Tiancai Wang, Yuang Zhang, Xiangyu Zhang, Jianbing Shen
for: 本研究旨在提高视频对象 segmentation（RVOS）的精度和效率，通过人工指令来 segment object 在视频中。
methods: 我们提出了一种简单 yet effective 的在线模型，称为 OnlineRefer，使用显式的查询传播来提高当前帧的引用预测精度。我们还扩展了我们的在线模型，使其兼容于视频基础模型。
results: 我们在四个 benchmark 上评估了我们的方法，包括 Refer-Youtube-VOS、Refer-DAVIS17、A2D-Sentences 和 JHMDB-Sentences。与其他Offline方法相比，我们的OnlineRefer WITH Swin-L 背景模型在 Refer-Youtube-VOS 和 Refer-DAVIS17 上达到了 63.5 J&F 和 64.8 J&F 的最高分，即使没有使用任何辅助工具。

Abstract
Referring video object segmentation (RVOS) aims at segmenting an object in a video following human instruction. Current state-of-the-art methods fall into an offline pattern, in which each clip independently interacts with text embedding for cross-modal understanding. They usually present that the offline pattern is necessary for RVOS, yet model limited temporal association within each clip. In this work, we break up the previous offline belief and propose a simple yet effective online model using explicit query propagation, named OnlineRefer. Specifically, our approach leverages target cues that gather semantic information and position prior to improve the accuracy and ease of referring predictions for the current frame. Furthermore, we generalize our online model into a semi-online framework to be compatible with video-based backbones. To show the effectiveness of our method, we evaluate it on four benchmarks, \ie, Refer-Youtube-VOS, Refer-DAVIS17, A2D-Sentences, and JHMDB-Sentences. Without bells and whistles, our OnlineRefer with a Swin-L backbone achieves 63.5 J&F and 64.8 J&F on Refer-Youtube-VOS and Refer-DAVIS17, outperforming all other offline methods.

摘要
“参考视频物体分割（RVOS）目标在实现视频中人工指令下分割物体。现今顶尖方法都是在离线模式下进行，每个片段独立地与文本嵌入进行跨Modal理解。它们通常表明离线模式是RVOS必要的，但是模型内部的时间相互关联受限。在这个工作中，我们打破了先前的离线信念，并提出了一个简单又有效的在线模型，使用Explicit Query Propagation（具体询问传播），名为OnlineRefer。我们的方法利用目标传递获取 semantic 信息和位置偏好，以提高当前帧的引用预测精度。此外，我们将我们的在线模型转换为半在线框架，以适应视频基础模型。为证明我们的方法的有效性，我们在四个 benchmark 上进行评估，包括Refer-Youtube-VOS、Refer-DAVIS17、A2D-Sentences 和 JHMDB-Sentences。 Without 钟表和套路，我们的OnlineRefer 使用 Swin-L 基础模型在 Refer-Youtube-VOS 和 Refer-DAVIS17 上 achieved 63.5 J&F 和 64.8 J&F，比所有其他离线方法高。”

SphereNet: Learning a Noise-Robust and General Descriptor for Point Cloud Registration

paper_url: http://arxiv.org/abs/2307.09351
repo_url: None
paper_authors: Guiyu Zhao, Zhentao Guo, Xin Wang, Hongbin Ma
for: 本研究旨在提出一种robust和可 generalized的点云注册方法，以便在不同视角下收集的点云之间进行准确的注册。
methods: 我们提出了一种基于学习的点云注册方法，称为SphereNet。该方法使用了圆柱体生成器来编码初始特征，然后使用圆形 interpolate 来实现鲁棒性 against 噪声。最后，一种新的圆形卷积神经网络with spherical integrity padding 完成了描述符的提取，这有助于减少丢失的特征并完全捕捉点云的几何特征。
results: 我们在两个indoor和outdoor dataset上进行了广泛的实验，并在高强度噪声下，SphereNet 能够提高特征匹配回快度比超过25个百分点。此外，SphereNet 在3DMatch和3DLoMatchbenchmark上达到了93.5%的注册回快度和75.6%的总回快度，并且在未见数据上有最好的泛化能力。

Abstract
Point cloud registration is to estimate a transformation to align point clouds collected in different perspectives. In learning-based point cloud registration, a robust descriptor is vital for high-accuracy registration. However, most methods are susceptible to noise and have poor generalization ability on unseen datasets. Motivated by this, we introduce SphereNet to learn a noise-robust and unseen-general descriptor for point cloud registration. In our method, first, the spheroid generator builds a geometric domain based on spherical voxelization to encode initial features. Then, the spherical interpolation of the sphere is introduced to realize robustness against noise. Finally, a new spherical convolutional neural network with spherical integrity padding completes the extraction of descriptors, which reduces the loss of features and fully captures the geometric features. To evaluate our methods, a new benchmark 3DMatch-noise with strong noise is introduced. Extensive experiments are carried out on both indoor and outdoor datasets. Under high-intensity noise, SphereNet increases the feature matching recall by more than 25 percentage points on 3DMatch-noise. In addition, it sets a new state-of-the-art performance for the 3DMatch and 3DLoMatch benchmarks with 93.5\% and 75.6\% registration recall and also has the best generalization ability on unseen datasets.

摘要
点云注册是将多个视角中收集的点云进行对齐。在学习基于的点云注册中，一个可靠的描述符是关键，但大多数方法容易受到噪声的影响，对未看到的数据集的普适性很差。为了解决这个问题，我们提出了圆球网（SphereNet），用于学习具有噪声抗性和未看到数据集普适性的点云注册描述符。在我们的方法中，首先，圆球生成器将点云转换为圆球形式，以便在圆球分割中编码初始特征。然后，圆球 interpolating 技术是引入的，以实现对噪声的抗性。最后，一种新的圆球卷积神经网络，具有圆球完整裁剪，完成了特征EXTRACTING操作，从而减少特征丢失和完全捕捉几何特征。为评估我们的方法，我们引入了一个新的标准测试集3DMatch-noise，这个测试集具有强大的噪声。我们在室内和室外 dataset上进行了广泛的实验。在高强度噪声下，SphereNet提高了特征匹配回归率超过25个百分点。此外，它在3DMatch和3DLoMatch测试集上达到了93.5%和75.6%的注册回归率，同时具有最好的普适性。

Visual Validation versus Visual Estimation: A Study on the Average Value in Scatterplots

paper_url: http://arxiv.org/abs/2307.09330
repo_url: None
paper_authors: Daniel Braun, Ashley Suh, Remco Chang, Michael Gleicher, Tatiana von Landesberger
for: 这个论文 investigate了人们可以通过视觉方式验证统计模型是否适合数据。
methods: 研究使用了两个人口（公告和志愿者），参与者需要同时通过视觉估算和视觉验证（接受或拒绝）常见的均值模型。
results: 结果表明参与者们的验证和估算没有偏见，并且自然的批判点（接受或拒绝给定均值）与95%的自信区间边界相当接近，表明视觉感知的自信区间与统计标准相符。

Abstract
We investigate the ability of individuals to visually validate statistical models in terms of their fit to the data. While visual model estimation has been studied extensively, visual model validation remains under-investigated. It is unknown how well people are able to visually validate models, and how their performance compares to visual and computational estimation. As a starting point, we conducted a study across two populations (crowdsourced and volunteers). Participants had to both visually estimate (i.e, draw) and visually validate (i.e., accept or reject) the frequently studied model of averages. Across both populations, the level of accuracy of the models that were considered valid was lower than the accuracy of the estimated models. We find that participants' validation and estimation were unbiased. Moreover, their natural critical point between accepting and rejecting a given mean value is close to the boundary of its 95% confidence interval, indicating that the visually perceived confidence interval corresponds to a common statistical standard. Our work contributes to the understanding of visual model validation and opens new research opportunities.

摘要
我们研究人员是否可以通过视觉方式验证统计模型的合适性。虽然视觉模型估计已经得到了广泛的研究，但视觉模型验证还受到了不足的研究。人们是否能够通过视觉方式验证模型，并且与计算机机器估计相比如何？为了开始，我们在两个人口中进行了研究（公众和志愿者）。参与者需要同时视觉估计（即绘制）以及视觉验证（即接受或拒绝）常见的平均值模型。在两个人口中，被认为是有效的模型的准确率较低于估计模型的准确率。我们发现参与者的验证和估计无偏见。此外，他们的自然批判点（between accepting and rejecting a given mean value）与其95%信息Interval的边缘几乎相同，表明视觉感知的信息Interval与统计标准相吻合。我们的研究对视觉模型验证的理解做出了贡献，开启了新的研究机遇。

Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving

paper_url: http://arxiv.org/abs/2307.09329
repo_url: https://github.com/kaavyarekanar/towards-a-performance-analysis-on-pre-trained-vqa-models-for-autonomous-driving
paper_authors: Kaavya Rekanar, Ciarán Eising, Ganesh Sistu, Martin Hayes
for: 这篇短篇论文主要探讨了三种受欢迎的视觉问答模型（ViLBERT、ViLT、LXMERT）在解决自动驾驶场景中的问题上的表现。
methods: 这篇论文使用了对多模态架构中变换器的使用进行分析，并通过对参考答案与计算机视觉专家提供的答案进行比较来评估这些模型的性能。
results: 结果表明，包含对多模态进行交叉感知和较晚的融合技术的模型在自动驾驶场景中表现出了良好的潜力，可能为自动驾驶领域提供更好的答案。

Abstract
This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.

摘要
这篇短文提供了三种流行的视觉问答（VQA）模型的初步分析，即ViLBERT、ViLT和LXMERT，在回答关于驾驶场景的问题上。这些模型的性能被评估通过与计算机视觉专家提供的参考答案之间的相似性进行比较。模型选择基于多Modal arquitectures中的变换器使用情况的分析。结果表明，包含对Modal attention和迟至合并技术的模型表现出了优秀的潜力，用于生成改进的答案从驾驶角度来看。这个初步分析将为后续 Comparative study involving nine VQA models 作为起点，进一步调查VQA模型在自动驾驶场景中的效果。补充材料可以在https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving 中找到。

Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis

paper_url: http://arxiv.org/abs/2307.09323
repo_url: https://github.com/fictionarry/er-nerf
paper_authors: Jiahe Li, Jiawei Zhang, Xiao Bai, Jun Zhou, Lin Gu
for: 实现高质量、快速渲染和小型模型的对话人物synthesis
methods: 使用Conditional Neural Radiance Fields (NeRF)建立Explicit Region-based NeRF (ER-NeRF)架构，并引入Tri-Plane Hash Representation和Region Attention Module等技术来提高对话人物模型的准确性和效率
results: 对比 précédente方法，ER-NeRF可以实现更高的高质量和audio-lips同步的对话人物视频生成，并且具有更高的效率和更小的模型大小

Abstract
This paper presents ER-NeRF, a novel conditional Neural Radiance Fields (NeRF) based architecture for talking portrait synthesis that can concurrently achieve fast convergence, real-time rendering, and state-of-the-art performance with small model size. Our idea is to explicitly exploit the unequal contribution of spatial regions to guide talking portrait modeling. Specifically, to improve the accuracy of dynamic head reconstruction, a compact and expressive NeRF-based Tri-Plane Hash Representation is introduced by pruning empty spatial regions with three planar hash encoders. For speech audio, we propose a Region Attention Module to generate region-aware condition feature via an attention mechanism. Different from existing methods that utilize an MLP-based encoder to learn the cross-modal relation implicitly, the attention mechanism builds an explicit connection between audio features and spatial regions to capture the priors of local motions. Moreover, a direct and fast Adaptive Pose Encoding is introduced to optimize the head-torso separation problem by mapping the complex transformation of the head pose into spatial coordinates. Extensive experiments demonstrate that our method renders better high-fidelity and audio-lips synchronized talking portrait videos, with realistic details and high efficiency compared to previous methods.

摘要
To achieve this, we propose a compact and expressive NeRF-based Tri-Plane Hash Representation, which prunes empty spatial regions using three planar hash encoders. This allows for more efficient rendering and improved accuracy.For speech audio, we introduce a Region Attention Module to generate region-aware condition features via an attention mechanism. This approach explicitly connects audio features and spatial regions, allowing for more effective capture of local motions.Furthermore, we propose a direct and fast Adaptive Pose Encoding to optimize the head-torso separation problem by mapping the complex transformation of the head pose into spatial coordinates. This approach allows for more efficient and accurate rendering of talking portraits.Extensive experiments demonstrate that our method produces high-fidelity and audio-lips synchronized talking portrait videos with realistic details and high efficiency, outperforming previous methods.

Towards Automated Semantic Segmentation in Mammography Images

paper_url: http://arxiv.org/abs/2307.10296
repo_url: None
paper_authors: Cesar A. Sierra-Franco, Jan Hurtado, Victor de A. Thomaz, Leonardo C. da Cruz, Santiago V. Silva, Alberto B. Raposo
for: 检测非感知乳腺癌，提供诊断和评估图像质量的机会。
methods: 使用深度学习框架自动 segmenting 乳腺、肌肉、肉组织和脂肪组织。
results: 实现了高精度的 segmentation 性能，适用于多样化和复杂的案例，可以应用于临床实践。

Abstract
Mammography images are widely used to detect non-palpable breast lesions or nodules, preventing cancer and providing the opportunity to plan interventions when necessary. The identification of some structures of interest is essential to make a diagnosis and evaluate image adequacy. Thus, computer-aided detection systems can be helpful in assisting medical interpretation by automatically segmenting these landmark structures. In this paper, we propose a deep learning-based framework for the segmentation of the nipple, the pectoral muscle, the fibroglandular tissue, and the fatty tissue on standard-view mammography images. We introduce a large private segmentation dataset and extensive experiments considering different deep-learning model architectures. Our experiments demonstrate accurate segmentation performance on variate and challenging cases, showing that this framework can be integrated into clinical practice.

摘要
乳影像广泛用于探测不可触感乳腺肿块或肿瘤，预防癌症并提供诊断和治疗计划时的机会。正确识别一些关键结构是诊断和评估图像质量的关键。因此，计算机支持的检测系统可以帮助医疗解读，自动将关键结构分割出来。在这篇论文中，我们提出了基于深度学习的框架，用于标准视图乳影像中的胸肌、乳腺组织和脂肪组织的自动分割。我们提供了大量私人分割数据集和详细的实验，考虑了不同的深度学习模型架构。我们的实验表明，这个框架可以在多种和复杂的案例中提供准确的分割表现，并且可以在临床实践中应用。

MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds

paper_url: http://arxiv.org/abs/2307.09316
repo_url: https://github.com/cvmi-lab/mars3d
paper_authors: Jiahui Liu, Chirui Chang, Jianhui Liu, Xiaoyang Wu, Lan Ma, Xiaojuan Qi
for: 这种研究旨在提高多扫描大规模点云semantic segmentation的精度，以便在自动化系统中提高自主驾驶能力。
methods: 该研究提出了一种名为MarS3D的插件式运动相关模块，可以让单扫模型具备多扫观察能力。该模块包括两个关键设计：横幅特征嵌入模块和运动相关特征学习模块。
results: 实验表明，MarS3D可以大幅提高基线模型的性能。代码可以在https://github.com/CVMI-Lab/MarS3D中下载。

Abstract
3D semantic segmentation on multi-scan large-scale point clouds plays an important role in autonomous systems. Unlike the single-scan-based semantic segmentation task, this task requires distinguishing the motion states of points in addition to their semantic categories. However, methods designed for single-scan-based segmentation tasks perform poorly on the multi-scan task due to the lacking of an effective way to integrate temporal information. We propose MarS3D, a plug-and-play motion-aware module for semantic segmentation on multi-scan 3D point clouds. This module can be flexibly combined with single-scan models to allow them to have multi-scan perception abilities. The model encompasses two key designs: the Cross-Frame Feature Embedding module for enriching representation learning and the Motion-Aware Feature Learning module for enhancing motion awareness. Extensive experiments show that MarS3D can improve the performance of the baseline model by a large margin. The code is available at https://github.com/CVMI-Lab/MarS3D.

摘要
三维semantic segmentation在多扫描大规模点云中扮演着重要的角色，这个任务不同于单扫描任务，需要分别处理点cloud中的运动态和semantic category。然而，针对单扫描任务设计的方法在多扫描任务中表现不佳，主要因为缺乏有效的时间信息集成方法。我们提议了MarS3D，一个可插入式运动意识模块，用于semantic segmentation在多扫描3D点云中。这个模块可以与单扫描模型结合，让它们拥有多扫描视觉能力。模型包括两个关键设计：横幅嵌入模块和运动意识学习模块。广泛的实验表明，MarS3D可以大幅提高基eline模型的性能。代码可以在https://github.com/CVMI-Lab/MarS3D中下载。

paper_url: http://arxiv.org/abs/2307.09306
repo_url: https://github.com/inhwanbae/eigentrajectory
paper_authors: Inhwan Bae, Jean Oh, Hae-Gon Jeon
for: 预测人行道径的方法
methods: 使用新的轨迹描述符将行人运动转换为一个紧凑的空间（称为$\mathbb{ET}$空间），并使用低级别approximation减少轨迹描述符的复杂性。
results: 比较 existed trajectory forecasting models 的预测精度和可靠性，并通过anchor-based refinement方法覆盖所有可能的未来。Here’s the full text in Simplified Chinese:
for: 预测人行道径的方法
methods: 使用新的轨迹描述符将行人运动转换为一个紧凑的空间（称为$\mathbb{ET}$空间），并使用低级别approximation减少轨迹描述符的复杂性。
results: 比较 existed trajectory forecasting models 的预测精度和可靠性，并通过anchor-based refinement方法覆盖所有可能的未来。

Abstract
Capturing high-dimensional social interactions and feasible futures is essential for predicting trajectories. To address this complex nature, several attempts have been devoted to reducing the dimensionality of the output variables via parametric curve fitting such as the B\'ezier curve and B-spline function. However, these functions, which originate in computer graphics fields, are not suitable to account for socially acceptable human dynamics. In this paper, we present EigenTrajectory ($\mathbb{ET}$), a trajectory prediction approach that uses a novel trajectory descriptor to form a compact space, known here as $\mathbb{ET}$ space, in place of Euclidean space, for representing pedestrian movements. We first reduce the complexity of the trajectory descriptor via a low-rank approximation. We transform the pedestrians' history paths into our $\mathbb{ET}$ space represented by spatio-temporal principle components, and feed them into off-the-shelf trajectory forecasting models. The inputs and outputs of the models as well as social interactions are all gathered and aggregated in the corresponding $\mathbb{ET}$ space. Lastly, we propose a trajectory anchor-based refinement method to cover all possible futures in the proposed $\mathbb{ET}$ space. Extensive experiments demonstrate that our EigenTrajectory predictor can significantly improve both the prediction accuracy and reliability of existing trajectory forecasting models on public benchmarks, indicating that the proposed descriptor is suited to represent pedestrian behaviors. Code is publicly available at https://github.com/inhwanbae/EigenTrajectory .

摘要
capturing高维社交互动和可能的未来是预测轨迹的关键。为了解决这种复杂的性质，许多尝试都是减少输出变量的维度via参数曲线拟合，如B\'ezier曲线和B-spline函数。然而，这些函数，起源于计算机图形领域，不适合考虑社会接受的人类动态。在这篇论文中，我们提出了EigenTrajectory（$\mathbb{ET}$），一种轨迹预测方法，使用一种新的轨迹描述符来形成一个紧凑的空间，称为$\mathbb{ET}$空间，以代替欧几何空间，来表示步行者的运动。我们首先减少轨迹描述符的复杂性via低级应对。将步行者历史路径转换为我们的$\mathbb{ET}$空间，表示的是时空原则Components，并将其输入到市场上可用的轨迹预测模型中。输入和输出模型以及社交互动都被聚集和聚合在相应的$\mathbb{ET}$空间中。最后，我们提出了一种轨迹锚点基于的修正方法，以覆盖所有可能的未来在提posed $\mathbb{ET}$空间中。广泛的实验表明，我们的EigenTrajectory预测器可以在公共的benchmark上显著提高现有轨迹预测模型的预测精度和可靠性， indicating that the proposed descriptor is suitable to represent pedestrian behaviors。代码可以在https://github.com/inhwanbae/EigenTrajectory上获取。

Conformal prediction under ambiguous ground truth

paper_url: http://arxiv.org/abs/2307.09302
repo_url: None
paper_authors: David Stutz, Abhijit Guha Roy, Tatiana Matejovicova, Patricia Strachan, Ali Taylan Cemgil, Arnaud Doucet
for: 这个研究是为了应对安全承认分类任务中的不确定性量化，提供信心集包括真实类别，并且可以根据使用者指定的概率。
methods: 这个研究使用了对复现预测，即提供信心集，包括真实类别，并且可以根据使用者指定的概率。然而，这个方法通常需要一个专属的验证集，以获得实际的真实类别。
results: 这个研究提出了一个基于对复现 posterior distribution 的类别分类框架，可以在不确定类别设定下进行预测。实验结果显示，这个方法可以在实际应用中提供更好的不确定性量化，并且可以在不同的数据集上进行预测。

Abstract
In safety-critical classification tasks, conformal prediction allows to perform rigorous uncertainty quantification by providing confidence sets including the true class with a user-specified probability. This generally assumes the availability of a held-out calibration set with access to ground truth labels. Unfortunately, in many domains, such labels are difficult to obtain and usually approximated by aggregating expert opinions. In fact, this holds true for almost all datasets, including well-known ones such as CIFAR and ImageNet. Applying conformal prediction using such labels underestimates uncertainty. Indeed, when expert opinions are not resolvable, there is inherent ambiguity present in the labels. That is, we do not have ``crisp'', definitive ground truth labels and this uncertainty should be taken into account during calibration. In this paper, we develop a conformal prediction framework for such ambiguous ground truth settings which relies on an approximation of the underlying posterior distribution of labels given inputs. We demonstrate our methodology on synthetic and real datasets, including a case study of skin condition classification in dermatology.

摘要
在安全关键分类任务中，协Forms prediction可以进行严格的uncertainty量化，提供包含真实类型的信任集，其中用户可以指定概率。通常，这假设有一个保留的Calibration集可以获得真实标签。然而，在许多领域，这些标签很难获得，通常通过专家意见的汇总来估算。事实上，这是大多数数据集的情况，包括CIFAR和ImageNet。在使用这些标签进行协Forms prediction时，会低估uncertainty。实际上，当专家意见不可分解时，存在labels中的不确定性。即，我们没有“精炼”的、明确的真实标签，这种uncertainty应该在calibration中被考虑。在这篇论文中，我们开发了一种协Forms prediction框架，用于这种不确定真实标签的设置，该框架基于输入的标签 posterior distribution的approximation。我们在 sintetic和实际数据集上验证了我们的方法，包括一个皮肤状况分类的case study。

RepViT: Revisiting Mobile CNN From ViT Perspective

paper_url: http://arxiv.org/abs/2307.09283
repo_url: https://github.com/jameslahm/RepViT
paper_authors: Ao Wang, Hui Chen, Zijia Lin, Hengjun Pu, Guiguang Ding
for: 该研究旨在探讨轻量级视Transformers（ViTs）和轻量级卷积神经网络（CNNs）在移动设备上的性能和延迟性能。
methods: 该研究使用了一种名为RepViT的新家族的纯轻量级CNN，通过结合轻量级ViTs的有效的建筑设计，实现了更高的性能和更低的延迟。
results: 实验结果表明，RepViT在多种视觉任务中表现出色，并且在iPhone 12上实现了80.4%的顶部1 accuracy，延迟只有1.3毫秒，是轻量级模型中首次达到这个水平。

Abstract
Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency compared with lightweight Convolutional Neural Networks (CNNs) on resource-constrained mobile devices. This improvement is usually attributed to the multi-head self-attention module, which enables the model to learn global representations. However, the architectural disparities between lightweight ViTs and lightweight CNNs have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs and emphasize their potential for mobile devices. We incrementally enhance the mobile-friendliness of a standard lightweight CNN, specifically MobileNetV3, by integrating the efficient architectural choices of lightweight ViTs. This ends up with a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. On ImageNet, RepViT achieves over 80\% top-1 accuracy with nearly 1ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Our largest model, RepViT-M3, obtains 81.4\% accuracy with only 1.3ms latency. The code and trained models are available at \url{https://github.com/jameslahm/RepViT}.

摘要
最近，轻量级视Transformers（ViTs）在有限的移动设备上表现出色，比较于轻量级卷积神经网络（CNNs）更具有优势，主要归功于自注意机制，允许模型学习全局表示。然而，这两种模型的建构差异尚未得到充分探讨。在本研究中，我们重新审视了轻量级CNNs的有效设计，强调其在移动设备上的潜在优势。我们逐步提高了标准轻量级CNNs的MobileNetV3的移动友好性，通过 интеграating了轻量级ViTs的有效建构选择。这使得我们获得了一个新的家族的纯轻量级CNNs，称之为RepViT。我们进行了广泛的实验，发现RepViT可以在各种视觉任务中超越现有的状态域轻量级ViTs，并且在iPhone 12上实现了80.4%的top-1准确率，延迟只有约1ms。我们最大的模型RepViT-M3可以达到81.4%的准确率，延迟只有1.3ms。代码和训练模型可以在上获取。

paper_url: http://arxiv.org/abs/2307.09279
repo_url: https://github.com/XiaoqiWang/regression-free-iqa
paper_authors: Xiaoqi Wang, Jian Xiong, Hao Gao, Weisi Lin
for: 提高图像质量评估模型的准确性，降低由偏袋训练样本引起的偏袋问题。
methods: 基于Retrieving Similar Instances的框架，结合semantic和distortion特征来评估图像质量。
results: 与state-of-the-art regression-based模型相比，提出的模型可以显著提高图像质量评估的准确性。

Abstract
Regression-based blind image quality assessment (IQA) models are susceptible to biased training samples, leading to a biased estimation of model parameters. To mitigate this issue, we propose a regression-free framework for image quality evaluation, which is founded upon retrieving similar instances by incorporating semantic and distortion features. The motivation behind this approach is rooted in the observation that the human visual system (HVS) has analogous visual responses to semantically similar image contents degraded by the same distortion. The proposed framework comprises two classification-based modules: semantic-based classification (SC) module and distortion-based classification (DC) module. Given a test image and an IQA database, the SC module retrieves multiple pristine images based on semantic similarity. The DC module then retrieves instances based on distortion similarity from the distorted images that correspond to each retrieved pristine image. Finally, the predicted quality score is derived by aggregating the subjective quality scores of multiple retrieved instances. Experimental results on four benchmark databases validate that the proposed model can remarkably outperform the state-of-the-art regression-based models.

摘要
征inct-based盲目图像质量评估（IQA）模型容易受到偏向训练样本的影响，导致模型参数的偏向估计。为解决这个问题，我们提议一种无回归的框架 для图像质量评估，基于检索相似的实例。这种方法的动机在于人类视觉系统（HVS）在semantically相似的图像内容受到同样的损害后，有相似的视觉响应。提议的框架包括两个分类基 module：semantic-based分类（SC）模块和损害基分类（DC）模块。给定一个测试图像和IQA数据库，SC模块将多个无损图像根据semantic similarity retrieved。然后，DC模块将从对应每个损害图像中 Retrieves instances based on distortion similarity。最后，预测的质量分数由多个检索到的实例的主观质量分数的汇总得到。实验结果表明，提议的模型可以remarkably exceed state-of-the-art regression-based模型的性能。

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding

paper_url: http://arxiv.org/abs/2307.09267
repo_url: None
paper_authors: Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen Zhu, Aoxiong Yin, Zhou Zhao
for: 本研究的目的是提出一种基于弱监督的3D视觉固定方法，以便使用简单的文本描述来找到3D场景中的目标对象。
methods: 我们提出了一种使用弱监督的Semantic Matching模型来学习3D视觉固定模型，其中使用粗细的场景文本对应关系来链接对象和文本。我们还提出了一种将弱监督的semantic Matching知识融入到传统的两阶段3D视觉固定模型中，以提高性能和降低推理成本。
results: 我们在ScanRefer、Nr3D和Sr3D等 datasets上进行了广泛的实验，并证明了我们的提出方法的效果。

Abstract
3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query. Although many approaches have been proposed and achieved impressive performance, they all require dense object-sentence pair annotations in 3D point clouds, which are both time-consuming and expensive. To address the problem that fine-grained annotated data is difficult to obtain, we propose to leverage weakly supervised annotations to learn the 3D visual grounding model, i.e., only coarse scene-sentence correspondences are used to learn object-sentence links. To accomplish this, we design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner. Specifically, we first extract object proposals and coarsely select the top-K candidates based on feature and class similarity matrices. Next, we reconstruct the masked keywords of the sentence using each candidate one by one, and the reconstructed accuracy finely reflects the semantic similarity of each candidate to the query. Additionally, we distill the coarse-to-fine semantic matching knowledge into a typical two-stage 3D visual grounding model, which reduces inference costs and improves performance by taking full advantage of the well-studied structure of the existing architectures. We conduct extensive experiments on ScanRefer, Nr3D, and Sr3D, which demonstrate the effectiveness of our proposed method.

摘要
三维视觉根据 involves 找到一个给定句子查询对应的目标对象在三维场景中。虽然许多方法已经提出并实现了出色的性能，但它们都需要密集的对象句子对 annotation 在三维点云中，这些数据均是时间consuming 和昂贵的。为了解决细化的 annotated data 困难以获得，我们提议使用弱supervised annotation 学习三维视觉根据模型，即只使用句子-场景对应的粗细 correlations 来学习对象-句子链接。为实现这一目标，我们设计了一种新的semantic matching模型，该模型在粗细-精度排序的方式下分析对象提案和句子之间的semantic similarity。具体来说，我们首先提取对象提案，然后使用特征和类型相似度矩阵来粗细选择前top-K candidates。接着，我们一个一个地使用每个候选对象来重建句子中的掩码关键词，重建准确率就是每个候选对象与查询句子之间的semantic similarity的表现。此外，我们将粗细-精度匹配知识蒸馏到一个常见的两阶段三维视觉根据模型中，从而降低推理成本和提高性能，同时利用现有的建筑结构。我们在ScanRefer、Nr3D和Sr3D上进行了广泛的实验，结果表明我们的提议方法的效果。

Knowledge Distillation for Object Detection: from generic to remote sensing datasets

paper_url: http://arxiv.org/abs/2307.09264
repo_url: None
paper_authors: Hoàng-Ân Lê, Minh-Tan Pham
for: 评估多种常用计算机视觉数据集上开发的物体检测知识填充方法在远程感知中的表现。
methods: 包括逻辑模仿和特征模仿方法，用于车辆检测，并在xView和VEDAI数据集上进行了广泛的实验。
results: 实验结果表明，这些方法在远程感知 datasets 上表现出了高度的变化，并且confirm了结果聚合和cross validation的重要性。

Abstract
Knowledge distillation, a well-known model compression technique, is an active research area in both computer vision and remote sensing communities. In this paper, we evaluate in a remote sensing context various off-the-shelf object detection knowledge distillation methods which have been originally developed on generic computer vision datasets such as Pascal VOC. In particular, methods covering both logit mimicking and feature imitation approaches are applied for vehicle detection using the well-known benchmarks such as xView and VEDAI datasets. Extensive experiments are performed to compare the relative performance and interrelationships of the methods. Experimental results show high variations and confirm the importance of result aggregation and cross validation on remote sensing datasets.

摘要
知识填充（knowledge distillation）是计算机视觉和远程感知领域中广泛研究的一种模型压缩技术。在这篇论文中，我们在远程感知上评估了一些 generic 计算机视觉数据集上开发的物体检测知识填充方法。特别是，我们应用了两种方法：ilogit 模仿和特征模仿。我们使用了 xView 和 VEDAI 数据集进行车辆检测。我们进行了广泛的实验比较这些方法的相对性能和相互关系。实验结果表明，存在很大的变化，并证明了在远程感知数据集上的结果聚合和十分 validate 的重要性。

Neuromorphic spintronics simulated using an unconventional data-driven Thiele equation approach

paper_url: http://arxiv.org/abs/2307.09262
repo_url: None
paper_authors: Anatole Moureaux, Simon de Wergifosse, Chloé Chopin, Flavio Abreu Araujo
for: 这个研究旨在开发一种量化描述磁矩扭矩气体oscillators（STVOs）动态的数学模型，以加速STVO-based neuromorphic computing设备的设计和计算成本减少。
methods: 这个研究使用了一种不寻常的模型，即TEA和MMS数据的组合，以解决STVO动态学问题。这种模型可以将计算时间减少9个数量级，同时保持同等精度。
results: 研究人员通过模拟STVO-based neural network来解决一个分类任务，并评估其性能对输入信号电流强度和噪声的影响。结果显示，这种方法可以加速STVO-based neuromorphic computing设备的设计和计算成本减少，同时维持同等精度。

Abstract
In this study, we developed a quantitative description of the dynamics of spin-torque vortex nano-oscillators (STVOs) through an unconventional model based on the combination of the Thiele equation approach (TEA) and data from micromagnetic simulations (MMS). Solving the STVO dynamics with our analytical model allows to accelerate the simulations by 9 orders of magnitude compared to MMS while reaching the same level of accuracy. Here, we showcase our model by simulating a STVO-based neural network for solving a classification task. We assess its performance with respect to the input signal current intensity and the level of noise that might affect such a system. Our approach is promising for accelerating the design of STVO-based neuromorphic computing devices while decreasing drastically its computational cost.

摘要
在这项研究中，我们开发了一种量化描述STVOs的动态模型，通过TEA和MMS数据的组合来实现。我们的分析模型可以将STVO动态 simulations加速9个数量级，与MMS达到同等精度。在这里，我们通过模拟一个基于STVO的神经网络来解决一个分类任务。我们评估了这个系统对输入信号强度和噪声的影响。我们的方法可以加速STVO基于神经网络设备的设计，同时减少计算成本。

Adaptive Topological Feature via Persistent Homology: Filtration Learning for Point Clouds

paper_url: http://arxiv.org/abs/2307.09259
repo_url: None
paper_authors: Naoki Nishikawa, Yuichi Ike, Kenji Yamanishi
for: 提高机器学习方法对点云的精度，通过 incorporating 全球拓扑特征。
methods: 提出了一种基于神经网络的自适应筛选方法，以实现 persistent homology 的准确计算。
results: 在多个分类任务中，实验结果表明了我们的框架的效果。

Abstract
Machine learning for point clouds has been attracting much attention, with many applications in various fields, such as shape recognition and material science. To enhance the accuracy of such machine learning methods, it is known to be effective to incorporate global topological features, which are typically extracted by persistent homology. In the calculation of persistent homology for a point cloud, we need to choose a filtration for the point clouds, an increasing sequence of spaces. Because the performance of machine learning methods combined with persistent homology is highly affected by the choice of a filtration, we need to tune it depending on data and tasks. In this paper, we propose a framework that learns a filtration adaptively with the use of neural networks. In order to make the resulting persistent homology isometry-invariant, we develop a neural network architecture with such invariance. Additionally, we theoretically show a finite-dimensional approximation result that justifies our architecture. Experimental results demonstrated the efficacy of our framework in several classification tasks.

摘要
Translation in Simplified Chinese:机器学习点云拥有很多应用，包括形状识别和材料科学等领域。将全球数学特征（Persistent homology）与机器学习方法结合，可以提高精度。在点云中计算 persistent homology 时，需要选择一个滤子，这是一个增加的序列空间。由于选择滤子会对机器学习方法的性能产生很大的影响，因此需要根据数据和任务进行调整。在这篇论文中，我们提出了一个自适应的滤子学习框架，使用神经网络来实现。为了使得结果的 persistent homology 是尺度不变的，我们开发了一个具有尺度不变的神经网络架构。此外，我们也理论上显示了一个有限维度近似结果，证明了我们的架构。实验结果显示了我们的框架在多个分类任务中的效果。

Generation of High Spatial Resolution Terrestrial Surface from Low Spatial Resolution Elevation Contour Maps via Hierarchical Computation of Median Elevation Regions

paper_url: http://arxiv.org/abs/2307.09239
repo_url: None
paper_authors: Geetika Barman, B. S. Daya Sagar
for: 将稀疏的数字地形图（DEM）转换为密集的数字地形图。
methods: 使用 median contours 进行转换，包括 DEM 的含义情况下的分解、非负无权重的中值高程区域（MER）的计算和高程面的推算。
results: 可以在高分辨率上生成高精度的地形图。该方法考虑了现有的高程面信息，可以在新的地形表面上 interpolate 高程面，直到无需生成高程面为止。这种新的方法是low-cost和可靠的，并且使用高程面来进行推算。

Abstract
We proposed a simple yet effective morphological approach to convert a sparse Digital Elevation Model (DEM) to a dense Digital Elevation Model. The conversion is similar to that of the generation of high-resolution DEM from its low-resolution DEM. The approach involves the generation of median contours to achieve the purpose. It is a sequential step of the I) decomposition of the existing sparse Contour map into the maximum possible Threshold Elevation Region (TERs). II) Computing all possible non-negative and non-weighted Median Elevation Region (MER) hierarchically between the successive TER decomposed from a sparse contour map. III) Computing the gradient of all TER, and MER computed from previous steps would yield the predicted intermediate elevation contour at a higher spatial resolution. We presented this approach initially with some self-made synthetic data to show how the contour prediction works and then experimented with the available contour map of Washington, NH to justify its usefulness. This approach considers the geometric information of existing contours and interpolates the elevation contour at a new spatial region of a topographic surface until no elevation contours are necessary to generate. This novel approach is also very low-cost and robust as it uses elevation contours.

摘要
我们提出了一种简单 yet有效的 morphological 方法，将稀疏的数字高程模型（DEM）转换成稠密的数字高程模型。这种转换类似于生成高分辨率 DEM 从其低分辨率 DEM 中。方法包括以下步骤：1. 将现有的稀疏 Contour 地图 decomposed 成最大可能的 Threshold Elevation Region (TER)。2. 计算所有可能的非负和无权重的 Median Elevation Region (MER) 在缺省 TER 中进行层次搜索。3. 计算所有 TER 和 MER 的梯度，并将之前步骤计算的 MER 作为输入，可以预测高程Contour 的预测值。我们首先使用自己制作的一些自然数据来示例ify 如何实现 contour 预测，然后对有效的 Contour 地图 of Washington, NH 进行了实验，以证明该方法的有用性。该方法考虑了现有 Contour 的 геометрической信息，并在 topographic 表面上 interpolate 高程Contour ntil no 高程Contour 是必须生成的。这种新的方法是非常低成本和可靠，因为它使用高程Contour。

Fusing Hand and Body Skeletons for Human Action Recognition in Assembly

paper_url: http://arxiv.org/abs/2307.09238
repo_url: None
paper_authors: Dustin Aganian, Mona Köhler, Benedict Stephan, Markus Eisenbach, Horst-Michael Gross
for: 这个论文是为了提高人机合作的效果而写的。
methods: 这个论文使用了肢体骨架和手套骨架两种方法，并使用了通信网络和变换器来提高人机合作的效果。
results: 这个论文的实验结果表明，将肢体骨架和手套骨架结合使用可以提高人机合作的效果，并且可以更好地识别人员的动作。

Abstract
As collaborative robots (cobots) continue to gain popularity in industrial manufacturing, effective human-robot collaboration becomes crucial. Cobots should be able to recognize human actions to assist with assembly tasks and act autonomously. To achieve this, skeleton-based approaches are often used due to their ability to generalize across various people and environments. Although body skeleton approaches are widely used for action recognition, they may not be accurate enough for assembly actions where the worker's fingers and hands play a significant role. To address this limitation, we propose a method in which less detailed body skeletons are combined with highly detailed hand skeletons. We investigate CNNs and transformers, the latter of which are particularly adept at extracting and combining important information from both skeleton types using attention. This paper demonstrates the effectiveness of our proposed approach in enhancing action recognition in assembly scenarios.

摘要
随着协作机器人（cobot）在工业生产中的普及，有效的人机合作变得非常重要。cobot应该能够识别人类行为，以帮助完成组装任务，并且能够自主行动。为了实现这一点，skeleton-based方法经常使用，因为它们能够通过不同的人和环境进行泛化。虽然身体骨架方法广泛用于动作识别，但它们可能无法准确地识别组装动作，因为工人的手和手指在这些动作中扮演着重要的角色。为了解决这一限制，我们提议一种方法，即将较为简单的身体骨架与高级细节的手骨架结合。我们研究了CNN和转换器，其中转换器特别适合将两种骨架中的重要信息拼接起来，使用吸引注意力。本文证明了我们提议的方法可以在组装场景中提高动作识别的效果。

Augmenting CLIP with Improved Visio-Linguistic Reasoning

paper_url: http://arxiv.org/abs/2307.09233
repo_url: None
paper_authors: Samyadeep Basu, Maziar Sanjabi, Daniela Massiceti, Shell Xu Hu, Soheil Feizi
for: 提高CLIP模型的compositional visio-linguistic理解能力
methods: 使用可微的图像参数化细化CLIP模型，通过带有扩散对象的梯度整理法，从大型文本到图像生成模型中抽取可以进行visio-linguistic理解任务的能力
results: 在Winoground和ARO datasets上，OUR方法可以提高CLIP模型的绝对visio-linguistic性能，在Winoground上提高了7%，在ARO上提高了3%，同时，OUR方法也微幅提高了CLIP模型的零shot性能在多个下游任务上。

Abstract
Image-text contrastive models such as CLIP are useful for a variety of downstream applications including zero-shot classification, image-text retrieval and transfer learning. However, these contrastively trained vision-language models often fail on compositional visio-linguistic tasks such as Winoground with performance equivalent to random chance. In our paper, we address this issue and propose a sample-efficient light-weight method called SDS-CLIP to improve the compositional visio-linguistic reasoning capabilities of CLIP. The core idea of our method is to use differentiable image parameterizations to fine-tune CLIP with a distillation objective from large text-to-image generative models such as Stable-Diffusion which are relatively good at visio-linguistic reasoning tasks. On the challenging Winoground compositional reasoning benchmark, our method improves the absolute visio-linguistic performance of different CLIP models by up to 7%, while on the ARO dataset, our method improves the visio-linguistic performance by upto 3%. As a byproduct of inducing visio-linguistic reasoning into CLIP, we also find that the zero-shot performance improves marginally on a variety of downstream datasets. Our method reinforces that carefully designed distillation objectives from generative models can be leveraged to extend existing contrastive image-text models with improved visio-linguistic reasoning capabilities.

摘要
“图文共同模型如CLIP可以用于多种下游应用，包括零 shot 分类、图文搜寻和传播学习。然而，这些对照训练的视力语言模型经常在成本复杂的图文共同任务中表现不佳，例如 Winoground，其表现相当于随机几率。在我们的论文中，我们解决这个问题，并提出一个轻量级、可靠的方法called SDS-CLIP，以提高 CLIP 的图文共同推理能力。我们的方法的核心思想是使用可微的图像参数来精致地调整 CLIP，使其能够从大型文本至图生成模型中获得更好的推理能力。在 Winoground 的挑战性图文共同任务中，我们的方法可以提高不同 CLIP 模型的绝对图文推理性能 by up to 7%，而在 ARO dataset 中，我们的方法可以提高图文推理性能 by up to 3%。另外，我们发现，将 visio-linguistic 推理引入 CLIP 中，可以微小地提高零 shot 下的表现。我们的方法证明了，通过将生成模型的数据分配给对照训练的视力语言模型，可以延展现有的对照训练模型，以提高图文共同推理能力。”

A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

paper_url: http://arxiv.org/abs/2307.09220
repo_url: None
paper_authors: Chaoyang Zhu, Long Chen
for: 这篇论文主要是为了提供一个对开放词汇检测（OVD）和开放词汇分割（OVS）的全面回顾和评论。
methods: 这篇论文使用了一种基于任务和方法的分类法来总结各种方法，包括视觉语义空间映射、新视觉特征合成、区域意识训练、pseudo标签、知识储存承诺和传输学习等。
results: 这篇论文对各种任务进行了全面的讨论，包括对象检测、semantic/instance/panoptic segmentation、3D场景和视频理解等，并对每个任务进行了细节的描述、主要挑战、发展路径、优点和缺点的评论。

Abstract
As the most fundamental tasks of computer vision, object detection and segmentation have made tremendous progress in the deep learning era. Due to the expensive manual labeling, the annotated categories in existing datasets are often small-scale and pre-defined, i.e., state-of-the-art detectors and segmentors fail to generalize beyond the closed-vocabulary. To resolve this limitation, the last few years have witnessed increasing attention toward Open-Vocabulary Detection (OVD) and Segmentation (OVS). In this survey, we provide a comprehensive review on the past and recent development of OVD and OVS. To this end, we develop a taxonomy according to the type of task and methodology. We find that the permission and usage of weak supervision signals can well discriminate different methodologies, including: visual-semantic space mapping, novel visual feature synthesis, region-aware training, pseudo-labeling, knowledge distillation-based, and transfer learning-based. The proposed taxonomy is universal across different tasks, covering object detection, semantic/instance/panoptic segmentation, 3D scene and video understanding. In each category, its main principles, key challenges, development routes, strengths, and weaknesses are thoroughly discussed. In addition, we benchmark each task along with the vital components of each method. Finally, several promising directions are provided to stimulate future research.

摘要
Computer vision 的基本任务中，对象检测和分割在深度学习时代取得了巨大进步。然而，由于手动标注成本高昂，现有数据集中的标注类型通常是小规模的预定的，即现状顶尖的检测器和分割器无法泛化。为解决这种限制，过去几年内，开放词汇检测（OVD）和分割（OVS）受到了逐渐增长的关注。在这种调查中，我们提供了对过去和最近发展的OVD和OVS的全面评论。为此，我们开发了一个分类方法，根据任务类型和方法学习。我们发现，使用弱级指导信号的许可和使用可以很好地区分不同的方法，包括：视觉Semantic空间映射、新视觉特征合成、区域意识训练、pseudo-labeling、知识泛化学习和传输学习。提出的分类方法 universal across different tasks，涵盖对象检测、semantic/instance/panoptic分割、3D场景和视频理解。在每个类型中，我们详细讨论了主要原则、关键挑战、发展途径、优点和缺点。此外，我们对每个任务进行了评估，并与每种方法的重要组成部分进行了比较。最后，我们提供了一些有前途的方向，以促进未来的研究。

You’ve Got Two Teachers: Co-evolutionary Image and Report Distillation for Semi-supervised Anatomical Abnormality Detection in Chest X-ray

paper_url: http://arxiv.org/abs/2307.09184
repo_url: None
paper_authors: Jinghan Sun, Dong Wei, Zhe Xu, Donghuan Lu, Hong Liu, Liansheng Wang, Yefeng Zheng
for:* 这个论文旨在帮助抑制验图像检测和诊断呼吸系统和心血管疾病的 radiologic 发现。methods:* 这个方法使用了 semi-supervised 图像检测和文本分类方法，通过对报告和图像进行交互式融合，提高检测精度。results:* 实验结果表明，这个方法在 MIMIC-CXR 数据集上表现出色，较前者weakly和semi-supervised方法有更高的性能。

Abstract
Chest X-ray (CXR) anatomical abnormality detection aims at localizing and characterising cardiopulmonary radiological findings in the radiographs, which can expedite clinical workflow and reduce observational oversights. Most existing methods attempted this task in either fully supervised settings which demanded costly mass per-abnormality annotations, or weakly supervised settings which still lagged badly behind fully supervised methods in performance. In this work, we propose a co-evolutionary image and report distillation (CEIRD) framework, which approaches semi-supervised abnormality detection in CXR by grounding the visual detection results with text-classified abnormalities from paired radiology reports, and vice versa. Concretely, based on the classical teacher-student pseudo label distillation (TSD) paradigm, we additionally introduce an auxiliary report classification model, whose prediction is used for report-guided pseudo detection label refinement (RPDLR) in the primary vision detection task. Inversely, we also use the prediction of the vision detection model for abnormality-guided pseudo classification label refinement (APCLR) in the auxiliary report classification task, and propose a co-evolution strategy where the vision and report models mutually promote each other with RPDLR and APCLR performed alternatively. To this end, we effectively incorporate the weak supervision by reports into the semi-supervised TSD pipeline. Besides the cross-modal pseudo label refinement, we further propose an intra-image-modal self-adaptive non-maximum suppression, where the pseudo detection labels generated by the teacher vision model are dynamically rectified by high-confidence predictions by the student. Experimental results on the public MIMIC-CXR benchmark demonstrate CEIRD's superior performance to several up-to-date weakly and semi-supervised methods.

摘要
胸部X射影（CXR）解剖异常检测目的在于寻找和描述胸部X射影中的呼吸和心脏 radiological 发现，以减少观察偏见和提高诊断工作流程。大多数现有方法都对这个任务进行了完全监督学习（fully supervised learning），需要高成本的大量偏例资料（per-abnormality annotations），或弱监督学习（weakly supervised learning），其性能仍然落后于完全监督学习方法。在这个工作中，我们提出了一个共演化图像和报告蒸发（CEIRD）框架，它通过将视觉检测结果与文本标示的异常发现（report-guided pseudo detection label refinement，RPDLR）和视觉检测模型的预测（abnormality-guided pseudo classification label refinement，APCLR）进行互动，以实现半监督学习的异常检测。此外，我们还提出了一个共演化策略，让视觉和报告模型互相推广 each other，通过对应的RPDLR和APCLR进行交替进行。这样，我们实现了对报告的弱监督学习，并将其 integrate 到 semi-supervised TSD 管道中。除了跨模式 pseudo label refinement 外，我们还提出了一个内部图像模式自适应非最大 suppress，其中 pseudo detection labels 生成的教师视觉模型的预测被动地修正了由学生高信度预测的高信度预测。实验结果显示，CEIRD 在公共 MIMIC-CXR benchmark 上表现出色，较上一代弱监督和半监督方法更高。

Pixel-wise Graph Attention Networks for Person Re-identification

paper_url: http://arxiv.org/abs/2307.09183
repo_url: https://github.com/wenyu1009/pganet
paper_authors: Wenyu Zhang, Qing Ding, Jian Hu, Yi Ma, Mingzhe Lu
for: 本研究用于探讨图像特征提取中使用图 convolutional networks (GCN) 和图注意力网络 (GAT)，以提高图像识别性能。
methods: 本研究提出了一种新的图生成算法，可以将图像转换成图形，并使用GAT更新节点特征。这两个步骤组成了一个图像精度注意模块 (PGA)，可以与传统的图像特征提取方法结合使用。基于PGA和ResNet，提出了一种新的像素精度graph注意网络 (PGANet)，用于人识别任务。
results: 对于Market1501、DukeMTMC-reID和Occluded-DukeMTMC等 datasets，PGANet可以达到最佳性能，比前一个state-of-the-art的0.8%、1.1%和11%。

Abstract
Graph convolutional networks (GCN) is widely used to handle irregular data since it updates node features by using the structure information of graph. With the help of iterated GCN, high-order information can be obtained to further enhance the representation of nodes. However, how to apply GCN to structured data (such as pictures) has not been deeply studied. In this paper, we explore the application of graph attention networks (GAT) in image feature extraction. First of all, we propose a novel graph generation algorithm to convert images into graphs through matrix transformation. It is one magnitude faster than the algorithm based on K Nearest Neighbors (KNN). Then, GAT is used on the generated graph to update the node features. Thus, a more robust representation is obtained. These two steps are combined into a module called pixel-wise graph attention module (PGA). Since the graph obtained by our graph generation algorithm can still be transformed into a picture after processing, PGA can be well combined with CNN. Based on these two modules, we consulted the ResNet and design a pixel-wise graph attention network (PGANet). The PGANet is applied to the task of person re-identification in the datasets Market1501, DukeMTMC-reID and Occluded-DukeMTMC (outperforms state-of-the-art by 0.8\%, 1.1\% and 11\% respectively, in mAP scores). Experiment results show that it achieves the state-of-the-art performance. \href{https://github.com/wenyu1009/PGANet}{The code is available here}.

摘要
Graph convolutional networks (GCN) 广泛应用于不规则数据处理，因为它利用图structure信息来更新节点特征。然而，如何将 GCN 应用于结构数据（如图片）尚未得到深入研究。在这篇论文中，我们探索使用图注意网络（GAT）在图像特征提取中的应用。首先，我们提出了一种新的图生成算法，通过矩阵变换将图像转换为图。这个算法比基于 K Nearest Neighbors（KNN）的算法一 magnitude faster。然后，GAT 在生成的图上更新节点特征。因此，我们获得了一个更加稳定的表示。这两个步骤组合在一起，我们称之为像素级别图注意模块（PGA）。由于生成的图仍可以转换为图像 после处理，PGA 可以与 CNN 集成。基于这两个模块，我们咨询了 ResNet，并设计了像素级别图注意网络（PGANet）。PGANet 在 Market1501、DukeMTMC-reID 和 Occluded-DukeMTMC 数据集上进行人重复识别任务，与状态艺术的比例分别高于 0.8\%、1.1\% 和 11\%。实验结果表明，它达到了状态艺术性能。链接：

Jean-Luc Picard at Touché 2023: Comparing Image Generation, Stance Detection and Feature Matching for Image Retrieval for Arguments

paper_url: http://arxiv.org/abs/2307.09172
repo_url: None
paper_authors: Max Moebius, Maximilian Enderling, Sarah T. Bachinger
for: 参加共同任务”图像检索论据”，我们使用了不同的图像检索管道，包括图像生成、立场检测、预选和特征匹配。我们提交了四个不同的运行，并与基准值进行比较。我们的管道表现和基准值类似。
methods: 我们使用了不同的图像检索管道，包括图像生成、立场检测、预选和特征匹配。
results: 我们的管道表现和基准值类似。

Abstract
Participating in the shared task "Image Retrieval for arguments", we used different pipelines for image retrieval containing Image Generation, Stance Detection, Preselection and Feature Matching. We submitted four different runs with different pipeline layout and compare them to given baseline. Our pipelines perform similarly to the baseline.

摘要
参加了“图像检索 для论点”共同任务，我们使用了不同的图像检索管道，包括图像生成、立场检测、预选和特征匹配。我们提交了四个不同的运行，并与给定的基线进行比较。我们的管道表现和基线相似。Note that the word "baseline" in the original text was translated as "基线" in Simplified Chinese, which is a common way to refer to a reference point or a standard for comparison.

ECSIC: Epipolar Cross Attention for Stereo Image Compression

paper_url: http://arxiv.org/abs/2307.10284
repo_url: None
paper_authors: Matthias Wödlinger, Jan Kotera, Manuel Keglevic, Jan Xu, Robert Sablatnig
for: 这 paper 的目的是提出一种新的学习方法 для双眼图像压缩。
methods: 该方法使用一种新的 crossed attention（SCA）模块和两个双眼上下文模块来同时压缩左右图像。SCA模块在对应的投影线上进行权重 Restricted cross-attention 处理，并在平行进行处理。双眼上下文模块使用第一个图像作为上下文来提高第二个编码图像的Entropy 估计。
results: ECSIC 在 Cityscapes 和 InStereo2k 两个流行的双眼图像数据集上达到了当今最佳性能，而且具有快速编码和解码功能，因此在实时应用中非常实用。

Abstract
In this paper, we present ECSIC, a novel learned method for stereo image compression. Our proposed method compresses the left and right images in a joint manner by exploiting the mutual information between the images of the stereo image pair using a novel stereo cross attention (SCA) module and two stereo context modules. The SCA module performs cross-attention restricted to the corresponding epipolar lines of the two images and processes them in parallel. The stereo context modules improve the entropy estimation of the second encoded image by using the first image as a context. We conduct an extensive ablation study demonstrating the effectiveness of the proposed modules and a comprehensive quantitative and qualitative comparison with existing methods. ECSIC achieves state-of-the-art performance among stereo image compression models on the two popular stereo image datasets Cityscapes and InStereo2k while allowing for fast encoding and decoding, making it highly practical for real-time applications.

摘要
在这篇论文中，我们提出了一种新的学习方法ECSIC，用于压缩立体图像。我们的提议方法同时压缩左右图像，通过利用立体图像对的补做和两个立体上下文模块来提高压缩率。SCA模块在对应的epipolar线上进行交叉注意力限制，并在平行进行处理。立体上下文模块使得第二个编码图像的Entropy估计得到改善，通过使用第一个图像作为上下文。我们进行了广泛的减少研究，证明提议的模块的有效性，并对现有方法进行了全面的量化和质量性比较。ECSIC在Cityscapes和InStereo2k两个流行的立体图像数据集上实现了最新的性能，同时允许快速编码和解码，因此在实时应用中非常实用。

Towards Trustworthy Dataset Distillation

paper_url: http://arxiv.org/abs/2307.09165
repo_url: None
paper_authors: Shijie Ma, Fei Zhu, Zhen Cheng, Xu-Yao Zhang
for:Trustworthy Dataset Distillation (TrustDD) aims to reduce training costs and enhance model trustworthiness in real-world applications by distilling both in-distribution (InD) samples and outliers.methods:The proposed method uses dataset distillation (DD) to reduce the training dataset to a tiny synthetic dataset, and simultaneously considers in-distribution (InD) classification and out-of-distribution (OOD) detection. To generate pseudo-outliers, InD samples are corrupted to introduce Pseudo-Outlier Exposure (POE).results:Comprehensive experiments on various settings demonstrate the effectiveness of TrustDD, and the proposed POE surpasses state-of-the-art method Outlier Exposure (OE). TrustDD is more trustworthy and applicable to real open-world scenarios compared to previous dataset distillation methods.Here is the format you requested:for: Trustworthy Dataset Distillation (TrustDD)methods: dataset distillation (DD) + Pseudo-Outlier Exposure (POE)results: effective in InD classification and OOD detection, surpasses state-of-the-art OE method.

Abstract
Efficiency and trustworthiness are two eternal pursuits when applying deep learning in real-world applications. With regard to efficiency, dataset distillation (DD) endeavors to reduce training costs by distilling the large dataset into a tiny synthetic dataset. However, existing methods merely concentrate on in-distribution (InD) classification in a closed-world setting, disregarding out-of-distribution (OOD) samples. On the other hand, OOD detection aims to enhance models' trustworthiness, which is always inefficiently achieved in full-data settings. For the first time, we simultaneously consider both issues and propose a novel paradigm called Trustworthy Dataset Distillation (TrustDD). By distilling both InD samples and outliers, the condensed datasets are capable to train models competent in both InD classification and OOD detection. To alleviate the requirement of real outlier data and make OOD detection more practical, we further propose to corrupt InD samples to generate pseudo-outliers and introduce Pseudo-Outlier Exposure (POE). Comprehensive experiments on various settings demonstrate the effectiveness of TrustDD, and the proposed POE surpasses state-of-the-art method Outlier Exposure (OE). Compared with the preceding DD, TrustDD is more trustworthy and applicable to real open-world scenarios. Our code will be publicly available.

摘要
“效率和可靠性是深度学习应用实际场景中的两大永恒追求。在这个领域，数据集缩写（DD）是一种减少训练成本的方法，它将大量数据集缩写成一个小型的合成数据集。然而，现有方法仅专注于闭合世界的内分布（InD）类别，忽略了外分布（OOD）样本。而OOD检测则是提高模型的可靠性的一种不fficient的方法，通常在全数据设置下进行。为了解决这两个问题，我们同时考虑了两者，并提出了一种新的思路——可靠的数据集缩写（TrustDD）。通过缩写InD样本和异常样本，缩写后的数据集能够训练可以在InD类别和OOD检测中具备竞争力。为了减少实际异常数据的需求和使OOD检测更加实用，我们进一步提议在InD样本中做质量损害，并引入假异常样本的暴露（POE）。经过了多种场景的实验，我们发现TrustDD比前一代DD更加可靠和适用于真正的开放世界场景。我们的代码将公开。”

CG-fusion CAM: Online segmentation of laser-induced damage on large-aperture optics

paper_url: http://arxiv.org/abs/2307.09161
repo_url: None
paper_authors: Yueyue Han, Yingyan Huang, Hangcheng Dong, Fengdong Chen, Fa Zeng, Zhitao Peng, Qihua Zhu, Guodong Liu
for: 这个论文是为了解决大光学望远镜上高功率激光器所产生的激光损害问题，特别是在线分 segmentation方面。
methods: 这个论文使用了一种弱类别Semantic segmentation算法，即Continuous Gradient CAM和其多尺度融合（CG-fusion CAM）。这种算法可以使用仅仅有图像级别标签，并且可以生成高精度的类 activation maps。
results: 实验结果表明，该算法可以与全程supervised算法相比，达到相同的分 segmentation性能。

Abstract
Online segmentation of laser-induced damage on large-aperture optics in high-power laser facilities is challenged by complicated damage morphology, uneven illumination and stray light interference. Fully supervised semantic segmentation algorithms have achieved state-of-the-art performance, but rely on plenty of pixel-level labels, which are time-consuming and labor-consuming to produce. LayerCAM, an advanced weakly supervised semantic segmentation algorithm, can generate pixel-accurate results using only image-level labels, but its scattered and partially under-activated class activation regions degrade segmentation performance. In this paper, we propose a weakly supervised semantic segmentation method with Continuous Gradient CAM and its nonlinear multi-scale fusion (CG-fusion CAM). The method redesigns the way of back-propagating gradients and non-linearly activates the multi-scale fused heatmaps to generate more fine-grained class activation maps with appropriate activation degree for different sizes of damage sites. Experiments on our dataset show that the proposed method can achieve segmentation performance comparable to that of fully supervised algorithms.

摘要
在高功率激光设施中，大光学口径上的激光引起的损害分 segmentation 是由于损害形态复杂、不均匀照明和干扰噪音的问题困难。全程supervised的semantic segmentation算法已经达到了状态的前沿性，但是它们需要大量的像素级标签，生成这些标签是时间consuming和劳动密集的。LayerCAM是一种先进的弱supervised semantic segmentation算法，可以生成像素精度的结果，但是它的分布式和部分地下活化的热点区域会降低分 segmentation 性能。在这篇论文中，我们提出了一种弱supervised semantic segmentation方法，使用Continuous Gradient CAM和非线性多尺度融合（CG-fusion CAM）。该方法重新定义了后向推导的方式和非线性激活多尺度融合的热点映射，以生成更细致的类活动地图，并且对不同的损害 Site 的大小进行适度的激活。实验表明，我们的方法可以与全程supervised算法相当的segmentation性能。

Constraining Depth Map Geometry for Multi-View Stereo: A Dual-Depth Approach with Saddle-shaped Depth Cells

paper_url: http://arxiv.org/abs/2307.09160
repo_url: https://github.com/dive128/dmvsnet
paper_authors: Xinyi Ye, Weiyue Zhao, Tianqi Liu, Zihao Huang, Zhiguo Cao, Xin Li
for: 本研究旨在提高多视图深度（MVS）方法的准确性和完整性，通过适合的深度几何来提高深度估计的精度。
methods: 我们提出了一种基于学习的多视图深度方法，即DUAL-MVSNet，它可以生成oscillating深度平面。技术上，我们预测每个像素两个深度值（双深度），并提出了一种新的损失函数和检查板形式的选择策略来限制预测的深度几何。
results: 与现有方法相比，DUAL-MVSNet在DTU benchmark上得到了高排名，并在复杂的场景下（如坦克和寺庐）达到了最高性能，这表明了我们的方法具有强大的表现和泛化能力。我们的方法还指出了考虑深度几何在MVS方面的新研究方向。

Abstract
Learning-based multi-view stereo (MVS) methods deal with predicting accurate depth maps to achieve an accurate and complete 3D representation. Despite the excellent performance, existing methods ignore the fact that a suitable depth geometry is also critical in MVS. In this paper, we demonstrate that different depth geometries have significant performance gaps, even using the same depth prediction error. Therefore, we introduce an ideal depth geometry composed of Saddle-Shaped Cells, whose predicted depth map oscillates upward and downward around the ground-truth surface, rather than maintaining a continuous and smooth depth plane. To achieve it, we develop a coarse-to-fine framework called Dual-MVSNet (DMVSNet), which can produce an oscillating depth plane. Technically, we predict two depth values for each pixel (Dual-Depth), and propose a novel loss function and a checkerboard-shaped selecting strategy to constrain the predicted depth geometry. Compared to existing methods,DMVSNet achieves a high rank on the DTU benchmark and obtains the top performance on challenging scenes of Tanks and Temples, demonstrating its strong performance and generalization ability. Our method also points to a new research direction for considering depth geometry in MVS.

摘要
To achieve this, we develop a coarse-to-fine framework called Dual-MVSNet (DMVSNet), which predicts two depth values for each pixel (Dual-Depth) and uses a novel loss function and checkerboard-shaped selecting strategy to constrain the predicted depth geometry. Compared to existing methods, DMVSNet achieves a high rank on the DTU benchmark and obtains the top performance on challenging scenes of Tanks and Temples, demonstrating its strong performance and generalization ability. Our method also highlights the importance of considering depth geometry in MVS and opens up a new research direction in this area.

Class-relation Knowledge Distillation for Novel Class Discovery

paper_url: http://arxiv.org/abs/2307.09158
repo_url: https://github.com/kleinzcy/cr-kd-ncd
paper_authors: Gu Peiyan, Zhang Chuyu, Xu Ruiji, He Xuming
for: 本研究目标是无监督学习新类，通过已知类数据来学习未知类。
methods: 我们引入了一个基于预测模型已知类分布的类关系表示，并使用知识填充框架来正则化学习新类。我们还开发了一种可学习的权重函数，以适应每个数据点在新类中的知识传递。
results: 我们在多个 benchmark 上进行了广泛的实验，并证明了我们的方法可以与之前的状态时间比对较高。

Abstract
We tackle the problem of novel class discovery, which aims to learn novel classes without supervision based on labeled data from known classes. A key challenge lies in transferring the knowledge in the known-class data to the learning of novel classes. Previous methods mainly focus on building a shared representation space for knowledge transfer and often ignore modeling class relations. To address this, we introduce a class relation representation for the novel classes based on the predicted class distribution of a model trained on known classes. Empirically, we find that such class relation becomes less informative during typical discovery training. To prevent such information loss, we propose a novel knowledge distillation framework, which utilizes our class-relation representation to regularize the learning of novel classes. In addition, to enable a flexible knowledge distillation scheme for each data point in novel classes, we develop a learnable weighting function for the regularization, which adaptively promotes knowledge transfer based on the semantic similarity between the novel and known classes. To validate the effectiveness and generalization of our method, we conduct extensive experiments on multiple benchmarks, including CIFAR100, Stanford Cars, CUB, and FGVC-Aircraft datasets. Our results demonstrate that the proposed method outperforms the previous state-of-the-art methods by a significant margin on almost all benchmarks. Code is available at \href{https://github.com/kleinzcy/Cr-KD-NCD}{here}.

摘要
我们解决了一个新类发现问题，即在无监督下学习新类，基于已知类的标注数据。知情挑战在传递已知类数据中的知识到新类学习中。以前的方法主要集中在建立已知类和新类之间的共享表示空间中，frequently ignore 新类之间的类关系。为了解决这一点，我们引入一个基于已知类模型预测的新类分布的类关系表示。我们发现，在普通的发现训练中，这个类关系表示变得 menos informative。为了避免这种信息损失，我们提出了一种新的知识塑化框架，利用我们的类关系表示来塑化新类的学习。此外，我们开发了一个可学习的权重函数，以便在每个新类数据点上进行自适应的知识传递优化。为了证明我们的方法的有效性和普适性，我们在多个benchmark上进行了广泛的实验，包括CIFAR100、Stanford Cars、CUB和FGVC-Aircraft等数据集。我们的结果表明，我们的方法在大多数benchmark上与之前的状态OF THE ART方法之间的差距非常大。代码可以在\href{https://github.com/kleinzcy/Cr-KD-NCD}{这里}找到。

paper_url: http://arxiv.org/abs/2307.09155
repo_url: None
paper_authors: Zewei Lin, Yanqing Shen, Sanping Zhou, Shitao Chen, Nanning Zheng
for: 这 paper 的目的是提出一种高性能的跨模态3D物体检测方法，以便更好地利用图像中的信息。
methods: 这 paper 使用了多级融合网络（MLF-DET），包括特征级融合和决策级融合两个部分。特征级融合使用多scale voxel图像融合模块（MVI），决策级融合使用轻量级特征引导修正模块（FCR）。此外，paper 还提出了一种有效的数据采样策略，即遮挡对准GT采样（OGS），以增加训练场景中的样本数量，从而降低过拟合。
results: EXTENSIVE experiments 表明，我们的方法在 KITTI 数据集上达到了 82.89% 的中等 AP 值，并在不具备特殊功能的情况下达到了领先的性能。

Abstract
In this paper, we propose a novel and effective Multi-Level Fusion network, named as MLF-DET, for high-performance cross-modal 3D object DETection, which integrates both the feature-level fusion and decision-level fusion to fully utilize the information in the image. For the feature-level fusion, we present the Multi-scale Voxel Image fusion (MVI) module, which densely aligns multi-scale voxel features with image features. For the decision-level fusion, we propose the lightweight Feature-cued Confidence Rectification (FCR) module which further exploits image semantics to rectify the confidence of detection candidates. Besides, we design an effective data augmentation strategy termed Occlusion-aware GT Sampling (OGS) to reserve more sampled objects in the training scenes, so as to reduce overfitting. Extensive experiments on the KITTI dataset demonstrate the effectiveness of our method. Notably, on the extremely competitive KITTI car 3D object detection benchmark, our method reaches 82.89% moderate AP and achieves state-of-the-art performance without bells and whistles.

摘要
在这篇论文中，我们提出了一种新的和有效的多级融合网络，称为MLF-DET，用于高性能跨模态3D对象检测。我们在这种网络中结合了功能级融合和决策级融合，以完全利用图像中的信息。对于功能级融合，我们提出了多尺度精细对齐模块（MVI），用于 densely aligning multi-scale voxel features with image features。对于决策级融合，我们提出了轻量级特征引导修正模块（FCR），以进一步利用图像 semantics 来修正检测候选者的信任度。此外，我们设计了一种有效的数据增强策略，称为Occlusion-aware GT Sampling（OGS），以保留更多的批处理对象在训练场景中，从而降低过拟合。广泛的实验表明，我们的方法在KITTI数据集上达到了82.89%的中等AP值，并在无论钟铃铃的情况下实现了state-of-the-art性能。

OPHAvatars: One-shot Photo-realistic Head Avatars

paper_url: http://arxiv.org/abs/2307.09153
repo_url: https://github.com/lsx0101/ophavatars
paper_authors: Shaoxu Li
for: 该paper的目的是创建一种从单个肖像图像 synthesize photo-realistic digital avatars的方法。
methods: 该方法使用驱动关键点特征来生成一个粗糙的说话头视频，然后使用扭曲神经辐射场来生成粗糙的说话头模型。通过更新优化的图像，该方法可以重新训练更高质量的模型。
results: 该方法可以在多个主题下对多个主题进行高效的量化和质量上的比较，并且可以在不同的照明和摄影条件下提供高质量的数字人物模型。

Abstract
We propose a method for synthesizing photo-realistic digital avatars from only one portrait as the reference. Given a portrait, our method synthesizes a coarse talking head video using driving keypoints features. And with the coarse video, our method synthesizes a coarse talking head avatar with a deforming neural radiance field. With rendered images of the coarse avatar, our method updates the low-quality images with a blind face restoration model. With updated images, we retrain the avatar for higher quality. After several iterations, our method can synthesize a photo-realistic animatable 3D neural head avatar. The motivation of our method is deformable neural radiance field can eliminate the unnatural distortion caused by the image2video method. Our method outperforms state-of-the-art methods in quantitative and qualitative studies on various subjects.

摘要
我们提出一种方法，可以从单一的肖像中生成真实的数字人物。给定一个肖像，我们的方法可以生成一个驱动关键点特征的粗糙说话头视频。然后，我们的方法可以使用扭曲神经辐射场来生成粗糙说话头人物。使用渲染出的粗糙人物图像，我们的方法可以通过盲人脸修复模型来更新低质量图像。经过多次迭代，我们的方法可以生成一个真实的渲染3D神经头人物。我们的方法的动机是使用扭曲神经辐射场可以消除图像2视频方法中的不自然的扭曲。与现状的方法进行比较，我们的方法在量化和质量上都有较高的表现。

Semi-supervised Cycle-GAN for face photo-sketch translation in the wild

paper_url: http://arxiv.org/abs/2307.10281
repo_url: https://github.com/chaofengc/Face-Sketch-SCG
paper_authors: Chaofeng Chen, Wei Liu, Xiao Tan, Kwan-Yee K. Wong
for: 本研究旨在提高面图翻译效果，使用深度神经网络和GAN方法。
methods: 我们提出了一种半监督方法，使用噪音插入策略，称为半周期GAN（SCG）。我们首先对输入图片进行 pseudo sketch 特征表示，并使用这些 pseudo pairs 来监督图片到素描 generator $G_{p2s}$。然后，$G_{p2s}$ 的输出可以用来自动地训练素描到图片 generator $G_{s2p}$。这样就可以使用一小reference set of photo-sketch pairs和一大面积的人脸图像集（没有ground-truth sketches）来训练 $G_{p2s}$ 和 $G_{s2p}$。
results: 实验结果表明，SCG 可以在公共 bencmark 上达到竞争性的性能，并在野外的图像上得到更加有reasonable的素描-to-图片结果，具有较少的过拟合问题。

Abstract
The performance of face photo-sketch translation has improved a lot thanks to deep neural networks. GAN based methods trained on paired images can produce high-quality results under laboratory settings. Such paired datasets are, however, often very small and lack diversity. Meanwhile, Cycle-GANs trained with unpaired photo-sketch datasets suffer from the \emph{steganography} phenomenon, which makes them not effective to face photos in the wild. In this paper, we introduce a semi-supervised approach with a noise-injection strategy, named Semi-Cycle-GAN (SCG), to tackle these problems. For the first problem, we propose a {\em pseudo sketch feature} representation for each input photo composed from a small reference set of photo-sketch pairs, and use the resulting {\em pseudo pairs} to supervise a photo-to-sketch generator $G_{p2s}$. The outputs of $G_{p2s}$ can in turn help to train a sketch-to-photo generator $G_{s2p}$ in a self-supervised manner. This allows us to train $G_{p2s}$ and $G_{s2p}$ using a small reference set of photo-sketch pairs together with a large face photo dataset (without ground-truth sketches). For the second problem, we show that the simple noise-injection strategy works well to alleviate the \emph{steganography} effect in SCG and helps to produce more reasonable sketch-to-photo results with less overfitting than fully supervised approaches. Experiments show that SCG achieves competitive performance on public benchmarks and superior results on photos in the wild.

摘要
《面部照片翻译表现得到了深度神经网络的改进，使得GAN基于方法在实验室设置下可以生成高质量的结果。然而，这些配对数据集通常很小，缺乏多样性。同时，基于不配对照片翻译数据集的Cycle-GANs受到了《隐写》现象的影响，使其对面部照片在野外不 efective。本文提出了一种半指导式方法，名为半周期GAN（SCG），以解决这些问题。首先，我们提出了一种《假绘制特征》表示方法，用于每个输入照片中组成一个小型参考集的绘制对。然后，我们使用这些《假对》来监督一个照片到绘制的生成器$G_{p2s}$。$G_{p2s}$的输出可以在自动化的方式下帮助训练一个绘制到照片的生成器$G_{s2p}$。这样就可以在一个小型的照片-绘制对参考集和一大量的面部照片数据集（无ground truth绘制）之间共同训练 $G_{p2s}$ 和 $G_{s2p}$。其次，我们发现了一种简单的噪声注入策略可以减轻SCG中的《隐写》现象，并帮助生成更加合理的绘制到照片结果，降低过拟合。实验表明，SCG可以在公共的benchmark上达到竞争性的表现，并在野外的照片上达到更高的表现。》

PRO-Face S: Privacy-preserving Reversible Obfuscation of Face Images via Secure Flow

paper_url: http://arxiv.org/abs/2307.09146
repo_url: None
paper_authors: Lin Yuan, Kai Liang, Xiao Pu, Yan Zhang, Jiaxu Leng, Tao Wu, Nannan Wang, Xinbo Gao
for: 防止面部隐私泄露
methods: 使用逆向神经网络（INN）进行隐私保护，并在网络中注入密钥以确保原始图像仅可以通过同一模型和正确的密钥进行恢复。
results: 对于多个公共图像集进行了广泛的实验，证明提议的框架在对多种现有方法的比较中具有超越性。

Abstract
This paper proposes a novel paradigm for facial privacy protection that unifies multiple characteristics including anonymity, diversity, reversibility and security within a single lightweight framework. We name it PRO-Face S, short for Privacy-preserving Reversible Obfuscation of Face images via Secure flow-based model. In the framework, an Invertible Neural Network (INN) is utilized to process the input image along with its pre-obfuscated form, and generate the privacy protected image that visually approximates to the pre-obfuscated one, thus ensuring privacy. The pre-obfuscation applied can be in diversified form with different strengths and styles specified by users. Along protection, a secret key is injected into the network such that the original image can only be recovered from the protection image via the same model given the correct key provided. Two modes of image recovery are devised to deal with malicious recovery attempts in different scenarios. Finally, extensive experiments conducted on three public image datasets demonstrate the superiority of the proposed framework over multiple state-of-the-art approaches.

摘要
In this framework, an Invertible Neural Network (INN) is used to process the input image and its pre-obfuscated form, and generate a privacy-protected image that visually resembles the pre-obfuscated one, ensuring privacy. The pre-obfuscation can be applied in diverse forms with different strengths and styles specified by users. Additionally, a secret key is injected into the network, allowing only the original image to be recovered from the protected image through the same model if the correct key is provided. To address malicious recovery attempts, two modes of image recovery are designed.Experiments conducted on three public image datasets demonstrate the superiority of the proposed framework compared to multiple state-of-the-art approaches.

MVA2023 Small Object Detection Challenge for Spotting Birds: Dataset, Methods, and Results

paper_url: http://arxiv.org/abs/2307.09143
repo_url: https://github.com/iim-ttij/mva2023smallobjectdetection4spottingbirds
paper_authors: Yuki Kondo, Norimichi Ukita, Takayuki Yamaguchi, Hao-Yu Hou, Mu-Yi Shen, Chia-Chi Hsu, En-Ming Huang, Yu-Chen Huang, Yu-Cheng Xia, Chien-Yao Wang, Chun-Yi Lee, Da Huo, Marc A. Kastner, Tingwei Liu, Yasutomo Kawanishi, Takatsugu Hirayama, Takahiro Komamizu, Ichiro Ide, Yosuke Shinya, Xinyao Liu, Guang Liang, Syusuke Yasui
for: 本研究旨在提供一个新的小对象检测数据集，用于鸟类检测。
methods: 本研究使用了一种新的小对象检测方法，包括提出了一个新的数据集。
results: 本研究的实验结果显示，这种新的检测方法可以准确地检测到远距离的鸟类。

Abstract
Small Object Detection (SOD) is an important machine vision topic because (i) a variety of real-world applications require object detection for distant objects and (ii) SOD is a challenging task due to the noisy, blurred, and less-informative image appearances of small objects. This paper proposes a new SOD dataset consisting of 39,070 images including 137,121 bird instances, which is called the Small Object Detection for Spotting Birds (SOD4SB) dataset. The detail of the challenge with the SOD4SB dataset is introduced in this paper. In total, 223 participants joined this challenge. This paper briefly introduces the award-winning methods. The dataset, the baseline code, and the website for evaluation on the public testset are publicly available.

摘要
小物体检测（SOD）是机器视觉中重要的话题，因为（i）许多现实世界应用需要对远距离物体进行检测，（ii）SOD是一项复杂的任务，因为小物体的图像表现具有噪音、模糊和不具有准确信息的特点。本文提出了一个新的SOD数据集，包括39,070张图像和137,121个鸟具体实例，称为小物体检测 для鸟兽（SOD4SB）数据集。本文介绍了SOD4SB数据集的挑战。总共有223名参与者参加了这个挑战。本文 briefly介绍了奖励方法。数据集、基线代码和评估网站对公共测试集进行评估是公共可用的。

Light-Weight Vision Transformer with Parallel Local and Global Self-Attention

paper_url: http://arxiv.org/abs/2307.09120
repo_url: None
paper_authors: Nikolas Ebert, Laurenz Reichardt, Didier Stricker, Oliver Wasenmüller
for: 这个研究的目的是将现代computer vision中的Transformer架构重新设计为适合具有有限资源的硬件上进行自动驾驶任务，并且可以在实时性要求下执行。
methods: 我们在这个研究中提出了一些改进PLG-ViT架构的方法，以减少其中的参数和浮点运算数。我们识别了PLG-ViT架构中的computationally expensive block，并提出了一些改进以减少这些对象的参数和浮点运算数。
results: 我们的研究获得了以下结果：我们可以将PLG-ViT架构缩小到原本的5倍小，并且只有500万个参数，可以在ImageNet-1K分类标准库中获得79.5%的顶部1个精度。我们的网络在COCO实例分类标准库中也实现了良好的表现。此外，我们还实现了一系列的实验，评估了我们的方法在自动驾驶和交通领域中的应用潜力。

Abstract
While transformer architectures have dominated computer vision in recent years, these models cannot easily be deployed on hardware with limited resources for autonomous driving tasks that require real-time-performance. Their computational complexity and memory requirements limits their use, especially for applications with high-resolution inputs. In our work, we redesign the powerful state-of-the-art Vision Transformer PLG-ViT to a much more compact and efficient architecture that is suitable for such tasks. We identify computationally expensive blocks in the original PLG-ViT architecture and propose several redesigns aimed at reducing the number of parameters and floating-point operations. As a result of our redesign, we are able to reduce PLG-ViT in size by a factor of 5, with a moderate drop in performance. We propose two variants, optimized for the best trade-off between parameter count to runtime as well as parameter count to accuracy. With only 5 million parameters, we achieve 79.5$\%$ top-1 accuracy on the ImageNet-1K classification benchmark. Our networks demonstrate great performance on general vision benchmarks like COCO instance segmentation. In addition, we conduct a series of experiments, demonstrating the potential of our approach in solving various tasks specifically tailored to the challenges of autonomous driving and transportation.

摘要
“储在限制型硬件上部署的自动驾驶任务需要实时性和轻量级的模型，但是transformer架构在最近几年内在计算机见识领域中占了主导地位。然而，这些模型的计算复杂度和内存需求限制了它们的使用，特别是在高分辨率输入的应用中。在我们的工作中，我们重新设计了原PLG-ViT架构，将其转换为更加快速和高效的架构，适合这些任务。我们识别PLG-ViT架构中最 computationally expensive的对象，并提出了多种改进，以减少参数和浮点运算数。因此，我们成功地将PLG-ViT的大小减少了5倍，但是损失了一些性能。我们提出了两种版本，适合在parameter count和runtime之间寻找最佳平衡，以及parameter count和准确率之间寻找最佳平衡。仅有500万个参数，我们在ImageNet-1K分类标准库中 achieved 79.5%的top-1准确率。我们的网络在通用的见识测试中也表现出色，并进行了一系列关于自动驾驶和交通的实验，显示了我们的方法在这些任务中的潜力。”

NU-MCC: Multiview Compressive Coding with Neighborhood Decoder and Repulsive UDF

paper_url: http://arxiv.org/abs/2307.09112
repo_url: None
paper_authors: Stefan Lionar, Xiangyu Xu, Min Lin, Gim Hee Lee
for: 单视图RGB-D输入的3D重建 tasks（3D reconstruction from single-view RGB-D inputs）
methods: 提出了一种新方法 called NU-MCC，包括两个关键创新：一个邻域解码器和一个排斥未分配距离函数（Repulsive UDF）
results: 实验结果表明，NU-MCC可以学习出一个强大的3D表示，提高了单视图3D重建的状态方法，相比MCC，NU-MCC在CO3D-v2数据集上的F1分数提高9.7%，运行速度高于MCC的5倍。

Abstract
Remarkable progress has been made in 3D reconstruction from single-view RGB-D inputs. MCC is the current state-of-the-art method in this field, which achieves unprecedented success by combining vision Transformers with large-scale training. However, we identified two key limitations of MCC: 1) The Transformer decoder is inefficient in handling large number of query points; 2) The 3D representation struggles to recover high-fidelity details. In this paper, we propose a new approach called NU-MCC that addresses these limitations. NU-MCC includes two key innovations: a Neighborhood decoder and a Repulsive Unsigned Distance Function (Repulsive UDF). First, our Neighborhood decoder introduces center points as an efficient proxy of input visual features, allowing each query point to only attend to a small neighborhood. This design not only results in much faster inference speed but also enables the exploitation of finer-scale visual features for improved recovery of 3D textures. Second, our Repulsive UDF is a novel alternative to the occupancy field used in MCC, significantly improving the quality of 3D object reconstruction. Compared to standard UDFs that suffer from holes in results, our proposed Repulsive UDF can achieve more complete surface reconstruction. Experimental results demonstrate that NU-MCC is able to learn a strong 3D representation, significantly advancing the state of the art in single-view 3D reconstruction. Particularly, it outperforms MCC by 9.7% in terms of the F1-score on the CO3D-v2 dataset with more than 5x faster running speed.

摘要
“单一视野RGB-D输入的3D重建技术已经做出了杰出的进步。MCC是这个领域的现行州Of-The-Art方法，通过结合Computer Vision Transformer和大规模训练而取得了前无seen的成功。然而，我们发现了MCC的两个关键限制：1）Transformer decoder在处理大量查询点时不具有高效性；2）3D表现仅能够重建高精度的细节。在这篇论文中，我们提出了一个新的方法called NU-MCC，它包括两个关键创新：一个邻居decoder和一个Repulsive Unsigned Distance Function（Repulsive UDF）。首先，我们的邻居decoder将输入的视觉特征转换为中心点，让每个查询点只需要关注当地小区域。这个设计不仅提高了单位执行时间的速度，而且允许更好地利用更细节的视觉特征，以提高3D文字的重建。其次，我们的Repulsive UDF是一种新型的对应场，它在MCC中使用的标准UDF遭受到洞的问题，而我们的提案可以实现更完整的表面重建。实验结果显示，NU-MCC可以对单一视野3D重建进行更好的学习，并在CO3D-v2dataset上比MCC提高9.7%的F1分数，且比MCC更快速执行。”

Mining of Single-Class by Active Learning for Semantic Segmentation

paper_url: http://arxiv.org/abs/2307.09109
repo_url: None
paper_authors: Hugues Lambert, Emma Slade
for: 这 paper 是为了提出一种新的活动学习（AL）策略，帮助寻找最有用的样本，并增加模型的性能。
methods: 这 paper 使用了深度强化学习来构建一个 AL 策略，并利用量精度相关性来建立高性能模型。
results: 这 paper 的结果表明，MiSiCAL 能够在 COCO10k 中的 150 个类中超越随机策略，而最强的基eline 只能在 101 个类中超越随机策略。

Abstract
Several Active Learning (AL) policies require retraining a target model several times in order to identify the most informative samples and rarely offer the option to focus on the acquisition of samples from underrepresented classes. Here the Mining of Single-Class by Active Learning (MiSiCAL) paradigm is introduced where an AL policy is constructed through deep reinforcement learning and exploits quantity-accuracy correlations to build datasets on which high-performance models can be trained with regards to specific classes. MiSiCAL is especially helpful in the case of very large batch sizes since it does not require repeated model training sessions as is common in other AL methods. This is thanks to its ability to exploit fixed representations of the candidate data points. We find that MiSiCAL is able to outperform a random policy on 150 out of 171 COCO10k classes, while the strongest baseline only outperforms random on 101 classes.

摘要
几种活动学习（AL）策略需要重新训练目标模型数次以确定最有用的样本，并很少提供针对异常类别的样本收集的选项。在这里，我们引入了单类挖掘学习（MiSiCAL）模式，其中一个AL策略通过深度强化学习构建，利用量精度相关性来建立高性能模型可以在特定类别上训练。MiSiCAL特别有利于大批量时，因为它不需要重复的模型训练会议。这是因为它可以利用固定表示的候选数据点。我们发现MiSiCAL能够在COCO10k类型中击败随机策略的150个类型，而最强基eline只能在101个类型上击败随机策略。

Division Gets Better: Learning Brightness-Aware and Detail-Sensitive Representations for Low-Light Image Enhancement

paper_url: http://arxiv.org/abs/2307.09104
repo_url: None
paper_authors: Huake Wang, Xiaoyang Yan, Xingsong Hou, Junhui Li, Yujie Dun, Kaibing Zhang
for: 提高低光照图像的对比度、颜色和纹理的Restoration
methods: 提出了一种基于两个分支网络的新方法LCDBNet，其中一个分支网络负责调整亮度，另一个分支网络负责修复颜色和纹理。
results: 对七个标准测试集进行了广泛的实验，结果显示LCDBNet的表现比其他当前领先方法更佳，并且在多种参考/非参考质量评价指标中获得了更高的分数。

Abstract
Low-light image enhancement strives to improve the contrast, adjust the visibility, and restore the distortion in color and texture. Existing methods usually pay more attention to improving the visibility and contrast via increasing the lightness of low-light images, while disregarding the significance of color and texture restoration for high-quality images. Against above issue, we propose a novel luminance and chrominance dual branch network, termed LCDBNet, for low-light image enhancement, which divides low-light image enhancement into two sub-tasks, e.g., luminance adjustment and chrominance restoration. Specifically, LCDBNet is composed of two branches, namely luminance adjustment network (LAN) and chrominance restoration network (CRN). LAN takes responsibility for learning brightness-aware features leveraging long-range dependency and local attention correlation. While CRN concentrates on learning detail-sensitive features via multi-level wavelet decomposition. Finally, a fusion network is designed to blend their learned features to produce visually impressive images. Extensive experiments conducted on seven benchmark datasets validate the effectiveness of our proposed LCDBNet, and the results manifest that LCDBNet achieves superior performance in terms of multiple reference/non-reference quality evaluators compared to other state-of-the-art competitors. Our code and pretrained model will be available.

摘要
低光照图像增强尝试提高图像的对比度、调整视觉效果和恢复颜色和纹理的损害。现有方法通常更关注提高低光照图像的可见度和对比度，而忽略了高质量图像的颜色和纹理恢复的重要性。为了解决这个问题，我们提议一种新的抽象和彩度分支网络（LCDBNet），用于低光照图像增强。LCDBNet将低光照图像增强分为两个子任务：亮度调整和颜色恢复。具体来说，LCDBNet由两个分支组成：亮度调整网络（LAN）和颜色恢复网络（CRN）。LAN负责学习亮度意识的特征，利用长距离关系和本地关注相互关系。而CRN则专注于学习细节敏感的特征，通过多层wavelet分解。最后，我们设计了一个混合网络，将它们学习的特征融合，以生成视觉卓越的图像。我们在七个标准测试数据集上进行了广泛的实验，结果表明LCDBNet在多个参考/非参考质量评价指标上表现出色，超过了当前状态的竞争对手。我们的代码和预训练模型将可以获得。

Reclaiming the Horizon: Novel Visualization Designs for Time-Series Data with Large Value Ranges

paper_url: http://arxiv.org/abs/2307.10278
repo_url: None
paper_authors: Daniel Braun, Rita Borgo, Max Sondag, Tatiana von Landesberger
for: 支持实践者在时间序列数据中进行标识和分类任务，即在很大的值范围内进行标识和分类。
methods: 提出了两种新的视觉设计：第一种是维度范围图，它是 классиic 的背景图的扩展；第二种是带有很大值范围的 log-line 图。这两种新的视觉设计可以让实践者更好地对很大的值范围进行视觉化。
results: 在四种常见的时间序列分析和大值范围视觉化任务中，新的维度范围图表现更好或等于所有其他设计，包括标识、分类、估计和趋势检测等任务。只有趋势检测任务中，传统的背景图表现更好。结果具有领域独立性，只需要时间序列数据具有很大的值范围即可。

Abstract
We introduce two novel visualization designs to support practitioners in performing identification and discrimination tasks on large value ranges (i.e., several orders of magnitude) in time-series data: (1) The order of magnitude horizon graph, which extends the classic horizon graph; and (2) the order of magnitude line chart, which adapts the log-line chart. These new visualization designs visualize large value ranges by explicitly splitting the mantissa m and exponent e of a value v = m * 10e . We evaluate our novel designs against the most relevant state-of-the-art visualizations in an empirical user study. It focuses on four main tasks commonly employed in the analysis of time-series and large value ranges visualization: identification, discrimination, estimation, and trend detection. For each task we analyse error, confidence, and response time. The new order of magnitude horizon graph performs better or equal to all other designs in identification, discrimination, and estimation tasks. Only for trend detection tasks, the more traditional horizon graphs reported better performance. Our results are domain-independent, only requiring time-series data with large value ranges.

摘要
我们介绍了两种新的视觉化设计，用于支持实践者在时间序列数据中进行标识和分类任务，即在很大的值范围内（几个数量级）：（1）扩展类 horizon graph，和（2）适应 log-line 图表。这两种新的视觉化设计将大值范围视觉化为显式地将值分解成整数部分m和指数部分e，使得v = m * 10e 的值范围变得更加明了。我们对最相关的现有视觉化进行了实验用户研究，这种研究包括四个主要任务，即标识、分类、估算和趋势检测。对每个任务，我们分析了错误率、自信度和响应时间。新的 ORDER OF MAGNITUDE horizon graph 在标识、分类和估算任务中表现更好或相等于所有其他设计。只有趋势检测任务中，传统的 horizon graphs 表现更好。我们的结果是领域独立的，只需要时间序列数据具有大值范围即可。

PixelHuman: Animatable Neural Radiance Fields from Few Images

paper_url: http://arxiv.org/abs/2307.09070
repo_url: None
paper_authors: Gyumin Shim, Jaeseong Lee, Junha Hyung, Jaegul Choo
for: 这 paper 是为了实现从几个人像中生成可动人景的 novel 模型。
methods: 该方法使用 neural radiance field 和 pose-aware pixel-aligned features，通过数据驱动的方式学习了折叠场景，以实现从几个不同的视图和姿势中生成可动人景。
results: 实验结果显示，该方法可以在多视图和新姿势synthesis中实现state-of-the-art表现，只需要几个图像来训练。

Abstract
In this paper, we propose PixelHuman, a novel human rendering model that generates animatable human scenes from a few images of a person with unseen identity, views, and poses. Previous work have demonstrated reasonable performance in novel view and pose synthesis, but they rely on a large number of images to train and are trained per scene from videos, which requires significant amount of time to produce animatable scenes from unseen human images. Our method differs from existing methods in that it can generalize to any input image for animatable human synthesis. Given a random pose sequence, our method synthesizes each target scene using a neural radiance field that is conditioned on a canonical representation and pose-aware pixel-aligned features, both of which can be obtained through deformation fields learned in a data-driven manner. Our experiments show that our method achieves state-of-the-art performance in multiview and novel pose synthesis from few-shot images.

摘要
在这篇论文中，我们提出了PixelHuman，一种新的人体渲染模型，该模型可以从几张人物图像中生成可动人体场景，无需seen人物视角和姿势。先前的工作已经证明了在新视角和姿势合成方面的可靠性，但它们需要大量图像进行训练，并且需要相当长的时间来生成从未seen的人物图像中的可动场景。我们的方法与现有方法不同，它可以对任意输入图像进行可动人体合成。给定一个随机姿势序列，我们的方法使用神经辐射场来synthesize每个目标场景，该场景被 conditioned 于一个拟合表示和pose-aware像素对齐特征，这些特征可以通过在数据驱动的方式下学习的扭形场所获得。我们的实验表明，我们的方法在多视图和新姿势合成方面具有状态级表现。

Evaluate Fine-tuning Strategies for Fetal Head Ultrasound Image Segmentation with U-Net

paper_url: http://arxiv.org/abs/2307.09067
repo_url: https://github.com/13204942/ft_methods_for_fetal_head_segmentation
paper_authors: Fangyijie Wang, Guénolé Silvestre, Kathleen M. Curran
for:这份研究目的是提高胎头环circumference（HC）的测量效率，以便更好地监控胎生长。methods:我们提出了一种将MobileNet作为encoder，并通过精确调整（FT）U-Net网络来进行胎头 segmentation的方法。这种方法可以实现限制 Parameters的培训，并且比对于从头开始训练的网络模型来得到更好的性能。results:我们发现，这种FT方法可以与从头开始训练的网络模型相比，具有与其相似的性能（85.8%），并且具有较小的trainable parameter size（Below 4.4 million）。这显示了我们的FT方法可以实现胎头 segmentation的目的，并且可以应对实际应用中的限制。

Abstract
Fetal head segmentation is a crucial step in measuring the fetal head circumference (HC) during gestation, an important biometric in obstetrics for monitoring fetal growth. However, manual biometry generation is time-consuming and results in inconsistent accuracy. To address this issue, convolutional neural network (CNN) models have been utilized to improve the efficiency of medical biometry. But training a CNN network from scratch is a challenging task, we proposed a Transfer Learning (TL) method. Our approach involves fine-tuning (FT) a U-Net network with a lightweight MobileNet as the encoder to perform segmentation on a set of fetal head ultrasound (US) images with limited effort. This method addresses the challenges associated with training a CNN network from scratch. It suggests that our proposed FT strategy yields segmentation performance that is comparable when trained with a reduced number of parameters by 85.8%. And our proposed FT strategy outperforms other strategies with smaller trainable parameter sizes below 4.4 million. Thus, we contend that it can serve as a dependable FT approach for reducing the size of models in medical image analysis. Our key findings highlight the importance of the balance between model performance and size in developing Artificial Intelligence (AI) applications by TL methods. Code is available at https://github.com/13204942/FT_Methods_for_Fetal_Head_Segmentation.

摘要
产前胎头分割是评估胎头圆周长（HC）的关键步骤，是妊娠期内重要的生物指标，可以对胎子增长进行监测。然而，手动生物学测量是时间consuming且准确性不稳定。为了解决这个问题，批处神经网络（CNN）模型已经被应用于医学生物学测量中，以提高生物学测量的效率。然而，训练CNN网络从零开始是一项困难的任务。我们提出了传输学习（TL）方法。我们的方法包括 fine-tuning（FT）一个U-Net网络，使其在一组有限的胎头ultrasound（US）图像上进行分割，而不需要大量的努力。这种方法可以解决训练CNN网络从零开始的挑战。我们发现，我们的提议的FT策略可以在限制参数数量的情况下，达到相当于85.8%的分割性能。此外，我们的FT策略还可以比其他策略下面4.4万个可 Trainable参数更好地表现。因此，我们认为这种FT方法可以在医学生物学测量中用于减小模型的大小。我们的关键发现表明了在使用TL方法开发人工智能应用程序时，模型性能和大小之间的平衡是非常重要的。代码可以在https://github.com/13204942/FT_Methods_for_Fetal_Head_Segmentation中找到。

PatchCT: Aligning Patch Set and Label Set with Conditional Transport for Multi-Label Image Classification

paper_url: http://arxiv.org/abs/2307.09066
repo_url: https://github.com/keepgoingjkg/patchct
paper_authors: Miaoge Li, Dongsheng Wang, Xinyang Liu, Zequn Zeng, Ruiying Lu, Bo Chen, Mingyuan Zhou
for: 这个论文目的是提出一种基于 Conditional Transport (CT) 理论的多标签图像分类方法，以实现更好地利用图像和标签Semantic Space的互动。
methods: 该方法使用 CT 理议来bridging图像和标签域的语义空间，并通过定义前向和后向导航器来学习和对Alignment those two semantic sets。
results: 根据实验结果，提出的方法在三个公共图像benchmark上 consistently outperform了之前的方法。

Abstract
Multi-label image classification is a prediction task that aims to identify more than one label from a given image. This paper considers the semantic consistency of the latent space between the visual patch and linguistic label domains and introduces the conditional transport (CT) theory to bridge the acknowledged gap. While recent cross-modal attention-based studies have attempted to align such two representations and achieved impressive performance, they required carefully-designed alignment modules and extra complex operations in the attention computation. We find that by formulating the multi-label classification as a CT problem, we can exploit the interactions between the image and label efficiently by minimizing the bidirectional CT cost. Specifically, after feeding the images and textual labels into the modality-specific encoders, we view each image as a mixture of patch embeddings and a mixture of label embeddings, which capture the local region features and the class prototypes, respectively. CT is then employed to learn and align those two semantic sets by defining the forward and backward navigators. Importantly, the defined navigators in CT distance model the similarities between patches and labels, which provides an interpretable tool to visualize the learned prototypes. Extensive experiments on three public image benchmarks show that the proposed model consistently outperforms the previous methods.

摘要
多标签图像分类是一个预测任务，旨在从给定图像中标识多个标签。本文考虑了图像和文本标签域之间的semantic consistency，并引入了conditional transport（CT）理论来bridging这些两个域之间的知识渠道。相比之下， latest cross-modal attention-based studies尝试了将这两个表示空间对齐，并实现了出色的表现，但是它们需要设计了复杂的对齐模块和附加的复杂操作。我们发现，将多标签分类定义为CT问题，可以充分利用图像和标签之间的交互，并通过最小化bidirectional CT成本来捕捉到这些交互。具体来说，我们将图像和文本标签 feed到模式特异Encoder中，然后视图每个图像为一个mixture of patch embeddings和一个mixture of label embeddings，这两个 embedding capture了图像的局部区域特征和类型豁达，分别。然后，我们使用CT来学习和对这两个semantic sets进行对齐，定义了前向和后向探险者。特别是，定义的探险者在CT距离模型中模型了图像和标签之间的相似性，这提供了可读的工具来可视化学习的prototype。我们在三个公共图像 benchmark上进行了广泛的实验，结果显示，我们的方法在前一个方法之上具有稳定的性能优势。

Learning Adaptive Neighborhoods for Graph Neural Networks

paper_url: http://arxiv.org/abs/2307.09065
repo_url: None
paper_authors: Avishkar Saha, Oscar Mendez, Chris Russell, Richard Bowden
for: 本 paper 是为了提出一种可 diferenciable 图结构生成器，帮助GCNs在图结构数据上进行端到端学习。
methods: 本 paper 使用了一种novel end-to-end differentiable graph generator，该模块可以将图结构学习到GCNs中，并且可以将每个节点的邻居和大小选择为其自己。
results: 本 paper 的实验结果表明，该模块可以在多种dataset和GCN背景下提高结果的准确率，并且可以与其他结构学习方法相比。

Abstract
Graph convolutional networks (GCNs) enable end-to-end learning on graph structured data. However, many works assume a given graph structure. When the input graph is noisy or unavailable, one approach is to construct or learn a latent graph structure. These methods typically fix the choice of node degree for the entire graph, which is suboptimal. Instead, we propose a novel end-to-end differentiable graph generator which builds graph topologies where each node selects both its neighborhood and its size. Our module can be readily integrated into existing pipelines involving graph convolution operations, replacing the predetermined or existing adjacency matrix with one that is learned, and optimized, as part of the general objective. As such it is applicable to any GCN. We integrate our module into trajectory prediction, point cloud classification and node classification pipelines resulting in improved accuracy over other structure-learning methods across a wide range of datasets and GCN backbones.

摘要
图像卷积网络（GCNs）允许端到端学习在图结构数据上。然而，许多工作假设给定的图结构。当输入图是噪音或无法获得时，一种方法是构建或学习隐藏图结构。这些方法通常固定整个图的节点度，这是不优化的。相反，我们提出了一种新的终端可 differentiable图生成器，它可以在建立图结构时让每个节点选择自己的邻居和大小。我们的模块可以轻松地整合到现有的GCN执行推导管线中，将预先确定或现有的相对位矩阵 replaced 为一个学习和优化的一部分，因此可以应用于任何GCN。我们将我们的模块集成到轨迹预测、点云分类和节点分类管线中，在各种数据集和GCN背景下得到了相对于其他结构学习方法的提高准确性。

Deep learning for unsupervised domain adaptation in medical imaging: Recent advancements and future perspectives

paper_url: http://arxiv.org/abs/2308.01265
repo_url: None
paper_authors: Suruchi Kumari, Pravendra Singh
for: 本研究写作的目的是对医疗影像领域内的深度学习方法进行评论和概述，尤其是在过去几年内的不监督领域适应（UDA）技术发展。
methods: 本研究主要探讨了医疗影像领域内的不监督领域适应技术，包括特征对焦、影像转换、自我监督、分离表示方法等。
results: 本研究给出了医疗影像领域内不监督领域适应技术的综观和评论，包括六种不同类型的方法，以及各自的数据集使用情况。

Abstract
Deep learning has demonstrated remarkable performance across various tasks in medical imaging. However, these approaches primarily focus on supervised learning, assuming that the training and testing data are drawn from the same distribution. Unfortunately, this assumption may not always hold true in practice. To address these issues, unsupervised domain adaptation (UDA) techniques have been developed to transfer knowledge from a labeled domain to a related but unlabeled domain. In recent years, significant advancements have been made in UDA, resulting in a wide range of methodologies, including feature alignment, image translation, self-supervision, and disentangled representation methods, among others. In this paper, we provide a comprehensive literature review of recent deep UDA approaches in medical imaging from a technical perspective. Specifically, we categorize current UDA research in medical imaging into six groups and further divide them into finer subcategories based on the different tasks they perform. We also discuss the respective datasets used in the studies to assess the divergence between the different domains. Finally, we discuss emerging areas and provide insights and discussions on future research directions to conclude this survey.

摘要
深度学习在医疗影像领域已经表现出非常出色。然而，这些方法主要是基于指导学习，假设训练和测试数据都来自同一个分布。可是，这个假设在实践中可能并不成立。为解决这些问题，无监督领域适应（UDA）技术得到了广泛应用。在最近几年，UDA领域在医疗影像领域的研究得到了 significative进步，包括特征对齐、图像翻译、自我指导、分解表示方法等。在这篇论文中，我们提供了医疗影像领域最新的深度UDA策略的全面文献综述。特别是，我们将当前UDA研究分为六个组，并将它们进一步分为不同任务的子类别。我们还讨论了不同研究使用的数据集，以评估不同领域之间的差异。最后，我们介绍了未来研究的前景和见解，并结束这篇报告。

Outlier-Robust Tensor Low-Rank Representation for Data Clustering

paper_url: http://arxiv.org/abs/2307.09055
repo_url: None
paper_authors: Tong Wu
for: 本文针对损受噪音或标本特有混淆的维度资料进行了恢复和聚类分析。
methods: 本文提出了一种基于维度对称分解（t-SVD）的噪音抗性维度低维表示（OR-TLRR）方法，用于同时检测噪音和维度资料的聚类分析。
results: 本文的实验结果显示，OR-TLRR方法可以对损受噪音或标本特有混淆的维度资料进行有效的恢复和聚类分析，并且可以处理部分资料欠拓实验结果。

Abstract
Low-rank tensor analysis has received widespread attention with many practical applications. However, the tensor data are often contaminated by outliers or sample-specific corruptions. How to recover the tensor data that are corrupted by outliers and perform data clustering remains a challenging problem. This paper develops an outlier-robust tensor low-rank representation (OR-TLRR) method for simultaneous outlier detection and tensor data clustering based on the tensor singular value decomposition (t-SVD) algebraic framework. It is motivated by the recently proposed tensor-tensor product induced by invertible linear transforms that satisfy certain conditions. For tensor observations with arbitrary outlier corruptions, OR-TLRR has provable performance guarantee for exactly recovering the row space of clean data and detecting outliers under mild conditions. Moreover, an extension of OR-TLRR is also proposed to handle the case when parts of the data are missing. Finally, extensive experimental results on both synthetic and real data demonstrate the effectiveness of the proposed algorithms.

摘要
低级tensor分析已经受到广泛关注，有很多实际应用。然而，tensor数据经常受到异常值或样本特有的腐朽影响。如何recover受损tensor数据并进行数据归类是一个具有挑战性的问题。这篇论文开发了一种具有异常鲁棒性的tensor低级表示法（OR-TLRR），用于同时检测异常值和tensor数据归类，基于tensor特征值分解（t-SVD）的代数框架。它是基于最近提出的tensor-tensor产品导出的减法，其满足某些条件。对于受到任意异常损害的tensor观测值，OR-TLRR有证明性的性能保证可以准确地恢复clean数据的行空间，并在某些条件下检测异常值。此外，对于缺失数据的情况，我们还提出了OR-TLRR的扩展。最后，我们在 synthetic和实际数据上进行了广泛的实验，并证明了提案的算法的效果。

Connections between Operator-splitting Methods and Deep Neural Networks with Applications in Image Segmentation

paper_url: http://arxiv.org/abs/2307.09052
repo_url: None
paper_authors: Hao Liu, Xue-Cheng Tai, Raymond Chan
for: 这篇论文的目的是为了提供深度神经网络的数学解释，以及将深度神经网络与数学算法联系起来的方法。
methods: 这篇论文使用了分解策略和多普雷德法则来解释深度神经网络。它还提出了两种基于分解策略的网络方案，用于解决图像分割问题。
results: 实验结果表明，这两种网络方案具有良好的性能，可以有效地解决图像分割问题。

Abstract
Deep neural network is a powerful tool for many tasks. Understanding why it is so successful and providing a mathematical explanation is an important problem and has been one popular research direction in past years. In the literature of mathematical analysis of deep deep neural networks, a lot of works are dedicated to establishing representation theories. How to make connections between deep neural networks and mathematical algorithms is still under development. In this paper, we give an algorithmic explanation for deep neural networks, especially in their connection with operator splitting and multigrid methods. We show that with certain splitting strategies, operator-splitting methods have the same structure as networks. Utilizing this connection and the Potts model for image segmentation, two networks inspired by operator-splitting methods are proposed. The two networks are essentially two operator-splitting algorithms solving the Potts model. Numerical experiments are presented to demonstrate the effectiveness of the proposed networks.

摘要
深度神经网络是许多任务的 poderous工具。理解它的成功原因以及提供数学解释是一个重要的研究方向，在过去几年中受到了广泛的关注。在深度神经网络的数学分析文献中，许多研究都是建立表示理论的。但是，如何将深度神经网络与数学算法连接起来仍然是一个开发中的问题。在这篇论文中，我们提供了深度神经网络的算法解释，特别是与分解算法和多格rid方法之间的连接。我们表明，使用某些拆分策略，Operator-splitting方法和神经网络之间存在同构关系。利用这种连接和Potts模型，我们提出了两种基于Operator-splitting方法的神经网络，它们实际上是两种解决Potts模型的算法。我们对这两种网络进行了数学实验，以证明它们的有效性。

PottsMGNet: A Mathematical Explanation of Encoder-Decoder Based Neural Networks

paper_url: http://arxiv.org/abs/2307.09039
repo_url: None
paper_authors: Xue-Cheng Tai, Hao Liu, Raymond Chan
for: 这篇论文主要是为了解释基于编码器-解码器架构的效果神经网络，以及其在图像分割领域的应用。
methods: 该论文使用了两阶段尼特斯模型来解释编码器-解码器架构，并使用了多普逊法和运算符拆分方法来离散化连续控制模型。
results: 研究发现，将Soft-Threshold-Dynamics作为正则化项 incorporated into the PottsMGNet，可以使其在各种网络参数（如网络宽度和深度）下表现出色，并在各种大量噪声的数据集上达到了Remarkable performance。在大多数实验中，新网络 sempre perfoms better or as good as existing networks for image segmentation on accuracy and dice score。

Abstract
For problems in image processing and many other fields, a large class of effective neural networks has encoder-decoder-based architectures. Although these networks have made impressive performances, mathematical explanations of their architectures are still underdeveloped. In this paper, we study the encoder-decoder-based network architecture from the algorithmic perspective and provide a mathematical explanation. We use the two-phase Potts model for image segmentation as an example for our explanations. We associate the segmentation problem with a control problem in the continuous setting. Then, multigrid method and operator splitting scheme, the PottsMGNet, are used to discretize the continuous control model. We show that the resulting discrete PottsMGNet is equivalent to an encoder-decoder-based network. With minor modifications, it is shown that a number of the popular encoder-decoder-based neural networks are just instances of the proposed PottsMGNet. By incorporating the Soft-Threshold-Dynamics into the PottsMGNet as a regularizer, the PottsMGNet has shown to be robust with the network parameters such as network width and depth and achieved remarkable performance on datasets with very large noise. In nearly all our experiments, the new network always performs better or as good on accuracy and dice score than existing networks for image segmentation.

摘要
对于图像处理和其他领域的问题，一大类效果强大的神经网络有编码器-解码器基本架构。 although these networks have made impressive performances, mathematical explanations of their architectures are still underdeveloped. In this paper, we study the encoder-decoder-based network architecture from the algorithmic perspective and provide a mathematical explanation. We use the two-phase Potts model for image segmentation as an example for our explanations. We associate the segmentation problem with a control problem in the continuous setting. Then, multigrid method and operator splitting scheme, the PottsMGNet, are used to discretize the continuous control model. We show that the resulting discrete PottsMGNet is equivalent to an encoder-decoder-based network. With minor modifications, it is shown that a number of the popular encoder-decoder-based neural networks are just instances of the proposed PottsMGNet. By incorporating the Soft-Threshold-Dynamics into the PottsMGNet as a regularizer, the PottsMGNet has shown to be robust with the network parameters such as network width and depth and achieved remarkable performance on datasets with very large noise. In nearly all our experiments, the new network always performs better or as good on accuracy and dice score than existing networks for image segmentation.Note: The translation is in Simplified Chinese, which is one of the two standardized Chinese languages. The other is Traditional Chinese.

Online Self-Supervised Thermal Water Segmentation for Aerial Vehicles

paper_url: http://arxiv.org/abs/2307.09027
repo_url: https://github.com/connorlee77/uav-thermal-water-segmentation
paper_authors: Connor Lee, Jonathan Gustafsson Frennert, Lu Gan, Matthew Anderson, Soon-Jo Chung
for: 这个论文的目的是提出一种新的方法，使RGB训练的水分割网络可以适应目标域空气热图像，并通过在线自我指导来使用文本和运动参数作为监督信号。
methods: 该方法使用了在线自我指导，将RGB训练的水分割网络应用到目标域空气热图像上，并使用文本和运动参数作为监督信号。
results: 该方法可以在夜间、无法训练数据的情况下，使current autonomous aerial robots在近岸环境中进行视觉导航、测量和流追踪等任务。另外，该方法还可以在实时上下文中运行，并且在Nvidia Jetson嵌入式计算平台上实现了实时应用。

Abstract
We present a new method to adapt an RGB-trained water segmentation network to target-domain aerial thermal imagery using online self-supervision by leveraging texture and motion cues as supervisory signals. This new thermal capability enables current autonomous aerial robots operating in near-shore environments to perform tasks such as visual navigation, bathymetry, and flow tracking at night. Our method overcomes the problem of scarce and difficult-to-obtain near-shore thermal data that prevents the application of conventional supervised and unsupervised methods. In this work, we curate the first aerial thermal near-shore dataset, show that our approach outperforms fully-supervised segmentation models trained on limited target-domain thermal data, and demonstrate real-time capabilities onboard an Nvidia Jetson embedded computing platform. Code and datasets used in this work will be available at: https://github.com/connorlee77/uav-thermal-water-segmentation.

摘要
我们提出了一种新的方法，用于将RGB搜索到的水分割网络适应目标域空气热图像上的自动超vision。这种新的热能力使得现有的无人飞行机器人在夜晚近岸环境中执行视觉导航、水深测量和流追踪等任务。我们的方法解决了 conventional 监督和无监督方法中 scarce 和 difficult-to-obtain 的近岸热数据问题。在这项工作中，我们创建了首个空气热近岸数据集，证明我们的方法超过了限定目标域热数据的彻底监督模型，并在 Nvidia Jetson 嵌入式计算平台上实现了实时功能。代码和数据集将在：https://github.com/connorlee77/uav-thermal-water-segmentation 上提供。

ActionPrompt: Action-Guided 3D Human Pose Estimation With Text and Pose Prompting

paper_url: http://arxiv.org/abs/2307.09026
repo_url: None
paper_authors: Hongwei Zheng, Han Li, Bowen Shi, Wenrui Dai, Botao Wan, Yu Sun, Min Guo, Hongkai Xiong
for: 提高视频基于2D-to-3D人姿估计（HPE）的性能，解决深度歧义问题。
methods: 提出了一个名为动作提示模块（APM）的插件模块，可以有效地挖掘不同类型的动作准则，以提高3D HPE的性能。
results: 实验表明，APM可以大幅提高大多数视频基于2D-to-3D HPE框架的性能。

Abstract
Recent 2D-to-3D human pose estimation (HPE) utilizes temporal consistency across sequences to alleviate the depth ambiguity problem but ignore the action related prior knowledge hidden in the pose sequence. In this paper, we propose a plug-and-play module named Action Prompt Module (APM) that effectively mines different kinds of action clues for 3D HPE. The highlight is that, the mining scheme of APM can be widely adapted to different frameworks and bring consistent benefits. Specifically, we first present a novel Action-related Text Prompt module (ATP) that directly embeds action labels and transfers the rich language information in the label to the pose sequence. Besides, we further introduce Action-specific Pose Prompt module (APP) to mine the position-aware pose pattern of each action, and exploit the correlation between the mined patterns and input pose sequence for further pose refinement. Experiments show that APM can improve the performance of most video-based 2D-to-3D HPE frameworks by a large margin.

摘要
最近的2D-to-3D人姿估算（HPE）利用时间连续性来减轻深度不确定性问题，但是它们忽略了动作相关的先前知识。在这篇论文中，我们提出了一个插件式模块 named Action Prompt Module (APM)，可以有效地挖掘不同类型的动作提示。特别是，我们首先提出了一种新的动作相关文本提示模块（ATP），直接嵌入动作标签，将written language信息传递到人姿序列中。此外，我们还引入了动作特定的姿势提示模块（APP），挖掘每种动作的位置感知姿势模式，并利用输入姿势序列和挖掘的模式之间的相关性进行进一步的姿势纠正。实验显示，APM可以提高大多数基于视频的2D-to-3D HPE框架的性能，增加了一定的改进空间。

LA-Net: Landmark-Aware Learning for Reliable Facial Expression Recognition under Label Noise

paper_url: http://arxiv.org/abs/2307.09023
repo_url: None
paper_authors: Zhiyu Wu, Jinshi Cui
for: 提高人脸表情识别（FER）的性能，解决实际应用中的标签噪声问题。
methods: 利用人脸特征点（landmark）来减少标签噪声的影响，从两个角度进行处理：首先，使用landmark信息来抑制表情空间中的uncertainty，并通过邻域聚合来提高每个样本的训练指导质量；其次，将landmark信息integrated到表情表示中，使表情特征提取器更加不敏感于标签噪声。
results: 对于在野外 dataset和synthetic noisy dataset的广泛实验，我们示出了LA-Net可以达到领先的性能水平。

Abstract
Facial expression recognition (FER) remains a challenging task due to the ambiguity of expressions. The derived noisy labels significantly harm the performance in real-world scenarios. To address this issue, we present a new FER model named Landmark-Aware Net~(LA-Net), which leverages facial landmarks to mitigate the impact of label noise from two perspectives. Firstly, LA-Net uses landmark information to suppress the uncertainty in expression space and constructs the label distribution of each sample by neighborhood aggregation, which in turn improves the quality of training supervision. Secondly, the model incorporates landmark information into expression representations using the devised expression-landmark contrastive loss. The enhanced expression feature extractor can be less susceptible to label noise. Our method can be integrated with any deep neural network for better training supervision without introducing extra inference costs. We conduct extensive experiments on both in-the-wild datasets and synthetic noisy datasets and demonstrate that LA-Net achieves state-of-the-art performance.

摘要
面部表达识别（FER）仍然是一项具有挑战性的任务，主要是因为表达的模糊性。 derivated的噪声标签在实际应用场景中会产生很大的影响。为解决这个问题，我们提出了一种新的FER模型，即Landmark-Aware Net（LA-Net），该模型利用面部特征点来减少表达空间中的uncertainty，并通过邻域聚合来提高每个样本的训练指导质量。其次，模型通过我们提出的表达-特征点对比损失来将特征点信息 incorporated into表达表示，从而使表达特征EXTractor更加抵抗噪声标签的影响。我们的方法可以与任何深度神经网络结合使用，不需要额外的推理成本。我们在野外数据集和静态噪声数据集上进行了广泛的实验，并证明了LA-Net可以达到当前最佳性能。

Face-PAST: Facial Pose Awareness and Style Transfer Networks

paper_url: http://arxiv.org/abs/2307.09020
repo_url: None
paper_authors: Sunder Ali Khowaja, Ghulam Mujtaba, Jiseok Yoon, Ik Hyun Lee
for: 提出一种基于 StyleGAN 的 facial style transfer 网络，以保持 facial 图像的细节和结构，并生成高质量的样式化图像。
methods: 使用预训练的样式生成网络、循环优化器和门控制单元，以及 facial 结构、身份和分割损失来保持 facial 细节和结构。
results: 通过对存在较少数据的 facial 图像进行样式转移，并且可以生成高质量的样式化图像，而不会过拟合样式或添加artefacts。

Abstract
Facial style transfer has been quite popular among researchers due to the rise of emerging technologies such as eXtended Reality (XR), Metaverse, and Non-Fungible Tokens (NFTs). Furthermore, StyleGAN methods along with transfer-learning strategies have reduced the problem of limited data to some extent. However, most of the StyleGAN methods overfit the styles while adding artifacts to facial images. In this paper, we propose a facial pose awareness and style transfer (Face-PAST) network that preserves facial details and structures while generating high-quality stylized images. Dual StyleGAN inspires our work, but in contrast, our work uses a pre-trained style generation network in an external style pass with a residual modulation block instead of a transform coding block. Furthermore, we use the gated mapping unit and facial structure, identity, and segmentation losses to preserve the facial structure and details. This enables us to train the network with a very limited amount of data while generating high-quality stylized images. Our training process adapts curriculum learning strategy to perform efficient and flexible style mixing in the generative space. We perform extensive experiments to show the superiority of Face-PAST in comparison to existing state-of-the-art methods.

摘要
Facial style transfer已经非常受研究人员欢迎，因为emerging technologies如XR、Metaverse和NFTs的出现。此外，StyleGAN方法和传播学习策略有助于缓解有限数据的问题。然而，大多数StyleGAN方法会过滤式，导致facial image中的瑕疵和错误。在这篇论文中，我们提出了一个名为Face-PAST的 facial pose awareness和style transfer网络，能够保留facial detail和结构，同时生成高品质的类型化图像。我们的作业受到了Dual StyleGAN的灵感，但是我们使用了一个预训式的style生成网络，而不是一个transform coding block。此外，我们使用了闸道 mapping单元和facial结构、认知和分类损失，以保留facial structure和瑕疵。这使得我们可以在有限数据的情况下训练网络，并生成高品质的类型化图像。我们的训练过程使用了curriculum learning策略，以实现有效和灵活的style混合在生成空间中。我们进行了广泛的实验，以显示Face-PAST在与现有的州际状态方法相比之下的superiority。

U-shaped Transformer: Retain High Frequency Context in Time Series Analysis

paper_url: http://arxiv.org/abs/2307.09019
repo_url: None
paper_authors: Qingkui Chen, Yiqin Zhang
for: 这篇论文旨在提出一种基于 transformer 框架的时间序列预测模型，利用 skip-layer 连接和 patch 合并分割操作提高模型的精度和效率。
methods: 该模型采用了 traditional transformer 框架，并加入了 skip-layer 连接和 patch 合并分割操作，以提高模型的精度和效率。
results: 实验结果表明，该模型在多个数据集上达到了高水平的预测性能，而且比传统的 transformer 模型更高效。

Abstract
Time series prediction plays a crucial role in various industrial fields. In recent years, neural networks with a transformer backbone have achieved remarkable success in many domains, including computer vision and NLP. In time series analysis domain, some studies have suggested that even the simplest MLP networks outperform advanced transformer-based networks on time series forecast tasks. However, we believe these findings indicate there to be low-rank properties in time series sequences. In this paper, we consider the low-pass characteristics of transformers and try to incorporate the advantages of MLP. We adopt skip-layer connections inspired by Unet into traditional transformer backbone, thus preserving high-frequency context from input to output, namely U-shaped Transformer. We introduce patch merge and split operation to extract features with different scales and use larger datasets to fully make use of the transformer backbone. Our experiments demonstrate that the model performs at an advanced level across multiple datasets with relatively low cost.

摘要
时序序列预测在各个行业中扮演着关键的角色。在最近几年，基于transformer结构的神经网络在计算机视觉和自然语言处理等领域取得了很大的成功。然而，一些研究表明，简单的MLP网络可以在时序序列预测任务上表现更出色于高级的transformer-based网络。我们认为这些发现表明时序序列序列具有低级属性。在这篇论文中，我们考虑了transformer的低通过性特性，并尝试将MLP网络的优点与transformer结构相结合。我们采用了 skip-layer 连接，以保持输入到输出的高频上下文，即U-shaped Transformer。我们还引入了补丁合并和分裂操作，以提取不同尺度的特征，并使用更大的数据集，以全面利用transformer结构。我们的实验表明，模型在多个数据集上达到了高水平的性能，而且相对成本较低。

Survey on Controlable Image Synthesis with Deep Learning

paper_url: http://arxiv.org/abs/2307.10275
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Shixiong Zhang, Jiao Li, Lu Yang
for: 本研究旨在investigate low-level controllable image synthesis problem, 用于精细图像渲染和修改任务。
methods: 本文使用deep learning技术, 尤其是生成模型, 来实现可控图像生成方法。
results: 本文对3D可控图像生成进行了评估, 并结合了评价指标和数据集。更进一步, 本文还 briefly summarized 相关应用、产品和资源 для实践者。

Abstract
Image synthesis has attracted emerging research interests in academic and industry communities. Deep learning technologies especially the generative models greatly inspired controllable image synthesis approaches and applications, which aim to generate particular visual contents with latent prompts. In order to further investigate low-level controllable image synthesis problem which is crucial for fine image rendering and editing tasks, we present a survey of some recent works on 3D controllable image synthesis using deep learning. We first introduce the datasets and evaluation indicators for 3D controllable image synthesis. Then, we review the state-of-the-art research for geometrically controllable image synthesis in two aspects: 1) Viewpoint/pose-controllable image synthesis; 2) Structure/shape-controllable image synthesis. Furthermore, the photometrically controllable image synthesis approaches are also reviewed for 3D re-lighting researches. While the emphasis is on 3D controllable image synthesis algorithms, the related applications, products and resources are also briefly summarized for practitioners.

摘要
Image合成已经吸引了学术和工业社区的新兴研究兴趣。特别是深度学习技术，包括生成模型，对可控图像生成方法和应用产生了启发。为了进一步调查低级可控图像生成问题，这问题对细节图像渲染和修订任务是关键的。我们现在介绍一些最新的3D可控图像生成使用深度学习的研究。我们首先介绍了3D可控图像生成数据集和评价指标。然后，我们对3D可控图像生成的两个方面进行了回顾：1）视点/姿态可控图像生成; 2）结构/形状可控图像生成。此外，我们还回顾了3D重新照明研究中的光学可控图像生成方法。虽然我们的重点是3D可控图像生成算法，但我们还 briefly summarized了相关应用、产品和资源，为实践者提供参考。

Soft-IntroVAE for Continuous Latent space Image Super-Resolution

paper_url: http://arxiv.org/abs/2307.09008
repo_url: None
paper_authors: Zhi-Song Liu, Zijia Wang, Zhen Jia
for: 这篇论文旨在提出一种基于Variational AutoEncoder的连续图像超分辨率（SR）方法，用于实现实用和灵活的图像缩放 для各种显示器。
methods: 该方法使用了Local implicit image representation，并基于Variational AutoEncoder进行 latent space interpolation。另外，一种新的潜在空间对抗训练方法是使用的，以实现照片真实的图像修复。
results: 对比其他方法，提出的Soft-introVAE-SR方法可以更好地提高图像的质量，并且可以扩展到噪声除除和实际图像超分辨率领域。

Abstract
Continuous image super-resolution (SR) recently receives a lot of attention from researchers, for its practical and flexible image scaling for various displays. Local implicit image representation is one of the methods that can map the coordinates and 2D features for latent space interpolation. Inspired by Variational AutoEncoder, we propose a Soft-introVAE for continuous latent space image super-resolution (SVAE-SR). A novel latent space adversarial training is achieved for photo-realistic image restoration. To further improve the quality, a positional encoding scheme is used to extend the original pixel coordinates by aggregating frequency information over the pixel areas. We show the effectiveness of the proposed SVAE-SR through quantitative and qualitative comparisons, and further, illustrate its generalization in denoising and real-image super-resolution.

摘要
continuous image super-resolution (SR) 近期吸引了许多研究人员的关注，因为它可以实现多种显示器上的图像缩放。本地隐式图像表示是一种可以将坐标和2D特征映射到隐藏空间的方法。 inspirited by Variational AutoEncoder, we propose a Soft-introVAE for continuous latent space image super-resolution (SVAE-SR).一种新的隐藏空间对抗训练方法是实现 фото真实图像修复。为了进一步提高质量，我们使用位置编码方案来扩展原始像素坐标，并将频率信息聚合到像素区域上。我们通过量化和质量比较，证明了我们提出的 SVAE-SR 的效果。此外，我们还ILLUSTRATE 其泛化性在噪声除除和真实图像超分解中。

Frequency-mixed Single-source Domain Generalization for Medical Image Segmentation

paper_url: http://arxiv.org/abs/2307.09005
repo_url: https://github.com/liamheng/non-iid_medical_image_segmentation
paper_authors: Heng Li, Haojin Li, Wei Zhao, Huazhu Fu, Xiuyun Su, Yan Hu, Jiang Liu
for: 提高医疗影像分类模型的一般化性，尤其是当标注数据稀缺时。
methods: 提出了一种新的频率混合单源领域一致化方法（FreeSDG），通过分析频率对领域差异的影响，利用混合频率 спектル增强单源领域。同时，建立了自我监督来学习具有Robust特征的分类表示。
results: 透过实验证明，FreeSDG比前一项方法高效，可以增强医疗影像分类模型的一般化性，特别是当标注数据稀缺时。

Abstract
The annotation scarcity of medical image segmentation poses challenges in collecting sufficient training data for deep learning models. Specifically, models trained on limited data may not generalize well to other unseen data domains, resulting in a domain shift issue. Consequently, domain generalization (DG) is developed to boost the performance of segmentation models on unseen domains. However, the DG setup requires multiple source domains, which impedes the efficient deployment of segmentation algorithms in clinical scenarios. To address this challenge and improve the segmentation model's generalizability, we propose a novel approach called the Frequency-mixed Single-source Domain Generalization method (FreeSDG). By analyzing the frequency's effect on domain discrepancy, FreeSDG leverages a mixed frequency spectrum to augment the single-source domain. Additionally, self-supervision is constructed in the domain augmentation to learn robust context-aware representations for the segmentation task. Experimental results on five datasets of three modalities demonstrate the effectiveness of the proposed algorithm. FreeSDG outperforms state-of-the-art methods and significantly improves the segmentation model's generalizability. Therefore, FreeSDG provides a promising solution for enhancing the generalization of medical image segmentation models, especially when annotated data is scarce. The code is available at https://github.com/liamheng/Non-IID_Medical_Image_Segmentation.

摘要
医学图像分割涉及到缺乏注释的问题，这会影响深度学习模型的训练数据收集。具体来说，由限制数据训练的模型可能无法在其他未见数据域上泛化良好，导致领域变化问题。为了解决这个问题并提高分割模型的泛化性，我们提出了一种新的方法 called Frequency-mixed Single-source Domain Generalization method (FreeSDG)。通过分析频谱的效果，FreeSDG利用了混合频谱来扩展单源频谱。此外，我们还构建了自我超vision来学习robust的上下文感知表示。实验结果表明，提出的方法可以在五个数据集上三种模式上达到最佳性能。FreeSDG比 estado-of-the-art 方法更高效，并显著提高了分割模型的泛化性。因此，FreeSDG 提供了医学图像分割模型的泛化问题中的一个有前途的解决方案，特别是当注释数据缺乏时。代码可以在上找到。

TractCloud: Registration-free tractography parcellation with a novel local-global streamline point cloud representation

paper_url: http://arxiv.org/abs/2307.09000
repo_url: https://github.com/SlicerDMRI/TractCloud
paper_authors: Tengfei Xue, Yuqian Chen, Chaoyi Zhang, Alexandra J. Golby, Nikos Makris, Yogesh Rathi, Weidong Cai, Fan Zhang, Lauren J. O’Donnell
for: 这篇论文的目的是提出一种无需注册的束分割方法，以便在各个个体空间中进行束分割，并且能够在大规模数据集上高效地进行分析。
methods: 该方法使用了一种新的、可学习的、本地-全局束表示方法，利用周围的束和整个脑部的束来描述本地解剖学和全脑姿态。
results: 论文在五个独立获得的测试数据集上进行了测试，并与多种状态的人群和疾病进行了比较，并显示了与之前的状态分割方法相比的显著优势。

Abstract
Diffusion MRI tractography parcellation classifies streamlines into anatomical fiber tracts to enable quantification and visualization for clinical and scientific applications. Current tractography parcellation methods rely heavily on registration, but registration inaccuracies can affect parcellation and the computational cost of registration is high for large-scale datasets. Recently, deep-learning-based methods have been proposed for tractography parcellation using various types of representations for streamlines. However, these methods only focus on the information from a single streamline, ignoring geometric relationships between the streamlines in the brain. We propose TractCloud, a registration-free framework that performs whole-brain tractography parcellation directly in individual subject space. We propose a novel, learnable, local-global streamline representation that leverages information from neighboring and whole-brain streamlines to describe the local anatomy and global pose of the brain. We train our framework on a large-scale labeled tractography dataset, which we augment by applying synthetic transforms including rotation, scaling, and translations. We test our framework on five independently acquired datasets across populations and health conditions. TractCloud significantly outperforms several state-of-the-art methods on all testing datasets. TractCloud achieves efficient and consistent whole-brain white matter parcellation across the lifespan (from neonates to elderly subjects, including brain tumor patients) without the need for registration. The robustness and high inference speed of TractCloud make it suitable for large-scale tractography data analysis. Our project page is available at https://tractcloud.github.io/.

摘要
Diffusion MRI tractography parcellation 分类ifies streamlines into anatomical fiber tracts, allowing for quantification and visualization in clinical and scientific applications. Current tractography parcellation methods rely heavily on registration, but registration inaccuracies can affect parcellation and increase computational cost for large-scale datasets. Recently, deep-learning-based methods have been proposed for tractography parcellation using various types of streamline representations. However, these methods only focus on information from a single streamline, ignoring geometric relationships between streamlines in the brain.We propose TractCloud, a registration-free framework that performs whole-brain tractography parcellation directly in individual subject space. We use a novel, learnable, local-global streamline representation that leverages information from neighboring and whole-brain streamlines to describe the local anatomy and global pose of the brain. We train our framework on a large-scale labeled tractography dataset and augment it with synthetic transforms including rotation, scaling, and translations. We test our framework on five independently acquired datasets across populations and health conditions.TractCloud significantly outperforms several state-of-the-art methods on all testing datasets, achieving efficient and consistent whole-brain white matter parcellation across the lifespan (from neonates to elderly subjects, including brain tumor patients) without the need for registration. The robustness and high inference speed of TractCloud make it suitable for large-scale tractography data analysis. For more information, please visit our project page at .

Towards Authentic Face Restoration with Iterative Diffusion Models and Beyond

paper_url: http://arxiv.org/abs/2307.08996
repo_url: None
paper_authors: Yang Zhao, Tingbo Hou, Yu-Chuan Su, Xuhui Jia. Yandong Li, Matthias Grundmann
for: This paper aims to propose an authentic face restoration system that can generate high-quality and realistic faces from low-quality ones, which is important in various computer vision applications such as image enhancement, video communication, and taking portrait.
methods: The proposed method, called $\textbf{IDM}$, is based on denoising diffusion models (DDMs) and uses iterative learning to achieve authentic face restoration. The method has two aspects of intrinsic iterative refinement and extrinsic iterative enhancement to preserve the content and gradually refine the high-quality details.
results: The proposed method demonstrates superior performance on blind face restoration tasks and can also clean the data to improve the restoration task. Additionally, the authentically cleaned data generated by the proposed method is found to be helpful for image generation tasks, achieving better quality than state-of-the-art on FFHQ and ImageNet generation using either GANs or diffusion models without modifying the models.

Abstract
An authentic face restoration system is becoming increasingly demanding in many computer vision applications, e.g., image enhancement, video communication, and taking portrait. Most of the advanced face restoration models can recover high-quality faces from low-quality ones but usually fail to faithfully generate realistic and high-frequency details that are favored by users. To achieve authentic restoration, we propose $\textbf{IDM}$, an $\textbf{I}$teratively learned face restoration system based on denoising $\textbf{D}$iffusion $\textbf{M}$odels (DDMs). We define the criterion of an authentic face restoration system, and argue that denoising diffusion models are naturally endowed with this property from two aspects: intrinsic iterative refinement and extrinsic iterative enhancement. Intrinsic learning can preserve the content well and gradually refine the high-quality details, while extrinsic enhancement helps clean the data and improve the restoration task one step further. We demonstrate superior performance on blind face restoration tasks. Beyond restoration, we find the authentically cleaned data by the proposed restoration system is also helpful to image generation tasks in terms of training stabilization and sample quality. Without modifying the models, we achieve better quality than state-of-the-art on FFHQ and ImageNet generation using either GANs or diffusion models.

摘要
一个真实的脸部恢复系统在许多计算机视觉应用中日益增加要求，例如图像提高、视频通信和拍照。大多数高级脸部恢复模型可以从低质量脸部恢复出高质量脸部，但通常无法准确地生成用户喜欢的真实和高频率细节。为实现真实的恢复，我们提出了 $\textbf{IDM}$，一种基于杂化扩散模型（DDM）的迭代学习face restoration系统。我们定义了真实的脸部恢复系统的标准，并论证DDM自然拥有这种属性，从两个方面：内在迭代细化和外在迭代增强。内在学习可以保持内容良好，逐渐细化高质量细节，而外在增强可以清洁数据，提高恢复任务一步更进。我们在盲目脸部恢复任务上展示了superior性能。此外，我们发现由我们提posed的恢复系统 authentically cleaned的数据不仅有助于图像生成任务的训练稳定和样本质量，而且可以在使用GANs或扩散模型时达到更高的质量。

Revisiting Latent Space of GAN Inversion for Real Image Editing

paper_url: http://arxiv.org/abs/2307.08995
repo_url: None
paper_authors: Kai Katsumata, Duc Minh Vo, Bei Liu, Hideki Nakayama
for: 这个研究旨在解决StyleGANs中的实像增强和实像增强之间的贸易问题，提供一个新的实像增强方法。
methods: 本研究使用StyleGANs的抽象层 $\mathcal{Z}$ 和高能量的潜在空间，建立一个新的合成空间 $\mathcal{F}/\mathcal{Z}^{+}$，以实现实像增强而不丧失图像质量。
results: 实验结果显示， $\mathcal{Z}^{+}$ 可以取代常用的 $\mathcal{W}$, $\mathcal{W}^{+}$, 和 $\mathcal{S}$ 空间，保持图像重建质量，并且实现Semantic editing。

Abstract
The exploration of the latent space in StyleGANs and GAN inversion exemplify impressive real-world image editing, yet the trade-off between reconstruction quality and editing quality remains an open problem. In this study, we revisit StyleGANs' hyperspherical prior $\mathcal{Z}$ and combine it with highly capable latent spaces to build combined spaces that faithfully invert real images while maintaining the quality of edited images. More specifically, we propose $\mathcal{F}/\mathcal{Z}^{+}$ space consisting of two subspaces: $\mathcal{F}$ space of an intermediate feature map of StyleGANs enabling faithful reconstruction and $\mathcal{Z}^{+}$ space of an extended StyleGAN prior supporting high editing quality. We project the real images into the proposed space to obtain the inverted codes, by which we then move along $\mathcal{Z}^{+}$, enabling semantic editing without sacrificing image quality. Comprehensive experiments show that $\mathcal{Z}^{+}$ can replace the most commonly-used $\mathcal{W}$, $\mathcal{W}^{+}$, and $\mathcal{S}$ spaces while preserving reconstruction quality, resulting in reduced distortion of edited images.

摘要
<>translate the following text into Simplified Chinese:The exploration of the latent space in StyleGANs and GAN inversion exemplify impressive real-world image editing, yet the trade-off between reconstruction quality and editing quality remains an open problem. In this study, we revisit StyleGANs' hyperspherical prior $\mathcal{Z}$ and combine it with highly capable latent spaces to build combined spaces that faithfully invert real images while maintaining the quality of edited images. More specifically, we propose $\mathcal{F}/\mathcal{Z}^{+}$ space consisting of two subspaces: $\mathcal{F}$ space of an intermediate feature map of StyleGANs enabling faithful reconstruction and $\mathcal{Z}^{+}$ space of an extended StyleGAN prior supporting high editing quality. We project the real images into the proposed space to obtain the inverted codes, by which we then move along $\mathcal{Z}^{+}$, enabling semantic editing without sacrificing image quality. Comprehensive experiments show that $\mathcal{Z}^{+}$ can replace the most commonly-used $\mathcal{W}$, $\mathcal{W}^{+}$, and $\mathcal{S}$ spaces while preserving reconstruction quality, resulting in reduced distortion of edited images.Translation: StyleGANs 的latent空间的探索和GAN倒计时的实际图像编辑，却存在一个公共的问题，即重建质量和编辑质量之间的权衡。在这种研究中，我们回到StyleGANs的高维度先验 Space $\mathcal{Z}$ 和高能量的latent空间，并将其组合成一个faithful的图像重建和高质量编辑的共同空间。我们提出了 $\mathcal{F}/\mathcal{Z}^{+}$ 空间，包括两个子空间： $\mathcal{F}$ 空间是 StyleGANs 中间特征图的intermediate feature map，可以实现 faithful reconstruction，而 $\mathcal{Z}^{+}$ 空间是 StyleGANs 的扩展先验空间，可以支持高质量的编辑。我们将实际图像映射到我们提出的空间中，并在 $\mathcal{Z}^{+}$ 空间中移动，以实现 semantics编辑而无需牺牲图像质量。我们的全面实验表明， $\mathcal{Z}^{+}$ 可以取代通常使用的 $\mathcal{W}$, $\mathcal{W}^{+}$ 和 $\mathcal{S}$ 空间，保持重建质量，并减少编辑图像的扭曲。

Human Action Recognition in Still Images Using ConViT

paper_url: http://arxiv.org/abs/2307.08994
repo_url: None
paper_authors: Seyed Rohollah Hosseyni, Hasan Taheri, Sanaz Seyedin, Ali Ahmad Rahmani
for: 本研究旨在提高图像识别 tasks 中不同部分之间的关系理解，以提高人体动作识别精度。
methods: 本研究提出了一种新的模块，它使用 Vision Transformer (ViT) 来模型图像各部分之间的关系。该模型包括一个深度卷积网络，用于提取图像高级空间特征，以及一个 Vision Transformer，用于使用特征图来捕捉图像各部分之间的关系。
results: 本研究在 Stanford40 和 PASCAL VOC 2012 动作数据集上进行了评估，并 achieved 95.5% mAP 和 91.5% mAP 结果，这些结果与其他当前领先方法相比较出色。

Abstract
Understanding the relationship between different parts of the image plays a crucial role in many visual recognition tasks. Despite the fact that Convolutional Neural Networks (CNNs) have demonstrated impressive results in detecting single objects, they lack the capability to extract the relationship between various regions of an image, which is a crucial factor in human action recognition. To address this problem, this paper proposes a new module that functions like a convolutional layer using Vision Transformer (ViT). The proposed action recognition model comprises two components: the first part is a deep convolutional network that extracts high-level spatial features from the image, and the second component of the model utilizes a Vision Transformer that extracts the relationship between various regions of the image using the feature map generated by the CNN output. The proposed model has been evaluated on the Stanford40 and PASCAL VOC 2012 action datasets and has achieved 95.5% mAP and 91.5% mAP results, respectively, which are promising compared to other state-of-the-art methods.

摘要
理解图像中不同部分之间的关系在许多视觉识别任务中扮演着关键角色。尽管卷积神经网络（CNN）在检测单个 объек 上表现出了惊人的成绩，但它缺乏抽象图像中不同部分之间的关系抽取能力，这是人类动作识别中的关键因素。为解决这个问题，本文提出了一个新的模块，该模块使用视力 трансформер（ViT）来实现卷积操作。该模型包括两个部分：第一部分是一个深度卷积神经网络，该神经网络从图像中提取高级空间特征；第二部分的模型使用 feature map 生成于 CNN 输出来抽取图像中不同部分之间的关系。该模型在 Standford40 和 PASCAL VOC 2012 动作数据集上进行了评估，并 achieved 95.5% mAP 和 91.5% mAP 的结果，这些结果与其他当前状态的方法相当出色。

Arbitrary point cloud upsampling via Dual Back-Projection Network

paper_url: http://arxiv.org/abs/2307.08992
repo_url: None
paper_authors: Zhi-Song Liu, Zijia Wang, Zhen Jia
for: 该论文主要针对精度低的点云数据进行重建，以提高点云的密度并重建细节 geometric 信息。
methods: 该方法基于 Dual Back-Projection 网络（DBPnet），通过在上下文中具有 Point cloud upsampling 的方式进行减小点云的重建误差。
results: 实验结果显示，该方法可以在不同的upsampling因子（例如 4x、5.5x）下实现最低的点集匹配损失，并且成功地验证了非均匀点云的重建。

Abstract
Point clouds acquired from 3D sensors are usually sparse and noisy. Point cloud upsampling is an approach to increase the density of the point cloud so that detailed geometric information can be restored. In this paper, we propose a Dual Back-Projection network for point cloud upsampling (DBPnet). A Dual Back-Projection is formulated in an up-down-up manner for point cloud upsampling. It not only back projects feature residues but also coordinates residues so that the network better captures the point correlations in the feature and space domains, achieving lower reconstruction errors on both uniform and non-uniform sparse point clouds. Our proposed method is also generalizable for arbitrary upsampling tasks (e.g. 4x, 5.5x). Experimental results show that the proposed method achieves the lowest point set matching losses with respect to the benchmark. In addition, the success of our approach demonstrates that generative networks are not necessarily needed for non-uniform point clouds.

摘要
点云数据通常是稀疏的和噪声污染的。点云upsampling是一种方法来增加点云的密度，以便从重建细节的几何信息。在这篇论文中，我们提出了双向反投影网络（DBPnet）。双向反投影是在上下两个方向进行的，用于点云upsampling。它不仅反投影特征剩余，还反投影坐标剩余，从而使网络更好地捕捉点云之间的相关性，实现了较低的重建错误率。我们提出的方法可应用于任意的upsampling任务（例如4倍、5.5倍）。实验结果显示，我们的方法实现了对比准标的最低点集匹配损失。此外，我们的成功表明了生成网络并不一定需要非均匀点云。

EgoVM: Achieving Precise Ego-Localization using Lightweight Vectorized Maps

paper_url: http://arxiv.org/abs/2307.08991
repo_url: None
paper_authors: Yuzhe He, Shuang Liang, Xiaofei Rui, Chengying Cai, Guowei Wan
for: 本研究旨在提供一种高精度、轻量级的 egolocalization 方法，以满足自动驾驶技术的需求。
methods: 本方法使用了vectorized maps，与传统的点云地图相比，具有较低的计算复杂度和存储量。具体来说，我们首先从多视图图像和 LiDAR 点云中提取 BEV 特征，然后使用一组学习式的semantic embedding来编码地图元素的semantic类型，并通过semantic segmentation进行超出检查。接着，我们将地图查询，包括学习式的semantic embedding和地图元素坐标，传递给 transformer decoder进行交叉模态匹配。最后，我们采用了一种可靠的 histogram-based pose solver，以搜索所有可能的姿态，并优化最佳姿态。
results: 我们在 nuScenes 数据集和新收集的数据集上进行了广泛的验证，结果显示，我们的方法可以实现厘米级的 lokalisierung 精度，并在使用 vectorized maps 的情况下，与现有方法相比，提供了较大的提升。此外，我们的模型已经在一大群自动驾驶车辆中进行了广泛的测试，并在多种复杂的城市场景下表现出色。

Abstract
Accurate and reliable ego-localization is critical for autonomous driving. In this paper, we present EgoVM, an end-to-end localization network that achieves comparable localization accuracy to prior state-of-the-art methods, but uses lightweight vectorized maps instead of heavy point-based maps. To begin with, we extract BEV features from online multi-view images and LiDAR point cloud. Then, we employ a set of learnable semantic embeddings to encode the semantic types of map elements and supervise them with semantic segmentation, to make their feature representation consistent with BEV features. After that, we feed map queries, composed of learnable semantic embeddings and coordinates of map elements, into a transformer decoder to perform cross-modality matching with BEV features. Finally, we adopt a robust histogram-based pose solver to estimate the optimal pose by searching exhaustively over candidate poses. We comprehensively validate the effectiveness of our method using both the nuScenes dataset and a newly collected dataset. The experimental results show that our method achieves centimeter-level localization accuracy, and outperforms existing methods using vectorized maps by a large margin. Furthermore, our model has been extensively tested in a large fleet of autonomous vehicles under various challenging urban scenes.

摘要
准确可靠的自驾车导航启用需要精准的ego-localization。在这篇论文中，我们提出了EgoVM，一种终端到端的地图localization网络，可以与之前的State-of-the-art方法相比，但使用轻量级的 вектор化地图而不是重量级的点云地图。我们从多视图图像和LiDAR点云中提取了BEV特征，然后使用一组可学习的semantic embedding来编码地图元素的semantic类型，并使用semantic segmentation来监督它们的特征表示相符合BEV特征。接着，我们将地图查询，由learnable semantic embedding和地图元素坐标组成，传递给一个transformer解码器进行跨模态匹配与BEV特征。最后，我们采用一种稳定的 histogram-based pose解决方案来估算最佳pose，通过搜索所有候选pose来找到最佳pose。我们对使用nuScenes数据集和新收集的数据集进行了广泛验证，结果表明我们的方法可以实现厘米级准确的localization，并与使用 вектор化地图的现有方法相比，大幅提高性能。此外，我们的模型在一大群自动驾车车辆下进行了广泛的测试，并在各种复杂的城市场景下运行。

In Defense of Clip-based Video Relation Detection

paper_url: http://arxiv.org/abs/2307.08984
repo_url: None
paper_authors: Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, Roger Zimmermann
for: 本研究旨在提高视频视关系检测（VidVRD）的精度和效率，通过空间矩阵和时间边界来检测视频中的视觉关系 triplets。
methods: 该研究使用clip-based方法，在不同的clip中分别进行视觉关系的检测和融合，以达到更高的检测精度和效率。
results: 对于两个 VidVRD 测试 benchmark 进行了广泛的实验，并证明了使用clip tubelets可以达到更高的性能，而且clip tubelets在模型设计方面具有更多的灵活性，可以更好地alleviate long-term object tracking问题和视频Tubelet特征压缩中的时间信息损失。

Abstract
Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries. Existing VidVRD methods can be broadly categorized into bottom-up and top-down paradigms, depending on their approach to classifying relations. Bottom-up methods follow a clip-based approach where they classify relations of short clip tubelet pairs and then merge them into long video relations. On the other hand, top-down methods directly classify long video tubelet pairs. While recent video-based methods utilizing video tubelets have shown promising results, we argue that the effective modeling of spatial and temporal context plays a more significant role than the choice between clip tubelets and video tubelets. This motivates us to revisit the clip-based paradigm and explore the key success factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM) that enriches the object-based spatial context and relation-based temporal context based on clips. We demonstrate that using clip tubelets can achieve superior performance compared to most video-based methods. Additionally, using clip tubelets offers more flexibility in model designs and helps alleviate the limitations associated with video tubelets, such as the challenging long-term object tracking problem and the loss of temporal information in long-term tubelet feature compression. Extensive experiments conducted on two challenging VidVRD benchmarks validate that our HCM achieves a new state-of-the-art performance, highlighting the effectiveness of incorporating advanced spatial and temporal context modeling within the clip-based paradigm.

摘要
视频视关系检测（VidVRD）的目标是在视频中检测视关系三元组使用空间矩形框和时间边界。现有的 VidVRD 方法可以分为底层和顶层两类，它们根据他们如何分类关系来进行分类。底层方法采用帧基本方法，先将短 clip 对组分类，然后将它们合并成为长视频关系。相反，顶层方法直接将长视频对组分类。虽然最近的视频基本方法使用视频封装件（Tubelet）已经显示出了有利的成绩，但我们认为在空间和时间上更好地模型上下文比clip tubelets和视频封装件之间的选择更重要。这种情况 motivates 我们重新评估clip-based paradigm，并探索关键成功因素。在这篇论文中，我们提出了层次上下文模型（HCM），它在clip中增强了对象空间上下文和关系时间上下文。我们示出，使用clip tubelets可以在大多数视频基本方法中获得更高的性能，并且使用clip tubelets的模型设计具有更多的灵活性，可以解决视频封装件中的长期对象跟踪问题和长期封装件特征压缩中的时间信息损失问题。我们在两个 VidVRD benchmark 上进行了广泛的实验，并证明了我们的 HCM 实现了新的州Of-The-Art性能，这highlights 了在clip-based paradigm中包含先进的空间和时间上下文模型的效iveness。

Learned Scalable Video Coding For Humans and Machines

paper_url: http://arxiv.org/abs/2307.08978
repo_url: None
paper_authors: Hadi Hadizadeh, Ivan V. Bajić
for: 这个论文是为了支持自动视频分析，而不是只是为了人类视觉。
methods: 该论文使用了深度神经网络（DNNs）来实现Conditional Coding，以实现更好的压缩效果。
results: 实验结果表明，该 Framework 在四个标准视频数据集上表现出了更好的压缩效果，而且可以保持与人类视觉任务的相似性。

Abstract
Video coding has traditionally been developed to support services such as video streaming, videoconferencing, digital TV, and so on. The main intent was to enable human viewing of the encoded content. However, with the advances in deep neural networks (DNNs), encoded video is increasingly being used for automatic video analytics performed by machines. In applications such as automatic traffic monitoring, analytics such as vehicle detection, tracking and counting, would run continuously, while human viewing could be required occasionally to review potential incidents. To support such applications, a new paradigm for video coding is needed that will facilitate efficient representation and compression of video for both machine and human use in a scalable manner. In this manuscript, we introduce the first end-to-end learnable video codec that supports a machine vision task in its base layer, while its enhancement layer supports input reconstruction for human viewing. The proposed system is constructed based on the concept of conditional coding to achieve better compression gains. Comprehensive experimental evaluations conducted on four standard video datasets demonstrate that our framework outperforms both state-of-the-art learned and conventional video codecs in its base layer, while maintaining comparable performance on the human vision task in its enhancement layer. We will provide the implementation of the proposed system at www.github.com upon completion of the review process.

摘要
<启用简化中文表示法。视频编码传统上是为服务如视频流传输、视频会议、数字电视等服务开发的。主要目的是启用人类观看编码内容。然而，随着深度神经网络（DNNs）的发展，编码的视频现在越来越被用于自动视频分析，这些分析由机器完成。在应用中，自动交通监测等场景中，机器会不断地进行视频分析，而人类可能会periodically审查可能的事件。为支持这些应用，我们需要一个新的视频编码 paradigma，以便高效地表示和压缩视频，以便机器和人类都可以使用。在这篇论文中，我们介绍了首个可学习的视频编码系统，该系统的基层支持机器视觉任务，而其增强层支持人类视觉。我们根据条件编码的概念设计了该系统，以实现更好的压缩减少。我们在四个标准视频集上进行了广泛的实验评估，结果表明，我们的框架在基层上比现有的学习视频编码和传统视频编码更高效，而且在人类视觉任务中保持相对的稳定性。我们将在www.github.com上提供该系统的实现，请等待审核过程结束。

Deep Physics-Guided Unrolling Generalization for Compressed Sensing

paper_url: http://arxiv.org/abs/2307.08950
repo_url: https://github.com/guaishou74851/prl
paper_authors: Bin Chen, Jiechong Song, Jingfen Xie, Jian Zhang
for: 这篇论文旨在提出一种新的深度学习方法，用于解决图像压缩感知问题。
methods: 该方法基于深度学习的核心思想，通过将传统的迭代回归模型扩展到高维特征空间，以提高网络的容量和实时预测速度。
results: 实验表明，提出的PRL网络比其他现有方法表现出更高的性能和效率，并且具有大量的可进一步改进和实际应用于其他反向图像问题或优化模型。

Abstract
By absorbing the merits of both the model- and data-driven methods, deep physics-engaged learning scheme achieves high-accuracy and interpretable image reconstruction. It has attracted growing attention and become the mainstream for inverse imaging tasks. Focusing on the image compressed sensing (CS) problem, we find the intrinsic defect of this emerging paradigm, widely implemented by deep algorithm-unrolled networks, in which more plain iterations involving real physics will bring enormous computation cost and long inference time, hindering their practical application. A novel deep $\textbf{P}$hysics-guided un$\textbf{R}$olled recovery $\textbf{L}$earning ($\textbf{PRL}$) framework is proposed by generalizing the traditional iterative recovery model from image domain (ID) to the high-dimensional feature domain (FD). A compact multiscale unrolling architecture is then developed to enhance the network capacity and keep real-time inference speeds. Taking two different perspectives of optimization and range-nullspace decomposition, instead of building an algorithm-specific unrolled network, we provide two implementations: $\textbf{PRL-PGD}$ and $\textbf{PRL-RND}$. Experiments exhibit the significant performance and efficiency leading of PRL networks over other state-of-the-art methods with a large potential for further improvement and real application to other inverse imaging problems or optimization models.

摘要
By absorbing the advantages of both model-driven and data-driven methods, the deep physics-engaged learning scheme achieves high accuracy and interpretable image reconstruction, and has attracted growing attention and become the mainstream for inverse imaging tasks. However, the image compressed sensing (CS) problem, which is widely implemented by deep algorithm-unrolled networks, has an inherent defect: more plain iterations involving real physics will lead to enormous computation cost and long inference time, hindering their practical application.To address this issue, a novel deep $\textbf{P}$hysics-guided un$\textbf{R}$olled recovery $\textbf{L}$earning ($\textbf{PRL}$) framework is proposed, which generalizes the traditional iterative recovery model from the image domain (ID) to the high-dimensional feature domain (FD). Additionally, a compact multiscale unrolling architecture is developed to enhance the network capacity and maintain real-time inference speeds.Two implementations of the PRL framework are provided: $\textbf{PRL-PGD}$ and $\textbf{PRL-RND}$, which use two different perspectives of optimization and range-nullspace decomposition. Experimental results show that the PRL networks significantly outperform other state-of-the-art methods, with a large potential for further improvement and real application to other inverse imaging problems or optimization models.

Experimental Security Analysis of DNN-based Adaptive Cruise Control under Context-Aware Perception Attacks

paper_url: http://arxiv.org/abs/2307.08939
repo_url: None
paper_authors: Xugui Zhou, Anqi Chen, Maxfield Kouzel, Haotian Ren, Morgan McCarty, Cristina Nita-Rotaru, Homa Alemzadeh
for: 评估深度神经网络（DNN）基于自适应播速控制系统（ACC）的安全性，以防止恶意投入摄像头数据，引起前方碰撞。methods: 提出了一种结合知识驱动和数据驱动的方法，选择最重要的时刻进行攻击，并在运行时使用优化算法生成适应性的图像干扰。results: 实验结果表明，提posed攻击可以在实际驾驶数据集和真实 simulate平台上 достиieves 142.9倍高的成功率，并被89.6%的安全特性减弱，同时具有适应性和robustness。这种攻击可以考虑到Operator的干预和基本安全特性，并提供了防御攻击的策略。

Abstract
Adaptive Cruise Control (ACC) is a widely used driver assistance feature for maintaining desired speed and safe distance to the leading vehicles. This paper evaluates the security of the deep neural network (DNN) based ACC systems under stealthy perception attacks that strategically inject perturbations into camera data to cause forward collisions. We present a combined knowledge-and-data-driven approach to design a context-aware strategy for the selection of the most critical times for triggering the attacks and a novel optimization-based method for the adaptive generation of image perturbations at run-time. We evaluate the effectiveness of the proposed attack using an actual driving dataset and a realistic simulation platform with the control software from a production ACC system and a physical-world driving simulator while considering interventions by the driver and safety features such as Automatic Emergency Braking (AEB) and Forward Collision Warning (FCW). Experimental results show that the proposed attack achieves 142.9x higher success rate in causing accidents than random attacks and is mitigated 89.6% less by the safety features while being stealthy and robust to real-world factors and dynamic changes in the environment. This study provides insights into the role of human operators and basic safety interventions in preventing attacks.

摘要
这篇研究文章评估了基于深度神经网络（DNN）的自适应速度控制（ACC）系统的安全性，在潜在攻击下维持预期的速度和安全距离。我们提出了一种结合知识驱动和数据驱动的方法，以选择最重要的时刻进行攻击，并使用一种基于优化的方法生成runtime的图像扰动。我们使用实际驾驶数据和一个真实的游戏平台，包括生产ACC系统的控制软件和物理世界驾驶 simulator，评估了我们的攻击效果。实验结果显示，我们的攻击可以导致事故的成功率比随机攻击高出142.9倍，并且受到安全功能如自动紧急刹车（AEB）和前方冲撞警示（FCW）的抑制89.6%。这篇研究提供了人类操作员和基本安全功能的抗攻击策略，并探讨了环境的动态变化和实际因素的影响。

CSSL-RHA: Contrastive Self-Supervised Learning for Robust Handwriting Authentication

paper_url: http://arxiv.org/abs/2307.11100
repo_url: None
paper_authors: Jingyao Wang, Luntian Mou, Changwen Zheng, Wen Gao
for: 防止诈骗和文化遗产保护等领域中的手写认证任务
methods: 提出了一种基于自我超vised学习的Contrastive Self-Supervised Learning框架（CSSL-RHA），可以学习复杂但重要的特征，并准确地预测作者标识
results: 对五个基准数据集和自行标注的EN-HA数据集进行了广泛的实验，证明了我们的CSSL-RHA在基础线上性能较高，并且在异常情况下（如数据forge和损坏）仍然能够有效地进行认证

Abstract
Handwriting authentication is a valuable tool used in various fields, such as fraud prevention and cultural heritage protection. However, it remains a challenging task due to the complex features, severe damage, and lack of supervision. In this paper, we propose a novel Contrastive Self-Supervised Learning framework for Robust Handwriting Authentication (CSSL-RHA) to address these issues. It can dynamically learn complex yet important features and accurately predict writer identities. Specifically, to remove the negative effects of imperfections and redundancy, we design an information-theoretic filter for pre-processing and propose a novel adaptive matching scheme to represent images as patches of local regions dominated by more important features. Through online optimization at inference time, the most informative patch embeddings are identified as the "most important" elements. Furthermore, we employ contrastive self-supervised training with a momentum-based paradigm to learn more general statistical structures of handwritten data without supervision. We conduct extensive experiments on five benchmark datasets and our manually annotated dataset EN-HA, which demonstrate the superiority of our CSSL-RHA compared to baselines. Additionally, we show that our proposed model can still effectively achieve authentication even under abnormal circumstances, such as data falsification and corruption.

摘要
《手写文本认证：一种值得信赖的工具》手写文本认证是多个领域中的一种重要工具，用于防止 fraud 和保护文化遗产。然而，由于手写文本的复杂特征、严重损害以及缺乏监督，这是一项挑战性的任务。在这篇论文中，我们提出了一种新的 Contrastive Self-Supervised Learning 框架，用于 Robust Handwriting Authentication (CSSL-RHA)。该框架可以动态学习复杂的重要特征，并准确地预测作者标识。为了解决负面影响和重复性的问题，我们设计了一种信息论Filter 来预处理图像，并提出了一种新的自适应匹配方案，用于将图像转换为具有更重要特征的 patches 的地方区域。通过在推理时进行在线优化，我们可以快速地identify 最有用的 patch embeddings 作为 "最重要" 的元素。此外，我们采用了一种带有摘要的自我超vision 训练方法，以学习不监督的手写数据的更加通用的统计结构。我们在五个 benchmark 数据集和我们手动标注的 EN-HA 数据集上进行了广泛的实验，并证明了我们的 CSSL-RHA 与基线相比更高效。此外，我们还证明了我们的提案的模型可以在不正常的情况下，如数据伪造和损害，仍然有效地进行身份验证。

Learning to Sample Tasks for Meta Learning

paper_url: http://arxiv.org/abs/2307.08924
repo_url: https://github.com/ZJLAB-AMMI/HS-OMRL
paper_authors: Jingyao Wang, Zeen Song, Xingzhe Su, Lingyu Si, Hongwei Dong, Wenwen Qiang, Changwen Zheng
for: 通过对各种元学习方法、任务采样器和几个少shot学习任务进行实验，这篇论文得出了三个结论。
methods: 第一，没有一个通用的任务采样策略可以保证元学习模型的表现。第二，任务多样性会导致模型在训练过程中 Either underfit 或 overfit。第三，模型的总结表现受到任务分化、任务熵和任务难度的影响。
results: 作为回应，我们提出了一种新的任务采样器called Adaptive Sampler (ASr)。ASr是一个可插入的任务采样器，它根据任务分化、任务熵和任务难度来采样任务。为了优化ASr，我们提出了一种简单的通用元学习算法。最后，论文通过大量实验证明了提议的ASr的有效性。

Abstract
Through experiments on various meta-learning methods, task samplers, and few-shot learning tasks, this paper arrives at three conclusions. Firstly, there are no universal task sampling strategies to guarantee the performance of meta-learning models. Secondly, task diversity can cause the models to either underfit or overfit during training. Lastly, the generalization performance of the models are influenced by task divergence, task entropy, and task difficulty. In response to these findings, we propose a novel task sampler called Adaptive Sampler (ASr). ASr is a plug-and-play task sampler that takes task divergence, task entropy, and task difficulty to sample tasks. To optimize ASr, we rethink and propose a simple and general meta-learning algorithm. Finally, a large number of empirical experiments demonstrate the effectiveness of the proposed ASr.

摘要

There is no universal task sampling strategy that can guarantee the performance of meta-learning models.2. Task diversity can cause the models to either underfit or overfit during training.3. The generalization performance of the models is influenced by task divergence, task entropy, and task difficulty.In response to these findings, we propose a novel task sampler called Adaptive Sampler (ASr). ASr is a plug-and-play task sampler that takes task divergence, task entropy, and task difficulty into consideration when sampling tasks. To optimize ASr, we propose a simple and general meta-learning algorithm.Empirical experiments on a large number of tasks demonstrate the effectiveness of the proposed ASr.

Accuracy versus time frontiers of semi-supervised and self-supervised learning on medical images

paper_url: http://arxiv.org/abs/2307.08919
repo_url: https://github.com/tufts-ml/ssl-vs-ssl-benchmark
paper_authors: Zhe Huang, Ruijie Jiang, Shuchin Aeron, Michael C. Hughes
for: 本研究的目的是为了提供一个可靠的benchmark，帮助实践者在有限的标签数据和训练时间的情况下，尽可能提高分类器的性能。
methods: 本研究使用了两种主要的研究方向：自监学习和semi-supervised learning。自监学习先在无标签数据上培养有用的表示，然后使用标签数据进行细化；semi-supervised learning直接使用标签和无标签数据同时训练分类器。
results: 本研究的结果表明，使用additional的无标签数据可以提高分类器的性能，并且使用MixMatch、SimCLR和BYOL等方法可以获得最佳性能。在3个医学图像集合上，6种semi-supervised方法和5种自监学习方法与强的标签只方法进行比较，并提供了在新的医学任务上提高性能的设置。

Abstract
For many applications of classifiers to medical images, a trustworthy label for each image can be difficult or expensive to obtain. In contrast, images without labels are more readily available. Two major research directions both promise that additional unlabeled data can improve classifier performance: self-supervised learning pretrains useful representations on unlabeled data only, then fine-tunes a classifier on these representations via the labeled set; semi-supervised learning directly trains a classifier on labeled and unlabeled data simultaneously. Recent methods from both directions have claimed significant gains on non-medical tasks, but do not systematically assess medical images and mostly compare only to methods in the same direction. This study contributes a carefully-designed benchmark to help answer a practitioner's key question: given a small labeled dataset and a limited budget of hours to spend on training, what gains from additional unlabeled images are possible and which methods best achieve them? Unlike previous benchmarks, ours uses realistic-sized validation sets to select hyperparameters, assesses runtime-performance tradeoffs, and bridges two research fields. By comparing 6 semi-supervised methods and 5 self-supervised methods to strong labeled-only baselines on 3 medical datasets with 30-1000 labels per class, we offer insights to resource-constrained, results-focused practitioners: MixMatch, SimCLR, and BYOL represent strong choices that were not surpassed by more recent methods. After much effort selecting hyperparameters on one dataset, we publish settings that enable strong methods to perform well on new medical tasks within a few hours, with further search over dozens of hours delivering modest additional gains.

摘要
For many medical image classification applications, obtaining trustworthy labels for each image can be difficult or expensive. In contrast, images without labels are more readily available. Two major research directions promise that additional unlabeled data can improve classifier performance: self-supervised learning pretrains useful representations on unlabeled data only, then fine-tunes a classifier on these representations via the labeled set; semi-supervised learning directly trains a classifier on labeled and unlabeled data simultaneously. Recent methods from both directions have claimed significant gains on non-medical tasks, but do not systematically assess medical images and mostly compare only to methods in the same direction. This study contributes a carefully-designed benchmark to help answer a practitioner's key question: given a small labeled dataset and a limited budget of hours to spend on training, what gains from additional unlabeled images are possible and which methods best achieve them? Unlike previous benchmarks, ours uses realistic-sized validation sets to select hyperparameters, assesses runtime-performance tradeoffs, and bridges two research fields. By comparing 6 semi-supervised methods and 5 self-supervised methods to strong labeled-only baselines on 3 medical datasets with 30-1000 labels per class, we offer insights to resource-constrained, results-focused practitioners: MixMatch, SimCLR, and BYOL represent strong choices that were not surpassed by more recent methods. After much effort selecting hyperparameters on one dataset, we publish settings that enable strong methods to perform well on new medical tasks within a few hours, with further search over dozens of hours delivering modest additional gains.

Towards the Sparseness of Projection Head in Self-Supervised Learning

paper_url: http://arxiv.org/abs/2307.08913
repo_url: None
paper_authors: Zeen Song, Xingzhe Su, Jingyao Wang, Wenwen Qiang, Changwen Zheng, Fuchun Sun
for: 本研究旨在探讨自动学习（Self-Supervised Learning，SSL）中的一种成功方法——对比学习，以及对其中的参数化投影头的内部机制和维度归一化现象的研究。
methods: 本研究通过实验分析和理论调查，探讨对比学习中的投影头对 representation 质量的影响，并提出了假设只需要一 subset of features 来最小化对比损失。
results: 实验结果表明，束之 sparse projection head 可以增强对比学习的性能，并且可以轻松地与现有的 SSL 方法结合使用。

Abstract
In recent years, self-supervised learning (SSL) has emerged as a promising approach for extracting valuable representations from unlabeled data. One successful SSL method is contrastive learning, which aims to bring positive examples closer while pushing negative examples apart. Many current contrastive learning approaches utilize a parameterized projection head. Through a combination of empirical analysis and theoretical investigation, we provide insights into the internal mechanisms of the projection head and its relationship with the phenomenon of dimensional collapse. Our findings demonstrate that the projection head enhances the quality of representations by performing contrastive loss in a projected subspace. Therefore, we propose an assumption that only a subset of features is necessary when minimizing the contrastive loss of a mini-batch of data. Theoretical analysis further suggests that a sparse projection head can enhance generalization, leading us to introduce SparseHead - a regularization term that effectively constrains the sparsity of the projection head, and can be seamlessly integrated with any self-supervised learning (SSL) approaches. Our experimental results validate the effectiveness of SparseHead, demonstrating its ability to improve the performance of existing contrastive methods.

摘要

What Can Simple Arithmetic Operations Do for Temporal Modeling?

paper_url: http://arxiv.org/abs/2307.08908
repo_url: https://github.com/whwu95/ATM
paper_authors: Wenhao Wu, Yuxin Song, Zhun Sun, Jingdong Wang, Chang Xu, Wanli Ouyang
for: 本研究旨在探讨视频内容中的时间模型化问题，采用简单的四则数学操作来建立时间关系。
methods: 我们首先从视频帧特征中提取auxiliary时间cue，并使用加减乘除四则数学操作来提取相关特征。然后，我们将这些特征与原始的时间不关注域进行对比，以便进一步提高视频识别性能。我们称之为Arithmetic Temporal Module（ATM），它可以与CNNs和ViTs两种不同的架构结合使用。
results: 我们在Something-Something V1、V2和Kinetics-400等视频识别Benchmark上进行了广泛的ablation研究，并证明ATM模块可以在低计算成本下提供强大的时间模型化能力。此外，ATM模块可以与不同的架构结合使用，并在 Something-Something V1、V2和Kinetics-400等Benchmark上达到了65.6%、74.6%和89.4%的top-1准确率。代码可以在https://github.com/whwu95/ATM中下载。

Abstract
Temporal modeling plays a crucial role in understanding video content. To tackle this problem, previous studies built complicated temporal relations through time sequence thanks to the development of computationally powerful devices. In this work, we explore the potential of four simple arithmetic operations for temporal modeling. Specifically, we first capture auxiliary temporal cues by computing addition, subtraction, multiplication, and division between pairs of extracted frame features. Then, we extract corresponding features from these cues to benefit the original temporal-irrespective domain. We term such a simple pipeline as an Arithmetic Temporal Module (ATM), which operates on the stem of a visual backbone with a plug-and-play style. We conduct comprehensive ablation studies on the instantiation of ATMs and demonstrate that this module provides powerful temporal modeling capability at a low computational cost. Moreover, the ATM is compatible with both CNNs- and ViTs-based architectures. Our results show that ATM achieves superior performance over several popular video benchmarks. Specifically, on Something-Something V1, V2 and Kinetics-400, we reach top-1 accuracy of 65.6%, 74.6%, and 89.4% respectively. The code is available at https://github.com/whwu95/ATM.

摘要
temporald modeling plays a crucial role in understanding video content. To tackle this problem, previous studies built complicated temporal relations through time sequence thanks to the development of computationally powerful devices. In this work, we explore the potential of four simple arithmetic operations for temporal modeling. Specifically, we first capture auxiliary temporal cues by computing addition, subtraction, multiplication, and division between pairs of extracted frame features. Then, we extract corresponding features from these cues to benefit the original temporal-irrelevant domain. We term such a simple pipeline as an Arithmetic Temporal Module (ATM), which operates on the stem of a visual backbone with a plug-and-play style. We conduct comprehensive ablation studies on the instantiation of ATMs and demonstrate that this module provides powerful temporal modeling capability at a low computational cost. Moreover, the ATM is compatible with both CNNs- and ViTs-based architectures. Our results show that ATM achieves superior performance over several popular video benchmarks. Specifically, on Something-Something V1, V2 and Kinetics-400, we reach top-1 accuracy of 65.6%, 74.6%, and 89.4% respectively. The code is available at https://github.com/whwu95/ATM.Here's the translation in Traditional Chinese:时间模型化在视频内容理解中扮演重要角色。以往的研究通过时间序列建立复杂的时间关系，感谢computationally powerful devices的发展。在这个工作中，我们探索四个简单的算术操作的潜力 для时间模型化。具体来说，我们首先 capture auxiliary temporal cues by computing addition, subtraction, multiplication, and division between pairs of extracted frame features。然后，我们从这些cue中提取相应的特征，以帮助原始时间不适用的领域。我们给这个简单管道命名为Arithmetic Temporal Module (ATM)，它在视觉背bone上运作，并且具有plug-and-play的风格。我们进行了广泛的ablation study，证明这个模组提供了强大的时间模型化能力，同时computational cost较低。此外，ATM适用于CNNs-和ViTs-based architecture。我们的结果显示，ATM在Something-Something V1, V2和Kinetics-400上达到了top-1准确率的65.6%, 74.6%, 和89.4%。代码可以在https://github.com/whwu95/ATM中找到。

Modular Neural Network Approaches for Surgical Image Recognition

paper_url: http://arxiv.org/abs/2307.08880
repo_url: None
paper_authors: Nosseiba Ben Salem, Younes Bennani, Joseph Karkazan, Abir Barbara, Charles Dacheux, Thomas Gregory
for: 这个论文的目的是提出一种基于深度学习的模块学习方法，用于解决DCSS不稳定性分类问题。
methods: 这个论文使用了自我训练和模块学习方法来解决DCSS不稳定性分类问题。
results: 实验结果显示，模块学习方法可以提高分类性能，而Weighted Modular方法达到了几乎完美的分类效果。

Abstract
Deep learning-based applications have seen a lot of success in recent years. Text, audio, image, and video have all been explored with great success using deep learning approaches. The use of convolutional neural networks (CNN) in computer vision, in particular, has yielded reliable results. In order to achieve these results, a large amount of data is required. However, the dataset cannot always be accessible. Moreover, annotating data can be difficult and time-consuming. Self-training is a semi-supervised approach that managed to alleviate this problem and achieve state-of-the-art performances. Theoretical analysis even proved that it may result in a better generalization than a normal classifier. Another problem neural networks can face is the increasing complexity of modern problems, requiring a high computational and storage cost. One way to mitigate this issue, a strategy that has been inspired by human cognition known as modular learning, can be employed. The principle of the approach is to decompose a complex problem into simpler sub-tasks. This approach has several advantages, including faster learning, better generalization, and enables interpretability. In the first part of this paper, we introduce and evaluate different architectures of modular learning for Dorsal Capsulo-Scapholunate Septum (DCSS) instability classification. Our experiments have shown that modular learning improves performances compared to non-modular systems. Moreover, we found that weighted modular, that is to weight the output using the probabilities from the gating module, achieved an almost perfect classification. In the second part, we present our approach for data labeling and segmentation with self-training applied on shoulder arthroscopy images.

摘要
深度学习基本应用在最近几年内得到了很大成功。文本、音频、图像和视频都被使用深度学习方法进行了成功的探索。特别是计算机视觉领域中使用卷积神经网络（CNN）的应用，已经取得了可靠的结果。但是，获取数据往往是一个大问题，因为数据集经常不可 accessible。此外，对数据进行标注也可能是一项困难和时间consuming的任务。自我帮助是一种半监督的方法，可以解决这个问题，并达到状态 искусственный智能的性能。然而，神经网络还面临着现代问题的增长复杂度，需要大量的计算和存储资源。为了解决这个问题，我们可以采用一种人类认知的灵感---模块学习的方法。这种方法的原则是将复杂问题分解成更简单的子任务。这种方法有很多优点，包括更快的学习速度、更好的泛化性和可读性。在本文中，我们首先介绍和评估不同的模块学习架构在DCSS不稳定分类问题上。我们的实验表明，模块学习可以提高性能，并且weighted模块可以达到几乎完美的分类结果。在第二部分，我们介绍了我们的自动标注和分割方法，使用自我帮助在Shoulder镜像上进行应用。

LiDAR-BEVMTN: Real-Time LiDAR Bird’s-Eye View Multi-Task Perception Network for Autonomous Driving

paper_url: http://arxiv.org/abs/2307.08850
repo_url: None
paper_authors: Sambit Mohapatra, Senthil Yogamani, Varun Ravi Kumar, Stefan Milz, Heinrich Gotzig, Patrick Mäder
for: 这篇论文旨在提出一种实时多任务深度学习网络，用于自动驾驶中的3D场景识别。
methods: 该方法使用了一种共享encoder和任务特定decoder的架构，实现了 joint representation learning。还提出了一种新的Semantic Weighting and Guidance（SWAG）模块，以提高对象检测的准确性。
results: 该方法在NVIDIA Xavier平台上实现了3ms的延迟时间，并在两个任务中达到了状态的较好的表现（semantic segmentation和动作分割），并且在3D对象检测任务中达到了状态的较好的表现（仅次于状态之最佳表现）。

Abstract
LiDAR is crucial for robust 3D scene perception in autonomous driving. LiDAR perception has the largest body of literature after camera perception. However, multi-task learning across tasks like detection, segmentation, and motion estimation using LiDAR remains relatively unexplored, especially on automotive-grade embedded platforms. We present a real-time multi-task convolutional neural network for LiDAR-based object detection, semantics, and motion segmentation. The unified architecture comprises a shared encoder and task-specific decoders, enabling joint representation learning. We propose a novel Semantic Weighting and Guidance (SWAG) module to transfer semantic features for improved object detection selectively. Our heterogeneous training scheme combines diverse datasets and exploits complementary cues between tasks. The work provides the first embedded implementation unifying these key perception tasks from LiDAR point clouds achieving 3ms latency on the embedded NVIDIA Xavier platform. We achieve state-of-the-art results for two tasks, semantic and motion segmentation, and close to state-of-the-art performance for 3D object detection. By maximizing hardware efficiency and leveraging multi-task synergies, our method delivers an accurate and efficient solution tailored for real-world automated driving deployment. Qualitative results can be seen at https://youtu.be/H-hWRzv2lIY.

摘要
“LiDAR 是自动驾驶中Robust 3D 场景识别的关键技术。LiDAR 识别有大量文献，仅次于摄像头识别。然而，使用 LiDAR 进行多任务学习，特别是在汽车级别的嵌入式平台上，尚未得到充分的研究。我们提出了一种实时多任务卷积神经网络，用于基于 LiDAR 的 объек特殊、类别和运动分割。该架构包括共享Encoder和任务特定的Decoder，允许 JOINT 表征学习。我们提出了一种新的Semantic Weighting and Guidance（SWAG）模块，用于 selectively 提高对象检测的准确性。我们的不同数据集合并利用了各种任务之间的补做作用。我们的方法在 NVIDIA Xavier 平台上实现了3ms 延迟，并实现了对 semantic 和运动分割两个任务的state-of-the-art 性能，以及对 3D 对象检测的近似性性能。通过最大化硬件效率和多任务 synergies，我们的方法提供了一个准确和高效的解决方案，适用于实际自动驾驶部署。详细结果可以参考。”

DARTS: Double Attention Reference-based Transformer for Super-resolution

paper_url: http://arxiv.org/abs/2307.08837
repo_url: https://github.com/bia006/darts
paper_authors: Masoomeh Aslahishahri, Jordan Ubbens, Ian Stavness
for: 提高低分辨率图像的内容质量
methods: 使用转换器模型，学习对两个图像分布进行 JOINT 表示，通过匹配对应关系学习高分辨率图像来提高低分辨率图像的内容质量
results: 在 SUN80 数据集上达到了状态机器人模型的水平，PSNR/SSIM 分别为 29.83 / 0.809，表明单独使用注意力机制可以实现参照基于图像超分辨率任务，不需要多种特殊设计Sub网络、知识储存或多Stage训练。

Abstract
We present DARTS, a transformer model for reference-based image super-resolution. DARTS learns joint representations of two image distributions to enhance the content of low-resolution input images through matching correspondences learned from high-resolution reference images. Current state-of-the-art techniques in reference-based image super-resolution are based on a multi-network, multi-stage architecture. In this work, we adapt the double attention block from the GAN literature, processing the two visual streams separately and combining self-attention and cross-attention blocks through a gating attention strategy. Our work demonstrates how the attention mechanism can be adapted for the particular requirements of reference-based image super-resolution, significantly simplifying the architecture and training pipeline. We show that our transformer-based model performs competitively with state-of-the-art models, while maintaining a simpler overall architecture and training process. In particular, we obtain state-of-the-art on the SUN80 dataset, with a PSNR/SSIM of 29.83 / .809. These results show that attention alone is sufficient for the RSR task, without multiple purpose-built subnetworks, knowledge distillation, or multi-stage training.

摘要
我们介绍了DARTS，一种基于 transformer 模型的参照型图像超分辨模型。DARTS 学习了两个图像分布的共同表示，以增强输入图像的内容。当前领导技术在参照型图像超分辨中使用多个网络、多个阶段架构。在这种工作中，我们从 GAN 文献中采用了双注意块，处理两个视觉流 separately，并通过阻塞注意力策略将自注意块和交叉注意块组合在一起。我们的工作表明了注意力机制在参照型图像超分辨中可以进行适应，大大简化架构和训练流程。我们显示了我们的 transformer 基于模型与状态革新模型相当，而且具有更简单的总体架构和训练过程。特别是，我们在 SUN80 数据集上获得了状态革新的 PSNR/SSIM 值为 29.83 / .809。这些结果表明，注意力机制alone 是RSR 任务中的 suficient，不需要多个专门设计的子网络、知识继承或多个阶段训练。

Harnessing the Power of AI based Image Generation Model DALLE 2 in Agricultural Settings

paper_url: http://arxiv.org/abs/2307.08789
repo_url: None
paper_authors: Ranjan Sapkota
for: 这项研究旨在探讨人工智能（AI）在农业领域视觉进程的提升方面，使用开源AI图像生成器DALLE 2。
methods: 该研究使用了chatGPT的自然语言处理能力和DALLE 2模型，实现了将文本描述器转换为真实的视觉内容的创新方法。
results: 研究发现，使用DALLE 2模型可以提高农业视觉进程的质量和准确性，帮助农业决策更加 Informed，并改善资源分配。结果表明，AI将在精度农业领域产生快速发展。I hope that helps! Let me know if you have any other questions.

Abstract
This study investigates the potential impact of artificial intelligence (AI) on the enhancement of visualization processes in the agricultural sector, using the advanced AI image generator, DALLE 2, developed by OpenAI. By synergistically utilizing the natural language processing proficiency of chatGPT and the generative prowess of the DALLE 2 model, which employs a Generative Adversarial Networks (GANs) framework, our research offers an innovative method to transform textual descriptors into realistic visual content. Our rigorously assembled datasets include a broad spectrum of agricultural elements such as fruits, plants, and scenarios differentiating crops from weeds, maintained for AI-generated versus original images. The quality and accuracy of the AI-generated images were evaluated via established metrics including mean squared error (MSE), peak signal-to-noise ratio (PSNR), and feature similarity index (FSIM). The results underline the significant role of the DALLE 2 model in enhancing visualization processes in agriculture, aiding in more informed decision-making, and improving resource distribution. The outcomes of this research highlight the imminent rise of an AI-led transformation in the realm of precision agriculture.

摘要

The FathomNet2023 Competition Dataset

paper_url: http://arxiv.org/abs/2307.08781
repo_url: https://github.com/fathomnet/fgvc-comp-2023
paper_authors: Eric Orenstein, Kevin Barnard, Lonny Lundsten, Geneviève Patterson, Benjamin Woodward, Kakani Katija
for: study marine organisms and environmental monitoring
methods: automatic processing of visual data
results: recognition of new organisms and assessment of out-of-sample data

Abstract
Ocean scientists have been collecting visual data to study marine organisms for decades. These images and videos are extremely valuable both for basic science and environmental monitoring tasks. There are tools for automatically processing these data, but none that are capable of handling the extreme variability in sample populations, image quality, and habitat characteristics that are common in visual sampling of the ocean. Such distribution shifts can occur over very short physical distances and in narrow time windows. Creating models that are able to recognize when an image or video sequence contains a new organism, an unusual collection of animals, or is otherwise out-of-sample is critical to fully leverage visual data in the ocean. The FathomNet2023 competition dataset presents a realistic scenario where the set of animals in the target data differs from the training data. The challenge is both to identify the organisms in a target image and assess whether it is out-of-sample.

摘要
Translated into Simplified Chinese:海洋科学家已经在数十年内收集视频数据来研究海洋生物。这些图像和视频非常有价值，不仅为基础科学研究，还为环境监测任务。然而，存在一些工具可以自动处理这些数据，但是无法处理海洋视频样本中的极大变化，包括样本人口、图像质量和生物群体特征等。这些变化可能在非常短的物理距离和时间窗口内发生。创建能够识别目标图像中的新生物、不寻常的动物群体或者是否外样的模型是海洋视频数据的核心。FathomNet2023比赛数据集提供了一个真实的enario，其中目标数据中的生物集合与训练数据不同。挑战是both识别目标图像中的生物和判断图像是否外样。

Similarity Min-Max: Zero-Shot Day-Night Domain Adaptation

paper_url: http://arxiv.org/abs/2307.08779
repo_url: https://github.com/Red-Fairy/ZeroShotDayNightDA
paper_authors: Rundong Luo, Wenjing Wang, Wenhan Yang, Jiaying Liu
for: 这篇论文旨在解决黑暗环境下影像识别和分类等夜间视觉任务中的模型性能降低问题。
methods: 本论文提出了一个统一的架构，协助实现零数据黑暗领域适应。它首先使用黑暗图像来减少特征相似性，然后将模型适应到黑暗图像和正常照明图像之间的特征相似性。
results: 实验结果显示，本方法可以优化模型的通用化能力，并在不同的夜间视觉任务中实现显著的改善。包括分类、 semantic segmentation、visual place recognition和video action recognition等多种夜间视觉任务都能够得到良好的表现。

Abstract
Low-light conditions not only hamper human visual experience but also degrade the model's performance on downstream vision tasks. While existing works make remarkable progress on day-night domain adaptation, they rely heavily on domain knowledge derived from the task-specific nighttime dataset. This paper challenges a more complicated scenario with border applicability, i.e., zero-shot day-night domain adaptation, which eliminates reliance on any nighttime data. Unlike prior zero-shot adaptation approaches emphasizing either image-level translation or model-level adaptation, we propose a similarity min-max paradigm that considers them under a unified framework. On the image level, we darken images towards minimum feature similarity to enlarge the domain gap. Then on the model level, we maximize the feature similarity between the darkened images and their normal-light counterparts for better model adaptation. To the best of our knowledge, this work represents the pioneering effort in jointly optimizing both aspects, resulting in a significant improvement of model generalizability. Extensive experiments demonstrate our method's effectiveness and broad applicability on various nighttime vision tasks, including classification, semantic segmentation, visual place recognition, and video action recognition. Code and pre-trained models are available at https://red-fairy.github.io/ZeroShotDayNightDA-Webpage/.

摘要
低光照条件不仅影响人类视觉经验，还会下降模型在下游视觉任务中的性能。现有的工作做出了显著的进步在日夜域适应中，但是它们依赖于任务特定的夜间数据集的知识。本文挑战了更复杂的场景，即零shot日夜域适应，即无需夜间数据集来适应。与先前的零shot适应方法不同，我们提出了一种相似度最大化思想，它考虑了图像级和模型级的适应。在图像级别，我们使用最小特征相似性来抑制图像，以扩大域之间的差距。然后，在模型级别，我们使用最大化特征相似性来进行模型适应。根据我们所知，这种方法是首次同时优化图像和模型级别的适应，从而提高模型的通用性。我们的方法在不同的夜视任务中，包括分类、 semantic segmentation、视觉地标识和视频动作识别等，都有广泛的应用和实验证明了其效果。代码和预训练模型可以在中下载。

UPSCALE: Unconstrained Channel Pruning

paper_url: http://arxiv.org/abs/2307.08771
repo_url: https://github.com/apple/ml-upscale
paper_authors: Alvin Wan, Hanxiang Hao, Kaushik Patnaik, Yueyang Xu, Omer Hadad, David Güera, Zhile Ren, Qi Shan
for: 降低卷积神经网络的执行速度，通过频道剪裁来提高模型的压缩率。
methods: 使用频道剪裁技术，但是对多个分支段的模型来说，频道剪裁可能会导致执行时间复制。为了解决这个问题，通常是将某些频道约束在一起，以完全消除执行时间复制，但是这会导致减少模型的准确性。
results: 根据我们的发现，可以在出口时重新排序频道，以降低执行时间复制并提高模型的准确性。我们提出的普适算法 UPSCALE 可以适应任何剪裁模式，并在 ImageNet 上提高后训练剪裁模型的平均准确性 by 2.1 点。此外，UPSCALE 还可以提高执行速度，相比基eline export 可以达到两倍的提升。

Abstract
As neural networks grow in size and complexity, inference speeds decline. To combat this, one of the most effective compression techniques -- channel pruning -- removes channels from weights. However, for multi-branch segments of a model, channel removal can introduce inference-time memory copies. In turn, these copies increase inference latency -- so much so that the pruned model can be slower than the unpruned model. As a workaround, pruners conventionally constrain certain channels to be pruned together. This fully eliminates memory copies but, as we show, significantly impairs accuracy. We now have a dilemma: Remove constraints but increase latency, or add constraints and impair accuracy. In response, our insight is to reorder channels at export time, (1) reducing latency by reducing memory copies and (2) improving accuracy by removing constraints. Using this insight, we design a generic algorithm UPSCALE to prune models with any pruning pattern. By removing constraints from existing pruners, we improve ImageNet accuracy for post-training pruned models by 2.1 points on average -- benefiting DenseNet (+16.9), EfficientNetV2 (+7.9), and ResNet (+6.2). Furthermore, by reordering channels, UPSCALE improves inference speeds by up to 2x over a baseline export.

摘要
Our insight is to reorder channels at export time to address this issue. By reducing latency by reducing memory copies and improving accuracy by removing constraints, we can achieve better performance. We have designed a generic algorithm called UPSCALE to prune models with any pruning pattern. By removing constraints from existing pruners, we have improved ImageNet accuracy for post-training pruned models by an average of 2.1 points - benefiting DenseNet, EfficientNetV2, and ResNet. Furthermore, by reordering channels, UPSCALE improves inference speeds by up to 2x over a baseline export.

Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

paper_url: http://arxiv.org/abs/2307.08763
repo_url: None
paper_authors: Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Triantafyllos Afouras, Kristen Grauman
for: 这篇论文旨在提高人工智能对教程视频中人体动作的理解，以便更好地执行 DIY 修理任务和食谱等。
methods: 论文提出自动从教程视频中挖掘任务图грам，并使用这个图грам来规范化键步认识。
results: 在多个实际教程视频 datasets 上，论文显示了更加可靠的零基础键步定位和改进的视频表示学习，超过了现状势。

Abstract
Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state -- such as the steps of a recipe or a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a predefined sequential script. We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps, and then leverage this graph to regularize keystep recognition in novel videos. On multiple datasets of real-world instructional videos, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.

摘要
执行活动理解需要感知人类行为，视为一个更广泛的任务，涉及多个键步骤在视频中顺序执行，以达到最终目标状态，例如预约的食谱或 DIY 修理任务。现有的工作大多数都是隔离这个更广泛的结构，或者固定地将键步骤与预先定义的顺序脚本相对应。我们提议自动从教程视频中发现任务图，表示人们执行键步骤的概率方式，然后利用这个图来规范新视频中的键步骤识别。在多个实际教程视频数据集上，我们表明了影响：更可靠的零基础键步骤定位和改进的视频表示学习，超越了现状的最佳。

Diffusion Models Beat GANs on Image Classification

paper_url: http://arxiv.org/abs/2307.08702
repo_url: None
paper_authors: Soumik Mukhopadhyay, Matthew Gwilliam, Vatsal Agarwal, Namitha Padmanabhan, Archana Swaminathan, Srinidhi Hegde, Tianyi Zhou, Abhinav Shrivastava
For: The paper explores the possibility of a unified representation learner that can address both generative and discriminative tasks simultaneously, using diffusion models as a prime candidate.* Methods: The paper uses a U-Net architecture to train a diffusion model for image generation, denoising, inpainting, super-resolution, and manipulation tasks, and demonstrates that the resulting model can generate high-fidelity, diverse, and novel images. The paper also explores optimal methods for extracting and using the embeddings generated by the model for classification tasks.* Results: The paper shows that the diffusion model outperforms comparable generative-discriminative methods such as BigBiGAN for classification tasks, and that with careful feature selection and pooling, the model achieves promising results on several fine-grained visual classification datasets.

Abstract
While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which uses a single pre-training stage to address both families of tasks simultaneously. We identify diffusion models as a prime candidate. Diffusion models have risen to prominence as a state-of-the-art method for image generation, denoising, inpainting, super-resolution, manipulation, etc. Such models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high fidelity, diverse, novel images. The U-Net architecture, as a convolution-based architecture, generates a diverse set of feature representations in the form of intermediate feature maps. We present our findings that these embeddings are useful beyond the noise prediction task, as they contain discriminative information and can also be leveraged for classification. We explore optimal methods for extracting and using these embeddings for classification tasks, demonstrating promising results on the ImageNet classification task. We find that with careful feature selection and pooling, diffusion models outperform comparable generative-discriminative methods such as BigBiGAN for classification tasks. We investigate diffusion models in the transfer learning regime, examining their performance on several fine-grained visual classification datasets. We compare these embeddings to those generated by competing architectures and pre-trainings for classification tasks.

摘要
多数不监督学习模型都专注于一种家族的任务，可是我们探索一个综合表示学习者：一种使用单一预训练阶段来同时解决两种家族任务的模型。我们认为扩散模型是最佳候选人选。扩散模型已经在图像生成、降噪、填充、超解像、修改等任务中脱颖而出，这些模型通过训练 U-Net 来逐步预测和除掉噪音，并且生成出高准确率、多样化、新颖的图像。U-Net 架构是一种基于 convolution 的架构，生成了多样化的特征表示，我们发现这些嵌入有用于分类任务，它们包含描述性信息，可以用于分类。我们探索如何提取和使用这些嵌入来进行分类任务，并实现了在 ImageNet 分类任务上的成功。我们发现，通过精心选择和 pooling，扩散模型在分类任务上超过相对的生成-分类方法，如 BigBiGAN。我们 investigate 扩散模型在转移学习 режи度下的性能，对多个细腻视觉分类任务进行比较。我们比较这些嵌入与其他架构和预训练生成的嵌入，并发现扩散模型在分类任务上表现出优异的result。

Flow Matching in Latent Space

paper_url: http://arxiv.org/abs/2307.08698
repo_url: https://github.com/vinairesearch/lfm
paper_authors: Quan Dao, Hao Phung, Binh Nguyen, Anh Tran
for: 这 paper 的目的是提出一种基于流匹配的生成模型，用于高分辨率图像生成。这种方法可以在受限的计算资源下进行训练，并且可以在不同的 conditional generation 任务中实现高质量的图像生成。
methods: 这 paper 使用了流匹配方法，并在预训练 autoencoder 的 latent space 中进行训练。这种方法可以更好地利用计算资源，并且可以在高分辨率图像生成 tasks 中实现更好的效果。
results: 这 paper 的实验结果表明，流匹配方法可以在不同的 conditional generation 任务中实现高质量的图像生成。具体来说，这 paper 在 CelebA-HQ、FFHQ、LSUN Church & Bedroom 和 ImageNet 等数据集上实现了优秀的quantitative 和 qualitative 结果。此外，这 paper 还提供了一种 theoretically 控制的 Wasserstein-2 距离，用于证明流匹配目标下的 latent flow distribution 和 true data distribution 之间的关系。

Abstract
Flow matching is a recent framework to train generative models that exhibits impressive empirical performance while being relatively easier to train compared with diffusion-based models. Despite its advantageous properties, prior methods still face the challenges of expensive computing and a large number of function evaluations of off-the-shelf solvers in the pixel space. Furthermore, although latent-based generative methods have shown great success in recent years, this particular model type remains underexplored in this area. In this work, we propose to apply flow matching in the latent spaces of pretrained autoencoders, which offers improved computational efficiency and scalability for high-resolution image synthesis. This enables flow-matching training on constrained computational resources while maintaining their quality and flexibility. Additionally, our work stands as a pioneering contribution in the integration of various conditions into flow matching for conditional generation tasks, including label-conditioned image generation, image inpainting, and semantic-to-image generation. Through extensive experiments, our approach demonstrates its effectiveness in both quantitative and qualitative results on various datasets, such as CelebA-HQ, FFHQ, LSUN Church & Bedroom, and ImageNet. We also provide a theoretical control of the Wasserstein-2 distance between the reconstructed latent flow distribution and true data distribution, showing it is upper-bounded by the latent flow matching objective. Our code will be available at https://github.com/VinAIResearch/LFM.git.

摘要
“流行匹配”是一种最近的框架，用于训练生成模型，具有让人感到惊叹的实际性能，而且训练更加容易。然而，先前的方法仍面临计算成本高和批处空间评估函数评估的挑战。另外，半Hidden Markov model（HMM）在这个领域中的应用还很少。在这个工作中，我们提议将流行匹配应用于预训练 autoencoder 的latent空间中，从而提高计算效率和可扩展性，以便在高分辨率图像生成中进行流行匹配训练。此外，我们的工作是在流行匹配中 интеGRATION 多种条件的先驱性贡献，包括标签Conditional image generation、图像缺失和semantic-to-image generation。经过广泛的实验，我们的方法在不同的数据集上都达到了良好的量化和质量结果，如 celebA-HQ、FFHQ、LSUN Church & Bedroom 和 ImageNet。我们还提供了对 Wasserstein-2 距离真实数据分布的控制，证明它是上界于流行匹配目标。我们的代码将在 GitHub 上公开，请参考 https://github.com/VinAIResearch/LFM.git。

Neural Video Depth Stabilizer

paper_url: http://arxiv.org/abs/2307.08695
repo_url: https://github.com/raymondwang987/nvds
paper_authors: Yiran Wang, Min Shi, Jiaqi Li, Zihao Huang, Zhiguo Cao, Jianming Zhang, Ke Xian, Guosheng Lin
for: 实时深度测量的维度准确性，以提高视频中的深度推导精度。
methods: 提出一个名为神经网络深度稳定器（NVDS）的插件化框架，可以对不同的单一图像深度模型进行稳定化，并且不需要额外的训练或准确性检查。
results: 在大规模的自然场景视频深度数据集（VDW）上进行评估，以及两个公共的benchmark上，与之前的方法进行比较，实现了更高的一致性、精度和效率。

Abstract
Video depth estimation aims to infer temporally consistent depth. Some methods achieve temporal consistency by finetuning a single-image depth model during test time using geometry and re-projection constraints, which is inefficient and not robust. An alternative approach is to learn how to enforce temporal consistency from data, but this requires well-designed models and sufficient video depth data. To address these challenges, we propose a plug-and-play framework called Neural Video Depth Stabilizer (NVDS) that stabilizes inconsistent depth estimations and can be applied to different single-image depth models without extra effort. We also introduce a large-scale dataset, Video Depth in the Wild (VDW), which consists of 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset to our knowledge. We evaluate our method on the VDW dataset as well as two public benchmarks and demonstrate significant improvements in consistency, accuracy, and efficiency compared to previous approaches. Our work serves as a solid baseline and provides a data foundation for learning-based video depth models. We will release our dataset and code for future research.

摘要
视频深度估计目标是获取时间一致的深度。一些方法在测试时使用图形和投影约束来精微调整单张图像深度模型，这是不高效且不稳定的。另一种方法是通过数据来强制实现时间一致，但这需要设计出功能强大的模型并具备足够的视频深度数据。为解决这些挑战，我们提出了名为神经视频深度稳定器（NVDS）的插件化框架，可以快速稳定不一致的深度估计，并可以适用于不同的单张图像深度模型无需额外努力。我们还推出了一个大规模的数据集，视频深度在野（VDW），该数据集包含14203个视频和超过200万帧，这是我们所知道的最大的自然场景视频深度数据集。我们对VDW数据集以及两个公共的标准测试集进行评估，并示出了和前方法相比的显著改进。我们的工作作为一个坚实的基础和未来研究的数据基础，我们将在未来发布我们的数据集和代码。

SEMI-DiffusionInst: A Diffusion Model Based Approach for Semiconductor Defect Classification and Segmentation

paper_url: http://arxiv.org/abs/2307.08693
repo_url: None
paper_authors: Vic De Ridder, Bappaditya Dey, Sandip Halder, Bartel Van Waeyenberge
for: 本研究旨在提出一种新的半导体缺陷检测框架”SEMI-DiffusionInst”,并与之前的框架进行比较。
methods: 该研究使用了一种扩散模型，并 investigate了不同的特征提取器网络和数据采样策略以实现一个平衡的质量和计算效率。
results: 该模型在总的mAP和准确地分割mAP方面都有所提高，并在大多数缺陷类型中表现比以前的工作更好或相当。特别是在检测任务中，线扩散和薄桥缺陷的准确率提高约15%。此外，通过调整推理参数，推理时间可以得到显著提高，无需妥协模型精度。

Abstract
With continuous progression of Moore's Law, integrated circuit (IC) device complexity is also increasing. Scanning Electron Microscope (SEM) image based extensive defect inspection and accurate metrology extraction are two main challenges in advanced node (2 nm and beyond) technology. Deep learning (DL) algorithm based computer vision approaches gained popularity in semiconductor defect inspection over last few years. In this research work, a new semiconductor defect inspection framework "SEMI-DiffusionInst" is investigated and compared to previous frameworks. To the best of the authors' knowledge, this work is the first demonstration to accurately detect and precisely segment semiconductor defect patterns by using a diffusion model. Different feature extractor networks as backbones and data sampling strategies are investigated towards achieving a balanced trade-off between precision and computing efficiency. Our proposed approach outperforms previous work on overall mAP and performs comparatively better or as per for almost all defect classes (per class APs). The bounding box and segmentation mAPs achieved by the proposed SEMI-DiffusionInst model are improved by 3.83% and 2.10%, respectively. Among individual defect types, precision on line collapse and thin bridge defects are improved approximately 15\% on detection task for both defect types. It has also been shown that by tuning inference hyperparameters, inference time can be improved significantly without compromising model precision. Finally, certain limitations and future work strategy to overcome them are discussed.

摘要
随着Moore的法则不断进步，集成电路（IC）设备复杂度也在增加。扫描电子镜（SEM）图像基于广泛的缺陷检测和精确的测量提取是进程技术的两大挑战。在过去几年中，用于半导体缺陷检测的深度学习（DL）算法基于计算机视觉方法得到了广泛的应用。本研究工作中，一种新的半导体缺陷检测框架“SEMI-DiffusionInst”被调查和比较了前一些框架。据作者所知，这是第一次使用扩散模型准确地检测和精确地分割半导体缺陷模式。不同的特征提取网络作为后端和数据采样策略的调查，以实现精度和计算效率之间的平衡负担。我们的提议方法在总的map和每个缺陷类ap方面的性能有所提高，并且与前一些工作相比，对大多数缺陷类型的性能具有较好的性能。SEMI-DiffusionInst模型在bounding box和分割map方面的性能提高了3.83%和2.10%。对于特定的缺陷类型，我们对检测任务的精度有15%的提高。此外，通过调整推理超参数，可以在不妥协精度的前提下大幅提高推理时间。最后，本研究的一些局限性和未来工作策略以及如何超越它们也被讨论。

Semantic Counting from Self-Collages

paper_url: http://arxiv.org/abs/2307.08727
repo_url: https://github.com/lukasknobel/selfcollages
paper_authors: Lukas Knobel, Tengda Han, Yuki M. Asano
for: 不需要手动标注数据，可以学习对象数量计算任务。
methods: 使用自带的”SelfCollages”图像作为训练样本，利用现有的无监督表示和分割技术进行学习。
results: 比基线模型和通用模型强，可以与有监督学习模型相匹配。

Abstract
While recent supervised methods for reference-based object counting continue to improve the performance on benchmark datasets, they have to rely on small datasets due to the cost associated with manually annotating dozens of objects in images. We propose Unsupervised Counter (UnCo), a model that can learn this task without requiring any manual annotations. To this end, we construct "SelfCollages", images with various pasted objects as training samples, that provide a rich learning signal covering arbitrary object types and counts. Our method builds on existing unsupervised representations and segmentation techniques to successfully demonstrate the ability to count objects without manual supervision. Our experiments show that our method not only outperforms simple baselines and generic models such as FasterRCNN, but also matches the performance of supervised counting models in some domains.

摘要
近期的监督学习方法可以继续提高参考基于对象计数的性能，但它们需要依靠小型数据集，因为 manually annotating dozens of objects in images 是成本很高的。我们提议一种无监督的对象计数模型（Unsupervised Counter，UnCo），不需要任何手动纠正。为此，我们构建了“SelfCollages”，它们是各种粘贴过的对象图像，提供了丰富的学习信号，覆盖了任意对象类型和计数。我们的方法基于现有的无监督表示和分割技术，成功地实现了没有手动纠正的对象计数。我们的实验表明，我们的方法不仅超过了简单的基准和通用模型如 FasterRCNN，还与有监督 counting 模型在一些领域匹配性能。

Implementation of a perception system for autonomous vehicles using a detection-segmentation network in SoC FPGA

paper_url: http://arxiv.org/abs/2307.08682
repo_url: https://github.com/vision-agh/mt_kria
paper_authors: Maciej Baczmanski, Mateusz Wasala, Tomasz Kryjak
for: 本研究旨在开发一种高效、实时、能效的感知控制系统，用于自动驾驶汽车。
methods: 该系统基于MultiTaskV3检测分类网络，并在AMD Xilinx Kria KV260视觉AI嵌入式平台上实现了并行加速。
results: 该系统在对Mock城市道路环境中进行测试时，实现了对 объекts的检测精度高于97%，以及图像分割精度高于90%。同时，该系统具有低功耗和小型化的优点。

Abstract
Perception and control systems for autonomous vehicles are an active area of scientific and industrial research. These solutions should be characterised by high efficiency in recognising obstacles and other environmental elements in different road conditions, real-time capability, and energy efficiency. Achieving such functionality requires an appropriate algorithm and a suitable computing platform. In this paper, we have used the MultiTaskV3 detection-segmentation network as the basis for a perception system that can perform both functionalities within a single architecture. It was appropriately trained, quantised, and implemented on the AMD Xilinx Kria KV260 Vision AI embedded platform. By using this device, it was possible to parallelise and accelerate the computations. Furthermore, the whole system consumes relatively little power compared to a CPU-based implementation (an average of 5 watts, compared to the minimum of 55 watts for weaker CPUs, and the small size (119mm x 140mm x 36mm) of the platform allows it to be used in devices where the amount of space available is limited. It also achieves an accuracy higher than 97% of the mAP (mean average precision) for object detection and above 90% of the mIoU (mean intersection over union) for image segmentation. The article also details the design of the Mecanum wheel vehicle, which was used to test the proposed solution in a mock-up city.

摘要
自动驾驶车辆的感知和控制系统是科学和工业领域的活跃领域。这些解决方案应具有高效地识别障碍物和其他环境元素，实时性和能效性。实现这种功能需要适当的算法和适当的计算平台。在这篇文章中，我们使用了MultiTaskV3检测-分割网络作为感知系统的基础，该系统可以同时完成这两个功能。它被正确地训练、量化和在AMD Xilinx Kria KV260 Vision AI嵌入式平台上实现。通过使用这台设备，可以并行化和加速计算。此外，整个系统的功耗相对较低，比CPU-基本实现（最低55 wat），而且尺寸很小（119mm x 140mm x 36mm），使其适用于有限空间的设备。此外，它的准确率高于97%的mAP（平均检测精度）和90%的mIoU（图像分割精度）。文章还详细介绍了使用的 mécanum 轮胎汽车，该车被用来测试提议的解决方案在模拟城市中。

CohortFinder: an open-source tool for data-driven partitioning of biomedical image cohorts to yield robust machine learning models

paper_url: http://arxiv.org/abs/2307.08673
repo_url: None
paper_authors: Fan Fan, Georgia Martinez, Thomas Desilvio, John Shin, Yijiang Chen, Bangchen Wang, Takaya Ozeki, Maxime W. Lafarge, Viktor H. Koelzer, Laura Barisoni, Anant Madabhushi, Satish E. Viswanath, Andrew Janowczyk
for: 这个论文是为了 Mitigating batch effects (BEs) in machine learning (ML) models, specifically through data-driven cohort partitioning.
methods: 该论文使用了一个开源工具叫做 CohortFinder, 通过数据驱动的 cohort 分区来缓解 BEs.
results: 论文表明，使用 CohortFinder 可以提高下游医疗影像处理任务中 ML 模型的性能。

Abstract
Batch effects (BEs) refer to systematic technical differences in data collection unrelated to biological variations whose noise is shown to negatively impact machine learning (ML) model generalizability. Here we release CohortFinder, an open-source tool aimed at mitigating BEs via data-driven cohort partitioning. We demonstrate CohortFinder improves ML model performance in downstream medical image processing tasks. CohortFinder is freely available for download at cohortfinder.com.

摘要
批处效应（BE）指的是数据收集过程中的系统性技术差异，不 relacionados con variaciones biológicas，这些噪声会负面影响机器学习（ML）模型的泛化性。我们现在发布了一个开源工具，名为 cohortfinder，用于减少BE的影响。我们示例了 cohortfinder 可以提高下游医学影像处理任务中 ML 模型的性能。cohortfinder 可以免费下载于 cohortfinder.com。

PolyGNN: Polyhedron-based Graph Neural Network for 3D Building Reconstruction from Point Clouds

paper_url: http://arxiv.org/abs/2307.08636
repo_url: https://github.com/chenzhaiyu/polygnn
paper_authors: Zhaiyu Chen, Yilei Shi, Liangliang Nan, Zhitong Xiong, Xiao Xiang Zhu
for: 本研究旨在开发一种基于多面体的图 neural network，用于从点云数据中进行3D建筑重建。
methods: 本方法使用多面体分解获取 primitives，然后通过图节点分类来学习这些primitives的组合。为了有效地表示任意形状的多面体，我们提出了三种不同的采样策略，以选择表示多面体的有效点。此外，我们还 incorporate 多面体间的邻接关系来增强图节点的分类。
results: 我们在大规模的 sintetic dataset上进行了大规模的重建，并进行了对比分析。结果表明，我们的方法可以快速和高效地进行大规模的重建，并且可以提供高质量的重建结果。此外，我们还进行了对实际点云数据进行重建的实验，并发现我们的方法可以在不同城市的点云数据上进行有效的重建。

Abstract
We present PolyGNN, a polyhedron-based graph neural network for 3D building reconstruction from point clouds. PolyGNN learns to assemble primitives obtained by polyhedral decomposition via graph node classification, achieving a watertight, compact, and weakly semantic reconstruction. To effectively represent arbitrary-shaped polyhedra in the neural network, we propose three different sampling strategies to select representative points as polyhedron-wise queries, enabling efficient occupancy inference. Furthermore, we incorporate the inter-polyhedron adjacency to enhance the classification of the graph nodes. We also observe that existing city-building models are abstractions of the underlying instances. To address this abstraction gap and provide a fair evaluation of the proposed method, we develop our method on a large-scale synthetic dataset covering 500k+ buildings with well-defined ground truths of polyhedral class labels. We further conduct a transferability analysis across cities and on real-world point clouds. Both qualitative and quantitative results demonstrate the effectiveness of our method, particularly its efficiency for large-scale reconstructions. The source code and data of our work are available at https://github.com/chenzhaiyu/polygnn.

摘要
我们介绍PolyGNN，一种基于多面体的图 neural network，用于从点云数据中进行3D建筑重建。PolyGNN通过图节点分类学习粗粒结构，实现了水密、紧凑、弱 semantic reconstruction。为了有效地表示任意形状多面体在神经网络中，我们提出了三种不同的抽象策略，选择表示多面体的重要点作为 queries，以实现高效的占据推断。此外，我们还 incorporate了多面体间邻接关系，以提高图节点的分类。我们还发现现有的城市建筑模型都是对下面的实例进行抽象。为了 Addressing this abstraction gap and provide a fair evaluation of our method, we develop our method on a large-scale synthetic dataset covering 500k+ buildings with well-defined ground truths of polyhedral class labels. We further conduct a transferability analysis across cities and on real-world point clouds. Both qualitative and quantitative results demonstrate the effectiveness of our method, particularly its efficiency for large-scale reconstructions. 我们的代码和数据可以在https://github.com/chenzhaiyu/polygnn中获取。

Deficiency-Aware Masked Transformer for Video Inpainting

paper_url: http://arxiv.org/abs/2307.08629
repo_url: https://github.com/yeates/dmt
paper_authors: Yongsheng Yu, Heng Fan, Libo Zhang
for: 这篇论文的目的是提出一种能够处理视频填充问题的方法，具体来说是提出一种能够在视频中填充损坏的部分的方法。
methods: 这篇论文使用了一种叫做Deficiency-aware Masked Transformer（DMT）的框架，该框架具有三个优势：首先，通过预训练一个图像填充模型DMT_img，以便将其用作视频模型DMT_vid的预设，从而提高了补假 случа件的填充效果。其次，使用自身注意力模块 selectively 把 spatiotemporal 标识符包含在推理中，以加速推理和除除噪信号。第三，将一种简单 yet effective的 Receptive Field Contextualizer 集成到 DMT 中，进一步提高了性能。
results: 对于 YouTube-VOS 和 DAVIS 等 datasets，DMT_vid 具有显著的性能优势，与之前的解决方案相比，具有更高的准确率和更好的稳定性。

Abstract
Recent video inpainting methods have made remarkable progress by utilizing explicit guidance, such as optical flow, to propagate cross-frame pixels. However, there are cases where cross-frame recurrence of the masked video is not available, resulting in a deficiency. In such situation, instead of borrowing pixels from other frames, the focus of the model shifts towards addressing the inverse problem. In this paper, we introduce a dual-modality-compatible inpainting framework called Deficiency-aware Masked Transformer (DMT), which offers three key advantages. Firstly, we pretrain a image inpainting model DMT_img serve as a prior for distilling the video model DMT_vid, thereby benefiting the hallucination of deficiency cases. Secondly, the self-attention module selectively incorporates spatiotemporal tokens to accelerate inference and remove noise signals. Thirdly, a simple yet effective Receptive Field Contextualizer is integrated into DMT, further improving performance. Extensive experiments conducted on YouTube-VOS and DAVIS datasets demonstrate that DMT_vid significantly outperforms previous solutions. The code and video demonstrations can be found at github.com/yeates/DMT.

摘要
现代视频填充方法已经取得了显著进步，通过使用显式导向，如光流，来传播帧之间像素。然而，有些情况下，跨帧回归的masked视频不可用，导致不足。在这种情况下，而不是借鉴其他帧的像素，模型的注意力转移到了解决反问题。在这篇论文中，我们介绍了一个双Modal可能性兼容的填充框架，即缺失意识的Transformer（DMT），它具有三个关键优势。首先，我们在DMT_img模型的预训练中，使用image填充模型DMT_img作为后续的拟合模型DMT_vid的先验，从而提高了缺失情况的描述能力。其次，自我注意力模块 selectively incorporates spatiotemporal tokens，以加速推理和消除噪声信号。最后，我们采用了一个简单 yet effective的Receptive Field Contextualizer，进一步提高性能。我们在YouTube-VOS和DAVIS数据集上进行了广泛的实验，并证明了DMT_vid在前一些解决方案之上显著超越。代码和视频示例可以在github.com/yeates/DMT中找到。

Benchmarking fixed-length Fingerprint Representations across different Embedding Sizes and Sensor Types

paper_url: http://arxiv.org/abs/2307.08615
repo_url: https://github.com/tim-rohwedder/fixed-length-fingerprint-extractors
paper_authors: Tim Rohwedder, Daile Osorio-Roig, Christian Rathgeb, Christoph Busch
for: 这个论文的目的是提高指纹识别的计算效率，通过减少纹理信息的维度来保持高度的生物ometric表现。
methods: 该论文使用了深度神经网络提取指纹的固定长度嵌入。
results: 实验结果表明，使用512个特征元素的纹理基于嵌入部分的fixed-length指纹表示可以保持高度的识别性，并且可以看到不同感知器类型之间的性能差异。

Abstract
Traditional minutiae-based fingerprint representations consist of a variable-length set of minutiae. This necessitates a more complex comparison causing the drawback of high computational cost in one-to-many comparison. Recently, deep neural networks have been proposed to extract fixed-length embeddings from fingerprints. In this paper, we explore to what extent fingerprint texture information contained in such embeddings can be reduced in terms of dimension while preserving high biometric performance. This is of particular interest since it would allow to reduce the number of operations incurred at comparisons. We also study the impact in terms of recognition performance of the fingerprint textural information for two sensor types, i.e. optical and capacitive. Furthermore, the impact of rotation and translation of fingerprint images on the extraction of fingerprint embeddings is analysed. Experimental results conducted on a publicly available database reveal an optimal embedding size of 512 feature elements for the texture-based embedding part of fixed-length fingerprint representations. In addition, differences in performance between sensor types can be perceived.

摘要

2023-07-18

cs.AI

cs.AI - 2023-07-18

CertPri: Certifiable Prioritization for Deep Neural Networks via Movement Cost in Feature Space

paper_url: http://arxiv.org/abs/2307.09375
repo_url: None
paper_authors: Haibin Zheng, Jinyin Chen, Haibo Jin
for: 提高深度神经网络（DNN）软件系统的质量，特别是通过测试输入优先级来检测和修复DNN中的误行为。
methods: 基于测试输入运动成本的视角，提出了一种名为CertPri的测试输入优先级技术，该技术可以提供正式的Robustness保证，并且可以在不同的任务、数据、模型和场景下进行应用。
results: 对于两个任务（分类和回归）、六种数据形式、四种模型结构和两种场景（白盒和黑盒）进行了广泛的评估，结果显示CertPri的优先级效果明显高于基elines，例如平均提高53.97%的优先级效果。其稳定性和通用性分别高于基elines的1.41-2.00倍和1.33-3.39倍的平均值。

Abstract
Deep neural networks (DNNs) have demonstrated their outperformance in various software systems, but also exhibit misbehavior and even result in irreversible disasters. Therefore, it is crucial to identify the misbehavior of DNN-based software and improve DNNs' quality. Test input prioritization is one of the most appealing ways to guarantee DNNs' quality, which prioritizes test inputs so that more bug-revealing inputs can be identified earlier with limited time and manual labeling efforts. However, the existing prioritization methods are still limited from three aspects: certifiability, effectiveness, and generalizability. To overcome the challenges, we propose CertPri, a test input prioritization technique designed based on a movement cost perspective of test inputs in DNNs' feature space. CertPri differs from previous works in three key aspects: (1) certifiable: it provides a formal robustness guarantee for the movement cost; (2) effective: it leverages formally guaranteed movement costs to identify malicious bug-revealing inputs; and (3) generic: it can be applied to various tasks, data, models, and scenarios. Extensive evaluations across 2 tasks (i.e., classification and regression), 6 data forms, 4 model structures, and 2 scenarios (i.e., white-box and black-box) demonstrate CertPri's superior performance. For instance, it significantly improves 53.97% prioritization effectiveness on average compared with baselines. Its robustness and generalizability are 1.41~2.00 times and 1.33~3.39 times that of baselines on average, respectively.

摘要

证明性：它提供了 DNN 中输入的形式化Robustness garantue，以确保移动成本的正确性。2. 有效性：它利用 formally guaranteed 的移动成本来 Identify DNN 中的恶意 bug-revealing 输入。3. 普遍性：它可以应用于不同的任务、数据、模型和场景。我们在 2 个任务（分类和回归）、6 种数据形式、4 种模型结构和 2 种场景（白盒和黑盒）进行了广泛的评估。结果表明，CertPri 在 average 上提高了 53.97% 的优先级效果，相比基eline。它的Robustness和普遍性分别高于基eline 的 1.41-2.00 倍和 1.33-3.39 倍的平均值。

Local Minima Drive Communications in Cooperative Interaction

paper_url: http://arxiv.org/abs/2307.09364
repo_url: None
paper_authors: Roger K. Moore
for: 这种研究旨在探讨人机合作中，何时决定通信，尤其是在合作任务中。
methods: 这种研究使用了感知控制理论（PCT），发现在共同完成任务的情况下，只要有共同的目标，并且合作行为足够完成任务，而不需要间接通信。但是，当任务存在本地最小点时，global解只能通过适时的交流寻找。
results: 在计算机基础的 simulate 环境中，两个独立的一维Agent通过合作解决了两个维度的路径找问题。

Abstract
An important open question in human-robot interaction (HRI) is precisely when an agent should decide to communicate, particularly in a cooperative task. Perceptual Control Theory (PCT) tells us that agents are able to cooperate on a joint task simply by sharing the same 'intention', thereby distributing the effort required to complete the task among the agents. This is even true for agents that do not possess the same abilities, so long as the goal is observable, the combined actions are sufficient to complete the task, and there is no local minimum in the search space. If these conditions hold, then a cooperative task can be accomplished without any communication between the contributing agents. However, for tasks that do contain local minima, the global solution can only be reached if at least one of the agents adapts its intention at the appropriate moments, and this can only be achieved by appropriately timed communication. In other words, it is hypothesised that in cooperative tasks, the function of communication is to coordinate actions in a complex search space that contains local minima. These principles have been verified in a computer-based simulation environment in which two independent one-dimensional agents are obliged to cooperate in order to solve a two-dimensional path-finding task.

摘要
人机交互（HRI）中一个重要的问题是决定何时通信，特别是在合作任务中。感知控制理论（PCT）告诉我们，只要 agents 共享同一个 '意图'，就可以在合作任务中分配完成任务所需的努力。这是甚至适用于不具备同样能力的代理人，只要目标可见，合作行动充分，并且无地点最低点。如果这些条件成立，那么合作任务就可以无需交流完成。但是，对于包含地点最低点的任务，全球解决方案只能通过至少一个代理人在合适时间修改其意图来实现。这意味着，在合作任务中，交流的函数是协调在复杂的搜索空间中的行动。这些原则在一个基于计算机的模拟环境中已经得到了验证，在两个独立的一维代理人之间进行了一个两维路径找到任务。

MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments

paper_url: http://arxiv.org/abs/2307.09361
repo_url: None
paper_authors: Spyros Gidaris, Andrei Bursuc, Oriane Simeoni, Antonin Vobecky, Nikos Komodakis, Matthieu Cord, Patrick Pérez
for: 降低干扰Transformer网络的贪婪需求，使用自我监督学习。
methods: 使用Masked Image Modeling策略和对比策略，并将两种学习方法融合在一起。
results: 在低样本设定下 achieve新的状态调研结果，并在多种评价 протокол中显示出强大的实验结果，训练时间至少三倍于先前方法。

Abstract
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks for very large fully-annotated datasets. Different classes of self-supervised learning offer representations with either good contextual reasoning properties, e.g., using masked image modeling strategies, or invariance to image perturbations, e.g., with contrastive methods. In this work, we propose a single-stage and standalone method, MOCA, which unifies both desired properties using novel mask-and-predict objectives defined with high-level features (instead of pixel-level details). Moreover, we show how to effectively employ both learning paradigms in a synergistic and computation-efficient way. Doing so, we achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols with a training that is at least 3 times faster than prior methods.

摘要
自我监督学习可以用于 mitigating 视图转换器网络的贪婪需求，不需要很大的完全标注数据集。不同类型的自我监督学习提供了不同的表示，如使用遮盲图像模型策略，或者对图像干扰的不变性。在这个工作中，我们提出了一种单stage和独立的方法，MOCA，它通过使用新的遮盲预测目标定义高级特征（而不是像素级别的细节）来实现良好的上下文理解性和图像干扰不变性。此外，我们还证明了如何有效地结合这两种学习方法，以实现更高的性能。通过这种方法，我们在低预测设定下达到了新的状态状态Record和强大的实验性能，并且训练时间至少三次 faster than priori方法。

Learning to Select SAT Encodings for Pseudo-Boolean and Linear Integer Constraints

paper_url: http://arxiv.org/abs/2307.09342
repo_url: https://github.com/felixvuo/lease-data
paper_authors: Felix Ulrich-Oltean, Peter Nightingale, James Alfred Walker
for: 解决复杂的允许满足和优化问题
methods: 使用超级vised机器学习方法选择编码
results: 选择编码效果优于AutoFolio，并且在未看过的问题类型上也能获得良好的结果

Abstract
Many constraint satisfaction and optimisation problems can be solved effectively by encoding them as instances of the Boolean Satisfiability problem (SAT). However, even the simplest types of constraints have many encodings in the literature with widely varying performance, and the problem of selecting suitable encodings for a given problem instance is not trivial. We explore the problem of selecting encodings for pseudo-Boolean and linear constraints using a supervised machine learning approach. We show that it is possible to select encodings effectively using a standard set of features for constraint problems; however we obtain better performance with a new set of features specifically designed for the pseudo-Boolean and linear constraints. In fact, we achieve good results when selecting encodings for unseen problem classes. Our results compare favourably to AutoFolio when using the same feature set. We discuss the relative importance of instance features to the task of selecting the best encodings, and compare several variations of the machine learning method.

摘要
许多约束满足和优化问题可以有效地通过将其编码为布尔满足问题（SAT）来解决。然而，即使最简单的约束也有许多文献中的编码方法，性能各异，选择适合的编码方法 для给定问题实例是一个不轻松的问题。我们使用监督式机器学习方法来选择编码方法，并证明可以使用标准的约束问题特征集来选择编码方法，并且使用新的特征集来选择 Pseudo-布尔和线性约束的编码方法。实际上，我们可以对未看过的问题类型进行有效的编码选择。我们的结果与AutoFolio相比，使用同一个特征集时比较好。我们讨论实例特征对选择最佳编码的任务的重要性，并对不同的机器学习方法进行比较。

Company2Vec – German Company Embeddings based on Corporate Websites

paper_url: http://arxiv.org/abs/2307.09332
repo_url: None
paper_authors: Christopher Gerling
for: 这个论文提出了一种新的应用场景，即 Representation Learning。模型使用 Word2Vec 和维度减少分析企业活动记录，保持语言结构的含义，并创建细致的行业嵌入。
methods: 这个论文使用 Word2Vec 和维度减少来分析企业活动记录，创建细致的行业嵌入。
results: 这个论文得到了高效的企业嵌入，可以用于多种银行应用程序，如顶尖词汇分析等。此外，论文还提出了三种算法used for peer-firm identification，包括 firm-centric、industry-centric 和 portfolio-centric。

Abstract
With Company2Vec, the paper proposes a novel application in representation learning. The model analyzes business activities from unstructured company website data using Word2Vec and dimensionality reduction. Company2Vec maintains semantic language structures and thus creates efficient company embeddings in fine-granular industries. These semantic embeddings can be used for various applications in banking. Direct relations between companies and words allow semantic business analytics (e.g. top-n words for a company). Furthermore, industry prediction is presented as a supervised learning application and evaluation method. The vectorized structure of the embeddings allows measuring companies similarities with the cosine distance. Company2Vec hence offers a more fine-grained comparison of companies than the standard industry labels (NACE). This property is relevant for unsupervised learning tasks, such as clustering. An alternative industry segmentation is shown with k-means clustering on the company embeddings. Finally, this paper proposes three algorithms for (1) firm-centric, (2) industry-centric and (3) portfolio-centric peer-firm identification.

摘要
With Company2Vec, the paper proposes a novel application in representation learning. The model analyzes business activities from unstructured company website data using Word2Vec and dimensionality reduction. Company2Vec maintains semantic language structures and thus creates efficient company embeddings in fine-granular industries. These semantic embeddings can be used for various applications in banking. Direct relations between companies and words allow semantic business analytics (e.g. top-n words for a company). Furthermore, industry prediction is presented as a supervised learning application and evaluation method. The vectorized structure of the embeddings allows measuring companies similarities with the cosine distance. Company2Vec hence offers a more fine-grained comparison of companies than the standard industry labels (NACE). This property is relevant for unsupervised learning tasks, such as clustering. An alternative industry segmentation is shown with k-means clustering on the company embeddings. Finally, this paper proposes three algorithms for (1) firm-centric, (2) industry-centric and (3) portfolio-centric peer-firm identification.Here's the translation in Traditional Chinese:With Company2Vec, the paper proposes a novel application in representation learning. The model analyzes business activities from unstructured company website data using Word2Vec and dimensionality reduction. Company2Vec maintains semantic language structures and thus creates efficient company embeddings in fine-granular industries. These semantic embeddings can be used for various applications in banking. Direct relations between companies and words allow semantic business analytics (e.g. top-n words for a company). Furthermore, industry prediction is presented as a supervised learning application and evaluation method. The vectorized structure of the embeddings allows measuring companies similarities with the cosine distance. Company2Vec hence offers a more fine-grained comparison of companies than the standard industry labels (NACE). This property is relevant for unsupervised learning tasks, such as clustering. An alternative industry segmentation is shown with k-means clustering on the company embeddings. Finally, this paper proposes three algorithms for (1) firm-centric, (2) industry-centric and (3) portfolio-centric peer-firm identification.

Exploiting Field Dependencies for Learning on Categorical Data

paper_url: http://arxiv.org/abs/2307.09321
repo_url: https://github.com/csiro-robotics/mdl
paper_authors: Zhibin Li, Piotr Koniusz, Lu Zhang, Daniel Edward Pagendam, Peyman Moghadam
for: 本研究旨在利用领域关系学习 categorical 数据，以优化预测性能。
methods: 我们提出了一种新的方法，即在 global field dependency matrix 中学习领域关系，并通过 local dependency modelling 在实例级别进行修改，以提高预测性能。
results: 我们的方法在 six 个popular dataset上进行了比较，与state-of-the-art方法进行了比较，并且得到了更好的性能。

Abstract
Traditional approaches for learning on categorical data underexploit the dependencies between columns (\aka fields) in a dataset because they rely on the embedding of data points driven alone by the classification/regression loss. In contrast, we propose a novel method for learning on categorical data with the goal of exploiting dependencies between fields. Instead of modelling statistics of features globally (i.e., by the covariance matrix of features), we learn a global field dependency matrix that captures dependencies between fields and then we refine the global field dependency matrix at the instance-wise level with different weights (so-called local dependency modelling) w.r.t. each field to improve the modelling of the field dependencies. Our algorithm exploits the meta-learning paradigm, i.e., the dependency matrices are refined in the inner loop of the meta-learning algorithm without the use of labels, whereas the outer loop intertwines the updates of the embedding matrix (the matrix performing projection) and global dependency matrix in a supervised fashion (with the use of labels). Our method is simple yet it outperforms several state-of-the-art methods on six popular dataset benchmarks. Detailed ablation studies provide additional insights into our method.

摘要
传统方法学习 categorical 数据会忽略数据集中列（即字段）之间的依赖关系，因为它们基于单独的分类/回归损失来驱动数据点的嵌入。然而，我们提出了一种新的方法，旨在利用数据集中列之间的依赖关系。而不是基于特征的全局统计（即特征 covariance 矩阵）来模型特征，我们学习一个全局字段依赖矩阵，然后在每个实例级别使用不同的权重（即本地依赖模型）来改进字段依赖的模型。我们的算法利用了元学习概念，即依赖矩阵在内Loop中被反复更新，而外层Loop则在有标签的情况下进行监督式更新（使用标签）。我们的方法简单，但它在六个流行的数据集benchmark上表现出色，并且在详细的减少研究中提供了更多的减少研究。

Biomaker CA: a Biome Maker project using Cellular Automata

paper_url: http://arxiv.org/abs/2307.09320
repo_url: None
paper_authors: Ettore Randazzo, Alexander Mordvintsev
for: 这个论文旨在描述一个基于细胞自动机（CA）的生物制造项目，即生物制造CA（Biomaker CA），用于模拟复杂的生态系统。
methods: 该论文使用了Python JAX框架来平台化CA规则的计算，并在GPU上并行计算。它还提供了多种环境和物理法则，以及不同的模型建筑和突变策略。
results: 论文展示了植物代理人在不同的环境下可以生长、存活、繁殖和演化，形成稳定和不稳定的生态系统。它还介绍了如何使用元进化来让模型在恶劣环境中存活，以及如何通过互动进化来让用户直接参与演化过程。

Abstract
We introduce Biomaker CA: a Biome Maker project using Cellular Automata (CA). In Biomaker CA, morphogenesis is a first class citizen and small seeds need to grow into plant-like organisms to survive in a nutrient starved environment and eventually reproduce with variation so that a biome survives for long timelines. We simulate complex biomes by means of CA rules in 2D grids and parallelize all of its computation on GPUs through the Python JAX framework. We show how this project allows for several different kinds of environments and laws of 'physics', alongside different model architectures and mutation strategies. We further analyze some configurations to show how plant agents can grow, survive, reproduce, and evolve, forming stable and unstable biomes. We then demonstrate how one can meta-evolve models to survive in a harsh environment either through end-to-end meta-evolution or by a more surgical and efficient approach, called Petri dish meta-evolution. Finally, we show how to perform interactive evolution, where the user decides how to evolve a plant model interactively and then deploys it in a larger environment. We open source Biomaker CA at: https://tinyurl.com/2x8yu34s .

摘要
我们介绍生物创造者CA（Biomaker CA）：一个基于细胞自动机（CA）的生物创造项目。在生物创造者CA中，形态形成是一等级公民，小种子需要在营养不足环境中成长为植物подіб的生物，以存活和 eventually reproduce 繁殖，以确保生物群落的长期存储。我们使用2D网格和Python JAX框架进行CA规则的并行计算，并可以模拟复杂的生物群落。我们还显示了不同环境和物理法则，以及不同的模型架构和变异策略。我们进一步分析了一些配置，详细介绍植物代理如何成长、存活、繁殖和演化，形成稳定和不稳定的生物群落。最后，我们显示了如何使用终端进化或对应演化来让模型在严峻环境中存活，并且允许用户互动地进化植物模型，然后将其部署到更大的环境中。我们将生物创造者CA开源发布在以下网址：https://tinyurl.com/2x8yu34s。

Rumor Detection with Diverse Counterfactual Evidence

paper_url: http://arxiv.org/abs/2307.09296
repo_url: https://github.com/vicinity111/dce-rd
paper_authors: Kaiwei Zhang, Junchi Yu, Haichao Shi, Jian Liang, Xiao-Yu Zhang
for: 这个研究旨在提出一个能够有效地检测谣传的方法，以应对社交媒体中对个人和社区的威胁增加。
methods: 本研究使用graph neural networks (GNNs)以探索谣传的传播模式，并利用多个构图来构成多元的探索结果。
results: 研究获得了更好的性能，并且可以提供多元的探索结果，以增强谣传检测的可解性和稳定性。

Abstract
The growth in social media has exacerbated the threat of fake news to individuals and communities. This draws increasing attention to developing efficient and timely rumor detection methods. The prevailing approaches resort to graph neural networks (GNNs) to exploit the post-propagation patterns of the rumor-spreading process. However, these methods lack inherent interpretation of rumor detection due to the black-box nature of GNNs. Moreover, these methods suffer from less robust results as they employ all the propagation patterns for rumor detection. In this paper, we address the above issues with the proposed Diverse Counterfactual Evidence framework for Rumor Detection (DCE-RD). Our intuition is to exploit the diverse counterfactual evidence of an event graph to serve as multi-view interpretations, which are further aggregated for robust rumor detection results. Specifically, our method first designs a subgraph generation strategy to efficiently generate different subgraphs of the event graph. We constrain the removal of these subgraphs to cause the change in rumor detection results. Thus, these subgraphs naturally serve as counterfactual evidence for rumor detection. To achieve multi-view interpretation, we design a diversity loss inspired by Determinantal Point Processes (DPP) to encourage diversity among the counterfactual evidence. A GNN-based rumor detection model further aggregates the diverse counterfactual evidence discovered by the proposed DCE-RD to achieve interpretable and robust rumor detection results. Extensive experiments on two real-world datasets show the superior performance of our method. Our code is available at https://github.com/Vicinity111/DCE-RD.

摘要
随着社交媒体的增长，假新闻对个人和社区而言变得更加严重，引起了开发有效和及时干预假新闻的研究的关注。现有的方法主要利用图神经网络（GNN）来利用假新闻传播过程中的后传 Pattern，但这些方法缺乏假新闻检测的自然解释，同时Results也较为不稳定。在这篇论文中，我们提出了一种基于多元解释的假新闻检测方法，即多元解释假新闻检测框架（DCE-RD）。我们的假设是利用事件图中的多个子图来提供多元解释，然后将这些多元解释集成为可靠的假新闻检测结果。具体来说，我们首先设计了一种子图生成策略，以生成不同的子图。我们限制了这些子图的移除，以避免影响假新闻检测结果。因此，这些子图自然成为假新闻检测中的反例证据。为了实现多元解释，我们设计了一种基于杂度点过程（DPP）的多元解释损失，以鼓励多元解释的涌现。一个基于GNN的假新闻检测模型进一步聚合了DCE-RD所发现的多元解释，以实现可靠和可解释的假新闻检测结果。我们的实验结果表明，我们的方法在两个真实的数据集上具有更高的性能。我们的代码可以在https://github.com/Vicinity111/DCE-RD上找到。

The Language Labyrinth: Constructive Critique on the Terminology Used in the AI Discourse

paper_url: http://arxiv.org/abs/2307.10292
repo_url: None
paper_authors: Rainer Rehak
for: 本研究旨在探讨人工智能（AI）领域中的术语问题，即现有的术语如“训练”、“学习”或“决策”等，对于AI的发展和应用产生了潜在的歧视和误导。
methods: 本研究采用了 критиче理解和语言哲学等方法，对AI辩论中关键的概念进行分析和批判，并提出了更合适的术语来促进更有价值的讨论。
results: 本研究发现，AI辩论中的术语问题导致了对AI的应用和责任的歧视和误导，从而妨碍了更加深入和系统的讨论。通过提出更合适的术语，本研究期望促进更有价值的AI讨论和应用。

Abstract
In the interdisciplinary field of artificial intelligence (AI) the problem of clear terminology is especially momentous. This paper claims, that AI debates are still characterised by a lack of critical distance to metaphors like 'training', 'learning' or 'deciding'. As consequence, reflections regarding responsibility or potential use-cases are greatly distorted. Yet, if relevant decision-makers are convinced that AI can develop an 'understanding' or properly 'interpret' issues, its regular use for sensitive tasks like deciding about social benefits or judging court cases looms. The chapter argues its claim by analysing central notions of the AI debate and tries to contribute by proposing more fitting terminology and hereby enabling more fruitful debates. It is a conceptual work at the intersection of critical computer science and philosophy of language.

摘要
在人工智能（AI）的跨学科领域中，术语问题特别突出。这篇论文认为，AI讨论仍然受到'"训练"、'"学习"或'"决策"等 мета征的影响，从而导致责任或实际应用场景的反映变得扭曲。然而，如果有关决策者被感受到AI可以'"理解"或'"正确地解读"问题，那么对社会福利或法律案件的决策将变得更加常见。本章 argue其主张，通过分析中心AI讨论概念，并提出更合适的术语，以便促进更有益的讨论。这是一篇在计算机科学与语言哲学交叉领域的概念性工作。

Llama 2: Open Foundation and Fine-Tuned Chat Models

paper_url: http://arxiv.org/abs/2307.09288
repo_url: https://github.com/facebookresearch/llama
paper_authors: Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom
for: 这篇论文是为了开发和发布一个大语言模型（LLM）的集合，覆盖 Parameters 的范围在 70 亿到 700 亿之间。
methods: 该论文使用了预训练和精度调整的大语言模型（LLM），并对对话用例进行优化。
results: 该论文的模型在大多数测试 benchmark 上表现更好，并且根据人类评估，它们在帮助和安全性方面表现出色，可能成为关闭源模型的可行替代品。

Abstract
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.

摘要
在这项工作中，我们开发并发布了LLama 2，一个包含预训练和精度调整的大语言模型（LLM）的收藏，其中参数的规模从70亿到700亿。我们的精度调整的LLM，称为Llama 2-Chat，是对话用例的优化。我们的模型在我们测试的大多数标准准则上表现出色，并且根据我们的人工评估，Llama 2-Chat在帮助性和安全性方面可能成为关闭源模型的可接受替补。我们提供了细节的方法和安全改进的描述，以便社区可以基于我们的工作而进一步发展和贡献到责任的LLM开发。

Improving Text Semantic Similarity Modeling through a 3D Siamese Network

paper_url: http://arxiv.org/abs/2307.09274
repo_url: None
paper_authors: Jianxiang Zang, Hui Liu
for: 文章主要用于提出一种基于三维 siamese 网络的文本 semantic similarity 模型，以提高现有方法的精度和可扩展性。
methods: 该模型使用了一种三维 siamese 网络，将文本semantic information 映射到更高维空间，以保留更多的特征领域信息和空间信息。模型还引入了一些功能扩充模块，包括特征提取、注意力和特征融合，以强化模型的表现。
results: 经过广泛的实验 validate，模型在四个文本 semantic similarity 标准 bencmarks 上显示出了高效和高精度的表现。

Abstract
Siamese networks have gained popularity as a method for modeling text semantic similarity. Traditional methods rely on pooling operation to compress the semantic representations from Transformer blocks in encoding, resulting in two-dimensional semantic vectors and the loss of hierarchical semantic information from Transformer blocks. Moreover, this limited structure of semantic vectors is akin to a flattened landscape, which restricts the methods that can be applied in downstream modeling, as they can only navigate this flat terrain. To address this issue, we propose a novel 3D Siamese network for text semantic similarity modeling, which maps semantic information to a higher-dimensional space. The three-dimensional semantic tensors not only retains more precise spatial and feature domain information but also provides the necessary structural condition for comprehensive downstream modeling strategies to capture them. Leveraging this structural advantage, we introduce several modules to reinforce this 3D framework, focusing on three aspects: feature extraction, attention, and feature fusion. Our extensive experiments on four text semantic similarity benchmarks demonstrate the effectiveness and efficiency of our 3D Siamese Network.

摘要
塞内链模型在文本Semantic相似性领域的应用得到了广泛的关注。传统方法通过pooling操作来压缩Transformer块中的semantic表示，导致得到的是两维semantic вектор，这限制了downstream模型的应用。此外，这种压缩的semantic表示结构类似于平坦的 terrain，这限制了可以应用的方法，它们只能在这个平坦的地形上行走。为了解决这个问题，我们提出了一种新的3D塞内链网络 для文本Semantic相似性模型，它将semantic信息映射到更高维空间。三维semantic tensor不仅保留了更精确的空间和特征领域信息，还提供了下游模型capture这些信息所需的结构条件。利用这种结构优势，我们在3D框架中引入了多个模块，集中在三个方面：特征提取、注意力和特征融合。我们对四个文本Semantic相似性benchmark进行了广泛的实验， demonstarted the effectiveness and efficiency of our 3D Siamese Network.

UniTabE: Pretraining a Unified Tabular Encoder for Heterogeneous Tabular Data

paper_url: http://arxiv.org/abs/2307.09249
repo_url: None
paper_authors: Yazheng Yang, Yuqi Wang, Guang Liu, Ledell Wu, Qi Liu
for: 本研究旨在推广预训示法的应用，将其应用到表格数据领域，以提高表格数据的含义表示。
methods: 本研究使用了UniTabE方法，UniTabE方法通过将每个基本表格元素表示为Module，并使用Transformer Encoder进行更精确的表示。
results: 实验结果显示，UniTabE方法在多种 benchmark 数据集上表现出色，超过了多个基eline模型。这表明UniTabE方法可以有效地提高表格数据的含义表示，为表格数据分析带来 significiant progress.

Abstract
Recent advancements in Natural Language Processing (NLP) have witnessed the groundbreaking impact of pretrained models, yielding impressive outcomes across various tasks. This study seeks to extend the power of pretraining methodologies to tabular data, a domain traditionally overlooked, yet inherently challenging due to the plethora of table schemas intrinsic to different tasks. The primary research questions underpinning this work revolve around the adaptation to heterogeneous table structures, the establishment of a universal pretraining protocol for tabular data, the generalizability and transferability of learned knowledge across tasks, the adaptation to diverse downstream applications, and the incorporation of incremental columns over time. In response to these challenges, we introduce UniTabE, a pioneering method designed to process tables in a uniform manner, devoid of constraints imposed by specific table structures. UniTabE's core concept relies on representing each basic table element with a module, termed TabUnit. This is subsequently followed by a Transformer encoder to refine the representation. Moreover, our model is designed to facilitate pretraining and finetuning through the utilization of free-form prompts. In order to implement the pretraining phase, we curated an expansive tabular dataset comprising approximately 13 billion samples, meticulously gathered from the Kaggle platform. Rigorous experimental testing and analyses were performed under a myriad of scenarios to validate the effectiveness of our methodology. The experimental results demonstrate UniTabE's superior performance against several baseline models across a multitude of benchmark datasets. This, therefore, underscores UniTabE's potential to significantly enhance the semantic representation of tabular data, thereby marking a significant stride in the field of tabular data analysis.

摘要
近期的自然语言处理（NLP）技术发展，目睹了革命性的影响，在不同任务上呈现出卓越的表现。本研究旨在扩展预训练方法的应用范围，从传统上被忽视的表格数据中获得突破性的结果。本研究的主要研究问题包括适应不同表格结构、建立通用的预训练协议、学习知识的泛化和传递性、适应多样化下游应用和逐渐增加的列数据。为解决这些挑战，我们提出了UniTabE方法，可以统一处理表格数据，不受特定表格结构的限制。UniTabE的核心思想是通过表格元素模块（TabUnit）来表示每个基本表格元素，然后使用TransformerEncoder来细化表示。此外，我们的模型设计了适应预训练和精化的机制，通过自由形式的提示来进行。为实现预训练阶段，我们细心筛选了 approximate 130亿个样本的广泛表格数据集，从Kaggle平台收集。经过严谨的实验测试和分析，我们证明UniTabE方法在多个benchmark数据集上表现出优于多个基eline模型。这些结果证明UniTabE方法可以显著提高表格数据的semantic表示，从而在表格数据分析中做出重要突破。

Towards Sustainable Deep Learning for Multi-Label Classification on NILM

paper_url: http://arxiv.org/abs/2307.09244
repo_url: None
paper_authors: Anže Pirnat, Blaž Bertalanič, Gregor Cerar, Mihael Mohorčič, Carolina Fortuna
for: 本研究旨在提高非侵入式电力监测（NILM）的计算和能量效率，并且使用深度学习（DL）技术进行多个标签分类。
methods: 本研究使用了一种新的DL模型，以提高NILM的计算和能量效率。此外，本研究还提出了一种测试方法ологи，用于比较不同模型的性能，该方法使用了从测量数据集Synthesized来更好地模拟实际情况。
results: 相比之前的状态艺术，提出的模型可以降低碳脚印的二十三分之一，并且在REFIT和UK-DALE数据集上测试时，其性能提高了大约8个百分点。

Abstract
Non-intrusive load monitoring (NILM) is the process of obtaining appliance-level data from a single metering point, measuring total electricity consumption of a household or a business. Appliance-level data can be directly used for demand response applications and energy management systems as well as for awareness raising and motivation for improvements in energy efficiency and reduction in the carbon footprint. Recently, classical machine learning and deep learning (DL) techniques became very popular and proved as highly effective for NILM classification, but with the growing complexity these methods are faced with significant computational and energy demands during both their training and operation. In this paper, we introduce a novel DL model aimed at enhanced multi-label classification of NILM with improved computation and energy efficiency. We also propose a testing methodology for comparison of different models using data synthesized from the measurement datasets so as to better represent real-world scenarios. Compared to the state-of-the-art, the proposed model has its carbon footprint reduced by more than 23% while providing on average approximately 8 percentage points in performance improvement when testing on data derived from REFIT and UK-DALE datasets.

摘要
非侵入式电力监控（NILM）是从单一监控点获取家用电器或商业用电器的总电力消耗数据。家用电器或商业用电器的总电力消耗数据可以直接用于需求回应应用程序和能源管理系统，以及对能源效率和碳足迹的意识和鼓励。现在，经典机器学习和深度学习（DL）技术在NILM分类中成为非常受欢迎和高效的方法，但是随着模型的复杂度增加，它们面临训练和运行过程中的巨大计算和能源需求。本文提出了一个新的深度学习模型，以提高多标签分类的NILM性能，并提出了一个比较不同模型的试验方法，使用从测量数据中Synthesized的数据，以更好地反映实际情况。与现有的state-of-the-art相比，提出的模型可以降低碳踪数据的碳踪比例超过23%，并在REFIT和UK-DALE数据集上的测试中平均提高了约8%的性能。

De Re and De Dicto Knowledge in Egocentric Setting

paper_url: http://arxiv.org/abs/2308.00001
repo_url: None
paper_authors: Pavel Naumov, Anna Ovchinnikova
for: 本研究探讨了逻辑系统中的自我中心性，即研究代理人的性质而不是可能世界的性质。
methods: 本研究使用了两种不同的模态，分别表示de re和de dicto的知识，并证明了这两种模态之间不可definable。
results: 本研究证明了逻辑系统中的自我中心性不可definable，即不可以通过一种模式来表示另一种模式。

Abstract
Prior proposes the term "egocentric" for logical systems that study properties of agents rather than properties of possible worlds. In such a setting, the paper introduces two different modalities capturing de re and de dicto knowledge and proves that these two modalities are not definable through each other.

摘要

Human Body Digital Twin: A Master Plan

paper_url: http://arxiv.org/abs/2307.09225
repo_url: None
paper_authors: Chenyu Tang, Shuo Gao, Luigi G. Occhipinti
for: 这篇论文旨在探讨人体数据技术（DT）的当前状况和未来发展前景，并提出了一个五级路图来引导未来的发展。
methods: 论文使用了各种技术和方法，包括智能监测器、数据收集、数据分析和决策系统的开发。
results: 论文提出了一个五级路图，这个路图覆盖了各种组件的开发，如智能监测器、数据收集、数据分析和决策系统的开发。

Abstract
The human body DT has the potential to revolutionize healthcare and wellness, but its responsible and effective implementation requires consideration of various factors. This article presents a comprehensive overview of the current status and future prospects of the human body DT and proposes a five-level roadmap for its development. The roadmap covers the development of various components, such as wearable devices, data collection, data analysis, and decision-making systems. The article also highlights the necessary support, security, cost, and ethical considerations that must be addressed in order to ensure responsible and effective implementation of the human body DT. The proposed roadmap provides a framework for guiding future development and offers a unique perspective on the future of the human body DT, facilitating new interdisciplinary research and innovative solutions in this rapidly evolving field.

摘要
人体DT有可能革新医疗和健康领域，但其负责和有效实施需要考虑多种因素。这篇文章提供了人体DT当前状况和未来前景的全面概述，并提出了五级路线图，以帮助未来的发展。这个路线图覆盖了不同组件的开发，如便携设备、数据采集、数据分析和决策系统。文章还强调了必须解决的支持、安全、成本和伦理考虑，以确保负责和有效的人体DT实施。提议的路线图为未来发展提供了框架，并且降低了新兴领域的研究和创新的难度。

Automated Ableism: An Exploration of Explicit Disability Biases in Sentiment and Toxicity Analysis Models

paper_url: http://arxiv.org/abs/2307.09209
repo_url: None
paper_authors: Pranav Narayanan Venkit, Mukund Srinath, Shomir Wilson
for: 本研究使用 perturbation sensitivity analysis 方法检测社交媒体平台 Twitter 和 Reddit 上关于残疾人的对话中的隐含残疾偏见。
methods: 我们使用 BITS 集合（Bias Identification Test in Sentiment）来评估这些 sentiment analysis 和攻击性检测模型中的显式残疾偏见。
results: 我们的研究发现所有这些模型都表现出对残疾人的显式偏见。

Abstract
We analyze sentiment analysis and toxicity detection models to detect the presence of explicit bias against people with disability (PWD). We employ the bias identification framework of Perturbation Sensitivity Analysis to examine conversations related to PWD on social media platforms, specifically Twitter and Reddit, in order to gain insight into how disability bias is disseminated in real-world social settings. We then create the \textit{Bias Identification Test in Sentiment} (BITS) corpus to quantify explicit disability bias in any sentiment analysis and toxicity detection models. Our study utilizes BITS to uncover significant biases in four open AIaaS (AI as a Service) sentiment analysis tools, namely TextBlob, VADER, Google Cloud Natural Language API, DistilBERT and two toxicity detection models, namely two versions of Toxic-BERT. Our findings indicate that all of these models exhibit statistically significant explicit bias against PWD.

摘要
我们分析情感分析和恶意检测模型，以检测对人际残疾（PWD）的直接偏见。我们使用干扰敏感分析框架来检查社交媒体平台上关于PWD的对话，以获得实际社会中残疾偏见的分析。然后，我们创建了《偏见标准测试集》（BITS），以量化任何情感分析和恶意检测模型中的直接残疾偏见。我们的研究使用BITS来揭露四个开源AIaaS（AI作为服务）情感分析工具——TextBlob、VADER、Google Cloud Natural Language API和DistilBERT——以及两个恶意检测模型——两个版本的Toxic-BERT——中的显著偏见。我们的发现表明所有这些模型都表现出了对PWD的 statistically significant direct bias。

ESMC: Entire Space Multi-Task Model for Post-Click Conversion Rate via Parameter Constraint

paper_url: http://arxiv.org/abs/2307.09193
repo_url: None
paper_authors: Zhenhao Jiang, Biao Zeng, Hao Feng, Jin Liu, Jicong Fan, Jie Zhang, Jia Jia, Ning Hu, Xingyu Chen, Xuguang Lan
for: 这篇论文主要针对大规模在线推荐系统中的Click-Through Rate (CTR)和Post-Click Conversion Rate (CVR)估计问题。
methods: 该论文提出了一种基于Entire Space Model的多任务模型，通过跟踪用户决策过程“曝光_点击_购买”来解决传统CVR估计器存在样本选择偏见和数据稀缺问题。
results: 实验表明，提出的方法在大规模推荐系统中显著超过了现有方法的性能。此外，对实际世界数据集进行了证明和验证。

Abstract
Large-scale online recommender system spreads all over the Internet being in charge of two basic tasks: Click-Through Rate (CTR) and Post-Click Conversion Rate (CVR) estimations. However, traditional CVR estimators suffer from well-known Sample Selection Bias and Data Sparsity issues. Entire space models were proposed to address the two issues via tracing the decision-making path of "exposure_click_purchase". Further, some researchers observed that there are purchase-related behaviors between click and purchase, which can better draw the user's decision-making intention and improve the recommendation performance. Thus, the decision-making path has been extended to "exposure_click_in-shop action_purchase" and can be modeled with conditional probability approach. Nevertheless, we observe that the chain rule of conditional probability does not always hold. We report Probability Space Confusion (PSC) issue and give a derivation of difference between ground-truth and estimation mathematically. We propose a novel Entire Space Multi-Task Model for Post-Click Conversion Rate via Parameter Constraint (ESMC) and two alternatives: Entire Space Multi-Task Model with Siamese Network (ESMS) and Entire Space Multi-Task Model in Global Domain (ESMG) to address the PSC issue. Specifically, we handle "exposure_click_in-shop action" and "in-shop action_purchase" separately in the light of characteristics of in-shop action. The first path is still treated with conditional probability while the second one is treated with parameter constraint strategy. Experiments on both offline and online environments in a large-scale recommendation system illustrate the superiority of our proposed methods over state-of-the-art models. The real-world datasets will be released.

摘要
大规模在互联网上的推荐系统遍布全网，负责两个基本任务：点击率（CTR）和后click conversión率（CVR）的估计。然而，传统的CVR估计器受到Well-knownSample Selection Bias和Data Sparsity问题的影响。Entire space模型被提出以解决这两个问题，通过跟踪用户做出决策的“曝光•点击•购买”的决策路径。此外，一些研究人员发现，在点击和购买之间存在购买相关行为，可以更好地捕捉用户做出决策的INTENTION，提高推荐性能。因此，决策路径被扩展为“曝光•点击•在店动作•购买”，可以通过条件概率方法模型。然而，我们发现链式法则不一定成立，我们报告了概率空间混乱（PSC）问题，并给出了数学上的解释。我们提出了一种新的Entire Space Multi-Task Model for Post-Click Conversion Rate via Parameter Constraint（ESMC），以及两种alternative：Entire Space Multi-Task Model with Siamese Network（ESMS）和Entire Space Multi-Task Model in Global Domain（ESMG），以解决PSC问题。具体来说，我们在“曝光•点击•在店动作”和“在店动作•购买”两个路径上处理它们分别，在它们的特点上进行处理。第一个路径仍然采用条件概率方法，第二个路径采用参数约束策略。在大规模推荐系统中进行了实验，我们的提出的方法在对state-of-the-art模型的比较中表现出色。真实的数据集将被发布。

Enhancing Network Slicing Architectures with Machine Learning, Security, Sustainability and Experimental Networks Integration

paper_url: http://arxiv.org/abs/2307.09151
repo_url: https://github.com/romoreira/sfi2-energy-sustainability
paper_authors: Joberto S. B. Martins, Tereza C. Carvalho, Rodrigo Moreira, Cristiano Both, Adnei Donatti, João H. Corrêa, José A. Suruagy, Sand L. Corrêa, Antonio J. G. Abelem, Moisés R. N. Ribeiro, Jose-Marcos Nogueira, Luiz C. S. Magalhães, Juliano Wickboldt, Tiago Ferreto, Ricardo Mello, Rafael Pasquini, Marcos Schwarz, Leobino N. Sampaio, Daniel F. Macedo, José F. de Rezende, Kleber V. Cardoso, Flávio O. Silva
for:NS architectures are proposed to optimize and customize scarce resources for 5G/6G applications, but they often have limited domain-specific functionality.methods:The SFI2 architecture proposal integrates experimental networks, enhances NS with ML native optimizations, energy-efficient slicing, and slicing-tailored security functionalities.results:The SFI2 reference architecture instantiations will enhance multi-domain and multi-technology integrated experimental network deployment with native ML optimization, energy-efficient aware slicing, and slicing-tailored security functionalities for practical domains.Here’s the Chinese text in the format you requested:for: NS architectures 是为了优化和自定义5G/6G应用程序中的稀缺资源，但它们通常具有特定领域的限制性。methods: SFI2架构提案 integrates experimental networks, enhances NS with ML native optimizations, energy-efficient slicing, and slicing-tailored security functionalities.results: SFI2 reference architecture instantiations will enhance multi-domain and multi-technology integrated experimental network deployment with native ML optimization, energy-efficient aware slicing, and slicing-tailored security functionalities for practical domains.

Abstract
Network Slicing (NS) is an essential technique extensively used in 5G networks computing strategies, mobile edge computing, mobile cloud computing, and verticals like the Internet of Vehicles and industrial IoT, among others. NS is foreseen as one of the leading enablers for 6G futuristic and highly demanding applications since it allows the optimization and customization of scarce and disputed resources among dynamic, demanding clients with highly distinct application requirements. Various standardization organizations, like 3GPP's proposal for new generation networks and state-of-the-art 5G/6G research projects, are proposing new NS architectures. However, new NS architectures have to deal with an extensive range of requirements that inherently result in having NS architecture proposals typically fulfilling the needs of specific sets of domains with commonalities. The Slicing Future Internet Infrastructures (SFI2) architecture proposal explores the gap resulting from the diversity of NS architectures target domains by proposing a new NS reference architecture with a defined focus on integrating experimental networks and enhancing the NS architecture with Machine Learning (ML) native optimizations, energy-efficient slicing, and slicing-tailored security functionalities. The SFI2 architectural main contribution includes the utilization of the slice-as-a-service paradigm for end-to-end orchestration of resources across multi-domains and multi-technology experimental networks. In addition, the SFI2 reference architecture instantiations will enhance the multi-domain and multi-technology integrated experimental network deployment with native ML optimization, energy-efficient aware slicing, and slicing-tailored security functionalities for the practical domain.

摘要
Slicing Future Internet Infrastructures (SFI2) 架构提案对 NS 架构的差异进行了探索，并提出了一个新的 NS 参考架构，旨在整合实验网络和增强 NS 架构，并将机器学习 (ML) 、能源效率和slice-tailored 安全功能纳入 NS 架构。SFI2 架构的主要贡献包括使用 slice-as-a-service 模式来实现端到端资源的整合和管理，以及对多域多技术的实验网络进行了Native ML优化、能源效率和slice-tailored 安全功能的增强。在实际领域中，SFI2 架构的实现将提供多域多技术集成的实验网络，并将Native ML优化、能源效率和slice-tailored 安全功能纳入 NS 架构。这将实现在多域多技术集成的实验网络中，实现了端到端资源的整合和管理，并提供了对特定领域的Native ML优化、能源效率和slice-tailored 安全功能。

Machine Learning for SAT: Restricted Heuristics and New Graph Representations

paper_url: http://arxiv.org/abs/2307.09141
repo_url: None
paper_authors: Mikhail Shirokikh, Ilya Shenbin, Anton Alekseev, Sergey Nikolenko
for: 这 paper 是解决Boolean satisfiability (SAT) 问题的基础NP-完全问题，包括自动规划和调度。
methods: 这 paper 使用机器学习 (ML) 模型来改进 SAT 解决器的规划策略，以减少步骤数和总时间。
results: 这 paper 提出了一种策略，在使用 ML 模型开始解决 SAT 问题后，将控制转交给经典的规划策略，以简化冷启动和减少总时间。此外，paper 还介绍了一种针对 SAT 问题转化自其他领域的 Graph-Q-SAT 修改。

Abstract
Boolean satisfiability (SAT) is a fundamental NP-complete problem with many applications, including automated planning and scheduling. To solve large instances, SAT solvers have to rely on heuristics, e.g., choosing a branching variable in DPLL and CDCL solvers. Such heuristics can be improved with machine learning (ML) models; they can reduce the number of steps but usually hinder the running time because useful models are relatively large and slow. We suggest the strategy of making a few initial steps with a trained ML model and then releasing control to classical heuristics; this simplifies cold start for SAT solving and can decrease both the number of steps and overall runtime, but requires a separate decision of when to release control to the solver. Moreover, we introduce a modification of Graph-Q-SAT tailored to SAT problems converted from other domains, e.g., open shop scheduling problems. We validate the feasibility of our approach with random and industrial SAT problems.

摘要
布尔满足（SAT）是一个基本NP完全问题，具有许多应用，包括自动规划和计划。为解决大规模实例，SAT解决器需要依赖于规则，如选择分支变量在DPLL和CDCL解决器中。这些规则可以通过机器学习（ML）模型进行改进，它们可以减少步骤数量，但通常会增加运行时间，因为有用的模型通常比较大 и慢。我们建议一种策略，即在一个训练好的ML模型的几个初始步骤后，将控制转移给类别规则；这种策略可以简化冰封开始 дляSAT解决和降低总时间，但需要分离的决定何时将控制转移给解决器。此外，我们介绍了基于图形Query-SAT的修改，适用于从其他领域转化的SAT问题，如开放式制造计划问题。我们验证了我们的方法的可行性，使用随机和工业SAT问题。

DropMix: Reducing Class Dependency in Mixed Sample Data Augmentation

paper_url: http://arxiv.org/abs/2307.09136
repo_url: None
paper_authors: Haeil Lee, Hansang Lee, Junmo Kim
for: 提高多种任务的性能，但有些类受到MSDA的影响而下降。
methods: 提议DropMix方法，通过排除MSDA计算中的特定百分比数据来减少类依赖性。
results: 在CIFAR-100和ImageNet两个数据集上，通过组合MSDA和非MSDA数据进行训练，提高了以前受到MSDA影响的类的性能，并将总平均准确率提高。

Abstract
Mixed sample data augmentation (MSDA) is a widely used technique that has been found to improve performance in a variety of tasks. However, in this paper, we show that the effects of MSDA are class-dependent, with some classes seeing an improvement in performance while others experience a decline. To reduce class dependency, we propose the DropMix method, which excludes a specific percentage of data from the MSDA computation. By training on a combination of MSDA and non-MSDA data, the proposed method not only improves the performance of classes that were previously degraded by MSDA, but also increases overall average accuracy, as shown in experiments on two datasets (CIFAR-100 and ImageNet) using three MSDA methods (Mixup, CutMix and PuzzleMix).

摘要

Cloud-native RStudio on Kubernetes for Hopsworks

paper_url: http://arxiv.org/abs/2307.09132
repo_url: None
paper_authors: Gibson Chikafa, Sina Sheikholeslami, Salman Niazi, Jim Dowling, Vladimir Vlassov
for:这篇论文主要目标是提供一个基于多户分布式系统的RStudio Server，以提高云计算中R语言开发环境的可用性和可扩展性。methods: authors使用了 Docker和Kubernetes等云原生技术来解决多户分布式环境中的性能隔离、安全性和扩展性问题。他们还实现了在RStudio Server实例中安全地共享数据，以便在RStudio用户之间进行数据共享和协作。results: authors在Google云平台上测试了他们的系统，并证明了可以同时运行44个RStudio Server实例，每个实例具有2GB的RAM。此外，他们的系统可以扩展到支持数百个同时运行的RStudio Server实例，只需要添加更多资源（CPU和RAM）到集群或系统中即可。

Abstract
In order to fully benefit from cloud computing, services are designed following the "multi-tenant" architectural model, which is aimed at maximizing resource sharing among users. However, multi-tenancy introduces challenges of security, performance isolation, scaling, and customization. RStudio server is an open-source Integrated Development Environment (IDE) accessible over a web browser for the R programming language. We present the design and implementation of a multi-user distributed system on Hopsworks, a data-intensive AI platform, following the multi-tenant model that provides RStudio as Software as a Service (SaaS). We use the most popular cloud-native technologies: Docker and Kubernetes, to solve the problems of performance isolation, security, and scaling that are present in a multi-tenant environment. We further enable secure data sharing in RStudio server instances to provide data privacy and allow collaboration among RStudio users. We integrate our system with Apache Spark, which can scale and handle Big Data processing workloads. Also, we provide a UI where users can provide custom configurations and have full control of their own RStudio server instances. Our system was tested on a Google Cloud Platform cluster with four worker nodes, each with 30GB of RAM allocated to them. The tests on this cluster showed that 44 RStudio servers, each with 2GB of RAM, can be run concurrently. Our system can scale out to potentially support hundreds of concurrently running RStudio servers by adding more resources (CPUs and RAM) to the cluster or system.

摘要
为了完全利用云计算，服务采用“多户”建筑模型，以最大化用户之间资源共享。然而，多户性带来安全性、性能隔离、扩展和自定义等挑战。RStudio服务器是一个开源的集成开发环境（IDE），可以通过网页浏览器访问R编程语言。我们提出了在Hopsworks上实现多用户分布式系统，以多户模型提供RStudio作为服务（SaaS）。我们使用云原生技术：Docker和Kubernetes解决多户环境中存在的性能隔离、安全性和扩展等问题。此外，我们还启用了安全数据共享在RStudio服务器实例中，以保障数据隐私和允许RStudio用户之间的合作。我们将系统与Apache Spark集成，可以扩展和处理大规模数据处理任务。此外，我们还提供了用户可定制配置和完全控制自己的RStudio服务器实例的UI。我们的系统在Google云平台集群上进行了测试，每个工作节点有30GB的RAM，并测试了44个RStudio服务器，每个服务器具有2GB的RAM。我们的系统可以扩展到可能支持百个同时运行的RStudio服务器，通过添加更多资源（CPU和RAM）到集群或系统。

BOLD: A Benchmark for Linked Data User Agents and a Simulation Framework for Dynamic Linked Data Environments

paper_url: http://arxiv.org/abs/2307.09114
repo_url: None
paper_authors: Tobias Käfer, Victor Charpenay, Andreas Harth
for: 本研究为Linked Data代理人提供了BOLD（Buildings on Linked Data）benchmark，以及一个用于模拟动态Linked Data环境的框架。
methods: 本研究使用了一个读写Linked Data接口来模拟一座智能建筑，包括互联网时间、占用情况和感知器等。
results: 在Linked Data表示环境下，代理人执行了一些具体任务，如灯光控制，并且通过测试环境来检查任务执行是否正确，以及代理人性能的测量。

Abstract
The paper presents the BOLD (Buildings on Linked Data) benchmark for Linked Data agents, next to the framework to simulate dynamic Linked Data environments, using which we built BOLD. The BOLD benchmark instantiates the BOLD framework by providing a read-write Linked Data interface to a smart building with simulated time, occupancy movement and sensors and actuators around lighting. On the Linked Data representation of this environment, agents carry out several specified tasks, such as controlling illumination. The simulation environment provides means to check for the correct execution of the tasks and to measure the performance of agents. We conduct measurements on Linked Data agents based on condition-action rules.

摘要
文章介绍了BOLD（建筑物 Linked Data）benchmark，用于测试链接数据代理人的能力，同时还提供了模拟动态链接数据环境的框架。BOLD benchmark通过提供一个阅读写链接数据接口，使用模拟时间、占用情况和感知器等来模拟智能建筑环境。在这个链接数据表示中，代理人执行了一些Specified任务，如控制照明。模拟环境提供了检查任务执行是否正确的方式和评估代理人性能的方法。我们对基于条件动作规则的链接数据代理人进行了测量。

A Survey on Multi-Objective Neural Architecture Search

paper_url: http://arxiv.org/abs/2307.09099
repo_url: None
paper_authors: Seyed Mahdi Shariatzadeh, Mahmood Fathy, Reza Berangi, Mohammad Shahverdy
for: 本文提供了针对多目标神经网络搜索（MONAS）领域的概述，包括主要工作和现状报告，以及未来发展方向。
methods: 本文使用了多目标神经网络搜索（MONAS）的主要方法，包括分类和排序、搜索空间和搜索策略等方面的研究。
results: 本文通过分析多个目标的关系和优化，揭示了MONAS在不同领域的应用和优势。同时，还提出了未来发展方向和挑战。

Abstract
Recently, the expert-crafted neural architectures is increasing overtaken by the utilization of neural architecture search (NAS) and automatic generation (and tuning) of network structures which has a close relation to the Hyperparameter Optimization and Auto Machine Learning (AutoML). After the earlier NAS attempts to optimize only the prediction accuracy, Multi-Objective Neural architecture Search (MONAS) has been attracting attentions which considers more goals such as computational complexity, power consumption, and size of the network for optimization, reaching a trade-off between the accuracy and other features like the computational cost. In this paper, we present an overview of principal and state-of-the-art works in the field of MONAS. Starting from a well-categorized taxonomy and formulation for the NAS, we address and correct some miscategorizations in previous surveys of the NAS field. We also provide a list of all known objectives used and add a number of new ones and elaborate their specifications. We have provides analyses about the most important objectives and shown that the stochastic properties of some the them should be differed from deterministic ones in the multi-objective optimization procedure of NAS. We finalize this paper with a number of future directions and topics in the field of MONAS.

摘要

DiTTO: Diffusion-inspired Temporal Transformer Operator

paper_url: http://arxiv.org/abs/2307.09072
repo_url: None
paper_authors: Oded Ovadia, Eli Turkel, Adar Kahana, George Em Karniadakis
for: 解决时间依赖的partial differential equations (PDEs) 问题，使用数据驱动的方法。
methods: 提出了一种基于操作学习概念的方法，称为DiTTO，它采用了潜在扩散模型的时间条件机制，并结合了Transformer架构的元素以提高其能力。
results: 对多维PDE问题进行了广泛的测试，包括1-D Burgers’方程、2-D Navier-Stokes方程以及2-D和3-D的声波方程，DiTTO得到了最佳的精度 результа。此外，还提出了一种使用快速抽象概念来提高DiTTO的性能的方法，并证明了DiTTO可以准确地进行零shot超解析。

Abstract
Solving partial differential equations (PDEs) using a data-driven approach has become increasingly common. The recent development of the operator learning paradigm has enabled the solution of a broader range of PDE-related problems. We propose an operator learning method to solve time-dependent PDEs continuously in time without needing any temporal discretization. The proposed approach, named DiTTO, is inspired by latent diffusion models. While diffusion models are usually used in generative artificial intelligence tasks, their time-conditioning mechanism is extremely useful for PDEs. The diffusion-inspired framework is combined with elements from the Transformer architecture to improve its capabilities. We demonstrate the effectiveness of the new approach on a wide variety of PDEs in multiple dimensions, namely the 1-D Burgers' equation, 2-D Navier-Stokes equations, and the acoustic wave equation in 2-D and 3-D. DiTTO achieves state-of-the-art results in terms of accuracy for these problems. We also present a method to improve the performance of DiTTO by using fast sampling concepts from diffusion models. Finally, we show that DiTTO can accurately perform zero-shot super-resolution in time.

摘要
解决部分梯度方程（PDE）使用数据驱动方法已成为日益普遍。最近的运算学学 paradigm的发展已经使得解决更广泛的 PDE 相关问题变得可能。我们提议一种运算学学方法，名为DiTTO，可以不需要任何时间细分化地解决时间相依的 PDE。我们的方法受某种扩散模型的启发，而这种模型通常用于生成人工智能任务中。扩散模型在时间条件下的机制非常有用于 PDE。我们将扩散模型与Transformer架构的元素组合以提高其能力。我们在多维度中的一维布格尔方程、二维 Navier-Stokes 方程以及二维和三维的声波方程上进行了广泛的试验。DiTTO 在这些问题上达到了最新的精度标准。我们还提出了使用 diffusion 模型快速采样概念来提高 DiTTO 的性能的方法。最后，我们展示了 DiTTO 可以高精度地进行零shot超解析。

Unleashing the Imagination of Text: A Novel Framework for Text-to-image Person Retrieval via Exploring the Power of Words

paper_url: http://arxiv.org/abs/2307.09059
repo_url: https://github.com/Delong-liu-bupt/UIT
paper_authors: Delong Liu, Haiwen Li
for: 这个论文主要针对文本描述和图像之间的对应关系进行研究，以实现文本描述与图像的对应。
methods: 该论文提出了一种新的框架，名为“允许文本 imagination 激活”（UIT），以便充分利用文本描述中的表达力。该框架使用预训练的全 CLIP 模型作为图像和文本的双encoder，利用先前的交叉模态对齐知识。此外，该论文还提出了一种文本指导图像修复 auxiliary task，以帮助将抽象的文本描述映射到具体的图像区域。
results: 该论文的提议方法在三个popular benchmark dataset上实现了状态的最佳 result，并将源代码公开发布在短时间内。

Abstract
The goal of Text-to-image person retrieval is to retrieve person images from a large gallery that match the given textual descriptions. The main challenge of this task lies in the significant differences in information representation between the visual and textual modalities. The textual modality conveys abstract and precise information through vocabulary and grammatical structures, while the visual modality conveys concrete and intuitive information through images. To fully leverage the expressive power of textual representations, it is essential to accurately map abstract textual descriptions to specific images. To address this issue, we propose a novel framework to Unleash the Imagination of Text (UIT) in text-to-image person retrieval, aiming to fully explore the power of words in sentences. Specifically, the framework employs the pre-trained full CLIP model as a dual encoder for the images and texts , taking advantage of prior cross-modal alignment knowledge. The Text-guided Image Restoration auxiliary task is proposed with the aim of implicitly mapping abstract textual entities to specific image regions, facilitating alignment between textual and visual embeddings. Additionally, we introduce a cross-modal triplet loss tailored for handling hard samples, enhancing the model's ability to distinguish minor differences. To focus the model on the key components within sentences, we propose a novel text data augmentation technique. Our proposed methods achieve state-of-the-art results on three popular benchmark datasets, and the source code will be made publicly available shortly.

摘要
文本到图像人识别的目标是从大量图库中 retrieve 匹配给定的文本描述图像。主要挑战在于视觉和文本modalities之间的信息表示有所不同。文本modalities通过语言和 grammatical structure 提供 precises 和抽象的信息，而视觉modalities通过图像提供具体和直观的信息。为了充分利用文本表示的表达力，需要准确地将抽象的文本描述映射到具体的图像。为解决这个问题，我们提出了一种新的 Unleash the Imagination of Text 框架（UIT），目的是全面利用文本表示的力量。具体来说，该框架使用预训练的 full CLIP 模型作为图像和文本的双Encoder，利用对modalities的先前交叉Alignment知识。我们还提出了 Text-guided Image Restoration 辅助任务，以帮助将抽象的文本实体映射到特定的图像区域，从而使得文本和视觉嵌入匹配。此外，我们还引入了特定 для处理困难样本的交叉模式 triplet loss，提高模型对微妙差别的识别能力。为了聚焦模型在句子中的关键组件，我们提出了一种新的文本数据增强技术。我们的提出的方法在三个流行的benchmark dataset上达到了状态精度，源代码即将公开。

QMNet: Importance-Aware Message Exchange for Decentralized Multi-Agent Reinforcement Learning

paper_url: http://arxiv.org/abs/2307.09051
repo_url: None
paper_authors: Xiufeng Huang, Sheng Zhou
for: 提高多智能体强化学习下的无线资源约束下的性能
methods: 提出了一种消息重要性度量和一种重要度意识的调度策略，使得agent可以更有效地交换消息，并且利用消息重要性度量来处理随机访问冲突。
results: 在交通拐点环境中评估了提议的方案，发现可以在仅30%的agent可以发送消息的情况下提高系统性能，并且通过消息预测机制可以进一步降低无线资源的使用率。

Abstract
To improve the performance of multi-agent reinforcement learning under the constraint of wireless resources, we propose a message importance metric and design an importance-aware scheduling policy to effectively exchange messages. The key insight is spending the precious communication resources on important messages. The message importance depends not only on the messages themselves, but also on the needs of agents who receive them. Accordingly, we propose a query-message-based architecture, called QMNet. Agents generate queries and messages with the environment observation. Sharing queries can help calculate message importance. Exchanging messages can help agents cooperate better. Besides, we exploit the message importance to deal with random access collisions in decentralized systems. Furthermore, a message prediction mechanism is proposed to compensate for messages that are not transmitted. Finally, we evaluate the proposed schemes in a traffic junction environment, where only a fraction of agents can send messages due to limited wireless resources. Results show that QMNet can extract valuable information to guarantee the system performance even when only $30\%$ of agents can share messages. By exploiting message prediction, the system can further save $40\%$ of wireless resources. The importance-aware decentralized multi-access mechanism can effectively avoid collisions, achieving almost the same performance as centralized scheduling.

摘要
translate into Simplified Chinese:为了提高多机器人学习下限制无线资源的性能，我们提出了消息重要度度量和重要度感知调度策略，以有效地交换消息。关键发现是在有限的无线资源下花费珍贵的通信资源。消息重要度取决于消息本身，以及接收者需要。我们提出了问题消息架构，称为QMNet。代理人根据环境观察生成问题和消息。共享问题可以帮助计算消息重要度。交换消息可以帮助代理人合作更好。此外，我们利用消息重要度来处理分布式系统中的随机访问冲突。此外，我们还提出了消息预测机制，以补偿未传输消息。最后，我们在交通枢纽环境中评估我们的方案，其中仅有一部分代理人可以发送消息due to limited wireless resources.结果表明，QMNet可以提取有价值的信息，以保证系统性能，即使仅有30%的代理人可以分享消息。通过利用消息预测，系统可以进一步降低40%的无线资源。我们的重要度感知分布式多访问机制可以有效避免冲突，实现与中央调度相似的性能。

R-Cut: Enhancing Explainability in Vision Transformers with Relationship Weighted Out and Cut

paper_url: http://arxiv.org/abs/2307.09050
repo_url: None
paper_authors: Yingjie Niu, Ming Ding, Maoning Ge, Robin Karlsson, Yuxiao Zhang, Kazuya Takeda
for: 这paper的目的是提高Transformer模型的可解释性，帮助用户更深入理解模型在图像分类任务中的决策过程。
methods: 这paper提出了两个模块来提高可解释性：“Relationship Weighted Out”模块和“Cut”模块。“Relationship Weighted Out”模块通过提取中间层的类别特征来强调相关的信息，而“Cut”模块则进行细腻的特征分解，考虑因素如位置、文本ure、颜色等。通过 integrate these two modules, we generate dense class-specific visual explainability maps。
results: 该paper通过了extensive的量化和质量性实验，证明了其方法在ImageNet dataset上的效果明显提高，并在LRN dataset上进行了大量的实验，以评估其方法在复杂背景下的可解释性。结果表明其方法在这些场景下具有显著的改善。此外，paper还进行了简要的抑制实验，以验证每个模块的效果，并证明了它们的共同作用。

Abstract
Transformer-based models have gained popularity in the field of natural language processing (NLP) and are extensively utilized in computer vision tasks and multi-modal models such as GPT4. This paper presents a novel method to enhance the explainability of Transformer-based image classification models. Our method aims to improve trust in classification results and empower users to gain a deeper understanding of the model for downstream tasks by providing visualizations of class-specific maps. We introduce two modules: the ``Relationship Weighted Out" and the ``Cut" modules. The ``Relationship Weighted Out" module focuses on extracting class-specific information from intermediate layers, enabling us to highlight relevant features. Additionally, the ``Cut" module performs fine-grained feature decomposition, taking into account factors such as position, texture, and color. By integrating these modules, we generate dense class-specific visual explainability maps. We validate our method with extensive qualitative and quantitative experiments on the ImageNet dataset. Furthermore, we conduct a large number of experiments on the LRN dataset, specifically designed for automatic driving danger alerts, to evaluate the explainability of our method in complex backgrounds. The results demonstrate a significant improvement over previous methods. Moreover, we conduct ablation experiments to validate the effectiveness of each module. Through these experiments, we are able to confirm the respective contributions of each module, thus solidifying the overall effectiveness of our proposed approach.

摘要
带有变换器基础的模型在自然语言处理（NLP）领域得到广泛应用，同时也在计算机视觉任务和多模态模型中得到广泛应用，如GPT4。这篇论文提出了一种新的方法来提高变换器基础的图像分类模型的可解释性。我们的方法旨在提高分类结果的信任度和让用户更深入地理解模型，以便在下游任务中获得更多的信息。我们提出了两个模块：“关系权重外”模块和“割”模块。“关系权重外”模块专注于从中间层提取类别特定的信息，使得我们能够高亮相关的特征。而“割”模块对细化特征进行了分析，考虑了位置、xture、颜色等因素。通过将这两个模块集成，我们生成了密集的类别特定的可解释性图。我们验证了我们的方法通过大量的质量和量测试在ImageNet数据集上，并在LRN数据集上进行了大量的实验，以评估我们的方法在复杂背景下的可解释性。结果表明我们的方法与之前的方法相比有了显著的改善。此外，我们还进行了减少模块的实验，以验证每个模块的各自贡献，从而确认整体方法的有效性。

FedDefender: Client-Side Attack-Tolerant Federated Learning

paper_url: http://arxiv.org/abs/2307.09048
repo_url: https://github.com/deu30303/feddefender
paper_authors: Sungwon Park, Sungwon Han, Fangzhao Wu, Sundong Kim, Bin Zhu, Xing Xie, Meeyoung Cha
for: 防止模型毒素攻击，增强 federated learning 的鲁棒性
methods: 客户端端Side defense 策略，包括攻击忍受本地模型更新和全局知识填充
results: 在实际场景中评估多个数据集，结果表明提议的方法可以增强 federated learning 对模型毒素攻击的鲁棒性。

Abstract
Federated learning enables learning from decentralized data sources without compromising privacy, which makes it a crucial technique. However, it is vulnerable to model poisoning attacks, where malicious clients interfere with the training process. Previous defense mechanisms have focused on the server-side by using careful model aggregation, but this may not be effective when the data is not identically distributed or when attackers can access the information of benign clients. In this paper, we propose a new defense mechanism that focuses on the client-side, called FedDefender, to help benign clients train robust local models and avoid the adverse impact of malicious model updates from attackers, even when a server-side defense cannot identify or remove adversaries. Our method consists of two main components: (1) attack-tolerant local meta update and (2) attack-tolerant global knowledge distillation. These components are used to find noise-resilient model parameters while accurately extracting knowledge from a potentially corrupted global model. Our client-side defense strategy has a flexible structure and can work in conjunction with any existing server-side strategies. Evaluations of real-world scenarios across multiple datasets show that the proposed method enhances the robustness of federated learning against model poisoning attacks.

摘要
federated learning 可以从分散式数据来源学习而不需要遗失隐私，这使其成为一种重要的技术。然而，它受到模型毒化攻击的威胁，恶意客户端可以在训练过程中干扰。先前的防御机制将重点放在服务器端，使用精确的模型聚合，但这可能无法有效地防止数据不对称或攻击者可以访问正常客户端的资讯。在这篇论文中，我们提出了一个新的防御机制，将重点放在客户端，称为FedDefender，以帮助正常的客户端训练稳定的地方模型，并避免由攻击者发送的错误模型更新对正常客户端的影响。我们的方法包括两个主要的元件：（1）攻击忍耐的地方元更新和（2）攻击忍耐的全球知识传递。这两个元件用于找到防护感知的模型参数，并精确地传递全球模型中的知识。我们的客户端防御策略具有可以与任何现有的服务器端策略结合使用的灵活结构。实际应用数据显示，提案的方法可以增强 Federated Learning 中的模型毒化攻击防御能力。

Multimodal Machine Learning for Extraction of Theorems and Proofs in the Scientific Literature

paper_url: http://arxiv.org/abs/2307.09047
repo_url: https://github.com/mv96/mm_extraction
paper_authors: Shrey Mishra, Antoine Gauquier, Pierre Senellart
for: 本研究旨在提取数学文章中的定理和证明，使用多modal类型的机器学习方法。
methods: 本研究使用文本、字体特征和bitmap图像的多modal特征，并通过晚期结合特征进行融合，以提取定理和证明。文本模块使用自然语言模型进行预训练，并在小规模数据上进行微调。字体特征使用长短折衔LSTM模型，而bitmap图像使用EfficientNetv2深度网络进行处理。最后，使用CRF模型将多modal特征与块序列信息结合。
results: 实验结果表明，使用多modal方法比使用单modal方法有较好的性能，并且使用CRF模型可以提高性能。

Abstract
Scholarly articles in mathematical fields feature mathematical statements such as theorems, propositions, etc., as well as their proofs. Extracting them from the PDF representation of the articles requires understanding of scientific text along with visual and font-based indicators. We pose this problem as a multimodal classification problem using text, font features, and bitmap image rendering of the PDF as different modalities. In this paper we propose a multimodal machine learning approach for extraction of theorem-like environments and proofs, based on late fusion of features extracted by individual unimodal classifiers, taking into account the sequential succession of blocks in the document. For the text modality, we pretrain a new language model on a 11 GB scientific corpus; experiments shows similar performance for our task than a model (RoBERTa) pretrained on 160 GB, with faster convergence while requiring much less fine-tuning data. Font-based information relies on training a 128-cell LSTM on the sequence of font names and sizes within each block. Bitmap renderings are dealt with using an EfficientNetv2 deep network tuned to classify each image block. Finally, a simple CRF-based approach uses the features of the multimodal model along with information on block sequences. Experimental results show the benefits of using a multimodal approach vs any single modality, as well as major performance improvements using the CRF modeling of block sequences.

摘要
学术论文在数学领域中经常包含数学陈述，如定理、命题等，以及其证明。从PDF文档中提取这些陈述和证明需要科学文本的理解以及视觉和字体指示器。我们将这个问题作为多modal分类问题来处理，使用文本、字体特征和Bitmap图像渲染作为不同的Modalities。在这篇论文中，我们提出了一种多modal机器学习方法，用于提取定理-like环境和证明，基于延迟融合多modal特征扩展的方法。文本模式下，我们预训练了一个新的语言模型，使用11GB的科学 corpus;实验表明，我们的任务性能与使用160GB的RoBERTa模型相似，而且具有更快的融合速度和需要的精度训练数据更少。基于字体信息，我们使用128个LSTM单元训练字体名称和大小序列内每个块的模型。Bitmap渲染方面，我们使用EfficientNetv2深度网络，对每个图像块进行分类。最后，我们使用一个简单的CRF模型，使用多modal模型的特征以及块序列信息。实验结果表明，使用多modal方法比使用单一模式有更大的优势，以及使用CRF模型对块序列信息的处理可以提高性能。

Emotional Intelligence of Large Language Models

paper_url: http://arxiv.org/abs/2307.09042
repo_url: None
paper_authors: Xuena Wang, Xueting Li, Zi Yin, Yue Wu, Liu Jia
for: 这研究是为了评估大语言模型（LLMs）在情感智能方面的能力，以及这些模型是否与人类情感和价值观Alignment。
methods: 研究人员首先开发了一种新的心理测试，旨在评估大语言模型情感理解能力，这种测试包括识别、解释和理解复杂情感的能力。此外，研究人员还使用了一个参照框架， constructed from over 500 adults，来测试不同的主流大语言模型。
results: 研究结果表明，大多数大语言模型在情感理解方面的能力强于人类平均水平，其中GPT-4的情感智能指数（EQ）达到了89%的人类参与者水平。此外，一种多变量模式分析发现，一些大语言模型并没有采用人类类似的机制来实现人类水平的表现，其表达模式与人类有所不同。

Abstract
Large Language Models (LLMs) have demonstrated remarkable abilities across numerous disciplines, primarily assessed through tasks in language generation, knowledge utilization, and complex reasoning. However, their alignment with human emotions and values, which is critical for real-world applications, has not been systematically evaluated. Here, we assessed LLMs' Emotional Intelligence (EI), encompassing emotion recognition, interpretation, and understanding, which is necessary for effective communication and social interactions. Specifically, we first developed a novel psychometric assessment focusing on Emotion Understanding (EU), a core component of EI, suitable for both humans and LLMs. This test requires evaluating complex emotions (e.g., surprised, joyful, puzzled, proud) in realistic scenarios (e.g., despite feeling underperformed, John surprisingly achieved a top score). With a reference frame constructed from over 500 adults, we tested a variety of mainstream LLMs. Most achieved above-average EQ scores, with GPT-4 exceeding 89% of human participants with an EQ of 117. Interestingly, a multivariate pattern analysis revealed that some LLMs apparently did not reply on the human-like mechanism to achieve human-level performance, as their representational patterns were qualitatively distinct from humans. In addition, we discussed the impact of factors such as model size, training method, and architecture on LLMs' EQ. In summary, our study presents one of the first psychometric evaluations of the human-like characteristics of LLMs, which may shed light on the future development of LLMs aiming for both high intellectual and emotional intelligence. Project website: https://emotional-intelligence.github.io/

摘要
大型语言模型（LLMs）已经展现出了很多领域的出色表现，主要是通过语言生成、知识利用和复杂的推理等任务评估。然而，它们与人类情感和价值观Alignment不够系统地评估。在这里，我们评估了LLMs的情感智能（EI），包括情感认知、解释和理解，这些是实际应用中的重要 фактор。具体来说，我们开发了一种新的心理测试，旨在评估LLMs的情感理解能力（EU），这是EI的核心组成部分。这个测试需要评估复杂情感（例如，感到受到了不公平待遇，然而突然获得了优秀成绩）在真实的场景中。我们使用了来自500多名成年人的参考框架，测试了多种主流LLMs。大多数LLMs的EQ分数高于人类参与者的平均分数，GPT-4的EQ分数达117，超过89%的人类参与者。另外，我们通过多变量模式分析发现，一些LLMs并不是通过人类类似的机制来实现人类水平的表现，其表示方式与人类有所不同。此外，我们还讨论了因素 such as模型大小、训练方法和架构对LLMs的EQ具有影响。总之，我们的研究提供了一个可能是LLMs寻求高智商和情感智能的未来发展的首个心理测试。项目网站：

PromptMagician: Interactive Prompt Engineering for Text-to-Image Creation

paper_url: http://arxiv.org/abs/2307.09036
repo_url: https://github.com/yingchaojiefeng/promptmagician
paper_authors: Yingchaojie Feng, Xingbo Wang, Kam Kwai Wong, Sijia Wang, Yuhong Lu, Minfeng Zhu, Baicheng Wang, Wei Chen
for: 该论文旨在提供一种可视化分析系统，帮助用户探索与输入提示相关的图像结果，并进行个性化的提示细化。
methods: 该系统基于一种提示建议模型，通过对用户提示进行分析，从DiffusionDB中检索相似的提示-图像对，并将特殊提示关键词标出。系统还提供了多级可视化工具，以便用户在跨模态空间中进行交互式的提示细化。
results: 两个使用场景、一个用户研究和专家采访表明，PromptMagician系统能够有效地支持用户在生成文本到图像模型中进行提示工程，并且提高了这种创造支持的效果。

Abstract
Generative text-to-image models have gained great popularity among the public for their powerful capability to generate high-quality images based on natural language prompts. However, developing effective prompts for desired images can be challenging due to the complexity and ambiguity of natural language. This research proposes PromptMagician, a visual analysis system that helps users explore the image results and refine the input prompts. The backbone of our system is a prompt recommendation model that takes user prompts as input, retrieves similar prompt-image pairs from DiffusionDB, and identifies special (important and relevant) prompt keywords. To facilitate interactive prompt refinement, PromptMagician introduces a multi-level visualization for the cross-modal embedding of the retrieved images and recommended keywords, and supports users in specifying multiple criteria for personalized exploration. Two usage scenarios, a user study, and expert interviews demonstrate the effectiveness and usability of our system, suggesting it facilitates prompt engineering and improves the creativity support of the generative text-to-image model.

摘要
自然语言提示的生成图像模型在公众中得到了广泛的推广和应用，但是开发有效的提示语言可以是一项挑战，因为自然语言的复杂性和抽象性。这项研究提出了PromptMagician，一个可视化分析系统，帮助用户探索图像结果和调整输入提示语言。PromptMagician的核心是一个提示建议模型，接受用户提示语言作为输入，从DiffusionDB中检索相似的提示语言-图像对，并标识特殊（重要和相关）的提示关键词。为了促进交互式提示调整，PromptMagician引入了多级可视化，用于跨Modal的嵌入图像和推荐关键词的可视化，并支持用户指定多个标准 для个性化探索。两个使用场景、一项用户研究和专家采访表明PromptMagician可以帮助提高生成图像模型的创作支持和提示工程。

Exploring acceptance of autonomous vehicle policies using KeyBERT and SNA: Targeting engineering students

paper_url: http://arxiv.org/abs/2307.09014
repo_url: https://github.com/wangbuera/f230778c-6292-4e4f-97ab-6edac0901476
paper_authors: Jinwoo Ha, Dongsoo Kim
for: 本研究旨在探讨自动驾驶车（AV）政策的用户接受度，通过提升文本挖掘方法来填补政策制定者未充分考虑的用户需求。
methods: 本研究采用了两种文本挖掘方法，一是基于TF-IWF和 dice积分的协occurrence网络分析（CNA），另一是基于KeyBERT提取关键词和双cosinus相似度的上下文Semantic网络分析（C-SNA）。
results: 结果表明，C-SNA提供了更好地理解用户声音的信息，使用 fewer nodes and features than CNA。用户通过预先理解AV政策基于其工程知识和给出的文本，披露了AV事故政策的潜在风险。本研究建议管理这些风险，以支持AV在公共道路上成功部署。

Abstract
This study aims to explore user acceptance of Autonomous Vehicle (AV) policies with improved text-mining methods. Recently, South Korean policymakers have viewed Autonomous Driving Car (ADC) and Autonomous Driving Robot (ADR) as next-generation means of transportation that will reduce the cost of transporting passengers and goods. They support the construction of V2I and V2V communication infrastructures for ADC and recognize that ADR is equivalent to pedestrians to promote its deployment into sidewalks. To fill the gap where end-user acceptance of these policies is not well considered, this study applied two text-mining methods to the comments of graduate students in the fields of Industrial, Mechanical, and Electronics-Electrical-Computer. One is the Co-occurrence Network Analysis (CNA) based on TF-IWF and Dice coefficient, and the other is the Contextual Semantic Network Analysis (C-SNA) based on both KeyBERT, which extracts keywords that contextually represent the comments, and double cosine similarity. The reason for comparing these approaches is to balance interest not only in the implications for the AV policies but also in the need to apply quality text mining to this research domain. Significantly, the limitation of frequency-based text mining, which does not reflect textual context, and the trade-off of adjusting thresholds in Semantic Network Analysis (SNA) were considered. As the results of comparing the two approaches, the C-SNA provided the information necessary to understand users' voices using fewer nodes and features than the CNA. The users who pre-emptively understood the AV policies based on their engineering literacy and the given texts revealed potential risks of the AV accident policies. This study adds suggestions to manage these risks to support the successful deployment of AVs on public roads.

摘要

How is ChatGPT’s behavior changing over time?

paper_url: http://arxiv.org/abs/2307.09009
repo_url: https://github.com/lchen001/llmdrift
paper_authors: Lingjiao Chen, Matei Zaharia, James Zou
for:* The paper evaluates the performance and behavior of GPT-3.5 and GPT-4 on various tasks over time.methods:* The authors use several diverse tasks to evaluate the models’ performance, including math problems, sensitive/dangerous questions, opinion surveys, multi-hop knowledge-intensive questions, generating code, US Medical License tests, and visual reasoning.results:* The performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time, with some tasks showing improvement and others showing decline.* GPT-4 (June 2023) had poor performance on identifying prime vs. composite numbers compared to GPT-4 (March 2023).* GPT-4 became less willing to answer sensitive questions and opinion survey questions in June than in March.* GPT-4 performed better at multi-hop questions in June than in March, while GPT-3.5’s performance dropped on this task.* Both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March.I hope that helps! Let me know if you have any other questions.

Abstract
GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on several diverse tasks: 1) math problems, 2) sensitive/dangerous questions, 3) opinion surveys, 4) multi-hop knowledge-intensive questions, 5) generating code, 6) US Medical License tests, and 7) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was reasonable at identifying prime vs. composite numbers (84% accuracy) but GPT-4 (June 2023) was poor on these same questions (51% accuracy). This is partly explained by a drop in GPT-4's amenity to follow chain-of-thought prompting. Interestingly, GPT-3.5 was much better in June than in March in this task. GPT-4 became less willing to answer sensitive questions and opinion survey questions in June than in March. GPT-4 performed better at multi-hop questions in June than in March, while GPT-3.5's performance dropped on this task. Both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. Overall, our findings show that the behavior of the "same" LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLMs.

摘要

Ord2Seq: Regarding Ordinal Regression as Label Sequence Prediction

paper_url: http://arxiv.org/abs/2307.09004
repo_url: None
paper_authors: Jinhong Wang, Yi Cheng, Jintai Chen, Tingting Chen, Danny Chen, Jian Wu
for: 这篇论文主要针对的是ordinal regression问题，即将对象实例分类到ordinal类别中。
methods: 这篇论文提出了一种简单的序列预测框架 Ord2Seq，它将每个ordinal类别标签转换成特殊标签序列，从而将ordinal regression任务转化为一个序列预测任务。
results: 实验表明，Ord2Seq可以很好地 distinguishing adjacent categories，并且在四个不同的场景中超过了当前state-of-the-art的性能。

Abstract
Ordinal regression refers to classifying object instances into ordinal categories. It has been widely studied in many scenarios, such as medical disease grading, movie rating, etc. Known methods focused only on learning inter-class ordinal relationships, but still incur limitations in distinguishing adjacent categories thus far. In this paper, we propose a simple sequence prediction framework for ordinal regression called Ord2Seq, which, for the first time, transforms each ordinal category label into a special label sequence and thus regards an ordinal regression task as a sequence prediction process. In this way, we decompose an ordinal regression task into a series of recursive binary classification steps, so as to subtly distinguish adjacent categories. Comprehensive experiments show the effectiveness of distinguishing adjacent categories for performance improvement and our new approach exceeds state-of-the-art performances in four different scenarios. Codes are available at https://github.com/wjh892521292/Ord2Seq.

摘要
ordinal 回归指的是将对象实例分类到ordinal类别中。它在许多场景中受到广泛研究，如医疗疾病等级分类、电影评分等。知名的方法只关注学习间类关系，但然而仍然存在区分邻Category的限制。在这篇论文中，我们提出了一种简单的序列预测框架 для ordinal 回归，称为Ord2Seq，它将每个ordinal类别标签转化为特殊标签序列，从而将ordinal 回归任务转化为一个序列预测过程。这样，我们将ordinal 回归任务分解成一系列的回归binary分类步骤，以便细腻地区分邻Category。经过全面的实验，我们发现可以明显提高区分邻Category的性能，并且我们的新方法在四个不同的场景中超越了当前最佳性能。代码可以在https://github.com/wjh892521292/Ord2Seq中找到。

EVIL: Evidential Inference Learning for Trustworthy Semi-supervised Medical Image Segmentation

paper_url: http://arxiv.org/abs/2307.08988
repo_url: None
paper_authors: Yingyu Chen, Ziyuan Yang, Chenyu Shen, Zhiwen Wang, Yang Qin, Yi Zhang
for: 提高 semi-supervised medical image segmentation 的准确性和可靠性，并提供 theoretically guaranteed 的解决方案。
methods: 基于 Dempster-Shafer Theory of Evidence (DST) 的 Evidential Inference Learning (EVIL) 方法，通过在一次前进中实现准确性量化和不确定性评估，并采用 consistency regularization-based 的训练方法来提高泛化能力。
results: 在公共数据集上实验表明，EVIL 方法与一些状态顶方法相比，具有竞争性的表现，并且可以提供 trustworthy pseudo labels on unlabeled data。

Abstract
Recently, uncertainty-aware methods have attracted increasing attention in semi-supervised medical image segmentation. However, current methods usually suffer from the drawback that it is difficult to balance the computational cost, estimation accuracy, and theoretical support in a unified framework. To alleviate this problem, we introduce the Dempster-Shafer Theory of Evidence (DST) into semi-supervised medical image segmentation, dubbed Evidential Inference Learning (EVIL). EVIL provides a theoretically guaranteed solution to infer accurate uncertainty quantification in a single forward pass. Trustworthy pseudo labels on unlabeled data are generated after uncertainty estimation. The recently proposed consistency regularization-based training paradigm is adopted in our framework, which enforces the consistency on the perturbed predictions to enhance the generalization with few labeled data. Experimental results show that EVIL achieves competitive performance in comparison with several state-of-the-art methods on the public dataset.

摘要
最近，uncertainty-aware方法在半supervised医学图像分割中受到了越来越多的关注。然而，当前方法通常受到难以平衡计算成本、估计准确性和理论支持的问题。为了解决这个问题，我们在半supervised医学图像分割中引入了Dempster-Shafer理论（DST），称之为Evidential Inference Learning（EVIL）。EVIL提供了一个在单个前进 pass中 theoretically guarantee 的解决方案，以便Quantification of uncertainty accurate。在不经过标注数据的基础上生成可靠的 Pseudolabel 。我们的框架中采用了最近提出的一致Regularization-based training paradigm，以便在具有少量标注数据的情况下进行加强通用性。实验结果显示，EVIL在比较一些国际前沿方法的公共数据集上实现了竞争性的表现。

AI-assisted Improved Service Provisioning for Low-latency XR over 5G NR

paper_url: http://arxiv.org/abs/2307.08987
repo_url: None
paper_authors: Moyukh Laha, Dibbendu Roy, Sourav Dutta, Goutam Das
for: This paper aims to address the challenges of ensuring low latency, high data rate, and reliability in supporting Extended Reality (XR) services in 5G/6G networks.
methods: The proposed AI-assisted service provisioning scheme leverages predicted frames for processing, virtually increasing the network delay budget and improving service provisioning.
results: The proposed scheme is validated by extensive simulations, demonstrating a multi-fold increase in supported XR users and providing crucial network design insights.Here’s the Chinese translation of the three key points:
for: 本文旨在解决5G/6G网络中支持扩展现实（XR）服务的低延迟、高数据速率和可靠性问题。
methods: 该方案利用预测帧进行处理，虚拟增加网络延迟预算，提高服务提供。
results: 该方案由于广泛的 simulations validate，能够支持多重XR用户，并提供重要的网络设计理念。

Abstract
Extended Reality (XR) is one of the most important 5G/6G media applications that will fundamentally transform human interactions. However, ensuring low latency, high data rate, and reliability to support XR services poses significant challenges. This letter presents a novel AI-assisted service provisioning scheme that leverages predicted frames for processing rather than relying solely on actual frames. This method virtually increases the network delay budget and consequently improves service provisioning, albeit at the expense of minor prediction errors. The proposed scheme is validated by extensive simulations demonstrating a multi-fold increase in supported XR users and also provides crucial network design insights.

摘要
伸展现实（XR）是5G/6G媒体应用之一，将fundamentally改变人类互动方式。然而，确保低延迟、高数据速率和可靠性来支持XR服务具有挑战。这封信提出了一种基于预测帧的AI助け服务提供方案，而不是完全依赖实际帧。这种方法虚拟增加网络延迟预算，从而改善服务提供，但是会出现一定的预测错误。提议的方案通过广泛的 simulations 验证，证明可以支持多ples XR用户，同时提供重要的网络设计理念。

PromptCrafter: Crafting Text-to-Image Prompt through Mixed-Initiative Dialogue with LLM

paper_url: http://arxiv.org/abs/2307.08985
repo_url: None
paper_authors: Seungho Baek, Hyerin Im, Jiseung Ryu, Juhyeong Park, Takyeon Lee
for: 这篇论文旨在提出一种新的混合式系统，帮助用户efficiently探索模型的能力和创建有效的提示。
methods: 该系统使用了步骤式的crafting方法，让用户可以逐步地制定文本到图像提示，并通过答复各种问题来细化意图。
results: 该系统可以帮助用户快速探索模型的能力，并帮助用户创建有效的提示，从而提高了用户的使用体验。

Abstract
Text-to-image generation model is able to generate images across a diverse range of subjects and styles based on a single prompt. Recent works have proposed a variety of interaction methods that help users understand the capabilities of models and utilize them. However, how to support users to efficiently explore the model's capability and to create effective prompts are still open-ended research questions. In this paper, we present PromptCrafter, a novel mixed-initiative system that allows step-by-step crafting of text-to-image prompt. Through the iterative process, users can efficiently explore the model's capability, and clarify their intent. PromptCrafter also supports users to refine prompts by answering various responses to clarifying questions generated by a Large Language Model. Lastly, users can revert to a desired step by reviewing the work history. In this workshop paper, we discuss the design process of PromptCrafter and our plans for follow-up studies.

摘要
文本至图生成模型可以生成具有多样化主题和风格的图像，根据单个提示。 latest works have proposed various interaction methods to help users understand the model's capabilities and utilize them. However, how to efficiently explore the model's capability and create effective prompts are still open-ended research questions.在这篇论文中，我们介绍PromptCrafter，一种新的混合 iniciative 系统，允许用户逐步制作文本至图 prompt。通过迭代过程，用户可以高效地探索模型的能力，并明确自己的意图。PromptCrafter 还支持用户对提示进行细化，通过一个大型自然语言模型生成的各种回答问题来解释。最后，用户可以通过查看工作历史来返回到感兴趣的步骤。在这篇工作论文中，我们介绍PromptCrafter 的设计过程和我们的跟进计划。

Generative Visual Question Answering

paper_url: http://arxiv.org/abs/2307.10405
repo_url: https://github.com/chojw/genb
paper_authors: Ethan Shen, Scotty Singh, Bhavesh Kumar
for: 该论文目的是要研究多 modal 任务中的视觉语言深度学习模型在未来数据分布下的一致性，并提出一个可行的方法来创建一个高级的视觉问题回答（VQA）模型，以便在未来数据分布下获得成功的结果。
methods: 该论文使用了七种基eline和进步的VQA模型，并将这些模型应用在一个新的扩展 dataset 上，named GenVQA，这个 dataset 使用了 VQAv2 和 MS-COCO dataset 中的图像和描述来生成新的图像，并使用稳定扩展来测试这些模型的一致性。
results: 研究发现，这些成功的 VQA 模型在未来数据分布下的一致性较差，但是通过分析这些模型的架构，发现了一些常见的设计选择，可以帮助这些模型在未来数据分布下优化一致性。

Abstract
Multi-modal tasks involving vision and language in deep learning continue to rise in popularity and are leading to the development of newer models that can generalize beyond the extent of their training data. The current models lack temporal generalization which enables models to adapt to changes in future data. This paper discusses a viable approach to creating an advanced Visual Question Answering (VQA) model which can produce successful results on temporal generalization. We propose a new data set, GenVQA, utilizing images and captions from the VQAv2 and MS-COCO dataset to generate new images through stable diffusion. This augmented dataset is then used to test a combination of seven baseline and cutting edge VQA models. Performance evaluation focuses on questions mirroring the original VQAv2 dataset, with the answers having been adjusted to the new images. This paper's purpose is to investigate the robustness of several successful VQA models to assess their performance on future data distributions. Model architectures are analyzed to identify common stylistic choices that improve generalization under temporal distribution shifts. This research highlights the importance of creating a large-scale future shifted dataset. This data can enhance the robustness of VQA models, allowing their future peers to have improved ability to adapt to temporal distribution shifts.

摘要
多modal任务涉及视觉和语言在深度学习中继续升温，并且导致 newer模型可以泛化到训练数据之外。现有模型缺乏时间泛化能力，使得模型能够适应未来数据的变化。这篇论文提出了创建高级Visual Question Answering（VQA）模型的可能方法，以便在未来数据分布下取得成功。我们提出了一个新的数据集，GenVQA，使用VQAv2和MS-COCO图像和caption生成新图像，并通过稳定扩散来测试七种基eline和潮流VQA模型。性能评估专注于原VQAv2数据集中的问题，并将答案调整到新图像中。本文的目的是investigate Several successful VQA模型的Robustness，以评估它们在未来数据分布下的性能。模型架构的分析，找到了一些通用的风格选择，可以提高模型在时间分布shift下的泛化能力。这种研究强调了创建大规模未来偏移数据的重要性，以提高VQA模型的未来几代的适应能力。

Development of the ChatGPT, Generative Artificial Intelligence and Natural Large Language Models for Accountable Reporting and Use (CANGARU) Guidelines

paper_url: http://arxiv.org/abs/2307.08974
repo_url: None
paper_authors: Giovanni E. Cacciamani, Michael B. Eppler, Conner Ganjavi, Asli Pekan, Brett Biedermann, Gary S. Collins, Inderbir S. Gill
for: 这个研究的目的是提出一个跨学科全包的global inclusive consensus，以便在学术研究中负责使用、披透和报告生成人工智能（GAI）/生成预训练变换器（GPT）/大语言模型（LLM）技术。methods: 这个研究使用了系统性文献评估，以理解相关的想法、发现和报道标准在学术研究中，并制定了使用和披透GAI/GPT/LLM技术的指南。同时，该研究还进行了一个 bibliometric 分析，以评估现有的作者指南，分析其建议的不一致，并从Delphi调查中提取了共识。results: 该研究通过系统性文献评估和Delphi调查，成功地建立了一个全球 inclusive consensus，以便在学术研究中负责使用、披透和报告GAI/GPT/LLM技术。这些指南将帮助保证GAI/GPT/LLM技术的负责使用、披透和报告，以确保学术研究的可靠性和可重复性。

Abstract
The swift progress and ubiquitous adoption of Generative AI (GAI), Generative Pre-trained Transformers (GPTs), and large language models (LLMs) like ChatGPT, have spurred queries about their ethical application, use, and disclosure in scholarly research and scientific productions. A few publishers and journals have recently created their own sets of rules; however, the absence of a unified approach may lead to a 'Babel Tower Effect,' potentially resulting in confusion rather than desired standardization. In response to this, we present the ChatGPT, Generative Artificial Intelligence, and Natural Large Language Models for Accountable Reporting and Use Guidelines (CANGARU) initiative, with the aim of fostering a cross-disciplinary global inclusive consensus on the ethical use, disclosure, and proper reporting of GAI/GPT/LLM technologies in academia. The present protocol consists of four distinct parts: a) an ongoing systematic review of GAI/GPT/LLM applications to understand the linked ideas, findings, and reporting standards in scholarly research, and to formulate guidelines for its use and disclosure, b) a bibliometric analysis of existing author guidelines in journals that mention GAI/GPT/LLM, with the goal of evaluating existing guidelines, analyzing the disparity in their recommendations, and identifying common rules that can be brought into the Delphi consensus process, c) a Delphi survey to establish agreement on the items for the guidelines, ensuring principled GAI/GPT/LLM use, disclosure, and reporting in academia, and d) the subsequent development and dissemination of the finalized guidelines and their supplementary explanation and elaboration documents.

摘要
<> translate_language: zh-CNThe swift progress and widespread adoption of Generative AI (GAI), Generative Pre-trained Transformers (GPTs), and large language models (LLMs) like ChatGPT, have triggered questions about their ethical application, use, and disclosure in scholarly research and scientific productions. A few publishers and journals have recently established their own sets of rules; however, the lack of a unified approach may lead to a 'Babel Tower Effect,' potentially resulting in confusion rather than desired standardization. In response to this, we present the ChatGPT, Generative Artificial Intelligence, and Natural Large Language Models for Accountable Reporting and Use Guidelines (CANGARU) initiative, with the aim of fostering a cross-disciplinary global inclusive consensus on the ethical use, disclosure, and proper reporting of GAI/GPT/LLM technologies in academia. The present protocol consists of four distinct parts:a) an ongoing systematic review of GAI/GPT/LLM applications to understand the linked ideas, findings, and reporting standards in scholarly research, and to formulate guidelines for its use and disclosure,b) a bibliometric analysis of existing author guidelines in journals that mention GAI/GPT/LLM, with the goal of evaluating existing guidelines, analyzing the disparity in their recommendations, and identifying common rules that can be brought into the Delphi consensus process,c) a Delphi survey to establish agreement on the items for the guidelines, ensuring principled GAI/GPT/LLM use, disclosure, and reporting in academia, andd) the subsequent development and dissemination of the finalized guidelines and their supplementary explanation and elaboration documents.

Landscape Surrogate: Learning Decision Losses for Mathematical Optimization Under Partial Information

paper_url: http://arxiv.org/abs/2307.08964
repo_url: https://github.com/facebookresearch/lancer
paper_authors: Arman Zharmagambetov, Brandon Amos, Aaron Ferber, Taoan Huang, Bistra Dilkina, Yuandong Tian
for: 这 paper 是为了解决部分观察或通用优化器无需专家调整时的优化问题而写的。
methods: 该 paper 使用学习优化器 $\mathbf{g}$ 来解决这些复杂问题，并使用知识优化解 $\mathbf{g}$ 来监督学习。
results: 该 paper 测试了该方法在synthetic问题和实际问题上，并得到了与当前基准相当或更好的目标值，同时减少了 $\mathbf{g}$ 的调用次数。特别是，该方法在高维ensional问题上表现出优于现有方法。

Abstract
Recent works in learning-integrated optimization have shown promise in settings where the optimization problem is only partially observed or where general-purpose optimizers perform poorly without expert tuning. By learning an optimizer $\mathbf{g}$ to tackle these challenging problems with $f$ as the objective, the optimization process can be substantially accelerated by leveraging past experience. The optimizer can be trained with supervision from known optimal solutions or implicitly by optimizing the compound function $f\circ \mathbf{g}$. The implicit approach may not require optimal solutions as labels and is capable of handling problem uncertainty; however, it is slow to train and deploy due to frequent calls to optimizer $\mathbf{g}$ during both training and testing. The training is further challenged by sparse gradients of $\mathbf{g}$, especially for combinatorial solvers. To address these challenges, we propose using a smooth and learnable Landscape Surrogate $M$ as a replacement for $f\circ \mathbf{g}$. This surrogate, learnable by neural networks, can be computed faster than the solver $\mathbf{g}$, provides dense and smooth gradients during training, can generalize to unseen optimization problems, and is efficiently learned via alternating optimization. We test our approach on both synthetic problems, including shortest path and multidimensional knapsack, and real-world problems such as portfolio optimization, achieving comparable or superior objective values compared to state-of-the-art baselines while reducing the number of calls to $\mathbf{g}$. Notably, our approach outperforms existing methods for computationally expensive high-dimensional problems.

摘要
现代学习整合优化方法已经在部分观察到的优化问题或通用优化器无需专家调整时表现出了搭配性。通过学习一个优化器 $\mathbf{g}$ 以解决这些复杂的问题，优化过程可以得到加速。 $\mathbf{g}$ 可以在知道优化解的情况下进行监督学习，或者通过优化函数 $f\circ \mathbf{g}$ 进行启发式学习。后者可以不需要优化解作为标签，并且能够处理问题不确定性。然而，这种方法的训练和部署可能需要频繁地调用优化器 $\mathbf{g}$，这会导致训练和部署变慢。此外，$\mathbf{g}$ 的迭代次数可能会增加，尤其是在使用分解优化器时。为解决这些挑战，我们提议使用一个可学习的地形函数 $M$ 作为 $\mathbf{g}$ 的替换。这个函数可以通过神经网络学习，在训练中提供稠密和平滑的梯度，可以泛化到未看过的优化问题，并通过交互式优化快速学习。我们在 synthetic 问题和实际问题上进行测试，比如最优投资等，实现了与当前标准基准相当或更高的目标函数值，同时减少了对 $\mathbf{g}$ 的调用次数。尤其是在高维ensional Computationally 昂贵的问题上，我们的方法表现出了优于现有方法。

REX: Rapid Exploration and eXploitation for AI Agents

paper_url: http://arxiv.org/abs/2307.08962
repo_url: None
paper_authors: Rithesh Murthy, Shelby Heinecke, Juan Carlos Niebles, Zhiwei Liu, Le Xue, Weiran Yao, Yihao Feng, Zeyuan Chen, Akash Gokul, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, Silvio Savarese
for:REX is proposed to address the limitations of existing AutoGPT-style techniques in decision-making and to improve the efficiency and practicality of AI agent performance.methods:REX introduces an additional layer of rewards and integrates concepts similar to Upper Confidence Bound (UCB) scores to enhance AI agent performance. It also utilizes offline behaviors from logs and does not require any model fine-tuning.results:REX-based methods demonstrate comparable performance with existing methods such as Chain-of-Thoughts(CoT) and Reasoning viA Planning(RAP) in certain cases, and exhibit remarkable reductions in execution time, making it more practical and efficient in a diverse range of scenarios.

Abstract
In this paper, we propose an enhanced approach for Rapid Exploration and eXploitation for AI Agents called REX. Existing AutoGPT-style techniques have inherent limitations, such as a heavy reliance on precise descriptions for decision-making, and the lack of a systematic approach to leverage try-and-fail procedures akin to traditional Reinforcement Learning (RL). REX introduces an additional layer of rewards and integrates concepts similar to Upper Confidence Bound (UCB) scores, leading to more robust and efficient AI agent performance. This approach has the advantage of enabling the utilization of offline behaviors from logs and allowing seamless integration with existing foundation models while it does not require any model fine-tuning. Through comparative analysis with existing methods such as Chain-of-Thoughts(CoT) and Reasoning viA Planning(RAP), REX-based methods demonstrate comparable performance and, in certain cases, even surpass the results achieved by these existing techniques. Notably, REX-based methods exhibit remarkable reductions in execution time, enhancing their practical applicability across a diverse set of scenarios.

摘要
在这篇论文中，我们提出了一种改进后的快速探索和尝试（Rapid Exploration and eXploitation，REX）方法，用于AI代理。现有的AutoGPT类型技术存在着内置的局限性，如决策过程中的精确描述过重，以及缺乏传统强化学习（Reinforcement Learning，RL）中的系统化尝试和失败处理机制。REX增加了一层奖励，并integrated Upper Confidence Bound（UCB）类概念，从而提高AI代理的稳定性和效率。这种方法可以利用日志中的offline行为，并可以与现有基础模型无需任何模型微调进行集成。与现有方法如Chain-of-Thoughts（CoT）和Reasoning viA Planning（RAP）进行比较分析，REX基于方法在相同的场景下达到了相当的表现水平，甚至在某些情况下超越了现有技术的结果。另外，REX基于方法在执行时间方面也表现出了很大的改善，从而提高了其在多样化场景下的实际应用性。

Siamese Networks for Weakly Supervised Human Activity Recognition

paper_url: http://arxiv.org/abs/2307.08944
repo_url: None
paper_authors: Taoran Sheng, Manfred Huber
for: 人体活动识别
methods: 使用多个siamesenet，无需明确标签数据进行训练
results: 可以作为各种不同的聚类算法的度量，并在三个数据集上进行评估，以验证其效果性。Here’s the breakdown of each point:1. for: The paper is written for human activity recognition, specifically using deep learning methods without explicit labels.2. methods: The paper proposes a model using multiple siamese networks to train a distance metric that can be used for clustering, without requiring explicit labels.3. results: The trained model can be used as a metric for a wide range of clustering algorithms, and the authors evaluate its effectiveness on three datasets for activity segmentation and recognition.

Abstract
Deep learning has been successfully applied to human activity recognition. However, training deep neural networks requires explicitly labeled data which is difficult to acquire. In this paper, we present a model with multiple siamese networks that are trained by using only the information about the similarity between pairs of data samples without knowing the explicit labels. The trained model maps the activity data samples into fixed size representation vectors such that the distance between the vectors in the representation space approximates the similarity of the data samples in the input space. Thus, the trained model can work as a metric for a wide range of different clustering algorithms. The training process minimizes a similarity loss function that forces the distance metric to be small for pairs of samples from the same kind of activity, and large for pairs of samples from different kinds of activities. We evaluate the model on three datasets to verify its effectiveness in segmentation and recognition of continuous human activity sequences.

摘要
深度学习已经成功应用于人类活动识别。然而，训练深度神经网络需要显式标注数据，这是困难的获得。在这篇论文中，我们提出了一种使用多个对称网络训练的模型，只使用数据对的相似性信息进行训练。训练后的模型将活动数据样本映射到固定大小的表示向量中，使得表示空间中的距离approximate输入空间中的相似性。因此，训练过的模型可以作为各种不同聚类算法的度量。训练过程中的相似损失函数使得距离度量在同类活动样本对应小，不同类活动样本对应大。我们在三个数据集上验证了模型的效果，以确认其在连续人类活动序列的分割和识别中的有效性。

IxDRL: A Novel Explainable Deep Reinforcement Learning Toolkit based on Analyses of Interestingness

paper_url: http://arxiv.org/abs/2307.08933
repo_url: https://github.com/sri-aic/23-xai-ixdrl-data
paper_authors: Pedro Sequeira, Melinda Gervasio
for: 这个论文旨在解决 Deep Reinforcement Learning (RL) 中存在的可解释性问题，提供人工智能专家在协作人机设置中更加了解RL机器人的能力和局限性，以便更 Informed 的决策。
methods: 该论文提出了一种基于吸引力分析的新框架，可以为 RL 机器人提供多种能力探测方法，并且可以Native 支持 popular RLLib 工具包。
results: 该论文通过应用该框架，对 RL 机器人的行为模式和能力进行了全面的探测和分析，并且可以帮助人工智能专家更好地理解 RL 机器人的能力和局限性，以便更好地决策。

Abstract
In recent years, advances in deep learning have resulted in a plethora of successes in the use of reinforcement learning (RL) to solve complex sequential decision tasks with high-dimensional inputs. However, existing systems lack the necessary mechanisms to provide humans with a holistic view of their competence, presenting an impediment to their adoption, particularly in critical applications where the decisions an agent makes can have significant consequences. Yet, existing RL-based systems are essentially competency-unaware in that they lack the necessary interpretation mechanisms to allow human operators to have an insightful, holistic view of their competency. Towards more explainable Deep RL (xDRL), we propose a new framework based on analyses of interestingness. Our tool provides various measures of RL agent competence stemming from interestingness analysis and is applicable to a wide range of RL algorithms, natively supporting the popular RLLib toolkit. We showcase the use of our framework by applying the proposed pipeline in a set of scenarios of varying complexity. We empirically assess the capability of the approach in identifying agent behavior patterns and competency-controlling conditions, and the task elements mostly responsible for an agent's competence, based on global and local analyses of interestingness. Overall, we show that our framework can provide agent designers with insights about RL agent competence, both their capabilities and limitations, enabling more informed decisions about interventions, additional training, and other interactions in collaborative human-machine settings.

摘要
To address this issue, we propose a new framework based on interestingness analysis to provide a more explainable deep reinforcement learning (xDRL) system. Our tool offers various measures of RL agent competence and is applicable to a wide range of RL algorithms, natively supporting the popular RLLib toolkit. We demonstrate the use of our framework in a variety of scenarios of varying complexity, showcasing its ability to identify agent behavior patterns and competency-controlling conditions, as well as the task elements most responsible for an agent's competence, based on both global and local analyses of interestingness.Our framework provides agent designers with valuable insights into RL agent competence, enabling more informed decisions about interventions, additional training, and other interactions in collaborative human-machine settings. With our proposed xDRL framework, we aim to improve the transparency and accountability of RL systems, ultimately leading to more reliable and trustworthy decision-making processes.

Unsupervised Deep Graph Matching Based on Cycle Consistency

paper_url: http://arxiv.org/abs/2307.08930
repo_url: None
paper_authors: Siddharth Tourani, Carsten Rother, Muhammad Haris Khan, Bogdan Savchynskyy
for: 图像中键点匹配无监督学习
methods: 自我监督法，不需要对准对应关系
results: 新状态作图像键点匹配，设置新纪录Here’s the summary in English for reference:
for: Keypoint matching in images without supervision
methods: Self-supervised method that enforces consistency of matchings between images of the same object category
results: New state-of-the-art for unsupervised graph matching

Abstract
We contribute to the sparsely populated area of unsupervised deep graph matching with application to keypoint matching in images. Contrary to the standard \emph{supervised} approach, our method does not require ground truth correspondences between keypoint pairs. Instead, it is self-supervised by enforcing consistency of matchings between images of the same object category. As the matching and the consistency loss are discrete, their derivatives cannot be straightforwardly used for learning. We address this issue in a principled way by building our method upon the recent results on black-box differentiation of combinatorial solvers. This makes our method exceptionally flexible, as it is compatible with arbitrary network architectures and combinatorial solvers. Our experimental evaluation suggests that our technique sets a new state-of-the-art for unsupervised graph matching.

摘要
我们在受限的领域中贡献了无监督深度图匹配，具体来说是图像中锚点匹配。与惯常的监督方法不同，我们的方法不需要图像对对的基准 truth 对应。而是通过确保图像类别之间匹配的一致性来自我监督。由于匹配和一致性损失是离散的，它们的导数不可直接用于学习。我们在理性的方式解决这个问题，基于最近的黑盒异构分析器的结果。这使得我们的方法非常灵活，可以与任意网络架构和异构分析器集成。我们的实验评估表明，我们的技术已经创造了无监督图匹配的新州 OF THE ART。

Multi-Stage Cable Routing through Hierarchical Imitation Learning

paper_url: http://arxiv.org/abs/2307.08927
repo_url: None
paper_authors: Jianlan Luo, Charles Xu, Xinyang Geng, Gilbert Feng, Kuan Fang, Liam Tan, Stefan Schaal, Sergey Levine
for: 这种研究旨在解决多个阶段机器人 manipulate 任务中的问题，特别是电缆 Routing 任务，机器人需要通过一系列的clip来路径电缆。
methods: 该研究使用 imitation learning 方法，从示例中学习 vision-based 政策，包括下一步的Sequencing 级别和下一步的motor control 级别。
results: 研究表明，使用这种方法可以在非常困难的clip placement variation中表现出色，并且可以恢复从失败中 recovered 和修正错误。Here are the three main points in English:
for: The research aims to solve the problem of multi-stage robotic manipulation tasks, particularly the cable routing task, where the robot needs to route a cable through a series of clips.
methods: The study uses imitation learning methods, learning vision-based policies from demonstrations at both the lower (motor control) and upper (sequencing) levels.
results: The research shows that the method can perform excellently in very challenging clip placement variations and can recover from failure and correct errors.

Abstract
We study the problem of learning to perform multi-stage robotic manipulation tasks, with applications to cable routing, where the robot must route a cable through a series of clips. This setting presents challenges representative of complex multi-stage robotic manipulation scenarios: handling deformable objects, closing the loop on visual perception, and handling extended behaviors consisting of multiple steps that must be executed successfully to complete the entire task. In such settings, learning individual primitives for each stage that succeed with a high enough rate to perform a complete temporally extended task is impractical: if each stage must be completed successfully and has a non-negligible probability of failure, the likelihood of successful completion of the entire task becomes negligible. Therefore, successful controllers for such multi-stage tasks must be able to recover from failure and compensate for imperfections in low-level controllers by smartly choosing which controllers to trigger at any given time, retrying, or taking corrective action as needed. To this end, we describe an imitation learning system that uses vision-based policies trained from demonstrations at both the lower (motor control) and the upper (sequencing) level, present a system for instantiating this method to learn the cable routing task, and perform evaluations showing great performance in generalizing to very challenging clip placement variations. Supplementary videos, datasets, and code can be found at https://sites.google.com/view/cablerouting.

摘要
我们研究多阶段机器人操作任务的学习问题，具体是用于电缆 Routing 的情况，机器人需要通过一系列的夹具来 rout 电缆。这个设定提供了复杂的多阶段机器人操作enario，包括处理可扩展物体、关闭视觉感知的关键和处理多步骤的延展 behaviors。在这种情况下，学习单一的基本步骤 для每个阶段是不实用的：如果每个阶段都需要成功完成，并且有一定的失败几率，则完成整个任务的可能性会极低。因此，成功的控制器 для这种多阶段任务必须能够重新启动、补偿低层控制器的缺陷，并在需要时进行订正动作。为此，我们描述了一个模仿学习系统，使用视觉政策来从示例中学习并实现 cable routing 任务，并在评估中表现出非常出色地一致性。请参考 https://sites.google.com/view/cablerouting 获取补充影片、数据和代码。

Federated Large Language Model: A Position Paper

paper_url: http://arxiv.org/abs/2307.08925
repo_url: None
paper_authors: Chaochao Chen, Xiaohua Feng, Jun Zhou, Jianwei Yin, Xiaolin Zheng
for: 这篇论文的目的是探讨大规模自然语言模型（LLM）在实际应用中遇到的挑战，以及如何使用联邦学习（FL）技术来解决这些挑战。
methods: 这篇论文提出了三个关键 ком成分，即联邦LLM预训练、联邦LLM精度调整和联邦LLM提示工程。每个 ком成分都有优点比传统LLM训练方法，并提出了具体的工程战略来实现。
results: 这篇论文分析了联邦LLM的新问题和挑战，并评估了现有的解决方案和可能的阻碍因素。

Abstract
Large scale language models (LLM) have received significant attention and found diverse applications across various domains, but their development encounters challenges in real-world scenarios. These challenges arise due to the scarcity of public domain data availability and the need to maintain privacy with respect to private domain data. To address these issues, federated learning (FL) has emerged as a promising technology that enables collaborative training of shared models while preserving decentralized data. We propose the concept of federated LLM, which comprises three key components, i.e., federated LLM pre-training, federated LLM fine-tuning, and federated LLM prompt engineering. For each component, we discuss its advantage over traditional LLM training methods and propose specific engineering strategies for implementation. Furthermore, we explore the novel challenges introduced by the integration of FL and LLM. We analyze existing solutions and identify potential obstacles faced by these solutions within the context of federated LLM.

摘要

Continuous-Time Reinforcement Learning: New Design Algorithms with Theoretical Insights and Performance Guarantees

paper_url: http://arxiv.org/abs/2307.08920
repo_url: None
paper_authors: Brent A. Wallace, Jennie Si
for: 这个论文的目的是提出一种新的连续时间非线性优化控制方法，用于控制非线性系统。
methods: 这些方法包括分解物理系统为小问题，以提高设计直观性和约化维度。它们还使用新的刺激框架，以提高 persistency of excitation 和数值稳定性性能。
results: 这些方法在控制一个不稳定、非最小频响 hypersonic vehicle (HSV) 上得到了证明和实验 guarantees，并且在这个应用中表现出了良好的性能。

Abstract
Continuous-time nonlinear optimal control problems hold great promise in real-world applications. After decades of development, reinforcement learning (RL) has achieved some of the greatest successes as a general nonlinear control design method. However, a recent comprehensive analysis of state-of-the-art continuous-time RL (CT-RL) methods, namely, adaptive dynamic programming (ADP)-based CT-RL algorithms, reveals they face significant design challenges due to their complexity, numerical conditioning, and dimensional scaling issues. Despite advanced theoretical results, existing ADP CT-RL synthesis methods are inadequate in solving even small, academic problems. The goal of this work is thus to introduce a suite of new CT-RL algorithms for control of affine nonlinear systems. Our design approach relies on two important factors. First, our methods are applicable to physical systems that can be partitioned into smaller subproblems. This constructive consideration results in reduced dimensionality and greatly improved intuitiveness of design. Second, we introduce a new excitation framework to improve persistence of excitation (PE) and numerical conditioning performance via classical input/output insights. Such a design-centric approach is the first of its kind in the ADP CT-RL community. In this paper, we progressively introduce a suite of (decentralized) excitable integral reinforcement learning (EIRL) algorithms. We provide convergence and closed-loop stability guarantees, and we demonstrate these guarantees on a significant application problem of controlling an unstable, nonminimum phase hypersonic vehicle (HSV).

摘要
The goal of this work is to introduce a suite of new CT-RL algorithms for controlling affine nonlinear systems. Our design approach focuses on two key factors:1. Applicability to physical systems that can be partitioned into smaller subproblems, resulting in reduced dimensionality and improved intuitiveness of design.2. Introduction of a new excitation framework to improve persistence of excitation (PE) and numerical conditioning performance via classical input/output insights.This design-centric approach is novel in the ADP CT-RL community. We progressively introduce a suite of decentralized excitable integral reinforcement learning (EIRL) algorithms, providing convergence and closed-loop stability guarantees. We demonstrate these guarantees on a significant application problem of controlling an unstable, nonminimum phase hypersonic vehicle (HSV).

Solving multiphysics-based inverse problems with learned surrogates and constraints

paper_url: http://arxiv.org/abs/2307.11099
repo_url: None
paper_authors: Ziyi Yin, Rafael Orozco, Mathias Louboutin, Felix J. Herrmann
for: 这个研究旨在解决对地质碳储监控中的多物理 inverse problem，当multimodal时间径数据贵重和numerical simulation too costly时，我们使用computationally cheap learned surrogates和learned constraints相结合，以获得高精度的permeability数据。
methods: 我们使用了一个结合learned surrogates和learned constraints的扩展方法，包括一个训练好的深度神经网（normalizing flow），强制模型迭代维持在distribution中，以保证训练了Fourier neural operator的精度。
results: 我们透过试验集中心在地质碳储问题上，使用了时间径数据和时间径地震数据两种不同的数据模式，并评估了这两种数据模式的组合效果。结果显示，这种结合方法可以提供高精度的permeability数据和CO2气泡预测，包括监控井点附近和远 away的地区。

Abstract
Solving multiphysics-based inverse problems for geological carbon storage monitoring can be challenging when multimodal time-lapse data are expensive to collect and costly to simulate numerically. We overcome these challenges by combining computationally cheap learned surrogates with learned constraints. Not only does this combination lead to vastly improved inversions for the important fluid-flow property, permeability, it also provides a natural platform for inverting multimodal data including well measurements and active-source time-lapse seismic data. By adding a learned constraint, we arrive at a computationally feasible inversion approach that remains accurate. This is accomplished by including a trained deep neural network, known as a normalizing flow, which forces the model iterates to remain in-distribution, thereby safeguarding the accuracy of trained Fourier neural operators that act as surrogates for the computationally expensive multiphase flow simulations involving partial differential equation solves. By means of carefully selected experiments, centered around the problem of geological carbon storage, we demonstrate the efficacy of the proposed constrained optimization method on two different data modalities, namely time-lapse well and time-lapse seismic data. While permeability inversions from both these two modalities have their pluses and minuses, their joint inversion benefits from either, yielding valuable superior permeability inversions and CO2 plume predictions near, and far away, from the monitoring wells.

摘要
解决基于多物理的反向问题可能会存在挑战，特别是当时间延迟数据成本高昂且计算成本高昂时。我们通过将计算成本低的学习模型与学习约束结合，不仅能够大幅提高含液流速度的重要参数含液性，还提供了自然的多模态数据混合 inverse 平台。通过添加学习约束，我们得到一个可行的推算方法，保持精度。这是通过包含训练好的深度神经网络（normalizing flow），让模型迭代器的输出呈现在有效范围内，以保证训练过 Fourier neural operator 的精度。通过选择合适的实验，我们在 geological carbon storage 问题上验证了我们的受限优化方法，并在不同数据模式下进行了两种不同的吞吐量推算。虽然吞吐量推算从两种数据模式中各有优缺点，但是两者的联合推算却能够提供更加优秀的含液性推算和 CO2 气泡预测，尤其是在监测井附近和远离监测井的地方。

Basal-Bolus Advisor for Type 1 Diabetes (T1D) Patients Using Multi-Agent Reinforcement Learning (RL) Methodology

paper_url: http://arxiv.org/abs/2307.08897
repo_url: None
paper_authors: Mehrad Jaloli, Marzia Cescon
for: 这个研究旨在开发一种基于多代理人学习（RL）的个人化血糖控制方法，以改善型1糖尿病（T1D）患者的血糖水平。
methods: 这个方法使用一个关闭链系统，包括一个血糖代谢模型和一个多代理人弹性算法对策。
results: 研究结果显示，RL基于的基础-胶囊导师可以有效改善血糖控制，减少血糖波动和增加在目标范围（70-180 mg/dL）中的时间。低血糖事件得到有效预防，并减少严重高血糖事件。此外，RL方法也导致了与传统治疗相比的平均每天基础胰岛素剂量的 statistically significant 减少。这些发现显示RL方法在实现更好的血糖控制和减少高血糖的风险方面是有效的。

Abstract
This paper presents a novel multi-agent reinforcement learning (RL) approach for personalized glucose control in individuals with type 1 diabetes (T1D). The method employs a closed-loop system consisting of a blood glucose (BG) metabolic model and a multi-agent soft actor-critic RL model acting as the basal-bolus advisor. Performance evaluation is conducted in three scenarios, comparing the RL agents to conventional therapy. Evaluation metrics include glucose levels (minimum, maximum, and mean), time spent in different BG ranges, and average daily bolus and basal insulin dosages. Results demonstrate that the RL-based basal-bolus advisor significantly improves glucose control, reducing glycemic variability and increasing time spent within the target range (70-180 mg/dL). Hypoglycemia events are effectively prevented, and severe hyperglycemia events are reduced. The RL approach also leads to a statistically significant reduction in average daily basal insulin dosage compared to conventional therapy. These findings highlight the effectiveness of the multi-agent RL approach in achieving better glucose control and mitigating the risk of severe hyperglycemia in individuals with T1D.

摘要
The results show that the RL-based basal-bolus advisor significantly improves glucose control, reducing glycemic variability and increasing time spent within the target range (70-180 mg/dL). Hypoglycemia events are effectively prevented, and severe hyperglycemia events are reduced. Additionally, the RL approach leads to a statistically significant reduction in average daily basal insulin dosage compared to conventional therapy. These findings demonstrate the effectiveness of the multi-agent RL approach in achieving better glucose control and mitigating the risk of severe hyperglycemia in individuals with T1D.

AI for the Generation and Testing of Ideas Towards an AI Supported Knowledge Development Environment

paper_url: http://arxiv.org/abs/2307.08876
repo_url: None
paper_authors: Ted Selker
for: 这篇论文主要是为了探讨如何使用机器学习技术创建大量语言模型，以便在不同的通信形式中预测Sequential information，并且通过Transformers生成文本或视觉输出，以模拟人类的回应。
methods: 这篇论文使用的方法主要是基于Transformers的生成AI，它可以在不同的语言模型中生成文本或视觉输出，并且可以模拟人类的回应。
results: 这篇论文的结果主要表明，通过将生成AI与互联网源的追溯功能相结合，可以创造出更加有价值的解决方案，并且可以减少人类偏见。此外，论文还提出了一种名为“Generate And Search Test”的系统，可以帮助知识工作者更加快速地创建高质量的解决方案。

Abstract
New systems employ Machine Learning to sift through large knowledge sources, creating flexible Large Language Models. These models discern context and predict sequential information in various communication forms. Generative AI, leveraging Transformers, generates textual or visual outputs mimicking human responses. It proposes one or multiple contextually feasible solutions for a user to contemplate. However, generative AI does not currently support traceability of ideas, a useful feature provided by search engines indicating origin of information. The narrative style of generative AI has gained positive reception. People learn from stories. Yet, early ChatGPT efforts had difficulty with truth, reference, calculations, and aspects like accurate maps. Current capabilities of referencing locations and linking to apps seem to be better catered by the link-centric search methods we've used for two decades. Deploying truly believable solutions extends beyond simulating contextual relevance as done by generative AI. Combining the creativity of generative AI with the provenance of internet sources in hybrid scenarios could enhance internet usage. Generative AI, viewed as drafts, stimulates thinking, offering alternative ideas for final versions or actions. Scenarios for information requests are considered. We discuss how generative AI can boost idea generation by eliminating human bias. We also describe how search can verify facts, logic, and context. The user evaluates these generated ideas for selection and usage. This paper introduces a system for knowledge workers, Generate And Search Test, enabling individuals to efficiently create solutions previously requiring top collaborations of experts.

摘要
新系统雇用机器学习探索大量知识源，创造 flexible 大语言模型。这些模型能够辨识 контекст和预测 sequential information 在不同的通讯形式中。生成 AI，运用 transformers，生成文字或图像出力，模拟人类回应。它提供一或多个 contextually feasible 解决方案，让用户思考。但是，生成 AI 目前不支持追溯想法的功能，这是搜索引擎提供的有用功能，它可以显示信息的来源。生成 AI 的 narrative 式受到了正面的评价，人们从故事中学习。然而， early ChatGPT 努力遇到了真实、参考、计算和正确地图等问题。现在的连结中心搜寻方法可以更好地适应连结中心搜寻方法，这些方法在过去二十年中被使用。实现真实的解决方案需要进一步推进，不只是模拟上下文相关的 relevance，生成 AI 可以与互联网的来源混合使用。生成 AI 被视为稿件，刺激思维，提供选择和行动的代替想法。在信息请求的情况下，我们讨论了如何使用生成 AI 增强想法生成，消除人类偏见。我们还描述了如何使用搜寻验证 факти、逻辑和 context。用户评估这些生成的想法，选择和使用。这篇文章介绍了一个专为知识工作者的系统，创新 Test，允许个人快速创建以前需要多个专家合作的解决方案。

An Alternative to Variance: Gini Deviation for Risk-averse Policy Gradient

paper_url: http://arxiv.org/abs/2307.08873
repo_url: None
paper_authors: Yudong Luo, Guiliang Liu, Pascal Poupart, Yangchen Pan
for: 降低奖励回报的方差，以避免风险投入RL中的潜在风险。
methods: 使用新的风险度量，帕尼-迪弗扬分布，代替直接使用奖励回报的方差。
results: 在具体的域中，我们的算法可以避免方差基础的限制，并实现高回报低风险。

Abstract
Restricting the variance of a policy's return is a popular choice in risk-averse Reinforcement Learning (RL) due to its clear mathematical definition and easy interpretability. Traditional methods directly restrict the total return variance. Recent methods restrict the per-step reward variance as a proxy. We thoroughly examine the limitations of these variance-based methods, such as sensitivity to numerical scale and hindering of policy learning, and propose to use an alternative risk measure, Gini deviation, as a substitute. We study various properties of this new risk measure and derive a policy gradient algorithm to minimize it. Empirical evaluation in domains where risk-aversion can be clearly defined, shows that our algorithm can mitigate the limitations of variance-based risk measures and achieves high return with low risk in terms of variance and Gini deviation when others fail to learn a reasonable policy.

摘要
限制策略返回的方差是常见的选择在偏离风险学习（RL）中，因为它有明确的数学定义和易于理解的解释。传统方法直接限制总返回方差。现代方法则限制每步奖励方差作为代理。我们仔细检查了这些方差基于的方法的局限性，如数值归一化的敏感性和策略学习压抑，并提出了一种替代风险度量，幂差偏度，并 derivation of a policy gradient algorithm to minimize it。我们在可以明确定义风险偏好的领域进行了实证评估，显示我们的算法可以减少方差基于的风险度量的局限性，并在其他策略无法学习合理策略时达到高返回低风险水平。

Curriculum Learning for Graph Neural Networks: A Multiview Competence-based Approach

paper_url: http://arxiv.org/abs/2307.08859
repo_url: https://github.com/CLU-UML/MCCL
paper_authors: Nidhi Vakil, Hadi Amiri
for: 这篇论文旨在提出一种基于图形复杂性ormalization的课程学习方法，以提高语言应用中的图形神经网络训练效率。
methods: 本篇论文使用了一种基于图形复杂性和模型能力的课程学习方法，包括一个调度方案，将不同的图形难度和模型能力考虑在训练中。
results: 实验结果显示，提案的方法可以在实际的连接预测和节点分类任务上提高图形神经网络的训练效率。

Abstract
A curriculum is a planned sequence of learning materials and an effective one can make learning efficient and effective for both humans and machines. Recent studies developed effective data-driven curriculum learning approaches for training graph neural networks in language applications. However, existing curriculum learning approaches often employ a single criterion of difficulty in their training paradigms. In this paper, we propose a new perspective on curriculum learning by introducing a novel approach that builds on graph complexity formalisms (as difficulty criteria) and model competence during training. The model consists of a scheduling scheme which derives effective curricula by accounting for different views of sample difficulty and model competence during training. The proposed solution advances existing research in curriculum learning for graph neural networks with the ability to incorporate a fine-grained spectrum of graph difficulty criteria in their training paradigms. Experimental results on real-world link prediction and node classification tasks illustrate the effectiveness of the proposed approach.

摘要
一个课程是一种规划的学习材料序列，一个有效的课程可以使学习变得高效和有效，不分人类和机器。现在的研究已经开发了使用数据驱动的课程学习方法来训练语言应用中的图神经网络。然而，现有的课程学习方法通常使用单一的困难度标准来训练。在这篇论文中，我们提出了一新的课程学习视角，基于图复杂性形式学（作为困难度标准）和模型能力的建议。我们的方法包括一种安排方案，通过考虑不同的样本困难度和模型能力来 derivation 有效的课程。我们的解决方案超越了现有的研究，使得图神经网络的课程学习可以包含细致的图困难度标准在内。实验结果表明，在真实世界中的链接预测和节点分类任务上，我们的方法具有显著的效果。

Autoregressive Diffusion Model for Graph Generation

paper_url: http://arxiv.org/abs/2307.08849
repo_url: None
paper_authors: Lingkai Kong, Jiaming Cui, Haotian Sun, Yuchen Zhuang, B. Aditya Prakash, Chao Zhang
for: 本研究旨在提出一种基于扩散的图生成模型，以提高图生成的效果和速度。
methods: 我们提出了一种名为“自适应扩散”的模型，它在图生成过程中直接在离散图空间进行扩散过程，而不是在归一化的邻接矩阵空间进行。我们还设计了一个“扩散排序网络”和一个“减噪网络”，以便在前向和反向生成过程中进行有效的图生成。
results: 我们在六个不同的通用图数据集和两个分子数据集上进行了实验，并证明了我们的模型可以在图生成中 дости得更好的或相当的效果，同时具有快速的生成速度。

Abstract
Diffusion-based graph generative models have recently obtained promising results for graph generation. However, existing diffusion-based graph generative models are mostly one-shot generative models that apply Gaussian diffusion in the dequantized adjacency matrix space. Such a strategy can suffer from difficulty in model training, slow sampling speed, and incapability of incorporating constraints. We propose an \emph{autoregressive diffusion} model for graph generation. Unlike existing methods, we define a node-absorbing diffusion process that operates directly in the discrete graph space. For forward diffusion, we design a \emph{diffusion ordering network}, which learns a data-dependent node absorbing ordering from graph topology. For reverse generation, we design a \emph{denoising network} that uses the reverse node ordering to efficiently reconstruct the graph by predicting the node type of the new node and its edges with previously denoised nodes at a time. Based on the permutation invariance of graph, we show that the two networks can be jointly trained by optimizing a simple lower bound of data likelihood. Our experiments on six diverse generic graph datasets and two molecule datasets show that our model achieves better or comparable generation performance with previous state-of-the-art, and meanwhile enjoys fast generation speed.

摘要
Diffusion-based图像生成模型最近已经取得了图像生成的出色成绩。然而，现有的扩散基于的图像生成模型大多是单一的一步生成模型，通过 Gaussian扩散在解Quantized邻接矩阵空间中进行。这种策略可能会导致模型训练困难、扩散速度慢和约束不能具体化。我们提出了一种“自适应扩散”模型，与现有方法不同，我们直接在零唯一图像空间中定义节点吸收扩散过程。对于前向扩散，我们设计了一个“扩散排序网络”，该网络学习基于图像结构的数据依赖节点吸收排序。对于反向生成，我们设计了一个“释放网络”，该网络使用反向节点排序来快速重建图像，先预测新节点的节点类型和与已经预测过的节点的边。基于图像的卷积 invariants，我们证明了这两个网络可以共同训练，通过优化数据可能的下界来优化模型。我们在六种不同的通用图像dataset和两种分子dataset上进行了实验，发现我们的模型可以与之前的状态前所未有的或相似的生成性能，同时具有快速生成速度。

Towards Accelerating Benders Decomposition via Reinforcement Learning Surrogate Models

paper_url: http://arxiv.org/abs/2307.08816
repo_url: None
paper_authors: Stephen Mak, Kyle Mana, Parisa Zehtabi, Michael Cashmore, Daniele Magazzeni, Manuela Veloso
for: 这篇论文是为了解决在不确定性下做出最佳决策的问题。
methods: 这篇论文使用的方法是Benders decomposion（BD），它将随机估计问题分解为多个更小的子问题。
results: 这篇论文提出了一种使用代理模型来加速BD的加速方法，并证明了这种方法可以提高BD的速度，具体来说，它比其他加速BD实现方法rapidly convergence的速度快30%。

Abstract
Stochastic optimization (SO) attempts to offer optimal decisions in the presence of uncertainty. Often, the classical formulation of these problems becomes intractable due to (a) the number of scenarios required to capture the uncertainty and (b) the discrete nature of real-world planning problems. To overcome these tractability issues, practitioners turn to decomposition methods that divide the problem into smaller, more tractable sub-problems. The focal decomposition method of this paper is Benders decomposition (BD), which decomposes stochastic optimization problems on the basis of scenario independence. In this paper we propose a method of accelerating BD with the aid of a surrogate model in place of an NP-hard integer master problem. Through the acceleration method we observe 30% faster average convergence when compared to other accelerated BD implementations. We introduce a reinforcement learning agent as a surrogate and demonstrate how it can be used to solve a stochastic inventory management problem.

摘要

The number of scenarios required to capture the uncertainty.2. The discrete nature of real-world planning problems.To address these tractability issues, practitioners often use decomposition methods that break down the problem into smaller, more manageable sub-problems. One such method is Benders decomposition (BD), which decomposes stochastic optimization problems based on scenario independence.In this paper, we propose a method to accelerate BD using a surrogate model in place of an NP-hard integer master problem. Our approach leads to an average convergence rate 30% faster compared to other accelerated BD implementations.We introduce a reinforcement learning agent as a surrogate and demonstrate its use in solving a stochastic inventory management problem. By leveraging the surrogate model, we can efficiently solve the problem and achieve better performance.

Operator Guidance Informed by AI-Augmented Simulations

paper_url: http://arxiv.org/abs/2307.08810
repo_url: None
paper_authors: Samuel J. Edwards, Michael Levine
for: 这篇论文是用来估算船舶响应统计数据的多优异、数据适应方法。
methods: 该方法使用了一个Long Short-Term Memory（LSTM）神经网络，使用了一个快速的低精度工具SimpleCode和一个更高精度的工具LAMP进行估算。
results: 经过训练LSTM神经网络以及对SimpleCode和LAMP数据进行比较，研究发现该方法可以准确地估算船舶响应统计数据。

Abstract
This paper will present a multi-fidelity, data-adaptive approach with a Long Short-Term Memory (LSTM) neural network to estimate ship response statistics in bimodal, bidirectional seas. The study will employ a fast low-fidelity, volume-based tool SimpleCode and a higher-fidelity tool known as the Large Amplitude Motion Program (LAMP). SimpleCode and LAMP data were generated by common bi-modal, bi-directional sea conditions in the North Atlantic as training data. After training an LSTM network with LAMP ship motion response data, a sample route was traversed and randomly sampled historical weather was input into SimpleCode and the LSTM network, and compared against the higher fidelity results.

摘要
这篇论文将提出一种多优劣、数据适应的方法，使用长Short-Term Memory（LSTM）神经网络来估算船舶响应统计在二模性、双向海域中。这项研究将使用一个快速低精度的代码工具SimpleCode，以及一个更高精度的工具——Large Amplitude Motion Program（LAMP）。SimpleCode和LAMP数据都是通过常见的二模性、双向海域Conditions在北大西洋进行训练的。接下来，我们将在LAMP船舶响应数据上训练LSTM网络，然后在SimpleCode和LSTM网络中随机选择历史天气数据，并与更高精度结果进行比较。

Local or Global: Selective Knowledge Assimilation for Federated Learning with Limited Labels

paper_url: http://arxiv.org/abs/2307.08809
repo_url: None
paper_authors: Yae Jee Cho, Gauri Joshi, Dimitrios Dimitriadis
for: 实际上的 Federated Learning (FL) 方法假设每个客户端都有完整的标签数据，但在实际情况下，客户端的标签数据仅具有有限的量，因为标签的过程是时consuming 和劳动的。
methods: 我们提出了 FedLabel，一种基于选择性标签的方法，让客户端选择使用本地或全球模型 pseudo-label 其未标的数据，具体来说，客户端根据其数据的特性选择使用哪一个模型，以确保模型对数据的掌握程度最高。我们还使用了全球和本地模型的知识，通过全球-本地一致调整，以降低两个模型在标签未知数据时的差异。
results: 我们在cross-device 和 cross-silo 设定下，与其他半结构化 FL 基elines 比较，获得了8%-24%的提高，甚至在仅有5%-20%的标签数据下，可以超越标准的完全监督 FL 基elines ($100%$ 标签数据)。

Abstract
Many existing FL methods assume clients with fully-labeled data, while in realistic settings, clients have limited labels due to the expensive and laborious process of labeling. Limited labeled local data of the clients often leads to their local model having poor generalization abilities to their larger unlabeled local data, such as having class-distribution mismatch with the unlabeled data. As a result, clients may instead look to benefit from the global model trained across clients to leverage their unlabeled data, but this also becomes difficult due to data heterogeneity across clients. In our work, we propose FedLabel where clients selectively choose the local or global model to pseudo-label their unlabeled data depending on which is more of an expert of the data. We further utilize both the local and global models' knowledge via global-local consistency regularization which minimizes the divergence between the two models' outputs when they have identical pseudo-labels for the unlabeled data. Unlike other semi-supervised FL baselines, our method does not require additional experts other than the local or global model, nor require additional parameters to be communicated. We also do not assume any server-labeled data or fully labeled clients. For both cross-device and cross-silo settings, we show that FedLabel outperforms other semi-supervised FL baselines by $8$-$24\%$, and even outperforms standard fully supervised FL baselines ($100\%$ labeled data) with only $5$-$20\%$ of labeled data.

摘要
许多现有的FL方法假设客户端拥有完整的标签数据，然而在现实中，客户端通常有限的标签数据，因为标签数据的生成和评估是成本和劳动 INTENSIVE 的过程。客户端的局部模型通常因为地方数据的不充分标注而导致其在大量未标注数据上的泛化能力差。为了解决这个问题，客户端可能会寻求从全球模型中吸取优势，但是由于客户端数据的不一致，这也变得困难。在我们的工作中，我们提出了FedLabel方法，其中客户端选择地ocal或全球模型来 Pseudo-label 其未标注数据，根据哪一个模型更加熟悉数据。我们还利用了两个模型的知识，通过全球-局部一致准则来减少两个模型输出不同的偏差。与其他半导导FL基线相比，我们的方法不需要额外的专家，也不需要额外的参数进行通信。我们也不假设服务器拥有完整的标签数据或全部客户端拥有完整的标签数据。在跨设备和跨笼设置下，我们展示了FedLabel方法可以与其他半导导FL基线相比，提高$8$-$24\%$，甚至超过了完全半导导FL基线（$100\%$ 标签数据），只需要$5$-$20\%$ 的标签数据。

Non-Stationary Policy Learning for Multi-Timescale Multi-Agent Reinforcement Learning

paper_url: http://arxiv.org/abs/2307.08794
repo_url: None
paper_authors: Patrick Emami, Xiangyu Zhang, David Biagioni, Ahmed S. Zamzam
for: 这个论文是为了学习多时间步长的多代理RL（Multi-timescale Multi-agent Reinforcement Learning）中的非站ARY政策而写的。
methods: 这个论文使用了可用代理时间尺度信息来定义周期时间编码，并通过 periodic multi-agent policy 来学习非站ARY政策。
results: 论文通过 theoretically 和实验 validate 了 periodic multi-agent policy 的学习效果，并在 gridworld 和建筑能源管理环境中证明了该方法的有效性。I hope that helps! Let me know if you have any other questions.

Abstract
In multi-timescale multi-agent reinforcement learning (MARL), agents interact across different timescales. In general, policies for time-dependent behaviors, such as those induced by multiple timescales, are non-stationary. Learning non-stationary policies is challenging and typically requires sophisticated or inefficient algorithms. Motivated by the prevalence of this control problem in real-world complex systems, we introduce a simple framework for learning non-stationary policies for multi-timescale MARL. Our approach uses available information about agent timescales to define a periodic time encoding. In detail, we theoretically demonstrate that the effects of non-stationarity introduced by multiple timescales can be learned by a periodic multi-agent policy. To learn such policies, we propose a policy gradient algorithm that parameterizes the actor and critic with phase-functioned neural networks, which provide an inductive bias for periodicity. The framework's ability to effectively learn multi-timescale policies is validated on a gridworld and building energy management environment.

摘要
在多时间步长多代理人学习（MARL）中，代理人之间的交互具有不同的时间步长。通常，由多个时间步长引起的政策是非站ARY的。学习非站ARY的政策是困难的，通常需要复杂或不效率的算法。为了解决实际世界中这种控制问题，我们提出了一个简单的框架 для学习非站ARY的多时间步长MARL政策。我们利用代理人的时间步长信息来定义 periodic 时间编码。在详细的演示中，我们证明了由多个时间步长引起的非站ARY效果可以通过 periodic 多代理人政策学习。为了学习这种政策，我们提议一种基于phasic 神经网络的策略梯度算法，该算法在actor和critic中使用phasic 神经网络作为参数，从而提供了periodicity的偏好。该框架在Gridworld和建筑能源管理环境中证明了有效地学习多时间步长政策。

On the Real-Time Semantic Segmentation of Aphid Clusters in the Wild

paper_url: http://arxiv.org/abs/2307.10267
repo_url: None
paper_authors: Raiyan Rahman, Christopher Indris, Tianxiao Zhang, Kaidong Li, Brian McCornack, Daniel Flippo, Ajay Sharda, Guanghui Wang
for: addresses the urgent need for an intelligent autonomous system to locate and spray aphid infestations in wheat and sorghum fields, reducing pesticide use and environmental impact.
methods: uses real-time semantic segmentation models to segment clusters of aphids in complex crop canopies, with a multiscale dataset to allow for learning at different scales.
results: compares the segmentation speeds and accuracy of four state-of-the-art real-time semantic segmentation models on an aphid cluster dataset, demonstrating the effectiveness of a real-time solution for pest detection and reducing inefficient pesticide use.

Abstract
Aphid infestations can cause extensive damage to wheat and sorghum fields and spread plant viruses, resulting in significant yield losses in agriculture. To address this issue, farmers often rely on chemical pesticides, which are inefficiently applied over large areas of fields. As a result, a considerable amount of pesticide is wasted on areas without pests, while inadequate amounts are applied to areas with severe infestations. The paper focuses on the urgent need for an intelligent autonomous system that can locate and spray infestations within complex crop canopies, reducing pesticide use and environmental impact. We have collected and labeled a large aphid image dataset in the field, and propose the use of real-time semantic segmentation models to segment clusters of aphids. A multiscale dataset is generated to allow for learning the clusters at different scales. We compare the segmentation speeds and accuracy of four state-of-the-art real-time semantic segmentation models on the aphid cluster dataset, benchmarking them against nonreal-time models. The study results show the effectiveness of a real-time solution, which can reduce inefficient pesticide use and increase crop yields, paving the way towards an autonomous pest detection system.

摘要
螨子感染可以对小麦和粟米田场造成广泛的损害，并传播植物病毒，导致农业生产的生产损失。为解决这个问题，农民们经常依靠化学杀虫剂，但这些杀虫剂通常在大面积的田场上不充分应用，导致大量的杀虫剂浪费在没有螨子的区域，而应用到螨子严重感染的区域不充分。本文强调需要一个智能自动化系统，可以在复杂的作物顶层中找到和喷洒感染区域，从而减少杀虫剂的使用和环境的影响。我们收集了大量螨子图像数据集，并提议使用实时semantic segmentation模型来 segment clusters of aphids。我们生成了多尺度的数据集，以便在不同的尺度上学习螨子集。我们对四种当前最佳实时semantic segmentation模型进行比较，以测试它们在aphid cluster数据集上的segmentation速度和准确率。研究结果表明，实时解决方案可以减少不必要的杀虫剂使用，提高作物的收成，为自动检测螨子系统开创了道路。

GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution

paper_url: http://arxiv.org/abs/2307.08775
repo_url: https://github.com/yining610/gear
paper_authors: Yining Lu, Haoping Yu, Daniel Khashabi
for: 提高大语言模型（LLM）的性能 across 多种任务。
methods: 使用外部工具，但之前的工作过于依赖于任务特定的示范，这限制了其通用性和计算成本。我们介绍了一种名为 GEAR 的计算效率高的问题工具固定算法，可以在多种任务中使用不同的工具。
results: 在 14 个数据集和 6 个下游任务上进行了评估，并示出了强大的通用性，可以在新任务、工具和不同的 SLM 上进行更好的工具固定。尽管提供更高效的计算，但 GEAR 可以达到更高的工具固定精度，从而提高下游精度，例如，使用 GEAR 修改 GPT-J 和 GPT-3 可以超过对应的工具修改基eline。

Abstract
Augmenting large language models (LLM) to use external tools enhances their performance across a variety of tasks. However, prior works over-rely on task-specific demonstration of tool use that limits their generalizability and computational cost due to making many calls to large-scale LLMs. We introduce GEAR, a computationally efficient query-tool grounding algorithm that is generalizable to various tasks that require tool use while not relying on task-specific demonstrations. GEAR achieves better efficiency by delegating tool grounding and execution to small language models (SLM) and LLM, respectively; while leveraging semantic and pattern-based evaluation at both question and answer levels for generalizable tool grounding. We evaluate GEAR on 14 datasets across 6 downstream tasks, demonstrating its strong generalizability to novel tasks, tools and different SLMs. Despite offering more efficiency, GEAR achieves higher precision in tool grounding compared to prior strategies using LLM prompting, thus improving downstream accuracy at a reduced computational cost. For example, we demonstrate that GEAR-augmented GPT-J and GPT-3 outperform counterpart tool-augmented baselines because of better tool use.

摘要
加强大语言模型（LLM）使用外部工具可以提高其在多种任务上的表现。然而，先前的工作过于依赖任务特定的工具使用示例，这限制了它们的普遍性和计算成本，因为它们需要访问大规模的 LLM。我们介绍了一种计算效率高的问题工具固定算法（GEAR），可以在多种需要工具使用的任务上实现更好的普遍性，而不需要任务特定的示例。GEAR通过委托工具固定和执行给小语言模型（SLM）和 LLM 分别执行，同时利用 semantic 和 pattern 基于的评估方法来实现更好的工具固定。我们在 14 个数据集上进行了 6 个下游任务的评估，并证明了 GEAR 在新任务、工具和不同的 SLM 上都具有强大的普遍性。尽管 GEAR 提供更高的效率，但它在工具固定精度方面仍然高于先前使用 LLM 提示的策略，因此在减少计算成本的情况下提高了下游精度。例如，我们表明了 GEAR 对 GPT-J 和 GPT-3 的加强可以超过相应的工具加强基eline。

AI empowering research: 10 ways how science can benefit from AI

paper_url: http://arxiv.org/abs/2307.10265
repo_url: None
paper_authors: César França
for: 这篇论文探讨人工智能（AI）在科学研究中的转变性影响。
methods: 论文介绍了AI在科学研究中的10种应用，包括强大的引用工具、更好地理解研究问题、增强研究问题生成、优化研究设计、数据生成、数据转换、高级数据分析和AI协助报告。
results: 论文指出，AI可以帮助科学家增强创造力，但需要考虑偏见、隐私问题和人AI合作。

Abstract
This article explores the transformative impact of artificial intelligence (AI) on scientific research. It highlights ten ways in which AI is revolutionizing the work of scientists, including powerful referencing tools, improved understanding of research problems, enhanced research question generation, optimized research design, stub data generation, data transformation, advanced data analysis, and AI-assisted reporting. While AI offers numerous benefits, challenges such as bias, privacy concerns, and the need for human-AI collaboration must be considered. The article emphasizes that AI can augment human creativity in science but not replace it.

摘要
这篇文章探讨人工智能（AI）在科学研究中的转变性影响。文章提出了10种AI在科学家工作中的改变方式，包括强大的引用工具、更好地理解研究问题、增强研究问题生成、优化研究设计、干扰数据生成、数据转换、高级数据分析和AI助手报告。虽然AI具有许多优势，但需要考虑偏见、隐私问题和人机合作。文章表达AI可以补充人类创造力，但不能取代它。Here's the translation of the text into Traditional Chinese:这篇文章探讨人工智能（AI）在科学研究中的转换性影响。文章提出了10种AI在科学家工作中的改变方式，包括强大的引用工具、更好地理解研究问题、增强研究问题生成、优化研究设计、干扰数据生成、数据转换、高级数据分析和AI助手报告。处理AI具有许多优势，但需要考虑偏见、隐私问题和人机合作。文章表达AI可以补充人类创造力，但不能取代它。

Reflections from the Workshop on AI-Assisted Decision Making for Conservation

paper_url: http://arxiv.org/abs/2307.08774
repo_url: None
paper_authors: Lily Xu, Esther Rolf, Sara Beery, Joseph R. Bennett, Tanya Berger-Wolf, Tanya Birch, Elizabeth Bondi-Kelly, Justin Brashares, Melissa Chapman, Anthony Corso, Andrew Davies, Nikhil Garg, Angela Gaylard, Robert Heilmayr, Hannah Kerner, Konstantin Klemmer, Vipin Kumar, Lester Mackey, Claire Monteleoni, Paul Moorcroft, Jonathan Palmer, Andrew Perrault, David Thau, Milind Tambe
for: 这份白皮书总结了在哈佛大学计算社会中心主办的AI助成决策工作坊上的演讲和讨论，旨在提出保护生态系统的开放研究问题，以及需要人工智能解决的保护挑战。
methods: 这份白皮书总结了工作坊上的讲座和讨论，并提出了一些开放研究问题，例如资源分配、规划和干预的算法化决策方法，以及如何应用这些方法解决生态系统的保护挑战。
results: 白皮书认为，AI助成决策方法可以帮助解决生态系统的保护挑战，但是还需要进一步的研究和开发，以确保这些方法能够应用到实际的保护场景中。

Abstract
In this white paper, we synthesize key points made during presentations and discussions from the AI-Assisted Decision Making for Conservation workshop, hosted by the Center for Research on Computation and Society at Harvard University on October 20-21, 2022. We identify key open research questions in resource allocation, planning, and interventions for biodiversity conservation, highlighting conservation challenges that not only require AI solutions, but also require novel methodological advances. In addition to providing a summary of the workshop talks and discussions, we hope this document serves as a call-to-action to orient the expansion of algorithmic decision-making approaches to prioritize real-world conservation challenges, through collaborative efforts of ecologists, conservation decision-makers, and AI researchers.

摘要
在这份白皮书中，我们对哈佛大学计算社会中心在2022年10月20-21日举行的AI助成决策为保护生态系统工作shop的演讲和讨论进行了总结和分析。我们确定了保护生态系统的开放研究问题，包括资源分配、规划和干预，并指出了需要AI解决方案的保护挑战。此外，我们希望通过协作的方式，让生态学家、保护决策者和AI研究人员共同努力，推动算法决策方法的扩展，以便更好地应对实际生态系统保护挑战。

A mixed policy to improve performance of language models on math problems

paper_url: http://arxiv.org/abs/2307.08767
repo_url: https://github.com/vividitytech/math_lm_rl
paper_authors: Gang Chen
for: 解决 math 问题时，语言模型通常采用采样策略来预测下一个词的概率。但这会导致 math 问题的解决结果不准确。为了解决这个问题，我们提出了一种混合策略探索方法，使用 reinforcement learning 来解决 math 问题。
methods: 我们提出了一种两级 токен探索策略，其中一级是概率采样，二级是决定性选择下一个tokен的最高分选择策略。具体来说，抽象层策略会根据概率采样决定下一个tokен是操作符或操作数，而第二级策略则是在决定性下选择下一个tokен的最高分选择。
results: 我们在 GSM8K 数据集上使用 GPT-2 模型进行测试，并达到了 более $2%$ 的性能提升。我们的实现可以在 GitHub 上找到：https://github.com/vividitytech/math_lm_rl。

Abstract
When to solve math problems, most language models take a sampling strategy to predict next word according conditional probabilities. In the math reasoning step, it may generate wrong answer. Considering math problems are deterministic, we propose a mixed policy exploration approach to solve math problems with reinforcement learning. In peculiar, we propose a two level token exploration policy: the abstract level explores next token with probability and the second level is deterministic. Specifically, the abstract level policy will decide whether the token is operator or operand with probability sampling, while the second level is deterministic to select next token with the highest score in a greedy way. We test our method on GSM8K dataset with GPT-2 model, and demonstrate more than $2\%$ performance gain. Our implementation is available at https://github.com/vividitytech/math_lm_rl.

摘要
当解决数学问题时，大多数语言模型采用采样策略来预测下一个词的 conditional probabilities。在数学逻辑步骤中，它可能生成错误答案。考虑到数学问题是deterministic的，我们提出了混合策略探索方法来解决数学问题使用强化学习。具体来说，我们提出了两级токен探索策略：第一级探索下一个词的概率，第二级是决定性的选择下一个词的最高分。 Specifically, the first-level policy will decide whether the token is an operator or operand with probability sampling, while the second-level policy is deterministic to select the next token with the highest score in a greedy way. 我们在GSM8K dataset上使用GPT-2模型测试了我们的方法，并证明了 más de $2\%$ 的性能提升。我们的实现可以在https://github.com/vividitytech/math_lm_rl上找到。

Quality Assessment of Photoplethysmography Signals For Cardiovascular Biomarkers Monitoring Using Wearable Devices

paper_url: http://arxiv.org/abs/2307.08766
repo_url: None
paper_authors: Felipe M. Dias, Marcelo A. F. Toledo, Diego A. C. Cardenas, Douglas A. Almeida, Filipe A. C. Oliveira, Estela Ribeiro, Jose E. Krieger, Marco A. Gutierrez
for: 这项研究用于评估非侵入式光谱学技术（PPG）的可靠性和准确性，以及开发远程、非侵入式和连续式测量设备。
methods: 该研究使用了27个统计特征从PPG信号中训练机器学习模型，包括梯度提升（XGBoost和CatBoost）和随机森林（RF）算法。
results: 研究发现，使用XGBoost、CatBoost和RF算法可以达到94.4、95.6和95.0的敏感度（Se）、正确预测率（PPV）和F1分数（F1），均高于当前文献报道的值。

Abstract
Photoplethysmography (PPG) is a non-invasive technology that measures changes in blood volume in the microvascular bed of tissue. It is commonly used in medical devices such as pulse oximeters and wrist worn heart rate monitors to monitor cardiovascular hemodynamics. PPG allows for the assessment of parameters (e.g., heart rate, pulse waveform, and peripheral perfusion) that can indicate conditions such as vasoconstriction or vasodilation, and provides information about microvascular blood flow, making it a valuable tool for monitoring cardiovascular health. However, PPG is subject to a number of sources of variations that can impact its accuracy and reliability, especially when using a wearable device for continuous monitoring, such as motion artifacts, skin pigmentation, and vasomotion. In this study, we extracted 27 statistical features from the PPG signal for training machine-learning models based on gradient boosting (XGBoost and CatBoost) and Random Forest (RF) algorithms to assess quality of PPG signals that were labeled as good or poor quality. We used the PPG time series from a publicly available dataset and evaluated the algorithm s performance using Sensitivity (Se), Positive Predicted Value (PPV), and F1-score (F1) metrics. Our model achieved Se, PPV, and F1-score of 94.4, 95.6, and 95.0 for XGBoost, 94.7, 95.9, and 95.3 for CatBoost, and 93.7, 91.3 and 92.5 for RF, respectively. Our findings are comparable to state-of-the-art reported in the literature but using a much simpler model, indicating that ML models are promising for developing remote, non-invasive, and continuous measurement devices.

摘要
photoplethysmography (PPG) 是一种非侵入式技术，用于测量血液量变化在微血管床中的变化。它通常用于医疗器械中，如搏动计和胳膊上的心率测量仪器，以监测Cardiovascular 血液动力学。 PPG 允许评估一些参数（如心率、搏动波形和 péripheral perfusion），这些参数可能指示 condition 如血管收缩或血管 расширение，并提供关于 microvascular 血液流的信息，使其成为监测 Cardiovascular 健康的有用工具。然而，PPG 受到一些来源的变化的影响，这些变化可能影响其准确性和可靠性，特别是在使用可携带式设备进行连续监测时。在这种研究中，我们从 PPG 信号中提取了27个统计特征，用于训练机器学习模型，包括梯度拟合（XGBoost 和 CatBoost）和Random Forest（RF）算法。我们使用公共可用的数据集中的 PPG 时间序列，并评估算法的性能使用敏感度（Se）、正确预测值（PPV）和 F1 得分（F1）指标。我们的模型实现了 Se、PPV 和 F1 分别为 94.4、95.6 和 95.0，94.7、95.9 和 95.3，93.7、91.3 和 92.5。我们的发现与文献中的状态艺术相比，但使用 much simpler 模型，表明机器学习模型是开发 remote、非侵入式、连续测量设备的有力的选择。

Fast model inference and training on-board of Satellites

paper_url: http://arxiv.org/abs/2307.08700
repo_url: https://github.com/previtus/ravaen-unibap-dorbit
paper_authors: Vít Růžička, Gonzalo Mateo-García, Chris Bridges, Chris Brunskill, Cormac Purcell, Nicolas Longépé, Andrew Markham
for: 本研究旨在实现在 cubeSat 上部署了多任务模型，并在 satellite 上进行机器学习模型的训练。
methods: 本研究使用了一个轻量级基础模型叫做 RaVAEn，它可以将小图像组成缩排的维度向量，并可以支援多个下游任务。
results: 本研究获得了在 cubeSat 上部署 RaVAEn 模型，并在 satellite 上进行了快速的几何训练。 encoding 时间为 0.110 秒，并且可以在 satellite 上进行快速的数据预测和决策。

Abstract
Artificial intelligence onboard satellites has the potential to reduce data transmission requirements, enable real-time decision-making and collaboration within constellations. This study deploys a lightweight foundational model called RaVAEn on D-Orbit's ION SCV004 satellite. RaVAEn is a variational auto-encoder (VAE) that generates compressed latent vectors from small image tiles, enabling several downstream tasks. In this work we demonstrate the reliable use of RaVAEn onboard a satellite, achieving an encoding time of 0.110s for tiles of a 4.8x4.8 km$^2$ area. In addition, we showcase fast few-shot training onboard a satellite using the latent representation of data. We compare the deployment of the model on the on-board CPU and on the available Myriad vision processing unit (VPU) accelerator. To our knowledge, this work shows for the first time the deployment of a multi-task model on-board a CubeSat and the on-board training of a machine learning model.

摘要
人工智能在卫星上有潜力减少数据传输要求、实时决策和卫星群 collaboration 。本研究部署了一个轻量级基础模型called RaVAEn 在D-Orbit的 ION SCV004卫星上。RaVAEn 是一种变量自动编码器（VAE），通过生成压缩的幂量向量来实现多个下游任务。在这个研究中，我们成功地在卫星上使用 RaVAEn，编码时间为 0.110 秒，对一个 4.8x4.8 km$^2$ 的区域进行编码。此外，我们还展示了在卫星上快速培训 few-shot 的 latent 表示。我们比较了在卫星上 CPU 和可用的 Myriad 视觉处理器（VPU）加速器上部署模型的性能。根据我们所知，这是第一次在 CubeSat 上部署多任务模型，以及在卫星上培训机器学习模型。

Pair then Relation: Pair-Net for Panoptic Scene Graph Generation

paper_url: http://arxiv.org/abs/2307.08699
repo_url: https://github.com/king159/pair-net
paper_authors: Jinghao Wang, Zhengyu Wen, Xiangtai Li, Zujin Guo, Jingkang Yang, Ziwei Liu
for: 本研究旨在提出一种新的基线方法来解决Scene Graph Generation（SGG）中的Panoptic Scene Graph（PSG）问题，以创造更加全面的场景图表示。
methods: 我们首先进行了深入分析，发现当前PSG方法中最大的瓶颈在于对每个对象之间的对应关系的缺失。基于这一点，我们提出了一种新的框架：Pair then Relation（Pair-Net），它使用一个Pair Proposal Network（PPN）来学习和筛选对象之间的笔直关系。
results: 我们通过了广泛的ablation和分析，证明了我们的方法可以大幅提高现有基线的性能。尤其是，我们的方法在PSG benchmark上实现了新的州际纪录，与PSGFormer相比增加了超过10%的绝对提升。代码可以在https://github.com/king159/Pair-Net上获取。

Abstract
Panoptic Scene Graph (PSG) is a challenging task in Scene Graph Generation (SGG) that aims to create a more comprehensive scene graph representation using panoptic segmentation instead of boxes. Compared to SGG, PSG has several challenging problems: pixel-level segment outputs and full relationship exploration (It also considers thing and stuff relation). Thus, current PSG methods have limited performance, which hinders downstream tasks or applications. The goal of this work aims to design a novel and strong baseline for PSG. To achieve that, we first conduct an in-depth analysis to identify the bottleneck of the current PSG models, finding that inter-object pair-wise recall is a crucial factor that was ignored by previous PSG methods. Based on this and the recent query-based frameworks, we present a novel framework: Pair then Relation (Pair-Net), which uses a Pair Proposal Network (PPN) to learn and filter sparse pair-wise relationships between subjects and objects. Moreover, we also observed the sparse nature of object pairs for both Motivated by this, we design a lightweight Matrix Learner within the PPN, which directly learn pair-wised relationships for pair proposal generation. Through extensive ablation and analysis, our approach significantly improves upon leveraging the segmenter solid baseline. Notably, our method achieves new state-of-the-art results on the PSG benchmark, with over 10\% absolute gains compared to PSGFormer. The code of this paper is publicly available at https://github.com/king159/Pair-Net.

摘要
panographic scene graph (PSG) 是Scene Graph Generation (SGG) 中一个挑战性的任务，旨在通过�annoying segmentation而非盒子来创建更加全面的场景图表示。相比SGG，PSG具有一些挑战性的问题：像素级别的分割输出和全面关系探索（同时考虑thing和stuff关系）。因此，当前PSG方法的性能有限，这阻碍了下游任务或应用。本文的目标是设计一个新的和强大的PSG基eline。为达到这一目标，我们首先进行了深入分析，并发现了当前PSG模型的瓶颈在于对象间对之间的recall问题。基于这一点和最近的查询基础框架，我们提出了一个新的框架：Pair then Relation（Pair-Net），它使用一个Pair Proposal Network（PPN）来学习和筛选对象间的笔 Edwards，并且对于对象对的笔 Edwards进行了灵活的学习。此外，我们还发现了对象对的笔 Edwards是稀疏的，因此我们设计了一个轻量级的矩阵学习器，直接学习对象间的笔 Edwards。经过广泛的拟合和分析，我们的方法在使用 segmenter 固定基线上显著提高了PSG的性能，并达到了PSGFormer 的新的国际纪录。代码可以在中获取。

COLLIE: Systematic Construction of Constrained Text Generation Tasks

paper_url: http://arxiv.org/abs/2307.08689
repo_url: https://github.com/princeton-nlp/Collie
paper_authors: Shunyu Yao, Howard Chen, Austin W. Hanjie, Runzhe Yang, Karthik Narasimhan
For: This paper is written for those interested in natural language processing and the development of constrained text generation systems.* Methods: The paper presents a grammar-based framework called COLLIE, which allows for the specification of rich and compositional constraints for diverse generation levels and modeling challenges. The framework includes tools for automatic extraction of task instances given a constraint structure and a raw text corpus.* Results: The paper compiles a dataset called COLLIE-v1, which includes 2080 instances with 13 constraint structures, and performs systematic experiments with five state-of-the-art instruction-tuned language models to analyze their performances and reveal shortcomings.

Abstract
Text generation under constraints have seen increasing interests in natural language processing, especially with the rapidly improving capabilities of large language models. However, existing benchmarks for constrained generation usually focus on fixed constraint types (e.g.,generate a sentence containing certain words) that have proved to be easy for state-of-the-art models like GPT-4. We present COLLIE, a grammar-based framework that allows the specification of rich, compositional constraints with diverse generation levels (word, sentence, paragraph, passage) and modeling challenges (e.g.,language understanding, logical reasoning, counting, semantic planning). We also develop tools for automatic extraction of task instances given a constraint structure and a raw text corpus. Using COLLIE, we compile the COLLIE-v1 dataset with 2080 instances comprising 13 constraint structures. We perform systematic experiments across five state-of-the-art instruction-tuned language models and analyze their performances to reveal shortcomings. COLLIE is designed to be extensible and lightweight, and we hope the community finds it useful to develop more complex constraints and evaluations in the future.

摘要
To address this limitation, we present COLLIE, a grammar-based framework that enables the specification of rich and compositional constraints at various generation levels (word, sentence, paragraph, passage) and with diverse modeling challenges (e.g., language understanding, logical reasoning, counting, semantic planning). We also develop tools for automatically extracting task instances from a constraint structure and a raw text corpus.Using COLLIE, we compiled the COLLIE-v1 dataset consisting of 2080 instances with 13 constraint structures. We conducted systematic experiments on five state-of-the-art instruction-tuned language models and analyzed their performances to reveal their shortcomings.COLLIE is designed to be extensible and lightweight, and we hope that the community will find it useful to develop more complex constraints and evaluations in the future.

Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

paper_url: http://arxiv.org/abs/2307.08678
repo_url: None
paper_authors: Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob Steinhardt, Zhou Yu, Kathleen McKeown
for: 本研究旨在评估大语言模型（LLM）的自我解释能力，以及LLM可以帮助人类构建模型处理不同输入的心理模型。
methods: 我们提出了两种基于对反事实可靠性的评估指标：准确性和通用性。我们使用自动生成的反事实来评估现状的State-of-the-art LLMs（如GPT-4）在多步骤事实理解和奖励模型任务中的性能。
results: 我们发现LLM的解释准确性较低，并且准确性与可能性无关。因此，不仅通过人类批准（如RLHF）优化是不够的。

Abstract
Large language models (LLMs) are trained to imitate humans to explain human decisions. However, do LLMs explain themselves? Can they help humans build mental models of how LLMs process different inputs? To answer these questions, we propose to evaluate $\textbf{counterfactual simulatability}$ of natural language explanations: whether an explanation can enable humans to precisely infer the model's outputs on diverse counterfactuals of the explained input. For example, if a model answers "yes" to the input question "Can eagles fly?" with the explanation "all birds can fly", then humans would infer from the explanation that it would also answer "yes" to the counterfactual input "Can penguins fly?". If the explanation is precise, then the model's answer should match humans' expectations. We implemented two metrics based on counterfactual simulatability: precision and generality. We generated diverse counterfactuals automatically using LLMs. We then used these metrics to evaluate state-of-the-art LLMs (e.g., GPT-4) on two tasks: multi-hop factual reasoning and reward modeling. We found that LLM's explanations have low precision and that precision does not correlate with plausibility. Therefore, naively optimizing human approvals (e.g., RLHF) may not be a sufficient solution.

摘要
大型语言模型（LLM）通常被训练以模仿人类来解释人类决策。然而， LLM 是否能够解释自己？可以 LLM 帮助人类建立模型处理不同输入的心理模型吗？为了回答这些问题，我们提议评估 $\textbf{对假实际可能性}$ 的自然语言解释：是否可以通过解释来准确地预测模型对不同的对假输入的输出。例如，如果模型对问题 "鸟可以飞吗?" 的输入提供了解释 "所有鸟类都可以飞"，那么人类就可以从解释中推断出模型对 "鸟不能飞" 的对假输入的输出是什么。如果解释准确，那么模型的输出应该与人类的预期匹配。我们实现了两个基于对假实际可能性的度量：准确性和通用性。我们使用 LLM 自动生成了多种对假输入。然后，我们使用这两个度量来评估现状最佳的 LLM (例如 GPT-4) 在多步真实逻辑和奖励模型两个任务上的表现。我们发现 LLM 的解释准确性较低，并且准确性与可能性之间没有直接的相关性。因此，不仅通过人类批准（例如 RLHF）来优化模型可能不够。

TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT

paper_url: http://arxiv.org/abs/2307.08674
repo_url: None
paper_authors: Liangyu Zha, Junlin Zhou, Liyao Li, Rui Wang, Qingyi Huang, Saisai Yang, Jing Yuan, Changbao Su, Xiang Li, Aofeng Su, Tao Zhang, Chen Zhou, Kaizhe Shou, Miao Wang, Wufang Zhu, Guoshan Lu, Chao Ye, Yali Ye, Wentao Ye, Yiming Zhang, Xinglong Deng, Jie Xu, Haobo Wang, Gang Chen, Junbo Zhao
For: TableGPT is a unified fine-tuned framework that enables large language models (LLMs) to understand and operate on tables using external functional commands, allowing for seamless interaction with tabular data and enabling a wide range of functionalities such as question answering, data manipulation, data visualization, analysis report generation, and automated prediction.* Methods: TableGPT uses a novel concept of global tabular representations, which empowers LLMs to gain a comprehensive understanding of the entire table beyond meta-information. It jointly trains LLMs on both table and text modalities, achieving a deep understanding of tabular data and the ability to perform complex operations on tables through chain-of-command instructions.* Results: TableGPT offers several advantages, including being a self-contained system rather than relying on external API interfaces, supporting efficient data process flow, query rejection (when appropriate), and private deployment, enabling faster domain data fine-tuning and ensuring data privacy. These features enhance the framework’s adaptability to specific use cases.

Abstract
Tables are prevalent in real-world databases, requiring significant time and effort for humans to analyze and manipulate. The advancements in large language models (LLMs) have made it possible to interact with tables using natural language input, bringing this capability closer to reality. In this paper, we present TableGPT, a unified fine-tuned framework that enables LLMs to understand and operate on tables using external functional commands. It introduces the capability to seamlessly interact with tables, enabling a wide range of functionalities such as question answering, data manipulation (e.g., insert, delete, query, and modify operations), data visualization, analysis report generation, and automated prediction. TableGPT aims to provide convenience and accessibility to users by empowering them to effortlessly leverage tabular data. At the core of TableGPT lies the novel concept of global tabular representations, which empowers LLMs to gain a comprehensive understanding of the entire table beyond meta-information. By jointly training LLMs on both table and text modalities, TableGPT achieves a deep understanding of tabular data and the ability to perform complex operations on tables through chain-of-command instructions. Importantly, TableGPT offers the advantage of being a self-contained system rather than relying on external API interfaces. Moreover, it supports efficient data process flow, query rejection (when appropriate) and private deployment, enabling faster domain data fine-tuning and ensuring data privacy, which enhances the framework's adaptability to specific use cases.

摘要
tables are prevalent in real-world databases, requiring significant time and effort for humans to analyze and manipulate. The advancements in large language models (LLMs) have made it possible to interact with tables using natural language input, bringing this capability closer to reality. In this paper, we present TableGPT, a unified fine-tuned framework that enables LLMs to understand and operate on tables using external functional commands. It introduces the capability to seamlessly interact with tables, enabling a wide range of functionalities such as question answering, data manipulation (e.g., insert, delete, query, and modify operations), data visualization, analysis report generation, and automated prediction. TableGPT aims to provide convenience and accessibility to users by empowering them to effortlessly leverage tabular data. At the core of TableGPT lies the novel concept of global tabular representations, which empowers LLMs to gain a comprehensive understanding of the entire table beyond meta-information. By jointly training LLMs on both table and text modalities, TableGPT achieves a deep understanding of tabular data and the ability to perform complex operations on tables through chain-of-command instructions. Importantly, TableGPT offers the advantage of being a self-contained system rather than relying on external API interfaces. Moreover, it supports efficient data process flow, query rejection (when appropriate) and private deployment, enabling faster domain data fine-tuning and ensuring data privacy, which enhances the framework's adaptability to specific use cases.

Quaternion Convolutional Neural Networks: Current Advances and Future Directions

paper_url: http://arxiv.org/abs/2307.08663
repo_url: None
paper_authors: Gerardo Altamirano-Gomez, Carlos Gershenson
for: 本研究的目的是探讨量子值 convolutional neural network (QCNN) 的开发和应用。
methods: 本研究使用了现有的 QCNN 模型，并进行了系统性的分析和评估。
results: 研究发现，使用 QCNN 可以 achieve 同等或更好的性能，并且具有更少的参数。此外，QCNN 还可以捕捉到更多的信息，因此可以在各种应用中提供更好的性能。

Abstract
Since their first applications, Convolutional Neural Networks (CNNs) have solved problems that have advanced the state-of-the-art in several domains. CNNs represent information using real numbers. Despite encouraging results, theoretical analysis shows that representations such as hyper-complex numbers can achieve richer representational capacities than real numbers, and that Hamilton products can capture intrinsic interchannel relationships. Moreover, in the last few years, experimental research has shown that Quaternion-Valued CNNs (QCNNs) can achieve similar performance with fewer parameters than their real-valued counterparts. This paper condenses research in the development of QCNNs from its very beginnings. We propose a conceptual organization of current trends and analyze the main building blocks used in the design of QCNN models. Based on this conceptual organization, we propose future directions of research.

摘要
自它们的首次应用，卷积神经网络（CNN）已经解决了许多领域的问题，提高了状态艺术。CNN表示信息使用实数。尽管获得了激励的结果，理论分析显示，幂复数可以实现更丰富的表达能力，而汉密尔顿产品可以捕捉内在的通道关系。此外，最近几年的实验室研究表明，四元值CNN（QCNN）可以与其实数对应的模型相比，使用 fewer 参数达到相似的性能。本文将从QCNN的开发的起源进行抽象，我们提出了现有趋势的概念组织，并分析了QCNN模型的主要构建块。基于这个概念组织，我们提出了未来的研究方向。

Hyperparameter Tuning Cookbook: A guide for scikit-learn, PyTorch, river, and spotPython

paper_url: http://arxiv.org/abs/2307.10262
repo_url: https://github.com/sequential-parameter-optimization/spotpython
paper_authors: Thomas Bartz-Beielstein
for: 本文提供了一份完整的hyperparameter tuning指南，使用spotPython进行scikit-learn、PyTorch和river模型的优化。
methods: 本文使用spotPython的代理模型基本优化过程，并将hyperparameter tuning应用于sklearn模型 such as Support Vector Classification、Random Forests、Gradient Boosting (XGB)和K-nearest neighbors (KNN) 等，以及river中的Hoeffding Adaptive Tree Regressor。
results: 本文通过实践和步骤解释，为使用Python进行hyperparameter tuning提供了一个实用的开始点。特点包括Tensorboard、PyTorch Lightning、spotPython和river之间的交互，以及PyTorch和PyTorch Lightning训练工作流程的集成。

Abstract
This document provides a comprehensive guide to hyperparameter tuning using spotPython for scikit-learn, PyTorch, and river. The first part introduces spotPython's surrogate model-based optimization process, while the second part focuses on hyperparameter tuning. Several case studies are presented, including hyperparameter tuning for sklearn models such as Support Vector Classification, Random Forests, Gradient Boosting (XGB), and K-nearest neighbors (KNN), as well as a Hoeffding Adaptive Tree Regressor from river. The integration of spotPython into the PyTorch and PyTorch Lightning training workflow is also discussed. With a hands-on approach and step-by-step explanations, this cookbook serves as a practical starting point for anyone interested in hyperparameter tuning with Python. Highlights include the interplay between Tensorboard, PyTorch Lightning, spotPython, and river. This publication is under development, with updates available on the corresponding webpage.

摘要
这份文档提供了使用 spotPython 进行 scikit-learn、PyTorch 和 river 模型hyperparameter tuning的全面指南。第一部分介绍了 spotPython 的代理模型基于优化过程，而第二部分则专注于 hyperparameter tuning。文档包含了多个案例研究，包括 scikit-learn 模型支持向量分类、Random Forests、Gradient Boosting (XGB) 和 K-nearest neighbors (KNN) 等模型的 hyperparameter tuning，以及来自 river 的 Hoeffding Adaptive Tree Regressor。文档还讲解了将 spotPython integrated into PyTorch 和 PyTorch Lightning 训练工作流程。以一种实践的方式和步骤说明，这本 cookbook 作为 Python 中 hyperparameter tuning 的实践入门点。文档在开发中，更新信息可以在相关 webpage 上获得。

Glamour muscles: why having a body is not what it means to be embodied

paper_url: http://arxiv.org/abs/2307.08598
repo_url: None
paper_authors: Shawn L. Beaulieu, Sam Kriegman
for: 提高智能机器的能力
methods: 使用embodiment方法
results: 生成更高级的智能工具

Abstract
Embodiment has recently enjoyed renewed consideration as a means to amplify the faculties of smart machines. Proponents of embodiment seem to imply that optimizing for movement in physical space promotes something more than the acquisition of niche capabilities for solving problems in physical space. However, there is nothing in principle which should so distinguish the problem of action selection in physical space from the problem of action selection in more abstract spaces, like that of language. Rather, what makes embodiment persuasive as a means toward higher intelligence is that it promises to capture, but does not actually realize, contingent facts about certain bodies (living intelligence) and the patterns of activity associated with them. These include an active resistance to annihilation and revisable constraints on the processes that make the world intelligible. To be theoretically or practically useful beyond the creation of niche tools, we argue that "embodiment" cannot be the trivial fact of a body, nor its movement through space, but the perpetual negotiation of the function, design, and integrity of that body$\unicode{x2013}$that is, to participate in what it means to $\textit{constitute}$ a given body. It follows that computer programs which are strictly incapable of traversing physical space might, under the right conditions, be more embodied than a walking, talking robot.

摘要
现在，人工智能的实体化受到了新的重视。支持者们认为，通过 Physical space 中的运动优化智能机器的能力。然而，没有任何原理可以解释 why 在 Physical space 中的行动选择问题和其他更抽象的空间，如语言空间中的行动选择问题之间存在差异。实际上，embodiment 的吸引力来自于它承诺能够捕捉，但并没有实现，certain bodies （生命智能）和其活动模式之间的相互关系。这些关系包括活体抵抗灭亡和可修改的世界认知过程。如果旨在超越创建 nich 工具的理论或实践用途，我们认为embodiment 不能是身体的杂志性，也不能是运动在空间中的杂志性，而是通过参与到一个身体的定义、设计和完整性的协商来实现。从这个意义上来看，一个不能在 Physical space 中移动的计算机程序可能在某些条件下比一个步行、说话的 робот更加embodied。

2023-07-19

An analysis on the effects of speaker embedding choice in non auto-regressive TTS

Self-Supervised Acoustic Word Embedding Learning via Correspondence Transformer Encoder

2023-07-19

Unsupervised Accuracy Estimation of Deep Visual Models using Domain-Adaptive Adversarial Perturbation without Source Samples

Divert More Attention to Vision-Language Object Tracking

Class Attention to Regions of Lesion for Imbalanced Medical Image Recognition

Towards Fair Face Verification: An In-depth Analysis of Demographic Biases

MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions

As large as it gets: Learning infinitely large Filters via Neural Implicit Functions in the Fourier Domain

Mitigating Viewer Impact from Disturbing Imagery using AI Filters: A User-Study

TUNeS: A Temporal U-Net with Self-Attention for Video-based Surgical Phase Recognition

Impact of Disentanglement on Pruning Neural Networks

TinyTrain: Deep Neural Network Training at the Extreme Edge

Lazy Visual Localization via Motion Averaging

ProtoCaps: A Fast and Non-Iterative Capsule Network Routing Method

AGAR: Attention Graph-RNN for Adaptative Motion Prediction of Point Clouds of Deformable Objects

DISA: DIfferentiable Similarity Approximation for Universal Multimodal Registration

Measuring and Modeling Uncertainty Degree for Monocular Depth Estimation

Embedded Heterogeneous Attention Transformer for Cross-lingual Image Captioning

Learning from Abstract Images: on the Importance of Occlusion in a Minimalist Encoding of Human Poses

3Deformer: A Common Framework for Image-Guided Mesh Deformation

A3D: Adaptive, Accurate, and Autonomous Navigation for Edge-Assisted Drones

BSDM: Background Suppression Diffusion Model for Hyperspectral Anomaly Detection

Blind Image Quality Assessment Using Multi-Stream Architecture with Spatial and Channel Attention

Hierarchical Spatio-Temporal Representation Learning for Gait Recognition

Cryo-forum: A framework for orientation recovery with uncertainty measure with the application in cryo-EM image analysis

Compressive Image Scanning Microscope

What do neural networks learn in image classification? A frequency shortcut perspective

Multi-modal Learning based Prediction for Disease

A Siamese-based Verification System for Open-set Architecture Attribution of Synthetic Images

Hierarchical Semantic Perceptual Listener Head Video Generation: A High-performance Pipeline

Deep unrolling Shrinkage Network for Dynamic MR imaging

LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network

GenKL: An Iterative Framework for Resolving Label Ambiguity and Label Non-conformity in Web Images Via a New Generalized KL Divergence

Fix your downsampling ASAP! Be natively more robust via Aliasing and Spectral Artifact free Pooling

From West to East: Who can understand the music of the others better?

DiffDP: Radiotherapy Dose Prediction via a Diffusion Model

Density-invariant Features for Distant Point Cloud Registration

DVPT: Dynamic Visual Prompt Tuning of Large Pre-trained Models for Medical Image Analysis

Source-Free Domain Adaptation for Medical Image Segmentation via Prototype-Anchored Feature Alignment and Contrastive Learning

Longitudinal Data and a Semantic Similarity Reward for Chest X-Ray Report Generation

Generative Prompt Model for Weakly Supervised Object Localization

Towards Robust Scene Text Image Super-resolution via Explicit Location Enhancement

Watch out Venomous Snake Species: A Solution to SnakeCLEF2023

CPCM: Contextual Point Cloud Modeling for Weakly-supervised Point Cloud Semantic Segmentation

Improved Distribution Matching for Dataset Condensation

ClickSeg: 3D Instance Segmentation with Click-Level Weak Annotations

NTIRE 2023 Quality Assessment of Video Enhancement Challenge

Uncertainty-Driven Multi-Scale Feature Fusion Network for Real-time Image Deraining

SAMConvex: Fast Discrete Optimization for CT Registration using Self-supervised Anatomical Embedding and Correlation Pyramid

AesPA-Net: Aesthetic Pattern-Aware Style Transfer Networks

Semantic-Aware Dual Contrastive Learning for Multi-label Image Classification

Towards Saner Deep Image Registration

GlobalMapper: Arbitrary-Shaped Urban Layout Generation

Domain Adaptation based Enhanced Detection for Autonomous Driving in Foggy and Rainy Weather

Object-aware Gaze Target Detection

Skin Lesion Correspondence Localization in Total Body Photography

Conditional 360-degree Image Synthesis for Immersive Indoor Scene Decoration

Surgical Action Triplet Detection by Mixed Supervised Learning of Instrument-Tissue Interactions

Can Neural Network Memorization Be Localized?

Adversarial Bayesian Augmentation for Single-Source Domain Generalization

AnyDoor: Zero-shot Object-level Image Customization

FACTS: Facial Animation Creation using the Transfer of Styles

GroupLane: End-to-End 3D Lane Detection with Channel-wise Grouping

Occlusion Aware Student Emotion Recognition based on Facial Action Unit Detection

Measuring Student Behavioral Engagement using Histogram of Actions

2023-07-19

A Decision Making Framework for Recommended Maintenance of Road Segments

Accurate deep learning sub-grid scale models for large eddy simulations

Convergence Guarantees for Stochastic Subgradient Methods in Nonsmooth Nonconvex Optimization

Automatic Conversion of MiniZinc Programs to QUBO

An Empirical Study on Fertility Proposals Using Multi-Grained Topic Analysis Methods

RobôCIn Small Size League Extended Team Description Paper for RoboCup 2023

6G Network Business Support System

TbExplain: A Text-based Explanation Method for Scene Classification Models with the Statistical Prediction Correction

Our Model Achieves Excellent Performance on MovieLens: What Does it Mean?

XSkill: Cross Embodiment Skill Discovery

U-CE: Uncertainty-aware Cross-Entropy for Semantic Segmentation

TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic Tree-Based Memory Network