cs.AI - 2023-11-25

SwiftLearn: A Data-Efficient Training Method of Deep Learning Models using Importance Sampling

paper_url: http://arxiv.org/abs/2311.15134
repo_url: None
paper_authors: Habib Hajimolahoseini, Omar Mohamed Awad, Walid Ahmed, Austin Wen, Saina Asani, Mohammad Hassanpour, Farnoosh Javadi, Mehdi Ahmadi, Foozhan Ataiefard, Kangling Liu, Yang Liu
for: 加速深度学习模型训练，使用一小部分数据样本在训练过程中的暖身阶段选择。
methods: 提出了一种基于重要性标准的数据选择策略，通过在整个数据集中测试模型性能，选择一部分数据样本，以保持训练过程中模型性能的稳定。
results: 在多种 Computer Vision 和自然语言处理任务中进行预训练和精度训练，实验结果表明，可以在减少训练样本数量的情况下保持模型性能，并在训练过程中实现显著的速度增加。例如，BERT 精度训练在 GLUE benchmark 上，可以将90%的数据排除，实现了平均速度增加3.36倍，保持模型性能下降在0.92%以下。

Abstract
In this paper, we present SwiftLearn, a data-efficient approach to accelerate training of deep learning models using a subset of data samples selected during the warm-up stages of training. This subset is selected based on an importance criteria measured over the entire dataset during warm-up stages, aiming to preserve the model performance with fewer examples during the rest of training. The importance measure we propose could be updated during training every once in a while, to make sure that all of the data samples have a chance to return to the training loop if they show a higher importance. The model architecture is unchanged but since the number of data samples controls the number of forward and backward passes during training, we can reduce the training time by reducing the number of training samples used in each epoch of training. Experimental results on a variety of CV and NLP models during both pretraining and finetuning show that the model performance could be preserved while achieving a significant speed-up during training. More specifically, BERT finetuning on GLUE benchmark shows that almost 90% of the data can be dropped achieving an end-to-end average speedup of 3.36x while keeping the average accuracy drop less than 0.92%.

摘要
在这篇论文中，我们提出了 SwiftLearn，一种数据效率的方法，用于加速深度学习模型的训练，通过在训练过程中选择一 subset 的数据样本。这个 subset 被选择基于数据集中的重要性标准，以保持模型的性能使用 fewer examples 进行训练。我们提出的重要性标准可以在训练过程中不断更新，以确保所有数据样本都有机会回到训练循环，如果它们在训练过程中显示更高的重要性。模型架构保持不变，但是由于训练样本数控制了训练过程中的前向和反向传播，因此可以降低训练时间。实验结果表明，在不同的 Computer Vision 和自然语言处理（NLP）模型中，可以保持模型性能的同时实现显著的训练时间减速。例如，BERT 在 GLUE 数据集上进行 Fine-tuning 时，可以将大约 90% 的数据排除，实现了端到端的平均减速比为 3.36x，而保持平均损失变化小于 0.92%。

Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching

paper_url: http://arxiv.org/abs/2311.15131
repo_url: None
paper_authors: James Campbell, Richard Ren, Phillip Guo
for: investigate instructed dishonesty in large language models (LLMs)
methods: prompt engineering and mechanistic interpretability approaches
results: localized five layers and 46 attention heads that enable the model to answer honestly, and demonstrated the interventions work robustly across many prompts and dataset splits.

Abstract
Large language models (LLMs) demonstrate significant knowledge through their outputs, though it is often unclear whether false outputs are due to a lack of knowledge or dishonesty. In this paper, we investigate instructed dishonesty, wherein we explicitly prompt LLaMA-2-70b-chat to lie. We perform prompt engineering to find which prompts best induce lying behavior, and then use mechanistic interpretability approaches to localize where in the network this behavior occurs. Using linear probing and activation patching, we localize five layers that appear especially important for lying. We then find just 46 attention heads within these layers that enable us to causally intervene such that the lying model instead answers honestly. We show that these interventions work robustly across many prompts and dataset splits. Overall, our work contributes a greater understanding of dishonesty in LLMs so that we may hope to prevent it.

摘要
大型语言模型（LLM）display significant knowledge in their outputs, but it is often unclear whether false outputs are due to a lack of knowledge or dishonesty. In this paper, we investigate instructed dishonesty, where we explicitly prompt LLaMA-2-70b-chat to lie. We perform prompt engineering to find which prompts best induce lying behavior, and then use mechanistic interpretability approaches to localize where in the network this behavior occurs. Using linear probing and activation patching, we localize five layers that appear especially important for lying. We then find just 46 attention heads within these layers that enable us to causally intervene such that the lying model instead answers honestly. We show that these interventions work robustly across many prompts and dataset splits. Overall, our work contributes to a greater understanding of dishonesty in LLMs, so that we may hope to prevent it.Note: The translation is done using a machine translation tool, and may not be perfect or idiomatic.

NCL-SM: A Fully Annotated Dataset of Images from Human Skeletal Muscle Biopsies

paper_url: http://arxiv.org/abs/2311.15113
repo_url: https://github.com/atifkhanncl/ncl-sm
paper_authors: Atif Khan, Conor Lawless, Amy Vincent, Charlotte Warren, Valeria Di Leo, Tiago Gomes, A. Stephen McGough
for: 本研究的目的是开发一个自动化、精准、可重复地单元细胞分析方法，以便更好地理解许多 neuromuscular disorders。
methods: 本研究使用了深度学习模型，并提供了高质量的生物医学影像数据集（NCL-SM），以便训练和测试这些模型。
results: 本研究提供了46个人skeletal muscle（SM）组织横截面的高质量生物医学影像数据集，包括超过5万个手动标注的muscle fibers（myofibers），以及高质量myofibers分割和优质区域标注。这些数据集可以用于开发自动化、精准、可重复地单元细胞分析方法。

Abstract
Single cell analysis of human skeletal muscle (SM) tissue cross-sections is a fundamental tool for understanding many neuromuscular disorders. For this analysis to be reliable and reproducible, identification of individual fibres within microscopy images (segmentation) of SM tissue should be automatic and precise. Biomedical scientists in this field currently rely on custom tools and general machine learning (ML) models, both followed by labour intensive and subjective manual interventions to fine-tune segmentation. We believe that fully automated, precise, reproducible segmentation is possible by training ML models. However, in this important biomedical domain, there are currently no good quality, publicly available annotated imaging datasets available for ML model training. In this paper we release NCL-SM: a high quality bioimaging dataset of 46 human SM tissue cross-sections from both healthy control subjects and from patients with genetically diagnosed muscle pathology. These images include $>$ 50k manually segmented muscle fibres (myofibres). In addition we also curated high quality myofibre segmentations, annotating reasons for rejecting low quality myofibres and low quality regions in SM tissue images, making these annotations completely ready for downstream analysis. This, we believe, will pave the way for development of a fully automatic pipeline that identifies individual myofibres within images of tissue sections and, in particular, also classifies individual myofibres that are fit for further analysis.

摘要
Single cell analysis of human skeletal muscle (SM) tissue cross-sections is a fundamental tool for understanding many neuromuscular disorders. For this analysis to be reliable and reproducible, identification of individual fibers within microscopy images (segmentation) of SM tissue should be automatic and precise. Biomedical scientists in this field currently rely on custom tools and general machine learning (ML) models, both followed by labor-intensive and subjective manual interventions to fine-tune segmentation. We believe that fully automated, precise, reproducible segmentation is possible by training ML models. However, in this important biomedical domain, there are currently no good quality, publicly available annotated imaging datasets available for ML model training. In this paper, we release NCL-SM: a high-quality bioimaging dataset of 46 human SM tissue cross-sections from both healthy control subjects and from patients with genetically diagnosed muscle pathology. These images include >50k manually segmented muscle fibers (myofibers). In addition, we also curated high-quality myofibers segmentations, annotating reasons for rejecting low-quality myofibers and low-quality regions in SM tissue images, making these annotations completely ready for downstream analysis. This, we believe, will pave the way for the development of a fully automatic pipeline that identifies individual myofibers within images of tissue sections and, in particular, also classifies individual myofibers that are fit for further analysis.

Everybody Needs a Little HELP: Explaining Graphs via Hierarchical Concepts

paper_url: http://arxiv.org/abs/2311.15112
repo_url: https://github.com/jonasjuerss/help
paper_authors: Jonas Jürß, Lucie Charlotte Magister, Pietro Barbiero, Pietro Liò, Nikola Simidjievski
for: 这个论文的目的是提高graph neural networks（GNNs）的可解释性，使其在高风险决策场景中得到更多的信任。
methods: 该论文提出了一种新的可解释的图 Pooling方法，即Hierarchical Explainable Latent Pooling（HELP），它可以在不 Spectral 的情况下，采用端到端学习的方式，学习 pooling 一个可变数量的连接组件。
results: 实验表明，HELP 方法可以与标准GCNs和流行的pooling方法相比，在准确率上达到一致，而且可以提供与专家知识相关的解释。此外，该论文还提出了一些新的评价指标，以证明发现的概念是更易于完全理解的。

Abstract
Graph neural networks (GNNs) have led to major breakthroughs in a variety of domains such as drug discovery, social network analysis, and travel time estimation. However, they lack interpretability which hinders human trust and thereby deployment to settings with high-stakes decisions. A line of interpretable methods approach this by discovering a small set of relevant concepts as subgraphs in the last GNN layer that together explain the prediction. This can yield oversimplified explanations, failing to explain the interaction between GNN layers. To address this oversight, we provide HELP (Hierarchical Explainable Latent Pooling), a novel, inherently interpretable graph pooling approach that reveals how concepts from different GNN layers compose to new ones in later steps. HELP is more than 1-WL expressive and is the first non-spectral, end-to-end-learnable, hierarchical graph pooling method that can learn to pool a variable number of arbitrary connected components. We empirically demonstrate that it performs on-par with standard GCNs and popular pooling methods in terms of accuracy while yielding explanations that are aligned with expert knowledge in the domains of chemistry and social networks. In addition to a qualitative analysis, we employ concept completeness scores as well as concept conformity, a novel metric to measure the noise in discovered concepts, quantitatively verifying that the discovered concepts are significantly easier to fully understand than those from previous work. Our work represents a first step towards an understanding of graph neural networks that goes beyond a set of concepts from the final layer and instead explains the complex interplay of concepts on different levels.

摘要
图 neck 网络（GNNs）已经导致了多种领域的重要突破，如药物发现、社交网络分析和旅行时间估计。然而，它们缺乏可解释性，这限制了人类信任度，从而阻碍了高风险决策的部署。一些可解释的方法采取了发现最后一层 GNN 层中的一小组有用的概念的方法，这些概念共同解释预测结果。然而，这些解释可能过分简化，不能解释 GNN 层之间的互动。为解决这个问题，我们提供了 HELP（层次可解释的抽象池化），一种新的可解释的图 pooling 方法，它可以在后续步骤中抽取出不同 GNN 层中的概念，并将它们组合成新的概念。HELP 是一种一元化可解释的图 pooling 方法，可以在变量数量的 arbitrary 连接组件上进行学习。我们在化学和社交网络等领域进行了实验，并证明了它的准确率和可解释性与标准 GCN 和流行的 pooling 方法相当。此外，我们还提出了一种新的概念完整性分数和概念准确性分数来衡量发现的概念是否具有完整性和准确性。我们的工作代表了对图 neck 网络的理解的第一步，而不是仅仅是最后一层 GNN 层中的概念的集合。我们的方法可以解释图 neck 网络中不同层次的概念之间的复杂的互动。

Leveraging Diffusion Perturbations for Measuring Fairness in Computer Vision

paper_url: http://arxiv.org/abs/2311.15108
repo_url: None
paper_authors: Nicholas Lui, Bryan Chia, William Berrios, Candace Ross, Douwe Kiela
for: This paper aims to evaluate the fairness of computer vision models by creating a dataset balanced along demographic traits and benchmarking several vision-language models on a multi-class occupation classification task.
methods: The paper uses diffusion models to generate a large set of images depicting various occupations, and inpainting to generate multiple variants of each image with different perceived races.
results: The paper finds that images generated with non-Caucasian labels have a significantly higher occupation misclassification rate than images generated with Caucasian labels, and that several misclassifications are suggestive of racial biases. The paper also finds significant disparities between the evaluated vision-and-language models using a fairness metric that measures the standard deviation in the probability of predicting the true occupation label across different perceived identity groups.

Abstract
Computer vision models have been known to encode harmful biases, leading to the potentially unfair treatment of historically marginalized groups, such as people of color. However, there remains a lack of datasets balanced along demographic traits that can be used to evaluate the downstream fairness of these models. In this work, we demonstrate that diffusion models can be leveraged to create such a dataset. We first use a diffusion model to generate a large set of images depicting various occupations. Subsequently, each image is edited using inpainting to generate multiple variants, where each variant refers to a different perceived race. Using this dataset, we benchmark several vision-language models on a multi-class occupation classification task. We find that images generated with non-Caucasian labels have a significantly higher occupation misclassification rate than images generated with Caucasian labels, and that several misclassifications are suggestive of racial biases. We measure a model's downstream fairness by computing the standard deviation in the probability of predicting the true occupation label across the different perceived identity groups. Using this fairness metric, we find significant disparities between the evaluated vision-and-language models. We hope that our work demonstrates the potential value of diffusion methods for fairness evaluations.

摘要

Unbalancedness in Neural Monge Maps Improves Unpaired Domain Translation

paper_url: http://arxiv.org/abs/2311.15100
repo_url: None
paper_authors: Luca Eyring, Dominik Klein, Théo Uscidda, Giovanni Palla, Niki Kilbertus, Zeynep Akata, Fabian Theis
for: 这个论文主要是为了解决无对照领域译换任务中的异常值问题。
methods: 这个论文提出了一种基于无对照领域译换的Monge图像译换方法，并使用了多个神经网络估计器来实现。
results: 论文的实验结果表明，这种方法可以在图像译换任务中提高表现，并且可以更好地保留有关特征。

Abstract
In optimal transport (OT), a Monge map is known as a mapping that transports a source distribution to a target distribution in the most cost-efficient way. Recently, multiple neural estimators for Monge maps have been developed and applied in diverse unpaired domain translation tasks, e.g. in single-cell biology and computer vision. However, the classic OT framework enforces mass conservation, which makes it prone to outliers and limits its applicability in real-world scenarios. The latter can be particularly harmful in OT domain translation tasks, where the relative position of a sample within a distribution is explicitly taken into account. While unbalanced OT tackles this challenge in the discrete setting, its integration into neural Monge map estimators has received limited attention. We propose a theoretically grounded method to incorporate unbalancedness into any Monge map estimator. We improve existing estimators to model cell trajectories over time and to predict cellular responses to perturbations. Moreover, our approach seamlessly integrates with the OT flow matching (OT-FM) framework. While we show that OT-FM performs competitively in image translation, we further improve performance by incorporating unbalancedness (UOT-FM), which better preserves relevant features. We hence establish UOT-FM as a principled method for unpaired image translation.

摘要
在最优运输（OT）中，蒙日图（Monge map）是指将源分布转移到目标分布所需的最有效的映射。在最近，多种神经网络估计器已经开发出来用于蒙日图，并在多个不同的领域中应用，如单元细胞生物和计算机视觉。然而，经典的OT框架强制保持质量的保持，这会使其容易受到异常值的影响，从而限制其在实际场景中的应用。特别是在OT领域译换任务中，样本在分布中的相对位置直接被考虑，这会使得经典的OT框架更加不适用。而不均衡的OT（unbalanced OT）可以解决这个挑战，但是它们在神经网络蒙日图估计器中的整合受到了有限的关注。我们提出了一种基于理论的方法，可以将不均衡性 integrate into any Monge map estimator。我们改进了现有的蒙日图估计器，以便模型细胞的时间演化和响应干扰。此外，我们的方法可以轻松地与OT流匹配（OT-FM）框架集成。虽然我们显示OT-FM在图像译换中表现竞争力强，但我们进一步改进性能，通过 integrate不均衡性（UOT-FM），以更好地保留相关的特征。因此，我们建立了UOT-FM作为一种原则性的图像译换方法。

Enhancing Sentiment Analysis Results through Outlier Detection Optimization

paper_url: http://arxiv.org/abs/2311.16185
repo_url: https://github.com/Stry233/Enhancing-Sentiment-Analysis-Results-through-Outlier-Detection-Optimization
paper_authors: Yuetian Chen, Mei Si
for: 这个研究旨在提高文本数据中主观标签的分类效果，通过识别和处理异常值。
methods: 这个研究使用了深度SVDD算法，一种一类分类方法，来检测文本数据中的异常值。同时，研究还使用了不同的机器学习算法（决策树、KNN、逻辑回归和LDA）作为分类器。
results: 研究发现，通过去除异常值可以提高分类效果的大多数情况。此外，使用大型语言模型（DeBERTa v3大型）也可以捕捉数据中复杂的模式，并继续观察到多个数据集上的性能提高。

Abstract
When dealing with text data containing subjective labels like speaker emotions, inaccuracies or discrepancies among labelers are not uncommon. Such discrepancies can significantly affect the performance of machine learning algorithms. This study investigates the potential of identifying and addressing outliers in text data with subjective labels, aiming to enhance classification outcomes. We utilized the Deep SVDD algorithm, a one-class classification method, to detect outliers in nine text-based emotion and sentiment analysis datasets. By employing both a small-sized language model (DistilBERT base model with 66 million parameters) and non-deep learning machine learning algorithms (decision tree, KNN, Logistic Regression, and LDA) as the classifier, our findings suggest that the removal of outliers can lead to enhanced results in most cases. Additionally, as outliers in such datasets are not necessarily unlearnable, we experienced utilizing a large language model -- DeBERTa v3 large with 131 million parameters, which can capture very complex patterns in data. We continued to observe performance enhancements across multiple datasets.

摘要
当处理含主观标签的文本数据时，标注人员之间的不一致性或准确性不高是不可避免的。这些不一致性可能会对机器学习算法的性能产生影响。本研究探讨了在文本数据中检测和处理异常值的潜在性，以提高分类结果。我们使用了深度SVDD算法，一种一类分类方法，对九个文本基于情感和 sentiment 分析 dataset 进行检测异常值。我们使用了小型语言模型（DistilBERT基础模型，参数数量为6600万）和非深度学习机器学习算法（决策树、KNN、逻辑回归和LDA）作为分类器，我们的发现表明，在大多数情况下，移除异常值可以提高结果。此外，我们发现在这些 dataset 中的异常值并不一定是无法学习的，我们使用了大型语言模型——DeBERTa v3大型（参数数量为13100万），这种模型可以捕捉非常复杂的数据模式。我们继续观察到多个 dataset 中的性能提高。

Weakly-Supervised Audio-Visual Segmentation

paper_url: http://arxiv.org/abs/2311.15080
repo_url: None
paper_authors: Shentong Mo, Bhiksha Raj
for: 这个论文是针对 Audio-visual segmentation 进行研究，具体的目标是预测视频中的音源 pixels 的标识。
methods: 本文提出了一个 Weakly-Supervised Audio-Visual Segmentation 框架，named WS-AVS，它可以学习多尺度的音频视频同步，并使用多尺度多例对比学习来进行音频视频分类。
results: 实验结果显示，WS-AVS 能够在单源和多源情况下进行弱型Supervised Audio-Visual Segmentation，并且在 AVSBench 上实现了高准确率和高速度的 Audio-Visual Segmentation。

Abstract
Audio-visual segmentation is a challenging task that aims to predict pixel-level masks for sound sources in a video. Previous work applied a comprehensive manually designed architecture with countless pixel-wise accurate masks as supervision. However, these pixel-level masks are expensive and not available in all cases. In this work, we aim to simplify the supervision as the instance-level annotation, i.e., weakly-supervised audio-visual segmentation. We present a novel Weakly-Supervised Audio-Visual Segmentation framework, namely WS-AVS, that can learn multi-scale audio-visual alignment with multi-scale multiple-instance contrastive learning for audio-visual segmentation. Extensive experiments on AVSBench demonstrate the effectiveness of our WS-AVS in the weakly-supervised audio-visual segmentation of single-source and multi-source scenarios.

摘要
audio-visual 分 segmentation 是一项复杂的任务，旨在预测视频中声音源 pixels 的 маски。过去的工作使用了全面地手动设计的架构，并且需要 COUNTLESS 像素精度的 masks 作为监督。然而，这些像素精度的 masks 是昂贵的，而且不一定有所提供。在这项工作中，我们想要简化监督，即使用 instance-level 标注，即弱有监督的 audio-visual 分 segmentation。我们提出了一种新的弱有监督 audio-visual 分 segmentation 框架，即 WS-AVS，可以学习多尺度的 audio-visual 对应关系，并使用多尺度的多个实例对比学习来实现 audio-visual 分 segmentation。我们在 AVSBench 上进行了广泛的实验，并证明了我们的 WS-AVS 在弱有监督的 audio-visual 分 segmentation中的有效性。

Introducing SSBD+ Dataset with a Convolutional Pipeline for detecting Self-Stimulatory Behaviours in Children using raw videos

paper_url: http://arxiv.org/abs/2311.15072
repo_url: https://github.com/sarl-iiitb/ssbdplus-dataset
paper_authors: Vaibhavi Lokegaonkar, Vijay Jaisankar, Pon Deepika, Madhav Rao, T K Srikanth, Sarbani Mallick, Manjit Sodhi
for: 该论文旨在提出一种自动识别儿童自闭行为，以便早期诊断AUTSPECTRUM DISORDER（ASD）的机器学习方法。
methods: 该论文提出了一种管道式深度学习架构，用于检测certain self-stimulatory behaviors，以帮助诊断ASD。该架构还使用了一个扩展版的Self Stimulatory Behavior Dataset（SSBD），并提出了一个新的label：no-class。
results: 该论文的提案的搅拌模型在targeted real-time和手sfree自动诊断任务中实现了约81%的总准确率。

Abstract
Conventionally, evaluation for the diagnosis of Autism spectrum disorder is done by a trained specialist through questionnaire-based formal assessments and by observation of behavioral cues under various settings to capture the early warning signs of autism. These evaluation techniques are highly subjective and their accuracy relies on the experience of the specialist. In this regard, machine learning-based methods for automated capturing of early signs of autism from the recorded videos of the children is a promising alternative. In this paper, the authors propose a novel pipelined deep learning architecture to detect certain self-stimulatory behaviors that help in the diagnosis of autism spectrum disorder (ASD). The authors also supplement their tool with an augmented version of the Self Stimulatory Behavior Dataset (SSBD) and also propose a new label in SSBD Action detection: no-class. The deep learning model with the new dataset is made freely available for easy adoption to the researchers and developers community. An overall accuracy of around 81% was achieved from the proposed pipeline model that is targeted for real-time and hands-free automated diagnosis. All of the source code, data, licenses of use, and other relevant material is made freely available in https://github.com/sarl-iiitb/

摘要
传统上，autism spectrum症诊断的评估是通过专业人员通过问卷式正式评估和对儿童行为观察来进行。这些评估技术具有高度主观性，其准确性取决于专业人员的经验。在这种情况下，机器学习基于方法可能成为自动捕捉早期autism特征的有力的替代方案。在这篇论文中，作者提出了一种新的管道式深度学习架构，用于检测某些自我刺激行为，这些行为可以帮助在autism спектル症诊断中。作者还补充了一个增强版的Self Stimulatory BehaviorDataset（SSBD），并提出了新的标签：no-class。这种深度学习模型与新的数据集在https://github.com/sarl-iiitb/ 上提供了免费下载和使用。该管道模型在targeted for real-time and hands-free automated diagnosis情况下实现了约81%的总准确率。所有的源代码、数据、许可使用、其他相关材料都可以免费下载在https://github.com/sarl-iiitb/。

Accurate and interpretable drug-drug interaction prediction enabled by knowledge subgraph learning

paper_url: http://arxiv.org/abs/2311.15056
repo_url: https://github.com/lars-research/knowddi
paper_authors: Yaqing Wang, Zaifei Yang, Quanming Yao
for: 本研究旨在提高药物相互作用预测的精度和可 interpretability，并使用知识图加以优化。
methods: 本文提出了一种基于图神经网络的方法，称为 KnowDDI，可以利用生物医学知识图中的丰富信息来增强药物表示，然后学习每个药物对的知识子图以解释预测的相互作用。
results: 在两个 benchmark 数据集上进行测试， results 表明 KnowDDI 可以达到预测性能的状态态，并且比现有方法更好地tolerate 知识图的稀缺性。

Abstract
Background: Discovering potential drug-drug interactions (DDIs) is a long-standing challenge in clinical treatments and drug developments. Recently, deep learning techniques have been developed for DDI prediction. However, they generally require a huge number of samples, while known DDIs are rare. Methods: In this work, we present KnowDDI, a graph neural network-based method that addresses the above challenge. KnowDDI enhances drug representations by adaptively leveraging rich neighborhood information from large biomedical knowledge graphs. Then, it learns a knowledge subgraph for each drug-pair to interpret the predicted DDI, where each of the edges is associated with a connection strength indicating the importance of a known DDI or resembling strength between a drug-pair whose connection is unknown. Thus, the lack of DDIs is implicitly compensated by the enriched drug representations and propagated drug similarities. Results: We evaluate KnowDDI on two benchmark DDI datasets. Results show that KnowDDI obtains the state-of-the-art prediction performance with better interpretability. We also find that KnowDDI suffers less than existing works given a sparser knowledge graph. This indicates that the propagated drug similarities play a more important role in compensating for the lack of DDIs when the drug representations are less enriched. Conclusions: KnowDDI nicely combines the efficiency of deep learning techniques and the rich prior knowledge in biomedical knowledge graphs. As an original open-source tool, KnowDDI can help detect possible interactions in a broad range of relevant interaction prediction tasks, such as protein-protein interactions, drug-target interactions and disease-gene interactions, eventually promoting the development of biomedicine and healthcare.

摘要
Background: 识别药物间交互（DDIs）是现代药物治疗和药物开发中长期的挑战。现在，深度学习技术已经为DDIs预测提供了新的方法。然而，这些方法通常需要庞大的样本数，而已知的DDIs却相对罕见。Methods: 在本工作中，我们提出了知识DDIs（KnowDDI）方法，用于解决上述挑战。KnowDDI使用图ael neural network对药物表示进行增强，并且可靠地利用生物医学知识图的丰富信息来增强药物表示。然后，它学习每个药物对的知识子图，以便为每个药物对解释预测的DDIs。每个边都有一个Connection Strength，表示知识DDIs的重要性或药物对的相似程度。因此，缺乏DDIs的缺陷被隐式地补偿了，而药物表示和相似度的增强则提高了预测性能。Results: 我们对两个DDIs数据集进行评估，结果显示，KnowDDI在预测DDIs方面达到了状态 искусственный智能技术的最高水平，并且具有更好的可读性。我们还发现，KnowDDI在知识图更 sparse 的情况下表现更好，这表明，在药物表示较为缺乏的情况下，传递的药物相似度更加重要。Conclusions: KnowDDI兼容着深度学习技术的效率和生物医学知识图中的丰富先验知识，可以帮助检测广泛的相互作用预测任务，如蛋白质-蛋白质交互、药物-target交互和疾病-基因交互，从而推动生物医学和医疗领域的发展。

MPCNN: A Novel Matrix Profile Approach for CNN-based Sleep Apnea Classification

paper_url: http://arxiv.org/abs/2311.15041
repo_url: https://github.com/vinuni-vishc/mpcnn-sleep-apnea
paper_authors: Hieu X. Nguyen, Duong V. Nguyen, Hieu H. Pham, Cuong D. Do
for: 本研究旨在提高电cardiogram（ECG）信号中的呼吸暂停诊断精度，通过挖掘ECG信号的完整PQRST段中的关键信息。
methods: 本研究提出了一种新的方法，基于Matrix Profile算法，生成Euclidean distanceprofile从固定长度信号 subsequences中。从这里，我们 deriv了三种基于距离关系的特征提取方法：MinDP、MaxDP和MeanDP。
results: 我们在PhysioNet Apnea-ECG数据集上进行了广泛的实验，结果表明，与新的特征提取方法相比，我们的方法可以达到92.11%的每段精度和100%的每个记录精度，同时也具有最高的相关系数（0.989）。此外，我们还发现这些方法可以增强特定的轻量级模型的性能，这显示了其潜在的应用前景。

Abstract
Sleep apnea (SA) is a significant respiratory condition that poses a major global health challenge. Previous studies have investigated several machine and deep learning models for electrocardiogram (ECG)-based SA diagnoses. Despite these advancements, conventional feature extractions derived from ECG signals, such as R-peaks and RR intervals, may fail to capture crucial information encompassed within the complete PQRST segments. In this study, we propose an innovative approach to address this diagnostic gap by delving deeper into the comprehensive segments of the ECG signal. The proposed methodology draws inspiration from Matrix Profile algorithms, which generate an Euclidean distance profile from fixed-length signal subsequences. From this, we derived the Min Distance Profile (MinDP), Max Distance Profile (MaxDP), and Mean Distance Profile (MeanDP) based on the minimum, maximum, and mean of the profile distances, respectively. To validate the effectiveness of our approach, we use the modified LeNet-5 architecture as the primary CNN model, along with two existing lightweight models, BAFNet and SE-MSCNN, for ECG classification tasks. Our extensive experimental results on the PhysioNet Apnea-ECG dataset revealed that with the new feature extraction method, we achieved a per-segment accuracy up to 92.11 \% and a per-recording accuracy of 100\%. Moreover, it yielded the highest correlation compared to state-of-the-art methods, with a correlation coefficient of 0.989. By introducing a new feature extraction method based on distance relationships, we enhanced the performance of certain lightweight models, showing potential for home sleep apnea test (HSAT) and SA detection in IoT devices. The source code for this work is made publicly available in GitHub: https://github.com/vinuni-vishc/MPCNN-Sleep-Apnea.

摘要
“睡眠呼吸暂停症（SA）是一种重要的呼吸疾病，对全球健康带来重要挑战。过去的研究已经使用机器学习和深度学习模型来进行电cardiogram（ECG）基于SA诊断。 despite these advancements， conventional feature extractions derived from ECG signals， such as R-peaks and RR intervals， may fail to capture crucial information encompassed within the complete PQRST segments。 In this study， we propose an innovative approach to address this diagnostic gap by delving deeper into the comprehensive segments of the ECG signal。 The proposed methodology draws inspiration from Matrix Profile algorithms， which generate an Euclidean distance profile from fixed-length signal subsequences。 From this， we derived the Min Distance Profile（MinDP）， Max Distance Profile（MaxDP）， and Mean Distance Profile（MeanDP）based on the minimum， maximum， and mean of the profile distances， respectively。 To validate the effectiveness of our approach， we use the modified LeNet-5 architecture as the primary CNN model， along with two existing lightweight models， BAFNet and SE-MSCNN， for ECG classification tasks。 Our extensive experimental results on the PhysioNet Apnea-ECG dataset revealed that with the new feature extraction method， we achieved a per-segment accuracy up to 92.11% and a per-recording accuracy of 100%. Moreover， it yielded the highest correlation compared to state-of-the-art methods， with a correlation coefficient of 0.989。 By introducing a new feature extraction method based on distance relationships， we enhanced the performance of certain lightweight models， showing potential for home sleep apnea test（HSAT） and SA detection in IoT devices。 The source code for this work is made publicly available in GitHub：https://github.com/vinuni-vishc/MPCNN-Sleep-Apnea。”Note: The translation is in Simplified Chinese, and the sentence structure and wording may be different from the original text.

Word for Person: Zero-shot Composed Person Retrieval

paper_url: http://arxiv.org/abs/2311.16515
repo_url: https://github.com/Delong-liu-bupt/Word4Per
paper_authors: Delong Liu, Haiwen Li, Zhicheng Zhao, Fei Su, Hongying Meng
for: 提高人员检索精度，利用图像和文本信息进行人员识别。
methods: 提出了一种全新的组合人员检索任务（CPR），并提出了一种零 shot 组合人员检索方法（ZS-CPR），使用现有领域相关数据解决CPR问题而不需要昂贵的手动标注数据。
results: 提出了一种基于文本扩展和图像扩展的二 stage 学习框架 Word4Per，并构建了一个精心标注的图像文本组合人员检索数据集（ITCPR），实验表明 Word4Per 在 ZS-CPR 任务中表现出色，比对方法高出10%以上。

Abstract
Searching for specific person has great security value and social benefits, and it often involves a combination of visual and textual information. Conventional person retrieval methods, whether image-based or text-based, usually fall short in effectively harnessing both types of information, leading to the loss of accuracy. In this paper, a whole new task called Composed Person Retrieval (CPR) is proposed to jointly utilize both image and text information for target person retrieval. However, the supervised CPR must depend on very costly manual annotation dataset, while there are currently no available resources. To mitigate this issue, we firstly introduce the Zero-shot Composed Person Retrieval (ZS-CPR), which leverages existing domain-related data to resolve the CPR problem without reliance on expensive annotations. Secondly, to learn ZS-CPR model, we propose a two-stage learning framework, Word4Per, where a lightweight Textual Inversion Network (TINet) and a text-based person retrieval model based on fine-tuned Contrastive Language-Image Pre-training (CLIP) network are learned without utilizing any CPR data. Thirdly, a finely annotated Image-Text Composed Person Retrieval dataset (ITCPR) is built as the benchmark to assess the performance of the proposed Word4Per framework. Extensive experiments under both Rank-1 and mAP demonstrate the effectiveness of Word4Per for the ZS-CPR task, surpassing the comparative methods by over 10%. The code and ITCPR dataset will be publicly available at https://github.com/Delong-liu-bupt/Word4Per.

摘要
searching for a specific person has great security value and social benefits, and it often involves a combination of visual and textual information. conventional person retrieval methods, whether image-based or text-based, usually fall short in effectively harnessing both types of information, leading to the loss of accuracy. in this paper, a whole new task called Composed Person Retrieval (CPR) is proposed to jointly utilize both image and text information for target person retrieval. however, the supervised CPR must depend on very costly manual annotation dataset, while there are currently no available resources. to mitigate this issue, we firstly introduce the Zero-shot Composed Person Retrieval (ZS-CPR), which leverages existing domain-related data to resolve the CPR problem without reliance on expensive annotations. secondly, to learn ZS-CPR model, we propose a two-stage learning framework, Word4Per, where a lightweight Textual Inversion Network (TINet) and a text-based person retrieval model based on fine-tuned Contrastive Language-Image Pre-training (CLIP) network are learned without utilizing any CPR data. thirdly, a finely annotated Image-Text Composed Person Retrieval dataset (ITCPR) is built as the benchmark to assess the performance of the proposed Word4Per framework. extensive experiments under both Rank-1 and mAP demonstrate the effectiveness of Word4Per for the ZS-CPR task, surpassing the comparative methods by over 10%. the code and ITCPR dataset will be publicly available at https://github.com/Delong-liu-bupt/Word4Per.

On-Device Soft Sensors: Real-Time Fluid Flow Estimation from Level Sensor Data

paper_url: http://arxiv.org/abs/2311.15036
repo_url: None
paper_authors: Tianheng Ling, Chao Qian, Gregor Schiele
for: 本研究旨在提高自动化系统的 физи加数字领域融合，通过在设备上部署软感知器来提高感知和整合。
methods: 本研究使用了在设备上部署的软感知器，通过将人工智能直接部署在设备上来提高能效性，并通过 Mikrokontroller Unit 和 Field-Programmable Gate Array（FPGA）的协同集成来提高快速的人工智能推理能力。
results: 实验结果表明，使用 FPGA 软感知器可以实现推理时间从 1.04 微秒到 12.04 微秒，这些结果表明了我们的创新方法在实时推理任务中的高效性，从而成为一种可行的云部署中的延迟问题的解决方案。

Abstract
Soft sensors are crucial in bridging autonomous systems' physical and digital realms, enhancing sensor fusion and perception. Instead of deploying soft sensors on the Cloud, this study shift towards employing on-device soft sensors, promising heightened efficiency and bolstering data security. Our approach substantially improves energy efficiency by deploying Artificial Intelligence (AI) directly on devices within a wireless sensor network. Furthermore, the synergistic integration of the Microcontroller Unit and Field-Programmable Gate Array (FPGA) leverages the rapid AI inference capabilities of the latter. Empirical evidence from our real-world use case demonstrates that FPGA-based soft sensors achieve inference times ranging remarkably from 1.04 to 12.04 microseconds. These compelling results highlight the considerable potential of our innovative approach for executing real-time inference tasks efficiently, thereby presenting a feasible alternative that effectively addresses the latency challenges intrinsic to Cloud-based deployments.

摘要
(Simplified Chinese)软传感器在自动化系统的物理和数字世界之间 bridge 是关键，提高感知和混合传感器。而而不是在云上部署软传感器，这项研究倾向于在设备上部署软传感器，这将提高效率和加强数据安全性。我们的方法可以在无线传感网络中直接部署人工智能（AI），从而显著提高能效性。此外，微控制器单元和场程序门阵列（FPGA）的共同integation可以利用后者的快速AI推理能力。实际案例证明，FPGA基于的软传感器在1.04到12.04微秒的推理时间范围内进行了remarkably的推理。这些吸引人的结果表明我们的创新方法可以高效地执行实时推理任务，从而为云端部署所存在的延迟挑战提供了可行的解决方案。

Agent as Cerebrum, Controller as Cerebellum: Implementing an Embodied LMM-based Agent on Drones

paper_url: http://arxiv.org/abs/2311.15033
repo_url: None
paper_authors: Haoran Zhao, Fengxing Pan, Huqiuyue Ping, Yaoming Zhou
for: 这种研究旨在开发一种基于机器人技术的工业机器人智能体，具有“智能体为脑，控制器为脊梁”的架构。
methods: 该方法利用大型多modal模型（LMM）在机器人框架中，并在实验室中采用ROS链接框架连接LMM基于的智能体和机器人操作系统（ROS）。
results: 研究结果表明，AeroAgent在 simulated experiments中表现出优于现有的深度强化学习（DRL）基于的智能体，在实际场景中也达到了更高的性能，特别是在寻找和救援任务中。

Abstract
In this study, we present a novel paradigm for industrial robotic embodied agents, encapsulating an 'agent as cerebrum, controller as cerebellum' architecture. Our approach harnesses the power of Large Multimodal Models (LMMs) within an agent framework known as AeroAgent, tailored for drone technology in industrial settings. To facilitate seamless integration with robotic systems, we introduce ROSchain, a bespoke linkage framework connecting LMM-based agents to the Robot Operating System (ROS). We report findings from extensive empirical research, including simulated experiments on the Airgen and real-world case study, particularly in individual search and rescue operations. The results demonstrate AeroAgent's superior performance in comparison to existing Deep Reinforcement Learning (DRL)-based agents, highlighting the advantages of the embodied LMM in complex, real-world scenarios.

摘要
在这项研究中，我们提出了一种新的工业机器人智能代理模式，涵盖了“代理为脑、控制器为脊梁”架构。我们的方法利用了大型多Modal模型（LMM）在代理框架中，称之为AeroAgent，是专门为工业设置中的机器人技术开发的。为了寻求机器人系统的灵活集成，我们提出了ROS链接框架，将LMM基于的代理与ROS操作系统相连。我们在实验室和实际案例中进行了广泛的实践研究，包括在Airgen上的 simulated实验和实际搜索救援操作。结果显示，AeroAgent的表现比现有的深度强化学习（DRL）基于的代理更出色，highlighting the advantages of embodied LMM in complex, real-world scenarios.

E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation

paper_url: http://arxiv.org/abs/2311.15016
repo_url: None
paper_authors: Fengyi Fu, Lei Zhang, Quan Wang, Zhendong Mao
for: 提高对话系统的人性化
methods: 利用多resolution情感图、 correlate-aware汇集和硬/软策略等方法进行情感相关学习和应用
results: 在标准测试数据集上实现了更高的 Empathy 识别和表达能力

Abstract
Achieving empathy is a crucial step toward humanized dialogue systems. Current approaches for empathetic dialogue generation mainly perceive an emotional label to generate an empathetic response conditioned on it, which simply treat emotions independently, but ignore the intrinsic emotion correlation in dialogues, resulting in inaccurate emotion perception and unsuitable response generation. In this paper, we propose a novel emotion correlation enhanced empathetic dialogue generation framework, which comprehensively realizes emotion correlation learning, utilization, and supervising. Specifically, a multi-resolution emotion graph is devised to capture context-based emotion interactions from different resolutions, further modeling emotion correlation. Then we propose an emotion correlation enhanced decoder, with a novel correlation-aware aggregation and soft/hard strategy, respectively improving the emotion perception and response generation. Experimental results on the benchmark dataset demonstrate the superiority of our model in both empathetic perception and expression.

摘要
实现共鸣是对人工智能对话系统的重要步骤。现有的对话生成方法主要是根据情感标签生成情感相关的回应，但是这些方法通常忽略对话中情感之间的自然相互关系，导致情感识别不准确和回应生成不适用。在这篇论文中，我们提出了一种新的情感相关增强的对话生成框架，该框架完整实现情感相关学习、利用和监督。specifically，我们设计了一个多resolution情感图来捕捉对话中不同分辨率的情感互动，并模型情感相关。然后，我们提出了一种具有相关感知和软/硬策略的情感相关增强decoder，以提高情感识别和回应生成。实验结果表明，我们的模型在识别和表达方面均有所提高。

Exploring Causal Learning through Graph Neural Networks: An In-depth Review

paper_url: http://arxiv.org/abs/2311.14994
repo_url: None
paper_authors: Simi Job, Xiaohui Tao, Taotao Cai, Haoran Xie, Lin Li, Jianming Yong, Qing Li
for: 本文旨在对graph neural networks（GNNs）在 causal learning 领域的发展进行系统性的回顾和梳理，并提供了一种新的分类方法来 categorize 不同的 GNN 方法。
methods: 本文使用了多种 state-of-the-art GNN 方法，包括 Graph Convolutional Networks (GCNs), Graph Attention Networks (GATs), and Graph Autoencoders (GAEs)，以及它们在 causal learning 领域的应用。
results: 本文提供了一个 exhaustive 的数据集合，以及一些实际应用的案例，并结合了不同的 GNN 方法，以便于读者进行实践研究。此外，本文还提供了一些未来可能的挑战和发展方向。

Abstract
In machine learning, exploring data correlations to predict outcomes is a fundamental task. Recognizing causal relationships embedded within data is pivotal for a comprehensive understanding of system dynamics, the significance of which is paramount in data-driven decision-making processes. Beyond traditional methods, there has been a surge in the use of graph neural networks (GNNs) for causal learning, given their capabilities as universal data approximators. Thus, a thorough review of the advancements in causal learning using GNNs is both relevant and timely. To structure this review, we introduce a novel taxonomy that encompasses various state-of-the-art GNN methods employed in studying causality. GNNs are further categorized based on their applications in the causality domain. We further provide an exhaustive compilation of datasets integral to causal learning with GNNs to serve as a resource for practical study. This review also touches upon the application of causal learning across diverse sectors. We conclude the review with insights into potential challenges and promising avenues for future exploration in this rapidly evolving field of machine learning.

摘要
在机器学习中，探索数据之间的相关性以预测结果是基本任务之一。认识数据中嵌入的 causal 关系对系统动态的理解是关键，这对数据驱动决策过程中具有重要性。在传统方法之外，Graph Neural Networks（GNNs）在 causal 学习中得到了广泛的应用，这是因为它们具有 Universal Data Approximators 的特点。因此，一个系统的综述和报告 GNNs 在 causal 学习中的进步是非常相关和时宜。为了结构化这个综述，我们提出了一种新的分类法，该法包括了不同的 state-of-the-art GNN 方法在 causality 领域中的应用。GNNs 进一步被分为了根据它们在 causality 领域中的应用。我们还提供了一份 integral 的 causal 学习中的数据集，以便在实践中使用。这个综述还讨论了 causal 学习在多个领域中的应用。我们在这个综述中结束，并提出了未来探索的潜在挑战和前瞻。

Effective Backdoor Mitigation Depends on the Pre-training Objective

paper_url: http://arxiv.org/abs/2311.14948
repo_url: None
paper_authors: Sahil Verma, Gantavya Bhatt, Avi Schwarzschild, Soumye Singhal, Arnav Mohanty Das, Chirag Shah, John P Dickerson, Jeff Bilmes
for: This paper is written to investigate the effectiveness of CleanCLIP in mitigating backdoors in multimodal models, and to explore the relationship between pre-training objectives and backdoor removal.
methods: The paper uses two large datasets (CC3M and CC6M) and various pre-training objectives to train multimodal models, followed by poison removal using CleanCLIP. The authors also perform extensive hyperparameter tuning to evaluate the effectiveness of CleanCLIP under different conditions.
results: The paper finds that CleanCLIP is ineffective in removing backdoors when stronger pre-training objectives are used, and that simpler pre-training objectives are more amenable to effective backdoor removal. The findings highlight the importance of considering the trade-offs between pre-training objectives and security against backdoor attacks in ML deployments.

Abstract
Despite the advanced capabilities of contemporary machine learning (ML) models, they remain vulnerable to adversarial and backdoor attacks. This vulnerability is particularly concerning in real-world deployments, where compromised models may exhibit unpredictable behavior in critical scenarios. Such risks are heightened by the prevalent practice of collecting massive, internet-sourced datasets for pre-training multimodal models, as these datasets may harbor backdoors. Various techniques have been proposed to mitigate the effects of backdooring in these models such as CleanCLIP which is the current state-of-the-art approach. In this work, we demonstrate that the efficacy of CleanCLIP in mitigating backdoors is highly dependent on the particular objective used during model pre-training. We observe that stronger pre-training objectives correlate with harder to remove backdoors behaviors. We show this by training multimodal models on two large datasets consisting of 3 million (CC3M) and 6 million (CC6M) datapoints, under various pre-training objectives, followed by poison removal using CleanCLIP. We find that CleanCLIP is ineffective when stronger pre-training objectives are used, even with extensive hyperparameter tuning. Our findings underscore critical considerations for ML practitioners who pre-train models using large-scale web-curated data and are concerned about potential backdoor threats. Notably, our results suggest that simpler pre-training objectives are more amenable to effective backdoor removal. This insight is pivotal for practitioners seeking to balance the trade-offs between using stronger pre-training objectives and security against backdoor attacks.

摘要
尽管现代机器学习（ML）模型具有先进的能力，但它们仍然易受到恶意攻击和后门攻击。这种抵触特别是在实际应用中有着很大的风险，因为受到攻击的模型可能会在重要的情况下表现不可预测。这些风险受到了广泛采集互联网上的大量数据进行预训练多Modal模型的习惯所增加。为了缓解这些后门的影响，许多技术已经被提出来，例如CleanCLIP，它是当前的状态艺术方法。在这项工作中，我们证明了CleanCLIP在 mitigating后门时的有效性具有很高的依赖关系，特别是在使用不同的预训练目标时。我们发现，使用更加强大的预训练目标会导致后门的行为更加困难去除。我们通过使用三百万（CC3M）和六百万（CC6M）个数据点的两个大型数据集，采用不同的预训练目标，然后使用CleanCLIP进行恶意攻击去除，发现CleanCLIP在使用更加强大的预训练目标时无法生效，即使进行了广泛的超参数调整。我们的发现强调了ML实践者在使用大规模网络筛选数据预训练模型时需要考虑的重要因素。特别是，我们的结果表明，使用 simpler 的预训练目标更容易进行有效的后门去除。这一发现对于寻求平衡使用更加强大的预训练目标和安全性的实践者来说，是非常重要的。

FreePIH: Training-Free Painterly Image Harmonization with Diffusion Model

paper_url: http://arxiv.org/abs/2311.14926
repo_url: None
paper_authors: Ruibin Li, Jingcai Guo, Song Guo, Qihua Zhou, Jie Zhang
for: 提供一种高效无需训练的画家风图像融合方法（PIH），即FreePIH，通过只使用预训练的扩散模型来实现当前最佳的融合结果。
methods: 我们的FreePIH方法利用杜氏扩散模型的最后几步生成过程中的风格信息，对于eground图像进行风格传递。我们通过在 latent space 中加入高斯函数来直接进行风格传递，并通过多尺度特征来保证内容一致性和风格稳定性。此外，我们还使用文本提示来attend to latent space，从而提高生成质量。
results: 我们的方法在 COCO 和 LAION 5B 数据集上进行了量化和质量评估，并证明了我们的方法可以在与代表性基eline之间创造大幅度的优势。

Abstract
This paper provides an efficient training-free painterly image harmonization (PIH) method, dubbed FreePIH, that leverages only a pre-trained diffusion model to achieve state-of-the-art harmonization results. Unlike existing methods that require either training auxiliary networks or fine-tuning a large pre-trained backbone, or both, to harmonize a foreground object with a painterly-style background image, our FreePIH tames the denoising process as a plug-in module for foreground image style transfer. Specifically, we find that the very last few steps of the denoising (i.e., generation) process strongly correspond to the stylistic information of images, and based on this, we propose to augment the latent features of both the foreground and background images with Gaussians for a direct denoising-based harmonization. To guarantee the fidelity of the harmonized image, we make use of multi-scale features to enforce the consistency of the content and stability of the foreground objects in the latent space, and meanwhile, aligning both fore-/back-grounds with the same style. Moreover, to accommodate the generation with more structural and textural details, we further integrate text prompts to attend to the latent features, hence improving the generation quality. Quantitative and qualitative evaluations on COCO and LAION 5B datasets demonstrate that our method can surpass representative baselines by large margins.

摘要
Specifically, we find that the very last few steps of the denoising (i.e., generation) process strongly correspond to the stylistic information of images. Based on this, we propose to augment the latent features of both the foreground and background images with Gaussians for a direct denoising-based harmonization. To guarantee the fidelity of the harmonized image, we make use of multi-scale features to enforce the consistency of the content and stability of the foreground objects in the latent space, and meanwhile, aligning both fore-/back-grounds with the same style.Moreover, to accommodate the generation with more structural and textural details, we further integrate text prompts to attend to the latent features, hence improving the generation quality. Quantitative and qualitative evaluations on COCO and LAION 5B datasets demonstrate that our method can surpass representative baselines by large margins.

LANS: A Layout-Aware Neural Solver for Plane Geometry Problem

paper_url: http://arxiv.org/abs/2311.16476
repo_url: None
paper_authors: Ming-Liang Zhang, Zhong-Zhi Li, Fei Yin, Cheng-Lin Liu
for: solves geometry problem solving (GPS) tasks with multi-modal understanding, fusion, and reasoning.
methods: proposes a layout-aware neural solver (LANS) with two new modules: multimodal layout-aware pre-trained language model (MLA-PLM) and layout-aware fusion attention (LA-FA).
results: extensive experiments on Geometry3K and PGPS9K datasets show the effectiveness of the layout-aware modules and superior problem solving performance of LANS compared to existing symbolic solvers and neural solvers.

Abstract
Geometry problem solving (GPS) is a challenging mathematical reasoning task requiring multi-modal understanding, fusion and reasoning. Existing neural solvers take GPS as a vision-language task but be short in the representation of geometry diagrams which carry rich and complex layout information. In this paper, we propose a layout-aware neural solver named LANS, integrated with two new modules: multimodal layout-aware pre-trained language model (MLA-PLM) and layout-aware fusion attention (LA-FA). MLA-PLM adopts structural and semantic pre-training (SSP) to implement global relationship modeling, and point matching pre-training (PMP) to achieve alignment between visual points and textual points. LA-FA employs a layout-aware attention mask to realize point-guided cross-modal fusion for further boosting layout awareness of LANS. Extensive experiments on datasets Geometry3K and PGPS9K validate the effectiveness of the layout-aware modules and superior problem solving performance of our LANS solver, over existing symbolic solvers and neural solvers. The code will make public available soon.

摘要
几何问题解决（GPS）是一项具有挑战性的数学逻辑任务，需要多modal的理解、融合和理解。现有的神经网络解决方案将GPS视为视觉语言任务，但缺乏geometry diagram的表示，这些图形图示含有复杂的布局信息。在这篇论文中，我们提出了一种具有布局意识的神经网络解决方案，名为LANS，并采用了两个新模块：多modal布局意识预训练语言模型（MLA-PLM）和布局意识融合注意（LA-FA）。MLA-PLM采用了结构性和semantic预训练（SSP）来实现全局关系模型，并采用点匹配预训练（PMP）来实现视觉点和文本点之间的对应。LA-FA使用了布局意识注意mask来实现点导向的交叉模式融合，以进一步增强LANS的布局意识。我们的实验表明， layout-aware模块和LANS解决器的效果，并超越现有的符号学神经网络解决方案和神经网络解决方案。代码很快将公开。

Resfusion: Prior Residual Noise embedded Denoising Diffusion Probabilistic Models

paper_url: http://arxiv.org/abs/2311.14900
repo_url: None
paper_authors: Shi Zhenning, Dong Changsheng, Pan Bin, Xie Xueshuo, He Along, Qu Qiaoying, Li Tao
for: 这篇论文是为了推广Diffusion Probabilistic Models在图像分割中的应用，以生成基于输入图像的分割面积。
methods: 这篇论文提出了一种新的Resfusion方法，它通过一种新的杂化-滤波过程，将现有的端到端模型和杂化 diffusion模型相结合，以生成分割面积或任何类型的目标图像。
results: 实验结果表明，Resfusion可以与现有的端到端模型和杂化 diffusion模型相结合，提高性能，并且可以轻松扩展到更多的任务和更大的数据集。

Abstract
Recently, Denoising Diffusion Probabilistic Models have been widely used in image segmentation, by generating segmentation masks conditioned on the input image. However, previous works can not seamlessly integrate existing end-to-end models with denoising diffusion models. Existing research can only select acceleration steps based on experience rather than calculating them specifically. Moreover, most methods are limited to small models and small-scale datasets, unable to generalize to general datasets and a wider range of tasks. Therefore, we propose Resfusion with a novel resnoise-diffusion process, which gradually generates segmentation masks or any type of target image, seamlessly integrating state-of-the-art end-to-end models and denoising diffusion models. Resfusion bridges the discrepancy between the likelihood output and the ground truth output through a Markov process. Through the novel smooth equivalence transformation in resnoise-diffusion process, we determine the optimal acceleration step. Experimental results demonstrate that Resfusion combines the capabilities of existing end-to-end models and denoising diffusion models, further enhancing performance and achieving outstanding results. Moreover, Resfusion is not limited to segmentation tasks, it can easily generalize to any general tasks of image generation and exhibit strong competitiveness.

摘要
(Note: The text is translated into Simplified Chinese, which is the most widely used standard for Chinese writing. The translation is done using Google Translate and may not be perfect, but it should give you a good idea of the content of the original text.)

Aiming to Minimize Alcohol-Impaired Road Fatalities: Utilizing Fairness-Aware and Domain Knowledge-Infused Artificial Intelligence

paper_url: http://arxiv.org/abs/2311.16180
repo_url: None
paper_authors: Tejas Venkateswaran, Sheikh Rabiul Islam, Md Golam Moula Mehedi Hasan, Mohiuddin Ahmed
for: 降低美国交通事故中的酒后驾车死亡率，以及实现更公平和有效的推广资源分配。
methods: 使用人工智能技术，采用公平意识和领域知识来预测酒后驾车事故，并分析不同族裔、年龄和收入等人口统计。
results: 透过分析不同地区的酒后驾车事故，获得有趣的统计资料，实现更公平和有效的推广资源分配，有助于降低交通事故的死亡率。

Abstract
Approximately 30% of all traffic fatalities in the United States are attributed to alcohol-impaired driving. This means that, despite stringent laws against this offense in every state, the frequency of drunk driving accidents is alarming, resulting in approximately one person being killed every 45 minutes. The process of charging individuals with Driving Under the Influence (DUI) is intricate and can sometimes be subjective, involving multiple stages such as observing the vehicle in motion, interacting with the driver, and conducting Standardized Field Sobriety Tests (SFSTs). Biases have been observed through racial profiling, leading to some groups and geographical areas facing fewer DUI tests, resulting in many actual DUI incidents going undetected, ultimately leading to a higher number of fatalities. To tackle this issue, our research introduces an Artificial Intelligence-based predictor that is both fairness-aware and incorporates domain knowledge to analyze DUI-related fatalities in different geographic locations. Through this model, we gain intriguing insights into the interplay between various demographic groups, including age, race, and income. By utilizing the provided information to allocate policing resources in a more equitable and efficient manner, there is potential to reduce DUI-related fatalities and have a significant impact on road safety.

摘要
Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries.Please note that the translation is done by a machine translation tool, and the quality of the translation may vary depending on the complexity and nuances of the original text.