2023-08-19

cs.AI

cs.AI - 2023-08-19

Efficient Representation Learning for Healthcare with Cross-Architectural Self-Supervision

paper_url: http://arxiv.org/abs/2308.10064
repo_url: https://github.com/pranavsinghps1/CASS
paper_authors: Pranav Singh, Jacopo Cirrone
for: 医疗和生物医学应用中的极端计算需求，使得表征学学习难以在实际医疗中应用。表征学学习可以提高深度学习架构的性能，但是现有的自我监督学习方法在使用较小的批处理大小或更短的预训练环节时，性能会下降。我们提出了跨建筑自监督学习（CASS）方法来解决这个挑战。
methods: 我们提出了一种新的带形自监督学习方法，即CASS，它利用了转换器和卷积神经网络（CNN）进行高效的学习。
results: 我们的实验表明，CASS训练的CNN和转换器在四个不同的医疗数据集上都能够超越现有的自监督学习方法。它只使用1%的标签数据进行微调，可以获得3.8%的平均提升，即使使用10%的标签数据，也可以获得5.9%的提升。在100%的标签数据下，CASS可以达到10.13%的显著提升。此外，CASS还可以降低预训练时间，比现有方法减少69%，使其更适合医疗实践。

Abstract
In healthcare and biomedical applications, extreme computational requirements pose a significant barrier to adopting representation learning. Representation learning can enhance the performance of deep learning architectures by learning useful priors from limited medical data. However, state-of-the-art self-supervised techniques suffer from reduced performance when using smaller batch sizes or shorter pretraining epochs, which are more practical in clinical settings. We present Cross Architectural - Self Supervision (CASS) in response to this challenge. This novel siamese self-supervised learning approach synergistically leverages Transformer and Convolutional Neural Networks (CNN) for efficient learning. Our empirical evaluation demonstrates that CASS-trained CNNs and Transformers outperform existing self-supervised learning methods across four diverse healthcare datasets. With only 1% labeled data for finetuning, CASS achieves a 3.8% average improvement; with 10% labeled data, it gains 5.9%; and with 100% labeled data, it reaches a remarkable 10.13% enhancement. Notably, CASS reduces pretraining time by 69% compared to state-of-the-art methods, making it more amenable to clinical implementation. We also demonstrate that CASS is considerably more robust to variations in batch size and pretraining epochs, making it a suitable candidate for machine learning in healthcare applications.

摘要
在医疗和生物医学应用中，极高的计算需求成为了使用表示学习的障碍。表示学习可以提高深度学习架构的性能，通过从有限的医疗数据中学习有用的先验知识。然而，现有的自我监督技术在使用较小的批处理大小或更短的预训练轮次时表现下降，这些较实际的参数更适合临床应用。我们提出了交叉体系自我监督（CASS），以应对这个挑战。这种新的哈密顿自我监督学习方法可以高效地利用转换器和卷积神经网络（CNN）。我们的实验证明，CASS训练后的CNN和转换器都能超过现有的自我监督学习方法，在四种不同的医疗数据集上。具有1%标注数据进行精化，CASS得到了3.8%的均值提升；具有10%标注数据，它获得了5.9%的提升；具有100%标注数据，它达到了10.13%的增强。另外，CASS可以降低预训练时间的69%，使其更适合临床应用。我们还证明了CASS在批处理大小和预训练轮次的变化中表现更加稳定，使其成为医疗机器学习应用的合适候选人。

Robust Fraud Detection via Supervised Contrastive Learning

paper_url: http://arxiv.org/abs/2308.10055
repo_url: None
paper_authors: Vinay M. S., Shuhan Yuan, Xintao Wu
for: 针对具有有限多样化劫持活动会话的开放集预测攻击检测问题
methods: 基于有效数据增强策略和超vised contrastive learning的敏捷框架ConRo
results: 与其他状态静态基eline相比，ConRo frameworks表现出了明显的性能提升

Abstract
Deep learning models have recently become popular for detecting malicious user activity sessions in computing platforms. In many real-world scenarios, only a few labeled malicious and a large amount of normal sessions are available. These few labeled malicious sessions usually do not cover the entire diversity of all possible malicious sessions. In many scenarios, possible malicious sessions can be highly diverse. As a consequence, learned session representations of deep learning models can become ineffective in achieving a good generalization performance for unseen malicious sessions. To tackle this open-set fraud detection challenge, we propose a robust supervised contrastive learning based framework called ConRo, which specifically operates in the scenario where only a few malicious sessions having limited diversity is available. ConRo applies an effective data augmentation strategy to generate diverse potential malicious sessions. By employing these generated and available training set sessions, ConRo derives separable representations w.r.t open-set fraud detection task by leveraging supervised contrastive learning. We empirically evaluate our ConRo framework and other state-of-the-art baselines on benchmark datasets. Our ConRo framework demonstrates noticeable performance improvement over state-of-the-art baselines.

摘要
深度学习模型最近在计算平台中探测恶意用户活动会话中得到了广泛的应用。在实际场景中，通常只有一些标注了恶意的会话，而很多正常的会话是可用的。这些标注的恶意会话通常不能覆盖所有可能的恶意会话的多样性。因此，深度学习模型学习的会话表示可能变得不效果，导致在未看到的恶意会话上具有好的泛化性能。为解决这个开放集骗诈检测挑战，我们提出了一种可靠的超级视的对比学习框架 called ConRo，它专门适用于只有有限多样性的恶意会话。ConRo应用有效的数据扩展策略，生成了多样的可能的恶意会话。通过使用这些生成的和可用的训练集会话，ConRo derive了对于开放集骗诈检测任务的分离表示。我们对ConRo框架和其他状态前的基elines进行了实验评估。我们的ConRo框架在标准 benchmark 数据集上显示出了明显的性能提高。

Large Language Models as Zero-Shot Conversational Recommenders

paper_url: http://arxiv.org/abs/2308.10053
repo_url: https://github.com/aaronheee/llms-as-zero-shot-conversational-recsys
paper_authors: Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, Bodhisattwa Prasad Majumder, Nathan Kallus, Julian McAuley
for: 这个论文的主要目的是研究使用大语言模型进行对话式推荐任务，以及对现有模型的 zeroshot 设定下的性能分析。
methods: 这个论文使用了代表性的大语言模型，并在一个 zeroshot Setting下进行了实验研究。
results: 研究发现，无需细化训练，大语言模型可以在对话式推荐任务中超越现有的细化训练模型。此外，研究还提出了多种探索任务，以探究大语言模型在对话式推荐中的表现机制和限制。

Abstract
In this paper, we present empirical studies on conversational recommendation tasks using representative large language models in a zero-shot setting with three primary contributions. (1) Data: To gain insights into model behavior in "in-the-wild" conversational recommendation scenarios, we construct a new dataset of recommendation-related conversations by scraping a popular discussion website. This is the largest public real-world conversational recommendation dataset to date. (2) Evaluation: On the new dataset and two existing conversational recommendation datasets, we observe that even without fine-tuning, large language models can outperform existing fine-tuned conversational recommendation models. (3) Analysis: We propose various probing tasks to investigate the mechanisms behind the remarkable performance of large language models in conversational recommendation. We analyze both the large language models' behaviors and the characteristics of the datasets, providing a holistic understanding of the models' effectiveness, limitations and suggesting directions for the design of future conversational recommenders

摘要
在本文中，我们提出了基于大语言模型的实验研究，以探讨在无需调教的情况下，大语言模型在实际对话推荐任务中的表现。我们的研究具有以下三个主要贡献：1. 数据：为了了解大语言模型在实际对话推荐场景中的行为，我们从流行的讨论网站中抓取了一个新的推荐相关对话集。这是目前最大的公共实际对话推荐数据集。2. 评估：在我们新建的数据集和两个现有的对话推荐数据集上，我们发现了一点：无需调教，大语言模型可以在对话推荐任务中超越现有的调教过的对话推荐模型。3. 分析：我们提出了多种探索任务，以Investigate大语言模型在对话推荐任务中的机制。我们分析了大语言模型的行为以及数据集的特点，从而提供了对未来对话推荐模型的设计方向的彻底理解。

The Snowflake Hypothesis: Training Deep GNN with One Node One Receptive field

paper_url: http://arxiv.org/abs/2308.10051
repo_url: None
paper_authors: Kun Wang, Guohao Li, Shilong Wang, Guibin Zhang, Kai Wang, Yang You, Xiaojiang Peng, Yuxuan Liang, Yang Wang
for: 本研究主要探讨深度 Graph Neural Networks (GNNs) 在图像领域中的应用，尤其是 GNNs 在深度为多少时 Display over-fitting 和 over-smoothing 问题。
methods: 本研究使用了系统的研究方法，包括不同的训练方案、不同的 shallow 和 deep GNN 基础体系、不同的层数（8, 16, 32, 64）以及多个 benchmark 图。
results: 研究结果表明，我们的假设（Snowflake Hypothesis）可以作为一种通用的操作符，可以帮助深度 GNNs 在不同任务中表现更好，并且可以在可解释的和普遍的方式下选择最佳网络深度。

Abstract
Despite Graph Neural Networks demonstrating considerable promise in graph representation learning tasks, GNNs predominantly face significant issues with over-fitting and over-smoothing as they go deeper as models of computer vision realm. In this work, we conduct a systematic study of deeper GNN research trajectories. Our findings indicate that the current success of deep GNNs primarily stems from (I) the adoption of innovations from CNNs, such as residual/skip connections, or (II) the tailor-made aggregation algorithms like DropEdge. However, these algorithms often lack intrinsic interpretability and indiscriminately treat all nodes within a given layer in a similar manner, thereby failing to capture the nuanced differences among various nodes. To this end, we introduce the Snowflake Hypothesis -- a novel paradigm underpinning the concept of ``one node, one receptive field''. The hypothesis draws inspiration from the unique and individualistic patterns of each snowflake, proposing a corresponding uniqueness in the receptive fields of nodes in the GNNs. We employ the simplest gradient and node-level cosine distance as guiding principles to regulate the aggregation depth for each node, and conduct comprehensive experiments including: (1) different training schemes; (2) various shallow and deep GNN backbones, and (3) various numbers of layers (8, 16, 32, 64) on multiple benchmarks (six graphs including dense graphs with millions of nodes); (4) compare with different aggregation strategies. The observational results demonstrate that our hypothesis can serve as a universal operator for a range of tasks, and it displays tremendous potential on deep GNNs. It can be applied to various GNN frameworks, enhancing its effectiveness when operating in-depth, and guiding the selection of the optimal network depth in an explainable and generalizable way.

摘要
Despite Graph Neural Networks (GNNs) showing great promise in graph representation learning tasks, they still face significant issues with over-fitting and over-smoothing as they become deeper, especially in the field of computer vision. In this study, we conducted a systematic investigation of deeper GNN research trajectories. Our findings indicate that the current success of deep GNNs is mainly due to (I) the adoption of innovations from Convolutional Neural Networks (CNNs), such as residual/skip connections, or (II) the use of tailor-made aggregation algorithms like DropEdge. However, these algorithms often lack intrinsic interpretability and treat all nodes within a given layer in a similar manner, failing to capture the subtle differences among various nodes. To address this issue, we propose the Snowflake Hypothesis - a novel paradigm that emphasizes the uniqueness of each node's receptive field, inspired by the unique patterns of snowflakes.We use the simplest gradient and node-level cosine distance as guiding principles to regulate the aggregation depth for each node. We conduct comprehensive experiments, including different training schemes, various shallow and deep GNN backbones, and various numbers of layers (8, 16, 32, 64) on multiple benchmarks (six graphs with millions of nodes). Our results show that our hypothesis can serve as a universal operator for a range of tasks and displays great potential in deep GNNs. It can be applied to various GNN frameworks, enhancing their effectiveness when operating in-depth, and providing an explainable and generalizable way to select the optimal network depth.

Towards Probabilistic Causal Discovery, Inference & Explanations for Autonomous Drones in Mine Surveying Tasks

paper_url: http://arxiv.org/abs/2308.10047
repo_url: None
paper_authors: Ricardo Cannizzaro, Rhys Howard, Paulina Lewinska, Lars Kunze
for: 这篇论文旨在提供自主机器人理解数据生成过程，以便在实际环境中做出决策和解释结果。
methods: 该论文提出了一种可靠的概率 causal 框架，包括 causally-informed POMDP 规划、在线 SCM 适应以及后果Counterfactual 解释。
results: 该框架能够在含混合因素、非站点性和不可预测的环境中帮助自主机器人做出决策和解释结果。

Abstract
Causal modelling offers great potential to provide autonomous agents the ability to understand the data-generation process that governs their interactions with the world. Such models capture formal knowledge as well as probabilistic representations of noise and uncertainty typically encountered by autonomous robots in real-world environments. Thus, causality can aid autonomous agents in making decisions and explaining outcomes, but deploying causality in such a manner introduces new challenges. Here we identify challenges relating to causality in the context of a drone system operating in a salt mine. Such environments are challenging for autonomous agents because of the presence of confounders, non-stationarity, and a difficulty in building complete causal models ahead of time. To address these issues, we propose a probabilistic causal framework consisting of: causally-informed POMDP planning, online SCM adaptation, and post-hoc counterfactual explanations. Further, we outline planned experimentation to evaluate the framework integrated with a drone system in simulated mine environments and on a real-world mine dataset.

摘要
causal模型提供了大量的潜在利器，以帮助自主代理人理解与世界进行交互的数据生成过程。这些模型捕捉了形式知识以及随机变量和不确定性的概率表示，通常在实际环境中遇到的自主机器人遇到的问题。因此， causality可以帮助自主代理人做出决策和解释结果，但是在这种方式下引入了新的挑战。在这篇文章中，我们识别了在钾矿环境中的 causality挑战，这些环境中存在干扰因素、非站点性和建立完整的 causal模型的困难。为解决这些问题，我们提议了一种概率 causal 框架，包括： causally-informed POMDP 规划、在线 SCM 适应和后续 counterfactual 解释。此外，我们还详细讲述了将这种框架与真实的矿山数据集集成的计划的实验。

Optimizing Multi-Class Text Classification: A Diverse Stacking Ensemble Framework Utilizing Transformers

paper_url: http://arxiv.org/abs/2308.11519
repo_url: None
paper_authors: Anusuya Krishnan
for: 本研究旨在提高客户评价分类的准确率和可靠性，以便商家从客户评价中提取有益的反馈信息，提高客户满意度和驱动持续改进。
methods: 本研究提出了一种新的Stacking Ensemble基于transformer模型的多文本分类方法，通过将多个单个transformer模型，包括BERT、ELECTRA和DistilBERT，作为基础级分类器，以及一个基于RoBERTa的meta级分类器，生成最佳预测模型。
results: 实验结果表明，相比传统单个分类器模型，折衔ensemble基于transformer模型的多文本分类方法可以提高客户评价分类的准确率和稳定性，并且在实际客户评价数据集上得到了较高的效果。

Abstract
Customer reviews play a crucial role in assessing customer satisfaction, gathering feedback, and driving improvements for businesses. Analyzing these reviews provides valuable insights into customer sentiments, including compliments, comments, and suggestions. Text classification techniques enable businesses to categorize customer reviews into distinct categories, facilitating a better understanding of customer feedback. However, challenges such as overfitting and bias limit the effectiveness of a single classifier in ensuring optimal prediction. This study proposes a novel approach to address these challenges by introducing a stacking ensemble-based multi-text classification method that leverages transformer models. By combining multiple single transformers, including BERT, ELECTRA, and DistilBERT, as base-level classifiers, and a meta-level classifier based on RoBERTa, an optimal predictive model is generated. The proposed stacking ensemble-based multi-text classification method aims to enhance the accuracy and robustness of customer review analysis. Experimental evaluations conducted on a real-world customer review dataset demonstrate the effectiveness and superiority of the proposed approach over traditional single classifier models. The stacking ensemble-based multi-text classification method using transformers proves to be a promising solution for businesses seeking to extract valuable insights from customer reviews and make data-driven decisions to enhance customer satisfaction and drive continuous improvement.

摘要
顾客评价对企业客户满意度评估、反馈收集和改进提供重要的指导意义。分析顾客评价可以获得价值的客户情感反馈，包括赞誉、评论和建议。文本分类技术可以将顾客评价分类为不同类别，以便更好地理解客户反馈。然而，过拟合和偏见问题限制了单个分类器的优化预测。这项研究提出了一种新的方法，利用堆 ensemble-based多文本分类方法，以解决这些问题。该方法组合多个单级 transformer 模型，包括 BERT、ELECTRA 和 DistilBERT，作为基础级分类器，并使用 RoBERTa 为高级分类器。这种堆 ensemble-based multi-text classification 方法的目的是提高客户评价分析的准确性和稳定性。实验表明，该方法在实际顾客评价数据集上的效果和优势，证明了该方法的有效性和可靠性。堆 ensemble-based multi-text classification 方法使用 transformers 是企业希望从顾客评价中提取有价值的信息，以便根据客户反馈提供数据驱动的决策，提高客户满意度和不断改进。

Causal Intersectionality and Dual Form of Gradient Descent for Multimodal Analysis: a Case Study on Hateful Memes

paper_url: http://arxiv.org/abs/2308.11585
repo_url: None
paper_authors: Yosuke Miyanishi, Minh Le Nguyen
for: 本研究旨在探讨如何使用可解释AI（XAI）和定义 semantics 来理解机器学习模型的内部机制，以便更好地理解模型的 causal effect。
methods: 本研究使用了 gradient-based 方法和 causal 分析方法，以 synergize 这两种方法来探讨模型的内部机制。
results: 研究发现，通过使用 intersectionality 理论，可以将忌视论检测问题表示为 averaged treatment effect（ATE），并且可以使用模态wise 概要来描述三种基于 transformer 的模型对 ATE 的不同行为。此外，研究还发现 latest LLM LLaMA2 在具有上下文学习设置下，能够拓宽忌视论检测问题的多样化特征。

Abstract
In the wake of the explosive growth of machine learning (ML) usage, particularly within the context of emerging Large Language Models (LLMs), comprehending the semantic significance rooted in their internal workings is crucial. While causal analyses focus on defining semantics and its quantification, the gradient-based approach is central to explainable AI (XAI), tackling the interpretation of the black box. By synergizing these approaches, the exploration of how a model's internal mechanisms illuminate its causal effect has become integral for evidence-based decision-making. A parallel line of research has revealed that intersectionality - the combinatory impact of multiple demographics of an individual - can be structured in the form of an Averaged Treatment Effect (ATE). Initially, this study illustrates that the hateful memes detection problem can be formulated as an ATE, assisted by the principles of intersectionality, and that a modality-wise summarization of gradient-based attention attribution scores can delineate the distinct behaviors of three Transformerbased models concerning ATE. Subsequently, we show that the latest LLM LLaMA2 has the ability to disentangle the intersectional nature of memes detection in an in-context learning setting, with their mechanistic properties elucidated via meta-gradient, a secondary form of gradient. In conclusion, this research contributes to the ongoing dialogue surrounding XAI and the multifaceted nature of ML models.

摘要
在机器学习（ML）的激进发展中，特别是在新兴的大语言模型（LLM）上，理解其内部机制的含义是关键。而 causal 分析专注于定义 semantics 和其量化，而 gradient-based 方法是解释 AI（XAI）的中心，解释黑盒模型的含义。将这些方法相互融合，可以探索模型内部机制如何推动 causal effect的探索。同时，另一条研究表明，个体多个特征的交叉影响（intersectionality）可以通过 averaged treatment effect（ATE）的形式表示，并且在这种形式下，可以使用 modality-wise 汇总 gradient-based 注意力分配分数来解释三种基于 Transformer 的模型在 ATE 方面的不同行为。后续，我们发现 latest LLM LLaMA2 在 context learning 设定下可以准确地检测仇恨 Memes，并且通过 meta-gradient，一种次级的梯度，揭示这些机制的性质。因此，本研究对 XAI 和多方面 ML 模型的对话进行了贡献。

ClothesNet: An Information-Rich 3D Garment Model Repository with Simulated Clothes Environment

paper_url: http://arxiv.org/abs/2308.09987
repo_url: None
paper_authors: Bingyang Zhou, Haoyu Zhou, Tianhai Liang, Qiaojun Yu, Siheng Zhao, Yuwei Zeng, Jun Lv, Siyuan Luo, Qiancai Wang, Xinyuan Yu, Haonan Chen, Cewu Lu, Lin Shao
for: 这篇论文是为了提供一个大规模的3D衣物数据集，并将其注解为具有衣物特征、边框和关键点等信息。
methods: 该论文使用了大规模的3D衣物数据集，并在其中设置了衣物分类、边框分割和关键点检测等任务，以便用于计算机视觉和机器人交互任务。
results: 该论文通过实际实验表明，使用ClothesNet数据集可以帮助实现衣物感知和机器人交互任务，并且可以提供高质量的数据集和任务集。

Abstract
We present ClothesNet: a large-scale dataset of 3D clothes objects with information-rich annotations. Our dataset consists of around 4400 models covering 11 categories annotated with clothes features, boundary lines, and keypoints. ClothesNet can be used to facilitate a variety of computer vision and robot interaction tasks. Using our dataset, we establish benchmark tasks for clothes perception, including classification, boundary line segmentation, and keypoint detection, and develop simulated clothes environments for robotic interaction tasks, including rearranging, folding, hanging, and dressing. We also demonstrate the efficacy of our ClothesNet in real-world experiments. Supplemental materials and dataset are available on our project webpage.

摘要
我们现在提出了 ClothesNet：一个大规模的3D衣物数据集，包含详细的注释信息。我们的数据集包含约4400个模型，涵盖11种类别，并且每个模型都有衣物特征、边界线和关键点的注释。ClothesNet可以用于促进计算机视觉和机器人互动任务。使用我们的数据集，我们建立了衣物识别、边界线分割和关键点检测的标准任务，并开发了机器人互动任务的模拟环境，包括重新排序、折叠、挂起和穿衣。我们还在实际世界中进行了实验，以证明 ClothesNet 的效果。补充材料和数据集可以在我们项目网站上获得。

Distributionally Robust Cross Subject EEG Decoding

paper_url: http://arxiv.org/abs/2308.11651
repo_url: None
paper_authors: Tiehang Duan, Zhenyi Wang, Gianfranco Doretto, Fang Li, Cui Tao, Donald Adjeroh
for: 提高EEG解码任务的性能，增强EEG数据的鲁棒性
methods: 使用分布式鲁棒优化和 Wasserstein 梯度流来实现数据演化，提高EEG解码器的特征学习
results: 比较基eline的性能，模型在各种损害EEG信号中的表现更佳， indicating that the proposed approach can improve the robustness of EEG decoding tasks.

Abstract
Recently, deep learning has shown to be effective for Electroencephalography (EEG) decoding tasks. Yet, its performance can be negatively influenced by two key factors: 1) the high variance and different types of corruption that are inherent in the signal, 2) the EEG datasets are usually relatively small given the acquisition cost, annotation cost and amount of effort needed. Data augmentation approaches for alleviation of this problem have been empirically studied, with augmentation operations on spatial domain, time domain or frequency domain handcrafted based on expertise of domain knowledge. In this work, we propose a principled approach to perform dynamic evolution on the data for improvement of decoding robustness. The approach is based on distributionally robust optimization and achieves robustness by optimizing on a family of evolved data distributions instead of the single training data distribution. We derived a general data evolution framework based on Wasserstein gradient flow (WGF) and provides two different forms of evolution within the framework. Intuitively, the evolution process helps the EEG decoder to learn more robust and diverse features. It is worth mentioning that the proposed approach can be readily integrated with other data augmentation approaches for further improvements. We performed extensive experiments on the proposed approach and tested its performance on different types of corrupted EEG signals. The model significantly outperforms competitive baselines on challenging decoding scenarios.

摘要
最近，深度学习已经在电enzephalography（EEG）解码任务中显示出有效性。然而，其性能可能受到两种关键因素的负面影响：1）EEG信号中的高度变化和不同类型的损害，2）EEG数据集通常较小，需要较多的获取成本、标注成本和精力投入。为了解决这个问题，数据扩展方法已经被Empirically研究，其中包括在空间领域、时间领域或频率领域进行手动设计的扩展操作。在这项工作中，我们提出了一种原理性的方法，通过分布robust优化来提高解码Robustness。该方法基于Wasserstein梯度流（WGF），并提供了两种不同的演化形式。intuitively，演化过程可以帮助EEG解码器学习更加Robust和多样的特征。值得一提是，提议的方法可以和其他数据扩展方法结合使用，以实现更高的性能。我们对提议的方法进行了广泛的实验，并测试其性能于不同类型的损害EEG信号。模型显著超越了竞争对手的基eline。

Artificial Intelligence across Europe: A Study on Awareness, Attitude and Trust

paper_url: http://arxiv.org/abs/2308.09979
repo_url: None
paper_authors: Teresa Scantamburlo, Atia Cortés, Francesca Foffano, Cristian Barrué, Veronica Distefano, Long Pham, Alessandro Fabris
For: The paper aims to gain a better understanding of European citizens’ views and perceptions of Artificial Intelligence (AI) in order to inform AI governance and policy-making.* Methods: The study uses a new questionnaire (PAICE) structured around three dimensions: awareness, attitude, and trust, and collects data from a sample of 4,006 European citizens from eight countries.* Results: The study finds that while awareness of AI is low, attitudes towards AI are generally positive, but there are implicit contradictions and trends that may interfere with the development of an inclusive AI ecosystem. The study highlights the importance of legal and ethical standards, educational entities, and AI literacy in supporting a trustworthy AI ecosystem.Here is the same information in Simplified Chinese text:* For: 这篇论文是为了了解欧洲公民对人工智能（AI）的看法和感受，以便更好地制定AI治理和政策。* Methods: 这篇论文使用一份新的问卷（PAICE），旨在三个维度上评估欧洲公民对AI的意见和态度：认知、态度和信任。* Results: 研究发现，尽管认知水平低，但大多数公民对AI的态度非常正面，但也存在一些潜在的矛盾和趋势，这些可能会影响建立包容性的AI生态系统。研究表明，legal和道德标准的引入，高等教育机构的活动，以及AI文化的推广是支持可信worthy AI生态系统的关键因素。

Abstract
This paper presents the results of an extensive study investigating the opinions on Artificial Intelligence (AI) of a sample of 4,006 European citizens from eight distinct countries (France, Germany, Italy, Netherlands, Poland, Romania, Spain, and Sweden). The aim of the study is to gain a better understanding of people's views and perceptions within the European context, which is already marked by important policy actions and regulatory processes. To survey the perceptions of the citizens of Europe we design and validate a new questionnaire (PAICE) structured around three dimensions: people's awareness, attitude, and trust. We observe that while awareness is characterized by a low level of self-assessed competency, the attitude toward AI is very positive for more than half of the population. Reflecting upon the collected results, we highlight implicit contradictions and identify trends that may interfere with the creation of an ecosystem of trust and the development of inclusive AI policies. The introduction of rules that ensure legal and ethical standards, along with the activity of high-level educational entities, and the promotion of AI literacy are identified as key factors in supporting a trustworthy AI ecosystem. We make some recommendations for AI governance focused on the European context and conclude with suggestions for future work.

摘要

Explicit Time Embedding Based Cascade Attention Network for Information Popularity Prediction

paper_url: http://arxiv.org/abs/2308.09976
repo_url: None
paper_authors: Xigang Sun, Jingya Zhou, Ling Liu, Wenqi Wei
for: 预测信息潮流的各种特点，包括它的各种特点，以及它在不同的社交网络上的传播方式。
methods: 本文提出了一种基于Explicit Time Embedding的Cascade Attention Network（TCAN），该模型可以integrate temporal attributes（例如周期性、线性和非线性扩展）into node features，并使用cascade graph attention encoder（CGAT）和cascade sequence attention encoder（CSAT）来完全学习潮流图和潮流序列的表示。
results: 在使用两个实际的数据集（Weibo和APS）进行验证的情况下，TCAN得到了mean logarithm squared errors的值为2.007和1.201，并且在两个数据集上的运行时间分别为1.76小时和0.15小时。此外，TCAN也超过了其他表达基线的10.4%、3.8%和10.4%的MSLE、MAE和R-squared指标。

Abstract
Predicting information cascade popularity is a fundamental problem in social networks. Capturing temporal attributes and cascade role information (e.g., cascade graphs and cascade sequences) is necessary for understanding the information cascade. Current methods rarely focus on unifying this information for popularity predictions, which prevents them from effectively modeling the full properties of cascades to achieve satisfactory prediction performances. In this paper, we propose an explicit Time embedding based Cascade Attention Network (TCAN) as a novel popularity prediction architecture for large-scale information networks. TCAN integrates temporal attributes (i.e., periodicity, linearity, and non-linear scaling) into node features via a general time embedding approach (TE), and then employs a cascade graph attention encoder (CGAT) and a cascade sequence attention encoder (CSAT) to fully learn the representation of cascade graphs and cascade sequences. We use two real-world datasets (i.e., Weibo and APS) with tens of thousands of cascade samples to validate our methods. Experimental results show that TCAN obtains mean logarithm squared errors of 2.007 and 1.201 and running times of 1.76 hours and 0.15 hours on both datasets, respectively. Furthermore, TCAN outperforms other representative baselines by 10.4%, 3.8%, and 10.4% in terms of MSLE, MAE, and R-squared on average while maintaining good interpretability.

摘要
预测信息潮流的受欢迎程度是社交网络中的基本问题。捕捉时间特征和垂直角色信息（如潮流图和潮流序列）是理解信息潮流的关键。现有方法rarely将这些信息统一到受欢迎预测中，这使得它们无法全面地模拟潮流的性质，从而导致不满足的预测性能。在本文中，我们提出了一种新的时间嵌入基于潮流注意力网络（TCAN），用于大规模信息网络中的受欢迎预测。TCAN通过一种通用的时间嵌入方法（TE）将时间特征纳入节点特征，然后使用潮流图注意力编码器（CGAT）和潮流序列注意力编码器（CSAT）来全面学习潮流图和潮流序列的表示。我们在两个实际 datasets（Weibo和APS）上进行了大量的潮流样本验证。实验结果表明，TCAN在两个dataset上的平均logarithmic squared error为2.007和1.201，运行时间分别为1.76小时和0.15小时。此外，TCAN在基于MSLE、MAE和R-squared的比较中，与其他代表性基准相比，提高了10.4%、3.8%和10.4%的性能。同时，TCAN保持了良好的可读性。

Disposable Transfer Learning for Selective Source Task Unlearning

paper_url: http://arxiv.org/abs/2308.09971
repo_url: None
paper_authors: Seunghee Koh, Hyounguk Shon, Janghyeon Lee, Hyeong Gwon Hong, Junmo Kim
for: 这个论文的目的是提出一种新的传输学习方法，即可抛弃源任务的传输学习方法（DTL），以避免知识泄露问题。
methods: 该论文提出了一种新的损失函数名为梯度碰撞损失（GC损失），GC损失会导致梯度向量在不同批处理中分别移动，从而选择性地忘记源任务。
results: 论文表明，使用GC损失可以有效地解决传输学习中的知识泄露问题，并且模型在target任务上保持了高度的性能。

Abstract
Transfer learning is widely used for training deep neural networks (DNN) for building a powerful representation. Even after the pre-trained model is adapted for the target task, the representation performance of the feature extractor is retained to some extent. As the performance of the pre-trained model can be considered the private property of the owner, it is natural to seek the exclusive right of the generalized performance of the pre-trained weight. To address this issue, we suggest a new paradigm of transfer learning called disposable transfer learning (DTL), which disposes of only the source task without degrading the performance of the target task. To achieve knowledge disposal, we propose a novel loss named Gradient Collision loss (GC loss). GC loss selectively unlearns the source knowledge by leading the gradient vectors of mini-batches in different directions. Whether the model successfully unlearns the source task is measured by piggyback learning accuracy (PL accuracy). PL accuracy estimates the vulnerability of knowledge leakage by retraining the scrubbed model on a subset of source data or new downstream data. We demonstrate that GC loss is an effective approach to the DTL problem by showing that the model trained with GC loss retains the performance on the target task with a significantly reduced PL accuracy.

摘要
<> translate text into Simplified ChineseTransfer learning 广泛用于训练深度神经网络 (DNN) 以建立强大的表示。即使预训练模型被适应目标任务，表示性性能的特征提取器也会保持一定程度的表现。由于预训练模型的性能可以被视为专有财产，因此是自然的寻求专利权的通用性表现。为解决这个问题，我们提出了一种新的转移学习方法 called disposable transfer learning (DTL)，它将仅 dispose of the source task 而不是降低目标任务的性能。为实现知识抛弃，我们提议一种新的损失函数名为梯度碰撞损失 (GC loss)。GC loss 将导致批处理的梯度 вектор在不同的方向上穿梭。如果模型成功忘记了源任务，那么PL准确率 (PL accuracy) 将提供一种估计知识泄露的敏感性，由于在重新训练擦除模型时在一部分源数据上或新下游数据上进行重新训练。我们示出GC损失是DTL问题的有效方法，因为模型在使用GC损失时保留了目标任务的性能，同时PL准确率得到了显著降低。

Tackling Vision Language Tasks Through Learning Inner Monologues

paper_url: http://arxiv.org/abs/2308.09970
repo_url: None
paper_authors: Diji Yang, Kezhen Chen, Jinmeng Rao, Xiaoyuan Guo, Yawen Zhang, Jie Yang, Yi Zhang
for: 解决复杂的视觉语言问题，如图像描述和图像理解等。
methods: 提出了一种新的方法 Inner Monologue Multi-Modal Optimization (IMMO)，通过模拟内部对话来促进语言模型和视觉模型之间的合并。
results: 实验结果表明，通过IMMO可以提高理解和解释能力，并且可以应用于多种不同的AI问题。

Abstract
Visual language tasks require AI models to comprehend and reason with both visual and textual content. Driven by the power of Large Language Models (LLMs), two prominent methods have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. To tackle this dilemma, we propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems by simulating inner monologue processes, a cognitive process in which an individual engages in silent verbal communication with themselves. We enable LLMs and VLMs to interact through natural language conversation and propose to use a two-stage training process to learn how to do the inner monologue (self-asking questions and answering questions). IMMO is evaluated on two popular tasks and the results suggest by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. More importantly, instead of using predefined human-crafted monologues, IMMO learns this process within the deep learning models, promising wider applicability to many different AI problems beyond vision language tasks.

摘要
视觉语言任务需要人工智能模型理解和处理视觉和文本内容。驱动了大型语言模型（LLM）的力量，两种主要方法出现：（1）混合Integration between LLMs and Vision-Language Models（VLMs），其中视觉输入首先被VLMs转换为语言描述，并作为LLMs的输入生成答案；（2）视觉特征对齐在语言空间，其中视觉输入被编码为特征嵌入，并通过进一步的超vision fine-tuning来将其投影到LLMs的语言空间。首个方法提供了轻量级训练成本和可读性，但具有硬coded困难，难以在端到端方式优化。其次的方法具有良好性能，但特征对齐通常需要大量的训练数据和缺乏可读性。为解决这个困难，我们提出了一种新的方法：Inner Monologue Multi-Modal Optimization（IMMO），用于解决复杂的视觉语言问题。我们使得LLMs和VLMs通过自然语言对话进行交互，并提出了一种两个阶段训练过程，以学习如何进行内部对话（自我问答和回答问题）。IMMO在两个流行任务上进行评估，结果表明，通过模拟内部对话，我们的方法可以提高理解和解释能力，为视觉语言模型的更有效融合做出贡献。更重要的是，不同于使用人类编写的固定内部对话，IMMO在深度学习模型中学习这个过程，可以应用于许多不同的人工智能问题。

Anomaly-Aware Semantic Segmentation via Style-Aligned OoD Augmentation

paper_url: http://arxiv.org/abs/2308.09965
repo_url: None
paper_authors: Dan Zhang, Kaspar Sakmann, William Beluch, Robin Hutmacher, Yumeng Li
for: 这个论文的目的是为了将标准 semantic segmentation 模型设置为适应不熟悉的物品类型。
methods: 这个研究使用了实验性的 out-of-distribution (OoD) 数据增强方法，以实现对惊喜物品的识别。
results: 这个研究发现，通过将 OoD 数据与驾驶场景之间的领域差缩小，可以有效减少类型差异的影响，并且提出了一个简单的 fine-tuning 损失，使得先前训练的 semantic segmentation 模型能够对不熟悉的物品进行预测。

Abstract
Within the context of autonomous driving, encountering unknown objects becomes inevitable during deployment in the open world. Therefore, it is crucial to equip standard semantic segmentation models with anomaly awareness. Many previous approaches have utilized synthetic out-of-distribution (OoD) data augmentation to tackle this problem. In this work, we advance the OoD synthesis process by reducing the domain gap between the OoD data and driving scenes, effectively mitigating the style difference that might otherwise act as an obvious shortcut during training. Additionally, we propose a simple fine-tuning loss that effectively induces a pre-trained semantic segmentation model to generate a ``none of the given classes" prediction, leveraging per-pixel OoD scores for anomaly segmentation. With minimal fine-tuning effort, our pipeline enables the use of pre-trained models for anomaly segmentation while maintaining the performance on the original task.

摘要
在自动驾驶中，遇到未知对象是不可避免的，因此需要填充标准semantic segmentation模型 anomaly awareness。许多前一代方法使用 synthetic out-of-distribution（OoD）数据增强来解决这个问题。在这项工作中，我们提高了 OoD 数据生成过程中的领域差异，从而有效地减少了样式差异，从而避免了训练时的快速通路。此外，我们提议一种简单的精度调整方法，可以让预训练的semantic segmentation模型生成“无任何给定类”预测，通过每个像素的 OoD 分数进行异常分割。只需 minimal fine-tuning 努力，我们的管道可以使用预训练模型进行异常分割，同时保持原始任务的性能。

Data-to-text Generation for Severely Under-Resourced Languages with GPT-3.5: A Bit of Help Needed from Google Translate

paper_url: http://arxiv.org/abs/2308.09957
repo_url: https://github.com/dcu-nlg/dcu-nlg-pbn
paper_authors: Michela Lorandi, Anya Belz
for: 这个论文旨在探讨使用英语少量训练数据的大语言模型（LLM）在缺乏训练数据的语言上进行数据转文本生成 task 中的性能。
methods: 作者在这篇论文中使用了一些示例输入/输出对的测试和评估，以及不同的提示类型和格式，以确定最佳的提示方法。
results: 研究发现，在直接生成到缺乏训练数据的语言时，几个示例提示方法效果较好，但是在通过英语转化后生成时，提示方法的效果消失。作者还在WebNLG 2023 共享任务中提交了一些系统，并在所有语言和所有指标上超越了竞争者系统。但是，最佳结果仍然远低于最差排名的英语系统在 WebNLG’20 中。

Abstract
LLMs like GPT are great at tasks involving English which dominates in their training data. In this paper, we look at how they cope with tasks involving languages that are severely under-represented in their training data, in the context of data-to-text generation for Irish, Maltese, Welsh and Breton. During the prompt-engineering phase we tested a range of prompt types and formats on GPT-3.5 and~4 with a small sample of example input/output pairs. We then fully evaluated the two most promising prompts in two scenarios: (i) direct generation into the under-resourced language, and (ii) generation into English followed by translation into the under-resourced language. We find that few-shot prompting works better for direct generation into under-resourced languages, but that the difference disappears when pivoting via English. The few-shot + translation system variants were submitted to the WebNLG 2023 shared task where they outperformed competitor systems by substantial margins in all languages on all metrics. We conclude that good performance on under-resourced languages can be achieved out-of-the box with state-of-the-art LLMs. However, our best results (for Welsh) remain well below the lowest ranked English system at WebNLG'20.

摘要
(Note: Simplified Chinese is used in this translation, as it is more widely used in mainland China and is the standard form of Chinese used in government documents, education, and media. Traditional Chinese is also commonly used in Taiwan, Hong Kong, and Macau, and is the standard form of Chinese used in these regions. However, for the purpose of this translation, Simplified Chinese is used to ensure consistency and readability.)

Eva-KELLM: A New Benchmark for Evaluating Knowledge Editing of LLMs

paper_url: http://arxiv.org/abs/2308.09954
repo_url: None
paper_authors: Suhang Wu, Minlong Peng, Yue Chen, Jinsong Su, Mingming Sun
for: 本研究旨在评估语言模型（LLM）知识编辑的效iveness。
methods: 本研究使用 raw documents 进行知识编辑，并 evaluate LLVM 的更新后性能从多个角度。
results: 实验结果表明，目前使用 raw documents 进行知识编辑的方法不够有效，特别是在理解修改后的知识和跨语言知识传递方面。

Abstract
Large language models (LLMs) possess a wealth of knowledge encoded in their parameters. However, this knowledge may become outdated or unsuitable over time. As a result, there has been a growing interest in knowledge editing for LLMs and evaluating its effectiveness. Existing studies primarily focus on knowledge editing using factual triplets, which not only incur high costs for collection but also struggle to express complex facts. Furthermore, these studies are often limited in their evaluation perspectives. In this paper, we propose Eva-KELLM, a new benchmark for evaluating knowledge editing of LLMs. This benchmark includes an evaluation framework and a corresponding dataset. Under our framework, we first ask the LLM to perform knowledge editing using raw documents, which provides a more convenient and universal approach compared to using factual triplets. We then evaluate the updated LLM from multiple perspectives. In addition to assessing the effectiveness of knowledge editing and the retention of unrelated knowledge from conventional studies, we further test the LLM's ability in two aspects: 1) Reasoning with the altered knowledge, aiming for the LLM to genuinely learn the altered knowledge instead of simply memorizing it. 2) Cross-lingual knowledge transfer, where the LLM updated with raw documents in one language should be capable of handling queries from another language. To facilitate further research, we construct and release the corresponding dataset. Using this benchmark, we investigate the effectiveness of several commonly-used knowledge editing methods. Experimental results indicate that the current methods for knowledge editing using raw documents are not effective in yielding satisfactory results, particularly when it comes to reasoning with altered knowledge and cross-lingual knowledge transfer.

摘要
大型语言模型（LLM）具有丰富的知识储存在其参数中。然而，这些知识可能会随时间变得过时或不适用。因此，有着增长的兴趣在Language Model（LM）的知识编译和评估其有效性。现有的研究主要针对LM的知识编译使用事实三重ts，它们不具有高成本的收集成本，但它们仅能表达简单的事实。此外，这些研究通常仅从一个见解进行评估。在这篇文章中，我们提出了Eva-KELLM，一个新的LLM知识编译评估标准。这个标准包括评估框架和对应的数据集。在我们的框架下，我们首先请LM进行知识编译使用原始文档，这提供了一个更方便和通用的方法，相比于使用事实三重ts。然后，我们从多个角度评估更新后的LM。除了评估知识编译的有效性和对不相关知识的保留外，我们还进行了两种方面的测试： 1. 使用更新后的知识进行推理，目的是让LM真正学习更新后的知识，而不是单纯记忆。2. 跨语言知识传递，即更新了LM的原始文档可以处理其他语言的查询。为了促进进一步的研究，我们建立了相应的数据集，并发布了这个标准。使用这个标准，我们进行了各种常用的知识编译方法的评估。实验结果表明，目前使用原始文档进行知识编译的方法并不具有满意的效果，特别是在推理使用更新后的知识和跨语言知识传递方面。

Exploring the Power of Topic Modeling Techniques in Analyzing Customer Reviews: A Comparative Analysis

paper_url: http://arxiv.org/abs/2308.11520
repo_url: None
paper_authors: Anusuya Krishnan
for: 本研究旨在比较五种常用的话题模型方法，以便在实际应用中提高话题检索的效果。
methods: 本研究使用的方法包括LSA、LDA、NMF、PAM、Top2Vec和BERTopic等五种话题模型方法。
results: 研究发现，BERTopic方法可以准确地提取有意义的话题，并且在两个文本数据集上获得了良好的效果。

Abstract
The exponential growth of online social network platforms and applications has led to a staggering volume of user-generated textual content, including comments and reviews. Consequently, users often face difficulties in extracting valuable insights or relevant information from such content. To address this challenge, machine learning and natural language processing algorithms have been deployed to analyze the vast amount of textual data available online. In recent years, topic modeling techniques have gained significant popularity in this domain. In this study, we comprehensively examine and compare five frequently used topic modeling methods specifically applied to customer reviews. The methods under investigation are latent semantic analysis (LSA), latent Dirichlet allocation (LDA), non-negative matrix factorization (NMF), pachinko allocation model (PAM), Top2Vec, and BERTopic. By practically demonstrating their benefits in detecting important topics, we aim to highlight their efficacy in real-world scenarios. To evaluate the performance of these topic modeling methods, we carefully select two textual datasets. The evaluation is based on standard statistical evaluation metrics such as topic coherence score. Our findings reveal that BERTopic consistently yield more meaningful extracted topics and achieve favorable results.

摘要
“在线社交媒体平台和应用程序的快速增长中，用户生成的文本内容的量已经成为严重的问题。为了解决这个问题，机器学习和自然语言处理算法已经在线上进行了广泛的应用。在过去几年中，主题探索技术在这个领域中得到了很大的应用。本研究将 investigate Five frequently used主题探索方法，它们是：对应语义分析（LSA）、Dirichlet分配（LDA）、非负矩阵分解（NMF）、碎掉投入模型（PAM）、Top2Vec和BERTopic。我们通过实际示范这些方法在实际应用中的效果，来强调它们在实际应用中的可行性。为了评估这些主题探索方法的表现，我们选择了两个文本数据集。评估是基于标准的统计评估指标，如主题凝聚分数。我们的发现表明，BERTopic在提取有意义的主题方面表现出色，取得了良好的成绩。”

Understanding Self-attention Mechanism via Dynamical System Perspective

paper_url: http://arxiv.org/abs/2308.09939
repo_url: https://github.com/jettbrains/-L-
paper_authors: Zhongzhan Huang, Mingfu Liang, Jinghui Qin, Shanshan Zhong, Liang Lin
for: 本研究 aimed to provide a new understanding of the self-attention mechanism (SAM) in neural networks, and to develop a new approach called StepNet that can measure the intrinsic stiffness phenomenon (SP) in high-performance neural networks.
methods: 本研究使用了动力系统视角来研究高性能神经网络中的SP现象，并提出了一种基于适应步长参数的StepNet方法来测量SP。
results: 实验结果表明，StepNet可以准确测量SP，提高了多种视觉任务的性能。

Abstract
The self-attention mechanism (SAM) is widely used in various fields of artificial intelligence and has successfully boosted the performance of different models. However, current explanations of this mechanism are mainly based on intuitions and experiences, while there still lacks direct modeling for how the SAM helps performance. To mitigate this issue, in this paper, based on the dynamical system perspective of the residual neural network, we first show that the intrinsic stiffness phenomenon (SP) in the high-precision solution of ordinary differential equations (ODEs) also widely exists in high-performance neural networks (NN). Thus the ability of NN to measure SP at the feature level is necessary to obtain high performance and is an important factor in the difficulty of training NN. Similar to the adaptive step-size method which is effective in solving stiff ODEs, we show that the SAM is also a stiffness-aware step size adaptor that can enhance the model's representational ability to measure intrinsic SP by refining the estimation of stiffness information and generating adaptive attention values, which provides a new understanding about why and how the SAM can benefit the model performance. This novel perspective can also explain the lottery ticket hypothesis in SAM, design new quantitative metrics of representational ability, and inspire a new theoretic-inspired approach, StepNet. Extensive experiments on several popular benchmarks demonstrate that StepNet can extract fine-grained stiffness information and measure SP accurately, leading to significant improvements in various visual tasks.

摘要
自我注意机制（SAM）在人工智能多个领域广泛应用，并成功提高不同模型的性能。然而，目前对这种机制的解释主要基于直觉和经验，而尚未有直接模型如何SAM帮助性能。为了解决这问题，在这篇论文中，基于径远系统视角，我们首先显示了高精度解方程（ODEs）中广泛存在的内在硬度现象（SP）。因此，NN的能力测量SP到特征层是必要的，并是训练NN的重要因素。与适应步长法，我们显示了SAM也是一种硬度感知的步长调整器，可以增强模型测量SP的能力，并生成适应性注意值。这种新的视角可以解释SAM如何提高模型性能，并设计新的量化度量方法。我们还提出了一种基于这种理论的StepNet方法，可以提取细腻的硬度信息并准确测量SP，从而在多种视觉任务中取得显著改进。

Analyzing Quantization in TVM

paper_url: http://arxiv.org/abs/2308.10905
repo_url: None
paper_authors: Mingfei Guo
for: 这个论文目的是研究 TVM 中的 8 位量化问题，以提高深度学习模型的执行速度和内存占用。
methods: 这篇论文使用了 TVM 的 8 位量化功能，并进行了多种优化技术的比较和分析，以提高量化后的性能。
results: 论文表明，在使用 8 位量化后，TVM 的执行时间比原始版本 slower，而且在计算任务和存储任务两个方面都有较大的性能下降。然而，通过修复图像建立错误，以及应用多种优化策略，最终实现了对 TVM 进行量化后的性能提升。

Abstract
There has been many papers in academic literature on quantizing weight tensors in deep learning models to reduce inference latency and memory footprint. TVM also has the ability to quantize weights and support low-bit computations. Although quantization is typically expected to improve inference time, in TVM, the performance of 8-bit quantization does not meet the expectations. Typically, when applying 8-bit quantization to a deep learning model, it is usually expected to achieve around 50% of the full-precision inference time. However, in this particular case, not only does the quantized version fail to achieve the desired performance boost, but it actually performs worse, resulting in an inference time that is about 2 times as slow as the non-quantized version. In this project, we thoroughly investigate the reasons behind the underperformance and assess the compatibility and optimization opportunities of 8-bit quantization in TVM. We discuss the optimization of two different types of tasks: computation-bound and memory-bound, and provide a detailed comparison of various optimization techniques in TVM. Through the identification of performance issues, we have successfully improved quantization by addressing a bug in graph building. Furthermore, we analyze multiple optimization strategies to achieve the optimal quantization result. The best experiment achieves 163.88% improvement compared with the TVM compiled baseline in inference time for the compute-bound task and 194.98% for the memory-bound task.

摘要
在学术文献中有很多关于深度学习模型量化Weight张量以降低推理时间和内存占用的论文。TVM也具有量化Weight和低位计算的能力。 although 量化通常预期可以提高推理时间，在TVM中，8位量化的表现不符预期，其实perform worse，导致推理时间约为非量化版本的两倍。在这个项目中，我们进行了深入的调查和分析，检查8位量化在TVM中的兼容性和优化机会。我们分析了两种不同的任务类型：计算繁重和内存繁重，并提供了多种优化技术的详细比较。通过问题定位，我们成功地修复了图形建立过程中的bug，并分析了多种优化策略以实现最佳量化结果。最佳实验在计算繁重任务上提高了163.88%的推理时间，而在内存繁重任务上提高了194.98%。

East: Efficient and Accurate Secure Transformer Framework for Inference

paper_url: http://arxiv.org/abs/2308.09923
repo_url: None
paper_authors: Yuanchao Ding, Hua Guo, Yewei Guan, Weixin Liu, Jiarong Huo, Zhenyu Guan, Xiyong Zhang
for: 这个研究旨在提供一个可靠和精确的隐私保护Transformer推断框架。
methods: 我们提出了一个名为“East”的框架，包括一个新的隐私 polynomial评估算法，以及适当设计的安全协议 для softmax 和层normalization。
results: 我们实现了一个名为“East”的框架，可以实现高效和精确的隐私保护Transformer推断，并且与没有微调的确切性相同。相比于Iron，我们的方案可以降低通信量约1.8倍，而且降低runtime约1.2倍。

Abstract
Transformer has been successfully used in practical applications, such as ChatGPT, due to its powerful advantages. However, users' input is leaked to the model provider during the service. With people's attention to privacy, privacy-preserving Transformer inference is on the demand of such services. Secure protocols for non-linear functions are crucial in privacy-preserving Transformer inference, which are not well studied. Thus, designing practical secure protocols for non-linear functions is hard but significant to model performance. In this work, we propose a framework \emph{East} to enable efficient and accurate secure Transformer inference. Firstly, we propose a new oblivious piecewise polynomial evaluation algorithm and apply it to the activation functions, which reduces the runtime and communication of GELU by over 1.5$\times$ and 2.5$\times$, compared to prior arts. Secondly, the secure protocols for softmax and layer normalization are carefully designed to faithfully maintain the desired functionality. Thirdly, several optimizations are conducted in detail to enhance the overall efficiency. We applied \emph{East} to BERT and the results show that the inference accuracy remains consistent with the plaintext inference without fine-tuning. Compared to Iron, we achieve about 1.8$\times$ lower communication within 1.2$\times$ lower runtime.

摘要
<>使用Transformer在实际应用中得到了成功，如ChatGPT，因为它具有强大的优势。然而，用户的输入被传递给模型提供商 durante el servicio，这会让人们关注隐私。隐私保护Transformer推理是对这些服务的需求。为了保持模型性能，安全协议 для非线性函数是必要的。然而，这些协议并未得到充分的研究。因此，设计实用安全协议 для非线性函数是具有挑战性和重要性的。在这种情况下，我们提出了一个框架called \emph{East}，以启用高效和准确的隐私保护Transformer推理。首先，我们提出了一种新的透明分割多项式评估算法，并应用它到活动函数上，从而降低GELU的运行时和通信时间，相比于先前的艺术。其次，我们仔细设计了安全协议 для软max和层normalization，以保持所需的功能。最后，我们在细节上进行了多个优化，以提高整体的效率。我们应用了\emph{East}于BERT，结果显示，无需练习，推理精度与纯文本推理相同。相比于Iron，我们实现了约1.8倍的通信减少和1.2倍的运行时减少。

Recap: Detecting Deepfake Video with Unpredictable Tampered Traces via Recovering Faces and Mapping Recovered Faces

paper_url: http://arxiv.org/abs/2308.09921
repo_url: None
paper_authors: Juan Hu, Xin Liao, Difei Gao, Satoshi Tsutsui, Qian Wang, Zheng Qin, Mike Zheng Shou
for: 针对恶意使用深伪技术的检测问题进行研究。
methods: 提出了一种新的深伪检测模型，即Recap，它可以暴露不具体的面部不一致性。
results: 在多种场景下进行了广泛的实验，并表明Recap可以有效地检测深伪视频。

Abstract
The exploitation of Deepfake techniques for malicious intentions has driven significant research interest in Deepfake detection. Deepfake manipulations frequently introduce random tampered traces, leading to unpredictable outcomes in different facial regions. However, existing detection methods heavily rely on specific forgery indicators, and as the forgery mode improves, these traces become increasingly randomized, resulting in a decline in the detection performance of methods reliant on specific forgery traces. To address the limitation, we propose Recap, a novel Deepfake detection model that exposes unspecific facial part inconsistencies by recovering faces and enlarges the differences between real and fake by mapping recovered faces. In the recovering stage, the model focuses on randomly masking regions of interest (ROIs) and reconstructing real faces without unpredictable tampered traces, resulting in a relatively good recovery effect for real faces while a poor recovery effect for fake faces. In the mapping stage, the output of the recovery phase serves as supervision to guide the facial mapping process. This mapping process strategically emphasizes the mapping of fake faces with poor recovery, leading to a further deterioration in their representation, while enhancing and refining the mapping of real faces with good representation. As a result, this approach significantly amplifies the discrepancies between real and fake videos. Our extensive experiments on standard benchmarks demonstrate that Recap is effective in multiple scenarios.

摘要
<>Translate the given text into Simplified Chinese.<> Deepfake 技术的滥用为恶意目的而引起了广泛的研究兴趣，但是现有的检测方法受到特定的妄想指标的限制，而且随着妄想模式的改进，这些指标变得越来越随机，导致这些方法在检测性能下降。为解决这个限制，我们提出了 Recap，一种新的 Deepfake 检测模型，它暴露了不具体的 facial part 不一致性。在恢复阶段，模型会随机屏蔽 Region of Interest (ROI) 并恢复真实的 face，从而实现了较好的真实 face 恢复效果，而假 face 的恢复效果相对较差。在映射阶段，输出恢复阶段的结果作为指导，导致 facial 映射过程中的映射结果更加精细。这种方法可以明显增加真实和假的视频之间的差异。我们对标准 benchmark 进行了广泛的实验，结果显示，Recap 在多种场景下都有效。

Learning Multiscale Consistency for Self-supervised Electron Microscopy Instance Segmentation

paper_url: http://arxiv.org/abs/2308.09917
repo_url: None
paper_authors: Yinda Chen, Wei Huang, Xiaoyu Liu, Qi Chen, Zhiwei Xiong
for: 该论文旨在提高电子顺带显微镜像中的实例分割精度。
methods: 该方法利用自我超级vised学习，通过多尺度视觉表示来捕捉电子镜像中实例的复杂视觉模式和声明关系。
results: 在四个大规模电子镜像 dataset 上广泛预训练该方法，实现了代表性任务中的神经元和 mitochondria 实例分割性能的提高。

Abstract
Instance segmentation in electron microscopy (EM) volumes poses a significant challenge due to the complex morphology of instances and insufficient annotations. Self-supervised learning has recently emerged as a promising solution, enabling the acquisition of prior knowledge of cellular tissue structures that are essential for EM instance segmentation. However, existing pretraining methods often lack the ability to capture complex visual patterns and relationships between voxels, which results in the acquired prior knowledge being insufficient for downstream EM analysis tasks. In this paper, we propose a novel pretraining framework that leverages multiscale visual representations to capture both voxel-level and feature-level consistency in EM volumes. Specifically, our framework enforces voxel-level consistency between the outputs of a Siamese network by a reconstruction function, and incorporates a cross-attention mechanism for soft feature matching to achieve fine-grained feature-level consistency. Moreover, we propose a contrastive learning scheme on the feature pyramid to extract discriminative features across multiple scales. We extensively pretrain our method on four large-scale EM datasets, achieving promising performance improvements in representative tasks of neuron and mitochondria instance segmentation.

摘要
Electron microscopy (EM) volumes 的实例分割具有 significiant 挑战，主要是因为实例的复杂形态和不充分的标注。自动学习 recent emerge as a promising solution，可以获得 cellular tissue 结构的先验知识，这些先验知识是下游 EM 实例分割任务的关键。然而，现有的预训练方法经常无法捕捉 EM volumes 中复杂的视觉模式和 voxel 之间的关系，导致预训练的先验知识不够 для下游 EM 分析任务。在这篇论文中，我们提出了一种新的预训练框架，利用多尺度的视觉表示来捕捉 EM volumes 中 voxel 级别和特征级别的一致性。具体来说，我们的框架使用 Siamese 网络的输出来执行重建函数，以实现 voxel 级别的一致性。此外，我们还提出了一种跨特征的匹配机制，以实现细腻的特征级别的一致性。此外，我们还提出了一种嵌入式学习的方案，用于提取多 scales 的特征。我们广泛预训练我们的方法在四个大规模 EM 数据集上，实现了代表性的 neuron 和 mitochondria 实例分割任务的表现。

Never Explore Repeatedly in Multi-Agent Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.09909
repo_url: None
paper_authors: Chenghao Li, Tonghan Wang, Chongjie Zhang, Qianchuan Zhao
for: 这篇论文目的是解决多智能代理人学习中的内在动机问题，尤其是游戏环境中的探索和获取奖励问题。
methods: 这篇论文提出了一种动态奖励推敲法，用于稳定内在奖励的波动和探索更广泛的区域，以解决 revisitation 问题。
results: 实验结果显示，这种方法能够在Google研究足球和StarCraft II微管理任务中提高表现，特别是在稀有奖励情况下。

Abstract
In the realm of multi-agent reinforcement learning, intrinsic motivations have emerged as a pivotal tool for exploration. While the computation of many intrinsic rewards relies on estimating variational posteriors using neural network approximators, a notable challenge has surfaced due to the limited expressive capability of these neural statistics approximators. We pinpoint this challenge as the "revisitation" issue, where agents recurrently explore confined areas of the task space. To combat this, we propose a dynamic reward scaling approach. This method is crafted to stabilize the significant fluctuations in intrinsic rewards in previously explored areas and promote broader exploration, effectively curbing the revisitation phenomenon. Our experimental findings underscore the efficacy of our approach, showcasing enhanced performance in demanding environments like Google Research Football and StarCraft II micromanagement tasks, especially in sparse reward settings.

摘要
在多智能奖励学习领域，内生动机作为探索的重要工具而出现。然而，许多内生奖励计算依赖于使用神经网络近似器来估计变分 posterior，这种限制神经统计近似器的表达能力带来了一定挑战。我们称这种挑战为“返回”问题，agent在完成任务空间中循环探索受限的区域。为了解决这个问题，我们提议动态奖励缩放方法。这种方法通过稳定在已经探索过的区域中的内生奖励波动，激励更广泛的探索，从而控制返回现象。我们的实验发现，我们的方法在Google研究足球和StarCraft II微管理任务中表现出色，特别是在罕见奖励设置下。

LEGO: Learning and Graph-Optimized Modular Tracker for Online Multi-Object Tracking with Point Clouds

paper_url: http://arxiv.org/abs/2308.09908
repo_url: None
paper_authors: Zhenrong Zhang, Jianan Liu, Yuxuan Xia, Tao Huang, Qing-Long Han, Hongbin Liu
for: 提高数据匹配性能，提高跟踪性能
methods: 利用图像优化和自注意力机制，生成准确的匹配得分图，提高数据匹配精度和效率
results: 在KITTI车辆跟踪评估板上，使用LiDARalone实现了出色的表现，并在所有在线跟踪器中排名第一，并在本文提交时保持第二名。

Abstract
Online multi-object tracking (MOT) plays a pivotal role in autonomous systems. The state-of-the-art approaches usually employ a tracking-by-detection method, and data association plays a critical role. This paper proposes a learning and graph-optimized (LEGO) modular tracker to improve data association performance in the existing literature. The proposed LEGO tracker integrates graph optimization and self-attention mechanisms, which efficiently formulate the association score map, facilitating the accurate and efficient matching of objects across time frames. To further enhance the state update process, the Kalman filter is added to ensure consistent tracking by incorporating temporal coherence in the object states. Our proposed method utilizing LiDAR alone has shown exceptional performance compared to other online tracking approaches, including LiDAR-based and LiDAR-camera fusion-based methods. LEGO ranked 1st at the time of submitting results to KITTI object tracking evaluation ranking board and remains 2nd at the time of submitting this paper, among all online trackers in the KITTI MOT benchmark for cars1

摘要
在自主系统中，在线多对目标跟踪（MOT）扮演着重要的角色。现状的方法通常采用检测-跟踪方法，数据归一化扮演 kritikus 的角色。这篇论文提议一种学习和图形优化（LEGO）模块化跟踪器，以改善现有文献中的数据归一化性能。提议的LEGO跟踪器结合图形优化和自注意机制，高效地计算归一化得分图，使得在不同时帧中快速和准确匹配目标。此外，为了进一步增强状态更新过程，我们添加了卡尔曼滤波，以确保跟踪过程中的对象状态具有一致性。我们使用了LiDARalone，与其他在线跟踪方法相比，包括LiDAR和LiDAR-camera fusione-based方法，我们提posed的方法在KITTI目标跟踪评估板上表现出了非凡的性能。LEGO在提交结果时排名第一，并在这篇论文提交时仍然排名第二，对于所有在线跟踪器而言，在KITTI MOTbenchmark中的汽车1中。

RAH! RecSys-Assistant-Human: A Human-Central Recommendation Framework with Large Language Models

paper_url: http://arxiv.org/abs/2308.09904
repo_url: None
paper_authors: Yubo Shu, Hansu Gu, Peng Zhang, Haonan Zhang, Tun Lu, Dongsheng Li, Ning Gu
for: 这个论文的目的是提出一种人类中心的推荐框架，以帮助推荐系统更好地理解和满足用户需求。
methods: 该论文使用了人工智能语言模型（LLM）作为用户代理人，并提出了一种名为RAH框架的人类中心推荐模型。RAH框架包括推荐系统、助手和人类三部分，以帮助推荐系统更好地理解和满足用户需求。
results: 实验结果表明，使用学习动作评价和反射机制可以更好地调整用户个性，而助手也可以有效地代表用户提供反馈和帮助推荐系统更好地调整。

Abstract
The recommendation ecosystem involves interactions between recommender systems(Computer) and users(Human). Orthogonal to the perspective of recommender systems, we attempt to utilize LLMs from the perspective of users and propose a more human-central recommendation framework named RAH, which consists of Recommender system, Assistant and Human. The assistant is a LLM-based and personal proxy for a human to achieve user satisfaction. The assistant plays a non-invasion role and the RAH framework can adapt to different recommender systems and user groups. Subsequently, we implement and evaluate the RAH framework for learning user personalities and proxy human feedback. The experiment shows that (1) using learn-action-critic and reflection mechanisms can lead more aligned personality and (2) our assistant can effectively proxy human feedback and help adjust recommender systems. Finally, we discuss further strategies in the RAH framework to address human-central concerns including user control, privacy and fairness.

摘要
<>推荐生态系统包括推荐系统与用户之间的互动。尽管从推荐系统的角度来看，但我们尝试从用户的角度来利用大语言模型（LLM），并提出一种更人类中心的推荐框架名为RAH，该框架包括推荐系统、助手和人类。助手是基于LLM的个性化代理人，用于达到用户满意度。助手在不侵略用户的前提下发挥作用，RAH框架可以适应不同的推荐系统和用户群体。在实施和评估RAH框架时，我们发现：1. 使用学习动作评价机制可以导致更加协调的个性特征。2. 我们的助手可以有效地代理人类反馈，帮助调整推荐系统。最后，我们讨论了RAH框架中更多的人类中心问题，包括用户控制、隐私和公平。

SwinLSTM:Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM

paper_url: http://arxiv.org/abs/2308.09891
repo_url: https://github.com/SongTang-x/SwinLSTM
paper_authors: Song Tang, Chuang Li, Pu Zhang, RongNian Tang
for: 这篇论文的目的是提出一种新的循环细节嵌入式神经网络（SwinLSTM），用于进行空间时间预测任务。
methods: 这篇论文使用了Swin Transformer层和简化LSTM层（ConvLSTM）的混合，实现了循环细节嵌入式神经网络。
results: 在 Moving MNIST、Human3.6m、TaxiBJ 和 KTH 数据集上，SwinLSTM 已经超越了现有的方法，并且与 ConvLSTM 的预测精度有 statistically significant 的差异。

Abstract
Integrating CNNs and RNNs to capture spatiotemporal dependencies is a prevalent strategy for spatiotemporal prediction tasks. However, the property of CNNs to learn local spatial information decreases their efficiency in capturing spatiotemporal dependencies, thereby limiting their prediction accuracy. In this paper, we propose a new recurrent cell, SwinLSTM, which integrates Swin Transformer blocks and the simplified LSTM, an extension that replaces the convolutional structure in ConvLSTM with the self-attention mechanism. Furthermore, we construct a network with SwinLSTM cell as the core for spatiotemporal prediction. Without using unique tricks, SwinLSTM outperforms state-of-the-art methods on Moving MNIST, Human3.6m, TaxiBJ, and KTH datasets. In particular, it exhibits a significant improvement in prediction accuracy compared to ConvLSTM. Our competitive experimental results demonstrate that learning global spatial dependencies is more advantageous for models to capture spatiotemporal dependencies. We hope that SwinLSTM can serve as a solid baseline to promote the advancement of spatiotemporal prediction accuracy. The codes are publicly available at https://github.com/SongTang-x/SwinLSTM.

摘要
“ integrating CNNs and RNNs to capture spatiotemporal dependencies is a common strategy for spatiotemporal prediction tasks. However, the property of CNNs to learn local spatial information limits their efficiency in capturing spatiotemporal dependencies, thereby limiting their prediction accuracy. In this paper, we propose a new recurrent cell, SwinLSTM, which integrates Swin Transformer blocks and the simplified LSTM, an extension that replaces the convolutional structure in ConvLSTM with the self-attention mechanism. Furthermore, we construct a network with SwinLSTM cell as the core for spatiotemporal prediction. Without using unique tricks, SwinLSTM outperforms state-of-the-art methods on Moving MNIST, Human3.6m, TaxiBJ, and KTH datasets. In particular, it exhibits a significant improvement in prediction accuracy compared to ConvLSTM. Our competitive experimental results demonstrate that learning global spatial dependencies is more advantageous for models to capture spatiotemporal dependencies. We hope that SwinLSTM can serve as a solid baseline to promote the advancement of spatiotemporal prediction accuracy. The codes are publicly available at https://github.com/SongTang-x/SwinLSTM.”Note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese used in mainland China. The traditional Chinese version would be slightly different.

Inductive-bias Learning: Generating Code Models with Large Language Model

paper_url: http://arxiv.org/abs/2308.09890
repo_url: https://github.com/fuyu-quant/iblm
paper_authors: Toma Tanaka, Naofumi Emoto, Tsukasa Yumibayashi
for: 本研究的目的是提出一种新的学习方法，即归纳学习（Inductive-Bias Learning，IBL），该方法结合了大语言模型（LLMs）的归纳学习（ICL）和代码生成技术，以实现高精度的推理和代码生成。
methods: 本研究使用的方法包括大语言模型（LLMs）的归纳学习（ICL）和代码生成技术，并提出了一种新的学习方法——归纳学习（IBL），该方法可以在不更新模型参数的情况下，基于训练数据进行高精度的推理和代码生成。
results: 研究发现，使用IBL方法可以实现高精度的推理和代码生成，并且比 tradicional ICL 和代表性机器学习模型更高。此外，IBL 方法还具有较好的可读性和解释性。

Abstract
Large Language Models(LLMs) have been attracting attention due to a ability called in-context learning(ICL). ICL, without updating the parameters of a LLM, it is possible to achieve highly accurate inference based on rules ``in the context'' by merely inputting a training data into the prompt. Although ICL is a developing field with many unanswered questions, LLMs themselves serves as a inference model, seemingly realizing inference without explicitly indicate ``inductive bias''. On the other hand, a code generation is also a highlighted application of LLMs. The accuracy of code generation has dramatically improved, enabling even non-engineers to generate code to perform the desired tasks by crafting appropriate prompts. In this paper, we propose a novel ``learning'' method called an ``Inductive-Bias Learning (IBL)'', which combines the techniques of ICL and code generation. An idea of IBL is straightforward. Like ICL, IBL inputs a training data into the prompt and outputs a code with a necessary structure for inference (we referred to as ``Code Model'') from a ``contextual understanding''. Despite being a seemingly simple approach, IBL encompasses both a ``property of inference without explicit inductive bias'' inherent in ICL and a ``readability and explainability'' of the code generation. Surprisingly, generated Code Models have been found to achieve predictive accuracy comparable to, and in some cases surpassing, ICL and representative machine learning models. Our IBL code is open source: https://github.com/fuyu-quant/IBLM

摘要
大型语言模型（LLM）在最近引起了一些注意，因为它具有一种能力called内部学习（ICL）。ICL可以在不更新LLM参数的情况下，通过将训练数据输入为提示，以高度精度地进行推论。虽然ICL是一个 noch developing 的领域，但LLM本身作为一个推论模型，它似乎可以不需要显式地指定“推论假设”来进行推论。另一方面，代码生成也是LLM的一个吸引人的应用。通过LLM，代码生成的精度有了很大提升，让even non-engineers可以通过设计适当的提示来生成代码来实现想要的任务。在这篇论文中，我们提出了一种新的“学习”方法，called“推论假设学习”（IBL）。IBL结合了ICL和代码生成的技术。IBL的想法是 Straightforward：如ICL，IBL将训练数据输入为提示，并从“上下文理解”中获取一个需要的结构（我们称之为“代码模型”），以进行推论。尽管看起来非常简单，但IBL包含了ICL中的“推论无需显式假设”的特性和代码生成中的“可读性和解释性”。 surprisingly，生成的代码模型已经发现可以 дости到ICL和代表性机器学习模型的预测精度。我们的IBL代码公开在GitHub：https://github.com/fuyu-quant/IBLM。

Tensor-Compressed Back-Propagation-Free Training for (Physics-Informed) Neural Networks

paper_url: http://arxiv.org/abs/2308.09858
repo_url: None
paper_authors: Yequan Zhao, Xinling Yu, Zhixiong Chen, Ziyue Liu, Sijia Liu, Zheng Zhang
for: 这篇论文主要是为了提出一个不需要传播推导（Backward Propagation，BP）的框架，来训练具有实际性的神经网络。
methods: 这篇论文使用了以下三种方法： 1. 紧缩矩阵variance reduction技术来增加评估缩排序（zeroth-order，ZO）的可扩展性，使得可以处理大型神经网络。 2. 混合式梯度评估方法来提高ZO训练的效率。 3. 将BP-free训练框架扩展到物理学 Informed Neural Networks（PINNs）中，通过提出一种简组方法来估计损失函数中的梯度，不需要使用BP。
results: 这篇论文的结果显示，BP-free训练框架对于MNIST dataset的训练损失比标准首顺训练要小，并且成功地训练了一个具有20个对应的哈密顿-雅可比-贝尔曼（HJB）偏微分方程的物理学 Informed Neural Networks（PINNs）。这个内存高效且BP-free的训练方法可能会成为未来训练许多有限资源的平台（例如FPGA、ASIC、微控制器和光子对应）的基础。

Abstract
Backward propagation (BP) is widely used to compute the gradients in neural network training. However, it is hard to implement BP on edge devices due to the lack of hardware and software resources to support automatic differentiation. This has tremendously increased the design complexity and time-to-market of on-device training accelerators. This paper presents a completely BP-free framework that only requires forward propagation to train realistic neural networks. Our technical contributions are three-fold. Firstly, we present a tensor-compressed variance reduction approach to greatly improve the scalability of zeroth-order (ZO) optimization, making it feasible to handle a network size that is beyond the capability of previous ZO approaches. Secondly, we present a hybrid gradient evaluation approach to improve the efficiency of ZO training. Finally, we extend our BP-free training framework to physics-informed neural networks (PINNs) by proposing a sparse-grid approach to estimate the derivatives in the loss function without using BP. Our BP-free training only loses little accuracy on the MNIST dataset compared with standard first-order training. We also demonstrate successful results in training a PINN for solving a 20-dim Hamiltonian-Jacobi-Bellman PDE. This memory-efficient and BP-free approach may serve as a foundation for the near-future on-device training on many resource-constraint platforms (e.g., FPGA, ASIC, micro-controllers, and photonic chips).

摘要
<>Translate the given text into Simplified Chinese.<> backwards propagation (BP) 是广泛使用的计算神经网络训练中的梯度计算方法。然而，由于缺乏硬件和软件资源，实现 BP 在边缘设备上具有很大的设计复杂性和时间到市场时间。这篇论文提出了一个完全不需要 BP 的框架，只需要前向传播来训练真实的神经网络。我们的技术贡献包括以下三个方面：首先，我们提出了一种紧凑变量减少方法，以大幅提高零次（ZO）优化的扩展性，使其能处理大于前一代 ZO 方法可以处理的网络大小。其次，我们提出了一种混合梯度评估方法，以提高 ZO 训练的效率。最后，我们扩展了我们的 BP-free 训练框架，以解决物理学 Informed Neural Networks（PINNs）中的梯度问题，而不需要 BP。我们的 BP-free 训练只在 MNIST 数据集上失去了微不足的精度，与标准首次训练相比。我们还成功地训练了一个 PINN，用于解决一个 20 维希尔伯特-雅各布-贝尔几何函数。这种内存高效且 BP-free 的方法可能成为未来资源限制平台（例如 FPGA、ASIC、微控制器和光学芯片）上的训练基础。

Enumerating Safe Regions in Deep Neural Networks with Provable Probabilistic Guarantees

paper_url: http://arxiv.org/abs/2308.09842
repo_url: None
paper_authors: Luca Marzari, Davide Corsi, Enrico Marchesini, Alessandro Farinelli, Ferdinando Cicalese
for: Ensuring trust in Deep Neural Network (DNN) systems by identifying safe areas.
methods: Proposed an efficient approximation method called epsilon-ProVe, which leverages statistical prediction of tolerance limits to provide a tight lower estimate of safe areas.
results: Empirical evaluation on standard benchmarks showed the scalability and effectiveness of the method, providing valuable insights for verifying DNNs.Here’s the original English text for reference:”Identifying safe areas is a key point to guarantee trust for systems that are based on Deep Neural Networks (DNNs). To this end, we introduce the AllDNN-Verification problem: given a safety property and a DNN, enumerate the set of all the regions of the property input domain which are safe, i.e., where the property does hold. Due to the #P-hardness of the problem, we propose an efficient approximation method called epsilon-ProVe. Our approach exploits a controllable underestimation of the output reachable sets obtained via statistical prediction of tolerance limits, and can provide a tight (with provable probabilistic guarantees) lower estimate of the safe areas. Our empirical evaluation on different standard benchmarks shows the scalability and effectiveness of our method, offering valuable insights for this new type of verification of DNNs.”

Abstract
Identifying safe areas is a key point to guarantee trust for systems that are based on Deep Neural Networks (DNNs). To this end, we introduce the AllDNN-Verification problem: given a safety property and a DNN, enumerate the set of all the regions of the property input domain which are safe, i.e., where the property does hold. Due to the #P-hardness of the problem, we propose an efficient approximation method called epsilon-ProVe. Our approach exploits a controllable underestimation of the output reachable sets obtained via statistical prediction of tolerance limits, and can provide a tight (with provable probabilistic guarantees) lower estimate of the safe areas. Our empirical evaluation on different standard benchmarks shows the scalability and effectiveness of our method, offering valuable insights for this new type of verification of DNNs.

摘要
安全区域的标识是深度神经网络（DNN）系统的关键点，以确保系统的可靠性。为此，我们介绍了AllDNN-Verification问题：给定一个安全性质和一个DNN，列出该输入领域的安全区域，即输入中的属性不符的区域。由于这个问题的P-完备性，我们提出了一种高效的近似方法called epsilon-ProVe。我们的方法利用通过统计预测误差范围来控制输出可达区域的下预测，可以提供一个紧密的（具有可证明的概率保证）下限估计安全区域。我们的实验结果表明我们的方法在不同的标准准型上具有扩展性和有效性，提供了有价值的这种新类型的DNN验证的反馈。

Synergistic Integration of Large Language Models and Cognitive Architectures for Robust AI: An Exploratory Analysis

paper_url: http://arxiv.org/abs/2308.09830
repo_url: None
paper_authors: Oscar J. Romero, John Zimmerman, Aaron Steinfeld, Anthony Tomasic
for: 论文探讨了将大语言模型（LLMs）和认知建筑（CAs）两种人工智能子领域融合的可能性，以实现更加智能的人工智能系统。
methods: 该论文采用了理论模型和初步实验数据，探讨了这两种方法之间的融合方法，以便赋予人工智能系统更高的robustness和复杂性。
results: 论文预测了这些融合方法可以相互补偿对方的缺陷和限制，从而实现更高水平的人工智能系统。同时，论文还讨论了每种方法的缺点和挑战。

Abstract
This paper explores alternatives for integrating two subdisciplines of AI in the construction of artificial agents that exhibit intelligent behavior: Large Language Models (LLMs) and Cognitive Architectures (CAs). Guided by theoretical models and supported by preliminary empirical data, we hypothesize how diverse synergistic approaches can mutually compensate for their respective weaknesses and limitations, ultimately fostering more robust and sophisticated artificial intelligence systems. Additionally, we discuss the tradeoffs and challenges associated with each approach.

摘要

Learning Representations on Logs for AIOps

paper_url: http://arxiv.org/abs/2308.11526
repo_url: None
paper_authors: Pranjal Gupta, Harshit Kumar, Debanjana Kar, Karan Bhukar, Pooja Aggarwal, Prateeti Mohapatra
for: 提高 AIOps 平台的自动化操作效率，减少人工干预。
methods: 使用自然语言处理技术和大语言模型（LLM）进行自动化日志分析，包括日志格式检测、日志分类和日志解析等任务。
results: 比较 existed 模型，提出一种基于 LLM 的日志分析模型，在多个下游任务上达到了更高的性能。

Abstract
AI for IT Operations (AIOps) is a powerful platform that Site Reliability Engineers (SREs) use to automate and streamline operational workflows with minimal human intervention. Automated log analysis is a critical task in AIOps as it provides key insights for SREs to identify and address ongoing faults. Tasks such as log format detection, log classification, and log parsing are key components of automated log analysis. Most of these tasks require supervised learning; however, there are multiple challenges due to limited labelled log data and the diverse nature of log data. Large Language Models (LLMs) such as BERT and GPT3 are trained using self-supervision on a vast amount of unlabeled data. These models provide generalized representations that can be effectively used for various downstream tasks with limited labelled data. Motivated by the success of LLMs in specific domains like science and biology, this paper introduces a LLM for log data which is trained on public and proprietary log data. The results of our experiments demonstrate that the proposed LLM outperforms existing models on multiple downstream tasks. In summary, AIOps powered by LLMs offers an efficient and effective solution for automating log analysis tasks and enabling SREs to focus on higher-level tasks. Our proposed LLM, trained on public and proprietary log data, offers superior performance on multiple downstream tasks, making it a valuable addition to the AIOps platform.

摘要

An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software

paper_url: http://arxiv.org/abs/2308.09810
repo_url: None
paper_authors: Wenxuan Wang, Jingyuan Huang, Jen-tse Huang, Chang Chen, Jiazhen Gu, Pinjia He, Michael R. Lyu
for: 这个论文旨在探讨现代内容审核软件对隐藏文本内容的抗垃圾性能。
methods: 该论文提出了一种基于变换规则的内容审核测试框架，名为OASIS。OASIS使用了21个变换规则，来生成含毒文本内容的图像测试用例，以检测内容审核软件的抗垃圾性能。
results: 根据论文的测试结果，OASIS可以寻发现到50%以上的潜在的攻击表现，并且通过重新训练内容审核模型，可以提高内容审核模型的鲁棒性，无需对性能产生负面影响。

Abstract
The exponential growth of social media platforms has brought about a revolution in communication and content dissemination in human society. Nevertheless, these platforms are being increasingly misused to spread toxic content, including hate speech, malicious advertising, and pornography, leading to severe negative consequences such as harm to teenagers' mental health. Despite tremendous efforts in developing and deploying textual and image content moderation methods, malicious users can evade moderation by embedding texts into images, such as screenshots of the text, usually with some interference. We find that modern content moderation software's performance against such malicious inputs remains underexplored. In this work, we propose OASIS, a metamorphic testing framework for content moderation software. OASIS employs 21 transform rules summarized from our pilot study on 5,000 real-world toxic contents collected from 4 popular social media applications, including Twitter, Instagram, Sina Weibo, and Baidu Tieba. Given toxic textual contents, OASIS can generate image test cases, which preserve the toxicity yet are likely to bypass moderation. In the evaluation, we employ OASIS to test five commercial textual content moderation software from famous companies (i.e., Google Cloud, Microsoft Azure, Baidu Cloud, Alibaba Cloud and Tencent Cloud), as well as a state-of-the-art moderation research model. The results show that OASIS achieves up to 100% error finding rates. Moreover, through retraining the models with the test cases generated by OASIS, the robustness of the moderation model can be improved without performance degradation.

摘要
社交媒体平台的快速增长引起了人类社会的communication和内容传递革命。然而，这些平台正在不断被滥用来传播恶意内容，包括仇恨言论、恶意广告和色情内容，导致了年轻人的心理健康受到严重损害。尽管努力开发和部署文本和图像内容审核方法，但恶意用户可以通过嵌入文本到图像中来逃脱审核，例如将文本截屏并添加扰乱。我们发现现有内容审核软件的性能在面对这类恶意输入仍然未得到足够的探索。在这项工作中，我们提出了OASIS，一个基于变形测试的内容审核软件测试框架。OASIS使用了21个从我们的飞行研究中所挑选的变形规则，这些规则是基于5000个实际的恶意内容，收集自四个popular社交媒体应用程序（Twitter、Instagram、Sina Weibo和Baidu Tieba）。给定恶意文本内容，OASIS可以生成图像测试 caso，保留恶意性，但可能会绕过审核。在评估中，我们使用OASIS测试五家著名公司的商业文本内容审核软件（Google Cloud、Microsoft Azure、Baidu Cloud、Alibaba Cloud和Tencent Cloud），以及一种当前研究模型。结果显示，OASIS可以达到100%的错误发现率。此外，通过 retraining 模型使用OASIS生成的测试 caso，可以提高审核模型的Robustness，不会导致性能下降。

VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control

paper_url: http://arxiv.org/abs/2308.09804
repo_url: https://github.com/henryhzy/vl-pet
paper_authors: Zi-Yuan Hu, Yanyang Li, Michael R. Lyu, Liwei Wang
for: 这篇论文是针对现有的语言模型（PLMs）的模型训练和储存成本增长的问题，提出了一个名为Vision-and-Language Parameter-Efficient Tuning（VL-PET）的框架，以便实现实时的模型训练和运算。
methods: 这篇论文使用了一个名为Granularity-Controlled Mechanism（GCM）的新的控制机制，允许在模型训练中对模组化修改（例如Adapter和LoRA）进行有效的控制，以提高模型的效率和可靠性。
results: 在四个图像与文本任务和四个影片与文本任务上，这篇论文的VL-PET框架和lightweight PET模组设计在效率、可靠性和转移性方面具有优秀的表现，特别是与VL-Adapter和LoRA进行比较，在图像与文本任务上表现出了2.92%（3.41%）和3.37%（7.03%）的提升。

Abstract
As the model size of pre-trained language models (PLMs) grows rapidly, full fine-tuning becomes prohibitively expensive for model training and storage. In vision-and-language (VL), parameter-efficient tuning (PET) techniques are proposed to integrate modular modifications (e.g., Adapter and LoRA) into encoder-decoder PLMs. By tuning a small set of trainable parameters, these techniques perform on par with full fine-tuning. However, excessive modular modifications and neglecting the functionality gap between the encoders and decoders can lead to performance degradation, while existing PET techniques (e.g., VL-Adapter) overlook these critical issues. In this paper, we propose a Vision-and-Language Parameter-Efficient Tuning (VL-PET) framework to impose effective control over modular modifications via a novel granularity-controlled mechanism. Considering different granularity-controlled matrices generated by this mechanism, a variety of model-agnostic VL-PET modules can be instantiated from our framework for better efficiency and effectiveness trade-offs. We further propose lightweight PET module designs to enhance VL alignment and modeling for the encoders and maintain text generation for the decoders. Extensive experiments conducted on four image-text tasks and four video-text tasks demonstrate the efficiency, effectiveness and transferability of our VL-PET framework. In particular, our VL-PET-large with lightweight PET module designs significantly outperforms VL-Adapter by 2.92% (3.41%) and LoRA by 3.37% (7.03%) with BART-base (T5-base) on image-text tasks. Furthermore, we validate the enhanced effect of employing our VL-PET designs on existing PET techniques, enabling them to achieve significant performance improvements. Our code is available at https://github.com/HenryHZY/VL-PET.

摘要
As the size of pre-trained language models (PLMs) grows rapidly, full fine-tuning becomes increasingly expensive for model training and storage. In vision-and-language (VL), parameter-efficient tuning (PET) techniques are proposed to integrate modular modifications (e.g., Adapter and LoRA) into encoder-decoder PLMs. By tuning a small set of trainable parameters, these techniques can achieve performance on par with full fine-tuning. However, excessive modular modifications and neglecting the functionality gap between the encoders and decoders can lead to performance degradation, while existing PET techniques (e.g., VL-Adapter) overlook these critical issues.In this paper, we propose a Vision-and-Language Parameter-Efficient Tuning (VL-PET) framework to effectively control modular modifications through a novel granularity-controlled mechanism. By considering different granularity-controlled matrices generated by this mechanism, a variety of model-agnostic VL-PET modules can be instantiated from our framework for better efficiency and effectiveness trade-offs. We further propose lightweight PET module designs to enhance VL alignment and modeling for the encoders and maintain text generation for the decoders.Extensive experiments conducted on four image-text tasks and four video-text tasks demonstrate the efficiency, effectiveness, and transferability of our VL-PET framework. In particular, our VL-PET-large with lightweight PET module designs significantly outperforms VL-Adapter by 2.92% (3.41%) and LoRA by 3.37% (7.03%) with BART-base (T5-base) on image-text tasks. Furthermore, we validate the enhanced effect of employing our VL-PET designs on existing PET techniques, enabling them to achieve significant performance improvements. Our code is available at .

Exploring the Power of Creative AI Tools and Game-Based Methodologies for Interactive Web-Based Programming

paper_url: http://arxiv.org/abs/2308.11649
repo_url: None
paper_authors: Benjamin Kenwright
for: 这篇论文旨在探讨创造性AI工具和游戏方法在互动网页程序中的潜力，包括提高学习体验和提高用户参与度。
methods: 论文使用了创造性AI工具和游戏方法，包括生成AI内容和用户参与式游戏等。
results: 论文通过对现实世界应用的例子和实践，探讨了这些工具和方法在web开发中的潜力和局限性，以及对用户体验和参与度的影响。

Abstract
In recent years, the fields of artificial intelligence and web-based programming have seen tremendous advancements, enabling developers to create dynamic and interactive websites and applications. At the forefront of these advancements, creative AI tools and game-based methodologies have emerged as potent instruments, promising enhanced user experiences and increased engagement in educational environments. This chapter explores the potential of these tools and methodologies for interactive web-based programming, examining their benefits, limitations, and real-world applications. We examine the challenges and ethical considerations that arise when integrating these technologies into web development, such as privacy concerns and the potential for bias in AI-generated content. Through this exploration, we aim to provide insights into the exciting possibilities that creative AI tools and game-based methodologies offer for the future of web-based programming.

摘要
recent 年份, artificial intelligence 和 web-based 程序设计领域已经取得了很大的进步, allowing developers to create dynamic and interactive websites and applications. At the forefront of these advancements, creative AI tools 和 game-based methodologies have emerged as potent instruments, promising enhanced user experiences and increased engagement in educational environments. This chapter explores the potential of these tools and methodologies for interactive web-based programming, examining their benefits, limitations, and real-world applications. We examine the challenges and ethical considerations that arise when integrating these technologies into web development, such as privacy concerns and the potential for bias in AI-generated content. Through this exploration, we aim to provide insights into the exciting possibilities that creative AI tools 和 game-based methodologies offer for the future of web-based programming.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other parts of the world. If you need the translation in Traditional Chinese, please let me know.

Taken by Surprise: Contrast effect for Similarity Scores

paper_url: http://arxiv.org/abs/2308.09765
repo_url: https://github.com/meetelise/surprise-similarity
paper_authors: Thomas C. Bachlechner, Mario Martone, Marjorie Schillo
for: 提高自然语言处理、信息检索和分类任务中对象向量嵌入的相似性评估的精度。
methods: 提出了一种基于对象ensemble的相似性评估指标——意料分数（surprise score），它考虑了人类对对象的概念嵌入的概率分布，从而更好地反映人类对对象之间的认知。
results: 在零/几shot文档分类任务中，使用意料分数 Typically 10-15% 比采用Raw cosine similarity better 的性能。

Abstract
Accurately evaluating the similarity of object vector embeddings is of critical importance for natural language processing, information retrieval and classification tasks. Popular similarity scores (e.g cosine similarity) are based on pairs of embedding vectors and disregard the distribution of the ensemble from which objects are drawn. Human perception of object similarity significantly depends on the context in which the objects appear. In this work we propose the $\textit{surprise score}$, an ensemble-normalized similarity metric that encapsulates the contrast effect of human perception and significantly improves the classification performance on zero- and few-shot document classification tasks. This score quantifies the surprise to find a given similarity between two elements relative to the pairwise ensemble similarities. We evaluate this metric on zero/few shot classification and clustering tasks and typically find 10-15 % better performance compared to raw cosine similarity. Our code is available at https://github.com/MeetElise/surprise-similarity.

摘要
非常重要的评估对象vector embedding的相似性，是自然语言处理、信息检索和分类任务中的关键。流行的相似性分数（例如偏度相似性）基于对 embedding vector的对比，忽略对象ensemble的分布。人类对物体相似性的感知强烈取决于物体在哪个上下文中出现。在这项工作中，我们提出了$\surprise$分数，一种ensemblenormalized的相似性度量，它利用对象ensemble的相似性来捕捉人类对物体相似性的异常效应。这个分数衡量在两个元素之间发现相似性的惊喜程度，相比于对 embedding vector的直接对比。我们在零/几个预测任务中评估了这个分数， Typically find 10-15% better performance compared to raw cosine similarity。我们的代码可以在https://github.com/MeetElise/surprise-similarity中找到。

The Impact of Background Removal on Performance of Neural Networks for Fashion Image Classification and Segmentation

paper_url: http://arxiv.org/abs/2308.09764
repo_url: None
paper_authors: Junhui Liang, Ying Liu, Vladimir Vlassov
for: 提高时尚图像数据质量和模型性能
methods: 使用突出对象检测来移除背景
results: 在简单和浅层网络中提高模型准确率，但在深度网络中不适用due to compatibility issues with other regularization techniques

Abstract
Fashion understanding is a hot topic in computer vision, with many applications having great business value in the market. Fashion understanding remains a difficult challenge for computer vision due to the immense diversity of garments and various scenes and backgrounds. In this work, we try removing the background from fashion images to boost data quality and increase model performance. Having fashion images of evident persons in fully visible garments, we can utilize Salient Object Detection to achieve the background removal of fashion data to our expectations. A fashion image with the background removed is claimed as the "rembg" image, contrasting with the original one in the fashion dataset. We conducted extensive comparative experiments with these two types of images on multiple aspects of model training, including model architectures, model initialization, compatibility with other training tricks and data augmentations, and target task types. Our experiments show that background removal can effectively work for fashion data in simple and shallow networks that are not susceptible to overfitting. It can improve model accuracy by up to 5% in the classification on the FashionStyle14 dataset when training models from scratch. However, background removal does not perform well in deep neural networks due to incompatibility with other regularization techniques like batch normalization, pre-trained initialization, and data augmentations introducing randomness. The loss of background pixels invalidates many existing training tricks in the model training, adding the risk of overfitting for deep models.

摘要
《时尚理解》是计算机视觉领域的热门话题，具有广泛的商业价值。然而，时尚理解仍然是计算机视觉中的一个困难挑战，因为裙服的多样性和不同的场景和背景。在这项工作中，我们尝试将背景从时尚图像中除去，以提高数据质量并提高模型性能。利用有 evident persons 穿着完整的裙服图像，我们可以使用 Salient Object Detection 来实现背景的除去。一个没有背景的时尚图像被称为 "rembg" 图像，与原始图像在时尚数据集中进行比较。我们进行了多种比较实验，包括模型架构、模型初始化、与其他训练技巧和数据扩展相容性等多个方面。我们的实验结果表明，背景除去可以有效地对时尚数据进行简单化和浅化，提高模型精度。在 FashionStyle14 数据集上进行类别预测时，背景除去可以提高模型的准确率达到 5%。然而，背景除去不适合深度神经网络，因为它们与其他正则化技术，如批处理标准化、预训练初始化和数据扩展引入随机性，不兼容。失去背景像素会让许多现有的训练技巧无法使用，增加深度模型难以避免过拟合的风险。

Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization

paper_url: http://arxiv.org/abs/2308.09716
repo_url: https://github.com/soumik-kanad/diff2lip
paper_authors: Soumik Mukhopadhyay, Saksham Suri, Ravi Teja Gadde, Abhinav Shrivastava
for: lip synchronization in-the-wild, preserving identity, pose, emotions, and image quality
methods: audio-conditioned diffusion-based model, trained on Voxceleb2 dataset
results: outperforms popular methods like Wav2Lip and PC-AVS in FID metric and MOS of users, results on both reconstruction and cross settings on Voxceleb2 and LRW datasets.

Abstract
The task of lip synchronization (lip-sync) seeks to match the lips of human faces with different audio. It has various applications in the film industry as well as for creating virtual avatars and for video conferencing. This is a challenging problem as one needs to simultaneously introduce detailed, realistic lip movements while preserving the identity, pose, emotions, and image quality. Many of the previous methods trying to solve this problem suffer from image quality degradation due to a lack of complete contextual information. In this paper, we present Diff2Lip, an audio-conditioned diffusion-based model which is able to do lip synchronization in-the-wild while preserving these qualities. We train our model on Voxceleb2, a video dataset containing in-the-wild talking face videos. Extensive studies show that our method outperforms popular methods like Wav2Lip and PC-AVS in Fr\'echet inception distance (FID) metric and Mean Opinion Scores (MOS) of the users. We show results on both reconstruction (same audio-video inputs) as well as cross (different audio-video inputs) settings on Voxceleb2 and LRW datasets. Video results and code can be accessed from our project page ( https://soumik-kanad.github.io/diff2lip ).

摘要
“lip sync”的任务是将不同的音频与人脸的肢体动作相对应。它在电影业中以及创建虚拟人偶和视讯会议中扮演重要的角色。这是一个具有挑战性的问题，因为需要同时实现细节、现实的舌头运动，并保留人脸的身份、姿势、情感和图像质量。许多以前的方法尝试解决这个问题，却受到图像质量下降的问题。在这篇论文中，我们提出了Diff2Lip，一个音频条件的扩散模型，可以在实际环境中进行lip sync，并保持这些质量。我们在Voxceleb2 dataset上训练我们的模型，该dataset包含实际环境中的说话面孔录影片。广泛的研究表明，我们的方法在FID和用户的意见评分（MOS）中优于流行的Wav2Lip和PC-AVS方法。我们在Voxceleb2和LRW datasets上进行了重建（同一对 audio-video 输入）和跨（不同 audio-video 输入）的研究，并提供了视频结果和代码。更多资讯可以在我们的项目页面（https://soumik-kanad.github.io/diff2lip）上获取。”

SimDA: Simple Diffusion Adapter for Efficient Video Generation

paper_url: http://arxiv.org/abs/2308.09710
repo_url: None
paper_authors: Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, Yu-Gang Jiang
for: 本研究旨在提出一种 Parametric Efficient 的 Text-to-Video（T2V）模型，通过简单的扩展器（SimDA）来改进现有的 Text-to-Image（T2I）模型，以提高视频生成效果。
methods: 我们提出了一种使用 Light-weight Spatial and Temporal Adapters（LSTA）和 Latent-Shift Attention（LSA）来适应视频生成，并且通过一种简单的微调过程来调整模型。
results: 我们的方法可以在 wild 中进行视频生成，同时也可以在一次微调过程中进行一键调整，以达到高分辨率（1024x1024）视频生成。

Abstract
The recent wave of AI-generated content has witnessed the great development and success of Text-to-Image (T2I) technologies. By contrast, Text-to-Video (T2V) still falls short of expectations though attracting increasing interests. Existing works either train from scratch or adapt large T2I model to videos, both of which are computation and resource expensive. In this work, we propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way. In particular, we turn the T2I model for T2V by designing light-weight spatial and temporal adapters for transfer learning. Besides, we change the original spatial attention to the proposed Latent-Shift Attention (LSA) for temporal consistency. With similar model architecture, we further train a video super-resolution model to generate high-definition (1024x1024) videos. In addition to T2V generation in the wild, SimDA could also be utilized in one-shot video editing with only 2 minutes tuning. Doing so, our method could minimize the training effort with extremely few tunable parameters for model adaptation.

摘要
近期人工智能生成内容的浪潮见证了文本到图像（T2I）技术的大发展和成功。然而，文本到视频（T2V）仍然落后于期望，吸引了增加的关注。现有的工作都是从零开始或者修改大型T2I模型来视频生成，两者都需要大量计算和资源。在这个工作中，我们提出了一种简单的扩散适配器（SimDA），只要24M个 Parameters中的1.1B个参数进行精细调整，从而将T2I模型适应到视频生成。具体来说，我们设计了轻量级的空间和时间适配器，以便在转移学习中使用。此外，我们将原始的空间注意力替换为我们提出的射频隐藏注意力（LSA），以保证时间一致性。与类似的模型架构相比，我们进一步训练了一个高清度（1024x1024）视频生成模型。除了在野生环境中进行T2V生成外，SimDA还可以在一次视频编辑中使用，只需要2分钟的调整。这样，我们的方法可以减少模型适应的训练努力，并且具有非常少的调整参数。

Graph of Thoughts: Solving Elaborate Problems with Large Language Models

paper_url: http://arxiv.org/abs/2308.09687
repo_url: https://github.com/spcl/graph-of-thoughts
paper_authors: Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, Torsten Hoefler
for: 提高大语言模型（LLM）的提示能力，超过链条思想或树思想（ToT）的限制。
methods: 模型LLM生成的信息为一个自由图，其中单元为“LLM思想”，而边表示这些单元之间的依赖关系。这种方法可以将不同的LLM思想组合成补充性的结果，浓缩整个网络的思想，或通过反馈循环进行增强。
results: 比对于state of the art，GoT可以提高排序质量62%，同时降低成本>31%。此外，GoT可以扩展新的思想转换，因此可以用于开拓新的提示方案。这项工作使得LLM的思维更加接近人类思维或脑机制，如复杂网络。

Abstract
We introduce Graph of Thoughts (GoT): a framework that advances prompting capabilities in large language models (LLMs) beyond those offered by paradigms such as Chain-of-Thought or Tree of Thoughts (ToT). The key idea and primary advantage of GoT is the ability to model the information generated by an LLM as an arbitrary graph, where units of information ("LLM thoughts") are vertices, and edges correspond to dependencies between these vertices. This approach enables combining arbitrary LLM thoughts into synergistic outcomes, distilling the essence of whole networks of thoughts, or enhancing thoughts using feedback loops. We illustrate that GoT offers advantages over state of the art on different tasks, for example increasing the quality of sorting by 62% over ToT, while simultaneously reducing costs by >31%. We ensure that GoT is extensible with new thought transformations and thus can be used to spearhead new prompting schemes. This work brings the LLM reasoning closer to human thinking or brain mechanisms such as recurrence, both of which form complex networks.

摘要
我们介绍 Graph of Thoughts（GoT）：一个框架，它超越了链条思维和树思维（ToT）的概念，实现了大语言模型（LLM）的推问能力。GoT的关键想法和主要优势在于让LLM的资讯单位（“LLM思维”）成为随机图形的顶点，并将这些顶点之间的相互依赖关系表现为图形的边。这种方法可以结合不同的LLM思维，形成具有融合效果的结果，精炼出整个网络的思想核心，或者通过反馈循环进行思维提升。我们证明GoT在不同的任务上比ToT高质量62%，同时降低成本>31%。我们还证明GoT可扩展新的思维转换，因此可以用来开创新的推问方案。这个工作使得LLM的思维更加接近人类思维或脑机制，例如回传和回归，这些机制形成复杂的网络。

PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation

paper_url: http://arxiv.org/abs/2308.09678
repo_url: https://github.com/hbing-l/posynda
paper_authors: Hanbing Liu, Jun-Yan He, Zhi-Qi Cheng, Wangmeng Xiang, Qize Yang, Wenhao Chai, Gaoang Wang, Xu Bao, Bin Luo, Yifeng Geng, Xuansong Xie
for: overcome the challenge of adapting 3D human pose estimators to new datasets without extensive target domain annotation.
methods: utilize a diffusion-centric structure to simulate the 3D pose distribution in the target domain, incorporate a multi-hypothesis network to create diverse pose hypotheses, and use target-specific source augmentation to obtain the target domain distribution data.
results: demonstrate competitive performance on benchmarks such as Human3.6M, MPI-INF-3DHP, and 3DPW, even comparable with the target-trained MixSTE model.

Abstract
The current 3D human pose estimators face challenges in adapting to new datasets due to the scarcity of 2D-3D pose pairs in target domain training sets. We present the \textit{Multi-Hypothesis \textbf{P}ose \textbf{Syn}thesis \textbf{D}omain \textbf{A}daptation} (\textbf{PoSynDA}) framework to overcome this issue without extensive target domain annotation. Utilizing a diffusion-centric structure, PoSynDA simulates the 3D pose distribution in the target domain, filling the data diversity gap. By incorporating a multi-hypothesis network, it creates diverse pose hypotheses and aligns them with the target domain. Target-specific source augmentation obtains the target domain distribution data from the source domain by decoupling the scale and position parameters. The teacher-student paradigm and low-rank adaptation further refine the process. PoSynDA demonstrates competitive performance on benchmarks, such as Human3.6M, MPI-INF-3DHP, and 3DPW, even comparable with the target-trained MixSTE model~\cite{zhang2022mixste}. This work paves the way for the practical application of 3D human pose estimation. The code is available at https://github.com/hbing-l/PoSynDA.

摘要
当前的3D人体姿态估计器面临着适应新数据集的挑战，因为目标领域训练集中缺乏2D-3D姿态对应的数据。我们提出了\textbf{\textit{多种假设 pose synthesis domain adaptation}（PoSynDA）框架，以解决这个问题而无需大量目标领域注释。PoSynDA使用分散结构，在目标领域中模拟3D姿态分布，填充数据多样性的空隙。通过包含多种假设网络，PoSynDA创造了多个姿态假设，并将它们与目标领域进行对align。通过寻求目标特定的源增强，从源领域中获取了目标领域的分布数据。教师-学生论断和低级适应进一步细化过程。PoSynDA在 benchmark 上表现竞争力强，包括人类3.6M、MPI-INF-3DHP 和 3DPW，甚至与目标训练 MixSTE 模型相当。这项工作为3D人体姿态估计器的实际应用开辟了道路。代码可以在https://github.com/hbing-l/PoSynDA 上获取。

Unsupervised 3D Pose Estimation with Non-Rigid Structure-from-Motion Modeling

paper_url: http://arxiv.org/abs/2308.10705
repo_url: None
paper_authors: Haorui Ji, Hui Deng, Yuchao Dai, Hongdong Li
for: 这篇论文的目的是提出一种新的人体姿态抽象模型和一种基于扩散的动作先验。
methods: 该论文使用了一种混合空间-时间的NRSfMformer来同时估算每帧人体 pose的3D参考骨架和每帧骨架变形。
results: 该论文的实验结果表明，该方法可以在主流数据集上获得更高的性能，并且超过了当前状态的最佳效果。

Abstract
Most of the previous 3D human pose estimation work relied on the powerful memory capability of the network to obtain suitable 2D-3D mappings from the training data. Few works have studied the modeling of human posture deformation in motion. In this paper, we propose a new modeling method for human pose deformations and design an accompanying diffusion-based motion prior. Inspired by the field of non-rigid structure-from-motion, we divide the task of reconstructing 3D human skeletons in motion into the estimation of a 3D reference skeleton, and a frame-by-frame skeleton deformation. A mixed spatial-temporal NRSfMformer is used to simultaneously estimate the 3D reference skeleton and the skeleton deformation of each frame from 2D observations sequence, and then sum them to obtain the pose of each frame. Subsequently, a loss term based on the diffusion model is used to ensure that the pipeline learns the correct prior motion knowledge. Finally, we have evaluated our proposed method on mainstream datasets and obtained superior results outperforming the state-of-the-art.

摘要
大多数前一代3D人姿估计工作都依赖于网络的强大记忆能力来获得适合的2D-3D映射从训练数据中。只有一些工作研究了人姿变形的模型化。在这篇论文中，我们提出了一种新的人姿变形模型化方法和附加的扩散基于运动先驱模型。受非固定结构从动图像处理领域启发，我们将重建3D人 skeleton在运动中的任务分解为估计3D参照骨架和每帧骨架变形。使用混合空间-时间NRSfMformer来同时估计每帧的3D参照骨架和每帧骨架变形，然后将它们加总以获得每帧的姿势。最后，我们使用基于扩散模型的损失函数来确保管道学习正确的先驱动知识。我们在主流数据集上评估了我们的提议方法，并获得了比前一代更高的成绩。

GiGaMAE: Generalizable Graph Masked Autoencoder via Collaborative Latent Space Reconstruction

paper_url: http://arxiv.org/abs/2308.09663
repo_url: https://github.com/sycny/gigamae
paper_authors: Yucheng Shi, Yushun Dong, Qiaoyu Tan, Jundong Li, Ninghao Liu
for: 这 paper 的目的是提出一种基于自动编码器的无监督学习方法，以便在图数据上生成有效的表示。
methods: 该方法使用了一种名为 GiGaMAE 的图自适应编码器框架，该框架通过协同重建有用和完整的嵌入表示来学习图数据的总体知识。
results: 对于七个 dataset 上的三个下游任务，GiGaMAE 的表现优于基elines。研究者希望这些结果能够引导基础模型的设计在图结构数据上。Translation:
for: The purpose of this paper is to propose a self-supervised learning method based on masked autoencoders to generate effective representations on graph data.
methods: The method uses a novel graph masked autoencoder framework called GiGaMAE, which learns to collaboratively reconstruct informative and integrated latent embeddings to capture more generalized and comprehensive knowledge.
results: GiGaMAE outperforms state-of-the-art baselines on seven datasets for three downstream tasks. The researchers hope that these results will shed light on the design of foundation models on graph-structured data.

Abstract
Self-supervised learning with masked autoencoders has recently gained popularity for its ability to produce effective image or textual representations, which can be applied to various downstream tasks without retraining. However, we observe that the current masked autoencoder models lack good generalization ability on graph data. To tackle this issue, we propose a novel graph masked autoencoder framework called GiGaMAE. Different from existing masked autoencoders that learn node presentations by explicitly reconstructing the original graph components (e.g., features or edges), in this paper, we propose to collaboratively reconstruct informative and integrated latent embeddings. By considering embeddings encompassing graph topology and attribute information as reconstruction targets, our model could capture more generalized and comprehensive knowledge. Furthermore, we introduce a mutual information based reconstruction loss that enables the effective reconstruction of multiple targets. This learning objective allows us to differentiate between the exclusive knowledge learned from a single target and common knowledge shared by multiple targets. We evaluate our method on three downstream tasks with seven datasets as benchmarks. Extensive experiments demonstrate the superiority of GiGaMAE against state-of-the-art baselines. We hope our results will shed light on the design of foundation models on graph-structured data. Our code is available at: https://github.com/sycny/GiGaMAE.

摘要
自我监督学习中使用遮盖自动编码器最近受到关注，因为它可以生成有效的图像或文本表示，可以应用于多个下游任务无需重新训练。然而，我们发现当前的遮盖自动编码器模型对图数据的泛化能力不佳。为解决这个问题，我们提出了一种新的图masked autoencoder框架，即GiGaMAE。与现有的遮盖自动编码器不同，我们在这篇论文中提议使用相互重建有用和完整的嵌入表示。我们考虑嵌入包括图型和属性信息作为重建目标，从而使我们的模型捕捉更加普遍和全面的知识。此外，我们引入了基于mutual information的重建损失，该损失函数允许我们有效地重建多个目标。这个学习目标使我们能够区分单个目标学习的独特知识和多个目标共享的通用知识。我们在三个下游任务上进行了七个数据集的测试，并对比了现有的基eline。广泛的实验结果表明GiGaMAE的超越性。我们希望我们的结果可以引导基于图结构数据的基础模型的设计。我们的代码可以在https://github.com/sycny/GiGaMAE中找到。

Tree-of-Mixed-Thought: Combining Fast and Slow Thinking for Multi-hop Visual Reasoning

paper_url: http://arxiv.org/abs/2308.09658
repo_url: None
paper_authors: Pengbo Hu, Ji Qi, Xingyu Li, Hong Li, Xinqi Wang, Bing Quan, Ruiyu Wang, Yi Zhou
for: 本研究旨在提出一种基于大语言模型（LLM）的规划算法，以解决复杂的视觉逻辑问题中的计划搜索问题。
methods: 本研究使用了一种叫做“ Tree-of-thought”的方法，它是基于人类大脑的两种认知系统——快速思维和慢速思维——的结合体。此外，还使用了一种叫做“一站式搜索”的方法。
results: 研究表明，我们的提出的算法可以同时保证性能和效率的提高，并且在不同的逻辑问题中显示出优异的表现。此外，我们还开发了一个系统性的评估框架，以便评估 LLMs 基于计划搜索的性能和效率。

Abstract
There emerges a promising trend of using large language models (LLMs) to generate code-like plans for complex inference tasks such as visual reasoning. This paradigm, known as LLM-based planning, provides flexibility in problem solving and endows better interpretability. However, current research is mostly limited to basic scenarios of simple questions that can be straightforward answered in a few inference steps. Planning for the more challenging multi-hop visual reasoning tasks remains under-explored. Specifically, under multi-hop reasoning situations, the trade-off between accuracy and the complexity of plan-searching becomes prominent. The prevailing algorithms either address the efficiency issue by employing the fast one-stop generation or adopt a complex iterative generation method to improve accuracy. Both fail to balance the need for efficiency and performance. Drawing inspiration from the dual system of cognition in the human brain, the fast and the slow think processes, we propose a hierarchical plan-searching algorithm that integrates the one-stop reasoning (fast) and the Tree-of-thought (slow). Our approach succeeds in performance while significantly saving inference steps. Moreover, we repurpose the PTR and the CLEVER datasets, developing a systematic framework for evaluating the performance and efficiency of LLMs-based plan-search algorithms under reasoning tasks at different levels of difficulty. Extensive experiments demonstrate the superiority of our proposed algorithm in terms of performance and efficiency. The dataset and code will be release soon.

摘要
出现了一种扩展大型语言模型（LLM）用于生成复杂推理任务的代码化计划的趋势，这种趋势被称为LLM-based planning。这种方法具有更多的问题解决方式和更好的可读性。然而，当前的研究主要集中在基本的单步问题上，尚未深入研究复杂的多步推理任务。在多步推理任务中，精度和计划搜索的复杂度之间存在明显的负担。目前的算法可以通过快速一站式生成或者使用迭代生成方法来提高精度，但都无法平衡精度和性能的需求。引用人类大脑中的双系统认知模型，我们提出了一种层次的计划搜索算法，将快速一站式推理（快）和树状思维（慢）相结合。我们的方法在性能和效率两个方面具有优势，并且对不同难度水平的推理任务进行系统性评估。我们对PTR和CLEVER数据集进行了修改和扩展，并开发了一个系统性的评估框架。广泛的实验表明，我们的提posed算法在性能和效率两个方面具有优势。数据集和代码将很快发布。

Robust Uncertainty Quantification using Conformalised Monte Carlo Prediction

paper_url: http://arxiv.org/abs/2308.09647
repo_url: https://github.com/team-daniel/mc-cp
paper_authors: Daniel Bethell, Simos Gerasimou, Radu Calinescu
for: 这篇论文主要用于推动深度学习模型在安全关键应用中的部署，并提供了一种可靠的评估方法来确保这些模型的可靠运行。
methods: 本论文提出了一种新的混合型不确定性评估方法（MC-CP），它将适应MC dropout方法和确定性预测（CP）相结合，以提高不确定性评估的精度和效率。
results: 经过广泛的实验评估，MC-CP方法在分类和回归benchmark中都达到了显著的改进，与先前的高级不确定性评估方法相比，如MC dropout、RAPS和CQR。MC-CP方法可以轻松地添加到现有的模型中，使其部署非常简单。

Abstract
Deploying deep learning models in safety-critical applications remains a very challenging task, mandating the provision of assurances for the dependable operation of these models. Uncertainty quantification (UQ) methods estimate the model's confidence per prediction, informing decision-making by considering the effect of randomness and model misspecification. Despite the advances of state-of-the-art UQ methods, they are computationally expensive or produce conservative prediction sets/intervals. We introduce MC-CP, a novel hybrid UQ method that combines a new adaptive Monte Carlo (MC) dropout method with conformal prediction (CP). MC-CP adaptively modulates the traditional MC dropout at runtime to save memory and computation resources, enabling predictions to be consumed by CP, yielding robust prediction sets/intervals. Throughout comprehensive experiments, we show that MC-CP delivers significant improvements over advanced UQ methods, like MC dropout, RAPS and CQR, both in classification and regression benchmarks. MC-CP can be easily added to existing models, making its deployment simple.

摘要
部署深度学习模型在安全关键应用中仍然是一个非常困难的任务，需要提供保证模型的可靠运行。不确定量评估（UQ）方法可以估计模型每个预测结果的可信度，并且在考虑Randomness和模型错误的情况下，对决策进行指导。despite state-of-the-art UQ methods have made significant advances, they are still computationally expensive or produce conservative prediction sets/intervals.我们介绍MC-CP，一种新的hybrid UQ方法， combining a new adaptive Monte Carlo (MC) dropout method with conformal prediction (CP). MC-CP可以在运行时适应MC dropout的传统方法，以节省内存和计算资源，使得预测结果可以被CP进行处理，并且生成Robust预测集/interval。经过了广泛的实验，我们发现MC-CP可以在分类和回归 benchmark上提供显著改进，比如MC dropout、RAPS和CQR。MC-CP可以轻松地添加到现有模型中，因此其部署非常简单。