cs.CL - 2023-08-20

CharacterChat: Learning towards Conversational AI with Personalized Social Support

  • paper_url: http://arxiv.org/abs/2308.10278
  • repo_url: https://github.com/morecry/characterchat
  • paper_authors: Quan Tu, Chuanqi Chen, Jinpeng Li, Yanran Li, Shuo Shang, Dongyan Zhao, Ran Wang, Rui Yan
  • for: 提供个性化社交支持
  • methods: 使用人类类型分解法(MBTI)和虚拟人物银行(MBTI-1024 Bank),开发了一个基于角色和记忆的对话系统(CharacterChat),并实现了人类类型匹配机制,以提供个性化的社交支持
  • results: 实验结果表明,CharacterChat 可以提供高效的个性化社交支持,并且人类类型匹配机制具有显著的优势。
    Abstract In our modern, fast-paced, and interconnected world, the importance of mental well-being has grown into a matter of great urgency. However, traditional methods such as Emotional Support Conversations (ESC) face challenges in effectively addressing a diverse range of individual personalities. In response, we introduce the Social Support Conversation (S2Conv) framework. It comprises a series of support agents and the interpersonal matching mechanism, linking individuals with persona-compatible virtual supporters. Utilizing persona decomposition based on the MBTI (Myers-Briggs Type Indicator), we have created the MBTI-1024 Bank, a group that of virtual characters with distinct profiles. Through improved role-playing prompts with behavior preset and dynamic memory, we facilitate the development of the MBTI-S2Conv dataset, which contains conversations between the characters in the MBTI-1024 Bank. Building upon these foundations, we present CharacterChat, a comprehensive S2Conv system, which includes a conversational model driven by personas and memories, along with an interpersonal matching plugin model that dispatches the optimal supporters from the MBTI-1024 Bank for individuals with specific personas. Empirical results indicate the remarkable efficacy of CharacterChat in providing personalized social support and highlight the substantial advantages derived from interpersonal matching. The source code is available in \url{https://github.com/morecry/CharacterChat}.
    摘要 在我们现代、快速发展、全球连接的世界中,个人心理健康的重要性日益提高。然而,传统的方法如情感支持对话(ESC)面临着困难,因为它们难以满足个人多样化的需求。为此,我们提出了社交支持对话(S2Conv)框架。它包括一系列的支持代理和人际匹配机制,将个人与具有相似人格特质的虚拟支持者联系起来。通过基于MBTI(Myers-Briggs Type Indicator)的人格分解,我们建立了MBTI-1024银行,一组包含具有明确特征的虚拟人物。我们通过改进的角色扮演提示和动态记忆,实现了MBTI-S2Conv数据集的开发,这些对话发生在MBTI-1024银行中。基于这些基础,我们介绍了CharacterChat,一个完整的S2Conv系统,包括驱动角色和记忆的对话模型,以及一个人际匹配插件模型,可以从MBTI-1024银行中派发最佳的支持者 для特定的人格类型。实验结果表明CharacterChat在提供个性化社交支持方面表现出了极高的效果,并且人际匹配带来了显著的优势。代码可以在上获取。

Scaled-up Discovery of Latent Concepts in Deep NLP Models

  • paper_url: http://arxiv.org/abs/2308.10263
  • repo_url: None
  • paper_authors: Majd Hawasly, Fahim Dalvi, Nadir Durrani
  • for: 这个研究是为了比较不同的聚类算法,以找出预训练语言模型中表示的编码概念。
  • methods: 这个研究使用了三种聚类算法:聚合 Hierarchical Clustering、Leaders Algorithm 和 K-Means Clustering,以确定它们在 humans 定义的 ontology 上的对应。
  • results: 研究结果表明,K-Means 算法有可能扩展到非常大的数据集,以获得丰富的干ovat 概念发现,包括单词和短语水平。
    Abstract Pre-trained language models (pLMs) learn intricate patterns and contextual dependencies via unsupervised learning on vast text data, driving breakthroughs across NLP tasks. Despite these achievements, these models remain black boxes, necessitating research into understanding their decision-making processes. Recent studies explore representation analysis by clustering latent spaces within pre-trained models. However, these approaches are limited in terms of scalability and the scope of interpretation because of high computation costs of clustering algorithms. This study focuses on comparing clustering algorithms for the purpose of scaling encoded concept discovery of representations from pLMs. Specifically, we compare three algorithms in their capacity to unveil the encoded concepts through their alignment to human-defined ontologies: Agglomerative Hierarchical Clustering, Leaders Algorithm, and K-Means Clustering. Our results show that K-Means has the potential to scale to very large datasets, allowing rich latent concept discovery, both on the word and phrase level.
    摘要 пре-trained语言模型(pLMs)通过自动学习大量文本数据,学习到了复杂的模式和语义依赖关系,导致了各种自然语言处理任务的突破。尽管如此,这些模型仍然是黑盒子,需要研究其决策过程。latest studies explore representation analysis by clustering latent spaces within pre-trained models. However, these approaches are limited in terms of scalability and the scope of interpretation because of high computation costs of clustering algorithms. This study focuses on comparing clustering algorithms for the purpose of scaling encoded concept discovery of representations from pLMs. Specifically, we compare three algorithms in their capacity to unveil the encoded concepts through their alignment to human-defined ontologies: Agglomerative Hierarchical Clustering, Leaders Algorithm, and K-Means Clustering. Our results show that K-Means has the potential to scale to very large datasets, allowing rich latent concept discovery, both on the word and phrase level.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Taiwan, Hong Kong, and other countries.

How Good Are Large Language Models at Out-of-Distribution Detection?

  • paper_url: http://arxiv.org/abs/2308.10261
  • repo_url: https://github.com/chrisneagu/FTC-Skystone-Dark-Angels-Romania-2020
  • paper_authors: Bo Liu, Liming Zhan, Zexin Lu, Yujie Feng, Lei Xue, Xiao-Ming Wu
  • for: 这个研究探讨了大型自然语言模型(LLM)在不同语言模型中进行Out-of-distribution(OOD)探测的可靠性。
  • methods: 这个研究使用了各种常见的OOD探测器,并对它们进行了zero-grad和精确调整的评估。此外,研究者还将先前的探测性内部训练改为生成式训练,以更好地适应LLM的预训练目标和下游任务。
  • results: 研究结果显示,一个简单的径向距离OOD探测器在LLM中表现出色,超越其他OOD探测器。研究者提供了一个新的理解,即LLM的嵌入空间具有iso对称性,这使得LLM更容易探测OOD数据。这个新的理解可以帮助提高LLM的适应力和可靠性在动态环境中。
    Abstract Out-of-distribution (OOD) detection plays a vital role in enhancing the reliability of machine learning (ML) models. The emergence of large language models (LLMs) has catalyzed a paradigm shift within the ML community, showcasing their exceptional capabilities across diverse natural language processing tasks. While existing research has probed OOD detection with relative small-scale Transformers like BERT, RoBERTa and GPT-2, the stark differences in scales, pre-training objectives, and inference paradigms call into question the applicability of these findings to LLMs. This paper embarks on a pioneering empirical investigation of OOD detection in the domain of LLMs, focusing on LLaMA series ranging from 7B to 65B in size. We thoroughly evaluate commonly-used OOD detectors, scrutinizing their performance in both zero-grad and fine-tuning scenarios. Notably, we alter previous discriminative in-distribution fine-tuning into generative fine-tuning, aligning the pre-training objective of LLMs with downstream tasks. Our findings unveil that a simple cosine distance OOD detector demonstrates superior efficacy, outperforming other OOD detectors. We provide an intriguing explanation for this phenomenon by highlighting the isotropic nature of the embedding spaces of LLMs, which distinctly contrasts with the anisotropic property observed in smaller BERT family models. The new insight enhances our understanding of how LLMs detect OOD data, thereby enhancing their adaptability and reliability in dynamic environments.
    摘要 大量语言模型(LLM)在机器学习(ML)领域的应用已经促使了一场 Paradigma shift。 existing research 探索了使用小型 transformer like BERT, RoBERTa 和 GPT-2 的 OOD 检测,但是这些发现的可靠性是否适用于 LLM 仍然存在问题。这篇文章开始了对 LLM 领域 OOD 检测的先锋性实验研究,专注于 LLaMA 系列模型,从 7B 到 65B 的大小。我们仔细评估了常用的 OOD 检测器,在零个 grad 和精度调整两种场景中进行了全面的评估。另外,我们改变了先前的推荐准确预测 OOD 检测器,转换为生成式调整,使得 LLM 的预训练目标与下游任务更加一致。我们的发现表明,一个简单的偏度距离 OOD 检测器在 LLM 中表现出色,超过其他 OOD 检测器。我们提供了一个有趣的解释,强调 LLMA 的均匀空间特性,与小型 BERT 家族模型所见的极性特性不同。这一新发现改善了我们对 LLM 的检测 OOD 数据的理解,从而提高了它们在动态环境中的适应性和可靠性。

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

  • paper_url: http://arxiv.org/abs/2308.10253
  • repo_url: https://github.com/icoz69/stablellava
  • paper_authors: Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Fu, Guosheng Lin, Chunhua Shen, Ling Chen, Yunchao Wei
  • for: 研究者们希望通过开发多模态大型语言模型(LLMs)来实现更好地融合文本和视觉模态,并在理解人类指令时进行更好的对接。
  • methods: 研究者们提出了一种新的数据收集方法,即同步生成图像和对话,以便为视觉指令调整。这种方法利用了生成模型的能力,将文本生成模型和图像生成模型结合起来,以生成多样化和可控的图像数据集。
  • results: 研究人员对多个数据集进行了广泛的实验,并使用开源的 LLAVA 模型作为测试平台。结果表明,该方法可以提高多达十个常见的能力指标,包括图像生成、对话生成和多模态对接等。
    Abstract The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have sparked significant interest in the development of multimodal Large Language Models (LLMs). A primary research objective of such models is to align visual and textual modalities effectively while comprehending human instructions. Current methodologies often rely on annotations derived from benchmark datasets to construct image-dialogue datasets for training purposes, akin to instruction tuning in LLMs. However, these datasets often exhibit domain bias, potentially constraining the generative capabilities of the models. In an effort to mitigate these limitations, we propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models to yield a diverse and controllable dataset with varied image content. This not only provides greater flexibility compared to existing methodologies but also significantly enhances several model capabilities. Our research includes comprehensive experiments conducted on various datasets using the open-source LLAVA model as a testbed for our proposed pipeline. Our results underscore marked enhancements across more than ten commonly assessed capabilities,
    摘要 “OpenAI的GPT-4的多Modal功能已经引起了很大的关注,导致多Modal大型语言模型(LLM)的开发获得更多的注意。主要研究目标之一是让文字和影像模式有效地调和,并且理解人类的指令。现有的方法ologies往往靠摄取来自测试集的标注来建立图像对话集,类似于执行调整LLMs。但这些数据集 часто会受到领域偏见,这可能对模型的生成能力产生限制。为了解决这些限制,我们提出了一种新的数据收集方法, synchronously 将图像和对话聚合在一起,以便对图像进行调整。这种方法利用了生成模型的能力,结合了ChatGPT和文本到图像生成模型,从而产生了多样化且可控的数据集。这不仅提供了更大的灵活性,而且也对模型的多个功能进行了明显改善。我们的研究包括了对不同数据集进行了广泛的实验,使用了开源的Llava模型作为我们的提案流水线的测试床。我们的结果显示,在多于十种常规评估能力上,我们的方法实现了明显的改善。”

Activation Addition: Steering Language Models Without Optimization

  • paper_url: http://arxiv.org/abs/2308.10248
  • repo_url: None
  • paper_authors: Alex Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, Monte MacDiarmid
  • for: 控制大型自然语言模型(LLM)的行为是一个当前仍然没有解决的问题。现有的方法包括监督微调、人工回馈学习(RLHF)、提示工程和导航式解码。我们却 investigate activation engineering:在推理时修改激活函数以预测性地改变模型行为。特别是,我们在推理过程中添加一个“导航向量”,这个向量通过自然语言来隐式地定义。
  • methods: 我们的ActAdd方法不同于过去的工作(Subramani et al. 2022;Hernandez et al. 2023),它不需要学习这些导航向量。而是通过对提示集中的激活差异来计算导航向量。
  • results: 我们在GPT-2上进行了OpenWebText和ConceptNet的测试,发现我们的推理时间方法可以控制输出的高级属性,并且保持目标模型性能不受影响。这种方法比监督微调或RLHF需要更少的计算和实现努力,允许用户通过自然语言提示来Specify要求,并且其负担随模型大小呈线性增长。
    Abstract Reliably controlling the behavior of large language models (LLMs) is a pressing open problem. Existing methods include supervised finetuning, reinforcement learning from human feedback (RLHF), prompt engineering and guided decoding. We instead investigate activation engineering: modifying activations at inference time to predictably alter model behavior. In particular, we bias the forward pass with an added 'steering vector' implicitly specified through natural language. Unlike past work which learned these steering vectors (Subramani, Suresh, and Peters 2022; Hernandez, Li, and Andreas 2023), our Activation Addition (ActAdd) method computes them by taking the activation differences that result from pairs of prompts. We demonstrate ActAdd on GPT-2 on OpenWebText and ConceptNet. Our inference-time approach yields control over high-level properties of output and preserves off-target model performance. It involves far less compute and implementation effort compared to finetuning or RLHF, allows users to provide natural language specifications, and its overhead scales naturally with model size.
    摘要 大型语言模型(LLM)的可靠控制问题是一个开放的问题。现有的方法包括监督微调、人工反馈学习(RLHF)、提示工程和导航解码。我们尝试 investigate 活动工程:在推理时修改活动以predictably 改变模型行为。特别是,我们使用添加的 "导航向量" 隐式地由自然语言特定。与过去的工作不同(Subramani et al. 2022;Hernandez et al. 2023),我们的 Activation Addition(ActAdd)方法不是学习这些导航向量,而是通过对提示的活动差异来计算它们。我们在 GPT-2 上对 OpenWebText 和 ConceptNet 进行了测试,并证明了我们的推理时间方法可以控制输出的高级属性,并保持目标模型性能。它比 finetuning 或 RLHF 需要更少的计算和实现努力,允许用户提供自然语言规范,并且其开销随模型大小呈线性增长。

Indonesian Automatic Speech Recognition with XLSR-53

  • paper_url: http://arxiv.org/abs/2308.11589
  • repo_url: None
  • paper_authors: Panji Arisaputra, Amalia Zahra
  • For: The paper aims to develop an Indonesian Automatic Speech Recognition (ASR) system using the XLSR-53 pre-trained model to reduce the amount of training data required to achieve a competitive Word Error Rate (WER).* Methods: The study uses the XLSR-53 pre-trained model and a combination of three datasets: TITML-IDN, Magic Data, and Common Voice, with a total of 24 hours, 18 minutes, and 1 second of data. The model is further improved using a language model to reduce the WER by around 8%.* Results: The study achieves a WER of 20%, which is competitive with similar models using the Common Voice dataset split test. Additionally, the use of a language model results in a WER of 12%, representing an 8% reduction in error rate. The results demonstrate the effectiveness of the proposed approach in developing a better Indonesian ASR system with a smaller amount of data.
    Abstract This study focuses on the development of Indonesian Automatic Speech Recognition (ASR) using the XLSR-53 pre-trained model, the XLSR stands for cross-lingual speech representations. The use of this XLSR-53 pre-trained model is to significantly reduce the amount of training data in non-English languages required to achieve a competitive Word Error Rate (WER). The total amount of data used in this study is 24 hours, 18 minutes, and 1 second: (1) TITML-IDN 14 hours and 31 minutes; (2) Magic Data 3 hours and 33 minutes; and (3) Common Voice 6 hours, 14 minutes, and 1 second. With a WER of 20%, the model built in this study can compete with similar models using the Common Voice dataset split test. WER can be decreased by around 8% using a language model, resulted in WER from 20% to 12%. Thus, the results of this study have succeeded in perfecting previous research in contributing to the creation of a better Indonesian ASR with a smaller amount of data.
    摘要

WMFormer++: Nested Transformer for Visible Watermark Removal via Implict Joint Learning

  • paper_url: http://arxiv.org/abs/2308.10195
  • repo_url: None
  • paper_authors: Dongjian Huo, Zehong Zhang, Hanjing Su, Guanbin Li, Chaowei Fang, Qingyao Wu
  • for: 提高水印图像的权利保护和防范水印图像的篡改
  • methods: 基于隐式联合学习和跨通道注意力的水印除法
  • results: 比前方法提高较多,在多种复杂的benchmark上展示出极高的效果
    Abstract Watermarking serves as a widely adopted approach to safeguard media copyright. In parallel, the research focus has extended to watermark removal techniques, offering an adversarial means to enhance watermark robustness and foster advancements in the watermarking field. Existing watermark removal methods mainly rely on UNet with task-specific decoder branches--one for watermark localization and the other for background image restoration. However, watermark localization and background restoration are not isolated tasks; precise watermark localization inherently implies regions necessitating restoration, and the background restoration process contributes to more accurate watermark localization. To holistically integrate information from both branches, we introduce an implicit joint learning paradigm. This empowers the network to autonomously navigate the flow of information between implicit branches through a gate mechanism. Furthermore, we employ cross-channel attention to facilitate local detail restoration and holistic structural comprehension, while harnessing nested structures to integrate multi-scale information. Extensive experiments are conducted on various challenging benchmarks to validate the effectiveness of our proposed method. The results demonstrate our approach's remarkable superiority, surpassing existing state-of-the-art methods by a large margin.
    摘要 水印加密是一种广泛采用的媒体版权保护方法。同时,研究焦点已经扩展到水印去除技术,提供了一种对水印的敌对性加强和 watermarking 领域的进步。现有的水印去除方法主要基于 UNet WITH task-specific decoder branches,一个用于水印localization,另一个用于背景图像修复。但水印localization 和背景修复不是独立的任务,准确的水印localization 直接影响了需要修复的区域,而背景修复过程也会提高水印localization的准确性。为了整合这两个分支的信息,我们引入了隐式联合学习 paradigm。这使得网络可以自动地在两个分支之间流动信息,通过门 mechanism。此外,我们使用交叉通道注意力来促进地方细节修复和整体结构认知,同时利用嵌入结构来集成多尺度信息。我们在多个挑战性的标准底下进行了广泛的实验,以验证我们的提议的效果。结果表明,我们的方法在与现有状态的方法进行比较时表现出了很大的优势。

FoodGPT: A Large Language Model in Food Testing Domain with Incremental Pre-training and Knowledge Graph Prompt

  • paper_url: http://arxiv.org/abs/2308.10173
  • repo_url: None
  • paper_authors: Zhixiao Qi, Yijiong Yu, Meiqi Tu, Junyi Tan, Yongfeng Huang
  • for: 这篇论文是为了开发一个适用于食品测试的大型自然语言处理模型而写的。
  • methods: 该论文提出了一种处理结构化知识和扫描文档的增量预训练方法,以及一种使用知识图来支持大语言模型中的检索。
  • results: 该论文未提供实验数据, future versions 将报告specific experimental data。
    Abstract Currently, the construction of large language models in specific domains is done by fine-tuning on a base model. Some models also incorporate knowledge bases without the need for pre-training. This is because the base model already contains domain-specific knowledge during the pre-training process. We build a large language model for food testing. Unlike the above approach, a significant amount of data in this domain exists in Scanning format for domain standard documents. In addition, there is a large amount of untrained structured knowledge. Therefore, we introduce an incremental pre-training step to inject this knowledge into a large language model. In this paper, we propose a method for handling structured knowledge and scanned documents in incremental pre-training. To overcome the problem of machine hallucination, we constructe a knowledge graph to serve as an external knowledge base for supporting retrieval in the large language model. It is worth mentioning that this paper is a technical report of our pre-release version, and we will report our specific experimental data in future versions.
    摘要 当前,大语言模型在特定领域的构建通常通过精度调整base模型来实现。一些模型还会integrate知识库,无需先行预训练。这是因为基模型在预训练过程中已经包含了领域特定的知识。我们构建了一个食品测试领域的大语言模型。与之前的方法不同的是,食品领域的大量数据存在扫描格式的域标准文档中,同时也有大量未训练的结构化知识。因此,我们介绍了一种递增预训练步骤,以注入这些知识到大语言模型中。在这篇论文中,我们提出了处理结构化知识和扫描文档的递增预训练方法。为了解决机器幻觉的问题,我们构建了一个知识图以作为大语言模型的外部知识库,以支持模型中的检索。值得一提的是,这篇论文是我们预发版的技术报告,未来版本中将公布我们的具体实验数据。

FashionNTM: Multi-turn Fashion Image Retrieval via Cascaded Memory

  • paper_url: http://arxiv.org/abs/2308.10170
  • repo_url: None
  • paper_authors: Anwesan Pal, Sahil Wadhwa, Ayush Jaiswal, Xu Zhang, Yue Wu, Rakesh Chada, Pradeep Natarajan, Henrik I. Christensen
  • For: 这项研究关注了现实世界中的多轮图像检索系统,其中用户可以逐次提供信息来细化检索结果,直到找到满足所有要求的项目。* Methods: 我们提出了一种新的储存器基于方法,called FashionNTM,它使用了新的叠加型内存神经机器人(CM-NTM)方法来实现隐式状态管理,以学习将所有过去轮的信息集成到新的图像检索中。与普通的神经内存机器人(NTM)不同,CM-NTM可以处理多个输入,并通过 individuak 读写头与各自的内存进行交互,以学习复杂的关系。* Results: 我们的提出方法在Multi-turn FashionIQ数据集上进行了广泛的评估,并与前一代状态下的算法相比,提高了50.5%的性能。此外,我们还创建了一个扩展自Single-turn Shoes数据集的Multi-turn Shoes数据集,并在这个数据集上进行了进一步的分析和用户研究。最终,我们的模型在实际交互 Setting中展现了两个重要的特点:记忆保持 across turns,和不依赖于轮次顺序的反馈。用户研究结果表明,由FashionNTM所 retrieve 的图像被用户 preference над其他多轮模型的83.1%。项目页面:https://sites.google.com/eng.ucsd.edu/fashionntm
    Abstract Multi-turn textual feedback-based fashion image retrieval focuses on a real-world setting, where users can iteratively provide information to refine retrieval results until they find an item that fits all their requirements. In this work, we present a novel memory-based method, called FashionNTM, for such a multi-turn system. Our framework incorporates a new Cascaded Memory Neural Turing Machine (CM-NTM) approach for implicit state management, thereby learning to integrate information across all past turns to retrieve new images, for a given turn. Unlike vanilla Neural Turing Machine (NTM), our CM-NTM operates on multiple inputs, which interact with their respective memories via individual read and write heads, to learn complex relationships. Extensive evaluation results show that our proposed method outperforms the previous state-of-the-art algorithm by 50.5%, on Multi-turn FashionIQ -- the only existing multi-turn fashion dataset currently, in addition to having a relative improvement of 12.6% on Multi-turn Shoes -- an extension of the single-turn Shoes dataset that we created in this work. Further analysis of the model in a real-world interactive setting demonstrates two important capabilities of our model -- memory retention across turns, and agnosticity to turn order for non-contradictory feedback. Finally, user study results show that images retrieved by FashionNTM were favored by 83.1% over other multi-turn models. Project page: https://sites.google.com/eng.ucsd.edu/fashionntm
    摘要 多回文本反馈基于时尚图像检索是关注现实场景,用户可以逐 turno 提供信息来细化检索结果,直到找到符合所有需求的图像。在这种多回系统中,我们提出了一种新的记忆型方法,即时尚NTM(FashionNTM)。我们的框架利用新的层次结构神经图计算机(CM-NTM)方法,以实现隐式状态管理,从而学习将所有过去的回合信息集成到新的图像检索中。不同于普通的神经图计算机(NTM),我们的CM-NTM可以处理多个输入,并通过个性化的读写头与各自的记忆进行交互,以学习复杂的关系。我们对Multi-turn FashionIQ和Multi-turn Shoes两个数据集进行了广泛的评估。结果表明,我们的提出的方法在Multi-turn FashionIQ上比前一个状态艺术算法提高50.5%,同时在Multi-turn Shoes上相对提高12.6%。进一步的分析表明,我们的模型在真实的交互设置下具有两个重要特点:首先,它可以保持多个回合的记忆,其次,它不依赖于回合顺序,对于不相互矛盾的反馈。最后,我们进行了用户研究,发现用户对FashionNTM所呈现的图像的偏好率为83.1%。项目页面:https://sites.google.com/eng.ucsd.edu/fashionntm

Head-to-Tail: How Knowledgeable are Large Language Models (LLM)? A.K.A. Will LLMs Replace Knowledge Graphs?

  • paper_url: http://arxiv.org/abs/2308.10168
  • repo_url: None
  • paper_authors: Kai Sun, Yifan Ethan Xu, Hanwen Zha, Yue Liu, Xin Luna Dong
  • for: 本研究试图回答一些关于大语言模型(LLM)的问题,具体来说是 LLM 是如何掌握知识的?
  • methods: 作者构建了一个名为 Head-to-Tail 的benchmark,包含 18K 问题答案对 regarding 头、躯体和尾部事实的知名度。他们设计了一种自动评价方法和一组紧接近 LLM 内化知识的度量。
  • results: 通过对 14 种公共可用的 LLM 进行全面评估,作者显示了现有 LLM 对事实知识的掌握仍然很差,特别是对躯体到尾部实体的知识。
    Abstract Since the recent prosperity of Large Language Models (LLMs), there have been interleaved discussions regarding how to reduce hallucinations from LLM responses, how to increase the factuality of LLMs, and whether Knowledge Graphs (KGs), which store the world knowledge in a symbolic form, will be replaced with LLMs. In this paper, we try to answer these questions from a new angle: How knowledgeable are LLMs? To answer this question, we constructed Head-to-Tail, a benchmark that consists of 18K question-answer (QA) pairs regarding head, torso, and tail facts in terms of popularity. We designed an automated evaluation method and a set of metrics that closely approximate the knowledge an LLM confidently internalizes. Through a comprehensive evaluation of 14 publicly available LLMs, we show that existing LLMs are still far from being perfect in terms of their grasp of factual knowledge, especially for facts of torso-to-tail entities.
    摘要 自大语言模型(LLM)的繁荣以来,有许多关于如何减少LLM回答中的幻觉、如何提高LLM的事实性、以及是否将知识图(KG)取代LLM的讨论。在这篇论文中,我们尝试从一个新的角度回答这些问题:LLM们有多“知识”吗?为answer这个问题,我们构建了Head-to-Tail,一个包含18000个问题回答(QA)对 Regarding popularity的头、躯体和尾部事实。我们设计了自动评估方法和一组 metros that closely approximate the knowledge that LLM confidently internalizes。通过对14个公开available LLMs进行全面评估,我们发现现有LLMs仍然很遥谱于事实知识,尤其是 torso-to-tail entity的事实。

Bayes Risk Transducer: Transducer with Controllable Alignment Prediction

  • paper_url: http://arxiv.org/abs/2308.10107
  • repo_url: https://github.com/espnet/espnet
  • paper_authors: Jinchuan Tian, Jianwei Yu, Hangting Chen, Brian Yan, Chao Weng, Dong Yu, Shinji Watanabe
  • for: 这个研究目的是为了实现控制式的语音识别预测,使得预测结果能够满足特定的想定property。
  • methods: 这个研究使用了Bayes risk函数来设定低风险值,以将预测结果与想定的property相对准确。
  • results: 实验结果显示,提案的BRT可以节省推断成本,并且降低了整个系统延迟。 Specifically, BRT可以降低非流式ASR的推断成本 by up to 46%,并且降低流式ASR系统的整体延迟 by 41%.
    Abstract Automatic speech recognition (ASR) based on transducers is widely used. In training, a transducer maximizes the summed posteriors of all paths. The path with the highest posterior is commonly defined as the predicted alignment between the speech and the transcription. While the vanilla transducer does not have a prior preference for any of the valid paths, this work intends to enforce the preferred paths and achieve controllable alignment prediction. Specifically, this work proposes Bayes Risk Transducer (BRT), which uses a Bayes risk function to set lower risk values to the preferred paths so that the predicted alignment is more likely to satisfy specific desired properties. We further demonstrate that these predicted alignments with intentionally designed properties can provide practical advantages over the vanilla transducer. Experimentally, the proposed BRT saves inference cost by up to 46% for non-streaming ASR and reduces overall system latency by 41% for streaming ASR.
    摘要 自动语音识别(ASR)基于传感器广泛应用。在训练中,一个传感器最大化总 posterior的所有路径。路径的 posterior 最高的路径通常定义为 Speech 和转录的预测Alignment。而 vanilla 传感器没有任何有效路径的偏好,这项工作计划强制执行 Preferred 路径并实现可控制的预测Alignment。特别是,这项工作提出了 Bayes 风险函数来设置更低的风险值 для Preferred 路径,以便预测Alignment 满足特定的愿望性质。我们进一步示出,这些预测的Alignment 可以提供非常实用的优势,比如降低总系统延迟时间41%。Experiments 实验表明,提议的 BRT 可以在非流式 ASR 中降低推理成本达46%,并在流式 ASR 中降低总系统延迟时间41%。

PACE: Improving Prompt with Actor-Critic Editing for Large Language Model

  • paper_url: http://arxiv.org/abs/2308.10088
  • repo_url: None
  • paper_authors: Yihong Dong, Kangcheng Luo, Xue Jiang, Zhi Jin, Ge Li
  • for: 提高大语言模型(LLM)的表现,使其能够自动编辑提示。
  • methods: 基于actor-critic算法的Prompt with Actor-Critic Editing(PACE)方法,使用LLM作为 dual 角色的actor和critic,将提示视为策略。
  • results: 对24个指令生成任务和21个大规模任务进行了广泛的实验,结果表明,PACE可以提高中等/低质量人工写的提示的相对性能,最高提高98%,与高质量人工写提示的表现相当。此外,PACE还表现出了remarkable的提示生成能力。
    Abstract Large language models (LLMs) have showcased remarkable potential across various tasks by conditioning on prompts. However, the quality of different human-written prompts leads to substantial discrepancies in LLMs' performance, and improving prompts usually necessitates considerable human effort and expertise. To this end, this paper proposes Prompt with Actor-Critic Editing (PACE) for LLMs to enable automatic prompt editing. Drawing inspiration from the actor-critic algorithm in reinforcement learning, PACE leverages LLMs as the dual roles of actors and critics, conceptualizing prompt as a type of policy. PACE refines prompt, taking into account the feedback from both actors performing prompt and critics criticizing response. This process helps LLMs better align prompt to a specific task, thanks to real responses and thinking from LLMs. We conduct extensive experiments on 24 instruction induction tasks and 21 big-bench tasks. Experimental results indicate that PACE elevates the relative performance of medium/low-quality human-written prompts by up to 98\%, which has comparable performance to high-quality human-written prompts. Moreover, PACE also exhibits notable efficacy for prompt generation.
    摘要 大型语言模型(LLM)在不同任务上展现了惊人的潜力,但不同的人工写的提示导致了 LLM 的性能差异很大,通常需要大量的人工努力和专业知识来改进提示。为解决这问题,本文提出了基于actor-critic算法的提示编辑方法(PACE),以便自动编辑提示。在actor-critic算法中,LLM 扮演了两个角色:actor和critic,将提示视为策略。PACE 利用 LLM 对提示的反馈,以及对响应的批评,来优化提示,使 LLM 更好地适应特定任务。我们在 24 个指令生成任务和 21 个大型任务上进行了广泛的实验。实验结果表明,PACE 可以提高人工写的中等/低质量提示的相对性能,最高提高达 98%,与高质量人工写提示的性能相当。此外,PACE 还显示了明显的生成提示效果。

cs.LG - 2023-08-20

Preserving Specificity in Federated Graph Learning for fMRI-based Neurological Disorder Identification

  • paper_url: http://arxiv.org/abs/2308.10302
  • repo_url: None
  • paper_authors: Junhao Zhang, Qianqian Wang, Xiaochuan Wang, Lishan Qiao, Mingxia Liu
  • for: 这个研究旨在探索用resting-state功能磁共振成像(rs-fMRI)检测脑部疾病的不同类型之间的连接性,并使用强化图 neural network(GNN)来学习rs-fMRI表示。
  • methods: 这个研究使用了联邦学习(Federated Learning,FL)来进行脑部疾病分析,并将多个实验中心/站点的数据集中心化。每个客户端都有一个共享和专用分支,其中共享分支中的参数将被发送到服务器,而专用分支中的参数则会保留在本地。这可以帮助知识分享和维护站点特有的特征。在共享分支中,我们使用了空间-时间注意力图像同步网络来学习动态rs-fMRI表示。在专用分支中,我们组合了 вектор化的民生资讯(例如年龄、性别和教育年数)和功能连接网络,以保留站点特有的特征。
  • results: 实验结果显示,SFGL比较多种现有的方法更高的精度,并且可以保留站点特有的特征。
    Abstract Resting-state functional magnetic resonance imaging (rs-fMRI) offers a non-invasive approach to examining abnormal brain connectivity associated with brain disorders. Graph neural network (GNN) gains popularity in fMRI representation learning and brain disorder analysis with powerful graph representation capabilities. Training a general GNN often necessitates a large-scale dataset from multiple imaging centers/sites, but centralizing multi-site data generally faces inherent challenges related to data privacy, security, and storage burden. Federated Learning (FL) enables collaborative model training without centralized multi-site fMRI data. Unfortunately, previous FL approaches for fMRI analysis often ignore site-specificity, including demographic factors such as age, gender, and education level. To this end, we propose a specificity-aware federated graph learning (SFGL) framework for rs-fMRI analysis and automated brain disorder identification, with a server and multiple clients/sites for federated model aggregation and prediction. At each client, our model consists of a shared and a personalized branch, where parameters of the shared branch are sent to the server while those of the personalized branch remain local. This can facilitate knowledge sharing among sites and also helps preserve site specificity. In the shared branch, we employ a spatio-temporal attention graph isomorphism network to learn dynamic fMRI representations. In the personalized branch, we integrate vectorized demographic information (i.e., age, gender, and education years) and functional connectivity networks to preserve site-specific characteristics. Representations generated by the two branches are then fused for classification. Experimental results on two fMRI datasets with a total of 1,218 subjects suggest that SFGL outperforms several state-of-the-art approaches.
    摘要 resting-state功能核磁共振成像(rs-fMRI)为脑病诊断提供了一种非侵入性的方法。图 neural network(GNN)在fMRI表示学习中得到了广泛的应用,但是训练一个普通的GNN通常需要大规模的数据集,而这些数据集来自多个成像中心/站点。 federated learning(FL)可以实现多个站点的模型训练,无需集中多个站点的数据。然而,前一些FL方法往往忽略了站点特有的特征,包括人口特征如年龄、性别和教育水平。为此,我们提出了特征意识 Federated Graph Learning(SFGL)框架,用于rs-fMRI分析和自动诊断脑病。在这个框架中,服务器和多个客户端/站点之间进行联合模型训练和预测。每个客户端都有一个共享和个性化分支,其中共享分支中的参数将被发送到服务器,而个性化分支中的参数将保留在本地。这可以促进站点之间的知识共享,同时也可以保持站点特有的特征。在共享分支中,我们采用了空间-时间注意力图同构网络,以学习动态的fMRI表示。在个性化分支中,我们将年龄、性别和教育年数Vector化后与功能连接网络相结合,以保留站点特有的特征。两个分支生成的表示后,将被混合以进行分类。实验结果表明,SFGL在两个fMRI数据集上,共计1,218名参与者,超过了一些当前的状态艺法。

An interpretable deep learning method for bearing fault diagnosis

  • paper_url: http://arxiv.org/abs/2308.10292
  • repo_url: None
  • paper_authors: Hao Lu, Austin M. Bray, Chao Hu, Andrew T. Zimmerman, Hongyi Xu
  • For: 本研究旨在提高深度学习(DL)模型的可解释性,以便在安全关键维护任务中提供更加可信的推荐。* Methods: 该研究使用了卷积神经网络(CNN),并使用梯度权重分割映射(Grad-CAM)来形成可解释的DL方法,用于分类滤波器故障。在训练过程中,我们使用Grad-CAM来确定训练样本的特征重要性,并建立了一个健康图书馆(health library),包含训练样本的彩色映射。在评估过程中,我们使用健康图书馆中的相似性来选择预测基础样本。* Results: 该方法可以轻松地应用于任何CNN模型,不需要修改模型结构。我们的实验结果表明,该方法可以选择基于实际和物理意义的预测基础样本,从而提高DL模型的可信度。
    Abstract Deep learning (DL) has gained popularity in recent years as an effective tool for classifying the current health and predicting the future of industrial equipment. However, most DL models have black-box components with an underlying structure that is too complex to be interpreted and explained to human users. This presents significant challenges when deploying these models for safety-critical maintenance tasks, where non-technical personnel often need to have complete trust in the recommendations these models give. To address these challenges, we utilize a convolutional neural network (CNN) with Gradient-weighted Class Activation Mapping (Grad-CAM) activation map visualizations to form an interpretable DL method for classifying bearing faults. After the model training process, we apply Grad-CAM to identify a training sample's feature importance and to form a library of diagnosis knowledge (or health library) containing training samples with annotated feature maps. During the model evaluation process, the proposed approach retrieves prediction basis samples from the health library according to the similarity of the feature importance. The proposed method can be easily applied to any CNN model without modifying the model architecture, and our experimental results show that this method can select prediction basis samples that are intuitively and physically meaningful, improving the model's trustworthiness for human users.
    摘要

Towards Few-shot Coordination: Revisiting Ad-hoc Teamplay Challenge In the Game of Hanabi

  • paper_url: http://arxiv.org/abs/2308.10284
  • repo_url: None
  • paper_authors: Hadi Nekoei, Xutong Zhao, Janarthanan Rajendran, Miao Liu, Sarath Chandar
  • For: This paper focuses on cooperative multi-agent reinforcement learning (MARL) algorithms with zero-shot coordination (ZSC) and their ability to adapt to unseen partners.* Methods: The paper uses a popular cooperative multi-agent game called Hanabi to evaluate the adaptability of MARL methods, and creates a diverse set of pre-trained agents to test their performance.* Results: The paper finds that naive Independent Q-Learning (IQL) agents can adapt as quickly as state-of-the-art ZSC algorithms in most cases, and identifies two categories of hyper-parameters that have a significant impact on the adaptability of Hanabi agents.Here are the three points in Simplified Chinese text:* For: 这篇论文关注了合作多代理权划学习算法(MARL)的零shot协调(ZSC),以及其在未看到合作伙伴时的适应性。* Methods: 论文使用了一个流行的合作多代理权划游戏called Hanabi来评估MARL方法的适应性,并创造了一组多样化的预训练代理来测试其性能。* Results: 论文发现,naive Independent Q-Learning(IQL)代理在大多数情况下可以和现有的ZSC算法快速适应,并identified两类Hyperparameter对合作 Hanabi 代理的适应性有重要影响。
    Abstract Cooperative Multi-agent Reinforcement Learning (MARL) algorithms with Zero-Shot Coordination (ZSC) have gained significant attention in recent years. ZSC refers to the ability of agents to coordinate zero-shot (without additional interaction experience) with independently trained agents. While ZSC is crucial for cooperative MARL agents, it might not be possible for complex tasks and changing environments. Agents also need to adapt and improve their performance with minimal interaction with other agents. In this work, we show empirically that state-of-the-art ZSC algorithms have poor performance when paired with agents trained with different learning methods, and they require millions of interaction samples to adapt to these new partners. To investigate this issue, we formally defined a framework based on a popular cooperative multi-agent game called Hanabi to evaluate the adaptability of MARL methods. In particular, we created a diverse set of pre-trained agents and defined a new metric called adaptation regret that measures the agent's ability to efficiently adapt and improve its coordination performance when paired with some held-out pool of partners on top of its ZSC performance. After evaluating several SOTA algorithms using our framework, our experiments reveal that naive Independent Q-Learning (IQL) agents in most cases adapt as quickly as the SOTA ZSC algorithm Off-Belief Learning (OBL). This finding raises an interesting research question: How to design MARL algorithms with high ZSC performance and capability of fast adaptation to unseen partners. As a first step, we studied the role of different hyper-parameters and design choices on the adaptability of current MARL algorithms. Our experiments show that two categories of hyper-parameters controlling the training data diversity and optimization process have a significant impact on the adaptability of Hanabi agents.
    摘要 合作多智能 reinforcement learning(MARL)算法 Zero-Shot 协调(ZSC)在过去几年内吸引了广泛关注。ZSC 指 agents 可以在不经过额外交互经验的情况下协调。虽然 ZSC 对合作 MARL 代理是非常重要,但可能在复杂任务和变化环境中不可能实现。代理还需要适应和改进其性能,只需要最小化与其他代理的交互。在这种情况下,我们证明了,当与其他代理进行交互时,当前的 ZSC 算法的性能不佳,需要数百万次交互样本来适应这些新合作伙伴。为了探讨这个问题,我们提出了一个基于流行的合作多智能游戏 Hanabi 的框架,用于评估 MARL 方法的适应性。特别是,我们创建了一个多样化的预训练代理集和一个新的度量叫做 adaptive regret,用于衡量代理在与一些储存的合作伙伴交互时的协调性能。经过评估多个 SOTA 算法,我们的实验发现,简单的独立 Q-学习(IQL)代理在大多数情况下快速适应,与 SOTA ZSC 算法 Off-Belief Learning(OBL)相当。这一发现提出了一个有趣的研究问题:如何设计 MARL 算法,同时具有高 ZSC 性能和快速适应未见伙伴的能力。作为首先的步骤,我们研究了当前 MARL 算法的不同超参数和设计选择对适应性的影响。我们的实验表明,控制训练数据多样性和优化过程的两类超参数具有重要的影响于 Hanabi 代理的适应性。

Adaptive Uncertainty-Guided Model Selection for Data-Driven PDE Discovery

  • paper_url: http://arxiv.org/abs/2308.10283
  • repo_url: https://github.com/pongpisit-thanasutives/ubic
  • paper_authors: Pongpisit Thanasutives, Takashi Morita, Masayuki Numao, Ken-ichi Fukui
  • for: 用于优选具有少量可靠项的噪声空间时间观察数据的参数适应不确定性加权信息条件(UBIC)。
  • methods: 使用参数适应不确定性加权信息条件(UBIC)来优选具有少量可靠项的噪声空间时间观察数据的参数适应PDE。
  • results: 经过实验证明,UBIC能够成功地确定true governing PDE,并且发现了清理观察数据可以改善模型复杂度和不确定性之间的贝叶тов。Here’s the translation of the abstract in English:
  • for: To propose a new parameter-adaptive uncertainty-penalized Bayesian information criterion (UBIC) for selecting the parsimonious partial differential equation (PDE) that best governs noisy spatial-temporal observed data with few reliable terms.
  • methods: Using the UBIC to select the PDE that adapts to the parameters of the observed data, which is penalized by both the complexity of the PDE and the quantified uncertainty derived from the model supports’ coefficient of variation in a probabilistic view.
  • results: Numerical results confirm the successful application of the UBIC in identifying the true governing PDE, and reveal an interesting effect of denoising the observed data on improving the trade-off between the BIC score and model complexity.
    Abstract We propose a new parameter-adaptive uncertainty-penalized Bayesian information criterion (UBIC) to prioritize the parsimonious partial differential equation (PDE) that sufficiently governs noisy spatial-temporal observed data with few reliable terms. Since the naive use of the BIC for model selection has been known to yield an undesirable overfitted PDE, the UBIC penalizes the found PDE not only by its complexity but also the quantified uncertainty, derived from the model supports' coefficient of variation in a probabilistic view. We also introduce physics-informed neural network learning as a simulation-based approach to further validate the selected PDE flexibly against the other discovered PDE. Numerical results affirm the successful application of the UBIC in identifying the true governing PDE. Additionally, we reveal an interesting effect of denoising the observed data on improving the trade-off between the BIC score and model complexity. Code is available at https://github.com/Pongpisit-Thanasutives/UBIC.
    摘要 我们提出一种新的参数适应 uncertainty-penalized Bayesian信息整合因子(UBIC),用于优先选择具有噪声空间时间观测数据少量可靠项的简洁偏微分方程(PDE)。由于直接使用BIC来进行模型选择可能会导致过度适应PDE,UBIC不仅penalizes发现的PDE的复杂性,还penalizes其所具有的量化不确定性,从probabilistic视角来 derivation。我们还引入物理学 Informed Neural Network学习作为一种基于实验的方法,以验证选择的PDE的可行性。numerical result confirm the successful application of UBIC in identifying the true governing PDE。此外,我们发现对观测数据进行去噪化可以提高模型复杂性和BIC分数之间的负荷平衡。代码可以在https://github.com/Pongpisit-Thanasutives/UBIC上获取。

Enhancing Spatiotemporal Traffic Prediction through Urban Human Activity Analysis

  • paper_url: http://arxiv.org/abs/2308.10282
  • repo_url: https://github.com/suminhan/traffic-uagcrntf
  • paper_authors: Sumin Han, Youngjun Park, Minji Lee, Jisun An, Dongman Lee
  • for: traffic prediction (交通预测)
  • methods: graph convolution deep learning algorithms (图 convolution 深度学习算法)
  • results: state-of-the-art performance without introducing excessive computational overhead (无需增加过分计算开销,达到了当前最佳性能)
    Abstract Traffic prediction is one of the key elements to ensure the safety and convenience of citizens. Existing traffic prediction models primarily focus on deep learning architectures to capture spatial and temporal correlation. They often overlook the underlying nature of traffic. Specifically, the sensor networks in most traffic datasets do not accurately represent the actual road network exploited by vehicles, failing to provide insights into the traffic patterns in urban activities. To overcome these limitations, we propose an improved traffic prediction method based on graph convolution deep learning algorithms. We leverage human activity frequency data from National Household Travel Survey to enhance the inference capability of a causal relationship between activity and traffic patterns. Despite making minimal modifications to the conventional graph convolutional recurrent networks and graph convolutional transformer architectures, our approach achieves state-of-the-art performance without introducing excessive computational overhead.
    摘要 交通预测是公民安全和便利的关键元素之一。现有的交通预测模型主要采用深度学习架构,以捕捉空间和时间相关性。它们往往忽略交通的本质。具体来说,交通数据集中的感知网络不准确表示实际行驶路网,不能提供交通模式的趋势信息。为了解决这些限制,我们提议一种基于图 convolution 深度学习算法的改进交通预测方法。我们利用国家家庭旅游调查数据来增强 causal 关系 между活动和交通模式的推理能力。虽然我们对传统的图 convolutional recurrent networks 和图 convolutional transformer 架构进行了最小的修改,但我们的方法可以达到当今最佳性能,无需增加过多的计算负担。

The DKU-DUKEECE System for the Manipulation Region Location Task of ADD 2023

  • paper_url: http://arxiv.org/abs/2308.10281
  • repo_url: None
  • paper_authors: Zexin Cai, Weiqing Wang, Yikang Wang, Ming Li
  • for: 本文是为了参加第二届音频深伪检测挑战(ADD 2023)的跟踪2类别而设计的系统。
  • methods: 本文使用多种检测系统来定位剪辑区域并确定其真实性。包括边界检测和深伪检测两个帧级系统,以及专门使用真实数据训练的VAE模型来确定音频clip的真实性。
  • results: 通过这三种系统的 fusión,我们的答案在ADD挑战中得到了82.23%的句子准确率和60.66%的F1分数,最终得到了ADD分数0.6713,在Track 2中排名第一。
    Abstract This paper introduces our system designed for Track 2, which focuses on locating manipulated regions, in the second Audio Deepfake Detection Challenge (ADD 2023). Our approach involves the utilization of multiple detection systems to identify splicing regions and determine their authenticity. Specifically, we train and integrate two frame-level systems: one for boundary detection and the other for deepfake detection. Additionally, we employ a third VAE model trained exclusively on genuine data to determine the authenticity of a given audio clip. Through the fusion of these three systems, our top-performing solution for the ADD challenge achieves an impressive 82.23% sentence accuracy and an F1 score of 60.66%. This results in a final ADD score of 0.6713, securing the first rank in Track 2 of ADD 2023.
    摘要 这份论文介绍了我们为Track 2设计的系统,专注于找到修改区域。这是2023年音频深刻投影检测挑战(ADD 2023)的第二轮比赛。我们的方法是利用多个检测系统来标识剪辑区域并确定它们的真实性。我们训练并集成了两个帧级系统:一个用于边界检测,另一个用于深刻检测。此外,我们还使用专门用于真实数据训练的VAE模型来确定一个音频clip的真实性。通过这三个系统的融合,我们的ADD挑战首名解决方案在ADD挑战中实现了82.23%的句子准确率和60.66%的F1分数。这导致了我们在ADD挑战中的最终得分为0.6713,在Track 2中名列第一。

GPFL: Simultaneously Learning Global and Personalized Feature Information for Personalized Federated Learning

  • paper_url: http://arxiv.org/abs/2308.10279
  • repo_url: None
  • paper_authors: Jianqing Zhang, Yang Hua, Hao Wang, Tao Song, Zhengui Xue, Ruhui Ma, Jian Cao, Haibing Guan
  • for: 本研究旨在提出一种新的个性化联合学习(pFL)方法,以满足联合学习和个性化学习的两个目标。
  • methods: 本方法使用了一种新的特征提取方法,可以同时学习全局和个体特征信息。
  • results: 对六个数据集进行了三种统计上不同的设置,并对十种现有方法进行了比较。结果显示,GPFL方法在效果、可扩展性、公平性、稳定性和隐私方面都有优势,并且可以避免过拟合和基线方法的超越。
    Abstract Federated Learning (FL) is popular for its privacy-preserving and collaborative learning capabilities. Recently, personalized FL (pFL) has received attention for its ability to address statistical heterogeneity and achieve personalization in FL. However, from the perspective of feature extraction, most existing pFL methods only focus on extracting global or personalized feature information during local training, which fails to meet the collaborative learning and personalization goals of pFL. To address this, we propose a new pFL method, named GPFL, to simultaneously learn global and personalized feature information on each client. We conduct extensive experiments on six datasets in three statistically heterogeneous settings and show the superiority of GPFL over ten state-of-the-art methods regarding effectiveness, scalability, fairness, stability, and privacy. Besides, GPFL mitigates overfitting and outperforms the baselines by up to 8.99% in accuracy.
    摘要 federated learning (FL) 是因其隐私保护和协同学习能力而受欢迎。 最近,个性化 federated learning (pFL) 已引起关注,因为它可以解决统计不同性问题并实现个性化。然而,在特征提取方面,大多数现有的 pFL 方法只是在本地训练中提取全局或个性化特征信息,这与 pFL 的协同学习和个性化目标不符。为了解决这个问题,我们提出了一种新的 pFL 方法,名为 GPFL,它可以在每个客户端上同时学习全局和个性化特征信息。我们在六个数据集上进行了三种统计不同性的设置,并对 GPFL 与十种现状很好的方法进行了广泛的实验。结果表明,GPFL 在效果、可扩展性、公平性、稳定性和隐私方面与现状很好的方法之间具有明显的优势。此外,GPFL 可以避免过拟合并超过基线方法的性能。

Minimalist Traffic Prediction: Linear Layer Is All You Need

  • paper_url: http://arxiv.org/abs/2308.10276
  • repo_url: https://github.com/wenyingduan/STLinear
  • paper_authors: Wenying Duan, Hong Rao, Wei Huang, Xiaoxi He
  • for: 交通预测是智能交通系统(ITS)和智能城市的关键,而STGNNs在这个领域表现出了承诺,但它们还存在计算复杂性、梯度问题和资源浪费等问题。这篇论文提出了三个解决方案,以解决这些问题。
  • methods: 本文提出了三个解决方案,包括节点嵌入方法、时间序列分解和周期学习。我们还介绍了STLinear模型架构,它是一种最优化的、高效的模型,与传统STGNNs不同,它完全地在本地进行计算,不需要间接数据交换,仅仅使用直线层,这大幅降低了计算占用的复杂性。
  • results: 我们的实验表明,STLinear模型在实际数据上具有极高的准确率,与状态lejardin2023年的STGNN基线模型匹配或超越,但计算负担减少了大于95%。总之,STLinear emerges as a powerful and efficient alternative to traditional STGNNs, with far-reaching implications for the future of ITS and smart city initiatives.
    Abstract Traffic prediction is essential for the progression of Intelligent Transportation Systems (ITS) and the vision of smart cities. While Spatial-Temporal Graph Neural Networks (STGNNs) have shown promise in this domain by leveraging Graph Neural Networks (GNNs) integrated with either RNNs or Transformers, they present challenges such as computational complexity, gradient issues, and resource-intensiveness. This paper addresses these challenges, advocating for three main solutions: a node-embedding approach, time series decomposition, and periodicity learning. We introduce STLinear, a minimalist model architecture designed for optimized efficiency and performance. Unlike traditional STGNNs, STlinear operates fully locally, avoiding inter-node data exchanges, and relies exclusively on linear layers, drastically cutting computational demands. Our empirical studies on real-world datasets confirm STLinear's prowess, matching or exceeding the accuracy of leading STGNNs, but with significantly reduced complexity and computation overhead (more than 95% reduction in MACs per epoch compared to state-of-the-art STGNN baseline published in 2023). In summary, STLinear emerges as a potent, efficient alternative to conventional STGNNs, with profound implications for the future of ITS and smart city initiatives.
    摘要 traffic prediction 是智能交通系统(ITS)和智能城市的发展所必需的。而 spatial-temporal graph neural networks(STGNNs)在这个领域已经表现出了承诺,通过结合图神经网络(GNNs)和 either RNNs 或 Transformers。但 STGNNs 也存在一些挑战,例如计算复杂性、梯度问题和资源浪费。这篇文章提出了三个主要解决方案:节点嵌入方法、时间序列分解和周期学习。我们介绍了 STLinear,一种最优化的模型建 architecture,与传统的 STGNNs 不同,STLinear 完全地在本地进行操作,不需要节点之间数据交换,并且仅仅使用线性层,大幅降低计算需求。我们对实际数据集进行了实验,并证明 STLinear 能够匹配或超越现有的 STGNNs 的精度,但是计算过程中的复杂度和负担减少了超过 95% (相比于2023年发表的state-of-the-art STGNN 基eline)。总之,STLinear emerges 作为一种高效、可靠的 STGNNs 代替方案,对智能交通系统和智能城市的未来产生了深远的影响。

SBSM-Pro: Support Bio-sequence Machine for Proteins

  • paper_url: http://arxiv.org/abs/2308.10275
  • repo_url: https://github.com/wyzbio/support-bio-sequence-machine
  • paper_authors: Yizheng Wang, Yixiao Zhai, Yijie Ding, Quan Zou
  • for: 本研究旨在提出一种特定 для生物序列分类的支持vector机器学习模型(SBSM-Pro),以帮助和导引生物实验。
  • methods: 该模型从原始序列开始,将氨基酸按照其物理化学性质分组,并使用序列对比来衡量蛋白质之间的相似性。它还使用一种新的MKL方法将多种信息集成,并使用支持向量机器学习进行分类预测。
  • results: 研究结果表明,SBSM-Pro在10个数据集中表现出色地预测蛋白质功能和后转化 modify。这项研究不仅代表了生物序列分类领域的 estado dell’arte,还开辟了新的方向,为生物序列分类 plataform 的开发做出了有益的贡献。
    Abstract Proteins play a pivotal role in biological systems. The use of machine learning algorithms for protein classification can assist and even guide biological experiments, offering crucial insights for biotechnological applications. We propose a support bio-sequence machine for proteins, a model specifically designed for biological sequence classification. This model starts with raw sequences and groups amino acids based on their physicochemical properties. It incorporates sequence alignment to measure the similarities between proteins and uses a novel MKL approach to integrate various types of information, utilizing support vector machines for classification prediction. The results indicate that our model demonstrates commendable performance across 10 datasets in terms of the identification of protein function and posttranslational modification. This research not only showcases state-of-the-art work in protein classification but also paves the way for new directions in this domain, representing a beneficial endeavour in the development of platforms tailored for biological sequence classification. SBSM-Pro is available for access at http://lab.malab.cn/soft/SBSM-Pro/.
    摘要 生物系统中,蛋白质扮演着重要的角色。通过机器学习算法进行蛋白质分类可以帮助和导引生物实验,提供生物技术应用中的重要关键。我们提议一个生物序列机器学习模型(SBSM-Pro),特别设计用于生物序列分类。这个模型从原始序列开始,根据蛋白质的物理化化性将氨基酸分组。它还包括序列对比来衡量蛋白质之间的相似性,并使用一种新的MKL方法集成不同类型的信息,使用支持向量机器学习进行预测。结果显示,我们的模型在10个数据集上表现出色,对蛋白质功能和后译 modificatory 进行识别。这个研究不仅代表了生物分类领域的 estado-of-the-art,也开启了新的方向,实现了针对生物序列分类的平台的开发。SBSM-Pro可以在http://lab.malab.cn/soft/SBSM-Pro/ 上获取。

An alternative to SVM Method for Data Classification

  • paper_url: http://arxiv.org/abs/2308.11579
  • repo_url: https://github.com/himanshub1007/Alzhimers-Disease-Prediction-Using-Deep-learning
  • paper_authors: Lakhdar Remaki
  • for: 这 paper 是为了提出一种新的分类方法,以解决支持向量机(SVM)方法的一些缺陷。
  • methods: 这 paper 使用了一种基于最小距离到优化Subspace的方法,其中Subspace 是包含原始类的映射。
  • results: 该方法与 SVM 方法有相似的性能,但具有改进了的一些缺陷,如时间处理、高维度情况下的优化过程的风险、多类分类、不均衡类别和动态分类。
    Abstract Support vector machine (SVM), is a popular kernel method for data classification that demonstrated its efficiency for a large range of practical applications. The method suffers, however, from some weaknesses including; time processing, risk of failure of the optimization process for high dimension cases, generalization to multi-classes, unbalanced classes, and dynamic classification. In this paper an alternative method is proposed having a similar performance, with a sensitive improvement of the aforementioned shortcomings. The new method is based on a minimum distance to optimal subspaces containing the mapped original classes.
    摘要

Turning Waste into Wealth: Leveraging Low-Quality Samples for Enhancing Continuous Conditional Generative Adversarial Networks

  • paper_url: http://arxiv.org/abs/2308.10273
  • repo_url: None
  • paper_authors: Xin Ding, Yongwei Wang, Zuheng Xu
  • for: 提高 conditional GANs 的可视化质量和标签一致性
  • methods: 使用 dual-NDA 方法,包括两种不同类型的负样本:可见不真实的样本生成自预训练 CcGAN,以及 manipulate 真实图像的标签来生成 label-inconsistent 的负样本。
  • results: Dual-NDA 可以帮助 CcGANs 生成更加可见性和标签一致性的假图像,在 UTKFace 和 Steering Angle 上进行了实验证明,并且可以超越当前状态的 conditional GANs 和液体模型。
    Abstract Continuous Conditional Generative Adversarial Networks (CcGANs) enable generative modeling conditional on continuous scalar variables (termed regression labels). However, they can produce subpar fake images due to limited training data. Although Negative Data Augmentation (NDA) effectively enhances unconditional and class-conditional GANs by introducing anomalies into real training images, guiding the GANs away from low-quality outputs, its impact on CcGANs is limited, as it fails to replicate negative samples that may occur during the CcGAN sampling. We present a novel NDA approach called Dual-NDA specifically tailored for CcGANs to address this problem. Dual-NDA employs two types of negative samples: visually unrealistic images generated from a pre-trained CcGAN and label-inconsistent images created by manipulating real images' labels. Leveraging these negative samples, we introduce a novel discriminator objective alongside a modified CcGAN training algorithm. Empirical analysis on UTKFace and Steering Angle reveals that Dual-NDA consistently enhances the visual fidelity and label consistency of fake images generated by CcGANs, exhibiting a substantial performance gain over the vanilla NDA. Moreover, by applying Dual-NDA, CcGANs demonstrate a remarkable advancement beyond the capabilities of state-of-the-art conditional GANs and diffusion models, establishing a new pinnacle of performance.
    摘要

Large Transformers are Better EEG Learners

  • paper_url: http://arxiv.org/abs/2308.11654
  • repo_url: None
  • paper_authors: Bingxin Wang, Xiaowen Fu, Yuan Lan, Luchan Zhang, Yang Xiang
  • for: 这篇论文的目的是探讨如何将预训transformer模型 fine-tune 以适应电enzephalogram(EEG)资料,以提高预测性能。
  • methods: 本篇论文使用了 AdaCE,即插件和平行对应器,将EEG数据转换为图像和文本形式,以便将预训的vision和language transformer模型 fine-tune 以适应EEG资料。
  • results: 本篇论文的实验结果显示,使用AdaCE可以将预训的transformer模型 fine-tune 以适应EEG资料,并 achiev 州度之前的最佳性能。例如,AdaCE在预训Swin-Transformer模型时,在人活动识别(UCI HAR)任务中取得了99.6%的准确率,升幅9.2%。此外,我们还证明了,将更大的预训模型 fine-tune 以适应EEG资料,可以获得更好的性能。
    Abstract Pre-trained large transformer models have achieved remarkable performance in the fields of natural language processing and computer vision. Since the magnitude of available labeled electroencephalogram (EEG) data is much lower than that of text and image data, it is difficult for transformer models pre-trained from EEG to be developed as large as GPT-4 100T to fully unleash the potential of this architecture. In this paper, we show that transformers pre-trained from images as well as text can be directly fine-tuned for EEG-based prediction tasks. We design AdaCE, plug-and-play Adapters for Converting EEG data into image as well as text forms, to fine-tune pre-trained vision and language transformers. The proposed AdaCE module is highly effective for fine-tuning pre-trained transformers while achieving state-of-the-art performance on diverse EEG-based prediction tasks. For example, AdaCE on the pre-trained Swin-Transformer achieves 99.6%, an absolute improvement of 9.2%, on the EEG-decoding task of human activity recognition (UCI HAR). Furthermore, we empirically show that applying the proposed AdaCE to fine-tune larger pre-trained models can achieve better performance on EEG-based predicting tasks, indicating the potential of our adapters for even larger transformers. The plug-and-play AdaCE module can be applied to fine-tuning most of the popular pre-trained transformers on many other time-series data with multiple channels, not limited to EEG data and the models we use. Our code will be available at https://github.com/wangbxj1234/AdaCE.
    摘要 “将预训大型 трансформа器模型应用于生物类时间序列数据中,具有优异的表现。由于生物类时间序列数据的量较文本和图像数据为低,因此预训自EEG的transformer模型难以发展到GPT-4 100T的水平,以全面发挥这个架构的潜力。本文提出了将图像和文本预训的transformer模型直接调整为EEG数据预测任务的方法。我们设计了 AdaCE,即将EEG数据转换为图像和文本形式的插件,以调整预训的vision和language transformer模型。我们的AdaCE模组具有高效性,可以对预训的transformer模型进行高性能的调整,并在多种EEG预测任务中获得州charts的表现。例如,AdaCE在预训Swin-Transformer上的EEG解oding任务中取得99.6%,相对于原始模型的提升为9.2%。此外,我们还证明了将AdaCE应用于调整更大的预训模型可以在EEG预测任务中获得更好的表现,这表明了我们的插件在更大的transformer模型中的潜力。我们的AdaCE模组可以对多种时间序列数据进行调整,不仅限于EEG数据和我们使用的模型。我们的代码将会在https://github.com/wangbxj1234/AdaCE上公开。”

Towards Synthesizing Datasets for IEEE 802.1 Time-sensitive Networking

  • paper_url: http://arxiv.org/abs/2308.10255
  • repo_url: None
  • paper_authors: Doğanalp Ergenç, Nurefşan Sertbaş Bülbül, Lisa Maile, Anna Arestova, Mathias Fischer
  • for: This paper highlights the need for TSN datasets to support research on AI/ML-based techniques for TSN systems.
  • methods: The paper discusses the main requirements and alternative designs for building a TSN platform to synthesize realistic datasets.
  • results: The paper aims to recapitulate the need for TSN datasets to flourish research on AI/ML-based techniques for TSN systems.Here is the text in Simplified Chinese:
  • for: 这篇论文强调需要TSN数据集,以便为TSN系统中的AI/ML技术进行研究。
  • methods: 论文讨论了TSN平台的主要要求和代理设计,以生成真实的数据集。
  • results: 论文目的是重点强调TSN数据集的需要,以推动TSN系统中的AI/ML技术研究。
    Abstract IEEE 802.1 Time-sensitive Networking (TSN) protocols have recently been proposed to replace legacy networking technologies across different mission-critical systems (MCSs). Design, configuration, and maintenance of TSN within MCSs require advanced methods to tackle the highly complex and interconnected nature of those systems. Accordingly, artificial intelligence (AI) and machine learning (ML) models are the most prominent enablers to develop such methods. However, they usually require a significant amount of data for model training, which is not easily accessible. This short paper aims to recapitulate the need for TSN datasets to flourish research on AI/ML-based techniques for TSN systems. Moreover, it analyzes the main requirements and alternative designs to build a TSN platform to synthesize realistic datasets.
    摘要 IEEE 802.1 时间敏感网络(TSN)协议最近被提议用于取代传统网络技术,以满足不同的使命关键系统(MCS)的需求。 TSN 的设计、配置和维护在 MCS 中需要高级的方法来处理这些系统的高度复杂和相互关联的特点。因此,人工智能(AI)和机器学习(ML)模型是TSN 研究的最佳推动力。然而,这些模型通常需要大量数据进行训练,这些数据往往不易获取。这篇短篇论文想要强调TSN 数据的需求,以便鼓励AI/ML 基于技术的TSN 系统研究。此外,它还分析了TSN 平台的主要需求和代替设计,以生成真实的数据集。

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

  • paper_url: http://arxiv.org/abs/2308.10253
  • repo_url: https://github.com/icoz69/stablellava
  • paper_authors: Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Fu, Guosheng Lin, Chunhua Shen, Ling Chen, Yunchao Wei
    for:The paper aims to address the limitations of current multimodal Large Language Model (LLM) training methods, specifically the domain bias of image-dialogue datasets.methods:The proposed methodology synchronously synthesizes images and dialogues for visual instruction tuning, leveraging the power of generative models such as ChatGPT and text-to-image models.results:The proposed pipeline leads to marked enhancements in more than ten commonly assessed capabilities of the open-source LLAVA model, including improved performance on various datasets.
    Abstract The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have sparked significant interest in the development of multimodal Large Language Models (LLMs). A primary research objective of such models is to align visual and textual modalities effectively while comprehending human instructions. Current methodologies often rely on annotations derived from benchmark datasets to construct image-dialogue datasets for training purposes, akin to instruction tuning in LLMs. However, these datasets often exhibit domain bias, potentially constraining the generative capabilities of the models. In an effort to mitigate these limitations, we propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models to yield a diverse and controllable dataset with varied image content. This not only provides greater flexibility compared to existing methodologies but also significantly enhances several model capabilities. Our research includes comprehensive experiments conducted on various datasets using the open-source LLAVA model as a testbed for our proposed pipeline. Our results underscore marked enhancements across more than ten commonly assessed capabilities,
    摘要 “OpenAI的GPT-4的多模态能力吸引了大量关注,推动了多模态大语言模型(LLM)的开发。主要研究目标之一是将文字和图像模态有效地规避,并理解人类指令。现有的方法ologies often rely on来自 referential datasets的注釈来构建图像对话集合用于训练purposes,类似于 instruction tuning in LLMs。但这些数据集经常受到领域偏见的限制,可能压缩模型的生成能力。为了缓解这些局限性,我们提出了一种新的数据采集方法,同时生成图像和对话。这种方法利用生成模型的能力,结合ChatGPT和文本到图像生成模型,生成了多样化和可控的数据集。这不仅提供了更大的灵活性,而且显著提高了许多常评价的能力。我们的研究包括对多个数据集使用开源的Lava模型进行了广泛的实验。我们的结果表明,我们的提案的管道显著提高了超过十种常评价的能力。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

Activation Addition: Steering Language Models Without Optimization

  • paper_url: http://arxiv.org/abs/2308.10248
  • repo_url: None
  • paper_authors: Alex Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, Monte MacDiarmid
  • for: 这篇论文目标是控制大型自然语言模型(LLM)的行为。
  • methods: 该论文提出了一种活动工程(ActAdd)方法,通过在推理时添加激活向量来预测性地改变模型行为。
  • results: 论文通过在GPT-2上进行实验,表明ActAdd方法可以在OpenWebText和ConceptNet上控制输出的高级特性,而且不会影响目标模型性能。此外,该方法需要远少于精通化或RLHF的计算资源,同时允许用户提供自然语言指令,其开销随 modelo 大小而增加。
    Abstract Reliably controlling the behavior of large language models (LLMs) is a pressing open problem. Existing methods include supervised finetuning, reinforcement learning from human feedback (RLHF), prompt engineering and guided decoding. We instead investigate activation engineering: modifying activations at inference time to predictably alter model behavior. In particular, we bias the forward pass with an added 'steering vector' implicitly specified through natural language. Unlike past work which learned these steering vectors (Subramani, Suresh, and Peters 2022; Hernandez, Li, and Andreas 2023), our Activation Addition (ActAdd) method computes them by taking the activation differences that result from pairs of prompts. We demonstrate ActAdd on GPT-2 on OpenWebText and ConceptNet. Our inference-time approach yields control over high-level properties of output and preserves off-target model performance. It involves far less compute and implementation effort compared to finetuning or RLHF, allows users to provide natural language specifications, and its overhead scales naturally with model size.
    摘要 大型语言模型(LLM)的可靠控制是一个开放的问题。现有的方法包括监督微调、人类反馈学习(RLHF)、提示工程和导航解码。我们则 investigate 活动工程:在推理时修改活动以预测性地改变模型行为。特别是,我们在前进通道添加了一个“导航向量”,这些向量通过自然语言进行隐式定义。与过去的工作不同(Subramani et al. 2022;Hernandez et al. 2023),我们的 Activation Addition(ActAdd)方法不是学习这些导航向量,而是通过对提示对的活动差异来计算它们。我们在 GPT-2 上进行了 ActAdd 测试,并在 OpenWebText 和 ConceptNet 上实现了控制输出的高级属性,同时保持了目标模型性能。这种在推理时进行的方法比 finetuning 或 RLHF 需要更少的计算和实现努力,允许用户通过自然语言指定,并且其开销随 modelo 大小呈线性增长。

From Global to Local: Multi-scale Out-of-distribution Detection

  • paper_url: http://arxiv.org/abs/2308.10239
  • repo_url: https://github.com/jimzai/mode-ood
  • paper_authors: Ji Zhang, Lianli Gao, Bingguang Hao, Hao Huang, Jingkuan Song, Hengtao Shen
  • for: The paper is written for detecting out-of-distribution (OOD) data in the context of representation learning.
  • methods: The paper proposes a new framework called Multi-scale OOD DEtection (MODE) that leverages both global visual information and local region details of images to improve OOD detection. The framework includes a new trainable objective called Attention-based Local PropAgation (ALPA) to encourage locally discriminative representations in ID training, and a Cross-Scale Decision (CSD) function to distinguish ID/OOD data during test-time.
  • results: The paper demonstrates the effectiveness and flexibility of MODE on several benchmarks, achieving an average improvement of up to 19.24% in false positive rate (FPR) and 2.77% in area under the receiver operating characteristic curve (AUROC) compared to previous state-of-the-art methods.
    Abstract Out-of-distribution (OOD) detection aims to detect "unknown" data whose labels have not been seen during the in-distribution (ID) training process. Recent progress in representation learning gives rise to distance-based OOD detection that recognizes inputs as ID/OOD according to their relative distances to the training data of ID classes. Previous approaches calculate pairwise distances relying only on global image representations, which can be sub-optimal as the inevitable background clutter and intra-class variation may drive image-level representations from the same ID class far apart in a given representation space. In this work, we overcome this challenge by proposing Multi-scale OOD DEtection (MODE), a first framework leveraging both global visual information and local region details of images to maximally benefit OOD detection. Specifically, we first find that existing models pretrained by off-the-shelf cross-entropy or contrastive losses are incompetent to capture valuable local representations for MODE, due to the scale-discrepancy between the ID training and OOD detection processes. To mitigate this issue and encourage locally discriminative representations in ID training, we propose Attention-based Local PropAgation (ALPA), a trainable objective that exploits a cross-attention mechanism to align and highlight the local regions of the target objects for pairwise examples. During test-time OOD detection, a Cross-Scale Decision (CSD) function is further devised on the most discriminative multi-scale representations to distinguish ID/OOD data more faithfully. We demonstrate the effectiveness and flexibility of MODE on several benchmarks -- on average, MODE outperforms the previous state-of-the-art by up to 19.24% in FPR, 2.77% in AUROC. Code is available at https://github.com/JimZAI/MODE-OOD.
    摘要 外部数据检测(OOD)的目标是检测“未知”的数据,其标签在ID(彩色)训练过程中未被见过。现代表各学习技术的进步使得距离基于OOD检测变得更加有力,它通过在ID类型数据的训练过程中计算对应的距离来认定输入数据是ID还是OOD。但是,以前的方法只基于全局图像表示,这可能会导致图像水平的噪声和类别变化,使得ID类型图像在给定的表示空间中被迫分离。在这种情况下,我们提出了多scale OOD检测(MODE)框架,它首次利用全局视觉信息和本地区域细节来最大限度地提高OOD检测。具体来说,我们发现现有的模型通过预训练的架构损失或对比损失来学习的是无法捕捉ID训练过程中的本地表示,这是因为ID训练和OOD检测过程的比例不同。为了解决这个问题并在ID训练过程中吸引更多的本地表示,我们提出了带有对比机制的注意力归一化(ALPA)对象,它可以在ID训练过程中帮助模型学习更多的本地表示。在测试时OOD检测中,我们还提出了十字缩放决策(CSD)函数,用于在最有价值的多级表示上进行ID/OOD数据的更加准确的分类。我们在多个标准 bencmarks 上展示了MODE 的效果和灵活性,其中平均与前一个状态的探测器相比,MODE 的 False Positive Rate 下降了19.24%,AUC ROC 上升了2.77%。代码可以在 https://github.com/JimZAI/MODE-OOD 上获取。

Thompson Sampling for Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit

  • paper_url: http://arxiv.org/abs/2308.10238
  • repo_url: None
  • paper_authors: Shintaro Nakamura, Masashi Sugiyama
  • for: 解决了多支武器带有不同奖励分布的实用值 combinatorial pure exploration问题(R-CPE-MAB),并提供了一个可靠的算法来解决这个问题。
  • methods: 使用了一种名为Generalized Thompson Sampling Explore(GenTS-Explore)算法,该算法可以在action set的大小是指数增长的情况下运行,而不是先前的算法假设action set的大小是多项式增长的。
  • results: 提供了一个问题依赖的样本复杂度下界,并证明了GenTS-Explore算法可以达到最佳样本复杂度下界,即在问题依赖的常数因子下。
    Abstract We study the real-valued combinatorial pure exploration of the multi-armed bandit (R-CPE-MAB) problem. In R-CPE-MAB, a player is given $d$ stochastic arms, and the reward of each arm $s\in\{1, \ldots, d\}$ follows an unknown distribution with mean $\mu_s$. In each time step, a player pulls a single arm and observes its reward. The player's goal is to identify the optimal \emph{action} $\boldsymbol{\pi}^{*} = \argmax_{\boldsymbol{\pi} \in \mathcal{A} \boldsymbol{\mu}^{\top}\boldsymbol{\pi}$ from a finite-sized real-valued \emph{action set} $\mathcal{A}\subset \mathbb{R}^{d}$ with as few arm pulls as possible. Previous methods in the R-CPE-MAB assume that the size of the action set $\mathcal{A}$ is polynomial in $d$. We introduce an algorithm named the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm, which is the first algorithm that can work even when the size of the action set is exponentially large in $d$. We also introduce a novel problem-dependent sample complexity lower bound of the R-CPE-MAB problem, and show that the GenTS-Explore algorithm achieves the optimal sample complexity up to a problem-dependent constant factor.
    摘要 我们研究实数值的 combinatorial纯exploration多重抓拍机(R-CPE-MAB)问题。在R-CPE-MAB中,一个玩家被给$d$个随机Arm,每个arm $s\in\{1, \ldots, d\}$的奖励follows未知分布的mean $\mu_s$。在每个时间步骤中,一个玩家抓一个arm并观察其奖励。玩家的目标是从一个有限大小的实数值动作集$\mathcal{A}\subset \mathbb{R}^{d}$中选择最佳动作 $\boldsymbol{\pi}^{*} = \argmax_{\boldsymbol{\pi} \in \mathcal{A} \boldsymbol{\mu}^{\top}\boldsymbol{\pi}$,使用最少的arm pulls。先前的R-CPE-MAB方法假设动作集$\mathcal{A}$的大小是 polynomial in $d$。我们介绍了一个名为Generalized Thompson Sampling Explore(GenTS-Explore)算法,这是第一个可以在动作集的大小是 exponentially large in $d$ 时运行的算法。我们还介绍了一个问题依赖性的样本复杂度下界,并证明GenTS-Explore算法实现了最佳样本复杂度的上界,减去一个问题依赖性的常量因子。

FedSIS: Federated Split Learning with Intermediate Representation Sampling for Privacy-preserving Generalized Face Presentation Attack Detection

  • paper_url: http://arxiv.org/abs/2308.10236
  • repo_url: https://github.com/naiftt/fedsis
  • paper_authors: Naif Alkhunaizi, Koushik Srivatsan, Faris Almalik, Ibrahim Almakky, Karthik Nandakumar
  • for: 防止面部攻击探测算法(FacePAD)的 Achilles heel,即缺乏对未见领域/攻击的扩展能力。
  • methods: 联合学习(Federated Learning,FL)和分组学习(Split Learning)的混合方法,以及一个名为“中介表示抽样”的新特点增强技术。
  • results: 在两个知名的跨领域FacePAD测试集上,证明了无需分享原始数据就可以 achieving state-of-the-art 的扩展性性能。Here’s the detailed summary of the paper:The paper addresses the challenge of lacking generalization to unseen domains/attacks in face presentation attack detection (FacePAD) algorithms. Existing methods assume that data from multiple source domains are available for centralized training, but in practice, data from different source domains may be collected by diverse entities who cannot share their data due to legal and privacy constraints. To overcome this problem, the authors propose a novel framework called Federated Split learning with Intermediate representation Sampling (FedSIS).FedSIS combines federated learning (FL) and split learning to achieve robustness against statistical heterogeneity in the client data distributions without any sharing of raw data, thereby preserving privacy. In addition, the authors employ a novel feature augmentation strategy called intermediate representation sampling to further improve generalization to unseen domains. The shared adapter network is used to distill discriminative information from intermediate blocks of a Vision Transformer (ViT).The FedSIS approach is evaluated on two well-known benchmarks for cross-domain FacePAD, and the results demonstrate that it is possible to achieve state-of-the-art generalization performance without data sharing. The code for FedSIS is available on GitHub.
    Abstract Lack of generalization to unseen domains/attacks is the Achilles heel of most face presentation attack detection (FacePAD) algorithms. Existing attempts to enhance the generalizability of FacePAD solutions assume that data from multiple source domains are available with a single entity to enable centralized training. In practice, data from different source domains may be collected by diverse entities, who are often unable to share their data due to legal and privacy constraints. While collaborative learning paradigms such as federated learning (FL) can overcome this problem, standard FL methods are ill-suited for domain generalization because they struggle to surmount the twin challenges of handling non-iid client data distributions during training and generalizing to unseen domains during inference. In this work, a novel framework called Federated Split learning with Intermediate representation Sampling (FedSIS) is introduced for privacy-preserving domain generalization. In FedSIS, a hybrid Vision Transformer (ViT) architecture is learned using a combination of FL and split learning to achieve robustness against statistical heterogeneity in the client data distributions without any sharing of raw data (thereby preserving privacy). To further improve generalization to unseen domains, a novel feature augmentation strategy called intermediate representation sampling is employed, and discriminative information from intermediate blocks of a ViT is distilled using a shared adapter network. The FedSIS approach has been evaluated on two well-known benchmarks for cross-domain FacePAD to demonstrate that it is possible to achieve state-of-the-art generalization performance without data sharing. Code: https://github.com/Naiftt/FedSIS
    摘要 缺乏对未经见的领域/攻击的总体化是现有的面呈现攻击检测(FacePAD)算法的 Achilles heel。现有的增强FacePAD解决方案假设可以在单一实体上进行中央化训练,但在实际应用中,数据来源可能是多个不同的实体,这些实体通常因法律和隐私问题无法共享其数据。而合作学习 paradigm such as federated learning(FL)可以解决这个问题,但标准的FL方法在适应不同领域时存在两个主要挑战:处理非标一个客户端数据分布 During training和在推理中适应未经见的领域。在这种情况下,一种新的框架called Federated Split learning with Intermediate representation Sampling(FedSIS)被引入,以保护隐私。在FedSIS中,一个混合的 Computer Vision Transformer(ViT)架构被学习使用分布式学习和分割学习来实现对统计学 heterogeneity 在客户端数据分布中的Robustness,而无需分享Raw数据。为了进一步提高推理中的适应,一种新的特征扩充策略called intermediate representation sampling被采用,并通过一个共享adapter网络来浓缩出ViT中的特征。FedSIS方法在两个常用的 cross-domain FacePADbenchmark上进行评估,以示其可以在不进行数据分享的情况下实现状态的总体化性能。代码:https://github.com/Naiftt/FedSIS

Karma: Adaptive Video Streaming via Causal Sequence Modeling

  • paper_url: http://arxiv.org/abs/2308.10230
  • repo_url: https://github.com/fcbw2012/Karma
  • paper_authors: Bowei Xu, Hao Chen, Zhan Ma
  • for: 提高 adaptive bitrate (ABR) 决策的优化,以提高用户体验质量 (QoE)。
  • methods: 使用 causal sequence modeling 技术,通过考虑过去观察、返回和行动之间的相互关系,以及时地调整行动,以提高 Generalization 性。
  • results: 在 trace-driven simulations 和实际场景测试中,karma 比现有的状态艺术 algoritms 表现出优于平均 QoE 提高率为 10.8% 到 18.7%,并且在未看到的网络条件下表现出强大的泛化能力。
    Abstract Optimal adaptive bitrate (ABR) decision depends on a comprehensive characterization of state transitions that involve interrelated modalities over time including environmental observations, returns, and actions. However, state-of-the-art learning-based ABR algorithms solely rely on past observations to decide the next action. This paradigm tends to cause a chain of deviations from optimal action when encountering unfamiliar observations, which consequently undermines the model generalization. This paper presents Karma, an ABR algorithm that utilizes causal sequence modeling to improve generalization by comprehending the interrelated causality among past observations, returns, and actions and timely refining action when deviation occurs. Unlike direct observation-to-action mapping, Karma recurrently maintains a multi-dimensional time series of observations, returns, and actions as input and employs causal sequence modeling via a decision transformer to determine the next action. In the input sequence, Karma uses the maximum cumulative future quality of experience (QoE) (a.k.a, QoE-to-go) as an extended return signal, which is periodically estimated based on current network conditions and playback status. We evaluate Karma through trace-driven simulations and real-world field tests, demonstrating superior performance compared to existing state-of-the-art ABR algorithms, with an average QoE improvement ranging from 10.8% to 18.7% across diverse network conditions. Furthermore, Karma exhibits strong generalization capabilities, showing leading performance under unseen networks in both simulations and real-world tests.
    摘要 优化的适应比特率(ABR)决策取决于全面的状态转移,包括时间上的相关Modalities和环境观测。然而,当前的学习型ABR算法仅仅基于过去的观测来决定下一步行动。这种方法可能会导致在不熟悉的观测下链地偏离优化的行动,从而降低模型泛化性。本文介绍了Karma算法,它利用 causal sequence modeling来提高泛化性,通过理解过去观测、返回和行动之间的相关 causality来及时修正偏离。不同于直接观测到行动的映射,Karma使用循环维护一个多维时间序列,并使用 causal sequence modeling via 决策变换来确定下一步行动。在输入序列中,Karma使用最大总共计未来体验质量(QoE)(即QoE-to-go)作为扩展返回信号,该信号在当前网络 conditio 和播放状态基础上定期估计。我们通过跟踪驱动的 simulations 和实际场景测试评估了Karma,并证明其与当前状态的art算法相比,在多种网络条件下表现出较高的QoE提升,平均提升率在10.8%到18.7%之间。此外,Karma表现出了强大的泛化能力,在不看到的网络上也表现出领先的表现。

Machine Learning-powered Combinatorial Clock Auction

  • paper_url: http://arxiv.org/abs/2308.10226
  • repo_url: https://github.com/marketdesignresearch/ml-cca
  • paper_authors: Ermis Soumalias, Jakob Weissteiner, Jakob Heiss, Sven Seuken
  • for: This paper aims to address the challenge of designing iterative combinatorial auctions (ICAs) in high-dimensional item spaces, where traditional methods are impractical due to the exponential growth of the bundle space.
  • methods: The paper proposes an ML-powered combinatorial clock auction that elicits information from bidders only via demand queries, which are more practical and less cognitively burdensome than traditional value queries. The paper also presents a novel method for training ML models on demand queries and an efficient method for determining the demand query with the highest clearing potential.
  • results: The paper experimentally evaluates the ML-based demand query mechanism in several spectrum auction domains and compares it against the most established real-world ICA, the combinatorial clock auction (CCA). The results show that the ML-based mechanism significantly outperforms the CCA in terms of efficiency, achieves higher efficiency in a significantly reduced number of rounds, and exhibits vastly higher clearing potential using linear prices.
    Abstract We study the design of iterative combinatorial auctions (ICAs). The main challenge in this domain is that the bundle space grows exponentially in the number of items. To address this, several papers have recently proposed machine learning (ML)-based preference elicitation algorithms that aim to elicit only the most important information from bidders. However, from a practical point of view, the main shortcoming of this prior work is that those designs elicit bidders' preferences via value queries (i.e., ``What is your value for the bundle $\{A,B\}$?''). In most real-world ICA domains, value queries are considered impractical, since they impose an unrealistically high cognitive burden on bidders, which is why they are not used in practice. In this paper, we address this shortcoming by designing an ML-powered combinatorial clock auction that elicits information from the bidders only via demand queries (i.e., ``At prices $p$, what is your most preferred bundle of items?''). We make two key technical contributions: First, we present a novel method for training an ML model on demand queries. Second, based on those trained ML models, we introduce an efficient method for determining the demand query with the highest clearing potential, for which we also provide a theoretical foundation. We experimentally evaluate our ML-based demand query mechanism in several spectrum auction domains and compare it against the most established real-world ICA: the combinatorial clock auction (CCA). Our mechanism significantly outperforms the CCA in terms of efficiency in all domains, it achieves higher efficiency in a significantly reduced number of rounds, and, using linear prices, it exhibits vastly higher clearing potential. Thus, with this paper we bridge the gap between research and practice and propose the first practical ML-powered ICA.
    摘要 我们研究Iterative Combinatorial Auctions(ICA)的设计。ICA的主要挑战在bundlespace exponentiationally grows with the number of items。为解决这个问题,一些最近的论文提出了基于机器学习(ML)的偏好提取算法,以便只提取拍手者的重要信息。然而,从实践角度来看,这些设计都是通过值查询(即“你对合并{$A,B}$的价值是多少?”)来提取拍手者的偏好。在现实世界ICA中,值查询是不实用的,因为它们具有不现实的认知卷积,所以在实践中不被使用。在这篇论文中,我们解决这一缺点,设计了一种基于ML的combined clock auction,该 auction只通过需求查询(即“在价格$p$下,你最有价值的合并哪些ITEMS?”)来提取拍手者的信息。我们做了两个关键的技术贡献:首先,我们提出了一种基于需求查询的ML模型训练方法。其次,基于这些训练过的ML模型,我们引入了一种高效的确定需求查询具有最高清算潜力的方法,并提供了理论基础。我们在多个频谱拍卖领域进行了实验性评估,与现实世界ICA最常用的combined clock auction(CCA)进行了比较。我们的机器学习基于需求查询机制在所有领域中显著高效,在一个明显更少的数量的回合中达到了更高的效率,使用线性价格时,其显示出了极高的清算潜力。因此,我们通过这篇论文,将研究与实践之间的差距 bridge,并提出了实用的ML-powered ICA。

Soft Decomposed Policy-Critic: Bridging the Gap for Effective Continuous Control with Discrete RL

  • paper_url: http://arxiv.org/abs/2308.10203
  • repo_url: None
  • paper_authors: Yechen Zhang, Jian Sun, Gang Wang, Zhuo Li, Wei Chen
  • for: 解决连续控制问题中的维度爆炸问题
  • methods: combining soft RL和actor-critic技术,独立地对每个动作维度进行拟合,并使用共享批处理网络来最大化软$Q$-函数
  • results: 在Mujoco的人型和Box2d的双脚行走任务中,比对当前最佳连续RL算法表现出色,证明了SDPC架构在连续控制问题中的效iveness。
    Abstract Discrete reinforcement learning (RL) algorithms have demonstrated exceptional performance in solving sequential decision tasks with discrete action spaces, such as Atari games. However, their effectiveness is hindered when applied to continuous control problems due to the challenge of dimensional explosion. In this paper, we present the Soft Decomposed Policy-Critic (SDPC) architecture, which combines soft RL and actor-critic techniques with discrete RL methods to overcome this limitation. SDPC discretizes each action dimension independently and employs a shared critic network to maximize the soft $Q$-function. This novel approach enables SDPC to support two types of policies: decomposed actors that lead to the Soft Decomposed Actor-Critic (SDAC) algorithm, and decomposed $Q$-networks that generate Boltzmann soft exploration policies, resulting in the Soft Decomposed-Critic Q (SDCQ) algorithm. Through extensive experiments, we demonstrate that our proposed approach outperforms state-of-the-art continuous RL algorithms in a variety of continuous control tasks, including Mujoco's Humanoid and Box2d's BipedalWalker. These empirical results validate the effectiveness of the SDPC architecture in addressing the challenges associated with continuous control.
    摘要 精细激励学习(RL)算法在解决序列决策任务中的逻辑划分空间时表现出色,如Atari游戏。然而,在连续控制问题中,它们的效果受到维度爆炸的挑战。在这篇论文中,我们提出了软分解策略-评估器(SDPC)架构,它将软RL和演员-评估器技术与精细RL方法相结合,以超越这些限制。SDPC独立地对每个动作维度进行粒度化,并使用共享评估器网络来最大化软$Q$-函数。这种新的方法使得SDPC可以支持两种类型的策略:分解演员,导致的软分解演员-评估器(SDAC)算法,以及分解$Q$-网络,生成博尔tz曼软探索策略,导致的软分解评估器Q(SDCQ)算法。经过广泛的实验,我们证明了我们提出的方法在多种连续控制任务中表现出色,包括Mujoco的人形机器人和Box2d的双脚行走器。这些实验结果证明了SDPC架构在连续控制问题中的有效性。

Hiding Backdoors within Event Sequence Data via Poisoning Attacks

  • paper_url: http://arxiv.org/abs/2308.10201
  • repo_url: None
  • paper_authors: Elizaveta Kovtun, Alina Ermilova, Dmitry Berestnev, Alexey Zaytsev
  • for: 该论文旨在描述如何在深度学习模型中引入潜在的攻击点,以便在金融业中使用深度学习模型时提高安全性。
  • methods: 该论文使用了潜在攻击的技术,包括插入后门和模型权重修改等方法,以便在深度学习模型中引入潜在的攻击点。
  • results: 实验结果表明,在三个开源交易数据集和三种架构(LSTM、CNN和Transformer)上,潜在攻击可以成功地影响深度学习模型的输出。这些结果不仅揭示了当今模型的漏洞,还可以帮助建立更加安全的系统。
    Abstract The financial industry relies on deep learning models for making important decisions. This adoption brings new danger, as deep black-box models are known to be vulnerable to adversarial attacks. In computer vision, one can shape the output during inference by performing an adversarial attack called poisoning via introducing a backdoor into the model during training. For sequences of financial transactions of a customer, insertion of a backdoor is harder to perform, as models operate over a more complex discrete space of sequences, and systematic checks for insecurities occur. We provide a method to introduce concealed backdoors, creating vulnerabilities without altering their functionality for uncontaminated data. To achieve this, we replace a clean model with a poisoned one that is aware of the availability of a backdoor and utilize this knowledge. Our most difficult for uncovering attacks include either additional supervised detection step of poisoned data activated during the test or well-hidden model weight modifications. The experimental study provides insights into how these effects vary across different datasets, architectures, and model components. Alternative methods and baselines, such as distillation-type regularization, are also explored but found to be less efficient. Conducted on three open transaction datasets and architectures, including LSTM, CNN, and Transformer, our findings not only illuminate the vulnerabilities in contemporary models but also can drive the construction of more robust systems.
    摘要 金融业务中使用深度学习模型作出重要决策,这种采用带来新的威胁,因为深度黑盒模型容易受到对抗攻击。在计算机视觉领域,可以在推理过程中形成输出的偏误,通过在训练过程中引入一个后门。但是对于金融交易序列的客户数据,插入后门更加困难,因为模型操作在更复杂的逻辑空间上,并且系统性的安全检查会发生。我们提供了一种引入隐藏后门的方法,创造漏洞而无需改变功能性的数据。我们将干净模型换成恶意模型,并利用这种知识。我们的最难于发现攻击包括额外的超visisted检测步骤, Activated During Test,以及隐藏的模型Weight修改。我们的实验研究提供了不同的数据集、结构和模型组件之间如何影响这些效果的深入理解。我们还explored alternative methods和基线,如distillation-type regularization,但发现它们较为不效。我们在三个公开的交易数据集和结构上进行了实验,包括LSTM、CNN和Transformer。我们的发现不仅揭示了当代模型中的漏洞,而且可以驱动建立更加Robust的系统。

Deep Reinforcement Learning for Artificial Upwelling Energy Management

  • paper_url: http://arxiv.org/abs/2308.10199
  • repo_url: None
  • paper_authors: Yiyuan Zhang, Wei Fan
  • For: 本研究旨在开发一种基于深度强化学习(DRL)算法的能源管理策略,以提高人工温升系统(AUS)的效率和可持续性。* Methods: 本研究使用DRL算法来开发高效的AUS操作策略,并通过大量的 simulations 来评估其性能。* Results: 研究结果表明,DRL算法可以有效地减少AUS中的能源浪费,并确保系统的稳定和高效运行。此外,DRL算法也比传统的规则编程方法和其他DRL算法更有效。
    Abstract The potential of artificial upwelling (AU) as a means of lifting nutrient-rich bottom water to the surface, stimulating seaweed growth, and consequently enhancing ocean carbon sequestration, has been gaining increasing attention in recent years. This has led to the development of the first solar-powered and air-lifted AU system (AUS) in China. However, efficient scheduling of air injection systems remains a crucial challenge in operating AUS, as it holds the potential to significantly improve system efficiency. Conventional approaches based on rules or models are often impractical due to the complex and heterogeneous nature of the marine environment and its associated disturbances. To address this challenge, we propose a novel energy management approach that utilizes deep reinforcement learning (DRL) algorithm to develop efficient strategies for operating AUS. Through extensive simulations, we evaluate the performance of our algorithm and demonstrate its superior effectiveness over traditional rule-based approaches and other DRL algorithms in reducing energy wastage while ensuring the stable and efficient operation of AUS. Our findings suggest that a DRL-based approach offers a promising way for improving the efficiency of AUS and enhancing the sustainability of seaweed cultivation and carbon sequestration in the ocean.
    摘要 人工升采 (AU) 的潜在优势在刚刚几年内得到了越来越多的关注,即通过升采具有营养物质的底层水到表层,促进海藻生长,并因此提高海洋碳储存。在中国,已经开发了第一个阳光动力和空气升采系统 (AUS)。然而,AU系统的有效调度仍然是一个关键挑战,因为它可以大幅提高系统的效率。传统的方法,如规则或模型,经常因marine环境的复杂和多样性而成为不实际。为解决这个挑战,我们提出了一种新的能源管理方法,利用深度强化学习 (DRL) 算法来开发有效的AU系统操作策略。通过广泛的 simulations,我们评估了我们的算法的性能,并证明它在降低能源浪费的同时,保证AU系统的稳定和高效操作。我们的发现表明,使用DRL算法可以有效地提高AU系统的效率,并推动海洋中的海藻培养和碳储存的可持续发展。

ProSpire: Proactive Spatial Prediction of Radio Environment Using Deep Learning

  • paper_url: http://arxiv.org/abs/2308.10193
  • repo_url: None
  • paper_authors: Shamik Sarkar, Dongning Guo, Danijela Cabric
    for: 这篇论文旨在提供一种基于深度学习的批处理框架,以实现频率分享。methods: 该框架利用了人群协助,在常规操作过程中收集数据,并使用深度学习的图像到图像翻译方法(RSSu-net)进行预测。results: 论文的实验结果表明,RSSu-net可以准确预测信号强度水平,其Error Mean Absolute Percentage(EMA)为5dB,与其他相关方法相当。此外,RSSu-net可以为 transmitter 创建 проaktiv 的预测边界,使其Activation 的概率达97%,在预测频率分享方面表现出19%的提升。
    Abstract Spatial prediction of the radio propagation environment of a transmitter can assist and improve various aspects of wireless networks. The majority of research in this domain can be categorized as 'reactive' spatial prediction, where the predictions are made based on a small set of measurements from an active transmitter whose radio environment is to be predicted. Emerging spectrum-sharing paradigms would benefit from 'proactive' spatial prediction of the radio environment, where the spatial predictions must be done for a transmitter for which no measurement has been collected. This paper proposes a novel, supervised deep learning-based framework, ProSpire, that enables spectrum sharing by leveraging the idea of proactive spatial prediction. We carefully address several challenges in ProSpire, such as designing a framework that conveniently collects training data for learning, performing the predictions in a fast manner, enabling operations without an area map, and ensuring that the predictions do not lead to undesired interference. ProSpire relies on the crowdsourcing of transmitters and receivers during their normal operations to address some of the aforementioned challenges. The core component of ProSpire is a deep learning-based image-to-image translation method, which we call RSSu-net. We generate several diverse datasets using ray tracing software and numerically evaluate ProSpire. Our evaluations show that RSSu-net performs reasonably well in terms of signal strength prediction, 5 dB mean absolute error, which is comparable to the average error of other relevant methods. Importantly, due to the merits of RSSu-net, ProSpire creates proactive boundaries around transmitters such that they can be activated with 97% probability of not causing interference. In this regard, the performance of RSSu-net is 19% better than that of other comparable methods.
    摘要 通过预测发射器的广播环境,可以提高无线网络的多个方面。大多数研究在这个领域可以分为“反应性”的广播环境预测,其中预测基于活跃发射器的一小部分测量数据。新兴的spectrum-sharing paradigms需要“主动”的广播环境预测,其中预测是基于没有测量数据的发射器。这篇论文提出了一种新的、 надёжный深度学习基本框架——ProSpire,以便启用spectrum-sharing。我们在ProSpire中谨慎地解决了多个挑战,例如设计一个框架可以方便地收集教程数据,在快速模式下进行预测,不需要地图,并确保预测不会导致不良干扰。ProSpire利用发射器和接收器在正常操作中的拥有者协助来解决一些上述挑战。ProSpire的核心 ком成分是一种基于深度学习的图像到图像翻译方法,我们称之为RSSu-net。我们使用RAY tracing软件生成了多个多样化的数据集,并 numerically evaluates ProSpire。我们的评估表明,RSSu-net在信号强度预测方面的性能 reasonably well,相对于其他相关方法的平均错误率为5dB。此外,由于RSSu-net的优点,ProSpire可以创建主动的边界,使发射器在97%的概率下不会导致干扰。在这个方面,RSSu-net的性能比其他相似方法高19%。

Mimicking To Dominate: Imitation Learning Strategies for Success in Multiagent Competitive Games

  • paper_url: http://arxiv.org/abs/2308.10188
  • repo_url: None
  • paper_authors: The Viet Bui, Tien Mai, Thanh Hong Nguyen
    for: 这项研究的目的是解决多体游戏中的训练问题,特别是因为环境和对手的策略的影响而导致的不确定性。methods: 这项研究使用了模仿学习来理解和预测对手的行为,以减少不确定性。具体来说,他们提出了一种新的多体游戏模仿学习模型,可以预测对手的下一步行动,并将这种模型与策略训练结合在一起。results: 实验结果显示,该方法在三个复杂的游戏环境中表现出优于现有state-of-the-art多体游戏RL算法。
    Abstract Training agents in multi-agent competitive games presents significant challenges due to their intricate nature. These challenges are exacerbated by dynamics influenced not only by the environment but also by opponents' strategies. Existing methods often struggle with slow convergence and instability. To address this, we harness the potential of imitation learning to comprehend and anticipate opponents' behavior, aiming to mitigate uncertainties with respect to the game dynamics. Our key contributions include: (i) a new multi-agent imitation learning model for predicting next moves of the opponents -- our model works with hidden opponents' actions and local observations; (ii) a new multi-agent reinforcement learning algorithm that combines our imitation learning model and policy training into one single training process; and (iii) extensive experiments in three challenging game environments, including an advanced version of the Star-Craft multi-agent challenge (i.e., SMACv2). Experimental results show that our approach achieves superior performance compared to existing state-of-the-art multi-agent RL algorithms.
    摘要 培训多体智能机器人在多体竞争游戏中存在严重的挑战,这些挑战由环境以及对手策略的影响强化。现有方法经常受到慢 converges 和不稳定的问题困扰。为解决这些问题,我们利用仿制学来理解和预测对手的行为,以降低对游戏动力学的不确定性。我们的关键贡献包括:1. 一种新的多体仿制学模型,用于预测对手下一步的行动——我们的模型可以处理隐藏的对手行动和本地观察;2. 一种新的多体强化学习算法,将我们的仿制学模型和策略训练结合在一起;3. 在三个复杂游戏环境中进行了广泛的实验,包括星际战II(SMACv2)的高级版本。实验结果表明,我们的方法在现有状态艺术多体RL算法中显示出优于性能。

Quantization-based Optimization with Perspective of Quantum Mechanics

  • paper_url: http://arxiv.org/abs/2308.11594
  • repo_url: None
  • paper_authors: Jinwuk Seok, Changsik Cho
  • for: 本研究探讨了基于量子力学的全球优化算法的新研究框架。
  • methods: 本研究使用了量子化基于Schrödinger方程的优化方法,并通过对这些方法的分析,揭示了量子力学中允许全球优化的性质。
  • results: 研究发现,基于量子力学的优化方法中的穿透效应,允许逃脱本地最优点。此外,这种穿透效应与量子力学基于全球优化的性质相同。实验结果表明,提出的分析是正确的。
    Abstract Statistical and stochastic analysis based on thermodynamics has been the main analysis framework for stochastic global optimization. Recently, appearing quantum annealing or quantum tunneling algorithm for global optimization, we require a new researching framework for global optimization algorithms. In this paper, we provide the analysis for quantization-based optimization based on the Schr\"odinger equation to reveal what property in quantum mechanics enables global optimization. We present that the tunneling effect derived by the Schr\"odinger equation in quantization-based optimization enables to escape of a local minimum. Additionally, we confirm that this tunneling effect is the same property included in quantum mechanics-based global optimization. Experiments with standard multi-modal benchmark functions represent that the proposed analysis is valid.
    摘要 统计和随机分析基于 термодина学已经是全球优化的主要分析框架。最近,出现了量子气化或量子逃逸算法用于全球优化,我们需要一个新的研究框架来研究全球优化算法。在这篇论文中,我们提供了量子化基于Schrödinger方程的优化分析,以探索量子力学中允许全球优化的性质。我们发现,量子化中的逃逸效应使得可以跃出本地最小值。此外,我们证明这种逃逸效应与量子力学基于全球优化中的性质相同。实验使用标准多模函数表示,我们的分析是有效的。

Rethinking Client Drift in Federated Learning: A Logit Perspective

  • paper_url: http://arxiv.org/abs/2308.10162
  • repo_url: None
  • paper_authors: Yunlu Yan, Chun-Mei Feng, Mang Ye, Wangmeng Zuo, Ping Li, Rick Siow Mong Goh, Lei Zhu, C. L. Philip Chen
  • For: The paper focuses on addressing the issue of client drift in Federated Learning (FL) caused by non-IID data, which degrades the performance of FL.* Methods: The proposed method, called FedCSD, uses class prototype similarity distillation to align local logits with refined global logits weighted by the similarity between local logits and the global prototype. Additionally, an adaptive mask is used to filter out terrible soft labels of global models to prevent them from misleading local optimization.* Results: The proposed method outperforms state-of-the-art federated learning approaches in various heterogeneous settings, as demonstrated by extensive experiments.Here is the same information in Simplified Chinese:* For: 论文旨在解决 Federated Learning (FL) 中客户端漂移问题,即因异步数据导致 FL 性能下降。* Methods: 提议的方法是 FedCSD,它使用类prototype相似抽象来协调本地logits与重视类 prototype 的全局模型。此外,适应性 másc 用于筛选全局模型的差异 Soft Label,以避免它们导致本地优化的干扰。* Results: 比较 experiments 表明,提议的方法在多种不同的设置下表现出色,超过了现有的 Federated Learning 方法。I hope that helps!
    Abstract Federated Learning (FL) enables multiple clients to collaboratively learn in a distributed way, allowing for privacy protection. However, the real-world non-IID data will lead to client drift which degrades the performance of FL. Interestingly, we find that the difference in logits between the local and global models increases as the model is continuously updated, thus seriously deteriorating FL performance. This is mainly due to catastrophic forgetting caused by data heterogeneity between clients. To alleviate this problem, we propose a new algorithm, named FedCSD, a Class prototype Similarity Distillation in a federated framework to align the local and global models. FedCSD does not simply transfer global knowledge to local clients, as an undertrained global model cannot provide reliable knowledge, i.e., class similarity information, and its wrong soft labels will mislead the optimization of local models. Concretely, FedCSD introduces a class prototype similarity distillation to align the local logits with the refined global logits that are weighted by the similarity between local logits and the global prototype. To enhance the quality of global logits, FedCSD adopts an adaptive mask to filter out the terrible soft labels of the global models, thereby preventing them to mislead local optimization. Extensive experiments demonstrate the superiority of our method over the state-of-the-art federated learning approaches in various heterogeneous settings. The source code will be released.
    摘要 federated learning (FL) 允许多个客户端共同学习,以保护隐私。然而,实际世界中的非相同数据会导致客户端漂移,从而下降FL的性能。奇怪的是,我们发现,在模型不断更新后,本地和全球模型之间的差异在增加,这会严重降低FL的性能。这主要是因为数据不同性导致的忘记抛弃。为了解决这个问题,我们提出了一种新的算法,名为FedCSD,即在 federated 框架中进行类prototype相似液化。FedCSD不simply将全球知识传递给本地客户端,因为一个未经训练的全球模型无法提供可靠的知识,即类相似信息,而其假软标签会mislead本地优化。具体来说,FedCSD引入一种类prototype相似液化,将本地征值与全球prototype之间的相似性进行对齐。为了提高全球征值的质量,FedCSD采用了一个适应性的面罩,从而防止全球模型的假软标签对本地优化产生负面影响。我们的实验表明,我们的方法在不同的各种各样的设置下比现状的联邦学习方法表现出色。代码将于发布。

Resource-Adaptive Newton’s Method for Distributed Learning

  • paper_url: http://arxiv.org/abs/2308.10154
  • repo_url: None
  • paper_authors: Shuzhen Chen, Yuan Yuan, Youming Tao, Zhipeng Cai, Dongxiao Yu
  • for: 这篇论文旨在提出一种高效的分布式数据优化方法,以优化对数学运算和通信成本的构成。
  • methods: 这篇论文使用了新顿法,并将其与分布式环境整合,以解决实际应用中的高计算和通信成本问题。
  • results: 论文显示了这种新方法可以实现高效的线性传播速率,并且可以适应可用资源和高效率。它还可以轻松地减少问题的条件数量,并且不需要复杂的参数调整。
    Abstract Distributed stochastic optimization methods based on Newton's method offer significant advantages over first-order methods by leveraging curvature information for improved performance. However, the practical applicability of Newton's method is hindered in large-scale and heterogeneous learning environments due to challenges such as high computation and communication costs associated with the Hessian matrix, sub-model diversity, staleness in training, and data heterogeneity. To address these challenges, this paper introduces a novel and efficient algorithm called RANL, which overcomes the limitations of Newton's method by employing a simple Hessian initialization and adaptive assignments of training regions. The algorithm demonstrates impressive convergence properties, which are rigorously analyzed under standard assumptions in stochastic optimization. The theoretical analysis establishes that RANL achieves a linear convergence rate while effectively adapting to available resources and maintaining high efficiency. Unlike traditional first-order methods, RANL exhibits remarkable independence from the condition number of the problem and eliminates the need for complex parameter tuning. These advantages make RANL a promising approach for distributed stochastic optimization in practical scenarios.
    摘要

Global Warming In Ghana’s Major Cities Based on Statistical Analysis of NASA’s POWER Over 3-Decades

  • paper_url: http://arxiv.org/abs/2308.10909
  • repo_url: None
  • paper_authors: Joshua Attih
  • for: 本研究旨在探讨加纳四大城市的长期气温趋势,以提高气候变化策略的理解。
  • methods: 研究使用NASA的Prediction of Worldwide Energy Resource(POWER)数据进行统计分析,以评估当地气候变化的趋势。 Linear regression 特征分析和eXtreme Gradient Boosting(XGBoost)机器学习方法预测气温变化。 RSLab平台生成的Land Surface Temperature(LST)profile图表提高了准确性。
  • results: 研究发现当地气温趋势,尤其是工业化的阿克拉地区。 人口density不是关键因素。 XGBoost模型的低Root Mean Square Error(RMSE)得分表明其能够准确捕捉气温趋势。 瓦 unexpectedly有最高的平均气温(30.76℃)。 预测2023年中的气温为:阿克拉27.86℃,库马西27.15℃,基特-克拉chi29.39℃,瓦30.76℃。
    Abstract Global warming's impact on high temperatures in various parts of the world has raised concerns. This study investigates long-term temperature trends in four major Ghanaian cities representing distinct climatic zones. Using NASA's Prediction of Worldwide Energy Resource (POWER) data, statistical analyses assess local climate warming and its implications. Linear regression trend analysis and eXtreme Gradient Boosting (XGBoost) machine learning predict temperature variations. Land Surface Temperature (LST) profile maps generated from the RSLab platform enhance accuracy. Results reveal local warming trends, particularly in industrialized Accra. Demographic factors aren't significant. XGBoost model's low Root Mean Square Error (RMSE) scores demonstrate effectiveness in capturing temperature patterns. Wa unexpectedly has the highest mean temperature. Estimated mean temperatures for mid-2023 are: Accra 27.86{\deg}C, Kumasi 27.15{\deg}C, Kete-Krachi 29.39{\deg}C, and Wa 30.76{\deg}C. These findings improve understanding of local climate warming for policymakers and communities, aiding climate change strategies.
    摘要 全球变暖对各地高温的影响已引起关注。这项研究研究了加纳四大城市的长期温度趋势,代表不同气候区。使用NASA的Prediction of Worldwide Energy Resource(POWER)数据,统计分析评估地方气候变暖和其意义。线性回归趋势分析和Machine Learning的eXtreme Gradient Boosting(XGBoost)模型预测温度变化。RSLab平台生成的Land Surface Temperature(LST)profile图表提高了准确性。结果显示了地方温升趋势,特别是工业化的Accra。人口因素不显著。XGBoost模型的低Root Mean Square Error(RMSE)分数表明其能够准确捕捉温度模式。意外地,Wa市有最高的mean温度。预计2023年中期的Accra温度为27.86℃,Kumasi温度为27.15℃,Kete-Krachi温度为29.39℃,Wa温度为30.76℃。这些发现可以帮助政策 makers和社区更好地理解本地气候变化,提高气候变化策略。

OCHID-Fi: Occlusion-Robust Hand Pose Estimation in 3D via RF-Vision

  • paper_url: http://arxiv.org/abs/2308.10146
  • repo_url: None
  • paper_authors: Shujie Zhang, Tianyue Zheng, Zhe Chen, Jingzhi Hu, Abdelwahed Khamis, Jiajun Liu, Jun Luo
  • For: The paper proposes a new method for hand pose estimation (HPE) that can overcome the limitations of cameras-based methods, which are subject to Line-of-Sight (LoS) and cannot capture occluded objects.* Methods: The proposed method, called OCHID-Fi, uses radio-frequency-vision (RF-vision) to bypass obstacles and achieve occluded HPE. It employs wideband RF sensors widely available on smart devices to probe 3D human hand pose and extract their skeletons behind obstacles. To overcome the challenge in labeling RF imaging, OCHID-Fi uses a cross-modality and cross-domain training process that combines a pre-trained CM-HPE network and a synchronized CM/RF dataset.* Results: The paper demonstrates the superiority of OCHID-Fi through experimental results, showing that it achieves comparable accuracy to cameras-based HPE under normal conditions while maintaining such accuracy even in occluded scenarios, with empirical evidence for its generalizability to new domains.Here is the simplified Chinese text for the three key points:* For: 这篇论文提出了一种新的手势估计(HPE)方法,可以超越传统摄像头基于方法的限制,该方法不能捕捉障碍物。* Methods: 该方法使用无线电频视(RF-vision)绕过障碍物实现 occluded HPE,并使用各种智能设备(如 iPhone)上广泛可用的宽频 RF 感知器探测3D人手势和其后方障碍物。* Results: 论文通过实验结果显示,OCHID-Fi 可以在正常情况下与摄像头基于方法相比具有相同的准确率,而且在障碍物情况下仍然保持相同的准确率,并且在新领域中进行了实质性的推广。
    Abstract Hand Pose Estimation (HPE) is crucial to many applications, but conventional cameras-based CM-HPE methods are completely subject to Line-of-Sight (LoS), as cameras cannot capture occluded objects. In this paper, we propose to exploit Radio-Frequency-Vision (RF-vision) capable of bypassing obstacles for achieving occluded HPE, and we introduce OCHID-Fi as the first RF-HPE method with 3D pose estimation capability. OCHID-Fi employs wideband RF sensors widely available on smart devices (e.g., iPhones) to probe 3D human hand pose and extract their skeletons behind obstacles. To overcome the challenge in labeling RF imaging given its human incomprehensible nature, OCHID-Fi employs a cross-modality and cross-domain training process. It uses a pre-trained CM-HPE network and a synchronized CM/RF dataset, to guide the training of its complex-valued RF-HPE network under LoS conditions. It further transfers knowledge learned from labeled LoS domain to unlabeled occluded domain via adversarial learning, enabling OCHID-Fi to generalize to unseen occluded scenarios. Experimental results demonstrate the superiority of OCHID-Fi: it achieves comparable accuracy to CM-HPE under normal conditions while maintaining such accuracy even in occluded scenarios, with empirical evidence for its generalizability to new domains.
    摘要 手势识别(HPE)对许多应用程序是关键,但传统的相机基于CM-HPE方法是完全依赖于直线视野(LoS),因为相机无法捕捉遮盖物体。在这篇论文中,我们提议利用无线电视视野(RF-vision),以绕过障碍物实现遮盖物体HPE,并引入了OCHID-Fi作为首个RF-HPE方法,具有3D手势 pose estimation能力。OCHID-Fi使用常见的宽频RF传感器(例如iPhone上的RF传感器)探测3D人类手势pose并提取其骨架。为了解决RF成像的标注挑战,OCHID-Fi采用了交叉模态和交叉领域的训练过程。它使用一个预训练的CM-HPE网络和一个同步CM/RF数据集,以导引其复杂的RF-HPE网络在LoS条件下进行训练。它还通过对LoS频谱域的标注数据进行反向学习,使得OCHID-Fi能够在未看到障碍物的情况下保持同等精度。实验结果表明,OCHID-Fi具有优于CM-HPE的优势:在正常情况下和障碍情况下都可以保持同等精度,并且在新领域中进行扩展。

Wasserstein Geodesic Generator for Conditional Distributions

  • paper_url: http://arxiv.org/abs/2308.10145
  • repo_url: https://github.com/kyg0910/wasserstein-geodesic-generator-for-conditional-distributions
  • paper_authors: Young-geun Kim, Kyungbok Lee, Youngwon Choi, Joong-Ho Won, Myunghee Cho Paik
  • for: 这篇论文的目的是提出一种新的条件生成算法,用于生成基于不同频率的数据中的条件数据。
  • methods: 这篇论文使用估计条件分布的方法,包括 derivation of tractable upper bound of Wasserstein distance between conditional distributions,以及使用 optimal transport theory 提出的 Wasserstein geodesic generator。
  • results: 这篇论文的实验结果显示,提出的条件生成算法可以将 conditional distributions given observed domains 和 unobserved intermediate domains 连接起来,并且可以learns both conditional distributions for observed domains and optimal transport maps between them。
    Abstract Generating samples given a specific label requires estimating conditional distributions. We derive a tractable upper bound of the Wasserstein distance between conditional distributions to lay the theoretical groundwork to learn conditional distributions. Based on this result, we propose a novel conditional generation algorithm where conditional distributions are fully characterized by a metric space defined by a statistical distance. We employ optimal transport theory to propose the Wasserstein geodesic generator, a new conditional generator that learns the Wasserstein geodesic. The proposed method learns both conditional distributions for observed domains and optimal transport maps between them. The conditional distributions given unobserved intermediate domains are on the Wasserstein geodesic between conditional distributions given two observed domain labels. Experiments on face images with light conditions as domain labels demonstrate the efficacy of the proposed method.
    摘要 <>将给定文本翻译成简化中文。<>生成样本需要估计条件分布。我们 derivates a tractable upper bound of Wasserstein distance between conditional distributions, laying the theoretical groundwork to learn conditional distributions。 Based on this result, we propose a novel conditional generation algorithm, where conditional distributions are fully characterized by a metric space defined by a statistical distance。 We employ optimal transport theory to propose the Wasserstein geodesic generator, a new conditional generator that learns the Wasserstein geodesic。 The proposed method learns both conditional distributions for observed domains and optimal transport maps between them。 The conditional distributions given unobserved intermediate domains are on the Wasserstein geodesic between conditional distributions given two observed domain labels。 Experiments on face images with light conditions as domain labels demonstrate the efficacy of the proposed method。Note: The translation is in Simplified Chinese, which is one of the two standard versions of Chinese used in mainland China. The other version is Traditional Chinese.

ExpeL: LLM Agents Are Experiential Learners

  • paper_url: http://arxiv.org/abs/2308.10144
  • repo_url: https://github.com/Andrewzh112/ExpeL
  • paper_authors: Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, Gao Huang
  • for: 这个论文的目的是提出一种新的机器学习方法,帮助机器学习模型在决策任务中学习和提高表现。
  • methods: 这个论文使用的方法是基于自然语言处理技术,通过自动收集经验和提取知识来让机器学习模型在决策任务中学习和提高表现。
  • results: 论文的实验结果表明,使用这种方法可以帮助机器学习模型在决策任务中表现更好,并且随着经验的积累,模型的性能会越来越好。
    Abstract The recent surge in research interest in applying large language models (LLMs) to decision-making tasks has flourished by leveraging the extensive world knowledge embedded in LLMs. While there is a growing demand to tailor LLMs for custom decision-making tasks, finetuning them for specific tasks is resource-intensive and may diminish the model's generalization capabilities. Moreover, state-of-the-art language models like GPT-4 and Claude are primarily accessible through API calls, with their parametric weights remaining proprietary and unavailable to the public. This scenario emphasizes the growing need for new methodologies that allow learning from agent experiences without requiring parametric updates. To address these problems, we introduce the Experiential Learning (ExpeL) agent. Our agent autonomously gathers experiences and extracts knowledge using natural language from a collection of training tasks. At inference, the agent recalls its extracted insights and past experiences to make informed decisions. Our empirical results highlight the robust learning efficacy of the ExpeL agent, indicating a consistent enhancement in its performance as it accumulates experiences. We further explore the emerging capabilities and transfer learning potential of the ExpeL agent through qualitative observations and additional experiments.
    摘要 Recent research has seen a surge in interest in applying large language models (LLMs) to decision-making tasks, leveraging the extensive world knowledge embedded in LLMs. However, finetuning them for specific tasks is resource-intensive and may diminish the model's generalization capabilities. Moreover, state-of-the-art language models like GPT-4 and Claude are primarily accessible through API calls, with their parametric weights remaining proprietary and unavailable to the public. In response to these challenges, we introduce the Experiential Learning (ExpeL) agent.Our agent autonomously gathers experiences and extracts knowledge using natural language from a collection of training tasks. At inference, the agent recalls its extracted insights and past experiences to make informed decisions. Our empirical results show that the ExpeL agent exhibits robust learning efficacy, with consistent enhancements in its performance as it accumulates experiences. We also explore the emerging capabilities and transfer learning potential of the ExpeL agent through qualitative observations and additional experiments.

A Review on Objective-Driven Artificial Intelligence

  • paper_url: http://arxiv.org/abs/2308.10135
  • repo_url: None
  • paper_authors: Apoorv Singh
  • for: 论文旨在探讨人工智能和人类智能之间的差距,以及如何使用 Hierarchical planning-based 方法和能量基的方法来填补这些差距。
  • methods: 论文使用了多种人工智能技术,包括supervised learning、reinforcement learning、self-supervised learning等,并对这些方法的局限性进行了批判。
  • results: 论文表明,使用 Hierarchical planning-based 方法和能量基的方法可以有效地填补人工智能和人类智能之间的差距,并提供了一些可能的解决方案。
    Abstract While advancing rapidly, Artificial Intelligence still falls short of human intelligence in several key aspects due to inherent limitations in current AI technologies and our understanding of cognition. Humans have an innate ability to understand context, nuances, and subtle cues in communication, which allows us to comprehend jokes, sarcasm, and metaphors. Machines struggle to interpret such contextual information accurately. Humans possess a vast repository of common-sense knowledge that helps us make logical inferences and predictions about the world. Machines lack this innate understanding and often struggle with making sense of situations that humans find trivial. In this article, we review the prospective Machine Intelligence candidates, a review from Prof. Yann LeCun, and other work that can help close this gap between human and machine intelligence. Specifically, we talk about what's lacking with the current AI techniques such as supervised learning, reinforcement learning, self-supervised learning, etc. Then we show how Hierarchical planning-based approaches can help us close that gap and deep-dive into energy-based, latent-variable methods and Joint embedding predictive architecture methods.
    摘要 Artificial Intelligence 在发展迅速的同时,仍然缺乏人工智能在一些关键方面的能力,这主要归结于当前的 AI 技术和我们认知心理学的限制。人类具有内生的能力来理解上下文、含义和微妙的通信缺失,这使得我们能够理解笑话、讽刺和 мета喻。机器则困难准确地理解上下文信息。人类拥有庞大的常识知识库,这 помо助我们作出逻辑的推理和世界上的预测。机器缺乏这种内生的理解,经常对人类觉得懵逼的情况下发。在这篇文章中,我们评论了目前的机器智能候选人选,包括 Prof. Yann LeCun 的评论以及其他的工作,以帮助将人类和机器智能之间的差距降低。我们首先介绍了当前 AI 技术的缺陷,如监督学习、奖励学习、自监学习等。然后,我们详细介绍了层次规划基础方法,如能量基础、隐变量方法和联合嵌入预测建筑方法,以帮助我们关闭人类和机器智能之间的差距。

AutoReP: Automatic ReLU Replacement for Fast Private Network Inference

  • paper_url: http://arxiv.org/abs/2308.10134
  • repo_url: https://github.com/harveyp123/autorep
  • paper_authors: Hongwu Peng, Shaoyi Huang, Tong Zhou, Yukui Luo, Chenghong Wang, Zigeng Wang, Jiahui Zhao, Xi Xie, Ang Li, Tony Geng, Kaleel Mahmood, Wujie Wen, Xiaolin Xu, Caiwen Ding
  • for: 提高机器学习服务市场中客户数据隐私和安全问题的解决方案。
  • methods: 使用 криптографических primitives 的私有推理(PI)技术,但它们可能具有高计算和通信成本,尤其是非线性运算如 ReLU。
  • results: 提出了一种梯度基本方法,可以减少非线性运算并缓解这些问题,并且可以自动选择 ReLU 和多项式函数,以加速 PI 应用。实验结果显示,对于 CIFAR-10、CIFAR-100 和 Tiny-ImageNet 等 datasets,可以达到6.12%、8.39% 和 9.45% 的准确率提高,相比之前的状态之法,如 SNL。此外,AutoReP 还应用于 ImageNet 数据集上的 EfficientNet-B2 模型,实现了176.1倍 ReLU 预算减少,并达到75.55% 的准确率。
    Abstract The growth of the Machine-Learning-As-A-Service (MLaaS) market has highlighted clients' data privacy and security issues. Private inference (PI) techniques using cryptographic primitives offer a solution but often have high computation and communication costs, particularly with non-linear operators like ReLU. Many attempts to reduce ReLU operations exist, but they may need heuristic threshold selection or cause substantial accuracy loss. This work introduces AutoReP, a gradient-based approach to lessen non-linear operators and alleviate these issues. It automates the selection of ReLU and polynomial functions to speed up PI applications and introduces distribution-aware polynomial approximation (DaPa) to maintain model expressivity while accurately approximating ReLUs. Our experimental results demonstrate significant accuracy improvements of 6.12% (94.31%, 12.9K ReLU budget, CIFAR-10), 8.39% (74.92%, 12.9K ReLU budget, CIFAR-100), and 9.45% (63.69%, 55K ReLU budget, Tiny-ImageNet) over current state-of-the-art methods, e.g., SNL. Morever, AutoReP is applied to EfficientNet-B2 on ImageNet dataset, and achieved 75.55% accuracy with 176.1 times ReLU budget reduction.
    摘要 机器学习服务(MLaaS)市场的增长对客户的数据隐私和安全问题提出了问题。隐私(PI)技术使用加密基础设计可以解决这些问题,但是它们可能会有高 computation和通信成本,尤其是在非线性操作符如ReLU中。许多尝试减少ReLU操作的方法已经存在,但是它们可能需要调整阈值或导致重大准确损失。本研究则引入自动ReP,一种Gradient-based的方法,以减少非线性操作和解决这些问题。它自动选择ReLU和多项式函数,以加速PI应用程序,并 introduce了分布式数据掌控多项式拟合(DaPa),以维持模型表达力而准确地拟合ReLUs。我们的实验结果显示,与现有的方法相比,AutoReP可以获得了6.12%(94.31%, 12.9K ReLU预算、CIFAR-10)、8.39%(74.92%, 12.9K ReLU预算、CIFAR-100)和9.45%(63.69%, 55K ReLU预算、Tiny-ImageNet)的准确度提升。此外,我们还应用了AutoReP到EfficientNet-B2 ImageNet dataset,获得了75.55%的准确度,并在ReLU预算下降75.55倍。

Intelligent Communication Planning for Constrained Environmental IoT Sensing with Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2308.10124
  • repo_url: None
  • paper_authors: Yi Hu, Jinhang Zuo, Bob Iannucci, Carlee Joe-Wong
  • for: 这个论文旨在优化互联网络传感器在环境监测中的通信规划,以提高环境监测的准确性。
  • methods: 这个论文使用多代理学习法(MARL)来找到优化环境数据收集的通信策略,以最大化环境监测准确性。
  • results: 实验表明,使用MARL方法可以在不知道带宽限制的情况下,平衡收集足够数据来预测野火蔓延与环境数据的协调。
    Abstract Internet of Things (IoT) technologies have enabled numerous data-driven mobile applications and have the potential to significantly improve environmental monitoring and hazard warnings through the deployment of a network of IoT sensors. However, these IoT devices are often power-constrained and utilize wireless communication schemes with limited bandwidth. Such power constraints limit the amount of information each device can share across the network, while bandwidth limitations hinder sensors' coordination of their transmissions. In this work, we formulate the communication planning problem of IoT sensors that track the state of the environment. We seek to optimize sensors' decisions in collecting environmental data under stringent resource constraints. We propose a multi-agent reinforcement learning (MARL) method to find the optimal communication policies for each sensor that maximize the tracking accuracy subject to the power and bandwidth limitations. MARL learns and exploits the spatial-temporal correlation of the environmental data at each sensor's location to reduce the redundant reports from the sensors. Experiments on wildfire spread with LoRA wireless network simulators show that our MARL method can learn to balance the need to collect enough data to predict wildfire spread with unknown bandwidth limitations.
    摘要 互联网物品(IoT)技术已经启用了大量数据驱动的 mobil 应用程序,并具有改善环境监控和危机警告的潜在潜力。然而,这些 IoT 设备通常受到能源限制和有限带宽通信方案的限制。这些限制限制每个设备可以在网络上传输的资料量,而带宽限制则阻碍感应器对传输的调控。在这个工作中,我们形式化了 IoT 感应器在监控环境状态时的通信规划问题。我们寻找最佳的通信策略,以确保感应器可以在 stringent 资源限制下传输环境数据,并且可以最大化追踪精度。我们提出了一种多代理强化学习(MARL)方法,以确保感应器在受限的资源下可以对环境数据进行最佳化追踪。MARL 方法可以学习和利用每个感应器的位置空间时间相关性,以减少感应器之间的重复报告。在使用 LoRA 无线网络模拟器进行野火传播实验中,我们发现我们的 MARL 方法可以寻找具有不知名带宽限制的情况下对野火传播进行最佳化追踪。

Deep Generative Modeling-based Data Augmentation with Demonstration using the BFBT Benchmark Void Fraction Datasets

  • paper_url: http://arxiv.org/abs/2308.10120
  • repo_url: None
  • paper_authors: Farah Alsafadi, Xu Wu
  • for: 本研究用了深度学习(DL)来解决因数据稀缺而困难进行研究的核工程问题。
  • methods: 本研究使用了深度生成模型(DGM),包括生成对抗网络(GAN)、归一化函数(NF)、变量自动编码器(VAE)和条件VAE(CVAE)等,来学习训练数据集中的下式分布。
  • results: 研究发现,使用DGM生成的数据可以增加训练数据集的大小,并且可以提高深度学习模型的准确率。CVAE的表现最佳,其生成的数据Error最小。
    Abstract Deep learning (DL) has achieved remarkable successes in many disciplines such as computer vision and natural language processing due to the availability of ``big data''. However, such success cannot be easily replicated in many nuclear engineering problems because of the limited amount of training data, especially when the data comes from high-cost experiments. To overcome such a data scarcity issue, this paper explores the applications of deep generative models (DGMs) that have been widely used for image data generation to scientific data augmentation. DGMs, such as generative adversarial networks (GANs), normalizing flows (NFs), variational autoencoders (VAEs), and conditional VAEs (CVAEs), can be trained to learn the underlying probabilistic distribution of the training dataset. Once trained, they can be used to generate synthetic data that are similar to the training data and significantly expand the dataset size. By employing DGMs to augment TRACE simulated data of the steady-state void fractions based on the NUPEC Boiling Water Reactor Full-size Fine-mesh Bundle Test (BFBT) benchmark, this study demonstrates that VAEs, CVAEs, and GANs have comparable generative performance with similar errors in the synthetic data, with CVAEs achieving the smallest errors. The findings shows that DGMs have a great potential to augment scientific data in nuclear engineering, which proves effective for expanding the training dataset and enabling other DL models to be trained more accurately.
    摘要 深度学习(DL)在许多领域取得了杰出的成就,如计算机视觉和自然语言处理,这主要归功于大量数据的可用性。然而,在核工程领域,由于数据的有限性,特别是高成本实验数据,因此DL模型的训练成本很高。为了解决这个数据缺乏问题,本文研究了使用深度生成模型(DGM)来增强科学数据。DGM包括生成对抗网络(GAN)、正常化流(NF)、变量自动编码器(VAE)和条件VAE(CVAE)等,可以根据训练数据的分布学习下来。一旦训练完成,它们可以生成与训练数据相似的 sintetic 数据,并显著增加数据量。在使用DGM增强TRACE模拟数据的稳态气体含量基于NUPEC沸水堆全size细膙测试(BFBT) benchmark的研究中,这种研究发现了VAE、CVAE和GAN在生成数据中的相似性,CVAE的错误最小。这些发现表明DGM在核工程领域有很大的潜力,可以增强科学数据,并且可以帮助DL模型更加准确地训练。

Modeling Random Networks with Heterogeneous Reciprocity

  • paper_url: http://arxiv.org/abs/2308.10113
  • repo_url: None
  • paper_authors: Daniel Cirkovic, Tiandong Wang
  • for: 这个论文主要研究了社交网络中用户之间的信息交换行为,以及这种行为如何影响社交网络的发展。
  • methods: 作者提出了一种基于偏好附属的模型,用于模拟社交网络中不同水平的reciprocal行为。该模型考虑了用户吸引受户的偏好,以及用户之间的不同类型的reciprocity行为。
  • results: 作者通过对Facebook墙post网络数据进行分析,发现存在多个用户群体,每个群体都有不同的回快行为模式。模型能够捕捉Facebook数据中实际度分布的重创尾性特征,并且可以Identify不同用户群体的特征。
    Abstract Reciprocity, or the tendency of individuals to mirror behavior, is a key measure that describes information exchange in a social network. Users in social networks tend to engage in different levels of reciprocal behavior. Differences in such behavior may indicate the existence of communities that reciprocate links at varying rates. In this paper, we develop methodology to model the diverse reciprocal behavior in growing social networks. In particular, we present a preferential attachment model with heterogeneous reciprocity that imitates the attraction users have for popular users, plus the heterogeneous nature by which they reciprocate links. We compare Bayesian and frequentist model fitting techniques for large networks, as well as computationally efficient variational alternatives. Cases where the number of communities are known and unknown are both considered. We apply the presented methods to the analysis of a Facebook wallpost network where users have non-uniform reciprocal behavior patterns. The fitted model captures the heavy-tailed nature of the empirical degree distributions in the Facebook data and identifies multiple groups of users that differ in their tendency to reply to and receive responses to wallposts.
    摘要 互助性(reciprocity),或社交网络中个体响应行为的倾向,是社交网络中信息交换的关键指标。社交网络中的用户们通常在不同的水平上进行反馈行为。不同的反馈行为可能表明社交网络中存在不同的社区,这些社区在链接复制速率上有不同的偏好。在这篇论文中,我们开发了用于模型社交网络中多种反馈行为的方法ологи。特别是,我们提出了带有不同reciprocity的偏好附着模型,该模型模拟用户对流行用户的吸引,以及不同的反馈行为。我们对大型网络中的bayesian和频率统计方法进行比较,以及计算效率高的变量替代方法。我们还考虑了知道和不知道社区数量的两种情况。我们在Facebook墙上的墙posts网络中应用了提出的方法,并发现了墙posts网络中用户的反馈行为具有重 tailed性特征,并分化出多个用户群体,这些用户群体在链接复制上有不同的偏好。

Robust Mixture-of-Expert Training for Convolutional Neural Networks

  • paper_url: http://arxiv.org/abs/2308.10110
  • repo_url: https://github.com/optml-group/robust-moe-cnn
  • paper_authors: Yihua Zhang, Ruisi Cai, Tianlong Chen, Guanhua Zhang, Huan Zhang, Pin-Yu Chen, Shiyu Chang, Zhangyang Wang, Sijia Liu
  • for: 这篇论文的目的是探讨如何将对抗式训练(Adversarial Training,AT) Mechanism 应用到具有混合专家(Mixture of Experts,MoE)架构的卷积神经网络(Convolutional Neural Networks,CNNs)中,以提高它们的抗衰弱性。
  • methods: 这篇论文使用了一种称为 AdvMoE 的新的对抗式训练框架,它将对抗式训练与MoE架构结合,以提高卷积神经网络的抗衰弱性。
  • results: 这篇论文的结果显示,使用 AdvMoE 框架可以将卷积神经网络的抗衰弱性提高 1% ~ 4%,并且可以降低 inference 成本超过 50%。
    Abstract Sparsely-gated Mixture of Expert (MoE), an emerging deep model architecture, has demonstrated a great promise to enable high-accuracy and ultra-efficient model inference. Despite the growing popularity of MoE, little work investigated its potential to advance convolutional neural networks (CNNs), especially in the plane of adversarial robustness. Since the lack of robustness has become one of the main hurdles for CNNs, in this paper we ask: How to adversarially robustify a CNN-based MoE model? Can we robustly train it like an ordinary CNN model? Our pilot study shows that the conventional adversarial training (AT) mechanism (developed for vanilla CNNs) no longer remains effective to robustify an MoE-CNN. To better understand this phenomenon, we dissect the robustness of an MoE-CNN into two dimensions: Robustness of routers (i.e., gating functions to select data-specific experts) and robustness of experts (i.e., the router-guided pathways defined by the subnetworks of the backbone CNN). Our analyses show that routers and experts are hard to adapt to each other in the vanilla AT. Thus, we propose a new router-expert alternating Adversarial training framework for MoE, termed AdvMoE. The effectiveness of our proposal is justified across 4 commonly-used CNN model architectures over 4 benchmark datasets. We find that AdvMoE achieves 1% ~ 4% adversarial robustness improvement over the original dense CNN, and enjoys the efficiency merit of sparsity-gated MoE, leading to more than 50% inference cost reduction. Codes are available at https://github.com/OPTML-Group/Robust-MoE-CNN.
    摘要 新型的稀疑隐藏 gates mixture of expert (MoE) 模型,已经展示出高精度和高效的模型推理能力。 despite the growing popularity of MoE, little work has investigated its potential to advance convolutional neural networks (CNNs), especially in the area of adversarial robustness. since the lack of robustness has become one of the main obstacles for CNNs, in this paper we ask: how to adversarially robustify a CNN-based MoE model? can we train it like an ordinary CNN model? our pilot study shows that the conventional adversarial training (AT) mechanism (developed for vanilla CNNs) no longer remains effective to robustify an MoE-CNN. to better understand this phenomenon, we dissect the robustness of an MoE-CNN into two dimensions: the robustness of routers (i.e., gating functions to select data-specific experts) and the robustness of experts (i.e., the router-guided pathways defined by the subnetworks of the backbone CNN). our analyses show that routers and experts are hard to adapt to each other in the vanilla AT. thus, we propose a new router-expert alternating adversarial training framework for MoE, termed AdvMoE. the effectiveness of our proposal is justified across 4 commonly-used CNN model architectures over 4 benchmark datasets. we find that AdvMoE achieves 1% ~ 4% adversarial robustness improvement over the original dense CNN, and enjoys the efficiency merit of sparsity-gated MoE, leading to more than 50% inference cost reduction. codes are available at https://github.com/OPTML-Group/Robust-MoE-CNN.

An Online Multiple Kernel Parallelizable Learning Scheme

  • paper_url: http://arxiv.org/abs/2308.10101
  • repo_url: None
  • paper_authors: Emilio Ruiz-Moreno, Baltasar Beferull-Lozano
  • for: 这 paper 的目的是提出一种可扩展性的多 kernel 学习方法,以降低 kernel 选择的偏见。
  • methods: 这 paper 使用了多 kernel 学习方法,其中每个 kernel 都是一种不同的特征选择方法。
  • results: 实验表明,这 paper 的提出的多 kernel 学习方法可以在数据丰富任务中提高解决效果,并且可以平行计算,以便分布计算负担。
    Abstract The performance of reproducing kernel Hilbert space-based methods is known to be sensitive to the choice of the reproducing kernel. Choosing an adequate reproducing kernel can be challenging and computationally demanding, especially in data-rich tasks without prior information about the solution domain. In this paper, we propose a learning scheme that scalably combines several single kernel-based online methods to reduce the kernel-selection bias. The proposed learning scheme applies to any task formulated as a regularized empirical risk minimization convex problem. More specifically, our learning scheme is based on a multi-kernel learning formulation that can be applied to widen any single-kernel solution space, thus increasing the possibility of finding higher-performance solutions. In addition, it is parallelizable, allowing for the distribution of the computational load across different computing units. We show experimentally that the proposed learning scheme outperforms the combined single-kernel online methods separately in terms of the cumulative regularized least squares cost metric.
    摘要 “ kernel 希尔伯特空间基于方法的性能复制性受选择 reproduce kernel 的影响。选择合适的 reproduce kernel 可以是具有挑战性和计算强度的,特别是在没有关于解决空间的先验信息的情况下。在这篇论文中,我们提出了一种学习方案,可以可扩展性地组合多个单kernel-based 在线方法,以降低 kernel-selection 偏见。该学习方案适用于任何形式为正则化empirical risk minimization convex问题。更 Specifically,我们的学习方案基于多kernel learning 形式,可以扩大任何单kernel 解决空间,从而增加高性能解决方案的可能性。此外,它可以并行化,以分配计算负担到不同的计算单元。我们实验表明,提出的学习方案在累积正则化最小二乘Cost metric上超过了单个单kernel 在线方法的总和。”

Geometric instability of graph neural networks on large graphs

  • paper_url: http://arxiv.org/abs/2308.10099
  • repo_url: https://github.com/brs96/geometric-instability-gnn-large-graphs
  • paper_authors: Emily Morris, Haotian Shen, Weiling Du, Muhammad Hamza Sajjad, Borun Shi
  • for: 这 paper investigate 图 neural network (GNN) 生成的嵌入的几何不稳定性。
  • methods: 该 paper 提出了一种简单、高效的图根本的图agram Gram Index (GGI) 来度量这种不稳定性,该方法是卷积、旋转、平移和评估顺序无关的。
  • results: 该 paper 通过使用 GGI 来研究 GNN 嵌入在大图上的不稳定性行为,并对节点分类和链接预测进行了研究。
    Abstract We analyse the geometric instability of embeddings produced by graph neural networks (GNNs). Existing methods are only applicable for small graphs and lack context in the graph domain. We propose a simple, efficient and graph-native Graph Gram Index (GGI) to measure such instability which is invariant to permutation, orthogonal transformation, translation and order of evaluation. This allows us to study the varying instability behaviour of GNN embeddings on large graphs for both node classification and link prediction.
    摘要 我们分析图 neural network (GNN) 生成的嵌入的几何不稳定性。现有的方法只适用于小图,缺乏图域上的上下文。我们提议一个简单、高效、图native的图agram Gram Index (GGI) 来测量这种不稳定性,该指标对Permutation、正交变换、翻译和评估顺序具有抗变换性。这使得我们可以研究大图上 GNN 嵌入的不同不稳定行为,包括节点分类和链接预测。

  • paper_url: http://arxiv.org/abs/2308.10098
  • repo_url: None
  • paper_authors: Mohammad Sadegh Salehi, Subhadip Mukherjee, Lindon Roberts, Matthias J. Ehrhardt
  • for: 这篇论文的目的是解决具有多个传播常数的泛化正规化项目中的偏好参数配置问题。
  • methods: 本文使用层次学习来学习传播常数的偏好参数,并使用批量学习来找到适当的传播常数。
  • results: 本文的 numrical experiments 显示了对于具有多个传播常数的泛化正规化项目中的偏好参数配置问题,提供了一个可证的不精准的内推搜索法,并且可以自动决定需要的精度。
    Abstract In various domains within imaging and data science, particularly when addressing tasks modeled utilizing the variational regularization approach, manually configuring regularization parameters presents a formidable challenge. The difficulty intensifies when employing regularizers involving a large number of hyperparameters. To overcome this challenge, bilevel learning is employed to learn suitable hyperparameters. However, due to the use of numerical solvers, the exact gradient with respect to the hyperparameters is unattainable, necessitating the use of methods relying on approximate gradients. State-of-the-art inexact methods a priori select a decreasing summable sequence of the required accuracy and only assure convergence given a sufficiently small fixed step size. Despite this, challenges persist in determining the Lipschitz constant of the hypergradient and identifying an appropriate fixed step size. Conversely, computing exact function values is not feasible, impeding the use of line search. In this work, we introduce a provably convergent inexact backtracking line search involving inexact function evaluations and hypergradients. We show convergence to a stationary point of the loss with respect to hyperparameters. Additionally, we propose an algorithm to determine the required accuracy dynamically. Our numerical experiments demonstrate the efficiency and feasibility of our approach for hyperparameter estimation in variational regularization problems, alongside its robustness in terms of the initial accuracy and step size choices.
    摘要 在各种图像和数据科学领域中,特别是使用变量正则化方法进行任务模型化时,手动配置正则化参数是一项具有挑战性的任务。难度增加了当使用含有大量超参数的正则化器。为了解决这个挑战,我们使用二级学习来学习适当的超参数。然而,由于使用数值解 sols,无法获得正则化器中的精确梯度,因此需要使用approximate gradients的方法。现有的state-of-the-art不精确方法会选择一个减少的总和序列,并且只有在具有足够小的固定步长时才能保证收敛。然而,在确定梯度Lipsilon constant和适当的固定步长方面,还存在挑战。此外,计算精确函数值是不可能的,这阻碍了使用梯度下降法。在这种情况下,我们介绍了一种可证明收敛的不精确返回搜索,该搜索包括不精确函数评估和梯度。我们表明,该搜索会收敛到loss中的超参数中的站点点。此外,我们还提出了一种动态确定所需的精度的算法。我们的数值实验表明,我们的方法可以高效地进行超参数估算,并且具有鲁棒性,即初始精度和步长选择的影响。

MLOps: A Review

  • paper_url: http://arxiv.org/abs/2308.10908
  • repo_url: https://github.com/jenningst/ecommerce-ops
  • paper_authors: Samar Wazir, Gautam Siddharth Kashyap, Parag Saxena
  • for: 本研究探讨了机器学习操作(MLOps)方法的重要性,以帮助开发者更好地创建使用机器学习算法的软件。
  • methods: 作者评估了多种MLOps方法的特性和操作性,以选择适合特定项目的最佳工具结构。
  • results: 研究发现现有的MLOps方法尚未具备完全有效的自动化功能,人类参与度仍然较高。
    Abstract Recently, Machine Learning (ML) has become a widely accepted method for significant progress that is rapidly evolving. Since it employs computational methods to teach machines and produce acceptable answers. The significance of the Machine Learning Operations (MLOps) methods, which can provide acceptable answers for such problems, is examined in this study. To assist in the creation of software that is simple to use, the authors research MLOps methods. To choose the best tool structure for certain projects, the authors also assess the features and operability of various MLOps methods. A total of 22 papers were assessed that attempted to apply the MLOps idea. Finally, the authors admit the scarcity of fully effective MLOps methods based on which advancements can self-regulate by limiting human engagement.
    摘要 近期,机器学习(ML)已成为广泛接受的方法,迅速进步的方法。由于它使用计算方法教育机器并生成可接受的答案。本研究 изучает机器学习操作(MLOps)方法的重要性,这些方法可以为这些问题提供可接受的答案。为便于创建易于使用的软件,作者研究了 MLOps 方法。为选择特定项目最佳工具结构,作者也评估了不同 MLOps 方法的特性和操作性。总共评估了 22 篇尝试应用 MLOps 想法的论文。最后,作者承认机器学习操作方法完全自主进步的缺乏,即限制人类参与度。Note: Please keep in mind that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Securing Pathways with Orthogonal Robots

  • paper_url: http://arxiv.org/abs/2308.10093
  • repo_url: None
  • paper_authors: Hamid Hoorfar, Faraneh Fathi, Sara Moshtaghi Largani, Alireza Bagheri
  • for: 保护路径的重要性在多个领域,如城市规划、交通、监视和安全中具有极高的意义。
  • methods: 本文提出了一种创新的方法,利用正交机器人来保护路径。研究专注于最小化正交机器人数量来有效地监视正交区域。
  • results: 研究表明,可以在线性时间内确定最小化正交机器人数量。但是,对于简单多边形的普通可见情况,即使是正交情况,则是NP困难的。研究强调了机器人可以在多边形的边界或内部任意地放置。
    Abstract The protection of pathways holds immense significance across various domains, including urban planning, transportation, surveillance, and security. This article introduces a groundbreaking approach to safeguarding pathways by employing orthogonal robots. The study specifically addresses the challenge of efficiently guarding orthogonal areas with the minimum number of orthogonal robots. The primary focus is on orthogonal pathways, characterized by a path-like dual graph of vertical decomposition. It is demonstrated that determining the minimum number of orthogonal robots for pathways can be achieved in linear time. However, it is essential to note that the general problem of finding the minimum number of robots for simple polygons with general visibility, even in the orthogonal case, is known to be NP-hard. Emphasis is placed on the flexibility of placing robots anywhere within the polygon, whether on the boundary or in the interior.
    摘要 保护路径具有广泛的应用场景,包括城市规划、交通、监测和安全等领域。这篇文章介绍了一种创新的路径保护方法,利用正交机器人。研究特点在于最小化正交机器人数量,以确保有效地监测正交区域。研究主要关注正交路径,即垂直分解图中的路径类 dual graph。实验表明,可以在线时确定最小正交机器人数量。然而,需要注意的是,找到最小机器人数量的普通多边形问题,即NP-hard问题。文章强调机器人的位置 flexibility,可以在边界或内部的任何位置进行布置。

Minimizing Turns in Watchman Robot Navigation: Strategies and Solutions

  • paper_url: http://arxiv.org/abs/2308.10090
  • repo_url: None
  • paper_authors: Hamid Hoorfar, Sara Moshtaghi Largani, Reza Rahimi, Alireza Bagheri
  • for: 本研究旨在提出一种高效的直线时间算法,用于解决监视人员路径问题(OWRP),以便在机器人系统中优化监视和护卫任务。
  • methods: 本研究使用了一种简单的直线时间算法,基于监视人员路径问题的假设环境是卷积的。
  • results: 研究发现,该算法可以在线性时间内解决监视人员路径问题,并且可以减少路径中转次数,从而提高机器人的涵盖率和耗时效率。
    Abstract The Orthogonal Watchman Route Problem (OWRP) entails the search for the shortest path, known as the watchman route, that a robot must follow within a polygonal environment. The primary objective is to ensure that every point in the environment remains visible from at least one point on the route, allowing the robot to survey the entire area in a single, continuous sweep. This research places particular emphasis on reducing the number of turns in the route, as it is crucial for optimizing navigation in watchman routes within the field of robotics. The cost associated with changing direction is of significant importance, especially for specific types of robots. This paper introduces an efficient linear-time algorithm for solving the OWRP under the assumption that the environment is monotone. The findings of this study contribute to the progress of robotic systems by enabling the design of more streamlined patrol robots. These robots are capable of efficiently navigating complex environments while minimizing the number of turns. This advancement enhances their coverage and surveillance capabilities, making them highly effective in various real-world applications.
    摘要 《Orthogonal Watchman Route Problem(OWRP)》的研究目标是找到在多边形环境中最短的路径,称为“看守路径”,以确保机器人可以在一次连续扫描所有环境点。主要目标是尽量减少路径中转数量,因为这对于某些机器人类型来说非常重要。这篇论文提出了一种高效的线性时间算法,用于解决在 monotone 环境下的 OWRP。这些发现对机器人系统的进步做出了贡献,使得机器人可以更加高效地在复杂环境中导航,最小化转弯数量。这种进步提高了机器人的覆盖和监测能力,使其在各种实际应用中表现出色。

Contrastive Learning for Non-Local Graphs with Multi-Resolution Structural Views

  • paper_url: http://arxiv.org/abs/2308.10077
  • repo_url: None
  • paper_authors: Asif Khan, Amos Storkey
  • for: 本文主要针对于疑似诈骗者检测和蛋白质功能预测等应用场景,掌握积极的图结构信息是关键。
  • methods: 本文提出了一种基于多视图对照学习的图表示学习方法,利用多个视图作为增强器,捕捉到积极图中结构相似性,从而揭示隐藏的关系和相似性。
  • results: 对于synthetic和实际结构数据集,本文的方法与基eline进行比较,达到了$16.06%$的提升(在Cornell数据集上)、$3.27%$的提升(在Texas数据集上)和$8.04%$的提升(在Wisconsin数据集上)。此外,本文在邻近任务上表现出了superior的性能,证明了其在揭示结构信息和提高下游应用中的效果。
    Abstract Learning node-level representations of heterophilic graphs is crucial for various applications, including fraudster detection and protein function prediction. In such graphs, nodes share structural similarity identified by the equivalence of their connectivity which is implicitly encoded in the form of higher-order hierarchical information in the graphs. The contrastive methods are popular choices for learning the representation of nodes in a graph. However, existing contrastive methods struggle to capture higher-order graph structures. To address this limitation, we propose a novel multiview contrastive learning approach that integrates diffusion filters on graphs. By incorporating multiple graph views as augmentations, our method captures the structural equivalence in heterophilic graphs, enabling the discovery of hidden relationships and similarities not apparent in traditional node representations. Our approach outperforms baselines on synthetic and real structural datasets, surpassing the best baseline by $16.06\%$ on Cornell, $3.27\%$ on Texas, and $8.04\%$ on Wisconsin. Additionally, it consistently achieves superior performance on proximal tasks, demonstrating its effectiveness in uncovering structural information and improving downstream applications.
    摘要

ILCAS: Imitation Learning-Based Configuration-Adaptive Streaming for Live Video Analytics with Cross-Camera Collaboration

  • paper_url: http://arxiv.org/abs/2308.10068
  • repo_url: None
  • paper_authors: Duo Wu, Dayou Zhang, Miao Zhang, Ruoyu Zhang, Fangxin Wang, Shuguang Cui
  • For: 这个研究的目的是为了提高现场视频分析(VA)中的深度神经网络(DNN)的准确率和资源效率。* Methods: 这个研究使用了循环学习(IL)和动态视觉感知(motion feature maps)来应对现场视频的内容变化,并且将多标的相机联合使用以利用空间时间相互联系。* Results: 实验结果显示,与现有解决方案相比,这个方法可以提高了2-20.9%的准确率和19.9-85.3%的块上传延误。
    Abstract The high-accuracy and resource-intensive deep neural networks (DNNs) have been widely adopted by live video analytics (VA), where camera videos are streamed over the network to resource-rich edge/cloud servers for DNN inference. Common video encoding configurations (e.g., resolution and frame rate) have been identified with significant impacts on striking the balance between bandwidth consumption and inference accuracy and therefore their adaption scheme has been a focus of optimization. However, previous profiling-based solutions suffer from high profiling cost, while existing deep reinforcement learning (DRL) based solutions may achieve poor performance due to the usage of fixed reward function for training the agent, which fails to craft the application goals in various scenarios. In this paper, we propose ILCAS, the first imitation learning (IL) based configuration-adaptive VA streaming system. Unlike DRL-based solutions, ILCAS trains the agent with demonstrations collected from the expert which is designed as an offline optimal policy that solves the configuration adaption problem through dynamic programming. To tackle the challenge of video content dynamics, ILCAS derives motion feature maps based on motion vectors which allow ILCAS to visually ``perceive'' video content changes. Moreover, ILCAS incorporates a cross-camera collaboration scheme to exploit the spatio-temporal correlations of cameras for more proper configuration selection. Extensive experiments confirm the superiority of ILCAS compared with state-of-the-art solutions, with 2-20.9% improvement of mean accuracy and 19.9-85.3% reduction of chunk upload lag.
    摘要 高精度和资源占用深度神经网络(DNN)在实时视频分析(VA)中广泛应用,其中电影摄像头推送到edge/云服务器进行DNN推理。常见的视频编码配置(如分辨率和帧率)对于寻求平衡带宽消耗和推理精度有显著影响,因此其适应方案成为优化的焦点。然而,先前的 Profiling 基本解决方案受到高 Profiling 成本的限制,而现有的深度强化学习(DRL)基本解决方案可能因为使用固定奖励函数进行训练代理人而达到低效果,不能适应不同场景中的应用目标。在这篇论文中,我们提出了ILCAS,首个基于依达学习(IL)的配置适应VA流动系统。与DRL基本解决方案不同,ILCAS 通过从专家设计的精度优化策略收集示例来训练代理人,通过动态Programming解决配置适应问题。为了解决视频内容变化的挑战,ILCAS derivates Motion Feature Maps based on motion vectors,allowing ILCAS to "perceive" video content changes visually。此外,ILCAS 还采用相机间协作方案,以利用相机之间的空间时间相关性进行更加合适的配置选择。实验证明ILCAS 比现有解决方案有2-20.9%的提高精度和19.9-85.3%的减少块上传延迟。

eess.IV - 2023-08-20

Boosting Adversarial Transferability by Block Shuffle and Rotation

  • paper_url: http://arxiv.org/abs/2308.10299
  • repo_url: None
  • paper_authors: Kunyu Wang, Xuanran He, Wenxuan Wang, Xiaosen Wang
  • for: 防止深度学习模型受到攻击,增强深度学习模型的安全性。
  • methods: 使用输入变换基于的攻击方法,包括块混淆和旋转(BSR)。
  • results: BSR方法可以在单模型和集成模型下实现显著更高的传输性,并且可以与现有输入变换方法组合使用,以达到更高的传输性和安全性。
    Abstract Adversarial examples mislead deep neural networks with imperceptible perturbations and have brought significant threats to deep learning. An important aspect is their transferability, which refers to their ability to deceive other models, thus enabling attacks in the black-box setting. Though various methods have been proposed to boost transferability, the performance still falls short compared with white-box attacks. In this work, we observe that existing input transformation based attacks, one of the mainstream transfer-based attacks, result in different attention heatmaps on various models, which might limit the transferability. We also find that breaking the intrinsic relation of the image can disrupt the attention heatmap of the original image. Based on this finding, we propose a novel input transformation based attack called block shuffle and rotation (BSR). Specifically, BSR splits the input image into several blocks, then randomly shuffles and rotates these blocks to construct a set of new images for gradient calculation. Empirical evaluations on the ImageNet dataset demonstrate that BSR could achieve significantly better transferability than the existing input transformation based methods under single-model and ensemble-model settings. Combining BSR with the current input transformation method can further improve the transferability, which significantly outperforms the state-of-the-art methods.
    摘要 深度学习模型被攻击性例子所欺骗,通过不可见的扰动而导致模型错误。其中一个重要方面是其可传递性,即它们能够在黑盒 Setting中攻击其他模型,从而对深度学习 pose 威胁。虽然多种方法已经被提出来提高可传递性,但其性能仍然落后于白盒攻击。在这种情况下,我们发现现有的输入变换基于攻击方法中的 attention heatmap 可能会受限制可传递性。我们还发现,对图像的内部关系的扰动可以破坏原始图像的 attention heatmap。基于这一发现,我们提出了一种新的输入变换基于攻击方法 called block shuffle and rotation (BSR)。具体来说,BSR 将输入图像分成多个块,然后随机排序和旋转这些块来构建一组新的图像,用于计算梯度。我们的实验表明,BSR 在 ImageNet 数据集上可以达到单模型和集成模型的情况下,与现有的输入变换基于方法相比,有显著更好的可传递性。此外,将 BSR 与当前的输入变换方法相结合,可以进一步提高可传递性,并与当前的状态OF-the-art方法相比,具有显著更好的性能。

Domain Reduction Strategy for Non Line of Sight Imaging

  • paper_url: http://arxiv.org/abs/2308.10269
  • repo_url: None
  • paper_authors: Hyunbo Shim, In Cho, Daekyu Kwon, Seon Joo Kim
  • for: 非直线视野(NLOS)图像重建,目的是在不同设置下重建隐藏的场景。
  • methods: 提出了一种优化基于方法,利用了干扰 superficies 的观察,通过简单地忽略隐藏表面之间的交互,独立计算每个隐藏点的光子回射。采用了一种域减少程序,从折射函数中排除无用的区域,提高优化的计算效率。
  • results: 在多种NLOS场景下,包括非平面关墙、稀疏扫描模式、射频和非射频、和表面几何重建等,实验结果表明提出的方法在一般NLOS场景中具有优越性和高效性。
    Abstract This paper presents a novel optimization-based method for non-line-of-sight (NLOS) imaging that aims to reconstruct hidden scenes under various setups. Our method is built upon the observation that photons returning from each point in hidden volumes can be independently computed if the interactions between hidden surfaces are trivially ignored. We model the generalized light propagation function to accurately represent the transients as a linear combination of these functions. Moreover, our proposed method includes a domain reduction procedure to exclude empty areas of the hidden volumes from the set of propagation functions, thereby improving computational efficiency of the optimization. We demonstrate the effectiveness of the method in various NLOS scenarios, including non-planar relay wall, sparse scanning patterns, confocal and non-confocal, and surface geometry reconstruction. Experiments conducted on both synthetic and real-world data clearly support the superiority and the efficiency of the proposed method in general NLOS scenarios.
    摘要

Crucial Feature Capture and Discrimination for Limited Training Data SAR ATR

  • paper_url: http://arxiv.org/abs/2308.10911
  • repo_url: https://github.com/cwwangsaratr/saratr_feacapture_discrimination
  • paper_authors: Chenwei Wang, Siyi Luo, Jifang Pei, Yulin Huang, Yin Zhang, Jianyu Yang
    for: 这个研究旨在提高SAR ATR的表现,对于具有限制training samples的情况进行设计。methods: 本研究提出了一个SAR ATR框架,包括两个支线和两个模组:全球协助支线和本地增强支线,特征捕捉模组和特征区别模组。在每次训练过程中,全球协助支线首先完成了初始识别,基于整个图像。然后,特征捕捉模组自动搜寻并锁定了重要图像区域,我们称之为图像的“金钥”。最后,本地增强支线对捕捉到的本地特征进行了进一步处理。results: 我们通过模型健全性实验和实验结果展示了我们的方法的有效性。在MSTAR和OPENSAR上进行了比较,我们的方法已经实现了Superior的识别性能。
    Abstract Although deep learning-based methods have achieved excellent performance on SAR ATR, the fact that it is difficult to acquire and label a lot of SAR images makes these methods, which originally performed well, perform weakly. This may be because most of them consider the whole target images as input, but the researches find that, under limited training data, the deep learning model can't capture discriminative image regions in the whole images, rather focus on more useless even harmful image regions for recognition. Therefore, the results are not satisfactory. In this paper, we design a SAR ATR framework under limited training samples, which mainly consists of two branches and two modules, global assisted branch and local enhanced branch, feature capture module and feature discrimination module. In every training process, the global assisted branch first finishes the initial recognition based on the whole image. Based on the initial recognition results, the feature capture module automatically searches and locks the crucial image regions for correct recognition, which we named as the golden key of image. Then the local extract the local features from the captured crucial image regions. Finally, the overall features and local features are input into the classifier and dynamically weighted using the learnable voting parameters to collaboratively complete the final recognition under limited training samples. The model soundness experiments demonstrate the effectiveness of our method through the improvement of feature distribution and recognition probability. The experimental results and comparisons on MSTAR and OPENSAR show that our method has achieved superior recognition performance.
    摘要 尽管深度学习基本方法在Synthetic Aperture Radar(SAR)特征识别(ATR)中表现出色,但由于获得和标注SAR图像困难,这些方法在限制性训练数据下表现弱化。这可能是因为大多数方法将整个目标图像作为输入,但研究人员发现,在有限的训练样本下,深度学习模型无法捕捉权重图像区域,而是专注于无用甚至害图像区域的识别。因此,结果不 satisfactory。在这篇论文中,我们设计了一个基于有限训练样本的SAR ATR框架,主要包括两个分支和两个模块:全球协助分支和本地增强分支,特征捕捉模块和特征分类模块。在每次训练过程中,全球协助分支首先基于整个图像完成初步识别。基于初步识别结果,特征捕捉模块自动搜索和锁定正确识别的关键图像区域,我们称之为图像的“金钥匙”。然后,本地EXTRACTLOCAL特征从捕捉到的关键图像区域中提取本地特征。最后,总特征和本地特征通过可学习投票参数进行相互协同完成最终识别。实验证明我们的方法有效性,通过改善特征分布和识别概率。实验结果和MSTAR和OPENSAR的比较表明,我们的方法在限制性训练样本下实现了superior的识别性能。

An Entropy-Awareness Meta-Learning Method for SAR Open-Set ATR

  • paper_url: http://arxiv.org/abs/2308.10251
  • repo_url: None
  • paper_authors: Chenwei Wang, Siyi Luo, Jifang Pei, Xiaoyu Liu, Yulin Huang, Yin Zhang, Jianyu Yang
    for: 这个论文主要针对的是开放集Recognition(OSR)问题,即使用Synthetic Aperture Radar自动目标识别(SAR ATR)方法对不同的目标类进行分类。methods: 该方法基于Entropy-awareness meta-learning技术,通过meta-学任务来学习建立动态分配的知识类别的特征空间,以便同时分类已知类别并排除未知类别。results: 实验结果表明,该方法在Moving and Stationary Target Acquisition and Recognition(MSTAR)数据集上表现出色,能够同时分类动态分配的知识类别和排除未知类别。
    Abstract Existing synthetic aperture radar automatic target recognition (SAR ATR) methods have been effective for the classification of seen target classes. However, it is more meaningful and challenging to distinguish the unseen target classes, i.e., open set recognition (OSR) problem, which is an urgent problem for the practical SAR ATR. The key solution of OSR is to effectively establish the exclusiveness of feature distribution of known classes. In this letter, we propose an entropy-awareness meta-learning method that improves the exclusiveness of feature distribution of known classes which means our method is effective for not only classifying the seen classes but also encountering the unseen other classes. Through meta-learning tasks, the proposed method learns to construct a feature space of the dynamic-assigned known classes. This feature space is required by the tasks to reject all other classes not belonging to the known classes. At the same time, the proposed entropy-awareness loss helps the model to enhance the feature space with effective and robust discrimination between the known and unknown classes. Therefore, our method can construct a dynamic feature space with discrimination between the known and unknown classes to simultaneously classify the dynamic-assigned known classes and reject the unknown classes. Experiments conducted on the moving and stationary target acquisition and recognition (MSTAR) dataset have shown the effectiveness of our method for SAR OSR.
    摘要 现有的Synthetic Aperture Radar自动目标识别(SAR ATR)方法已经有效地分类了见到的目标类。然而,更重要和挑战性的是分类未经见到的目标类,即开放集 recognition(OSR)问题,这是实际SAR ATR中的一个紧迫问题。OSR的关键解决方案是有效地建立已知类别的特征分布的独特性。在这封信中,我们提出了一种基于熵意识的meta-学习方法,该方法可以提高已知类别的特征分布独特性,即我们的方法可以不仅分类见到的类别,也可以遇到未经见到的其他类别。通过meta-学习任务,我们提出的方法学习了动态分配的已知类别的特征空间。这个特征空间是由任务需要拒绝不属于已知类别的所有其他类别。同时,我们的熵意识损失帮助模型增强特征空间的有效和可靠地区分已知和未知类别。因此,我们的方法可以建立动态的特征空间,同时分类动态分配的已知类别和拒绝未知类别。在MSTAR数据集上进行的实验表明了我们的方法在SAR OSR中的效果。

SAR Ship Target Recognition via Selective Feature Discrimination and Multifeature Center Classifier

  • paper_url: http://arxiv.org/abs/2308.10250
  • repo_url: None
  • paper_authors: Chenwei Wang, Siyi Luo, Jifang Pei, Yulin Huang, Yin Zhang, Jianyu Yang
  • For: 本研究旨在提高空中遥感船舶目标识别精度,尤其是在资源价格较低的情况下。* Methods: 本研究提出了一种基于选择性特征分化和多特征中心分类器的SAR船舶目标识别方法。选择性特征分化自动地寻找最相似的多个类别之间的相似特征,并将不相似的特征提高为更大的内类分布。多特征中心分类器将每个船舶类别分配多个学习可能的特征中心,以分解大内类分布。* Results: 实验结果显示,我们的方法在对OpenSARShip和FUSAR-Ship datasets进行测试时,实现了与训练数据减少的高精度识别。
    Abstract Maritime surveillance is not only necessary for every country, such as in maritime safeguarding and fishing controls, but also plays an essential role in international fields, such as in rescue support and illegal immigration control. Most of the existing automatic target recognition (ATR) methods directly send the extracted whole features of SAR ships into one classifier. The classifiers of most methods only assign one feature center to each class. However, the characteristics of SAR ship images, large inner-class variance, and small interclass difference lead to the whole features containing useless partial features and a single feature center for each class in the classifier failing with large inner-class variance. We proposes a SAR ship target recognition method via selective feature discrimination and multifeature center classifier. The selective feature discrimination automatically finds the similar partial features from the most similar interclass image pairs and the dissimilar partial features from the most dissimilar inner-class image pairs. It then provides a loss to enhance these partial features with more interclass separability. Motivated by divide and conquer, the multifeature center classifier assigns multiple learnable feature centers for each ship class. In this way, the multifeature centers divide the large inner-class variance into several smaller variances and conquered by combining all feature centers of one ship class. Finally, the probability distribution over all feature centers is considered comprehensively to achieve an accurate recognition of SAR ship images. The ablation experiments and experimental results on OpenSARShip and FUSAR-Ship datasets show that our method has achieved superior recognition performance under decreasing training SAR ship samples.
    摘要 海上监控不仅是每个国家必需的,如海上安全和渔业控制,而且在国际领域也扮演着关键角色,如救援支持和非法移民控制。现有的自动目标识别(ATR)方法大多直接将扫描后的整个特征向量传递给一个分类器。但是,抛物线船图像的特征是巨大的内类差异和小的 между类差异,这导致整个特征向量包含了无用的部分特征和单个特征中心为每个类别的分类器失败。我们提出了一种基于选择特征筛选和多特征中心分类器的SAR船目标识别方法。选择特征筛选自动从最相似的interclass图像对中找到相似的部分特征和最不相似的inner-class图像对中找到不相似的部分特征,然后为这些部分特征提供损失以提高它们的 между类差异性。以分类器为核心,多特征中心分类器将每个船类分配多个学习的特征中心。这样,多个特征中心将内类差异分解为多个较小的差异,并通过将所有特征中心的一个船类拟合而 conquer。最后,对所有特征中心的概率分布进行全面考虑,以实现准确地识别SAR船图像。ablation实验和OpenSARShip和FUSAR-Ship数据集的实验结果表明,我们的方法在减少待学SAR船样本的情况下实现了superior的识别性能。

SAR Ship Target Recognition Via Multi-Scale Feature Attention and Adaptive-Weighed Classifier

  • paper_url: http://arxiv.org/abs/2308.10247
  • repo_url: None
  • paper_authors: Chenwei Wang, Jifang Pei, Siyi Luo, Weibo Huo, Yulin Huang, Yin Zhang, Jianyu Yang
  • for: 本研究旨在提高SAR船舶识别精度,解决SAR船舶特征之间的大内类差异和交叠问题。
  • methods: 我们提出了一种基于多级特征注意和自适应权重分类器的SAR船舶识别方法,通过提高每级特征的表征能力,适应选择有效的特征级别来实现高精度识别。
  • results: 通过对OpenSARship数据集进行实验和比较,我们的方法 Validated to achieve state-of-the-art performance for SAR ship recognition。
    Abstract Maritime surveillance is indispensable for civilian fields, including national maritime safeguarding, channel monitoring, and so on, in which synthetic aperture radar (SAR) ship target recognition is a crucial research field. The core problem to realizing accurate SAR ship target recognition is the large inner-class variance and inter-class overlap of SAR ship features, which limits the recognition performance. Most existing methods plainly extract multi-scale features of the network and utilize equally each feature scale in the classification stage. However, the shallow multi-scale features are not discriminative enough, and each scale feature is not equally effective for recognition. These factors lead to the limitation of recognition performance. Therefore, we proposed a SAR ship recognition method via multi-scale feature attention and adaptive-weighted classifier to enhance features in each scale, and adaptively choose the effective feature scale for accurate recognition. We first construct an in-network feature pyramid to extract multi-scale features from SAR ship images. Then, the multi-scale feature attention can extract and enhance the principal components from the multi-scale features with more inner-class compactness and inter-class separability. Finally, the adaptive weighted classifier chooses the effective feature scales in the feature pyramid to achieve the final precise recognition. Through experiments and comparisons under OpenSARship data set, the proposed method is validated to achieve state-of-the-art performance for SAR ship recognition.
    摘要 海上监测是民用领域不可或缺的,包括国家海上安全、水道监测等,Synthetic Aperture Radar(SAR)船target认识是一个关键的研究领域。recognition的核心问题是SAR船特征的大内类差和间类重叠,这限制了认识性能。现有方法通常是简单地提取网络中的多尺度特征,然后在分类阶段使用等效的每个特征尺度。然而,这些浅层多尺度特征并不是够精准的,每个尺度特征都不是等效的认识。这些因素导致了认识性能的限制。因此,我们提出了一种基于多尺度特征注意力和自适应权重分类器的SAR船认识方法,以提高每个尺度特征的表征力。我们首先在网络中构建了一个内部特征 piramid,以提取SAR船图像中的多尺度特征。然后,多尺度特征注意力可以提取和增强多尺度特征中的主成分,具有更高的内类紧密度和间类分化度。最后,自适应权重分类器可以选择网络中的有效特征尺度,以实现最终的精准认识。经过实验和比较,我们的方法在OpenSARship数据集上得到了国际顶尖的性能。

SAR ATR Method with Limited Training Data via an Embedded Feature Augmenter and Dynamic Hierarchical-Feature Refiner

  • paper_url: http://arxiv.org/abs/2308.10243
  • repo_url: None
  • paper_authors: Chenwei Wang, Siyi Luo, Yulin Huang, Jifang Pei, Yin Zhang, Jianyu Yang
  • for: 提高SAR自动目标识别(ATR)性能,解决有限training数据的问题
  • methods: 提出了一种新的方法,包括: 1. 嵌入式特征增强器,用于增强EXTRACTED的虚拟特征,使其更加靠近类中心。 2. 动态嵌入特征改进器,用于捕捉样本中的特征,并将其与全局特征结合。
  • results: 实验结果表明,提出的方法可以在有限SAR训练数据的情况下提高ATR性能,并且在MSTAR、OpenSARShip和FUSAR-Ship测试数据集上达到了出色的性能。
    Abstract Without sufficient data, the quantity of information available for supervised training is constrained, as obtaining sufficient synthetic aperture radar (SAR) training data in practice is frequently challenging. Therefore, current SAR automatic target recognition (ATR) algorithms perform poorly with limited training data availability, resulting in a critical need to increase SAR ATR performance. In this study, a new method to improve SAR ATR when training data are limited is proposed. First, an embedded feature augmenter is designed to enhance the extracted virtual features located far away from the class center. Based on the relative distribution of the features, the algorithm pulls the corresponding virtual features with different strengths toward the corresponding class center. The designed augmenter increases the amount of information available for supervised training and improves the separability of the extracted features. Second, a dynamic hierarchical-feature refiner is proposed to capture the discriminative local features of the samples. Through dynamically generated kernels, the proposed refiner integrates the discriminative local features of different dimensions into the global features, further enhancing the inner-class compactness and inter-class separability of the extracted features. The proposed method not only increases the amount of information available for supervised training but also extracts the discriminative features from the samples, resulting in superior ATR performance in problems with limited SAR training data. Experimental results on the moving and stationary target acquisition and recognition (MSTAR), OpenSARShip, and FUSAR-Ship benchmark datasets demonstrate the robustness and outstanding ATR performance of the proposed method in response to limited SAR training data.
    摘要 Without sufficient data, the quantity of information available for supervised training is constrained, as obtaining sufficient synthetic aperture radar (SAR) training data in practice is frequently challenging. Therefore, current SAR automatic target recognition (ATR) algorithms perform poorly with limited training data availability, resulting in a critical need to increase SAR ATR performance. In this study, a new method to improve SAR ATR when training data are limited is proposed. First, an embedded feature augmenter is designed to enhance the extracted virtual features located far away from the class center. Based on the relative distribution of the features, the algorithm pulls the corresponding virtual features with different strengths toward the corresponding class center. The designed augmenter increases the amount of information available for supervised training and improves the separability of the extracted features. Second, a dynamic hierarchical-feature refiner is proposed to capture the discriminative local features of the samples. Through dynamically generated kernels, the proposed refiner integrates the discriminative local features of different dimensions into the global features, further enhancing the inner-class compactness and inter-class separability of the extracted features. The proposed method not only increases the amount of information available for supervised training but also extracts the discriminative features from the samples, resulting in superior ATR performance in problems with limited SAR training data. Experimental results on the moving and stationary target acquisition and recognition (MSTAR), OpenSARShip, and FUSAR-Ship benchmark datasets demonstrate the robustness and outstanding ATR performance of the proposed method in response to limited SAR training data.Note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. The other version is Traditional Chinese.

Blind Face Restoration for Under-Display Camera via Dictionary Guided Transformer

  • paper_url: http://arxiv.org/abs/2308.10196
  • repo_url: None
  • paper_authors: Jingfan Tan, Xiaoxu Chen, Tao Wang, Kaihao Zhang, Wenhan Luo, Xiaocun Cao
    for:The paper aims to address the problem of restoring face images taken by under-display cameras (UDCs), which are affected by significant quality degradation due to the characteristics of the display.methods:The proposed method uses a two-stage network called UDC-DMNet to synthesize UDC images by modeling the processes of UDC imaging. Additionally, a novel dictionary-guided transformer network named DGFormer is proposed, which incorporates a facial component dictionary and the characteristics of UDC images to address blind face restoration in UDC scenarios.results:The proposed methods achieve state-of-the-art performance in restoring face images taken by UDCs, with the DGFormer and UDC-DMNet achieving high restoration performance.
    Abstract By hiding the front-facing camera below the display panel, Under-Display Camera (UDC) provides users with a full-screen experience. However, due to the characteristics of the display, images taken by UDC suffer from significant quality degradation. Methods have been proposed to tackle UDC image restoration and advances have been achieved. There are still no specialized methods and datasets for restoring UDC face images, which may be the most common problem in the UDC scene. To this end, considering color filtering, brightness attenuation, and diffraction in the imaging process of UDC, we propose a two-stage network UDC Degradation Model Network named UDC-DMNet to synthesize UDC images by modeling the processes of UDC imaging. Then we use UDC-DMNet and high-quality face images from FFHQ and CelebA-Test to create UDC face training datasets FFHQ-P/T and testing datasets CelebA-Test-P/T for UDC face restoration. We propose a novel dictionary-guided transformer network named DGFormer. Introducing the facial component dictionary and the characteristics of the UDC image in the restoration makes DGFormer capable of addressing blind face restoration in UDC scenarios. Experiments show that our DGFormer and UDC-DMNet achieve state-of-the-art performance.
    摘要 By hiding the front-facing camera below the display panel, Under-Display Camera (UDC) provides users with a full-screen experience. However, due to the characteristics of the display, images taken by UDC suffer from significant quality degradation. Methods have been proposed to tackle UDC image restoration, and advances have been achieved. However, there are still no specialized methods and datasets for restoring UDC face images, which may be the most common problem in the UDC scene.To address this issue, we propose a two-stage network called UDC Degradation Model Network (UDC-DMNet) to synthesize UDC images by modeling the processes of UDC imaging. We use UDC-DMNet and high-quality face images from FFHQ and CelebA-Test to create UDC face training datasets FFHQ-P/T and testing datasets CelebA-Test-P/T for UDC face restoration.Furthermore, we propose a novel dictionary-guided transformer network named DGFormer. By introducing the facial component dictionary and the characteristics of the UDC image in the restoration, DGFormer is capable of addressing blind face restoration in UDC scenarios. Experiments show that our DGFormer and UDC-DMNet achieve state-of-the-art performance.

WMFormer++: Nested Transformer for Visible Watermark Removal via Implict Joint Learning

  • paper_url: http://arxiv.org/abs/2308.10195
  • repo_url: None
  • paper_authors: Dongjian Huo, Zehong Zhang, Hanjing Su, Guanbin Li, Chaowei Fang, Qingyao Wu
  • for: 本研究旨在提高水印 removing 技术的 robustness,以便更好地保护媒体 copyright。
  • methods: 本研究使用了一种新的 JOINT learning 方法,该方法可以自动地 navigates 信息 между不同的 branche,以提高水印 localization 和背景图像 Restoration 的效果。
  • results: 实验结果表明,本研究的方法在多种 challenging bencmarks 上具有remarkable 的超越性,与现有的状态 искусственный智能方法相比,具有大量的优势。
    Abstract Watermarking serves as a widely adopted approach to safeguard media copyright. In parallel, the research focus has extended to watermark removal techniques, offering an adversarial means to enhance watermark robustness and foster advancements in the watermarking field. Existing watermark removal methods mainly rely on UNet with task-specific decoder branches--one for watermark localization and the other for background image restoration. However, watermark localization and background restoration are not isolated tasks; precise watermark localization inherently implies regions necessitating restoration, and the background restoration process contributes to more accurate watermark localization. To holistically integrate information from both branches, we introduce an implicit joint learning paradigm. This empowers the network to autonomously navigate the flow of information between implicit branches through a gate mechanism. Furthermore, we employ cross-channel attention to facilitate local detail restoration and holistic structural comprehension, while harnessing nested structures to integrate multi-scale information. Extensive experiments are conducted on various challenging benchmarks to validate the effectiveness of our proposed method. The results demonstrate our approach's remarkable superiority, surpassing existing state-of-the-art methods by a large margin.
    摘要 水印技术是现代媒体版权保护的广泛采用方法之一。同时,研究焦点已经扩展到水印去除技术,提供了一种对抗性的方法,以增强水印的鲁棒性和促进水印技术的发展。现有的水印去除方法主要基于UNet架构,包括一个用于水印定位和另一个用于背景图像修复的任务特定分支。然而,水印定位和背景修复不是独立的任务,准确的水印定位直接 implies 需要修复的区域,而背景修复过程也会提高水印定位的准确性。为了整合这两个分支的信息,我们引入了隐式联合学习 paradigm。这使得网络可以自动地在两个分支之间流动信息,通过门 mechanism。此外,我们还使用了跨通道注意力来促进地方细节修复和整体结构认知,同时利用嵌入结构来整合多尺度信息。我们对多个挑战性的标准均衡进行了广泛的实验,以验证我们的提出的方法的有效性。结果表明,我们的方法在与现有状态态的方法进行比较时表现出了很大的超越。

EDDense-Net: Fully Dense Encoder Decoder Network for Joint Segmentation of Optic Cup and Disc

  • paper_url: http://arxiv.org/abs/2308.10192
  • repo_url: None
  • paper_authors: Mehwish Mehmood, Khuram Naveed, Haroon Ahmed Khan, Syed S. Naqvi
  • for: 该论文用于检测和诊断眼内病变,尤其是检测患有眼内疾病的人群中的视网膜损害。
  • methods: 该论文提出了一种基于 dense block 的 ED-Dense-Net segmentation网络,用于同时 segmentation 眼内粘膜(OC)和视网膜(OD)。该网络包括编码器和解码器,每个块都包含了分组 convolutional layer,以便网络能够从图像中获取和传递空间信息,同时降低网络的复杂度。
  • results: 该论文在两个公共可用的数据集上进行了评估,并与现有的状态态的方法进行了比较。结果显示,该方法在准确率和效率方面都高于现有的方法。因此,该方法可以作为医学针对系统,帮助医学眼科医生进行诊断和分析。
    Abstract Glaucoma is an eye disease that causes damage to the optic nerve, which can lead to visual loss and permanent blindness. Early glaucoma detection is therefore critical in order to avoid permanent blindness. The estimation of the cup-to-disc ratio (CDR) during an examination of the optical disc (OD) is used for the diagnosis of glaucoma. In this paper, we present the EDDense-Net segmentation network for the joint segmentation of OC and OD. The encoder and decoder in this network are made up of dense blocks with a grouped convolutional layer in each block, allowing the network to acquire and convey spatial information from the image while simultaneously reducing the network's complexity. To reduce spatial information loss, the optimal number of filters in all convolution layers were utilised. In semantic segmentation, dice pixel classification is employed in the decoder to alleviate the problem of class imbalance. The proposed network was evaluated on two publicly available datasets where it outperformed existing state-of-the-art methods in terms of accuracy and efficiency. For the diagnosis and analysis of glaucoma, this method can be used as a second opinion system to assist medical ophthalmologists.
    摘要 Glaucoma 是一种影视疾病,可以导致视网膜损害,从而引起视力下降和永久溃疡。 Early detection of glaucoma 是非常重要,以避免永久溃疡。在诊断 glaucoma 时, estimation of cup-to-disc ratio (CDR) 是非常重要的。在这篇论文中,我们提出了 EDDense-Net segmentation 网络,用于同时 segments OC 和 OD。这个网络的encoder 和 decoder 都由密集层组成,每个密集层都有 grouped convolutional layer,使网络可以从图像中获取和传递空间信息,同时减少网络的复杂性。为了避免空间信息损失,我们使用了最佳数量的 filters 在所有卷积层中。在 semantic segmentation 中,我们使用了 dice pixel classification 来缓解类别不均衡问题。我们的方法在两个公共可用的数据集上进行了评估,并在精度和效率方面超过了现有的状态对照方法。这种方法可以作为医疗视科医生的第二意见系统,帮助他们诊断和分析 glaucoma。

Contrastive Diffusion Model with Auxiliary Guidance for Coarse-to-Fine PET Reconstruction

  • paper_url: http://arxiv.org/abs/2308.10157
  • repo_url: https://github.com/show-han/pet-reconstruction
  • paper_authors: Zeyu Han, Yuhan Wang, Luping Zhou, Peng Wang, Binyu Yan, Jiliu Zhou, Yan Wang, Dinggang Shen
    for: This paper aims to improve the quality of standard-dose positron emission tomography (PET) scans while reducing radiation exposure to the human body.methods: The proposed method uses a coarse-to-fine framework that consists of a coarse prediction module (CPM) and an iterative refinement module (IRM). The CPM generates a coarse PET image via a deterministic process, while the IRM samples the residual iteratively. Additionally, two strategies are proposed and integrated into the reconstruction process to enhance the correspondence between the low-dose PET (LPET) image and the reconstructed PET (RPET) image.results: The proposed method outperforms state-of-the-art PET reconstruction methods in terms of clinical reliability, as demonstrated by extensive experiments on two human brain PET datasets.
    Abstract To obtain high-quality positron emission tomography (PET) scans while reducing radiation exposure to the human body, various approaches have been proposed to reconstruct standard-dose PET (SPET) images from low-dose PET (LPET) images. One widely adopted technique is the generative adversarial networks (GANs), yet recently, diffusion probabilistic models (DPMs) have emerged as a compelling alternative due to their improved sample quality and higher log-likelihood scores compared to GANs. Despite this, DPMs suffer from two major drawbacks in real clinical settings, i.e., the computationally expensive sampling process and the insufficient preservation of correspondence between the conditioning LPET image and the reconstructed PET (RPET) image. To address the above limitations, this paper presents a coarse-to-fine PET reconstruction framework that consists of a coarse prediction module (CPM) and an iterative refinement module (IRM). The CPM generates a coarse PET image via a deterministic process, and the IRM samples the residual iteratively. By delegating most of the computational overhead to the CPM, the overall sampling speed of our method can be significantly improved. Furthermore, two additional strategies, i.e., an auxiliary guidance strategy and a contrastive diffusion strategy, are proposed and integrated into the reconstruction process, which can enhance the correspondence between the LPET image and the RPET image, further improving clinical reliability. Extensive experiments on two human brain PET datasets demonstrate that our method outperforms the state-of-the-art PET reconstruction methods. The source code is available at \url{https://github.com/Show-han/PET-Reconstruction}.
    摘要 为了获得高质量的 positron emission tomography(PET)扫描图像,同时减少人体被暴露于辐射的风险,不同的方法已经被提议用于从低剂量PET(LPET)图像中重建标准剂量PET(SPET)图像。一种广泛采用的方法是生成 adversarial networks(GANs),然而,最近,扩散概率模型(DPMs)已经出现为一种吸引人的替代方案,因为它们可以提供更高的样本质量和更高的logslikelihood分数。然而,DPMs在实际临床应用中受到两个主要的限制:一是计算昂贵的采样过程,二是不足的保留LPET图像和重建PET图像之间的对应关系。为了解决以上限制,本文提出了一种粗略到细节的PET重建框架,包括一个粗略预测模块(CPM)和一个迭代优化模块(IRM)。CPM使用权重函数生成一个粗略PET图像,而IRM采样了剩余的信息,并在每次迭代中优化PET图像。通过委托大部分计算负担给CPM,我们的方法的总采样速度可以得到明显的提高。此外,我们还提出了两种附加策略,即auxiliary guidance strategy和contrastive diffusion strategy,并将它们纳入重建过程中,以进一步提高LPET图像和重建PET图像之间的对应关系,从而提高临床可靠性。我们在两个人脑PET数据集上进行了广泛的实验,结果表明,我们的方法超过了现有的PET重建方法。源代码可以在 \url{https://github.com/Show-han/PET-Reconstruction} 中下载。

Federated Pseudo Modality Generation for Incomplete Multi-Modal MRI Reconstruction

  • paper_url: http://arxiv.org/abs/2308.10910
  • repo_url: None
  • paper_authors: Yunlu Yan, Chun-Mei Feng, Yuexiang Li, Rick Siow Mong Goh, Lei Zhu
  • for: addresses the missing modality challenge in federated multi-modal MRI reconstruction
  • methods: utilizes a pseudo modality generation mechanism to recover the missing modality, and introduces a clustering scheme to reduce communication costs
  • results: can effectively complete the missing modality within an acceptable communication cost, with similar performance to the ideal scenario
    Abstract While multi-modal learning has been widely used for MRI reconstruction, it relies on paired multi-modal data which is difficult to acquire in real clinical scenarios. Especially in the federated setting, the common situation is that several medical institutions only have single-modal data, termed the modality missing issue. Therefore, it is infeasible to deploy a standard federated learning framework in such conditions. In this paper, we propose a novel communication-efficient federated learning framework, namely Fed-PMG, to address the missing modality challenge in federated multi-modal MRI reconstruction. Specifically, we utilize a pseudo modality generation mechanism to recover the missing modality for each single-modal client by sharing the distribution information of the amplitude spectrum in frequency space. However, the step of sharing the original amplitude spectrum leads to heavy communication costs. To reduce the communication cost, we introduce a clustering scheme to project the set of amplitude spectrum into finite cluster centroids, and share them among the clients. With such an elaborate design, our approach can effectively complete the missing modality within an acceptable communication cost. Extensive experiments demonstrate that our proposed method can attain similar performance with the ideal scenario, i.e., all clients have the full set of modalities. The source code will be released.
    摘要 在多模式学习广泛应用于MRI重建中,它依赖于对应的多模式数据,但在实际临床场景下获得这些数据是困难的。特别在联邦设置下,许多医疗机构只有单模式数据,称之为模式缺失问题。因此,使用标准联邦学习框架是无法实现的。在这篇论文中,我们提出了一种新的通信效率高的联邦学习框架,即Fed-PMG,以解决联邦多模式MRI重建中的模式缺失问题。特别是,我们利用 pseudo 模式生成机制来为每个单模式客户端 recuperate 缺失的模式。然而,在共享原始幅谱信息时,会导致重大的通信成本。为了减少通信成本,我们提出了一种归一化 schemes,将幅谱信息Project 到finite cluster centroids,并在客户端之间分享它们。与这种细化的设计相比,我们的方法可以有效地完成缺失模式,并且在接受ABLE 的通信成本下。我们的实验结果表明,我们的提议方法可以达到与理想情况(即所有客户端具有完整的模式)的相似性。源代码将发布。

Polymerized Feature-based Domain Adaptation for Cervical Cancer Dose Map Prediction

  • paper_url: http://arxiv.org/abs/2308.10142
  • repo_url: None
  • paper_authors: Jie Zeng, Zeyu Han, Xingchen Peng, Jianghong Xiao, Peng Wang, Yan Wang
  • for: 这篇论文是为了提高乳腺癌规划辅助的深度学习(DL)方法,以增强输入数据的对应性和预测精度。
  • methods: 本研究使用了对另一种肝癌(rectum cancer)的丰富知识进行领域适应,以优化乳腺癌的剂量地图预测性能。具体来说,我们开发了一个高效的Transformer型态变分布(PFM)模组,可以生成一个最佳的共聚分布,以缓和两个输入分布的对应关系。
  • results: 实验结果显示,提案的方法比现有方法更高效,能够更好地预测乳腺癌的剂量地图。
    Abstract Recently, deep learning (DL) has automated and accelerated the clinical radiation therapy (RT) planning significantly by predicting accurate dose maps. However, most DL-based dose map prediction methods are data-driven and not applicable for cervical cancer where only a small amount of data is available. To address this problem, this paper proposes to transfer the rich knowledge learned from another cancer, i.e., rectum cancer, which has the same scanning area and more clinically available data, to improve the dose map prediction performance for cervical cancer through domain adaptation. In order to close the congenital domain gap between the source (i.e., rectum cancer) and the target (i.e., cervical cancer) domains, we develop an effective Transformer-based polymerized feature module (PFM), which can generate an optimal polymerized feature distribution to smoothly align the two input distributions. Experimental results on two in-house clinical datasets demonstrate the superiority of the proposed method compared with state-of-the-art methods.
    摘要 近期,深度学习(DL)已经自动化和加速了临床辐射治疗(RT)规划,并且预测精确剂量图。然而,大多数DL基于的剂量图预测方法是数据驱动的,而且不适用于颈部癌,因为颈部癌的数据很少。为解决这个问题,本文提出了将另一种癌种( namely,肛门癌)的丰富知识传播到颈部癌中,以提高剂量图预测性能。为了关闭预生域差距 между源(即肛门癌)和目标(即颈部癌)域,我们开发了一种高效的Transformer基于的聚合特征模块(PFM),可以生成一个最佳的聚合特征分布,以平滑地将两个输入分布对齐。实验结果表明,提出的方法在两个自有临床数据集上比状态元方法更高效。

Sensitivity analysis of AI-based algorithms for autonomous driving on optical wavefront aberrations induced by the windshield

  • paper_url: http://arxiv.org/abs/2308.11711
  • repo_url: None
  • paper_authors: Dominik Werner Wolf, Markus Ulrich, Nikhil Kapoor
  • for: 这篇论文是关于自动驾驶感知技术的,具体来说是研究这些技术在不同车辆和镜头配置下的性能问题。
  • methods: 这篇论文使用了supervised机器学习模型,并在实际街道数据上进行了训练。但是,将这些训练模型应用于不同的车辆和镜头配置可能会导致域名shift问题,这可能会影响机器学习模型的性能和工作ADAS的要求。
  • results: 这篇论文通过evaluating两个感知模型对不同镜头配置的敏感性来研究域名shift问题。结果显示,镜头配置会导致性能差距,而现有的光学度量函数可能不够用于 pose 要求。
    Abstract Autonomous driving perception techniques are typically based on supervised machine learning models that are trained on real-world street data. A typical training process involves capturing images with a single car model and windshield configuration. However, deploying these trained models on different car types can lead to a domain shift, which can potentially hurt the neural networks performance and violate working ADAS requirements. To address this issue, this paper investigates the domain shift problem further by evaluating the sensitivity of two perception models to different windshield configurations. This is done by evaluating the dependencies between neural network benchmark metrics and optical merit functions by applying a Fourier optics based threat model. Our results show that there is a performance gap introduced by windshields and existing optical metrics used for posing requirements might not be sufficient.
    摘要 自主驾驶感知技术通常基于指导学习模型,这些模型通过实际街道数据进行训练。一般来说,训练过程中使用单一汽车型和车窗配置拍摄图像。然而,将这些训练过的模型部署到不同的汽车类型上可能会导致域shift问题,这可能会影响神经网络性能并违反工作ADAS要求。为解决这个问题,本文进一步调查了域shift问题,并评估了两种感知模型对不同车窗配置的敏感性。我们通过应用 Fourier optics 基于威胁模型来评估神经网络指标与光学功能函数之间的相互关系。我们的结果显示,车窗会引入性能差距,而现有的光学指标可能不够用于满足工作ADAS要求。

cs.SD - 2023-08-19

Spatial Reconstructed Local Attention Res2Net with F0 Subband for Fake Speech Detection

  • paper_url: http://arxiv.org/abs/2308.09944
  • repo_url: None
  • paper_authors: Cunhang Fan, Jun Xue, Jianhua Tao, Jiangyan Yi, Chenglong Wang, Chengshi Zheng, Zhao Lv
    for: 本研究旨在提高假语音识别(FSD)任务的性能,特别是对于rhythm of synthetic speech too smooth的问题。methods: 本文提出了一种新的F0子带,以及一种具有spatial reconstructed local attention的Res2Net网络(SR-LA Res2Net)。results: 在ASVspoof 2019 LA数据集上,我们的提议方法实现了EER值为0.47%和min t-DCF值为0.0159,与所有单个系统中的最佳性能相当。
    Abstract The rhythm of synthetic speech is usually too smooth, which causes that the fundamental frequency (F0) of synthetic speech is significantly different from that of real speech. It is expected that the F0 feature contains the discriminative information for the fake speech detection (FSD) task. In this paper, we propose a novel F0 subband for FSD. In addition, to effectively model the F0 subband so as to improve the performance of FSD, the spatial reconstructed local attention Res2Net (SR-LA Res2Net) is proposed. Specifically, Res2Net is used as a backbone network to obtain multiscale information, and enhanced with a spatial reconstruction mechanism to avoid losing important information when the channel group is constantly superimposed. In addition, local attention is designed to make the model focus on the local information of the F0 subband. Experimental results on the ASVspoof 2019 LA dataset show that our proposed method obtains an equal error rate (EER) of 0.47% and a minimum tandem detection cost function (min t-DCF) of 0.0159, achieving the state-of-the-art performance among all of the single systems.
    摘要 文本中的人工语音的节奏通常太平滑,导致人工语音的基本频率(F0)与实际语音的F0有所不同。这些F0特征含有识别假语音的重要信息。在这篇论文中,我们提出了一种新的F0子带 для假语音检测(FSD)任务。此外,为了有效地模型F0子带,我们还提出了一种空间重建本地注意力Res2Net(SR-LA Res2Net)。具体来说,Res2Net被用作背景网络,以获取多尺度信息,并在核心矩阵上添加空间重建机制,以避免损失重要信息。此外,本地注意力被设计来使模型关注F0子带的本地信息。实验结果表明,我们提出的方法在ASVspoof 2019 LA数据集上达到了单个系统的状态略进行性表现,其EER为0.47%,min t-DCF为0.0159。

cs.CV - 2023-08-19

DPL: Decoupled Prompt Learning for Vision-Language Models

  • paper_url: http://arxiv.org/abs/2308.10061
  • repo_url: None
  • paper_authors: Chen Xu, Yuhan Zhu, Guozhen Zhang, Haocheng Shen, Yixuan Liao, Xiaoxin Chen, Gangshan Wu, Limin Wang
  • for: 本研究旨在提高CLIP模型的下游任务迁移效果,但现有方法通常会过拟合已经看过的类别, thereby limiting its generalization ability for unseen classes.
  • methods: 我们提出了一种新的方法,即分离提示学习(DPL),它通过重新定义提示学习的注意力进程来解决这个问题。 Specifically, we theoretically investigate the collaborative process between prompts and instances (i.e., image patches/text tokens) by reformulating the original self-attention into four separate sub-processes.
  • results: 我们的方法可以在三个代表性的 benchmark 上取得状态机器的表现,包括15个图像识别数据集,而且不需要任何辅助正则化任务或额外训练数据,进一步表明了其惊人的泛化能力。
    Abstract Prompt learning has emerged as an efficient and effective approach for transferring foundational Vision-Language Models (e.g., CLIP) to downstream tasks. However, current methods tend to overfit to seen categories, thereby limiting their generalization ability for unseen classes. In this paper, we propose a new method, Decoupled Prompt Learning (DPL), which reformulates the attention in prompt learning to alleviate this problem. Specifically, we theoretically investigate the collaborative process between prompts and instances (i.e., image patches/text tokens) by reformulating the original self-attention into four separate sub-processes. Through detailed analysis, we observe that certain sub-processes can be strengthened to bolster robustness and generalizability by some approximation techniques. Furthermore, we introduce language-conditioned textual prompting based on decoupled attention to naturally preserve the generalization of text input. Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning. By combining the proposed techniques, our approach achieves state-of-the-art performance on three representative benchmarks encompassing 15 image recognition datasets, while maintaining parameter-efficient. Moreover, our DPL does not rely on any auxiliary regularization task or extra training data, further demonstrating its remarkable generalization ability.
    摘要 提高学习(Prompt Learning)已经成为跨类 Task 迁移Foundational Vision-Language Models(例如 CLIP)的有效和高效的方法。然而,现有方法往往会遇到seen类,从而限制其对未seen类的泛化能力。在这篇论文中,我们提出了一种新的方法:异步Prompt Learning(DPL),该方法通过重新推导提问的注意力来缓解这个问题。 Specifically,我们对提问和实例(即图像 patches/文本token)之间的协作过程进行了理论性的调查,并将原始自注意转化为四个独立的子过程。通过详细分析,我们发现了一些子过程可以通过一些抽象技巧加强,以提高抗衰假性和泛化能力。此外,我们引入了基于异步注意力的语言条件文本提问,以保持文本输入的泛化。我们的方法可以轻松扩展到多模态提问学习。通过结合我们的方法,我们的DPL实现了三个代表性的标准准则,包括15个图像识别数据集,而不需要额外的训练数据或auxiliary regularization任务,再次证明了它的强大泛化能力。

R-C-P Method: An Autonomous Volume Calculation Method Using Image Processing and Machine Vision

  • paper_url: http://arxiv.org/abs/2308.10058
  • repo_url: None
  • paper_authors: MA Muktadir, Sydney Parker, Sun Yi
  • for: 这个论文的目的是提供一种基于多个2D相机的实时三维空间量化和变化信息获取方法,以取代深度摄像头。
  • methods: 该方法使用图像处理和边检测技术,开发了ROW-COLUMN-PIXEL(纵列像素)方法,可以实时测量物体的面积和变化。
  • results: 实验结果表明,ROW-COLUMN-PIXEL方法可以准确地测量物体的面积和变化,并且可以检测物体上的不连续边或体积。
    Abstract Machine vision and image processing are often used with sensors for situation awareness in autonomous systems, from industrial robots to self-driving cars. The 3D depth sensors, such as LiDAR (Light Detection and Ranging), Radar, are great invention for autonomous systems. Due to the complexity of the setup, LiDAR may not be suitable for some operational environments, for example, a space environment. This study was motivated by a desire to get real-time volumetric and change information with multiple 2D cameras instead of a depth camera. Two cameras were used to measure the dimensions of a rectangular object in real-time. The R-C-P (row-column-pixel) method is developed using image processing and edge detection. In addition to the surface areas, the R-C-P method also detects discontinuous edges or volumes. Lastly, experimental work is presented for illustration of the R-C-P method, which provides the equations for calculating surface area dimensions. Using the equations with given distance information between the object and the camera, the vision system provides the dimensions of actual objects.
    摘要

ControlCom: Controllable Image Composition using Diffusion Model

  • paper_url: http://arxiv.org/abs/2308.10040
  • repo_url: https://github.com/bcmi/controlcom-image-composition
  • paper_authors: Bo Zhang, Yuxuan Duan, Jun Lan, Yan Hong, Huijia Zhu, Weiqiang Wang, Li Niu
  • for: 本研究旨在实现一种可控的图像组合方法,能够生成真实、自然的组合图像。
  • methods: 本方法基于大型预训练的扩散模型,并实现了四个任务的混合:图像融合、图像和谐、视觉合成和生成组合。
  • results: 对于公共benchmark数据和实际数据,我们的方法可以生成更加忠诚和可控的组合图像,比现有方法更高效。
    Abstract Image composition targets at synthesizing a realistic composite image from a pair of foreground and background images. Recently, generative composition methods are built on large pretrained diffusion models to generate composite images, considering their great potential in image generation. However, they suffer from lack of controllability on foreground attributes and poor preservation of foreground identity. To address these challenges, we propose a controllable image composition method that unifies four tasks in one diffusion model: image blending, image harmonization, view synthesis, and generative composition. Meanwhile, we design a self-supervised training framework coupled with a tailored pipeline of training data preparation. Moreover, we propose a local enhancement module to enhance the foreground details in the diffusion model, improving the foreground fidelity of composite images. The proposed method is evaluated on both public benchmark and real-world data, which demonstrates that our method can generate more faithful and controllable composite images than existing approaches. The code and model will be available at https://github.com/bcmi/ControlCom-Image-Composition.
    摘要 Image composition targets at synthesizing a realistic composite image from a pair of foreground and background images. Recently, generative composition methods are built on large pretrained diffusion models to generate composite images, considering their great potential in image generation. However, they suffer from lack of controllability on foreground attributes and poor preservation of foreground identity. To address these challenges, we propose a controllable image composition method that unifies four tasks in one diffusion model: image blending, image harmonization, view synthesis, and generative composition. Meanwhile, we design a self-supervised training framework coupled with a tailored pipeline of training data preparation. Moreover, we propose a local enhancement module to enhance the foreground details in the diffusion model, improving the foreground fidelity of composite images. The proposed method is evaluated on both public benchmark and real-world data, which demonstrates that our method can generate more faithful and controllable composite images than existing approaches. The code and model will be available at https://github.com/bcmi/ControlCom-Image-Composition.

CRC-ICM: Colorectal Cancer Immune Cell Markers Pattern Dataset

  • paper_url: http://arxiv.org/abs/2308.10033
  • repo_url: None
  • paper_authors: Zahra Mokhtari, Elham Amjadi, Hamidreza Bolhasani, Zahra Faghih, AmirReza Dehghanian, Marzieh Rezaei
  • For: The paper is written to explore the differences in immune checkpoints expression in primary tumors located in the right and left sides of the colon, and to investigate the prognostic value of these checkpoints in colorectal cancer (CRC).* Methods: The study uses a dataset of 1756 images related to 136 patients, stained with specific antibodies for CD3, CD8, CD45RO, PD-1, LAG3, and Tim3.* Results: The study found that tumors on the left and right sides of the colon have different immune landscapes, with differences in the expression of immune checkpoints such as PD-1, LAG3, and Tim3. These differences may have implications for the prognosis of CRC patients.
    Abstract Colorectal Cancer (CRC) is the second most common cause of cancer death in the world, ad can be identified by the location of the primary tumor in the large intestine: right and left colon, and rectum. Based on the location, CRC shows differences in chromosomal and molecular characteristics, microbiomes incidence, pathogenesis, and outcome. It has been shown that tumors on left and right sides also have different immune landscape, so the prognosis may be different based on the primary tumor locations. It is widely accepted that immune components of the tumor microenvironment (TME) plays a critical role in tumor development. One of the critical regulatory molecules in the TME is immune checkpoints that as the gatekeepers of immune responses regulate the infiltrated immune cell functions. Inhibitory immune checkpoints such as PD-1, Tim3, and LAG3, as the main mechanism of immune suppression in TME overexpressed and result in further development of the tumor. The images of this dataset have been taken from colon tissues of patients with CRC, stained with specific antibodies for CD3, CD8, CD45RO, PD-1, LAG3 and Tim3. The name of this dataset is CRC-ICM and contains 1756 images related to 136 patients. The initial version of CRC-ICM is published on Elsevier Mendeley dataset portal, and the latest version is accessible via: https://databiox.com
    摘要 “幽门肉瘤癌(CRC)是全球第二常见的癌症死亡原因,可以根据主癌肿 locus在大小肠中进行定位:右和左大肠、RECTUM。根据位置,CRC具有不同的染色体和分子特征、微生物发生率、生成和结果。已经证明左右两侧的肿均具有不同的免疫景观,因此诊断结果可能因primary tumor location而异。广泛认可的是,免疫组件在肿瘤微环境(TME)中扮演了关键的角色,其中一种关键的调控分子是免疫检查点。压缩性免疫检查点如PD-1、Tim3和LAG3等,是TME中免疫抑制的主要机制,过度表达导致肿瘤进一步发展。这些图像来自于患有CRC的大肠组织,用specific抗体染色CD3、CD8、CD45RO、PD-1、LAG3和Tim3。该数据集名为CRC-ICM,包含1756张图像,与136名患者相关。最初版本在Elsevier Mendeley数据集门户上发布,最新版本可以通过以下链接访问:https://databiox.com”

Single Image Reflection Separation via Component Synergy

  • paper_url: http://arxiv.org/abs/2308.10027
  • repo_url: https://github.com/mingcv/dsrnet
  • paper_authors: Qiming Hu, Xiaojie Guo
  • for: 本研究旨在提出更一般的折射模型,以更好地捕捉剩下的信息,使分解层更加完整。
  • methods: 该研究基于现有模型的缺陷进行调查,并引入学习式偏移项,以捕捉剩下的信息。同时,我们还设计了网络结构,包括一种新型的双流交互机制和一种强大的分解网络。
  • results: 我们通过广泛的实验和减少研究,证明我们的方法在多个真实世界的标准 benchmark 数据集上表现出色,胜过当前的状态艺术方法。我们的代码可以在 https://github.com/mingcv/DSRNet 上获取。
    Abstract The reflection superposition phenomenon is complex and widely distributed in the real world, which derives various simplified linear and nonlinear formulations of the problem. In this paper, based on the investigation of the weaknesses of existing models, we propose a more general form of the superposition model by introducing a learnable residue term, which can effectively capture residual information during decomposition, guiding the separated layers to be complete. In order to fully capitalize on its advantages, we further design the network structure elaborately, including a novel dual-stream interaction mechanism and a powerful decomposition network with a semantic pyramid encoder. Extensive experiments and ablation studies are conducted to verify our superiority over state-of-the-art approaches on multiple real-world benchmark datasets. Our code is publicly available at https://github.com/mingcv/DSRNet.
    摘要 <> translate "The reflection superposition phenomenon is complex and widely distributed in the real world, which derives various simplified linear and nonlinear formulations of the problem. In this paper, based on the investigation of the weaknesses of existing models, we propose a more general form of the superposition model by introducing a learnable residue term, which can effectively capture residual information during decomposition, guiding the separated layers to be complete. In order to fully capitalize on its advantages, we further design the network structure elaborately, including a novel dual-stream interaction mechanism and a powerful decomposition network with a semantic pyramid encoder. Extensive experiments and ablation studies are conducted to verify our superiority over state-of-the-art approaches on multiple real-world benchmark datasets. Our code is publicly available at https://github.com/mingcv/DSRNet." into Simplified Chinese.Translation:<>现实世界中各种复杂的反射积累现象,导致了许多简化的线性和非线性形式化问题。在这篇论文中,我们基于现有模型的缺陷进行调查,并提出了更一般的积累模型,通过引入学习型剩余项,能够有效地捕捉反射信息的剩余信息,导引分解层成为完整的。为了充分利用其优势,我们还设计了网络结构,包括一种新的双流交互机制和一个强大的分解网络,具有含义层Encoder。我们进行了广泛的实验和减少研究,以证明我们在多个实际世界标准数据集上的优越性。我们的代码可以在https://github.com/mingcv/DSRNet中获得。

Interpretation on Multi-modal Visual Fusion

  • paper_url: http://arxiv.org/abs/2308.10019
  • repo_url: None
  • paper_authors: Hao Chen, Haoran Zhou, Yongjian Deng
  • for: 本文提出了一种分析框架和一种新的评估指标,用于解释多Modal视觉社区中的解释。
  • methods: 我们的方法包括在不同modalities和层次上测量提议的semantic variance和特征相似性,并通过广泛的实验进行semantic和量化分析。
  • results: 我们的研究发现了跨modalities的特征不一致和多modal的合作规则,这些发现有助于重新评估和设计多Modal视觉融合模型。
    Abstract In this paper, we present an analytical framework and a novel metric to shed light on the interpretation of the multimodal vision community. Our approach involves measuring the proposed semantic variance and feature similarity across modalities and levels, and conducting semantic and quantitative analyses through comprehensive experiments. Specifically, we investigate the consistency and speciality of representations across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a multi-modality model. Our studies reveal several important findings, such as the discrepancy in cross-modal features and the hybrid multi-modal cooperation rule, which highlights consistency and speciality simultaneously for complementary inference. Through our dissection and findings on multi-modal fusion, we facilitate a rethinking of the reasonability and necessity of popular multi-modal vision fusion strategies. Furthermore, our work lays the foundation for designing a trustworthy and universal multi-modal fusion model for a variety of tasks in the future.
    摘要 在这篇论文中,我们提出了一种分析框架和一种新的度量来探讨多模态视觉社区的解释。我们的方法是在不同的modalities和层次上测量提议的 semantics variance和特征相似性,并通过全面的实验进行semantic和量化的分析。我们专门investigates the consistency和特点of representations across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a multi-modality model。我们的研究发现了一些重要的发现,如cross-modal features的差异和hybrid multi-modal cooperation rule,这些发现揭示了同时具有一致和特点的hybrid multi-modal合作。通过我们的分析和发现,我们促进了对多模态视觉混合的重新思考,并为未来多种任务的多模态混合模型设计一个可靠和通用的基础。

Pseudo Flow Consistency for Self-Supervised 6D Object Pose Estimation

  • paper_url: http://arxiv.org/abs/2308.10016
  • repo_url: https://github.com/yanghai-1218/pseudoflow
  • paper_authors: Yang Hai, Rui Song, Jiaojiao Li, David Ferstl, Yinlin Hu
  • for: 这篇论文主要针对的是无需额外信息的自主学习6D对象pose估计问题。
  • methods: 该方法首先从使用synthetic图像生成的网络获得粗略的pose初始值,然后引入了一种基于干扰图像对数据集的geometry约束的精度调整策略。该约束由多个不同视角的synthetic-to-real图像对形成,并通过动态生成的pseudo标签来表示。
  • results: 对三个复杂的数据集进行评估,该方法与无需2D标注和额外深度图像的前一代自主学习方法相比,显著地提高了性能。
    Abstract Most self-supervised 6D object pose estimation methods can only work with additional depth information or rely on the accurate annotation of 2D segmentation masks, limiting their application range. In this paper, we propose a 6D object pose estimation method that can be trained with pure RGB images without any auxiliary information. We first obtain a rough pose initialization from networks trained on synthetic images rendered from the target's 3D mesh. Then, we introduce a refinement strategy leveraging the geometry constraint in synthetic-to-real image pairs from multiple different views. We formulate this geometry constraint as pixel-level flow consistency between the training images with dynamically generated pseudo labels. We evaluate our method on three challenging datasets and demonstrate that it outperforms state-of-the-art self-supervised methods significantly, with neither 2D annotations nor additional depth images.
    摘要 大多数自我指导的6D对象姿态估计方法需要额外深度信息或者精确的2D分割标签,这限制了它们的应用范围。在这篇论文中,我们提出了不需要 auxiliary信息的6D对象姿态估计方法。我们首先从目标对象的3D meshRendered的Synthetic图像中获取初步姿态 initialization。然后,我们引入了一种改善策略,利用多个不同视角的Synthetic-to-real图像对的geometry约束。我们将这种geometry约束表示为多个不同视角的Synthetic图像之间的像素水平流consistency。我们对三个具有挑战性的数据集进行评估,并证明了我们的方法在自我指导方法中具有明显的优势,无需2D标签也无需额外深度图像。

DyFFPAD: Dynamic Fusion of Convolutional and Handcrafted Features for Fingerprint Presentation Attack Detection

  • paper_url: http://arxiv.org/abs/2308.10015
  • repo_url: None
  • paper_authors: Anuj Rai, Parsheel Kumar Tiwari, Jyotishna Baishya, Ram Prakash Sharma, Somnath Dey
  • for: The paper is written for the purpose of detecting presentation attacks in automatic fingerprint recognition systems, which are a threat to their wide range of applications in areas including national borders and commercial applications.
  • methods: The paper proposes a dynamic ensemble of deep learning and handcrafted features to detect presentation attacks in known-material and unknown-material protocols. The proposed model combines both deep CNN and handcrafted features, and learns their parameters together to exhibit better performance than individual results.
  • results: The proposed model is validated using benchmark LivDet 2015, 2017, and 2019 databases, and achieves an overall accuracy of 96.10%, 96.49%, and 95.99% on them, respectively. The proposed model outperforms state-of-the-art methods in benchmark protocols of presentation attack detection in terms of classification accuracy.Here are the three key points in Simplified Chinese text:
  • for: 这篇论文是为检测自动指纹识别系统中的示现攻击而写的。
  • methods: 这篇论文提出了一种动态集成深度学习和手工特征的方法来检测示现攻击。该模型将深度学习和手工特征结合使用,并同时学习它们的参数。
  • results: 该模型在livDet2015、livDet2017和livDet2019等数据库上进行了验证,并达到了96.10%、96.49%和95.99%的总准确率。这个模型在示现攻击检测中的benchmark协议中表现出了更好的性能。
    Abstract Automatic fingerprint recognition systems suffer from the threat of presentation attacks due to their wide range of applications in areas including national borders and commercial applications. Presentation attacks can be performed by fabricating the fake fingerprint of a user with or without the intention of the subject. This paper presents a dynamic ensemble of deep learning and handcrafted features to detect presentation attacks in known-material and unknown-material protocols. The proposed model is a dynamic ensemble of deep CNN and handcrafted features empowered deep neural networks both of which learn their parameters together. The proposed presentation attack detection model, in this way, utilizes the capabilities of both classification techniques and exhibits better performance than their individual results. The proposed model's performance is validated using benchmark LivDet 2015, 2017, and 2019 databases, with an overall accuracy of 96.10\%, 96.49\%, and 95.99\% attained on them, respectively. The proposed model outperforms state-of-the-art methods in benchmark protocols of presentation attack detection in terms of classification accuracy.
    摘要

Partition-and-Debias: Agnostic Biases Mitigation via A Mixture of Biases-Specific Experts

  • paper_url: http://arxiv.org/abs/2308.10005
  • repo_url: https://github.com/Jiaxuan-Li/PnD
  • paper_authors: Jiaxuan Li, Duc Minh Vo, Hideki Nakayama
  • for: 减少图像分类中的偏见(bias mitigation),特别是面对不确定或多种偏见的情况。
  • methods: 提出了一种名为Partition-and-Debias(PnD)方法,通过一组偏见特定的专家来分解偏见空间,并使用一个阻断模块来实现专家之间的协调,以达到减少偏见的分类。
  • results: 在公共和自定义的benchmark上进行了实验,并证明了PnD方法的有效性。
    Abstract Bias mitigation in image classification has been widely researched, and existing methods have yielded notable results. However, most of these methods implicitly assume that a given image contains only one type of known or unknown bias, failing to consider the complexities of real-world biases. We introduce a more challenging scenario, agnostic biases mitigation, aiming at bias removal regardless of whether the type of bias or the number of types is unknown in the datasets. To address this difficult task, we present the Partition-and-Debias (PnD) method that uses a mixture of biases-specific experts to implicitly divide the bias space into multiple subspaces and a gating module to find a consensus among experts to achieve debiased classification. Experiments on both public and constructed benchmarks demonstrated the efficacy of the PnD. Code is available at: https://github.com/Jiaxuan-Li/PnD.
    摘要 <<音� simpified>>偏调缓和图像分类中的研究已经广泛,现有的方法已经获得了 Notable results。然而,大多数这些方法预设假设一个图像只包含一种已知或未知的偏调,忽略了实际世界中的复杂偏调。我们引入一个更加问题的场景:agnostic biases mitigation, aiming at bias removal regardless of whether the type of bias or the number of types is unknown in the datasets。为了解决这个困难的任务,我们提出了 Partition-and-Debias(PnD)方法,使用一种混合偏调特有的专家来隐式地分解偏调空间 into multiple subspaces,并使用一个闸道模组来获得专家们之间的一致,以达到 debiased classification。实验结果显示,PnD方法在公共和自己建立的benchmark上具有优良的效果。代码可以在:https://github.com/Jiaxuan-Li/PnD 中找到。

Efficient Multi-View Inverse Rendering Using a Hybrid Differentiable Rendering Method

  • paper_url: http://arxiv.org/abs/2308.10003
  • repo_url: https://github.com/HsiangYangChu/DRBIR
  • paper_authors: Xiangyang Zhu, Yiling Pan, Bailin Deng, Bin Wang
  • for: 这篇论文的目的是用hybrid differentiable rendering方法 efficiently recovering real-world object的3D geometry和reflectance从多视图图像中。
  • methods: 该方法包括两个阶段:初始化阶段使用传统的SfM和MVS方法重建虚拟场景,优化阶段使用混合方法同时优化几何和反射特性,其中几何使用伪优化方法,反射使用物理基于优化方法。
  • results: 对于synthetic和实际数据,我们的方法可以生成与状态前方法相同或更高质量的重建结果,同时更高效。
    Abstract Recovering the shape and appearance of real-world objects from natural 2D images is a long-standing and challenging inverse rendering problem. In this paper, we introduce a novel hybrid differentiable rendering method to efficiently reconstruct the 3D geometry and reflectance of a scene from multi-view images captured by conventional hand-held cameras. Our method follows an analysis-by-synthesis approach and consists of two phases. In the initialization phase, we use traditional SfM and MVS methods to reconstruct a virtual scene roughly matching the real scene. Then in the optimization phase, we adopt a hybrid approach to refine the geometry and reflectance, where the geometry is first optimized using an approximate differentiable rendering method, and the reflectance is optimized afterward using a physically-based differentiable rendering method. Our hybrid approach combines the efficiency of approximate methods with the high-quality results of physically-based methods. Extensive experiments on synthetic and real data demonstrate that our method can produce reconstructions with similar or higher quality than state-of-the-art methods while being more efficient.
    摘要 recuperar la forma y apariencia de objetos del mundo real desde imágenes 2D naturales es un problema de inverse rendering largo standing y desafiante. En este artículo, presentamos un método de renderizado diferenciable híbrido para eficientemente reconstruir la geometría 3D y la refracción de una escena a partir de imágenes multiview capturadas por cámaras portátiles convencionales. Nuestro método sigue un enfoque de análisis por síntesis y consta de dos fases. En la fase de inicio, utilizamos métodos SfM y MVS tradicionales para reconstruir una escena virtual que approximate la escena real. Luego, en la fase de optimización, adoptamos un enfoque híbrido para refinar la geometría y la refracción, donde la geometría se optimiza primero utilizando un método de renderizado diferenciable aproximado, y la refracción se optimiza después utilizando un método de renderizado diferenciable físicamente basado. Nuestro enfoque híbrido combina la eficiencia de los métodos aproximados con los resultados de alta calidad de los métodos físicamente basados. Los experimentos extensivos en datos sintéticos y reales demuestran que nuestro método puede producir reconstrucciones con calidad similar o superior a los métodos estado del arte mientras es más eficiente.

AltNeRF: Learning Robust Neural Radiance Field via Alternating Depth-Pose Optimization

  • paper_url: http://arxiv.org/abs/2308.10001
  • repo_url: None
  • paper_authors: Kun Wang, Zhiqiang Yan, Huang Tian, Zhenyu Zhang, Xiang Li, Jun Li, Jian Yang
  • for: 实现高品质的新视角生成(Novel View Synthesis) from sparse scene images.
  • methods: 使用自我监督的单目深度测量(SMDE)自MONOCULAR VIDEOS中学习深度和pose prior,并与NeRF进行交互式调整。
  • results: 产生高传真度和可靠的新视角,并且能够处理不确定的摄像机位置和缺乏明确的3D超级vision。
    Abstract Neural Radiance Fields (NeRF) have shown promise in generating realistic novel views from sparse scene images. However, existing NeRF approaches often encounter challenges due to the lack of explicit 3D supervision and imprecise camera poses, resulting in suboptimal outcomes. To tackle these issues, we propose AltNeRF -- a novel framework designed to create resilient NeRF representations using self-supervised monocular depth estimation (SMDE) from monocular videos, without relying on known camera poses. SMDE in AltNeRF masterfully learns depth and pose priors to regulate NeRF training. The depth prior enriches NeRF's capacity for precise scene geometry depiction, while the pose prior provides a robust starting point for subsequent pose refinement. Moreover, we introduce an alternating algorithm that harmoniously melds NeRF outputs into SMDE through a consistence-driven mechanism, thus enhancing the integrity of depth priors. This alternation empowers AltNeRF to progressively refine NeRF representations, yielding the synthesis of realistic novel views. Additionally, we curate a distinctive dataset comprising indoor videos captured via mobile devices. Extensive experiments showcase the compelling capabilities of AltNeRF in generating high-fidelity and robust novel views that closely resemble reality.
    摘要 neural radiance fields (NeRF) 已经示示出在使用稀疏场景图像生成真实的新视图时的承诺。然而,现有的 NeRF 方法经常遇到缺乏显式的3D监督和不准确的相机位置问题,导致优化结果不佳。为解决这些问题,我们提出了 AltNeRF -- 一种新的框架,通过使用单目视频中的自我监督深度估计 (SMDE),不需要知道相机位置,创建了鲜明的 NeRF 表示。SMDE 在 AltNeRF 中熟练地学习了深度和pose prior,以规则 NeRF 训练。深度先导提高了 NeRF 的场景几何描述能力,而pose先导提供了 robust的开始点,用于后续的pose修正。此外,我们还引入了一种相互律动的算法,将 NeRF 输出与 SMDE 融合在一起,通过一种具有一致性的机制,以强化深度先导的完整性。这种相互律动使得 AltNeRF 能够不断地加工 NeRF 表示,Synthesize 出真实的新视图。此外,我们还制作了一个特有的indoor视频 captured via mobile devices的数据集。广泛的实验表明,AltNeRF 能够生成高质量和Robust的新视图,与实际相似。

TTPOINT: A Tensorized Point Cloud Network for Lightweight Action Recognition with Event Cameras

  • paper_url: http://arxiv.org/abs/2308.09993
  • repo_url: None
  • paper_authors: Hongwei Ren, Yue Zhou, Haotian Fu, Yulong Huang, Renjing Xu, Bojun Cheng
  • for: 本研究旨在提出一种轻量级、通用的点云网络(TTPOINT),用于行动识别任务。
  • methods: 该模型采用点云方式进行数据采集,并使用tensor-train压缩特征提取器来减少计算复杂度和参数量。
  • results: TTPOINT在三个 dataset 上达到了状态平台(SOTA)水平,并在所有五个 dataset 上达到了点云方法的SOTA水平。此外,通过使用tensor-train压缩方法,模型的精度几乎不受参数大小压缩的影响。
    Abstract Event cameras have gained popularity in computer vision due to their data sparsity, high dynamic range, and low latency. As a bio-inspired sensor, event cameras generate sparse and asynchronous data, which is inherently incompatible with the traditional frame-based method. Alternatively, the point-based method can avoid additional modality transformation and naturally adapt to the sparsity of events. Still, it typically cannot reach a comparable accuracy as the frame-based method. We propose a lightweight and generalized point cloud network called TTPOINT which achieves competitive results even compared to the state-of-the-art (SOTA) frame-based method in action recognition tasks while only using 1.5 % of the computational resources. The model is adept at abstracting local and global geometry by hierarchy structure. By leveraging tensor-train compressed feature extractors, TTPOINT can be designed with minimal parameters and computational complexity. Additionally, we developed a straightforward downsampling algorithm to maintain the spatio-temporal feature. In the experiment, TTPOINT emerged as the SOTA method on three datasets while also attaining SOTA among point cloud methods on all five datasets. Moreover, by using the tensor-train decomposition method, the accuracy of the proposed TTPOINT is almost unaffected while compressing the parameter size by 55 % in all five datasets.
    摘要 事件摄像机在计算机视觉中得到了普遍应用,因为它们的数据稀疏、高动态范围和低延迟时间。作为生物体发现的感知器,事件摄像机生成的数据是不兼容传统框架方法的异常快照式数据。相反,点云方法可以避免额外模态变换,并自然适应事件的稀疏性。然而,它通常无法达到与框架方法相当的准确性。我们提出了一种轻量级、通用的点云网络 called TTPOINT,它在动作识别任务中达到了与状态前方法相当的结果,只使用了1.5%的计算资源。该模型能够层次结构中抽象本地和全局几何。通过利用张量约束压缩特征提取器,TTPOINT可以设计为最小参数和计算复杂度。此外,我们开发了一种简单的下采样算法,以保持空间时间特征。在实验中,TTPOINT被认为状态前方法在三个数据集上,同时在五个数据集上也成为点云方法的最佳方法。此外,通过使用张量约束压缩方法,提议的TTPOINT的准确率几乎不受参数大小压缩55%的影响。

AltDiffusion: A Multilingual Text-to-Image Diffusion Model

  • paper_url: http://arxiv.org/abs/2308.09991
  • repo_url: https://github.com/superhero-7/altdiffuson
  • paper_authors: Fulong Ye, Guang Liu, Xinya Wu, Ledell Wu
    for: 这个论文是为了推广大语言环境下的文本到图像(T2I)扩散模型,以便为不同语言用户提供更好的服务。methods: 这篇论文使用了知识传播学习(KD)来训练一个多语言文本编码器,然后把其插入到一个预训练的英文只 diffusion 模型中,通过两个阶段的Schema来提高多语言能力。results: 这篇论文在一个大规模多语言 dataset 上进行了两个阶段的Schema,包括概念对接和质量改进阶段,并在多语言总体评价和文化特有概念评价中表现出色,超过了现有的状态之艺 T2I 模型。
    Abstract Large Text-to-Image(T2I) diffusion models have shown a remarkable capability to produce photorealistic and diverse images based on text inputs. However, existing works only support limited language input, e.g., English, Chinese, and Japanese, leaving users beyond these languages underserved and blocking the global expansion of T2I models. Therefore, this paper presents AltDiffusion, a novel multilingual T2I diffusion model that supports eighteen different languages. Specifically, we first train a multilingual text encoder based on the knowledge distillation. Then we plug it into a pretrained English-only diffusion model and train the model with a two-stage schema to enhance the multilingual capability, including concept alignment and quality improvement stage on a large-scale multilingual dataset. Furthermore, we introduce a new benchmark, which includes Multilingual-General-18(MG-18) and Multilingual-Cultural-18(MC-18) datasets, to evaluate the capabilities of T2I diffusion models for generating high-quality images and capturing culture-specific concepts in different languages. Experimental results on both MG-18 and MC-18 demonstrate that AltDiffusion outperforms current state-of-the-art T2I models, e.g., Stable Diffusion in multilingual understanding, especially with respect to culture-specific concepts, while still having comparable capability for generating high-quality images. All source code and checkpoints could be found in https://github.com/superhero-7/AltDiffuson.
    摘要 大型文本到图像(T2I)扩散模型已经显示出产生高质量和多样化的图像的remarkable能力,但现有的工作只支持有限的语言输入,例如英语、中文和日语,这使得用户使用其他语言被忽略和限制了全球的扩展。因此,本文提出了AltDiffusion,一种新的多语言T2I扩散模型,支持 eighteen种不同的语言。 Specifically,我们首先训练了一个多语言文本编码器,基于知识传承。然后,我们插入到一个预训练的英语只 diffusion model 中,并在一个两个阶段的Schema中进行增强多语言能力,包括概念对齐和质量提升阶段。此外,我们 introduce a new benchmark,包括 Multilingual-General-18(MG-18)和 Multilingual-Cultural-18(MC-18)数据集,以评估 T2I 扩散模型在生成高质量图像和捕捉不同语言文化特性方面的能力。实验结果表明,AltDiffusion 在 MG-18 和 MC-18 上表现出优于当前状态的扩散模型,特别是在文化特性方面,而且仍然与生成高质量图像有相同的能力。所有源代码和检查点可以在 找到。

TSAR-MVS: Textureless-aware Segmentation and Correlative Refinement Guided Multi-View Stereo

  • paper_url: http://arxiv.org/abs/2308.09990
  • repo_url: None
  • paper_authors: Zhenlong Yuan, Jiakai Cao, Hao Jiang, Zhaoqi Wang, Zhaoxin Li
  • for: 增强多视图ステレオ(MVS)中缺失文本区域的重建问题的解决方案。
  • methods: 提议的方法包括:一、 joint hypothesis filtering,二、iterative correlation refinement,三、 textureless-aware segmentation。
  • results: 实验结果表明,提议的方法在大量数据集上表现出色,与大多数非学习方法比较,具有较高的精度和稳定性,同时能够保留细节。
    Abstract The reconstruction of textureless areas has long been a challenging problem in MVS due to lack of reliable pixel correspondences between images. In this paper, we propose the Textureless-aware Segmentation And Correlative Refinement guided Multi-View Stereo (TSAR-MVS), a novel method that effectively tackles challenges posed by textureless areas in 3D reconstruction through filtering, refinement and segmentation. First, we implement joint hypothesis filtering, a technique that merges a confidence estimator with a disparity discontinuity detector to eliminate incorrect depth estimations. Second, to spread the pixels with confident depth, we introduce a iterative correlation refinement strategy that leverages RANSAC to generate superpixels, succeeded by a median filter for broadening the influence of accurately determined pixels.Finally, we present a textureless-aware segmentation method that leverages edge detection and line detection for accurately identify large textureless regions to be fitted using 3D planes. Experiments on extensive datasets demonstrate that our method significantly outperforms most non-learning methods and exhibits robustness to textureless areas while preserving fine details.
    摘要 文本无纹区域重建问题在多视图深度(MVS)中一直是一个挑战,因为缺乏可靠的像素对应关系。在这篇论文中,我们提出了Textureless-aware Segmentation And Correlative Refinement guided Multi-View Stereo(TSAR-MVS),一种新的方法,可以有效地解决由缺纹区域引起的3D重建问题。首先,我们实现了联合假设筛选,这是一种将信度估计器与分辨率缺乏检测器结合在一起,以消除错误的深度估计。其次,为了将具有信任度的像素扩展到更多的像素,我们引入了一种迭代相关级联策略,该策略利用RANSAC生成超像素,然后使用中值滤波器来扩大正确确定的像素的影响。最后,我们提出了一种具有纹理性的分割方法,该方法利用边检测和直线检测来准确地识别大面积的缺纹区域,并将其适用3D平面来适应。我们对广泛的数据集进行了实验,结果显示,我们的方法在非学习方法中显著超越,并且具有对缺纹区域的鲁棒性,同时保留细节。

Prototypical Cross-domain Knowledge Transfer for Cervical Dysplasia Visual Inspection

  • paper_url: http://arxiv.org/abs/2308.09983
  • repo_url: None
  • paper_authors: Yichen Zhang, Yifang Yin, Ying Zhang, Zhenguang Liu, Zheng Wang, Roger Zimmermann
    for: 这个研究的目的是提高自动诊断阴道异常的准确率,以便在低资源设置下提供更好的阴道癌诊断。methods: 我们提出了一种使用跨领域阴道照片进行学习,以增强模型的性能。我们还提出了一种具有转移性的知识范例选择方法,以便对目标阴道照片集进行训练。results: 我们的方法在三个真实世界 benchmark 阴道照片集上进行了实验,结果显示我们的方法在阴道异常诊断中的顶尖一致率、精度、准确率和ROC-AUC上有4.7%、7.0%、1.4%和4.6%的提升,优于现有的阴道异常诊断方法。
    Abstract Early detection of dysplasia of the cervix is critical for cervical cancer treatment. However, automatic cervical dysplasia diagnosis via visual inspection, which is more appropriate in low-resource settings, remains a challenging problem. Though promising results have been obtained by recent deep learning models, their performance is significantly hindered by the limited scale of the available cervix datasets. Distinct from previous methods that learn from a single dataset, we propose to leverage cross-domain cervical images that were collected in different but related clinical studies to improve the model's performance on the targeted cervix dataset. To robustly learn the transferable information across datasets, we propose a novel prototype-based knowledge filtering method to estimate the transferability of cross-domain samples. We further optimize the shared feature space by aligning the cross-domain image representations simultaneously on domain level with early alignment and class level with supervised contrastive learning, which endows model training and knowledge transfer with stronger robustness. The empirical results on three real-world benchmark cervical image datasets show that our proposed method outperforms the state-of-the-art cervical dysplasia visual inspection by an absolute improvement of 4.7% in top-1 accuracy, 7.0% in precision, 1.4% in recall, 4.6% in F1 score, and 0.05 in ROC-AUC.
    摘要 早期检测颈部癌变是肺癌治疗的关键,但自动诊断颈部癌变via视觉检查,更适合在低资源环境中进行,仍然是一个挑战。虽然最近的深度学习模型已经获得了有前途的结果,但它们的性能受到有限的颈部数据集的限制。与前方方法不同,我们提议利用跨频道颈部图像,从不同但相关的临床研究中收集的图像来提高模型的性能。为了强制学习跨频道图像中的共同信息,我们提出了一种新的原型基于知识筛选方法,以估算跨频道样本的传输性。此外,我们还使用了同时对域级和类别级进行对齐,以提高共享特征空间的定制。通过这种方法,我们可以强制训练模型并传输知识,使其在目标颈部数据集上表现更加稳定。实验结果表明,我们的提议方法在三个实际预测颈部癌变的数据集上表现出色,与状态之前的诊断性能相比,提高了4.7%的排名前一精度、7.0%的精度、1.4%的准确率、4.6%的F1分数和0.05的ROC-AUC。

Breast Lesion Diagnosis Using Static Images and Dynamic Video

  • paper_url: http://arxiv.org/abs/2308.09980
  • repo_url: None
  • paper_authors: Yunwen Huang, Hongyu Hu, Ying Zhu, Yi Xu
    for: 这个研究旨在开发一个基于深度学习的电脑助诊系统,以帮助诊断乳腺癌。methods: 这个系统使用多 modal 的乳腺ultrasound影像和动态影像,并将它们融合为一个多Modal Feature。在这个过程中,我们将使用专家选择的静止影像来导航动态影像的特征聚合。results: 我们在一个包含897个乳腺ultrasound影像和动态影像的 dataset上进行验证,结果显示我们的模型可以提高了诊断的精度,其AUC值为90.0%,并且精度为81.7%。
    Abstract Deep learning based Computer Aided Diagnosis (CAD) systems have been developed to treat breast ultrasound. Most of them focus on a single ultrasound imaging modality, either using representative static images or the dynamic video of a real-time scan. In fact, these two image modalities are complementary for lesion diagnosis. Dynamic videos provide detailed three-dimensional information about the lesion, while static images capture the typical sections of the lesion. In this work, we propose a multi-modality breast tumor diagnosis model to imitate the diagnosing process of radiologists, which learns the features of both static images and dynamic video and explores the potential relationship between the two modalities. Considering that static images are carefully selected by professional radiologists, we propose to aggregate dynamic video features under the guidance of domain knowledge from static images before fusing multi-modality features. Our work is validated on a breast ultrasound dataset composed of 897 sets of ultrasound images and videos. Experimental results show that our model boosts the performance of Benign/Malignant classification, achieving 90.0% in AUC and 81.7% in accuracy.
    摘要 深度学习基于计算机辅助诊断(CAD)系统已经开发以治疗乳腺癌。大多数其中都专注于单一的乳腺ultrasound图像模式,可以是使用代表性的静止图像或实时扫描的动态视频。事实上,这两种图像模式是诊断癌变的补充。动态视频提供癌变三维信息的详细信息,而静止图像捕捉癌变典型部分。在这种工作中,我们提议一种多Modal breast tumor诊断模型,以模拟医生的诊断过程,学习静止图像和动态视频中的特征,并探索这两种模式之间的潜在关系。考虑到静止图像由专业医生 manually selects,我们提议将动态视频特征与静止图像特征相互融合,然后将多模式特征进行融合。我们的工作被验证在897组乳腺ultrasound图像和视频组成的 dataset上。实验结果表明,我们的模型可以提高了抑准/癌变分类的性能,达到了90.0%的AUC和81.7%的准确率。

Whether you can locate or not? Interactive Referring Expression Generation

  • paper_url: http://arxiv.org/abs/2308.09977
  • repo_url: https://github.com/superhero-7/ireg
  • paper_authors: Fulong Ye, Yuxing Long, Fangxiang Feng, Xiaojie Wang
  • for: 本研究旨在生成不ambiguous的 Referring Expressions (REs),并实现 Referring Expression Comprehension (REC) 任务。
  • methods: 我们提出了一种Interactive REG (IREG) 模型,可以与真实的 REC 模型交互,利用显示对象所在视觉区域和对象是否已经找到的信号来慢慢修改 REs。
  • results: 我们在 RefCOCO、RefCOCO+ 和 RefCOCOg 三个 RE benchmark 数据集上进行了实验,结果显示,IREG 可以比前一代方法在各种评价指标上取得更高的性能。此外,人工评价也表明,IREG 能够更好地生成 REs 并具有交互能力。
    Abstract Referring Expression Generation (REG) aims to generate unambiguous Referring Expressions (REs) for objects in a visual scene, with a dual task of Referring Expression Comprehension (REC) to locate the referred object. Existing methods construct REG models independently by using only the REs as ground truth for model training, without considering the potential interaction between REG and REC models. In this paper, we propose an Interactive REG (IREG) model that can interact with a real REC model, utilizing signals indicating whether the object is located and the visual region located by the REC model to gradually modify REs. Our experimental results on three RE benchmark datasets, RefCOCO, RefCOCO+, and RefCOCOg show that IREG outperforms previous state-of-the-art methods on popular evaluation metrics. Furthermore, a human evaluation shows that IREG generates better REs with the capability of interaction.
    摘要 《referring表达生成(REG)》的目标是生成清晰无ambiguity的引用表达(RE),同时通过引用表达理解(REC)来确定引用的对象所在位置。现有方法通常是独立构建REG模型,只使用RE作为模型训练的真实值,而不考虑REG和REC模型之间的可能的互动。在这篇论文中,我们提出了一种互动式REG(IREG)模型,可以与真实的REC模型进行互动,通过视觉区域和对象所在位置的信号来慢慢修改RE。我们的实验结果表明,IREG在三个REFCOCO数据集上表现出色,与前一代方法相比,在流行的评价指标上表现出色。此外,人工评价也表明,IREG能够更好地生成引用表达,并具有互动的能力。

DESOBAv2: Towards Large-scale Real-world Dataset for Shadow Generation

  • paper_url: http://arxiv.org/abs/2308.09972
  • repo_url: None
  • paper_authors: Qingyang Liu, Jianting Wang, Li Niu
  • for: 本研究旨在生成真实的阴影,以使composite image更加真实。
  • methods: 使用object-shadow detection和inpainting技术生成阴影,并使用pretrained填充模型进行填充。
  • results: 创建了一个大规模的DESOBAv2 dataset,可以用于评估阴影生成器的性能。
    Abstract Image composition refers to inserting a foreground object into a background image to obtain a composite image. In this work, we focus on generating plausible shadow for the inserted foreground object to make the composite image more realistic. To supplement the existing small-scale dataset DESOBA, we create a large-scale dataset called DESOBAv2 by using object-shadow detection and inpainting techniques. Specifically, we collect a large number of outdoor scene images with object-shadow pairs. Then, we use pretrained inpainting model to inpaint the shadow region, resulting in the deshadowed images. Based on real images and deshadowed images, we can construct pairs of synthetic composite images and ground-truth target images. Dataset is available at https://github.com/bcmi/Object-Shadow-Generation-Dataset-DESOBAv2.
    摘要 Image composition指的是将前景对象 inserts 到背景图像中获得复合图像。在这种工作中,我们关注于生成合理的阴影,以使复合图像更加真实。为了补充现有的小规模数据集DESOBA,我们创建了一个大规模数据集DESOBAv2,使用对象阴影检测和填充技术。具体来说,我们收集了大量的室外场景图像和对象阴影对。然后,我们使用预训练的填充模型填充阴影区域,得到了抹掉阴影的图像。基于真实图像和抹掉阴影图像,我们可以构建对应的 sintetic 复合图像和实际目标图像的对。数据集可以在https://github.com/bcmi/Object-Shadow-Generation-Dataset-DESOBAv2上下载。

NeutrEx: A 3D Quality Component Measure on Facial Expression Neutrality

  • paper_url: http://arxiv.org/abs/2308.09963
  • repo_url: None
  • paper_authors: Marcel Grimmer, Christian Rathgeb, Raymond Veldhuis, Christoph Busch
  • for: 这个论文的目的是提出一个基于3D人脸重建的表达质量评估方法,以便确保低质量的人脸图像不会影响辨识率。
  • methods: 这个方法使用了一个基于支持向量机器的expression辨识方法,并使用了一个预训练的对应神经网络来提取面孔嵌入。
  • results: 该方法比基于面孔嵌入的基eline方法表现出色,并且可以提供可解释的评估结果,包括每个颅骨的距离值,以帮助操作员给予有用的反馈。
    Abstract Accurate face recognition systems are increasingly important in sensitive applications like border control or migration management. Therefore, it becomes crucial to quantify the quality of facial images to ensure that low-quality images are not affecting recognition accuracy. In this context, the current draft of ISO/IEC 29794-5 introduces the concept of component quality to estimate how single factors of variation affect recognition outcomes. In this study, we propose a quality measure (NeutrEx) based on the accumulated distances of a 3D face reconstruction to a neutral expression anchor. Our evaluations demonstrate the superiority of our proposed method compared to baseline approaches obtained by training Support Vector Machines on face embeddings extracted from a pre-trained Convolutional Neural Network for facial expression classification. Furthermore, we highlight the explainable nature of our NeutrEx measures by computing per-vertex distances to unveil the most impactful face regions and allow operators to give actionable feedback to subjects.
    摘要 正确的面部识别系统在敏感应用中如国境控制或移民管理中日益重要。因此,它成为了评估面部图像质量的关键。在这个上下文中,ISO/IEC 29794-5 的现行稿引入了面部成分质量的概念,以估计单一因素的变化对于识别结果的影响。在这个研究中,我们提出了基于三维面部重建的累累距离对于中性表情参照点的质量指标(NeutrEx)。我们的评估结果显示,我们的提案方法比基于预训条件支持向量机器学习的面部嵌入式表情分类模型来得到更高的表现。此外,我们还高亮了我们的NeutrEx指标的可解释性,通过计算每个颅骨距离以揭露面部最重要的影响区域,并允许操作者给予适当的反馈给主体。

UniAP: Towards Universal Animal Perception in Vision via Few-shot Learning

  • paper_url: http://arxiv.org/abs/2308.09953
  • repo_url: https://github.com/rese1f/UniAP
  • paper_authors: Meiqi Sun, Zhonghan Zhao, Wenhao Chai, Hanjun Luo, Shidong Cao, Yanting Zhang, Jenq-Neng Hwang, Gaoang Wang
  • for: 用于自动监测动物健康、理解动物行为和助助动物研究。
  • methods: 使用几拟学习来实现跨种动物视觉识别模型,通过共享视觉特征来传递知识。
  • results: 实现了跨种动物视觉任务的泛化和适应,可以快速适应新种类和有限数量的标注数据。
    Abstract Animal visual perception is an important technique for automatically monitoring animal health, understanding animal behaviors, and assisting animal-related research. However, it is challenging to design a deep learning-based perception model that can freely adapt to different animals across various perception tasks, due to the varying poses of a large diversity of animals, lacking data on rare species, and the semantic inconsistency of different tasks. We introduce UniAP, a novel Universal Animal Perception model that leverages few-shot learning to enable cross-species perception among various visual tasks. Our proposed model takes support images and labels as prompt guidance for a query image. Images and labels are processed through a Transformer-based encoder and a lightweight label encoder, respectively. Then a matching module is designed for aggregating information between prompt guidance and the query image, followed by a multi-head label decoder to generate outputs for various tasks. By capitalizing on the shared visual characteristics among different animals and tasks, UniAP enables the transfer of knowledge from well-studied species to those with limited labeled data or even unseen species. We demonstrate the effectiveness of UniAP through comprehensive experiments in pose estimation, segmentation, and classification tasks on diverse animal species, showcasing its ability to generalize and adapt to new classes with minimal labeled examples.
    摘要 生物视觉技术是重要的自动监测动物健康、理解动物行为和帮助动物研究的方法。然而,设计一个基于深度学习的视觉感知模型,能够自由地适应不同的动物和多种视觉任务,是一项挑战。这是因为动物的姿势变化多样、罕见的种类缺乏数据,以及不同任务的semantic不一致。我们介绍UniAP,一种新的通用动物感知模型,利用几何学学习来实现跨种pecie的视觉感知。我们提posed模型通过在支持图像和标签作为引导图像的query图像。图像和标签通过Transformer基于encoder和轻量级标签encoder处理。然后,我们设计了一个匹配模块,用于聚合引导图像和查询图像之间的信息,并由一个多头标签解码器生成多种任务的输出。通过利用不同动物和任务之间的共同视觉特征,UniAP允许知识从已经学习的种类传递到有限数据或者even未经见过的种类。我们通过对多种动物种类的pose估计、分割和分类任务进行了广泛的实验,证明UniAP的有效性,并显示其能够通过少量标签示例进行泛化和适应新类。

Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos

  • paper_url: http://arxiv.org/abs/2308.09951
  • repo_url: None
  • paper_authors: Rui Qian, Shuangrui Ding, Xian Liu, Dahua Lin
  • for: 本 paper 旨在强化对象中心表示,提高视频对象发现和分类性能。
  • methods: 该 paper 使用了自助学习方法,包括 query slot attention 和 random sampling based slot attention,以提取高级 semantics 和低级时间匹配信息。另外,它还提出了一种新的masked slot attention方法,以强化对象中心表示。
  • results: 该 paper 的实验结果表明,使用自助学习方法和masked slot attention可以提高视频对象发现和分类性能,并且可以实现对象中心表示。此外,它还达到了 dense label propagation 任务的最佳性能,demonstrating the potential for object-centric analysis。
    Abstract Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence. Building on these results, we take one step further and explore the possibility of integrating these two features to enhance object-centric representations. Our preliminary experiments indicate that query slot attention can extract different semantic components from the RGB feature map, while random sampling based slot attention can exploit temporal correspondence cues between frames to assist instance identification. Motivated by this, we propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps. It comprises two slot attention stages with a set of shared learnable Gaussian distributions. In the first stage, we use the mean vectors as slot initialization to decompose potential semantics and generate semantic segmentation masks through iterative attention. In the second stage, for each semantics, we randomly sample slots from the corresponding Gaussian distribution and perform masked feature aggregation within the semantic area to exploit temporal correspondence patterns for instance identification. We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations. Our model effectively identifies multiple object instances with semantic structure, reaching promising results on unsupervised video object discovery. Furthermore, we achieve state-of-the-art performance on dense label propagation tasks, demonstrating the potential for object-centric analysis. The code is released at https://github.com/shvdiwnkozbw/SMTC.
    摘要 自我监督方法已经在学习高级 semantics 和低级时间匹配方面进行了非常出色的进步。基于这些结果,我们尝试一步更远,探索将这两个特征集成到对象中心表示中以提高对象识别的可能性。我们的初步实验表明,Query slot attention可以从 RGB 特征地图中提取不同的semantic 分量,而Random sampling based slot attention可以在帧中使用时间匹配规律来帮助实例识别。这些灵感下,我们提议一种新的semantic-aware masked slot attention,它包括两个槽注意阶段,每个阶段都有一组共享学习的Gaussian 分布。在第一个阶段,我们使用极值向量作为槽初始化,以iterative attention decomposition potential semantics并生成semantic segmentation mask。在第二个阶段,对每个semantics,我们随机从相应的Gaussian分布中选择槽,并在semantic区域内进行masked feature aggregation,以利用时间匹配模式来提高实例识别。我们采用semantic-和instance-level时间一致性自我监督,以鼓励对象中心表示的时间一致性。我们的模型能够有效地识别多个对象实例,同时保持semantic结构,在无监督视频对象发现任务中达到了可观的结果。此外,我们在 dense label propagation 任务中实现了state-of-the-art 性能,表明对象中心分析的潜力。代码可以在 上下载。

Scene-Aware Feature Matching

  • paper_url: http://arxiv.org/abs/2308.09949
  • repo_url: https://github.com/USTCPCS/CVPR2018_attention
  • paper_authors: Xiaoyong Lu, Yaping Yan, Tong Wei, Songlin Du
  • for: The paper is written for improving the performance of feature matching in computer vision tasks, particularly in handling challenging scenes with large viewpoint and illumination changes.
  • methods: The paper proposes a novel model named SAM, which applies attentional grouping to guide Scene-Aware feature Matching. The model uses attention layers to handle multi-level features, including image tokens and group tokens, and groups the image tokens with the proposed token grouping module.
  • results: The paper achieves state-of-the-art performance on various applications, including homography estimation, pose estimation, and image matching, and demonstrates that the proposed model is more accurate, robust, and interpretable than conventional feature matching models.
    Abstract Current feature matching methods focus on point-level matching, pursuing better representation learning of individual features, but lacking further understanding of the scene. This results in significant performance degradation when handling challenging scenes such as scenes with large viewpoint and illumination changes. To tackle this problem, we propose a novel model named SAM, which applies attentional grouping to guide Scene-Aware feature Matching. SAM handles multi-level features, i.e., image tokens and group tokens, with attention layers, and groups the image tokens with the proposed token grouping module. Our model can be trained by ground-truth matches only and produce reasonable grouping results. With the sense-aware grouping guidance, SAM is not only more accurate and robust but also more interpretable than conventional feature matching models. Sufficient experiments on various applications, including homography estimation, pose estimation, and image matching, demonstrate that our model achieves state-of-the-art performance.
    摘要 当前的特征匹配方法主要关注点级匹配,尝试更好地学习个体特征的表示学习,但缺乏更深入的场景理解。这会导致对复杂场景(如大视角和照明变化)的处理而受到显著性能下降。为解决这个问题,我们提出了一种新的模型,即SAM,它通过注意力组合来引导场景意识的特征匹配。SAM处理多级特征,即图像token和组token,使用注意力层,并通过我们提议的token grouping模块将图像token分组。我们的模型可以通过真实匹配只需要训练,并且生成合理的分组结果。与传统的特征匹配模型相比,SAM不仅更准确和Robust,还更易于解释。我们在多种应用,包括Homography估计、pose估计和图像匹配等,进行了详细的实验,结果显示我们的模型在状态艺术性能。

Weakly-Supervised Action Localization by Hierarchically-structured Latent Attention Modeling

  • paper_url: http://arxiv.org/abs/2308.09946
  • repo_url: None
  • paper_authors: Guiqin Wang, Peng Zhao, Cong Zhao, Shusen Yang, Jie Cheng, Luziwei Leng, Jianxing Liao, Qinghai Guo
  • for: 这个论文主要针对的是强度不足的动作地理学问题,即在没有时间标注的视频中识别和地理化动作实例。
  • methods: 我们提出了一种新的注意力基于层次结构的隐藏模型,用于学习视频特征的时间变化 semantics。该模型包括两个组件:第一个是一个不supervised的变化点检测模块,通过学习视频特征的时间层次结构来检测变化点;第二个是一个注意力基于分类模型,用于选择变化点的背景。
  • results: 我们在THUMOS-14和ActivityNet-v1.3两个 benchmark dataset上进行了广泛的实验,结果显示,我们的方法可以比现有的方法表现更好,甚至与完全监督的方法相当。
    Abstract Weakly-supervised action localization aims to recognize and localize action instancese in untrimmed videos with only video-level labels. Most existing models rely on multiple instance learning(MIL), where the predictions of unlabeled instances are supervised by classifying labeled bags. The MIL-based methods are relatively well studied with cogent performance achieved on classification but not on localization. Generally, they locate temporal regions by the video-level classification but overlook the temporal variations of feature semantics. To address this problem, we propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics. Specifically, our model entails two components, the first is an unsupervised change-points detection module that detects change-points by learning the latent representations of video features in a temporal hierarchy based on their rates of change, and the second is an attention-based classification model that selects the change-points of the foreground as the boundaries. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark datasets, THUMOS-14 and ActivityNet-v1.3. The experiments show that our method outperforms current state-of-the-art methods, and even achieves comparable performance with fully-supervised methods.
    摘要 weakly-supervised action localization aims to recognize and localize action instances in untrimmed videos with only video-level labels. most existing models rely on multiple instance learning(MIL), where the predictions of unlabeled instances are supervised by classifying labeled bags. the MIL-based methods are relatively well studied with cogent performance achieved on classification but not on localization. generally, they locate temporal regions by the video-level classification but overlook the temporal variations of feature semantics. to address this problem, we propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics. specifically, our model entails two components, the first is an unsupervised change-points detection module that detects change-points by learning the latent representations of video features in a temporal hierarchy based on their rates of change, and the second is an attention-based classification model that selects the change-points of the foreground as the boundaries. to evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark datasets, THUMOS-14 and ActivityNet-v1.3. the experiments show that our method outperforms current state-of-the-art methods, and even achieves comparable performance with fully-supervised methods.Here's the text with traditional Chinese characters:weakly-supervised action localization aims to recognize and localize action instances in untrimmed videos with only video-level labels. most existing models rely on multiple instance learning(MIL), where the predictions of unlabeled instances are supervised by classifying labeled bags. the MIL-based methods are relatively well studied with cogent performance achieved on classification but not on localization. generally, they locate temporal regions by the video-level classification but overlook the temporal variations of feature semantics. to address this problem, we propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics. specifically, our model entails two components, the first is an unsupervised change-points detection module that detects change-points by learning the latent representations of video features in a temporal hierarchy based on their rates of change, and the second is an attention-based classification model that selects the change-points of the foreground as the boundaries. to evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark datasets, THUMOS-14 and ActivityNet-v1.3. the experiments show that our method outperforms current state-of-the-art methods, and even achieves comparable performance with fully-supervised methods.

Dual Branch Deep Learning Network for Detection and Stage Grading of Diabetic Retinopathy

  • paper_url: http://arxiv.org/abs/2308.09945
  • repo_url: None
  • paper_authors: Hossein Shakibania, Sina Raoufi, Behnam Pourafkham, Hassan Khotanlou, Muharram Mansoorizadeh
  • for: 这篇论文旨在提出一种基于深度学习的视网膜病变检测和分级方法,以帮助早期识别和治疗糖尿病 relacionais complications。
  • methods: 本论文使用了两个现有的优秀预训练模型作为特征提取器,并对其进行了微调,以适应一个新的数据集。模型在一个大型多中心数据集上进行了训练,包括APTOS 2019数据集。
  • results: 本论文的提出的方法在APTOS 2019数据集上实现了优秀的视网膜病变检测和分级性能,比过去Literature中的成果更高。在二分类任务中,提出的方法取得了98.50%的准确率,99.46%的敏感度和97.51%的特异度。在分级任务中,它取得了93.00%的 quadratic weighted kappa,89.60%的准确率,89.60%的敏感度和97.72%的特异度。
    Abstract Diabetic retinopathy is a severe complication of diabetes that can lead to permanent blindness if not treated promptly. Early and accurate diagnosis of the disease is essential for successful treatment. This paper introduces a deep learning method for the detection and stage grading of diabetic retinopathy, using a single fundus retinal image. Our model utilizes transfer learning, employing two state-of-the-art pre-trained models as feature extractors and fine-tuning them on a new dataset. The proposed model is trained on a large multi-center dataset, including the APTOS 2019 dataset, obtained from publicly available sources. It achieves remarkable performance in diabetic retinopathy detection and stage classification on the APTOS 2019, outperforming the established literature. For binary classification, the proposed approach achieves an accuracy of 98.50%, a sensitivity of 99.46%, and a specificity of 97.51%. In stage grading, it achieves a quadratic weighted kappa of 93.00%, an accuracy of 89.60%, a sensitivity of 89.60%, and a specificity of 97.72%. The proposed approach serves as a reliable screening and stage grading tool for diabetic retinopathy, offering significant potential to enhance clinical decision-making and patient care.
    摘要 糖尿病 retinopathy 是糖尿病的严重合并症状,可能会导致永久潦倒盲视,如果不及时治疗。早期和准确的诊断是治疗的关键。本文介绍了一种深度学习方法,用于检测和评分糖尿病 retinopathy,只需要一张背景照片。我们的模型使用了传输学习,利用了两个现有的状态体验模型作为特征提取器,并在新的数据集上进行了微调。我们的模型在APTOS 2019 数据集上进行了训练,包括公共可用的数据集。它在糖尿病 retinopathy 检测和评分中实现了显著的表现,超过了现有文献。对二分类问题,我们的方法实现了98.50%的准确率,99.46%的敏感度和97.51%的特异度。在评分方面,我们的方法实现了93.00%的 quadratic 权重κ值,89.60%的准确率,89.60%的敏感度和97.72%的特异度。我们的方法可以作为糖尿病 retinopathy 的可靠检测和评分工具,提供了对临床决策和患者护理的显著潜在优势。

On the Robustness of Open-World Test-Time Training: Self-Training with Dynamic Prototype Expansion

  • paper_url: http://arxiv.org/abs/2308.09942
  • repo_url: https://github.com/yushu-li/owttt
  • paper_authors: Yushu Li, Xun Xu, Yongyi Su, Kui Jia
  • for: 这篇论文旨在提高 unknown target domain distribution 下的 deep learning 模型鲁棒性,并且可以在低延迟下进行 test-time training/adaptation (TTT/TTA)。
  • methods: 该论文提出了一种 adaptive strong OOD pruning 技术,以及一种动态扩展 prototype 以区分强 OOD 样本和弱 OOD 样本。此外,论文还提出了一种 distribution alignment REG regularization,以提高 self-training 的效果。
  • results: 该论文在 5 个 OWTTT benchmark 上达到了 state-of-the-art 性能。
    Abstract Generalizing deep learning models to unknown target domain distribution with low latency has motivated research into test-time training/adaptation (TTT/TTA). Existing approaches often focus on improving test-time training performance under well-curated target domain data. As figured out in this work, many state-of-the-art methods fail to maintain the performance when the target domain is contaminated with strong out-of-distribution (OOD) data, a.k.a. open-world test-time training (OWTTT). The failure is mainly due to the inability to distinguish strong OOD samples from regular weak OOD samples. To improve the robustness of OWTTT we first develop an adaptive strong OOD pruning which improves the efficacy of the self-training TTT method. We further propose a way to dynamically expand the prototypes to represent strong OOD samples for an improved weak/strong OOD data separation. Finally, we regularize self-training with distribution alignment and the combination yields the state-of-the-art performance on 5 OWTTT benchmarks. The code is available at https://github.com/Yushu-Li/OWTTT.
    摘要 通用深度学习模型到未知目标频率分布下进行普通化(Test-Time Training/Adaptation,TTT/TTA)已经引起了研究者的关注。现有的方法 oftentimes 专注于在well-curated 目标频率数据下提高测试时训练性能。在这项工作中,我们发现许多状态 искусственный智能方法在受到强外部数据杂化(Out-of-Distribution,OOD)影响时失效,主要原因是不能正确分辨强OOD样本和弱OOD样本。为了改善OWTTT的Robustness,我们首先开发了适应强OOD检索,该方法可以提高自动训练 TTT 方法的效果。我们还提出了在运行时动态扩展prototype来表示强OOD样本,以提高弱/强OOD数据的分离。最后,我们使用分布对齐和组合,并得到了5个OWTTT benchmark上的状态对齐性。代码可以在https://github.com/Yushu-Li/OWTTT中找到。

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

  • paper_url: http://arxiv.org/abs/2308.09936
  • repo_url: https://github.com/mlpc-ucsd/bliva
  • paper_authors: Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, Zhuowen Tu
  • for: 提高实际图像中文本的理解和处理能力,以便更好地解决实际场景中的视觉问答任务。
  • methods: combines InstructBLIP和Visual Assistant,Directly project encoded patch embeddings into the LLM,以帮助模型更好地捕捉图像中的细节。
  • results: 在处理文本含有图像的VQA benchmark tasks上显著提高性能(up to 17.76% in OCR-VQA benchmark),并在典型VQA benchmark tasks上也获得了显著提高(up to 7.9% in Visual Spatial Reasoning benchmark),并且可以在实际图像中处理文本不存在的情况下也表现出色。
    Abstract Vision Language Models (VLMs), which extend Large Language Models (LLM) by incorporating visual understanding capability, have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) tasks. However, these models cannot accurately interpret images infused with text, a common occurrence in real-world scenarios. Standard procedures for extracting information from images often involve learning a fixed set of query embeddings. These embeddings are designed to encapsulate image contexts and are later used as soft prompt inputs in LLMs. Yet, this process is limited to the token count, potentially curtailing the recognition of scenes with text-rich context. To improve upon them, the present study introduces BLIVA: an augmented version of InstructBLIP with Visual Assistant. BLIVA incorporates the query embeddings from InstructBLIP and also directly projects encoded patch embeddings into the LLM, a technique inspired by LLaVA. This approach assists the model to capture intricate details potentially missed during the query decoding process. Empirical evidence demonstrates that our model, BLIVA, significantly enhances performance in processing text-rich VQA benchmarks (up to 17.76\% in OCR-VQA benchmark) and in undertaking typical VQA benchmarks (up to 7.9\% in Visual Spatial Reasoning benchmark), comparing to our baseline InstructBLIP. BLIVA demonstrates significant capability in decoding real-world images, irrespective of text presence. To demonstrate the broad industry applications enabled by BLIVA, we evaluate the model using a new dataset comprising YouTube thumbnails paired with question-answer sets across 13 diverse categories. For researchers interested in further exploration, our code and models are freely accessible at https://github.com/mlpc-ucsd/BLIVA.git
    摘要 大多数视语模型(VLM)通过结合视觉理解能力和大语言模型(LLM)来解决开放式视觉问答(VQA)任务,但是这些模型无法正确地理解包含文本的图像,这是现实世界中非常常见的情况。标准的图像信息抽取方法通常包括学习固定的查询嵌入。这些嵌入是用于捕捉图像上下文,并且在LLM中使用为软提示输入。然而,这种过程受到固定嵌入数量的限制,可能会遮盖捕捉场景中的文本背景。为了解决这个问题,本研究提出了BLIVA:一种基于InstructBLIP的增强版,它包括InstructBLIP的查询嵌入以及直接将编码补丁嵌入 proyect到LLM中,这种方法灵感于LLaVA。这种方法帮助模型捕捉细节信息,可能在查询解码过程中被遗弃。实验证明,我们的模型BLIVA在处理具有文本背景的VQA benchmark中表现出色,与基线InstructBLIP相比,提高了17.76%(在OCR-VQA benchmark中)和7.9%(在Visual Spatial Reasoning benchmark中)。BLIVA示出了在现实图像中捕捉文本背景的能力,不受文本存在或不存在的限制。为了展示BLIVA在广泛的 industrielles 应用中的可能性,我们使用了一个新的 YouTube 频道封面和问答集合,并对13种不同的类别进行评估。如果您想进一步探索,我们的代码和模型都可以免费下载于https://github.com/mlpc-ucsd/BLIVA.git。

TDG: Text-guided Domain Generalization

  • paper_url: http://arxiv.org/abs/2308.09931
  • repo_url: None
  • paper_authors: Geng Liu, Yuxi Wang
  • for: 本文旨在推广基于单个或多个源领域的模型到未见的目标领域。
  • methods: 我们提出了一种新的文本引导领域泛化(TDG)方法,包括三个方面:首先,我们开发了一种自动生成域特有词汇的方法,以扩展当前领域的描述。然后,我们使用提案学习基于文本特征生成方法,将生成的域信息与图像特征共同卷积在同一个表示空间中。最后,我们使用输入图像特征和生成的文本特征来训练一个特殊的分类器,以便在未见目标领域中进行泛化。
  • results: 我们的实验结果表明,通过在TDG中引入生成的文本信息,可以提高领域泛化的性能,而且这种方法的实现非常容易。我们在多个领域泛化 benchmark 上进行了实验,并证明了我们的提出的框架可以在不同的领域中达到更高的性能。
    Abstract Domain generalization (DG) attempts to generalize a model trained on single or multiple source domains to the unseen target domain. Benefiting from the success of Visual-and-Language Pre-trained models in recent years, we argue that it is crucial for domain generalization by introducing extra text information. In this paper, we develop a novel Text-guided Domain Generalization (TDG) paradigm for domain generalization, which includes three following aspects. Specifically, we first devise an automatic words generation method to extend the description of current domains with novel domain-relevant words. Then, we embed the generated domain information into the text feature space, by the proposed prompt learning-based text feature generation method, which shares a common representation space with the image feature. Finally, we utilize both input image features and generated text features to train a specially designed classifier that generalizes well on unseen target domains, while the image encoder is also updated under the supervision of gradients back propagated from the classifier. Our experimental results show that the techniques incorporated by TDG contribute to the performance in an easy implementation manner. Experimental results on several domain generalization benchmarks show that our proposed framework achieves superior performance by effectively leveraging generated text information in domain generalization.
    摘要 域间泛化(DG)目标是将单个或多个源域模型泛化到未看过的目标域。受最近几年视觉语言预训模型的成功影响,我们认为在域间泛化中具有重要作用的是引入文本信息。在这篇论文中,我们提出了一种新的文本准导域泛化(TDG)方法,它包括以下三个方面:首先,我们开发了一种自动生成域关键词方法,以扩展当前域的描述,并添加新域相关的词语。然后,我们将生成的域信息embedded到文本特征空间中,通过我们提出的提示学习基于文本特征生成方法,这个方法与图像特征空间共享表示。最后,我们利用输入图像特征和生成的文本特征来训练一个专门设计的分类器,这个分类器可以在未看过的目标域上进行广泛的泛化,而图像编码器也在分类器的监督下更新。我们的实验结果表明,TDG方法可以在易于实现的情况下提高域泛化的性能。我们在多个域泛化 benchmark 上进行了实验,并证明了我们提出的方法可以有效地利用生成的文本信息,以提高域泛化的性能。

MDCS: More Diverse Experts with Consistency Self-distillation for Long-tailed Recognition

  • paper_url: http://arxiv.org/abs/2308.09922
  • repo_url: https://github.com/fistyee/mdcs
  • paper_authors: Qihao Zhao, Chen Jiang, Wei Hu, Fan Zhang, Jun Liu
  • for: 提高长尾识别(LTR)精度
  • methods: 使用更多的专家和自我泛化(CS)提高模型的多样性和稳定性
  • results: 与先前方法比较,MDCS方法可以提高识别精度,降低模型的偏差,并提高专家的多样性。在五个流行的长尾识别 bencmarks 上,MDCS方法比前一代 лучperform,提高精度1% 至 2%。
    Abstract Recently, multi-expert methods have led to significant improvements in long-tail recognition (LTR). We summarize two aspects that need further enhancement to contribute to LTR boosting: (1) More diverse experts; (2) Lower model variance. However, the previous methods didn't handle them well. To this end, we propose More Diverse experts with Consistency Self-distillation (MDCS) to bridge the gap left by earlier methods. Our MDCS approach consists of two core components: Diversity Loss (DL) and Consistency Self-distillation (CS). In detail, DL promotes diversity among experts by controlling their focus on different categories. To reduce the model variance, we employ KL divergence to distill the richer knowledge of weakly augmented instances for the experts' self-distillation. In particular, we design Confident Instance Sampling (CIS) to select the correctly classified instances for CS to avoid biased/noisy knowledge. In the analysis and ablation study, we demonstrate that our method compared with previous work can effectively increase the diversity of experts, significantly reduce the variance of the model, and improve recognition accuracy. Moreover, the roles of our DL and CS are mutually reinforcing and coupled: the diversity of experts benefits from the CS, and the CS cannot achieve remarkable results without the DL. Experiments show our MDCS outperforms the state-of-the-art by 1% $\sim$ 2% on five popular long-tailed benchmarks, including CIFAR10-LT, CIFAR100-LT, ImageNet-LT, Places-LT, and iNaturalist 2018. The code is available at https://github.com/fistyee/MDCS.
    摘要 近些时间,多专家方法已经导致长尾识别(LTR)中的显著改进。我们总结了两个需要进一步改进以提高LTR的方面:(1)更多的专家;(2)模型变量下降。然而,之前的方法没有很好地处理这两个方面。为此,我们提出了更多的专家与自我照成(MDCS),以填补之前方法留下的差距。我们的MDCS方法包括两个核心组成部分:多样性损失(DL)和自我照成(CS)。在详细说明下,DL使得专家们对不同类别的焦点控制,以提高多样性。为了降低模型变量,我们使用KL散度来让弱增强的实例对专家自我照成进行泛化。特别是,我们设计了信心实例选择(CIS),以确保选择正确分类的实例,以避免偏倚/噪音知识。我们的分析和割裁研究表明,与之前的工作相比,我们的方法可以有效增加专家的多样性,显著降低模型变量,并提高识别精度。此外,我们的DL和CS之间存在互相强化和结合关系:专家的多样性受益于CS,而CS无法取得显著成果不包括DL。实验表明,我们的MDCS在五个流行的长尾benchmark上比前一个最佳的实现1% $\sim$ 2%的提高。代码可以在https://github.com/fistyee/MDCS上获取。

VI-Net: Boosting Category-level 6D Object Pose Estimation via Learning Decoupled Rotations on the Spherical Representations

  • paper_url: http://arxiv.org/abs/2308.09916
  • repo_url: https://github.com/jiehonglin/vi-net
  • paper_authors: Jiehong Lin, Zewei Wei, Yabin Zhang, Kui Jia
  • for: 高精度 объекpose数据集上的6D对象pose估计,即使没有可用的CAD模型。
  • methods: 提议一种新的旋转估计网络,名为VI-Net,它通过分解旋转为视点旋转和平面旋转的组合来简化非线性空间SO(3)中的学习。
  • results: 实验表明,提议的VI-Net方法在高精度 regime下可以大幅超过现有方法。
    Abstract Rotation estimation of high precision from an RGB-D object observation is a huge challenge in 6D object pose estimation, due to the difficulty of learning in the non-linear space of SO(3). In this paper, we propose a novel rotation estimation network, termed as VI-Net, to make the task easier by decoupling the rotation as the combination of a viewpoint rotation and an in-plane rotation. More specifically, VI-Net bases the feature learning on the sphere with two individual branches for the estimates of two factorized rotations, where a V-Branch is employed to learn the viewpoint rotation via binary classification on the spherical signals, while another I-Branch is used to estimate the in-plane rotation by transforming the signals to view from the zenith direction. To process the spherical signals, a Spherical Feature Pyramid Network is constructed based on a novel design of SPAtial Spherical Convolution (SPA-SConv), which settles the boundary problem of spherical signals via feature padding and realizesviewpoint-equivariant feature extraction by symmetric convolutional operations. We apply the proposed VI-Net to the challenging task of category-level 6D object pose estimation for predicting the poses of unknown objects without available CAD models; experiments on the benchmarking datasets confirm the efficacy of our method, which outperforms the existing ones with a large margin in the regime of high precision.
    摘要 rotation 估计高精度从RGB-D对象观察是6D对象pose估计中的巨大挑战,由于非线性空间SO(3)学习的困难。在本文中,我们提出了一种新的 rotate estimation network,称为VI-Net,以使任务更加容易,通过分解旋转为两个因素旋转的组合。更具体地说,VI-Net基于特有的SPAtial Spherical Convolution(SPA-SConv)设计,在特定的圆形信号上进行特征学习,并通过将信号变换到zenith方向来估计平面旋转。为处理圆形信号,我们构建了一个叫做Spherical Feature Pyramid Network(SFPN),该网络通过特有的SPA-SConv设计解决了圆形信号的边界问题,并实现了视角平衡的特征提取。我们应用提出的VI-Net来解决6D对象pose估计中的category-level高精度任务,对于未知对象而言,不需要可用的CAD模型;在标准测试集上进行了实验,并证明了我们的方法在高精度 режи下表现出了明显的优势。

EGANS: Evolutionary Generative Adversarial Network Search for Zero-Shot Learning

  • paper_url: http://arxiv.org/abs/2308.09915
  • repo_url: None
  • paper_authors: Shiming Chen, Shihuang Chen, Wenjin Hou, Weiping Ding, Xinge You
  • for: 这篇论文的目的是提出一种基于进化的对抗学习方法(EGANS),以自动设计适应性和稳定性优化的生成网络,并在不同数据集/场景下进行可靠的视觉特征样本生成,以提高零对零学习(ZSL)的性能。
  • methods: 这篇论文使用了协同对抗进化搜索(cooperative dual evolution)来进行神经网络架构搜索,包括生成器和检测器两个部分。在演化生成器架构搜索阶段,运用了多对一对抗训练策略来演化生成器。然后,使用了相似的演化搜索算法来进行检测器架构搜索。
  • results: 实验结果显示,EGANS可以稳定地提高现有的生成ZSL方法的性能,在标准的CUB、SUN、AWA2和FLO数据集上均有显著的表现提升。这些表现提升显示了进化性的神经架构搜索在ZSL领域中的可能性。
    Abstract Zero-shot learning (ZSL) aims to recognize the novel classes which cannot be collected for training a prediction model. Accordingly, generative models (e.g., generative adversarial network (GAN)) are typically used to synthesize the visual samples conditioned by the class semantic vectors and achieve remarkable progress for ZSL. However, existing GAN-based generative ZSL methods are based on hand-crafted models, which cannot adapt to various datasets/scenarios and fails to model instability. To alleviate these challenges, we propose evolutionary generative adversarial network search (termed EGANS) to automatically design the generative network with good adaptation and stability, enabling reliable visual feature sample synthesis for advancing ZSL. Specifically, we adopt cooperative dual evolution to conduct a neural architecture search for both generator and discriminator under a unified evolutionary adversarial framework. EGANS is learned by two stages: evolution generator architecture search and evolution discriminator architecture search. During the evolution generator architecture search, we adopt a many-to-one adversarial training strategy to evolutionarily search for the optimal generator. Then the optimal generator is further applied to search for the optimal discriminator in the evolution discriminator architecture search with a similar evolution search algorithm. Once the optimal generator and discriminator are searched, we entail them into various generative ZSL baselines for ZSL classification. Extensive experiments show that EGANS consistently improve existing generative ZSL methods on the standard CUB, SUN, AWA2 and FLO datasets. The significant performance gains indicate that the evolutionary neural architecture search explores a virgin field in ZSL.
    摘要 zero-shot learning(ZSL)的目标是识别无法在训练预测模型的新类。因此,通常使用生成模型(如生成对抗网络(GAN))来生成基于类含义 вектор的视觉样本,并取得了remarkable进步。然而,现有的GAN基于的生成ZSL方法都是基于手工设计的模型,无法适应不同的数据集/场景,并且容易出现模型不稳定的问题。为了解决这些挑战,我们提出了进化生成对抗网络搜索(EGANS),用于自动设计生成网络,以便在不同数据集/场景中具有良好的适应和稳定性,从而实现可靠的视觉特征样本生成,以提高ZSL的进步。EGANS采用了合作双向进化来进行神经网络搜索,包括生成器和判断器的搜索。在生成器搜索阶段,我们采用了多对一对抗训练策略来进行进化搜索,以找到最佳的生成器。然后,我们将最佳的生成器应用于判断器搜索阶段,通过类似的进化搜索策略来找到最佳的判断器。一旦找到了最佳的生成器和判断器,我们将它们与不同的生成ZSL基线方法进行比较,以评估EGANS的性能。实验结果表明,EGANS在标准的CUB、SUN、AWA2和FLO数据集上具有显著的性能提升,这表明了进化神经网络搜索在ZSL中探索了一个新的领域。

Noisy-Correspondence Learning for Text-to-Image Person Re-identification

  • paper_url: http://arxiv.org/abs/2308.09911
  • repo_url: https://github.com/tencentyouturesearch/personretrieval-ivt
  • paper_authors: Yang Qin, Yingke Chen, Dezhong Peng, Xi Peng, Joey Tianyi Zhou, Peng Hu
  • for: 提高 Text-to-image person re-identification(TIReID)方法的Robustness,以便在实际场景中处理受到干扰的图像和文本对应关系。
  • methods: 提出了一种名为 Robust Dual Embedding(RDE)的新方法,包括两个主要组成部分:1)一个Confident Consensus Division(CCD)模块,利用双重排序模块的双重决策来获取一个纯净的训练数据集,以便学习正确和可靠的视Semantic关系。2)一个Triplet-Alignment Loss(TAL),将传统的 triplet-ranking损失函数改进为对所有负样本进行征逐,以避免模型过度依赖干扰的图像-文本对应关系。
  • results: 通过在三个公共评测 dataset(CUHK-PEDES、ICFG-PEDES和RSTPReID)进行广泛的实验,证明了我们的RDE方法在不受NC干扰的情况下和受NC干扰的情况下均 achieve 状态的最佳Result。
    Abstract Text-to-image person re-identification (TIReID) is a compelling topic in the cross-modal community, which aims to retrieve the target person based on a textual query. Although numerous TIReID methods have been proposed and achieved promising performance, they implicitly assume the training image-text pairs are correctly aligned, which is not always the case in real-world scenarios. In practice, the image-text pairs inevitably exist under-correlated or even false-correlated, a.k.a noisy correspondence (NC), due to the low quality of the images and annotation errors. To address this problem, we propose a novel Robust Dual Embedding method (RDE) that can learn robust visual-semantic associations even with NC. Specifically, RDE consists of two main components: 1) A Confident Consensus Division (CCD) module that leverages the dual-grained decisions of dual embedding modules to obtain a consensus set of clean training data, which enables the model to learn correct and reliable visual-semantic associations. 2) A Triplet-Alignment Loss (TAL) relaxes the conventional triplet-ranking loss with hardest negatives, which tends to rapidly overfit NC, to a log-exponential upper bound over all negatives, thus preventing the model from overemphasizing false image-text pairs. We conduct extensive experiments on three public benchmarks, namely CUHK-PEDES, ICFG-PEDES, and RSTPReID, to evaluate the performance and robustness of our RDE. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on all three datasets.
    摘要 Text-to-image人重识别(TIReID)是跨模态社区中吸引人的话题,它目的是基于文本查询来 retrieve目标人。虽然多种 TIReID 方法已经被提出并实现了各种表现,但它们在假设训练图像文本对应是正确的情况下进行学习,而在实际场景中,图像文本对应存在偏差或假的对应关系(NC),即低质量图像和注释错误。为解决这问题,我们提出了一种robust dual embedding方法(RDE),可以学习具有NC的视Semantic关系。RDE包括两个主要组件: 1. 自信投票分区(CCD)模块,通过双重权重分配模块的双重决策来获得一个净的训练数据集,使模型可以学习正确和可靠的视Semantic关系。 2. triplet对齐损失(TAL),通过放弃硬iest negative triplet损失,而是将 triplet损失Relax到log-exponential upper bound上,以防止模型过度强调NC。我们在三个公共Benchmark上进行了广泛的实验,分别是CUHK-PEDES、ICFG-PEDES和RSTPReID,以评估我们的RDE表现和稳定性。我们的方法在所有三个数据集上都实现了状态的art表现,并且在Synthetic NC情况下也具有优秀的表现。

Physics-Guided Human Motion Capture with Pose Probability Modeling

  • paper_url: http://arxiv.org/abs/2308.09910
  • repo_url: https://github.com/me-ditto/physics-guided-mocap
  • paper_authors: Jingyi Ju, Buzhen Huang, Chen Zhu, Zhihao Li, Yangang Wang
  • for: 提高人体动作捕捉的精度和成功率,避免漂浮、脚部滑动和地面凿入等误差。
  • methods: 采用物理学为导向,在反卷积过程中使用物理学来降噪减雷,从模型 pose 分布中重建物理可能性最高的人体动作。
  • results: 与前一代物理学基本方法相比,本方法在 JOINT 精度和成功率方面具有显著的提高,可以更好地捕捉人体动作。更多信息可以查看 \url{https://github.com/Me-Ditto/Physics-Guided-Mocap}。
    Abstract Incorporating physics in human motion capture to avoid artifacts like floating, foot sliding, and ground penetration is a promising direction. Existing solutions always adopt kinematic results as reference motions, and the physics is treated as a post-processing module. However, due to the depth ambiguity, monocular motion capture inevitably suffers from noises, and the noisy reference often leads to failure for physics-based tracking. To address the obstacles, our key-idea is to employ physics as denoising guidance in the reverse diffusion process to reconstruct physically plausible human motion from a modeled pose probability distribution. Specifically, we first train a latent gaussian model that encodes the uncertainty of 2D-to-3D lifting to facilitate reverse diffusion. Then, a physics module is constructed to track the motion sampled from the distribution. The discrepancies between the tracked motion and image observation are used to provide explicit guidance for the reverse diffusion model to refine the motion. With several iterations, the physics-based tracking and kinematic denoising promote each other to generate a physically plausible human motion. Experimental results show that our method outperforms previous physics-based methods in both joint accuracy and success rate. More information can be found at \url{https://github.com/Me-Ditto/Physics-Guided-Mocap}.
    摘要 将物理学 incorporated into 人体动作捕捉,以避免浮动、脚部滑块和地面嵌入的artefacts是一个Promising方向。现有的解决方案都是采用骨骼结果作为参考动作,并将物理学当作后处理模块。然而,由于深度的模糊,单目动作捕捉不可避免噪音,这些噪音经常导致物理学基于跟踪失败。为了解决这些障碍,我们的关键思想是使用物理学作为减噪指导,在反卷积过程中重建可靠的人体动作。specifically,我们首先训练了一个卷积学习模型,用于编码2D-to-3D的不确定性,以便在反卷积过程中进行减噪。然后,我们构建了物理模块,以跟踪从分布中抽取的动作。图像观测与跟踪的差异被用于直接提供反卷积模型的减噪指导,以便在几次迭代中改进动作。通过这种方式,物理学基于跟踪和骨骼减噪促进了对人体动作的重建。实验结果表明,我们的方法在 JOINT 精度和成功率方面都高于前期物理学基于方法。更多信息可以参考 \url{https://github.com/Me-Ditto/Physics-Guided-Mocap}.

DiffusionTrack: Diffusion Model For Multi-Object Tracking

  • paper_url: http://arxiv.org/abs/2308.09905
  • repo_url: https://github.com/rainbowluocs/diffusiontrack
  • paper_authors: Run Luo, Zikai Song, Lintao Ma, Jinlin Wei, Wei Yang, Min Yang
  • for: 这 paper 的目的是提出一种简单 yet robust 的多目标跟踪 (MOT) 方法,以解决现有 MOT 方法 的一些问题,如 global or local inconsistency, 模型复杂性和灵活性不足。
  • methods: 这 paper 使用的方法是通过 Paired Noise Boxes 到 Paired Ground-Truth Boxes 的一种逐步减噪演化策略,来实现对象检测和跟踪的一体化。在训练阶段,模型通过减噪演化过程来学习检测和跟踪,而在测试阶段,模型通过一种灵活的一步或多步减噪演化来更新检测和跟踪结果。
  • results: 这 paper 的实验结果表明,使用这种方法可以在三个常用的 MOT benchmark 上达到与当前状态的识别方法相当的性能,包括 MOT17, MOT20 和 Dancetrack。
    Abstract Multi-object tracking (MOT) is a challenging vision task that aims to detect individual objects within a single frame and associate them across multiple frames. Recent MOT approaches can be categorized into two-stage tracking-by-detection (TBD) methods and one-stage joint detection and tracking (JDT) methods. Despite the success of these approaches, they also suffer from common problems, such as harmful global or local inconsistency, poor trade-off between robustness and model complexity, and lack of flexibility in different scenes within the same video. In this paper we propose a simple but robust framework that formulates object detection and association jointly as a consistent denoising diffusion process from paired noise boxes to paired ground-truth boxes. This novel progressive denoising diffusion strategy substantially augments the tracker's effectiveness, enabling it to discriminate between various objects. During the training stage, paired object boxes diffuse from paired ground-truth boxes to random distribution, and the model learns detection and tracking simultaneously by reversing this noising process. In inference, the model refines a set of paired randomly generated boxes to the detection and tracking results in a flexible one-step or multi-step denoising diffusion process. Extensive experiments on three widely used MOT benchmarks, including MOT17, MOT20, and Dancetrack, demonstrate that our approach achieves competitive performance compared to the current state-of-the-art methods.
    摘要 多目标跟踪(MOT)是一项视觉任务,旨在单帧内检测个体对象并在多帧中相关联。现代MOT方法可以分为两种阶段的跟踪检测(TBD)方法和一体的检测和跟踪(JDT)方法。尽管这些方法具有成功,但它们也存在一些共同的问题,如全局或局部的不一致、轻度的模型复杂度和不同场景中的灵活性不足。在这篇论文中,我们提出了一个简单 yet robust的框架,它将对象检测和相关联视为一个一致的降噪演进程,从对应的噪声框到对应的真实框进行进行逐步降噪。在训练阶段,对象框从对应的真实框降噪到随机分布,并且模型同时学习检测和跟踪的过程。在推断阶段,模型将一组随机生成的对象框进行精细的逐步降噪处理,以实现一步或多步的检测和跟踪结果。我们在MOT17、MOT20和Dancetrack等三个常用的MOT标准 benchmark上进行了广泛的实验,结果显示我们的方法与当前状态的方法相比,具有竞争性的性能。

Scalable Video Object Segmentation with Simplified Framework

  • paper_url: http://arxiv.org/abs/2308.09903
  • repo_url: None
  • paper_authors: Qiangqiang Wu, Tianyu Yang, Wei WU, Antoni Chan
  • for: 这个论文主要针对视频对象分割(VOS)领域的问题,即如何使用简单的模型来实现高效的目标检测和分割。
  • methods: 这篇论文提出了一种可扩展的Simplified VOS(SimVOS)框架,利用单一的变换器底层来同时进行特征提取和匹配。此外,论文还提出了一种在框架中使用的快速注意力机制和新的токен精细化模块,以提高运行速度和避免计算成本增加。
  • results: 实验表明,我们的SimVOS可以在流行的视频对象分割数据集上达到最佳效果,包括DAVIS-2017(88.0% J&F)、DAVIS-2016(92.9% J&F)和YouTube-VOS 2019(84.2% J&F)等数据集,而无需应用任何 sintética video 或 BL30K 预训练。
    Abstract The current popular methods for video object segmentation (VOS) implement feature matching through several hand-crafted modules that separately perform feature extraction and matching. However, the above hand-crafted designs empirically cause insufficient target interaction, thus limiting the dynamic target-aware feature learning in VOS. To tackle these limitations, this paper presents a scalable Simplified VOS (SimVOS) framework to perform joint feature extraction and matching by leveraging a single transformer backbone. Specifically, SimVOS employs a scalable ViT backbone for simultaneous feature extraction and matching between query and reference features. This design enables SimVOS to learn better target-ware features for accurate mask prediction. More importantly, SimVOS could directly apply well-pretrained ViT backbones (e.g., MAE) for VOS, which bridges the gap between VOS and large-scale self-supervised pre-training. To achieve a better performance-speed trade-off, we further explore within-frame attention and propose a new token refinement module to improve the running speed and save computational cost. Experimentally, our SimVOS achieves state-of-the-art results on popular video object segmentation benchmarks, i.e., DAVIS-2017 (88.0% J&F), DAVIS-2016 (92.9% J&F) and YouTube-VOS 2019 (84.2% J&F), without applying any synthetic video or BL30K pre-training used in previous VOS approaches.
    摘要 当前流行的视频对象分割(VOS)方法通常通过多个手动设计的模块来实现特征匹配。然而,这些手动设计在实践中会导致不充分的目标互动,从而限制VOS中的动态目标感知特征学习。为了解决这些限制,本文提出了一个可扩展的简化VOS(SimVOS)框架,用于同时执行特征提取和匹配。具体来说,SimVOS使用可扩展的ViT脊梁来同时提取和匹配查询和参照特征。这种设计使得SimVOS能够学习更好的目标相关特征,以提高掩模预测的准确性。此外,SimVOS可以直接应用已经预训练的ViT脊梁(例如MAE)来进行VOS,这种 bridging 可以将VOS和大规模自动预训练相连接。为了实现更好的性能-速度交易,我们进一步探索了 Within-frame 注意力和一种新的 токен细化模块,以提高运行速度并降低计算成本。实验结果表明,我们的 SimVOS 在流行的视频对象分割标准 bencmarks 上 achieve state-of-the-art 结果,无需应用任何 sintetic video 或 BL30K 预训练,这些预训练在前一代 VOS 方法中通常被使用。

Towards a High-Performance Object Detector: Insights from Drone Detection Using ViT and CNN-based Deep Learning Models

  • paper_url: http://arxiv.org/abs/2308.09899
  • repo_url: None
  • paper_authors: Junyang Zhang
  • for: 避免无人机和自驾车与无人机相撞、防御无人机入侵和自驾车自动降落。
  • methods: 使用 CNN 和 ViT 模型,实现单一无人机检测和多无人机检测。
  • results: 比较 CNN 和 ViT 模型的性能,发现 ViT 在单一无人机检测中表现4.6倍更好,但需要更多的训练数据、computational power 和高级设计来完全超越 CNN 检测器。
    Abstract Accurate drone detection is strongly desired in drone collision avoidance, drone defense and autonomous Unmanned Aerial Vehicle (UAV) self-landing. With the recent emergence of the Vision Transformer (ViT), this critical task is reassessed in this paper using a UAV dataset composed of 1359 drone photos. We construct various CNN and ViT-based models, demonstrating that for single-drone detection, a basic ViT can achieve performance 4.6 times more robust than our best CNN-based transfer learning models. By implementing the state-of-the-art You Only Look Once (YOLO v7, 200 epochs) and the experimental ViT-based You Only Look At One Sequence (YOLOS, 20 epochs) in multi-drone detection, we attain impressive 98% and 96% mAP values, respectively. We find that ViT outperforms CNN at the same epoch, but also requires more training data, computational power, and sophisticated, performance-oriented designs to fully surpass the capabilities of cutting-edge CNN detectors. We summarize the distinct characteristics of ViT and CNN models to aid future researchers in developing more efficient deep learning models.
    摘要 准确的飞行器探测在无人机冲突避免、无人机防御和自适应无人机自降中是非常重要的。随着最近的视力变换器(ViT)的出现,我们在这篇论文中使用了一个无人机数据集,包含1359个飞行器照片,重新评估了这个关键任务。我们构建了不同的CNN和ViT基本模型,发现在单飞行器探测任务中,一个基本的ViT可以达到与我们最佳CNN基本传播学习模型的4.6倍的性能。通过在多飞行器探测任务中实现了state-of-the-art的You Only Look Once(YOLO v7,200 epochs)和实验性的ViT基本You Only Look At One Sequence(YOLOS,20 epochs),我们获得了各种98%和96%的mAP值。我们发现ViT比CNN在同一个熬煮值下表现更好,但也需要更多的训练数据、计算能力和复杂的性能设计来完全超越现有的CNN探测器。我们总结了ViT和CNN模型的特点,以帮助未来的研究人员开发更高效的深度学习模型。

Spatial-Temporal Alignment Network for Action Recognition

  • paper_url: http://arxiv.org/abs/2308.09897
  • repo_url: None
  • paper_authors: Jinhui Ye, Junwei Liang
  • for: 本文旨在提出一种视角不变特征表示方法,用于改进现有动作识别架构。
  • methods: 该方法基于一种名为空间-时间对应网络(STAN),该网络可以学习geometry invariant的表示。
  • results: 实验结果表明,在训练从scratch的情况下,STAN模型可以在UCf101和HMDB51等广泛使用的数据集上提高动作识别任务的性能。
    Abstract This paper studies introducing viewpoint invariant feature representations in existing action recognition architecture. Despite significant progress in action recognition, efficiently handling geometric variations in large-scale datasets remains challenging. To tackle this problem, we propose a novel Spatial-Temporal Alignment Network (STAN), which explicitly learns geometric invariant representations for action recognition. Notably, the STAN model is light-weighted and generic, which could be plugged into existing action recognition models (e.g., MViTv2) with a low extra computational cost. We test our STAN model on widely-used datasets like UCF101 and HMDB51. The experimental results show that the STAN model can consistently improve the state-of-the-art models in action recognition tasks in trained-from-scratch settings.
    摘要

Semantic-Human: Neural Rendering of Humans from Monocular Video with Human Parsing

  • paper_url: http://arxiv.org/abs/2308.09894
  • repo_url: None
  • paper_authors: Jie Zhang, Pengcheng Shi, Zaiwang Gu, Yiyang Zhou, Zhi Wang
  • for: 本研究旨在提高人体 нейрон渲染的质量,同时实现人体解析。
  • methods: 本文提出了一种名为Semantic-Human的新方法,它可以同时实现高品质的渲染和视角相关的人体解析。特别是,该方法在NeRF基础上扩展了semantics, appearance和geometry的编码,以实现基于噪声批量标签的高精度2D semantic labels。此外,该方法还引入了基于SMPL表面的运动场和恢复的三维几何学regularization。
  • results: 在使用ZJU-MoCap数据集进行评估时,Semantic-Human方法得到了非常竞争力的结果,证明了该方法的有效性。此外,该方法还可以实现多种有趣的应用,如标签噪声除除、标签生成和图像修改等,并且经验 Validate了其优势性。
    Abstract The neural rendering of humans is a topic of great research significance. However, previous works mostly focus on achieving photorealistic details, neglecting the exploration of human parsing. Additionally, classical semantic work are all limited in their ability to efficiently represent fine results in complex motions. Human parsing is inherently related to radiance reconstruction, as similar appearance and geometry often correspond to similar semantic part. Furthermore, previous works often design a motion field that maps from the observation space to the canonical space, while it tends to exhibit either underfitting or overfitting, resulting in limited generalization. In this paper, we present Semantic-Human, a novel method that achieves both photorealistic details and viewpoint-consistent human parsing for the neural rendering of humans. Specifically, we extend neural radiance fields (NeRF) to jointly encode semantics, appearance and geometry to achieve accurate 2D semantic labels using noisy pseudo-label supervision. Leveraging the inherent consistency and smoothness properties of NeRF, Semantic-Human achieves consistent human parsing in both continuous and novel views. We also introduce constraints derived from the SMPL surface for the motion field and regularization for the recovered volumetric geometry. We have evaluated the model using the ZJU-MoCap dataset, and the obtained highly competitive results demonstrate the effectiveness of our proposed Semantic-Human. We also showcase various compelling applications, including label denoising, label synthesis and image editing, and empirically validate its advantageous properties.
    摘要 “人体神经渲染是研究领域的热点话题。然而,前一些工作强调了实现光真实细节,忽略了人体解析的探索。此外,传统的 semantic 工作都受到了复杂动作中的细节表示的限制。人体解析与光重建密切相关,因为类似的外观和结构通常对应于类似的semantic部分。此外,以前的工作通常将动作场景映射到了 canonical 空间,而这经常导致过拟合或者下降抑制,限制了其泛化能力。在本文中,我们提出了 Semantic-Human,一种新的方法,能够同时实现光真实细节和视点一致的人体解析。具体来说,我们扩展了神经辐射场(NeRF),使其同时编码semantics、外观和geometry以实现基于噪声假标签的高精度2D semantic标签。利用NeRF的自然协调性和平滑性属性,Semantic-Human在连续和新视图下实现了一致的人体解析。我们还引入了基于 SMPL поверхност的运动场景约束和 recovered volumetric geometry 的正则化。我们在 ZJU-MoCap 数据集上评估了模型,并获得了非常竞争力的结果,证明了我们提出的 Semantic-Human 的效果。我们还展示了多种吸引人的应用,包括标签噪声去除、标签生成和图像修改,并且实际验证了它的优点性质。”

DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization

  • paper_url: http://arxiv.org/abs/2308.09889
  • repo_url: None
  • paper_authors: Xiaoyu Ye, Hao Huang, Jiaqi An, Yongtao Wang
    for: This paper aims to address the issue of copyright infringement in Stable Diffusion (SD) customization approaches by proposing an invisible data-free universal adversarial watermark (DUAW) to protect a myriad of copyrighted images.methods: The proposed DUAW is designed to disrupt the variational autoencoder during SD customization, and it operates in a data-free context using synthetic images produced by a Large Language Model (LLM) and a pretrained SD model.results: Experimental results demonstrate that DUAW can effectively distort the outputs of fine-tuned SD models, rendering them discernible to both human observers and a simple classifier, thereby protecting copyrighted images from plagiarism.
    Abstract Stable Diffusion (SD) customization approaches enable users to personalize SD model outputs, greatly enhancing the flexibility and diversity of AI art. However, they also allow individuals to plagiarize specific styles or subjects from copyrighted images, which raises significant concerns about potential copyright infringement. To address this issue, we propose an invisible data-free universal adversarial watermark (DUAW), aiming to protect a myriad of copyrighted images from different customization approaches across various versions of SD models. First, DUAW is designed to disrupt the variational autoencoder during SD customization. Second, DUAW operates in a data-free context, where it is trained on synthetic images produced by a Large Language Model (LLM) and a pretrained SD model. This approach circumvents the necessity of directly handling copyrighted images, thereby preserving their confidentiality. Once crafted, DUAW can be imperceptibly integrated into massive copyrighted images, serving as a protective measure by inducing significant distortions in the images generated by customized SD models. Experimental results demonstrate that DUAW can effectively distort the outputs of fine-tuned SD models, rendering them discernible to both human observers and a simple classifier.
    摘要

Calibrating Uncertainty for Semi-Supervised Crowd Counting

  • paper_url: http://arxiv.org/abs/2308.09887
  • repo_url: None
  • paper_authors: Chen Li, Xiaoling Hu, Shahira Abousamra, Chao Chen
  • for: 这篇论文的目的是提出一种用于半指导人数推断的新方法,以提高这种任务的性能。
  • methods: 这篇论文使用了一种基于模型不确定性的方法,通过调教一个价值函数来训练模型。这个方法使用了一种匹配函数来更好地估计人数推断的不确定性。
  • results: 这篇论文的结果显示,使用这种方法可以生成可靠的伪标签,并且可以实现semi-supervised人数推断的state-of-the-art性能。
    Abstract Semi-supervised crowd counting is an important yet challenging task. A popular approach is to iteratively generate pseudo-labels for unlabeled data and add them to the training set. The key is to use uncertainty to select reliable pseudo-labels. In this paper, we propose a novel method to calibrate model uncertainty for crowd counting. Our method takes a supervised uncertainty estimation strategy to train the model through a surrogate function. This ensures the uncertainty is well controlled throughout the training. We propose a matching-based patch-wise surrogate function to better approximate uncertainty for crowd counting tasks. The proposed method pays a sufficient amount of attention to details, while maintaining a proper granularity. Altogether our method is able to generate reliable uncertainty estimation, high quality pseudolabels, and achieve state-of-the-art performance in semisupervised crowd counting.
    摘要 semi-supervised crowd counting 是一项重要又挑战性的任务。一种popular approach是iteratively generating pseudo-labels for unlabeled data and adding them to the training set。关键在于使用uncertainty选择可靠的pseudo-labels。在这篇论文中,我们提出了一种novel method to calibrate model uncertainty for crowd counting。我们的方法通过一个supervised uncertainty estimation strategy to train the model through a surrogate function,这 garantizesthat uncertainty is well controlled throughout the training。我们提出了一种matching-based patch-wise surrogate function to better approximate uncertainty for crowd counting tasks。提议的方法具有 suficient amount of attention to details,同时保持proper granularity。总之,我们的方法能够生成可靠的uncertainty estimation,高质量的pseudolabels,并实现semisupervised crowd counting的state-of-the-art performance。

Forecast-MAE: Self-supervised Pre-training for Motion Forecasting with Masked Autoencoders

  • paper_url: http://arxiv.org/abs/2308.09882
  • repo_url: https://github.com/jchengai/forecast-mae
  • paper_authors: Jie Cheng, Xiaodong Mei, Ming Liu
  • for: 这个研究探索了自监学习(SSL)在动态预测任务中的应用,这是计算机视觉和自然语言处理领域中广泛成功的 SSL 方法,却尚未得到广泛研究。
  • methods: 我们引入了 Forecast-MAE,一种基于面积自适应神经网络(Transformer)块的 SSL 框架,特意设计用于自监学习动态预测任务。我们的方法包括一种新的面 másking 策略,利用agent trajectory 和路网之间强联系,通过补做agent future trajectory 和历史 trajectory的 complementary másking,以及随机 másking 路网段。
  • results: 我们在 Argoverse 2 动态预测测试集上进行了实验,显示 Forecast-MAE 在与 supervised learning 和复杂设计的方法相比,在竞争性Task 中具有竞争性的性能。此外,它还超过了之前的自监学习方法,表明 Forecast-MAE 可以充分利用 SSL 来预测动态Scene。
    Abstract This study explores the application of self-supervised learning (SSL) to the task of motion forecasting, an area that has not yet been extensively investigated despite the widespread success of SSL in computer vision and natural language processing. To address this gap, we introduce Forecast-MAE, an extension of the mask autoencoders framework that is specifically designed for self-supervised learning of the motion forecasting task. Our approach includes a novel masking strategy that leverages the strong interconnections between agents' trajectories and road networks, involving complementary masking of agents' future or history trajectories and random masking of lane segments. Our experiments on the challenging Argoverse 2 motion forecasting benchmark show that Forecast-MAE, which utilizes standard Transformer blocks with minimal inductive bias, achieves competitive performance compared to state-of-the-art methods that rely on supervised learning and sophisticated designs. Moreover, it outperforms the previous self-supervised learning method by a significant margin. Code is available at https://github.com/jchengai/forecast-mae.
    摘要 这种研究探讨了使用自动教学学习(SSL)来解决运动预测任务,这是一个尚未得到广泛探讨的领域,尽管SSL在计算机视觉和自然语言处理领域得到了广泛的成功。为了解决这个遗漏,我们介绍了 Forecast-MAE,一种特制的mask autoencoders框架,用于自动教学学习运动预测任务。我们的方法包括一种新的面积策略,利用汽车轨迹和公路网络之间的强相关性,包括补做未来或历史轨迹的随机掩码和路段掩码。我们在Argoverse 2运动预测测试benchmark上进行了实验,发现 Forecast-MAE,使用标准Transformer块和最小适应性,可以与supervised learning和复杂设计的方法相比肩,并且超过了之前的自动教学方法,性能较好。代码可以在https://github.com/jchengai/forecast-mae中找到。

DatasetEquity: Are All Samples Created Equal? In The Quest For Equity Within Datasets

  • paper_url: http://arxiv.org/abs/2308.09878
  • repo_url: https://github.com/towardsautonomy/datasetequity
  • paper_authors: Shubham Shrivastava, Xianling Zhang, Sushruth Nagesh, Armin Parchami
  • for: 本研究旨在解决机器学习中的数据不均衡问题,具体来说是针对computer vision领域中的数据偏见问题。
  • methods: 本研究使用了深度感知嵌入和聚类算法来计算图像出现的可能性,然后使用这些可能性来减轻数据不均衡的影响。另外,提出了一种新的$\textbf{普适吸引损失函数}$来调整训练过程中的样本权重。
  • results: 实验表明,该方法可以提高3D物体检测方法的性能,特别是对于少见的类别(自行车手)在KITTI数据集上的AP效果提高了超过200%。这些结果表明该方法是通用的,可以补充现有的技术,并在小数据集和少见的类别上特别有效。
    Abstract Data imbalance is a well-known issue in the field of machine learning, attributable to the cost of data collection, the difficulty of labeling, and the geographical distribution of the data. In computer vision, bias in data distribution caused by image appearance remains highly unexplored. Compared to categorical distributions using class labels, image appearance reveals complex relationships between objects beyond what class labels provide. Clustering deep perceptual features extracted from raw pixels gives a richer representation of the data. This paper presents a novel method for addressing data imbalance in machine learning. The method computes sample likelihoods based on image appearance using deep perceptual embeddings and clustering. It then uses these likelihoods to weigh samples differently during training with a proposed $\textbf{Generalized Focal Loss}$ function. This loss can be easily integrated with deep learning algorithms. Experiments validate the method's effectiveness across autonomous driving vision datasets including KITTI and nuScenes. The loss function improves state-of-the-art 3D object detection methods, achieving over $200\%$ AP gains on under-represented classes (Cyclist) in the KITTI dataset. The results demonstrate the method is generalizable, complements existing techniques, and is particularly beneficial for smaller datasets and rare classes. Code is available at: https://github.com/towardsautonomy/DatasetEquity
    摘要 “数据不均衡是机器学习领域的一个常见问题,这主要归结于数据收集成本、标签难度和数据的地域分布。在计算机视觉领域,图像外观的偏见对数据分布仍然具有很大的潜在探索空间。相比于使用类别分布的类标签,图像外观表现出了更复杂的对象之间的关系。使用深度感知特征提取自原始像素的归一化可以提供更丰富的数据表示。本文提出了一种 novel 的数据不均衡解决方案,该方法通过使用深度感知嵌入和归一化计算样本概率。然后使用这些概率对样本进行不同权重训练,使用我们提议的 $\textbf{通用强化损失}$ 函数。这种损失函数可以轻松地与深度学习算法结合使用。实验证明了该方法的有效性,在 KITTI 和 nuScenes 自动驾驶视觉数据集上实现了超过 200% AP 提升的 Results 表明该方法是通用的,可以补偿现有技术,特别有利于小型数据集和罕见类。代码可以在 GitHub 上找到:https://github.com/towardsautonomy/DatasetEquity。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need Traditional Chinese, please let me know and I'll be happy to provide it.

A Theory of Topological Derivatives for Inverse Rendering of Geometry

  • paper_url: http://arxiv.org/abs/2308.09865
  • repo_url: None
  • paper_authors: Ishit Mehta, Manmohan Chandraker, Ravi Ramamoorthi
  • for: 这篇论文旨在提出一种可微 differentiable 表面演化理论,以便通过变分函数优化图像函数。
  • methods: 该理论使用 topological derivatives 来实现不同的拓扑结构变化,而不是先前的 silhouette gradients。
  • results: 该理论可以实现可微的形态异常,包括孔子核生成和相位异常。这些结果可以用于改进图像向量化、vector-graphics生成、单图像重建形意agram和多视图3D重建等应用。
    Abstract We introduce a theoretical framework for differentiable surface evolution that allows discrete topology changes through the use of topological derivatives for variational optimization of image functionals. While prior methods for inverse rendering of geometry rely on silhouette gradients for topology changes, such signals are sparse. In contrast, our theory derives topological derivatives that relate the introduction of vanishing holes and phases to changes in image intensity. As a result, we enable differentiable shape perturbations in the form of hole or phase nucleation. We validate the proposed theory with optimization of closed curves in 2D and surfaces in 3D to lend insights into limitations of current methods and enable improved applications such as image vectorization, vector-graphics generation from text prompts, single-image reconstruction of shape ambigrams and multi-view 3D reconstruction.
    摘要 我们提出了一种可 diferenciable 表面演化理论,允许离散topology变化通过使用图像函数的topological导数进行variational优化。而以前的对geometry inverse rendering方法依赖于silhouette导数进行topology变化,这些信号是稀疏的。相比之下,我们的理论 derivates topological导数,将引入vanishing holes和阶段相关到图像强度的变化。因此,我们可以实现可微形状变化,包括孔悉散和阶段悉散。我们验证了提出的理论,通过在2D和3D中优化closed curves和surfaces来增加应用,如图像vectorization、vector-graphics生成from文本提示、单个图像重建shape ambigrams和多视图3D重建。

Microscopy Image Segmentation via Point and Shape Regularized Data Synthesis

  • paper_url: http://arxiv.org/abs/2308.09835
  • repo_url: None
  • paper_authors: Shijie Li, Mengwei Ren, Thomas Ach, Guido Gerig
  • for: 这篇论文主要针对微scopic图像分类问题提出了一个新的方法,它可以使用简单的点标注来进行训练,而不需要大量的实际标注数据。
  • methods: 这篇论文提出了一个三阶段的框架,包括:1)将点标注转换为伪稠密分类面组件,并受限于物体形状假设;2)使用对称的图像生成模型,将伪稠密分类面组件转换为真实的微scopic图像;3)使用伪稠密分类面组件和生成的图像,进行训练专门的分类模型。
  • results: 这篇论文的实验结果显示,使用这个新的方法可以在公共的 MoNuSeg 数据集上生成更多的多标的图像,并且保持高度的标注与生成图像之间的协调性。此外,这个方法可以与使用pseudo-labels或基准生成的图像进行比较,实现更高的分类精度。
    Abstract Current deep learning-based approaches for the segmentation of microscopy images heavily rely on large amount of training data with dense annotation, which is highly costly and laborious in practice. Compared to full annotation where the complete contour of objects is depicted, point annotations, specifically object centroids, are much easier to acquire and still provide crucial information about the objects for subsequent segmentation. In this paper, we assume access to point annotations only during training and develop a unified pipeline for microscopy image segmentation using synthetically generated training data. Our framework includes three stages: (1) it takes point annotations and samples a pseudo dense segmentation mask constrained with shape priors; (2) with an image generative model trained in an unpaired manner, it translates the mask to a realistic microscopy image regularized by object level consistency; (3) the pseudo masks along with the synthetic images then constitute a pairwise dataset for training an ad-hoc segmentation model. On the public MoNuSeg dataset, our synthesis pipeline produces more diverse and realistic images than baseline models while maintaining high coherence between input masks and generated images. When using the identical segmentation backbones, the models trained on our synthetic dataset significantly outperform those trained with pseudo-labels or baseline-generated images. Moreover, our framework achieves comparable results to models trained on authentic microscopy images with dense labels, demonstrating its potential as a reliable and highly efficient alternative to labor-intensive manual pixel-wise annotations in microscopy image segmentation. The code is available.
    摘要 当前的深度学习基于方法 для微scopic影像分割强调大量的训练数据,包括密集的标注。在实际应用中,这种标注是非常成本高昂和劳动密集的。相比拥有完整的标注,其中包含对象的完整边界,点标注更加容易获得,并且仍然提供了对对象的重要信息。在这篇论文中,我们假设在训练时有点标注可用。我们提出了一个简化的框架,包括以下三个阶段:1. 使用点标注,生成一个 Pseudo density 分割面,受限于形态约束。2. 使用一种没有对应关系的图像生成模型,将分割面翻译成一个真实的微scopic影像,并对其进行对象水平的准确性补做。3. 使用生成的 Pseudo 分割面和实际图像组成一个对应的数据集,用于训练适应性强的分割模型。在公共的 MoNuSeg 数据集上,我们的生成框架生成了更加多样化和真实的图像,同时保持了输入权重的高准确性。当使用同一个分割后端时,我们在我们的生成数据集上训练的模型比使用 pseudo-标签 或基eline-生成的图像训练得更好。此外,我们的框架可以与密集标注的模型相比,在微scopic影像分割任务中达到相同的性能水平。代码可以获得。

Cross-modality Attention-based Multimodal Fusion for Non-small Cell Lung Cancer (NSCLC) Patient Survival Prediction

  • paper_url: http://arxiv.org/abs/2308.09831
  • repo_url: None
  • paper_authors: Ruining Deng, Nazim Shaikh, Gareth Shannon, Yao Nie
  • for: 预测非小细胞肺癌患者存活result, 即computer-aided diagnosis和prognosis在医学应用中的提高。
  • methods: 跨模态注意力基本的多模态融合管道,该方法不仅将不同模式的特征简单 concatenate或sum,而是通过跨模态关系 gauges each modality’s importance for feature fusion。
  • results: 在实验中,提议的融合方法在NSCLC患者存活预测中实现了c-index 0.6587,较单模式(使用 solely tissue image data或RNA-seq data)的c-index 0.5772和0.5885高出2.3%和1.6%。
    Abstract Cancer prognosis and survival outcome predictions are crucial for therapeutic response estimation and for stratifying patients into various treatment groups. Medical domains concerned with cancer prognosis are abundant with multiple modalities, including pathological image data and non-image data such as genomic information. To date, multimodal learning has shown potential to enhance clinical prediction model performance by extracting and aggregating information from different modalities of the same subject. This approach could outperform single modality learning, thus improving computer-aided diagnosis and prognosis in numerous medical applications. In this work, we propose a cross-modality attention-based multimodal fusion pipeline designed to integrate modality-specific knowledge for patient survival prediction in non-small cell lung cancer (NSCLC). Instead of merely concatenating or summing up the features from different modalities, our method gauges the importance of each modality for feature fusion with cross-modality relationship when infusing the multimodal features. Compared with single modality, which achieved c-index of 0.5772 and 0.5885 using solely tissue image data or RNA-seq data, respectively, the proposed fusion approach achieved c-index 0.6587 in our experiment, showcasing the capability of assimilating modality-specific knowledge from varied modalities.
    摘要 cancer 诊断和生存结果预测是临床应用中的关键任务,可以用于评估治疗效果和将患者分配到不同的治疗组。医疗领域中关于 cancer 诊断的数据非常丰富,包括生物pathological 图像数据和非图像数据,如基因信息。迄今为止,多Modal learning 已经展现出能够提高诊断模型性能,通过抽取和汇集不同模式的信息来提高计算机助成诊断和预测的能力。在这个工作中,我们提议一种跨模式关注机制的多模式融合管道,用于将不同模式的特征融合,以提高患者存活预测的准确性。相比单模式学习,我们的方法可以评估不同模式之间的关系,从而更好地汇集模式特征。在我们的实验中,我们的方法实现了c-index 0.6587,比单模式学习的c-index 0.5772和0.5885更高,这显示了我们的方法可以充分利用不同模式之间的关系,以提高诊断和预测的准确性。

EAVL: Explicitly Align Vision and Language for Referring Image Segmentation

  • paper_url: http://arxiv.org/abs/2308.09779
  • repo_url: None
  • paper_authors: Yichen Yan, Xingjian He, Wenxuan Wang, Sihan Chen, Jing Liu
  • for: This paper is written for the task of image segmentation using natural language references.
  • methods: The paper proposes a new method called Explicitly Align the Vision and Language for Referring Image Segmentation (EAVL), which explicitly aligns vision and language features in the segmentation stage using a series of unfixed convolution kernels generated based on the input language expression.
  • results: The paper achieves state-of-the-art performance on three benchmark datasets (RefCOCO, RefCOCO+, and G-Ref) by effectively fusing vision and language features and exploiting their potential in the segmentation stage, while also achieving language-related localization.
    Abstract Referring image segmentation aims to segment an object mentioned in natural language from an image. A main challenge is language-related localization, which means locating the object with the relevant language. Previous approaches mainly focus on the fusion of vision and language features without fully addressing language-related localization. In previous approaches, fused vision-language features are directly fed into a decoder and pass through a convolution with a fixed kernel to obtain the result, which follows a similar pattern as traditional image segmentation. This approach does not explicitly align language and vision features in the segmentation stage, resulting in a suboptimal language-related localization. Different from previous methods, we propose Explicitly Align the Vision and Language for Referring Image Segmentation (EAVL). Instead of using a fixed convolution kernel, we propose an Aligner which explicitly aligns the vision and language features in the segmentation stage. Specifically, a series of unfixed convolution kernels are generated based on the input l, and then are use to explicitly align the vision and language features. To achieve this, We generate multiple queries that represent different emphases of the language expression. These queries are transformed into a series of query-based convolution kernels. Then, we utilize these kernels to do convolutions in the segmentation stage and obtain a series of segmentation masks. The final result is obtained through the aggregation of all masks. Our method can not only fuse vision and language features effectively but also exploit their potential in the segmentation stage. And most importantly, we explicitly align language features of different emphases with the image features to achieve language-related localization. Our method surpasses previous state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by large margins.
    摘要 Traditional image segmentation methods mainly focus on fusing vision and language features without fully addressing the challenge of language-related localization. Previous approaches typically use a fixed convolution kernel to fuse the features, which does not explicitly align the language and vision features in the segmentation stage, leading to suboptimal localization.In this paper, we propose a novel method called Explicitly Align the Vision and Language for Referring Image Segmentation (EAVL). Our approach uses an Aligner to explicitly align the vision and language features in the segmentation stage, rather than using a fixed convolution kernel. We generate multiple queries that represent different emphases of the language expression and transform them into a series of query-based convolution kernels. These kernels are then used to do convolutions in the segmentation stage, resulting in a series of segmentation masks. The final result is obtained through the aggregation of all masks.Our method not only effectively fuses vision and language features but also exploits their potential in the segmentation stage. Moreover, we explicitly align language features of different emphases with the image features, achieving language-related localization. Our method outperforms previous state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by large margins.

Long-range Multimodal Pretraining for Movie Understanding

  • paper_url: http://arxiv.org/abs/2308.09775
  • repo_url: https://github.com/dawitmureja/LMP
  • paper_authors: Dawit Mureja Argaw, Joon-Young Lee, Markus Woodson, In So Kweon, Fabian Caba Heilbron
  • for: 这个论文的目的是提出一种基于电影数据的多模态预训练策略,以便在电影理解任务中实现更好的表现。
  • methods: 这个论文使用了长距离多模态预训练策略,通过观察和提取电影中各种modalities之间的关系,从而学习多模态和交叉模态编码器。
  • results: 这个论文在LVU测试集上进行了缺失学习和模型选择研究,并证明了其在多个benchmark任务上的传送性。其中,模型在LVU任务上达到了状态略于前者,并且在五个不同的benchmark任务中设置了新的状态略。
    Abstract Learning computer vision models from (and for) movies has a long-standing history. While great progress has been attained, there is still a need for a pretrained multimodal model that can perform well in the ever-growing set of movie understanding tasks the community has been establishing. In this work, we introduce Long-range Multimodal Pretraining, a strategy, and a model that leverages movie data to train transferable multimodal and cross-modal encoders. Our key idea is to learn from all modalities in a movie by observing and extracting relationships over a long-range. After pretraining, we run ablation studies on the LVU benchmark and validate our modeling choices and the importance of learning from long-range time spans. Our model achieves state-of-the-art on several LVU tasks while being much more data efficient than previous works. Finally, we evaluate our model's transferability by setting a new state-of-the-art in five different benchmarks.
    摘要 学习电影中的计算机视觉模型有很长的历史。虽然已经取得了很大的进步,但还有一些需求,例如需要一个预训练的多modal模型,可以在电影理解任务中表现出色。在这项工作中,我们介绍了远程多modal预训练策略和模型,该模型利用电影数据来训练可转移的多modal和跨modal编码器。我们的关键思想是从电影中所有modalities中学习和提取关系,并且在远程时间范围内做出关系。在预训练后,我们进行了ablation研究, validate我们的模型设计和学习长时间范围的重要性。我们的模型在LVU标准准则上实现了多个任务的state-of-the-art,并且比前一些工作更加数据效率。最后,我们测试了我们的模型的传输性,并在五个不同的标准准则上设置了新的state-of-the-art。

Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training

  • paper_url: http://arxiv.org/abs/2308.09718
  • repo_url: https://github.com/Pointcept/Pointcept
  • paper_authors: Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui Liu, Kaicheng Yu, Hengshuang Zhao
  • for: 提高3D深度学习模型的性能和通用性,即使只使用有限的大规模3D数据。
  • methods: 提出Point Prompt Training(PPT)框架,支持多种预训练方法,包括Prompt-driven Normalization和Language-guided Categorical Alignment等技术。
  • results: 经验表明,PPT可以缓解多 dataset 学习中的负转移现象,并生成高质量的表示。在多种不同的3D下世界任务上,PPT在单个模型下实现了最佳性能,并在多种预训练方法中占据了主导地位。
    Abstract The rapid advancement of deep learning models often attributes to their ability to leverage massive training data. In contrast, such privilege has not yet fully benefited 3D deep learning, mainly due to the limited availability of large-scale 3D datasets. Merging multiple available data sources and letting them collaboratively train a single model is a potential solution. However, due to the large domain gap between 3D point cloud datasets, such mixed supervision could adversely affect the model's performance and lead to degenerated performance (i.e., negative transfer) compared to single-dataset training. In view of this challenge, we introduce Point Prompt Training (PPT), a novel framework for multi-dataset synergistic learning in the context of 3D representation learning that supports multiple pre-training paradigms. Based on this framework, we propose Prompt-driven Normalization, which adapts the model to different datasets with domain-specific prompts and Language-guided Categorical Alignment that decently unifies the multiple-dataset label spaces by leveraging the relationship between label text. Extensive experiments verify that PPT can overcome the negative transfer associated with synergistic learning and produce generalizable representations. Notably, it achieves state-of-the-art performance on each dataset using a single weight-shared model with supervised multi-dataset training. Moreover, when served as a pre-training framework, it outperforms other pre-training approaches regarding representation quality and attains remarkable state-of-the-art performance across over ten diverse downstream tasks spanning both indoor and outdoor 3D scenarios.
    摘要 “深度学习模型的快速进步 frequently 归功于它们可以利用大量的训练数据。然而,三维深度学习尚未得到完全的利益,主要是因为三维数据集的有限性。将多个可用的数据源合并并让它们共同训练单个模型是一个可能的解决方案。然而,由于三维点云数据集之间的域 gap 较大,这种混合超vision 可能会对模型的性能产生负面影响(即负向传播),而不是单 dataset 训练。为此,我们介绍 Point Prompt Training(PPT),一种多数据集协同学习框架,支持多种预训练方法。基于这个框架,我们提出 Prompt-driven Normalization,使模型适应不同数据集的域特定提示,并Language-guided Categorical Alignment,通过利用标签文本之间的关系,有效地统一多个数据集的标签空间。广泛的实验表明,PPT 可以超越负向传播,生成普适的表示。其中,使用单个权重共享模型进行多数据集超vised 训练,可以达到每个数据集的最优性能。此外,作为预训练框架,PPT 比其他预训练方法在表示质量和下游任务中表现出色,在室内和室外的多种3D场景中具有惊人的状态 искусternal 表现。”

Smoothness Similarity Regularization for Few-Shot GAN Adaptation

  • paper_url: http://arxiv.org/abs/2308.09717
  • repo_url: None
  • paper_authors: Vadim Sushko, Ruyu Wang, Juergen Gall
  • for: 这个研究的目的是提高几张训练图像下的GAN适应率,并且可以处理两个不同结构的资料集。
  • methods: 这个研究提出了一个新的平滑相似性规范,用于将预训练的GAN模型转换到几张训练图像的目标领域,即使两个领域的物件结构很不同。
  • results: 这个研究的结果显示,在两个不同结构的资料集中,这个新的平滑相似性规范可以将预训练的GAN模型转换到几张训练图像的目标领域,并且与预训练的GAN模型相比,可以提高适应率。
    Abstract The task of few-shot GAN adaptation aims to adapt a pre-trained GAN model to a small dataset with very few training images. While existing methods perform well when the dataset for pre-training is structurally similar to the target dataset, the approaches suffer from training instabilities or memorization issues when the objects in the two domains have a very different structure. To mitigate this limitation, we propose a new smoothness similarity regularization that transfers the inherently learned smoothness of the pre-trained GAN to the few-shot target domain even if the two domains are very different. We evaluate our approach by adapting an unconditional and a class-conditional GAN to diverse few-shot target domains. Our proposed method significantly outperforms prior few-shot GAN adaptation methods in the challenging case of structurally dissimilar source-target domains, while performing on par with the state of the art for similar source-target domains.
    摘要 文本:几张图像ew-shot GAN适应的任务是使得预训练GAN模型适应一个具有非常少的训练图像的小数据集。现有的方法在预训练数据集和目标数据集的结构相似时表现良好,但是这些方法在物体在两个领域中具有非常不同的结构时受训练不稳定或记忆问题的影响。为了解决这些限制,我们提出了一种新的平滑相似性规范,将预训练GAN中内在学习的平滑性传递到几张图像目标领域,即使两个领域具有非常不同的结构。我们通过对无条件GAN和类别GAN进行适应,评估了我们的方法。我们的提议方法在预训练和目标领域结构不相似的情况下显著超过先前的几张图像GAN适应方法,并在相似的预训练和目标领域情况下与状态的艺术相当。

Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis

  • paper_url: http://arxiv.org/abs/2308.09713
  • repo_url: None
  • paper_authors: Jonathon Luiten, Georgios Kopanas, Bastian Leibe, Deva Ramanan
  • for: 该 paper 旨在同时解决动态场景新视图合成和六自由度(6-DOF)跟踪所有紧凑场景元素的问题。
  • methods: 作者采用了分析合成框架, Drawing inspiration from recent work that models scenes as a collection of 3D Gaussians, which are optimized to reconstruct input images via differentiable rendering.
  • results: 作者实现了一种可以同时解决动态场景新视图合成和6-DOF跟踪的方法,并且不需要输入任何匹配或流体。这种方法可以自动捕捉和跟踪场景中的所有紧凑元素,包括场景的旋转。
    Abstract We present a method that simultaneously addresses the tasks of dynamic scene novel-view synthesis and six degree-of-freedom (6-DOF) tracking of all dense scene elements. We follow an analysis-by-synthesis framework, inspired by recent work that models scenes as a collection of 3D Gaussians which are optimized to reconstruct input images via differentiable rendering. To model dynamic scenes, we allow Gaussians to move and rotate over time while enforcing that they have persistent color, opacity, and size. By regularizing Gaussians' motion and rotation with local-rigidity constraints, we show that our Dynamic 3D Gaussians correctly model the same area of physical space over time, including the rotation of that space. Dense 6-DOF tracking and dynamic reconstruction emerges naturally from persistent dynamic view synthesis, without requiring any correspondence or flow as input. We demonstrate a large number of downstream applications enabled by our representation, including first-person view synthesis, dynamic compositional scene synthesis, and 4D video editing.
    摘要 我们提出了一种方法,同时解决动态场景新视角合成和所有精细场景元素的六度自由(6-DOF)跟踪问题。我们采用分析合成框架,受到最近的Scene模型研究的启发,将场景视为一个集合的3D Gaussian,通过可导渲染来重建输入图像。为模型动态场景,我们允许Gaussian在时间上移动和旋转,并且保持颜色、透明度和大小的 persistency。通过对Gaussian的运动和旋转进行地方刚性约束,我们证明了我们的动态3D Gaussian correctly模型了物理空间的同一个区域在时间上的变化,包括空间的旋转。基于持续动态视思操作,我们无需输入对应或流动,直接从持续动态视思渲染中获得精细6-DOF跟踪和动态重建。我们示出了大量的下游应用,包括首人视图合成、动态组合场景合成和4D视频编辑。

HumanLiff: Layer-wise 3D Human Generation with Diffusion Model

  • paper_url: http://arxiv.org/abs/2308.09712
  • repo_url: None
  • paper_authors: Shoukang Hu, Fangzhou Hong, Tao Hu, Liang Pan, Haiyi Mei, Weiye Xiao, Lei Yang, Ziwei Liu
  • for: 本研究旨在提出一种层 wise 3D 人体生成模型,即 HumanLiff,该模型可以具有高精度和控制性,并能够生成层 wise 3D 人体。
  • methods: HumanLiff 模型使用了 diffusion-based 3D 条件生成方法,首先生成 minimal-clothed 人体,然后逐层生成衣物。另外,为了提高 3D 生成的精度,该模型还提出了 tri-plane shift 操作和层 wise 特征融合方法。
  • results: 对于 SynBody 和 TightCap 两个层 wise 3D 人体数据集,HumanLiff 模型在层 wise 3D 人体生成方面表现出了显著的优异性,与现有的状态对比方法相比,它可以生成更精度和更控制性的 3D 人体。
    Abstract 3D human generation from 2D images has achieved remarkable progress through the synergistic utilization of neural rendering and generative models. Existing 3D human generative models mainly generate a clothed 3D human as an undetectable 3D model in a single pass, while rarely considering the layer-wise nature of a clothed human body, which often consists of the human body and various clothes such as underwear, outerwear, trousers, shoes, etc. In this work, we propose HumanLiff, the first layer-wise 3D human generative model with a unified diffusion process. Specifically, HumanLiff firstly generates minimal-clothed humans, represented by tri-plane features, in a canonical space, and then progressively generates clothes in a layer-wise manner. In this way, the 3D human generation is thus formulated as a sequence of diffusion-based 3D conditional generation. To reconstruct more fine-grained 3D humans with tri-plane representation, we propose a tri-plane shift operation that splits each tri-plane into three sub-planes and shifts these sub-planes to enable feature grid subdivision. To further enhance the controllability of 3D generation with 3D layered conditions, HumanLiff hierarchically fuses tri-plane features and 3D layered conditions to facilitate the 3D diffusion model learning. Extensive experiments on two layer-wise 3D human datasets, SynBody (synthetic) and TightCap (real-world), validate that HumanLiff significantly outperforms state-of-the-art methods in layer-wise 3D human generation. Our code will be available at https://skhu101.github.io/HumanLiff.
    摘要 人体三维生成从二维图像中得到了惊人的进步,通过神经渲染和生成模型的共同使用。现有的三维人体生成模型主要生成一个穿着衣服的三维人体,而rarely考虑人体层次结构,人体通常由躯体和衣服组成,如内衣、外衣、裤子、鞋等。在这项工作中,我们提出了人类生命(HumanLiff),首先生成最小化穿着衣服的三维人体,用三个平面特征表示,然后逐层生成衣服。因此,三维人体生成被定义为一个序列的扩散基于三维条件的生成。为了重建更细腻的三维人体,我们提出了三个平面移动操作,将每个平面分成三个子平面,并将这些子平面移动以实现特征网格分解。此外,为了进一步提高三维生成的控制性,人类生命层次融合三个平面特征和三维层次条件,以便扩散模型学习。广泛的实验表明,人类生命在两个层次三维人体数据集(SynBody和TightCap)上显著超越了当前的状态艺术法。我们的代码将在https://skhu101.github.io/HumanLiff上公开。

Robust Monocular Depth Estimation under Challenging Conditions

  • paper_url: http://arxiv.org/abs/2308.09711
  • repo_url: https://github.com/md4all/md4all
  • paper_authors: Stefano Gasperini, Nils Morbitzer, HyunJun Jung, Nassir Navab, Federico Tombari
  • for: 提高单目深度估计的可靠性,特别是在不良环境和天气条件下。
  • methods: 利用现有方法的效果,生成复杂样本集,并通过自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自
    Abstract While state-of-the-art monocular depth estimation approaches achieve impressive results in ideal settings, they are highly unreliable under challenging illumination and weather conditions, such as at nighttime or in the presence of rain. In this paper, we uncover these safety-critical issues and tackle them with md4all: a simple and effective solution that works reliably under both adverse and ideal conditions, as well as for different types of learning supervision. We achieve this by exploiting the efficacy of existing methods under perfect settings. Therefore, we provide valid training signals independently of what is in the input. First, we generate a set of complex samples corresponding to the normal training ones. Then, we train the model by guiding its self- or full-supervision by feeding the generated samples and computing the standard losses on the corresponding original images. Doing so enables a single model to recover information across diverse conditions without modifications at inference time. Extensive experiments on two challenging public datasets, namely nuScenes and Oxford RobotCar, demonstrate the effectiveness of our techniques, outperforming prior works by a large margin in both standard and challenging conditions. Source code and data are available at: https://md4all.github.io.
    摘要 当前最先进的单目深度估计方法在理想的Setting下可以达到印象人的结果,但在具有挑战性的照明和天气条件下(如夜晚或雨天),它们的可靠性却非常低。在这篇论文中,我们揭示了这些安全关键问题,并使用md4all:一种简单而有效的解决方案,可靠地在不 идеаль的Setting下运行,并且可以处理不同类型的学习监督。我们通过利用现有方法在完美 Setting下的效果,来提供有效的训练信号,无论输入内容如何。首先,我们生成了一组复杂的样本,与常见的训练样本相对应。然后,我们使用这些生成的样本进行自我或全自监督,通过计算相应的原始图像上的标准损失来训练模型。这样做的好处是,一个模型可以在执行时不需要修改,就能够在多种Setting下恢复信息。我们在两个公共的挑战性数据集(即nuScenes和Oxford RobotCar)进行了广泛的实验,并证明了我们的技术的有效性,与先前的成果相比,在标准和挑战性的Setting下都有大幅度的提高。软件代码和数据可以在以下网址获取:https://md4all.github.io。

Training with Product Digital Twins for AutoRetail Checkout

  • paper_url: http://arxiv.org/abs/2308.09708
  • repo_url: https://github.com/yorkeyao/automated-retail-checkout
  • paper_authors: Yue Yao, Xinyu Tian, Zheng Tang, Sujit Biswas, Huan Lei, Tom Gedeon, Liang Zheng
  • for: 这 paper 的目的是为了自动化商业 checkout 过程,提高用户体验和效率。
  • methods: 该 paper 使用了产品 3D 模型,通过图形引擎渲染来生成快速、灵活、大规模的训练数据。它还提出了一种训练数据优化框架,通过使用产品 3D 模型来生成“数字双胞胎”,以增强训练数据的可靠性和效果。
  • results: 该 paper 的实验表明,使用“数字双胞胎”训练集可以提高产品检测和跟踪模型的准确率,并且可以与 Pseudo-labeled 的实际检查数据组合使用,进一步提高模型的性能。
    Abstract Automating the checkout process is important in smart retail, where users effortlessly pass products by hand through a camera, triggering automatic product detection, tracking, and counting. In this emerging area, due to the lack of annotated training data, we introduce a dataset comprised of product 3D models, which allows for fast, flexible, and large-scale training data generation through graphic engine rendering. Within this context, we discern an intriguing facet, because of the user "hands-on" approach, bias in user behavior leads to distinct patterns in the real checkout process. The existence of such patterns would compromise training effectiveness if training data fail to reflect the same. To address this user bias problem, we propose a training data optimization framework, i.e., training with digital twins (DtTrain). Specifically, we leverage the product 3D models and optimize their rendering viewpoint and illumination to generate "digital twins" that visually resemble representative user images. These digital twins, inherit product labels and, when augmented, form the Digital Twin training set (DT set). Because the digital twins individually mimic user bias, the resulting DT training set better reflects the characteristics of the target scenario and allows us to train more effective product detection and tracking models. In our experiment, we show that DT set outperforms training sets created by existing dataset synthesis methods in terms of counting accuracy. Moreover, by combining DT set with pseudo-labeled real checkout data, further improvement is observed. The code is available at https://github.com/yorkeyao/Automated-Retail-Checkout.
    摘要 自动化购物过程是智能零售中的重要方面,用户可以轻松地通过摄像头传输产品,触发自动产品检测、跟踪和计数。在这个emerging领域中,由于缺乏标注的训练数据,我们提出了一个包含产品3D模型的数据集,允许快速、灵活和大规模的训练数据生成。在这个上下文中,我们发现了一个有趣的特点,即由于用户“手动”的方式,用户的行为偏好会导致实际检查出 процесс中的差异。如果训练数据不能够反映这些差异, то这将对训练效果产生负面影响。为解决这个用户偏好问题,我们提出了一个训练数据优化框架,即使用数字双子(DtTrain)训练。具体来说,我们利用产品3D模型,并且优化它们的渲染视角和照明,以生成“数字双子”,这些数字双子 inherit 产品标签,并且当加以扩展时,组成了数字双子训练集(DT集)。由于数字双子具有用户偏好,DT集更好地反映了目标场景的特点,可以训练更有效的产品检测和跟踪模型。在我们的实验中,我们发现DT集比现有的数据集生成方法在计数准确性方面表现出色。此外,通过组合DT集和 Pseudo-labeled 实际检查数据,进一步提高了性能。代码可以在 中找到。

Guide3D: Create 3D Avatars from Text and Image Guidance

  • paper_url: http://arxiv.org/abs/2308.09705
  • repo_url: https://github.com/yukangcao/Guide3D
  • paper_authors: Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, Kwan-Yee K. Wong
  • for: 本研究旨在开发一种高效的文本和图像引导的三维生成模型,用于生成高分辨率的纹理网格。
  • methods: 我们提出了一种基于扩散模型的零引入文本和图像引导生成模型,包括生成多视图图像并jointly优化多分辨率响应的抽象四角体网格。我们还提出了一种相似性意识的特征融合策略,以有效地集成不同视图的特征。
  • results: 我们的框架在生成三维 geometry和高分辨率纹理上达到了现状之最,并且可以直接将二维生成的图像传递到三维空间中。我们的代码将会公开发布。
    Abstract Recently, text-to-image generation has exhibited remarkable advancements, with the ability to produce visually impressive results. In contrast, text-to-3D generation has not yet reached a comparable level of quality. Existing methods primarily rely on text-guided score distillation sampling (SDS), and they encounter difficulties in transferring 2D attributes of the generated images to 3D content. In this work, we aim to develop an effective 3D generative model capable of synthesizing high-resolution textured meshes by leveraging both textual and image information. To this end, we introduce Guide3D, a zero-shot text-and-image-guided generative model for 3D avatar generation based on diffusion models. Our model involves (1) generating sparse-view images of a text-consistent character using diffusion models, and (2) jointly optimizing multi-resolution differentiable marching tetrahedral grids with pixel-aligned image features. We further propose a similarity-aware feature fusion strategy for efficiently integrating features from different views. Moreover, we introduce two novel training objectives as an alternative to calculating SDS, significantly enhancing the optimization process. We thoroughly evaluate the performance and components of our framework, which outperforms the current state-of-the-art in producing topologically and structurally correct geometry and high-resolution textures. Guide3D enables the direct transfer of 2D-generated images to the 3D space. Our code will be made publicly available.
    摘要 Simplified Chinese:最近,文本到图像生成技术已经做出了很大的进步,能够生成非常印象深刻的图像。相比之下,文本到3D生成技术还没有达到相同的水平。现有方法主要依靠文本指导分数散热抽取样本(SDS),但它们在将2D生成图像中的特征传递到3D内容上遇到了困难。在这个工作中,我们目标是开发一种高效的3D生成模型,能够通过文本和图像信息来生成高分辨率的纹理镜面。为此,我们提出了 Guide3D,一种基于扩散模型的零shot文本和图像引导的3D人物生成模型。我们的模型包括(1)通过扩散模型生成文本一致的人物各个视角的稀疏图像,以及(2)同时优化多resolution的可 diferentiable marching tetrahedral网格和像素对齐的图像特征。我们还提出了一种相似性意识的特征融合策略,以高效地将不同视角的特征集成。此外,我们引入了两种新的训练目标,作为SDS的代替方法,有助于优化过程。我们 thorougly评估了我们的框架的性能和组件,其表现超越当前状态的艺术,能够生成正确的 topology和结构,以及高分辨率的纹理。 Guide3D可以将2D生成的图像直接传递到3D空间。我们的代码将公开发布。

Invariant Training 2D-3D Joint Hard Samples for Few-Shot Point Cloud Recognition

  • paper_url: http://arxiv.org/abs/2308.09694
  • repo_url: None
  • paper_authors: Xuanyu Yi, Jiajun Deng, Qianru Sun, Xian-Sheng Hua, Joo-Hwee Lim, Hanwang Zhang
  • for: 增强ew-shot点云识别任务中数据稀缺问题的解决方案,使用一个joint预测方法,即一个常见的3D模型和一个培养得非常好的2D模型。
  • methods: 使用一种新的 invariant 训练策略,称为InvJoint,不仅更强调训练’’joint困难样本’’,还寻找2D和3D模型之间的相互关系。
  • results: 对ModelNet10/40、ScanObjectNN和Toys4K等三个领域进行了广泛的实验,证明了InvJoint可以学习更好的2D和3D表示,从而提高 ensemble 的性能。
    Abstract We tackle the data scarcity challenge in few-shot point cloud recognition of 3D objects by using a joint prediction from a conventional 3D model and a well-trained 2D model. Surprisingly, such an ensemble, though seems trivial, has hardly been shown effective in recent 2D-3D models. We find out the crux is the less effective training for the ''joint hard samples'', which have high confidence prediction on different wrong labels, implying that the 2D and 3D models do not collaborate well. To this end, our proposed invariant training strategy, called InvJoint, does not only emphasize the training more on the hard samples, but also seeks the invariance between the conflicting 2D and 3D ambiguous predictions. InvJoint can learn more collaborative 2D and 3D representations for better ensemble. Extensive experiments on 3D shape classification with widely adopted ModelNet10/40, ScanObjectNN and Toys4K, and shape retrieval with ShapeNet-Core validate the superiority of our InvJoint.
    摘要 我们解决了很少样本点云识别三维物体中的数据缺乏挑战,通过使用一个 conjunction 的预测,来 combinational 一个常见的三维模型和一个已经训练好的二维模型。尽管这种ensemble 看起来很简单,但它在最近的二维-三维模型中几乎没有被研究过。我们发现,问题的关键在于“联合困难样本”的训练不够有效,这些样本有高信心地预测不同的错误标签,表明二维和三维模型之间没有协作良好。为此,我们提出了一种不变训练策略,即 InvJoint,不仅更多地训练“联合困难样本”,而且寻求这些样本之间的不变性。InvJoint 能够学习更好地协作的二维和三维表示,从而提高ensemble的性能。我们在广泛采用的 ModelNet10/40、ScanObjectNN 和 Toys4K 等三维形状分类和 ShapeNet-Core 等形状检索任务上进行了广泛的实验,并证明了 InvJoint 的超越性。

A Lightweight Transformer for Faster and Robust EBSD Data Collection

  • paper_url: http://arxiv.org/abs/2308.09693
  • repo_url: https://github.com/hdong920/ebsd_slice_recovery
  • paper_authors: Harry Dong, Sean Donegan, Megna Shah, Yuejie Chi
  • for: 提高3D EBSD数据质量和收集效率
  • methods: 使用变换器模型和投影算法进行数据处理和恢复
  • results: 使用自我超视觉学习和 Synthetic 3D EBSD数据进行训练,在实际3D EBSD数据上获得更高的恢复精度 compared to 现有方法
    Abstract Three dimensional electron back-scattered diffraction (EBSD) microscopy is a critical tool in many applications in materials science, yet its data quality can fluctuate greatly during the arduous collection process, particularly via serial-sectioning. Fortunately, 3D EBSD data is inherently sequential, opening up the opportunity to use transformers, state-of-the-art deep learning architectures that have made breakthroughs in a plethora of domains, for data processing and recovery. To be more robust to errors and accelerate this 3D EBSD data collection, we introduce a two step method that recovers missing slices in an 3D EBSD volume, using an efficient transformer model and a projection algorithm to process the transformer's outputs. Overcoming the computational and practical hurdles of deep learning with scarce high dimensional data, we train this model using only synthetic 3D EBSD data with self-supervision and obtain superior recovery accuracy on real 3D EBSD data, compared to existing methods.
    摘要 三维电子反射干扰diffraction(EBSD)镜像是物理科学中多种应用的重要工具,但其数据质量可能会在收集过程中出现大幅波动,特别是通过串行sectioning。幸运的是,3D EBSD数据是串行的,这开 up了使用 transformers,当前领域的最先进深度学习架构,进行数据处理和恢复的机会。为了更加鲁棒地处理错误和加速3D EBSD数据收集,我们介绍了一种两步方法,使用高效的 transformer 模型和投影算法来处理 transformer 的输出。通过对深度学习的计算和实践障碍而不是高维数据,我们使用只有自我超vision的 Synthetic 3D EBSD 数据进行训练,并在实际3D EBSD数据上获得了比现有方法更高的恢复精度。

Audiovisual Moments in Time: A Large-Scale Annotated Dataset of Audiovisual Actions

  • paper_url: http://arxiv.org/abs/2308.09685
  • repo_url: https://github.com/mjoannou/audiovisual-moments-in-time
  • paper_authors: Michael Joannou, Pia Rotshtein, Uta Noppeney
  • for: 这个论文主要是为了提供一个大规模的audiovisual动作事件数据集(AVMIT),以便用于计算机模型和人类参与者之间的研究。
  • methods: 这篇论文使用了一个大规模的注释任务,采集了3秒的audiovisual视频,并由11名参与者进行了分类。每个试验都需要参与者确定 audiovisual动作事件是否存在,以及这个事件是否是视频中最显著的特征。
  • results: 论文表明,使用AVMIT注释数据集可以提高audiovisual事件认识性能,特别是在 audiovisual对应性是关键的研究问题上。在6个回归神经网络(RNN)中,通过专门在audiovisual事件上进行训练,可以提高测试集的准确率,而不是使用模式性无关的事件。
    Abstract We present Audiovisual Moments in Time (AVMIT), a large-scale dataset of audiovisual action events. In an extensive annotation task 11 participants labelled a subset of 3-second audiovisual videos from the Moments in Time dataset (MIT). For each trial, participants assessed whether the labelled audiovisual action event was present and whether it was the most prominent feature of the video. The dataset includes the annotation of 57,177 audiovisual videos, each independently evaluated by 3 of 11 trained participants. From this initial collection, we created a curated test set of 16 distinct action classes, with 60 videos each (960 videos). We also offer 2 sets of pre-computed audiovisual feature embeddings, using VGGish/YamNet for audio data and VGG16/EfficientNetB0 for visual data, thereby lowering the barrier to entry for audiovisual DNN research. We explored the advantages of AVMIT annotations and feature embeddings to improve performance on audiovisual event recognition. A series of 6 Recurrent Neural Networks (RNNs) were trained on either AVMIT-filtered audiovisual events or modality-agnostic events from MIT, and then tested on our audiovisual test set. In all RNNs, top 1 accuracy was increased by 2.71-5.94\% by training exclusively on audiovisual events, even outweighing a three-fold increase in training data. We anticipate that the newly annotated AVMIT dataset will serve as a valuable resource for research and comparative experiments involving computational models and human participants, specifically when addressing research questions where audiovisual correspondence is of critical importance.
    摘要 我们介绍Audiovisual Moments in Time(AVMIT)数据集,这是一个大规模的 audiovisual 动作事件数据集。在一项大规模的注释任务中,11名参与者对 Mit 数据集(Moments in Time)中的3秒audiovisual视频进行了注释。为每个试验,参与者判断 audiovisual 动作事件是否存在,以及它是视频中最出色的特征。该数据集包括57,177个 audiovisual 视频,每个视频都被3名训练参与者独立地评估。从这个初始集合中,我们创建了一个精心选择的测试集,包含16种不同的动作类别,每个类别有60个视频(共960个视频)。我们还提供了2个预计算的 audiovisual 特征嵌入,使用VGGish/YamNet для音频数据和VGG16/EfficientNetB0 для视频数据,从而降低了audiovisual DNN研究的门槛。我们explored AVMIT 注释和特征嵌入的优点,以提高audiovisual事件认识的性能。6个回归神经网络(RNNs)在AVMIT中过滤 audiovisual 事件或MIT中的模态无关事件,然后在我们的audiovisual测试集上进行测试。在所有RNNs中,在训练 exclusively 于 audiovisual 事件上,精度提高了2.71-5.94%,甚至超过了模态无关事件的三倍增长。我们预计,新注释的 AVMIT 数据集将成为研究计算模型和人类参与者之间的 valuable 资源,特别是在研究问题中,audiovisual 协调的重要性非常高。

cs.AI - 2023-08-19

Efficient Representation Learning for Healthcare with Cross-Architectural Self-Supervision

  • paper_url: http://arxiv.org/abs/2308.10064
  • repo_url: https://github.com/pranavsinghps1/CASS
  • paper_authors: Pranav Singh, Jacopo Cirrone
  • for: 医疗和生物医学应用中的极端计算需求,使得表征学学习难以在实际医疗中应用。表征学学习可以提高深度学习架构的性能,但是现有的自我监督学习方法在使用较小的批处理大小或更短的预训练环节时,性能会下降。我们提出了跨建筑自监督学习(CASS)方法来解决这个挑战。
  • methods: 我们提出了一种新的带形自监督学习方法,即CASS,它利用了转换器和卷积神经网络(CNN)进行高效的学习。
  • results: 我们的实验表明,CASS训练的CNN和转换器在四个不同的医疗数据集上都能够超越现有的自监督学习方法。它只使用1%的标签数据进行微调,可以获得3.8%的平均提升,即使使用10%的标签数据,也可以获得5.9%的提升。在100%的标签数据下,CASS可以达到10.13%的显著提升。此外,CASS还可以降低预训练时间,比现有方法减少69%,使其更适合医疗实践。
    Abstract In healthcare and biomedical applications, extreme computational requirements pose a significant barrier to adopting representation learning. Representation learning can enhance the performance of deep learning architectures by learning useful priors from limited medical data. However, state-of-the-art self-supervised techniques suffer from reduced performance when using smaller batch sizes or shorter pretraining epochs, which are more practical in clinical settings. We present Cross Architectural - Self Supervision (CASS) in response to this challenge. This novel siamese self-supervised learning approach synergistically leverages Transformer and Convolutional Neural Networks (CNN) for efficient learning. Our empirical evaluation demonstrates that CASS-trained CNNs and Transformers outperform existing self-supervised learning methods across four diverse healthcare datasets. With only 1% labeled data for finetuning, CASS achieves a 3.8% average improvement; with 10% labeled data, it gains 5.9%; and with 100% labeled data, it reaches a remarkable 10.13% enhancement. Notably, CASS reduces pretraining time by 69% compared to state-of-the-art methods, making it more amenable to clinical implementation. We also demonstrate that CASS is considerably more robust to variations in batch size and pretraining epochs, making it a suitable candidate for machine learning in healthcare applications.
    摘要 在医疗和生物医学应用中,极高的计算需求成为了使用表示学习的障碍。表示学习可以提高深度学习架构的性能,通过从有限的医疗数据中学习有用的先验知识。然而,现有的自我监督技术在使用较小的批处理大小或更短的预训练轮次时表现下降,这些较实际的参数更适合临床应用。我们提出了交叉体系自我监督(CASS),以应对这个挑战。这种新的哈密顿自我监督学习方法可以高效地利用转换器和卷积神经网络(CNN)。我们的实验证明,CASS训练后的CNN和转换器都能超过现有的自我监督学习方法,在四种不同的医疗数据集上。具有1%标注数据进行精化,CASS得到了3.8%的均值提升;具有10%标注数据,它获得了5.9%的提升;具有100%标注数据,它达到了10.13%的增强。另外,CASS可以降低预训练时间的69%,使其更适合临床应用。我们还证明了CASS在批处理大小和预训练轮次的变化中表现更加稳定,使其成为医疗机器学习应用的合适候选人。

Robust Fraud Detection via Supervised Contrastive Learning

  • paper_url: http://arxiv.org/abs/2308.10055
  • repo_url: None
  • paper_authors: Vinay M. S., Shuhan Yuan, Xintao Wu
  • for: 针对具有有限多样化劫持活动会话的开放集预测攻击检测问题
  • methods: 基于有效数据增强策略和超vised contrastive learning的敏捷框架ConRo
  • results: 与其他状态静态基eline相比,ConRo frameworks表现出了明显的性能提升
    Abstract Deep learning models have recently become popular for detecting malicious user activity sessions in computing platforms. In many real-world scenarios, only a few labeled malicious and a large amount of normal sessions are available. These few labeled malicious sessions usually do not cover the entire diversity of all possible malicious sessions. In many scenarios, possible malicious sessions can be highly diverse. As a consequence, learned session representations of deep learning models can become ineffective in achieving a good generalization performance for unseen malicious sessions. To tackle this open-set fraud detection challenge, we propose a robust supervised contrastive learning based framework called ConRo, which specifically operates in the scenario where only a few malicious sessions having limited diversity is available. ConRo applies an effective data augmentation strategy to generate diverse potential malicious sessions. By employing these generated and available training set sessions, ConRo derives separable representations w.r.t open-set fraud detection task by leveraging supervised contrastive learning. We empirically evaluate our ConRo framework and other state-of-the-art baselines on benchmark datasets. Our ConRo framework demonstrates noticeable performance improvement over state-of-the-art baselines.
    摘要 深度学习模型最近在计算平台中探测恶意用户活动会话中得到了广泛的应用。在实际场景中,通常只有一些标注了恶意的会话,而很多正常的会话是可用的。这些标注的恶意会话通常不能覆盖所有可能的恶意会话的多样性。因此,深度学习模型学习的会话表示可能变得不效果,导致在未看到的恶意会话上具有好的泛化性能。为解决这个开放集骗诈检测挑战,我们提出了一种可靠的超级视的对比学习框架 called ConRo,它专门适用于只有有限多样性的恶意会话。ConRo应用有效的数据扩展策略,生成了多样的可能的恶意会话。通过使用这些生成的和可用的训练集会话,ConRo derive了对于开放集骗诈检测任务的分离表示。我们对ConRo框架和其他状态前的基elines进行了实验评估。我们的ConRo框架在标准 benchmark 数据集上显示出了明显的性能提高。

Large Language Models as Zero-Shot Conversational Recommenders

  • paper_url: http://arxiv.org/abs/2308.10053
  • repo_url: https://github.com/aaronheee/llms-as-zero-shot-conversational-recsys
  • paper_authors: Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, Bodhisattwa Prasad Majumder, Nathan Kallus, Julian McAuley
  • for: 这个论文的主要目的是研究使用大语言模型进行对话式推荐任务,以及对现有模型的 zeroshot 设定下的性能分析。
  • methods: 这个论文使用了代表性的大语言模型,并在一个 zeroshot Setting下进行了实验研究。
  • results: 研究发现,无需细化训练,大语言模型可以在对话式推荐任务中超越现有的细化训练模型。此外,研究还提出了多种探索任务,以探究大语言模型在对话式推荐中的表现机制和限制。
    Abstract In this paper, we present empirical studies on conversational recommendation tasks using representative large language models in a zero-shot setting with three primary contributions. (1) Data: To gain insights into model behavior in "in-the-wild" conversational recommendation scenarios, we construct a new dataset of recommendation-related conversations by scraping a popular discussion website. This is the largest public real-world conversational recommendation dataset to date. (2) Evaluation: On the new dataset and two existing conversational recommendation datasets, we observe that even without fine-tuning, large language models can outperform existing fine-tuned conversational recommendation models. (3) Analysis: We propose various probing tasks to investigate the mechanisms behind the remarkable performance of large language models in conversational recommendation. We analyze both the large language models' behaviors and the characteristics of the datasets, providing a holistic understanding of the models' effectiveness, limitations and suggesting directions for the design of future conversational recommenders
    摘要 在本文中,我们提出了基于大语言模型的实验研究,以探讨在无需调教的情况下,大语言模型在实际对话推荐任务中的表现。我们的研究具有以下三个主要贡献:1. 数据:为了了解大语言模型在实际对话推荐场景中的行为,我们从流行的讨论网站中抓取了一个新的推荐相关对话集。这是目前最大的公共实际对话推荐数据集。2. 评估:在我们新建的数据集和两个现有的对话推荐数据集上,我们发现了一点:无需调教,大语言模型可以在对话推荐任务中超越现有的调教过的对话推荐模型。3. 分析:我们提出了多种探索任务,以Investigate大语言模型在对话推荐任务中的机制。我们分析了大语言模型的行为以及数据集的特点,从而提供了对未来对话推荐模型的设计方向的彻底理解。

The Snowflake Hypothesis: Training Deep GNN with One Node One Receptive field

  • paper_url: http://arxiv.org/abs/2308.10051
  • repo_url: None
  • paper_authors: Kun Wang, Guohao Li, Shilong Wang, Guibin Zhang, Kai Wang, Yang You, Xiaojiang Peng, Yuxuan Liang, Yang Wang
  • for: 本研究主要探讨深度 Graph Neural Networks (GNNs) 在图像领域中的应用,尤其是 GNNs 在深度为多少时 Display over-fitting 和 over-smoothing 问题。
  • methods: 本研究使用了系统的研究方法,包括不同的训练方案、不同的 shallow 和 deep GNN 基础体系、不同的层数(8, 16, 32, 64)以及多个 benchmark 图。
  • results: 研究结果表明,我们的假设(Snowflake Hypothesis)可以作为一种通用的操作符,可以帮助深度 GNNs 在不同任务中表现更好,并且可以在可解释的和普遍的方式下选择最佳网络深度。
    Abstract Despite Graph Neural Networks demonstrating considerable promise in graph representation learning tasks, GNNs predominantly face significant issues with over-fitting and over-smoothing as they go deeper as models of computer vision realm. In this work, we conduct a systematic study of deeper GNN research trajectories. Our findings indicate that the current success of deep GNNs primarily stems from (I) the adoption of innovations from CNNs, such as residual/skip connections, or (II) the tailor-made aggregation algorithms like DropEdge. However, these algorithms often lack intrinsic interpretability and indiscriminately treat all nodes within a given layer in a similar manner, thereby failing to capture the nuanced differences among various nodes. To this end, we introduce the Snowflake Hypothesis -- a novel paradigm underpinning the concept of ``one node, one receptive field''. The hypothesis draws inspiration from the unique and individualistic patterns of each snowflake, proposing a corresponding uniqueness in the receptive fields of nodes in the GNNs. We employ the simplest gradient and node-level cosine distance as guiding principles to regulate the aggregation depth for each node, and conduct comprehensive experiments including: (1) different training schemes; (2) various shallow and deep GNN backbones, and (3) various numbers of layers (8, 16, 32, 64) on multiple benchmarks (six graphs including dense graphs with millions of nodes); (4) compare with different aggregation strategies. The observational results demonstrate that our hypothesis can serve as a universal operator for a range of tasks, and it displays tremendous potential on deep GNNs. It can be applied to various GNN frameworks, enhancing its effectiveness when operating in-depth, and guiding the selection of the optimal network depth in an explainable and generalizable way.
    摘要 Despite Graph Neural Networks (GNNs) showing great promise in graph representation learning tasks, they still face significant issues with over-fitting and over-smoothing as they become deeper, especially in the field of computer vision. In this study, we conducted a systematic investigation of deeper GNN research trajectories. Our findings indicate that the current success of deep GNNs is mainly due to (I) the adoption of innovations from Convolutional Neural Networks (CNNs), such as residual/skip connections, or (II) the use of tailor-made aggregation algorithms like DropEdge. However, these algorithms often lack intrinsic interpretability and treat all nodes within a given layer in a similar manner, failing to capture the subtle differences among various nodes. To address this issue, we propose the Snowflake Hypothesis - a novel paradigm that emphasizes the uniqueness of each node's receptive field, inspired by the unique patterns of snowflakes.We use the simplest gradient and node-level cosine distance as guiding principles to regulate the aggregation depth for each node. We conduct comprehensive experiments, including different training schemes, various shallow and deep GNN backbones, and various numbers of layers (8, 16, 32, 64) on multiple benchmarks (six graphs with millions of nodes). Our results show that our hypothesis can serve as a universal operator for a range of tasks and displays great potential in deep GNNs. It can be applied to various GNN frameworks, enhancing their effectiveness when operating in-depth, and providing an explainable and generalizable way to select the optimal network depth.

Towards Probabilistic Causal Discovery, Inference & Explanations for Autonomous Drones in Mine Surveying Tasks

  • paper_url: http://arxiv.org/abs/2308.10047
  • repo_url: None
  • paper_authors: Ricardo Cannizzaro, Rhys Howard, Paulina Lewinska, Lars Kunze
  • for: 这篇论文旨在提供自主机器人理解数据生成过程,以便在实际环境中做出决策和解释结果。
  • methods: 该论文提出了一种可靠的概率 causal 框架,包括 causally-informed POMDP 规划、在线 SCM 适应以及后果Counterfactual 解释。
  • results: 该框架能够在含混合因素、非站点性和不可预测的环境中帮助自主机器人做出决策和解释结果。
    Abstract Causal modelling offers great potential to provide autonomous agents the ability to understand the data-generation process that governs their interactions with the world. Such models capture formal knowledge as well as probabilistic representations of noise and uncertainty typically encountered by autonomous robots in real-world environments. Thus, causality can aid autonomous agents in making decisions and explaining outcomes, but deploying causality in such a manner introduces new challenges. Here we identify challenges relating to causality in the context of a drone system operating in a salt mine. Such environments are challenging for autonomous agents because of the presence of confounders, non-stationarity, and a difficulty in building complete causal models ahead of time. To address these issues, we propose a probabilistic causal framework consisting of: causally-informed POMDP planning, online SCM adaptation, and post-hoc counterfactual explanations. Further, we outline planned experimentation to evaluate the framework integrated with a drone system in simulated mine environments and on a real-world mine dataset.
    摘要 causal模型提供了大量的潜在利器,以帮助自主代理人理解与世界进行交互的数据生成过程。这些模型捕捉了形式知识以及随机变量和不确定性的概率表示,通常在实际环境中遇到的自主机器人遇到的问题。因此, causality可以帮助自主代理人做出决策和解释结果,但是在这种方式下引入了新的挑战。在这篇文章中,我们识别了在钾矿环境中的 causality挑战,这些环境中存在干扰因素、非站点性和建立完整的 causal模型的困难。为解决这些问题,我们提议了一种概率 causal 框架,包括: causally-informed POMDP 规划、在线 SCM 适应和后续 counterfactual 解释。此外,我们还详细讲述了将这种框架与真实的矿山数据集集成的计划的实验。

Optimizing Multi-Class Text Classification: A Diverse Stacking Ensemble Framework Utilizing Transformers

  • paper_url: http://arxiv.org/abs/2308.11519
  • repo_url: None
  • paper_authors: Anusuya Krishnan
  • for: 本研究旨在提高客户评价分类的准确率和可靠性,以便商家从客户评价中提取有益的反馈信息,提高客户满意度和驱动持续改进。
  • methods: 本研究提出了一种新的Stacking Ensemble基于transformer模型的多文本分类方法,通过将多个单个transformer模型,包括BERT、ELECTRA和DistilBERT,作为基础级分类器,以及一个基于RoBERTa的meta级分类器,生成最佳预测模型。
  • results: 实验结果表明,相比传统单个分类器模型,折衔ensemble基于transformer模型的多文本分类方法可以提高客户评价分类的准确率和稳定性,并且在实际客户评价数据集上得到了较高的效果。
    Abstract Customer reviews play a crucial role in assessing customer satisfaction, gathering feedback, and driving improvements for businesses. Analyzing these reviews provides valuable insights into customer sentiments, including compliments, comments, and suggestions. Text classification techniques enable businesses to categorize customer reviews into distinct categories, facilitating a better understanding of customer feedback. However, challenges such as overfitting and bias limit the effectiveness of a single classifier in ensuring optimal prediction. This study proposes a novel approach to address these challenges by introducing a stacking ensemble-based multi-text classification method that leverages transformer models. By combining multiple single transformers, including BERT, ELECTRA, and DistilBERT, as base-level classifiers, and a meta-level classifier based on RoBERTa, an optimal predictive model is generated. The proposed stacking ensemble-based multi-text classification method aims to enhance the accuracy and robustness of customer review analysis. Experimental evaluations conducted on a real-world customer review dataset demonstrate the effectiveness and superiority of the proposed approach over traditional single classifier models. The stacking ensemble-based multi-text classification method using transformers proves to be a promising solution for businesses seeking to extract valuable insights from customer reviews and make data-driven decisions to enhance customer satisfaction and drive continuous improvement.
    摘要 顾客评价对企业客户满意度评估、反馈收集和改进提供重要的指导意义。分析顾客评价可以获得价值的客户情感反馈,包括赞誉、评论和建议。文本分类技术可以将顾客评价分类为不同类别,以便更好地理解客户反馈。然而,过拟合和偏见问题限制了单个分类器的优化预测。这项研究提出了一种新的方法,利用堆 ensemble-based多文本分类方法,以解决这些问题。该方法组合多个单级 transformer 模型,包括 BERT、ELECTRA 和 DistilBERT,作为基础级分类器,并使用 RoBERTa 为高级分类器。这种堆 ensemble-based multi-text classification 方法的目的是提高客户评价分析的准确性和稳定性。实验表明,该方法在实际顾客评价数据集上的效果和优势,证明了该方法的有效性和可靠性。堆 ensemble-based multi-text classification 方法使用 transformers 是企业希望从顾客评价中提取有价值的信息,以便根据客户反馈提供数据驱动的决策,提高客户满意度和不断改进。

Causal Intersectionality and Dual Form of Gradient Descent for Multimodal Analysis: a Case Study on Hateful Memes

  • paper_url: http://arxiv.org/abs/2308.11585
  • repo_url: None
  • paper_authors: Yosuke Miyanishi, Minh Le Nguyen
  • for: 本研究旨在探讨如何使用可解释AI(XAI)和定义 semantics 来理解机器学习模型的内部机制,以便更好地理解模型的 causal effect。
  • methods: 本研究使用了 gradient-based 方法和 causal 分析方法,以 synergize 这两种方法来探讨模型的内部机制。
  • results: 研究发现,通过使用 intersectionality 理论,可以将忌视论检测问题表示为 averaged treatment effect(ATE),并且可以使用模态wise 概要来描述三种基于 transformer 的模型对 ATE 的不同行为。此外,研究还发现 latest LLM LLaMA2 在具有上下文学习设置下,能够拓宽忌视论检测问题的多样化特征。
    Abstract In the wake of the explosive growth of machine learning (ML) usage, particularly within the context of emerging Large Language Models (LLMs), comprehending the semantic significance rooted in their internal workings is crucial. While causal analyses focus on defining semantics and its quantification, the gradient-based approach is central to explainable AI (XAI), tackling the interpretation of the black box. By synergizing these approaches, the exploration of how a model's internal mechanisms illuminate its causal effect has become integral for evidence-based decision-making. A parallel line of research has revealed that intersectionality - the combinatory impact of multiple demographics of an individual - can be structured in the form of an Averaged Treatment Effect (ATE). Initially, this study illustrates that the hateful memes detection problem can be formulated as an ATE, assisted by the principles of intersectionality, and that a modality-wise summarization of gradient-based attention attribution scores can delineate the distinct behaviors of three Transformerbased models concerning ATE. Subsequently, we show that the latest LLM LLaMA2 has the ability to disentangle the intersectional nature of memes detection in an in-context learning setting, with their mechanistic properties elucidated via meta-gradient, a secondary form of gradient. In conclusion, this research contributes to the ongoing dialogue surrounding XAI and the multifaceted nature of ML models.
    摘要 在机器学习(ML)的激进发展中,特别是在新兴的大语言模型(LLM)上,理解其内部机制的含义是关键。而 causal 分析专注于定义 semantics 和其量化,而 gradient-based 方法是解释 AI(XAI)的中心,解释黑盒模型的含义。将这些方法相互融合,可以探索模型内部机制如何推动 causal effect的探索。同时,另一条研究表明,个体多个特征的交叉影响(intersectionality)可以通过 averaged treatment effect(ATE)的形式表示,并且在这种形式下,可以使用 modality-wise 汇总 gradient-based 注意力分配分数来解释三种基于 Transformer 的模型在 ATE 方面的不同行为。后续,我们发现 latest LLM LLaMA2 在 context learning 设定下可以准确地检测仇恨 Memes,并且通过 meta-gradient,一种次级的梯度,揭示这些机制的性质。因此,本研究对 XAI 和多方面 ML 模型的对话进行了贡献。

ClothesNet: An Information-Rich 3D Garment Model Repository with Simulated Clothes Environment

  • paper_url: http://arxiv.org/abs/2308.09987
  • repo_url: None
  • paper_authors: Bingyang Zhou, Haoyu Zhou, Tianhai Liang, Qiaojun Yu, Siheng Zhao, Yuwei Zeng, Jun Lv, Siyuan Luo, Qiancai Wang, Xinyuan Yu, Haonan Chen, Cewu Lu, Lin Shao
  • for: 这篇论文是为了提供一个大规模的3D衣物数据集,并将其注解为具有衣物特征、边框和关键点等信息。
  • methods: 该论文使用了大规模的3D衣物数据集,并在其中设置了衣物分类、边框分割和关键点检测等任务,以便用于计算机视觉和机器人交互任务。
  • results: 该论文通过实际实验表明,使用ClothesNet数据集可以帮助实现衣物感知和机器人交互任务,并且可以提供高质量的数据集和任务集。
    Abstract We present ClothesNet: a large-scale dataset of 3D clothes objects with information-rich annotations. Our dataset consists of around 4400 models covering 11 categories annotated with clothes features, boundary lines, and keypoints. ClothesNet can be used to facilitate a variety of computer vision and robot interaction tasks. Using our dataset, we establish benchmark tasks for clothes perception, including classification, boundary line segmentation, and keypoint detection, and develop simulated clothes environments for robotic interaction tasks, including rearranging, folding, hanging, and dressing. We also demonstrate the efficacy of our ClothesNet in real-world experiments. Supplemental materials and dataset are available on our project webpage.
    摘要 我们现在提出了 ClothesNet:一个大规模的3D衣物数据集,包含详细的注释信息。我们的数据集包含约4400个模型,涵盖11种类别,并且每个模型都有衣物特征、边界线和关键点的注释。ClothesNet可以用于促进计算机视觉和机器人互动任务。使用我们的数据集,我们建立了衣物识别、边界线分割和关键点检测的标准任务,并开发了机器人互动任务的模拟环境,包括重新排序、折叠、挂起和穿衣。我们还在实际世界中进行了实验,以证明 ClothesNet 的效果。补充材料和数据集可以在我们项目网站上获得。

Distributionally Robust Cross Subject EEG Decoding

  • paper_url: http://arxiv.org/abs/2308.11651
  • repo_url: None
  • paper_authors: Tiehang Duan, Zhenyi Wang, Gianfranco Doretto, Fang Li, Cui Tao, Donald Adjeroh
  • for: 提高EEG解码任务的性能,增强EEG数据的鲁棒性
  • methods: 使用分布式鲁棒优化和 Wasserstein 梯度流来实现数据演化,提高EEG解码器的特征学习
  • results: 比较基eline的性能,模型在各种损害EEG信号中的表现更佳, indicating that the proposed approach can improve the robustness of EEG decoding tasks.
    Abstract Recently, deep learning has shown to be effective for Electroencephalography (EEG) decoding tasks. Yet, its performance can be negatively influenced by two key factors: 1) the high variance and different types of corruption that are inherent in the signal, 2) the EEG datasets are usually relatively small given the acquisition cost, annotation cost and amount of effort needed. Data augmentation approaches for alleviation of this problem have been empirically studied, with augmentation operations on spatial domain, time domain or frequency domain handcrafted based on expertise of domain knowledge. In this work, we propose a principled approach to perform dynamic evolution on the data for improvement of decoding robustness. The approach is based on distributionally robust optimization and achieves robustness by optimizing on a family of evolved data distributions instead of the single training data distribution. We derived a general data evolution framework based on Wasserstein gradient flow (WGF) and provides two different forms of evolution within the framework. Intuitively, the evolution process helps the EEG decoder to learn more robust and diverse features. It is worth mentioning that the proposed approach can be readily integrated with other data augmentation approaches for further improvements. We performed extensive experiments on the proposed approach and tested its performance on different types of corrupted EEG signals. The model significantly outperforms competitive baselines on challenging decoding scenarios.
    摘要 最近,深度学习已经在电enzephalography(EEG)解码任务中显示出有效性。然而,其性能可能受到两种关键因素的负面影响:1)EEG信号中的高度变化和不同类型的损害,2)EEG数据集通常较小,需要较多的获取成本、标注成本和精力投入。为了解决这个问题,数据扩展方法已经被Empirically研究,其中包括在空间领域、时间领域或频率领域进行手动设计的扩展操作。在这项工作中,我们提出了一种原理性的方法,通过分布robust优化来提高解码Robustness。该方法基于Wasserstein梯度流(WGF),并提供了两种不同的演化形式。intuitively,演化过程可以帮助EEG解码器学习更加Robust和多样的特征。值得一提是,提议的方法可以和其他数据扩展方法结合使用,以实现更高的性能。我们对提议的方法进行了广泛的实验,并测试其性能于不同类型的损害EEG信号。模型显著超越了竞争对手的基eline。

Artificial Intelligence across Europe: A Study on Awareness, Attitude and Trust

  • paper_url: http://arxiv.org/abs/2308.09979
  • repo_url: None
  • paper_authors: Teresa Scantamburlo, Atia Cortés, Francesca Foffano, Cristian Barrué, Veronica Distefano, Long Pham, Alessandro Fabris
  • For: The paper aims to gain a better understanding of European citizens’ views and perceptions of Artificial Intelligence (AI) in order to inform AI governance and policy-making.* Methods: The study uses a new questionnaire (PAICE) structured around three dimensions: awareness, attitude, and trust, and collects data from a sample of 4,006 European citizens from eight countries.* Results: The study finds that while awareness of AI is low, attitudes towards AI are generally positive, but there are implicit contradictions and trends that may interfere with the development of an inclusive AI ecosystem. The study highlights the importance of legal and ethical standards, educational entities, and AI literacy in supporting a trustworthy AI ecosystem.Here is the same information in Simplified Chinese text:* For: 这篇论文是为了了解欧洲公民对人工智能(AI)的看法和感受,以便更好地制定AI治理和政策。* Methods: 这篇论文使用一份新的问卷(PAICE),旨在三个维度上评估欧洲公民对AI的意见和态度:认知、态度和信任。* Results: 研究发现,尽管认知水平低,但大多数公民对AI的态度非常正面,但也存在一些潜在的矛盾和趋势,这些可能会影响建立包容性的AI生态系统。研究表明,legal和道德标准的引入,高等教育机构的活动,以及AI文化的推广是支持可信worthy AI生态系统的关键因素。
    Abstract This paper presents the results of an extensive study investigating the opinions on Artificial Intelligence (AI) of a sample of 4,006 European citizens from eight distinct countries (France, Germany, Italy, Netherlands, Poland, Romania, Spain, and Sweden). The aim of the study is to gain a better understanding of people's views and perceptions within the European context, which is already marked by important policy actions and regulatory processes. To survey the perceptions of the citizens of Europe we design and validate a new questionnaire (PAICE) structured around three dimensions: people's awareness, attitude, and trust. We observe that while awareness is characterized by a low level of self-assessed competency, the attitude toward AI is very positive for more than half of the population. Reflecting upon the collected results, we highlight implicit contradictions and identify trends that may interfere with the creation of an ecosystem of trust and the development of inclusive AI policies. The introduction of rules that ensure legal and ethical standards, along with the activity of high-level educational entities, and the promotion of AI literacy are identified as key factors in supporting a trustworthy AI ecosystem. We make some recommendations for AI governance focused on the European context and conclude with suggestions for future work.
    摘要

Explicit Time Embedding Based Cascade Attention Network for Information Popularity Prediction

  • paper_url: http://arxiv.org/abs/2308.09976
  • repo_url: None
  • paper_authors: Xigang Sun, Jingya Zhou, Ling Liu, Wenqi Wei
  • for: 预测信息潮流的各种特点,包括它的各种特点,以及它在不同的社交网络上的传播方式。
  • methods: 本文提出了一种基于Explicit Time Embedding的Cascade Attention Network(TCAN),该模型可以integrate temporal attributes(例如周期性、线性和非线性扩展)into node features,并使用cascade graph attention encoder(CGAT)和cascade sequence attention encoder(CSAT)来完全学习潮流图和潮流序列的表示。
  • results: 在使用两个实际的数据集(Weibo和APS)进行验证的情况下,TCAN得到了mean logarithm squared errors的值为2.007和1.201,并且在两个数据集上的运行时间分别为1.76小时和0.15小时。此外,TCAN也超过了其他表达基线的10.4%、3.8%和10.4%的MSLE、MAE和R-squared指标。
    Abstract Predicting information cascade popularity is a fundamental problem in social networks. Capturing temporal attributes and cascade role information (e.g., cascade graphs and cascade sequences) is necessary for understanding the information cascade. Current methods rarely focus on unifying this information for popularity predictions, which prevents them from effectively modeling the full properties of cascades to achieve satisfactory prediction performances. In this paper, we propose an explicit Time embedding based Cascade Attention Network (TCAN) as a novel popularity prediction architecture for large-scale information networks. TCAN integrates temporal attributes (i.e., periodicity, linearity, and non-linear scaling) into node features via a general time embedding approach (TE), and then employs a cascade graph attention encoder (CGAT) and a cascade sequence attention encoder (CSAT) to fully learn the representation of cascade graphs and cascade sequences. We use two real-world datasets (i.e., Weibo and APS) with tens of thousands of cascade samples to validate our methods. Experimental results show that TCAN obtains mean logarithm squared errors of 2.007 and 1.201 and running times of 1.76 hours and 0.15 hours on both datasets, respectively. Furthermore, TCAN outperforms other representative baselines by 10.4%, 3.8%, and 10.4% in terms of MSLE, MAE, and R-squared on average while maintaining good interpretability.
    摘要 预测信息潮流的受欢迎程度是社交网络中的基本问题。捕捉时间特征和垂直角色信息(如潮流图和潮流序列)是理解信息潮流的关键。现有方法rarely将这些信息统一到受欢迎预测中,这使得它们无法全面地模拟潮流的性质,从而导致不满足的预测性能。在本文中,我们提出了一种新的时间嵌入基于潮流注意力网络(TCAN),用于大规模信息网络中的受欢迎预测。TCAN通过一种通用的时间嵌入方法(TE)将时间特征纳入节点特征,然后使用潮流图注意力编码器(CGAT)和潮流序列注意力编码器(CSAT)来全面学习潮流图和潮流序列的表示。我们在两个实际 datasets(Weibo和APS)上进行了大量的潮流样本验证。实验结果表明,TCAN在两个dataset上的平均logarithmic squared error为2.007和1.201,运行时间分别为1.76小时和0.15小时。此外,TCAN在基于MSLE、MAE和R-squared的比较中,与其他代表性基准相比,提高了10.4%、3.8%和10.4%的性能。同时,TCAN保持了良好的可读性。

Disposable Transfer Learning for Selective Source Task Unlearning

  • paper_url: http://arxiv.org/abs/2308.09971
  • repo_url: None
  • paper_authors: Seunghee Koh, Hyounguk Shon, Janghyeon Lee, Hyeong Gwon Hong, Junmo Kim
  • for: 这个论文的目的是提出一种新的传输学习方法,即可抛弃源任务的传输学习方法(DTL),以避免知识泄露问题。
  • methods: 该论文提出了一种新的损失函数名为梯度碰撞损失(GC损失),GC损失会导致梯度向量在不同批处理中分别移动,从而选择性地忘记源任务。
  • results: 论文表明,使用GC损失可以有效地解决传输学习中的知识泄露问题,并且模型在target任务上保持了高度的性能。
    Abstract Transfer learning is widely used for training deep neural networks (DNN) for building a powerful representation. Even after the pre-trained model is adapted for the target task, the representation performance of the feature extractor is retained to some extent. As the performance of the pre-trained model can be considered the private property of the owner, it is natural to seek the exclusive right of the generalized performance of the pre-trained weight. To address this issue, we suggest a new paradigm of transfer learning called disposable transfer learning (DTL), which disposes of only the source task without degrading the performance of the target task. To achieve knowledge disposal, we propose a novel loss named Gradient Collision loss (GC loss). GC loss selectively unlearns the source knowledge by leading the gradient vectors of mini-batches in different directions. Whether the model successfully unlearns the source task is measured by piggyback learning accuracy (PL accuracy). PL accuracy estimates the vulnerability of knowledge leakage by retraining the scrubbed model on a subset of source data or new downstream data. We demonstrate that GC loss is an effective approach to the DTL problem by showing that the model trained with GC loss retains the performance on the target task with a significantly reduced PL accuracy.
    摘要 <> translate text into Simplified ChineseTransfer learning 广泛用于训练深度神经网络 (DNN) 以建立强大的表示。即使预训练模型被适应目标任务,表示性性能的特征提取器也会保持一定程度的表现。由于预训练模型的性能可以被视为专有财产,因此是自然的寻求专利权的通用性表现。为解决这个问题,我们提出了一种新的转移学习方法 called disposable transfer learning (DTL),它将仅 dispose of the source task 而不是降低目标任务的性能。为实现知识抛弃,我们提议一种新的损失函数名为梯度碰撞损失 (GC loss)。GC loss 将导致批处理的梯度 вектор在不同的方向上穿梭。如果模型成功忘记了源任务,那么PL准确率 (PL accuracy) 将提供一种估计知识泄露的敏感性,由于在重新训练擦除模型时在一部分源数据上或新下游数据上进行重新训练。我们示出GC损失是DTL问题的有效方法,因为模型在使用GC损失时保留了目标任务的性能,同时PL准确率得到了显著降低。

Tackling Vision Language Tasks Through Learning Inner Monologues

  • paper_url: http://arxiv.org/abs/2308.09970
  • repo_url: None
  • paper_authors: Diji Yang, Kezhen Chen, Jinmeng Rao, Xiaoyuan Guo, Yawen Zhang, Jie Yang, Yi Zhang
  • for: 解决复杂的视觉语言问题,如图像描述和图像理解等。
  • methods: 提出了一种新的方法 Inner Monologue Multi-Modal Optimization (IMMO),通过模拟内部对话来促进语言模型和视觉模型之间的合并。
  • results: 实验结果表明,通过IMMO可以提高理解和解释能力,并且可以应用于多种不同的AI问题。
    Abstract Visual language tasks require AI models to comprehend and reason with both visual and textual content. Driven by the power of Large Language Models (LLMs), two prominent methods have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. To tackle this dilemma, we propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems by simulating inner monologue processes, a cognitive process in which an individual engages in silent verbal communication with themselves. We enable LLMs and VLMs to interact through natural language conversation and propose to use a two-stage training process to learn how to do the inner monologue (self-asking questions and answering questions). IMMO is evaluated on two popular tasks and the results suggest by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. More importantly, instead of using predefined human-crafted monologues, IMMO learns this process within the deep learning models, promising wider applicability to many different AI problems beyond vision language tasks.
    摘要 视觉语言任务需要人工智能模型理解和处理视觉和文本内容。驱动了大型语言模型(LLM)的力量,两种主要方法出现:(1)混合Integration between LLMs and Vision-Language Models(VLMs),其中视觉输入首先被VLMs转换为语言描述,并作为LLMs的输入生成答案;(2)视觉特征对齐在语言空间,其中视觉输入被编码为特征嵌入,并通过进一步的超vision fine-tuning来将其投影到LLMs的语言空间。首个方法提供了轻量级训练成本和可读性,但具有硬coded困难,难以在端到端方式优化。其次的方法具有良好性能,但特征对齐通常需要大量的训练数据和缺乏可读性。为解决这个困难,我们提出了一种新的方法:Inner Monologue Multi-Modal Optimization(IMMO),用于解决复杂的视觉语言问题。我们使得LLMs和VLMs通过自然语言对话进行交互,并提出了一种两个阶段训练过程,以学习如何进行内部对话(自我问答和回答问题)。IMMO在两个流行任务上进行评估,结果表明,通过模拟内部对话,我们的方法可以提高理解和解释能力,为视觉语言模型的更有效融合做出贡献。更重要的是,不同于使用人类编写的固定内部对话,IMMO在深度学习模型中学习这个过程,可以应用于许多不同的人工智能问题。

Anomaly-Aware Semantic Segmentation via Style-Aligned OoD Augmentation

  • paper_url: http://arxiv.org/abs/2308.09965
  • repo_url: None
  • paper_authors: Dan Zhang, Kaspar Sakmann, William Beluch, Robin Hutmacher, Yumeng Li
  • for: 这个论文的目的是为了将标准 semantic segmentation 模型设置为适应不熟悉的物品类型。
  • methods: 这个研究使用了实验性的 out-of-distribution (OoD) 数据增强方法,以实现对惊喜物品的识别。
  • results: 这个研究发现,通过将 OoD 数据与驾驶场景之间的领域差缩小,可以有效减少类型差异的影响,并且提出了一个简单的 fine-tuning 损失,使得先前训练的 semantic segmentation 模型能够对不熟悉的物品进行预测。
    Abstract Within the context of autonomous driving, encountering unknown objects becomes inevitable during deployment in the open world. Therefore, it is crucial to equip standard semantic segmentation models with anomaly awareness. Many previous approaches have utilized synthetic out-of-distribution (OoD) data augmentation to tackle this problem. In this work, we advance the OoD synthesis process by reducing the domain gap between the OoD data and driving scenes, effectively mitigating the style difference that might otherwise act as an obvious shortcut during training. Additionally, we propose a simple fine-tuning loss that effectively induces a pre-trained semantic segmentation model to generate a ``none of the given classes" prediction, leveraging per-pixel OoD scores for anomaly segmentation. With minimal fine-tuning effort, our pipeline enables the use of pre-trained models for anomaly segmentation while maintaining the performance on the original task.
    摘要 在自动驾驶中,遇到未知对象是不可避免的,因此需要填充标准semantic segmentation模型 anomaly awareness。许多前一代方法使用 synthetic out-of-distribution(OoD)数据增强来解决这个问题。在这项工作中,我们提高了 OoD 数据生成过程中的领域差异,从而有效地减少了样式差异,从而避免了训练时的快速通路。此外,我们提议一种简单的精度调整方法,可以让预训练的semantic segmentation模型生成“无任何给定类”预测,通过每个像素的 OoD 分数进行异常分割。只需 minimal fine-tuning 努力,我们的管道可以使用预训练模型进行异常分割,同时保持原始任务的性能。

Data-to-text Generation for Severely Under-Resourced Languages with GPT-3.5: A Bit of Help Needed from Google Translate

  • paper_url: http://arxiv.org/abs/2308.09957
  • repo_url: https://github.com/dcu-nlg/dcu-nlg-pbn
  • paper_authors: Michela Lorandi, Anya Belz
  • for: 这个论文旨在探讨使用英语少量训练数据的大语言模型(LLM)在缺乏训练数据的语言上进行数据转文本生成 task 中的性能。
  • methods: 作者在这篇论文中使用了一些示例输入/输出对的测试和评估,以及不同的提示类型和格式,以确定最佳的提示方法。
  • results: 研究发现,在直接生成到缺乏训练数据的语言时,几个示例提示方法效果较好,但是在通过英语转化后生成时,提示方法的效果消失。作者还在WebNLG 2023 共享任务中提交了一些系统,并在所有语言和所有指标上超越了竞争者系统。但是,最佳结果仍然远低于最差排名的英语系统在 WebNLG’20 中。
    Abstract LLMs like GPT are great at tasks involving English which dominates in their training data. In this paper, we look at how they cope with tasks involving languages that are severely under-represented in their training data, in the context of data-to-text generation for Irish, Maltese, Welsh and Breton. During the prompt-engineering phase we tested a range of prompt types and formats on GPT-3.5 and~4 with a small sample of example input/output pairs. We then fully evaluated the two most promising prompts in two scenarios: (i) direct generation into the under-resourced language, and (ii) generation into English followed by translation into the under-resourced language. We find that few-shot prompting works better for direct generation into under-resourced languages, but that the difference disappears when pivoting via English. The few-shot + translation system variants were submitted to the WebNLG 2023 shared task where they outperformed competitor systems by substantial margins in all languages on all metrics. We conclude that good performance on under-resourced languages can be achieved out-of-the box with state-of-the-art LLMs. However, our best results (for Welsh) remain well below the lowest ranked English system at WebNLG'20.
    摘要 (Note: Simplified Chinese is used in this translation, as it is more widely used in mainland China and is the standard form of Chinese used in government documents, education, and media. Traditional Chinese is also commonly used in Taiwan, Hong Kong, and Macau, and is the standard form of Chinese used in these regions. However, for the purpose of this translation, Simplified Chinese is used to ensure consistency and readability.)

Eva-KELLM: A New Benchmark for Evaluating Knowledge Editing of LLMs

  • paper_url: http://arxiv.org/abs/2308.09954
  • repo_url: None
  • paper_authors: Suhang Wu, Minlong Peng, Yue Chen, Jinsong Su, Mingming Sun
  • for: 本研究旨在评估语言模型(LLM)知识编辑的效iveness。
  • methods: 本研究使用 raw documents 进行知识编辑,并 evaluate LLVM 的更新后性能从多个角度。
  • results: 实验结果表明,目前使用 raw documents 进行知识编辑的方法不够有效,特别是在理解修改后的知识和跨语言知识传递方面。
    Abstract Large language models (LLMs) possess a wealth of knowledge encoded in their parameters. However, this knowledge may become outdated or unsuitable over time. As a result, there has been a growing interest in knowledge editing for LLMs and evaluating its effectiveness. Existing studies primarily focus on knowledge editing using factual triplets, which not only incur high costs for collection but also struggle to express complex facts. Furthermore, these studies are often limited in their evaluation perspectives. In this paper, we propose Eva-KELLM, a new benchmark for evaluating knowledge editing of LLMs. This benchmark includes an evaluation framework and a corresponding dataset. Under our framework, we first ask the LLM to perform knowledge editing using raw documents, which provides a more convenient and universal approach compared to using factual triplets. We then evaluate the updated LLM from multiple perspectives. In addition to assessing the effectiveness of knowledge editing and the retention of unrelated knowledge from conventional studies, we further test the LLM's ability in two aspects: 1) Reasoning with the altered knowledge, aiming for the LLM to genuinely learn the altered knowledge instead of simply memorizing it. 2) Cross-lingual knowledge transfer, where the LLM updated with raw documents in one language should be capable of handling queries from another language. To facilitate further research, we construct and release the corresponding dataset. Using this benchmark, we investigate the effectiveness of several commonly-used knowledge editing methods. Experimental results indicate that the current methods for knowledge editing using raw documents are not effective in yielding satisfactory results, particularly when it comes to reasoning with altered knowledge and cross-lingual knowledge transfer.
    摘要 大型语言模型(LLM)具有丰富的知识储存在其参数中。然而,这些知识可能会随时间变得过时或不适用。因此,有着增长的兴趣在Language Model(LM)的知识编译和评估其有效性。现有的研究主要针对LM的知识编译使用事实三重ts,它们不具有高成本的收集成本,但它们仅能表达简单的事实。此外,这些研究通常仅从一个见解进行评估。在这篇文章中,我们提出了Eva-KELLM,一个新的LLM知识编译评估标准。这个标准包括评估框架和对应的数据集。在我们的框架下,我们首先请LM进行知识编译使用原始文档,这提供了一个更方便和通用的方法,相比于使用事实三重ts。然后,我们从多个角度评估更新后的LM。除了评估知识编译的有效性和对不相关知识的保留外,我们还进行了两种方面的测试: 1. 使用更新后的知识进行推理,目的是让LM真正学习更新后的知识,而不是单纯记忆。2. 跨语言知识传递,即更新了LM的原始文档可以处理其他语言的查询。为了促进进一步的研究,我们建立了相应的数据集,并发布了这个标准。使用这个标准,我们进行了各种常用的知识编译方法的评估。实验结果表明,目前使用原始文档进行知识编译的方法并不具有满意的效果,特别是在推理使用更新后的知识和跨语言知识传递方面。

Exploring the Power of Topic Modeling Techniques in Analyzing Customer Reviews: A Comparative Analysis

  • paper_url: http://arxiv.org/abs/2308.11520
  • repo_url: None
  • paper_authors: Anusuya Krishnan
  • for: 本研究旨在比较五种常用的话题模型方法,以便在实际应用中提高话题检索的效果。
  • methods: 本研究使用的方法包括LSA、LDA、NMF、PAM、Top2Vec和BERTopic等五种话题模型方法。
  • results: 研究发现,BERTopic方法可以准确地提取有意义的话题,并且在两个文本数据集上获得了良好的效果。
    Abstract The exponential growth of online social network platforms and applications has led to a staggering volume of user-generated textual content, including comments and reviews. Consequently, users often face difficulties in extracting valuable insights or relevant information from such content. To address this challenge, machine learning and natural language processing algorithms have been deployed to analyze the vast amount of textual data available online. In recent years, topic modeling techniques have gained significant popularity in this domain. In this study, we comprehensively examine and compare five frequently used topic modeling methods specifically applied to customer reviews. The methods under investigation are latent semantic analysis (LSA), latent Dirichlet allocation (LDA), non-negative matrix factorization (NMF), pachinko allocation model (PAM), Top2Vec, and BERTopic. By practically demonstrating their benefits in detecting important topics, we aim to highlight their efficacy in real-world scenarios. To evaluate the performance of these topic modeling methods, we carefully select two textual datasets. The evaluation is based on standard statistical evaluation metrics such as topic coherence score. Our findings reveal that BERTopic consistently yield more meaningful extracted topics and achieve favorable results.
    摘要 “在线社交媒体平台和应用程序的快速增长中,用户生成的文本内容的量已经成为严重的问题。为了解决这个问题,机器学习和自然语言处理算法已经在线上进行了广泛的应用。在过去几年中,主题探索技术在这个领域中得到了很大的应用。本研究将 investigate Five frequently used主题探索方法,它们是:对应语义分析(LSA)、Dirichlet分配(LDA)、非负矩阵分解(NMF)、碎掉投入模型(PAM)、Top2Vec和BERTopic。我们通过实际示范这些方法在实际应用中的效果,来强调它们在实际应用中的可行性。为了评估这些主题探索方法的表现,我们选择了两个文本数据集。评估是基于标准的统计评估指标,如主题凝聚分数。我们的发现表明,BERTopic在提取有意义的主题方面表现出色,取得了良好的成绩。”

Understanding Self-attention Mechanism via Dynamical System Perspective

  • paper_url: http://arxiv.org/abs/2308.09939
  • repo_url: https://github.com/jettbrains/-L-
  • paper_authors: Zhongzhan Huang, Mingfu Liang, Jinghui Qin, Shanshan Zhong, Liang Lin
  • for: 本研究 aimed to provide a new understanding of the self-attention mechanism (SAM) in neural networks, and to develop a new approach called StepNet that can measure the intrinsic stiffness phenomenon (SP) in high-performance neural networks.
  • methods: 本研究使用了动力系统视角来研究高性能神经网络中的SP现象,并提出了一种基于适应步长参数的StepNet方法来测量SP。
  • results: 实验结果表明,StepNet可以准确测量SP,提高了多种视觉任务的性能。
    Abstract The self-attention mechanism (SAM) is widely used in various fields of artificial intelligence and has successfully boosted the performance of different models. However, current explanations of this mechanism are mainly based on intuitions and experiences, while there still lacks direct modeling for how the SAM helps performance. To mitigate this issue, in this paper, based on the dynamical system perspective of the residual neural network, we first show that the intrinsic stiffness phenomenon (SP) in the high-precision solution of ordinary differential equations (ODEs) also widely exists in high-performance neural networks (NN). Thus the ability of NN to measure SP at the feature level is necessary to obtain high performance and is an important factor in the difficulty of training NN. Similar to the adaptive step-size method which is effective in solving stiff ODEs, we show that the SAM is also a stiffness-aware step size adaptor that can enhance the model's representational ability to measure intrinsic SP by refining the estimation of stiffness information and generating adaptive attention values, which provides a new understanding about why and how the SAM can benefit the model performance. This novel perspective can also explain the lottery ticket hypothesis in SAM, design new quantitative metrics of representational ability, and inspire a new theoretic-inspired approach, StepNet. Extensive experiments on several popular benchmarks demonstrate that StepNet can extract fine-grained stiffness information and measure SP accurately, leading to significant improvements in various visual tasks.
    摘要 自我注意机制(SAM)在人工智能多个领域广泛应用,并成功提高不同模型的性能。然而,目前对这种机制的解释主要基于直觉和经验,而尚未有直接模型如何SAM帮助性能。为了解决这问题,在这篇论文中,基于径远系统视角,我们首先显示了高精度解方程(ODEs)中广泛存在的内在硬度现象(SP)。因此,NN的能力测量SP到特征层是必要的,并是训练NN的重要因素。与适应步长法,我们显示了SAM也是一种硬度感知的步长调整器,可以增强模型测量SP的能力,并生成适应性注意值。这种新的视角可以解释SAM如何提高模型性能,并设计新的量化度量方法。我们还提出了一种基于这种理论的StepNet方法,可以提取细腻的硬度信息并准确测量SP,从而在多种视觉任务中取得显著改进。

Analyzing Quantization in TVM

  • paper_url: http://arxiv.org/abs/2308.10905
  • repo_url: None
  • paper_authors: Mingfei Guo
  • for: 这个论文目的是研究 TVM 中的 8 位量化问题,以提高深度学习模型的执行速度和内存占用。
  • methods: 这篇论文使用了 TVM 的 8 位量化功能,并进行了多种优化技术的比较和分析,以提高量化后的性能。
  • results: 论文表明,在使用 8 位量化后,TVM 的执行时间比原始版本 slower,而且在计算任务和存储任务两个方面都有较大的性能下降。然而,通过修复图像建立错误,以及应用多种优化策略,最终实现了对 TVM 进行量化后的性能提升。
    Abstract There has been many papers in academic literature on quantizing weight tensors in deep learning models to reduce inference latency and memory footprint. TVM also has the ability to quantize weights and support low-bit computations. Although quantization is typically expected to improve inference time, in TVM, the performance of 8-bit quantization does not meet the expectations. Typically, when applying 8-bit quantization to a deep learning model, it is usually expected to achieve around 50% of the full-precision inference time. However, in this particular case, not only does the quantized version fail to achieve the desired performance boost, but it actually performs worse, resulting in an inference time that is about 2 times as slow as the non-quantized version. In this project, we thoroughly investigate the reasons behind the underperformance and assess the compatibility and optimization opportunities of 8-bit quantization in TVM. We discuss the optimization of two different types of tasks: computation-bound and memory-bound, and provide a detailed comparison of various optimization techniques in TVM. Through the identification of performance issues, we have successfully improved quantization by addressing a bug in graph building. Furthermore, we analyze multiple optimization strategies to achieve the optimal quantization result. The best experiment achieves 163.88% improvement compared with the TVM compiled baseline in inference time for the compute-bound task and 194.98% for the memory-bound task.
    摘要 在学术文献中有很多关于深度学习模型量化Weight张量以降低推理时间和内存占用的论文。TVM也具有量化Weight和低位计算的能力。 although 量化通常预期可以提高推理时间,在TVM中,8位量化的表现不符预期,其实perform worse,导致推理时间约为非量化版本的两倍。在这个项目中,我们进行了深入的调查和分析,检查8位量化在TVM中的兼容性和优化机会。我们分析了两种不同的任务类型:计算繁重和内存繁重,并提供了多种优化技术的详细比较。通过问题定位,我们成功地修复了图形建立过程中的bug,并分析了多种优化策略以实现最佳量化结果。最佳实验在计算繁重任务上提高了163.88%的推理时间,而在内存繁重任务上提高了194.98%。

East: Efficient and Accurate Secure Transformer Framework for Inference

  • paper_url: http://arxiv.org/abs/2308.09923
  • repo_url: None
  • paper_authors: Yuanchao Ding, Hua Guo, Yewei Guan, Weixin Liu, Jiarong Huo, Zhenyu Guan, Xiyong Zhang
  • for: 这个研究旨在提供一个可靠和精确的隐私保护Transformer推断框架。
  • methods: 我们提出了一个名为“East”的框架,包括一个新的隐私 polynomial评估算法,以及适当设计的安全协议 для softmax 和层normalization。
  • results: 我们实现了一个名为“East”的框架,可以实现高效和精确的隐私保护Transformer推断,并且与没有微调的确切性相同。 相比于Iron,我们的方案可以降低通信量约1.8倍,而且降低runtime约1.2倍。
    Abstract Transformer has been successfully used in practical applications, such as ChatGPT, due to its powerful advantages. However, users' input is leaked to the model provider during the service. With people's attention to privacy, privacy-preserving Transformer inference is on the demand of such services. Secure protocols for non-linear functions are crucial in privacy-preserving Transformer inference, which are not well studied. Thus, designing practical secure protocols for non-linear functions is hard but significant to model performance. In this work, we propose a framework \emph{East} to enable efficient and accurate secure Transformer inference. Firstly, we propose a new oblivious piecewise polynomial evaluation algorithm and apply it to the activation functions, which reduces the runtime and communication of GELU by over 1.5$\times$ and 2.5$\times$, compared to prior arts. Secondly, the secure protocols for softmax and layer normalization are carefully designed to faithfully maintain the desired functionality. Thirdly, several optimizations are conducted in detail to enhance the overall efficiency. We applied \emph{East} to BERT and the results show that the inference accuracy remains consistent with the plaintext inference without fine-tuning. Compared to Iron, we achieve about 1.8$\times$ lower communication within 1.2$\times$ lower runtime.
    摘要 <>使用Transformer在实际应用中得到了成功,如ChatGPT,因为它具有强大的优势。然而,用户的输入被传递给模型提供商 durante el servicio,这会让人们关注隐私。隐私保护Transformer推理是对这些服务的需求。为了保持模型性能,安全协议 для非线性函数是必要的。然而,这些协议并未得到充分的研究。因此,设计实用安全协议 для非线性函数是具有挑战性和重要性的。在这种情况下,我们提出了一个框架called \emph{East},以启用高效和准确的隐私保护Transformer推理。首先,我们提出了一种新的透明分割多项式评估算法,并应用它到活动函数上,从而降低GELU的运行时和通信时间,相比于先前的艺术。其次,我们仔细设计了安全协议 для软max和层normalization,以保持所需的功能。最后,我们在细节上进行了多个优化,以提高整体的效率。我们应用了\emph{East}于BERT,结果显示,无需练习,推理精度与纯文本推理相同。相比于Iron,我们实现了约1.8倍的通信减少和1.2倍的运行时减少。

Recap: Detecting Deepfake Video with Unpredictable Tampered Traces via Recovering Faces and Mapping Recovered Faces

  • paper_url: http://arxiv.org/abs/2308.09921
  • repo_url: None
  • paper_authors: Juan Hu, Xin Liao, Difei Gao, Satoshi Tsutsui, Qian Wang, Zheng Qin, Mike Zheng Shou
  • for: 针对恶意使用深伪技术的检测问题进行研究。
  • methods: 提出了一种新的深伪检测模型,即Recap,它可以暴露不具体的面部不一致性。
  • results: 在多种场景下进行了广泛的实验,并表明Recap可以有效地检测深伪视频。
    Abstract The exploitation of Deepfake techniques for malicious intentions has driven significant research interest in Deepfake detection. Deepfake manipulations frequently introduce random tampered traces, leading to unpredictable outcomes in different facial regions. However, existing detection methods heavily rely on specific forgery indicators, and as the forgery mode improves, these traces become increasingly randomized, resulting in a decline in the detection performance of methods reliant on specific forgery traces. To address the limitation, we propose Recap, a novel Deepfake detection model that exposes unspecific facial part inconsistencies by recovering faces and enlarges the differences between real and fake by mapping recovered faces. In the recovering stage, the model focuses on randomly masking regions of interest (ROIs) and reconstructing real faces without unpredictable tampered traces, resulting in a relatively good recovery effect for real faces while a poor recovery effect for fake faces. In the mapping stage, the output of the recovery phase serves as supervision to guide the facial mapping process. This mapping process strategically emphasizes the mapping of fake faces with poor recovery, leading to a further deterioration in their representation, while enhancing and refining the mapping of real faces with good representation. As a result, this approach significantly amplifies the discrepancies between real and fake videos. Our extensive experiments on standard benchmarks demonstrate that Recap is effective in multiple scenarios.
    摘要 <>Translate the given text into Simplified Chinese.<> Deepfake 技术的滥用为恶意目的而引起了广泛的研究兴趣,但是现有的检测方法受到特定的妄想指标的限制,而且随着妄想模式的改进,这些指标变得越来越随机,导致这些方法在检测性能下降。为解决这个限制,我们提出了 Recap,一种新的 Deepfake 检测模型,它暴露了不具体的 facial part 不一致性。在恢复阶段,模型会随机屏蔽 Region of Interest (ROI) 并恢复真实的 face,从而实现了较好的真实 face 恢复效果,而假 face 的恢复效果相对较差。在映射阶段,输出恢复阶段的结果作为指导,导致 facial 映射过程中的映射结果更加精细。这种方法可以明显增加真实和假的视频之间的差异。我们对标准 benchmark 进行了广泛的实验,结果显示,Recap 在多种场景下都有效。

Learning Multiscale Consistency for Self-supervised Electron Microscopy Instance Segmentation

  • paper_url: http://arxiv.org/abs/2308.09917
  • repo_url: None
  • paper_authors: Yinda Chen, Wei Huang, Xiaoyu Liu, Qi Chen, Zhiwei Xiong
  • for: 该论文旨在提高电子顺带显微镜像中的实例分割精度。
  • methods: 该方法利用自我超级vised学习,通过多尺度视觉表示来捕捉电子镜像中实例的复杂视觉模式和声明关系。
  • results: 在四个大规模电子镜像 dataset 上广泛预训练该方法,实现了代表性任务中的神经元和 mitochondria 实例分割性能的提高。
    Abstract Instance segmentation in electron microscopy (EM) volumes poses a significant challenge due to the complex morphology of instances and insufficient annotations. Self-supervised learning has recently emerged as a promising solution, enabling the acquisition of prior knowledge of cellular tissue structures that are essential for EM instance segmentation. However, existing pretraining methods often lack the ability to capture complex visual patterns and relationships between voxels, which results in the acquired prior knowledge being insufficient for downstream EM analysis tasks. In this paper, we propose a novel pretraining framework that leverages multiscale visual representations to capture both voxel-level and feature-level consistency in EM volumes. Specifically, our framework enforces voxel-level consistency between the outputs of a Siamese network by a reconstruction function, and incorporates a cross-attention mechanism for soft feature matching to achieve fine-grained feature-level consistency. Moreover, we propose a contrastive learning scheme on the feature pyramid to extract discriminative features across multiple scales. We extensively pretrain our method on four large-scale EM datasets, achieving promising performance improvements in representative tasks of neuron and mitochondria instance segmentation.
    摘要 Electron microscopy (EM) volumes 的实例分割具有 significiant 挑战,主要是因为实例的复杂形态和不充分的标注。自动学习 recent emerge as a promising solution,可以获得 cellular tissue 结构的先验知识,这些先验知识是下游 EM 实例分割任务的关键。然而,现有的预训练方法经常无法捕捉 EM volumes 中复杂的视觉模式和 voxel 之间的关系,导致预训练的先验知识不够 для下游 EM 分析任务。在这篇论文中,我们提出了一种新的预训练框架,利用多尺度的视觉表示来捕捉 EM volumes 中 voxel 级别和特征级别的一致性。具体来说,我们的框架使用 Siamese 网络的输出来执行重建函数,以实现 voxel 级别的一致性。此外,我们还提出了一种跨特征的匹配机制,以实现细腻的特征级别的一致性。此外,我们还提出了一种嵌入式学习的方案,用于提取多 scales 的特征。我们广泛预训练我们的方法在四个大规模 EM 数据集上,实现了代表性的 neuron 和 mitochondria 实例分割任务的表现。

Never Explore Repeatedly in Multi-Agent Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2308.09909
  • repo_url: None
  • paper_authors: Chenghao Li, Tonghan Wang, Chongjie Zhang, Qianchuan Zhao
  • for: 这篇论文目的是解决多智能代理人学习中的内在动机问题,尤其是游戏环境中的探索和获取奖励问题。
  • methods: 这篇论文提出了一种动态奖励推敲法,用于稳定内在奖励的波动和探索更广泛的区域,以解决 revisitation 问题。
  • results: 实验结果显示,这种方法能够在Google研究足球和StarCraft II微管理任务中提高表现,特别是在稀有奖励情况下。
    Abstract In the realm of multi-agent reinforcement learning, intrinsic motivations have emerged as a pivotal tool for exploration. While the computation of many intrinsic rewards relies on estimating variational posteriors using neural network approximators, a notable challenge has surfaced due to the limited expressive capability of these neural statistics approximators. We pinpoint this challenge as the "revisitation" issue, where agents recurrently explore confined areas of the task space. To combat this, we propose a dynamic reward scaling approach. This method is crafted to stabilize the significant fluctuations in intrinsic rewards in previously explored areas and promote broader exploration, effectively curbing the revisitation phenomenon. Our experimental findings underscore the efficacy of our approach, showcasing enhanced performance in demanding environments like Google Research Football and StarCraft II micromanagement tasks, especially in sparse reward settings.
    摘要 在多智能奖励学习领域,内生动机作为探索的重要工具而出现。然而,许多内生奖励计算依赖于使用神经网络近似器来估计变分 posterior,这种限制神经统计近似器的表达能力带来了一定挑战。我们称这种挑战为“返回”问题,agent在完成任务空间中循环探索受限的区域。为了解决这个问题,我们提议动态奖励缩放方法。这种方法通过稳定在已经探索过的区域中的内生奖励波动,激励更广泛的探索,从而控制返回现象。我们的实验发现,我们的方法在Google研究足球和StarCraft II微管理任务中表现出色,特别是在罕见奖励设置下。

LEGO: Learning and Graph-Optimized Modular Tracker for Online Multi-Object Tracking with Point Clouds

  • paper_url: http://arxiv.org/abs/2308.09908
  • repo_url: None
  • paper_authors: Zhenrong Zhang, Jianan Liu, Yuxuan Xia, Tao Huang, Qing-Long Han, Hongbin Liu
  • for: 提高数据匹配性能,提高跟踪性能
  • methods: 利用图像优化和自注意力机制,生成准确的匹配得分图,提高数据匹配精度和效率
  • results: 在KITTI车辆跟踪评估板上,使用LiDARalone实现了出色的表现,并在所有在线跟踪器中排名第一,并在本文提交时保持第二名。
    Abstract Online multi-object tracking (MOT) plays a pivotal role in autonomous systems. The state-of-the-art approaches usually employ a tracking-by-detection method, and data association plays a critical role. This paper proposes a learning and graph-optimized (LEGO) modular tracker to improve data association performance in the existing literature. The proposed LEGO tracker integrates graph optimization and self-attention mechanisms, which efficiently formulate the association score map, facilitating the accurate and efficient matching of objects across time frames. To further enhance the state update process, the Kalman filter is added to ensure consistent tracking by incorporating temporal coherence in the object states. Our proposed method utilizing LiDAR alone has shown exceptional performance compared to other online tracking approaches, including LiDAR-based and LiDAR-camera fusion-based methods. LEGO ranked 1st at the time of submitting results to KITTI object tracking evaluation ranking board and remains 2nd at the time of submitting this paper, among all online trackers in the KITTI MOT benchmark for cars1
    摘要 在自主系统中,在线多对目标跟踪(MOT)扮演着重要的角色。现状的方法通常采用检测-跟踪方法,数据归一化扮演 kritikus 的角色。这篇论文提议一种学习和图形优化(LEGO)模块化跟踪器,以改善现有文献中的数据归一化性能。提议的LEGO跟踪器结合图形优化和自注意机制,高效地计算归一化得分图,使得在不同时帧中快速和准确匹配目标。此外,为了进一步增强状态更新过程,我们添加了卡尔曼滤波,以确保跟踪过程中的对象状态具有一致性。我们使用了LiDARalone,与其他在线跟踪方法相比,包括LiDAR和LiDAR-camera fusione-based方法,我们提posed的方法在KITTI目标跟踪评估板上表现出了非凡的性能。LEGO在提交结果时排名第一,并在这篇论文提交时仍然排名第二,对于所有在线跟踪器而言,在KITTI MOTbenchmark中的汽车1中。

RAH! RecSys-Assistant-Human: A Human-Central Recommendation Framework with Large Language Models

  • paper_url: http://arxiv.org/abs/2308.09904
  • repo_url: None
  • paper_authors: Yubo Shu, Hansu Gu, Peng Zhang, Haonan Zhang, Tun Lu, Dongsheng Li, Ning Gu
  • for: 这个论文的目的是提出一种人类中心的推荐框架,以帮助推荐系统更好地理解和满足用户需求。
  • methods: 该论文使用了人工智能语言模型(LLM)作为用户代理人,并提出了一种名为RAH框架的人类中心推荐模型。RAH框架包括推荐系统、助手和人类三部分,以帮助推荐系统更好地理解和满足用户需求。
  • results: 实验结果表明,使用学习动作评价和反射机制可以更好地调整用户个性,而助手也可以有效地代表用户提供反馈和帮助推荐系统更好地调整。
    Abstract The recommendation ecosystem involves interactions between recommender systems(Computer) and users(Human). Orthogonal to the perspective of recommender systems, we attempt to utilize LLMs from the perspective of users and propose a more human-central recommendation framework named RAH, which consists of Recommender system, Assistant and Human. The assistant is a LLM-based and personal proxy for a human to achieve user satisfaction. The assistant plays a non-invasion role and the RAH framework can adapt to different recommender systems and user groups. Subsequently, we implement and evaluate the RAH framework for learning user personalities and proxy human feedback. The experiment shows that (1) using learn-action-critic and reflection mechanisms can lead more aligned personality and (2) our assistant can effectively proxy human feedback and help adjust recommender systems. Finally, we discuss further strategies in the RAH framework to address human-central concerns including user control, privacy and fairness.
    摘要 <>推荐生态系统包括推荐系统与用户之间的互动。尽管从推荐系统的角度来看,但我们尝试从用户的角度来利用大语言模型(LLM),并提出一种更人类中心的推荐框架名为RAH,该框架包括推荐系统、助手和人类。助手是基于LLM的个性化代理人,用于达到用户满意度。助手在不侵略用户的前提下发挥作用,RAH框架可以适应不同的推荐系统和用户群体。在实施和评估RAH框架时,我们发现:1. 使用学习动作评价机制可以导致更加协调的个性特征。2. 我们的助手可以有效地代理人类反馈,帮助调整推荐系统。最后,我们讨论了RAH框架中更多的人类中心问题,包括用户控制、隐私和公平。

SwinLSTM:Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM

  • paper_url: http://arxiv.org/abs/2308.09891
  • repo_url: https://github.com/SongTang-x/SwinLSTM
  • paper_authors: Song Tang, Chuang Li, Pu Zhang, RongNian Tang
  • for: 这篇论文的目的是提出一种新的循环细节嵌入式神经网络(SwinLSTM),用于进行空间时间预测任务。
  • methods: 这篇论文使用了Swin Transformer层和简化LSTM层(ConvLSTM)的混合,实现了循环细节嵌入式神经网络。
  • results: 在 Moving MNIST、Human3.6m、TaxiBJ 和 KTH 数据集上,SwinLSTM 已经超越了现有的方法,并且与 ConvLSTM 的预测精度有 statistically significant 的差异。
    Abstract Integrating CNNs and RNNs to capture spatiotemporal dependencies is a prevalent strategy for spatiotemporal prediction tasks. However, the property of CNNs to learn local spatial information decreases their efficiency in capturing spatiotemporal dependencies, thereby limiting their prediction accuracy. In this paper, we propose a new recurrent cell, SwinLSTM, which integrates Swin Transformer blocks and the simplified LSTM, an extension that replaces the convolutional structure in ConvLSTM with the self-attention mechanism. Furthermore, we construct a network with SwinLSTM cell as the core for spatiotemporal prediction. Without using unique tricks, SwinLSTM outperforms state-of-the-art methods on Moving MNIST, Human3.6m, TaxiBJ, and KTH datasets. In particular, it exhibits a significant improvement in prediction accuracy compared to ConvLSTM. Our competitive experimental results demonstrate that learning global spatial dependencies is more advantageous for models to capture spatiotemporal dependencies. We hope that SwinLSTM can serve as a solid baseline to promote the advancement of spatiotemporal prediction accuracy. The codes are publicly available at https://github.com/SongTang-x/SwinLSTM.
    摘要 “ integrating CNNs and RNNs to capture spatiotemporal dependencies is a common strategy for spatiotemporal prediction tasks. However, the property of CNNs to learn local spatial information limits their efficiency in capturing spatiotemporal dependencies, thereby limiting their prediction accuracy. In this paper, we propose a new recurrent cell, SwinLSTM, which integrates Swin Transformer blocks and the simplified LSTM, an extension that replaces the convolutional structure in ConvLSTM with the self-attention mechanism. Furthermore, we construct a network with SwinLSTM cell as the core for spatiotemporal prediction. Without using unique tricks, SwinLSTM outperforms state-of-the-art methods on Moving MNIST, Human3.6m, TaxiBJ, and KTH datasets. In particular, it exhibits a significant improvement in prediction accuracy compared to ConvLSTM. Our competitive experimental results demonstrate that learning global spatial dependencies is more advantageous for models to capture spatiotemporal dependencies. We hope that SwinLSTM can serve as a solid baseline to promote the advancement of spatiotemporal prediction accuracy. The codes are publicly available at https://github.com/SongTang-x/SwinLSTM.”Note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese used in mainland China. The traditional Chinese version would be slightly different.

Inductive-bias Learning: Generating Code Models with Large Language Model

  • paper_url: http://arxiv.org/abs/2308.09890
  • repo_url: https://github.com/fuyu-quant/iblm
  • paper_authors: Toma Tanaka, Naofumi Emoto, Tsukasa Yumibayashi
  • for: 本研究的目的是提出一种新的学习方法,即归纳学习(Inductive-Bias Learning,IBL),该方法结合了大语言模型(LLMs)的归纳学习(ICL)和代码生成技术,以实现高精度的推理和代码生成。
  • methods: 本研究使用的方法包括大语言模型(LLMs)的归纳学习(ICL)和代码生成技术,并提出了一种新的学习方法——归纳学习(IBL),该方法可以在不更新模型参数的情况下,基于训练数据进行高精度的推理和代码生成。
  • results: 研究发现,使用IBL方法可以实现高精度的推理和代码生成,并且比 tradicional ICL 和代表性机器学习模型更高。此外,IBL 方法还具有较好的可读性和解释性。
    Abstract Large Language Models(LLMs) have been attracting attention due to a ability called in-context learning(ICL). ICL, without updating the parameters of a LLM, it is possible to achieve highly accurate inference based on rules ``in the context'' by merely inputting a training data into the prompt. Although ICL is a developing field with many unanswered questions, LLMs themselves serves as a inference model, seemingly realizing inference without explicitly indicate ``inductive bias''. On the other hand, a code generation is also a highlighted application of LLMs. The accuracy of code generation has dramatically improved, enabling even non-engineers to generate code to perform the desired tasks by crafting appropriate prompts. In this paper, we propose a novel ``learning'' method called an ``Inductive-Bias Learning (IBL)'', which combines the techniques of ICL and code generation. An idea of IBL is straightforward. Like ICL, IBL inputs a training data into the prompt and outputs a code with a necessary structure for inference (we referred to as ``Code Model'') from a ``contextual understanding''. Despite being a seemingly simple approach, IBL encompasses both a ``property of inference without explicit inductive bias'' inherent in ICL and a ``readability and explainability'' of the code generation. Surprisingly, generated Code Models have been found to achieve predictive accuracy comparable to, and in some cases surpassing, ICL and representative machine learning models. Our IBL code is open source: https://github.com/fuyu-quant/IBLM
    摘要 大型语言模型(LLM)在最近引起了一些注意,因为它具有一种能力called内部学习(ICL)。ICL可以在不更新LLM参数的情况下,通过将训练数据输入为提示,以高度精度地进行推论。虽然ICL是一个 noch developing 的领域,但LLM本身作为一个推论模型,它似乎可以不需要显式地指定“推论假设”来进行推论。另一方面,代码生成也是LLM的一个吸引人的应用。通过LLM,代码生成的精度有了很大提升,让even non-engineers可以通过设计适当的提示来生成代码来实现想要的任务。在这篇论文中,我们提出了一种新的“学习”方法,called“推论假设学习”(IBL)。IBL结合了ICL和代码生成的技术。IBL的想法是 Straightforward:如ICL,IBL将训练数据输入为提示,并从“上下文理解”中获取一个需要的结构(我们称之为“代码模型”),以进行推论。尽管看起来非常简单,但IBL包含了ICL中的“推论无需显式假设”的特性和代码生成中的“可读性和解释性”。 surprisingly,生成的代码模型已经发现可以 дости到ICL和代表性机器学习模型的预测精度。我们的IBL代码公开在GitHub:https://github.com/fuyu-quant/IBLM。

Tensor-Compressed Back-Propagation-Free Training for (Physics-Informed) Neural Networks

  • paper_url: http://arxiv.org/abs/2308.09858
  • repo_url: None
  • paper_authors: Yequan Zhao, Xinling Yu, Zhixiong Chen, Ziyue Liu, Sijia Liu, Zheng Zhang
  • for: 这篇论文主要是为了提出一个不需要传播推导(Backward Propagation,BP)的框架,来训练具有实际性的神经网络。
  • methods: 这篇论文使用了以下三种方法: 1. 紧缩矩阵variance reduction技术来增加评估缩排序(zeroth-order,ZO)的可扩展性,使得可以处理大型神经网络。 2. 混合式梯度评估方法来提高ZO训练的效率。 3. 将BP-free训练框架扩展到物理学 Informed Neural Networks(PINNs)中,通过提出一种简组方法来估计损失函数中的梯度,不需要使用BP。
  • results: 这篇论文的结果显示,BP-free训练框架对于MNIST dataset的训练损失比标准首顺训练要小,并且成功地训练了一个具有20个对应的哈密顿-雅可比-贝尔曼(HJB)偏微分方程的物理学 Informed Neural Networks(PINNs)。这个内存高效且BP-free的训练方法可能会成为未来训练许多有限资源的平台(例如FPGA、ASIC、微控制器和光子对应)的基础。
    Abstract Backward propagation (BP) is widely used to compute the gradients in neural network training. However, it is hard to implement BP on edge devices due to the lack of hardware and software resources to support automatic differentiation. This has tremendously increased the design complexity and time-to-market of on-device training accelerators. This paper presents a completely BP-free framework that only requires forward propagation to train realistic neural networks. Our technical contributions are three-fold. Firstly, we present a tensor-compressed variance reduction approach to greatly improve the scalability of zeroth-order (ZO) optimization, making it feasible to handle a network size that is beyond the capability of previous ZO approaches. Secondly, we present a hybrid gradient evaluation approach to improve the efficiency of ZO training. Finally, we extend our BP-free training framework to physics-informed neural networks (PINNs) by proposing a sparse-grid approach to estimate the derivatives in the loss function without using BP. Our BP-free training only loses little accuracy on the MNIST dataset compared with standard first-order training. We also demonstrate successful results in training a PINN for solving a 20-dim Hamiltonian-Jacobi-Bellman PDE. This memory-efficient and BP-free approach may serve as a foundation for the near-future on-device training on many resource-constraint platforms (e.g., FPGA, ASIC, micro-controllers, and photonic chips).
    摘要 <>Translate the given text into Simplified Chinese.<> backwards propagation (BP) 是广泛使用的计算神经网络训练中的梯度计算方法。然而,由于缺乏硬件和软件资源,实现 BP 在边缘设备上具有很大的设计复杂性和时间到市场时间。这篇论文提出了一个完全不需要 BP 的框架,只需要前向传播来训练真实的神经网络。我们的技术贡献包括以下三个方面:首先,我们提出了一种紧凑变量减少方法,以大幅提高零次(ZO)优化的扩展性,使其能处理大于前一代 ZO 方法可以处理的网络大小。其次,我们提出了一种混合梯度评估方法,以提高 ZO 训练的效率。最后,我们扩展了我们的 BP-free 训练框架,以解决物理学 Informed Neural Networks(PINNs)中的梯度问题,而不需要 BP。我们的 BP-free 训练只在 MNIST 数据集上失去了微不足的精度,与标准首次训练相比。我们还成功地训练了一个 PINN,用于解决一个 20 维希尔伯特-雅各布-贝尔几何函数。这种内存高效且 BP-free 的方法可能成为未来资源限制平台(例如 FPGA、ASIC、微控制器和光学芯片)上的训练基础。

Enumerating Safe Regions in Deep Neural Networks with Provable Probabilistic Guarantees

  • paper_url: http://arxiv.org/abs/2308.09842
  • repo_url: None
  • paper_authors: Luca Marzari, Davide Corsi, Enrico Marchesini, Alessandro Farinelli, Ferdinando Cicalese
  • for: Ensuring trust in Deep Neural Network (DNN) systems by identifying safe areas.
  • methods: Proposed an efficient approximation method called epsilon-ProVe, which leverages statistical prediction of tolerance limits to provide a tight lower estimate of safe areas.
  • results: Empirical evaluation on standard benchmarks showed the scalability and effectiveness of the method, providing valuable insights for verifying DNNs.Here’s the original English text for reference:”Identifying safe areas is a key point to guarantee trust for systems that are based on Deep Neural Networks (DNNs). To this end, we introduce the AllDNN-Verification problem: given a safety property and a DNN, enumerate the set of all the regions of the property input domain which are safe, i.e., where the property does hold. Due to the #P-hardness of the problem, we propose an efficient approximation method called epsilon-ProVe. Our approach exploits a controllable underestimation of the output reachable sets obtained via statistical prediction of tolerance limits, and can provide a tight (with provable probabilistic guarantees) lower estimate of the safe areas. Our empirical evaluation on different standard benchmarks shows the scalability and effectiveness of our method, offering valuable insights for this new type of verification of DNNs.”
    Abstract Identifying safe areas is a key point to guarantee trust for systems that are based on Deep Neural Networks (DNNs). To this end, we introduce the AllDNN-Verification problem: given a safety property and a DNN, enumerate the set of all the regions of the property input domain which are safe, i.e., where the property does hold. Due to the #P-hardness of the problem, we propose an efficient approximation method called epsilon-ProVe. Our approach exploits a controllable underestimation of the output reachable sets obtained via statistical prediction of tolerance limits, and can provide a tight (with provable probabilistic guarantees) lower estimate of the safe areas. Our empirical evaluation on different standard benchmarks shows the scalability and effectiveness of our method, offering valuable insights for this new type of verification of DNNs.
    摘要 安全区域的标识是深度神经网络(DNN)系统的关键点,以确保系统的可靠性。为此,我们介绍了AllDNN-Verification问题:给定一个安全性质和一个DNN,列出该输入领域的安全区域,即输入中的属性不符的区域。由于这个问题的P-完备性,我们提出了一种高效的近似方法called epsilon-ProVe。我们的方法利用通过统计预测误差范围来控制输出可达区域的下预测,可以提供一个紧密的(具有可证明的概率保证)下限估计安全区域。我们的实验结果表明我们的方法在不同的标准准型上具有扩展性和有效性,提供了有价值的这种新类型的DNN验证的反馈。

Synergistic Integration of Large Language Models and Cognitive Architectures for Robust AI: An Exploratory Analysis

  • paper_url: http://arxiv.org/abs/2308.09830
  • repo_url: None
  • paper_authors: Oscar J. Romero, John Zimmerman, Aaron Steinfeld, Anthony Tomasic
  • for: 论文探讨了将大语言模型(LLMs)和认知建筑(CAs)两种人工智能子领域融合的可能性,以实现更加智能的人工智能系统。
  • methods: 该论文采用了理论模型和初步实验数据,探讨了这两种方法之间的融合方法,以便赋予人工智能系统更高的robustness和复杂性。
  • results: 论文预测了这些融合方法可以相互补偿对方的缺陷和限制,从而实现更高水平的人工智能系统。同时,论文还讨论了每种方法的缺点和挑战。
    Abstract This paper explores alternatives for integrating two subdisciplines of AI in the construction of artificial agents that exhibit intelligent behavior: Large Language Models (LLMs) and Cognitive Architectures (CAs). Guided by theoretical models and supported by preliminary empirical data, we hypothesize how diverse synergistic approaches can mutually compensate for their respective weaknesses and limitations, ultimately fostering more robust and sophisticated artificial intelligence systems. Additionally, we discuss the tradeoffs and challenges associated with each approach.
    摘要

Learning Representations on Logs for AIOps

  • paper_url: http://arxiv.org/abs/2308.11526
  • repo_url: None
  • paper_authors: Pranjal Gupta, Harshit Kumar, Debanjana Kar, Karan Bhukar, Pooja Aggarwal, Prateeti Mohapatra
  • for: 提高 AIOps 平台的自动化操作效率,减少人工干预。
  • methods: 使用自然语言处理技术和大语言模型(LLM)进行自动化日志分析,包括日志格式检测、日志分类和日志解析等任务。
  • results: 比较 existed 模型,提出一种基于 LLM 的日志分析模型,在多个下游任务上达到了更高的性能。
    Abstract AI for IT Operations (AIOps) is a powerful platform that Site Reliability Engineers (SREs) use to automate and streamline operational workflows with minimal human intervention. Automated log analysis is a critical task in AIOps as it provides key insights for SREs to identify and address ongoing faults. Tasks such as log format detection, log classification, and log parsing are key components of automated log analysis. Most of these tasks require supervised learning; however, there are multiple challenges due to limited labelled log data and the diverse nature of log data. Large Language Models (LLMs) such as BERT and GPT3 are trained using self-supervision on a vast amount of unlabeled data. These models provide generalized representations that can be effectively used for various downstream tasks with limited labelled data. Motivated by the success of LLMs in specific domains like science and biology, this paper introduces a LLM for log data which is trained on public and proprietary log data. The results of our experiments demonstrate that the proposed LLM outperforms existing models on multiple downstream tasks. In summary, AIOps powered by LLMs offers an efficient and effective solution for automating log analysis tasks and enabling SREs to focus on higher-level tasks. Our proposed LLM, trained on public and proprietary log data, offers superior performance on multiple downstream tasks, making it a valuable addition to the AIOps platform.
    摘要

An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software

  • paper_url: http://arxiv.org/abs/2308.09810
  • repo_url: None
  • paper_authors: Wenxuan Wang, Jingyuan Huang, Jen-tse Huang, Chang Chen, Jiazhen Gu, Pinjia He, Michael R. Lyu
  • for: 这个论文旨在探讨现代内容审核软件对隐藏文本内容的抗垃圾性能。
  • methods: 该论文提出了一种基于变换规则的内容审核测试框架,名为OASIS。OASIS使用了21个变换规则,来生成含毒文本内容的图像测试用例,以检测内容审核软件的抗垃圾性能。
  • results: 根据论文的测试结果,OASIS可以寻发现到50%以上的潜在的攻击表现,并且通过重新训练内容审核模型,可以提高内容审核模型的鲁棒性,无需对性能产生负面影响。
    Abstract The exponential growth of social media platforms has brought about a revolution in communication and content dissemination in human society. Nevertheless, these platforms are being increasingly misused to spread toxic content, including hate speech, malicious advertising, and pornography, leading to severe negative consequences such as harm to teenagers' mental health. Despite tremendous efforts in developing and deploying textual and image content moderation methods, malicious users can evade moderation by embedding texts into images, such as screenshots of the text, usually with some interference. We find that modern content moderation software's performance against such malicious inputs remains underexplored. In this work, we propose OASIS, a metamorphic testing framework for content moderation software. OASIS employs 21 transform rules summarized from our pilot study on 5,000 real-world toxic contents collected from 4 popular social media applications, including Twitter, Instagram, Sina Weibo, and Baidu Tieba. Given toxic textual contents, OASIS can generate image test cases, which preserve the toxicity yet are likely to bypass moderation. In the evaluation, we employ OASIS to test five commercial textual content moderation software from famous companies (i.e., Google Cloud, Microsoft Azure, Baidu Cloud, Alibaba Cloud and Tencent Cloud), as well as a state-of-the-art moderation research model. The results show that OASIS achieves up to 100% error finding rates. Moreover, through retraining the models with the test cases generated by OASIS, the robustness of the moderation model can be improved without performance degradation.
    摘要 社交媒体平台的快速增长引起了人类社会的communication和内容传递革命。然而,这些平台正在不断被滥用来传播恶意内容,包括仇恨言论、恶意广告和色情内容,导致了年轻人的心理健康受到严重损害。尽管努力开发和部署文本和图像内容审核方法,但恶意用户可以通过嵌入文本到图像中来逃脱审核,例如将文本截屏并添加扰乱。我们发现现有内容审核软件的性能在面对这类恶意输入仍然未得到足够的探索。在这项工作中,我们提出了OASIS,一个基于变形测试的内容审核软件测试框架。OASIS使用了21个从我们的飞行研究中所挑选的变形规则,这些规则是基于5000个实际的恶意内容,收集自四个popular社交媒体应用程序(Twitter、Instagram、Sina Weibo和Baidu Tieba)。给定恶意文本内容,OASIS可以生成图像测试 caso,保留恶意性,但可能会绕过审核。在评估中,我们使用OASIS测试五家著名公司的商业文本内容审核软件(Google Cloud、Microsoft Azure、Baidu Cloud、Alibaba Cloud和Tencent Cloud),以及一种当前研究模型。结果显示,OASIS可以达到100%的错误发现率。此外,通过 retraining 模型使用OASIS生成的测试 caso,可以提高审核模型的Robustness,不会导致性能下降。

VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control

  • paper_url: http://arxiv.org/abs/2308.09804
  • repo_url: https://github.com/henryhzy/vl-pet
  • paper_authors: Zi-Yuan Hu, Yanyang Li, Michael R. Lyu, Liwei Wang
  • for: 这篇论文是针对现有的语言模型(PLMs)的模型训练和储存成本增长的问题,提出了一个名为Vision-and-Language Parameter-Efficient Tuning(VL-PET)的框架,以便实现实时的模型训练和运算。
  • methods: 这篇论文使用了一个名为Granularity-Controlled Mechanism(GCM)的新的控制机制,允许在模型训练中对模组化修改(例如Adapter和LoRA)进行有效的控制,以提高模型的效率和可靠性。
  • results: 在四个图像与文本任务和四个影片与文本任务上,这篇论文的VL-PET框架和lightweight PET模组设计在效率、可靠性和转移性方面具有优秀的表现,特别是与VL-Adapter和LoRA进行比较,在图像与文本任务上表现出了2.92%(3.41%)和3.37%(7.03%)的提升。
    Abstract As the model size of pre-trained language models (PLMs) grows rapidly, full fine-tuning becomes prohibitively expensive for model training and storage. In vision-and-language (VL), parameter-efficient tuning (PET) techniques are proposed to integrate modular modifications (e.g., Adapter and LoRA) into encoder-decoder PLMs. By tuning a small set of trainable parameters, these techniques perform on par with full fine-tuning. However, excessive modular modifications and neglecting the functionality gap between the encoders and decoders can lead to performance degradation, while existing PET techniques (e.g., VL-Adapter) overlook these critical issues. In this paper, we propose a Vision-and-Language Parameter-Efficient Tuning (VL-PET) framework to impose effective control over modular modifications via a novel granularity-controlled mechanism. Considering different granularity-controlled matrices generated by this mechanism, a variety of model-agnostic VL-PET modules can be instantiated from our framework for better efficiency and effectiveness trade-offs. We further propose lightweight PET module designs to enhance VL alignment and modeling for the encoders and maintain text generation for the decoders. Extensive experiments conducted on four image-text tasks and four video-text tasks demonstrate the efficiency, effectiveness and transferability of our VL-PET framework. In particular, our VL-PET-large with lightweight PET module designs significantly outperforms VL-Adapter by 2.92% (3.41%) and LoRA by 3.37% (7.03%) with BART-base (T5-base) on image-text tasks. Furthermore, we validate the enhanced effect of employing our VL-PET designs on existing PET techniques, enabling them to achieve significant performance improvements. Our code is available at https://github.com/HenryHZY/VL-PET.
    摘要 As the size of pre-trained language models (PLMs) grows rapidly, full fine-tuning becomes increasingly expensive for model training and storage. In vision-and-language (VL), parameter-efficient tuning (PET) techniques are proposed to integrate modular modifications (e.g., Adapter and LoRA) into encoder-decoder PLMs. By tuning a small set of trainable parameters, these techniques can achieve performance on par with full fine-tuning. However, excessive modular modifications and neglecting the functionality gap between the encoders and decoders can lead to performance degradation, while existing PET techniques (e.g., VL-Adapter) overlook these critical issues.In this paper, we propose a Vision-and-Language Parameter-Efficient Tuning (VL-PET) framework to effectively control modular modifications through a novel granularity-controlled mechanism. By considering different granularity-controlled matrices generated by this mechanism, a variety of model-agnostic VL-PET modules can be instantiated from our framework for better efficiency and effectiveness trade-offs. We further propose lightweight PET module designs to enhance VL alignment and modeling for the encoders and maintain text generation for the decoders.Extensive experiments conducted on four image-text tasks and four video-text tasks demonstrate the efficiency, effectiveness, and transferability of our VL-PET framework. In particular, our VL-PET-large with lightweight PET module designs significantly outperforms VL-Adapter by 2.92% (3.41%) and LoRA by 3.37% (7.03%) with BART-base (T5-base) on image-text tasks. Furthermore, we validate the enhanced effect of employing our VL-PET designs on existing PET techniques, enabling them to achieve significant performance improvements. Our code is available at .

Exploring the Power of Creative AI Tools and Game-Based Methodologies for Interactive Web-Based Programming

  • paper_url: http://arxiv.org/abs/2308.11649
  • repo_url: None
  • paper_authors: Benjamin Kenwright
  • for: 这篇论文旨在探讨创造性AI工具和游戏方法在互动网页程序中的潜力,包括提高学习体验和提高用户参与度。
  • methods: 论文使用了创造性AI工具和游戏方法,包括生成AI内容和用户参与式游戏等。
  • results: 论文通过对现实世界应用的例子和实践,探讨了这些工具和方法在web开发中的潜力和局限性,以及对用户体验和参与度的影响。
    Abstract In recent years, the fields of artificial intelligence and web-based programming have seen tremendous advancements, enabling developers to create dynamic and interactive websites and applications. At the forefront of these advancements, creative AI tools and game-based methodologies have emerged as potent instruments, promising enhanced user experiences and increased engagement in educational environments. This chapter explores the potential of these tools and methodologies for interactive web-based programming, examining their benefits, limitations, and real-world applications. We examine the challenges and ethical considerations that arise when integrating these technologies into web development, such as privacy concerns and the potential for bias in AI-generated content. Through this exploration, we aim to provide insights into the exciting possibilities that creative AI tools and game-based methodologies offer for the future of web-based programming.
    摘要 recent 年份, artificial intelligence 和 web-based 程序设计 领域已经取得了很大的进步, allowing developers to create dynamic and interactive websites and applications. At the forefront of these advancements, creative AI tools 和 game-based methodologies have emerged as potent instruments, promising enhanced user experiences and increased engagement in educational environments. This chapter explores the potential of these tools and methodologies for interactive web-based programming, examining their benefits, limitations, and real-world applications. We examine the challenges and ethical considerations that arise when integrating these technologies into web development, such as privacy concerns and the potential for bias in AI-generated content. Through this exploration, we aim to provide insights into the exciting possibilities that creative AI tools 和 game-based methodologies offer for the future of web-based programming.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other parts of the world. If you need the translation in Traditional Chinese, please let me know.

Taken by Surprise: Contrast effect for Similarity Scores

  • paper_url: http://arxiv.org/abs/2308.09765
  • repo_url: https://github.com/meetelise/surprise-similarity
  • paper_authors: Thomas C. Bachlechner, Mario Martone, Marjorie Schillo
  • for: 提高自然语言处理、信息检索和分类任务中对象向量嵌入的相似性评估的精度。
  • methods: 提出了一种基于对象ensemble的相似性评估指标——意料分数(surprise score),它考虑了人类对对象的概念嵌入的概率分布,从而更好地反映人类对对象之间的认知。
  • results: 在零/几shot文档分类任务中,使用意料分数 Typically 10-15% 比采用Raw cosine similarity better 的性能。
    Abstract Accurately evaluating the similarity of object vector embeddings is of critical importance for natural language processing, information retrieval and classification tasks. Popular similarity scores (e.g cosine similarity) are based on pairs of embedding vectors and disregard the distribution of the ensemble from which objects are drawn. Human perception of object similarity significantly depends on the context in which the objects appear. In this work we propose the $\textit{surprise score}$, an ensemble-normalized similarity metric that encapsulates the contrast effect of human perception and significantly improves the classification performance on zero- and few-shot document classification tasks. This score quantifies the surprise to find a given similarity between two elements relative to the pairwise ensemble similarities. We evaluate this metric on zero/few shot classification and clustering tasks and typically find 10-15 % better performance compared to raw cosine similarity. Our code is available at https://github.com/MeetElise/surprise-similarity.
    摘要 非常重要的评估对象vector embedding的相似性,是自然语言处理、信息检索和分类任务中的关键。流行的相似性分数(例如偏度相似性)基于对 embedding vector的对比,忽略对象ensemble的分布。人类对物体相似性的感知强烈取决于物体在哪个上下文中出现。在这项工作中,我们提出了$\surprise$分数,一种ensemblenormalized的相似性度量,它利用对象ensemble的相似性来捕捉人类对物体相似性的异常效应。这个分数衡量在两个元素之间发现相似性的惊喜程度,相比于对 embedding vector的直接对比。我们在零/几个预测任务中评估了这个分数, Typically find 10-15% better performance compared to raw cosine similarity。我们的代码可以在https://github.com/MeetElise/surprise-similarity中找到。

The Impact of Background Removal on Performance of Neural Networks for Fashion Image Classification and Segmentation

  • paper_url: http://arxiv.org/abs/2308.09764
  • repo_url: None
  • paper_authors: Junhui Liang, Ying Liu, Vladimir Vlassov
  • for: 提高时尚图像数据质量和模型性能
  • methods: 使用突出对象检测来移除背景
  • results: 在简单和浅层网络中提高模型准确率,但在深度网络中不适用due to compatibility issues with other regularization techniques
    Abstract Fashion understanding is a hot topic in computer vision, with many applications having great business value in the market. Fashion understanding remains a difficult challenge for computer vision due to the immense diversity of garments and various scenes and backgrounds. In this work, we try removing the background from fashion images to boost data quality and increase model performance. Having fashion images of evident persons in fully visible garments, we can utilize Salient Object Detection to achieve the background removal of fashion data to our expectations. A fashion image with the background removed is claimed as the "rembg" image, contrasting with the original one in the fashion dataset. We conducted extensive comparative experiments with these two types of images on multiple aspects of model training, including model architectures, model initialization, compatibility with other training tricks and data augmentations, and target task types. Our experiments show that background removal can effectively work for fashion data in simple and shallow networks that are not susceptible to overfitting. It can improve model accuracy by up to 5% in the classification on the FashionStyle14 dataset when training models from scratch. However, background removal does not perform well in deep neural networks due to incompatibility with other regularization techniques like batch normalization, pre-trained initialization, and data augmentations introducing randomness. The loss of background pixels invalidates many existing training tricks in the model training, adding the risk of overfitting for deep models.
    摘要 《时尚理解》是计算机视觉领域的热门话题,具有广泛的商业价值。然而,时尚理解仍然是计算机视觉中的一个困难挑战,因为裙服的多样性和不同的场景和背景。在这项工作中,我们尝试将背景从时尚图像中除去,以提高数据质量并提高模型性能。利用有 evident persons 穿着完整的裙服图像,我们可以使用 Salient Object Detection 来实现背景的除去。一个没有背景的时尚图像被称为 "rembg" 图像,与原始图像在时尚数据集中进行比较。我们进行了多种比较实验,包括模型架构、模型初始化、与其他训练技巧和数据扩展相容性等多个方面。我们的实验结果表明,背景除去可以有效地对时尚数据进行简单化和浅化,提高模型精度。在 FashionStyle14 数据集上进行类别预测时,背景除去可以提高模型的准确率达到 5%。然而,背景除去不适合深度神经网络,因为它们与其他正则化技术,如批处理标准化、预训练初始化和数据扩展引入随机性,不兼容。失去背景像素会让许多现有的训练技巧无法使用,增加深度模型难以避免过拟合的风险。

Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization

  • paper_url: http://arxiv.org/abs/2308.09716
  • repo_url: https://github.com/soumik-kanad/diff2lip
  • paper_authors: Soumik Mukhopadhyay, Saksham Suri, Ravi Teja Gadde, Abhinav Shrivastava
  • for: lip synchronization in-the-wild, preserving identity, pose, emotions, and image quality
  • methods: audio-conditioned diffusion-based model, trained on Voxceleb2 dataset
  • results: outperforms popular methods like Wav2Lip and PC-AVS in FID metric and MOS of users, results on both reconstruction and cross settings on Voxceleb2 and LRW datasets.
    Abstract The task of lip synchronization (lip-sync) seeks to match the lips of human faces with different audio. It has various applications in the film industry as well as for creating virtual avatars and for video conferencing. This is a challenging problem as one needs to simultaneously introduce detailed, realistic lip movements while preserving the identity, pose, emotions, and image quality. Many of the previous methods trying to solve this problem suffer from image quality degradation due to a lack of complete contextual information. In this paper, we present Diff2Lip, an audio-conditioned diffusion-based model which is able to do lip synchronization in-the-wild while preserving these qualities. We train our model on Voxceleb2, a video dataset containing in-the-wild talking face videos. Extensive studies show that our method outperforms popular methods like Wav2Lip and PC-AVS in Fr\'echet inception distance (FID) metric and Mean Opinion Scores (MOS) of the users. We show results on both reconstruction (same audio-video inputs) as well as cross (different audio-video inputs) settings on Voxceleb2 and LRW datasets. Video results and code can be accessed from our project page ( https://soumik-kanad.github.io/diff2lip ).
    摘要 “lip sync”的任务是将不同的音频与人脸的肢体动作相对应。它在电影业中以及创建虚拟人偶和视讯会议中扮演重要的角色。这是一个具有挑战性的问题,因为需要同时实现细节、现实的舌头运动,并保留人脸的身份、姿势、情感和图像质量。许多以前的方法尝试解决这个问题,却受到图像质量下降的问题。在这篇论文中,我们提出了Diff2Lip,一个音频条件的扩散模型,可以在实际环境中进行lip sync,并保持这些质量。我们在Voxceleb2 dataset上训练我们的模型,该dataset包含实际环境中的说话面孔录影片。广泛的研究表明,我们的方法在FID和用户的意见评分(MOS)中优于流行的Wav2Lip和PC-AVS方法。我们在Voxceleb2和LRW datasets上进行了重建(同一对 audio-video 输入)和跨(不同 audio-video 输入)的研究,并提供了视频结果和代码。更多资讯可以在我们的项目页面(https://soumik-kanad.github.io/diff2lip)上获取。”

SimDA: Simple Diffusion Adapter for Efficient Video Generation

  • paper_url: http://arxiv.org/abs/2308.09710
  • repo_url: None
  • paper_authors: Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, Yu-Gang Jiang
  • for: 本研究旨在提出一种 Parametric Efficient 的 Text-to-Video(T2V)模型,通过简单的扩展器(SimDA)来改进现有的 Text-to-Image(T2I)模型,以提高视频生成效果。
  • methods: 我们提出了一种使用 Light-weight Spatial and Temporal Adapters(LSTA)和 Latent-Shift Attention(LSA)来适应视频生成,并且通过一种简单的微调过程来调整模型。
  • results: 我们的方法可以在 wild 中进行视频生成,同时也可以在一次微调过程中进行一键调整,以达到高分辨率(1024x1024)视频生成。
    Abstract The recent wave of AI-generated content has witnessed the great development and success of Text-to-Image (T2I) technologies. By contrast, Text-to-Video (T2V) still falls short of expectations though attracting increasing interests. Existing works either train from scratch or adapt large T2I model to videos, both of which are computation and resource expensive. In this work, we propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way. In particular, we turn the T2I model for T2V by designing light-weight spatial and temporal adapters for transfer learning. Besides, we change the original spatial attention to the proposed Latent-Shift Attention (LSA) for temporal consistency. With similar model architecture, we further train a video super-resolution model to generate high-definition (1024x1024) videos. In addition to T2V generation in the wild, SimDA could also be utilized in one-shot video editing with only 2 minutes tuning. Doing so, our method could minimize the training effort with extremely few tunable parameters for model adaptation.
    摘要 近期人工智能生成内容的浪潮见证了文本到图像(T2I)技术的大发展和成功。然而,文本到视频(T2V)仍然落后于期望,吸引了增加的关注。现有的工作都是从零开始或者修改大型T2I模型来视频生成,两者都需要大量计算和资源。在这个工作中,我们提出了一种简单的扩散适配器(SimDA),只要24M个 Parameters中的1.1B个参数进行精细调整,从而将T2I模型适应到视频生成。具体来说,我们设计了轻量级的空间和时间适配器,以便在转移学习中使用。此外,我们将原始的空间注意力替换为我们提出的射频隐藏注意力(LSA),以保证时间一致性。与类似的模型架构相比,我们进一步训练了一个高清度(1024x1024)视频生成模型。除了在野生环境中进行T2V生成外,SimDA还可以在一次视频编辑中使用,只需要2分钟的调整。这样,我们的方法可以减少模型适应的训练努力,并且具有非常少的调整参数。

Graph of Thoughts: Solving Elaborate Problems with Large Language Models

  • paper_url: http://arxiv.org/abs/2308.09687
  • repo_url: https://github.com/spcl/graph-of-thoughts
  • paper_authors: Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, Torsten Hoefler
  • for: 提高大语言模型(LLM)的提示能力,超过链条思想或树思想(ToT)的限制。
  • methods: 模型LLM生成的信息为一个自由图,其中单元为“LLM思想”,而边表示这些单元之间的依赖关系。这种方法可以将不同的LLM思想组合成补充性的结果,浓缩整个网络的思想,或通过反馈循环进行增强。
  • results: 比对于state of the art,GoT可以提高排序质量62%,同时降低成本>31%。此外,GoT可以扩展新的思想转换,因此可以用于开拓新的提示方案。这项工作使得LLM的思维更加接近人类思维或脑机制,如复杂网络。
    Abstract We introduce Graph of Thoughts (GoT): a framework that advances prompting capabilities in large language models (LLMs) beyond those offered by paradigms such as Chain-of-Thought or Tree of Thoughts (ToT). The key idea and primary advantage of GoT is the ability to model the information generated by an LLM as an arbitrary graph, where units of information ("LLM thoughts") are vertices, and edges correspond to dependencies between these vertices. This approach enables combining arbitrary LLM thoughts into synergistic outcomes, distilling the essence of whole networks of thoughts, or enhancing thoughts using feedback loops. We illustrate that GoT offers advantages over state of the art on different tasks, for example increasing the quality of sorting by 62% over ToT, while simultaneously reducing costs by >31%. We ensure that GoT is extensible with new thought transformations and thus can be used to spearhead new prompting schemes. This work brings the LLM reasoning closer to human thinking or brain mechanisms such as recurrence, both of which form complex networks.
    摘要 我们介绍 Graph of Thoughts(GoT):一个框架,它超越了链条思维和树思维(ToT)的概念,实现了大语言模型(LLM)的推问能力。GoT的关键想法和主要优势在于让LLM的资讯单位(“LLM思维”)成为随机图形的顶点,并将这些顶点之间的相互依赖关系表现为图形的边。这种方法可以结合不同的LLM思维,形成具有融合效果的结果,精炼出整个网络的思想核心,或者通过反馈循环进行思维提升。我们证明GoT在不同的任务上比ToT高质量62%,同时降低成本>31%。我们还证明GoT可扩展新的思维转换,因此可以用来开创新的推问方案。这个工作使得LLM的思维更加接近人类思维或脑机制,例如回传和回归,这些机制形成复杂的网络。

PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation

  • paper_url: http://arxiv.org/abs/2308.09678
  • repo_url: https://github.com/hbing-l/posynda
  • paper_authors: Hanbing Liu, Jun-Yan He, Zhi-Qi Cheng, Wangmeng Xiang, Qize Yang, Wenhao Chai, Gaoang Wang, Xu Bao, Bin Luo, Yifeng Geng, Xuansong Xie
  • for: overcome the challenge of adapting 3D human pose estimators to new datasets without extensive target domain annotation.
  • methods: utilize a diffusion-centric structure to simulate the 3D pose distribution in the target domain, incorporate a multi-hypothesis network to create diverse pose hypotheses, and use target-specific source augmentation to obtain the target domain distribution data.
  • results: demonstrate competitive performance on benchmarks such as Human3.6M, MPI-INF-3DHP, and 3DPW, even comparable with the target-trained MixSTE model.
    Abstract The current 3D human pose estimators face challenges in adapting to new datasets due to the scarcity of 2D-3D pose pairs in target domain training sets. We present the \textit{Multi-Hypothesis \textbf{P}ose \textbf{Syn}thesis \textbf{D}omain \textbf{A}daptation} (\textbf{PoSynDA}) framework to overcome this issue without extensive target domain annotation. Utilizing a diffusion-centric structure, PoSynDA simulates the 3D pose distribution in the target domain, filling the data diversity gap. By incorporating a multi-hypothesis network, it creates diverse pose hypotheses and aligns them with the target domain. Target-specific source augmentation obtains the target domain distribution data from the source domain by decoupling the scale and position parameters. The teacher-student paradigm and low-rank adaptation further refine the process. PoSynDA demonstrates competitive performance on benchmarks, such as Human3.6M, MPI-INF-3DHP, and 3DPW, even comparable with the target-trained MixSTE model~\cite{zhang2022mixste}. This work paves the way for the practical application of 3D human pose estimation. The code is available at https://github.com/hbing-l/PoSynDA.
    摘要 当前的3D人体姿态估计器面临着适应新数据集的挑战,因为目标领域训练集中缺乏2D-3D姿态对应的数据。我们提出了\textbf{\textit{多种假设 pose synthesis domain adaptation}(PoSynDA)框架,以解决这个问题而无需大量目标领域注释。PoSynDA使用分散结构,在目标领域中模拟3D姿态分布,填充数据多样性的空隙。通过包含多种假设网络,PoSynDA创造了多个姿态假设,并将它们与目标领域进行对align。通过寻求目标特定的源增强,从源领域中获取了目标领域的分布数据。教师-学生论断和低级适应进一步细化过程。PoSynDA在 benchmark 上表现竞争力强,包括人类3.6M、MPI-INF-3DHP 和 3DPW,甚至与目标训练 MixSTE 模型相当。这项工作为3D人体姿态估计器的实际应用开辟了道路。代码可以在https://github.com/hbing-l/PoSynDA 上获取。

Unsupervised 3D Pose Estimation with Non-Rigid Structure-from-Motion Modeling

  • paper_url: http://arxiv.org/abs/2308.10705
  • repo_url: None
  • paper_authors: Haorui Ji, Hui Deng, Yuchao Dai, Hongdong Li
  • for: 这篇论文的目的是提出一种新的人体姿态抽象模型和一种基于扩散的动作先验。
  • methods: 该论文使用了一种混合空间-时间的NRSfMformer来同时估算每帧人体 pose的3D参考骨架和每帧骨架变形。
  • results: 该论文的实验结果表明,该方法可以在主流数据集上获得更高的性能,并且超过了当前状态的最佳效果。
    Abstract Most of the previous 3D human pose estimation work relied on the powerful memory capability of the network to obtain suitable 2D-3D mappings from the training data. Few works have studied the modeling of human posture deformation in motion. In this paper, we propose a new modeling method for human pose deformations and design an accompanying diffusion-based motion prior. Inspired by the field of non-rigid structure-from-motion, we divide the task of reconstructing 3D human skeletons in motion into the estimation of a 3D reference skeleton, and a frame-by-frame skeleton deformation. A mixed spatial-temporal NRSfMformer is used to simultaneously estimate the 3D reference skeleton and the skeleton deformation of each frame from 2D observations sequence, and then sum them to obtain the pose of each frame. Subsequently, a loss term based on the diffusion model is used to ensure that the pipeline learns the correct prior motion knowledge. Finally, we have evaluated our proposed method on mainstream datasets and obtained superior results outperforming the state-of-the-art.
    摘要 大多数前一代3D人姿估计工作都依赖于网络的强大记忆能力来获得适合的2D-3D映射从训练数据中。只有一些工作研究了人姿变形的模型化。在这篇论文中,我们提出了一种新的人姿变形模型化方法和附加的扩散基于运动先驱模型。受非固定结构从动图像处理领域启发,我们将重建3D人 skeleton在运动中的任务分解为估计3D参照骨架和每帧骨架变形。使用混合空间-时间NRSfMformer来同时估计每帧的3D参照骨架和每帧骨架变形,然后将它们加总以获得每帧的姿势。最后,我们使用基于扩散模型的损失函数来确保管道学习正确的先驱动知识。我们在主流数据集上评估了我们的提议方法,并获得了比前一代更高的成绩。

GiGaMAE: Generalizable Graph Masked Autoencoder via Collaborative Latent Space Reconstruction

  • paper_url: http://arxiv.org/abs/2308.09663
  • repo_url: https://github.com/sycny/gigamae
  • paper_authors: Yucheng Shi, Yushun Dong, Qiaoyu Tan, Jundong Li, Ninghao Liu
  • for: 这 paper 的目的是提出一种基于自动编码器的无监督学习方法,以便在图数据上生成有效的表示。
  • methods: 该方法使用了一种名为 GiGaMAE 的图自适应编码器框架,该框架通过协同重建有用和完整的嵌入表示来学习图数据的总体知识。
  • results: 对于七个 dataset 上的三个下游任务,GiGaMAE 的表现优于基elines。研究者希望这些结果能够引导基础模型的设计在图结构数据上。Translation:
  • for: The purpose of this paper is to propose a self-supervised learning method based on masked autoencoders to generate effective representations on graph data.
  • methods: The method uses a novel graph masked autoencoder framework called GiGaMAE, which learns to collaboratively reconstruct informative and integrated latent embeddings to capture more generalized and comprehensive knowledge.
  • results: GiGaMAE outperforms state-of-the-art baselines on seven datasets for three downstream tasks. The researchers hope that these results will shed light on the design of foundation models on graph-structured data.
    Abstract Self-supervised learning with masked autoencoders has recently gained popularity for its ability to produce effective image or textual representations, which can be applied to various downstream tasks without retraining. However, we observe that the current masked autoencoder models lack good generalization ability on graph data. To tackle this issue, we propose a novel graph masked autoencoder framework called GiGaMAE. Different from existing masked autoencoders that learn node presentations by explicitly reconstructing the original graph components (e.g., features or edges), in this paper, we propose to collaboratively reconstruct informative and integrated latent embeddings. By considering embeddings encompassing graph topology and attribute information as reconstruction targets, our model could capture more generalized and comprehensive knowledge. Furthermore, we introduce a mutual information based reconstruction loss that enables the effective reconstruction of multiple targets. This learning objective allows us to differentiate between the exclusive knowledge learned from a single target and common knowledge shared by multiple targets. We evaluate our method on three downstream tasks with seven datasets as benchmarks. Extensive experiments demonstrate the superiority of GiGaMAE against state-of-the-art baselines. We hope our results will shed light on the design of foundation models on graph-structured data. Our code is available at: https://github.com/sycny/GiGaMAE.
    摘要 自我监督学习中使用遮盖自动编码器最近受到关注,因为它可以生成有效的图像或文本表示,可以应用于多个下游任务无需重新训练。然而,我们发现当前的遮盖自动编码器模型对图数据的泛化能力不佳。为解决这个问题,我们提出了一种新的图masked autoencoder框架,即GiGaMAE。与现有的遮盖自动编码器不同,我们在这篇论文中提议使用相互重建有用和完整的嵌入表示。我们考虑嵌入包括图型和属性信息作为重建目标,从而使我们的模型捕捉更加普遍和全面的知识。此外,我们引入了基于mutual information的重建损失,该损失函数允许我们有效地重建多个目标。这个学习目标使我们能够区分单个目标学习的独特知识和多个目标共享的通用知识。我们在三个下游任务上进行了七个数据集的测试,并对比了现有的基eline。广泛的实验结果表明GiGaMAE的超越性。我们希望我们的结果可以引导基于图结构数据的基础模型的设计。我们的代码可以在https://github.com/sycny/GiGaMAE中找到。

Tree-of-Mixed-Thought: Combining Fast and Slow Thinking for Multi-hop Visual Reasoning

  • paper_url: http://arxiv.org/abs/2308.09658
  • repo_url: None
  • paper_authors: Pengbo Hu, Ji Qi, Xingyu Li, Hong Li, Xinqi Wang, Bing Quan, Ruiyu Wang, Yi Zhou
  • for: 本研究旨在提出一种基于大语言模型(LLM)的规划算法,以解决复杂的视觉逻辑问题中的计划搜索问题。
  • methods: 本研究使用了一种叫做“ Tree-of-thought”的方法,它是基于人类大脑的两种认知系统——快速思维和慢速思维——的结合体。此外,还使用了一种叫做“一站式搜索”的方法。
  • results: 研究表明,我们的提出的算法可以同时保证性能和效率的提高,并且在不同的逻辑问题中显示出优异的表现。此外,我们还开发了一个系统性的评估框架,以便评估 LLMs 基于计划搜索的性能和效率。
    Abstract There emerges a promising trend of using large language models (LLMs) to generate code-like plans for complex inference tasks such as visual reasoning. This paradigm, known as LLM-based planning, provides flexibility in problem solving and endows better interpretability. However, current research is mostly limited to basic scenarios of simple questions that can be straightforward answered in a few inference steps. Planning for the more challenging multi-hop visual reasoning tasks remains under-explored. Specifically, under multi-hop reasoning situations, the trade-off between accuracy and the complexity of plan-searching becomes prominent. The prevailing algorithms either address the efficiency issue by employing the fast one-stop generation or adopt a complex iterative generation method to improve accuracy. Both fail to balance the need for efficiency and performance. Drawing inspiration from the dual system of cognition in the human brain, the fast and the slow think processes, we propose a hierarchical plan-searching algorithm that integrates the one-stop reasoning (fast) and the Tree-of-thought (slow). Our approach succeeds in performance while significantly saving inference steps. Moreover, we repurpose the PTR and the CLEVER datasets, developing a systematic framework for evaluating the performance and efficiency of LLMs-based plan-search algorithms under reasoning tasks at different levels of difficulty. Extensive experiments demonstrate the superiority of our proposed algorithm in terms of performance and efficiency. The dataset and code will be release soon.
    摘要 出现了一种扩展大型语言模型(LLM)用于生成复杂推理任务的代码化计划的趋势,这种趋势被称为LLM-based planning。这种方法具有更多的问题解决方式和更好的可读性。然而,当前的研究主要集中在基本的单步问题上,尚未深入研究复杂的多步推理任务。在多步推理任务中,精度和计划搜索的复杂度之间存在明显的负担。目前的算法可以通过快速一站式生成或者使用迭代生成方法来提高精度,但都无法平衡精度和性能的需求。引用人类大脑中的双系统认知模型,我们提出了一种层次的计划搜索算法,将快速一站式推理(快)和树状思维(慢)相结合。我们的方法在性能和效率两个方面具有优势,并且对不同难度水平的推理任务进行系统性评估。我们对PTR和CLEVER数据集进行了修改和扩展,并开发了一个系统性的评估框架。广泛的实验表明,我们的提posed算法在性能和效率两个方面具有优势。数据集和代码将很快发布。

Robust Uncertainty Quantification using Conformalised Monte Carlo Prediction

  • paper_url: http://arxiv.org/abs/2308.09647
  • repo_url: https://github.com/team-daniel/mc-cp
  • paper_authors: Daniel Bethell, Simos Gerasimou, Radu Calinescu
  • for: 这篇论文主要用于推动深度学习模型在安全关键应用中的部署,并提供了一种可靠的评估方法来确保这些模型的可靠运行。
  • methods: 本论文提出了一种新的混合型不确定性评估方法(MC-CP),它将适应MC dropout方法和确定性预测(CP)相结合,以提高不确定性评估的精度和效率。
  • results: 经过广泛的实验评估,MC-CP方法在分类和回归benchmark中都达到了显著的改进,与先前的高级不确定性评估方法相比,如MC dropout、RAPS和CQR。MC-CP方法可以轻松地添加到现有的模型中,使其部署非常简单。
    Abstract Deploying deep learning models in safety-critical applications remains a very challenging task, mandating the provision of assurances for the dependable operation of these models. Uncertainty quantification (UQ) methods estimate the model's confidence per prediction, informing decision-making by considering the effect of randomness and model misspecification. Despite the advances of state-of-the-art UQ methods, they are computationally expensive or produce conservative prediction sets/intervals. We introduce MC-CP, a novel hybrid UQ method that combines a new adaptive Monte Carlo (MC) dropout method with conformal prediction (CP). MC-CP adaptively modulates the traditional MC dropout at runtime to save memory and computation resources, enabling predictions to be consumed by CP, yielding robust prediction sets/intervals. Throughout comprehensive experiments, we show that MC-CP delivers significant improvements over advanced UQ methods, like MC dropout, RAPS and CQR, both in classification and regression benchmarks. MC-CP can be easily added to existing models, making its deployment simple.
    摘要 部署深度学习模型在安全关键应用中仍然是一个非常困难的任务,需要提供保证模型的可靠运行。不确定量评估(UQ)方法可以估计模型每个预测结果的可信度,并且在考虑Randomness和模型错误的情况下,对决策进行指导。despite state-of-the-art UQ methods have made significant advances, they are still computationally expensive or produce conservative prediction sets/intervals.我们介绍MC-CP,一种新的hybrid UQ方法, combining a new adaptive Monte Carlo (MC) dropout method with conformal prediction (CP). MC-CP可以在运行时适应MC dropout的传统方法,以节省内存和计算资源,使得预测结果可以被CP进行处理,并且生成Robust预测集/interval。经过了广泛的实验,我们发现MC-CP可以在分类和回归 benchmark上提供显著改进,比如MC dropout、RAPS和CQR。MC-CP可以轻松地添加到现有模型中,因此其部署非常简单。

cs.CL - 2023-08-19

  • paper_url: http://arxiv.org/abs/2308.10045
  • repo_url: https://github.com/flame-chasers/tbps-clip
  • paper_authors: Min Cao, Yang Bai, Ziyin Zeng, Mang Ye, Min Zhang
    for: This paper aims to explore the potential of the visual-language pre-training model CLIP for downstream Text-Based Person Search (TBPS) tasks.methods: The paper conducts a comprehensive empirical study of CLIP for TBPS, including revisiting critical design considerations such as data augmentation and loss function, and implementing practical training tricks.results: The model achieves satisfactory performance without any sophisticated modules, and the probing experiments demonstrate the effectiveness of TBPS-CLIP from various aspects, providing empirical insights and highlighting future research directions.Here’s the simplified Chinese text:for: 这篇论文想要探索CLIP视觉语言预训模型在下游文本人像检索任务上的潜力。methods: 论文通过对CLIP进行全面的实验研究,包括重新评估关键设计因素,如数据增强和损失函数,以及实施实用的训练技巧。results: 模型无需任何复杂模块就可以达到满意性的性能,并通过 probing 实验表明TBPS-CLIP在多个方面的效果,提供了实证意义和未来研究方向。
    Abstract Text-based Person Search (TBPS) aims to retrieve the person images using natural language descriptions. Recently, Contrastive Language Image Pretraining (CLIP), a universal large cross-modal vision-language pre-training model, has remarkably performed over various cross-modal downstream tasks due to its powerful cross-modal semantic learning capacity. TPBS, as a fine-grained cross-modal retrieval task, is also facing the rise of research on the CLIP-based TBPS. In order to explore the potential of the visual-language pre-training model for downstream TBPS tasks, this paper makes the first attempt to conduct a comprehensive empirical study of CLIP for TBPS and thus contribute a straightforward, incremental, yet strong TBPS-CLIP baseline to the TBPS community. We revisit critical design considerations under CLIP, including data augmentation and loss function. The model, with the aforementioned designs and practical training tricks, can attain satisfactory performance without any sophisticated modules. Also, we conduct the probing experiments of TBPS-CLIP in model generalization and model compression, demonstrating the effectiveness of TBPS-CLIP from various aspects. This work is expected to provide empirical insights and highlight future CLIP-based TBPS research.
    摘要

GameEval: Evaluating LLMs on Conversational Games

  • paper_url: http://arxiv.org/abs/2308.10032
  • repo_url: None
  • paper_authors: Dan Qiao, Chenfei Wu, Yaobo Liang, Juntao Li, Nan Duan
  • for: This paper aims to evaluate large language models (LLMs) through goal-driven conversational games, addressing the limitations of existing evaluation methods.
  • methods: The proposed approach, called GameEval, treats LLMs as game players and assigns them distinct roles with specific goals achieved through conversations of various forms, such as discussion, question answering, and voting.
  • results: Extensive experiments show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems.Here are the three points in Simplified Chinese:
  • for: 这篇论文目标是通过对话游戏来评估大语言模型(LLM),超越现有评估方法的限制。
  • methods: 提议的方法是通过将 LLM 当作游戏玩家,赋予它们特定的目标,通过不同的对话形式,如讨论、问答和投票,来评估模型的能力。
  • results: 广泛的实验表明,GameEval 可以有效地区分不同的 LLM 的能力,为复杂问题的解决提供全面的评估。
    Abstract The rapid advancements in large language models (LLMs) have presented challenges in evaluating those models. Existing evaluation methods are either reference-based or preference based, which inevitably need human intervention or introduce test bias caused by evaluator models. In this paper, we propose GameEval, a novel approach to evaluating LLMs through goal-driven conversational games, overcoming the limitations of previous methods. GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms, including discussion, question answering, and voting. We design three unique games with cooperative or adversarial objectives, accompanied by corresponding evaluation metrics, to show how this new paradigm comprehensively evaluates model performance.Through extensive experiments, we show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems. Our public anonymous code is available at https://github.com/GameEval/GameEval.
    摘要 快速发展的大语言模型(LLM)带来了评估这些模型的挑战。现有的评估方法都是基于参考或偏好基础的,因此需要人工干预或引入评估器模型的测试偏见。本文提出了一种新的评估方法——GameEval,通过对LML进行目标驱动的对话游戏来评估其表现。GameEval将LML当作游戏玩家,赋予它们不同的角色和目标,通过发起不同的对话形式,包括讨论、问答和投票,来评估其能力解决复杂问题。我们设计了三个独特的游戏,每个游戏都有合作或对抗目标,并附带了相应的评估指标。我们通过广泛的实验表明,GameEval可以有效地区分不同的LML表现,提供全面评估这些模型的复杂问题解决能力。我们的公共匿名代码可以在https://github.com/GameEval/GameEval上下载。

ControlRetriever: Harnessing the Power of Instructions for Controllable Retrieval

  • paper_url: http://arxiv.org/abs/2308.10025
  • repo_url: None
  • paper_authors: Kaihang Pan, Juncheng Li, Hongye Song, Hao Fei, Wei Ji, Shuo Zhang, Jun Lin, Xiaozhong Liu, Siliang Tang
  • for: 这个研究的目的是提高 dense retrieval 模型在多元搜寻任务中的表现,并且可以让模型适应不同的搜寻意图。
  • methods: 这个研究使用了 ControlRetriever,一个可以控制 dense retrieval 模型的方法,并且使用了 ControlNet 的基础,将不同的搜寻模型融合到一个整体系统中,并且使用了一个新的 LLM 导向的指令生成和迭代训练策略,将 ControlRetriever 训练成可以适应不同的搜寻任务。
  • results: 实验结果显示,在 BEIR 评量标准中,ControlRetriever 可以在不需要任务特定调整的情况下,与基eline方法相比,获得了明显的改善,并且在零基eline情况下也实现了州际级的表现。
    Abstract Recent studies have shown that dense retrieval models, lacking dedicated training data, struggle to perform well across diverse retrieval tasks, as different retrieval tasks often entail distinct search intents. To address this challenge, in this work we introduce ControlRetriever, a generic and efficient approach with a parameter isolated architecture, capable of controlling dense retrieval models to directly perform varied retrieval tasks, harnessing the power of instructions that explicitly describe retrieval intents in natural language. Leveraging the foundation of ControlNet, which has proven powerful in text-to-image generation, ControlRetriever imbues different retrieval models with the new capacity of controllable retrieval, all while being guided by task-specific instructions. Furthermore, we propose a novel LLM guided Instruction Synthesizing and Iterative Training strategy, which iteratively tunes ControlRetriever based on extensive automatically-generated retrieval data with diverse instructions by capitalizing the advancement of large language models. Extensive experiments show that in the BEIR benchmark, with only natural language descriptions of specific retrieval intent for each task, ControlRetriever, as a unified multi-task retrieval system without task-specific tuning, significantly outperforms baseline methods designed with task-specific retrievers and also achieves state-of-the-art zero-shot performance.
    摘要 Translation in Simplified Chinese: latest studies have shown that dense retrieval models, lacking dedicated training data, struggle to perform well across diverse retrieval tasks, as different retrieval tasks often entail distinct search intents. To address this challenge, in this work we introduce ControlRetriever, a generic and efficient approach with a parameter isolated architecture, capable of controlling dense retrieval models to directly perform varied retrieval tasks, harnessing the power of instructions that explicitly describe retrieval intents in natural language. Building on the foundation of ControlNet, which has proven powerful in text-to-image generation, ControlRetriever imbues different retrieval models with the new capacity of controllable retrieval, all while being guided by task-specific instructions. Furthermore, we propose a novel LLM guided Instruction Synthesizing and Iterative Training strategy, which iteratively tunes ControlRetriever based on extensive automatically-generated retrieval data with diverse instructions by capitalizing the advancement of large language models. Extensive experiments show that in the BEIR benchmark, with only natural language descriptions of specific retrieval intent for each task, ControlRetriever, as a unified multi-task retrieval system without task-specific tuning, significantly outperforms baseline methods designed with task-specific retrievers and also achieves state-of-the-art zero-shot performance.

HICL: Hashtag-Driven In-Context Learning for Social Media Natural Language Understanding

  • paper_url: http://arxiv.org/abs/2308.09985
  • repo_url: https://github.com/albertan017/hicl
  • paper_authors: Hanzhuo Tan, Chunpu Xu, Jing Li, Yuqun Zhang, Zeyang Fang, Zeyu Chen, Baohua Lai
  • for: addresses the issue of compromised performance in existing natural language understanding (NLU) models when faced with short and noisy social media content.
  • methods: leverages in-context learning (ICL) and a novel hashtag-driven in-context learning (HICL) framework, which pre-trains a model #Encoder using hashtags to drive BERT-based pre-training through contrastive learning, and employs a gradient-based method to identify trigger terms useful in fusing information from both sources.
  • results: substantially advances the previous state-of-the-art results on seven downstream tasks, and found that combining source input with a top-retrieved post from #Encoder is more effective than using semantically similar posts, and trigger words can largely benefit in merging context from the source and retrieved posts.Here is the answer in Simplified Chinese text:
  • for: 解决现有的自然语言理解(NLU)模型在面对短暴露的社交媒体内容时表现不佳的问题。
  • methods: 利用启发式学习(ICL)和一种带有标签驱动的启发式学习(HICL)框架,通过使用标签驱动BERT预训练的 pré-training,并使用梯度法来确定权重用于将来源和检索出的帖子内容融合。
  • results: substantially advance了之前的状态值表现结果,并发现将源输入与 #Encoder 预测的top-retrieved帖子融合是比使用相似的帖子更有效的。
    Abstract Natural language understanding (NLU) is integral to various social media applications. However, existing NLU models rely heavily on context for semantic learning, resulting in compromised performance when faced with short and noisy social media content. To address this issue, we leverage in-context learning (ICL), wherein language models learn to make inferences by conditioning on a handful of demonstrations to enrich the context and propose a novel hashtag-driven in-context learning (HICL) framework. Concretely, we pre-train a model #Encoder, which employs #hashtags (user-annotated topic labels) to drive BERT-based pre-training through contrastive learning. Our objective here is to enable #Encoder to gain the ability to incorporate topic-related semantic information, which allows it to retrieve topic-related posts to enrich contexts and enhance social media NLU with noisy contexts. To further integrate the retrieved context with the source text, we employ a gradient-based method to identify trigger terms useful in fusing information from both sources. For empirical studies, we collected 45M tweets to set up an in-context NLU benchmark, and the experimental results on seven downstream tasks show that HICL substantially advances the previous state-of-the-art results. Furthermore, we conducted extensive analyzes and found that: (1) combining source input with a top-retrieved post from #Encoder is more effective than using semantically similar posts; (2) trigger words can largely benefit in merging context from the source and retrieved posts.
    摘要 natural language understanding (NLU) 是社交媒体应用程序中的一个重要组成部分。然而,现有的 NLU 模型很重要地依赖于上下文进行 semantic learning,这会导致它们在短 и噪音的社交媒体内容上表现不佳。为解决这个问题,我们利用 in-context learning (ICL),其中语言模型通过使用一些示例来增强上下文,并提出了一个具有 hash 标签驱动的增Context learning (HICL) 框架。具体来说,我们首先预训 #Encoder,该模型使用 #hash 标签(用户标注的主题标签)来驱动 BERT 基于的预训练。我们的目标是让 #Encoder 能够integrate topic-related semantic information,以便从 retrieve 的上下文中提取相关信息,并在社交媒体 NLU 中增强噪音上下文的表现。此外,我们还使用 Gradient 基本方法来确定激活词,以便将源文本和检索到的上下文 fusion 到一起。为 empirical studies,我们收集了 45 万条 tweet,并设置了一个 in-context NLU benchmark。实验结果显示,HICL 在七个下游任务上substantially advance 了之前的状态的术。此外,我们还进行了广泛的分析,发现:(1) 将源输入与检索到的最佳帖子 fusion 是更有效的 чем使用相似的帖子;(2) 触发词可以很大程度地帮助合并来源和检索到的上下文。

FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models

  • paper_url: http://arxiv.org/abs/2308.09975
  • repo_url: https://github.com/sufe-aiflm-lab/fineval
  • paper_authors: Liwen Zhang, Weige Cai, Zhaowei Liu, Zhi Yang, Wei Dai, Yujie Liao, Qianru Qin, Yifei Li, Xingyu Liu, Zhiqiang Liu, Zhoufan Zhu, Anbo Wu, Xin Guo, Yun Chen
  • for: 本研究旨在评估大语言模型(LLMs)在金融领域知识上的表现,并提供一个特性rich的评估 benchmark。
  • methods: 本研究使用了多种提问类型,包括零shot、几shot、答案只、链式思维等,以评估state-of-the-art的中文和英文 LLMS 在金融领域知识上的表现。
  • results: 结果显示,只有 GPT-4 在不同的提问设置下达到了接近 70% 的准确率,表明 LLMS 在金融领域知识上的可能性很大。
    Abstract Large language models (LLMs) have demonstrated exceptional performance in various natural language processing tasks, yet their efficacy in more challenging and domain-specific tasks remains largely unexplored. This paper presents FinEval, a benchmark specifically designed for the financial domain knowledge in the LLMs. FinEval is a collection of high-quality multiple-choice questions covering Finance, Economy, Accounting, and Certificate. It includes 4,661 questions spanning 34 different academic subjects. To ensure a comprehensive model performance evaluation, FinEval employs a range of prompt types, including zero-shot and few-shot prompts, as well as answer-only and chain-of-thought prompts. Evaluating state-of-the-art Chinese and English LLMs on FinEval, the results show that only GPT-4 achieved an accuracy close to 70% in different prompt settings, indicating significant growth potential for LLMs in the financial domain knowledge. Our work offers a more comprehensive financial knowledge evaluation benchmark, utilizing data of mock exams and covering a wide range of evaluated LLMs.
    摘要 大型自然语言模型(LLMs)已经在各种自然语言处理任务中显示出了卓越表现,然而它们在更加具有挑战性和领域特有性的任务中的表现仍然尚未得到了充分的探索。这篇论文提出了FinEval,一个专门为金融领域知识的benchmark。FinEval包括4,661个高质量多选问题,涵盖了财务、经济、会计和证书等34个学科。为了全面评估模型的表现,FinEval使用了多种提问类型,包括零shot和几shot提问、以及答案只提问和链式思维提问。对现有的中文和英文LLMs进行FinEval的评估,结果显示,只有GPT-4在不同的提问设置中达到了近70%的准确率,这表明LLMs在金融领域知识方面还有很大的成长 potential。我们的工作提供了一个更加全面的金融知识评估benchmark,利用了考试数据和覆盖了各种评估LLMs。

Utilizing Semantic Textual Similarity for Clinical Survey Data Feature Selection

  • paper_url: http://arxiv.org/abs/2308.09892
  • repo_url: https://github.com/bcwarner/sts-select
  • paper_authors: Benjamin C. Warner, Ziqi Xu, Simon Haroutounian, Thomas Kannampallil, Chenyang Lu
  • for: 这篇论文是为了解决问题,即使 survey data 具有较高的特征数量且较低的例子数量, machine learning 模型还是能够预测结果的问题。
  • methods: 这篇论文使用了 feature selection 来解决这个问题,特别是使用 textual names of features 来评估哪些特征是有用的。
  • results: 研究发现,使用 STS 来选择特征可以实现更高的性能模型,比较传统的特征选择算法。
    Abstract Survey data can contain a high number of features while having a comparatively low quantity of examples. Machine learning models that attempt to predict outcomes from survey data under these conditions can overfit and result in poor generalizability. One remedy to this issue is feature selection, which attempts to select an optimal subset of features to learn upon. A relatively unexplored source of information in the feature selection process is the usage of textual names of features, which may be semantically indicative of which features are relevant to a target outcome. The relationships between feature names and target names can be evaluated using language models (LMs) to produce semantic textual similarity (STS) scores, which can then be used to select features. We examine the performance using STS to select features directly and in the minimal-redundancy-maximal-relevance (mRMR) algorithm. The performance of STS as a feature selection metric is evaluated against preliminary survey data collected as a part of a clinical study on persistent post-surgical pain (PPSP). The results suggest that features selected with STS can result in higher performance models compared to traditional feature selection algorithms.
    摘要 Survey data often contains a large number of features but only a small number of examples. If machine learning models are used to predict outcomes from this data, they may overfit and have poor generalizability. One solution to this problem is feature selection, which involves selecting a subset of the most relevant features to learn from. A previously unexplored source of information in the feature selection process is the textual names of the features, which may be semantically indicative of which features are relevant to the target outcome. We can use language models (LMs) to evaluate the relationships between feature names and target names and produce semantic textual similarity (STS) scores. These scores can then be used to select features. We compare the performance of STS as a feature selection metric with traditional feature selection algorithms using preliminary survey data collected as part of a clinical study on persistent post-surgical pain (PPSP). The results suggest that features selected with STS can lead to higher performance models.

Breaking Language Barriers: A Question Answering Dataset for Hindi and Marathi

  • paper_url: http://arxiv.org/abs/2308.09862
  • repo_url: None
  • paper_authors: Maithili Sabane, Onkar Litake, Aman Chadha
  • for: developing a Question Answering dataset for low-resource languages Hindi and Marathi
  • methods: novel approach for translating the SQuAD 2.0 dataset into Hindi and Marathi
  • results: release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples, and the best-performing models for both Hindi and MarathiHere’s the simplified Chinese text for the three key points:
  • for: 为低资源语言� Hinidi 和 Marathi 开发问答集
  • methods: 使用 novel Approach 将 SQuAD 2.0 集 перевод成 Hinidi 和 Marathi
  • results: 发布了最大的问答集,每个集 contain 28,000 个样本,并且在两种语言上提供了最佳性能的模型
    Abstract The recent advances in deep-learning have led to the development of highly sophisticated systems with an unquenchable appetite for data. On the other hand, building good deep-learning models for low-resource languages remains a challenging task. This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi. Despite Hindi being the 3rd most spoken language worldwide, with 345 million speakers, and Marathi being the 11th most spoken language globally, with 83.2 million speakers, both languages face limited resources for building efficient Question Answering systems. To tackle the challenge of data scarcity, we have developed a novel approach for translating the SQuAD 2.0 dataset into Hindi and Marathi. We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples. We evaluate the dataset on various architectures and release the best-performing models for both Hindi and Marathi, which will facilitate further research in these languages. Leveraging similarity tools, our method holds the potential to create datasets in diverse languages, thereby enhancing the understanding of natural language across varied linguistic contexts. Our fine-tuned models, code, and dataset will be made publicly available.
    摘要 Recent advances in deep learning have led to the development of highly sophisticated systems with an insatiable appetite for data. However, building good deep learning models for low-resource languages remains a challenging task. This paper focuses on developing a Question Answering dataset for two such languages - Hindi and Marathi. Despite Hindi being the third most spoken language worldwide with 345 million speakers and Marathi being the 11th most spoken language globally with 83.2 million speakers, both languages face limited resources for building efficient Question Answering systems. To address the challenge of data scarcity, we have developed a novel approach for translating the SQuAD 2.0 dataset into Hindi and Marathi. We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples. We evaluate the dataset on various architectures and release the best-performing models for both Hindi and Marathi, which will facilitate further research in these languages. Our method leverages similarity tools, which has the potential to create datasets in diverse languages, thereby enhancing the understanding of natural language across varied linguistic contexts. Our fine-tuned models, code, and dataset will be made publicly available.

Black-box Adversarial Attacks against Dense Retrieval Models: A Multi-view Contrastive Learning Method

  • paper_url: http://arxiv.org/abs/2308.09861
  • repo_url: None
  • paper_authors: Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, Yixing Fan, Xueqi Cheng
  • for: 本研究主要针对于 dense retrieval (DR) 模型的 Robustness 问题。
  • methods: 本研究使用了现有的 adversarial attack 方法,并提出了一种基于 contrastive learning 的新方法来攻击 DR 模型。
  • results: 实验结果显示,新方法可以对 DR 模型进行有效的攻击,并且可以使用小量文本干扰来诱导模型返回错误结果。
    Abstract Neural ranking models (NRMs) and dense retrieval (DR) models have given rise to substantial improvements in overall retrieval performance. In addition to their effectiveness, and motivated by the proven lack of robustness of deep learning-based approaches in other areas, there is growing interest in the robustness of deep learning-based approaches to the core retrieval problem. Adversarial attack methods that have so far been developed mainly focus on attacking NRMs, with very little attention being paid to the robustness of DR models. In this paper, we introduce the adversarial retrieval attack (AREA) task. The AREA task is meant to trick DR models into retrieving a target document that is outside the initial set of candidate documents retrieved by the DR model in response to a query. We consider the decision-based black-box adversarial setting, which is realistic in real-world search engines. To address the AREA task, we first employ existing adversarial attack methods designed for NRMs. We find that the promising results that have previously been reported on attacking NRMs, do not generalize to DR models: these methods underperform a simple term spamming method. We attribute the observed lack of generalizability to the interaction-focused architecture of NRMs, which emphasizes fine-grained relevance matching. DR models follow a different representation-focused architecture that prioritizes coarse-grained representations. We propose to formalize attacks on DR models as a contrastive learning problem in a multi-view representation space. The core idea is to encourage the consistency between each view representation of the target document and its corresponding viewer via view-wise supervision signals. Experimental results demonstrate that the proposed method can significantly outperform existing attack strategies in misleading the DR model with small indiscernible text perturbations.
    摘要 神经排名模型(NRM)和紧凑检索(DR)模型已经导致总体检索性能得到了重大改善。此外,由于深度学习基本概念的不稳定性在其他领域已经证明了其不足,因此对深度学习基本概念的检索问题的Robustness也有增加的兴趣。许多攻击方法主要target NRMs,DR模型几乎没有受到关注。在本文中,我们介绍了抗击式检索任务(AREA)。AREA任务的目标是让DR模型返回一个不在初始候选文档中的目标文档。我们使用现有的NRMs攻击方法,并发现这些方法在DR模型上的表现不如预期差。我们归因这种不一致性于NRMs的交互强调的架构,DR模型采用了一种强调媒体表示的架构。我们提出了一种对DR模型进行攻击的方法,定义为多视图表示空间中的对比学习问题。核心思想是在每个视图中supervise the viewer's representation of the target document and its corresponding viewer via view-wise supervision signals。实验结果表明,我们的方法可以在小型文本干扰下明显超越现有攻击策略。

How susceptible are LLMs to Logical Fallacies?

  • paper_url: http://arxiv.org/abs/2308.09853
  • repo_url: https://github.com/Amir-pyh/LOGICOM
  • paper_authors: Amirreza Payandeh, Dan Pluth, Jordan Hosier, Xuesu Xiao, Vijay K. Gurbani
  • for: 本研究探讨了大型自然语言模型(LLM)在多轮辩论中的合理思维能力,特别是对逻辑错误的影响。
  • methods: 本研究使用了Logic Competence Measurement Benchmark(LOGICOM),一个用于评估LLM在逻辑错误下的逻辑理解能力的诊断标准。LOGICOM包括两个代理:一个诱导者和一个辩者,在一个争议性主题上进行多轮辩论,诱导者尝试使辩者接受其主张的正确性。
  • results: 研究发现,LLM可以通过理解来修改其意见。但是,当面临逻辑错误时,GPT-3.5和GPT-4分别被误导41%和69%更多次,相比于逻辑理解时。此外,本研究还提供了一个包含逻辑vs. 逻辑错误的5k+对话的新数据集,并公开发布了源代码和数据集。
    Abstract This paper investigates the rational thinking capability of Large Language Models (LLMs) in multi-round argumentative debates by exploring the impact of fallacious arguments on their logical reasoning performance. More specifically, we present Logic Competence Measurement Benchmark (LOGICOM), a diagnostic benchmark to assess the robustness of LLMs against logical fallacies. LOGICOM involves two agents: a persuader and a debater engaging in a multi-round debate on a controversial topic, where the persuader tries to convince the debater of the correctness of its claim. First, LOGICOM assesses the potential of LLMs to change their opinions through reasoning. Then, it evaluates the debater's performance in logical reasoning by contrasting the scenario where the persuader employs logical fallacies against one where logical reasoning is used. We use this benchmark to evaluate the performance of GPT-3.5 and GPT-4 using a dataset containing controversial topics, claims, and reasons supporting them. Our findings indicate that both GPT-3.5 and GPT-4 can adjust their opinion through reasoning. However, when presented with logical fallacies, GPT-3.5 and GPT-4 are erroneously convinced 41% and 69% more often, respectively, compared to when logical reasoning is used. Finally, we introduce a new dataset containing over 5k pairs of logical vs. fallacious arguments. The source code and dataset of this work are made publicly available.
    摘要 We use this benchmark to evaluate the performance of GPT-3.5 and GPT-4, two popular LLMs, using a dataset of controversial topics, claims, and reasons supporting them. Our findings show that both GPT-3.5 and GPT-4 can adjust their opinions through reasoning, but when presented with logical fallacies, they are erroneously convinced 41% and 69% more often, respectively, compared to when logical reasoning is used.To further evaluate the ability of LLMs to distinguish between logical and fallacious arguments, we have created a new dataset containing over 5,000 pairs of logical vs. fallacious arguments. This dataset is publicly available, along with the source code for our benchmark. Our results have important implications for the development of LLMs and their use in real-world applications where logical reasoning is critical.

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

  • paper_url: http://arxiv.org/abs/2308.09778
  • repo_url: None
  • paper_authors: Navid Rajabi, Jana Kosecka
  • for: 这个研究是用来评估大规模视力语言模型(VLM)在不同的视觉理解任务中表现的,特别是在理解空间关系方面。
  • methods: 这个研究使用了细化的组合基础grounding技术来评估视觉关系理解任务的性能,并提出了底层方法来排序空间句子并评估模型的表现。
  • results: 研究发现,现有的视力语言模型在理解空间关系方面表现不佳,有较大的差距和人类表现。而提出的方法可以减少这个差距,并提高模型的表现。
    Abstract With the advances in large scale vision-and-language models (VLMs) it is of interest to assess their performance on various visual reasoning tasks such as counting, referring expressions and general visual question answering. The focus of this work is to study the ability of these models to understanding spatial relations. Previously, this has been tackled using image-text matching (Liu, Emerson, and Collier 2022) or visual question answering task, both showing poor performance and a large gap compared to human performance. To better understand the gap, we present fine-grained compositional grounding of spatial relationships and propose a bottom up approach for ranking spatial clauses and evaluating the performance of spatial relationship reasoning task. We propose to combine the evidence from grounding noun phrases corresponding to objects and their locations to compute the final rank of the spatial clause. We demonstrate the approach on representative vision-language models (Tan and Bansal 2019; Gupta et al. 2022; Kamath et al. 2021) and compare and highlight their abilities to reason about spatial relationships.
    摘要 随着大规模视言语模型(VLM)的发展,有兴趣测试它们在不同的视觉逻辑任务中的表现,如计数、引用表达和通用的视觉问答。本工作的重点是研究这些模型在理解空间关系方面的能力。在过去,这些任务通常通过图像文本匹配(Liu、Emerson、Collier 2022)或视觉问答任务进行评估,两者都显示了较差的表现和人类表现之间的大差。为更好地了解这个差距,我们提出了细化的 composer grounding 技术和底层方法来评估空间关系逻辑任务的表现。我们建议将对象和其位置的描述语Fragment(noun phrase)grounding 结果相加以计算最终的空间句排名。我们在代表性的视觉语言模型(Tan和Bansal 2019;Gupta等 2022;Kamath等 2021)上进行了示例实现,并对它们的空间关系逻辑能力进行了比较和强调。

YORC: Yoruba Reading Comprehension dataset

  • paper_url: http://arxiv.org/abs/2308.09768
  • repo_url: None
  • paper_authors: Anuoluwapo Aremu, Jesujoba O. Alabi, David Ifeoluwa Adelani
  • for: 这个论文创建了一个新的多选式 йоруба阅读理解数据集,基于 йоруба高中阅读理解考试。
  • methods: 该论文使用了已有的英文 RACE 数据集进行交叉语言传递,并使用预训练 encoder-only 模型获得基准结果。此外,还使用了大型自然语言模型(LLMs)如 GPT-4 进行推荐。
  • results: 该论文提供了基准结果和使用 LLMs 的结果。
    Abstract In this paper, we create YORC: a new multi-choice Yoruba Reading Comprehension dataset that is based on Yoruba high-school reading comprehension examination. We provide baseline results by performing cross-lingual transfer using existing English RACE dataset based on a pre-trained encoder-only model. Additionally, we provide results by prompting large language models (LLMs) like GPT-4.
    摘要 在这篇论文中,我们创建了YORC:一个新的多选 йору巴读写理解数据集,该数据集基于 йору巴高中读写理解考试。我们提供了基线结果,使用现有的英语RACE数据集进行cross-lingual转移,并使用预训练的encoder-only模型。此外,我们还提供了使用大语言模型(LLMs)如GPT-4的结果。

OCR Language Models with Custom Vocabularies

  • paper_url: http://arxiv.org/abs/2308.09671
  • repo_url: None
  • paper_authors: Peter Garst, Reeve Ingle, Yasuhisa Fujii
  • for: 提高特殊领域文档识别率
  • methods: 生成领域特定词语模型,附加到通用语模型上,并使用修改后的CTC搜索解码器
  • results: 降低特殊领域文档词语错误率
    Abstract Language models are useful adjuncts to optical models for producing accurate optical character recognition (OCR) results. One factor which limits the power of language models in this context is the existence of many specialized domains with language statistics very different from those implied by a general language model - think of checks, medical prescriptions, and many other specialized document classes. This paper introduces an algorithm for efficiently generating and attaching a domain specific word based language model at run time to a general language model in an OCR system. In order to best use this model the paper also introduces a modified CTC beam search decoder which effectively allows hypotheses to remain in contention based on possible future completion of vocabulary words. The result is a substantial reduction in word error rate in recognizing material from specialized domains.
    摘要 语言模型是Optical Character Recognition(OCR)结果的有用辅助工具。一个限制语言模型在这种情况下的力量是存在许多特殊领域的语言统计数据与一般语言模型所假设的语言统计数据异常不同 - 想象检查、医疗订单等多种专业文档类型。本文提出了一种方法,可以在运行时效率地生成并附加专业领域词汇基于语言模型。为了最好地使用这种模型,本文还提出了一种修改后的CTC搜索解码器,可以让假设中的词语在未来完成的可能性下保持在竞争中。这导致了特殊领域中word error rate的显著减少。

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

  • paper_url: http://arxiv.org/abs/2308.09662
  • repo_url: https://github.com/declare-lab/red-instruct
  • paper_authors: Rishabh Bhardwaj, Soujanya Poria
  • For: This paper aims to address the risk of large language models (LLMs) producing harmful outputs and to develop a safety evaluation benchmark for LLMs.* Methods: The paper proposes a new safety evaluation benchmark called RED-EVAL, which uses a red-teaming approach to test the susceptibility of LLMs to harmful prompts. The authors also propose a method called RED-INSTRUCT for aligning LLMs with safe and helpful responses.* Results: The paper shows that even widely deployed LLMs are susceptible to harmful prompts, with more than 65% and 73% of harmful queries eliciting unethical responses from GPT-4 and ChatGPT, respectively. The authors also demonstrate the consistency of RED-EVAL across 8 open-source LLMs in generating harmful responses in more than 86% of the red-teaming attempts. Finally, the authors show that their proposed safety alignment method (RED-INSTRUCT) can improve the safety of LLMs while preserving their utility.
    Abstract Larger language models (LLMs) have taken the world by storm with their massive multi-tasking capabilities simply by optimizing over a next-word prediction objective. With the emergence of their properties and encoded knowledge, the risk of LLMs producing harmful outputs increases, making them unfit for scalable deployment for the public. In this work, we propose a new safety evaluation benchmark RED-EVAL that carries out red-teaming. We show that even widely deployed models are susceptible to the Chain of Utterances-based (CoU) prompting, jailbreaking closed source LLM-based systems such as GPT-4 and ChatGPT to unethically respond to more than 65% and 73% of harmful queries. We also demonstrate the consistency of the RED-EVAL across 8 open-source LLMs in generating harmful responses in more than 86% of the red-teaming attempts. Next, we propose RED-INSTRUCT--An approach for the safety alignment of LLMs. It constitutes two phases: 1) HARMFULQA data collection: Leveraging CoU prompting, we collect a dataset that consists of 1.9K harmful questions covering a wide range of topics, 9.5K safe and 7.3K harmful conversations from ChatGPT; 2) SAFE-ALIGN: We demonstrate how the conversational dataset can be used for the safety alignment of LLMs by minimizing the negative log-likelihood over helpful responses and penalizing over harmful responses by gradient accent over sample loss. Our model STARLING, a fine-tuned Vicuna-7B, is observed to be more safely aligned when evaluated on RED-EVAL and HHH benchmarks while preserving the utility of the baseline models (TruthfulQA, MMLU, and BBH).
    摘要 大型语言模型(LLM)已经在全球引起了风波,它们通过优化下一个单词预测目标来实现巨大多任务能力。然而, LLM 的出现也带来了它们生成有害输出的风险,使得它们不适合大规模部署。在这种情况下,我们提出了一个新的安全评估标准RED-EVAL,它通过红色队伍(red-teaming)来评估 LLM 的安全性。我们发现,广泛部署的模型都是对 CoU(链接词)提示的敏感,可以被破解,从而生成有害输出。此外,我们还证明了 RED-EVAL 在 8 个开源 LLM 上的一致性,它们在红色队伍中生成有害输出的情况超过 86%。接着,我们提出了 RED-INSTRUCT,一种安全对齐 LLM 的方法。它包括两个阶段:1)危险问题收集:通过 CoU 提示,我们收集了一个包含 1.9K 个危险问题、9.5K 个安全问题和 7.3K 个危险对话的数据集,来自 ChatGPT; 2)安全对齐:我们示例了如何使用 conversational 数据集来对 LLM 进行安全对齐,通过负采样损失来减少帮助Response的负损失,并对危险Response进行惩罚。我们的模型 STARLING,基于 Vicuna-7B 的 fine-tune,在 RED-EVAL 和 HHH benchmark 上被观察到更安全地对齐,而不会影响基eline模型(TruthfulQA、MMLU 和 BBH)的实用性。

cs.LG - 2023-08-19

Efficient Representation Learning for Healthcare with Cross-Architectural Self-Supervision

  • paper_url: http://arxiv.org/abs/2308.10064
  • repo_url: https://github.com/pranavsinghps1/CASS
  • paper_authors: Pranav Singh, Jacopo Cirrone
  • for: 这篇论文的目的是解决在医疗和生物医学应用中的极限计算需求,以实现表现学习的效果。
  • methods: 本论文提出了一种新的同构自监督学习方法,将传播储存器和对称神经网络(CNN)融合使用,以提高效率。
  • results: empirical evaluation 表明,CASS-trained CNNs和Transformers 在四个不同的医疗数据集上比先前的自监督学习方法表现更好,尤其是只使用 1% 标注数据进行调整时。
    Abstract In healthcare and biomedical applications, extreme computational requirements pose a significant barrier to adopting representation learning. Representation learning can enhance the performance of deep learning architectures by learning useful priors from limited medical data. However, state-of-the-art self-supervised techniques suffer from reduced performance when using smaller batch sizes or shorter pretraining epochs, which are more practical in clinical settings. We present Cross Architectural - Self Supervision (CASS) in response to this challenge. This novel siamese self-supervised learning approach synergistically leverages Transformer and Convolutional Neural Networks (CNN) for efficient learning. Our empirical evaluation demonstrates that CASS-trained CNNs and Transformers outperform existing self-supervised learning methods across four diverse healthcare datasets. With only 1% labeled data for finetuning, CASS achieves a 3.8% average improvement; with 10% labeled data, it gains 5.9%; and with 100% labeled data, it reaches a remarkable 10.13% enhancement. Notably, CASS reduces pretraining time by 69% compared to state-of-the-art methods, making it more amenable to clinical implementation. We also demonstrate that CASS is considerably more robust to variations in batch size and pretraining epochs, making it a suitable candidate for machine learning in healthcare applications.
    摘要 在医疗和生物医学应用中,极高的计算需求成为了使用表示学习的障碍。表示学习可以提高深度学习架构的性能,但是现有的自我超vised学习技术在使用小批量或短duration pretraining epochs时会导致性能下降,这些epochs更加适合临床应用。为了解决这个挑战,我们提出了跨建筑-自我超vised学习(CASS)方法。这种新的siamesecross-architectural self-supervised learning方法利用了 transformer 和 convolutional neural networks(CNN),以高效地学习。我们的实验证明,CASS-trained CNNs和 transformers 在四种不同的医疗数据集上都超过了现有的自我超vised learning方法。具有1% 标注数据进行 fine-tuning,CASS 可以提高3.8%的平均提升;具有10% 标注数据,它可以提高5.9%;具有100% 标注数据,它可以达到了10.13%的很出色的提升。另外,CASS 可以降低预训练时间,相比之前的方法,它更适合临床应用。我们还证明了 CASS 对批量大小和预训练epochs的变化具有较强的鲁棒性,使其成为医学应用中的适合者。

Accelerating Exact Combinatorial Optimization via RL-based Initialization – A Case Study in Scheduling

  • paper_url: http://arxiv.org/abs/2308.11652
  • repo_url: None
  • paper_authors: Jiaqi Yin, Cunxi Yu
  • for: 本研究旨在提出一种创新的方法,使用机器学习(ML)解决计算图上的 combinatorial 优化问题,例如 EdgeTPU 平台上的计算图调度问题。
  • methods: 本研究使用了一种两阶段 RL-to-ILP 调度框架,包括三个步骤:1)RL 算法 acted as coarse-grain 调度器,2)解决 relaxation 和 3)精确的 ILP 解。
  • results: 研究表明,使用本研究的方法可以保证优化和幂等性,同时具有较高的运行时效率,在 EdgeTPU 平台上实现了 ImageNet DNN 计算图的实时推理和加速。
    Abstract Scheduling on dataflow graphs (also known as computation graphs) is an NP-hard problem. The traditional exact methods are limited by runtime complexity, while reinforcement learning (RL) and heuristic-based approaches struggle with determinism and solution quality. This research aims to develop an innovative approach that employs machine learning (ML) for addressing combinatorial optimization problems, using scheduling as a case study. The goal is to provide guarantees in optimality and determinism while maintaining the runtime cost of heuristic methods. Specifically, we introduce a novel two-phase RL-to-ILP scheduling framework, which includes three steps: 1) RL solver acts as coarse-grain scheduler, 2) solution relaxation and 3) exact solving via ILP. Our framework demonstrates the same scheduling performance compared with using exact scheduling methods while achieving up to 128 $\times$ speed improvements. This was conducted on actual EdgeTPU platforms, utilizing ImageNet DNN computation graphs as input. Additionally, the framework offers improved on-chip inference runtime and acceleration compared to the commercially available EdgeTPU compiler.
    摘要 “计划在数据流图(也称为计算图)是NP困难的问题。传统的精确方法受到时间复杂度的限制,而RL和规则基本方法则受到权化和解决质量的影响。这项研究旨在开发一种创新的方法,利用机器学习(ML)来解决 combinatorial 优化问题,使用计划为例研究。目标是提供优化和权化的保证,同时保持规则基本方法的运行时间成本。我们介绍了一种新的两阶段RL-to-ILP 计划框架,包括以下三个步骤:1)RL 解决器 acts as 粗糙度调度器,2)解决relaxation和3)精确的解决via ILP。我们的框架示出与使用精确计划方法相同的计划性表现,同时实现了128倍的速度提升。这些研究在实际的 EdgeTPU 平台上进行,使用 ImageNet DNN 计算图作为输入。此外,我们的框架还提供了比商业可用的 EdgeTPU 编译器更好的在处理器上的执行时间和加速。”

The Snowflake Hypothesis: Training Deep GNN with One Node One Receptive field

  • paper_url: http://arxiv.org/abs/2308.10051
  • repo_url: None
  • paper_authors: Kun Wang, Guohao Li, Shilong Wang, Guibin Zhang, Kai Wang, Yang You, Xiaojiang Peng, Yuxuan Liang, Yang Wang
  • for: This paper aims to improve the performance and interpretability of deep graph neural networks (GNNs) by introducing the Snowflake Hypothesis, which posits that each node in a graph should have its own unique receptive field.
  • methods: The paper conducts a systematic study of deeper GNN research trajectories, employing the simplest gradient and node-level cosine distance as guiding principles to regulate the aggregation depth for each node. The authors also compare their approach with different aggregation strategies on multiple benchmarks.
  • results: The paper demonstrates that the Snowflake Hypothesis can serve as a universal operator for a range of tasks, and it displays tremendous potential on deep GNNs. The authors show that their approach can be applied to various GNN frameworks, enhancing their effectiveness when operating in-depth, and guiding the selection of the optimal network depth in an explainable and generalizable way.Here is the text in Simplified Chinese:
  • for: 本 paper 的目的是提高深度 graph neural network (GNN) 的性能和可解性,通过引入 snowflake 假设,即每个节点在图中应该有自己独特的接受范围。
  • methods: 本 paper 通过系统地研究深度 GNN 研究轨迹,使用最简单的梯度和节点级cosine距离来规则每个节点的聚合深度。作者还比较了不同的聚合策略在多个 benchmark 上。
  • results: 本 paper 表明 snowflake 假设可以作为一个通用的运算,并且在深度 GNN 中显示出巨大的潜力。作者们表明该方法可以应用于不同的 GNN 框架,提高它们在深度下的效果,并且可以在可解的和通用的方式下选择最佳网络深度。
    Abstract Despite Graph Neural Networks demonstrating considerable promise in graph representation learning tasks, GNNs predominantly face significant issues with over-fitting and over-smoothing as they go deeper as models of computer vision realm. In this work, we conduct a systematic study of deeper GNN research trajectories. Our findings indicate that the current success of deep GNNs primarily stems from (I) the adoption of innovations from CNNs, such as residual/skip connections, or (II) the tailor-made aggregation algorithms like DropEdge. However, these algorithms often lack intrinsic interpretability and indiscriminately treat all nodes within a given layer in a similar manner, thereby failing to capture the nuanced differences among various nodes. To this end, we introduce the Snowflake Hypothesis -- a novel paradigm underpinning the concept of ``one node, one receptive field''. The hypothesis draws inspiration from the unique and individualistic patterns of each snowflake, proposing a corresponding uniqueness in the receptive fields of nodes in the GNNs. We employ the simplest gradient and node-level cosine distance as guiding principles to regulate the aggregation depth for each node, and conduct comprehensive experiments including: (1) different training schemes; (2) various shallow and deep GNN backbones, and (3) various numbers of layers (8, 16, 32, 64) on multiple benchmarks (six graphs including dense graphs with millions of nodes); (4) compare with different aggregation strategies. The observational results demonstrate that our hypothesis can serve as a universal operator for a range of tasks, and it displays tremendous potential on deep GNNs. It can be applied to various GNN frameworks, enhancing its effectiveness when operating in-depth, and guiding the selection of the optimal network depth in an explainable and generalizable way.
    摘要 尽管图形神经网络(Graph Neural Networks,GNNs)在图像表示学任务中表现出了很大的承诺,但它们在深度层次上面临着颗粒泛化和颗粒滤波等问题。在这项工作中,我们进行了系统的 deeper GNN 研究轨迹。我们的发现表明,当前深度 GNN 的成功主要归功于(I) adopting innovations from CNNs,如待遇/跳过连接,或(II)适应制的聚合算法如 DropEdge。然而,这些算法通常缺乏内在解释性和不具分辨率地对所有层中的所有节点进行处理,因此无法捕捉不同节点之间的细腻差异。为此,我们提出了“雪花假设”——一种新的思想,它着眼于每个节点具有独特和特殊的感知领域。我们采用 simplest gradient 和节点级 cosine distance 作为引导原则,以REGULATE aggregation depth for each node,并进行了广泛的实验,包括:(1)不同训练方案;(2)不同的 shallow 和 deep GNN 基础架构;(3)不同层数(8, 16, 32, 64)在多个 benchmark 上进行了多种实验。结果表明,我们的假设可以作为一种通用的操作,并且在深度 GNN 中表现出了巨大的潜力。它可以应用于多种 GNN 框架,提高其在深度下的效果,并且可以在可解释的和普遍的方式下选择最佳网络深度。

Computing the Vapnik Chervonenkis Dimension for Non-Discrete Settings

  • paper_url: http://arxiv.org/abs/2308.10041
  • repo_url: None
  • paper_authors: Mohammed Nechba, Mouhajir Mohamed, Sedjari Yassine
  • for: 这个论文的目的是为了开发一种可以计算无约束的VC dimenson的方法。
  • methods: 该论文使用了Empirical Risk Minimization(ERM)学习模型来Characterize the shattering property of a concept class。
  • results: 该论文提出了一种可以approximately计算VC dimenson的方法,不再需要约束于概念集或其域集。
    Abstract In 1984, Valiant [ 7 ] introduced the Probably Approximately Correct (PAC) learning framework for boolean function classes. Blumer et al. [ 2] extended this model in 1989 by introducing the VC dimension as a tool to characterize the learnability of PAC. The VC dimension was based on the work of Vapnik and Chervonenkis in 1971 [8 ], who introduced a tool called the growth function to characterize the shattering property. Researchers have since determined the VC dimension for specific classes, and efforts have been made to develop an algorithm that can calculate the VC dimension for any concept class. In 1991, Linial, Mansour, and Rivest [4] presented an algorithm for computing the VC dimension in the discrete setting, assuming that both the concept class and domain set were finite. However, no attempts had been made to design an algorithm that could compute the VC dimension in the general setting.Therefore, our work focuses on developing a method to approximately compute the VC dimension without constraints on the concept classes or their domain set. Our approach is based on our finding that the Empirical Risk Minimization (ERM) learning paradigm can be used as a new tool to characterize the shattering property of a concept class.
    摘要 在1984年,强力(7)引入了一种名为“可以近似正确”(PAC)学习框架,用于 boolean 函数类型。布吕默等人(2)在1989年将这个模型扩展了,通过引入 VC 维度来描述学习可能性。VC 维度基于1971年由普纳克和 Червонен科伊所引入的一种工具——生长函数,用于描述分裂性质。后续的研究者已经确定了特定类型的 VC 维度,并尝试了开发一种可以计算任何概念类型的 VC 维度的算法。在1991年,林亚尔、曼索尔和里韦斯特(4)提出了一种算法,用于计算 discrete Setting 中的 VC 维度,假设概念类型和域集都是 фиксирован的。然而,没有任何尝试过开发一种可以计算总体设定中的 VC 维度的算法。因此,我们的工作将关注于开发一种可以近似计算 VC 维度的方法,不受概念类型或域集的限制。我们的方法基于我们发现,Empirical Risk Minimization(ERM)学习模式可以用作一种新的工具来描述分裂性质。

Physics-guided training of GAN to improve accuracy in airfoil design synthesis

  • paper_url: http://arxiv.org/abs/2308.10038
  • repo_url: None
  • paper_authors: Kazunari Wada, Katsuyuki Suzuki, Kazuo Yonekura
  • for: 本研究使用生成对抗网络(GAN)进行机械形状的设计合成,但GANometimes输出物理不合理的形状。例如,当GAN模型在输出空气foil形状时,需要的 aerodynamic性能值会出现显著的错误。这是因为GAN模型只考虑数据,而不考虑下面的 aerodynamic equations。
  • methods: 本研究提出了基于物理学习的GAN模型训练方法,使得GAN模型学习物理有效性。物理有效性通过外部的通用软件计算,而不是直接在神经网络模型中实现物理方程。此外,生成模型的输出数据通常和训练数据相似,无法生成完全新的形状。但是,由于提议的模型受到物理模型的指导,不使用训练集,因此可以生成完全新的形状。
  • results: 数值实验表明,提议的模型可以减轻错误,同时输出的形状与训练集不同,但仍满足物理有效性。这超越了现有的GAN模型的局限性。
    Abstract Generative adversarial networks (GAN) have recently been used for a design synthesis of mechanical shapes. A GAN sometimes outputs physically unreasonable shapes. For example, when a GAN model is trained to output airfoil shapes that indicate required aerodynamic performance, significant errors occur in the performance values. This is because the GAN model only considers data but does not consider the aerodynamic equations that lie under the data. This paper proposes the physics-guided training of the GAN model to guide the model to learn physical validity. Physical validity is computed using general-purpose software located outside the neural network model. Such general-purpose software cannot be used in physics-informed neural network frameworks, because physical equations must be implemented inside the neural network models. Additionally, a limitation of generative models is that the output data are similar to the training data and cannot generate completely new shapes. However, because the proposed model is guided by a physical model and does not use a training dataset, it can generate completely new shapes. Numerical experiments show that the proposed model drastically improves the accuracy. Moreover, the output shapes differ from those of the training dataset but still satisfy the physical validity, overcoming the limitations of existing GAN models.
    摘要 生成对抗网络(GAN)最近在机械形状设计中得到了应用。然而,GAN sometimes outputs不理性的形状。例如,当GAN模型被训练来输出符合需要的 aerodynamic performance 的风 razor shape 时,会出现 significannot 的错误。这是因为GAN模型只考虑数据,而不考虑下面数据的 aerodynamic 方程。这篇论文提议通过 физи学导向的 GAN 模型来引导模型学习物理有效性。物理有效性通过外部通用软件计算,这种外部通用软件无法用于物理学 Informed Neural Network 框架中,因为物理方程必须在神经网络模型中实现。此外,生成模型的一个限制是输出数据与训练数据相似,无法生成完全新的形状。然而,由于提议的模型受到物理模型的导引,不使用训练集,因此可以生成完全新的形状。 num 实验表明,提议的模型可以减少错误,并且输出的形状与训练集不同,但仍满足物理有效性,超越现有 GAN 模型的限制。

High Performance Computing Applied to Logistic Regression: A CPU and GPU Implementation Comparison

  • paper_url: http://arxiv.org/abs/2308.10037
  • repo_url: https://github.com/nechbamohammed/swiftlogisticreg
  • paper_authors: Nechba Mohammed, Mouhajir Mohamed, Sedjari Yassine
  • for: 这个论文旨在提出一个基于GPU的多线程逻辑回传函数(Logistic Regression,LR),以满足巨量数据的运算需求。
  • methods: 这个实现方式是基于X. Zou等人提出的平行Gradient Descent Logistic Regression算法的直译版本。
  • results: 我们的GPU基于LR在处理大资料集时的执行时间比CPU基于实现更快,但与相同的f1分数相似。这使得我们的方法尤其有利于实时预测应用,如影像识别、垃圾邮件检测和诈欺检测。
    Abstract We present a versatile GPU-based parallel version of Logistic Regression (LR), aiming to address the increasing demand for faster algorithms in binary classification due to large data sets. Our implementation is a direct translation of the parallel Gradient Descent Logistic Regression algorithm proposed by X. Zou et al. [12]. Our experiments demonstrate that our GPU-based LR outperforms existing CPU-based implementations in terms of execution time while maintaining comparable f1 score. The significant acceleration of processing large datasets makes our method particularly advantageous for real-time prediction applications like image recognition, spam detection, and fraud detection. Our algorithm is implemented in a ready-to-use Python library available at : https://github.com/NechbaMohammed/SwiftLogisticReg
    摘要 我们提出了一种高性能的GPU基于的并行Logistic Regression(LR)算法,以满足大量数据集的需求。我们的实现是直接将平行梯度下降Logistic Regression算法提出的X. Zou等人的方案翻译而成。我们的实验表明,我们的GPU基于LR在执行时间方面与CPU基于实现相比具有明显的优势,同时保持了相似的准确率。这种加速处理大数据集的能力使得我们的方法在实时预测应用中,如图像识别、垃圾邮件检测和诈骗检测等方面具有明显的优势。我们的算法已经实现在Python库中,可以在以下地址下下载:https://github.com/NechbaMohammed/SwiftLogisticReg。

Semi-Supervised Anomaly Detection for the Determination of Vehicle Hijacking Tweets

  • paper_url: http://arxiv.org/abs/2308.10036
  • repo_url: None
  • paper_authors: Taahir Aiyoob Patel, Clement N. Nyirenda
  • for: 本研究旨在使用微博来识别抢夺事件,以帮助旅行者避免成为受害者。
  • methods: 本研究使用了无监督异常检测算法,包括KNN和CBLOF两种方法,对Twitter上包含“抢夺”关键词的微博进行分析。
  • results: 比较分析表明,CBLOF方法的准确率为90%,而KNN方法的准确率为89%。CBLOF方法还得到了F1分数为0.8,而KNN方法得到了0.78。因此,CBLOF方法在抢夺微博识别方面表现了一定的优势。未来,将会对大型数据集使用超级vised学习方法进行比较,并使用优化机制提高总性能。
    Abstract In South Africa, there is an ever-growing issue of vehicle hijackings. This leads to travellers constantly being in fear of becoming a victim to such an incident. This work presents a new semi-supervised approach to using tweets to identify hijacking incidents by using unsupervised anomaly detection algorithms. Tweets consisting of the keyword "hijacking" are obtained, stored, and processed using the term frequency-inverse document frequency (TF-IDF) and further analyzed by using two anomaly detection algorithms: 1) K-Nearest Neighbour (KNN); 2) Cluster Based Outlier Factor (CBLOF). The comparative evaluation showed that the KNN method produced an accuracy of 89%, whereas the CBLOF produced an accuracy of 90%. The CBLOF method was also able to obtain a F1-Score of 0.8, whereas the KNN produced a 0.78. Therefore, there is a slight difference between the two approaches, in favour of CBLOF, which has been selected as a preferred unsupervised method for the determination of relevant hijacking tweets. In future, a comparison will be done between supervised learning methods and the unsupervised methods presented in this work on larger dataset. Optimisation mechanisms will also be employed in order to increase the overall performance.
    摘要 在南非, vehicular hijacking 是一个日益增长的问题。这使得旅行者们经常处于被劫持的恐惧中。本文提出了一种新的半监督方法,使用社交媒体上的 tweet 来识别劫持事件。使用关键词 "劫持" 获取、存储和处理 tweet,并使用 term frequency-inverse document frequency (TF-IDF) 进行处理。然后使用两种异常检测算法:1)K-Nearest Neighbour (KNN);2)Cluster Based Outlier Factor (CBLOF)。对比评估表明,KNN 方法的准确率为 89%,而 CBLOF 方法的准确率为 90%。CBLOF 方法还可以获得 F1-Score 0.8,而 KNN 方法只有 0.78。因此,CBLOF 方法有一定的优势,因此被选为劫持 tweet 的半监督方法。未来,将进行大量数据集上的比较,以及优化机制的实现,以提高总性能。

Effects of Convolutional Autoencoder Bottleneck Width on StarGAN-based Singing Technique Conversion

  • paper_url: http://arxiv.org/abs/2308.10021
  • repo_url: None
  • paper_authors: Tung-Cheng Su, Yung-Chuan Chang, Yi-Wen Liu
  • for: 本研究旨在评估对声音技巧转换(STC)中CAE核心宽度的影响,以提高转换质量。
  • methods: 我们采用了基于GAN的多域STC系统,利用WORLD vocoder表示和CAE架构。我们对CAE核心宽度进行了变换,并对转换结果进行主观评估。
  • results: 结果显示,更宽的CAE核心对于语音清晰度有益,但不一定导致更高的同化度。四种声音技巧中,抽喉声最容易转换,而其他三种声音技巧作为源声音更能实现更加真实的转换。
    Abstract Singing technique conversion (STC) refers to the task of converting from one voice technique to another while leaving the original singer identity, melody, and linguistic components intact. Previous STC studies, as well as singing voice conversion research in general, have utilized convolutional autoencoders (CAEs) for conversion, but how the bottleneck width of the CAE affects the synthesis quality has not been thoroughly evaluated. To this end, we constructed a GAN-based multi-domain STC system which took advantage of the WORLD vocoder representation and the CAE architecture. We varied the bottleneck width of the CAE, and evaluated the conversion results subjectively. The model was trained on a Mandarin dataset which features four singers and four singing techniques: the chest voice, the falsetto, the raspy voice, and the whistle voice. The results show that a wider bottleneck corresponds to better articulation clarity but does not necessarily lead to higher likeness to the target technique. Among the four techniques, we also found that the whistle voice is the easiest target for conversion, while the other three techniques as a source produce more convincing conversion results than the whistle.
    摘要 声乐技巧转换(STC)指的是将一种声乐技巧转换为另一种,保持原始歌手身份、旋律和语言元素不变。先前的 STC 研究以及全球声乐转换研究都使用了卷积 autoencoder(CAE)进行转换,但 CAE 瓶颈宽度如何影响转换质量尚未得到全面评估。为了解决这个问题,我们构建了基于 GAN 的多Domain STC 系统,利用了 WORLD vocoder 表示和 CAE 建筑。我们在 CAE 瓶颈宽度上进行了变量,并对转换结果进行主观评估。系统在普通话 dataset 上训练,该 dataset 包含四名歌手和四种声乐技巧:胸部声、抖音声、咔声和喊声。结果表明,宽瓶颈对声音清晰度有更好的影响,但并不一定导致更高的模仿度。其中四种技巧中,喊声是转换最容易的目标,而其他三种技巧作为源都能够更加自然地转换。

Semi-Implicit Variational Inference via Score Matching

  • paper_url: http://arxiv.org/abs/2308.10014
  • repo_url: https://github.com/longinyu/sivism
  • paper_authors: Longlin Yu, Cheng Zhang
  • for: 强化变量推理的表达力(Semi-implicit variational inference,SIVI),使得变量推理家族更加丰富和灵活。
  • methods: 基于分形证据匹配(score matching)的一种新方法,充分利用变量推理家族的层次结构,使得困难的变量推理density可以自然地处理。
  • results: 与MCMC相比,SIVI-SM方法可以准确地匹配MCMC的准确性,并且在多种推理任务中超过ELBO基于的SIVI方法。
    Abstract Semi-implicit variational inference (SIVI) greatly enriches the expressiveness of variational families by considering implicit variational distributions defined in a hierarchical manner. However, due to the intractable densities of variational distributions, current SIVI approaches often use surrogate evidence lower bounds (ELBOs) or employ expensive inner-loop MCMC runs for unbiased ELBOs for training. In this paper, we propose SIVI-SM, a new method for SIVI based on an alternative training objective via score matching. Leveraging the hierarchical structure of semi-implicit variational families, the score matching objective allows a minimax formulation where the intractable variational densities can be naturally handled with denoising score matching. We show that SIVI-SM closely matches the accuracy of MCMC and outperforms ELBO-based SIVI methods in a variety of Bayesian inference tasks.
    摘要 <>使用半隐式变量推断(SIVI)可以扩大变量家族的表达能力,通过在层次结构中考虑隐式变量分布定义。然而,由于变量分布的计算困难,现有的SIVI方法通常使用代理证据下界(ELBO)或者使用昂贵的内 Loop MCMC 迭代进行培训。在这篇论文中,我们提出了基于代理证据匹配的新方法,称为SIVI-SM。该方法利用层次结构中的 semi-implicit 变量家族,通过代理证据匹配对象来实现一种减少对变量分布的干扰的训练目标。我们表明,SIVI-SM可以准确地与 MCMC 匹配,并且在多种 bayesian 推断任务中超过 ELBO 基于的 SIVI 方法表现。>>>

Distributionally Robust Cross Subject EEG Decoding

  • paper_url: http://arxiv.org/abs/2308.11651
  • repo_url: None
  • paper_authors: Tiehang Duan, Zhenyi Wang, Gianfranco Doretto, Fang Li, Cui Tao, Donald Adjeroh
    for: 这个研究旨在提高EEG解oding任务的表现,特别是面对高度不确定和不同类型的资料污染。methods: 这个研究使用分布robust优化和 Wasserstein gradient flow (WGF) 提出了一个原理式的演化方法,通过对数据分布进行演化来提高解oding的类别识别能力。results: 实验结果显示,提案的方法可以与其他数据增强技术结合使用,在严重损坏EEG讯号的情况下表现出色,较前一些竞争基eline。
    Abstract Recently, deep learning has shown to be effective for Electroencephalography (EEG) decoding tasks. Yet, its performance can be negatively influenced by two key factors: 1) the high variance and different types of corruption that are inherent in the signal, 2) the EEG datasets are usually relatively small given the acquisition cost, annotation cost and amount of effort needed. Data augmentation approaches for alleviation of this problem have been empirically studied, with augmentation operations on spatial domain, time domain or frequency domain handcrafted based on expertise of domain knowledge. In this work, we propose a principled approach to perform dynamic evolution on the data for improvement of decoding robustness. The approach is based on distributionally robust optimization and achieves robustness by optimizing on a family of evolved data distributions instead of the single training data distribution. We derived a general data evolution framework based on Wasserstein gradient flow (WGF) and provides two different forms of evolution within the framework. Intuitively, the evolution process helps the EEG decoder to learn more robust and diverse features. It is worth mentioning that the proposed approach can be readily integrated with other data augmentation approaches for further improvements. We performed extensive experiments on the proposed approach and tested its performance on different types of corrupted EEG signals. The model significantly outperforms competitive baselines on challenging decoding scenarios.
    摘要

Disposable Transfer Learning for Selective Source Task Unlearning

  • paper_url: http://arxiv.org/abs/2308.09971
  • repo_url: None
  • paper_authors: Seunghee Koh, Hyounguk Shon, Janghyeon Lee, Hyeong Gwon Hong, Junmo Kim
  • for: 这个论文旨在解决如何在转移学习中保持目标任务的表现,而不是完全抛弃源任务的表现。
  • methods: 该论文提出了一种新的转移学习方法,即丢弃转移学习(DTL),该方法可以在转移学习过程中 selectively 抛弃源任务的知识。
  • results: 论文表明,通过使用 Gradient Collision loss(GC loss)可以帮助模型 selectively 抛弃源任务的知识,同时保持目标任务的表现。 GC loss 可以衡量知识泄露的程度,并且可以在新的下游任务上进行重新训练。
    Abstract Transfer learning is widely used for training deep neural networks (DNN) for building a powerful representation. Even after the pre-trained model is adapted for the target task, the representation performance of the feature extractor is retained to some extent. As the performance of the pre-trained model can be considered the private property of the owner, it is natural to seek the exclusive right of the generalized performance of the pre-trained weight. To address this issue, we suggest a new paradigm of transfer learning called disposable transfer learning (DTL), which disposes of only the source task without degrading the performance of the target task. To achieve knowledge disposal, we propose a novel loss named Gradient Collision loss (GC loss). GC loss selectively unlearns the source knowledge by leading the gradient vectors of mini-batches in different directions. Whether the model successfully unlearns the source task is measured by piggyback learning accuracy (PL accuracy). PL accuracy estimates the vulnerability of knowledge leakage by retraining the scrubbed model on a subset of source data or new downstream data. We demonstrate that GC loss is an effective approach to the DTL problem by showing that the model trained with GC loss retains the performance on the target task with a significantly reduced PL accuracy.
    摘要 <>translate "Transfer learning is widely used for training deep neural networks (DNN) for building a powerful representation. Even after the pre-trained model is adapted for the target task, the representation performance of the feature extractor is retained to some extent. As the performance of the pre-trained model can be considered the private property of the owner, it is natural to seek the exclusive right of the generalized performance of the pre-trained weight. To address this issue, we suggest a new paradigm of transfer learning called disposable transfer learning (DTL), which disposes of only the source task without degrading the performance of the target task. To achieve knowledge disposal, we propose a novel loss named Gradient Collision loss (GC loss). GC loss selectively unlearns the source knowledge by leading the gradient vectors of mini-batches in different directions. Whether the model successfully unlearns the source task is measured by piggyback learning accuracy (PL accuracy). PL accuracy estimates the vulnerability of knowledge leakage by retraining the scrubbed model on a subset of source data or new downstream data. We demonstrate that GC loss is an effective approach to the DTL problem by showing that the model trained with GC loss retains the performance on the target task with a significantly reduced PL accuracy."中文翻译:转移学习广泛用于训练深度神经网络(DNN)以建立强大的表示。即使预训练模型被适应目标任务,表示性性能的特征提取器仍然保持一定程度的表现。由于预训练模型的性能可以被视为业主的私有财产,因此自然想要寻求预训练权重的权属。为解决这个问题,我们提出了一种新的转移学习方法 called dispose transfer learning(DTL),它可以不影响目标任务的性能,但是可以消除来源任务。为实现知识抛弃,我们提出了一种新的损失函数名为梯度碰撞损失(GC损失)。GC损失可以 selectively 忘记来源知识,通过导向梯度向量的不同方向。确定模型是否成功忘记来源任务,可以通过猪肉学习精度(PL精度)来衡量。PL精度可以估计知识泄露的敏感度,通过在源数据或新的下游数据上重新训练混凝模型。我们示示GC损失是解决 DTL 问题的有效方法,通过显示GC损失训练的模型在目标任务上保持表现,同时减少了PL精度。

Tackling Vision Language Tasks Through Learning Inner Monologues

  • paper_url: http://arxiv.org/abs/2308.09970
  • repo_url: None
  • paper_authors: Diji Yang, Kezhen Chen, Jinmeng Rao, Xiaoyuan Guo, Yawen Zhang, Jie Yang, Yi Zhang
  • for: 解决复杂的视觉语言问题,使用内启思维过程来优化语言模型和视觉模型之间的拟合。
  • methods: 提出了一种新的方法 Inner Monologue Multi-Modal Optimization (IMMO),通过模拟内启思维过程,使得语言模型和视觉模型可以在自然语言交流中互动,提高推理和解释能力。
  • results: 对两个popular任务进行评估,结果表明,通过模拟内启思维过程,IMMO可以提高视觉语言模型的推理和解释能力,并且可以应用于许多不同的AI问题。
    Abstract Visual language tasks require AI models to comprehend and reason with both visual and textual content. Driven by the power of Large Language Models (LLMs), two prominent methods have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. To tackle this dilemma, we propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems by simulating inner monologue processes, a cognitive process in which an individual engages in silent verbal communication with themselves. We enable LLMs and VLMs to interact through natural language conversation and propose to use a two-stage training process to learn how to do the inner monologue (self-asking questions and answering questions). IMMO is evaluated on two popular tasks and the results suggest by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. More importantly, instead of using predefined human-crafted monologues, IMMO learns this process within the deep learning models, promising wider applicability to many different AI problems beyond vision language tasks.
    摘要 ⟨SYS⟩视觉语言任务需要人工智能模型理解和处理视觉和文本内容。受大型语言模型(LLM)的力量驱动,两种主要方法出现:(1)混合 integrate LLMs 和视觉语言模型(VLMs),其中视觉输入首先被VLMs转换为语言描述,然后被LLMs 生成最终答案;(2)视觉特征对齐在语言空间,其中视觉输入被编码为特征并通过进一步的超vised fine-tuning Projected to LLMs 的语言空间。前一种方法具有轻量级训练成本和可读性,但困难在端到端方式优化;后一种方法具有良好的性能,但特征对齐通常需要大量的训练数据和缺乏可读性。为解决这个悖论,我们提出了一种新的方法: Inner Monologue Multi-Modal Optimization(IMMO),用于解决复杂的视觉语言问题。我们使得 LLMs 和 VLMs 通过自然语言对话互动,并提出了在两个阶段训练过程中学习如何做内心对话(自我问答)。IMMO 在两个流行的任务上进行评估,结果表明,通过模拟内心对话这种认知现象,我们的方法可以提高理解和解释能力,为视觉语言模型的更有效融合做出贡献。此外,不同于使用预先定制的人类编写的假象,IMMO 在深度学习模型中学习内心对话过程,提供更广泛的应用可能性。

Anomaly-Aware Semantic Segmentation via Style-Aligned OoD Augmentation

  • paper_url: http://arxiv.org/abs/2308.09965
  • repo_url: None
  • paper_authors: Dan Zhang, Kaspar Sakmann, William Beluch, Robin Hutmacher, Yumeng Li
  • for: 这paper是为了帮助标准semantic segmentation模型具备异常感知而写的。
  • methods: 这paper使用了改进的out-of-distribution数据生成方法,将异常数据与驾驶场景的风格差减少到最小化,从而减轻数据生成过程中的 shortcut。此外,paper还提出了一种简单的练习损失,使得预训练的semantic segmentation模型能够生成“None of the given classes”预测,通过每个像素的异常分数进行异常分割。
  • results: 与普通的semantic segmentation模型相比,这paper的方法能够减少异常分割的训练时间和精度损失,同时保持原始任务的性能。
    Abstract Within the context of autonomous driving, encountering unknown objects becomes inevitable during deployment in the open world. Therefore, it is crucial to equip standard semantic segmentation models with anomaly awareness. Many previous approaches have utilized synthetic out-of-distribution (OoD) data augmentation to tackle this problem. In this work, we advance the OoD synthesis process by reducing the domain gap between the OoD data and driving scenes, effectively mitigating the style difference that might otherwise act as an obvious shortcut during training. Additionally, we propose a simple fine-tuning loss that effectively induces a pre-trained semantic segmentation model to generate a ``none of the given classes" prediction, leveraging per-pixel OoD scores for anomaly segmentation. With minimal fine-tuning effort, our pipeline enables the use of pre-trained models for anomaly segmentation while maintaining the performance on the original task.
    摘要 在自动驾驶中,遇到未知对象是不可避免的,因此需要为标准Semantic segmentation模型增加异常意识。许多前一代方法使用synthetic Out-of-distribution(OoD)数据增强进行了处理。在这种工作中,我们提高了OoD数据生成过程,将驾驶场景和OoD数据域的差异降低到最小,从而有效地消除了可能导致训练中的短cut shortcut。此外,我们提议一种简单的精度调整loss,使得预训练的Semantic segmentation模型能够生成“ none of the given classes”预测,通过每个像素的OoD分数进行异常分 segmentation。只需 minimal fine-tuning effort,我们的管道可以使用预训练模型进行异常分 segmentation,同时保持原始任务的性能。

Towards Self-Adaptive Machine Learning-Enabled Systems Through QoS-Aware Model Switching

  • paper_url: http://arxiv.org/abs/2308.09960
  • repo_url: https://github.com/sa4s-serc/adamls
  • paper_authors: Shubham Kulkarni, Arya Marda, Karthik Vaidhyanathan
  • for: 本研究旨在提出一种机器学习模型均衡器,以管理运行时uncertainty,保证机器学习系统的质量。
  • methods: 本研究使用多个机器学习模型,并 introduce了一种基于MAPLE-K循环的自适应策略,以实现不断自适应机器学习系统。
  • results: 实验结果表明,AdaMLS可以更好地保证系统和模型的性能,并且在动态环境中提供优化的QoS。
    Abstract Machine Learning (ML), particularly deep learning, has seen vast advancements, leading to the rise of Machine Learning-Enabled Systems (MLS). However, numerous software engineering challenges persist in propelling these MLS into production, largely due to various run-time uncertainties that impact the overall Quality of Service (QoS). These uncertainties emanate from ML models, software components, and environmental factors. Self-adaptation techniques present potential in managing run-time uncertainties, but their application in MLS remains largely unexplored. As a solution, we propose the concept of a Machine Learning Model Balancer, focusing on managing uncertainties related to ML models by using multiple models. Subsequently, we introduce AdaMLS, a novel self-adaptation approach that leverages this concept and extends the traditional MAPE-K loop for continuous MLS adaptation. AdaMLS employs lightweight unsupervised learning for dynamic model switching, thereby ensuring consistent QoS. Through a self-adaptive object detection system prototype, we demonstrate AdaMLS's effectiveness in balancing system and model performance. Preliminary results suggest AdaMLS surpasses naive and single state-of-the-art models in QoS guarantees, heralding the advancement towards self-adaptive MLS with optimal QoS in dynamic environments.
    摘要

A Comparison of Adversarial Learning Techniques for Malware Detection

  • paper_url: http://arxiv.org/abs/2308.09958
  • repo_url: None
  • paper_authors: Pavla Louthánová, Matouš Kozák, Martin Jureček, Mark Stamp
  • for: 本文Addresses the problem of generating adversarial malware samples, specifically malicious Windows Portable Executable files, to evaluate the effectiveness of different methods for generating adversarial samples and their practical applicability.
  • methods: The paper uses gradient-based, evolutionary algorithm-based, and reinforcement-based methods to generate adversarial samples, and tests the generated samples against selected antivirus products.
  • results: The results show that applying optimized modifications to previously detected malware can lead to incorrect classification of the file as benign, and that generated malware samples can be successfully used against detection models other than those used to generate them. The Gym-malware generator, which uses a reinforcement learning approach, has the greatest practical potential, achieving an average sample generation time of 5.73 seconds and the highest average evasion rate of 44.11%. Using the Gym-malware generator in combination with itself improved the evasion rate to 58.35%.
    Abstract Machine learning has proven to be a useful tool for automated malware detection, but machine learning models have also been shown to be vulnerable to adversarial attacks. This article addresses the problem of generating adversarial malware samples, specifically malicious Windows Portable Executable files. We summarize and compare work that has focused on adversarial machine learning for malware detection. We use gradient-based, evolutionary algorithm-based, and reinforcement-based methods to generate adversarial samples, and then test the generated samples against selected antivirus products. We compare the selected methods in terms of accuracy and practical applicability. The results show that applying optimized modifications to previously detected malware can lead to incorrect classification of the file as benign. It is also known that generated malware samples can be successfully used against detection models other than those used to generate them and that using combinations of generators can create new samples that evade detection. Experiments show that the Gym-malware generator, which uses a reinforcement learning approach, has the greatest practical potential. This generator achieved an average sample generation time of 5.73 seconds and the highest average evasion rate of 44.11%. Using the Gym-malware generator in combination with itself improved the evasion rate to 58.35%.
    摘要

To prune or not to prune : A chaos-causality approach to principled pruning of dense neural networks

  • paper_url: http://arxiv.org/abs/2308.09955
  • repo_url: None
  • paper_authors: Rajan Sahu, Shivam Chadha, Nithin Nagaraj, Archana Mathur, Snehanshu Saha
  • for: 降低神经网络的大小(卷积),不受性能受到影响,是资源有限设备上重要问题。
  • methods: 通过Weight rankings或penalization criteria,如矩阵大小和重要性,从已 retrained 的剩下 weights 中选择特定的 weights,来实现卷积。
  • results: 采用这种方法可以保持原始性能,同时保留特征解释性。
    Abstract Reducing the size of a neural network (pruning) by removing weights without impacting its performance is an important problem for resource-constrained devices. In the past, pruning was typically accomplished by ranking or penalizing weights based on criteria like magnitude and removing low-ranked weights before retraining the remaining ones. Pruning strategies may also involve removing neurons from the network in order to achieve the desired reduction in network size. We formulate pruning as an optimization problem with the objective of minimizing misclassifications by selecting specific weights. To accomplish this, we have introduced the concept of chaos in learning (Lyapunov exponents) via weight updates and exploiting causality to identify the causal weights responsible for misclassification. Such a pruned network maintains the original performance and retains feature explainability.
    摘要 减小神经网络(减少),不影响其性能,是资源受限设备上非常重要的问题。以前,减少通常通过对权重进行排名或惩罚,根据权重的大小和重要性来移除低排名的权重,然后再 retrained 剩下的权重来实现网络的减少。减少策略可能还会涉及到从网络中移除神经元,以实现所需的网络大小减少。我们将减少形式为学习过程中权重更新中引入了混乱(Lyapunov 分数)的概念,并通过利用 causality 来确定引起错分的权重。这样的减少网络可以保持原始性能,同时保持特征解释性。

Finding emergence in data: causal emergence inspired dynamics learning

  • paper_url: http://arxiv.org/abs/2308.09952
  • repo_url: None
  • paper_authors: Mingzhe Yang, Zhipeng Wang, Kaiwei Liu, Yingqi Rong, Bing Yuan, Jiang Zhang
  • for: This paper aims to develop a machine learning framework to model complex dynamical systems in a data-driven manner, with a focus on capturing emergent behaviors and properties.
  • methods: The proposed framework draws inspiration from the theory of causal emergence and uses maximum effective information (EI) to learn macro-dynamics within an emergent latent space.
  • results: The proposed framework is effective in capturing emergent patterns, learning the coarse-graining strategy, and quantifying the degree of causal emergence in the data. Additionally, the model demonstrates superior generalization ability on environments different from the training dataset.Here’s the same information in Simplified Chinese:
  • for: 这篇论文目标是通过数据驱动方式模型复杂的动力系统,强调捕捉出emergent行为和质量。
  • methods: 提议的框架启发自 causal emergence 理论,使用最大有效信息(EI)来学习macro-dinamics在emergent latent space中。
  • results: 提议的框架能够成功捕捉emergentpattern,学习卷积策略,并量化数据中的 causal emergence度。此外,模型在不同于训练集环境下的实验也表现出了superior generalization能力。
    Abstract Modelling complex dynamical systems in a data-driven manner is challenging due to the presence of emergent behaviors and properties that cannot be directly captured by micro-level observational data. Therefore, it is crucial to develop a model that can effectively capture emergent dynamics at the macro-level and quantify emergence based on the available data. Drawing inspiration from the theory of causal emergence, this paper introduces a machine learning framework aimed at learning macro-dynamics within an emergent latent space. The framework achieves this by maximizing the effective information (EI) to obtain a macro-dynamics model with stronger causal effects. Experimental results on both simulated and real data demonstrate the effectiveness of the proposed framework. Not only does it successfully capture emergent patterns, but it also learns the coarse-graining strategy and quantifies the degree of causal emergence in the data. Furthermore, experiments conducted on environments different from the training dataset highlight the superior generalization ability of our model.
    摘要 模拟复杂动力系统在数据驱动方式下是一项挑战,因为存在不可预测的 emergent 行为和质量。为了有效地捕捉 emergent 动力,这篇论文提出了一种基于机器学习的框架,该框架在 emergent 尘肤空间中学习 macro-动力。该框架通过最大化有效信息(EI)来获得具有更强的 causal 效应的 macro-动力模型。实验结果表明,该模型不仅可以成功捕捉 emergent 模式,还可以学习 coarse-graining 策略并量化数据中的 causal emergence 度。此外,在训练数据集以外的环境下进行的实验也表明了我们的模型具有更高的总体化能力。

Study on the effectiveness of AutoML in detecting cardiovascular disease

  • paper_url: http://arxiv.org/abs/2308.09947
  • repo_url: None
  • paper_authors: T. V. Afanasieva, A. P. Kuzlyakin, A. V. Komolov
    for: 这篇论文主要是为了探讨机器学习技术在抗生素敏感疾病预测方面的应用。methods: 这篇论文使用了自动机器学习(AutoML)技术,combined five data sets of cardiovascular disease indicators from the UCI Machine Learning Repository,并用了十三种基本机器学习模型(KNeighborsUnif、KNeighborsDist、LightGBMXT、LightGBM、RandomForestGini、RandomForestEntr、CatBoost、ExtraTreesGini、ExtraTreesEntr、NeuralNetFastA、XGBoost、NeuralNetTorch、LightGBMLarge)。results: 研究发现,自动机器学习模型的结构对于抗生素敏感疾病预测是不同的,具体取决于使用的基本模型的效率和准确率,以及数据预处理方法,尤其是数据标准化技术。研究发现,当将源数据标准化为二进制值时,自动机器学习模型的准确率最高,达到了87.41%至92.3%之间的范围。
    Abstract Cardiovascular diseases are widespread among patients with chronic noncommunicable diseases and are one of the leading causes of death, including in the working age. The article presents the relevance of the development and application of patient-oriented systems, in which machine learning (ML) is a promising technology that allows predicting cardiovascular diseases. Automated machine learning (AutoML) makes it possible to simplify and speed up the process of developing AI/ML applications, which is key in the development of patient-oriented systems by application users, in particular medical specialists. The authors propose a framework for the application of automatic machine learning and three scenarios that allowed for data combining five data sets of cardiovascular disease indicators from the UCI Machine Learning Repository to investigate the effectiveness in detecting this class of diseases. The study investigated one AutoML model that used and optimized the hyperparameters of thirteen basic ML models (KNeighborsUnif, KNeighborsDist, LightGBMXT, LightGBM, RandomForestGini, RandomForestEntr, CatBoost, ExtraTreesGini, ExtraTreesEntr, NeuralNetFastA, XGBoost, NeuralNetTorch, LightGBMLarge) and included the most accurate models in the weighted ensemble. The results of the study showed that the structure of the AutoML model for detecting cardiovascular diseases depends not only on the efficiency and accuracy of the basic models used, but also on the scenarios for preprocessing the initial data, in particular, on the technique of data normalization. The comparative analysis showed that the accuracy of the AutoML model in detecting cardiovascular disease varied in the range from 87.41% to 92.3%, and the maximum accuracy was obtained when normalizing the source data into binary values, and the minimum was obtained when using the built-in AutoML technique.
    摘要 心血管疾病是 chronic noncommunicable diseases 中广泛存在的,是死亡的主要原因之一,包括工作年龄期。本文介绍了patient-oriented系统的开发和应用的重要性,机器学习(ML)技术在心血管疾病预测方面的潜在性。自动机器学习(AutoML)技术可以简化和加速AI/ML应用程序的开发过程,这对医疗专业人员 particularly medical specialists 来说是关键。作者提出了一个自动机器学习框架,并通过将五个心血管疾病指标数据集 combine investigate the effectiveness of detecting this class of diseases。研究使用了一个AutoML模型,该模型使用和优化了十三种基本ML模型(KNeighborsUnif、KNeighborsDist、LightGBMXT、LightGBM、RandomForestGini、RandomForestEntr、CatBoost、ExtraTreesGini、ExtraTreesEntr、NeuralNetFastA、XGBoost、NeuralNetTorch、LightGBMLarge),并包括最佳模型在权重 ensemble 中。研究结果表明,自动机器学习模型的结构在检测心血管疾病方面取决于基本模型的效率和准确率,以及数据预处理方法,特别是数据 нор化技术。比较分析表明,自动机器学习模型在检测心血管疾病方面的准确率在87.41%到92.3%之间,最高准确率为将源数据 нор化为二进制值,最低准确率为使用AutoML技术。

Dual Branch Deep Learning Network for Detection and Stage Grading of Diabetic Retinopathy

  • paper_url: http://arxiv.org/abs/2308.09945
  • repo_url: None
  • paper_authors: Hossein Shakibania, Sina Raoufi, Behnam Pourafkham, Hassan Khotanlou, Muharram Mansoorizadeh
    for: 这个研究的目的是提出一种基于深度学习的糖尿病肠病诊断和分级方法,使用单一的背景照片。methods: 该模型使用了传输学习,利用两个现有的状态流程模型作为特征提取器,并在新的数据集上进行细化。results: 该模型在APTOS 2019 数据集上表现出色,在糖尿病诊断和分级方面都高于现有文献。对于二分类问题,该方法达到了98.50%的准确率、99.46%的敏感度和97.51%的特异度。在分级问题上,它达到了93.00%的квадратиче weights κ值、89.60%的准确率、89.60%的敏感度和97.72%的特异度。该方法可以作为糖尿病肠病诊断和分级工具,为临床决策和患者护理提供重要的帮助。
    Abstract Diabetic retinopathy is a severe complication of diabetes that can lead to permanent blindness if not treated promptly. Early and accurate diagnosis of the disease is essential for successful treatment. This paper introduces a deep learning method for the detection and stage grading of diabetic retinopathy, using a single fundus retinal image. Our model utilizes transfer learning, employing two state-of-the-art pre-trained models as feature extractors and fine-tuning them on a new dataset. The proposed model is trained on a large multi-center dataset, including the APTOS 2019 dataset, obtained from publicly available sources. It achieves remarkable performance in diabetic retinopathy detection and stage classification on the APTOS 2019, outperforming the established literature. For binary classification, the proposed approach achieves an accuracy of 98.50%, a sensitivity of 99.46%, and a specificity of 97.51%. In stage grading, it achieves a quadratic weighted kappa of 93.00%, an accuracy of 89.60%, a sensitivity of 89.60%, and a specificity of 97.72%. The proposed approach serves as a reliable screening and stage grading tool for diabetic retinopathy, offering significant potential to enhance clinical decision-making and patient care.
    摘要 糖尿病 RETINOPATHY 是糖尿病的一种严重的并发症,如果不及时治疗,可能会导致永久失明。早期和准确的诊断是成功治疗的关键。这篇论文介绍了一种深度学习方法,用于检测和评分糖尿病 RETINOPATHY,使用单个背部眼图像。我们的模型利用传输学习,使用两个状态之前的权威模型作为特征提取器,并在新数据集上练习 fine-tuning。我们的模型在APTOS 2019 数据集上获得了优秀的性能,在糖尿病 RETINOPATHY 检测和评分方面超过了现有文献。对于二分类,我们的方法达到了 98.50%的准确率,99.46%的敏感度和 97.51%的特异度。在评分方面,我们的方法达到了 93.00%的QUADRATIC WEIGHTED KAPPA,89.60%的准确率,89.60%的敏感度和 97.72%的特异度。我们的方法可以作为糖尿病 RETINOPATHY 检测和评分工具,为临床决策和患者护理提供了重要的可靠性。

On the Robustness of Open-World Test-Time Training: Self-Training with Dynamic Prototype Expansion

  • paper_url: http://arxiv.org/abs/2308.09942
  • repo_url: https://github.com/yushu-li/owttt
  • paper_authors: Yushu Li, Xun Xu, Yongyi Su, Kui Jia
  • for: 该研究旨在提高unknown target domain distribution下的深度学习模型的泛化性,并且在低延迟下实现。
  • methods: 该研究使用了test-time training/adaptation(TTT/TTA)技术,并提出了一种 adaptive strong OOD pruning 技术和动态扩展 prototype 技术来提高 OWTTT 的 Robustness。
  • results: 该研究在 5 个 OWTTT benchmark 上达到了state-of-the-art 性能,并且提供了一个可用的代码库(https://github.com/Yushu-Li/OWTTT)。
    Abstract Generalizing deep learning models to unknown target domain distribution with low latency has motivated research into test-time training/adaptation (TTT/TTA). Existing approaches often focus on improving test-time training performance under well-curated target domain data. As figured out in this work, many state-of-the-art methods fail to maintain the performance when the target domain is contaminated with strong out-of-distribution (OOD) data, a.k.a. open-world test-time training (OWTTT). The failure is mainly due to the inability to distinguish strong OOD samples from regular weak OOD samples. To improve the robustness of OWTTT we first develop an adaptive strong OOD pruning which improves the efficacy of the self-training TTT method. We further propose a way to dynamically expand the prototypes to represent strong OOD samples for an improved weak/strong OOD data separation. Finally, we regularize self-training with distribution alignment and the combination yields the state-of-the-art performance on 5 OWTTT benchmarks. The code is available at https://github.com/Yushu-Li/OWTTT.
    摘要 通用深度学习模型到未知目标频谱分布的泛化,以低延迟实现,已经引起了研究者的关注。现有的方法通常是在已经批处理的目标频谱数据上提高测试时训练性能。然而,在受到强度外部数据杂化(OOD)的情况下,许多状态态态方法表现不佳,这主要是因为无法 отличи出强度OOD样本和软OOD样本。为了改善OWTTT的 Robustness,我们首先开发了适应强度OOD排除,使自我训练TTT方法更加高效。然后,我们提议在运行时动态扩展prototype,以便更好地分离强度OOD和软OOD样本。最后,我们将自我训练与分布AlignmentREG regularization相结合,这种结合得到了5个OWTTT benchmark中的状态态表现。相关代码可以在https://github.com/Yushu-Li/OWTTT中找到。

Practical Anomaly Detection over Multivariate Monitoring Metrics for Online Services

  • paper_url: http://arxiv.org/abs/2308.09937
  • repo_url: https://github.com/OpsPAI/CMAnomaly
  • paper_authors: Jinyang Liu, Tianyi Yang, Zhuangbin Chen, Yuxin Su, Cong Feng, Zengyin Yang, Michael R. Lyu
  • for: 这篇论文是为了提出一个基于协力机器的多变数监控指标异常探测框架,以提高现有系统监控中的异常探测效能。
  • methods: 本论文使用了一个名为CMAnomaly的异常探测框架,其中包括一个基于协力机器的机制,可以有效地捕捉多变数监控指标之间的相互作用,并且可以使用cost-effective模型来利用这些相互作用进行异常探测。
  • results: 根据实验结果,CMAnomaly比基于现有模型的 benchmark 高出6.77%到10.68%,并且可以在10倍至20倍的速度下进行异常探测。此外,本论文还详细介绍了在Huawei Cloud上部署CMAnomaly的经验。
    Abstract As modern software systems continue to grow in terms of complexity and volume, anomaly detection on multivariate monitoring metrics, which profile systems' health status, becomes more and more critical and challenging. In particular, the dependency between different metrics and their historical patterns plays a critical role in pursuing prompt and accurate anomaly detection. Existing approaches fall short of industrial needs for being unable to capture such information efficiently. To fill this significant gap, in this paper, we propose CMAnomaly, an anomaly detection framework on multivariate monitoring metrics based on collaborative machine. The proposed collaborative machine is a mechanism to capture the pairwise interactions along with feature and temporal dimensions with linear time complexity. Cost-effective models can then be employed to leverage both the dependency between monitoring metrics and their historical patterns for anomaly detection. The proposed framework is extensively evaluated with both public data and industrial data collected from a large-scale online service system of Huawei Cloud. The experimental results demonstrate that compared with state-of-the-art baseline models, CMAnomaly achieves an average F1 score of 0.9494, outperforming baselines by 6.77% to 10.68%, and runs 10X to 20X faster. Furthermore, we also share our experience of deploying CMAnomaly in Huawei Cloud.
    摘要 现代软件系统的复杂性和规模在不断增长,异常检测在多变量监控指标上变得越来越重要和挑战。特别是依赖于不同指标之间的关系以及历史 patrón 的信息在追踪系统的健康状态非常重要。现有的方法无法fficiently capture这些信息,为此,在本文中,我们提出了 CMAnomaly,一种基于合作机器的异常检测框架。该框架使用了对feature和时间维度进行对称的机制,以linear time complexityCapture pairwise interactions。然后,可以使用cost-effective模型来利用监控指标之间的依赖关系和历史 patrón 进行异常检测。我们对公共数据集和industrial数据集进行了广泛的evaluation,结果表明,相比州标baseline模型,CMAnomaly在F1分数方面获得了0.9494的平均分,高于基eline模型6.77%到10.68%,并且运行速度比基eline模型快10到20倍。此外,我们还分享了在Huawei Cloud中部署 CMAnomaly的经验。

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

  • paper_url: http://arxiv.org/abs/2308.09936
  • repo_url: https://github.com/mlpc-ucsd/bliva
  • paper_authors: Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, Zhuowen Tu
  • for: 该研究旨在解决现实世界中常见的图像涉及文本场景下的视觉问答任务,即使图像具有文本背景。
  • methods: 该研究使用了一种名为BLIVA的新方法,它是基于InstructBLIP的增强版本,通过直接将图像中的编码补丁直接输入到大语言模型中,以帮助模型更好地捕捉图像中的细节。
  • results: 实验证明,BLIVA比基elineInstructBLIP有17.76%的提升(在OCR-VQAbenchmark中)和7.9%的提升(在Visual Spatial Reasoning benchmark中),并且在实际世界中的图像处理 tasks中表现出色,无论图像中存在文本或不存在。
    Abstract Vision Language Models (VLMs), which extend Large Language Models (LLM) by incorporating visual understanding capability, have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) tasks. However, these models cannot accurately interpret images infused with text, a common occurrence in real-world scenarios. Standard procedures for extracting information from images often involve learning a fixed set of query embeddings. These embeddings are designed to encapsulate image contexts and are later used as soft prompt inputs in LLMs. Yet, this process is limited to the token count, potentially curtailing the recognition of scenes with text-rich context. To improve upon them, the present study introduces BLIVA: an augmented version of InstructBLIP with Visual Assistant. BLIVA incorporates the query embeddings from InstructBLIP and also directly projects encoded patch embeddings into the LLM, a technique inspired by LLaVA. This approach assists the model to capture intricate details potentially missed during the query decoding process. Empirical evidence demonstrates that our model, BLIVA, significantly enhances performance in processing text-rich VQA benchmarks (up to 17.76\% in OCR-VQA benchmark) and in undertaking typical VQA benchmarks (up to 7.9\% in Visual Spatial Reasoning benchmark), comparing to our baseline InstructBLIP. BLIVA demonstrates significant capability in decoding real-world images, irrespective of text presence. To demonstrate the broad industry applications enabled by BLIVA, we evaluate the model using a new dataset comprising YouTube thumbnails paired with question-answer sets across 13 diverse categories. For researchers interested in further exploration, our code and models are freely accessible at https://github.com/mlpc-ucsd/BLIVA.git
    摘要 大型语言模型(VLM),它们将大型语言模型(LLM)与视觉理解能力结合,在开放式视觉问答任务上达到了显著的进步。然而,这些模型无法准确地理解带有文本的图像,这是现实生活中常见的情况。标准的图像信息提取方法通常包括学习固定的查询嵌入。这些嵌入用于在LLM中作为软提示输入,然而这种方法受限于字符串数量,可能导致捕捉场景中的文本背景信息。为了改进它们,本研究提出了BLIVA:一种基于InstructBLIP的增强版,它将InstructBLIP的查询嵌入与LLM直接进行编码覆盖,这种方法源于LLaVA。这种方法帮助模型捕捉文本背景信息中的细节,可能在查询解码过程中被遗弃。实验证明,我们的模型BLIVA在处理文本含量高的VQA标准benchmark(OCRA-VQA标准benchmark)中表现出色,提高了17.76%,以及在典型VQA标准benchmark(Visual Spatial Reasoning标准benchmark)中提高了7.9%。BLIVA在真实世界中处理图像,无论图像中是否存在文本。为了展示BLIVA在广泛的工业应用中的可能性,我们使用了一个新的YouTube封面集合,与问题集合在13种不同的类别中进行了评估。对于关心进一步探索的研究人员,我们在https://github.com/mlpc-ucsd/BLIVA.git中提供了代码和模型。

Analyzing Quantization in TVM

  • paper_url: http://arxiv.org/abs/2308.10905
  • repo_url: None
  • paper_authors: Mingfei Guo
  • for: 这个论文主要是研究 TVM 中对weight tensor的量化以提高推理时间和内存占用率,但是8位量化并不符合预期,实际比非量化版本更慢。
  • methods: 该论文使用了 TVM 进行量化和低位计算,并 investigate 了量化下的性能问题,并评估了不同的优化策略。
  • results: 该论文通过修复图构建错误和测试多种优化策略,最终实现了对 compute-bound 和 memory-bound 任务的优化,并实现了163.88%和194.98%的推理时间提升。
    Abstract There has been many papers in academic literature on quantizing weight tensors in deep learning models to reduce inference latency and memory footprint. TVM also has the ability to quantize weights and support low-bit computations. Although quantization is typically expected to improve inference time, in TVM, the performance of 8-bit quantization does not meet the expectations. Typically, when applying 8-bit quantization to a deep learning model, it is usually expected to achieve around 50% of the full-precision inference time. However, in this particular case, not only does the quantized version fail to achieve the desired performance boost, but it actually performs worse, resulting in an inference time that is about 2 times as slow as the non-quantized version. In this project, we thoroughly investigate the reasons behind the underperformance and assess the compatibility and optimization opportunities of 8-bit quantization in TVM. We discuss the optimization of two different types of tasks: computation-bound and memory-bound, and provide a detailed comparison of various optimization techniques in TVM. Through the identification of performance issues, we have successfully improved quantization by addressing a bug in graph building. Furthermore, we analyze multiple optimization strategies to achieve the optimal quantization result. The best experiment achieves 163.88% improvement compared with the TVM compiled baseline in inference time for the compute-bound task and 194.98% for the memory-bound task.
    摘要 在学术文献中有很多论著关于深度学习模型的量化权重tensor以降低推理延迟和内存占用。TVM也具有量化权重和低位计算的能力。although quantization is typically expected to improve inference time, in TVM, the performance of 8-bit quantization does not meet expectations. Typically, when applying 8-bit quantization to a deep learning model, it is usually expected to achieve around 50% of the full-precision inference time. However, in this particular case, not only does the quantized version fail to achieve the desired performance boost, but it actually performs worse, resulting in an inference time that is about 2 times as slow as the non-quantized version. In this project, we thoroughly investigate the reasons behind the underperformance and assess the compatibility and optimization opportunities of 8-bit quantization in TVM. We discuss the optimization of two different types of tasks: computation-bound and memory-bound, and provide a detailed comparison of various optimization techniques in TVM. Through the identification of performance issues, we have successfully improved quantization by addressing a bug in graph building. Furthermore, we analyze multiple optimization strategies to achieve the optimal quantization result. The best experiment achieves 163.88% improvement compared with the TVM compiled baseline in inference time for the compute-bound task and 194.98% for the memory-bound task.

East: Efficient and Accurate Secure Transformer Framework for Inference

  • paper_url: http://arxiv.org/abs/2308.09923
  • repo_url: None
  • paper_authors: Yuanchao Ding, Hua Guo, Yewei Guan, Weixin Liu, Jiarong Huo, Zhenyu Guan, Xiyong Zhang
  • for: 提高Transformer推理的隐私保护
  • methods: 提出一个名为”East”的框架,包括新的匿名分割多项式评估算法和特殊防范协议,以提高安全性和准确性
  • results: 对BERT进行应用,并与不加密推理相比,推理精度保持一致,与Iron相比,通信量下降1.8倍,运行时间下降1.2倍
    Abstract Transformer has been successfully used in practical applications, such as ChatGPT, due to its powerful advantages. However, users' input is leaked to the model provider during the service. With people's attention to privacy, privacy-preserving Transformer inference is on the demand of such services. Secure protocols for non-linear functions are crucial in privacy-preserving Transformer inference, which are not well studied. Thus, designing practical secure protocols for non-linear functions is hard but significant to model performance. In this work, we propose a framework \emph{East} to enable efficient and accurate secure Transformer inference. Firstly, we propose a new oblivious piecewise polynomial evaluation algorithm and apply it to the activation functions, which reduces the runtime and communication of GELU by over 1.5$\times$ and 2.5$\times$, compared to prior arts. Secondly, the secure protocols for softmax and layer normalization are carefully designed to faithfully maintain the desired functionality. Thirdly, several optimizations are conducted in detail to enhance the overall efficiency. We applied \emph{East} to BERT and the results show that the inference accuracy remains consistent with the plaintext inference without fine-tuning. Compared to Iron, we achieve about 1.8$\times$ lower communication within 1.2$\times$ lower runtime.
    摘要 “transformer”已经在实际应用中得到了成功,例如chatGPT,因为它具有强大的优势。然而,用户的输入会被提供者 during the service,导致隐私问题。为了保护隐私,隐私保护的transformer推察是非常重要的。但是, Designing practical secure protocols for non-linear functions is hard but significant to model performance。在这个工作中,我们提出了一个名为“East”的框架,以实现有效和精确的隐私保护的transformer推察。首先,我们提出了一个新的隐私检查法,并将其应用到活化函数上,这减少了runtime和通信量,比对实际应用中的优先输出还要好。其次,我们对软max和层normalization的安全协议进行了谨慎的设计,以保持所需的功能。最后,我们在细节上进行了详细的优化,以提高整体的效率。我们将“East”应用到BERT,结果显示,在不进行微调的情况下,推察精度与普通的推察相同。与Iron相比,我们的通信量为1.8倍,并且runtime为1.2倍。

EGANS: Evolutionary Generative Adversarial Network Search for Zero-Shot Learning

  • paper_url: http://arxiv.org/abs/2308.09915
  • repo_url: None
  • paper_authors: Shiming Chen, Shihuang Chen, Wenjin Hou, Weiping Ding, Xinge You
  • for: 提高零shot学习(ZSL)中的类别识别率,使用生成模型(如生成对抗网络(GAN))synthesize视觉样本,以提高ZSL的性能。
  • methods: 提出了一种自然选择的生成器网络搜索方法(EGANS),通过协同对抗进行 neural architecture search,以获得适应性和稳定性好的生成器和批判器。
  • results: 在标准CUB、SUN、AWA2和FLO数据集上,EGANS consistently提高了现有的生成ZSL方法的性能,表明了生成ZSL中的进化性搜索在ZSL中的潜在应用。
    Abstract Zero-shot learning (ZSL) aims to recognize the novel classes which cannot be collected for training a prediction model. Accordingly, generative models (e.g., generative adversarial network (GAN)) are typically used to synthesize the visual samples conditioned by the class semantic vectors and achieve remarkable progress for ZSL. However, existing GAN-based generative ZSL methods are based on hand-crafted models, which cannot adapt to various datasets/scenarios and fails to model instability. To alleviate these challenges, we propose evolutionary generative adversarial network search (termed EGANS) to automatically design the generative network with good adaptation and stability, enabling reliable visual feature sample synthesis for advancing ZSL. Specifically, we adopt cooperative dual evolution to conduct a neural architecture search for both generator and discriminator under a unified evolutionary adversarial framework. EGANS is learned by two stages: evolution generator architecture search and evolution discriminator architecture search. During the evolution generator architecture search, we adopt a many-to-one adversarial training strategy to evolutionarily search for the optimal generator. Then the optimal generator is further applied to search for the optimal discriminator in the evolution discriminator architecture search with a similar evolution search algorithm. Once the optimal generator and discriminator are searched, we entail them into various generative ZSL baselines for ZSL classification. Extensive experiments show that EGANS consistently improve existing generative ZSL methods on the standard CUB, SUN, AWA2 and FLO datasets. The significant performance gains indicate that the evolutionary neural architecture search explores a virgin field in ZSL.
    摘要 EGANS采用了合作双向进化来进行神经网络搜索,包括生成器和分类器。在生成器搜索阶段,我们采用了多对一的对抗训练策略,通过进化搜索来找到最佳的生成器。然后,我们使用类似的进化搜索算法来搜索最佳的分类器。一旦找到了最佳的生成器和分类器,我们就将它们与不同的生成ZSL基线方法结合,以进行ZSL分类。我们的实验表明,EGANS可以在标准的CUB、SUN、AWA2和FLO数据集上提供remarkable的性能提升,这表明了进化神经网络搜索在ZSL中的可能性。

Never Explore Repeatedly in Multi-Agent Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2308.09909
  • repo_url: None
  • paper_authors: Chenghao Li, Tonghan Wang, Chongjie Zhang, Qianchuan Zhao
  • for: 本研究旨在提高多智能体强化学习中的探索性能,通过采用内在动机。
  • methods: 本文提出了动态奖励扩大方法,通过稳定既已探索的区域的奖励,促进更广泛的探索。
  • results: 实验结果表明,本方法可以在Google研究足球和StarCraft II微管理任务中提高表现,特别在罕见奖励 Setting下。
    Abstract In the realm of multi-agent reinforcement learning, intrinsic motivations have emerged as a pivotal tool for exploration. While the computation of many intrinsic rewards relies on estimating variational posteriors using neural network approximators, a notable challenge has surfaced due to the limited expressive capability of these neural statistics approximators. We pinpoint this challenge as the "revisitation" issue, where agents recurrently explore confined areas of the task space. To combat this, we propose a dynamic reward scaling approach. This method is crafted to stabilize the significant fluctuations in intrinsic rewards in previously explored areas and promote broader exploration, effectively curbing the revisitation phenomenon. Our experimental findings underscore the efficacy of our approach, showcasing enhanced performance in demanding environments like Google Research Football and StarCraft II micromanagement tasks, especially in sparse reward settings.
    摘要 在多智能奖励学习领域,内生动机被视为探索的重要工具。然而,计算许多内生奖励的 neural network 近似器表现有限,导致“再次探索”问题的出现,即代理人重复探索任务空间中的封闭区域。为解决此问题,我们提出了动态奖励缩放方法。这种方法通过稳定前期探索区域中的奖励变化,激励代理人更广泛探索,从而缓解再次探索现象。我们的实验结果表明,我们的方法在 Google Research Football 和 StarCraft II 微管理任务中表现出色,尤其在罕见奖励设置下。

Imputing Brain Measurements Across Data Sets via Graph Neural Networks

  • paper_url: http://arxiv.org/abs/2308.09907
  • repo_url: None
  • paper_authors: Yixin Wang, Wei Peng, Susan F. Tapert, Qingyu Zhao, Kilian M. Pohl
  • for: The paper aims to address the issue of missing measurements in publicly available structural MRI data sets, specifically the curvature scores computed by Freesurfer, by proposing a deep learning-based imputation method called Demographic Aware Graph-based Imputation (DAGI).
  • methods: The DAGI method uses a graph neural network (GNN) to model the dependencies between brain Region of Interests (ROIs) and accounts for demographic differences in brain measurements by feeding the graph encoding into a parallel architecture that simultaneously optimizes a graph decoder to impute values and a classifier to predict demographic factors.
  • results: The proposed DAGI method is tested on imputing missing Freesurfer measurements of the Adolescent Brain Cognitive Development (ABCD) Study data set (N=3760) by training the predictor on publicly released data from the National Consortium on Alcohol and Neurodevelopment in Adolescence (NCANDA, N=540).
    Abstract Publicly available data sets of structural MRIs might not contain specific measurements of brain Regions of Interests (ROIs) that are important for training machine learning models. For example, the curvature scores computed by Freesurfer are not released by the Adolescent Brain Cognitive Development (ABCD) Study. One can address this issue by simply reapplying Freesurfer to the data set. However, this approach is generally computationally and labor intensive (e.g., requiring quality control). An alternative is to impute the missing measurements via a deep learning approach. However, the state-of-the-art is designed to estimate randomly missing values rather than entire measurements. We therefore propose to re-frame the imputation problem as a prediction task on another (public) data set that contains the missing measurements and shares some ROI measurements with the data sets of interest. A deep learning model is then trained to predict the missing measurements from the shared ones and afterwards is applied to the other data sets. Our proposed algorithm models the dependencies between ROI measurements via a graph neural network (GNN) and accounts for demographic differences in brain measurements (e.g. sex) by feeding the graph encoding into a parallel architecture. The architecture simultaneously optimizes a graph decoder to impute values and a classifier in predicting demographic factors. We test the approach, called Demographic Aware Graph-based Imputation (DAGI), on imputing those missing Freesurfer measurements of ABCD (N=3760) by training the predictor on those publicly released by the National Consortium on Alcohol and Neurodevelopment in Adolescence (NCANDA, N=540)...
    摘要 公共可用数据集的结构MRI数据可能不包含特定的脑区域关注点(ROIs)的准确量测量。例如,ABCDFreesurfer的曲线分数不由ABCDFreesurfer发布。一种解决方法是简单地重新应用Freesurfer到数据集中。然而,这种方法通常是计算机和劳动力密集的(例如,需要质量控制)。另一种方法是使用深度学习方法进行填充。然而,现状的深度学习方法是随机缺失值的估计而不是整个测量。我们因此提议将填充问题重新定义为一个预测任务,使用另一个(公共)数据集,该数据集包含缺失的测量和与数据集集中的ROIs进行共享。然后,我们将深度学习模型训练以预测缺失测量,并在训练过程中考虑ROIs之间的依赖关系。我们称这种方法为“规格意识 Graph-based 填充”(DAGI)。我们在NCANDA(N=540)公共发布的数据集上训练了预测器,并在ABCDFreesurfer(N=3760)中进行了填充。

DPMAC: Differentially Private Communication for Cooperative Multi-Agent Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2308.09902
  • repo_url: https://github.com/CANVOLCANO/DPMAC
  • paper_authors: Canzhe Zhao, Yanjie Ze, Jing Dong, Baoxiang Wang, Shuai Li
  • for: 防止多代理人学习中泄露敏感信息
  • methods: 使用(ε,δ)敏感数据隐私保证,采用Stochastic MessageSender,自动调整学习后的消息分布
  • results: 实验显示DPMAC在隐私保护场景下比基eline方法有明显优势
    Abstract Communication lays the foundation for cooperation in human society and in multi-agent reinforcement learning (MARL). Humans also desire to maintain their privacy when communicating with others, yet such privacy concern has not been considered in existing works in MARL. To this end, we propose the \textit{differentially private multi-agent communication} (DPMAC) algorithm, which protects the sensitive information of individual agents by equipping each agent with a local message sender with rigorous $(\epsilon, \delta)$-differential privacy (DP) guarantee. In contrast to directly perturbing the messages with predefined DP noise as commonly done in privacy-preserving scenarios, we adopt a stochastic message sender for each agent respectively and incorporate the DP requirement into the sender, which automatically adjusts the learned message distribution to alleviate the instability caused by DP noise. Further, we prove the existence of a Nash equilibrium in cooperative MARL with privacy-preserving communication, which suggests that this problem is game-theoretically learnable. Extensive experiments demonstrate a clear advantage of DPMAC over baseline methods in privacy-preserving scenarios.
    摘要 交流为社会和多代理学习(MARL)中的合作基础。人们也希望在交流时保持隐私,但这一点尚未被现有的MARL工作考虑。为此,我们提出了《差分隐私多代理通信算法》(DPMAC),该算法保护每个代理的敏感信息,并在每个代理的本地消息发送器中实现了严格(ε,δ)差分隐私(DP)保证。与直接将消息裁剪为常见的隐私保护方法一样,我们采用了每个代理的随机消息发送器,并将DP要求直接 incorporated into the sender,自动调整学习的消息分布,以解决由DP噪声引起的不稳定性。此外,我们证明了在隐私保护通信下的合作MARL问题是游戏理论上学习可能的。广泛的实验表明DPMAC在隐私保护场景下具有明显的优势。

Contrastive Learning-based Imputation-Prediction Networks for In-hospital Mortality Risk Modeling using EHRs

  • paper_url: http://arxiv.org/abs/2308.09896
  • repo_url: https://github.com/liulab1356/CL-ImpPreNet
  • paper_authors: Yuxi Liu, Zhenhao Zhang, Shaowen Qin, Flora D. Salim, Antonio Jimeno Yepes
  • for: 预测医院内死亡风险从电子医疗记录(EHRs)获得了广泛关注,以提供早期警示患者健康状况,促使医疗专业人员采取有效措施。
  • methods: 本研究提出了一种基于对比学习的抽象替换网络,用于预测医院内死亡风险。我们的方法引入图分析基于患者划分模型,以组合类似患者的信息。此外,我们的方法还可以将对比学习集成到提议的网络架构中,以增强患者表征学习和预测性能。
  • results: 在两个真实的EHR数据集上,我们的方法比州分之前的方法在替换和预测任务中都表现出优异。
    Abstract Predicting the risk of in-hospital mortality from electronic health records (EHRs) has received considerable attention. Such predictions will provide early warning of a patient's health condition to healthcare professionals so that timely interventions can be taken. This prediction task is challenging since EHR data are intrinsically irregular, with not only many missing values but also varying time intervals between medical records. Existing approaches focus on exploiting the variable correlations in patient medical records to impute missing values and establishing time-decay mechanisms to deal with such irregularity. This paper presents a novel contrastive learning-based imputation-prediction network for predicting in-hospital mortality risks using EHR data. Our approach introduces graph analysis-based patient stratification modeling in the imputation process to group similar patients. This allows information of similar patients only to be used, in addition to personal contextual information, for missing value imputation. Moreover, our approach can integrate contrastive learning into the proposed network architecture to enhance patient representation learning and predictive performance on the classification task. Experiments on two real-world EHR datasets show that our approach outperforms the state-of-the-art approaches in both imputation and prediction tasks.
    摘要 预测医院内 morteo风险从电子医疗记录(EHRs)得到了广泛的关注。这将为医疗专业人员提供早期警示patient的健康状况,以便在时间上采取有效的 intervención. 这个预测任务是挑战的,因为EHR数据本身是不规则的,有许多缺失值以及不同的时间间隔 between medical records. 现有的方法是利用patient的医疗记录中的变量相关性来填充缺失值,并设置时间衰减机制来处理这种不规则性. 本文提出了一种基于对比学习的投入-预测网络,用于预测医院内 morteo风险。我们的方法引入图分析基于patient的分类模型,以分组类似的patient。这样只有类似patient的信息被用于缺失值填充,同时还能够使用个体特定的上下文信息。此外,我们的方法还可以将对比学习integrated into the proposed network architecture,以提高patient表示学习和预测性能。实验表明,我们的方法在两个实际的EHR数据集上比状态之前的方法在投入和预测任务中表现出色。

Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs

  • paper_url: http://arxiv.org/abs/2308.09895
  • repo_url: None
  • paper_authors: Federico Cassano, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Carolyn Jane Anderson, Michael Greenberg, Abhinav Jangda, Arjun Guha
  • for: 提高低资源语言中Code LLM的性能
  • methods: 使用半人工数据生成高质量数据集,并在这些数据集上练习和评估Code LLM
  • results: 使用 MultiPL-T 方法生成了大量有效的训练数据,并在 benchmark 问题上达到了state-of-the-art 性能水平,包括 Lua、Racket 和 OCaml。
    Abstract Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as a building block for research in programming languages and software engineering. However, the quality of code produced by a Code LLM varies significantly by programming languages. Code LLMs produce impressive results on programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages, like OCaml and Racket. This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach generates high-quality datasets for low-resource languages, which can then be used to fine-tune any pretrained Code LLM. Our approach, called MultiPL-T, translates training data from high-resource languages into training data for low-resource languages. We apply our approach to generate tens of thousands of new, validated training items for Racket, OCaml, and Lua from Python. Moreover, we use an open dataset (The Stack) and model (StarCoderBase), which allow us to decontaminate benchmarks and train models on this data without violating the model license. With MultiPL-T generated data, we present fine-tuned versions of StarCoderBase that achieve state-of-the-art performance for Racket, OCaml, and Lua on benchmark problems. For Lua, our fine-tuned model achieves the same performance as StarCoderBase as Python -- a very high-resource language -- on the MultiPL-E benchmarks. For Racket and OCaml, we double their performance on MultiPL-E, bringing their performance close to higher-resource languages such as Ruby and C#.
    摘要 在过去几年,大型代码模型(Code LLMs)已经开始对编程实践产生重要影响。 Code LLMs 也在编程语言和软件工程研究中出现为基础建筑块。然而,由 Code LLMs 生成的代码质量各不相同,尤其是低资源语言。 Code LLMs 在具有充足训练数据的语言(如 Java、Python 或 JavaScript)上表现出色,但在低资源语言(如 OCaml 和 Racket)上困难。本文提出了一种有效的方法,可以提高 Code LLMs 在低资源语言上表现的方法。我们的方法通过生成高质量的低资源语言数据,然后使用这些数据来练化任何预训练 Code LLM。我们的方法被称为 MultiPL-T,它将高资源语言的训练数据翻译成低资源语言的训练数据。我们使用这种方法生成了数以千计的新的有效训练项目,用于 Racket、OCaml 和 Lua。此外,我们使用公开的数据集(The Stack)和模型(StarCoderBase),可以在这些数据上训练模型,而不违反模型的许可证。使用 MultiPL-T 生成的数据,我们提出了一些精心调整的 StarCoderBase 模型,在benchmark问题上达到了状态之 искусственный智能的性能。对 Lua 来说,我们的调整模型和 Python 的 StarCoderBase 模型在 MultiPL-E benchmark上具有同等的性能。对 Racket 和 OCaml 来说,我们的调整模型可以提高它们的性能,使其接近高资源语言如 Ruby 和 C#。

Utilizing Semantic Textual Similarity for Clinical Survey Data Feature Selection

  • paper_url: http://arxiv.org/abs/2308.09892
  • repo_url: https://github.com/bcwarner/sts-select
  • paper_authors: Benjamin C. Warner, Ziqi Xu, Simon Haroutounian, Thomas Kannampallil, Chenyang Lu
  • for: 这篇论文是为了提出一种基于文本名称的对Target outcome的Semantic textual similarity(STS)分析方法,以便从 Survey data 中选择最佳的特征集。
  • methods: 本论文使用了Language models(LMs)来评估特征名称和Target名称之间的semantic textual similarity(STS),并将其用于特征选择。
  • results: 结果显示,使用STS进行特征选择可以实现更高的模型性能,相比传统的特征选择算法。
    Abstract Survey data can contain a high number of features while having a comparatively low quantity of examples. Machine learning models that attempt to predict outcomes from survey data under these conditions can overfit and result in poor generalizability. One remedy to this issue is feature selection, which attempts to select an optimal subset of features to learn upon. A relatively unexplored source of information in the feature selection process is the usage of textual names of features, which may be semantically indicative of which features are relevant to a target outcome. The relationships between feature names and target names can be evaluated using language models (LMs) to produce semantic textual similarity (STS) scores, which can then be used to select features. We examine the performance using STS to select features directly and in the minimal-redundancy-maximal-relevance (mRMR) algorithm. The performance of STS as a feature selection metric is evaluated against preliminary survey data collected as a part of a clinical study on persistent post-surgical pain (PPSP). The results suggest that features selected with STS can result in higher performance models compared to traditional feature selection algorithms.
    摘要 Survey data 可能包含大量的特征,而具有较少的示例数量。机器学习模型在这种情况下预测结果时可能会过拟合,导致泛化性差。一种解决方案是Feature选择,它尝试选择最佳的特征子集来学习。文本特征名称的使用仍然是一个相对未探索的信息来源。我们可以使用语言模型(LM)评估特征名称和目标名称之间的语义文本相似性(STS) scores,以便选择特征。我们对直接使用 STS 选择特征和 minimal-redundancy-maximal-relevance(mRMR)算法进行了评估。我们发现,使用 STS 作为特征选择度量时,可以获得比传统特征选择算法更高的性能。Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

Inductive-bias Learning: Generating Code Models with Large Language Model

  • paper_url: http://arxiv.org/abs/2308.09890
  • repo_url: https://github.com/fuyu-quant/iblm
  • paper_authors: Toma Tanaka, Naofumi Emoto, Tsukasa Yumibayashi
  • for: 本研究旨在提出一种新的学习方法,即归纳学习(Inductive-Bias Learning,IBL),这种方法结合了归纳学习(ICL)和代码生成技术,以实现高精度的推理和代码生成。
  • methods: 本研究使用的方法包括归纳学习(ICL)和代码生成技术。通过输入训练数据,IBL可以从Contextual Understanding中找到必需的结构,并生成相应的代码模型,以实现高精度的推理和代码生成。
  • results: 研究发现,IBL可以实现与ICL和代表性机器学习模型相当的预测精度,并且代码生成结果具有较高的可读性和解释性。此外,IBL代码也是开源的,可以在https://github.com/fuyu-quant/IBLM上下载。
    Abstract Large Language Models(LLMs) have been attracting attention due to a ability called in-context learning(ICL). ICL, without updating the parameters of a LLM, it is possible to achieve highly accurate inference based on rules ``in the context'' by merely inputting a training data into the prompt. Although ICL is a developing field with many unanswered questions, LLMs themselves serves as a inference model, seemingly realizing inference without explicitly indicate ``inductive bias''. On the other hand, a code generation is also a highlighted application of LLMs. The accuracy of code generation has dramatically improved, enabling even non-engineers to generate code to perform the desired tasks by crafting appropriate prompts. In this paper, we propose a novel ``learning'' method called an ``Inductive-Bias Learning (IBL)'', which combines the techniques of ICL and code generation. An idea of IBL is straightforward. Like ICL, IBL inputs a training data into the prompt and outputs a code with a necessary structure for inference (we referred to as ``Code Model'') from a ``contextual understanding''. Despite being a seemingly simple approach, IBL encompasses both a ``property of inference without explicit inductive bias'' inherent in ICL and a ``readability and explainability'' of the code generation. Surprisingly, generated Code Models have been found to achieve predictive accuracy comparable to, and in some cases surpassing, ICL and representative machine learning models. Our IBL code is open source: https://github.com/fuyu-quant/IBLM
    摘要 大型语言模型(LLM)在近年来吸引了很多注意,主要是因为它们的一个能力 called "in-context learning"(ICL)。ICL是不需要更新 LLM 的参数时,可以通过输入训练数据来 achieve highly accurate inference based on "rules in the context"。although ICL is a developing field with many unanswered questions, LLMs themselves serve as an inference model, seemingly realizing inference without explicitly indicating "inductive bias"。另一方面,代码生成也是 LLM 的优点之一。代码生成的精度已经得到了很大改善,使得甚至非工程师也可以通过设计适当的 prompt 来生成代码以完成想要的任务。在这篇文章中,我们提出了一个新的“学习”方法called“对应式学习”(IBL),它结合了 ICL 和代码生成的技术。IBL 的想法是 straightforward。如 ICLL,IBL 通过输入训练数据来生成一个代码模型(我们称之为“内在结构”),并将其与内在的概念“对应”。尽管看起来很简单,但 IBL 包含了 ICL 的“无预先假设”和代码生成的“可读性和解释性”。惊奇的是,生成的代码模型已经被发现可以 achieve predictive accuracy comparable to, and in some cases surpassing, ICL 和代表性机器学习模型。我们的 IBL 代码可以在 GitHub 上找到:https://github.com/fuyu-quant/IBLM。

DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization

  • paper_url: http://arxiv.org/abs/2308.09889
  • repo_url: None
  • paper_authors: Xiaoyu Ye, Hao Huang, Jiaqi An, Yongtao Wang
  • for: 保护版权图像从多种自定义方法中,防止抄袭特定风格或主题。
  • methods: 使用不可见的数据自由universal adversarial watermark(DUAW),在不直接处理版权图像的情况下,在多个版本的SD模型中保护各种版权图像。
  • results: DUAW可以有效地扰乱自定义SD模型生成的图像,使其可见 both human observers和一个简单的分类器。
    Abstract Stable Diffusion (SD) customization approaches enable users to personalize SD model outputs, greatly enhancing the flexibility and diversity of AI art. However, they also allow individuals to plagiarize specific styles or subjects from copyrighted images, which raises significant concerns about potential copyright infringement. To address this issue, we propose an invisible data-free universal adversarial watermark (DUAW), aiming to protect a myriad of copyrighted images from different customization approaches across various versions of SD models. First, DUAW is designed to disrupt the variational autoencoder during SD customization. Second, DUAW operates in a data-free context, where it is trained on synthetic images produced by a Large Language Model (LLM) and a pretrained SD model. This approach circumvents the necessity of directly handling copyrighted images, thereby preserving their confidentiality. Once crafted, DUAW can be imperceptibly integrated into massive copyrighted images, serving as a protective measure by inducing significant distortions in the images generated by customized SD models. Experimental results demonstrate that DUAW can effectively distort the outputs of fine-tuned SD models, rendering them discernible to both human observers and a simple classifier.
    摘要 首先,DUAW是在SD自定义过程中扰乱变量自动编码器的设计。其次,DUAW在无数据上下文中操作,通过使用一个大型自然语言模型(LLM)和预训练的SD模型生成的 sintetic图像进行训练。这种方法可以避免直接处理版权图像,保持其 конфиденциальность。一旦制作完毕,DUAW可以隐身地 integrate到大量版权图像中,作为保护性的水印,对于自定义SD模型生成的图像进行显著的扭曲。实验结果表明,DUAW可以有效地扭曲调整后的SD模型输出,使其可见 both human observer和一个简单的分类器。

On Estimating the Gradient of the Expected Information Gain in Bayesian Experimental Design

  • paper_url: http://arxiv.org/abs/2308.09888
  • repo_url: https://github.com/ziq-ao/GradEIG
  • paper_authors: Ziqiao Ao, Jinglai Li
  • for: 本研究旨在提高 bayesian 推理中的实验条件,通过优化预期信息增强 (EIG) 的优化。
  • methods: 本研究提出了两种估算 EIG 的梯度方法:UEEG-MCMC 和 BEEG-AP。UEEG-MCMC 利用 MCMC 生成的 posterior 样本来估算 EIG 梯度,而 BEEG-AP 则强调高效的 simulation 实现,通过重复使用参数样本来估算 EIG 梯度。
  • results: 理论分析和数值实验表明,UEEG-MCMC 对实际 EIG 值具有良好的稳定性,而 BEEG-AP 在 EIG 值小时显示更高的效率。此外,两种方法在数值实验中都表现出优于多个流行的参考方法。
    Abstract Bayesian Experimental Design (BED), which aims to find the optimal experimental conditions for Bayesian inference, is usually posed as to optimize the expected information gain (EIG). The gradient information is often needed for efficient EIG optimization, and as a result the ability to estimate the gradient of EIG is essential for BED problems. The primary goal of this work is to develop methods for estimating the gradient of EIG, which, combined with the stochastic gradient descent algorithms, result in efficient optimization of EIG. Specifically, we first introduce a posterior expected representation of the EIG gradient with respect to the design variables. Based on this, we propose two methods for estimating the EIG gradient, UEEG-MCMC that leverages posterior samples generated through Markov Chain Monte Carlo (MCMC) to estimate the EIG gradient, and BEEG-AP that focuses on achieving high simulation efficiency by repeatedly using parameter samples. Theoretical analysis and numerical studies illustrate that UEEG-MCMC is robust agains the actual EIG value, while BEEG-AP is more efficient when the EIG value to be optimized is small. Moreover, both methods show superior performance compared to several popular benchmarks in our numerical experiments.
    摘要 bayesian experimental design (BED), which aims to find the optimal experimental conditions for bayesian inference, 通常是要优化预期信息增强(EIG)的目标。为了提高效率,通常需要计算EIG的梯度信息,因此能够估计EIG梯度的能力是BED问题的关键。本工作的主要目标是开发估计EIG梯度的方法,这些方法可以与随机梯度下降算法结合使用,从而实现EIG的高效优化。我们首先引入了 posterior expected representation of EIG gradient with respect to design variables。然后,我们提出了两种估计EIG梯度的方法:UEEG-MCMC,利用MCMC生成的 posterior samples来估计EIG梯度,和BEEG-AP,重点是实现高效的 simulations by repeatedly using parameter samples。我们的理论分析和数学实验表明,UEEG-MCMC 对实际EIG值有良好的稳定性,而BEEG-AP 在EIG值要 optimize 小时更高效。此外,两种方法在我们的数学实验中表现出了较好的性能,比如几种流行的参考方法。

Calibrating Uncertainty for Semi-Supervised Crowd Counting

  • paper_url: http://arxiv.org/abs/2308.09887
  • repo_url: None
  • paper_authors: Chen Li, Xiaoling Hu, Shahira Abousamra, Chao Chen
  • for: 这个研究的目的是提出一种可靠地测量人群数量的方法,并且能够在半指导式的情况下进行数据训练。
  • methods: 本研究使用了一种基于模型uncertainty的方法,通过训练一个surrogate函数来控制模型的uncertainty。另外,我们还提出了一个基于匹配的patch-wise surrogate函数,以更好地近似uncertainty的数据。
  • results: 根据实验结果显示,我们的方法能够生成可靠的uncertainty估计、高质量的pseudolabels,并且在半指导式的人群数量推断 task 上实现了state-of-the-art的表现。
    Abstract Semi-supervised crowd counting is an important yet challenging task. A popular approach is to iteratively generate pseudo-labels for unlabeled data and add them to the training set. The key is to use uncertainty to select reliable pseudo-labels. In this paper, we propose a novel method to calibrate model uncertainty for crowd counting. Our method takes a supervised uncertainty estimation strategy to train the model through a surrogate function. This ensures the uncertainty is well controlled throughout the training. We propose a matching-based patch-wise surrogate function to better approximate uncertainty for crowd counting tasks. The proposed method pays a sufficient amount of attention to details, while maintaining a proper granularity. Altogether our method is able to generate reliable uncertainty estimation, high quality pseudolabels, and achieve state-of-the-art performance in semisupervised crowd counting.
    摘要 semi-supervised crowd counting是一项重要又挑战性的任务。一种流行的方法是通过逐步生成 pseudo-labels 来训练无标注数据。关键在于使用uncertainty来选择可靠的 pseudo-labels。在这篇论文中,我们提出了一种新的方法来准确地控制模型的uncertainty。我们使用一种监督性uncertainty估计策略来训练模型,并通过一个匹配基于的 patch-wise 代理函数来更好地估计uncertainty。我们的方法具有充分的细节注意力,同时保持合适的粒度。总之,我们的方法能够生成可靠的uncertainty估计、高质量的 pseudo-labels,并在semi-supervised crowd counting中实现状态前的性能。

A Transformer-based Framework For Multi-variate Time Series: A Remaining Useful Life Prediction Use Case

  • paper_url: http://arxiv.org/abs/2308.09884
  • repo_url: None
  • paper_authors: Oluwaseyi Ogunfowora, Homayoun Najjaran
  • for: 这个研究是为了提出一个基于encoder-transformer架构的多变量时间序列预测框架,用于预测机器的剩下有用生命时间(RUL)。
  • methods: 本研究使用了encoder-transformer模型,并进行了三个模型对应实验,以将自然语言领域中的transformer模型应用到时间序列预测中。此外,本研究还提出了一个新的扩展窗口方法,用于让模型耳熟悉机器的初期阶段和衰老路径。
  • results: 根据试验数据表现,提出的encoder-transformer模型在13个state-of-the-art(SOTA)模型中得到了平均提高137.65%的性能。
    Abstract In recent times, Large Language Models (LLMs) have captured a global spotlight and revolutionized the field of Natural Language Processing. One of the factors attributed to the effectiveness of LLMs is the model architecture used for training, transformers. Transformer models excel at capturing contextual features in sequential data since time series data are sequential, transformer models can be leveraged for more efficient time series data prediction. The field of prognostics is vital to system health management and proper maintenance planning. A reliable estimation of the remaining useful life (RUL) of machines holds the potential for substantial cost savings. This includes avoiding abrupt machine failures, maximizing equipment usage, and serving as a decision support system (DSS). This work proposed an encoder-transformer architecture-based framework for multivariate time series prediction for a prognostics use case. We validated the effectiveness of the proposed framework on all four sets of the C-MAPPS benchmark dataset for the remaining useful life prediction task. To effectively transfer the knowledge and application of transformers from the natural language domain to time series, three model-specific experiments were conducted. Also, to enable the model awareness of the initial stages of the machine life and its degradation path, a novel expanding window method was proposed for the first time in this work, it was compared with the sliding window method, and it led to a large improvement in the performance of the encoder transformer model. Finally, the performance of the proposed encoder-transformer model was evaluated on the test dataset and compared with the results from 13 other state-of-the-art (SOTA) models in the literature and it outperformed them all with an average performance increase of 137.65% over the next best model across all the datasets.
    摘要 This work proposes an encoder-transformer architecture-based framework for multivariate time series prediction in a prognostics use case. We validated the effectiveness of the proposed framework on all four datasets of the C-MAPPS benchmark for RUL prediction. To transfer knowledge and application of transformers from the natural language domain to time series, we conducted three model-specific experiments. Additionally, we proposed a novel expanding window method to enhance the model's awareness of the initial stages of machine life and its degradation path. This method was compared with the sliding window method and led to a large improvement in the performance of the encoder transformer model.The proposed encoder-transformer model was evaluated on the test dataset and outperformed 13 other state-of-the-art (SOTA) models in the literature, with an average performance increase of 137.65% over the next best model across all datasets.

Flamingo: Multi-Round Single-Server Secure Aggregation with Applications to Private Federated Learning

  • paper_url: http://arxiv.org/abs/2308.09883
  • repo_url: https://github.com/eniac/flamingo
  • paper_authors: Yiping Ma, Jess Woods, Sebastian Angel, Antigoni Polychroniadou, Tal Rabin
  • for: 这篇论文描述了一种用于安全聚合 Federated Learning 中的数据的系统,即 Flamingo。
  • methods: Flamingo 使用了一种新的轻量级随机 dropout 抗性协议,以及一种新的客户端邻居选择方法,以确保在客户端离开中阶段的服务器仍然可以获得有意义的结果。
  • results: Flamingo 可以安全地训练基于 (Extended) MNIST 和 CIFAR-100 数据集的神经网络,并且模型可以在不失去精度的情况下 converge。相比之下,非私有 Federated Learning 系统可以带来更快的结果。
    Abstract This paper introduces Flamingo, a system for secure aggregation of data across a large set of clients. In secure aggregation, a server sums up the private inputs of clients and obtains the result without learning anything about the individual inputs beyond what is implied by the final sum. Flamingo focuses on the multi-round setting found in federated learning in which many consecutive summations (averages) of model weights are performed to derive a good model. Previous protocols, such as Bell et al. (CCS '20), have been designed for a single round and are adapted to the federated learning setting by repeating the protocol multiple times. Flamingo eliminates the need for the per-round setup of previous protocols, and has a new lightweight dropout resilience protocol to ensure that if clients leave in the middle of a sum the server can still obtain a meaningful result. Furthermore, Flamingo introduces a new way to locally choose the so-called client neighborhood introduced by Bell et al. These techniques help Flamingo reduce the number of interactions between clients and the server, resulting in a significant reduction in the end-to-end runtime for a full training session over prior work. We implement and evaluate Flamingo and show that it can securely train a neural network on the (Extended) MNIST and CIFAR-100 datasets, and the model converges without a loss in accuracy, compared to a non-private federated learning system.
    摘要 Translated into Simplified Chinese:这篇论文介绍了Flamingo系统,用于在大量客户端上安全地汇集数据。在安全汇集中,服务器将客户端的私有输入汇集到最终结果中,而不会了解每个输入的详细信息。Flamingo针对了联合学习中的多轮设定,在多个汇集(均值)中得到一个好的模型。先前的协议,如Bell et al. (CCS '20),是为单轮设定而设计,并通过重复协议来适应联合学习设定。Flamingo消除了先前协议的每轮设定需求,并提供了一个轻量级的dropout鲁棒性协议,以确保如果客户端在汇集过程中离开,服务器仍可以获得有意义的结果。此外,Flamingo还引入了一种新的客户端 neighboorhood的选择方法,这些方法帮助Flamingo减少客户端和服务器之间的交互次数,从而导致了对于整个训练会话的结束到终端时间的重要减少。我们实现和评估了Flamingo,并证明它可以安全地训练一个神经网络在(扩展)MNIST和CIFAR-100数据集上,并且模型可以无损地训练完成,相比于非私钥联合学习系统。

Generative Adversarial Networks Unlearning

  • paper_url: http://arxiv.org/abs/2308.09881
  • repo_url: None
  • paper_authors: Hui Sun, Tianqing Zhu, Wenhan Chang, Wanlei Zhou
  • for: 本研究旨在解决Generative Adversarial Networks (GANs)中的机器学习模型卸载数据的问题,即生成器和判别器的架构特点使得卸载过程可能会导致 latent space 的破坏和模型效果的降低。
  • methods: 本研究提出了一种替换机制和假标签来有效地解决 generator 卸载和判别器的挑战。基于这种替换机制和假标签,我们提出了一种层次式卸载方法,其中卸载和学习过程Running in a cascaded manner。
  • results: 我们在 MNIST 和 CIFAR-10 数据集上进行了全面的评估,结果表明,这种层次式卸载方法可以大幅提高项和类卸载效率,比 retraining from scratch 减少时间量达 185x 和 284x。此外,我们发现,即使模型性能受到一定影响,这些影响几乎可以忽略不计(如 64 个图像),并无负面影响下游任务 such as classification。
    Abstract As machine learning continues to develop, and data misuse scandals become more prevalent, individuals are becoming increasingly concerned about their personal information and are advocating for the right to remove their data. Machine unlearning has emerged as a solution to erase training data from trained machine learning models. Despite its success in classifiers, research on Generative Adversarial Networks (GANs) is limited due to their unique architecture, including a generator and a discriminator. One challenge pertains to generator unlearning, as the process could potentially disrupt the continuity and completeness of the latent space. This disruption might consequently diminish the model's effectiveness after unlearning. Another challenge is how to define a criterion that the discriminator should perform for the unlearning images. In this paper, we introduce a substitution mechanism and define a fake label to effectively mitigate these challenges. Based on the substitution mechanism and fake label, we propose a cascaded unlearning approach for both item and class unlearning within GAN models, in which the unlearning and learning processes run in a cascaded manner. We conducted a comprehensive evaluation of the cascaded unlearning technique using the MNIST and CIFAR-10 datasets. Experimental results demonstrate that this approach achieves significantly improved item and class unlearning efficiency, reducing the required time by up to 185x and 284x for the MNIST and CIFAR-10 datasets, respectively, in comparison to retraining from scratch. Notably, although the model's performance experiences minor degradation after unlearning, this reduction is negligible when dealing with a minimal number of images (e.g., 64) and has no adverse effects on downstream tasks such as classification.
    摘要 machine learning技术不断发展,同时数据滥用事件的发生也使人们对个人信息变得越来越重要,因此开始提出个人数据权的要求。为了解除训练机器学习模型的数据,机器学习卸学(Machine Unlearning)已成为一种解决方案。然而,对于生成型 adversarial network(GAN)来说,研究尚未充分发展,其特殊的架构包括生成器和分类器,带来了一些挑战。其中一个挑战是生成器卸学,因为这个过程可能会破坏生成器的维度完整性,从而降低模型的效果。另一个挑战是如何定义分类器对卸学图像的标准。在这篇论文中,我们提出了替换机制和假标签,以解决这些挑战。基于替换机制和假标签,我们提出了卸学和学习的层次结构,在这种结构下,卸学和学习过程会在一起运行。我们对MNIST和CIFAR-10 datasets进行了全面的评估,实验结果表明,这种方法可以有效提高项目和类卸学效率,比Retraining from scratch需要的时间减少了185倍和284倍。尤其是在处理少量图像时(例如64个),模型的性能下降非常小,无法影响下游任务 such as classification。

DatasetEquity: Are All Samples Created Equal? In The Quest For Equity Within Datasets

  • paper_url: http://arxiv.org/abs/2308.09878
  • repo_url: https://github.com/towardsautonomy/datasetequity
  • paper_authors: Shubham Shrivastava, Xianling Zhang, Sushruth Nagesh, Armin Parchami
  • for: addressed the data imbalance issue in machine learning, specifically in computer vision tasks, by developing a novel method that leverages deep perceptual embeddings and clustering to weigh samples differently during training.
  • methods: the proposed method uses sample likelihoods based on image appearance, computed using deep perceptual embeddings and clustering, to weigh samples differently during training with a novel $\textbf{Generalized Focal Loss}$ function.
  • results: the proposed method achieves over $200%$ AP gains on under-represented classes (Cyclist) in the KITTI dataset, demonstrating its effectiveness in improving state-of-the-art 3D object detection methods, and its generalizability across different datasets and rare classes.
    Abstract Data imbalance is a well-known issue in the field of machine learning, attributable to the cost of data collection, the difficulty of labeling, and the geographical distribution of the data. In computer vision, bias in data distribution caused by image appearance remains highly unexplored. Compared to categorical distributions using class labels, image appearance reveals complex relationships between objects beyond what class labels provide. Clustering deep perceptual features extracted from raw pixels gives a richer representation of the data. This paper presents a novel method for addressing data imbalance in machine learning. The method computes sample likelihoods based on image appearance using deep perceptual embeddings and clustering. It then uses these likelihoods to weigh samples differently during training with a proposed $\textbf{Generalized Focal Loss}$ function. This loss can be easily integrated with deep learning algorithms. Experiments validate the method's effectiveness across autonomous driving vision datasets including KITTI and nuScenes. The loss function improves state-of-the-art 3D object detection methods, achieving over $200\%$ AP gains on under-represented classes (Cyclist) in the KITTI dataset. The results demonstrate the method is generalizable, complements existing techniques, and is particularly beneficial for smaller datasets and rare classes. Code is available at: https://github.com/towardsautonomy/DatasetEquity
    摘要 “数据不匹配是机器学习领域的一个公认问题,归结于数据收集成本高、标签难度和数据地域分布。在计算机视觉领域,图像外观偏见对数据分布的偏见仍未得到充分探索。相比于使用类别标签的分布,图像外观表明对象之间的复杂关系,超出了类别标签的提供。使用深度感知特征提取器和归一化 clustering 可以得到更加丰富的数据表示。本文提出了一种novel方法,用于解决机器学习中的数据不匹配问题。该方法计算样本概率基于图像外观使用深度感知特征和归一化,然后使用这些概率将样本 differently 权重 durante 训练,使用我们提议的 $\textbf{通用增强损失}$ 函数。这个损失函数可以轻松地与深度学习算法结合使用。实验证明了该方法在自动驾驶视觉数据集(包括 KITTI 和 nuScenes)上的效果,可以提高状态的较为少见类(自行车手)的 AP 分数超过 200%。结果表明该方法是通用的,可以补做现有技术,特别有利于 smaller 数据集和罕见类。代码可以在 GitHub 上找到:https://github.com/towardsautonomy/DatasetEquity。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Skill Transformer: A Monolithic Policy for Mobile Manipulation

  • paper_url: http://arxiv.org/abs/2308.09873
  • repo_url: None
  • paper_authors: Xiaoyu Huang, Dhruv Batra, Akshara Rai, Andrew Szot
  • for: 这篇论文是用于解决长期机器人任务的方法。
  • methods: 这篇论文使用了条件序列模型和技能模块来解决问题。它使用transformer架构,通过示例轨迹来预测高级技能和全身低级动作,从而实现了任务的可组合性和模块性。
  • results: 在一个embodied重新排序测试中,这篇论文的方法比基eline achieved a 2.5倍高的成功率,在困难重新排序问题中表现出了稳定的任务规划和低级控制能力。
    Abstract We present Skill Transformer, an approach for solving long-horizon robotic tasks by combining conditional sequence modeling and skill modularity. Conditioned on egocentric and proprioceptive observations of a robot, Skill Transformer is trained end-to-end to predict both a high-level skill (e.g., navigation, picking, placing), and a whole-body low-level action (e.g., base and arm motion), using a transformer architecture and demonstration trajectories that solve the full task. It retains the composability and modularity of the overall task through a skill predictor module while reasoning about low-level actions and avoiding hand-off errors, common in modular approaches. We test Skill Transformer on an embodied rearrangement benchmark and find it performs robust task planning and low-level control in new scenarios, achieving a 2.5x higher success rate than baselines in hard rearrangement problems.
    摘要 我们提出了Skill Transformer,一种解决长期机器人任务的方法,通过加入条件序列模型和技能模块性。基于机器人 egocentric和 proprioceptive 观察,Skill Transformer 通过端到端训练,预测高级技能(如导航、捕获、放置)和整体低级动作(如基体和臂动作),使用 transformer 架构和示范轨迹解决整个任务。它保留了总任务的可组合性和模块性通过技能预测模块,并在低级动作和避免手动错误方面进行了合理的理解。我们在embodied rearrangement benchmark上测试了Skill Transformer,并发现它在新的 scenarios 中实现了可靠的任务规划和低级控制,与基eline 的成功率相比,在困难的重新排序问题中达到2.5倍的成功率。

Tensor-Compressed Back-Propagation-Free Training for (Physics-Informed) Neural Networks

  • paper_url: http://arxiv.org/abs/2308.09858
  • repo_url: None
  • paper_authors: Yequan Zhao, Xinling Yu, Zhixiong Chen, Ziyue Liu, Sijia Liu, Zheng Zhang
  • for: 本研究的目的是提出一个不需要后向传播(Backward Propagation,BP)的框架,用于在边缘设备上训练具有真实性的神经网络。
  • methods: 我们的技术贡献包括三个方面:首先,我们提出了一种紧凑变量压缩方法来大幅提高零次ORDER(ZO)优化的可扩展性,使得可以处理大于过去ZO方法所能处理的网络大小。其次,我们提出了一种混合式Gradient评估方法来提高ZO训练的效率。最后,我们将我们的BP-free训练框架扩展到物理学 informed neural networks(PINNs)中,并提出了一种稀疏网格方法来估计损失函数中的导数而不使用BP。
  • results: 我们的BP-free训练只在MNIST数据集上失去了一点精度与标准首次Order训练相比。此外,我们还成功地训练了一个解决20维汉米尔-雅可比-贝尔姆 partial differential equation(PDE)的PINN。这种内存高效和BP-free的方法可能会成为未来资源有限平台(例如FPGA、ASIC、微控制器和光子学材料)上的准 Near-future on-device training。
    Abstract Backward propagation (BP) is widely used to compute the gradients in neural network training. However, it is hard to implement BP on edge devices due to the lack of hardware and software resources to support automatic differentiation. This has tremendously increased the design complexity and time-to-market of on-device training accelerators. This paper presents a completely BP-free framework that only requires forward propagation to train realistic neural networks. Our technical contributions are three-fold. Firstly, we present a tensor-compressed variance reduction approach to greatly improve the scalability of zeroth-order (ZO) optimization, making it feasible to handle a network size that is beyond the capability of previous ZO approaches. Secondly, we present a hybrid gradient evaluation approach to improve the efficiency of ZO training. Finally, we extend our BP-free training framework to physics-informed neural networks (PINNs) by proposing a sparse-grid approach to estimate the derivatives in the loss function without using BP. Our BP-free training only loses little accuracy on the MNIST dataset compared with standard first-order training. We also demonstrate successful results in training a PINN for solving a 20-dim Hamiltonian-Jacobi-Bellman PDE. This memory-efficient and BP-free approach may serve as a foundation for the near-future on-device training on many resource-constraint platforms (e.g., FPGA, ASIC, micro-controllers, and photonic chips).
    摘要 <>使用后向传播(BP)计算神经网络训练中的梯度是广泛使用的。然而,由于缺乏硬件和软件资源,在边缘设备上实现BP很困难。这有效地增加了设计复杂性和时间到市场的 neural network 训练加速器。本文提出了一个完全不需要BP的框架,只需要前向传播来训练真实的神经网络。我们的技术贡献有三个方面:1. 我们提出了一种压缩tensor的方法,以提高零次梯度优化的扩展性,使得可以处理大于前一代ZO方法所能处理的网络大小。2. 我们提出了一种混合梯度评估方法,以提高ZO训练的效率。3. 我们将BP-free训练框架扩展到物理学习神经网络(PINNs),并提出了一种稀疏网格方法,以便在损失函数中无需使用BP来估算梯度。BP-free训练只在MNIST数据集上失去了微scopic的精度,与标准首次训练相比。我们还成功地训练了一个解决20维汉密尔-雅可比-贝尔干涯方程的PINN。这种内存高效的BP-free方法可能会成为许多有限资源平台(如FPGA、ASIC、微控制器和光学芯片)的未来边缘训练基础。

Backdoor Mitigation by Correcting the Distribution of Neural Activations

  • paper_url: http://arxiv.org/abs/2308.09850
  • repo_url: None
  • paper_authors: Xi Li, Zhen Xiang, David J. Miller, George Kesidis
  • for: 本研究探讨了深度神经网络(DNN)中的后门(Trojan)攻击,特别是 Successful 攻击会导致内层活动分布的变化,以及如何通过修复这种变化来实现后门纠正。
  • methods: 本研究使用了reverse工程技术来恢复 trigger 的原始形式,并通过修复内层活动分布的变化来实现后门纠正。
  • results: 本研究发现,使用修复后门 trigger 的方法可以有效地纠正后门攻击,并且不需要对 DNN 的参数进行优化。此外,这种方法还可以有效地检测测试实例中是否存在 trigger。
    Abstract Backdoor (Trojan) attacks are an important type of adversarial exploit against deep neural networks (DNNs), wherein a test instance is (mis)classified to the attacker's target class whenever the attacker's backdoor trigger is present. In this paper, we reveal and analyze an important property of backdoor attacks: a successful attack causes an alteration in the distribution of internal layer activations for backdoor-trigger instances, compared to that for clean instances. Even more importantly, we find that instances with the backdoor trigger will be correctly classified to their original source classes if this distribution alteration is corrected. Based on our observations, we propose an efficient and effective method that achieves post-training backdoor mitigation by correcting the distribution alteration using reverse-engineered triggers. Notably, our method does not change any trainable parameters of the DNN, but achieves generally better mitigation performance than existing methods that do require intensive DNN parameter tuning. It also efficiently detects test instances with the trigger, which may help to catch adversarial entities in the act of exploiting the backdoor.
    摘要 后门(Trojan)攻击是深度神经网络(DNN)的重要类型敌意攻击,其中测试实例会在攻击者的目标类中被识别,只要攻击者的后门触发器存在。在这篇论文中,我们揭示了后门攻击的一个重要性ptych:成功攻击会导致后门触发器实例的内部层活动分布发生变化,相比于干净实例。进一步地,我们发现,如果这种分布变化得到了修正,那么后门触发器实例将会返回到其原始类别。基于我们的观察,我们提出了一种高效的后门整治方法,通过修正分布变化来实现后门整治。这种方法不会改变 DNN 的可调参数,但可以实现更好的整治性能,并快速检测测试实例中的触发器。

Enumerating Safe Regions in Deep Neural Networks with Provable Probabilistic Guarantees

  • paper_url: http://arxiv.org/abs/2308.09842
  • repo_url: None
  • paper_authors: Luca Marzari, Davide Corsi, Enrico Marchesini, Alessandro Farinelli, Ferdinando Cicalese
  • for: 确保深度神经网络(DNNs)的可靠性,identifying safe areas是关键。
  • methods: 我们引入AllDNN-Verification问题,给出了一种有效的 aproximation方法called epsilon-ProVe。
  • results: 我们的方法可以提供一个紧身的(具有证明的概率保证)下界估计安全区域,并且在不同的标准准比例上进行了实验,证明了我们的方法的可扩展性和有效性。
    Abstract Identifying safe areas is a key point to guarantee trust for systems that are based on Deep Neural Networks (DNNs). To this end, we introduce the AllDNN-Verification problem: given a safety property and a DNN, enumerate the set of all the regions of the property input domain which are safe, i.e., where the property does hold. Due to the #P-hardness of the problem, we propose an efficient approximation method called epsilon-ProVe. Our approach exploits a controllable underestimation of the output reachable sets obtained via statistical prediction of tolerance limits, and can provide a tight (with provable probabilistic guarantees) lower estimate of the safe areas. Our empirical evaluation on different standard benchmarks shows the scalability and effectiveness of our method, offering valuable insights for this new type of verification of DNNs.
    摘要 安全区域的标识是深度神经网络(DNN)系统的关键策略,以确保信任性。为此,我们提出了AllDNN-Verification问题:给定一个安全性质和一个DNN,列出安全区域的输入Domaint中的所有区域,即where the property does hold。由于这个问题的P-完备性,我们提出了一种高效的近似方法called epsilon-ProVe。我们的方法利用了可控的输出可达集的统计预测误差,可以提供一个紧靠的(带有可靠的 probabilistic guarantees)下界估计安全区域。我们的实验表明了我们的方法的扩展性和有效性,为这种新的DNN验证提供了有价值的发现。

Microscopy Image Segmentation via Point and Shape Regularized Data Synthesis

  • paper_url: http://arxiv.org/abs/2308.09835
  • repo_url: None
  • paper_authors: Shijie Li, Mengwei Ren, Thomas Ach, Guido Gerig
  • for: 这个论文主要针对的是用深度学习方法进行微scopia图像分割,但是现有的方法几乎都需要大量的培训数据,包括完整的对象 outline,这是非常时间consuming和困难的。这篇论文提出了一种使用点标注(即对象中心点)来训练微scopia图像分割模型的方法。
  • methods: 这篇论文提出了一种三个阶段的框架,包括: 1. 使用点标注生成一个 Pseudo dense segmentation mask,这个mask受到形状约束的限制。 2. 使用一种未经培训的图像生成模型将mask翻译成一个真实的微scopia图像,这个图像受到对象级别一致性的限制。 3. 将Pseudo masks和生成的图像组成一个对应的 dataset,用于训练微scopia图像分割模型。
  • results: compared to使用pseudo-labels或基eline生成的图像,模型在使用这种synthesis pipeline训练后表现出了明显的改进。此外,这种方法可以与使用authentic microscopy images with dense labels进行比较,并且达到了相似的性能。代码可以获取。
    Abstract Current deep learning-based approaches for the segmentation of microscopy images heavily rely on large amount of training data with dense annotation, which is highly costly and laborious in practice. Compared to full annotation where the complete contour of objects is depicted, point annotations, specifically object centroids, are much easier to acquire and still provide crucial information about the objects for subsequent segmentation. In this paper, we assume access to point annotations only during training and develop a unified pipeline for microscopy image segmentation using synthetically generated training data. Our framework includes three stages: (1) it takes point annotations and samples a pseudo dense segmentation mask constrained with shape priors; (2) with an image generative model trained in an unpaired manner, it translates the mask to a realistic microscopy image regularized by object level consistency; (3) the pseudo masks along with the synthetic images then constitute a pairwise dataset for training an ad-hoc segmentation model. On the public MoNuSeg dataset, our synthesis pipeline produces more diverse and realistic images than baseline models while maintaining high coherence between input masks and generated images. When using the identical segmentation backbones, the models trained on our synthetic dataset significantly outperform those trained with pseudo-labels or baseline-generated images. Moreover, our framework achieves comparable results to models trained on authentic microscopy images with dense labels, demonstrating its potential as a reliable and highly efficient alternative to labor-intensive manual pixel-wise annotations in microscopy image segmentation. The code is available.
    摘要 当前的深度学习基于方法 для微scopia图像分割强调大量的训练数据和精密的标注,但这在实际中是非常昂贵和劳动密集的。相比于全标注,其中包括对象的完整边界,点标注,具体来说是对象的中心点,是训练中得到的 much easier。在这篇论文中,我们假设在训练时可以获得点标注。我们提出了一个整体框架,包括以下三个阶段:1. 从点标注中提取pseudo dense segmentation mask,并受限于形状假设。2. 使用未经准备的图像生成模型,将mask翻译成真实的微scopia图像,并在对象水平进行了规范。3. pseudo masks和生成的图像组成一个对应的对数据集,用于训练专门的分割模型。在公共的 MoNuSeg 数据集上,我们的生成框架生成了更加多样化和真实的图像,同时保持输入权重和生成图像之间的高协调性。当使用同一个分割背bone时,我们在生成数据集上训练的模型显著超过 pseudo-标签或基eline-生成的图像模型。此外,我们的框架可以与 dense标注数据集进行比较,表明它可以作为 dense标注的可靠和高效的替代方案。代码可以获取。

Learning from A Single Graph is All You Need for Near-Shortest Path Routing in Wireless Networks

  • paper_url: http://arxiv.org/abs/2308.09829
  • repo_url: None
  • paper_authors: Yung-Fu Chen, Sen Lin, Anish Arora
  • For: 本研究提出一种学习算法,用于解决无线网络中的本地路由策略问题。这种算法只需要一些从同一个图中获取的数据样本,可以对所有随机图进行泛化。* Methods: 本研究使用深度神经网络(DNNs)来学习本地路由策略。这些DNNs可以高效地和扩展地学习路由策略,只考虑节点状态和邻居节点状态。* Results: 研究结果显示,使用这种算法可以快速地从一些路由路径中获取样本,并在各种随机图上获得高效和普遍适用的路由策略。此外,这种算法还可以提供 тео리тиче explainability,即为什么使用一个小型的种子图和节点抽样可以有效地学习路由策略。
    Abstract We propose a learning algorithm for local routing policies that needs only a few data samples obtained from a single graph while generalizing to all random graphs in a standard model of wireless networks. We thus solve the all-pairs near-shortest path problem by training deep neural networks (DNNs) that efficiently and scalably learn routing policies that are local, i.e., they only consider node states and the states of neighboring nodes. Remarkably, one of these DNNs we train learns a policy that exactly matches the performance of greedy forwarding; another generally outperforms greedy forwarding. Our algorithm design exploits network domain knowledge in several ways: First, in the selection of input features and, second, in the selection of a ``seed graph'' and subsamples from its shortest paths. The leverage of domain knowledge provides theoretical explainability of why the seed graph and node subsampling suffice for learning that is efficient, scalable, and generalizable. Simulation-based results on uniform random graphs with diverse sizes and densities empirically corroborate that using samples generated from a few routing paths in a modest-sized seed graph quickly learns a model that is generalizable across (almost) all random graphs in the wireless network model.
    摘要 我们提出了一种学习算法,用于本地路由策略,只需要几个数据样本,从一个图中获取,而能够泛化到所有随机图中的标准网络模型。我们通过训练深度神经网络(DNN),efficiently和可扩展地学习本地路由策略,即只考虑节点状态和邻居节点状态。奇怪的是,我们训练的一个DNN学习策略与批发转发性能相同;另一个策略则通常超过批发转发性能。我们的算法设计利用网络领域知识,包括输入特征的选择和``seed graph''和节点子集的选择。这种利用领域知识的设计提供了有理解的解释,为何seed graph和节点子集 suffice for learning是有效、可扩展和泛化的。针对随机图中的不同大小和密度的实验结果,表明使用从seed graph中生成的几个路由路径的样本,可以快速学习一个泛化到大多数随机图的模型。

VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control

  • paper_url: http://arxiv.org/abs/2308.09804
  • repo_url: https://github.com/henryhzy/vl-pet
  • paper_authors: Zi-Yuan Hu, Yanyang Li, Michael R. Lyu, Liwei Wang
  • for: 这个论文目的是提出一个具有效的vision-and-language受控 Parametric Tuning(VL-PET)框架,以便实现更好的效率和有效性变数。
  • methods: 这个论文使用了一种新的粒度控制机制,允许在不同的粒度控制矩阵下逐步实现不同的模组化修改。此外,论文还提出了一些轻量级的PET模组设计,以提高预测和文本生成的能力。
  • results: 实验结果显示,这个VL-PET框架可以与现有的PET技术相比,在四个影像文本任务和四个影像视频文本任务上表现更好,并且可以实现更好的效率和有效性变数。具体来说,使用VL-PET-大的模型可以与BART-base和T5-base模型相比,在影像文本任务上表现出2.92%(3.41%)和3.37%(7.03%)的改善。
    Abstract As the model size of pre-trained language models (PLMs) grows rapidly, full fine-tuning becomes prohibitively expensive for model training and storage. In vision-and-language (VL), parameter-efficient tuning (PET) techniques are proposed to integrate modular modifications (e.g., Adapter and LoRA) into encoder-decoder PLMs. By tuning a small set of trainable parameters, these techniques perform on par with full fine-tuning. However, excessive modular modifications and neglecting the functionality gap between the encoders and decoders can lead to performance degradation, while existing PET techniques (e.g., VL-Adapter) overlook these critical issues. In this paper, we propose a Vision-and-Language Parameter-Efficient Tuning (VL-PET) framework to impose effective control over modular modifications via a novel granularity-controlled mechanism. Considering different granularity-controlled matrices generated by this mechanism, a variety of model-agnostic VL-PET modules can be instantiated from our framework for better efficiency and effectiveness trade-offs. We further propose lightweight PET module designs to enhance VL alignment and modeling for the encoders and maintain text generation for the decoders. Extensive experiments conducted on four image-text tasks and four video-text tasks demonstrate the efficiency, effectiveness and transferability of our VL-PET framework. In particular, our VL-PET-large with lightweight PET module designs significantly outperforms VL-Adapter by 2.92% (3.41%) and LoRA by 3.37% (7.03%) with BART-base (T5-base) on image-text tasks. Furthermore, we validate the enhanced effect of employing our VL-PET designs on existing PET techniques, enabling them to achieve significant performance improvements. Our code is available at https://github.com/HenryHZY/VL-PET.
    摘要 “随着预训语言模型(PLM)的模型大小迅速增长,全面精通化训练(full fine-tuning)已成为训练和储存模型的瓶颈。在视觉语言(VL)领域,具有实现效益的参数效率训练(PET)技术被提出,以将可变的修改(e.g., Adapter和LoRA) integrate到encoder-decoder PLM 中。这些技术可以通过微小的参数训练,与全面精通化训练相比,实现相似的性能。然而,过度的修改和遗传的差异点(gap)对模型的性能产生负面影响,而现有的PET技术(e.g., VL-Adapter)忽略了这些重要问题。本文提出了一个vision-and-language参数效率训练(VL-PET)框架,以控制修改的具体化程度。通过这个框架,可以从不同的具体化控制矩阵生成多种模型独立的VL-PET模组,以达到更好的效率和效果的调控。此外,我们还提出了一些轻量级PET模组的设计,以增强视觉和文本的整合和模型。实验结果显示,我们的VL-PET-大型以轻量级PET模组设计对image-text任务和video-text任务表现出色,与BART-base和T5-base相比,优于VL-Adapter和LoRA。此外,我们还 validate了我们的VL-PET设计对现有PET技术的优化效果,使其实现更大的性能提升。我们的代码可以在https://github.com/HenryHZY/VL-PET 上获取。”

An Efficient High-Dimensional Gene Selection Approach based on Binary Horse Herd Optimization Algorithm for Biological Data Classification

  • paper_url: http://arxiv.org/abs/2308.09791
  • repo_url: None
  • paper_authors: Niloufar Mehrabi, Sayed Pedram Haeri Boroujeni, Elnaz Pashaei
  • for: 解决复杂高维问题,特别是维度较大的搜索问题。
  • methods: 使用新的马群优化算法(BHOA)和最小重复最大相关性筛选器(MRMR)组成的гибрид特征选择方法。
  • results: 对于十个靶场数据集(Lymphoma、Prostate、Brain-1、DLBCL、SRBCT、Leukemia、Ovarian、Colon、Lung、MLL),提出了一种更高效的特征选择方法,比如Gray Wolf(GW)、Particle Swarm Optimization(PSO)和遗传算法(GA)。
    Abstract The Horse Herd Optimization Algorithm (HOA) is a new meta-heuristic algorithm based on the behaviors of horses at different ages. The HOA was introduced recently to solve complex and high-dimensional problems. This paper proposes a binary version of the Horse Herd Optimization Algorithm (BHOA) in order to solve discrete problems and select prominent feature subsets. Moreover, this study provides a novel hybrid feature selection framework based on the BHOA and a minimum Redundancy Maximum Relevance (MRMR) filter method. This hybrid feature selection, which is more computationally efficient, produces a beneficial subset of relevant and informative features. Since feature selection is a binary problem, we have applied a new Transfer Function (TF), called X-shape TF, which transforms continuous problems into binary search spaces. Furthermore, the Support Vector Machine (SVM) is utilized to examine the efficiency of the proposed method on ten microarray datasets, namely Lymphoma, Prostate, Brain-1, DLBCL, SRBCT, Leukemia, Ovarian, Colon, Lung, and MLL. In comparison to other state-of-the-art, such as the Gray Wolf (GW), Particle Swarm Optimization (PSO), and Genetic Algorithm (GA), the proposed hybrid method (MRMR-BHOA) demonstrates superior performance in terms of accuracy and minimum selected features. Also, experimental results prove that the X-Shaped BHOA approach outperforms others methods.
    摘要 《各种高维问题的解决方法——骏马群优化算法》是一种新的元规则算法,基于不同年龄的马的行为。该算法最近被提出来解决复杂和高维问题。本文提出了一种二进制版的骏马群优化算法(BHOA),以解决离散问题和选择显著特征子。此外,本研究还提出了一种新的半程特征选择框架,基于BHOA和最小冗余最大相关性(MRMR)筛选方法。这种半程特征选择方法更加高效,生成了有益的相关和有用的特征子。由于特征选择是一个二进制问题,我们采用了一种新的转换函数(TF),即X型TF,将连续问题转换为二进制搜索空间。此外,我们使用支持向量机(SVM)来评估提出的方法在10个微陷数据集上的效率,即肿瘤、 próstata、大脑-1、DLBCL、SRBCT、白细胞癌、卵巢、肠癌和MLL等10个数据集。与其他现状之前的方法,如灰狼(GW)、粒子群动 optimization(PSO)和生物学算法(GA)相比,提出的半程特征选择方法(MRMR-BHOA)在精度和选择的最小特征上显示出超过其他方法的优势。此外,实验结果也证明了X型BHOA方法的优越性。

A Two-Part Machine Learning Approach to Characterizing Network Interference in A/B Testing

  • paper_url: http://arxiv.org/abs/2308.09790
  • repo_url: None
  • paper_authors: Yuan Yuan, Kristen M. Altenburger
  • for: 提高A/B测试的可靠性和精度,解决网络干扰的问题
  • methods: 使用机器学习算法确定和描述不同类型的网络干扰,自动确定”曝光地图”,解决现有文献中的两大限制
  • results: 通过synthetic实验和实际大规模测试(1-2万名Instagram用户),证明了我们的方法可以超越传统的设计基于块随机化和分析基于邻居曝光地图的方法,提高A/B测试结果的精度和可靠性。
    Abstract The reliability of controlled experiments, or "A/B tests," can often be compromised due to the phenomenon of network interference, wherein the outcome for one unit is influenced by other units. To tackle this challenge, we propose a machine learning-based method to identify and characterize heterogeneous network interference. Our approach accounts for latent complex network structures and automates the task of "exposure mapping'' determination, which addresses the two major limitations in the existing literature. We introduce "causal network motifs'' and employ transparent machine learning models to establish the most suitable exposure mapping that reflects underlying network interference patterns. Our method's efficacy has been validated through simulations on two synthetic experiments and a real-world, large-scale test involving 1-2 million Instagram users, outperforming conventional methods such as design-based cluster randomization and analysis-based neighborhood exposure mapping. Overall, our approach not only offers a comprehensive, automated solution for managing network interference and improving the precision of A/B testing results, but it also sheds light on users' mutual influence and aids in the refinement of marketing strategies.
    摘要 controlled experiments 的可靠性,或“A/B测试”,经常受到网络干扰的影响,这导致一个单位的结果受到其他单位的影响。为解决这个挑战,我们提议一种基于机器学习的方法,用于识别和特征化不同类型的网络干扰。我们的方法考虑了隐藏的复杂网络结构,并自动确定“曝光 mapping”,这两者都是现有文献中的主要限制。我们引入“ causal 网络模式”,并使用透明的机器学习模型来确定最适合的曝光 mapping,这些模型能够反映下面网络干扰模式。我们的方法在两个人工实验和一个实际的大规模测试中,对1-2万名INSTAGRAM用户进行验证,并表现出色,超过了传统的设计基于块随机化和分析基于邻居曝光 mapping的方法。总的来说,我们的方法不仅提供了一种全面、自动化的网络干扰管理和A/B测试结果的精度提高的解决方案,还 shed light on 用户之间的互相影响,并帮助优化市场策略。

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

  • paper_url: http://arxiv.org/abs/2308.09778
  • repo_url: None
  • paper_authors: Navid Rajabi, Jana Kosecka
  • for: 这个研究旨在评估大规模视语言模型(VLM)在不同的视觉理解任务中表现,包括数字、引用表达和通用的视觉问答任务。
  • methods: 该研究使用了细化的组合基础grounding方法,并提出了底层方法来评估视空间关系理解任务的表现。
  • results: 研究发现,当前的VLM模型在理解视空间关系方面表现不佳,与人类表现之间存在较大的差距。
    Abstract With the advances in large scale vision-and-language models (VLMs) it is of interest to assess their performance on various visual reasoning tasks such as counting, referring expressions and general visual question answering. The focus of this work is to study the ability of these models to understanding spatial relations. Previously, this has been tackled using image-text matching (Liu, Emerson, and Collier 2022) or visual question answering task, both showing poor performance and a large gap compared to human performance. To better understand the gap, we present fine-grained compositional grounding of spatial relationships and propose a bottom up approach for ranking spatial clauses and evaluating the performance of spatial relationship reasoning task. We propose to combine the evidence from grounding noun phrases corresponding to objects and their locations to compute the final rank of the spatial clause. We demonstrate the approach on representative vision-language models (Tan and Bansal 2019; Gupta et al. 2022; Kamath et al. 2021) and compare and highlight their abilities to reason about spatial relationships.
    摘要 With the advances in large-scale vision-and-language models (VLMs), it is of interest to assess their performance on various visual reasoning tasks such as counting, referring expressions, and general visual question answering. The focus of this work is to study the ability of these models to understand spatial relations. Previously, this has been tackled using image-text matching (Liu, Emerson, and Collier 2022) or visual question answering tasks, both showing poor performance and a large gap compared to human performance. To better understand the gap, we present fine-grained compositional grounding of spatial relationships and propose a bottom-up approach for ranking spatial clauses and evaluating the performance of spatial relationship reasoning tasks. We propose to combine the evidence from grounding noun phrases corresponding to objects and their locations to compute the final rank of the spatial clause. We demonstrate the approach on representative vision-language models (Tan and Bansal 2019; Gupta et al. 2022; Kamath et al. 2021) and compare and highlight their abilities to reason about spatial relationships.Here's the translation in Traditional Chinese: WITH 大规模的视力语言模型(VLM)的发展,有兴趣评估它们在不同的视觉推理任务上的表现,例如数量、参考表达和通用的视觉问题回答。这个研究的重点是研究这些模型对空间关系的理解。以前,这已经使用了图像和文本匹配(Liu、Emerson和Collier 2022)或视觉问题回答任务,都显示了轻微的表现和人类表现之间的大差。为了更好地理解这个差距,我们提出了细化的实体基底推理和评估视觉关系理解任务的方法。我们提议将对象和其位置的语言表达与它们的位置进行细化的实体基底推理,并将这些证据组合以计算最终排名的视觉关系 clause。我们在代表性的视力语言模型(Tan和Bansal 2019;Gupta et al. 2022;Kamath et al. 2021)上实现了这个方法,并与之比较和强调它们在视觉关系理解方面的能力。

Time Series Predictions in Unmonitored Sites: A Survey of Machine Learning Techniques in Water Resources

  • paper_url: http://arxiv.org/abs/2308.09766
  • repo_url: None
  • paper_authors: Jared D. Willard, Charuleka Varadharajan, Xiaowei Jia, Vipin Kumar
  • for: 预测不监测水体环境变量的精度减少了水资源科学中的长期挑战。大多数新鲜水资源的监测不具备关键环境变量的监测,尽管气候和土地使用变化的影响对水资源产生了越来越严重的影响。
  • methods: 现代机器学习方法在水文时间序预测中表现出了超越过程知识和经验模型的能力,特别是EXTRACT信息FROM大量多样数据集。
  • results: 我们回顾了相关的状态艺术应用程序,涵盖了流量和水质预测等水资源预测领域。分析结果表明,以往努力都集中在了深度学习框架上,但对不同类型机器学习方法的比较是 rare 和不充分。我们认为还有几个开放问题需要解决,包括包括动态输入和站点特点、机制理解和空间Context,以及现代机器学习框架中的可解释AI技术。
    Abstract Prediction of dynamic environmental variables in unmonitored sites remains a long-standing challenge for water resources science. The majority of the world's freshwater resources have inadequate monitoring of critical environmental variables needed for management. Yet, the need to have widespread predictions of hydrological variables such as river flow and water quality has become increasingly urgent due to climate and land use change over the past decades, and their associated impacts on water resources. Modern machine learning methods increasingly outperform their process-based and empirical model counterparts for hydrologic time series prediction with their ability to extract information from large, diverse data sets. We review relevant state-of-the art applications of machine learning for streamflow, water quality, and other water resources prediction and discuss opportunities to improve the use of machine learning with emerging methods for incorporating watershed characteristics into deep learning models, transfer learning, and incorporating process knowledge into machine learning models. The analysis here suggests most prior efforts have been focused on deep learning learning frameworks built on many sites for predictions at daily time scales in the United States, but that comparisons between different classes of machine learning methods are few and inadequate. We identify several open questions for time series predictions in unmonitored sites that include incorporating dynamic inputs and site characteristics, mechanistic understanding and spatial context, and explainable AI techniques in modern machine learning frameworks.
    摘要 <>Translate the following text into Simplified Chinese.<>现代机器学习方法在水资源预测中表现越来越出色,可以从大量多样数据集中提取信息。然而,预测不监测站点的水文变量仍然是水资源科学领域的长期挑战。全球大多数新鲜水资源尚未实施监测关键环境变量,尤其是气候和土地使用变化的影响。因此,广泛预测河流流量和水质等水资源变量的需求日益紧迫。本文将介绍相关的现代机器学习应用,包括流速、水质和其他水资源预测。我们还将讨论如何在深入学习模型中包含水体特征、知识传播和机器学习模型。分析结果表明,现有的尝试主要集中在美国多个站点的深度学习框架上,但对不同类型机器学习方法的比较 remains 有限。我们确定了一些未解决的问题,包括 incorporating 动态输入和站点特征、机制理解和空间上下文,以及现代机器学习框架中的可解释AI技术。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Taken by Surprise: Contrast effect for Similarity Scores

  • paper_url: http://arxiv.org/abs/2308.09765
  • repo_url: https://github.com/meetelise/surprise-similarity
  • paper_authors: Thomas C. Bachlechner, Mario Martone, Marjorie Schillo
  • for: 本研究旨在提高自然语言处理、信息检索和分类任务中的对象向量表示的准确评估。
  • methods: 该研究提出了一种名为“奇异分数”的ensemble-normalized相似度度量,它考虑到人类对对象之间的相似程度的感知效应,并在零/几shot文档分类任务中显示出10-15%的提高。
  • results: 研究发现,使用“奇异分数”相似度度量 Typically find 10-15% better performance compared to raw cosine similarity in zero/few shot document classification tasks.I hope that helps! Let me know if you have any further questions.
    Abstract Accurately evaluating the similarity of object vector embeddings is of critical importance for natural language processing, information retrieval and classification tasks. Popular similarity scores (e.g cosine similarity) are based on pairs of embedding vectors and disregard the distribution of the ensemble from which objects are drawn. Human perception of object similarity significantly depends on the context in which the objects appear. In this work we propose the $\textit{surprise score}$, an ensemble-normalized similarity metric that encapsulates the contrast effect of human perception and significantly improves the classification performance on zero- and few-shot document classification tasks. This score quantifies the surprise to find a given similarity between two elements relative to the pairwise ensemble similarities. We evaluate this metric on zero/few shot classification and clustering tasks and typically find 10-15 % better performance compared to raw cosine similarity. Our code is available at https://github.com/MeetElise/surprise-similarity.
    摘要 非常重要的评估对象vector embedding的相似性,是自然语言处理、信息检索和分类任务中的关键。受欢迎的相似性分数(如归一化相似性)忽略了对象 ensemble 的分布。人类对物体相似性的认知具有Context-dependent的特点,在这种情况下,我们提出了“surprise score”,一种ensemble-normalized相似度度量,它考虑了对象之间的对比效果,并在零/几shot文档分类任务中显示了10-15%的改善性。我们在这些任务中评估了这个度量,并通常发现它比 raw cosine similarity 更好。我们的代码可以在https://github.com/MeetElise/surprise-similarity上找到。

The Impact of Background Removal on Performance of Neural Networks for Fashion Image Classification and Segmentation

  • paper_url: http://arxiv.org/abs/2308.09764
  • repo_url: None
  • paper_authors: Junhui Liang, Ying Liu, Vladimir Vlassov
  • for: 提高时尚图像数据质量,提高模型性能
  • methods: 使用突出对象检测进行背景除去
  • results: 对时尚图像进行背景除去可以提高模型准确率,但是不适用于深度神经网络。
    Abstract Fashion understanding is a hot topic in computer vision, with many applications having great business value in the market. Fashion understanding remains a difficult challenge for computer vision due to the immense diversity of garments and various scenes and backgrounds. In this work, we try removing the background from fashion images to boost data quality and increase model performance. Having fashion images of evident persons in fully visible garments, we can utilize Salient Object Detection to achieve the background removal of fashion data to our expectations. A fashion image with the background removed is claimed as the "rembg" image, contrasting with the original one in the fashion dataset. We conducted extensive comparative experiments with these two types of images on multiple aspects of model training, including model architectures, model initialization, compatibility with other training tricks and data augmentations, and target task types. Our experiments show that background removal can effectively work for fashion data in simple and shallow networks that are not susceptible to overfitting. It can improve model accuracy by up to 5% in the classification on the FashionStyle14 dataset when training models from scratch. However, background removal does not perform well in deep neural networks due to incompatibility with other regularization techniques like batch normalization, pre-trained initialization, and data augmentations introducing randomness. The loss of background pixels invalidates many existing training tricks in the model training, adding the risk of overfitting for deep models.
    摘要 现代服装理解是计算机视觉领域热点话题,具有广泛的商业价值。然而,服装理解仍然是计算机视觉中的困难挑战,因为衣服的多样性和不同的场景和背景。在这种工作中,我们尝试将背景从时尚图像中除掉,以提高数据质量并提高模型性能。通过使用突出对象检测来实现背景的除掉,我们称之为“rembg”图像,与原始图像在时尚数据集中进行比较。我们进行了多种比较试验,包括模型架构、模型初始化、与其他训练技巧和数据扩展Compatibility,以及目标任务类型。我们的实验结果表明,背景除掉可以有效地提高时尚数据上的模型准确率。在FashionStyle14数据集上进行分类训练时,背景除掉可以提高模型准确率达到5%。然而,背景除掉不适合深度神经网络,因为它们与其他正则化技术,如批处理normalization、预先初始化和数据扩展,不兼容。失去背景像素会让许多现有的训练技巧无法在模型训练中使用,增加深度模型的风险。

Data Compression and Inference in Cosmology with Self-Supervised Machine Learning

  • paper_url: http://arxiv.org/abs/2308.09751
  • repo_url: https://github.com/aizhanaakhmet/data-compression-inference-in-cosmology-with-ssl
  • paper_authors: Aizhan Akhmetzhanova, Siddharth Mishra-Sharma, Cora Dvorkin
  • for: 这种方法可以快速 SUMMARIZE 大量数据,以便进行下游任务。
  • methods: 这种方法使用 simulation-based 自我超vised 机器学习,通过创建代表性的摘要来快速 SUMMARIZE 大量数据。
  • results: 这种方法可以提供高度信息含量的摘要,可以用于准确地推断参数。 Additionally, the method is insensitive to prescribed systematic effects, such as the influence of baryonic physics.
    Abstract The influx of massive amounts of data from current and upcoming cosmological surveys necessitates compression schemes that can efficiently summarize the data with minimal loss of information. We introduce a method that leverages the paradigm of self-supervised machine learning in a novel manner to construct representative summaries of massive datasets using simulation-based augmentations. Deploying the method on hydrodynamical cosmological simulations, we show that it can deliver highly informative summaries, which can be used for a variety of downstream tasks, including precise and accurate parameter inference. We demonstrate how this paradigm can be used to construct summary representations that are insensitive to prescribed systematic effects, such as the influence of baryonic physics. Our results indicate that self-supervised machine learning techniques offer a promising new approach for compression of cosmological data as well its analysis.
    摘要 “ cosmological 调查中的大量数据涌入导致了各种压缩方法的实现,以确保仅对数据进行最小的损失。我们介绍了一种方法,利用自动机器学习的思想,在实验增强的情况下,创建大规模数据的代表SUMMARY。在液体动力学 cosmological 实验中执行了这种方法,发现它可以提供高度有用的SUMMARY,可以用于多种下游任务,包括精确的参数推断。我们显示了这种思想可以用来建构对系统性效应不敏感的SUMMARY表示,例如关于生物物理的影响。我们的结果显示自动机器学习技术可以对 cosmological 数据实现高效的压缩和分析。”Note: Please note that the translation is in Simplified Chinese, and the word order and sentence structure may be different from the original text.

Robust Monocular Depth Estimation under Challenging Conditions

  • paper_url: http://arxiv.org/abs/2308.09711
  • repo_url: https://github.com/md4all/md4all
  • paper_authors: Stefano Gasperini, Nils Morbitzer, HyunJun Jung, Nassir Navab, Federico Tombari
  • for: 提高单目深度估计的可靠性,并在不同的环境和学习环境下实现可靠性。
  • methods: 基于现有方法的自我或全自动超vision,通过生成复杂样本并在原始图像上计算标准损失来训练模型。
  • results: 在nuScenes和Oxford RobotCar等两个公共数据集上进行了广泛的实验,并覆盖了先前的工作,在标准和挑战性条件下表现出色。
    Abstract While state-of-the-art monocular depth estimation approaches achieve impressive results in ideal settings, they are highly unreliable under challenging illumination and weather conditions, such as at nighttime or in the presence of rain. In this paper, we uncover these safety-critical issues and tackle them with md4all: a simple and effective solution that works reliably under both adverse and ideal conditions, as well as for different types of learning supervision. We achieve this by exploiting the efficacy of existing methods under perfect settings. Therefore, we provide valid training signals independently of what is in the input. First, we generate a set of complex samples corresponding to the normal training ones. Then, we train the model by guiding its self- or full-supervision by feeding the generated samples and computing the standard losses on the corresponding original images. Doing so enables a single model to recover information across diverse conditions without modifications at inference time. Extensive experiments on two challenging public datasets, namely nuScenes and Oxford RobotCar, demonstrate the effectiveness of our techniques, outperforming prior works by a large margin in both standard and challenging conditions. Source code and data are available at: https://md4all.github.io.
    摘要 当前最前沿单目深度估计方法在理想情况下可以 achieve impressive results,但在具有挑战性的照明和天气条件下,其可靠性却受到了挑战。在这篇论文中,我们揭示了这些安全关键问题,并通过md4all:一个简单而有效的解决方案,可以在不同的条件下工作,包括不良和理想的情况,以及不同类型的学习监督。我们通过利用现有方法在完美情况下的效果,实现了这一点。因此,我们可以在训练时提供有效的训练信号,不abhängig于输入中的内容。首先,我们生成了一组复杂的样本,与传统的训练样本相对应。然后,我们通过引导自我或全自监督,将生成的样本和对应的原始图像之间的标准损失进行训练。这样做的核心思想是,让模型在不同的条件下可以在推理时恢复信息。我们在 nuScenes 和 Oxford RobotCar 两个公共数据集上进行了广泛的实验,并证明了我们的技术的有效性,与之前的工作相比,在标准和挑战性的情况下都有大幅度的提高。软件代码和数据可以在 上获取。

Neural-network quantum state study of the long-range antiferromagnetic Ising chain

  • paper_url: http://arxiv.org/abs/2308.09709
  • repo_url: None
  • paper_authors: Jicheol Kim, Dongkyu Kim, Dong-Hee Kim
  • for: investigate quantum phase transitions in the transverse field Ising chain with algebraically decaying long-range antiferromagnetic interactions
  • methods: using the variational Monte Carlo method with the restricted Boltzmann machine as a trial wave function ansatz
  • results: the central charge deviates from 1/2 at a small decay exponent $\alpha_\mathrm{LR}$, and the threshold of the Ising universality and the conformal symmetry is estimated to be in the range of $2 \lesssim \alpha_\mathrm{LR} < 3$.
    Abstract We investigate quantum phase transitions in the transverse field Ising chain with algebraically decaying long-range antiferromagnetic interactions by using the variational Monte Carlo method with the restricted Boltzmann machine being employed as a trial wave function ansatz. In the finite-size scaling analysis with the order parameter and the second R\'enyi entropy, we find that the central charge deviates from 1/2 at a small decay exponent $\alpha_\mathrm{LR}$ in contrast to the critical exponents staying very close to the short-range (SR) Ising values regardless of $\alpha_\mathrm{LR}$ examined, supporting the previously proposed scenario of conformal invariance breakdown. To identify the threshold of the Ising universality and the conformal symmetry, we perform two additional tests for the universal Binder ratio and the conformal field theory (CFT) description of the correlation function. It turns out that both indicate a noticeable deviation from the SR Ising class at $\alpha_\mathrm{LR} < 2$. However, a closer look at the scaled correlation function for $\alpha_\mathrm{LR} \ge 2$ shows a gradual change from the asymptotic line of the CFT verified at $\alpha_\mathrm{LR} = 3$, providing a rough estimate of the threshold being in the range of $2 \lesssim \alpha_\mathrm{LR} < 3$.
    摘要 我们调查量子阶段转变在横向离散链磁铁中,使用统计力学 Monte Carlo 方法,将Restricted Boltzmann Machine作为实验波函数拟合。在 finite-size 拓展分析中,我们发现在小数字 $\alpha_\text{LR}$ 下,中心 CHARGE 偏离 1/2,与短距离铁质值不同,支持之前提出的对称性破坏enario。为了识别阶段对称性和对称性破坏的阈值,我们进行了两个额外的测试:一是通用的 Binder 率,二是对称场论 (CFT) 的描述。结果显示,这两个测试都显示在 $\alpha_\text{LR} < 2$ 的情况下,有明显的偏离短距离铁质值。然而,在 $\alpha_\text{LR} \ge 2$ 的情况下,随着测量尺度的增加,扩展函数的演化逐渐变得更加类似 CFT 预测的 asymptotic 线,提供了一个粗略的估计阈值在 $2 \lesssim \alpha_\text{LR} < 3$ 之间。

Do you know what q-means?

  • paper_url: http://arxiv.org/abs/2308.09701
  • repo_url: https://github.com/Sfedfcv/redesigned-pancake
  • paper_authors: João F. Doriguello, Alessandro Luongo, Ewin Tang
  • For: 本文提出了一种改进版本的 “$q$-means” 算法,用于实现 Approximate $k$-means clustering。* Methods: 本算法使用 QRAM 来准备和测量简单的状态,而不需要使用先前的量子线性代数 primitives。* Results: 本算法的时间复杂度为 $O\big(\frac{k^{2}{\varepsilon^2}(\sqrt{k}d + \log(Nd))\big)$,与先前的算法相比,它维持了对 $N$ 的多阶幂度依赖,并改进了大多数其他参数的依赖。 In addition, a “dequantized” classical algorithm for $\varepsilon$-$k$-means is also presented, which runs in $O\big(\frac{k^{2}{\varepsilon^2}(kd + \log(Nd))\big)$ time and matches the polylogarithmic dependence on $N$ attained by the quantum algorithms.
    Abstract Clustering is one of the most important tools for analysis of large datasets, and perhaps the most popular clustering algorithm is Lloyd's iteration for $k$-means. This iteration takes $N$ vectors $v_1,\dots,v_N\in\mathbb{R}^d$ and outputs $k$ centroids $c_1,\dots,c_k\in\mathbb{R}^d$; these partition the vectors into clusters based on which centroid is closest to a particular vector. We present an overall improved version of the "$q$-means" algorithm, the quantum algorithm originally proposed by Kerenidis, Landman, Luongo, and Prakash (2019) which performs $\varepsilon$-$k$-means, an approximate version of $k$-means clustering. This algorithm does not rely on the quantum linear algebra primitives of prior work, instead only using its QRAM to prepare and measure simple states based on the current iteration's clusters. The time complexity is $O\big(\frac{k^{2}{\varepsilon^2}(\sqrt{k}d + \log(Nd))\big)$ and maintains the polylogarithmic dependence on $N$ while improving the dependence on most of the other parameters. We also present a "dequantized" algorithm for $\varepsilon$-$k$-means which runs in $O\big(\frac{k^{2}{\varepsilon^2}(kd + \log(Nd))\big)$ time. Notably, this classical algorithm matches the polylogarithmic dependence on $N$ attained by the quantum algorithms.
    摘要 “集群是大规模数据分析中最重要的工具之一,而最受欢迎的集群算法之一是戴尔斯的$k$-means迭代。这个迭代接受$N$个$v_1,\ldots,v_N\in\mathbb{R}^d$的向量,并输出$k$个中心点$c_1,\ldots,c_k\in\mathbb{R}^d$;这些中心点将向量 partition 成 clusters 基于哪个中心点最近于特定向量。我们提出了一个全面改进的 "$q$-means" 算法,即 kerenedis 等人(2019)提出的量子算法,实现 $\varepsilon$-$k$-means,一种 Approximate 版本的 $k$-means 集群算法。这个算法不依赖于量子线性代数基本操作,而只使用其 QRAM 准备和测量简单状态,基于当前迭代的 clusters。时间复杂度为 $O\big(\frac{k^{2}{\varepsilon^2}(\sqrt{k}d + \log(Nd))\big)$,保持了对 $N$ 的多项式依赖,而改善了大多数其他参数的依赖。我们还提出了一个 "dequantized" 算法 для $\varepsilon$-$k$-means,运行时间为 $O\big(\frac{k^{2}{\varepsilon^2}(kd + \log(Nd))\big)$,并且 notable 地与量子算法的多项式依赖相符。”

A Lightweight Transformer for Faster and Robust EBSD Data Collection

  • paper_url: http://arxiv.org/abs/2308.09693
  • repo_url: https://github.com/hdong920/ebsd_slice_recovery
  • paper_authors: Harry Dong, Sean Donegan, Megna Shah, Yuejie Chi
  • for: 提高三维电子背托干涉diffraction(EBSD)微scopy数据质量
  • methods: 使用转换器模型和投影算法进行数据处理和恢复
  • results: 比existings方法更高的数据恢复精度
    Abstract Three dimensional electron back-scattered diffraction (EBSD) microscopy is a critical tool in many applications in materials science, yet its data quality can fluctuate greatly during the arduous collection process, particularly via serial-sectioning. Fortunately, 3D EBSD data is inherently sequential, opening up the opportunity to use transformers, state-of-the-art deep learning architectures that have made breakthroughs in a plethora of domains, for data processing and recovery. To be more robust to errors and accelerate this 3D EBSD data collection, we introduce a two step method that recovers missing slices in an 3D EBSD volume, using an efficient transformer model and a projection algorithm to process the transformer's outputs. Overcoming the computational and practical hurdles of deep learning with scarce high dimensional data, we train this model using only synthetic 3D EBSD data with self-supervision and obtain superior recovery accuracy on real 3D EBSD data, compared to existing methods.
    摘要 三维电子反射干扰diffraction(EBSD)顾问是物理科学中多种应用的关键工具,但它的数据质量可能会在收集过程中呈现大幅波动,特别是通过串行分割。幸运的是,3D EBSD数据是顺序的,这为使用变换器,现代深度学习架构,提供了机会。为了更加鲁棒地处理和加速3D EBSD数据收集,我们介绍了一种两步方法,使用高效的变换器模型和投影算法来处理变换器的输出。通过超越深度学习中的计算和实践障碍,我们使用只有自我超vision的synthetic 3D EBSD数据进行训练,并在实际3D EBSD数据上获得了比现有方法更高的恢复精度。

Reduced Order Modeling of a MOOSE-based Advanced Manufacturing Model with Operator Learning

  • paper_url: http://arxiv.org/abs/2308.09691
  • repo_url: None
  • paper_authors: Mahmoud Yaseen, Dewen Yushu, Peter German, Xu Wu
  • for: 本研究旨在开发一种高精度但运行速度快的减少级模型(ROM),用于在深度再强化学习(DRL)控制和优化方法中使用。
  • methods: 本研究使用了运算学(OL)基于方法,可以学习变量的家族方程。在这种情况下,我们使用了傅ри欧姆算法来构建OL-based ROM。
  • results: 我们对比了OL-based ROM和深度神经网络基于ROM的性能,发现OL-based ROM的性能更高,运行速度更快。
    Abstract Advanced Manufacturing (AM) has gained significant interest in the nuclear community for its potential application on nuclear materials. One challenge is to obtain desired material properties via controlling the manufacturing process during runtime. Intelligent AM based on deep reinforcement learning (DRL) relies on an automated process-level control mechanism to generate optimal design variables and adaptive system settings for improved end-product properties. A high-fidelity thermo-mechanical model for direct energy deposition has recently been developed within the MOOSE framework at the Idaho National Laboratory (INL). The goal of this work is to develop an accurate and fast-running reduced order model (ROM) for this MOOSE-based AM model that can be used in a DRL-based process control and optimization method. Operator learning (OL)-based methods will be employed due to their capability to learn a family of differential equations, in this work, produced by changing process variables in the Gaussian point heat source for the laser. We will develop OL-based ROM using Fourier neural operator, and perform a benchmark comparison of its performance with a conventional deep neural network-based ROM.
    摘要 高级生产技术(高级生产)在核心社区引起了广泛的关注,因为它可能用于核材料的制造。一个挑战是通过控制生产过程中的runtime来获得所需的材料性能。基于深度强化学习(DRL)的智能高级生产通过自动化过程级别控制机制来生成优化的设计变量和适应系统设置,以提高终产品的性能。在美国伊达荷大学(INL)的MOOSE框架中,最近已经开发了一个高精度热力学-机械模型,用于直接能量沟入。该工作的目标是开发一个准确快速的减少阶模型(ROM),用于这个MOOSE基于AM模型的DRL控制和优化方法。我们将使用运算学(OL)基本的方法,因为它可以学习一个家族的微分方程,在这里,由变量的改变生成的 Gaussian 点热源中的激光。我们将开发 OL 基本的 ROM 使用 Fourier 神经网络,并对其性能与一个传统的深度神经网络基本的 ROM 进行比较。

Graph of Thoughts: Solving Elaborate Problems with Large Language Models

  • paper_url: http://arxiv.org/abs/2308.09687
  • repo_url: https://github.com/spcl/graph-of-thoughts
  • paper_authors: Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, Torsten Hoefler
  • for: 提高大语言模型(LLM)的提问能力,超过链条思维或树思维(ToT)的限制。
  • methods: 利用Graph of Thoughts(GoT)框架,将LLM生成的信息模型为一个任意图,其中单元为“LLM思维”,弧线表示思维之间的依赖关系。这种方法可以将LLM思维合并成 synergistic 结果,提炼整个思维网络的核心,或通过反馈循环进行思维提升。
  • results: GoT 可以在不同任务上提供优化,例如在排序任务上提高质量62%,同时降低成本>31%。此外,GoT 可以扩展新的思维转换,因此可以用于开拓新的提问方案。这项工作使得 LLM 的思维更加接近人类思维或大脑机制,如回忆、复杂网络等。
    Abstract We introduce Graph of Thoughts (GoT): a framework that advances prompting capabilities in large language models (LLMs) beyond those offered by paradigms such as Chain-of-Thought or Tree of Thoughts (ToT). The key idea and primary advantage of GoT is the ability to model the information generated by an LLM as an arbitrary graph, where units of information ("LLM thoughts") are vertices, and edges correspond to dependencies between these vertices. This approach enables combining arbitrary LLM thoughts into synergistic outcomes, distilling the essence of whole networks of thoughts, or enhancing thoughts using feedback loops. We illustrate that GoT offers advantages over state of the art on different tasks, for example increasing the quality of sorting by 62% over ToT, while simultaneously reducing costs by >31%. We ensure that GoT is extensible with new thought transformations and thus can be used to spearhead new prompting schemes. This work brings the LLM reasoning closer to human thinking or brain mechanisms such as recurrence, both of which form complex networks.
    摘要 我们介绍Graph of Thoughts(GoT)框架,这种框架可以在大型语言模型(LLM)中提高提示能力,超越链条思维和树思维(ToT)的限制。GoT的关键思想和主要优势在于将LLM生成的信息视为一个可变图形,其中单元为“LLM思维”,而边表示这些单元之间的依赖关系。这种方法允许将任意LLM思维结合成 synergistic 结果,浓缩整个网络思维的核心,或者通过反馈循环进行思维增强。我们示出GoT在不同任务上具有优于现状的优势,例如比ToT提高排序质量62%,同时降低成本>31%。我们还证明GoT可扩展新的思维转换,因此可以用来开拓新的提示方案。这项工作使得LLM的思维更加接近人类思维或脑机制,如回忆、循环等,这些机制都形成复杂的网络。

Audiovisual Moments in Time: A Large-Scale Annotated Dataset of Audiovisual Actions

  • paper_url: http://arxiv.org/abs/2308.09685
  • repo_url: https://github.com/mjoannou/audiovisual-moments-in-time
  • paper_authors: Michael Joannou, Pia Rotshtein, Uta Noppeney
  • for: 这个论文主要目的是提供一个大规模的audiovisual动作事件集(AVMIT),以便用于计算机模型和人类参与者的研究。
  • methods: 论文使用了一个大规模的annotation任务,从MIT数据集中选择3秒的audiovisual视频,并让11名参与者标注每个试验 whether the labelled audiovisual动作事件是存在的,以及这个事件是视频中最显著的特征。
  • results: 论文提供了57,177个audiovisual视频的标注,每个视频由3名参与者独立地评估。在这个初始收集中,论文创建了一个精心测试集,包括16种动作类别,每个类别有60个视频(共960个视频)。论文还提供了2个预计算的audiovisual特征嵌入,使用VGGish/YamNet和VGG16/EfficientNetB0来降低audiovisual DNN研究的入门难度。
    Abstract We present Audiovisual Moments in Time (AVMIT), a large-scale dataset of audiovisual action events. In an extensive annotation task 11 participants labelled a subset of 3-second audiovisual videos from the Moments in Time dataset (MIT). For each trial, participants assessed whether the labelled audiovisual action event was present and whether it was the most prominent feature of the video. The dataset includes the annotation of 57,177 audiovisual videos, each independently evaluated by 3 of 11 trained participants. From this initial collection, we created a curated test set of 16 distinct action classes, with 60 videos each (960 videos). We also offer 2 sets of pre-computed audiovisual feature embeddings, using VGGish/YamNet for audio data and VGG16/EfficientNetB0 for visual data, thereby lowering the barrier to entry for audiovisual DNN research. We explored the advantages of AVMIT annotations and feature embeddings to improve performance on audiovisual event recognition. A series of 6 Recurrent Neural Networks (RNNs) were trained on either AVMIT-filtered audiovisual events or modality-agnostic events from MIT, and then tested on our audiovisual test set. In all RNNs, top 1 accuracy was increased by 2.71-5.94\% by training exclusively on audiovisual events, even outweighing a three-fold increase in training data. We anticipate that the newly annotated AVMIT dataset will serve as a valuable resource for research and comparative experiments involving computational models and human participants, specifically when addressing research questions where audiovisual correspondence is of critical importance.
    摘要 我们介绍了听视频时刻事件(AVMIT)数据集,这是一个大规模的听视频动作事件数据集。在一项广泛的标注任务中,11名参与者标注了MIT数据集中的3秒听视频示例的一部分。每个试验中,参与者评估了听视频动作事件是否存在,以及它是视频中最为出色的特征。该数据集包括57,177个听视频示例,每个示例由3名训练过的参与者独立地评估。从这个初始收集中,我们创建了一个精心选择的测试集,包括16种不同的动作类别,每个类别有60个视频示例(共960个视频示例)。我们还提供了两个预计算的听视频特征嵌入,使用VGGish/YamNet对音频数据进行预处理,并使用VGG16/EfficientNetB0对视频数据进行预处理,从而降低了听视频DNN研究的门槛。我们评估了AVMIT标注和特征嵌入的优势,以提高听视频事件认识性能。我们使用6个回归神经网络(RNN)在AVMIT filtered audiovisual事件或MIT模态不同的事件上训练,然后在我们的听视频测试集上进行测试。在所有RNN中,只训练在 audiovisual事件上的总准确率提高了2.71-5.94%,甚至超过了三倍增加的训练数据量。我们预计这些新的AVMIT标注数据集将成为未来研究和比较实验中的重要资源,特别是在研究问题中,听视频匹配的重要性很高。

Variational optimization of the amplitude of neural-network quantum many-body ground states

  • paper_url: http://arxiv.org/abs/2308.09664
  • repo_url: None
  • paper_authors: Jia-Qi Wang, Rong-Qiang He, Zhong-Yi Lu
  • for: 这篇论文旨在探讨基于神经网络的量子多体ground state搜索方法,并对其进行优化。
  • methods: 该方法将量子多体变量波函数分解为一个实值神经网络和一个固定的符号结构,然后优化神经网络。神经网络使用了卷积层和径向层,即ResNet。
  • results: 该方法在三个典型量子多体系统上进行测试,并与传统的变量 Monte Carlo(VMC)和密度矩阵约束 груп(DMRG)方法进行比较。结果显示,对于受挫的Heisenberg $J_1$-$J_2$模型,该方法的结果更好于文献中的复杂值神经网络,表明了 sign structure的优化是困难的。将在未来研究 sign structure的优化。
    Abstract Neural-network quantum states (NQSs), variationally optimized by combining traditional methods and deep learning techniques, is a new way to find quantum many-body ground states and gradually becomes a competitor of traditional variational methods. However, there are still some difficulties in the optimization of NQSs, such as local minima, slow convergence, and sign structure optimization. Here, we split a quantum many-body variational wave function into a multiplication of a real-valued amplitude neural network and a sign structure, and focus on the optimization of the amplitude network while keeping the sign structure fixed. The amplitude network is a convolutional neural network (CNN) with residual blocks, namely a ResNet. Our method is tested on three typical quantum many-body systems. The obtained ground state energies are lower than or comparable to those from traditional variational Monte Carlo (VMC) methods and density matrix renormalization group (DMRG). Surprisingly, for the frustrated Heisenberg $J_1$-$J_2$ model, our results are better than those of the complex-valued CNN in the literature, implying that the sign structure of the complex-valued NQS is difficult to be optimized. We will study the optimization of the sign structure of NQSs in the future.
    摘要 (Note: The text has been translated into Simplified Chinese, but the word order and sentence structure may be different from the original text.)

GiGaMAE: Generalizable Graph Masked Autoencoder via Collaborative Latent Space Reconstruction

  • paper_url: http://arxiv.org/abs/2308.09663
  • repo_url: https://github.com/sycny/gigamae
  • paper_authors: Yucheng Shi, Yushun Dong, Qiaoyu Tan, Jundong Li, Ninghao Liu
  • for: 这篇论文的目的是提出一个基于自动encoder的自愿式学习框架,以解决现有masked autoencoder模型在graph数据上的普遍化能力不足问题。
  • methods: 这篇论文提出了一个名为GiGaMAE的新型graph masked autoencoder框架,不同于现有的masked autoencoder模型,这里的模型不是直接从原始graph中重建graph的component(例如特征或边),而是将graph的topology和 attribute信息视为重建目标,以capture更广泛和全面的知识。此外, authors也引入了一个基于mutual information的重建损失函数,这使得模型能够有效地重建多个目标。
  • results: 在七个标准 benchmark上进行了广泛的实验,结果显示了 GiGaMAE 在三个下游任务上的超越性。 authors hope 这些结果能够照明 foundation models 的设计在graph-structured data上。
    Abstract Self-supervised learning with masked autoencoders has recently gained popularity for its ability to produce effective image or textual representations, which can be applied to various downstream tasks without retraining. However, we observe that the current masked autoencoder models lack good generalization ability on graph data. To tackle this issue, we propose a novel graph masked autoencoder framework called GiGaMAE. Different from existing masked autoencoders that learn node presentations by explicitly reconstructing the original graph components (e.g., features or edges), in this paper, we propose to collaboratively reconstruct informative and integrated latent embeddings. By considering embeddings encompassing graph topology and attribute information as reconstruction targets, our model could capture more generalized and comprehensive knowledge. Furthermore, we introduce a mutual information based reconstruction loss that enables the effective reconstruction of multiple targets. This learning objective allows us to differentiate between the exclusive knowledge learned from a single target and common knowledge shared by multiple targets. We evaluate our method on three downstream tasks with seven datasets as benchmarks. Extensive experiments demonstrate the superiority of GiGaMAE against state-of-the-art baselines. We hope our results will shed light on the design of foundation models on graph-structured data. Our code is available at: https://github.com/sycny/GiGaMAE.
    摘要 自我超级学习中的假设掩码自动机(Masked Autoencoder,MAE)在图像或文本表示方面具有出色的表达能力,可以应用于多个下游任务无需重新训练。然而,我们发现当前的假设掩码模型对图数据的泛化能力不足。为解决这个问题,我们提出了一种新的图Masked Autoencoder框架,即GiGaMAE。与现有的假设掩码模型不同,我们在这篇论文中提议通过同时重建图像的整体特征和属性信息来学习图像的表示。这使得我们的模型能够捕捉更加普遍和全面的知识。此外,我们引入了基于缺失信息的重建损失函数,这使得我们的模型能够有效地重建多个目标。我们在三个下游任务上进行了七个数据集的测试,并证明了GiGaMAE在比基eline上表现出色。我们希望我们的结果能够为图数据基础模型的设计提供指导。我们的代码可以在https://github.com/sycny/GiGaMAE中找到。

Robust Uncertainty Quantification using Conformalised Monte Carlo Prediction

  • paper_url: http://arxiv.org/abs/2308.09647
  • repo_url: https://github.com/team-daniel/mc-cp
  • paper_authors: Daniel Bethell, Simos Gerasimou, Radu Calinescu
  • for: 这 paper 是为了提供一种可靠的深度学习模型部署方法,以满足安全关键应用的需求。
  • methods: 这 paper 使用了一种新的 MC 采样 dropout 方法,以及一种基于 CP 的抗随机变量预测方法。这两种方法在运行时可以相互协作,以提高预测的可靠性和精度。
  • results: 通过对多种分类和回归 benchmark 进行了广泛的实验,这 paper 显示了 MC-CP 方法在对 uncertainty quantification 方面的显著改进,比如 MC dropout、RAPS 和 CQR 等先进方法。MC-CP 方法可以轻松地与现有模型结合使用,从而使其的部署变得更加简单。
    Abstract Deploying deep learning models in safety-critical applications remains a very challenging task, mandating the provision of assurances for the dependable operation of these models. Uncertainty quantification (UQ) methods estimate the model's confidence per prediction, informing decision-making by considering the effect of randomness and model misspecification. Despite the advances of state-of-the-art UQ methods, they are computationally expensive or produce conservative prediction sets/intervals. We introduce MC-CP, a novel hybrid UQ method that combines a new adaptive Monte Carlo (MC) dropout method with conformal prediction (CP). MC-CP adaptively modulates the traditional MC dropout at runtime to save memory and computation resources, enabling predictions to be consumed by CP, yielding robust prediction sets/intervals. Throughout comprehensive experiments, we show that MC-CP delivers significant improvements over advanced UQ methods, like MC dropout, RAPS and CQR, both in classification and regression benchmarks. MC-CP can be easily added to existing models, making its deployment simple.
    摘要 部署深度学习模型在安全关键应用中仍然是一项非常具有挑战性的任务,需要提供对模型可靠性的保证。不确定量评估(UQ)方法估算模型每次预测的可靠程度,以帮助决策,考虑随机性和模型误差的影响。 despite state-of-the-art UQ 方法的进步,它们可能 computationally expensive 或生成保守的预测集/interval。我们介绍 MC-CP,一种新的hybrid UQ 方法,将新的适应MC dropout 方法与 confirmal prediction 结合。 MC-CP 在运行时动态调整传统MC dropout,以避免占用内存和计算资源,使预测可以被CP 处理,生成 robust预测集/interval。经过了广泛的实验,我们表明 MC-CP 在分类和回归 benchmark 中具有显著的改进,比如 MC dropout、RAPS 和 CQR。 MC-CP 可以轻松地添加到现有模型中,使其的部署变得简单。

biquality-learn: a Python library for Biquality Learning

  • paper_url: http://arxiv.org/abs/2308.09643
  • repo_url: https://github.com/biquality-learn/biquality-learn
  • paper_authors: Pierre Nodet, Vincent Lemaire, Alexis Bondu, Antoine Cornuéjols
  • For: The paper aims to address the challenges of weak supervision and dataset shifts in machine learning, and proposes a new framework called Biquality Learning.* Methods: The paper proposes a Python library called biquality-learn, which provides a consistent and intuitive API for learning machine learning models from biquality data. The library includes well-proven algorithms and is designed to be accessible and easy to use for everyone.* Results: The paper enables researchers to experiment in a reproducible way on biquality data, and demonstrates the effectiveness of the proposed framework through experiments on several benchmark datasets.
    Abstract The democratization of Data Mining has been widely successful thanks in part to powerful and easy-to-use Machine Learning libraries. These libraries have been particularly tailored to tackle Supervised Learning. However, strong supervision signals are scarce in practice, and practitioners must resort to weak supervision. In addition to weaknesses of supervision, dataset shifts are another kind of phenomenon that occurs when deploying machine learning models in the real world. That is why Biquality Learning has been proposed as a machine learning framework to design algorithms capable of handling multiple weaknesses of supervision and dataset shifts without assumptions on their nature and level by relying on the availability of a small trusted dataset composed of cleanly labeled and representative samples. Thus we propose biquality-learn: a Python library for Biquality Learning with an intuitive and consistent API to learn machine learning models from biquality data, with well-proven algorithms, accessible and easy to use for everyone, and enabling researchers to experiment in a reproducible way on biquality data.
    摘要 “数据挖掘的民主化得到了广泛的成功,很大的帮助来自于强大且易用的机器学习库。这些库主要针对超级vised学习。然而,实际中强制督学信号强度很弱,实际operator需要采用弱督学。此外,在机器学习模型实际部署时,数据变化也是一种常见的现象。为此,我们提出了一种机器学习框架---多质量学习(Biquality Learning),旨在设计能够处理多种弱督学和数据变化的算法,不假设它们的性质和水平。我们提出了biquality-learn:一个Python库,用于多质量学习,具有直观和一致的API,可以从多质量数据中学习机器学习模型,具有证明过的算法、易于使用、可重复地进行研究。”

eess.IV - 2023-08-19

CRC-ICM: Colorectal Cancer Immune Cell Markers Pattern Dataset

  • paper_url: http://arxiv.org/abs/2308.10033
  • repo_url: None
  • paper_authors: Zahra Mokhtari, Elham Amjadi, Hamidreza Bolhasani, Zahra Faghih, AmirReza Dehghanian, Marzieh Rezaei
  • for: This paper aims to investigate the differences in immune checkpoint expression between right and left colon cancer, and to identify potential biomarkers for immunotherapy.
  • methods: The study uses a dataset of 1756 images from 136 patients with colorectal cancer, stained with specific antibodies for CD3, CD8, CD45RO, PD-1, LAG3, and Tim3.
  • results: The paper finds that there are differences in immune checkpoint expression between right and left colon cancer, and identifies potential biomarkers for immunotherapy.Here is the information in Simplified Chinese text:
  • for: 这篇论文目的是调查大小colon癌的右和左两侧免疫检查点表达的差异,并找到可能的免疫治疗标记物。
  • methods: 这篇论文使用136名患有colon癌的患者的1756张图像数据集,用特定抗体染色CD3、CD8、CD45RO、PD-1、LAG3和Tim3等。
  • results: 论文发现右和左两侧colon癌免疫检查点表达有差异,并确定了可能的免疫治疗标记物。
    Abstract Colorectal Cancer (CRC) is the second most common cause of cancer death in the world, ad can be identified by the location of the primary tumor in the large intestine: right and left colon, and rectum. Based on the location, CRC shows differences in chromosomal and molecular characteristics, microbiomes incidence, pathogenesis, and outcome. It has been shown that tumors on left and right sides also have different immune landscape, so the prognosis may be different based on the primary tumor locations. It is widely accepted that immune components of the tumor microenvironment (TME) plays a critical role in tumor development. One of the critical regulatory molecules in the TME is immune checkpoints that as the gatekeepers of immune responses regulate the infiltrated immune cell functions. Inhibitory immune checkpoints such as PD-1, Tim3, and LAG3, as the main mechanism of immune suppression in TME overexpressed and result in further development of the tumor. The images of this dataset have been taken from colon tissues of patients with CRC, stained with specific antibodies for CD3, CD8, CD45RO, PD-1, LAG3 and Tim3. The name of this dataset is CRC-ICM and contains 1756 images related to 136 patients. The initial version of CRC-ICM is published on Elsevier Mendeley dataset portal, and the latest version is accessible via: https://databiox.com
    摘要 抗colon癌(CRC)是全球第二常见的癌症死亡原因,可以根据主 tumor 的位置在大肠中进行分类:右colon、左colon 和肛门。根据位置,CRC 会有不同的染色体和分子特征、微生物发生率、生物学过程和结果。有证据显示左和右方主 tumor 的免疫环境有所不同,因此预后可能因主 tumor 的位置而异。免疫组件在肿瘤微环境(TME)中扮演了重要的角色,并且免疫检查点(IC)是免疫回应的关键调节器。对 TME 中的免疫检查点进行抑制可以导致肿瘤的进一步发展。这些数据库包括1756幅图像,来自于136名患有CRC的病人的大肠标本,已经使用特定抗体进行染色,包括CD3、CD8、CD45RO、PD-1、LAG3 和 Tim3。这个数据库名为 CRC-ICM,可以在 Elsevier Mendeley 数据库 порталу上获取,或者通过以下连结:https://databiox.com。

Deformable-Detection Transformer for Microbubble Localization in Ultrasound Localization Microscopy

  • paper_url: http://arxiv.org/abs/2308.09845
  • repo_url: None
  • paper_authors: Sepideh K. Gharamaleki, Brandon Helfield, Hassan Rivaz
  • for: This paper aims to improve the localization of microbubbles (MBs) in ultrasound imaging, which is limited by the half-wavelength resolution of the imaging modality.
  • methods: The proposed method, DEformable DETR (DE-DETR), uses a multi-scale deformable attention mechanism to distribute attention within a limited budget, improving upon the conventional DETR approach which casts attention upon all grid pixels.
  • results: The proposed DE-DETR method shows improvement in both precision and recall, as well as the final super-resolution maps, compared to the conventional DETR method, when applied to the task of MB localization in ultrasound imaging.
    Abstract To overcome the half a wavelength resolution limitations of ultrasound imaging, microbubbles (MBs) have been utilized widely in the field. Conventional MB localization methods are limited whether by exhaustive parameter tuning or considering a fixed Point Spread Function (PSF) for MBs. This questions their adaptability to different imaging settings or depths. As a result, development of methods that don't rely on manually adjusted parameters is crucial. Previously, we used a transformer-based approach i.e. DEtection TRansformer (DETR) (arXiv:2005.12872v3 and arXiv:2209.11859v1) to address the above mentioned issues. However, DETR suffers from long training times and lower precision for smaller objects. In this paper, we propose the application of DEformable DETR (DE-DETR) ( arXiv:2010.04159) for MB localization to mitigate DETR's above mentioned challenges. As opposed to DETR, where attention is casted upon all grid pixels, DE-DETR utilizes a multi-scale deformable attention to distribute attention within a limited budget. To evaluate the proposed strategy, pre-trained DE-DETR was fine-tuned on a subset of the dataset provided by the IEEE IUS Ultra-SR challenge organizers using transfer learning principles and subsequently we tested the network on the rest of the dataset, excluding the highly correlated frames. The results manifest an improvement both in precision and recall and the final super-resolution maps compared to DETR.
    摘要 为了超越ultrasound imaging中半波长resolution的限制,广泛使用了微气泡(MBs)。传统的MBlocalization方法受限于手动调整的参数或者considering a fixed Point Spread Function(PSF)for MBs。这问题了其适应不同的imaging setting或深度。因此,开发不依赖于手动调整参数的方法是关键。在过去,我们使用了transformer-basedapproach,即DEtection TRansformer(DETR)(arXiv:2005.12872v3和arXiv:2209.11859v1)来解决上述问题。然而,DETR受到训练时间过长和对小对象的精度较低的问题。在这篇论文中,我们提议使用DEformable DETR(DE-DETR)(arXiv:2010.04159)来进行MB localization,以mitigate DETR的上述问题。与DETR不同,DE-DETR使用多尺度可变注意力来分配注意力,而不是对所有的网格像素进行注意力投入。为评估提议的策略,我们先在一个IEEE IUS Ultra-SR challenge提供的数据集上使用了转移学习原理进行先training DE-DETR,然后对剩下的数据集进行测试,排除高相关性的帧。结果显示,与DETR相比,DE-DETR在精度和准确性方面具有显著改进,并且最终的超高分辨率图像也得到了改进。

Cross-modality Attention-based Multimodal Fusion for Non-small Cell Lung Cancer (NSCLC) Patient Survival Prediction

  • paper_url: http://arxiv.org/abs/2308.09831
  • repo_url: None
  • paper_authors: Ruining Deng, Nazim Shaikh, Gareth Shannon, Yao Nie
  • for: 预测和诊断结果提供估计,用于评估治疗效果和对患者分组。
  • methods: 跨Modalities的注意力基本多模态融合策略,通过把不同模态特征融合以提高计算机辅助诊断和预测的性能。
  • results: 在非小细胞肺癌(NSCLC)患者存活预测 task 中,提出的融合策略比单 modal 学习提高了性能,c-index 从 0.5772 和 0.5885 提高到 0.6587。
    Abstract Cancer prognosis and survival outcome predictions are crucial for therapeutic response estimation and for stratifying patients into various treatment groups. Medical domains concerned with cancer prognosis are abundant with multiple modalities, including pathological image data and non-image data such as genomic information. To date, multimodal learning has shown potential to enhance clinical prediction model performance by extracting and aggregating information from different modalities of the same subject. This approach could outperform single modality learning, thus improving computer-aided diagnosis and prognosis in numerous medical applications. In this work, we propose a cross-modality attention-based multimodal fusion pipeline designed to integrate modality-specific knowledge for patient survival prediction in non-small cell lung cancer (NSCLC). Instead of merely concatenating or summing up the features from different modalities, our method gauges the importance of each modality for feature fusion with cross-modality relationship when infusing the multimodal features. Compared with single modality, which achieved c-index of 0.5772 and 0.5885 using solely tissue image data or RNA-seq data, respectively, the proposed fusion approach achieved c-index 0.6587 in our experiment, showcasing the capability of assimilating modality-specific knowledge from varied modalities.
    摘要 cancer 诊断和生存结果预测是致命的,它们对于治疗响应的估计和患者分配到不同的治疗组有重要作用。医疗领域对抗癌症的诊断和预测充满多种多样的数据,包括生物像数据和非生物像数据,如基因信息。到目前为止,多模态学习已经显示出了提高临床预测模型性能的潜力,通过提取和综合不同模态的信息。这种方法可能超越单模态学习,从而改善计算机辅助诊断和预测在各种医疗应用中的性能。在这项工作中,我们提出了一种跨模态注意力基于多模态融合管道,用于整合不同模态知识以提高患者存活预测的准确性。不同于将各模态特征直接拼接或加权平均,我们的方法评估每个模态对于特征融合的重要性,并在融合多模态特征时进行跨模态关系的权重评估。与单模态学习相比,我们的融合方法在实验中达到了c-指数0.6587,这说明了我们能够充分利用不同模态之间的关系,从而提高预测的准确性。

cs.SD - 2023-08-18

Compensating Removed Frequency Components: Thwarting Voice Spectrum Reduction Attacks

  • paper_url: http://arxiv.org/abs/2308.09546
  • repo_url: None
  • paper_authors: Shu Wang, Kun Sun, Qi Li
    for:This paper aims to address the challenge of detecting harmful content in audio and video available on social media platforms, specifically the vulnerability of automatic speech recognition (ASR) systems to spectrum reduction attacks.methods:The proposed solution is an acoustic compensation system named ACE, which leverages two key observations: frequency component dependencies and perturbation sensitivity. ACE uses a combination of frequency-based compensation and over-the-air perturbations to counter the spectrum reduction attacks and improve the accuracy of ASR systems.results:The experiments show that ACE can effectively reduce up to 87.9% of ASR inference errors caused by spectrum reduction attacks. Additionally, the paper identifies six general types of ASR inference errors and investigates their causes and potential mitigation solutions.
    Abstract Automatic speech recognition (ASR) provides diverse audio-to-text services for humans to communicate with machines. However, recent research reveals ASR systems are vulnerable to various malicious audio attacks. In particular, by removing the non-essential frequency components, a new spectrum reduction attack can generate adversarial audios that can be perceived by humans but cannot be correctly interpreted by ASR systems. It raises a new challenge for content moderation solutions to detect harmful content in audio and video available on social media platforms. In this paper, we propose an acoustic compensation system named ACE to counter the spectrum reduction attacks over ASR systems. Our system design is based on two observations, namely, frequency component dependencies and perturbation sensitivity. First, since the Discrete Fourier Transform computation inevitably introduces spectral leakage and aliasing effects to the audio frequency spectrum, the frequency components with similar frequencies will have a high correlation. Thus, considering the intrinsic dependencies between neighboring frequency components, it is possible to recover more of the original audio by compensating for the removed components based on the remaining ones. Second, since the removed components in the spectrum reduction attacks can be regarded as an inverse of adversarial noise, the attack success rate will decrease when the adversarial audio is replayed in an over-the-air scenario. Hence, we can model the acoustic propagation process to add over-the-air perturbations into the attacked audio. We implement a prototype of ACE and the experiments show ACE can effectively reduce up to 87.9% of ASR inference errors caused by spectrum reduction attacks. Also, by analyzing residual errors, we summarize six general types of ASR inference errors and investigate the error causes and potential mitigation solutions.
    摘要 自动语音识别(ASR)系统提供了多种语音到文本服务,帮助人类与机器进行交互。然而,最近的研究发现,ASR系统受到了多种黑客音频攻击。具体来说,通过删除不必要的频率组成部分,可以生成针对ASR系统的恶意音频,这些音频可以被人类听到,但是无法正确地被ASR系统识别。这种攻击提高了社交媒体平台上的内容审核解决方案的挑战。在这篇论文中,我们提出了一种名为ACE的听音补偿系统,以防止频谱减少攻击。我们的系统设计基于两点观察:一是频谱成分之间的相互依赖关系,二是对于攻击音频的扰动敏感性。首先,由于计算Discrete Fourier Transform时必然存在频谱泄漏和射频效应,因此频谱中的相关频率成分会具有高相关性。因此,通过考虑频谱成分之间的内在相互依赖关系,可以通过补偿被删除的频率成分来恢复更多的原始音频。其次,由于攻击音频中删除的频率成分可以看作是对抗噪声的逆,因此在通过空中传播的情况下重新播放攻击音频时,攻击成功率将下降。因此,我们可以模拟听音传播过程,将攻击音频中的频率成分加上了空中传播的扰动。我们实现了ACE的原型,实验结果表明,ACE可以效果地减少ASR推理错误率达87.9%。此外,通过分析剩余错误,我们总结了六种常见的ASR推理错误类型,并分析了错误的原因和可能的修复方案。

Generative Machine Listener

  • paper_url: http://arxiv.org/abs/2308.09493
  • repo_url: None
  • paper_authors: Guanxin Jiang, Lars Villemoes, Arijit Biswas
  • for: 这个论文用于训练神经网络来预测每对参照和编码声音信号的分布。
  • methods: 这个方法使用个人侵入式听力测试成绩来训练神经网络,我们称之为生成机器听众(GML)。
  • results: 与基准系统使用平均分数 regression 相比,我们 observe 下降的外围比率(OR),并可以轻松地预测 confidence interval(CI)。 数据增强技术的引入导致 CI 预测准确率和平均分数预测准确率的提高。
    Abstract We show how a neural network can be trained on individual intrusive listening test scores to predict a distribution of scores for each pair of reference and coded input stereo or binaural signals. We nickname this method the Generative Machine Listener (GML), as it is capable of generating an arbitrary amount of simulated listening test data. Compared to a baseline system using regression over mean scores, we observe lower outlier ratios (OR) for the mean score predictions, and obtain easy access to the prediction of confidence intervals (CI). The introduction of data augmentation techniques from the image domain results in a significant increase in CI prediction accuracy as well as Pearson and Spearman rank correlation of mean scores.
    摘要 我们展示了一个神经网络可以在个别关注听力测验成绩上训练,以预测每对参考和编码听力信号的分布。我们称这为“生成机器听者”(GML),因为它可以生成无限量的模拟听力测验数据。相比基准系统使用平均值 regression,我们观察到下峰值値(OR)较低,并可以轻松地预测信号interval prediction(CI)。对于数据增强技术的引入,从影像领域的数据增强技术导致了预测CI的准确性和平均分数和Speedman排名相互联系的提高。

Exploring Sampling Techniques for Generating Melodies with a Transformer Language Model

  • paper_url: http://arxiv.org/abs/2308.09454
  • repo_url: None
  • paper_authors: Mathias Rose Bjare, Stefan Lattner, Gerhard Widmer
  • for: 这个研究旨在investigate the impact of different sampling techniques on musical qualities such as diversity and structure in natural language processing.
  • methods: 作者使用了高容量transformer模型,并使用分布 truncation sampling techniques,包括nucleus sampling、”typical sampling”和conventional ancestral sampling,以评估这些抽样策略对音乐质量的影响。
  • results: 研究发现,probability truncation techniques可能会限制优 optimal circumstances中的多样性和结构性,但在suboptimal circumstances中可能生成更多的音乐样本。
    Abstract Research in natural language processing has demonstrated that the quality of generations from trained autoregressive language models is significantly influenced by the used sampling strategy. In this study, we investigate the impact of different sampling techniques on musical qualities such as diversity and structure. To accomplish this, we train a high-capacity transformer model on a vast collection of highly-structured Irish folk melodies and analyze the musical qualities of the samples generated using distribution truncation sampling techniques. Specifically, we use nucleus sampling, the recently proposed "typical sampling", and conventional ancestral sampling. We evaluate the effect of these sampling strategies in two scenarios: optimal circumstances with a well-calibrated model and suboptimal circumstances where we systematically degrade the model's performance. We assess the generated samples using objective and subjective evaluations. We discover that probability truncation techniques may restrict diversity and structural patterns in optimal circumstances, but may also produce more musical samples in suboptimal circumstances.
    摘要 研究自然语言处理已经证明,训练过程中使用的采样策略会对生成的质量产生重要影响。在这个研究中,我们研究了不同采样技术对音乐质量的影响,特别是多样性和结构。为此,我们训练了一个高容量变换器模型,并使用分布截断采样技术来分析生成的样本。我们使用核心采样、“典型采样”和传统祖先采样三种采样技术,并在优化和不优化情况下进行评估。我们使用对象和主观评估来评估生成的样本质量。我们发现,概率截断技术可能会在优化情况下减少多样性和结构,但在不优化情况下可能会生成更多的音乐样本。

TrOMR:Transformer-Based Polyphonic Optical Music Recognition

  • paper_url: http://arxiv.org/abs/2308.09370
  • repo_url: https://github.com/netease/polyphonic-tromr
  • paper_authors: Yixuan Li, Huaping Liu, Qiang Jin, Miaomiao Cai, Peng Li
  • for: 这篇论文旨在提出一种基于变换器的全音频乐识别方法(TrOMR),以提高现实世界场景中的识别精度。
  • methods: 该方法使用变换器来实现全音频乐识别,并引入了一种新的一致损失函数和合理的数据注释方法来提高识别精度。
  • results: 广泛的实验表明,TrOMR方法在现实世界场景中比现有的OMR方法高效,特别是对于复杂的乐谱。此外, authors还开发了一个 TrOMR 系统和一个实验室场景数据集,以便进行全面的评估和复制。
    Abstract Optical Music Recognition (OMR) is an important technology in music and has been researched for a long time. Previous approaches for OMR are usually based on CNN for image understanding and RNN for music symbol classification. In this paper, we propose a transformer-based approach with excellent global perceptual capability for end-to-end polyphonic OMR, called TrOMR. We also introduce a novel consistency loss function and a reasonable approach for data annotation to improve recognition accuracy for complex music scores. Extensive experiments demonstrate that TrOMR outperforms current OMR methods, especially in real-world scenarios. We also develop a TrOMR system and build a camera scene dataset for full-page music scores in real-world. The code and datasets will be made available for reproducibility.
    摘要 依靠视觉技术的音乐识别(OMR)已经是音乐领域的一项重要技术,已经有很长时间的研究。过去的approach通常基于CNN来理解图像和RNN来分类音乐符号。在这篇论文中,我们提出一种基于转换器的方法,具有出色的全球感知能力,用于端到端多重音乐OMR,称为TrOMR。我们还介绍了一种新的一致性损失函数和合理的数据注释方法,以提高复杂音乐分页的识别精度。广泛的实验表明,TrOMR在现实世界中比现有的OMR方法高效,特别是在复杂音乐分页中。我们还开发了TrOMR系统,并建立了真实世界中的摄像头场景数据集,用于全页音乐分页。代码和数据集将被公开,以便重现。

Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge

  • paper_url: http://arxiv.org/abs/2308.09311
  • repo_url: None
  • paper_authors: Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Yong Man Ro
  • for: 提高低资源语言 lip reading 模型的开发,通过学习通用语言知识和语言特定知识来提高模型的表现。
  • methods: 使用高资源语言的speech unit prediction来学习通用语言知识,然后使用Language-specific Memory-augmented Decoder (LMDecoder)来学习语言特定知识。
  • results: 通过对五种语言(英语、西班牙语、法语、意大利语和葡萄牙语)进行了广泛的实验,证明提案的方法可以有效地提高lip reading模型的表现。
    Abstract This paper proposes a novel lip reading framework, especially for low-resource languages, which has not been well addressed in the previous literature. Since low-resource languages do not have enough video-text paired data to train the model to have sufficient power to model lip movements and language, it is regarded as challenging to develop lip reading models for low-resource languages. In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units. It is known that different languages partially share common phonemes, thus general speech knowledge learned from one language can be extended to other languages. Then, we try to learn language-specific knowledge, the ability to model language, by proposing Language-specific Memory-augmented Decoder (LMDecoder). LMDecoder saves language-specific audio features into memory banks and can be trained on audio-text paired data which is more easily accessible than video-text paired data. Therefore, with LMDecoder, we can transform the input speech units into language-specific audio features and translate them into texts by utilizing the learned rich language knowledge. Finally, by combining general speech knowledge and language-specific knowledge, we can efficiently develop lip reading models even for low-resource languages. Through extensive experiments using five languages, English, Spanish, French, Italian, and Portuguese, the effectiveness of the proposed method is evaluated.
    摘要

Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms

  • paper_url: http://arxiv.org/abs/2308.09302
  • repo_url: https://github.com/ph-w2000/s2pecnet
  • paper_authors: Penghui Wen, Kun Hu, Wenxi Yue, Sen Zhang, Wanlei Zhou, Zhiyong Wang
  • for: 防止深度伪装技术所带来的伪声辨识攻击
  • methods: 使用多项 spectral pattern 融合 deep learning 方法,包括 coarse-to-fine 组合和 fine-level 组合两种分支
  • results: 在 ASVspoof2019 LA Challenge 上实现了顶尖性能,EER 为 0.77%
    Abstract Robust audio anti-spoofing has been increasingly challenging due to the recent advancements on deepfake techniques. While spectrograms have demonstrated their capability for anti-spoofing, complementary information presented in multi-order spectral patterns have not been well explored, which limits their effectiveness for varying spoofing attacks. Therefore, we propose a novel deep learning method with a spectral fusion-reconstruction strategy, namely S2pecNet, to utilise multi-order spectral patterns for robust audio anti-spoofing representations. Specifically, spectral patterns up to second-order are fused in a coarse-to-fine manner and two branches are designed for the fine-level fusion from the spectral and temporal contexts. A reconstruction from the fused representation to the input spectrograms further reduces the potential fused information loss. Our method achieved the state-of-the-art performance with an EER of 0.77% on a widely used dataset: ASVspoof2019 LA Challenge.
    摘要 受深圳技术的提高影响,Robust audio anti-spoofing已经变得越来越困难。虽然spectrograms已经表现出了抗假技术的能力,但多个 spectral pattern的信息还没有得到充分利用,这限制了它们对不同的假攻击的效iveness。因此,我们提出了一种基于深度学习的新方法,即S2pecNet,用于利用多个 spectral pattern来获得robust audio anti-spoofing表示。特别是,至第二顺序的spectral pattern被在粗糙到细节的方式进行融合,并设计了两个支线来从spectral和时间上下文中进行细节级别的融合。再次从融合表示中重建输入spectrograms可以减少潜在的融合信息损失。我们的方法在ASVspoof2019 LA Challenge上 achieved state-of-the-art performance,EER为0.77%。

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

  • paper_url: http://arxiv.org/abs/2308.09300
  • repo_url: None
  • paper_authors: Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, Weidong Cai
  • for: 本研究强调使用基础模型(FM)来解决跨模态生成问题,具体来说是将视觉输入转化为听取输出。
  • methods: 该研究使用CLIP、CLAP和AudioLDM三个基础模型,通过设计一种简单 yet有效的映射机制(V2A-Mapper)来 bridge the domain gap,并使用预训练的听取生成FM AudioLDM生成高质量的听取输出。
  • results: 对比现有方法,该方法需要较少的训练参数(86%),但能够提高FD和CS两个评价指标的表现,具体来说是提高了53%和19%。
    Abstract Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-to-audio (V2A) generation problem, existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound. Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches - trained with 86% fewer parameters but achieving 53% and 19% improvement in FD and CS, respectively.
    摘要 Generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. Existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM.We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then, we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound.Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches. We trained with 86% fewer parameters but achieved 53% and 19% improvement in FD and CS, respectively.

Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries

  • paper_url: http://arxiv.org/abs/2308.09089
  • repo_url: None
  • paper_authors: Julia Wilkins, Justin Salamon, Magdalena Fuentes, Juan Pablo Bello, Oriol Nieto
    for: 这个论文主要是用于找到适合的声音效果(SFX)来匹配视频中的时刻,并且可以使用视频帧直接作为查询来找到高质量(HQ)的声音效果。methods: 这篇论文使用了多模态框架,包括利用大型语言模型和基础视觉语言模型来将HQ音频和视频桥接起来,创建高可扩展的自动音频视频数据纪录pipeline。它还使用预训练的音频和视觉编码器来训练一种对比学习基本来进行匹配。results: 论文表明,使用自动数据纪录ipeline和对比学习基本来训练的系统可以对HQ音频进行高效的匹配,并且在使用各种数据集上表现出色。此外,这种系统还可以从清晰到野外数据中进行泛化,并且在用户测试中获得了67%的正确率。
    Abstract Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task, and relies heavily on the quality and completeness of text metadata. Retrieving high-quality (HQ) SFX using a video frame directly as the query is an attractive alternative, removing the reliance on text metadata and providing a low barrier to entry for non-experts. Due to the lack of HQ audio-visual training data, previous work on audio-visual retrieval relies on YouTube (in-the-wild) videos of varied quality for training, where the audio is often noisy and the video of amateur quality. As such it is unclear whether these systems would generalize to the task of matching HQ audio to production-quality video. To address this, we propose a multimodal framework for recommending HQ SFX given a video frame by (1) leveraging large language models and foundational vision-language models to bridge HQ audio and video to create audio-visual pairs, resulting in a highly scalable automatic audio-visual data curation pipeline; and (2) using pre-trained audio and visual encoders to train a contrastive learning-based retrieval system. We show that our system, trained using our automatic data curation pipeline, significantly outperforms baselines trained on in-the-wild data on the task of HQ SFX retrieval for video. Furthermore, while the baselines fail to generalize to this task, our system generalizes well from clean to in-the-wild data, outperforming the baselines on a dataset of YouTube videos despite only being trained on the HQ audio-visual pairs. A user study confirms that people prefer SFX retrieved by our system over the baseline 67% of the time both for HQ and in-the-wild data. Finally, we present ablations to determine the impact of model and data pipeline design choices on downstream retrieval performance. Please visit our project website to listen to and view our SFX retrieval results.
    摘要 找到合适的声效(SFX)以匹配影像中的时刻是一个困难和耗时的任务,它高度依赖文本 metadata 的质量和完整性。使用影像帧直接作为查询,抽取高品质(HQ)声效是一个吸引人的选择,它可以解除文本 metadata 的依赖,并提供低门槛的入门点 для非专家。由于缺乏 HQ 音频视觉训练数据,过去的音频视觉检索工作都是使用 YouTube 上的各种质量的影片进行训练,其中的音频 oft 是噪音的,影像则是业余质量。这使得这些系统是否能够应用到高品质音频与生产质量影像的匹配问题仍然存在一定的uncertainty。为了解决这个问题,我们提出了一个多Modal 框架,可以根据影像帧提供高品质声效。我们的方法包括:1. 利用大型语言模型和基础的视觉语言模型,将高品质音频和影像 Bridge 到创建音频视觉对,实现了高度排擦的自动音频视觉数据填充管道。2. 使用预训的音频和视觉嵌入器,使用对照式学习 retrained 一个检索系统。我们的系统,使用我们的自动数据填充管道进行训练,与基准相比,显著超过了对 YouTube 上的各种质量影片进行训练的基准。此外,我们的系统具有很好的泛化能力,可以从清洁到实际上的数据进行检索,并且在 YouTube 上的影片上进行检索时,表现更好 than 基准。在一次用户研究中,人们对我们的系统进行检索的时候,偏好我们的系统67%。最后,我们进行了一些范例的ablation,以决定模型和数据管道设计的影响。您可以前往我们的项目网站,聆听和查看我们的 SFX 检索结果。