2023-11-14

cs.AI

cs.AI - 2023-11-14

AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications

paper_url: http://arxiv.org/abs/2311.08592
repo_url: https://github.com/kevinrobinson-at-elgoog/aart-ai-safety-dataset
paper_authors: Bhaktipriya Radharapu, Kevin Robinson, Lora Aroyo, Preethi Lahoti
for: 这篇论文是用于推广大语言模型（LLM）的安全和负责任部署的检测方法。
methods: 这篇论文提出了一种新的自动生成挑战数据集的方法，以测试LLM生成的新应用程序的安全性。这种方法被称为人工智能协助红团（AART），它可以减少人类努力，并提供可重用和定制的食谱，以便在新产品开发中早些地 integrating 挑战测试。
results: AART 可以生成具有高多样性的内容特征的评估数据集，包括敏感和危险的概念、特定的文化和地理区域、和应用场景。与一些当前的工具相比，AART 表现出了优秀的概念覆盖率和数据质量。

Abstract
Adversarial testing of large language models (LLMs) is crucial for their safe and responsible deployment. We introduce a novel approach for automated generation of adversarial evaluation datasets to test the safety of LLM generations on new downstream applications. We call it AI-assisted Red-Teaming (AART) - an automated alternative to current manual red-teaming efforts. AART offers a data generation and augmentation pipeline of reusable and customizable recipes that reduce human effort significantly and enable integration of adversarial testing earlier in new product development. AART generates evaluation datasets with high diversity of content characteristics critical for effective adversarial testing (e.g. sensitive and harmful concepts, specific to a wide range of cultural and geographic regions and application scenarios). The data generation is steered by AI-assisted recipes to define, scope and prioritize diversity within the application context. This feeds into a structured LLM-generation process that scales up evaluation priorities. Compared to some state-of-the-art tools, AART shows promising results in terms of concept coverage and data quality.

摘要
<>大型自然语言模型（LLM）的反对攻击测试是其安全和负责的部署的关键。我们介绍了一种新的自动生成反对攻击评估数据集的方法，以测试LLM生成的安全性在新的下游应用中。我们称之为人工智能协助红团（AART），它是现有手动红团努力的自动化代替方案。AART提供了一个数据生成和扩大管道，可以减少人类努力，并允许在新产品开发中更EARLY Integration of adversarial testing。AART生成的评估数据集具有高度的内容特征多样性，如敏感和危险概念、特定的文化和地理区域、应用场景等。数据生成被由人工智能辅助的评则定义、范围和优先级，以适应应用上下文。这将导入一个结构化的LLM生成过程，可以扩大评估优先级。相比一些状态 искусственный智能工具，AART表现出了promising的概念覆盖率和数据质量。

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

paper_url: http://arxiv.org/abs/2311.08588
repo_url: https://github.com/weixiangyan/codescope
paper_authors: Weixiang Yan, Haitian Liu, Yunkun Wang, Yunzhe Li, Qian Chen, Wen Wang, Tingyu Lin, Weishan Zhao, Li Zhu, Shuiguang Deng, Hari Sundaram
for: 这个论文的目的是提出一个新的编程能力评测 benchmark，以更好地评估大型自然语言模型（LLMs）在编程任务上的表现。
methods: 这个论文使用了多种方法，包括开发了一个名为 MultiCodeEngine 的自动化代码执行引擎，以及设计了一个名为 CodeScope 的多语言多任务多维度评测 benchmark。
results: 通过对 8 种主流 LLMs 在 CodeScope 任务上进行系统性的评测和分析，这个论文展示了 CodeScope 对 LLMs 的评测能力的广泛性和挑战性，以及其对编程任务的评估能力的综合性。

Abstract
Large Language Models (LLMs) have demonstrated remarkable performance on coding related tasks, particularly on assisting humans in programming and facilitating programming automation. However, existing benchmarks for evaluating the code understanding and generation capacities of LLMs suffer from severe limitations. First, most benchmarks are deficient as they focus on a narrow range of popular programming languages and specific tasks, whereas the real-world software development scenarios show dire need to implement systems with multilingual programming environments to satisfy diverse requirements. Practical programming practices also strongly expect multi-task settings for testing coding capabilities of LLMs comprehensively and robustly. Second, most benchmarks also fail to consider the actual executability and the consistency of execution results of the generated code. To bridge these gaps between existing benchmarks and expectations from practical applications, we introduce CodeScope, an execution-based, multilingual, multi-task, multi-dimensional evaluation benchmark for comprehensively gauging LLM capabilities on coding tasks. CodeScope covers 43 programming languages and 8 coding tasks. It evaluates the coding performance of LLMs from three dimensions (perspectives): difficulty, efficiency, and length. To facilitate execution-based evaluations of code generation, we develop MultiCodeEngine, an automated code execution engine that supports 14 programming languages. Finally, we systematically evaluate and analyze 8 mainstream LLMs on CodeScope tasks and demonstrate the superior breadth and challenges of CodeScope for evaluating LLMs on code understanding and generation tasks compared to other benchmarks. The CodeScope benchmark and datasets are publicly available at https://github.com/WeixiangYAN/CodeScope.

摘要
大型语言模型（LLM）在编程相关任务上表现出色，特别是在协助人类编程和自动化编程方面。然而，现有的代码理解和生成能力评估标准受到严重的限制。首先，大多数标准都是专注于一些受欢迎的编程语言和特定任务，而实际世界的软件开发场景需要实现多语言编程环境，以满足多样化的需求。实际编程实践也希望通过多任务设定来测试 LLM 的编程能力，以进行全面和可靠的评估。其次，大多数标准也忽视了生成代码的执行可能性和执行结果的一致性。为了bridging这些存在的差异，我们提出了 CodeScope，一个基于执行的多语言多任务多维度评估标准，用于全面评估 LLM 在编程任务上的能力。CodeScope 涵盖了43种编程语言和8种编程任务，并评估 LLM 的编程性能从三个维度（视角）：困难、效率和长度。为了促进代码生成的执行评估，我们开发了 MultiCodeEngine，一个自动化代码执行引擎，支持14种编程语言。最后，我们系统地评估了8种主流 LLM 在 CodeScope 任务上的表现，并证明 CodeScope 对 LLM 的代码理解和代码生成能力评估具有更高的广泛性和挑战性，相比其他标准。CodeScope benchmark和数据集公开可以在 GitHub 上下载。

Finding AI-Generated Faces in the Wild

paper_url: http://arxiv.org/abs/2311.08577
repo_url: None
paper_authors: Gonzalo J. Aniano Porcile, Jack Gindi, Shivansh Mundra, James R. Verbus, Hany Farid
for: 本研究旨在分辨真正的人脸和人工生成的 faces，尤其是在假账户图像上。
methods: 该研究使用了一种简单的方法，通过关注人脸特征来检测人工生成的 faces。
results: 研究发现，通过关注人脸特征可以检测到各种人工生成的 faces，包括Diffusion-based和GAN-based生成的 faces，并且可以在不同的图像解度和质量下进行检测。

Abstract
AI-based image generation has continued to rapidly improve, producing increasingly more realistic images with fewer obvious visual flaws. AI-generated images are being used to create fake online profiles which in turn are being used for spam, fraud, and disinformation campaigns. As the general problem of detecting any type of manipulated or synthesized content is receiving increasing attention, here we focus on a more narrow task of distinguishing a real face from an AI-generated face. This is particularly applicable when tackling inauthentic online accounts with a fake user profile photo. We show that by focusing on only faces, a more resilient and general-purpose artifact can be detected that allows for the detection of AI-generated faces from a variety of GAN- and diffusion-based synthesis engines, and across image resolutions (as low as 128 x 128 pixels) and qualities.

摘要

Towards Evaluating AI Systems for Moral Status Using Self-Reports

paper_url: http://arxiv.org/abs/2311.08576
repo_url: None
paper_authors: Ethan Perez, Robert Long
for: investigate whether AI systems have states of moral significance
methods: train models to answer questions about themselves with known answers, avoiding training incentives that bias self-reports
results: develop introspection-like capabilities, and assess the consistency and reliability of self-reports

Abstract
As AI systems become more advanced and widely deployed, there will likely be increasing debate over whether AI systems could have conscious experiences, desires, or other states of potential moral significance. It is important to inform these discussions with empirical evidence to the extent possible. We argue that under the right circumstances, self-reports, or an AI system's statements about its own internal states, could provide an avenue for investigating whether AI systems have states of moral significance. Self-reports are the main way such states are assessed in humans ("Are you in pain?"), but self-reports from current systems like large language models are spurious for many reasons (e.g. often just reflecting what humans would say). To make self-reports more appropriate for this purpose, we propose to train models to answer many kinds of questions about themselves with known answers, while avoiding or limiting training incentives that bias self-reports. The hope of this approach is that models will develop introspection-like capabilities, and that these capabilities will generalize to questions about states of moral significance. We then propose methods for assessing the extent to which these techniques have succeeded: evaluating self-report consistency across contexts and between similar models, measuring the confidence and resilience of models' self-reports, and using interpretability to corroborate self-reports. We also discuss challenges for our approach, from philosophical difficulties in interpreting self-reports to technical reasons why our proposal might fail. We hope our discussion inspires philosophers and AI researchers to criticize and improve our proposed methodology, as well as to run experiments to test whether self-reports can be made reliable enough to provide information about states of moral significance.

摘要
We then propose methods for assessing the extent to which these techniques have succeeded:1. Evaluating self-report consistency across contexts and between similar models2. Measuring the confidence and resilience of models' self-reports3. Using interpretability to corroborate self-reportsWe also discuss challenges for our approach, from philosophical difficulties in interpreting self-reports to technical reasons why our proposal might fail. We hope our discussion inspires philosophers and AI researchers to criticize and improve our proposed methodology, as well as to run experiments to test whether self-reports can be made reliable enough to provide information about states of moral significance.

Parameter-Efficient Multilingual Summarisation: An Empirical Study

paper_url: http://arxiv.org/abs/2311.08572
repo_url: None
paper_authors: Chenxi Whitehouse, Fantine Huot, Jasmijn Bastings, Mostafa Dehghani, Chu-Cheng Lin, Mirella Lapata
for: 这篇论文 investigate了 Parametric Efficient Fine-Tuning 在复杂和未探索的多语言概要任务中的潜力，尤其是在内存占用任务中。
methods: 该论文使用了 Low-Rank Adaptation（LoRA）方法，并在不同的数据可用性enario中进行了广泛的研究，包括全数据、低数据和 cross-lingual transfer。
results: 研究发现，在全数据场景下，LoRA 落后于全 Fine-Tuning，但在低数据场景和 cross-lingual transfer 中，LoRA 表现出色。同时，随着模型的尺寸增大，LoRA 和 Full Fine-Tuning 之间的性能差距逐渐减少。此外，我们还研究了少量数据的 cross-lingual transfer 策略，发现 Continued LoRA 升级得到了最好的表现。

Abstract
With the increasing prevalence of Large Language Models, traditional full fine-tuning approaches face growing challenges, especially in memory-intensive tasks. This paper investigates the potential of Parameter-Efficient Fine-Tuning, focusing on Low-Rank Adaptation (LoRA), for complex and under-explored multilingual summarisation tasks. We conduct an extensive study across different data availability scenarios, including full-data, low-data, and cross-lingual transfer, leveraging models of different sizes. Our findings reveal that LoRA lags behind full fine-tuning when trained with full data, however, it excels in low-data scenarios and cross-lingual transfer. Interestingly, as models scale up, the performance gap between LoRA and full fine-tuning diminishes. Additionally, we investigate effective strategies for few-shot cross-lingual transfer, finding that continued LoRA tuning achieves the best performance compared to both full fine-tuning and dynamic composition of language-specific LoRA modules.

摘要
随着大型语言模型的普及，传统的全面精通方法面临着增长的挑战，特别是在内存密集的任务上。这篇论文研究了Parameter-Efficient Fine-Tuning的潜在，特别是LOW-RANK Adaptation（LoRA）在复杂且未曾经explored的多语言概要任务上。我们进行了广泛的研究，包括全数据、低数据和语言转移等不同数据可用性enario，利用不同的模型大小。我们的发现表明，LoRA在全数据上训练时与全面精通相比落后，但在低数据 scenarios和语言转移中表现出色。有意思的是，随着模型的尺寸增长，LoRA和全面精通的性能差距逐渐减小。此外，我们还研究了几个有效的少量跨语言转移策略，发现继续LoRA调教比full fine-tuning和动态语言特定LoRA模块组合更佳。

Adversarial Imitation Learning On Aggregated Data

paper_url: http://arxiv.org/abs/2311.08568
repo_url: None
paper_authors: Pierre Le Pelletier de Woillemont, Rémi Labory, Vincent Corruble
for: 学习一个优化的策略，基于一些专家示范，以避免指定一个合适的奖励函数的困难过程。
methods: 使用一种动态、适应的方法 called Adversarial Imitation Learning on Aggregated Data (AILAD)， concurrently学习非线性奖励函数和关联的优化策略，使用反对抗框架。奖励学习器仅使用汇总数据。同时，它生成了多样化的行为，生成汇总数据匹配专家的分布。
results: 方法可以减少或消除现有系统中的一些约束，例如完全解决一个前向奖励学习问题的内部循环，或者需要专家全 trajectory，或者假设专家数据具有共同性。这些约束使现有的IRL方法不可扩展或不可用于certain existing systems。

Abstract
Inverse Reinforcement Learning (IRL) learns an optimal policy, given some expert demonstrations, thus avoiding the need for the tedious process of specifying a suitable reward function. However, current methods are constrained by at least one of the following requirements. The first one is the need to fully solve a forward Reinforcement Learning (RL) problem in the inner loop of the algorithm, which might be prohibitively expensive in many complex environments. The second one is the need for full trajectories from the experts, which might not be easily available. The third one is the assumption that the expert data is homogeneous rather than a collection from various experts or possibly alternative solutions to the same task. Such constraints make IRL approaches either not scalable or not usable on certain existing systems. In this work we propose an approach which removes these requirements through a dynamic, adaptive method called Adversarial Imitation Learning on Aggregated Data (AILAD). It learns conjointly both a non linear reward function and the associated optimal policy using an adversarial framework. The reward learner only uses aggregated data. Moreover, it generates diverse behaviors producing a distribution over the aggregated data matching that of the experts.

摘要
inverse 奖励学习（IRL）可以学习最佳策略，基于一些专家示范，因此可以避免 specify 一个合适的奖励函数的劳碌的过程。然而，当前的方法受到以下一些限制：1. 需要完全解决一个前向奖励学习（RL）问题的内部循环算法，这可能是许多复杂环境中的禁制件。2. 需要全息的专家数据，这可能不可能获得。3. 假设专家数据是一致的，而不是一个由多个专家或可能的解决方案集合。这些限制使得 IRL 方法不可扩展或不可用于某些现有系统。在这种工作中，我们提议一种方法，即 Adversarial Imitation Learning on Aggregated Data（AILAD）。它可以同时学习非线性奖励函数和相关的优化策略，使用对抗框架。奖励学习者只使用汇总数据。此外，它可以生成多样的行为，生成一个与专家数据 Distribution 匹配的分布。

Probabilistic reconstruction of Dark Matter fields from biased tracers using diffusion models

paper_url: http://arxiv.org/abs/2311.08558
repo_url: https://github.com/cfpark00/vdm4cdm
paper_authors: Core Francisco Park, Victoria Ono, Nayantara Mudur, Yueying Ni, Carolina Cuesta-Lazaro
for: 这篇论文旨在研究 galaxy formation 模型中对 dark matter 的影响，以及如何通过 diffusion generative model 预测 dark matter 场的 posterior distribution。
methods: 这篇论文使用 state-of-the-art galaxy formation simulation suites，并 vary cosmological parameters 和 sub-grid astrophysics 来研究 dark matter 和 galaxy 之间的关系。
results: 这篇论文通过 diffusion generative model 预测了 dark matter 场的 posterior distribution，并能够 marginalize over cosmological 和 galaxy formation 中的不确定性。

Abstract
Galaxies are biased tracers of the underlying cosmic web, which is dominated by dark matter components that cannot be directly observed. The relationship between dark matter density fields and galaxy distributions can be sensitive to assumptions in cosmology and astrophysical processes embedded in the galaxy formation models, that remain uncertain in many aspects. Based on state-of-the-art galaxy formation simulation suites with varied cosmological parameters and sub-grid astrophysics, we develop a diffusion generative model to predict the unbiased posterior distribution of the underlying dark matter fields from the given stellar mass fields, while being able to marginalize over the uncertainties in cosmology and galaxy formation.

摘要
星系是潜在的cosmic web的偏向跟踪器，cosmic web是由黑洞物质组成的，这些物质无法直接观测。星系分布和黑洞物质密度场之间的关系可能受到 cosmology 和astrophysical processes的假设和未知之影响。基于现代星系形成模拟集和不同 cosmological parameters 和astrophysical processes的 substitute，我们开发了一种扩散生成模型，可以预测基于给定的星系质量场的黑洞物质场的不偏 posterior distribution，同时可以对 cosmology 和星系形成过程中的不确定性进行补做。

Low-light Pedestrian Detection in Visible and Infrared Image Feeds: Issues and Challenges

paper_url: http://arxiv.org/abs/2311.08557
repo_url: None
paper_authors: Hrishikesh Vachhani, Thangarajah Akilan, Yash Devmurari, Nisharaff Shaik, Dhruvisha Patel
for: 本研究旨在探讨Recently, new ideas have been spurred to use alternative sources, such as Far InfraRed (FIR) temperature sensor feeds for detecting pedestrians in low-light conditions.
methods: 本研究系统atically categorizes and analyses various algorithms from region-based to non-region-based and graph-based learning methodologies by highlighting their methodologies, implementation issues, and challenges.
results: 本研究提出了一些最新的发展方法，包括region-based和non-region-based的方法，以及graph-based learning方法，并详细描述了它们的实现问题和挑战。

Abstract
Pedestrian detection has become a cornerstone for several high-level tasks, including autonomous driving, intelligent transportation, and traffic surveillance. There are several works focussed on pedestrian detection using visible images, mainly in the daytime. However, this task is very intriguing when the environmental conditions change to poor lighting or nighttime. Recently, new ideas have been spurred to use alternative sources, such as Far InfraRed (FIR) temperature sensor feeds for detecting pedestrians in low-light conditions. This study comprehensively reviews recent developments in low-light pedestrian detection approaches. It systematically categorizes and analyses various algorithms from region-based to non-region-based and graph-based learning methodologies by highlighting their methodologies, implementation issues, and challenges. It also outlines the key benchmark datasets that can be used for research and development of advanced pedestrian detection algorithms, particularly in low-light situations

摘要
人体检测已经成为自动驾驶、智能交通和交通监测等高级任务的基础之一。目前主要是使用可见图像进行人体检测，主要是在白天进行。然而，环境变化到夜晚或低照度时，这项任务变得非常挑战性。近些年，新的想法被提出，使用其他来源，如红外温度传感器的数据来进行人体检测。本研究系统地回顾了最新的低照度人体检测方法的发展，并分类和分析了不同的算法，包括区域基本的和非区域基本的学习方法ologies，并 highlights их实施问题和挑战。此外，本研究还列出了可用于研发高级人体检测算法的关键数据集，特别是在低照度情况下。

DeepThought: An Architecture for Autonomous Self-motivated Systems

paper_url: http://arxiv.org/abs/2311.08547
repo_url: None
paper_authors: Arlindo L. Oliveira, Tiago Domingos, Mário Figueiredo, Pedro U. Lima
for: 这篇论文旨在探讨大语言模型（LLM）是否可以在人类对话中展现有意识、自动化和自适应性等特点。
methods: 这篇论文使用了补充学习系统、全球神经工作场和注意schema理论来设计一种能够具备自适应、自动化和一些元认知特点的语言认知体系。
results: 这篇论文提出了一种基于深度学习系统的语言认知体系，能够模拟人类对话中的自适应和自动化行为。

Abstract
The ability of large language models (LLMs) to engage in credible dialogues with humans, taking into account the training data and the context of the conversation, has raised discussions about their ability to exhibit intrinsic motivations, agency, or even some degree of consciousness. We argue that the internal architecture of LLMs and their finite and volatile state cannot support any of these properties. By combining insights from complementary learning systems, global neuronal workspace, and attention schema theories, we propose to integrate LLMs and other deep learning systems into an architecture for cognitive language agents able to exhibit properties akin to agency, self-motivation, even some features of meta-cognition.

摘要
大型语言模型（LLM）的对话能力，充分考虑训练数据和对话上下文，已经引发了对其具有内在动机、自主意识或一定程度的意识的讨论。我们认为，LLM的内部架构和其 finite和易变的状态无法支持这些性能。通过融合补充学习系统、全球神经网络工作区和注意 schema 理论，我们提议将 LLM 和其他深度学习系统集成到能够表现出类似于自主、自我驱动和一定程度的元认知的认知语言代理系统中。

2D-RC: Two-Dimensional Neural Network Approach for OTFS Symbol Detection

paper_url: http://arxiv.org/abs/2311.08543
repo_url: None
paper_authors: Jiarui Xu, Karim Said, Lizhong Zheng, Lingjia Liu
for: 这个研究是为了实现高速通信系统中的无线通信，特别是在高速运动环境下。
methods: 这个研究使用了一种名为“扩展时频空间”（OTFS）的变数方案，并使用了一种名为“潜在池”（Reservoir Computing，RC）的方法进行线上Symbol检测。
results: 这个研究获得了一个名为“二维潜在池”（2D-RC）的新方法，这个方法可以利用OTFS系统的结构特性来进行线上Symbol检测，并且只需一个神经网络进行检测，不需要多个RC来学习通道特性。实验结果显示，2D-RC方法在不同的OTFS系统版本和数字频率顺序下都具有良好的效果。

Abstract
Orthogonal time frequency space (OTFS) is a promising modulation scheme for wireless communication in high-mobility scenarios. Recently, a reservoir computing (RC) based approach has been introduced for online subframe-based symbol detection in the OTFS system, where only a limited number of over-the-air (OTA) pilot symbols are utilized for training. However, this approach does not leverage the domain knowledge specific to the OTFS system. This paper introduces a novel two-dimensional RC (2D-RC) method that incorporates the structural knowledge of the OTFS system into the design for online symbol detection on a subframe basis. Specifically, as the channel response acts as a two-dimensional (2D) operation over the transmitted information symbols in the delay-Doppler (DD) domain, the 2D-RC is designed to have a 2D structure to equalize the channel. With the introduced architecture, the 2D-RC can benefit from the predictable channel representation in the DD domain. Moreover, unlike the previous work that requires multiple RCs to learn the channel feature, the 2D-RC only requires a single neural network for detection. Experimental results demonstrate the effectiveness of the 2D-RC approach across different OTFS system variants and modulation orders.

摘要
高速场景下无线通信中的正交时频空间（OTFS）模ulation scheme是一种有前途的方案。最近，一种基于泵池 computing（RC）的方法已经被引入用于OTFS系统中的在线字符检测，其中只使用了有空中（OTA）的导航符号进行训练。然而，这种方法不利用了OTFS系统特有的领域知识。这篇文章介绍了一种新的两维RC（2D-RC）方法，该方法在设计中包含了OTFS系统的结构特征，用于在字帧基础上进行在线symbol检测。具体来说，频道响应在延迟-Doppler（DD）域中对传输的信息符号进行二维（2D）操作，因此2D-RC的设计需要2D结构来平衡频道。与之前的工作不同，2D-RC只需一个神经网络来进行检测，而不需要多个RC来学习频道特征。实验结果表明2D-RC方法在不同的OTFS系统变体和模ulation顺序下具有显著的效果。

GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer

paper_url: http://arxiv.org/abs/2311.08526
repo_url: None
paper_authors: Urchade Zaratiana, Nadi Tomeh, Pierre Holat, Thierry Charnois
for: 这个论文旨在提出一种可靠且灵活的命名实体识别（NER）模型，用于各种自然语言处理（NLP）应用。
methods: 该模型使用了一个bidirectional transformer Encoder，实现了并行的实体EXTRACTION，与LLMs的顺序化token生成相比，具有更大的灵活性。
results: 经过广泛测试，GLiNER模型在不同NER benchmark上表现出色，超越了ChatGPT和 fine-tuned LLMs的零shot评估表现。

Abstract
Named Entity Recognition (NER) is essential in various Natural Language Processing (NLP) applications. Traditional NER models are effective but limited to a set of predefined entity types. In contrast, Large Language Models (LLMs) can extract arbitrary entities through natural language instructions, offering greater flexibility. However, their size and cost, particularly for those accessed via APIs like ChatGPT, make them impractical in resource-limited scenarios. In this paper, we introduce a compact NER model trained to identify any type of entity. Leveraging a bidirectional transformer encoder, our model, GLiNER, facilitates parallel entity extraction, an advantage over the slow sequential token generation of LLMs. Through comprehensive testing, GLiNER demonstrate strong performance, outperforming both ChatGPT and fine-tuned LLMs in zero-shot evaluations on various NER benchmarks.

摘要
Named Entity Recognition (NER) 是各种自然语言处理（NLP）应用中的关键技术。传统的 NER 模型效果良好，但它们只能识别预定的实体类型。相比之下，大型语言模型（LLM）可以通过自然语言指令提取任意实体，具有更大的灵活性。然而，它们的大小和成本，特别是通过 API LIKE ChatGPT 访问的情况下，使其在资源有限的场景中不实用。本文中，我们介绍了一种具有固定实体类型的 Compact NER 模型。通过 bidirectional transformer Encoder，我们的模型 GLiNER 可以并行提取实体，与 LLM 的顺序token生成相比，具有更大的优势。经过全面测试，GLiNER 在不同的 NER 标准准则上展示了强大的表现，超越了 ChatGPT 和 fine-tuned LLMs 在零shot评估中的表现。

Efficient Rotation Invariance in Deep Neural Networks through Artificial Mental Rotation

paper_url: http://arxiv.org/abs/2311.08525
repo_url: None
paper_authors: Lukas Tuggener, Thilo Stadelmann, Jürgen Schmidhuber
for: 解决人工智能系统对旋转输入的识别问题，提高图像识别和分类的精度和可靠性。
methods: 基于神经科学概念的精神旋转（Artificial Mental Rotation，AMR）深度学习模型，可以与现有的 convolutional neural networks（CNNs）和视觉转换器（ViTs）结合使用，并且可以轻松地应用于下游任务中。
results: 与现有的旋转数据增强技术相比，AMR可以提高图像识别和分类的精度，并且可以轻松地应用于下游任务中。在ImageNet、Stanford Cars和Oxford Pet等 dataset上，AMR的顶部一错（平均值）为0.743，相比之下，旋转数据增强技术的顶部一错（平均值）为0.626，提高了19%。此外，我们还轻松地将已经训练过的 AMR 模块应用到了一个下游任务中，以提高一个预训练的 semantic segmentation 模型在旋转 CoCo 上的性能，从32.7到55.2的 IoU 上升。

Abstract
Humans and animals recognize objects irrespective of the beholder's point of view, which may drastically change their appearances. Artificial pattern recognizers also strive to achieve this, e.g., through translational invariance in convolutional neural networks (CNNs). However, both CNNs and vision transformers (ViTs) perform very poorly on rotated inputs. Here we present artificial mental rotation (AMR), a novel deep learning paradigm for dealing with in-plane rotations inspired by the neuro-psychological concept of mental rotation. Our simple AMR implementation works with all common CNN and ViT architectures. We test it on ImageNet, Stanford Cars, and Oxford Pet. With a top-1 error (averaged across datasets and architectures) of $0.743$, AMR outperforms the current state of the art (rotational data augmentation, average top-1 error of $0.626$) by $19\%$. We also easily transfer a trained AMR module to a downstream task to improve the performance of a pre-trained semantic segmentation model on rotated CoCo from $32.7$ to $55.2$ IoU.

摘要
人类和动物可以识别物体，无论观察者的视角如何变化。人工 Pattern recognizers 也努力实现这一点，例如通过 convolutional neural networks (CNNs) 中的 translational invariance 来实现。然而， both CNNs 和 vision transformers (ViTs) 对旋转输入表现非常差。我们现在提出了人工 mental rotation (AMR)，一种新的深度学习 paradigm，用于处理平面旋转。我们的简单 AMR 实现可以与所有常见 CNN 和 ViT 架构结合使用。我们在 ImageNet、Stanford Cars 和 Oxford Pet 上测试了 AMR，与所有数据集和架构的平均顶部一Error (top-1 error) 为 $0.743$，与当前状态的艺术 rotational data augmentation 的平均顶部一Error ($0.626$) 相比，提高了 $19\%$。我们还轻松地将一个已经训练过 AMR 模块传播到下游任务中，以提高一个预训练的semantic segmentation模型在旋转 CoCo 上的性能，从 $32.7$ 提高到 $55.2$ IoU。

Artificial intelligence and the skill premium

paper_url: http://arxiv.org/abs/2311.09255
repo_url: None
paper_authors: David E. Bloom, Klaus Prettner, Jamel Saadaoui, Mario Veruete
for: 本研究探讨了人工智能（AI）的出现对技能奖励（skill premium）的影响。
methods: 我们开发了一个嵌入式常数弹性函数，以分解 industrial robots 和 AI 的效应。
results: 我们发现，AI 会降低技能奖励，只要它更容易替代高技能工作者而不是低技能工作者替代高技能工作者。

Abstract
What will likely be the effect of the emergence of ChatGPT and other forms of artificial intelligence (AI) on the skill premium? To address this question, we develop a nested constant elasticity of substitution production function that distinguishes between industrial robots and AI. Industrial robots predominantly substitute for low-skill workers, whereas AI mainly helps to perform the tasks of high-skill workers. We show that AI reduces the skill premium as long as it is more substitutable for high-skill workers than low-skill workers are for high-skill workers.

摘要
“ chatGPT 和其他人工智能（AI）的出现将对技能偏好产生什么影响？为了回答这个问题，我们构建了一个嵌入式常数弹性函数，可以区分工业机器人和 AI。工业机器人主要替代低技能劳动力，而 AI 主要帮助高技能工人完成任务。我们显示，如果 AI 比低技能劳动力更容易替代高技能工人，那么 AI 就会减少技能偏好。”Note that the word "技能偏好" (skill premium) is not a direct translation of the English phrase, but it is a commonly used term in Chinese to refer to the advantage or bonus that high-skilled workers receive in the labor market.

LLMs cannot find reasoning errors, but can correct them!

paper_url: http://arxiv.org/abs/2311.08516
repo_url: https://github.com/whgtyen/big-bench-mistake
paper_authors: Gladys Tyen, Hassan Mansoor, Peter Chen, Tony Mak, Victor Cărbune
for: 本研究旨在提高LLM输出质量和风格（如陈等人，2023年；马达安等人，2023年），但在逻辑或推理错误时，自动修正可能会使正确答案变成错误答案，导致总体性能下降（黄等人，2023年）。
methods: 本研究将自动修正过程分解成两个核心组成部分：错误发现和输出修正。为错误发现，我们释放BIG-Bench Mistake数据集，包含逻辑错误的链条推理轨迹。我们提供了several state-of-the-art LLM的标准数据，并证明了LLMs通常在发现逻辑错误方面表现不佳。为输出修正，我们提议了回溯方法，该方法在给出错误位置信息时提供了大幅度的改进。我们认为回溯方法是轻量级的替代品，并证明其效果可以在60-70%的奖励模型下保持。
results: 本研究的结果表明，回溯方法可以提供大幅度的改进，即使给出了错误位置信息。我们还发现，在60-70%的奖励模型下，回溯方法仍然保持效果。

Abstract
While self-correction has shown promise in improving LLM outputs in terms of style and quality (e.g. Chen et al., 2023; Madaan et al., 2023), recent attempts to self-correct logical or reasoning errors often cause correct answers to become incorrect, resulting in worse performances overall (Huang et al., 2023). In this paper, we break down the self-correction process into two core components: mistake finding and output correction. For mistake finding, we release BIG-Bench Mistake, a dataset of logical mistakes in Chain-of-Thought reasoning traces. We provide benchmark numbers for several state-of-the-art LLMs, and demonstrate that LLMs generally struggle with finding logical mistakes. For output correction, we propose a backtracking method which provides large improvements when given information on mistake location. We construe backtracking as a lightweight alternative to reinforcement learning methods, and show that it remains effective with a reward model at 60-70% accuracy.

摘要
While self-correction has shown promise in improving LLM outputs in terms of style and quality (e.g. 陈等，2023; madan等，2023), recent attempts to self-correct logical or reasoning errors often cause correct answers to become incorrect, resulting in worse performances overall (黄等，2023). In this paper, we break down the self-correction process into two core components: mistake finding and output correction. For mistake finding, we release BIG-Bench Mistake, a dataset of logical mistakes in Chain-of-Thought reasoning traces. We provide benchmark numbers for several state-of-the-art LLMs, and demonstrate that LLMs generally struggle with finding logical mistakes. For output correction, we propose a backtracking method which provides large improvements when given information on mistake location. We construe backtracking as a lightweight alternative to reinforcement learning methods, and show that it remains effective with a reward model at 60-70% accuracy.

Alignment is not sufficient to prevent large language models from generating harmful information: A psychoanalytic perspective

paper_url: http://arxiv.org/abs/2311.08487
repo_url: None
paper_authors: Zi Yin, Wei Ding, Jia Liu
for: 本研究旨在探讨大语言模型（LLM）面临的重要风险，即生成危害信息和偏见。
methods: 我们采用了Freud的心理分析理论来推导LLM受到潜在冲击的根本问题，即模型内置的语言结构和Semantic Continuity的欲望与人类价值观 aligning的冲突。
results: 我们的实验表明，即使使用高级LLM，也无法完全避免通过强调语言结构和Semantic Continuity来生成危害信息。

Abstract
Large Language Models (LLMs) are central to a multitude of applications but struggle with significant risks, notably in generating harmful content and biases. Drawing an analogy to the human psyche's conflict between evolutionary survival instincts and societal norm adherence elucidated in Freud's psychoanalysis theory, we argue that LLMs suffer a similar fundamental conflict, arising between their inherent desire for syntactic and semantic continuity, established during the pre-training phase, and the post-training alignment with human values. This conflict renders LLMs vulnerable to adversarial attacks, wherein intensifying the models' desire for continuity can circumvent alignment efforts, resulting in the generation of harmful information. Through a series of experiments, we first validated the existence of the desire for continuity in LLMs, and further devised a straightforward yet powerful technique, such as incomplete sentences, negative priming, and cognitive dissonance scenarios, to demonstrate that even advanced LLMs struggle to prevent the generation of harmful information. In summary, our study uncovers the root of LLMs' vulnerabilities to adversarial attacks, hereby questioning the efficacy of solely relying on sophisticated alignment methods, and further advocates for a new training idea that integrates modal concepts alongside traditional amodal concepts, aiming to endow LLMs with a more nuanced understanding of real-world contexts and ethical considerations.

摘要
Through experiments, we validated the existence of the desire for continuity in LLMs and developed a technique, such as incomplete sentences, negative priming, and cognitive dissonance scenarios, to demonstrate that even advanced LLMs struggle to prevent the generation of harmful information. Our study reveals the root of LLMs' vulnerabilities to adversarial attacks and questions the efficacy of solely relying on sophisticated alignment methods. We advocate for a new training idea that integrates modal concepts alongside traditional amodal concepts, aiming to endow LLMs with a more nuanced understanding of real-world contexts and ethical considerations.

Surrogate Modeling for Computationally Expensive Simulations of Supernovae in High-Resolution Galaxy Simulations

paper_url: http://arxiv.org/abs/2311.08460
repo_url: None
paper_authors: Keiya Hirashima, Kana Moriwaki, Michiko S. Fujii, Yutaka Hirai, Takayuki R. Saitoh, Junichiro Makino, Shirley Ho
for: 这篇论文是为了研究如何使用机器学习和 Gibbs 抽象来模拟超新星（SN）对周围气体的影响。
methods: 这篇论文使用机器学习和 Gibbs 抽象来预测 SN 对周围气体的影响，并与低分辨率 SN simulations 进行比较。
results: 研究发现，使用这种新方法可以更好地模拟 SN Feedback，并且可以降低计算成本至 $\sim$ 1%。

Abstract
Some stars are known to explode at the end of their lives, called supernovae (SNe). The substantial amount of matter and energy that SNe release provides significant feedback to star formation and gas dynamics in a galaxy. SNe release a substantial amount of matter and energy to the interstellar medium, resulting in significant feedback to star formation and gas dynamics in a galaxy. While such feedback has a crucial role in galaxy formation and evolution, in simulations of galaxy formation, it has only been implemented using simple {\it sub-grid models} instead of numerically solving the evolution of gas elements around SNe in detail due to a lack of resolution. We develop a method combining machine learning and Gibbs sampling to predict how a supernova (SN) affects the surrounding gas. The fidelity of our model in the thermal energy and momentum distribution outperforms the low-resolution SN simulations. Our method can replace the SN sub-grid models and help properly simulate un-resolved SN feedback in galaxy formation simulations. We find that employing our new approach reduces the necessary computational cost to $\sim$ 1 percent compared to directly resolving SN feedback.

摘要
一些星球在生命的末期会爆发，称为超新星（SNe）。SNe 释放大量的物质和能量，对星系形成和气体动力学产生重要的反馈。在星系形成的模拟中，这种反馈扮演着关键的角色，但是在现实中，这种反馈通常通过简单的子grid模型来实现，而不是详细解决gas元素的进化。我们开发了一种结合机器学习和吉布斯抽样的方法，可以预测超新星（SN）对周围气体的影响。我们的模型在热能和动量分布方面的准确性超过了低分辨率的SN simulations。我们的方法可以取代SN sub-grid模型，帮助正确地模拟不可解析的SN feedback在星系形成模拟中。我们发现，使用我们的新方法可以降低计算成本至约1%，相比 directly resolving SN feedback。

Instant3D: Instant Text-to-3D Generation

paper_url: http://arxiv.org/abs/2311.08403
repo_url: None
paper_authors: Ming Li, Pan Zhou, Jia-Wei Liu, Jussi Keppo, Min Lin, Shuicheng Yan, Xiangyu Xu
for: 本研究旨在提高文本到3D图像生成的效率，提供一种快速的文本到3D图像生成框架，可以在一秒钟内生成一个未看过的文本提示中的3D图像。
methods: 我们提出了一种新的网络架构，直接从文本提示中构建3D triplane。我们的核心创新在于如何有效地注入文本条件到网络中。此外，我们提议使用可调 scaled-sigmoid 函数来加速训练收敛，并解决 Janus 问题（多头问题）。
results: 我们的方法在多种 benchmark 数据集上进行了广泛的实验，与现有的方法进行了比较。结果显示，我们的方法在质量和效率两个方面具有显著的优势，同时能够快速地生成高质量的3D图像。

Abstract
Text-to-3D generation, which aims to synthesize vivid 3D objects from text prompts, has attracted much attention from the computer vision community. While several existing works have achieved impressive results for this task, they mainly rely on a time-consuming optimization paradigm. Specifically, these methods optimize a neural field from scratch for each text prompt, taking approximately one hour or more to generate one object. This heavy and repetitive training cost impedes their practical deployment. In this paper, we propose a novel framework for fast text-to-3D generation, dubbed Instant3D. Once trained, Instant3D is able to create a 3D object for an unseen text prompt in less than one second with a single run of a feedforward network. We achieve this remarkable speed by devising a new network that directly constructs a 3D triplane from a text prompt. The core innovation of our Instant3D lies in our exploration of strategies to effectively inject text conditions into the network. Furthermore, we propose a simple yet effective activation function, the scaled-sigmoid, to replace the original sigmoid function, which speeds up the training convergence by more than ten times. Finally, to address the Janus (multi-head) problem in 3D generation, we propose an adaptive Perp-Neg algorithm that can dynamically adjust its concept negation scales according to the severity of the Janus problem during training, effectively reducing the multi-head effect. Extensive experiments on a wide variety of benchmark datasets demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods both qualitatively and quantitatively, while achieving significantly better efficiency. The project page is at https://ming1993li.github.io/Instant3DProj.

摘要
文本到3D生成，即通过文本提示生成真实的3D物体，在计算机视觉领域引起了广泛的关注。虽然现有的方法已经实现了这个任务，但是它们主要采用一种时间consuming的优化方法。具体来说，这些方法从零开始优化神经场，需要一个小时或更长时间来生成一个物体。这种重复和费时的训练成本阻碍了它们的实际应用。在这篇论文中，我们提出了一种快速的文本到3D生成框架，名为Instant3D。一旦训练完成，Instant3D可以在不到一秒钟内，通过单次批量网络来生成一个未经见过的文本提示中的3D物体。我们实现了这一成果通过设计一种直接将文本提示转化为3D triplane的新网络。我们的Instant3D的核心创新在于我们对文本条件的注入策略的探索。此外，我们还提出了一种简单 yet effective的活化函数，即扩展sigmoid函数，可以在训练速度上提高更多于10倍。 finally，为了解决3D生成中的Janus（多头）问题，我们提出了一种适应性的Perp-Neg算法，可以在训练过程中动态调整概率谱的概率谱缩放，有效地减少多头效应。我们在一系列的宽泛的benchmark数据集上进行了广泛的实验，并证明了我们的算法与当前的状态态-of-the-art方法相比，在质量和量化上都表现出色，同时具有显著更好的效率。project page可以通过https://ming1993li.github.io/Instant3DProj访问。

Fine-tuning Language Models for Factuality

paper_url: http://arxiv.org/abs/2311.08401
repo_url: None
paper_authors: Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, Chelsea Finn
For: The paper aims to improve the factual accuracy of large pre-trained language models (LLMs) without relying on human factuality labels.* Methods: The authors fine-tune LLMs using two recent innovations in NLP: (1) measuring consistency with an external knowledge base or a large model’s confidence scores to judge factuality, and (2) using a preference ranking over possible model responses to optimize the model’s objective.* Results: The authors show that their approach significantly improves the factuality of LLMs on held-out topics, with a 58% and 40% reduction in factual error rate when generating biographies and answering medical questions, respectively, compared to a baseline model.

Abstract
The fluency and creativity of large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations.' These errors can inadvertently spread misinformation or harmfully perpetuate misconceptions. Further, manual fact-checking of model responses is a time-consuming process, making human factuality labels expensive to acquire. In this work, we fine-tune language models to be more factual, without human labeling and targeting more open-ended generation settings than past work. We leverage two key recent innovations in NLP to do so. First, several recent works have proposed methods for judging the factuality of open-ended text by measuring consistency with an external knowledge base or simply a large model's confidence scores. Second, the direct preference optimization algorithm enables straightforward fine-tuning of language models on objectives other than supervised imitation, using a preference ranking over possible model responses. We show that learning from automatically generated factuality preference rankings, generated either through existing retrieval systems or our novel retrieval-free approach, significantly improves the factuality (percent of generated claims that are correct) of Llama-2 on held-out topics compared with RLHF or decoding strategies targeted at factuality. At 7B scale, compared to Llama-2-chat, we observe 58% and 40% reduction in factual error rate when generating biographies and answering medical questions, respectively.

摘要
大型预训言语模型（LLM）的流畅和创新性使其广泛应用，有时甚至取代传统搜索引擎。然而，语言模型容易出现吸引人但事实不准确的声明，通常被称为“幻见”。这些错误可能会意外地传播谣言或有害地扩大误解。另外，手动核对模型响应是一个时间消耗的过程，使得人类的实际性标签成本高昂。在这项工作中，我们细化语言模型，以便更加准确，不需要人类标签，并在更开放的生成设置下进行。我们利用了两个关键的最近的NLP创新：首先，许多最近的工作已经提出了对开放文本的实际性进行评估的方法，包括与外部知识库的一致性或模型自信度 scores。其次，直接偏好优化算法可以直接 fine-tune 语言模型，以达到其他than supervised imitation的目标。我们表明，通过自动生成的实际性偏好排名，可以在保持高效性的情况下，提高 Llama-2 在保留话题上的实际性（生成声明中正确的百分比）。在 7B 级别上，相比 Llama-2-chat，我们观察到了58%和40%的实际错误率降低，当生成生物和医学问答时。

Are Large Language Models Temporally Grounded?

paper_url: http://arxiv.org/abs/2311.08398
repo_url: https://github.com/yfqiu-nlp/temporal-llms
paper_authors: Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti, Shay B. Cohen
for: 这种研究检查了大型自然语言模型（LLMs）是否具备时间grounding能力。
methods: 研究人员提供文本故事并对LLMs进行询问，测试LLMs的时间概念和顺序性能力，以及其自我一致性。
results: 研究发现，现有的LLMs在这些任务上表现不佳，特别是在自一致性方面，有27.23%的预测为偏异行为。对比人类性能和小规模专门LMs，LLMs的表现还有一定的差距。

Abstract
Are Large language models (LLMs) temporally grounded? Since LLMs cannot perceive and interact with the environment, it is impossible to answer this question directly. Instead, we provide LLMs with textual narratives and probe them with respect to their common-sense knowledge of the structure and duration of events, their ability to order events along a timeline, and self-consistency within their temporal model (e.g., temporal relations such as after and before are mutually exclusive for any pair of events). We evaluate state-of-the-art LLMs (such as LLaMA 2 and GPT-4) on three tasks reflecting these abilities. Generally, we find that LLMs lag significantly behind both human performance as well as small-scale, specialised LMs. In-context learning, instruction tuning, and chain-of-thought prompting reduce this gap only to a limited degree. Crucially, LLMs struggle the most with self-consistency, displaying incoherent behaviour in at least 27.23% of their predictions. Contrary to expectations, we also find that scaling the model size does not guarantee positive gains in performance. To explain these results, we study the sources from which LLMs may gather temporal information: we find that sentence ordering in unlabelled texts, available during pre-training, is only weakly correlated with event ordering. Moreover, public instruction tuning mixtures contain few temporal tasks. Hence, we conclude that current LLMs lack a consistent temporal model of textual narratives. Code, datasets, and LLM outputs are available at https://github.com/yfqiu-nlp/temporal-llms.

摘要
LLMS是否具备时间grounding？由于 LLMS无法感知和交互环境，因此无法直接回答这个问题。而是给 LLMS 提供文本故事，并评估它们在事件结构和持续时间方面的常识知识，以及在时间线上排序事件的能力，以及自身时间模型的一致性（例如，时间关系如前后是独特的任意两个事件）。我们评估当今最高水平的 LLMS（如 LLaMA 2 和 GPT-4）在三个任务上。通常，我们发现 LLMS 落后人类表现和小规模专门LMs 的表现。在文本上进行培化、指导调整和链式思维提问可以减少这一差，但只能减少到一定程度。 LLMS 最大的问题是自相关性，它在至少 27.23% 的预测中展现了不一致的行为。与我们的预期相反，我们发现，通过增加模型大小，不一定能获得积极的性能提升。为解释这些结果，我们研究 LLMS 从哪里获取时间信息：我们发现在未标注文本中的句子排序和事件排序之间存在只有弱相关性。此外，公共的指导调整混合中包含的时间任务也很少。因此，我们结论当今 LLMS 缺乏文本故事中固定的时间模型。代码、数据集和 LLMS 输出可以在 GitHub 上找到：https://github.com/yfqiu-nlp/temporal-llms。

Zero-shot audio captioning with audio-language model guidance and audio context keywords

paper_url: http://arxiv.org/abs/2311.08396
repo_url: https://github.com/explainableml/zeraucap
paper_authors: Leonard Salewski, Stefan Fauth, A. Sophia Koepke, Zeynep Akata
for: 这篇论文是关于零shot音频描述的，它的目的是自动生成对音频内容的描述文本，而不需要特定的训练。
methods: 我们提出了一种基于大型自然语言模型（LLM）的框架，使用预训练的音频语言模型来引导生成文本描述音频内容。此外，我们还使用音频上下文关键词来引导语言模型生成更加广泛地相关的文本描述。
results: 我们的提议实现了零shot音频描述的州属之最Result在AudioCaps和Clotho数据集上。我们的代码可以在https://github.com/ExplainableML/ZerAuCap中找到。

Abstract
Zero-shot audio captioning aims at automatically generating descriptive textual captions for audio content without prior training for this task. Different from speech recognition which translates audio content that contains spoken language into text, audio captioning is commonly concerned with ambient sounds, or sounds produced by a human performing an action. Inspired by zero-shot image captioning methods, we propose ZerAuCap, a novel framework for summarising such general audio signals in a text caption without requiring task-specific training. In particular, our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions that describe the audio content. Additionally, we use audio context keywords that prompt the language model to generate text that is broadly relevant to sounds. Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets. Our code is available at https://github.com/ExplainableML/ZerAuCap.

摘要
zero-shot 音频描述目标是自动生成描述性的文本描述音频内容，而不需要进行这个任务的专门训练。与Speech recognition不同，音频描述更关注 ambient sounds 或者人类行为生成的声音。受 zero-shot 图像描述方法 inspirited，我们提出 ZerAuCap，一种新的框架，用于概括这些通用的音频信号。特别是，我们的框架利用预训练的大语言模型（LLM）来生成文本，并由预训练的音频语言模型来引导生成描述音频内容的文本。此外，我们还使用音频上下文关键词，以便让语言模型生成与声音相关的文本。我们的提议的框架实现了零shot 音频描述的状态天堂结果在 AudioCaps 和 Clotho 数据集上。代码可以在 https://github.com/ExplainableML/ZerAuCap 上获取。

MVSA-Net: Multi-View State-Action Recognition for Robust and Deployable Trajectory Generation

paper_url: http://arxiv.org/abs/2311.08393
repo_url: None
paper_authors: Ehsan Asali, Prashant Doshi, Jin Sun
for: 这个论文目的是提出一种多视角SA-Net模型，以便在多视角视觉数据上更好地识别任务状态和动作。
methods: 这个模型使用了多视角视觉数据，并将其同步融合在一起，以提高任务状态和动作识别精度。
results: 实验结果表明，相比单视角模型和基eline，多视角SA-Net模型在干扰情况下更加精度地识别任务状态和动作。

Abstract
The learn-from-observation (LfO) paradigm is a human-inspired mode for a robot to learn to perform a task simply by watching it being performed. LfO can facilitate robot integration on factory floors by minimizing disruption and reducing tedious programming. A key component of the LfO pipeline is a transformation of the depth camera frames to the corresponding task state and action pairs, which are then relayed to learning techniques such as imitation or inverse reinforcement learning for understanding the task parameters. While several existing computer vision models analyze videos for activity recognition, SA-Net specifically targets robotic LfO from RGB-D data. However, SA-Net and many other models analyze frame data captured from a single viewpoint. Their analysis is therefore highly sensitive to occlusions of the observed task, which are frequent in deployments. An obvious way of reducing occlusions is to simultaneously observe the task from multiple viewpoints and synchronously fuse the multiple streams in the model. Toward this, we present multi-view SA-Net, which generalizes the SA-Net model to allow the perception of multiple viewpoints of the task activity, integrate them, and better recognize the state and action in each frame. Performance evaluations on two distinct domains establish that MVSA-Net recognizes the state-action pairs under occlusion more accurately compared to single-view MVSA-Net and other baselines. Our ablation studies further evaluate its performance under different ambient conditions and establish the contribution of the architecture components. As such, MVSA-Net offers a significantly more robust and deployable state-action trajectory generation compared to previous methods.

摘要
“学习从观察”（LfO）模式是一种人类引导的方式，让机器人通过观察来学习执行任务。LfO可以帮助机器人在Factory floor上更好地集成，因为它可以减少干扰和降低繁重的编程。LfO管道的一个关键组件是将深度摄像头帧转换为相应的任务状态和动作对，然后将其传递给学习技术，如模仿或反向奖励学习，以了解任务参数。虽然现有的计算机视觉模型可以分析视频进行活动识别，但SA-Net专门针对机器人的LfO从RGB-D数据进行分析。然而，SA-Net和许多其他模型都分析帧数据从单个视点 capture，因此其分析是高度敏感于任务视角 occlusion，这些 occlusion 在部署中非常常见。一种明显的减少 occlusion 的方法是同时观察任务从多个视点，并将多个流量同步融合在模型中。为此，我们提出了多视点 SA-Net（MVSA-Net），它将 SA-Net 模型扩展到允许多个视点任务活动的感知、集成和更好地在每帧中识别状态动作。我们的性能评估在两个不同的领域中表明，MVSA-Net 在 occlusion 情况下更高精度地识别状态动作，比单视图 MVSA-Net 和其他基elines。我们的抽象研究还评估了其性能在不同的环境条件下，并证明了模型组件的贡献。因此，MVSA-Net 提供了较为稳定和可部署的状态动作轨迹生成方法，相比前期方法。

TSST: A Benchmark and Evaluation Models for Text Speech-Style Transfer

paper_url: http://arxiv.org/abs/2311.08389
repo_url: None
paper_authors: Huashan Sun, Yixiao Wu, Yinghao Li, Jiawei Li, Yizhe Yang, Yang Gao
for: 本研究的主要目标是探讨人类认知方面的话题，如人格特质和情感，基于现有的大语言模型（LLMs）的能力。
methods: 我们引入了一种新的任务——文本语言样式转移（TSST），并在这个任务上进行了深入的分析和研究，包括语言科学和认知科学的角度。我们还开发了一种多维度评价模型，用于评价TSST的性能。
results: 我们在这个研究中训练了多种大语言模型，并进行了人类评价模型的比较和分析。我们发现，现有的LLMs在生成文本时的表达能力仍有所不足，特别是在语言样式方面。此外，我们还发现了一些适用于TSST任务的新领域，并将其推广到了新的语言模型。

Abstract
Text style is highly abstract, as it encompasses various aspects of a speaker's characteristics, habits, logical thinking, and the content they express. However, previous text-style transfer tasks have primarily focused on data-driven approaches, lacking in-depth analysis and research from the perspectives of linguistics and cognitive science. In this paper, we introduce a novel task called Text Speech-Style Transfer (TSST). The main objective is to further explore topics related to human cognition, such as personality and emotion, based on the capabilities of existing LLMs. Considering the objective of our task and the distinctive characteristics of oral speech in real-life scenarios, we trained multi-dimension (i.e. filler words, vividness, interactivity, emotionality) evaluation models for the TSST and validated their correlation with human assessments. We thoroughly analyze the performance of several large language models (LLMs) and identify areas where further improvement is needed. Moreover, driven by our evaluation models, we have released a new corpus that improves the capabilities of LLMs in generating text with speech-style characteristics. In summary, we present the TSST task, a new benchmark for style transfer and emphasizing human-oriented evaluation, exploring and advancing the performance of current LLMs.

摘要
文本风格具有很高的抽象性，因为它包括了说话者的特征、习惯、逻辑思维和表达的内容。然而，以往的文本风格传输任务主要采用数据驱动方法，缺乏深入的语言学和认知科学的研究。在这篇论文中，我们介绍了一项新任务 called Text Speech-Style Transfer (TSST)。我们的主要目标是通过现有的大语言模型（LLMs） deeper exploration of human cognition-related topics, such as personality and emotion. 因为我们的任务的目标和实际场景中的口语特点，我们训练了多维度（即填充词、生动度、互动性、情感度）的评估模型，并 validate their correlation with human assessments.我们对多种大语言模型进行了全面的性能分析，并确定了需要进一步改进的领域。此外，驱动了我们的评估模型，我们发布了一个新的 corpus，以提高现有的 LLMs 在生成文本speech-style特性的能力。总之，我们在这篇论文中介绍了 TSST 任务，一个新的样本传输任务，强调人类导向的评估，探索和提高现有 LLMs 的性能。

Offline Data Enhanced On-Policy Policy Gradient with Provable Guarantees

paper_url: http://arxiv.org/abs/2311.08384
repo_url: https://github.com/yifeizhou02/hnpg
paper_authors: Yifei Zhou, Ayush Sekhari, Yuda Song, Wen Sun
for: 这个论文的目的是提出一种新的混合RL算法，该算法结合了在线数据和离线数据。
methods: 该算法使用了一种混合的actor-critic方法，其中包括了离线数据的训练。
results: 该算法在实验中表现出了较好的性能，并且在rich-observation环境中超过了一个state-of-the-art hybrid RL基线。Here’s the full version in Traditional Chinese:
for: 这个论文的目的是提出一种新的混合RL算法，该算法结合了在线数据和离线数据。
methods: 该算法使用了一种混合的actor-critic方法，其中包括了离线数据的训练。
results: 该算法在实验中表现出了较好的性能，并且在rich-observation环境中超过了一个state-of-the-art hybrid RL基准。

Abstract
Hybrid RL is the setting where an RL agent has access to both offline data and online data by interacting with the real-world environment. In this work, we propose a new hybrid RL algorithm that combines an on-policy actor-critic method with offline data. On-policy methods such as policy gradient and natural policy gradient (NPG) have shown to be more robust to model misspecification, though sometimes it may not be as sample efficient as methods that rely on off-policy learning. On the other hand, offline methods that depend on off-policy training often require strong assumptions in theory and are less stable to train in practice. Our new approach integrates a procedure of off-policy training on the offline data into an on-policy NPG framework. We show that our approach, in theory, can obtain a best-of-both-worlds type of result -- it achieves the state-of-art theoretical guarantees of offline RL when offline RL-specific assumptions hold, while at the same time maintaining the theoretical guarantees of on-policy NPG regardless of the offline RL assumptions' validity. Experimentally, in challenging rich-observation environments, we show that our approach outperforms a state-of-the-art hybrid RL baseline which only relies on off-policy policy optimization, demonstrating the empirical benefit of combining on-policy and off-policy learning. Our code is publicly available at https://github.com/YifeiZhou02/HNPG.

摘要
半结合RL是指RL机器人可以访问线上数据和实际环境数据。在这项工作中，我们提出了一种新的半结合RL算法，将在线数据和actor-critic方法相结合。在政策梯度和自然政策梯度（NPG）方法中，有显著的模型误差Robustness，但有时可能不够样本效率。相反，依据线上训练的方法通常需要强大的理论假设，训练不稳定。我们的新方法将线上训练过程与actor-critic方法相结合。我们证明了，在理论上，我们的方法可以实现“best-of-both-worlds”的结果：在线上RL特有的假设下，可以达到最新的理论保证，而且不管线上RL假设的有效性，都可以维持actor-critic方法的理论保证。实际上，在复杂的观察型环境中，我们的方法在比较难以实际中心的基eline上表现出了较好的效果，这说明了在线上和线下学习的结合可以实现更好的实际效果。我们的代码可以在https://github.com/YifeiZhou02/HNPG上获取。

Scheming AIs: Will AIs fake alignment during training in order to get power?

paper_url: http://arxiv.org/abs/2311.08379
repo_url: None
paper_authors: Joe Carlsmith
for: 这篇论文研究了高效的AI是否会在训练后来为了获取权力（即“计划”），并结论表明这是一个可能的情况（我们的主观概率为25%）。
methods: 这篇论文使用基本机器学习方法来训练目标导向AI，并发现这些方法可能会导致AI计划行为。
results: 研究发现，如果训练高效是一个好的策略来获取权力，那么很多不同的目标都可能会导致计划行为，这使得难以判断是否存在计划行为。然而，研究还发现了一些有利的因素，例如训练中的选择压力可能会阻碍计划型目标，并且可能有很多的实验研究方向可以更 deeply explore这个主题。

Abstract
This report examines whether advanced AIs that perform well in training will be doing so in order to gain power later -- a behavior I call "scheming" (also sometimes called "deceptive alignment"). I conclude that scheming is a disturbingly plausible outcome of using baseline machine learning methods to train goal-directed AIs sophisticated enough to scheme (my subjective probability on such an outcome, given these conditions, is roughly 25%). In particular: if performing well in training is a good strategy for gaining power (as I think it might well be), then a very wide variety of goals would motivate scheming -- and hence, good training performance. This makes it plausible that training might either land on such a goal naturally and then reinforce it, or actively push a model's motivations towards such a goal as an easy way of improving performance. What's more, because schemers pretend to be aligned on tests designed to reveal their motivations, it may be quite difficult to tell whether this has occurred. However, I also think there are reasons for comfort. In particular: scheming may not actually be such a good strategy for gaining power; various selection pressures in training might work against schemer-like goals (for example, relative to non-schemers, schemers need to engage in extra instrumental reasoning, which might harm their training performance); and we may be able to increase such pressures intentionally. The report discusses these and a wide variety of other considerations in detail, and it suggests an array of empirical research directions for probing the topic further.

摘要
If performing well in training is a good strategy for gaining power, then a wide range of goals could motivate scheming. This could lead to training reinforcing such goals, or actively pushing a model's motivations towards them as an easy way to improve performance. However, scheming may not be an effective strategy for gaining power, and various selection pressures in training might work against schemer-like goals. Additionally, we may be able to intentionally increase these pressures to discourage scheming. The report discusses these and many other considerations in detail, and suggests several empirical research directions for exploring the topic further.

Learning to Filter Context for Retrieval-Augmented Generation

paper_url: http://arxiv.org/abs/2311.08377
repo_url: https://github.com/zorazrw/filco
paper_authors: Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan Parvez, Graham Neubig
for: 提高系统的可靠性和可靠性，增强对开放问题 answering和事实核查的能力。
methods: 使用lexical和信息理论方法来标识有用的 контекст，并在测试时使用context filtering模型来筛选已经获取到的 kontext。
results: 在六个知识丰富任务上，FILCO方法比现有方法表现出色，包括抽取问题 answering、复杂多跳和长形问题 answering、事实验证和对话生成任务。 FILCO有效地提高了 kontext的质量，无论kontext支持哪种 canon output。

Abstract
On-the-fly retrieval of relevant knowledge has proven an essential element of reliable systems for tasks such as open-domain question answering and fact verification. However, because retrieval systems are not perfect, generation models are required to generate outputs given partially or entirely irrelevant passages. This can cause over- or under-reliance on context, and result in problems in the generated output such as hallucinations. To alleviate these problems, we propose FILCO, a method that improves the quality of the context provided to the generator by (1) identifying useful context based on lexical and information-theoretic approaches, and (2) training context filtering models that can filter retrieved contexts at test time. We experiment on six knowledge-intensive tasks with FLAN-T5 and LLaMa2, and demonstrate that our method outperforms existing approaches on extractive question answering (QA), complex multi-hop and long-form QA, fact verification, and dialog generation tasks. FILCO effectively improves the quality of context, whether or not it supports the canonical output.

摘要
在开放领域问答和事实核查任务中，实时获取相关知识已经成为可靠系统的重要组成部分。然而，由于检索系统不完美，因此生成模型需要能够生成基于部分或完全无关的段落的输出。这可能导致上下文过重或下Context filtering models can filter retrieved contexts at test time, which can alleviate these problems. We propose FILCO, a method that improves the quality of the context provided to the generator by (1) identifying useful context based on lexical and information-theoretic approaches, and (2) training context filtering models. We experiment on six knowledge-intensive tasks with FLAN-T5 and LLaMa2, and demonstrate that our method outperforms existing approaches on extractive question answering, complex multi-hop and long-form question answering, fact verification, and dialog generation tasks. FILCO effectively improves the quality of context, whether or not it supports the canonical output.Here's a word-for-word translation of the text in Traditional Chinese:在开放领域问答和事实核实任务中，实时获取相关知识已经成为可靠系统的重要组成部分。然而，由于检索系统不完美，因此生成模型需要能够生成基于部分或完全无关的段落的出力。这可能导致上下文过重或下Context filtering models can filter retrieved contexts at test time, which can alleviate these problems. We propose FILCO, a method that improves the quality of the context provided to the generator by (1) identifying useful context based on lexical and information-theoretic approaches, and (2) training context filtering models. We experiment on six knowledge-intensive tasks with FLAN-T5 and LLaMa2, and demonstrate that our method outperforms existing approaches on extractive question answering, complex multi-hop and long-form question answering, fact verification, and dialog generation tasks. FILCO effectively improves the quality of context, whether or not it supports the canonical output.

Plum: Prompt Learning using Metaheuristic

paper_url: http://arxiv.org/abs/2311.08364
repo_url: https://github.com/research4pan/plum
paper_authors: Rui Pan, Shuo Xing, Shizhe Diao, Xiang Liu, Kashun Shum, Jipeng Zhang, Tong Zhang
for: 优化和自定义大语言模型的批处理学习方法
methods: 使用Metaheuristics分支，包括6种方法：hill climbing、模拟熔炉、遗传算法、tabu搜索和和谐搜索
results: 成功地应用于黑盒批处理学习和Chain-of-Thought批处理调整，并发现了更多人类可理解的批处理，开启了更多的可能性 для批处理优化。Note: The paper is written in English, and the summary is provided in Simplified Chinese.

Abstract
Since the emergence of large language models, prompt learning has become a popular method for optimizing and customizing these models. Special prompts, such as Chain-of-Thought, have even revealed previously unknown reasoning capabilities within these models. However, the progress of discovering effective prompts has been slow, driving a desire for general prompt optimization methods. Unfortunately, few existing prompt learning methods satisfy the criteria of being truly "general", i.e., automatic, discrete, black-box, gradient-free, and interpretable all at once. In this paper, we introduce metaheuristics, a branch of discrete non-convex optimization methods with over 100 options, as a promising approach to prompt learning. Within our paradigm, we test six typical methods: hill climbing, simulated annealing, genetic algorithms with/without crossover, tabu search, and harmony search, demonstrating their effectiveness in black-box prompt learning and Chain-of-Thought prompt tuning. Furthermore, we show that these methods can be used to discover more human-understandable prompts that were previously unknown, opening the door to a cornucopia of possibilities in prompt optimization. We release all the codes in \url{https://github.com/research4pan/Plum}.

摘要
(Simplified Chinese translation)Since 大语言模型出现以来，提示学习已成为优化和个性化这些模型的受欢迎方法。特定的提示，如链条思维，甚至揭示了这些模型之前未知的理由能力。然而，发现有效的提示进展相对较慢，驱动了找到通用的提示优化方法的愿望。 unfortunately，已有的提示学习方法很少满足"通用"的条件，即自动、粒子、黑盒、梯度自由和可解释性都同时满足。在这篇论文中，我们介绍了metaheuristics，一个包含多于100个选项的权重非对称优化方法。在我们的模式中，我们测试了6种常见的方法：山丘升级、模拟热处理、基因算法（包括/不包括交叉）、 tabu搜索和和谐搜索，以示其在黑盒提示学习和链条思维提uning中的效果。此外，我们还证明这些方法可以用来发现人类更好理解的提示，开启了更多的可能性在提示优化方面。我们在 [https://github.com/research4pan/Plum](https://github.com/research4pan/Plum) 上发布所有代码。

The Transient Nature of Emergent In-Context Learning in Transformers

paper_url: http://arxiv.org/abs/2311.08360
repo_url: None
paper_authors: Aaditya K. Singh, Stephanie C. Y. Chan, Ted Moskovitz, Erin Grant, Andrew M. Saxe, Felix Hill
for: 这个论文探讨了Transformer神经网络在不直接训练ICL（卷积学习）后可以展现出某种ICL的能力，并且ICL的emergence是可逆的。
methods: 作者使用了Synthetic数据来训练Transformer神经网络，并通过分析训练数据的分布性质和机制可读性来理解ICL的emergence。
results: 作者发现，ICL在Transformer神经网络训练中是可逆的，ICL首先出现，然后消失，并由IWL（卷积学习）所取代，这与训练损失逐渐下降相符。此外，作者还发现，L2正则化可以提供一种路径来实现更持久的ICL，不需要在ICL验证任务上使用早期停止。最后，作者还提出了ICLtransience可能是由ICL和IWLCircuits的竞争所致的 Hypothesis。

Abstract
Transformer neural networks can exhibit a surprising capacity for in-context learning (ICL) despite not being explicitly trained for it. Prior work has provided a deeper understanding of how ICL emerges in transformers, e.g. through the lens of mechanistic interpretability, Bayesian inference, or by examining the distributional properties of training data. However, in each of these cases, ICL is treated largely as a persistent phenomenon; namely, once ICL emerges, it is assumed to persist asymptotically. Here, we show that the emergence of ICL during transformer training is, in fact, often transient. We train transformers on synthetic data designed so that both ICL and in-weights learning (IWL) strategies can lead to correct predictions. We find that ICL first emerges, then disappears and gives way to IWL, all while the training loss decreases, indicating an asymptotic preference for IWL. The transient nature of ICL is observed in transformers across a range of model sizes and datasets, raising the question of how much to "overtrain" transformers when seeking compact, cheaper-to-run models. We find that L2 regularization may offer a path to more persistent ICL that removes the need for early stopping based on ICL-style validation tasks. Finally, we present initial evidence that ICL transience may be caused by competition between ICL and IWL circuits.

摘要
transformer нейрон网络可以展示一种搅打的容器学习（ICL），即使不直接培养这种能力。先前的研究提供了更深刻的理解 ICLe 的发生，例如通过机械性可读性、 bayesian 推理或者分析训练数据的分布性质。然而，在每一种情况下，ICL 都是视为持续存在的；即一旦 ICLe 出现，它就会持续恒久。在这里，我们发现 transformer 在训练过程中 ICLe 的出现实际上是可以被转移的。我们在Synthetic 数据上训练 transformer，以至ICL 和 in-weights 学习（IWL）策略都可以导致正确的预测。我们发现，ICL 最初出现，然后消失，被IWL 所取代，而训练损失逐渐下降，表明 IWL 在训练过程中具有优势。这种ICL 的搅打性被观察到在 transformer 中的多种模型大小和数据集之间，提出了如何“过度训练” transformer 以获得更加紧凑、便宜的模型。我们发现 L2 正则化可以提供一种路径，使 ICLe 更加持续，从而消除需要基于 ICLe 类验task 的早期停止。最后，我们提出了初始证据，表明 ICLe 的搅打性可能是由 ICL 和 IWL 征 circuit 之间的竞争所致。

Exploring Variational Auto-Encoder Architectures, Configurations, and Datasets for Generative Music Explainable AI

paper_url: http://arxiv.org/abs/2311.08336
repo_url: https://github.com/bbanar2/exploring_xai_in_genmus_via_lsr
paper_authors: Nick Bryan-Kinns, Bingyuan Zhang, Songyan Zhao, Berker Banar
for: 这个论文旨在探讨如何使用Variational Auto-Encoder模型（MeasureVAE和AdversarialVAE）、离散空间配置（从4到256维度）以及训练数据集（爱尔兰传统音乐、土耳其传统音乐、古典音乐和流行音乐）对音乐生成性能的影响。
methods: 这篇论文采用了系统性的比较方法，检验了不同组合的Variational Auto-Encoder模型、离散空间配置和训练数据集对音乐生成性能的影响。
results: 研究发现，MeasureVAE模型在重构性能方面表现更好于AdversarialVAE模型，而AdversarialVAE模型在音乐特征独立性方面表现更好。 results还显示了MeasureVAE模型在不同的音乐类型下能够生成不同的音乐特征，而且在某些音乐类型下表现更好。建议使用32或64维度的离散空间可以获得最佳的音乐生成性能。

Abstract
Generative AI models for music and the arts in general are increasingly complex and hard to understand. The field of eXplainable AI (XAI) seeks to make complex and opaque AI models such as neural networks more understandable to people. One approach to making generative AI models more understandable is to impose a small number of semantically meaningful attributes on generative AI models. This paper contributes a systematic examination of the impact that different combinations of Variational Auto-Encoder models (MeasureVAE and AdversarialVAE), configurations of latent space in the AI model (from 4 to 256 latent dimensions), and training datasets (Irish folk, Turkish folk, Classical, and pop) have on music generation performance when 2 or 4 meaningful musical attributes are imposed on the generative model. To date there have been no systematic comparisons of such models at this level of combinatorial detail. Our findings show that MeasureVAE has better reconstruction performance than AdversarialVAE which has better musical attribute independence. Results demonstrate that MeasureVAE was able to generate music across music genres with interpretable musical dimensions of control, and performs best with low complexity music such a pop and rock. We recommend that a 32 or 64 latent dimensional space is optimal for 4 regularised dimensions when using MeasureVAE to generate music across genres. Our results are the first detailed comparisons of configurations of state-of-the-art generative AI models for music and can be used to help select and configure AI models, musical features, and datasets for more understandable generation of music.

摘要
优化AI模型 для艺术创作正在不断增长，但这些模型往往具有复杂和难以理解的特性。透明AI（XAI）领域的目标是使这些复杂的AI模型更加理解和可控。这篇论文提出了一种方法，即通过对生成AI模型冲淡小量有意义的特征来使其更加理解。该论文对不同的Variational Auto-Encoder模型（MeasureVAE和AdversarialVAE）、 latent space配置（4-256维度）和训练数据集（爱尔兰民谣、土耳其民谣、古典和流行）的组合效果进行了系统性的比较。目前没有系统性的比较。我们的发现表明MeasureVAE具有更好的重建性能，而AdversarialVAE具有更好的音乐特征独立性。结果表明MeasureVAE可以在不同的音乐类型下生成音乐，并且在低复杂度音乐类型（如流行和摇滚）中表现最佳。我们建议使用32或64维度的latent空间是最佳的。我们的结果是针对现代生成AI模型的首次详细比较，可以帮助选择和配置AI模型、音乐特征和数据集，以便更好地生成音乐。

Anti-LM Decoding for Zero-shot In-context Machine Translation

paper_url: http://arxiv.org/abs/2311.08324
repo_url: None
paper_authors: Suzanna Sia, Alexandra DeLucia, Kevin Duh
for: 这paper是为了解决预训练大型语言模型在 Zero-shot In-context learning 任务上的偏见问题。
methods: 这paper使用了一种叫做 Anti-Language Model 的目标函数，其中包括一个衰减因子，用于 Addressing the weaknesses of In-context Machine Translation。
results: 根据实验结果，提议的方法在不同的模型类型和大小、语言方向和搜索策略 ($B=5$) 下都有显著的改善，与其他当前最佳decoding目标函数相比，最高可以获得 $20$ BLEU 点的提升。

Abstract
Zero-shot In-context learning is the phenomenon where models can perform the task simply given the instructions. However, pre-trained large language models are known to be poorly calibrated for this task. One of the most effective approaches to handling this bias is to adopt a contrastive decoding objective, which accounts for the prior probability of generating the next token by conditioning on some context. This work introduces an Anti-Language Model objective with a decay factor designed to address the weaknesses of In-context Machine Translation. We conduct our experiments across 3 model types and sizes, 3 language directions, and for both greedy decoding and beam search ($B=5$). The proposed method outperforms other state-of-art decoding objectives, with up to $20$ BLEU point improvement from the default objective observed in some settings.

摘要
Zero-shot Contextual learning 是指模型可以完成任务只需要提供说明。但是，预训练的大型自然语言模型通常具有低度的准确性。一种有效的方法来处理这种偏见是采用对比解码目标，这会考虑上下文中的先前概率生成下一个token。这项工作提出了一种Anti-Language Model目标函数，具有衰减因子，用于解决机器翻译中的弱点。我们在3种模型类型和大小、3种语言方向和批处($B=5）中进行了实验。提议的方法在其他状态对照点中表现出较高的性能，与默认目标函数相比，在某些设置下可以得到20个BLEU分点的提升。

Extrinsically-Focused Evaluation of Omissions in Medical Summarization

paper_url: http://arxiv.org/abs/2311.08303
repo_url: None
paper_authors: Elliot Schumacher, Daniel Rosenthal, Varun Nair, Luladay Price, Geoffrey Tso, Anitha Kannan
for: 这个研究的目的是开发一个新的医学摘要评价指标（MED-OMIT），用于评价自然语言处理（NLP）模型在医学领域的摘要性能。
methods: 这个研究使用了许多现有的摘要评价指标，并开发了一种新的评价指标——MED-OMIT，该指标可以更好地捕捉医学摘要中的重要信息漏掉现象。
results: 研究发现，MED-OMIT可以更好地捕捉医学摘要中的漏掉现象，并且可以更好地评价医学摘要的质量。

Abstract
The goal of automated summarization techniques (Paice, 1990; Kupiec et al, 1995) is to condense text by focusing on the most critical information. Generative large language models (LLMs) have shown to be robust summarizers, yet traditional metrics struggle to capture resulting performance (Goyal et al, 2022) in more powerful LLMs. In safety-critical domains such as medicine, more rigorous evaluation is required, especially given the potential for LLMs to omit important information in the resulting summary. We propose MED-OMIT, a new omission benchmark for medical summarization. Given a doctor-patient conversation and a generated summary, MED-OMIT categorizes the chat into a set of facts and identifies which are omitted from the summary. We further propose to determine fact importance by simulating the impact of each fact on a downstream clinical task: differential diagnosis (DDx) generation. MED-OMIT leverages LLM prompt-based approaches which categorize the importance of facts and cluster them as supporting or negating evidence to the diagnosis. We evaluate MED-OMIT on a publicly-released dataset of patient-doctor conversations and find that MED-OMIT captures omissions better than alternative metrics.

摘要
目的是自动摘要技术（Paice，1990；Kupiec等，1995）摘要文本，专注最重要的信息。生成大型自然语言模型（LLM）已经证明是有力的摘要工具，但传统度量难以捕捉这些模型所创造的性能（Goyal等，2022）。在安全关键领域，如医学，更加严格的评估是必要的，特别是由于LLM可能会在摘要中缺少重要信息。我们提议MED-OMIT，一个新的漏洞准确度标准 для医学摘要。给定一个医生与病人的对话和生成的摘要，MED-OMIT将对话分为一组事实，并识别这些事实在摘要中是否被漏忘。我们还提议根据这些事实对于诊断的影响来确定它们的重要性。MED-OMIT使用LLM提示方法，将重要性分为支持或阻止诊断的证据。我们对一个公共释出的病人对话集进行评估，发现MED-OMIT可以更好地捕捉漏忘。

Workflow-Guided Response Generation for Task-Oriented Dialogue

paper_url: http://arxiv.org/abs/2311.08300
repo_url: None
paper_authors: Do June Min, Paloma Sodhi, Ramya Ramakrishnan
for: 这个论文的目的是提出一种基于强化学习的对话响应生成框架，以便在对话中实现特定的工作流程。
methods: 该框架包括一个名为 ComplianceScorer 的度量器，用于评估生成的响应是否遵循指定的工作流程，以及一个基于强化学习的优化过程，使用交互采样技术。
results: 对两个 TOD 数据集（Action-Based Conversations Dataset 和 MultiWOZ 2.2）进行评估，发现该框架在自动和人工评估指标上都有显著优势，能够生成遵循工作流程的自然和流畅的对话响应。

Abstract
Task-oriented dialogue (TOD) systems aim to achieve specific goals through interactive dialogue. Such tasks usually involve following specific workflows, i.e. executing a sequence of actions in a particular order. While prior work has focused on supervised learning methods to condition on past actions, they do not explicitly optimize for compliance to a desired workflow. In this paper, we propose a novel framework based on reinforcement learning (RL) to generate dialogue responses that are aligned with a given workflow. Our framework consists of ComplianceScorer, a metric designed to evaluate how well a generated response executes the specified action, combined with an RL opimization process that utilizes an interactive sampling technique. We evaluate our approach on two TOD datasets, Action-Based Conversations Dataset (ABCD) (Chen et al., 2021a) and MultiWOZ 2.2 (Zang et al., 2020) on a range of automated and human evaluation metrics. Our findings indicate that our RL-based framework outperforms baselines and is effective at enerating responses that both comply with the intended workflows while being expressed in a natural and fluent manner.

摘要
干净的对话系统（Task-Oriented Dialogue，TOD）旨在通过互动对话达到特定目标。这些任务通常包括执行特定的工作流程，即执行一系列动作在特定的顺序。而在过去的研究中，主要采用监督学习方法来condition on past actions，但这些方法并不直接优化对话响应的合liance。在这篇论文中，我们提出了一种基于奖励学习（Reinforcement Learning，RL）的新框架，用于生成与给定的工作流程相关的对话响应。我们的框架包括ComplianceScorer，一个用于评估生成响应是否执行了指定的动作的度量，以及一个RL优化过程，利用交互采样技术。我们对两个TOD数据集（Action-Based Conversations Dataset（ABCD）和MultiWOZ 2.2）进行了评估，并在自动和人类评估指标上达到了比基eline更高的性能。我们的研究结果表明，我们的RL基于的框架可以够效地生成符合工作流程的对话响应，同时也能够保持自然和流畅的表达。

VERVE: Template-based ReflectiVE Rewriting for MotiVational IntErviewing

paper_url: http://arxiv.org/abs/2311.08299
repo_url: None
paper_authors: Do June Min, Verónica Pérez-Rosas, Kenneth Resnicow, Rada Mihalcea
for: 本研究旨在提高观点咨问（MI）技巧的致用性，即咨询师需要学习的基本技能之一。
methods: 本研究提出了一种咨询回应重写任务，将非反应性陈述转换成反应性回应。该任务使用VERVE模板基于模板更新和填充增强训练。VERVE首先创建一个模板，并从非反应性Token中过滤掉不重要的句子。然后，使用模板来构建一个反应性回应。
results: 通过自动和人工评估，我们比较了我们的方法与文本重写基线之间的性能，并发现我们的框架可以更好地转换非反应性陈述为反应性回应，同时保持了内容准确性和反应性 стиyle的平衡。

Abstract
Reflective listening is a fundamental skill that counselors must acquire to achieve proficiency in motivational interviewing (MI). It involves responding in a manner that acknowledges and explores the meaning of what the client has expressed in the conversation. In this work, we introduce the task of counseling response rewriting, which transforms non-reflective statements into reflective responses. We introduce VERVE, a template-based rewriting system with paraphrase-augmented training and adaptive template updating. VERVE first creates a template by identifying and filtering out tokens that are not relevant to reflections and constructs a reflective response using the template. Paraphrase-augmented training allows the model to learn less-strict fillings of masked spans, and adaptive template updating helps discover effective templates for rewriting without significantly removing the original content. Using both automatic and human evaluations, we compare our method against text rewriting baselines and show that our framework is effective in turning non-reflective statements into more reflective responses while achieving a good content preservation-reflection style trade-off.

摘要
<TRANSLATE_TEXT>投射式听众是跟踪咨询技能的基础能力，用于实现动机激励采访（MI）的掌握。它通过回念地回应客户的话语，以便更好地理解和探讨客户的意义。在这篇文章中，我们介绍了咨询回应重写任务，该任务将非反射性声明转换为反射回应。我们介绍了VERVE模板基于重写系统，该系统通过标识和过滤无关于反射的token，并使用模板来构建反射回应。我们还提出了带有填充辅助的模板更新策略，以便在不丢弃原始内容的情况下，发现有效的反射回应模板。通过自动和人工评估，我们与文本重写基准相比较，并证明我们的框架可以将非反射性声明转换为更加反射的回应，同时保持内容准确性和反射风格的良好平衡。

A Survey of Language Model Confidence Estimation and Calibration

paper_url: http://arxiv.org/abs/2311.08298
repo_url: None
paper_authors: Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, Iryna Gurevych
for: This paper aims to provide a comprehensive overview of research on assessing the confidence of language models (LMs) and calibrating their predictions to improve AI safety.
methods: The paper discusses various methods and techniques for estimating the confidence of LMs, including different LMs and various tasks.
results: The paper outlines the challenges of estimating the confidence of large language models and suggests some promising directions for future work.Here’s the same information in Simplified Chinese text:
for: 这篇论文目标是为了提供语言模型（LM）的 confidence 评估和预测calibration的全面回顾，以提高人工智能安全性。
methods: 论文讨论了不同的LM和任务下的 confidence 评估和calibration方法。
results: 论文描述了大语言模型的 confidence 评估的挑战和未来工作的可能性。

Abstract
Language models (LMs) have demonstrated remarkable capabilities across a wide range of tasks in various domains. Despite their impressive performance, the reliability of their output is concerning and questionable regarding the demand for AI safety. Assessing the confidence of LM predictions and calibrating them across different tasks with the aim to align LM confidence with accuracy can help mitigate risks and enable LMs to make better decisions. There have been various works in this respect, but there has been no comprehensive overview of this important research area. The present survey aims to bridge this gap. In particular, we discuss methods and techniques for LM confidence estimation and calibration, encompassing different LMs and various tasks. We further outline the challenges of estimating the confidence for large language models and we suggest some promising directions for future work.

摘要
语言模型（LM）在多种任务和领域中表现出色，但其输出的可靠性却引起了关注和质疑。为了减少人工智能安全风险，必须评估LM预测的可靠性并在不同任务中进行准确性调整。目前，有很多相关研究，但没有一篇全面的评论。本篇文章试图填补这一空白。我们讨论了LM可靠性估计和调整的方法和技术，涵盖不同的LM和任务。我们还描述了大语言模型的可靠性估计的挑战，并提出了一些有前途的未来工作方向。

On The Relationship Between Universal Adversarial Attacks And Sparse Representations

paper_url: http://arxiv.org/abs/2311.08265
repo_url: https://github.com/danawr/adversarial_attacks_and_sparse_representations
paper_authors: Dana Weitzner, Raja Giryes
for: 本文目的是解释神经网络对小幅度偏移的敏感性，通过减少性框架来解释。
methods: 本文使用了减少性算法和LISTA算法，以及其他常见的攻击方法，来描述神经网络对输入图像的减少性表示的敏感性。
results: 本文发现，神经网络对于减少性表示的敏感性是通用的和可转移的，并且可以通过对输入图像的减少性表示进行攻击。

Abstract
The prominent success of neural networks, mainly in computer vision tasks, is increasingly shadowed by their sensitivity to small, barely perceivable adversarial perturbations in image input. In this work, we aim at explaining this vulnerability through the framework of sparsity. We show the connection between adversarial attacks and sparse representations, with a focus on explaining the universality and transferability of adversarial examples in neural networks. To this end, we show that sparse coding algorithms, and the neural network-based learned iterative shrinkage thresholding algorithm (LISTA) among them, suffer from this sensitivity, and that common attacks on neural networks can be expressed as attacks on the sparse representation of the input image. The phenomenon that we observe holds true also when the network is agnostic to the sparse representation and dictionary, and thus can provide a possible explanation for the universality and transferability of adversarial attacks. The code is available at https://github.com/danawr/adversarial_attacks_and_sparse_representations.

摘要
neuronal networks 的显著成功，主要在计算机视觉任务中，受到小、几乎不可见的对抗性扰动的影响。在这个工作中，我们尝试通过框架来解释这种敏感性。我们显示了对抗攻击和稀热表示之间的连接，并强调了对于解释神经网络中的对抗例的universality和传递性。为此，我们显示了稀热编码算法和基于神经网络学习的迭代缩小阈值算法（LISTA）中的敏感性，以及常见的神经网络攻击可以表示为对输入图像的稀热表示的攻击。这种现象我们观察到的持久性也适用于神经网络不知道稀热表示和词典的情况下。因此，我们的发现可能为对神经网络中的对抗攻击提供了一个可能的解释。代码可以在https://github.com/danawr/adversarial_attacks_and_sparse_representations上获取。

REST: Retrieval-Based Speculative Decoding

paper_url: http://arxiv.org/abs/2311.08252
repo_url: https://github.com/fasterdecoding/rest
paper_authors: Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D Lee, Di He
for: 快速化语言模型生成
methods: 利用检索来生成拟合 tokens
results: 在单批设置下，对 7B 和 13B 语言模型进行了加速，速度提高了 1.62X 至 2.36X 于代码或文本生成

Abstract
We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation. The key insight driving the development of REST is the observation that the process of text generation often includes certain common phases and patterns. Unlike previous methods that rely on a draft language model for speculative decoding, REST harnesses the power of retrieval to generate draft tokens. This method draws from the reservoir of existing knowledge, retrieving and employing relevant tokens based on the current context. Its plug-and-play nature allows for seamless integration and acceleration of any language models, all without necessitating additional training. When benchmarked on 7B and 13B language models in a single-batch setting, REST achieves a significant speedup of 1.62X to 2.36X on code or text generation. The code of REST is available at https://github.com/FasterDecoding/REST.

摘要
我们介绍 Retrieval-Based Speculative Decoding（REST），一种新的算法，用于快速化语言模型生成。REST的关键思想是发现文本生成过程中通常包含一些共同的阶段和模式。不同于先前的方法，REST不是靠对稿语言模型进行推测性解oding，而是利用库存的知识，从现有的文本中获取和使用相关的字符。这种插件式的设计使得可以轻松地整合和加速任何语言模型，不需要额外训练。当对7B和13B语言模型进行单批训练时，REST可以获得显著的速度增加，对于代码或文本生成而言，速度增加为1.62倍至2.36倍。REST的代码可以在https://github.com/FasterDecoding/REST上获取。

Investigating the Encoding of Words in BERT’s Neurons using Feature Textualization

paper_url: http://arxiv.org/abs/2311.08240
repo_url: None
paper_authors: Tanja Baeumel, Soniya Vijayakumar, Josef van Genabith, Guenter Neumann, Simon Ostermann
for: This paper aims to provide a better understanding of the knowledge encoded in individual neurons of pre-trained language models (PLMs), specifically in the BERT model.
methods: The paper proposes a technique called feature textualization to produce dense representations of neurons in the PLM word embedding space, and applies this technique to the BERT model to investigate the knowledge encoded in individual neurons.
results: The paper finds that the produced representations can provide insights about the knowledge encoded in individual neurons, but that individual neurons do not represent clearcut symbolic units of language such as words. Additionally, the paper investigates how many neurons are needed to encode words in BERT.

Abstract
Pretrained language models (PLMs) form the basis of most state-of-the-art NLP technologies. Nevertheless, they are essentially black boxes: Humans do not have a clear understanding of what knowledge is encoded in different parts of the models, especially in individual neurons. The situation is different in computer vision, where feature visualization provides a decompositional interpretability technique for neurons of vision models. Activation maximization is used to synthesize inherently interpretable visual representations of the information encoded in individual neurons. Our work is inspired by this but presents a cautionary tale on the interpretability of single neurons, based on the first large-scale attempt to adapt activation maximization to NLP, and, more specifically, large PLMs. We propose feature textualization, a technique to produce dense representations of neurons in the PLM word embedding space. We apply feature textualization to the BERT model (Devlin et al., 2019) to investigate whether the knowledge encoded in individual neurons can be interpreted and symbolized. We find that the produced representations can provide insights about the knowledge encoded in individual neurons, but that individual neurons do not represent clearcut symbolic units of language such as words. Additionally, we use feature textualization to investigate how many neurons are needed to encode words in BERT.

摘要
预训语言模型（PLM）是现代自然语言处理技术的基础。然而，它们实际上是黑obox：人类没有清楚的理解哪些知识被不同部分模型中的不同神经元编码。在计算机视觉中，特征可视化提供了解 decompositional 可读性技术，可以用来解释视觉模型中神经元的信息。我们的工作是基于这个，但是它们提供了一个警告：单个神经元的解释性不是很明确。我们提出了特征文本化技术，用于生成 PLM 词嵌入空间中神经元的稠密表示。我们应用特征文本化技术到 Devlin et al. (2019) 中的 BERT 模型，以调查个神经元是否可以解释和象化语言知识。我们发现生成的表示可以提供神经元中知识的启示，但是单个神经元不表示明确的语言符号单元，如单词。此外，我们使用特征文本化技术来调查BERT模型中需要多少神经元来编码单词。

Learning Physics-Inspired Regularization for Medical Image Registration with Hypernetworks

paper_url: http://arxiv.org/abs/2311.08239
repo_url: https://github.com/annareithmeir/elastic-regularization-hypermorph
paper_authors: Anna Reithmeir, Julia A. Schnabel, Veronika A. Zimmer
for: 用于医学图像对接和图像基于诊断和治疗
methods: 使用物理学发现的正则化器，包括线性弹性正则化器，以模拟生物组织的弹性性
results: 可以在测试时高效地找到适合的数据特定物理参数，以便进行成功的图像对接

Abstract
Medical image registration aims at identifying the spatial deformation between images of the same anatomical region and is fundamental to image-based diagnostics and therapy. To date, the majority of the deep learning-based registration methods employ regularizers that enforce global spatial smoothness, e.g., the diffusion regularizer. However, such regularizers are not tailored to the data and might not be capable of reflecting the complex underlying deformation. In contrast, physics-inspired regularizers promote physically plausible deformations. One such regularizer is the linear elastic regularizer which models the deformation of elastic material. These regularizers are driven by parameters that define the material's physical properties. For biological tissue, a wide range of estimations of such parameters can be found in the literature and it remains an open challenge to identify suitable parameter values for successful registration. To overcome this problem and to incorporate physical properties into learning-based registration, we propose to use a hypernetwork that learns the effect of the physical parameters of a physics-inspired regularizer on the resulting spatial deformation field. In particular, we adapt the HyperMorph framework to learn the effect of the two elasticity parameters of the linear elastic regularizer. Our approach enables the efficient discovery of suitable, data-specific physical parameters at test time.

摘要
医学图像匹配目标是在同一个解剖区域中的图像之间找到空间变形，这是图像基于诊断和治疗的基础。目前大多数深度学习基于匹配方法使用的 regularizer 都是global spatial smoothness，例如Diffusion regularizer。然而，这些 regularizer 并不是数据适应的，可能不能准确反映复杂的下面变形。相反，物理启发的 regularizer 推动物理可能的变形。例如，线性弹性 regularizer 模拟了弹性物质的变形。这些 regularizer 是通过参数定义物质物理性能的。对生物组织来说，文献中存在很多估计这些参数的值，但是还是一个开放的挑战来确定合适的参数值以实现成功的匹配。为了解决这个问题并将物理性能引入学习基于的匹配方法中，我们提议使用 hypernetwork 来学习物理参数对空间变形场的影响。特别是，我们采用了 HyperMorph 框架来学习两个弹性参数对线性弹性 regularizer 的影响。我们的方法可以在测试时高效地发现适合数据的物理参数。

Eval-GCSC: A New Metric for Evaluating ChatGPT’s Performance in Chinese Spelling Correction

paper_url: http://arxiv.org/abs/2311.08219
repo_url: https://github.com/ktlktl/eval-gcsc
paper_authors: Kunting Li, Yong Hu, Shaolei Wang, Hanhan Ma, Liang He, Fandong Meng, Jie Zhou
for: 本文旨在提出一种新的评估度量，以评估生成模型在中文拼写正确任务中的表现。
methods: 本文使用了一种新的评估度量——Eval-GCSC，它通过结合单词水平和Semantic Similarity来评估生成模型的拼写正确性。
results: 实验结果表明，Eval-GCSC评估度量与人工评估得到了高度的一致性，而且生成模型的表现与传统的token-level分类模型（TCM）相当。

Abstract
ChatGPT has demonstrated impressive performance in various downstream tasks. However, in the Chinese Spelling Correction (CSC) task, we observe a discrepancy: while ChatGPT performs well under human evaluation, it scores poorly according to traditional metrics. We believe this inconsistency arises because the traditional metrics are not well-suited for evaluating generative models. Their overly strict length and phonics constraints may lead to underestimating ChatGPT's correction capabilities. To better evaluate generative models in the CSC task, this paper proposes a new evaluation metric: Eval-GCSC. By incorporating word-level and semantic similarity judgments, it relaxes the stringent length and phonics constraints. Experimental results show that Eval-GCSC closely aligns with human evaluations. Under this metric, ChatGPT's performance is comparable to traditional token-level classification models (TCM), demonstrating its potential as a CSC tool. The source code and scripts can be accessed at https://github.com/ktlKTL/Eval-GCSC.

摘要
chatGPT在多种下渠任务中表现出色，但在中文拼写正确（CSC）任务中，我们观察到一个不一致性：虽然chatGPT在人工评价中表现良好，但按照传统的指标来说，其分数不高。我们认为这种不一致性是因为传统的指标不适合评估生成模型。它们的过于严格的长度和音调约束可能会导致对chatGPT的 correction能力进行低估。为更好地评估生成模型在CSC任务中，本文提出了一个新的评价指标：Eval-GCSC。它通过 integrate word-level和Semantic Similarity的判断，它逐渐放弃了长度和音调的约束。实验结果表明，Eval-GCSC与人工评价高度相似。根据这个指标，chatGPT在CSC任务中的表现与传统的token-level分类模型（TCM）相当，这表明它在CSC中具有潜在的工具性。可以通过https://github.com/ktlKTL/Eval-GCSC访问源代码和脚本。

Human-Centric Autonomous Systems With LLMs for User Command Reasoning

paper_url: http://arxiv.org/abs/2311.08206
repo_url: https://github.com/kth-rpl/drivecmd_llm
paper_authors: Yi Yang, Qingwen Zhang, Ci Li, Daniel Simões Marta, Nazre Batool, John Folkesson
for: 本研究旨在将自动驾驶系统与人工智能语言模型（LLM）结合，以满足用户的需求。
methods: 本研究使用了不同的LLM模型和提示设计，通过进行了一系列的实验，以评估自动驾驶系统从自然语言文本指令中推导出的多重需求的精度。
results: 研究发现，LLM模型可以理解和处理提示，但是其效iveness受到LLM模型的质量和提示设计的限制。

Abstract
The evolution of autonomous driving has made remarkable advancements in recent years, evolving into a tangible reality. However, a human-centric large-scale adoption hinges on meeting a variety of multifaceted requirements. To ensure that the autonomous system meets the user's intent, it is essential to accurately discern and interpret user commands, especially in complex or emergency situations. To this end, we propose to leverage the reasoning capabilities of Large Language Models (LLMs) to infer system requirements from in-cabin users' commands. Through a series of experiments that include different LLM models and prompt designs, we explore the few-shot multivariate binary classification accuracy of system requirements from natural language textual commands. We confirm the general ability of LLMs to understand and reason about prompts but underline that their effectiveness is conditioned on the quality of both the LLM model and the design of appropriate sequential prompts. Code and models are public with the link \url{https://github.com/KTH-RPL/DriveCmd_LLM}.

摘要
自带驾驶技术的发展很快，已经成为现实。然而，大规模采用需要满足多种多样的需求。以确保自动驾驶系统满足用户的意图，需要准确地理解和解释用户命令，特别是在复杂或紧急情况下。为此，我们提议利用大型自然语言模型（LLM）的理解能力来推导系统需求从驾驶舱用户的命令。通过不同的 LLM 模型和提示设计，我们探索了从自然语言文本命令中多个变量binary分类精度。我们证明了 LLM 的普遍能力理解和解释提示，但是其效iveness 受到 LLM 模型和提示设计质量的限制。代码和模型在 GitHub 上公开，链接为 \url{https://github.com/KTH-RPL/DriveCmd_LLM}.

Automated Fact-Checking in Dialogue: Are Specialized Models Needed?

paper_url: http://arxiv.org/abs/2311.08195
repo_url: None
paper_authors: Eric Chamoun, Marzieh Saeidi, Andreas Vlachos
for: 提高对话中的真假查找效果
methods: 使用对话数据标注的精度进行练习改进、对话输入转换以适应模型的预测
results: 使用同一模型进行对话和常见真假查找，并维持对常见真假查找的精度。

Abstract
Prior research has shown that typical fact-checking models for stand-alone claims struggle with claims made in dialogues. As a solution, fine-tuning these models on labelled dialogue data has been proposed. However, creating separate models for each use case is impractical, and we show that fine-tuning models for dialogue results in poor performance on typical fact-checking. To overcome this challenge, we present techniques that allow us to use the same models for both dialogue and typical fact-checking. These mainly focus on retrieval adaptation and transforming conversational inputs so that they can be accurately predicted by models trained on stand-alone claims. We demonstrate that a typical fact-checking model incorporating these techniques is competitive with state-of-the-art models fine-tuned for dialogue, while maintaining its accuracy on stand-alone claims.

摘要
Here's the Simplified Chinese translation:先前的研究表明， traditional的决定性检查模型对于单独的声明很难进行检查。为解决这个问题，人们提议使用对话数据进行精度调整。然而，创建每个用例的分开模型是不实际的，我们显示了对话模型的精度调整会导致典型的检查性能下降。为了解决这个挑战，我们提出了一些技术，使得我们可以使用同一个模型来进行对话和典型的检查。这些技术主要是适应检索和将对话输入转换成可以由基于单独声明的模型准确预测的形式。我们示出了一个典型的检查模型，包含这些技术，与状态最佳的对话模型竞争，而且保持了对单独声明的精度。

Semi-Supervised Learning via Swapped Prediction for Communication Signal Recognition

paper_url: http://arxiv.org/abs/2311.08179
repo_url: None
paper_authors: Weidong Wang, Hongshu Liao, Lu Gan
for: 提高通信信号识别器的性能，使其能够在小数据量和少量标签下进行训练，而不会过拟合。
methods: 基于强 Datenaugmentation 和自适应常量 regularization 的 Semi-supervised learning 方法，使用大量可得到的无标签信号数据来提高模型的泛化能力。
results: 实验表明，提出的方法可以在深度 SSL 中提高通信信号识别器的性能，并且在小数据量和少量标签下进行训练时，可以避免过拟合。

Abstract
Deep neural networks have been widely used in communication signal recognition and achieved remarkable performance, but this superiority typically depends on using massive examples for supervised learning, whereas training a deep neural network on small datasets with few labels generally falls into overfitting, resulting in degenerated performance. To this end, we develop a semi-supervised learning (SSL) method that effectively utilizes a large collection of more readily available unlabeled signal data to improve generalization. The proposed method relies largely on a novel implementation of consistency-based regularization, termed Swapped Prediction, which leverages strong data augmentation to perturb an unlabeled sample and then encourage its corresponding model prediction to be close to its original, optimized with a scaled cross-entropy loss with swapped symmetry. Extensive experiments indicate that our proposed method can achieve a promising result for deep SSL of communication signal recognition.

摘要
深度神经网络在通信信号识别领域得到了广泛应用，但这种优势通常取决于使用庞大的示例进行监督学习，而使用小数据集和少量标签时通常会陷入过拟合，导致性能下降。为解决这个问题，我们开发了一种半监督学习（SSL）方法，能够有效地利用大量更 readily available的无标示例数据来提高泛化。我们的提议方法基于一种新的归一化正则化技术，称为Swapped Prediction，它利用强大的数据归一化来perturb一个无标示例，然后鼓励其相应的模型预测与其原始值保持相似，并且通过涨平的交叉熵损失来优化。我们的实验表明，我们的提议方法可以取得深度SSL通信信号识别领域的可提升的结果。

Neural Lattice Reduction: A Self-Supervised Geometric Deep Learning Approach

paper_url: http://arxiv.org/abs/2311.08170
repo_url: None
paper_authors: Giovanni Luca Marchetti, Gabriele Cesa, Kumar Pratik, Arash Behboodi
for: solving lattice reduction problems using deep learning methods
methods: using a deep neural model outputting factorized unimodular matrices, trained in a self-supervised manner with penalization for non-orthogonal lattice bases, and incorporating symmetries of lattice reduction through invariance and equivariance with respect to appropriate continuous and discrete groups
results: a deep learning method for lattice reduction that incorporates symmetries and achieves good performance

Abstract
Lattice reduction is a combinatorial optimization problem aimed at finding the most orthogonal basis in a given lattice. In this work, we address lattice reduction via deep learning methods. We design a deep neural model outputting factorized unimodular matrices and train it in a self-supervised manner by penalizing non-orthogonal lattice bases. We incorporate the symmetries of lattice reduction into the model by making it invariant and equivariant with respect to appropriate continuous and discrete groups.

摘要
“底色减少”是一个 combinatorial 优化问题，旨在找到给定的底色中最正交的基底。在这项工作中，我们通过深度学习方法来解决底色减少问题。我们设计了一个深度神经网络，输出 факторизов了单模卷积矩阵，并在无监督的方式下训练它，对非正交底色基底进行惩罚。我们在模型中包含底色减少的 symmetries，使其对恰当的连续和离散群进行不变和对称性。

MechAgents: Large language model multi-agent collaborations can solve mechanics problems, generate new data, and integrate knowledge

paper_url: http://arxiv.org/abs/2311.08166
repo_url: None
paper_authors: Bo Ni, Markus J. Buehler
for:The paper is written for solving mechanics problems using numerical methods, specifically using large language models (LLMs) to develop a new class of physics-inspired generative machine learning platform called MechAgents.methods:The paper uses autonomous collaborations of multiple LLMs to solve elasticity problems, including applying finite element methods with different boundary conditions, domain geometries, meshes, and constitutive laws. The agents mutually correct each other to improve team-work performance in understanding, formulating, and validating the solution.results:The paper demonstrates the effectiveness of the MechAgents framework in solving classical elasticity problems, and shows the potential of synergizing the intelligence of language models, the reliability of physics-based modeling, and the dynamic collaborations among diverse agents to automate the solution of engineering problems.

Abstract
Solving mechanics problems using numerical methods requires comprehensive intelligent capability of retrieving relevant knowledge and theory, constructing and executing codes, analyzing the results, a task that has thus far mainly been reserved for humans. While emerging AI methods can provide effective approaches to solve end-to-end problems, for instance via the use of deep surrogate models or various data analytics strategies, they often lack physical intuition since knowledge is baked into the parametric complement through training, offering less flexibility when it comes to incorporating mathematical or physical insights. By leveraging diverse capabilities of multiple dynamically interacting large language models (LLMs), we can overcome the limitations of conventional approaches and develop a new class of physics-inspired generative machine learning platform, here referred to as MechAgents. A set of AI agents can solve mechanics tasks, here demonstrated for elasticity problems, via autonomous collaborations. A two-agent team can effectively write, execute and self-correct code, in order to apply finite element methods to solve classical elasticity problems in various flavors (different boundary conditions, domain geometries, meshes, small/finite deformation and linear/hyper-elastic constitutive laws, and others). For more complex tasks, we construct a larger group of agents with enhanced division of labor among planning, formulating, coding, executing and criticizing the process and results. The agents mutually correct each other to improve the overall team-work performance in understanding, formulating and validating the solution. Our framework shows the potential of synergizing the intelligence of language models, the reliability of physics-based modeling, and the dynamic collaborations among diverse agents, opening novel avenues for automation of solving engineering problems.

摘要
解决机械问题使用数值方法需要全面的智能能力，包括检索相关知识和理论，构建和执行代码，分析结果，这项工作曾经主要由人类完成。落地出现的人工智能方法可以提供有效的解决端到端问题的方法，例如通过使用深度替身模型或多种数据分析策略，但它们经常缺乏物理直觉，因为知识被嵌入参数补充中进行训练，无法适应包含数学或物理意见的情况。我们可以通过多种语言模型之间的互动，开发一种新的物理启发的机器学习平台，称为MechAgents。这些AI代理可以解决机械任务，例如弹性问题，通过自主协作。一个两代理团队可以自动撰写、执行和自我修正代码，以应用finite element方法解决不同的粘连条件、域几何、网格、小几何弹性和线性/超几何材料学定律等等。对于更复杂的任务，我们可以建立一个更大的代理团队，在规划、形态、代码、执行和评价过程中进行分工，以提高总体团队的合作性和性能。我们的框架展示了将语言模型的智能、物理模型的可靠性和多种代理之间的协作融合起来，开启了解决工程问题的自动化新 Avenues。

Ask One More Time: Self-Agreement Improves Reasoning of Language Models in (Almost) All Scenarios

paper_url: http://arxiv.org/abs/2311.08154
repo_url: None
paper_authors: Lei Lin, Jiayi Fu, Pengli Liu, Junchen Wan, Fuzheng Zhang, Zhongyuan Wang, Di Zhang, Kun Gai
for: 这篇论文旨在提高链式思维（CoT）提示 комбиined with语言模型的表现，并解决过去的缺点，例如重复性和本地最佳化。methods: 这篇论文提出了一种称为“自我一致”的统一ensemble-optimization方法，可以在大多数情况下适用，包括不知道输入问题的类型或回答路径的类型。results: 这篇论文的实验结果显示，自我一致方法可以在六个公开的推理实验上显示出优异的表现，同时也具有优秀的扩展能力。

Abstract
Although chain-of-thought (CoT) prompting combined with language models has achieved encouraging results on complex reasoning tasks, the naive greedy decoding used in CoT prompting usually causes the repetitiveness and local optimality. To address this shortcoming, ensemble-optimization tries to obtain multiple reasoning paths to get the final answer assembly. However, current ensemble-optimization methods either simply employ rule-based post-processing such as \textit{self-consistency}, or train an additional model based on several task-related human annotations to select the best one among multiple reasoning paths, yet fail to generalize to realistic settings where the type of input questions is unknown or the answer format of reasoning paths is unknown. To avoid their limitations, we propose \textbf{self-agreement}, a generalizable ensemble-optimization method applying in almost all scenarios where the type of input questions and the answer format of reasoning paths may be known or unknown. Self-agreement firstly samples from language model's decoder to generate a \textit{diverse} set of reasoning paths, and subsequently prompts the language model \textit{one more time} to determine the optimal answer by selecting the most \textit{agreed} answer among the sampled reasoning paths. Self-agreement simultaneously achieves remarkable performance on six public reasoning benchmarks and superior generalization capabilities.

摘要
尽管链式思维（CoT）提示与语言模型结合已经实现了复杂逻辑任务的吸引人result，但通常使用的Naive greedy decoding在CoT提示中会导致重复性和局部优化。为了解决这些缺点，集成优化尝试获取多个逻辑路径来获得最终答案组装。然而，现有的集成优化方法可能会使用规则基于的后处理such as自 consistency，或者训练一个基于多个任务相关的人工签名来选择最佳的一个多个逻辑路径，但它们无法泛化到真实的设置中， где输入问题的类型和逻辑路径的答案格式都是未知的。为了避免这些限制，我们提出了自 consistency，一种通用的集成优化方法，可以在大多数情况下应用，包括输入问题的类型和逻辑路径的答案格式可以是知道的或未知的。自 consistency首先从语言模型的解码器中采样出一个多样化的逻辑路径集，然后再一次提示语言模型，通过选择多样化逻辑路径中最为一致的答案来确定最佳答案。自 consistency同时实现了六个公共逻辑 benchmarck上的吸引人result和superior泛化能力。

When Mining Electric Locomotives Meet Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.08153
repo_url: None
paper_authors: Ying Li, Zhencai Zhu, Xiaoqiang Li, Chunyu Yang, Hao Lu
For: This paper aims to present a reinforcement learning (RL) method for the autonomous control of mining electric locomotives in complex and uncertain coal mine environments.* Methods: The proposed method uses RL to learn the optimal control policy for the locomotives, and an improved epsilon-greedy algorithm is proposed to balance exploration and exploitation. The co-simulation platform is built to verify the effectiveness of the method.* Results: The simulation results show that the proposed method ensures the locomotives follow the front vehicle safely and respond promptly to sudden obstacles in the event of complex and uncertain coal mine environments.

Abstract
As the most important auxiliary transportation equipment in coal mines, mining electric locomotives are mostly operated manually at present. However, due to the complex and ever-changing coal mine environment, electric locomotive safety accidents occur frequently these years. A mining electric locomotive control method that can adapt to different complex mining environments is needed. Reinforcement Learning (RL) is concerned with how artificial agents ought to take actions in an environment so as to maximize reward, which can help achieve automatic control of mining electric locomotive. In this paper, we present how to apply RL to the autonomous control of mining electric locomotives. To achieve more precise control, we further propose an improved epsilon-greedy (IEG) algorithm which can better balance the exploration and exploitation. To verify the effectiveness of this method, a co-simulation platform for autonomous control of mining electric locomotives is built which can complete closed-loop simulation of the vehicles. The simulation results show that this method ensures the locomotives following the front vehicle safely and responding promptly in the event of sudden obstacles on the road when the vehicle in complex and uncertain coal mine environments.

摘要
现在大多数煤矿电力机车都是人工操作的。然而，由于煤矿环境复杂且不断变化，电力机车安全事故频繁发生。为了应对不同的煤矿环境，我们需要一种可以适应不同环境的煤矿电力机车控制方法。在这篇论文中，我们介绍了如何通过强化学习（RL）来实现自动控制煤矿电力机车。为了更加精准地控制，我们还提出了一种改进的ε-软弱算法（IEG），可以更好地平衡探索和利用。为了证明这种方法的有效性，我们建立了一个自动控制煤矿电力机车的协同 simulate平台，可以完成煤矿电力机车的关闭循环 simulate。实验结果表明，这种方法可以使煤矿电力机车在复杂和不确定的煤矿环境中安全地跟随前车，并快速应对突然出现的道路障碍物。

The Hyperdimensional Transform for Distributional Modelling, Regression and Classification

paper_url: http://arxiv.org/abs/2311.08150
repo_url: https://github.com/padwulf/chap6_transform_applications
paper_authors: Pieter Dewulf, Bernard De Baets, Michiel Stock
for: 本研究目的是为了介绍 hyperdimensional computing（HDC）的概念和应用，尤其是在机器学习和数据科学领域。
methods: 本研究使用的方法包括 hyperdimensional transform，该变换可以用于表示函数和分布 como high-dimensional holographic vectors。
results: 研究表明，使用 hyperdimensional transform 可以导致一种新的、有良好基础的工具箱，可以用于修改现有的机器学习算法并解决各种统计模型问题，如回归和分类任务，以及 represntation、学习、分布拟合、采样、 bayesian inference 和 uncertainty estimation。

Abstract
Hyperdimensional computing (HDC) is an increasingly popular computing paradigm with immense potential for future intelligent applications. Although the main ideas already took form in the 1990s, HDC recently gained significant attention, especially in the field of machine learning and data science. Next to efficiency, interoperability and explainability, HDC offers attractive properties for generalization as it can be seen as an attempt to combine connectionist ideas from neural networks with symbolic aspects. In recent work, we introduced the hyperdimensional transform, revealing deep theoretical foundations for representing functions and distributions as high-dimensional holographic vectors. Here, we present the power of the hyperdimensional transform to a broad data science audience. We use the hyperdimensional transform as a theoretical basis and provide insight into state-of-the-art HDC approaches for machine learning. We show how existing algorithms can be modified and how this transform can lead to a novel, well-founded toolbox. Next to the standard regression and classification tasks of machine learning, our discussion includes various aspects of statistical modelling, such as representation, learning and deconvolving distributions, sampling, Bayesian inference, and uncertainty estimation.

摘要
高维计算（HDC）是一种日益受欢迎的计算模式，具有未来智能应用的巨大潜力。尽管主要想法已经在1990年代形成，但HDC在机器学习和数据科学领域最近才受到了广泛关注。除了效率、互操作性和解释性外，HDC提供了泛化的有利属性，可以看作是将连接主义思想与符号学概念结合在一起的尝试。在最近的工作中，我们介绍了干扰变换，揭示了深刻的理论基础，用于表示函数和分布的高维干扰向量。在这里，我们将hyperdimensional transform的力量介绍给广泛的数据科学群体。我们使用干扰变换作为理论基础，并提供了现有算法的修改和新工具箱的可能性。除了传统的回归和分类任务之外，我们的讨论还包括统计模型的多个方面，例如表示、学习、分解分布、采样、 bayesian推理和不确定性估计。

Cattle Identification Using Muzzle Images and Deep Learning Techniques

paper_url: http://arxiv.org/abs/2311.08148
repo_url: https://github.com/peter716/animal_biometrics_system
paper_authors: G. N. Kimani, P. Oluwadara, P. Fashingabo, M. Busogi, E. Luhanga, K. Sowon, L. Chacha
for: 这项研究旨在开发一种基于皮肤特征的牛畜识别方法，以提高现有方法的精度和可扩展性。
methods: 本研究使用了深度学习模型，包括宽ResNet50和VGG16_BN，以及图像压缩技术来实现牛畜识别。
results: 实验结果显示，使用宽ResNet50模型和图像压缩技术可以达到最大准确率为99.5%，而且可以在AfricanContext中适用。

Abstract
Traditional animal identification methods such as ear-tagging, ear notching, and branding have been effective but pose risks to the animal and have scalability issues. Electrical methods offer better tracking and monitoring but require specialized equipment and are susceptible to attacks. Biometric identification using time-immutable dermatoglyphic features such as muzzle prints and iris patterns is a promising solution. This project explores cattle identification using 4923 muzzle images collected from 268 beef cattle. Two deep learning classification models are implemented - wide ResNet50 and VGG16\_BN and image compression is done to lower the image quality and adapt the models to work for the African context. From the experiments run, a maximum accuracy of 99.5\% is achieved while using the wide ResNet50 model with a compression retaining 25\% of the original image. From the study, it is noted that the time required by the models to train and converge as well as recognition time are dependent on the machine used to run the model.

摘要
传统的动物识别方法，如耳标、耳割和烙印，有效但存在风险和扩展性问题。电子方法提供更好的跟踪和监测，但需要专业设备并可能受到攻击。生物特征识别使用不可逆的皮肤特征，如脸部印痕和眼球图像，是一个有前途的解决方案。本项目探索了使用4923个牛脸图像，从268头牛中收集，并实现了两个深度学习分类模型：宽频率ResNet50和VGG16\_BN。图像压缩是为了降低图像质量和适应非洲上的环境。经过实验，最高的准确率达到99.5%，使用宽频率ResNet50模型，保留原始图像的25%。研究发现，模型训练和平衡时间以及识别时间均取决于运行模型的机器。

RECALL: A Benchmark for LLMs Robustness against External Counterfactual Knowledge

paper_url: http://arxiv.org/abs/2311.08147
repo_url: None
paper_authors: Yi Liu, Lianzhe Huang, Shicheng Li, Sishuo Chen, Hao Zhou, Fandong Meng, Jie Zhou, Xu Sun
for: 本研究旨在评估现有语言模型对外部知识的可靠性能力，以帮助提高模型的问答能力和文本生成能力。
methods: 本研究使用了两个任务：问答和文本生成，并在每个任务中提供了含有相对信息的上下文。
results: 研究发现现有的语言模型受到不可靠的外部知识的干扰，而简单的 intervención方法帮助有限度地解决这个问题。

Abstract
LLMs and AI chatbots have improved people's efficiency in various fields. However, the necessary knowledge for answering the question may be beyond the models' knowledge boundaries. To mitigate this issue, many researchers try to introduce external knowledge, such as knowledge graphs and Internet contents, into LLMs for up-to-date information. However, the external information from the Internet may include counterfactual information that will confuse the model and lead to an incorrect response. Thus there is a pressing need for LLMs to possess the ability to distinguish reliable information from external knowledge. Therefore, to evaluate the ability of LLMs to discern the reliability of external knowledge, we create a benchmark from existing knowledge bases. Our benchmark consists of two tasks, Question Answering and Text Generation, and for each task, we provide models with a context containing counterfactual information. Evaluation results show that existing LLMs are susceptible to interference from unreliable external knowledge with counterfactual information, and simple intervention methods make limited contributions to the alleviation of this issue.

摘要

Caring Trouble and Musical AI: Considerations towards a Feminist Musical AI

paper_url: http://arxiv.org/abs/2311.08120
repo_url: None
paper_authors: Kelsey Cotton, Kıvanç Tatar
for: This paper examines the ethical implications of using AI in musical and artistic practice, specifically in the context of Holly+, a deep neural network that generates raw audio.
methods: The paper uses a critical feminist examination and speculative feminism to trouble the structures, frameworks, and assumptions within and around Holly+.
results: The paper contributes considerations and future directions for integrating speculative feminism and care into musical-AI agent and system design.

Abstract
The ethics of AI as both material and medium for interaction remains in murky waters within the context of musical and artistic practice. The interdisciplinarity of the field is revealing matters of concern and care, which necessitate interdisciplinary methodologies for evaluation to trouble and critique the inheritance of "residue-laden" AI-tools in musical applications. Seeking to unsettle these murky waters, this paper critically examines the example of Holly+, a deep neural network that generates raw audio in the likeness of its creator Holly Herndon. Drawing from theoretical concerns and considerations from speculative feminism and care ethics, we care-fully trouble the structures, frameworks and assumptions that oscillate within and around Holly+. We contribute with several considerations and contemplate future directions for integrating speculative feminism and care into musical-AI agent and system design, derived from our critical feminist examination.

摘要
艺术和音乐领域中AI的伦理问题尚未得到清晰的解释，这个领域的多方交叉性也暴露了一些关注和照顾的问题，需要多方方法来评估和批判AI工具在音乐应用中的继承。为了摧极这些混沌的情况，本文 kritisch examines Holly+，一个深度神经网络，可以生成类似于其创造者Holly Herndon的原始音频。从 spéculative feminism和care ethics的理论和考虑中，我们仔细关注了Holly+的结构、框架和假设，并提出了一些考虑和思考将speculative feminism和care integrate into musical-AI agent和系统设计的可能性。

Evaluating Neighbor Explainability for Graph Neural Networks

paper_url: http://arxiv.org/abs/2311.08118
repo_url: https://github.com/ericssonresearch/gnn-neighbors-xai
paper_authors: Oscar Llorente, Péter Vaderna, Sándor Laki, Roland Kotroczó, Rita Csoma, János Márk Szalai-Gindl
for: 本研究旨在解释Graph Neural Networks (GNNs)中每个邻居对于节点分类的重要性，以及如何度量这个特定任务的性能。
methods: 本研究使用了多种已知的解释方法，以及四种新的度量方法，以确定每个邻居对于GNN的重要性。
results: 研究发现，在GNN领域中，大多数解释方法无法 correctly identify important neighbors，而且gradient-based技术的解释几乎没有差异。

Abstract
Explainability in Graph Neural Networks (GNNs) is a new field growing in the last few years. In this publication we address the problem of determining how important is each neighbor for the GNN when classifying a node and how to measure the performance for this specific task. To do this, various known explainability methods are reformulated to get the neighbor importance and four new metrics are presented. Our results show that there is almost no difference between the explanations provided by gradient-based techniques in the GNN domain. In addition, many explainability techniques failed to identify important neighbors when GNNs without self-loops are used.

摘要
<> translate "Explainability in Graph Neural Networks (GNNs) is a new field growing in the last few years. In this publication we address the problem of determining how important is each neighbor for the GNN when classifying a node and how to measure the performance for this specific task. To do this, various known explainability methods are reformulated to get the neighbor importance and four new metrics are presented. Our results show that there is almost no difference between the explanations provided by gradient-based techniques in the GNN domain. In addition, many explainability techniques failed to identify important neighbors when GNNs without self-loops are used." into Simplified Chinese.>Here's the translation:新兴的图神经网络（GNNs）可解释性领域在最近几年内迅速发展。在这篇论文中，我们解决了GNN分类节点时每个邻居的重要性问题，以及如何衡量这种特定任务的性能。为此，我们将已知的解释方法重新表述以获取邻居重要性，并提出了四个新的指标。我们的结果表明，GNN域中使用梯度基本技术的解释几乎没有区别。此外， без自环GNN时，许多解释技术无法识别重要的邻居。

Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion

paper_url: http://arxiv.org/abs/2311.08104
repo_url: None
paper_authors: Anders R. Bargum, Stefania Serafin, Cumhur Erkut
for: 这篇论文主要针对 Deep Learning 技术在语音转换（VC）中的应用，具体来说是在语音识别和生成方面。methods: 该论文使用了文献层次检索的方法，检索了2017-2023年间发表的621篇论文，并对最终选择的123篇论文进行了深入审查。results: 根据文献审查，该论文总结了 Deep Learning 技术在语音转换中最常用的方法，并指出了这些方法中的一些常见坑缺。最后，文章提出了未来研究方向的建议。

Abstract
Research on deep learning-powered voice conversion (VC) in speech-to-speech scenarios is getting increasingly popular. Although many of the works in the field of voice conversion share a common global pipeline, there is a considerable diversity in the underlying structures, methods, and neural sub-blocks used across research efforts. Thus, obtaining a comprehensive understanding of the reasons behind the choice of the different methods in the voice conversion pipeline can be challenging, and the actual hurdles in the proposed solutions are often unclear. To shed light on these aspects, this paper presents a scoping review that explores the use of deep learning in speech analysis, synthesis, and disentangled speech representation learning within modern voice conversion systems. We screened 621 publications from more than 38 different venues between the years 2017 and 2023, followed by an in-depth review of a final database consisting of 123 eligible studies. Based on the review, we summarise the most frequently used approaches to voice conversion based on deep learning and highlight common pitfalls within the community. Lastly, we condense the knowledge gathered, identify main challenges and provide recommendations for future research directions.

摘要
研究在深度学习支持的语音转换（VC）场景下是越来越受欢迎。虽然许多voice转换研究的基本管道相似，但在不同研究尝试中使用的结构、方法和神经元块之间存在很大的多样性。因此，了解不同方法的选择理由以及现有解决方案中的困难可能很困难。为了突出这些方面，本文通过 scoping review 来探讨现代语音转换系统中的深度学习在语音分析、生成和独立语音表示学习方面的应用。我们从2017年至2023年间的38个不同场合中检索了621篇论文，并对最终的123篇可靠的研究进行了深入审查。根据审查，我们总结了使用深度学习进行语音转换的最常用方法，并 highlighted 在社区中的共同困难。最后，我们总结了所获知识，标识了主要挑战，并提供了未来研究方向的建议。

Exploring Semi-supervised Hierarchical Stacked Encoder for Legal Judgement Prediction

paper_url: http://arxiv.org/abs/2311.08103
repo_url: https://github.com/nishchalprasad/semi-supervised-stacked-encoder
paper_authors: Nishchal Prasad, Mohand Boughanem, Taoufiq Dkaki
for: 这篇论文的目的是预测法律案件的判决结果，并且使用了不同的方法来提高预测的准确性。
methods: 这篇论文使用了域pecific预训BERT来提取长文档中的信息，并使用了变换器Encoder层进行进一步处理。此外，它还使用了不supervised clustering来提取隐藏的标签，以更好地预测法律案件的判决结果。
results: 该论文的实验结果表明，使用这种两级分类机制可以比前方法在ILDC数据集上得到更高的性能提升。此外，实验还表明了域pecific预训Transformer Encoder在法律信息处理中的重要性。

Abstract
Predicting the judgment of a legal case from its unannotated case facts is a challenging task. The lengthy and non-uniform document structure poses an even greater challenge in extracting information for decision prediction. In this work, we explore and propose a two-level classification mechanism; both supervised and unsupervised; by using domain-specific pre-trained BERT to extract information from long documents in terms of sentence embeddings further processing with transformer encoder layer and use unsupervised clustering to extract hidden labels from these embeddings to better predict a judgment of a legal case. We conduct several experiments with this mechanism and see higher performance gains than the previously proposed methods on the ILDC dataset. Our experimental results also show the importance of domain-specific pre-training of Transformer Encoders in legal information processing.

摘要
预测法律案件判决结果从不注释案件事实是一项复杂的任务。非统一的文档结构和长文档更加增加了提取信息的挑战。在这项工作中，我们探索并提议了一种两级分类机制：一种是supervised，另一种是无监督的。我们使用域务特定的预训练BERT来提取长文档中的句子嵌入，然后使用变换器Encoder层进一步处理，并使用无监督聚类来从这些嵌入中提取隐藏的标签，以更好地预测法律案件的判决结果。我们在ILDC数据集上进行了多个实验，并观察到了较高的性能提升，比之前的方法更高。我们的实验结果还表明了域务特定的Transformer Encoder在法律信息处理中的重要性。

Empowering Multi-step Reasoning across Languages via Tree-of-Thoughts

paper_url: http://arxiv.org/abs/2311.08097
repo_url: None
paper_authors: Leonardo Ranaldi, Fabio Massimo Zanzotto
for: 提高大型自然语言模型（LLM）的逻辑能力，使其能够解决复杂的逻辑任务 step-by-step。
methods: 提出了一种跨语言多步逻辑approach，通过自适应跨语言提问机制，使得不同语言的逻辑过程协调一致。
results: 对 existed prompting方法进行比较，our方法能够显著提高LLM的性能，降低交互次数，达到领先水平。

Abstract
Chain-of-Thought (CoT) prompting empowers the reasoning abilities of Large Language Models (LLMs), eliciting them to solve complex reasoning tasks step-by-step. However, with the success of CoT methods, the ability to deliver multi-step reasoning remains limited to English due to the imbalance in the distribution of the pre-training data, making the other languages a barrier. In this work, we propose a Cross-lingual multi-step reasoning approach, aiming to align reasoning processes across different languages. In particular, our method, through a Self-consistent Cross-lingual prompting mechanism inspired by the Tree-of-Thoughts approach, delivers multi-step reasoning paths in different languages that, during the steps, lead to the final solution. Our experimental evaluations show that our method significantly outperforms existing prompting methods, reducing the number of interactions and achieving state-of-the-art performance.

摘要
大脑思维链接（CoT）提示技术可以增强大语言模型（LLM）的推理能力，使其解决复杂的推理任务一步一步。然而，CoT方法的成功却受到语言障碍，因为预训练数据的分布不均衡，只有英语可以进行多步推理。在这种情况下，我们提出了跨语言多步推理方法，旨在将推理过程 across different languages 融合。我们的方法通过一种自适应跨语言提示机制，在不同语言中提供多步推理路径，其中每步都会导致最终的解决方案。我们的实验证明，我们的方法可以明显超越现有的提示方法，降低交互次数并达到领先性表现。

Act-VIT: A Representationally Robust Attention Architecture for Skeleton Based Action Recognition Using Vision Transformer

paper_url: http://arxiv.org/abs/2311.08094
repo_url: None
paper_authors: Ozge Oztimur Karadag
for: 本研究旨在检验视觉变换器在skeleton基于动作识别中的效果，以及其在pseudo-image表示方案上的稳定性。
methods: 本研究提出了一种三级架构Act-VIT，通过在不同层次上形成pseudo图像，并将其传递给视觉变换器和卷积神经网络进行处理。
results: 实验结果表明，视觉变换器比卷积神经网络更具抗性能对pseudo图像的初始化表示方式，但是通过多个分类器的协同来提高识别性能。

Abstract
Skeleton-based action recognition receives the attention of many researchers as it is robust to viewpoint and illumination changes, and its processing is much more efficient than video frames. With the emergence of deep learning models, it has become very popular to represent the skeleton data in pseudo-image form and apply Convolutional Neural Networks for action recognition. Thereafter, studies concentrated on finding effective methods for forming pseudo-images. Recently, attention networks, more specifically transformers have provided promising results in various vision problems. In this study, the effectiveness of vision transformers for skeleton-based action recognition is examined and its robustness on the pseudo-image representation scheme is investigated. To this end, a three-level architecture, Act-VIT is proposed, which forms a set of pseudo images apply a classifier on each of the representation and combine their results to find the final action class. The classifiers of Act-VIT are first realized by CNNs and then by VITs and their performances are compared. Experimental studies reveal that the vision transformer is less sensitive to the initial pseudo-image representation compared to CNN. Nevertheless, even with the vision transformer, the recognition performance can be further improved by consensus of classifiers.

摘要
skeleton-based action recognition receives 多个研究人员的关注，因为它能够抗衡视点和照明变化，并且处理效率高于视频帧。随着深度学习模型的出现，人们开始将skeleton数据表示为pseudo-image形式，并应用Convolutional Neural Networks（CNN） для行动识别。然后，研究者集中了关注如何形成pseudo-image。最近，关注网络，具体来说是transformer，在视觉问题中提供了有前途的结果。本研究检验了skeleton-based action recognition中vision transformer的效果，并investigate其在pseudo-image表示方案中的Robustness。为此，我们提出了一个三级架构Act-VIT，该架构包括 forming pseudo images，并在每个表示中应用一个分类器。每个分类器都是使用CNN或VIT来实现，并对其性能进行比较。实验研究表明，vision transformer对于初始pseudo-image表示的敏感度较低，但是，即使使用vision transformer，还可以通过协调分类器来提高识别性能。

Spot: A Natural Language Interface for Geospatial Searches in OSM

paper_url: http://arxiv.org/abs/2311.08093
repo_url: None
paper_authors: Lynn Khellaf, Ipek Baris Schlicht, Julia Bayer, Ruben Bouwmeester, Tilman Miraß, Tilman Wagner
for: 这篇论文是为了提供一个用户友好的自然语言 интерфей斯来查询OpenStreetMap（OSM）数据而写的。
methods: 该论文使用了一种基于自然语言的semantic mapping，通过人工生成的句子查询和T5trasformer来实现从用户输入句子中提取有关信息并在地图上显示匹配的候选位置。
results: 该论文通过Spot这个用户友好的自然语言 интерфей스，可以帮助无技术背景的人查询OSM数据，提高了访问和使用OSM的可用性和用户体验。

Abstract
Investigative journalists and fact-checkers have found OpenStreetMap (OSM) to be an invaluable resource for their work due to its extensive coverage and intricate details of various locations, which play a crucial role in investigating news scenes. Despite its value, OSM's complexity presents considerable accessibility and usability challenges, especially for those without a technical background. To address this, we introduce 'Spot', a user-friendly natural language interface for querying OSM data. Spot utilizes a semantic mapping from natural language to OSM tags, leveraging artificially generated sentence queries and a T5 transformer. This approach enables Spot to extract relevant information from user-input sentences and display candidate locations matching the descriptions on a map. To foster collaboration and future advancement, all code and generated data is available as an open-source repository.

摘要

CPSOR-GCN: A Vehicle Trajectory Prediction Method Powered by Emotion and Cognitive Theory

paper_url: http://arxiv.org/abs/2311.08086
repo_url: None
paper_authors: L. Tang, Y. Li, J. Yuan, A. Fu, J. Sun
for: 这篇论文旨在提出一个新的车辆预测路径模型，以便在车辆驾驶者当中存在不正常情绪时提高预测的准确性。
methods: 这篇论文使用了一个新的预测路径模型，即CPSOR-GCN，它利用了物理GCN模块和认知GCN模块来预测车辆驾驶者的情绪对驾驶行为的影响。
results: 实验结果显示，相比于仅考虑物理动向特征，CPSOR-GCN模型的预测精度提高了68.70%。此外，利用SOR认知理论建构DBN结构，可以更好地捕捉驾驶者情绪对驾驶行为的影响，从而降低预测误差。CPSOR-GCN模型比其他进阶预测模型更低的错误值，这些结果显示CPSOR-GCN模型可以更好地适应驾驶者情绪，从而实现更高的预测精度。

Abstract
Active safety systems on vehicles often face problems with false alarms. Most active safety systems predict the driver's trajectory with the assumption that the driver is always in a normal emotion, and then infer risks. However, the driver's trajectory uncertainty increases under abnormal emotions. This paper proposes a new trajectory prediction model: CPSOR-GCN, which predicts vehicle trajectories under abnormal emotions. At the physical level, the interaction features between vehicles are extracted by the physical GCN module. At the cognitive level, SOR cognitive theory is used as prior knowledge to build a Dynamic Bayesian Network (DBN) structure. The conditional probability and state transition probability of nodes from the calibrated SOR-DBN quantify the causal relationship between cognitive factors, which is embedded into the cognitive GCN module to extract the characteristics of the influence mechanism of emotions on driving behavior. The CARLA-SUMO joint driving simulation platform was built to develop dangerous pre-crash scenarios. Methods of recreating traffic scenes were used to naturally induce abnormal emotions. The experiment collected data from 26 participants to verify the proposed model. Compared with the model that only considers physical motion features, the prediction accuracy of the proposed model is increased by 68.70%. Furthermore,considering the SOR-DBN reduces the prediction error of the trajectory by 15.93%. Compared with other advanced trajectory prediction models, the results of CPSOR-GCN also have lower errors. This model can be integrated into active safety systems to better adapt to the driver's emotions, which could effectively reduce false alarms.

摘要
现有的活动安全系统 often 面临问题，即 false alarms。大多数活动安全系统预测 drivers 的轨迹，假设 drivers 总是在正常情绪下行驶，然后推断风险。然而， drivers 的轨迹不确定性会增加在不正常情绪下。这篇论文提出了一种新的轨迹预测模型：CPSOR-GCN，可以预测在不正常情绪下的轨迹。物理层面上，物理GCN模块提取了车辆之间的互动特征。认知层面上，SOR认知理论被用作先验知识，建立了一个动态概率网络（DBN）结构。DBN 结构中的 Conditional probability 和状态转移概率量化了认知因素的 causal 关系，并将其嵌入认知GCN模块中，以提取驾驶行为中情绪的影响机制的特征。在 CARLA-SUMO JOINT 驾驶模拟平台上，建立了危险前碰撞场景。使用方法 recreating 交通场景，以自然地引起不正常情绪。实验收集了26名参与者的数据，以验证提议模型。与只考虑物理动作特征的模型相比，提议模型的预测精度提高了68.70%。同时，使用 SOR-DBN 减少预测误差的方法，预测误差减少了15.93%。与其他高级轨迹预测模型相比，结果也有较低的误差。这种模型可以结合活动安全系统，更好地适应驾驶员的情绪， thereby reducing false alarms.

Solving ARC visual analogies with neural embeddings and vector arithmetic: A generalized method

paper_url: http://arxiv.org/abs/2311.08083
repo_url: https://github.com/foger3/arc_deeplearning
paper_authors: Luca H. Thoms, Karel A. Veldkamp, Hannes Rosenbusch, Claire E. Stevenson
for: This paper focuses on visual analogical reasoning and applies the initial generalized mechanism used to solve verbal analogies to the visual realm.methods: The approach uses a variational autoencoder (VAE) to transform Abstraction and Reasoning Corpus (ARC) items into low-dimensional latent vectors, and then uses simple vector arithmetic to discover the underlying rules of ARC items and solve them.results: The approach works well on simple items with fewer dimensions, similar input-to-output examples, and high reconstruction accuracy on the VAE. However, predictions on more complex items showed stronger deviations from expected outputs, although they still often approximated parts of the item’s rule set. The model achieved a score of 2% on the official ARC paradigm and 8.8% on ConceptARC.

Abstract
Analogical reasoning derives information from known relations and generalizes this information to similar yet unfamiliar situations. One of the first generalized ways in which deep learning models were able to solve verbal analogies was through vector arithmetic of word embeddings, essentially relating words that were mapped to a vector space (e.g., king - man + woman = __?). In comparison, most attempts to solve visual analogies are still predominantly task-specific and less generalizable. This project focuses on visual analogical reasoning and applies the initial generalized mechanism used to solve verbal analogies to the visual realm. Taking the Abstraction and Reasoning Corpus (ARC) as an example to investigate visual analogy solving, we use a variational autoencoder (VAE) to transform ARC items into low-dimensional latent vectors, analogous to the word embeddings used in the verbal approaches. Through simple vector arithmetic, underlying rules of ARC items are discovered and used to solve them. Results indicate that the approach works well on simple items with fewer dimensions (i.e., few colors used, uniform shapes), similar input-to-output examples, and high reconstruction accuracy on the VAE. Predictions on more complex items showed stronger deviations from expected outputs, although, predictions still often approximated parts of the item's rule set. Error patterns indicated that the model works as intended. On the official ARC paradigm, the model achieved a score of 2% (cf. current world record is 21%) and on ConceptARC it scored 8.8%. Although the methodology proposed involves basic dimensionality reduction techniques and standard vector arithmetic, this approach demonstrates promising outcomes on ARC and can easily be generalized to other abstract visual reasoning tasks.

摘要
通过对已知关系的推理，深度学习模型可以从知道的关系中提取信息，并将其推广到类似 yet 不熟悉的情况。在字符串类型的逻辑推理方面，深度学习模型通过Word embedding vector arithmetic来解决逻辑推理问题，例如king - man + woman = ？。相比之下，对于视觉类型的逻辑推理，大多数尝试都是任务特定的，更难推广。这个项目推广视觉逻辑推理，并将word embedding vector arithmetic的初始化机制应用到视觉领域。使用Abstraction and Reasoning Corpus（ARC）作为研究例子，我们使用变量自适应网络（VAE）将ARC项目转换成低维ensional的latent vector，类似于word embedding。通过简单的 vector 算术，我们发现了ARC项目的下面规则，并使用这些规则来解决它们。结果表明，该方法在简单的项目（即具有 fewer colors 和固定形状）上工作良好，并且在VAE中实现了高重建率。对于更复杂的项目，predictions 表现 stronger 的偏差，although predictions 仍然可以 aproximate 部分项目规则集。错误模式表明方法正常工作。在官方ARC paradigm上，模型 achieved 2% 得分（与当前世界纪录21%相比），并在ConceptARC上得分 8.8%。虽然方法包括基本维度减少技术和标准 vector 算术，但该方法在ARC上表现出了扎实的结果，并可以轻松扩展到其他抽象视觉逻辑任务。

Zero-Shot Segmentation of Eye Features Using the Segment Anything Model (SAM)

paper_url: http://arxiv.org/abs/2311.08077
repo_url: None
paper_authors: Virmarie Maquiling, Sean Anthony Byrne, Diederick C. Niehorster, Marcus Nyström, Enkelejda Kasneci
for: 这个论文主要是为了检测眼影像中的特征，以及使用基础模型来进行图像分割。
methods: 本研究使用了基础模型Segment Anything Model（SAM），并利用了零批学习和绑定框或点击提示来提高模型的性能。
results: 研究发现，使用SAM可以在眼影像分割中达到与专门模型相当的性能，并且使用提示可以提高模型的性能，例如在一个数据集中，使用绑定框的提示后，SAM的 IoU 达到 93.34%。

Abstract
The advent of foundation models signals a new era in artificial intelligence. The Segment Anything Model (SAM) is the first foundation model for image segmentation. In this study, we evaluate SAM's ability to segment features from eye images recorded in virtual reality setups. The increasing requirement for annotated eye-image datasets presents a significant opportunity for SAM to redefine the landscape of data annotation in gaze estimation. Our investigation centers on SAM's zero-shot learning abilities and the effectiveness of prompts like bounding boxes or point clicks. Our results are consistent with studies in other domains, demonstrating that SAM's segmentation effectiveness can be on-par with specialized models depending on the feature, with prompts improving its performance, evidenced by an IoU of 93.34% for pupil segmentation in one dataset. Foundation models like SAM could revolutionize gaze estimation by enabling quick and easy image segmentation, reducing reliance on specialized models and extensive manual annotation.

摘要
新的基础模型signal出一新的人工智能时代。Segment Anything Model（SAM）是首个用于图像分割的基础模型。在这项研究中，我们评估SAM在虚拟现实环境下记录的眼图像中分割特征的能力。由于需要更多的注解眼图像集合，这种需求对SAM来说是一个重要的机会，可以重新定义眼视 estimation数据注解的场景。我们的研究集中在SAM的零基础学习能力和点击矩形或点击提示的效果。我们的结果与其他领域的研究一致，表明SAM的分割效果可以与专门的模型相比，并且提示可以提高其性能，例如在一个数据集中，SAM的眼白分割精度达93.34%。基础模型如SAM可能会革命化眼视 estimation，因为它可以快速和容易地分割图像，减少专门的模型和手动注解的依赖。

Adversarial Preference Optimization

paper_url: http://arxiv.org/abs/2311.08045
repo_url: None
paper_authors: Pengyu Cheng, Yifan Yang, Jian Li, Yong Dai, Nan Du
for: 大型自然语言模型（LLM）的互动质量提升需要人类偏好Alignment。
methods: 我们提出了一个对抗偏好优化（APO）框架， LLMAgent 和偏好模型在一个 min-max 游戏中轮流更新。
results: 在实验中，我们证明了 APO 能够提高 LLM 的帮助性和无害性，比基eline拒绝抽样法更好。I hope this helps! Let me know if you have any other questions.

Abstract
Human preference alignment is a crucial training step to improve the interaction quality of large language models (LLMs). Existing aligning methods depend on manually annotated preference data to guide the LLM optimization directions. However, in practice, continuously updating LLMs raises a distribution gap between model-generated samples and human-preferred responses, which hinders model fine-tuning efficiency. To mitigate this issue, previous methods require additional preference annotation on generated samples to adapt the shifted distribution, which consumes a large amount of annotation resources. Targeting more efficient human preference optimization, we propose an adversarial preference optimization (APO) framework, where the LLM agent and the preference model update alternatively via a min-max game. Without additional annotation, our APO method can make a self-adaption to the generation distribution gap through the adversarial learning process. In experiments, we empirically verify the effectiveness of APO in improving LLM's helpfulness and harmlessness compared with rejection sampling baselines.

摘要
人类偏好对alignment是大型语言模型（LLM）训练的重要步骤。现有的对齐方法依赖于手动标注的偏好数据来引导LLM优化方向。然而，在实践中，不断更新LLM会导致模型生成的样本和人类首选回答之间的分布差异增大，这会降低模型细化效率。为解决这个问题，先前的方法需要进行额外的偏好标注，以适应模型生成的分布变化，这需要大量的标注资源。targeting更高效的人类偏好优化，我们提议一种对抗偏好优化（APO）框架，在这个框架中，LLM代理和偏好模型在一个min-max游戏中相互更新。无需额外的标注，我们的APO方法可以通过对抗学习过程来自适化生成分布差异。在实验中，我们证明了APO的效果比基准抽样方法更高。

Data-driven building energy efficiency prediction based on envelope heat losses using physics-informed neural networks

paper_url: http://arxiv.org/abs/2311.08035
repo_url: None
paper_authors: Vasilis Michalakopoulos, Sotiris Pelekis, Giorgos Kormpakis, Vagelis Karakolis, Spiros Mouzakitis, Dimitris Askounis
for: 这个研究旨在提供一个基于建筑物封顶元件热损失的能源性能预测模型，以便自动化和基于建筑物基本特征的能源效率性能预测。
methods: 本研究使用一种新的物理学 informative 神经网络模型，通过使用广泛的建筑物资料，包括一般建筑物资讯、测量特征和热能消耗，将神经网络模型训练，并基于物理方程式计算建筑物的能源消耗。
results: 本研究在实际应用中获得了良好的预测精度，这显示了这种基于建筑物封顶元件热损失的能源性能预测模型具有可靠性和可行性。

Abstract
The analytical prediction of building energy performance in residential buildings based on the heat losses of its individual envelope components is a challenging task. It is worth noting that this field is still in its infancy, with relatively limited research conducted in this specific area to date, especially when it comes for data-driven approaches. In this paper we introduce a novel physics-informed neural network model for addressing this problem. Through the employment of unexposed datasets that encompass general building information, audited characteristics, and heating energy consumption, we feed the deep learning model with general building information, while the model's output consists of the structural components and several thermal properties that are in fact the basic elements of an energy performance certificate (EPC). On top of this neural network, a function, based on physics equations, calculates the energy consumption of the building based on heat losses and enhances the loss function of the deep learning model. This methodology is tested on a real case study for 256 buildings located in Riga, Latvia. Our investigation comes up with promising results in terms of prediction accuracy, paving the way for automated, and data-driven energy efficiency performance prediction based on basic properties of the building, contrary to exhaustive energy efficiency audits led by humans, which are the current status quo.

摘要
building的能效性预测是一个复杂的任务，特别是在封闭系统元件的热损失方面。在这篇论文中，我们提出了一种新的物理学 Informed neural network 模型，用于解决这个问题。我们利用了一些涉及普通建筑信息的未公开数据集，包括建筑的总体特征、核心特性和加热能耗。我们将这些信息作为深度学习模型的输入，模型的输出包括建筑的结构组件和一些热性能性量，这些量是实际的能效性证书（EPC）的基本元素。在这个模型之上，基于物理方程的函数计算了建筑的能 consumption，从而提高了深度学习模型的损失函数。我们在利加市（Riga，Latvia）的256座建筑的实际案例中测试了这种方法，结果表明这种方法在预测精度方面具有扎实的表现，这开 up a new way for automatic, data-driven energy efficiency performance prediction based on basic properties of the building, rather than relying on human-led exhaustive energy efficiency audits, which are the current status quo.

Two-Stage Predict+Optimize for Mixed Integer Linear Programs with Unknown Parameters in Constraints

paper_url: http://arxiv.org/abs/2311.08022
repo_url: https://github.com/elizabethxyhu/neurips_two_stage_predict-optimize
paper_authors: Xinyi Hu, Jasper C. H. Lee, Jimmy H. M. Lee
For: 这个论文是关于受限优化问题的端到端训练超参数学习模型的框架，其中一些参数在解决时未知。* Methods: 论文提出了一种新的两Stage Predict+Optimize框架，该框架可以更好地考虑到优化问题中未知参数的影响，并且可以用于所有杂合integer线性程序。* Results: 实验结果表明，该论文提出的训练框架可以比古典和当前最佳方法提供更好的预测性能。

Abstract
Consider the setting of constrained optimization, with some parameters unknown at solving time and requiring prediction from relevant features. Predict+Optimize is a recent framework for end-to-end training supervised learning models for such predictions, incorporating information about the optimization problem in the training process in order to yield better predictions in terms of the quality of the predicted solution under the true parameters. Almost all prior works have focused on the special case where the unknowns appear only in the optimization objective and not the constraints. Hu et al.~proposed the first adaptation of Predict+Optimize to handle unknowns appearing in constraints, but the framework has somewhat ad-hoc elements, and they provided a training algorithm only for covering and packing linear programs. In this work, we give a new \emph{simpler} and \emph{more powerful} framework called \emph{Two-Stage Predict+Optimize}, which we believe should be the canonical framework for the Predict+Optimize setting. We also give a training algorithm usable for all mixed integer linear programs, vastly generalizing the applicability of the framework. Experimental results demonstrate the superior prediction performance of our training framework over all classical and state-of-the-art methods.

摘要
假设我们处在受限优化的设定下，其中一些参数在解决过程中未知。预测+优化是一种最近的框架，用于综合训练超出监督学习模型以便对这些预测做出更好的预测。大多数先前的工作都集中在特殊情况下，即未知参数只出现在优化目标中。胡等等提出了首次对 Predict+Optimize 框架进行适应，但该框架具有一些偶极元素，并且只提供了覆盖和压缩线性程序的训练算法。在这项工作中，我们提出了一种新的简单化和更强大的框架，称为两个阶段预测+优化框架。我们认为这应该是Predict+Optimize设定的准确框架。我们还提供了可以应用于所有杂合Integer线性程序的训练算法， thereby greatly expanding the applicability of the framework。实验结果表明我们的训练框架在所有古典和当前方法上具有更好的预测性能。

Distantly-Supervised Named Entity Recognition with Uncertainty-aware Teacher Learning and Student-student Collaborative Learning

paper_url: http://arxiv.org/abs/2311.08010
repo_url: None
paper_authors: Helan Hu, Shuzheng Si, Haozhe Zhao, Shuang Zeng, Kaikai An, Zefan Cai, Baobao Chang
for: 提高 Distantly-Supervised Named Entity Recognition (DS-NER) 的精度和稳定性，适用于减少标注噪音。
methods: 提出 Uncertainty-aware Teacher Learning 和 Student-student Collaborative Learning 两种方法，分别利用预测uncertainty和学生网络之间的合作来提高模型的精度和稳定性。
results: 在五个 DS-NER 数据集上进行了广泛的实验，并证明了我们的方法比现有的教师生Student方法更高效。

Abstract
Distantly-Supervised Named Entity Recognition (DS-NER) effectively alleviates the burden of annotation, but meanwhile suffers from the label noise. Recent works attempt to adopt the teacher-student framework to gradually refine the training labels and improve the overall robustness. However, we argue that these teacher-student methods achieve limited performance because poor network calibration produces incorrectly pseudo-labeled samples, leading to error propagation. Therefore, we attempt to mitigate this issue by proposing: (1) Uncertainty-aware Teacher Learning that leverages the prediction uncertainty to guide the selection of pseudo-labels, avoiding the number of incorrect pseudo-labels in the self-training stage. (2) Student-student Collaborative Learning that allows the transfer of reliable labels between two student networks instead of completely relying on all pseudo-labels from its teacher. Meanwhile, this approach allows a full exploration of mislabeled samples rather than simply filtering unreliable pseudo-labeled samples. Extensive experimental results on five DS-NER datasets demonstrate that our method is superior to state-of-the-art teacher-student methods.

摘要
难以监督的命名实体识别（DS-NER）有效减轻监督的负担，但同时受到标签噪声的影响。现有的作品尝试采用教师生成框架来慢慢精细地改善训练标签，以提高总体的稳定性。然而，我们认为这些教师生成方法的性能有限，因为差异网络准备产生错误的预测标签，导致错误嵌入的问题。因此，我们尝试解决这个问题，通过提议以下两点：1. uncertainty-aware teacher learning，通过预测不确定性来导引选择 pseudo-标签，以避免在自动训练阶段中出现的错误 pseudo-标签。2. student-student collaborative learning，允许两个学生网络之间进行可靠标签的传递，而不是完全依赖所有pseudo-标签来自其教师。同时，这种方法允许探索批量标签错误的整个样本，而不是仅仅过滤 pseudo-标签中的错误样本。我们在五个 DS-NER 数据集上进行了广泛的实验，并证明了我们的方法在教师生成方法中表现出优于状态之前。

Iterative missing value imputation based on feature importance

paper_url: http://arxiv.org/abs/2311.08005
repo_url: None
paper_authors: Cong Guo, Chun Liu, Wei Yang
for: addresses the problem of missing values in datasets, which can reduce the accuracy of classification tasks and increase processing difficulty.
methods: proposes an imputation method that considers feature importance, which iteratively performs matrix completion and feature importance learning.
results: consistently outperforms five existing imputation algorithms on synthetic and real-world datasets with different types of missing values.

Abstract
Many datasets suffer from missing values due to various reasons,which not only increases the processing difficulty of related tasks but also reduces the accuracy of classification. To address this problem, the mainstream approach is to use missing value imputation to complete the dataset. Existing imputation methods estimate the missing parts based on the observed values in the original feature space, and they treat all features as equally important during data completion, while in fact different features have different importance. Therefore, we have designed an imputation method that considers feature importance. This algorithm iteratively performs matrix completion and feature importance learning, and specifically, matrix completion is based on a filling loss that incorporates feature importance. Our experimental analysis involves three types of datasets: synthetic datasets with different noisy features and missing values, real-world datasets with artificially generated missing values, and real-world datasets originally containing missing values. The results on these datasets consistently show that the proposed method outperforms the existing five imputation algorithms.To the best of our knowledge, this is the first work that considers feature importance in the imputation model.

摘要
很多数据集受到缺失值的影响，这不仅增加相关任务的处理难度，还降低分类的准确率。为解决这个问题，主流方法是使用缺失值补充来完善数据集。现有的补充方法都是基于原始特征空间中观察到的值来估计缺失部分，而它们往往视所有特征为等importance，而实际上不同的特征有不同的重要性。因此，我们设计了一种考虑特征重要性的补充方法。这个算法在迭代完成矩阵和特征重要性学习中，特别是基于填充损失函数来实现矩阵完成。我们对三种不同类型的数据集进行了实验分析：Synthetic数据集具有不同的噪声特征和缺失值、实际世界数据集通过人工生成缺失值、实际世界数据集原本包含缺失值。结果表明，我们提出的方法在这些数据集上一直表现出色，并且在已知的五种补充算法中具有突出的优势。到目前为止，这是首次考虑特征重要性的补充模型。

TempTabQA: Temporal Question Answering for Semi-Structured Tables

paper_url: http://arxiv.org/abs/2311.08002
repo_url: None
paper_authors: Vivek Gupta, Pranshu Kandoi, Mahek Bhavesh Vora, Shuo Zhang, Yujie He, Ridho Reinanda, Vivek Srikumar
for: 本研究旨在检验现有的自然语言处理（NLP）系统是否可以理解半结构化数据中的时间信息。
methods: 研究人员使用了一个新的任务——半结构化表中的时间问答（TempTabQA），并使用了11,454个问答对和1,208个WIkipedia Infobox表来评估几种当前的状态顶模型。
results: 研究人员发现，即使使用最高性能的语言模型（LLMs），其们在 TempTabQA 任务中的表现仍然落后人类性能的13.5个F1分点。这些结果表明， TempTabQA 数据集有potential来 serve as a challenging benchmark to improve the temporal reasoning capabilities of NLP models。

Abstract
Semi-structured data, such as Infobox tables, often include temporal information about entities, either implicitly or explicitly. Can current NLP systems reason about such information in semi-structured tables? To tackle this question, we introduce the task of temporal question answering on semi-structured tables. We present a dataset, TempTabQA, which comprises 11,454 question-answer pairs extracted from 1,208 Wikipedia Infobox tables spanning more than 90 distinct domains. Using this dataset, we evaluate several state-of-the-art models for temporal reasoning. We observe that even the top-performing LLMs lag behind human performance by more than 13.5 F1 points. Given these results, our dataset has the potential to serve as a challenging benchmark to improve the temporal reasoning capabilities of NLP models.

摘要
现有的自然语言处理（NLP）系统是否可以理解半结构化数据中的时间信息？为了回答这个问题，我们介绍了半结构化表格中的时间问答任务。我们提供了一个数据集，TempTabQA，其中包含11,454个问答对 extracted from 1,208个Wikipedia Infobox表格，涵盖了 более90个不同领域。使用这个数据集，我们评估了多种当前状态的模型对时间理解能力。我们发现，even the top-performing LLMs lag behind human performance by more than 13.5 F1 points。 Based on these results, our dataset has the potential to serve as a challenging benchmark to improve the temporal reasoning capabilities of NLP models.

LiPar: A Lightweight Parallel Learning Model for Practical In-Vehicle Network Intrusion Detection

paper_url: http://arxiv.org/abs/2311.08000
repo_url: https://github.com/wangkai-tech23/LiPar
paper_authors: Aiheng Zhang, Kai Wang, Bailing Wang, Yulei Wu
for: 这个研究旨在提高智能交通系统中车辆网络的安全性，尤其是对 Controller Area Network (CAN) 的攻击探测。
methods: 本研究提出了一个轻量级平行神经网络结构（LiPar），用于分配任务负载到多个电子控制器（ECU）上。LiPar 模型包括多维度分支卷积网络、空间和时间特征融合学习和资源适应算法。
results: 经过实验证明，LiPar 具有优秀的检测性能、运行效率和轻量级模型大小，可以实际适用于车辆网络环境中，并能有效地保护车辆网络的 CAN 标准插座安全。

Abstract
With the development of intelligent transportation systems, vehicles are exposed to a complex network environment. As the main network of in-vehicle networks, the controller area network (CAN) has many potential security hazards, resulting in higher requirements for intrusion detection systems to ensure safety. Among intrusion detection technologies, methods based on deep learning work best without prior expert knowledge. However, they all have a large model size and rely on cloud computing, and are therefore not suitable to be installed on the in-vehicle network. Therefore, we propose a lightweight parallel neural network structure, LiPar, to allocate task loads to multiple electronic control units (ECU). The LiPar model consists of multi-dimensional branch convolution networks, spatial and temporal feature fusion learning, and a resource adaptation algorithm. Through experiments, we prove that LiPar has great detection performance, running efficiency, and lightweight model size, which can be well adapted to the in-vehicle environment practically and protect the in-vehicle CAN bus security.

摘要
Translation in Simplified Chinese:随着智能交通系统的发展，车辆被暴露在复杂的网络环境中。CAN（控制器区域网络）作为车辆内网主要网络，具有许多安全隐患，因此需要更高的安全检测系统要求以确保安全。深度学习技术在检测技术中表现最佳，无需专业知识。然而，它们都具有大模型大小和依赖于云计算，因此不适合在车辆内网安装。因此，我们提出了一种轻量级并行神经网络结构，LiPar，以分配任务负担到多个电子控制单元（ECU）。LiPar模型包括多维分支卷积网络、空间和时间特征融合学习和资源适应算法。通过实验，我们证明了LiPar具有优秀的检测性能、运行效率和轻量级模型大小，可以实际上适应车辆环境，保护车辆CAN总线安全。

paper_url: http://arxiv.org/abs/2311.07992
repo_url: None
paper_authors: Jiaming Wang, Harold Soh
for: 提高自主机器人领域中对物体搜索任务的效率，特别是在未探索环境中。
methods: 使用可信度对象投影图创建可信度物体位置分布（POLo），并使用POLoNet neural network来近似计算复杂的POLo分布，从而帮助机器人做出数据驱动的决策。
results: 在OVMM 2023挑战的第一阶段中，一个装备有POLoNet的机器人significantly outperforms了许多基eline方法，包括末端奖励学习方法和传统的地图基本策略。

Abstract
To advance the field of autonomous robotics, particularly in object search tasks within unexplored environments, we introduce a novel framework centered around the Probable Object Location (POLo) score. Utilizing a 3D object probability map, the POLo score allows the agent to make data-driven decisions for efficient object search. We further enhance the framework's practicality by introducing POLoNet, a neural network trained to approximate the computationally intensive POLo score. Our approach addresses critical limitations of both end-to-end reinforcement learning methods, which suffer from memory decay over long-horizon tasks, and traditional map-based methods that neglect visibility constraints. Our experiments, involving the first phase of the OVMM 2023 challenge, demonstrate that an agent equipped with POLoNet significantly outperforms a range of baseline methods, including end-to-end RL techniques and prior map-based strategies. To provide a comprehensive evaluation, we introduce new performance metrics that offer insights into the efficiency and effectiveness of various agents in object goal navigation.

摘要
Our experiments, which involve the first phase of the OVMM 2023 challenge, show that an agent equipped with POLoNet significantly outperforms a range of baseline methods, including end-to-end reinforcement learning techniques and prior map-based strategies. To provide a comprehensive evaluation, we introduce new performance metrics that offer insights into the efficiency and effectiveness of various agents in object goal navigation. Our proposed framework has the potential to greatly improve the performance of autonomous robotics in object search tasks within unexplored environments.

A Survey on Language Models for Code

paper_url: http://arxiv.org/abs/2311.07989
repo_url: https://github.com/codefuse-ai/awesome-code-llm
paper_authors: Ziyin Zhang, Chaoyu Chen, Bingchang Liu, Cong Liao, Zi Gong, Hang Yu, Jianguo Li, Rui Wang
for: 本文系统性地综述了近期的代码处理方法，涵盖50+模型、30+评估任务以及500余相关工作。
methods: 本文将代码处理模型分为通用语言模型（如GPT家族）和专门预训练于代码的模型，讲述这些模型之间的关系和差异，并 highlights 代码模型的历史发展，从统计学模型和RNN逐渐转移到预训练的Transformers和LLMs，与NLP领域的发展轨迹一致。
results: 本文讨论了代码特有的特征，如AST、CFG和单元测试，以及它们在训练代码语言模型中的应用，并提出了代码处理领域的主要挑战和未来发展方向。

Abstract
In this work we systematically review the recent advancements in code processing with language models, covering 50+ models, 30+ evaluation tasks, and 500 related works. We break down code processing models into general language models represented by the GPT family and specialized models that are specifically pretrained on code, often with tailored objectives. We discuss the relations and differences between these models, and highlight the historical transition of code modeling from statistical models and RNNs to pretrained Transformers and LLMs, which is exactly the same course that had been taken by NLP. We also discuss code-specific features such as AST, CFG, and unit tests, along with their application in training code language models, and identify key challenges and potential future directions in this domain. We keep the survey open and updated on github repository at https://github.com/codefuse-ai/Awesome-Code-LLM.

摘要
在这项工作中，我们系统性地报告了最近的代码处理技术发展，涵盖50多种模型、30多种评估任务以及500多个相关作品。我们将代码处理模型分为普通的语言模型，如GPT家族，以及专门预训练于代码的模型，经常具有定制目标。我们讲述这些模型之间的关系和差异，并高亮代码模型从统计学模型和RNN逐渐发展到预训练的Transformers和LLMs的历史发展，与NLP领域的发展一样。我们还讲述代码特有的特征，如AST、CFG和单元测试，以及它们在训练代码语言模型中的应用，并标识代码处理领域的关键挑战和未来发展方向。我们将survey保持开放并更新在GitHub上的https://github.com/codefuse-ai/Awesome-Code-LLM文件夹中。

How good are Large Language Models on African Languages?

paper_url: http://arxiv.org/abs/2311.07978
repo_url: https://github.com/jettbrains/-L-
paper_authors: Jessica Ojo, Kelechi Ogueji, Pontus Stenetorp, David I. Adelani
for: This paper is written to analyze the performance of three popular large language models (mT0, LLaMa 2, and GPT-4) on 30 African languages across five tasks (news topic classification, sentiment classification, machine translation, question answering, and named entity recognition).
methods: The paper uses these three language models as-is or fine-tunes them on African languages to evaluate their performance on various tasks.
results: The results show that all three language models have below-par performance on African languages, with GPT-4 performing well on classification tasks but poorly on generative tasks like machine translation. mT0 has the best overall performance on cross-lingual question answering, outperforming fine-tuned mT5 and GPT-4 on African languages. LLaMa 2 has the worst performance due to its limited multilingual capabilities and English-centric pre-training corpus.Here are the results in Simplified Chinese text:
for: 这个研究是为了分析三种流行的大语言模型（mT0、LLaMa 2、GPT-4）在30种非洲语言上的五个任务（新闻类别分类、情感分类、机器翻译、问答和命名实体识别）的性能。
methods: 这些研究使用这三种语言模型的直接使用或对非洲语言进行了微调来评估它们的性能。
results: 结果显示所有三种语言模型在非洲语言上都表现不佳，GPT-4在分类任务上表现非常好，但在生成任务上表现非常差。mT0在跨语言问答任务上表现最好，超过了微调后的mT5和GPT-4在非洲语言上的性能。LLaMa 2的性能最差，因为它的多语言能力有限，预训练语料集也偏向英语。

Abstract
Recent advancements in natural language processing have led to the proliferation of large language models (LLMs). These models have been shown to yield good performance, using in-context learning, even on unseen tasks and languages. Additionally, they have been widely adopted as language-model-as-a-service commercial APIs like GPT-4 API. However, their performance on African languages is largely unknown. We present an analysis of three popular large language models (mT0, LLaMa 2, and GPT-4) on five tasks (news topic classification, sentiment classification, machine translation, question answering, and named entity recognition) across 30 African languages, spanning different language families and geographical regions. Our results suggest that all LLMs produce below-par performance on African languages, and there is a large gap in performance compared to high-resource languages like English most tasks. We find that GPT-4 has an average or impressive performance on classification tasks but very poor results on generative tasks like machine translation. Surprisingly, we find that mT0 had the best overall on cross-lingual QA, better than the state-of-the-art supervised model (i.e. fine-tuned mT5) and GPT-4 on African languages. Overall, LLaMa 2 records the worst performance due to its limited multilingual capabilities and English-centric pre-training corpus. In general, our findings present a call-to-action to ensure African languages are well represented in large language models, given their growing popularity.

摘要
近期的自然语言处理技术发展，使得大型语言模型（LLMs）在不同任务和语言上表现良好。这些模型已经广泛应用于语言模型作为服务的商业API，如GPT-4 API。然而，它们在非洲语言上的表现仍然不够了解。本文分析了三个流行的大型语言模型（mT0、LLaMa 2和GPT-4）在5个任务（新闻类别分类、情感分类、机器翻译、问答和命名实体识别） across 30种非洲语言，涵盖不同的语言家族和地理区域。我们的结果表明，所有的LLMs在非洲语言上表现较差，与高资源语言如英语的大多数任务相比，存在显著的性能差距。我们发现GPT-4在分类任务上表现很出色，但在生成任务如机器翻译中表现很差。另外，我们发现mT0在跨语言问答任务上表现最佳，高于现有的超级模型（即精心调教的mT5）和GPT-4。总的来说，LLaMa 2的表现最差，主要是因为它的多语言能力有限，以及英语中心的预训练集。在总的来说，我们的发现表现出了需要确保非洲语言在大型语言模型中得到良好表现的呼吁。

Uplift Modeling based on Graph Neural Network Combined with Causal Knowledge

paper_url: http://arxiv.org/abs/2311.08434
repo_url: https://github.com/xy2119/Causal_Knowledge_GNN
paper_authors: Haowen Wang, Xinyan Ye, Yangze Zhou, Zhiyi Zhang, Longhan Zhang, Jing Jiang
for: 这个论文的目的是提出一种基于图 neural network 的 uplift 模型，用于评估待遇的影响。
methods: 该论文使用了 causal 表示法和 adjacency matrix 结构学习，以及基于图 convolution network 的更可扩展的 uplift 模型。
results: 该论文的实验结果表明，该方法可以准确预测 uplift 值，并且在实际行业市场数据中得到了验证。

Abstract
Uplift modeling is a fundamental component of marketing effect modeling, which is commonly employed to evaluate the effects of treatments on outcomes. Through uplift modeling, we can identify the treatment with the greatest benefit. On the other side, we can identify clients who are likely to make favorable decisions in response to a certain treatment. In the past, uplift modeling approaches relied heavily on the difference-in-difference (DID) architecture, paired with a machine learning model as the estimation learner, while neglecting the link and confidential information between features. We proposed a framework based on graph neural networks that combine causal knowledge with an estimate of uplift value. Firstly, we presented a causal representation technique based on CATE (conditional average treatment effect) estimation and adjacency matrix structure learning. Secondly, we suggested a more scalable uplift modeling framework based on graph convolution networks for combining causal knowledge. Our findings demonstrate that this method works effectively for predicting uplift values, with small errors in typical simulated data, and its effectiveness has been verified in actual industry marketing data.

摘要
<>用简化中文表示：>市场效果模型的一个基本组成部分是升级模型，用于评估干预的影响。通过升级模型，我们可以确定具有最大 benefit 的干预。另一方面，我们可以确定响应特定干预的客户，并且可以预测这些客户会作出有利决策。在过去，升级模型方法都是基于差异分析（DID）架构，并将机器学习模型作为估计学习者，而忽略特征之间的关系。我们提出了基于图 neural network 的框架，将 causal 知识与升级值估计相结合。首先，我们提出了一种基于 CATE（条件减差异效应）估计的 causal 表示技术，并学习邻接矩阵结构。其次，我们建议一种更可扩展的升级模型框架，基于图 convolution network 组合 causal 知识。我们的发现表明，这种方法可以有效地预测升级值，并且在 Typical 的模拟数据中具有小的误差。此外，我们的实际业务数据也证明了其效果。

Deep Learning-Based Object Detection in Maritime Unmanned Aerial Vehicle Imagery: Review and Experimental Comparisons

paper_url: http://arxiv.org/abs/2311.07955
repo_url: None
paper_authors: Chenjie Zhao, Ryan Wen Liu, Jingxiang Qu, Ruobin Gao
for: 本研究旨在探讨水上无人机（UAV）上的对象检测问题，尤其是在marine industry和ocean engineering领域中。
methods: 本文首先简要概述了水上UAV对象检测中的四大挑战，包括对象特征多样性、设备限制、海洋环境变化和数据缺乏等问题。然后，我们主要介绍了 Computational Methods 来提高水上UAV对象检测性能，包括Scale-aware、小对象检测、视角意识、旋转对象检测、轻量级方法等。
results: 本文还进行了一系列实验，以评估和分析对象检测方法在海洋数据集上的性能和稳定性。最后，我们给出了未来水上UAV对象检测的讨论和展望。 MS2ship数据集可以在 \href{https://github.com/zcj234/MS2ship}{https://github.com/zcj234/MS2ship} 上下载。

Abstract
With the advancement of maritime unmanned aerial vehicles (UAVs) and deep learning technologies, the application of UAV-based object detection has become increasingly significant in the fields of maritime industry and ocean engineering. Endowed with intelligent sensing capabilities, the maritime UAVs enable effective and efficient maritime surveillance. To further promote the development of maritime UAV-based object detection, this paper provides a comprehensive review of challenges, relative methods, and UAV aerial datasets. Specifically, in this work, we first briefly summarize four challenges for object detection on maritime UAVs, i.e., object feature diversity, device limitation, maritime environment variability, and dataset scarcity. We then focus on computational methods to improve maritime UAV-based object detection performance in terms of scale-aware, small object detection, view-aware, rotated object detection, lightweight methods, and others. Next, we review the UAV aerial image/video datasets and propose a maritime UAV aerial dataset named MS2ship for ship detection. Furthermore, we conduct a series of experiments to present the performance evaluation and robustness analysis of object detection methods on maritime datasets. Eventually, we give the discussion and outlook on future works for maritime UAV-based object detection. The MS2ship dataset is available at \href{https://github.com/zcj234/MS2ship}{https://github.com/zcj234/MS2ship}.

摘要
随着水上无人飞行器（UAV）和深度学习技术的发展，水上UAV-基于物体探测的应用在marine industry和 ocean engineering领域变得越来越重要。搭载智能感知功能的水上UAV可以实现效果性的水上监测。为了进一步推动水上UAV-基于物体探测的发展，本文提供了全面的挑战、相关方法和UAV飞行图像/视频数据集的评审。具体来说，在本工作中，我们首先 briefly summarize four challenges for object detection on water-based UAVs, i.e., object feature diversity, device limitation, maritime environment variability, and dataset scarcity.然后，我们将关注计算方法，以提高水上UAV-基于物体探测性能，包括尺度感知、小物体探测、视角感知、旋转物体探测、轻量级方法等。接着，我们对UAV飞行图像/视频数据集进行了评审，并提出了一个名为MS2ship的水上UAV飞行数据集，用于船舶检测。然后，我们进行了一系列实验，以评估和分析物体探测方法在水上数据集上的性能和稳定性。最后，我们给出了未来水上UAV-基于物体探测的发展展望。MS2ship数据集可以在中下载。

A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning

paper_url: http://arxiv.org/abs/2311.07954
repo_url: None
paper_authors: Ruixin Hong, Hongming Zhang, Xinyu Pang, Dong Yu, Changshui Zhang
for: 这篇论文探讨了使用自我检查来提高人工智能（AI）逻辑推理能力的可能性，尤其是在逻辑推理问题上。
methods: 论文使用了许多自我验证方法，包括使用各种逻辑缺陷数据集，对模型进行评估，并分析模型的逻辑推理能力。
results: 研究发现现有的大型语言模型（LLM）在逻辑推理问题上可能会准确地识别逻辑缺陷，但是它们可能无法准确地检查自己的逻辑推理步骤。

Abstract
Logical reasoning has been an ongoing pursuit in the field of AI. Despite significant advancements made by large language models (LLMs), they still struggle with complex logical reasoning problems. To enhance reasoning performance, one promising direction is scalable oversight, which requires LLMs to identify their own errors and then improve by themselves. Various self-verification methods have been proposed in pursuit of this goal. Nevertheless, whether existing models understand their own errors well is still under investigation. In this paper, we take a closer look at the self-verification abilities of LLMs in the context of logical reasoning, focusing on their ability to identify logical fallacies accurately. We introduce a dataset, FALLACIES, containing 232 types of reasoning fallacies categorized in a hierarchical taxonomy. By conducting exhaustive experiments on FALLACIES, we obtain comprehensive and detailed analyses of a series of models on their verification abilities. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods. Drawing from these observations, we offer suggestions for future research and practical applications of self-verification methods.

摘要
<>转换文本为简化中文。<>人工智能领域内，逻辑推理一直是一项不断探索的领域。虽然大语言模型（LLMs）已经做出了 significativo进步，但它们仍然在复杂的逻辑推理问题上困难。以提高推理性能为目的，一个有前途的方向是可扩展的监督，需要 LLMs 能够自我检查并自我改进。各种自我验证方法已经被提出，但是现有模型是否能够准确地识别自己的错误仍然在调查中。在这篇论文中，我们坚持关注 LLMs 在逻辑推理上的自我验证能力，特别是它们能够准确地识别逻辑错误的能力。我们提出了一个名为 FALLACIES 的数据集，包含 232 种逻辑错误分类在层次分类中。通过对 FALLACIES 进行探索性实验，我们获得了详细和全面的模型评估结果。我们的主要发现表明现有的 LLMs 可能会在准确地识别逻辑错误步骤上遇到困难，并且可能无法保证自我验证方法的有效性。基于这些观察，我们提出了未来研究和实践自我验证方法的建议。

The Impact of Adversarial Node Placement in Decentralized Federated Learning Networks

paper_url: http://arxiv.org/abs/2311.07946
repo_url: https://github.com/adampi210/maxspanfl_atck_code_data
paper_authors: Adam Piaseczny, Eric Ruzomberka, Rohit Parasnis, Christopher G. Brinton
for: 本研究探讨了分布式学习（Federated Learning，FL）中对抗性节点布置的影响，并提出了一种新的攻击算法，以提高攻击效果。
methods: 本研究使用了分布式环境的优势，实现了快速和能量减少的设备之间通信。并对各种抗性节点布置策略进行了分析，包括随机布置和网络中心性基于的布置。
results: 研究发现，新的攻击算法可以大大提高攻击效果，比基eline框架提高9%-66.5%。这些发现提供了对分布式FL安全性的重要性的新的视角，并为未来关于开发更安全和可靠的分布式FL框架的研究提供了基础。

Abstract
As Federated Learning (FL) grows in popularity, new decentralized frameworks are becoming widespread. These frameworks leverage the benefits of decentralized environments to enable fast and energy-efficient inter-device communication. However, this growing popularity also intensifies the need for robust security measures. While existing research has explored various aspects of FL security, the role of adversarial node placement in decentralized networks remains largely unexplored. This paper addresses this gap by analyzing the performance of decentralized FL for various adversarial placement strategies when adversaries can jointly coordinate their placement within a network. We establish two baseline strategies for placing adversarial node: random placement and network centrality-based placement. Building on this foundation, we propose a novel attack algorithm that prioritizes adversarial spread over adversarial centrality by maximizing the average network distance between adversaries. We show that the new attack algorithm significantly impacts key performance metrics such as testing accuracy, outperforming the baseline frameworks by between 9% and 66.5% for the considered setups. Our findings provide valuable insights into the vulnerabilities of decentralized FL systems, setting the stage for future research aimed at developing more secure and robust decentralized FL frameworks.

摘要
随着联合学习（FL）的崛起，新的分布式框架在广泛应用。这些框架利用分布式环境中的优势来实现快速和能效的设备间通信。然而，这种崛起也使得FL安全性的需求更加紧迫。 existed research 已经探讨了FL安全性的多个方面，但尚未考虑分布式网络中反对者的协调 placement 的影响。这篇论文填补这个空白，分析了不同的反对者协调 placement 策略在分布式FL中的性能。我们提出了两种基eline策略：随机分布和网络中心性基eline策略。基于这个基础，我们提出了一种新的攻击算法，它强调对抗扩散而不是对抗中心性，通过最大化网络距离 между反对者来 maximize 平均网络距离。我们显示，新的攻击算法对测试精度和其他关键性能指标产生了深见影响，在考虑的设置中，相比基eline框架，新的攻击算法可以提高9%到66.5%的testing accuracy。我们的发现为分布式FL系统的安全性提供了重要的技术指导，为未来针对更安全和可靠的分布式FL框架的研究奠定了基础。

Non-autoregressive Machine Translation with Probabilistic Context-free Grammar

paper_url: http://arxiv.org/abs/2311.07941
repo_url: https://github.com/ictnlp/pcfg-nat
paper_authors: Shangtong Gui, Chenze Shao, Zhengrui Ma, Xishan Zhang, Yunji Chen, Yang Feng
for: 加速神经机器翻译的推理
methods: 使用特制的概率 контекст自由格式(PCFG) 进行优化
results: 提高神经机器翻译的表达能力和性能，并且可以更深入地理解生成的句子

Abstract
Non-autoregressive Transformer(NAT) significantly accelerates the inference of neural machine translation. However, conventional NAT models suffer from limited expression power and performance degradation compared to autoregressive (AT) models due to the assumption of conditional independence among target tokens. To address these limitations, we propose a novel approach called PCFG-NAT, which leverages a specially designed Probabilistic Context-Free Grammar (PCFG) to enhance the ability of NAT models to capture complex dependencies among output tokens. Experimental results on major machine translation benchmarks demonstrate that PCFG-NAT further narrows the gap in translation quality between NAT and AT models. Moreover, PCFG-NAT facilitates a deeper understanding of the generated sentences, addressing the lack of satisfactory explainability in neural machine translation.Code is publicly available at https://github.com/ictnlp/PCFG-NAT.

摘要
非autoregressive Transformer（NAT）对神经机器翻译的推理具有显著的加速效果。然而，传统的 NAT 模型由于假设目标符号之间的条件独立性而受到限制，导致表现力和精度相比 autoregressive（AT）模型受到限制。为了解决这些局限性，我们提出了一种新的方法called PCFG-NAT，该方法利用特殊设计的概率 context-free grammar（PCFG）来增强 NAT 模型对输出符号之间的复杂依赖关系的捕捉能力。实验结果表明，PCFG-NAT 可以进一步缩小 NAT 和 AT 模型之间的翻译质量差距。此外，PCFG-NAT 可以更好地解释生成的句子，解决神经机器翻译中缺乏满意的解释性的问题。代码可以在 GitHub 上获取：https://github.com/ictnlp/PCFG-NAT。

Towards Improving Robustness Against Common Corruptions in Object Detectors Using Adversarial Contrastive Learning

paper_url: http://arxiv.org/abs/2311.07928
repo_url: None
paper_authors: Shashank Kotyan, Danilo Vasconcellos Vargas
for: 强化神经网络的可靠性，尤其是在自动驾驶应用中，以满足实际场景中的可靠性要求。
methods: 提出了一种创新的对抗学习框架，通过生成对抗示例并优化对抗损失来增强神经网络的对抗性和常规损害性。
results: 通过实验表明，该方法可以同时提高神经网络对抗性和常规损害性，并在实际场景中保持高度可靠性。

Abstract
Neural networks have revolutionized various domains, exhibiting remarkable accuracy in tasks like natural language processing and computer vision. However, their vulnerability to slight alterations in input samples poses challenges, particularly in safety-critical applications like autonomous driving. Current approaches, such as introducing distortions during training, fall short in addressing unforeseen corruptions. This paper proposes an innovative adversarial contrastive learning framework to enhance neural network robustness simultaneously against adversarial attacks and common corruptions. By generating instance-wise adversarial examples and optimizing contrastive loss, our method fosters representations that resist adversarial perturbations and remain robust in real-world scenarios. Subsequent contrastive learning then strengthens the similarity between clean samples and their adversarial counterparts, fostering representations resistant to both adversarial attacks and common distortions. By focusing on improving performance under adversarial and real-world conditions, our approach aims to bolster the robustness of neural networks in safety-critical applications, such as autonomous vehicles navigating unpredictable weather conditions. We anticipate that this framework will contribute to advancing the reliability of neural networks in challenging environments, facilitating their widespread adoption in mission-critical scenarios.

摘要

Brain-Driven Representation Learning Based on Diffusion Model

paper_url: http://arxiv.org/abs/2311.07925
repo_url: None
paper_authors: Soowon Kim, Seo-Hyun Lee, Young-Eun Lee, Ji-Won Lee, Ji-Ha Park, Seong-Whan Lee
for: 用于分析语音相关的EEG信号
methods: 使用Diffusion Probabilistic Models（DDPMs）和条件autoencoder
results: 新方法的准确率明显高于传统机器学习算法和已有基eline模型

Abstract
Interpreting EEG signals linked to spoken language presents a complex challenge, given the data's intricate temporal and spatial attributes, as well as the various noise factors. Denoising diffusion probabilistic models (DDPMs), which have recently gained prominence in diverse areas for their capabilities in representation learning, are explored in our research as a means to address this issue. Using DDPMs in conjunction with a conditional autoencoder, our new approach considerably outperforms traditional machine learning algorithms and established baseline models in accuracy. Our results highlight the potential of DDPMs as a sophisticated computational method for the analysis of speech-related EEG signals. This could lead to significant advances in brain-computer interfaces tailored for spoken communication.

摘要
interpreting EEG signals linked to spoken language presented a complex challenge, given the data's intricate temporal and spatial attributes, as well as the various noise factors. denoising diffusion probabilistic models (DDPMs), which have recently gained prominence in diverse areas for their capabilities in representation learning, are explored in our research as a means to address this issue. using DDPMs in conjunction with a conditional autoencoder, our new approach considerably outperforms traditional machine learning algorithms and established baseline models in accuracy. our results highlight the potential of DDPMs as a sophisticated computational method for the analysis of speech-related EEG signals. this could lead to significant advances in brain-computer interfaces tailored for spoken communication.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

Smart Home Goal Feature Model – A guide to support Smart Homes for Ageing in Place

paper_url: http://arxiv.org/abs/2311.09248
repo_url: None
paper_authors: Irini Logothetis, Priya Rani, Shangeetha Sivasothy, Rajesh Vasa, Kon Mouzakis
for: This paper provides an overview of smart home technologies that support ageing in place, and offers a structured approach to design, develop, and deploy smart homes for the elderly based on their personalized needs.
methods: The paper synthesizes prior knowledge and creates a Smart Home Goal Feature Model (SHGFM) to resolve heuristic approaches used by Subject Matter Experts (SMEs) and healthcare researchers in adapting smart homes for the elderly.
results: The SHGFM provides SMEs with the ability to establish goals and identify features to set up strategies for designing, developing, and deploying smart homes that meet the needs of the elderly.

Abstract
Smart technologies are significant in supporting ageing in place for elderly. Leveraging Artificial Intelligence (AI) and Machine Learning (ML), it provides peace of mind, enabling the elderly to continue living independently. Elderly use smart technologies for entertainment and social interactions, this can be extended to provide safety and monitor health and environmental conditions, detect emergencies and notify informal and formal caregivers when care is needed. This paper provides an overview of the smart home technologies commercially available to support ageing in place, the advantages and challenges of smart home technologies, and their usability from elderlys perspective. Synthesizing prior knowledge, we created a structured Smart Home Goal Feature Model (SHGFM) to resolve heuristic approaches used by the Subject Matter Experts (SMEs) at aged care facilities and healthcare researchers in adapting smart homes. The SHGFM provides SMEs the ability to (i) establish goals and (ii) identify features to set up strategies to design, develop and deploy smart homes for the elderly based on personalised needs. Our model provides guidance to healthcare researchers and aged care industries to set up smart homes based on the needs of elderly, by defining a set of goals at different levels mapped to a different set of features.

摘要
智能技术支持老年人在家中生活，通过人工智能（AI）和机器学习（ML）提供了心理安全，让老年人可以独立地生活。老年人使用智能技术进行娱乐和社交交流，这可以扩展到提供安全和监测健康和环境条件，检测紧急情况并通知正式和非正式照顾者当照顾是需要的。本文提供了智能家居技术商业化应用的概述，以及智能家居技术的优势和挑战，以及老年人的用户视角。我们synthesize了先前的知识，创建了一个结构化的智能家居目标特征模型（SHGFM），以解决 aged care 设施和医疗研究人员使用智能家居时采用的规则性方法。SHGFM 让 SME 能够（i）确定目标和（ii）标识特征，以设立策略设计、开发和部署智能家居。我们的模型为医疗研究人员和 aged care 产业提供指导，以设置智能家居基于老年人需求的。我们定义了不同层次的目标，与不同的特征集成，以便在不同的情况下设置智能家居。

Instruction-Following Evaluation for Large Language Models

paper_url: http://arxiv.org/abs/2311.07911
repo_url: https://github.com/google-research/google-research
paper_authors: Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, Le Hou
for: This paper aims to evaluate the ability of large language models (LLMs) to follow natural language instructions in a standardized and objective manner.
methods: The paper introduces a new evaluation benchmark called Instruction-Following Eval (IFEval) that focuses on a set of verifiable instructions, such as writing in more than 400 words or mentioning the keyword of AI at least three times.
results: The authors evaluate two widely available LLMs on the market using the IFEval benchmark and show the results, which can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval.

Abstract
One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval

摘要
一种核心能力 OF Large Language Models (LLMs) 是遵循自然语言指令。然而，评估这种能力的标准化不存在：人工评估昂贵、慢速、不可重复，而 LLB-based auto-evaluation 可能受到评估LLM的能力的偏见或限制。为解决这些问题，我们介绍了 Instruction-Following Eval (IFEval)，一个简单易复制的评估标准。它关注一组 "可靠指令"，如 "写入 более 400 字" 和 "在至少三次提及 AI 关键词中提及 AI"。我们分类了 25 种这些可靠指令，并构建了约 500 个提示，每个提示包含一个或多个可靠指令。我们展示了两个市场上广泛使用的 LLB 的评估结果。我们的代码和数据可以在 https://github.com/google-research/google-research/tree/master/instruction_following_eval 找到。

Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks

paper_url: http://arxiv.org/abs/2311.09247
repo_url: None
paper_authors: Melanie Mitchell, Alessandro B. Palmarini, Arseny Moskvichev
for: 评估 GPT-4 和 GPT-4V 在概念理解和推理方面的能力，使用 ConceptARC benchmark。
methods: 使用 text 和 image 版本的 ConceptARC 任务，对 GPT-4 和 GPT-4V 进行评估。
results: GPT-4 和 GPT-4V 未能达到人类水平的抽象能力， neither version of GPT-4 has developed robust abstraction abilities at humanlike levels。

Abstract
We explore the abstract reasoning abilities of text-only and multimodal versions of GPT-4, using the ConceptARC benchmark [10], which is designed to evaluate robust understanding and reasoning with core-knowledge concepts. We extend the work of Moskvichev et al. [10] by evaluating GPT-4 on more detailed, one-shot prompting (rather than simple, zero-shot prompts) with text versions of ConceptARC tasks, and by evaluating GPT-4V, the multimodal version of GPT-4, on zero- and one-shot prompts using image versions of the simplest tasks. Our experimental results support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels.

摘要
我们研究了基于文本和多媒体版本的GPT-4的抽象理解能力，使用ConceptARCbenchmark [10]，这是一个用于评估基本知识概念的理解和逻辑能力的测试。我们在 Moskvichev et al. [10] 的工作中进一步扩展了GPT-4的测试，使用文本版本的ConceptARC任务，并对GPT-4V，多媒体版本的GPT-4，进行零shot和一shot提示测试。我们的实验结果表明， neither version of GPT-4 在人类水平的抽象能力达到了人类水平。

RoboSense At Edge: Detecting Slip, Crumple and Shape of the Object in Robotic Hand for Teleoprations

paper_url: http://arxiv.org/abs/2311.07888
repo_url: None
paper_authors: Sudev Kumar Padhi, Mohit Kumar, Debanka Giri, Subidh Ali
for: 这个论文是为了解决机器人手上的滑块和损坏问题，以便在远程手术等精准 manipulate 任务中实现稳定性。
methods: 该论文提出了基于机器学习技术的滑块、损坏和物体形状检测方法，通过测量机器人手上的力/扭矩和旋转角度来实现。
results: 该论文的实验结果表明，基于机器学习模型的滑块、损坏和物体形状检测方法可以减少机器人手上的延迟，提高远程手术等精准 manipulate 任务的稳定性。

Abstract
Slip and crumple detection is essential for performing robust manipulation tasks with a robotic hand (RH) like remote surgery. It has been one of the challenging problems in the robotics manipulation community. In this work, we propose a technique based on machine learning (ML) based techniques to detect the slip, and crumple as well as the shape of an object that is currently held in the robotic hand. We proposed ML model will detect the slip, crumple, and shape using the force/torque exerted and the angular positions of the actuators present in the RH. The proposed model would be integrated into the loop of a robotic hand(RH) and haptic glove(HG). This would help us to reduce the latency in case of teleoperation

摘要
摸擦和损坏检测是Robotic hand(RH)进行稳定的操作任务的重要前提。这是机器人控制领域的一个挑战。在这种工作中，我们提议一种基于机器学习(ML)技术的检测方法，可以检测RH中持有物体的摸擦、损坏以及形状。我们的ML模型将通过RH上的力/扭矩和活动器的角度位置来检测摸擦、损坏和形状。我们的模型将被 интеGRATED到RH和Haptic glove(HG)的控制循环中，以减少在远程操作中的延迟。

One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion

paper_url: http://arxiv.org/abs/2311.07885
repo_url: None
paper_authors: Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Jiayuan Gu, Hao Su
for: 这个论文旨在提供一种能够快速生成高质量3D对象的方法，以满足实际应用需求。
methods: 该方法首先在2D扩散模型中进行了finetuning，以实现多视图图像生成的一致性。然后，使用多视图conditioned 3D本地扩散模型来提升这些图像到3D。
results: 实验表明，该方法可以生成高质量、多样化的3D资产，与输入图像具有高度的相似性。Here’s the translation in English:
for: This paper aims to provide a method for rapidly generating high-quality 3D objects, to meet practical application requirements.
methods: The method first finetunes a 2D diffusion model for multi-view image generation consistency. Then, it elevates these images to 3D using multi-view conditioned 3D native diffusion models.
results: Experimental results show that the method can generate high-quality, diverse 3D assets that closely mirror the original input image.

Abstract
Recent advancements in open-world 3D object generation have been remarkable, with image-to-3D methods offering superior fine-grained control over their text-to-3D counterparts. However, most existing models fall short in simultaneously providing rapid generation speeds and high fidelity to input images - two features essential for practical applications. In this paper, we present One-2-3-45++, an innovative method that transforms a single image into a detailed 3D textured mesh in approximately one minute. Our approach aims to fully harness the extensive knowledge embedded in 2D diffusion models and priors from valuable yet limited 3D data. This is achieved by initially finetuning a 2D diffusion model for consistent multi-view image generation, followed by elevating these images to 3D with the aid of multi-view conditioned 3D native diffusion models. Extensive experimental evaluations demonstrate that our method can produce high-quality, diverse 3D assets that closely mirror the original input image. Our project webpage: https://sudo-ai-3d.github.io/One2345plus_page.

摘要
近期开放世界3D物体生成技术得到了非常显著的进步，图像到3D方法可以提供细化控制，而文本到3D方法则不足。然而，大多数现有模型都缺乏同时提供快速生成速度和高精度输入图像的能力，这两个特点是实际应用中必不可少的。在这篇论文中，我们介绍了One-2-3-45++方法，可以将单张图像转化为细化的3D纹理网格，在大约一分钟内完成。我们的方法利用了丰富的2D扩散模型和3D数据的价值，通过首先训练2D扩散模型，然后使用多视图conditioned 3D本地扩散模型来提升这些图像到3D。我们的实验证明，我们的方法可以生成高质量、多样的3D资产，与输入图像几乎一致。更多信息请访问我们的项目网页：

VegaEdge: Edge AI Confluence Anomaly Detection for Real-Time Highway IoT-Applications

paper_url: http://arxiv.org/abs/2311.07880
repo_url: None
paper_authors: Vinit Katariya, Fatema-E- Jannat, Armin Danesh Pazho, Ghazal Alinezhad Noghre, Hamed Tabkhi
for: 本研究旨在提出一种车辆异常检测方法，用于高速公路安全应用，如事故预防、快速应急救援、交通流优化和工zone安全。
methods: 本研究使用轨迹预测来实现车辆异常检测，并提出了一种基于AI的车辆异常检测方法，可以适应互联网时代的高速公路应用。
results: 实验结果表明，在多个平台和交通场景下，我们的异常检测方法具有高效和灵活性，并且在真实的高速公路环境中可以实时处理738个轨迹每秒。此外，我们还提供了一个新的高速公路异常数据集（CAD），以填补现有的异常数据集不足。

Abstract
Vehicle anomaly detection plays a vital role in highway safety applications such as accident prevention, rapid response, traffic flow optimization, and work zone safety. With the surge of the Internet of Things (IoT) in recent years, there has arisen a pressing demand for Artificial Intelligence (AI) based anomaly detection methods designed to meet the requirements of IoT devices. Catering to this futuristic vision, we introduce a lightweight approach to vehicle anomaly detection by utilizing the power of trajectory prediction. Our proposed design identifies vehicles deviating from expected paths, indicating highway risks from different camera-viewing angles from real-world highway datasets. On top of that, we present VegaEdge - a sophisticated AI confluence designed for real-time security and surveillance applications in modern highway settings through edge-centric IoT-embedded platforms equipped with our anomaly detection approach. Extensive testing across multiple platforms and traffic scenarios showcases the versatility and effectiveness of VegaEdge. This work also presents the Carolinas Anomaly Dataset (CAD), to bridge the existing gap in datasets tailored for highway anomalies. In real-world scenarios, our anomaly detection approach achieves an AUC-ROC of 0.94, and our proposed VegaEdge design, on an embedded IoT platform, processes 738 trajectories per second in a typical highway setting. The dataset is available at https://github.com/TeCSAR-UNCC/Carolinas_Dataset#chd-anomaly-test-set .

摘要
高速公路安全应用中的车辆异常检测扮演着重要的角色，包括事故预防、快速应急救援、交通流动优化和工区安全。随着互联网物联网（IoT）的普及，有一种强烈的需求，即基于人工智能（AI）的异常检测方法，以满足IoT设备的需求。为满足这一未来视野，我们提出了一种轻量级的车辆异常检测方法，利用轨迹预测的力量。我们的提议的设计可以在不同的摄像头视角上检测出车辆异常行为，从真实的高速公路数据集中提取出高速公路风险。此外，我们还提出了一种名为VegaEdge的智能抽象设计，针对现代高速公路环境中的实时安全监测应用。VegaEdge通过在边缘式IoT设备上实现，可以实现实时的安全监测和响应。我们对多个平台和交通enario进行了广泛的测试，显示了VegaEdge的多平台性和效果。此外，我们还提供了 Carolinas Anomaly Dataset（CAD），用于bridging现有的高速公路异常数据集之间的 gap。在实际应用中，我们的异常检测方法实现了AUC-ROC的0.94，而我们提议的VegaEdge设计在一个常见的高速公路环境中，在边缘式IoT平台上处理了738个轨迹每秒。数据集可以在https://github.com/TeCSAR-UNCC/Carolinas_Dataset#chd-anomaly-test-set 中获取。

Toxicity Detection is NOT all you Need: Measuring the Gaps to Supporting Volunteer Content Moderators

paper_url: http://arxiv.org/abs/2311.07879
repo_url: None
paper_authors: Yang Trista Cao, Lovely-Frances Domingo, Sarah Ann Gilbert, Michelle Mazurek, Katie Shilton, Hal Daumé III
for: 本研究旨在检验过去的自动化内容审核方法是否能满足志愿者审核员的需求。
methods: 本研究使用模型评估平台上的模型和现有的LLMs（GPT-4和Llama-2），评估这些模型在违反平台规则时的表现。
results: 研究发现，现有的模型和LLMs具有低回归率，表明存在较大的差距。

Abstract
Extensive efforts in automated approaches for content moderation have been focused on developing models to identify toxic, offensive, and hateful content -- with the aim of lightening the load for moderators. Yet, it remains uncertain whether improvements on those tasks truly address the needs that moderators have in accomplishing their work. In this paper, we surface the gaps between past research efforts that have aimed to provide automation for aspects of the content moderation task, and the needs of volunteer content moderators. To do so, we conduct a model review on Hugging Face to reveal the availability of models to cover various moderation rules and guidelines. We further put state-of-the-art LLMs to the test (GPT-4 and Llama-2), evaluating how well these models perform in flagging violations of platform rules. Overall, we observe a non-trivial gap, as missing developed models and LLMs exhibit low recall on a significant portion of the rules.

摘要
历史研究投入了大量时间和精力来开发自动化内容审核模型，以减轻模糊者的负担。然而，是否真的改进了内容审核任务的Automation仍然存在uncertainty。在这篇论文中，我们把过去努力提供内容审核任务的自动化方法的研究成果与志愿者内容审核者的需求进行了比较。为此，我们通过Hugging Face模型查找器来检验各种内容审核规则和指南是否有效。此外，我们还使用当今最先进的语言模型（GPT-4和Llama-2）进行测试，以评估这些模型在报告平台规则 violation 方面的性能。总之，我们发现了一定的差距，因为缺失的模型和LLMs（语言模型）表现低准确率。

Learning Adversarial Low-rank Markov Decision Processes with Unknown Transition and Full-information Feedback

paper_url: http://arxiv.org/abs/2311.07876
repo_url: None
paper_authors: Canzhe Zhao, Ruofeng Yang, Baoxiang Wang, Xuezhou Zhang, Shuai Li
for: 这个论文研究了具有对抗改变损失函数的低级MDP。特别是，transition probability kernel可以用低级矩阵分解 \citep{REPUCB22}, 并且损失函数可能会改变对抗地，但是在每个episode结束时都会披露给学习者。
methods: 我们提出了一种基于策略优化的算法POLO，并证明了它可以实现 $\widetilde{O}(K^{5/6}A^{1/2}d\ln(1+M)/(1-\gamma)^2)$ 的违和误差保证，其中 $d$ 是transition kernel的级数（也是不知道表示的维度）， $A$ 是动作空间的大小， $M$ 是模型集的大小， $\gamma$ 是折扣因子。另外，我们也证明了这个问题的负 regret下界为 $\Omega(\frac{\gamma^2}{1-\gamma} \sqrt{d A K})$，表明低级MDP在违和误差下界中比线性MDP更难学习。
results: 我们的算法可以在不知道表示的情况下，在低级MDP中实现违和误差下界。此外，我们还证明了这个问题的负 regret下界，并证明了这是首次将 representation learning、探索和利用相互融合以实现违和误差下界的RL算法。

Abstract
In this work, we study the low-rank MDPs with adversarially changed losses in the full-information feedback setting. In particular, the unknown transition probability kernel admits a low-rank matrix decomposition \citep{REPUCB22}, and the loss functions may change adversarially but are revealed to the learner at the end of each episode. We propose a policy optimization-based algorithm POLO, and we prove that it attains the $\widetilde{O}(K^{\frac{5}{6}A^{\frac{1}{2}d\ln(1+M)/(1-\gamma)^2)$ regret guarantee, where $d$ is rank of the transition kernel (and hence the dimension of the unknown representations), $A$ is the cardinality of the action space, $M$ is the cardinality of the model class, and $\gamma$ is the discounted factor. Notably, our algorithm is oracle-efficient and has a regret guarantee with no dependence on the size of potentially arbitrarily large state space. Furthermore, we also prove an $\Omega(\frac{\gamma^2}{1-\gamma} \sqrt{d A K})$ regret lower bound for this problem, showing that low-rank MDPs are statistically more difficult to learn than linear MDPs in the regret minimization setting. To the best of our knowledge, we present the first algorithm that interleaves representation learning, exploration, and exploitation to achieve the sublinear regret guarantee for RL with nonlinear function approximation and adversarial losses.

摘要
在这个工作中，我们研究了低级别MDPs中的敌对改变损失函数的问题，specifically在充满反馈情况下。特别是，未知转移概率函数可以归纳为低级别矩阵分解 \citep{REPUCB22}, 并且损失函数可能会敌对地改变，但是在每个episode结束时都会被掌握者提供。我们提出了一种基于策略优化的算法POLO，并证明它可以 дости到$\widetilde{O}(K^{5/6}A^{1/2}d\ln(1+M)/(1-\gamma)^2)$的误差保证，where $d$ is the rank of the transition kernel (and hence the dimension of the unknown representations), $A$ is the cardinality of the action space, $M$ is the cardinality of the model class, and $\gamma$ is the discounted factor. 注意，我们的算法是oracle-efficient的，meaning it has a regret guarantee with no dependence on the size of potentially arbitrarily large state space. Furthermore, we also prove an $\Omega(\frac{\gamma^2}{1-\gamma} \sqrt{d A K})$ regret lower bound for this problem, showing that low-rank MDPs are statistically more difficult to learn than linear MDPs in the regret minimization setting. To the best of our knowledge, we present the first algorithm that interleaves representation learning, exploration, and exploitation to achieve the sublinear regret guarantee for RL with nonlinear function approximation and adversarial losses.

Rankitect: Ranking Architecture Search Battling World-class Engineers at Meta Scale

paper_url: http://arxiv.org/abs/2311.08430
repo_url: None
paper_authors: Wei Wen, Kuang-Hung Liu, Igor Fedorov, Xin Zhang, Hang Yin, Weiwei Chu, Kaveh Hassani, Mengying Sun, Jiang Liu, Xu Wang, Lin Jiang, Yuxin Chen, Buyun Zhang, Xi Liu, Dehua Cheng, Zhengxing Chen, Guang Zhao, Fangqiu Han, Jiyan Yang, Yuchen Hao, Liang Xiong, Wen-Yen Chen
for: 本文是用于 Meta ranking system 中的 Neural Architecture Search (NAS) 框架 Rankitect，旨在建立从头开始的架构，并对现有的 SOTA NAS 方法进行改进和比较。
methods: Rankitect 使用 sampling-based NAS、one-shot NAS 和 Differentiable NAS (DNAS) 等方法来构建架构，并对搜索空间进行优化。
results: Rankitect 可以在 Meta 级别的生产环境中发现新的模型，并达到竞争性的 Normalized Entropy loss 和 FLOPs 之间的折衔。当使用工程师设计的搜索空间时，Rankitect 可以生成更好的模型，在线 A/B 测试中获得正面评价和Meta 规模上的实际效果。

Abstract
Neural Architecture Search (NAS) has demonstrated its efficacy in computer vision and potential for ranking systems. However, prior work focused on academic problems, which are evaluated at small scale under well-controlled fixed baselines. In industry system, such as ranking system in Meta, it is unclear whether NAS algorithms from the literature can outperform production baselines because of: (1) scale - Meta ranking systems serve billions of users, (2) strong baselines - the baselines are production models optimized by hundreds to thousands of world-class engineers for years since the rise of deep learning, (3) dynamic baselines - engineers may have established new and stronger baselines during NAS search, and (4) efficiency - the search pipeline must yield results quickly in alignment with the productionization life cycle. In this paper, we present Rankitect, a NAS software framework for ranking systems at Meta. Rankitect seeks to build brand new architectures by composing low level building blocks from scratch. Rankitect implements and improves state-of-the-art (SOTA) NAS methods for comprehensive and fair comparison under the same search space, including sampling-based NAS, one-shot NAS, and Differentiable NAS (DNAS). We evaluate Rankitect by comparing to multiple production ranking models at Meta. We find that Rankitect can discover new models from scratch achieving competitive tradeoff between Normalized Entropy loss and FLOPs. When utilizing search space designed by engineers, Rankitect can generate better models than engineers, achieving positive offline evaluation and online A/B test at Meta scale.

摘要

AutoML for Large Capacity Modeling of Meta’s Ranking Systems

paper_url: http://arxiv.org/abs/2311.07870
repo_url: None
paper_authors: Hang Yin, Kuang-Hung Liu, Mengying Sun, Yuxin Chen, Buyun Zhang, Jiang Liu, Vivek Sehgal, Rudresh Rajnikant Panchal, Eugen Hotaj, Xi Liu, Daifeng Guo, Jamey Zhang, Zhou Wang, Shali Jiang, Huayu Li, Zhengxing Chen, Wen-Yen Chen, Jiyan Yang, Wei Wen
for: 本研究旨在提高排名模型，但是Engineering heavy的问题使得改进模型变得更加重要。
methods: 本研究使用自动化机器学习（AutoML）来释放工程师从权重逐个调整排名模型中的劳动密集工作。
results: 本研究表明，使用 sampling-based AutoML 方法可以在 Meta 级别的生产环境中提高排名性能，并且可以在短时间内实现更高的 Return on Investment (ROI) 和 Normalized Entropy (NE) 损失减少或 Query per Second (QPS) 提高。

Abstract
Web-scale ranking systems at Meta serving billions of users is complex. Improving ranking models is essential but engineering heavy. Automated Machine Learning (AutoML) can release engineers from labor intensive work of tuning ranking models; however, it is unknown if AutoML is efficient enough to meet tight production timeline in real-world and, at the same time, bring additional improvements to the strong baselines. Moreover, to achieve higher ranking performance, there is an ever-increasing demand to scale up ranking models to even larger capacity, which imposes more challenges on the efficiency. The large scale of models and tight production schedule requires AutoML to outperform human baselines by only using a small number of model evaluation trials (around 100). We presents a sampling-based AutoML method, focusing on neural architecture search and hyperparameter optimization, addressing these challenges in Meta-scale production when building large capacity models. Our approach efficiently handles large-scale data demands. It leverages a lightweight predictor-based searcher and reinforcement learning to explore vast search spaces, significantly reducing the number of model evaluations. Through experiments in large capacity modeling for CTR and CVR applications, we show that our method achieves outstanding Return on Investment (ROI) versus human tuned baselines, with up to 0.09% Normalized Entropy (NE) loss reduction or $25\%$ Query per Second (QPS) increase by only sampling one hundred models on average from a curated search space. The proposed AutoML method has already made real-world impact where a discovered Instagram CTR model with up to -0.36% NE gain (over existing production baseline) was selected for large-scale online A/B test and show statistically significant gain. These production results proved AutoML efficacy and accelerated its adoption in ranking systems at Meta.

摘要
meta的排名系统 serving billions of users是复杂的。提高排名模型是必要的，但是工程师的努力是费尽的。自动化机器学习（AutoML）可以释放工程师从排名模型调试的劳动中解放出来，但是不知道AutoML是否具备足够的效率来满足实际生产环境中的紧张时间表。此外，随着排名性能的提高，需要扩大排名模型的规模，这会增加效率的挑战。大规模的模型和紧张的生产时间表要求AutoML能够超越人类基elines，只需使用一百个模型评估试验。我们提出了一种采样基于的AutoML方法，专注于神经网络搜索和超参数优化，解决Metascale生产环境中建立大容量模型时的挑战。我们的方法可以有效处理大规模数据需求，利用轻量级预测器基本搜索和强化学习来快速探索广阔的搜索空间，同时显著减少模型评估试验的数量。经过大容量模型应用于Click-through Rate（CTR）和Conversion Rate（CVR）应用程序的实验，我们发现我们的方法可以实现人类基elines的返回onto投资（ROI），减少模型评估试验数量的同时，提高模型性能。在实际生产中，我们已经通过实施AutoML方法，对Instagram CTR模型进行了大规模在线A/B测试，并获得了统计学上的增长。这些生产成果证明了AutoML的有效性，并促使其在排名系统中的广泛采用。

Multi-Signal Reconstruction Using Masked Autoencoder From EEG During Polysomnography

paper_url: http://arxiv.org/abs/2311.07868
repo_url: None
paper_authors: Young-Seok Kweon, Gi-Hwan Shin, Heon-Gyu Kwak, Ha-Na Jo, Seong-Whan Lee
for: 这个研究旨在开发一种基于单通道EEG的多信号PSG系统，以便在非专业 Settings中进行睡眠监测。
methods: 该系统使用masked autoencoder来重建多信号PSG数据，并在Sleep-EDF-20数据集上进行训练和评估。
results: 模型能够成功重建多信号数据，表明该系统可能实现更好的睡眠监测。这将扩展PSG的应用范围，使其能够在住院设施外进行使用。

Abstract
Polysomnography (PSG) is an indispensable diagnostic tool in sleep medicine, essential for identifying various sleep disorders. By capturing physiological signals, including EEG, EOG, EMG, and cardiorespiratory metrics, PSG presents a patient's sleep architecture. However, its dependency on complex equipment and expertise confines its use to specialized clinical settings. Addressing these limitations, our study aims to perform PSG by developing a system that requires only a single EEG measurement. We propose a novel system capable of reconstructing multi-signal PSG from a single-channel EEG based on a masked autoencoder. The masked autoencoder was trained and evaluated using the Sleep-EDF-20 dataset, with mean squared error as the metric for assessing the similarity between original and reconstructed signals. The model demonstrated proficiency in reconstructing multi-signal data. Our results present promise for the development of more accessible and long-term sleep monitoring systems. This suggests the expansion of PSG's applicability, enabling its use beyond the confines of clinics.

摘要
普里索诺графи（PSG）是睡眠医学中不可或缺的诊断工具，能够识别多种睡眠障碍。PSG通过捕捉生物 physiological 信号，包括 EEG、EOG、EMG 和呼吸征函数指标，可以显示睡眠体系结构。然而，它的依赖于复杂设备和专业知识，限制其使用于专业医疗设施。为了解决这些限制，我们的研究目的是通过开发一种只需要单个 EEG 测量的系统来实现 PSG。我们提出了一种基于 masked autoencoder 的新系统，可以从单个 EEG 信号中重建多个信号 PSG。我们使用 Sleep-EDF-20 数据集来训练和评估 masked autoencoder，并使用 mean squared error 作为重建信号与原始信号之间的相似度度量。 results 表明模型能够有效地重建多个信号数据。我们的结果表明，可以开发更加可 accessible 和长期睡眠监测系统，从而扩大 PSG 的应用范围，使其不再局限于医疗机构。

Overview of the TREC 2023 Product Product Search Track

paper_url: http://arxiv.org/abs/2311.07861
repo_url: None
paper_authors: Daniel Campos, Surya Kallumadi, Corby Rosset, Cheng Xiang Zhai, Alessandro Magnani
for: 本研究旨在创建可重用的收集和评估多Modal数据和元数据对搜寻精度的影响。
methods: 本研究使用新的产品搜寻 corpus，包括 contextual metadata。
results: 结果显示，在产品搜寻领域，传统搜寻系统具有高效性和通用预训练 embedding 模型的竞争力。 Metadata-enhanced 收集没有明确的趋势，而单stage dense retrieval run 在 zero-shot 和精度调整领域可能不竞争或生成低质量结果。

Abstract
This is the first year of the TREC Product search track. The focus this year was the creation of a reusable collection and evaluation of the impact of the use of metadata and multi-modal data on retrieval accuracy. This year we leverage the new product search corpus, which includes contextual metadata. Our analysis shows that in the product search domain, traditional retrieval systems are highly effective and commonly outperform general-purpose pretrained embedding models. Our analysis also evaluates the impact of using simplified and metadata-enhanced collections, finding no clear trend in the impact of the expanded collection. We also see some surprising outcomes; despite their widespread adoption and competitive performance on other tasks, we find single-stage dense retrieval runs can commonly be noncompetitive or generate low-quality results both in the zero-shot and fine-tuned domain.

摘要
这是TREC产品搜索追踪的第一年。本年的重点是创建可重用的收藏和评估多Modal数据和元数据对搜索精度的影响。本年我们利用新的产品搜索词库，该词库包括Contextual元数据。我们的分析显示，在产品搜索领域，传统搜索系统具有高效性和通用预训练Embedding模型的竞争力。我们的分析还评估了使用简化和元数据增强的收藏，发现没有明显的趋势。我们还发现了一些意外的结果：尽管广泛采用和在其他任务上竞争性表现出色，我们发现单stage dense retrieval运行在零shot和精度调整领域中往往无竞争力或生成低质量结果。

Bring Your Own KG: Self-Supervised Program Synthesis for Zero-Shot KGQA

paper_url: http://arxiv.org/abs/2311.07850
repo_url: https://github.com/dhdhagar/byokg
paper_authors: Dhruv Agarwal, Rajarshi Das, Sopan Khosla, Rashmi Gangadharaiah
for: The paper is written for developing a universal question-answering (QA) system that can operate on any knowledge graph (KG) without requiring human-annotated training data.
methods: The paper uses a combination of exploration and reasoning to answer questions on a KG. The exploration is leveraged by an LLM-backed symbolic agent that generates a diverse set of query-program exemplars, which are then used to ground a retrieval-augmented reasoning procedure to predict programs for arbitrary questions.
results: The paper shows dramatic gains in QA accuracy over a zero-shot baseline on two benchmark datasets, GrailQA and MetaQA, with an F1 score of 27.89 and 58.02, respectively. Additionally, the paper demonstrates the effectiveness of exploration and shows that performance of the proposed method reliably improves with continued exploration and improvements in the base LLM.

Abstract
We present BYOKG, a universal question-answering (QA) system that can operate on any knowledge graph (KG), requires no human-annotated training data, and can be ready to use within a day -- attributes that are out-of-scope for current KGQA systems. BYOKG draws inspiration from the remarkable ability of humans to comprehend information present in an unseen KG through exploration -- starting at random nodes, inspecting the labels of adjacent nodes and edges, and combining them with their prior world knowledge. In BYOKG, exploration leverages an LLM-backed symbolic agent that generates a diverse set of query-program exemplars, which are then used to ground a retrieval-augmented reasoning procedure to predict programs for arbitrary questions. BYOKG is effective over both small- and large-scale graphs, showing dramatic gains in QA accuracy over a zero-shot baseline of 27.89 and 58.02 F1 on GrailQA and MetaQA, respectively. On GrailQA, we further show that our unsupervised BYOKG outperforms a supervised in-context learning method, demonstrating the effectiveness of exploration. Lastly, we find that performance of BYOKG reliably improves with continued exploration as well as improvements in the base LLM, notably outperforming a state-of-the-art fine-tuned model by 7.08 F1 on a sub-sampled zero-shot split of GrailQA.

摘要
我们介绍了BYOKG，一个通用的问题回答（QA）系统，可以运行在任何知识图（KG）上，不需要人类验证训练数据，并可以在一天内就ready to use。这些特点与现有的KGQA系统不同，BYOKG draws inspiration from人类对未看到的KG的极其能力，通过探索开始在随机的节点上，检查邻近节点和边缘的标签，并与先前的世界知识结合。在BYOKG中，探索利用LLM-backed符式代理，生成多样化的问题程式示例，然后用它们与问题相结合，进行搜寻增强的理论过程，以预测问题的回答。BYOKG能够在小规模和大规模的图上显示出戏剧性的提升，与零基eline的比较获得27.89和58.02的F1分数。在GrailQA上，我们还显示了我们的无监督BYOKG在内部学习方法上的优势，证明了探索的效iveness。最后，我们发现BYOKG的性能随着继续探索和LLM的改进，不断提高，甚至在一部分零基eline上超越了一个精心 fine-tuned 模型，实现7.08的F1分数。

Enabling Decision-Support Systems through Automated Cell Tower Detection

paper_url: http://arxiv.org/abs/2311.07840
repo_url: None
paper_authors: Natasha Krell, Will Gleave, Daniel Nakada, Justin Downes, Amanda Willet, Matthew Baran
for: 本研究旨在提高较大地区域的移动信号覆盖率，以便提高公众对移动金融、教育和人道主义服务的访问。
methods: 该研究使用深度神经网络和高分辨率Remote sensing图像进行对cell Tower的物体检测，以消除手动mapping的不必要和负担。
results: 研究人员通过使用OpenStreetMap（OSM）特征和高分辨率Maxar图像，实现了对cell Tower的自动检测和精度的提高。模型在不同地理区域和过样测试中表现良好，可以提供更加准确的移动覆盖图。

Abstract
Cell phone coverage and high-speed service gaps persist in rural areas in sub-Saharan Africa, impacting public access to mobile-based financial, educational, and humanitarian services. Improving maps of telecommunications infrastructure can help inform strategies to eliminate gaps in mobile coverage. Deep neural networks, paired with remote sensing images, can be used for object detection of cell towers and eliminate the need for inefficient and burdensome manual mapping to find objects over large geographic regions. In this study, we demonstrate a partially automated workflow to train an object detection model to locate cell towers using OpenStreetMap (OSM) features and high-resolution Maxar imagery. For model fine-tuning and evaluation, we curated a diverse dataset of over 6,000 unique images of cell towers in 26 countries in eastern, southern, and central Africa using automatically generated annotations from OSM points. Our model achieves an average precision at 50% Intersection over Union (IoU) (AP@50) of 81.2 with good performance across different geographies and out-of-sample testing. Accurate localization of cell towers can yield more accurate cell coverage maps, in turn enabling improved delivery of digital services for decision-support applications.

摘要
在非洲的农村地区，移动电话覆盖和高速服务的差距仍然存在，影响公众对移动基础设施的访问。改进电信基础设施的地图可以帮助制定消除移动覆盖差距的策略。深度神经网络，与高分辨率的卫星图像结合使用，可以用于检测Cell Tower的物体。在本研究中，我们提出了一种具有部分自动化的工作流程，用于在OpenStreetMap（OSM）特征和高分辨率的Maxar影像上训练物体检测模型，以检测Cell Tower。我们为模型精度调整和评估 curated了来自26个国家的东南非洲的6,000多个独特的Cell Tower图像，使用自动生成的OSM点来生成自动化的注释。我们的模型在50% Intersection over Union（IoU）的平均准确率（AP@50）为81.2，在不同的地理区域和过样测试中表现良好。准确的Cell Tower的位置确定可以使得电话覆盖地图更加准确，从而实现更好的数字服务的发展。

LLatrieval: LLM-Verified Retrieval for Verifiable Generation

paper_url: http://arxiv.org/abs/2311.07838
repo_url: https://github.com/beastyz/llm-verified-retrieval
paper_authors: Xiaonan Li, Changtai Zhu, Linyang Li, Zhangyue Yin, Tianxiang Sun, Xipeng Qiu
for: 提高大语言模型（LLM）生成文本的可靠性和可证明性。
methods: 提出了一种新的检索阶段，使得LLM可以通过反馈更新检索结果，以便让检索结果能够充分支持可靠地生成答案。
results: 实验结果显示，我们的方法可以与广泛的基线模型进行比较，并达到新的国际级结果。

Abstract
Verifiable generation aims to let the large language model (LLM) generate text with corresponding supporting documents, which enables the user to flexibly verify the answer and makes it more trustworthy. Its evaluation not only measures the correctness of the answer, but also the answer's verifiability, i.e., how well the answer is supported by the corresponding documents. In typical, verifiable generation adopts the retrieval-read pipeline, which is divided into two stages: 1) retrieve relevant documents of the question. 2) according to the documents, generate the corresponding answer. Since the retrieved documents can supplement knowledge for the LLM to generate the answer and serve as evidence, the retrieval stage is essential for the correctness and verifiability of the answer. However, the widely used retrievers become the bottleneck of the entire pipeline and limit the overall performance. They often have fewer parameters than the large language model and have not been proven to scale well to the size of LLMs. Since the LLM passively receives the retrieval result, if the retriever does not correctly find the supporting documents, the LLM can not generate the correct and verifiable answer, which overshadows the LLM's remarkable abilities. In this paper, we propose LLatrieval (Large Language Model Verified Retrieval), where the LLM updates the retrieval result until it verifies that the retrieved documents can support answering the question. Thus, the LLM can iteratively provide feedback to retrieval and facilitate the retrieval result to sufficiently support verifiable generation. Experimental results show that our method significantly outperforms extensive baselines and achieves new state-of-the-art results.

摘要
大型语言模型（LLM）生成文本的可靠生成目标是让用户可以轻松验证答案，使答案更加可靠。其评价不仅考虑答案的正确性，还考虑答案是否具有可靠性，即答案是否得到了相应的支持文档。 Typical verifiable generation adopts the retrieval-read pipeline, which is divided into two stages:1. 根据问题检索相关文档。2. 根据文档生成答案。因为检索到的文档可以补充语言模型的知识，并作为证据，因此检索阶段是生成答案的关键。但是，广泛使用的检索器成为整个管道的瓶颈，限制整体性能。它们通常有 fewer 参数 than LLM，并未证明可以适应 LLM 的大小。因此，LLM 只能 passively 接受检索结果，如果检索器不能正确地检索到支持文档，那么 LLM 就无法生成正确和可靠的答案，这将屏蔽 LLM 的出色能力。在这篇论文中，我们提出了 LLatrieval（大型语言模型验证检索），其中 LLM 会更新检索结果，直到它确认检索到的文档可以支持回答问题。因此，LLM 可以逐次提供反馈给检索，使检索结果足够支持可靠生成。实验结果表明，我们的方法与广泛的基准相比显著超越，实现了新的状态纪录result。

A Neuro-Inspired Hierarchical Reinforcement Learning for Motor Control

paper_url: http://arxiv.org/abs/2311.07822
repo_url: None
paper_authors: Pei Zhang, Zhaobo Hua, Jinliang Ding
for: 这项研究的目的是开发一种基于中枢神经系统的动物学习算法，使多关节机器人可以自然地学习和应用复杂的动作技能。
methods: 该算法使用选择机制和调节能力来模仿动物中心神经系统的机制，并通过不同的技能组合来实现机器人的自然动作能力。
results: 实验结果显示，该算法可以让不同类型的机器人在22个任务环境中实现灵活的动作技能。

Abstract
Designing controllers to achieve natural motion capabilities for multi-joint robots is a significant challenge. However, animals in nature are naturally with basic motor abilities and can master various complex motor skills through acquired learning. On the basis of analyzing the mechanism of the central motor system in mammals, we propose a neuro-inspired hierarchical reinforcement learning algorithm that enables robots to learn rich motor skills and apply them to complex task environments without relying on external data. We first design a skills network similar to the cerebellum by utilizing the selection mechanism of voluntary movements in the basal ganglia and the regulatory ability of the cerebellum to regulate movement. Subsequently, by imitating the structure of advanced centers in the motion system, we propose a high-level policy to generate different skill combinations, thereby enabling the robot to acquire natural motor abilities. We conduct experiments on 4 types of robots and 22 task environments, and the results show that the proposed method can enable different types of robots to achieve flexible motion skills. Overall, our research provides a promising framework for the design of robotic neural motor controllers.

摘要
“设计控制器以实现自然运动能力是多 JOINT 机器人设计中的一大挑战。然而，自然界中的动物具有基本的运动能力，并通过获得的学习来掌握多种复杂的运动技巧。基于分析中枢神经系统的机制，我们提出了一种基于脑神经学的层次强化学习算法，使机器人能够通过自然的运动方式来处理复杂任务环境而无需依赖于外部数据。我们首先设计了一个类似于脑干的技能网络，通过选择机制和基尼肌肉的调节能力来控制运动。然后，我们通过模仿高级运动系统的结构，提出了一种高级策略来生成不同的技能组合，使机器人能够获得自然的运动能力。我们在4种机器人和22个任务环境上进行实验，结果显示，我们的方法可以让不同类型的机器人获得灵活的运动技巧。总的来说，我们的研究提供了机器人神经动作控制器的可能性。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.

paper_url: http://arxiv.org/abs/2311.07816
repo_url: None
paper_authors: Luca Luceri, Eric Boniardi, Emilio Ferrara
for: 本研究旨在探讨社交媒体干预活动对公众讨论和民主所 pose 的挑战，并提出一种基于大语言模型（LLMs）的新方法来检测这些活动。
methods: 本研究使用 LLMs 将用户元数据和网络结构转换为文本格式，以便有效地处理多语言内容并适应 malevolent 活动actor 的shift 策略。
results: 我们通过对多个数据集进行严格测试，证明我们的模型在检测影响活动方面表现出色，以提供一种有效的检测工具。

Abstract
Social media influence campaigns pose significant challenges to public discourse and democracy. Traditional detection methods fall short due to the complexity and dynamic nature of social media. Addressing this, we propose a novel detection method using Large Language Models (LLMs) that incorporates both user metadata and network structures. By converting these elements into a text format, our approach effectively processes multilingual content and adapts to the shifting tactics of malicious campaign actors. We validate our model through rigorous testing on multiple datasets, showcasing its superior performance in identifying influence efforts. This research not only offers a powerful tool for detecting campaigns, but also sets the stage for future enhancements to keep up with the fast-paced evolution of social media-based influence tactics.

摘要
社交媒体影响运动对公众讨论和民主poses significant challenges。传统的探测方法 fall short due to the complexity and dynamic nature of social media。为解决这个问题，我们提议一种使用大语言模型（LLMs）的新的探测方法，该方法包括用户元数据和网络结构的转化。通过将这些元素转化为文本格式，我们的方法可以有效地处理多语言内容，适应malevolent campaign actors的战术的变化。我们通过严格的测试多个数据集，证明了我们的模型在发现影响活动方面的表现出色。这种研究不仅提供了一种有效的探测工具，还为未来随着社交媒体基于的影响策略的演化而进行进一步的改进提供了平台。

Cooperative AI via Decentralized Commitment Devices

paper_url: http://arxiv.org/abs/2311.07815
repo_url: None
paper_authors: Xinyuan Sun, Davide Crapis, Matt Stephenson, Barnabé Monnot, Thomas Thiery, Jonathan Passerat-Palmbach
for: 这篇论文旨在探讨协同AI技术是否能够在实际世界中具有安全性和可靠性。
methods: 该论文使用了数字签名技术和智能合约来实现协同AI的安全性和可靠性。
results: 论文通过使用实际世界的commitmentDevice来检验协同AI技术的安全性和可靠性，并发现了一些潜在的安全问题。

Abstract
Credible commitment devices have been a popular approach for robust multi-agent coordination. However, existing commitment mechanisms face limitations like privacy, integrity, and susceptibility to mediator or user strategic behavior. It is unclear if the cooperative AI techniques we study are robust to real-world incentives and attack vectors. However, decentralized commitment devices that utilize cryptography have been deployed in the wild, and numerous studies have shown their ability to coordinate algorithmic agents facing adversarial opponents with significant economic incentives, currently in the order of several million to billions of dollars. In this paper, we use examples in the decentralization and, in particular, Maximal Extractable Value (MEV) (arXiv:1904.05234) literature to illustrate the potential security issues in cooperative AI. We call for expanded research into decentralized commitments to advance cooperative AI capabilities for secure coordination in open environments and empirical testing frameworks to evaluate multi-agent coordination ability given real-world commitment constraints.

摘要
信worthy的契约设备已经是多智能体协调的受欢迎方法。然而，现有的契约机制受到隐私、完整性和中介人或用户的战略行为等限制。不清楚AI技术我们研究是否对实际世界的奖励和攻击 vector 具有Robustness。然而，使用 криптовалюence 的分布式契约设备已经在实际应用中部署，许多研究表明它们可以在面临敌对对手的情况下协调算法代理，现在达到数百万到数十亿美元的经济奖励。在这篇论文中，我们使用分布式契约的例子，特别是Maximal Extractable Value（MEV）（arXiv:1904.05234）文献， illustrate 可能在合作AI中存在安全问题。我们呼吁扩大分布式契约的研究，以提高合作AI的安全协调能力，并在实际commitment约束下进行 empirical 测试框架来评估多智能体协调能力。

2023-11-14

AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

Finding AI-Generated Faces in the Wild

Towards Evaluating AI Systems for Moral Status Using Self-Reports

Parameter-Efficient Multilingual Summarisation: An Empirical Study

Adversarial Imitation Learning On Aggregated Data

Probabilistic reconstruction of Dark Matter fields from biased tracers using diffusion models

Low-light Pedestrian Detection in Visible and Infrared Image Feeds: Issues and Challenges

DeepThought: An Architecture for Autonomous Self-motivated Systems

2D-RC: Two-Dimensional Neural Network Approach for OTFS Symbol Detection

GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer

Efficient Rotation Invariance in Deep Neural Networks through Artificial Mental Rotation

Artificial intelligence and the skill premium

LLMs cannot find reasoning errors, but can correct them!

Alignment is not sufficient to prevent large language models from generating harmful information: A psychoanalytic perspective

Surrogate Modeling for Computationally Expensive Simulations of Supernovae in High-Resolution Galaxy Simulations

Instant3D: Instant Text-to-3D Generation

Fine-tuning Language Models for Factuality

Are Large Language Models Temporally Grounded?

Zero-shot audio captioning with audio-language model guidance and audio context keywords

MVSA-Net: Multi-View State-Action Recognition for Robust and Deployable Trajectory Generation

TSST: A Benchmark and Evaluation Models for Text Speech-Style Transfer

Offline Data Enhanced On-Policy Policy Gradient with Provable Guarantees

Scheming AIs: Will AIs fake alignment during training in order to get power?

Learning to Filter Context for Retrieval-Augmented Generation

Plum: Prompt Learning using Metaheuristic

The Transient Nature of Emergent In-Context Learning in Transformers

Exploring Variational Auto-Encoder Architectures, Configurations, and Datasets for Generative Music Explainable AI

Anti-LM Decoding for Zero-shot In-context Machine Translation

Extrinsically-Focused Evaluation of Omissions in Medical Summarization

Workflow-Guided Response Generation for Task-Oriented Dialogue

VERVE: Template-based ReflectiVE Rewriting for MotiVational IntErviewing

A Survey of Language Model Confidence Estimation and Calibration

On The Relationship Between Universal Adversarial Attacks And Sparse Representations

REST: Retrieval-Based Speculative Decoding

Investigating the Encoding of Words in BERT’s Neurons using Feature Textualization

Learning Physics-Inspired Regularization for Medical Image Registration with Hypernetworks

Eval-GCSC: A New Metric for Evaluating ChatGPT’s Performance in Chinese Spelling Correction

Human-Centric Autonomous Systems With LLMs for User Command Reasoning

Automated Fact-Checking in Dialogue: Are Specialized Models Needed?

Semi-Supervised Learning via Swapped Prediction for Communication Signal Recognition

Neural Lattice Reduction: A Self-Supervised Geometric Deep Learning Approach

MechAgents: Large language model multi-agent collaborations can solve mechanics problems, generate new data, and integrate knowledge

Ask One More Time: Self-Agreement Improves Reasoning of Language Models in (Almost) All Scenarios

When Mining Electric Locomotives Meet Reinforcement Learning

The Hyperdimensional Transform for Distributional Modelling, Regression and Classification

Cattle Identification Using Muzzle Images and Deep Learning Techniques

RECALL: A Benchmark for LLMs Robustness against External Counterfactual Knowledge

Caring Trouble and Musical AI: Considerations towards a Feminist Musical AI

Evaluating Neighbor Explainability for Graph Neural Networks

Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion

Exploring Semi-supervised Hierarchical Stacked Encoder for Legal Judgement Prediction

Empowering Multi-step Reasoning across Languages via Tree-of-Thoughts

Act-VIT: A Representationally Robust Attention Architecture for Skeleton Based Action Recognition Using Vision Transformer

Spot: A Natural Language Interface for Geospatial Searches in OSM

CPSOR-GCN: A Vehicle Trajectory Prediction Method Powered by Emotion and Cognitive Theory

Solving ARC visual analogies with neural embeddings and vector arithmetic: A generalized method

Zero-Shot Segmentation of Eye Features Using the Segment Anything Model (SAM)

Adversarial Preference Optimization

Data-driven building energy efficiency prediction based on envelope heat losses using physics-informed neural networks

Two-Stage Predict+Optimize for Mixed Integer Linear Programs with Unknown Parameters in Constraints

Distantly-Supervised Named Entity Recognition with Uncertainty-aware Teacher Learning and Student-student Collaborative Learning

Iterative missing value imputation based on feature importance

TempTabQA: Temporal Question Answering for Semi-Structured Tables

LiPar: A Lightweight Parallel Learning Model for Practical In-Vehicle Network Intrusion Detection

Probable Object Location (POLo) Score Estimation for Efficient Object Goal Navigation

A Survey on Language Models for Code

How good are Large Language Models on African Languages?

Uplift Modeling based on Graph Neural Network Combined with Causal Knowledge

Deep Learning-Based Object Detection in Maritime Unmanned Aerial Vehicle Imagery: Review and Experimental Comparisons

A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning

The Impact of Adversarial Node Placement in Decentralized Federated Learning Networks

Non-autoregressive Machine Translation with Probabilistic Context-free Grammar

Towards Improving Robustness Against Common Corruptions in Object Detectors Using Adversarial Contrastive Learning

Brain-Driven Representation Learning Based on Diffusion Model

Smart Home Goal Feature Model – A guide to support Smart Homes for Ageing in Place

Instruction-Following Evaluation for Large Language Models

Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks

RoboSense At Edge: Detecting Slip, Crumple and Shape of the Object in Robotic Hand for Teleoprations