2023-10-26

cs.AI

cs.AI - 2023-10-26

Style-Aware Radiology Report Generation with RadGraph and Few-Shot Prompting

paper_url: http://arxiv.org/abs/2310.17811
repo_url: None
paper_authors: Benjamin Yan, Ruochen Liu, David E. Kuo, Subathra Adithan, Eduardo Pontes Reis, Stephen Kwak, Vasantha Kumar Venugopal, Chloe P. O’Connell, Agustina Saenz, Pranav Rajpurkar, Michael Moor
For: 提高 radiologist 的工作流程，通过自动生成医疗影像报告* Methods: 提出了一种 two-step 方法，首先提取图像中的内容，然后将其折衣成医生特定的报告风格* Results: 在量化评估中获得了有利的性能，人工评估中显示 AI 生成的报告能够匹配医生特定的报告风格，即使只使用了一些示例作为 context

Abstract
Automatically generated reports from medical images promise to improve the workflow of radiologists. Existing methods consider an image-to-report modeling task by directly generating a fully-fledged report from an image. However, this conflates the content of the report (e.g., findings and their attributes) with its style (e.g., format and choice of words), which can lead to clinically inaccurate reports. To address this, we propose a two-step approach for radiology report generation. First, we extract the content from an image; then, we verbalize the extracted content into a report that matches the style of a specific radiologist. For this, we leverage RadGraph -- a graph representation of reports -- together with large language models (LLMs). In our quantitative evaluations, we find that our approach leads to beneficial performance. Our human evaluation with clinical raters highlights that the AI-generated reports are indistinguishably tailored to the style of individual radiologist despite leveraging only a few examples as context.

摘要
自动生成的医疗图像报告承诺改善诊断医生的工作流程。现有方法直接将图像转换为完整的报告，但这会混合报告的内容（如发现和其属性）与样式（如格式和语言选择），导致临床不准确的报告。为解决这个问题，我们提出了一种两步方法 для医学报告生成。首先，我们从图像中提取内容；然后，我们将提取的内容转换为医生特有的样式的报告。我们利用RadGraph -- 报告的图表表示方式 -- 以及大型自然语言模型（LLM）。我们的量化评估表明，我们的方法具有有利性。我们的人类评估，医生特有的报告被AI生成的报告杂化不可分辨，即使只使用几个例子作为 контекст。

Clover: Closed-Loop Verifiable Code Generation

paper_url: http://arxiv.org/abs/2310.17807
repo_url: None
paper_authors: Chuyue Sun, Ying Sheng, Oded Padon, Clark Barrett
for: 这篇论文旨在提出一种方法来确保代码生成器生成的代码是正确的，以避免不良的结果。
methods: 该方法基于一种名为Clover的概念，即闭环可验证代码生成，它将正确性检查降低到更容易进行的一个问题：一致性检查。Clover使用了一种新的形式验证工具和大型自然语言模型的结合来实现一个可验证的代码检查器。
results: 在一个手动设计的数据集（CloverBench）上，我们发现：（i）LLMs可以自动生成正式规范; 并且（ii）我们的一致性检查器在正确的实例上可以达到87%的接受率，而且没有任何false positive（Zero tolerance for incorrect instances）。

Abstract
The use of large language models for code generation is a rapidly growing trend in software development. However, without effective methods for ensuring the correctness of generated code, this trend could lead to any number of undesirable outcomes. In this paper, we lay out a vision for addressing this challenge: the Clover paradigm, short for Closed-Loop Verifiable Code Generation, which reduces correctness checking to the more accessible problem of consistency checking. At the core of Clover lies a checker that performs consistency checks among code, docstrings, and formal annotations. The checker is implemented using a novel integration of formal verification tools and large language models. We provide a theoretical analysis to support our thesis that Clover should be effective at consistency checking. We also empirically investigate its feasibility on a hand-designed dataset (CloverBench) featuring annotated Dafny programs at a textbook level of difficulty. Experimental results show that for this dataset, (i) LLMs are reasonably successful at automatically generating formal specifications; and (ii) our consistency checker achieves a promising acceptance rate (up to 87%) for correct instances while maintaining zero tolerance for incorrect ones (no false positives).

摘要
大量语言模型在软件开发中用于代码生成是一种快速增长的趋势。然而，如果没有有效的方法来确保代码的正确性，这种趋势可能会导致任何不良结果。在这篇论文中，我们提出了一个方法来解决这个挑战：叶对象（Clover）模型，简称为关闭Loop可验证代码生成。叶对象模型的核心在于一个实现了一种可验证性检查的检查器，该检查器通过对代码、文档字符串和正式注释进行一系列的一致性检查来减少正确性检查到更加可 accessible的一致性检查问题。我们提供了一个理论分析，以支持我们的论点，即叶对象模型在一致性检查中应该是有效的。我们还对一个手动设计的数据集（CloverBench）进行了实验研究，该数据集包含了注释的达凡程程序。实验结果显示，（i）LLMs可以自动生成正式规范; （ii）我们的一致性检查器在正确的实例中达到了Promising的接受率（达到87%），而且在错误的实例中保持了零的准确率（没有假阳性）。

Reward Scale Robustness for Proximal Policy Optimization via DreamerV3 Tricks

paper_url: http://arxiv.org/abs/2310.17805
repo_url: None
paper_authors: Ryan Sullivan, Akarsh Kumar, Shengyi Huang, John P. Dickerson, Joseph Suarez
for: 本研究旨在应用DreamerV3的模型基method，并评估其是否能够对PPO提供改进。
methods: 本研究使用了DreamerV3的一些陷阱，包括缓存滤选和规律对应。
results: 研究结果显示，这些陷阱并不能够通用于PPO，并且在一些情况下可能会下降性能。然而，研究还发现了一些特定情况下，这些陷阱可以对PPO提供改进，例如在Atari游戏中实现奖励截断。

Abstract
Most reinforcement learning methods rely heavily on dense, well-normalized environment rewards. DreamerV3 recently introduced a model-based method with a number of tricks that mitigate these limitations, achieving state-of-the-art on a wide range of benchmarks with a single set of hyperparameters. This result sparked discussion about the generality of the tricks, since they appear to be applicable to other reinforcement learning algorithms. Our work applies DreamerV3's tricks to PPO and is the first such empirical study outside of the original work. Surprisingly, we find that the tricks presented do not transfer as general improvements to PPO. We use a high quality PPO reference implementation and present extensive ablation studies totaling over 10,000 A100 hours on the Arcade Learning Environment and the DeepMind Control Suite. Though our experiments demonstrate that these tricks do not generally outperform PPO, we identify cases where they succeed and offer insight into the relationship between the implementation tricks. In particular, PPO with these tricks performs comparably to PPO on Atari games with reward clipping and significantly outperforms PPO without reward clipping.

摘要
大多数强化学习方法很依赖密集、均衡环境奖励。 DreamerV3 最近引入了一种模型基于方法，并使用了一些技巧来缓解这些限制，在各种标准 benchmark 上达到了单一的超参数 Settings 的状态体验最佳性。这一结果引发了对这些技巧的通用性的讨论，因为它们看起来可以应用于其他强化学习算法。我们的工作将 DreamerV3 的技巧应用到 PPO 中，是外部工作中的第一个实验。Surprisingly, we find that the tricks presented do not transfer as general improvements to PPO. 我们使用高质量 PPO 参考实现，并进行了大量的磨砺研究，总计超过 10,000 A100 小时在 Arcade Learning Environment 和 DeepMind Control Suite 上。虽然我们的实验表明，这些技巧不一般地提高 PPO，但我们确定了其在 Atari 游戏中奖励截断时和无奖励截断时的性能相当。

“You Are An Expert Linguistic Annotator”: Limits of LLMs as Analyzers of Abstract Meaning Representation

paper_url: http://arxiv.org/abs/2310.17793
repo_url: None
paper_authors: Allyson Ettinger, Jena D. Hwang, Valentina Pyatkin, Chandra Bhagavatula, Yejin Choi
for: 本研究旨在检验LLMs是否可以作为语言专家提供准确的语义分析结果。
methods: 研究使用了GPT-3、ChatGPT和GPT-4模型，通过对句子意义结构进行分析，并使用Abstract Meaning Representation（AMR）格式进行表示。
results: 研究发现，LLMs可以准确地生成AMR格式的句子意义结构，但是模型输出存在重要和常见的错误，无法生成完全准确的句子意义结构。

Abstract
Large language models (LLMs) show amazing proficiency and fluency in the use of language. Does this mean that they have also acquired insightful linguistic knowledge about the language, to an extent that they can serve as an "expert linguistic annotator"? In this paper, we examine the successes and limitations of the GPT-3, ChatGPT, and GPT-4 models in analysis of sentence meaning structure, focusing on the Abstract Meaning Representation (AMR; Banarescu et al. 2013) parsing formalism, which provides rich graphical representations of sentence meaning structure while abstracting away from surface forms. We compare models' analysis of this semantic structure across two settings: 1) direct production of AMR parses based on zero- and few-shot prompts, and 2) indirect partial reconstruction of AMR via metalinguistic natural language queries (e.g., "Identify the primary event of this sentence, and the predicate corresponding to that event."). Across these settings, we find that models can reliably reproduce the basic format of AMR, and can often capture core event, argument, and modifier structure -- however, model outputs are prone to frequent and major errors, and holistic analysis of parse acceptability shows that even with few-shot demonstrations, models have virtually 0% success in producing fully accurate parses. Eliciting natural language responses produces similar patterns of errors. Overall, our findings indicate that these models out-of-the-box can capture aspects of semantic structure, but there remain key limitations in their ability to support fully accurate semantic analyses or parses.

摘要
大语言模型（LLM）显示了惊人的掌握能力和流畅性在语言使用方面。这意味着它们也获得了深刻的语言知识吗？在这篇论文中，我们研究了GPT-3、ChatGPT和GPT-4模型对句子意义结构的分析，使用抽象意义表示（AMR）分析方法，该方法提供了丰富的图形表示方式，抽象于表面形式。我们在两个设置下对模型的分析进行比较：1）直接生成基于零或几个提示的AMR parse，和2）通过语言问题（如“这句话中的主事件是什么，以及其对应的 predicate”）进行间接半重建AMR。在这两个设置下，我们发现模型可靠地生成基本的AMR格式，并常常捕捉核心事件、参加者和修饰结构。然而，模型的输出受到频繁和重大的错误的影响，全面分析parse的可 acceptability 显示，即使有几个示例，模型的成功率几乎为零。召唤自然语言响应也会产生类似的错误模式。总之，我们的发现表明，这些模型可以在出废的情况下捕捉含义结构的方面，但还有关键的限制，它们无法支持完全准确的semantic analyses或parse。

Utilizing Language Models for Energy Load Forecasting

paper_url: http://arxiv.org/abs/2310.17788
repo_url: https://github.com/xuehaouwa/lm-load-forecasting
paper_authors: Hao Xue, Flora D. Salim
for: 实时能源负载预测可以帮助企业和城市优化资源配置和管理能源消耗。
methods: 本文提出一种使用语言模型进行能源负载预测的新方法，使用提示技术将能源消耗数据转换为描述性句子，并使用自动生成方法进行预测。
results: 经过实验 validate 的结果显示，该方法可以对真实数据进行高精度的能源负载预测，并且可以预测不同时间点的未来能源负载。

Abstract
Energy load forecasting plays a crucial role in optimizing resource allocation and managing energy consumption in buildings and cities. In this paper, we propose a novel approach that leverages language models for energy load forecasting. We employ prompting techniques to convert energy consumption data into descriptive sentences, enabling fine-tuning of language models. By adopting an autoregressive generating approach, our proposed method enables predictions of various horizons of future energy load consumption. Through extensive experiments on real-world datasets, we demonstrate the effectiveness and accuracy of our proposed method. Our results indicate that utilizing language models for energy load forecasting holds promise for enhancing energy efficiency and facilitating intelligent decision-making in energy systems.

摘要
(Simplified Chinese translation)能量负荷预测对于建筑物和城市资源分配和能源消耗管理起到关键作用。在这篇论文中，我们提出了一种新的方法，利用语言模型进行能量负荷预测。我们使用提示技术将能量消耗数据转化为描述性句子，以便Language Model的微调。采用autoregressive生成方法，我们的提议方法可以预测不同时间 horizons的未来能量负荷消耗。通过对实际数据进行广泛的实验，我们证明了我们的提议方法的效果和准确性。结果表明，通过语言模型进行能量负荷预测，可以提高能源效率，并促进智能决策在能源系统中。

Evaluation of large language models using an Indian language LGBTI+ lexicon

paper_url: http://arxiv.org/abs/2310.17787
repo_url: None
paper_authors: Aditya Joshi, Shruta Rawat, Alpana Dange
for: 本研究旨在评估大型自然语言处理（LLM）模型在LGBTI+语言上的责任行为。
methods: 本研究使用LGBTI+词汇库进行评估，包括四个步骤：形式化NLP任务，创建测试提示，使用LLM生成输出，并手动评估结果。
results: 研究发现，三种LLM模型无法检测下面带有仇恨内容的语言。此外，我们发现使用机器翻译来评估自然语言理解可能存在限制，特别是在非英语语言中。

Abstract
Large language models (LLMs) are typically evaluated on the basis of task-based benchmarks such as MMLU. Such benchmarks do not examine responsible behaviour of LLMs in specific contexts. This is particularly true in the LGBTI+ context where social stereotypes may result in variation in LGBTI+ terminology. Therefore, domain-specific lexicons or dictionaries may be useful as a representative list of words against which the LLM's behaviour needs to be evaluated. This paper presents a methodology for evaluation of LLMs using an LGBTI+ lexicon in Indian languages. The methodology consists of four steps: formulating NLP tasks relevant to the expected behaviour, creating prompts that test LLMs, using the LLMs to obtain the output and, finally, manually evaluating the results. Our qualitative analysis shows that the three LLMs we experiment on are unable to detect underlying hateful content. Similarly, we observe limitations in using machine translation as means to evaluate natural language understanding in languages other than English. The methodology presented in this paper can be useful for LGBTI+ lexicons in other languages as well as other domain-specific lexicons. The work done in this paper opens avenues for responsible behaviour of LLMs, as demonstrated in the context of prevalent social perception of the LGBTI+ community.

摘要

Graph Convolutional Networks for Complex Traffic Scenario Classification

paper_url: http://arxiv.org/abs/2310.17773
repo_url: None
paper_authors: Tobias Hoek, Holger Caesar, Andreas Falkovén, Tommy Johansson
for: 这篇论文的目的是为了简化自动驾驶系统（ADS）的安全性证据所需的时间。
methods: 这篇论文使用了enario-based testing方法，并使用了图形传播网络（Graph Convolutional Networks，GCN）来模型车辆与环境的互动，以及其他交通代理人的互动。
results: 这篇论文提出了一个可以模型车辆与环境的互动，以及其他交通代理人的互动的方法，并使用了扩展的 nuScenes 和 Argoverse 2 驾驶测试数据来训练这个方法。这个方法在训练后已经成为了一个可靠的基eline для未来关于每帧复杂enario的分类研究。

Abstract
A scenario-based testing approach can reduce the time required to obtain statistically significant evidence of the safety of Automated Driving Systems (ADS). Identifying these scenarios in an automated manner is a challenging task. Most methods on scenario classification do not work for complex scenarios with diverse environments (highways, urban) and interaction with other traffic agents. This is mirrored in their approaches which model an individual vehicle in relation to its environment, but neglect the interaction between multiple vehicles (e.g. cut-ins, stationary lead vehicle). Furthermore, existing datasets lack diversity and do not have per-frame annotations to accurately learn the start and end time of a scenario. We propose a method for complex traffic scenario classification that is able to model the interaction of a vehicle with the environment, as well as other agents. We use Graph Convolutional Networks to model spatial and temporal aspects of these scenarios. Expanding the nuScenes and Argoverse 2 driving datasets, we introduce a scenario-labeled dataset, which covers different driving environments and is annotated per frame. Training our method on this dataset, we present a promising baseline for future research on per-frame complex scenario classification.

摘要
一种场景基本测试方法可以减少自动驾驶系统（ADS）的安全性证明所需的时间。确定这些场景的自动化方式是一项具有挑战性的任务。大多数场景分类方法不适用于复杂的场景中（高速公路、城市），并且与其他交通代理人之间的交互。这是它们的方法所模拟的个体车辆与其环境之间的关系，而忽略了多辆车辆之间的交互（例如，割込、静止领航车）。此外，现有的数据集缺乏多样性，并没有每帧的注释，以准确地学习场景的开始和结束时间。我们提议一种能够模型车辆与环境的交互，以及其他代理人之间的交互的方法。我们使用图像卷积网络来模型场景的空间和时间方面。对于扩展nuScenes和Argoverse 2驾驶数据集，我们引入了场景标注数据集，覆盖不同的驾驶环境，并且每帧都有注释。通过对这个数据集进行训练，我们提出了一个可能的基线 для未来关于每帧复杂场景分类的研究。

GROOViST: A Metric for Grounding Objects in Visual Storytelling

paper_url: http://arxiv.org/abs/2310.17770
repo_url: https://github.com/akskuchi/groovist
paper_authors: Aditya K Surikuchi, Sandro Pezzelle, Raquel Fernández
for: 评估视觉故事的可ovygrounding度，即图像序列中显示的实体是否被story中正确地描述。
methods: 分析当前的评估方法，包括专门 для这个目标的评估方法和通用视觉对齐方法，以及提出一种新的评估工具GROOViST，该工具考虑了交叉模态依赖关系、时间不同步（图像序列和story的顺序不一致）以及人类视觉嵌入的 intuitions。
results: GROOViST提供了一种可以评估视觉故事的可ovygrounding度的新评估工具，其中每个组件的贡献可以分别评估和解释。

Abstract
A proper evaluation of stories generated for a sequence of images -- the task commonly referred to as visual storytelling -- must consider multiple aspects, such as coherence, grammatical correctness, and visual grounding. In this work, we focus on evaluating the degree of grounding, that is, the extent to which a story is about the entities shown in the images. We analyze current metrics, both designed for this purpose and for general vision-text alignment. Given their observed shortcomings, we propose a novel evaluation tool, GROOViST, that accounts for cross-modal dependencies, temporal misalignments (the fact that the order in which entities appear in the story and the image sequence may not match), and human intuitions on visual grounding. An additional advantage of GROOViST is its modular design, where the contribution of each component can be assessed and interpreted individually.

摘要
为评估生成的故事（常称视觉故事），需考虑多个方面，如 coherence、 grammatical correctness 和 visual grounding。在这种工作中，我们专注于评估故事的基础设定，即图像中显示的实体是否被描述。我们分析了现有的指标，包括专门为这个目的设计的指标以及通用视觉-文本对齐的指标。由于这些指标的缺点，我们提出了一种新的评估工具——GROOViST，它考虑了跨模态依赖关系、时间不对齐（图像序列和故事序列中实体出现的顺序不同）以及人类对视觉基础的直觉。GROOViST 的另一个优点是它的模块化设计，允许每个组件的贡献被分析和解释。

paper_url: http://arxiv.org/abs/2310.17769
repo_url: https://github.com/janphilippfranken/scai
paper_authors: Jan-Philipp Fränken, Sam Kwok, Peixuan Ye, Kanishk Gandhi, Dilip Arumugam, Jared Moore, Alex Tamkin, Tobias Gerstenberg, Noah D. Goodman
for: validating the proposal of aligning an AI assistant by inverting a model of users’ preferences from observed interactions
methods: using proof-of-concept simulations in the economic ultimatum game to formalize user preferences as policies that guide the actions of simulated players
results: the AI assistant accurately aligns its behavior to match standard policies from the economic literature, but exhibits limited generalization in an out-of-distribution setting and slow learning when there is inconsistency in the relationship between language use and an unknown policy.

Abstract
We explore the idea of aligning an AI assistant by inverting a model of users' (unknown) preferences from observed interactions. To validate our proposal, we run proof-of-concept simulations in the economic ultimatum game, formalizing user preferences as policies that guide the actions of simulated players. We find that the AI assistant accurately aligns its behavior to match standard policies from the economic literature (e.g., selfish, altruistic). However, the assistant's learned policies lack robustness and exhibit limited generalization in an out-of-distribution setting when confronted with a currency (e.g., grams of medicine) that was not included in the assistant's training distribution. Additionally, we find that when there is inconsistency in the relationship between language use and an unknown policy (e.g., an altruistic policy combined with rude language), the assistant's learning of the policy is slowed. Overall, our preliminary results suggest that developing simulation frameworks in which AI assistants need to infer preferences from diverse users can provide a valuable approach for studying practical alignment questions.

摘要
我团队研究了一种方法，通过反向模型用户（未知）的偏好来调整人工智能助手的行为。为了证明我们的建议，我们在经济最终决策游戏中进行了证明性实验，将用户的偏好形式为助手的动作指导政策。我们发现助手的行为准确地匹配了经济文献中的标准政策（例如自利和慈善）。然而，助手学习的政策缺乏 Robustness 和在对应分布外的扩展性，例如当面临不包括在助手训练分布中的货币（例如药物的重量）时，助手的行为会受到限制。此外，当语言使用与未知政策之间存在不一致（例如慈善政策与侮辱语言）时，助手学习政策的速度会减慢。总之，我们的初步结果表明，通过在用户多样化的情况下使用人工智能助手来学习用户的偏好可以提供一种有价值的方法，用于研究实际的对齐问题。

Salespeople vs SalesBot: Exploring the Role of Educational Value in Conversational Recommender Systems

paper_url: http://arxiv.org/abs/2310.17749
repo_url: None
paper_authors: Lidiya Murakhovs’ka, Philippe Laban, Tian Xie, Caiming Xiong, Chien-Sheng Wu
for: 这个论文的目的是提供一种基于对话的产品推荐系统，同时提供教育性的值。
methods: 这个论文使用了大语言模型（LLM）来实现混合类型混合动机的对话系统，并通过人类研究对比专业销售人员的表现。
results: 研究发现，虽然销售机器人（SalesBot）的流畅性和信息准确性与专业销售人员相当，但是它在提供建议质量方面落后。此外，研究还发现，销售机器人和专业销售人员都面临着确保信念的挑战。

Abstract
Making big purchases requires consumers to research or consult a salesperson to gain domain expertise. However, existing conversational recommender systems (CRS) often overlook users' lack of background knowledge, focusing solely on gathering preferences. In this work, we define a new problem space for conversational agents that aim to provide both product recommendations and educational value through mixed-type mixed-initiative dialog. We introduce SalesOps, a framework that facilitates the simulation and evaluation of such systems by leveraging recent advancements in large language models (LLMs). We build SalesBot and ShopperBot, a pair of LLM-powered agents that can simulate either side of the framework. A comprehensive human study compares SalesBot against professional salespeople, revealing that although SalesBot approaches professional performance in terms of fluency and informativeness, it lags behind in recommendation quality. We emphasize the distinct limitations both face in providing truthful information, highlighting the challenges of ensuring faithfulness in the CRS context. We release our code and make all data available.

摘要
大购物需要消费者进行研究或咨询销售人员以获得领域专业知识。然而，现有的对话式推荐系统（CRS）通常忽视用户的背景知识不足，专注于收集首选。在这项工作中，我们定义了对话式代理人提供产品推荐和教育价值的新问题空间。我们提出了销售操作（SalesOps）框架，利用最新的大语言模型（LLMs）进行 simulate和评估。我们建立了销售机器人（SalesBot）和购物机器人（ShopperBot），这两个 LLM 动态代理人可以模拟框架两侧。人类研究比较了销售机器人和专业销售人员，发现销售机器人在流畅性和信息完整性方面几乎与专业人员相当，但在推荐质量方面存在明显的不足。我们强调了对话式推荐系统中 Ensure faithfulness 的挑战，并发布了我们的代码和所有数据。

Improving Traffic Density Forecasting in Intelligent Transportation Systems Using Gated Graph Neural Networks

paper_url: http://arxiv.org/abs/2310.17729
repo_url: None
paper_authors: Razib Hayat Khan, Jonayet Miah, S M Yasir Arafat, M M Mahbubul Syeed, Duc M Ca
for: 本研究探讨了应用图 neural network 在交通预测方面，这是智能交通系统中重要的一环。准确的交通预测对于如旅行规划、交通控制和车辆路径规划等功能都是关键。
methods: 研究中探讨了三种常见的图 neural network 架构，即 Graph Convolutional Networks (Graph Sample and Aggregation)、Gated Graph Neural Networks。每种架构的方法都得到了详细的描述，包括层配置、活动函数和超参数。研究的目标是最小化预测错误，GGNNs emerges as the most effective choice among the three models。
results: 研究结果显示，GCNs 的 RMSE 为 9.10 和 MAE 为 8.00，而 GraphSAGE 表现有所提高，其 RMSE 为 8.3 和 MAE 为 7.5。而 Gated Graph Neural Networks (GGNNs) 则在三种模型中表现最佳，其 RMSE 为 9.15 和 MAE 为 7.1。

Abstract
This study delves into the application of graph neural networks in the realm of traffic forecasting, a crucial facet of intelligent transportation systems. Accurate traffic predictions are vital for functions like trip planning, traffic control, and vehicle routing in such systems. Three prominent GNN architectures Graph Convolutional Networks (Graph Sample and Aggregation) and Gated Graph Neural Networks are explored within the context of traffic prediction. Each architecture's methodology is thoroughly examined, including layer configurations, activation functions,and hyperparameters. The primary goal is to minimize prediction errors, with GGNNs emerging as the most effective choice among the three models. The research outlines outcomes for each architecture, elucidating their predictive performance through root mean squared error and mean absolute error (MAE). Hypothetical results reveal intriguing insights: GCNs display an RMSE of 9.10 and an MAE of 8.00, while GraphSAGE shows improvement with an RMSE of 8.3 and an MAE of 7.5. Gated Graph Neural Networks (GGNNs) exhibit the lowest RMSE at 9.15 and an impressive MAE of 7.1, positioning them as the frontrunner.

摘要
The study aims to minimize prediction errors, and the results show that GGNNs are the most effective among the three models. The outcomes for each architecture are presented in terms of root mean squared error (RMSE) and mean absolute error (MAE). The hypothetical results reveal that GCNs have an RMSE of 9.10 and an MAE of 8.00, while GraphSAGE shows improvement with an RMSE of 8.3 and an MAE of 7.5. GGNNs exhibit the lowest RMSE at 9.15 and an impressive MAE of 7.1, making them the frontrunner.The study's findings highlight the potential of GNNs in traffic forecasting and provide valuable insights into the strengths and limitations of different GNN architectures. The results suggest that GGNNs are a promising approach for accurate traffic prediction, and future research can build on these findings to improve the accuracy and efficiency of intelligent transportation systems.

Large Language Models as Generalizable Policies for Embodied Tasks

paper_url: http://arxiv.org/abs/2310.17722
repo_url: None
paper_authors: Andrew Szot, Max Schwarzer, Harsh Agrawal, Bogdan Mazoure, Walter Talbott, Katherine Metcalf, Natalie Mackraz, Devon Hjelm, Alexander Toshev
for: 这个论文的目的是使大型自然语言模型（LLM）可以被适应为涉及视觉任务的通用策略。
methods: 这种方法called Large LAnguage model Reinforcement Learning Policy（LLaRP），它使用预训练的冻结LLM来接受文本指令和视觉 egocentric 观察，并直接在环境中输出操作。通过强化学习，我们训练 LLaRP 通过环境交互来看和行动。
results: 我们的实验表明，LLaRP 能够通过复杂的重新排序指令来执行任务，并且可以在新任务中表现出新的优化行为。特别是，在 1,000 个未seen 任务中，LLaRP 的成功率为 42%，比其他常见的学习基线或零shot 应用的 LLM 高出 1.7 倍。此外，我们还发布了一个新的标准测试集，名为 Language Rearrangement，它包含 150,000 个训练任务和 1,000 个测试任务，以便研究语言条件、大量多任务、embodied AI 问题。视频示例可以在 https://llm-rl.github.io 上找到。

Abstract
We show that large language models (LLMs) can be adapted to be generalizable policies for embodied visual tasks. Our approach, called Large LAnguage model Reinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take as input text instructions and visual egocentric observations and output actions directly in the environment. Using reinforcement learning, we train LLaRP to see and act solely through environmental interactions. We show that LLaRP is robust to complex paraphrasings of task instructions and can generalize to new tasks that require novel optimal behavior. In particular, on 1,000 unseen tasks it achieves 42% success rate, 1.7x the success rate of other common learned baselines or zero-shot applications of LLMs. Finally, to aid the community in studying language conditioned, massively multi-task, embodied AI problems we release a novel benchmark, Language Rearrangement, consisting of 150,000 training and 1,000 testing tasks for language-conditioned rearrangement. Video examples of LLaRP in unseen Language Rearrangement instructions are at https://llm-rl.github.io.

摘要
我们显示大型语言模型（LLM）可以适应为普遍化政策 для具有身体视觉任务。我们的方法，叫做大型语言模型增强学习政策（LLaRP），将预训冻的LLM变数为接受文本指令和 Egocentric 视觉观察，并将其转换为环境中的动作。使用增强学习，我们训练 LLaRP 通过环境互动来看和行动。我们显示 LLARP 能够对复杂重新写法的任务指令进行抗衡，并且能够扩展到新的任务，需要新的优化行为。特别是，在 1,000 个未见任务中，它取得了 42% 的成功率，比其他常见的基eline或 zero-shot 应用的 LLM 高一倍。最后，为了帮助社区研究语言条件、大规模多任务、具有身体视觉 AI 问题，我们发布了一个新的对benchmark，语言重新排序，包括 150,000 个训练任务和 1,000 个测试任务 для语言条件的重新排序。影像示例 LLARP 在未见 Language Rearrangement 指令下的动作可以在 https://llm-rl.github.io 浏览。

From Transcripts to Insights: Uncovering Corporate Risks Using Generative AI

paper_url: http://arxiv.org/abs/2310.17721
repo_url: None
paper_authors: Alex Kim, Maximilian Muhn, Valeri Nikolaev
for: 该研究旨在探讨生成AI工具，如ChatGPT，如何帮助投资者发现公司风险。
methods: 研究人员开发了基于资料提供的会议记录文本的启发式AI模型，以生成公司风险报告和评估。
results: 研究发现，基于GPT 3.5模型的风险测量方法具有显著的信息内容，并且能够超越现有的风险测量方法在预测公司独特异常性和投资决策等方面。

Abstract
We explore the value of generative AI tools, such as ChatGPT, in helping investors uncover dimensions of corporate risk. We develop and validate firm-level measures of risk exposure to political, climate, and AI-related risks. Using the GPT 3.5 model to generate risk summaries and assessments from the context provided by earnings call transcripts, we show that GPT-based measures possess significant information content and outperform the existing risk measures in predicting (abnormal) firm-level volatility and firms' choices such as investment and innovation. Importantly, information in risk assessments dominates that in risk summaries, establishing the value of general AI knowledge. We also find that generative AI is effective at detecting emerging risks, such as AI risk, which has soared in recent quarters. Our measures perform well both within and outside the GPT's training window and are priced in equity markets. Taken together, an AI-based approach to risk measurement provides useful insights to users of corporate disclosures at a low cost.

摘要

Outlier Dimensions Encode Task-Specific Knowledge

paper_url: http://arxiv.org/abs/2310.17715
repo_url: https://github.com/wrudman/outlier_dimensions
paper_authors: William Rudman, Catherine Chen, Carsten Eickhoff
for: 这个论文的目的是研究大语言模型（LLM）表示的缺失特征dimension的影响。
methods: 这个论文使用了大语言模型的 fine-tuning 方法，以investigate how fine-tuning impacts outlier dimensions。
results: 这个论文发现了一些 Interesting results，包括：1) pre-training 中出现的异常维度在 fine-tuned 模型中仍然存在，2) 一个异常维度可以完成下游任务 WITH 较低的错误率。这些结果表明了异常维度可能会含有关键的任务特定知识，并且这些知识可能会影响下游模型的决策。

Abstract
Representations from large language models (LLMs) are known to be dominated by a small subset of dimensions with exceedingly high variance. Previous works have argued that although ablating these outlier dimensions in LLM representations hurts downstream performance, outlier dimensions are detrimental to the representational quality of embeddings. In this study, we investigate how fine-tuning impacts outlier dimensions and show that 1) outlier dimensions that occur in pre-training persist in fine-tuned models and 2) a single outlier dimension can complete downstream tasks with a minimal error rate. Our results suggest that outlier dimensions can encode crucial task-specific knowledge and that the value of a representation in a single outlier dimension drives downstream model decisions.

摘要
大型语言模型（LLM）的表示被证明为具有极高方差的一小部分维度所控制。先前的研究表明，尽管剖除这些异常维度在LLM表示中减少下游性能，但异常维度对表征质量仍然有负面影响。在这种研究中，我们研究了如何微调影响异常维度，并发现以下两点：1）在预训练中出现的异常维度在微调后仍然存在于模型中，2）一个异常维度可以通过最小错误率完成下游任务。我们的结果表明，异常维度可能含有关键任务知识，并且表示中的一个异常维度的值会驱动下游模型决策。

A Wireless AI-Generated Content (AIGC) Provisioning Framework Empowered by Semantic Communication

paper_url: http://arxiv.org/abs/2310.17705
repo_url: None
paper_authors: Runze Cheng, Yao Sun, Dusit Niyato, Lan Zhang, Lei Zhang, Muhammad Ali Imran
for: 提供高质量人工智能生成内容（AIGC）服务，使其可以通过无线通信网络进行 ubique 访问。
methods: 使用 semantics communication（SemCom）技术，只提取内容的 semantic 信息，而不是所有的 binary 位。同时，利用 diffusion-based 模型进行高效的内容生成和计算负荷的灵活调整。
results: simulations 表明，提出的 SemAIGC 框架在延迟和内容质量方面比 conventional 方法更高效。

Abstract
Generative AI applications are recently catering to a vast user base by creating diverse and high-quality AI-generated content (AIGC). With the proliferation of mobile devices and rapid growth of mobile traffic, providing ubiquitous access to high-quality AIGC services via wireless communication networks is becoming the future direction for AIGC products. However, it is challenging to provide optimal AIGC services in wireless networks with unstable channels, limited bandwidth resources, and unevenly distributed computational resources. To tackle these challenges, we propose a semantic communication (SemCom)-empowered AIGC (SemAIGC) generation and transmission framework, where only semantic information of the content rather than all the binary bits should be extracted and transmitted by using SemCom. Specifically, SemAIGC integrates diffusion-based models within the semantic encoder and decoder for efficient content generation and flexible adjustment of the computing workload of both transmitter and receiver. Meanwhile, we devise a resource-aware workload trade-off (ROOT) scheme into the SemAIGC framework to intelligently decide transmitter/receiver workload, thus adjusting the utilization of computational resource according to service requirements. Simulations verify the superiority of our proposed SemAIGC framework in terms of latency and content quality compared to conventional approaches.

摘要
<>现代生成AI应用程序正在为广泛的用户群体提供多样化和高质量的AI生成内容（AIGC）服务。随着移动设备的普及和移动流量的快速增长，将高质量AIGC服务通过无线通信网络提供到用户的 ubique 访问已成为未来的发展方向。然而，在不稳定的通道、有限的带宽资源和分布式计算资源的情况下，提供优化的AIGC服务是一项挑战。为解决这些挑战，我们提议一种基于semantic communication（SemCom）的AI生成内容（AIGC）生成和传输框架，其中只需提取内容的semantic信息而不是所有的二进制位数据。具体来说，SemAIGC框架包括在semantic编码器和解码器中的扩散模型，以实现高效的内容生成和接收端计算资源的灵活调整。同时，我们在SemAIGC框架中实现了根据服务需求调整计算资源的资源意识的工作负荷调整（ROOT）策略。实验证明我们的提议的SemAIGC框架在延迟和内容质量方面与传统方法相比具有显著的优势。

Defending Against Transfer Attacks From Public Models

paper_url: http://arxiv.org/abs/2310.17645
repo_url: https://github.com/wagner-group/pubdef
paper_authors: Chawin Sitawarin, Jaewon Chang, David Huang, Wesson Altoyan, David Wagner
for: 这篇论文主要是为了提出一种实际的攻击模型，以及基于游戏理论的防御方法，以应对未来安全敏感应用中的攻击。
methods: 这篇论文使用了将攻击者通过公共可用的副本模型进行转移攻击，并提出了一种基于游戏理论的特殊防御方法。
results: 对于3个数据集（CIFAR-10、CIFAR-100和ImageNet）和24个公共模型以及11种攻击算法，我们的防御方法PubDef在攻击下表现出了明显的优势，与白盒 adversarial training相比，几乎没有失去正常准确率。例如，在ImageNet上，我们的防御方法在最强的转移攻击下达到了62%的准确率，而最佳的白盒 adversarial training只达到了36%。而在不受攻击的情况下，我们的防御方法的准确率只比无防御模型低2%（78%vs80%）。

Abstract
Adversarial attacks have been a looming and unaddressed threat in the industry. However, through a decade-long history of the robustness evaluation literature, we have learned that mounting a strong or optimal attack is challenging. It requires both machine learning and domain expertise. In other words, the white-box threat model, religiously assumed by a large majority of the past literature, is unrealistic. In this paper, we propose a new practical threat model where the adversary relies on transfer attacks through publicly available surrogate models. We argue that this setting will become the most prevalent for security-sensitive applications in the future. We evaluate the transfer attacks in this setting and propose a specialized defense method based on a game-theoretic perspective. The defenses are evaluated under 24 public models and 11 attack algorithms across three datasets (CIFAR-10, CIFAR-100, and ImageNet). Under this threat model, our defense, PubDef, outperforms the state-of-the-art white-box adversarial training by a large margin with almost no loss in the normal accuracy. For instance, on ImageNet, our defense achieves 62% accuracy under the strongest transfer attack vs only 36% of the best adversarially trained model. Its accuracy when not under attack is only 2% lower than that of an undefended model (78% vs 80%). We release our code at https://github.com/wagner-group/pubdef.

摘要
针对抗击攻击已成为industry中的潜在威胁，然而过去一代robustness评估文献中的经验教我们，在攻击者有Machine Learning和领域专业知识的情况下，构建强大或最佳的攻击是困难的。因此，过去大多数文献中所假设的白盒威胁模型是不切实际的。在这篇论文中，我们提出了一种新的实用威胁模型，在这个模型中，攻击者通过公共可用的副本模型进行转移攻击。我们认为这将在安全敏感应用中成为未来的主要威胁模型。我们对这个设定中的转移攻击进行评估，并提出了基于游戏理论的防御方法。我们对这些防御策略进行了24个公共模型和11种攻击算法的评估，并在CIFAR-10、CIFAR-100和ImageNet三个数据集上进行了评估。根据这个威胁模型，我们的防御策略PubDef在对抗攻击下的性能大幅超越了现有的白盒针对攻击训练方法，而且与正常准确率几乎没有差异。例如，在ImageNet上，我们的防御策略在最强的转移攻击下达到62%的准确率，而最佳针对攻击训练方法只达到36%。其正常准确率与没有防御的情况下的准确率几乎没有差异（78% vs 80%）。我们将代码发布在GitHub上，可以通过https://github.com/wagner-group/pubdef访问。

In-Context Learning Dynamics with Random Binary Sequences

paper_url: http://arxiv.org/abs/2310.17639
repo_url: None
paper_authors: Eric J. Bigelow, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Tomer D. Ullman
for: The paper aims to improve our understanding of the complex, emergent capabilities of large language models (LLMs) and their in-context learning dynamics.
methods: The authors propose a Cognitive Interpretability framework that involves using random binary sequences as context to study the dynamics of in-context learning in LLMs. They manipulate properties of the context data, such as sequence length, to observe the behavior of the models.
results: The authors find that the latest GPT-3.5+ models exhibit emergent abilities to generate pseudo-random numbers and learn basic formal languages, with striking in-context learning dynamics that transition sharply from pseudo-random behaviors to deterministic repetition.Here’s the Chinese translation of the three information points:
for: 该文章目的是更好地理解大语言模型（LLMs）的复杂、萌发性能力以及其在上下文学习动态。
methods: 作者们提出了一种认知可读性框架，使用随机二进制序列作为上下文来研究LLMs中的上下文学习动态。他们在context数据的属性上进行了修改，以观察模型的行为。
results: 作者们发现最新的GPT-3.5+模型在上下文学习过程中展现出了萌发性能力，可以生成 Pseudo-Random 数字和学习基本的正式语言，并且上下文学习动态具有强烈的转变性，从 Pseudo-Random 行为转sharply 到决定性重复。

Abstract
Large language models (LLMs) trained on huge corpora of text datasets demonstrate complex, emergent capabilities, achieving state-of-the-art performance on tasks they were not explicitly trained for. The precise nature of LLM capabilities is often mysterious, and different prompts can elicit different capabilities through in-context learning. We propose a Cognitive Interpretability framework that enables us to analyze in-context learning dynamics to understand latent concepts in LLMs underlying behavioral patterns. This provides a more nuanced understanding than success-or-failure evaluation benchmarks, but does not require observing internal activations as a mechanistic interpretation of circuits would. Inspired by the cognitive science of human randomness perception, we use random binary sequences as context and study dynamics of in-context learning by manipulating properties of context data, such as sequence length. In the latest GPT-3.5+ models, we find emergent abilities to generate pseudo-random numbers and learn basic formal languages, with striking in-context learning dynamics where model outputs transition sharply from pseudo-random behaviors to deterministic repetition.

摘要

Grow Your Limits: Continuous Improvement with Real-World RL for Robotic Locomotion

paper_url: http://arxiv.org/abs/2310.17634
repo_url: None
paper_authors: Laura Smith, Yunhao Cao, Sergey Levine
for: 实现机器人自动获得复杂行为，如四肢行走。
methods: 使用policy regularization框架，调整机器人在训练过程中的探索。
results: 实现了在真实世界中几分钟内完全学习四肢行走，并继续训练后能够更好地适应不同的情况和动力学变化。

Abstract
Deep reinforcement learning (RL) can enable robots to autonomously acquire complex behaviors, such as legged locomotion. However, RL in the real world is complicated by constraints on efficiency, safety, and overall training stability, which limits its practical applicability. We present APRL, a policy regularization framework that modulates the robot's exploration over the course of training, striking a balance between flexible improvement potential and focused, efficient exploration. APRL enables a quadrupedal robot to efficiently learn to walk entirely in the real world within minutes and continue to improve with more training where prior work saturates in performance. We demonstrate that continued training with APRL results in a policy that is substantially more capable of navigating challenging situations and is able to adapt to changes in dynamics with continued training.

摘要
深度强化学习（RL）可以让机器人自动获得复杂的行为，如四肢行走。但在实际世界中，RL受到效率、安全性和总训练稳定性的限制，这限制了其实际应用性。我们提出了APRL，一种策略Regularization框架，可以在训练过程中调整机器人的探索行为，实现在训练过程中均衡 flexible improvement potential和专注、效率的探索。APRL使得一只四肢机器人在实际世界中快速地学习行走，并继续增强，其性能比之前的工作更高。我们示出，继续训练APRL后，机器人可以更好地处理复杂的情况，并能够适应动力学变化。

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

paper_url: http://arxiv.org/abs/2310.17631
repo_url: https://github.com/baaivision/judgelm
paper_authors: Lianghui Zhu, Xinggang Wang, Xinlong Wang
for:The paper aims to evaluate large language models (LLMs) in open-ended scenarios by fine-tuning them as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively.methods:The paper proposes a comprehensive dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. The authors train JudgeLM at different scales and analyze its capabilities and behaviors.results:JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and the proposed new benchmark. The JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs, and achieves high agreement with the teacher judge (agreement exceeding 90%, even surpassing human-to-human agreement). JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.

Abstract
Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.

摘要
评估大语言模型（LLM）在开放场景中是有挑战的，因为现有的标准准则和指标不能全面评估它们。为解决这问题，我们提议 fine-tune LLM 为可扩展的判官（JudgeLM），以有效地评估 LLM 在开放场景中。我们首先提出了一个全面、大规模、高质量的数据集，包括任务种子、LLM 生成的答案、GPT-4 生成的判断，以便 Fine-tune 高性能的判官。我们在不同的批处参数（7B、13B、33B）上进行了系统性的分析和评估。我们then analyzed the key biases in fine-tuning LLM as a judge and considered them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient, and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.

Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana

paper_url: http://arxiv.org/abs/2310.17606
repo_url: None
paper_authors: Owen Henkel, Hannah Horne-Robinson, Libby Hills, Bill Roberts, Joshua McGrane
for: 这项研究旨在使用大规模语音模型对学生在加纳的口语读写能力进行自动评估。
methods: 研究使用最新版本的大规模语音模型（Whisper V2 wav2vec2.0）对学生的口语读写能力进行评估。
results: 研究发现，使用Whisper V2生成的学生口语读写 транскриптов的Word Error Rate为13.5，与成人语音识别模型的平均WER（12.8）相似，而且与人工评分员生成的分数高度相关（相关系数为0.96）。此外，研究还发现这些 транскриптов可以用于生成自动化的 ORF 分数，并且在 repre sentative 数据集上达到了高度相关性（相关系数为0.96）。

Abstract
This paper reports on a set of three recent experiments utilizing large-scale speech models to evaluate the oral reading fluency (ORF) of students in Ghana. While ORF is a well-established measure of foundational literacy, assessing it typically requires one-on-one sessions between a student and a trained evaluator, a process that is time-consuming and costly. Automating the evaluation of ORF could support better literacy instruction, particularly in education contexts where formative assessment is uncommon due to large class sizes and limited resources. To our knowledge, this research is among the first to examine the use of the most recent versions of large-scale speech models (Whisper V2 wav2vec2.0) for ORF assessment in the Global South. We find that Whisper V2 produces transcriptions of Ghanaian students reading aloud with a Word Error Rate of 13.5. This is close to the model's average WER on adult speech (12.8) and would have been considered state-of-the-art for children's speech transcription only a few years ago. We also find that when these transcriptions are used to produce fully automated ORF scores, they closely align with scores generated by expert human graders, with a correlation coefficient of 0.96. Importantly, these results were achieved on a representative dataset (i.e., students with regional accents, recordings taken in actual classrooms), using a free and publicly available speech model out of the box (i.e., no fine-tuning). This suggests that using large-scale speech models to assess ORF may be feasible to implement and scale in lower-resource, linguistically diverse educational contexts.

摘要
The results show that Whisper V2 produces transcriptions of Ghanaian students reading aloud with a Word Error Rate (WER) of 13.5, which is close to the model's average WER on adult speech (12.8) and would have been considered state-of-the-art for children's speech transcription only a few years ago. Furthermore, the transcriptions were used to produce fully automated ORF scores, which closely aligned with scores generated by expert human graders, with a correlation coefficient of 0.96.The results were achieved on a representative dataset, including students with regional accents and recordings taken in actual classrooms, using a free and publicly available speech model out of the box (i.e., no fine-tuning). This suggests that using large-scale speech models to assess ORF may be feasible to implement and scale in lower-resource, linguistically diverse educational contexts.

MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

paper_url: http://arxiv.org/abs/2310.17596
repo_url: None
paper_authors: Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, Dieter Fox
for: 这个论文主要目标是提出一种自动生成大规模、 ricah datasets 的方法，以便通过模仿学习来训练 робот代理。
methods: 这个系统使用了一种基于人工示例的自动生成方法，可以从少量的人工示例中生成大量的示例，并将其适应到新的上下文中。
results: 通过使用这个系统，研究人员可以生成大量的示例，并训练 robot 代理以达到强大的性能，包括多部件组装和咖啡制作等高精度任务，并且在不同的初始状态分布下表现出色。

Abstract
Imitation learning from a large set of human demonstrations has proved to be an effective paradigm for building capable robot agents. However, the demonstrations can be extremely costly and time-consuming to collect. We introduce MimicGen, a system for automatically synthesizing large-scale, rich datasets from only a small number of human demonstrations by adapting them to new contexts. We use MimicGen to generate over 50K demonstrations across 18 tasks with diverse scene configurations, object instances, and robot arms from just ~200 human demonstrations. We show that robot agents can be effectively trained on this generated dataset by imitation learning to achieve strong performance in long-horizon and high-precision tasks, such as multi-part assembly and coffee preparation, across broad initial state distributions. We further demonstrate that the effectiveness and utility of MimicGen data compare favorably to collecting additional human demonstrations, making it a powerful and economical approach towards scaling up robot learning. Datasets, simulation environments, videos, and more at https://mimicgen.github.io .

摘要
人工智能控制机器人的努力学习从人类示例集中获得了成功。然而，收集示例集可以非常昂贵和耗时。我们介绍MimicGen，一个系统可以自动生成大规模、丰富的数据集，只需要一小部分的人类示例。我们使用MimicGen生成了18种任务中的超过50,000个示例，包括多个物品配置、物品实例和机器人臂。我们表明，通过对这些生成的数据集进行依据学习，可以让机器人在长期和高精度任务中表现出色，例如多部件组装和咖啡制作。此外，我们还证明了MimicGen数据的有效性和实用性，与收集更多的人类示例相比，它是一种强大和经济的方法来扩大机器人学习。更多信息请访问https://mimicgen.github.io。

SPA: A Graph Spectral Alignment Perspective for Domain Adaptation

paper_url: http://arxiv.org/abs/2310.17594
repo_url: None
paper_authors: Zhiqing Xiao, Haobo Wang, Ying Jin, Lei Feng, Gang Chen, Fei Huang, Junbo Zhao
for: 这个研究旨在解决域类别预测中的领域对预测模型的不足，使用无监督领域适应（Unsupervised Domain Adaptation，UDA）来将内部预测模型扩展到不同的目标领域，并且考虑到这些领域之间的数据分布不同。
methods: 这个方法基于图 primitives，将DA问题转换为图间的对称问题，并且使用一个新的spectral regularizer来将领域图在特征空间进行调整。此外，还开发了一个细节化的讯息传递模组，以提高目标领域中的推断能力。
results: 在标准的 benchmark 上，SPA 的实验结果表明其性能已经超过了现有的剪切前渠道DA方法。另外，透过对称分析，我们发现SPA 的方法具有较好的有效性、韧性、推断能力和传递能力。资料和代码可以在https://github.com/CrownX/SPA 上获取。

Abstract
Unsupervised domain adaptation (UDA) is a pivotal form in machine learning to extend the in-domain model to the distinctive target domains where the data distributions differ. Most prior works focus on capturing the inter-domain transferability but largely overlook rich intra-domain structures, which empirically results in even worse discriminability. In this work, we introduce a novel graph SPectral Alignment (SPA) framework to tackle the tradeoff. The core of our method is briefly condensed as follows: (i)-by casting the DA problem to graph primitives, SPA composes a coarse graph alignment mechanism with a novel spectral regularizer towards aligning the domain graphs in eigenspaces; (ii)-we further develop a fine-grained message propagation module -- upon a novel neighbor-aware self-training mechanism -- in order for enhanced discriminability in the target domain. On standardized benchmarks, the extensive experiments of SPA demonstrate that its performance has surpassed the existing cutting-edge DA methods. Coupled with dense model analysis, we conclude that our approach indeed possesses superior efficacy, robustness, discriminability, and transferability. Code and data are available at: https://github.com/CrownX/SPA.

摘要
无监督领域适应（USDA）是机器学习中的一种重要形式，用于将内域模型扩展到不同的目标领域，其数据分布不同。大多数前期工作强调了交叉领域的传送性，但它们忽略了内域结构的丰富性，这会导致实际效果更差。在这种情况下，我们介绍了一种新的图spectral alignment（SPA）框架，用于解决这个负担。SPA的核心思想如下：（i）通过将DA问题转化为图 primitives，SPA使用一种含有新的 spectral regularizer 来将领域图在特征空间进行对齐；（ii）我们进一步发展了一种细化的消息传递模块，基于一种新的邻居自动训练机制，以提高目标领域的分类能力。在标准化的测试上，我们进行了广泛的实验，结果表明，SPA的性能已经超过了现有的cutting-edge DA方法。同时，我们还进行了密集的模型分析，得出了我们的方法实际上具有更好的效果、更好的稳定性、更好的分类能力和更好的传送性。代码和数据可以从以下地址获取：https://github.com/CrownX/SPA。

An Open Source Data Contamination Report for Llama Series Models

paper_url: http://arxiv.org/abs/2310.17589
repo_url: https://github.com/liyucheng09/contamination_detector
paper_authors: Yucheng Li
for: 这篇论文旨在提供一种开源的数据污染报告，用于评估LLama系列模型的可靠性。
methods: 该论文使用了多种方法进行数据污染分析，包括对六个多选问答 benchmark进行分析，并计算这些benchmark中的重叠率。
results: 研究发现，这些benchmark中存在1%到8.7%的数据污染程度，而LLama模型在污染subset上的准确率高于清晰subset上的准确率超过5%. 数据和代码可以在https://github.com/liyucheng09/Contamination_Detector中获得。

Abstract
Data contamination in language model evaluation is increasingly prevalent as the popularity of large language models. It allows models to "cheat" via memorisation instead of displaying true capabilities. Therefore, contamination analysis has became an crucial part of reliable model evaluation to validate results. However, existing contamination analysis is usually conducted internally by LLM developers and often lacks transparency and completeness. This paper present an open source data contamination reports for the Llama series models. We analyse six popular multi-choice QA benchmarks and quantify their overlapping with the training set of Llama. Various levels of contamination ranging from 1\% to 8.7\% are found across benchmarks. Our comparison also reveals that Llama models can gain over 5\% higher accuracy on contaminated subsets versus clean subsets. Data and code are available at: https://github.com/liyucheng09/Contamination_Detector.

摘要
<>文本环境污染在语言模型评估中日益普遍，这是由大型语言模型的普及所导致。这种污染允许模型通过记忆而不是展示真正的能力“偷懒”。因此，污染分析已成为可靠模型评估的重要组成部分。然而，现有的污染分析通常由LLM开发者进行内部实施，lacks transparency和完整性。本文介绍了一个开源的数据污染报告 для LLama 系列模型。我们分析了 six 个受欢迎的多选问答 bencmarks，并衡量它们与 LLama 训练集的重叠。我们发现，这些 bencmarks 中的污染水平从 1% 到 8.7% 不等。我们的比较还表明，LLama 模型在污染subset 上可以获得高于 5% 的准确率。数据和代码可以在 GitHub 上获取：https://github.com/liyucheng09/Contamination_Detector。Note: "LLama" refers to a series of language models, and "LLM" stands for "large language model".

Can LLMs Grade Short-answer Reading Comprehension Questions : Foundational Literacy Assessment in LMICs

paper_url: http://arxiv.org/abs/2310.18373
repo_url: None
paper_authors: Owen Henkel, Libby Hills, Bill Roberts, Joshua McGrane
for: 这项研究用于评估大语言模型（GPT-4）在评估短回答阅读理解问题上的可靠性。
methods: 研究使用了多种配置的生成式大语言模型（LLMs）来评估来自新数据集的学生答案，该数据集由150名学生在加纳完成的阅读测验中收集。
results: GPT-4在评估新数据集上表现出色，其 quadratic weighted kappa 值为0.923，F1值为0.88，大大超越了基于传输学习的方法。此外，GPT-4还能够与人类评分员相匹配。这项研究表明，生成式LLMs可能用于可靠地评估基础的阅读理解能力。

Abstract
This paper presents emerging evidence of using generative large language models (i.e., GPT-4) to reliably evaluate short-answer reading comprehension questions. Specifically, we explore how various configurations of generative (LLMs) are able to evaluate student responses from a new dataset, drawn from a battery of reading assessments conducted with over 150 students in Ghana. As this dataset is novel and hence not used in training runs of GPT, it offers an opportunity to test for domain shift and evaluate the generalizability of generative LLMs, which are predominantly designed and trained on data from high-income North American countries. We found that GPT-4, with minimal prompt engineering performed extremely well on evaluating the novel dataset (Quadratic Weighted Kappa 0.923, F1 0.88), substantially outperforming transfer-learning based approaches, and even exceeding expert human raters (Quadratic Weighted Kappa 0.915, F1 0.87). To the best of our knowledge, our work is the first to empirically evaluate the performance of generative LLMs on short-answer reading comprehension questions, using real student data, and suggests that generative LLMs have the potential to reliably evaluate foundational literacy. Currently the assessment of formative literacy and numeracy is infrequent in many low and middle-income countries (LMICs) due to the cost and operational complexities of conducting them at scale. Automating the grading process for reading assessment could enable wider usage, and in turn improve decision-making regarding curricula, school management, and teaching practice at the classroom level. Importantly, in contrast transfer learning based approaches, generative LLMs generalize well and the technical barriers to their use are low, making them more feasible to implement and scale in lower resource educational contexts.

摘要
The results show that GPT-4, with minimal prompt engineering, performed extremely well on evaluating the novel dataset, outperforming transfer-learning based approaches and even exceeding expert human raters. This is the first study to empirically evaluate the performance of generative LLMs on short-answer reading comprehension questions using real student data, and suggests that these models have the potential to reliably evaluate foundational literacy.Currently, the assessment of formative literacy and numeracy is infrequent in many low and middle-income countries (LMICs) due to the cost and operational complexities of conducting them at scale. Automating the grading process for reading assessment could enable wider usage and improve decision-making regarding curricula, school management, and teaching practice at the classroom level. Additionally, generative LLMs generalize well and have low technical barriers to implementation, making them more feasible to use in lower resource educational contexts.

Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models

paper_url: http://arxiv.org/abs/2310.17567
repo_url: None
paper_authors: Dingli Yu, Simran Kaur, Arushi Gupta, Jonah Brown-Cohen, Anirudh Goyal, Sanjeev Arora
for: 这个论文旨在探讨 LLM 评价方法如何随着 LLM 从统计语言模型转化为通用 AI 代理而变化。
methods: 本论文引入了一种新的评价方法，称为 Skill-Mix，用于评估 LLM 的能力集合和组合能力。评价方法包括 randomly 选择 $k$ 个技能，并要求 LLM 生成组合这些技能的文本。
results: 研究人员通过对两个流行的 chatbot 进行评价，发现存在较大的差异在不同模型之间，这些差异不会被捕捉到在现有的 LLM 排名板。此外，研究人员发现 GPT-4 在 $k=5$ 时表现良好，这可能指示它在组合技能方面具有更高的能力。

Abstract
With LLMs shifting their role from statistical modeling of language to serving as general-purpose AI agents, how should LLM evaluations change? Arguably, a key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. The capability to combine skills plays an important role in (human) pedagogy and also in a paper on emergence phenomena (Arora & Goyal, 2023). This work introduces Skill-Mix, a new evaluation to measure ability to combine skills. Using a list of $N$ skills the evaluator repeatedly picks random subsets of $k$ skills and asks the LLM to produce text combining that subset of skills. Since the number of subsets grows like $N^k$, for even modest $k$ this evaluation will, with high probability, require the LLM to produce text significantly different from any text in the training set. The paper develops a methodology for (a) designing and administering such an evaluation, and (b) automatic grading (plus spot-checking by humans) of the results using GPT-4 as well as the open LLaMA-2 70B model. Administering a version of to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. Sizeable differences exist among model capabilities that are not captured by their ranking on popular LLM leaderboards ("cramming for the leaderboard"). Furthermore, simple probability calculations indicate that GPT-4's reasonable performance on $k=5$ is suggestive of going beyond "stochastic parrot" behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training. We sketch how the methodology can lead to a Skill-Mix based eco-system of open evaluations for AI capabilities of future models.

摘要
With LLMs 从统计语言模型转移到通用 AI 代理，如何改变 LLM 评估呢？可以 argue 一个关键的 AI 代理能力是随时结合基本技能。人类教学中也有这种能力，以及 Arora 和 Goyal 的论文（2023）中所描述的 emergence 现象。这个工作介绍了 Skill-Mix，一种新的评估方法，用于测试 LLM 的技能结合能力。评估者会随机选择 $k$ 个技能，并要求 LLM 生成结合这些技能的文本。由于可能的组合方式的数量如 $N^k$，即使使用 modest 的 $k$，这种评估方法仍然可以要求 LLM 生成未在训练集中出现的文本。文章还提供了一种方法来设计和进行这种评估，以及自动评分和人工检查的方法。对两个流行的 chatbot 进行了评估，结果与预期一致，但也有一些意外结果。不同的模型能力存在显著差异，这些差异不会被捕捉在 popular LLM 排名板。此外，简单的概率计算表明，GPT-4 在 $k=5$ 时的表现是指示它可能超过 "随机习语" 行为（Bender et al., 2021），即它可以结合技能，而它没有在训练过程中看到这些技能。文章略 outline 如何使用 Skill-Mix 基础环境来建立未来模型的 AI 能力评估系统。

Bifurcations and loss jumps in RNN training

paper_url: http://arxiv.org/abs/2310.17561
repo_url: https://github.com/durstewitzlab/scyfi
paper_authors: Lukas Eisenmann, Zahra Monfared, Niclas Alexander Göring, Daniel Durstewitz
for: 这个论文的目的是使用权重链网络（RNN）模型和预测时间序列数据，以及推导出动态系统（DS）的各种计算和动态性质。
methods: 这个论文使用了权重链网络（RNN）和权重链网络的训练过程，以及动态系统（DS）理论中的概念，来更好地理解训练过程和模型的计算和动态性质。
results: 这个论文提出了一种新的规则搜索算法，可以快速和精确地找到权重链网络（RNN）中的固定点和循环点，以及它们的存在和稳定区域。这种算法可以帮助分析训练过程中的某些突变，并且可以推导出训练过程中的某些性质。

Abstract
Recurrent neural networks (RNNs) are popular machine learning tools for modeling and forecasting sequential data and for inferring dynamical systems (DS) from observed time series. Concepts from DS theory (DST) have variously been used to further our understanding of both, how trained RNNs solve complex tasks, and the training process itself. Bifurcations are particularly important phenomena in DS, including RNNs, that refer to topological (qualitative) changes in a system's dynamical behavior as one or more of its parameters are varied. Knowing the bifurcation structure of an RNN will thus allow to deduce many of its computational and dynamical properties, like its sensitivity to parameter variations or its behavior during training. In particular, bifurcations may account for sudden loss jumps observed in RNN training that could severely impede the training process. Here we first mathematically prove for a particular class of ReLU-based RNNs that certain bifurcations are indeed associated with loss gradients tending toward infinity or zero. We then introduce a novel heuristic algorithm for detecting all fixed points and k-cycles in ReLU-based RNNs and their existence and stability regions, hence bifurcation manifolds in parameter space. In contrast to previous numerical algorithms for finding fixed points and common continuation methods, our algorithm provides exact results and returns fixed points and cycles up to high orders with surprisingly good scaling behavior. We exemplify the algorithm on the analysis of the training process of RNNs, and find that the recently introduced technique of generalized teacher forcing completely avoids certain types of bifurcations in training. Thus, besides facilitating the DST analysis of trained RNNs, our algorithm provides a powerful instrument for analyzing the training process itself.

摘要
循环神经网络（RNN）是人工智能中流行的模型和预测序列数据的工具，以及从观察时间序列中推导动力系统（DS）的学习方法。DS理论（DST）的概念在RNN中都有各种应用，以深化我们对RNN解决复杂任务的理解和训练过程。变分是RNN中重要的特点之一，它指的是在参数变化时系统的动力学行为发生了 topological（Qualitative）变化。了解RNN的变分结构，可以推导出它的计算和动力学性质，例如参数变化的敏感度和训练过程中的行为。尤其是变分可以解释RNN训练过程中的突然损失峰值，这可能会对训练进程产生严重的阻碍。在这篇文章中，我们首先 математиче地证明了一种特定的ReLU基于RNN中的变分与损失勋度很大或者很小之间的关系。然后，我们提出了一种新的启发式算法，可以在ReLU基于RNN中找到所有的固定点和k-循环，以及它们在参数空间的存在和稳定区域。与之前的数值算法和常见继续方法不同，我们的算法提供了精确的结果，可以在高阶度上找到固定点和循环，并且具有惊人的扩展性。我们在RNN训练过程的分析中应用了这种算法，并发现了一些通过通用教师填充（Generalized Teacher Forcing）完全避免某些类型的变分的训练技术。因此，除了促进DST分析已训练的RNN之外，我们的算法还提供了一种可以分析训练过程本身的强大工具。

Instability of computer vision models is a necessary result of the task itself

paper_url: http://arxiv.org/abs/2310.17559
repo_url: None
paper_authors: Oliver Turnbull, George Cevora
for: 这篇论文探讨了计算机视觉模型中的敌对示例问题，强调了这些问题的潜在危害性，以及如何通过分析问题的本质和数据的稳定性来部分缓解这些问题。
methods: 本论文使用了数据的对称性（翻译不变性）、分类任务的分类性和图像本身的基本不同来探讨计算机视觉模型的不稳定性。
results: 研究发现，由于数据的对称性、分类任务的分类性和图像本身的基本不同，计算机视觉模型必然存在不稳定性。此外，由于训练数据的不彻底标注，这种不稳定性可能会更加严重。但是，通过提高图像的分辨率、提供图像上下文信息、彻底标注训练数据和防止攻击者频繁访问计算机视觉系统，可以部分缓解这种不稳定性。

Abstract
Adversarial examples resulting from instability of current computer vision models are an extremely important topic due to their potential to compromise any application. In this paper we demonstrate that instability is inevitable due to a) symmetries (translational invariance) of the data, b) the categorical nature of the classification task, and c) the fundamental discrepancy of classifying images as objects themselves. The issue is further exacerbated by non-exhaustive labelling of the training data. Therefore we conclude that instability is a necessary result of how the problem of computer vision is currently formulated. While the problem cannot be eliminated, through the analysis of the causes, we have arrived at ways how it can be partially alleviated. These include i) increasing the resolution of images, ii) providing contextual information for the image, iii) exhaustive labelling of training data, and iv) preventing attackers from frequent access to the computer vision system.

摘要
“对今天的计算机视觉模型而言，敌对示例是一个非常重要的话题，因为它们有可能会破坏任何应用程序。在这篇论文中，我们示出了这种不稳定性是不可避免的，原因包括：a) 数据中的对称性（平移不变性），b) 分类任务的分类性质，以及c) 图像本身的基本不同。这个问题被加剧了由于训练数据的非完整标注。因此，我们得出结论是，不稳定性是计算机视觉问题的一个必然的结果。虽然这个问题无法完全消除，但通过分析其原因，我们到达了一些可以减轻这个问题的方法，包括：i) 提高图像的分辨率，ii) 为图像提供上下文信息，iii) 对训练数据进行完整的标注，以及iv) 防止攻击者对计算机视觉系统进行频繁访问。”

Interactive Robot Learning from Verbal Correction

paper_url: http://arxiv.org/abs/2310.17555
repo_url: None
paper_authors: Huihan Liu, Alice Chen, Yuke Zhu, Adith Swaminathan, Andrey Kolobov, Ching-An Cheng
for: 本研究旨在帮助机器人学习并优化其行为在不结构化环境中，使其能够在日常生活中更好地协助人类。
methods: 该研究使用大型自然语言模型（LLM）OLAF，让日常用户通过语音纠正机器人的错误行为来教育机器人。OLAF可以根据语音反馈更新机器人的视听动作神经策略，以避免将来重复错误。
results: 在实验中，用户通过OLAF教育机器人完成长期 manipulate任务，成功率提高了20.0%。详细结果和视频可以在https://ut-austin-rpl.github.io/olaf/中找到。

Abstract
The ability to learn and refine behavior after deployment has become ever more important for robots as we design them to operate in unstructured environments like households. In this work, we design a new learning system based on large language model (LLM), OLAF, that allows everyday users to teach a robot using verbal corrections when the robot makes mistakes, e.g., by saying "Stop what you're doing. You should move closer to the cup." A key feature of OLAF is its ability to update the robot's visuomotor neural policy based on the verbal feedback to avoid repeating mistakes in the future. This is in contrast to existing LLM-based robotic systems, which only follow verbal commands or corrections but not learn from them. We demonstrate the efficacy of our design in experiments where a user teaches a robot to perform long-horizon manipulation tasks both in simulation and on physical hardware, achieving on average 20.0% improvement in policy success rate. Videos and more results are at https://ut-austin-rpl.github.io/olaf/

摘要
“在设计 robots 操作在无结构环境中时，学习和改进行为的能力已经日益重要。在这个工作中，我们设计了一个基于大型自然语言模型（LLM）的新学习系统，名为 OLAF，允许日常用户通过语音纠正，当 robot 错误时，例如“停止你的动作，你应该靠近碗子”。 OLAF 的一个关键特点是能够根据语音反馈更新 robot 的视觉动作神经策略，以避免未来重复错误。这与现有的 LLM-based robotic 系统不同，只会跟随语音命令或纠正，而不会从中学习。我们在实验中证明了我们的设计，可以让用户教育 robot 进行长期搬运任务，并在模拟和物理硬件上实现了平均 20.0% 的政策成功率。详细信息和视频可以在获取。”

Model-Based Runtime Monitoring with Interactive Imitation Learning

paper_url: http://arxiv.org/abs/2310.17552
repo_url: None
paper_authors: Huihan Liu, Shivin Dass, Roberto Martín-Martín, Yuke Zhu
for: 这个研究旨在将Robot学习方法升级为可靠和可靠的高度任务。
methods: 这个研究使用互动学习和监控方法，将人工智能和机器人联合作为一体，以提高机器人的性能和可靠性。
results: 这个研究比基准方法高出23%和40%在模拟和物理硬件上的成功率。

Abstract
Robot learning methods have recently made great strides, but generalization and robustness challenges still hinder their widespread deployment. Failing to detect and address potential failures renders state-of-the-art learning systems not combat-ready for high-stakes tasks. Recent advances in interactive imitation learning have presented a promising framework for human-robot teaming, enabling the robots to operate safely and continually improve their performances over long-term deployments. Nonetheless, existing methods typically require constant human supervision and preemptive feedback, limiting their practicality in realistic domains. This work aims to endow a robot with the ability to monitor and detect errors during task execution. We introduce a model-based runtime monitoring algorithm that learns from deployment data to detect system anomalies and anticipate failures. Unlike prior work that cannot foresee future failures or requires failure experiences for training, our method learns a latent-space dynamics model and a failure classifier, enabling our method to simulate future action outcomes and detect out-of-distribution and high-risk states preemptively. We train our method within an interactive imitation learning framework, where it continually updates the model from the experiences of the human-robot team collected using trustworthy deployments. Consequently, our method reduces the human workload needed over time while ensuring reliable task execution. Our method outperforms the baselines across system-level and unit-test metrics, with 23% and 40% higher success rates in simulation and on physical hardware, respectively. More information at https://ut-austin-rpl.github.io/sirius-runtime-monitor/

摘要
现代机器人学习方法在最近几年内已经做出了很大的进步，但是总结和稳定性问题仍然限制它们的普及。如果不能检测和解决潜在的失败，那么当前最先进的学习系统将不能在高度任务中进行实战。 recient advances in interactive imitation learning have presented a promising framework for human-robot teaming, enabling the robots to operate safely and continually improve their performances over long-term deployments. However, existing methods typically require constant human supervision and preemptive feedback, limiting their practicality in realistic domains.本研究的目的是赋予机器人能力监控和检测任务执行过程中的错误。我们介绍了一种基于模型的运行时监控算法，可以从部署数据中学习检测系统异常和预测失败。与先前的方法不同，我们的方法不需要失败经验进行训练，而是通过学习latent空间动力学模型和失败分类器，可以预测未来动作结果和检测出现在分布中的高风险和异常状态。我们在人机团队优先级的交互式模仿学习框架中训练了我们的方法，从人机团队收集的经验中不断更新模型。因此，我们的方法可以逐渐减少人工劳动量，并同时确保任务执行的可靠性。我们的方法在系统级别和单元测试指标上比基eline高出23%和40%，在实际硬件上也达到了相同的水平。更多信息请参考https://ut-austin-rpl.github.io/sirius-runtime-monitor/。

Unpacking the Ethical Value Alignment in Big Models

paper_url: http://arxiv.org/abs/2310.17551
repo_url: None
paper_authors: Xiaoyuan Yi, Jing Yao, Xiting Wang, Xing Xie
for: 本文旨在探讨大型模型在社会中的风险和挑战，以及现有的人工智能伦理准则是如何应对这些风险的。
methods: 本文对现有的AI伦理准则进行了检视，并分析了大型模型的伦理含义和挑战。此外，本文还 investigate了当前主流的语言模型（LLMs）的道德倾向，分析了现有的对齐算法，并提出了一种新的概念框架以确定大型模型的道德价值观。
results: 本文提出了一种新的概念框架，以帮助建立一个统一的AI伦理框架，并提出了一些有优先的研究方向以确定大型模型的道德价值观。

Abstract
Big models have greatly advanced AI's ability to understand, generate, and manipulate information and content, enabling numerous applications. However, as these models become increasingly integrated into everyday life, their inherent ethical values and potential biases pose unforeseen risks to society. This paper provides an overview of the risks and challenges associated with big models, surveys existing AI ethics guidelines, and examines the ethical implications arising from the limitations of these models. Taking a normative ethics perspective, we propose a reassessment of recent normative guidelines, highlighting the importance of collaborative efforts in academia to establish a unified and universal AI ethics framework. Furthermore, we investigate the moral inclinations of current mainstream LLMs using the Moral Foundation theory, analyze existing alignment algorithms, and outline the unique challenges encountered in aligning ethical values within them. To address these challenges, we introduce a novel conceptual paradigm for aligning the ethical values of big models and discuss promising research directions for alignment criteria, evaluation, and method, representing an initial step towards the interdisciplinary construction of the ethically aligned AI This paper is a modified English version of our Chinese paper https://crad.ict.ac.cn/cn/article/doi/10.7544/issn1000-1239.202330553, intended to help non-Chinese native speakers better understand our work.

摘要
大型模型已经大幅提高了人工智能的理解、生成和 manipulate信息和内容能力，这些能力已经开拓了许多应用程序。然而，随着这些模型在日常生活中的普及，它们的内置优先级和可能的偏见带来了未料的风险。本文提供了大型模型所存在的风险和挑战，检视了现有的人工智能伦理准则，并分析了这些模型的伦理含义。从normative伦理角度出发，我们提出了重新评估最近的normative准则，强调在学术界共同努力建立一个统一的人工智能伦理框架。此外，我们使用道尔文基础理论分析当今主流的大型语言模型（LLMs）中的道德倾向，分析现有的对Alignment算法，并详细介绍了对道德价值的整合在这些模型中所遇到的挑战。为解决这些挑战，我们提出了一种新的概念 paradigm，并讨论了一些有前途的研究方向，代表了初步的interdisciplinary构建人工智能的道德框架。Please note that the translation is done using a machine translation tool, and may not be perfect. Additionally, some cultural references or idioms may not be accurately translated.

Human-Guided Complexity-Controlled Abstractions

paper_url: http://arxiv.org/abs/2310.17550
repo_url: https://github.com/mycal-tucker/human-guided-abstractions
paper_authors: Andi Peng, Mycal Tucker, Eoin Kenny, Noga Zaslavsky, Pulkit Agrawal, Julie Shah
for: 这个论文旨在探讨神经网络如何学习任务特定的幂等表示，以及如何使其泛化到新的任务和设定。
methods: 作者使用了生成谱分布来训练神经网络生成一谱分布，并控制表示复杂性（即输入编码的比特数）通过调整分布 entropy。
results: 在 fine-tuning 实验中，使用只有一小数量的标注数据，发现（1）调整表示复杂性到任务适应的水平支持最高的 fine-tuning 性能，以及（2）在人类参与者研究中，用户能够通过视觉化表示的方式确定下游任务适应的复杂性水平。

Abstract
Neural networks often learn task-specific latent representations that fail to generalize to novel settings or tasks. Conversely, humans learn discrete representations (i.e., concepts or words) at a variety of abstraction levels (e.g., "bird" vs. "sparrow") and deploy the appropriate abstraction based on task. Inspired by this, we train neural models to generate a spectrum of discrete representations, and control the complexity of the representations (roughly, how many bits are allocated for encoding inputs) by tuning the entropy of the distribution over representations. In finetuning experiments, using only a small number of labeled examples for a new task, we show that (1) tuning the representation to a task-appropriate complexity level supports the highest finetuning performance, and (2) in a human-participant study, users were able to identify the appropriate complexity level for a downstream task using visualizations of discrete representations. Our results indicate a promising direction for rapid model finetuning by leveraging human insight.

摘要

根据任务适应性调整表示复杂性水平支持最高的finetuning性能。2. 在人类参与者研究中，用户能通过视觉化表示的幻数表示来识别适当的复杂性水平。我们的结果表明，可以通过人类意见来快速训练模型，并且有批处性。

Neuro-Inspired Fragmentation and Recall to Overcome Catastrophic Forgetting in Curiosity

paper_url: http://arxiv.org/abs/2310.17537
repo_url: https://github.com/fietelab/farcuriosity
paper_authors: Jaedong Hwang, Zhang-Wei Hong, Eric Chen, Akhilan Boopathy, Pulkit Agrawal, Ila Fiete
for: 解决难度探索任务中的快速忘记问题
methods: 使用Fragmentation and Recall Curiosity方法，通过在不同 Fragment 上使用不同的本地好奇模块来减少忘记
results: 在Atari benchmark suite of tasks 上的游戏环境中，实现了 less forgetting 和 better overall performance

Abstract
Deep reinforcement learning methods exhibit impressive performance on a range of tasks but still struggle on hard exploration tasks in large environments with sparse rewards. To address this, intrinsic rewards can be generated using forward model prediction errors that decrease as the environment becomes known, and incentivize an agent to explore novel states. While prediction-based intrinsic rewards can help agents solve hard exploration tasks, they can suffer from catastrophic forgetting and actually increase at visited states. We first examine the conditions and causes of catastrophic forgetting in grid world environments. We then propose a new method FARCuriosity, inspired by how humans and animals learn. The method depends on fragmentation and recall: an agent fragments an environment based on surprisal, and uses different local curiosity modules (prediction-based intrinsic reward functions) for each fragment so that modules are not trained on the entire environment. At each fragmentation event, the agent stores the current module in long-term memory (LTM) and either initializes a new module or recalls a previously stored module based on its match with the current state. With fragmentation and recall, FARCuriosity achieves less forgetting and better overall performance in games with varied and heterogeneous environments in the Atari benchmark suite of tasks. Thus, this work highlights the problem of catastrophic forgetting in prediction-based curiosity methods and proposes a solution.

摘要
In this work, we examine the conditions and causes of catastrophic forgetting in grid world environments and propose a new method called FARCuriosity, inspired by how humans and animals learn. FARCuriosity depends on fragmentation and recall, where the agent fragments the environment based on surprisal and uses different local curiosity modules (prediction-based intrinsic reward functions) for each fragment. At each fragmentation event, the agent stores the current module in long-term memory (LTM) and either initializes a new module or recalls a previously stored module based on its match with the current state.With fragmentation and recall, FARCuriosity achieves less forgetting and better overall performance in games with varied and heterogeneous environments in the Atari benchmark suite of tasks. This work highlights the problem of catastrophic forgetting in prediction-based curiosity methods and proposes a solution.

SoK: Pitfalls in Evaluating Black-Box Attacks

paper_url: http://arxiv.org/abs/2310.17534
repo_url: https://github.com/iamgroot42/blackboxsok
paper_authors: Fnu Suya, Anshuman Suri, Tingwei Zhang, Jingtao Hong, Yuan Tian, David Evans
for:* This paper is written to systematize knowledge in the area of black-box attacks on image classifiers, and to provide a taxonomy for understanding the threat space of these attacks.methods:* The paper uses a taxonomy to organize the threat space of black-box attacks on image classifiers, and to identify under-explored threat spaces that require further research.* The paper also demonstrates the importance of considering the quality and quantity of auxiliary data available to the attacker, as well as the access of interactive queries, in understanding the threat model of different attacks.results:* The paper establishes a new state-of-the-art in the less-studied setting of access to top-k confidence scores, and shows how this setting can be challenging even for well-explored techniques.* The paper also overturns prior state-of-the-art claims in the setting of interactive query access, and highlights the need for more research in this area.* The paper reveals connections between black-box attacks and related areas, such as model inversion and extraction attacks, and discusses how advances in these areas can enable stronger black-box attacks.I hope this helps! Let me know if you have any other questions.

Abstract
Numerous works study black-box attacks on image classifiers. However, these works make different assumptions on the adversary's knowledge and current literature lacks a cohesive organization centered around the threat model. To systematize knowledge in this area, we propose a taxonomy over the threat space spanning the axes of feedback granularity, the access of interactive queries, and the quality and quantity of the auxiliary data available to the attacker. Our new taxonomy provides three key insights. 1) Despite extensive literature, numerous under-explored threat spaces exist, which cannot be trivially solved by adapting techniques from well-explored settings. We demonstrate this by establishing a new state-of-the-art in the less-studied setting of access to top-k confidence scores by adapting techniques from well-explored settings of accessing the complete confidence vector, but show how it still falls short of the more restrictive setting that only obtains the prediction label, highlighting the need for more research. 2) Identification the threat model of different attacks uncovers stronger baselines that challenge prior state-of-the-art claims. We demonstrate this by enhancing an initially weaker baseline (under interactive query access) via surrogate models, effectively overturning claims in the respective paper. 3) Our taxonomy reveals interactions between attacker knowledge that connect well to related areas, such as model inversion and extraction attacks. We discuss how advances in other areas can enable potentially stronger black-box attacks. Finally, we emphasize the need for a more realistic assessment of attack success by factoring in local attack runtime. This approach reveals the potential for certain attacks to achieve notably higher success rates and the need to evaluate attacks in diverse and harder settings, highlighting the need for better selection criteria.

摘要
多种研究报告了黑盒攻击图像分类器。然而，这些研究假设了不同的攻击者知识和当前文献缺乏一个协调中心，因此我们提出了一个分类器，以攻击者的攻击模型为中心，并将攻击者的知识分为三个轴：回归精度、交互查询访问权限和攻击者可用的辅助数据质量和量。我们的新分类器提供了三个关键发现：1. DESPITE 详细的文献研究，还有许多未经探索的攻击空间，这些空间无法通过适应已经explored的设置来解决。我们通过在访问top-k信任分数的设置下Establishing a new state-of-the-art，并证明这些设置仍然不够 restrictive， highlighting the need for more research。2. Identifying the threat model of different attacks reveals stronger baselines that challenge prior state-of-the-art claims. We demonstrate this by enhancing an initially weaker baseline (under interactive query access) via surrogate models, effectively overturning claims in the respective paper.3. Our taxonomy reveals interactions between attacker knowledge that connect well to related areas, such as model inversion and extraction attacks. We discuss how advances in other areas can enable potentially stronger black-box attacks. Finally, we emphasize the need for a more realistic assessment of attack success by factoring in local attack runtime. This approach reveals the potential for certain attacks to achieve notably higher success rates and the need to evaluate attacks in diverse and harder settings, highlighting the need for better selection criteria.

Can large language models replace humans in the systematic review process? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages

paper_url: http://arxiv.org/abs/2310.17526
repo_url: None
paper_authors: Qusai Khraisha, Sophie Put, Johanna Kappenberg, Azza Warraitch, Kristin Hadfield
for: 这个论文旨在评估大型自然语言模型（LLM）在系统性审查中的性能，以及LLM在不同类型的文献和语言中的表现。
methods: 该论文采用了一种’人出现’-based的方法，通过对LLM进行训练和测试，以评估其在标题/摘要层次层级和全文审查中的表现。
results: 研究发现，LLM在大多数任务中的准确率与人类表现相当，但是结果受到了机会协议和数据不均衡的影响。经过调整后，LLM在数据抽取任务中表现 moderate，但是在不同阶段和语言类型中的层次层级层级表现呈现不均匀。使用高可靠性的提示时，LLM在全文审查任务中的表现接近完美。

Abstract
Systematic reviews are vital for guiding practice, research, and policy, yet they are often slow and labour-intensive. Large language models (LLMs) could offer a way to speed up and automate systematic reviews, but their performance in such tasks has not been comprehensively evaluated against humans, and no study has tested GPT-4, the biggest LLM so far. This pre-registered study evaluates GPT-4's capability in title/abstract screening, full-text review, and data extraction across various literature types and languages using a 'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human performance in most tasks, results were skewed by chance agreement and dataset imbalance. After adjusting for these, there was a moderate level of performance for data extraction, and - barring studies that used highly reliable prompts - screening performance levelled at none to moderate for different stages and languages. When screening full-text literature using highly reliable prompts, GPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key studies using highly reliable prompts improved its performance even more. Our findings indicate that, currently, substantial caution should be used if LLMs are being used to conduct systematic reviews, but suggest that, for certain systematic review tasks delivered under reliable prompts, LLMs can rival human performance.

摘要
While GPT-4 had accuracy on par with human performance in most tasks, the results were affected by chance agreement and dataset imbalance. After adjusting for these factors, GPT-4's performance was moderate for data extraction, but its screening performance was low for different stages and languages. However, when screening full-text literature using highly reliable prompts, GPT-4's performance was almost perfect. Penalizing GPT-4 for missing key studies using highly reliable prompts further improved its performance.Our findings suggest that, while LLMs have the potential to automate systematic reviews, they should be used with caution, and their performance should be carefully evaluated. However, for certain systematic review tasks delivered under reliable prompts, LLMs can rival human performance.

The Expressive Power of Low-Rank Adaptation

paper_url: http://arxiv.org/abs/2310.17513
repo_url: https://github.com/uw-madison-lee-lab/expressive_power_of_lora
paper_authors: Yuchen Zeng, Kangwook Lee
for: This paper aims to theoretically analyze the expressive power of Low-Rank Adaptation (LoRA) for fine-tuning pre-trained models, specifically large language models and diffusion models.
methods: The paper uses theoretical analysis to prove the expressive power of LoRA for fully connected neural networks and Transformer networks. The authors show that LoRA can adapt any model to accurately represent any smaller target model with a certain rank threshold.
results: The paper proves that, for fully connected neural networks, LoRA can adapt any model to accurately represent any smaller target model if LoRA-rank is greater than or equal to the product of the width of the model and the depth of the target model, divided by the depth of the model. For Transformer networks, the authors show that any model can be adapted to a target model of the same size with rank-$(\frac{\text{embedding size}{2})$ LoRA adapters. The paper also quantifies the approximation error when LoRA-rank is lower than the threshold.

Abstract
Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that leverages low-rank adaptation of weight matrices, has emerged as a prevalent technique for fine-tuning pre-trained models such as large language models and diffusion models. Despite its huge success in practice, the theoretical underpinnings of LoRA have largely remained unexplored. This paper takes the first step to bridge this gap by theoretically analyzing the expressive power of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any model $f$ to accurately represent any smaller target model $\overline{f}$ if LoRA-rank $\geq(\text{width of }f) \times \frac{\text{depth of }\overline{f}{\text{depth of }f}$. We also quantify the approximation error when LoRA-rank is lower than the threshold. For Transformer networks, we show any model can be adapted to a target model of the same size with rank-$(\frac{\text{embedding size}{2})$ LoRA adapters.

摘要
低级别适应（LoRA），一种精炼批处理方法，通过低级别适应weight矩阵，实现了许多预训练模型的精炼。尽管LoRA在实践中取得了很大成功，但其理论基础尚未得到充分探索。这篇论文是研究LoRA的第一步，我们提供了LoRA的表达力理论分析。我们证明，对于完全连接神经网络，LoRA可以使任何模型$f$ accurately表示任何更小的target模型$\overline{f}$，如果LoRA-rank $\geq(\text{宽度of }f) \times \frac{\text{深度of }\overline{f}{\text{深度of }f}$。我们还量化了当LoRA-rank低于阈值时的讹差。对于Transformer网络，我们显示任何模型可以通过rank-$(\frac{\text{嵌入大小}{2})$ LoRA adapter来适应同样大小的target模型。

CompeteAI: Understanding the Competition Behaviors in Large Language Model-based Agents

paper_url: http://arxiv.org/abs/2310.17512
repo_url: None
paper_authors: Qinlin Zhao, Jindong Wang, Yixuan Zhang, Yiqiao Jin, Kaijie Zhu, Hao Chen, Xing Xie
for: 这篇论文探讨了基于大语言模型（LLM）的代理在竞争中的行为。
methods: 作者提出了一个通用的竞争框架来研究代理之间的竞争。然后，他们使用GPT-4实现了一个假设的虚拟小镇，并在其中让两种代理进行竞争：餐厅代理和客户代理。餐厅代理之间竞争，以吸引更多的顾客，这种竞争使得餐厅代理受到激发，如培养新的运营策略。
results: 实验结果显示了一些有趣的发现，包括社会学学习和玛提效应，这些发现与社会学和经济学的现有理论很好吻合。作者认为，代理之间的竞争值得进一步研究，以更好地理解社会。代码将很快发布。

Abstract
Large language models (LLMs) have been widely used as agents to complete different tasks, such as personal assistance or event planning. While most work has focused on cooperation and collaboration between agents, little work explores competition, another important mechanism that fosters the development of society and economy. In this paper, we seek to examine the competition behaviors in LLM-based agents. We first propose a general framework to study the competition between agents. Then, we implement a practical competitive environment using GPT-4 to simulate a virtual town with two types of agents, including restaurant agents and customer agents. Specifically, restaurant agents compete with each other to attract more customers, where the competition fosters them to transform, such as cultivating new operating strategies. The results of our experiments reveal several interesting findings ranging from social learning to Matthew Effect, which aligns well with existing sociological and economic theories. We believe that competition between agents deserves further investigation to help us understand society better. The code will be released soon.

摘要
大型语言模型（LLM）已广泛应用于不同任务的代理人，如个人助手或活动规划。然而，大多数工作都集中在合作和协作之间，很少探讨竞争，这也是社会和经济发展的重要机制。在这篇论文中，我们想要研究LLM基于代理人的竞争行为。我们首先提出一个通用的框架来研究代理人之间的竞争。然后，我们使用GPT-4实现一个实际竞争环境，模拟一个虚拟小镇，有两种代理人：餐厅代理人和客户代理人。具体来说，餐厅代理人之间竞争，以吸引更多的客户，这种竞争使得他们变得更加创新，如培养新的运营策略。我们的实验结果显示了一些有趣的发现，包括社会学习到马特效应，这与社会学和经济学的现有理论吻合得非常好。我们认为代理人之间的竞争值得进一步调查，以更好地理解社会。代码将很快发布。

Orchestration of Emulator Assisted Mobile Edge Tuning for AI Foundation Models: A Multi-Agent Deep Reinforcement Learning Approach

paper_url: http://arxiv.org/abs/2310.17492
repo_url: None
paper_authors: Wenhan Yu, Terence Jie Chua, Jun Zhao
for: 本研究旨在提高当地任务性能，通过Mobile Edge Computing（MEC）与基础模型集成，以提高用户设备（UE）的本地任务性能。
methods: 我们提出了一种创新的Emulator-Adapter架构，将基础模型分成两个协同模块，以保持计算资源并提高下游任务的适应性和微调效率。此外，我们还提出了一种适应度较高的资源分配机制，以适应Emulator-Adapter结构在分散环境中的需求。
results: 我们通过实验和验证表明，我们的方法可以减少计算资源消耗，同时提高下游任务的性能和扩展性。这种方法在实际应用中具有强大的实用性和扩展性。

Abstract
The efficient deployment and fine-tuning of foundation models are pivotal in contemporary artificial intelligence. In this study, we present a groundbreaking paradigm integrating Mobile Edge Computing (MEC) with foundation models, specifically designed to enhance local task performance on user equipment (UE). Central to our approach is the innovative Emulator-Adapter architecture, segmenting the foundation model into two cohesive modules. This design not only conserves computational resources but also ensures adaptability and fine-tuning efficiency for downstream tasks. Additionally, we introduce an advanced resource allocation mechanism that is fine-tuned to the needs of the Emulator-Adapter structure in decentralized settings. To address the challenges presented by this system, we employ a hybrid multi-agent Deep Reinforcement Learning (DRL) strategy, adept at handling mixed discrete-continuous action spaces, ensuring dynamic and optimal resource allocations. Our comprehensive simulations and validations underscore the practical viability of our approach, demonstrating its robustness, efficiency, and scalability. Collectively, this work offers a fresh perspective on deploying foundation models and balancing computational efficiency with task proficiency.

摘要
当代人工智能中的有效部署和细化调整是关键。在本研究中，我们提出了一种创新的模型整合Mobile Edge Computing（MEC）和基础模型，特意设计用于增强用户设备（UE）上本地任务性能。我们的方法的核心是分解基础模型为两个协同模块的Emulator-Adapter架构，不仅保留计算资源，而且确保适应性和调整效率。此外，我们还提出了一种适应Emulator-Adapter结构的高级资源分配机制，在分布式环境中进行微调。为解决这个系统中的挑战，我们采用了一种混合多代理 Deep Reinforcement Learning（DRL）策略，能够处理混合数字-连续动作空间， Ensure dynamic and optimal resource allocations。我们的全面的 simulations和验证表明了我们的方法的实用性和可扩展性。总之，这项工作提供了一种新的基础模型部署和计算效率协调的新视角。

Improving Zero-shot Reader by Reducing Distractions from Irrelevant Documents in Open-Domain Question Answering

paper_url: http://arxiv.org/abs/2310.17490
repo_url: None
paper_authors: Sukmin Cho, Jeong yeon Seo, Soyeong Jeong, Jong C. Park
for: 这篇论文旨在探讨Zero-shot Question Answering（ODQA）中的语言模型（LLMs），以及其在开放领域中的应用。
methods: 本研究使用了Distracted-aware Answer Selection（DAS）技术，将不相关的文档除去，以提高Zero-shotReader的性能。
results: 实验结果显示，DAS技术能够成功地抑制干扰，提高Zero-shotReader的性能，并且与supervised reader不同，Zero-shot reader能够在未见到数据的情况下实现卓越的转移性。

Abstract
Large language models (LLMs) enable zero-shot approaches in open-domain question answering (ODQA), yet with limited advancements as the reader is compared to the retriever. This study aims at the feasibility of a zero-shot reader that addresses the challenges of computational cost and the need for labeled data. We find that LLMs are distracted due to irrelevant documents in the retrieved set and the overconfidence of the generated answers when they are exploited as zero-shot readers. To tackle these problems, we mitigate the impact of such documents via Distraction-aware Answer Selection (DAS) with a negation-based instruction and score adjustment for proper answer selection. Experimental results show that our approach successfully handles distraction across diverse scenarios, enhancing the performance of zero-shot readers. Furthermore, unlike supervised readers struggling with unseen data, zero-shot readers demonstrate outstanding transferability without any training.

摘要
大型语言模型（LLM）允许零条件方法在开放领域问题回答（ODQA）中，但有限的进步，因为读者与搜寻器之间的比较有限。本研究探讨了零条件读者的可能性，并实现了适当的选择方法以减少干扰。我们发现，LLMs 受到无关文档的干扰，并且在应用为零条件读者时，产生了过度自信的答案。为了解决这些问题，我们使用了对抗干扰的选择技术（DAS），并调整得分以确保适当的答案选择。实验结果显示，我们的方法可以成功地减少干扰，并提高零条件读者的表现。此外，不同于需要训练的监督读者，零条件读者具有卓越的转移性，无需任何训练。

Bias in Evaluation Processes: An Optimization-Based Model

paper_url: http://arxiv.org/abs/2310.17489
repo_url: https://github.com/anaymehrotra/bias-in-evaluation-processes
paper_authors: L. Elisa Celis, Amit Kumar, Anay Mehrotra, Nisheeth K. Vishnoi
for: 这个论文主要研究了评估过程中受到个人社会特征的偏见的现象，包括录用和招聘等设置。
methods: 该论文使用了一种解决具有信息约束的损失最小化问题来模型评估过程中的偏见。该模型具有两个参数，它们是资源信息费用Parameter和风险偏好Parameter。
results: 该论文通过分析模型生成的分布，研究了这两个参数对观察分布的影响。此外，该论文还验证了模型的实际应用，并使用其来研究在下游选择任务中的干预效果。这些结果为偏见在评估过程中的发生提供了更深刻的理解，并提供了用于mitigate偏见的工具。

Abstract
Biases with respect to socially-salient attributes of individuals have been well documented in evaluation processes used in settings such as admissions and hiring. We view such an evaluation process as a transformation of a distribution of the true utility of an individual for a task to an observed distribution and model it as a solution to a loss minimization problem subject to an information constraint. Our model has two parameters that have been identified as factors leading to biases: the resource-information trade-off parameter in the information constraint and the risk-averseness parameter in the loss function. We characterize the distributions that arise from our model and study the effect of the parameters on the observed distribution. The outputs of our model enrich the class of distributions that can be used to capture variation across groups in the observed evaluations. We empirically validate our model by fitting real-world datasets and use it to study the effect of interventions in a downstream selection task. These results contribute to an understanding of the emergence of bias in evaluation processes and provide tools to guide the deployment of interventions to mitigate biases.

摘要
社会背景下个体特征偏见在评估过程中得到了广泛的报道。我们视这种评估过程为一种将真实个体能力的分布转化为观察分布，并模型为一种损失最小化问题下的信息约束问题。我们的模型具有两个参数，这两个参数被证明导致偏见：资源信息费用参数在信息约束中，以及风险偏好参数在损失函数中。我们描述出的分布可以用来捕捉不同群体在观察评估中的变化。我们采用实际数据进行验证，并用其来研究在下游选择任务中的干预效果。这些结果对偏见的出现和控制偏见的措施提供了深入的理解和实用的工具。

Towards Learning Monocular 3D Object Localization From 2D Labels using the Physical Laws of Motion

paper_url: http://arxiv.org/abs/2310.17462
repo_url: None
paper_authors: Daniel Kienzle, Julian Lorenz, Katja Ludwig, Rainer Lienhart
for: 精确的3D物体定位在单个图像中
methods: 使用2D标签和物体运动的物理知识进行训练，不需要Expensive 3D标签
results: 在实验中，实现了平均距离错误只有6 cm，表明方法具有实现3D物体定位估计的潜在能力，而不需要收集3D数据进行训练。

Abstract
We present a novel method for precise 3D object localization in single images from a single calibrated camera using only 2D labels. No expensive 3D labels are needed. Thus, instead of using 3D labels, our model is trained with easy-to-annotate 2D labels along with the physical knowledge of the object's motion. Given this information, the model can infer the latent third dimension, even though it has never seen this information during training. Our method is evaluated on both synthetic and real-world datasets, and we are able to achieve a mean distance error of just 6 cm in our experiments on real data. The results indicate the method's potential as a step towards learning 3D object location estimation, where collecting 3D data for training is not feasible.

摘要
我们提出了一种新的方法，可以准确地在单张图像中 lokalisieren 3D 对象，只使用单个满足的相机和2D 标签。没有需要高价的3D标签。因此，我们的模型在受训练时不使用3D标签，而是使用容易标注的2D标签以及物体运动的物理知识。给定这些信息，模型可以推断缺失的第三维度信息，即使在训练时没有看到这些信息。我们的方法在实验中达到了6cm的平均误差，这表明该方法在无法收集3D数据的情况下可能成为3D对象位置估计的一个重要步骤。

Generating by Understanding: Neural Visual Generation with Logical Symbol Groundings

paper_url: http://arxiv.org/abs/2310.17451
repo_url: None
paper_authors: Yifei Peng, Yu Jin, Zhexu Luo, Yao-Xiang Ding, Wang-Zhou Dai, Zhong Ren, Kun Zhou
for: integrate neural visual generative models with strong symbolic knowledge reasoning systems
methods: 使用abductive learning框架、量化决策法、对决 Meta-abduction法
results: 比基eline要少的实例级标签信息、能够学习数据中的逻辑生成规则

Abstract
Despite the great success of neural visual generative models in recent years, integrating them with strong symbolic knowledge reasoning systems remains a challenging task. The main challenges are two-fold: one is symbol assignment, i.e. bonding latent factors of neural visual generators with meaningful symbols from knowledge reasoning systems. Another is rule learning, i.e. learning new rules, which govern the generative process of the data, to augment the knowledge reasoning systems. To deal with these symbol grounding problems, we propose a neural-symbolic learning approach, Abductive Visual Generation (AbdGen), for integrating logic programming systems with neural visual generative models based on the abductive learning framework. To achieve reliable and efficient symbol assignment, the quantized abduction method is introduced for generating abduction proposals by the nearest-neighbor lookups within semantic codebooks. To achieve precise rule learning, the contrastive meta-abduction method is proposed to eliminate wrong rules with positive cases and avoid less-informative rules with negative cases simultaneously. Experimental results on various benchmark datasets show that compared to the baselines, AbdGen requires significantly fewer instance-level labeling information for symbol assignment. Furthermore, our approach can effectively learn underlying logical generative rules from data, which is out of the capability of existing approaches.

摘要
尽管 neural visual generative models 在过去几年取得了很大的成功，但将它们与强大的符号知识推理系统集成仍然是一项挑战。主要的挑战有两个方面：一是符号分配，即将 neural visual generators 的幂谱因子绑定到有意义的符号 FROM knowledge reasoning systems。另一个是规则学习，即学习新的规则，这些规则 governs 数据生成过程。 To address these symbol grounding problems, we propose a neural-symbolic learning approach, AbdGen, for integrating logic programming systems with neural visual generative models based on the abductive learning framework. To achieve reliable and efficient symbol assignment, we introduce the quantized abduction method, which generates abduction proposals by nearest-neighbor lookups within semantic codebooks. To achieve precise rule learning, we propose the contrastive meta-abduction method to eliminate wrong rules with positive cases and avoid less-informative rules with negative cases simultaneously. Experimental results on various benchmark datasets show that compared to the baselines, AbdGen requires significantly fewer instance-level labeling information for symbol assignment. Furthermore, our approach can effectively learn underlying logical generative rules from data, which is beyond the capability of existing approaches.

LSA64: An Argentinian Sign Language Dataset

paper_url: http://arxiv.org/abs/2310.17429
repo_url: None
paper_authors: Franco Ronchetti, Facundo Manuel Quiroga, César Estrebou, Laura Lanzarini, Alejandro Rosete
for: 本研究旨在提供一个以阿根廷手语为基础的手语识别 dataset，以便进行机器学习或其他研究。
methods: 本研究使用了10名参与者的3200个手语视频，并将其分为64个不同的手语类型。参与者穿着了颜色的手套，以便追踪和分类手部运动。
results: 本研究提供了一个名为LSA64的手语识别dataset，包含3200个手语视频，并 Computed statistics of movement, position和手型。这个dataset可以作为未来的机器学习或其他研究使用。

Abstract
Automatic sign language recognition is a research area that encompasses human-computer interaction, computer vision and machine learning. Robust automatic recognition of sign language could assist in the translation process and the integration of hearing-impaired people, as well as the teaching of sign language to the hearing population. Sign languages differ significantly in different countries and even regions, and their syntax and semantics are different as well from those of written languages. While the techniques for automatic sign language recognition are mostly the same for different languages, training a recognition system for a new language requires having an entire dataset for that language. This paper presents a dataset of 64 signs from the Argentinian Sign Language (LSA). The dataset, called LSA64, contains 3200 videos of 64 different LSA signs recorded by 10 subjects, and is a first step towards building a comprehensive research-level dataset of Argentinian signs, specifically tailored to sign language recognition or other machine learning tasks. The subjects that performed the signs wore colored gloves to ease the hand tracking and segmentation steps, allowing experiments on the dataset to focus specifically on the recognition of signs. We also present a pre-processed version of the dataset, from which we computed statistics of movement, position and handshape of the signs.

摘要
自动手语识别是一个人机交互、计算机视觉和机器学习研究领域。可靠自动识别手语可以帮助翻译过程和听力障碍人群的集成，以及教育听力人群学习手语。不同国家和地区的手语之间存在很大差异，其语法和 semantics 也与written languages不同。虽然自动手语识别技术大多相同，但为新语言训练recognition系统需要拥有整个语言的数据集。本文介绍了一个名为LSA64的数据集，包括64种阿根廷手语（LSA）的视频记录，共3200个视频，由10名参与者执行。这是建立 comprehensive 研究级数据集的第一步，特地适用于手语识别或其他机器学习任务。参与者在执行手语时穿着颜色的手套，以便轻松地跟踪和分割手部，从而使实验中能够专注于手语识别。我们还提供了对数据集进行了预处理，从而计算了手语的运动、位置和形状的统计数据。

Handshape recognition for Argentinian Sign Language using ProbSom

paper_url: http://arxiv.org/abs/2310.17427
repo_url: None
paper_authors: Franco Ronchetti, Facundo Manuel Quiroga, César Estrebou, Laura Lanzarini
for: 这篇论文主要针对的是自动手语识别技术，以帮助听力障碍人士参与社会通信。
methods: 该论文提出了两大贡献：首先，建立了一个大量的阿根廷手语（LSA）手势数据库，这是一个未曾被充分研究的领域。其次，提出了一种基于自适应映射的图像处理技术，并对其进行了比较与当前状态艺术中的支持向量机器（SVM）、随机森林和神经网络等方法。
results: 该论文的实验结果显示，使用提出的特征点和ProbSom核算法可以实现手势识别精度高于90%。

Abstract
Automatic sign language recognition is an important topic within the areas of human-computer interaction and machine learning. On the one hand, it poses a complex challenge that requires the intervention of various knowledge areas, such as video processing, image processing, intelligent systems and linguistics. On the other hand, robust recognition of sign language could assist in the translation process and the integration of hearing-impaired people. This paper offers two main contributions: first, the creation of a database of handshapes for the Argentinian Sign Language (LSA), which is a topic that has barely been discussed so far. Secondly, a technique for image processing, descriptor extraction and subsequent handshape classification using a supervised adaptation of self-organizing maps that is called ProbSom. This technique is compared to others in the state of the art, such as Support Vector Machines (SVM), Random Forests, and Neural Networks. The database that was built contains 800 images with 16 LSA handshapes, and is a first step towards building a comprehensive database of Argentinian signs. The ProbSom-based neural classifier, using the proposed descriptor, achieved an accuracy rate above 90%.

摘要
自动手语识别是人工智能和人机交互领域中的一个重要话题。一方面，它需要多种知识领域的干预，如视频处理、图像处理、智能系统和语言学。另一方面，可靠地识别手语可以帮助翻译过程和听力障碍人士的 интеграción。本文提供了两个主要贡献：首先，建立了阿根廷手语（LSA）的手势数据库，这是一个尚未得到广泛讨论的话题。其次，提出了一种基于自组织地图的图像处理技术，即ProbSom，用于手势特征提取和分类。这种技术与现有的状态对比较技术，如支持向量机（SVM）、Random Forests和神经网络，进行比较。建立的数据库包含16种LSA手势的800个图像，是建立全面的阿根廷手语数据库的第一步。ProbSom基于的神经分类器，使用提出的特征，达到了90%以上的准确率。

Distribution of Action Movements (DAM): A Descriptor for Human Action Recognition

paper_url: http://arxiv.org/abs/2310.17421
repo_url: None
paper_authors: Facundo Manuel Quiroga, Franco Ronchetti, Laura Lanzarini, Cesar Eestrebou
for: 人体动作识别从骨骼数据是一个重要和活跃的研究领域，现状的最佳性还没有在许多知名数据集上达到近乎完美的准确率。
methods: 我们引入了分布动作运动特征（ Distribution of Action Movements Descriptor），一种基于骨骼动作帧间 JOINTS 的方向分布的新动作描述器。该描述器通过对集成数据集中所有可能的动作的方向分布进行归一化，计算得到一个正常化 histogram，并通过窗口 schemes 保留一定的时间结构。
results: 该描述器，结合标准分类器，在许多知名数据集上超过了许多现状技术的性能。

Abstract
Human action recognition from skeletal data is an important and active area of research in which the state of the art has not yet achieved near-perfect accuracy on many well-known datasets. In this paper, we introduce the Distribution of Action Movements Descriptor, a novel action descriptor based on the distribution of the directions of the motions of the joints between frames, over the set of all possible motions in the dataset. The descriptor is computed as a normalized histogram over a set of representative directions of the joints, which are in turn obtained via clustering. While the descriptor is global in the sense that it represents the overall distribution of movement directions of an action, it is able to partially retain its temporal structure by applying a windowing scheme. The descriptor, together with a standard classifier, outperforms several state-of-the-art techniques on many well-known datasets.

摘要
人体动作识别从骨骼数据是一个重要和活跃的研究领域，目前状态OF THE ART还没有在许多公知数据集上达到近乎完美准确性。在这篇论文中，我们介绍了动作描述器（Distribution of Action Movements Descriptor），这是一种基于数据集中 JOINTS 的动作描述器，它通过计算 JOINTS 的方向分布来描述动作的总体分布。这个描述器是全局的，因为它表示整个动作的方向分布，但同时还能够部分保留时间结构，通过应用窗口计划。这个描述器，结合标准分类器，在许多公知数据集上超越了多种状态OF THE ART技术。

Goals are Enough: Inducing AdHoc cooperation among unseen Multi-Agent systems in IMFs

paper_url: http://arxiv.org/abs/2310.17416
repo_url: None
paper_authors: Kaushik Dey, Satheesh K. Perepu, Abir Das
for: 这个论文的目的是提出一种基于人工智能的拓展者代理人机制，以便在下一代移动网络中实现用户expectation的有效管理。
methods: 该论文使用了多智能代理人学习（MARL）和人工智能监督代理人（AI-based supervisor agent）来实现协调多个预训练好的自利推荐代理人的协同工作。
results: 该论文的实验结果表明，相比 traditional rule-based方法，使用该提议的方法可以更快地和更好地满足用户的期望，并且能够适应环境变化。

Abstract
Intent-based management will play a critical role in achieving customers' expectations in the next-generation mobile networks. Traditional methods cannot perform efficient resource management since they tend to handle each expectation independently. Existing approaches, e.g., based on multi-agent reinforcement learning (MARL) allocate resources in an efficient fashion when there are conflicting expectations on the network slice. However, in reality, systems are often far more complex to be addressed by a standalone MARL formulation. Often there exists a hierarchical structure of intent fulfilment where multiple pre-trained, self-interested agents may need to be further orchestrated by a supervisor or controller agent. Such agents may arrive in the system adhoc, which then needs to be orchestrated along with other available agents. Retraining the whole system every time is often infeasible given the associated time and cost. Given the challenges, such adhoc coordination of pre-trained systems could be achieved through an intelligent supervisor agent which incentivizes pre-trained RL/MARL agents through sets of dynamic contracts (goals or bonuses) and encourages them to act as a cohesive unit towards fulfilling a global expectation. Some approaches use a rule-based supervisor agent and deploy the hierarchical constituent agents sequentially, based on human-coded rules. In the current work, we propose a framework whereby pre-trained agents can be orchestrated in parallel leveraging an AI-based supervisor agent. For this, we propose to use Adhoc-Teaming approaches which assign optimal goals to the MARL agents and incentivize them to exhibit certain desired behaviours. Results on the network emulator show that the proposed approach results in faster and improved fulfilment of expectations when compared to rule-based approaches and even generalizes to changes in environments.

摘要
“intent-based管理将在下一代移动网络中扮演关键角色，以实现用户的期望。传统方法无法有效地资源管理，因为它们通常处理每个期望独立。现有的方法，如基于多代理学习（MARL）的方法，可以有效地分配资源，当存在 conflicting 期望时。然而，在实际情况下，系统通常是多个层次结构的意图实现，其中多个预训练的自利愿代理（agent）需要被进一步协调。这些代理可能会随时间的推移而变化，需要在系统中进行实时协调。在这种情况下，不可能每次都进行系统重新训练。因此，我们提出了一种基于人工智能（AI）的监督代理，通过设置动态目标（goal）和奖励（bonus）来吸引预训练的 MARL 代理，使其们作为一个协调的单元工作。我们还提出了一种可靠性评估方法，以确保系统在不同环境下的稳定性。”Note that Simplified Chinese is the official language used in mainland China, and it is different from Traditional Chinese, which is used in Taiwan and other regions.

PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

paper_url: http://arxiv.org/abs/2310.17415
repo_url: https://github.com/ginnm/proteinpretraining
paper_authors: Yang Tan, Mingchen Li, Pan Tan, Ziyi Zhou, Huiqun Yu, Guisheng Fan, Liang Hong
For: The paper is written for protein engineering, specifically to explore the use of large protein language models for capturing the underlying evolutionary information in primary structures.* Methods: The paper uses pre-trained language models, specifically PETA, with 14 different vocabulary sizes under three tokenization methods to assess the models’ transfer learning capabilities.* Results: The paper finds that vocabulary sizes between 50 and 200 optimize the model, while sizes exceeding 800 detrimentally affect the model’s representational performance.Here is the information in Simplified Chinese text:* For: 本文是为了蛋白工程而写的，具体来说是探讨大蛋白语言模型在主要结构中捕捉进化信息的可能性。* Methods: 本文使用预训练的语言模型，具体来说是PETA模型，并在不同的词汇大小下进行了14种不同的模型训练。* Results: 研究发现，词汇大小在50-200之间优化模型表现，而词汇大小超过800会有消化性的影响。

Abstract
Large protein language models are adept at capturing the underlying evolutionary information in primary structures, offering significant practical value for protein engineering. Compared to natural language models, protein amino acid sequences have a smaller data volume and a limited combinatorial space. Choosing an appropriate vocabulary size to optimize the pre-trained model is a pivotal issue. Moreover, despite the wealth of benchmarks and studies in the natural language community, there remains a lack of a comprehensive benchmark for systematically evaluating protein language model quality. Given these challenges, PETA trained language models with 14 different vocabulary sizes under three tokenization methods. It conducted thousands of tests on 33 diverse downstream datasets to assess the models' transfer learning capabilities, incorporating two classification heads and three random seeds to mitigate potential biases. Extensive experiments indicate that vocabulary sizes between 50 and 200 optimize the model, whereas sizes exceeding 800 detrimentally affect the model's representational performance. Our code, model weights and datasets are available at https://github.com/ginnm/ProteinPretraining.

摘要
大型蛋白语言模型能够很好地捕捉基因编码中的演化信息，具有重要的实用价值 для蛋白工程。与自然语言模型相比，蛋白肽序列具有较小的数据量和有限的组合空间。选择合适的词汇大小以优化预训练模型是一个关键的问题。此外，虽然自然语言社区有着丰富的benchmark和研究，但是对蛋白语言模型质量的系统性评估还缺乏一个完整的benchmark。为了解决这些挑战，PETA使用了14个不同的词汇大小在三种token化方法上训练语言模型。它在33个多样化的下游数据集上进行了千次测试，以评估模型的传输学能力，并包括两个分类头和三个随机种子以mitigate潜在偏见。广泛的实验表明，词汇大小在50-200之间优化模型，而大于800的词汇大小会消化模型的表征性表现。我们的代码、模型权重和数据集可以在https://github.com/ginnm/ProteinPretraining上获取。

Synthesizing Efficiently Monitorable Formulas in Metric Temporal Logic

paper_url: http://arxiv.org/abs/2310.17410
repo_url: https://github.com/ritamraha/Teal
paper_authors: Ritam Raha, Rajarshi Roy, Nathanael Fijalkow, Daniel Neider, Guillermo A. Perez
for: 这篇论文的目的是提出一种自动从系统执行中生成正式规则，以便实时监控系统的与性。
methods: 这篇论文使用了一种叫做线性现实数学（LRA）的数学方法，将问题转换为一系列的满足问题，然后将这些问题的解决方案转换为Metric Temporal Logic（MTL）的规则。
results: 这篇论文的结果显示了一个名为TEAL的工具可以实现高效地从系统执行中生成监控可能的MTL规则，并且可以控制规则的”看 ahead”量以提高监控的效率。

Abstract
In runtime verification, manually formalizing a specification for monitoring system executions is a tedious and error-prone process. To address this issue, we consider the problem of automatically synthesizing formal specifications from system executions. To demonstrate our approach, we consider the popular specification language Metric Temporal Logic (MTL), which is particularly tailored towards specifying temporal properties for cyber-physical systems (CPS). Most of the classical approaches for synthesizing temporal logic formulas aim at minimizing the size of the formula. However, for efficiency in monitoring, along with the size, the amount of "lookahead" required for the specification becomes relevant, especially for safety-critical applications. We formalize this notion and devise a learning algorithm that synthesizes concise formulas having bounded lookahead. To do so, our algorithm reduces the synthesis task to a series of satisfiability problems in Linear Real Arithmetic (LRA) and generates MTL formulas from their satisfying assignments. The reduction uses a novel encoding of a popular MTL monitoring procedure using LRA. Finally, we implement our algorithm in a tool called TEAL and demonstrate its ability to synthesize efficiently monitorable MTL formulas in a CPS application.

摘要

Invariance Measures for Neural Networks

paper_url: http://arxiv.org/abs/2310.17404
repo_url: https://github.com/facundoq/tmeasures
paper_authors: Facundo Manuel Quiroga, Jordina Torrents-Barrena, Laura Cristina Lanzarini, Domenec Puig-Valls
for: 本研究旨在量化神经网络模型中的对称性表示。
methods: 本研究提出了一种量化神经网络模型中对称性的方法，该方法基于模型的内部表示。
results: 研究发现，使用该方法可以对神经网络模型的内部表示进行量化，并且该量化结果具有稳定性和可解释性。此外，研究还发现了神经网络模型的内部对称性在不同的数据集和变换下的稳定性。

Abstract
Invariances in neural networks are useful and necessary for many tasks. However, the representation of the invariance of most neural network models has not been characterized. We propose measures to quantify the invariance of neural networks in terms of their internal representation. The measures are efficient and interpretable, and can be applied to any neural network model. They are also more sensitive to invariance than previously defined measures. We validate the measures and their properties in the domain of affine transformations and the CIFAR10 and MNIST datasets, including their stability and interpretability. Using the measures, we perform a first analysis of CNN models and show that their internal invariance is remarkably stable to random weight initializations, but not to changes in dataset or transformation. We believe the measures will enable new avenues of research in invariance representation.

摘要
neural networks 的不变性是有用和必需的许多任务中的。然而，大多数神经网络模型中的不变性表示尚未得到了描述。我们提出了一些量化神经网络模型内部表示的不变性的方法。这些方法是高效的，可解释的，可以应用于任何神经网络模型。它们也比之前定义的方法更敏感于不变性。我们验证了这些方法和其属性在拟合变换和 CIFAR10 和 MNIST 数据集上，包括其稳定性和可解释性。使用这些方法，我们进行了第一次 CNN 模型的分析，并发现它们的内部不变性具有Random weight initialization 的稳定性，但不具有数据集或变换的稳定性。我们认为这些方法将开启新的研究领域，即不变性表示。

ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation

paper_url: http://arxiv.org/abs/2310.17389
repo_url: None
paper_authors: Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, Jingbo Shang
for: 本研究旨在提供一个基于实际用户与AI交互的排泄评估 benchmark，以便为用户与AI交互环境中的不良言语检测模型提供更好的训练数据。
methods: 本研究使用了现有的排泄评估 benchmark 进行系统性的评估，并通过与现有的模型进行比较，以显示这些模型在实际用户与AI交互中的缺陷。
results: 研究发现，现有的排泄评估模型在实际用户与AI交互中表现不佳，尤其是在辨识具有复杂涵义的不良言语方面。这显示了社交媒体内的排泄评估 benchmark 和实际用户与AI交互中的不良言语检测具有重要的区别。

Abstract
Despite remarkable advances that large language models have achieved in chatbots, maintaining a non-toxic user-AI interactive environment has become increasingly critical nowadays. However, previous efforts in toxicity detection have been mostly based on benchmarks derived from social media content, leaving the unique challenges inherent to real-world user-AI interactions insufficiently explored. In this work, we introduce ToxicChat, a novel benchmark based on real user queries from an open-source chatbot. This benchmark contains the rich, nuanced phenomena that can be tricky for current toxicity detection models to identify, revealing a significant domain difference compared to social media content. Our systematic evaluation of models trained on existing toxicity datasets has shown their shortcomings when applied to this unique domain of ToxicChat. Our work illuminates the potentially overlooked challenges of toxicity detection in real-world user-AI conversations. In the future, ToxicChat can be a valuable resource to drive further advancements toward building a safe and healthy environment for user-AI interactions.

摘要
尽管大语言模型在 чат机器人中已经做出了很多卓越的进步，但现在保持非恶意用户-AI交互环境变得越来越重要。然而，先前的恶意检测努力都基于社交媒体内容的标准套件，忽略了实际世界用户-AI交互中的独特挑战。在这项工作中，我们介绍了一个新的恶意测试集，即 ToxicChat，该集基于实际的用户问题，从一个开源的 chatbot 中提取出来。这个测试集包含实际世界用户-AI交互中的复杂和细腻的现象，这些现象可能会让当前的恶意检测模型很难以识别，与社交媒体内容之间存在显著的域差。我们对现有的恶意数据集上训练的模型进行了系统性的评估，发现这些模型在 ToxicChat 中的表现不佳，表明了现有的恶意检测模型在实际世界用户-AI交互中存在一定的潜在问题。我们的工作暴露了现有的恶意检测模型在实际世界用户-AI交互中可能存在的被过look的挑战。未来，ToxicChat 可以成为驱动进一步帮助建立安全和健康的用户-AI交互环境的资源。

YOLO-BEV: Generating Bird’s-Eye View in the Same Way as 2D Object Detection

paper_url: http://arxiv.org/abs/2310.17379
repo_url: None
paper_authors: Chang Liu, Liguo Zhou, Yanliang Huang, Alois Knoll
for: 提高安全和导航的自动驾驶系统视觉理解能力，实现全面和快速的视觉解释。
methods: 使用特殊的周围摄像头设置，将八个摄像头分别放置在45度的interval上，将图像集成成3x3格式，留下中心空间，提供了充足的空间表示，使得效率处理。使用YOLO检测机制，利用其快速响应和小型模型结构的优点。
results: 预liminary结果表明YOLO-BEV在实时交通视觉任务中的可行性。它的流线式架构和可能的快速部署因为参数的减少，对自动驾驶系统未来的视觉角度提供了一个丰富的探索。

Abstract
Vehicle perception systems strive to achieve comprehensive and rapid visual interpretation of their surroundings for improved safety and navigation. We introduce YOLO-BEV, an efficient framework that harnesses a unique surrounding cameras setup to generate a 2D bird's-eye view of the vehicular environment. By strategically positioning eight cameras, each at a 45-degree interval, our system captures and integrates imagery into a coherent 3x3 grid format, leaving the center blank, providing an enriched spatial representation that facilitates efficient processing. In our approach, we employ YOLO's detection mechanism, favoring its inherent advantages of swift response and compact model structure. Instead of leveraging the conventional YOLO detection head, we augment it with a custom-designed detection head, translating the panoramically captured data into a unified bird's-eye view map of ego car. Preliminary results validate the feasibility of YOLO-BEV in real-time vehicular perception tasks. With its streamlined architecture and potential for rapid deployment due to minimized parameters, YOLO-BEV poses as a promising tool that may reshape future perspectives in autonomous driving systems.

摘要

Optimization dependent generalization bound for ReLU networks based on sensitivity in the tangent bundle

paper_url: http://arxiv.org/abs/2310.17378
repo_url: None
paper_authors: Dániel Rácz, Mihály Petreczky, András Csertán, Bálint Daróczy
for: 该论文旨在解释深度学习模型如何通过极大过 Parametrization 来泛化良好。
methods: 该论文使用 PAC 类型 bound 来估计抽象网络的泛化误差，通过估计梯度下降过程中输入数据的敏感度。
results: 该论文通过实验证明，抽象网络的泛化误差可以通过估计梯度下降过程中输入数据的敏感度来 bounds。

Abstract
Recent advances in deep learning have given us some very promising results on the generalization ability of deep neural networks, however literature still lacks a comprehensive theory explaining why heavily over-parametrized models are able to generalize well while fitting the training data. In this paper we propose a PAC type bound on the generalization error of feedforward ReLU networks via estimating the Rademacher complexity of the set of networks available from an initial parameter vector via gradient descent. The key idea is to bound the sensitivity of the network's gradient to perturbation of the input data along the optimization trajectory. The obtained bound does not explicitly depend on the depth of the network. Our results are experimentally verified on the MNIST and CIFAR-10 datasets.

摘要
Translation notes:* "PAC" stands for "probably approximately correct" and refers to a theoretical framework for understanding the generalization ability of machine learning models.* "Rademacher complexity" is a measure of the complexity of a set of functions, and is used to bound the generalization error of a model.* "feedforward ReLU networks" are a type of deep neural network that uses ReLU activation functions and does not have any feedback connections.* "gradient descent" is an optimization algorithm used to train deep neural networks.* "MNIST" and "CIFAR-10" are benchmark datasets commonly used in deep learning research.

Dialogue-based generation of self-driving simulation scenarios using Large Language Models

paper_url: http://arxiv.org/abs/2310.17372
repo_url: https://github.com/avmb/dialogllmscenic
paper_authors: Antonio Valerio Miceli-Barone, Alex Lascarides, Craig Innes
for: 这篇论文主要用于开发和评估自动驾驶车辆控制器。
methods: 该论文使用了大型自然语言模型（LLM）将用户的英文语言交互映射到域pecific的编程代码中，以支持扩展的多模态互动。
results: 研究表明，LLMs可以捕捉用户在交互中的上下文敏感性，以便计算用户的真正意图。

Abstract
Simulation is an invaluable tool for developing and evaluating controllers for self-driving cars. Current simulation frameworks are driven by highly-specialist domain specific languages, and so a natural language interface would greatly enhance usability. But there is often a gap, consisting of tacit assumptions the user is making, between a concise English utterance and the executable code that captures the user's intent. In this paper we describe a system that addresses this issue by supporting an extended multimodal interaction: the user can follow up prior instructions with refinements or revisions, in reaction to the simulations that have been generated from their utterances so far. We use Large Language Models (LLMs) to map the user's English utterances in this interaction into domain-specific code, and so we explore the extent to which LLMs capture the context sensitivity that's necessary for computing the speaker's intended message in discourse.

摘要
模拟是自驾车控制器开发和评估的不可或缺工具。当前的模拟框架通常使用域Specific语言（DSL）驱动，因此增加自然语言界面可以大幅提高用户体验。但是，通常存在一个差距，这个差距由用户在提供简短的英文语言指令时所做的tacit assumption组成。在这篇论文中，我们描述了一个解决这个问题的系统，该系统支持扩展的多Modal交互：用户可以在模拟生成后进行修改或重新定义先前的指令。我们使用大型自然语言模型（LLM）将用户的英文语言指令映射到域Specific代码中，因此我们探讨了LLM是否能捕捉到在对话中的上下文敏感性。

Exploring the Potential of Generative AI for the World Wide Web

paper_url: http://arxiv.org/abs/2310.17370
repo_url: None
paper_authors: Nouar AlDahoul, Joseph Hong, Matteo Varvello, Yasir Zaki
for: The paper explores the potential of generative AI in the realm of the World Wide Web, specifically focusing on image generation.
methods: The paper develops a tool called WebDiffusion that simulates a Web powered by stable diffusion, a popular text-to-image model, from both a client and server perspective. The tool also supports crowdsourcing of user opinions to evaluate the quality and accuracy of AI-generated images.
results: The paper finds that generative AI is already capable of producing pertinent and high-quality Web images, even without requiring Web designers to manually input prompts, just by leveraging contextual information available within the webpages. However, direct in-browser image generation remains a challenge, and only highly powerful GPUs can partially compete with classic image downloads.

Abstract
Generative Artificial Intelligence (AI) is a cutting-edge technology capable of producing text, images, and various media content leveraging generative models and user prompts. Between 2022 and 2023, generative AI surged in popularity with a plethora of applications spanning from AI-powered movies to chatbots. In this paper, we delve into the potential of generative AI within the realm of the World Wide Web, specifically focusing on image generation. Web developers already harness generative AI to help crafting text and images, while Web browsers might use it in the future to locally generate images for tasks like repairing broken webpages, conserving bandwidth, and enhancing privacy. To explore this research area, we have developed WebDiffusion, a tool that allows to simulate a Web powered by stable diffusion, a popular text-to-image model, from both a client and server perspective. WebDiffusion further supports crowdsourcing of user opinions, which we use to evaluate the quality and accuracy of 409 AI-generated images sourced from 60 webpages. Our findings suggest that generative AI is already capable of producing pertinent and high-quality Web images, even without requiring Web designers to manually input prompts, just by leveraging contextual information available within the webpages. However, we acknowledge that direct in-browser image generation remains a challenge, as only highly powerful GPUs, such as the A40 and A100, can (partially) compete with classic image downloads. Nevertheless, this approach could be valuable for a subset of the images, for example when fixing broken webpages or handling highly private content.

摘要
优化人工智能（AI）是一种前沿技术，可以生成文本、图像和多媒体内容，通过生成模型和用户提示。在2022年和2023年之间，生成AI的受欢迎程度增加，其应用领域包括AI电影和chatbot等。在这篇论文中，我们探讨生成AI在互联网上的潜力，具体来说是图像生成。当前，开发者已经使用生成AI来帮助制作文本和图像，而浏览器可能在未来使用它来本地生成图像，以完成维护破碎页面、降低带宽和保护隐私等任务。为了探索这个研究领域，我们开发了WebDiffusion工具，可以模拟一个基于稳定的扩散模型的网络，从客户端和服务端两个角度进行模拟。WebDiffusion还支持用户对OPINION的协同评估，我们使用这些评估来评估409个由60个页面生成的AI图像的质量和准确性。我们的发现表明，生成AI已经能够生成与页面上的内容相关的高质量和 pertinent 的网络图像，而不需要网站设计师手动输入提示。然而，我们认为，直接在浏览器中生成图像仍然是一个挑战，只有高性能的GPU，如A40和A100，才能（部分）与经典图像下载竞争。尽管如此，这种方法可能对一部分图像有价值，例如修复破碎页面或处理高度隐私内容。

Cultural Adaptation of Recipes

paper_url: http://arxiv.org/abs/2310.17353
repo_url: None
paper_authors: Yong Cao, Yova Kementchedjhieva, Ruixiang Cui, Antonia Karamolegkou, Li Zhou, Megan Dare, Lucia Donatelli, Daniel Hershcovich
for: 本研究旨在探讨跨文化料理翻译和文化化问题，利用大语言模型来支持这一任务。
methods: 本研究使用了GPT-4和其他大语言模型、传统机器翻译和信息检索技术进行评估。
results: GPT-4在翻译中文料理为英文时表现出色，但在翻译英文料理为中文时仍然落后于人工专家。这反映了跨文化翻译的复杂性。

Abstract
Building upon the considerable advances in Large Language Models (LLMs), we are now equipped to address more sophisticated tasks demanding a nuanced understanding of cross-cultural contexts. A key example is recipe adaptation, which goes beyond simple translation to include a grasp of ingredients, culinary techniques, and dietary preferences specific to a given culture. We introduce a new task involving the translation and cultural adaptation of recipes between Chinese and English-speaking cuisines. To support this investigation, we present CulturalRecipes, a unique dataset comprised of automatically paired recipes written in Mandarin Chinese and English. This dataset is further enriched with a human-written and curated test set. In this intricate task of cross-cultural recipe adaptation, we evaluate the performance of various methods, including GPT-4 and other LLMs, traditional machine translation, and information retrieval techniques. Our comprehensive analysis includes both automatic and human evaluation metrics. While GPT-4 exhibits impressive abilities in adapting Chinese recipes into English, it still lags behind human expertise when translating English recipes into Chinese. This underscores the multifaceted nature of cultural adaptations. We anticipate that these insights will significantly contribute to future research on culturally-aware language models and their practical application in culturally diverse contexts.

摘要
基于大语言模型（LLM）的 significative advances，我们现在可以Addressing More sophisticated tasks demanding a nuanced understanding of cross-cultural contexts. A key example is recipe adaptation, which goes beyond simple translation to include a grasp of ingredients, culinary techniques, and dietary preferences specific to a given culture. We introduce a new task involving the translation and cultural adaptation of recipes between Chinese and English-speaking cuisines. To support this investigation, we present CulturalRecipes, a unique dataset comprised of automatically paired recipes written in Mandarin Chinese and English. This dataset is further enriched with a human-written and curated test set. In this intricate task of cross-cultural recipe adaptation, we evaluate the performance of various methods, including GPT-4 and other LLMs, traditional machine translation, and information retrieval techniques. Our comprehensive analysis includes both automatic and human evaluation metrics. While GPT-4 exhibits impressive abilities in adapting Chinese recipes into English, it still lags behind human expertise when translating English recipes into Chinese. This underscores the multifaceted nature of cultural adaptations. We anticipate that these insights will significantly contribute to future research on culturally-aware language models and their practical application in culturally diverse contexts.Note: I used the Simplified Chinese character set for the translation. If you prefer Traditional Chinese, please let me know.

CQM: Curriculum Reinforcement Learning with a Quantized World Model

paper_url: http://arxiv.org/abs/2310.17330
repo_url: None
paper_authors: Seungjae Lee, Daesol Cho, Jonghae Park, H. Jin Kim
for: 解决复杂任务的 Reinforcement Learning (RL) 方法面临高维度目标空间生成课程目标的挑战，因此通常需要手动指定目标空间。
methods: 我们提出了一种新的课程方法，它自动定义了 semantic goal space，包含关键信息 для课程过程，并提出了 uncertainty 和 temporal distance-aware 的课程目标，可以快速在无信息环境中进行探索。
results: 我们的方法可以快速实现目标，并且在不同的目标达成任务中表现出优于现状最佳方法，包括使用 egocentric 视觉输入。

Abstract
Recent curriculum Reinforcement Learning (RL) has shown notable progress in solving complex tasks by proposing sequences of surrogate tasks. However, the previous approaches often face challenges when they generate curriculum goals in a high-dimensional space. Thus, they usually rely on manually specified goal spaces. To alleviate this limitation and improve the scalability of the curriculum, we propose a novel curriculum method that automatically defines the semantic goal space which contains vital information for the curriculum process, and suggests curriculum goals over it. To define the semantic goal space, our method discretizes continuous observations via vector quantized-variational autoencoders (VQ-VAE) and restores the temporal relations between the discretized observations by a graph. Concurrently, ours suggests uncertainty and temporal distance-aware curriculum goals that converges to the final goals over the automatically composed goal space. We demonstrate that the proposed method allows efficient explorations in an uninformed environment with raw goal examples only. Also, ours outperforms the state-of-the-art curriculum RL methods on data efficiency and performance, in various goal-reaching tasks even with ego-centric visual inputs.

摘要
现代训练学习（RL）在解决复杂任务上展现出了显著的进步，通过提出序列的代理任务来解决问题。然而，之前的方法经常面临高维空间中生成课程目标的挑战，因此通常依赖于手动指定的目标空间。为了解决这些限制并提高课程的扩展性，我们提出了一种新的课程方法，它自动定义了 semantic goal space，包含课程过程中重要的信息，并在其上提出课程目标。为了定义 semantic goal space，我们使用 vector quantized-variational autoencoders（VQ-VAE）来维度化连续观察数据，并使用图restore temporal relations between the discretized observations。同时，我们建议uncertainty和 temporal distance-aware curriculum goals，这些目标在自动组成的目标空间中趋向于最终目标。我们示示了我们的方法可以在没有任何信息的环境中高效地探索，并且在不同的目标达成任务中，我们的方法超过了当前RL课程方法的数据效率和性能。

C-Disentanglement: Discovering Causally-Independent Generative Factors under an Inductive Bias of Confounder

paper_url: http://arxiv.org/abs/2310.17325
repo_url: None
paper_authors: Xiaoyu Liu, Jiaxin Yuan, Bang An, Yuancheng Xu, Yifan Yang, Furong Huang
for: 本研究目的是探索如何在实际数据中找到几个semantically meaningful的生成因素，并使这些因素在幂下空间中分离开来。
methods: 本研究使用了一种名为Confounded-Disentanglement（C-Disentanglement）的框架，该框架通过域专家的标签引入了 inductive bias of confounder，以便在实际数据中找到 causally disentangled的特征。
results: 根据实验结果，C-Disentanglement 方法在各种 benchmark 上与多种现状顶峰模型相比，在 domain shift 下能够获得 causally disentangled 的特征和下游任务的优秀表现。

Abstract
Representation learning assumes that real-world data is generated by a few semantically meaningful generative factors (i.e., sources of variation) and aims to discover them in the latent space. These factors are expected to be causally disentangled, meaning that distinct factors are encoded into separate latent variables, and changes in one factor will not affect the values of the others. Compared to statistical independence, causal disentanglement allows more controllable data generation, improved robustness, and better generalization. However, most existing work assumes unconfoundedness in the discovery process, that there are no common causes to the generative factors and thus obtain only statistical independence. In this paper, we recognize the importance of modeling confounders in discovering causal generative factors. Unfortunately, such factors are not identifiable without proper inductive bias. We fill the gap by introducing a framework entitled Confounded-Disentanglement (C-Disentanglement), the first framework that explicitly introduces the inductive bias of confounder via labels from domain expertise. In addition, we accordingly propose an approach to sufficiently identify the causally disentangled factors under any inductive bias of the confounder. We conduct extensive experiments on both synthetic and real-world datasets. Our method demonstrates competitive results compared to various SOTA baselines in obtaining causally disentangled features and downstream tasks under domain shifts.

摘要
学习表示假设实际世界数据是由一些semantically meaningful的生成因素（即生成因素）生成的，并且目标是在幽默空间发现这些因素。这些因素应该是 causally disentangled，meaning that distinct factors are encoded into separate latent variables, and changes in one factor will not affect the values of the others. 比如，compared to statistical independence, causal disentanglement allows more controllable data generation, improved robustness, and better generalization. 然而，most existing work assumes unconfoundedness in the discovery process, that there are no common causes to the generative factors and thus obtain only statistical independence. In this paper, we recognize the importance of modeling confounders in discovering causal generative factors. Unfortunately, such factors are not identifiable without proper inductive bias. We fill the gap by introducing a framework entitled Confounded-Disentanglement (C-Disentanglement), the first framework that explicitly introduces the inductive bias of confounder via labels from domain expertise. In addition, we accordingly propose an approach to sufficiently identify the causally disentangled factors under any inductive bias of the confounder. We conduct extensive experiments on both synthetic and real-world datasets. Our method demonstrates competitive results compared to various SOTA baselines in obtaining causally disentangled features and downstream tasks under domain shifts.

In-Context Ability Transfer for Question Decomposition in Complex QA

paper_url: http://arxiv.org/abs/2310.18371
repo_url: None
paper_authors: Venktesh V, Sourangshu Bhattacharya, Avishek Anand
for: 这个论文的目的是提出一种能够帮助语言模型学习复杂问答任务的方法，无需进行模型训练或专家注释。
methods: 这个方法基于在可用数据源中选择相关任务的示例，并通过注意力机制将这些示例传递给语言模型，以便帮助模型学习复杂问答任务。
results: 研究人员通过对多种复杂问答任务进行大规模实验，证明了 ICAT 可以在不进行模型训练或专家注释的情况下，与已有的提问基于方法相比，表现更好。

Abstract
Answering complex questions is a challenging task that requires question decomposition and multistep reasoning for arriving at the solution. While existing supervised and unsupervised approaches are specialized to a certain task and involve training, recently proposed prompt-based approaches offer generalizable solutions to tackle a wide variety of complex question-answering (QA) tasks. However, existing prompt-based approaches that are effective for complex QA tasks involve expensive hand annotations from experts in the form of rationales and are not generalizable to newer complex QA scenarios and tasks. We propose, icat (In-Context Ability Transfer) which induces reasoning capabilities in LLMs without any LLM fine-tuning or manual annotation of in-context samples. We transfer the ability to decompose complex questions to simpler questions or generate step-by-step rationales to LLMs, by careful selection from available data sources of related tasks. We also propose an automated uncertainty-aware exemplar selection approach for selecting examples from transfer data sources. Finally, we conduct large-scale experiments on a variety of complex QA tasks involving numerical reasoning, compositional complex QA, and heterogeneous complex QA which require decomposed reasoning. We show that ICAT convincingly outperforms existing prompt-based solutions without involving any model training, showcasing the benefits of re-using existing abilities.

摘要
Answering complex questions is a difficult task that requires breaking down the question into smaller parts and using multistep reasoning to find the solution. While existing supervised and unsupervised approaches are specialized to a certain task and require training, recently proposed prompt-based approaches offer generalizable solutions to tackle a wide variety of complex question-answering (QA) tasks. However, existing prompt-based approaches that are effective for complex QA tasks rely on expensive expert annotations in the form of rationales and are not generalizable to newer complex QA scenarios and tasks.We propose a new approach called icat (In-Context Ability Transfer), which enables reasoning capabilities in large language models (LLMs) without any fine-tuning or manual annotation of in-context samples. We transfer the ability to decompose complex questions into simpler questions or generate step-by-step rationales to LLMs by carefully selecting relevant data from available sources of related tasks. We also propose an automated uncertainty-aware exemplar selection approach for selecting examples from transfer data sources.We conduct large-scale experiments on a variety of complex QA tasks involving numerical reasoning, compositional complex QA, and heterogeneous complex QA, which require decomposed reasoning. Our results show that ICAT outperforms existing prompt-based solutions without any model training, demonstrating the benefits of reusing existing abilities.

CodeFusion: A Pre-trained Diffusion Model for Code Generation

paper_url: http://arxiv.org/abs/2310.17680
repo_url: None
paper_authors: Mukul Singh, José Cambronero, Sumit Gulwani, Vu Le, Carina Negreanu, Gust Verbruggen
for: 本研究的目的是提出一种基于扩散代码生成模型，以便在自然语言编程中能够更好地重新考虑之前生成的代码。
methods: 本研究使用了预训练的扩散代码生成模型CodeFusion，通过 Iterative Denoising 来重新考虑 encoded natural language 中的整个程序。
results: 实验表明，CodeFusion 能够与现状的 auto-regressive 系统相当，并且在 top-3 和 top-5 准确率上表现更佳，这是因为它更好地均衡了多样性和质量。

Abstract
Imagine a developer who can only change their last line of code, how often would they have to start writing a function from scratch before it is correct? Auto-regressive models for code generation from natural language have a similar limitation: they do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. We evaluate CodeFusion on the task of natural language to code generation for Bash, Python, and Microsoft Excel conditional formatting (CF) rules. Experiments show that CodeFusion (75M parameters) performs on par with state-of-the-art auto-regressive systems (350M-175B parameters) in top-1 accuracy and outperforms them in top-3 and top-5 accuracy due to its better balance in diversity versus quality.

摘要
想象一个开发者只能改变最后一行代码，如何频繁地重新写函数才能达到正确性？自然语言到代码生成模型具有类似的限制：它们不易允许重新考虑早些 tokens 生成的。我们介绍 CodeFusion，一种预训练的扩散代码生成模型，解决了这种限制。CodeFusion 通过Iteratively Denoising 完整程序，根据编码的自然语言来conditioning。我们在 Bash、Python 和 Microsoft Excel 条件格式（CF）规则中进行了实验，结果显示 CodeFusion（75M 参数）与状态之前 auto-regressive 系统（350M-175B 参数）在 top-1 准确率上相当，而且在 top-3 和 top-5 准确率上表现更好，这是因为它更好地均衡了多样性和质量。

FormaT5: Abstention and Examples for Conditional Table Formatting with Natural Language

paper_url: http://arxiv.org/abs/2310.17306
repo_url: None
paper_authors: Mukul Singh, José Cambronero, Sumit Gulwani, Vu Le, Carina Negreanu, Elnaz Nouri, Mohammad Raza, Gust Verbruggen
for: 本研究旨在提供一种基于 transformer 模型的自动表格格式化系统，以便根据用户提供的自然语言描述，生成数据依赖 conditional formatting（CF）规则。
methods: 本研究使用 transformer 模型来生成 CF 规则，并通过预测 placeholder 来解决用户描述的下pecification和 Argument Errors 问题。
results: 对于 1053 个 CF 任务，FormaT5 可以通过预测 placeholder 和 filling 来超过 8 种神经网络方法的性能，both with 和 without 例子。这说明了建立域pecific learning system 的价值。

Abstract
Formatting is an important property in tables for visualization, presentation, and analysis. Spreadsheet software allows users to automatically format their tables by writing data-dependent conditional formatting (CF) rules. Writing such rules is often challenging for users as it requires them to understand and implement the underlying logic. We present FormaT5, a transformer-based model that can generate a CF rule given the target table and a natural language description of the desired formatting logic. We find that user descriptions for these tasks are often under-specified or ambiguous, making it harder for code generation systems to accurately learn the desired rule in a single step. To tackle this problem of under-specification and minimise argument errors, FormaT5 learns to predict placeholders though an abstention objective. These placeholders can then be filled by a second model or, when examples of rows that should be formatted are available, by a programming-by-example system. To evaluate FormaT5 on diverse and real scenarios, we create an extensive benchmark of 1053 CF tasks, containing real-world descriptions collected from four different sources. We release our benchmarks to encourage research in this area. Abstention and filling allow FormaT5 to outperform 8 different neural approaches on our benchmarks, both with and without examples. Our results illustrate the value of building domain-specific learning systems.

摘要
表格的格式化是一个重要的属性，它对于视觉化、展示和分析都非常重要。电子表格软件允许用户自动格式化他们的表格，这可以通过写数据依赖的条件格式化规则（CF）来实现。写这些规则是常常给用户带来挑战，因为它们需要用户理解并实现下面的逻辑。我们提出了FormaT5，一种基于转换器的模型，可以根据目标表格和自然语言描述来生成CF规则。我们发现用户对这些任务的描述经常是不充分或模糊的，这使得代码生成系统更难准确地学习所需的规则。为解决这个问题，FormaT5学习预测占位符，通过缺失目标对象的目标搜索来减少参数错误。这些占位符可以通过第二个模型或，当有示例行可用时，通过编程示例系统来填充。为评估FormaT5在多样化和实际场景中的表现，我们创建了1053个CF任务的广泛 benchmark，其中包括来自四个不同来源的真实描述。我们发布了这些 benchmark，以便促进这一领域的研究。忽略和填充允许FormaT5在我们的 benchmark 上超越8种神经网络方法，包括和没有示例。我们的结果表明，建立领域特定的学习系统是非常有价值的。

Comparing Photorealistic and Animated Embodied Conversational Agents in Serious Games: An Empirical Study on User Experience

paper_url: http://arxiv.org/abs/2310.17300
repo_url: None
paper_authors: Danai Korre
for: 这篇论文的目的是研究对话人工智能（ECAs）在严肃游戏环境中的使用，以及两种不同的表现实实验的影响。
methods: 这篇论文使用了一种在Subjects中使用的两重两жды因素设计，并采集了36名参与者的数据，以便分析对ECAs的使用性和参与者对不同版本的偏好。
results: 研究发现，两种版本都被评估为非常可用，但参与者中69.4%表示偏好真实版本，25%表示偏好动画版本，5.6%没有表态。真实版本被认为更加真实和人类化，而动画版本使得任务更像游戏。尽管代理人的真实性没有对可用性产生显著影响，但它 positively 影响了参与者对代理人的评估。

Abstract
Embodied conversational agents (ECAs) are paradigms of conversational user interfaces in the form of embodied characters. While ECAs offer various manipulable features, this paper focuses on a study conducted to explore two distinct levels of presentation realism. The two agent versions are photorealistic and animated. The study aims to provide insights and design suggestions for speech-enabled ECAs within serious game environments. A within-subjects, two-by-two factorial design was employed for this research with a cohort of 36 participants balanced for gender. The results showed that both the photorealistic and the animated versions were perceived as highly usable, with overall mean scores of 5.76 and 5.71, respectively. However, 69.4 per cent of the participants stated they preferred the photorealistic version, 25 per cent stated they preferred the animated version and 5.6 per cent had no stated preference. The photorealistic agents were perceived as more realistic and human-like, while the animated characters made the task feel more like a game. Even though the agents' realism had no significant effect on usability, it positively influenced participants' perceptions of the agent. This research aims to lay the groundwork for future studies on ECA realism's impact in serious games across diverse contexts.

摘要
人工智能对话代理（ECAs）是对话用户界面的一种形式，具有各种可操作特性。本研究探讨了两种不同的展示现实主义水平，即真实摄影和动画两种代理版本。这项研究的目的是为了在严格游戏环境中的speech-enabled ECAs提供设计建议和灵感。本研究采用了一种内subjects，两个因素实验设计，参与者共36名，男女各半数。结果显示，两种版本都被评估为非常可用，总的 mean分别为5.76和5.71。然而，69.4%的参与者表示喜欢真实摄影版本，25%表示喜欢动画版本，5.6%无偏好。真实摄影代理被认为更真实和人类化，而动画人物使得任务感觉更像是一场游戏。虽然代理的真实性没有显著影响可用性，但它 positively 影响了参与者对代理的看法。本研究的目的是为将来在多种场景中的ECAs真实性的影响进行深入研究。

Fast Scalable and Accurate Discovery of DAGs Using the Best Order Score Search and Grow-Shrink Trees

paper_url: http://arxiv.org/abs/2310.17679
repo_url: https://github.com/cmu-phil/boss
paper_authors: Bryan Andrews, Joseph Ramsey, Ruben Sanchez-Romero, Jazmin Camchong, Erich Kummerfeld
for: 学习图解 conditional independence 结构是机器学习中一项重要的问题，也是 causal discovery 的重要基础。但是，现有的算法的准确率和执行时间通常难以扩展到包含百个高度连接的变量的问题，例如从 fMRI 数据中恢复大脑网络。
methods: 我们引入了最佳顺序分数搜索 (BOSS) 和 grow-shrink 树 (GST)，用于学习 Directed Acyclic Graphs (DAGs)。BOSS 通过 GST 构建和评分 DAGs 来进行搜索。GST 高效缓存分数，以消除重复计算。
results: BOSS 可以在各种条件下达到 state-of-the-art 的准确率和执行时间，与其他 combinatorial 和梯度基于的学习算法相比。为了证明其实用性，我们将 BOSS 应用于两个resting-state fMRI数据集：一个是 simulated data 与 pseudo-empirical noise distribution derivated from randomized empirical fMRI cortical signals，另一个是 3T fMRI scans 处理后的 cortical parcels。BOSS 可以在 TETRAD 项目中使用，包括 Python 和 R wrapper。

Abstract
Learning graphical conditional independence structures is an important machine learning problem and a cornerstone of causal discovery. However, the accuracy and execution time of learning algorithms generally struggle to scale to problems with hundreds of highly connected variables -- for instance, recovering brain networks from fMRI data. We introduce the best order score search (BOSS) and grow-shrink trees (GSTs) for learning directed acyclic graphs (DAGs) in this paradigm. BOSS greedily searches over permutations of variables, using GSTs to construct and score DAGs from permutations. GSTs efficiently cache scores to eliminate redundant calculations. BOSS achieves state-of-the-art performance in accuracy and execution time, comparing favorably to a variety of combinatorial and gradient-based learning algorithms under a broad range of conditions. To demonstrate its practicality, we apply BOSS to two sets of resting-state fMRI data: simulated data with pseudo-empirical noise distributions derived from randomized empirical fMRI cortical signals and clinical data from 3T fMRI scans processed into cortical parcels. BOSS is available for use within the TETRAD project which includes Python and R wrappers.

摘要
学习图structures是机器学习的重要问题，也是 causal discovery的基础。但是，学习算法的准确率和执行时间通常在百个高度连接的变量问题上难以扩展 -- 例如，从 fMRI 数据中回归大脑网络。我们介绍了最佳顺序分数搜索（BOSS）和生长缩小树（GST）用于学习 directed acyclic graphs（DAGs）。BOSS 在 permutations of variables 上进行探索，使用 GSTs 构建和评分 DAGs。GSTs 高效地缓存分数，以消除重复计算。BOSS 在准确率和执行时间方面达到了状态机器学习算法的最佳性能，与许多 combinatorial 和梯度基于的学习算法进行比较，在各种条件下表现出色。为了证明其实用性，我们将 BOSS 应用于两个 sets of resting-state fMRI 数据：生成的 simulated data 和临床数据 from 3T fMRI 扫描。BOSS 可以在 TETRAD 项目中使用，该项目包括 Python 和 R 包装。

New Boolean satisfiability problem heuristic strategy: Minimal Positive Negative Product Strategy

paper_url: http://arxiv.org/abs/2310.18370
repo_url: None
paper_authors: Qun Zhao, Xintao Wang, Menghui Yang
for: 解决Boolean satisfiability问题
methods: 使用Minimal Positive Negative Product Strategy引导CDCL算法
results: 实验结果证明该算法在问题解决中更高效 чем常用的DLIS和VSIDS算法

Abstract
This study presents a novel heuristic algorithm called the "Minimal Positive Negative Product Strategy" to guide the CDCL algorithm in solving the Boolean satisfiability problem. It provides a mathematical explanation for the superiority of this algorithm over widely used heuristics such as the Dynamic Largest Individual Sum (DLIS) and the Variable State Independent Decaying Sum (VSIDS). Experimental results further confirm the effectiveness of this heuristic strategy in problem-solving.

摘要
Here is the text in Simplified Chinese:这个研究提出了一种新的启发算法，called "最小正负乘积策略"，以帮助CDCL算法解决Boolean满足问题。这个算法被数学上证明为其他广泛使用的启发策略，如DLIS和VSIDS，的超越。实验结果还证明了该启发策略的效果。

Attribute Based Interpretable Evaluation Metrics for Generative Models

paper_url: http://arxiv.org/abs/2310.17261
repo_url: None
paper_authors: Dongkyun Kim, Mingi Kwon, Youngjung Uh
for: 本研究旨在提出一种新的评估协议，用于评估生成模型是否能够准确地捕捉训练集中的种类分布。
methods: 本研究使用了单 attribute 分化（SaD）和对应 attribute 分化（PaD）两种新的评估指标，以及一种新的图像特征评估指标——不同类型 CLIPScore（HCS）。
results: 通过使用这些指标，我们发现了一些现有的生成模型的缺陷，例如 ProjectedGAN 生成了不可能的属性关系，扩散模型困难捕捉数据集中的多种颜色，而 latent diffusion model 的更大的抽样步骤生成了更小的对象。

Abstract
When the training dataset comprises a 1:1 proportion of dogs to cats, a generative model that produces 1:1 dogs and cats better resembles the training species distribution than another model with 3:1 dogs and cats. Can we capture this phenomenon using existing metrics? Unfortunately, we cannot, because these metrics do not provide any interpretability beyond "diversity". In this context, we propose a new evaluation protocol that measures the divergence of a set of generated images from the training set regarding the distribution of attribute strengths as follows. Single-attribute Divergence (SaD) measures the divergence regarding PDFs of a single attribute. Paired-attribute Divergence (PaD) measures the divergence regarding joint PDFs of a pair of attributes. They provide which attributes the models struggle. For measuring the attribute strengths of an image, we propose Heterogeneous CLIPScore (HCS) which measures the cosine similarity between image and text vectors with heterogeneous initial points. With SaD and PaD, we reveal the following about existing generative models. ProjectedGAN generates implausible attribute relationships such as a baby with a beard even though it has competitive scores of existing metrics. Diffusion models struggle to capture diverse colors in the datasets. The larger sampling timesteps of latent diffusion model generate the more minor objects including earrings and necklaces. Stable Diffusion v1.5 better captures the attributes than v2.1. Our metrics lay a foundation for explainable evaluations of generative models.

摘要
We define Single-attribute Divergence (SaD) as the divergence regarding the probability density functions (PDFs) of a single attribute. Paired-attribute Divergence (PaD) measures the divergence regarding the joint PDFs of a pair of attributes. These metrics reveal which attributes the models struggle with.To measure the attribute strengths of an image, we propose Heterogeneous CLIPScore (HCS), which measures the cosine similarity between image and text vectors with heterogeneous initial points. With SaD and PaD, we find that ProjectedGAN generates implausible attribute relationships, such as a baby with a beard, despite having competitive scores on existing metrics. Diffusion models struggle to capture diverse colors in the datasets, and the larger sampling timesteps of the latent diffusion model result in the generation of smaller objects, such as earrings and necklaces. Stable Diffusion v1.5 performs better in capturing attributes than v2.1.Our proposed metrics provide a foundation for explainable evaluations of generative models, enabling us to better understand their strengths and weaknesses.

IDENAS: Internal Dependency Exploration for Neural Architecture Search

paper_url: http://arxiv.org/abs/2310.17250
repo_url: https://github.com/viharoszsolt/idenas
paper_authors: Anh T. Hoang, Zsolt J. Viharos
for: 提高自动机器学习模型开发的效率和准确率，特别是在输入和输出变量之间存在未知关系的情况下。
methods: 提出了一种基于内部依赖关系的搜索方法IDENAS，结合了神经网络搜索和特征选择。IDENAS使用修改后的编码器-解码器模型和继承前进搜索算法，将输入-输出配置搜索与嵌入特征选择相结合。
results: 实验结果显示，IDENAS在比较其他算法的情况下表现出色， demonstrating its effectiveness in model development pipelines and automated machine learning. On average, IDENAS achieved significant modelling improvements, highlighting its significant contribution to advancing the state-of-the-art in neural architecture search and feature selection integration.

Abstract
Machine learning is a powerful tool for extracting valuable information and making various predictions from diverse datasets. Traditional algorithms rely on well-defined input and output variables however, there are scenarios where the distinction between the input and output variables and the underlying, associated (input and output) layers of the model, are unknown. Neural Architecture Search (NAS) and Feature Selection have emerged as promising solutions in such scenarios. This research proposes IDENAS, an Internal Dependency-based Exploration for Neural Architecture Search, integrating NAS with feature selection. The methodology explores internal dependencies in the complete parameter space for classification involving 1D sensor and 2D image data as well. IDENAS employs a modified encoder-decoder model and the Sequential Forward Search (SFS) algorithm, combining input-output configuration search with embedded feature selection. Experimental results demonstrate IDENASs superior performance in comparison to other algorithms, showcasing its effectiveness in model development pipelines and automated machine learning. On average, IDENAS achieved significant modelling improvements, underscoring its significant contribution to advancing the state-of-the-art in neural architecture search and feature selection integration.

摘要
机器学习是一种强大的工具，可以从多样化数据集中提取有价值信息并进行多种预测。传统的算法假设输入和输出变量之间存在明确的定义，但有时候输入和输出变量之间的关系并不明确。神经网络搜索（NAS）和特征选择是一些有前途的解决方案。这项研究提出了内部依赖性搜索（IDENAS），它将NAS与特征选择集成了一起。该方法在完全参数空间中搜索内部依赖关系，用于分类，包括1D感知器和2D图像数据。IDENAS使用修改后的encoder-decoder模型和顺序前进搜索（SFS）算法，将输入输出配置搜索与嵌入特征选择结合在一起。实验结果表明，IDENAS在其他算法的比较中表现出色，展示了其在机器学习开发流程和自动化机器学习中的有效性。在平均上，IDENAS实现了重要的模型改进，强调了它在神经网络搜索和特征选择集成中的重要贡献。

CROP: Conservative Reward for Model-based Offline Policy Optimization

paper_url: http://arxiv.org/abs/2310.17245
repo_url: https://github.com/g0k0ururi/crop
paper_authors: Hao Li, Xiao-Hu Zhou, Xiao-Liang Xie, Shi-Qi Liu, Zhen-Qiu Feng, Xiao-Yin Liu, Mei-Jiang Gui, Tian-Yu Xiang, De-Xing Huang, Bo-Xian Yao, Zeng-Guang Hou
for: 提出了一种新的模型基于的离线强化学习算法（CROP），用于优化策略，并通过保守估计奖励来避免分布迁移问题。
methods: 该算法使用了模型训练来估计奖励，并同时减少估计错误和随机行动奖励的积累。
results: 实验结果表明，CROP算法与当前基eline相当，并且在D4RLbenchmark上显示了良好的性能。此外，该算法还发现了在离线RL中的onlineRL技术的潜在连接。Here’s the translation in English:
for: The paper proposes a new model-based offline reinforcement learning algorithm (CROP) to optimize policies and mitigate the distribution drift problem by conservatively estimating rewards.
methods: The algorithm uses model training to estimate rewards and simultaneously minimizes the estimation error and the reward of random actions.
results: Experimental results show that the performance of CROP is comparable to the state-of-the-art baselines, and it establishes an innovative connection between offline and online RL by adopting online RL techniques to the empirical Markov decision process trained with a conservative reward.

Abstract
Offline reinforcement learning (RL) aims to optimize policy using collected data without online interactions. Model-based approaches are particularly appealing for addressing offline RL challenges due to their capability to mitigate the limitations of offline data through data generation using models. Prior research has demonstrated that introducing conservatism into the model or Q-function during policy optimization can effectively alleviate the prevalent distribution drift problem in offline RL. However, the investigation into the impacts of conservatism in reward estimation is still lacking. This paper proposes a novel model-based offline RL algorithm, Conservative Reward for model-based Offline Policy optimization (CROP), which conservatively estimates the reward in model training. To achieve a conservative reward estimation, CROP simultaneously minimizes the estimation error and the reward of random actions. Theoretical analysis shows that this conservative reward mechanism leads to a conservative policy evaluation and helps mitigate distribution drift. Experiments on D4RL benchmarks showcase that the performance of CROP is comparable to the state-of-the-art baselines. Notably, CROP establishes an innovative connection between offline and online RL, highlighting that offline RL problems can be tackled by adopting online RL techniques to the empirical Markov decision process trained with a conservative reward. The source code is available with https://github.com/G0K0URURI/CROP.git.

摘要
偏好线上学习（RL）的目标是通过收集数据来优化策略，而不是在线交互。基于模型的方法在解决偏好线上学习挑战方面尤其有利，因为它们可以通过模型生成数据来减少收集数据的限制。过去的研究表明，在策略优化中引入保守性可以有效地解决偏好线上学习中的分布漂移问题。然而，关于奖励估计中的保守性的研究仍然缺乏。这篇论文提出了一种新的模型基于的偏好线上学习算法，即保守奖励for model-based Offline Policy optimization（CROP）。CROP通过在模型训练中保守地估计奖励来实现保守的奖励估计。为了实现保守的奖励估计，CROP同时减少了估计错误和随机动作的奖励。理论分析表明，这种保守的奖励机制导致保守的策略评估，帮助解决分布漂移问题。实验表明，CROP在D4RL benchmark上的性能与现状的基eline相当。尤其是，CROP建立了在线和偏好线上学习之间的创新连接，指出偏好线上学习问题可以通过采用在线RL技术来解决empirical Markov decision process中训练的保守奖励。源代码可以在https://github.com/G0K0URURI/CROP.git中找到。

Joint Entity and Relation Extraction with Span Pruning and Hypergraph Neural Networks

paper_url: http://arxiv.org/abs/2310.17238
repo_url: https://github.com/yanzhh/hgere
paper_authors: Zhaohui Yan, Songlin Yang, Wei Liu, Kewei Tu
for: 提高Entity和Relation抽取（ERE）任务的性能，特别是解决 marker-based 管道模型中的错误卷积问题。
methods: 基于 PL-marker marker-based 管道模型，提出 HyperGraph 神经网络（$\hgnn{}$），并使用高复 recall 减弱机制来减轻NER模块的负担。进一步地，建立一个高级图，其中节点为实体（由 span pruner 提供）和其关系，强制编码这些关系的交互。
results: 在三个广泛使用的 ERE benchmark 上（\acef{}, \ace{} 和 \scierc{）），经验表明 $\hgnn{}$ 模型在前一代 marker-based 管道模型的基础上具有显著的改进。

Abstract
Entity and Relation Extraction (ERE) is an important task in information extraction. Recent marker-based pipeline models achieve state-of-the-art performance, but still suffer from the error propagation issue. Also, most of current ERE models do not take into account higher-order interactions between multiple entities and relations, while higher-order modeling could be beneficial.In this work, we propose HyperGraph neural network for ERE ($\hgnn{}$), which is built upon the PL-marker (a state-of-the-art marker-based pipleline model). To alleviate error propagation,we use a high-recall pruner mechanism to transfer the burden of entity identification and labeling from the NER module to the joint module of our model. For higher-order modeling, we build a hypergraph, where nodes are entities (provided by the span pruner) and relations thereof, and hyperedges encode interactions between two different relations or between a relation and its associated subject and object entities. We then run a hypergraph neural network for higher-order inference by applying message passing over the built hypergraph. Experiments on three widely used benchmarks (\acef{}, \ace{} and \scierc{}) for ERE task show significant improvements over the previous state-of-the-art PL-marker.

摘要
entity 和 relation 抽取 (ERE) 是信息抽取中的重要任务。 current marker-based pipeline 模型可以达到状态的最佳性能，但仍然受到错误卷积问题的影响。 besides， current ERE 模型多数不考虑多个实体和关系之间的高阶交互，而高阶模型化可能是有利的。在这种情况下，我们提出了 HyperGraph 神经网络 для ERE ($ \hgnn{}$), 它基于 PL-marker (现状最佳 marker-based pipeline 模型)。为了缓解错误卷积问题，我们使用高度回归预测机制，将实体识别和标注的负担从 NER 模块传递给我们模型的联合模块。 For higher-order modeling， we build a hypergraph, where nodes are entities (由 span pruner 提供) and relations thereof, and hyperedges encode interactions between two different relations or between a relation and its associated subject and object entities。然后，我们运行一个高阶神经网络，通过在建立的 hypergraph 上进行消息传递来进行高阶推理。 experiments 表明，在三个常用的 ERE benchmark 上（\acef{}, \ace{} 和 \scierc{）），我们的模型可以具有显著的改善，胜过了之前的 PL-marker。

TST$^\mathrm{R}$: Target Similarity Tuning Meets the Real World

paper_url: http://arxiv.org/abs/2310.17228
repo_url: None
paper_authors: Anirudh Khatry, Sumit Gulwani, Priyanshu Gupta, Vu Le, Ananya Singha, Mukul Singh, Gust Verbruggen
for: This paper is written for improving the performance of natural language (NL) to code generation through large language models (LLMs) by adapting a sentence embedding model to have the similarity between two NL inputs match the similarity between their associated code outputs.
methods: The paper proposes different methods to apply and improve target similarity tuning (TST) in the real world, including replacing the sentence transformer with embeddings from a larger model, training a tiny model to transform the embeddings, and efficiently selecting a smaller number of training examples.
results: The paper introduces a ranking-based evaluation for TST that does not require end-to-end code generation experiments, which can be expensive to perform.

Abstract
Target similarity tuning (TST) is a method of selecting relevant examples in natural language (NL) to code generation through large language models (LLMs) to improve performance. Its goal is to adapt a sentence embedding model to have the similarity between two NL inputs match the similarity between their associated code outputs. In this paper, we propose different methods to apply and improve TST in the real world. First, we replace the sentence transformer with embeddings from a larger model, which reduces sensitivity to the language distribution and thus provides more flexibility in synthetic generation of examples, and we train a tiny model that transforms these embeddings to a space where embedding similarity matches code similarity, which allows the model to remain a black box and only requires a few matrix multiplications at inference time. Second, we show how to efficiently select a smaller number of training examples to train the TST model. Third, we introduce a ranking-based evaluation for TST that does not require end-to-end code generation experiments, which can be expensive to perform.

摘要
目标相似调整（TST）是一种使用大语言模型（LLM）来生成代码的方法，旨在将NL输入与其相关的代码输出之间的相似性进行调整。在这篇论文中，我们提出了不同的方法来应用和改进TST在实际应用中。首先，我们将 sentence transformer 替换为来自更大的模型的嵌入，这会降低语言分布的敏感度，并提供更多的自然语言生成的可能性，然后我们将这些嵌入变换到一个空间中，使得嵌入相似性与代码相似性匹配，这些操作只需在推理时进行几次矩阵乘法即可。其次，我们展示了如何高效地选择训练例子来训练TST模型。最后，我们引入了一种基于排名的评估方法，不需要进行昂贵的端到端代码生成实验。

Beyond MLE: Convex Learning for Text Generation

paper_url: http://arxiv.org/abs/2310.17217
repo_url: https://github.com/ictnlp/convex-learning
paper_authors: Chenze Shao, Zhengrui Ma, Min Zhang, Yang Feng
For: This paper proposes a novel approach to training text generation models using convex functions, which can help the models focus on highly probable outputs without requiring maximum likelihood estimation (MLE).* Methods: The proposed approach uses convex functions to define the training objective, which enables the model to better capture outputs with high probabilities. The authors investigate the theoretical properties of the optimal predicted distribution when applying convex functions to the loss.* Results: The proposed approach is effective in improving the performance of text generation models. In experiments on various text generation tasks and models, the approach enables autoregressive models to bridge the gap between greedy and beam search, and facilitates the learning of non-autoregressive models with a maximum improvement of 9+ BLEU points. The approach also exhibits significant impact on large language models (LLMs), substantially enhancing their generative capability on various tasks.

Abstract
Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution that best explain the observed data. In the context of text generation, MLE is often used to train generative language models, which can then be used to generate new text. However, we argue that MLE is not always necessary and optimal, especially for closed-ended text generation tasks like machine translation. In these tasks, the goal of model is to generate the most appropriate response, which does not necessarily require it to estimate the entire data distribution with MLE. To this end, we propose a novel class of training objectives based on convex functions, which enables text generation models to focus on highly probable outputs without having to estimate the entire data distribution. We investigate the theoretical properties of the optimal predicted distribution when applying convex functions to the loss, demonstrating that convex functions can sharpen the optimal distribution, thereby enabling the model to better capture outputs with high probabilities. Experiments on various text generation tasks and models show the effectiveness of our approach. It enables autoregressive models to bridge the gap between greedy and beam search, and facilitates the learning of non-autoregressive models with a maximum improvement of 9+ BLEU points. Moreover, our approach also exhibits significant impact on large language models (LLMs), substantially enhancing their generative capability on various tasks. Source code is available at \url{https://github.com/ictnlp/Convex-Learning}.

摘要
最大 LIKElihood 估计 (MLE) 是一种统计方法，用于估计一个概率分布中的参数，以便最好地预测观察到的数据。在文本生成任务中，MLE oftens 用于训练生成语言模型，以便生成新的文本。然而，我们认为MLE 不一定是最佳和必要的，尤其是在关闭式文本生成任务中，如机器翻译。在这些任务中，模型的目标是生成最佳的回答，而不一定需要估计整个数据分布。为此，我们提出了一种新的训练目标函数，基于凸函数，允许文本生成模型专注于高可能性的输出，而不需要估计整个数据分布。我们研究了这种新的训练目标函数的理论性质，并证明了凸函数可以使估计的最佳分布更加紧凑，使模型更好地捕捉高可能性的输出。我们在不同的文本生成任务和模型上进行了实验，并证明了我们的方法的效iveness。它使得 autoregressive 模型可以跨度搜索和搜索，并且使得非 autoregressive 模型学习得到最大改进（9+ BLEU 点）。此外，我们的方法还在大语言模型 (LLM) 上展现了显著的影响，substantially 提高了它们的生成能力在多种任务上。可以在 \url{https://github.com/ictnlp/Convex-Learning} 上获得源代码。

Emotion Recognition by Video: A review

paper_url: http://arxiv.org/abs/2310.17212
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Junxiao Xue, Jie Wang, Xuecheng Wu, Liangyu Fu
for: 本文旨在帮助学术界和现代科学家综合了解最新的发展和创新在视频情感识别领域。
methods: 本文分析了视频情感识别方法的特点和性能，并对不同类型的方法进行比较。
results: 本文系统性地梳理了2015年至2023年期间发表的视频情感识别研究，包括两种常见情感模型、常用的数据库和现代视频情感识别方法的结构和性能。

Abstract
Video emotion recognition is an important branch of affective computing, and its solutions can be applied in different fields such as human-computer interaction (HCI) and intelligent medical treatment. Although the number of papers published in the field of emotion recognition is increasing, there are few comprehensive literature reviews covering related research on video emotion recognition. Therefore, this paper selects articles published from 2015 to 2023 to systematize the existing trends in video emotion recognition in related studies. In this paper, we first talk about two typical emotion models, then we talk about databases that are frequently utilized for video emotion recognition, including unimodal databases and multimodal databases. Next, we look at and classify the specific structure and performance of modern unimodal and multimodal video emotion recognition methods, talk about the benefits and drawbacks of each, and then we compare them in detail in the tables. Further, we sum up the primary difficulties right now looked by video emotion recognition undertakings and point out probably the most encouraging future headings, such as establishing an open benchmark database and better multimodal fusion strategys. The essential objective of this paper is to assist scholarly and modern scientists with keeping up to date with the most recent advances and new improvements in this speedy, high-influence field of video emotion recognition.

摘要
视频情感识别是人工智能的重要分支，其解决方案可以应用于不同领域，如人机交互（HCI）和智能医疗治疗。虽然有很多关于情感识别的研究论文发表，但有很少的总结性文献，涵盖相关研究的视频情感识别领域。因此，本文选择2015年至2023年发表的文献，系матизи了视频情感识别领域的现有趋势。在本文中，我们首先介绍了两种典型的情感模型，然后介绍了通常用于视频情感识别的数据库，包括单模态数据库和多模态数据库。接着，我们分析和比较现代单模态和多模态视频情感识别方法的特点和性能，讲述每种方法的优缺点，并在表格中进行详细比较。然后，我们总结了现在视频情感识别项目面临的主要挑战，并提出了未来可能的发展方向，如建立开源标准数据库和更好的多模态融合策略。本文的主要目标是帮助学术和现代科学家保持最新的发展和新进展在这个快速、高影响的领域中。

Efficient Data Fusion using the Tsetlin Machine

paper_url: http://arxiv.org/abs/2310.17207
repo_url: None
paper_authors: Rupsa Saha, Vladimir I. Zadorozhny, Ole-Christoffer Granmo
for: 本研究提出了一种新的方法来评估和融合噪音数据，使用Tsetlin机器。
methods: 该方法通过监测Tsetlin机器学习的解释逻辑 clause 如何随数据噪音变化，从而识别噪音或者通过新的逻辑 clause 来反映噪音。
results: 该方法在不同的数据集上进行了全面的实验研究，得到了高效的结果。

Abstract
We propose a novel way of assessing and fusing noisy dynamic data using a Tsetlin Machine. Our approach consists in monitoring how explanations in form of logical clauses that a TM learns changes with possible noise in dynamic data. This way TM can recognize the noise by lowering weights of previously learned clauses, or reflect it in the form of new clauses. We also perform a comprehensive experimental study using notably different datasets that demonstrated high performance of the proposed approach.

摘要
我们提出了一种新的方法，使用Tsetlin机器来评估和融合含有噪声的动态数据。我们的方法是通过观察TMC所学得的逻辑条件如何随着可能的噪声在动态数据中变化，从而使TMC能够认可噪声，例如降低先前学习的条件的权重，或者表现为新的条件。我们还进行了对不同数据集的完整实验研究，并得到了高性能的结果。

Taming Gradient Variance in Federated Learning with Networked Control Variates

paper_url: http://arxiv.org/abs/2310.17200
repo_url: None
paper_authors: Xingyan Chen, Yaling Liu, Huaming Du, Mu Wang, Yu Zhao
For: 这个研究旨在解决联合学习中的问题，包括广泛的通信开销、慢态变化和不稳定的改善。这些问题主要导因于变量 gradient 由于客户端数据分布不均匀。* Methods: 这个研究提出了一个名为 FedNCV 的联合学习框架，采用了 REINFORCE Leave-One-Out (RLOO) 作为基本控制量单元，实现在客户端和服务器两个层次。在客户端上，RLOO 控制量单元用于优化本地梯度更新，减少由数据样本引入的变量。一旦传递到服务器端，RLOO 基本估计又提供了不偏且低变量的总梯度，导致Robust global更新。这个双面应用可以理解为对于客户端和服务器端的线性结合。我们提供了一个数学表达式，捕捉了这个组合的双控制量单元在 FedNCV 中的组合。* Results: 这个研究在六个多样的数据集上进行了六个 SOTA 方法的比较，以及该研究的性能优势。结果显示，FedNCV 具有较高的性能，并且可以实现大规模应用。

Abstract
Federated learning, a decentralized approach to machine learning, faces significant challenges such as extensive communication overheads, slow convergence, and unstable improvements. These challenges primarily stem from the gradient variance due to heterogeneous client data distributions. To address this, we introduce a novel Networked Control Variates (FedNCV) framework for Federated Learning. We adopt the REINFORCE Leave-One-Out (RLOO) as a fundamental control variate unit in the FedNCV framework, implemented at both client and server levels. At the client level, the RLOO control variate is employed to optimize local gradient updates, mitigating the variance introduced by data samples. Once relayed to the server, the RLOO-based estimator further provides an unbiased and low-variance aggregated gradient, leading to robust global updates. This dual-side application is formalized as a linear combination of composite control variates. We provide a mathematical expression capturing this integration of double control variates within FedNCV and present three theoretical results with corresponding proofs. This unique dual structure equips FedNCV to address data heterogeneity and scalability issues, thus potentially paving the way for large-scale applications. Moreover, we tested FedNCV on six diverse datasets under a Dirichlet distribution with {\alpha} = 0.1, and benchmarked its performance against six SOTA methods, demonstrating its superiority.

摘要
federated learning, a decentralized machine learning approach, faces significant challenges such as extensive communication overheads, slow convergence, and unstable improvements. these challenges primarily stem from the gradient variance due to heterogeneous client data distributions. to address this, we introduce a novel networked control variates (fedncov) framework for federated learning. we adopt the reinforce leave-one-out (rloo) as a fundamental control variate unit in the fedncov framework, implemented at both client and server levels. at the client level, the rloo control variate is employed to optimize local gradient updates, mitigating the variance introduced by data samples. once relayed to the server, the rloo-based estimator further provides an unbiased and low-variance aggregated gradient, leading to robust global updates. this dual-side application is formalized as a linear combination of composite control variates. we provide a mathematical expression capturing this integration of double control variates within fedncov and present three theoretical results with corresponding proofs. this unique dual structure equips fedncov to address data heterogeneity and scalability issues, thus potentially paving the way for large-scale applications. moreover, we tested fedncov on six diverse datasets under a dirichlet distribution with α = 0.1, and benchmarked its performance against six sota methods, demonstrating its superiority.

How do Language Models Bind Entities in Context?

paper_url: http://arxiv.org/abs/2310.17191
repo_url: None
paper_authors: Jiahai Feng, Jacob Steinhardt
for: 这篇论文旨在探讨语言模型（LM）如何在上下文中使用符号知识，具体来说是如何将形态绑定到其特征上。
methods: 这篇论文使用了 causal intervention 技术，来检查 LM 内部活动是否表示绑定信息，并发现了绑定 ID 机制，即在大型 Pythia 和 LLaMA 模型中每一个模型都具有解决绑定问题的一致性机制。
results: 研究发现，LM 的内部活动实际上将形态绑定到其特征上，并且绑定 ID 向量组成一个连续的子空间，在这个子空间中，绑定 ID 向量之间的距离反映了它们的推理程度。总的来说，这些结果揭示了 LM 在上下文中表示符号知识的可读性策略，为大规模 LM 的上下文理解提供了一个重要的步阶。

Abstract
To correctly use in-context information, language models (LMs) must bind entities to their attributes. For example, given a context describing a "green square" and a "blue circle", LMs must bind the shapes to their respective colors. We analyze LM representations and identify the binding ID mechanism: a general mechanism for solving the binding problem, which we observe in every sufficiently large model from the Pythia and LLaMA families. Using causal interventions, we show that LMs' internal activations represent binding information by attaching binding ID vectors to corresponding entities and attributes. We further show that binding ID vectors form a continuous subspace, in which distances between binding ID vectors reflect their discernability. Overall, our results uncover interpretable strategies in LMs for representing symbolic knowledge in-context, providing a step towards understanding general in-context reasoning in large-scale LMs.

摘要
<>将文本翻译成简化中文。<>为正确地使用上下文信息，语言模型（LM）必须将实体绑定到其属性上。例如，在一个描述绿色正方形和蓝色圆形的上下文中，LM必须将形状绑定到它们的相应颜色上。我们分析LM表示形式和识别绑定机制：一种通用的解决绑定问题的机制，我们在Pyythia和LLaMA家族中的每个足够大的模型中都观察到。使用 causal intervention，我们表明LM内部的活动表示绑定信息，将绑定ID向量附加到对应的实体和属性上。我们还表明绑定ID向量组成一个连续的子空间，在这个子空间中，绑定ID向量之间的距离反映它们的推理程度。总之，我们的结果揭示了LM在含义上的具体推理策略，提供了解决大规模LM的普遍性含义理解的一个步骤。

Understanding the Effects of Projectors in Knowledge Distillation

paper_url: http://arxiv.org/abs/2310.17183
repo_url: https://github.com/chenyd7/pefd
paper_authors: Yudong Chen, Sen Wang, Jiajun Liu, Xuwei Xu, Frank de Hoog, Brano Kusy, Zi Huang
for: 这篇论文旨在调查隐藏在知识储存过程中的投影器（feature distillation）的作用，即使学生和教师网络具有相同的特征维度。
methods: 该论文使用了预训练的教师网络和学生网络，并在学生网络中添加了投影器来进行特征转换。
results: 该研究发现，即使学生和教师网络具有相同的特征维度，投影器仍然能够提高知识储存性能。此外，投影器甚至在逻辑分布式学习中也能够提高性能。这些发现驱动了 authors 提出了一种基于投影器集合的特征储存方法，以进一步提高知识储存性能。

Abstract
Conventionally, during the knowledge distillation process (e.g. feature distillation), an additional projector is often required to perform feature transformation due to the dimension mismatch between the teacher and the student networks. Interestingly, we discovered that even if the student and the teacher have the same feature dimensions, adding a projector still helps to improve the distillation performance. In addition, projectors even improve logit distillation if we add them to the architecture too. Inspired by these surprising findings and the general lack of understanding of the projectors in the knowledge distillation process from existing literature, this paper investigates the implicit role that projectors play but so far have been overlooked. Our empirical study shows that the student with a projector (1) obtains a better trade-off between the training accuracy and the testing accuracy compared to the student without a projector when it has the same feature dimensions as the teacher, (2) better preserves its similarity to the teacher beyond shallow and numeric resemblance, from the view of Centered Kernel Alignment (CKA), and (3) avoids being over-confident as the teacher does at the testing phase. Motivated by the positive effects of projectors, we propose a projector ensemble-based feature distillation method to further improve distillation performance. Despite the simplicity of the proposed strategy, empirical results from the evaluation of classification tasks on benchmark datasets demonstrate the superior classification performance of our method on a broad range of teacher-student pairs and verify from the aspects of CKA and model calibration that the student's features are of improved quality with the projector ensemble design.

摘要
通常在知识塑化过程中（例如特征塑化），需要添加一个投影器来实现特征转换，因为教师和学生网络的维度不匹配。 Interestingly, we discovered that even if the student and the teacher have the same feature dimensions, adding a projector still helps to improve the distillation performance. In addition, projectors even improve logit distillation if we add them to the architecture too. inspirited by these surprising findings and the general lack of understanding of the projectors in the knowledge distillation process from existing literature, this paper investigates the implicit role that projectors play but so far have been overlooked. Our empirical study shows that the student with a projector (1) obtains a better trade-off between the training accuracy and the testing accuracy compared to the student without a projector when it has the same feature dimensions as the teacher, (2) better preserves its similarity to the teacher beyond shallow and numeric resemblance, from the view of Centered Kernel Alignment (CKA), and (3) avoids being over-confident as the teacher does at the testing phase. Motivated by the positive effects of projectors, we propose a projector ensemble-based feature distillation method to further improve distillation performance. Despite the simplicity of the proposed strategy, empirical results from the evaluation of classification tasks on benchmark datasets demonstrate the superior classification performance of our method on a broad range of teacher-student pairs and verify from the aspects of CKA and model calibration that the student's features are of improved quality with the projector ensemble design.

Graphical Object-Centric Actor-Critic

paper_url: http://arxiv.org/abs/2310.17178
repo_url: None
paper_authors: Leonid Ugadiarov, Aleksandr I. Panov
for: 提高image-based object-centric reinforcement learning任务中的策略学习效果
methods: 使用actor-critic和model-based方法，使用transformer编码器提取对象表示，使用图 neural network逼近环境动力学
results: 在3D机器人环境和2Dcompositional结构环境中表现较为出色，比对state-of-the-art模型自由actor-critic算法和monolithic模型基础算法更好

Abstract
There have recently been significant advances in the problem of unsupervised object-centric representation learning and its application to downstream tasks. The latest works support the argument that employing disentangled object representations in image-based object-centric reinforcement learning tasks facilitates policy learning. We propose a novel object-centric reinforcement learning algorithm combining actor-critic and model-based approaches to utilize these representations effectively. In our approach, we use a transformer encoder to extract object representations and graph neural networks to approximate the dynamics of an environment. The proposed method fills a research gap in developing efficient object-centric world models for reinforcement learning settings that can be used for environments with discrete or continuous action spaces. Our algorithm performs better in a visually complex 3D robotic environment and a 2D environment with compositional structure than the state-of-the-art model-free actor-critic algorithm built upon transformer architecture and the state-of-the-art monolithic model-based algorithm.

摘要
近来，无监督物体归一表示学习问题得到了重要进展，以及其应用于下游任务中。最新的研究证明了使用分离的物体表示在图像基本的反馈学习任务中帮助策略学习。我们提出了一种新的物体中心的奖励学习算法，将actor-critic和模型基础方法结合起来，以有效利用这些表示。在我们的方法中，我们使用变换器编码器提取物体表示，并使用图 neural network来近似环境的动态。我们的方法填充了奖励学习设置中的物体中心世界模型的研究空白，可以用于具有离散或连续动作空间的环境。我们的算法在三维 робоット环境和二维 Compositional 结构环境中表现更好than当前无监督actor-critic算法和单一模型基础算法。

Bridging The Gaps Between Token Pruning and Full Pre-training via Masked Fine-tuning

paper_url: http://arxiv.org/abs/2310.17177
repo_url: None
paper_authors: Fengyuan Shi, Limin Wang
for: 提高适应性和抗遮挡能力，使基本模型更适合用于动态图像转换器的初始化。
methods: 使用Masked Fine-Tuning方法，将预训练基本模型与动态图像转换器的token减少策略相匹配，从而解决基本模型与动态模型之间的不一致问题。
results: 对ImageNet dataset进行了广泛的实验，显示了基本模型通过Masked Fine-Tuning方法获得了强大的遮挡Robustness和信息损失能力，并且Dynamic ViT在不同的token减少比例下（例如0.8和0.3）获得了更高的准确率。

Abstract
Despite the success of transformers on various computer vision tasks, they suffer from excessive memory and computational cost. Some works present dynamic vision transformers to accelerate inference by pruning redundant tokens. A key to improving token pruning is using well-trained models as initialization for faster convergence and better performance. However, current base models usually adopt full image training, i.e., using full images as inputs and keeping the whole feature maps through the forward process, which causes inconsistencies with dynamic models that gradually reduce tokens, including calculation pattern, information amount and token selection strategy inconsistencies. Inspired by MAE which performs masking and reconstruction self-supervised task, we devise masked fine-tuning to bridge the gaps between pre-trained base models used for initialization and token pruning based dynamic vision transformers, by masking image patches and predicting the image class label based on left unmasked patches. Extensive experiments on ImageNet demonstrate that base models via masked fine-tuning gain strong occlusion robustness and ability against information loss. With this better initialization, Dynamic ViT achieves higher accuracies, especially under large token pruning ratios (e.g., 81.9% vs. 81.3%, and 62.3% vs. 58.9% for DeiT based Dynamic ViT/0.8 and Dynamic ViT/0.3). Moreover, we apply our method into different token pruning based dynamic vision transformers, different pre-trained models and randomly initialized models to demonstrate the generalization ability.

摘要
尽管变换器在各种计算机视觉任务上取得了成功，但它们受到过度的内存和计算成本的束缚。一些工作提出了动态视觉转换器来加速推理，其中一个关键是使用已经训练过的模型作为初始化以更快地达到更好的性能。然而，当前的基本模型通常采用全像训练，即将全像作为输入，并保留整个特征图进行前进计算，这会导致动态模型逐渐减少token的问题，包括计算模式、信息量和选择策略不一致。受到MAE的启发，我们设计了彩色精度调整来bridging基本模型和动态视觉转换器之间的差异，通过遮盖图像块并预测图像类别标签基于未遮盖的块来进行彩色精度调整。我们在ImageNet上进行了广泛的实验，发现基于彩色精度调整的基本模型在遮盖率较高时（例如0.8和0.3）获得了强大的遮盖异常和信息损失能力。这些更好的初始化使得动态ViT在不同的token遮盖比例（例如81.9% vs. 81.3%,和62.3% vs. 58.9%）上取得了更高的准确率。此外，我们还应用了我们的方法到不同的token遮盖基于动态视觉转换器、不同的预训练模型和随机初始化模型，以 demonstrate其通用性。

A Deep Learning Approach to Teeth Segmentation and Orientation from Panoramic X-rays

paper_url: http://arxiv.org/abs/2310.17176
repo_url: https://github.com/mrinal054/instance_teeth_segmentation
paper_authors: Mrinal Kanti Dhar, Mou Deb, D. Madhab, Zeyun Yu
for: 这篇研究旨在提高现代口腔健康预算中的精确牙齿分类和方位测量，以便精确诊断、治疗规划和 dental implant 设计。
methods: 我们使用了深度学习技术，基于 FUSegNet 模型，并将其改进为具有格子基于注意门的 skip connections。我们还引入了 Orientated bounding box (OBB) 生成，通过主成分分析 (PCA)，以精确地 Orient 牙齿。
results: 我们在公开的 DNS 资料集上评估了我们的方法，包括 543 枚 panoramic X-ray 图像，得到了 teeth 实例分类中的最高 Intersection-over-Union (IoU) 分数 82.43%，Dice Similarity Coefficient (DSC) 分数 90.37%，以及 Rotated IoU (RIoU) 分数 82.82%。我们还进行了各个牙齿标签和分类性能的详细分析，为未来口腔预算中的精确诊断、治疗规划和个性化医疗带来了promising prospects。

Abstract
Accurate teeth segmentation and orientation are fundamental in modern oral healthcare, enabling precise diagnosis, treatment planning, and dental implant design. In this study, we present a comprehensive approach to teeth segmentation and orientation from panoramic X-ray images, leveraging deep learning techniques. We build our model based on FUSegNet, a popular model originally developed for wound segmentation, and introduce modifications by incorporating grid-based attention gates into the skip connections. We introduce oriented bounding box (OBB) generation through principal component analysis (PCA) for precise tooth orientation estimation. Evaluating our approach on the publicly available DNS dataset, comprising 543 panoramic X-ray images, we achieve the highest Intersection-over-Union (IoU) score of 82.43% and Dice Similarity Coefficient (DSC) score of 90.37% among compared models in teeth instance segmentation. In OBB analysis, we obtain the Rotated IoU (RIoU) score of 82.82%. We also conduct detailed analyses of individual tooth labels and categorical performance, shedding light on strengths and weaknesses. The proposed model's accuracy and versatility offer promising prospects for improving dental diagnoses, treatment planning, and personalized healthcare in the oral domain. Our generated OBB coordinates and codes are available at https://github.com/mrinal054/Instance_teeth_segmentation.

摘要
准确的牙齿分割和方向是现代口腔医疗中的基本要求，帮助确定精准诊断、治疗规划和植入设计。在本研究中，我们提出了一种涵盖所有牙齿分割和方向的全面方法，基于FUSegNet模型，并通过栅格基于注意机制的修改来提高性能。我们还引入了原则components分析（PCA）来生成方向 bounding box（OBB），以便精准地 Orient estimation。在公共可用的 DNS 数据集上评估我们的方法，包括 543 张扫描图像，我们达到了 teeth 实例分割中最高的 Intersection-over-Union（IoU）分数（82.43%）和 Dice Similarity Coefficient（DSC）分数（90.37%），同时在 OBB 分析中获得了 Rotated IoU（RIoU）分数（82.82%）。我们还进行了精度分析，探讨个体牙齿标签和分类性能，为口腔医疗领域的个性化医疗带来了推荐的前景。我们在 GitHub 上公布了生成的 OBB 坐标和代码，请参考 https://github.com/mrinal054/Instance_teeth_segmentation。

Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise

paper_url: http://arxiv.org/abs/2310.17167
repo_url: None
paper_authors: Zhenkai Zhang, Krista A. Ehinger, Tom Drummond
for: 本 paper 的两个主要贡献是提高反扩散过程中图像生成的速度和质量。
methods: 本 paper 使用了两种方法来提高图像生成的速度和质量，第一种是将扩散过程重parameterized为图像和噪声之间的角度，第二种是直接使用网络来估算图像和噪声的值。
results: 根据 Frechet Inception Distance (FID)、spatial Frechet Inception Distance (sFID)、精度和回归率等指标，本 paper 的模型可以更快地生成高质量的图像，并且可以更快地达到高质量图像。

Abstract
This paper introduces two key contributions aimed at improving the speed and quality of images generated through inverse diffusion processes. The first contribution involves reparameterizing the diffusion process in terms of the angle on a quarter-circular arc between the image and noise, specifically setting the conventional $\displaystyle \sqrt{\bar{\alpha}=\cos(\eta)$. This reparameterization eliminates two singularities and allows for the expression of diffusion evolution as a well-behaved ordinary differential equation (ODE). In turn, this allows higher order ODE solvers such as Runge-Kutta methods to be used effectively. The second contribution is to directly estimate both the image ($\mathbf{x}_0$) and noise ($\mathbf{\epsilon}$) using our network, which enables more stable calculations of the update step in the inverse diffusion steps, as accurate estimation of both the image and noise are crucial at different stages of the process. Together with these changes, our model achieves faster generation, with the ability to converge on high-quality images more quickly, and higher quality of the generated images, as measured by metrics such as Frechet Inception Distance (FID), spatial Frechet Inception Distance (sFID), precision, and recall.

摘要

Reparameterizing the diffusion process: Instead of using the conventional $\sqrt{\bar{\alpha} = \cos(\eta)$, we parameterize the diffusion process in terms of the angle between the image and noise on a quarter-circular arc. This eliminates two singularities and allows the diffusion evolution to be expressed as a well-behaved ordinary differential equation (ODE), making it possible to use higher-order ODE solvers such as Runge-Kutta methods.2. Direct estimation of image and noise: Our network directly estimates both the image and noise, which ensures more stable calculations of the update step in the inverse diffusion process. Accurate estimation of both the image and noise is crucial at different stages of the process, and our model achieves faster generation of high-quality images, as measured by metrics such as Frechet Inception Distance (FID), spatial Frechet Inception Distance (sFID), precision, and recall.

Content-based Controls For Music Large Language Modeling

paper_url: http://arxiv.org/abs/2310.17162
repo_url: None
paper_authors: Liwei Lin, Gus Xia, Junyan Jiang, Yixiao Zhang
for: 这个论文旨在提供一种基于内容的控制方法，以提高大规模语言模型在音乐频域中的音乐生成质量。
methods: 该方法使用一种效率高的参数调整方法（PEFT），专门针对基于转换器的音频模型。实验表明，我们的方法可以在具有少量超级vised学习的情况下实现高质量的音乐生成，并且可以具有有效的内容基于的控制能力。
results: 我们的方法可以实现高质量的音乐生成，并且可以通过调整旋律和和声来实现有效的内容基于的控制。此外，我们还示出了将内容基于的控制与文本描述结合使用可以实现灵活的音乐变化生成和风格传递。

Abstract
Recent years have witnessed a rapid growth of large-scale language models in the domain of music audio. Such models enable end-to-end generation of higher-quality music, and some allow conditioned generation using text descriptions. However, the control power of text controls on music is intrinsically limited, as they can only describe music indirectly through meta-data (such as singers and instruments) or high-level representations (such as genre and emotion). We aim to further equip the models with direct and content-based controls on innate music languages such as pitch, chords and drum track. To this end, we contribute Coco-Mulla, a content-based control method for music large language modeling. It uses a parameter-efficient fine-tuning (PEFT) method tailored for Transformer-based audio models. Experiments show that our approach achieved high-quality music generation with low-resource semi-supervised learning, tuning with less than 4% parameters compared to the original model and training on a small dataset with fewer than 300 songs. Moreover, our approach enables effective content-based controls, and we illustrate the control power via chords and rhythms, two of the most salient features of music audio. Furthermore, we show that by combining content-based controls and text descriptions, our system achieves flexible music variation generation and style transfer. Our source codes and demos are available online.

摘要
Translation notes:* "large-scale language models" 大规模语言模型 (dà xiǎng móde lǐng yǔ)* "end-to-end generation" 端到端生成 (dían dào diàn chéng)* "conditioned generation" 受控生成 (fù kòng shēng chéng)* "text descriptions" 文本描述 (wén tiěn mǐng yù)* "meta-data" 元数据 (yuán jí)* "singers and instruments" 歌手和乐器 (gē shǒu hé yuè qì)* "genre and emotion" 种类和情感 (zhòng lèi hé qíng gǎn)* "innate music languages" Native Music Languages (yuán jì yǔ)* "pitch, chords, and drum tracks" 抑弹、和弹、鼓踏 (zuò dì, hé dì, gǔ tà)* "parameter-efficient fine-tuning" 参数高效精度调整 (cèshù gāodégòu jīngdé jiǎo)* "Transformer-based audio models" 基于Transformer的音频模型 (jī yú Transformer de yīn yǐn módel)* "low-resource semi-supervised learning" 半指导式半资源学习 (bàn zhǐdǎo xī bàn zīyuán xuéxí)* "tuning with less than 4% parameters" 使用少于4%参数调整 (shǐyòu xiǎo yú 4% cèshù jiǎo)* "training on a small dataset" 使用小 datasets 训练 (shǐyòu xiǎo dataset zhīngxì)* "fewest than 300 songs" fewer than 300 songs (liǎo xiǎo gē)* "content-based controls" 内容基于的控制 (néngyòu jīyào de kòng zhì)* "chords and rhythms" 和弹和节奏 (hé dì yǔ jié zhù)* "flexible music variation generation" 灵活的音乐变换 (língyòu de yīn yuè biàn huà)* "style transfer" 风格传递 (fēngxìng chuándòu)

CosmosDSR – a methodology for automated detection and tracking of orbital debris using the Unscented Kalman Filter

paper_url: http://arxiv.org/abs/2310.17158
repo_url: None
paper_authors: Daniel S. Roll, Zeyneb Kurt, Wai Lok Woo
for: Addressing the Kessler syndrome by detecting and tracking satellites in sequential images.
methods: Combining YOLOv3 with an Unscented Kalman Filter (UKF) for tracking satellites, and comparing with a linear Kalman filter (LKF).
results: Precise detection and classification of satellite categories with few errors, and accurate tracking of satellites with a mean squared error (MSE) and root mean squared error (RMSE) of 2.83/1.66 for UKF and 2.84/1.66 for LKF.

Abstract
The Kessler syndrome refers to the escalating space debris from frequent space activities, threatening future space exploration. Addressing this issue is vital. Several AI models, including Convolutional Neural Networks, Kernel Principal Component Analysis, and Model-Agnostic Meta- Learning have been assessed with various data types. Earlier studies highlighted the combination of the YOLO object detector and a linear Kalman filter (LKF) for object detection and tracking. Advancing this, the current paper introduces a novel methodology for the Comprehensive Orbital Surveillance and Monitoring Of Space by Detecting Satellite Residuals (CosmosDSR) by combining YOLOv3 with an Unscented Kalman Filter (UKF) for tracking satellites in sequential images. Using the Spacecraft Recognition Leveraging Knowledge of Space Environment (SPARK) dataset for training and testing, the YOLOv3 precisely detected and classified all satellite categories (Mean Average Precision=97.18%, F1=0.95) with few errors (TP=4163, FP=209, FN=237). Both CosmosDSR and an implemented LKF used for comparison tracked satellites accurately for a mean squared error (MSE) and root mean squared error (RME) of MSE=2.83/RMSE=1.66 for UKF and MSE=2.84/RMSE=1.66 for LKF. The current study is limited to images generated in a space simulation environment, but the CosmosDSR methodology shows great potential in detecting and tracking satellites, paving the way for solutions to the Kessler syndrome.

摘要
《凯斯勒征》指的是由于频繁的空间活动而导致的增加的空间废弃物，这对未来的空间探索造成了威胁。为解决这个问题，许多人使用了人工智能模型，包括卷积神经网络、基准 principl component analysis 和模型无关元学习。在之前的研究中，拟合了 YOLO 对象检测器和线性 Kalman 筛（LKF）的结合，用于对象检测和跟踪。现在的论文介绍了一种新的方法，即 CosmosDSR，它将 YOLOv3 与不确定 Kalman 筛（UKF）结合，用于在顺序图像中跟踪卫星。使用 SPARK 数据集进行训练和测试，YOLOv3 精确地检测和分类了所有卫星类别（平均精度=97.18%, F1=0.95），只有一些错误（TP=4163, FP=209, FN=237）。两种 CosmosDSR 和 LKF 的实现都可以准确地跟踪卫星，MSE 和 RMSE 分别为 MSE=2.83/RMSE=1.66。当前的研究只是在空间模拟环境中生成的图像上进行的，但 CosmosDSR 方法具有很好的潜在性，可以用于检测和跟踪卫星，为解决凯斯勒征提供了新的解决方案。

Technical Note: Feasibility of translating 3.0T-trained Deep-Learning Segmentation Models Out-of-the-Box on Low-Field MRI 0.55T Knee-MRI of Healthy Controls

paper_url: http://arxiv.org/abs/2310.17152
repo_url: None
paper_authors: Rupsa Bhattacharjee, Zehra Akkaya, Johanna Luitjens, Pan Su, Yang Yang, Valentina Pedoia, Sharmila Majumdar
for: 这项研究的目的是评估将深度学习技术应用于评估双下肢骨骼标记的可能性，并将其应用于健康控制人群的0.55T MR 影像中。methods: 这项研究使用了标准的实践中的骨和软组织分割算法，并对其进行质量和量化的评估，以确定在0.55T和3.0T之间的差异。results: 初步结果表明，可以将现有的高级深度学习图像分割技术，训练在3.0T上，翻译到0.55T上，并在多 vendor 环境中实现可用到良好的技术可行性。尤其是在分割软组织 compartment 方面，模型表现几乎相当于3.0T。这表明，0.55T低场磁共振成像可以用于评估双下肢骨骼标记，并且可以通过使用现有的深度学习图像分割技术来提高表征性。

Abstract
In the current study, our purpose is to evaluate the feasibility of applying deep learning (DL) enabled algorithms to quantify bilateral knee biomarkers in healthy controls scanned at 0.55T and compared with 3.0T. The current study assesses the performance of standard in-practice bone, and cartilage segmentation algorithms at 0.55T, both qualitatively and quantitatively, in terms of comparing segmentation performance, areas of improvement, and compartment-wise cartilage thickness values between 0.55T vs. 3.0T. Initial results demonstrate a usable to good technical feasibility of translating existing quantitative deep-learning-based image segmentation techniques, trained on 3.0T, out of 0.55T for knee MRI, in a multi-vendor acquisition environment. Especially in terms of segmenting cartilage compartments, the models perform almost equivalent to 3.0T in terms of Likert ranking. The 0.55T low-field sustainable and easy-to-install MRI, as demonstrated, thus, can be utilized for evaluating knee cartilage thickness and bone segmentations aided by established DL algorithms trained at higher-field strengths out-of-the-box initially. This could be utilized at the far-spread point-of-care locations with a lack of radiologists available to manually segment low-field images, at least till a decent base of low-field data pool is collated. With further fine-tuning with manual labeling of low-field data or utilizing synthesized higher SNR images from low-field images, OA biomarker quantification performance is potentially guaranteed to be further improved.

摘要
当前研究的目的是评估使用深度学习（DL）启用算法来评估双侧膝关节生物标志物理量的可能性。研究现在评估0.55T中标准实践骨和软组织分割算法的性能，包括对比分割性能、改进方向和软组织厚度值 между0.55T和3.0T。初步结果表明可以将已经训练在3.0T上的量化深度学习图像分割技术翻译到0.55T，并在多 vendor acquisition 环境中实现了技术可行性。特别是在分割软组织COMPARTMENT中，模型表现了几乎相同的Likert排名。因此，0.55T的低场可持续和易于安装的MRI可以用于评估膝软组织厚度和骨分割，并且可以通过已有的DL算法在更高的场 strengths 中进行外部调试。这可以在覆盖医疗机构的各个点批处理地点使用，至少ntil a decent base of low-field data pool is collated。通过进一步细化的手动标注低场数据或使用生成的更高SNR图像来进行优化，OA生物标志量的评估性能可能会得到进一步改进。

Explainable Spatio-Temporal Graph Neural Networks

paper_url: http://arxiv.org/abs/2310.17149
repo_url: https://github.com/hkuds/stexplainer
paper_authors: Jiabin Tang, Lianghao Xia, Chao Huang
for: 这个论文的目的是提出一个可解释的城市空间时间图预测模型（STExplainer），以增强城市 aplicatons 中的内置可解释性。
methods: 这个模型使用了一个统一的城市空间图注意力网络（STGNN），加上一个位置信息融合层，以解决城市空间时间资料的黑盒问题。此外，我们还提出了一个结构炼分法，基于图形信息瓶颈（GIB）原则，并使用了一个可解释的目标函数。
results: 经过广泛的实验，我们证明了我们的 STExplainer 模型在交通和犯罪预测任务上比基于state-of-the-art 的基础模型表现更好，并且在预测精度和可解释度（例如给定和实际）方面均达到了优秀的成绩。此外，我们的模型还能够有效地解决资料缺失和稀疏性问题。

Abstract
Spatio-temporal graph neural networks (STGNNs) have gained popularity as a powerful tool for effectively modeling spatio-temporal dependencies in diverse real-world urban applications, including intelligent transportation and public safety. However, the black-box nature of STGNNs limits their interpretability, hindering their application in scenarios related to urban resource allocation and policy formulation. To bridge this gap, we propose an Explainable Spatio-Temporal Graph Neural Networks (STExplainer) framework that enhances STGNNs with inherent explainability, enabling them to provide accurate predictions and faithful explanations simultaneously. Our framework integrates a unified spatio-temporal graph attention network with a positional information fusion layer as the STG encoder and decoder, respectively. Furthermore, we propose a structure distillation approach based on the Graph Information Bottleneck (GIB) principle with an explainable objective, which is instantiated by the STG encoder and decoder. Through extensive experiments, we demonstrate that our STExplainer outperforms state-of-the-art baselines in terms of predictive accuracy and explainability metrics (i.e., sparsity and fidelity) on traffic and crime prediction tasks. Furthermore, our model exhibits superior representation ability in alleviating data missing and sparsity issues. The implementation code is available at: https://github.com/HKUDS/STExplainer.

摘要
随着城市应用的多样化和复杂性的增加，随时空 Graph Neural Networks (STGNNs) 已经得到了广泛的应用，以模型城市中的随时空关系。然而，黑盒模型的限制使得 STGNNs 的解释性受到限制，从而阻碍其在城市资源分配和政策制定方面的应用。为了bridging这个鸿沟，我们提出了一个可解释的随时空 Graph Neural Networks (STExplainer) 框架，该框架可以增强 STGNNs 的解释性，使其同时提供高准确率和 faithful 的预测和解释。我们的框架包括一个统一的随时空 Graph attention网络和一个位置信息融合层作为 STG Encoder 和 Decoder，分别。此外，我们提出了基于 Graph Information Bottleneck (GIB) 原理的结构填充方法，该方法通过一个可解释的目标函数实现。经过广泛的实验，我们证明了我们的 STExplainer 在交通预测和犯罪预测任务上的预测精度和解释性指标（即稀疏性和准确性）高于当前基线。此外，我们的模型在缺失数据和稀疏性问题下的表示能力也更高。代码可以在 GitHub 上获取：https://github.com/HKUDS/STExplainer。

Counterfactual-Augmented Importance Sampling for Semi-Offline Policy Evaluation

paper_url: http://arxiv.org/abs/2310.17146
repo_url: https://github.com/mld3/counterfactualannot-semiope
paper_authors: Shengpu Tang, Jenna Wiens
for: 这个论文是用于推广强化学习（RL）在高风险领域的应用，并通过观察数据进行量化和质量evaluation，以帮助实践者理解新策略的总体性能。
methods: 这个论文提出了一种半在线评估框架，通过询问人类用户提供不可见的对比性轨迹的注释，以帮助解决在线评估不可能进行的问题。
results: 这个论文的实验表明，相比标准的强化学习评估器，该半在线评估框架可以减少偏见和噪声，并且在不完整的注释情况下表现更加稳定。

Abstract
In applying reinforcement learning (RL) to high-stakes domains, quantitative and qualitative evaluation using observational data can help practitioners understand the generalization performance of new policies. However, this type of off-policy evaluation (OPE) is inherently limited since offline data may not reflect the distribution shifts resulting from the application of new policies. On the other hand, online evaluation by collecting rollouts according to the new policy is often infeasible, as deploying new policies in these domains can be unsafe. In this work, we propose a semi-offline evaluation framework as an intermediate step between offline and online evaluation, where human users provide annotations of unobserved counterfactual trajectories. While tempting to simply augment existing data with such annotations, we show that this naive approach can lead to biased results. Instead, we design a new family of OPE estimators based on importance sampling (IS) and a novel weighting scheme that incorporate counterfactual annotations without introducing additional bias. We analyze the theoretical properties of our approach, showing its potential to reduce both bias and variance compared to standard IS estimators. Our analyses reveal important practical considerations for handling biased, noisy, or missing annotations. In a series of proof-of-concept experiments involving bandits and a healthcare-inspired simulator, we demonstrate that our approach outperforms purely offline IS estimators and is robust to imperfect annotations. Our framework, combined with principled human-centered design of annotation solicitation, can enable the application of RL in high-stakes domains.

摘要
在应用强化学习（RL）到高风险领域时，可以使用观察数据进行量化和质量evaluation来帮助实践者理解新策略的总结性性能。然而，这种Off-policy评估（OPE）是由于新策略的应用而导致的分布变化的限制。相反，在线评估，通过根据新策略收集滚动数据，可以是不可靠的，因为在这些领域中部署新策略可能是不安全的。在这种情况下，我们提出了一种半Offline评估框架，作为在线和Offline评估之间的中间步骤，在这里，人类用户提供了未观察的contrastive Trajectory的注释。虽然有吸引力地将现有数据 augmented with这些注释，但我们表明这种Naive Approach可能会导致偏向结果。相反，我们设计了一种基于重要性抽样（IS）和一种新的权重方案的新家族OPE估计器，可以在不引入额外偏向的情况下，利用contrastive注释进行估计。我们分析了我们的方法的理论性质，并表明它在减少偏向和方差方面具有潜在的优势。我们的分析还揭示了在处理偏向、杂音或缺失注释时的重要实践考虑事项。在一系列Proof-of-concept实验中，我们示出了我们的方法可以在bandits和一种医疗领域的模拟器中出performances，并且可以抗护免着不完整的注释。我们的框架，结合人类中心的注释 solicitation设计，可以帮助RL在高风险领域应用。

Symbolic Planning and Code Generation for Grounded Dialogue

paper_url: http://arxiv.org/abs/2310.17140
repo_url: https://github.com/justinchiu/onecommon-gpt
paper_authors: Justin T. Chiu, Wenting Zhao, Derek Chen, Saujas Vaduguru, Alexander M. Rush, Daniel Fried
for: 这个论文的目的是提出一种可组合和可解释的对话系统，以解决现有的对话系统在跟踪目标和处理新的grounding方面的缺陷。
methods: 该系统包括一个读取器和一个规划器：读取器使用大语言模型将对话伙伴的话语转换成可执行代码，并调用函数来完成grounding。符号规划器使用符号计划法确定下一个最佳回答。
results: 该系统在OneCommon对话任务中表现出色，成功率从56%提高到69%，在最复杂的设定下也有显著提升。

Abstract
Large language models (LLMs) excel at processing and generating both text and code. However, LLMs have had limited applicability in grounded task-oriented dialogue as they are difficult to steer toward task objectives and fail to handle novel grounding. We present a modular and interpretable grounded dialogue system that addresses these shortcomings by composing LLMs with a symbolic planner and grounded code execution. Our system consists of a reader and planner: the reader leverages an LLM to convert partner utterances into executable code, calling functions that perform grounding. The translated code's output is stored to track dialogue state, while a symbolic planner determines the next appropriate response. We evaluate our system's performance on the demanding OneCommon dialogue task, involving collaborative reference resolution on abstract images of scattered dots. Our system substantially outperforms the previous state-of-the-art, including improving task success in human evaluations from 56% to 69% in the most challenging setting.

摘要
大型语言模型（LLM）在处理和生成文本和代码方面表现出色，但LLM在固定任务对话中有限的应用可能性，主要是因为它们难以追导到任务目标并处理新的固定。我们提出了一个模块化和可解释的基于符号计划的对话系统，这个系统通过将LLM与符号计划和基于符号的代码执行结合起来，以解决这些缺点。我们的系统包括读者和计划器：读者使用LLM将伙伴的话语转换为执行代码，并调用函数来进行固定。转换后的代码的输出被存储以跟踪对话状态，而符号计划器根据对话状态确定下一个适当的回应。我们对一个具有抽象点云图像的OneCommon对话任务进行了评估，并substantially outperformed前一个状态的艺术。在最复杂的设定下，我们的系统的任务成功率从56%提高到69%。

Core Challenge 2023: Solver and Graph Descriptions

paper_url: http://arxiv.org/abs/2310.17136
repo_url: None
paper_authors: Takehide Soh, Tomoya Tanjo, Yoshio Okamoto, Takehiro Ito
for: 本研究收集了CoRe Challenge 2023中所提交的解决方案和ISR实例的描述。
methods: 本研究使用了各种解决方案和ISR实例来描述CoRe Challenge 2023中的问题。
results: 本研究收集了CoRe Challenge 2023中所有的解决方案和ISR实例，以便进行后续的分析和研究。

Abstract
This paper collects all descriptions of solvers and ISR instances submitted to CoRe Challenge 2023.

摘要
这篇论文收集了2023年CoRe挑战中所提交的解决方案和实例。Note: "CoRe" stands for "Combinatorial Optimization and Recommendation" challenge.

Incorporating Probing Signals into Multimodal Machine Translation via Visual Question-Answering Pairs

paper_url: http://arxiv.org/abs/2310.17133
repo_url: https://github.com/libeineu/mmt-vqa
paper_authors: Yuxin Zuo, Bei Li, Chuanhao Lv, Tong Zheng, Tong Xiao, Jingbo Zhu
for: 这篇论文研究了多Modal机器翻译（MMT）系统中文本输入完整性的影响，并提出了一种新的方法来促进cross-模态交互。
methods: 该论文提出了一种生成来源文本中Visual Question-Answering（VQA）样式对的方法，并使用Large Language Models（LLMs）来显式地模型MMT中的探测信号。
results: 实验结果表明，该新方法可以提高MMT系统的性能，并且可以帮助MMT系统更好地理解图像信息。

Abstract
This paper presents an in-depth study of multimodal machine translation (MMT), examining the prevailing understanding that MMT systems exhibit decreased sensitivity to visual information when text inputs are complete. Instead, we attribute this phenomenon to insufficient cross-modal interaction, rather than image information redundancy. A novel approach is proposed to generate parallel Visual Question-Answering (VQA) style pairs from the source text, fostering more robust cross-modal interaction. Using Large Language Models (LLMs), we explicitly model the probing signal in MMT to convert it into VQA-style data to create the Multi30K-VQA dataset. An MMT-VQA multitask learning framework is introduced to incorporate explicit probing signals from the dataset into the MMT training process. Experimental results on two widely-used benchmarks demonstrate the effectiveness of this novel approach. Our code and data would be available at: \url{https://github.com/libeineu/MMT-VQA}.

摘要
The authors use Large Language Models (LLMs) to explicitly model the probing signal in MMT and convert it into VQA-style data, creating the Multi30K-VQA dataset. They then introduce an MMT-VQA multitask learning framework to incorporate explicit probing signals from the dataset into the MMT training process.Experimental results on two widely-used benchmarks demonstrate the effectiveness of this novel approach. The code and data used in this study will be available at the following link: .

Unleashing the potential of GNNs via Bi-directional Knowledge Transfer

paper_url: http://arxiv.org/abs/2310.17132
repo_url: None
paper_authors: Shuai Zheng, Zhizhe Liu, Zhenfeng Zhu, Xingxing Zhang, Jianxin Li, Yao Zhao
for: 提高 Graph Neural Network (GNN) 的性能。
methods: 利用 message-passing 框架中的 feature transformation 操作，提出 Bi-directional Knowledge Transfer (BiKT) 方法，以便不需修改原有架构即可充分发挥 GNN 的 potential。
results: 对 7 个数据集和 5 种常见 GNN 进行了广泛的实验，显示 BiKT 可以提高 GNN 的性能，最高提升达 0.5% - 4%，同时 derive 模型也能够独立应用于特定下游任务。

Abstract
Based on the message-passing paradigm, there has been an amount of research proposing diverse and impressive feature propagation mechanisms to improve the performance of GNNs. However, less focus has been put on feature transformation, another major operation of the message-passing framework. In this paper, we first empirically investigate the performance of the feature transformation operation in several typical GNNs. Unexpectedly, we notice that GNNs do not completely free up the power of the inherent feature transformation operation. By this observation, we propose the Bi-directional Knowledge Transfer (BiKT), a plug-and-play approach to unleash the potential of the feature transformation operations without modifying the original architecture. Taking the feature transformation operation as a derived representation learning model that shares parameters with the original GNN, the direct prediction by this model provides a topological-agnostic knowledge feedback that can further instruct the learning of GNN and the feature transformations therein. On this basis, BiKT not only allows us to acquire knowledge from both the GNN and its derived model but promotes each other by injecting the knowledge into the other. In addition, a theoretical analysis is further provided to demonstrate that BiKT improves the generalization bound of the GNNs from the perspective of domain adaption. An extensive group of experiments on up to 7 datasets with 5 typical GNNs demonstrates that BiKT brings up to 0.5% - 4% performance gain over the original GNN, which means a boosted GNN is obtained. Meanwhile, the derived model also shows a powerful performance to compete with or even surpass the original GNN, enabling us to flexibly apply it independently to some other specific downstream tasks.

摘要
使用消息传递模式的基础，有很多研究提出了多样化和吸引人的特征传播机制，以提高GNN的性能。然而，对特征转换的研究相对较少。在这篇论文中，我们首先employs empirical investigation来研究GNN中特征转换操作的性能。我们发现，GNN并不完全利用特征转换操作的力量。通过这一发现，我们提出了双向知识传输（BiKT），一种可插入的扩展approach，以解 liberate特征转换操作的潜力。 BiKT通过将特征转换操作作为GNN中的derived representation learning模型，并将这两个模型共享参数，从而实现了从GNN中获得 topological-agnostic的知识反馈，以帮助GNN的学习和特征转换。此外，我们还提供了一个理论分析，以证明BiKT在适应领域中提高GNN的泛化范围。在7个数据集和5种典型的GNN上进行了广泛的实验，显示BiKT可以提高GNN的性能，从0.5%到4%不等。此外， derivated模型还能够独立地应用于其他特定的下游任务中，并且表现强劲。

Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models

paper_url: http://arxiv.org/abs/2310.17120
repo_url: None
paper_authors: Reshmi Ghosh, Harjeet Singh Kajal, Sharanya Kamath, Dhuri Shrivastava, Samyadeep Basu, Hansi Zeng, Soundararajan Srinivasan
For: This paper focuses on analyzing the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts, and evaluating the effectiveness of different loss functions for improving segmentation results in unstructured conversational datasets.* Methods: The paper uses a variety of methods, including training from scratch with a small-sized dataset of the target unstructured domain, and experimenting with multiple loss functions (including Cross-Entropy, re-weighted Cross-Entropy, and Focal Loss) to mitigate the effects of imbalance in unstructured conversational datasets.* Results: The paper finds that training from scratch with a small-sized dataset of the target unstructured domain improves segmentation results by a significant margin, and that the Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.Here’s the information in Simplified Chinese text:* For: 本文研究了现有state-of-the-art话题分割模型在不结构化文本上的泛化能力，并评估了不同损失函数在不结构化对话集中的性能。* Methods: 本文使用了许多方法，包括从scratch在目标不结构化频道上小型数据集上训练，以及使用多种损失函数（包括十字熵、重量十字熵和焦点损失）来减轻不结构化对话集中的不均衡问题。* Results: 本文发现，从scratch在小型数据集上训练可以大幅提高话题分割结果，而焦点损失函数在不结构化和半结构化对话中话题分割时表现出了良好的Robustness。

Abstract
Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured conversational data. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin. We stress-test our proposed Topic Segmentation approach by experimenting with multiple loss functions, in order to mitigate effects of imbalance in unstructured conversational datasets. Our empirical evaluation indicates that Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.

摘要
分析文档或对话的 semantic 结构，将其分解成多个连续的段落是 NLP 中一项重要和挑战性的问题，可以帮助多种下游任务。然而，现有的话题 segmentation 模型通常只关注结构化文本的 segmentation。在这篇论文中，我们全面分析了现代话题 segmentation 模型对非结构化文本的泛化能力。我们发现：(a) 现有的预训练方法，如使用 Wiki-727K 大量结构化文本集，对于非结构化对话数据的转移性不够。(b) 直接从 scratch 使用target域的小型数据集进行训练，可以大幅提高分 segmentation 结果。我们为了 Mitigate 非结构化对话集的不均衡问题，对我们的提议的话题 segmentation 方法进行了多种搅拌损失函数的实验。我们的实验表明，焦点损失函数是跨Entropy 和重量跨Entropy损失函数的稳定和可靠的替代方案。

Detecting stealthy cyberattacks on adaptive cruise control vehicles: A machine learning approach

paper_url: http://arxiv.org/abs/2310.17091
repo_url: None
paper_authors: Tianyi Li, Mingfeng Shang, Shian Wang, Raphael Stern
for:The paper is written to address the detection of cyberattacks on vehicles equipped with advanced driver-assistance systems (ADAS) and automated driving features.methods:The paper proposes a traffic model framework for three types of potential cyberattacks, and uses a novel generative adversarial network (GAN)-based anomaly detection model to identify such attacks in real-time using vehicle trajectory data.results:The paper provides numerical evidence to demonstrate the efficacy of the proposed machine learning approach in detecting cyberattacks on ACC-equipped vehicles, and compares the results against some recently proposed neural network models.

Abstract
With the advent of vehicles equipped with advanced driver-assistance systems, such as adaptive cruise control (ACC) and other automated driving features, the potential for cyberattacks on these automated vehicles (AVs) has emerged. While overt attacks that force vehicles to collide may be easily identified, more insidious attacks, which only slightly alter driving behavior, can result in network-wide increases in congestion, fuel consumption, and even crash risk without being easily detected. To address the detection of such attacks, we first present a traffic model framework for three types of potential cyberattacks: malicious manipulation of vehicle control commands, false data injection attacks on sensor measurements, and denial-of-service (DoS) attacks. We then investigate the impacts of these attacks at both the individual vehicle (micro) and traffic flow (macro) levels. A novel generative adversarial network (GAN)-based anomaly detection model is proposed for real-time identification of such attacks using vehicle trajectory data. We provide numerical evidence {to demonstrate} the efficacy of our machine learning approach in detecting cyberattacks on ACC-equipped vehicles. The proposed method is compared against some recently proposed neural network models and observed to have higher accuracy in identifying anomalous driving behaviors of ACC vehicles.

摘要
A novel generative adversarial network (GAN)-based anomaly detection model is proposed for real-time identification of such attacks using vehicle trajectory data. We provide numerical evidence to demonstrate the efficacy of our machine learning approach in detecting cyberattacks on ACC-equipped vehicles. The proposed method is compared against some recently proposed neural network models and is found to have higher accuracy in identifying anomalous driving behaviors of ACC vehicles.Here is the text in Simplified Chinese:随着自动驾驶汽车（AV） équiponder with advanced driver-assistance systems（ADAS），如适应速度控制（ACC）和其他自动驾驶功能，攻击这些自动汽车的可能性已出现。而这些攻击可能会导致车辆之间的冲突，也可能会被轻松发现。但是，更嫌恶的攻击可能会导致网络上的堵塞，燃油消耗和碰撞风险的增加，而不会被轻松发现。为了解决这些攻击的检测，我们首先提出了一个交通流模型框架，用于三种可能的网络攻击：负面控制命令的恶意修改，感知测量数据的假数据插入攻击和服务拒绝（DoS）攻击。然后，我们研究了这些攻击对各个车辆（微）和交通流（ макро）水平的影响。一种基于生成对抗网络（GAN）的异常检测模型被提出，用于实时标识这些攻击。我们通过车辆轨迹数据来提供数据来支持我们的机器学习方法的可行性。我们的方法与一些最近提出的神经网络模型进行比较，并被证明具有更高的准确性，可以快速和准确地识别ACC车辆上的异常驾驶行为。

Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models

paper_url: http://arxiv.org/abs/2310.17086
repo_url: None
paper_authors: Deqing Fu, Tian-Qi Chen, Robin Jia, Vatsal Sharan
for: 这 paper 探讨了 Transformer 在受限学习 (In-Context Learning, ICL) 中的表现，尤其是它如何在不更新参数的情况下学习。
methods: 这 paper 展示了 Transformer 可以通过内部运行高级梯度下降 (Higher-Order Optimization Method) 来实现 ICL。
results: 实验表明，Transformer 可以很准确地实现 Iterative Newton’s Method，一种高级梯度下降方法，而不是 Gradient Descent。每个中间层都可以 rough Compute 3 个 Newton’s Method 迭代步骤，而 Gradient Descent 需要 exponentiation 更多步骤才能匹配一个 Transformer 层。此外，Transformer 还可以在不良条件数据上进行受限学习，一种 Gradient Descent 在这种情况下困难的情况。最后，paper 还提供了理论结果，支持实验结果，并与实验结果具有密切相关性：Transformer 可以通过 $\mathcal{O}(k)$ 层实现 $k$ 次 Newton’s Method 迭代。

Abstract
Transformers are remarkably good at in-context learning (ICL) -- learning from demonstrations without parameter updates -- but how they perform ICL remains a mystery. Recent work suggests that Transformers may learn in-context by internally running Gradient Descent, a first-order optimization method. In this paper, we instead demonstrate that Transformers learn to implement higher-order optimization methods to perform ICL. Focusing on in-context linear regression, we show that Transformers learn to implement an algorithm very similar to Iterative Newton's Method, a higher-order optimization method, rather than Gradient Descent. Empirically, we show that predictions from successive Transformer layers closely match different iterations of Newton's Method linearly, with each middle layer roughly computing 3 iterations. In contrast, exponentially more Gradient Descent steps are needed to match an additional Transformers layer; this suggests that Transformers have an comparable rate of convergence with high-order methods such as Iterative Newton, which are exponentially faster than Gradient Descent. We also show that Transformers can learn in-context on ill-conditioned data, a setting where Gradient Descent struggles but Iterative Newton succeeds. Finally, we show theoretical results which support our empirical findings and have a close correspondence with them: we prove that Transformers can implement $k$ iterations of Newton's method with $\mathcal{O}(k)$ layers.

摘要
传播者很好地做内部学习（ICL）——学习 без 参数更新——但它们在做 ICL 的方式仍然是一个谜。最近的工作表明，传播者可能在内部运行 Gradient Descent，一种一阶估计方法。在这篇论文中，我们则证明了传播者会实现更高阶的估计方法来做 ICL。专注于内部线性回推，我们显示了传播者会实现一个非常相似的Iterative Newton's Method，一种更高阶的估计方法，而不是 Gradient Descent。实际上，我们证明了传播者的预测值在不同的中间层之间可以线性地匹配不同的Newton's Method 迭代，每个中间层约 Compute 3 次迭代。相比之下，Gradient Descent 需要更多的步骤来匹配一个额外的传播者层，这表明传播者具有与高阶方法相似的速度，但是 Gradient Descent 的速度是指数增长的。我们还证明了传播者可以在糜烂数据上进行内部学习，这是 Gradient Descent 在这种设定下陷阱的。最后，我们提供了理论结果，证明了传播者可以通过 $\mathcal{O}(k)$ 层来实现 $k$ 次Newton's method。

Isometric Motion Manifold Primitives

paper_url: http://arxiv.org/abs/2310.17072
repo_url: https://github.com/gabe-yhlee/immp-public
paper_authors: Yonghyeon Lee
for: 这个论文主要是为了提出一种基于拟合 manifold 的运动控制方法，以实现一系列的动作任务。
methods: 这个方法使用了 decoder 函数来 parametrize 拟合 manifold，并使用了在 latent 坐标空间中的概率密度。
results: 论文表明，使用 Isometric Motion Manifold Primitives (IMMP) 可以大幅提高运动控制的性能，并且在 planar 障碍物避免和推动 manipulate 任务中表现出色。

Abstract
The Motion Manifold Primitive (MMP) produces, for a given task, a continuous manifold of trajectories each of which can successfully complete the task. It consists of the decoder function that parametrizes the manifold and the probability density in the latent coordinate space. In this paper, we first show that the MMP performance can significantly degrade due to the geometric distortion in the latent space -- by distortion, we mean that similar motions are not located nearby in the latent space. We then propose {\it Isometric Motion Manifold Primitives (IMMP)} whose latent coordinate space preserves the geometry of the manifold. For this purpose, we formulate and use a Riemannian metric for the motion space (i.e., parametric curve space), which we call a {\it CurveGeom Riemannian metric}. Experiments with planar obstacle-avoiding motions and pushing manipulation tasks show that IMMP significantly outperforms existing MMP methods. Code is available at https://github.com/Gabe-YHLee/IMMP-public.

摘要
<>将文本翻译成简化中文。<>动态 manifold 基本原理（MMP）生成一个任务下的连续扩散 manifold 每个可以成功完成任务。它包括嵌入函数参数化扩散和在幂空间中的概率密度。在这篇论文中，我们首先表明了 MMP 性能可能因 latent space 的几何扭曲而受到 significiant 降低。然后，我们提议使用 Isometric Motion Manifold Primitives (IMMP)，它的幂空间保持了扩散的几何结构。为了实现这一目标，我们构造了一个 Riemannian metric для动作空间（即参数曲线空间），我们称之为 CurveGeom Riemannian metric。实验表明，IMMP 在平面障碍物避免和推动 manipulate 任务中表现出色，较之 exist 的 MMP 方法更高效。代码可以在 https://github.com/Gabe-YHLee/IMMP-public 中找到。

2023-10-26

Style-Aware Radiology Report Generation with RadGraph and Few-Shot Prompting

Clover: Closed-Loop Verifiable Code Generation

Reward Scale Robustness for Proximal Policy Optimization via DreamerV3 Tricks

“You Are An Expert Linguistic Annotator”: Limits of LLMs as Analyzers of Abstract Meaning Representation

Utilizing Language Models for Energy Load Forecasting

Evaluation of large language models using an Indian language LGBTI+ lexicon

Graph Convolutional Networks for Complex Traffic Scenario Classification

GROOViST: A Metric for Grounding Objects in Visual Storytelling

Social Contract AI: Aligning AI Assistants with Implicit Group Norms

Salespeople vs SalesBot: Exploring the Role of Educational Value in Conversational Recommender Systems

Improving Traffic Density Forecasting in Intelligent Transportation Systems Using Gated Graph Neural Networks

Large Language Models as Generalizable Policies for Embodied Tasks

From Transcripts to Insights: Uncovering Corporate Risks Using Generative AI

Outlier Dimensions Encode Task-Specific Knowledge

A Wireless AI-Generated Content (AIGC) Provisioning Framework Empowered by Semantic Communication

Defending Against Transfer Attacks From Public Models

In-Context Learning Dynamics with Random Binary Sequences

Grow Your Limits: Continuous Improvement with Real-World RL for Robotic Locomotion

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana

MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

SPA: A Graph Spectral Alignment Perspective for Domain Adaptation

An Open Source Data Contamination Report for Llama Series Models

Can LLMs Grade Short-answer Reading Comprehension Questions : Foundational Literacy Assessment in LMICs

Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models

Bifurcations and loss jumps in RNN training

Instability of computer vision models is a necessary result of the task itself

Interactive Robot Learning from Verbal Correction

Model-Based Runtime Monitoring with Interactive Imitation Learning

Unpacking the Ethical Value Alignment in Big Models

Human-Guided Complexity-Controlled Abstractions

Neuro-Inspired Fragmentation and Recall to Overcome Catastrophic Forgetting in Curiosity

SoK: Pitfalls in Evaluating Black-Box Attacks

Can large language models replace humans in the systematic review process? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages

The Expressive Power of Low-Rank Adaptation

CompeteAI: Understanding the Competition Behaviors in Large Language Model-based Agents

Orchestration of Emulator Assisted Mobile Edge Tuning for AI Foundation Models: A Multi-Agent Deep Reinforcement Learning Approach

Improving Zero-shot Reader by Reducing Distractions from Irrelevant Documents in Open-Domain Question Answering

Bias in Evaluation Processes: An Optimization-Based Model

Towards Learning Monocular 3D Object Localization From 2D Labels using the Physical Laws of Motion

Generating by Understanding: Neural Visual Generation with Logical Symbol Groundings

LSA64: An Argentinian Sign Language Dataset

Handshape recognition for Argentinian Sign Language using ProbSom

Distribution of Action Movements (DAM): A Descriptor for Human Action Recognition

Goals are Enough: Inducing AdHoc cooperation among unseen Multi-Agent systems in IMFs

PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

Synthesizing Efficiently Monitorable Formulas in Metric Temporal Logic

Invariance Measures for Neural Networks

ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation

YOLO-BEV: Generating Bird’s-Eye View in the Same Way as 2D Object Detection

Optimization dependent generalization bound for ReLU networks based on sensitivity in the tangent bundle

Dialogue-based generation of self-driving simulation scenarios using Large Language Models

Exploring the Potential of Generative AI for the World Wide Web

Cultural Adaptation of Recipes

CQM: Curriculum Reinforcement Learning with a Quantized World Model

C-Disentanglement: Discovering Causally-Independent Generative Factors under an Inductive Bias of Confounder

In-Context Ability Transfer for Question Decomposition in Complex QA

CodeFusion: A Pre-trained Diffusion Model for Code Generation

FormaT5: Abstention and Examples for Conditional Table Formatting with Natural Language

Comparing Photorealistic and Animated Embodied Conversational Agents in Serious Games: An Empirical Study on User Experience

Fast Scalable and Accurate Discovery of DAGs Using the Best Order Score Search and Grow-Shrink Trees

New Boolean satisfiability problem heuristic strategy: Minimal Positive Negative Product Strategy

Attribute Based Interpretable Evaluation Metrics for Generative Models

IDENAS: Internal Dependency Exploration for Neural Architecture Search

CROP: Conservative Reward for Model-based Offline Policy Optimization

Joint Entity and Relation Extraction with Span Pruning and Hypergraph Neural Networks

TST$^\mathrm{R}$: Target Similarity Tuning Meets the Real World

Beyond MLE: Convex Learning for Text Generation

Emotion Recognition by Video: A review

Efficient Data Fusion using the Tsetlin Machine

Taming Gradient Variance in Federated Learning with Networked Control Variates

How do Language Models Bind Entities in Context?

Understanding the Effects of Projectors in Knowledge Distillation

Graphical Object-Centric Actor-Critic

Bridging The Gaps Between Token Pruning and Full Pre-training via Masked Fine-tuning

A Deep Learning Approach to Teeth Segmentation and Orientation from Panoramic X-rays

Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise

Content-based Controls For Music Large Language Modeling

CosmosDSR – a methodology for automated detection and tracking of orbital debris using the Unscented Kalman Filter