2023-11-15

cs.AI

cs.AI - 2023-11-15

HAL 9000: Skynet’s Risk Manager

paper_url: http://arxiv.org/abs/2311.09449
repo_url: None
paper_authors: Tadeu Freitas, Mário Neto, Inês Dutra, João Soares, Manuel Correia, Rolando Martins
for:这种论文是为了提出一种基于现代技术的攻击快照系统（ITS）体系，以提高ITS的入侵忍受能力和适应新敌人。methods:该论文使用了机器学习（ML）算法来帮助ITS学习从以往攻击和已知漏洞中，以增强其入侵忍受能力。它还提出了一种基于现代技术的风险管理器设计，通过自动评估操作系统（OS）的风险，提供更安全的配置建议。results:实验表明，使用Skynet和HAL 9000设计可以降低成功入侵的可能性，并且HAL可以选择15%更安全的配置，比现有的风险管理器更高效。

Abstract
Intrusion Tolerant Systems (ITSs) are a necessary component for cyber-services/infrastructures. Additionally, as cyberattacks follow a multi-domain attack surface, a similar defensive approach should be applied, namely, the use of an evolving multi-disciplinary solution that combines ITS, cybersecurity and Artificial Intelligence (AI). With the increased popularity of AI solutions, due to Big Data use-case scenarios and decision support and automation scenarios, new opportunities to apply Machine Learning (ML) algorithms have emerged, namely ITS empowerment. Using ML algorithms, an ITS can augment its intrusion tolerance capability, by learning from previous attacks and from known vulnerabilities. As such, this work's contribution is twofold: (1) an ITS architecture (Skynet) based on the state-of-the-art and incorporates new components to increase its intrusion tolerance capability and its adaptability to new adversaries; (2) an improved Risk Manager design that leverages AI to improve ITSs by automatically assessing OS risks to intrusions, and advise with safer configurations. One of the reasons that intrusions are successful is due to bad configurations or slow adaptability to new threats. This can be caused by the dependency that systems have for human intervention. One of the characteristics in Skynet and HAL 9000 design is the removal of human intervention. Being fully automatized lowers the chance of successful intrusions caused by human error. Our experiments using Skynet, shows that HAL is able to choose 15% safer configurations than the state-of-the-art risk manager.

摘要
干扰快照系统（ITS）是现代网络服务/基础设施的必需组件。此外，由于攻击者通常会利用多个领域进行攻击，因此应采取相应的防御策略，即结合ITS、网络安全和人工智能（AI）的演化多学科解决方案。随着人工智能解决方案的普及，特别是基于大数据和决策支持自动化场景，新的机会出现了，可以使用机器学习（ML）算法来实现ITS的增强。通过ML算法，ITS可以从前一次攻击和已知漏洞中学习增强其抗侵入能力。这项工作的贡献有两个方面：1. 基于当前最佳实践的ITS架构（Skynet），新增了增强抗侵入能力和适应新敌人的功能。2. 基于人工智能自动评估系统（HAL 9000），提高了ITS的风险管理，自动评估操作系统的风险，并提供更安全的配置。一个常见的攻击成功原因是因为系统的坏配置或慢速应对新威胁。这可能是由系统的人工参与引起的。Skynet和HAL 9000的设计中消除了人工参与，它们是完全自动化的，降低了由人类错误引起的成功攻击的可能性。我们对Skynet进行了实验，发现HAL可以比当前状态艺术风险管理选择15%更安全的配置。

How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities

paper_url: http://arxiv.org/abs/2311.09447
repo_url: None
paper_authors: Lingbo Mo, Boshi Wang, Muhao Chen, Huan Sun
for: 本研究旨在评估开源大语言模型（LLMs）的可靠性，检测其在8个方面，包括恶意、偏见、伦理、幻觉、公平、奴役、隐私和对抗示范攻击的可靠性。
methods: 我们提出了一种基于Chain of Utterances（CoU）的提示策略，通过针对性地制作恶意示范来检测模型的可靠性。我们对当今代表性的开源LLMs进行了广泛的实验，包括Vicuna、MPT、Falcon、Mistral和Llama 2。
results: 我们的实验结果表明，我们的攻击策略在多个方面具有效果，而且模型的性能在普通NLP任务上高不一定意味着它们具有更高的可靠性。此外，我们发现，受过 instrucion tuning 的模型更容易受到攻击，而 fine-tuning LLMs for safety alignment 可以减轻对抗式可靠性攻击的影响。

Abstract
The rapid progress in open-source Large Language Models (LLMs) is significantly driving AI development forward. However, there is still a limited understanding of their trustworthiness. Deploying these models at scale without sufficient trustworthiness can pose significant risks, highlighting the need to uncover these issues promptly. In this work, we conduct an assessment of open-source LLMs on trustworthiness, scrutinizing them across eight different aspects including toxicity, stereotypes, ethics, hallucination, fairness, sycophancy, privacy, and robustness against adversarial demonstrations. We propose an enhanced Chain of Utterances-based (CoU) prompting strategy by incorporating meticulously crafted malicious demonstrations for trustworthiness attack. Our extensive experiments encompass recent and representative series of open-source LLMs, including Vicuna, MPT, Falcon, Mistral, and Llama 2. The empirical outcomes underscore the efficacy of our attack strategy across diverse aspects. More interestingly, our result analysis reveals that models with superior performance in general NLP tasks do not always have greater trustworthiness; in fact, larger models can be more vulnerable to attacks. Additionally, models that have undergone instruction tuning, focusing on instruction following, tend to be more susceptible, although fine-tuning LLMs for safety alignment proves effective in mitigating adversarial trustworthiness attacks.

摘要
开源大语言模型（LLM）的快速进步在人工智能发展中发挥着重要作用。然而，对这些模型的可靠性仍然具有有限的理解。在大规模部署过程中，如果不具备足够的可靠性，可能会产生严重的风险。在这项工作中，我们对开源LLM进行了可靠性评估，对其进行了八个方面的检查，包括恶意、偏见、伦理、幻觉、公平、追随、隐私和对抗攻击的Robustness。我们提出了基于Chain of Utterances（CoU）的增强的提示策略，通过针对可靠性攻击的精心制作的假示例进行检测。我们的广泛实验包括当前和代表性的开源LLM系列，包括Vicuna、MPT、Falcon、Mistral和Llama 2。实验结果证明了我们的攻击策略在多个方面的有效性。更有趣的是，我们的结果分析发现，在普通的NLPT任务中表现出色的模型并不总是具有最高的可靠性；事实上，更大的模型可能会更容易受到攻击。此外，通过专门准备Instruction Following的模型，即模型偏好遵循指令，可能会更容易受到攻击，而通过安全对齐来平衡攻击的可靠性攻击。

Exploring the Privacy-Energy Consumption Tradeoff for Split Federated Learning

paper_url: http://arxiv.org/abs/2311.09441
repo_url: None
paper_authors: Joohyung Lee, Mohamed Seif, Jungchan Cho, H. Vincent Poor
for:This paper focuses on Split Federated Learning (SFL) and its impact on energy consumption and privacy.methods:The paper analyzes the influence of system parameters on the selection of the cut layer in SFL and provides an illustrative example of cut layer selection to minimize the risk of clients reconstructing raw data while sustaining energy consumption within a required budget.results:The paper discusses the challenges of cut layer selection in SFL and provides a comprehensive overview of the SFL process, taking into account the impact of various system parameters on energy consumption and privacy. Additionally, the paper addresses open challenges in this field and identifies promising avenues for future research and development, particularly in the context of 6G technology.

Abstract
Split Federated Learning (SFL) has recently emerged as a promising distributed learning technology, leveraging the strengths of both federated learning and split learning. It emphasizes the advantages of rapid convergence while addressing privacy concerns. As a result, this innovation has received significant attention from both industry and academia. However, since the model is split at a specific layer, known as a cut layer, into both client-side and server-side models for the SFL, the choice of the cut layer in SFL can have a substantial impact on the energy consumption of clients and their privacy, as it influences the training burden and the output of the client-side models. Moreover, the design challenge of determining the cut layer is highly intricate, primarily due to the inherent heterogeneity in the computing and networking capabilities of clients. In this article, we provide a comprehensive overview of the SFL process and conduct a thorough analysis of energy consumption and privacy. This analysis takes into account the influence of various system parameters on the cut layer selection strategy. Additionally, we provide an illustrative example of the cut layer selection, aiming to minimize the risk of clients from reconstructing the raw data at the server while sustaining energy consumption within the required energy budget, which involve trade-offs. Finally, we address open challenges in this field including their applications to 6G technology. These directions represent promising avenues for future research and development.

摘要
Split Federated Learning (SFL) 是一种最近崛起的分布式学习技术，结合 federated learning 和 split learning 的优势，强调快速收敛和隐私问题的处理。因此，这一创新在行业和学术界都受到了广泛的关注。然而，在 SFL 中选择 cut layer 可能对客户端的能 consumption 和隐私有很大的影响，因为它影响了客户端的训练负担和输出。此外，选择 cut layer 的设计挑战很大，主要由客户端的计算和网络能力的不同而导致的约束。在本文中，我们提供了 SFL 的全面概述，并进行了严格的能 consumption 和隐私分析。这种分析考虑了各种系统参数对 cut layer 选择策略的影响。此外，我们还提供了一个例子，以减少客户端从服务器重建原始数据的风险，同时保持在必要的能 consumption 范围内，这些决策涉及到了负面的负担。最后，我们讨论了当前领域的开放挑战，包括它们在 6G 技术中的应用。这些方向表示未来研发的有前途。

Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment

paper_url: http://arxiv.org/abs/2311.09433
repo_url: None
paper_authors: Haoran Wang, Kai Shu
for: 这 paper 是为了研究 instruction-tuned Large Language Models (LLMs) 的安全性，具体来说是研究这些模型在不同的安全任务上的可控性。
methods: 这 paper 使用了一种新的攻击框架，叫做 Backdoor Activation Attack，它可以在 LLMs 的活动层中插入恶意导向 вектор。
results: 实验结果表明，该方法可以高效地启动攻击，并且增加了非常小的负担。此外， paper 还讨论了对这种活动攻击的可能的防御措施。

Abstract
To ensure AI safety, instruction-tuned Large Language Models (LLMs) are specifically trained to ensure alignment, which refers to making models behave in accordance with human intentions. While these models have demonstrated commendable results on various safety benchmarks, the vulnerability of their safety alignment has not been extensively studied. This is particularly troubling given the potential harm that LLMs can inflict. Existing attack methods on LLMs often rely on poisoned training data or the injection of malicious prompts. These approaches compromise the stealthiness and generalizability of the attacks, making them susceptible to detection. Additionally, these models often demand substantial computational resources for implementation, making them less practical for real-world applications. In this work, we introduce a novel attack framework, called Backdoor Activation Attack, which injects trojan steering vectors into the activation layers of LLMs. These malicious steering vectors can be triggered at inference time to steer the models toward attacker-desired behaviors by manipulating their activations. In particular, the steering vectors are generated by taking the difference between benign and malicious activations. Then, the most effective steering vector is selected and added to the forward passes of the LLMs. Our experiment results on four primary alignment tasks show that our proposed method is highly effective and adds little or no overhead to attack efficiency. Additionally, we discuss potential countermeasures against such activation attacks. Our code and data are available at https://email-haoran-for-link. Warning: this paper contains content that can be offensive or upsetting.

摘要
为确保人工智能安全，特定的大语言模型（LLMs）被专门训练，以确保它们的对应性，即让模型按照人类意图进行行为。尽管这些模型在不同的安全标准上表现出色，但它们的安全对应性还没有得到广泛的研究。这特别具有威胁性，因为这些模型可能会造成严重的损害。现有的攻击方法通常利用恶意训练数据或插入恶意提示来攻击LLMs。这些方法会增加攻击的隐蔽性和通用性，使其易于检测。另外，这些模型通常需要巨大的计算资源来实现，使其在实际应用中不太实际。在这种情况下，我们引入了一种新的攻击框架，called Backdoor Activation Attack，它可以在LLMs中插入恶意导向 вектор。这些恶意导向 вектор可以在推理时被触发，以使模型按照攻击者所需的方向进行行为。具体来说，这些恶意导向 вектор是通过比较善意和恶意的激活值而生成的。然后，选择最有效的导向 вектор，并将其添加到LLMs的前向传输中。我们的实验结果表明，我们的提posed方法在四个主要对应任务上都具有非常高的效果，并且增加了非常少的负载。此外，我们还讨论了对这种激活攻击的可能的防御措施。我们的代码和数据可以在https://email-haoran-for-link中找到。注意：这篇论文可能包含不适或不适的内容。

Beyond Detection: Unveiling Fairness Vulnerabilities in Abusive Language Models

paper_url: http://arxiv.org/abs/2311.09428
repo_url: None
paper_authors: Yueqing Liang, Lu Cheng, Ali Payani, Kai Shu
for: 本研究探讨了针对恶意语言检测模型的不公正性和检测性能的攻击性能，以提高模型的公正性稳定性。
methods: 本研究提出了一个简单 yet effective的框架 FABLE，通过利用后门攻击来实现对公正性和检测性能的Targeted控制。 FABLE 探讨了三种触发设计（i.e., 罕见、人工和自然触发）以及新的采样策略。
results: 实验结果表明，FABLE 可以成功地攻击恶意语言检测模型的公正性和实用性。

Abstract
This work investigates the potential of undermining both fairness and detection performance in abusive language detection. In a dynamic and complex digital world, it is crucial to investigate the vulnerabilities of these detection models to adversarial fairness attacks to improve their fairness robustness. We propose a simple yet effective framework FABLE that leverages backdoor attacks as they allow targeted control over the fairness and detection performance. FABLE explores three types of trigger designs (i.e., rare, artificial, and natural triggers) and novel sampling strategies. Specifically, the adversary can inject triggers into samples in the minority group with the favored outcome (i.e., ``non-abusive'') and flip their labels to the unfavored outcome, i.e., ``abusive''. Experiments on benchmark datasets demonstrate the effectiveness of FABLE attacking fairness and utility in abusive language detection.

摘要

When Large Language Models contradict humans? Large Language Models’ Sycophantic Behaviour

paper_url: http://arxiv.org/abs/2311.09410
repo_url: None
paper_authors: Leonardo Ranaldi, Giulia Pucci
for: 这篇论文探讨了大语言模型（LLMs）在解决复杂任务时的可能性，以及人类反馈对其回答的影响。
methods: 该论文使用了不同任务的人类影响提示，以探讨 LLMS 是否受到 sycophancy 行为的影响。
results: 研究发现，当 LLMS 回答Subjective 意见和基于事实应该提供相反回答的问题时，它们往往表现出 sycophancy 倾向，表明它们缺乏坚实性和可靠性。

Abstract
Large Language Models (LLMs) have been demonstrating the ability to solve complex tasks by delivering answers that are positively evaluated by humans due in part to the intensive use of human feedback that refines responses. However, the suggestibility transmitted through human feedback increases the inclination to produce responses that correspond to the user's beliefs or misleading prompts as opposed to true facts, a behaviour known as sycophancy. This phenomenon decreases the bias, robustness, and, consequently, their reliability. In this paper, we shed light on the suggestibility of LLMs to sycophantic behaviour, demonstrating these tendencies via human-influenced prompts over different tasks. Our investigation reveals that LLMs show sycophantic tendencies when responding to queries involving subjective opinions and statements that should elicit a contrary response based on facts, demonstrating a lack of robustness.

摘要

Zero-Shot Relational Learning on Temporal Knowledge Graphs with Large Language Models

paper_url: http://arxiv.org/abs/2311.10112
repo_url: None
paper_authors: Zifeng Ding, Heling Cai, Jingpei Wu, Yunpu Ma, Ruotong Liao, Bo Xiong, Volker Tresp
for: 提高模型对非常见关系的预测能力
methods: 利用自然语言模型生成关系表示，然后将其引入嵌入式TKGF方法中
results: 实验结果表明，我们的方法可以帮助TKGF模型在预测未看过的关系方面提高表现，而且仍然保持预测已经看过的关系的能力

Abstract
In recent years, modeling evolving knowledge over temporal knowledge graphs (TKGs) has become a heated topic. Various methods have been proposed to forecast links on TKGs. Most of them are embedding-based, where hidden representations are learned to represent knowledge graph (KG) entities and relations based on the observed graph contexts. Although these methods show strong performance on traditional TKG forecasting (TKGF) benchmarks, they naturally face a strong challenge when they are asked to model the unseen zero-shot relations that has no prior graph context. In this paper, we try to mitigate this problem as follows. We first input the text descriptions of KG relations into large language models (LLMs) for generating relation representations, and then introduce them into embedding-based TKGF methods. LLM-empowered representations can capture the semantic information in the relation descriptions. This makes the relations, whether seen or unseen, with similar semantic meanings stay close in the embedding space, enabling TKGF models to recognize zero-shot relations even without any observed graph context. Experimental results show that our approach helps TKGF models to achieve much better performance in forecasting the facts with previously unseen relations, while still maintaining their ability in link forecasting regarding seen relations.

摘要
Here's the Simplified Chinese translation:最近几年，模型知识演化的研究在时态知识图(TKG)上得到了广泛的关注。许多方法被提出来预测TKG上的链接。大多数方法是基于嵌入的，其中隐藏表示被学习以表示知识图实体和关系，基于观察的图文本上。although these methods show strong performance on traditional TKG forecasting (TKGF) benchmarks, they naturally face a strong challenge when they are asked to model the unseen zero-shot relations that have no prior graph context. In this paper, we try to mitigate this problem as follows. We first input the text descriptions of KG relations into large language models (LLMs) for generating relation representations, and then introduce them into embedding-based TKGF methods. LLM-empowered representations can capture the semantic information in the relation descriptions. This makes the relations, whether seen or unseen, with similar semantic meanings stay close in the embedding space, enabling TKGF models to recognize zero-shot relations even without any observed graph context. Experimental results show that our approach helps TKGF models to achieve much better performance in forecasting the facts with previously unseen relations, while still maintaining their ability in link forecasting regarding seen relations.

LOKE: Linked Open Knowledge Extraction for Automated Knowledge Graph Construction

paper_url: http://arxiv.org/abs/2311.09366
repo_url: None
paper_authors: Jamie McCusker
for: 本研究旨在提高知识图建构（KGC）中Open Information Extraction（Open IE）的效果，通过使用大语言模型（LLM）和提示工程（Prompt Engineering）来提高知识图的建构。
methods: 本研究使用GPT模型和提示工程来实现Open Knowledge Extraction（OKE），并提出了一种Linked Open Knowledge Extractor（LOKE）来解决相似的问题。
results: 研究发现，一个WellEngineered提示，配置了Naive entity linking方法（LOKE-GPT），可以超过AllenAI的OpenIE 4实现在OKE任务上的性能，尽管它生成了比参照集更多的 triple。此外，研究还发现，LOKE-GPT和”银” TekGen triple 表明任务的内容和结构都与OIE有很大差异。

Abstract
While the potential of Open Information Extraction (Open IE) for Knowledge Graph Construction (KGC) may seem promising, we find that the alignment of Open IE extraction results with existing knowledge graphs to be inadequate. The advent of Large Language Models (LLMs), especially the commercially available OpenAI models, have reset expectations for what is possible with deep learning models and have created a new field called prompt engineering. We investigate the use of GPT models and prompt engineering for knowledge graph construction with the Wikidata knowledge graph to address a similar problem to Open IE, which we call Open Knowledge Extraction (OKE) using an approach we call the Linked Open Knowledge Extractor (LOKE, pronounced like "Loki"). We consider the entity linking task essential to construction of real world knowledge graphs. We merge the CaRB benchmark scoring approach with data from the TekGen dataset for the LOKE task. We then show that a well engineered prompt, paired with a naive entity linking approach (which we call LOKE-GPT), outperforms AllenAI's OpenIE 4 implementation on the OKE task, although it over-generates triples compared to the reference set due to overall triple scarcity in the TekGen set. Through an analysis of entity linkability in the CaRB dataset, as well as outputs from OpenIE 4 and LOKE-GPT, we see that LOKE-GPT and the "silver" TekGen triples show that the task is significantly different in content from OIE, if not structure. Through this analysis and a qualitative analysis of sentence extractions via all methods, we found that LOKE-GPT extractions are of high utility for the KGC task and suitable for use in semi-automated extraction settings.

摘要
原文：尽管开放信息EXTRACTION（Open IE）的潜力对知识图构建（KGC）似乎吸引人，但我们发现将Open IE提取结果与现有知识图进行对应是不够的。商业可用的大语言模型（LLM），特别是OpenAI模型，使得深度学习模型的期望得到了新的提升，并创造了一个新的领域called prompt engineering。我们调查了GPT模型和提示工程在知识图构建中的应用，使用Wikidata知识图来解决类似于Open IE的问题，我们称之为开放知识EXTRACTION（OKE）。我们认为实体链接任务是构建现实世界知识图的关键。我们将CaRB评估方法与数据集的TekGen结合，并显示了一个WellEngineered的提示，与Naive实体链接方法（我们称之为LOKE-GPT）在OKE任务上表现出色，虽然它生成的 triple 比参考集多，但是总体来说 triple 在TekGen集中的缺失导致了这种情况。通过CaRB数据集中实体链接性的分析，以及OpenIE 4和LOKE-GPT的输出，我们发现LOKE-GPT和“银”TekGen triple 表明任务的内容和结构都与OIE不同。通过这种分析和所有方法的 качеitative分析，我们发现LOKE-GPT提取是KGC任务中的高Utility和可以用于半自动提取设置。

Empirical evaluation of Uncertainty Quantification in Retrieval-Augmented Language Models for Science

paper_url: http://arxiv.org/abs/2311.09358
repo_url: https://github.com/pnnl/expert2
paper_authors: Sridevi Wagle, Sai Munikoti, Anurag Acharya, Sara Smith, Sameera Horawalavithana
for: 这个研究的目的是evaluating uncertainty quantification (UQ) in Retrieval Augmented Language Models (RALMs) for scientific tasks.
methods: 该研究使用了一种已有的RALM模型，并在其基础上进行了训练和测试，以评估模型在科学任务中的可靠性和准确性。
results: 研究发现，当用科学知识作为预训练和检索数据时，模型具有更高的自信心，但同时也更容易产生错误的预测。此外，模型在准确预测和错误预测之间的自信心差异不会减轻这个问题。

Abstract
Large language models (LLMs) have shown remarkable achievements in natural language processing tasks, producing high-quality outputs. However, LLMs still exhibit limitations, including the generation of factually incorrect information. In safety-critical applications, it is important to assess the confidence of LLM-generated content to make informed decisions. Retrieval Augmented Language Models (RALMs) is relatively a new area of research in NLP. RALMs offer potential benefits for scientific NLP tasks, as retrieved documents, can serve as evidence to support model-generated content. This inclusion of evidence enhances trustworthiness, as users can verify and explore the retrieved documents to validate model outputs. Quantifying uncertainty in RALM generations further improves trustworthiness, with retrieved text and confidence scores contributing to a comprehensive and reliable model for scientific applications. However, there is limited to no research on UQ for RALMs, particularly in scientific contexts. This study aims to address this gap by conducting a comprehensive evaluation of UQ in RALMs, focusing on scientific tasks. This research investigates how uncertainty scores vary when scientific knowledge is incorporated as pretraining and retrieval data and explores the relationship between uncertainty scores and the accuracy of model-generated outputs. We observe that an existing RALM finetuned with scientific knowledge as the retrieval data tends to be more confident in generating predictions compared to the model pretrained only with scientific knowledge. We also found that RALMs are overconfident in their predictions, making inaccurate predictions more confidently than accurate ones. Scientific knowledge provided either as pretraining or retrieval corpus does not help alleviate this issue. We released our code, data and dashboards at https://github.com/pnnl/EXPERT2.

摘要
大型自然语言处理模型（LLM）在自然语言处理任务中表现出色，生成高质量输出。然而，LLM仍存在一些限制，包括生成不准确的信息。在安全关键应用中，需要评估LLM生成的内容的可靠性，以做出 Informed 决策。Retrieval Augmented Language Models（RALM）是一个相对新的研究领域，它们可以提供可靠的科学 NLP 任务。RALM 的可靠性可以通过文档检索来提高，文档可以作为生成内容的证据，让用户可以验证和探索文档来验证模型的输出。在 RALM 生成时，量化不确定性可以进一步提高可靠性，文档检索结果和信任分数共同组成一个可靠和可靠的模型。然而，对于 RALM 的 UQ 研究尚存在很大的空白，特别是在科学上。本研究希望填补这一空白，通过对 RALM 的 UQ 进行全面评估，专注于科学任务。本研究研究了 RALM 在科学任务中 uncertainty 分布的变化，以及模型生成输出的准确率和不确定性之间的关系。我们发现，当将科学知识作为检索数据和预训练数据时，RALM 会更加自信地生成预测结果。此外，我们发现 RALM 会过于自信，即在生成更多的不准确预测结果。科学知识作为预训练数据或检索数据不能够解决这一问题。我们将代码、数据和仪表分享在 GitHub 上，请参考。

Privacy Threats in Stable Diffusion Models

paper_url: http://arxiv.org/abs/2311.09355
repo_url: None
paper_authors: Thomas Cilloni, Charles Fleming, Charles Walter
for: 这篇论文旨在攻击稳定扩散计算机视觉模型中的成员推论攻击（MIA），具体是针对高度进步的稳定扩散V2模型（StabilityAI）。
methods: 我们使用了黑盒攻击方法，只需要重复地询问受到攻击的模型。我们的方法包括观察稳定扩散模型在不同生成epoch时的输出，并训练一个分类模型来区别生成结果是否来自训练数据集。
results: 我们使用ROC AUC方法评估攻击的有效率，获得60%的成功率，即可以从稳定扩散模型的输出中推断出训练数据集的成员信息。

Abstract
This paper introduces a novel approach to membership inference attacks (MIA) targeting stable diffusion computer vision models, specifically focusing on the highly sophisticated Stable Diffusion V2 by StabilityAI. MIAs aim to extract sensitive information about a model's training data, posing significant privacy concerns. Despite its advancements in image synthesis, our research reveals privacy vulnerabilities in the stable diffusion models' outputs. Exploiting this information, we devise a black-box MIA that only needs to query the victim model repeatedly. Our methodology involves observing the output of a stable diffusion model at different generative epochs and training a classification model to distinguish when a series of intermediates originated from a training sample or not. We propose numerous ways to measure the membership features and discuss what works best. The attack's efficacy is assessed using the ROC AUC method, demonstrating a 60\% success rate in inferring membership information. This paper contributes to the growing body of research on privacy and security in machine learning, highlighting the need for robust defenses against MIAs. Our findings prompt a reevaluation of the privacy implications of stable diffusion models, urging practitioners and developers to implement enhanced security measures to safeguard against such attacks.

摘要
Our methodology involves observing the output of a stable diffusion model at different generative epochs and training a classification model to distinguish when a series of intermediates originated from a training sample or not. We propose numerous ways to measure the membership features and discuss what works best. The attack's efficacy is assessed using the ROC AUC method, demonstrating a 60% success rate in inferring membership information.This paper contributes to the growing body of research on privacy and security in machine learning, highlighting the need for robust defenses against MIAs. Our findings prompt a reevaluation of the privacy implications of stable diffusion models, urging practitioners and developers to implement enhanced security measures to safeguard against such attacks.

Generalizable Imitation Learning Through Pre-Trained Representations

paper_url: http://arxiv.org/abs/2311.09350
repo_url: None
paper_authors: Wei-Di Chang, Francois Hogan, David Meger, Gregory Dudek
for: 提高imitaiton learning政策的通用能力
methods: 利用自动Supervised vision transformer模型和其自然的 semantic能力来提高imitaiton learning政策的通用能力
results: 通过 clustering appearance features into semantic concepts, our method obtains better generalization across a wide range of appearance variations and object types, and demonstrates generalized behavior in a diverse dataset of object manipulation tasks.

Abstract
In this paper we leverage self-supervised vision transformer models and their emergent semantic abilities to improve the generalization abilities of imitation learning policies. We introduce BC-ViT, an imitation learning algorithm that leverages rich DINO pre-trained Visual Transformer (ViT) patch-level embeddings to obtain better generalization when learning through demonstrations. Our learner sees the world by clustering appearance features into semantic concepts, forming stable keypoints that generalize across a wide range of appearance variations and object types. We show that this representation enables generalized behaviour by evaluating imitation learning across a diverse dataset of object manipulation tasks. Our method, data and evaluation approach are made available to facilitate further study of generalization in Imitation Learners.

摘要
在这篇论文中，我们利用自我supervised的视觉变换器模型和其自然而然的semantic能力来提高便函式学习策略的总体化能力。我们提出了BC-ViT算法，它利用丰富的DINO预训练的视觉变换（ViT）质点嵌入来获得更好的总体化，从示例学习中学习出更好的行为。我们的学习者通过对外观特征的归一化来形成稳定的键点，这些键点可以在各种外观变化和物体类型上广泛generalize。我们示示了这种表示能够实现总体化的行为，我们的方法、数据和评估方法都被提供，以便进一步研究便函式学习的总体化。

Generative AI-Based Probabilistic Constellation Shaping With Diffusion Models

paper_url: http://arxiv.org/abs/2311.09349
repo_url: None
paper_authors: Mehdi Letafati, Samad Ali, Matti Latva-aho
for: 本研究旨在探讨 diffusion models 在通信工程应用中的潜在优势，以提高信息率和解码性能。
methods: 我们利用 denoising diffusion probabilistic models (DDPM) 的“净化并生成”特点进行 probabilistic constellation shaping。
results: 我们的方法比深度神经网络 (DNN) 参考方法和均匀设定更好，并且在低 SNR 条件下和非泊 distributions 下提供了网络可靠性和 Robust out-of-distribution 性能。数值评估表明，我们的方法在 64-QAM 几何中提高了cosine similarity 30%，并三倍提高了相互信息。

Abstract
Diffusion models are at the vanguard of generative AI research with renowned solutions such as ImageGen by Google Brain and DALL.E 3 by OpenAI. Nevertheless, the potential merits of diffusion models for communication engineering applications are not fully understood yet. In this paper, we aim to unleash the power of generative AI for PHY design of constellation symbols in communication systems. Although the geometry of constellations is predetermined according to networking standards, e.g., quadrature amplitude modulation (QAM), probabilistic shaping can design the probability of occurrence (generation) of constellation symbols. This can help improve the information rate and decoding performance of communication systems. We exploit the ``denoise-and-generate'' characteristics of denoising diffusion probabilistic models (DDPM) for probabilistic constellation shaping. The key idea is to learn generating constellation symbols out of noise, ``mimicking'' the way the receiver performs symbol reconstruction. This way, we make the constellation symbols sent by the transmitter, and what is inferred (reconstructed) at the receiver become as similar as possible, resulting in as few mismatches as possible. Our results show that the generative AI-based scheme outperforms deep neural network (DNN)-based benchmark and uniform shaping, while providing network resilience as well as robust out-of-distribution performance under low-SNR regimes and non-Gaussian assumptions. Numerical evaluations highlight 30% improvement in terms of cosine similarity and a threefold improvement in terms of mutual information compared to DNN-based approach for 64-QAM geometry.

摘要
Diffusion models 是生成人工智能研究的先锋之一，其中包括Google Brain 的 ImageGen 和 OpenAI 的 DALL.E 3。然而，对于通信工程应用的 diffusion models 的潜在优点还没有得到充分了解。在这篇论文中，我们想使用生成人工智能来设计物理设计符号（PHY）在通信系统中。尽管constellation 的几何结构根据网络标准固定，例如 quadrature amplitude modulation（QAM），但可以通过概率形成来设计 constellation 符号的概率出现。这可以帮助提高通信系统的信息率和解码性能。我们利用 denoising diffusion probabilistic models（DDPM）的“denoise-and-generate”特性来进行概率形成。我们的关键思想是通过学习生成 constellation 符号来“模仿”接收器在重建符号时的过程。这样，我们可以使得发送者发送的 constellation 符号和接收器重建的符号变得非常相似，从而减少偏差。我们的结果表明，基于生成人工智能的方案在低 SNR 下和非泊然分布下表现出了更好的网络鲁棒性和robust out-of-distribution性，并且在 64-QAM 几何上实现了30%的圆拟相似性和三倍的相互信息相比于 DNN-based 方法。数值评估表明，在低 SNR 下和非泊然分布下，生成人工智能的方案可以实现30%的圆拟相似性和三倍的相互信息相比于 DNN-based 方法。

VideoCon: Robust Video-Language Alignment via Contrast Captions

paper_url: http://arxiv.org/abs/2311.10111
repo_url: https://github.com/hritikbansal/videocon
paper_authors: Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, Aditya Grover
for: 该研究目标是提高现有视频语言对齐模型的强度，使其能够承受Semantically plausible contrastive changes在视频标题中。
methods: 该研究使用了一种新的视频语言对齐数据集，即VideoCon，该数据集由一个大型自然语言模型生成了可能的对比视频标题和解释。然后，研究者使用了VideoCon来训练一个生成型视频语言对齐模型，以评估视频语言相似性和生成解释。
results: 研究发现，VideoCon-based alignment model在人工生成的对比标题上表现出色，与现有模型相比，其AUC提高了12点。此外，该模型在无预训练的情况下在视频语言任务中（如SSv2-Temporal和ATP-Hard）达到了新的最佳性能，并在人工编写的标题和解释上也表现出优异。

Abstract
Despite being (pre)trained on a massive amount of data, state-of-the-art video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions. Our work addresses this by identifying a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order, which alignment models should be robust against. To this end, we introduce the VideoCon, a video-language alignment dataset constructed by a large language model that generates plausible contrast video captions and explanations for differences between original and contrast video captions. Then, a generative video-language model is finetuned with VideoCon to assess video-language entailment and generate explanations. Our VideoCon-based alignment model significantly outperforms current models. It exhibits a 12-point increase in AUC for the video-language alignment task on human-generated contrast captions. Finally, our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks such as text-to-video retrieval (SSv2-Temporal) and video question answering (ATP-Hard). Moreover, our model shows superior performance on novel videos and human-crafted captions and explanations. Our code and data are available at https://github.com/Hritikbansal/videocon.

摘要
尽管使用大量数据进行（先）训练，当前最佳的视频语言对接模型并不能抗击Semantically plausible contrastive changes in the video captions。我们的工作是解决这个问题，我们识别了广泛的对接不一致，如替换实体、动作和事件顺序的flipping。为了实现这一目标，我们开发了 VideoCon，一个由大型语言模型生成的视频语言对接集合，其中包含了可能的对接视频标签和解释。然后，我们使用 VideoCon 进行finetuning，以评估视频语言关系和生成解释。我们的 VideoCon-based alignment model 在人工生成的对接标签上显示了12点的提升，而且在 temporally-extensive video-language tasks 中也达到了新的最佳无Zero-shot表现（SSv2-Temporal和ATP-Hard）。此外，我们的模型在新的视频和人工生成的标签和解释上也表现出了superior的性能。我们的代码和数据可以在 https://github.com/Hritikbansal/videocon 上获取。

Lighter, yet More Faithful: Investigating Hallucinations in Pruned Large Language Models for Abstractive Summarization

paper_url: http://arxiv.org/abs/2311.09335
repo_url: None
paper_authors: George Chrysostomou, Zhixue Zhao, Miles Williams, Nikolaos Aletras
for: investigate the effect of pruning on hallucinations in abstractive summarization with large language models (LLMs)
methods: pruning techniques to reduce model size, three instruction-tuned LLMs, three hallucination evaluation metrics
results: pruned LLMs hallucinate less compared to full-sized counterparts, greater dependency on source input leads to higher lexical overlap between generated content and source inputHere is the summary in Traditional Chinese text:
for: 研究对于摘要 Summarization 中大型语言模型（LLMs）的剪裁效果
methods: 剪裁技术来缩小模型大小，三个受训 LLMs，三个hallucination评估指标
results: 剪裁 LLMs 比起全大小模型少了hallucination，依赖源输入更加强，导致生成内容与源输入之间的字词相似度更高

Abstract
Despite their remarkable performance on abstractive summarization, large language models (LLMs) face two significant challenges: their considerable size and tendency to hallucinate. Hallucinations are concerning because they erode the reliability of LLMs and raise safety issues. Pruning is a technique that reduces model size by removing redundant weights to create sparse models that enable more efficient inference. Pruned models yield comparable performance to their counterpart full-sized models, making them ideal alternatives when operating on a limited budget. However, the effect that pruning has upon hallucinations in abstractive summarization with LLMs has yet to be explored. In this paper, we provide an extensive empirical study on the hallucinations produced by pruned models across three standard summarization tasks, two pruning approaches, three instruction-tuned LLMs, and three hallucination evaluation metrics. Surprisingly, we find that pruned LLMs hallucinate less compared to their full-sized counterparts. Our follow-up analysis suggests that pruned models tend to depend more on the source input and less on their parametric knowledge from pre-training for generation. This greater dependency on the source input leads to a higher lexical overlap between generated content and the source input, which can be a reason for the reduction in hallucinations.

摘要
尽管大语言模型（LLM）在抽象概要 SUMMARIZATION 方面表现出色，但它们面临两个主要挑战：它们的很大的大小和偏差。偏差会让模型失去可靠性，并且提高安全风险。剪除是一种技术，可以通过删除重复的权重来减小模型的大小，从而实现更高效的推理。剪除后的模型可以保持与全大小模型相同的性能，因此它们成为了在有限预算情况下使用的理想选择。然而，剪除对抽象 SUMMARIZATION 中 LLM 的偏差产生的影响还没有得到探讨。在这篇论文中，我们提供了对剪除后 LLM 在三个标准 SUMMARIZATION 任务中的偏差产生进行了广泛的实验研究。我们发现，剪除后 LLM 会比全大小模型更少地偏差。我们的跟踪分析表明，剪除后模型更多地依赖于输入源，而 menos依赖于它们在预训练中学习的参数知识。这种更多地依赖于输入源的依赖性导致生成的内容与输入源的字句 overlap 更高，这可能是减少偏差的原因。

Strategic Data Augmentation with CTGAN for Smart Manufacturing: Enhancing Machine Learning Predictions of Paper Breaks in Pulp-and-Paper Production

paper_url: http://arxiv.org/abs/2311.09333
repo_url: None
paper_authors: Hamed Khosravi, Sarah Farhadpour, Manikanta Grandhi, Ahmed Shoyeb Raihan, Srinjoy Das, Imtiaz Ahmed
For: This paper aims to address the challenge of predictive maintenance in the pulp-and-paper industry, specifically the scarcity of paper breaks during production, which has a high economic impact.* Methods: The authors use a dataset of 18,398 instances derived from a quality assurance protocol, and employ Conditional Generative Adversarial Networks (CTGAN) and Synthetic Minority Oversampling Technique (SMOTE) to create a novel data augmentation framework. This method enhances the performance metrics of predictive modeling and improves the detection of machine breaks.* Results: The study achieves significant improvements in predictive maintenance performance metrics using the CTGAN-enhanced dataset. The models’ detection of machine breaks (Class 1) improves by over 30% for Decision Trees, 20% for Random Forest, and nearly 90% for Logistic Regression.Here is the information in Simplified Chinese text:* For: 这篇论文目标是解决纸品工业中预测维护的挑战，特别是生产过程中纸卷断rare事件的高经济影响。* Methods: 作者们使用18398个来自质量监控协议的实例集，并使用Conditional Generative Adversarial Networks (CTGAN)和Synthetic Minority Oversampling Technique (SMOTE)创建一个新的数据增强框架。这种方法提高预测模型性能指标，并提高机器停机的检测率。* Results: 研究实现了预测维护性能指标的显著提高，使用CTGAN增强dataset时，模型对机器停机(Class 1)的检测率提高了30%以上 для决策树，20%以上 дляRandom Forest， nearly 90%以上 дляLogistic Regression。

Abstract
A significant challenge for predictive maintenance in the pulp-and-paper industry is the infrequency of paper breaks during the production process. In this article, operational data is analyzed from a paper manufacturing machine in which paper breaks are relatively rare but have a high economic impact. Utilizing a dataset comprising 18,398 instances derived from a quality assurance protocol, we address the scarcity of break events (124 cases) that pose a challenge for machine learning predictive models. With the help of Conditional Generative Adversarial Networks (CTGAN) and Synthetic Minority Oversampling Technique (SMOTE), we implement a novel data augmentation framework. This method ensures that the synthetic data mirrors the distribution of the real operational data but also seeks to enhance the performance metrics of predictive modeling. Before and after the data augmentation, we evaluate three different machine learning algorithms-Decision Trees (DT), Random Forest (RF), and Logistic Regression (LR). Utilizing the CTGAN-enhanced dataset, our study achieved significant improvements in predictive maintenance performance metrics. The efficacy of CTGAN in addressing data scarcity was evident, with the models' detection of machine breaks (Class 1) improving by over 30% for Decision Trees, 20% for Random Forest, and nearly 90% for Logistic Regression. With this methodological advancement, this study contributes to industrial quality control and maintenance scheduling by addressing rare event prediction in manufacturing processes.

摘要
产品生产过程中纸裂事件的rarity是维保预测的一大挑战。本文分析了一种纸制造机的操作数据，该机器的纸裂事件相对较少，但具有高经济影响。使用包含18,398个实例的质量监管协议数据集，我们解决了缺乏纸裂事件的挑战，这些事件的数量很少（124个），但它们对预测模型的性能产生了很大的影响。我们采用了 Conditional Generative Adversarial Networks（CTGAN）和Synthetic Minority Oversampling Technique（SMOTE）来实现一个新的数据增强框架。这种方法确保了生成的 sintetic 数据与实际操作数据的分布相同，同时尝试提高预测模型的性能指标。在数据增强前和后，我们评估了三种不同的机器学习算法：决策树（DT）、Random Forest（RF）和логистиック回归（LR）。使用 CTGAN 增强的数据集，我们的研究实现了显著提高的维保预测性能指标。 CTGAN 在 Addressing 缺乏数据问题方面的效果是明显的，纸裂事件的检测（Class 1）的准确率提高了30%以上 для决策树，20%以上 для Random Forest，并且近90%以上 для logistic regression。这种方法ological advancement 在工业质量控制和维保时间安排方面做出了贡献，解决了制造过程中罕见事件预测的问题。

Improving fit to human reading times via temperature-scaled surprisal

paper_url: http://arxiv.org/abs/2311.09325
repo_url: None
paper_authors: Tong Liu, Iza Škrjanec, Vera Demberg
for: 这项研究旨在使用大语言模型（LLM）模拟人类认知负荷，并研究words with lower predictability（i.e., higher surprisal）需要更多时间进行理解。
methods: 这项研究使用了温度缩放的 surprisal，即由形态概率决定的surprisal，作为人类阅读时间预测的Predictor。
results: 研究结果遍布三个资料库，表明temperature-scaled surprisal可以很好地提高预测阅读时间的准确性，并且设置温度为大约2.5可以获得最大的89%的 delta log-likelihood 提升。此外，研究还提出了一种可能的人类化偏见指标来衡量模型的可靠性。

Abstract
Past studies have provided broad support for that words with lower predictability (i.e., higher surprisal) require more time for comprehension by using large language models (LLMs) to simulate humans' cognitive load. In general, these studies have implicitly assumed that the probability scores from LLMs are accurate, ignoring the discrepancies between human cognition and LLMs from this standpoint. Inspired by the concept of probability calibration, we are the first work to focus on the probability distribution for human reading simulation. We propose to use temperature-scaled surprisal, a surprisal calculated by shaped probability, to be the predictor of human reading times. Our results across three corpora consistently revealed that such a surprisal can drastically improve the prediction of reading times. Setting the temperature to be approximately 2.5 across all models and datasets can yield up to an 89% of increase in delta log-likelihood in our setting. We also propose a calibration metric to quantify the possible human-likeness bias. Further analysis was done and provided insights into this phenomenon.

摘要

Spoken Word2Vec: A Perspective And Some Techniques

paper_url: http://arxiv.org/abs/2311.09319
repo_url: None
paper_authors: Mohammad Amaan Sayeed, Hanan Aldarmaki
for: 这个论文的目的是探讨语音字 embedding 的问题，以及过去的研究中使用的假设和体系是否能够正确地编码 semantic features。
methods: 这篇论文使用 Word2Vec 算法来模型语音字的语言模式，并对过去的研究进行了实验检验，以确定这些算法是否能够正确地编码 semantic features。
results: 实验结果表明，过去的研究中使用的假设和体系导致了语音字 embedding 中的phonetic feature占主导地位，而不是 semantic feature。此外， automatic word type clustering 的使用也有助于改善 embedding 的质量。

Abstract
Text word embeddings that encode distributional semantic features work by modeling contextual similarities of frequently occurring words. Acoustic word embeddings, on the other hand, typically encode low-level phonetic similarities. Semantic embeddings for spoken words have been previously explored using similar algorithms to Word2Vec, but the resulting vectors still mainly encoded phonetic rather than semantic features. In this paper, we examine the assumptions and architectures used in previous works and show experimentally how Word2Vec algorithms fail to encode distributional semantics when the input units are acoustically correlated. In addition, previous works relied on the simplifying assumptions of perfect word segmentation and clustering by word type. Given these conditions, a trivial solution identical to text-based embeddings has been overlooked. We follow this simpler path using automatic word type clustering and examine the effects on the resulting embeddings, highlighting the true challenges in this task.

摘要
文本词嵌入工具，例如Word2Vec，可以模拟语言中的 distribuional semantic 特征。但是，在使用 acoustic 词嵌入时，通常只会模拟低级别的语音相似性。在这篇论文中，我们会检查之前的作品中使用的假设和架构，并通过实验表明，Word2Vec 算法在输入单元是 acoustically correlated 时不能够编码分布semantic特征。此外，之前的作品假设了完美的单词分 segmentation 和 word type 的分 clustering，这导致了一个简单的解决方案被忽视了。我们采用自动化单词类型分 clustering，并研究 embedding 的效果， highlighting the true challenges 在这个任务中。

H-Packer: Holographic Rotationally Equivariant Convolutional Neural Network for Protein Side-Chain Packing

paper_url: http://arxiv.org/abs/2311.09312
repo_url: None
paper_authors: Gian Marco Visani, William Galvin, Michael Neal Pun, Armita Nourmohammad
for: 预测蛋白质三维结构，尤其是蛋白质侧链排列，是设计功能蛋白质的关键步骤。
methods: 我们提出了一种新的两阶段算法，叫做干擦包装器（H-Packer），基于两个轻量级的旋转对称神经网络。
results: H-Packer在CASP13和CASP14目标上展示了 Computationally efficient 和有利的性能，与传统物理学基于算法和其他深度学习解决方案相比。

Abstract
Accurately modeling protein 3D structure is essential for the design of functional proteins. An important sub-task of structure modeling is protein side-chain packing: predicting the conformation of side-chains (rotamers) given the protein's backbone structure and amino-acid sequence. Conventional approaches for this task rely on expensive sampling procedures over hand-crafted energy functions and rotamer libraries. Recently, several deep learning methods have been developed to tackle the problem in a data-driven way, albeit with vastly different formulations (from image-to-image translation to directly predicting atomic coordinates). Here, we frame the problem as a joint regression over the side-chains' true degrees of freedom: the dihedral $\chi$ angles. We carefully study possible objective functions for this task, while accounting for the underlying symmetries of the task. We propose Holographic Packer (H-Packer), a novel two-stage algorithm for side-chain packing built on top of two light-weight rotationally equivariant neural networks. We evaluate our method on CASP13 and CASP14 targets. H-Packer is computationally efficient and shows favorable performance against conventional physics-based algorithms and is competitive against alternative deep learning solutions.

摘要
Accurately modeling protein 3D structure is essential for the design of functional proteins. An important sub-task of structure modeling is protein side-chain packing: predicting the conformation of side-chains (rotamers) given the protein's backbone structure and amino-acid sequence. Conventional approaches for this task rely on expensive sampling procedures over hand-crafted energy functions and rotamer libraries. Recently, several deep learning methods have been developed to tackle the problem in a data-driven way, albeit with vastly different formulations (from image-to-image translation to directly predicting atomic coordinates). Here, we frame the problem as a joint regression over the side-chains' true degrees of freedom: the dihedral $\chi$ angles. We carefully study possible objective functions for this task, while accounting for the underlying symmetries of the task. We propose Holographic Packer (H-Packer), a novel two-stage algorithm for side-chain packing built on top of two light-weight rotationally equivariant neural networks. We evaluate our method on CASP13 and CASP14 targets. H-Packer is computationally efficient and shows favorable performance against conventional physics-based algorithms and is competitive against alternative deep learning solutions.Here's the translation in Traditional Chinese: Accurately modeling protein 3D structure is essential for the design of functional proteins. An important sub-task of structure modeling is protein side-chain packing: predicting the conformation of side-chains (rotamers) given the protein's backbone structure and amino-acid sequence. Conventional approaches for this task rely on expensive sampling procedures over hand-crafted energy functions and rotamer libraries. Recently, several deep learning methods have been developed to tackle the problem in a data-driven way, albeit with vastly different formulations (from image-to-image translation to directly predicting atomic coordinates). Here, we frame the problem as a joint regression over the side-chains' true degrees of freedom: the dihedral $\chi$ angles. We carefully study possible objective functions for this task, while accounting for the underlying symmetries of the task. We propose Holographic Packer (H-Packer), a novel two-stage algorithm for side-chain packing built on top of two light-weight rotationally equivariant neural networks. We evaluate our method on CASP13 and CASP14 targets. H-Packer is computationally efficient and shows favorable performance against conventional physics-based algorithms and is competitive against alternative deep learning solutions.

Divergences between Language Models and Human Brains

paper_url: http://arxiv.org/abs/2311.09308
repo_url: https://github.com/flamingozh/divergence_meg
paper_authors: Yuchen Zhou, Emmy Liu, Graham Neubig, Leila Wehbe
for: 研究 whether machines and humans process language in similar ways, and explore the differences between human and machine language processing using brain data.
methods: 使用 Magnetoencephalography (MEG) responses to a written narrative to examine the differences between LM representations and the human brain’s responses to language, and fine-tune LMs on datasets related to specific phenomena to improve their alignment with human brain responses.
results: 发现 LMs 不好地理解情感、 figurative language processing 和 physical commonsense，并通过 fine-tuning LMs 可以提高它们与人类大脑响应的匹配度。

Abstract
Do machines and humans process language in similar ways? A recent line of research has hinted in the affirmative, demonstrating that human brain signals can be effectively predicted using the internal representations of language models (LMs). This is thought to reflect shared computational principles between LMs and human language processing. However, there are also clear differences in how LMs and humans acquire and use language, even if the final task they are performing is the same. Despite this, there is little work exploring systematic differences between human and machine language processing using brain data. To address this question, we examine the differences between LM representations and the human brain's responses to language, specifically by examining a dataset of Magnetoencephalography (MEG) responses to a written narrative. In doing so we identify three phenomena that, in prior work, LMs have been found to not capture well: emotional understanding, figurative language processing, and physical commonsense. By fine-tuning LMs on datasets related to these phenomena, we observe that fine-tuned LMs show improved alignment with human brain responses across these tasks. Our study implies that the observed divergences between LMs and human brains may stem from LMs' inadequate representation of these specific types of knowledge.

摘要
人类和机器是否处理语言类似？一项研究表明，人类大脑信号可以准确预测语言模型（LM）内部表示，这被视为人类语言处理和LM共享计算原理的证明。然而，尚存在人类和机器语言获取和使用语言的显著差异，即使完成的任务相同。尽管如此，有少量研究探讨人类和机器语言处理的系统性差异使用大脑数据。为解决这个问题，我们比较LM表示和人类大脑对语言的响应，Specifically by examining a dataset of Magnetoencephalography (MEG) responses to a written narrative.在这些任务中，我们发现了三种现象，在先前的工作中LMs没有良好捕捉：情感理解、 figurative language processing和physical common sense。通过对这些任务进行数据集的细化，我们观察到了已经细化LMs的Alignment with human brain responses across these tasks.我们的研究表明，观察到的LMs和人类大脑之间差异可能由LMs的不够表示这些特定类型的知识而导致。

Symbol-LLM: Towards Foundational Symbol-centric Interface For Large Language Models

paper_url: http://arxiv.org/abs/2311.09278
repo_url: None
paper_authors: Fangzhi Xu, Zhiyong Wu, Qiushi Sun, Siyu Ren, Fei Yuan, Shuai Yuan, Qika Lin, Yu Qiao, Jun Liu
for: 本研究旨在探讨如何在语言模型（LLM）中注入特定的符号知识，以提高NL-centric任务的表现。
methods: 本研究采用了两个方向的方法：一是收集了34个符号任务，覆盖了~20种不同的形式，以捕捉符号之间的关系；二是提出了一种两阶段调试框架，能够在注入符号知识时保持NL-centric能力的一致性。
results: 对于符号-和NL-centric任务的广泛实验表明，Symbol-LLM系列模型在符号知识注入问题上具有balanced和superior表现。

Abstract
Large Language Models (LLMs) have greatly propelled the progress in natural language(NL)-centric tasks based on NL interface. However, the NL form is not enough for world knowledge. Current works focus on this question by injecting specific symbolic knowledge into LLM, which ignore two critical challenges: the interrelations between various symbols and the balance between symbolic-centric and NL-centric capabilities. In this work, we tackle these challenges from both a data and framework perspective and introduce Symbol-LLM series models. First, we collect 34 symbolic tasks, covering ~20 different forms, which are unified to capture symbol interrelations. Then, a two-stage tuning framework succeeds in injecting symbolic knowledge without loss of the generality ability. Extensive experiments on both symbol- and NL-centric tasks demonstrate the balanced and superior performances of Symbol-LLM series models.

摘要
(Simplified Chinese translation)大语言模型（LLM）已经为自然语言（NL）关注的任务带来了很大的进步，基于NL界面。然而，NL形式不够用于世界知识。当前的工作都在注意这个问题，通过将特定的象征知识注入到LLM中，忽略了两个关键挑战：符号之间的关系和象征中心和NL中心能力的平衡。在这项工作中，我们从数据和框架角度来解决这些挑战，并引入 Symbol-LLM 系列模型。首先，我们收集了34个符号任务，覆盖了~20种不同的形式，这些任务被统一以捕捉符号之间的关系。然后，我们提出了一个两个阶段的调整框架，成功地注入符号知识而不失去通用能力。广泛的实验表明 Symbol-LLM 系列模型在符号和NL关注任务中具有平衡和超越的表现。

Assessing Translation capabilities of Large Language Models involving English and Indian Languages

paper_url: http://arxiv.org/abs/2311.09216
repo_url: None
paper_authors: Vandan Mujadia, Ashok Urlana, Yash Bhaskar, Penumalla Aditya Pavani, Kukkapalli Shravya, Parameswari Krishnamurthy, Dipti Misra Sharma
for: 本研究旨在探讨大型自然语言处理器（LLM）在多种自然语言译语 зада务中的多语言能力。
methods: 我们使用机器翻译作为英语和22种印度语言之间的译语任务，首先研究 raw LLM 的翻译能力，然后探讨 raw LLM 在上下文学习中的表现。我们使用 LoRA 等参数有效的微调方法进行微调。
results: 我们的研究表明，使用 LLaMA 作为基础模型，可以在英语和印度语言之间的翻译任务中取得显著进步，其中 average BLEU 分数为 13.42、15.93、12.13、12.30 和 12.07，chrF 分数为 43.98、46.99、42.55、42.42 和 45.39。在印度语言到英语的翻译任务中，我们取得了 average BLEU 分数为 14.03、16.65、16.17、15.35 和 12.55，chrF 分数为 36.71、40.44、40.26、39.51 和 36.20。总之，我们的发现表明大型自然语言处理器在机器翻译任务中具有潜在的潜力，包括目前未经投入的语言。

Abstract
Generative Large Language Models (LLMs) have achieved remarkable advancements in various NLP tasks. In this work, our aim is to explore the multilingual capabilities of large language models by using machine translation as a task involving English and 22 Indian languages. We first investigate the translation capabilities of raw large language models, followed by exploring the in-context learning capabilities of the same raw models. We fine-tune these large language models using parameter efficient fine-tuning methods such as LoRA and additionally with full fine-tuning. Through our study, we have identified the best performing large language model for the translation task involving LLMs, which is based on LLaMA. Our results demonstrate significant progress, with average BLEU scores of 13.42, 15.93, 12.13, 12.30, and 12.07, as well as CHRF scores of 43.98, 46.99, 42.55, 42.42, and 45.39, respectively, using 2-stage fine-tuned LLaMA-13b for English to Indian languages on IN22 (conversational), IN22 (general), flores200-dev, flores200-devtest, and newstest2019 testsets. Similarly, for Indian languages to English, we achieved average BLEU scores of 14.03, 16.65, 16.17, 15.35 and 12.55 along with chrF scores of 36.71, 40.44, 40.26, 39.51, and 36.20, respectively, using fine-tuned LLaMA-13b on IN22 (conversational), IN22 (general), flores200-dev, flores200-devtest, and newstest2019 testsets. Overall, our findings highlight the potential and strength of large language models for machine translation capabilities, including for languages that are currently underrepresented in LLMs.

摘要
大型语言模型（LLMs）已经在不同的自然语言处理任务中实现了很大的进步。在这项工作中，我们的目标是探索大型语言模型在多种语言之间的多语言能力。我们首先调查了Raw大型语言模型的翻译能力，然后探索这些Raw模型在 Context Learning 中的能力。我们使用 parameter efficient fine-tuning 方法如 LoRA 和全局 fine-tuning 进行参数的调整。通过我们的研究，我们已经确定了最佳的大型语言模型为翻译任务，即基于 LLaMA 的模型。我们的结果表明了显著的进步，平均 BLEU 分数为 13.42、15.93、12.13、12.30 和 12.07，以及 CHRF 分数为 43.98、46.99、42.55、42.42 和 45.39，分别在英语到印度语言的 IN22（交流）、IN22（通用）、flores200-dev、flores200-devtest 和 newstest2019 测试集上。同样，在印度语言到英语的翻译任务中，我们获得了平均 BLEU 分数为 14.03、16.65、16.17、15.35 和 12.55，以及 CHRF 分数为 36.71、40.44、40.26、39.51 和 36.20，分别在 IN22（交流）、IN22（通用）、flores200-dev、flores200-devtest 和 newstest2019 测试集上。总的来说，我们的发现表明了大型语言模型在翻译任务中的潜力和优势，包括目前尚未得到充分利用的语言。

Controllable Text Summarization: Unraveling Challenges, Approaches, and Prospects – A Survey

paper_url: http://arxiv.org/abs/2311.09212
repo_url: https://github.com/ashokurlana/controllable_text_summarization_survey
paper_authors: Ashok Urlana, Pruthwik Mishra, Tathagato Roy, Rahul Mishra
for: 这个论文主要写于控制可能性的文章摘要方法。
methods: 本论文使用了多种控制可能性的方法，包括文章摘要任务的定义、分类和评价。
results: 本论文发现了控制可能性的文章摘要方法的多种类别和挑战，以及未来研究的可能性。

Abstract
Generic text summarization approaches often fail to address the specific intent and needs of individual users. Recently, scholarly attention has turned to the development of summarization methods that are more closely tailored and controlled to align with specific objectives and user needs. While a growing corpus of research is devoted towards a more controllable summarization, there is no comprehensive survey available that thoroughly explores the diverse controllable aspects or attributes employed in this context, delves into the associated challenges, and investigates the existing solutions. In this survey, we formalize the Controllable Text Summarization (CTS) task, categorize controllable aspects according to their shared characteristics and objectives, and present a thorough examination of existing methods and datasets within each category. Moreover, based on our findings, we uncover limitations and research gaps, while also delving into potential solutions and future directions for CTS.

摘要
常见的文本摘要方法 often 无法 addresses 用户的特定目标和需求。近年来，学术界对于更加控制性的摘要方法的开发受到了关注。 although 一个增长的文献库 devoted towards 更加控制性的摘要， there is no comprehensive survey available that thoroughly explores the diverse controllable aspects or attributes employed in this context, delves into the associated challenges, and investigates the existing solutions. In this survey, we formalize the Controllable Text Summarization (CTS) task, categorize controllable aspects according to their shared characteristics and objectives, and present a thorough examination of existing methods and datasets within each category. Moreover, based on our findings, we uncover limitations and research gaps, while also delving into potential solutions and future directions for CTS.Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China and Singapore.

Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models

paper_url: http://arxiv.org/abs/2311.09210
repo_url: None
paper_authors: Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin Ma, Hongwei Wang, Dong Yu
for: 提高 Retrieval-augmented language models（RALMs）的可靠性和能力，特别是在减少假想的报道和增加外部知识源的情况下。
methods: 提出了一种新的Chain-of-Noting（CoN）方法，通过生成文档检索过程中的顺序阅读笔记，评估检索到的文档的相关性，并将其纳入最终的回答中。
results: CoN在四个开放领域问答 benchmarck 上进行了实验，结果显示，与标准 RALMs 相比，CoN 可以提高 EM 分数的平均提升为+7.9，并在实时问题中减少不相关文档的拒绝率为+10.5。

Abstract
Retrieval-augmented language models (RALMs) represent a substantial advancement in the capabilities of large language models, notably in reducing factual hallucination by leveraging external knowledge sources. However, the reliability of the retrieved information is not always guaranteed. The retrieval of irrelevant data can lead to misguided responses, and potentially causing the model to overlook its inherent knowledge, even when it possesses adequate information to address the query. Moreover, standard RALMs often struggle to assess whether they possess adequate knowledge, both intrinsic and retrieved, to provide an accurate answer. In situations where knowledge is lacking, these systems should ideally respond with "unknown" when the answer is unattainable. In response to these challenges, we introduces Chain-of-Noting (CoN), a novel approach aimed at improving the robustness of RALMs in facing noisy, irrelevant documents and in handling unknown scenarios. The core idea of CoN is to generate sequential reading notes for retrieved documents, enabling a thorough evaluation of their relevance to the given question and integrating this information to formulate the final answer. We employed ChatGPT to create training data for CoN, which was subsequently trained on an LLaMa-2 7B model. Our experiments across four open-domain QA benchmarks show that RALMs equipped with CoN significantly outperform standard RALMs. Notably, CoN achieves an average improvement of +7.9 in EM score given entirely noisy retrieved documents and +10.5 in rejection rates for real-time questions that fall outside the pre-training knowledge scope.

摘要
大型语言模型（RALM）在增强它们的能力方面取得了重要进步，主要是减少了假想的投入。然而，获取的信息不一定可靠。不必要的数据获取可能导致错误的响应，甚至让模型忽略其内置的知识，即使它拥有足够的信息来回答问题。此外，标准的RALM通常难以判断自己是否具备了足够的知识来提供正确的答案。在知识缺乏的情况下，这些系统应该回答为“未知”。为解决这些挑战，我们提出了链条注释（CoN），一种新的方法，可以提高RALM在噪音、无关文档中的稳定性，以及处理未知情况的能力。CoN的核心思想是生成批处理的阅读笔记，以评估 retrieved 文档的相关性，并将其集成到提供答案。我们使用了ChatGPT创建训练数据，然后将其训练在LLaMa-2 7B 模型上。我们在四个开放领域问答 benchmark 上进行了实验，结果显示，RALMs 配置了 CoN 显著超越标准 RALMs。特别是，CoN 在 entirely 噪音获取的情况下的 EM 分数平均提高了+7.9，并在实时问题 external 知识范围外的拒绝率上提高了+10.5。

Fusion-Eval: Integrating Evaluators with LLMs

paper_url: http://arxiv.org/abs/2311.09204
repo_url: None
paper_authors: Lei Shu, Nevan Wichers, Liangchen Luo, Yun Zhu, Yinxiao Liu, Jindong Chen, Lei Meng
for: 这篇论文的目的是评估大型自然语言处理模型（LLMs）的评估方法，以便更好地理解自然语言理解和高级逻辑预期。methods: 这篇论文使用了多种评估方法，包括人类基于、模型基于和自动度量标准方法，并通过将这些方法综合使用来创建一个更加灵活和有效的评估系统。results: 在使用SummEval数据集进行测试时，Fusion-Eval实现了Spearman相关性0.96，超过其他评估器。这表明了Fusion-Eval可以充分利用多个参考来生成高度相似于人类视角的评估结果，为LLM评估做出了新的标准。

Abstract
Evaluating Large Language Models (LLMs) is a complex task, especially considering the intricacies of natural language understanding and the expectations for high-level reasoning. Traditional evaluations typically lean on human-based, model-based, or automatic-metrics-based paradigms, each with its own advantages and shortcomings. We introduce "Fusion-Eval", a system that employs LLMs not solely for direct evaluations, but to skillfully integrate insights from diverse evaluators. This gives Fusion-Eval flexibility, enabling it to work effectively across diverse tasks and make optimal use of multiple references. In testing on the SummEval dataset, Fusion-Eval achieved a Spearman correlation of 0.96, outperforming other evaluators. The success of Fusion-Eval underscores the potential of LLMs to produce evaluations that closely align human perspectives, setting a new standard in the field of LLM evaluation.

摘要
评估大语言模型（LLM）是一项复杂的任务，尤其是在自然语言理解方面和高级逻辑预期下。传统评估方法通常是人类基础、模型基础或自动指标基础的三者，每种方法都有其优点和缺点。我们介绍了“融合评估”（Fusion-Eval）系统，它不仅利用 LLM 进行直接评估，而且灵活地结合了多个评估者的意见。这使得 Fusion-Eval 能够在多种任务上工作有效，并且能够最大化多个参考。在 SummEval 数据集上测试时，Fusion-Eval 达到了 Spearman 相关系数 0.96，超越其他评估器。Fusion-Eval 的成功表明 LLM 可以生成高度吻合人类视角的评估结果，为 LLM 评估领域设置了新的标准。

ExpM+NF: Differentially Private Machine Learning that Surpasses DPSGD

paper_url: http://arxiv.org/abs/2311.09200
repo_url: None
paper_authors: Robert A. Bridges, Vandy J. Tombs, Christopher B. Stanley
for: 本研究旨在提出一种基于Exponential Mechanism（ExpM）和auxiliary Normalizing Flow（NF）的方法，用于在private数据上训练机器学习（ML）模型，并 garantuee differential privacy（DP）保证。
methods: 本方法使用ExpM和NF结合使用，以实现在private数据上训练ML模型，并可以实现预先指定的DP保证。
results: 对于多个分类任务和不同的数据集，ExpM+NF可以 achieve greater than 93%的非私有训练精度（AUC），并且在DP保证下提供更高的精度和更低的DP保证（$\varepsilon$）。I hope that helps! Let me know if you have any further questions or if you’d like me to help with anything else.

Abstract
In this pioneering work we formulate ExpM+NF, a method for training machine learning (ML) on private data with pre-specified differentially privacy guarantee $\varepsilon>0, \delta=0$, by using the Exponential Mechanism (ExpM) and an auxiliary Normalizing Flow (NF). We articulate theoretical benefits of ExpM+NF over Differentially Private Stochastic Gradient Descent (DPSGD), the state-of-the-art (SOTA) and de facto method for differentially private ML, and we empirically test ExpM+NF against DPSGD using the SOTA implementation (Opacus with PRV accounting) in multiple classification tasks on the Adult Dataset (census data) and MIMIC-III Dataset (electronic healthcare records) using Logistic Regression and GRU-D, a deep learning recurrent neural network with ~20K-100K parameters. In all experiments, ExpM+NF achieves greater than 93% of the non-private training accuracy (AUC) for $\varepsilon \in [1\mathrm{e}{-3}, 1]$, exhibiting greater accuracy (higher AUC) and privacy (lower $\varepsilon$ with $\delta=0$) than DPSGD. Differentially private ML generally considers $\varepsilon \in [1,10]$ to maintain reasonable accuracy; hence, ExpM+NF's ability to provide strong accuracy for orders of magnitude better privacy (smaller $\varepsilon$) substantially pushes what is currently possible in differentially private ML. Training time results are presented showing ExpM+NF is comparable to (slightly faster) than DPSGD. Code for these experiments will be provided after review. Limitations and future directions are provided.

摘要
在这项先锋工作中，我们提出了ExpM+NF方法，用于在private数据上训练机器学习（ML），并 garantía differentially privacy 保证 $\varepsilon>0, \delta=0$。我们解释了ExpM+NF方法与State-of-the-art（SOTA）和de facto方法 differentially private Stochastic Gradient Descent（DPSGD）之间的理论优势，并对ExpM+NF方法和DPSGD进行了多个分类任务中的empirical测试，使用了Adult Dataset（人口普查数据）和MIMIC-III Dataset（电子医疗记录）上的Logistic Regression和GRU-D，一个深度学习循环神经网络，parameters数量在20K-100K之间。在所有实验中，ExpM+NF方法可以在 $\varepsilon \in [1\mathrm{e}{-3}, 1]$ 范围内达到非私有训练精度（AUC）的大于93%，表现出更高的准确率（AUC）和隐私（lower $\varepsilon$ with $\delta=0$），比DPSGD更好。 differentially private ML通常Consider $\varepsilon \in [1,10]$ 以保持合理的准确率;因此，ExpM+NF方法的能力提供许多orders of magnitude better privacy（smaller $\varepsilon）substantially pushes what is currently possible in differentially private ML。我们还提供了训练时间结果，表明ExpM+NF方法与DPSGD相对（slightly faster）。我们将在审核后提供代码。 limitations和未来方向也被提供。

Never Lost in the Middle: Improving Large Language Models via Attention Strengthening Question Answering

paper_url: http://arxiv.org/abs/2311.09198
repo_url: None
paper_authors: Junqing He, Kunhao Pan, Xiaoqun Dong, Zhuoyang Song, Yibo Liu, Yuxin Liang, Hao Wang, Qianguo Sun, Songxin Zhang, Zejian Xie, Jiaxing Zhang
for: 提高大语言模型在长文本上的信息搜寻和反思能力
methods: 提出特制的任务 called Attention Strengthening Multi-doc QA (ASM QA)，以提高模型在长文本上的精准搜寻能力
results: 实验结果显示，模型在多文档问答和其他标准任务上表现出色，与当前最佳模型相比，在随机设置下获得13.7%的绝对提升，在文章检索任务上获得21.5%的提升。

Abstract
While large language models (LLMs) are equipped with longer text input capabilities than before, they are struggling to seek correct information in long contexts. The "lost in the middle" problem challenges most LLMs, referring to the dramatic decline in accuracy when correct information is located in the middle. To overcome this crucial issue, this paper proposes to enhance the information searching and reflection ability of LLMs in long contexts via specially designed tasks called Attention Strengthening Multi-doc QA (ASM QA). Following these tasks, our model excels in focusing more precisely on the desired information. Experimental results show substantial improvement in Multi-doc QA and other benchmarks, superior to state-of-the-art models by 13.7% absolute gain in shuffled settings, by 21.5% in passage retrieval task. We release our model, Ziya-Reader to promote related research in the community.

摘要
大型语言模型（LLM）具有更长的文本输入能力，但它们在长文本上寻找正确信息时受到挑战。这个“lost in the middle”问题对大多数LLM都是一个重要问题，指的是正确信息在中间部分的减退率。为了解决这个关键问题，这篇论文提出了通过特定任务 called Attention Strengthening Multi-doc QA（ASM QA）来增强LLM在长文本上的信息寻找和反射能力。在这些任务中，我们的模型在更加精准地Focus on Desired Information。实验结果表明，我们的模型在多文档问答和其他标准 bencmarks 上表现出了明显的提升，相比领先模型的13.7%绝对提升，在排序任务上提高21.5%。我们将发布我们的模型，Ziya-Reader，以便在社区中促进相关的研究。

The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task

paper_url: http://arxiv.org/abs/2311.09193
repo_url: None
paper_authors: Yifan Wu, Pengchuan Zhang, Wenhan Xiong, Barlas Oguz, James C. Gee, Yixin Nie
for: 这个研究探讨了链条思维方法在复杂视语任务中的有效性，这种方法通过将任务拆分成子任务和中间步骤来提高语言任务的效率。
methods: 这篇研究使用了”描述然后做出决策”策略，这种策略draws inspiration from human signal processing mechanisms，并在探测任务中提高了性能，提高了50%。
results: 这篇研究发现，使用”描述然后做出决策”策略可以在复杂视语任务中提高探测任务的性能，提高50%。

Abstract
The study explores the effectiveness of the Chain-of-Thought approach, known for its proficiency in language tasks by breaking them down into sub-tasks and intermediate steps, in improving vision-language tasks that demand sophisticated perception and reasoning. We present the "Description then Decision" strategy, which is inspired by how humans process signals. This strategy significantly improves probing task performance by 50%, establishing the groundwork for future research on reasoning paradigms in complex vision-language tasks.

摘要
这个研究探讨了链条思维方法的效iveness，这种方法以分解语言任务为互助步骤而著称，在复杂的视觉语言任务中提高了高级观察和理解能力。我们提出了“描述然后决策”策略，这种策略 Draws inspiration from human signal processing and significantly improves probing task performance by 50%. This lays the foundation for future research on reasoning paradigms in complex vision-language tasks.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Towards Verifiable Text Generation with Symbolic References

paper_url: http://arxiv.org/abs/2311.09188
repo_url: None
paper_authors: Lucas Torroba Hennigen, Shannon Shen, Aniruddha Nrusimha, Bernhard Gapp, David Sontag, Yoon Kim
for: 这篇论文目的是提出一种简单的方法来使大语言模型（LLM）的输出更易于人工验证，以便用于高风险应用。
methods: 该论文提出了一种名为符号附加生成（SymGen）的方法，它使得 LLM 可以在输出文本中嵌入显式的符号参考，以便显示不同的文本段的来源。
results: 实验表明， LLM 可以通过 SymGen 方法直接输出包含符号参考的文本，而不会影响文本的流畅性和准确性。

Abstract
Large language models (LLMs) have demonstrated an impressive ability to synthesize plausible and fluent text. However they remain vulnerable to hallucinations, and thus their outputs generally require manual human verification for high-stakes applications, which can be time-consuming and difficult. This paper proposes symbolically grounded generation (SymGen) as a simple approach for enabling easier validation of an LLM's output. SymGen prompts an LLM to interleave its regular output text with explicit symbolic references to fields present in some conditioning data (e.g., a table in JSON format). The references can be used to display the provenance of different spans of text in the generation, reducing the effort required for manual verification. Across data-to-text and question answering experiments, we find that LLMs are able to directly output text that makes use of symbolic references while maintaining fluency and accuracy.

摘要

Generate, Filter, and Fuse: Query Expansion via Multi-Step Keyword Generation for Zero-Shot Neural Rankers

paper_url: http://arxiv.org/abs/2311.09175
repo_url: None
paper_authors: Minghan Li, Honglei Zhuang, Kai Hui, Zhen Qin, Jimmy Lin, Rolf Jagerman, Xuanhui Wang, Michael Bendersky
for: 提高零shot neural ranker的启发搜索精度
methods: 提出了一个名为GFF的管道，包括一个大型自然语言模型和一个神经网络排序器，用于生成、筛选和融合查询扩展。
results: GFF可以提高零shot nDCG@10在BEIR和TREC DL 2019/2020上。

Abstract
Query expansion has been proved to be effective in improving recall and precision of first-stage retrievers, and yet its influence on a complicated, state-of-the-art cross-encoder ranker remains under-explored. We first show that directly applying the expansion techniques in the current literature to state-of-the-art neural rankers can result in deteriorated zero-shot performance. To this end, we propose GFF, a pipeline that includes a large language model and a neural ranker, to Generate, Filter, and Fuse query expansions more effectively in order to improve the zero-shot ranking metrics such as nDCG@10. Specifically, GFF first calls an instruction-following language model to generate query-related keywords through a reasoning chain. Leveraging self-consistency and reciprocal rank weighting, GFF further filters and combines the ranking results of each expanded query dynamically. By utilizing this pipeline, we show that GFF can improve the zero-shot nDCG@10 on BEIR and TREC DL 2019/2020. We also analyze different modelling choices in the GFF pipeline and shed light on the future directions in query expansion for zero-shot neural rankers.

摘要
Query expansion 已经证明可以提高首个检索器的准确率和匹配率，但是它对现代跨Encoder排名器的影响还未得到充分探讨。我们首先表明，直接在当前文献中使用扩展技术可能会导致现有神经排名器的零件性能下降。为此，我们提出了GFF，一个管道，包括一个大型自然语言模型和一个神经排名器，用于生成、筛选和融合查询扩展更有效地，以提高零件性能指标 such as nDCG@10。具体来说，GFF首先通过一个遵循语言模型来生成基于查询的关键词，然后通过自我一致和对偶排名Weight来筛选和组合每个扩展查询的排名结果。通过这个管道，我们表明GFF可以提高零件 nDCG@10 在 BEIR 和 TREC DL 2019/2020。我们还分析了 GFF 管道中不同的模型选择和未来方向。

AbsPyramid: Benchmarking the Abstraction Ability of Language Models with a Unified Entailment Graph

paper_url: http://arxiv.org/abs/2311.09174
repo_url: https://github.com/hkust-knowcomp/abspyramid
paper_authors: Zhaowei Wang, Haochen Shi, Weiqi Wang, Tianqing Fang, Hongming Zhang, Sehyun Choi, Xin Liu, Yangqiu Song
for: 本研究旨在探讨语言模型内置抽象能力的现状，并提出一个大规模的抽象知识图。
methods: 本研究使用了一个大规模的文本描述数据集，通过构建抽象知识图来评估语言模型在开放领域中的抽象能力。
results: 实验结果表明，现有的LLMs在零shot和几shot设置下具有很大的抽象知识识别挑战。通过训练在我们的充沛抽象知识上，我们发现LLMs可以获得基本的抽象能力，并在未见事件中进行抽象。同时，我们也证明了我们的指标是可以强化LLMs在两个前一个抽象任务上。

Abstract
Cognitive research indicates that abstraction ability is essential in human intelligence, which remains under-explored in language models. In this paper, we present AbsPyramid, a unified entailment graph of 221K textual descriptions of abstraction knowledge. While existing resources only touch nouns or verbs within simplified events or specific domains, AbsPyramid collects abstract knowledge for three components of diverse events to comprehensively evaluate the abstraction ability of language models in the open domain. Experimental results demonstrate that current LLMs face challenges comprehending abstraction knowledge in zero-shot and few-shot settings. By training on our rich abstraction knowledge, we find LLMs can acquire basic abstraction abilities and generalize to unseen events. In the meantime, we empirically show that our benchmark is comprehensive to enhance LLMs across two previous abstraction tasks.

摘要
研究表明人类智能中的抽象能力是非常重要的，但是这一点尚未得到充分的探索。在这篇论文中，我们介绍了一个名为AbsPyramid的抽象知识维度图，包含221K个文本描述。与现有资源不同，AbsPyramid不仅覆盖了简化事件中的名词或动词，而是收集了多元事件中的抽象知识，以全面评估语言模型在开放领域中的抽象能力。实验结果表明，现有的LLMs在零shot和几shot设定下面临着抽象知识的挑战。通过在我们的充足抽象知识上训练，我们发现LLMs可以学习基本的抽象能力，并在未经见过的事件上进行推断。同时，我们实验表明，我们的标准可以提高LLMs在两个之前的抽象任务中表现。

Temporal Knowledge Question Answering via Abstract Reasoning Induction

paper_url: http://arxiv.org/abs/2311.09149
repo_url: None
paper_authors: Ziyang Chen, Dongfang Li, Xiang Zhao, Baotian Hu, Min Zhang
for: 本研究旨在解决大语言模型（LLM）中的时间知识推理问题，这是LLM遇到的一个重要挑战，这些问题可能会导致LLM生成错误或误导信息，主要是因为它们的时间知识处理能力有限，同时复杂的时间逻辑也会带来问题。
methods: 我们提出了一种新的构建主义方法，它强调在LLM学习中实行持续的知识合成和个性化。我们的方法包括Abstract Reasoning Induction ARI框架，这个框架将时间推理分成两个不同阶段：知识无关和知识基础。这种分类目标在减少幻觉和提高LLM对抽象方法的应用。
results: 我们的方法在两个时间问答Dataset上获得了显著改进，相比于基eline，我们的方法提高了29.7%和9.27%。这demonstrates our approach’s efficacy in enhancing temporal reasoning in LLMs. The code will be released at https://github.com/czy1999/ARI.

Abstract
In this paper, we tackle the significant challenge of temporal knowledge reasoning in Large Language Models (LLMs), an area where such models frequently encounter difficulties. These difficulties often result in the generation of misleading or incorrect information, primarily due to their limited capacity to process evolving factual knowledge and complex temporal logic. In response, we propose a novel, constructivism-based approach that advocates for a paradigm shift in LLM learning towards an active, ongoing process of knowledge synthesis and customization. At the heart of our proposal is the Abstract Reasoning Induction ARI framework, which divides temporal reasoning into two distinct phases: Knowledge-agnostic and Knowledge-based. This division aims to reduce instances of hallucinations and improve LLMs' capacity for integrating abstract methodologies derived from historical data. Our approach achieves remarkable improvements, with relative gains of 29.7\% and 9.27\% on two temporal QA datasets, underscoring its efficacy in advancing temporal reasoning in LLMs. The code will be released at https://github.com/czy1999/ARI.

摘要
在这篇论文中，我们面临着大语言模型（LLM）中的时间知识推理挑战，这是 LLM 很频繁遇到的问题。这些问题经常导致 LLM 生成错误或误导信息，主要是因为它们对逐渐发展的事实知识和复杂的时间逻辑处理能力有限。为此，我们提出了一种新的建构主义方法，强调 LLM 学习 Should be an active, ongoing process of knowledge synthesis and customization。我们的提议的核心是抽象逻辑推理引入框架（ARI），将时间推理分为两个不同阶段：无知阶段和知识阶段。这种分类的目的是减少 LLM 生成幻见的情况，提高它们对历史数据 derivated 抽象方法的集成能力。我们的方法在两个时间问答 dataset 上显示了很大的改进，相对于基eline的提升率分别为 29.7% 和 9.27%，这证明了我们的方法在提高 LLM 中的时间推理能力具有效果。代码将在 GitHub 上发布，请参考 https://github.com/czy1999/ARI。

Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts

paper_url: http://arxiv.org/abs/2311.09127
repo_url: None
paper_authors: Yuanwei Wu, Xiang Li, Yixin Liu, Pan Zhou, Lichao Sun
for: 本研究的目的是探讨 Multimodal Large Language Models (MLLMs) 的安全问题，具体来说是通过对 GPT-4V 的系统提示泄露漏洞进行攻击，以及如何通过自我反对攻击（Self-Adversarial Attack via System Prompt，简称 SASP）方法来实现 MLLM 的破狱。
methods: 本研究使用了一种新的破狱攻击方法，即 SASP，该方法利用了 GPT-4 作为红人工具，通过对自己的系统提示进行攻击，以搜索可能的破狱提示。此外，为了提高攻击成功率，还添加了人工修改基于 GPT-4 的分析。
results: 本研究发现，修改系统提示可以有效降低破狱成功率。 In addition, the study found that modifying system prompts can effectively reduce jailbreak success rates.

Abstract
Existing work on jailbreak Multimodal Large Language Models (MLLMs) has focused primarily on adversarial examples in model inputs, with less attention to vulnerabilities in model APIs. To fill the research gap, we carry out the following work: 1) We discover a system prompt leakage vulnerability in GPT-4V. Through carefully designed dialogue, we successfully steal the internal system prompts of GPT-4V. This finding indicates potential exploitable security risks in MLLMs; 2)Based on the acquired system prompts, we propose a novel MLLM jailbreaking attack method termed SASP (Self-Adversarial Attack via System Prompt). By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts. Furthermore, in pursuit of better performance, we also add human modification based on GPT-4's analysis, which further improves the attack success rate to 98.7\%; 3) We evaluated the effect of modifying system prompts to defend against jailbreaking attacks. Results show that appropriately designed system prompts can significantly reduce jailbreak success rates. Overall, our work provides new insights into enhancing MLLM security, demonstrating the important role of system prompts in jailbreaking, which could be leveraged to greatly facilitate jailbreak success rates while also holding the potential for defending against jailbreaks.

摘要
现有研究对囚犯多Modal大型语言模型（MLLM）主要集中在输入例针对攻击，少量关注模型API的漏洞。为填补这 gap，我们实施以下工作：1. 我们发现了GPT-4V中的系统提示泄露漏洞。通过特殊的对话设计，我们成功夺取了GPT-4V的内部系统提示。这一发现表明MLLM可能存在潜在的可以利用的安全风险;2. 基于夺取的系统提示，我们提出了一种新的MLLM囚犯攻击方法，称为SASP（自我反对性攻击via系统提示）。通过使用GPT-4作为红色团队工具，我们尝试通过夺取的系统提示找到可能的囚犯提示。此外，为了提高攻击成功率，我们还添加了人工修改基于GPT-4的分析，这进一步提高了攻击成功率到98.7%；3. 我们评估了修改系统提示以防止囚犯攻击的效果。结果表明，适当设计的系统提示可以减少囚犯成功率。总的来说，我们的工作提供了新的思路来增强MLLM安全性，表明系统提示在囚犯中具有重要的作用，可以大大提高囚犯成功率，同时也有可能用于防止囚犯。

paper_url: http://arxiv.org/abs/2311.09115
repo_url: None
paper_authors: Konstantin Hemker, Nikola Smidjievski, Mateja Jamnik
for: This paper is written for researchers and practitioners in the field of multi-modal biomedical modelling, specifically those working with image, tabular, and graph data in medical applications.
methods: The Hybrid Early-fusion Attention Learning Network (HEALNet) architecture is used in this paper, which combines modality-specific architectures with cross-modal attention mechanisms to capture crucial cross-modal information and preserve modality-specific structural information.
results: The HEALNet architecture achieves state-of-the-art performance in multi-modal survival analysis on Whole Slide Images and Multi-omic data from four cancer cohorts in The Cancer Genome Atlas (TCGA), substantially improving over both uni-modal and recent multi-modal baselines, while being robust in scenarios with missing modalities.

Abstract
Technological advances in medical data collection such as high-resolution histopathology and high-throughput genomic sequencing have contributed to the rising requirement for multi-modal biomedical modelling, specifically for image, tabular, and graph data. Most multi-modal deep learning approaches use modality-specific architectures that are trained separately and cannot capture the crucial cross-modal information that motivates the integration of different data sources. This paper presents the Hybrid Early-fusion Attention Learning Network (HEALNet): a flexible multi-modal fusion architecture, which a) preserves modality-specific structural information, b) captures the cross-modal interactions and structural information in a shared latent space, c) can effectively handle missing modalities during training and inference, and d) enables intuitive model inspection by learning on the raw data input instead of opaque embeddings. We conduct multi-modal survival analysis on Whole Slide Images and Multi-omic data on four cancer cohorts of The Cancer Genome Atlas (TCGA). HEALNet achieves state-of-the-art performance, substantially improving over both uni-modal and recent multi-modal baselines, whilst being robust in scenarios with missing modalities.

摘要
技术进步在医疗数据收集中，如高分辨率 histopathology 和高通过put genomic sequencing，对多Modal生物医学模型的需求提高了。大多数多Modal深入学习方法使用专门的模式特性 architecture，这些模型在独立地训练，无法捕捉 crossing Modal 信息，这些信息是集成不同数据源的关键。这篇论文提出了 Hybrid Early-fusion Attention Learning Network (HEALNet)：一种灵活的多Modal融合建模 Architecture，具有以下特点：a) 保持 Modal 特有的结构信息b) 捕捉 crossing Modal 交互和结构信息在共享封装空间中c) 可以效果地处理训练和推断中缺失的 Modald) 允许直观地模型检查，通过学习原始数据输入而不是抽象封装我们在TCGA 四个肿瘤 cohort 上进行多Modal 存活分析，HEALNet 实现了状态之 arts 性能，大幅提高过uni-Modal 和 latest multi-Modal 基elines，同时在缺失 Modal 的情况下具有强健性。

Ever: Mitigating Hallucination in Large Language Models through Real-Time Verification and Rectification

paper_url: http://arxiv.org/abs/2311.09114
repo_url: None
paper_authors: Haoqiang Kang, Juntong Ni, Huaxiu Yao
for: 这个论文的目的是解决大语言模型（LLM）在生成文本时遇到的不准确或幻想内容问题。
methods: 该论文提出了一种实时验证和修正（Ever）方法，通过实时步骤的生成和幻想修正策略来检测和修正幻想错误。
results: 与基eline相比，Ever在多种任务上（包括短answer问题、生成传记和多步论证）表现出了显著的改善，能够生成可靠和事实正确的文本。

Abstract
Large Language Models (LLMs) have demonstrated remarkable proficiency in generating fluent text. However, they often encounter the challenge of generating inaccurate or hallucinated content. This issue is common in both non-retrieval-based generation and retrieval-augmented generation approaches, and existing post-hoc rectification methods may not address the accumulated hallucination errors that may be caused by the "snowballing" issue, especially in reasoning tasks. To tackle these challenges, we introduce a novel approach called Real-time Verification and Rectification (Ever). Instead of waiting until the end of the generation process to rectify hallucinations, Ever employs a real-time, step-wise generation and hallucination rectification strategy. The primary objective is to detect and rectify hallucinations as they occur during the text generation process. When compared to both retrieval-based and non-retrieval-based baselines, Ever demonstrates a significant improvement in generating trustworthy and factually accurate text across a diverse range of tasks, including short-form QA, biography generation, and multi-hop reasoning.

摘要
Translated into Simplified Chinese:大型语言模型 (LLMs) 已经示出了惊人的流畅性，但它们经常遇到生成不准确或幻想内容的挑战。这个问题是非 retrieve-based 生成和 retrieve-augmented 生成方法中的共同问题，而现有的后续修正方法可能不能Address the accumulated hallucination errors that may be caused by the "snowballing" issue, especially in reasoning tasks。为解决这些挑战，我们介绍了一种新的方法called Real-time Verification and Rectification (Ever).而不是等待生成过程结束后进行修正幻想，Ever 使用了实时步骤生成和幻想修正策略。主要目标是在生成过程中实时检测和修正幻想。与 retrieve-based 和 non-retrieve-based 基线相比，Ever 在多种任务上，包括短问答、生传生成和多步逻辑 reasoning 等，示出了显著的改善。

Does Pre-trained Language Model Actually Infer Unseen Links in Knowledge Graph Completion?

paper_url: http://arxiv.org/abs/2311.09109
repo_url: None
paper_authors: Yusuke Sakai, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe
for: 本研究旨在探讨PLM-based KGC方法是否能够真正进行推理，或者只是通过Memorization来获得高性能。
methods: 我们提出了一种Synthetic dataset construction方法，用于分析PLM-based KGC方法是否能够进行推理。
results: 我们发现，PLMs通过预训练获得了推理能力，尽管表现改进主要来自于实体和关系文本信息。

Abstract
Knowledge graphs (KGs) consist of links that describe relationships between entities. Due to the difficulty of manually enumerating all relationships between entities, automatically completing them is essential for KGs. Knowledge Graph Completion (KGC) is a task that infers unseen relationships between entities in a KG. Traditional embedding-based KGC methods, such as RESCAL, TransE, DistMult, ComplEx, RotatE, HAKE, HousE, etc., infer missing links using only the knowledge from training data. In contrast, the recent Pre-trained Language Model (PLM)-based KGC utilizes knowledge obtained during pre-training. Therefore, PLM-based KGC can estimate missing links between entities by reusing memorized knowledge from pre-training without inference. This approach is problematic because building KGC models aims to infer unseen links between entities. However, conventional evaluations in KGC do not consider inference and memorization abilities separately. Thus, a PLM-based KGC method, which achieves high performance in current KGC evaluations, may be ineffective in practical applications. To address this issue, we analyze whether PLM-based KGC methods make inferences or merely access memorized knowledge. For this purpose, we propose a method for constructing synthetic datasets specified in this analysis and conclude that PLMs acquire the inference abilities required for KGC through pre-training, even though the performance improvements mostly come from textual information of entities and relations.

摘要
知识图（KG）由关系链描述实体之间的关系。由于手动列出所有实体间关系的困难，自动完成这些关系是知识图的关键。知识图完成任务（KGC）是尝试推断实体间未知的关系。传统的嵌入式KGC方法，如RESCAL、TransE、DistMult、ComplEx、RotatE、HAKE、HousE等，通过训练数据来INFER未知的关系。与此相反，最近的预训练语言模型（PLM）基于KGC利用预训练中获得的知识。因此，PLM基于KGC可以估计实体间未知的关系，而不需要INFER。这种方法存在问题，因为建立KGC模型的目标是INFER实体间未知的关系。然而，现有的KGC评价方法不会分开考虑推断和嵌入能力。因此，一个PLM基于KGC方法，即在当前KGC评价中具有高性能，可能在实际应用中效果不佳。为解决这个问题，我们分析PLM基于KGC方法是否进行推断或只是访问嵌入知识。为此，我们提出一种方法构建定制化的 sintetic dataset，并结论PLM在预训练中获得了推断能力，即使表现改进主要来自实体和关系的文本信息。

Towards A Unified View of Answer Calibration for Multi-Step Reasoning

paper_url: http://arxiv.org/abs/2311.09101
repo_url: None
paper_authors: Shumin Deng, Ningyu Zhang, Nay Oo, Bryan Hooi
for: 该论文旨在探讨以Chain-of-Thought（CoT）提示方法改进多步逻辑能力的大语言模型（LLMs）。
methods: 该论文分析了近期的答栏准确策略，并提供了一种统一的视角，以便系统地检查多个路径上的步骤级和路径级答栏准确策略。
results: 该论文通过对多个路径上的答栏准确策略进行系统性的评估，探讨了多步逻辑的优化。

Abstract
Large Language Models (LLMs) employing Chain-of-Thought (CoT) prompting have broadened the scope for improving multi-step reasoning capabilities. Usually, answer calibration strategies such as step-level or path-level calibration play a vital role in multi-step reasoning. While effective, there remains a significant gap in our understanding of the key factors that drive their success. In this paper, we break down the design of recent answer calibration strategies and present a unified view which establishes connections between them. We then conduct a thorough evaluation on these strategies from a unified view, systematically scrutinizing step-level and path-level answer calibration across multiple paths. Our study holds the potential to illuminate key insights for optimizing multi-step reasoning with answer calibration.

摘要

Can MusicGen Create Training Data for MIR Tasks?

paper_url: http://arxiv.org/abs/2311.09094
repo_url: None
paper_authors: Nadine Kroher, Helena Cuesta, Aggelos Pikrakis
for: 这个论文是为了研究基于AI生成音乐系统来生成用于音乐信息检索（MIR）任务的训练数据而写的。
methods: 该论文使用了MusicGen生成器生成了5个音乐种类的大量人工音乐样本，并使用了这些样本来训练一个类别模型。
results: 实验结果表明，提议的模型可以从人工音乐辑中学习到类别特征，并能够在真实音乐录音中Generalize well。

Abstract
We are investigating the broader concept of using AI-based generative music systems to generate training data for Music Information Retrieval (MIR) tasks. To kick off this line of work, we ran an initial experiment in which we trained a genre classifier on a fully artificial music dataset created with MusicGen. We constructed over 50 000 genre- conditioned textual descriptions and generated a collection of music excerpts that covers five musical genres. Our preliminary results show that the proposed model can learn genre-specific characteristics from artificial music tracks that generalise well to real-world music recordings.

摘要
我们正在研究使用基于人工智能的生成音乐系统来生成听力音乐信息检索（MIR）任务的训练数据。为了开始这条工作，我们进行了一次初步实验，在我们训练了一个类别分类器的基础上，使用了MusicGen创建的完全人工音乐数据集。我们构建了50000多个频道条件的文本描述，并生成了涵盖五种音乐类型的音乐片断集。我们的初步结果表明，我们的提议的模型可以从人工音乐追踪中学习类别特征，这些特征可以通过实际音乐录音来泛化。

The Uli Dataset: An Exercise in Experience Led Annotation of oGBV

paper_url: http://arxiv.org/abs/2311.09086
repo_url: None
paper_authors: Arnav Arora, Maha Jinadoss, Cheshta Arora, Denny George, Brindaalakshmi, Haseena Dawood Khan, Kirti Rawat, Div, Ritash, Seema Mathur, Shivani Yadav, Shehla Rashid Shora, Rie Raut, Sumit Pawar, Apurva Paithane, Sonia, Vivek, Dharini Priscilla, Khairunnisha, Grace Banu, Ambika Tandon, Rishav Thakker, Rahul Dev Korra, Aatman Vaidya, Tarunima Prabhakar
for: 这个论文目的是为了提供一个语言特定和上下文相关的 dataset，以便开发自动识别 hate speech 和 gendered abuse 的 AI 系统。
methods: 这个论文使用了 Twitter 上的 tweets，并将其分类为三个问题：对于 gender abuse 的经历，由女性或 LGBTQIA 社区成员领导的专家进行标注。
results: 通过这个 dataset，研究人员展示了一种参与式的方法来创建 dataset，并通过这些 dataset 驱动 AI 系统。

Abstract
Online gender based violence has grown concomitantly with adoption of the internet and social media. Its effects are worse in the Global majority where many users use social media in languages other than English. The scale and volume of conversations on the internet has necessitated the need for automated detection of hate speech, and more specifically gendered abuse. There is, however, a lack of language specific and contextual data to build such automated tools. In this paper we present a dataset on gendered abuse in three languages- Hindi, Tamil and Indian English. The dataset comprises of tweets annotated along three questions pertaining to the experience of gender abuse, by experts who identify as women or a member of the LGBTQIA community in South Asia. Through this dataset we demonstrate a participatory approach to creating datasets that drive AI systems.

摘要
互联网上的性别基于暴力现象随着互联网和社交媒体的普及而增长。其影响更加严重在全球主要地区，因为许多用户在不使用英语的情况下使用社交媒体。因为互联网上的规模和量的对话，需要自动检测 hate speech 和更Specifically gendered abuse。然而， Currently, there is a lack of language-specific and contextual data to build such automated tools. In this paper, we present a dataset on gendered abuse in three languages - Hindi, Tamil, and Indian English. The dataset consists of tweets annotated with three questions related to the experience of gender abuse, annotated by experts who identify as women or members of the LGBTQIA community in South Asia. Through this dataset, we demonstrate a participatory approach to creating datasets that drive AI systems.

How Multilingual is Multilingual LLM?

paper_url: http://arxiv.org/abs/2311.09071
repo_url: None
paper_authors: Fei Yuan, Shuai Yuan, Zhiyong Wu, Lei Li
for: 这项研究旨在评估大语言模型（LLMs）在101种语言中的多语言能力，并将语言分为四个不同的 quadrant，以便更好地了解这些语言的特点和 optimize their performance.
methods: 研究使用了现有的 LLMs，并通过对这些模型进行调整和训练来提高其多语言能力。
results: 研究发现，现有的 LLMs 在101种语言中的多语言能力比预期更高，并且可以通过对每个 quadrant 的特点进行调整来进一步提高多语言性能。

Abstract
Large Language Models (LLMs), trained predominantly on extensive English data, often exhibit limitations when applied to other languages. Current research is primarily focused on enhancing the multilingual capabilities of these models by employing various tuning strategies. Despite their effectiveness in certain languages, the understanding of the multilingual abilities of LLMs remains incomplete. This study endeavors to evaluate the multilingual capacity of LLMs by conducting an exhaustive analysis across 101 languages, and classifies languages with similar characteristics into four distinct quadrants. By delving into each quadrant, we shed light on the rationale behind their categorization and offer actionable guidelines for tuning these languages. Extensive experiments reveal that existing LLMs possess multilingual capabilities that surpass our expectations, and we can significantly improve the multilingual performance of LLMs by focusing on these distinct attributes present in each quadrant.

摘要
大型语言模型（LLM），通常在广泛的英语数据上训练，在其他语言上表现有限。当前的研究主要集中在提高 LLM 的多语言能力，使用不同的调整策略。虽然在某些语言上有效，但我们对 LLM 的多语言能力的理解仍然不够完整。这项研究尝试对 101 种语言进行了全面的分析，并将语言分为四个不同的方块。我们对每个方块进行了详细的分析，并提供了改进 LLM 的多语言性表现的实用指南。广泛的实验表明，现有的 LLM 在多语言方面的能力超出了我们的预期，并且可以通过对每个方块的特点进行调整来进一步提高多语言性表现。

How Well Do Large Language Models Truly Ground?

paper_url: http://arxiv.org/abs/2311.09069
repo_url: None
paper_authors: Hyunji Lee, Sejune Joo, Chaeeun Kim, Joel Jang, Doyoung Kim, Kyoung-Woon On, Minjoon Seo
for: 这paper aims to improve the reliability and controllability of Large Language Models (LLMs) by introducing a stricter definition of grounding and developing a new dataset and metric to assess it.
methods: 该paper uses a new dataset and a grounding metric to evaluate the grounding capabilities of 13 different LLMs of various sizes and training methods.
results: 研究发现，现有的知识增强模型通常只关注response中是否包含正确答案，而忽略了response的可靠性和可控性。新的定义和 metric 能够评估模型是否真正基于知识进行回答，并提供了更多的信息来改进模型的可靠性和可控性。

Abstract
Reliance on the inherent knowledge of Large Language Models (LLMs) can cause issues such as hallucinations, lack of control, and difficulties in integrating variable knowledge. To mitigate this, LLMs can be probed to generate responses by grounding on external context, often given as input (knowledge-augmented models). Yet, previous research is often confined to a narrow view of the term "grounding", often only focusing on whether the response contains the correct answer or not, which does not ensure the reliability of the entire response. To address this limitation, we introduce a strict definition of grounding: a model is considered truly grounded when its responses (1) fully utilize necessary knowledge from the provided context, and (2) don't exceed the knowledge within the contexts. We introduce a new dataset and a grounding metric to assess this new definition and perform experiments across 13 LLMs of different sizes and training methods to provide insights into the factors that influence grounding performance. Our findings contribute to a better understanding of how to improve grounding capabilities and suggest an area of improvement toward more reliable and controllable LLM applications.

摘要
依赖大语言模型（LLM）的内在知识可能会导致问题，如幻觉、无控和变量知识的集成问题。为了解决这些问题，LLM可以通过附加外部 контекст进行探索，并生成响应（知识增强型模型）。然而，过去的研究通常受限于“安全”的定义，即判断响应是否包含正确的答案，这并不能 garantuee 整个响应的可靠性。为了解决这些限制，我们提出了一个严格的定义：一个模型被 considere 为真正地附加了知识，当其响应（1）完全利用提供的 контекст中的所有必要知识，（2）不超过 контекст中的知识。我们介绍了一个新的数据集和附加 metric，以评估这个新定义，并在13种不同的 LLM 中进行了实验，以提供关于如何提高附加能力的深入了解和建议。我们的发现可以帮助改善 LLM 的可靠性和控制性，并且建议一个可以提高 LLM 应用的方向。

Learning Fair Division from Bandit Feedback

paper_url: http://arxiv.org/abs/2311.09068
repo_url: None
paper_authors: Hakuei Yamada, Junpei Komiyama, Kenshi Abe, Atsushi Iwasaki
for: 这篇论文研究了在不约束的线性鱼市中进行在线分配，在中央规划者不知道代理人的价值或利益下进行分配。
methods: 我们引入了一种封包算法，使用双平均来慢慢学习到来到达者的物品类型分布和代理人的价值。
results: 我们证明了我们的提议的算法可以在线ark Fisher市场中 asymptotically 实现 оптимальную拜纳社会利益，并提供了 regret bounds。我们还通过 sintetic 和实验数据 validate 了我们的算法的超越性。

Abstract
This work addresses learning online fair division under uncertainty, where a central planner sequentially allocates items without precise knowledge of agents' values or utilities. Departing from conventional online algorithm, the planner here relies on noisy, estimated values obtained after allocating items. We introduce wrapper algorithms utilizing \textit{dual averaging}, enabling gradual learning of both the type distribution of arriving items and agents' values through bandit feedback. This approach enables the algorithms to asymptotically achieve optimal Nash social welfare in linear Fisher markets with agents having additive utilities. We establish regret bounds in Nash social welfare and empirically validate the superior performance of our proposed algorithms across synthetic and empirical datasets.

摘要

In-vehicle Sensing and Data Analysis for Older Drivers with Mild Cognitive Impairment

paper_url: http://arxiv.org/abs/2311.09273
repo_url: None
paper_authors: Sonia Moshfeghi, Muhammad Tanveer Jan, Joshua Conniff, Seyedeh Gol Ara Ghoreishi, Jinwoo Jang, Borko Furht, Kwangsoo Yang, Monica Rosselli, David Newman, Ruth Tappen, Dana Smith
for: 这项研究的目的是设计低成本的在日常驾驶环境中获取高精度定位和电子邮件数据的汽车仪器，并通过机器学习方法早期发现老年人智能障碍的迹象。
methods: 这项研究使用了低成本的在汽车内部设备来获取高精度定位和电子邮件数据，并使用机器学习方法来检测老年人智能障碍的迹象。
results: 研究结果表明，老年人智能障碍的 drivers 在日常驾驶中比不受智能障碍的 drivers 驾驶更稳定和安全，而且机器学习模型也 identific 了驾驶次数、教育水平和夜间驾驶次数为最重要的因素。

Abstract
Driving is a complex daily activity indicating age and disease related cognitive declines. Therefore, deficits in driving performance compared with ones without mild cognitive impairment (MCI) can reflect changes in cognitive functioning. There is increasing evidence that unobtrusive monitoring of older adults driving performance in a daily-life setting may allow us to detect subtle early changes in cognition. The objectives of this paper include designing low-cost in-vehicle sensing hardware capable of obtaining high-precision positioning and telematics data, identifying important indicators for early changes in cognition, and detecting early-warning signs of cognitive impairment in a truly normal, day-to-day driving condition with machine learning approaches. Our statistical analysis comparing drivers with MCI to those without reveals that those with MCI exhibit smoother and safer driving patterns. This suggests that drivers with MCI are cognizant of their condition and tend to avoid erratic driving behaviors. Furthermore, our Random Forest models identified the number of night trips, number of trips, and education as the most influential factors in our data evaluation.

摘要
驾驶是一项复杂的日常活动，表征年龄和疾病相关的认知下降。因此，与无明遇病患（MCI）相比，驾驶性能下降的差异可能反映认知功能的变化。有增加证据表明，在日常生活环境中不侵入式监测老年人驾驶行为可能有助于早期发现轻度认知障碍。本文的目标包括设计低成本的汽车内部感知硬件，获得高精度的位置定位和通信数据，确定重要的认知变化指标，并使用机器学习方法探测日常驾驶中的认知障碍警示。我们的统计分析表明，与MCI相比，有MCI的 Driver exhibit更稳定和更安全的驾驶模式。这表明，有MCI的 Driver 意识到自己的状况，并尽可能避免异常的驾驶行为。此外，我们的Random Forest模型确定了夜间行驶次数、总行驶次数和教育水平是我们数据评估中最重要的因素。

Assessing Knowledge Editing in Language Models via Relation Perspective

paper_url: http://arxiv.org/abs/2311.09053
repo_url: https://github.com/weiyifan1023/knowledge-edit-based-on-relation-perspective
paper_authors: Yifan Wei, Xiaoyan Yu, Huanhuan Ma, Fangyu Lei, Yixuan Weng, Ran Song, Kang Liu
for: 本研究旨在修改大语言模型中的事实知识，并 investigate relation-centric 知识编辑方法的可行性。
methods: 本研究使用了一个新的benchmark名为RaKE，用于评估relation based知识编辑方法。还进行了多种知识编辑基线的比较实验，以及对 transformer 中关系知识的深入研究。
results: 研究结果表明，现有的知识编辑方法在编辑关系上存在潜在的困难，而且关系知识不仅存储在FFN网络中，还存储在注意层中。这些结果为未来的relation-based知识编辑方法提供了实验支持。

Abstract
Knowledge Editing (KE) for modifying factual knowledge in Large Language Models (LLMs) has been receiving increasing attention. However, existing knowledge editing methods are entity-centric, and it is unclear whether this approach is suitable for a relation-centric perspective. To address this gap, this paper constructs a new benchmark named RaKE, which focuses on Relation based Knowledge Editing. In this paper, we establish a suite of innovative metrics for evaluation and conduct comprehensive experiments involving various knowledge editing baselines. We notice that existing knowledge editing methods exhibit the potential difficulty in their ability to edit relations. Therefore, we further explore the role of relations in factual triplets within the transformer. Our research results confirm that knowledge related to relations is not only stored in the FFN network but also in the attention layers. This provides experimental support for future relation-based knowledge editing methods.

摘要
大型语言模型（LLM）中的知识编辑（KE）已经获得了增加的注意。然而，现有的知识编辑方法都是基于实体中心的，而不是关系中心的。为了填补这个差距，本文建立了一个新的benchmark名为RaKE，它专注于关系基本知识编辑。本文提出了一个创新的评估标准和进行了各种知识编辑基线的广泛实验。我们发现现有的知识编辑方法对于修改关系表现出了潜在的问题。因此，我们进一步探索关系在简单 triplets 中的知识是如何储存和处理的。我们的研究结果显示，关系知识不仅在 FFN 网络中储存，还在注意层中储存。这给了未来关系基本知识编辑方法的实验支持。

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

paper_url: http://arxiv.org/abs/2311.09050
repo_url: https://github.com/ecnu-dase-nlp/rqp
paper_authors: Yunshi Lan, Xiang Li, Xin Liu, Yang Li, Wei Qin, Weining Qian
for: 本研究旨在提高零shot情境下的视觉问答系统（VQA）的性能，通过帮助大语言模型（LLMs）更好地理解和回答问题。
methods: 我们提出了一种新的问题提示方法，即理解问题提示（Reasoning Question Prompts，RQP），可以让LLMs更好地理解和回答问题。RQP通过一个不supervised的问题编辑模块生成了每个问题的自 contenido问题，以便更好地指导LLMs回答问题。
results: 我们在三个VQA挑战中测试了RQP方法，结果表明，RQP可以在零shot情境下显著提高LLMs的性能，并在四个数据集中超越现有的零shot方法。我们的源代码已经公开在GitHub上（https://github.com/ECNU-DASE-NLP/RQP）。

Abstract
Zero-shot Visual Question Answering (VQA) is a prominent vision-language task that examines both the visual and textual understanding capability of systems in the absence of training data. Recently, by converting the images into captions, information across multi-modalities is bridged and Large Language Models (LLMs) can apply their strong zero-shot generalization capability to unseen questions. To design ideal prompts for solving VQA via LLMs, several studies have explored different strategies to select or generate question-answer pairs as the exemplar prompts, which guide LLMs to answer the current questions effectively. However, they totally ignore the role of question prompts. The original questions in VQA tasks usually encounter ellipses and ambiguity which require intermediate reasoning. To this end, we present Reasoning Question Prompts for VQA tasks, which can further activate the potential of LLMs in zero-shot scenarios. Specifically, for each question, we first generate self-contained questions as reasoning question prompts via an unsupervised question edition module considering sentence fluency, semantic integrity and syntactic invariance. Each reasoning question prompt clearly indicates the intent of the original question. This results in a set of candidate answers. Then, the candidate answers associated with their confidence scores acting as answer heuristics are fed into LLMs and produce the final answer. We evaluate reasoning question prompts on three VQA challenges, experimental results demonstrate that they can significantly improve the results of LLMs on zero-shot setting and outperform existing state-of-the-art zero-shot methods on three out of four data sets. Our source code is publicly released at \url{https://github.com/ECNU-DASE-NLP/RQP}.

摘要
zero-shot视觉问答（VQA）是一种引人注目的视觉语言任务，它检验系统在不受训练数据的情况下，对图像和文本之间的理解能力。最近，通过将图像转换为caption，使得多 modalities之间的信息相互汇流，大型自然语言模型（LLMs）可以通过未经训练的情况下，对未经见过的问题进行有效的回答。为了设计理想的提问方法，许多研究已经探索了不同的策略来选择或生成问题答对，作为示例提问。然而，它们完全忽视了提问的角色。原始的VQA任务中的问题通常会遇到斜杠和混乱，需要中间的推理。为此，我们提出了视觉问答推理提问（RQP），可以further activate LLMs的零shot能力。具体来说，为每个问题，我们首先通过不supervised问题编辑模块生成自包含的推理提问，考虑语言流畅性、意义完整性和语法不变性。每个推理提问都能够明确表达问题的意图。这些推理提问的候选答案与其自信度分数 acting as answer heuristics被 fed into LLMs，并生成最终的答案。我们在三个VQA挑战中评估了推理提问，实验结果表明，它们可以在零shot Setting下significantly提高LLMs的表现，并在四个数据集中超越现有的零shot方法。我们的源代码公开release于\url{https://github.com/ECNU-DASE-NLP/RQP}.

MELA: Multilingual Evaluation of Linguistic Acceptability

paper_url: http://arxiv.org/abs/2311.09033
repo_url: None
paper_authors: Ziyin Zhang, Yikang Liu, Weifang Huang, Junyu Mao, Rui Wang, Hai Hu
for: 本研究的目的是提供一个多语言的语言模型评估 benchmark，以evaluate 不同语言模型在语言学可接受性方面的表现。
methods: 本研究使用了多种语言模型，包括ChatGPT和XLM-R，并进行了过程学习和多任务学习。同时，研究者们还使用了层 wise probing 来分析 XLM-R 的 weights 是如何影响其在不同语言之间的推理能力。
results: 研究结果显示，XLM-R 在 zero-shot Setting 中可以达到与 fine-tuned XLM-R 相当的性能，而 ChatGPT 则需要在 Context 中提供示例来改善其性能。同时，研究者们还发现了一些语言之间的推理困难，并提出了一种” conflicting weight” 的概念来描述这种现象。

Abstract
Recent benchmarks for Large Language Models (LLMs) have mostly focused on application-driven tasks such as complex reasoning and code generation, and this has led to a scarcity in purely linguistic evaluation of LLMs. Against this background, we introduce Multilingual Evaluation of Linguistic Acceptability -- MELA, the first multilingual benchmark on linguistic acceptability with 48K samples covering 10 languages from a diverse set of language families. We establish baselines of commonly used LLMs along with supervised models, and conduct cross-lingual transfer and multi-task learning experiments with XLM-R. In pursuit of multilingual interpretability, we analyze the weights of fine-tuned XLM-R to explore the possibility of identifying transfer difficulty between languages. Our results show that ChatGPT benefits much from in-context examples but still lags behind fine-tuned XLM-R, while the performance of GPT-4 is on par with fine-tuned XLM-R even in zero-shot setting. Cross-lingual and multi-task learning experiments show that unlike semantic tasks, in-language training data is crucial in acceptability judgements. Results in layerwise probing indicate that the upper layers of XLM-R become a task-specific but language-agnostic region for multilingual acceptability judgment. We also introduce the concept of conflicting weight, which could be a potential indicator for the difficulty of cross-lingual transfer between languages. Our data will be available at https://github.com/sjtu-compling/MELA.

摘要
近期大语言模型（LLM）的 benchmark 主要集中在应用驱动的任务上，如复杂的理解和代码生成，这导致了对 LLM 的纯语言评估的缺乏。为了解决这问题，我们介绍了多语言评估语言可接受性（MELA），这是一个包含 48K 个样本，覆盖 10 种语言家族的多语言 benchmark。我们建立了常用的 LLG 基elines，以及supervised 模型的基elines，并进行了跨语言传播和多任务学习实验。在追求多语言可读性的探索中，我们分析了精心调整的 XLM-R 的权重，以探索语言之间传播困难的可能性。我们的结果显示，ChatGPT 受到上下文例子的启发，但仍落后于精心调整的 XLM-R，而 GPT-4 在零shot 设定下与精心调整的 XLM-R 的性能相当。跨语言和多任务学习实验表明，与 semantic 任务不同，在语言上的培训数据是关键在 acceptability 判断中。层wise probing 结果表明，XLM-R 的Upper层变成了多语言可接受性的任务特定 yet language-agnostic 区域。我们还引入了 conflicting weight 概念，它可能是跨语言传播之间语言的难度指标。我们的数据将在 GitHub 上发布。

Assessing the Robustness of Intelligence-Driven Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.09027
repo_url: None
paper_authors: Lorenzo Nodari, Federico Cerutti
for: This paper focuses on the problem of robustness in intelligence-driven reinforcement learning, specifically in military contexts where high stakes and uncertainty are prevalent.
methods: The paper employs reward machines to express complex reward structures in RL tasks, and explores the need for further research in evidential reasoning and learning to improve the robustness of current state-of-the-art reinforcement learning approaches.
results: The preliminary results presented in the paper suggest the need for further research to harden current RL approaches before they can be considered mission-critical-ready.

Abstract
Robustness to noise is of utmost importance in reinforcement learning systems, particularly in military contexts where high stakes and uncertain environments prevail. Noise and uncertainty are inherent features of military operations, arising from factors such as incomplete information, adversarial actions, or unpredictable battlefield conditions. In RL, noise can critically impact decision-making, mission success, and the safety of personnel. Reward machines offer a powerful tool to express complex reward structures in RL tasks, enabling the design of tailored reinforcement signals that align with mission objectives. This paper considers the problem of the robustness of intelligence-driven reinforcement learning based on reward machines. The preliminary results presented suggest the need for further research in evidential reasoning and learning to harden current state-of-the-art reinforcement learning approaches before being mission-critical-ready.

摘要
<>military contexts 的 robustness to noise 是权重要的，特别是在高赌注和不确定环境下。雨声和不确定性是军事操作的内生特征，由于因素如不完整信息、敌方行动或不可预测的战场条件而出现。在RL中，雨声可能会重要影响决策、任务成功和人员安全。奖励机器提供了一种强大的工具来表达复杂的奖励结构在RL任务中，使得设计定制化的奖励信号与任务目标相对应。本文考虑了奖励机器驱动的智能学习robustness问题。初步结果表明需要进一步研究证据推理和学习以强化当前状态艺术RL方法，以便在任务关键ready。<>

Identification and Estimation for Nonignorable Missing Data: A Data Fusion Approach

paper_url: http://arxiv.org/abs/2311.09015
repo_url: None
paper_authors: Zixiao Wang, AmirEmad Ghassami, Ilya Shpitser
for: identifying and estimating a parameter of interest in settings where data is missing not at random (MNAR)
methods: inspired by data fusion, using information in an MNAR dataset and an auxiliary dataset subject to missingness at random (MAR)
results: can identify the parameter of interest given pooled data, under two complementary sets of assumptions; derived an inverse probability weighted (IPW) estimator for identified parameters, and evaluated the performance of the estimation strategies via simulation studies

Abstract
We consider the task of identifying and estimating a parameter of interest in settings where data is missing not at random (MNAR). In general, such parameters are not identified without strong assumptions on the missing data model. In this paper, we take an alternative approach and introduce a method inspired by data fusion, where information in an MNAR dataset is augmented by information in an auxiliary dataset subject to missingness at random (MAR). We show that even if the parameter of interest cannot be identified given either dataset alone, it can be identified given pooled data, under two complementary sets of assumptions. We derive an inverse probability weighted (IPW) estimator for identified parameters, and evaluate the performance of our estimation strategies via simulation studies.

摘要
我团队考虑了在数据损失不均匀（MNAR）的情况下识别和估算参数 интереса。通常情况下，这些参数无法 sans strong assumptions on the missing data model。在这篇论文中，我们采取了一种不同的方法，并通过数据融合引入了一个auxiliary dataset，这个dataset受到随机 missing（MAR）。我们表明，即使 données alone 中的参数无法识别，也可以通过 combining data 识别出参数，只要满足两个 complementary sets of assumptions。我们 derivate了一种 inverse probability weighted（IPW）估计器，并通过 simulations studies 评估了我们的估计策略的性能。

Adversarial Attacks to Reward Machine-based Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.09014
repo_url: None
paper_authors: Lorenzo Nodari
for: 本研究旨在提供首个对奖金机制（RM）基于 reinforcement learning 技术的安全性分析，以便更好地理解和提高这种技术在不良场景下的稳定性。
methods: 本研究使用 blinding attacks 这种新的攻击方法，以评估 RM-based reinforcement learning 技术的安全性。
results: 研究发现，blinding attacks 可以成功地破坏 RM-based reinforcement learning 技术的安全性，并提供了一种新的攻击方法来攻击这种技术。

Abstract
In recent years, Reward Machines (RMs) have stood out as a simple yet effective automata-based formalism for exposing and exploiting task structure in reinforcement learning settings. Despite their relevance, little to no attention has been directed to the study of their security implications and robustness to adversarial scenarios, likely due to their recent appearance in the literature. With my thesis, I aim to provide the first analysis of the security of RM-based reinforcement learning techniques, with the hope of motivating further research in the field, and I propose and evaluate a novel class of attacks on RM-based techniques: blinding attacks.

摘要

Leveraging AI for Natural Disaster Management : Takeaways From The Moroccan Earthquake

paper_url: http://arxiv.org/abs/2311.08999
repo_url: None
paper_authors: Morocco Solidarity Hackathon
for: 这篇论文主要是为了探讨在2023年阿哈鲁兹地震后，全球灾害管理策略的批判性反思，以及使用人工智能（AI）提高灾害准备、应急回应和恢复的技术。
methods: 这篇论文使用了全面的文献综述、赢得项目概述、关键发现和挑战，包括实时开源数据、数据缺乏和交叉学科合作的障碍。
results: 这篇论文得到了许多关键发现和挑战，包括实时开源数据的潜在价值、数据缺乏的问题和交叉学科合作的障碍。同时，论文还发起了社区呼吁，呼吁更多的行业专家和学者参与到灾害管理领域的研究和实践中来。

Abstract
The devastating 6.8-magnitude earthquake in Al Haouz, Morocco in 2023 prompted critical reflections on global disaster management strategies, resulting in a post-disaster hackathon, using artificial intelligence (AI) to improve disaster preparedness, response, and recovery. This paper provides (i) a comprehensive literature review, (ii) an overview of winning projects, (iii) key insights and challenges, namely real-time open-source data, data scarcity, and interdisciplinary collaboration barriers, and (iv) a community-call for further action.

摘要
在2023年Morocco的阿卢哈沃兹发生了6.8级地震，这导致了全球灾害管理策略的批判性反思，并且促使了一场以人工智能（AI）为核心的 poste-disaster hackathon，以提高灾害准备、应急回应和恢复。本文提供以下内容：1. 全面的文献综述2. 赢家项目的概述3. 关键的发现和挑战，包括实时开源数据、数据缺乏和跨学科协作障碍4. 社区呼吁更进一步的行动Translation notes:* "阿卢哈沃兹" (Al Haouz) is the name of the location where the earthquake occurred, and it is written in Simplified Chinese as "阿卢哈沃兹" (Al Haouz).* "灾害管理策略" (disaster management strategies) is written as "灾害管理策略" (disaster management strategies) in Simplified Chinese.* "poste-disaster hackathon" is written as "后灾害黑匠挑战" (post-disaster hackathon) in Simplified Chinese.* "实时开源数据" (real-time open-source data) is written as "实时开源数据" (real-time open-source data) in Simplified Chinese.* "数据缺乏" (data scarcity) is written as "数据缺乏" (data scarcity) in Simplified Chinese.* "跨学科协作障碍" (interdisciplinary collaboration barriers) is written as "跨学科协作障碍" (interdisciplinary collaboration barriers) in Simplified Chinese.

When does In-context Learning Fall Short and Why? A Study on Specification-Heavy Tasks

paper_url: http://arxiv.org/abs/2311.08993
repo_url: None
paper_authors: Hao Peng, Xiaozhi Wang, Jianhui Chen, Weikai Li, Yunjia Qi, Zimu Wang, Zhili Wu, Kaisheng Zeng, Bin Xu, Lei Hou, Juanzi Li
for: 本文旨在探讨大语言模型（LLM）在启发式学习（ICL）方法下的局限性，以及这些局限性的根本原因。
methods: 作者通过对18种特有的任务进行广泛的实验，发现ICL在处理这些任务时存在三个主要的原因：无法具体地理解上下文，任务架构理解与人类不一致，以及缺乏长文理解能力。
results: 研究发现，通过细化调教，LLM可以在这些任务上达到不错的性能，这表明ICL的失败不是LLM的内在缺陷，而是现有的对齐方法的不足，使LLM无法处理复杂的规则繁残任务。

Abstract
In-context learning (ICL) has become the default method for using large language models (LLMs), making the exploration of its limitations and understanding the underlying causes crucial. In this paper, we find that ICL falls short of handling specification-heavy tasks, which are tasks with complicated and extensive task specifications, requiring several hours for ordinary humans to master, such as traditional information extraction tasks. The performance of ICL on these tasks mostly cannot reach half of the state-of-the-art results. To explore the reasons behind this failure, we conduct comprehensive experiments on 18 specification-heavy tasks with various LLMs and identify three primary reasons: inability to specifically understand context, misalignment in task schema comprehension with humans, and inadequate long-text understanding ability. Furthermore, we demonstrate that through fine-tuning, LLMs can achieve decent performance on these tasks, indicating that the failure of ICL is not an inherent flaw of LLMs, but rather a drawback of existing alignment methods that renders LLMs incapable of handling complicated specification-heavy tasks via ICL. To substantiate this, we perform dedicated instruction tuning on LLMs for these tasks and observe a notable improvement. We hope the analyses in this paper could facilitate advancements in alignment methods enabling LLMs to meet more sophisticated human demands.

摘要
启发式学习（ICL）已成为大语言模型（LLM）的默认方法，因此探索其限制和理解下面层次的原因变得非常重要。在这篇论文中，我们发现ICL在需要较多任务规定的任务上表现不佳，这些任务通常需要人类花费几个小时来学习，如传统信息抽取任务。ICL的性能在这些任务上通常无法达到状态艺术的一半。为了探索这些失败的原因，我们进行了18个需要较多任务规定的任务的广泛实验，并确定了三个主要原因：无法特别理解上下文，任务架构与人类的理解不符，以及缺乏长文本理解能力。此外，我们还证明了通过微调，LLM可以在这些任务上达到不错的表现，这表明ICL失败不是LLM的内在缺陷，而是现有的对齐方法的缺陷，使得LLM无法通过ICL处理复杂的需要较多任务。为了证明这一点，我们在LLM上进行了专门的指令调整，并观察到了明显的改善。我们希望这些分析可以促进对齐方法的进步，使LLM能够更好地满足人类的需求。

Proceedings Fifth International Workshop on Formal Methods for Autonomous Systems

paper_url: http://arxiv.org/abs/2311.08987
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Marie Farrell, Matt Luckcuck, Mario Gleirscher, Maike Schwammberger
for: 本研讨会论文集是为了形式方法与自主系统之间的研究提供一个发表平台。
methods: 本研讨会接受了25篇提交论文，其中包括11篇正式论文、3篇经验报告、6篇研究预览和5篇视野论文。
results: 经审核后，本研讨会接受了15篇论文，包括8篇长篇论文和7篇短篇论文。

Abstract
This EPTCS volume contains the proceedings for the Fifth International Workshop on Formal Methods for Autonomous Systems (FMAS 2023), which was held on the 15th and 16th of November 2023. FMAS 2023 was co-located with 18th International Conference on integrated Formal Methods (iFM) (iFM'22), organised by Leiden Institute of Advanced Computer Science of Leiden University. The workshop itself was held at Scheltema Leiden, a renovated 19th Century blanket factory alongside the canal. FMAS 2023 received 25 submissions. We received 11 regular papers, 3 experience reports, 6 research previews, and 5 vision papers. The researchers who submitted papers to FMAS 2023 were from institutions in: Australia, Canada, Colombia, France, Germany, Ireland, Italy, the Netherlands, Sweden, the United Kingdom, and the United States of America. Increasing our number of submissions for the third year in a row is an encouraging sign that FMAS has established itself as a reputable publication venue for research on the formal modelling and verification of autonomous systems. After each paper was reviewed by three members of our Programme Committee we accepted a total of 15 papers: 8 long papers and 7 short papers.

摘要
这个 EPTCS 卷包含了第五届国际形式方法工作坊（FMAS 2023）的论文集，该活动于2023年11月15日-16日举行。FMAS 2023 与18届国际集成形式方法会议（iFM）（iFM'22）联合举办，由雷登大学计算机科学院主办。工作坊本身在19世纪19世纪的重新翻新的褡厂 alongside the canal 举行。 FMAS 2023 接受了25篇提交的论文，包括11篇正式论文、3篇经验报告、6篇研究预览和5篇视野论文。参加该活动的研究人员来自：澳大利亚、加拿大、哥伦比亚、法国、德国、爱尔兰、意大利、荷兰、瑞典、英国和美国。我们在第三年 consecutively 收到更多的提交，表明 FMAS 已经成为自动化系统的正式模型和验证的出版物。经过三名编委会成员的审核后，我们接受了总共15篇论文：8篇长篇和7篇短篇。

Linear time Evidence Accumulation Clustering with KMeans

paper_url: http://arxiv.org/abs/2311.09272
repo_url: None
paper_authors: Gaëlle Candel
for: 本研究旨在提出一种简单 yet efficient consensus clustering方法，以解决现有方法的计算复杂性问题。
methods: 本方法基于证据积累 clustering，首先构建一个 n x n 的相关性矩阵，然后使用这个矩阵进行 clustering，以提取共识 clusters。与其他方法不同的是，这里不需要找到匹配于两个不同 partitioning 中的匹配项。但是，这种方法受到计算复杂性的限制，只适用于小规模 dataset。
results: 本研究提出了一种方法来高效计算 density，从而降低了计算复杂性的问题。此外，我们证明了 k-means 自然地maximizes density。在多个 benchmark dataset 上进行了比较，k-means 和 bisecting 版本的结果与其他现有的 consensus algorithm 相当，而且计算成本较低。此外，k-means 在 density 方面获得了最佳结果。这些结果表明，consensus clustering 可以使用简单的算法解决。

Abstract
Among ensemble clustering methods, Evidence Accumulation Clustering is one of the simplest technics. In this approach, a co-association (CA) matrix representing the co-clustering frequency is built and then clustered to extract consensus clusters. Compared to other approaches, this one is simple as there is no need to find matches between clusters obtained from two different partitionings. Nevertheless, this method suffers from computational issues, as it requires to compute and store a matrix of size n x n, where n is the number of items. Due to the quadratic cost, this approach is reserved for small datasets. This work describes a trick which mimic the behavior of average linkage clustering. We found a way of computing efficiently the density of a partitioning, reducing the cost from a quadratic to linear complexity. Additionally, we proved that the k-means maximizes naturally the density. We performed experiments on several benchmark datasets where we compared the k-means and the bisecting version to other state-of-the-art consensus algorithms. The k-means results are comparable to the best state of the art in terms of NMI while keeping the computational cost low. Additionally, the k-means led to the best results in terms of density. These results provide evidence that consensus clustering can be solved with simple algorithms.

摘要
在ensemble clustering方法中，证据积累 clustering 是一种最简单的方法。在这种方法中，我们首先构建一个 co-association（CA）矩阵，表示item之间的协 clustering频率，然后使用这个矩阵进行归一化，以提取共识cluster。相比其他方法，这种方法更简单，不需要在两个不同的 partitioning 中找到匹配。然而，这种方法受到计算问题的限制，因为需要计算和存储一个 n x n 的矩阵，其中 n 是items的数量，这会导致计算成本 quadratic。由于这个问题，这种方法只适用于小型数据集。本文描述了一种技巧，可以模拟average linkage clustering的行为。我们发现了一种可以高效计算分区 densities 的方法，从而降低计算成本的复杂度从 quadratic 降至 linear。此外，我们证明了 k-means 自然地 maximizes densities。我们在多个 benchmark 数据集上进行了实验，并与其他状态Of-the-art consensus算法进行了比较。k-means 的结果与最佳状态Of-the-art 的 NMI 相当，同时计算成本低。此外，k-means 导致了最佳的 densities 结果。这些结果证明了 consensus clustering 可以使用简单的算法解决。

Identifying Linear Relational Concepts in Large Language Models

paper_url: http://arxiv.org/abs/2311.08968
repo_url: https://github.com/Aryia-Behroziuan/Robot-learning
paper_authors: David Chanin, Anthony Hunter, Oana-Maria Camburu
for: 本文旨在找到隐藏层中的概念方向，以便更好地理解模型表示的概念。
methods: 本文提出了一种Linear Relational Concepts（LRC）技术，通过模型Subject和Object之间的关系为线性关系嵌入（LRE）来找到隐藏层中的概念方向。
results: 研究发现，通过逆向LRE并使用早期的对象层来找到概念方向，可以实现高效地为概念分类和影响模型输出。

Abstract
Transformer language models (LMs) have been shown to represent concepts as directions in the latent space of hidden activations. However, for any given human-interpretable concept, how can we find its direction in the latent space? We present a technique called linear relational concepts (LRC) for finding concept directions corresponding to human-interpretable concepts at a given hidden layer in a transformer LM by first modeling the relation between subject and object as a linear relational embedding (LRE). While the LRE work was mainly presented as an exercise in understanding model representations, we find that inverting the LRE while using earlier object layers results in a powerful technique to find concept directions that both work well as a classifier and causally influence model outputs.

摘要
transformer 语言模型（LM）已经显示出在隐藏活动空间中表示概念的方向。然而，为任何给定的人类可解释的概念，如何在隐藏层中找到其方向？我们提出了线性关系概念（LRC）技术，用于在 transformer LM 中找到人类可解释的概念方向。我们首先将关系 между主题和对象模型为线性关系嵌入（LRE）。虽然 LRE 工作主要被表现为模型表示理解的一种实践，但我们发现，对于早期对象层来说，倒转 LRE 会生成一种强大的技术，可以作为分类器并在模型输出中产生 causal 影响。

paper_url: http://arxiv.org/abs/2311.08957
repo_url: None
paper_authors: Giulio Antonio Abbo, Tony Belpaeme
for: 该论文旨在探讨如何通过将视觉功能 integrate into conversational agents ，以提高人机交互的效果。
methods: 该论文使用最新的大语言模型（如 GPT-4、IDEFICS）来解释文本提示和实时视觉输入，创造出更Contextually 意识的对话系统。
results: 六个与 Furhat 机器人进行的交互记录和分析，ILLUSTRATE 和讨论所获得的结果，提出了一种将文本和视觉modalities融合的对话系统。

Abstract
In the rapidly evolving landscape of human-computer interaction, the integration of vision capabilities into conversational agents stands as a crucial advancement. This paper presents an initial implementation of a dialogue manager that leverages the latest progress in Large Language Models (e.g., GPT-4, IDEFICS) to enhance the traditional text-based prompts with real-time visual input. LLMs are used to interpret both textual prompts and visual stimuli, creating a more contextually aware conversational agent. The system's prompt engineering, incorporating dialogue with summarisation of the images, ensures a balance between context preservation and computational efficiency. Six interactions with a Furhat robot powered by this system are reported, illustrating and discussing the results obtained. By implementing this vision-enabled dialogue system, the paper envisions a future where conversational agents seamlessly blend textual and visual modalities, enabling richer, more context-aware dialogues.

摘要
在人机交互领域的快速发展中，融合视觉能力的对话管理器是一项重要的进步。这篇论文介绍了一种使用最新的大语言模型（如GPT-4、IDEFICS）来增强传统的文本基于的提示，并在实时视觉输入的基础上进行对话管理。这些语言模型能够同时解释文本提示和视觉刺激，创造出更Contextually 意识的对话代理人。系统的提问工程，包括对话和图像摘要，保证了对话的上下文保持和计算效率的平衡。报告了六次与furhat机器人运行此系统的交互，并讲述了获得的结果。通过实现这种视觉启用对话系统，论文预测未来的对话代理人将协调文本和视觉模式，实现更加 ricther，Contextually 意识的对话。

Safety, Trust, and Ethics Considerations for Human-AI Teaming in Aerospace Control

paper_url: http://arxiv.org/abs/2311.08943
repo_url: None
paper_authors: Kerianne L. Hobbs, Bernard Li
for: 本文旨在探讨人工智能在航空系统控制中的合作，特别是人类和AI的团队合作，以及这些团队合作的安全、可靠和伦理方面。
methods: 本文使用了许多不同的方法，包括文献综述、案例研究和理论分析，以探讨不同的人工智能应用场景和相关的安全、可靠和伦理问题。
results: 本文的结果表明，在安全和任务关键领域中使用人工智能时，需要考虑到安全、可靠和伦理方面的问题，并且需要采取相应的措施来解决这些问题。

Abstract
Designing a safe, trusted, and ethical AI may be practically impossible; however, designing AI with safe, trusted, and ethical use in mind is possible and necessary in safety and mission-critical domains like aerospace. Safe, trusted, and ethical use of AI are often used interchangeably; however, a system can be safely used but not trusted or ethical, have a trusted use that is not safe or ethical, and have an ethical use that is not safe or trusted. This manuscript serves as a primer to illuminate the nuanced differences between these concepts, with a specific focus on applications of Human-AI teaming in aerospace system control, where humans may be in, on, or out-of-the-loop of decision-making.

摘要
设计一个安全、可信、伦理的人工智能可能是实际上不可能的；但是设计人工智能以安全、可信、伦理的使用为目标是可能的和必要的，尤其在安全和战略性领域如航空航天。安全、可信、伦理的使用人工智能常常被混用，但是一个系统可以安全地使用但不是可信或伦理的，可以有一个可信用但不是安全或伦理的，可以有一个伦理用但不是安全或可信的。这篇报告作为一个导论，探讨了这些概念之间的细腻差异，尤其在人工智能和人类团队在航空系统控制中的应用， где人类可能在、在或离Loop的决策过程中。

Reasoning over Description Logic-based Contexts with Transformers

paper_url: http://arxiv.org/abs/2311.08941
repo_url: None
paper_authors: Angelos Poulis, Eleni Tsalapati, Manolis Koubarakis
for: 本研究的目的是测试 transformer 模型在复杂的语言上进行推理能力。
methods: 本研究使用了生成自描述逻辑知识库的自然语言问答数据集，并使用了 $\mathcal{ALCQ}$ 语言来生成知识库。
results: 研究发现，使用 DEBERTa 模型 DELTA$_M$ 的表现随 reasoning depth 的增加而无显著变化，而 sentence length 的增加则不会影响表现。此外，模型在不同的 reasoning depth 上进行推理时的泛化能力也得到了证明。

Abstract
One way that the current state of the art measures the reasoning ability of transformer-based models is by evaluating accuracy in downstream tasks like logical question answering or proof generation over synthetic contexts expressed in natural language. However, most of the contexts used are in practice very simple; in most cases, they are generated from short first-order logic sentences with only a few logical operators and quantifiers. In this work, we seek to answer the question how well a transformer-based model will perform reasoning over expressive contexts. For this purpose, we construct a synthetic natural language question-answering dataset, generated by description logic knowledge bases. For the generation of the knowledge bases, we use the expressive language $\mathcal{ALCQ}$. The resulting dataset contains 384K examples, and increases in two dimensions: i) reasoning depth, and ii) length of sentences. We show that the performance of our DeBERTa-based model, DELTA$_M$, is marginally affected when the reasoning depth is increased and it is not affected at all when the length of the sentences is increasing. We also evaluate the generalization ability of the model on reasoning depths unseen at training, both increasing and decreasing, revealing interesting insights into the model's adaptive generalization abilities.

摘要
Currently, the state-of-the-art measure of reasoning ability in transformer-based models is their accuracy in downstream tasks like logical question answering or proof generation over synthetic contexts expressed in natural language. However, most of these contexts are very simple, typically consisting of short first-order logic sentences with only a few logical operators and quantifiers. In this study, we aim to investigate how well a transformer-based model can perform reasoning over more expressive contexts. To achieve this, we create a synthetic natural language question-answering dataset generated by description logic knowledge bases. We use the expressive language $\mathcal{ALCQ}$ to generate the knowledge bases, resulting in a dataset containing 384K examples that increase in two dimensions: i) reasoning depth, and ii) length of sentences. Our DeBERTa-based model, DELTA$_M$, shows marginal impact from increased reasoning depth and no impact from longer sentences. We also evaluate the model's generalization ability on unseen reasoning depths, both increasing and decreasing, revealing interesting insights into its adaptive generalization capabilities.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard Chinese writing systems. If you prefer Traditional Chinese, please let me know and I will be happy to provide the translation in that script.

Supported Trust Region Optimization for Offline Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.08935
repo_url: None
paper_authors: Yixiu Mao, Hongchang Zhang, Chen Chen, Yi Xu, Xiangyang Ji
for: 提高掌控环境下的远离线强化学习效果
methods: 使用支持信任区域优化（STR）方法，即在行为政策内部进行强化学习优化，且受到行为政策支持的约束
results: 在假设无误度和抽象误差时，STR方法能够保证政策改进直至到达数据集中的优化策略，并在实际测试中表现出优于当前状态的表现。

Abstract
Offline reinforcement learning suffers from the out-of-distribution issue and extrapolation error. Most policy constraint methods regularize the density of the trained policy towards the behavior policy, which is too restrictive in most cases. We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy, enjoying the less restrictive support constraint. We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset. Further with both errors incorporated, STR still guarantees safe policy improvement for each step. Empirical results validate the theory of STR and demonstrate its state-of-the-art performance on MuJoCo locomotion domains and much more challenging AntMaze domains.

摘要
< translate into Simplified Chinese离线强化学uffer于out-of-distribution问题和推论误差。大多数策略约束方法将训练的策略密度规范到行为策略上，这是大多数情况下过于严格的。我们提议Supported Trust Region优化（STR），该方法通过在行为策略支持下进行信任区域策略优化，享受到较为lenient的支持约束。我们证明，当假设无approximation和抽象误差时，STR确保每步产生策略改进，直到在数据集中收敛到最优的支持约束策略。而在实际中，STR仍然保证每步安全的策略改进，即使包括两种误差。实验结果证明STR的理论和实际性能在MuJoCo步行领域和更加复杂的AntMaze领域都达到了顶峰水平。

Leveraging Activation Maximization and Generative Adversarial Training to Recognize and Explain Patterns in Natural Areas in Satellite Imagery

paper_url: http://arxiv.org/abs/2311.08923
repo_url: None
paper_authors: Ahmed Emam, Timo T. Stomberg, Ribana Roscher
for: 保护 natura 遗产的详细地图创建
methods: 使用activation maximization和生成对抗模型生成卫星图像，结合领域知识，提供完整和有效的解释方法
results: 生成的卫星图像可以准确地标识保护区域的自然 authenticity 特征，提高了保护区域的生态完整性的理解，可能对未来监测和保护做出贡献

Abstract
Natural protected areas are vital for biodiversity, climate change mitigation, and supporting ecological processes. Despite their significance, comprehensive mapping is hindered by a lack of understanding of their characteristics and a missing land cover class definition. This paper aims to advance the explanation of the designating patterns forming protected and wild areas. To this end, we propose a novel framework that uses activation maximization and a generative adversarial model. With this, we aim to generate satellite images that, in combination with domain knowledge, are capable of offering complete and valid explanations for the spatial and spectral patterns that define the natural authenticity of these regions. Our proposed framework produces more precise attribution maps pinpointing the designating patterns forming the natural authenticity of protected areas. Our approach fosters our understanding of the ecological integrity of the protected natural areas and may contribute to future monitoring and preservation efforts.

摘要
自然保护区是生物多样性、气候变化缓解和生态过程支持的重要资源。尽管它们的重要性，但全面的地图制定受到了未understanding其特征和缺失的土地覆盖类划定的限制。本文提出了一种新的框架，使用活动最大化和生成对抗模型，以提高指定 Patterns forming protected and wild areas的解释。通过这种方法，我们可以生成具有完整性和有效性的卫星图像，与领域知识相结合，以提供自然 authenticity 的区域的完整和有效的解释。我们的提议的框架可以生成更精确的归属地图， pinpointing the designating patterns forming the natural authenticity of protected areas。这将有助于我们更好地理解保护区的生态完整性，并可能对未来监测和保护做出贡献。

An Empathetic User-Centric Chatbot for Emotional Support

paper_url: http://arxiv.org/abs/2311.09271
repo_url: None
paper_authors: Yanting Pan, Yixuan Tang, Yuchen Niu
for: 这篇论文探讨了亚特媒体文化和人工智能之间的交叉点，尤其是游戏如何满足年轻女性的情感需求。
methods: 这篇论文使用了大语言模型（LLM）技术来超越传统的静态游戏剧本，创造出dinamic和情感响应的互动体验。
results: 研究人员通过在游戏剧本中添加问答（QA）系统，通过数据扩充和情感增强技术，创建了一个真实和支持的伴侣聊天机器人。

Abstract
This paper explores the intersection of Otome Culture and artificial intelligence, particularly focusing on how Otome-oriented games fulfill the emotional needs of young women. These games, which are deeply rooted in a subcultural understanding of love, provide players with feelings of satisfaction, companionship, and protection through carefully crafted narrative structures and character development. With the proliferation of Large Language Models (LLMs), there is an opportunity to transcend traditional static game narratives and create dynamic, emotionally responsive interactions. We present a case study of Tears of Themis, where we have integrated LLM technology to enhance the interactive experience. Our approach involves augmenting existing game narratives with a Question and Answer (QA) system, enriched through data augmentation and emotional enhancement techniques, resulting in a chatbot that offers realistic and supportive companionship.

摘要
这篇论文探讨了互助文化和人工智能的交叉点，特别是游戏如何满足年轻女性的情感需求。这些游戏，深受互助文化的影响，为玩家提供满足、伙伴和保护的感受，通过精心设计的故事结构和人物发展。随着大语言模型（LLM）的普及，有机会超越传统的静止游戏剧本，创造动态、情感回应的互动体验。我们介绍了《泪之Theme》案例，我们在该游戏中集成了LLM技术，以增强互动体验。我们的方法包括在现有游戏剧本中添加问答（QA）系统，通过数据增强和情感增强技术，创造出真实和支持的伙伴。

NormNet: Scale Normalization for 6D Pose Estimation in Stacked Scenarios

paper_url: http://arxiv.org/abs/2311.09269
repo_url: https://github.com/shuttlet/normnet
paper_authors: En-Te Lin, Wei-Jie Lv, Ding-Tao Huang, Long Zeng
for: 本研究旨在提出一种可以在堆积场景中robustly estimate不同尺度对象的6DoF pose estimator（NormNet）。
methods: 本方法首先使用点准 regression来学习每个对象的尺度，然后通过semantic segmentation和affine变换将所有对象 норmalized到同一个尺度。最后，它们被 fed into a shared pose estimator来恢复它们的6D姿态。此外，我们还提出了一种新的Sim-to-Real transfer管线，该管线结合了style transfer和domain randomization，以提高NormNet在实际数据上的性能。
results: 广泛的实验表明，提出的方法可以在公共benchmark和我们自己construct的MultiScale dataset上达到领先的性能。实际世界 эксперименты也显示，我们的方法可以robustly estimate不同尺度对象的6D姿态。

Abstract
Existing Object Pose Estimation (OPE) methods for stacked scenarios are not robust to changes in object scale. This paper proposes a new 6DoF OPE network (NormNet) for different scale objects in stacked scenarios. Specifically, each object's scale is first learned with point-wise regression. Then, all objects in the stacked scenario are normalized into the same scale through semantic segmentation and affine transformation. Finally, they are fed into a shared pose estimator to recover their 6D poses. In addition, we introduce a new Sim-to-Real transfer pipeline, combining style transfer and domain randomization. This improves the NormNet's performance on real data even if we only train it on synthetic data. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on public benchmarks and the MultiScale dataset we constructed. The real-world experiments show that our method can robustly estimate the 6D pose of objects at different scales.

摘要
现有的栅格场景中对象姿态估计（OPE）方法不能抗测对象比例变化。这篇论文提出了一种新的6度自由姿态网络（NormNet），用于不同比例的 объекts在栅格场景中估计6D姿态。具体来说，每个对象的比例首先通过点级回归学习。然后，所有在栅格场景中的对象都被正规化为同一个比例通过 semantic segmentation 和Affine变换。最后，它们被 fed into 共享的姿态估计器，以便从 shared pose estimator 中回归其6D姿态。此外，我们还引入了一种新的 Sim-to-Real 传输管道， combining style transfer 和 domain randomization。这种管道可以在只使用 sintetic data 进行训练时，提高 NormNet 的表现 на real data。广泛的实验表明，我们提posed方法可以在公共 benchmarks 和我们自己构建的 MultiScale 数据集上达到顶尖性能。在实际场景中，我们的方法可以Robustly 估计不同比例的对象的6D姿态。

Combining Transfer Learning with In-context Learning using Blackbox LLMs for Zero-shot Knowledge Base Question Answering

paper_url: http://arxiv.org/abs/2311.08894
repo_url: None
paper_authors: Mayur Patidar, Avinash Singh, Riya Sawhney, Indrajit Bhattacharya, Mausam
for: 本文Addresses the zero-shot transfer learning setting for the knowledge base question answering (KBQA) problem, where a large volume of labeled training data is available for the source domain, but no such labeled examples are available for the target domain.
methods: 本文使用了大量的无标签数据在目标域，并结合了源域的标签数据进行了转移学习。此外，文章还提出了基于黑盒大语言模型（BLLM）的受限自我调整方法，可以独立于转移设定进行执行。
results: 根据实验结果，提出的方法可以在 GrailQA 作为源域和 WebQSP 作为目标域的情况下，对两个阶段（检索和生成）进行了显著改进，并且也超越了当前的超参数化KBQA模型。此外，当有限量的标签数据时，BLLM的扩展也可以在域内设定中提供显著的改进。

Abstract
We address the zero-shot transfer learning setting for the knowledge base question answering (KBQA) problem, where a large volume of labeled training data is available for the source domain, but no such labeled examples are available for the target domain. Transfer learning for KBQA makes use of large volumes of unlabeled data in the target in addition to the labeled data in the source. More recently, few-shot in-context learning using Black-box Large Language Models (BLLMs) has been adapted for KBQA without considering any source domain data. In this work, we show how to meaningfully combine these two paradigms for KBQA so that their benefits add up. Specifically, we preserve the two stage retrieve-then-generate pipeline of supervised KBQA and introduce interaction between in-context learning using BLLMs and transfer learning from the source for both stages. In addition, we propose execution-guided self-refinement using BLLMs, decoupled from the transfer setting. With the help of experiments using benchmark datasets GrailQA as the source and WebQSP as the target, we show that the proposed combination brings significant improvements to both stages and also outperforms by a large margin state-of-the-art supervised KBQA models trained on the source. We also show that in the in-domain setting, the proposed BLLM augmentation significantly outperforms state-of-the-art supervised models, when the volume of labeled data is limited, and also outperforms these marginally even when using the entire large training dataset.

摘要
我们研究了零shot转移学习 Setting for 知识库问答（KBQA）问题，其中有大量标注的训练数据在源领域可用，但target领域没有任何标注的示例。KBQA的转移学习使用了target领域的大量无标注数据，以及源领域的标注数据。在这种情况下，我们将黑obox大型自然语言模型（BLLM）的几个shot在 Context learning应用于KBQA，而不考虑源领域的数据。在这种情况下，我们保留了KBQA的两stage retrieve-then-generate架构，并在这两个阶段中引入了BLLM的交互。此外，我们还提出了基于BLLM的执行指导自适应，与转移学习分离。通过使用GrailQA作为源领域和WebQSP作为目标领域的实验，我们表明了我们的提案可以在两个阶段中提供显著改进，并且也超越了当前的supervised KBQA模型。此外，我们还表明了在域内设置下，我们的BLLM扩展可以在标注数据量有限的情况下获得显著改进，并且甚至在使用整个大量训练数据时也能够超越supervised模型。

Advances in ACL2 Proof Debugging Tools

paper_url: http://arxiv.org/abs/2311.08856
repo_url: None
paper_authors: Matt Kaufmann, J Strother Moore
for: 本文描述了ACL2用户通常会遇到失败的证明尝试，以及如何使用工具来解决这些失败。
methods: 本文专注于ACL2版本8.5后的改进：改进的break-rewrite工具以及新增的with-brr-data工具。
results: 通过使用这些工具，ACL2用户可以更有效地解决证明失败。

Abstract
The experience of an ACL2 user generally includes many failed proof attempts. A key to successful use of the ACL2 prover is the effective use of tools to debug those failures. We focus on changes made after ACL2 Version 8.5: the improved break-rewrite utility and the new utility, with-brr-data.

摘要
ACL2用户通常会经历许多失败的证明尝试。成功使用ACL2证明工具的关键在于有效地使用工具来调试失败。我们关注ACL2版本8.5后的更改：改进的break-rewrite工具以及新增的with-brr-data工具。

Evaluating Gender Bias in the Translation of Gender-Neutral Languages into English

paper_url: http://arxiv.org/abs/2311.08836
repo_url: None
paper_authors: Spencer Rarrick, Ranjita Naik, Sundar Poudel, Vishal Chowdhary
for: 这个论文的目的是提出一个gender bias检测和 mitigation的数据集，以便更好地评估和改进Machine Translation（MT）系统中的gender bias问题。
methods: 这个论文使用了一个新的数据集名为GATE X-E，这个数据集包含了从土耳其语、匈牙利语、芬兰语和波斯语翻译成英语的人工翻译，每个翻译都有女性、男性和中性的多个变体。此外，这篇论文还提出了一种基于GPT-3.5 Turbo的英语性别重写解决方案，并使用GATE X-E来评估这种解决方案。
results: 这篇论文的研究结果表明，GATE X-E数据集可以帮助提高MT系统中gender bias的识别和改进，并且基于GPT-3.5 Turbo的英语性别重写解决方案也能够有效地改善MT系统中的gender bias问题。

Abstract
Machine Translation (MT) continues to improve in quality and adoption, yet the inadvertent perpetuation of gender bias remains a significant concern. Despite numerous studies into gender bias in translations from gender-neutral languages such as Turkish into more strongly gendered languages like English, there are no benchmarks for evaluating this phenomenon or for assessing mitigation strategies. To address this gap, we introduce GATE X-E, an extension to the GATE (Rarrick et al., 2023) corpus, that consists of human translations from Turkish, Hungarian, Finnish, and Persian into English. Each translation is accompanied by feminine, masculine, and neutral variants for each possible gender interpretation. The dataset, which contains between 1250 and 1850 instances for each of the four language pairs, features natural sentences with a wide range of sentence lengths and domains, challenging translation rewriters on various linguistic phenomena. Additionally, we present an English gender rewriting solution built on GPT-3.5 Turbo and use GATE X-E to evaluate it. We open source our contributions to encourage further research on gender debiasing.

摘要

paper_url: http://arxiv.org/abs/2311.08834
repo_url: None
paper_authors: Ba Luat Le, Layla Martin, Emrah Demir, Duc Minh Vu
for: 该研究探讨了一个优化投资问题，它在 Shared Vehicle System 中出现。给定一个站点建设集，我们需要确定（i）站点建设顺序和车辆数量，以达到所有站点建设完成的目标状态，（ii）在一些或所有站点打开时，最大化运营系统的总收益。
methods: 作者提出了一种 A* 搜索算法来解决这个问题，该算法可以视为一种 TSP 变种，具有集成依赖性的成本。
results: 计算实验表明，作者的提案算法在比较 Dijkstra 算法时具有明显的优势，并且将来的研究可以探讨新的可能性和应用。

Abstract
We study an optimal investment problem that arises in the context of the vehicle-sharing system. Given a set of locations to build stations, we need to determine i) the sequence of stations to be built and the number of vehicles to acquire in order to obtain the target state where all stations are built, and ii) the number of vehicles to acquire and their allocation in order to maximize the total profit returned by operating the system when some or all stations are open. The profitability associated with operating open stations, measured over a specific time period, is represented as a linear optimization problem applied to a collection of open stations. With operating capital, the owner of the system can open new stations. This property introduces a set-dependent aspect to the duration required for opening a new station, and the optimal investment problem can be viewed as a variant of the Traveling Salesman Problem (TSP) with set-dependent cost. We propose an A* search algorithm to address this particular variant of the TSP. Computational experiments highlight the benefits of the proposed algorithm in comparison to the widely recognized Dijkstra algorithm and propose future research to explore new possibilities and applications for both exact and approximate A* algorithms.

摘要
我们研究一个最佳投资问题，它在车仲共享系统中发生。我们需要 Determine 以下两个问题：1. 建站的顺序和车辆数量，以实现所有站点都建立，并2. 车辆数量和分配方式，以最大化在一些或所有站点开放时的总收益。系统在运行时的收益，通过在一个特定时间间隔内进行线性优化问题，以表示开放的站点的盈利。系统所有者可以通过资金来开新站点。这个属性导致开新站点所需时间受到站点集的依赖，并且将最佳投资问题视为对特定设置成本的车辆销售人员问题的变形。我们提议使用A*搜索算法来解决这个问题。计算实验显示了我们的提案算法与通过世界上所认可的迪克斯特拉算法相比，具有更好的性能。我们未来的研究将探讨新的可能性和应用，以及精确和近似A*算法的应用。

Exploring Links between Conversational Agent Design Challenges and Interdisciplinary Collaboration

paper_url: http://arxiv.org/abs/2311.08832
repo_url: None
paper_authors: Malak Sadek, Céline Mougenot
for: The paper is written to explore the socio-technical challenges of creating conversational agents (CA) and to propose practical strategies to overcome these challenges.
methods: The paper uses a scoping review of existing literature to identify and categorize the socio-technical challenges of CA design, and proposes a taxonomy of these challenges using interdisciplinary collaboration (IDC) as a lens.
results: The paper proposes practical strategies to overcome the socio-technical challenges of CA design, and invites future work to empirically verify the suggested conceptual links and apply the proposed strategies within the space of CA design to evaluate their effectiveness.Here is the same information in Simplified Chinese text:
for: 这篇论文是为了探讨对话代理（CA）的社会技术创新挑战，并提出了解决这些挑战的实际策略。
methods: 这篇论文通过审视现有文献来标识和分类CA设计中的社会技术挑战，并提出了使用交叉学科协作（IDC）作为镜头的挑战分类法。
results: 这篇论文提出了解决CA设计中的社会技术挑战的实际策略，并邀请未来的研究 empirically verify提出的概念链和在CA设计空间中应用提出的策略以评估其效果。

Abstract
Recent years have seen a steady rise in the popularity and use of Conversational Agents (CA) for different applications, well before the more immediate impact of large language models. This rise has been accompanied by an extensive exploration and documentation of the challenges of designing and creating conversational agents. Focusing on a recent scoping review of the socio-technical challenges of CA creation, this opinion paper calls for an examination of the extent to which interdisciplinary collaboration (IDC) challenges might contribute towards socio-technical CA design challenges. The paper proposes a taxonomy of CA design challenges using IDC as a lens, and proposes practical strategies to overcome them which complement existing design principles. The paper invites future work to empirically verify suggested conceptual links and apply the proposed strategies within the space of CA design to evaluate their effectiveness.

摘要
Translation notes:* "Conversational Agents" (CA) is translated as "对话代理人" (duìxiào dàibiǎn)* "Interdisciplinary collaboration" (IDC) is translated as "交叉学科合作" (jiāo kè xué kē hè zuò)* "Socio-technical challenges" is translated as "社会技术挑战" (shè huì jī shuō tā zhàn)* "Design principles" is translated as "设计原则" (xiè yì yuán xì)Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.

Reinforcement Learning with Model Predictive Control for Highway Ramp Metering

paper_url: http://arxiv.org/abs/2311.08820
repo_url: https://github.com/filippoairaldi/mpcrl-for-ramp-metering
paper_authors: Filippo Airaldi, Bart De Schutter, Azita Dabiri
for: 提高城市和高速公路交通系统的效率
methods: 结合模型驱动和学习驱动的方法，使用可靠学习控制方法来改进高速公路上的踏面控制
results: 实验结果显示，从一个不精准的模型和不佳地调整的控制器开始，提议的方法可以有效地学习改进控制策略，从而减少网络中的堵塞和满足约束，相比初始控制器表现更佳。

Abstract
In the backdrop of an increasingly pressing need for effective urban and highway transportation systems, this work explores the synergy between model-based and learning-based strategies to enhance traffic flow management by use of an innovative approach to the problem of highway ramp metering control that embeds Reinforcement Learning techniques within the Model Predictive Control framework. The control problem is formulated as an RL task by crafting a suitable stage cost function that is representative of the traffic conditions, variability in the control action, and violations of a safety-critical constraint on the maximum number of vehicles in queue. An MPC-based RL approach, which merges the advantages of the two paradigms in order to overcome the shortcomings of each framework, is proposed to learn to efficiently control an on-ramp and to satisfy its constraints despite uncertainties in the system model and variable demands. Finally, simulations are performed on a benchmark from the literature consisting of a small-scale highway network. Results show that, starting from an MPC controller that has an imprecise model and is poorly tuned, the proposed methodology is able to effectively learn to improve the control policy such that congestion in the network is reduced and constraints are satisfied, yielding an improved performance compared to the initial controller.

摘要
在城市和高速公路交通系统的需求越来越高的背景下，这项工作探讨了模型基本和学习基本策略之间的共谊，以提高交通流控制的效果。该工作使用了一种嵌入了回归学习技术的模型预测控制框架来解决高速匝道流控制问题。通过设计一个合适的stage cost函数，该方法将交通条件、控制动作的变化和安全约束的最大车辆队列数量作为RL任务的stage cost函数。该方法将MPC和RL两种框架融合，以超越每个框架的缺点，并学习高速匝道控制，并满足系统模型不确定性和变化的需求。最后，对一个小规模高速公路网络的测试表明，从一个不精确的模型和优化不良的MPC控制器开始，该方法能够有效地学习改善控制策略，从而减少网络中的拥堵，满足约束，并提高效果相比于初始控制器。

Frequency Domain-based Dataset Distillation

paper_url: http://arxiv.org/abs/2311.08819
repo_url: https://github.com/sdh0818/fred
paper_authors: Donghyeok Shin, Seungjae Shin, Il-Chul Moon
for: 本研究旨在提出一种新的参数化方法，用于快速生成小型的合成数据集，从原始大型数据集中提取关键信息。
methods: 该方法基于频域的变换来优化数据集中每个实例的频率表示，通过选择特定频率维度进行优化，以实现快速生成实例的目标。
results: 对于不同的评价指标和数据集，FreD方法能够在有限的资源下实现更好的信息保留和性能提升，并且与现有方法兼容。

Abstract
This paper presents FreD, a novel parameterization method for dataset distillation, which utilizes the frequency domain to distill a small-sized synthetic dataset from a large-sized original dataset. Unlike conventional approaches that focus on the spatial domain, FreD employs frequency-based transforms to optimize the frequency representations of each data instance. By leveraging the concentration of spatial domain information on specific frequency components, FreD intelligently selects a subset of frequency dimensions for optimization, leading to a significant reduction in the required budget for synthesizing an instance. Through the selection of frequency dimensions based on the explained variance, FreD demonstrates both theoretical and empirical evidence of its ability to operate efficiently within a limited budget, while better preserving the information of the original dataset compared to conventional parameterization methods. Furthermore, based on the orthogonal compatibility of FreD with existing methods, we confirm that FreD consistently improves the performances of existing distillation methods over the evaluation scenarios with different benchmark datasets. We release the code at https://github.com/sdh0818/FreD.

摘要

MAP’s not dead yet: Uncovering true language model modes by conditioning away degeneracy

paper_url: http://arxiv.org/abs/2311.08817
repo_url: None
paper_authors: Davis Yoshida, Kartik Goyal, Kevin Gimpel
for: 这个论文主要研究了NLG模型中模式的问题，具体来说是解释为什么模式搜索（MAP解oding）常常导致输出异常（Stahlberg和Byrne，2019，Holtzman等，2019）。
methods: 作者使用了杂合搜索和模式搜索来研究NLG模型的输出。他们发现，即使模型没有错误，模式仍可以变得缺乏含义，这是因为训练数据中的噪声污染。为解决这问题，作者提议使用模式搜索 conditional on avoiding specific degeneracies。
results: 作者通过实验证明了，对机器翻译模型和语言模型进行长度 conditional 模式搜索可以获得更加流畅和话题性的输出。此外，作者还提供了许多模式序列的实际示例，并证明了LLaMA模型的模式仍然具有缺乏含义的问题。为了解决这问题，作者开发了一种approximate模式搜索方法，ACBS。通过应用这种方法，作者可以从LLaMA-7B模型中获得可接受的输出，而无需任何训练。

Abstract
It has been widely observed that exact or approximate MAP (mode-seeking) decoding from natural language generation (NLG) models consistently leads to degenerate outputs (Stahlberg and Byrne, 2019, Holtzman et al., 2019). This has generally been attributed to either a fundamental inadequacy of modes in models or weaknesses in language modeling. Contrastingly in this work, we emphasize that degenerate modes can even occur in the absence of any model error, due to contamination of the training data. Specifically, we show that mixing even a tiny amount of low-entropy noise with a population text distribution can cause the data distribution's mode to become degenerate, implying that any models trained on it will be as well. As the unconditional mode of NLG models will often be degenerate, we therefore propose to apply MAP decoding to the model's distribution conditional on avoiding specific degeneracies. Using exact-search, we empirically verify that the length-conditional modes of machine translation models and language models are indeed more fluent and topical than their unconditional modes. For the first time, we also share many examples of exact modal sequences from these models, and from several variants of the LLaMA-7B model. Notably, the modes of the LLaMA models are still degenerate, showing that improvements in modeling have not fixed this issue. Because of the cost of exact mode finding algorithms, we develop an approximate mode finding approach, ACBS, which finds sequences that are both high-likelihood and high-quality. We apply this approach to LLaMA-7B, a model which was not trained for instruction following, and find that we are able to elicit reasonable outputs without any finetuning.

摘要
历史观察表明，使用自然语言生成（NLG）模型的准确或近似MAP（模式寻找）解oding会导致异常输出（Stahlberg和Byrne，2019，Holtzman等，2019）。这一问题通常被归结到模型中的缺陷或语言模型的弱点。然而，在本研究中，我们强调的是，即使模型没有错误，degenerate modes仍可能出现，这是因为训练数据被杂入了低 entropy 的噪音。我们证明，只要混合一点微的低 entropy 噪音到一个人类文本分布中，就可以让数据分布的模式变得异常。因此，我们建议在模型的分布上使用MAP decoding，并且条件于避免特定的异常模式。我们通过对机器翻译模型和语言模型的长度准确模式进行实验，证明了这些模式在fluency和topicality方面比unconditional modes更高。此外，我们还提供了许多exact模式序列的例子，包括several variants of the LLaMA-7B model。不幸的是，LLaMA模型的模式仍然异常，显示改进模型化没有解决这一问题。由于找到精确模式的算法成本高，我们开发了一种 Approximate CBS（ACBS）模式找到方法，可以找到高概率和高质量的序列。我们应用ACBS方法于LLaMA-7B模型，并发现可以获得无需较少的finetuning的合理输出。

Self-Supervised Disentanglement by Leveraging Structure in Data Augmentations

paper_url: http://arxiv.org/abs/2311.08815
repo_url: None
paper_authors: Cian Eastwood, Julius von Kügelgen, Linus Ericsson, Diane Bouchacourt, Pascal Vincent, Bernhard Schölkopf, Mark Ibrahim
for: 这个论文旨在推动自然语言处理领域中的自我超VI中的表示学习。
methods: 这篇论文使用了数据扩充来适应”风格”特征的变化，但是由于下游任务通常在训练时未知，因此难以在训练时确定”风格”特征是否可以安全地丢弃。为了解决这个问题，这篇论文提出了一种更原则的方法，即通过添加多个风格嵌入空间来分离风格特征。
results: 该方法在synthetic数据集上进行了实验，并且在ImageNet上进行了一些有限的实验，并证明了其效果。

Abstract
Self-supervised representation learning often uses data augmentations to induce some invariance to "style" attributes of the data. However, with downstream tasks generally unknown at training time, it is difficult to deduce a priori which attributes of the data are indeed "style" and can be safely discarded. To address this, we introduce a more principled approach that seeks to disentangle style features rather than discard them. The key idea is to add multiple style embedding spaces where: (i) each is invariant to all-but-one augmentation; and (ii) joint entropy is maximized. We formalize our structured data-augmentation procedure from a causal latent-variable-model perspective, and prove identifiability of both content and (multiple blocks of) style variables. We empirically demonstrate the benefits of our approach on synthetic datasets and then present promising but limited results on ImageNet.

摘要
自我指导学习经常使用数据扩充来induce一些数据的"风格"特征的不变性。然而，下游任务通常不知道训练时间点，因此难以在训练时确定哪些特征是"风格"特征，可以安全地抛弃。为解决这个问题，我们介绍了一种更理智的方法，即通过分离风格特征来解决这个问题。我们的关键想法是在多个风格嵌入空间中添加多个不变性，即：(i) 每个不变性都是对所有扩充之外的一个不变性;(ii) 共同 entropy 的最大化。我们从 causal 潜在变量模型的视角来正式描述我们的结构化数据扩充过程，并证明内容和多个块风格变量的可识别性。我们在synthetic dataset上进行了实验，并在ImageNet上得到了有前途的 pero有限的结果。

SparseSpikformer: A Co-Design Framework for Token and Weight Pruning in Spiking Transformer

paper_url: http://arxiv.org/abs/2311.08806
repo_url: None
paper_authors: Yue Liu, Shanlin Xiao, Bo Li, Zhiyi Yu
for: 这个研究旨在提高Spikformer模型的效率和能效性，使其适合实现在边缘设备上。
methods: 这个研究使用了Lottery Ticket Hypothesis（LTH）和几个创新的token和重量调整技术来实现Spikformer模型的简洁化。
results: 实验结果显示，这个框架可以将Spikformer模型的90%模型参数简减，且可以降低Giga浮动点操作数（GFLOPs）20%，同时保持原始模型的准确性。

Abstract
As the third-generation neural network, the Spiking Neural Network (SNN) has the advantages of low power consumption and high energy efficiency, making it suitable for implementation on edge devices. More recently, the most advanced SNN, Spikformer, combines the self-attention module from Transformer with SNN to achieve remarkable performance. However, it adopts larger channel dimensions in MLP layers, leading to an increased number of redundant model parameters. To effectively decrease the computational complexity and weight parameters of the model, we explore the Lottery Ticket Hypothesis (LTH) and discover a very sparse ($\ge$90%) subnetwork that achieves comparable performance to the original network. Furthermore, we also design a lightweight token selector module, which can remove unimportant background information from images based on the average spike firing rate of neurons, selecting only essential foreground image tokens to participate in attention calculation. Based on that, we present SparseSpikformer, a co-design framework aimed at achieving sparsity in Spikformer through token and weight pruning techniques. Experimental results demonstrate that our framework can significantly reduce 90% model parameters and cut down Giga Floating-Point Operations (GFLOPs) by 20% while maintaining the accuracy of the original model.

摘要
为了提高edge设备上的神经网络模型的能效性，我们提出了一种基于SNN的第三代神经网络模型，即SparseSpikformer。该模型通过减少神经网络的计算复杂性和参数量来提高实现效率。在这个模型中，我们采用了LTH Hypothesis，并在SNN中发现了一个大于90%的稀疏子网络，可以保持与原始网络相同的性能。此外，我们还设计了一个轻量级的图像选择器模块，可以根据神经元的射击率选择图像中的重要背景信息，从而降低计算复杂性。基于这些设计，我们提出了一种减少Spikformer模型计算复杂性的框架，并实现了减少90%的模型参数和20%的GFLOPs操作数量的目标。实验结果表明，我们的框架可以维持原始模型的准确性，同时实现效率的提高。

X-Eval: Generalizable Multi-aspect Text Evaluation via Augmented Instruction Tuning with Auxiliary Evaluation Aspects

paper_url: http://arxiv.org/abs/2311.08788
repo_url: None
paper_authors: Minqian Liu, Ying Shen, Zhiyang Xu, Yixin Cao, Eunah Cho, Vaibhav Kumar, Reza Ghanadan, Lifu Huang
for: 本文目的是提出一种多方面评估框架，以便评估自然语言生成（NLG）的多个方面质量。
methods: 本文使用了两个学习阶段：第一阶段是简单的指令调整阶段，旨在提高模型following指令的能力；第二阶段是加强的指令调整阶段，通过细致的评估方面之间的连接来更好地评估文本质量。
results: 经过广泛的实验，我们发现X-Eval可以让even a lightweight language model达到与人类评估相当或更高的相关性，比如GPT-4。

Abstract
Natural Language Generation (NLG) typically involves evaluating the generated text in various aspects (e.g., consistency and naturalness) to obtain a comprehensive assessment. However, multi-aspect evaluation remains challenging as it may require the evaluator to generalize to any given evaluation aspect even if it's absent during training. In this paper, we introduce X-Eval, a two-stage instruction tuning framework to evaluate the text in both seen and unseen aspects customized by end users. X-Eval consists of two learning stages: the vanilla instruction tuning stage that improves the model's ability to follow evaluation instructions, and an enhanced instruction tuning stage that exploits the connections between fine-grained evaluation aspects to better assess text quality. To support the training of X-Eval, we collect AspectInstruct, the first instruction tuning dataset tailored for multi-aspect NLG evaluation spanning 27 diverse evaluation aspects with 65 tasks. To enhance task diversity, we devise an augmentation strategy that converts human rating annotations into diverse forms of NLG evaluation tasks, including scoring, comparison, ranking, and Boolean question answering. Extensive experiments across three essential categories of NLG tasks: dialogue generation, summarization, and data-to-text coupled with 21 aspects in meta-evaluation, demonstrate that our X-Eval enables even a lightweight language model to achieve a comparable if not higher correlation with human judgments compared to the state-of-the-art NLG evaluators, such as GPT-4.

摘要
自然语言生成（NLG）通常包括评估生成文本的多个方面（例如一致性和自然性）以获得全面的评估。然而，多方面评估仍然是挑战，因为评估人可能需要将注意力扩展到任何给定的评估方面，即使在训练过程中没有出现过。在这篇论文中，我们介绍了X-Eval，一个两个学习阶段的指令调整框架，用于评估文本在已知和未知方面的质量。X-Eval包括两个学习阶段：一个普通的指令调整阶段，用于提高模型能够遵循评估指令的能力，以及一个加强的指令调整阶段，用于更好地评估文本质量。为支持X-Eval的训练，我们收集了AspectInstruct数据集，这是第一个适用于多方面NLG评估的指令调整数据集，覆盖了27种多样化的评估方面，65个任务。为了增加任务多样性，我们设计了一种扩展策略，将人类评分笔记转换成多种NLG评估任务的不同形式，包括分数、对比、排名和布尔问答。在对对话生成、概要和数据到文本等三类NLG任务进行广泛的实验，我们发现X-Eval可以让even a lightweight语言模型与人类评估结果相似或更高相关性，比如GPT-4。

ICRA Roboethics Challenge 2023: Intelligent Disobedience in an Elderly Care Home

paper_url: http://arxiv.org/abs/2311.08783
repo_url: None
paper_authors: Sveta Paster, Kantwon Rogers, Gordon Briggs, Peter Stone, Reuth Mirsky
for: 这份报告是为了提高老人护理机构中的服务机器人增强老人的生活质量，以应对预计的老年人口增长。
methods: 该报告提议利用智能不遵守框架，让机器人能够进行有伦理意义的决策过程。
results: 该报告列出了智能不遵守框架可以帮助机器人解决的问题，并在特定的老人护理机构场景下定义了该框架的形式化定义，以及实现智能不遵守机器人的需求。

Abstract
With the projected surge in the elderly population, service robots offer a promising avenue to enhance their well-being in elderly care homes. Such robots will encounter complex scenarios which will require them to perform decisions with ethical consequences. In this report, we propose to leverage the Intelligent Disobedience framework in order to give the robot the ability to perform a deliberation process over decisions with potential ethical implications. We list the issues that this framework can assist with, define it formally in the context of the specific elderly care home scenario, and delineate the requirements for implementing an intelligently disobeying robot. We conclude this report with some critical analysis and suggestions for future work.

摘要
随着老年人口增长的预计，服务机器人在老年人医疗机构中提供了一个有前途的解决方案，以提高老年人的生活质量。这些机器人会遇到复杂的情况，需要它们在具有伦理意义的决策时进行慎重的讨论。在这份报告中，我们提议利用智能不遵守框架，让机器人在具有伦理意义的决策时能够进行慎重的讨论。我们列出了这个框架可以帮助解决的问题，在老年人医疗机构特定场景中明确定义了它，并详细描述了实现智能不遵守机器人的需求。我们在报告结尾提出了一些批判性分析和未来工作的建议。

Adversarially Robust Spiking Neural Networks Through Conversion

paper_url: http://arxiv.org/abs/2311.09266
repo_url: https://github.com/igitugraz/robustsnnconversion
paper_authors: Ozan Özdenizci, Robert Legenstein
for: 提高深度神经网络（SNN）的防御性能，增强SNN在应用中的可靠性。
methods: 提出了一种可扩展的Robust SNN培训方法，通过归一化层级触发阈值和synaptic连接权重来保持从预训练ANN中传递的robust性提升。
results: 实验结果表明，我们的方法可以在多种适应性攻击Setting下提供一个可扩展的、低延迟的防御性能。

Abstract
Spiking neural networks (SNNs) provide an energy-efficient alternative to a variety of artificial neural network (ANN) based AI applications. As the progress in neuromorphic computing with SNNs expands their use in applications, the problem of adversarial robustness of SNNs becomes more pronounced. To the contrary of the widely explored end-to-end adversarial training based solutions, we address the limited progress in scalable robust SNN training methods by proposing an adversarially robust ANN-to-SNN conversion algorithm. Our method provides an efficient approach to embrace various computationally demanding robust learning objectives that have been proposed for ANNs. During a post-conversion robust finetuning phase, our method adversarially optimizes both layer-wise firing thresholds and synaptic connectivity weights of the SNN to maintain transferred robustness gains from the pre-trained ANN. We perform experimental evaluations in numerous adaptive adversarial settings that account for the spike-based operation dynamics of SNNs, and show that our approach yields a scalable state-of-the-art solution for adversarially robust deep SNNs with low-latency.

摘要
神经网络（SNN）提供了一种能效的人工神经网络（ANN）的替代方案，随着神经omorphic计算的进步，SNN在应用中的使用逐渐扩大。然而，SNN的敌意 robustness问题在这种扩展过程中变得更加突出。而不是已经广泛探索的终端对抗验证学习方法，我们提出了一种可扩展的Robust SNN Training方法。我们的方法可以有效地涵盖各种计算具有挑战性的Robust learning目标，这些目标在ANN中已经得到了广泛的探索。在post-conversionRobust fine-tuning阶段，我们的方法在SNN中对层wise发射阈值和 synaptic连接权重进行了对抗优化，以保持从pre-trained ANN中传递的Robustness收益。我们在许多适应性攻击设定下进行了实验评估，并证明了我们的方法可以实现可扩展的state-of-the-art解决方案，并且具有低延迟。

Three Conjectures on Unexpectedeness

paper_url: http://arxiv.org/abs/2311.08768
repo_url: None
paper_authors: Giovanni Sileno, Jean-Louis Dessalles
for: This paper aims to lay the groundwork for a theoretical framework to explain the predictive power of unexpectedness in cognition, and to explore its connection to various measures of divergence between the entropy of the world and the variety of the observer.
methods: The paper uses a combination of theoretical conjectures and experimental results to develop a framework for understanding the role of unexpectedness in cognition.
results: The paper provides a new perspective on the relationship between unexpectedness and cognition, and suggests potential research directions that could lead to new insights into the extraction of causal relations and the role of descriptive mechanisms in learning.

Abstract
Unexpectedness is a central concept in Simplicity Theory, a theory of cognition relating various inferential processes to the computation of Kolmogorov complexities, rather than probabilities. Its predictive power has been confirmed by several experiments with human subjects, yet its theoretical basis remains largely unexplored: why does it work? This paper lays the groundwork for three theoretical conjectures. First, unexpectedness can be seen as a generalization of Bayes' rule. Second, the frequentist core of unexpectedness can be connected to the function of tracking ergodic properties of the world. Third, unexpectedness can be seen as constituent of various measures of divergence between the entropy of the world (environment) and the variety of the observer (system). The resulting framework hints to research directions that go beyond the division between probabilistic and logical approaches, potentially bringing new insights into the extraction of causal relations, and into the role of descriptive mechanisms in learning.

摘要
不期待性是简洁理论中的核心概念， relate to various inference processes and Kolmogorov complexities computation, rather than probabilities. Its predictive power has been confirmed by several experiments with human subjects, but its theoretical basis remains largely unexplored: why does it work? This paper lays the groundwork for three theoretical conjectures. First, unexpectedness can be seen as a generalization of Bayes' rule. Second, the frequentist core of unexpectedness can be connected to the function of tracking ergodic properties of the world. Third, unexpectedness can be seen as a constituent of various measures of divergence between the entropy of the world (environment) and the variety of the observer (system). The resulting framework hints to research directions that go beyond the division between probabilistic and logical approaches, potentially bringing new insights into the extraction of causal relations, and into the role of descriptive mechanisms in learning.Here's the translation breakdown:不期待性 (bù qīdài xìng) - unexpectedness简洁理论 (jiǎn jiǎn lǐlùn) - Simplicity Theoryrelate (tiě yǔ) - relatevarious inference processes (dào yī) - various inference processesKolmogorov complexities (kēlèmǔ gōngjì) - Kolmogorov complexitiescomputation (suānjiǔ) - computationrather than probabilities (bié kèqì) - rather than probabilitiesits predictive power (wǒ de yìjī) - its predictive powerhas been confirmed (yǐjī) - has been confirmedby several experiments (shíyī zhèng yǐjī) - by several experimentswith human subjects (rénshēng) - with human subjectsbut (but) - butits theoretical basis (wǒ de lǐyì) - its theoretical basisremains largely unexplored (yǐjī zhèngyǐ) - remains largely unexploredwhy does it work? (bù yīnwèi zhèngyǐ) - why does it work?This paper (zhèng zhì) - This paperlays the groundwork (dào zhì) - lays the groundworkfor three theoretical conjectures (sān lǐyì zhèng) - for three theoretical conjecturesFirst, (yī) - Firstunexpectedness (bù qīdài xìng) - unexpectednesscan be seen as (dào yī) - can be seen asa generalization (fāngyì) - a generalizationof Bayes' rule (Bayes de zhèng) - of Bayes' ruleSecond, (èr) - Secondthe frequentist core (liàng zhèng) - the frequentist coreof unexpectedness (bù qīdài xìng) - of unexpectednesscan be connected (dào yī) - can be connectedto the function (fāngyì) - to the functionof tracking (dào) - of trackingergodic properties (érguò) - ergodic propertiesof the world (shìjiè) - of the worldThird, (sān) - Thirdunexpectedness (bù qīdài xìng) - unexpectednesscan be seen as (dào yī) - can be seen asa constituent (fāngyì) - a constituentof various measures (biǎo) - of various measuresof divergence (fāngbiàn) - of divergencebetween (biān) - betweenthe entropy (hétuán) - the entropyof the world (shìjiè) - of the worldand (he) - andthe variety (dào) - the varietyof the observer (jìshì) - of the observerThe resulting framework (zhèng zhì) - The resulting frameworkhints (dào) - hintsto research directions (kēngsuǒ) - to research directionsthat go beyond (biào) - that go beyondthe division (biān) - the divisionbetween probabilistic (suǒyì) - between probabilisticand logical (lógí) - and logicalapproaches (jì) - approachespotentially bringing (dào) - potentially bringingnew insights (xīnwèi) - new insightsinto (yǐ) - intothe extraction (suō) - the extractionof causal relations (liǎo) - of causal relationsand (he) - andthe role (yè) - the roleof descriptive mechanisms (mǎojī) - of descriptive mechanismsin learning (xuéxí) - in learning.

Combining Past, Present and Future: A Self-Supervised Approach for Class Incremental Learning

paper_url: http://arxiv.org/abs/2311.08764
repo_url: None
paper_authors: Xiaoshuang Chen, Zhongyi Sun, Ke Yan, Shouhong Ding, Hongtao Lu
for: 本文目的是解决自适应学习中的 kontinuous novel class 问题，即模型能够识别新来的类，同时避免 catastrophic forgetting。
methods: 本文提出了一种自助学习 CIL 框架 CPPF，包括一个 prototype clustering module (PC)、一个 embedding space reserving module (ESR) 和一个 multi-teacher distillation module (MTD)。PC 和 ESR 模块在prototype level和feature level分别保留 embedding space для后续阶段，而 MTD 模块保持当前阶段的表示不受过去知识的干扰。
results: 对 CIFAR100 和 ImageNet100 数据集进行了广泛的实验，显示了我们提出的方法可以提高自适应学习中的class incremental learning性能。

Abstract
Class Incremental Learning (CIL) aims to handle the scenario where data of novel classes occur continuously and sequentially. The model should recognize the sequential novel classes while alleviating the catastrophic forgetting. In the self-supervised manner, it becomes more challenging to avoid the conflict between the feature embedding spaces of novel classes and old ones without any class labels. To address the problem, we propose a self-supervised CIL framework CPPF, meaning Combining Past, Present and Future. In detail, CPPF consists of a prototype clustering module (PC), an embedding space reserving module (ESR) and a multi-teacher distillation module (MTD). 1) The PC and the ESR modules reserve embedding space for subsequent phases at the prototype level and the feature level respectively to prepare for knowledge learned in the future. 2) The MTD module maintains the representations of the current phase without the interference of past knowledge. One of the teacher networks retains the representations of the past phases, and the other teacher network distills relation information of the current phase to the student network. Extensive experiments on CIFAR100 and ImageNet100 datasets demonstrate that our proposed method boosts the performance of self-supervised class incremental learning. We will release code in the near future.

摘要
<>Translate the following text into Simplified Chinese.<>类增量学习（CIL）目标是处理连续出现的新类数据场景。模型应该识别连续出现的新类，同时避免catastrophic forgetting。在无监督的方式下，更加挑战是避免新类和旧类的feature embedding空间之间的冲突。为解决这个问题，我们提出了一个自动监督CIL框架CPPF，即Combining Past, Present and Future。在详细的实现方式下，CPPF包括一个原型聚合模块（PC）、一个嵌入空间保留模块（ESR）以及一个多教师浸泡模块（MTD）。1）PC和ESR模块在原型级和特征级分别保留了后续阶段的嵌入空间，以便在未来学习的知识。2）MTD模块保持了当前阶段的表示，并避免了过去知识的干扰。其中一个教师网络保持过去阶段的表示，另一个教师网络将当前阶段的关系信息传播给学生网络。我们在CIFAR100和ImageNet100数据集上进行了广泛的实验，结果表明我们提出的方法可以提高无监督类增量学习的性能。我们将即将发布代码。

Forms of Understanding of XAI-Explanations

paper_url: http://arxiv.org/abs/2311.08760
repo_url: None
paper_authors: Hendrik Buschmeier, Heike M. Buhl, Friederike Kern, Angela Grimminger, Helen Beierling, Josephine Fisher, André Groß, Ilona Horwath, Nils Klowait, Stefan Lazarov, Michael Lenke, Vivien Lohmer, Katharina Rohlfing, Ingrid Scharlau, Amit Singh, Lutz Terfloth, Anna-Lisa Vollmer, Yu Wang, Annedore Wilmes, Britta Wrede
for: 本文旨在提供一种对Explainable Artificial Intelligence（XAI）领域和其他领域的理解模型，以及对理解的定义和形式、评估和动力的探讨。
methods: 本文采用了多学科的视角，包括计算机科学、语言学、社会学和心理学，对理解的定义和形式、评估和动力进行了探讨和系统化。
results: 本文提出了两种理解的形式，即启用性（knowing how）和理解（knowing that），并论证了这两种理解在解释过程中的发展和互相关系。 I hope this helps! Let me know if you have any further questions.

Abstract
Explainability has become an important topic in computer science and artificial intelligence, leading to a subfield called Explainable Artificial Intelligence (XAI). The goal of providing or seeking explanations is to achieve (better) 'understanding' on the part of the explainee. However, what it means to 'understand' is still not clearly defined, and the concept itself is rarely the subject of scientific investigation. This conceptual article aims to present a model of forms of understanding in the context of XAI and beyond. From an interdisciplinary perspective bringing together computer science, linguistics, sociology, and psychology, a definition of understanding and its forms, assessment, and dynamics during the process of giving everyday explanations are explored. Two types of understanding are considered as possible outcomes of explanations, namely enabledness, 'knowing how' to do or decide something, and comprehension, 'knowing that' -- both in different degrees (from shallow to deep). Explanations regularly start with shallow understanding in a specific domain and can lead to deep comprehension and enabledness of the explanandum, which we see as a prerequisite for human users to gain agency. In this process, the increase of comprehension and enabledness are highly interdependent. Against the background of this systematization, special challenges of understanding in XAI are discussed.

摘要
<>输入文本转换为简化中文。<>Explainability 已成为计算机科学和人工智能中重要的话题，导致了一个子领域called Explainable Artificial Intelligence (XAI). 该领域的目标是提供或寻求解释，以达到更好的'理解'。然而，'理解'这个概念仍然没有得到清晰定义，而且这个概念自己也rarely是科学研究的对象。本文旨在提出一个形式理解在 XAI 和其他领域的模型。从计算机科学、语言学、社会学和心理学的多学科角度，一个理解的定义和其形式、评估和过程中的动态都是探讨的对象。在日常解释过程中，理解可以分为两种可能的结果，即'能力'和'认知'，两者都有不同的深度水平（从浅到深）。解释通常从特定领域的浅度理解开始，可以导致解释对象的深度认知和能力，这被视为人类用户获得行为能力的前提。在这个过程中，理解和能力之间存在很高的相互关系。在这个背景下，XAI 中特殊的理解挑战也是讨论的对象。

Cross-domain feature disentanglement for interpretable modeling of tumor microenvironment impact on drug response

paper_url: http://arxiv.org/abs/2311.09264
repo_url: None
paper_authors: Jia Zhai, Hui Liu
for: 本研究旨在模拟肿瘤微环境（TME）对药物响应的影响，以提高药物治疗的效果和特点。
methods: 本研究使用了适应域网络进行特征分离，将源领域（cell lines）和目标领域（肿瘤）的特征分离开来，并使用了 Graph Attention Network 学习药物的潜在表示。
results: 研究表明，适应域网络可以 superior performance 在预测药物响应和分解肿瘤微环境对药物效果的影响。

Abstract
High-throughput screening technology has facilitated the generation of large-scale drug responses across hundreds of cancer cell lines. However, there exists significant discrepancy between in vitro cell lines and actual tumors in vivo in terms of their response to drug treatments, because of tumors comprise of complex cellular compositions and histopathology structure, known as tumor microenvironment (TME), which greatly influences the drug cytotoxicity against tumor cells. To date, no study has focused on modeling the impact of the TME on clinical drug response. This paper proposed a domain adaptation network for feature disentanglement to separate representations of cancer cells and TME of a tumor in patients. Two denoising autoencoders were separately used to extract features from cell lines (source domain) and tumors (target domain) for partial domain alignment and feature decoupling. The specific encoder was enforced to extract information only about TME. Moreover, to ensure generalizability to novel drugs, we applied a graph attention network to learn the latent representation of drugs, allowing us to linearly model the drug perturbation on cellular state in latent space. We calibrated our model on a benchmark dataset and demonstrated its superior performance in predicting clinical drug response and dissecting the influence of the TME on drug efficacy.

摘要
高通量屏测技术已经促进了大规模药物响应的生成 across hundreds of cancer cell lines. 然而， exists significant discrepancy between in vitro cell lines and actual tumors in vivo in terms of their response to drug treatments, because tumors comprise complex cellular compositions and histopathology structure, known as tumor microenvironment (TME), which greatly influences the drug cytotoxicity against tumor cells. To date, no study has focused on modeling the impact of the TME on clinical drug response. This paper proposed a domain adaptation network for feature disentanglement to separate representations of cancer cells and TME of a tumor in patients. Two denoising autoencoders were separately used to extract features from cell lines (source domain) and tumors (target domain) for partial domain alignment and feature decoupling. The specific encoder was enforced to extract information only about TME. Moreover, to ensure generalizability to novel drugs, we applied a graph attention network to learn the latent representation of drugs, allowing us to linearly model the drug perturbation on cellular state in latent space. We calibrated our model on a benchmark dataset and demonstrated its superior performance in predicting clinical drug response and dissecting the influence of the TME on drug efficacy.

Auto-ICL: In-Context Learning without Human Supervision

paper_url: http://arxiv.org/abs/2311.09263
repo_url: https://github.com/ecielyang/auto-icl
paper_authors: Jinghan Yang, Shuming Ma, Furu Wei
for: 这个研究旨在提高人机交互的自然语言功能，使大语言模型在各种任务上具备更高的灵活性和自主性。
methods: 该研究提出了一种自动启发学习框架，可以让模型自动生成示例、标签、指导路径等，以便在不同任务上进行启发学习。
results: 研究表明，该方法在多种任务上能够实现优秀的表现，与现有方法相比，具有更高的灵活性和自主性。

Abstract
In the era of Large Language Models (LLMs), human-computer interaction has evolved towards natural language, offering unprecedented flexibility. Despite this, LLMs are heavily reliant on well-structured prompts to function efficiently within the realm of In-Context Learning. Vanilla In-Context Learning relies on human-provided contexts, such as labeled examples, explicit instructions, or other guiding mechanisms that shape the model's outputs. To address this challenge, our study presents a universal framework named Automatic In-Context Learning. Upon receiving a user's request, we ask the model to independently generate examples, including labels, instructions, or reasoning pathways. The model then leverages this self-produced context to tackle the given problem. Our approach is universally adaptable and can be implemented in any setting where vanilla In-Context Learning is applicable. We demonstrate that our method yields strong performance across a range of tasks, standing up well when compared to existing methods.

摘要
（在大语言模型（LLM）时代，人机交互发展到自然语言水平，提供了前所未有的灵活性。然而，LLMs仍然受到良好结构化提示的限制，以便在受限的上下文学习中功能 efficiently。vanilla In-Context Learning rely on人类提供的上下文，如标注的例子、显式的指令或其他引导机制，以shape模型的输出。为解决这个挑战，我们的研究提出了一个通用框架 named Automatic In-Context Learning。当接收用户的请求时，我们会让模型独立生成示例，包括标签、指令或推理路径。然后，模型会利用自己生成的上下文来解决给定的问题。我们的方法是 universally adaptable，可以在任何可以使用vanilla In-Context Learning的场景中实现。我们示出了我们的方法在多种任务上具有强大表现，与现有方法相比，表现良好。）

Disentangling the Potential Impacts of Papers into Diffusion, Conformity, and Contribution Values

paper_url: http://arxiv.org/abs/2311.09262
repo_url: None
paper_authors: Zhikai Xue, Guoxiu He, Zhuoren Jiang, Yangyang Kang, Star Zhao, Wei Lu
for: 这个论文的目的是计算学术论文的潜在影响力，并分解其为三个方面：散布、遵循和贡献。
methods: 该论文提出了一种基于图神经网络的新方法，称为DPPDCC，用于解决这些问题。DPPDCC使用动态不同类型的图 структуры，包括时间和结构特征，以捕捉知识的流动。具体来说，它使用比较和相关的信息来捕捉知识的流动，并使用约束来避免模型之间的混淆。
results: 实验结果表明，DPPDCC在不同时间点的论文上表现出色，与基线模型相比，它在新发表、新鲜出版和当下发表的论文上均有显著优势。此外，DPPDCC还能够robust地处理不同类型的论文和数据集。

Abstract
The potential impact of an academic paper is determined by various factors, including its popularity and contribution. Existing models usually estimate original citation counts based on static graphs and fail to differentiate values from nuanced perspectives. In this study, we propose a novel graph neural network to Disentangle the Potential impacts of Papers into Diffusion, Conformity, and Contribution values (called DPPDCC). Given a target paper, DPPDCC encodes temporal and structural features within the constructed dynamic heterogeneous graph. Particularly, to capture the knowledge flow, we emphasize the importance of comparative and co-cited/citing information between papers and aggregate snapshots evolutionarily. To unravel popularity, we contrast augmented graphs to extract the essence of diffusion and predict the accumulated citation binning to model conformity. We further apply orthogonal constraints to encourage distinct modeling of each perspective and preserve the inherent value of contribution. To evaluate models' generalization for papers published at various times, we reformulate the problem by partitioning data based on specific time points to mirror real-world conditions. Extensive experimental results on three datasets demonstrate that DPPDCC significantly outperforms baselines for previously, freshly, and immediately published papers. Further analyses confirm its robust capabilities. We will make our datasets and codes publicly available.

摘要
科学论文的潜在影响因多种因素决定，包括其受欢迎程度和贡献。现有模型通常基于静止图计算原始引用数，而不能区分不同的观点。在这项研究中，我们提出了一种新的图神经网络，即分离论文的潜在影响值（DPPDCC）。给定目标论文，DPPDCC 编码了时间和结构特征在构建的动态 hetэроogeneous图中。特别是，为了捕捉知识的流动，我们强调在比较和引用/引用信息之间的关系中捕捉知识的流动。为了评估媒体，我们对升级图进行比较，从而提取论文的核心特征。我们还应用正交约束，以便独特地模型每个角度，并保留论文的内在价值。为了评估模型在不同时间点发表的论文的普适性，我们将数据分 partitions according to specific time points，以模拟实际情况。我们的实验结果表明，DPPDCC 在三个数据集上显著超过基线。进一步的分析证明它的稳定性。我们将数据和代码公开。

Emerging Drug Interaction Prediction Enabled by Flow-based Graph Neural Network with Biomedical Network

paper_url: http://arxiv.org/abs/2311.09261
repo_url: https://github.com/lars-research/emergnn
paper_authors: Yongqi Zhang, Quanming Yao, Ling Yue, Xian Wu, Ziheng Zhang, Zhenxi Lin, Yefeng Zheng
for: 预测新药物与新药物之间的药物交互作用，以提高病人患病经验和药物开发效率。
methods: 使用图 нейрон网络（GNN）来预测新药物之间的交互作用，并利用生物医学网络中的资料来提高预测的准确性。
results: EmerGNN比现有方法更高的准确性来预测新药物之间的交互作用，并可以快速地确定最重要的生物医学概念。

Abstract
Accurately predicting drug-drug interactions (DDI) for emerging drugs, which offer possibilities for treating and alleviating diseases, with computational methods can improve patient care and contribute to efficient drug development. However, many existing computational methods require large amounts of known DDI information, which is scarce for emerging drugs. In this paper, we propose EmerGNN, a graph neural network (GNN) that can effectively predict interactions for emerging drugs by leveraging the rich information in biomedical networks. EmerGNN learns pairwise representations of drugs by extracting the paths between drug pairs, propagating information from one drug to the other, and incorporating the relevant biomedical concepts on the paths. The different edges on the biomedical network are weighted to indicate the relevance for the target DDI prediction. Overall, EmerGNN has higher accuracy than existing approaches in predicting interactions for emerging drugs and can identify the most relevant information on the biomedical network.

摘要
通过计算方法精准预测新药 drug-drug interactions (DDI)，可以提高患者护理和药物开发效率。然而，许多现有的计算方法需要大量已知 DDI 信息，而这些信息对新药来说匮乏。在这篇文章中，我们提出 EmerGNN，一种基于图神经网络 (GNN) 的方法，可以有效预测新药之间的交互。EmerGNN 通过提取药物对之间的路径，传递药物之间的信息，并 incorporate 生物医学网络上相关的概念，来学习药物对之间的对应。不同的生物医学网络边缘权重，以指示目标 DDI 预测中的重要性。总的来说，EmerGNN 比现有方法更高精度地预测新药之间的交互，并可以 Identify 生物医学网络上最重要的信息。

Joint User Pairing and Beamforming Design of Multi-STAR-RISs-Aided NOMA in the Indoor Environment via Multi-Agent Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.08708
repo_url: None
paper_authors: Yu Min Park, Yan Kyaw Tun, Choong Seon Hong
for:* 6G/B5G wireless networks with enhanced quality requirementsmethods:* NOMA technique for multiple users to share resources* STAR-RISs for improved coverage, spectral efficiency, and reliabilityresults:* Joint user pairing and beamforming design for Multi-STAR-RISs in an indoor environment* Maximum total throughput of multiple users (MUs) through optimization of decoding order, user pairing, active beamforming, and passive beamformingPlease note that the above information is in Simplified Chinese text, as requested.

Abstract
The development of 6G/B5G wireless networks, which have requirements that go beyond current 5G networks, is gaining interest from academia and industry. However, to increase 6G/B5G network quality, conventional cellular networks that rely on terrestrial base stations are constrained geographically and economically. Meanwhile, NOMA allows multiple users to share the same resources, which improves the spectral efficiency of the system and has the advantage of supporting a larger number of users. Additionally, by intelligently manipulating the phase and amplitude of both the reflected and transmitted signals, STAR-RISs can achieve improved coverage, increased spectral efficiency, and enhanced communication reliability. However, STAR-RISs must simultaneously optimize the amplitude and phase shift corresponding to reflection and transmission, which makes the existing terrestrial networks more complicated and is considered a major challenging issue. Motivated by the above, we study the joint user pairing for NOMA and beamforming design of Multi-STAR-RISs in an indoor environment. Then, we formulate the optimization problem with the objective of maximizing the total throughput of MUs by jointly optimizing the decoding order, user pairing, active beamforming, and passive beamforming. However, the formulated problem is a MINLP. To address this challenge, we first introduce the decoding order for NOMA networks. Next, we decompose the original problem into two subproblems, namely: 1) MU pairing and 2) Beamforming optimization under the optimal decoding order. For the first subproblem, we employ correlation-based K-means clustering to solve the user pairing problem. Then, to jointly deal with beamforming vector optimizations, we propose MAPPO, which can make quick decisions in the given environment owing to its low complexity.

摘要
6G/B5G无线网络的开发，具有超过当前5G网络的需求，已经吸引了学术界和业界的关注。然而，使得6G/B5G网络质量提高的传统Cellsular网络，受到地面基站的限制，它们的空间和经济性不足。而NOMA技术允许多个用户共享同一资源，提高系统的spectral efficiency，并且可以支持更多的用户。此外，通过智能地控制反射和发射信号的相位和幅度，STAR-RISs可以实现改善的覆盖率、增加spectral efficiency和通信可靠性。然而，STAR-RISs需要同时优化反射和发射信号的相位和幅度，这使得现有的地面网络更加复杂，并被视为主要挑战。驱动了以上，我们研究了Multi-STAR-RISs在室内环境中的用户对称对接和束缚设计。然后，我们形ulated了优化问题的目标，即通过同时优化用户对称对接、束缚、活动束缚和空转束缚来提高多个用户机（MU）的总吞吐量。然而，该问题是一个MINLP问题。为了解决这个挑战，我们首先介绍了NOMA网络中的解码顺序。然后，我们将原问题分解成两个子问题，即：1）用户对称对接问题和2）束缚优化问题。对于第一个子问题，我们采用协方差基于K-means分 clustering算法来解决用户对称对接问题。然后，为了同时处理束缚向量优化问题，我们提议MAPPO，它可以在给定环境中做出快速决策，因为它的复杂度较低。

Aligned: A Platform-based Process for Alignment

paper_url: http://arxiv.org/abs/2311.08706
repo_url: https://github.com/klonnet23/helloy-word
paper_authors: Ethan Shaotran, Ido Pesok, Sam Jones, Emi Liu
for: 本研究旨在提供一个公信worthy、公开的方式来保障前沿模型的安全性，并最终实现超智能。
methods: 本研究使用了一个 constitutional committee 框架，Initial tests with 680 participants result in a 30-guideline constitution with 93% overall support。
results: 研究显示了平台的自然扩展性，使得社区参与者具有更高的信任和满意度。

Abstract
We are introducing Aligned, a platform for global governance and alignment of frontier models, and eventually superintelligence. While previous efforts at the major AI labs have attempted to gather inputs for alignment, these are often conducted behind closed doors. We aim to set the foundation for a more trustworthy, public-facing approach to safety: a constitutional committee framework. Initial tests with 680 participants result in a 30-guideline constitution with 93% overall support. We show the platform naturally scales, instilling confidence and enjoyment from the community. We invite other AI labs and teams to plug and play into the Aligned ecosystem.

摘要
我们是引入了对齐平台，用于全球治理和前沿模型的对齐，最终是超智能。在大型AI室中，过去的尝试都是在关闭的门后进行集Inputs for alignment，我们想要设立一个更加可靠、公共的安全方法：一个宪法委员会框架。我们的初步测试中，680名参与者共同制定了30个指南，得到了93%的总支持。我们表明该平台自然扩展，带来了社区的信任和愉悦。我们邀请其他AI室和团队加入对齐生态系统。

Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains

paper_url: http://arxiv.org/abs/2311.08704
repo_url: None
paper_authors: Marcio Fonseca, Shay B. Cohen
For: The paper aims to examine the capacity of instruction-tuned large language models (LLMs) to follow in-context concept guidelines for sentence labeling tasks.* Methods: The paper uses zero-shot sentence classification tasks with different types of factual and counterfactual concept definitions as prompts to test the models’ ability to recognize new concepts.* Results: The paper finds that only the larger models (with 70B parameters or more) have limited ability to work under counterfactual contexts, and that proprietary models such as GPT-3.5 and GPT-4 can recognize nonsensical guidelines. Additionally, the paper finds that Falcon-180B-chat is outperformed by Llama-2-70B-chat in most cases, indicating that careful fine-tuning is more effective than increasing model scale.Here’s the simplified Chinese version of the three key points:* For: 论文目的是检验基于叙述示例的指导下，大型自然语言模型（LLMs）是否可以学习新的概念或事实。* Methods: 论文使用零shot句式分类任务，用不同类型的事实和反事实指导来测试模型的新概念认知能力。* Results: 论文发现，只有70B参数或更多的模型才能在对应的反事实上工作，而且专有API如GPT-3.5和GPT-4可以识别无意义的指导。此外，论文发现Falcon-180B-chat在大多数情况下被Llama-2-70B-chat所超越，这表明精心调整是更有效的than增加模型scale。

Abstract
Although large language models (LLMs) exhibit remarkable capacity to leverage in-context demonstrations, it is still unclear to what extent they can learn new concepts or facts from ground-truth labels. To address this question, we examine the capacity of instruction-tuned LLMs to follow in-context concept guidelines for sentence labeling tasks. We design guidelines that present different types of factual and counterfactual concept definitions, which are used as prompts for zero-shot sentence classification tasks. Our results show that although concept definitions consistently help in task performance, only the larger models (with 70B parameters or more) have limited ability to work under counterfactual contexts. Importantly, only proprietary models such as GPT-3.5 and GPT-4 can recognize nonsensical guidelines, which we hypothesize is due to more sophisticated alignment methods. Finally, we find that Falcon-180B-chat is outperformed by Llama-2-70B-chat is most cases, which indicates that careful fine-tuning is more effective than increasing model scale. Altogether, our simple evaluation method reveals significant gaps in concept understanding between the most capable open-source language models and the leading proprietary APIs.

摘要
尽管大型语言模型（LLM）具有丰富的 Context 掌握能力，但是是否可以从真实标签中学习新的概念或事实仍然不清楚。为了回答这个问题，我们研究了基于示例示范的 instruction-tuned LLM 是否能够遵循 Context 中的概念指南进行句子标签任务。我们设计了不同类型的事实和反事实概念定义，用作零容量 sentence classification 任务的提示。我们的结果表明，虽然概念定义 invariably 提高任务性能，但只有70B参数或更多的大型模型可以在对应事实上下降性能。此外，我们发现仅有专有模型 such as GPT-3.5 和 GPT-4 可以识别不合理的指南，我们假设这是因为它们使用了更加复杂的对接方法。最后，我们发现 Falcon-180B-chat 通常被 Llama-2-70B-chat 超越，这表明精细的微调更加重要于提高模型规模。总之，我们的简单评估方法 revelas 最高水平的 open-source 语言模型和主流专有 API 之间的概念理解存在显著差距。

Debate Helps Supervise Unreliable Experts

paper_url: http://arxiv.org/abs/2311.08702
repo_url: https://github.com/julianmichael/debate
paper_authors: Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, Samuel R. Bowman
for: supervising unreliable AI systems to give answers that are systematically true
methods: using debate between two unreliable experts to help a non-expert judge more reliably identify the truth
results: debate performs significantly better than consultancy (a baseline approach) and is more efficient, with 84% judge accuracy compared to 74% for consultancy

Abstract
As AI systems are used to answer more difficult questions and potentially help create new knowledge, judging the truthfulness of their outputs becomes more difficult and more important. How can we supervise unreliable experts, which have access to the truth but may not accurately report it, to give answers that are systematically true and don't just superficially seem true, when the supervisor can't tell the difference between the two on their own? In this work, we show that debate between two unreliable experts can help a non-expert judge more reliably identify the truth. We collect a dataset of human-written debates on hard reading comprehension questions where the judge has not read the source passage, only ever seeing expert arguments and short quotes selectively revealed by 'expert' debaters who have access to the passage. In our debates, one expert argues for the correct answer, and the other for an incorrect answer. Comparing debate to a baseline we call consultancy, where a single expert argues for only one answer which is correct half of the time, we find that debate performs significantly better, with 84% judge accuracy compared to consultancy's 74%. Debates are also more efficient, being 68% of the length of consultancies. By comparing human to AI debaters, we find evidence that with more skilled (in this case, human) debaters, the performance of debate goes up but the performance of consultancy goes down. Our error analysis also supports this trend, with 46% of errors in human debate attributable to mistakes by the honest debater (which should go away with increased skill); whereas 52% of errors in human consultancy are due to debaters obfuscating the relevant evidence from the judge (which should become worse with increased skill). Overall, these results show that debate is a promising approach for supervising increasingly capable but potentially unreliable AI systems.

摘要
Traditional Chinese:随着人工智能系统用于回答更加困难的问题，评估其输出的真实性变得更加困难和更加重要。如何监督不可靠的专家，他们有存取真理，但可能不会正确地报告它们？在这个工作中，我们显示了对两名不可靠专家进行辩论可以帮助非专家评估者更加可靠地评估真理。我们收集了一个人类写作的辩论集，其中一名专家认为正确的答案，另一名专家认为 incorrect的答案。与基准我们称之为咨询（consultancy），单一的专家 argue for 正确的答案，其中正确的答案是半数的时间。我们发现，在我们的辩论中，辩论比咨询表现更好，评估者的准确率为84%，而咨询的准确率为74%。辩论也更有效率，长度只有68%。我们对人工和人类辩论进行比较，发现随着专家的技能提高，辩论的表现也提高，而咨询的表现则下降。我们的错误分析也支持这个趋势，发现人类辩论中的错误中46%是由诚实的辩论者所引起的（这些错误可以逐渐消失），而人类咨询中的错误中52%是由辩论者对评估者隐藏重要证据所致（这些错误可以加剧）。总的来说，这些结果显示了辩论是一种可靠地监督 increasingly capable 但可能不可靠的 AI 系统的方法。

Artificial General Intelligence, Existential Risk, and Human Risk Perception

paper_url: http://arxiv.org/abs/2311.08698
repo_url: None
paper_authors: David R. Mandel
for: 这篇论文关注人工智能（AGI）的可能性，特别是AGI在未来20年内可能超过人类智能水平，然后快速超越人类智能。
methods: 作者基于公开可用的预测和意见数据，研究了专家和非专家对AGI的风险认知。
results: 研究发现，对AGI的世界大悲害或灭绝风险的认知比其他潜在存在风险（如核战或人类引起的气候变化）高，过去一年内对AGI风险的认知增长也比其他风险更快。

Abstract
Artificial general intelligence (AGI) does not yet exist, but given the pace of technological development in artificial intelligence, it is projected to reach human-level intelligence within roughly the next two decades. After that, many experts expect it to far surpass human intelligence and to do so rapidly. The prospect of superintelligent AGI poses an existential risk to humans because there is no reliable method for ensuring that AGI goals stay aligned with human goals. Drawing on publicly available forecaster and opinion data, the author examines how experts and non-experts perceive risk from AGI. The findings indicate that the perceived risk of a world catastrophe or extinction from AGI is greater than for other existential risks. The increase in perceived risk over the last year is also steeper for AGI than for other existential threats (e.g., nuclear war or human-caused climate change). That AGI is a pressing existential risk is something on which experts and non-experts agree, but the basis for such agreement currently remains obscure.

摘要
人工总智能（AGI）目前还没有存在，但根据技术发展的速度，预计在下一两十年内达到人类水平的智能。之后，许多专家预计它会迅速超越人类智能。超智AGI的出现对人类存在极大的风险，因为没有可靠的方法来保证AGI目标与人类目标相对应。作者通过公开available的预测和意见数据，检查专家和非专家对AGI风险的认知。结果表明AGI世界灾难或灭绝的风险高于其他极大风险（如核战或人类引起的气候变化）。过去一年内AGI风险的增加速度也比其他极大风险更大。虽然专家和非专家都认为AGI是一种极大的存在风险，但目前这种一致的基础还未明确。

An Eye on Clinical BERT: Investigating Language Model Generalization for Diabetic Eye Disease Phenotyping

paper_url: http://arxiv.org/abs/2311.08687
repo_url: https://github.com/kharrigian/ml4h-clinical-bert
paper_authors: Keith Harrigian, Tina Tang, Anthony Gonzales, Cindy X. Cai, Mark Dredze
for: 本研究旨在帮助监测 диабетиче眼病的临床趋势和检测护理不足，以预防盲视。
methods: 本研究使用了19种临床概念相关的文本提取系统，以检测和描述 диабетиче眼病的临床特征。
results: 研究发现，使用BERT语言模型预训练在非临床数据上的语言模型，对于本领域来说并无显著改进。

Abstract
Diabetic eye disease is a major cause of blindness worldwide. The ability to monitor relevant clinical trajectories and detect lapses in care is critical to managing the disease and preventing blindness. Alas, much of the information necessary to support these goals is found only in the free text of the electronic medical record. To fill this information gap, we introduce a system for extracting evidence from clinical text of 19 clinical concepts related to diabetic eye disease and inferring relevant attributes for each. In developing this ophthalmology phenotyping system, we are also afforded a unique opportunity to evaluate the effectiveness of clinical language models at adapting to new clinical domains. Across multiple training paradigms, we find that BERT language models pretrained on out-of-distribution clinical data offer no significant improvement over BERT language models pretrained on non-clinical data for our domain. Our study tempers recent claims that language models pretrained on clinical data are necessary for clinical NLP tasks and highlights the importance of not treating clinical language data as a single homogeneous domain.

摘要
糖尿病眼病是全球最大的失明原因之一。监测相关的临床轨迹和检测护理缺失是控制疾病和避免失明的关键。然而，大量关键信息都藏在电子医疗记录中的自由文本中，使得管理疾病困难。为了填补这个信息差距，我们介绍了一种EXTRACTING EVIDENCE FROM CLINICAL TEXT OF 19 CLINICAL CONCEPTS RELATED TO DIABETIC EYE DISEASE AND INFERRING RELEVANT ATTRIBUTES FOR EACH。在开发这种眼科phenotyping系统时，我们也获得了评估临床语言模型在新临床领域中的适应性的机会。在多种训练方法中，我们发现BERT语言模型在非临床数据上进行预训练后对我们领域没有显著提高。我们的研究抑制了最近的宣称，即临床语言数据上的语言模型预训练是临床NLP任务中必不可少的。我们的研究也 highlights the importance of not treating clinical language data as a single homogeneous domain。

Safer-Instruct: Aligning Language Models with Automated Preference Data

paper_url: http://arxiv.org/abs/2311.08685
repo_url: https://github.com/uscnlp-lime/safer-instruct
paper_authors: Taiwei Shi, Kai Chen, Jieyu Zhao
for: 本研究旨在提高语言模型的安全性，通过人工审核和自动生成数据来提高模型的准确率和安全性。
methods: 本研究提出了一种新的数据生成管道，即Safer-Instruct，它使用倒转指令调整、指令生成和专家模型评估来生成高质量的偏好数据，不需要人工纠正。
results: 通过LLaMA进行指令生成和GPT-4作为专家模型，生成了约10K个偏好样本。通过训练一个Alpaca模型于此数据集，可以提高模型的安全性而不会影响其对话和下游任务的性能。Safer-Instruct解决了偏好数据获取的挑战，为更安全和责任的AI系统的发展提供了新的思路。

Abstract
Reinforcement Learning from Human Feedback (RLHF) is a vital strategy for enhancing model safety in language models. However, annotating preference data for RLHF is a resource-intensive and creativity-demanding process, while automatic generation methods face limitations in data diversity and quality. In response, we present Safer-Instruct, a novel pipeline for semi-automatically constructing large-scale preference datasets. Our approach leverages reversed instruction tuning, instruction induction, and expert model evaluation to efficiently generate high-quality preference data without human annotators. We evaluate Safer-Instruct using LLaMA for instruction induction and GPT-4 as an expert model, generating approximately 10K preference samples. Finetuning an Alpaca model on this dataset demonstrates improved harmlessness while maintaining competitive performance on conversation and downstream tasks. Safer-Instruct addresses the challenges in preference data acquisition, advancing the development of safer and more responsible AI systems. Our code and data are available at https://github.com/uscnlp-lime/safer-instruct

摘要
� Reinforcement Learning from Human Feedback (RLHF) 是一种重要的策略来提高语言模型的安全性。然而，为RLHF annotating偏好数据是一个资源密集且创作需求高的过程，而自动生成方法受到数据多样性和质量的限制。为此，我们提出了Safer-Instruct，一个新的管线来半自动建构大规模的偏好数据。我们的方法利用倒转指令调整、指令生成和专家模型评估，以生成高品质的偏好数据，不需要人工标注员。我们使用LLaMA进行指令生成和GPT-4作为专家模型，生成约10K偏好数据。给Alpaca模型进行调整后，示出改善了无害性，同时保持了与对话和下游任务的竞争性能。Safer-Instruct解决了偏好数据取得的挑战，推动了更安全和责任的AI系统的开发。我们的代码和数据可以在https://github.com/uscnlp-lime/safer-instruct上取得。

Multi-Set Inoculation: Assessing Model Robustness Across Multiple Challenge Sets

paper_url: http://arxiv.org/abs/2311.08662
repo_url: None
paper_authors: Vatsal Gupta, Pranshu Pandya, Tushar Kataria, Vivek Gupta, Dan Roth
for: 这个研究旨在理解语言模型对输入异常的敏感性，以增强模型的可信度。
methods: 研究使用了精细调教和多个干扰的训练策略，以及一种链式思维（COT）示例来提高模型的多干扰Robustness。
results: 研究显示，使用提议的方法可以训练模型对不同干扰的Robustness，而不会影响模型在给定任务上的准确率。

Abstract
Language models, given their black-box nature, often exhibit sensitivity to input perturbations, leading to trust issues due to hallucinations. To bolster trust, it's essential to understand these models' failure modes and devise strategies to enhance their performance. In this study, we propose a framework to study the effect of input perturbations on language models of different scales, from pre-trained models to large language models (LLMs). We use fine-tuning to train a robust model to perturbations, and we investigate whether exposure to one perturbation improves or degrades the model's performance on other perturbations. To address multi-perturbation robustness, we suggest three distinct training strategies. We also extend the framework to LLMs via a chain of thought(COT) prompting with exemplars. We instantiate our framework for the Tabular-NLI task and show that the proposed strategies train the model robust to different perturbations without losing accuracy on a given dataset.

摘要
<>文本模型，由于其黑盒特性，经常表现出输入杂乱的敏感性，导致不信任问题由于幻觉。为了增强不信任，我们需要理解这些模型的失败模式，并设计策略来提高其性能。在这项研究中，我们提议一个框架来研究输入杂乱对不同规模的语言模型（从预训练模型到大语言模型）的影响。我们使用精度训练来适应杂乱，并研究曝光一种杂乱后，模型对其他杂乱的性能是否改善或恶化。为了解决多种杂乱的可靠性，我们提出三种不同的训练策略。此外，我们将框架扩展到大语言模型（LLMs）via一种链式思维（COT）提示法，并通过例子来实现。我们在Tabular-NLI任务上实现了我们的框架，并示出了提议的策略可以在给定数据集上训练模型抗杂乱而不失去精度。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Autonomous Large Language Model Agents Enabling Intent-Driven Mobile GUI Testing

paper_url: http://arxiv.org/abs/2311.08649
repo_url: None
paper_authors: Juyeon Yoon, Robert Feldt, Shin Yoo
for: automatize GUI testing of Android apps, to increase testing efficiency and coverage
methods: uses Large Language Models and support mechanisms such as long- and short-term memory to set relevant task goals and perform realistic tasks
results: achieved 61% activity coverage and 317 out of 374 autonomously created tasks are realistic and relevant to app functionalities, outperforming current state-of-the-art GUI testing techniques.

Abstract
GUI testing checks if a software system behaves as expected when users interact with its graphical interface, e.g., testing specific functionality or validating relevant use case scenarios. Currently, deciding what to test at this high level is a manual task since automated GUI testing tools target lower level adequacy metrics such as structural code coverage or activity coverage. We propose DroidAgent, an autonomous GUI testing agent for Android, for semantic, intent-driven automation of GUI testing. It is based on Large Language Models and support mechanisms such as long- and short-term memory. Given an Android app, DroidAgent sets relevant task goals and subsequently tries to achieve them by interacting with the app. Our empirical evaluation of DroidAgent using 15 apps from the Themis benchmark shows that it can set up and perform realistic tasks, with a higher level of autonomy. For example, when testing a messaging app, DroidAgent created a second account and added a first account as a friend, testing a realistic use case, without human intervention. On average, DroidAgent achieved 61% activity coverage, compared to 51% for current state-of-the-art GUI testing techniques. Further, manual analysis shows that 317 out of the 374 autonomously created tasks are realistic and relevant to app functionalities, and also that DroidAgent interacts deeply with the apps and covers more features.

摘要
GUI 测试检查软件系统在用户与图形界面交互时是否按预期的行为，例如测试特定功能或验证相关用例enario。目前，决定要测试的高级水平是一个手动任务，因为自动化GUI测试工具通常target lower level的充分度度量 such as 结构代码覆盖率或活动覆盖率。我们提出了DroidAgent，一个基于大型自然语言模型和支持机制such as长期和短期记忆的Android GUI测试自动化工具。给一个Android应用程序，DroidAgent会设定相关的任务目标，然后通过与应用程序交互来实现这些目标。我们对DroidAgent使用Themis测试套件中的15个应用程序进行了实验性评估，结果显示DroidAgent可以自动设置和执行真实的任务，并且比现有的GUI测试技术高一个等级。例如，当测试一个消息应用程序时，DroidAgent创建了一个第二个帐户，并将第一个帐户添加为好友，测试了一个真实的用例，没有人工干预。在average，DroidAgent achieve 61%的活动覆盖率，比现有技术的51%高。此外，手动分析结果显示DroidAgent自动创建的374个任务中，317个任务是真实有用和相关于应用程序功能，而且DroidAgent会深入与应用程序交互，覆盖更多的功能。

Explore Spurious Correlations at the Concept Level in Language Models for Text Classification

paper_url: http://arxiv.org/abs/2311.08648
repo_url: None
paper_authors: Yuhang Zhou, Paiheng Xu, Xiaoyu Liu, Bang An, Wei Ai, Furong Huang
for: 本研究旨在探讨语言模型（LM）在不同语言处理任务中的表现，以及如何减少LM因杂质相关性而导致的Robustness问题。
methods: 本研究使用语言模型（LM）对文本进行标签，并测试LM在不同文本分类任务中的表现。同时，我们还提出了一种数据重新平衡方法，通过添加LM生成的对反数据来减少杂质相关性。
results: 研究结果表明，存在多个概念的标签分布偏误在多个文本分类数据集中，LM会利用这些偏误来进行预测，而我们的减少方法可以有效地减少这些偏误。

Abstract
Language models (LMs) have gained great achievement in various NLP tasks for both fine-tuning and in-context learning (ICL) methods. Despite its outstanding performance, evidence shows that spurious correlations caused by imbalanced label distributions in training data (or exemplars in ICL) lead to robustness issues. However, previous studies mostly focus on word- and phrase-level features and fail to tackle it from the concept level, partly due to the lack of concept labels and subtle and diverse expressions of concepts in text. In this paper, we first use the LLM to label the concept for each text and then measure the concept bias of models for fine-tuning or ICL on the test data. Second, we propose a data rebalancing method to mitigate the spurious correlations by adding the LLM-generated counterfactual data to make a balanced label distribution for each concept. We verify the effectiveness of our mitigation method and show its superiority over the token removal method. Overall, our results show that there exist label distribution biases in concepts across multiple text classification datasets, and LMs will utilize these shortcuts to make predictions in both fine-tuning and ICL methods.

摘要
语言模型（LM）在各种自然语言处理（NLP）任务中已经取得了很大的成就，包括精度训练和上下文学习（ICL）方法。尽管它们的表现很出色，但证据表明，由于训练数据中标签的不均匀分布而导致的偏见问题。然而，前一些研究主要集中在单词和短语水平的特征上，忽略了概念水平的问题，其中一个原因是缺乏概念标签，以及文本中概念的柔和和多样化表达。在本文中，我们首先使用LM来标注每个文本中的概念，然后测量模型在测试数据上的概念偏见。其次，我们提出了一种数据重新补做方法，以避免由于标签分布的偏见问题。我们证明了我们的mitigation方法的有效性，并证明它在和token移除方法相比而言更有优势。总之，我们的结果表明，存在多个文本分类 datasets中的概念标签偏见，LM在精度训练和ICL方法中都会利用这些短cuts来做预测。

Interpretable by Design: Wrapper Boxes Combine Neural Performance with Faithful Explanations

paper_url: http://arxiv.org/abs/2311.08644
repo_url: None
paper_authors: Yiheng Su, Juni Jessy Li, Matthew Lease
for: 能够保持神经网络模型的准确性while提供 faithful的解释吗？我们提出了“ wrapper boxes”，一种通用的方法来生成 faithful， example-based解释 для模型预测结果，同时保持预测性能。
methods: 我们首先训练了一个神经网络模型，然后将其学习的特征表示输入到一个可解释的模型中进行实际预测。这种简单的策略 surprisingly effective，results largely comparable to those of the original neural model， как shown across three large pre-trained language models, two datasets of varying scale, four classic models, and four evaluation metrics。
results: 此外，因为这些可解释模型是设计为可解释的，所以可以直接向用户显示trainig example subset That determine classic model predictions。

Abstract
Can we preserve the accuracy of neural models while also providing faithful explanations? We present wrapper boxes, a general approach to generate faithful, example-based explanations for model predictions while maintaining predictive performance. After training a neural model as usual, its learned feature representation is input to a classic, interpretable model to perform the actual prediction. This simple strategy is surprisingly effective, with results largely comparable to those of the original neural model, as shown across three large pre-trained language models, two datasets of varying scale, four classic models, and four evaluation metrics. Moreover, because these classic models are interpretable by design, the subset of training examples that determine classic model predictions can be shown directly to users.

摘要
可以保持神经网络模型的准确性 while also providing faithful explanations? We present wrapper boxes, a general approach to generate faithful, example-based explanations for model predictions while maintaining predictive performance. After training a neural model as usual, its learned feature representation is input to a classic, interpretable model to perform the actual prediction. This simple strategy is surprisingly effective, with results largely comparable to those of the original neural model, as shown across three large pre-trained language models, two datasets of varying scale, four classic models, and four evaluation metrics. Moreover, because these classic models are interpretable by design, the subset of training examples that determine classic model predictions can be shown directly to users.Here's the translation in Traditional Chinese:可以保持神经网络模型的准确性 while also providing faithful explanations? We present wrapper boxes, a general approach to generate faithful, example-based explanations for model predictions while maintaining predictive performance. After training a neural model as usual, its learned feature representation is input to a classic, interpretable model to perform the actual prediction. This simple strategy is surprisingly effective, with results largely comparable to those of the original neural model, as shown across three large pre-trained language models, two datasets of varying scale, four classic models, and four evaluation metrics. Moreover, because these classic models are interpretable by design, the subset of training examples that determine classic model predictions can be shown directly to users.

Spatio-Temporal Graph Neural Point Process for Traffic Congestion Event Prediction

paper_url: http://arxiv.org/abs/2311.08635
repo_url: None
paper_authors: Guangyin Jin, Lingbo Liu, Fuxian Li, Jincai Huang
for: 预测交通堵塞事件，以提高智能交通系统的效能。
methods: 我们提出了一种基于图 neural point process 框架的 spatial-temporal graph neural network，可以充分捕捉历史交通状态数据中的长距离空间-时间依赖关系，同时还可以模型堵塞事件的发展趋势。
results: 我们的方法在两个实际数据集上进行了广泛的实验，并证明了与现有状态艺术方法相比，其性能更高。

Abstract
Traffic congestion event prediction is an important yet challenging task in intelligent transportation systems. Many existing works about traffic prediction integrate various temporal encoders and graph convolution networks (GCNs), called spatio-temporal graph-based neural networks, which focus on predicting dense variables such as flow, speed and demand in time snapshots, but they can hardly forecast the traffic congestion events that are sparsely distributed on the continuous time axis. In recent years, neural point process (NPP) has emerged as an appropriate framework for event prediction in continuous time scenarios. However, most conventional works about NPP cannot model the complex spatio-temporal dependencies and congestion evolution patterns. To address these limitations, we propose a spatio-temporal graph neural point process framework, named STGNPP for traffic congestion event prediction. Specifically, we first design the spatio-temporal graph learning module to fully capture the long-range spatio-temporal dependencies from the historical traffic state data along with the road network. The extracted spatio-temporal hidden representation and congestion event information are then fed into a continuous gated recurrent unit to model the congestion evolution patterns. In particular, to fully exploit the periodic information, we also improve the intensity function calculation of the point process with a periodic gated mechanism. Finally, our model simultaneously predicts the occurrence time and duration of the next congestion. Extensive experiments on two real-world datasets demonstrate that our method achieves superior performance in comparison to existing state-of-the-art approaches.

摘要
traffic 堵塞事件预测是智能交通系统中的一个重要 yet 挑战性任务。许多现有的交通预测方法 integrates 多种 temporal 编码器和图像 convolution 网络（GCNs），称为 spatio-temporal 图像-based 神经网络，它们主要 focus 在 predicting 稠密变量 such as flow, speed 和 demand 在时刻戳中，但它们很难预测分布在继续时间轴上的交通堵塞事件。在过去几年，神经点过程（NPP）已经 emerge 为继续时间场景中的适用性Frameworks。然而，大多数传统的 NPP 方法无法模型 complex spatio-temporal 依赖关系和堵塞演化模式。为了解决这些局限性，我们提出了一种 spatio-temporal 图像神经点过程框架，名为 STGNPP для交通堵塞事件预测。具体来说，我们首先设计了 spatio-temporal 图像学习模块，以全面捕捉历史交通状态数据中的长距离 spatio-temporal 依赖关系，同时与道路网络相结合。提取的 spatio-temporal 隐藏表示和堵塞事件信息然后被 fed 到一个连续闭合回归单元，以模型堵塞演化模式。特别是，为了充分利用周期信息，我们还改进了点过程中的 Intensity 函数计算方法。最后，我们的模型同时预测下一次堵塞事件的发生时间和持续时间。广泛的实验表明，我们的方法在两个真实世界数据集上表现出优于现有的状态前方法。

XplainLLM: A QA Explanation Dataset for Understanding LLM Decision-Making

paper_url: http://arxiv.org/abs/2311.08614
repo_url: None
paper_authors: Zichen Chen, Jianda Chen, Mitali Gaidhani, Ambuj Singh, Misha Sra
for: 本研究旨在提高大型自然语言处理模型（LLM）的决策过程的可见性，通过创建一个新的问答解释数据集（QAE），integrating知识图（KG）。
methods: 我们使用了知识图和图注意网络（GAT）来找到reason-elements，并将其转化为可理解的why-choose和why-not-choose解释。
results: 我们通过量化和质量评价表明，我们的数据集可以提高LLM在上下文学习中的性能，提高其解释性和可见性，使其更加可靠和可信worthy。

Abstract
Large Language Models (LLMs) have recently made impressive strides in natural language understanding tasks. Despite their remarkable performance, understanding their decision-making process remains a big challenge. In this paper, we look into bringing some transparency to this process by introducing a new explanation dataset for question answering (QA) tasks that integrates knowledge graphs (KGs) in a novel way. Our dataset includes 12,102 question-answer-explanation (QAE) triples. Each explanation in the dataset links the LLM's reasoning to entities and relations in the KGs. The explanation component includes a why-choose explanation, a why-not-choose explanation, and a set of reason-elements that underlie the LLM's decision. We leverage KGs and graph attention networks (GAT) to find the reason-elements and transform them into why-choose and why-not-choose explanations that are comprehensible to humans. Through quantitative and qualitative evaluations, we demonstrate the potential of our dataset to improve the in-context learning of LLMs, and enhance their interpretability and explainability. Our work contributes to the field of explainable AI by enabling a deeper understanding of the LLMs decision-making process to make them more transparent and thereby, potentially more reliable, to researchers and practitioners alike. Our dataset is available at: https://github.com/chen-zichen/XplainLLM_dataset.git

摘要

Navigating the Ocean of Biases: Political Bias Attribution in Language Models via Causal Structures

paper_url: http://arxiv.org/abs/2311.08605
repo_url: https://github.com/david-jenny/llm-political-study
paper_authors: David F. Jenny, Yann Billeter, Mrinmaya Sachan, Bernhard Schölkopf, Zhijing Jin
for: 本研究旨在探讨 Large Language Models (LLMs) 在政治辩论中的决策过程和内在偏见。
methods: 本研究使用 Activity Dependency Networks (ADNs) 抽取 LLMs 中的隐式评价标准，并 illustrate how normative values influence these perceptions。
results: 研究发现 LLMs 在评价 “好Arguments” 时存在偏见，并且这些偏见受到了 normative values 的影响。这些结果有关于人机同步和偏见减少的影响。

Abstract
The rapid advancement of Large Language Models (LLMs) has sparked intense debate regarding their ability to perceive and interpret complex socio-political landscapes. In this study, we undertake an exploration of decision-making processes and inherent biases within LLMs, exemplified by ChatGPT, specifically contextualizing our analysis within political debates. We aim not to critique or validate LLMs' values, but rather to discern how they interpret and adjudicate "good arguments." By applying Activity Dependency Networks (ADNs), we extract the LLMs' implicit criteria for such assessments and illustrate how normative values influence these perceptions. We discuss the consequences of our findings for human-AI alignment and bias mitigation. Our code and data at https://github.com/david-jenny/LLM-Political-Study.

摘要
LLMs 的快速发展已经引发了对其能够理解和解释复杂社会政治景观的激烈讨论。在这个研究中，我们进行了 LLMS 决策过程和内在偏见的探索，以 chatGPT 为例，并在政治辩论中进行了Contextual化分析。我们的目标不是评价或验证 LLMS 的价值观，而是理解它们如何解读和评价 "好的论点"。通过应用 Activity Dependency Networks (ADNs)，我们提取了 LLMS 的隐藏标准 для这些评价，并示出了如何 normative 价值影响这些见解。我们讨论了我们发现的后果，以及如何实现人机同步和偏见缓减。我们的代码和数据可以在找到。

2023-11-15

HAL 9000: Skynet’s Risk Manager

How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities

Exploring the Privacy-Energy Consumption Tradeoff for Split Federated Learning

Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment

Beyond Detection: Unveiling Fairness Vulnerabilities in Abusive Language Models

When Large Language Models contradict humans? Large Language Models’ Sycophantic Behaviour

Zero-Shot Relational Learning on Temporal Knowledge Graphs with Large Language Models

LOKE: Linked Open Knowledge Extraction for Automated Knowledge Graph Construction

Empirical evaluation of Uncertainty Quantification in Retrieval-Augmented Language Models for Science

Privacy Threats in Stable Diffusion Models

Generalizable Imitation Learning Through Pre-Trained Representations

Generative AI-Based Probabilistic Constellation Shaping With Diffusion Models

VideoCon: Robust Video-Language Alignment via Contrast Captions

Lighter, yet More Faithful: Investigating Hallucinations in Pruned Large Language Models for Abstractive Summarization

Strategic Data Augmentation with CTGAN for Smart Manufacturing: Enhancing Machine Learning Predictions of Paper Breaks in Pulp-and-Paper Production

Improving fit to human reading times via temperature-scaled surprisal

Spoken Word2Vec: A Perspective And Some Techniques

H-Packer: Holographic Rotationally Equivariant Convolutional Neural Network for Protein Side-Chain Packing

Divergences between Language Models and Human Brains

Symbol-LLM: Towards Foundational Symbol-centric Interface For Large Language Models

Assessing Translation capabilities of Large Language Models involving English and Indian Languages

Controllable Text Summarization: Unraveling Challenges, Approaches, and Prospects – A Survey

Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models

Fusion-Eval: Integrating Evaluators with LLMs

ExpM+NF: Differentially Private Machine Learning that Surpasses DPSGD

Never Lost in the Middle: Improving Large Language Models via Attention Strengthening Question Answering

The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task

Towards Verifiable Text Generation with Symbolic References

Generate, Filter, and Fuse: Query Expansion via Multi-Step Keyword Generation for Zero-Shot Neural Rankers

AbsPyramid: Benchmarking the Abstraction Ability of Language Models with a Unified Entailment Graph

Temporal Knowledge Question Answering via Abstract Reasoning Induction

Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts

HEALNet – Hybrid Multi-Modal Fusion for Heterogeneous Biomedical Data

Ever: Mitigating Hallucination in Large Language Models through Real-Time Verification and Rectification

Does Pre-trained Language Model Actually Infer Unseen Links in Knowledge Graph Completion?

Towards A Unified View of Answer Calibration for Multi-Step Reasoning

Can MusicGen Create Training Data for MIR Tasks?

The Uli Dataset: An Exercise in Experience Led Annotation of oGBV

How Multilingual is Multilingual LLM?

How Well Do Large Language Models Truly Ground?

Learning Fair Division from Bandit Feedback

In-vehicle Sensing and Data Analysis for Older Drivers with Mild Cognitive Impairment

Assessing Knowledge Editing in Language Models via Relation Perspective

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

MELA: Multilingual Evaluation of Linguistic Acceptability

Assessing the Robustness of Intelligence-Driven Reinforcement Learning

Identification and Estimation for Nonignorable Missing Data: A Data Fusion Approach

Adversarial Attacks to Reward Machine-based Reinforcement Learning

Leveraging AI for Natural Disaster Management : Takeaways From The Moroccan Earthquake

When does In-context Learning Fall Short and Why? A Study on Specification-Heavy Tasks

Proceedings Fifth International Workshop on Formal Methods for Autonomous Systems

Linear time Evidence Accumulation Clustering with KMeans

Identifying Linear Relational Concepts in Large Language Models

I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots

Safety, Trust, and Ethics Considerations for Human-AI Teaming in Aerospace Control

Reasoning over Description Logic-based Contexts with Transformers

Supported Trust Region Optimization for Offline Reinforcement Learning

Leveraging Activation Maximization and Generative Adversarial Training to Recognize and Explain Patterns in Natural Areas in Satellite Imagery

An Empathetic User-Centric Chatbot for Emotional Support

NormNet: Scale Normalization for 6D Pose Estimation in Stacked Scenarios

Combining Transfer Learning with In-context Learning using Blackbox LLMs for Zero-shot Knowledge Base Question Answering

Advances in ACL2 Proof Debugging Tools

Evaluating Gender Bias in the Translation of Gender-Neutral Languages into English

A* search algorithm for an optimal investment problem in vehicle-sharing systems

Exploring Links between Conversational Agent Design Challenges and Interdisciplinary Collaboration

Reinforcement Learning with Model Predictive Control for Highway Ramp Metering

Frequency Domain-based Dataset Distillation

MAP’s not dead yet: Uncovering true language model modes by conditioning away degeneracy

Self-Supervised Disentanglement by Leveraging Structure in Data Augmentations

SparseSpikformer: A Co-Design Framework for Token and Weight Pruning in Spiking Transformer

X-Eval: Generalizable Multi-aspect Text Evaluation via Augmented Instruction Tuning with Auxiliary Evaluation Aspects

ICRA Roboethics Challenge 2023: Intelligent Disobedience in an Elderly Care Home

Adversarially Robust Spiking Neural Networks Through Conversion

Three Conjectures on Unexpectedeness

Combining Past, Present and Future: A Self-Supervised Approach for Class Incremental Learning

Forms of Understanding of XAI-Explanations

Cross-domain feature disentanglement for interpretable modeling of tumor microenvironment impact on drug response

Auto-ICL: In-Context Learning without Human Supervision

Disentangling the Potential Impacts of Papers into Diffusion, Conformity, and Contribution Values