2023-10-11

cs.CL

cs.CL - 2023-10-11

Crosslingual Structural Priming and the Pre-Training Dynamics of Bilingual Language Models

paper_url: http://arxiv.org/abs/2310.07929
repo_url: None
paper_authors: Catherine Arnett, Tyler A. Chang, James A. Michaelov, Benjamin K. Bergen
for: 这研究探讨了多语言模型是否共享抽象语法表示形式，以及这些表示形式是如何发展的。
methods: 作者使用了结构预导来测试模型输出中的抽象语法表示形式，并将这种方法应用到了荷兰语-英语双语设定中。他们还评估了一个荷兰语-英语语言模型在预训练时的表现。
results: 研究发现，在接受第二语言后，跨语言结构预导效果很快出现，只需要少于100万个字的数据。这些结果有关于数据污染、低资源传输和多语言模型中抽象语法表示形式的发展。

Abstract
Do multilingual language models share abstract grammatical representations across languages, and if so, when do these develop? Following Sinclair et al. (2022), we use structural priming to test for abstract grammatical representations with causal effects on model outputs. We extend the approach to a Dutch-English bilingual setting, and we evaluate a Dutch-English language model during pre-training. We find that crosslingual structural priming effects emerge early after exposure to the second language, with less than 1M tokens of data in that language. We discuss implications for data contamination, low-resource transfer, and how abstract grammatical representations emerge in multilingual models.

摘要
请参考Sinclair等（2022），我们使用结构驱动来测试多语言模型中的抽象语法表示。我们将该方法扩展到荷兰语-英语双语设置，并评估一个荷兰语-英语语言模型在预训练期间。我们发现，在接触第二语言后不久，跨语言结构驱动效果便出现了，仅需要少于1M个Token的数据。我们讨论了数据污染、低资源传输和多语言模型中抽象语法表示的起源。

The Expressive Power of Transformers with Chain of Thought

paper_url: http://arxiv.org/abs/2310.07923
repo_url: None
paper_authors: William Merrill, Ashish Sabharwal
for: 这个研究探讨了使用Transformer进行语言理解和计算的能力。
methods: 研究人员使用了一种”链条思维”或”笔记簿”的方法，即在回答输入前生成和条件 intermediate tokens。
results: 研究人员发现，使用 intermediate generation 可以提高 transformer 的计算能力，但是Amount of increase 取决于 intermediate generation 的数量。例如，使用 logarithmic 数量的 decoding steps 只能 marginally 提高 transformer 的能力，而 linear 数量的 decoding steps 可以认为是一种新的能力（以标准复杂性 conjectures），可以识别所有的 regular languages。此外，研究人员还发现，使用 polynomial 数量的 decoding steps 可以识别所有的 polynomial-time solvable problems，这是第一次对 transformer 的类型进行了正确的 Characterization 。

Abstract
Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a "chain of thought" or "scratchpad", i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps make them recognize exactly the class of polynomial-time solvable problems -- the first exact characterization of a type of transformers in terms of standard complexity classes. Together, our results provide a nuanced framework for understanding how the length of a transformer's chain of thought or scratchpad impacts its reasoning power.

摘要
(Simplified Chinese translation)最近的理论研究发现了一些奇异的简单逻辑问题，如图中两个节点是连接的检查或模拟 finite-state machine，是不可避免地由标准 transformer 所解决的。但在实践中， transformer 的逻辑可以通过允许它们使用 "链式思维" 或 "笔记"，即生成并条件于输入的一系列间接符号，来改进。这引发了我们的问题：这种间接生成是否fundamentally 扩展了 decoder-only transformer 的计算能力？我们显示，答案是 yes，但间接生成的数量对计算能力的提高有关键的影响。例如，我们发现，对输入长度的 logarithmic 数量的解oding步可以只很小地推动标准 transformer，而 linear 数量的解oding步可以添加一个明确的新能力（根据标准复杂性假设）：recognize 所有的 Regular 语言。我们的结果还表明，linear 步骤可以将 transformer decoder 限制在 context-sensitive 语言中，而 polynomial 步骤可以使其recognize 所有的 polynomial-time solvable problems ，这是首次对 transformer 类型的一种描述。总之，我们的结果提供了一个细化的框架，用于理解 transformer 的 chain of thought 或 scratchpad 的长度如何影响其逻辑能力。

Pit One Against Many: Leveraging Attention-head Embeddings for Parameter-efficient Multi-head Attention

paper_url: http://arxiv.org/abs/2310.07911
repo_url: None
paper_authors: Huiyin Xue, Nikolaos Aletras
for: 降低自适应语言模型的内存需求，提高自然语言处理任务的性能。
methods: 基于transformer中的位嵌入，提出了一种简化多头注意力（MHA）机制的代替模块，使用单个共享投影矩阵和多头嵌入（MHE）。
results: 对多个下游任务进行实验，证明MHE注意力可以减少大量内存需求，同时保持高预测性能率。相比之下，MHA需要更多的参数（$(3n^2-3n)d^2-3nd$）。

Abstract
Scaling pre-trained language models has resulted in large performance gains in various natural language processing tasks but comes with a large cost in memory requirements. Inspired by the position embeddings in transformers, we aim to simplify and reduce the memory footprint of the multi-head attention (MHA) mechanism. We propose an alternative module that uses only a single shared projection matrix and multiple head embeddings (MHE), i.e. one per head. We empirically demonstrate that our MHE attention is substantially more memory efficient compared to alternative attention mechanisms while achieving high predictive performance retention ratio to vanilla MHA on several downstream tasks. MHE attention only requires a negligible fraction of additional parameters ($3nd$, where $n$ is the number of attention heads and $d$ the size of the head embeddings) compared to a single-head attention, while MHA requires $(3n^2-3n)d^2-3nd$ additional parameters.

摘要
<>预训练语言模型的扩大已经导致了许多自然语言处理任务中的性能提升，但是它们的内存需求却很大。 Drawing inspiration from transformers的位嵌入，我们想要简化并减少多头注意（MHA）机制的内存占用。我们提出了一种替代模块，它使用单个共享投影矩阵和多个头嵌入（MHE），即每个头都有一个嵌入。我们实际测试了我们的MHE注意力，并证明它与替代注意力机制相比，具有较高的内存效率和预测性能保留率。MHE注意力只需要negligible fraction of additional parameters（$3nd$, where $n$ is the number of attention heads and $d$ is the size of the head embeddings)，而MHA需要 $(3n^2-3n)d^2-3nd$ 额外参数。

Assessing Evaluation Metrics for Neural Test Oracle Generation

paper_url: http://arxiv.org/abs/2310.07856
repo_url: None
paper_authors: Jiho Shin, Hadi Hemmati, Moshi Wei, Song Wang
for: 本研究は现有的oracle生成研究 plus ChatGPTを用于实际调查当前的性能水平，包括NLG基于和测试充分性metric。
methods: 我们训练并运行四种state-of-the-art测试oracle生成模型，并对五种NLG基于和两种测试充分性metric进行分析。
results: 我们发现NLG基于metric和测试充分性metric之间没有显著相关性。例如，通过ChatGPT生成的activemq-artemis项目的oracles在所有studied NOGs中的NLG基于metric最高，但它们在所有studied NOGs中测试充分性metric最低。我们进行质量分析，发现oracles with high NLG-based metrics but low test adequacy metrics tend to have complex or multiple chained method invocations within the oracle’s parameters, making it hard for the model to generate completely, affecting the test adequacy metrics。

Abstract
In this work, we revisit existing oracle generation studies plus ChatGPT to empirically investigate the current standing of their performance in both NLG-based and test adequacy metrics. Specifically, we train and run four state-of-the-art test oracle generation models on five NLG-based and two test adequacy metrics for our analysis. We apply two different correlation analyses between these two different sets of metrics. Surprisingly, we found no significant correlation between the NLG-based metrics and test adequacy metrics. For instance, oracles generated from ChatGPT on the project activemq-artemis had the highest performance on all the NLG-based metrics among the studied NOGs, however, it had the most number of projects with a decrease in test adequacy metrics compared to all the studied NOGs. We further conduct a qualitative analysis to explore the reasons behind our observations, we found that oracles with high NLG-based metrics but low test adequacy metrics tend to have complex or multiple chained method invocations within the oracle's parameters, making it hard for the model to generate completely, affecting the test adequacy metrics. On the other hand, oracles with low NLG-based metrics but high test adequacy metrics tend to have to call different assertion types or a different method that functions similarly to the ones in the ground truth. Overall, this work complements prior studies on test oracle generation with an extensive performance evaluation with both NLG and test adequacy metrics and provides guidelines for better assessment of deep learning applications in software test generation in the future.

摘要
在这项研究中，我们对现有的oracle生成研究进行了评估，并与ChatGPT进行了实验性的研究，以评估现有的性能水平。我们训练并运行了四种state-of-the-art测试 oracle生成模型，并对五种NLG基于的和两种测试准确性度量进行了分析。我们应用了两种不同的相关分析方法 между这两种不同的度量。结果显示，NLG基于的度量和测试准确性度量之间没有显著的相关性。例如，由ChatGPT生成的活动mq-artemis的oracles在所有研究的NOG中表现最高，但它在所有研究的NOG中有最多的项目测试准确性度量下降。我们进行了质量分析，以探究这些观察结果的原因。我们发现，NLG基于的度量低，但测试准确性度量高的oracles通常有复杂的或多个链接的方法调用在参数中，使模型生成完全难以，affecting测试准确性度量。相反，NLG基于的度量高，但测试准确性度量低的oracles通常有不同的断言类型或类似于ground truth中的方法调用。总的来说，这项研究补充了先前的测试 oracle生成研究，并提供了未来深度学习应用软件测试生成的评估指南。

Framework for Question-Answering in Sanskrit through Automated Construction of Knowledge Graphs

paper_url: http://arxiv.org/abs/2310.07848
repo_url: None
paper_authors: Hrishikesh Terdalkar, Arnab Bhattacharya
for: 本研究旨在提取梵语（sa\d{m}sk\d{r}ta）文献中的知识，并使用知识图来回答问题。
methods: 本研究使用自然语言问答系统和知识图来回答问题。
results: 研究表明，使用知识图可以回答约50%的问题。同时，研究还分析了系统的缺点并提出了可能的改进方向。

Abstract
Sanskrit (sa\d{m}sk\d{r}ta) enjoys one of the largest and most varied literature in the whole world. Extracting the knowledge from it, however, is a challenging task due to multiple reasons including complexity of the language and paucity of standard natural language processing tools. In this paper, we target the problem of building knowledge graphs for particular types of relationships from sa\d{m}sk\d{r}ta texts. We build a natural language question-answering system in sa\d{m}sk\d{r}ta that uses the knowledge graph to answer factoid questions. We design a framework for the overall system and implement two separate instances of the system on human relationships from mah\=abh\=arata and r\=am\=aya\d{n}a, and one instance on synonymous relationships from bh\=avaprak\=a\'sa nigha\d{n}\d{t}u, a technical text from \=ayurveda. We show that about 50% of the factoid questions can be answered correctly by the system. More importantly, we analyse the shortcomings of the system in detail for each step, and discuss the possible ways forward.

摘要
sanskrit (sa\d{m}sk\d{r}ta) 有世界上最大和最多样化的文学作品。然而，从它中提取知识是一项复杂的任务，主要因为 sanskrit 语言的复杂性和自然语言处理工具的缺乏。在这篇论文中，我们面临的问题是从 sa\d{m}sk\d{r}ta 文本中建立知识图。我们开发了一套在 sa\d{m}sk\d{r}ta 语言中建立自然语言问答系统的框架，并在人类关系、mah\=abh\=arata 和 r\=am\=aya\d{n}a 等领域中实现了两个独立的实例。此外，我们还在 bh\=avaprak\=a\'sa nigha\d{n}\d{t}u 等技术文献中实现了一个实例。我们发现，该系统可以对约50%的问题作出正确的答案。此外，我们还详细分析了系统的缺陷，并讨论了可能的进一步策略。

Antarlekhaka: A Comprehensive Tool for Multi-task Natural Language Annotation

paper_url: http://arxiv.org/abs/2310.07826
repo_url: https://github.com/Antarlekhaka/code
paper_authors: Hrishikesh Terdalkar, Arnab Bhattacharya
for: 这篇论文旨在提高自然语言处理（NLP）技术的发展，尤其是为低资源语言的NLP技术提供更多的注释数据集，以便训练和测试机器学习模型。
methods: 这篇论文提出了一种名为Antarlekhaka的工具，用于手动注释NLP领域的广泛任务。该工具支持多个同时注释者，语言不限，可以在网络上部署，并且支持分布式注释。工具包含8种类型的用户友好的界面，以便注释8种NLP任务类型。这些任务类型包括2种语言学任务，即句子边界检测和选择正确的字符顺序，这两种任务都不是其他工具能处理。
results: 论文表明Antarlekhaka工具在对象评估中表现更好，并且已经在两种不同的语言上进行了两个实际注释任务。工具可以在 https://github.com/Antarlekhaka/code 上获取。

Abstract
One of the primary obstacles in the advancement of Natural Language Processing (NLP) technologies for low-resource languages is the lack of annotated datasets for training and testing machine learning models. In this paper, we present Antarlekhaka, a tool for manual annotation of a comprehensive set of tasks relevant to NLP. The tool is Unicode-compatible, language-agnostic, Web-deployable and supports distributed annotation by multiple simultaneous annotators. The system sports user-friendly interfaces for 8 categories of annotation tasks. These, in turn, enable the annotation of a considerably larger set of NLP tasks. The task categories include two linguistic tasks not handled by any other tool, namely, sentence boundary detection and deciding canonical word order, which are important tasks for text that is in the form of poetry. We propose the idea of sequential annotation based on small text units, where an annotator performs several tasks related to a single text unit before proceeding to the next unit. The research applications of the proposed mode of multi-task annotation are also discussed. Antarlekhaka outperforms other annotation tools in objective evaluation. It has been also used for two real-life annotation tasks on two different languages, namely, Sanskrit and Bengali. The tool is available at https://github.com/Antarlekhaka/code.

摘要
一个主要阻碍自然语言处理（NLP）技术的发展是低资源语言的Annotation dataset缺乏。在这篇论文中，我们介绍了Antarlekhaka工具，用于手动标注NLP相关的完整任务集。该工具兼容Unicode、语言不偏、Web部署和多个同时 annotators 支持分布式标注。系统支持8种类型的标注任务用户界面，可以对NLP任务进行更广泛的标注。任务类型包括两种语言学任务，即句子边界检测和推定正确的单词顺序，这些任务对文学作品中的文本非常重要。我们提出了基于小文本单元的顺序标注的想法，其中一个annotator先完成一个文本单元中的多个任务，然后进行下一个单元的标注。我们还讨论了这种多任务标注的研究应用。Antarlekhaka在对象评估中表现出色，并在两种不同语言的两个实际标注任务中使用。工具可以在https://github.com/Antarlekhaka/code 上下载。

Non-autoregressive Text Editing with Copy-aware Latent Alignments

paper_url: http://arxiv.org/abs/2310.07821
repo_url: https://github.com/yzhangcs/ctc-copy
paper_authors: Yu Zhang, Yue Zhang, Leyang Cui, Guohong Fu
for: 提高文本编辑效率和多语言泛化性
methods: 基于秘密CTC对编辑进行非自适应排序，引入复制操作以优化文本重叠管理
results: 在GEC和句子融合任务上实现了比基于Seq2Seq的显著提高（超过4倍），并且在德语和俄语上也达到了良好的泛化性。

Abstract
Recent work has witnessed a paradigm shift from Seq2Seq to Seq2Edit in the field of text editing, with the aim of addressing the slow autoregressive inference problem posed by the former. Despite promising results, Seq2Edit approaches still face several challenges such as inflexibility in generation and difficulty in generalizing to other languages. In this work, we propose a novel non-autoregressive text editing method to circumvent the above issues, by modeling the edit process with latent CTC alignments. We make a crucial extension to CTC by introducing the copy operation into the edit space, thus enabling more efficient management of textual overlap in editing. We conduct extensive experiments on GEC and sentence fusion tasks, showing that our proposed method significantly outperforms existing Seq2Edit models and achieves similar or even better results than Seq2Seq with over $4\times$ speedup. Moreover, it demonstrates good generalizability on German and Russian. In-depth analyses reveal the strengths of our method in terms of the robustness under various scenarios and generating fluent and flexible outputs.

摘要
最近的工作受到了Seq2Seq到Seq2Edit的 paradigm shift，以解决前者的慢进 autoregressive inference 问题。然而，Seq2Edit 方法仍然面临着一些挑战，如生成灵活性不足和通用性不强。在这项工作中，我们提出了一种新的非autoregressive文本编辑方法，通过使用 latent CTC 对应关系来模型编辑过程。我们对 CTC 进行了重要扩展，在编辑空间中引入了复制操作，从而更有效地管理文本重叠。我们对 GEC 和 sentence fusion 任务进行了广泛的实验，显示了我们的提出方法在现有 Seq2Edit 模型的比较和 Seq2Seq 模型的 $4\times$ 速度提升。此外，它还能够在德语和俄语上显示出良好的通用性。深入分析表明，我们的方法在不同情况下具有强大的Robustness和生成灵活的输出。

Faithfulness Measurable Masked Language Models

paper_url: http://arxiv.org/abs/2310.07819
repo_url: https://github.com/AndreasMadsen/faithfulness-measurable-models
paper_authors: Andreas Madsen, Siva Reddy, Sarath Chandar
for: 本研究旨在提供一种可靠度度量，以评估 NLP 模型中Token的重要性。
methods: 本研究使用一种新的 fine-tuning 方法，将掩码Token进行特殊处理，使其成为内存 Distribution 的一部分。
results: 本研究通过对多种任务进行应用和统计分布测试，证明了该方法的可靠度度量的有效性。此外，由于掩码Token已经成为内存 Distribution 的一部分，因此 NLP 模型中Token的重要性度量也变得更加可靠。

Abstract
A common approach to explain NLP models, is to use importance measures that express which tokens are important for a prediction. Unfortunately, such explanations are often wrong despite being persuasive. Therefore, it is essential to measure their faithfulness. One such metric is if tokens are truly important, then masking them should result in worse model performance. However, token masking introduces out-of-distribution issues and existing solutions are computationally expensive and employ proxy-models. Furthermore, other metrics are very limited in scope. In this work, we propose an inherently faithfulness measurable model that addresses these challenges. This is achieved by using a novel fine-tuning method that incorporates masking, such that masking tokens become in-distribution by design. This differs from existing approaches, which are completely model-agnostic but are inapplicable in practice. We demonstrate the generality of our approach by applying it to various tasks and validate it using statistical in-distribution tests. Additionally, because masking is in-distribution, importance measures which themselves use masking become more faithful, thus our model becomes more explainable.

摘要
通常来说，使用重要度度量来解释NLG模型的方法是使用重要度度量来表示选择的токен是否重要。然而，这些解释通常是错误的，尽管很有吸引力。因此，必须测试其忠实程度。一种度量器是，如果tokentoken是重要的，那么对它们进行遮盖应该导致模型性能下降。然而，tokentoken遮盖引入了非标准issue和现有解决方案是计算成本高昂，并且使用代理模型。此外，其他度量器很有限，只能用于某些任务。在这项工作中，我们提出了一种内在准确度度量器，解决了这些挑战。这是通过使用一种新的微调方法，将遮盖tokentoken变成了标准分布。这与现有方法不同，它们是完全无关的模型的，但在实践中无法应用。我们在不同的任务上应用了我们的方法，并通过统计学的在 Distribution Test validate 它。此外，因为遮盖tokentoken变成了标准分布，因此重要度度量也变得更加准确，从而使我们的模型变得更加解释。

Language Models As Semantic Indexers

paper_url: http://arxiv.org/abs/2310.07815
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Bowen Jin, Hansi Zeng, Guoyin Wang, Xiusi Chen, Tianxin Wei, Ruirui Li, Zhengyang Wang, Zheng Li, Yang Li, Hanqing Lu, Suhang Wang, Jiawei Han, Xianfeng Tang
for: 本文targets at learning semantic IDs for documents, which can facilitate various downstream tasks such as recommendation and retrieval.
methods: 本文提出了一种自然语言模型基于的自我超vised框架，用于学习文档的含义ID。该框架使用了进步的训练和对比学习来生成神经网络顺序分解表示，并通过自我超vised文档重建目标进行训练。
results: 实验结果表明，LMINDEXER在三个任务中（推荐、产品搜索和文档检索）在五个数据集上具有显著性和一致性的优势，与竞争性基准模型相比。

Abstract
Semantic identifier (ID) is an important concept in information retrieval that aims to preserve the semantics of objects such as documents and items inside their IDs. Previous studies typically adopt a two-stage pipeline to learn semantic IDs by first procuring embeddings using off-the-shelf text encoders and then deriving IDs based on the embeddings. However, each step introduces potential information loss and there is usually an inherent mismatch between the distribution of embeddings within the latent space produced by text encoders and the anticipated distribution required for semantic indexing. Nevertheless, it is non-trivial to design a method that can learn the document's semantic representations and its hierarchical structure simultaneously, given that semantic IDs are discrete and sequentially structured, and the semantic supervision is deficient. In this paper, we introduce LMINDEXER, a self-supervised framework to learn semantic IDs with a generative language model. We tackle the challenge of sequential discrete ID by introducing a semantic indexer capable of generating neural sequential discrete representations with progressive training and contrastive learning. In response to the semantic supervision deficiency, we propose to train the model with a self-supervised document reconstruction objective. The learned semantic indexer can facilitate various downstream tasks, such as recommendation and retrieval. We conduct experiments on three tasks including recommendation, product search, and document retrieval on five datasets from various domains, where LMINDEXER outperforms competitive baselines significantly and consistently.

摘要
<>Semantic identifier (ID) 是信息检索中的一个重要概念，旨在保留文档和其内部元素的 semantics。先前的研究通常采用两个阶段管道来学习含义 ID，先从 off-the-shelf 文本编码器获取嵌入，然后基于嵌入derive ID。但每个步骤都会导致信息损失，而且嵌入空间中的分布和预期的分布通常存在差异。尽管是非常困难的设计一种可以同时学习文档的含义表示和层次结构的方法，因为含义 ID 是整数和顺序结构的，而且含义监督不足。在本文中，我们介绍了 LMINDEXER，一种自动化的框架，可以使用生成语言模型来学习含义 ID。我们解决了顺序整数 ID 的挑战，通过引入含义索引器，该索引器可以在进行进程训练和对比学习后生成神经网络整数表示。受到含义监督不足的挑战，我们提议使用自动化文档重建目标进行训练。学习的含义索引器可以帮助下游任务，如推荐和检索。我们在三个任务上进行了五个数据集的实验，包括推荐、产品搜索和文档检索，LMINDEXER 与竞争对手相比显著并且一致性高。

Ontology Enrichment for Effective Fine-grained Entity Typing

paper_url: http://arxiv.org/abs/2310.07795
repo_url: None
paper_authors: Siru Ouyang, Jiaxin Huang, Pranav Pillai, Yunyi Zhang, Yu Zhang, Jiawei Han
for: 本研究旨在提出一种静态ontology-based zero-shot fine-grained实体类型标注（FET）方法，以便在无需人工标注的情况下实现高质量的实体类型标注。
methods: 我们提出了一种名为OnEFET的方法，它在ontology结构中增加了两种类型的额外信息，并开发了一种从粗到细的类型标注算法，通过在不同话题和实例增强训练样本中训练一个推理模型来利用这些额外信息。
results: 我们的实验结果表明，OnEFET可以在无需人工标注的情况下实现高质量的 fine-grained entity typing，与现有的零shot方法相比，其表现较好，甚至可以与有监督方法相比。

Abstract
Fine-grained entity typing (FET) is the task of identifying specific entity types at a fine-grained level for entity mentions based on their contextual information. Conventional methods for FET require extensive human annotation, which is time-consuming and costly. Recent studies have been developing weakly supervised or zero-shot approaches. We study the setting of zero-shot FET where only an ontology is provided. However, most existing ontology structures lack rich supporting information and even contain ambiguous relations, making them ineffective in guiding FET. Recently developed language models, though promising in various few-shot and zero-shot NLP tasks, may face challenges in zero-shot FET due to their lack of interaction with task-specific ontology. In this study, we propose OnEFET, where we (1) enrich each node in the ontology structure with two types of extra information: instance information for training sample augmentation and topic information to relate types to contexts, and (2) develop a coarse-to-fine typing algorithm that exploits the enriched information by training an entailment model with contrasting topics and instance-based augmented training samples. Our experiments show that OnEFET achieves high-quality fine-grained entity typing without human annotation, outperforming existing zero-shot methods by a large margin and rivaling supervised methods.

摘要
细化实体类型标识（FET）是根据实体提及的上下文信息确定特定实体类型的任务。传统方法需要大量人工标注，却是时间consuming和成本高的。最近的研究已经开始开发弱级或无级指导的方法。我们研究了基于ontology的零基础FET设定，但现有的ontology结构缺乏详细信息和甚至存在歧义关系，使其无法有效地导引FET。最近发展的自然语言处理模型，尽管在各种几个shot和零基础NLP任务中表现出色，但在零基础FET中可能会遇到挑战，因为它们与任务特定的ontology没有直接交互。在本研究中，我们提出了OnEFET，其中我们（1）为ontology结构中的每个节点添加了两种类型的额外信息：实例信息用于训练样本增强和主题信息用于将类型与上下文关联，并（2）开发了一种宽到细类型标识算法，利用这些额外信息通过训练排除模型和对比主题的实例基本样本来利用。我们的实验显示，OnEFET可以在无人注释情况下实现高质量细化实体类型标识，胜过现有的零基础方法，并与supervised方法相当。

To Build Our Future, We Must Know Our Past: Contextualizing Paradigm Shifts in Natural Language Processing

paper_url: http://arxiv.org/abs/2310.07715
repo_url: None
paper_authors: Sireesh Gururaja, Amanda Bertsch, Clara Na, David Gray Widder, Emma Strubell
for: 本研究目的是理解NLP领域的发展趋势，以便更好地形决未来。
methods: 本研究使用长形采访26名NLP研究人员，分析了文化、激励力和基础设施等因素对NLP领域的影响。
results: 研究发现了NLP领域的cyclical patterns以及新的变革，包括benchmark文化和软件基础设施的变化。

Abstract
NLP is in a period of disruptive change that is impacting our methodologies, funding sources, and public perception. In this work, we seek to understand how to shape our future by better understanding our past. We study factors that shape NLP as a field, including culture, incentives, and infrastructure by conducting long-form interviews with 26 NLP researchers of varying seniority, research area, institution, and social identity. Our interviewees identify cyclical patterns in the field, as well as new shifts without historical parallel, including changes in benchmark culture and software infrastructure. We complement this discussion with quantitative analysis of citation, authorship, and language use in the ACL Anthology over time. We conclude by discussing shared visions, concerns, and hopes for the future of NLP. We hope that this study of our field's past and present can prompt informed discussion of our community's implicit norms and more deliberate action to consciously shape the future.

摘要

Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models

paper_url: http://arxiv.org/abs/2310.07712
repo_url: https://github.com/castorini/perm-sc
paper_authors: Raphael Tang, Xinyu Zhang, Xueguang Ma, Jimmy Lin, Ferhan Ture
for: Addressing positional bias in listwise ranking of large language models (LLMs)
methods: permutation self-consistency, marginalizing out different list orders in the prompt to produce order-independent ranking with less positional bias
results: Improved scores from conventional inference by up to 7-18% for GPT-3.5 and 8-16% for LLaMA v2 (70B) on five list-ranking datasets in sorting and passage reranking, surpassing the previous state of the art in passage reranking.

Abstract
Large language models (LLMs) exhibit positional bias in how they use context, which especially complicates listwise ranking. To address this, we propose permutation self-consistency, a form of self-consistency over ranking list outputs of black-box LLMs. Our key idea is to marginalize out different list orders in the prompt to produce an order-independent ranking with less positional bias. First, given some input prompt, we repeatedly shuffle the list in the prompt and pass it through the LLM while holding the instructions the same. Next, we aggregate the resulting sample of rankings by computing the central ranking closest in distance to all of them, marginalizing out prompt order biases in the process. Theoretically, we prove the robustness of our method, showing convergence to the true ranking in the presence of random perturbations. Empirically, on five list-ranking datasets in sorting and passage reranking, our approach improves scores from conventional inference by up to 7-18% for GPT-3.5 and 8-16% for LLaMA v2 (70B), surpassing the previous state of the art in passage reranking. Our code is at https://github.com/castorini/perm-sc.

摘要
大型语言模型（LLM）会展示位置偏见，尤其是在列表排名中。为了解决这问题，我们提出了排序自适应性，一种黑盒 LLM 的排名结果自适应性。我们的关键想法是在提示中随机排序列表，然后将其通过 LLM，并将不同的列表顺序聚合成一个位置不受歧视的排名。具体来说，我们会将输入提示中的列表随机排序，然后将其通过 LLM，并在不同的列表顺序下重复这些排序。接着，我们将这些排序结果聚合起来， computed the central ranking closest in distance to all of them, thereby marginalizing out prompt order biases in the process。理论上，我们证明了我们的方法的稳定性，显示在随机干扰下，我们的方法会趋向真实的排名。实验上，我们在五个列表排名 dataset 上进行了 sorting 和 passage reranking，比较了与传统推理相比，我们的方法可以提高分数达7-18% для GPT-3.5 和 8-16% для LLaMA v2 (70B)，超过了过去的州际之优。我们的代码位于 GitHub 上的 https://github.com/castorini/perm-sc。

DiPmark: A Stealthy, Efficient and Resilient Watermark for Large Language Models

paper_url: http://arxiv.org/abs/2310.07710
repo_url: None
paper_authors: Yihan Wu, Zhengmian Hu, Hongyang Zhang, Heng Huang
for: 保护数据安全，采用隐藏信息在数据中嵌入 watermarking 技术。
methods: 提出了一种基于分布采样和哈希函数的分布保持 watermarking 方法（DiPmark），可以避免当前策略中的分布误差。
results: 对比 experiments 表明，该方法可以具有隐蔽性、高效性和鲁棒性，适用于需要准确性保持的 watermarking 任务。

Abstract
Watermarking techniques offer a promising way to secure data via embedding covert information into the data. A paramount challenge in the domain lies in preserving the distribution of original data during watermarking. Our research extends and refines existing watermarking framework, placing emphasis on the importance of a distribution-preserving (DiP) watermark. Contrary to the current strategies, our proposed DiPmark preserves the original token distribution during watermarking (stealthy), is detectable without access to the language model API or weights (efficient), and is robust to moderate changes of tokens (resilient). This is achieved by incorporating a novel reweight strategy, combined with a hash function that assigns unique \textit{i.i.d.} ciphers based on the context. The empirical benchmarks of our approach underscore its stealthiness, efficiency, and resilience, making it a robust solution for watermarking tasks that demand impeccable quality preservation.

摘要
通过水印技术来保护数据，通过在数据中隐藏潜在信息。领域中的一大挑战是保持原始数据的分布。我们的研究扩展和改进了现有的水印框架，强调在水印过程中保持原始token的分布。与现有策略不同，我们的提议的DiPmark可以在水印过程中保持原始token的分布（隐蔽），不需要访问语言模型API或参数（高效），并且对小范围的token变化具有抗锋性。这是通过新的重要策略和基于上下文的哈希函数来实现的，这使得我们的方法在隐蔽性、高效性和抗锋性方面具有优势，适用于需要精细质量保持的水印任务。

MatFormer: Nested Transformer for Elastic Inference

paper_url: http://arxiv.org/abs/2310.07707
repo_url: None
paper_authors: Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain
for: 这 paper 的目的是提出一种可以适应不同部署环境的嵌入式 transformer 模型，以提高模型的灵活性和可控性。
methods: 这 paper 使用了一种名为 MatFormer 的嵌入式 transformer 模型，该模型通过对各层 feed forward network (FFN) 块进行共同优化，以实现模型的灵活性和可控性。
results: 该 paper 通过实验表明，MatFormer 模型在不同的模型类型（解码器和编码器）、Modalities（语言和视觉）和大小（达到 2.6B 参数）中具有广泛的适用性和可靠性。此外， MatFormer 模型还可以提取出准确且可靠的小型模型，以提高下游评估的准确性和可靠性。

Abstract
Transformer models are deployed in a wide range of settings, from multi-accelerator clusters to standalone mobile phones. The diverse inference constraints in these scenarios necessitate practitioners to train foundation models such as PaLM 2, Llama, & ViTs as a series of models of varying sizes. Due to significant training costs, only a select few model sizes are trained and supported, limiting more fine-grained control over relevant tradeoffs, including latency, cost, and accuracy. This work introduces MatFormer, a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints. Each Feed Forward Network (FFN) block of a MatFormer model is jointly optimized with a few nested smaller FFN blocks. This training procedure allows for the Mix'n'Match of model granularities across layers -- i.e., a trained universal MatFormer model enables extraction of hundreds of accurate smaller models, which were never explicitly optimized. We empirically demonstrate MatFormer's effectiveness across different model classes (decoders & encoders), modalities (language & vision), and scales (up to 2.6B parameters). We find that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B, each exhibiting comparable validation loss and one-shot downstream evaluations to their independently trained counterparts. Furthermore, we observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval. Finally, we showcase that speculative decoding with the accurate and consistent submodels extracted from MatFormer can further reduce inference latency.

摘要
《Transformer模型在多种设置中部署，从多个加速器集群到单个手持式手机。这些不同的推理约束在这些场景中需要实践者在PaLM 2、Llama和ViTs等基础模型中训练多种模型的不同大小。由于训练成本很高，只有一些选择的模型大小得到了训练和支持，限制了更细化的控制 над相关的负载、成本和准确率的负载。本文介绍了MatFormer，一种嵌入式Transporter架构，用于提供多种部署约束的灵活性。每个Feed Forward Network（FFN）块的MatFormer模型都是与嵌入的一些更小的FFN块进行共同优化的。这种训练过程允许在层次上进行模型的混合和拼接，即一个已经训练过的通用MatFormer模型可以提取出百度精度的准确小模型，这些小模型从未直接优化过。我们在不同的模型类型（解码器和编码器）、modalities（语言和视觉）和 scale（最多2.6亿参数）上进行了实验，发现一个2.6亿解码器Only MatFormer语言模型（MatLM）可以提取出1.5亿到2.6亿的精度小模型，每个模型具有与独立训练的对应模型相同的验证损失和一次下游评估。此外，我们发现使用MatFormer基于ViT（MatViT）Encoder提取的小编码器保持了度量空间结构，适用于大规模适应式检索。最后，我们发现使用MatFormer提取的准确和一致的子模型进行规划的推理可以进一步减少推理时间。

Ferret: Refer and Ground Anything Anywhere at Any Granularity

paper_url: http://arxiv.org/abs/2310.07704
repo_url: https://github.com/apple/ml-ferret
paper_authors: Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang
for: 这个论文旨在描述一种新的多Modal大语言模型（MLLM），可以理解图像中任意形状或粒度的空间引用，并准确地将描述与图像相关联。
methods: 为了统一引用和降grunding在LLM模型中，这个论文提出了一种新的hybrid区域表示方法，将精确的坐标和连续特征相结合，以表示图像中的区域。此外，提出了一种适应不同形状的视觉采样器，可以处理图像中的不同粒度。
results: 根据实验结果，这个模型不仅在经典的引用和降grunding任务中表现出色，而且在基于区域的多模式聊天中也表现出了优秀的能力。此外，模型还能够更好地描述图像的细节和减少对象推测的情况。

Abstract
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with 95K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination. Code and data will be available at https://github.com/apple/ml-ferret

摘要
我们介绍 Ferret，一种新的多modal大型语言模型（MLLM），可以理解图像中任何形状或粒度的空间引用，并准确地将开放词汇描述与图像相关联。为了在LLM模型中统一引用和降解，Ferret使用一种新的强大的混合区域表示方法，将精确的整数坐标和连续特征联合起来表示图像中的区域。为了提取不同形状的区域中的连续特征，我们提议了一种适应性能处理的空间视觉采样器，能够处理不同形状的变化。因此，Ferret可以接受不同的区域输入，如点、矩形框和自由形状。为了增强Ferret的愿望能力，我们组织了 GRIT，一个包括1.1万个样本的参考和降解指令调整数据集，其中包括9.5万个困难的负例数据，以促进模型的稳定性。最终模型不仅在经典的引用和降解任务中表现出色，而且在多modal聊天中需要区域基础的任务中也表现出优异，并且在描述图像细节和对象投影方面表现出了显著改善。代码和数据将在https://github.com/apple/ml-ferret上提供。

Knowledge-enhanced Memory Model for Emotional Support Conversation

paper_url: http://arxiv.org/abs/2310.07700
repo_url: None
paper_authors: Mengzhao Jia, Qianglong Chen, Liqiang Jing, Dawei Fu, Renyu Li
For: 提高Emotional Support Conversation的效果，以扩展 mental health 支持的可能性。* Methods: 提出了一种知识增强 Memory mODEl for emotional suppoRt coNversation (MODERN)，包括对话语义encode和基于ConceptNet的实用回答生成模块。* Results: 对一个广泛使用的大规模数据集进行了详细实验，证明了我们的模型在比较先进的基准上表现出色。

Abstract
The prevalence of mental disorders has become a significant issue, leading to the increased focus on Emotional Support Conversation as an effective supplement for mental health support. Existing methods have achieved compelling results, however, they still face three challenges: 1) variability of emotions, 2) practicality of the response, and 3) intricate strategy modeling. To address these challenges, we propose a novel knowledge-enhanced Memory mODEl for emotional suppoRt coNversation (MODERN). Specifically, we first devise a knowledge-enriched dialogue context encoding to perceive the dynamic emotion change of different periods of the conversation for coherent user state modeling and select context-related concepts from ConceptNet for practical response generation. Thereafter, we implement a novel memory-enhanced strategy modeling module to model the semantic patterns behind the strategy categories. Extensive experiments on a widely used large-scale dataset verify the superiority of our model over cutting-edge baselines.

摘要
现在，情绪疾病的流行性已成为一个重要的问题，导致了情绪支持对话的效iveness来支持心理健康的更多的注意力。现有的方法已经实现了有力的结果，但它们还面临着三个挑战：1）情绪的变化性，2）回应的实用性，和3）复杂的战略模型。为了解决这些挑战，我们提出了一种基于知识的Memory mODEl for emotional suppoRt coNversation（MODERN）。具体来说，我们首先设计了一种增强对话上下文的知识编码，以捕捉不同时间段的对话中的动态情绪变化，并选择相关的上下文概念从ConceptNet中生成实用的回应。然后，我们实施了一种新的记忆增强策略模型模块，以模型 semantic patterns 下的策略类别。经验证明，我们的模型在一个广泛使用的大规模数据集上表现出了较好的效果，比较出色的基准值。

Composite Backdoor Attacks Against Large Language Models

paper_url: http://arxiv.org/abs/2310.07676
repo_url: None
paper_authors: Hai Huang, Zhengyu Zhao, Michael Backes, Yun Shen, Yang Zhang
for: 本研究探讨了语言模型（LLMs）的不可靠性，以及在下游任务中可能存在的攻击方式。
methods: 本研究使用了背门际攻击（backdoor attack）来探讨 LLMS 的攻击性。与先前的攻击方法不同，本研究在不同的提问组件中扫描多个触发密钥。这种复合背门攻击（CBA）比单个组件中的多个触发密钥更隐蔽。
results: 我们的实验表明，CBA 在自然语言处理（NLP）和多媒体任务中都是有效的。例如，在 LLaMA-7B 模型对 Emotion 数据集的 $3%$ 杂入样本上，我们的攻击达到了 $100%$ 攻击成功率（ASR）， False Triggered Rate（FTR）在 $2.06%$ 以下，模型性能减少很小。

Abstract
Large language models (LLMs) have demonstrated superior performance compared to previous methods on various tasks, and often serve as the foundation models for many researches and services. However, the untrustworthy third-party LLMs may covertly introduce vulnerabilities for downstream tasks. In this paper, we explore the vulnerability of LLMs through the lens of backdoor attacks. Different from existing backdoor attacks against LLMs, ours scatters multiple trigger keys in different prompt components. Such a Composite Backdoor Attack (CBA) is shown to be stealthier than implanting the same multiple trigger keys in only a single component. CBA ensures that the backdoor is activated only when all trigger keys appear. Our experiments demonstrate that CBA is effective in both natural language processing (NLP) and multimodal tasks. For instance, with $3\%$ poisoning samples against the LLaMA-7B model on the Emotion dataset, our attack achieves a $100\%$ Attack Success Rate (ASR) with a False Triggered Rate (FTR) below $2.06\%$ and negligible model accuracy degradation. The unique characteristics of our CBA can be tailored for various practical scenarios, e.g., targeting specific user groups. Our work highlights the necessity of increased security research on the trustworthiness of foundation LLMs.

摘要
大型语言模型（LLM）在各种任务上表现出色，常作为许多研究和服务的基础模型。然而，第三方不可靠的 LLM 可能会隐藏攻击性漏洞。在这篇论文中，我们通过针对 LLM 的后门攻击来探讨 LLM 的不可靠性。与现有的 LLM 后门攻击不同，我们的 Composite Backdoor Attack（CBA）在不同的提示组件中扫描多个触发关键。这种 CBA 比将同样多个触发关键Implanting 在单个组件中更隐蔽。CBA 确保只有当所有触发关键都出现时才会打开后门。我们的实验表明，CBA 在自然语言处理（NLP）和多Modal 任务中都有效。例如，在对 LLaMA-7B 模型的 Emotion 数据集上，我们使用 $3\%$ 恶意样本，我们的攻击达到了 $100\%$ 攻击成功率（ASR），False Triggered Rate（FTR）低于 $2.06\%$，模型性能下降非常小。CBA 的独特特点可以适应不同的实际场景，例如 targeting 特定用户群。我们的工作高亮了基础 LLM 的信任性的重要性，强调了对这些模型的安全研究的必要性。

Well Begun is Half Done: Generator-agnostic Knowledge Pre-Selection for Knowledge-Grounded Dialogue

paper_url: http://arxiv.org/abs/2310.07659
repo_url: https://github.com/qinlang14/gate
paper_authors: Lang Qin, Yao Zhang, Hongru Liang, Jun Wang, Zhenglu Yang
for: 这篇论文旨在提高知识选择的准确性，以便在知识卷积对话系统中进行更好的对话。
methods: 这篇论文提出了一种新的知识选择方法，即在生成之前选择相关的知识，以减少后续响应生成模型（特别是LLMs）的学习、调整和解释压力。
results: 实验结果表明，该方法可以提高响应的信息性，并且指出了知识选择前的生成是轻量级而有效的方式，可以帮助LLMs（如ChatGPT）生成更有用的响应。

Abstract
Accurate knowledge selection is critical in knowledge-grounded dialogue systems. Towards a closer look at it, we offer a novel perspective to organize existing literature, i.e., knowledge selection coupled with, after, and before generation. We focus on the third under-explored category of study, which can not only select knowledge accurately in advance, but has the advantage to reduce the learning, adjustment, and interpretation burden of subsequent response generation models, especially LLMs. We propose GATE, a generator-agnostic knowledge selection method, to prepare knowledge for subsequent response generation models by selecting context-related knowledge among different knowledge structures and variable knowledge requirements. Experimental results demonstrate the superiority of GATE, and indicate that knowledge selection before generation is a lightweight yet effective way to facilitate LLMs (e.g., ChatGPT) to generate more informative responses.

摘要
精准的知识选择是知识固有对话系统的关键。我们提出了一种新的视角，即知识选择与生成结合在一起，以及生成之前和之后的知识选择。我们专注于第三种未探讨的研究领域，即可以不仅在预先选择正确的知识，而且可以减轻后续响应生成模型（特别是LLMs）的学习、调整和解释负担。我们提出了一种生成器独立的知识选择方法，以适应不同的知识结构和变量知识需求。实验结果表明GATE的优越性，并表明知识选择 перед生成是一种轻量级 yet 有效的方式，使LLMs（如ChatGPT）能够更加有 informations 的响应。

Audio-Visual Neural Syntax Acquisition

paper_url: http://arxiv.org/abs/2310.07654
repo_url: None
paper_authors: Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass
for: 这个论文是用来研究视觉挖掘的语音结构induction的。
methods: 这个论文使用的方法是首先将语音波形分割成单词序列，然后使用推导出的段级连续表示来推导语句结构。
results: 这个论文的实验结果表明，通过听取音频和查看图像，而不需要任何文本Supervision，Audio-Visual Neural Syntax Learner（AV-NSL）可以学习出有意义的语句结构，与自然Supervised文本解析器相当。

Abstract
We study phrase structure induction from visually-grounded speech. The core idea is to first segment the speech waveform into sequences of word segments, and subsequently induce phrase structure using the inferred segment-level continuous representations. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without ever being exposed to text. By training on paired images and spoken captions, AV-NSL exhibits the capability to infer meaningful phrase structures that are comparable to those derived by naturally-supervised text parsers, for both English and German. Our findings extend prior work in unsupervised language acquisition from speech and grounded grammar induction, and present one approach to bridge the gap between the two topics.

摘要
我们研究语音结构推导从视觉关联的 speech。核心思想是首先将语音波形分段成字段序列，然后使用推断出的段级连续表示来推导 phrase structure。我们提出了 Audio-Visual Neural Syntax Learner（AV-NSL），它通过听音和看图学习语音结构，不需要任何文本干扰。通过对图像和说话标注进行训练，AV-NSL能够推导出有意义的 phrase structure，与自然监督文本分析器相当。我们的发现扩展了先前的无监督语言学习从speech和固定语法推导的研究，并提供了一种将两个话题相连接的方法。

LLM4Vis: Explainable Visualization Recommendation using ChatGPT

paper_url: http://arxiv.org/abs/2310.07652
repo_url: https://github.com/demoleiwang/llm4vis
paper_authors: Lei Wang, Songheng Zhang, Yun Wang, Ee-Peng Lim, Yong Wang
for: 该研究旨在提供一种自动化视觉推荐方法，以便在不同领域中探索和传达视觉结论。
methods: 该方法基于ChatGPT的提问方法，包括特征描述、示例选择、解释生成、示例构建和推理步骤。为了获得高质量的解释，提出了一种新的解释生成启发法，通过考虑之前的生成和模板基本提示来进行反射式启发。
results: 在VizML数据集上进行评估，LLM4Vis在几个实际中或与随机森林、决策树和多层感知网络等超vised学习模型相当，并在零shot和几个实际中表现出色。Qualitative评估还表明LLM4Vis生成的解释的效果。代码可以在 \href{https://github.com/demoleiwang/LLM4Vis}{https://github.com/demoleiwang/LLM4Vis} 上获取。

Abstract
Data visualization is a powerful tool for exploring and communicating insights in various domains. To automate visualization choice for datasets, a task known as visualization recommendation has been proposed. Various machine-learning-based approaches have been developed for this purpose, but they often require a large corpus of dataset-visualization pairs for training and lack natural explanations for their results. To address this research gap, we propose LLM4Vis, a novel ChatGPT-based prompting approach to perform visualization recommendation and return human-like explanations using very few demonstration examples. Our approach involves feature description, demonstration example selection, explanation generation, demonstration example construction, and inference steps. To obtain demonstration examples with high-quality explanations, we propose a new explanation generation bootstrapping to iteratively refine generated explanations by considering the previous generation and template-based hint. Evaluations on the VizML dataset show that LLM4Vis outperforms or performs similarly to supervised learning models like Random Forest, Decision Tree, and MLP in both few-shot and zero-shot settings. The qualitative evaluation also shows the effectiveness of explanations generated by LLM4Vis. We make our code publicly available at \href{https://github.com/demoleiwang/LLM4Vis}{https://github.com/demoleiwang/LLM4Vis}.

摘要
“数据视觉是一种强大的工具，用于探索和传达不同领域的探索结果。为自动化视觉选择，一项称为视觉推荐的任务已被提议。多种基于机器学习的方法已经被开发出来，但它们经常需要大量的数据集和视觉对的训练，而且往往无法提供自然的解释。为了解决这一研究漏洞，我们提出了LLM4Vis，一种基于ChatGPT的新的提问方法，用于进行视觉推荐并返回人类化的解释，只需要几个示例例子。我们的方法包括特征描述、示例选择、解释生成、示例构建和推理步骤。为了获得高质量的解释，我们提出了一种新的解释生成恒bootstrap，可以逐步优化生成的解释，通过考虑之前的生成和模板基于的提示。在VizML数据集上进行评估，我们发现LLM4Vis在几个示例例子和零示例例子的情况下均能与随机森林、决策树和多层感知网络相当或超越。 qualitative评估还表明LLM4Vis生成的解释的效果。我们将代码公开在https://github.com/demoleiwang/LLM4Vis。”

Evaluating Large Language Models at Evaluating Instruction Following

paper_url: http://arxiv.org/abs/2310.07641
repo_url: https://github.com/princeton-nlp/llmbar
paper_authors: Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, Danqi Chen
for: 本研究旨在评估LLM evaluator的效果，特别是用于评估生成文本是否遵循给定的指令。
methods: 本研究使用了LLMBar作为一个挑战性的meta-评估benchmark，以测试LLM evaluator的指令遵循能力。
results: 研究发现不同的评估器（即LLM和提示的组合）在LLMBar上表现出 diferencia significativa，甚至最高分的评估器也有较大的改进空间。此外，本研究还提出了一个新的提示策略，以更好地衡量LLM的指令遵循能力。

Abstract
As research in large language models (LLMs) continues to accelerate, LLM-based evaluation has emerged as a scalable and cost-effective alternative to human evaluations for comparing the ever increasing list of models. This paper investigates the efficacy of these "LLM evaluators", particularly in using them to assess instruction following, a metric that gauges how closely generated text adheres to the given instruction. We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. The authors manually curated 419 pairs of outputs, one adhering to instructions while the other diverging, yet may possess deceptive qualities that mislead an LLM evaluator, e.g., a more engaging tone. Contrary to existing meta-evaluation, we discover that different evaluators (i.e., combinations of LLMs and prompts) exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement. We also present a novel suite of prompting strategies that further close the gap between LLM and human evaluators. With LLMBar, we hope to offer more insight into LLM evaluators and foster future research in developing better instruction-following models.

摘要
LLM 研究在继续加速， LLMBased 评估成为一种可扩展和成本效果的人工评估的替代方案，用于比较越来越多的模型。这篇论文 investigate LLM 评估器的效果，特别是用于评估 instruciton 遵循度，一个测量生成文本是否遵循给定的 instruciton 的指标。我们提出了一个挑战性的 meta-评估标准 LLMBar，用于测试 LLM 评估器是否能够识别遵循 instruciton 的输出。作者手动精心选择了 419 对输出，其中一个遵循 instruciton，另一个偏离 instruciton，但可能具有诱导 LLM 评估器的特性，如更有吸引力的语言风格。与现有的 meta-评估不同，我们发现不同的评估器（即 LLM 和提示的组合）在 LLMBar 上表现出不同的性能，甚至最高分的评估器还有很大的改进空间。我们还提出了一个新的提示策略集合，可以进一步减距 между LLM 和人类评估器。通过 LLMBar，我们希望能够为 LLM 评估器提供更多的洞察，并促进未来的研究，以开发更好的 instruciton 遵循模型。

The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values

paper_url: http://arxiv.org/abs/2310.07629
repo_url: None
paper_authors: Hannah Rose Kirk, Andrew M. Bean, Bertie Vidgen, Paul Röttger, Scott A. Hale
for: 这篇论文主要是为了研究如何收集和利用人类反馈来改进大语言模型（LLM）的表现。
methods: 论文主要抄取了95篇文章，来总结过去对语言模型的人类反馈的约束和现代技术和实践。
results: 论文提出了五个未解决的概念和实践挑战，以促进未来对Feedback学习的发展。

Abstract
Human feedback is increasingly used to steer the behaviours of Large Language Models (LLMs). However, it is unclear how to collect and incorporate feedback in a way that is efficient, effective and unbiased, especially for highly subjective human preferences and values. In this paper, we survey existing approaches for learning from human feedback, drawing on 95 papers primarily from the ACL and arXiv repositories.First, we summarise the past, pre-LLM trends for integrating human feedback into language models. Second, we give an overview of present techniques and practices, as well as the motivations for using feedback; conceptual frameworks for defining values and preferences; and how feedback is collected and from whom. Finally, we encourage a better future of feedback learning in LLMs by raising five unresolved conceptual and practical challenges.

摘要
人类反馈在大自然语言模型（LLM）中的行为指导 increasingly 使用。然而，不清楚如何收集和包含反馈的方式，尤其是对于高度主观的人类偏好和价值，是效果、高效和无偏见的。在这篇论文中，我们对现有的反馈学习方法进行了调查，主要从 ACLL和arXiv 存储库中检索了95篇论文。首先，我们总结过去， LLM 前的人类反馈integrating 语言模型的趋势。其次，我们提供现有的技术和实践，以及使用反馈的动机，定义价值和偏好的概念框架，以及反馈是如何收集的和从谁收集。最后，我们鼓励更好的反馈学习在 LLM 中的未来，提出了五个未解决的概念和实践挑战。

QACHECK: A Demonstration System for Question-Guided Multi-Hop Fact-Checking

paper_url: http://arxiv.org/abs/2310.07609
repo_url: https://github.com/xinyuanlu00/qacheck
paper_authors: Liangming Pan, Xinyuan Lu, Min-Yen Kan, Preslav Nakov
For: The paper aims to address the challenge of fact-checking real-world claims with complex, multi-step reasoning, and to provide a transparent, explainable, and user-friendly fact-checking process.* Methods: The proposed QACHECK system uses a sequence of (question, answer) pairs to guide its reasoning process, and includes five key modules: a claim verifier, a question generator, a question-answering module, a QA validator, and a reasoner.* Results: The paper demonstrates the effectiveness of QACHECK through a recorded video, showing how the system can provide a comprehensive report detailing its reasoning process and the source of evidence supporting each question.

Abstract
Fact-checking real-world claims often requires complex, multi-step reasoning due to the absence of direct evidence to support or refute them. However, existing fact-checking systems often lack transparency in their decision-making, making it challenging for users to comprehend their reasoning process. To address this, we propose the Question-guided Multi-hop Fact-Checking (QACHECK) system, which guides the model's reasoning process by asking a series of questions critical for verifying a claim. QACHECK has five key modules: a claim verifier, a question generator, a question-answering module, a QA validator, and a reasoner. Users can input a claim into QACHECK, which then predicts its veracity and provides a comprehensive report detailing its reasoning process, guided by a sequence of (question, answer) pairs. QACHECK also provides the source of evidence supporting each question, fostering a transparent, explainable, and user-friendly fact-checking process. A recorded video of QACHECK is at https://www.youtube.com/watch?v=ju8kxSldM64

摘要
fact-checking 实际场景中的真假性检查经常需要复杂的多步骤 reasoning，因为没有直接的证据支持或驳斥这些laims。然而，现有的 fact-checking 系统经常缺乏决策过程的透明性，使得用户很难理解它们的思维过程。为解决这个问题，我们提议Question-guided Multi-hop Fact-Checking（QACHECK）系统，它通过问题来导引模型的思维过程。QACHECK 系统有五个关键模块：声明验证模块、问题生成模块、问题回答模块、QA 验证模块和理解模块。用户可以将声明输入到 QACHECK 系统中，然后它会预测声明的真假性并提供一份详细的报告，描述了它的思维过程，并且这些思维过程是通过一系列（问题、答案）对被Question-guided Multi-hop Fact-Checking（QACHECK）系统。QACHECK 系统还提供每个问题的证据来源，这使得 fact-checking 过程变得更加透明、可靠和用户友好。有关 QACHECK 的录制视频可以在 YouTube 上搜索：https://www.youtube.com/watch?v=ju8kxSldM64

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity

paper_url: http://arxiv.org/abs/2310.07521
repo_url: https://github.com/wangcunxiang/llm-factuality-survey
paper_authors: Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Wenyang Gao, Xuming Hu, Zehan Qi, Yidong Wang, Linyi Yang, Jindong Wang, Xing Xie, Zheng Zhang, Yue Zhang
for: This paper addresses the issue of factuality in Large Language Models (LLMs) and its implications for their reliability and accuracy in diverse applications.
methods: The paper analyzes the mechanisms of LLM factuality, including the storage and processing of facts, and evaluates methodologies for assessing LLM factuality.
results: The paper explores strategies for enhancing LLM factuality, including approaches tailored for specific domains, and offers a structured guide for researchers aiming to improve the factual reliability of LLMs.Here are the three points in Simplified Chinese text:
for: 这篇论文关注 Large Language Models (LLMs) 的准确性问题，以及其在多个领域的可靠性和准确性。
methods: 论文分析 LLM 的准确性机制，包括存储和处理事实的方式，并评估了评估 LLM 准确性的方法。
results: 论文探讨了提高 LLM 准确性的策略，包括特定领域的方法，并提供了为研究人员增强 LLM 准确性的结构化指南。

Abstract
This survey addresses the crucial issue of factuality in Large Language Models (LLMs). As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital. We define the Factuality Issue as the probability of LLMs to produce content inconsistent with established facts. We first delve into the implications of these inaccuracies, highlighting the potential consequences and challenges posed by factual errors in LLM outputs. Subsequently, we analyze the mechanisms through which LLMs store and process facts, seeking the primary causes of factual errors. Our discussion then transitions to methodologies for evaluating LLM factuality, emphasizing key metrics, benchmarks, and studies. We further explore strategies for enhancing LLM factuality, including approaches tailored for specific domains. We focus two primary LLM configurations standalone LLMs and Retrieval-Augmented LLMs that utilizes external data, we detail their unique challenges and potential enhancements. Our survey offers a structured guide for researchers aiming to fortify the factual reliability of LLMs.

摘要

Cognate Transformer for Automated Phonological Reconstruction and Cognate Reflex Prediction

paper_url: http://arxiv.org/abs/2310.07487
repo_url: https://github.com/mahesh-ak/cognatetransformer
paper_authors: V. S. D. S. Mahesh Akavarapu, Arnab Bhattacharya
for: 本研究的目的是自动化历史语言学中的phonological reconstruction问题，使用computational biology中的一些想法和技术。
methods: 本研究使用的方法是基于多重序列对 alignment的 MSA Transformer 模型，并将其应用到自动化 phonological reconstruction 和 cognate reflex prediction 问题上。
results: 研究结果显示，我们的模型在这两个 задачі中的表现都比现有的模型更好，特别是在预训练 masked word prediction 任务上。

Abstract
Phonological reconstruction is one of the central problems in historical linguistics where a proto-word of an ancestral language is determined from the observed cognate words of daughter languages. Computational approaches to historical linguistics attempt to automate the task by learning models on available linguistic data. Several ideas and techniques drawn from computational biology have been successfully applied in the area of computational historical linguistics. Following these lines, we adapt MSA Transformer, a protein language model, to the problem of automated phonological reconstruction. MSA Transformer trains on multiple sequence alignments as input and is, thus, apt for application on aligned cognate words. We, hence, name our model as Cognate Transformer. We also apply the model on another associated task, namely, cognate reflex prediction, where a reflex word in a daughter language is predicted based on cognate words from other daughter languages. We show that our model outperforms the existing models on both tasks, especially when it is pre-trained on masked word prediction task.

摘要
<>传统的历史语言学问题之一是推算 proto-word 的 ancestral 语言，从 observer 的 daughter 语言中的ognate 词语中确定。计算方法在历史语言学中尝试自动化任务，学习模型以用于可用的语言数据。从计算生物学中的想法和技术来，我们在 computational 历史语言学中应用了 MSA Transformer，一种蛋白语言模型。MSA Transformer 在输入多个序列对应上训练，因此适用于对aligned cognate 词语进行自动化 reconstruction。我们因此将模型命名为 Cognate Transformer。此外，我们还应用了模型在另一个相关任务中，即 cognate reflex 预测任务， predict 一个 daughter 语言中的 reflex 词语，基于另外的 daughter 语言中的 cognate 词语。我们表明我们的模型在两个任务上比现有模型表现更好，特别是在预训练 word 隐藏预测任务中。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The Traditional Chinese writing system is also commonly used in Taiwan and other parts of the world, but it may have slightly different grammar and vocabulary.

Adapting the adapters for code-switching in multilingual ASR

paper_url: http://arxiv.org/abs/2310.07423
repo_url: https://github.com/atharva7k/mms-code-switching
paper_authors: Atharva Kulkarni, Ajinkya Kulkarni, Miguel Couceiro, Hanan Aldarmaki
for: 提高 code-switched speech 的自动语音识别（ASR）性能
methods: 使用语言适应器和批处理机制，将信息从每个语言适应器在网络中的每个适应点传递
results: 在三个 code-switched 数据集（包括阿拉伯语、普通话和印地语）和英语的测试集上，实现了持续性的 code-switching 性能提高，至少减少了 10% 的 CERHere’s a more detailed explanation of each point:
for: The paper aims to improve the performance of automatic speech recognition (ASR) on code-switched speech.
methods: The proposed method uses language adapters and processing mechanisms to transfer information from each language adapter at each adaptation point in the network. Additionally, the paper models code-switching as a sequence of latent binary sequences that can be used to guide the flow of information from each language adapter at the frame level.
results: The proposed approach is evaluated on three code-switched datasets (including Arabic, Mandarin, and Hindi) and shows consistent improvements in code-switching performance, with at least 10% absolute reduction in CER across all test sets.

Abstract
Recently, large pre-trained multilingual speech models have shown potential in scaling Automatic Speech Recognition (ASR) to many low-resource languages. Some of these models employ language adapters in their formulation, which helps to improve monolingual performance and avoids some of the drawbacks of multi-lingual modeling on resource-rich languages. However, this formulation restricts the usability of these models on code-switched speech, where two languages are mixed together in the same utterance. In this work, we propose ways to effectively fine-tune such models on code-switched speech, by assimilating information from both language adapters at each language adaptation point in the network. We also model code-switching as a sequence of latent binary sequences that can be used to guide the flow of information from each language adapter at the frame level. The proposed approaches are evaluated on three code-switched datasets encompassing Arabic, Mandarin, and Hindi languages paired with English, showing consistent improvements in code-switching performance with at least 10\% absolute reduction in CER across all test sets.

摘要
最近，大型预训练多语言语音模型已经显示出了扩展自动语音识别（ASR）到多个低资源语言的潜力。其中一些模型使用语言适配器在其形式ulation中，以提高单语言性能并避免在资源rich语言上多语言模型的一些缺点。然而，这种形式ulation限制了这些模型在混合语言语音上的可用性。在这项工作中，我们提出了有效地微调这些模型在混合语言语音上的方法，通过在每个语言适配点网络中吸收两种语言适配器中的信息。我们还模型了混合语言为一系列隐藏 binary 序列，以便在帧层级引导每个语言适配器的信息流。我们的方法被评估在涵盖阿拉伯语、普通话和印地语与英语的三个混合语言测试集上，显示了一致性的提高，CER 的最小减少为 10% 以上。

Linguistic laws in biology

paper_url: http://arxiv.org/abs/2310.07387
repo_url: None
paper_authors: Stuart Semple, Ramon Ferrer-i-Cancho, Morgan L. Gustison
for: investigating the prevalence of linguistic laws beyond language and unifying linguistic laws and core theory in biology
methods: adopting a new conceptual framework that integrates distinct levels of analysis, from description to prediction to theory building
results: providing critical new insights into the fundamental rules of organisation underpinning natural systems, unifying linguistic laws and core theory in biology

Abstract
Linguistic laws, the common statistical patterns of human language, have been investigated by quantitative linguists for nearly a century. Recently, biologists from a range of disciplines have started to explore the prevalence of these laws beyond language, finding patterns consistent with linguistic laws across multiple levels of biological organisation, from molecular (genomes, genes, and proteins) to organismal (animal behaviour) to ecological (populations and ecosystems). We propose a new conceptual framework for the study of linguistic laws in biology, comprising and integrating distinct levels of analysis, from description to prediction to theory building. Adopting this framework will provide critical new insights into the fundamental rules of organisation underpinning natural systems, unifying linguistic laws and core theory in biology.

摘要
生物学中的语言法则，人类语言中的统计趋势，已经在量化语言学家的研究中进行了nearly一个世纪。而在最近几年，生物学家从不同领域开始探索这些法则在生物系统中的普遍性，从分子（基因、蛋白质）到生物体（动物行为）到生态系统（人口和生态系统）多个生物水平都发现了与语言法则相符的模式。我们提出了一个新的概念框架，用于语言法则生物学的研究，包括了不同水平的分析、预测和理论建构。采用这个框架，将提供新的核心理论，整合语言法则和生物核心理论。

Investigating the Effect of Language Models in Sequence Discriminative Training for Neural Transducers

paper_url: http://arxiv.org/abs/2310.07345
repo_url: None
paper_authors: Zijian Yang, Wei Zhou, Ralf Schlüter, Hermann Ney
for: 本研究探讨了不同语言模型（LM） Context length和标签单元（phoneme vs. word）在语音识别器的序列推理训练中的效果。
methods: 本研究采用了无格式和N-best列表方法，并对 lattice-free方法中使用phoneme-level LM进行了一种近似方法来模拟全文本依赖关系。
results: 实验结果表明，在Librispeech上使用word-level LM进行训练可以超过使用phoneme-level LM，同时发现语言模型在probability计算中的上下文大小有限制性，以及序列推理训练中假设空间质量的重要性。

Abstract
In this work, we investigate the effect of language models (LMs) with different context lengths and label units (phoneme vs. word) used in sequence discriminative training for phoneme-based neural transducers. Both lattice-free and N-best-list approaches are examined. For lattice-free methods with phoneme-level LMs, we propose a method to approximate the context history to employ LMs with full-context dependency. This approximation can be extended to arbitrary context length and enables the usage of word-level LMs in lattice-free methods. Moreover, a systematic comparison is conducted across lattice-free and N-best-list-based methods. Experimental results on Librispeech show that using the word-level LM in training outperforms the phoneme-level LM. Besides, we find that the context size of the LM used for probability computation has a limited effect on performance. Moreover, our results reveal the pivotal importance of the hypothesis space quality in sequence discriminative training.

摘要
在这项研究中，我们 investigate了不同上下文长度和标签单元（phoneme vs. word）在序列推断训练中语言模型（LM）的效果。我们还 examine了无格子和N-best-list方法。对于phoneme-level LM，我们提出了一种方法来approximate context history以使用具有全文件依赖的LM。这种approximation可以扩展到任意上下文长度，并允许在无格子方法中使用word-level LM。此外，我们进行了系统性的比较，包括无格子和N-best-list-based方法。实验结果表明，在Librispeech上使用word-level LM进行训练可以超越phoneme-level LM。此外，我们发现上下文大小对LM在概率计算中的性能影响很有限。此外，我们的结果还表明，序列推断训练中假设空间质量的重要性。

How Do Large Language Models Capture the Ever-changing World Knowledge? A Review of Recent Advances

paper_url: http://arxiv.org/abs/2310.07343
repo_url: https://github.com/hyintell/awesome-refreshing-llms
paper_authors: Zihan Zhang, Meng Fang, Ling Chen, Mohammad-Reza Namazi-Rad, Jun Wang
for: 本研究旨在寻找一种方法，使大语言模型（LLMs）可以适应世界知识的变化，而不需要从scratch重新训练。
methods: 本文系统地归纳了最新的研究成果，并进行了深入的比较和讨论。
results: 本文提供了一个综合的评估和未来研究方向，以帮助研究人员更好地进行这一领域的研究。

Abstract
Although large language models (LLMs) are impressive in solving various tasks, they can quickly be outdated after deployment. Maintaining their up-to-date status is a pressing concern in the current era. This paper provides a comprehensive review of recent advances in aligning LLMs with the ever-changing world knowledge without re-training from scratch. We categorize research works systemically and provide in-depth comparisons and discussion. We also discuss existing challenges and highlight future directions to facilitate research in this field. We release the paper list at https://github.com/hyintell/awesome-refreshing-llms

摘要
尽管大语言模型（LLMs）在各种任务上表现出色，但它们很快就会被取代。维护它们的最新状态是当今时期的一项严重问题。本文提供了对最近各种更新LLMs与不断变化的世界知识的系统性梳理，并进行了深入的比较和讨论。我们还讨论了现有的挑战和未来的发展方向，以便促进这一领域的研究。我们在 GitHub 上发布了相关资料，请参考。

SNOiC: Soft Labeling and Noisy Mixup based Open Intent Classification Model

paper_url: http://arxiv.org/abs/2310.07306
repo_url: None
paper_authors: Aditi Kanwar, Aditi Seetha, Satyendra Singh Chouhan, Rajdeep Niyogi
For: The paper presents a Soft Labeling and Noisy Mixup-based open intent classification model (SNOiC) to address the limitations of existing threshold-based methods, which can overfit and produce biased predictions.* Methods: The SNOiC model combines Soft Labeling and Noisy Mixup strategies to reduce bias and generate pseudo-data for open intent classes.* Results: The experimental results on four benchmark datasets show that the SNOiC model achieves a minimum and maximum performance of 68.72% and 94.71%, respectively, in identifying open intents, and improves the performance by 0.93% to 12.76% compared to state-of-the-art models.Here are the three points in Simplified Chinese text:* For: 本文提出了一种基于软标签和噪音混合的开放意图分类模型（SNOiC），以解决现有的阈值基于方法具有过拟合和生成偏见预测的问题。* Methods: SNOiC模型结合软标签和噪音混合策略，以减少偏见并生成开放意图类型的 Pseudo-数据。* Results: 对四个 benchmark 数据集进行实验，SNOiC模型在开放意图分类方面实现了最低和最高的性能为 68.72% 和 94.71%，分别提高了现有模型的性能表现0.93% 到 12.76%。

Abstract
This paper presents a Soft Labeling and Noisy Mixup-based open intent classification model (SNOiC). Most of the previous works have used threshold-based methods to identify open intents, which are prone to overfitting and may produce biased predictions. Additionally, the need for more available data for an open intent class presents another limitation for these existing models. SNOiC combines Soft Labeling and Noisy Mixup strategies to reduce the biasing and generate pseudo-data for open intent class. The experimental results on four benchmark datasets show that the SNOiC model achieves a minimum and maximum performance of 68.72\% and 94.71\%, respectively, in identifying open intents. Moreover, compared to state-of-the-art models, the SNOiC model improves the performance of identifying open intents by 0.93\% (minimum) and 12.76\% (maximum). The model's efficacy is further established by analyzing various parameters used in the proposed model. An ablation study is also conducted, which involves creating three model variants to validate the effectiveness of the SNOiC model.

摘要
Translation in Simplified Chinese:这篇论文提出了一种基于软标签和噪音混合的开放意图分类模型（SNOiC），以解决过去的阈值基于方法存在过拟合和生成偏向预测的问题。此外，开放意图类的数据不足是另一个限制。SNOiC模型 combinates 软标签和噪音混合策略，以减少偏向和生成开放意图类的 Pseudo-data。实验结果表明，SNOiC模型在四个 benchmark 数据集上的最低和最高性能为 68.72% 和 94.71%，分别，在开放意图类中标识open intent。与现有模型相比，SNOiC模型提高了开放意图类的标识性能的最低和最高值为 0.93% 和 12.76%。此外，模型的有效性还得到了参数分析和减少研究的支持。

Parrot: Enhancing Multi-Turn Chat Models by Learning to Ask Questions

paper_url: http://arxiv.org/abs/2310.07301
repo_url: None
paper_authors: Yuchong Sun, Che Liu, Jinwen Huang, Ruihua Song, Fuzheng Zhang, Di Zhang, Zhongyuan Wang, Kun Gai
for: 这篇论文主要是为了提高对话模型的性能，具体来说是在多回合对话中提高chat模型的效果。
methods: 论文使用了一种自动生成高质量指令调整数据的方法，并使用这些数据来强化对话模型的性能。特别是，论文使用了一种名为Parrot-Ask的模型，用于模拟真实用户在生成指令时的行为。
results: 论文的实验结果显示，使用Parrot-Ask模型生成的高质量多回合对话数据可以大幅提高chat模型的性能，特别是在多回合评价中。此外，论文还发现了这些数据在不同话题上的多样性和人类对话的相似性。

Abstract
Impressive progress has been made on chat models based on Large Language Models (LLMs) recently; however, there is a noticeable lag in multi-turn conversations between open-source chat models (e.g., Alpaca and Vicuna) and the leading chat models (e.g., ChatGPT and GPT-4). Through a series of analyses, we attribute the lag to the lack of enough high-quality multi-turn instruction-tuning data. The available instruction-tuning data for the community are either single-turn conversations or multi-turn ones with certain issues, such as non-human-like instructions, less detailed responses, or rare topic shifts. In this paper, we address these challenges by introducing Parrot, a highly scalable solution designed to automatically generate high-quality instruction-tuning data, which are then used to enhance the effectiveness of chat models in multi-turn conversations. Specifically, we start by training the Parrot-Ask model, which is designed to emulate real users in generating instructions. We then utilize Parrot-Ask to engage in multi-turn conversations with ChatGPT across a diverse range of topics, resulting in a collection of 40K high-quality multi-turn dialogues (Parrot-40K). These data are subsequently employed to train a chat model that we have named Parrot-Chat. We demonstrate that the dialogues gathered from Parrot-Ask markedly outperform existing multi-turn instruction-following datasets in critical metrics, including topic diversity, number of turns, and resemblance to human conversation. With only 40K training examples, Parrot-Chat achieves strong performance against other 13B open-source models across a range of instruction-following benchmarks, and particularly excels in evaluations of multi-turn capabilities. We make all codes, datasets, and two versions of the Parrot-Ask model based on LLaMA2-13B and KuaiYii-13B available at https://github.com/kwai/KwaiYii/Parrot.

摘要
很多进步已经被做出在基于大语言模型（LLM）的对话模型方面，但是在多回合对话方面，开源对话模型（如阿尔帕卡和维库那）和领先对话模型（如对话GPT和GPT-4）之间存在明显的延迟。经过一系列分析，我们认为这种延迟的原因是因为社区available的instruction-tuning数据不够，这些数据包括单回合对话或多回合对话，但是具有一些问题，如非人类化的指令、较少细节的回答和rare topic shift。在这篇论文中，我们解决这些挑战，通过引入Parrot，一种高可扩展的解决方案，以生成高质量的instruction-tuning数据，并用于提高对话模型在多回合对话中的效果。具体来说，我们首先在Parrot-Ask模型中训练，该模型设计用于模拟真正的用户来生成指令。然后，我们使用Parrot-Ask与ChatGPT进行多回合对话，生成了一个多样化的话题的40K高质量多回合对话集（Parrot-40K）。这些数据被用来训练一个名为Parrot-Chat的对话模型，我们示出Parrot-Ask生成的对话明显超过现有的多回合指令遵从数据集的重要指标，包括话题多样性、回合数和人类对话的相似性。尽管只有40K的训练样例，Parrot-Chat仍然在多种指令遵从benchmark上表现出色，特别是在多回合评估中表现出特殊的优势。我们在https://github.com/kwai/KwaiYii/Parrot上分享所有代码、数据集和基于LLaMA2-13B和KuaiYii-13B的两个Parrot-Ask模型。

Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators

paper_url: http://arxiv.org/abs/2310.07289
repo_url: https://github.com/chanliang/conner
paper_authors: Liang Chen, Yang Deng, Yatao Bian, Zeyu Qin, Bingzhe Wu, Tat-Seng Chua, Kam-Fai Wong
for: 这篇论文的目的是如何评估生成的知识是否准确？
methods: 这篇论文使用了六个不同的评估标准来评估生成知识的准确性、相关性、一致性、有用性、有效性和合理性。
results: 研究发现，生成知识的准确性并不是准确度的决定因素，而更重要的是输出的相关性和一致性。此外， authors 还提出了两种策略来提高知识具有任务的性能：Prompt Engineering 和 Knowledge Selection。

Abstract
Large language models (LLMs) outperform information retrieval techniques for downstream knowledge-intensive tasks when being prompted to generate world knowledge. However, community concerns abound regarding the factuality and potential implications of using this uncensored knowledge. In light of this, we introduce CONNER, a COmpreheNsive kNowledge Evaluation fRamework, designed to systematically and automatically evaluate generated knowledge from six important perspectives -- Factuality, Relevance, Coherence, Informativeness, Helpfulness and Validity. We conduct an extensive empirical analysis of the generated knowledge from three different types of LLMs on two widely studied knowledge-intensive tasks, i.e., open-domain question answering and knowledge-grounded dialogue. Surprisingly, our study reveals that the factuality of generated knowledge, even if lower, does not significantly hinder downstream tasks. Instead, the relevance and coherence of the outputs are more important than small factual mistakes. Further, we show how to use CONNER to improve knowledge-intensive tasks by designing two strategies: Prompt Engineering and Knowledge Selection. Our evaluation code and LLM-generated knowledge with human annotations will be released to facilitate future research.

摘要
大型语言模型（LLM）在下游知识密集任务中表现更好于信息检索技术，但社区对使用这些未经检索的知识表达出了关切。为此，我们介绍了CONNER，一个全面的知识评估框架，可以系统地和自动地评估生成的知识从六个重要角度：事实性、相关性、一致性、启示性、帮助性和有效性。我们进行了对三种不同的LLM生成知识的广泛验证研究，并在两个广泛研究的知识密集任务上进行了实验。surprisingly，我们的研究发现，生成的知识的事实性，即使低，并不会对下游任务产生很大的阻碍。相反，输出的相关性和一致性更加重要于小的事实错误。此外，我们还示出了如何使用CONNER来改进知识密集任务，通过提出两种策略：提示工程和知识选择。我们的评估代码和LLM生成的知识以及人工注释将被释出，以便未来研究。

Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

paper_url: http://arxiv.org/abs/2310.07284
repo_url: https://github.com/haoxiangsnr/llm-tse
paper_authors: Xiang Hao, Jibin Wu, Jianwei Yu, Chenglin Xu, Kay Chen Tan
for: 这种研究旨在复制人类在听力环境中选择 interess 的音频源的能力，即cocktail party问题。methods: 这种研究使用了大型自然语言模型（LLM）来提取用户输入文本中的有用 semantic cues，以便增强目标 speaker extraction 模型的可靠性、可控性和性能。results: 实验结果表明，只有文本基础的cue可以达到竞争性表现，文本作为任务选择器的效果，并且将文本基础cue与先前注册的cue相结合可以创造出新的纪录。这是首次使用LLM来引导目标 speaker extraction，可能成为cocktail party问题研究的重要基础。

Abstract
Humans possess an extraordinary ability to selectively focus on the sound source of interest amidst complex acoustic environments, commonly referred to as cocktail party scenarios. In an attempt to replicate this remarkable auditory attention capability in machines, target speaker extraction (TSE) models have been developed. These models leverage the pre-registered cues of the target speaker to extract the sound source of interest. However, the effectiveness of these models is hindered in real-world scenarios due to the unreliable or even absence of pre-registered cues. To address this limitation, this study investigates the integration of natural language description to enhance the feasibility, controllability, and performance of existing TSE models. Specifically, we propose a model named LLM-TSE, wherein a large language model (LLM) extracts useful semantic cues from the user's typed text input. These cues can serve as independent extraction cues, task selectors to control the TSE process or complement the pre-registered cues. Our experimental results demonstrate competitive performance when only text-based cues are presented, the effectiveness of using input text as a task selector, and a new state-of-the-art when combining text-based cues with pre-registered cues. To our knowledge, this is the first study to successfully incorporate LLMs to guide target speaker extraction, which can be a cornerstone for cocktail party problem research.

摘要
人类具有一种极强的选择性听觉能力，能够在复杂的听觉环境中选择 интересуante的声音来源，这种情况通常被称为“cocktail party”问题。为了复制人类的出色听觉注意力能力，目标说话者抽取（TSE）模型已经被开发。这些模型利用目标说话者的预注册的cue来抽取声音来源。然而，现实中的限制使得这些模型的效果受到限制。为了解决这个问题，本研究 investigate了在现有TSE模型中 integrate natural language description以提高可行性、可控性和性能。specifically，我们提出了一个名为LLM-TSE的模型，其中一个大型自然语言模型（LLM）通过用户输入的文本来提取有用的semantic cue。这些cue可以作为独立的抽取cue、任务选择器来控制TSE过程或补充预注册的cue。我们的实验结果表明，只有文本基础的cue提供时，LLM-TSE模型的性能具有竞争力，使用文本作为任务选择器的效果和将文本基础和预注册的cue结合使用时的新状态。在我们知道的范围内，这是首次成功地integrate LLM来导向目标说话者抽取，这可能是cocktail party问题研究的开山之作。

Enhancing expressivity transfer in textless speech-to-speech translation

paper_url: http://arxiv.org/abs/2310.07279
repo_url: None
paper_authors: Jarod Duret, Benjamin O’Brien, Yannick Estève, Titouan Parcollet
for: 本研究旨在提高文本eless speech-to-speech翻译系统的表达准确性。
methods: 该研究提出了一种新的方法，通过在不同语言之间传递语言无关信息来提高表达准确性。这种方法在speech unit级别进行操作，并利用多语言情感嵌入来预测目标语言中的音高和持续时间。
results: 对于一个法语到英语翻译任务，我们的实验结果表明，我们的方法可以更好地传递表达情感，比现有的状态 искусственный智能系统更高效。

Abstract
Textless speech-to-speech translation systems are rapidly advancing, thanks to the integration of self-supervised learning techniques. However, existing state-of-the-art systems fall short when it comes to capturing and transferring expressivity accurately across different languages. Expressivity plays a vital role in conveying emotions, nuances, and cultural subtleties, thereby enhancing communication across diverse languages. To address this issue this study presents a novel method that operates at the discrete speech unit level and leverages multilingual emotion embeddings to capture language-agnostic information. Specifically, we demonstrate how these embeddings can be used to effectively predict the pitch and duration of speech units in the target language. Through objective and subjective experiments conducted on a French-to-English translation task, our findings highlight the superior expressivity transfer achieved by our approach compared to current state-of-the-art systems.

摘要
文本语音翻译系统在不断发展，感谢自我超vised学习技术的整合。然而，现有的状态调研系统在准确传递expressivity方面存在缺陷。expressivity在传递情感、细节和文化差异方面扮演着重要的角色，因此可以增强不同语言之间的交流。为解决这个问题，本研究提出了一种新的方法，它在discrete speech unit nivel operate和多语言情感嵌入之间进行了结合。我们示出了这些嵌入可以准确预测目标语言中的音高和持续时间。通过对法语到英语翻译任务进行对象和主观实验，我们的发现表明我们的方法在expressivity传递方面表现出了与当前状态调研系统相比的superior性。

Exploring the Landscape of Large Language Models In Medical Question Answering: Observations and Open Questions

paper_url: http://arxiv.org/abs/2310.07225
repo_url: None
paper_authors: Karolina Korgul, Andrew M. Bean, Felix Krones, Robert McCraith, Adam Mahdi
for: 这篇论文旨在了解现代医疗问答系统中大型自然语言模型（LLMs）的局限性，以便在高风险环境中部署这些模型。
methods: 该论文使用了多种流行的LLMs，对医疗问题进行了评估，以了解这些模型在医疗领域的性能。
results: 论文提出了一些初步的观察和开放问题，以便进一步探讨LLMs在医疗领域的应用。

Abstract
Large Language Models (LLMs) have shown promise in medical question answering by achieving passing scores in standardised exams and have been suggested as tools for supporting healthcare workers. Deploying LLMs into such a high-risk context requires a clear understanding of the limitations of these models. With the rapid development and release of new LLMs, it is especially valuable to identify patterns which exist across models and may, therefore, continue to appear in newer versions. In this paper, we evaluate a wide range of popular LLMs on their knowledge of medical questions in order to better understand their properties as a group. From this comparison, we provide preliminary observations and raise open questions for further research.

摘要
大型语言模型（LLM）在医疗问答中表现良好，达到标准化考试的过关分数，并被建议作为医疗工作者支持工具。将LLM部署到高风险环境中需要清晰地理解这些模型的限制。随着新的LLM的快速开发和发布，可以从多个模型之间找到共同的特征，因此可能在更新后仍然存在。本文通过评估广泛的流行LLMs来更好地了解它们的特性。从这个比较中，我们提供初步观察和开出更进一步的研究问题。

PHALM: Building a Knowledge Graph from Scratch by Prompting Humans and a Language Model

paper_url: http://arxiv.org/abs/2310.07170
repo_url: https://github.com/nlp-waseda/comet-atomic-ja
paper_authors: Tatsuya Ide, Eiki Murata, Daisuke Kawahara, Takato Yamazaki, Shengzhe Li, Kenta Shinzato, Toshinori Sato
for: 这篇论文旨在提出一种从头构建知识图谱的方法，以便建立更好的常识感知型语言模型。
methods: 该方法使用了人工智能和大型自然语言模型（LLM）的提问，从而构建了一个日本事件知识图谱。
results: 实验结果表明， constructed graph 和通过训练的理解模型生成的推理都是可接受的，并且比较人工智能和 LLM 的提问方法的不同性。 code、数据和模型都可以在 GitHub 上找到。

Abstract
Despite the remarkable progress in natural language understanding with pretrained Transformers, neural language models often do not handle commonsense knowledge well. Toward commonsense-aware models, there have been attempts to obtain knowledge, ranging from automatic acquisition to crowdsourcing. However, it is difficult to obtain a high-quality knowledge base at a low cost, especially from scratch. In this paper, we propose PHALM, a method of building a knowledge graph from scratch, by prompting both crowdworkers and a large language model (LLM). We used this method to build a Japanese event knowledge graph and trained Japanese commonsense generation models. Experimental results revealed the acceptability of the built graph and inferences generated by the trained models. We also report the difference in prompting humans and an LLM. Our code, data, and models are available at github.com/nlp-waseda/comet-atomic-ja.

摘要
尽管各种预训Transformer的进步很remarkable， neural language model经常不处理常识知识得当。为建立常识意识的模型，有尝试从自动获取到聚集ources。然而，获得高质量常识知识库，特别是从头开始，是很困难的。在这篇论文中，我们提出了PHALM方法，可以从头开始建立知识图грам，通过询问大量的人工工作者和一个大语言模型（LLM）。我们使用了这种方法建立了日本事件知识图грам，并训练了日本常识生成模型。实验结果表明了建立的图грам和训练的模型生成的推理都是可接受的。我们还报告了人工工作者和LLM的询问之间的差异。我们的代码、数据和模型都可以在github.com/nlp-waseda/comet-atomic-ja上获取。

Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms

paper_url: http://arxiv.org/abs/2310.07161
repo_url: None
paper_authors: Joseph Konan, Ojas Bhargave, Shikhar Agnihotri, Shuo Han, Yunyang Zeng, Ankit Shah, Bhiksha Raj
for: 这研究探讨了VoIP（声音在互联网协议）通信中的复杂性，尤其是sender-side denoising效果的研究。
methods: 这研究使用了Oaxaca分解法，一种经济学工具，以分析VoIP系统中的语音-语音学变化。此外，研究还使用了PESQ和STOI指标来衡量语音变化的质量。
results: 研究发现，VoIP系统中的语音变化非常复杂，受到多种因素的影响。此外，研究还发现了一些新的 psychoacoustic 指标，可以用于衡量语音变化的质量。

Abstract
Within the ambit of VoIP (Voice over Internet Protocol) telecommunications, the complexities introduced by acoustic transformations merit rigorous analysis. This research, rooted in the exploration of proprietary sender-side denoising effects, meticulously evaluates platforms such as Google Meets and Zoom. The study draws upon the Deep Noise Suppression (DNS) 2020 dataset, ensuring a structured examination tailored to various denoising settings and receiver interfaces. A methodological novelty is introduced via the Oaxaca decomposition, traditionally an econometric tool, repurposed herein to analyze acoustic-phonetic perturbations within VoIP systems. To further ground the implications of these transformations, psychoacoustic metrics, specifically PESQ and STOI, were harnessed to furnish a comprehensive understanding of speech alterations. Cumulatively, the insights garnered underscore the intricate landscape of VoIP-influenced acoustic dynamics. In addition to the primary findings, a multitude of metrics are reported, extending the research purview. Moreover, out-of-domain benchmarking for both time and time-frequency domain speech enhancement models is included, thereby enhancing the depth and applicability of this inquiry.

摘要
在VOIP（语音过网协议）电信中，音频转换引入的复杂性需要严格的分析。这项研究，基于专利发送方清除效果的探索，综合评估Google Meets和Zoom等平台。研究使用2020年深度噪声 dataset，确保结构化的评估适应不同的清除设置和接收界面。本研究 introduce了一种方法新颖，即使用Oaxaca decomposition，原本是一种经济ometrics工具，以分析VOIP系统中的音频语音扰动。此外，通过使用 psychoacoustic 指标，如PESQ和STOI，研究得到了全面的说话变化理解。总的来说，这些发现反映了VOIP-影响的听音动力学景观的复杂性。此外，本研究还报告了多种指标，扩大了研究范围。此外，对域外的时间频率域speech增强模型进行了对比测试，从而提高了这项研究的深度和实用性。

paper_url: http://arxiv.org/abs/2310.07155
repo_url: None
paper_authors: Shamik Roy, Dan Goldwasser
for: 这篇论文是为了研究社交媒体在社会变革中的作用，以及如何自动理解在线社会运动的视角和对立面。
methods: 该论文提出了一种弱监督图structured prediction方法，通过分析Twitter上的文本和社交网络作者之间的关系，对#BlackLivesMatter相关的微博进行了视角分类。该方法使用了社会语言表示，将文本转换为图形，并使用小量示例集来验证。
results: 该论文通过对人工标注测试集进行质量和量itative分析，发现其模型在对视角分类方面表现出色，比多任务基线提高了大幅度。该模型还成功地描述了支持和反对 #BLM 的视角。

Abstract
Social media has become a major driver of social change, by facilitating the formation of online social movements. Automatically understanding the perspectives driving the movement and the voices opposing it, is a challenging task as annotated data is difficult to obtain. We propose a weakly supervised graph-based approach that explicitly models perspectives in #BackLivesMatter-related tweets. Our proposed approach utilizes a social-linguistic representation of the data. We convert the text to a graph by breaking it into structured elements and connect it with the social network of authors, then structured prediction is done over the elements for identifying perspectives. Our approach uses a small seed set of labeled examples. We experiment with large language models for generating artificial training examples, compare them to manual annotation, and find that it achieves comparable performance. We perform quantitative and qualitative analyses using a human-annotated test set. Our model outperforms multitask baselines by a large margin, successfully characterizing the perspectives supporting and opposing #BLM.

摘要
社交媒体已成为社会变革的主要驱动力，通过促成在线社会运动的形成。自动理解运动中的观点和反对观点是一项具有挑战性的任务，因为获取注释数据是困难的。我们提议一种弱级超视的图structured prediction方法，其中明确表示数据中的观点。我们将文本转换成图形，将其与作者社交网络连接起来，然后进行结构预测，以确定观点。我们的方法使用小量精心标注示例。我们使用大型自然语言模型生成人工训练示例，并与手动注释进行比较，发现它们具有相似性。我们使用人工注释测试集进行量化和质量分析，发现我们的模型在对 #BLM 相关的 tweet 进行观点分类中占据了明显的优势，成功地捕捉了支持和反对 #BLM 的观点。

QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

paper_url: http://arxiv.org/abs/2310.07147
repo_url: None
paper_authors: Zhikai Li, Xiaoxuan Liu, Banghua Zhu, Zhen Dong, Qingyi Gu, Kurt Keutzer
for: 这 paper 的目的是提出一种Parameter-efficient fine-tuning方法，以便在大型语言模型（LLMs）上进行精细调整，而不会增加过多的资源开销。
methods: 这 paper 使用了两个新的想法：首先，使用高效的 Lion 优化器，它只跟踪积分和批量大小，这有利于精细调整和量化；其次，对所有模型状态进行量化，并采用梯度流和参数更新方案来更新量化的参数。
results: 根据 paper 的结果，QFT 可以将模型状态内存减少至 21% 的标准解决方案，同时保持相同的性能水平。例如，对 LLaMA-7B 模型进行精细调整只需要 <30GB 的内存，可以通过一个 A6000 GPU 满足。

Abstract
Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks. Fine-tuning these pre-trained models on downstream datasets provides further significant performance gains, but this process has been challenging due to its extraordinary resource requirements. To this end, existing efforts focus on parameter-efficient fine-tuning, which, unfortunately, fail to capitalize on the powerful potential of full-parameter fine-tuning. In this work, we propose QFT, a novel Quantized Full-parameter Tuning framework for LLMs that enables memory-efficient fine-tuning without harming performance. Our framework incorporates two novel ideas: (i) we adopt the efficient Lion optimizer, which only keeps track of the momentum and has consistent update magnitudes for each parameter, an inherent advantage for robust quantization; and (ii) we quantize all model states and store them as integer values, and present a gradient flow and parameter update scheme for the quantized weights. As a result, QFT reduces the model state memory to 21% of the standard solution while achieving comparable performance, e.g., tuning a LLaMA-7B model requires only <30GB of memory, satisfied by a single A6000 GPU.

摘要

我们采用高效的 Lion 优化器，只跟踪势量和每个参数的一致更新大小，这是量化的自然优势;2. 我们量化所有模型状态，将其转换为整数值，并提供了一种梯度流和参数更新方案 для量化的 weights。因此，QFT 可以将模型状态内存减少至 21% 的标准解决方案，同时实现相似的性能，例如，对 LLaMA-7B 模型进行 fine-tuning 只需要 <30GB 的内存，可以由一个 A6000 GPU 满足。

Empowering Psychotherapy with Large Language Models: Cognitive Distortion Detection through Diagnosis of Thought Prompting

paper_url: http://arxiv.org/abs/2310.07146
repo_url: None
paper_authors: Zhiyu Chen, Yujie Lu, William Yang Wang
for: 这个论文的目的是开发人工智能助手来帮助计算心理咨询。methods: 这个论文使用了大语言模型进行诊断思维检测，包括三个阶段：主观性评估、相互推理和学习概念板块。results: 实验显示，DoT在认知扭曲检测方面得到了显著改进，同时生成的诊断理由得到了专业人员的批准。

Abstract
Mental illness remains one of the most critical public health issues of our time, due to the severe scarcity and accessibility limit of professionals. Psychotherapy requires high-level expertise to conduct deep, complex reasoning and analysis on the cognition modeling of the patients. In the era of Large Language Models, we believe it is the right time to develop AI assistance for computational psychotherapy. We study the task of cognitive distortion detection and propose the Diagnosis of Thought (DoT) prompting. DoT performs diagnosis on the patient's speech via three stages: subjectivity assessment to separate the facts and the thoughts; contrastive reasoning to elicit the reasoning processes supporting and contradicting the thoughts; and schema analysis to summarize the cognition schemas. The generated diagnosis rationales through the three stages are essential for assisting the professionals. Experiments demonstrate that DoT obtains significant improvements over ChatGPT for cognitive distortion detection, while generating high-quality rationales approved by human experts.

摘要
心理疾病仍然是当今公共卫生问题中最严重的一个，这主要归结于专业人员的紧缺和访问限制。心理治疗需要高水平的专业知识，以进行深入的、复杂的认知和分析，以模拟患者的认知模型。在大语言模型时代，我们认为是时候开发人工智能助手来支持计算心理治疗。我们研究了认知扭曲检测任务，并提出了诊断思维（DoT）提示。DoT在患者的话语中进行三个阶段的诊断：主观评估，以分离 факты和思想; 对比逻辑，以引出支持和反对思想的逻辑过程; 和 schema 分析，以概括认知schema。生成的诊断理由经过三个阶段的诊断是对专业人员的助手。实验显示，DoT在认知扭曲检测方面比ChatGPT具有显著的改善，同时生成高质量的人类专家批准的诊断理由。

AE-smnsMLC: Multi-Label Classification with Semantic Matching and Negative Label Sampling for Product Attribute Value Extraction

paper_url: http://arxiv.org/abs/2310.07137
repo_url: https://github.com/zhongfendeng/ae-smnsmlc
paper_authors: Zhongfen Deng, Wei-Te Chen, Lei Chen, Philip S. Yu
for: 这篇论文主要针对电子商务中的产品特征值EXTRACTION问题，尤其是产品搜寻和推荐。
methods: 这篇论文提出了一个基于多 Label Classification 的方法，它可以应对实际情况下的产品特征值EXTRACTION，在缺乏位置信息的情况下进行train模型。它还考虑了产品内多个属性值之间的 semantics 连接，从而帮助产品特征值EXTRACTION。
results: 论文的实验结果显示，提出的方法具有优越性和有效性，可以在三个不同的电子商务数据集上进行精确的产品特征值EXTRACTION。

Abstract
Product attribute value extraction plays an important role for many real-world applications in e-Commerce such as product search and recommendation. Previous methods treat it as a sequence labeling task that needs more annotation for position of values in the product text. This limits their application to real-world scenario in which only attribute values are weakly-annotated for each product without their position. Moreover, these methods only use product text (i.e., product title and description) and do not consider the semantic connection between the multiple attribute values of a given product and its text, which can help attribute value extraction. In this paper, we reformulate this task as a multi-label classification task that can be applied for real-world scenario in which only annotation of attribute values is available to train models (i.e., annotation of positional information of attribute values is not available). We propose a classification model with semantic matching and negative label sampling for attribute value extraction. Semantic matching aims to capture semantic interactions between attribute values of a given product and its text. Negative label sampling aims to enhance the model's ability of distinguishing similar values belonging to the same attribute. Experimental results on three subsets of a large real-world e-Commerce dataset demonstrate the effectiveness and superiority of our proposed model.

摘要
In this paper, we reformulate this task as a multi-label classification task that can be applied to real-world scenarios where only annotation of attribute values is available to train models (i.e., annotation of positional information of attribute values is not available). We propose a classification model with semantic matching and negative label sampling for attribute value extraction. Semantic matching aims to capture the semantic interactions between attribute values of a given product and its text, while negative label sampling aims to enhance the model's ability to distinguish similar values belonging to the same attribute.Experimental results on three subsets of a large real-world e-Commerce dataset demonstrate the effectiveness and superiority of our proposed model.

Comparing Styles across Languages

paper_url: http://arxiv.org/abs/2310.07135
repo_url: https://github.com/sanusanth/javascript-basic-program
paper_authors: Shreya Havaldar, Matthew Pressimone, Eric Wong, Lyle Ungar
for: 这篇论文的目的是提出一种解释框架，用于从多种语言的语言模型中提取风格差异并比较不同语言之间的风格。
methods: 该论文使用的方法包括生成全面的风格词典和将语言模型中的特征重要性转化为可比较的 lexical category。
results: 通过应用该解释框架， authors 创造了首个涵盖四种语言的全面风格数据集，并分析了不同语言之间的尊重程度如何异同。

Abstract
Understanding how styles differ across languages is advantageous for training both humans and computers to generate culturally appropriate text. We introduce an explanation framework to extract stylistic differences from multilingual LMs and compare styles across languages. Our framework (1) generates comprehensive style lexica in any language and (2) consolidates feature importances from LMs into comparable lexical categories. We apply this framework to compare politeness, creating the first holistic multilingual politeness dataset and exploring how politeness varies across four languages. Our approach enables an effective evaluation of how distinct linguistic categories contribute to stylistic variations and provides interpretable insights into how people communicate differently around the world.

摘要
理解不同语言的风格差异对于训练人类和计算机生成文本的文化适应性是有利的。我们介绍了一种解释框架，用于从多语言LM中提取风格差异并对语言之间的风格进行比较。我们的框架包括以下两个主要功能：1. 生成任何语言的完整风格词典。2. 将LM中的特征重要性整合到相似的词汇类别中。我们应用这个框架，比较了四种语言的尊重度，创建了第一个涵盖四种语言的整体多语言尊重数据集，并explored了各语言之间的尊重度差异。我们的方法可以有效地评估不同语言类型的风格差异的贡献，并提供可读取的各种通信方式之间的对比。

Argumentative Stance Prediction: An Exploratory Study on Multimodality and Few-Shot Learning

paper_url: http://arxiv.org/abs/2310.07093
repo_url: None
paper_authors: Arushi Sharma, Abhibha Gupta, Maneesh Bilalpur
for: 这项研究旨在评估图像是否对架构预测中的意见角度有益，以及文本基础模型在少数shot设置下的表现。
methods: 研究使用了Twitter上的推文和图像，并使用了几种不同的模型来进行预测，包括文本基础模型、图像基础模型和 multimedia 模型。
results: 研究发现， ensemble 的文本基础模型（0.817 F1-score）在预测中表现更好，比 multimodal 模型（0.677 F1-score）和文本基础 few-shot 预测（0.550 F1-score）更高。 Additionally, the study found that multimodal models perform better when the image content is summarized as natural language, and using in-context examples improves the few-shot performance of language models.

Abstract
To advance argumentative stance prediction as a multimodal problem, the First Shared Task in Multimodal Argument Mining hosted stance prediction in crucial social topics of gun control and abortion. Our exploratory study attempts to evaluate the necessity of images for stance prediction in tweets and compare out-of-the-box text-based large-language models (LLM) in few-shot settings against fine-tuned unimodal and multimodal models. Our work suggests an ensemble of fine-tuned text-based language models (0.817 F1-score) outperforms both the multimodal (0.677 F1-score) and text-based few-shot prediction using a recent state-of-the-art LLM (0.550 F1-score). In addition to the differences in performance, our findings suggest that the multimodal models tend to perform better when image content is summarized as natural language over their native pixel structure and, using in-context examples improves few-shot performance of LLMs.

摘要
要提高论证立场预测为多Modal问题，第一次共同任务在多Modal Argument Mining中进行了立场预测，涉及到重要的社会话题，如枪支控制和堕胎。我们的探索研究试图评估图像是否对立场预测在微博中是必要的，并比较未经修改的文本大语言模型（LLM）在少量学习设置下与精心适应单Modal和多Modal模型的性能。我们的工作表明，一个 ensemble of 精心适应文本大语言模型（0.817 F1-score）超过了多Modal（0.677 F1-score）和文本基于几个shot预测使用最新的状态对语言模型（0.550 F1-score）。此外，我们的发现表明，多Modal模型在自然语言描述图像内容时表现较好，而使用Context例子可以提高LLMs的几个shot预测性能。

2023-10-11

Crosslingual Structural Priming and the Pre-Training Dynamics of Bilingual Language Models

The Expressive Power of Transformers with Chain of Thought

Pit One Against Many: Leveraging Attention-head Embeddings for Parameter-efficient Multi-head Attention

Assessing Evaluation Metrics for Neural Test Oracle Generation

Framework for Question-Answering in Sanskrit through Automated Construction of Knowledge Graphs

Antarlekhaka: A Comprehensive Tool for Multi-task Natural Language Annotation

Non-autoregressive Text Editing with Copy-aware Latent Alignments

Faithfulness Measurable Masked Language Models

Language Models As Semantic Indexers

Ontology Enrichment for Effective Fine-grained Entity Typing

To Build Our Future, We Must Know Our Past: Contextualizing Paradigm Shifts in Natural Language Processing

Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models

DiPmark: A Stealthy, Efficient and Resilient Watermark for Large Language Models

MatFormer: Nested Transformer for Elastic Inference

Ferret: Refer and Ground Anything Anywhere at Any Granularity

Knowledge-enhanced Memory Model for Emotional Support Conversation

Composite Backdoor Attacks Against Large Language Models

Well Begun is Half Done: Generator-agnostic Knowledge Pre-Selection for Knowledge-Grounded Dialogue

Audio-Visual Neural Syntax Acquisition

LLM4Vis: Explainable Visualization Recommendation using ChatGPT

Evaluating Large Language Models at Evaluating Instruction Following

The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values

QACHECK: A Demonstration System for Question-Guided Multi-Hop Fact-Checking

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity

Cognate Transformer for Automated Phonological Reconstruction and Cognate Reflex Prediction

Adapting the adapters for code-switching in multilingual ASR

Linguistic laws in biology

Investigating the Effect of Language Models in Sequence Discriminative Training for Neural Transducers

How Do Large Language Models Capture the Ever-changing World Knowledge? A Review of Recent Advances

SNOiC: Soft Labeling and Noisy Mixup based Open Intent Classification Model

Parrot: Enhancing Multi-Turn Chat Models by Learning to Ask Questions

Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators

Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

Enhancing expressivity transfer in textless speech-to-speech translation

Exploring the Landscape of Large Language Models In Medical Question Answering: Observations and Open Questions

PHALM: Building a Knowledge Graph from Scratch by Prompting Humans and a Language Model

Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms

“A Tale of Two Movements”: Identifying and Comparing Perspectives in #BlackLivesMatter and #BlueLivesMatter Movements-related Tweets using Weakly Supervised Graph-based Structured Prediction

QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

Empowering Psychotherapy with Large Language Models: Cognitive Distortion Detection through Diagnosis of Thought Prompting

AE-smnsMLC: Multi-Label Classification with Semantic Matching and Negative Label Sampling for Product Attribute Value Extraction

Comparing Styles across Languages

Argumentative Stance Prediction: An Exploratory Study on Multimodality and Few-Shot Learning