2023-11-01

cs.CL

cs.CL - 2023-11-01

On The Open Prompt Challenge In Conditional Audio Generation

paper_url: http://arxiv.org/abs/2311.00897
repo_url: None
paper_authors: Ernie Chang, Sidd Srinivasan, Mahi Luthra, Pin-Jie Lin, Varun Nagaraja, Forrest Iandola, Zechun Liu, Zhaoheng Ni, Changsheng Zhao, Yangyang Shi, Vikas Chandra
for: 这个论文的目的是如何使用 TTA 模型来改善用户输入提示的音频生成质量。
methods: 这个论文使用了两个关键思想来解决用户提示挑战：首先，用户提示通常比训练提示更为简略，导致音频生成和提示之间存在大的启用差异。其次，存在一种音频描述分布，TTA 模型在这种分布下能够更好地生成更高质量的音频。
results: 该论文通过使用 instruction-tuned 模型重写提示，并通过margin ranking学习使用文本-音频对应为反馈信号，实现了对音频质量的改善。在对象和主观人类评价中，都观察到了明显的改善。

Abstract
Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio samples and hand-annotated text. However, commercializing audio generation is challenging as user-input prompts are often under-specified when compared to text descriptions used to train TTA models. In this work, we treat TTA models as a ``blackbox'' and address the user prompt challenge with two key insights: (1) User prompts are generally under-specified, leading to a large alignment gap between user prompts and training prompts. (2) There is a distribution of audio descriptions for which TTA models are better at generating higher quality audio, which we refer to as ``audionese''. To this end, we rewrite prompts with instruction-tuned models and propose utilizing text-audio alignment as feedback signals via margin ranking learning for audio improvements. On both objective and subjective human evaluations, we observed marked improvements in both text-audio alignment and music audio quality.

摘要
文本到声音生成（TTA）可以生成声音从文本描述，学习从声音样本和手动标注的文本对。但是，商业化声音生成具有挑战，因为用户输入提示通常与用于训练TTA模型的文本描述相比较少。在这项工作中，我们将TTA模型当做黑obox处理，并通过两个关键发现：（1）用户提示通常不够具体，导致用户提示和训练提示之间存在大的对齐差。（2）存在一个声音描述的分布，TTA模型在这个分布下能够更高质量的生成声音，我们称之为“audionese”。因此，我们将提示重新编写为 instruction-tuned 模型，并提出使用文本-声音对应为反馈信号via margin ranking学习来改善声音质量。在对象和主观人类评估中，我们观察到了明显改善的文本-声音对应和音乐声音质量。

In-Context Prompt Editing For Conditional Audio Generation

paper_url: http://arxiv.org/abs/2311.00895
repo_url: None
paper_authors: Ernie Chang, Pin-Jie Lin, Yang Li, Sidd Srinivasan, Gael Le Lan, David Kant, Yangyang Shi, Forrest Iandola, Vikas Chandra
for: 提高text-to-audio生成模型在实际数据上的部署，因为实际数据中的分布shift可能会使模型表现下降。
methods: Retrieval-based in-context prompt editing framework，利用训练Caption作为示例来修改用户提示。
results: 提高了用户提示集中的音频质量。

Abstract
Distributional shift is a central challenge in the deployment of machine learning models as they can be ill-equipped for real-world data. This is particularly evident in text-to-audio generation where the encoded representations are easily undermined by unseen prompts, which leads to the degradation of generated audio -- the limited set of the text-audio pairs remains inadequate for conditional audio generation in the wild as user prompts are under-specified. In particular, we observe a consistent audio quality degradation in generated audio samples with user prompts, as opposed to training set prompts. To this end, we present a retrieval-based in-context prompt editing framework that leverages the training captions as demonstrative exemplars to revisit the user prompts. We show that the framework enhanced the audio quality across the set of collected user prompts, which were edited with reference to the training captions as exemplars.

摘要
将文本转换为简化中文模型在实际数据中部署时面临 distribuitional shift 挑战，这是因为模型可能不具备适应实际数据的能力。这种情况特别明显在文本到音频生成中， encoded 表示被不знакомые提示所损害，导致生成的音频质量下降。由于用户提交的提示集是有限的，因此 conditional audio generation 在野外是不充分的。我们发现，在用户提交的提示下，生成的音频样本的质量受到了影响，而使用训练集提示的情况下，音频质量更高。为此，我们提出了一种基于检索的上下文修改框架，利用训练caption作为示例来修改用户提交的提示。我们显示，该框架可以在用户提交的提示集中提高音频质量。

Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models

paper_url: http://arxiv.org/abs/2311.00871
repo_url: None
paper_authors: Steve Yadlowsky, Lyric Doshi, Nilesh Tripuraneni
for: 本研究探讨了Transformer模型在无supervision的情况下，是否可以通过受限的数据集来学习新任务。
methods: 研究者采用了基于序列的$(x, f(x))$对的方法，以investigate transformer模型在不同任务家族之间的协同学习能力。
results: 实验结果表明，当任务家族在预训练数据中充分表现时，Transformer模型能够几乎协同学习新任务，但当任务或函数出现外域时，模型会表现出各种失败模式和泛化能力下降。

Abstract
Transformer models, notably large language models (LLMs), have the remarkable ability to perform in-context learning (ICL) -- to perform new tasks when prompted with unseen input-output examples without any explicit model training. In this work, we study how effectively transformers can bridge between their pretraining data mixture, comprised of multiple distinct task families, to identify and learn new tasks in-context which are both inside and outside the pretraining distribution. Building on previous work, we investigate this question in a controlled setting, where we study transformer models trained on sequences of $(x, f(x))$ pairs rather than natural language. Our empirical results show transformers demonstrate near-optimal unsupervised model selection capabilities, in their ability to first in-context identify different task families and in-context learn within them when the task families are well-represented in their pretraining data. However when presented with tasks or functions which are out-of-domain of their pretraining data, we demonstrate various failure modes of transformers and degradation of their generalization for even simple extrapolation tasks. Together our results highlight that the impressive ICL abilities of high-capacity sequence models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases that create fundamental generalization capabilities.

摘要
启发器模型，特别是大语言模型（LLM），有让人惊叹的能力：无需显式训练，就能在新的输入输出示例上进行学习。在这项工作中，我们研究了启发器模型在受过训练的数据混合中如何 bridge 到新任务上进行学习。我们在控制的环境下进行研究，我们研究了基于 sequences of （x, f(x)) pairs 而不是自然语言的启发器模型。我们的实验结果表明，启发器模型在受过训练的数据混合中能够准确地identify 新任务家族并在其中学习，当任务家族在受过训练数据中充分表示时。但当面临没有适应性的任务或函数时，我们 demonstate 启发器模型的多种失败模式和泛化能力的减退。这些结果表明，高容量序列模型的印象优秀ICL能力可能更加closely tied于其受过训练数据混合的覆盖率而不是基本的泛化能力。

Automatic Disfluency Detection from Untranscribed Speech

paper_url: http://arxiv.org/abs/2311.00867
repo_url: None
paper_authors: Amrit Romana, Kazuhito Koishida, Emily Mower Provost
for: 这个研究是为了提高自动异常流 speech 识别和分类。
methods: 这个研究使用语言、音频和多模态方法进行自动异常流 speech 识别和分类。
results: 研究发现，使用语音为输入的音频基于方法比语音识别系统来的方法更高效。此外，多模态架构也提高了异常流 speech 识别性能。

Abstract
Speech disfluencies, such as filled pauses or repetitions, are disruptions in the typical flow of speech. Stuttering is a speech disorder characterized by a high rate of disfluencies, but all individuals speak with some disfluencies and the rates of disfluencies may by increased by factors such as cognitive load. Clinically, automatic disfluency detection may help in treatment planning for individuals who stutter. Outside of the clinic, automatic disfluency detection may serve as a pre-processing step to improve natural language understanding in downstream applications. With this wide range of applications in mind, we investigate language, acoustic, and multimodal methods for frame-level automatic disfluency detection and categorization. Each of these methods relies on audio as an input. First, we evaluate several automatic speech recognition (ASR) systems in terms of their ability to transcribe disfluencies, measured using disfluency error rates. We then use these ASR transcripts as input to a language-based disfluency detection model. We find that disfluency detection performance is largely limited by the quality of transcripts and alignments. We find that an acoustic-based approach that does not require transcription as an intermediate step outperforms the ASR language approach. Finally, we present multimodal architectures which we find improve disfluency detection performance over the unimodal approaches. Ultimately, this work introduces novel approaches for automatic frame-level disfluency and categorization. In the long term, this will help researchers incorporate automatic disfluency detection into a range of applications.

摘要
干扰性言语，如填充停顿或重复，是语言流动的干扰。吵吵吵是一种语言障碍，其特征是高率干扰，但所有人都会有一些干扰，并且干扰率可能会受因素如认知负担的影响。临床上，自动干扰检测可能会帮助治疗吵吵吵的人群。外部，自动干扰检测可能会作为下游应用程序的预处理步骤，以提高自然语言理解。为了实现这些应用，我们 investigate语言、音响和多模态方法 для自动干扰检测和分类。每种方法都依赖于音频输入。我们首先评估了多种自动语音识别（ASR）系统，以确定它们在捕捉干扰的能力。然后，我们使用这些ASR转译结果作为语言基于的干扰检测模型的输入。我们发现，干扰检测性能受转译和对齐的限制。我们还发现一种基于音响的方法，不需要转译作为中间步骤，可以超过语言基于的方法。最后，我们展示了多模态架构，我们发现它们可以提高干扰检测性能。总之，这项工作介绍了新的自动干扰检测和分类方法。长期来看，这将帮助研究人员在多种应用程序中自动检测干扰。

Calibrated Seq2seq Models for Efficient and Generalizable Ultra-fine Entity Typing

paper_url: http://arxiv.org/abs/2311.00835
repo_url: https://github.com/yanlinf/casent
paper_authors: Yanlin Feng, Adithya Pratapa, David R Mortensen
for: 这篇论文的目的是提出一种seq2seq模型，用于ultra-fine实体类型预测。
methods: 该模型使用约束搜索和自适应排序来生成多个类型，并使用一种新的准确抑制方法来转换Raw序列概率为信任分数。
results: 在UFET数据集上进行了广泛的实验，并取得了F1分数和准确性错误的最佳性能，同时实现了更 чем50倍的搜索速度。此外，在零shot和几shot设置下，模型也表现出了极好的泛化能力，并在特殊领域实体类型预测上超越了大型语言模型。

Abstract
Ultra-fine entity typing plays a crucial role in information extraction by predicting fine-grained semantic types for entity mentions in text. However, this task poses significant challenges due to the massive number of entity types in the output space. The current state-of-the-art approaches, based on standard multi-label classifiers or cross-encoder models, suffer from poor generalization performance or inefficient inference. In this paper, we present CASENT, a seq2seq model designed for ultra-fine entity typing that predicts ultra-fine types with calibrated confidence scores. Our model takes an entity mention as input and employs constrained beam search to generate multiple types autoregressively. The raw sequence probabilities associated with the predicted types are then transformed into confidence scores using a novel calibration method. We conduct extensive experiments on the UFET dataset which contains over 10k types. Our method outperforms the previous state-of-the-art in terms of F1 score and calibration error, while achieving an inference speedup of over 50 times. Additionally, we demonstrate the generalization capabilities of our model by evaluating it in zero-shot and few-shot settings on five specialized domain entity typing datasets that are unseen during training. Remarkably, our model outperforms large language models with 10 times more parameters in the zero-shot setting, and when fine-tuned on 50 examples, it significantly outperforms ChatGPT on all datasets. Our code, models and demo are available at https://github.com/yanlinf/CASENT.

摘要
“ULTRA-细化实体类型标注在信息提取中扮演了关键角色，但这个任务受到巨量实体类型的输出空间的挑战。现有的状态 искусственный智能方法，基于标准多标签分类器或相关器模型，受到低效率和差异性的限制。在这篇论文中，我们提出了CASENT模型，这是一种seq2seq模型，用于ULTRA-细化实体类型标注。我们的模型从实体提及中提取实体类型，并使用约束搜索 beam来生成多个类型。然后，我们使用一种新的准确方法将Raw序列概率转换为信任分数。我们在UFET数据集上进行了广泛的实验，其中包含超过10,000个类型。我们的方法在F1分数和准确性错误方面超过前一个状态艺术，同时实现了更高的执行速度。此外，我们还证明了我们的模型在零shot和几shot设置中的普适性，在不同领域实体类型标注数据集上具有优秀表现。特别是，当与10次更多的参数的大语言模型进行比较时，在零shot设置中，我们的模型在所有数据集上表现出优异。代码、模型和示例可以在https://github.com/yanlinf/CASENT中找到。”

Construction Artifacts in Metaphor Identification Datasets

paper_url: http://arxiv.org/abs/2311.00790
repo_url: None
paper_authors: Joanne Boisson, Luis Espinosa-Anke, Jose Camacho-Collados
for: 本研究探讨了现有的比喻 indentification数据集是否可以被游戏。
methods: 作者使用了语言模型来测试这个假设，并发现了这些数据集中的偏见导致了模型的表现不佳。
results: 作者在不同的数据集和设置中测试了这个假设，并发现了这些数据集中的偏见导致了模型的表现不佳。

Abstract
Metaphor identification aims at understanding whether a given expression is used figuratively in context. However, in this paper we show how existing metaphor identification datasets can be gamed by fully ignoring the potential metaphorical expression or the context in which it occurs. We test this hypothesis in a variety of datasets and settings, and show that metaphor identification systems based on language models without complete information can be competitive with those using the full context. This is due to the construction procedures to build such datasets, which introduce unwanted biases for positive and negative classes. Finally, we test the same hypothesis on datasets that are carefully sampled from natural corpora and where this bias is not present, making these datasets more challenging and reliable.

摘要
<>将文本翻译成简化中文。>表达identification目标是理解给定表达是在上下文中使用 figuratively。然而，在这篇论文中，我们展示了现有的比喻identification数据集可以被游戏，完全忽略可能的比喻表达或上下文。我们在多种数据集和设置下测试了这个假设，并显示了不完整的语言模型可以与基于全文的系统竞争。这是因为构建这些数据集的过程引入了不必要的偏见，导致了正确级别的分类。最后，我们对自然聚合体中精心采样的数据集进行了测试，这些数据集不受这种偏见的影响，使得它们更加具有挑战性和可靠性。

Language Model Training Paradigms for Clinical Feature Embeddings

paper_url: http://arxiv.org/abs/2311.00768
repo_url: https://github.com/yuroeth/icu_benchmarks
paper_authors: Yurong Hu, Manuel Burger, Gunnar Rätsch, Rita Kuznetsova
for: 医学时序序数据的缺乏量化问题中，研究领域使用表示学习，以提高医学时序序数据的可视化和分类性能。本文旨在提高医学时序序数据的表示学习，通过 derivation of universal embeddings for clinical features such as heart rate and blood pressure。
methods: 本文使用自动生成文本的自监督训练方法，使用语言模型来学习高质量的医学特征嵌入。通过不同的自监督训练方法，我们实现了更高的时间步长和患者级别的表示学习精度。
results: 我们使用不supervised dimension reduction techniques来可视化学习的嵌入，并发现与临床知识有高度的一致性。此外，我们还在MIMIC-III标准测试集上评估模型性能，并证明了使用医学特征嵌入可以提高模型的表达能力。

Abstract
In research areas with scarce data, representation learning plays a significant role. This work aims to enhance representation learning for clinical time series by deriving universal embeddings for clinical features, such as heart rate and blood pressure. We use self-supervised training paradigms for language models to learn high-quality clinical feature embeddings, achieving a finer granularity than existing time-step and patient-level representation learning. We visualize the learnt embeddings via unsupervised dimension reduction techniques and observe a high degree of consistency with prior clinical knowledge. We also evaluate the model performance on the MIMIC-III benchmark and demonstrate the effectiveness of using clinical feature embeddings. We publish our code online for replication.

摘要
在医疗数据 scarcity 的研究领域，表示学习扮演着重要的角色。这项工作的目标是通过获取丰富的临床特征表示来增强临床时间序列的表示学习。我们使用自我超vised 训练方法来学习高质量的临床特征表示，实现了更高的粒度 than 现有的时间步和患者级别表示学习。我们使用无监督的减维技术来可见化学习得到的表示，并观察到了与临床知识的高度一致性。我们还在 MIMIC-III 测试集上评估模型性能，并证明使用临床特征表示可以获得有效的结果。我们在线发布代码，以便进行复现。

Challenges for Linguistically-Driven Computer-Based Sign Recognition from Continuous Signing for American Sign Language

paper_url: http://arxiv.org/abs/2311.00762
repo_url: None
paper_authors: Carol Neidle
for: 这篇论文主要写于计算机基于视频中识别隔离的注解符号的问题。
methods: 该论文主要介绍了识别注解符号的一些挑战，包括自然occurring的内部和外部签名同步变化，以及美国手语（ASL）的语言变体。
results: 论文还讨论了一些语言规律，可以帮助提高手势和注解符号识别的性能。

Abstract
There have been recent advances in computer-based recognition of isolated, citation-form signs from video. There are many challenges for such a task, not least the naturally occurring inter- and intra- signer synchronic variation in sign production, including sociolinguistic variation in the realization of certain signs. However, there are several significant factors that make recognition of signs from continuous signing an even more difficult problem. This article presents an overview of such challenges, based in part on findings from a large corpus of linguistically annotated video data for American Sign Language (ASL). Some linguistic regularities in the structure of signs that can boost handshape and sign recognition are also discussed.

摘要
Recently, there have been advances in computer-based recognition of isolated, citation-form signs from video. However, there are many challenges for this task, including natural variations in sign production, such as sociolinguistic variations in the realization of certain signs. Moreover, recognition of signs from continuous signing is an even more difficult problem. This article provides an overview of these challenges, based on findings from a large corpus of linguistically annotated video data for American Sign Language (ASL). Additionally, some linguistic regularities in the structure of signs that can improve handshape and sign recognition are also discussed.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The translation may differ slightly from Traditional Chinese, which is used in Taiwan and other countries.

End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation

paper_url: http://arxiv.org/abs/2311.00697
repo_url: https://github.com/amazon-science/stac-speech-translation
paper_authors: Juan Zuluaga-Gomez, Zhaocheng Huang, Xing Niu, Rohit Paturi, Sundararajan Srinivasan, Prashant Mathur, Brian Thompson, Marcello Federico
for: 这篇论文旨在解决单通道多说话人对话语音识别翻译中的泛化问题。
methods: 该模型采用了结束到终端的多任务培训模型，名为Speaker-Turn Aware Conversational Speech Translation，它结合了自动语音识别、语音翻译和说话人转移检测，使用特殊符号来标注序列化。
results: 在采用Fisher-CALLHOME数据集，并将单个说话人通道合并到一个多说话人通道中，实现了更真实和挑战性的多说话人对话场景。实验结果表明，我们的模型在多说话人条件下比参照系统表现出色，在单说话人条件下也达到了相对比较的性能。我们公开了数据处理和模型训练脚本。

Abstract
Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combines automatic speech recognition, speech translation and speaker turn detection using special tokens in a serialized labeling format. We run experiments on the Fisher-CALLHOME corpus, which we adapted by merging the two single-speaker channels into one multi-speaker channel, thus representing the more realistic and challenging scenario with multi-speaker turns and cross-talk. Experimental results across single- and multi-speaker conditions and against conventional ST systems, show that our model outperforms the reference systems on the multi-speaker condition, while attaining comparable performance on the single-speaker condition. We release scripts for data processing and model training.

摘要
传统的语音到文本翻译（ST）系统通常在单个说话人的单个音频上训练，这些系统可能无法泛化到实际生活中的多个说话人对话场景。在这篇论文中，我们解决了单通道多说话人对话的语音到文本翻译问题，我们提出了一种综合和多任务训练模型，名为对话者转换意识涉及的语音翻译模型。我们使用特殊符号来检测说话者的转换，并将自动语音识别、语音翻译和说话者转换拼接在一起。我们在鱼客-CALLHOME corpus上进行了实验，将两个单个说话人的通道合并到一个多个说话人通道中，从而更真实地反映多个说话人之间的对话场景。我们对单个和多个说话人情况下的实验结果进行比较，并与传统的ST系统进行比较，结果显示我们的模型在多个说话人情况下超过参照系统，而在单个说话人情况下与参照系统相当。我们将数据处理脚本和模型训练脚本公开发布。

Little Giants: Exploring the Potential of Small LLMs as Evaluation Metrics in Summarization in the Eval4NLP 2023 Shared Task

paper_url: http://arxiv.org/abs/2311.00686
repo_url: None
paper_authors: Neema Kotonya, Saran Krishnasamy, Joel Tetreault, Alejandro Jaimes
for: 本研究征文描述了我们在2023年NLP共同任务中参与的尝试，该任务旨在评估使用提示技术来使大语言模型处理质量评估任务，特别是在翻译和摘要的评估中。
methods: 我们采用了多种提示技术，包括标准提示、根据注释员指导的提示和创新的链条提示。此外，我们还将这些方法与零批学习和一批学习方法结合使用，以 maximize我们的评估过程的效果。
results: 我们的工作表明，将这些方法结合使用，使用一个”小”的开源模型（orca_mini_v3_7B）可以获得竞争力强的结果。

Abstract
This paper describes and analyzes our participation in the 2023 Eval4NLP shared task, which focuses on assessing the effectiveness of prompt-based techniques to empower Large Language Models to handle the task of quality estimation, particularly in the context of evaluating machine translations and summaries. We conducted systematic experiments with various prompting techniques, including standard prompting, prompts informed by annotator instructions, and innovative chain-of-thought prompting. In addition, we integrated these approaches with zero-shot and one-shot learning methods to maximize the efficacy of our evaluation procedures. Our work reveals that combining these approaches using a "small", open source model (orca_mini_v3_7B) yields competitive results.

摘要
这份论文描述了我们在2023年的Eval4NLP共同任务中的参与，这个任务旨在通过使用提示技术来让大语言模型进行质量评估，特别是在机器翻译和摘要的评估中。我们进行了系统化的实验，使用了不同的提示技术，包括标准提示、基于注释员指导的提示和创新的链条思维提示。此外，我们还将这些方法与零shot和一shot学习方法相结合，以最大化我们的评估过程的效果。我们的工作表明，将这些方法结合使用一个"小"的开源模型（orca_mini_v3_7B）可以获得竞争力强的结果。

Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

paper_url: http://arxiv.org/abs/2311.00684
repo_url: None
paper_authors: Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky
for: 该论文旨在探讨如何使Transformer语言模型可以处理 longer than training length的序列，无需进行长序列细化。
methods: 该论文使用了 T5 家族的大型预训练语言模型，并 investigate了其位置嵌入的灵活性。
results: 该论文发现 T5 family 的位置嵌入可以捕捉到rich和灵活的注意模式，但是它们受到了长输入序列的扩散注意问题的困扰。该论文提出了两种注意协调策略，通过温度调整来解决这个问题，从而提高 T5 的长上下文利用能力。

Abstract
An ideal length-extrapolatable Transformer language model can handle sequences longer than the training length without any long sequence fine-tuning. Such long-context utilization capability highly relies on a flexible positional embedding design. Upon investigating the flexibility of existing large pre-trained Transformer language models, we find that the T5 family deserves a closer look, as its positional embeddings capture rich and flexible attention patterns. However, T5 suffers from the dispersed attention issue: the longer the input sequence, the flatter the attention distribution. To alleviate the issue, we propose two attention alignment strategies via temperature scaling. Our findings improve the long-context utilization capability of T5 on language modeling, retrieval, and multi-document question answering without any fine-tuning, suggesting that a flexible positional embedding design and attention alignment go a long way toward Transformer length extrapolation.\footnote{\url{https://github.com/chijames/Attention-Alignment-Transformer-Length-Extrapolation}

摘要
一种理想的长度推导Transformer语言模型应该能够处理 longer than training length 的序列，而不需要任何长序细化。这种长context使用能力几乎完全取决于位置嵌入设计的灵活性。我们调查了现有大型预训练Transformer语言模型的 flexible positional embedding 设计，发现 T5 家族值得更加仔细研究，因为它的位置嵌入 capture 了丰富和灵活的注意模式。然而， T5 受到了分散注意 Issue：即输入序列越长，注意分布就越平坦。为了解决这问题，我们提出了两种注意对齐策略，通过温度扩大来实现。我们的发现提高了 T5 在语言模型、检索和多文档问答中的长context使用能力，无需任何细化，表明一种灵活的位置嵌入设计和注意对齐可以帮助Transformer长度推导。Note: The translation is in Simplified Chinese, which is a standardized form of Chinese used in mainland China and Singapore. The translation is based on the official translation of the text provided in the footnote.

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

paper_url: http://arxiv.org/abs/2311.00681
repo_url: None
paper_authors: Xue-Yong Fu, Md Tahmid Rahman Laskar, Cheng Chen, Shashi Bhushan TN
for: 本研究探讨了Large Language Models（LLMs）作为文本生成模型生成的概要中的准确性评估者。
methods: 本研究提出了一种新的方法，使用单个LLM进行整个问答式准确性评估过程。然后，研究对不同的LLM进行了直接准确性评估，并对human annotation进行了比较。
results: 研究发现，与人类评估不符，LLMs之间存在显著的相关性，尤其是GPT-3.5在两个准确性子类型上显示出了良好的相关性。这些结果表明，目前的LLMs尚未具备正确评估准确性的能力。

Abstract
In recent years, Large Language Models (LLMs) have gained immense attention due to their notable emergent capabilities, surpassing those seen in earlier language models. A particularly intriguing application of LLMs is their role as evaluators for texts produced by various generative models. In this study, we delve into the potential of LLMs as reliable assessors of factual consistency in summaries generated by text-generation models. Initially, we introduce an innovative approach for factuality assessment using LLMs. This entails employing a singular LLM for the entirety of the question-answering-based factuality scoring process. Following this, we examine the efficacy of various LLMs in direct factuality scoring, benchmarking them against traditional measures and human annotations. Contrary to initial expectations, our results indicate a lack of significant correlations between factuality metrics and human evaluations, specifically for GPT-4 and PaLM-2. Notable correlations were only observed with GPT-3.5 across two factuality subcategories. These consistent findings across various factual error categories suggest a fundamental limitation in the current LLMs' capability to accurately gauge factuality. This version presents the information more concisely while maintaining the main points and findings of the original text.

摘要
Recently, Large Language Models (LLMs) have received extensive attention due to their remarkable emergent capabilities, surpassing those of earlier language models. One fascinating application of LLMs is their ability to evaluate the factual consistency of texts generated by various generative models. In this study, we explore the potential of LLMs as reliable assessors of factual consistency in summaries produced by text-generation models. We propose an innovative approach for factuality assessment using LLMs, which involves using a single LLM for the entire question-answering-based factuality scoring process. We then compare the efficacy of various LLMs in direct factuality scoring, benchmarking them against traditional measures and human annotations.Surprisingly, our results indicate a lack of significant correlations between factuality metrics and human evaluations, particularly for GPT-4 and PaLM-2. Only GPT-3.5 showed notable correlations across two factuality subcategories. These consistent findings across various factual error categories suggest a fundamental limitation in the current LLMs' ability to accurately assess factuality.This version presents the information more concisely while maintaining the main points and findings of the original text.

Emotion Detection for Misinformation: A Review

paper_url: http://arxiv.org/abs/2311.00671
repo_url: None
paper_authors: Zhiwei Liu, Tianlin Zhang, Kailai Yang, Paul Thompson, Zeping Yu, Sophia Ananiadou
For: The paper focuses on the detection of misinformation (e.g., fake news and rumors) in social media, with a particular emphasis on the role of emotions and sentiments in distinguishing between genuine and false information.* Methods: The paper reviews a range of emotion-based methods for misinformation detection, including the use of emotion, sentiment, and stance-based features. These methods are analyzed in terms of their strengths and weaknesses.* Results: The paper discusses ongoing challenges in emotion-based misinformation detection, including the need for large, high-quality datasets, accurate annotation, and benchmarking. The authors also suggest future research directions, such as incorporating multimodality and improving interpretability.Here’s the same information in Simplified Chinese:* For: 这篇论文关注社交媒体中的谣言检测（如假新闻和谣言），特别是情感和 sentiment 在分辨真实和假信息中的作用。* Methods: 论文回顾了一系列基于情感、sentiment和立场的谣言检测方法，并分析了它们的优点和缺点。* Results: 论文讨论了谣言检测中的ongoing挑战，包括需要大量、高质量的数据、准确的注释和benchmarking。作者还提出了未来研究方向，如多 modal 和提高可读性。

Abstract
With the advent of social media, an increasing number of netizens are sharing and reading posts and news online. However, the huge volumes of misinformation (e.g., fake news and rumors) that flood the internet can adversely affect people's lives, and have resulted in the emergence of rumor and fake news detection as a hot research topic. The emotions and sentiments of netizens, as expressed in social media posts and news, constitute important factors that can help to distinguish fake news from genuine news and to understand the spread of rumors. This article comprehensively reviews emotion-based methods for misinformation detection. We begin by explaining the strong links between emotions and misinformation. We subsequently provide a detailed analysis of a range of misinformation detection methods that employ a variety of emotion, sentiment and stance-based features, and describe their strengths and weaknesses. Finally, we discuss a number of ongoing challenges in emotion-based misinformation detection based on large language models and suggest future research directions, including data collection (multi-platform, multilingual), annotation, benchmark, multimodality, and interpretability.

摘要
We begin by discussing the strong connections between emotions and misinformation. We then provide a detailed analysis of a variety of misinformation detection methods that use emotion, sentiment, and stance-based features, and describe their strengths and weaknesses. Finally, we address ongoing challenges in emotion-based misinformation detection using large language models and suggest future research directions, including data collection (multi-platform, multilingual), annotation, benchmarking, multimodality, and interpretability.

Explicit Morphological Knowledge Improves Pre-training of Language Models for Hebrew

paper_url: http://arxiv.org/abs/2311.00658
repo_url: None
paper_authors: Eylon Gueta, Omer Goldman, Reut Tsarfaty
for: 研究 Whether incorporating explicit morphological knowledge can improve the performance of pre-trained language models (PLMs) for morphologically-rich languages (MRLs).
methods: 提议 various morphologically driven tokenization methods to enable the model to leverage morphological cues beyond raw text.
results: 实验 Results show that morphologically driven tokenization demonstrates improved results compared to a standard language-agnostic tokenization, on a benchmark of both semantic and morphologic tasks.

Abstract
Pre-trained language models (PLMs) have shown remarkable successes in acquiring a wide range of linguistic knowledge, relying solely on self-supervised training on text streams. Nevertheless, the effectiveness of this language-agnostic approach has been frequently questioned for its sub-optimal performance when applied to morphologically-rich languages (MRLs). We investigate the hypothesis that incorporating explicit morphological knowledge in the pre-training phase can improve the performance of PLMs for MRLs. We propose various morphologically driven tokenization methods enabling the model to leverage morphological cues beyond raw text. We pre-train multiple language models utilizing the different methods and evaluate them on Hebrew, a language with complex and highly ambiguous morphology. Our experiments show that morphologically driven tokenization demonstrates improved results compared to a standard language-agnostic tokenization, on a benchmark of both semantic and morphologic tasks. These findings suggest that incorporating morphological knowledge holds the potential for further improving PLMs for morphologically rich languages.

摘要

Formal Translation from Reversing Petri Nets to Coloured Petri Nets

paper_url: http://arxiv.org/abs/2311.00629
repo_url: None
paper_authors: Kamila Barylska, Anna Gogolinska, Lukasz Mikulski, Anna Philippou, Marcin Piatkowski, Kyriaki Psara
for: 这篇论文旨在探讨反计算的扩展 computing paradigm，以及其在化学反应、量子计算、机器人和分布式系统等领域的应用。
methods: 这篇论文使用了修改 Petri nets 的方法，以实现反计算的三种主要形式，即回溯、 causal 反转和 out-of-causal-order 反转。这些修改包括使用名称的 токен，可以组合在一起形成键。
results: 这篇论文报告了一种可以处理多个名称的 токен的翻译方法，该方法可以将反计算 Petri nets 翻译成 Coloured Petri Nets (CPNs) 模型，并且可以自动处理反计算系统的分析和翻译。

Abstract
Reversible computation is an emerging computing paradigm that allows any sequence of operations to be executed in reverse order at any point during computation. Its appeal lies in its potential for lowpower computation and its relevance to a wide array of applications such as chemical reactions, quantum computation, robotics, and distributed systems. Reversing Petri nets are a recently-proposed extension of Petri nets that implements the three main forms of reversibility, namely, backtracking, causal reversing, and out-of-causal-order reversing. Their distinguishing feature is the use of named tokens that can be combined together to form bonds. Named tokens along with a history function, constitute the means of remembering past behaviour, thus, enabling reversal. In recent work, we have proposed a structural translation from a subclass of RPNs to the model of Coloured Petri Nets (CPNs), an extension of traditional Petri nets where tokens carry data values. In this paper, we extend the translation to handle RPNs with token multiplicity under the individual-token interpretation, a model which allows multiple tokens of the same type to exist in a system. To support the three types of reversibility, tokens are associated with their causal history and, while tokens of the same type are equally eligible to fire a transition when going forward, when going backwards they are able to reverse only the transitions they have previously fired. The new translation, in addition to lifting the restriction on token uniqueness, presents a refined approach for transforming RPNs to CPNs through a unifying approach that allows instantiating each of the three types of reversibility. The paper also reports on a tool that implements this translation, paving the way for automated translations and analysis of reversible systems using CPN Tools.

摘要
“逆计算”是一种emerging computing paradigm，允许任何运算序列在computation中执行逆序。它的吸引力在于它的低功耗计算和它适用于广泛应用，如化学反应、量子计算、机器人和分布式系统。“复原”Petri nets是一种最近提出的扩展，实现了三种主要的逆向性，namely， backtracking、causal reversing和out-of-causal-order reversing。它的特点是使用名称的对象，可以组合在一起形成关联。名称、以及一个历史函数，使得可以记住过去的行为，因此实现逆向。在最近的工作中，我们已经提出了一种结构转换，将一 subclass of RPNs转换为Colored Petri Nets（CPNs）模型， Traditional Petri nets的扩展，其中Token carry data values。在这篇论文中，我们延伸了转换，以处理 RPNs with token multiplicity under the individual-token interpretation，一个允许多个同类型的Token在系统中存在的模型。为了支持三种逆向性，Token被 associate with its causal history，而当Token在前进时，它们可以将Transition firing，但当它们在逆向时，它们只能逆转它们以前燃烧过的Transition。新的转换，不仅解除了Token唯一性的限制，而且提供了一个统一的方法，可以实现将 RPNs 转换为 CPNs。论文还报告了一个工具，实现了这个转换，将来自逆向系统的自动转换和分析。

Crosslingual Retrieval Augmented In-context Learning for Bangla

paper_url: http://arxiv.org/abs/2311.00587
repo_url: None
paper_authors: Xiaoqian Li, Ercong Nie, Sheng Liang
for: 提高低资源语言如বাংলা的自然语言处理性能
methods: 利用跨语言检索增强在context学习
results: 跨语言检索增强的提高了多语言预训练语言模型（MPLMs）在বাংলা任务上的性能

Abstract
The promise of Large Language Models (LLMs) in Natural Language Processing has often been overshadowed by their limited performance in low-resource languages such as Bangla. To address this, our paper presents a pioneering approach that utilizes cross-lingual retrieval augmented in-context learning. By strategically sourcing semantically similar prompts from high-resource language, we enable multilingual pretrained language models (MPLMs), especially the generative model BLOOMZ, to successfully boost performance on Bangla tasks. Our extensive evaluation highlights that the cross-lingual retrieval augmented prompts bring steady improvements to MPLMs over the zero-shot performance.

摘要
LLMs（大型自然语言处理语言模型）在自然语言处理方面的承诺经常被低资源语言如孟加拉语掩蔽。为解决这一问题，我们的论文提出了一种创新的方法，利用跨语言检索增强在语言上学习。我们策略性地从高资源语言中抽取相似的提示，使多语言预训练语言模型（MPLMs），特别是生成模型BLOOMZ，在孟加拉语任务上表现出色。我们的广泛评估表明，跨语言检索增强提示可以持续提高MPLMs的零shot性能。

Can Large Language Models Design Accurate Label Functions?

paper_url: http://arxiv.org/abs/2311.00739
repo_url: https://github.com/chrisneagu/FTC-Skystone-Dark-Angels-Romania-2020
paper_authors: Naiqing Guan, Kaiwen Chen, Nick Koudas
for: 这个论文主要用于探讨使用先验语言模型（PLM）自动生成高精度标签函数（LF）的可能性。
methods: 本研究使用了数据雕刻框架（DataSculpt），这是一种基于PLM的交互式框架，可以自动生成LF。研究者采用了多种提示技术、实例选择策略和LF筛选方法来探索广泛的设计空间。
results: 研究者在12个实际数据集上进行了广泛的评估，包括多种任务。评估结果显示了当前PLM在LF设计中的优势和局限性。

Abstract
Programmatic weak supervision methodologies facilitate the expedited labeling of extensive datasets through the use of label functions (LFs) that encapsulate heuristic data sources. Nonetheless, the creation of precise LFs necessitates domain expertise and substantial endeavors. Recent advances in pre-trained language models (PLMs) have exhibited substantial potential across diverse tasks. However, the capacity of PLMs to autonomously formulate accurate LFs remains an underexplored domain. In this research, we address this gap by introducing DataSculpt, an interactive framework that harnesses PLMs for the automated generation of LFs. Within DataSculpt, we incorporate an array of prompting techniques, instance selection strategies, and LF filtration methods to explore the expansive design landscape. Ultimately, we conduct a thorough assessment of DataSculpt's performance on 12 real-world datasets, encompassing a range of tasks. This evaluation unveils both the strengths and limitations of contemporary PLMs in LF design.

摘要

An Embedded Diachronic Sense Change Model with a Case Study from Ancient Greek

paper_url: http://arxiv.org/abs/2311.00541
repo_url: https://github.com/schyanzafar/edisc
paper_authors: Schyan Zafar, Geoff K. Nicholls
for: 这个论文的目的是分析古希腊文本集的词语意思变化。
methods: 这个论文使用了无监督学习的GASC和DiSC生成模型，对target字(“kosmos”)的多个意思进行分析，并使用MCMC方法来衡量这些意思的变化趋势。
results: 该论文提出了EDiSC模型，它结合了单词嵌入和DiSC模型，可以提供更高的预测精度、真实恢复率和 uncertainty 量化，以及更好的MCMC方法的样本效率和扩展性。

Abstract
Word meanings change over time, and word senses evolve, emerge or die out in the process. For ancient languages, where the corpora are often small, sparse and noisy, modelling such changes accurately proves challenging, and quantifying uncertainty in sense-change estimates consequently becomes important. GASC and DiSC are existing generative models that have been used to analyse sense change for target words from an ancient Greek text corpus, using unsupervised learning without the help of any pre-training. These models represent the senses of a given target word such as "kosmos" (meaning decoration, order or world) as distributions over context words, and sense prevalence as a distribution over senses. The models are fitted using MCMC methods to measure temporal changes in these representations. In this paper, we introduce EDiSC, an embedded version of DiSC, which combines word embeddings with DiSC to provide superior model performance. We show empirically that EDiSC offers improved predictive accuracy, ground-truth recovery and uncertainty quantification, as well as better sampling efficiency and scalability properties with MCMC methods. We also discuss the challenges of fitting these models.

摘要
word meanings change over time, and word senses evolve, emerge, or die out in the process. For ancient languages, where the corpora are often small, sparse, and noisy, modeling such changes accurately proves challenging, and quantifying uncertainty in sense-change estimates consequently becomes important. GASC and DiSC are existing generative models that have been used to analyze sense change for target words from an ancient Greek text corpus, using unsupervised learning without the help of any pre-training. these models represent the senses of a given target word, such as "kosmos" (meaning decoration, order, or world), as distributions over context words, and sense prevalence as a distribution over senses. the models are fitted using MCMC methods to measure temporal changes in these representations. in this paper, we introduce EDiSC, an embedded version of DiSC, which combines word embeddings with DiSC to provide superior model performance. we show empirically that EDiSC offers improved predictive accuracy, ground-truth recovery, and uncertainty quantification, as well as better sampling efficiency and scalability properties with MCMC methods. we also discuss the challenges of fitting these models.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

Text Rendering Strategies for Pixel Language Models

paper_url: http://arxiv.org/abs/2311.00522
repo_url: None
paper_authors: Jonas F. Lotz, Elizabeth Salesky, Phillip Rust, Desmond Elliott
for: 这篇论文主要针对的是开放词汇语言模型Pixel模型中的文本渲染方法。
methods: 论文使用了四种不同的文本渲染方法，包括单个字符大字符渲染、字符匹配渲染、字符串渲染和字符串匹配渲染。
results: 研究发现，使用单个字符大字符渲染方法可以提高句子级任务的性能，而不会对token级任务或多语言任务造成干扰。此外，使用这种渲染方法也可以降低模型的参数数量，从86M降低到22M，并且模型的性能仍然保持在同等水平。

Abstract
Pixel-based language models process text rendered as images, which allows them to handle any script, making them a promising approach to open vocabulary language modelling. However, recent approaches use text renderers that produce a large set of almost-equivalent input patches, which may prove sub-optimal for downstream tasks, due to redundancy in the input representations. In this paper, we investigate four approaches to rendering text in the PIXEL model (Rust et al., 2023), and find that simple character bigram rendering brings improved performance on sentence-level tasks without compromising performance on token-level or multilingual tasks. This new rendering strategy also makes it possible to train a more compact model with only 22M parameters that performs on par with the original 86M parameter model. Our analyses show that character bigram rendering leads to a consistently better model but with an anisotropic patch embedding space, driven by a patch frequency bias, highlighting the connections between image patch- and tokenization-based language models.

摘要
Pixel基于的语言模型可以处理作为图像的文本，这使得它们成为开 vocabulary 语言模型的有力的方法。然而，最近的方法使用生成大量几乎相同的输入补充，这可能会导致下游任务中的重复性，从而降低性能。在这篇论文中，我们调查了PIXEL模型（Rust et al., 2023）中四种文本渲染方法，并发现简单的字符双字渲染可以提高句子级任务的性能，不会影响 Token 级或多语言任务的性能。这新的渲染策略还使得可以训练一个更加占用的模型，只有 22M 参数，它与原始 86M 参数模型具有相同的性能。我们的分析表明，字符双字渲染导致一个更好的模型，但是 embedding 空间具有不均匀的特征，即补充频率偏好，这显示了图像补充和tokenization基于语言模型之间的连接。

Rule-Based Error Classification for Analyzing Differences in Frequent Errors

paper_url: http://arxiv.org/abs/2311.00513
repo_url: None
paper_authors: Atsushi Shirafuji, Taku Matsumoto, Md Faizul Ibne Amin, Yutaka Watanobe
for: 本研究旨在揭示 novice 和 expert 程序员之间 Error 的差异。
methods: 我们提出了一种基于规则的 Error 分类工具，用于分类 code pairs 中的错误。
results: 我们对 95,631 个 code pairs 进行分类，平均错误数为 3.47。分析结果表明， novice 程序员的错误主要归结于programming知识的缺乏，而 expert 程序员的错误则主要归结于 problerm解决过程中的疏忽或不同于常规方法的解决方式。

Abstract
Finding and fixing errors is a time-consuming task not only for novice programmers but also for expert programmers. Prior work has identified frequent error patterns among various levels of programmers. However, the differences in the tendencies between novices and experts have yet to be revealed. From the knowledge of the frequent errors in each level of programmers, instructors will be able to provide helpful advice for each level of learners. In this paper, we propose a rule-based error classification tool to classify errors in code pairs consisting of wrong and correct programs. We classify errors for 95,631 code pairs and identify 3.47 errors on average, which are submitted by various levels of programmers on an online judge system. The classified errors are used to analyze the differences in frequent errors between novice and expert programmers. The analyzed results show that, as for the same introductory problems, errors made by novices are due to the lack of knowledge in programming, and the mistakes are considered an essential part of the learning process. On the other hand, errors made by experts are due to misunderstandings caused by the carelessness of reading problems or the challenges of solving problems differently than usual. The proposed tool can be used to create error-labeled datasets and for further code-related educational research.

摘要
发现和修复错误是一项时间消耗的任务，不仅对于新手程序员而言，也对于专家程序员来说。先前的工作已经确定了不同级别程序员的错误模式的频繁性。然而，新手和专家之间的差异仍未得到揭示。通过了解每个级别程序员的错误频率，教师将能提供有用的建议。在这篇论文中，我们提议一种基于规则的错误分类工具，用于分类代码对中的错误和正确代码。我们对95631个代码对进行分类，并发现每个代码对的平均错误数为3.47。分类后的错误被用来分析新手和专家之间的错误差异。分析结果显示，对于同一些入门问题，新手的错误是由于缺乏编程知识，这些错误被视为学习过程中的必要部分。而专家的错误则是由于阅读问题不够仔细或解决问题不同于常见方式所致。我们的工具可以用于创建错误标注数据集和进一步的代码相关教育研究。

Robustness Tests for Automatic Machine Translation Metrics with Adversarial Attacks

paper_url: http://arxiv.org/abs/2311.00508
repo_url: https://github.com/i-need-sleep/eval_attack
paper_authors: Yichen Huang, Timothy Baldwin
for: 本研究探讨了MT评价指标在针对性Synthesized文本的性能，以探讨评价指标的稳定性。
methods: 我们使用了word-和character-level攻击对三种流行的机器翻译指标BERTScore、BLEURT和COMET进行实验。
results: 我们的人工实验表明，自动指标往往会对针对性下降的翻译文本进行过多的惩罚。此外，我们发现BERTScore指标存在不一致的问题，它将原始句子和针对性下降的句子视为相似，而将针对性下降的翻译文本视为与参考文本不符。这些异常情况激发了更多的robust指标的开发。

Abstract
We investigate MT evaluation metric performance on adversarially-synthesized texts, to shed light on metric robustness. We experiment with word- and character-level attacks on three popular machine translation metrics: BERTScore, BLEURT, and COMET. Our human experiments validate that automatic metrics tend to overpenalize adversarially-degraded translations. We also identify inconsistencies in BERTScore ratings, where it judges the original sentence and the adversarially-degraded one as similar, while judging the degraded translation as notably worse than the original with respect to the reference. We identify patterns of brittleness that motivate more robust metric development.

摘要
我们研究MT评价指标性能在针对式 synthesized 文本上，以探讨指标Robustness。我们对三种流行的机器翻译指标：BERTScore、BLEURT和COMET进行实验，用单词和字符级攻击。我们的人工实验证明了自动指标往往对针对性下降的翻译进行过分罚。我们还发现BERTScore评分存在不一致性，它将原始句子和针对性下降的句子评分为相似，而它对参考的翻译进行评分时则评分较低。我们发现了指标脆弱的特征，这些特征驱动我们更加Robust指标的发展。

Comparing Optimization Targets for Contrast-Consistent Search

paper_url: http://arxiv.org/abs/2311.00488
repo_url: None
paper_authors: Hugo Fry, Seamus Fallows, Ian Fan, Jamie Wright, Nandi Schoots
for: 优化CCS搜索算法的目标，即recover大语言模型内部真实的表示。
methods: 提出了新的Midpoint-Displacement（MD）损失函数，并证明在某个参数值下，MD损失函数导致搜索器的 weights 与 CCS 相似。
results: MD 损失函数在certain hyper-parameter value下可以达到与 CCS 相似的搜索器 weights，并且further show that this hyper-parameter不是最佳值，可以通过更好的hyper-parameter来提高测试准确率。

Abstract
We investigate the optimization target of Contrast-Consistent Search (CCS), which aims to recover the internal representations of truth of a large language model. We present a new loss function that we call the Midpoint-Displacement (MD) loss function. We demonstrate that for a certain hyper-parameter value this MD loss function leads to a prober with very similar weights to CCS. We further show that this hyper-parameter is not optimal and that with a better hyper-parameter the MD loss function attains a higher test accuracy than CCS.

摘要
我们研究对比搜索（CCS）优化目标，该目标是恢复大语言模型的内部真实性表示。我们提出了一个新的损失函数，称为中点差（MD）损失函数。我们示出了一个特定的 гипер参数值下，MD损失函数导致探测器的 веса与 CCS 非常相似。此外，我们还证明了这个 гипер参数并不是最佳的，并且通过更好的 гипер参数，MD 损失函数在测试准确率方面超过了 CCS。

Style Locality for Controllable Generation with kNN Language Models

paper_url: http://arxiv.org/abs/2311.00475
repo_url: None
paper_authors: Gilles Nawezi, Lucie Flek, Charles Welch
for: 这个论文主要是为了控制文本的风格和语言表达而研究的（control the style and language expression of text）
methods: 该论文使用了外部记忆和最近邻居语言模型（external memory and nearest neighbor language models），并在这些模型中添加了地域层次（locality levels）来学习如何对文本中的词语进行权重调整（weighting of words in text），以提高模型的性能。
results: 该研究发现，使用这种新的控制风格模型（novel approach）可以成功地控制文本的风格，并且提供了更好的流畅性-风格质量的平衡（better fluency-style trade-off）thanprevious work。

Abstract
Recent language models have been improved by the addition of external memory. Nearest neighbor language models retrieve similar contexts to assist in word prediction. The addition of locality levels allows a model to learn how to weight neighbors based on their relative location to the current text in source documents, and have been shown to further improve model performance. Nearest neighbor models have been explored for controllable generation but have not examined the use of locality levels. We present a novel approach for this purpose and evaluate it using automatic and human evaluation on politeness, formality, supportiveness, and toxicity textual data. We find that our model is successfully able to control style and provides a better fluency-style trade-off than previous work.

摘要
近期语言模型已经得到了外部记忆的加入，以 nearest neighbor 语言模型为例，可以在word预测中提供类似的上下文，以帮助预测单词。通过不同的地方级别学习如何对当前文档中的邻居进行权重调整，可以进一步提高模型的性能。近邻模型在可控生成中也被研究，但没有检查了地方级别的使用。我们提出了一种新的方法，并通过自动和人工评估来评估其在政eness、正式、支持性和恶意等文本数据上的性能。我们发现我们的模型能够成功地控制样式，并提供了更好的流畅性-风格质量的交换。

Discourse Relations Classification and Cross-Framework Discourse Relation Classification Through the Lens of Cognitive Dimensions: An Empirical Investigation

paper_url: http://arxiv.org/abs/2311.00451
repo_url: None
paper_authors: Yingxue Fu
for: 本研究旨在捕捉不同框架下的干扰关系，并使用简单的认知启发的维度来描述这些关系。
methods: 本研究使用了Sanders等人（2018）提出的简单维度来捕捉干扰关系，并进行了跨框架的干扰关系分类。
results: 研究发现，使用这些维度可以 Transfer Knowledge 从一个框架到另一个框架，并且不同的维度对不同的干扰关系有不同的影响。

Abstract
Existing discourse formalisms use different taxonomies of discourse relations, which require expert knowledge to understand, posing a challenge for annotation and automatic classification. We show that discourse relations can be effectively captured by some simple cognitively inspired dimensions proposed by Sanders et al.(2018). Our experiments on cross-framework discourse relation classification (PDTB & RST) demonstrate that it is possible to transfer knowledge of discourse relations for one framework to another framework by means of these dimensions, in spite of differences in discourse segmentation of the two frameworks. This manifests the effectiveness of these dimensions in characterizing discourse relations across frameworks. Ablation studies reveal that different dimensions influence different types of discourse relations. The patterns can be explained by the role of dimensions in characterizing and distinguishing different relations. We also report our experimental results on automatic prediction of these dimensions.

摘要
现有的话语形式学派使用不同的话语关系称号，这需要专家知识来理解，对于标注和自动分类而言是一大挑战。我们表明了使用沙德等人（2018）所提出的一些简单的认知启发的维度可以有效地捕捉话语关系。我们的实验表明，可以将一个框架中的话语关系转移到另一个框架中，这些维度的帮助下，即使两个框架之间存在话语分 segmentation 的差异。这说明了这些维度在不同框架之间的话语关系capture的效果。我们也进行了删除研究，发现不同的维度对不同的话语关系产生不同的影响。这些模式可以由这些维度在话语关系之间的角色从中解释。此外，我们还报告了自动预测这些维度的实验结果。

Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling

paper_url: http://arxiv.org/abs/2311.00430
repo_url: https://github.com/huggingface/distil-whisper
paper_authors: Sanchit Gandhi, Patrick von Platen, Alexander M. Rush
for: 这个研究的目的是为了将大型预训练的语音识别模型在具有低延迟和有限资源的环境中进行运行。
methods: 该研究使用pseudo-labeling技术 assemble了一个大规模的开源数据集，并使用这些数据集来缩小Whisper模型，称为Distil-Whisper。通过简单的单词错误率（WER）假设，选择了最高质量的pseudo-标签进行训练。
results: 相比Whisper模型，Distil-Whisper模型速度快5.8倍，具有51% fewer参数，在零基础转移设置下对异类数据进行测试时的Word Error Rate（WER）下降至1%。Distil-Whisper保持了Whisper模型对困难的声学条件的Robustness，而且对长形音频中的投射错误具有较好的鲁棒性。 Distil-Whisper可以与Whisper模型一起使用，以实现2倍的速度提升，而且数学上保证输出是与原始模型相同的。

Abstract
As the size of pre-trained speech recognition models increases, running these large models in low-latency or resource-constrained environments becomes challenging. In this work, we leverage pseudo-labelling to assemble a large-scale open-source dataset which we use to distill the Whisper model into a smaller variant, called Distil-Whisper. Using a simple word error rate (WER) heuristic, we select only the highest quality pseudo-labels for training. The distilled model is 5.8 times faster with 51% fewer parameters, while performing to within 1% WER on out-of-distribution test data in a zero-shot transfer setting. Distil-Whisper maintains the robustness of the Whisper model to difficult acoustic conditions, while being less prone to hallucination errors on long-form audio. Distil-Whisper is designed to be paired with Whisper for speculative decoding, yielding a 2 times speed-up while mathematically ensuring the same outputs as the original model. To facilitate further research in this domain, we make our training code, inference code and models publicly accessible.

摘要
随着预训言语识别模型的大小增加，运行这些大模型在低延迟或资源受限的环境中变得具有挑战。在这项工作中，我们利用 Pseudo-label 技术组织了大规模的开源数据集，并使用简单的单词错误率（WER）匹配来选择最高质量的 Pseudo-labels 进行训练。经过筛选后，我们得到了一个名为 Distil-Whisper 的减小型，其速度比 Whisper 快5.8倍，参数量减少51%，并在零分配情况下保持 Whisper 模型的robustness，同时减少了对长形音频的幻觉错误。 Distil-Whisper 可以与 Whisper 集成，实现2倍的速度增加，并且数学保证输出与原始模型相同。为了促进这个领域的研究，我们将训练代码、推理代码和模型公开访问ible。

Efficient Human-AI Coordination via Preparatory Language-based Convention

paper_url: http://arxiv.org/abs/2311.00416
repo_url: None
paper_authors: Cong Guan, Lichao Zhang, Chunpeng Fan, Yichen Li, Feng Chen, Lihe Li, Yunjia Tian, Lei Yuan, Yang Yu
for: 本研究旨在开发智能代理人，以实现人工通用智能的目标。
methods: 我们利用大语言模型（LLM）来开发行动计划，以便指导人类和AI进行合作。
results: 我们的方法在实验环境中比现有的学习方法表现出更高的性能，并且在协调实际人类时达到了更好的人类偏好的对齐和15%的性能提升。

Abstract
Developing intelligent agents capable of seamless coordination with humans is a critical step towards achieving artificial general intelligence. Existing methods for human-AI coordination typically train an agent to coordinate with a diverse set of policies or with human models fitted from real human data. However, the massively diverse styles of human behavior present obstacles for AI systems with constrained capacity, while high quality human data may not be readily available in real-world scenarios. In this study, we observe that prior to coordination, humans engage in communication to establish conventions that specify individual roles and actions, making their coordination proceed in an orderly manner. Building upon this observation, we propose employing the large language model (LLM) to develop an action plan (or equivalently, a convention) that effectively guides both human and AI. By inputting task requirements, human preferences, the number of agents, and other pertinent information into the LLM, it can generate a comprehensive convention that facilitates a clear understanding of tasks and responsibilities for all parties involved. Furthermore, we demonstrate that decomposing the convention formulation problem into sub-problems with multiple new sessions being sequentially employed and human feedback, will yield a more efficient coordination convention. Experimental evaluations conducted in the Overcooked-AI environment, utilizing a human proxy model, highlight the superior performance of our proposed method compared to existing learning-based approaches. When coordinating with real humans, our method achieves better alignment with human preferences and an average performance improvement of 15% compared to the state-of-the-art.

摘要
开发智能代理人可以与人类协同工作是人工一般智能的关键一步。现有的人类AI协同方法通常是训练一个代理人可以与多种政策或人类模型从实际人类数据中学习协同。然而，人类行为的极其多样性会对具有限制的AI系统带来难以解决的问题，而高质量的人类数据可能在实际 scenarios中不 readily available。在这项研究中，我们发现在协同之前，人类通常通过交流来确定协同的规则和行为，使其协同顺序进行。基于这一观察，我们提议使用大型自然语言模型（LLM）来开发一个行动计划（或等效地，一个公约），以便指导人类和AI进行协同。通过输入任务需求、人类偏好、代理人数量和其他相关信息到LLM，它可以生成一份全面的公约，以便所有参与者都能够快速理解任务和责任。此外，我们表明可以将公约的形式化问题分解成多个新会议的子问题，并采用人类反馈，以便更高效地协同。在Overcooked-AI环境中的实验评估中，我们的提议方法比现有的学习基本方法表现出更高的性能。当与真正的人类协同时，我们的方法可以更好地与人类偏好相匹配，并在平均上提高15%的性能相比于状态艺术。

AdaSent: Efficient Domain-Adapted Sentence Embeddings for Few-Shot Classification

paper_url: http://arxiv.org/abs/2311.00408
repo_url: https://github.com/ukplab/adasent
paper_authors: Yongxin Huang, Kexin Wang, Sourav Dutta, Raj Nath Patel, Goran Glavaš, Iryna Gurevych
for: investigate strategies for domain-specialization in the context of few-shot sentence classification with Pre-trained Sentence Encoders (SEs)
methods: unsupervised Domain-Adaptive Pre-Training (DAPT) of a base Pre-trained Language Model (PLM), training a SEPT adapter on the base PLM to decouple SEPT from DAPT
results: substantially improves the accuracy of few-shot sentence classification, matches or surpasses the performance of full SEPT on DAPT-ed PLM, while substantially reducing the training costs

Abstract
Recent work has found that few-shot sentence classification based on pre-trained Sentence Encoders (SEs) is efficient, robust, and effective. In this work, we investigate strategies for domain-specialization in the context of few-shot sentence classification with SEs. We first establish that unsupervised Domain-Adaptive Pre-Training (DAPT) of a base Pre-trained Language Model (PLM) (i.e., not an SE) substantially improves the accuracy of few-shot sentence classification by up to 8.4 points. However, applying DAPT on SEs, on the one hand, disrupts the effects of their (general-domain) Sentence Embedding Pre-Training (SEPT). On the other hand, applying general-domain SEPT on top of a domain-adapted base PLM (i.e., after DAPT) is effective but inefficient, since the computationally expensive SEPT needs to be executed on top of a DAPT-ed PLM of each domain. As a solution, we propose AdaSent, which decouples SEPT from DAPT by training a SEPT adapter on the base PLM. The adapter can be inserted into DAPT-ed PLMs from any domain. We demonstrate AdaSent's effectiveness in extensive experiments on 17 different few-shot sentence classification datasets. AdaSent matches or surpasses the performance of full SEPT on DAPT-ed PLM, while substantially reducing the training costs. The code for AdaSent is available.

摘要
最近的研究发现，基于预训练的句子编码器（SE）的几个步骤分类是高效、可靠和有效的。在这项工作中，我们研究预训练的域特化策略在几个步骤分类中的应用。我们首先证明了不supervised域适应预训练（DAPT）基于基础预训练语言模型（PLM）可以很大程度上提高几个步骤分类的准确率，最高提高8.4个点。但是，在SE上进行DAPT会中断SEPT的效果，而在基础PLM上进行general-domain SEPT后再进行DAPT是有效的，但是 computationally expensive SEPT的计算成本会增加。为了解决这个问题，我们提出了AdaSent，它将SEPT和DAPT分离开来，通过在基础PLM上训练SEPT adapter来实现。adapter可以在任何域的DAPT-ed PLM中插入。我们在17个不同的几个步骤分类dataset上进行了广泛的实验，并证明了AdaSent的有效性。AdaSent可以与全SEPT在DAPT-ed PLM上进行比较，同时减少训练成本。代码可以在线获取。

Enhanced Knowledge Injection for Radiology Report Generation

paper_url: http://arxiv.org/abs/2311.00399
repo_url: None
paper_authors: Qingqiu Li, Jilan Xu, Runtian Yuan, Mohan Chen, Yuejie Zhang, Rui Feng, Xiaobo Zhang, Shang Gao
for: automated radiology report generation
methods: utilizes two branches (Weighted Concept Knowledge and Multimodal Retrieval Knowledge) to extract different types of knowledge and integrate with current image
results: achieves superior performance over other state-of-the-art methods, with effective knowledge injection and well-structured knowledge gain

Abstract
Automatic generation of radiology reports holds crucial clinical value, as it can alleviate substantial workload on radiologists and remind less experienced ones of potential anomalies. Despite the remarkable performance of various image captioning methods in the natural image field, generating accurate reports for medical images still faces challenges, i.e., disparities in visual and textual data, and lack of accurate domain knowledge. To address these issues, we propose an enhanced knowledge injection framework, which utilizes two branches to extract different types of knowledge. The Weighted Concept Knowledge (WCK) branch is responsible for introducing clinical medical concepts weighted by TF-IDF scores. The Multimodal Retrieval Knowledge (MRK) branch extracts triplets from similar reports, emphasizing crucial clinical information related to entity positions and existence. By integrating this finer-grained and well-structured knowledge with the current image, we are able to leverage the multi-source knowledge gain to ultimately facilitate more accurate report generation. Extensive experiments have been conducted on two public benchmarks, demonstrating that our method achieves superior performance over other state-of-the-art methods. Ablation studies further validate the effectiveness of two extracted knowledge sources.

摘要

HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning

paper_url: http://arxiv.org/abs/2311.00321
repo_url: https://github.com/joonkeekim/hare-hate-speech
paper_authors: Yongjin Yang, Joonkee Kim, Yujin Kim, Namgyu Ho, James Thorne, Se-young Yun
for: 本研究旨在减轻社交媒体上的仇恨言语检测，以确保在线安全。
methods: 本研究使用大语言模型（LLM）的逻辑能力填充现有的恶意语言检测描述的漏洞，以提供有效的检测模型超级vision。
results: 实验表明，使用我们的方法，使用模型生成的数据，可以超越基elines，使用现有的自由文本人工检测。我们的方法可以提高模型的解释质量和泛化性。

Abstract
With the proliferation of social media, accurate detection of hate speech has become critical to ensure safety online. To combat nuanced forms of hate speech, it is important to identify and thoroughly explain hate speech to help users understand its harmful effects. Recent benchmarks have attempted to tackle this issue by training generative models on free-text annotations of implications in hateful text. However, we find significant reasoning gaps in the existing annotations schemes, which may hinder the supervision of detection models. In this paper, we introduce a hate speech detection framework, HARE, which harnesses the reasoning capabilities of large language models (LLMs) to fill these gaps in explanations of hate speech, thus enabling effective supervision of detection models. Experiments on SBIC and Implicit Hate benchmarks show that our method, using model-generated data, consistently outperforms baselines, using existing free-text human annotations. Analysis demonstrates that our method enhances the explanation quality of trained models and improves generalization to unseen datasets. Our code is available at https://github.com/joonkeekim/hare-hate-speech.git.

摘要
随着社交媒体的普及，精准地检测 hate speech 已成为确保在线安全的关键。为了对 nuanced 形式的 hate speech 进行有效的检测，需要准确地识别和解释 hate speech，以帮助用户理解其有害的影响。现有的标准化框架已经尝试解决这个问题，通过在 free-text 约束下训练生成模型。然而，我们发现现有的注释方案存在重要的理由差距，这可能会阻碍检测模型的超级vision。在这篇论文中，我们提出了一种 hate speech detection 框架，即 HARE，它利用大型语言模型（LLM）的理解能力来填充这些注释中的解释漏洞，从而为检测模型提供有效的超级vision。我们的实验表明，使用我们的方法，使用模型生成的数据，可以在 SBIC 和 Implicit Hate 标准底下 consistently 超越基elines，使用现有的 free-text 人类注释。分析表明，我们的方法可以提高训练模型的解释质量，并且可以提高对未看到的数据集的泛化能力。我们的代码可以在中找到。

Data Augmentation for Code Translation with Comparable Corpora and Multiple References

paper_url: http://arxiv.org/abs/2311.00317
repo_url: https://github.com/Veronicium/CMTrans
paper_authors: Yiqing Xie, Atharva Naik, Daniel Fried, Carolyn Rose
for: 本文是关于编程语言之间代码翻译的研究，具体来说是使用数据扩充技术来解决翻译数据的限制问题。
methods: 本文提出了两种数据扩充技术，一种是建立可比较的代码对，另一种是对已有的平行数据进行多个参考翻译的扩充。特别是，使用自然语言文档生成代码的方法来建立可比较的代码对，并对可用的平行数据进行多个参考翻译的扩充，以增加翻译目标的多样性。
results: 实验结果表明，使用本文提出的数据扩充技术可以大幅提高CodeT5在Java、Python和C++之间的翻译精度，具体来说是提高了7.5%的计算准确率（CA@1），这 verify了翻译的正确性。codes可以在https://github.com/Veronicium/CMTrans中下载。

Abstract
One major challenge of translating code between programming languages is that parallel training data is often limited. To overcome this challenge, we present two data augmentation techniques, one that builds comparable corpora (i.e., code pairs with similar functionality), and another that augments existing parallel data with multiple reference translations. Specifically, we build and analyze multiple types of comparable corpora, including programs generated from natural language documentation using a code generation model. Furthermore, to reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data and filter the translations by unit tests, which increases variation in target translations. Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java, Python, and C++ by an average of 7.5% Computational Accuracy (CA@1), which verifies the correctness of translations by execution. The code is available at https://github.com/Veronicium/CMTrans.

摘要
一个主要挑战在代码翻译 между编程语言是并发训练数据匮乏。为解决这个挑战，我们提出了两种数据增强技术，一种是建立相似功能的代码对照集（i.e., 代码对照集），另一种是对已有并发数据中的多个参考翻译进行增强。我们建立了多种类型的相似对照集，包括通过自然语言文档生成的代码。此外，为避免单个参考翻译的过拟合，我们自动生成了更多的翻译参考，并使用单元测试过滤掉过拟合的翻译。实验表明，我们的数据增强技术可以在 Java、Python 和 C++ 之间的代码翻译中提高 CodeT5 的平均计算准确率（CA@1）约为 7.5%，这verify了翻译的正确性。代码可以在 GitHub 上找到：https://github.com/Veronicium/CMTrans。

Probing Explicit and Implicit Gender Bias through LLM Conditional Text Generation

paper_url: http://arxiv.org/abs/2311.00306
repo_url: None
paper_authors: Xiangjue Dong, Yibo Wang, Philip S. Yu, James Caverlee
for: 本文旨在检测语言模型中的性别偏见，并提出一种基于三种输入策略的Conditional Text Generation机制，以检测LLMs中的显式和隐式性别偏见。
methods: 本文使用三种不同的输入策略来评测LLMs的性别偏见，包括使用随机输入、使用人名输入和使用各种语言模型输入。同时，本文还使用显式和隐式评价指标来评价LLMs中的性别偏见。
results: 实验结果表明，增大模型大小不一定能够提高公平性，并且所有测试的LLMs都表现出显式和/或隐式的性别偏见，即使输入中不含显式性别标签。

Abstract
Large Language Models (LLMs) can generate biased and toxic responses. Yet most prior work on LLM gender bias evaluation requires predefined gender-related phrases or gender stereotypes, which are challenging to be comprehensively collected and are limited to explicit bias evaluation. In addition, we believe that instances devoid of gender-related language or explicit stereotypes in inputs can still induce gender bias in LLMs. Thus, in this work, we propose a conditional text generation mechanism without the need for predefined gender phrases and stereotypes. This approach employs three types of inputs generated through three distinct strategies to probe LLMs, aiming to show evidence of explicit and implicit gender biases in LLMs. We also utilize explicit and implicit evaluation metrics to evaluate gender bias in LLMs under different strategies. Our experiments demonstrate that an increased model size does not consistently lead to enhanced fairness and all tested LLMs exhibit explicit and/or implicit gender bias, even when explicit gender stereotypes are absent in the inputs.

摘要
Note: "Simplified Chinese" is a romanization of Chinese characters, which may not be exactly the same as the traditional Chinese characters used in mainland China. The translation is done using the "Simplified Chinese" format.

Detecting Syllable-Level Pronunciation Stress with A Self-Attention Model

paper_url: http://arxiv.org/abs/2311.00301
repo_url: https://github.com/wangweiying303/stress-detection-model
paper_authors: Wang Weiying, Nakajima Akinori
for: 本研究旨在开发一种自注意模型，用于检测英语口语中每个音节的强调水平。
methods: 本研究使用了多种语音和分类特征，包括抽象层的抽象特征、句子级别的语音特征和语音识别器等，输入到自注意模型中，并使用自注意机制对每个音节进行强调预测。
results: 研究发现，使用 simplest model 可以在不同的数据集上实现准确率高达88%和93%，而更先进的模型可以提供更高的准确率。这些模型可以应用于在线会议、英语学习等场景。

Abstract
One precondition of effective oral communication is that words should be pronounced clearly, especially for non-native speakers. Word stress is the key to clear and correct English, and misplacement of syllable stress may lead to misunderstandings. Thus, knowing the stress level is important for English speakers and learners. This paper presents a self-attention model to identify the stress level for each syllable of spoken English. Various prosodic and categorical features, including the pitch level, intensity, duration and type of the syllable and its nuclei (the vowel of the syllable), are explored. These features are input to the self-attention model, and syllable-level stresses are predicted. The simplest model yields an accuracy of over 88% and 93% on different datasets, while more advanced models provide higher accuracy. Our study suggests that the self-attention model can be promising in stress-level detection. These models could be applied to various scenarios, such as online meetings and English learning.

摘要
一个重要的或al通信前提是话语应该清楚地发音，特别是非本地语言 speaker。话语重点是英语的关键，不正确的重点可能导致歧义。因此，了解话语重点非常重要 для英语 speaker和学习者。这篇论文提出了一种自注意模型，用于判断每个话语的重点水平。不同的prosodic和分类特征，包括话语的抽象水平、强度、持续时间和元音类型，都被探讨。这些特征作为输入，输入到自注意模型中，并预测每个话语的重点。最简单的模型的准确率高于88%和93%在不同的数据集上，而更先进的模型可以提供更高的准确率。我们的研究表明，自注意模型在重点水平检测中具有承诺。这些模型可以应用于不同的场景，如在线会议和英语学习。

Entity Alignment Method of Science and Technology Patent based on Graph Convolution Network and Information Fusion

paper_url: http://arxiv.org/abs/2311.00300
repo_url: None
paper_authors: Runze Fang, Yawen Li, Yingxia Shao, Zeli Guan, Zhe Xue
for: 提高科技专利知识图库中实体匹配的性能
methods: 基于图 convolution 网络和 BERT 模型，利用图 структуры信息和实体属性信息进行多信息融合，以提高实体匹配的精度
results: 在三个 Referenced 数据集上进行实验，评估指标都高于现有方法

Abstract
The entity alignment of science and technology patents aims to link the equivalent entities in the knowledge graph of different science and technology patent data sources. Most entity alignment methods only use graph neural network to obtain the embedding of graph structure or use attribute text description to obtain semantic representation, ignoring the process of multi-information fusion in science and technology patents. In order to make use of the graphic structure and auxiliary information such as the name, description and attribute of the patent entity, this paper proposes an entity alignment method based on the graph convolution network for science and technology patent information fusion. Through the graph convolution network and BERT model, the structure information and entity attribute information of the science and technology patent knowledge graph are embedded and represented to achieve multi-information fusion, thus improving the performance of entity alignment. Experiments on three benchmark data sets show that the proposed method Hit@K The evaluation indicators are better than the existing methods.

摘要
<>科技与科学专利实体对应的实体对齐目标是将不同科技与科学专利数据源知识图中相应的实体相互对应。大多数实体对齐方法只使用图 neural network 获取图结构的嵌入或使用特征文本描述获取 semantic 表示，忽略了科技与科学专利中多种信息融合的过程。为了利用专利知识图中的图结构和辅助信息 such as 专利名称、描述和属性，本文提出了基于图 convolution network 和 BERT 模型的科技与科学专利信息融合的实体对齐方法。通过图 convolution network 和 BERT 模型，专利知识图中的结构信息和实体属性信息被嵌入和表示，实现多种信息融合，从而提高实体对齐的性能。对三个标准数据集进行实验，评估指标都比现有方法更好。>>>

Semantic Representation Learning of Scientific Literature based on Adaptive Feature and Graph Neural Network

paper_url: http://arxiv.org/abs/2311.00296
repo_url: None
paper_authors: Hongrui Gao, Yawen Li, Meiyu Liang, Zeli Guan, Zhe Xue
for: 本研究旨在提出一种基于适应特征和图 neural network的科学文献semantic representation学习方法，以增强科学文献的特征表示能力。
methods: 本方法首先引入适应特征方法，考虑了科学文献的全局和局部特征；然后使用图注意机制对科学文献的特征进行权重赋值，以更好地表示各个科学文献之间的相互关系。此外，本方法还提出了一种无监督图 neural network semantic representation学习方法，通过比较相互信息between科学文献的本地半语义表示和全局图semantic representation，使得图 neural network能够捕捉到各个科学文献之间的相互关系，提高了学习 semantic representation的能力。
results: 实验结果显示，基于适应特征和图 neural network的科学文献semantic representation学习方法在科学文献分类任务上具有竞争力，并实现了良好的result。

Abstract
Because most of the scientific literature data is unmarked, it makes semantic representation learning based on unsupervised graph become crucial. At the same time, in order to enrich the features of scientific literature, a learning method of semantic representation of scientific literature based on adaptive features and graph neural network is proposed. By introducing the adaptive feature method, the features of scientific literature are considered globally and locally. The graph attention mechanism is used to sum the features of scientific literature with citation relationship, and give each scientific literature different feature weights, so as to better express the correlation between the features of different scientific literature. In addition, an unsupervised graph neural network semantic representation learning method is proposed. By comparing the mutual information between the positive and negative local semantic representation of scientific literature and the global graph semantic representation in the potential space, the graph neural network can capture the local and global information, thus improving the learning ability of the semantic representation of scientific literature. The experimental results show that the proposed learning method of semantic representation of scientific literature based on adaptive feature and graph neural network is competitive on the basis of scientific literature classification, and has achieved good results.

摘要
因为大多数科学文献数据未标注，使得基于无监督图的语义表示学习成为不可或缺的。同时，为了丰富科学文献的特征，一种基于适应特征和图神经网络的科学文献语义表示学习方法被提议。通过引入适应特征方法，科学文献的特征被考虑在全球和地方两个维度。使用图注意力机制将科学文献的特征相加，并给每个科学文献不同的特征重量，以更好地表示不同科学文献之间的相关性。此外，一种无监督图神经网络语义表示学习方法被提议。通过比较相互信息between科学文献的正向和负向本地语义表示和全球图semantic representation在潜在空间中，图神经网络可以捕捉本地和全球信息，从而提高语义表示学习的能力。实验结果表明，基于适应特征和图神经网络的科学文献语义表示学习方法在科学文献分类基础上具有竞争力，并取得了良好的结果。

paper_url: http://arxiv.org/abs/2311.00292
repo_url: None
paper_authors: Xiaoyue Wang, Xin Liu, Lijie Wang, Yaoxiang Wang, Jinsong Su, Hua Wu
for: 本文旨在提出一种基于迭代偏见感知的自适应偏见识别框架（IBADR），以帮助自然语言理解（NLU）模型减少偏见。
methods: 本文使用的方法包括训练一个浅层模型来评估样本中偏见的程度，然后将每个样本与偏见指标相对应，并使用这些扩展样本来训练一个样本生成器。这个生成器可以学习偏见指标和样本之间的对应关系。最后，本文使用这个生成器生成具有更少偏见特征的 Pseudo 样本，并将其添加到样本池中。
results: 本文的实验结果和深入分析表明，IBADR 不仅可以显著超越现有的 dataset refinement 方法， дости得 State-of-the-Art 性能，而且可以与模型中心的方法兼容。

Abstract
As commonly-used methods for debiasing natural language understanding (NLU) models, dataset refinement approaches heavily rely on manual data analysis, and thus maybe unable to cover all the potential biased features. In this paper, we propose IBADR, an Iterative Bias-Aware Dataset Refinement framework, which debiases NLU models without predefining biased features. We maintain an iteratively expanded sample pool. Specifically, at each iteration, we first train a shallow model to quantify the bias degree of samples in the pool. Then, we pair each sample with a bias indicator representing its bias degree, and use these extended samples to train a sample generator. In this way, this generator can effectively learn the correspondence relationship between bias indicators and samples. Furthermore, we employ the generator to produce pseudo samples with fewer biased features by feeding specific bias indicators. Finally, we incorporate the generated pseudo samples into the pool. Experimental results and in-depth analyses on two NLU tasks show that IBADR not only significantly outperforms existing dataset refinement approaches, achieving SOTA, but also is compatible with model-centric methods.

摘要
通常使用的自然语言理解（NLU）模型偏见纠正方法中，数据集精度方法依赖于手动数据分析，因此可能无法涵盖所有可能的偏见特征。在这篇论文中，我们提出了IBADR，一种迭代偏见感知数据集精度框架，可以无需先定偏见特征来纠正NLU模型。我们保持一个逐渐扩展的样本池。specifically，在每个迭代中，我们首先使用一个浅度模型来评估样本池中每个样本的偏见度。然后，我们对每个样本分配一个偏见指标，表示其偏见度。这些扩展的样本后来用于训练样本生成器。这样的生成器可以有效地学习样本和偏见指标之间的对应关系。此外，我们使用生成器生成具有更少偏见特征的 Pseudo 样本。最后，我们将生成的 Pseudo 样本添加到样本池中。实验结果和NLU任务的深入分析表明，IBADR不仅可以准确地纠正现有的数据集精度方法，同时也与模型中心的方法兼容。

SoulChat: Improving LLMs’ Empathy, Listening, and Comfort Abilities through Fine-tuning with Multi-turn Empathy Conversations

paper_url: http://arxiv.org/abs/2311.00273
repo_url: https://github.com/scutcyr/soulchat
paper_authors: Yirong Chen, Xiaofen Xing, Jingkai Lin, Huimin Zheng, Zhenyu Wang, Qi Liu, Xiangmin Xu
for: 这个论文是为了提高语言模型在心理咨询领域的Empathy能力而写的。
methods: 这个论文使用了多turn对话Context和更加接近心理咨询者的回应来finetune语言模型，以提高其Empathy能力。
results: 实验表明，通过使用多turn对话历史和更加接近心理咨询者的回应来finetune语言模型，可以显著提高语言模型的Empathy能力。

Abstract
Large language models (LLMs) have been widely applied in various fields due to their excellent capability for memorizing knowledge and chain of thought (CoT). When these language models are applied in the field of psychological counseling, they often rush to provide universal advice. However, when users seek psychological support, they need to gain empathy, trust, understanding and comfort, rather than just reasonable advice. To this end, we constructed a multi-turn empathetic conversation dataset of more than 2 million samples, in which the input is the multi-turn conversation context, and the target is empathetic responses that cover expressions such as questioning, comfort, recognition, listening, trust, emotional support, etc. Experiments have shown that the empathy ability of LLMs can be significantly enhanced when finetuning by using multi-turn dialogue history and responses that are closer to the expression of a psychological consultant.

摘要

Syntactic Inductive Bias in Transformer Language Models: Especially Helpful for Low-Resource Languages?

paper_url: http://arxiv.org/abs/2311.00268
repo_url: https://github.com/lgessler/lr-sib
paper_authors: Luke Gessler, Nathan Schneider
for: 检验Transformer基于语言模型BERT的针对低资源语言的启发性 inductive bias 是否可以提高预训练过程中的性能。
methods: 使用针对低资源语言的语法结构 inductive bias 进行预训练。
results: 在五种低资源语言（维吾尔语、沃洛夫语、马耳他语、古埃及语、古希腊语）中，发现这些针对低资源语言的方法并不一定有益，很少有提高性能的效果。

Abstract
A line of work on Transformer-based language models such as BERT has attempted to use syntactic inductive bias to enhance the pretraining process, on the theory that building syntactic structure into the training process should reduce the amount of data needed for training. But such methods are often tested for high-resource languages such as English. In this work, we investigate whether these methods can compensate for data sparseness in low-resource languages, hypothesizing that they ought to be more effective for low-resource languages. We experiment with five low-resource languages: Uyghur, Wolof, Maltese, Coptic, and Ancient Greek. We find that these syntactic inductive bias methods produce uneven results in low-resource settings, and provide surprisingly little benefit in most cases.

摘要
一些基于Transformer的语言模型，如BERT，尝试使用语法指导来增强预训练过程，理由是在训练过程中建立语法结构可以减少训练数据量。但这些方法通常在高资源语言如英语上进行测试。在这项工作中，我们研究了这些方法是否能在低资源语言中提供更好的效果，假设它们应该更有效于低资源语言。我们在五种低资源语言中进行实验：维吾尔语、沃洛夫语、马耳他语、古埃及语和古希腊语。我们发现这些语法指导方法在低资源环境中的效果不均匀，大多数情况下提供了surprisingly little benefit。

Noisy Exemplars Make Large Language Models More Robust: A Domain-Agnostic Behavioral Analysis

paper_url: http://arxiv.org/abs/2311.00258
repo_url: https://github.com/hiroki39/noisy-exemplars-make-large-language-models-more-robust
paper_authors: Hongyi Zheng, Abulhair Saparov
for: 研究大语言模型（LLM）在逻辑推理问题中具有几个步骤的解题能力，但现有的研究很少探讨 LLM 在几个步骤的推理问题中的Robustness。
methods: 提出了一种系统的方法来测试 LLM 在多步逻辑问题中的Robustness，包括在不同层次（如 lexical 和 semantic 等）进行干扰分析，以及通过控制干扰示例的比例来提高几个步骤推理方法的Robustness。
results: 通过实验发现，模型对替换单词为同义词的干扰最为敏感，同时增加干扰示例的比例可以提高几个步骤推理方法的Robustness。

Abstract
Recent advances in prompt engineering enable large language models (LLMs) to solve multi-hop logical reasoning problems with impressive accuracy. However, there is little existing work investigating the robustness of LLMs with few-shot prompting techniques. Therefore, we introduce a systematic approach to test the robustness of LLMs in multi-hop reasoning tasks via domain-agnostic perturbations. We include perturbations at multiple levels of abstractions (e.g. lexical perturbations such as typos, and semantic perturbations such as the inclusion of intermediate reasoning steps in the questions) to conduct behavioral analysis on the LLMs. Throughout our experiments, we find that models are more sensitive to certain perturbations such as replacing words with their synonyms. We also demonstrate that increasing the proportion of perturbed exemplars in the prompts improves the robustness of few-shot prompting methods.

摘要
现代提问工程技术使大语言模型（LLM）在多步逻辑理解任务中表现出色。然而，exist little previous work investigating LLMs的可靠性using few-shot prompting techniques。因此，我们提出了一种系统性的方法来测试LLMs在多步逻辑任务中的可靠性via domain-agnostic perturbations。我们在多个层次（例如，lexical perturbations such as typos,和semantic perturbations such as the inclusion of intermediate reasoning steps in the questions）中添加干扰来进行行为分析。在我们的实验中，我们发现模型更敏感于替换words with synonyms。我们还示出，增加干扰 exemplars的比例在提问中可以提高few-shot prompting methods的可靠性。

The Mystery and Fascination of LLMs: A Comprehensive Survey on the Interpretation and Analysis of Emergent Abilities

paper_url: http://arxiv.org/abs/2311.00237
repo_url: None
paper_authors: Yuxiang Zhou, Jiazheng Li, Yanzheng Xiang, Hanqi Yan, Lin Gui, Yulan He
For: This paper is written to provide a thorough survey on the interpretation and analysis of emergent abilities of large language models (LLMs).* Methods: The paper uses a macro perspective and a micro-perspective to examine studies on the mechanistic interpretability and empirical interpretability of emergent abilities in LLMs.* Results: The paper highlights the challenges encountered in interpreting emergent abilities in LLMs and suggests potential avenues for future research.Here’s the Chinese translation of the three pieces of information:* For: 这篇论文是为了提供大语言模型（LLM）的潜在能力的彻底评估和分析。* Methods: 这篇论文使用一种macro perspective和一种微观 perspective来检查大语言模型中emergent能力的机制可读性和实际可读性。* Results: 这篇论文描述了对大语言模型中emergent能力的解释所遇到的挑战和未来研究的可能性。

Abstract
Understanding emergent abilities, such as in-context learning (ICL) and chain-of-thought (CoT) prompting in large language models (LLMs), is of utmost importance. This importance stems not only from the better utilization of these capabilities across various tasks, but also from the proactive identification and mitigation of potential risks, including concerns of truthfulness, bias, and toxicity, that may arise alongside these capabilities. In this paper, we present a thorough survey on the interpretation and analysis of emergent abilities of LLMs. First, we provide a concise introduction to the background and definition of emergent abilities. Then, we give an overview of advancements from two perspectives: 1) a macro perspective, emphasizing studies on the mechanistic interpretability and delving into the mathematical foundations behind emergent abilities; and 2) a micro-perspective, concerning studies that focus on empirical interpretability by examining factors associated with these abilities. We conclude by highlighting the challenges encountered and suggesting potential avenues for future research. We believe that our work establishes the basis for further exploration into the interpretation of emergent abilities.

摘要

Distort, Distract, Decode: Instruction-Tuned Model Can Refine its Response from Noisy Instructions

paper_url: http://arxiv.org/abs/2311.00233
repo_url: None
paper_authors: Taehyeon Kim, Joonkee Kim, Gihun Lee, Se-Young Yun
for: 这篇论文旨在提高指令搜索模型的扩展性，使其能够更好地处理不同的指令。
methods: 该论文提出了一种简单 yet effective的方法 called Instructive Decoding (ID), 它通过对下一个token的预测值进行冲击，使用来自受损指令（noisy instruction）的预测值来提高模型的准确率。
results: 经过实验，该方法可以在多种指令搜索模型和任务上提高性能，而无需更新参数。尤其是在使用’opposite’作为受损指令时，表现最好，其能够带来最大的性能提升。

Abstract
While instruction-tuned language models have demonstrated impressive zero-shot generalization, these models often struggle to generate accurate responses when faced with instructions that fall outside their training set. This paper presents Instructive Decoding (ID), a simple yet effective approach that augments the efficacy of instruction-tuned models. Specifically, ID adjusts the logits for next-token prediction in a contrastive manner, utilizing predictions generated from a manipulated version of the original instruction, referred to as a noisy instruction. This noisy instruction aims to elicit responses that could diverge from the intended instruction yet remain plausible. We conduct experiments across a spectrum of such noisy instructions, ranging from those that insert semantic noise via random words to others like 'opposite' that elicit the deviated responses. Our approach achieves considerable performance gains across various instruction-tuned models and tasks without necessitating any additional parameter updates. Notably, utilizing 'opposite' as the noisy instruction in ID, which exhibits the maximum divergence from the original instruction, consistently produces the most significant performance gains across multiple models and tasks.

摘要
“对于已训练的语言模型，实际应用中的指令可能会让模型做出不正确的回答。本文提出了一个简单 yet 有效的方法——指令增强（Instructive Decoding，ID），以提高已训练的指令模型的表现。特别是，ID 在下一个字的预测中调整 logits 的方式，通过使用从修改过的原始指令（即杂音指令）所生成的预测，以获得更加积极的回答。我们在不同的杂音指令上进行了实验，包括插入 semantics 杂音的 Random Word，以及“opposite”类型的杂音指令，以获得更大的表现改进。我们发现，使用“opposite”类型的杂音指令可以导致最大的表现改进，并且不需要进行任何额外的参数更新。”Note: Simplified Chinese is used in this translation, as it is the most widely used variety of Chinese in mainland China and Singapore.

Is GPT Powerful Enough to Analyze the Emotions of Memes?

paper_url: http://arxiv.org/abs/2311.00223
repo_url: None
paper_authors: Jingjing Wang, Joshua Luo, Grace Yang, Allen Hong, Feng Luo
for: 这个研究的目的是探讨GPT-3.5在互联网趣图中的情感分析能力。
methods: 这个研究使用GPT-3.5模型来处理互联网趣图，包括分类趣图情感、确定趣图类型和检测趣图中的暗示性仇恨。
results: 研究发现GPT-3.5在处理这些任务时表现出色，但也存在一些限制，如理解社会规范和文化背景、解释暗示性意境和数据偏见等问题。

Abstract
Large Language Models (LLMs), representing a significant achievement in artificial intelligence (AI) research, have demonstrated their ability in a multitude of tasks. This project aims to explore the capabilities of GPT-3.5, a leading example of LLMs, in processing the sentiment analysis of Internet memes. Memes, which include both verbal and visual aspects, act as a powerful yet complex tool for expressing ideas and sentiments, demanding an understanding of societal norms and cultural contexts. Notably, the detection and moderation of hateful memes pose a significant challenge due to their implicit offensive nature. This project investigates GPT's proficiency in such subjective tasks, revealing its strengths and potential limitations. The tasks include the classification of meme sentiment, determination of humor type, and detection of implicit hate in memes. The performance evaluation, using datasets from SemEval-2020 Task 8 and Facebook hateful memes, offers a comparative understanding of GPT responses against human annotations. Despite GPT's remarkable progress, our findings underscore the challenges faced by these models in handling subjective tasks, which are rooted in their inherent limitations including contextual understanding, interpretation of implicit meanings, and data biases. This research contributes to the broader discourse on the applicability of AI in handling complex, context-dependent tasks, and offers valuable insights for future advancements.

摘要
大型自然语言模型（LLM），代表人工智能（AI）研究的一项重要成就，在多种任务中表现出色。这个项目旨在探索GPT-3.5的能力，一种领先的LLM，在互联网趣图上进行情感分析。趣图包含语言和视觉方面的特征，需要对社会规范和文化背景有深入的理解。尤其是对于带有偏见的趣图进行排查和修剪是一项非常复杂的任务，因为它们的危险性往往是 implicit的。这个项目探索GPT在这些主观任务中的表现，揭示其强点和可能的限制。任务包括趣图情感分类、趣图类型划分和排查带有偏见的趣图。使用SemEval-2020任务8的数据集和Facebook上的仇恨趣图进行性能评估，以获得人注解与GPT响应的比较理解。虽然GPT在这些任务中做出了惊人的进步，但我们的发现表明这些模型在处理主观任务时面临着挑战，这些挑战源于它们的内在限制，包括Contextual Understanding、解释偏见的能力和数据偏见。这项研究对于人工智能在处理复杂、上下文依赖的任务中的应用提供了有价值的反思，并为未来的进步提供了有价值的发现。

Transformers as Recognizers of Formal Languages: A Survey on Expressivity

paper_url: http://arxiv.org/abs/2311.00208
repo_url: None
paper_authors: Lena Strobl, William Merrill, Gail Weiss, David Chiang, Dana Angluin
for: 这篇论文旨在探讨transformer模型在自然语言处理中的能力和限制，通过将问题转化为正式语言来比较transformer模型和其他模型的性能。
methods: 本论文使用了一种形式语言来描述问题，并对不同的问题进行了 theoretically分析，以了解transformer模型能否解决这些问题。
results: 本论文提供了一个总结性的survey，汇总了各种研究中的不同假设和结论，并提供了一个统一的框架来协调 aparently contradictory findings。

Abstract
As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal languages. Exploring questions such as this will help to compare transformers with other models, and transformer variants with one another, for various tasks. Work in this subarea has made considerable progress in recent years. Here, we undertake a comprehensive survey of this work, documenting the diverse assumptions that underlie different results and providing a unified framework for harmonizing seemingly contradictory findings.

摘要
Translate the given text into Simplified Chinese.答：如transformers在自然语言处理中获得了主导地位，一些研究人员已经研究了这些模型可以解决哪些问题，通过对问题进行正式语言的处理。探讨这些问题将有助于比较transformers与其他模型，以及不同transformer变体之间的比较，在不同任务上。在这个子领域中，工作已经在过去几年中做出了大量的进展。我们现在对这些工作进行了完整的报告，并 documenting不同假设的不同假设，以提供一个统一的框架，以协调似然相反的发现。

2023-11-01

On The Open Prompt Challenge In Conditional Audio Generation

In-Context Prompt Editing For Conditional Audio Generation

Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models

Automatic Disfluency Detection from Untranscribed Speech

Calibrated Seq2seq Models for Efficient and Generalizable Ultra-fine Entity Typing

Construction Artifacts in Metaphor Identification Datasets

Language Model Training Paradigms for Clinical Feature Embeddings

Challenges for Linguistically-Driven Computer-Based Sign Recognition from Continuous Signing for American Sign Language

End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation

Little Giants: Exploring the Potential of Small LLMs as Evaluation Metrics in Summarization in the Eval4NLP 2023 Shared Task

Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

Emotion Detection for Misinformation: A Review

Explicit Morphological Knowledge Improves Pre-training of Language Models for Hebrew

Formal Translation from Reversing Petri Nets to Coloured Petri Nets

Crosslingual Retrieval Augmented In-context Learning for Bangla

Can Large Language Models Design Accurate Label Functions?

An Embedded Diachronic Sense Change Model with a Case Study from Ancient Greek

Text Rendering Strategies for Pixel Language Models

Rule-Based Error Classification for Analyzing Differences in Frequent Errors

Robustness Tests for Automatic Machine Translation Metrics with Adversarial Attacks

Comparing Optimization Targets for Contrast-Consistent Search

Style Locality for Controllable Generation with kNN Language Models

Discourse Relations Classification and Cross-Framework Discourse Relation Classification Through the Lens of Cognitive Dimensions: An Empirical Investigation

Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling

Efficient Human-AI Coordination via Preparatory Language-based Convention

AdaSent: Efficient Domain-Adapted Sentence Embeddings for Few-Shot Classification

Enhanced Knowledge Injection for Radiology Report Generation

HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning

Data Augmentation for Code Translation with Comparable Corpora and Multiple References

Probing Explicit and Implicit Gender Bias through LLM Conditional Text Generation

Detecting Syllable-Level Pronunciation Stress with A Self-Attention Model

Entity Alignment Method of Science and Technology Patent based on Graph Convolution Network and Information Fusion

Semantic Representation Learning of Scientific Literature based on Adaptive Feature and Graph Neural Network

IBADR: an Iterative Bias-Aware Dataset Refinement Framework for Debiasing NLU models

SoulChat: Improving LLMs’ Empathy, Listening, and Comfort Abilities through Fine-tuning with Multi-turn Empathy Conversations

Syntactic Inductive Bias in Transformer Language Models: Especially Helpful for Low-Resource Languages?

Noisy Exemplars Make Large Language Models More Robust: A Domain-Agnostic Behavioral Analysis

The Mystery and Fascination of LLMs: A Comprehensive Survey on the Interpretation and Analysis of Emergent Abilities

Distort, Distract, Decode: Instruction-Tuned Model Can Refine its Response from Noisy Instructions

Is GPT Powerful Enough to Analyze the Emotions of Memes?

Transformers as Recognizers of Formal Languages: A Survey on Expressivity