2023-10-17

cs.CL

cs.CL - 2023-10-17

BasahaCorpus: An Expanded Linguistic Resource for Readability Assessment in Central Philippine Languages

paper_url: http://arxiv.org/abs/2310.11584
repo_url: https://github.com/imperialite/basahacorpus-hierarchicalcrosslingualara
paper_authors: Joseph Marvin Imperial, Ekaterina Kochmar
for: This paper is written for the purpose of improving the performance of automatic readability assessment (ARA) models in lower resource languages in the Philippines.
methods: The paper uses a corpus of short fictional narratives written in Hiligaynon, Minasbate, Karay-a, and Rinconada to train ARA models using surface-level, syllable-pattern, and n-gram overlap features. The paper also proposes a new hierarchical cross-lingual modeling approach that takes advantage of a language’s placement in the family tree to increase the amount of available training data.
results: The study yields encouraging results that support previous work showcasing the efficacy of cross-lingual models in low-resource settings, as well as similarities in highly informative linguistic features for mutually intelligible languages.

Abstract
Current research on automatic readability assessment (ARA) has focused on improving the performance of models in high-resource languages such as English. In this work, we introduce and release BasahaCorpus as part of an initiative aimed at expanding available corpora and baseline models for readability assessment in lower resource languages in the Philippines. We compiled a corpus of short fictional narratives written in Hiligaynon, Minasbate, Karay-a, and Rinconada -- languages belonging to the Central Philippine family tree subgroup -- to train ARA models using surface-level, syllable-pattern, and n-gram overlap features. We also propose a new hierarchical cross-lingual modeling approach that takes advantage of a language's placement in the family tree to increase the amount of available training data. Our study yields encouraging results that support previous work showcasing the efficacy of cross-lingual models in low-resource settings, as well as similarities in highly informative linguistic features for mutually intelligible languages.

摘要
当前研究自动可读性评估（ARA）主要集中在高资源语言 such as English 中进行改进模型性能。在这项工作中，我们介绍并发布 BasahaCorpus，是一项旨在扩大可用 corpora 和基线模型 для可读性评估的低资源语言在菲律宾的 iniciativa。我们编译了中央菲律宾语族 subgroup 中的希利加纳、民达、卡拉雅和林康达语言的短篇小说，以用于训练 ARA 模型 surface-level、 syllable-pattern 和 n-gram 重叠特征。我们还提出了一种新的 hierarchical cross-lingual 模型方法，利用语言在语族树中的位置，以增加可用训练数据。我们的研究得到了鼓舞人心的结果，支持先前的研究表明在低资源设置中， crossed-lingual 模型具有可读性评估的效果，以及同属语言之间的高度相似性特征。

What is a good question? Task-oriented asking with fact-level masking

paper_url: http://arxiv.org/abs/2310.11571
repo_url: None
paper_authors: Matthew Toles, Yukun Huang, Zhou Yu, Luis Gravano
for: 这篇论文主要是关于问答推理任务中的协作问题，即如何让机器人能够咨询用户并采集有用的信息。
methods: 该论文提出了一种自然语言任务协作问答（TOA）定义和框架，以及一种基于实际事实遮盖（FLM）的自动生成问答数据集的方法。
results: 实际试验表明，当前的零shot语言模型在完成TOA任务时表现不佳，与人工标注者相比。这些结果表明可以使用FLM数据集和TOA框架来训练和评估更好的TOA模型。

Abstract
Asking questions is an important element of real-life collaboration on reasoning tasks like question answering. For example, a legal assistant chatbot may be unable to make accurate recommendations without specific information on the user's circumstances. However, large language models are usually deployed to solve reasoning tasks directly without asking follow-up questions to the user or third parties. We term this problem task-oriented asking (TOA). Zero-shot chat models can perform TOA, but their training is primarily based on next-token prediction rather than whether questions contribute to successful collaboration. To enable the training and evaluation of TOA models, we present a definition and framework for natural language task-oriented asking, the problem of generating questions that result in answers useful for a reasoning task. We also present fact-level masking (FLM), a procedure for converting natural language datasets into self-supervised TOA datasets by omitting particular critical facts. Finally, we generate a TOA dataset from the HotpotQA dataset using FLM and evaluate several zero-shot language models on it. Our experiments show that current zero-shot models struggle to ask questions that retrieve useful information, as compared to human annotators. These results demonstrate an opportunity to use FLM datasets and the TOA framework to train and evaluate better TOA models.

摘要
实际协作中的问题询问是解决理性任务的重要元素，例如法律助手聊天机器人可能无法提供精确的建议 без specific 用户情况信息。然而，大型语言模型通常会直接解决理性任务，而不是询问使用者或第三方的询问。我们称这问题为任务 Orientated Asking（TOA）。零开始聊天模型可以进行 TOA，但它们的训练主要基于下一个字符预测，而不是 Whether 询问对于成功协作有用。为了实现和评估 TOA 模型的训练，我们提出了自然语言任务 Orientated Asking 的定义和框架，以及 факт阶掩蔽（FLM），将自然语言数据集转换为自主监督的 TOA 数据集，并在这些数据集上评估了多个零开始语言模型。我们的实验结果显示，现有的零开始模型在对于有用信息的询问方面表现不佳，与人工标注师相比。这些结果显示了使用 FLM 数据集和 TOA 框架可以训练和评估更好的 TOA 模型。

Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

paper_url: http://arxiv.org/abs/2310.11564
repo_url: https://github.com/joeljang/rlphf
paper_authors: Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, Prithviraj Ammanabrolu
for: 本研究旨在使用人工反馈来适应大语言模型（LLM）与多个个人偏好的多目标强化学习（MORL）问题。
methods: 本研究使用了分解个人偏好为多个维度的方法，并在分布式环境中独立进行了这些维度的快速训练。在训练后，parameters可以通过合并 Parameter Merging 技术来有效地组合。
results: 相比强大的单个目标基eline，我们的方法可以实现个性化的偏好对齐。我们的实验结果表明，RLPHF可以有效地适应多个个人偏好，并且可以在不同的应用场景中提供个性化的结果。

Abstract
While Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with general, aggregate human preferences, it is suboptimal for learning diverse, individual perspectives. In this work, we study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are aligned to multiple (sometimes conflicting) preferences by modeling alignment as a Multi-Objective Reinforcement Learning (MORL) problem. Compared to strong single-objective baselines, we show that we can achieve personalized alignment by decomposing preferences into multiple dimensions. These dimensions are defined based on personalizations that are declared as desirable by the user. In this work, we show that they can be efficiently trained independently in a distributed manner and combined effectively post-hoc through parameter merging. The code is available at https://github.com/joeljang/RLPHF.

摘要
While Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with general, aggregate human preferences, it is suboptimal for learning diverse, individual perspectives. In this work, we study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are aligned to multiple (sometimes conflicting) preferences by modeling alignment as a Multi-Objective Reinforcement Learning (MORL) problem. Compared to strong single-objective baselines, we show that we can achieve personalized alignment by decomposing preferences into multiple dimensions. These dimensions are defined based on personalizations that are declared as desirable by the user. In this work, we show that they can be efficiently trained independently in a distributed manner and combined effectively post-hoc through parameter merging. The code is available at https://github.com/joeljang/RLPHF.Here's the translation in Simplified Chinese:While Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with general, aggregate human preferences, it is suboptimal for learning diverse, individual perspectives. In this work, we study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are aligned to multiple (sometimes conflicting) preferences by modeling alignment as a Multi-Objective Reinforcement Learning (MORL) problem. Compared to strong single-objective baselines, we show that we can achieve personalized alignment by decomposing preferences into multiple dimensions. These dimensions are defined based on personalizations that are declared as desirable by the user. In this work, we show that they can be efficiently trained independently in a distributed manner and combined effectively post-hoc through parameter merging. The code is available at https://github.com/joeljang/RLPHF.

Multi-stage Large Language Model Correction for Speech Recognition

paper_url: http://arxiv.org/abs/2310.11532
repo_url: None
paper_authors: Jie Pu, Thai-Son Nguyen, Sebastian Stüker
for: 提高竞争性语音识别系统的性能
methods: 使用大语言模型（LLM）进行语音识别系统的改进
results: 实验结果显示，提议的方法可以在多个测试频谱上实现10%~20%的相对改进，与一个竞争性ASR系统相比

Abstract
In this paper, we investigate the usage of large language models (LLMs) to improve the performance of competitive speech recognition systems. Different from traditional language models that focus on one single data domain, the rise of LLMs brings us the opportunity to push the limit of state-of-the-art ASR performance, and at the same time to achieve higher robustness and generalize effectively across multiple domains. Motivated by this, we propose a novel multi-stage approach to combine traditional language model re-scoring and LLM prompting. Specifically, the proposed method has two stages: the first stage uses a language model to re-score an N-best list of ASR hypotheses and run a confidence check; The second stage uses prompts to a LLM to perform ASR error correction on less confident results from the first stage. Our experimental results demonstrate the effectiveness of the proposed method by showing a 10% ~ 20% relative improvement in WER over a competitive ASR system -- across multiple test domains.

摘要
在这篇论文中，我们研究了使用大语言模型（LLM）提高竞争性语音识别系统的性能。与传统语言模型不同，LLMs允许我们在多个数据域之间进行跨领域的学习和泛化，从而提高ASR性能和Robustness。我们提出了一种新的多阶段方法， combining traditional language model re-scoring和LLM prompting。这种方法包括两个阶段：第一阶段使用语言模型对N-best列表的ASR假设进行重新分数和信任检查；第二阶段使用提示来让LLM进行错误纠正。我们的实验结果表明，提案的方法可以提高竞争性ASR系统的WER表现，在多个测试领域中显示10%~20%的相对改善。

Automatic News Summerization

paper_url: http://arxiv.org/abs/2310.11520
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Kavach Dheer, Arpit Dhankhar
for: 这个研究论文主要是为了比较EXTRACTIVE和ABSTRACTIVE方法在新闻文本摘要方面的表现。
methods: 这个研究使用了CNN-Daily Mail dataset，这个 dataset包含了新闻文章和人工生成的参考摘要。研究使用了ROUGE分数来评估生成的摘要质量。
results: 经过评估后，研究人员选择了最佳性能的模型，并将其集成到了一个web应用程序中，以评估它们在实际应用中的表现和用户体验。

Abstract
Natural Language Processing is booming with its applications in the real world, one of which is Text Summarization for large texts including news articles. This research paper provides an extensive comparative evaluation of extractive and abstractive approaches for news text summarization, with an emphasis on the ROUGE score analysis. The study employs the CNN-Daily Mail dataset, which consists of news articles and human-generated reference summaries. The evaluation employs ROUGE scores to assess the efficacy and quality of generated summaries. After Evaluation, we integrate the best-performing models on a web application to assess their real-world capabilities and user experience.

摘要
自然语言处理技术在现实世界中得到了广泛应用，其中之一是文本概要化，特别是对新闻文章进行概要。本研究论文进行了对抽取和抽象方法的比较评估，强调ROUGE分数分析。研究使用了CNN-Daily Mail dataset，该 dataset包括新闻文章和人工生成的参考概要。评估使用ROUGE分数评估生成的概要质量。经评估后，我们将最佳表现的模型集成到了网站应用程序中，以评估它们在实际应用中的能力和用户体验。

VeRA: Vector-based Random Matrix Adaptation

paper_url: http://arxiv.org/abs/2310.11454
repo_url: None
paper_authors: Dawid Jan Kopiczko, Tijmen Blankevoort, Yuki Markus Asano
For: 降低大型语言模型训练参数数量，并在多个用户或任务适配模型中进行多个适配。* Methods: 使用单个低级别矩阵和学习小扩展矩阵来减少训练参数数量，同时保持与LoRA相同的性能。* Results: 在GLUE和E2E测试集上实现同LoRA相同的性能，并在 instruciton-following 任务中使用Llama2 7B模型，只需1.4M参数。

Abstract
Low-rank adapation (LoRA) is a popular method that reduces the number of trainable parameters when finetuning large language models, but still faces acute storage challenges when scaling to even larger models or deploying numerous per-user or per-task adapted models. In this work, we present Vector-based Random Matrix Adaptation (VeRA), which reduces the number of trainable parameters by 10x compared to LoRA, yet maintains the same performance. It achieves this by using a single pair of low-rank matrices shared across all layers and learning small scaling vectors instead. We demonstrate its effectiveness on the GLUE and E2E benchmarks, and show its application in instruction-following with just 1.4M parameters using the Llama2 7B model.

摘要
低阶 adaptive（LoRA）是一种受欢迎的方法，可以降低训练可变参数数量，但仍然面临着扩大模型或部署多个用户或任务特定 adapted 模型时的严重存储挑战。在这种工作中，我们提出了 вектор基的随机矩阵适应（VeRA），它可以在与 LoRA 相比下减少训练可变参数数量达到 10 倍，同时保持性能不变。它实现这一点通过使用所有层共享的低阶矩阵对和学习小扩张 вектор而做。我们在 GLUE 和 E2E 测试上证明了它的有效性，并在使用 Llama2 7B 模型进行 instrucion-following tasks 中只需要 1.4M 参数。

BitNet: Scaling 1-bit Transformers for Large Language Models

paper_url: http://arxiv.org/abs/2310.11453
repo_url: https://github.com/kyegomez/BitNet
paper_authors: Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei
for: 这个研究是为了开发一个可扩展且稳定的1比特transformer架构，以便在大型语言模型中实现高效性和可持续性。
methods: 这个研究使用了BitLinear层来将1比特量化的 weights 训练出来，并且提出了一个可替换的drop-in替代方案。
results: 实验结果显示，BitNet可以与现有的8比特量化方法和FP16 transformer基准相比，在语言模型化上达到竞争性的表现，同时具有较小的内存库存和能源消耗。此外，BitNet显示了与全精度transformer相似的扩展法则， suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.

Abstract
The increasing size of large language models has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.

摘要
大型语言模型的增加会带来部署的挑战和环境影响的关注，由于高能consumption。在这个工作中，我们介绍BitNet，一个可扩展和稳定的1比特Transformer架构，设计用于大型语言模型。具体来说，我们介绍BitLinear，用于从零开始训练1比特的 weights的替换层。实验结果显示，BitNet在语言模型化方面实现了竞争性的性能，并substantially reducingmemory尺度和能源consumption，相比于现有的8比特量化方法和FP16 Transformer基准。此外，BitNet展示了与全精度Transformer一样的扩展律，表明它的潜在可以实现有效的扩展到更大的语言模型，保持效率和性能优势。

An Empirical Study of Translation Hypothesis Ensembling with Large Language Models

paper_url: http://arxiv.org/abs/2310.11430
repo_url: https://github.com/deep-spin/translation-hypothesis-ensembling
paper_authors: António Farinhas, José G. C. de Souza, André F. T. Martins
for: 这 paper investigate LLM-based machine translation 的质量可以通过集成假设来提高。
methods: 这 paper 使用多种ensemble技术，包括多个提示、温度-based sampling 和 beam search，来生成假设。
results: 这 paper 的结果表明，MBR decoding 是一个非常有效的方法，可以使用少量的样本提高翻译质量，而且制定提示的调整对假设的多样性和样本温度具有强烈的影响。

Abstract
Large language models (LLMs) are becoming a one-fits-many solution, but they sometimes hallucinate or produce unreliable output. In this paper, we investigate how hypothesis ensembling can improve the quality of the generated text for the specific problem of LLM-based machine translation. We experiment with several techniques for ensembling hypotheses produced by LLMs such as ChatGPT, LLaMA, and Alpaca. We provide a comprehensive study along multiple dimensions, including the method to generate hypotheses (multiple prompts, temperature-based sampling, and beam search) and the strategy to produce the final translation (instruction-based, quality-based reranking, and minimum Bayes risk (MBR) decoding). Our results show that MBR decoding is a very effective method, that translation quality can be improved using a small number of samples, and that instruction tuning has a strong impact on the relation between the diversity of the hypotheses and the sampling temperature.

摘要
Translation into Simplified Chinese:大型语言模型（LLM）正在成为一个一size-fits-all解决方案，但它们有时会幻想或生成不可靠的输出。在这篇论文中，我们调查了如何使用假设集成以提高由 LLM 生成的文本质量，特别是在机器翻译问题上。我们对各种生成假设的技术进行了实验，包括 ChatGPT、LLaMA 和 Alpaca。我们提供了多维度的研究，包括生成假设的方法（多个提示、温度基本抽样和搜索杆）以及生成翻译的策略（指令基本、质量基本重新排序和最小极值风险（MBR）解oding）。我们的结果显示了 MBR 解oding 是一个非常有效的方法，可以通过一小数量的样本提高翻译质量，并且指令调整对假设多样性和抽样温度之间的关系具有很强的影响。

Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles

paper_url: http://arxiv.org/abs/2310.11379
repo_url: https://github.com/ferugit/iterative-pseudo-forced-alignment-ctc
paper_authors: Fernando López, Jordi Luque, Carlos Segura, Pablo Gómez
for: 本研究旨在提高语音界面上的唤醒词检测精度、能效性和速度。
methods: 该研究使用了两个阶段的检测方法，包括多分辨率的数据增强和服务器端的模型集成。它还使用了一个轻量级的设备上模型和一个云端的验证模型，以优化两个运行点。
results: 研究发现，使用不同的参数配置和多种语音分类器可以提高检测精度和减少干扰。特别是，提出的集成模型在所有噪声条件下都表现出优于 stronger 分类器。

Abstract
Voice-based interfaces rely on a wake-up word mechanism to initiate communication with devices. However, achieving a robust, energy-efficient, and fast detection remains a challenge. This paper addresses these real production needs by enhancing data with temporal alignments and using detection based on two phases with multi-resolution. It employs two models: a lightweight on-device model for real-time processing of the audio stream and a verification model on the server-side, which is an ensemble of heterogeneous architectures that refine detection. This scheme allows the optimization of two operating points. To protect privacy, audio features are sent to the cloud instead of raw audio. The study investigated different parametric configurations for feature extraction to select one for on-device detection and another for the verification model. Furthermore, thirteen different audio classifiers were compared in terms of performance and inference time. The proposed ensemble outperforms our stronger classifier in every noise condition.

摘要
声音基于界面依赖于唤醒词机制来与设备进行通信。然而，实现高效、能效、快速的检测仍然是一大挑战。本文通过增强数据的时间对齐和使用两个阶段多分辨率检测来解决这些生产环境需求。它使用了两个模型：一个轻量级在设备上进行实时处理的音频流模型，以及服务器端的验证模型，这是一个多种不同架构的 ensemble 模型，用于精度检测。这种方案允许优化两个运行点。为了保护隐私，音频特征被发送到云端而不是原始音频。研究中试用了不同的参数配置来进行特征提取，以便在设备上进行检测和在服务器端进行验证。此外，本文比较了13种不同的声音分类器，并评估了它们的性能和推理时间。提议的ensemble在每种噪音条件下都超过了我们更强的分类器。

DialogueLLM: Context and Emotion Knowledge-Tuned LLaMA Models for Emotion Recognition in Conversations

paper_url: http://arxiv.org/abs/2310.11374
repo_url: None
paper_authors: Yazhou Zhang, Mengyao Wang, Prayag Tiwari, Qiuchi Li, Benyou Wang, Jing Qin
for: 提高对话情感认知的模型性能，特别是在自然语言生成中。
methods: fine-tuning LLaMA模型，使用多modal信息作为补充知识。
results: 在三个对话情感认知 benchmark 数据集上提供了比基线和其他 SOTA LLM 更高的性能。

Abstract
Large language models (LLMs) and their variants have shown extraordinary efficacy across numerous downstream natural language processing (NLP) tasks, which has presented a new vision for the development of NLP. Despite their remarkable performance in natural language generating (NLG), LLMs lack a distinct focus on the emotion understanding domain. As a result, using LLMs for emotion recognition may lead to suboptimal and inadequate precision. Another limitation of LLMs is that they are typical trained without leveraging multi-modal information. To overcome these limitations, we propose DialogueLLM, a context and emotion knowledge tuned LLM that is obtained by fine-tuning LLaMA models with 13,638 multi-modal (i.e., texts and videos) emotional dialogues. The visual information is considered as the supplementary knowledge to construct high-quality instructions. We offer a comprehensive evaluation of our proposed model on three benchmarking emotion recognition in conversations (ERC) datasets and compare the results against the SOTA baselines and other SOTA LLMs. Additionally, DialogueLLM-7B can be easily trained using LoRA on a 40GB A100 GPU in 5 hours, facilitating reproducibility for other researchers.

摘要
大型语言模型（LLM）和其变体在多种自然语言处理（NLP）任务中表现出色，带来了一个新的视野 для NLP的发展。尽管它们在自然语言生成（NLG）方面表现出色，但LLM对感情理解领域没有明确的注意力，使用LLM进行感情识别可能会导致不足和不充分的精度。另外，LLM通常不会利用多modal信息进行训练。为了解决这些限制，我们提出了对话LLM，一个基于LLaMA模型的内容和感情知识调整的LLM，通过调整13,638种多modal（文本和影片）情感对话。影像信息被视为补充知识，用于建立高品质的指令。我们提供了三个benchmarking感情识别在对话（ERC）数据集的完整评估，并与基于SOTA和其他SOTA LLM的结果进行比较。此外，DialogueLLM-7B可以在5小时内使用LoRA在40GB A100 GPU上进行训练，便于其他研究人员的重现。

VECHR: A Dataset for Explainable and Robust Classification of Vulnerability Type in the European Court of Human Rights

paper_url: http://arxiv.org/abs/2310.11368
repo_url: None
paper_authors: Shanshan Xu, Leon Staufer, T. Y. S. S Santosh, Oana Ichim, Corina Heri, Matthias Grabmair
For: This paper is written to address the elusive concept of vulnerability at the European Court of Human Rights (ECtHR) and to provide a novel dataset (VECHR) for future research in this area.* Methods: The paper uses expert-annotated multi-label data to benchmark the performance of state-of-the-art models for vulnerability type classification and explanation rationale.* Results: The results show that the task of vulnerability classification is challenging, with lower prediction performance and limited agreement between models and experts. Additionally, the models have limited performance when dealing with out-of-domain (OOD) data.Here’s the information in Simplified Chinese text:* For: 这篇论文是为了解决欧洲人权法庭（ECtHR）中的抑降性概念，并提供一个新的数据集（VECHR）以便未来在这个领域进行研究。* Methods: 这篇论文使用专家标注的多标签数据来评估现有模型的表现，以便对抑降性类型分类和解释理由进行 benchlearning。* Results: 结果显示，抑降性分类任务具有较低的预测性能和专家和模型之间的有限的一致性。此外，模型对于非预期数据（OOD）的性能也有限。

Abstract
Recognizing vulnerability is crucial for understanding and implementing targeted support to empower individuals in need. This is especially important at the European Court of Human Rights (ECtHR), where the court adapts Convention standards to meet actual individual needs and thus ensures effective human rights protection. However, the concept of vulnerability remains elusive at the ECtHR and no prior NLP research has dealt with it. To enable future research in this area, we present VECHR, a novel expert-annotated multi-label dataset comprising of vulnerability type classification and explanation rationale. We benchmark the performance of state-of-the-art models on VECHR from both prediction and explainability perspectives. Our results demonstrate the challenging nature of the task with lower prediction performance and limited agreement between models and experts. Further, we analyze the robustness of these models in dealing with out-of-domain (OOD) data and observe overall limited performance. Our dataset poses unique challenges offering significant room for improvement regarding performance, explainability, and robustness.

摘要
认识投降性是关键 для理解并实施targeted支持，以帮助个人需要。这对欧洲人权法庭（ECtHR）来说特别重要，因为法庭将会适应实际个人需求，从而确保人权保护的有效性。然而，投降性这个概念在ECtHR中仍然毫不准确，而且没有任何NLP研究过去关注过它。为了启动未来的研究，我们提供了VECHR，一个新的专家标注的多标签数据集，包括投降性类型分类和解释理由。我们对现有模型在VECHR上进行了测试和解释两个方面的性能评估。我们的结果表明这是一个复杂的任务，模型的预测性能较低，并且模型和专家之间的一致性很有限。此外，我们发现这些模型在非标准数据（OOD）上的性能有限。our dataset提供了一些独特的挑战，它们的性能、解释和Robustness在需要进一步改进。

Disentangling the Linguistic Competence of Privacy-Preserving BERT

paper_url: http://arxiv.org/abs/2310.11363
repo_url: None
paper_authors: Stefan Arnold, Nils Kemmerzell, Annika Schreiner
for: 本研究旨在透过文本层级的解释技术来探索对于文本隐私保护而导致的语言模型表现下降的原因。
methods: 本研究使用了一系列的解释技术来分析BERT模型在受到干扰前文本训练后内部表现的改变。
results: 实验结果显示，对于受到干扰前文本训练的BERT模型，内部表现之间的相似性减少了许多。通过询问任务来探索这种不相似性，发现文本层级的隐私保护对于词汇的地方性特征有影响，但是对于词汇之间的关系性却有所下降。

Abstract
Differential Privacy (DP) has been tailored to address the unique challenges of text-to-text privatization. However, text-to-text privatization is known for degrading the performance of language models when trained on perturbed text. Employing a series of interpretation techniques on the internal representations extracted from BERT trained on perturbed pre-text, we intend to disentangle at the linguistic level the distortion induced by differential privacy. Experimental results from a representational similarity analysis indicate that the overall similarity of internal representations is substantially reduced. Using probing tasks to unpack this dissimilarity, we find evidence that text-to-text privatization affects the linguistic competence across several formalisms, encoding localized properties of words while falling short at encoding the contextual relationships between spans of words.

摘要
Diffusion Privacy (DP) 已经适应文本到文本隐私化的特殊挑战。然而，文本到文本隐私化知道会降低基于扰动文本训练的语言模型性能。通过对 BERT 在扰动预文本上提取的内部表示进行解释技术，我们意图在语言层次分离扰动所引起的损害。实验结果表明，总体内表示相似性substantially 降低。通过探索任务来抽取这种不同，我们发现了文本到文本隐私化对语言能力的影响，包括 encoding lokalisierte 词性特征，但是缺乏 encoding 词语间关系的上下文。

Enhancing Neural Machine Translation with Semantic Units

paper_url: http://arxiv.org/abs/2310.11360
repo_url: https://github.com/ictnlp/su4mt
paper_authors: Langlin Huang, Shuhao Gu, Zhuocheng Zhang, Yang Feng
for: 本研究旨在提高机器翻译模型的语义理解能力，通过模型语义单位内部的意义 integrate 多个 tokens 的 semantics。
methods: 本方法包括 Word Pair Encoding (WPE) 和 Attentive Semantic Fusion (ASF) 两部分。WPE 用于提取句子中的 semantic unit 边界，而 ASF 则用于将多个 subword 的 semantics 融合为单一 vector。
results: 实验结果表明，本方法可以有效地模型和利用句子中的语义单位信息，并与强基eline 比较。code 可以在 https://github.com/ictnlp/SU4MT 中找到。

Abstract
Conventional neural machine translation (NMT) models typically use subwords and words as the basic units for model input and comprehension. However, complete words and phrases composed of several tokens are often the fundamental units for expressing semantics, referred to as semantic units. To address this issue, we propose a method Semantic Units for Machine Translation (SU4MT) which models the integral meanings of semantic units within a sentence, and then leverages them to provide a new perspective for understanding the sentence. Specifically, we first propose Word Pair Encoding (WPE), a phrase extraction method to help identify the boundaries of semantic units. Next, we design an Attentive Semantic Fusion (ASF) layer to integrate the semantics of multiple subwords into a single vector: the semantic unit representation. Lastly, the semantic-unit-level sentence representation is concatenated to the token-level one, and they are combined as the input of encoder. Experimental results demonstrate that our method effectively models and leverages semantic-unit-level information and outperforms the strong baselines. The code is available at https://github.com/ictnlp/SU4MT.

摘要

QADYNAMICS: Training Dynamics-Driven Synthetic QA Diagnostic for Zero-Shot Commonsense Question Answering

paper_url: http://arxiv.org/abs/2310.11303
repo_url: https://github.com/hkust-knowcomp/qadynamics
paper_authors: Haochen Shi, Weiqi Wang, Tianqing Fang, Baixuan Xu, Wenxuan Ding, Xin Liu, Yangqiu Song
for: 本研究的目的是提高Zero-shot Commonsense Question-Answering（QA）模型的普遍能力，使其能够理解更多的常识知识。
methods: 本研究使用了语言模型的训练动态学习方法，对每个QA对的训练动态进行分析，并从CSKBs中提取不含噪声的问答对。
results: 对比基eline，本研究的方法可以更好地提高QA模型的普遍能力，并且只需使用33%的Synthetic数据。专业评估也证明了我们的方法可以提高问答生成的质量。

Abstract
Zero-shot commonsense Question-Answering (QA) requires models to reason about general situations beyond specific benchmarks. State-of-the-art approaches fine-tune language models on QA pairs constructed from CommonSense Knowledge Bases (CSKBs) to equip the models with more commonsense knowledge in a QA context. However, current QA synthesis protocols may introduce noise from the CSKBs and generate ungrammatical questions and false negative options, which impede the model's ability to generalize. To address these issues, we propose QADYNAMICS, a training dynamics-driven framework for QA diagnostics and refinement. Our approach analyzes the training dynamics of each QA pair at both the question level and option level, discarding machine-detectable artifacts by removing uninformative QA pairs and mislabeled or false-negative options. Extensive experiments demonstrate the effectiveness of our approach, which outperforms all baselines while using only 33% of the synthetic data, even including LLMs such as ChatGPT. Moreover, expert evaluations confirm that our framework significantly improves the quality of QA synthesis. Our codes and model checkpoints are available at https://github.com/HKUST-KnowComp/QaDynamics.

摘要
zero-shot常识问答（QA）需要模型可以理解通用的情况，而不仅仅是特定的benchmark。现状的方法是在QA对中 fine-tune语言模型，以便将更多的常识知识带入QA上下文中。然而，当使用现有的QA生成协议时，可能会出现CSKB中的噪音和生成不正确的问题和false negative选项，这会阻碍模型的泛化能力。为解决这些问题，我们提出了QADYNAMICS，一个基于训练动态的框架 дляQA诊断和改进。我们的方法在每个QA对的训练动态中分析问题和选项的训练动态，并将不可识别的QA对和false negative选项移除。我们的实验证明了我们的方法的有效性，能够在使用33%的合成数据的情况下，以及包括LLMs like ChatGPT的情况下，与所有基线都比较。此外，专家评估也证明了我们的框架可以显著提高问答生成质量。我们的代码和模型检查点可以在https://github.com/HKUST-KnowComp/QaDynamics中找到。

ChapGTP, ILLC’s Attempt at Raising a BabyLM: Improving Data Efficiency by Automatic Task Formation

paper_url: http://arxiv.org/abs/2310.11282
repo_url: None
paper_authors: Jaap Jumelet, Michael Hanna, Marianne de Heer Kloots, Anna Langedijk, Charlotte Pouw, Oskar van der Wal
for: 这个论文是为了参加BabyLM挑战（Warstadt等., 2023）的strict-small track而写的。
methods: 这个模型使用了一种新的数据增强技术called Automatic Task Formation，并在200个epoch中训练。
results: 这个模型在BLiMP、(Super)GLUE和MSGS三个评估集中表现出色，并且提供了一些不包括在模型中的方法，可能用于训练low-resource的语言模型。

Abstract
We present the submission of the ILLC at the University of Amsterdam to the BabyLM challenge (Warstadt et al., 2023), in the strict-small track. Our final model, ChapGTP, is a masked language model that was trained for 200 epochs, aided by a novel data augmentation technique called Automatic Task Formation. We discuss in detail the performance of this model on the three evaluation suites: BLiMP, (Super)GLUE, and MSGS. Furthermore, we present a wide range of methods that were ultimately not included in the model, but may serve as inspiration for training LMs in low-resource settings.

摘要
我们提交了阿姆斯特丹大学ILLC的订阅（Warstadt等，2023），在严格小轨道上进行了参与。我们的最终模型“ChapGTP”是一个做了200个epoch的面孔语言模型，得益于一种新的数据增强技术called自动任务形成。我们在BLiMP、（Super）GLUE和MSGS三个评估集上详细讲解了这个模型的性能。此外，我们还提供了一些不被包括在模型中的方法，可能可以用于训练LMs在低资源环境下。

xMEN: A Modular Toolkit for Cross-Lingual Medical Entity Normalization

paper_url: http://arxiv.org/abs/2310.11275
repo_url: https://github.com/hpi-dhc/xmen
paper_authors: Florian Borchert, Ignacio Llorca, Roland Roller, Bert Arnrich, Matthieu-P. Schapranow
for: 提高医疗实体 нормализации的性能 across 多种语言, 特别是当语言资源更少时。
methods: 我们介绍了 xMEN，一个模块化的跨语言医疗实体 нормализаation系统，可以在低资源和高资源enario下表现出色。当目标语言中缺乏同义词时，我们利用英语别名进行跨语言候选生成。对候选列表排名，我们使用可训练的跨Encoder模型，并评估了基于机器翻译dataset的弱监督学习模型。
results: xMEN在多种多样的多语言benchmark数据集上提高了状态的艺术性表现。弱监督的跨Encoder模型在没有目标任务的注释数据时也是有效的。通过xMEN与BigBIO框架的兼容性，它可以轻松地与现有和未来的数据集结合使用。

Abstract
Objective: To improve performance of medical entity normalization across many languages, especially when fewer language resources are available compared to English. Materials and Methods: We introduce xMEN, a modular system for cross-lingual medical entity normalization, which performs well in both low- and high-resource scenarios. When synonyms in the target language are scarce for a given terminology, we leverage English aliases via cross-lingual candidate generation. For candidate ranking, we incorporate a trainable cross-encoder model if annotations for the target task are available. We also evaluate cross-encoders trained in a weakly supervised manner based on machine-translated datasets from a high resource domain. Our system is publicly available as an extensible Python toolkit. Results: xMEN improves the state-of-the-art performance across a wide range of multilingual benchmark datasets. Weakly supervised cross-encoders are effective when no training data is available for the target task. Through the compatibility of xMEN with the BigBIO framework, it can be easily used with existing and prospective datasets. Discussion: Our experiments show the importance of balancing the output of general-purpose candidate generators with subsequent trainable re-rankers, which we achieve through a rank regularization term in the loss function of the cross-encoder. However, error analysis reveals that multi-word expressions and other complex entities are still challenging. Conclusion: xMEN exhibits strong performance for medical entity normalization in multiple languages, even when no labeled data and few terminology aliases for the target language are available. Its configuration system and evaluation modules enable reproducible benchmarks. Models and code are available online at the following URL: https://github.com/hpi-dhc/xmen

摘要
目的：提高医疗实体Normalization的表现 across多种语言，特别是当target语言的资源更少于英语时。材料和方法：我们介绍xMEN，一个模块化的跨语言医疗实体Normalization系统，能够在低资源和高资源enario中表现出色。当target语言中某些同义词scarce时，我们利用英语别名via cross-lingual candidate生成。对候选列表排名，我们采用可训练的跨编码模型，如果目标任务的注释存在。我们还评估了基于弱监督的机器翻译数据集来训练cross-编码器。我们的系统公开提供了可扩展的Python工具包。结果：xMEN在多种多语言的benchmark数据集上提高了状态的艺术表现。弱监督的cross-编码器在没有目标任务的注释时效果很好。通过xMEN与BigBIO框架的兼容性，它可以轻松地与现有和前景数据集结合使用。讨论：我们的实验表明，需要平衡通用候选生成器的输出和后续可训练的再排名器，我们通过跨编码器损失函数中的排名常量来实现。然而，错误分析表明，复杂实体，如多单词表达和其他复杂实体，仍然是挑战。结论：xMEN在多种语言的医疗实体Normalization中表现出色，即使target语言的资源很少，并且可以轻松地与现有和前景数据集结合使用。它的配置系统和评估模块使得可重现性很好。模型和代码在以下URL上可以下载：https://github.com/hpi-dhc/xmen

Utilizing Weak Supervision To Generate Indonesian Conservation Dataset

paper_url: http://arxiv.org/abs/2310.11258
repo_url: None
paper_authors: Mega Fransiska, Diah Pitaloka, Saripudin, Satrio Putra, Lintang Sutawika
for: 这篇论文目的是构建一个印度尼西亚语言处理 dataset，使用弱监督学习方法生成软标注数据。
methods: 该论文使用了labeling函数，创建了多类分类和情感分类的两种类型数据集。
results: 基线实验表明，使用不同预训练语言模型可以达到59.79%的准确率和55.72%的F1分数 для情感分类，66.87%的F1分数-macro、71.5%的F1分数-micro、83.67%的ROC-AUC для多类分类。

Abstract
Weak supervision has emerged as a promising approach for rapid and large-scale dataset creation in response to the increasing demand for accelerated NLP development. By leveraging labeling functions, weak supervision allows practitioners to generate datasets quickly by creating learned label models that produce soft-labeled datasets. This paper aims to show how such an approach can be utilized to build an Indonesian NLP dataset from conservation news text. We construct two types of datasets: multi-class classification and sentiment classification. We then provide baseline experiments using various pretrained language models. These baseline results demonstrate test performances of 59.79% accuracy and 55.72% F1-score for sentiment classification, 66.87% F1-score-macro, 71.5% F1-score-micro, and 83.67% ROC-AUC for multi-class classification. Additionally, we release the datasets and labeling functions used in this work for further research and exploration.

摘要
弱监督学习已经成为快速和大规模数据创建的有力的方法，以满足人工智能发展的增加需求。通过利用标签函数，弱监督允许实践者快速生成数据集，创建学习的标签模型，生成软标注数据集。本文想要表明如何使用这种方法来建立印尼语言处理数据集。我们构建了两种类型的数据集：多类分类和情感分类。然后，我们提供了基线实验，使用不同的预训练语言模型。这些基线结果表明了情感分类的测试准确率为59.79%，情感分类的F1分数为55.72%，多类分类的F1分数为66.87%，多类分类的Macro F1分数为71.5%，多类分类的微 F1分数为83.67%。此外，我们发布了在这项工作中使用的数据集和标签函数，以便进一步的研究和探索。

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

paper_url: http://arxiv.org/abs/2310.11248
repo_url: https://github.com/amazon-science/cceval
paper_authors: Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, Bing Xiang
for: 这篇论文的目的是为了评估代码完成器的能力，并提供一个多档案、多语言的代码完成实验室（CrossCodeEval），以测试代码完成器在实际软件开发中的表现。
methods: 这篇论文使用了一个简单 yet efficient的静态分析方法，将使用cross-file context的例子组建在四种流行程式语言（Python、Java、TypeScript、C#）中，以模拟实际软件开发中的档案间依赖。
results: 实验结果显示，CrossCodeEval 是一个非常具有挑战性的测试，当cross-file context absent时，代码完成器的性能很差，但是通过添加cross-file context可以大幅提高性能。另外，这篇论文还评估了不同的方法来获取cross-file context，并显示CrossCodeEval可以用来评估代码检索器的能力。

Abstract
Code completion models have made significant progress in recent years, yet current popular evaluation datasets, such as HumanEval and MBPP, predominantly focus on code completion tasks within a single file. This over-simplified setting falls short of representing the real-world software development scenario where repositories span multiple files with numerous cross-file dependencies, and accessing and understanding cross-file context is often required to complete the code correctly. To fill in this gap, we propose CrossCodeEval, a diverse and multilingual code completion benchmark that necessitates an in-depth cross-file contextual understanding to complete the code accurately. CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages: Python, Java, TypeScript, and C#. To create examples that strictly require cross-file context for accurate completion, we propose a straightforward yet efficient static-analysis-based approach to pinpoint the use of cross-file context within the current file. Extensive experiments on state-of-the-art code language models like CodeGen and StarCoder demonstrate that CrossCodeEval is extremely challenging when the relevant cross-file context is absent, and we see clear improvements when adding these context into the prompt. However, despite such improvements, the pinnacle of performance remains notably unattained even with the highest-performing model, indicating that CrossCodeEval is also capable of assessing model's capability in leveraging extensive context to make better code completion. Finally, we benchmarked various methods in retrieving cross-file context, and show that CrossCodeEval can also be used to measure the capability of code retrievers.

摘要
现代代码完成模型在过去几年内已经做出了 significiant 进步，但是目前流行的评估数据集，如 HumanEval 和 MBPP，主要集中在单个文件中的代码完成任务上。这种过分简化的设定不能够反映现实世界软件开发场景，其中文件数量很多，文件之间存在丰富的相互依赖关系，以至于完成代码时需要跨文件上下文的深入理解。为了填补这个空白，我们提出了 CrossCodeEval，一个多样化和多语言的代码完成评估标准。CrossCodeEval 基于四种流行编程语言：Python、Java、TypeScript 和 C# 的真实世界开源项目，并且通过一种简单 yet efficient 的静态分析方法来寻找文件之间的相互依赖关系。我们通过对 CodeGen 和 StarCoder 等现状代码生成器进行广泛的实验发现，当缺乏相关的跨文件上下文时，CrossCodeEval 非常具有挑战性，而在添加上下文时，模型的表现有显著改善。尽管如此，绝对高性能的水平仍然未能得到满分，表明 CrossCodeEval 还可以评估模型是否能够充分利用广泛的上下文来提高代码生成质量。最后，我们还研究了不同的跨文件上下文检索方法，并证明 CrossCodeEval 可以用于评估代码检索器的能力。

Entity Matching using Large Language Models

paper_url: http://arxiv.org/abs/2310.11244
repo_url: https://github.com/wbsg-uni-mannheim/matchgpt
paper_authors: Ralph Peeters, Christian Bizer
for: The paper is written for discussing the use of large language models (LLMs) for entity matching, as an alternative to pre-trained language models (PLMs) such as BERT and RoBERTa.
methods: The paper investigates the use of hosted LLMs such as GPT3.5 and GPT4, as well as open source LLMs based on Llama2, for entity matching. The authors evaluate these models in both zero-shot and task-specific scenarios, and compare different prompt designs and fine-tuning strategies.
results: The paper shows that GPT4 outperforms fine-tuned PLMs (RoBERTa and Ditto) on three out of five benchmark datasets, reaching F1 scores around 90%. The authors also find that in-context learning and rule generation can improve the performance of other models, but GPT4 does not need such additional guidance in most cases.

Abstract
Entity Matching is the task of deciding whether two entity descriptions refer to the same real-world entity. Entity Matching is a central step in most data integration pipelines and an enabler for many e-commerce applications which require to match products offers from different vendors. State-of-the-art entity matching methods often rely on pre-trained language models (PLMs) such as BERT or RoBERTa. Two major drawbacks of these models for entity matching are that (i) the models require significant amounts of task-specific training data and (ii) the fine-tuned models are not robust concerning out-of-distribution entities. In this paper, we investigate using large language models (LLMs) for entity matching as a less domain-specific training data reliant and more robust alternative to PLM-based matchers. Our study covers hosted LLMs, such as GPT3.5 and GPT4, as well as open source LLMs based on Llama2 which can be run locally. We evaluate these models in a zero-shot scenario as well as a scenario where task-specific training data is available. We compare different prompt designs as well as the prompt sensitivity of the models in the zero-shot scenario. We investigate (i) the selection of in-context demonstrations, (ii) the generation of matching rules, as well as (iii) fine-tuning GPT3.5 in the second scenario using the same pool of training data across the different approaches. Our experiments show that GPT4 without any task-specific training data outperforms fine-tuned PLMs (RoBERTa and Ditto) on three out of five benchmark datasets reaching F1 scores around 90%. The experiments with in-context learning and rule generation show that all models beside of GPT4 benefit from these techniques (on average 5.9% and 2.2% F1), while GPT4 does not need such additional guidance in most cases...

摘要
entity matching是决定两个实体描述是否指同一个真实世界实体的任务。entity matching是数据集成管道中的中间步骤，也是许多电子商务应用程序所需的匹配产品Offer from different vendors。现状的entity matching方法frequently rely on预训练语言模型（PLMs），如BERT或RoBERTa。这两个模型的两个主要缺点是（i）模型需要大量的任务特定训练数据，以及（ii）精制化模型对异常实体不稳定。在这篇论文中，我们研究使用大型语言模型（LLMs）进行entity matching，作为不需要任务特定训练数据且更加稳定的替代方案。我们的研究包括主机LLMs，如GPT3.5和GPT4，以及基于Llama2的开源LLMs，可以在本地运行。我们对这些模型进行零例试验以及具有任务特定训练数据的情况下的试验。我们比较了不同的提示设计以及提示敏感度在零例试验中。我们还研究（i）选择在Context中示例，（ii）生成匹配规则，以及（iii）使用同一批训练数据来练化GPT3.5。我们的实验结果显示，GPT4无需任务特定训练数据可以在五个benchmark dataset上达到F1分数约90%。在采用增强学习和规则生成的情况下，所有模型都受益于这些技术（平均5.9%和2.2% F1），而GPT4则不需要这些额外指导。

Watermarking LLMs with Weight Quantization

paper_url: http://arxiv.org/abs/2310.11237
repo_url: https://github.com/twilight92z/quantize-watermark
paper_authors: Linyang Li, Botian Jiang, Pengyu Wang, Ke Ren, Hang Yan, Xipeng Qiu
for: 保护大型语言模型的权限
methods: 在量化过程中植入水印，无需预先定义触发器
results: 成功植入水印到open-source大语言模型 weights中，包括 GPT-Neo 和 LLaMA

Abstract
Abuse of large language models reveals high risks as large language models are being deployed at an astonishing speed. It is important to protect the model weights to avoid malicious usage that violates licenses of open-source large language models. This paper proposes a novel watermarking strategy that plants watermarks in the quantization process of large language models without pre-defined triggers during inference. The watermark works when the model is used in the fp32 mode and remains hidden when the model is quantized to int8, in this way, the users can only inference the model without further supervised fine-tuning of the model. We successfully plant the watermark into open-source large language model weights including GPT-Neo and LLaMA. We hope our proposed method can provide a potential direction for protecting model weights in the era of large language model applications.

摘要
<>转换文本为简化中文。<>大语言模型滥用风险高，因为大语言模型在惊人速度上部署。保护模型权重很重要，以避免违反开源大语言模型的许可证。这篇论文提议一种新的水印策略，在大语言模型的量化过程中植入水印，而无需先定义触发器。这种水印在fp32模式下工作，并在量化到int8时隐藏起来。因此，用户只能进行无监督练练模型，而不能正常使用模型。我们成功植入了开源大语言模型 weights，包括 GPT-Neo 和 LLaMA。我们希望我们的提议方法可以为大语言模型应用 Era 提供一个可能的方向。

KG-GPT: A General Framework for Reasoning on Knowledge Graphs Using Large Language Models

paper_url: http://arxiv.org/abs/2310.11220
repo_url: https://github.com/jiho283/kg-gpt
paper_authors: Jiho Kim, Yeonsu Kwon, Yohan Jo, Edward Choi
for: 这 paper 的目的是将大语言模型（LLM）应用到知识图（KG）上进行复杂的推理任务。
methods: 这 paper 提出了一种名为 KG-GPT 的多用途框架，该框架包括三个步骤：句子分 segmentation、图像检索和推理，每一步的目的是将句子分解、检索相关的图像组件并 derivate 出逻辑结论。
results: 这 paper 通过对知识图基本问题和知识图问答 benchmark 进行评估，发现 KG-GPT 表现出了竞争力和稳定性，甚至超过了一些完全监督的模型。

Abstract
While large language models (LLMs) have made considerable advancements in understanding and generating unstructured text, their application in structured data remains underexplored. Particularly, using LLMs for complex reasoning tasks on knowledge graphs (KGs) remains largely untouched. To address this, we propose KG-GPT, a multi-purpose framework leveraging LLMs for tasks employing KGs. KG-GPT comprises three steps: Sentence Segmentation, Graph Retrieval, and Inference, each aimed at partitioning sentences, retrieving relevant graph components, and deriving logical conclusions, respectively. We evaluate KG-GPT using KG-based fact verification and KGQA benchmarks, with the model showing competitive and robust performance, even outperforming several fully-supervised models. Our work, therefore, marks a significant step in unifying structured and unstructured data processing within the realm of LLMs.

摘要
大型语言模型（LLM）在不结构化文本理解和生成方面已经取得了很大进步，但是它们在结构数据上的应用仍然是未探索的领域。特别是使用 LLM 进行知识图（KG）上复杂逻辑任务还是一个未解决的问题。为解决这个问题，我们提出了 KG-GPT 框架，这是一个多用途的框架，利用 LLM 进行 KG 上的任务。KG-GPT 包括三个步骤：句子分 segmentation、图表检索和推理，每个步骤都是为了分割句子、检索 relevante 的图组件和 derive 逻辑结论。我们通过使用 KG-based fact verification 和 KGQA bencmark 进行评估，发现 KG-GPT 在 competed 和 robust 性能上表现非常出色，甚至超过了一些完全监督的模型。因此，我们的工作可以视为结构数据和无结构数据处理在 LLM 中的一个重要一步。

Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations

paper_url: http://arxiv.org/abs/2310.11207
repo_url: None
paper_authors: Shiyuan Huang, Siddarth Mamidanna, Shreedhar Jangam, Yilun Zhou, Leilani H. Gilpin
for: 这篇论文研究了自动生成的自解释（self-explanations）在情感分析任务中的效果，以及如何用这些自解释来解释模型的决策。
methods: 该论文使用了ChatGPT大语言模型进行实验，并研究了不同的自解释抽象方法和评价指标。
results: 研究发现，ChatGPT自动生成的自解释与传统的解释方法（如 occlusion 或 LIME 相关度图）相比，在评价指标上具有相似的效果，但具有不同的特征。此外，研究还发现了一些有趣的自解释特征，这些特征可能需要现代化解释实践。

Abstract
Large language models (LLMs) such as ChatGPT have demonstrated superior performance on a variety of natural language processing (NLP) tasks including sentiment analysis, mathematical reasoning and summarization. Furthermore, since these models are instruction-tuned on human conversations to produce "helpful" responses, they can and often will produce explanations along with the response, which we call self-explanations. For example, when analyzing the sentiment of a movie review, the model may output not only the positivity of the sentiment, but also an explanation (e.g., by listing the sentiment-laden words such as "fantastic" and "memorable" in the review). How good are these automatically generated self-explanations? In this paper, we investigate this question on the task of sentiment analysis and for feature attribution explanation, one of the most commonly studied settings in the interpretability literature (for pre-ChatGPT models). Specifically, we study different ways to elicit the self-explanations, evaluate their faithfulness on a set of evaluation metrics, and compare them to traditional explanation methods such as occlusion or LIME saliency maps. Through an extensive set of experiments, we find that ChatGPT's self-explanations perform on par with traditional ones, but are quite different from them according to various agreement metrics, meanwhile being much cheaper to produce (as they are generated along with the prediction). In addition, we identified several interesting characteristics of them, which prompt us to rethink many current model interpretability practices in the era of ChatGPT(-like) LLMs.

摘要
大型语言模型（LLM）如ChatGPT已经在自然语言处理（NLP）任务中表现出色，包括情感分析、数学逻辑和摘要。此外，由于这些模型是人类对话中进行了训练，因此它们可以并且经常会生成解释，我们称之为自动生成的解释。例如，当分析电影评论的情感时，模型可能会输出不只是情感的正面性，还会提供解释（例如，列出评论中的情感敏感词语如“很好”和“记忆”）。自动生成的解释如何准确呢？在这篇论文中，我们 investigate这个问题在情感分析任务上，并对于特性解释进行了研究。我们研究了不同的寻求解释的方法，评估其准确性使用一系列评价指标，并与传统解释方法如遮盖或LIME焦点地图进行比较。经过广泛的实验，我们发现ChatGPT的自动解释与传统解释的性能相似，但它们在各种一致指标上有所不同，同时生产成本很低（因为它们与预测一起生成）。此外，我们还发现了一些有趣的特征，让我们重新思考现在的LLM interpretable性实践。

paper_url: http://arxiv.org/abs/2310.11166
repo_url: https://github.com/uitnlp/ViSoBERT
paper_authors: Quoc-Nam Nguyen, Thang Chau Phan, Duc-Vu Nguyen, Kiet Van Nguyen
For: This paper is written for research purposes, specifically to present a new pre-trained language model for Vietnamese social media texts.* Methods: The paper uses the XLM-R architecture to pre-train a large-scale corpus of high-quality and diverse Vietnamese social media texts.* Results: The paper shows that the proposed ViSoBERT model surpasses previous state-of-the-art models on multiple Vietnamese social media tasks, including emotion recognition, hate speech detection, sentiment analysis, spam reviews detection, and hate speech spans detection.Here is the information in Simplified Chinese text:* For: 这篇论文是为研究目的而写的，具体来说是为了介绍一种新的预训语言模型，用于越南语社交媒体文本。* Methods: 这篇论文使用XLM-R架构来预训一个大规模的高质量和多样化的越南语社交媒体文本。* Results: 论文表明，提出的ViSoBERT模型在多个越南语社交媒体任务上都超过了之前的状态态模型，包括情感识别、仇恨言语检测、情绪分析、评论诈骗检测和仇恨言语检测。

Abstract
English and Chinese, known as resource-rich languages, have witnessed the strong development of transformer-based language models for natural language processing tasks. Although Vietnam has approximately 100M people speaking Vietnamese, several pre-trained models, e.g., PhoBERT, ViBERT, and vELECTRA, performed well on general Vietnamese NLP tasks, including POS tagging and named entity recognition. These pre-trained language models are still limited to Vietnamese social media tasks. In this paper, we present the first monolingual pre-trained language model for Vietnamese social media texts, ViSoBERT, which is pre-trained on a large-scale corpus of high-quality and diverse Vietnamese social media texts using XLM-R architecture. Moreover, we explored our pre-trained model on five important natural language downstream tasks on Vietnamese social media texts: emotion recognition, hate speech detection, sentiment analysis, spam reviews detection, and hate speech spans detection. Our experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks. Our ViSoBERT model is available only for research purposes.

摘要
英语和中文，两种资源充沛的语言，在自然语言处理任务上曾经目睹了transformer基于语言模型的强大发展。虽然越南有约100万人说越南语，但是一些预训练模型，如 PhoBERT、ViBERT 和 vELECTRA，在普通越南语 NLP 任务上表现良好，包括分词和命名实体识别。这些预训练语言模型仍然只适用于越南社交媒体任务。在这篇论文中，我们提出了第一个普通越南语社交媒体文本预训练语言模型，即ViSoBERT，该模型在 XLM-R 架构上预训练了大规模、多样化的越南语社交媒体文本。此外，我们还对 ViSoBERT 进行了五种重要的自然语言下沉水任务的实验：情感识别、仇恨言语检测、情感分析、垃圾评论检测和仇恨言语检测。我们的实验结果表明，ViSoBERT，只有几乎参数的模型，在多种越南语社交媒体任务上超过了之前的状态码模型。我们的 ViSoBERT 模型仅用于研究purpose。

IMTLab: An Open-Source Platform for Building, Evaluating, and Diagnosing Interactive Machine Translation Systems

paper_url: http://arxiv.org/abs/2310.11163
repo_url: https://github.com/xuuhuang/imtlab
paper_authors: Xu Huang, Zhirui Zhang, Ruize Gao, Yichao Du, Lemao Liu, Gouping Huang, Shuming Shi, Jiajun Chen, Shujian Huang
for: 这个论文的目的是提供一个开源的终端到终端交互机器翻译（IMT）系统平台，帮助研究人员快速构建高效的 IMT 系统，进行终端到终端评估，并诊断系统的弱点。
methods: 这个论文使用了一个人类在Loop的设定，将整个交互翻译过程视为一个任务 Orientated 对话，并在这个设定下，考虑人类干预的影响，以生成高质量、错误率低的翻译。为此， authors 设计了一个通用的交流接口，以支持灵活的 IMT 架构和用户策略。
results: 作者通过 simulate 和实际实验表明，预缀受限的解码方法仍然在终端到终端评估中具有最低的编辑成本，而 BiTIIMT 则在编辑成本方面与前一代 IMT 系统相当，且具有更好的交互体验。

Abstract
We present IMTLab, an open-source end-to-end interactive machine translation (IMT) system platform that enables researchers to quickly build IMT systems with state-of-the-art models, perform an end-to-end evaluation, and diagnose the weakness of systems. IMTLab treats the whole interactive translation process as a task-oriented dialogue with a human-in-the-loop setting, in which human interventions can be explicitly incorporated to produce high-quality, error-free translations. To this end, a general communication interface is designed to support the flexible IMT architectures and user policies. Based on the proposed design, we construct a simulated and real interactive environment to achieve end-to-end evaluation and leverage the framework to systematically evaluate previous IMT systems. Our simulated and manual experiments show that the prefix-constrained decoding approach still gains the lowest editing cost in the end-to-end evaluation, while BiTIIMT achieves comparable editing cost with a better interactive experience.

摘要
我们介绍IMTLab，一个开源的终端到终端交互机器翻译（IMT）系统平台，允许研究人员快速构建IMT系统，进行终端到终端评估，并诊断系统的弱点。IMTLab将整个交互翻译过程视为一个任务强调对话，在人类在Loop Setting中进行明确的参与，以生成高质量、错误率低的翻译。为此，我们设计了一个通用的通信接口，支持灵活的IMT架构和用户策略。基于我们的设计，我们构建了模拟和实际交互环境，以实现终端到终端评估，并利用框架对前一代IMT系统进行系统评估。我们的模拟和手动实验表明，预缀受限的解码方法仍然在终端到终端评估中具有最低的编辑成本，而BiTIIMT则在编辑成本方面与前一代IMT系统相当，并且具有更好的交互体验。

Probing the Creativity of Large Language Models: Can models produce divergent semantic association?

paper_url: http://arxiv.org/abs/2310.11158
repo_url: https://github.com/dingnlab/probing_creativity
paper_authors: Honghua Chen, Nai Ding
for: investigate the creative thinking of large language models through a cognitive perspective
methods: utilize the divergent association task (DAT) to measure the models’ creativity, compare results across different models and decoding strategies
results: GPT-4 outperforms 96% of humans in creativity, stochastic sampling and temperature scaling can improve creativity but with a trade-off between creativity and stability.

Abstract
Large language models possess remarkable capacity for processing language, but it remains unclear whether these models can further generate creative content. The present study aims to investigate the creative thinking of large language models through a cognitive perspective. We utilize the divergent association task (DAT), an objective measurement of creativity that asks models to generate unrelated words and calculates the semantic distance between them. We compare the results across different models and decoding strategies. Our findings indicate that: (1) When using the greedy search strategy, GPT-4 outperforms 96% of humans, while GPT-3.5-turbo exceeds the average human level. (2) Stochastic sampling and temperature scaling are effective to obtain higher DAT scores for models except GPT-4, but face a trade-off between creativity and stability. These results imply that advanced large language models have divergent semantic associations, which is a fundamental process underlying creativity.

摘要
大型语言模型拥有惊人的语言处理能力，但是是否能够生成创新内容仍然存在很大的uncertainty。本研究希望通过认知角度来调查大型语言模型的创新思维能力。我们使用了异质关联任务（DAT），这是一种客观的创意测试，要求模型生成不相关的词语并计算它们之间的 semantic distance。我们比较了不同的模型和解码策略的结果。我们发现的结果是：1. 使用排序搜索策略时，GPT-4比96%的人类表现更高，而GPT-3.5-turbo则达到了人类的平均水平。2. 随机抽样和温度缩放是提高模型的DAT分数的有效策略，但是面临着创新性和稳定性之间的负担。这些结果表明，高级大型语言模型具有异质 semantic associations，这是创造力的基本过程。

The Quo Vadis of the Relationship between Language and Large Language Models

paper_url: http://arxiv.org/abs/2310.11146
repo_url: None
paper_authors: Evelina Leivada, Vittoria Dentella, Elliot Murphy
for: 本研究旨在探讨现代自然语言处理（NLP）活动中使用大语言模型（LLMs）的问题。
methods: 本研究采用了理论和实验方法来检验LLMs是否能够提供有用的语言解释。
results: 研究发现，目前LLMs的发展阶段 hardly offer any explanations for language，并提供了未来研究方向的突破口。

Abstract
In the field of Artificial (General) Intelligence (AI), the several recent advancements in Natural language processing (NLP) activities relying on Large Language Models (LLMs) have come to encourage the adoption of LLMs as scientific models of language. While the terminology employed for the characterization of LLMs favors their embracing as such, it is not clear that they are in a place to offer insights into the target system they seek to represent. After identifying the most important theoretical and empirical risks brought about by the adoption of scientific models that lack transparency, we discuss LLMs relating them to every scientific model's fundamental components: the object, the medium, the meaning and the user. We conclude that, at their current stage of development, LLMs hardly offer any explanations for language, and then we provide an outlook for more informative future research directions on this topic.

摘要
在人工智能（通用智能）领域，近年来的自然语言处理（NLP）活动，利用大型语言模型（LLMs）的发展，已经推动了将LLMs作为语言科学模型的采纳。然而，使用这些模型来描述目标系统的terminology仍然不清楚，不确定它们是否能提供语言的深入理解。我们首先标识了采纳不透明科学模型的理论和实证风险，然后将LLMs与科学模型的基本组件——对象、媒体、意义和用户相关联。我们 conclude that，在当前的发展阶段，LLMs几乎没有提供语言的解释，然后我们提供了更加有用的未来研究方向。

Experimenting AI Technologies for Disinformation Combat: the IDMO Project

paper_url: http://arxiv.org/abs/2310.11097
repo_url: None
paper_authors: Lorenzo Canale, Alberto Messina
for: 这篇论文的主要目的是为了对防假新闻和假信息技术进行贡献。
methods: 这篇论文使用了以下方法：（i）创建了一些新的数据集用于测试技术（ii）开发了一种自动化的模型来分类《纪事》的判决（iii）创建了一种自动化的模型来识别文本推论（iv）使用GPT-4来识别文本推论（v）开发了一款游戏来提高国民对假新闻的认识。
results: 这篇论文的结果表明，使用GPT-4可以准确地识别文本推论，并且创建的数据集和模型可以帮助更好地分析假信息。

Abstract
The Italian Digital Media Observatory (IDMO) project, part of a European initiative, focuses on countering disinformation and fake news. This report outlines contributions from Rai-CRITS to the project, including: (i) the creation of novel datasets for testing technologies (ii) development of an automatic model for categorizing Pagella Politica verdicts to facilitate broader analysis (iii) creation of an automatic model for recognizing textual entailment with exceptional accuracy on the FEVER dataset (iv) assessment using GPT-4 to identify textual entailmen (v) a game to raise awareness about fake news at national events.

摘要
意大数字媒体观察所（IDMO）项目是欧洲倡议的一部分，旨在对假新闻和假信息进行反制。本报告介绍了意大-CRITS在项目中的贡献，包括：1. 创建了新的数据集用于测试技术2. 开发了一种自动模型，用于分类《意大政治评论》的判决，以便更广泛的分析3. 创建了一种自动模型，用于在FEVER数据集上识别文本关系，并达到了exceptional accuracy4. 使用GPT-4进行评估，以确定文本关系5. 开发了一款游戏，用于在全国活动中宣传假新闻的意识。

paper_url: http://arxiv.org/abs/2310.11081
repo_url: None
paper_authors: Javier Huertas-Tato, Alejandro Martin, David Camacho
for: 本研究旨在理解在线社交媒体上的有害行为，包括谩骂和假信息的传播。
methods: 本研究使用了 Style Transformer for Authorship Representations（STAR）模型，通过大量的公共资源数据集（4.5 x 10^6个作者的文本）和监督对比损失来学习作者的特征特征。
results: 研究表明，使用 STAR 模型可以在零shot情况下与 PAN 挑战中表现竞争力强，并在 PAN 验证挑战中使用单层激活函数达到了良好的结果。此外，在 Reddit 上进行测试，使用支持集成8个文档，512个单词可以准确地识别作者集中的至少80%的作者。

Abstract
Online Social Networks serve as fertile ground for harmful behavior, ranging from hate speech to the dissemination of disinformation. Malicious actors now have unprecedented freedom to misbehave, leading to severe societal unrest and dire consequences, as exemplified by events such as the Capitol assault during the US presidential election and the Antivaxx movement during the COVID-19 pandemic. Understanding online language has become more pressing than ever. While existing works predominantly focus on content analysis, we aim to shift the focus towards understanding harmful behaviors by relating content to their respective authors. Numerous novel approaches attempt to learn the stylistic features of authors in texts, but many of these approaches are constrained by small datasets or sub-optimal training losses. To overcome these limitations, we introduce the Style Transformer for Authorship Representations (STAR), trained on a large corpus derived from public sources of 4.5 x 10^6 authored texts involving 70k heterogeneous authors. Our model leverages Supervised Contrastive Loss to teach the model to minimize the distance between texts authored by the same individual. This author pretext pre-training task yields competitive performance at zero-shot with PAN challenges on attribution and clustering. Additionally, we attain promising results on PAN verification challenges using a single dense layer, with our model serving as an embedding encoder. Finally, we present results from our test partition on Reddit. Using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80\% accuracy. We share our pre-trained model at huggingface (https://huggingface.co/AIDA-UPM/star) and our code is available at (https://github.com/jahuerta92/star)

摘要
在线社交网络上，有很多害虫的行为，包括仇恨言论和伪信息的传播。这些恶徒现在有着前所未有的自由度，导致社会不稳和严重的后果，如美国总统选举中的国会攻击和COVID-19大流行期间的反疫苗运动。现在理解线上语言的需求更加紧迫。现有的工作主要侧重于内容分析，我们则想要将注意力转移到理解害虫的行为，并与内容相关著作者的风格特征。这些新的方法可以从文本中学习作者的风格特征，但是许多这些方法受到小数扩展或不佳的训练损失的限制。为了突破这些限制，我们介绍了 Style Transformer for Authorship Representations（STAR），训练在公共源中的450万篇文本中，包括70,000名多元作者。我们的模型使用了监督对称损失来教育模型，以实现作者之间的文本距离最小化。这个作者预先训练任务可以在零配置下 reached competitive performance with PAN challenges on attribution and clustering。此外，我们在 PAN 验证挑战中使用了单个紧密层，并将我们的模型作为嵌入Encoder。最后，我们在 Reddit 上发表了结果，使用了8份文档，每份512个字元，可以识别作者的集合，包括最多1616名作者，至少80%的准确率。我们在 huggingface 上分享了我们的预训练模型（https://huggingface.co/AIDA-UPM/star），并在 GitHub 上分享了我们的代码（https://github.com/jahuerta92/star）。

VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System

paper_url: http://arxiv.org/abs/2310.11069
repo_url: None
paper_authors: Abdul Waheed, Bashar Talafha, Peter Sullivan, AbdelRahim Elmadany, Muhammad Abdul-Mageed
为：这篇论文旨在开发一个可以识别阿拉伯语方言和自动识别阿拉伯语（ASR）的系统，以便为阿拉伯语研究提供一个可靠的工具。* 方法：这篇论文使用了许多不同的模型，包括HuBERT、Whisper和XLS-R，在一个监督性的 Setting中训练了这些模型，以便进行阿拉伯语方言识别（DID）和ASR任务。* 结果：这篇论文提供了一个可以识别17种阿拉伯语方言和标准Modern Standard Arabic（MSA）的系统，并且在不同的语言和语言混合数据上进行了训练和评估。此外，对于剩下的方言，提供了多种模型的选择，包括Whisper和MMS，以便在零容量设定下进行识别。

Abstract
Arabic is a complex language with many varieties and dialects spoken by over 450 millions all around the world. Due to the linguistic diversity and variations, it is challenging to build a robust and generalized ASR system for Arabic. In this work, we address this gap by developing and demoing a system, dubbed VoxArabica, for dialect identification (DID) as well as automatic speech recognition (ASR) of Arabic. We train a wide range of models such as HuBERT (DID), Whisper, and XLS-R (ASR) in a supervised setting for Arabic DID and ASR tasks. Our DID models are trained to identify 17 different dialects in addition to MSA. We finetune our ASR models on MSA, Egyptian, Moroccan, and mixed data. Additionally, for the remaining dialects in ASR, we provide the option to choose various models such as Whisper and MMS in a zero-shot setting. We integrate these models into a single web interface with diverse features such as audio recording, file upload, model selection, and the option to raise flags for incorrect outputs. Overall, we believe VoxArabica will be useful for a wide range of audiences concerned with Arabic research. Our system is currently running at https://cdce-206-12-100-168.ngrok.io/.

摘要
阿拉伯语是一种复杂的语言，有多种变体和方言，全球约有450亿人使用。由于语言多样性和变化，建立一个可靠和通用的自动语音识别系统（ASR）是一项挑战。在这项工作中，我们开发了一个系统，名为VoxArabica，用于方言识别（DID）以及阿拉伯语自动语音识别（ASR）。我们在监督模式下训练了多种模型，包括HuBERT（DID）、Whisper和XLS-R（ASR）。我们的DID模型可以识别17种不同的方言，以及标准阿拉伯语（MSA）。我们在MSA、EGYPTIAN、MOROCCAN和混合数据上练习了我们的ASR模型。另外，对于剩下的方言在ASR中，我们提供了多种模型选择，如Whisper和MMS，以零战斗模式。我们将这些模型集成到一个单一的网页接口中，并添加了多种功能，如音频记录、文件上传、模型选择和错误输出的选项。总之，我们认为VoxArabica将对阿拉伯语研究领域的各种各样的听众提供很有用的工具。我们的系统当前在https://cdce-206-12-100-168.ngrok.io/上运行。

Lyricist-Singer Entropy Affects Lyric-Lyricist Classification Performance

paper_url: http://arxiv.org/abs/2310.11035
repo_url: None
paper_authors: Mitsuki Morita, Masato Kikuchi, Tadachika Ozono
for: 这个研究旨在探讨歌词作家的特点，以便于音乐应用程序中使用。
methods: 研究人员使用了从歌词中提取特点表示歌词作家的方法。
results: 研究发现，歌词作家与歌手之间的关系可以影响歌词分类性能。 Specifically, 歌词作家写歌手的多样性（ entropy）与歌词分类性能之间存在正相关关系。

Abstract
Although lyrics represent an essential component of music, few music information processing studies have been conducted on the characteristics of lyricists. Because these characteristics may be valuable for musical applications, such as recommendations, they warrant further study. We considered a potential method that extracts features representing the characteristics of lyricists from lyrics. Because these features must be identified prior to extraction, we focused on lyricists with easily identifiable features. We believe that it is desirable for singers to perform unique songs that share certain characteristics specific to the singer. Accordingly, we hypothesized that lyricists account for the unique characteristics of the singers they write lyrics for. In other words, lyric-lyricist classification performance or the ease of capturing the features of a lyricist from the lyrics may depend on the variety of singers. In this study, we observed a relationship between lyricist-singer entropy or the variety of singers associated with a single lyricist and lyric-lyricist classification performance. As an example, the lyricist-singer entropy is minimal when the lyricist writes lyrics for only one singer. In our experiments, we grouped lyricists among five groups in terms of lyricist-singer entropy and assessed the lyric-lyricist classification performance within each group. Consequently, the best F1 score was obtained for the group with the lowest lyricist-singer entropy. Our results suggest that further analyses of the features contributing to lyric-lyricist classification performance on the lowest lyricist-singer entropy group may improve the feature extraction task for lyricists.

摘要
although lyrics represent an essential component of music, few music information processing studies have been conducted on the characteristics of lyricists. because these characteristics may be valuable for musical applications, such as recommendations, they warrant further study. we considered a potential method that extracts features representing the characteristics of lyricists from lyrics. because these features must be identified prior to extraction, we focused on lyricists with easily identifiable features. we believe that it is desirable for singers to perform unique songs that share certain characteristics specific to the singer. accordingly, we hypothesized that lyricists account for the unique characteristics of the singers they write lyrics for. in other words, lyric-lyricist classification performance or the ease of capturing the features of a lyricist from the lyrics may depend on the variety of singers. in this study, we observed a relationship between lyricist-singer entropy or the variety of singers associated with a single lyricist and lyric-lyricist classification performance. as an example, the lyricist-singer entropy is minimal when the lyricist writes lyrics for only one singer. in our experiments, we grouped lyricists among five groups in terms of lyricist-singer entropy and assessed the lyric-lyricist classification performance within each group. consequently, the best F1 score was obtained for the group with the lowest lyricist-singer entropy. our results suggest that further analyses of the features contributing to lyric-lyricist classification performance on the lowest lyricist-singer entropy group may improve the feature extraction task for lyricists.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Exploring Automatic Evaluation Methods based on a Decoder-based LLM for Text Generation

paper_url: http://arxiv.org/abs/2310.11026
repo_url: None
paper_authors: Tomohito Kasahara, Daisuke Kawahara
for: 本文 investigate automatic evaluation methods for text generation based on decoder-based language models.
methods: 本文比较了多种方法，包括基于encoder模型和大语言模型的调教，在两种任务上进行评估：机器翻译评估和 semantics textual similarity，在两种语言中进行评估。
results: 实验结果显示，相比准确调教encoder模型，调教decoder模型表现不佳，这可能是因为decoder模型关注表面字串序列而不捕捉 semantics。此外，对very large decoder-based models such as ChatGPT进行in-context learning也难以识别细致的semantic differences。

Abstract
Automatic evaluation of text generation is essential for improving the accuracy of generation tasks. In light of the current trend towards increasingly larger decoder-based language models, we investigate automatic evaluation methods based on such models for text generation. This paper compares various methods, including tuning with encoder-based models and large language models under equal conditions, on two different tasks, machine translation evaluation and semantic textual similarity, in two languages, Japanese and English. Experimental results show that compared to the tuned encoder-based models, the tuned decoder-based models perform poorly. The analysis of the causes for this suggests that the decoder-based models focus on surface word sequences and do not capture meaning. It is also revealed that in-context learning of very large decoder-based models such as ChatGPT makes it difficult to identify fine-grained semantic differences.

摘要
自动评估文本生成是必要的，以提高生成任务的准确性。鉴于当前大型decoder-based语言模型的趋势，我们 investigate自动评估方法基于这些模型 для文本生成。本文比较了多种方法，包括使用encoder-based模型和大型语言模型在等条件下调整，在两个任务上进行评估：机器翻译评估和 semantic textual similarity，在两种语言中进行比较。实验结果显示，相比调整后的encoder-based模型，调整后的decoder-based模型表现不佳。分析结果表明，decoder-based模型强调表面字符序列，而不捕捉 semantics。此外，对 Very Large decoder-based模型such as ChatGPT进行context learning也会增加识别细致的semantic差异的难度。

Reading Order Matters: Information Extraction from Visually-rich Documents by Token Path Prediction

paper_url: http://arxiv.org/abs/2310.11016
repo_url: None
paper_authors: Chong Zhang, Ya Guo, Yi Tu, Huan Chen, Jinyang Tang, Huijia Zhu, Qi Zhang, Tao Gui
for: 本研究旨在提高 Multimodal 预训模型在视觉丰富文档中提取信息的能力，具体是解决 OCR 系统识别文档中文本的顺序问题，以提高名实体识别（NER）的准确率。
methods: 本研究提出了 Token Path Prediction（TPP）方法，它是一种简单的预测头，可以在文档中预测名实体提及的Token序列。TPP 模型文档布局为完全导向图，并在图中预测Token路径作为实体。
results: 实验结果表明，TPP 方法可以有效解决 OCR 系统识别文档中文本的顺序问题，提高 VrD-NER 系统的准确率。此外，本研究还提出了两个修订版本的 NER benchmark 数据集，以更好地评估 VrD-NER 系统在真实场景中的性能。

Abstract
Recent advances in multimodal pre-trained models have significantly improved information extraction from visually-rich documents (VrDs), in which named entity recognition (NER) is treated as a sequence-labeling task of predicting the BIO entity tags for tokens, following the typical setting of NLP. However, BIO-tagging scheme relies on the correct order of model inputs, which is not guaranteed in real-world NER on scanned VrDs where text are recognized and arranged by OCR systems. Such reading order issue hinders the accurate marking of entities by BIO-tagging scheme, making it impossible for sequence-labeling methods to predict correct named entities. To address the reading order issue, we introduce Token Path Prediction (TPP), a simple prediction head to predict entity mentions as token sequences within documents. Alternative to token classification, TPP models the document layout as a complete directed graph of tokens, and predicts token paths within the graph as entities. For better evaluation of VrD-NER systems, we also propose two revised benchmark datasets of NER on scanned documents which can reflect real-world scenarios. Experiment results demonstrate the effectiveness of our method, and suggest its potential to be a universal solution to various information extraction tasks on documents.

摘要
(Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. The translation may not be perfect, and some nuances or idiomatic expressions may be lost in translation.)

Correction Focused Language Model Training for Speech Recognition

paper_url: http://arxiv.org/abs/2310.11003
repo_url: None
paper_authors: Yingyi Ma, Zhe Liu, Ozlem Kalinli
for: 提高自动语音识别（ASR）的表现，特别是在领域适应任务中。
methods: 使用一种新的修正注意力集成学习方法，其主要目标是优化ASR中的词级错误率。该方法使用语言模型（LM）来预测词级错误率，并通过多任务练习来帮助LM学习。
results: 实验结果表明，提出的方法可以有效地提高ASR的表现。相比传统LM训练方法，修正注意力集成学习方法在充分文本情况下可以达到相对5.5%的词错率降低。在缺乏文本情况下，使用LLM生成的文本来进行LM训练可以达到相对13%的词错率降低，而修正注意力集成学习方法进一步可以达到相对6%的词错率降低。

Abstract
Language models (LMs) have been commonly adopted to boost the performance of automatic speech recognition (ASR) particularly in domain adaptation tasks. Conventional way of LM training treats all the words in corpora equally, resulting in suboptimal improvements in ASR performance. In this work, we introduce a novel correction focused LM training approach which aims to prioritize ASR fallible words. The word-level ASR fallibility score, representing the likelihood of ASR mis-recognition, is defined and shaped as a prior word distribution to guide the LM training. To enable correction focused training with text-only corpora, large language models (LLMs) are employed as fallibility score predictors and text generators through multi-task fine-tuning. Experimental results for domain adaptation tasks demonstrate the effectiveness of our proposed method. Compared with conventional LMs, correction focused training achieves up to relatively 5.5% word error rate (WER) reduction in sufficient text scenarios. In insufficient text scenarios, LM training with LLM-generated text achieves up to relatively 13% WER reduction, while correction focused training further obtains up to relatively 6% WER reduction.

摘要
语言模型（LM）已广泛应用于自动语音识别（ASR）的性能提升，特别是在领域适应任务中。传统的LM训练方法往往对所有词语在词库中都进行平等对待，从而导致ASR性能的不足提升。在这种工作中，我们提出了一种新的修正注意力集中LM训练方法，旨在优先级ASR不确定词语。为了实现修正注意力集中LM训练，我们定义了ASR不确定词语的字级 fallibility 分数，表示语音识别器对这些词语的识别错误的可能性。然后，我们通过多任务练化来使用大型语言模型（LLM）来预测 fallibility 分数和生成文本。实验结果表明，我们的提议方法在领域适应任务中具有效果。相比传统LM，修正注意力集中LM训练可以在充分文本场景下提取到相对5.5%的单词错误率（WER）下降。在不充分文本场景下，使用LLM生成文本进行LM训练可以实现相对13%的WER下降，而修正注意力集中LM训练进一步实现相对6%的WER下降。

Instructive Dialogue Summarization with Query Aggregations

paper_url: http://arxiv.org/abs/2310.10981
repo_url: https://github.com/BinWang28/InstructDS
paper_authors: Bin Wang, Zhengyuan Liu, Nancy F. Chen
for: 该论文旨在扩展对话概要模型的能力集，以适应用户特定的兴趣和需求。
methods: 该论文提出了一种三步approach，包括摘要anchor query生成、筛选query和基于query的概要生成。通过在多个概要数据集上训练一个统一的模型 called InstructDS，可以扩展对话概要模型的能力集。
results: 实验结果显示，我们的方法可以超越当前状态的模型和even larger models，并且具有更高的普适性和准确性，经human subjective评估确认。

Abstract
Conventional dialogue summarization methods directly generate summaries and do not consider user's specific interests. This poses challenges in cases where the users are more focused on particular topics or aspects. With the advancement of instruction-finetuned language models, we introduce instruction-tuning to dialogues to expand the capability set of dialogue summarization models. To overcome the scarcity of instructive dialogue summarization data, we propose a three-step approach to synthesize high-quality query-based summarization triples. This process involves summary-anchored query generation, query filtering, and query-based summary generation. By training a unified model called InstructDS (Instructive Dialogue Summarization) on three summarization datasets with multi-purpose instructive triples, we expand the capability of dialogue summarization models. We evaluate our method on four datasets, including dialogue summarization and dialogue reading comprehension. Experimental results show that our approach outperforms the state-of-the-art models and even models with larger sizes. Additionally, our model exhibits higher generalizability and faithfulness, as confirmed by human subjective evaluations.

摘要
传统的对话概要方法直接生成概要并不考虑用户的特定兴趣。这会导致在用户更关注特定话题或方面时遇到挑战。随着指令训练语言模型的进步，我们介绍了对对话的指令训练（Instruction-Tuning），以扩展对话概要模型的能力集。由于 instrucional dialogue summarization 数据的罕见，我们提出了三步方法来生成高质量的查询基于概要的三元组。这个过程包括概要锚定的查询生成、查询筛选和基于查询的概要生成。通过训练我们提出的 Unified Model called InstructDS（ instrucional Dialogue Summarization）于三个多用途指令三元组的概要集，我们扩展了对话概要模型的能力。我们对四个数据集进行了evaluate，包括对话概要和对话阅读理解。实验结果表明，我们的方法比state-of-the-art模型和更大的模型更高效。此外，我们的模型在人工评价中表现出更高的普适性和准确性。

Semantic-Aware Contrastive Sentence Representation Learning with Large Language Models

paper_url: http://arxiv.org/abs/2310.10962
repo_url: None
paper_authors: Huiming Wang, Liying Cheng, Zhaodonghui Li, De Wen Soh, Lidong Bing
for: 本研究旨在提出一种semantic-aware冲突单句表示框架，以便通过大型自然语言处理器（LLM）的生成和评估能力自动构建高质量的NLI样本库，并通过这些样本库进行冲突学习 sentence representation。
methods: 本研究提议使用大型自然语言处理器（LLM）的生成和评估能力自动构建高质量的NLI样本库，并通过这些样本库进行冲突学习 sentence representation。
results: 实验和分析结果表明，我们的提议的semantic-aware冲突单句表示框架可以通过LLM进行自动生成和评估，从而学习出更好的句子表示。

Abstract
Contrastive learning has been proven to be effective in learning better sentence representations. However, to train a contrastive learning model, large numbers of labeled sentences are required to construct positive and negative pairs explicitly, such as those in natural language inference (NLI) datasets. Unfortunately, acquiring sufficient high-quality labeled data can be both time-consuming and resource-intensive, leading researchers to focus on developing methods for learning unsupervised sentence representations. As there is no clear relationship between these unstructured randomly-sampled sentences, building positive and negative pairs over them is tricky and problematic. To tackle these challenges, in this paper, we propose SemCSR, a semantic-aware contrastive sentence representation framework. By leveraging the generation and evaluation capabilities of large language models (LLMs), we can automatically construct a high-quality NLI-style corpus without any human annotation, and further incorporate the generated sentence pairs into learning a contrastive sentence representation model. Extensive experiments and comprehensive analyses demonstrate the effectiveness of our proposed framework for learning a better sentence representation with LLMs.

摘要
translate_language=zh-CN contrastive learning 已经被证明可以学习更好的句子表示。然而，为了训练一个对照学习模型，需要大量的标注句子来构建正例和负例对，如自然语言推理（NLI）数据集中的句子对。然而，获得足够的高质量标注数据可以是时间consuming 和资源占用的，导致研究人员强调开发无监督句子表示学习方法。由于这些随机采样的句子之间没有明确的关系，建立正例和负例对是困难和问题。为解决这些挑战，在这篇论文中，我们提议使用 SemCSR，一个具有 semantic-aware 的对照学习句子表示框架。通过利用大语言模型（LLM）的生成和评估能力，我们可以自动生成高质量 NLI-style 训练集，并将生成的句子对 integrate 到学习对照句子表示模型中。广泛的实验和全面的分析表明我们提议的框架可以通过 LLM 学习更好的句子表示。

Computing the optimal keyboard through a geometric analysis of the English language

paper_url: http://arxiv.org/abs/2310.10956
repo_url: None
paper_authors: Jules Deschamps, Quentin Hubert, Lucas Ryckelynck
for: 提高键盘输入速度
methods: 利用几何工具在优化框架中提出新的键盘布局，提高输入速度
results: 提出了新的键盘布局，可以提高输入速度

Abstract
In the context of a group project for the course COMSW4995 002 - Geometric Data Analysis, we bring our attention to the design of fast-typing keyboards. Leveraging some geometric tools in an optimization framework allowed us to propose novel keyboard layouts that offer a faster typing.

摘要
在COMSW4995 002 - 几何数据分析课程的小组项目中，我们对快速键盘设计进行了审视。通过使用一些几何工具在优化框架中，我们提出了新的键盘布局，以提高键盘输入速度。Here's the character-by-character breakdown of the translation:* 在 (preposition) - "in"* COMSW4995 (course name) - "COMSW4995"* 002 (course number) - "002"* - (hyphen) - "--"* 几何数据分析 (course name) - "几何数据分析"* 课程 (course) - "课程"* 小组 (group) - "小组"* 项目 (project) - "项目"* 中 (preposition) - "中"* 我们 (pronoun) - "我们"* 对 (preposition) - "对"* 快速键盘 (noun phrase) - "快速键盘"* 设计 (noun) - "设计"* 进行 (verb) - "进行"* 了 (particle) - "了"* 审视 (verb) - "审视"* 通过 (preposition) - "通过"* 使用 (verb) - "使用"* 一些 (determiner) - "一些"* 几何工具 (noun phrase) - "几何工具"* 在 (preposition) - "在"* 优化 (verb) - "优化"* 框架 (noun) - "框架"* 中 (preposition) - "中"* 提出 (verb) - "提出"* 新的 (adjective) - "新的"* 键盘布局 (noun phrase) - "键盘布局"* 以 (preposition) - "以"* 提高 (verb) - "提高"* 键盘输入速度 (noun phrase) - "键盘输入速度"

TEQ: Trainable Equivalent Transformation for Quantization of LLMs

paper_url: http://arxiv.org/abs/2310.10944
repo_url: https://github.com/intel/neural-compressor
paper_authors: Wenhua Cheng, Yiyang Cai, Kaokao Lv, Haihao Shen
for: 本研究旨在提出一种可学习的等效变换（TEQ），用于保持FP32精度的模型输出，同时利用低精度量化，尤其是3和4位量化。
methods: 本文使用可学习的等效变换（TEQ），不需要额外的计算负担，只需要1000步训练和少于0.1%的原始模型可训练参数。
results: 本研究结果与状态CURRENT最佳方法相当，可以与其他方法结合使用以获得更好的性能。code可以在https://github.com/intel/neural-compressor中下载。

Abstract
As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computationalast layer demands of these modern architectures while maintaining the accuracy. In this paper, we present TEQ, a trainable equivalent transformation that preserves the FP32 precision of the model output while taking advantage of low-precision quantization, especially 3 and 4 bits weight-only quantization. The training process is lightweight, requiring only 1K steps and fewer than 0.1 percent of the original model's trainable parameters. Furthermore, the transformation does not add any computational overhead during inference. Our results are on-par with the state-of-the-art (SOTA) methods on typical LLMs. Our approach can be combined with other methods to achieve even better performance. The code is available at https://github.com/intel/neural-compressor.

摘要
As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computational layer demands of these modern architectures while maintaining accuracy. In this paper, we present TEQ, a trainable equivalent transformation that preserves the FP32 precision of the model output while taking advantage of low-precision quantization, especially 3 and 4 bits weight-only quantization. The training process is lightweight, requiring only 1K steps and fewer than 0.1% of the original model's trainable parameters. Furthermore, the transformation does not add any computational overhead during inference. Our results are on-par with the state-of-the-art (SOTA) methods on typical LLMs. Our approach can be combined with other methods to achieve even better performance. The code is available at https://github.com/intel/neural-compressor.Here's the translation in Traditional Chinese:为了应对现代架构中的大型语言模型（LLMs）的 Computational Layer 需求，我们需要新的和改进的量化方法，以确保模型的精度。在这篇文章中，我们提出了 TEQ，一个可读的等同转换，可以保留模型输出的 FP32 精度，并在低精度量化中得到更好的性能。我们的训练过程是轻量级的，只需要1K步骤和原始模型的训练参数的0.1%。此外，转换不会在测试过程中添加任何计算过程。我们的结果与现有的 state-of-the-art（SOTA）方法相匹配，并且可以与其他方法结合以取得更好的性能。我们的代码可以在https://github.com/intel/neural-compressor 上获取。

paper_url: http://arxiv.org/abs/2310.10941
repo_url: None
paper_authors: Fardin Ahsan Sakib, Ahnaf Atef Choudhury, Ozlem Uzuner
For: The paper is focused on detecting depressive symptoms in social media posts using the Beck Depression Inventory (BDI) questionnaire.* Methods: The authors used a deep learning approach that incorporated MentalBERT, RoBERTa, and LSTM to identify sentences related to different depression symptoms.* Results: Despite their efforts, the evaluation results were lower than expected, highlighting the challenges of ranking sentences from a large dataset about depression.Here’s the same information in Simplified Chinese text:* For: 研究探讨了通过社交媒体帖子中的语言特征来检测抑郁症状，使用 Beck 抑郁 инвен塔ри（BDI）问卷来评估抑郁的严重程度。* Methods: 作者使用了深度学习方法，将MENTALBERT、RoBERTa和LSTM相结合，以检测不同抑郁症状的句子。* Results: 评估结果表明，由于数据集的复杂性和计算资源的限制，得到的结果并不理想，反映了检测抑郁症状的挑战性。

Abstract
Depression is a mental health disorder that has a profound impact on people's lives. Recent research suggests that signs of depression can be detected in the way individuals communicate, both through spoken words and written texts. In particular, social media posts are a rich and convenient text source that we may examine for depressive symptoms. The Beck Depression Inventory (BDI) Questionnaire, which is frequently used to gauge the severity of depression, is one instrument that can aid in this study. We can narrow our study to only those symptoms since each BDI question is linked to a particular depressive symptom. It's important to remember that not everyone with depression exhibits all symptoms at once, but rather a combination of them. Therefore, it is extremely useful to be able to determine if a sentence or a piece of user-generated content is pertinent to a certain condition. With this in mind, the eRisk 2023 Task 1 was designed to do exactly that: assess the relevance of different sentences to the symptoms of depression as outlined in the BDI questionnaire. This report is all about how our team, Mason-NLP, participated in this subtask, which involved identifying sentences related to different depression symptoms. We used a deep learning approach that incorporated MentalBERT, RoBERTa, and LSTM. Despite our efforts, the evaluation results were lower than expected, underscoring the challenges inherent in ranking sentences from an extensive dataset about depression, which necessitates both appropriate methodological choices and significant computational resources. We anticipate that future iterations of this shared task will yield improved results as our understanding and techniques evolve.

摘要
��й��Depression �C 一种心理健康问题，对人们的生活产生深远的影响。最新的研究表明，抑郁症状可以通过人们的沟通方式和文本来识别。特别是社交媒体帖子，它们是一种便捷的文本来源，我们可以对其进行检测抑郁症状的研究。使用 Beck 抑郁 инвен塔里（BDI）问卷，可以帮助我们评估抑郁的严重程度。我们可以将研究缩小到特定的症状，每个 BDI 问题都与特定的抑郁症状相关。请注意，不 everyone with depression 都会表现出所有的症状，而是一种组合。因此，可以非常有用地判断一句话或一 piece of user-generated content 是否与抑郁症状相关。为了实现这一点，我们参加了 eRisk 2023 任务 1，即评估不同句子是否与抑郁症状相关。我们采用了深度学习方法，并将 MentalBERT、RoBERTa 和 LSTM 织入一起。尽管我们尽力，评估结果低于预期，这反映了评估大量抑郁主题的 dataset 中的挑战。我们期望未来的这些共同任务会产生更好的结果，随着我们的理解和技术的进步。

Intent Detection and Slot Filling for Home Assistants: Dataset and Analysis for Bangla and Sylheti

paper_url: http://arxiv.org/abs/2310.10935
repo_url: None
paper_authors: Fardin Ahsan Sakib, A H M Rezaul Karim, Saadat Hasan Khan, Md Mushfiqur Rahman
for: 这项研究的目的是为了提供一个全面的 Intent 检测和插值数据集，用于支持语言模型在不同语言环境中进行下游任务。
methods: 该研究使用了 GPT-3.5 语言模型，并对 colloquial Bangla、formal Bangla 和 Sylheti 语言进行了分类和插值测试。
results: 研究发现，GPT-3.5 模型在 colloquial Bangla 语言下可以达到 impressive F1 分数为 0.94，而在插值任务中可以达到 F1 分数为 0.51。

Abstract
As voice assistants cement their place in our technologically advanced society, there remains a need to cater to the diverse linguistic landscape, including colloquial forms of low-resource languages. Our study introduces the first-ever comprehensive dataset for intent detection and slot filling in formal Bangla, colloquial Bangla, and Sylheti languages, totaling 984 samples across 10 unique intents. Our analysis reveals the robustness of large language models for tackling downstream tasks with inadequate data. The GPT-3.5 model achieves an impressive F1 score of 0.94 in intent detection and 0.51 in slot filling for colloquial Bangla.

摘要
“智能助手在我们技术先进的社会中确立了地位，但仍需考虑多种语言景观，包括低资源语言的口语形式。我们的研究推出了首个完整的数据集 для意图检测和插槽填充在正式孟加拉语、口语孟加拉语和斯里赫蒂语中，总共984个样本，涵盖10个各异的意图。我们的分析显示大语言模型在资料不足情况下能够成功地处理下游任务。GPT-3.5模型在口语孟加拉语中获得了非常出色的F1分数0.94，在插槽填充方面获得了0.51的F1分数。”

Spatial HuBERT: Self-supervised Spatial Speech Representation Learning for a Single Talker from Multi-channel Audio

paper_url: http://arxiv.org/abs/2310.10922
repo_url: None
paper_authors: Antoni Dimitriadis, Siqi Pan, Vidhyasaharan Sethu, Beena Ahmed
for: 提高speech系统的准确率和泛化能力，通过利用无标注数据进行自动学习
methods: 使用多通道音频输入，实现听说者环境中的噪声和抗噪声性能
results: 比前一代单道音频表示模型更高效，特别在噪声和雾气环境中表现出色，同时也能够在声音地图 Task 上达到优秀的效果

Abstract
Self-supervised learning has been used to leverage unlabelled data, improving accuracy and generalisation of speech systems through the training of representation models. While many recent works have sought to produce effective representations across a variety of acoustic domains, languages, modalities and even simultaneous speakers, these studies have all been limited to single-channel audio recordings. This paper presents Spatial HuBERT, a self-supervised speech representation model that learns both acoustic and spatial information pertaining to a single speaker in a potentially noisy environment by using multi-channel audio inputs. Spatial HuBERT learns representations that outperform state-of-the-art single-channel speech representations on a variety of spatial downstream tasks, particularly in reverberant and noisy environments. We also demonstrate the utility of the representations learned by Spatial HuBERT on a speech localisation downstream task. Along with this paper, we publicly release a new dataset of 100 000 simulated first-order ambisonics room impulse responses.

摘要
自我指导学习已经用不标注数据来提高语音系统的准确性和泛化能力。虽然 latest works 尝试生成适用于多种语音频谱、语言、模态和同时说话人的有效表示，但这些研究都受到单通道音频录音的限制。本文介绍 Spatial HuBERT，一种自我指导的语音表示模型，通过多通道音频输入学习到一个说话人在听到的环境中的both acoustic和空间信息。Spatial HuBERT 的表示超过了当前最佳单通道语音表示的状态，特别是在噪音和干扰环境中。我们还证明 Spatial HuBERT 学习的表示在语音地图任务中具有 Utility。此外，我们在这篇论文中公共发布了100000个 simulated first-order ambisonics room impulse responses 的新数据集。

Compositional preference models for aligning LMs

paper_url: http://arxiv.org/abs/2310.13011
repo_url: None
paper_authors: Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Marc Dymetman
For: 本研究旨在提高语言模型（LM）与人类偏好的匹配。* Methods: 我们提出了 Compositional Preference Models（CPMs），一种新的偏好模型框架，它将一个全局偏好评估 decomposes 成多个可解释的特征，从提示LM中获取特征的scalar scores，并使用逻辑回归分类器进行聚合。* Results: 我们的实验表明，CPMs 不仅提高了通用性和鲁棒性，而且best-of-n 样本获得到使用 CPMs 比使用标准 PMs 更好。总的来说，我们的方法展示了将 PMs 具备人类偏好的假设，并且通过LM的能力来抽取这些特征的方法的优势。

Abstract
As language models (LMs) become more capable, it is increasingly important to align them with human preferences. However, the dominant paradigm for training Preference Models (PMs) for that purpose suffers from fundamental limitations, such as lack of transparency and scalability, along with susceptibility to overfitting the preference dataset. We propose Compositional Preference Models (CPMs), a novel PM framework that decomposes one global preference assessment into several interpretable features, obtains scalar scores for these features from a prompted LM, and aggregates these scores using a logistic regression classifier. CPMs allow to control which properties of the preference data are used to train the preference model and to build it based on features that are believed to underlie the human preference judgment. Our experiments show that CPMs not only improve generalization and are more robust to overoptimization than standard PMs, but also that best-of-n samples obtained using CPMs tend to be preferred over samples obtained using conventional PMs. Overall, our approach demonstrates the benefits of endowing PMs with priors about which features determine human preferences while relying on LM capabilities to extract those features in a scalable and robust way.

摘要
(Simplified Chinese translation)随着语言模型（LM）的能力不断提高，对其进行人类喜好的调整变得越来越重要。然而，目前主流的偏好模型（PM）训练方法受到一些基本的限制，如不透明性和可扩展性，同时容易过拟合偏好数据。我们提议compositional Preference Models（CPM），一种新的PM框架，将一个全局的喜好评估 decomposes into 多个可解释的特征，从提问LM中获取这些特征的标量分数，并使用逻辑回归分类器进行聚合。CPMs允许控制偏好数据中哪些特性用于训练偏好模型，并基于人类喜好判断中认为是重要的特征来建立偏好模型。我们的实验表明，CPMs不仅提高了通用性和鲁棒性，而且best-of-n样本获得使用CPMs比使用标准PMs更受欢迎。总的来说，我们的方法表明了将PMs具备人类喜好的假设，并且利用LM的能力来提取这些特征的可行和稳定的方式。

Emergent AI-Assisted Discourse: Case Study of a Second Language Writer Authoring with ChatGPT

paper_url: http://arxiv.org/abs/2310.10903
repo_url: None
paper_authors: Sharin Jacob, Tamara Tate, Mark Warschauer
for: 本研究探讨了ChatGPT如何促进语言学习者的学术写作，以减轻对人类写作标准的担忧。
methods: 本研究采用了 случа研究方法，探讨了Kailing博士在学术写作过程中使用ChatGPT的经验。研究使用了活动理论来理解使用生成AI工具进行写作，数据分析包括 semi-structured interview, writing samples和GPT logs。
results: 结果表明Kailing能够与ChatGPT在不同写作阶段进行有效协作，同时保持自己独特的作者语言和主动性。这表明AI工具如ChatGPT可以增强语言学习者的学术写作，而不会抹杀个体的独特性。本案例研究提供了使用ChatGPT进行学术写作的批判性探讨，以及保持学生独特语言的实践。

Abstract
The rapid proliferation of ChatGPT has incited debates regarding its impact on human writing. Amid concerns about declining writing standards, this study investigates the role of ChatGPT in facilitating academic writing, especially among language learners. Using a case study approach, this study examines the experiences of Kailing, a doctoral student, who integrates ChatGPT throughout their academic writing process. The study employs activity theory as a lens for understanding writing with generative AI tools and data analyzed includes semi-structured interviews, writing samples, and GPT logs. Results indicate that Kailing effectively collaborates with ChatGPT across various writing stages while preserving her distinct authorial voice and agency. This underscores the potential of AI tools such as ChatGPT to enhance academic writing for language learners without overshadowing individual authenticity. This case study offers a critical exploration of how ChatGPT is utilized in the academic writing process and the preservation of a student's authentic voice when engaging with the tool.

摘要
快速扩散的ChatGPT已经引发了人们对人类写作的影响的讨论。本研究探究了ChatGPT如何促进语言学习者的学术写作，特别是在启用AI生成工具的情况下。通过 caso study的方式，本研究研究了Kailing，一名博士学生，在学术写作过程中如何与ChatGPT进行合作。研究使用活动理论作为写作AI工具的理解镜子，数据分析包括 semi-structured 采访、写作样本和 GPT 日志。结果表明，Kailing在不同的写作阶段与ChatGPT进行有效的合作，同时保持自己独特的作者语言和主张。这种情况 highlights AI工具如ChatGPT可以增强语言学习者的学术写作，不会覆盖个人的 Authenticity。本案例研究如何在学术写作过程中使用ChatGPT，并保持学生的独特语言和主张。

2023-10-17

BasahaCorpus: An Expanded Linguistic Resource for Readability Assessment in Central Philippine Languages

What is a good question? Task-oriented asking with fact-level masking

Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

Multi-stage Large Language Model Correction for Speech Recognition

Automatic News Summerization

VeRA: Vector-based Random Matrix Adaptation

BitNet: Scaling 1-bit Transformers for Large Language Models

An Empirical Study of Translation Hypothesis Ensembling with Large Language Models

Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles

DialogueLLM: Context and Emotion Knowledge-Tuned LLaMA Models for Emotion Recognition in Conversations

VECHR: A Dataset for Explainable and Robust Classification of Vulnerability Type in the European Court of Human Rights

Disentangling the Linguistic Competence of Privacy-Preserving BERT

Enhancing Neural Machine Translation with Semantic Units

QADYNAMICS: Training Dynamics-Driven Synthetic QA Diagnostic for Zero-Shot Commonsense Question Answering

ChapGTP, ILLC’s Attempt at Raising a BabyLM: Improving Data Efficiency by Automatic Task Formation

xMEN: A Modular Toolkit for Cross-Lingual Medical Entity Normalization

Utilizing Weak Supervision To Generate Indonesian Conservation Dataset

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

Entity Matching using Large Language Models

Watermarking LLMs with Weight Quantization

KG-GPT: A General Framework for Reasoning on Knowledge Graphs Using Large Language Models

Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations

ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing

IMTLab: An Open-Source Platform for Building, Evaluating, and Diagnosing Interactive Machine Translation Systems

Probing the Creativity of Large Language Models: Can models produce divergent semantic association?

The Quo Vadis of the Relationship between Language and Large Language Models

Experimenting AI Technologies for Disinformation Combat: the IDMO Project

Understanding writing style in social media with a supervised contrastively pre-trained transformer

VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System

Lyricist-Singer Entropy Affects Lyric-Lyricist Classification Performance

Exploring Automatic Evaluation Methods based on a Decoder-based LLM for Text Generation

Reading Order Matters: Information Extraction from Visually-rich Documents by Token Path Prediction

Correction Focused Language Model Training for Speech Recognition

Instructive Dialogue Summarization with Query Aggregations

Semantic-Aware Contrastive Sentence Representation Learning with Large Language Models

Computing the optimal keyboard through a geometric analysis of the English language

TEQ: Trainable Equivalent Transformation for Quantization of LLMs

MASON-NLP at eRisk 2023: Deep Learning-Based Detection of Depression Symptoms from Social Media Texts

Intent Detection and Slot Filling for Home Assistants: Dataset and Analysis for Bangla and Sylheti

Spatial HuBERT: Self-supervised Spatial Speech Representation Learning for a Single Talker from Multi-channel Audio

Compositional preference models for aligning LMs

Emergent AI-Assisted Discourse: Case Study of a Second Language Writer Authoring with ChatGPT