2023-08-15

cs.CL

cs.CL - 2023-08-15

DS4DH at #SMM4H 2023: Zero-Shot Adverse Drug Events Normalization using Sentence Transformers and Reciprocal-Rank Fusion

paper_url: http://arxiv.org/abs/2308.12877
repo_url: None
paper_authors: Anthony Yazdani, Hossein Rouhizadeh, David Vicente Alvarez, Douglas Teodoro
for: 本研究是为了评估一种基于BERT fine-tuning和sentence transformers的社交媒体文本挖掘系统，用于正常化恶性药物事件提到Medical Dictionary for Regulatory Activities（MDRA）词汇。
methods: 本研究采用了两stage方法，首先使用BERT fine-tuning进行实体识别，然后使用sentence transformers和reciprocal-rank fusion进行零 shot正常化。
results: 本研究的结果显示，这种方法在MDRA词汇正常化中得到了44.9%的精度、40.5%的准确率和42.6%的F1分数，超过了共享任务5中的中值性能提高10%，并且在所有参与者中显示出最高性能。这些结果证明了该方法的有效性和在社交媒体文本挖掘领域的应用潜力。

Abstract
This paper outlines the performance evaluation of a system for adverse drug event normalization, developed by the Data Science for Digital Health group for the Social Media Mining for Health Applications 2023 shared task 5. Shared task 5 targeted the normalization of adverse drug event mentions in Twitter to standard concepts from the Medical Dictionary for Regulatory Activities terminology. Our system hinges on a two-stage approach: BERT fine-tuning for entity recognition, followed by zero-shot normalization using sentence transformers and reciprocal-rank fusion. The approach yielded a precision of 44.9%, recall of 40.5%, and an F1-score of 42.6%. It outperformed the median performance in shared task 5 by 10% and demonstrated the highest performance among all participants. These results substantiate the effectiveness of our approach and its potential application for adverse drug event normalization in the realm of social media text mining.

摘要
Here's the text in Simplified Chinese:这篇论文介绍了一种基于BERT微调和sentence transformers的社交媒体文本挖掘系统，用于正常化投诉病药事件。该系统采用了两 stageapproach：首先微调BERT进行实体识别，然后使用sentence transformers和reciprocal-rank fusions进行零批normalization。该approach实现了44.9%的精度、40.5%的准确率和42.6%的F1分数，比共享任务5中的中值性能提高10%，并达到了所有参与者中最高的性能。这些结果证明了该approach的有效性，并适用于社交媒体文本挖掘中的投诉病药事件正常化。

Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

paper_url: http://arxiv.org/abs/2308.07777
repo_url: None
paper_authors: Qiwei Li, Zuchao Li, Xiantao Cai, Bo Du, Hai Zhao
for: 这 paper 的目的是提高文档理解的精度，特别是利用文档结构图模型文档的布局结构知识。
methods: 该 paper 提出了一种名为 GraphLayoutLM 的新型文档理解模型，该模型利用文档结构图模型文档的布局结构知识，并使用图重新排序算法和布局意识多头自注意力层来学习文档布局知识。
results: 该 paper 在多个 benchmark 上达到了最佳成绩，包括 FUNSD、XFUND 和 CORD 等 datasets，并且通过对模型组件的缺省研究，表明了每个组件的贡献。

Abstract
In recent years, the use of multi-modal pre-trained Transformers has led to significant advancements in visually-rich document understanding. However, existing models have mainly focused on features such as text and vision while neglecting the importance of layout relationship between text nodes. In this paper, we propose GraphLayoutLM, a novel document understanding model that leverages the modeling of layout structure graph to inject document layout knowledge into the model. GraphLayoutLM utilizes a graph reordering algorithm to adjust the text sequence based on the graph structure. Additionally, our model uses a layout-aware multi-head self-attention layer to learn document layout knowledge. The proposed model enables the understanding of the spatial arrangement of text elements, improving document comprehension. We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results among these datasets. Our experimental results demonstrate that our proposed method provides a significant improvement over existing approaches and showcases the importance of incorporating layout information into document understanding models. We also conduct an ablation study to investigate the contribution of each component of our model. The results show that both the graph reordering algorithm and the layout-aware multi-head self-attention layer play a crucial role in achieving the best performance.

摘要
Translated into Simplified Chinese:在最近几年，基于多modal预训练的Transformers模型在文本 ricoh 理解方面带来了显著的进步。然而，现有的模型主要集中在文本和视觉特征之间，忽略了文档布局关系的重要性。在本文中，我们提出了 GraphLayoutLM 模型，它利用文档布局结构图来注入文档布局知识到模型中。GraphLayoutLM 模型使用图重新排序算法来根据图结构调整文本序列。此外，我们的模型还使用了布局意识多头自注意层来学习文档布局知识。该模型可以理解文本元素的空间排序，从而提高文档理解能力。我们在不同的benchmark上评估了我们的模型，包括FUNSD、XFUND和CORD等，并在这些数据集中达到了状态之最好的结果。我们的实验结果表明，我们提出的方法具有显著的改进，并证明了在文档理解模型中包含布局信息的重要性。我们还进行了一个ablation研究，以 Investigate each component of our model的贡献。结果显示，图重新排序算法和布局意识多头自注意层均对获得最佳性能做出了重要贡献。

SPM: Structured Pretraining and Matching Architectures for Relevance Modeling in Meituan Search

paper_url: http://arxiv.org/abs/2308.07711
repo_url: None
paper_authors: Wen Zan, Yaopeng Han, Xiaotian Jiang, Yao Xiao, Yang Yang, Dayao Chen, Sheng Chen
for: 提高生活服务平台上搜索结果的相关性，以提高用户体验。
methods: 提出了一种两stage预训练和匹配架构，使用了查询和文档多个字段作为输入，并使用了有效的信息压缩方法来处理长文档。
results: 经过大规模的实验和在线A/B测试，表明提出的架构有效提高了搜索结果的相关性，已经在Meituan上线部署一年多。

Abstract
In e-commerce search, relevance between query and documents is an essential requirement for satisfying user experience. Different from traditional e-commerce platforms that offer products, users search on life service platforms such as Meituan mainly for product providers, which usually have abundant structured information, e.g. name, address, category, thousands of products. Modeling search relevance with these rich structured contents is challenging due to the following issues: (1) there is language distribution discrepancy among different fields of structured document, making it difficult to directly adopt off-the-shelf pretrained language model based methods like BERT. (2) different fields usually have different importance and their length vary greatly, making it difficult to extract document information helpful for relevance matching. To tackle these issues, in this paper we propose a novel two-stage pretraining and matching architecture for relevance matching with rich structured documents. At pretraining stage, we propose an effective pretraining method that employs both query and multiple fields of document as inputs, including an effective information compression method for lengthy fields. At relevance matching stage, a novel matching method is proposed by leveraging domain knowledge in search query to generate more effective document representations for relevance scoring. Extensive offline experiments and online A/B tests on millions of users verify that the proposed architectures effectively improve the performance of relevance modeling. The model has already been deployed online, serving the search traffic of Meituan for over a year.

摘要
在电商搜索中，搜索结果的相关性是用户体验的关键要求。与传统电商平台不同，用户在生活服务平台such as Meituan上查询主要是为了找到供应商，这些供应商通常有很多结构化信息，例如名称、地址、类别、千种产品。使用这些丰富的结构化内容进行搜索相关性模型化是有挑战的，因为：（1）不同的结构化文档字段存在语言分布差异，使得直接采用市场上已有预训练语言模型的方法如BERT不太可能。（2）不同的字段通常有不同的重要性和长度，使得提取文档信息有帮助于相关性匹配的部分很困难。为解决这些问题，本文提出了一种新的两Stage预训练和匹配架构，用于与结构化文档进行相关性模型化。预训练阶段，我们提出了一种有效的预训练方法，该方法使用查询和多个文档字段作为输入，并使用有效的信息压缩方法来处理长字段。匹配阶段，我们提出了一种基于搜索查询领域知识的新匹配方法，该方法可以更有效地生成文档表示，用于相关性分数。广泛的Offline实验和在线A/B测试表明，提出的架构有效地提高了相关性模型的性能。该模型已经在Meituan上线服务了一年多。

Better Zero-Shot Reasoning with Role-Play Prompting

paper_url: http://arxiv.org/abs/2308.07702
repo_url: https://github.com/HLT-NLP/Role-Play-Prompting
paper_authors: Aobo Kong, Shiwan Zhao, Hao Chen, Qicheng Li, Yong Qin, Ruiqi Sun, Xin Zhou
for: 这种研究旨在探讨LLMs中的角色扮演如何影响其理解能力。
methods: 研究使用了一种策略性的角色扮演提示方法，在零基eline设定下测试了12种不同的理解准则，包括代数、常识理解、 симвоlic理解等。
results: 研究结果表明，使用角色扮演提示可以在大多数数据集上超越标准的零基eline方法，其中AQuA的准确率由53.5%提高到63.8%，Last Letter的准确率由23.8%提高到84.2%。这表明角色扮演提示可以提高LLMs的上下文理解和链条思维能力。

Abstract
Modern large language models (LLMs), such as ChatGPT, exhibit a remarkable capacity for role-playing, enabling them to embody not only human characters but also non-human entities like a Linux terminal. This versatility allows them to simulate complex human-like interactions and behaviors within various contexts, as well as to emulate specific objects or systems. While these capabilities have enhanced user engagement and introduced novel modes of interaction, the influence of role-playing on LLMs' reasoning abilities remains underexplored. In this study, we introduce a strategically designed role-play prompting methodology and assess its performance under the zero-shot setting across twelve diverse reasoning benchmarks, encompassing arithmetic, commonsense reasoning, symbolic reasoning, and more. Leveraging models such as ChatGPT and Llama 2, our empirical results illustrate that role-play prompting consistently surpasses the standard zero-shot approach across most datasets. Notably, accuracy on AQuA rises from 53.5% to 63.8%, and on Last Letter from 23.8% to 84.2%. Beyond enhancing contextual understanding, we posit that role-play prompting serves as an implicit Chain-of-Thought (CoT) trigger, thereby improving the quality of reasoning. By comparing our approach with the Zero-Shot-CoT technique, which prompts the model to "think step by step", we further demonstrate that role-play prompting can generate a more effective CoT. This highlights its potential to augment the reasoning capabilities of LLMs.

摘要
现代大型语言模型（LLM），如ChatGPT，显示出了很强的角色扮演能力，可以不仅扮演人类角色，还可以模拟非人类Entity，如Linux终端。这种多样性使得它们能够模拟人类间的复杂互动和行为，以及模拟特定的对象或系统。虽然这些能力提高了用户参与度和引入了新的交互方式，但LLM的理解能力下的影响仍未得到足够的探索。在这项研究中，我们提出了一种策略性的角色扮演提问方法，并评估其在零基础设定下的性能。通过使用ChatGPT和Llama 2这两种模型，我们的实验结果表明，角色扮演提问在大多数数据集上都能够超越标准的零基础设定。特别是，AQuA的准确率由53.5%提高到63.8%，Last Letter的准确率由23.8%提高到84.2%。除了提高上下文理解，我们认为角色扮演提问可以作为隐藏链条（Chain-of-Thought，CoT）触发器，因此改善LLM的理解质量。通过与零基础CoT技术进行比较，我们进一步证明了角色扮演提问可以生成更有效的CoT。这说明它可以增强LLM的理解能力。

Attention Is Not All You Need Anymore

paper_url: http://arxiv.org/abs/2308.07661
repo_url: https://github.com/rprokap/pset-9
paper_authors: Zhe Chen
for: 本文提出了一种用于减少Transformer架构中自注意机制的计算和内存复杂性的drop-in替换方案，以提高Transformer的性能。
methods: 本文提出的Extractor可以作为Transformer的自注意机制替换，并且可以减少计算和内存复杂性。
results: 实验结果表明，使用Extractor可以提高Transformer的性能，并且它的计算路径更短，可以更快速地完成计算。

Abstract
In recent years, the popular Transformer architecture has achieved great success in many application areas, including natural language processing and computer vision. Many existing works aim to reduce the computational and memory complexity of the self-attention mechanism in the Transformer by trading off performance. However, performance is key for the continuing success of the Transformer. In this paper, a drop-in replacement for the self-attention mechanism in the Transformer, called the Extractor, is proposed. Experimental results show that replacing the self-attention mechanism with the Extractor improves the performance of the Transformer. Furthermore, the proposed Extractor has the potential to run faster than the self-attention since it has a much shorter critical path of computation. Additionally, the sequence prediction problem in the context of text generation is formulated using variable-length discrete-time Markov chains, and the Transformer is reviewed based on our understanding.

摘要
近年来，受欢迎的Transformer架构在许多应用领域取得了很大成功，包括自然语言处理和计算机视觉。许多现有的工作尝试通过减少Transformer中的自注意机制的计算和内存复杂性，但是性能是Transformer继续成功的关键。在这篇论文中，一种可替换Transformer中的自注意机制，称为Extractor，被提议。实验结果表明，将自注意机制替换为Extractor可以提高Transformer的性能。此外，提议的Extractor可能比自注意机制更快速，因为它有许多短的计算路径。此外，在文本生成中的序列预测问题被形式化为变量长 discrete-time Markov链，并根据我们的理解对Transformer进行了评估。

SEER: Super-Optimization Explorer for HLS using E-graph Rewriting with MLIR

paper_url: http://arxiv.org/abs/2308.07654
repo_url: None
paper_authors: Jianyi Cheng, Samuel Coward, Lorenzo Chelini, Rafael Barbalho, Theo Drane
for: This paper aims to improve the performance of hardware designs produced by high-level synthesis (HLS) tools by automatically rewriting software programs into efficient HLS code.
methods: The proposed method, called SEER, uses an e-graph data structure to efficiently explore equivalent implementations of a program at scale, and orchestrates existing software compiler passes and hardware synthesis optimizers.
results: The paper shows that SEER achieves up to 38x the performance within 1.4x the area of the original program, and outperforms manually optimized designs produced by hardware experts in an Intel-provided case study.

Abstract
High-level synthesis (HLS) is a process that automatically translates a software program in a high-level language into a low-level hardware description. However, the hardware designs produced by HLS tools still suffer from a significant performance gap compared to manual implementations. This is because the input HLS programs must still be written using hardware design principles. Existing techniques either leave the program source unchanged or perform a fixed sequence of source transformation passes, potentially missing opportunities to find the optimal design. We propose a super-optimization approach for HLS that automatically rewrites an arbitrary software program into efficient HLS code that can be used to generate an optimized hardware design. We developed a toolflow named SEER, based on the e-graph data structure, to efficiently explore equivalent implementations of a program at scale. SEER provides an extensible framework, orchestrating existing software compiler passes and hardware synthesis optimizers. Our work is the first attempt to exploit e-graph rewriting for large software compiler frameworks, such as MLIR. Across a set of open-source benchmarks, we show that SEER achieves up to 38x the performance within 1.4x the area of the original program. Via an Intel-provided case study, SEER demonstrates the potential to outperform manually optimized designs produced by hardware experts.

摘要
高级合成（HLS）是一个过程，它自动将高级语言程序转换为低级硬件描述。然而，由HLS工具生成的硬件设计仍然受到性能差距的影响，这是因为输入HLS程序仍需遵循硬件设计原则。现有的技术可能会留下程序源代码不变，或者执行固定的源代码转换步骤，可能会错过优化的机会。我们提出了一种超优化方法，它可以自动将任何软件程序转换为高效的HLS代码，可以生成优化的硬件设计。我们开发了一个名为SEER的工具流，基于e-graph数据结构，以高效地探索相当的实现方式。SEER提供了可扩展的框架，可以启用现有的软件编译器过程和硬件合成优化器。我们的工作是首次利用e-graph重写来大规模的软件编译框架，如MLIR。对一组开源 benchmark 进行测试，我们发现SEER可以达到38倍的性能，在1.4倍的面积内。通过Intel提供的案例研究，SEER还能够超越由硬件专家手动优化的设计。

Steering Language Generation: Harnessing Contrastive Expert Guidance and Negative Prompting for Coherent and Diverse Synthetic Data Generation

paper_url: http://arxiv.org/abs/2308.07645
repo_url: None
paper_authors: Charles O’Neill, Yuan-Sen Ting, Ioana Ciuca, Jack Miller, Thang Bui
for: 提高大语言模型生成的数据质量和多样性，以便下游模型训练和实际数据利用。
methods: 引入对比专家指导，以确保领域遵循性，并使用现有真实数据和 sintetic 示例作为负例准入，以保证多样性和 authenticty。
results: 比前一些生成数据技术提高表现，在三个不同任务中（假设生成、恶意和非恶意评论生成、常识理解任务生成） Displaying better balance between data diversity and coherence。

Abstract
Large Language Models (LLMs) hold immense potential to generate synthetic data of high quality and utility, which has numerous applications from downstream model training to practical data utilisation. However, contemporary models, despite their impressive capacities, consistently struggle to produce both coherent and diverse data. To address the coherency issue, we introduce contrastive expert guidance, where the difference between the logit distributions of fine-tuned and base language models is emphasised to ensure domain adherence. In order to ensure diversity, we utilise existing real and synthetic examples as negative prompts to the model. We deem this dual-pronged approach to logit reshaping as STEER: Semantic Text Enhancement via Embedding Repositioning. STEER operates at inference-time and systematically guides the LLMs to strike a balance between adherence to the data distribution (ensuring semantic fidelity) and deviation from prior synthetic examples or existing real datasets (ensuring diversity and authenticity). This delicate balancing act is achieved by dynamically moving towards or away from chosen representations in the latent space. STEER demonstrates improved performance over previous synthetic data generation techniques, exhibiting better balance between data diversity and coherency across three distinct tasks: hypothesis generation, toxic and non-toxic comment generation, and commonsense reasoning task generation. We demonstrate how STEER allows for fine-tuned control over the diversity-coherency trade-off via its hyperparameters, highlighting its versatility.

摘要

LogPrompt: Prompt Engineering Towards Zero-Shot and Interpretable Log Analysis

paper_url: http://arxiv.org/abs/2308.07610
repo_url: None
paper_authors: Yilun Liu, Shimin Tao, Weibin Meng, Jingyu Wang, Wenbing Ma, Yanqing Zhao, Yuhang Chen, Hao Yang, Yanfei Jiang, Xun Chen
for: 本文提出了一种新的零shot和可解释的系统事件分析方法，以提高系统维护和工程生命周期中的可靠性和抗抗锋性。
methods: 本文使用大型自然语言模型（LLM）进行零shot系统事件分析任务，并采用了一系列高级提示策略，以提高LLM的性能。
results: 实验结果显示， LogPrompt 在九个公开的评估数据集上，在两个任务上表现出色，比既有方法（使用千余个日志）高于50%。此外， LogPrompt 的可解释性得到了专业人员的高度评估（4.42/5）。

Abstract
Automated log analysis is crucial in modern software-intensive systems for ensuring reliability and resilience throughout software maintenance and engineering life cycles. Existing methods perform tasks such as log parsing and log anomaly detection by providing a single prediction value without interpretation. However, given the increasing volume of system events, the limited interpretability of analysis results hinders analysts' trust and their ability to take appropriate actions. Moreover, these methods require substantial in-domain training data, and their performance declines sharply (by up to 62.5%) in online scenarios involving unseen logs from new domains, a common occurrence due to rapid software updates. In this paper, we propose LogPrompt, a novel zero-shot and interpretable log analysis approach. LogPrompt employs large language models (LLMs) to perform zero-shot log analysis tasks via a suite of advanced prompt strategies tailored for log tasks, which enhances LLMs' performance by up to 107.5% compared with simple prompts. Experiments on nine publicly available evaluation datasets across two tasks demonstrate that LogPrompt, despite using no training data, outperforms existing approaches trained on thousands of logs by up to around 50%. We also conduct a human evaluation of LogPrompt's interpretability, with six practitioners possessing over 10 years of experience, who highly rated the generated content in terms of usefulness and readability (averagely 4.42/5). LogPrompt also exhibits remarkable compatibility with open-source and smaller-scale LLMs, making it flexible for practical deployment.

摘要
现代软件强调系统中，自动化日志分析是关键要素，以确保软件稳定性和恢复能力在维护和工程生命周期中。现有方法可以完成日志分析任务，如日志分析和异常日志检测，但是这些方法通常只提供单个预测值而不具备解释。由于系统事件的增加，以及分析结果的有限可读性，分析人员对结果的信任和他们对结果的应用能力受到限制。此外，这些方法通常需要大量域内训练数据，并且在在线enario中（新领域的日志）发现的日志上进行分析时，其性能会下降（最多下降62.5%）。在这篇论文中，我们提出了一种新的零shot和可解释的日志分析方法——LogPrompt。LogPrompt使用大型自然语言模型（LLMs）来实现零shot日志分析任务，通过一组适用于日志任务的高级提示策略，提高LLMs的性能（最多提高107.5%）。我们在九个公共可用的评估数据集上进行了九个任务的实验，并证明了LogPrompt，即使没有使用任何训练数据，可以与已经训练 thousands of logs 的现有方法相比，在两个任务上提高性能（最多提高50%）。我们还进行了人类评估LogPrompt的可解释性，六位具有超过10年经验的实践者对生成的内容进行了评估，并评估结果表明，生成的内容在有用性和可读性方面得分4.42/5。此外，LogPrompt还表现出了remarkable的兼容性，可以与开源和较小规模的LLMs进行实际应用。

VBD-MT Chinese-Vietnamese Translation Systems for VLSP 2022

paper_url: http://arxiv.org/abs/2308.07601
repo_url: None
paper_authors: Hai Long Trieu, Song Kiet Bui, Tan Minh Tran, Van Khanh Tran, Hai An Nguyen
for: 本研究参加了2022年VLSP机器翻译共同任务。
methods: 我们基于神经网络模型的Transformer模型，使用了强大的多语言干扰预测模型mBART进行建构。我们还应用了一种采样方法来进行反向翻译，以利用大规模的可用单语言数据。此外，我们还应用了一些提高翻译质量的方法，包括拟合和后处理。
results: 我们在公共测试集上 achievement 38.9 BLEU在中越翻译和38.0 BLEU在越中翻译 tasks，这些成绩超过了一些强大的基eline。

Abstract
We present our systems participated in the VLSP 2022 machine translation shared task. In the shared task this year, we participated in both translation tasks, i.e., Chinese-Vietnamese and Vietnamese-Chinese translations. We build our systems based on the neural-based Transformer model with the powerful multilingual denoising pre-trained model mBART. The systems are enhanced by a sampling method for backtranslation, which leverage large scale available monolingual data. Additionally, several other methods are applied to improve the translation quality including ensembling and postprocessing. We achieve 38.9 BLEU on ChineseVietnamese and 38.0 BLEU on VietnameseChinese on the public test sets, which outperform several strong baselines.

摘要
我们在VLSP 2022机器翻译共同任务中提交了我们的系统。本年度共同任务中，我们参与了中越文和越文中翻译两个任务。我们基于神经网络模型的Transformer模型，并使用大规模可用的单语言数据进行采样方法进行增强。此外，我们还应用了多种方法来提高翻译质量，包括集成和后处理。在公共测试集上，我们取得了38.9的BLEU指标在中越文翻译和38.0的BLEU指标在越文中翻译，这些成绩超过了一些强大的基线。

A User-Centered Evaluation of Spanish Text Simplification

paper_url: http://arxiv.org/abs/2308.07556
repo_url: None
paper_authors: Adrian de Wynter, Anthony Hevia, Si-Qing Chen
for: 这个论文的目的是评估西班牙语文本简化（TS）系统的生产性，通过复杂句子和复杂词语识别两个 corpora 进行评估。
methods: 这个论文使用了神经网络来比较西班牙语特有的阅读性分数，并显示神经网络在预测用户TS首选项上一直表现出色。作者们还发现，多语言模型在同一任务上下降性能，但所有模型往往围绕幻数统计特征，如句子长度，进行围绕。
results: 作者们发现，神经网络在同一任务上一直表现出色，而且可以准确预测用户TS首选项。同时，作者们发现多语言模型在同一任务上下降性能，并且发现所有模型往往围绕幻数统计特征，如句子长度，进行围绕。

Abstract
We present an evaluation of text simplification (TS) in Spanish for a production system, by means of two corpora focused in both complex-sentence and complex-word identification. We compare the most prevalent Spanish-specific readability scores with neural networks, and show that the latter are consistently better at predicting user preferences regarding TS. As part of our analysis, we find that multilingual models underperform against equivalent Spanish-only models on the same task, yet all models focus too often on spurious statistical features, such as sentence length. We release the corpora in our evaluation to the broader community with the hopes of pushing forward the state-of-the-art in Spanish natural language processing.

摘要
我们对西班牙语文本简化（TS）的评估进行了一种生产系统的研究，通过两个聚合了复杂句子和复杂词的字句 corpus 进行了评估。我们将西班牙语特有的阅读性分数与神经网络进行比较，并发现后者在预测用户对TS的偏好时表现更好。在我们的分析中，我们发现了多语言模型在同一任务上下降表现，然而所有模型都太过注重干扰性的统计特征，如句子长度。我们将我们的评估 corpora 公开发布给广泛的社区，希望能够推动西班牙自然语言处理领域的前沿。

Improving CTC-AED model with integrated-CTC and auxiliary loss regularization

paper_url: http://arxiv.org/abs/2308.08449
repo_url: None
paper_authors: Daobin Zhu, Xiangdong Su, Hongbin Zhang
for: automatic speech recognition (ASR)
methods: Connectionist temporal classification (CTC) and attention-based encoder decoder (AED) joint training, with two fusion methods (DAL and PMP) and auxiliary loss regularization
results: Experimental results show that DAL method performs better in attention rescoring, while PMP method excels in CTC prefix beam search and greedy search.

Abstract
Connectionist temporal classification (CTC) and attention-based encoder decoder (AED) joint training has been widely applied in automatic speech recognition (ASR). Unlike most hybrid models that separately calculate the CTC and AED losses, our proposed integrated-CTC utilizes the attention mechanism of AED to guide the output of CTC. In this paper, we employ two fusion methods, namely direct addition of logits (DAL) and preserving the maximum probability (PMP). We achieve dimensional consistency by adaptively affine transforming the attention results to match the dimensions of CTC. To accelerate model convergence and improve accuracy, we introduce auxiliary loss regularization for accelerated convergence. Experimental results demonstrate that the DAL method performs better in attention rescoring, while the PMP method excels in CTC prefix beam search and greedy search.

摘要
Connectionist temporal classification (CTC) 和 attention-based encoder decoder (AED) 的共同训练已经广泛应用在自动语音识别（ASR）中。与大多数混合模型不同，我们提出的集成-CTC 使用 AED 的注意力机制来导引 CTC 的输出。在这篇论文中，我们采用了两种合并方法，namely direct addition of logits (DAL) 和 preserving the maximum probability (PMP)。我们通过适应性折射变换来保持维度的一致性，以适应 CTC 的维度。为了加速模型的启动和提高准确性，我们引入了辅助损失补偿。实验结果表明，DAL 方法在注意力重新评分中表现更好，而 PMP 方法在 CTC 前缀搜索和扩散搜索中表现更好。

CALYPSO: LLMs as Dungeon Masters’ Assistants

paper_url: http://arxiv.org/abs/2308.07540
repo_url: https://github.com/northern-lights-province/calypso-aiide-artifact
paper_authors: Andrew Zhu, Lara J. Martin, Andrew Head, Chris Callison-Burch
for: 这篇论文的目的是探讨用大自然语言模型（LLM）在桌面角色扮演游戏（D&D）中的应用场景，以及这些技术在桌面游戏中的可能性。
methods: 该论文使用了大自然语言模型（GPT-3和ChatGPT）来生成合理的自然语言文本，并通过与游戏导ilder（DM）进行形成评估，以确定LLM在D&D中的应用场景。
results: 研究发现，当给DM们提供LLM-力Point的支持时，他们表示可以 direktly present高品质的自然语言文本给玩家，以及低品质的想法，以便继续保持创作主义。这种方法可以帮助DMs在游戏中提供更多的创新和灵感，而无需干扰他们的创作过程。

Abstract
The role of a Dungeon Master, or DM, in the game Dungeons & Dragons is to perform multiple tasks simultaneously. The DM must digest information about the game setting and monsters, synthesize scenes to present to other players, and respond to the players' interactions with the scene. Doing all of these tasks while maintaining consistency within the narrative and story world is no small feat of human cognition, making the task tiring and unapproachable to new players. Large language models (LLMs) like GPT-3 and ChatGPT have shown remarkable abilities to generate coherent natural language text. In this paper, we conduct a formative evaluation with DMs to establish the use cases of LLMs in D&D and tabletop gaming generally. We introduce CALYPSO, a system of LLM-powered interfaces that support DMs with information and inspiration specific to their own scenario. CALYPSO distills game context into bite-sized prose and helps brainstorm ideas without distracting the DM from the game. When given access to CALYPSO, DMs reported that it generated high-fidelity text suitable for direct presentation to players, and low-fidelity ideas that the DM could develop further while maintaining their creative agency. We see CALYPSO as exemplifying a paradigm of AI-augmented tools that provide synchronous creative assistance within established game worlds, and tabletop gaming more broadly.

摘要

Finding Stakeholder-Material Information from 10-K Reports using Fine-Tuned BERT and LSTM Models

paper_url: http://arxiv.org/abs/2308.07522
repo_url: None
paper_authors: Victor Zitian Chen
For: The paper aims to identify stakeholder-material information in annual 10-K reports to help companies and investors efficiently extract material information.* Methods: The authors fine-tuned BERT models and RNN models with LSTM layers to identify stakeholder-material information, using business expert-labeled training data.* Results: The best model achieved an accuracy of 0.904 and an F1 score of 0.899 in test data, significantly outperforming the baseline model.

Abstract
All public companies are required by federal securities law to disclose their business and financial activities in their annual 10-K reports. Each report typically spans hundreds of pages, making it difficult for human readers to identify and extract the material information efficiently. To solve the problem, I have fine-tuned BERT models and RNN models with LSTM layers to identify stakeholder-material information, defined as statements that carry information about a company's influence on its stakeholders, including customers, employees, investors, and the community and natural environment. The existing practice uses keyword search to identify such information, which is my baseline model. Using business expert-labeled training data of nearly 6,000 sentences from 62 10-K reports published in 2022, the best model has achieved an accuracy of 0.904 and an F1 score of 0.899 in test data, significantly above the baseline model's 0.781 and 0.749 respectively. Furthermore, the same work was replicated on more granular taxonomies, based on which four distinct groups of stakeholders (i.e., customers, investors, employees, and the community and natural environment) are tested separately. Similarly, fined-tuned BERT models outperformed LSTM and the baseline. The implications for industry application and ideas for future extensions are discussed.

摘要
(Simplified Chinese translation)所有公开公司都需要根据联邦证券法披露其业务和财务活动在每年的10-K报告中。每份报告通常包含数百页的内容，使得人类读者很难快速 identificar和提取重要信息。为解决这个问题，我已经细化BERT模型和RNN模型的LSTM层来标识利益相关者材料信息，其定义为公司对利益相关者（包括客户、员工、投资者和社区和自然环境）的影响信息。现行做法使用关键词搜索来标识这类信息，这是我的基线模型。使用2022年62份10-K报告中的商业专家标注训练数据（约6,000句），最佳模型在测试数据中达到了0.904的准确率和0.899的F1得分，与基线模型的0.781和0.749分别显著上升。此外，同样的工作也在更细化的分类中进行了重复，基于这四个不同的利益相关者组（即客户、投资者、员工和社区和自然环境）进行了分开测试。同样，细化BERT模型也超过了LSTM和基线模型。关于业务应用和未来扩展的想法都是讨论的。

Data Race Detection Using Large Language Models

paper_url: http://arxiv.org/abs/2308.07505
repo_url: https://github.com/chrisneagu/FTC-Skystone-Dark-Angels-Romania-2020
paper_authors: Le Chen, Xianzhong Ding, Murali Emani, Tristan Vanderbruggen, Pei-hung Lin, Chuanhua Liao
for: 本研究旨在探讨一种基于大语言模型（LLM）的数据竞争检测方法，以代替手动创建资源投入庞大的工具。
methods: 本研究使用了提示工程和精度调整技术，创建了专门的DRB-ML数据集，并使用了代表性的LLM和开源LLM进行评估。
results: 研究显示，LLM可以成为数据竞争检测的可能性，但是它们还无法与传统数据竞争检测工具相比提供详细的变量对 causing 数据竞争的信息。

Abstract
Large language models (LLMs) are demonstrating significant promise as an alternate strategy to facilitate analyses and optimizations of high-performance computing programs, circumventing the need for resource-intensive manual tool creation. In this paper, we explore a novel LLM-based data race detection approach combining prompting engineering and fine-tuning techniques. We create a dedicated dataset named DRB-ML, which is derived from DataRaceBench, with fine-grain labels showing the presence of data race pairs and their associated variables, line numbers, and read/write information. DRB-ML is then used to evaluate representative LLMs and fine-tune open-source ones. Our experiment shows that LLMs can be a viable approach to data race detection. However, they still cannot compete with traditional data race detection tools when we need detailed information about variable pairs causing data races.

摘要

SOTASTREAM: A Streaming Approach to Machine Translation Training

paper_url: http://arxiv.org/abs/2308.07489
repo_url: https://github.com/marian-nmt/sotastream
paper_authors: Matt Post, Thamme Gowda, Roman Grundkiewicz, Huda Khayrallah, Rohit Jain, Marcin Junczys-Dowmunt
for: 这 paper aims to address the limitations of traditional data preparation methods for machine translation toolkits, which can be time-consuming, expensive, and cumbersome.
methods: The proposed approach separates the generation of data from its consumption, allowing for on-the-fly modifications and eliminating the need for a separate pre-processing step.
results: The proposed approach reduces training time, adds flexibility, reduces experiment management complexity, and reduces disk space without affecting the accuracy of the trained models.Here’s the simplified Chinese text:
for: 这 paper 的目的是解决传统机器翻译工具集的数据准备方法的限制，这些方法可能会占用很多时间、成本和复杂度。
methods: 提议的方法是将数据生成和数据消耗分离开来，这样允许在实际使用过程中进行实时修改，并完全消除预处理步骤。
results: 提议的方法可以减少训练时间、添加灵活性、减少实验管理复杂度和减少磁盘空间，而不影响训练出来的模型的准确性。

Abstract
Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer. This preparation step is increasingly at odds with modern research and development practices because this process produces a static, unchangeable version of the training data, making common training-time needs difficult (e.g., subword sampling), time-consuming (preprocessing with large data can take days), expensive (e.g., disk space), and cumbersome (managing experiment combinatorics). We propose an alternative approach that separates the generation of data from the consumption of that data. In this approach, there is no separate pre-processing step; data generation produces an infinite stream of permutations of the raw training data, which the trainer tensorizes and batches as it is consumed. Additionally, this data stream can be manipulated by a set of user-definable operators that provide on-the-fly modifications, such as data normalization, augmentation or filtering. We release an open-source toolkit, SOTASTREAM, that implements this approach: https://github.com/marian-nmt/sotastream. We show that it cuts training time, adds flexibility, reduces experiment management complexity, and reduces disk space, all without affecting the accuracy of the trained models.

摘要
许多机器翻译工具包括一个数据准备步骤，将原始数据转换成可直接用于训练的张量格式。这个过程在现代研发实践中变得越来越不合适，因为这会生成一个静态、不可变的版本的训练数据，使得一些常见的训练时间需求（如字符抽样）变得困难、时间consuming（处理大量数据可以花费多天）、昂贵（如磁盘空间）和困难（实验组合管理）。我们提出一种新的方法，即将数据生成与数据消耗分离开来。在这种方法中，没有单独的预处理步骤；数据生成生成了无限长的Permutation序列，这些 permutation被训练者张量化并批处理，直到它们被消耗。此外，这个数据流可以通过用户定义的操作符进行实时修改，例如数据Normalization、扩展或筛选。我们发布了一个开源工具kit，SOTASTREAM，实现了这种方法：https://github.com/marian-nmt/sotastream。我们表明，它可以减少训练时间，添加灵活性，降低实验管理复杂度，并降低磁盘空间，而无需影响训练出来的模型准确性。

O-1: Self-training with Oracle and 1-best Hypothesis

paper_url: http://arxiv.org/abs/2308.07486
repo_url: None
paper_authors: Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Kartik Audhkhasi
for: 提高Speech Recognition训练的准确率和评估 metrics
methods: 使用O-1自适应目标函数，可以处理both超级vised和无级vised数据，并且可以减少训练偏见
results: O-1对SpeechStew数据集和一个大规模的内部数据集进行评估，与EMBR相比，O-1可以将实际和oracle表现之间的差距减少80%，并且在不同的SpeechStew数据集上实现13%-25%的相对改善，对EMBR训练的内部数据集也可以减少12%的差距。总的来说，O-1可以提高WER的准确率9%。

Abstract
We introduce O-1, a new self-training objective to reduce training bias and unify training and evaluation metrics for speech recognition. O-1 is a faster variant of Expected Minimum Bayes Risk (EMBR), that boosts the oracle hypothesis and can accommodate both supervised and unsupervised data. We demonstrate the effectiveness of our approach in terms of recognition on publicly available SpeechStew datasets and a large-scale, in-house data set. On Speechstew, the O-1 objective closes the gap between the actual and oracle performance by 80\% relative compared to EMBR which bridges the gap by 43\% relative. O-1 achieves 13\% to 25\% relative improvement over EMBR on the various datasets that SpeechStew comprises of, and a 12\% relative gap reduction with respect to the oracle WER over EMBR training on the in-house dataset. Overall, O-1 results in a 9\% relative improvement in WER over EMBR, thereby speaking to the scalability of the proposed objective for large-scale datasets.

摘要
我们介绍O-1，一个新的自我训练目标，用于降低训练偏见和统一训练和评估指标 для语音识别。O-1是EMBR的快速版本，可以提高oracle假设，并可以处理both监控和无监控数据。我们透过使用O-1目标，在公开ailable的SpeechStew数据集和一个大规模的内部数据集上进行评估。在SpeechStew上，O-1目标可以关闭实际和oracle性能之间的差距 by 80% relative compared to EMBR，而EMBR则可以关闭差距 by 43% relative。O-1在不同的SpeechStew数据集上的表现亮眼，比EMBR高13%到25% relative，并且与oracle WER之间的差距降低12% relative。总的来说，O-1对EMBR的WER进行了9%的相对改善，证明了O-1目标的扩展性。

Development and Evaluation of Three Chatbots for Postpartum Mood and Anxiety Disorders

paper_url: http://arxiv.org/abs/2308.07407
repo_url: None
paper_authors: Xuewen Yao, Miriam Mikhelson, S. Craig Watkins, Eunsol Choi, Edison Thomaz, Kaya de Barbaro
for: 本研究的目的是开发一些聊天机器人，以提供新生婴期护理者所需的情感支持。
methods: 我们使用了规则引导的和生成模型，以提供上下文特定的同情支持。
results: 我们的规则引导模型表现最佳，其输出与真实参考数据几乎相同，同时含有最高水平的同情。人工用户对规则引导聊天机器人表示喜欢，因为它的回答具有上下文特定和人类化的特点。生成模型也能生成同情的回答，但由于训练数据的限制，它的回答经常具有含糊不清的问题。

Abstract
In collaboration with Postpartum Support International (PSI), a non-profit organization dedicated to supporting caregivers with postpartum mood and anxiety disorders, we developed three chatbots to provide context-specific empathetic support to postpartum caregivers, leveraging both rule-based and generative models. We present and evaluate the performance of our chatbots using both machine-based metrics and human-based questionnaires. Overall, our rule-based model achieves the best performance, with outputs that are close to ground truth reference and contain the highest levels of empathy. Human users prefer the rule-based chatbot over the generative chatbot for its context-specific and human-like replies. Our generative chatbot also produced empathetic responses and was described by human users as engaging. However, limitations in the training dataset often result in confusing or nonsensical responses. We conclude by discussing practical benefits of rule-based vs. generative models for supporting individuals with mental health challenges. In light of the recent surge of ChatGPT and BARD, we also discuss the possibilities and pitfalls of large language models for digital mental healthcare.

摘要
合作 Postpartum Support International (PSI) 非营利组织，我们开发了三个聊天机器人，以提供适应性强的同理支持给孕后照顾者，利用规则基本和生成模型。我们对聊天机器人的表现进行评估，使用机器人和人类Questionnaire。总的来说，我们的规则基本模型实现了最好的表现，输出与真实参照接近，同时具有最高水平的同理。人类用户对规则基本聊天机器人的喜欢度最高，因为它的回答具有Context-specific和人类化的特点。我们的生成模型也生成了同理的回答，但是训练数据的限制导致它们的回答有时会很混乱或无意义。我们 conclude 规则基本模型和生成模型在支持人们 mental health 挑战时的实际效用，以及 ChatGPT 和 BARD 等大语言模型在数字 mental healthcare 中的可能性和风险。

Text Injection for Capitalization and Turn-Taking Prediction in Speech Models

paper_url: http://arxiv.org/abs/2308.07395
repo_url: None
paper_authors: Shaan Bijwadia, Shuo-yiin Chang, Weiran Wang, Zhong Meng, Hao Zhang, Tara N. Sainath
for: 提高 auxiliary 任务表现（非ASR任务）
methods: 使用文本注入（JEIT）训练 ASR 模型，并在两个 auxiliary 任务上进行训练
results: 文本注入方法可以提高长尾数据的首字母排序性能，并提高转接推断精度

Abstract
Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate. This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model. In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. The first is capitalization, which is a de-normalization task. The second is turn-taking prediction, which attempts to identify whether a user has completed their conversation turn in a digital assistant interaction. We show results demonstrating that our text injection method boosts capitalization performance for long-tail data, and improves turn-taking detection recall.

摘要
文本注入技术可以用于自动语音识别（ASR），其中使用无对应的文本数据来补充带有音频数据的对应数据，有效地降低了单词错误率。本研究探讨了文本注入技术在辅助任务中的应用，这些任务通常是END-TO-END模型完成的非ASR任务。在这个工作中，我们使用了结合端到端和内部语言模型训练（JEIT）作为我们的文本注入算法，用于训练一个ASR模型，该模型完成了两个辅助任务。第一个是字母大小 normalization 任务，第二个是判断用户是否已经完成了在数字助手交互中的对话转移。我们的实验结果表明，我们的文本注入方法可以提高长尾数据中的字母大小正确率，并提高了对话转移检测的准确率。

Using Text Injection to Improve Recognition of Personal Identifiers in Speech

paper_url: http://arxiv.org/abs/2308.07393
repo_url: None
paper_authors: Yochai Blau, Rohan Agrawal, Lior Madmony, Gary Wang, Andrew Rosenberg, Zhehuai Chen, Zorik Gekhman, Genady Beryozkin, Parisa Haghani, Bhuvana Ramabhadran
for: 提高自动语音识别（ASR）系统中个人特定信息（PII）的识别率。
methods: 使用文本插入法将假文本替换PII类别，以提高训练数据中PII类别的识别率。
results: 在医疗记录中提高了名称和日期的回忆率，同时提高了总的word error rate（WER）。对数字序列也显示了改善Character Error Rate和句子准确率。

Abstract
Accurate recognition of specific categories, such as persons' names, dates or other identifiers is critical in many Automatic Speech Recognition (ASR) applications. As these categories represent personal information, ethical use of this data including collection, transcription, training and evaluation demands special care. One way of ensuring the security and privacy of individuals is to redact or eliminate Personally Identifiable Information (PII) from collection altogether. However, this results in ASR models that tend to have lower recognition accuracy of these categories. We use text-injection to improve the recognition of PII categories by including fake textual substitutes of PII categories in the training data using a text injection method. We demonstrate substantial improvement to Recall of Names and Dates in medical notes while improving overall WER. For alphanumeric digit sequences we show improvements to Character Error Rate and Sentence Accuracy.

摘要
准确地识别特定类别，如人名、日期等标识信息，在自动语音识别（ASR）应用中是非常重要的。这些类别代表个人信息，因此对这些数据的采集、译写、训练和评估需要特殊的注意。一种方法是完全不收集人类标识信息（PII），但这会导致ASR模型对这些类别的识别精度下降。我们使用文本插入法来提高PII类别的识别精度，通过在训练数据中插入假文本substitute来实现。我们在医疗笔记中示出了大幅提高名称和日期的回忆率，同时提高总的word Error Rate。对于字符串数字序列，我们示出了字符错误率和句子准确率的改善。

paper_url: http://arxiv.org/abs/2308.07317
repo_url: https://github.com/arielnlee/Platypus
paper_authors: Ariel N. Lee, Cole J. Hunter, Nataniel Ruiz
for: 这个论文是为了描述一个名为Platypus的家族 Large Language Models (LLMs)，它们在 HuggingFace 的开放 LLM Leaderboard 上达到了最高的表现并现在位于第一名。
methods: 这个论文使用了一个名为 Open-Platypus 的精心准备和合并 LoRA 模块，以保留预训练 LLMs 的强大优先知识，同时将特定领域知识带到表面。
results: Platypus 家族在量化 LLM 度量上表现出色，在模型尺寸上占据了全球 Open LLM leaderboard 的排名，而使用的 fine-tuning 数据和总计算量只是其他 state-of-the-art fine-tuned LLMs 所需的一小部分。例如，一个 13B Platypus 模型可以在单个 A100 GPU 上使用 25k 问题进行 5 小时的训练。这是 Open-Platypus 数据集的质量的证明，并开启了更多改进的可能性。

Abstract
We present $\textbf{Platypus}$, a family of fine-tuned and merged Large Language Models (LLMs) that achieves the strongest performance and currently stands at first place in HuggingFace's Open LLM Leaderboard as of the release date of this work. In this work we describe (1) our curated dataset $\textbf{Open-Platypus}$, that is a subset of other open datasets and which $\textit{we release to the public}$ (2) our process of fine-tuning and merging LoRA modules in order to conserve the strong prior of pretrained LLMs, while bringing specific domain knowledge to the surface (3) our efforts in checking for test data leaks and contamination in the training data, which can inform future research. Specifically, the Platypus family achieves strong performance in quantitative LLM metrics across model sizes, topping the global Open LLM leaderboard while using just a fraction of the fine-tuning data and overall compute that are required for other state-of-the-art fine-tuned LLMs. In particular, a 13B Platypus model can be trained on $\textit{a single}$ A100 GPU using 25k questions in 5 hours. This is a testament of the quality of our Open-Platypus dataset, and opens opportunities for more improvements in the field. Project page: https://platypus-llm.github.io

摘要
我们现在提出了$\textbf{ Platypus}$家族，这是一些精心调整和合并的大语言模型（LLMs），它在HuggingFace的开源LLM排名榜上 currently stands at first place as of the release date of this work. 在这个工作中，我们描述了我们的手动抽象 dataset $\textbf{Open-Platypus}$，这是其他开放数据集的一个子集，并且 $\textit{我们向公众发布了这些数据}$。我们的过程包括了精心调整和合并LoRA模块，以保留预训练LLMs的强大优先知识，同时将特定领域知识带到表面。我们还尽可能地检查测试数据泄露和训练数据污染，以便未来的研究。特别是，Platypus家族在量化LLM指标中表现出色，在模型尺寸上占据全球开源LLM排名榜首位，而使用的是比其他 state-of-the-art 精心调整LLMs的一部分的精心调整数据和总计算资源。例如，一个13B Platypus模型可以在 $\textit{单个}$ A100 GPU 上使用 25k 问题，在 5 小时内训练完成。这是一个证明我们 Open-Platypus 数据集的质量，并开创了更多的改进机会。项目页面：https://platypus-llm.github.io

The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

paper_url: http://arxiv.org/abs/2308.07286
repo_url: None
paper_authors: Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, André F. T. Martins, Graham Neubig, Ankush Garg, Jonathan H. Clark, Markus Freitag, Orhan Firat
for: 这篇论文是为了提供一种自动评估机器翻译（MT）系统的方法，以便在MT系统的快速迭代发展中进行评估。
methods: 这篇论文使用了大语言模型（LLM）的理解和在场景学习能力，并让它们标注翻译中的错误。
results: 研究发现，使用AutoMQM技术可以提高MT系统的性能，特别是使用更大的模型时。此外，AutoMQM还提供了解释性的错误块，与人工标注相Alignment。

Abstract
Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems. While considerable progress has been made on estimating a single scalar quality score, current metrics lack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). In this paper, we help fill this gap by proposing AutoMQM, a prompting technique which leverages the reasoning and in-context learning capabilities of large language models (LLMs) and asks them to identify and categorize errors in translations. We start by evaluating recent LLMs, such as PaLM and PaLM-2, through simple score prediction prompting, and we study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores (with particularly large gains for larger models) while providing interpretability through error spans that align with human annotations.

摘要
自动评估机器翻译（MT）是翻译系统的快速迭代发展的重要工具。虽然已经取得了较大的进步，但当前的度量仍然缺乏详细的错误标注，如多维质量指标（MQM）。在这篇文章中，我们帮助填补这个空白，并提出了AutoMQM技术，它利用大型自然语言模型（LLM）的理解和上下文学习能力，并让它们标注和分类翻译中的错误。我们首先通过对最近的LLM，如PaLM和PaLM-2，进行简单的分数预测提问，并研究了标注数据的影响。然后，我们评估了AutoMQM技术，并发现它在比只是提问分数时提高性能（尤其是大型模型），并提供了解释性的错误排序。

Comparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classification

paper_url: http://arxiv.org/abs/2308.07282
repo_url: None
paper_authors: Olesya Razuvayevskaya, Ben Wu, Joao A. Leite, Freddy Heppell, Ivan Srba, Carolina Scarton, Kalina Bontcheva, Xingyi Song
for: 本研究旨在investigating parameter-efficient fine-tuning techniques的影响于多语言文本分类任务（类型、框架和说服技巧检测），包括不同的输入长度、预测类数和分类难度。
methods: 本研究使用了Adaptors和LoRA技术来实现parameter-efficient fine-tuning，并进行了对不同训练场景（训练原始多语言数据、翻译成英语和英语只数据）和不同语言的深入分析。
results: 研究发现，在多语言文本分类任务中，Adaptors和LoRA技术可以减少训练时间和计算成本，并且在某些情况下可以提高性能。

Abstract
Adapters and Low-Rank Adaptation (LoRA) are parameter-efficient fine-tuning techniques designed to make the training of language models more efficient. Previous results demonstrated that these methods can even improve performance on some classification tasks. This paper complements the existing research by investigating how these techniques influence the classification performance and computation costs compared to full fine-tuning when applied to multilingual text classification tasks (genre, framing, and persuasion techniques detection; with different input lengths, number of predicted classes and classification difficulty), some of which have limited training data. In addition, we conduct in-depth analyses of their efficacy across different training scenarios (training on the original multilingual data; on the translations into English; and on a subset of English-only data) and different languages. Our findings provide valuable insights into the applicability of the parameter-efficient fine-tuning techniques, particularly to complex multilingual and multilabel classification tasks.

摘要
这篇文章进一步探讨了微调和低阶化适应（LoRA）技术的影响，它们是用于对语言模型进行更有效的训练。过往的研究显示这些技术可以提高一些分类任务的性能。本文在多种多元文本分类任务（文类、几何、说服等）中进行了广泛的实验，包括有限的训练数据。此外，我们还进行了不同训练enario（训练原始多种语言数据；训练英文翻译；和使用英文数据subset）和不同语言的深入分析。我们的发现将有价值的帮助在复杂的多种语言和多类分类任务中应用这些参数有效的微调技术。

Dialogue for Prompting: a Policy-Gradient-Based Discrete Prompt Optimization for Few-shot Learning

paper_url: http://arxiv.org/abs/2308.07272
repo_url: None
paper_authors: Chengzhengxu Li, Xiaoming Liu, Yichen Wang, Duyi Li, Yu Lan, Chao Shen
for: 提高几何学NLU任务中的表现，减少专家知识和人工干预。
methods: 对PLMs进行对话分析，设计可读性检测 metric，使用RL框架和政策网络进行优化。
results: 在四个开源数据集上，DP_2O方法在几何学NLU任务中的表现高于SOTA方法1.52%，并且具有良好的通用性、Robustness和普适性。

Abstract
Prompt-based pre-trained language models (PLMs) paradigm have succeeded substantially in few-shot natural language processing (NLP) tasks. However, prior discrete prompt optimization methods require expert knowledge to design the base prompt set and identify high-quality prompts, which is costly, inefficient, and subjective. Meanwhile, existing continuous prompt optimization methods improve the performance by learning the ideal prompts through the gradient information of PLMs, whose high computational cost, and low readability and generalizability are often concerning. To address the research gap, we propose a Dialogue-comprised Policy-gradient-based Discrete Prompt Optimization ($DP_2O$) method. We first design a multi-round dialogue alignment strategy for readability prompt set generation based on GPT-4. Furthermore, we propose an efficient prompt screening metric to identify high-quality prompts with linear complexity. Finally, we construct a reinforcement learning (RL) framework based on policy gradients to match the prompts to inputs optimally. By training a policy network with only 0.67% of the PLM parameter size on the tasks in the few-shot setting, $DP_2O$ outperforms the state-of-the-art (SOTA) method by 1.52% in accuracy on average on four open-source datasets. Moreover, subsequent experiments also demonstrate that $DP_2O$ has good universality, robustness, and generalization ability.

摘要

Fun Paper

2023-08-15

cs.CL - 2023-08-15

DS4DH at #SMM4H 2023: Zero-Shot Adverse Drug Events Normalization using Sentence Transformers and Reciprocal-Rank Fusion

Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

SPM: Structured Pretraining and Matching Architectures for Relevance Modeling in Meituan Search

Better Zero-Shot Reasoning with Role-Play Prompting

Attention Is Not All You Need Anymore

SEER: Super-Optimization Explorer for HLS using E-graph Rewriting with MLIR

Steering Language Generation: Harnessing Contrastive Expert Guidance and Negative Prompting for Coherent and Diverse Synthetic Data Generation

LogPrompt: Prompt Engineering Towards Zero-Shot and Interpretable Log Analysis

VBD-MT Chinese-Vietnamese Translation Systems for VLSP 2022

A User-Centered Evaluation of Spanish Text Simplification

Improving CTC-AED model with integrated-CTC and auxiliary loss regularization

CALYPSO: LLMs as Dungeon Masters’ Assistants

Finding Stakeholder-Material Information from 10-K Reports using Fine-Tuned BERT and LSTM Models

Data Race Detection Using Large Language Models

SOTASTREAM: A Streaming Approach to Machine Translation Training

O-1: Self-training with Oracle and 1-best Hypothesis

Development and Evaluation of Three Chatbots for Postpartum Mood and Anxiety Disorders

Text Injection for Capitalization and Turn-Taking Prediction in Speech Models

Using Text Injection to Improve Recognition of Personal Identifiers in Speech

Platypus: Quick, Cheap, and Powerful Refinement of LLMs

The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

Comparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classification

Dialogue for Prompting: a Policy-Gradient-Based Discrete Prompt Optimization for Few-shot Learning