cs.CL - 2023-11-24

Tracing Influence at Scale: A Contrastive Learning Approach to Linking Public Comments and Regulator Responses

paper_url: http://arxiv.org/abs/2311.14871
repo_url: None
paper_authors: Linzi Xing, Brad Hackinen, Giuseppe Carenini
for: 这个论文是为了解决美国联邦 regulators 每年接到一百万多封公众意见信的问题。
methods: 这篇论文使用了迭代对比方法，使用神经网络模型匹配公众意见信和 regulators 的回复文本。
results: 论文的实验结果显示，该方法可以substantially outperform 一些选择的文本匹配基准，并且与最先进的语言模型（GPT-4）的性能相当，在处理大规模的公众意见信和 regulators 回复时更加经济。

Abstract
U.S. Federal Regulators receive over one million comment letters each year from businesses, interest groups, and members of the public, all advocating for changes to proposed regulations. These comments are believed to have wide-ranging impacts on public policy. However, measuring the impact of specific comments is challenging because regulators are required to respond to comments but they do not have to specify which comments they are addressing. In this paper, we propose a simple yet effective solution to this problem by using an iterative contrastive method to train a neural model aiming for matching text from public comments to responses written by regulators. We demonstrate that our proposal substantially outperforms a set of selected text-matching baselines on a human-annotated test set. Furthermore, it delivers performance comparable to the most advanced gigantic language model (i.e., GPT-4), and is more cost-effective when handling comments and regulator responses matching in larger scale.

摘要
美国联邦监管机构每年收到超过一百万个公众意见书籍，来自企业、利益团体和公民，强烈提出修改提案的修订。这些意见被认为具有广泛的公共政策影响。然而，评估特定意见的影响很困难，因为监管机构需要回复意见，但并不需要指定哪些意见。在这篇论文中，我们提议一种简单 yet effective的解决方案，使用迭代对照方法训练一个神经网络，以匹配公众意见和监管机构的回复。我们示出，我们的提议在一个人工标注的测试集上显著超越了一组选择的文本匹配基线。此外，它可以与最先进的巨大语言模型（即GPT-4）的性能相似，并在处理评论和监管机构回复的更大规模时更加经济。

OpusCleaner and OpusTrainer, open source toolkits for training Machine Translation and Large language models

paper_url: http://arxiv.org/abs/2311.14838
repo_url: https://github.com/hplt-project/opustrainer
paper_authors: Nikolay Bogoychev, Jelmer van der Linde, Graeme Nail, Barry Haddow, Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez, Lukas Weymann, Tudor Nicolae Mateiu, Jindřich Helcl, Mikko Aulamo
for: 提高机器翻译系统的质量和可靠性，减少新手入门难度。
methods: 提供OpusCleaner和OpusTrainer两个工具，用于简化数据下载、清洁、处理和数据调度，以及实现大规模机器翻译系统和语言模型的建立。
results: 使用这两个工具，可以创建高质量的机器翻译模型，抗雷达User输入噪音，以及多语言模型和专业词汇模型。

Abstract
Developing high quality machine translation systems is a labour intensive, challenging and confusing process for newcomers to the field. We present a pair of tools OpusCleaner and OpusTrainer that aim to simplify the process, reduce the amount of work and lower the entry barrier for newcomers. OpusCleaner is a data downloading, cleaning, and proprocessing toolkit. It is designed to allow researchers to quickly download, visualise and preprocess bilingual (or monolingual) data that comes from many different sources, each of them with different quality, issues, and unique filtering/preprocessing requirements. OpusTrainer is a data scheduling and data augmenting tool aimed at building large scale, robust machine translation systems and large language models. It features deterministic data mixing from many different sources, on-the-fly data augmentation and more. Using these tools, we showcase how we can use it to create high quality machine translation model robust to noisy user input; multilingual models and terminology aware models.

摘要
开发高质量机器翻译系统是一项劳动密集、挑战性强、容易困惑的过程，特别是对新手而言。我们提供了一对工具——OpusCleaner和OpusTrainer——以简化过程、减少工作量和降低新手入门难度。OpusCleaner是一个数据下载、清洁、预处理工具集。它是为研究人员快速下载、视见和预处理来自多种不同来源的双语（或单语）数据，每个来源都有不同的质量、问题和唯一的筛选/预处理要求。OpusTrainer是一个数据调度和数据增强工具，旨在建立大规模、可靠的机器翻译系统和大语言模型。它具有 deterministic 数据混合、在线数据增强等功能。使用这两个工具，我们示例如如何使其创建高质量机器翻译模型，抗骚抗噪的用户输入，多语言模型和专有词汇模型。

Custom Data Augmentation for low resource ASR using Bark and Retrieval-Based Voice Conversion

paper_url: http://arxiv.org/abs/2311.14836
repo_url: None
paper_authors: Anand Kamble, Aniket Tathe, Suyash Kumbharkar, Atharva Bhandare, Anirban C. Mitra
for: 提高低资源语言 like Hindi 的 Common Voice 数据集的定制化
methods: 利用 Bark 模型和 Meta 的 enCodec 和 HuBert 模型进行改进，以及 Retrieval-Based Voice Conversion (RVC) 技术
results: 提高 ASR 技术的发展，并为各种应用场景提供高质量、个性化的声音生成Translation:
for: 提高低资源语言 like Hindi 的 Common Voice 数据集的定制化
methods: 利用 Bark 模型和 Meta 的 enCodec 和 HuBert 模型进行改进，以及 Retrieval-Based Voice Conversion (RVC) 技术
results: 提高 ASR 技术的发展，并为各种应用场景提供高质量、个性化的声音生成

Abstract
This paper proposes two innovative methodologies to construct customized Common Voice datasets for low-resource languages like Hindi. The first methodology leverages Bark, a transformer-based text-to-audio model developed by Suno, and incorporates Meta's enCodec and a pre-trained HuBert model to enhance Bark's performance. The second methodology employs Retrieval-Based Voice Conversion (RVC) and uses the Ozen toolkit for data preparation. Both methodologies contribute to the advancement of ASR technology and offer valuable insights into addressing the challenges of constructing customized Common Voice datasets for under-resourced languages. Furthermore, they provide a pathway to achieving high-quality, personalized voice generation for a range of applications.

摘要

Weak Alignment Supervision from Hybrid Model Improves End-to-end ASR

paper_url: http://arxiv.org/abs/2311.14835
repo_url: None
paper_authors: Jintao Jiang, Yingbo Gao, Zoltan Tuske
for: 这 paper 的目的是创建弱对Alignment超级vision，以帮助端到端模型。
methods: 作者使用现有的混合式 ASR 系统生成训练声音的 triphone 对Alignment，然后在某层的 encoder 中创建一个 cross-entropy 损失函数。
results: 结果表明，在第三层 encoder 中使用 label smoothing 参数值为 0.5 的 weak alignment supervision 比一般一颗 cross-entropy 损失函数和 CTC 损失函数 WITH loss weighting 更好，可以在 TED-LIUM 2 数据集上减少约 5% 的 relative WER。

Abstract
In this paper, we aim to create weak alignment supervision to aid the end-to-end modeling. Towards this end, we use the existing hybrid ASR system to produce triphone alignments of the training audios. We then create a cross-entropy loss at a certain layer of the encoder using the derived alignments. In contrast to the general one-hot cross-entropy losses with or without loss weighting, here we use a cross-entropy loss with a label smoothing parameter to regularize the supervision. As a comparison, we also conduct the experiments with one-hot cross-entropy losses and CTC losses with loss weighting. The results show that placing the weak alignment supervision with the label smoothing parameter of 0.5 at the third encoder layer outperforms the other two approaches and leads to about 5% relative WER reduction on the TED-LIUM 2 dataset over the baseline. We see similar improvements when applying the method out-of-the-box on a Tagalog end-to-end ASR system.

摘要
在这篇论文中，我们目的是创建弱对Alignment超级vision来 помо助端到端模型。为此，我们使用现有的混合式ASR系统生成训练听力的triphone对Alignment。然后，我们在encoder层中定义一个cross-entropy损失函数，使用 derive的对Alignment来定义损失。与通常的一个hot cross-entropy损失函数不同，我们使用一个标签平滑参数来规范Supervision。为了比较，我们还进行了使用一个hot cross-entropy损失函数和CTC损失函数的实验。结果表明，在第三层encoder层上添加弱对Alignment超级vision，使用标签平滑参数0.5，可以比基eline的5%相对WRER降低。我们在Tagalog端到端ASR系统上也 observe到类似的改进。

Data-to-Text Bilingual Generation

paper_url: http://arxiv.org/abs/2311.14808
repo_url: None
paper_authors: Guy Lapalme
for: 这篇论文旨在提供一种基于pyrealb的方法，用于自动生成英文和法文两种语言的平行文本，从单一数据源开始。
methods: 论文使用了对象导向的方法，确保在两种语言中的文本组织和数据选择过程相似，只有语言依赖的单词和短语选择异常。
results: 实验结果表明，使用这种方法可以生成同一个信息在英文和法文两种语言中，无论是翻译或是同时拥有两种语言能力。此外，与GPT实例的文本生成结果进行比较，也表明这种方法的优势。

Abstract
This document illustrates the use of pyrealb for generating two parallel texts (English and French) from a single source of data. The data selection and text organisation processes are shared between the two languages. only language dependent word and phrasing choices are distinct processes. The realized texts thus convey identical information in both languages without the risk of being lost in translation. This is especially important in cases where strict and simultaneous bilingualism is required. We first present the types of applications targeted by this approach and how the pyrealb English and French realizer can be used for achieving this goal in a natural way. We describe an object-oriented organization to ensure a convenient realization in both languages. To illustrate the process, different types of applications are then briefly sketched with links to the source code. A brief comparison of the text generation is given with the output of an instance of a GPT.

摘要
We will discuss the types of applications that can benefit from this approach and how the pyrealb English and French realizer can be used to achieve this goal in a natural way. We will also describe an object-oriented organization to make the realization process convenient for both languages.To illustrate the process, we will provide brief sketches of different types of applications and links to the source code. Finally, we will compare the text generation produced by pyrealb with the output of an instance of a GPT.

One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space

paper_url: http://arxiv.org/abs/2311.14652
repo_url: None
paper_authors: Raghav Addanki, Chenyang Li, Zhao Song, Chiwun Yang
for: 这个论文主要关注在流处理中大语言模型（LLM）的应用，尤其是在长文本分析和对话中的应用。
methods: 这个论文提出了一种新的算法，可以在流处理中减少大语言模型的内存使用。该算法只需要一次遍历数据，并使用低于等比数的存储空间来实现。
results: 论文的实验结果表明，该算法可以在流处理中高效地应用大语言模型，并且可以避免内存溢出问题。特别是当文本长度增长时，该算法可以保持减少内存使用的特点。

Abstract
Deploying Large Language Models (LLMs) in streaming applications that involve long contexts, particularly for extended dialogues and text analysis, is of paramount importance but presents two significant challenges. Firstly, the memory consumption is substantial during the decoding phase due to the caching of Key and Value states (KV) of previous tokens. Secondly, attention computation is time-consuming with a time complexity of $O(n^2)$ for the generation of each token. In recent OpenAI DevDay (Nov 6, 2023), OpenAI released a new model that is able to support a 128K-long document, in our paper, we focus on the memory-efficient issue when context length $n$ is much greater than 128K ($n \gg 2^d$). Considering a single-layer self-attention with Query, Key, and Value matrices $Q, K, V \in \mathbb{R}^{n \times d}$, the polynomial method approximates the attention output $T \in \mathbb{R}^{n \times d}$. It accomplishes this by constructing $U_1, U_2 \in \mathbb{R}^{n \times t}$ to expedite attention ${\sf Attn}(Q, K, V)$ computation within $n^{1+o(1)}$ time executions. Despite this, storing the Key and Value matrices $K, V \in \mathbb{R}^{n \times d}$ still necessitates $O( n d)$ space, leading to significant memory usage. In response to these challenges, we introduce a new algorithm that only reads one pass of the data in streaming fashion. This method employs sublinear space $o(n)$ to store three sketch matrices, alleviating the need for exact $K, V$ storage. Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens. As the token length $n$ increases, our error guarantee diminishes while the memory usage remains nearly constant. This unique attribute underscores the potential of our technique in efficiently handling LLMs in streaming applications.

摘要
部署大型自然语言模型（LLMs）在流处理应用程序中，特别是在长 Context 中进行长时间的对话和文本分析，是非常重要的。然而，这种部署存在两个主要挑战。首先，在解码阶段，模型的内存占用非常大，主要是因为缓存前 tokens 的 Key 和 Value 状态。其次，计算注意力的时间复杂度为 $O(n^2)$，对于每个token的生成。在OpenAI DevDay（2023年11月6日）上，OpenAI 发布了一个新的模型，可以支持128K字长的文档。在我们的论文中，我们关注的是，当 Context 长度远大于128K（$n \gg 2^d）时，内存使用效率的问题。对于单层自注意的模型，我们使用多项式方法来近似注意输出 $T \in \mathbb{R}^{n \times d}$。它通过构建 $U_1, U_2 \in \mathbb{R}^{n \times t}$来加速注意力计算，从而在 $n^{1+o(1)}$ 时间内执行注意力计算。尽管如此，保存 Key 和 Value 矩阵 $K, V \in \mathbb{R}^{n \times d}$仍需要 $O(n d)$ 空间，导致内存使用增加。为了解决这些挑战，我们提出了一新的算法，只需要在流处理模式下读取一次数据。这种方法使用 sublinear 空间 $o(n)$ 存储三个笔记矩阵，从而消除了 $K, V$ 的准确存储需求。值得一提的是，我们的算法在长 token 时 exhibit 出色的内存减少性，即，随着 token 长度 $n$ 增加，我们的错误保证逐渐减少，而内存使用则保持相对常数。这种特点强调了我们的技术在流处理应用中的高效性。

Machine Translation for Ge’ez Language

paper_url: http://arxiv.org/abs/2311.14530
repo_url: None
paper_authors: Aman Kassahun Wassie
for: 此研究旨在提高非常低资源语言如格 Ethiopic 的机器翻译性能。
methods: 本研究使用了多种方法来提高格 Ethiopic 机器翻译，包括将相关语言的转移学习、优化共享词汇和分词方法、使用大型预训练模型和大语言模型（LLM）进行几招翻译与杂合匹配。
results: 我们发现，基于语言相似性的多语言神经机器翻译（MNMT）模型可以提高格 Ethiopic 机器翻译的性能，并且使用 GPT-3.5 大语言模型进行几招翻译也可以达到 remarkable BLEU 分数。然而，对于只有4k 的训练样本，NLLB-200 模型的 finsheet 表现较差。

Abstract
Machine translation (MT) for low-resource languages such as Ge'ez, an ancient language that is no longer spoken in daily life, faces challenges such as out-of-vocabulary words, domain mismatches, and lack of sufficient labeled training data. In this work, we explore various methods to improve Ge'ez MT, including transfer-learning from related languages, optimizing shared vocabulary and token segmentation approaches, finetuning large pre-trained models, and using large language models (LLMs) for few-shot translation with fuzzy matches. We develop a multilingual neural machine translation (MNMT) model based on languages relatedness, which brings an average performance improvement of about 4 BLEU compared to standard bilingual models. We also attempt to finetune the NLLB-200 model, one of the most advanced translation models available today, but find that it performs poorly with only 4k training samples for Ge'ez. Furthermore, we experiment with using GPT-3.5, a state-of-the-art LLM, for few-shot translation with fuzzy matches, which leverages embedding similarity-based retrieval to find context examples from a parallel corpus. We observe that GPT-3.5 achieves a remarkable BLEU score of 9.2 with no initial knowledge of Ge'ez, but still lower than the MNMT baseline of 15.2. Our work provides insights into the potential and limitations of different approaches for low-resource and ancient language MT.

摘要
机器翻译（MT） для低资源语言如格'ез（Ge'ez）面临挑战，包括无法词、领域不匹配和不足的训练数据。在这个工作中，我们探索了不同的方法来改善格'езMT，包括将相关语言的转移学习应用到格'ез，优化共享词汇和分词方法，调整大型预训模型，以及使用大型自然语言模型（LLM）进行几据翻译。我们开发了一个多语言神经机器翻译（MNMT）模型，基于语言之间的相关性，带来了约4个BLEU的平均性能提升。我们还尝试了调整NLLB-200模型，但发现它对于格'ез的4000个训练数据表现不佳。此外，我们尝试使用GPT-3.5，一个现今最先进的自然语言模型，进行几据翻译，使用类似度基于的汇集搜寻获得上下文示例。我们发现GPT-3.5在无任何格'ез知识下可以获得9.2个BLEU分，但仍比MNMT基准下的15.2分低。我们的工作提供了低资源语言和古语言MT的可能性和限制。

tinyCLAP: Distilling Constrastive Language-Audio Pretrained Models

paper_url: http://arxiv.org/abs/2311.14517
repo_url: None
paper_authors: Francesco Paissan, Elisabetta Farella
for: 降低对比语音预训练模型的复杂性，以实现高效的语音识别和生成
methods: 基于首肯定理 derivation的单模型热退、约束梯度下降和精简
results: 使用 tinyCLAP 模型，只需使用原 Microsoft CLAP 参数的 6%，在三个声音事件检测数据集上实现零 shot 分类性能下降 less than 5%

Abstract
Contrastive Language-Audio Pretraining (CLAP) became of crucial importance in the field of audio and speech processing. Its employment ranges from sound event detection to text-to-audio generation. However, one of the main limitations is the considerable amount of data required in the training process and the overall computational complexity during inference. This paper investigates how we can reduce the complexity of contrastive language-audio pre-trained models, yielding an efficient model that we call tinyCLAP. We derive an unimodal distillation loss from first principles and explore how the dimensionality of the shared, multimodal latent space can be reduced via pruning. TinyCLAP uses only 6% of the original Microsoft CLAP parameters with a minimal reduction (less than 5%) in zero-shot classification performance across the three sound event detection datasets on which it was tested

摘要
对于语音处理领域而言，对照语言-语音预训（CLAP）已经成为非常重要的一种方法。它的应用范围自声事件探测到文本-语音生成。然而，CLAP的主要限制是训练过程中需要很大量数据，以及推导过程中的总 Computational Complexity。这篇文章探讨了如何将对照语言-语音预训模型简化，实现一个高效的模型，我们称之为“tinyCLAP”。我们从基本原理开始， derivate一种单modal distillation损失函数，并考虑如何透过剪枝来降低共享多modal的 latent space 维度。 tinyCLAP 只需6%的原始 Microsoft CLAP 参数，并且在三个声事件探测数据集上进行零 shot 分类时，几乎没有损失（少于5%）。

Analysing the Impact of Removing Infrequent Words on Topic Quality in LDA Models

paper_url: http://arxiv.org/abs/2311.14505
repo_url: None
paper_authors: Victor Bystrov, Viktoriia Naboka-Krell, Anna Staszewska-Bystrova, Peter Winker
for: 这篇论文的目的是为了探讨在文本数据应用中文本预处理的一个步骤，即去掉不常见的词语，以提高计算的效率。
methods: 论文使用了Latent Dirichlet Allocation（LDAL）来估计主题质量。在实验中，作者使用了不同的词语去掉标准和评价指标来评估去掉不常见词语的效果。
results: 结果显示，去掉不常见词语可以提高主题估计的质量，并且可以去掉一 considerable amount of vocabulary。

Abstract
An initial procedure in text-as-data applications is text preprocessing. One of the typical steps, which can substantially facilitate computations, consists in removing infrequent words believed to provide limited information about the corpus. Despite popularity of vocabulary pruning, not many guidelines on how to implement it are available in the literature. The aim of the paper is to fill this gap by examining the effects of removing infrequent words for the quality of topics estimated using Latent Dirichlet Allocation. The analysis is based on Monte Carlo experiments taking into account different criteria for infrequent terms removal and various evaluation metrics. The results indicate that pruning is beneficial and that the share of vocabulary which might be eliminated can be quite considerable.

摘要
<>文本为数据应用的初始过程之一是文本处理。其中一个常见的步骤是去掉不常用词，因为这些词据信能够提供 corpus 中的有限信息。虽然词汇剔除受欢迎，但在 литературе 中有很少关于如何实现它的指南。本文的目标是填补这个空白，通过对 Latent Dirichlet Allocation 估算的话题质量的影响进行分析。这些分析基于 Monte Carlo 实验，考虑了不同的不常用词去除 criterion 和不同的评价指标。结果表明，剔除不常用词是有利的，并且可以去掉一部分词汇的比例。Note: I used the Traditional Chinese characters for "文本" (wén tiě) and "词汇" (cí huì) to match the original text.

SER_AMPEL: A multi-source dataset for SER of Italian older adults

paper_url: http://arxiv.org/abs/2311.14483
repo_url: None
paper_authors: Alessandra Grossi, Francesca Gasparini
for: 这 paper 是为了提供一个参考 Dataset дляspeech emotion recognition（SER）Italian older adults。
methods: 该 Dataset 采集了不同协议，包括 acted conversations 从电影和电视剧中提取，以及自然会话中使用问题引起情感的录制。
results: 这 paper 预览了提出的 Dataset 的需求，并对一个子集进行了初步的分类结果分析，探讨了SER 的关键问题。

Abstract
In this paper, SER_AMPEL, a multi-source dataset for speech emotion recognition (SER) is presented. The peculiarity of the dataset is that it is collected with the aim of providing a reference for speech emotion recognition in case of Italian older adults. The dataset is collected following different protocols, in particular considering acted conversations, extracted from movies and TV series, and recording natural conversations where the emotions are elicited by proper questions. The evidence of the need for such a dataset emerges from the analysis of the state of the art. Preliminary considerations on the critical issues of SER are reported analyzing the classification results on a subset of the proposed dataset.

摘要
在本文中，我们提出了一个多源数据集 для语音情感识别（SER），称为SER_AMPEL。该数据集的特点是集成了意大利老年人的语音情感识别参考数据集。数据集采集了不同协议，包括从电影和电视剧中提取的 acted conversations，以及通过适当问题诱发的自然对话。我们认为这样的数据集是有必要的，因为我们通过分析现状技术的报告发现了语音情感识别领域的挑战。本文的前提是对SER_AMPEL数据集的一些首要考虑。

Controlled Text Generation via Language Model Arithmetic

paper_url: http://arxiv.org/abs/2311.14479
repo_url: https://github.com/eth-sri/language-model-arithmetic
paper_authors: Jasper Dekoninck, Marc Fischer, Luca Beurer-Kellner, Martin Vechev
for: 这 paper 是为了提出一种新的推理框架，帮助在更广泛的场景中使用大型自然语言模型（LLMs）进行自定义。
methods: 这 paper 使用了一种名为“模型算术”的新的推理方法，可以无需再训练模型或使用高度特定的数据集来进行自定义。这种方法还允许更精细地控制生成的文本，比直接提示和先前的控制文本生成（CTG）技术更有效。
results: 根据这 paper，使用模型算术可以实现精细地控制生成的文本，同时超过了现有的状态对tasks of toxicity reduction。

Abstract
As Large Language Models (LLMs) are deployed more widely, customization with respect to vocabulary, style and character becomes more important. In this work we introduce model arithmetic, a novel inference framework for composing and biasing LLMs without the need for model (re)training or highly specific datasets. In addition, the framework allows for more precise control of generated text than direct prompting and prior controlled text generation (CTG) techniques. Using model arithmetic, we can express prior CTG techniques as simple formulas and naturally extend them to new and more effective formulations. Further, we show that speculative sampling, a technique for efficient LLM sampling, extends to our setting. This enables highly efficient text generation with multiple composed models with only marginal overhead over a single model. Our empirical evaluation demonstrates that model arithmetic allows fine-grained control of generated text while outperforming state-of-the-art on the task of toxicity reduction.

摘要
As Large Language Models (LLMs) 广泛部署，自定义 vocabulary、style 和 character 变得更加重要。在这项工作中，我们介绍 model arithmetic，一种新的推理框架，可以无需模型（重）训练或特定的数据集来组合和偏迷 LLMs。此外，该框架还允许更精细地控制生成的文本，比直接提示和先前控制的文本生成（CTG）技术更为灵活。使用 model arithmetic，我们可以将先前的 CTG 技术表示为简单的公式，并自然地扩展到新的有效的表述。此外，我们发现，用于高效的 LLM 抽样的 speculative sampling 技术可以应用于我们的设置中。这使得可以使用多个组合的模型进行高效的文本生成，只需单个模型的负担。我们的实验证明，model arithmetic 允许细化控制生成的文本，而且在减少攻击性 task 上超越了当前的状态。

DP-NMT: Scalable Differentially-Private Machine Translation

paper_url: http://arxiv.org/abs/2311.14465
repo_url: https://github.com/trusthlt/dp-nmt
paper_authors: Timour Igamberdiev, Doan Nam Long Vu, Felix Künnecke, Zhuo Yu, Jannik Holmer, Ivan Habernal
for: 这个研究旨在提供一个开源框架，用于进行隐私保护的自然语言译latexmb Translation (NMT) 系统的研究。
methods: 本研究使用了差异encibly private stochastic gradient descent (DP-SGD) 方法来训练 NMT 模型，并提供了一个可重新使用的开源框架，以便研究人员可以轻松地实现隐私保护的 NMT 系统。
results: 本研究通过在不同的数据集和评估指标下进行了一系列实验，以验证 DP-NMT 框架的可行性和有效性。

Abstract
Neural machine translation (NMT) is a widely popular text generation task, yet there is a considerable research gap in the development of privacy-preserving NMT models, despite significant data privacy concerns for NMT systems. Differentially private stochastic gradient descent (DP-SGD) is a popular method for training machine learning models with concrete privacy guarantees; however, the implementation specifics of training a model with DP-SGD are not always clarified in existing models, with differing software libraries used and code bases not always being public, leading to reproducibility issues. To tackle this, we introduce DP-NMT, an open-source framework for carrying out research on privacy-preserving NMT with DP-SGD, bringing together numerous models, datasets, and evaluation metrics in one systematic software package. Our goal is to provide a platform for researchers to advance the development of privacy-preserving NMT systems, keeping the specific details of the DP-SGD algorithm transparent and intuitive to implement. We run a set of experiments on datasets from both general and privacy-related domains to demonstrate our framework in use. We make our framework publicly available and welcome feedback from the community.

摘要
神经机器翻译（NMT）是广泛应用的文本生成任务，但是在开发隐私保护NMT模型方面还存在较大的研究差距，尽管NMT系统存在数据隐私问题。不同的隐私保护权限的权限评估（DP-SGD）是训练机器学习模型的受欢迎方法，但是在训练模型时的具体实现细节不一定是已经解释的，存在不同的软件库和代码库，导致复制问题。为了解决这问题，我们介绍DP-NMT框架，这是一个开源的框架，用于进行隐私保护NMT模型的研究，汇集了许多模型、数据集和评价指标在一个系统化的软件包中。我们的目标是提供一个平台，使研究人员可以在隐私保护NMT系统的发展中进行研究，并且在DP-SGD算法中保持简明易懂的具体细节。我们在不同的数据集上进行了一系列实验，以示DP-NMT框架的应用。我们将DP-NMT框架公开发布，欢迎社区的反馈。

ÚFAL CorPipe at CRAC 2023: Larger Context Improves Multilingual Coreference Resolution

paper_url: http://arxiv.org/abs/2311.14391
repo_url: None
paper_authors: Milan Straka
for: 本文是CRAC 2023 Shared Task on Multilingual Coreference Resolution 的获奖作品，用于提高多语言核心参照解决的性能。
methods: 本文使用的方法包括提取提及 span 以及在这些 span 上进行核心参照链接，通过在所有可用 corpora 上进行共同预训练，并使用共享预训练语言模型进行训练。主要改进包括输入大于 512 个子词和更改提及解码以支持 ensemble。
results: 本文的实验结果表明，CorPipe 在 CRAC 2023 中的得分高于其他参与者的平均分数点数 by 4.5% 之多。

Abstract
We present CorPipe, the winning entry to the CRAC 2023 Shared Task on Multilingual Coreference Resolution. Our system is an improved version of our earlier multilingual coreference pipeline, and it surpasses other participants by a large margin of 4.5 percent points. CorPipe first performs mention detection, followed by coreference linking via an antecedent-maximization approach on the retrieved spans. Both tasks are trained jointly on all available corpora using a shared pretrained language model. Our main improvements comprise inputs larger than 512 subwords and changing the mention decoding to support ensembling. The source code is available at https://github.com/ufal/crac2023-corpipe.

摘要
我们现在介绍CorPipe，CRAC 2023共享任务中的赢家。我们的系统是之前的多语言核心引用管道的改进版本，在其他参与者之上减分4.5个百分点。CorPipe首先检测提及，然后通过 antecedent-maximization 方法对检测到的跨度进行核心关系链接。两个任务都是通过所有可用 corpora 进行共同训练，使用共享预训练语言模型。我们的主要改进包括输入大于512个子词和更改提及解码以支持集成。源代码可以在 GitHub 上找到：https://github.com/ufal/crac2023-corpipe。

Average Token Delay: A Duration-aware Latency Metric for Simultaneous Translation

paper_url: http://arxiv.org/abs/2311.14353
repo_url: None
paper_authors: Yasumasa Kano, Katsuhito Sudoh, Satoshi Nakamura
for: 这篇论文是关于同时翻译的评估 metric，具体来说是一种基于听说 span (EVS) 的延迟评估方法。
methods: 这篇论文使用了一种新的延迟评估方法called \emph{Average Token Delay} (ATD)，它关注了部分翻译输出的持续时间。
results: 在实验中，ATD 与 EVS 之间存在高度相关性，特别在大多数情况下。

Abstract
Simultaneous translation is a task in which the translation begins before the end of an input speech segment. Its evaluation should be conducted based on latency in addition to quality, and for users, the smallest possible amount of latency is preferable. Most existing metrics measure latency based on the start timings of partial translations and ignore their duration. This means such metrics do not penalize the latency caused by long translation output, which delays the comprehension of users and subsequent translations. In this work, we propose a novel latency evaluation metric for simultaneous translation called \emph{Average Token Delay} (ATD) that focuses on the duration of partial translations. We demonstrate its effectiveness through analyses simulating user-side latency based on Ear-Voice Span (EVS). In our experiment, ATD had the highest correlation with EVS among baseline latency metrics under most conditions.

摘要
同时翻译是一种任务，在输入语音段结束之前，翻译就开始了。其评估应该基于延迟，而不仅仅是质量。用户希望的最小化延迟。现有的度量都是基于部分翻译的开始时间，忽略其持续时间。这意味着这些度量不会负担由长翻译输出带来的延迟，这会延迟用户的理解和后续翻译。在这项工作中，我们提出了一种新的同时翻译延迟评估度量called 平均字符延迟（ATD），它关注部分翻译持续时间。我们通过 simulate user-side 延迟基于耳语间距（EVS）进行分析，并证明 ATD 在大多数情况下与基准延迟度量之间存在最高的相关性。