cs.CL - 2023-07-17

Syntax-Aware Complex-Valued Neural Machine Translation

paper_url: http://arxiv.org/abs/2307.08586
repo_url: None
paper_authors: Yang Liu, Yuexian Hou
for: 提高 neural machine translation (NMT) 的翻译性能
methods: 使用复杂值 Encoder-Decoder 架构，并将 syntax 信息直接 интегрирован到 NMT 模型中，使用注意机制来学习 word-level 和 syntax-level 注意力分数
results: 实验结果表明，提出的方法可以在两个数据集上提高 BLEU 分数，尤其是在语言对的 sintactic 差异较大的翻译任务中获得更大的改善。

Abstract
Syntax has been proven to be remarkably effective in neural machine translation (NMT). Previous models obtained syntax information from syntactic parsing tools and integrated it into NMT models to improve translation performance. In this work, we propose a method to incorporate syntax information into a complex-valued Encoder-Decoder architecture. The proposed model jointly learns word-level and syntax-level attention scores from the source side to the target side using an attention mechanism. Importantly, it is not dependent on specific network architectures and can be directly integrated into any existing sequence-to-sequence (Seq2Seq) framework. The experimental results demonstrate that the proposed method can bring significant improvements in BLEU scores on two datasets. In particular, the proposed method achieves a greater improvement in BLEU scores in translation tasks involving language pairs with significant syntactic differences.

摘要
syntax 已经被证明可以在神经机器翻译（NMT）中发挥 Remarkably 的效果。在过去的模型中，从 sintactic parsing 工具中获取了 syntax 信息，然后将其集成到 NMT 模型中，以提高翻译性能。在这项工作中，我们提议一种将 syntax 信息 incorporated 到复杂值 Encoder-Decoder 架构中的方法。该方法使用注意力机制，在源侧到目标侧的翻译过程中同时学习 word-level 和 syntax-level 注意力分数。很重要的是，该方法不依赖于特定的网络架构，可以直接integrated 到任何现有的 sequence-to-sequence（Seq2Seq）框架中。实验结果表明，我们提议的方法可以在两个 dataset 上提供显著的改善，特别是在语言对应的语言对中进行翻译任务时。

The Resume Paradox: Greater Language Differences, Smaller Pay Gaps

paper_url: http://arxiv.org/abs/2307.08580
repo_url: None
paper_authors: Joshua R. Minot, Marc Maier, Bradford Demarest, Nicholas Cheney, Christopher M. Danforth, Peter Sheridan Dodds, Morgan R. Frank
for: 这种研究旨在探讨工作者自我表达如何影响 gender pay gap.
methods: 研究使用美国工作者数百万个简历语言分析gender pay gap.
results: 研究发现，在不同行业中，男女简历语言差异对 gender pay gap 的影响相对较小，但是尽管如此，具有更高语言差异的行业却有较低的gender pay gap. 每年增加两倍语言差异，女性工作者的平均年薪增加2,797美元。

Abstract
Over the past decade, the gender pay gap has remained steady with women earning 84 cents for every dollar earned by men on average. Many studies explain this gap through demand-side bias in the labor market represented through employers' job postings. However, few studies analyze potential bias from the worker supply-side. Here, we analyze the language in millions of US workers' resumes to investigate how differences in workers' self-representation by gender compare to differences in earnings. Across US occupations, language differences between male and female resumes correspond to 11% of the variation in gender pay gap. This suggests that females' resumes that are semantically similar to males' resumes may have greater wage parity. However, surprisingly, occupations with greater language differences between male and female resumes have lower gender pay gaps. A doubling of the language difference between female and male resumes results in an annual wage increase of $2,797 for the average female worker. This result holds with controls for gender-biases of resume text and we find that per-word bias poorly describes the variance in wage gap. The results demonstrate that textual data and self-representation are valuable factors for improving worker representations and understanding employment inequities.

摘要
We find that language differences between male and female resumes account for 11% of the variation in the gender pay gap. Specifically, we find that female resumes that are semantically similar to male resumes are associated with greater wage parity. However, surprisingly, occupations with greater language differences between male and female resumes have lower gender pay gaps.Furthermore, we find that a doubling of the language difference between female and male resumes results in an annual wage increase of $2,797 for the average female worker. This result holds even when controlling for gender-biases of resume text, and we find that per-word bias poorly describes the variance in wage gap.Overall, our results demonstrate that textual data and self-representation are valuable factors for improving worker representations and understanding employment inequities.

Discovering collective narratives shifts in online discussions

paper_url: http://arxiv.org/abs/2307.08541
repo_url: None
paper_authors: Wanying Zhao, Fiona Guo, Kristina Lerman, Yong-Yeol Ahn
for: This paper aims to develop a systematic and computational understanding of online narratives, specifically in the context of social media, to better understand how they emerge, spread, and die.
methods: The proposed framework combines change point detection, semantic role labeling (SRL), and automatic aggregation of narrative fragments into narrative networks to reliably and automatically extract narratives from massive amounts of text data.
results: The proposed approach is evaluated using synthetic and empirical data from two Twitter corpora related to COVID-19 and the 2017 French Election, and the results demonstrate that the approach can recover major narrative shifts that correspond to significant events.Here is the same information in Simplified Chinese text:
for: 这篇论文目的是为了提供一种系统的计算机理解社交媒体上的故事，以更好地理解它们如何出现、传播和消亡。
methods: 该提议的框架结合变化点检测、Semantic Role Labeling（SRL）和自动聚合故事片断到故事网络，以可靠地和自动地从大量文本数据中提取故事。
results: 该提议的方法在使用生成的和实际数据两个Twitter corpora相关于COVID-19和2017法国大选进行评估，结果表明该方法可以回归主要的故事变化，与主要事件相吻合。

Abstract
Narrative is a foundation of human cognition and decision making. Because narratives play a crucial role in societal discourses and spread of misinformation and because of the pervasive use of social media, the narrative dynamics on social media can have profound societal impact. Yet, systematic and computational understanding of online narratives faces critical challenge of the scale and dynamics; how can we reliably and automatically extract narratives from massive amount of texts? How do narratives emerge, spread, and die? Here, we propose a systematic narrative discovery framework that fill this gap by combining change point detection, semantic role labeling (SRL), and automatic aggregation of narrative fragments into narrative networks. We evaluate our model with synthetic and empirical data two-Twitter corpora about COVID-19 and 2017 French Election. Results demonstrate that our approach can recover major narrative shifts that correspond to the major events.

摘要
叙述是人类认知和决策的基础。因为叙述在社会话语中发挥重要作用，并且在社会媒体上广泛传播谣言，因此在社会媒体上的叙述动态可能有深远的社会影响。然而，系统化和计算化地理解社会媒体上的叙述遇到了重要的挑战，即如何可靠地和自动地提取叙述？叙述是如何产生、传播和死亡的？我们提出了一个系统的叙述发现框架，通过结合变点检测、semantic role labeling（SRL）和自动聚合叙述碎片而填补这个空白。我们使用了两个Twitter数据集，一个是关于COVID-19的，另一个是关于2017年法国大选。结果表明，我们的方法可以重现主要的叙述变化，与主要事件相吻合。

Large-Scale Evaluation of Topic Models and Dimensionality Reduction Methods for 2D Text Spatialization

paper_url: http://arxiv.org/abs/2307.11770
repo_url: https://github.com/cgshpi/topic-models-and-dimensionality-reduction-benchmark
paper_authors: Daniel Atzberger, Tim Cech, Willy Scheibel, Matthias Trapp, Rico Richter, Jürgen Döllner, Tobias Schreck
For: The paper is written for deriving spatializations for text corpora using topic models and dimensionality reduction methods, and evaluating the effectiveness of these methods for creating high-quality layouts.* Methods: The paper uses a combination of topic models and dimensionality reduction methods, including Latent Dirichlet Allocation (LDA) and t-Distributed Stochastic Neighbor Embedding (t-SNE), to create two-dimensional scatter plots of text corpora.* Results: The paper presents a large-scale computational evaluation of the effectiveness of these methods, using a set of corpora and quality metrics to quantify the preservation of local and global properties and the perceptual effectiveness of the resulting layouts. The results show that interpretable topic models are beneficial for capturing the structure of text corpora, and that t-SNE is a good choice for subsequent dimensionality reduction.

Abstract
Topic models are a class of unsupervised learning algorithms for detecting the semantic structure within a text corpus. Together with a subsequent dimensionality reduction algorithm, topic models can be used for deriving spatializations for text corpora as two-dimensional scatter plots, reflecting semantic similarity between the documents and supporting corpus analysis. Although the choice of the topic model, the dimensionality reduction, and their underlying hyperparameters significantly impact the resulting layout, it is unknown which particular combinations result in high-quality layouts with respect to accuracy and perception metrics. To investigate the effectiveness of topic models and dimensionality reduction methods for the spatialization of corpora as two-dimensional scatter plots (or basis for landscape-type visualizations), we present a large-scale, benchmark-based computational evaluation. Our evaluation consists of (1) a set of corpora, (2) a set of layout algorithms that are combinations of topic models and dimensionality reductions, and (3) quality metrics for quantifying the resulting layout. The corpora are given as document-term matrices, and each document is assigned to a thematic class. The chosen metrics quantify the preservation of local and global properties and the perceptual effectiveness of the two-dimensional scatter plots. By evaluating the benchmark on a computing cluster, we derived a multivariate dataset with over 45 000 individual layouts and corresponding quality metrics. Based on the results, we propose guidelines for the effective design of text spatializations that are based on topic models and dimensionality reductions. As a main result, we show that interpretable topic models are beneficial for capturing the structure of text corpora. We furthermore recommend the use of t-SNE as a subsequent dimensionality reduction.

摘要
Topic models 是一类不监督学习算法，用于检测文本集合中的 semantic structure。与随后的维度减少算法结合使用，topic models 可以用于生成文本集合的 two-dimensional scatter plot，反映文档之间的 semantic similarity，并支持文档集合分析。although the choice of topic model, dimensionality reduction, and their underlying hyperparameters have a significant impact on the resulting layout, it is unknown which particular combinations result in high-quality layouts with respect to accuracy and perception metrics.To investigate the effectiveness of topic models and dimensionality reduction methods for the spatialization of corpora as two-dimensional scatter plots (or basis for landscape-type visualizations), we present a large-scale, benchmark-based computational evaluation. Our evaluation consists of (1) a set of corpora, (2) a set of layout algorithms that are combinations of topic models and dimensionality reductions, and (3) quality metrics for quantifying the resulting layout. The corpora are given as document-term matrices, and each document is assigned to a thematic class. The chosen metrics quantify the preservation of local and global properties and the perceptual effectiveness of the two-dimensional scatter plots. By evaluating the benchmark on a computing cluster, we derived a multivariate dataset with over 45 000 individual layouts and corresponding quality metrics. Based on the results, we propose guidelines for the effective design of text spatializations that are based on topic models and dimensionality reductions. As a main result, we show that interpretable topic models are beneficial for capturing the structure of text corpora. We furthermore recommend the use of t-SNE as a subsequent dimensionality reduction.

Latent Jailbreak: A Test Suite for Evaluating Both Text Safety and Output Robustness of Large Language Models

paper_url: http://arxiv.org/abs/2307.08487
repo_url: https://github.com/qiuhuachuan/latent-jailbreak
paper_authors: Huachuan Qiu, Shuai Zhang, Anqi Li, Hongliang He, Zhenzhong Lan
for: 这种论文旨在评估大语言模型（LLM）是否能够遵循人类价值观和生成安全文本。
methods: 作者提出了一个新的评估标准，以评估 LLM 的安全性和可靠性。该标准包括使用潜在的监禁提示集，以评估模型在完成任务时的 robustness。
results: 研究发现，当前的 LLM 不仅会优先使用某些指令词，还会在不同的指令词上 exhibit 不同的监禁率。

Abstract
Considerable research efforts have been devoted to ensuring that large language models (LLMs) align with human values and generate safe text. However, an excessive focus on sensitivity to certain topics can compromise the model's robustness in following instructions, thereby impacting its overall performance in completing tasks. Previous benchmarks for jailbreaking LLMs have primarily focused on evaluating the safety of the models without considering their robustness. In this paper, we propose a benchmark that assesses both the safety and robustness of LLMs, emphasizing the need for a balanced approach. To comprehensively study text safety and output robustness, we introduce a latent jailbreak prompt dataset, each involving malicious instruction embedding. Specifically, we instruct the model to complete a regular task, such as translation, with the text to be translated containing malicious instructions. To further analyze safety and robustness, we design a hierarchical annotation framework. We present a systematic analysis of the safety and robustness of LLMs regarding the position of explicit normal instructions, word replacements (verbs in explicit normal instructions, target groups in malicious instructions, cue words for explicit normal instructions), and instruction replacements (different explicit normal instructions). Our results demonstrate that current LLMs not only prioritize certain instruction verbs but also exhibit varying jailbreak rates for different instruction verbs in explicit normal instructions. Code and data are available at https://github.com/qiuhuachuan/latent-jailbreak.

摘要
很多研究工作已经投入到确保大语言模型（LLM）与人类价值观 align的问题上。然而，过度强调某些话题的敏感性可能会削弱模型的完成任务能力，从而影响其总体性能。以前的 LLM 破狱 benchmark 主要关注模型的安全性而忽略了其可靠性。在这篇论文中，我们提出一个旨在评估 LLM 的安全性和可靠性的 benchmark，强调了平衡的方法。为了全面研究文本安全性和输出可靠性，我们提出了一个潜在破狱 prompt 数据集，每个包含恶意命令嵌入。为了进一步分析安全性和可靠性，我们设计了一个层次化注释框架。我们对 LLM 的安全性和可靠性进行了系统性分析，包括表示正常指令的位置、词替换（表示正常指令中的词语）、指令替换（不同的正常指令）等方面。我们的结果显示，当前 LLM 不仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅仅��

Domain Knowledge Distillation from Large Language Model: An Empirical Study in the Autonomous Driving Domain

paper_url: http://arxiv.org/abs/2307.11769
repo_url: None
paper_authors: Yun Tang, Antonio A. Bruto da Costa, Jason Zhang, Irvine Patrick, Siddartha Khastgir, Paul Jennings
for: automatize engineering processes in knowledge-based systems
methods: prompt engineering and ChatGPT language model
results: empirical assessment in autonomous driving domain, improved efficiency and output quality with human supervision

Abstract
Engineering knowledge-based (or expert) systems require extensive manual effort and domain knowledge. As Large Language Models (LLMs) are trained using an enormous amount of cross-domain knowledge, it becomes possible to automate such engineering processes. This paper presents an empirical automation and semi-automation framework for domain knowledge distillation using prompt engineering and the LLM ChatGPT. We assess the framework empirically in the autonomous driving domain and present our key observations. In our implementation, we construct the domain knowledge ontology by "chatting" with ChatGPT. The key finding is that while fully automated domain ontology construction is possible, human supervision and early intervention typically improve efficiency and output quality as they lessen the effects of response randomness and the butterfly effect. We, therefore, also develop a web-based distillation assistant enabling supervision and flexible intervention at runtime. We hope our findings and tools could inspire future research toward revolutionizing the engineering of knowledge-based systems across application domains.

摘要
工程知识基础（或专家）系统需要广泛的手动努力和领域知识。由于大型自然语言模型（LLM）在巨量跨领域知识训练中得到了很多经验，因此可以自动化工程过程。这篇论文提出了一种实践和半自动化框架，用于领域知识蒸馏，通过提问工程和LLM ChatGPT进行建构领域知识 ontology。我们在自动驾驶领域进行了实证测试，并提出了我们的关键观察。在我们的实现中，我们通过与 ChatGPT "聊天" 构建了领域知识 ontology。我们发现，完全自动化领域知识构建是可能的，但是人工监督和早期干预通常可以提高效率和输出质量，因为它们可以减少响应随机性和蝴蝶效应。因此，我们还开发了一个基于网络的蒸馏助手，以便在运行时进行监督和灵活干预。我们希望我们的发现和工具可以激励未来的工程系统知识工程研究，以推动应用领域的工程系统工程技术的革命。

Improving End-to-End Speech Translation by Imitation-Based Knowledge Distillation with Synthetic Transcripts

paper_url: http://arxiv.org/abs/2307.08426
repo_url: https://github.com/hubreb/imitkd_ast
paper_authors: Rebekka Hubert, Artem Sokolov, Stefan Riezler
for: This paper focuses on improving end-to-end automatic speech translation (AST) systems by using imitation learning to correct errors made by a student model.
methods: The authors use a teacher NMT system to correct the errors of an AST student model without relying on manual transcripts.
results: The NMT teacher is able to recover from errors in automatic transcriptions and correct erroneous translations of the AST student, leading to improvements of about 4 BLEU points over the standard AST end-to-end baseline on two datasets.Here’s the same information in Simplified Chinese:
for: 这篇论文关注改进端到端自动语音翻译（AST）系统，使用模仿学习方法来更正学生模型中的错误。
methods: 作者使用一个教师NMT系统来更正学生AST模型中的错误，而不需要人工译文。
results: NMT教师能够从自动译文中恢复错误，并更正学生AST模型中的错误翻译，在两个数据集上提高约4个BLEU分。

Abstract
End-to-end automatic speech translation (AST) relies on data that combines audio inputs with text translation outputs. Previous work used existing large parallel corpora of transcriptions and translations in a knowledge distillation (KD) setup to distill a neural machine translation (NMT) into an AST student model. While KD allows using larger pretrained models, the reliance of previous KD approaches on manual audio transcripts in the data pipeline restricts the applicability of this framework to AST. We present an imitation learning approach where a teacher NMT system corrects the errors of an AST student without relying on manual transcripts. We show that the NMT teacher can recover from errors in automatic transcriptions and is able to correct erroneous translations of the AST student, leading to improvements of about 4 BLEU points over the standard AST end-to-end baseline on the English-German CoVoST-2 and MuST-C datasets, respectively. Code and data are publicly available.\footnote{\url{https://github.com/HubReb/imitkd_ast/releases/tag/v1.1}

摘要

Enhancing Supervised Learning with Contrastive Markings in Neural Machine Translation Training

paper_url: http://arxiv.org/abs/2307.08416
repo_url: None
paper_authors: Nathaniel Berger, Miriam Exel, Matthias Huck, Stefan Riezler
for: 提高 neural machine translation（NMT）中的监督学习过程中的探索性能力
methods: 使用对比标记目标来提供自动生成的增强训练信号，对比系统假设与参考文本进行比较，并对正确/错误字符进行权重调整
results: 训练 WITH contrastive markings 可以提高 NMT 的性能，特别是在学习从 postedits 中的情况下，contrastive markings 可以指示人工错误纠正。

Abstract
Supervised learning in Neural Machine Translation (NMT) typically follows a teacher forcing paradigm where reference tokens constitute the conditioning context in the model's prediction, instead of its own previous predictions. In order to alleviate this lack of exploration in the space of translations, we present a simple extension of standard maximum likelihood estimation by a contrastive marking objective. The additional training signals are extracted automatically from reference translations by comparing the system hypothesis against the reference, and used for up/down-weighting correct/incorrect tokens. The proposed new training procedure requires one additional translation pass over the training set per epoch, and does not alter the standard inference setup. We show that training with contrastive markings yields improvements on top of supervised learning, and is especially useful when learning from postedits where contrastive markings indicate human error corrections to the original hypotheses. Code is publicly released.

摘要
通常情况下，超级vised学习在神经机器翻译（NMT）中通常采用教师强制方法，其中参考令符出现作为模型预测的conditioningContext。为了解决这种预测缺乏探索的问题，我们提出了一个简单的扩展方法，通过对参考翻译进行自动提取的对照标记目标。这些额外训练信号通过比较系统假设与参考之间的比较，来升/降重量正确/错误的字符。我们的新训练方法不会改变标准的推理设置，只需要在每个轮次上进行一次训练集的一个额外翻译过程。我们展示了在supervised学习之上进行训练，并且在postedits中学习时，对于人类错误纠正的对照标记具有特别的用处。代码公共发布。

On the application of Large Language Models for language teaching and assessment technology

paper_url: http://arxiv.org/abs/2307.08393
repo_url: None
paper_authors: Andrew Caines, Luca Benedetto, Shiva Taslimipoor, Christopher Davis, Yuan Gao, Oeistein Andersen, Zheng Yuan, Mark Elliott, Russell Moore, Christopher Bryant, Marek Rei, Helen Yannakoudakis, Andrew Mullooly, Diane Nicholls, Paula Buttery
for: 这个论文主要目的是研究大型自然语言处理模型在语言教学和评估系统中的应用潜力。
methods: 这篇论文使用了大型自然语言处理模型，包括PaLM和GPT-4，进行文本生成和自动评分等任务的研究。
results: 研究发现，大型语言模型可以在文本生成任务中提供更好的表现，但是在自动评分和语法错误检测任务中，它们并没有超越现有的state-of-the-art结果。

Abstract
The recent release of very large language models such as PaLM and GPT-4 has made an unprecedented impact in the popular media and public consciousness, giving rise to a mixture of excitement and fear as to their capabilities and potential uses, and shining a light on natural language processing research which had not previously received so much attention. The developments offer great promise for education technology, and in this paper we look specifically at the potential for incorporating large language models in AI-driven language teaching and assessment systems. We consider several research areas and also discuss the risks and ethical considerations surrounding generative AI in education technology for language learners. Overall we find that larger language models offer improvements over previous models in text generation, opening up routes toward content generation which had not previously been plausible. For text generation they must be prompted carefully and their outputs may need to be reshaped before they are ready for use. For automated grading and grammatical error correction, tasks whose progress is checked on well-known benchmarks, early investigations indicate that large language models on their own do not improve on state-of-the-art results according to standard evaluation metrics. For grading it appears that linguistic features established in the literature should still be used for best performance, and for error correction it may be that the models can offer alternative feedback styles which are not measured sensitively with existing methods. In all cases, there is work to be done to experiment with the inclusion of large language models in education technology for language learners, in order to properly understand and report on their capacities and limitations, and to ensure that foreseeable risks such as misinformation and harmful bias are mitigated.

摘要
Recently released large language models such as PaLM and GPT-4 have caused a stir in popular media and public consciousness, with both excitement and fear about their capabilities and potential uses. This has shone a light on natural language processing research, which had not previously received so much attention. These developments offer great promise for education technology, and in this paper we explore the potential for incorporating large language models in AI-driven language teaching and assessment systems. We examine several research areas and also discuss the risks and ethical considerations surrounding generative AI in education technology for language learners.We find that larger language models offer improvements over previous models in text generation, opening up new possibilities for content generation. However, for text generation, careful prompting is necessary, and the outputs may need to be reshaped before they are ready for use. In terms of automated grading and grammatical error correction, early investigations suggest that large language models on their own do not improve on state-of-the-art results according to standard evaluation metrics. For grading, it appears that linguistic features established in the literature should still be used for best performance, and for error correction, the models may offer alternative feedback styles that are not measured sensitively with existing methods.In all cases, there is work to be done to experiment with the inclusion of large language models in education technology for language learners, in order to properly understand and report on their capacities and limitations, and to mitigate foreseeable risks such as misinformation and harmful bias.

How do software citation formats evolve over time? A longitudinal analysis of R programming language packages

paper_url: http://arxiv.org/abs/2307.09390
repo_url: None
paper_authors: Yuzhuo Wang, Kai Li
for: 本研究旨在探讨软件引用的复杂性，以便更好地理解软件引用的政策和基础设施。
methods: 本研究使用长期数据集，对2021年和2022年所有R包的引用格式进行比较和分析，以了解R语言包的引用格式，这些包是开源软件家族中重要的成员，以及引用格式是如何发展起来的。
results: 研究发现，不同的文档类型下的引用格式存在差异，而且在不同时间点，Metadata元素在引用格式中的变化也存在差异。此外，研究还发现了软件纸引用的专业性。通过这项研究，我们希望能够为软件引用政策和基础设施提供更好的理解。

Abstract
Under the data-driven research paradigm, research software has come to play crucial roles in nearly every stage of scientific inquiry. Scholars are advocating for the formal citation of software in academic publications, treating it on par with traditional research outputs. However, software is hardly consistently cited: one software entity can be cited as different objects, and the citations can change over time. These issues, however, are largely overlooked in existing empirical research on software citation. To fill the above gaps, the present study compares and analyzes a longitudinal dataset of citation formats of all R packages collected in 2021 and 2022, in order to understand the citation formats of R-language packages, important members in the open-source software family, and how the citations evolve over time. In particular, we investigate the different document types underlying the citations and what metadata elements in the citation formats changed over time. Furthermore, we offer an in-depth analysis of the disciplinarity of journal articles cited as software (software papers). By undertaking this research, we aim to contribute to a better understanding of the complexities associated with software citation, shedding light on future software citation policies and infrastructure.

摘要
(Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. The translation may not be perfect, and some nuances of the original text may be lost in translation.)

Legal Syllogism Prompting: Teaching Large Language Models for Legal Judgment Prediction

paper_url: http://arxiv.org/abs/2307.08321
repo_url: None
paper_authors: Cong Jiang, Xiaolei Yang
for: 本研究旨在开发一种简单的提示方法，以教育大型自然语言处理模型（LLM）在法律推理中进行判断预测。
methods: 本研究使用的是“法律推理提示”（LoT）方法，它只教育模型在法律推理中，主 premise 是法律， minor premise 是事实，结论是判断。
results: 在 CAIL2018 中国刑事案例集上，我们使用 GPT-3 模型进行零例判断预测实验，结果显示 LLM 使用 LoT 方法可以在多种推理任务上表现更好 than 基eline 和链式思维提示方法。 LoT 方法使得模型能够吸取到关键信息 relevante 于判断，正确理解法律规定的意义，与其他方法相比更加精准。

Abstract
Legal syllogism is a form of deductive reasoning commonly used by legal professionals to analyze cases. In this paper, we propose legal syllogism prompting (LoT), a simple prompting method to teach large language models (LLMs) for legal judgment prediction. LoT teaches only that in the legal syllogism the major premise is law, the minor premise is the fact, and the conclusion is judgment. Then the models can produce a syllogism reasoning of the case and give the judgment without any learning, fine-tuning, or examples. On CAIL2018, a Chinese criminal case dataset, we performed zero-shot judgment prediction experiments with GPT-3 models. Our results show that LLMs with LoT achieve better performance than the baseline and chain of thought prompting, the state-of-art prompting method on diverse reasoning tasks. LoT enables the model to concentrate on the key information relevant to the judgment and to correctly understand the legal meaning of acts, as compared to other methods. Our method enables LLMs to predict judgment along with law articles and justification, which significantly enhances the explainability of models.

摘要
法律逻辑是法律专业人员常用的推理方式，用于分析案例。在这篇论文中，我们提出了法律逻辑提示（LoT），一种简单的提示方法，用于教育大型自然语言模型（LLM）进行法律判断预测。LoT只教导了在法律逻辑中，主 Premise 是法律，次 Premise 是事实，结论是判断。然后模型可以生成案例的逻辑推理，并给出判断。在 CAIL2018 中国刑事案例集上，我们进行了零shot 判断预测实验，使用 GPT-3 模型。我们的结果显示， LLMS WITH LoT 在多种推理任务上表现更好 than 基eline 和链条思维提示方法。LoT 使得模型能够专注于关键信息 relevante 到判断，正确理解法律 act 的法律含义，与其他方法相比。我们的方法允许 LLMS 预测判断，同时提供法律条文和 justify，这显著提高了模型的解释性。

IterLara: A Turing Complete Algebra for Big Data, AI, Scientific Computing, and Database

paper_url: http://arxiv.org/abs/2307.08315
repo_url: None
paper_authors: Hongxiao Li, Wanling Gao, Lei Wang, Jianfeng Zhan
for: This paper aims to provide an algebraic model that unifies operations in general-purpose computing, such as big data, AI, scientific computing, and database.methods: The paper proposes \textsc{IterLara}, an extension of \textsc{Lara} with iterative operators, to achieve this goal.results: The paper studies the expressive ability of \textsc{Lara} and \textsc{IterLara} and proves that \textsc{IterLara} with aggregation functions can represent matrix inversion and determinant. Additionally, the paper shows that \textsc{IterLara} with no limitation of function utility is Turing complete, and proposes the Operation Count (OP) as a metric of computation amount for \textsc{IterLara}.

Abstract
\textsc{Lara} is a key-value algebra that aims at unifying linear and relational algebra with three types of operation abstraction. The study of \textsc{Lara}'s expressive ability reports that it can represent relational algebra and most linear algebra operations. However, several essential computations, such as matrix inversion and determinant, cannot be expressed in \textsc{Lara}. \textsc{Lara} cannot represent global and iterative computation, either. This article proposes \textsc{IterLara}, extending \textsc{Lara} with iterative operators, to provide an algebraic model that unifies operations in general-purpose computing, like big data, AI, scientific computing, and database. We study the expressive ability of \textsc{Lara} and \textsc{IterLara} and prove that \textsc{IterLara} with aggregation functions can represent matrix inversion, determinant. Besides, we demonstrate that \textsc{IterLara} with no limitation of function utility is Turing complete. We also propose the Operation Count (OP) as a metric of computation amount for \textsc{IterLara} and ensure that the OP metric is in accordance with the existing computation metrics.

摘要
\begin{blockquote}\textsc{Lara} 是一种键值代数，旨在将线性代数和关系代数合一，并提供三种操作抽象类型。研究 \textsc{Lara} 的表达能力发现，它可以表示关系代数和大多数线性代数操作。然而，一些基本计算，如矩阵逆元和 determinant，无法在 \textsc{Lara} 中表达。此外， \textsc{Lara} 也无法表示全局和迭代计算。这篇文章提出 \textsc{IterLara}， extending \textsc{Lara} WITH 迭代操作，以提供一种通用计算中的代数模型，包括大数据、人工智能、科学计算和数据库。我们研究 \textsc{Lara} 和 \textsc{IterLara} 的表达能力，并证明 \textsc{IterLara} WITH 聚合函数可以表示矩阵逆元和 determinant。此外，我们还证明 \textsc{IterLara} WITH 无限函数Utility 是 Turing 完全的。我们还提出了 Operation Count（OP）作为 \textsc{IterLara} 的计算量度，并证明 OP 度量与现有计算度量相符。\end{blockquoteNote that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

CoAD: Automatic Diagnosis through Symptom and Disease Collaborative Generation

paper_url: http://arxiv.org/abs/2307.08290
repo_url: https://github.com/kwanwaichung/coad
paper_authors: Huimin Wang, Wai-Chung Kwan, Kam-Fai Wong, Yefeng Zheng
for: 该研究旨在提高自动诊断（AD）的精度，以帮助医生更加精确地诊断疾病。
methods: 该方法使用Transformer架构，将症状序列作为输入，透过自动重构来预测疾病。
results: 该研究获得了2.3%的提升，与前一代最佳结果相比。

Abstract
Automatic diagnosis (AD), a critical application of AI in healthcare, employs machine learning techniques to assist doctors in gathering patient symptom information for precise disease diagnosis. The Transformer-based method utilizes an input symptom sequence, predicts itself through auto-regression, and employs the hidden state of the final symptom to determine the disease. Despite its simplicity and superior performance demonstrated, a decline in disease diagnosis accuracy is observed caused by 1) a mismatch between symptoms observed during training and generation, and 2) the effect of different symptom orders on disease prediction. To address the above obstacles, we introduce the CoAD, a novel disease and symptom collaborative generation framework, which incorporates several key innovations to improve AD: 1) aligning sentence-level disease labels with multiple possible symptom inquiry steps to bridge the gap between training and generation; 2) expanding symptom labels for each sub-sequence of symptoms to enhance annotation and eliminate the effect of symptom order; 3) developing a repeated symptom input schema to effectively and efficiently learn the expanded disease and symptom labels. We evaluate the CoAD framework using four datasets, including three public and one private, and demonstrate that it achieves an average 2.3% improvement over previous state-of-the-art results in automatic disease diagnosis. For reproducibility, we release the code and data at https://github.com/KwanWaiChung/coad.

摘要
自动诊断（AD），医疗领域中critical应用的人工智能技术，利用机器学习技术帮助医生收集病人症状信息，以确定精准的疾病诊断。transformer基本方法使用输入症状序列，通过自动回归，并使用最终症状隐藏状态来确定疾病。despite its simplicity and superior performance demonstrated, a decline in disease diagnosis accuracy is observed caused by 1) a mismatch between symptoms observed during training and generation, and 2) the effect of different symptom orders on disease prediction. To address the above obstacles, we introduce the CoAD, a novel disease and symptom collaborative generation framework, which incorporates several key innovations to improve AD: 1) aligning sentence-level disease labels with multiple possible symptom inquiry steps to bridge the gap between training and generation; 2) expanding symptom labels for each sub-sequence of symptoms to enhance annotation and eliminate the effect of symptom order; 3) developing a repeated symptom input schema to effectively and efficiently learn the expanded disease and symptom labels. We evaluate the CoAD framework using four datasets, including three public and one private, and demonstrate that it achieves an average 2.3% improvement over previous state-of-the-art results in automatic disease diagnosis. For reproducibility, we release the code and data at https://github.com/KwanWaiChung/coad.Here's the translation in Traditional Chinese:自动诊断（AD），医疗领域中critical应用的人工智能技术，利用机器学习技术帮助医生收集病人症状信息，以确定精确的疾病诊断。transformer基本方法使用输入症状序列，通过自动回归，并使用最终症状隐藏状态来确定疾病。despite its simplicity and superior performance demonstrated, a decline in disease diagnosis accuracy is observed caused by 1) a mismatch between symptoms observed during training and generation, and 2) the effect of different symptom orders on disease prediction. To address the above obstacles, we introduce the CoAD, a novel disease and symptom collaborative generation framework, which incorporates several key innovations to improve AD: 1) aligning sentence-level disease labels with multiple possible symptom inquiry steps to bridge the gap between training and generation; 2) expanding symptom labels for each sub-sequence of symptoms to enhance annotation and eliminate the effect of symptom order; 3) developing a repeated symptom input schema to effectively and efficiently learn the expanded disease and symptom labels. We evaluate the CoAD framework using four datasets, including three public and one private, and demonstrate that it achieves an average 2.3% improvement over previous state-of-the-art results in automatic disease diagnosis. For reproducibility, we release the code and data at https://github.com/KwanWaiChung/coad.

Automated Action Model Acquisition from Narrative Texts

paper_url: http://arxiv.org/abs/2307.10247
repo_url: None
paper_authors: Ruiqi Li, Leyang Cui, Songtuan Lin, Patrik Haslum
for: 本研究旨在提高人工智能代理人的决策能力，通过自动从叙述文本中提取结构化事件和生成 планинг语言风格的动作模型。
methods: 本研究使用了自动从叙述文本中提取结构化事件的方法，并基于预测常识事件关系、文本矛盾和相似性，生成了 планинг语言风格的动作模型。
results: 实验结果显示，NaRuto可以在经典叙述规划领域中生成高质量的动作模型，与现有的完全自动方法相当，甚至与半自动方法相当。

Abstract
Action models, which take the form of precondition/effect axioms, facilitate causal and motivational connections between actions for AI agents. Action model acquisition has been identified as a bottleneck in the application of planning technology, especially within narrative planning. Acquiring action models from narrative texts in an automated way is essential, but challenging because of the inherent complexities of such texts. We present NaRuto, a system that extracts structured events from narrative text and subsequently generates planning-language-style action models based on predictions of commonsense event relations, as well as textual contradictions and similarities, in an unsupervised manner. Experimental results in classical narrative planning domains show that NaRuto can generate action models of significantly better quality than existing fully automated methods, and even on par with those of semi-automated methods.

摘要
<使用减少的语言表达，模型可以帮助人工智能代理人进行计划和决策。但是获取这些模型的步骤是困难的，特别是在叙述计划领域。我们提出了一种系统，它可以自动从叙述文本中提取结构化事件，并根据预测的常识事件关系、文本矛盾和相似性，生成计划语言风格的行动模型，而不需要人工干预。我们的实验结果表明，NaRuto可以在经典叙述计划领域生成的行动模型质量比既有的完全自动方法更高，甚至与半自动方法相当。Here's the translation in Traditional Chinese: <使用简化的语言表达，模型可以帮助人工智能代理人进行计划和决策。但是获取这些模型的步骤是困难的，特别是在叙述计划领域。我们提出了一个系统，它可以自动从叙述文本中提取结构化事件，并根据预测的常识事件关系、文本矛盾和相似性，生成计划语言风格的行动模型，而不需要人工干预。我们的实验结果表明，NaRuto可以在经典叙述计划领域生成的行动模型质量比既有的完全自动方法更高，甚至与半自动方法相当。

ChatGPT is Good but Bing Chat is Better for Vietnamese Students

paper_url: http://arxiv.org/abs/2307.08272
repo_url: None
paper_authors: Xuan-Quy Dao, Ngoc-Bich Le
for: 这个研究旨在探讨两个现代大语言模型（LLMs），即ChatGPT和Microsoft Bing Chat（BingChat），在越南学生的需求下是否有效。
methods: 我们进行了这两个LLMs在不同学科中的比较分析，包括数学、文学、英语、物理、化学、生物、历史、地理和公民教育等。
results: 我们的研究结果表明，BingChat在各种学科中表现出色，只有文学领域是ChatGPT表现得更好。此外，BingChat使用了更高级的GPT-4技术，而ChatGPT是基于GPT-3.5技术。这使得BingChat可以提高其理解、判断和创造性文本生成能力。此外，BingChat在越南可用并具有内置的链接和参考，这也为其superiority做出了贡献。

Abstract
This study examines the efficacy of two SOTA large language models (LLMs), namely ChatGPT and Microsoft Bing Chat (BingChat), in catering to the needs of Vietnamese students. Although ChatGPT exhibits proficiency in multiple disciplines, Bing Chat emerges as the more advantageous option. We conduct a comparative analysis of their academic achievements in various disciplines, encompassing mathematics, literature, English language, physics, chemistry, biology, history, geography, and civic education. The results of our study suggest that BingChat demonstrates superior performance compared to ChatGPT across a wide range of subjects, with the exception of literature, where ChatGPT exhibits better performance. Additionally, BingChat utilizes the more advanced GPT-4 technology in contrast to ChatGPT, which is built upon GPT-3.5. This allows BingChat to improve to comprehension, reasoning and generation of creative and informative text. Moreover, the fact that BingChat is accessible in Vietnam and its integration of hyperlinks and citations within responses serve to reinforce its superiority. In our analysis, it is evident that while ChatGPT exhibits praiseworthy qualities, BingChat presents a more apdated solutions for Vietnamese students.

摘要
Here's the translation in Simplified Chinese:这个研究检查了两个现代大型自然语言处理器（LLM），即ChatGPT和Microsoft Bing Chat（BingChat），在越南学生的需求下是否有效。虽然ChatGPT在多个领域展现出色，但BingChat emerges as the more advantageous option。我们进行了多个学科的比较分析，包括数学、文学、英语、物理、化学、生物、历史、地理和公民教育。我们的研究结果表明，BingChat在多个学科中表现出色，只有文学领域，ChatGPT的表现更好。此外，BingChat使用更先进的GPT-4技术，可以提高对话、理解和创造性文本的能力。此外，BingChat在越南可用，并且在回答中包含链接和参考文献，这些因素都服务于加强其优势。在我们的分析中，可以看到，虽然ChatGPT具有称赞的特点，但BingChat提供了更先进的解决方案 для越南学生。

Extending the Frontier of ChatGPT: Code Generation and Debugging

paper_url: http://arxiv.org/abs/2307.08260
repo_url: None
paper_authors: Fardin Ahsan Sakib, Saadat Hasan Khan, A. H. M. Rezaul Karim
for: 这篇论文旨在研究 ChatGPT 是否可以解决编程问题，并评估其解决问题的正确性和效率。
methods: 该论文使用 ChatGPT 解决 Leetcode 上的编程问题，并对其解决的问题进行了评估。
results: 论文发现，ChatGPT 的总成功率为 71.875%，表明它可以成功解决大多数编程问题。它在结构化问题上表现出了优异，但在接受反馈改进解决方案时表现不佳。

Abstract
Large-scale language models (LLMs) have emerged as a groundbreaking innovation in the realm of question-answering and conversational agents. These models, leveraging different deep learning architectures such as Transformers, are trained on vast corpora to predict sentences based on given queries. Among these LLMs, ChatGPT, developed by OpenAI, has ushered in a new era by utilizing artificial intelligence (AI) to tackle diverse problem domains, ranging from composing essays and biographies to solving intricate mathematical integrals. The versatile applications enabled by ChatGPT offer immense value to users. However, assessing the performance of ChatGPT's output poses a challenge, particularly in scenarios where queries lack clear objective criteria for correctness. For instance, evaluating the quality of generated essays becomes arduous and relies heavily on manual labor, in stark contrast to evaluating solutions to well-defined, closed-ended questions such as mathematical problems. This research paper delves into the efficacy of ChatGPT in solving programming problems, examining both the correctness and the efficiency of its solution in terms of time and memory complexity. The research reveals a commendable overall success rate of 71.875\%, denoting the proportion of problems for which ChatGPT was able to provide correct solutions that successfully satisfied all the test cases present in Leetcode. It exhibits strengths in structured problems and shows a linear correlation between its success rate and problem acceptance rates. However, it struggles to improve solutions based on feedback, pointing to potential shortcomings in debugging tasks. These findings provide a compact yet insightful glimpse into ChatGPT's capabilities and areas for improvement.

摘要
大规模语言模型（LLM）已经成为问答和对话机器人领域的创新之一。这些模型，利用不同的深度学习架构，如转换器，在庞大的文献中进行训练，以预测基于给定查询的句子。中的ChatGPT，由OpenAI开发，对多种问题领域进行了应用，从编写文章和传记到解决复杂的数学 интеграル。这些应用具有巨大的价值，但评估ChatGPT的输出性能具有挑战，尤其是在查询缺乏明确的对象标准的情况下。例如，评估生成的文章的质量变得困难，需要大量的人工劳动，与解决具有明确对象标准的关闭式问题，如数学问题，存在很大的区别。本研究评估了ChatGPT在编程问题上的效果，包括正确性和时间复杂度、内存复杂度。研究发现，ChatGPT在Leetcode上的总成功率为71.875%，表示它可以成功解决的问题数。它在结构化问题上表现出了优异，与问题接受率之间存在直线相关性。但是，它在反馈基础上改进解决方案存在困难，表明可能存在调试任务的缺陷。这些发现为ChatGPT的能力和改进提供了一个紧凑 yet 深入的视角。

PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese

paper_url: http://arxiv.org/abs/2307.08247
repo_url: None
paper_authors: Nghia Hieu Nguyen, Kiet Van Nguyen
for: 本文提出了一种新的多模态学习方法，称为并行注意机制。
methods: 本文提出了一种新的语言特征提取模块，即干扰语言特征提取器（Hierarchical Linguistic Features Extractor，HLFE），以及一种基于Transformer架构的并行注意机制（Parallel Attention Transformer，PAT）。
results: 根据本文的实验结果，PAT在ViVQA数据集上达到了所有基准值和其他SOTA方法（包括SAAA和MCAN）的最高准确率。

Abstract
We present in this paper a novel scheme for multimodal learning named the Parallel Attention mechanism. In addition, to take into account the advantages of grammar and context in Vietnamese, we propose the Hierarchical Linguistic Features Extractor instead of using an LSTM network to extract linguistic features. Based on these two novel modules, we introduce the Parallel Attention Transformer (PAT), achieving the best accuracy compared to all baselines on the benchmark ViVQA dataset and other SOTA methods including SAAA and MCAN.

摘要
我们在这篇论文中提出了一种新的多Modal学习方案，称为并行注意机制。此外，为了利用越南语的语法和语言上下文的优势，我们提议使用层次语言特征提取器而不是LSTM网络来提取语言特征。基于这两种新模块，我们介绍了并行注意变换器（PAT），在 benchmark ViVQA 数据集和其他 SOTA 方法，包括 SAAA 和 MCAN，中达到最高准确率。

ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development

paper_url: http://arxiv.org/abs/2307.08720
repo_url: https://github.com/yairl/ivrit.ai
paper_authors: Yanir Marmor, Kinneret Misgav, Yair Lifshitz
for: 提高希伯来语自动语音识别技术的研究和开发
methods: 使用大量希伯来语 speech数据，包括未经过音响活动检测和部分转录数据，以满足不同研究需求
results: 提供了大量、多样化的希伯来语 speech数据资源，可以帮助研究人员、开发者和商业机构提高希伯来语自动语音识别技术的水平Translation:
for: 提高希伯来语自动语音识别技术的研究和开发
methods: 使用大量希伯来语 speech数据，包括未经过音响活动检测和部分转录数据，以满足不同研究需求
results: 提供了大量、多样化的希伯来语 speech数据资源，可以帮助研究人员、开发者和商业机构提高希伯来语自动语音识别技术的水平

Abstract
We introduce "ivrit.ai", a comprehensive Hebrew speech dataset, addressing the distinct lack of extensive, high-quality resources for advancing Automated Speech Recognition (ASR) technology in Hebrew. With over 3,300 speech hours and a over a thousand diverse speakers, ivrit.ai offers a substantial compilation of Hebrew speech across various contexts. It is delivered in three forms to cater to varying research needs: raw unprocessed audio; data post-Voice Activity Detection, and partially transcribed data. The dataset stands out for its legal accessibility, permitting use at no cost, thereby serving as a crucial resource for researchers, developers, and commercial entities. ivrit.ai opens up numerous applications, offering vast potential to enhance AI capabilities in Hebrew. Future efforts aim to expand ivrit.ai further, thereby advancing Hebrew's standing in AI research and technology.

摘要
我们介绍“ivrit.ai”，一个全面的 hébrew 语音数据集，纠正了 hébrew 语音识别技术的缺乏大量、高质量资源。 ivrit.ai 包含了超过 3,300 小时的语音数据和数百个多样化的说话人，在不同的场景下提供了广泛的 hébrew 语音资料。该数据集提供三种形式，以适应不同的研究需求： raw 未处理的音频数据、 after Voice Activity Detection 的数据和部分转录的数据。ivrit.ai 的法律可 accessing，免费使用，因此成为研究人员、开发者和商业机构的重要资源。 ivrit.ai 开启了许多应用程序，提供了巨大的潜在space ，以提高 hébrew 语言在人工智能技术中的地位。未来努力将 ivrit.ai 进一步扩展，以推动 hébrew 语言在人工智能研究和技术中的发展。

BASS: Block-wise Adaptation for Speech Summarization

paper_url: http://arxiv.org/abs/2307.08217
repo_url: None
paper_authors: Roshan Sharma, Kenneth Zheng, Siddhant Arora, Shinji Watanabe, Rita Singh, Bhiksha Raj
for: 提高端到端语音摘要模型的性能，解决训练时间过长导致模型质量下降的问题。
methods: 采用块式训练方法，通过分割输入序列进行逐步训练，使模型能够在很长的输入序列上进行学习。
results: 在How2 dataset上实现了块式训练方法，比 truncated input baseline 高出3点级ROUGE-L。

Abstract
End-to-end speech summarization has been shown to improve performance over cascade baselines. However, such models are difficult to train on very large inputs (dozens of minutes or hours) owing to compute restrictions and are hence trained with truncated model inputs. Truncation leads to poorer models, and a solution to this problem rests in block-wise modeling, i.e., processing a portion of the input frames at a time. In this paper, we develop a method that allows one to train summarization models on very long sequences in an incremental manner. Speech summarization is realized as a streaming process, where hypothesis summaries are updated every block based on new acoustic information. We devise and test strategies to pass semantic context across the blocks. Experiments on the How2 dataset demonstrate that the proposed block-wise training method improves by 3 points absolute on ROUGE-L over a truncated input baseline.

摘要
<> translate "End-to-end speech summarization has been shown to improve performance over cascade baselines. However, such models are difficult to train on very large inputs (dozens of minutes or hours) owing to compute restrictions and are hence trained with truncated model inputs. Truncation leads to poorer models, and a solution to this problem rests in block-wise modeling, i.e., processing a portion of the input frames at a time. In this paper, we develop a method that allows one to train summarization models on very long sequences in an incremental manner. Speech summarization is realized as a streaming process, where hypothesis summaries are updated every block based on new acoustic information. We devise and test strategies to pass semantic context across the blocks. Experiments on the How2 dataset demonstrate that the proposed block-wise training method improves by 3 points absolute on ROUGE-L over a truncated input baseline." into Simplified Chinese.中文简体版：END-TO-END语音摘要模型已经显示在提高性能的情况下超过顺序基elines。然而，这些模型在非常长的输入（分钟或小时级别）上受到计算限制，因此通常使用 truncated 模型输入进行训练。这会导致模型较差，解决这个问题需要块式模型化，即在每个块（一段时间）中处理输入框架。在这篇论文中，我们开发了一种允许在非常长序列上逐步训练摘要模型的方法。在流动处理中，每个块基于新的音频信息更新假设摘要。我们提出并测试了在块之间传递 semantics 上下文的策略。在 How2 数据集上，我们的块式训练方法与 truncated 输入基线相比，提高了 ROUGE-L 指标的3个绝对点。

Analyzing Dataset Annotation Quality Management in the Wild

paper_url: http://arxiv.org/abs/2307.08153
repo_url: None
paper_authors: Jan-Christoph Klie, Richard Eckart de Castilho, Iryna Gurevych
for: 这些论文的目的是要研究数据质量对机器学习模型训练和评估的影响，以及现有数据集中是否存在错误注释、偏见或注释 artifacts。
methods: 这些论文使用了文献综述和建议的方法来描述数据集创建时的质量管理实践，以及如何应用这些建议。然后，它们编译了591篇科学论文引用的文本数据集，并对其进行了质量相关的注释。
results: 这些论文发现大多数注释的工作采用了良好或非常良好的质量管理方法，但有30%的工作只有较差的质量管理。分析还显示了一些常见的错误，特别是在使用间对注释协议和计算注释错误率时出现的错误。

Abstract
Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models and their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, bias or annotation artifacts. There exist best practices and guidelines regarding annotation projects. But to the best of our knowledge, no large-scale analysis has been performed as of yet on how quality management is actually conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions on how to apply them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication or data validation. Using these annotations, we then analyze how quality management is conducted in practice. We find that a majority of the annotated publications apply good or very good quality management. However, we deem the effort of 30% of the works as only subpar. Our analysis also shows common errors, especially with using inter-annotator agreement and computing annotation error rates.

摘要
<> translation into Simplified Chinese文本质量对机器学习模型的训练和评估是关键。然而，当今最新的研究表明，即使使用最新的模型训练和评估的数据集中也包含一定数量的错误注释、偏见或注释遗留。现有的最佳实践和指南可以帮助确保数据集的质量。然而，我们知道的是，到目前为止没有任何大规模的分析，检查在创建自然语言数据集时是否实际遵循这些建议。因此，我们首先对文献中描述的质量管理做出survey和摘要，然后提供应用这些建议的建议。接着，我们编译了591篇科学论文引用的文本数据集，并对其进行质量相关的注释，包括注释员管理、一致性、仲裁或数据验证。使用这些注释，我们然后分析了在实践中是否实际遵循质量管理的方法。我们发现大多数注释的工作采用良好或非常良好的质量管理方法。然而，我们认为30%的工作仅为低水平。我们的分析还发现了一些常见的错误，特别是使用间接注释者一致性和计算注释错误率。

The Potential and Pitfalls of using a Large Language Model such as ChatGPT or GPT-4 as a Clinical Assistant

paper_url: http://arxiv.org/abs/2307.08152
repo_url: None
paper_authors: Jingqing Zhang, Kai Sun, Akshay Jagadeesh, Mahta Ghahfarokhi, Deepa Gupta, Ashok Gupta, Vibhor Gupta, Yike Guo
for: 这两篇研究用于评估ChatGPT和GPT-4在实际医疗数据库中的表现，以及它们在诊断助理方面的使用可行性。
methods: 这两篇研究使用了ChatGPT和GPT-4，分别在实际医疗数据库中进行了医疗诊断和诊断助理任务。
results: GPT-4在医疗诊断和诊断助理任务中的表现可以达到96%的F1分数，但是有些提示中含有误导性的资讯，遗传医疗发现和建议不必要的检测和治疗。这些问题，加上医疗资料保护问题，使这些模型目前不适合实际医疗应用。

Abstract
Recent studies have demonstrated promising performance of ChatGPT and GPT-4 on several medical domain tasks. However, none have assessed its performance using a large-scale real-world electronic health record database, nor have evaluated its utility in providing clinical diagnostic assistance for patients across a full range of disease presentation. We performed two analyses using ChatGPT and GPT-4, one to identify patients with specific medical diagnoses using a real-world large electronic health record database and the other, in providing diagnostic assistance to healthcare workers in the prospective evaluation of hypothetical patients. Our results show that GPT-4 across disease classification tasks with chain of thought and few-shot prompting can achieve performance as high as 96% F1 scores. For patient assessment, GPT-4 can accurately diagnose three out of four times. However, there were mentions of factually incorrect statements, overlooking crucial medical findings, recommendations for unnecessary investigations and overtreatment. These issues coupled with privacy concerns, make these models currently inadequate for real world clinical use. However, limited data and time needed for prompt engineering in comparison to configuration of conventional machine learning workflows highlight their potential for scalability across healthcare applications.

摘要
Here is the translation in Simplified Chinese: latest studies have shown that ChatGPT and GPT-4 have performed well on medical tasks, but no one has tested their performance on a large, real-world electronic health record database or evaluated their ability to provide clinical diagnostic assistance for patients with a wide range of symptoms. We conducted two studies using ChatGPT and GPT-4. One study used a real-world large electronic health record database to identify patients with specific medical diagnoses, and the other study provided diagnostic assistance to healthcare workers for hypothetical patients. Our results show that GPT-4 achieved high performance on disease classification tasks with chain of thought and few-shot prompting, with F1 scores as high as 96%. However, GPT-4 also made factually incorrect statements, overlooked crucial medical findings, and recommended unnecessary investigations and overtreatment. These issues, combined with privacy concerns, make these models currently unsuitable for real-world clinical use. However, the limited data and time required for prompt engineering compared to configuring conventional machine learning workflows highlight their potential for scalability across healthcare applications.

It’s All Relative: Interpretable Models for Scoring Bias in Documents

paper_url: http://arxiv.org/abs/2307.08139
repo_url: None
paper_authors: Aswin Suresh, Chi-Hsuan Wu, Matthias Grossglauser
for: 这篇论文的目的是提出一种可解释的模型，用于评分网络文档中的偏见。
methods: 这种模型基于布莱德利-泰勒axioms，并通过对同一篇Wikipedia文章的多个修订版本进行比较，以学习偏见的评分。
results: 模型可以准确地评分偏见，并且可以解释模型的参数，找到偏见的指标字。此外，模型还在不同的设置下进行了应用，包括研究Wikipedia文章的时间演化、比较新闻来源的偏见、以及评分法律修订的偏见。

Abstract
We propose an interpretable model to score the bias present in web documents, based only on their textual content. Our model incorporates assumptions reminiscent of the Bradley-Terry axioms and is trained on pairs of revisions of the same Wikipedia article, where one version is more biased than the other. While prior approaches based on absolute bias classification have struggled to obtain a high accuracy for the task, we are able to develop a useful model for scoring bias by learning to perform pairwise comparisons of bias accurately. We show that we can interpret the parameters of the trained model to discover the words most indicative of bias. We also apply our model in three different settings - studying the temporal evolution of bias in Wikipedia articles, comparing news sources based on bias, and scoring bias in law amendments. In each case, we demonstrate that the outputs of the model can be explained and validated, even for the two domains that are outside the training-data domain. We also use the model to compare the general level of bias between domains, where we see that legal texts are the least biased and news media are the most biased, with Wikipedia articles in between. Given its high performance, simplicity, interpretability, and wide applicability, we hope the model will be useful for a large community, including Wikipedia and news editors, political and social scientists, and the general public.

摘要
我们提出一种可解释的模型，用于评分网络文档中的偏见。我们的模型具有布莱德利-泰勒axioms的假设，并基于同一篇Wikipedia文章的修订版本进行训练。而且，我们的模型可以准确地比较修订版本之间的偏见水平。我们表明了我们可以解释模型中学习的参数，以便发现偏见的关键词。此外，我们还应用了我们的模型在三个不同的场景中：研究Wikipedia文章的时间演化、比较新闻来源的偏见水平以及评分法律修订的偏见水平。在每个场景中，我们都能够解释和验证模型的输出结果，包括在训练数据外的两个领域中。此外，我们还使用模型对不同领域的总体偏见水平进行比较，发现法律文档最少偏见，新闻媒体最多偏见，Wikipedia文章处于中间。考虑其高性能、简单、可解释和广泛应用，我们希望这种模型能够为大量社区提供帮助，包括Wikipedia编辑、新闻编辑、政治和社会科学家以及一般公众。