2023-09-25

cs.CL

cs.CL - 2023-09-25

Introducing DictaLM – A Large Generative Language Model for Modern Hebrew

paper_url: http://arxiv.org/abs/2309.14568
repo_url: None
paper_authors: Shaltiel Shmidman, Avi Shmidman, Amir David Nissan Cohen, Moshe Koppel
for: 这篇论文是为了开发一个适用于现代希伯来语的大规模语言模型。
methods: 这篇论文使用了7亿个参数的模型，主要是在希伯来语中进行训练。作者还发布了基础模型和指导适应模型，并将其发布于Creative Commons许可证下。此外，作者还介绍了DictaLM-Rab基础模型，这是专门针对 rabbinic/历史希伯来语的。
results: 这篇论文提出了一个初步的希伯来语大规模语言模型，可以用于多种希伯来语特定任务的细化调整，如教学、问答、情感分析等。这是一个对希伯来语NLP社区的一个初步探索。

Abstract
We present DictaLM, a large-scale language model tailored for Modern Hebrew. Boasting 7B parameters, this model is predominantly trained on Hebrew-centric data. As a commitment to promoting research and development in the Hebrew language, we release both the foundation model and the instruct-tuned model under a Creative Commons license. Concurrently, we introduce DictaLM-Rab, another foundation model geared towards Rabbinic/Historical Hebrew. These foundation models serve as ideal starting points for fine-tuning various Hebrew-specific tasks, such as instruction, Q&A, sentiment analysis, and more. This release represents a preliminary step, offering an initial Hebrew LLM model for the Hebrew NLP community to experiment with.

摘要
我们介绍DictaLM，一种适用于现代希伯来语的大规模语言模型。这个模型拥有70亿参数，主要基于希伯来语数据进行训练。作为推广希伯来语研究和发展的承诺，我们在Creative Commons许可证下发布了基础模型和指导训练模型。同时，我们还引入DictaLM-Rab，另一个针对 rabbinic/历史希伯来语的基础模型。这些基础模型可以用于多种希伯来语特定任务的细化调整，如教学、问答、情感分析等。这次发布代表希伯来语NLPT社区的初步尝试，希望能够促进希伯来语语言模型的研究和发展。

Aligning Large Multimodal Models with Factually Augmented RLHF

paper_url: http://arxiv.org/abs/2309.14525
repo_url: None
paper_authors: Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, Trevor Darrell
for: addressing the multimodal misalignment issue in large multimodal models (LMM)
methods: using reinforcement learning from human feedback (RLHF) to train a vision-language model to align with human annotations, and augmenting the reward model with additional factual information such as image captions and ground-truth multi-choice options
results: achieving a remarkable improvement of 94% on the LLaVA-Bench dataset and an improvement by 60% on MMHAL-BENCH over other baselines, with the first LMM trained with RLHF.Here are the three information in Simplified Chinese text:
for: 解决大量多模态模型（LMM）中的多模态误差问题
methods: 使用人类反馈学习（RLHF）来训练一个视觉语言模型，并将奖励模型增加了更多的事实信息，如图文描述和真实多选项
results: 在LLaVA-Bench数据集上达到了94%的表现水平，比前一个最佳方法提高了60%，并且开源了代码、模型和数据在https://llava-rlhf.github.io。

Abstract
Large Multimodal Models (LMM) are built across modalities and the misalignment between two modalities can result in "hallucination", generating textual outputs that are not grounded by the multimodal information in context. To address the multimodal misalignment issue, we adapt the Reinforcement Learning from Human Feedback (RLHF) from the text domain to the task of vision-language alignment, where human annotators are asked to compare two responses and pinpoint the more hallucinated one, and the vision-language model is trained to maximize the simulated human rewards. We propose a new alignment algorithm called Factually Augmented RLHF that augments the reward model with additional factual information such as image captions and ground-truth multi-choice options, which alleviates the reward hacking phenomenon in RLHF and further improves the performance. We also enhance the GPT-4-generated training data (for vision instruction tuning) with previously available human-written image-text pairs to improve the general capabilities of our model. To evaluate the proposed approach in real-world scenarios, we develop a new evaluation benchmark MMHAL-BENCH with a special focus on penalizing hallucinations. As the first LMM trained with RLHF, our approach achieves remarkable improvement on the LLaVA-Bench dataset with the 94% performance level of the text-only GPT-4 (while previous best methods can only achieve the 87% level), and an improvement by 60% on MMHAL-BENCH over other baselines. We opensource our code, model, data at https://llava-rlhf.github.io.

摘要
大型多modal模型（LMM）在modalities之间建立起来，但是这两个modalities之间的不一致可能导致"幻觉",生成不受多modal信息的支持的文本输出。为了解决多modal不一致问题，我们从文本领域中提取了人类反馈学习（RLHF），并将其应用到视觉语言对应中，请求人工标注员比较两个响应，并标出更加幻觉的一个，并训练视觉语言模型以 Maximize 模拟人类奖励。我们提出了一种新的对ignment算法called Factually Augmented RLHF，该算法将奖励模型中的奖励信息与更多的事实信息（如图像描述和真实多选项）相结合，以解决奖励黑客现象，并进一步提高性能。此外，我们还使用了之前已有的人类写的图像文本对应来提高我们模型的总能力。为了评估我们的方法在实际场景中的表现，我们开发了一个新的评估标准MMHAL-BENCH，强调对幻觉进行惩罚。作为首个RLHF模型，我们的方法在LLaVA-Bench数据集上达到了94%的性能水平，而前一个最佳方法只能达到87%的水平，在MMHAL-BENCH上与其他基eline相比，我们的方法提高了60%。我们将代码、模型和数据公开发布在https://llava-rlhf.github.io。

ChatGPT Performance on Standardized Testing Exam – A Proposed Strategy for Learners

paper_url: http://arxiv.org/abs/2309.14519
repo_url: None
paper_authors: Umer Farooq, Saira Anwar
for: 这研究探讨了ChatGPT在标准化测试准备中的问题解决能力，特点是关注GRE数学部分。先前的研究表明了ChatGPT在不同学科中的学习方法有很大的潜力。
methods: 我们通过对GRE数学部分100个随机选择的问题进行Quantitative评估来研究ChatGPT在不同内容领域中的问题解决能力。我们还使用t检验来检验修改问题提示对ChatGPT的准确率的影响。
results: 结果显示，对问题提示进行修改后，ChatGPT的准确率有 statistically significant 的提高（84%对修改后的问题，69%对原始数据）。研究还发现ChatGPT在某些问题上存在困难，并提供了修改问题提示的方法可以帮助学生准备标准测试 like GRE。

Abstract
This study explores the problem solving capabilities of ChatGPT and its prospective applications in standardized test preparation, focusing on the GRE quantitative exam. Prior research has shown great potential for the utilization of ChatGPT for academic purposes in revolutionizing the approach to studying across various disciplines. We investigate how ChatGPT performs across various question types in the GRE quantitative domain, and how modifying question prompts impacts its accuracy. More specifically this study addressed two research questions: 1. How does ChatGPT perform in answering GRE-based quantitative questions across various content areas? 2. How does the accuracy of ChatGPT vary with modifying the question prompts? The dataset consisting of 100 randomly selected GRE quantitative questions was collected from the ETS official guide to GRE test preparation. We used quantitative evaluation to answer our first research question, and t-test to examine the statistical association between prompt modification and ChatGPT's accuracy. Results show a statistical improvement in the ChatGPT's accuracy after applying instruction priming and contextual prompts to the original questions. ChatGPT showed 84% accuracy with the modified prompts compared to 69% with the original data. The study discusses the areas where ChatGPT struggled with certain questions and how modifications can be helpful for preparing for standardized tests like GRE and provides future directions for prompt modifications.

摘要
To answer our first research question, we used quantitative evaluation to assess ChatGPT's performance on GRE-based quantitative questions across different content areas. We also used a t-test to examine the statistical association between prompt modification and ChatGPT's accuracy. Our results show that ChatGPT's accuracy improved statistically after we applied instruction priming and contextual prompts to the original questions. With the modified prompts, ChatGPT achieved 84% accuracy, compared to 69% with the original data.The study also discusses the areas where ChatGPT struggled with certain questions and how modifications can be helpful for preparing for standardized tests like GRE. We provide future directions for prompt modifications and highlight the potential of using ChatGPT for test preparation.Here is the translation in Simplified Chinese:这个研究探讨了ChatGPT的问题解决能力和其在标准化测试准备中的可能应用，特点是关注GRE数学部分。先前的研究表明了ChatGPT可以用于学术目的，可以革命化学习的方式。我们调查了ChatGPT如何在不同类型的GRE数学题目上表现，以及如何修改问题提示影响其准确率。为了回答我们的第一个研究问题，我们使用量化评估来评估ChatGPT在GRE数学题目上的表现，并使用t检验来检验修改提示和ChatGPT的准确率之间的统计关系。我们的结果显示，在我们应用了指导提示和文本提示后，ChatGPT的准确率有 statistically 的提高。与原始数据相比，ChatGPT在修改后的问题上达到了84%的准确率，与原始数据相比，这是69%的提高。研究还讨论了ChatGPT在某些问题上的困难之处，以及修改如何有助于为GRE和其他标准化测试准备。我们还提供了未来的提示修改方向，并强调了使用ChatGPT进行测试准备的潜在优势。

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

paper_url: http://arxiv.org/abs/2309.14509
repo_url: None
paper_authors: Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He
for: 本研究旨在提高大型语言模型（LLM）的训练效率，特别是对长序Transformer模型进行加速。
methods: 本文提出了一种新的方法——DeepSpeed-Ulysses，它可以高效地对长序LLM进行训练。这种方法通过分解输入数据的序列维度，并使用高效的所有到所有集成通信来计算注意力。
results: 实验表明，DeepSpeed-Ulysses可以比现有基eline方法快速2.5倍，并且可以在4倍长的序列长度上进行训练。

Abstract
Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the first three dimensions: data parallelism for batch size, tensor parallelism for hidden size and pipeline parallelism for model depth or layers. These widely studied forms of parallelism are not targeted or optimized for long sequence Transformer models. Given practical application needs for long sequence LLM, renewed attentions are being drawn to sequence parallelism. However, existing works in sequence parallelism are constrained by memory-communication inefficiency, limiting their scalability to long sequence large models. In this work, we introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training with extremely long sequence length. DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention computation. Theoretical communication analysis shows that whereas other methods incur communication overhead as sequence length increases, DeepSpeed-Ulysses maintains constant communication volume when sequence length and compute devices are increased proportionally. Furthermore, experimental evaluations show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.

摘要
Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the first three dimensions: data parallelism for batch size, tensor parallelism for hidden size, and pipeline parallelism for model depth or layers. These widely studied forms of parallelism are not targeted or optimized for long sequence Transformer models. Given practical application needs for long sequence LLM, renewed attentions are being drawn to sequence parallelism. However, existing works in sequence parallelism are constrained by memory-communication inefficiency, limiting their scalability to long sequence large models. In this work, we introduce DeepSpeed-Ulysses, a novel, portable, and effective methodology for enabling highly efficient and scalable LLM training with extremely long sequence length. DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention computation. Theoretical communication analysis shows that whereas other methods incur communication overhead as sequence length increases, DeepSpeed-Ulysses maintains constant communication volume when sequence length and compute devices are increased proportionally. Furthermore, experimental evaluations show that DeepSpeed-Ulysses trains 2.5 times faster with 4 times longer sequence length than the existing method SOTA baseline.

Classifying token frequencies using angular Minkowski $p$-distance

paper_url: http://arxiv.org/abs/2309.14495
repo_url: None
paper_authors: Oliver Urs Lenz, Chris Cornelis
for: 本研究旨在探讨Angular Minkowski $p$-distance是一种代替cosine dissimilarity的异同度量，以及它在20-newsgroups dataset上的分类性能。
methods: 本研究使用了粗糙最近几个邻居和经典加权最近几个邻居来评估分类性能，并分析了$p$参数、数据维度$m$,邻居数$k$和权重的选择对分类性能的影响。
results: 研究发现，采用Angular Minkowski $p$-distance可以获得substantially higher的分类性能，特别是当$p$取得合适的值时。

Abstract
Angular Minkowski $p$-distance is a dissimilarity measure that is obtained by replacing Euclidean distance in the definition of cosine dissimilarity with other Minkowski $p$-distances. Cosine dissimilarity is frequently used with datasets containing token frequencies, and angular Minkowski $p$-distance may potentially be an even better choice for certain tasks. In a case study based on the 20-newsgroups dataset, we evaluate clasification performance for classical weighted nearest neighbours, as well as fuzzy rough nearest neighbours. In addition, we analyse the relationship between the hyperparameter $p$, the dimensionality $m$ of the dataset, the number of neighbours $k$, the choice of weights and the choice of classifier. We conclude that it is possible to obtain substantially higher classification performance with angular Minkowski $p$-distance with suitable values for $p$ than with classical cosine dissimilarity.

摘要

Explainable and Accurate Natural Language Understanding for Voice Assistants and Beyond

paper_url: http://arxiv.org/abs/2309.14485
repo_url: None
paper_authors: Kalpa Gunaratna, Vijay Srinivasan, Hongxia Jin
for: joint NLU (Natural Language Understanding) JOINT NLU是智能声助手上无可或缺的一环，它的目标是同时检测用户的意图和 slot filling。
methods: 使用各种技术提高准确率，并使模型自然易于理解和解释。
results: 对 JOINT NLU 模型进行自然易于理解和解释，不会影响准确率。同时，这种扩展可以在其他普通分类任务中使用。

Abstract
Joint intent detection and slot filling, which is also termed as joint NLU (Natural Language Understanding) is invaluable for smart voice assistants. Recent advancements in this area have been heavily focusing on improving accuracy using various techniques. Explainability is undoubtedly an important aspect for deep learning-based models including joint NLU models. Without explainability, their decisions are opaque to the outside world and hence, have tendency to lack user trust. Therefore to bridge this gap, we transform the full joint NLU model to be `inherently' explainable at granular levels without compromising on accuracy. Further, as we enable the full joint NLU model explainable, we show that our extension can be successfully used in other general classification tasks. We demonstrate this using sentiment analysis and named entity recognition.

摘要
joint意图检测和插槽填充（joint NLU）对智能声音助手是非常重要的。近期的进展在这个领域主要集中在提高准确率上。解释性是深度学习模型，包括联合NLU模型的重要方面。没有解释性，这些模型的决策对外部世界来说是不透明的，因此容易lack user trust。因此，我们将全部联合NLU模型变换成“基本”的解释性模型，无需牺牲准确率。此外，我们证明了我们的扩展可以成功应用于其他通用分类任务中。我们通过 sentiment分析和名称实体识别来说明这一点。

paper_url: http://arxiv.org/abs/2309.14327
repo_url: https://github.com/microsoft/deepspeedexamples
paper_authors: Zhewei Yao, Xiaoxia Wu, Conglong Li, Minjia Zhang, Heyang Qin, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He
for: 提高大型语言模型对交互对话的适应性和扩展性
methods: 引入多模态功能，包括创新的多模态 causal 注意机制和数据融合技术
results: 比对 existed 框架 superior 扩展性，可以承受大型语言模型的70亿参数大小Here’s the Chinese text in simplified format:
for: 提高大型语言模型对交互对话的适应性和扩展性
methods: 引入多模态功能，包括创新的多模态 causal 注意机制和数据融合技术
results: 比对 existed 框架 superior 扩展性，可以承受大型语言模型的70亿参数大小

Abstract
Most of the existing multi-modal models, hindered by their incapacity to adeptly manage interleaved image-and-text inputs in multi-image, multi-round dialogues, face substantial constraints in resource allocation for training and data accessibility, impacting their adaptability and scalability across varied interaction realms. To address this, we present the DeepSpeed-VisualChat framework, designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities, with a focus on enhancing the proficiency of Large Vision and Language Models in handling interleaved inputs. Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions in multi-round, multi-image conversations. Compared to existing frameworks, DeepSpeed-VisualChat shows superior scalability up to 70B parameter language model size, representing a significant advancement in multi-modal language models and setting a solid foundation for future explorations.

摘要
大多数现有多模态模型受到其不能够有效地处理交错图像和文本输入的限制，在多图多轮对话中受到资源分配和数据可accessibility的限制，影响其适应性和扩展性。为解决这一问题，我们提出了DeepSpeed-VisualChat框架，旨在优化大型语言模型，通过多模态能力提高大型语言和视觉模型对交错输入的处理能力。我们的框架具有以下三个特点：1. 支持多轮多图对话的开源实现，以便实现无缝的多模态对话。2. 引入创新的多模态 causal attention机制，以提高模型对交错输入的处理能力。3. 通过使用现有数据集的混合技术，保证多轮多图对话中的无缝交互。与现有框架相比，DeepSpeed-VisualChat显示出比较出色的扩展性，可以 Handle up to 70B parameter language model size，代表着多模态语言模型的显著进步，并为未来的探索提供了坚实的基础。

Towards General-Purpose Text-Instruction-Guided Voice Conversion

paper_url: http://arxiv.org/abs/2309.14324
repo_url: https://github.com/text-guided-vc/text-guided-vc.github.io
paper_authors: Chun-Yi Kuan, Chen An Li, Tsu-Yuan Hsu, Tse-Yang Lin, Ho-Lam Chung, Kai-Wei Chang, Shuo-yiin Chang, Hung-yi Lee
for: 这篇论文旨在描述一种新的语音转换（VC）模型，该模型根据文本指令（如“诠释慢速深音”或“说话带着幽默boyish音色”）进行指导。与传统方法不同的是，我们的模型可以根据文本指令来修改转换后的语音的谐音和情感信息，从而提供更多的灵活性和具体性。
methods: 该VC模型是一种基于神经网络编码语言模型的，它处理一个序列的精度码，并将其转换为转换后的语音序列。该模型使用文本指令作为样式提示，以修改源语音的不同方面。与之前的方法不同的是，我们的模型可以在端到端方式下处理不同方面的语音信息，而不需要分别使用多个编码器。
results: 实验表明，我们的模型能够很好地理解指令，并提供合理的转换结果。

Abstract
This paper introduces a novel voice conversion (VC) model, guided by text instructions such as "articulate slowly with a deep tone" or "speak in a cheerful boyish voice". Unlike traditional methods that rely on reference utterances to determine the attributes of the converted speech, our model adds versatility and specificity to voice conversion. The proposed VC model is a neural codec language model which processes a sequence of discrete codes, resulting in the code sequence of converted speech. It utilizes text instructions as style prompts to modify the prosody and emotional information of the given speech. In contrast to previous approaches, which often rely on employing separate encoders like prosody and content encoders to handle different aspects of the source speech, our model handles various information of speech in an end-to-end manner. Experiments have demonstrated the impressive capabilities of our model in comprehending instructions and delivering reasonable results.

摘要
Note: The text has been translated into Simplified Chinese, which is the standard writing system used in mainland China.

Urdu Poetry Generated by Using Deep Learning Techniques

paper_url: http://arxiv.org/abs/2309.14233
repo_url: None
paper_authors: Muhammad Shoaib Farooq, Ali Abbas
For: 本研究提供了使用不同深度学习技术和算法生成的 Urdu 诗歌。* Methods: 该研究使用了 Long Short-term Memory Networks (LSTM) 和 Gated Recurrent Unit (GRU) 等深度学习模型，以及自然语言处理 (NLP) 技术来理解、分析和生成人类可以理解和使用的语言。* Results: 研究结果表明，使用这些技术可以生成具有高准确性的 Urdu 诗歌。

Abstract
This study provides Urdu poetry generated using different deep-learning techniques and algorithms. The data was collected through the Rekhta website, containing 1341 text files with several couplets. The data on poetry was not from any specific genre or poet. Instead, it was a collection of mixed Urdu poems and Ghazals. Different deep learning techniques, such as the model applied Long Short-term Memory Networks (LSTM) and Gated Recurrent Unit (GRU), have been used. Natural Language Processing (NLP) may be used in machine learning to understand, analyze, and generate a language humans may use and understand. Much work has been done on generating poetry for different languages using different techniques. The collection and use of data were also different for different researchers. The primary purpose of this project is to provide a model that generates Urdu poems by using data completely, not by sampling data. Also, this may generate poems in pure Urdu, not Roman Urdu, as in the base paper. The results have shown good accuracy in the poems generated by the model.

摘要

Autonomous Vehicles an overview on system, cyber security, risks, issues, and a way forward

paper_url: http://arxiv.org/abs/2309.14213
repo_url: None
paper_authors: Md Aminul Islam, Sarah Alqahtani
for:这篇论文主要是为了探讨自动驾驶车的基本组件和运行特点，以及它们如何在互联网的框架下集成。methods:该论文使用了感知器、人工智能标识系统、控制机制等技术，并将其与云计算服务器集成在一起。results:该论文探讨了自动驾驶车在交通预测和交通预测等领域的实践应用，以及它们对不同产业的自动化任务的影响。同时，它还探讨了自动驾驶车的安全性问题，包括伦理、环境、法律、职业和社会方面的风险。

Abstract
This chapter explores the complex realm of autonomous cars, analyzing their fundamental components and operational characteristics. The initial phase of the discussion is elucidating the internal mechanics of these automobiles, encompassing the crucial involvement of sensors, artificial intelligence (AI) identification systems, control mechanisms, and their integration with cloud-based servers within the framework of the Internet of Things (IoT). It delves into practical implementations of autonomous cars, emphasizing their utilization in forecasting traffic patterns and transforming the dynamics of transportation. The text also explores the topic of Robotic Process Automation (RPA), illustrating the impact of autonomous cars on different businesses through the automation of tasks. The primary focus of this investigation lies in the realm of cybersecurity, specifically in the context of autonomous vehicles. A comprehensive analysis will be conducted to explore various risk management solutions aimed at protecting these vehicles from potential threats including ethical, environmental, legal, professional, and social dimensions, offering a comprehensive perspective on their societal implications. A strategic plan for addressing the challenges and proposing strategies for effectively traversing the complex terrain of autonomous car systems, cybersecurity, hazards, and other concerns are some resources for acquiring an understanding of the intricate realm of autonomous cars and their ramifications in contemporary society, supported by a comprehensive compilation of resources for additional investigation. Keywords: RPA, Cyber Security, AV, Risk, Smart Cars

摘要
The primary focus of this investigation lies in the realm of cybersecurity, specifically in the context of autonomous vehicles. A comprehensive analysis will be conducted to explore various risk management solutions aimed at protecting these vehicles from potential threats, including ethical, environmental, legal, professional, and social dimensions. This will provide a comprehensive perspective on their societal implications.In addition, a strategic plan for addressing the challenges and proposing strategies for effectively traversing the complex terrain of autonomous car systems, cybersecurity, hazards, and other concerns will be presented. This will be supported by a comprehensive compilation of resources for additional investigation.Keywords: RPA, Cyber Security, AV, Risk, Smart Cars

Only 5% Attention Is All You Need: Efficient Long-range Document-level Neural Machine Translation

paper_url: http://arxiv.org/abs/2309.14174
repo_url: None
paper_authors: Zihan Liu, Zewei Sun, Shanbo Cheng, Shujian Huang, Mingxuan Wang
for: DocNMT for handling discourse phenomena in Machine Translation tasks, with the goal of improving efficiency while maintaining performance.
methods: The paper introduces a lightweight attention mechanism to select a small portion of tokens to be attended, reducing the computational cost of the attention module while maintaining performance.
results: The method achieves up to 95% sparsity (only 5% tokens attended) and saves 93% computation cost on the attention module compared to the original Transformer, while maintaining performance.

Abstract
Document-level Neural Machine Translation (DocNMT) has been proven crucial for handling discourse phenomena by introducing document-level context information. One of the most important directions is to input the whole document directly to the standard Transformer model. In this case, efficiency becomes a critical concern due to the quadratic complexity of the attention module. Existing studies either focus on the encoder part, which cannot be deployed on sequence-to-sequence generation tasks, e.g., Machine Translation (MT), or suffer from a significant performance drop. In this work, we keep the translation performance while gaining 20\% speed up by introducing extra selection layer based on lightweight attention that selects a small portion of tokens to be attended. It takes advantage of the original attention to ensure performance and dimension reduction to accelerate inference. Experimental results show that our method could achieve up to 95\% sparsity (only 5\% tokens attended) approximately, and save 93\% computation cost on the attention module compared with the original Transformer, while maintaining the performance.

摘要
文档水平神经机器翻译（DocNMT）已经被证明是处理讨论现象的关键，通过引入文档级别的上下文信息。一个重要的方向是直接将整个文档输入到标准变换器模型中。在这种情况下，效率成为一个关键问题，因为变换器模型的注意模块的复杂度是二次的。现有的研究 either ocus 在encoder部分，无法在序列到序列生成任务中使用，例如机器翻译（MT），或者受到 significativ performance drop。在这种工作中，我们保持翻译性能，同时减少了20%的计算成本，通过引入附加的选择层，选择一小部分的Token进行注意。它利用原始注意来确保性能，并将维度减少以加速推理。实验结果表明，我们的方法可以达到约95%的稀疏性（只有5%的Token被注意），并将93%的计算成本减少在变换器模型中，同时保持性能。

Towards End-User Development for IoT: A Case Study on Semantic Parsing of Cooking Recipes for Programming Kitchen Devices

paper_url: http://arxiv.org/abs/2309.14165
repo_url: https://github.com/filipposventirozos/towards-end-user-development-for-iot
paper_authors: Filippos Ventirozos, Sarah Clinch, Riza Batista-Navarro
for: 支持烹饪recipe instructions的自然语言编程
methods: 使用 conditional random fields (CRF) 和神经网络模型进行语义分析
results: training semantic parsers based on annotations is feasible, but most natural-language instructions are incomplete and need to be transformed into formal meaning representation.Here’s the breakdown of each piece of information:
for: 支持烹饪recipe instructions的自然语言编程 (What the paper is written for)
methods: 使用 conditional random fields (CRF) 和神经网络模型进行语义分析 (What methods the paper uses)
results: training semantic parsers based on annotations is feasible, but most natural-language instructions are incomplete and need to be transformed into formal meaning representation. (What results the paper gets)

Abstract
Semantic parsing of user-generated instructional text, in the way of enabling end-users to program the Internet of Things (IoT), is an underexplored area. In this study, we provide a unique annotated corpus which aims to support the transformation of cooking recipe instructions to machine-understandable commands for IoT devices in the kitchen. Each of these commands is a tuple capturing the semantics of an instruction involving a kitchen device in terms of "What", "Where", "Why" and "How". Based on this corpus, we developed machine learning-based sequence labelling methods, namely conditional random fields (CRF) and a neural network model, in order to parse recipe instructions and extract our tuples of interest from them. Our results show that while it is feasible to train semantic parsers based on our annotations, most natural-language instructions are incomplete, and thus transforming them into formal meaning representation, is not straightforward.

摘要
<>Semantic parsing of user-generated instructional text, in the way of enabling end-users to program the Internet of Things (IoT), is an underexplored area. In this study, we provide a unique annotated corpus which aims to support the transformation of cooking recipe instructions to machine-understandable commands for IoT devices in the kitchen. Each of these commands is a tuple capturing the semantics of an instruction involving a kitchen device in terms of "What", "Where", "Why" and "How". Based on this corpus, we developed machine learning-based sequence labelling methods, namely conditional random fields (CRF) and a neural network model, in order to parse recipe instructions and extract our tuples of interest from them. Our results show that while it is feasible to train semantic parsers based on our annotations, most natural-language instructions are incomplete, and thus transforming them into formal meaning representation, is not straightforward.中文翻译：用户生成的 instrucitonal text 的 semantics parsing，以实现终端用户对 Internet of Things (IoT) 的程式设定，是一个未得到充分探讨的领域。在这个研究中，我们提供了一个唯一的标注集，以支持将烹饪recipe的 instrucitons 转换为机器可理解的命令，并且每个命令都是一个捕捉烹饪 instruciton 的含义的 tuple，包括 "What"、"Where"、"Why" 和 "How"。基于这个标注集，我们开发了机器学习基于条件随机场 (CRF) 和神经网络模型，以解析 recipe instrucitons 并将我们的感兴趣 tuple 提取出来。我们的结果显示，可以对我们的标注集进行训练，但大多数自然语言 instrucitons 是不完整的，因此将它们转换为正式的意义表现，不是一个 straightforward 的任务。

Examining Temporal Bias in Abusive Language Detection

paper_url: http://arxiv.org/abs/2309.14146
repo_url: None
paper_authors: Mali Jin, Yida Mu, Diana Maynard, Kalina Bontcheva
for: This study aims to investigate the nature and impact of temporal bias in abusive language detection across various languages and explore mitigation methods.methods: The study evaluates the performance of models on abusive data sets from different time periods and presents an extensive linguistic analysis of these abusive data sets from a diachronic perspective.results: The results demonstrate that temporal bias is a significant challenge for abusive language detection, with models trained on historical data showing a significant drop in performance over time.

Abstract
The use of abusive language online has become an increasingly pervasive problem that damages both individuals and society, with effects ranging from psychological harm right through to escalation to real-life violence and even death. Machine learning models have been developed to automatically detect abusive language, but these models can suffer from temporal bias, the phenomenon in which topics, language use or social norms change over time. This study aims to investigate the nature and impact of temporal bias in abusive language detection across various languages and explore mitigation methods. We evaluate the performance of models on abusive data sets from different time periods. Our results demonstrate that temporal bias is a significant challenge for abusive language detection, with models trained on historical data showing a significant drop in performance over time. We also present an extensive linguistic analysis of these abusive data sets from a diachronic perspective, aiming to explore the reasons for language evolution and performance decline. This study sheds light on the pervasive issue of temporal bias in abusive language detection across languages, offering crucial insights into language evolution and temporal bias mitigation.

摘要
互联网上的辱语问题日益普遍，对个人和社会造成心理副作用、实际暴力和even death的影响。机器学习模型已经开发出来自动检测辱语，但这些模型可能会受到时间偏见的影响，时间偏见是指语言、语言使用或社会规范随着时间的变化。本研究旨在研究辱语检测中的时间偏见问题，以及不同语言下的时间偏见的影响。我们对不同时间段的辱语数据集进行了评估，结果显示，时间偏见是辱语检测中的一大挑战，历史数据上训练的模型表现下降显著。此外，我们还进行了对这些辱语数据集的广泛语言分析，尝试探讨语言演化的原因和表现下降的原因。这项研究突出了辱语检测中时间偏见的问题，为语言演化和时间偏见缓解提供了关键的洞察。

On the Relation between Internal Language Model and Sequence Discriminative Training for Neural Transducers

paper_url: http://arxiv.org/abs/2309.14130
repo_url: None
paper_authors: Zijian Yang, Wei Zhou, Ralf Schlüter, Hermann Ney
for: 提高 RNN-Transducer 的表现，使用外部语言模型 (LM) 融合
methods: 使用序列推理训练，并对 ILM 进行减法
results: 序列推理训练和 ILM 减法在 Librispeech 上的各种实验中具有类似的表现，包括 MMI 和 MBR 等标准。减法对 ILM 的影响也变得较小。

Abstract
Internal language model (ILM) subtraction has been widely applied to improve the performance of the RNN-Transducer with external language model (LM) fusion for speech recognition. In this work, we show that sequence discriminative training has a strong correlation with ILM subtraction from both theoretical and empirical points of view. Theoretically, we derive that the global optimum of maximum mutual information (MMI) training shares a similar formula as ILM subtraction. Empirically, we show that ILM subtraction and sequence discriminative training achieve similar performance across a wide range of experiments on Librispeech, including both MMI and minimum Bayes risk (MBR) criteria, as well as neural transducers and LMs of both full and limited context. The benefit of ILM subtraction also becomes much smaller after sequence discriminative training. We also provide an in-depth study to show that sequence discriminative training has a minimal effect on the commonly used zero-encoder ILM estimation, but a joint effect on both encoder and prediction + joint network for posterior probability reshaping including both ILM and blank suppression.

摘要
内部语言模型（ILM）减法广泛应用于改进RNN-Transducer的语音识别性能，在这项工作中，我们表明了序列推理训练与ILM减法之间存在强相关性。从理论上来看，我们得出了最大共识度（MMI）训练的全球最优点与ILM减法的相似公式。从实验来看，我们证明了ILM减法和序列推理训练在Librispeech上覆盖广泛的实验中具有相似的性能，包括MMI和最小 bayes风险（MBR） критериria，以及神经转移和LM的全文和有限文本上的性能。ILM减法的利益也变得很小之后进行序列推理训练。我们还进行了深入的研究，发现序列推理训练对于通常使用零编码ILM估计的影响非常小，但对于encoder和预测+联合网络进行 posterior probability重塑，包括ILM和空白抑制，有 JOINT 效果。

Wav2vec-based Detection and Severity Level Classification of Dysarthria from Speech

paper_url: http://arxiv.org/abs/2309.14107
repo_url: None
paper_authors: Farhad Javanmardi, Saska Tirronen, Manila Kodali, Sudarsana Reddy Kadiri, Paavo Alku
for: 这个研究旨在使用自动检测和评估瘫疡症患者的语音信号，以便在医疗诊断中使用。
methods: 这个研究使用了预训练的wav2vec 2.0模型作为特征提取器，建立检测和评估瘫疡症语音的系统。
results: 实验结果显示，使用wav2vec 2.0模型的对应层 embeddings（第一层）可以实现最佳的检测性能，相比基准模型（spectrogram）的最高表现提高1.23%的精度。在研究的评估瘫疡症严重程度分类任务中，使用最终层 embeddings 可以实现10.62%的精度提高，相比基准特征（mel-frequency cepstral coefficients）。

Abstract
Automatic detection and severity level classification of dysarthria directly from acoustic speech signals can be used as a tool in medical diagnosis. In this work, the pre-trained wav2vec 2.0 model is studied as a feature extractor to build detection and severity level classification systems for dysarthric speech. The experiments were carried out with the popularly used UA-speech database. In the detection experiments, the results revealed that the best performance was obtained using the embeddings from the first layer of the wav2vec model that yielded an absolute improvement of 1.23% in accuracy compared to the best performing baseline feature (spectrogram). In the studied severity level classification task, the results revealed that the embeddings from the final layer gave an absolute improvement of 10.62% in accuracy compared to the best baseline features (mel-frequency cepstral coefficients).

摘要
自动检测和评估瘫疡程度可以将单词识别和瘫疡程度分类 directly from acoustic speech signals 用作医疗诊断工具。在这个工作中，预训练的 wav2vec 2.0 模型被研究作为特征提取器，以建立检测和瘫疡程度分类系统。实验使用了常用的 UA-speech 数据库。在检测实验中，结果显示，使用 wav2vec 模型的第一层嵌入得到最佳表现，对比基准特征（spectrogram）的最佳表现，获得了绝对提升1.23%的准确度。在研究的瘫疡程度分类任务中，结果显示，使用 wav2vec 模型的最终层嵌入得到最佳表现，与基准特征（mel-frequency cepstral coefficients）的最佳表现相比，获得了绝对提升10.62%的准确度。

Analysis and Detection of Pathological Voice using Glottal Source Features

paper_url: http://arxiv.org/abs/2309.14080
repo_url: None
paper_authors: Sudarsana Reddy Kadiri, Paavo Alku
for: 该研究旨在对声音疾病进行自动检测，以提供 объектив评估和早期 intervención 的 диагности方法。
methods: 该研究使用 quasi-closed phase (QCP) glottal inverse filtering 方法 estimate glottal flows，并使用 zero frequency filtering (ZFF) 方法计算 approximate glottal source signals，以及直接使用声音信号。此外，研究还提出了 derivate mel-frequency cepstral coefficients (MFCCs) from glottal source waveforms computed by QCP和ZFF，以具体地捕捉 glottal source 谱的变化。
results: 研究结果表明，glottal source 特征含有可以区分正常和疾病声音的信息。通过支持向量机 (SVM) 进行检测试验，发现 studied glottal source 特征可以与 convential MFCCs 和 perceptual linear prediction (PLP) 特征相比，达到了同等或更好的检测性能。此外，combine glottal source 特征与 convential MFCCs 和 PLP 特征可以获得最佳的检测性能，这表明这些特征之间存在辅助性的关系。

Abstract
Automatic detection of voice pathology enables objective assessment and earlier intervention for the diagnosis. This study provides a systematic analysis of glottal source features and investigates their effectiveness in voice pathology detection. Glottal source features are extracted using glottal flows estimated with the quasi-closed phase (QCP) glottal inverse filtering method, using approximate glottal source signals computed with the zero frequency filtering (ZFF) method, and using acoustic voice signals directly. In addition, we propose to derive mel-frequency cepstral coefficients (MFCCs) from the glottal source waveforms computed by QCP and ZFF to effectively capture the variations in glottal source spectra of pathological voice. Experiments were carried out using two databases, the Hospital Universitario Principe de Asturias (HUPA) database and the Saarbrucken Voice Disorders (SVD) database. Analysis of features revealed that the glottal source contains information that discriminates normal and pathological voice. Pathology detection experiments were carried out using support vector machine (SVM). From the detection experiments it was observed that the performance achieved with the studied glottal source features is comparable or better than that of conventional MFCCs and perceptual linear prediction (PLP) features. The best detection performance was achieved when the glottal source features were combined with the conventional MFCCs and PLP features, which indicates the complementary nature of the features.

摘要
自动检测声道疾病可以提供对象评估和早期 intervención для诊断。本研究提供了声道疾病检测中频谱源特征的系统性分析，并investigates其效iveness。频谱源特征通过预计closed phase（QCP）glottal inverse filtering方法、零频率 filtering（ZFF）方法和直接使用声音信号来提取。此外，我们提议 derivation of mel-frequency cepstral coefficients（MFCCs）from the glottal source waveforms computed by QCP和ZFF，以有效捕捉声道源спектrum的变化。实验使用了两个数据库，大学医院主楼（HUPA）数据库和 saarbrucken voice disorders（SVD）数据库。特征分析表明，频谱源含有可以区分正常和疾病声音的信息。疾病检测实验使用支持向量机（SVM）。从检测实验中，我们发现，研究中的频谱源特征表现比或更好于 conventioml MFCCs和perceptual linear prediction（PLP）特征。最佳检测性能是在glottal source特征与conventioml MFCCs和PLP特征结合时得到的，这表明这些特征之间存在衔接关系。

Multiple evolutionary pressures shape identical consonant avoidance in the world’s languages

paper_url: http://arxiv.org/abs/2309.14006
repo_url: None
paper_authors: Chundra A. Cathcart
for: 本研究探讨语言演化中同音字符的出现频率是否受限制，以及这些限制的来源。
methods: 研究者使用phylogenetic分析方法，对同义词的演化进行比较分析，探讨语言演化过程中同音字符的出现频率和word form mutation的影响。
results: 研究发现，同音字符在词形变化中更容易消失，而非出现。此外，同音字符的出现频率较低，word form mutation也更可能将同音字符消除。但同时发现，同音字符不会更容易消失。结论是，同音字符的出现频率受到语言演化的限制，但这些限制不来自于语言使用者的选择。

Abstract
Languages disfavor word forms containing sequences of similar or identical consonants, due to the biomechanical and cognitive difficulties posed by patterns of this sort. However, the specific evolutionary processes responsible for this phenomenon are not fully understood. Words containing sequences of identical consonants may be more likely to arise than those without; processes of word form mutation may be more likely to remove than create sequences of identical consonants in word forms; finally, words containing identical consonants may die out more frequently than those without. Phylogenetic analyses of the evolution of homologous word forms indicate that words with identical consonants arise less frequently than those without, and processes which mutate word forms are more likely to remove sequences of identical consonants than introduce them. However, words with identical consonants do not die out more frequently than those without. Further analyses reveal that forms with identical consonants are replaced in basic meaning functions more frequently than words without. Taken together, results suggest that the under representation of sequences of identical consonants is overwhelmingly a byproduct of constraints on word form coinage, though processes related to word usage also serve to ensure that such patterns are infrequent in more salient vocabulary items. These findings clarify previously unknown aspects of processes of lexical evolution and competition that take place during language change, optimizing communicative systems.

摘要
语言偏远同辅音序列，因为这些模式带来生物机械和认知上的困难。然而，这种现象的具体演化过程仍未完全了解。words containing sequences of identical consonants may be more likely to arise than those without; processes of word form mutation may be more likely to remove than create sequences of identical consonants in word forms; finally, words containing identical consonants may die out more frequently than those without.phylogenetic analyses of the evolution of homologous word forms indicate that words with identical consonants arise less frequently than those without, and processes which mutate word forms are more likely to remove sequences of identical consonants than introduce them. however, words with identical consonants do not die out more frequently than those without. further analyses reveal that forms with identical consonants are replaced in basic meaning functions more frequently than words without. taken together, results suggest that the under representation of sequences of identical consonants is overwhelmingly a byproduct of constraints on word form coinage, though processes related to word usage also serve to ensure that such patterns are infrequent in more salient vocabulary items. these findings clarify previously unknown aspects of processes of lexical evolution and competition that take place during language change, optimizing communicative systems.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is also widely used, especially in Taiwan and Hong Kong.

Connecting Speech Encoder and Large Language Model for ASR

paper_url: http://arxiv.org/abs/2309.13963
repo_url: None
paper_authors: Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang
for: 本研究旨在比较三种常用的结构，包括完全连接层、多头交叉注意力和Q-Former，以实现自动语音识别（ASR）模型的集成。
methods: 研究使用了Whisper模型系列的语音编码器和Vicuna模型系列的不同模型大小的LLMs，并在LibriSpeech、Common Voice和GigaSpeech datasets上进行了实验。
results: Q-Former-based LLMs在不同数据集上显示了一致和显著的单词错误率（WER）减少，相比其他结构。此外，一种新的段级Q-Former也被提出，使LLMs可以识别长于编码器限制的语音段，带来17%的相对WER减少。

Abstract
The impressive capability and versatility of large language models (LLMs) have aroused increasing attention in automatic speech recognition (ASR), with several pioneering studies attempting to build integrated ASR models by connecting a speech encoder with an LLM. This paper presents a comparative study of three commonly used structures as connectors, including fully connected layers, multi-head cross-attention, and Q-Former. Speech encoders from the Whisper model series as well as LLMs from the Vicuna model series with different model sizes were studied. Experiments were performed on the commonly used LibriSpeech, Common Voice, and GigaSpeech datasets, where the LLMs with Q-Formers demonstrated consistent and considerable word error rate (WER) reductions over LLMs with other connector structures. Q-Former-based LLMs can generalise well to out-of-domain datasets, where 12% relative WER reductions over the Whisper baseline ASR model were achieved on the Eval2000 test set without using any in-domain training data from Switchboard. Moreover, a novel segment-level Q-Former is proposed to enable LLMs to recognise speech segments with a duration exceeding the limitation of the encoders, which results in 17% relative WER reductions over other connector structures on 90-second-long speech data.

摘要
大型语言模型（LLM）的印象力和多方面性在自动话语识别（ASR）中受到了越来越多的关注，有几个先锋性研究尝试建立了 integrate ASR 模型，通过与话语编码器连接。这篇文章发表了三种常用的结构，包括完全连接层、多头标注和Q-Former 的比较研究。研究使用了 Whisper 模型系列的话语编码器和 Vicuna 模型系列的不同模型大小的 LLM，并在 LibriSpeech、Common Voice 和 GigaSpeech datasets 上进行实验。实验结果显示，使用 Q-Former 的 LLM 可以在不使用域内训练数据的情况下，实现了12% 的相对 palabier error rate（WER）降低，相比其他结构。此外，一个新的段级 Q-Former 被提议，允许 LLM 识别长度超过编码器限制的语音段，从而实现了17% 的相对 WER 降低。

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

paper_url: http://arxiv.org/abs/2309.13876
repo_url: https://github.com/espnet/espnet
paper_authors: Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe
for: 该论文的目的是开发一个开源的听说模型，以便于研究人员可以在开源的工具kit和公共可用的数据上进行训练和改进。
methods: 该论文使用的方法是基于开源的工具kit和公共可用的数据进行听说模型的训练，并支持更多的翻译方向。
results: 该论文可以在零shot设置下实现良好的一致性和翻译性，并且可以在训练过程中提高效率和稳定性。

Abstract
Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessible, which makes it difficult for researchers to further improve its performance and address training-related issues such as efficiency, robustness, fairness, and bias. This work presents an Open Whisper-style Speech Model (OWSM), which reproduces Whisper-style training using an open-source toolkit and publicly available data. OWSM even supports more translation directions and can be more efficient to train. We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pre-trained models and training logs to promote open science.

摘要
Translate the given text into Simplified Chinese.预训术语模型在大量数据上已经取得了很大成功。OpenAI Whisper是一个多语言多任务模型，在680k小时的监督术语数据上训练。它在不同的术语识别和翻译 bencmarks 中进行了良好的泛化，甚至在零shot setup 下也能达到良好的性能。然而，整个模型开发管道（从数据收集到训练）没有公开 accessible，这使得研究人员很难进一步改进其性能和 Address training-related issues such as efficiency, robustness, fairness, and bias。这项工作提出了一个 Open Whisper-style Speech Model (OWSM)，该模型通过使用开源工具包和公开可用的数据来重现 Whisper-style 训练。OWSM 还支持更多的翻译方向，并且可以更高效地训练。我们将公开所有数据准备、训练、推理和评分脚本以及预训练模型和训练日志，以便推动开放科学。

2023-09-25

Introducing DictaLM – A Large Generative Language Model for Modern Hebrew

Aligning Large Multimodal Models with Factually Augmented RLHF

ChatGPT Performance on Standardized Testing Exam – A Proposed Strategy for Learners

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Classifying token frequencies using angular Minkowski $p$-distance

Explainable and Accurate Natural Language Understanding for Voice Assistants and Beyond

DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention

Towards General-Purpose Text-Instruction-Guided Voice Conversion

Urdu Poetry Generated by Using Deep Learning Techniques

Autonomous Vehicles an overview on system, cyber security, risks, issues, and a way forward

Only 5% Attention Is All You Need: Efficient Long-range Document-level Neural Machine Translation

Towards End-User Development for IoT: A Case Study on Semantic Parsing of Cooking Recipes for Programming Kitchen Devices

Examining Temporal Bias in Abusive Language Detection

On the Relation between Internal Language Model and Sequence Discriminative Training for Neural Transducers

Wav2vec-based Detection and Severity Level Classification of Dysarthria from Speech

Analysis and Detection of Pathological Voice using Glottal Source Features

Multiple evolutionary pressures shape identical consonant avoidance in the world’s languages

Connecting Speech Encoder and Large Language Model for ASR

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data