results: 在三个实际的人类评估任务上展示了superior的能力和效率,能够预测人类评估者的总行为,匹配人类注解的分布,并模拟人类评估者之间的不一致。Abstract
Human annotator simulation (HAS) serves as a cost-effective substitute for human evaluation such as data annotation and system assessment. Human perception and behaviour during human evaluation exhibit inherent variability due to diverse cognitive processes and subjective interpretations, which should be taken into account in modelling to better mimic the way people perceive and interact with the world. This paper introduces a novel meta-learning framework that treats HAS as a zero-shot density estimation problem, which incorporates human variability and allows for the efficient generation of human-like annotations for unlabelled test inputs. Under this framework, we propose two new model classes, conditional integer flows and conditional softmax flows, to account for ordinal and categorical annotations, respectively. The proposed method is evaluated on three real-world human evaluation tasks and shows superior capability and efficiency to predict the aggregated behaviours of human annotators, match the distribution of human annotations, and simulate the inter-annotator disagreements.
摘要
人工标注 simulate (HAS) acted as a cost-effective substitute for human evaluation, such as data annotation and system assessment. Human perception and behavior during human evaluation exhibit inherent variability due to diverse cognitive processes and subjective interpretations, which should be taken into account in modeling to better mimic the way people perceive and interact with the world. This paper introduces a novel meta-learning framework that treats HAS as a zero-shot density estimation problem, which incorporates human variability and allows for the efficient generation of human-like annotations for unlabeled test inputs. Under this framework, we propose two new model classes, conditional integer flows and conditional softmax flows, to account for ordinal and categorical annotations, respectively. The proposed method is evaluated on three real-world human evaluation tasks and shows superior capability and efficiency to predict the aggregated behaviors of human annotators, match the distribution of human annotations, and simulate the inter-annotator disagreements.
Question-Answering Model for Schizophrenia Symptoms and Their Impact on Daily Life using Mental Health Forums Data
results: 经过实验 validate,提出的方法可以获得一个准确的数据集,并且通过 fine-tuning BioBERT QA模型,实现了精神疾病领域的问答模型。该模型在 F1 分数上达到了 0.885,超过了当前领域的状态码模型。Abstract
In recent years, there is strong emphasis on mining medical data using machine learning techniques. A common problem is to obtain a noiseless set of textual documents, with a relevant content for the research question, and developing a Question Answering (QA) model for a specific medical field. The purpose of this paper is to present a new methodology for building a medical dataset and obtain a QA model for analysis of symptoms and impact on daily life for a specific disease domain. The ``Mental Health'' forum was used, a forum dedicated to people suffering from schizophrenia and different mental disorders. Relevant posts of active users, who regularly participate, were extrapolated providing a new method of obtaining low-bias content and without privacy issues. Furthermore, it is shown how to pre-process the dataset to convert it into a QA dataset. The Bidirectional Encoder Representations from Transformers (BERT), DistilBERT, RoBERTa, and BioBERT models were fine-tuned and evaluated via F1-Score, Exact Match, Precision and Recall. Accurate empirical experiments demonstrated the effectiveness of the proposed method for obtaining an accurate dataset for QA model implementation. By fine-tuning the BioBERT QA model, we achieved an F1 score of 0.885, showing a considerable improvement and outperforming the state-of-the-art model for mental disorders domain.
摘要
现在,有强烈的强调在医疗数据挖掘中使用机器学习技术。一个常见的问题是获取噪音少的文本文档,具有相关的内容,并开发一个问答(QA)模型 для特定的医疗领域。本文的目的是提出一种新的方法ology for 建立医疗数据集和获得一个QA模型,用于分析疾病和对日常生活的影响的分析。使用“精神健康”论坛,这是一个专门为患有分子难病和不同的精神障碍的人们而设立的论坛。我们从有活跃用户的相关帖子中抽取了有用的帖子,以提供一种新的方法,无需隐私问题。此外,我们还介绍了如何预处理数据,以将其转换为QA数据集。我们使用了BERT、DistilBERT、RoBERTa和BioBERT模型,并对其进行了微调和评估。我们通过F1分数、精确匹配、精确率和受损率进行了实际的实验,并证明了我们提出的方法的有效性。通过微调BioBERT QA模型,我们实现了F1分数0.885,表明我们的方法在精神疾病领域的QA模型实现了显著改进,并超越了现有的状态码模型。
The Many Voices of Duying: Revisiting the Disputed Essays Between Lu Xun and Zhou Zuoren
paper_authors: Xin Xie, Jiangqiong Li, Haining Wang
for: This research aims to revisit three disputed essays pseudonymously published by Lu Xun and Zhou Zuoren in 1912, using quantitative methods and stylometric analysis to investigate the authors’ respective writing styles and examine the essays’ authorship.
methods: The research employs an interpretable authorship attribution model and visual representations of essay features to facilitate a nuanced understanding of the brothers’ formative intellectual trajectories and their collaboration on these early works.
results: The findings suggest that ‘Looking at the Country of China’ was authored by Lu Xun, while ‘People of Yue, Forget Not Your Ancestors’ Instructions’ seems to be either predominantly authored or extensively revised by Lu Xun, with notable stylistic similarities to ‘Looking at the Land of Yue,’ which Zhou Zuoren recognized as his own but edited by Lu Xun. The third essay, ‘Where Has the Character of the Republic Gone?’, exhibits a ‘diluted’, mixed writing style, suggesting thorough collaboration between the brothers.Abstract
Lu Xun and Zhou Zuoren stand as two of the most influential writers in modern Chinese literature. Beyond their familial ties as brothers, they were also intimate collaborators during the nascent stages of their writing careers. This research employs quantitative methods to revisit three disputed essays pseudonymously published by the brothers in 1912. Our stylometric analysis uses an interpretable authorship attribution model to investigate the essays' authorship and examine the brothers' respective writing styles. Our findings suggest that 'Looking at the Country of China' was authored by Lu Xun. Moreover, 'People of Yue, Forget Not Your Ancestors' Instructions' seems to be either predominantly authored or extensively revised by Lu Xun given its notable stylistic similarities to 'Looking at the Land of Yue,' a piece Zhou Zuoren recognized as his own, but edited by Lu Xun. The third essay, 'Where Has the Character of the Republic Gone?,' exhibits a 'diluted', mixed writing style, suggesting thorough collaboration. We offer visual representations of essay features to facilitate a nuanced and intuitive understanding. We have uncovered evidence suggesting Lu Xun's covert engagement with social issues during his purported 'silent era' and provided insights into the brothers' formative intellectual trajectories.
摘要
吕迅和周作人是现代中国文学中最有影响力的两位作家。除了他们的 familial ties 外,他们也是在他们写作早期的密切合作者。这项研究使用量化方法回到1912年 pseudonymously 发表的三篇文章中,进行了作者属性分析。我们的样本分析发现,《看中国的国情》是吕迅所作的,而《父母教诲》似乎是吕迅或者大量修改了,因为它的作家风格与吕迅认可的《看粤eland的情》具有显著的相似性。第三篇文章《共和国的人物消失了吗?》显示出了混合的写作风格,表明了两人的合作。我们提供了文章特征的视觉表示,以便更好地理解。我们发现吕迅在报道的“幽默时期”中仍然做出了 covert 的社会问题干预,并为吕迅和周作人的形成 интеллектуаль 轨迹提供了新的视角。
Enhancing Representation Generalization in Authorship Identification
results: 该论文的结果表明,选择合适的语言特征对于作者识别具有重要性,特别是在域外场景下。同时,使用深度学习模型可以提高作者识别的普适性。Abstract
Authorship identification ascertains the authorship of texts whose origins remain undisclosed. That authorship identification techniques work as reliably as they do has been attributed to the fact that authorial style is properly captured and represented. Although modern authorship identification methods have evolved significantly over the years and have proven effective in distinguishing authorial styles, the generalization of stylistic features across domains has not been systematically reviewed. The presented work addresses the challenge of enhancing the generalization of stylistic representations in authorship identification, particularly when there are discrepancies between training and testing samples. A comprehensive review of empirical studies was conducted, focusing on various stylistic features and their effectiveness in representing an author's style. The influencing factors such as topic, genre, and register on writing style were also explored, along with strategies to mitigate their impact. While some stylistic features, like character n-grams and function words, have proven to be robust and discriminative, others, such as content words, can introduce biases and hinder cross-domain generalization. Representations learned using deep learning models, especially those incorporating character n-grams and syntactic information, show promise in enhancing representation generalization. The findings underscore the importance of selecting appropriate stylistic features for authorship identification, especially in cross-domain scenarios. The recognition of the strengths and weaknesses of various linguistic features paves the way for more accurate authorship identification in diverse contexts.
摘要
我们对多种实验研究进行了全面的审查,包括不同的风格特征和它们在作者风格表示中的效果。我们还研究了主题、类型和注册等因素对写作风格的影响,以及如何减轻这些因素的影响。结果显示,一些风格特征,如字符n-gram和语法信息,可以提供强大和特征的表示,而其他些风格特征,如内容词,可能会引入偏见并降低跨领域泛化。深度学习模型,特别是包含字符n-gram和语法信息的模型,显示了提高表示泛化的潜力。这些发现强调了选择适当的风格特征对作者识别非常重要,特别是在跨领域场景下。通过了解不同语言特征的优劣点,我们可以更准确地确定文本的作者,并在多种不同的情况下进行更好的作者识别。
Open-Domain Dialogue Quality Evaluation: Deriving Nugget-level Scores from Turn-level Scores
results: 经过实验,该方法可以帮助找到对话转换中的具体问题,从而改进对话系统的表现。Abstract
Existing dialogue quality evaluation systems can return a score for a given system turn from a particular viewpoint, e.g., engagingness. However, to improve dialogue systems by locating exactly where in a system turn potential problems lie, a more fine-grained evaluation may be necessary. We therefore propose an evaluation approach where a turn is decomposed into nuggets (i.e., expressions associated with a dialogue act), and nugget-level evaluation is enabled by leveraging an existing turn-level evaluation system. We demonstrate the potential effectiveness of our evaluation method through a case study.
摘要
现有的对话质量评估系统可以返回一个给定系统转帧的分数,例如吸引力。然而,要改进对话系统,可能需要更细化的评估方法。我们因此提议一种评估方法,即将转帧 decomposed into 块 (即对话动作相关的表达),并使用现有的转帧级别评估系统来启用块级别评估。我们通过案例研究示出了我们的评估方法的潜在效果。
AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ
results: 对于人工和自动评估,CLiMA 和 LLaMA 都能够超越商业 GPT-4 和 Claude 2 模型,在人工创建的图表的相似性方面获得更高的分数。此外,CLiMA 还能够改善文本-图像对齐。Abstract
Generating bitmap graphics from text has gained considerable attention, yet for scientific figures, vector graphics are often preferred. Given that vector graphics are typically encoded using low-level graphics primitives, generating them directly is difficult. To address this, we propose the use of TikZ, a well-known abstract graphics language that can be compiled to vector graphics, as an intermediate representation of scientific figures. TikZ offers human-oriented, high-level commands, thereby facilitating conditional language modeling with any large language model. To this end, we introduce DaTikZ the first large-scale TikZ dataset, consisting of 120k TikZ drawings aligned with captions. We fine-tune LLaMA on DaTikZ, as well as our new model CLiMA, which augments LLaMA with multimodal CLIP embeddings. In both human and automatic evaluation, CLiMA and LLaMA outperform commercial GPT-4 and Claude 2 in terms of similarity to human-created figures, with CLiMA additionally improving text-image alignment. Our detailed analysis shows that all models generalize well and are not susceptible to memorization. GPT-4 and Claude 2, however, tend to generate more simplistic figures compared to both humans and our models. We make our framework, AutomaTikZ, along with model weights and datasets, publicly available.
摘要
科学图表生成已经受到了广泛关注,但是在科学图表上,vector图形通常被首选。这是因为vector图形通常使用低级图形元素编码,直接生成它们是困难的。为解决这个问题,我们提议使用TikZ,一种广泛使用的抽象图形语言,作为科学图表的中间表示。TikZ提供人类 oriented 高级命令,因此可以使用任何大型语言模型进行条件语言模型化。为此,我们引入了DaTikZ,我们的首个大规模TikZ数据集,包含120k个TikZ绘制和关联的描述。我们在DaTikZ上练习LLaMA,以及我们的新模型CLiMA,它在多Modal CLIP嵌入下进行了增强。在人工和自动评估中,CLiMA和LLaMA都超过了商业GPT-4和Claude 2在人类创建图表的相似性上,并且CLiMA还改善了文本-图像对齐。我们的详细分析表明所有模型都能够通过并不易于记忆。然而,GPT-4和Claude 2倾向于生成更简单的图表,与人类和我们的模型相比。我们在AutomaTikZ框架,以及模型和数据集,公开提供。
Gaze-Driven Sentence Simplification for Language Learners: Enhancing Comprehension and Readability
results: 实验结果表明,该系统可以准确估计句子级别的理解程度,并且通过使用GPT-3.5进行简化,提高了文本的可读性和个体单词难度Abstract
Language learners should regularly engage in reading challenging materials as part of their study routine. Nevertheless, constantly referring to dictionaries is time-consuming and distracting. This paper presents a novel gaze-driven sentence simplification system designed to enhance reading comprehension while maintaining their focus on the content. Our system incorporates machine learning models tailored to individual learners, combining eye gaze features and linguistic features to assess sentence comprehension. When the system identifies comprehension difficulties, it provides simplified versions by replacing complex vocabulary and grammar with simpler alternatives via GPT-3.5. We conducted an experiment with 19 English learners, collecting data on their eye movements while reading English text. The results demonstrated that our system is capable of accurately estimating sentence-level comprehension. Additionally, we found that GPT-3.5 simplification improved readability in terms of traditional readability metrics and individual word difficulty, paraphrasing across different linguistic levels.
摘要
学习者应 régulièrement 阅读具有挑战性的材料,以提高阅读理解能力。然而,不断地查询词典是时间consuming 和distracting。这篇论文提出了一种基于eye gaze的句子简化系统,用于提高阅读理解而不间断注意力。我们的系统通过对个人学习者的eye gaze特征和语言特征进行机器学习模型,以评估句子理解程度。当系统认为理解有difficulties时,它会提供简化版本,替换复杂词汇和语法 With GPT-3.5。我们进行了19名英语学习者的实验,收集了他们的eye movement数据 while reading English text。结果表明,我们的系统可以准确地估计句子级别的理解程度。此外,我们发现GPT-3.5简化提高了文本的可读性,包括传统的可读性指标和个别词Difficulty,以及各种语言水平的重叠。
Red Teaming Game: A Game-Theoretic Framework for Red Teaming Language Models
results: 这个论文的实验结果显示,GRTS 能够自动发现多种攻击策略,并对 LLM 的安全性进行改善,比起现有的规律性红队设计方法更好。Abstract
Deployable Large Language Models (LLMs) must conform to the criterion of helpfulness and harmlessness, thereby achieving consistency between LLMs outputs and human values. Red-teaming techniques constitute a critical way towards this criterion. Existing work rely solely on manual red team designs and heuristic adversarial prompts for vulnerability detection and optimization. These approaches lack rigorous mathematical formulation, thus limiting the exploration of diverse attack strategy within quantifiable measure and optimization of LLMs under convergence guarantees. In this paper, we present Red-teaming Game (RTG), a general game-theoretic framework without manual annotation. RTG is designed for analyzing the multi-turn attack and defense interactions between Red-team language Models (RLMs) and Blue-team Language Model (BLM). Within the RTG, we propose Gamified Red-teaming Solver (GRTS) with diversity measure of the semantic space. GRTS is an automated red teaming technique to solve RTG towards Nash equilibrium through meta-game analysis, which corresponds to the theoretically guaranteed optimization direction of both RLMs and BLM. Empirical results in multi-turn attacks with RLMs show that GRTS autonomously discovered diverse attack strategies and effectively improved security of LLMs, outperforming existing heuristic red-team designs. Overall, RTG has established a foundational framework for red teaming tasks and constructed a new scalable oversight technique for alignment.
摘要
deployable 大语言模型(LLM)必须遵循帮助和无害性的准则,以确保 LLM 的输出与人类价值之间的一致性。红队技术是一种关键的方法,可以帮助实现这一准则。现有的工作仅仅采用手动设计的红队和启发式对抗提示来检测漏洞和优化 LLM。这些方法缺乏准确的数学表述,因此限制了对多种攻击策略的可靠探索和 LLM 的优化 beneath convergence guarantees。在这篇论文中,我们提出了红队游戏(RTG),一种普适的游戏理论基础。RTG 是用于分析红队语言模型(RLM)和蓝队语言模型(BLM)之间的多轮攻击和防御互动的框架。在 RTG 中,我们提出了游戏化红队解决方案(GRTS),具有 semantic space 的多样性度量。GRTS 是一种自动化的红队技术,通过 meta-game 分析解决 RTG,对 RLMs 和 BLMs 都有 theoretically guaranteed optimization direction。实验结果表明,GRTS 可以自动找到多样化的攻击策略,提高 LLMs 的安全性,比传统的启发式红队设计更高效。总的来说,RTG 建立了红队任务的基础框架,并构建了一种新的可扩展的监管技术,以便对适应进行Alignment。
In-Context Learning in Large Language Models: A Neuroscience-inspired Analysis of Representations
paper_authors: Safoora Yousefi, Leo Betthauser, Hosein Hasanbeig, Akanksha Saran, Raphaël Millière, Ida Momennejad
for: investigate the mechanisms behind the improvement of large language models (LLMs) through in-context learning (ICL)
methods: employ neuroscience-inspired techniques such as representational similarity analysis (RSA) and propose novel methods for parameterized probing and attention ratio analysis (ARA)
results: found a meaningful correlation between changes in both embeddings and attention representations with improvements in behavioral performance after ICL, offering valuable tools and insights for future research and practical applications.Here’s the full text in Simplified Chinese:
results: 发现在ICL后, embedding 和注意力表示变化与行为性能改进存在明确的相关性,为未来研究和实际应用提供有价值的工具和洞察。Abstract
Large language models (LLMs) exhibit remarkable performance improvement through in-context learning (ICL) by leveraging task-specific examples in the input. However, the mechanisms behind this improvement remain elusive. In this work, we investigate embeddings and attention representations in Llama-2 70B and Vicuna 13B. Specifically, we study how embeddings and attention change after in-context-learning, and how these changes mediate improvement in behavior. We employ neuroscience-inspired techniques, such as representational similarity analysis (RSA), and propose novel methods for parameterized probing and attention ratio analysis (ARA, measuring the ratio of attention to relevant vs. irrelevant information). We designed three tasks with a priori relationships among their conditions: reading comprehension, linear regression, and adversarial prompt injection. We formed hypotheses about expected similarities in task representations to investigate latent changes in embeddings and attention. Our analyses revealed a meaningful correlation between changes in both embeddings and attention representations with improvements in behavioral performance after ICL. This empirical framework empowers a nuanced understanding of how latent representations affect LLM behavior with and without ICL, offering valuable tools and insights for future research and practical applications.
摘要
Translation notes:* "Large language models" (LLMs) is translated as "大语言模型" (dà yǔ yán módel).* "In-context learning" (ICL) is translated as "在上下文中学习" (zài shàng xiào yì xué xí).* "Embeddings" is translated as "嵌入" (fù rù).* "Attention" is translated as "注意" (zhù yì).* "Representational similarity analysis" (RSA) is translated as "表示相似性分析" (biǎo xiǎng sì xìng bù yì).* "Parameterized probing" is translated as "参数化探测" (cèsuǒ huì yì).* "Attention ratio analysis" (ARA) is translated as "注意率分析" (zhù yì xìng bù yì).* "Reading comprehension" is translated as "阅读理解" (dòng dú lǐ jiě).* "Linear regression" is translated as "直线回归" (zhí xiàn huí qù).* "Adversarial prompt injection" is translated as "敌对提示注入" (dí duì tím shì zhù yì).
Towards LLM-based Fact Verification on News Claims with a Hierarchical Step-by-Step Prompting Method
results: 在两个公共谣言数据集上,HiSS提问方法超过了现有的完全监督方法和强几架 ICL-enabled 基elineAbstract
While large pre-trained language models (LLMs) have shown their impressive capabilities in various NLP tasks, they are still under-explored in the misinformation domain. In this paper, we examine LLMs with in-context learning (ICL) for news claim verification, and find that only with 4-shot demonstration examples, the performance of several prompting methods can be comparable with previous supervised models. To further boost performance, we introduce a Hierarchical Step-by-Step (HiSS) prompting method which directs LLMs to separate a claim into several subclaims and then verify each of them via multiple questions-answering steps progressively. Experiment results on two public misinformation datasets show that HiSS prompting outperforms state-of-the-art fully-supervised approach and strong few-shot ICL-enabled baselines.
摘要
大型预训语言模型(LLM)在不同的自然语言处理任务中已经展示了它们的卓越能力,但它们在假信息领域还尚未得到充分的探索。在这篇论文中,我们对新闻声明验证 зада务使用 LLM 进行培 обу,并发现只需要4个示例示例,可以使得许多提示方法的性能与之前的监督模型相当。为了进一步提高性能,我们介绍了层次步骤进行(HiSS)提示方法,该方法将声明分解成多个子声明,然后通过多个问题回答步骤进行逐步验证。实验结果表明,HiSS 提示方法在两个公共的假信息 datasets 上表现出色,超过了当前的完全监督方法和强几步 ICL-enabled 基elines。
results: RelBERT模型可以模型到训练数据之外的关系,例如名称实体之间的关系,并且可以认osciLLMs and GPT-based models。Abstract
Many applications need access to background knowledge about how different concepts and entities are related. Although Knowledge Graphs (KG) and Large Language Models (LLM) can address this need to some extent, KGs are inevitably incomplete and their relational schema is often too coarse-grained, while LLMs are inefficient and difficult to control. As an alternative, we propose to extract relation embeddings from relatively small language models. In particular, we show that masked language models such as RoBERTa can be straightforwardly fine-tuned for this purpose, using only a small amount of training data. The resulting model, which we call RelBERT, captures relational similarity in a surprisingly fine-grained way, allowing us to set a new state-of-the-art in analogy benchmarks. Crucially, RelBERT is capable of modelling relations that go well beyond what the model has seen during training. For instance, we obtained strong results on relations between named entities with a model that was only trained on lexical relations between concepts, and we observed that RelBERT can recognise morphological analogies despite not being trained on such examples. Overall, we find that RelBERT significantly outperforms strategies based on prompting language models that are several orders of magnitude larger, including recent GPT-based models and open source models.
摘要
Note:* "Knowledge Graph" (知识图)* "Large Language Model" (大语言模型)* "Relation Embeddings" (关系嵌入)* "Masked Language Model" (伪语言模型)* "RoBERTa" (RoBERTa)* "RelBERT" (RelBERT)
Understanding In-Context Learning from Repetitions
results: 该研究发现了表面特征在文本生成中的双重作用,并解释了协同学习的内在机制和其可能的局限性。Abstract
This paper explores the elusive mechanism underpinning in-context learning in Large Language Models (LLMs). Our work provides a novel perspective by examining in-context learning via the lens of surface repetitions. We quantitatively investigate the role of surface features in text generation, and empirically establish the existence of \emph{token co-occurrence reinforcement}, a principle that strengthens the relationship between two tokens based on their contextual co-occurrences. By investigating the dual impacts of these features, our research illuminates the internal workings of in-context learning and expounds on the reasons for its failures. This paper provides an essential contribution to the understanding of in-context learning and its potential limitations, providing a fresh perspective on this exciting capability.
摘要
AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR
paper_authors: Tobi Olatunji, Tejumade Afonja, Aditya Yadavalli, Chris Chinenye Emezue, Sahib Singh, Bonaventure F. P. Dossou, Joanne Osuchukwu, Salomey Osei, Atnafu Lambebo Tonja, Naome Etori, Clinton Mbataku
For: The paper aims to address the lack of productivity tools for overworked clinicians in Africa, where the doctor-to-patient ratio is very low.* Methods: The paper uses clinical automatic speech recognition (ASR) systems, which are mature and ubiquitous in developed nations, but have not been widely available for clinicians in Africa. The authors also release a new dataset called AfriSpeech, which includes 200 hours of Pan-African English speech from 2,463 unique speakers across 120 indigenous accents from 13 countries.* Results: The authors release pre-trained models with state-of-the-art (SOTA) performance on the AfriSpeech benchmark, which can be used to improve the accuracy of clinical ASR systems for African accents.Here are the three points in Simplified Chinese:* For: 该论文目的是为非洲的医生缺乏产品力工具,非洲医生比例非常低。* Methods: 论文使用了临床自动语音识别(ASR)系统,这些系统在发达国家是成熔的,但在非洲尚未广泛应用。作者们还发布了新的数据集called AfriSpeech,包括200小时的非洲英语语音,来自2,463名唯一的说话者,来自13个国家的120种本地口音。* Results: 作者们发布了基于AfriSpeechbenchmark的预训练模型,其中的语音识别性能达到了当前最佳状态(SOTA)。Abstract
Africa has a very low doctor-to-patient ratio. At very busy clinics, doctors could see 30+ patients per day -- a heavy patient burden compared with developed countries -- but productivity tools such as clinical automatic speech recognition (ASR) are lacking for these overworked clinicians. However, clinical ASR is mature, even ubiquitous, in developed nations, and clinician-reported performance of commercial clinical ASR systems is generally satisfactory. Furthermore, the recent performance of general domain ASR is approaching human accuracy. However, several gaps exist. Several publications have highlighted racial bias with speech-to-text algorithms and performance on minority accents lags significantly. To our knowledge, there is no publicly available research or benchmark on accented African clinical ASR, and speech data is non-existent for the majority of African accents. We release AfriSpeech, 200hrs of Pan-African English speech, 67,577 clips from 2,463 unique speakers across 120 indigenous accents from 13 countries for clinical and general domain ASR, a benchmark test set, with publicly available pre-trained models with SOTA performance on the AfriSpeech benchmark.
摘要
非洲医生与病人比例非常低。在非常忙的医院,医生可以每天看到30多名患者,这对于发达国家来说是一个重重的患者负担,但是医疗机器人(ASR)的产业化工具缺乏。然而,在发达国家,临床ASR已经成熟,甚至 ubique,并且医生报告的商业临床ASR系统的性能一般满意。此外,最近的通用领域ASR性能已经接近人类水平。然而,有几个差距。一些发表文章指出了语言算法中的种族偏见,并且少数语言口音的性能明显落后。根据我们所知,没有公开的研究或标准测试数据在非洲语音领域的临床ASR,并且非洲语音数据非常罕见。我们释放了AfriSpeech,200小时的非洲英语语音数据,67,577个clip从2,463名唯一的说话者中,来自120个本地口音,13个国家的医疗和通用领域ASR测试集,以及公共可用的预训练模型,与AfriSpeech测试集的最新表现。
AutoHall: Automated Hallucination Dataset Generation for Large Language Models
for: This paper is written for detecting non-factual or hallucinatory content generated by large language models (LLMs).
methods: The paper proposes a method for automatically constructing model-specific hallucination datasets based on existing fact-checking datasets, called AutoHall. Additionally, the paper proposes a zero-resource and black-box hallucination detection method based on self-contradiction.
results: The paper achieves superior hallucination detection performance compared to extant baselines, and reveals variations in hallucination proportions and types among different models.Here’s the information in Simplified Chinese text:
results: 论文在现有基准下 achieve 了超过基准的幻想检测性能,并发现不同模型中幻想的比例和类型存在差异。Abstract
While Large language models (LLMs) have garnered widespread applications across various domains due to their powerful language understanding and generation capabilities, the detection of non-factual or hallucinatory content generated by LLMs remains scarce. Currently, one significant challenge in hallucination detection is the laborious task of time-consuming and expensive manual annotation of the hallucinatory generation. To address this issue, this paper first introduces a method for automatically constructing model-specific hallucination datasets based on existing fact-checking datasets called AutoHall. Furthermore, we propose a zero-resource and black-box hallucination detection method based on self-contradiction. We conduct experiments towards prevalent open-/closed-source LLMs, achieving superior hallucination detection performance compared to extant baselines. Moreover, our experiments reveal variations in hallucination proportions and types among different models.
摘要
大型语言模型(LLM)在不同领域的应用广泛,主要是因为它们具有强大的语言理解和生成能力。然而,检测 LLM 生成的非事实或幻见内容仍然是一个罕见的问题。现在,一个主要的挑战是手动标注幻见生成的时间和成本很高。为解决这个问题,本文首先提出了一种自动构建基于现有真假检查集的模型特定幻见数据集的方法,称为AutoHall。此外,我们还提出了一种零资源和黑盒子幻见检测方法,基于自相矛盾。我们对广泛存在的开源/关闭源 LLM 进行了实验,并 achievement 超过现有基准。此外,我们的实验还发现了不同模型中幻见的порпор额和类型存在差异。
SLM: Bridge the thin gap between speech and text foundation models
For: The paper is written for the task of speech and language modeling, with a focus on multitask, multilingual, and dual-modal models.* Methods: The paper uses pretrained foundational speech and language models, and trains a simple adapter with just 1% of the foundation models’ parameters to adapt the model to new tasks.* Results: The paper demonstrates strong performance on conventional tasks such as speech recognition and speech translation, and introduces the novel capability of zero-shot instruction-following for more diverse tasks such as contextual biasing ASR, dialog generation, speech continuation, and question answering.Abstract
We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. SLM freezes the pretrained foundation models to maximally preserves their capabilities, and only trains a simple adapter with just 1\% (156M) of the foundation models' parameters. This adaptation not only leads SLM to achieve strong performance on conventional tasks such as speech recognition (ASR) and speech translation (AST), but also introduces the novel capability of zero-shot instruction-following for more diverse tasks: given a speech input and a text instruction, SLM is able to perform unseen generation tasks including contextual biasing ASR using real-time context, dialog generation, speech continuation, and question answering, etc. Our approach demonstrates that the representational gap between pretrained speech and language models might be narrower than one would expect, and can be bridged by a simple adaptation mechanism. As a result, SLM is not only efficient to train, but also inherits strong capabilities already acquired in foundation models of different modalities.
摘要
我们提出了一个共同语音和语言模型(SLM),这是一种多任务、多语言、双modal模型,它利用预训练的基础语音和语言模型。SLM冻结了基础模型的预训练,以便保持其能力的最大化,并只训练一个简单的适应器,占基础模型参数的1%(156M)。这种适应不仅使SLM在传统任务such as语音识别(ASR)和语音翻译(AST)中达到了强大的表现,而且引入了无需训练的零shot指令遵循能力,包括基于实时上下文的语音识别、对话生成、语音续写和问答等多种不同任务。我们的方法表明,预训练的语音和语言模型之间的表示差可能比一 might expect更窄,并且可以通过简单的适应机制来bridged。因此,SLM不仅轻松训练,而且继承了不同模式的基础模型已经获得的强大能力。
Detecting Unseen Multiword Expressions in American Sign Language
results: word embeddings可以实现非常高的准确率来探测非结构化词语。Abstract
Multiword expressions present unique challenges in many translation tasks. In an attempt to ultimately apply a multiword expression detection system to the translation of American Sign Language, we built and tested two systems that apply word embeddings from GloVe to determine whether or not the word embeddings of lexemes can be used to predict whether or not those lexemes compose a multiword expression. It became apparent that word embeddings carry data that can detect non-compositionality with decent accuracy.
摘要
多字表达presentUnique挑战在许多翻译任务中。为了最终应用多字表达检测系统到美国手语翻译,我们建立并测试了两个系统,它们使用GloVeWord embedding来判断lexemes是否组成多字表达。结果表明,word embedding含有数据可以准确地检测非组合性。
results: 研究发现,不同学科的学术文献具有类似的结构和表达方式,并且在不同学科之间存在一定的相似性和差异性。这些结果可以为未来评估研究质量、域风格传递和进一步的 Pragmatic 分析提供基础。Abstract
Scholarly documents have a great degree of variation, both in terms of content (semantics) and structure (pragmatics). Prior work in scholarly document understanding emphasizes semantics through document summarization and corpus topic modeling but tends to omit pragmatics such as document organization and flow. Using a corpus of scholarly documents across 19 disciplines and state-of-the-art language modeling techniques, we learn a fixed set of domain-agnostic descriptors for document sections and "retrofit" the corpus to these descriptors (also referred to as "normalization"). Then, we analyze the position and ordering of these descriptors across documents to understand the relationship between discipline and structure. We report within-discipline structural archetypes, variability, and between-discipline comparisons, supporting the hypothesis that scholarly communities, despite their size, diversity, and breadth, share similar avenues for expressing their work. Our findings lay the foundation for future work in assessing research quality, domain style transfer, and further pragmatic analysis.
摘要
学术文献 exhibit 大量变化,包括内容( semantics)和结构( pragmatics)两方面。先前的学术文献理解工作强调 semantics 通过文摘和文库主题模型来实现,但它们往往忽略 pragmatics,如文档组织和流程。通过使用 crossed 学术文献资料库( across 19 学科)和现有的语言模型技术,我们学习了一组适用于所有学科的静态描述符(也称为“正常化”)。然后,我们分析了这些描述符在文档中的位置和顺序,以理解学术领域与结构之间的关系。我们发现了学术社区中文献的内部结构架构,以及不同学科之间的比较。这些发现为未来评估研究质量、领域风格传递和进一步的 Pragmatic 分析提供了基础。
The Sem-Lex Benchmark: Modeling ASL Signs and Their Phonemes
methods: 我们介绍了一个新的资源 для美国手语(ASL)模型化,即 Sem-Lex Benchmark。这个资源包括了超过84k个隔离手语制作的视频,这些视频来自于聋哑的 ASL 手语发isher,他们提供了同意和收到了补偿。人工专家将这些视频与其他手语资源,如 ASL-LEX、SignBank 和 ASL Citizen,进行了对应,从而实现了有用的扩展 для手语和音律特征recognition。
results: 我们进行了一系列实验,使用 SL-GCN 模型来证明手语的音律特征可以达到85%的准确率,并且这些特征是 ISR 中有效的辅助目标。学习recognize手语的音律特征并与词义recognition结合,可以提高 few-shot ISR 精度6%,提高 ISR 精度总体2%。有关下载数据的 instrucions 可以在 GitHub 上找到。Abstract
Sign language recognition and translation technologies have the potential to increase access and inclusion of deaf signing communities, but research progress is bottlenecked by a lack of representative data. We introduce a new resource for American Sign Language (ASL) modeling, the Sem-Lex Benchmark. The Benchmark is the current largest of its kind, consisting of over 84k videos of isolated sign productions from deaf ASL signers who gave informed consent and received compensation. Human experts aligned these videos with other sign language resources including ASL-LEX, SignBank, and ASL Citizen, enabling useful expansions for sign and phonological feature recognition. We present a suite of experiments which make use of the linguistic information in ASL-LEX, evaluating the practicality and fairness of the Sem-Lex Benchmark for isolated sign recognition (ISR). We use an SL-GCN model to show that the phonological features are recognizable with 85% accuracy, and that they are effective as an auxiliary target to ISR. Learning to recognize phonological features alongside gloss results in a 6% improvement for few-shot ISR accuracy and a 2% improvement for ISR accuracy overall. Instructions for downloading the data can be found at https://github.com/leekezar/SemLex.
摘要
sign language recognition和翻译技术有可能提高聋听群体的接入和包容,但研究进展受到数据不充分表征的限制。我们介绍了一个新的美国手语(ASL)模型资源,即Sem-Lex Benchmark。该资源现在是最大的一个,包括了84,000个孤立手语生产视频,这些视频由聋听ASL手语演示者提供,并经过了详细的人工标注和对照。我们进行了一系列实验,利用ASL-LEX等手语资源的语言信息,评估Sem-Lex Benchmark的孤立手语认可率(ISR)的实用性和公平性。我们使用SL-GCN模型显示,手语phonological特征可以达到85%的准确率,并且作为auxiliary target可以提高ISR准确率。通过同时学习手语gloss和phonological特征,可以提高几个shot ISR准确率和总的ISR准确率。下载数据的指导可以在https://github.com/leekezar/SemLex中找到。
Exploring Strategies for Modeling Sign Language Phonology
results: 研究人员在 Sem-Lex Benchmark 上进行了测试,结果表明,使用 curriculum learning 策略可以在所有 phoneme 类型上实现平均准确率为 87%,超过了 fine-tuning 和 multi-task 策略的性能。Abstract
Like speech, signs are composed of discrete, recombinable features called phonemes. Prior work shows that models which can recognize phonemes are better at sign recognition, motivating deeper exploration into strategies for modeling sign language phonemes. In this work, we learn graph convolution networks to recognize the sixteen phoneme "types" found in ASL-LEX 2.0. Specifically, we explore how learning strategies like multi-task and curriculum learning can leverage mutually useful information between phoneme types to facilitate better modeling of sign language phonemes. Results on the Sem-Lex Benchmark show that curriculum learning yields an average accuracy of 87% across all phoneme types, outperforming fine-tuning and multi-task strategies for most phoneme types.
摘要
如speech, sign language composed of discrete, recombinable features called phonemes. Prior work shows that models that can recognize phonemes are better at sign recognition, motivating deeper exploration into strategies for modeling sign language phonemes. In this work, we learn graph convolution networks to recognize the sixteen phoneme "types" found in ASL-LEX 2.0. Specifically, we explore how learning strategies like multi-task and curriculum learning can leverage mutually useful information between phoneme types to facilitate better modeling of sign language phonemes. Results on the Sem-Lex Benchmark show that curriculum learning yields an average accuracy of 87% across all phoneme types, outperforming fine-tuning and multi-task strategies for most phoneme types.Note: ASL-LEX 2.0 refers to the American Sign Language Lexicon, which is a dataset of sign language words and their corresponding phonemes. The Sem-Lex Benchmark is a standardized dataset for evaluating the recognition of sign language phonemes.