results: 在试验中,NEWSSENSE能够帮助用户更好地找到关键信息,验证新闻文章的准确性,并探索不同的视角。Abstract
Reading and understanding the stories in the news is increasingly difficult. Reporting on stories evolves rapidly, politicized news venues offer different perspectives (and sometimes different facts), and misinformation is rampant. However, existing solutions merely aggregate an overwhelming amount of information from heterogenous sources, such as different news outlets, social media, and news bias rating agencies. We present NEWSSENSE, a novel sensemaking tool and reading interface designed to collect and integrate information from multiple news articles on a central topic, using a form of reference-free fact verification. NEWSSENSE augments a central, grounding article of the user's choice by linking it to related articles from different sources, providing inline highlights on how specific claims in the chosen article are either supported or contradicted by information from other articles. Using NEWSSENSE, users can seamlessly digest and cross-check multiple information sources without disturbing their natural reading flow. Our pilot study shows that NEWSSENSE has the potential to help users identify key information, verify the credibility of news articles, and explore different perspectives.
摘要
阅读和理解新闻故事越来越Difficult。新闻报道不断发展,政治化的新闻场景提供不同的视角(有时还有不同的事实),而且谣言游走在社交媒体上。然而,现有的解决方案只是将过载量的信息从多个来源集中到一起,如不同的新闻报道、社交媒体和新闻偏见评级机构。我们介绍NEWSSENSE,一种新的意义感知工具和阅读界面,可以收集和结合多篇关于同一主题的新闻文章,使用参照文章的无需准备。NEWSSENSE将用户选择的中心文章与不同来源的相关文章集成,并在文章中提供 inline 高亮,以显示特定文章中的声明是否由其他文章中的信息支持或否认。使用NEWSSENSE,用户可以无需干扰自然阅读流程,轻松摄取和检查多个信息源。我们的试点研究表明,NEWSSENSE有助于用户发现关键信息、验证新闻文章的可靠性和探索不同的视角。
results: 研究发现,使用信息理论的评估方法可以帮助评估文本解释的质量,并且可以揭示不同文本解释方法的下面机制。例如,NLEs在传输输入相关信息和目标相关信息之间存在一定的权衡,而 rationales 则不具有这种机制。Abstract
Text-based explanation is a particularly promising approach in explainable AI, but the evaluation of text explanations is method-dependent. We argue that placing the explanations on an information-theoretic framework could unify the evaluations of two popular text explanation methods: rationale and natural language explanations (NLE). This framework considers the post-hoc text pipeline as a series of communication channels, which we refer to as ``explanation channels''. We quantify the information flow through these channels, thereby facilitating the assessment of explanation characteristics. We set up tools for quantifying two information scores: relevance and informativeness. We illustrate what our proposed information scores measure by comparing them against some traditional evaluation metrics. Our information-theoretic scores reveal some unique observations about the underlying mechanisms of two representative text explanations. For example, the NLEs trade-off slightly between transmitting the input-related information and the target-related information, whereas the rationales do not exhibit such a trade-off mechanism. Our work contributes to the ongoing efforts in establishing rigorous and standardized evaluation criteria in the rapidly evolving field of explainable AI.
摘要
文本基本解释方法在可解释AI中具有极高的潜力,但评估文本解释的方法受到方法的限制。我们认为将文本解释放在信息理论框架上可以统一两种受欢迎的文本解释方法:理由和自然语言解释(NLE)的评估。这个框架视文本解释管道为一系列通信频道,我们称之为“解释频道”。我们量化这些频道中的信息流动,从而促进解释特征的评估。我们设立了量化两种信息分数的工具:相关性和启示性。我们explain了我们所提出的信息分数是什么,并与传统评估指标进行比较。我们的信息理论分数表明了两种文本解释的下面机制。例如,NLEs在传递输入相关信息和目标相关信息之间进行了微妙的 equilibrio,而 rationales没有这种平衡机制。我们的工作贡献到了在可解释AI领域积极发展的评估标准化的努力。
Module-wise Adaptive Distillation for Multimodality Foundation Models
results: 这个研究透过实验发现,使用OPTIMA算法可以将模型更好地调整,从而提高预训练多Modal基础模型的可靠性和可扩展性。Abstract
Pre-trained multimodal foundation models have demonstrated remarkable generalizability but pose challenges for deployment due to their large sizes. One effective approach to reducing their sizes is layerwise distillation, wherein small student models are trained to match the hidden representations of large teacher models at each layer. Motivated by our observation that certain architecture components, referred to as modules, contribute more significantly to the student's performance than others, we propose to track the contributions of individual modules by recording the loss decrement after distillation each module and choose the module with a greater contribution to distill more frequently. Such an approach can be naturally formulated as a multi-armed bandit (MAB) problem, where modules and loss decrements are considered as arms and rewards, respectively. We then develop a modified-Thompson sampling algorithm named OPTIMA to address the nonstationarity of module contributions resulting from model updating. Specifically, we leverage the observed contributions in recent history to estimate the changing contribution of each module and select modules based on these estimations to maximize the cumulative contribution. We evaluate the effectiveness of OPTIMA through distillation experiments on various multimodal understanding and image captioning tasks, using the CoCa-Large model (Yu et al., 2022) as the teacher model.
摘要
<>传统的多Modal基础模型已经表现出了惊人的通用性,但是它们的大小却带来了部署的挑战。一种有效的减少大小的方法是层WISE的distillation,其中小的学生模型在每层都被训练以匹配大的教师模型的隐藏表示。我们发现了一些architecture组件,被称为模块,在学生的性能中发挥了更大的作用,我们因此提议在distillation过程中跟踪这些模块的贡献。我们可以将这种方法形式化为多重投掷(MAB)问题,其中模块和loss减掉被视为手中的武器和奖励,分别。我们然后开发了一种修改后Thompson投掷算法,名为OPTIMA,以解决模块贡献的不平等。我们利用了最近历史中每个模块的贡献 Observation,来估算每个模块的变化贡献,并根据这些估算选择模块,以最大化总贡献。我们通过在多种多Modal理解和图像描述任务上进行distillation实验,使用Yu et al.(2022)的CoCa-Large模型作为教师模型,证明OPTIMA的有效性。
Envisioning Narrative Intelligence: A Creative Visual Storytelling Anthology
results: 这篇论文描述了5种在视觉故事创作过程中出现的变化,并提出了计算视觉故事创作的智能 критерионов:创作、可靠、表达、固有和负责任。Abstract
In this paper, we collect an anthology of 100 visual stories from authors who participated in our systematic creative process of improvised story-building based on image sequences. Following close reading and thematic analysis of our anthology, we present five themes that characterize the variations found in this creative visual storytelling process: (1) Narrating What is in Vision vs. Envisioning; (2) Dynamically Characterizing Entities/Objects; (3) Sensing Experiential Information About the Scenery; (4) Modulating the Mood; (5) Encoding Narrative Biases. In understanding the varied ways that people derive stories from images, we offer considerations for collecting story-driven training data to inform automatic story generation. In correspondence with each theme, we envision narrative intelligence criteria for computational visual storytelling as: creative, reliable, expressive, grounded, and responsible. From these criteria, we discuss how to foreground creative expression, account for biases, and operate in the bounds of visual storyworlds.
摘要
在这篇论文中,我们收集了100个视觉故事作者参与我们系统化的创作过程中的故事建构,并进行了仔细的阅读和主题分析。根据我们的分析,我们发现了5种主题,它们描述了在这种创作过程中变化的方式:1. 描述视野中的事物 vs. 预测未来的情境2. 动态 caracterize entities/objects3. 感受景色中的情感信息4. 调节情感5. 编码故事偏见在理解人们如何从图像中获得故事的过程中,我们提供了收集故事驱动的训练数据的考虑事项,以 Inform automatic story generation。与每个主题相对应,我们提出了计算视觉故事创作的智能准则:创造力、可靠性、表达力、基于现实的、负责任。从这些准则中,我们讨论了如何强调创作表达,考虑偏见,并在视觉故事世界中运行。
RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation
results: 在语言模型化任务和开放问答任务中,实现了6%的压缩率,而无需 sacrifiSing表现质量,并且可以在不同的LM上进行模型转换。Abstract
Retrieving documents and prepending them in-context at inference time improves performance of language model (LMs) on a wide range of tasks. However, these documents, often spanning hundreds of words, make inference substantially more expensive. We propose compressing the retrieved documents into textual summaries prior to in-context integration. This not only reduces the computational costs but also relieves the burden of LMs to identify relevant information in long retrieved documents. We present two compressors -- an extractive compressor which selects useful sentences from retrieved documents and an abstractive compressor which generates summaries by synthesizing information from multiple documents. Both compressors are trained to improve LMs' performance on end tasks when the generated summaries are prepended to the LMs' input, while keeping the summary concise.If the retrieved documents are irrelevant to the input or offer no additional information to LM, our compressor can return an empty string, implementing selective augmentation.We evaluate our approach on language modeling task and open domain question answering task. We achieve a compression rate of as low as 6% with minimal loss in performance for both tasks, significantly outperforming the off-the-shelf summarization models. We show that our compressors trained for one LM can transfer to other LMs on the language modeling task and provide summaries largely faithful to the retrieved documents.
摘要
LMs的表现可以通过在推理时预先附加文档来提高表现,但这些文档经常 span hundreds of words,使推理成本增加substantially。我们提议将检索到的文档压缩成短文档摘要,以降低计算成本并使LM不必检索长文档中重要信息。我们提出了两种压缩器:一种是提取用于检索到的文档中有用句子的抽取压缩器,另一种是通过将多个文档的信息合并来生成摘要的抽取压缩器。两种压缩器都是根据LM在输入中预先附加摘要来提高LM的表现,而且保持摘要简洁。如果检索到的文档与输入无关或无法提供LM任何新信息,我们的压缩器可以返回空串,实现选择性的扩展。我们在语言模型任务和开放问题 answering任务中评估了我们的方法,实现了 compression rate as low as 6% ,与传统摘要模型相比,表现出明显的提升。我们还证明了我们的压缩器可以在不同LM上进行转移,并为推理任务提供大致 faithful 的摘要。
Improving Stability in Simultaneous Speech Translation: A Revision-Controllable Decoding Approach
results: 实验结果表明,提议的方法可以重要地改善翻译稳定性,而无需妥协substantially translation质量。Abstract
Simultaneous Speech-to-Text translation serves a critical role in real-time crosslingual communication. Despite the advancements in recent years, challenges remain in achieving stability in the translation process, a concern primarily manifested in the flickering of partial results. In this paper, we propose a novel revision-controllable method designed to address this issue. Our method introduces an allowed revision window within the beam search pruning process to screen out candidate translations likely to cause extensive revisions, leading to a substantial reduction in flickering and, crucially, providing the capability to completely eliminate flickering. The experiments demonstrate the proposed method can significantly improve the decoding stability without compromising substantially on the translation quality.
摘要
simultaneous 语音到文本翻译在实时跨语言通信中发挥关键作用。 despite recent advances, there are still challenges in achieving stability in the translation process, primarily manifesting as flickering of partial results. in this paper, we propose a novel revision-controllable method to address this issue. our method introduces an allowed revision window within the beam search pruning process to screen out candidate translations likely to cause extensive revisions, leading to a substantial reduction in flickering and, crucially, providing the capability to completely eliminate flickering. the experiments demonstrate that the proposed method can significantly improve decoding stability without compromising translation quality.
Amortizing intractable inference in large language models
results: 这个论文的实验结果表明,通过使用这种分布匹配方法,LLM可以更好地适应具有多步骤的理智和工具使用的任务。Abstract
Autoregressive large language models (LLMs) compress knowledge from their training data through next-token conditional distributions. This limits tractable querying of this knowledge to start-to-end autoregressive sampling. However, many tasks of interest -- including sequence continuation, infilling, and other forms of constrained generation -- involve sampling from intractable posterior distributions. We address this limitation by using amortized Bayesian inference to sample from these intractable posteriors. Such amortization is algorithmically achieved by fine-tuning LLMs via diversity-seeking reinforcement learning algorithms: generative flow networks (GFlowNets). We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training and reward-maximizing policy optimization. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem and demonstrate that our approach enables data-efficient adaptation of LLMs to tasks that require multi-step rationalization and tool use.
摘要
自适应大语言模型(LLM)通过下一个元素的 conditional 分布压缩知识从训练数据。这限制了可追踪的知识查询到开始到终止的排序样本。然而,许多有用的任务,包括序列续写、填充和其他受限制的生成任务,涉及到无法解决的 posterior 分布。我们解决这个限制,使用权重学习算法来精心 fine-tune LLM,以实现各种可追踪的 posterior 分布。我们称之为泛化 Bayesian 推理(GFlowNets)。我们经验表明,这种分布匹配方法可以作为最大likelihood 训练和奖励最大化策略的有效替代方案。作为重要应用,我们解释了链条思维为 latent 变量模型问题,并示出我们的方法可以实现数据效率地适应 LLM 到需要多步合理化和工具使用的任务。
Transferring speech-generic and depression-specific knowledge for Alzheimer’s disease detection
results: 实验结果表明,提出的方法可以提高AD和抑郁症诊断精度,并在ADReSSo数据集上实现了状态最佳的F1分数0.928。Abstract
The detection of Alzheimer's disease (AD) from spontaneous speech has attracted increasing attention while the sparsity of training data remains an important issue. This paper handles the issue by knowledge transfer, specifically from both speech-generic and depression-specific knowledge. The paper first studies sequential knowledge transfer from generic foundation models pretrained on large amounts of speech and text data. A block-wise analysis is performed for AD diagnosis based on the representations extracted from different intermediate blocks of different foundation models. Apart from the knowledge from speech-generic representations, this paper also proposes to simultaneously transfer the knowledge from a speech depression detection task based on the high comorbidity rates of depression and AD. A parallel knowledge transfer framework is studied that jointly learns the information shared between these two tasks. Experimental results show that the proposed method improves AD and depression detection, and produces a state-of-the-art F1 score of 0.928 for AD diagnosis on the commonly used ADReSSo dataset.
摘要
抑郁症和阿尔茨海默症诊断从自然语言中进行探测已经吸引了越来越多的关注,但训练数据的稀缺性仍然是一个重要的问题。本文通过知识传递来解决这个问题,具体来说是从speech-通用和抑郁症特定的知识中进行传递。本文首先研究了基于大量的语音和文本数据预训练的基础模型,然后对不同的中间块进行块级分析,以实现AD诊断。此外,本文还提出了同时传递speech抑郁症诊断任务的知识,根据抑郁症和AD的高共同发病率。本文提出了并行知识传递框架,并同时学习这两个任务之间的共享信息。实验结果表明,提出的方法可以提高AD和抑郁症的诊断精度,并在常用的ADReSSo数据集上获得了0.928的F1分数,创下了状态精度记录。
Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services
results: 该论文的方法可以超越人类水平的准确率,并且可以同时检测多种类型的偏见语言。这些结果可以为实际的 hate speech 和偏见 mitigation 提供实用的解决方案,从而改善在线社区的健康状况。Abstract
With the growth of online services, the need for advanced text classification algorithms, such as sentiment analysis and biased text detection, has become increasingly evident. The anonymous nature of online services often leads to the presence of biased and harmful language, posing challenges to maintaining the health of online communities. This phenomenon is especially relevant in South Korea, where large-scale hate speech detection algorithms have not yet been broadly explored. In this paper, we introduce a new comprehensive, large-scale dataset collected from a well-known South Korean SNS platform. Our proposed dataset provides annotations including (1) Preferences, (2) Profanities, and (3) Nine types of Bias for the text samples, enabling multi-task learning for simultaneous classification of user-generated texts. Leveraging state-of-the-art BERT-based language models, our approach surpasses human-level accuracy across diverse classification tasks, as measured by various metrics. Beyond academic contributions, our work can provide practical solutions for real-world hate speech and bias mitigation, contributing directly to the improvement of online community health. Our work provides a robust foundation for future research aiming to improve the quality of online discourse and foster societal well-being. All source codes and datasets are publicly accessible at https://github.com/Dasol-Choi/KoMultiText.
摘要
随着在线服务的发展,卷积式文本分类算法,如情感分析和偏见文本检测,的需求日益明显。无名氏的在线服务通常会导致偏见和有害语言的存在,对于维护在线社区的健康带来挑战。这种现象特别在韩国是普遍存在的,在这里,大规模的偏见排除算法还没有广泛探索。在这篇论文中,我们介绍了一个新的全面的大规模数据集,从韩国知名的社交媒体平台收集得到的。我们的提议的数据集包括了(1)偏好、(2)荒唐词汇和(3)九种偏见的注释,使得文本样本可以同时进行多任务学习。利用现代BERT基于语言模型,我们的方法超越人类水平的准确率,在多种 метриках上测试。我们的工作不仅有学术价值,还可以实际地减少在线偏见和偏见,直接提高在线社区的健康。我们的工作提供了对于改善在线讨论质量和促进社会 благополучи性的坚实基础。所有代码和数据集都公开可访问于https://github.com/Dasol-Choi/KoMultiText。
Written and spoken corpus of real and fake social media postings about COVID-19
paper_authors: Ng Bee Chin, Ng Zhi Ee Nicole, Kyla Kwan, Lee Yong Han Dylann, Liu Fang, Xu Hong
for: This study aims to investigate the linguistic traits of fake news and real news in both written and speech data.
methods: The study uses a dataset of COVID-19 related tweets and TikTok videos, which are fact-checked and labeled as ‘Real’, ‘Fake’, or ‘Questionable’. The Linguistic Inquiry and Word Count (LIWC) software is used to detect patterns in linguistic data.
results: The study finds a set of linguistic features that distinguish fake news from real news in both written and speech data, offering valuable insights into the role of language in shaping trust, social media interactions, and the propagation of fake news.Here is the same information in Simplified Chinese text:
for: 这个研究是 investigate fake news 和 real news 的语言特征。
methods: 研究使用 COVID-19 相关的 tweets 和 TikTok 视频数据集,并使用 credible sources 进行验证和标注为 ‘Real’、’Fake’ 或 ‘Questionable’。使用 Linguistic Inquiry and Word Count (LIWC) 软件检测语言数据中的特征。
results: 研究发现 fake news 和 real news 的语言特征,提供有价值的信息,用于理解信任、社交媒体互动以及假新闻的传播。Abstract
This study investigates the linguistic traits of fake news and real news. There are two parts to this study: text data and speech data. The text data for this study consisted of 6420 COVID-19 related tweets re-filtered from Patwa et al. (2021). After cleaning, the dataset contained 3049 tweets, with 2161 labeled as 'real' and 888 as 'fake'. The speech data for this study was collected from TikTok, focusing on COVID-19 related videos. Research assistants fact-checked each video's content using credible sources and labeled them as 'Real', 'Fake', or 'Questionable', resulting in a dataset of 91 real entries and 109 fake entries from 200 TikTok videos with a total word count of 53,710 words. The data was analysed using the Linguistic Inquiry and Word Count (LIWC) software to detect patterns in linguistic data. The results indicate a set of linguistic features that distinguish fake news from real news in both written and speech data. This offers valuable insights into the role of language in shaping trust, social media interactions, and the propagation of fake news.
摘要
Here's the translation in Simplified Chinese:这个研究 investigate fake news 和 real news 的语言特征。研究包括两部分:文本数据和说话数据。文本数据包括从 Patwa et al. (2021) 筛选出的 6420 个 COVID-19 相关的推文,经过清洁,剩下 3049 个推文,其中 2161 个被标记为 "real",888 个被标记为 "fake"。说话数据来自 TikTok,关注 COVID-19 相关的视频,研究助手使用可靠的来源进行 факт-核查,并将每个视频的内容分为 "Real"、"Fake" 或 "问题" 三类,共有 91 个实际的入口和 109 个假的入口,总共 53,710 个字。数据被利用 Linguistic Inquiry and Word Count (LIWC) 软件分析,探测文本数据中的语言特征。结果显示, fake news 和 real news 之间存在一组语言特征,这些特征可以在文本数据和说话数据中被探测出来。这些发现对于语言在建立信任、社交媒体互动和假新闻传播中的作用提供了有价值的信息。
mlirSynth: Automatic, Retargetable Program Raising in Multi-Level IR using Program Synthesis
results: 对Polybench测试集的分析显示,mlirSynth可以实现更高的覆盖率,并且在Intel和AMD两种硬件平台上实现了2.5倍和3.4倍的平均增速,相比之下现有的编译流程。此外,mlirSynth还可以对域特定加速器进行重定向,实现了TPU上的21.6倍的平均增速。Abstract
MLIR is an emerging compiler infrastructure for modern hardware, but existing programs cannot take advantage of MLIR's high-performance compilation if they are described in lower-level general purpose languages. Consequently, to avoid programs needing to be rewritten manually, this has led to efforts to automatically raise lower-level to higher-level dialects in MLIR. However, current methods rely on manually-defined raising rules, which limit their applicability and make them challenging to maintain as MLIR dialects evolve. We present mlirSynth -- a novel approach which translates programs from lower-level MLIR dialects to high-level ones without manually defined rules. Instead, it uses available dialect definitions to construct a program space and searches it effectively using type constraints and equivalences. We demonstrate its effectiveness \revi{by raising C programs} to two distinct high-level MLIR dialects, which enables us to use existing high-level dialect specific compilation flows. On Polybench, we show a greater coverage than previous approaches, resulting in geomean speedups of 2.5x (Intel) and 3.4x (AMD) over state-of-the-art compilation flows for the C programming language. mlirSynth also enables retargetability to domain-specific accelerators, resulting in a geomean speedup of 21.6x on a TPU.
摘要
MLIR 是一个emerging compiler 基础设施 для modern 硬件,但现有的程式不能够利用 MLIR 的高性能编译。因此,以避免程式需要手动 rewrite,导致了对 MLIR dialects 的自动提升的努力。然而,现有的方法仍然 rely на手动定义的提升规则,这限制了它们的应用范围和维护可能性。我们提出了 mlirSynth,一个新的方法,可以将程式从低层 MLIR dialects 提升到高层 dialects without manually defined rules。它使用可用的 dialect definitions 建立一个程式空间,并使用类型条件和等价关系进行有效的搜寻。我们透过将 C 程式提升到两种不同的高层 MLIR dialects,以使用现有的高层 dialect specific compilation flows。在 Polybench 上,我们显示了更高的覆盖率,导致 geomean 加速率为 2.5x (Intel) 和 3.4x (AMD) compared to state-of-the-art compilation flows for the C programming language。mlirSynth 还允许透过域对应加速器的应用,导致一个 geomean 加速率为 21.6x 在 TPU 上。
How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation
results: 这个论文的结果表明,如果输入矩阵中的元素均小于 $o(\sqrt[3]{\log n})$,那么可以在 $n^{1+o(1)}$ 时间内将“tensor-type”的注意力矩阵近似计算出来。但如果输入矩阵中的元素可能达到 $\Omega(\sqrt[3]{\log n})$,那么不存在 faster than $n^{3-o(1)}$ 的算法。Abstract
In the classical transformer attention scheme, we are given three $n \times d$ size matrices $Q, K, V$ (the query, key, and value tokens), and the goal is to compute a new $n \times d$ size matrix $D^{-1} \exp(QK^\top) V$ where $D = \mathrm{diag}( \exp(QK^\top) {\bf 1}_n )$. In this work, we study a generalization of attention which captures triple-wise correlations. This generalization is able to solve problems about detecting triple-wise connections that were shown to be impossible for transformers. The potential downside of this generalization is that it appears as though computations are even more difficult, since the straightforward algorithm requires cubic time in $n$. However, we show that in the bounded-entry setting (which arises in practice, and which is well-studied in both theory and practice), there is actually a near-linear time algorithm. More precisely, we show that bounded entries are both necessary and sufficient for quickly performing generalized computations: $\bullet$ On the positive side, if all entries of the input matrices are bounded above by $o(\sqrt[3]{\log n})$ then we show how to approximate the ``tensor-type'' attention matrix in $n^{1+o(1)}$ time. $\bullet$ On the negative side, we show that if the entries of the input matrices may be as large as $\Omega(\sqrt[3]{\log n})$, then there is no algorithm that runs faster than $n^{3-o(1)}$ (assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory). We also show that our construction, algorithms, and lower bounds naturally generalize to higher-order tensors and correlations. Interestingly, the higher the order of the tensors, the lower the bound on the entries needs to be for an efficient algorithm. Our results thus yield a natural tradeoff between the boundedness of the entries, and order of the tensor one may use for more expressive, efficient attention computation.
摘要
在 классическом transformer 注意机制中,我们给定三个 $n \times d$ 大小矩阵 $Q$, $K$, $V$(问题、键和值符号),目标是计算一个新的 $n \times d$ 大小矩阵 $D^{-1} \exp(QK^\top) V$,其中 $D = \text{diag}( \exp(QK^\top) I_n )$。在这个工作中,我们研究一种扩展 attention,它可以捕捉 triple-wise 相关性。这种扩展可以解决 transformer 无法解决的问题,但是可能增加计算的难度。我们证明,在受限的输入矩阵中(这种情况在实践中经常出现,并且有良好的理论和实践研究),实际上存在一个近线时间算法。具体来说,我们证明如果输入矩阵中所有元素都是 $o(\sqrt[3]{\log n})$ 的Upper bound,那么可以在 $n^{1+o(1)}$ 时间内 aproximate“tensor-type” 注意矩阵。然而,如果输入矩阵中元素可能达到 $\Omega(\sqrt[3]{\log n})$,那么我们证明无法在 $n^{3-o(1)}$ 时间内完成扩展计算。此外,我们还证明我们的构造、算法和下界自然地推广到更高阶的tensor和相关性。有趣的是,与tensor的阶数成正比,输入矩阵中元素的下界需要越低,以便有效地进行注意计算。我们的结果因此表示在bounded-entry setting中,注意计算的效率和tensor阶数之间存在自然的负反关系。
Analysis of the Reasoning with Redundant Information Provided Ability of Large Language Models
results: 研究发现,当 LLMS面临含重复信息的情景时,其表现不佳。这种情况透视了当前 LLMS 在推理方面的局限性,并建议将未来的训练数据包含更多的重复信息,以提高 RRIP 任务的表现。Abstract
Recent advancements in Large Language Models (LLMs) have demonstrated impressive capabilities across a range of natural language processing tasks, especially in reasoning, a cornerstone for achieving Artificial General Intelligence (AGI). However, commonly used benchmarks may not fully encapsulate the inferential abilities of these models in real-world scenarios. To address this gap, a new form of Question-Answering (QA) task, termed Reasoning with Redundant Information Provided (RRIP), is introduced. The study designed a modified version of the grade school math 8K (GSM-8K) dataset which has several variants focusing on different attributes of redundant information. This investigation evaluates two popular LLMs, LlaMA2-13B-chat and generative pre-trained transformer 3.5 (GPT-3.5), contrasting their performance on traditional QA tasks against the RRIP tasks. Findings indicate that while these models achieved moderate success on standard QA benchmarks, their performance notably declines when assessed on RRIP tasks. The study not only highlights the limitations of current LLMs in handling redundant information but also suggests that future training of these models should focus on incorporating redundant information into the training data to increase the performance on RRIP tasks.
摘要
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language Models
paper_authors: Boyu Zhang, Hongyang Yang, Tianyu Zhou, Ali Babar, Xiao-Yang Liu
for: 这个论文是为了提高金融 sentiment 分析的精度和效果而写的。
methods: 该论文使用了 Large Language Models (LLMs) 和检索增强模块,以解决传统 NLP 模型在金融 sentiment 分析中的局限性和不足。
results: 该论文对比传统模型和其他 LLMs (如 ChatGPT 和 LLaMA),实现了15% 到 48% 的性能提升。Abstract
Financial sentiment analysis is critical for valuation and investment decision-making. Traditional NLP models, however, are limited by their parameter size and the scope of their training datasets, which hampers their generalization capabilities and effectiveness in this field. Recently, Large Language Models (LLMs) pre-trained on extensive corpora have demonstrated superior performance across various NLP tasks due to their commendable zero-shot abilities. Yet, directly applying LLMs to financial sentiment analysis presents challenges: The discrepancy between the pre-training objective of LLMs and predicting the sentiment label can compromise their predictive performance. Furthermore, the succinct nature of financial news, often devoid of sufficient context, can significantly diminish the reliability of LLMs' sentiment analysis. To address these challenges, we introduce a retrieval-augmented LLMs framework for financial sentiment analysis. This framework includes an instruction-tuned LLMs module, which ensures LLMs behave as predictors of sentiment labels, and a retrieval-augmentation module which retrieves additional context from reliable external sources. Benchmarked against traditional models and LLMs like ChatGPT and LLaMA, our approach achieves 15\% to 48\% performance gain in accuracy and F1 score.
摘要
financial sentiment分析是决定性的 для评估和投资决策。传统的NLP模型 however,受其参数大小和训练数据范围的限制,导致其泛化能力和效果在这个领域受到限制。最近,大语言模型(LLMs)在庞大的文本资源上进行预训练后表现出色,因为它们在不同的NLP任务上显示出了出色的零扩展能力。然而,直接将LLMs应用于金融 sentiment分析存在挑战:预训练目标和predicting sentiment标签之间的差异可能会降低 LLMS 的预测性能。此外,金融新闻通常简短,缺乏充分的上下文,可能会使 LLMS 的 sentiment分析成本不可靠。为了解决这些挑战,我们提出了一个结合检索增强的 LLMS 框架。该框架包括一个受 instrucion 训练的 LLMS 模块,以及一个检索增强模块,该模块可以从可靠的外部源中检索更多的上下文。与传统模型和 ChatGPT 以及 LLaMA 类 LLMS 进行比较,我们的方法在精度和 F1 分数方面实现了15% 到 48% 的性能提升。
SemStamp: A Semantic Watermark with Paraphrastic Robustness for Text Generation
results: 对比传统的token-level watermarking方法,本研究的方法更加抗伪,并且能够更好地保持生成质量。 experiments 表明,本研究的novel semantic watermark algorithm 在 common 和 bigram paraphrase 攻击下具有更高的抗伪性和生成质量。Abstract
Existing watermarking algorithms are vulnerable to paraphrase attacks because of their token-level design. To address this issue, we propose SemStamp, a robust sentence-level semantic watermarking algorithm based on locality-sensitive hashing (LSH), which partitions the semantic space of sentences. The algorithm encodes and LSH-hashes a candidate sentence generated by an LLM, and conducts sentence-level rejection sampling until the sampled sentence falls in watermarked partitions in the semantic embedding space. A margin-based constraint is used to enhance its robustness. To show the advantages of our algorithm, we propose a "bigram" paraphrase attack using the paraphrase that has the fewest bigram overlaps with the original sentence. This attack is shown to be effective against the existing token-level watermarking method. Experimental results show that our novel semantic watermark algorithm is not only more robust than the previous state-of-the-art method on both common and bigram paraphrase attacks, but also is better at preserving the quality of generation.
摘要
现有的水印算法受到重写攻击的威胁,因为它们是基于语句级别的设计。为解决这个问题,我们提出SemStamp,一种可靠的句子级别Semantic水印算法,基于locality-sensitive hashing(LSH)。这个算法会将生成器LM生成的候选句子编码和LSH-对hash,然后通过句子级别弃权探针探查,直到找到水印分区中的句子。另外,我们还使用了一个margin-based的约束,以提高其可靠性。为证明我们的算法的优势,我们提出了一个"bigram"重写攻击,使用最少bigram重写的句子进行攻击。实验结果显示,我们的新的句子水印算法不只是在常规重写和bigram重写攻击下更加可靠,而且也能保持生成质量的好。
Dementia Assessment Using Mandarin Speech with an Attention-based Speech Recognition Encoder
results: 这篇论文获得了92.04%的准确率在认知症识别方面,并在临床认知评估分数预测方面得到了9%的平均绝对误差。Abstract
Dementia diagnosis requires a series of different testing methods, which is complex and time-consuming. Early detection of dementia is crucial as it can prevent further deterioration of the condition. This paper utilizes a speech recognition model to construct a dementia assessment system tailored for Mandarin speakers during the picture description task. By training an attention-based speech recognition model on voice data closely resembling real-world scenarios, we have significantly enhanced the model's recognition capabilities. Subsequently, we extracted the encoder from the speech recognition model and added a linear layer for dementia assessment. We collected Mandarin speech data from 99 subjects and acquired their clinical assessments from a local hospital. We achieved an accuracy of 92.04% in Alzheimer's disease detection and a mean absolute error of 9% in clinical dementia rating score prediction.
摘要
德мен诊断需要一系列不同的测试方法,这是复杂和时间consuming的。早期发现德门可以防止病情加重。这篇论文使用语音识别模型构建了一个专门为普通话说者设计的德门评估系统。通过在真实场景中训练关注型语音识别模型,我们已经显著提高了模型的识别能力。然后,我们从语音识别模型中提取了编码器,并添加了一个线性层进行德门评估。我们从当地医院收集了99名患者的普通话语音数据,并获得了他们的临床评估。我们达到了阿尔茨heimer病 detection的准确率92.04%和临床德门评估分数预测的平均绝对误差9%。
HuBERTopic: Enhancing Semantic Representation of HuBERT through Self-supervision Utilizing Topic Model
methods: 我们使用 topic model 对 pseudo-labels 进行分类,生成每个语音句子的话题标签。然后,我们将话题标签作为教师,将其添加到 HuBERT 模型中,以便在无监督的情况下提高模型的泛化能力。
results: 我们的方法在大多数任务中达到了比基eline更好的性能,包括自动语音识别和 SUPERB 任务中的五个任务。此外,我们发现话题标签包含了各种语音句子中的信息,如 gender、speaker 和其主题,这表明我们的方法可以有效地捕捉语音内容中的多方面含义。Abstract
Recently, the usefulness of self-supervised representation learning (SSRL) methods has been confirmed in various downstream tasks. Many of these models, as exemplified by HuBERT and WavLM, use pseudo-labels generated from spectral features or the model's own representation features. From previous studies, it is known that the pseudo-labels contain semantic information. However, the masked prediction task, the learning criterion of HuBERT, focuses on local contextual information and may not make effective use of global semantic information such as speaker, theme of speech, and so on. In this paper, we propose a new approach to enrich the semantic representation of HuBERT. We apply topic model to pseudo-labels to generate a topic label for each utterance. An auxiliary topic classification task is added to HuBERT by using topic labels as teachers. This allows additional global semantic information to be incorporated in an unsupervised manner. Experimental results demonstrate that our method achieves comparable or better performance than the baseline in most tasks, including automatic speech recognition and five out of the eight SUPERB tasks. Moreover, we find that topic labels include various information about utterance, such as gender, speaker, and its theme. This highlights the effectiveness of our approach in capturing multifaceted semantic nuances.
摘要
Quantized Transformer Language Model Implementations on Edge Devices
results: 这个研究的结果显示,相比 Original BERT 大模型, converted 和量化的 MobileBERT 模型具有 160$\times$ 小的库存储空间,并且在边缘设备上进行评估时能够保持至少一则 tweet 每秒的速度,却是 Original BERT 模型的 4.1% 损失。此外,这个研究也诉说了在无服务器环境中进行隐私保证的特点。Abstract
Large-scale transformer-based models like the Bidirectional Encoder Representations from Transformers (BERT) are widely used for Natural Language Processing (NLP) applications, wherein these models are initially pre-trained with a large corpus with millions of parameters and then fine-tuned for a downstream NLP task. One of the major limitations of these large-scale models is that they cannot be deployed on resource-constrained devices due to their large model size and increased inference latency. In order to overcome these limitations, such large-scale models can be converted to an optimized FlatBuffer format, tailored for deployment on resource-constrained edge devices. Herein, we evaluate the performance of such FlatBuffer transformed MobileBERT models on three different edge devices, fine-tuned for Reputation analysis of English language tweets in the RepLab 2013 dataset. In addition, this study encompassed an evaluation of the deployed models, wherein their latency, performance, and resource efficiency were meticulously assessed. Our experiment results show that, compared to the original BERT large model, the converted and quantized MobileBERT models have 160$\times$ smaller footprints for a 4.1% drop in accuracy while analyzing at least one tweet per second on edge devices. Furthermore, our study highlights the privacy-preserving aspect of TinyML systems as all data is processed locally within a serverless environment.
摘要
大规模转换器基模型如 bidirectional Encoder Representations from Transformers (BERT) 在自然语言处理 (NLP) 应用中广泛使用,其中这些模型首先通过大量 Parameters 的预训练来初始化,然后为下游 NLP 任务进行细化。一个主要的 limitation 是这些大规模模型无法在有限的设备上部署,因为它们的模型大小和执行时间增加。为了解决这些限制,这些大规模模型可以转换为适合部署在有限资源的边缘设备的 FlatBuffer 格式。在这种情况下,我们评估了这些转换后的 MobileBERT 模型在三个不同的边缘设备上的性能,并对这些部署的模型进行了精心的评估。我们的实验结果表明,相比于原始 BERT 大模型,转换并量化后的 MobileBERT 模型具有 160 倍小于的占用空间,对于一个 4.1% 的精度下降,可以在边缘设备上分析至少一条微博每秒。此外,我们的研究强调了无人化 ML 系统的隐私保护特点,所有数据都是在无人化环境中进行本地处理。