2023-07-14

cs.CL

cs.CL - 2023-07-14

HuCurl: Human-induced Curriculum Discovery

paper_url: http://arxiv.org/abs/2307.07412
repo_url: None
paper_authors: Mohamed Elgaar, Hadi Amiri
for: 本文提出了课程发现问题，并提出了一种基于优先知识的课程学习框架，能够在课程空间中找到有效的课程。
methods: 使用标注 entropy 和损失作为难度度量，并通过对模型和数据集的比较，找到最佳的课程。
results: (i) 发现的课程通常不 monotonic，与现有文献中的 monotonic 课程不同; (ii) 常见的 easy-to-hard 或 hard-to-easy 转换课程可能会下降性能; (iii) 为小 datasets 和模型而发现的课程可以在大 datasets 和模型上表现出色。

Abstract
We introduce the problem of curriculum discovery and describe a curriculum learning framework capable of discovering effective curricula in a curriculum space based on prior knowledge about sample difficulty. Using annotation entropy and loss as measures of difficulty, we show that (i): the top-performing discovered curricula for a given model and dataset are often non-monotonic as opposed to monotonic curricula in existing literature, (ii): the prevailing easy-to-hard or hard-to-easy transition curricula are often at the risk of underperforming, and (iii): the curricula discovered for smaller datasets and models perform well on larger datasets and models respectively. The proposed framework encompasses some of the existing curriculum learning approaches and can discover curricula that outperform them across several NLP tasks.

摘要
我们介绍了课程发现问题和一个基于先前知识的课程学习框架，可以在课程空间中发现有效的课程。使用标签 entropy 和损失作为困难度的度量，我们显示了以下结论：(i) 给定的模型和数据集，最高表现的发现课程通常不是传统文献中的对称课程，(ii) 常见的易于困难或困难于易课程过渡curricula 可能会受到表现下降的问题，(iii) 针对较小的数据集和模型，发现的课程可以对较大的数据集和模型进行良好的表现。提案的框架包括一些现有的课程学习方法，并可以在多个自然语言处理任务中发现高表现的课程。

Composition-contrastive Learning for Sentence Embeddings

paper_url: http://arxiv.org/abs/2307.07380
repo_url: https://github.com/perceptiveshawty/compcse
paper_authors: Sachin J. Chanchani, Ruihong Huang
for: 学习自然语言文本表示法
methods: 使用对比学习方法从无标注数据中学习文本表示
results: 比基eline表现提高，无需 auxiliary 训练目标或额外网络参数

Abstract
Vector representations of natural language are ubiquitous in search applications. Recently, various methods based on contrastive learning have been proposed to learn textual representations from unlabelled data; by maximizing alignment between minimally-perturbed embeddings of the same text, and encouraging a uniform distribution of embeddings across a broader corpus. Differently, we propose maximizing alignment between texts and a composition of their phrasal constituents. We consider several realizations of this objective and elaborate the impact on representations in each case. Experimental results on semantic textual similarity tasks show improvements over baselines that are comparable with state-of-the-art approaches. Moreover, this work is the first to do so without incurring costs in auxiliary training objectives or additional network parameters.

摘要
设想文本表示方法在搜索应用中广泛存在。最近，有许多基于对比学习的方法被提出来从无标签数据中学习文本表示方法，通过最大化同样文本的微小变化 embedding 之间的对应性，并强制整个 корпуス中 embedding 呈现 uniform 分布。然而，我们提议通过 maximizing alignment between texts and a composition of their phrasal constituents来学习文本表示方法。我们考虑了几种实现方式，并详细介绍每种情况下的影响。实验结果表明，在 semantic textual similarity 任务上比基eline 高，并且不需要额外的auxiliary training objective或网络参数。Note: "phrase" is translated as "短语" (xiàng yǔ) in Simplified Chinese, and "constituents" is translated as "成分" (chéng fāng) in Simplified Chinese.

A scoping review on multimodal deep learning in biomedical images and texts

paper_url: http://arxiv.org/abs/2307.07362
repo_url: None
paper_authors: Zhaoyi Sun, Mingquan Lin, Qingqing Zhu, Qianqian Xie, Fei Wang, Zhiyong Lu, Yifan Peng
for: 这个评估报告的目的是提供关于多模态深度学习在医学领域的现状报告，并提出未来研究方向。
methods: 本文使用了多模态深度学习技术，包括图像和文本的共同学习，以处理多种生物医学数据。
results: 本文对多个任务进行了评估，包括报告生成、视觉问答、跨模态检索、计算辅助诊断和semantic segmentation等。结果表明多模态深度学习在医学领域有广泛的应用前景和潜在的发展前景。

Abstract
Computer-assisted diagnostic and prognostic systems of the future should be capable of simultaneously processing multimodal data. Multimodal deep learning (MDL), which involves the integration of multiple sources of data, such as images and text, has the potential to revolutionize the analysis and interpretation of biomedical data. However, it only caught researchers' attention recently. To this end, there is a critical need to conduct a systematic review on this topic, identify the limitations of current work, and explore future directions. In this scoping review, we aim to provide a comprehensive overview of the current state of the field and identify key concepts, types of studies, and research gaps with a focus on biomedical images and texts joint learning, mainly because these two were the most commonly available data types in MDL research. This study reviewed the current uses of multimodal deep learning on five tasks: (1) Report generation, (2) Visual question answering, (3) Cross-modal retrieval, (4) Computer-aided diagnosis, and (5) Semantic segmentation. Our results highlight the diverse applications and potential of MDL and suggest directions for future research in the field. We hope our review will facilitate the collaboration of natural language processing (NLP) and medical imaging communities and support the next generation of decision-making and computer-assisted diagnostic system development.

摘要
Translated into Simplified Chinese:计算机助成诊断和预测系统的未来应该能同时处理多modal数据。多modal深度学习（MDL），即多种数据的 integrate，有可能改变生物医学数据的分析和解释。然而，它只是最近才引起研究者的注意。为此，有一项重要的需求是进行这个话题的系统性综述，识别当前工作的局限性，并探索未来的方向。在这份报告中，我们希望提供生物医学图像和文本合作学习领域的全面的概述，并识别关键概念、研究类型和研究缺失，主要关注生物医学图像和文本合作学习。这个研究对五个任务进行了现有的多modal深度学习应用：（1）报告生成，（2）视觉问答，（3）跨modal检索，（4）计算机辅助诊断，（5） semantics 分割。我们的结果表明 MDL 的多样化应用和潜力，并提出了未来研究的方向。我们希望这份综述能促进 NLP 和医学影像社区之间的合作，并支持下一代决策和计算机辅助诊断系统的发展。

Gloss Attention for Gloss-free Sign Language Translation

paper_url: http://arxiv.org/abs/2307.07361
repo_url: https://github.com/yinaoxiong/gaslt
paper_authors: Aoxiong Yin, Tianyun Zhong, Li Tang, Weike Jin, Tao Jin, Zhou Zhao
for: 提高手语翻译的准确率和效率，解决现有方法所需的辅助信息获取问题。
methods: 分析现有模型，发现gloss注释可以为模型提供两种信息：帮助模型学习手语视频中的语义边界位置，以及帮助模型理解手语视频的全局结构。提出了“gloss attention”机制，使模型在视频中保持注意力在同Semantic的视频段划中。同时，将自然语言模型中的句子相似性知识传播到我们的gloss attention手语翻译网络（GASLT）中，以便更好地理解手语视频。
results: 对多个大规模手语数据集进行实验，结果显示，我们提出的GASLT模型在准确率和效率两个方面与现有方法显著不同，表现更优化。代码在 \url{https://github.com/YinAoXiong/GASLT} 上提供。

Abstract
Most sign language translation (SLT) methods to date require the use of gloss annotations to provide additional supervision information, however, the acquisition of gloss is not easy. To solve this problem, we first perform an analysis of existing models to confirm how gloss annotations make SLT easier. We find that it can provide two aspects of information for the model, 1) it can help the model implicitly learn the location of semantic boundaries in continuous sign language videos, 2) it can help the model understand the sign language video globally. We then propose \emph{gloss attention}, which enables the model to keep its attention within video segments that have the same semantics locally, just as gloss helps existing models do. Furthermore, we transfer the knowledge of sentence-to-sentence similarity from the natural language model to our gloss attention SLT network (GASLT) to help it understand sign language videos at the sentence level. Experimental results on multiple large-scale sign language datasets show that our proposed GASLT model significantly outperforms existing methods. Our code is provided in \url{https://github.com/YinAoXiong/GASLT}.

摘要
大多数手语翻译（SLT）方法至今都需要使用夸夸注（gloss）来提供额外监督信息，但获取夸夸注并不容易。为解决这个问题，我们首先进行了现有模型的分析，确认夸夸注如何使SLT更容易。我们发现，夸夸注可以为模型提供两种信息：1）帮助模型在连续手语视频中找到 semantics 的位置，2）帮助模型理解手语视频的全局结构。然后，我们提出了“夸夸注注意力”（gloss attention），它使模型能够在同 semantics 的视频段内保持注意力，就像夸夸注一样。此外，我们将自然语言模型中的句子之间的相似性知识传递到我们的夸夸注注意力SLT网络（GASLT）中，以 помочь它理解手语视频的句子水平。多个大规模的手语数据集的实验结果表明，我们的提议的GASLT模型在与现有方法进行比较时显著提高了性能。我们的代码可以在中找到。

Unsupervised Domain Adaptation using Lexical Transformations and Label Injection for Twitter Data

paper_url: http://arxiv.org/abs/2307.10210
repo_url: None
paper_authors: Akshat Gupta, Xiaomo Liu, Sameena Shah
for: 这个论文是解决频道适应问题的，它不是通过模型进行适应，而是通过修改源频道数据来减少频道偏移。
methods: 论文使用了简单的 lexical transformation 来修改源频道数据，以减少频道偏移。
results: 论文发现，使用修改后的源频道数据可以提高模型的性能，并达到了无监督的 POS 标签准确率为 92.14%（与零shot准确率相比，提高了81.54%）。同时，通过使用提posed transformations来生成 tweets，可以增强 Twitter 数据集，达到了状态的术语标签性能。

Abstract
Domain adaptation is an important and widely studied problem in natural language processing. A large body of literature tries to solve this problem by adapting models trained on the source domain to the target domain. In this paper, we instead solve this problem from a dataset perspective. We modify the source domain dataset with simple lexical transformations to reduce the domain shift between the source dataset distribution and the target dataset distribution. We find that models trained on the transformed source domain dataset performs significantly better than zero-shot models. Using our proposed transformations to convert standard English to tweets, we reach an unsupervised part-of-speech (POS) tagging accuracy of 92.14% (from 81.54% zero shot accuracy), which is only slightly below the supervised performance of 94.45%. We also use our proposed transformations to synthetically generate tweets and augment the Twitter dataset to achieve state-of-the-art performance for POS tagging.

摘要
域适应是自然语言处理领域的重要和广泛研究问题。大量文献尝试通过将源领域模型适应目标领域来解决这个问题。在这篇论文中，我们则从数据集角度解决这个问题。我们对源领域数据集进行简单的lexical变换，以减少源领域分布和目标领域分布之间的域Shift。我们发现，使用我们提议的变换可以将标准英语转换成推文，并达到无监督 circumstance 中的POS标签准确率92.14%（从零shot准确率81.54%提高），只是与监督性性能（94.45%）相对较低。此外，我们还使用我们的变换synthetically生成推文，并将其添加到Twitter数据集中，以达到POS标签准确率的状态图表。

How Different Is Stereotypical Bias Across Languages?

paper_url: http://arxiv.org/abs/2307.07331
repo_url: https://github.com/slds-lmu/stereotypes-multi
paper_authors: Ibrahim Tolga Öztürk, Rostislav Nedelchev, Christian Heumann, Esteban Garces Arias, Marius Roger, Bernd Bischl, Matthias Aßenmacher
for: 这项研究旨在扩展英语预训练模型中的刻板印象偏见分析，并在不同语言和模型结构下进行系统性的研究。
methods: 本研究使用英语STereoSet数据集（Nadeem et al., 2021），通过自动翻译而将其翻译成德语、法语、西班牙语和土耳其语。
results: 研究发现，在多语言设置下进行此类分析非常重要，因为我们的实验显示了较为复杂的图像和语言模型之间的关系，英语单语言模型表现最强烈的偏见，而土耳其模型偏见最少。

Abstract
Recent studies have demonstrated how to assess the stereotypical bias in pre-trained English language models. In this work, we extend this branch of research in multiple different dimensions by systematically investigating (a) mono- and multilingual models of (b) different underlying architectures with respect to their bias in (c) multiple different languages. To that end, we make use of the English StereoSet data set (Nadeem et al., 2021), which we semi-automatically translate into German, French, Spanish, and Turkish. We find that it is of major importance to conduct this type of analysis in a multilingual setting, as our experiments show a much more nuanced picture as well as notable differences from the English-only analysis. The main takeaways from our analysis are that mGPT-2 (partly) shows surprising anti-stereotypical behavior across languages, English (monolingual) models exhibit the strongest bias, and the stereotypes reflected in the data set are least present in Turkish models. Finally, we release our codebase alongside the translated data sets and practical guidelines for the semi-automatic translation to encourage a further extension of our work to other languages.

摘要
新的研究表明如何评估预训练的英语语言模型中的刻板印象偏见。在这项工作中，我们扩展了这一领域的研究，在不同的维度上系统地调查（a）不同的语言模型（b）的不同基础架构，对其偏见的多语言研究。为此，我们使用英语STereoSet数据集（Nadeem et al., 2021），并自动将其翻译成德语、法语、西班牙语和土耳其语。我们发现在多语言设置下进行这类分析是非常重要的，因为我们的实验显示了更加复杂的图像以及英语只有分析中的差异。主要发现是，mGPT-2（部分）在不同语言上表现出了反刻板行为，英语单语言模型表现最强烈的偏见，而数据集中各种刻板印象最少出现在土耳其模型中。最后，我们发布了我们的代码基础和翻译后的数据集，并提供了实践指南，以便其他语言的扩展。

Hybrid moderation in the newsroom: Recommending featured posts to content moderators

paper_url: http://arxiv.org/abs/2307.07317
repo_url: None
paper_authors: Cedric Waterschoot, Antal van den Bosch
for: 本研究旨在支持和强制content moderator在评论部分中Moderation User-generated content，提供一个基于排名类概率的 recommender system。
methods: 本研究使用了User和文本内容特征，组合得到最佳的分类F1分数0.44，以及NDCG@5的最高分0.87。
results: 专家评估表明，基于建议的选择，content moderators可以准确地选择合适的评论，NDCG分数为0.83。本研究还发现，文本特征的添加可以提高分类效果，但选择featured content仍然有一定主观性。

Abstract
Online news outlets are grappling with the moderation of user-generated content within their comment section. We present a recommender system based on ranking class probabilities to support and empower the moderator in choosing featured posts, a time-consuming task. By combining user and textual content features we obtain an optimal classification F1-score of 0.44 on the test set. Furthermore, we observe an optimum mean NDCG@5 of 0.87 on a large set of validation articles. As an expert evaluation, content moderators assessed the output of a random selection of articles by choosing comments to feature based on the recommendations, which resulted in a NDCG score of 0.83. We conclude that first, adding text features yields the best score and second, while choosing featured content remains somewhat subjective, content moderators found suitable comments in all but one evaluated recommendations. We end the paper by analyzing our best-performing model, a step towards transparency and explainability in hybrid content moderation.

摘要
在线新闻媒体正在苦战用户生成内容的Moderation。我们提出一种基于排名分类概率的推荐系统，以支持和赋能Moderator在选择推荐文章方面的努力。通过结合用户和文本内容特征，我们实现了最佳的分类F1分数0.44，以及在大量验证文章上的优秀NDCG@50.87。此外，专业评估人员对一组随机选择的文章中的 recommenations 进行评估，得到了NDCG分数0.83。我们 conclude 以下两点：一、添加文本特征可以获得最佳分数；二、选择推荐内容仍然有一定的主观性，但Moderator在所有评估的推荐中都可以找到适合的评论。我们结束文章，分析我们最佳performing的模型，是一种透明度和解释性的 hybrid content moderation 的一步。

Using Large Language Models for Zero-Shot Natural Language Generation from Knowledge Graphs

paper_url: http://arxiv.org/abs/2307.07312
repo_url: https://github.com/agnesion/zero-shot-nlg-from-kgs-data
paper_authors: Agnes Axelsson, Gabriel Skantze
for: 这个论文旨在探讨如何使用大量文本数据预训练模型来实现结构知识图（KG）数据下的文本生成task。
methods: 该论文使用了大型语言模型来进行零shot生成，基于模型对 triple 结构的理解。
results: 研究发现，使用 ChatGPT 模型可以在一些 WebNLG 2020 挑战中达到near state-of-the-art 性能水平，但在其他指标上表现较差。此外，研究还发现， factual、counter-factual 和 fictional 声明之间存在显著的关系，与模型对数据进行处理的知识有关。

Abstract
In any system that uses structured knowledge graph (KG) data as its underlying knowledge representation, KG-to-text generation is a useful tool for turning parts of the graph data into text that can be understood by humans. Recent work has shown that models that make use of pretraining on large amounts of text data can perform well on the KG-to-text task even with relatively small sets of training data on the specific graph-to-text task. In this paper, we build on this concept by using large language models to perform zero-shot generation based on nothing but the model's understanding of the triple structure from what it can read. We show that ChatGPT achieves near state-of-the-art performance on some measures of the WebNLG 2020 challenge, but falls behind on others. Additionally, we compare factual, counter-factual and fictional statements, and show that there is a significant connection between what the LLM already knows about the data it is parsing and the quality of the output text.

摘要
在使用结构化知识图（KG）数据作为基础知识表示时，KG-to-text生成是一种有用的工具，可以将知识图数据转换成人类可理解的文本。最近的研究表明，可以通过对大量文本数据进行预训练，使模型在特定的图文生成任务上表现出色，即使只有小量的训练数据。在这篇论文中，我们将这一概念进一步发展，使用大型语言模型来进行零基础生成，基于模型对 triple 结构的理解。我们发现，ChatGPT在一些 WebNLG 2020 挑战中的一些指标上达到了near state-of-the-art 水平，但在其他指标上落后。此外，我们比较了事实、Counter-factual 和虚构声明，发现模型对数据处理过程中已知知识和出力文本质量之间存在显著的关系。

Similarity-based Memory Enhanced Joint Entity and Relation Extraction

paper_url: http://arxiv.org/abs/2307.11762
repo_url: https://github.com/kosciukiewicz/similarity_based_memory_re
paper_authors: Witold Kosciukiewicz, Mateusz Wojcik, Tomasz Kajdanowicz, Adam Gonczarek
for: 这篇论文是关于文档级别的共同实体和关系抽取问题，需要一种统一的方法，这种方法需要一个单一的神经网络来完成四个子任务：提及检测、核心归一化、实体分类和关系提取。
methods: 该论文提出了一种多任务学习框架，其中任务之间具有协助性的记忆效应，以解决现有方法中的缺陷，并且可以更加准确地进行共同问题的解决。
results: 我们的实验表明，提出的方法可以比现有方法更高效地解决文档级别的共同实体和关系抽取问题，并在BioCreative V CDR词库中达到了国际先进水平。

Abstract
Document-level joint entity and relation extraction is a challenging information extraction problem that requires a unified approach where a single neural network performs four sub-tasks: mention detection, coreference resolution, entity classification, and relation extraction. Existing methods often utilize a sequential multi-task learning approach, in which the arbitral decomposition causes the current task to depend only on the previous one, missing the possible existence of the more complex relationships between them. In this paper, we present a multi-task learning framework with bidirectional memory-like dependency between tasks to address those drawbacks and perform the joint problem more accurately. Our empirical studies show that the proposed approach outperforms the existing methods and achieves state-of-the-art results on the BioCreative V CDR corpus.

摘要
文档级联合实体和关系抽取是一个复杂的信息抽取问题，需要一种统一的方法，其中单个神经网络执行四个子任务：提及检测、核心匹配解决、实体分类和关系抽取。现有方法frequently使用顺序多任务学习方法，这会使当前任务仅仅依赖于前一个任务，缺少可能存在的更复杂的关系。在这篇论文中，我们提出了一种多任务学习框架，其中任务之间具有双向卫星记忆的相互依赖关系，以解决这些缺陷并更准确地完成联合问题。我们的实验表明，我们的方法可以超过现有方法，并在生物创新V CDR数据集上达到状态的最佳Result。

Towards dialect-inclusive recognition in a low-resource language: are balanced corpora the answer?

paper_url: http://arxiv.org/abs/2307.07295
repo_url: None
paper_authors: Liam Lonergan, Mengjie Qian, Neasa Ní Chiaráin, Christer Gobl, Ailbhe Ní Chasaide
for: 这项研究是为了解决语音识别系统在不同 диалект中的表现不平等问题。
methods: 研究人员使用12个语音识别系统，首先使用基elinedialect-balanced训练集，然后使用基eline modify版本的训练集，其中dialect-specific材料被 subtracted或 added。
results: 结果显示，dialect-balanced训练集不会导致所有 диалект的表现相似。 Ul диалект一直表现出来underperform，而 Mu dialect获得了最低的wer。 Co 和 Mu диалект之间存在密切的关系，但这种关系不是对称的。这些结果将指导未来的 corps collection 和系统建设策略，以优化cross-dialect表现的公平性。

Abstract
ASR systems are generally built for the spoken 'standard', and their performance declines for non-standard dialects/varieties. This is a problem for a language like Irish, where there is no single spoken standard, but rather three major dialects: Ulster (Ul), Connacht (Co) and Munster (Mu). As a diagnostic to quantify the effect of the speaker's dialect on recognition performance, 12 ASR systems were trained, firstly using baseline dialect-balanced training corpora, and then using modified versions of the baseline corpora, where dialect-specific materials were either subtracted or added. Results indicate that dialect-balanced corpora do not yield a similar performance across the dialects: the Ul dialect consistently underperforms, whereas Mu yields lowest WERs. There is a close relationship between Co and Mu dialects, but one that is not symmetrical. These results will guide future corpus collection and system building strategies to optimise for cross-dialect performance equity.

摘要
听说系统通常是为标准口语建立的，其性能在非标准方言下降。这是一个问题，因为不同的语言如爱尔兰语没有单一的口语标准，而是有三大方言：乌尔斯特（Ul）、康нах特（Co）和门стер（Mu）。为了诊断语音识别器的语言方言影响的效果，12个听说系统被训练，首先使用基线 dialect-balanced 训练数据库，然后使用基线数据库的修改版本，其中方言特定的材料被 subtracted 或 added。结果表明，dialect-balanced 数据库不会在不同方言上得到相同的性能：Ul 方言一直表现不佳，而 Mu 方言具有最低 WERs。Co 和 Mu 方言之间存在close关系，但这不是对称的。这些结果将指导未来的 corpus 收集和系统建立策略，以便实现cross-dialect 性能平等。

Replay to Remember: Continual Layer-Specific Fine-tuning for German Speech Recognition

paper_url: http://arxiv.org/abs/2307.07280
repo_url: None
paper_authors: Theresa Pekarek Rosin, Stefan Wermter
for: 这 paper 的目的是探讨大规模自动语音识别（ASR）模型在小区域数据上的表现，以及如何通过 selective freezing 和经验回放来提高模型的稳定性和抗衰落性。
methods: 作者使用了大规模 multilingual ASR 模型，通过 transfer learning 将其适应到更小的 germany senior voice commands（SVC-de）数据集上。在训练过程中，作者选择性冻结部分模型参数，以保留大规模数据上的表现。此外，作者还应用了经验回放来进行 continual learning，以增强模型对 vocabulary 和 Speaker 的抗衰落性。
results: 作者通过实验发现，可以通过 selective freezing 和经验回放来将 ASR 模型的表现 approximated 到小区域数据上，并保持 general speech recognition 的表现在可接受的 Water Error Rate（WER）下。特别是，通过添加原始频率上的一部分数据，可以在新频率上达到 WER 下于 5%，并稳定 general speech recognition 的表现。

Abstract
While Automatic Speech Recognition (ASR) models have shown significant advances with the introduction of unsupervised or self-supervised training techniques, these improvements are still only limited to a subsection of languages and speakers. Transfer learning enables the adaptation of large-scale multilingual models to not only low-resource languages but also to more specific speaker groups. However, fine-tuning on data from new domains is usually accompanied by a decrease in performance on the original domain. Therefore, in our experiments, we examine how well the performance of large-scale ASR models can be approximated for smaller domains, with our own dataset of German Senior Voice Commands (SVC-de), and how much of the general speech recognition performance can be preserved by selectively freezing parts of the model during training. To further increase the robustness of the ASR model to vocabulary and speakers outside of the fine-tuned domain, we apply Experience Replay for continual learning. By adding only a fraction of data from the original domain, we are able to reach Word-Error-Rates (WERs) below 5\% on the new domain, while stabilizing performance for general speech recognition at acceptable WERs.

摘要
自动语音识别（ASR）模型在无监督或自监督训练技术的引入后显示了显著的进步，但这些改进仅限于一个语言和说话人的子集。通过传输学习，大规模多语言模型可以适应不仅低资源语言，还可以适应更特定的说话人组。然而，在新领域的数据进行精细调整通常会导致原领域的性能下降。因此，我们在实验中研究了如何在更小的领域中 aproximate大规模 ASR 模型的性能，以及如何在 selective freezing 部分模型 During training 中保持一定的通用语音识别性能。此外，我们还应用了经验回放以实现持续学习，通过添加原领域数据的一小部分，能够在新领域下达到 Word-Error-Rates（WER）下于5%，而同时稳定通用语音识别的性能。

Are words equally surprising in audio and audio-visual comprehension?

paper_url: http://arxiv.org/abs/2307.07277
repo_url: None
paper_authors: Pranava Madhyastha, Ye Zhang, Gabriella Vigliocco
for: investigate the effect of visual information on spoken language comprehension
methods: compare ERP signature (N400) in audio-only and audio-visual presentations, use different types of language models (n-gram and Transformer models) to predict N400 responses for each word
results: cognitive effort differs significantly between multimodal and unimodal settings, Transformer-based models provide a better fit in the audio-only setting, while 2-gram language models are more effective in the multimodal setting, highlighting the significant impact of local lexical context on cognitive processing in a multimodal environment.Here’s the full text in Simplified Chinese:
for: 这个研究是要调查视觉信息对口语理解的影响
methods: 通过比较audio只和audio-visual演示的同一个语言刺激的ERP签名（N400），使用不同类型的语言模型（n-gram和Transformer模型）来预测每个字的N400响应
results: 在多模态和单模态 Setting中，认知努力存在显著的差异，Transformer模型在audio只 Setting中提供了更好的预测，而2-gram语言模型在多模态 Setting中更有效。这些结果表明在多模态环境中，本地语言上下文对认知处理产生了深刻的影响。

Abstract
We report a controlled study investigating the effect of visual information (i.e., seeing the speaker) on spoken language comprehension. We compare the ERP signature (N400) associated with each word in audio-only and audio-visual presentations of the same verbal stimuli. We assess the extent to which surprisal measures (which quantify the predictability of words in their lexical context) are generated on the basis of different types of language models (specifically n-gram and Transformer models) that predict N400 responses for each word. Our results indicate that cognitive effort differs significantly between multimodal and unimodal settings. In addition, our findings suggest that while Transformer-based models, which have access to a larger lexical context, provide a better fit in the audio-only setting, 2-gram language models are more effective in the multimodal setting. This highlights the significant impact of local lexical context on cognitive processing in a multimodal environment.

摘要
我们报告了一项控制性研究，探讨了视觉信息（即说话人的视觉）对口语理解的影响。我们比较了听音只和听音视频两种 presentaation 中每个单词的 ERP 特征（N400）。我们评估了不同类型的语言模型（具体是 n-gram 和 Transformer 模型）对每个单词的抽象度的影响。我们发现，在多模态和单模态 Setting 中，聪明的努力有所不同。此外，我们发现，基于 Transformer 模型，它们有更大的语言上下文，在听音只 Setting 中提供了更好的适应。然而，基于 2-gram 语言模型，在多模态 Setting 中更有效。这显示了多模态环境中的本地语言上下文对认知处理的重要影响。

paper_url: http://arxiv.org/abs/2307.11775
repo_url: None
paper_authors: Miguel Palencia-Olivar
for: 这项研究旨在解决各种噪音和不同来源数据中完全无监督的话题抽取问题。
methods: 研究人员采用了基于Variational Autoencoder框架的三种方法：Embedded Dirichlet Process、Embedded Hierarchical Dirichlet Process以及时间相关的Dynamic Embedded Dirichlet Process。这些非 Parametric 方法可以自动决定话题和单词嵌入。
results: 研究人员在一个实际案例中使用了这些方法测试 benchmark 和汽车行业相关的数据集，并证明了这些方法可以与当前状态的方法相比或更好的表现。同时，研究人员也认为这些方法可以帮助话题模型领域增强评估指标。

Abstract
The age of social media has opened new opportunities for businesses. This flourishing wealth of information is outside traditional channels and frameworks of classical marketing research, including that of Marketing Mix Modeling (MMM). Textual data, in particular, poses many challenges that data analysis practitioners must tackle. Social media constitute massive, heterogeneous, and noisy document sources. Industrial data acquisition processes include some amount of ETL. However, the variability of noise in the data and the heterogeneity induced by different sources create the need for ad-hoc tools. Put otherwise, customer insight extraction in fully unsupervised, noisy contexts is an arduous task. This research addresses the challenge of fully unsupervised topic extraction in noisy, Big Data contexts. We present three approaches we built on the Variational Autoencoder framework: the Embedded Dirichlet Process, the Embedded Hierarchical Dirichlet Process, and the time-aware Dynamic Embedded Dirichlet Process. These nonparametric approaches concerning topics present the particularity of determining word embeddings and topic embeddings. These embeddings do not require transfer learning, but knowledge transfer remains possible. We test these approaches on benchmark and automotive industry-related datasets from a real-world use case. We show that our models achieve equal to better performance than state-of-the-art methods and that the field of topic modeling would benefit from improved evaluation metrics.

摘要
“社交媒体时代对企业带来了新的机会。这个润装丰富的信息是传统渠道和古典市场研究框架之外，包括市场混合模型（MMM）。文本数据特别是这些挑战，数据分析实践者需要面对。社交媒体是巨大、多样化和噪音的文档来源。工业资料取得过程中有一定的ETL。但是，资料噪音的多样性和不同来源导致需要特殊的工具。如果简称，在完全不监督的情况下提取客户见解是一个艰辛的任务。本研究面著挑战的是在噪音大数据中完全不监督的主题抽象。我们提出了基于自适应抽象器框架的三种方法：嵌入 Dirichlet 过程、嵌入层次 Dirichlet 过程和时间意识的动态嵌入 Dirichlet 过程。这些非parametric方法对主题表现出特殊之处，包括决定字幕和主题幕。这些幕并不需要传统学习，但知识传统仍然可能。我们在 benchmark 和汽车业相关的数据集上进行了实验，并证明了我们的模型在比较顶对照方法的情况下实现了等于或更好的性能。这显示了主题抽象领域对于改进评估指标的需求。”

MorphPiece : Moving away from Statistical Language Representation

paper_url: http://arxiv.org/abs/2307.07262
repo_url: None
paper_authors: Haris Jabbar
for: 这篇论文旨在提出一种基于语言学特征的字串识别方法，以提高现代自然语言处理（NLP）管道中的表达能力。
methods: 该方法基于语音学分析，并结合 morphological segmentation 技术来生成字串。
results: 与标准 BPE 字串识别器相比，该方法可以提高语言模型的融合性能，并在多种 NLP 任务中显示出superior表现。

Abstract
Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers for Large Language Models are based on statistical analysis of text corpora, without much consideration to the linguistic features. We propose a linguistically motivated tokenization scheme, MorphPiece, which is based partly on morphological segmentation of the underlying text. A GPT-style causal language model trained on this tokenizer (called MorphGPT) shows superior convergence compared to the same architecture trained on a standard BPE tokenizer. Specifically we get Language Modeling performance comparable to a 6 times larger model. Additionally, we evaluate MorphGPT on a variety of NLP tasks in supervised and unsupervised settings and find superior performance across the board, compared to GPT-2 model.

摘要
启用是现代自然语言处理（NLP）管道的关键部分。然而，当前的tokenizer для大语言模型基于文本 corpus 的统计分析，没有考虑语言特点。我们提出一种基于语言学特征的tokenization方案，MorphPiece，它基于文本的 morphological segmentation。一个基于MorphGPT的GPT-style causal语言模型（called MorphGPT）在这种tokenizer上训练显示了更好的吞吐量，比标准BPE tokenizer更高的性能。特别是，我们得到了与6倍大的模型性能相同的语言模型表现。此外，我们在多种NLP任务中评估了MorphGPT的表现，包括supervised和Unsupervised Setting，并发现其在所有任务上表现更好，比GPT-2模型更高。

Improving BERT with Hybrid Pooling Network and Drop Mask

paper_url: http://arxiv.org/abs/2307.07258
repo_url: None
paper_authors: Qian Chen, Wen Wang, Qinglin Zhang, Chong Deng, Ma Yukun, Siqi Zheng
For: This paper proposes a HybridBERT model that combines self-attention and pooling networks to improve the encoding of contextual features in each layer, and also introduces a simple DropMask method to address the mismatch between pre-training and fine-tuning.* Methods: The HybridBERT model uses a combination of self-attention and pooling networks to encode different contextual features in each layer, and the DropMask method is used to address the mismatch between pre-training and fine-tuning.* Results: The HybridBERT model outperforms the vanilla BERT in pre-training with lower loss, faster training speed, and lower memory cost, and also achieves 1.5% higher accuracies on downstream tasks. The DropMask method improves the accuracies of BERT on downstream tasks across various masking rates.

Abstract
Transformer-based pre-trained language models, such as BERT, achieve great success in various natural language understanding tasks. Prior research found that BERT captures a rich hierarchy of linguistic information at different layers. However, the vanilla BERT uses the same self-attention mechanism for each layer to model the different contextual features. In this paper, we propose a HybridBERT model which combines self-attention and pooling networks to encode different contextual features in each layer. Additionally, we propose a simple DropMask method to address the mismatch between pre-training and fine-tuning caused by excessive use of special mask tokens during Masked Language Modeling pre-training. Experiments show that HybridBERT outperforms BERT in pre-training with lower loss, faster training speed (8% relative), lower memory cost (13% relative), and also in transfer learning with 1.5% relative higher accuracies on downstream tasks. Additionally, DropMask improves accuracies of BERT on downstream tasks across various masking rates.

摘要
transformer-based 预训练语言模型，如BERT，在不同的自然语言理解任务中获得了很大的成功。先前的研究发现，BERT capture了不同层次的语言信息，但是普通的BERT使用同一种自我注意机制来模型不同的contextual feature。在这篇论文中，我们提出了HybridBERT模型，它将自我注意和卷积网络结合使用，以在每层中编码不同的contextual feature。此外，我们还提出了一种简单的DropMask方法，用于解决预训练和精度训练之间的差异，由于预训练中过度使用特殊的masktoken。实验显示，HybridBERT在预训练中得到较低的损失，更快的训练速度（8%相对），更低的内存成本（13%相对），以及在转移学习中的1.5%相对高于Accuracies。此外，DropMask可以提高BERT在下游任务上的准确率，不 matter what masking rate。

Certified Robustness for Large Language Models with Self-Denoising

paper_url: http://arxiv.org/abs/2307.07171
repo_url: None
paper_authors: Zhen Zhang, Guanhua Zhang, Bairu Hou, Wenqi Fan, Qing Li, Sijia Liu, Yang Zhang, Shiyu Chang
for: 提高大语言模型（LLM）在高风险环境中的可靠性，以确保 LLM 的预测具有稳定性，即 LLM 的预测是对输入的小变化具有一定的恒定性。
methods: 利用 LLM 的多任务性，通过自我推净法来除杂读，以提高 LLM 的证明性和预测稳定性。
results: 对比 existed 证明方法，本方法在证明稳定性和实际稳定性两个方面具有较高的表现，并且可以更好地满足高风险环境中 LLM 的应用需求。I hope that helps!

Abstract
Although large language models (LLMs) have achieved great success in vast real-world applications, their vulnerabilities towards noisy inputs have significantly limited their uses, especially in high-stake environments. In these contexts, it is crucial to ensure that every prediction made by large language models is stable, i.e., LLM predictions should be consistent given minor differences in the input. This largely falls into the study of certified robust LLMs, i.e., all predictions of LLM are certified to be correct in a local region around the input. Randomized smoothing has demonstrated great potential in certifying the robustness and prediction stability of LLMs. However, randomized smoothing requires adding noise to the input before model prediction, and its certification performance depends largely on the model's performance on corrupted data. As a result, its direct application to LLMs remains challenging and often results in a small certification radius. To address this issue, we take advantage of the multitasking nature of LLMs and propose to denoise the corrupted inputs with LLMs in a self-denoising manner. Different from previous works like denoised smoothing, which requires training a separate model to robustify LLM, our method enjoys far better efficiency and flexibility. Our experiment results show that our method outperforms the existing certification methods under both certified robustness and empirical robustness. The codes are available at https://github.com/UCSB-NLP-Chang/SelfDenoise.

摘要
大型语言模型（LLM）在各种实际应用中取得了很大的成功，但它们对噪输入的敏感性限制了它们在高资产环境中的应用，特别是在高资产环境中，每个LLM预测都需要是稳定的，即LLM预测应该在输入的小变化下保持一致。这主要归结于证明LLM的稳定性和预测稳定性，即所有LLM预测在输入的当地区域内都是正确的。随机缓冲已经证明了它在证明LLM的稳定性和预测稳定性方面具有极大的潜力。然而，随机缓冲需要在模型预测前添加噪音到输入中，并且其证明性виси于模型在受损数据上的性能。因此，直接应用随机缓冲到LLM中 remainschallenging，并常导致小证明半径。为解决这一问题，我们利用了LLM的多任务性，并提议使用LLM自身来去噪 Input。与之前的denoising smoothing不同，我们的方法不需要单独培训一个模型来Robustify LLM，而且具有较好的效率和灵活性。我们的实验结果表明，我们的方法在证明稳定性和实际稳定性两个方面都高于现有的证明方法。代码可以在https://github.com/UCSB-NLP-Chang/SelfDenoise中找到。

Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

paper_url: http://arxiv.org/abs/2307.07166
repo_url: None
paper_authors: Ryosuke Korekata, Motonari Kambara, Yu Yoshida, Shintaro Ishikawa, Yosuke Kawasaki, Masaki Takahashi, Komei Sugiura
for: 这个论文描述了一种家用服务机器（DSR），它可以根据自然语言指令fetch everyday objects并将其运输到指定的目的地点。
methods: 我们提出了Switching Head-Tail Funnel UNITER方法，它使用单个模型来预测目标对象和目的地点。
results: 我们的方法在一个新建的数据集上进行验证，并与基线方法进行比较。实验结果显示，我们的方法在语言理解精度方面表现出色，而且在物理实验中，DSR成功完成了指定的物品抓取和放置动作，成功率高于90%。

Abstract
This paper describes a domestic service robot (DSR) that fetches everyday objects and carries them to specified destinations according to free-form natural language instructions. Given an instruction such as "Move the bottle on the left side of the plate to the empty chair," the DSR is expected to identify the bottle and the chair from multiple candidates in the environment and carry the target object to the destination. Most of the existing multimodal language understanding methods are impractical in terms of computational complexity because they require inferences for all combinations of target object candidates and destination candidates. We propose Switching Head-Tail Funnel UNITER, which solves the task by predicting the target object and the destination individually using a single model. Our method is validated on a newly-built dataset consisting of object manipulation instructions and semi photo-realistic images captured in a standard Embodied AI simulator. The results show that our method outperforms the baseline method in terms of language comprehension accuracy. Furthermore, we conduct physical experiments in which a DSR delivers standardized everyday objects in a standardized domestic environment as requested by instructions with referring expressions. The experimental results show that the object grasping and placing actions are achieved with success rates of more than 90%.

摘要
translate to Simplified Chinese:这篇论文描述了一种家庭服务机器人（DSR），该机器人可以根据自然语言指令fetch everyday objects并将其运送到指定的目的地点。例如，一个指令可以是“将左边的杯子移动到空椅上”，在这个指令中，DSR需要从多个环境中 identificates the bottle and the chair，并将目标对象运送到目的地点。现有的多modal语言理解方法都是在计算复杂性方面不实用，因为它们需要对所有的目标对象和目的地点进行推理。我们提出了 Switching Head-Tail Funnel UNITER，这种方法可以通过单个模型来预测目标对象和目的地点。我们的方法在一个新建的数据集上进行验证，结果表明我们的方法在语言理解精度方面高于基eline方法。此外，我们还进行了实际实验，在标准的家庭环境中，DSR通过指令With referring expressions来实现物品抓取和放置动作，实际结果表明动作成功率高于90%。

Learning to Retrieve In-Context Examples for Large Language Models

paper_url: http://arxiv.org/abs/2307.07164
repo_url: https://github.com/microsoft/lmops
paper_authors: Liang Wang, Nan Yang, Furu Wei
for: 提高大型自然语言模型（LLM）在上下文中学习的效iveness
methods: 使用奖励模型和知识填充 retrainer 训练稠密检索器
results: 在30个任务上显著提高上下文学习性能，并在训练中对未看过任务的探索性能具有普遍性In simpler Chinese text:
for: 提高LLM在上下文中学习的效iveness
methods: 使用奖励模型和知识填充 retrainer 训练稠密检索器
results: 在30个任务上显著提高上下文学习性能，并在训练中对未看过任务的探索性能具有普遍性

Abstract
Large language models (LLMs) have demonstrated their ability to learn in-context, allowing them to perform various tasks based on a few input-output examples. However, the effectiveness of in-context learning is heavily reliant on the quality of the selected examples. In this paper, we propose a novel framework to iteratively train dense retrievers that can identify high-quality in-context examples for LLMs. Our framework initially trains a reward model based on LLM feedback to evaluate the quality of candidate examples, followed by knowledge distillation to train a bi-encoder based dense retriever. Our experiments on a suite of 30 tasks demonstrate that our framework significantly enhances in-context learning performance. Furthermore, we show the generalization ability of our framework to unseen tasks during training. An in-depth analysis reveals that our model improves performance by retrieving examples with similar patterns, and the gains are consistent across LLMs of varying sizes.

摘要
大型语言模型（LLM）已经表现出可以在上下文中学习，以完成基于输入输出示例的多种任务。然而，Context learning的效果受输入示例质量的影响很大。在这篇论文中，我们提出了一种新的框架，可以逐步训练稠密检索器，以便在LLM上进行高质量的上下文学习示例选择。我们的框架首先在LLM反馈下将奖励模型训练，然后通过知识储存来训练二元编码器基于稠密检索器。我们在30个任务的灵活suite上进行了实验，结果表明，我们的框架可以明显提高上下文学习性能。此外，我们还证明了我们的框架在训练过程中对未看过任务的泛化能力具有很强的能力。深入分析表明，我们的模型可以更好地选择与示例具有相似模式的示例，并且这些提升都是LLM的大小无关的。

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

paper_url: http://arxiv.org/abs/2307.07162
repo_url: https://github.com/PJLab-ADG/driveLikeAHuman
paper_authors: Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, Yu Qiao
for: 这 paper 旨在探讨使用大型自然语言模型（LLM）来理解人类驾驶方式，并分析其在复杂情况下的解释、理解和记忆能力。
methods: 作者提出了一种基于 LLM 的人类化自动驾驶系统（ADG），通过不断驾驶和使用常识来解决问题。
results: 实验表明，LLM 在复杂情况下能够具有卓越的解释和环境互动能力，为人类化自动驾驶的开发提供了有价值的意见。

Abstract
In this paper, we explore the potential of using a large language model (LLM) to understand the driving environment in a human-like manner and analyze its ability to reason, interpret, and memorize when facing complex scenarios. We argue that traditional optimization-based and modular autonomous driving (AD) systems face inherent performance limitations when dealing with long-tail corner cases. To address this problem, we propose that an ideal AD system should drive like a human, accumulating experience through continuous driving and using common sense to solve problems. To achieve this goal, we identify three key abilities necessary for an AD system: reasoning, interpretation, and memorization. We demonstrate the feasibility of employing an LLM in driving scenarios by building a closed-loop system to showcase its comprehension and environment-interaction abilities. Our extensive experiments show that the LLM exhibits the impressive ability to reason and solve long-tailed cases, providing valuable insights for the development of human-like autonomous driving. The related code are available at https://github.com/PJLab-ADG/DriveLikeAHuman .

摘要
在这篇论文中，我们探讨了使用大型自然语言模型（LLM）来理解人类驾驶环境，并分析其能否如人类般进行理性、解释和记忆。我们认为传统的优化型和模块化自动驾驶（AD）系统在面对复杂情况时存在内在的性能限制。为解决这个问题，我们提议一个理想的 AD 系统应该驾驶如人类一样，通过不断驾驶积累经验，使用通用意识解决问题。为达到这个目标，我们确定了三种关键的能力是必要的：理性、解释和记忆。我们通过建立一个封闭系统来示cases，证明LLM在驾驶场景中的理解和环境互动能力。我们的广泛实验表明，LLM能够具有卓越的理性和解决长尾情况的能力，为人类化自动驾驶的开发提供了有价值的思路。相关代码可以在https://github.com/PJLab-ADG/DriveLikeAHuman 中找到。

Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords

paper_url: http://arxiv.org/abs/2307.07160
repo_url: None
paper_authors: Shahriar Golchin, Mihai Surdeanu, Nazgol Tavabi, Ata Kiapour
for: 这个论文目的是提出一种新的任务无关的预训练方法，位于泛化预训练和精度调整之间。
methods: 这种方法使用KeyBERT（Grootendorst，2020）来标识目标领域中的短语，并对这些短语进行随机掩码。
results: 根据六个不同的设置，包括三个数据集和两种不同的预训练语言模型（PLMs），我们的结果表明，使用我们的预训练策略来改进PLMs，than使用随机掩码预训练和常见的预训练然后调整的方法。此外，标识目标领域中的关键词的负担不高，例如，对BERT大（Devlin et al., 2019）的预训练需要7-15%的时间（对两个epoch）。

Abstract
We propose a novel task-agnostic in-domain pre-training method that sits between generic pre-training and fine-tuning. Our approach selectively masks in-domain keywords, i.e., words that provide a compact representation of the target domain. We identify such keywords using KeyBERT (Grootendorst, 2020). We evaluate our approach using six different settings: three datasets combined with two distinct pre-trained language models (PLMs). Our results reveal that the fine-tuned PLMs adapted using our in-domain pre-training strategy outperform PLMs that used in-domain pre-training with random masking as well as those that followed the common pre-train-then-fine-tune paradigm. Further, the overhead of identifying in-domain keywords is reasonable, e.g., 7-15% of the pre-training time (for two epochs) for BERT Large (Devlin et al., 2019).

摘要
我们提出了一种新的任务无关预训练方法，它位于普通预训练和精度调整之间。我们的方法选择性地遮盖域内关键词（Grootendorst, 2020），即目标领域中提供紧凑的表示的词语。我们使用 KeyBERT 进行识别。我们使用六个不同的设置进行评估：三个数据集和两种不同的预训练语言模型（PLM）。我们的结果表明，使用我们的域内预训练策略进行精度调整的 PLM 表现比使用随机遮盖和常见预训练然后调整的 PLM 更好。此外，寻找域内关键词的开销是合理的，例如，BERT 大型（Devlin et al., 2019）在两个epoch的预训练时间中，寻找域内关键词需要7-15%的时间开销。

paper_url: http://arxiv.org/abs/2307.07135
repo_url: https://github.com/joeying1019/mmsd2.0
paper_authors: Libo Qin, Shijue Huang, Qiguang Chen, Chenran Cai, Yudi Zhang, Bin Liang, Wanxiang Che, Ruifeng Xu
for: 提高多模态嘲笑检测系统的可靠性
methods: 利用多个视角（文本、图像和文本-图像互动视角）的多重准确信息
results: 在广泛的实验中，MMSD2.0是一个有价值的 Referenceless Benchmark，而多视图CLIP可以明显超越前一个基线值

Abstract
Multi-modal sarcasm detection has attracted much recent attention. Nevertheless, the existing benchmark (MMSD) has some shortcomings that hinder the development of reliable multi-modal sarcasm detection system: (1) There are some spurious cues in MMSD, leading to the model bias learning; (2) The negative samples in MMSD are not always reasonable. To solve the aforementioned issues, we introduce MMSD2.0, a correction dataset that fixes the shortcomings of MMSD, by removing the spurious cues and re-annotating the unreasonable samples. Meanwhile, we present a novel framework called multi-view CLIP that is capable of leveraging multi-grained cues from multiple perspectives (i.e., text, image, and text-image interaction view) for multi-modal sarcasm detection. Extensive experiments show that MMSD2.0 is a valuable benchmark for building reliable multi-modal sarcasm detection systems and multi-view CLIP can significantly outperform the previous best baselines.

摘要
多Modal sarcastic detection recently attracted much attention. However, the existing benchmark (MMSD) has some shortcomings that hinder the development of reliable multi-modal sarcasm detection systems:1. MMSD contains some spurious cues, leading to model bias learning;2. The negative samples in MMSD are not always reasonable.To solve these issues, we introduce MMSD2.0, a corrected dataset that removes spurious cues and re-annotates unreasonable samples. Additionally, we present a novel framework called multi-view CLIP that can leverage multi-grained cues from multiple perspectives (i.e., text, image, and text-image interaction view) for multi-modal sarcasm detection.Extensive experiments show that MMSD2.0 is a valuable benchmark for building reliable multi-modal sarcasm detection systems, and multi-view CLIP can significantly outperform previous best baselines.

Generating Efficient Training Data via LLM-based Attribute Manipulation

paper_url: http://arxiv.org/abs/2307.07099
repo_url: https://github.com/komeijiforce/cotam
paper_authors: Letian Peng, Yuwei Zhang, Jingbo Shang
for: 这个论文是为了提出一种新的方法，即链条思维属性操作（CoTAM），用于指导几个shot学习。
methods: 该方法利用大语言模型（LLM）生成特定任务的属性变化数据，通过启发自Face attribute manipulation的思路，生成标签交换数据，并通过链条思维的分解和重建来适应LLM。
results: 对文本分类和其他任务进行了广泛的试验，证明CoTAM比其他基于LLM的文本生成方法具有更大的优势，并且可以在几个shot学习中实现更好的性能。

Abstract
In this paper, we propose a novel method, Chain-of-Thoughts Attribute Manipulation (CoTAM), to guide few-shot learning by carefully crafted data from Large Language Models (LLMs). The main idea is to create data with changes only in the attribute targeted by the task. Inspired by facial attribute manipulation, our approach generates label-switched data by leveraging LLMs to manipulate task-specific attributes and reconstruct new sentences in a controlled manner. Instead of conventional latent representation controlling, we implement chain-of-thoughts decomposition and reconstruction to adapt the procedure to LLMs. Extensive results on text classification and other tasks verify the advantage of CoTAM over other LLM-based text generation methods with the same number of training examples. Analysis visualizes the attribute manipulation effectiveness of CoTAM and presents the potential of LLM-guided learning with even less supervision.

摘要
在这篇论文中，我们提出了一种新的方法，链条思维属性 manipulate（CoTAM），用于导引几张学习。我们的主要想法是创建具有目标特征的数据，并通过大自然语言模型（LLM）来修改这些特征。启发自人脸特征 manipulate，我们的方法生成了标签交换数据，并使用 LLM 来控制任务特定的特征并重建新的句子。相比于传统的隐藏表示控制，我们实施了链条思维分解和重建来适应 LLM。我们的实验结果表明，CoTAM 在文本分类和其他任务上比其他基于 LLM 的文本生成方法具有更多的优势，并且对于几张学习，CoTAM 可以减少数据量。分析还可以视觉化 attribute 修改效果，并探讨 LLM 带领学习的可能性。

An Analysis of Dialogue Repair in Virtual Voice Assistants

paper_url: http://arxiv.org/abs/2307.07076
repo_url: None
paper_authors: Matthew Carson Galbraith, Mireia Gómez i Martínez
for: 研究人员通过对对话修复结构的分析，了解虚拟助手和人类对话的差异。
methods: 研究使用了英语和西班牙语的两个流行虚拟助手（Google助手和Apple Siri），对对话修复策略进行了分析和比较。
results: 研究发现，虚拟助手和人类对话修复策略存在差异，而且虚拟助手和语言 studied也存在差异。

Abstract
Language speakers often use what are known as repair initiators to mend fundamental disconnects that occur between them during verbal communication. Previous research in this field has mainly focused on the human-to-human use of repair initiator. We proposed an examination of dialogue repair structure wherein the dialogue initiator is human and the party that initiates or responds to the repair is a virtual assistant. This study examined the use of repair initiators in both English and Spanish with two popular assistants, Google Assistant and Apple's Siri. Our aim was to codify the differences, if any, in responses by voice assistants to dialogues in need of repair as compared to human-human dialogues also in need of repair. Ultimately the data demonstrated that not only were there differences between human-assistant and human-human dialogue repair strategies, but that there were likewise differences among the assistants and the languages studied.

摘要
语言使用者常使用优化开始符来修复在对话中出现的基本分支。先前的研究主要集中在人类到人类对话中的修复开始符。我们提议了对对话结构的对话修复，其中对话开始人是人类，并且有虚拟助手发起或回应修复。我们对英语和西班牙语使用Google助手和苹果的SIRI两种流行助手进行研究，目的是将对话需要修复的响应与人类对话需要修复的响应进行比较。最终数据显示，不仅有人类与助手对话修复策略的差异，还有助手之间和语言研究的差异。

Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling

paper_url: http://arxiv.org/abs/2307.07057
repo_url: None
paper_authors: He Huang, Jagadeesh Balam, Boris Ginsburg
for: 这个论文主要研究了speech意图分类和插槽填充（SICSF），提出使用已经预训练的语音识别（ASR）编码器初始化一个端到端（E2E）Conformer-Transformer模型，实现了SLURP数据集新的状态态-of-the-art结果，具有90.14%的意图精度和82.27%的SLURP-F1。
methods: 该论文使用ASR预训练编码器初始化E2E Conformer-Transformer模型，并对比了使用自我支持学习（SSL）预训练的编码器。结果显示，ASR预训练比SSL更有效果于SICSF。此外，作者还研究了参数效率，通过冻结编码器并添加Adapter模块，发现只有ASR预训练编码器可以保持参数效率，而SSL编码器需要全部训练才能达到相同的结果。
results: 该论文通过对端到端模型和混合模型（ASR+NLU）进行比较，显示E2E模型比混合模型更高效，除非提供oracle ASR模型。此外，作者还提供了一个完整的实现，包括代码、检查点和配置。

Abstract
We study speech intent classification and slot filling (SICSF) by proposing to use an encoder pretrained on speech recognition (ASR) to initialize an end-to-end (E2E) Conformer-Transformer model, which achieves the new state-of-the-art results on the SLURP dataset, with 90.14% intent accuracy and 82.27% SLURP-F1. We compare our model with encoders pretrained on self-supervised learning (SSL), and show that ASR pretraining is much more effective than SSL for SICSF. To explore parameter efficiency, we freeze the encoder and add Adapter modules, and show that parameter efficiency is only achievable with an ASR-pretrained encoder, while the SSL encoder needs full finetuning to achieve comparable results. In addition, we provide an in-depth comparison on end-to-end models versus cascading models (ASR+NLU), and show that E2E models are better than cascaded models unless an oracle ASR model is provided. Last but not least, our model is the first E2E model that achieves the same performance as cascading models with oracle ASR. Code, checkpoints and configs are available.

摘要
我们研究了Speech Intent Classification和slot filling（SICSF），我们提议使用已经预训练的SpeechRecognition（ASR）encoder来初始化一个End-to-end（E2E）Conformer-Transformer模型，达到了SLURP数据集上新的状态码� instantiateResults，即90.14%的意图精度和82.27%的SLURP-F1。我们与self-supervised learning（SSL）预训练器进行比较，并显示ASR预训练是SLSF的多少更有效。为了探索参数效率，我们冻结encoder并添加Adapter模块，并显示仅ASR预训练的encoder可以实现参数效率，而SSL预训练的encoder需要全部训练来 достичь相似的结果。此外，我们还进行了End-to-end模型vs搭配模型（ASR+NLU）的深入比较，并显示E2E模型比搭配模型更好，除非提供了oracle ASR模型。最后，我们的模型是第一个实现了与搭配模型相同的性能的E2E模型。我们提供了代码、Checkpoints和配置。

Making the Most Out of the Limited Context Length: Predictive Power Varies with Clinical Note Type and Note Section

paper_url: http://arxiv.org/abs/2307.07051
repo_url: None
paper_authors: Hongyi Zheng, Yixin Zhu, Lavender Yao Jiang, Kyunghyun Cho, Eric Karl Oermann
for: 这个论文旨在提高医疗语言处理领域中使用临床笔记自由文本的能效性。
methods: 该论文提出了一种 Frameworks，用于分析临床笔记中高预测力的部分。该框架基于各种不同类型的临床笔记，并通过对各个部分的预测力进行分析，选择最有价值的输入部分。
results: 该论文使用MIMIC-III数据集进行实验，发现：1）临床笔记和释放笔记的预测力分布不同，2）将不同类型的临床笔记合并使用可以在大context length下提高性能。这些发现 suggessts that a carefully selected sampling function could enable more efficient information extraction from clinical notes.

Abstract
Recent advances in large language models have led to renewed interest in natural language processing in healthcare using the free text of clinical notes. One distinguishing characteristic of clinical notes is their long time span over multiple long documents. The unique structure of clinical notes creates a new design choice: when the context length for a language model predictor is limited, which part of clinical notes should we choose as the input? Existing studies either choose the inputs with domain knowledge or simply truncate them. We propose a framework to analyze the sections with high predictive power. Using MIMIC-III, we show that: 1) predictive power distribution is different between nursing notes and discharge notes and 2) combining different types of notes could improve performance when the context length is large. Our findings suggest that a carefully selected sampling function could enable more efficient information extraction from clinical notes.

摘要
近期大语言模型的进步导致了医疗健康领域自然语言处理（NLP）中文仪表的再次吸引。临床笔记的一个特点是其长时间覆盖多个长文档，这创造了一个新的设计选择：当语言模型预测器的上下文长度有限制时，我们应该选择哪部分作为输入？现有研究 either 选择输入的域知识或 simply truncate them。我们提出了一个框架来分析sections with high predictive power。使用MIMIC-III，我们发现：1）预测力分布不同于护理笔记和出院笔记，2）将不同类型的笔记结合起来可以在大上下文长度下提高性能。我们的发现表示，一个仔细选择的采样函数可以更有效地提取信息从临床笔记中。

MegaWika: Millions of reports and their sources across 50 diverse languages

paper_url: http://arxiv.org/abs/2307.07049
repo_url: None
paper_authors: Samuel Barham, Orion Weller, Michelle Yuan, Kenton Murray, Mahsa Yarmohammadi, Zhengping Jiang, Siddharth Vashishtha, Alexander Martin, Anqi Liu, Aaron Steven White, Jordan Boyd-Graber, Benjamin Van Durme
for: 本研究旨在推动新的合作AI助成报告生成模型的发展，Introducing MegaWika，包含50种语言的1300万篇Wikipedia文章和其7100万个引用文献。
methods: 本研究使用了多种应用程序，包括将非英语文章翻译为跨语言应用程序，并提供FrameNet分析以自动进行语义分析。
results: 本研究提供了许多基线结果和训练模型，包括跨语言问答和引用检索。MegaWika是 sentence-level 报告生成的最大资源，同时也是唯一的多语言报告生成资源。

Abstract
To foster the development of new models for collaborative AI-assisted report generation, we introduce MegaWika, consisting of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials. We process this dataset for a myriad of applications, going beyond the initial Wikipedia citation extraction and web scraping of content, including translating non-English articles for cross-lingual applications and providing FrameNet parses for automated semantic analysis. MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual. We manually analyze the quality of this resource through a semantically stratified sample. Finally, we provide baseline results and trained models for crucial steps in automated report generation: cross-lingual question answering and citation retrieval.

摘要
为推动新的协作AI助成报告生成模型的发展，我们介绍了MegaWika，包括50种语言的1300万篇Wikipedia文章以及其7100万个参考文献。我们对这个数据集进行了多种应用程序，超出了初始的Wikipedia引用EXTRACTION和网络抓取内容的限制，包括将非英语文章翻译为跨语言应用程序和提供FrameNet分析的自动化semantic analysis。MegaWika是最大的句子级报告生成资源，也是唯一的多语言报告生成资源。我们手动分析了这个资源的质量，通过semanticstratified sample进行了manual analysis。最后，我们提供了基线结果和已经训练的模型，用于自动化报告生成的关键步骤：跨语言问答和引用检索。

DIALGEN: Collaborative Human-LM Generated Dialogues for Improved Understanding of Human-Human Conversations

paper_url: http://arxiv.org/abs/2307.07047
repo_url: None
paper_authors: Bo-Ru Lu, Nikita Haduong, Chia-Hsuan Lee, Zeqiu Wu, Hao Cheng, Paul Koester, Jean Utke, Tao Yu, Noah A. Smith, Mari Ostendorf
for: 本研究旨在提高人机对话自动理解的能力，尤其是在实际数据中保护private信息的情况下。
methods: 本研究提出了DIALGEN模型，它是一种人工智能驱动的对话生成框架，使用ChatGPT语言模型来生成流畅的对话文本，并通过人类反馈来修正不一致或重定向对话的流程。
results: 在结构化摘要化Agent-客户信息收集电话对话中，DIALGEN数据实现了显著提高模型性能的效果。

Abstract
Applications that could benefit from automatic understanding of human-human conversations often come with challenges associated with private information in real-world data such as call center or clinical conversations. Working with protected data also increases costs of annotation, which limits technology development. To address these challenges, we propose DIALGEN, a human-in-the-loop semi-automated dialogue generation framework. DIALGEN uses a language model (ChatGPT) that can follow schema and style specifications to produce fluent conversational text, generating a complex conversation through iteratively generating subdialogues and using human feedback to correct inconsistencies or redirect the flow. In experiments on structured summarization of agent-client information gathering calls, framed as dialogue state tracking, we show that DIALGEN data enables significant improvement in model performance.

摘要
应用程序，具有自动理解人类对话的潜在优势，通常面临实际数据中private信息的挑战，如客户服务或医疗对话。与保护数据一起工作也增加了注释成本，限制技术发展。为解决这些挑战，我们提议DIALGEN，一种人工干预半自动对话生成框架。DIALGEN使用一种语言模型（ChatGPT），可以按照schema和样式规范生成流畅对话文本，通过逐步生成子对话和人类反馈来修正不一致或重定向流程。在对客户服务信息收集电话对话的结构化摘要实验中，我们示出DIALGEN数据可以带来显著提高模型性能。

Data Augmentation for Machine Translation via Dependency Subtree Swapping

paper_url: http://arxiv.org/abs/2307.07025
repo_url: https://github.com/attilanagy234/syntax-augmentation-nmt
paper_authors: Attila Nagy, Dorina Petra Lakatos, Botond Barta, Patrick Nanys, Judit Ács
for: 提高机器翻译模型的性能
methods: 基于依赖树的子树交换数据增强
results: 在4种语言对irection中，与基eline模型比较，有consistent的BLEU分数提高

Abstract
We present a generic framework for data augmentation via dependency subtree swapping that is applicable to machine translation. We extract corresponding subtrees from the dependency parse trees of the source and target sentences and swap these across bisentences to create augmented samples. We perform thorough filtering based on graphbased similarities of the dependency trees and additional heuristics to ensure that extracted subtrees correspond to the same meaning. We conduct resource-constrained experiments on 4 language pairs in both directions using the IWSLT text translation datasets and the Hunglish2 corpus. The results demonstrate consistent improvements in BLEU score over our baseline models in 3 out of 4 language pairs. Our code is available on GitHub.

摘要
我们提出了一个通用的数据扩充框架，可以应用于机器翻译。我们从源和目标句子的依赖树中提取相应的子树，并将这些子树在句子之间交换以创建扩充样本。我们根据依赖树的图基 similarity 和其他规则进行了严格的筛选，以确保提取的子树表示同一个意义。我们在4种语言对的两个方向使用 IWSLT 文本翻译数据集和 Hunglish2 corpus 进行了资源受限的实验。结果表明在3种语言对中，我们的基eline模型在 BLEU 分数上具有了一致的提高。我们的代码可以在 GitHub 上找到。

Electoral Agitation Data Set: The Use Case of the Polish Election

paper_url: http://arxiv.org/abs/2307.07007
repo_url: https://github.com/mateuszbaransanok/e-agitation
paper_authors: Mateusz Baran, Mateusz Wójcik, Piotr Kolebski, Michał Bernaczyk, Krzysztof Rajda, Łukasz Augustyniak, Tomasz Kajdanowicz
for: This paper aims to address the problem of detecting electoral agitation in social media, specifically in the Polish language.
methods: The authors use a combination of human annotation and machine learning to create a data set of labeled tweets for training a Polish language model to detect electoral agitation.
results: The authors achieve a 0.66 inter-annotator agreement and a 68% F1 score for the fine-tuned language model on the newly created data set. They also present a number of potential use cases for such data sets and models, and analyze the Polish 2020 Presidential Election on Twitter.Here’s the simplified Chinese text:
for: 这篇论文目的是解决社交媒体上的选举宣传（electoral agitation）检测问题，特别是在选举竞选时期。
methods: 作者使用人工标注和机器学习技术创建了一个Polish语言模型的训练数据集，用于检测选举宣传。
results: 作者达到了0.66间标注者一致度（Cohen的卡巴 score）和68% F1分数，表明训练后的Polish语言模型在新创建的数据集上具有良好的检测能力。

Abstract
The popularity of social media makes politicians use it for political advertisement. Therefore, social media is full of electoral agitation (electioneering), especially during the election campaigns. The election administration cannot track the spread and quantity of messages that count as agitation under the election code. It addresses a crucial problem, while also uncovering a niche that has not been effectively targeted so far. Hence, we present the first publicly open data set for detecting electoral agitation in the Polish language. It contains 6,112 human-annotated tweets tagged with four legally conditioned categories. We achieved a 0.66 inter-annotator agreement (Cohen's kappa score). An additional annotator resolved the mismatches between the first two improving the consistency and complexity of the annotation process. The newly created data set was used to fine-tune a Polish Language Model called HerBERT (achieving a 68% F1 score). We also present a number of potential use cases for such data sets and models, enriching the paper with an analysis of the Polish 2020 Presidential Election on Twitter.

摘要
社交媒体的流行使得政治人物利用它进行政治广告。因此，社交媒体上有很多选举宣传（选举宣传），特别是在选举运动期间。选举管理部门无法跟踪播散和量度选举宣传下的信息，这解决了一个重要的问题，同时还揭示了一个未曾有效地宣传的 nich（niche）。因此，我们提供了第一个公开的波兰语选举宣传数据集。该数据集包含6,112个人标注的推特，分为四个法律条件的类别。我们达到了0.66的间对标注者一致度（Cohen的卡方分数）。其中一个额外的标注者解决了第一两个标注的差异，从而提高了标注过程的一致性和复杂性。新创建的数据集被用来练化一个波兰语模型called HerBERT（实现了68%的F1分数）。我们还提供了一些可能的数据集和模型的应用场景，并对波兰2020年总统选举的推特进行分析。

mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

paper_url: http://arxiv.org/abs/2307.06930
repo_url: https://github.com/gregor-ge/mblip
paper_authors: Gregor Geigle, Abhay Jain, Radu Timofte, Goran Glavaš
for: 这个论文的目的是提出一种 computationally efficient 的多语言视力语言模型（mBLIP），它可以使用现有的高质量英语图像数据和多语言语言模型（LLMs）来进行训练，从而避免预训练大型视力语言模型从头开始。
methods: 这个论文使用了一种名为“投影”的方法，将一个已经训练过英语图像编码器的 LLM 转换为多语言 LLM，从而使得模型可以处理多种语言。这个方法使用了机器翻译的技术，将高质量英语数据翻译成了95种语言。
results: 在 IGLUE benchmark 上，mBLIP 的表现与当前最佳模型匹配，而且在 XM3600 的图像描述任务中，mBLIP （零shot）even 超过了 PaLI-X（一个55B参数的模型）。相比之下，从头开始训练大型多语言视力语言模型需要训练几个数量级更多的参数和更多的数据。

Abstract
Modular vision-language models (Vision-LLMs) align pretrained image encoders with (pretrained) large language models (LLMs), representing a computationally much more efficient alternative to end-to-end training of large vision-language models from scratch, which is prohibitively expensive for most. Vision-LLMs instead post-hoc condition LLMs to `understand' the output of an image encoder. With the abundance of readily available high-quality English image-text data as well as monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. In this work, we present mBLIP, the first multilingual Vision-LLM, which we obtain in a computationally efficient manner -- on consumer hardware using only a few million training examples -- by leveraging a pretrained multilingual LLM. To this end, we \textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM -- for this, we leverage multilingual data from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark, mBLIP yields results competitive with state-of-the-art models. Moreover, in image captioning on XM3600, mBLIP (zero-shot) even outperforms PaLI-X (a model with 55B parameters). Compared to these very large multilingual vision-language models trained from scratch, we obtain mBLIP by training orders of magnitude fewer parameters on magnitudes less data. We release our model and code at \url{https://github.com/gregor-ge/mBLIP}.

摘要
模块化视力语言模型（Vision-LLMs）将预训练的图像编码器与（预训练）大型语言模型（LLMs）进行对接，代表了计算效率更高的代替方案，而不是从头开始训练大型视力语言模型，这是不可持续的。Vision-LLMs 将 LLMs 后置 condition 以便“理解”图像编码器的输出。由于有充足的高质量英文图文数据以及英文 LLMS，研究的焦点是英文只 Vision-LLMs。多语言视力语言模型仍然主要通过昂贵的端到端训练获得， resulting in 相对较小的模型，通过有限的多语言图像数据和文本Only multilingual corpora 进行训练。在这种工作中，我们发布了 mBLIP，首个多语言视力语言模型，通过在消费级硬件上使用只有几百万个训练例子进行计算效率的方式获得。为此，我们利用了预训练的多语言 LLMS。为此，我们重新对预训练的图像编码器进行了对接，使其与新的多语言 LLMS进行对接。我们利用了多语言视和语言任务中的数据，通过机器翻译高质量的英文数据到95种语言。在 IGLUE benchmark 上，mBLIP 的结果与状态 Ellume 模型竞争。此外，在 XM3600 上的图像描述任务中，mBLIP（零容量） Even outperform PaLI-X（一个550亿参数的模型）。相比于这些很大的多语言视力语言模型，我们通过训练数量在数量级下的参数和数据进行训练获得 mBLIP。我们将我们的模型和代码发布在 GitHub 上，请参考 \url{https://github.com/gregor-ge/mBLIP}。

Towards Populating Generalizable Engineering Design Knowledge

paper_url: http://arxiv.org/abs/2307.06985
repo_url: None
paper_authors: L Siddharth, Jianxi Luo
for: populate generalizable engineering design knowledge
methods: train taggers to identify entities and relationships from sentences, and compare performance against typically recommended approaches
results: build a domain knowledge base and search for solutions relevant to key issues in fan systems, with comparative discussion against ChatGPT’s opinionsHere’s the full text in Simplified Chinese:
for: 为了填充可重用的工程设计知识，我们提议一种方法从 Sentences 中提取 head entity :: relationship :: tail entity 的事实。这些事实可以在各个和 across patent documents 中组合，形成知识图表，用于表示和存储设计知识。现有的工程设计文献中的方法通常使用一组预先定义的关系来填充 triple，而不是事实。
methods: 我们 trains 两个标注器：一个用于从 sentence 中标注 head entity 和 tail entity，另一个用于从 head entity 和 tail entity 对的对话中标注关系 tokens。为了训练这两个标注器，我们手动构建了44,227个句子和相应的事实数据集。我们还与一般推荐的方法进行比较，其中包括对 tokens 的独立对应和图形对应。
results: 我们应用方法到有关扇子系统的专利文献中的句子上，并构建了Domain知识库。然后，我们对知识库进行概述，并在一些关键问题上进行搜索，并组织结果为知识图表，进行与 ChatGPT 的比较讨论。

Abstract
Aiming to populate generalizable engineering design knowledge, we propose a method to extract facts of the form head entity :: relationship :: tail entity from sentences found in patent documents. These facts could be combined within and across patent documents to form knowledge graphs that serve as schemes for representing as well as storing design knowledge. Existing methods in engineering design literature often utilise a set of predefined relationships to populate triples that are statistical approximations rather than facts. In our method, we train a tagger to identify both entities and relationships from a sentence. Given a pair of entities thus identified, we train another tagger to identify the relationship tokens that specifically denote the relationship between the pair. For training these taggers, we manually construct a dataset of 44,227 sentences and corresponding facts. We also compare the performance of the method against typically recommended approaches, wherein, we predict the edges among tokens by pairing the tokens independently and as part of a graph. We apply our method to sentences found in patents related to fan systems and build a domain knowledge base. Upon providing an overview of the knowledge base, we search for solutions relevant to some key issues prevailing in fan systems. We organize the responses into knowledge graphs and hold a comparative discussion against the opinions from ChatGPT.

摘要
目标是填充通用工程设计知识，我们提出了一种方法，从专利文档中提取形式为head entity :: relationship :: tail entity的事实。这些事实可以在专利文档之间组合，组成知识图图表示和存储设计知识。现有的工程设计文献中的方法 oftentimes使用 predetermined 关系来填充 triple，这些 triple 是统计方程而不是事实。在我们的方法中，我们训练一个标签器，可以从句子中标识 both entities 和关系。给定一对 entities，我们又训练另一个标签器，可以特定这两个 entities 之间的关系Token。为了训练这些标签器，我们手动构建了一 dataset of 44,227 句子和相应的事实。我们还比较了这种方法与通常推荐的方法的性能，其中，我们在 pairing tokens 独立和 graph 中对 tokens 进行预测。我们应用我们的方法到专利中关于扇子系统的句子，并建立了领域知识基础。对于一些关键问题在扇子系统中，我们提供了解决方案，并将其组织成知识图表示。我们进行了比较分析，并与 ChatGPT 的观点进行对比。

Adapting an ASR Foundation Model for Spoken Language Assessment

paper_url: http://arxiv.org/abs/2307.09378
repo_url: None
paper_authors: Rao Ma, Mengjie Qian, Mark J. F. Gales, Kate M. Knill
for: 这个论文的目的是提出一种方法来更正Whisper输出中的问题，以提高学习者的评估和反馈。methods: 这个论文使用了精度的词汇级抽象和软提示调整的方法来修正Whisper输出。results: 实验结果表明，通过精度的词汇级抽象和软提示调整，可以有效地更正Whisper输出中的问题，并生成学习者实际上说的话。

Abstract
A crucial part of an accurate and reliable spoken language assessment system is the underlying ASR model. Recently, large-scale pre-trained ASR foundation models such as Whisper have been made available. As the output of these models is designed to be human readable, punctuation is added, numbers are presented in Arabic numeric form and abbreviations are included. Additionally, these models have a tendency to skip disfluencies and hesitations in the output. Though useful for readability, these attributes are not helpful for assessing the ability of a candidate and providing feedback. Here a precise transcription of what a candidate said is needed. In this paper, we give a detailed analysis of Whisper outputs and propose two solutions: fine-tuning and soft prompt tuning. Experiments are conducted on both public speech corpora and an English learner dataset. Results show that we can effectively alter the decoding behaviour of Whisper to generate the exact words spoken in the response.

摘要
一个重要的准确和可靠的口语评估系统的基础模型是ASR模型。最近，大规模预训练ASR基础模型如嘟哒（Whisper）已经提供。这些模型的输出设计为人类可读性，包括括号、阿拉伯数字形式的数字和缩写。然而，这些特征不实用于评估候选人的能力和提供反馈。在这篇论文中，我们对嘟哒输出进行了详细分析，并提出了两种解决方案：细化和软提示调整。我们在公共演讲 Corpora 和英语学习数据集上进行了实验，结果表明我们可以有效地改变嘟哒的解码行为，以生成响应中实际说的字句。

2023-07-14

HuCurl: Human-induced Curriculum Discovery

Composition-contrastive Learning for Sentence Embeddings

A scoping review on multimodal deep learning in biomedical images and texts

Gloss Attention for Gloss-free Sign Language Translation

Unsupervised Domain Adaptation using Lexical Transformations and Label Injection for Twitter Data

How Different Is Stereotypical Bias Across Languages?

Hybrid moderation in the newsroom: Recommending featured posts to content moderators

Using Large Language Models for Zero-Shot Natural Language Generation from Knowledge Graphs

Similarity-based Memory Enhanced Joint Entity and Relation Extraction

Towards dialect-inclusive recognition in a low-resource language: are balanced corpora the answer?

Replay to Remember: Continual Layer-Specific Fine-tuning for German Speech Recognition

Are words equally surprising in audio and audio-visual comprehension?

A Topical Approach to Capturing Customer Insight In Social Media

MorphPiece : Moving away from Statistical Language Representation

Improving BERT with Hybrid Pooling Network and Drop Mask

Certified Robustness for Large Language Models with Self-Denoising

Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

Learning to Retrieve In-Context Examples for Large Language Models

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords

MMSD2.0: Towards a Reliable Multi-modal Sarcasm Detection System

Generating Efficient Training Data via LLM-based Attribute Manipulation

An Analysis of Dialogue Repair in Virtual Voice Assistants

Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling

Making the Most Out of the Limited Context Length: Predictive Power Varies with Clinical Note Type and Note Section

MegaWika: Millions of reports and their sources across 50 diverse languages

DIALGEN: Collaborative Human-LM Generated Dialogues for Improved Understanding of Human-Human Conversations

Data Augmentation for Machine Translation via Dependency Subtree Swapping

Electoral Agitation Data Set: The Use Case of the Polish Election

mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

Towards Populating Generalizable Engineering Design Knowledge

Adapting an ASR Foundation Model for Spoken Language Assessment