cs.CL - 2023-07-21

OxfordTVG-HIC: Can Machine Make Humorous Captions from Images?

  • paper_url: http://arxiv.org/abs/2307.11636
  • repo_url: None
  • paper_authors: Runjia Li, Shuyang Sun, Mohamed Elhoseiny, Philip Torr
  • for: 本研究开发了一个大规模的humorous image captions数据集(OxfordTVG-HIC),用于幽默生成和理解。
  • methods: 本研究使用了一个大规模的image-text对的数据集,并运用了深度学习方法来训练一个通用的幽默captioning模型。
  • results: 研究发现,OxfordTVG-HIC数据集可以用于训练一个通用的幽默captioning模型,并且可以用来评估生成的文本是否具有幽默性。此外,研究还发现,幽默的文本生成通常需要融合语言和图像的信息,并且需要运用幽默的概念和笑料。
    Abstract This paper presents OxfordTVG-HIC (Humorous Image Captions), a large-scale dataset for humour generation and understanding. Humour is an abstract, subjective, and context-dependent cognitive construct involving several cognitive factors, making it a challenging task to generate and interpret. Hence, humour generation and understanding can serve as a new task for evaluating the ability of deep-learning methods to process abstract and subjective information. Due to the scarcity of data, humour-related generation tasks such as captioning remain under-explored. To address this gap, OxfordTVG-HIC offers approximately 2.9M image-text pairs with humour scores to train a generalizable humour captioning model. Contrary to existing captioning datasets, OxfordTVG-HIC features a wide range of emotional and semantic diversity resulting in out-of-context examples that are particularly conducive to generating humour. Moreover, OxfordTVG-HIC is curated devoid of offensive content. We also show how OxfordTVG-HIC can be leveraged for evaluating the humour of a generated text. Through explainability analysis of the trained models, we identify the visual and linguistic cues influential for evoking humour prediction (and generation). We observe qualitatively that these cues are aligned with the benign violation theory of humour in cognitive psychology.
    摘要

A Change of Heart: Improving Speech Emotion Recognition through Speech-to-Text Modality Conversion

  • paper_url: http://arxiv.org/abs/2307.11584
  • repo_url: https://github.com/iclr2023achangeofheart/meld-modality-conversion
  • paper_authors: Zeinab Sadat Taghavi, Ali Satvaty, Hossein Sameti
  • for: 本研究旨在提高MELD dataset上的情感识别性能。
  • methods: 我们提出了一种模态转换概念,通过使用自动语音识别系统(ASR)和文本分类器来提高情感识别性能。
  • results: 我们的方法在MELD dataset上实现了显著的result,而Modality-Conversion++方法even outperformed当前的speech-based方法(WF1分数)。这些结果表明模态转换可以在不同的模态下进行任务,提高task的性能。
    Abstract Speech Emotion Recognition (SER) is a challenging task. In this paper, we introduce a modality conversion concept aimed at enhancing emotion recognition performance on the MELD dataset. We assess our approach through two experiments: first, a method named Modality-Conversion that employs automatic speech recognition (ASR) systems, followed by a text classifier; second, we assume perfect ASR output and investigate the impact of modality conversion on SER, this method is called Modality-Conversion++. Our findings indicate that the first method yields substantial results, while the second method outperforms state-of-the-art (SOTA) speech-based approaches in terms of SER weighted-F1 (WF1) score on the MELD dataset. This research highlights the potential of modality conversion for tasks that can be conducted in alternative modalities.
    摘要 《语音情感识别(SER)是一个挑战性的任务。在这篇论文中,我们介绍了一个modalities conversion概念,以提高MELD dataset上的情感识别性能。我们通过两个实验进行评估:第一个方法是使用自动话音识别系统(ASR),然后使用文本分类器;第二个方法是假设ASR输出是完美的,并调查模式转换对SER的影响,这个方法被称为Modality-Conversion++。我们发现第一个方法具有重要的成果,而第二个方法在MELDdataset上的SERWeighted-F1(WF1)分数超过了现有的话语基于的方法。这个研究显示了modalities conversion的潜力,用于可以在不同的模式下进行的任务。》Note that Simplified Chinese is the standard writing system used in mainland China, and it may be different from Traditional Chinese, which is used in Taiwan and other parts of the world.

Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

  • paper_url: http://arxiv.org/abs/2307.11558
  • repo_url: https://github.com/zhjohnchan/sk-vg
  • paper_authors: Zhihong Chen, Ruifei Zhang, Yibing Song, Xiang Wan, Guanbin Li
  • for: This paper aims to create a new benchmark for visual grounding (VG) called SK-VG, which requires models to have reasoning abilities on long-form scene knowledge.
  • methods: The proposed approaches for SK-VG involve embedding knowledge into image features before the image-query interaction, or leveraging linguistic structure to assist in computing the image-text matching.
  • results: The proposed approaches achieve promising results but still leave room for improvement, including performance and interpretability.Here’s the simplified Chinese text:
  • for: 这篇论文目标是构建一个新的视觉固定(VG)benchmark,叫做SK-VG,它需要模型具有长文场景知识的理解能力。
  • methods: 提议的SK-VG方法包括在图像特征上嵌入知识,或者通过语言结构帮助图像文本匹配。
  • results: 提议的方法达到了可以的结果,但还有进一步的改进空间,包括性能和可解释性。
    Abstract Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities over their joint space. However, most existing VG datasets are constructed using simple description texts, which do not require sufficient reasoning over the images and texts. This has been demonstrated in a recent study~\cite{luo2022goes}, where a simple LSTM-based text encoder without pretraining can achieve state-of-the-art performance on mainstream VG datasets. Therefore, in this paper, we propose a novel benchmark of \underline{S}cene \underline{K}nowledge-guided \underline{V}isual \underline{G}rounding (SK-VG), where the image content and referring expressions are not sufficient to ground the target objects, forcing the models to have a reasoning ability on the long-form scene knowledge. To perform this task, we propose two approaches to accept the triple-type input, where the former embeds knowledge into the image features before the image-query interaction; the latter leverages linguistic structure to assist in computing the image-text matching. We conduct extensive experiments to analyze the above methods and show that the proposed approaches achieve promising results but still leave room for improvement, including performance and interpretability. The dataset and code are available at \url{https://github.com/zhjohnchan/SK-VG}.
    摘要 视觉定位(VG)的目标是建立细腻的视觉和语言之间对应。理想情况下,它可以作为视觉和语言模型的评估平台,以评估这些模型对图像和文本的理解和逻辑能力。然而,大多数现有的VG数据集都是使用简单的描述文本构建的,这些文本不足以需要图像和文本之间的充分理解。这已经在一项研究中被证明(Luo et al., 2022),使用无预训练的LSTM文本编码器可以在主流VG数据集上达到状态的表现。因此,在这篇论文中,我们提出了一个新的基准测试 datasets,即Scene Knowledge-guided Visual Grounding(SK-VG)。在这个数据集中,图像内容和引用表达不够地固定目标对象,需要模型具备对场景知识的理解和逻辑能力。为了完成这个任务,我们提出了两种方法,其中一种将知识嵌入图像特征之前进行图像-查询交互;另一种利用语言结构来帮助计算图像-文本匹配。我们进行了广泛的实验分析这些方法,并显示了提议的方法可以获得可 promise的结果,但仍有改进的空间,包括性能和可读性。数据集和代码可以在GitHub上获取,请参阅

Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation

  • paper_url: http://arxiv.org/abs/2307.11545
  • repo_url: https://github.com/kkakkkka/etris
  • paper_authors: Zunnan Xu, Zhihong Chen, Yong Zhang, Yibing Song, Xiang Wan, Guanbin Li
  • for: 提高图像 segmentation 任务的 Parametric Efficient Tuning (PET) 性能,并且可以更好地适应多Modalities 的交互。
  • methods: 提出了一种名为 Bridger 的 adapter,用于在预训练模型中交换 cross-modal 信息,并将任务特定的信息注入到模型中。还设计了一个轻量级的解码器。
  • results: 通过对 challenging benchmarks 进行评估,得到了与预训练模型相比的比较或更高的性能,只需要更新 backbone 参数 1.61% 到 3.38%。
    Abstract Parameter Efficient Tuning (PET) has gained attention for reducing the number of parameters while maintaining performance and providing better hardware resource savings, but few studies investigate dense prediction tasks and interaction between modalities. In this paper, we do an investigation of efficient tuning problems on referring image segmentation. We propose a novel adapter called Bridger to facilitate cross-modal information exchange and inject task-specific information into the pre-trained model. We also design a lightweight decoder for image segmentation. Our approach achieves comparable or superior performance with only 1.61\% to 3.38\% backbone parameter updates, evaluated on challenging benchmarks. The code is available at \url{https://github.com/kkakkkka/ETRIS}.
    摘要 Parameter Efficient Tuning (PET) 已经吸引了关注,因为它可以降低参数的数量而保持性能,并且提供更好的硬件资源储存。然而,有很少的研究探讨 dense prediction 任务和多Modalities 之间的交互。本文提出了一种叫做 Bridger 的 novel adapter,用于促进 cross-modal 信息交换和注入预训练模型中的任务特定信息。我们还设计了一个轻量级的解码器,用于图像 segmentation。我们的方法可以在具有挑战性的benchmark上实现相当或更高的性能,只需要更新 1.61% 到 3.38% 的后部参数。代码可以在 \url{https://github.com/kkakkkka/ETRIS} 上获取。

Topic Identification For Spontaneous Speech: Enriching Audio Features With Embedded Linguistic Information

  • paper_url: http://arxiv.org/abs/2307.11450
  • repo_url: https://github.com/aalto-speech/Topic-identification-for-spontaneous-Finnish-speech
  • paper_authors: Dejan Porjazovski, Tamás Grósz, Mikko Kurimo
  • for: 本研究旨在探讨非标准文本解决方案,即使没有自动语音识别系统(ASR)的情况下,是否可以使用音频特征来实现话语识别。
  • methods: 本研究使用了音频只解决方案和多模态解决方案,并对其在自然语言处理 tasks 上进行评估。
  • results: 研究结果表明,当 ASR 系统不可用时,专门使用音频特征的解决方案是一个可行的选择,而多模态解决方案在识别精度方面达到了最佳效果。
    Abstract Traditional topic identification solutions from audio rely on an automatic speech recognition system (ASR) to produce transcripts used as input to a text-based model. These approaches work well in high-resource scenarios, where there are sufficient data to train both components of the pipeline. However, in low-resource situations, the ASR system, even if available, produces low-quality transcripts, leading to a bad text-based classifier. Moreover, spontaneous speech containing hesitations can further degrade the performance of the ASR model. In this paper, we investigate alternatives to the standard text-only solutions by comparing audio-only and hybrid techniques of jointly utilising text and audio features. The models evaluated on spontaneous Finnish speech demonstrate that purely audio-based solutions are a viable option when ASR components are not available, while the hybrid multi-modal solutions achieve the best results.
    摘要 传统的话题标识解决方案从音频中依赖于自动语音识别系统(ASR)生成讲解,然后用文本基于模型进行分类。这些方法在高资源场景下工作良好,因为有足够的数据来训练两个组件的管道。然而,在低资源情况下,ASR系统,即使可用,也会生成低质量的讲解,导致文本基于分类器的性能差。此外,不断的语音中含有停顿可能会进一步降低ASR模型的性能。在这篇论文中,我们 investigate了标准文本只solution的代替方案,包括音频只solution和多Modal joint使用文本和音频特征的 Hybrid方法。我们对于不间断的芬兰语言进行了评估,得出结论是:纯音频基本方案在ASR组件不可用时是一个可行的选择,而Hybrid多Modal方法在最佳情况下实现了最好的结果。

MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems

  • paper_url: http://arxiv.org/abs/2307.11394
  • repo_url: None
  • paper_authors: Thilo von Neumann, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach
  • for: 本研究旨在评估各种会议笔记系统,提供一个通用的接口来计算常用的词错率(WER),包括cpWER、ORC WER 和 MIMO WER 等定义。
  • methods: 本文extend cpWER 计算中的时间约束,以确保只有在可能的时间对齐下才认为词语为正确的。这会导致匹配假设字符串与参照字符串的匹配质量更高,系统会受到负面折算。同时,作者还提出了一种使用 sentence-level 时间信息来估算 exact word-level 时间信息的方法,并证明该方法可以达到类似的 WER。
  • results: 作者通过实验表明,时间约束可以提高匹配算法的速度,并且这种提高的速度可以抵消对处理时间戳的额外开销。
    Abstract MeetEval is an open-source toolkit to evaluate all kinds of meeting transcription systems. It provides a unified interface for the computation of commonly used Word Error Rates (WERs), specifically cpWER, ORC WER and MIMO WER along other WER definitions. We extend the cpWER computation by a temporal constraint to ensure that only words are identified as correct when the temporal alignment is plausible. This leads to a better quality of the matching of the hypothesis string to the reference string that more closely resembles the actual transcription quality, and a system is penalized if it provides poor time annotations. Since word-level timing information is often not available, we present a way to approximate exact word-level timings from segment-level timings (e.g., a sentence) and show that the approximation leads to a similar WER as a matching with exact word-level annotations. At the same time, the time constraint leads to a speedup of the matching algorithm, which outweighs the additional overhead caused by processing the time stamps.
    摘要

Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect ChatGPT-Generated Text

  • paper_url: http://arxiv.org/abs/2307.11380
  • repo_url: https://github.com/clement1290/chatgpt-detection-pr-hppt
  • paper_authors: Lingyi Yang, Feng Jiang, Haizhou Li
  • For: The paper aims to address the limitations of previous detectors that can only differentiate between purely ChatGPT-generated texts and human-authored texts, and instead focuses on detecting texts generated through human-machine collaboration, such as ChatGPT-polished texts.* Methods: The paper introduces a novel dataset called HPPT (ChatGPT-polished academic abstracts) that consists of pairs of human-written and ChatGPT-polished abstracts, and proposes a new method called the “Polish Ratio” to measure the degree of ChatGPT involvement in text generation based on editing distance.* Results: The paper shows that the proposed model has better robustness on the HPPT dataset and two existing datasets (HC3 and CDB), and the “Polish Ratio” method provides a more comprehensive explanation by quantifying the degree of ChatGPT involvement, with a value greater than 0.2 indicating ChatGPT involvement and a value exceeding 0.6 implying that ChatGPT generates most of the text.Here are the three points in Simplified Chinese:
  • for: 这个论文的目的是解决以往的探测器只能区分纯然是人类写的文本和ChatGPT生成的文本,而不能区分人机合作生成的文本,例如ChatGPT准备过的文本。
  • methods: 这个论文提出了一个新的数据集名为HPPT(ChatGPT准备过的学术摘要),该数据集包含了人类写的和ChatGPT准备过的摘要对,并提出了一种新的方法名为“Polish Ratio”来度量ChatGPT在文本生成中的参与度。
  • results: 论文的实验结果表明,提出的模型在HPPT数据集和两个现有数据集(HC3和CDB)上具有更好的Robustness,而“Polish Ratio”方法提供了一个更加全面的解释,通过编辑距离度量ChatGPT的参与度,其中Polish Ratio值大于0.2表示ChatGPT参与,值大于0.6则表示ChatGPT主要生成了文本。
    Abstract The remarkable capabilities of large-scale language models, such as ChatGPT, in text generation have incited awe and spurred researchers to devise detectors to mitigate potential risks, including misinformation, phishing, and academic dishonesty. Despite this, most previous studies, including HC3, have been predominantly geared towards creating detectors that differentiate between purely ChatGPT-generated texts and human-authored texts. This approach, however, fails to work on discerning texts generated through human-machine collaboration, such as ChatGPT-polished texts. Addressing this gap, we introduce a novel dataset termed HPPT (ChatGPT-polished academic abstracts), facilitating the construction of more robust detectors. It diverges from extant corpora by comprising pairs of human-written and ChatGPT-polished abstracts instead of purely ChatGPT-generated texts. Additionally, we propose the "Polish Ratio" method, an innovative measure of ChatGPT's involvement in text generation based on editing distance. It provides a mechanism to measure the degree of human originality in the resulting text. Our experimental results show our proposed model has better robustness on the HPPT dataset and two existing datasets (HC3 and CDB). Furthermore, the "Polish Ratio" we proposed offers a more comprehensive explanation by quantifying the degree of ChatGPT involvement, which indicates that a Polish Ratio value greater than 0.2 signifies ChatGPT involvement and a value exceeding 0.6 implies that ChatGPT generates most of the text.
    摘要 大型语言模型,如ChatGPT,在文本生成方面表现出了惊人的能力,引发了研究人员的关注和探索。为了 mitigate potential risks,包括诈骗、垃圾信息和学术不正当行为,研究人员们已经开发了多种检测器。然而,大多数前一些研究,包括HC3,都是将注意力集中在区分ChatGPT生成的文本和人类写的文本之间。这种方法并不能够在检测人机合作生成的文本,如ChatGPT优化的文本。为了解决这个 gap,我们提出了一个新的数据集,称为HPPT(ChatGPT优化的学术摘要)。这个数据集与现有的数据集不同,因为它包含了人类写的和ChatGPT优化的摘要对。此外,我们还提出了一种新的方法,称为“熔化率”方法,它基于编辑距离来衡量ChatGPT在文本生成中的参与度。这种方法可以衡量生成的文本中人类的原创性。我们的实验结果表明,我们的提出的模型在HPPT数据集和两个现有的数据集(HC3和CDB)上具有更高的Robustness。此外,我们提出的“熔化率”方法可以为检测器提供更全面的解释,并且表明一个熔化率值大于0.2表示ChatGPT的参与度,而一个值超过0.6表示ChatGPT主要生成了文本。

DEFTri: A Few-Shot Label Fused Contextual Representation Learning For Product Defect Triage in e-Commerce

  • paper_url: http://arxiv.org/abs/2307.11344
  • repo_url: None
  • paper_authors: Ipsita Mohanty
  • for: 这研究旨在提高大规模敏捷软件开发循环中的缺陷排查效率,通过自动化方法使用机器学习来准确地将缺陷分配给合适的团队。
  • methods: 该研究提出了一个新的自动化缺陷排查框架(DEFTri),使用精度调整的现有预训练BERT来提高人类生成的产品缺陷文本embedding的上下文表示。
  • results: 在我们的多标签文本分类缺陷排查任务中,我们引入了一个Walmart报有的产品缺陷数据集,使用弱监督和对抗学习,在几个尝试设定下实现了一些表现。
    Abstract Defect Triage is a time-sensitive and critical process in a large-scale agile software development lifecycle for e-commerce. Inefficiencies arising from human and process dependencies in this domain have motivated research in automated approaches using machine learning to accurately assign defects to qualified teams. This work proposes a novel framework for automated defect triage (DEFTri) using fine-tuned state-of-the-art pre-trained BERT on labels fused text embeddings to improve contextual representations from human-generated product defects. For our multi-label text classification defect triage task, we also introduce a Walmart proprietary dataset of product defects using weak supervision and adversarial learning, in a few-shot setting.
    摘要 大规模敏捷软件开发生命周期中的缺陷察批是一个时间敏感和关键的过程,人工和过程依赖的不足导致了研究自动化方法的需求。这种工作提出了一种基于机器学习的新框架,用于自动检测缺陷(DEFTri),使用精度调整的先进预训练BERT来提高人工生成产品缺陷的文本表示。为我们的多标签文本分类缺陷察批任务,我们还介绍了一个华尔街专有的产品缺陷 dataset,使用弱监督和对抗学习,在几个尝试 Setting中。

Making Pre-trained Language Models both Task-solvers and Self-calibrators

  • paper_url: http://arxiv.org/abs/2307.11316
  • repo_url: https://github.com/yangyi-chen/lm-toast
  • paper_authors: Yangyi Chen, Xingyao Wang, Heng Ji
  • for: 这个论文是为了解决PLMs在高风险应用中的过于自信问题,并提出了一种基于LM-TOAST算法的自适应Calibration方法。
  • methods: 这个论文使用了LM-TOAST算法,该算法是基于验证分布的,可以在有限的训练样本下使PLMs有合理的自信估计。
  • results: 实验结果表明,LM-TOAST可以有效地利用训练数据,使PLMs在高风险应用中具有合理的自信估计,同时保持原始任务性能。此外,论文还应用了LM-TOAST在选择性分类、对抗攻击和模型堆叠等下游应用中的实用性。
    Abstract Pre-trained language models (PLMs) serve as backbones for various real-world systems. For high-stake applications, it's equally essential to have reasonable confidence estimations in predictions. While the vanilla confidence scores of PLMs can already be effectively utilized, PLMs consistently become overconfident in their wrong predictions, which is not desirable in practice. Previous work shows that introducing an extra calibration task can mitigate this issue. The basic idea involves acquiring additional data to train models in predicting the confidence of their initial predictions. However, it only demonstrates the feasibility of this kind of method, assuming that there are abundant extra available samples for the introduced calibration task. In this work, we consider the practical scenario that we need to effectively utilize training samples to make PLMs both task-solvers and self-calibrators. Three challenges are presented, including limited training samples, data imbalance, and distribution shifts. We first conduct pilot experiments to quantify various decisive factors in the calibration task. Based on the empirical analysis results, we propose a training algorithm LM-TOAST to tackle the challenges. Experimental results show that LM-TOAST can effectively utilize the training data to make PLMs have reasonable confidence estimations while maintaining the original task performance. Further, we consider three downstream applications, namely selective classification, adversarial defense, and model cascading, to show the practical usefulness of LM-TOAST. The code will be made public at \url{https://github.com/Yangyi-Chen/LM-TOAST}.
    摘要 预训言语模型(PLM)作为许多实际应用的基础,需要有合理的信任估计。然而,PLM经常在错误预测时变得过自信,这不是实际应用中希望的。前期工作表明,引入额外的calibration任务可以解决这个问题。基本思路是通过获取更多的数据来训练模型对其初始预测的信任程度进行预测。然而,这只是一种可行的方法,假设有充足的额外可用样本。在这种实际场景中,我们需要使PLM同时作为任务解决方案和自我调整方法。我们提出三个挑战,包括有限的训练样本、数据不均衡和分布shift。我们首先进行了飞行实验,以量化各种重要的因素在calibration任务中。根据实验分析结果,我们提出了一种名为LM-TOAST的训练算法,以解决这些挑战。实验结果表明,LM-TOAST可以有效地利用训练数据,使PLM有合理的信任估计,同时保持原始任务性能。此外,我们还考虑了三个下游应用,namely selective classification、adversarial defense和model cascading,以显示LM-TOAST的实际用途。代码将在\url{https://github.com/Yangyi-Chen/LM-TOAST}上公开。

GIST: Generating Image-Specific Text for Fine-grained Object Classification

  • paper_url: http://arxiv.org/abs/2307.11315
  • repo_url: https://github.com/emu1729/gist
  • paper_authors: Kathleen M. Lewis, Emily Mu, Adrian V. Dalca, John Guttag
  • for: 这篇论文旨在提高精细图像分类 tasks 的表现, especailly for vision-language models that lack paired text/image descriptions.
  • methods: 我们提出了一个方法,即 GIST,它可以将图像转换为特定的精细文本描述,并证明这些文本描述可以用来改善分类。 GIST 的关键部分包括使用域专的提示Word 激发预训大型自然语言模型生成多元精细文本描述,以及使用预训视语模型对每个图像和文本描述进行匹配。
  • results: 我们透过对视语模型进行 fine-tuning 以 learns an aligned vision-language representation space,并在四个不同领域的全载和少载测试中证明了我们的方法可以提高 $4.1%$ 的准确性和 $1.1%$ 的前期的最佳图像文本分类方法。
    Abstract Recent vision-language models outperform vision-only models on many image classification tasks. However, because of the absence of paired text/image descriptions, it remains difficult to fine-tune these models for fine-grained image classification. In this work, we propose a method, GIST, for generating image-specific fine-grained text descriptions from image-only datasets, and show that these text descriptions can be used to improve classification. Key parts of our method include 1. prompting a pretrained large language model with domain-specific prompts to generate diverse fine-grained text descriptions for each class and 2. using a pretrained vision-language model to match each image to label-preserving text descriptions that capture relevant visual features in the image. We demonstrate the utility of GIST by fine-tuning vision-language models on the image-and-generated-text pairs to learn an aligned vision-language representation space for improved classification. We evaluate our learned representation space in full-shot and few-shot scenarios across four diverse fine-grained classification datasets, each from a different domain. Our method achieves an average improvement of $4.1\%$ in accuracy over CLIP linear probes and an average of $1.1\%$ improvement in accuracy over the previous state-of-the-art image-text classification method on the full-shot datasets. Our method achieves similar improvements across few-shot regimes. Code is available at https://github.com/emu1729/GIST.
    摘要 近期的视觉语模型在许多图像分类任务上表现出了超越视觉模型的能力。然而,由于缺乏对应的文本/图像描述对, fine-tune这些模型 для细化图像分类仍然具有挑战。在这项工作中,我们提出了一种方法,即GIST,可以从图像只 datasets中生成特定图像的细化文本描述,并证明这些文本描述可以用于改进分类。关键的部分包括:1. 使用域pecific的推文提示大型预训练语言模型生成每个类型的多样化细化文本描述。2. 使用预训练的视觉语言模型将每幅图像与保持标签的文本描述相匹配,以捕捉图像中相关的视觉特征。我们采用GIST方法,在图像和生成的文本对上进行了 fine-tuning 视觉语言模型,以学习一个含有视觉语言对应关系的表示空间,并证明该表示空间可以提高分类精度。我们在四个不同领域的四个全文shot和几个少shot场景中进行了评估,结果显示,我们的方法在全文shot场景中平均提高了4.1%的精度,并在前一个状态的图像文本分类方法中提高了1.1%的精度。我们的方法在少shot场景中也具有相似的改进。代码可以在https://github.com/emu1729/GIST 中找到。

Who should I Collaborate with? A Comparative Study of Academia and Industry Research Collaboration in NLP

  • paper_url: http://arxiv.org/abs/2308.04524
  • repo_url: None
  • paper_authors: Hussain Sadiq Abuwala, Bohan Zhang, Mushi Wang
  • for: 研究了学术和产业合作对自然语言处理(NLP)的影响。
  • methods: 创建了一个管道,EXTRACTAFFILIATIONS和CITATIONS FROM NLP论文,并将其分为三类:学术、产业和混合(学术和产业合作)。
  • results: 发现论文中学术和产业合作的出版数量呈增长趋势,而这些合作出版物的影响力比solely在学术界出版的高。
    Abstract The goal of our research was to investigate the effects of collaboration between academia and industry on Natural Language Processing (NLP). To do this, we created a pipeline to extract affiliations and citations from NLP papers and divided them into three categories: academia, industry, and hybrid (collaborations between academia and industry). Our empirical analysis found that there is a trend towards an increase in industry and academia-industry collaboration publications and that these types of publications tend to have a higher impact compared to those produced solely within academia.
    摘要 我们的研究目标是研究学术与产业合作对自然语言处理(NLP)的影响。为此,我们创建了一个管道,EXTRACT afFILIATIONS和引用 FROM NLP论文,并将其分为三类:学术、产业和Hybrid(学术和产业合作)。我们的实证分析发现,有一趋向增加的产业和学术合作出版物,并且这些类型的出版物具有较高的影响力,比solely within academia的出版物更高。Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Generator-Retriever-Generator: A Novel Approach to Open-domain Question Answering

  • paper_url: http://arxiv.org/abs/2307.11278
  • repo_url: https://github.com/abdoelsayed2016/grg
  • paper_authors: Abdelrahman Abdallah, Adam Jatowt
  • for: 提高开放领域问答(QA)任务中的答案准确性,通过将文档检索技术与大语言模型(LLM)结合在一起,首先提示模型生成基于问题的文档,然后并行使用双编码网络从外部废集中检索相关的文档,最后通过第二个LLM生成最终的答案。
  • methods: 提议使用生成器-检索器-生成器(GRG)模型,其中首先使用LLM生成基于问题的文档,然后使用双编码网络从外部废集中检索相关的文档,最后使用第二个LLM生成最终的答案。
  • results: GRG模型在TriviaQA、NQ和WebQ数据集上的性能比GENREAD和RFiD模型高出至少5.2、4.2和1.6分,表明GRG模型可以更好地解决开放领域QA任务中的挑战,如生成相关和Contextually relevante的答案。
    Abstract Open-domain question answering (QA) tasks usually require the retrieval of relevant information from a large corpus to generate accurate answers. We propose a novel approach called Generator-Retriever-Generator (GRG) that combines document retrieval techniques with a large language model (LLM), by first prompting the model to generate contextual documents based on a given question. In parallel, a dual-encoder network retrieves documents that are relevant to the question from an external corpus. The generated and retrieved documents are then passed to the second LLM, which generates the final answer. By combining document retrieval and LLM generation, our approach addresses the challenges of open-domain QA, such as generating informative and contextually relevant answers. GRG outperforms the state-of-the-art generate-then-read and retrieve-then-read pipelines (GENREAD and RFiD) improving their performance at least by +5.2, +4.2, and +1.6 on TriviaQA, NQ, and WebQ datasets, respectively. We provide code, datasets, and checkpoints \footnote{\url{https://github.com/abdoelsayed2016/GRG}
    摘要

A Systematic Evaluation of Federated Learning on Biomedical Natural Language Processing

  • paper_url: http://arxiv.org/abs/2307.11254
  • repo_url: https://github.com/pl97/fednlp
  • paper_authors: Le Peng, sicheng zhou, jiandong chen, Rui Zhang, Ziyue Xu, Ju Sun
  • for: 本研究用于探讨基于联合学习的医疗自然语言处理(NLP)模型的可靠性和私有数据保护。
  • methods: 本研究使用了6种基于 transformer 的语言模型,对6种生物医学数据集进行了融合学习。
  • results: 研究结果显示,基于联合学习的模型在医疗数据集上的性能比各自训练的模型和投票数据集训练的模型要好,但是当客户端数据不均匀分布时,模型性能会下降。 code 可以在 GitHub 上找到:https://github.com/PL97/FedNLP。
    Abstract Language models (LMs) like BERT and GPT have revolutionized natural language processing (NLP). However, privacy-sensitive domains, particularly the medical field, face challenges to train LMs due to limited data access and privacy constraints imposed by regulations like the Health Insurance Portability and Accountability Act (HIPPA) and the General Data Protection Regulation (GDPR). Federated learning (FL) offers a decentralized solution that enables collaborative learning while ensuring the preservation of data privacy. In this study, we systematically evaluate FL in medicine across $2$ biomedical NLP tasks using $6$ LMs encompassing $8$ corpora. Our results showed that: 1) FL models consistently outperform LMs trained on individual client's data and sometimes match the model trained with polled data; 2) With the fixed number of total data, LMs trained using FL with more clients exhibit inferior performance, but pre-trained transformer-based models exhibited greater resilience. 3) LMs trained using FL perform nearly on par with the model trained with pooled data when clients' data are IID distributed while exhibiting visible gaps with non-IID data. Our code is available at: https://github.com/PL97/FedNLP
    摘要 语言模型(LM)如BERT和GPT已经革命化自然语言处理(NLP)领域。然而,隐私敏感领域,特别是医疗领域,面临着培训LM的限制,这是由于有限的数据访问和隐私法规,如医疗保险和人类资源保护法(HIPAA)和欧盟数据保护条例(GDPR)所带来的。联邦学习(FL)提供了一种分布式解决方案,允许客户端之间的合作学习,同时保持数据隐私。在本研究中,我们系统地评估FL在医学领域中的表现,使用了6种LM,涵盖8个corpus。我们的结果显示了以下几点:1. FL模型通常比各个客户端的数据进行单独培训的LM表现更好,有时与汇总数据进行培训的模型匹配。2. 随着客户端数量增加,使用FL培训的LM在固定总数据量下表现较差,但是使用预先学习的 transformer-based 模型表现更好。3. 使用FL培训的LM在客户端数据是IID分布的情况下几乎与汇总数据进行培训的模型表现相同,而在非IID分布的情况下显示出明显的差距。我们的代码可以在GitHub上找到:https://github.com/PL97/FedNLP。

UMLS-KGI-BERT: Data-Centric Knowledge Integration in Transformers for Biomedical Entity Recognition

  • paper_url: http://arxiv.org/abs/2307.11170
  • repo_url: None
  • paper_authors: Aidan Mannion, Thierry Chevalier, Didier Schwab, Lorraine Geouriot
  • for: 这篇论文旨在提高生物医学领域的自然语言处理(NLP)任务中的表达能力。
  • methods: 该论文提出了一种基于数据的概念,用于增强生物医学领域的变换器encoder语言模型(LM)的语言表示。该方法通过提取来自UMLS的文本序列,将图形学习目标融合到了遮盖语言预训练中。
  • results: 根据对扩展预训练LM和从scratch进行训练的实验结果,该框架可以提高多个生物医学和临床Named Entity Recognition(NER)任务的下游性能。
    Abstract Pre-trained transformer language models (LMs) have in recent years become the dominant paradigm in applied NLP. These models have achieved state-of-the-art performance on tasks such as information extraction, question answering, sentiment analysis, document classification and many others. In the biomedical domain, significant progress has been made in adapting this paradigm to NLP tasks that require the integration of domain-specific knowledge as well as statistical modelling of language. In particular, research in this area has focused on the question of how best to construct LMs that take into account not only the patterns of token distribution in medical text, but also the wealth of structured information contained in terminology resources such as the UMLS. This work contributes a data-centric paradigm for enriching the language representations of biomedical transformer-encoder LMs by extracting text sequences from the UMLS. This allows for graph-based learning objectives to be combined with masked-language pre-training. Preliminary results from experiments in the extension of pre-trained LMs as well as training from scratch show that this framework improves downstream performance on multiple biomedical and clinical Named Entity Recognition (NER) tasks.
    摘要 (Simplified Chinese translation)先前的 transformer 语言模型 (LM) 在应用 NLP 领域已经成为主导的 paradigm。这些模型在信息提取、问答、情感分析、文档分类等任务上达到了状态机器的性能。在生物医学领域,人们已经在这些模型中采用了适应性的方法,以满足需要 integrate 域特定知识和语言统计模型的任务。特别是,研究人员在这个领域的关注点是如何构建 LM,以便考虑医学文本中token的分布模式,同时还考虑terminology资源such as UMLS 中的结构化信息。这个工作提供了一种数据驱动的 paradigm,用于增强生物医学 transformer-encoder LM 的语言表示。通过EXTRACTING text sequences from UMLS,可以将图基的学习目标与遮盖语言预训练结合。初步的实验结果表明,这种框架可以提高多个生物医学和临床Named Entity Recognition (NER) 任务的下游性能。

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

  • paper_url: http://arxiv.org/abs/2307.11088
  • repo_url: https://github.com/openlmlab/leval
  • paper_authors: Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, Xipeng Qiu
  • for: 评估长context语言模型的表现,以提高其处理单turn输入和对话历史的能力。
  • methods: 开发了一个标准化的评估方法,名为L-Eval,包含411份长文档和2,000多个人标注的查询响应对。
  • results: 发现open-source模型在评估中落后于商业模型,但仍然在开放式任务上表现出色,并在4k上下文长度下达到了最好的结果。
    Abstract Recently, there has been growing interest in extending the context length of instruction-following models in order to effectively process single-turn long input (e.g. summarizing a paper) and conversations with more extensive histories. While proprietary models such as GPT-4 and Claude have shown significant strides in handling extremely lengthy input, open-sourced models are still in the early stages of experimentation. It also remains unclear whether extending the context can offer substantial gains over traditional methods such as retrieval, and to what extent it improves upon their regular counterparts in practical downstream tasks. To address this challenge, we propose instituting standardized evaluation for long context language models. Concretely, we develop L-Eval which contains 411 long documents and over 2,000 human-labeled query-response pairs encompassing areas such as law, finance, school lectures, lengthy conversations, news, long-form novels, and meetings. L-Eval also adopts diverse evaluation methods and instruction styles, enabling a more reliable assessment of Long Context Language Models (LCLMs). Our findings indicate that while open-source models typically lag behind commercial models, they still exhibit impressive performance compared with their regular versions. LLaMA2-13B achieves the best results on both open-ended tasks (win \textbf{42}\% vs turbo-16k-0613) and closed-ended tasks with only 4k context length. We release our new evaluation suite, code, and all generation results including predictions from all open-sourced LCLMs, GPT4-32k, Cluade-100k at {\url{https://github.com/OpenLMLab/LEval}.
    摘要 近些时间,有越来越多的人对长 Context Language Model(LCLM)的Context长度扩展表示出兴趣,以便有效处理单次输入(如报告概要)和历史记录更加详细的对话。虽然Proprietary模型如GPT-4和Claude已经在处理极长输入方面进行了显著的进步,但开源模型仍处于实验阶段。尚未确定是否通过扩展Context可以提供重要的提升,以及到哪程度 extent 可以超越传统方法(如检索)的实际下渠道任务中的表现。为解决这个挑战,我们提议实施长 Context Language Model 的标准化评估。具体来说,我们开发了L-Eval,包含411个长文档和超过2000个人标注的查询响应对。L-Eval采用多种评估方法和指令风格,使得对Long Context Language Models 的评估更加可靠。我们的发现表明,虽然开源模型通常落后于商业模型,但仍然在开放式任务中表现出色,LLaMA2-13B在开放式任务中取得了42%的胜利率,并在4k Context长度下达到了最佳成绩。我们将在{\url{https://github.com/OpenLMLab/LEval}上发布我们的新评估集合、代码和所有生成结果,包括所有开源LCLMs、GPT4-32k和Cluade-100k的预测结果。

Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot Classification

  • paper_url: http://arxiv.org/abs/2307.11031
  • repo_url: https://github.com/HazyResearch/embroid
  • paper_authors: Neel Guha, Mayee F. Chen, Kush Bhatia, Azalia Mirhoseini, Frederic Sala, Christopher Ré
  • for: 提高自动数据标注的效率,使用语言模型(LM)的提问基本学习能力。
  • methods: 使用LM的预测结果进行修改,而不需要更多的标注数据。提出Embroid方法,通过计算不同嵌入函数下的多个表示来实现。
  • results: 对六种LM和95个任务进行了严格的实验评估,发现 Embroid 能够提高表现(比如 GPT-JT 的平均提高7.3个点),同时也能够实现更复杂的提问策略(如连链思维),并可以根据嵌入函数进行特化。
    Abstract Recent work has shown that language models' (LMs) prompt-based learning capabilities make them well suited for automating data labeling in domains where manual annotation is expensive. The challenge is that while writing an initial prompt is cheap, improving a prompt is costly -- practitioners often require significant labeled data in order to evaluate the impact of prompt modifications. Our work asks whether it is possible to improve prompt-based learning without additional labeled data. We approach this problem by attempting to modify the predictions of a prompt, rather than the prompt itself. Our intuition is that accurate predictions should also be consistent: samples which are similar under some feature representation should receive the same prompt prediction. We propose Embroid, a method which computes multiple representations of a dataset under different embedding functions, and uses the consistency between the LM predictions for neighboring samples to identify mispredictions. Embroid then uses these neighborhoods to create additional predictions for each sample, and combines these predictions with a simple latent variable graphical model in order to generate a final corrected prediction. In addition to providing a theoretical analysis of Embroid, we conduct a rigorous empirical evaluation across six different LMs and up to 95 different tasks. We find that (1) Embroid substantially improves performance over original prompts (e.g., by an average of 7.3 points on GPT-JT), (2) also realizes improvements for more sophisticated prompting strategies (e.g., chain-of-thought), and (3) can be specialized to domains like law through the embedding functions.
    摘要 最近的研究表明,语言模型(LM)的提问基本学习能力使其适合自动化数据标注,尤其是在人工标注成本高的领域。问题在于,虽然编写初始提问便宜,但是改进提问却需要较多的标注数据来评估提问修改的影响。我们的工作是要判断是否可以不使用额外的标注数据来改进提问基本学习。我们采取的方法是通过修改预测而不是提问本身来改进提问。我们的直觉是,准确的预测应该也是一致的:样本在某种特征表示下应该具有相似的预测结果。我们提出了Embroid方法,它计算了不同的嵌入函数下的数据集多种表示,并使用预测结果之间的一致性来识别预测错误。Embroid使用这些邻居样本来生成每个样本的额外预测,并将这些预测与一个简单的潜在变量图模型结合起来,以生成最终更正的预测。除了对Embroid的理论分析之外,我们还进行了六种不同LM的系统性的实验,并在95个任务中进行了严格的实验评估。我们发现:1. Embroid可以明显改进原始提问的性能(例如,使用GPT-JT的平均提高7.3分)。2. Embroid还可以提高更复杂的提问策略(例如,链条思维)的性能。3. Embroid可以根据嵌入函数特定的领域进行特殊化。

Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation

  • paper_url: http://arxiv.org/abs/2307.11019
  • repo_url: https://github.com/rucaibox/llm-knowledge-boundary
  • paper_authors: Ruiyang Ren, Yuhao Wang, Yingqi Qu, Wayne Xin Zhao, Jing Liu, Hao Tian, Hua Wu, Ji-Rong Wen, Haifeng Wang
  • for: 这个研究的目的是调查大语言模型(LLMs)在开放领域问答(QA)任务中的知识边界能力,以及如何通过检索增强来提高 LLMS 的判断能力。
  • methods: 这个研究使用了大语言模型(LLMs),包括 ChatGPT,以及检索增强技术来解决开放领域问答(QA)任务。
  • results: 研究发现,LLMs 具有不动摇的自信心,并且它们的回答准确率较高。此外,通过检索增强,LLMs 的判断能力得到了改善。同时,研究发现 LLMS 倾向于使用提供的检索结果来组织答案,而这些结果的质量对 LLMS 的依赖有着重要的影响。
    Abstract Knowledge-intensive tasks (e.g., open-domain question answering (QA)) require a substantial amount of factual knowledge and often rely on external information for assistance. Recently, large language models (LLMs) (e.g., ChatGPT), have demonstrated impressive prowess in solving a wide range of tasks with world knowledge, including knowledge-intensive tasks. However, it remains unclear how well LLMs are able to perceive their factual knowledge boundaries, particularly how they behave when incorporating retrieval augmentation. In this study, we present an initial analysis of the factual knowledge boundaries of LLMs and how retrieval augmentation affects LLMs on open-domain QA. Specially, we focus on three primary research questions and analyze them by examining QA performance, priori judgement and posteriori judgement of LLMs. We show evidence that LLMs possess unwavering confidence in their capabilities to respond to questions and the accuracy of their responses. Furthermore, retrieval augmentation proves to be an effective approach in enhancing LLMs' awareness of knowledge boundaries, thereby improving their judgemental abilities. Additionally, we also find that LLMs have a propensity to rely on the provided retrieval results when formulating answers, while the quality of these results significantly impacts their reliance. The code to reproduce this work is available at https://github.com/RUCAIBox/LLM-Knowledge-Boundary.
    摘要 知识密集任务(如开放领域问答(QA))需要很大的实际知识和常常依靠外部信息。最近,大型语言模型(LLMs)(如ChatGPT)在解决各种任务的能力方面表现出色,包括知识密集任务。然而,LLMs在把握实际知识边界方面的能力仍然不清楚,特别是在将检索增强纳入系统时。在本研究中,我们提出了三个主要研究问题,并通过分析LLMs的问答性能、先前判断和后来判断来研究这些问题。我们发现LLMs具有不退让的自信心,并且其回答的准确率较高。此外,我们发现检索增强是一种有效的方法,可以提高LLMs对实际知识边界的意识。此外,我们还发现LLMs倾向于根据提供的检索结果来组织答案,而这些结果的质量对LLMs的依赖度有着重要影响。相关代码可以在GitHub上找到:https://github.com/RUCAIBox/LLM-Knowledge-Boundary。

Integrating Pretrained ASR and LM to Perform Sequence Generation for Spoken Language Understanding

  • paper_url: http://arxiv.org/abs/2307.11005
  • repo_url: None
  • paper_authors: Siddhant Arora, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe
  • for: 这篇论文是为了探讨预训练语音识别(ASR)和语言模型(LM)在语义理解(SLU)框架中的集成。
  • methods: 该论文提出了一种三个过程的端到端(E2E)SLU系统,它将预训练ASR和LM子网络集成到SLU框架中进行序列生成任务。在第一个过程中,我们的架构预测ASR转译文本使用ASR子网络。接着,LM子网络会对初始SLU预测。最后,在第三个过程中,妥协子网络通过ASR和LM子网络表示进行最终预测。
  • results: 该论文的提出的三个过程SLU系统在两个标准SLU数据集上(SLURP和SLUE)表现出优于栅积和E2E SLU模型,特别是在听觉困难的语音中。
    Abstract There has been an increased interest in the integration of pretrained speech recognition (ASR) and language models (LM) into the SLU framework. However, prior methods often struggle with a vocabulary mismatch between pretrained models, and LM cannot be directly utilized as they diverge from its NLU formulation. In this study, we propose a three-pass end-to-end (E2E) SLU system that effectively integrates ASR and LM subnetworks into the SLU formulation for sequence generation tasks. In the first pass, our architecture predicts ASR transcripts using the ASR subnetwork. This is followed by the LM subnetwork, which makes an initial SLU prediction. Finally, in the third pass, the deliberation subnetwork conditions on representations from the ASR and LM subnetworks to make the final prediction. Our proposed three-pass SLU system shows improved performance over cascaded and E2E SLU models on two benchmark SLU datasets, SLURP and SLUE, especially on acoustically challenging utterances.
    摘要 有些研究者表示,在将预训练的语音识别(ASR)和语言模型(LM)集成到SLU框架中的 интерес增长。然而,之前的方法经常会遇到预训练模型和LM之间的词汇差异,LM无法直接使用,因为它与NLU表述不匹配。在这项研究中,我们提出了一种三个过程的SLU系统,可以有效地将ASR和LM子网络集成到SLU表述中进行序列生成任务。在第一个过程中,我们的架构预测ASR译文使用ASR子网络。接着,LM子网络会 Initial SLU预测。最后,在第三个过程中,妥协子网络使用ASR和LM子网络的表示来做最终预测。我们的提议的三个过程SLU系统在两个标准SLU数据集上(SLURP和SLUE)表现出色,特别是在具有声学挑战的语音中。

MASR: Metadata Aware Speech Representation

  • paper_url: http://arxiv.org/abs/2307.10982
  • repo_url: None
  • paper_authors: Anjali Raj, Shikhar Bharadwaj, Sriram Ganapathy, Min Ma, Shikhar Vashishth
  • for: 本文提出了一种Metadata Aware Speech Representation learning框架(MASR),用于利用外部知识来增强speech表示学习。
  • methods: 本文使用了一种基于sample-level pair-wise similarity矩阵的硬挖损失函数,以及任意选择的self-supervised learning方法。
  • results: 在多个下游任务中,包括语言识别、语音识别和非语言性任务中,MASR表示学习得到了显著的性能提升,比较其他已知标准。在语言识别任务中,文件进行了详细分析,以便理解如何使用提案的损失函数使表示分离 closely related语言。
    Abstract In the recent years, speech representation learning is constructed primarily as a self-supervised learning (SSL) task, using the raw audio signal alone, while ignoring the side-information that is often available for a given speech recording. In this paper, we propose MASR, a Metadata Aware Speech Representation learning framework, which addresses the aforementioned limitations. MASR enables the inclusion of multiple external knowledge sources to enhance the utilization of meta-data information. The external knowledge sources are incorporated in the form of sample-level pair-wise similarity matrices that are useful in a hard-mining loss. A key advantage of the MASR framework is that it can be combined with any choice of SSL method. Using MASR representations, we perform evaluations on several downstream tasks such as language identification, speech recognition and other non-semantic tasks such as speaker and emotion recognition. In these experiments, we illustrate significant performance improvements for the MASR over other established benchmarks. We perform a detailed analysis on the language identification task to provide insights on how the proposed loss function enables the representations to separate closely related languages.
    摘要