cs.CL - 2023-07-03

Analyzing Multiple-Choice Reading and Listening Comprehension Tests

paper_url: http://arxiv.org/abs/2307.01076
repo_url: None
paper_authors: Vatsal Raina, Adian Liusie, Mark Gales
for: 这篇论文是为了研究多选题测试中的语言评估和知识应用。
methods: 这篇论文使用了对话记录和听力测试来研究多选题测试中的语言理解能力。
results: 研究发现，自动化语言理解系统可以在部分或无 Context passage 的情况下表现更好于随机。这些发现可以帮助内容创作者自动捕捉测试题中的知识和语言理解的交互关系。

Abstract
Multiple-choice reading and listening comprehension tests are an important part of language assessment. Content creators for standard educational tests need to carefully curate questions that assess the comprehension abilities of candidates taking the tests. However, recent work has shown that a large number of questions in general multiple-choice reading comprehension datasets can be answered without comprehension, by leveraging world knowledge instead. This work investigates how much of a contextual passage needs to be read in multiple-choice reading based on conversation transcriptions and listening comprehension tests to be able to work out the correct answer. We find that automated reading comprehension systems can perform significantly better than random with partial or even no access to the context passage. These findings offer an approach for content creators to automatically capture the trade-off between comprehension and world knowledge required for their proposed questions.

摘要
多选测试是语言评估的重要组成部分。测试创作者需要仔细制定问题，以确保测试参加者的理解能力得到评估。然而，最近的研究表明，许多多选测试数据集中的问题可以通过知识世界而非通过理解来回答。这项工作研究了多选测试中需要阅读多少文章来 correctly 回答问题，基于对话笔记和听力测试。我们发现，自动化阅读理解系统可以在部分或无 Context 情况下表现 significatively луч于随机。这些发现可以帮助内容创作者自动捕捉测试问题中的理解和世界知识之间的质量。

Estimating Post-OCR Denoising Complexity on Numerical Texts

paper_url: http://arxiv.org/abs/2307.01020
repo_url: None
paper_authors: Arthur Hemmer, Jérôme Brachat, Mickaël Coustaty, Jean-Marc Ogier
for: 这个论文是为了评估各种文档中的OCR后处理困难度而写的。
methods: 该论文提出了一种方法来估算文档中的噪声复杂度，并在不同类型的文档上进行了评估。
results: 研究发现，文档中含有数字信息的文档具有显著的噪声复杂度劣势。此外，研究还证明了该估算器的有效性。

Abstract
Post-OCR processing has significantly improved over the past few years. However, these have been primarily beneficial for texts consisting of natural, alphabetical words, as opposed to documents of numerical nature such as invoices, payslips, medical certificates, etc. To evaluate the OCR post-processing difficulty of these datasets, we propose a method to estimate the denoising complexity of a text and evaluate it on several datasets of varying nature, and show that texts of numerical nature have a significant disadvantage. We evaluate the estimated complexity ranking with respect to the error rates of modern-day denoising approaches to show the validity of our estimator.

摘要
转换文本到简化中文：<>过去几年，OCR后处理技术有了很大的进步，但这些进步主要对于含有自然字母的文本有利，而不是含有数字的文本，如发票、薪资单、医疗证明等。为了评估这些文本的OCR后处理难度，我们提议一种估算文本的干净复杂度的方法，并在不同类型的文本上进行了评估，发现数字文本具有显著的劣势。我们将估算结果与现代denoising方法的错误率进行比较，以证明我们的估算器的有效性。

Visual Instruction Tuning with Polite Flamingo

paper_url: http://arxiv.org/abs/2307.01003
repo_url: https://github.com/chendelong1999/polite_flamingo
paper_authors: Delong Chen, Jianfeng Liu, Wenliang Dai, Baoyuan Wang
for: 这个论文的目的是提高多模态语言模型（LLM）的性能，并且解决模型在处理多模态数据时出现的一种副作用。
methods: 这篇论文使用了多模态数据集进行练习和微调，并引入了一种名为“多模态协调税”的问题，该问题会影响模型的回答格式和礼貌。为了解决这个问题，该论文提出了一种名为“礼貌鸟”的多模态回答重写器，可以将原始的粗糙答案重新排序并改善其格式和礼貌。
results: 该论文通过使用“礼貌鸟”重写器和一些新的方法，如多Stage逻辑和多轮增强，使得模型在多模态理解和回答礼貌方面表现出色，并且在自动和人工评估中都得到了提升。

Abstract
Recent research has demonstrated that the multi-task fine-tuning of multi-modal Large Language Models (LLMs) using an assortment of annotated downstream vision-language datasets significantly enhances their performance. Yet, during this process, a side effect, which we termed as the "multi-modal alignment tax", surfaces. This side effect negatively impacts the model's ability to format responses appropriately -- for instance, its "politeness" -- due to the overly succinct and unformatted nature of raw annotations, resulting in reduced human preference. In this paper, we introduce Polite Flamingo, a multi-modal response rewriter that transforms raw annotations into a more appealing, "polite" format. Polite Flamingo is trained to reconstruct high-quality responses from their automatically distorted counterparts and is subsequently applied to a vast array of vision-language datasets for response rewriting. After rigorous filtering, we generate the PF-1M dataset and further validate its value by fine-tuning a multi-modal LLM with it. Combined with novel methodologies including U-shaped multi-stage tuning and multi-turn augmentation, the resulting model, Clever Flamingo, demonstrates its advantages in both multi-modal understanding and response politeness according to automated and human evaluations.

摘要
In this paper, we introduce Polite Flamingo, a multi-modal response rewriter that transforms raw annotations into a more appealing and "polite" format. Polite Flamingo is trained to reconstruct high-quality responses from their distorted counterparts and is applied to a wide range of vision-language datasets for response rewriting. After rigorous filtering, we generate the PF-1M dataset and validate its value by fine-tuning a multi-modal LLM with it.We also propose several novel methodologies, including U-shaped multi-stage tuning and multi-turn augmentation, which improve the model's performance in both multi-modal understanding and response politeness, as demonstrated by automated and human evaluations. The resulting model, Clever Flamingo, shows its advantages in both areas.

Towards Suicide Prevention from Bipolar Disorder with Temporal Symptom-Aware Multitask Learning

paper_url: http://arxiv.org/abs/2307.00995
repo_url: https://github.com/leedaeuni/Temporal-Symptom-Aware-Multitask-Learning-KDD23
paper_authors: Daeun Lee, Sejung Son, Hyolim Jeon, Seungbae Kim, Jinyoung Han
for: 预测患有抑郁症的患者未来自杀风险
methods: 使用多任务学习模型，同时学习当前症状，预测患者未来自杀风险
results: 提出了一种基于多任务学习的模型，能够有效预测患者未来自杀风险，并提供了可解释的注意力权重，帮助临床医生更全面地理解患者情况，提供时间化的干预措施。

Abstract
Bipolar disorder (BD) is closely associated with an increased risk of suicide. However, while the prior work has revealed valuable insight into understanding the behavior of BD patients on social media, little attention has been paid to developing a model that can predict the future suicidality of a BD patient. Therefore, this study proposes a multi-task learning model for predicting the future suicidality of BD patients by jointly learning current symptoms. We build a novel BD dataset clinically validated by psychiatrists, including 14 years of posts on bipolar-related subreddits written by 818 BD patients, along with the annotations of future suicidality and BD symptoms. We also suggest a temporal symptom-aware attention mechanism to determine which symptoms are the most influential for predicting future suicidality over time through a sequence of BD posts. Our experiments demonstrate that the proposed model outperforms the state-of-the-art models in both BD symptom identification and future suicidality prediction tasks. In addition, the proposed temporal symptom-aware attention provides interpretable attention weights, helping clinicians to apprehend BD patients more comprehensively and to provide timely intervention by tracking mental state progression.

摘要
抑郁症（BD）与Future suicide 风险之间存在紧密的关系。然而，先前的研究尚未关注开发一个可预测BD患者未来自杀性的模型。因此，本研究提出了一种多任务学习模型，用于预测BD患者未来自杀性，同时 JOINTLY 学习当前症状。我们建立了一个新的BD数据集，包括818名BD患者在bipolar-related subreddits上的14年历史文章，以及未来自杀性和BD症状的注释。我们还提出了一种时间相关症状意识机制，用于在BD历史文章序列中确定最有影响力的症状，以便预测未来自杀性。我们的实验结果表明，我们提出的模型在BD症状标识和未来自杀性预测任务中都能够获得显著性能。此外，我们的模型还提供了可读取的注意力权重，帮助临床医生更全面地理解BD患者，并在时间上跟踪精神状态的进程，以便提供时间敏感的干预。

Data-Driven Information Extraction and Enrichment of Molecular Profiling Data for Cancer Cell Lines

paper_url: http://arxiv.org/abs/2307.00933
repo_url: https://github.com/progenetix/cancercelllines-web
paper_authors: Ellery Smith, Rahel Paloots, Dimitris Giagkos, Michael Baudis, Kurt Stockinger
for: This paper is written for researchers and domain experts in the fields of biological, medical, and clinical research, who need to quickly and efficiently extract relevant information from large amounts of published scientific literature.
methods: The paper presents a novel data extraction and exploration system that uses computational methods to extract deep semantic relations between textual entities from scientific literature. The system uses a combination of natural language processing and machine learning techniques to identify and link relevant information.
results: The paper reports on the design, implementation, and application of the novel data extraction and exploration system, which is publicly available on the web at https://cancercelllines.org. The system is able to extract and link information about genomic copy number variants and affected genes, and provides literature-derived evidences to support the links. The system enables rapid, yet deep, literature search using existing structured data as a springboard.

Abstract
With the proliferation of research means and computational methodologies, published biomedical literature is growing exponentially in numbers and volume. As a consequence, in the fields of biological, medical and clinical research, domain experts have to sift through massive amounts of scientific text to find relevant information. However, this process is extremely tedious and slow to be performed by humans. Hence, novel computational information extraction and correlation mechanisms are required to boost meaningful knowledge extraction. In this work, we present the design, implementation and application of a novel data extraction and exploration system. This system extracts deep semantic relations between textual entities from scientific literature to enrich existing structured clinical data in the domain of cancer cell lines. We introduce a new public data exploration portal, which enables automatic linking of genomic copy number variants plots with ranked, related entities such as affected genes. Each relation is accompanied by literature-derived evidences, allowing for deep, yet rapid, literature search, using existing structured data as a springboard. Our system is publicly available on the web at https://cancercelllines.org

摘要
In this work, we present a novel data extraction and exploration system that extracts deep semantic relations between textual entities from scientific literature to enrich existing structured clinical data in the domain of cancer cell lines. Our system provides a public data exploration portal that automatically links genomic copy number variants plots with ranked, related entities such as affected genes, accompanied by literature-derived evidences. This enables rapid, yet deep, literature search using existing structured data as a springboard. Our system is publicly available at .

Fraunhofer SIT at CheckThat! 2023: Tackling Classification Uncertainty Using Model Souping on the Example of Check-Worthiness Classification

paper_url: http://arxiv.org/abs/2307.02377
repo_url: None
paper_authors: Raphael Frick, Inna Vogel, Jeong-Eun Choi
for: 本研究旨在提出一种基于模型卷积的检查价值判断方法，以便优先级化 manual fact-checking 的审核工作。
methods: 该方法基于 Model Souping ensemble classification scheme，使用英语政治辩论文本数据集进行训练和测试。
results: 该方法在英语数据集上实现了总 F1 分数 0.878，在竞赛中排名第二。

Abstract
This paper describes the second-placed approach developed by the Fraunhofer SIT team in the CLEF-2023 CheckThat! lab Task 1B for English. Given a text snippet from a political debate, the aim of this task is to determine whether it should be assessed for check-worthiness. Detecting check-worthy statements aims to facilitate manual fact-checking efforts by prioritizing the claims that fact-checkers should consider first. It can also be considered as primary step of a fact-checking system. Our best-performing method took advantage of an ensemble classification scheme centered on Model Souping. When applied to the English data set, our submitted model achieved an overall F1 score of 0.878 and was ranked as the second-best model in the competition.

摘要
Translation notes:* "Check-worthiness" was translated as "可评估性" (kě píng yè xìng)* "Manual fact-checking efforts" was translated as "手动fact-checking努力" (shǒu dòng fact-checking nǔ lì)* "Primary step" was translated as "首要步骤" (shǒu yào bù xí)* "Model Souping" was translated as "模型汤" (moldè tāng)

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

paper_url: http://arxiv.org/abs/2307.00862
repo_url: https://github.com/threesr/unifine
paper_authors: Rui Sun, Zhecan Wang, Haoxuan You, Noel Codella, Kai-Wei Chang, Shih-Fu Chang
for: 本文研究zero-shot视觉语言任务，即VQA、SNLI-VE和VCR等任务。
methods: 本文提出了一种综合框架，利用细节信息（如图像中的对象和文本中的关键词）来提高zero-shot视觉语言学习的性能。
results: 实验表明，本文的框架在VQA任务上超过了前一个 zero-shot 方法的性能，并在SNLI-VE和VCR任务上 achieve 了substantial 改进。此外，本文的ablation 研究证明了提posed 方法的效果和普遍性。

Abstract
Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our experiments show that our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our ablation studies confirm the effectiveness and generalizability of our proposed method. Code will be available at https://github.com/ThreeSR/UniFine

摘要
“视觉语言任务，如VQA、SNLI-VE和VCR，是因为它们需要模型理解视觉世界和自然语言的 semantics 的能力。已有许多supervised方法在这些任务上进行了广泛的研究。然而，在零shot设定下解决这些任务的研究较少。由于 Contrastive Language-Image Pre-training（CLIP）的出色的零shot能力，前作者们利用了它的强大零shot能力，将视觉语言任务转化为图像文本匹配问题，主要考虑全图或全句的全局级匹配。然而，我们发现图像和文本的细腻信息，例如文本中的关键词和图像中的 объек ，可以很好地帮助理解 semantics。 inspired by this，我们提出了一个统一框架，利用细腻信息进行零shot视觉语言学习，覆盖多个任务，如VQA、SNLI-VE和VCR。我们的实验表明，我们的框架在VQA上超过了前一个零shot方法的性能，并在SNLI-VE和VCR上实现了显著的改善。此外，我们的ablation研究证明了我们提出的方法的有效性和普适性。代码将在https://github.com/ThreeSR/UniFine 上提供。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

VOLTA: Diverse and Controllable Question-Answer Pair Generation with Variational Mutual Information Maximizing Autoencoder

paper_url: http://arxiv.org/abs/2307.00852
repo_url: None
paper_authors: Yueen Ma, Dafeng Chi, Jingjing Li, Yuzheng Zhuang, Jianye Hao, Irwin King
for: 提高生成多样性和独立控制性
methods: 利用Variational Autoencoder框架， Shared backbone网络作为编码器和解码器，以及InfoGAN风格的秘密码进行输入独立控制。
results: 对比前一代模型，能够显著提高生成多样性和独立控制性。

Abstract
Previous question-answer pair generation methods aimed to produce fluent and meaningful question-answer pairs but tend to have poor diversity. Recent attempts addressing this issue suffer from either low model capacity or overcomplicated architecture. Furthermore, they overlooked the problem where the controllability of their models is highly dependent on the input. In this paper, we propose a model named VOLTA that enhances generative diversity by leveraging the Variational Autoencoder framework with a shared backbone network as its encoder and decoder. In addition, we propose adding InfoGAN-style latent codes to enable input-independent controllability over the generation process. We perform comprehensive experiments and the results show that our approach can significantly improve diversity and controllability over state-of-the-art models.

摘要
Note:* "Previous question-answer pair generation methods" is translated as "以前的问答对生成方法"* "tend to produce repetitive and lacking in diversity" is translated as "往往产生单调且缺乏多样性"* "Recent attempts to address this issue" is translated as "最近的尝试解决这个问题"* "have either low model capacity or overly complex architecture" is translated as "或者模型容量低或者结构过于复杂"* " ignore the problem that the controllability of their models is highly dependent on the input" is translated as "忽略输入对模型控制性的高度依赖"* "In this paper, we propose a model named VOLTA" is translated as "在这篇论文中，我们提出了一种名为VOLTA的模型"* "which enhances generative diversity" is translated as "可以提高生成多样性"* "by using the Variational Autoencoder framework with a shared backbone network as its encoder and decoder" is translated as "使用共享脊梁网络作为encoder和decoder"* "Additionally, we propose adding InfoGAN-style latent codes" is translated as "此外，我们还提出了添加InfoGAN风格的幂数代码"* "to enable input-independent controllability over the generation process" is translated as "以便独立于输入的控制生成过程"

Large Language and Text-to-3D Models for Engineering Design Optimization

paper_url: http://arxiv.org/abs/2307.01230
repo_url: None
paper_authors: Thiago Rios, Stefan Menzel, Bernhard Sendhoff
for: 这个论文的目的是研究使用深度文本到3D模型来优化计算机 simulate 设计。
methods: 这篇论文使用了 Shap-E 进行自动化进化优化，以及 evaluate 了两种文本表示方法：bag-of-words 和 tokenisation。
results: 主要发现包括：首先，确保生成的设计是在物体类别中有效，其次，需要进一步研究以确定文本提示的变化强度和3D设计变化之间存在相互关系，以改进优化。

Abstract
The current advances in generative AI for learning large neural network models with the capability to produce essays, images, music and even 3D assets from text prompts create opportunities for a manifold of disciplines. In the present paper, we study the potential of deep text-to-3D models in the engineering domain, with focus on the chances and challenges when integrating and interacting with 3D assets in computational simulation-based design optimization. In contrast to traditional design optimization of 3D geometries that often searches for the optimum designs using numerical representations, such as B-Spline surface or deformation parameters in vehicle aerodynamic optimization, natural language challenges the optimization framework by requiring a different interpretation of variation operators while at the same time may ease and motivate the human user interaction. Here, we propose and realize a fully automated evolutionary design optimization framework using Shap-E, a recently published text-to-3D asset network by OpenAI, in the context of aerodynamic vehicle optimization. For representing text prompts in the evolutionary optimization, we evaluate (a) a bag-of-words approach based on prompt templates and Wordnet samples, and (b) a tokenisation approach based on prompt templates and the byte pair encoding method from GPT4. Our main findings from the optimizations indicate that, first, it is important to ensure that the designs generated from prompts are within the object class of application, i.e. diverse and novel designs need to be realistic, and, second, that more research is required to develop methods where the strength of text prompt variations and the resulting variations of the 3D designs share causal relations to some degree to improve the optimization.

摘要
现有的生成AI技术在学习大型神经网络模型，以生成文本提示为输入生成文章、图像、音乐和3D资产，带来了多个领域的机遇。在 presente 篇文章中，我们研究了深度文本到3D模型在工程领域的潜在性，特别是在计算机 simulate 基础上的设计优化中的挑战和机遇。与传统的3D形状设计优化方法不同，这种方法通过自然语言提出的文本提示来定义变量运算，同时可能简化和激励人类用户的交互。我们提出了一个完全自动化的进化式设计优化框架，使用 OpenAI 最近发布的 Shap-E 文本到3D资产网络，在航空器 aerodynamic 优化中实现。为表示文本提示在进化优化中，我们评估了（a）一个 bag-of-words 方法，基于提示模板和 Wordnet 样本，以及（b）一个tokenization方法，基于提示模板和 GPT4 的字对编码方法。我们的主要发现表明，首先，需要确保生成的设计是在应用对象类中，即提示生成的设计需要多样化和原创，并且，其次，需要更多的研究，以开发方法，使得文本提示的变化和生成的3D设计之间存在相互关系，以改进优化。

CollabKG: A Learnable Human-Machine-Cooperative Information Extraction Toolkit for (Event) Knowledge Graph Construction

paper_url: http://arxiv.org/abs/2307.00769
repo_url: None
paper_authors: Xiang Wei, Yufeng Chen, Ning Cheng, Xingyu Cui, Jinan Xu, Wenjuan Han
For: 这个论文是为了提出一个可学习的人机共同的信息提取工具kit，用于构建或扩展基于实体和事件的知识图（KG和EKG）。* Methods: 这个工具kit使用了多任务合并、人机共同协作、高级提示技术等方法来解决现有的信息提取工具kit中的一些非正式问题，例如不支持多任务、不支持自动更新等。* Results: 对比其他现有工具kit，这个工具kit具有许多优势，例如可自定义、无需训练、可传播等，同时也提高了注解质量、效率和稳定性。人工评估结果表明，CollabKG在注解质量、效率和稳定性三个方面均有显著提高。

Abstract
In order to construct or extend entity-centric and event-centric knowledge graphs (KG and EKG), the information extraction (IE) annotation toolkit is essential. However, existing IE toolkits have several non-trivial problems, such as not supporting multi-tasks, not supporting automatic updates. In this work, we present CollabKG, a learnable human-machine-cooperative IE toolkit for KG and EKG construction. Specifically, for the multi-task issue, CollabKG unifies different IE subtasks, including named entity recognition (NER), entity-relation triple extraction (RE), and event extraction (EE), and supports both KG and EKG. Then, combining advanced prompting-based IE technology, the human-machine-cooperation mechanism with LLMs as the assistant machine is presented which can provide a lower cost as well as a higher performance. Lastly, owing to the two-way interaction between the human and machine, CollabKG with learning ability allows self-renewal. Besides, CollabKG has several appealing features (e.g., customization, training-free, propagation, etc.) that make the system powerful, easy-to-use, and high-productivity. We holistically compare our toolkit with other existing tools on these features. Human evaluation quantitatively illustrates that CollabKG significantly improves annotation quality, efficiency, and stability simultaneously.

摘要
从构建或扩展实体中心和事件中心知识 graphs (KG 和 EKG)的角度来看，信息提取 (IE) 标注工具是非常重要的。然而，现有的 IE 工具存在许多不容易解决的问题，例如不支持多任务、不支持自动更新。在这个工作中，我们提出了 CollabKG，一个可学习的人机联合 IE 标注工具 для KG 和 EKG 构建。具体来说，为了解决多任务问题，CollabKG 将不同的 IE 子任务，包括命名实体识别 (NER)、实体关系三元项抽取 (RE) 和事件抽取 (EE)，融合在一起，并支持 KG 和 EKG。然后，通过进步的提示技术，我们将人机合作机制与 LLMS 作为助手机器搭配，可以提供更低的成本以及更高的性能。此外，由人机对话的互动，CollabKG 具有学习能力，可以进行自我更新。此外，CollabKG 具有许多吸引人的特点（例如自定义、无需训练、传播等），使得系统具有高效、易用、高产力等特点。我们将它与其他现有工具进行全面比较，并进行人类评估，以量化提高标注质量、效率和稳定性。

Multilingual Contextual Adapters To Improve Custom Word Recognition In Low-resource Languages

paper_url: http://arxiv.org/abs/2307.00759
repo_url: None
paper_authors: Devang Kulshreshtha, Saket Dingliwal, Brady Houston, Sravan Bodapati
for: 提高自然语言处理（NLP）中自定义单词的识别率
methods: 使用Contextual Adapters模型，通过注意力基于偏好模型来改善CTC模型对自定义实体的识别
results: 在低资源语言上实现48% F1提高率，同时导致CTC模型的5-11%单词错误率下降

Abstract
Connectionist Temporal Classification (CTC) models are popular for their balance between speed and performance for Automatic Speech Recognition (ASR). However, these CTC models still struggle in other areas, such as personalization towards custom words. A recent approach explores Contextual Adapters, wherein an attention-based biasing model for CTC is used to improve the recognition of custom entities. While this approach works well with enough data, we showcase that it isn't an effective strategy for low-resource languages. In this work, we propose a supervision loss for smoother training of the Contextual Adapters. Further, we explore a multilingual strategy to improve performance with limited training data. Our method achieves 48% F1 improvement in retrieving unseen custom entities for a low-resource language. Interestingly, as a by-product of training the Contextual Adapters, we see a 5-11% Word Error Rate (WER) reduction in the performance of the base CTC model as well.

摘要
Connectionist Temporal Classification (CTC) 模型在自动语音识别（ASR）中具有平衡速度和性能的优势，但这些 CTC 模型仍然在其他领域存在问题，如个性化针对定制词汇。一种最近的方法是使用上下文适应器来改善 CT 的识别性能，该模型通过注意力机制来偏好 CT 中的定制实体。虽然这种方法在具有足够数据的情况下工作良好，但我们显示其对低资源语言不是一个有效的策略。在这种工作中，我们提出了 Contextual Adapters 的超vision损失来降低训练的干扰。此外，我们探讨了多语言策略以提高受限训练数据的性能。我们的方法实现了48% F1 的提升，用于检索未经见过的定制实体。另外，在训练 Contextual Adapters 时，我们发现了5-11% 的单词错误率（WER）的减少，这也是一个副产品。

An End-to-End Multi-Module Audio Deepfake Generation System for ADD Challenge 2023

paper_url: http://arxiv.org/abs/2307.00729
repo_url: None
paper_authors: Sheng Zhao, Qilong Yuan, Yibo Duan, Zhuoyue Chen
for: 这个论文主要是为了研究语音生成技术，具体来说是使用端到端多模块Synthesizer模型来生成自然语音。
methods: 这个论文使用了多种方法，包括speaker encoder、Tacotron2基于的synthesizer和WaveRNN基于的vocoder。同时， authors也进行了多个比较实验，用于评估不同的数据集和模型结构。
results: 根据论文的报告， authors使用了这个模型参加了2023年ADD挑战赛Track 1.1，并获得了44.97%的Weighted Deception Success Rate（WDSR），即第一名。

Abstract
The task of synthetic speech generation is to generate language content from a given text, then simulating fake human voice.The key factors that determine the effect of synthetic speech generation mainly include speed of generation, accuracy of word segmentation, naturalness of synthesized speech, etc. This paper builds an end-to-end multi-module synthetic speech generation model, including speaker encoder, synthesizer based on Tacotron2, and vocoder based on WaveRNN. In addition, we perform a lot of comparative experiments on different datasets and various model structures. Finally, we won the first place in the ADD 2023 challenge Track 1.1 with the weighted deception success rate (WDSR) of 44.97%.

摘要
“synthetic speech generation的任务是将文本转换为语音内容，然后模拟人工声音。关键因素包括生成速度、词汇分 segmentation精度、生成的语音自然度等。本文建立了端到端多模块合成语音模型，包括话者编码器、基于 Tacotron2 的合成器和基于 WaveRNN 的 vocoder。此外，我们进行了许多比较实验，包括不同数据集和不同模型结构。最后，我们在 ADD 2023 挑战赛 Track 1.1 中获得了44.97%的权重诱导成功率（WDSR）。”

paper_url: http://arxiv.org/abs/2307.00610
repo_url: None
paper_authors: Raphael Frick, Inna Vogel
for:This paper aims to improve the efficiency of fact-checking on social media by developing a novel approach to detecting the check-worthiness of multi-modal tweets.methods:The proposed approach uses two classifiers, each trained on a single modality (text or image), and combines their outputs to determine the check-worthiness of a tweet. The text classifier uses OCR analysis to extract embedded text from images.results:The proposed approach achieved an F1 score of 0.7297 on the private test set of the CheckThat! 2023 Task 1A, placing first among all submissions.

Abstract
The option of sharing images, videos and audio files on social media opens up new possibilities for distinguishing between false information and fake news on the Internet. Due to the vast amount of data shared every second on social media, not all data can be verified by a computer or a human expert. Here, a check-worthiness analysis can be used as a first step in the fact-checking pipeline and as a filtering mechanism to improve efficiency. This paper proposes a novel way of detecting the check-worthiness in multi-modal tweets. It takes advantage of two classifiers, each trained on a single modality. For image data, extracting the embedded text with an OCR analysis has shown to perform best. By combining the two classifiers, the proposed solution was able to place first in the CheckThat! 2023 Task 1A with an F1 score of 0.7297 achieved on the private test set.

摘要
“社交媒体上分享图片、视频和音频文件打开了新的可能性，以 отличать假信息和谣言网络上。由于社交媒体每秒钟数据量太多，不能由计算机或人工专家所检查。这篇论文提出了一种检查可信worthiness的新方法，利用两个分类器，每个分类器在单一模式上训练。对于图像数据，使用OCR分析提取嵌入文本最佳。将两个分类器结合使用，提出的解决方案在CheckThat! 2023任务1A中获得了0.7297的F1分数。”

2023-07-03

Analyzing Multiple-Choice Reading and Listening Comprehension Tests

Estimating Post-OCR Denoising Complexity on Numerical Texts

Visual Instruction Tuning with Polite Flamingo

Towards Suicide Prevention from Bipolar Disorder with Temporal Symptom-Aware Multitask Learning

Data-Driven Information Extraction and Enrichment of Molecular Profiling Data for Cancer Cell Lines

Fraunhofer SIT at CheckThat! 2023: Tackling Classification Uncertainty Using Model Souping on the Example of Check-Worthiness Classification

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

VOLTA: Diverse and Controllable Question-Answer Pair Generation with Variational Mutual Information Maximizing Autoencoder

Large Language and Text-to-3D Models for Engineering Design Optimization

CollabKG: A Learnable Human-Machine-Cooperative Information Extraction Toolkit for (Event) Knowledge Graph Construction

Multilingual Contextual Adapters To Improve Custom Word Recognition In Low-resource Languages

An End-to-End Multi-Module Audio Deepfake Generation System for ADD Challenge 2023

Fraunhofer SIT at CheckThat! 2023: Mixing Single-Modal Classifiers to Estimate the Check-Worthiness of Multi-Modal Tweets