cs.CL - 2023-08-31

TouchStone: Evaluating Vision-Language Models by Language Models

paper_url: http://arxiv.org/abs/2308.16890
repo_url: None
paper_authors: Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, Jingren Zhou
for: 评估大视觉语言模型（LVLMs）的多种能力，包括认知、理解和处理视觉信息，以及对话技巧和文学创作能力。
methods: 使用强大的语言模型（LLMs）作为评判者，对LVLMs的多种能力进行全面评估，包括开放世界图像和问题，涵盖五大类能力和27个子任务。
results: 通过验证，表明强大的LVLMs，如GPT-4，可以通过文本能力alone评估多modal对话质量，与人类偏好相align。

Abstract
Large vision-language models (LVLMs) have recently witnessed rapid advancements, exhibiting a remarkable capacity for perceiving, understanding, and processing visual information by connecting visual receptor with large language models (LLMs). However, current assessments mainly focus on recognizing and reasoning abilities, lacking direct evaluation of conversational skills and neglecting visual storytelling abilities. In this paper, we propose an evaluation method that uses strong LLMs as judges to comprehensively evaluate the various abilities of LVLMs. Firstly, we construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks. This dataset not only covers fundamental recognition and comprehension but also extends to literary creation. Secondly, by integrating detailed image annotations we effectively transform the multimodal input content into a form understandable by LLMs. This enables us to employ advanced LLMs for directly evaluating the quality of the multimodal dialogue without requiring human intervention. Through validation, we demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone, aligning with human preferences. We hope our work can serve as a touchstone for LVLMs' evaluation and pave the way for building stronger LVLMs. The evaluation code is available at https://github.com/OFA-Sys/TouchStone.

摘要
首先，我们建立了一个完整的视觉对话 dataset TouchStone，包括开放世界的图片和问题，涵盖五大类能力和 27 个子任务。这个 dataset 不仅覆盖基本的识别和理解，也扩展到文学创作。其次，通过将复杂的视觉内容转换为可以由 LLMs 理解的形式，我们可以直接使用高级 LLMs 评估多模式对话质量，不需要人工干预。经过验证，我们展示了强大的 LVLMs，如 GPT-4，可以通过它们的文本能力 alone 评估对话质量，与人类偏好相Alignment。我们希望这个工作可以成为 LVLMs 评估的 touchstone，导向建立更强大的 LVLMs。评估代码可以在 https://github.com/OFA-Sys/TouchStone 上获取。

Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue Evaluation

paper_url: http://arxiv.org/abs/2308.16797
repo_url: https://github.com/johndmendonca/dialevalml
paper_authors: John Mendonça, Patrícia Pereira, João Paulo Carvalho, Alon Lavie, Isabel Trancoso
for: 这 paper 是为了开发一个可以评估多语言对话系统的自动对话评估指标的框架。
methods: 这 paper 使用了现有评估模型的优势，同时采用了新的大语言模型（LLM）提问 paradigm。
results: 这 paper 的实验结果表明，其框架在多个 benchmark 上的 Mean Spearman correlation 分数均达到了state of the art Water mark，并在 DSTC11 Track 4 “Automatic Evaluation Metrics for Open-Domain Dialogue Systems” 的 Robust 和 Multilingual 任务中排名第一。

Abstract
Despite significant research effort in the development of automatic dialogue evaluation metrics, little thought is given to evaluating dialogues other than in English. At the same time, ensuring metrics are invariant to semantically similar responses is also an overlooked topic. In order to achieve the desired properties of robustness and multilinguality for dialogue evaluation metrics, we propose a novel framework that takes advantage of the strengths of current evaluation models with the newly-established paradigm of prompting Large Language Models (LLMs). Empirical results show our framework achieves state of the art results in terms of mean Spearman correlation scores across several benchmarks and ranks first place on both the Robust and Multilingual tasks of the DSTC11 Track 4 "Automatic Evaluation Metrics for Open-Domain Dialogue Systems", proving the evaluation capabilities of prompted LLMs.

摘要
尽管有很多研究努力在自动对话评价指标的发展中，但对于其他语言的对话评价却得到了少量的关注。同时，确保评价结果对Semantically相同的答案具有不变性也是一个被忽略的话题。为了实现自动对话评价指标的稳定性和多语言性，我们提出了一种新的框架，利用现有评价模型的优势以及新的大语言模型（LLM）的推荐 paradigm。实验结果显示，我们的框架在多个 benchmark 上取得了 state of the art 的 Mean Spearman correlation 分数，并在 DSTC11 Track 4 "Automatic Evaluation Metrics for Open-Domain Dialogue Systems" 的 Robust 和 Multilingual 任务上取得了第一名，证明了提高 LLM 的评价能力。

Towards Multilingual Automatic Dialogue Evaluation

paper_url: http://arxiv.org/abs/2308.16795
repo_url: None
paper_authors: John Mendonça, Alon Lavie, Isabel Trancoso
for: 这篇论文主要针对的问题是开发robust的多语言对话评估指标的主要限制因素是多语言数据的缺乏和开源多语言对话系统的有限可用性。
methods: 作者提议一种绕过这些限制的方法是利用强大的多语言预训练自然语言处理模型，并使用机器翻译将英语对话数据扩展到多语言数据。
results: 作者经验表明，直接使用翻译后的数据进行训练是不足以超越基线的多语言模型，而需要仔细筛选翻译后的数据使用MT质量评估 metric，以避免低质量翻译对性能的影响。

Abstract
The main limiting factor in the development of robust multilingual dialogue evaluation metrics is the lack of multilingual data and the limited availability of open sourced multilingual dialogue systems. In this work, we propose a workaround for this lack of data by leveraging a strong multilingual pretrained LLM and augmenting existing English dialogue data using Machine Translation. We empirically show that the naive approach of finetuning a pretrained multilingual encoder model with translated data is insufficient to outperform the strong baseline of finetuning a multilingual model with only source data. Instead, the best approach consists in the careful curation of translated data using MT Quality Estimation metrics, excluding low quality translations that hinder its performance.

摘要
主要的限制因素是多语言对话评估指标的发展缺乏多语言数据和开源的多语言对话系统的有限可用性。在这种工作中，我们提出了一种绕过这种缺乏数据的 workaround，利用强大的多语言预训练深度学习模型，并通过机器翻译来扩展现有的英语对话数据。我们经验显示，直接使用翻译后的数据进行训练是不够的，而是需要仔细筛选翻译后的数据使用MT质量评估指标，排除低质量翻译，以保证其表现。

Enhancing PLM Performance on Labour Market Tasks via Instruction-based Finetuning and Prompt-tuning with Rules

paper_url: http://arxiv.org/abs/2308.16770
repo_url: None
paper_authors: Jarno Vrolijk, David Graus
for: 本研究旨在探讨如何使用预训练语言模型（PLM）在劳动市场特定应用中提高表示性。
methods: 本研究使用了提示基于调整和 instrucion tuning 方法，无需 exemplars 和数据增强，可以在劳动市场特定应用中提高 PLM 的表现。
results: 研究结果表明，使用提示基于调整和 instrucion tuning 方法可以在劳动市场特定应用中提高 PLM 的表现，而无需添加新的模型层、手动标注和数据增强。

Abstract
The increased digitization of the labour market has given researchers, educators, and companies the means to analyze and better understand the labour market. However, labour market resources, although available in high volumes, tend to be unstructured, and as such, research towards methodologies for the identification, linking, and extraction of entities becomes more and more important. Against the backdrop of this quest for better labour market representations, resource constraints and the unavailability of large-scale annotated data cause a reliance on human domain experts. We demonstrate the effectiveness of prompt-based tuning of pre-trained language models (PLM) in labour market specific applications. Our results indicate that cost-efficient methods such as PTR and instruction tuning without exemplars can significantly increase the performance of PLMs on downstream labour market applications without introducing additional model layers, manual annotations, and data augmentation.

摘要
随着劳动市场的数字化，研究者、教育者和公司得到了分析和更好地理解劳动市场的工具。然而，劳动市场资源，即使在大量存在，通常是不结构化的，因此对方法ologies for the identification, linking, and extraction of entities的研究变得越来越重要。在这种寻求更好的劳动市场表示方面，因为资源受限和大规模annotated data的不可得性，人际域专家的依赖度增加。我们示示了适用Prompt-based tuning的pre-trained语言模型（PLM）在劳动市场特定应用中的效果。我们的结果表明，不需要添加更多的模型层、手动标注和数据扩展的cost-efficient方法，如PTR和instruction tuning without exemplars，可以大幅提高PLMs在下游劳动市场应用中的性能。

SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models

paper_url: http://arxiv.org/abs/2308.16692
repo_url: None
paper_authors: Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, Xipeng Qiu
for: 这paper aimed to evaluate the suitability of existing speech tokens for speech language modeling and to propose a unified speech tokenizer for speech large language models.
methods: The paper proposed a unified speech tokenizer called SpeechTokenizer, which adopts the Encoder-Decoder architecture with residual vector quantization (RVQ).
results: The SpeechTokenizer performed comparably to EnCodec in speech reconstruction and demonstrated strong performance on the SLMTokBench benchmark. Additionally, the Unified Speech Language Model (USLM) outperformed VALL-E in zero-shot Text-to-Speech tasks.

Abstract
Current speech large language models build upon discrete speech representations, which can be categorized into semantic tokens and acoustic tokens. However, existing speech tokens are not specifically designed for speech language modeling. To assess the suitability of speech tokens for building speech language models, we established the first benchmark, SLMTokBench. Our results indicate that neither semantic nor acoustic tokens are ideal for this purpose. Therefore, we propose SpeechTokenizer, a unified speech tokenizer for speech large language models. SpeechTokenizer adopts the Encoder-Decoder architecture with residual vector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically across different RVQ layers. Furthermore, We construct a Unified Speech Language Model (USLM) leveraging SpeechTokenizer. Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark. Also, USLM outperforms VALL-E in zero-shot Text-to-Speech tasks. Code and models are available at https://github.com/ZhangXInFD/SpeechTokenizer/.

摘要
当前的大语言模型建立于不连续的语音表示方式上，可以分为 semantics 和 acoustic 两类。然而，现有的语音表示token并不是专门为语音语言模型设计的。为了评估语音表示token的适用程度，我们建立了首个benchmarkSLMTokBench。我们的结果表明， neither semantic noch acoustic tokens 是理想的。因此，我们提出了 SpeechTokenizer，一种通用的语音tokenizer для语音大语言模型。SpeechTokenizer采用了 Encoder-Decoder 架构和剩余 вектор量化（RVQ）。在不同的RVQ层中，SpeechTokenizer层次分解不同的语音信息。此外，我们构建了 Unified Speech Language Model (USLM)，利用 SpeechTokenizer。实验表明，SpeechTokenizer与EnCodec相当在语音重建任务中，并在SLMTokBench标准差中表现出色。此外，USLM在零基本Text-to-Speech任务中表现出优于 VALL-E。代码和模型可以在https://github.com/ZhangXInFD/SpeechTokenizer/上获取。

DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew

paper_url: http://arxiv.org/abs/2308.16687
repo_url: None
paper_authors: Shaltiel Shmidman, Avi Shmidman, Moshe Koppel
for: 这个论文是为了提出一个新的现代希伯来BERT模型，以及两个特定任务的两个精度版本：prefix segmentation和 morphological tagging。
methods: 这个论文使用了BERT模型，并在其基础之上进行了特定任务的精度版本。
results: 论文表明了这些模型在不同的标准测试数据上的表现，并释放了这些模型以便进一步的研究和开发。I hope this helps! Let me know if you have any other questions.

Abstract
We present DictaBERT, a new state-of-the-art pre-trained BERT model for modern Hebrew, outperforming existing models on most benchmarks. Additionally, we release two fine-tuned versions of the model, designed to perform two specific foundational tasks in the analysis of Hebrew texts: prefix segmentation and morphological tagging. These fine-tuned models allow any developer to perform prefix segmentation and morphological tagging of a Hebrew sentence with a single call to a HuggingFace model, without the need to integrate any additional libraries or code. In this paper we describe the details of the training as well and the results on the different benchmarks. We release the models to the community, along with sample code demonstrating their use. We release these models as part of our goal to help further research and development in Hebrew NLP.

摘要
我们介绍DictaBERT，一个新的现代希伯来预训练BERT模型，在大多数标准准则上超越现有模型。此外，我们释放了两个精度调整版本的模型，用于执行希伯来文本分析中两个基本任务：前缀分 segmentation和 morphological tagging。这两个精度调整版本使得任何开发者可以通过一个HuggingFace模型的单调用来完成希伯来句子的前缀分 segmentation和 morphological tagging，无需额外的库或代码集成。在这篇文章中，我们详细描述了训练细节以及不同的标准准则的结果。我们将这些模型公开发布，并附送示例代码以示其使用。我们发布这些模型，以帮助进一步推动希伯来自然语言处理的研究和开发。

Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis

paper_url: http://arxiv.org/abs/2308.16593
repo_url: None
paper_authors: Weiqin Li, Shun Lei, Qiaochu Huang, Yixuan Zhou, Zhiyong Wu, Shiyin Kang, Helen Meng
for: 提高自然语言对话中的启发行为标注数据和自然语言对话中的表达质量
methods: 使用半监督预训练方法，同时考虑文本和语音信息，以检测对话中的启发行为标注
results: 实验结果表明，提posed方法可以实现高质量的自然语言对话synthesis，同时能够模型对话中的启发行为和预测对话中的自然语言表达

Abstract
The spontaneous behavior that often occurs in conversations makes speech more human-like compared to reading-style. However, synthesizing spontaneous-style speech is challenging due to the lack of high-quality spontaneous datasets and the high cost of labeling spontaneous behavior. In this paper, we propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech and spontaneous behavioral labels. In the process of semi-supervised learning, both text and speech information are considered for detecting spontaneous behaviors labels in speech. Moreover, a linguistic-aware encoder is used to model the relationship between each sentence in the conversation. Experimental results indicate that our proposed method achieves superior expressive speech synthesis performance with the ability to model spontaneous behavior in spontaneous-style speech and predict reasonable spontaneous behavior from text.

摘要
人们在对话中的自发行为通常使得语音更加人类化，然而 sintesizing自发样式语音具有高质量数据和标注自发行为的高成本。在这篇论文中，我们提出了一种半supervised预训练方法，以增加自发样式语音和自发行为标签。在半supervised学习中，我们考虑了文本和语音信息，以检测speech中的自发行为标签。此外，我们使用语言意识encoder来模型对话中每句话之间的关系。实验结果表明，我们的提议方法可以实现高水平的表达语音合成性能，同时能够模型自发样式语音中的自发行为和从文本中预测合理的自发行为。

Interpreting Sentiment Composition with Latent Semantic Tree

paper_url: http://arxiv.org/abs/2308.16588
repo_url: https://github.com/changmenseng/semantic_tree
paper_authors: Zhongtao Jiang, Yuanzhe Zhang, Cao Liu, Jiansong Chen, Jun Zhao, Kang Liu
for: 这篇论文是为了提出一种新的 sentiment composition 方法，以解决传统的 hierarchical trees 存在偏度和难以理解的问题。
methods: 该方法使用 semantic tree，一种基于 context-free grammar (CFG) 的新树形式，来理解 sentiment composition 的原则性。 semantic tree 是一个 latent variable，通过 inside algorithm 进行抽象，以提高分类性能。
results: 该方法在常规和领域适应分类任务中 achieves 更好或竞争性的结果，同时也可以生成合理的树解释。

Abstract
As the key to sentiment analysis, sentiment composition considers the classification of a constituent via classifications of its contained sub-constituents and rules operated on them. Such compositionality has been widely studied previously in the form of hierarchical trees including untagged and sentiment ones, which are intrinsically suboptimal in our view. To address this, we propose semantic tree, a new tree form capable of interpreting the sentiment composition in a principled way. Semantic tree is a derivation of a context-free grammar (CFG) describing the specific composition rules on difference semantic roles, which is designed carefully following previous linguistic conclusions. However, semantic tree is a latent variable since there is no its annotation in regular datasets. Thus, in our method, it is marginalized out via inside algorithm and learned to optimize the classification performance. Quantitative and qualitative results demonstrate that our method not only achieves better or competitive results compared to baselines in the setting of regular and domain adaptation classification, and also generates plausible tree explanations.

摘要
As the key to sentiment analysis, sentiment composition considers the classification of a constituent via classifications of its contained sub-constituents and rules operated on them. Such compositionality has been widely studied previously in the form of hierarchical trees including untagged and sentiment ones, which are intrinsically suboptimal in our view. To address this, we propose semantic tree, a new tree form capable of interpreting the sentiment composition in a principled way. Semantic tree is a derivation of a context-free grammar (CFG) describing the specific composition rules on difference semantic roles, which is designed carefully following previous linguistic conclusions. However, semantic tree is a latent variable since there is no its annotation in regular datasets. Thus, in our method, it is marginalized out via inside algorithm and learned to optimize the classification performance. Quantitative and qualitative results demonstrate that our method not only achieves better or competitive results compared to baselines in the setting of regular and domain adaptation classification, and also generates plausible tree explanations.Here's the translation in Traditional Chinese:作为情感分析的关键，情感组合考虑 класифіcation的构成单元 через其包含的子单元的类别和运算之规则。这种结构已经在过去广泛研究过，通常用树结构，包括未标的树和情感树，这些树结构是我们看来不理想的。为了解决这个问题，我们提出了含义树，一种新的树形式，可以在原理上解释情感组合。含义树是基于特定的语言结构（CFG），描述了不同Semantic Role的特定composing规则，这是以前的语言结论为基础设计的。然而，含义树是一个隐藏变量，因为没有它的标注在常规dataset中。因此，在我们的方法中，它是通过内部算法和学习来抑制标注的。结果显示，我们的方法不仅在常规和预设类别的设定下实现了更好或竞争性的结果，还可以生成合理的树解释。

Unsupervised Text Style Transfer with Deep Generative Models

paper_url: http://arxiv.org/abs/2308.16584
repo_url: https://github.com/djdprogramming/adfa2
paper_authors: Zhongtao Jiang, Yuanzhe Zhang, Yiming Ju, Kang Liu
for: 该论文提出了一种总结式文本风格转移框架，用于无监督地将文本风格转换为另一种风格。
methods: 该框架基于深度生成模型，对每个句子-标签对进行模型化，并利用数据中的依赖关系学习句子的内容和风格代码。
results: 该方法在三个标准评测 benchmark 上进行了实验，自动和人工评估结果都显示了与多个强基eline相比的更好或竞争的效果。

Abstract
We present a general framework for unsupervised text style transfer with deep generative models. The framework models each sentence-label pair in the non-parallel corpus as partially observed from a complete quadruplet which additionally contains two latent codes representing the content and style, respectively. These codes are learned by exploiting dependencies inside the observed data. Then a sentence is transferred by manipulating them. Our framework is able to unify previous embedding and prototype methods as two special forms. It also provides a principled perspective to explain previously proposed techniques in the field such as aligned encoder and adversarial training. We further conduct experiments on three benchmarks. Both automatic and human evaluation results show that our methods achieve better or competitive results compared to several strong baselines.

摘要
我们提出了一种总体框架，用于无监督文本风格传输with deep生成模型。这个框架每个句子-标签对在非平行 corpus 中被视为部分观察到的完整四元组，其中包括两个隐藏代码，表示内容和风格。这些代码通过利用观察数据中的依赖关系学习。然后，一个句子可以通过操作这些代码进行传输。我们的框架可以将之前的嵌入和原型方法视为两种特殊形式，并提供了一个理性的视角来解释过去的相关技术，如对齐编码器和对抗训练。我们进一步进行了三个标准测试。自动和人工评估结果都显示，我们的方法可以与一些强大基eline相比，或者达到相同的结果。

Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information

paper_url: http://arxiv.org/abs/2308.16577
repo_url: None
paper_authors: Jie Chen, Changhe Song, Deyi Tuo, Xixin Wu, Shiyin Kang, Zhiyong Wu, Helen Meng
for: 提高文本到语音合成中的自然性和 inteligibilty
methods: 使用多级上下文信息（包括 между话语语言信息和当前话语语言信息），使用多任务学习（MTL）解决方案预测语音结构
results: 对两个数据集进行了对jective评估，获得了更高的F1分数，并且在主观 preference测试中也表明了合成语音的自然性得到了改进。

Abstract
For text-to-speech (TTS) synthesis, prosodic structure prediction (PSP) plays an important role in producing natural and intelligible speech. Although inter-utterance linguistic information can influence the speech interpretation of the target utterance, previous works on PSP mainly focus on utilizing intrautterance linguistic information of the current utterance only. This work proposes to use inter-utterance linguistic information to improve the performance of PSP. Multi-level contextual information, which includes both inter-utterance and intrautterance linguistic information, is extracted by a hierarchical encoder from character level, utterance level and discourse level of the input text. Then a multi-task learning (MTL) decoder predicts prosodic boundaries from multi-level contextual information. Objective evaluation results on two datasets show that our method achieves better F1 scores in predicting prosodic word (PW), prosodic phrase (PPH) and intonational phrase (IPH). It demonstrates the effectiveness of using multi-level contextual information for PSP. Subjective preference tests also indicate the naturalness of synthesized speeches are improved.

摘要
To achieve this, a hierarchical encoder extracts multi-level contextual information from the input text, including character level, utterance level, and discourse level. Then, a multi-task learning (MTL) decoder predicts prosodic boundaries based on the multi-level contextual information.Experimental results on two datasets show that our method outperforms previous methods in predicting prosodic word (PW), prosodic phrase (PPH), and intonational phrase (IPH) with higher F1 scores. Subjective preference tests also indicate that the synthesized speeches produced by our method are more natural-sounding.This work demonstrates the effectiveness of using multi-level contextual information for PSP, and has important implications for improving the naturalness and intelligibility of TTS synthesis.

Thesis Distillation: Investigating The Impact of Bias in NLP Models on Hate Speech Detection

paper_url: http://arxiv.org/abs/2308.16549
repo_url: None
paper_authors: Fatma Elsafoury
for: 本研究探讨了NLP模型中偏见的影响，具体来说是从三个角度：解释性、偏见刻板印象和公平性。
methods: 本研究使用了NLP模型的解释性、偏见刻板印象和公平性来探讨偏见的影响。
results: 研究发现，NLP模型中的偏见从三个角度都会影响 hate speech 检测任务，而且不integrating social sciences在研究偏见的NLP模型中，我们无法有效地解决偏见的问题。

Abstract
This paper is a summary of the work in my PhD thesis. In which, I investigate the impact of bias in NLP models on the task of hate speech detection from three perspectives: explainability, offensive stereotyping bias, and fairness. I discuss the main takeaways from my thesis and how they can benefit the broader NLP community. Finally, I discuss important future research directions. The findings of my thesis suggest that bias in NLP models impacts the task of hate speech detection from all three perspectives. And that unless we start incorporating social sciences in studying bias in NLP models, we will not effectively overcome the current limitations of measuring and mitigating bias in NLP models.

摘要
这份论文是我博士论文的摘要，其中我 investigate了NLP模型中偏见的影响在仇视言语检测任务中，从三个角度：可解性、偏见刻板印象和公平。我讲述了我的博士论文的主要答案和如何对整个NLP社区有益。最后，我讲述了未来研究的重要方向。我的论文发现，NLP模型中的偏见会影响仇视言语检测任务从三个角度，而且如果我们不开始在研究NLP模型中的偏见时，我们无法有效地解决NLP模型中的偏见问题。Here's a word-for-word translation:这份论文是我博士论文的摘要，其中我 investigate了NLP模型中偏见的影响在仇视言语检测任务中，从三个角度：可解性、偏见刻板印象和公平。我讲述了我的博士论文的主要答案和如何对整个NLP社区有益。最后，我讲述了未来研究的重要方向。我的论文发现，NLP模型中的偏见会影响仇视言语检测任务从三个角度，而且如果我们不开始在研究NLP模型中的偏见时，我们无法有效地解决NLP模型中的偏见问题。

Time-Varying Quasi-Closed-Phase Analysis for Accurate Formant Tracking in Speech Signals

paper_url: http://arxiv.org/abs/2308.16540
repo_url: None
paper_authors: Dhananjaya Gowda, Sudarsana Reddy Kadiri, Brad Story, Paavo Alku
for: 这篇论文提出了一种新的准确地估计和跟踪speech信号中的声门形态使用时间变化 quasi-closed-phase (TVQCP)分析方法。
methods: 该方法 combinesthree approaches to improve formant estimation and tracking: (1) it uses temporally weighted quasi-closed-phase analysis to derive closed-phase estimates of the vocal tract with reduced interference from the excitation source, (2) it increases the residual sparsity by using the $L_1$ optimization, and (3) it uses time-varying linear prediction analysis over long time windows to impose a continuity constraint on the vocal tract model and hence on the formant trajectories.
results: 对于各种合成和自然语音信号的实验表明，提出的TVQCP方法比传统和流行的formant tracking工具，如Wavesurfer和Praat（基于动态规划）、KARMA算法（基于加尔曼滤波）和DeepFormants（基于深度神经网络）perform better。

Abstract
In this paper, we propose a new method for the accurate estimation and tracking of formants in speech signals using time-varying quasi-closed-phase (TVQCP) analysis. Conventional formant tracking methods typically adopt a two-stage estimate-and-track strategy wherein an initial set of formant candidates are estimated using short-time analysis (e.g., 10--50 ms), followed by a tracking stage based on dynamic programming or a linear state-space model. One of the main disadvantages of these approaches is that the tracking stage, however good it may be, cannot improve upon the formant estimation accuracy of the first stage. The proposed TVQCP method provides a single-stage formant tracking that combines the estimation and tracking stages into one. TVQCP analysis combines three approaches to improve formant estimation and tracking: (1) it uses temporally weighted quasi-closed-phase analysis to derive closed-phase estimates of the vocal tract with reduced interference from the excitation source, (2) it increases the residual sparsity by using the $L_1$ optimization and (3) it uses time-varying linear prediction analysis over long time windows (e.g., 100--200 ms) to impose a continuity constraint on the vocal tract model and hence on the formant trajectories. Formant tracking experiments with a wide variety of synthetic and natural speech signals show that the proposed TVQCP method performs better than conventional and popular formant tracking tools, such as Wavesurfer and Praat (based on dynamic programming), the KARMA algorithm (based on Kalman filtering), and DeepFormants (based on deep neural networks trained in a supervised manner). Matlab scripts for the proposed method can be found at: https://github.com/njaygowda/ftrack

摘要
在这篇论文中，我们提出了一种新的方法用于准确地计算和跟踪语音信号中的声门特征（formant）。传统的声门跟踪方法通常采用两个阶段的估计和跟踪策略，其中首先使用短时间分析（例如10-50ms）来估计初始声门候选者，然后使用动态计划或线性状态空间模型来跟踪。这种方法的主要缺点是跟踪阶段，即使非常好，也无法改善初始估计阶段的声门估计精度。我们的提议的TVQCP方法则提供了一种单阶段的声门跟踪，其中估计和跟踪阶段被结合到一起。TVQCP分析结合了三种方法来提高声门估计和跟踪精度：1. 使用时间权重 quasi-closed-phase 分析来 deriv closed-phase 估计值，减少干扰来自激发源的干扰。2. 使用 $L_1$ 优化增加剩余稀热性。3. 使用时间变化的线性预测分析在长时间窗口（例如100-200ms）来强制施加 vocals tract 模型中的连续性约束。我们在各种 sintetic 和自然语音信号上进行了声门跟踪实验，结果显示，我们的TVQCP方法在与传统和流行的声门跟踪工具（如Wavesurfer和Praat）进行比较时，表现出了更高的精度。Matlab 脚本 для我们的方法可以在以下 GitHub 地址找到：https://github.com/njaygowda/ftrack。

The Smart Data Extractor, a Clinician Friendly Solution to Accelerate and Improve the Data Collection During Clinical Trials

paper_url: http://arxiv.org/abs/2308.16537
repo_url: None
paper_authors: Sophie Quennelle, Maxime Douillet, Lisa Friedlander, Olivia Boyer, Anita Burgun, Antoine Neuraz, Nicolas Garcelon
for: 提高医疗数据收集效率和质量，避免人工劳动和错误
methods: 提出一种半自动的数据收集系统，能够自动提取各种数据，包括病历记录
results: 对比手动和半自动数据收集方法，发现半自动方法的平均时间为3’22’’，比手动方法要快，并且错误数量较少（46个整个减少到163个），提供一种容易使用、易于理解和快速的便携式临床研究表单填写解决方案，提高数据收集效率和质量，避免人工劳动和错误

Abstract
In medical research, the traditional way to collect data, i.e. browsing patient files, has been proven to induce bias, errors, human labor and costs. We propose a semi-automated system able to extract every type of data, including notes. The Smart Data Extractor pre-populates clinic research forms by following rules. We performed a cross-testing experiment to compare semi-automated to manual data collection. 20 target items had to be collected for 79 patients. The average time to complete one form was 6'81'' for manual data collection and 3'22'' with the Smart Data Extractor. There were also more mistakes during manual data collection (163 for the whole cohort) than with the Smart Data Extractor (46 for the whole cohort). We present an easy to use, understandable and agile solution to fill out clinical research forms. It reduces human effort and provides higher quality data, avoiding data re-entry and fatigue induced errors.

摘要
医学研究中，传统的数据收集方式，即阅读病人文件，已经被证明会导致偏见、错误、人工劳动和成本增加。我们提议一种半自动的数据收集系统，能够自动提取所有类型的数据，包括笔记。智能数据抽取器按照规则自动填充临床研究表单。我们进行了跨测试实验，比较半自动和手动数据收集方式。对79名病人的20个目标项进行了收集。手动数据收集的平均时间为6'81''，而智能数据抽取器的平均时间为3'22''。此外，手动数据收集中还有更多的错误（总共163个），与智能数据抽取器相比（46个）。我们提供了一种易于使用、易于理解、快速的解决方案，快速填充临床研究表单，减少人工劳动，提供更高质量的数据，避免数据重复和劳动 induced错误。

Link Prediction for Wikipedia Articles as a Natural Language Inference Task

paper_url: http://arxiv.org/abs/2308.16469
repo_url: None
paper_authors: Chau-Thang Phan, Quoc-Nam Nguyen, Kiet Van Nguyen
for: 本文提出了一种解决自动理解大规模知识库结构的链接预测问题的系统，并在Data Science and Advanced Analytics 2023 Competition “Efficient and Effective Link Prediction” (DSAA-2023 Competition)中提交了该系统。
methods: 本文提出了一种将链接预测视为自然语言理解（NLI）任务的方法，基于最近的自然语言处理和理解技术，将链接预测视为两篇文章之间的文本关系预测任务。
results: 本文的实现基于Sentence Pair Classification for Link Prediction for the Wikipedia Articles task，在公共测试集上 achiev 0.99996 Macro F1-score和1.00000 Macro F1-score，与第一名和第二名的分数相同。

Abstract
Link prediction task is vital to automatically understanding the structure of large knowledge bases. In this paper, we present our system to solve this task at the Data Science and Advanced Analytics 2023 Competition "Efficient and Effective Link Prediction" (DSAA-2023 Competition) with a corpus containing 948,233 training and 238,265 for public testing. This paper introduces an approach to link prediction in Wikipedia articles by formulating it as a natural language inference (NLI) task. Drawing inspiration from recent advancements in natural language processing and understanding, we cast link prediction as an NLI task, wherein the presence of a link between two articles is treated as a premise, and the task is to determine whether this premise holds based on the information presented in the articles. We implemented our system based on the Sentence Pair Classification for Link Prediction for the Wikipedia Articles task. Our system achieved 0.99996 Macro F1-score and 1.00000 Macro F1-score for the public and private test sets, respectively. Our team UIT-NLP ranked 3rd in performance on the private test set, equal to the scores of the first and second places. Our code is publicly for research purposes.

摘要
很重要的任务是预测wiki文章之间的链接，以自动理解大量知识库的结构。在这篇论文中，我们介绍了我们在“高效高效链接预测”（DSAA-2023）比赛中解决这个任务的系统，采用了948233个训练文章和238265个测试文章。本论文将链接预测问题转化为自然语言推理（NLI）任务， Drawing inspiration from recent advances in natural language processing and understanding, we cast link prediction as an NLI task, where the presence of a link between two articles is treated as a premise, and the task is to determine whether this premise holds based on the information presented in the articles. We implemented our system based on the Sentence Pair Classification for Link Prediction for the Wikipedia Articles task. Our system achieved 0.99996 Macro F1-score and 1.00000 Macro F1-score for the public and private test sets, respectively. Our team UIT-NLP ranked 3rd in performance on the private test set, equal to the scores of the first and second places. Our code is publicly available for research purposes.

Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models

paper_url: http://arxiv.org/abs/2308.16463
repo_url: https://github.com/HYPJUDY/Sparkles
paper_authors: Yupan Huang, Zaiqiao Meng, Fangyu Liu, Yixuan Su, Nigel Collier, Yutong Lu
for: SparklesChat is designed to handle open-ended dialogues across multiple images, addressing the challenge of maintaining dialogue coherence in multimodal instruction-following tasks.methods: SparklesChat uses a multimodal instruction-following model that integrates text and images, and is trained on the newly introduced SparklesDialogue dataset. The model is evaluated using the SparklesEval benchmark, which assesses conversational competence across multiple images and dialogue turns.results: SparklesChat outperformed MiniGPT-4 on established vision-and-language benchmarks and scored 8.56 out of 10 on SparklesEval, demonstrating its effectiveness in understanding and reasoning across multiple images and dialogue turns. Qualitative evaluations also showed the model’s generality in handling real-world applications.

Abstract
Large language models exhibit enhanced zero-shot performance on various tasks when fine-tuned with instruction-following data. Multimodal instruction-following models extend these capabilities by integrating both text and images. However, existing models such as MiniGPT-4 face challenges in maintaining dialogue coherence in scenarios involving multiple images. A primary reason is the lack of a specialized dataset for this critical application. To bridge these gaps, we present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images. To support the training, we introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions. Furthermore, we construct SparklesEval, a GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns. Our experiments validate the effectiveness of SparklesChat in understanding and reasoning across multiple images and dialogue turns. Specifically, SparklesChat outperformed MiniGPT-4 on established vision-and-language benchmarks, including the BISON binary image selection task and the NLVR2 visual reasoning task. Moreover, SparklesChat scored 8.56 out of 10 on SparklesEval, substantially exceeding MiniGPT-4's score of 3.91 and nearing GPT-4's score of 9.26. Qualitative evaluations further demonstrate SparklesChat's generality in handling real-world applications. All resources will be available at https://github.com/HYPJUDY/Sparkles.

摘要
大型语言模型在不同任务上显示出强化零学习性能，当精通化 instruction-following 数据时。多模式 instruction-following 模型将这些能力扩展到包括文字和图像在内的多个模式。然而，现有的模型，如 MiniGPT-4，在多个图像场景中维持对话一致性存在问题。主要原因是缺乏特殊化的数据集。为了bridging这些差距，我们提出 SparklesChat，一个多模式 instruction-following 模型，用于开放式对话过程中的多个图像。为支持训练，我们引入 SparklesDialogue，首个特别设计 для word-level 跨多个图像和文字互动的机器生成对话数据集。此外，我们建立 SparklesEval，一个基于 GPT 的测试工具，用于量化评估模型在多个图像和对话转换中的对话能力。我们的实验显示 SparklesChat 在多个图像和对话转换中理解和推理能力有所提高。具体来说，SparklesChat 在已知的视觉和语言标准 benchmark 上表现出色，包括BISON binary 图像选择任务和 NLVR2 视觉理解任务。此外，SparklesChat 在 SparklesEval 上获得 8.56 分，大幅超过 MiniGPT-4 的 3.91 分，并且接近 GPT-4 的 9.26 分。实验结果还表明 SparklesChat 在实际应用中具有一般性。所有资源将在 GitHub 上公开。

Knowledge Distillation from Non-streaming to Streaming ASR Encoder using Auxiliary Non-streaming Layer

paper_url: http://arxiv.org/abs/2308.16415
repo_url: None
paper_authors: Kyuhong Shim, Jinkyu Lee, Simyung Chang, Kyuwoong Hwang
for: 提高流式自动语音识别（ASR）模型的性能
methods: 使用层到层知识储 transmit 教师Encoder 到学生Encoder
results: 比前 Token 概率储 transmit 方法减少词错率

Abstract
Streaming automatic speech recognition (ASR) models are restricted from accessing future context, which results in worse performance compared to the non-streaming models. To improve the performance of streaming ASR, knowledge distillation (KD) from the non-streaming to streaming model has been studied, mainly focusing on aligning the output token probabilities. In this paper, we propose a layer-to-layer KD from the teacher encoder to the student encoder. To ensure that features are extracted using the same context, we insert auxiliary non-streaming branches to the student and perform KD from the non-streaming teacher layer to the non-streaming auxiliary layer. We design a special KD loss that leverages the autoregressive predictive coding (APC) mechanism to encourage the streaming model to predict unseen future contexts. Experimental results show that the proposed method can significantly reduce the word error rate compared to previous token probability distillation methods.

摘要
流式自动语音识别（ASR）模型因无法访问未来上下文，因此其性能较差于非流式模型。为提高流式ASR的性能，我们已经研究了知识塑化（KD）从非流式到流式模型，主要关注输出token概率的匹配。在这篇论文中，我们提议一种层到层KD从教师encoder到学生encoder。为确保使用相同上下文提取特征，我们在学生encoder中插入了auxiliary非流式分支，并在教师层和auxiliary层之间进行KD。我们设计了一种特殊的KD损失，使用自适应预测编码（APC）机制，以促进流式模型预测未见的未来上下文。实验结果表明，我们的方法可以在前期token概率塑化方法中减少词错率。