paper_authors: George Boateng, Jonathan Abrefah Mensah, Kevin Takyi Yeboah, William Edor, Andrew Kojo Mensah-Onumah, Naafi Dasana Ibrahim, Nana Sam Yeboah
for: The paper is written to explore the possibility of using AI to compete in Ghana’s National Science and Maths Quiz (NSMQ) and to describe the progress made so far in the NSMQ AI project.
methods: The paper uses open-source AI technology to build an AI system that can compete in the NSMQ, with a focus on speech-to-text, text-to-speech, question-answering, and human-computer interaction.
results: The paper describes the progress made thus far in the NSMQ AI project, including the development of an AI system that can compete in the NSMQ and the potential real-world impact of such a system on education in Africa.Abstract
Can an AI win Ghana's National Science and Maths Quiz (NSMQ)? That is the question we seek to answer in the NSMQ AI project, an open-source project that is building AI to compete live in the NSMQ and win. The NSMQ is an annual live science and mathematics competition for senior secondary school students in Ghana in which 3 teams of 2 students compete by answering questions across biology, chemistry, physics, and math in 5 rounds over 5 progressive stages until a winning team is crowned for that year. The NSMQ is an exciting live quiz competition with interesting technical challenges across speech-to-text, text-to-speech, question-answering, and human-computer interaction. In this ongoing work that began in January 2023, we give an overview of the project, describe each of the teams, progress made thus far, and the next steps toward our planned launch and debut of the AI in October for NSMQ 2023. An AI that conquers this grand challenge can have real-world impact on education such as enabling millions of students across Africa to have one-on-one learning support from this AI.
摘要
可以AI赢得加纳国家科学数学竞赛(NSMQ)呢?这是我们想要回答的问题,我们在NSMQ AI项目中进行开源项目,旨在使AI在NSMQ中赢得比赛。NSMQ是每年举行的live科学数学竞赛,参与者是加纳高中二年级学生,共有3支队伍,每支队伍有2名学生,通过Answering questions across biology, chemistry, physics, and math in 5 rounds over 5 progressive stages until a winning team is crowned for that year。NSMQ是一个有趣的live竞赛,技术挑战包括speech-to-text、text-to-speech、问题回答和人机交互。在这项工作于2023年1月开始的项目中,我们将提供项目概述、团队描述、已经进展和下一步的计划,以便在10月份的NSMQ 2023上发布AI。一旦AI成功解决这个大型挑战,可能会对教育产生实际影响,如提供非洲数百万学生一对一的学习支持。
Auditory Attention Decoding with Task-Related Multi-View Contrastive Learning
results: 研究人员通过对 two 个 popular AAD 数据集进行测试,发现了我们的方法的优越性,并与现有的 state-of-the-art 方法进行比较。Abstract
The human brain can easily focus on one speaker and suppress others in scenarios such as a cocktail party. Recently, researchers found that auditory attention can be decoded from the electroencephalogram (EEG) data. However, most existing deep learning methods are difficult to use prior knowledge of different views (that is attended speech and EEG are task-related views) and extract an unsatisfactory representation. Inspired by Broadbent's filter model, we decode auditory attention in a multi-view paradigm and extract the most relevant and important information utilizing the missing view. Specifically, we propose an auditory attention decoding (AAD) method based on multi-view VAE with task-related multi-view contrastive (TMC) learning. Employing TMC learning in multi-view VAE can utilize the missing view to accumulate prior knowledge of different views into the fusion of representation, and extract the approximate task-related representation. We examine our method on two popular AAD datasets, and demonstrate the superiority of our method by comparing it to the state-of-the-art method.
摘要
人脑可以轻松地关注一个说话者并压抑其他说话者在cocktail party类场景中。现在,研究人员发现了基于电enzephalogram(EEG)数据的听力注意力可以被解码。然而,大多数现有的深度学习方法难以使用不同视图(即注意力和EEG数据是任务相关的视图)的先前知识,并提取不满足的表示。以布鲁门特 filters 模型为 inspirations,我们在多视图 paradigm 中解码听力注意力,并使用缺失的视图来汇集不同视图中的先前知识,并提取任务相关的表示。我们提出了基于多视图VAE的听力注意力解码方法(AAD),并使用任务相关的多视图异构学习(TMC)来学习。通过TMC学习,我们可以在多视图VAE中汇集不同视图中的先前知识,并提取任务相关的表示。我们在两个流行的AAD数据集上进行了实验,并证明了我们的方法的优越性,比较于状态的艺术方法。
Evil Operation: Breaking Speaker Recognition with PaddingBack
paper_authors: Zhe Ye, Diqun Yan, Li Dong, Kailai Shen
For: The paper aims to propose a novel backdoor attack method that can bypass speaker recognition systems and remain undetectable to human ears.* Methods: The proposed method, called PaddingBack, exploits the widely used speech signal operation of padding to make poisoned samples indistinguishable from clean ones.* Results: The experimental results show that PaddingBack achieves a high attack success rate while maintaining a high rate of benign accuracy, and is able to resist defense methods while maintaining its stealthiness against human perception.Here’s the full text in Simplified Chinese:* For: 本研究提出的目的是提出一种可以绕过说话识别系统的背门附件攻击方法,并且能够避免人类听觉中的异常感。* Methods: 该方法称为PaddingBack,利用了广泛使用的语音信号操作padding,以制作恶意样本与净样本无法分辨。* Results: 实验结果显示,PaddingBack可以达到高度的攻击成功率,同时保持高度的净样本准确率,并且能够抵抗防御方法,同时保持人类听觉中的潜藏性。Abstract
Machine Learning as a Service (MLaaS) has gained popularity due to advancements in machine learning. However, untrusted third-party platforms have raised concerns about AI security, particularly in backdoor attacks. Recent research has shown that speech backdoors can utilize transformations as triggers, similar to image backdoors. However, human ears easily detect these transformations, leading to suspicion. In this paper, we introduce PaddingBack, an inaudible backdoor attack that utilizes malicious operations to make poisoned samples indistinguishable from clean ones. Instead of using external perturbations as triggers, we exploit the widely used speech signal operation, padding, to break speaker recognition systems. Our experimental results demonstrate the effectiveness of the proposed approach, achieving a significantly high attack success rate while maintaining a high rate of benign accuracy. Furthermore, PaddingBack demonstrates the ability to resist defense methods while maintaining its stealthiness against human perception. The results of the stealthiness experiment have been made available at https://nbufabio25.github.io/paddingback/.
摘要
MSAC: Multiple Speech Attribute Control Method for Speech Emotion Recognition
results: 对于单个 corpora 和跨 corpora SER 场景,我们的提议的 SER 工作流程经过了广泛的实验,并 consistently 超过基准值,包括认知、泛化和可靠性性能。单个 corpora SER 场景中,我们的 SER 工作流程达到了72.97%的 WAR 和 71.76%的 UAR 在 IEMOCAP corpora 上。Abstract
Despite significant progress, speech emotion recognition (SER) remains challenging due to inherent complexity and ambiguity of the emotion attribute, particularly in wild world. Whereas current studies primarily focus on recognition and generalization capabilities, this work pioneers an exploration into the reliability of SER methods and investigates how to model the speech emotion from the aspect of data distribution across various speech attributes. Specifically, we first build a novel CNN-based SER model which adopts additive margin softmax loss to expand the distance between features of different classes, thereby enhancing their discrimination. Second, a novel multiple speech attribute control method MSAC is proposed to explicitly control speech attributes, enabling the model to be less affected by emotion-agnostic attributes and capture more fine-grained emotion-related features. Third, we make a first attempt to test and analyze the reliability of the proposed SER workflow using the out-of-distribution detection method. Extensive experiments on both single and cross-corpus SER scenarios show that our proposed unified SER workflow consistently outperforms the baseline in terms of recognition, generalization, and reliability performance. Besides, in single-corpus SER, the proposed SER workflow achieves superior recognition results with a WAR of 72.97\% and a UAR of 71.76\% on the IEMOCAP corpus.
摘要
尽管已经取得了 significative 进步,speech emotion recognition(SER)仍然是一项复杂和不确定的任务,尤其在野外环境中。现有研究主要关注recognition和泛化能力,而这项工作则尝试了 SER 方法的可靠性的探索,并 investigate 如何从数据分布角度模型 speech emotion。 Specifically, we first build a novel CNN-based SER model which adopts additive margin softmax loss to expand the distance between features of different classes, thereby enhancing their discrimination. Second, a novel multiple speech attribute control method MSAC is proposed to explicitly control speech attributes, enabling the model to be less affected by emotion-agnostic attributes and capture more fine-grained emotion-related features. Third, we make a first attempt to test and analyze the reliability of the proposed SER workflow using the out-of-distribution detection method. Extensive experiments on both single and cross-corpus SER scenarios show that our proposed unified SER workflow consistently outperforms the baseline in terms of recognition, generalization, and reliability performance. Besides, in single-corpus SER, the proposed SER workflow achieves superior recognition results with a WAR of 72.97% and a UAR of 71.76% on the IEMOCAP corpus.
Target Speech Extraction with Conditional Diffusion Model
paper_authors: Naoyuki Kamo, Marc Delcroix, Tomohiro Nakatani
for: targets speech extraction (TSE) in a mixture of multi-talkers
methods: uses a conditional diffusion model conditioned on a clue identifying the target speaker, and ensemble inference to reduce potential extraction errors
results: outperforms a comparable TSE system trained discriminatively in experiments on Libri2mix corpusAbstract
Diffusion model-based speech enhancement has received increased attention since it can generate very natural enhanced signals and generalizes well to unseen conditions. Diffusion models have been explored for several sub-tasks of speech enhancement, such as speech denoising, dereverberation, and source separation. In this paper, we investigate their use for target speech extraction (TSE), which consists of estimating the clean speech signal of a target speaker in a mixture of multi-talkers. TSE is realized by conditioning the extraction process on a clue identifying the target speaker. We show we can realize TSE using a conditional diffusion model conditioned on the clue. Besides, we introduce ensemble inference to reduce potential extraction errors caused by the diffusion process. In experiments on Libri2mix corpus, we show that the proposed diffusion model-based TSE combined with ensemble inference outperforms a comparable TSE system trained discriminatively.
摘要
听说模型基于扩散模型的speech增强技术在最近几年来得到了更多的关注,因为它可以生成非常自然的增强信号,并且可以在未见过的条件下进行泛化。扩散模型在多个子任务中被探索,如speech噪声除去、泛化声学环境和音源分离。在这篇论文中,我们研究了它们在target speech extraction(TSE)中的使用,TSE是一种估计混合多个说话人的干扰者的清晰speech信号的过程。我们表明可以通过对 clue(指定target speaker)进行条件的扩散模型来实现TSE。此外,我们还引入了集成推理来降低扩散过程中的潜在出错。在Libri2mix数据集上进行了实验,我们发现提出的扩散模型基于TSE,并且集成推理可以与一个相对的TSE系统所得到的性能进行比较。
Universal Automatic Phonetic Transcription into the International Phonetic Alphabet
results: 我们的模型可以达到与人工标注师相当的质量水平,并且与之前的最佳语音到IPA模型(Wav2Vec2Phoneme)相比,我们的模型在训练数据量相对较少的情况下可以达到类似或更好的结果。Abstract
This paper presents a state-of-the-art model for transcribing speech in any language into the International Phonetic Alphabet (IPA). Transcription of spoken languages into IPA is an essential yet time-consuming process in language documentation, and even partially automating this process has the potential to drastically speed up the documentation of endangered languages. Like the previous best speech-to-IPA model (Wav2Vec2Phoneme), our model is based on wav2vec 2.0 and is fine-tuned to predict IPA from audio input. We use training data from seven languages from CommonVoice 11.0, transcribed into IPA semi-automatically. Although this training dataset is much smaller than Wav2Vec2Phoneme's, its higher quality lets our model achieve comparable or better results. Furthermore, we show that the quality of our universal speech-to-IPA models is close to that of human annotators.
摘要
这篇论文介绍了一种现代模型,用于将任何语言的 spoken language 转录为国际音声字母(IPA)。将语言记录转录为 IPA 是一项重要但是时间占用很大的任务,即使只是部分自动化这个过程,也有很大的潜在速度提升语言记录的批处。与之前的最佳音频-to-IPA 模型(Wav2Vec2Phoneme)一样,我们的模型基于 wav2vec 2.0,并在音频输入上进行了微调,以预测 IPA。我们使用了 CommonVoice 11.0 中的七种语言的训练数据,并将其 semi-automatically 转录为 IPA。虽然我们的训练集规模较小,但它的质量更高,使我们的模型在获得相似或更好的结果。此外,我们还证明了我们的通用音频-to-IPA 模型的质量与人工注释员很相似。