results: 实验结果表明,CIF-T在178小时AISHELL-1和10000小时WenetSpeech datasets上达到了当前最佳性能,同时具有较低的计算开销。Abstract
RNN-T models are widely used in ASR, which rely on the RNN-T loss to achieve length alignment between input audio and target sequence. However, the implementation complexity and the alignment-based optimization target of RNN-T loss lead to computational redundancy and a reduced role for predictor network, respectively. In this paper, we propose a novel model named CIF-Transducer (CIF-T) which incorporates the Continuous Integrate-and-Fire (CIF) mechanism with the RNN-T model to achieve efficient alignment. In this way, the RNN-T loss is abandoned, thus bringing a computational reduction and allowing the predictor network a more significant role. We also introduce Funnel-CIF, Context Blocks, Unified Gating and Bilinear Pooling joint network, and auxiliary training strategy to further improve performance. Experiments on the 178-hour AISHELL-1 and 10000-hour WenetSpeech datasets show that CIF-T achieves state-of-the-art results with lower computational overhead compared to RNN-T models.
摘要
RNN-T模型广泛用于语音识别,这些模型基于RNN-T损失来实现输入音频和目标序列之间的长度匹配。然而,实现复杂性和基于匹配优化目标的RNN-T损失会导致计算冗余和预测网络的功能减少。在这篇论文中,我们提出了一种新的模型名为CIF-Transducer(CIF-T),它将继续采用Continuous Integrate-and-Fire(CIF)机制和RNN-T模型来实现有效的匹配。这样,RNN-T损失就不再需要,从而带来计算减少和预测网络更加重要的角色。我们还引入了Funnel-CIF、Context Blocks、Unified Gating和Bilinear Pooling共同网络,以及辅助训练策略,以进一步提高性能。在AISHELL-1和WenetSpeech数据集上进行了178小时和10000小时的实验,得出的结果表明,CIF-T可以与RNN-T模型相比,在计算负担下获得状态之差的最佳结果。
The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link between Phonemes and Facial Features
results: 我们的结果支持物理学的相关性理论,并为未来的speech-face多模态学习铺平了道路。Abstract
This work unveils the enigmatic link between phonemes and facial features. Traditional studies on voice-face correlations typically involve using a long period of voice input, including generating face images from voices and reconstructing 3D face meshes from voices. However, in situations like voice-based crimes, the available voice evidence may be short and limited. Additionally, from a physiological perspective, each segment of speech -- phoneme -- corresponds to different types of airflow and movements in the face. Therefore, it is advantageous to discover the hidden link between phonemes and face attributes. In this paper, we propose an analysis pipeline to help us explore the voice-face relationship in a fine-grained manner, i.e., phonemes v.s. facial anthropometric measurements (AM). We build an estimator for each phoneme-AM pair and evaluate the correlation through hypothesis testing. Our results indicate that AMs are more predictable from vowels compared to consonants, particularly with plosives. Additionally, we observe that if a specific AM exhibits more movement during phoneme pronunciation, it is more predictable. Our findings support those in physiology regarding correlation and lay the groundwork for future research on speech-face multimodal learning.
摘要
Note: "Simplified Chinese" is a romanization of Chinese characters, and the text above is written in the "Yale Romanization" system, which is a widely used system for romanizing Chinese. The actual Chinese characters and pronunciation may vary depending on the region and dialect.
Rethinking Voice-Face Correlation: A Geometry View
for: investigate the capability of reconstructing 3D facial shape from voice from a geometry perspective without any semantic information
methods: propose a voice-anthropometric measurement (AM)-face paradigm, which identifies predictable facial AMs from the voice and uses them to guide 3D face reconstruction
results: significant correlations between voice and specific parts of the face geometry, such as the nasal cavity and craniumAbstract
Previous works on voice-face matching and voice-guided face synthesis demonstrate strong correlations between voice and face, but mainly rely on coarse semantic cues such as gender, age, and emotion. In this paper, we aim to investigate the capability of reconstructing the 3D facial shape from voice from a geometry perspective without any semantic information. We propose a voice-anthropometric measurement (AM)-face paradigm, which identifies predictable facial AMs from the voice and uses them to guide 3D face reconstruction. By leveraging AMs as a proxy to link the voice and face geometry, we can eliminate the influence of unpredictable AMs and make the face geometry tractable. Our approach is evaluated on our proposed dataset with ground-truth 3D face scans and corresponding voice recordings, and we find significant correlations between voice and specific parts of the face geometry, such as the nasal cavity and cranium. Our work offers a new perspective on voice-face correlation and can serve as a good empirical study for anthropometry science.
摘要
previous works on voice-face matching and voice-guided face synthesis have shown strong correlations between voice and face, but mainly rely on coarse semantic cues such as gender, age, and emotion. in this paper, we aim to investigate the capability of reconstructing the 3D facial shape from voice from a geometry perspective without any semantic information. we propose a voice-anthropometric measurement (AM)-face paradigm, which identifies predictable facial AMs from the voice and uses them to guide 3D face reconstruction. by leveraging AMs as a proxy to link the voice and face geometry, we can eliminate the influence of unpredictable AMs and make the face geometry tractable. our approach is evaluated on our proposed dataset with ground-truth 3D face scans and corresponding voice recordings, and we find significant correlations between voice and specific parts of the face geometry, such as the nasal cavity and cranium. our work offers a new perspective on voice-face correlation and can serve as a good empirical study for anthropometry science.
Perceptual Quality Enhancement of Sound Field Synthesis Based on Combination of Pressure and Amplitude Matching
results: 对于数字实验和实际系统的听测,提出的方法可以提高声场synthesis的感知质量,比传统压力匹配法更好。Abstract
A sound field synthesis method enhancing perceptual quality is proposed. Sound field synthesis using multiple loudspeakers enables spatial audio reproduction with a broad listening area; however, synthesis errors at high frequencies called spatial aliasing artifacts are unavoidable. To minimize these artifacts, we propose a method based on the combination of pressure and amplitude matching. On the basis of the human's auditory properties, synthesizing the amplitude distribution will be sufficient for horizontal sound localization. Furthermore, a flat amplitude response should be synthesized as much as possible to avoid coloration. Therefore, we apply amplitude matching, which is a method to synthesize the desired amplitude distribution with arbitrary phase distribution, for high frequencies and conventional pressure matching for low frequencies. Experimental results of numerical simulations and listening tests using a practical system indicated that the perceptual quality of the sound field synthesized by the proposed method was improved from that synthesized by pressure matching.
摘要
一种提高听觉质量的声场合成方法被提出。使用多个喇叭器实现声场合成可以提供广泛的听众区域,但高频合成错误无法避免。为降低这些artifacts,我们提出基于压力和强度匹配的方法。根据人类听觉特性,只需要合成水平声localization的强度分布即可。此外,要避免颜色化,因此我们应用强度匹配,用于高频范围内的恰当强度分布,并使用压力匹配来处理低频范围内的压力。 numericsimulations和实际系统的听试结果表明,提出的方法可以提高声场合成的听觉质量,比普通压力匹配更好。
Exploring the Interactions between Target Positive and Negative Information for Acoustic Echo Cancellation
results: 实验结果显示,这个CMNet模型在近端语音和干扰信号之间建立了更好的相关性,并且在AEC任务上表现了更高的性能。Abstract
Acoustic echo cancellation (AEC) aims to remove interference signals while leaving near-end speech least distorted. As the indistinguishable patterns between near-end speech and interference signals, near-end speech can't be separated completely, causing speech distortion and interference signals residual. We observe that besides target positive information, e.g., ground-truth speech and features, the target negative information, such as interference signals and features, helps make pattern of target speech and interference signals more discriminative. Therefore, we present a novel AEC model encoder-decoder architecture with the guidance of negative information termed as CMNet. A collaboration module (CM) is designed to establish the correlation between the target positive and negative information in a learnable manner via three blocks: target positive, target negative, and interactive block. Experimental results demonstrate our CMNet achieves superior performance than recent methods.
摘要
听音预处理(AEC)目标是从干扰信号中分离近端语音,以保持近端语音最小化损害。由于干扰信号和近端语音之间存在不可分割的模式,因此完全分离近端语音和干扰信号是不可能的,从而导致语音损害和干扰信号剩下。我们发现,除了目标正面信息(如真实语音和特征)外,目标负面信息(如干扰信号和特征)也可以帮助制定目标语音和干扰信号的模式更加明确。因此,我们提出了一种基于encoder-decoder架构的新型AEC模型,称为CMNet。一个合作模块(CM)是用来在一个学习性的方式下建立目标正面和负面信息之间的相互关系。经过实验证明,我们的CMNet在相比之前的方法上表现出了更高的性能。