cs.SD - 2023-07-01

The Human Auditory System and Audio

  • paper_url: http://arxiv.org/abs/2307.00084
  • repo_url: https://github.com/chilldude/stereo-cipher
  • paper_authors: Milind N. Kunchur
  • for: 这篇论文探讨了人类听觉系统,描述了一些特殊化机制和非线性路径,从物理声音的感知过程中。
  • methods: 该论文使用了一些新的技术和方法,包括声音响应的测量和分析,以及计算模型的构建。
  • results: 研究发现,人类听觉系统具有惊人的高精度和多样性,可以在微秒级别听觉和分辨声音细节,并且可以检测到声音的非常小的变化。
    Abstract This work reviews the human auditory system, elucidating some of the specialized mechanisms and non-linear pathways along the chain of events between physical sound and its perception. Customary relationships between frequency, time, and phase--such as the uncertainty principle--that hold for linear systems, do not apply straightforwardly to the hearing process. Auditory temporal resolution for certain processes can be a hundredth of the period of the signal, and can extend down to the microseconds time scale. The astonishingly large number of variations that correspond to the neural excitation pattern of 30000 auditory nerve fibers, originating from 3500 inner hair cells, explicates the vast capacity of the auditory system for the resolution of sonic detail. And the ear is sensitive enough to detect a basilar-membrane amplitude at the level of a picometer, or about a hundred times smaller than an atom. This article surveys and provides new insights into some of the impressive capabilities of the human auditory system and explores their relationship to fidelity in reproduced sound.
    摘要 Translated into Simplified Chinese:这篇文章介绍人类听觉系统,描述了听觉过程中的一些特殊机制和非线性路径,从物理声音转化为感知。传统的关系 между频率、时间和相位,例如不确定原理,不直接适用于听觉过程。听觉时间分辨率可以达百万分之一秒级,并可以降到微秒级别。听觉系统的神经刺激模式有30000个 auditory nerve fibers,来自3500个内声毫细胞,这使得听觉系统具有很大的容量,用于分辨声音细节。而耳朵也够敏感,可以探测到基ляр膜振荡的振荡幅度,只有一个picometer级别,约相当于一个原子的100倍。这篇文章提供了新的意义和听觉系统的关系,并探讨其与重新生成的声音的准确性之间的关系。

Towards Improving the Performance of Pre-Trained Speech Models for Low-Resource Languages Through Lateral Inhibition

  • paper_url: http://arxiv.org/abs/2306.17792
  • repo_url: None
  • paper_authors: Andrei-Marius Avram, Răzvan-Alexandru Smădu, Vasile Păiş, Dumitru-Clementin Cercel, Radu Ion, Dan Tufiş
  • for: 提高预先训练的语音模型性能
  • methods: 取代细化 dense layer avec lateral inhibition layer
  • results: 在 Romanian 语言下提高了12.5% 字异错率 (WER),并在 Romanian Speech Corpus 和 Robin Technical Acquisition Corpus 上达到了状态机器人的result(1.78% WER 和 29.64% WER)。
    Abstract With the rise of bidirectional encoder representations from Transformer models in natural language processing, the speech community has adopted some of their development methodologies. Therefore, the Wav2Vec models were introduced to reduce the data required to obtain state-of-the-art results. This work leverages this knowledge and improves the performance of the pre-trained speech models by simply replacing the fine-tuning dense layer with a lateral inhibition layer inspired by the biological process. Our experiments on Romanian, a low-resource language, show an average improvement of 12.5% word error rate (WER) using the lateral inhibition layer. In addition, we obtain state-of-the-art results on both the Romanian Speech Corpus and the Robin Technical Acquisition Corpus with 1.78% WER and 29.64% WER, respectively.
    摘要 随着Transformer模型的bidirectional编码器表示法在自然语言处理领域的普及,语音社区开始采纳其开发方法。因此,Wav2Vec模型被引入,以降低需要获得状态对应的数据量。本工作借用这些知识,改进了预训练的语音模型,通过取代精度降低层为 lateral inhibition层,这种层启发自生物过程。我们的实验表明,在罗马尼亚语,一种低资源语言,使用 lateral inhibition 层可以提高语音识别精度,平均提高12.5%词错率(WER)。此外,我们在罗马尼亚语语音库和Robin技术获得 corpus 上达到了状态对应的最佳结果,分别为1.78% WER和29.64% WER。