results: 研究发现,使用AMSS系统可以提高室内声学环境的听觉舒适性,而且可以更好地根据听觉反馈来选择最佳的干扰音频。Abstract
Soundscape augmentation or "masking" introduces wanted sounds into the acoustic environment to improve acoustic comfort. Usually, the masker selection and playback strategies are either arbitrary or based on simple rules (e.g. -3 dBA), which may lead to sub-optimal increment or even reduction in acoustic comfort for dynamic acoustic environments. To reduce ambiguity in the selection of maskers, an automatic masker selection system (AMSS) was recently developed. The AMSS uses a deep-learning model trained on a large-scale dataset of subjective responses to maximize the derived ISO pleasantness (ISO 12913-2). Hence, this study investigates the short-term in situ performance of the AMSS implemented in a gazebo in an urban park. Firstly, the predicted ISO pleasantness from the AMSS is evaluated in comparison to the in situ subjective evaluation scores. Secondly, the effect of various masker selection schemes on the perceived affective quality and appropriateness would be evaluated. In total, each participant evaluated 6 conditions: (1) ambient environment with no maskers; (2) AMSS; (3) bird and (4) water masker from prior art; (5) random selection from same pool of maskers used to train the AMSS; and (6) selection of best-performing maskers based on the analysis of the dataset used to train the AMSS.
摘要
增强声响环境或"遮盾"技术可以提高声响舒适度。通常,遮盾选择和播放策略是随意的或基于简单的规则(例如 -3 dBA),可能会导致声响舒适度的下降或不足。为了减少遮盾选择的 ambiguity,一个自动遮盾选择系统(AMSS)已经被开发出来。AMSS使用基于大规模数据集的主观反应进行训练,以最大化 derivated ISO 舒适度(ISO 12913-2)。因此,本研究探讨了在城市公园中的废墟中进行的 AMSS 实际性表现。首先,AMSS 预测的 ISO 舒适度与现场评估分数进行比较。其次,通过不同遮盾选择方案对人们对声响质量和适应性的感知做出评估。总的来说,每名参与者评估了6个条件:(1)无遮盾的 ambient 环境;(2)AMSS;(3)鸟声和水声遮盾来自优等艺术作品;(4)随机从同一个训练数据集中选择的遮盾;(5)基于训练数据集分析选择最佳遮盾;以及(6)由参与者自己选择的最佳遮盾。
Improving CTC-AED model with integrated-CTC and auxiliary loss regularization
results: DAL 方法在注意力重新评分中表现更好,而 PMP 方法在 CTC 预refix 搜索和扩散搜索中表现更好Abstract
Connectionist temporal classification (CTC) and attention-based encoder decoder (AED) joint training has been widely applied in automatic speech recognition (ASR). Unlike most hybrid models that separately calculate the CTC and AED losses, our proposed integrated-CTC utilizes the attention mechanism of AED to guide the output of CTC. In this paper, we employ two fusion methods, namely direct addition of logits (DAL) and preserving the maximum probability (PMP). We achieve dimensional consistency by adaptively affine transforming the attention results to match the dimensions of CTC. To accelerate model convergence and improve accuracy, we introduce auxiliary loss regularization for accelerated convergence. Experimental results demonstrate that the DAL method performs better in attention rescoring, while the PMP method excels in CTC prefix beam search and greedy search.
摘要
卷积时间分类(CTC)和注意力基于编码器解码器(AED)的共同训练在自动语音识别(ASR)中广泛应用。与大多数混合模型不同,我们提议的综合CTC使用AED的注意力机制来导引CTC输出。在这篇论文中,我们采用了两种融合方法,即直接加法的峰值(DAL)和保持最大概率(PMP)。为确保维度一致,我们采用了适应权重变换来调整注意力结果的维度与CTC匹配。为加速模型 converges 和提高准确性,我们引入了辅助损失regularization。实验结果表明,DAL方法在注意力重新评分中表现更好,而PMP方法在CTC预фикс搜索和扫描搜索中表现更好。
Using Text Injection to Improve Recognition of Personal Identifiers in Speech
methods: 使用 text-injection 方法,在训练数据中包含假文本替换Personally Identifiable Information(PII)类别,以提高 ASR 模型对这些类别的识别精度。
results: 在医疗笔记中,可以提高名称和日期的回忆率,并提高总的words error rate(WER)。对于字符串数字序列,可以提高字符错误率和句子准确率。Abstract
Accurate recognition of specific categories, such as persons' names, dates or other identifiers is critical in many Automatic Speech Recognition (ASR) applications. As these categories represent personal information, ethical use of this data including collection, transcription, training and evaluation demands special care. One way of ensuring the security and privacy of individuals is to redact or eliminate Personally Identifiable Information (PII) from collection altogether. However, this results in ASR models that tend to have lower recognition accuracy of these categories. We use text-injection to improve the recognition of PII categories by including fake textual substitutes of PII categories in the training data using a text injection method. We demonstrate substantial improvement to Recall of Names and Dates in medical notes while improving overall WER. For alphanumeric digit sequences we show improvements to Character Error Rate and Sentence Accuracy.
摘要
“准确识别特定类别,如人名、日期等标识信息,在自动语音识别(ASR)应用中是关键。这些类别代表个人信息,因此使用这些数据需要特殊的注意和保护。一种方法是不收集或消除个人可识别信息(PII),但这会导致ASR模型对这些类别的识别精度下降。我们使用文本插入法来提高PII类别的识别精度,通过在训练数据中包含假文本substitute来提高名称和日期的回忆率。我们在医疗笔记中展示了明显的提高,同时提高总的word error rate。对于字符串数字序列,我们显示了字符错误率和句子准确率的改善。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.
Localization of DOA trajectories – Beyond the grid
results: 对比传统网格方法,提议的轨迹localization算法在噪音抗干扰和计算效率方面具有改善的性能。Abstract
The direction of arrival (DOA) estimation algorithms are crucial in localizing acoustic sources. Traditional localization methods rely on block-level processing to extract the directional information from multiple measurements processed together. However, these methods assume that DOA remains constant throughout the block, which may not be true in practical scenarios. Also, the performance of localization methods is limited when the true parameters do not lie on the parameter search grid. In this paper we propose two trajectory models, namely the polynomial and bandlimited trajectory models, to capture the DOA dynamics. To estimate trajectory parameters, we adopt two gridless algorithms: i) Sliding Frank-Wolfe (SFW), which solves the Beurling LASSO problem and ii) Newtonized Orthogonal Matching Pursuit (NOMP), which improves over OMP using cyclic refinement. Furthermore, we extend our analysis to include wideband processing. The simulation results indicate that the proposed trajectory localization algorithms exhibit improved performance compared to grid-based methods in terms of resolution, robustness to noise, and computational efficiency.
摘要
Direction of arrival (DOA) 估计算法是音频源本地化的关键。传统的本地化方法依赖块级处理来提取方向信息,但这些方法假设DOA在块内保持相同,这可能不符合实际场景。此外,本地化方法的性能受限于真实参数不在搜索网格上。在本文中,我们提出了两种轨迹模型:多项式轨迹模型和带有限轨迹模型,以捕捉DOA的动态。为估计轨迹参数,我们采用了两种不含网格的算法:i)滑动法沃尔夫(SFW),解决了Beurling LASSO问题,ii)增强的正交匹配追踪(NOMP),通过循环纠正提高OMP的性能。此外,我们扩展了分析至宽频处理。实验结果表明,提议的轨迹本地化算法在比grid-based方法更高的分解能力、鲁棒性和计算效率方面表现出色。
Compositional nonlinear audio signal processing with Volterra series
results: 论文得出了一些关于非线性音频系统的结论,包括:非线性音频系统的变化可以通过模型为折射映射来描述; Volterra系列和其 morphisms 组织成一个Category,可以用来模型非线性音频系统的时间变化; series composition of Volterra series 是 associative。Abstract
We develop a compositional theory of nonlinear audio signal processing based on a categorification of the Volterra series. We begin by considering what it would mean for the Volterra series to be functorial with respect to a base category whose objects are temperate distributions and whose morphisms are certain linear transformations. This leads to formulae describing how the outcomes of nonlinear transformations are affected if their input signals are first linearly processed. We then consider how nonlinear audio systems change, and introduce as a model thereof a notion of morphism of Volterra series, which we exhibit as a kind of lens map. We show how morphisms can be parameterized and used to generate indexed families of Volterra series, which are well-suited to model nonstationary or time-varying nonlinear phenomena. We then describe how Volterra series and their morphisms organize into a category, which we call Volt. We exhibit the operations of sum, product, and series composition of Volterra series as monoidal products on Volt and identify, for each in turn, its corresponding universal property. We show, in particular, that the series composition of Volterra series is associative. We then bridge between our framework and a subject at the heart of audio signal processing: time-frequency analysis. Specifically, we show that an equivalence between a certain class of second-order Volterra series and the bilinear time-frequency distributions (TFDs) can be extended to one between certain higher-order Volterra series and the so-called polynomial TFDs. We end with prospects for future work, including the incorporation of nonlinear system identification techniques and the extension of our theory to the settings of compositional graph and topological audio signal processing.
摘要
我们开发了一种基于幂阶系列的非线性音频信号处理理论,该理论是基于幂阶系列的 categorification。我们首先考虑了幂阶系列是如何作为一种函手,对底Category的对象(温度分布)和态射(certain linear transformation)进行函手性的定义。这导致了输入非线性变换后的结果如何受到输入信号的线性处理影响的公式。然后,我们考虑了非线性音频系统如何改变,并引入了一种模型,即幂阶系列的态射。我们显示了这种态射可以被视为一种类型的镜像。我们还介绍了如何使用态射来生成索引 family of Volterra series,这些家族适合模拟非站ARY或时间变化的非线性现象。最后,我们描述了幂阶系列和其态射组织成一个category,我们称之为Volt。我们展示了幂阶系列和其态射之间的操作,包括加法、乘法和序列 compose,它们都是Volt中的对应的幂阶乘法。我们还证明了序列 compose 是相关的。最后,我们将我们的框架与音频信号处理中关键的主题进行桥接,即时域分析。我们证明了一种等价关系,即某种次序 Volterra series与bilinear time-frequency distributions(TFDs)之间的等价关系,可以扩展到高阶 Volterra series和叫做多项式 TFDs。我们结束于未来工作的展望,包括非线性系统识别技术的 incorporation 和我们理论的扩展到compositional graph 和 topological audio signal processing 的设置。