eess.AS - 2023-10-27

Improved Lossless Coding for Storage and Transmission of Multichannel Immersive Audio

paper_url: http://arxiv.org/abs/2310.18461
repo_url: None
paper_authors: Toni Hirvonen, Mahmoud Namazi
for: 提高多渠道无损编码的效率，用于听众体验技术
methods: 提议同时编码多个不同渠道的内容，使用过去样本和当前时间样本来预测混合音频，并使用一般线性解法优化模型参数，最后使用rice编码处理剩余信息
results: 相比基eline，提议方法能够提高听众体验技术的压缩率，包括存储和传输Here’s a breakdown of each point:
for: The paper is written for the purpose of improving the efficiency of multichannel lossless coding, which is used in audio compression technology.
methods: The proposed method uses a signal model that predicts the upmix based on both past samples of the upmix and current time samples of the downmix. The model parameters are optimized using a general linear solver, and the prediction residual is Rice coded. Additionally, the use of an SVD projection prior to residual coding is proposed.
results: The proposed method shows improved compression ratios compared to various baselines, including FLAC, for the storage and transmission of immersive audio.

Abstract
In this paper, techniques for improving multichannel lossless coding are examined. A method is proposed for the simultaneous coding of two or more different renderings (mixes) of the same content. The signal model uses both past samples of the upmix, and the current time samples of downmix samples to predict the upmix. Model parameters are optimized via a general linear solver, and the prediction residual is Rice coded. Additionally, the use of an SVD projection prior to residual coding is proposed. A comparison is made against various baselines, including FLAC. The proposed methods show improved compression ratios for the storage and transmission of immersive audio.

摘要
在这篇论文中，我们研究了多通道无损编码技术的改进方法。我们提议同时编码两个或更多不同的渲染（混合）的同一个内容。信号模型使用过去时间amples的混合和当前时间amples的混合样本来预测混合。模型参数通过一般线性解决器优化，预测差异用Rice编码。此外，我们还提出了SVD проекции前置 residual编码的方法。与不同的基准值进行比较，我们的方法显示在具有幂扩增音频存储和传输中提供了更好的压缩比率。

MixRep: Hidden Representation Mixup for Low-Resource Speech Recognition

paper_url: http://arxiv.org/abs/2310.18450
repo_url: https://github.com/jiamin1013/mixrep-espnet
paper_authors: Jiamin Xie, John H. L. Hansen
for: 本研究旨在提出一种简单有效的数据增强策略基于mixup，用于低资源ASR。
methods: 本文提出了 interpolating the feature dimensions of hidden representations in the neural network，可以应用于输入和每层输出的feature。此外，我们还提出了将mixup与时间轴的regulization相结合，并应用到ConformerEncoder上。
results: 实验结果表明，MixRep可以在低资源ASR中提供更高的性能，比其他增强方法更好。与SpecAugment强制比较，MixRep在eval92集和Callhome部分的eval’2000集上减少了相对WRER值6.5%和6.7%。

Abstract
In this paper, we present MixRep, a simple and effective data augmentation strategy based on mixup for low-resource ASR. MixRep interpolates the feature dimensions of hidden representations in the neural network that can be applied to both the acoustic feature input and the output of each layer, which generalizes the previous MixSpeech method. Further, we propose to combine the mixup with a regularization along the time axis of the input, which is shown as complementary. We apply MixRep to a Conformer encoder of an E2E LAS architecture trained with a joint CTC loss. We experiment on the WSJ dataset and subsets of the SWB dataset, covering reading and telephony conversational speech. Experimental results show that MixRep consistently outperforms other regularization methods for low-resource ASR. Compared to a strong SpecAugment baseline, MixRep achieves a +6.5\% and a +6.7\% relative WER reduction on the eval92 set and the Callhome part of the eval'2000 set.

摘要
在这篇论文中，我们提出了一种基于mixup的简单有效数据扩大策略，称为MixRep，用于低资源ASR。MixRep interpolates the feature dimensions of hidden representations in the neural network, which can be applied to both the acoustic feature input and the output of each layer, thus generalizing the previous MixSpeech method。另外，我们提议将mixup与时间轴方向的准则相结合，以便增强其效果。我们在一个Conformer编码器上应用MixRep，并使用一个CTC损失函数进行训练。我们在WSJ dataset和SWB dataset的一些子集上进行实验，包括读取和电话交流的语音。实验结果表明，MixRep在低资源ASR中consistently outperform其他准则方法。相比于一个强大的SpecAugment基准，MixRep在eval92集和Callhome部分的eval'2000集上实现了+6.5%和+6.7%的相对WRER降低。

Relative Transfer Function Vector Estimation for Acoustic Sensor Networks Exploiting Covariance Matrix Structure

paper_url: http://arxiv.org/abs/2310.18199
repo_url: None
paper_authors: Wiebke Middelberg, Henri Gode, Simon Doclo
for: 这篇论文主要针对的是听音环境中多个杂音源的噪声减少问题。
methods: 这篇论文提出了两种Relative Transfer Function（RTF）向量估计方法，其中一种是基于噪声covariance矩阵的whitening方法，另一种是基于噪声矩阵的off-diagonal块选择方法。
results: 在使用这两种方法后，对真实的频谱记录进行了 simulated 环境中的 reverberation 环境中的多个噪声源下的噪声减少测试，结果显示，modified CW方法与CW方法相比，有slightly better的SNR提升表现，而off-diagonal选择方法则超过了偏向RTF向量估计。

Abstract
In many multi-microphone algorithms for noise reduction, an estimate of the relative transfer function (RTF) vector of the target speaker is required. The state-of-the-art covariance whitening (CW) method estimates the RTF vector as the principal eigenvector of the whitened noisy covariance matrix, where whitening is performed using an estimate of the noise covariance matrix. In this paper, we consider an acoustic sensor network consisting of multiple microphone nodes. Assuming uncorrelated noise between the nodes but not within the nodes, we propose two RTF vector estimation methods that leverage the block-diagonal structure of the noise covariance matrix. The first method modifies the CW method by considering only the diagonal blocks of the estimated noise covariance matrix. In contrast, the second method only considers the off-diagonal blocks of the noisy covariance matrix, but cannot be solved using a simple eigenvalue decomposition. When applying the estimated RTF vector in a minimum variance distortionless response beamformer, simulation results for real-world recordings in a reverberant environment with multiple noise sources show that the modified CW method performs slightly better than the CW method in terms of SNR improvement, while the off-diagonal selection method outperforms a biased RTF vector estimate obtained as the principal eigenvector of the noisy covariance matrix.

摘要
多频器算法中的Target speaker的相对传输函数（RTF）向量估计是多频器算法中非常重要的一个步骤。现在的State-of-the-art方法是covariance whitening（CW）方法，它估计RTF向量为白化后的噪声矩阵中的主要特征向量。在这篇论文中，我们考虑了一个包含多个麦克风节点的声学感知网络。假设 node之间的噪声是独立的，但不是内部独立的，我们提出了两种RTF向量估计方法，它们都利用噪声矩阵的块对称结构。第一种方法是修改CW方法，只考虑预估噪声矩阵的对角块。相比之下，第二种方法只考虑噪声矩阵的偏置块，但不可以使用简单的特征值分解来解决。当应用估计RTF向量在无损杂点抗噪声器中时，通过使用实际录制的真实环境中的多个噪声源，我们的simulation结果显示，修改CW方法与CW方法在SNR提高方面的性能略微不同，而偏置选择方法则超过偏置RTF向量估计。