cs.SD - 2023-09-28

Towards High Resolution Weather Monitoring with Sound Data

paper_url: http://arxiv.org/abs/2309.16867
repo_url: None
paper_authors: Enis Berk Çoban, Megan Perra, Michael I. Mandel
for: This paper aims to improve the accuracy of weather predictions in wildlife research using acoustic data and satellite data.
methods: The paper uses machine learning algorithms to train acoustic classifiers using satellite data from the MERRA-2 system, and then uses these classifiers to predict rain, wind, and air temperature at different thresholds.
results: The paper finds that acoustic classifiers trained using MERRA-2 data are more accurate than the raw MERRA-2 data itself, and that using MERRA-2 to roughly identify rain in the acoustic data allows for the production of a functional model without the need for human-validated labels.

Abstract
Across various research domains, remotely-sensed weather products are valuable for answering many scientific questions; however, their temporal and spatial resolutions are often too coarse to answer many questions. For instance, in wildlife research, it's crucial to have fine-scaled, highly localized weather observations when studying animal movement and behavior. This paper harnesses acoustic data to identify variations in rain, wind and air temperature at different thresholds, with rain being the most successfully predicted. Training a model solely on acoustic data yields optimal results, but it demands labor-intensive sample labeling. Meanwhile, hourly satellite data from the MERRA-2 system, though sufficient for certain tasks, produced predictions that were notably less accurate in predict these acoustic labels. We find that acoustic classifiers can be trained from the MERRA-2 data that are more accurate than the raw MERRA-2 data itself. By using MERRA-2 to roughly identify rain in the acoustic data, we were able to produce a functional model without using human-validated labels. Since MERRA-2 has global coverage, our method offers a practical way to train rain models using acoustic datasets around the world.

摘要
在不同的研究领域中，远程感知的天气产品非常有价值，但它们的时间和空间分辨率经常太低。例如，在野生动物研究中，需要有高级别、高地点准确的天气观测，以便研究动物的移动和行为。这篇论文利用声学数据来识别雨、风和空气温度的变化，雨是最成功地预测的。使用声学数据来训练模型，可以获得优化的结果，但需要大量的人工标注样本。而从MERRA-2系统获得的一小时卫星数据，虽然适用于某些任务，但预测的准确性较低。我们发现，使用MERRA-2数据来训练声学分类器，可以获得更高的准确性 than raw MERRA-2数据本身。通过使用MERRA-2数据来粗略地识别雨在声学数据中，我们可以生成一个可行的模型，不需要人工验证标注。由于MERRA-2数据有全球覆盖，我们的方法可以在全球各地使用声学数据来训练雨模型。

Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization

paper_url: http://arxiv.org/abs/2309.16482
repo_url: None
paper_authors: Thilo von Neumann, Christoph Boeddeker, Tobias Cord-Landwehr, Marc Delcroix, Reinhold Haeb-Umbach
for: 本文提出了一个模块化管道，用于单通道分离、识别和分配会议录音，并对 Libri-CSS 数据集进行评估。
methods: 使用 Continuous Speech Separation (CSS) 系统的 TF-GridNet 分离架构，然后使用无关于说话人的语音识别器，以实现会议录音识别的最佳参考词错率 (ORC WER) 表现。接着，使用 d-vector 基于的分配模块提取了增强的信号中的说话人嵌入，并将 CSS 输出分配给正确的说话人。
results: 在使用了句子和词级界限的 ASR 模块支持说话人转折检测的情况下，本文提出了一个最佳 Concatenated minimum-Permutation Word Error Rate (cpWER) для全会议记录管道。

Abstract
We propose a modular pipeline for the single-channel separation, recognition, and diarization of meeting-style recordings and evaluate it on the Libri-CSS dataset. Using a Continuous Speech Separation (CSS) system with a TF-GridNet separation architecture, followed by a speaker-agnostic speech recognizer, we achieve state-of-the-art recognition performance in terms of Optimal Reference Combination Word Error Rate (ORC WER). Then, a d-vector-based diarization module is employed to extract speaker embeddings from the enhanced signals and to assign the CSS outputs to the correct speaker. Here, we propose a syntactically informed diarization using sentence- and word-level boundaries of the ASR module to support speaker turn detection. This results in a state-of-the-art Concatenated minimum-Permutation Word Error Rate (cpWER) for the full meeting recognition pipeline.

摘要
我们提出了一个模块化管道来实现单频约会录音的分离、识别和分类，并在Libri-CSS数据集上进行评估。我们使用一个连续语音分离（CSS）系统，其中使用TF-GridNet分离架构，然后使用一个无关于Speaker的语音识别器，以达到最佳的语音识别性能（ORC WER）。接着，我们使用d-vector基于的分类模块来提取Speaker的嵌入特征，并将CSS输出分配给正确的 speaker。在这里，我们提出了一种基于ASR模块的句子和单词级划分的语音转发检测，以支持说话者的转发检测。这 führt zu einem state-of-the-art Concatenated minimum-Permutation Word Error Rate (cpWER) für die vollständige Treffenserkennungspipeline.

Efficient Supervised Training of Audio Transformers for Music Representation Learning

paper_url: http://arxiv.org/abs/2309.16418
repo_url: https://github.com/palonso/maest
paper_authors: Pablo Alonso-Jiménez, Xavier Serra, Dmitry Bogdanov
for: 这个论文主要针对音乐表示学习，使用无核Transformer来替代传统的卷积神经网络。
methods: 作者们使用patchout训练方法和不同的输入音频段长来训练模型，并研究不同块和Token的learned表示对下游任务的影响。
results: 研究发现，初始化模型使用ImageNet或AudioSet权重和使用 longer输入段都是有利的，中间块的learned表示是最佳的，并且在执行patchout操作时可以更快速地提取特征，而无需降低性能。

Abstract
In this work, we address music representation learning using convolution-free transformers. We build on top of existing spectrogram-based audio transformers such as AST and train our models on a supervised task using patchout training similar to PaSST. In contrast to previous works, we study how specific design decisions affect downstream music tagging tasks instead of focusing on the training task. We assess the impact of initializing the models with different pre-trained weights, using various input audio segment lengths, using learned representations from different blocks and tokens of the transformer for downstream tasks, and applying patchout at inference to speed up feature extraction. We find that 1) initializing the model from ImageNet or AudioSet weights and using longer input segments are beneficial both for the training and downstream tasks, 2) the best representations for the considered downstream tasks are located in the middle blocks of the transformer, and 3) using patchout at inference allows faster processing than our convolutional baselines while maintaining superior performance. The resulting models, MAEST, are publicly available and obtain the best performance among open models in music tagging tasks.

摘要
在这个工作中，我们研究了使用减少 convolution 的 transformer 进行音乐表示学习。我们基于现有的spectrogram-based audio transformer 如 AST，并在supervised任务上训练我们的模型，类似于 PaSST。与前一些工作不同，我们研究了特定的设计决策对下游音乐标签任务的影响，而不是专注于训练任务。我们评估了不同初始化模型的预训练 веса、使用不同的输入音频段长、使用不同块和token的 transformer 学习表示，以及在推理时使用 patchout 加速特征提取。我们发现：1）从 ImageNet 或 AudioSet 预训练 weights 初始化模型和使用 longer input segments 都是有利的，2）Considered downstream tasks 中最佳的表示位于 transformer 中间块中，3）在推理时使用 patchout 可以比我们的 convolutional baselines 更快，但是保持了更高的性能。所有的模型，MAEST，公开可用，并在音乐标签任务中获得了最佳性能。

Audio Visual Speaker Localization from EgoCentric Views

paper_url: http://arxiv.org/abs/2309.16308
repo_url: https://github.com/kawhizhao/egocentric-audio-visual-speaker-localization
paper_authors: Jinzheng Zhao, Yong Xu, Xinyuan Qian, Wenwu Wang
for: 这 paper 是为了研究 egocentric 音频视频推测说话人的方向的。
methods: 这 paper 使用 transformer 模型将 audio 和视频数据进行融合，并提出了一种训练策略来解决说话人从视频中消失的问题。
results: 实验结果表明，提出的方法在新的数据集上表现了出色的跟踪精度。此外，paper 还适应了多个说话人enario，并在 EasyCom 上实现了 state-of-the-art 的结果。

Abstract
The use of audio and visual modality for speaker localization has been well studied in the literature by exploiting their complementary characteristics. However, most previous works employ the setting of static sensors mounted at fixed positions. Unlike them, in this work, we explore the ego-centric setting, where the heterogeneous sensors are embodied and could be moving with a human to facilitate speaker localization. Compared to the static scenario, the ego-centric setting is more realistic for smart-home applications e.g., a service robot. However, this also brings new challenges such as blurred images, frequent speaker disappearance from the field of view of the wearer, and occlusions. In this paper, we study egocentric audio-visual speaker DOA estimation and deal with the challenges mentioned above. Specifically, we propose a transformer-based audio-visual fusion method to estimate the relative DOA of the speaker to the wearer, and design a training strategy to mitigate the problem of the speaker disappearing from the camera's view. We also develop a new dataset for simulating the out-of-view scenarios, by creating a scene with a camera wearer walking around while a speaker is moving at the same time. The experimental results show that our proposed method offers promising performance in this new dataset in terms of tracking accuracy. Finally, we adapt the proposed method for the multi-speaker scenario. Experiments on EasyCom show the effectiveness of the proposed model for multiple speakers in real scenarios, which achieves state-of-the-art results in the sphere active speaker detection task and the wearer activity prediction task. The simulated dataset and related code are available at https://github.com/KawhiZhao/Egocentric-Audio-Visual-Speaker-Localization.

摘要
文中的听音和视觉模态已经在文献中得到了广泛的研究，通过利用它们的补充特点。然而，大多数前一些工作采用了固定位置的静止感知器。与之不同，在这种工作中，我们 explore egocentric setting，其中各种不同的感知器被嵌入并可以随人类移动，以便实现听音定位。相比静止情况， egocentric setting更加真实地适用于智能家居应用，例如服务机器人。然而，这也带来了新的挑战，如模糊的图像、听音器在感知器视野中的频繁消失和遮挡。在这篇论文中，我们研究 egocentric 听音视觉 DOA 估计，并解决以上挑战。具体来说，我们提出了一种基于 transformer 的听音视觉混合方法，以估计听音器与感知器之间的相对 DOA。此外，我们还开发了一个用于模拟听音器离视场景的新数据集，通过创建一个有人戴摄像头的人移动在同时的场景。实验结果表明，我们提出的方法在新数据集中具有良好的跟踪精度。最后，我们适应了提posed方法 для多个听音器场景。在 EasyCom 上进行实验，我们的模型在实际场景中得到了状态之一的结果，包括圆拱活动听音器检测任务和穿戴者活动预测任务。相关的数据集和代码可以在 GitHub 上获取。

Predicting performance difficulty from piano sheet music images

paper_url: http://arxiv.org/abs/2309.16287
repo_url: None
paper_authors: Pedro Ramoneda, Jose J. Valero-Mas, Dasaem Jeong, Xavier Serra
for: 这 paper 的目的是提出一种用于估算音乐作品的演奏难度的方法，以便在音乐教育中设计学生学习课程。
methods: 该方法使用 transformer 模型和 mid-level representation，即 bootleg score，来描述音乐作品的乐谱图像。 encoding scheme 被引入以将原始序列长度减少到一半。
results: 在五个 datasets 上进行评估，包括超过 7500 份乐谱和 9 个难度水平。模型在这些 datasets 上进行精度的 fine-tuning 后，实现了最佳性能，其中 balanced accuracy 为 40.34%，mean square error 为 1.33。

Abstract
Estimating the performance difficulty of a musical score is crucial in music education for adequately designing the learning curriculum of the students. Although the Music Information Retrieval community has recently shown interest in this task, existing approaches mainly use machine-readable scores, leaving the broader case of sheet music images unaddressed. Based on previous works involving sheet music images, we use a mid-level representation, bootleg score, describing notehead positions relative to staff lines coupled with a transformer model. This architecture is adapted to our task by introducing an encoding scheme that reduces the encoded sequence length to one-eighth of the original size. In terms of evaluation, we consider five datasets -- more than 7500 scores with up to 9 difficulty levels -- , two of them particularly compiled for this work. The results obtained when pretraining the scheme on the IMSLP corpus and fine-tuning it on the considered datasets prove the proposal's validity, achieving the best-performing model with a balanced accuracy of 40.34\% and a mean square error of 1.33. Finally, we provide access to our code, data, and models for transparency and reproducibility.

摘要
(Simplified Chinese translation)估计乐谱演奏难度是音乐教育中对学生学习课程设计的关键。尽管音乐信息检索社区最近对这项任务表示了兴趣，但现有的方法主要使用机器可读 Musical scores，忽略了Sheet music images的更广泛情况。基于之前的Sheet music images工作，我们使用mid-level representation，即bootleg score，描述Notehead positions relative to staff lines，并与Transformer模型结合。我们采用一种编码方案，将编码序列长度减少到原始大小的一半。在评估方面，我们考虑了五个数据集，包括超过7500份乐谱，最高达9级难度。两个数据集特别为这项工作编制。结果表明我们的提案有效，在7500份乐谱上取得平均折合率40.34%和平均方差1.33。最后，我们提供了代码、数据和模型，以便透明度和重现性。

NOMAD: Unsupervised Learning of Perceptual Embeddings for Speech Enhancement and Non-matching Reference Audio Quality Assessment

paper_url: http://arxiv.org/abs/2309.16284
repo_url: https://github.com/alessandroragano/nomad
paper_authors: Alessandro Ragano, Jan Skoglund, Andrew Hines
for: 这篇论文是为了提出一种不同声音参照的非匹配式媒体距离度量方法（NOMAD），用于评估受损音频质量。
methods: 该方法基于学习深度特征嵌入，使用三重损失函数驱动NSIM指数来捕捉受损程度。在推理阶段，将任意两个声音样本的相似度分数计算为嵌入空间的欧氏距离。
results: 对于三个任务（评估受损程度、预测语音质量和Speech增强），NOMAD表现出了与全参照音频 metric 的竞争性能，同时也超过了其他非匹配参照方法。这表明 NOMAD 可以准确地评估受损音频质量，并且可以学习人类对受损音频的识别能力。

Abstract
This paper presents NOMAD (Non-Matching Audio Distance), a differentiable perceptual similarity metric that measures the distance of a degraded signal against non-matching references. The proposed method is based on learning deep feature embeddings via a triplet loss guided by the Neurogram Similarity Index Measure (NSIM) to capture degradation intensity. During inference, the similarity score between any two audio samples is computed through Euclidean distance of their embeddings. NOMAD is fully unsupervised and can be used in general perceptual audio tasks for audio analysis e.g. quality assessment and generative tasks such as speech enhancement and speech synthesis. The proposed method is evaluated with 3 tasks. Ranking degradation intensity, predicting speech quality, and as a loss function for speech enhancement. Results indicate NOMAD outperforms other non-matching reference approaches in both ranking degradation intensity and quality assessment, exhibiting competitive performance with full-reference audio metrics. NOMAD demonstrates a promising technique that mimics human capabilities in assessing audio quality with non-matching references to learn perceptual embeddings without the need for human-generated labels.

摘要

Semantic Proximity Alignment: Towards Human Perception-consistent Audio Tagging by Aligning with Label Text Description

paper_url: http://arxiv.org/abs/2309.16265
repo_url: None
paper_authors: Wuyang Liu, Yanzhen Ren
for: 提高音频标签模型的准确率和人工评估的一致性，使模型更加准确地捕捉音频中的听话关系和层次结构。
methods: 使用auxiliary文本描述音频事件，对audio特征进行Semantic Proximity Alignment（SPA），使audio编码器内置听话关系和层次结构信息，从而提高模型的OmAP评估指标。
results: 对比一个hot标签solely模型，使用SPA模型可以提高OmAP评估指标+1.8，并且人工评估表明SPA模型的预测更加准确地反映人类听频识别结果。

Abstract
Most audio tagging models are trained with one-hot labels as supervised information. However, one-hot labels treat all sound events equally, ignoring the semantic hierarchy and proximity relationships between sound events. In contrast, the event descriptions contains richer information, describing the distance between different sound events with semantic proximity. In this paper, we explore the impact of training audio tagging models with auxiliary text descriptions of sound events. By aligning the audio features with the text features of corresponding labels, we inject the hierarchy and proximity information of sound events into audio encoders, improving the performance while making the prediction more consistent with human perception. We refer to this approach as Semantic Proximity Alignment (SPA). We use Ontology-aware mean Average Precision (OmAP) as the main evaluation metric for the models. OmAP reweights the false positives based on Audioset ontology distance and is more consistent with human perception compared to mAP. Experimental results show that the audio tagging models trained with SPA achieve higher OmAP compared to models trained with one-hot labels solely (+1.8 OmAP). Human evaluations also demonstrate that the predictions of SPA models are more consistent with human perception.

摘要
大多数音频标记模型都是通过一个热度标签作为监督信息进行训练。然而，这些热度标签忽略了声音事件的 semantic hierarchy和 proximity 关系。相比之下，事件描述包含更多的信息，描述了不同声音事件之间的距离和 semantic proximity。在这篇论文中，我们研究了在音频标记模型中使用辅助文本描述声音事件的影响。我们将音频特征与文本描述的对应标签进行对齐，从而将声音事件的 hierarchy和 proximity 信息注入到音频编码器中，提高了性能，同时使预测更加符合人类感知。我们称这种方法为 Semantic Proximity Alignment (SPA)。我们使用 Ontology-aware mean Average Precision (OmAP) 作为主要评估指标，OmAP 重新计算false positives，根据 Audioset ontology distance 进行权重，与人类感知更一致。实验结果表明，使用 SPA 训练的音频标记模型在 OmAP 方面高于使用一个热度标签solely (+1.8 OmAP)。人类评估也表明，SPA 模型的预测更加符合人类感知。

PP-MeT: a Real-world Personalized Prompt based Meeting Transcription System

paper_url: http://arxiv.org/abs/2309.16247
repo_url: None
paper_authors: Xiang Lyu, Yuhang Cao, Qing Wang, Jingjing Yin, Yuguang Yang, Pengpeng Zou, Yanni Hu, Heng Lu
for: 提高多个说话人ASR系统在真实世界场景中的准确率和可用性
methods: 使用目标说话者嵌入作为提示在TS-VAD和TS-ASR模块中
results: 在M2MeT2.0挑战数据集上测试集上 achieve cp-CER 11.27%，在固定和开放训练条件下 ranked first

Abstract
Speaker-attributed automatic speech recognition (SA-ASR) improves the accuracy and applicability of multi-speaker ASR systems in real-world scenarios by assigning speaker labels to transcribed texts. However, SA-ASR poses unique challenges due to factors such as speaker overlap, speaker variability, background noise, and reverberation. In this study, we propose PP-MeT system, a real-world personalized prompt based meeting transcription system, which consists of a clustering system, target-speaker voice activity detection (TS-VAD), and TS-ASR. Specifically, we utilize target-speaker embedding as a prompt in TS-VAD and TS-ASR modules in our proposed system. In constrast with previous system, we fully leverage pre-trained models for system initialization, thereby bestowing our approach with heightened generalizability and precision. Experiments on M2MeT2.0 Challenge dataset show that our system achieves a cp-CER of 11.27% on the test set, ranking first in both fixed and open training conditions.

摘要
speaker-attributed automatic speech recognition (SA-ASR) 提高多个说话人ASR系统在实际场景中的准确率和可用性，通过分配说话人标签到转录文本中。然而，SA-ASR存在独特的挑战，包括说话人重叠、说话人差异、背景噪音和回声。在本研究中，我们提出了PP-MeT系统，一种基于个人化提示的实时会议笔记系统，包括聚类系统、目标说话人活动检测（TS-VAD）和TS-ASR。具体来说，我们利用目标说话人嵌入作为TS-VAD和TS-ASR模块的提示。与前一代系统不同，我们完全利用预训练模型进行系统初始化，从而使我们的方法具有更高的泛化性和精度。在M2MeT2.0挑战数据集上进行实验，我们的系统在测试集上 achievement 的 cp-CER 为11.27%，在固定和开放训练条件下都 ranking 第一。

LAE-ST-MoE: Boosted Language-Aware Encoder Using Speech Translation Auxiliary Task for E2E Code-switching ASR

paper_url: http://arxiv.org/abs/2309.16178
repo_url: None
paper_authors: Guodong Ma, Wenxuan Wang, Yuke Li, Yuting Yang, Binbin Du, Haoran Fu
for: 这篇论文旨在解决自然语言转换（CS）自动话语识别（ASR）中的混乱问题，特别是当语言变化时。
methods: 这篇论文提出了一个名为LAE-ST-MoE的框架，它将语言感知器（LAE）与对话译本（ST）结合在一起，以learn语言之间的背景信息。它还使用了一个任务基于的混合专家模组，将ASR和ST任务分配给不同的feed-forward网络。
results: 实验结果显示，相比于使用LAE-based CTC，LAE-ST-MoE模型在CS测试集上的混合错误率下降了9.26%，并且可以从CS语音中译写出 Mandarin 或英文文本。

Abstract
Recently, to mitigate the confusion between different languages in code-switching (CS) automatic speech recognition (ASR), the conditionally factorized models, such as the language-aware encoder (LAE), explicitly disregard the contextual information between different languages. However, this information may be helpful for ASR modeling. To alleviate this issue, we propose the LAE-ST-MoE framework. It incorporates speech translation (ST) tasks into LAE and utilizes ST to learn the contextual information between different languages. It introduces a task-based mixture of expert modules, employing separate feed-forward networks for the ASR and ST tasks. Experimental results on the ASRU 2019 Mandarin-English CS challenge dataset demonstrate that, compared to the LAE-based CTC, the LAE-ST-MoE model achieves a 9.26% mix error reduction on the CS test with the same decoding parameter. Moreover, the well-trained LAE-ST-MoE model can perform ST tasks from CS speech to Mandarin or English text.

摘要
最近，为了解决自动语音识别（ASR）中语言 switching（CS）中的混乱，conditionally factorized models（如语言意识encoder，LAE）会过度忽略不同语言之间的上下文信息。然而，这些信息可能对ASR模型化有所帮助。为了解决这个问题，我们提出了LAE-ST-MoE框架。它将语音翻译（ST）任务 incorporated into LAE，并通过ST学习不同语言之间的上下文信息。它引入了任务基于的混合专家模块，使用了分离的Feed-Forward网络来处理ASR和ST任务。实验结果表明，相比LAE-based CTC，LAE-ST-MoE模型在ASRU 2019 Mandarin-English CS挑战数据集上的混合错误率减少9.26%。此外，通过训练Well-trained LAE-ST-MoE模型可以将CS语音转换成普通话或英文文本。

Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR

paper_url: http://arxiv.org/abs/2309.16093
repo_url: None
paper_authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
for: 这个论文的目的是提出一种cross-modality knowledge transfer（CMKT）学习框架，用于将语言模型（PLM）中的语言知识有效地传递给语音编码器，以提高自动语音识别（ASR）的性能。
methods: 这个论文使用了一种基于时间连接的神经网络（CTC）的ASR系统，并在这个系统中应用了层次的音频对应。此外，这个论文还提出了使用Sinkhorn注意力机制来进行跨模态对应过程，其中 transformer 注意力是Sinkhorn注意力的特殊情况。
results: 在使用 CTC 推理（不使用语言模型）的情况下，这个论文在 AISHELL-1 数据集上达到了state-of-the-art的性能，具体来说是 development 集和测试集的字符错误率（CER）分别为 3.64% 和 3.94%，相比基eline CTC-ASR 系统，这个性能提升了 34.18% 和 34.88%。

Abstract
Due to the modality discrepancy between textual and acoustic modeling, efficiently transferring linguistic knowledge from a pretrained language model (PLM) to acoustic encoding for automatic speech recognition (ASR) still remains a challenging task. In this study, we propose a cross-modality knowledge transfer (CMKT) learning framework in a temporal connectionist temporal classification (CTC) based ASR system where hierarchical acoustic alignments with the linguistic representation are applied. Additionally, we propose the use of Sinkhorn attention in cross-modality alignment process, where the transformer attention is a special case of this Sinkhorn attention process. The CMKT learning is supposed to compel the acoustic encoder to encode rich linguistic knowledge for ASR. On the AISHELL-1 dataset, with CTC greedy decoding for inference (without using any language model), we achieved state-of-the-art performance with 3.64% and 3.94% character error rates (CERs) for the development and test sets, which corresponding to relative improvements of 34.18% and 34.88% compared to the baseline CTC-ASR system, respectively.

摘要
由于文本和声音模型之间的Modalität不同，将语言知识从预训练语言模型（PLM）传递到声音编码以实现自动语音识别（ASR）仍然是一项复杂的任务。在这种研究中，我们提出了一种跨Modalität知识传递（CMKT）学习框架，在基于时间连接的朋克思矩阵（CTC）基础上的ASR系统中应用。此外，我们还提出了使用杯形注意力在跨Modalität对应过程中，其中 transformer 注意力是特殊的杯形注意力过程。CMKT 学习计划使得声音编码器编码出较为丰富的语言知识，以便ASR。在 AISHELL-1 数据集上，使用 CTC 滥触推断（不使用任何语言模型），我们在开发和测试集上达到了state-of-the-art性能，即字符错误率（CER）为 3.64% 和 3.94%，对比基准 CTC-ASR 系统的 CER 分别提高了34.18% 和 34.88%。