cs.SD - 2023-09-10

Multimodal Fish Feeding Intensity Assessment in Aquaculture

  • paper_url: http://arxiv.org/abs/2309.05058
  • repo_url: None
  • paper_authors: Meng Cui, Xubo Liu, Haohe Liu, Zhuangzhuang Du, Tao Chen, Guoping Lian, Daoliang Li, Wenwu Wang
  • for: 这项研究的目的是评估鱼类食欲强度变化的评估方法,具体来说是用于工业鱼类养殖应用。
  • methods: 这项研究使用了多modal方法,包括单模态预训练模型和模式融合方法,并在大规模的 audio-visual数据集 AV-FFIA 上进行了比较研究。
  • results: 研究结果表明,多模态方法在噪音环境中表现明显更好,而单模态方法在静音环境中表现更好。此外,提出了一种单一模型 U-FFIA,可以处理不同的感知模式,并且可以在较低的计算成本下实现更高的性能。
    Abstract Fish feeding intensity assessment (FFIA) aims to evaluate the intensity change of fish appetite during the feeding process, which is vital in industrial aquaculture applications. The main challenges surrounding FFIA are two-fold. 1) robustness: existing work has mainly leveraged single-modality (e.g., vision, audio) methods, which have a high sensitivity to input noise. 2) efficiency: FFIA models are generally expected to be employed on devices. This presents a challenge in terms of computational efficiency. In this work, we first introduce an audio-visual dataset, called AV-FFIA. AV-FFIA consists of 27,000 labeled audio and video clips that capture different levels of fish feeding intensity. To our knowledge, AV-FFIA is the first large-scale multimodal dataset for FFIA research. Then, we introduce a multi-modal approach for FFIA by leveraging single-modality pre-trained models and modality-fusion methods, with benchmark studies on AV-FFIA. Our experimental results indicate that the multi-modal approach substantially outperforms the single-modality based approach, especially in noisy environments. While multimodal approaches provide a performance gain for FFIA, it inherently increase the computational cost. To overcome this issue, we further present a novel unified model, termed as U-FFIA. U-FFIA is a single model capable of processing audio, visual, or audio-visual modalities, by leveraging modality dropout during training and knowledge distillation from single-modality pre-trained models. We demonstrate that U-FFIA can achieve performance better than or on par with the state-of-the-art modality-specific FFIA models, with significantly lower computational overhead. Our proposed U-FFIA approach enables a more robust and efficient method for FFIA, with the potential to contribute to improved management practices and sustainability in aquaculture.
    摘要 鱼食吞吐评估(FFIA)目的是评估鱼的吞吐程度在食物过程中的变化,这对于工业鱼养殖非常重要。主要挑战包括:1)稳定性:现有工作主要基于单模态(如视觉、音频)方法,具有高敏感度输入噪声。2)效率:FFIA模型通常预期在设备上使用,这将带来计算效率的挑战。在这种情况下,我们首先介绍了一个音频视频数据集(AV-FFIA),AV-FFIA包括27,000个标注音频和视频剪辑,各个剪辑捕捉不同水平的鱼食吞吐程度。我们知道,AV-FFIA是首个大规模的多模态FFIA数据集。然后,我们介绍了一种多模态方法,通过单模态预训练模型和多模态融合方法,对AV-FFIA进行了 benchmark研究。我们的实验结果表明,多模态方法在噪声环境中substantially outperforms单模态基于方法,特别是在噪声环境下。虽然多模态方法提供了FFIA中性能提升,但它会自然增加计算成本。为了解决这个问题,我们进一步发表了一种单一模型,称为U-FFIA。U-FFIA是一个能够处理音频、视觉或音频视频模式的单一模型,通过训练时模式排除和知识储存单模态预训练模型来实现。我们示示了U-FFIA可以达到与状态空间的性能,同时具有明显更低的计算开销。我们的提出的U-FFIA方法可以提供更加稳定和高效的FFIA方法,具有改善鱼养殖管理实践和可持续发展的潜在潜力。

Gray Jedi MVDR Post-filtering

  • paper_url: http://arxiv.org/abs/2309.05057
  • repo_url: https://github.com/FrancoisGrondin/mvdrpf
  • paper_authors: François Grondin, Caleb Rascón
  • for: 提高多个语音源场景中的语音质量
  • methods: 使用深度学习基于的语音提高模型,并使用最小差分误差Response(MVDR)进行干扰估计
  • results: 比单输入基线具有更高的提升性能,并且需要更少的计算资源进行后处理
    Abstract Spatial filters can exploit deep-learning-based speech enhancement models to increase their reliability in scenarios with multiple speech sources scenarios. To further improve speech quality, it is common to perform postfiltering on the estimated target speech obtained with spatial filtering. In this work, Minimum Variance Distortionless Response (MVDR) is employed to provide the interference estimation, along with the estimation of the target speech, to be later used for postfiltering. This improves the enhancement performance over a single-input baseline in a far more significant way than by increasing the model's complexity. Results suggest that less computing resources are required for postfiltering when provided with both target and interference signals, which is a step forward in developing an online speech enhancement system for multi-speech scenarios.
    摘要 空间滤波可以利用深度学习基于的Speech增强模型来提高其在多个语音源场景中的可靠性。为进一步提高语音质量,通常会在估计目标语音后进行后 filtering。在这种工作中,使用最小差异无损响应(MVDR)来提供干扰估计,同时提供目标语音估计,以便后续使用。这会提高增强性能,相比增加模型复杂度。结果表明,提供target和干扰信号后 filtering需要更少的计算资源,这是在开发在线语音增强系统的重要进展。