eess.IV - 2023-07-08

Lightweight Improved Residual Network for Efficient Inverse Tone Mapping

  • paper_url: http://arxiv.org/abs/2307.03998
  • repo_url: None
  • paper_authors: Liqi Xue, Tianyi Xu, Yongbao Song, Yan Liu, Lei Zhang, Xiantong Zhen, Jun Xu
  • for: 用于SDR图像转换为HDR图像的高效 inverse tone mapping(ITM)。
  • methods: 提出了一种基于增强的 residual block 的轻量级 Improved Residual Network(IRNet),用于精细化HDR图像重建。
  • results: 在三个标准测试集上实现了State-of-the-art表现在ITM和 joint SR-ITM任务上。
    Abstract The display devices like HDR10 televisions are increasingly prevalent in our daily life for visualizing high dynamic range (HDR) images. But the majority of media images on the internet remain in 8-bit standard dynamic range (SDR) format. Therefore, converting SDR images to HDR ones by inverse tone mapping (ITM) is crucial to unlock the full potential of abundant media images. However, existing ITM methods are usually developed with complex network architectures requiring huge computational costs. In this paper, we propose a lightweight Improved Residual Network (IRNet) by enhancing the power of popular residual block for efficient ITM. Specifically, we propose a new Improved Residual Block (IRB) to extract and fuse multi-layer features for fine-grained HDR image reconstruction. Experiments on three benchmark datasets demonstrate that our IRNet achieves state-of-the-art performance on both the ITM and joint SR-ITM tasks. The code, models and data will be publicly available at https://github.com/ThisisVikki/ITM-baseline.
    摘要 显示设备如HDR10电视在我们日常生活中变得越来越普遍,用于可见化高动态范围(HDR)图像。但大多数网络图像仍然保留在8位标准动态范围(SDR)格式中。因此,将SDR图像转换成HDR图像的 inverse tone mapping(ITM)变得非常重要,以解锁丰富的网络图像的潜力。然而,现有的ITM方法通常具有复杂的网络架构,需要巨大的计算成本。在这篇论文中,我们提出了一种轻量级的改进的 residual 网络(IRNet),通过提高流行的 residual 块来提高精细度的 HDR 图像重建。具体来说,我们提出了一种新的改进的 residual 块(IRB),用于提取和融合多层特征进行精细度的 HDR 图像重建。实验结果表明,我们的 IRNet 在 ITM 和 joint SR-ITM 任务上均达到了状态略作性的表现。代码、模型和数据将在 GitHub 上公开,详细信息请参考

Ariadne’s Thread:Using Text Prompts to Improve Segmentation of Infected Areas from Chest X-ray images

  • paper_url: http://arxiv.org/abs/2307.03942
  • repo_url: https://github.com/junelin2333/languidemedseg-miccai2023
  • paper_authors: Yi Zhong, Mengqiu Xu, Kongming Liang, Kaixin Chen, Ming Wu
  • for: 本研究旨在提高肺病诊断的准确性,提出了一种语言驱动的医学图像分割方法,以增强图像分割结果的准确性。
  • methods: 本研究使用了语言提示来改进图像分割结果,并对QaTa-COV19数据集进行了实验,结果显示,与单Modal方法相比,语言驱动方法可以提高分割精度。
  • results: 本研究的结果表明,使用语言提示可以提高图像分割精度,并且对训练数据的大小有显著的优势。在QaTa-COV19数据集上,语言驱动方法的Dice分数提高6.09%以上,与单Modal方法相比。
    Abstract Segmentation of the infected areas of the lung is essential for quantifying the severity of lung disease like pulmonary infections. Existing medical image segmentation methods are almost uni-modal methods based on image. However, these image-only methods tend to produce inaccurate results unless trained with large amounts of annotated data. To overcome this challenge, we propose a language-driven segmentation method that uses text prompt to improve to the segmentation result. Experiments on the QaTa-COV19 dataset indicate that our method improves the Dice score by 6.09% at least compared to the uni-modal methods. Besides, our extended study reveals the flexibility of multi-modal methods in terms of the information granularity of text and demonstrates that multi-modal methods have a significant advantage over image-only methods in terms of the size of training data required.
    摘要 对于肺病的评估,分割感染区域的精确性是非常重要。现有的医疗影像分类方法都是基于图像的单 modal 方法,但这些图像仅方法往往会导致不准确的结果,除非训练数据量很大。为了解决这个挑战,我们提出了语言驱动的分类方法,使用文本提示来改善分类结果。实验结果显示,我们的方法在 QaTa-COV19 数据集上提高了 dice 分数6.09%至少,并且我们的扩展研究显示,多 modal 方法在文本信息粒度方面的灵活性和训练数据量方面的优势。

StyleGAN3: Generative Networks for Improving the Equivariance of Translation and Rotation

  • paper_url: http://arxiv.org/abs/2307.03898
  • repo_url: None
  • paper_authors: Tianlei Zhu, Junqi Chen, Renzhe Zhu, Gaurav Gupta
  • for: 本研究的目的是评估StyleGAN2和两个修改后的StyleGAN3版本在生成图像方面的性能差异。
  • methods: 本研究使用的方法包括使用FFHQ数据集和FID、EQ-T和EQ-R指标评估模型的表现。
  • results: 研究发现,StyleGan3版本是一个更好的生成网络,可以提高图像的等距变换性。这些发现对动画和视频的创作有积极的影响。
    Abstract StyleGAN can use style to affect facial posture and identity features, and noise to affect hair, wrinkles, skin color and other details. Among these, the outcomes of the picture processing will vary slightly between different versions of styleGAN. As a result, the comparison of performance differences between styleGAN2 and the two modified versions of styleGAN3 will be the main focus of this study. We used the FFHQ dataset as the dataset and FID, EQ-T, and EQ-R were used to be the assessment of the model. In the end, we discovered that Stylegan3 version is a better generative network to improve the equivariance. Our findings have a positive impact on the creation of animation and videos.
    摘要 StyleGAN 可以通过风格影响 facial 姿态和人脸特征,并通过噪声影响头发、皮肤色、皱纹等细节。 Among these, 不同版本的 StyleGAN 的图像处理结果会有一些微妙的差异。因此,我们将在这种情况下进行 StyleGAN2 和两个修改版本的 StyleGAN3 之间的性能对比。我们使用 FFHQ 数据集作为数据集,并使用 FID、EQ-T 和 EQ-R 三种指标评估模型。最终,我们发现 StyleGAN3 版本是一个更好的生成网络,可以提高equivariance。我们的发现对动画和视频的创建产生了积极的影响。Here's the breakdown of the translation:* StyleGAN 可以通过风格影响 facial 姿态和人脸特征 (StyleGAN can use style to affect facial posture and identity features)* 并通过噪声影响头发、皮肤色、皱纹等细节 (and noise to affect hair, skin color, and other details)* Among these, 不同版本的 StyleGAN 的图像处理结果会有一些微妙的差异 (Among these, the outcomes of the picture processing will vary slightly between different versions of StyleGAN)* 因此,我们将在这种情况下进行 StyleGAN2 和两个修改版本的 StyleGAN3 之间的性能对比 (Therefore, we will compare the performance of StyleGAN2 and the two modified versions of StyleGAN3 in this situation)* 我们使用 FFHQ 数据集作为数据集 (We use the FFHQ dataset as the dataset)* 并使用 FID、EQ-T 和 EQ-R 三种指标评估模型 (And use three metrics to evaluate the model: FID, EQ-T, and EQ-R)* 最终,我们发现 StyleGAN3 版本是一个更好的生成网络,可以提高equivariance (Finally, we found that the StyleGAN3 version is a better generative network, which can improve equivariance)* 我们的发现对动画和视频的创建产生了积极的影响 (Our discovery has a positive impact on the creation of animation and videos)

TBSS++: A novel computational method for Tract-Based Spatial Statistics

  • paper_url: http://arxiv.org/abs/2307.05387
  • repo_url: None
  • paper_authors: Davood Karimi, Hamza Kebiri, Ali Gholipour
  • for: 这个论文旨在提高Diffusion-weighted磁共振成像(dMRI)中脑白 mater 的评估。
  • methods: 该论文提出了一种新的计算框架,通过(i)精准的脑 tract 分割,和(ii)交叉Subject数据的精准注册,以超越现有方法的缺陷和限制。
  • results: 与TBSS相比,该方法可以提供更高的重复性和数据抖动鲁棒性。
    Abstract Diffusion-weighted magnetic resonance imaging (dMRI) is widely used to assess the brain white matter. One of the most common computations in dMRI involves cross-subject tract-specific analysis, whereby dMRI-derived biomarkers are compared between cohorts of subjects. The accuracy and reliability of these studies hinges on the ability to compare precisely the same white matter tracts across subjects. This is an intricate and error-prone computation. Existing computational methods such as Tract-Based Spatial Statistics (TBSS) suffer from a host of shortcomings and limitations that can seriously undermine the validity of the results. We present a new computational framework that overcomes the limitations of existing methods via (i) accurate segmentation of the tracts, and (ii) precise registration of data from different subjects/scans. The registration is based on fiber orientation distributions. To further improve the alignment of cross-subject data, we create detailed atlases of white matter tracts. These atlases serve as an unbiased reference space where the data from all subjects is registered for comparison. Extensive evaluations show that, compared with TBSS, our proposed framework offers significantly higher reproducibility and robustness to data perturbations. Our method promises a drastic improvement in accuracy and reproducibility of cross-subject dMRI studies that are routinely used in neuroscience and medical research.
    摘要 Diffusion-weighted магнитно共振成像(dMRI)广泛用于评估大脑白 matter。一种最常见的计算在 dMRI 中是 между cohorts of subjects 进行 tract-specific 分析,其中 dMRI 得到的生物标志物被比较 между不同的subjects。这些研究的准确性和可靠性取决于能够准确比较不同subjects 中的白 matter tracts。现有的计算方法,如 Tract-Based Spatial Statistics(TBSS),受到严重的缺陷和限制,这些缺陷可能会严重地损害结果的有效性。我们提出了一个新的计算框架,该框架可以超越现有的方法的限制,通过(i)准确地分割 tracts,和(ii)精准地注册不同subjects/scans 的数据。注册基于纤维方向分布。为了进一步改进cross-subject数据的对Alignment,我们创建了详细的 white matter tracts Atlases。这些 Atlases 作为一个不偏见的参照空间,用于注册所有subjects 的数据进行比较。广泛的评估表明,相比TBSS,我们提出的方法具有更高的可重现性和数据抖动强度的鲁棒性。我们的方法承诺可以在 neuroscience 和医学研究中提供明显的改进,以提高cross-subject dMRI 研究的准确性和可靠性。

Invariant Scattering Transform for Medical Imaging

  • paper_url: http://arxiv.org/abs/2307.04771
  • repo_url: None
  • paper_authors: Nafisa Labiba Ishrat Huda, Angona Biswas, MD Abdullah Al Nasim, Md. Fahim Rahman, Shoaib Ahmed
  • for: This paper is written for researchers and practitioners in the field of medical image analysis and deep learning, particularly those interested in using scattering transform for efficient image classification.
  • methods: The paper uses a novel approach called scattering transform, which combines signal processing and deep learning for medical image analysis. The transform is based on a wavelet technique that builds a useful signal representation for image classification.
  • results: The paper presents a step-by-step case study demonstrating the efficiency of scattering transform for medical image analysis, achieving high accuracy and outperforming traditional deep learning methods.Here is the information in Simplified Chinese text:
  • for: 这篇论文是为医学图像分析和深度学习领域的研究人员和实践者写的,特别是关心使用散射变换进行高效的图像分类。
  • methods: 这篇论文使用了一种新的方法——散射变换,它将信号处理和深度学习两个领域融合在一起,用于医学图像分析。散射变换基于wavelet技术,建立了有用的信号表示,用于图像分类。
  • results: 这篇论文展示了一个步骤很多的案例研究,用于证明散射变换在医学图像分析中的高效性,并且超过了传统的深度学习方法。
    Abstract Invariant scattering transform introduces new area of research that merges the signal processing with deep learning for computer vision. Nowadays, Deep Learning algorithms are able to solve a variety of problems in medical sector. Medical images are used to detect diseases brain cancer or tumor, Alzheimer's disease, breast cancer, Parkinson's disease and many others. During pandemic back in 2020, machine learning and deep learning has played a critical role to detect COVID-19 which included mutation analysis, prediction, diagnosis and decision making. Medical images like X-ray, MRI known as magnetic resonance imaging, CT scans are used for detecting diseases. There is another method in deep learning for medical imaging which is scattering transform. It builds useful signal representation for image classification. It is a wavelet technique; which is impactful for medical image classification problems. This research article discusses scattering transform as the efficient system for medical image analysis where it's figured by scattering the signal information implemented in a deep convolutional network. A step by step case study is manifested at this research work.
    摘要 固定扩散变换引入了一新的研究领域,把信号处理与深度学习结合以应用于计算机视觉。目前,深度学习算法能够解决医疗领域多种问题。医疗图像用于检测脑瘤或肿瘤、阿尔茨曼病、乳腺癌、parkinson病和其他多种疾病。在2020年疫情期间,机器学习和深度学习扮演了关键角色,检测COVID-19,包括变异分析、预测、诊断和决策。医疗图像如X射线、MRI(磁共振成像)、CT扫描是用于检测疾病的。此外,深度学习还有另一种方法用于医疗图像分类,即扩散变换。它建立了有用的信号表示,用于图像分类问题。这篇研究文章讨论了扩散变换作为医疗图像分析的有效系统,其中使用了扩散信号信息在深度征值网络中实现。本研究文章还提供了一步步的实践案例。

Coordinate-based neural representations for computational adaptive optics in widefield microscopy

  • paper_url: http://arxiv.org/abs/2307.03812
  • repo_url: https://github.com/iksungk/cocoa
  • paper_authors: Iksung Kang, Qinrong Zhang, Stella X. Yu, Na Ji
    for:* 这个论文旨在描述一种基于自适应光学的Machine Learning算法,用于无侵入性地图像生物结构,并且可以在复杂的样品中提高图像质量。methods:* 这个算法使用了自适应光学技术,包括光谱扫描和激光扫描,以估计波前弯曲和三维结构信息。results:* 研究人员使用了这个算法,在实验室中成功地图像了一个 mouse brain 的三维结构,并且系统性地探讨了这个算法的性能的限制因素。
    Abstract Widefield microscopy is widely used for non-invasive imaging of biological structures at subcellular resolution. When applied to complex specimen, its image quality is degraded by sample-induced optical aberration. Adaptive optics can correct wavefront distortion and restore diffraction-limited resolution but require wavefront sensing and corrective devices, increasing system complexity and cost. Here, we describe a self-supervised machine learning algorithm, CoCoA, that performs joint wavefront estimation and three-dimensional structural information extraction from a single input 3D image stack without the need for external training dataset. We implemented CoCoA for widefield imaging of mouse brain tissues and validated its performance with direct-wavefront-sensing-based adaptive optics. Importantly, we systematically explored and quantitatively characterized the limiting factors of CoCoA's performance. Using CoCoA, we demonstrated the first in vivo widefield mouse brain imaging using machine-learning-based adaptive optics. Incorporating coordinate-based neural representations and a forward physics model, the self-supervised scheme of CoCoA should be applicable to microscopy modalities in general.
    摘要 广角微scopia 广泛应用于非侵入性的生物结构成像,其图像质量在复杂样品下受样品引起的光学扭曲的影响。可适应光学可以修复波前弯曲和恢复 diffraction-limited 分辨率,但需要波前测量和修正设备,从而增加系统复杂度和成本。我们描述了一种自主学习机器学习算法 CoCoA,它可以在单个输入 3D 图像堆中同时进行波前估计和三维结构信息提取,无需外部训练集。我们在宽场探针中实现了 CoCoA,并通过 direct-wavefront-sensing-based adaptive optics 进行验证。重要的是,我们系统地探索和量化 CoCoA 的性能限制因素。使用 CoCoA,我们实现了首次在 vivo 宽场 mouse brain 成像,使用机器学习基于 adaptive optics 。通过卷积 нейрон表示和前向物理模型,CoCoA 的自主学习方案应用于 microscopy Modalities 中。

Thoracic Cartilage Ultrasound-CT Registration using Dense Skeleton Graph

  • paper_url: http://arxiv.org/abs/2307.03800
  • repo_url: None
  • paper_authors: Zhongliang Jiang, Chenyang Li, Xuesong Li, Nassir Navab
  • for: 用于实现自适应超声成像,尤其是在骨骼结构下面的高频率吸收层面上。
  • methods: 使用图形基于非导入注册方法,特别是利用骨表面特征来转移规划路径。
  • results: 可以有效地将规划路径从CT图像传播到当前设置下的US视图,并且可以减少干扰。
    Abstract Autonomous ultrasound (US) imaging has gained increased interest recently, and it has been seen as a potential solution to overcome the limitations of free-hand US examinations, such as inter-operator variations. However, it is still challenging to accurately map planned paths from a generic atlas to individual patients, particularly for thoracic applications with high acoustic-impedance bone structures under the skin. To address this challenge, a graph-based non-rigid registration is proposed to enable transferring planned paths from the atlas to the current setup by explicitly considering subcutaneous bone surface features instead of the skin surface. To this end, the sternum and cartilage branches are segmented using a template matching to assist coarse alignment of US and CT point clouds. Afterward, a directed graph is generated based on the CT template. Then, the self-organizing map using geographical distance is successively performed twice to extract the optimal graph representations for CT and US point clouds, individually. To evaluate the proposed approach, five cartilage point clouds from distinct patients are employed. The results demonstrate that the proposed graph-based registration can effectively map trajectories from CT to the current setup for displaying US views through limited intercostal space. The non-rigid registration results in terms of Hausdorff distance (Mean$\pm$SD) is 9.48$\pm$0.27 mm and the path transferring error in terms of Euclidean distance is 2.21$\pm$1.11 mm.
    摘要 自主式超声成像(US)在最近几年内得到了更多的关注,被视为可以超越自由手操作US检测的限制。然而,准确地将规划路径从通用 Atlas 传递到当前设置仍然是一项挑战,特别是在骨盆部应用中,因为有高频率声 impedance 结构位于皮肤下。为解决这个挑战,一种基于图的非RIGID региstración被提议,以便将规划路径从 Atlas 传递到当前设置,并且Explicitly 考虑到骨质表面特征而不是皮肤表面。为此,使用模板匹配 segment 胸板和软骨支持的 Cartilage 分支。然后,基于 CT 模板生成一个指向图。接着,使用自组织地图进行两次 Successive 地执行 geographical distance 自适应映射,以提取 CT 和 US 点云的最佳图表示。为评估提议方法,使用了五个不同患者的 Cartilage 点云。结果表明,提议的图基于REGISTRATION 可以有效地将 CT 的规划路径传递到当前设置,并且非RIGID регистраción的 Hausdorff 距离(Mean ± SD)为 9.48 ± 0.27 mm,路径传递错误(Euclidean 距离)为 2.21 ± 1.11 mm。

Motion Magnification in Robotic Sonography: Enabling Pulsation-Aware Artery Segmentation

  • paper_url: http://arxiv.org/abs/2307.03698
  • repo_url: https://github.com/dianyehuang/robpmepasnn
  • paper_authors: Dianye Huang, Yuan Bi, Nassir Navab, Zhongliang Jiang
  • for: 用于诊断和监测arterial疾病,提供非侵入、无辐射、实时的优势。
  • methods: 使用neuronal网络(PAS-NN),利用心跳刺激信号,提高血管分割精度和稳定性。
  • results: 在 volontiers的carotid和radial artery上进行实验, demonstarted that PAS-NN可以与当前最佳方法匹配,并有效地改善小血管(radial artery)的分割性能。
    Abstract Ultrasound (US) imaging is widely used for diagnosing and monitoring arterial diseases, mainly due to the advantages of being non-invasive, radiation-free, and real-time. In order to provide additional information to assist clinicians in diagnosis, the tubular structures are often segmented from US images. To improve the artery segmentation accuracy and stability during scans, this work presents a novel pulsation-assisted segmentation neural network (PAS-NN) by explicitly taking advantage of the cardiac-induced motions. Motion magnification techniques are employed to amplify the subtle motion within the frequency band of interest to extract the pulsation signals from sequential US images. The extracted real-time pulsation information can help to locate the arteries on cross-section US images; therefore, we explicitly integrated the pulsation into the proposed PAS-NN as attention guidance. Notably, a robotic arm is necessary to provide stable movement during US imaging since magnifying the target motions from the US images captured along a scan path is not manually feasible due to the hand tremor. To validate the proposed robotic US system for imaging arteries, experiments are carried out on volunteers' carotid and radial arteries. The results demonstrated that the PAS-NN could achieve comparable results as state-of-the-art on carotid and can effectively improve the segmentation performance for small vessels (radial artery).
    摘要 ultrasound(US)成像广泛应用于诊断和监测动脉疾病,主要是因为它不侵入、无辐射和实时。为了为临床医生提供更多的诊断信息,在US图像中分割动脉结构成为一项重要任务。为了提高动脉分割精度和稳定性,本工作提出了一种基于征动脉信号的新型激活分割神经网络(PAS-NN)。使用了振荡增强技术来增强US图像中的某些频谱信息,以提取动脉的征动脉信号。这些实时征动脉信号可以帮助在US图像的横截面上定位动脉,因此我们直接将征动脉信号 интеGRATED到提案的PAS-NN中作为注意力引导。需要注意的是,为了保证US成像过程中的稳定运动,需要使用机器人臂提供稳定的运动。为验证提案的机器人US系统是否能够成功地成像动脉,我们在志愿者的轮状和 radial artery 上进行了实验。结果表明,PAS-NN可以与当前最佳的结果相比,并且可以有效地提高小动脉( radial artery)的分割性能。

cs.SD - 2023-07-07

The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement

  • paper_url: http://arxiv.org/abs/2307.03533
  • repo_url: None
  • paper_authors: Simon Leglaive, Léonie Borne, Efthymios Tzinis, Mostafa Sadeghi, Matthieu Fraticelli, Scott Wisdom, Manuel Pariente, Daniel Pressnitzer, John R. Hershey
  • for: 这篇论文的目的是提出一个无监督领域适应对话音频减噪任务(UDASE),以利用实际采集的噪声听写记录来适应语音减噪模型。
  • methods: 这篇论文使用了无监督领域适应技术,利用实际采集的噪声听写记录来适应语音减噪模型。
  • results: 这篇论文提出了一个基eline系统,用于解决 conversational speech 中的噪声问题。
    Abstract Supervised speech enhancement models are trained using artificially generated mixtures of clean speech and noise signals, which may not match real-world recording conditions at test time. This mismatch can lead to poor performance if the test domain significantly differs from the synthetic training domain. In this paper, we introduce the unsupervised domain adaptation for conversational speech enhancement (UDASE) task of the 7th CHiME challenge. This task aims to leverage real-world noisy speech recordings from the target test domain for unsupervised domain adaptation of speech enhancement models. The target test domain corresponds to the multi-speaker reverberant conversational speech recordings of the CHiME-5 dataset, for which the ground-truth clean speech reference is not available. Given a CHiME-5 recording, the task is to estimate the clean, potentially multi-speaker, reverberant speech, removing the additive background noise. We discuss the motivation for the CHiME-7 UDASE task and describe the data, the task, and the baseline system.
    摘要 <>转换文本到简化中文。<>超级vised语音提升模型通常通过人工生成的清晰语音和噪声信号混合来进行训练,这些混合可能不符合实际录音条件。这种匹配不符问题可能会导致测试时的性能差。在这篇论文中,我们介绍了无监督领域适应对话语音提升任务(UDASE)的7个CHiME挑战。这个任务的目标是利用真实世界的噪声语音记录来无监督适应语音提升模型。目标测试频道对应的是CHiME-5数据集中的多个说话人 reverberant conversational speech记录,其中没有清晰语音参考。给一个CHiME-5记录,任务是估算清晰、可能多个说话人、 reverberant speech,从噪声背景噪声中移除。我们介绍了CHiME-7 UDASE任务的动机和数据、任务和基eline系统。

Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

  • paper_url: http://arxiv.org/abs/2307.03354
  • repo_url: None
  • paper_authors: Sara Papi, Peidong Wan, Junkun Chen, Jian Xue, Jinyu Li, Yashesh Gaur
  • for: 这篇论文旨在提高实时语音识别和翻译的效果,并且使用单一decoder进行同时生成ASR和ST输出。
  • methods: 该方法使用一种joint token-level serialized output训练方法,通过利用市场上的文本对齐器来实现源和目标词的混合。
  • results: 实验表明,该方法在单语言(it-en)和多语言(de,es,it)设置下均能够达到最佳的质量-延迟平衡,并且与分立的ASR和ST模型相比,输出质量不减、甚至提高了0.4 BLEU和1.1 WER。
    Abstract In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary. This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder. To produce ASR and ST content effectively with minimal latency, we propose a joint token-level serialized output training method that interleaves source and target words by leveraging an off-the-shelf textual aligner. Experiments in monolingual (it-en) and multilingual (\{de,es,it\}-en) settings demonstrate that our approach achieves the best quality-latency balance. With an average ASR latency of 1s and ST latency of 1.3s, our model shows no degradation or even improves output quality compared to separate ASR and ST models, yielding an average improvement of 1.1 WER and 0.4 BLEU in the multilingual case.
    摘要 在实际应用中,用户经常需要同时获得翻译和转写的speech内容,以提高其理解度,特别在流媒体enario中,需要逐步生成。这篇论文介绍了一个流动Transformer-Transducer,通过单个解码器同时生成自动语音识别(ASR)和语音翻译(ST)输出。为了在最小延迟下生成ASR和ST内容,我们提议使用单个Token水平的 serialized输出训练方法,通过利用市场上的文本对齐器来扩展源和目标词。实验在单语言(it-en)和多语言(de,es,it)的设置下表明,我们的方法可以实现最佳的质量-延迟平衡。我们的模型在1s的ASR延迟和1.3s的ST延迟下,不产生质量下降或者even improves输出质量,相对于分离的ASR和ST模型,平均提高1.1 WER和0.4 BLEU的多语言情况。

Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment

  • paper_url: http://arxiv.org/abs/2307.03296
  • repo_url: https://github.com/areffarhadi/gammatonegram_cnn_dysarthric_speech
  • paper_authors: Aref Farhadipour, Hadi Veisi
  • for: This paper aims to develop a system for speech recognition, speaker identification, and intelligibility assessment for individuals with dysarthria.
  • methods: The proposed system uses gammatonegram to represent audio files with discriminative details, which are then fed into a convolutional neural network (CNN) for recognition. The system also employs transfer learning and Alexnet pre-training for improved accuracy.
  • results: The proposed system achieved 91.29% accuracy in speaker-dependent mode, 87.74% accuracy in text-dependent mode, and 96.47% accuracy in two-class mode for intelligibility assessment. Additionally, the multi-network speech recognition system achieved an accuracy of 92.3% WRR.
    Abstract Dysarthria is a disability that causes a disturbance in the human speech system and reduces the quality and intelligibility of a person's speech. Because of this effect, the normal speech processing systems can not work properly on impaired speech. This disability is usually associated with physical disabilities. Therefore, designing a system that can perform some tasks by receiving voice commands in the smart home can be a significant achievement. In this work, we introduce gammatonegram as an effective method to represent audio files with discriminative details, which is used as input for the convolutional neural network. On the other word, we convert each speech file into an image and propose image recognition system to classify speech in different scenarios. Proposed CNN is based on the transfer learning method on the pre-trained Alexnet. In this research, the efficiency of the proposed system for speech recognition, speaker identification, and intelligibility assessment is evaluated. According to the results on the UA dataset, the proposed speech recognition system achieved 91.29% accuracy in speaker-dependent mode, the speaker identification system acquired 87.74% accuracy in text-dependent mode, and the intelligibility assessment system achieved 96.47% accuracy in two-class mode. Finally, we propose a multi-network speech recognition system that works fully automatically. This system is located in a cascade arrangement with the two-class intelligibility assessment system, and the output of this system activates each one of the speech recognition networks. This architecture achieves an accuracy of 92.3% WRR. The source code of this paper is available.
    摘要 <>嗣瑞thesis是一种功能受限的人类语言系统的异常,导致人类语音质量和可理解性降低。由于这种效果,常规的语音处理系统无法正常工作。这种疾病通常与物理障碍有关。因此,设计一个可以通过声音命令在智能家庭中完成一些任务的系统可以是一项重要成果。在这种工作中,我们介绍了一种有效的方法来表示音频文件的特征,即干扰agram。即将每个语音文件转换成图像,并提议图像识别系统来分类语音在不同的场景中。我们的CNN基于传输学习方法,使用预训练的Alexnet。在这项研究中,我们评估了提议的系统的效率,包括语音识别、说话人识别和可理解性评估。根据UA数据集的结果,我们的语音识别系统在 speaker-dependent 模式下达到了 91.29% 的准确率,说话人识别系统在 text-dependent 模式下达到了 87.74% 的准确率,而可理解性评估系统在 two-class 模式下达到了 96.47% 的准确率。最后,我们提议了一个多网络语音识别系统,该系统位于堆叠式排序中,以及每个语音识别网络的输出。这种架构达到了 92.3% WRR 的准确率。本文的源代码可以获取。<>

Performance Comparison of Pre-trained Models for Speech-to-Text in Turkish: Whisper-Small and Wav2Vec2-XLS-R-300M

  • paper_url: http://arxiv.org/abs/2307.04765
  • repo_url: None
  • paper_authors: Oyku Berfin Mercan, Sercan Cepni, Davut Emre Tasar, Sukru Ozan
  • for: 本研究探讨了两种预训练多语言模型(Whisper-Small和Wav2Vec2-XLS-R-300M)在土耳其语言上的表现。
  • methods: 本研究使用了Mozilla Common Voice版本11.0,这是一个开源的土耳其语言数据集,并对两种模型进行了微调。
  • results: 研究发现,使用Wav2Vec2-XLS-R-300M模型可以得到更高的语音识别精度(WER值为0.16),而使用Whisper-Small模型的WER值为0.28。此外,研究还发现,使用测试数据集,不包括在训练和验证数据集中的call center记录,可以提高模型的表现。
    Abstract In this study, the performances of the Whisper-Small and Wav2Vec2-XLS-R-300M models which are two pre-trained multilingual models for speech to text were examined for the Turkish language. Mozilla Common Voice version 11.0 which is prepared in Turkish language and is an open-source data set, was used in the study. The multilingual models, Whisper- Small and Wav2Vec2-XLS-R-300M were fine-tuned with this data set which contains a small amount of data. The speech to text performance of the two models was compared. WER values are calculated as 0.28 and 0.16 for the Wav2Vec2-XLS- R-300M and the Whisper-Small models respectively. In addition, the performances of the models were examined with the test data prepared with call center records that were not included in the training and validation dataset.
    摘要 在这个研究中,我们研究了两种预训练的多语言模型:Whisper-Small和Wav2Vec2-XLS-R-300M,用于识别土耳其语。这些模型在土耳其语 Mozilla Common Voice 版本11.0 数据集上进行了训练和测试。这个数据集包含一小量的数据,我们使用这些数据来精度地训练和测试这两种模型。我们计算了 WER 值,其中 Wav2Vec2-XLS-R-300M 的 WER 值为 0.28,Whisper-Small 模型的 WER 值为 0.16。此外,我们还测试了这两种模型在未包括在训练和验证数据集中的测试数据上的性能。

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

  • paper_url: http://arxiv.org/abs/2307.03183
  • repo_url: https://github.com/YuanGongND/whisper-at
  • paper_authors: Yuan Gong, Sameer Khurana, Leonid Karlinsky, James Glass
  • for: 这个论文专注于 Whisper 模型,一种基于大量标注的语音识别模型,以及该模型在不同环境下的表现。
  • methods: 论文首先展示了 Whisper 模型对真实世界背景噪音的强健性,但它的音频表示并不是噪音不变的,而是高度相关于非语音噪音。基于这一发现,论文建立了一个简单的音频标记和语音识别模型 Whisper-AT,通过冻结 Whisper 的背bone,并在顶部添加一个轻量级的音频标记模型。
  • results: Whisper-AT 可以在单个前进 pass 中同时进行语音识别和音频标记,并且只需 <1% 的额外计算成本。
    Abstract In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sounds, indicating that Whisper recognizes speech conditioned on the noise type. With this finding, we build a unified audio tagging and speech recognition model Whisper-AT by freezing the backbone of Whisper, and training a lightweight audio tagging model on top of it. With <1% extra computational cost, Whisper-AT can recognize audio events, in addition to spoken text, in a single forward pass.
    摘要 在这篇论文中,我们关注Whisper,一种最近的自动语音识别模型,使用了巨量的680万小时标注的语音词汇录音数据,录制在多种条件下。我们首先发现一个有趣的现象,即Whisper对实际世界背景声(如音乐)非常鲁棒,但它的音频表示并不是噪声不变的,而是高度相关于非语音声音,表明Whisper识别语音 conditional 于噪声类型。基于这一发现,我们构建了一个整合音频标记和语音识别模型Whisper-AT,通过冻结Whisper的背bone,并在其上训练一个轻量级音频标记模型。与<1%的额外计算成本,Whisper-AT可以在单个前进通过recognize audio事件,以及说话文本,在一个前进中完成。

eess.AS - 2023-07-07

Recovering implicit pitch contours from formants in whispered speech

  • paper_url: http://arxiv.org/abs/2307.03168
  • repo_url: None
  • paper_authors: Pablo Pérez Zarazaga, Zofia Malisz
  • for: 这个论文主要研究了whisper speech中的intonation的感知和表达方式。
  • methods: 作者使用了一种两步方法,首先使用平行 corpus 将whisper中的声学特征转换成相应的phonatedEquivalents,然后分析声学特征来预测phonated pitch contour的变化。
  • results: 研究发现,使用这种方法可以确定whisper中的声学特征和phonated pitch contour之间的关系,并揭示了whisper中的implicit pitch contour。
    Abstract Whispered speech is characterised by a noise-like excitation that results in the lack of fundamental frequency. Considering that prosodic phenomena such as intonation are perceived through f0 variation, the perception of whispered prosody is relatively difficult. At the same time, studies have shown that speakers do attempt to produce intonation when whispering and that prosodic variability is being transmitted, suggesting that intonation "survives" in whispered formant structure. In this paper, we aim to estimate the way in which formant contours correlate with an "implicit" pitch contour in whisper, using a machine learning model. We propose a two-step method: using a parallel corpus, we first transform the whispered formants into their phonated equivalents using a denoising autoencoder. We then analyse the formant contours to predict phonated pitch contour variation. We observe that our method is effective in establishing a relationship between whispered and phonated formants and in uncovering implicit pitch contours in whisper.
    摘要 含秘语言特征为噪声类刺激,导致基本频率的缺失。由于语音学中的听觉现象如声调变化是通过f0变化传递的,因此听众对潜 voce 的识别相对较难。然而,研究表明,当speaker whispering时,他们仍会尝试生成声调,并且发现了不同的语音变化,表明声调在潜 voce 中存在。在这篇论文中,我们想使用机器学习模型来估算潜 voce 中形式轨迹与隐藏的声调轨迹之间的相关性。我们提出了一种两步方法:首先,使用平行 корпу斯,将潜 voce 的形式轨迹转换为其相应的声调轨迹,使用杜因噪声自适应神经网络。然后,我们分析形式轨迹,预测声调轨迹的变化。我们发现,我们的方法能够有效地建立潜 voce 中形式轨迹和声调轨迹之间的关系,并且揭示了隐藏的声调轨迹。

Label-Synchronous Neural Transducer for End-to-End ASR

  • paper_url: http://arxiv.org/abs/2307.03088
  • repo_url: None
  • paper_authors: Keqi Deng, Philip C. Woodland
  • for: 这篇论文是关于 streaming ASR 的自然途径,但是它在使用文本数据进行预测时遇到了挑战。
  • methods: 该论文提出了一种 label-同步神经转换器 (LS-Transducer),它在输出序列中提取出了标签水平编码表示,然后将其与预测网络输出结合。这样,不再需要使用空token,并且可以轻松地适应文本数据。此外,论文还提出了一种自动循环整合和触发 (AIF) 机制,以生成标筹水平编码表示。
  • results: 实验表明,相比标准神经转换器,提出的 LS-Transducer 在内部预测 Librispeech-100h 数据上减少了10%的相对WRER(文本识别错误率),以及在跨频度的 TED-LIUM 2 和 AESRC2020 数据上减少了17%和19%的相对WRER。
    Abstract Neural transducers provide a natural approach to streaming ASR. However, they augment output sequences with blank tokens which leads to challenges for domain adaptation using text data. This paper proposes a label-synchronous neural transducer (LS-Transducer), which extracts a label-level encoder representation before combining it with the prediction network output. Hence blank tokens are no longer needed and the prediction network can be easily adapted using text data. An Auto-regressive Integrate-and-Fire (AIF) mechanism is proposed to generate the label-level encoder representation while retaining the streaming property. In addition, a streaming joint decoding method is designed to improve ASR accuracy. Experiments show that compared to standard neural transducers, the proposed LS-Transducer gave a 10% relative WER reduction (WERR) for intra-domain Librispeech-100h data, as well as 17% and 19% relative WERRs on cross-domain TED-LIUM 2 and AESRC2020 data with an adapted prediction network.
    摘要 “神经变换器提供了自然的流处理ASR方法。然而,它们在输出序列中添加空token,导致领域适应使用文本数据具有挑战。这篇论文提议了一种标签同步神经变换器(LS-Transducer),它在组合预测网络输出之前提取标签水平Encoder表示。因此,空token不再需要,预测网络可以轻松地适应文本数据。此外,一种自动重启综合射频(AIF)机制被提议,以生成标签水平Encoder表示,同时保持流处理性。此外,一种流处理共同解码方法被设计,以提高ASR准确性。实验表明,相比标准神经变换器,提议的LS-Transducer在内领域Librispeech-100h数据上减少了10%的相对WRER(文本识别错误率),以及在跨领域TED-LIUM 2和AESRC2020数据上适应预测网络后减少了17%和19%的相对WRER。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

cs.CV - 2023-07-07

Detecting the Sensing Area of A Laparoscopic Probe in Minimally Invasive Cancer Surgery

  • paper_url: http://arxiv.org/abs/2307.03662
  • repo_url: https://github.com/br0202/sensing_area_detection
  • paper_authors: Baoru Huang, Yicheng Hu, Anh Nguyen, Stamatia Giannarou, Daniel S. Elson
  • for: 针对于医学领域中的外科手术预测和肿瘤检测。
  • methods: 使用了一种新型的缚定式 Laparoscope γ射测定器,以实时地local化预先注射的辐源追踪剂。
  • results: 通过利用高维度的图像特征和探针位置信息,成功解决了γ活动视图化的问题,并创造了一个新的性能标准。
    Abstract In surgical oncology, it is challenging for surgeons to identify lymph nodes and completely resect cancer even with pre-operative imaging systems like PET and CT, because of the lack of reliable intraoperative visualization tools. Endoscopic radio-guided cancer detection and resection has recently been evaluated whereby a novel tethered laparoscopic gamma detector is used to localize a preoperatively injected radiotracer. This can both enhance the endoscopic imaging and complement preoperative nuclear imaging data. However, gamma activity visualization is challenging to present to the operator because the probe is non-imaging and it does not visibly indicate the activity origination on the tissue surface. Initial failed attempts used segmentation or geometric methods, but led to the discovery that it could be resolved by leveraging high-dimensional image features and probe position information. To demonstrate the effectiveness of this solution, we designed and implemented a simple regression network that successfully addressed the problem. To further validate the proposed solution, we acquired and publicly released two datasets captured using a custom-designed, portable stereo laparoscope system. Through intensive experimentation, we demonstrated that our method can successfully and effectively detect the sensing area, establishing a new performance benchmark. Code and data are available at https://github.com/br0202/Sensing_area_detection.git
    摘要 在外科onkology中,外科医生很难识别lymph nodes和完全 remove cancer,即使使用预先的内分析系统如PET和CT。这是因为在手术中没有可靠的实时显示工具。Recently, endoscopic radio-guided cancer detection and resection has been evaluated, which uses a novel tethered laparoscopic gamma detector to localize a preoperatively injected radiotracer. This can both enhance endoscopic imaging and complement preoperative nuclear imaging data. However, gamma activity visualization is challenging to present to the operator because the probe is non-imaging and does not visibly indicate the activity origination on the tissue surface. Initial attempts used segmentation or geometric methods, but these were unsuccessful. Instead, we found that the problem could be resolved by leveraging high-dimensional image features and probe position information. To demonstrate the effectiveness of this solution, we designed and implemented a simple regression network that successfully addressed the problem. To further validate the proposed solution, we acquired and publicly released two datasets captured using a custom-designed, portable stereo laparoscope system. Through intensive experimentation, we demonstrated that our method can successfully and effectively detect the sensing area, establishing a new performance benchmark. Code and data are available at .

Robust Human Detection under Visual Degradation via Thermal and mmWave Radar Fusion

  • paper_url: http://arxiv.org/abs/2307.03623
  • repo_url: https://github.com/ramdrop/utm
  • paper_authors: Kaiwen Cai, Qiyue Xia, Peize Li, John Stankovic, Chris Xiaoxuan Lu
  • for: 本研究旨在提出一种多模态人体检测系统,用于解决在质量不佳的视觉条件下人体检测的问题。
  • methods: 本研究使用了可携带式热成像相机和单芯片mm波雷达,并提出了一种bayesian特征提取器和一种uncertainty-guided融合方法来减少热成像检测特征的噪音和雷达点云的多 PATH噪声。
  • results: 本研究对实际数据集进行评估,并证明了我们的方法在多种竞争方法中具有显著的优势,包括单模态和多模态方法。
    Abstract The majority of human detection methods rely on the sensor using visible lights (e.g., RGB cameras) but such sensors are limited in scenarios with degraded vision conditions. In this paper, we present a multimodal human detection system that combines portable thermal cameras and single-chip mmWave radars. To mitigate the noisy detection features caused by the low contrast of thermal cameras and the multi-path noise of radar point clouds, we propose a Bayesian feature extractor and a novel uncertainty-guided fusion method that surpasses a variety of competing methods, either single-modal or multi-modal. We evaluate the proposed method on real-world data collection and demonstrate that our approach outperforms the state-of-the-art methods by a large margin.
    摘要 大多数人员探测方法都是使用可见光(例如RGB摄像头),但这些感知器在有很差视力条件下效果有限。在这篇论文中,我们提出了一种多模态人员探测系统,该系统结合携带式热成像镜头和单 chip MM 微波雷达。为了减少热成像镜头的噪声探测特征和雷达点云的多重反射噪声,我们提议了一种 bayesian 特征提取器和一种新的不确定性导向融合方法。我们对实际数据收集进行了评估,并证明了我们的方法在比较方法中具有明显的优势。

Depth Estimation Analysis of Orthogonally Divergent Fisheye Cameras with Distortion Removal

  • paper_url: http://arxiv.org/abs/2307.03602
  • repo_url: None
  • paper_authors: Matvei Panteleev, Houari Bettahar
  • for: 提高 fisheye 相机镜像干扰矫正和深度估计精度
  • methods: 使用两个虚拟平铺相机(VPC),每个VPC捕捉小区域,并将其呈现无镜面偏扭变,模拟平铺相机的行为
  • results: 对虚拟环境和实际相机实验结果进行比较,显示提案方法可以减少干扰和改善深度估计精度
    Abstract Stereo vision systems have become popular in computer vision applications, such as 3D reconstruction, object tracking, and autonomous navigation. However, traditional stereo vision systems that use rectilinear lenses may not be suitable for certain scenarios due to their limited field of view. This has led to the popularity of vision systems based on one or multiple fisheye cameras in different orientations, which can provide a field of view of 180x180 degrees or more. However, fisheye cameras introduce significant distortion at the edges that affects the accuracy of stereo matching and depth estimation. To overcome these limitations, this paper proposes a method for distortion-removal and depth estimation analysis for stereovision system using orthogonally divergent fisheye cameras (ODFC). The proposed method uses two virtual pinhole cameras (VPC), each VPC captures a small portion of the original view and presents it without any lens distortions, emulating the behavior of a pinhole camera. By carefully selecting the captured regions, it is possible to create a stereo pair using two VPCs. The performance of the proposed method is evaluated in both simulation using virtual environment and experiments using real cameras and their results compared to stereo cameras with parallel optical axes. The results demonstrate the effectiveness of the proposed method in terms of distortion removal and depth estimation accuracy.
    摘要 三角视系统在计算机视觉应用中变得流行,如3D重建、对象跟踪和自动导航。然而,传统的三角视系统使用直线镜头可能无法适用于某些场景,因为它们的视场有限。这导致了基于一或多个折衣镜头的不同orientation的视系统的 Popularity,这些系统可以提供180x180度或更大的视场。然而,折衣镜头会在边缘 introduce significant distortion,影响三角匹配和深度估计的准确性。为了解决这些限制,本文提出了一种基于折衣镜头的三角视系统中的distortion-removal和深度估计分析方法。该方法使用两个虚拟缩影镜头(VPC),每个VPC捕捉一小部分的原始视图,并无镜头扭曲的情况下,表现出pinhole镜头的行为。通过精心选择捕捉的区域,可以创建一个三角对Using two VPCs。实验结果表明,提议的方法可以减少折衣的影响,并提高深度估计的准确性。

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

  • paper_url: http://arxiv.org/abs/2307.03601
  • repo_url: https://github.com/jshilong/gpt4roi
  • paper_authors: Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, Ping Luo
  • for: 这 paper 的目的是提高大型语言模型(LLM)在图像和文本对应中的细腻多模态理解能力,通过在区域水平上调整 instruciton。
  • methods: 该 paper 使用了重新编写 bounding box 为空间指令的方法,将视觉特征与语言嵌入拼接在一起,输入到 LLM 进行训练。
  • results: 该 paper 提出了一种基于区域水平的视觉语言模型(GPT4RoI),可以提供更多的区域级多模态能力,如细腻区域描述和复杂区域逻辑。用户可以通过语言和空间指令来互动,并可以通过不同的区域指令来控制细腻程度。
    Abstract Instruction tuning large language model (LLM) on image-text pairs has achieved unprecedented vision-language multimodal abilities. However, their vision-language alignments are only built on image-level, the lack of region-level alignment limits their advancements to fine-grained multimodal understanding. In this paper, we propose instruction tuning on region-of-interest. The key design is to reformulate the bounding box as the format of spatial instruction. The interleaved sequences of visual features extracted by the spatial instruction and the language embedding are input to LLM, and trained on the transformed region-text data in instruction tuning format. Our region-level vision-language model, termed as GPT4RoI, brings brand new conversational and interactive experience beyond image-level understanding. (1) Controllability: Users can interact with our model by both language and spatial instructions to flexibly adjust the detail level of the question. (2) Capacities: Our model supports not only single-region spatial instruction but also multi-region. This unlocks more region-level multimodal capacities such as detailed region caption and complex region reasoning. (3) Composition: Any off-the-shelf object detector can be a spatial instruction provider so as to mine informative object attributes from our model, like color, shape, material, action, relation to other objects, etc. The code, data, and demo can be found at https://github.com/jshilong/GPT4RoI.
    摘要 大型语言模型(LLM)的指令调整在图像和文本对数据上实现了前无之例的视觉语言融合能力。然而,这些视觉语言对应只基于图像水平,缺失地区水平对应限制了其细化多模态理解的进步。在这篇论文中,我们提议在地区水平上调整指令。我们的关键设计是将 bounding box 转换为空间指令的格式。批处视觉特征和语言嵌入被输入到 LLM,并在转换后的地区文本数据上进行了 instrucion 调整。我们称之为 GPT4RoI 的 Region-level 视觉语言模型,它为用户提供了新的对话和交互体验,跻身于图像水平的理解之外。(1)可控性:用户可以通过语言和空间指令来灵活地调整问题的细节水平。(2)能力:我们的模型支持单个地区空间指令以及多个地区。这解锁了更多的地区多模态能力,如详细地区描述和复杂地区逻辑。(3)组合:任何准备好的物体检测器都可以提供空间指令,从而挖掘出模型中的有用对象特征,如颜色、形状、材质、动作、与其他对象的关系等。代码、数据和示例可以在 GitHub 上找到:https://github.com/jshilong/GPT4RoI。

Unsupervised Segmentation of Fetal Brain MRI using Deep Learning Cascaded Registration

  • paper_url: http://arxiv.org/abs/2307.03579
  • repo_url: https://github.com/valbcn/casreg
  • paper_authors: Valentin Comte, Mireia Alenya, Andrea Urru, Judith Recober, Ayako Nakaki, Francesca Crovetto, Oscar Camara, Eduard Gratacós, Elisenda Eixarch, Fàtima Crispi, Gemma Piella, Mario Ceresa, Miguel A. González Ballester
  • for: 这研究的目的是为了提高胎儿脑 magnetic resonance imaging(MRI)的自动分割精度,以便分析胎儿脑发育和检测可能的脑发育异常。
  • methods: 该研究提出了一种新的无监督分割方法,基于多个Atlas分割。该方法使用了一个卷积神经网络来进行3D图像 региSTRATION,并通过计算小、增量的变形来将移动图像精确地对齐到固定图像。这个卷积神经网络可以用来注册多个标注图像,并将其组合成一个精确的分割结果。
  • results: 该研究的实验结果表明,提出的卷积神经网络注册和多Atlas分割方法可以超越现有的注册方法,并且与使用大量标注数据进行训练的nnU-Net相当。此外,该方法只需使用一小部分的标注数据来进行多Atlas分割任务,而不需要任何数据来训练网络。
    Abstract Accurate segmentation of fetal brain magnetic resonance images is crucial for analyzing fetal brain development and detecting potential neurodevelopmental abnormalities. Traditional deep learning-based automatic segmentation, although effective, requires extensive training data with ground-truth labels, typically produced by clinicians through a time-consuming annotation process. To overcome this challenge, we propose a novel unsupervised segmentation method based on multi-atlas segmentation, that accurately segments multiple tissues without relying on labeled data for training. Our method employs a cascaded deep learning network for 3D image registration, which computes small, incremental deformations to the moving image to align it precisely with the fixed image. This cascaded network can then be used to register multiple annotated images with the image to be segmented, and combine the propagated labels to form a refined segmentation. Our experiments demonstrate that the proposed cascaded architecture outperforms the state-of-the-art registration methods that were tested. Furthermore, the derived segmentation method achieves similar performance and inference time to nnU-Net while only using a small subset of annotated data for the multi-atlas segmentation task and none for training the network. Our pipeline for registration and multi-atlas segmentation is publicly available at https://github.com/ValBcn/CasReg.
    摘要 准确 segmentation of fetal brain magnetic resonance images 是关键 для分析胎儿脑部发展和检测可能的神经发育畸形。传统的深度学习自动 segmentation 方法,虽然有效,但需要大量的训练数据并有标注数据,通常由临床医生通过时间consuming 的标注过程生成。为了解决这个挑战,我们提出了一种新的无监督分割方法,基于多个 Atlas segmentation,可以准确地分割多种组织而无需训练数据。我们的方法使用了堆叠的深度学习网络 для 3D 图像匹配,计算小、增量的形变来将移动图像精准地对齐于静止图像。这个堆叠网络可以用来对多个标注图像与要分割的图像进行匹配,并将传播的标签组合成为精度的分割。我们的实验表明,我们提出的堆叠体系超越了测试中的状态态术方法。此外,我们的分割方法可以与 nnU-Net 的性能相似,只需使用小数量的标注数据进行多个Atlas segmentation 任务,并无需训练网络。我们的注册和多个Atlas segmentation 管道可以在 GitHub 上获得,请参考 https://github.com/ValBcn/CasReg。

SpawnNet: Learning Generalizable Visuomotor Skills from Pre-trained Networks

  • paper_url: http://arxiv.org/abs/2307.03567
  • repo_url: https://github.com/johnrso/spawnnet
  • paper_authors: Xingyu Lin, John So, Sashwat Mahalingam, Fangchen Liu, Pieter Abbeel
  • for: 本研究旨在探讨使用预训练视觉表征的可行性,以便提高学习策略的通用能力。
  • methods: 本研究使用了一种新的两树架构,即SpawnNet,来将预训练多层表征融合到一个独立的网络中,以学习一个Robust的策略。
  • results: 对于实验和实际场景,SpawnNet表现出了明显的可 categorical 泛化能力,比之前的方法更好。
    Abstract The existing internet-scale image and video datasets cover a wide range of everyday objects and tasks, bringing the potential of learning policies that have broad generalization. Prior works have explored visual pre-training with different self-supervised objectives, but the generalization capabilities of the learned policies remain relatively unknown. In this work, we take the first step towards this challenge, focusing on how pre-trained representations can help the generalization of the learned policies. We first identify the key bottleneck in using a frozen pre-trained visual backbone for policy learning. We then propose SpawnNet, a novel two-stream architecture that learns to fuse pre-trained multi-layer representations into a separate network to learn a robust policy. Through extensive simulated and real experiments, we demonstrate significantly better categorical generalization compared to prior approaches in imitation learning settings.
    摘要 现有的互联网级图像和视频数据集覆盖了广泛的日常物品和任务,这对学习策略的泛化潜力具有很大的潜力。先前的工作已经探索了不同的自我超vised目标,但已经学习的策略的泛化能力仍然不够了解。在这项工作中,我们首次面临这个挑战,我们关注使用预训练的表示来帮助策略的泛化。我们首先确定采用静止预训练视觉背bone的主要瓶颈,然后我们提出了SpawnNet,一种新的两核 architecture,它学习将预训练的多层表示融合到一个分离的网络中,以学习一个稳定的策略。通过了详细的 simulate和实际实验,我们证明了SpawnNet在模仿学习设置下的分类泛化性能明显更好,比先前的方法更好。

VariGrad: A Novel Feature Vector Architecture for Geometric Deep Learning on Unregistered Data

  • paper_url: http://arxiv.org/abs/2307.03553
  • repo_url: https://github.com/emmanuel-hartman/pytorch_varigrad
  • paper_authors: Emmanuel Hartman, Emery Pierson
  • for: 本研究提出了一种新的几何深度学习层,使用变量梯度(VariGrad)计算3D几何数据的特征向量表示。这些特征向量可以用于多种下游学习任务,如分类、匹配和形态重建。
  • methods: 本研究使用了无关于参数化的变量表示方法,以便在数据独立于采样或参数化的情况下训练和测试模型。
  • results: 研究表明,提出的VariGrad层具有高效、普适和对采样重新采样的可靠性。
    Abstract We present a novel geometric deep learning layer that leverages the varifold gradient (VariGrad) to compute feature vector representations of 3D geometric data. These feature vectors can be used in a variety of downstream learning tasks such as classification, registration, and shape reconstruction. Our model's use of parameterization independent varifold representations of geometric data allows our model to be both trained and tested on data independent of the given sampling or parameterization. We demonstrate the efficiency, generalizability, and robustness to resampling demonstrated by the proposed VariGrad layer.
    摘要 我们提出了一种新的几何深度学习层,利用变量Gradient(VariGrad)计算三维几何数据的特征向量表示。这些特征向量可以用于多种下游学习任务,如分类、注册和形状重建。我们的模型使用独立参数的变量表示方法,使得我们的模型可以在不同的抽象和参数下进行训练和测试。我们展示了提议的VariGrad层的效率、通用性和对抽样的稳定性。

Language-free Compositional Action Generation via Decoupling Refinement

  • paper_url: http://arxiv.org/abs/2307.03538
  • repo_url: https://github.com/XLiu443/Language-free-Compositional-Action-Generation-via-Decoupling-Refinement
  • paper_authors: Xiao Liu, Guangyi Chen, Yansong Tang, Guangrun Wang, Ser-Nam Lim
  • for: 本研究旨在生成3D动作,无需依赖于庞大的神经网络语言注释。
  • methods: 我们提出了一个新的框架,包括动作对接、条件动作生成和解除精度提升。动作对接使用能量模型提取每个子动作的注意力掩模,然后将两个动作结合使用这些注意力来生成pseudo训练示例。然后,我们使用条件生成模型CVAE来学习一个latent空间,使得动作生成更加多样化。最后,我们提出了解除精度提升,使用自我supervised预训练模型MAE来保证子动作和组合动作之间的semantic一致性。这个进程包括将生成的3D动作映射到2D空间,分解这些图像为两个子segments,使用MAE模型重建完整的图像从子segments,并强制恢复的图像与原始子动作映射的图像一致。
  • results: 我们创建了两个新的 datasets,名为HumanAct-C和UESTC-C,并提出了相应的评价度量。我们进行了 both qualitative和量化的评估,以证明我们的方法的效果。
    Abstract Composing simple elements into complex concepts is crucial yet challenging, especially for 3D action generation. Existing methods largely rely on extensive neural language annotations to discern composable latent semantics, a process that is often costly and labor-intensive. In this study, we introduce a novel framework to generate compositional actions without reliance on language auxiliaries. Our approach consists of three main components: Action Coupling, Conditional Action Generation, and Decoupling Refinement. Action Coupling utilizes an energy model to extract the attention masks of each sub-action, subsequently integrating two actions using these attentions to generate pseudo-training examples. Then, we employ a conditional generative model, CVAE, to learn a latent space, facilitating the diverse generation. Finally, we propose Decoupling Refinement, which leverages a self-supervised pre-trained model MAE to ensure semantic consistency between the sub-actions and compositional actions. This refinement process involves rendering generated 3D actions into 2D space, decoupling these images into two sub-segments, using the MAE model to restore the complete image from sub-segments, and constraining the recovered images to match images rendered from raw sub-actions. Due to the lack of existing datasets containing both sub-actions and compositional actions, we created two new datasets, named HumanAct-C and UESTC-C, and present a corresponding evaluation metric. Both qualitative and quantitative assessments are conducted to show our efficacy.
    摘要 <> translate the following text into Simplified Chinese:Composing simple elements into complex concepts is crucial yet challenging, especially for 3D action generation. Existing methods largely rely on extensive neural language annotations to discern composable latent semantics, a process that is often costly and labor-intensive. In this study, we introduce a novel framework to generate compositional actions without reliance on language auxiliaries. Our approach consists of three main components: Action Coupling, Conditional Action Generation, and Decoupling Refinement. Action Coupling utilizes an energy model to extract the attention masks of each sub-action, subsequently integrating two actions using these attentions to generate pseudo-training examples. Then, we employ a conditional generative model, CVAE, to learn a latent space, facilitating the diverse generation. Finally, we propose Decoupling Refinement, which leverages a self-supervised pre-trained model MAE to ensure semantic consistency between the sub-actions and compositional actions. This refinement process involves rendering generated 3D actions into 2D space, decoupling these images into two sub-segments, using the MAE model to restore the complete image from sub-segments, and constraining the recovered images to match images rendered from raw sub-actions. Due to the lack of existing datasets containing both sub-actions and compositional actions, we created two new datasets, named HumanAct-C and UESTC-C, and present a corresponding evaluation metric. Both qualitative and quantitative assessments are conducted to show our efficacy.Translation:<>组合简单元素成复杂概念是关键,特别是在3D动作生成中。现有方法主要依赖于广泛的神经语言标注来 отлича出可组合的含义,这个过程经常是费时和劳动密集的。在这种研究中,我们提出了一种新的框架,可以生成无语言助记的compositional动作。我们的方法包括三个主要组成部分:Action Coupling、Conditional Action Generation和Decoupling Refinement。Action Coupling使用能量模型提取每个子动作的注意力映射,然后将两个动作使用这些注意力进行拼接,生成 pseudo-training 示例。然后,我们使用Conditional Generative Model(CVAE)来学习一个含义空间,促进多样化生成。最后,我们提出了Decoupling Refinement,使用预训练的MAE模型来保证子动作和compositional动作之间的semantic consistency。这个修正过程包括将生成的3D动作映射到2D空间,将这些图像分解成两个子图像,使用MAE模型重建完整的图像,并使其与原始子图像匹配。由于现有的数据集没有包含子动作和compositional动作,我们创建了两个新的数据集,名为HumanAct-C和UESTC-C,并提出了相应的评价指标。我们进行了质量和量化评价,以展示我们的效果。

Joint Perceptual Learning for Enhancement and Object Detection in Underwater Scenarios

  • paper_url: http://arxiv.org/abs/2307.03536
  • repo_url: None
  • paper_authors: Chenping Fu, Wanqi Yuan, Jiewen Xiao, Risheng Liu, Xin Fan
  • for: jointly learn underwater object detection and image enhancement
  • methods: 使用着色矩阵和卷积神经网络,并提出了一种双层优化方法
  • results: 实现了更好的图像增强和物体检测精度
    Abstract Underwater degraded images greatly challenge existing algorithms to detect objects of interest. Recently, researchers attempt to adopt attention mechanisms or composite connections for improving the feature representation of detectors. However, this solution does \textit{not} eliminate the impact of degradation on image content such as color and texture, achieving minimal improvements. Another feasible solution for underwater object detection is to develop sophisticated deep architectures in order to enhance image quality or features. Nevertheless, the visually appealing output of these enhancement modules do \textit{not} necessarily generate high accuracy for deep detectors. More recently, some multi-task learning methods jointly learn underwater detection and image enhancement, accessing promising improvements. Typically, these methods invoke huge architecture and expensive computations, rendering inefficient inference. Definitely, underwater object detection and image enhancement are two interrelated tasks. Leveraging information coming from the two tasks can benefit each task. Based on these factual opinions, we propose a bilevel optimization formulation for jointly learning underwater object detection and image enhancement, and then unroll to a dual perception network (DPNet) for the two tasks. DPNet with one shared module and two task subnets learns from the two different tasks, seeking a shared representation. The shared representation provides more structural details for image enhancement and rich content information for object detection. Finally, we derive a cooperative training strategy to optimize parameters for DPNet. Extensive experiments on real-world and synthetic underwater datasets demonstrate that our method outputs visually favoring images and higher detection accuracy.
    摘要 水下降低图像对现有算法检测对象存在挑战。研究人员尝试采用注意力机制或复合连接来改善检测器的特征表示。然而,这种解决方案不能完全消除水下图像内容的影响,如颜色和xture,只能获得有限的改进。另一个可行的水下对象检测解决方案是开发高级深度架构,以增强图像质量或特征。然而,这些美化模块的输出不一定能够提高深度检测器的准确率。在最近几年,一些多任务学习方法同时学习水下检测和图像改善,并取得了有望的改进。这些方法通常需要庞大的架构和昂贵的计算,导致效率低下。 Based on these facts, we propose a bilevel optimization formulation for jointly learning water下 object detection and image enhancement, and then unroll to a dual perception network (DPNet) for the two tasks. DPNet with one shared module and two task subnets learns from the two different tasks, seeking a shared representation. The shared representation provides more structural details for image enhancement and rich content information for object detection. Finally, we derive a cooperative training strategy to optimize parameters for DPNet. Extensive experiments on real-world and synthetic underwater datasets demonstrate that our method outputs visually pleasing images and higher detection accuracy.

Matching in the Wild: Learning Anatomical Embeddings for Multi-Modality Images

  • paper_url: http://arxiv.org/abs/2307.03535
  • repo_url: None
  • paper_authors: Xiaoyu Bai, Fan Bai, Xiaofei Huo, Jia Ge, Tony C. W. Mok, Zi Li, Minfeng Xu, Jingren Zhou, Le Lu, Dakai Jin, Xianghua Ye, Jingjing Lu, Ke Yan
  • for: 这个研究旨在提高内部模组之间的对 alignment 的精度,以便更好地利用 CT 和 MRI 两种不同模式之间的信息。
  • methods: 我们提出了一种新的方法,叫做 Cross-SAM,它利用了一个新的迭代过程,将 embedding learning 和 CT-MRI registrations 融合在一起,以提高对 alignment 的精度。
  • results: 我们在两个 CT-MRI 融合注册dataset上进行了评估,发现 Cross-SAM 能够实现了 CT 和 MRI 之间的稳定融合注册,并且与其他方法相比,表现出了州域之最。
    Abstract Radiotherapists require accurate registration of MR/CT images to effectively use information from both modalities. In a typical registration pipeline, rigid or affine transformations are applied to roughly align the fixed and moving images before proceeding with the deformation step. While recent learning-based methods have shown promising results in the rigid/affine step, these methods often require images with similar field-of-view (FOV) for successful alignment. As a result, aligning images with different FOVs remains a challenging task. Self-supervised landmark detection methods like self-supervised Anatomical eMbedding (SAM) have emerged as a useful tool for mapping and cropping images to similar FOVs. However, these methods are currently limited to intra-modality use only. To address this limitation and enable cross-modality matching, we propose a new approach called Cross-SAM. Our approach utilizes a novel iterative process that alternates between embedding learning and CT-MRI registration. We start by applying aggressive contrast augmentation on both CT and MRI images to train a SAM model. We then use this SAM to identify corresponding regions on paired images using robust grid-points matching, followed by a point-set based affine/rigid registration, and a deformable fine-tuning step to produce registered paired images. We use these registered pairs to enhance the matching ability of SAM, which is then processed iteratively. We use the final model for cross-modality matching tasks. We evaluated our approach on two CT-MRI affine registration datasets and found that Cross-SAM achieved robust affine registration on both datasets, significantly outperforming other methods and achieving state-of-the-art performance.
    摘要 医 Physicists需要准确地将MR/CT图像 регистрирова到以便使用这两种模式中的信息。在一般的注册管道中,使用精度的旋转或相对变换来粗略地将固定图像和移动图像对齐,然后进行塑形步骤。而最近的学习基于方法已经在精度步骤中表现出了有前途的结果,但这些方法经常需要具有相似的视野范围(FOV)的图像进行成功的对齐。因此,将图像 WITH 不同的 FOV 进行对齐仍然是一个挑战。自动找到自我医学特征的自适应检测方法,如自适应Anatomical eMbedding(SAM),已经作为一种有用的工具来映射和剪辑图像,但这些方法目前只能在同一种模式中使用。为了解决这种限制并启用跨模式匹配,我们提出了一种新的方法,即 Cross-SAM。我们的方法利用了一种新的迭代过程,它 alternate между embedding learning和CT-MRI注册。我们首先在CT和MRI图像上应用了强制对比增强,然后使用这些SAM来标识对应的区域,并使用精度的grid-points匹配和点集基于的旋转/相对变换注册步骤,最后使用可动的精度调整步骤来生成注册的对应图像。我们使用这些注册对来增强SAM的匹配能力,然后重复处理,并使用最终模型进行跨模式匹配任务。我们在两个CT-MRI注册数据集上进行了评估,并发现 Cross-SAM在两个数据集上都达到了稳定的Affine注册,与其他方法相比,表现出了显著的优势,并达到了领域的前景性表现。

HoughLaneNet: Lane Detection with Deep Hough Transform and Dynamic Convolution

  • paper_url: http://arxiv.org/abs/2307.03494
  • repo_url: None
  • paper_authors: Jia-Qi Zhang, Hao-Bin Duan, Jun-Long Chen, Ariel Shamir, Miao Wang
  • for: 提高自动驾驶中车道检测的精度和可靠性,解决车道检测复杂的问题。
  • methods: 提出了一种基于幂函数变换的层次结构,将整个图像中的所有车道特征整合到幂函数参数空间中,并采用了动态 convolution模块来有效地分解每个车道特征。
  • results: 实验结果表明,提出的方法可以更好地检测受阻或损坏的车道图像,并且与当前最佳方法相当或超过其性能。
    Abstract The task of lane detection has garnered considerable attention in the field of autonomous driving due to its complexity. Lanes can present difficulties for detection, as they can be narrow, fragmented, and often obscured by heavy traffic. However, it has been observed that the lanes have a geometrical structure that resembles a straight line, leading to improved lane detection results when utilizing this characteristic. To address this challenge, we propose a hierarchical Deep Hough Transform (DHT) approach that combines all lane features in an image into the Hough parameter space. Additionally, we refine the point selection method and incorporate a Dynamic Convolution Module to effectively differentiate between lanes in the original image. Our network architecture comprises a backbone network, either a ResNet or Pyramid Vision Transformer, a Feature Pyramid Network as the neck to extract multi-scale features, and a hierarchical DHT-based feature aggregation head to accurately segment each lane. By utilizing the lane features in the Hough parameter space, the network learns dynamic convolution kernel parameters corresponding to each lane, allowing the Dynamic Convolution Module to effectively differentiate between lane features. Subsequently, the lane features are fed into the feature decoder, which predicts the final position of the lane. Our proposed network structure demonstrates improved performance in detecting heavily occluded or worn lane images, as evidenced by our extensive experimental results, which show that our method outperforms or is on par with state-of-the-art techniques.
    摘要 自动驾驶领域内,车道检测已经引起了非常大的关注,因为它的复杂性。车道可能会变窄、分 Fragmented 或者受到压杂的交通影响,但是观察到车道有一定的几何结构,这使得通过利用这个特点可以提高车道检测的结果。为解决这个挑战,我们提议使用层次深度霍夫变换(DHT)方法,将整个图像中的所有车道特征 combine 到霍夫参数空间中。此外,我们还改进了点选择方法,并将动态卷积模块 incorporate 到图像原像中,以有效地区分每条车道。我们的网络架构包括后ION 网络(ResNet 或 Pyramid Vision Transformer)、特征峰网络作为 neck 提取多比例特征,以及层次 DHT 基于特征聚合头来准确地分类每条车道。通过利用车道特征在霍夫参数空间中,网络学习了对应每条车道的动态卷积参数,使得动态卷积模块可以有效地区分每条车道。最后,车道特征被传递到特征解码器,解码器预测了最终车道的位置。我们的提议的网络结构在检测受到压杂或损坏的车道图像时表现出了改进的性能,这得到了我们的广泛实验结果的支持,其中我们的方法与现有技术相当或超越。

Unpaired Multi-View Graph Clustering with Cross-View Structure Matching

  • paper_url: http://arxiv.org/abs/2307.03476
  • repo_url: https://github.com/wy1019/upmgc-sm
  • paper_authors: Yi Wen, Siwei Wang, Qing Liao, Weixuan Liang, Ke Liang, Xinhang Wan, Xinwang Liu
  • for: 提高多视图数据的群集效果,这个 paper 写的目的是创建一个无 Parameters 的 гра clustering 框架,可以处理不完整的数据对。
  • methods: 本 paper 使用的方法是一个 Unpaired Multi-view Graph Clustering framework with Cross-View Structure Matching (UPMGC-SM),这个方法 使用多视图数据的结构资讯来优化 cross-view 对应关系。
  • results: 实验结果显示,本 paper 的提案可以有效地处理不完整的数据对,并且可以与已有的 graph clustering 方法整合来增强它们的效能。
    Abstract Multi-view clustering (MVC), which effectively fuses information from multiple views for better performance, has received increasing attention. Most existing MVC methods assume that multi-view data are fully paired, which means that the mappings of all corresponding samples between views are pre-defined or given in advance. However, the data correspondence is often incomplete in real-world applications due to data corruption or sensor differences, referred as the data-unpaired problem (DUP) in multi-view literature. Although several attempts have been made to address the DUP issue, they suffer from the following drawbacks: 1) Most methods focus on the feature representation while ignoring the structural information of multi-view data, which is essential for clustering tasks; 2) Existing methods for partially unpaired problems rely on pre-given cross-view alignment information, resulting in their inability to handle fully unpaired problems; 3) Their inevitable parameters degrade the efficiency and applicability of the models. To tackle these issues, we propose a novel parameter-free graph clustering framework termed Unpaired Multi-view Graph Clustering framework with Cross-View Structure Matching (UPMGC-SM). Specifically, unlike the existing methods, UPMGC-SM effectively utilizes the structural information from each view to refine cross-view correspondences. Besides, our UPMGC-SM is a unified framework for both the fully and partially unpaired multi-view graph clustering. Moreover, existing graph clustering methods can adopt our UPMGC-SM to enhance their ability for unpaired scenarios. Extensive experiments demonstrate the effectiveness and generalization of our proposed framework for both paired and unpaired datasets.
    摘要 多视图聚合(MVC),已经得到了更好的性能的注意。大多数现有的MVC方法假设多视图数据是完全对应的,这意味着所有视图之间的样本映射都是先前定义或提供的。然而,在实际应用中,数据对应性 oftentimes incomplete due to data corruption or sensor differences, referred as the data-unpaired problem (DUP) in multi-view literature. Although several attempts have been made to address the DUP issue, they suffer from the following drawbacks:1. 大多数方法只注重特征表示,忽略了多视图数据的结构信息,这是 clustering 任务中非常重要的;2. 现有的部分对应问题方法 rely on pre-given cross-view alignment information, resulting in their inability to handle fully unpaired problems;3. 它们的参数会影响模型的效率和可应用性。为了解决这些问题,我们提出了一个新的参数自由的图 clustering 框架,称为 Unpaired Multi-view Graph Clustering framework with Cross-View Structure Matching (UPMGC-SM). Specifically, unlike the existing methods, UPMGC-SM 能够充分利用每视图中的结构信息来修正交叉视图对应关系。此外,我们的 UPMGC-SM 是一个统一的框架,可以处理完全和部分对应的多视图图 clustering。此外,现有的图 clustering 方法可以采用我们的 UPMGC-SM 来增强它们对无对应场景的能力。广泛的实验表明我们提出的框架对于 paired 和 unpaired 数据均有效和普适。

Freezing of Gait Prediction From Accelerometer Data Using a Simple 1D-Convolutional Neural Network – 8th Place Solution for Kaggle’s Parkinson’s Freezing of Gait Prediction Competition

  • paper_url: http://arxiv.org/abs/2307.03475
  • repo_url: https://github.com/janbrederecke/fog
  • paper_authors: Jan Brederecke
  • For: 这个研究的目的是检测parkinson病人的停止行动(Freezing of Gait,FOG)事件,以便提供更好的 intervención和管理策略。* Methods: 该研究使用了patient-worn加速度计数据,并使用了一种简单的1-D卷积神经网络来检测FOG事件。* Results: 研究结果表明,使用这种方法可以在实时中检测FOG事件,并在Kaggle上的私人领导板上达到了0.356的平均准确率,并最终排名了1379个équipe中的第8名。
    Abstract Freezing of Gait (FOG) is a common motor symptom in patients with Parkinson's disease (PD). During episodes of FOG, patients suddenly lose their ability to stride as intended. Patient-worn accelerometers can capture information on the patient's movement during these episodes and machine learning algorithms can potentially classify this data. The combination therefore holds the potential to detect FOG in real-time. In this work I present a simple 1-D convolutional neural network that was trained to detect FOG events in accelerometer data. Model performance was assessed by measuring the success of the model to discriminate normal movement from FOG episodes and resulted in a mean average precision of 0.356 on the private leaderboard on Kaggle. Ultimately, the model ranked 8th out of 1379 teams in the Parkinson's Freezing of Gait Prediction competition. The results underscore the potential of Deep Learning-based solutions in advancing the field of FOG detection, contributing to improved interventions and management strategies for PD patients.
    摘要 困难步行(FOG)是许多parkinson病患者的常见运动症状之一。在FOG发作时,患者可能会突然失去步行的能力。患者穿戴的加速度仪可以记录患者的运动信息,机器学习算法可以可能地分类这些数据。因此,这两种技术的结合具有检测FOG的潜在力。在这项工作中,我提出了一种简单的1-D convolutional neural network,用于在加速度仪数据中检测FOG事件。模型性能由normal运动和FOG发作之间的分类成功率来衡量,并达到了0.356的mean average precision在Kaggle私人领先板上。最终,模型在1379个组合中排名第8位。这些结果表明深度学习基本解决方案在FOG检测方面具有潜在的优势,可能导致parkinson病患者的 intervención和管理策略的改善。

A Deep Active Contour Model for Delineating Glacier Calving Fronts

  • paper_url: http://arxiv.org/abs/2307.03461
  • repo_url: None
  • paper_authors: Konrad Heidler, Lichao Mou, Erik Loebel, Mirko Scheinert, Sébastien Lefèvre, Xiao Xiang Zhu
  • for: 这个论文主要针对的是如何将现实世界中的冰川陷阱问题编码为机器学习任务。
  • methods: 该论文提出了一种新的方法,即将冰川陷阱模型转换为 outline 检测问题,并使用 Convolutional Neural Networks (CNNs) 和 active contour 模型来实现。
  • results: 该论文通过对格陵兰冰川的多个大规模数据集进行训练和评估,显示了该方法的优越性,并且还展示了这种方法在计算模型预测结果的不确定性方面的优势。
    Abstract Choosing how to encode a real-world problem as a machine learning task is an important design decision in machine learning. The task of glacier calving front modeling has often been approached as a semantic segmentation task. Recent studies have shown that combining segmentation with edge detection can improve the accuracy of calving front detectors. Building on this observation, we completely rephrase the task as a contour tracing problem and propose a model for explicit contour detection that does not incorporate any dense predictions as intermediate steps. The proposed approach, called ``Charting Outlines by Recurrent Adaptation'' (COBRA), combines Convolutional Neural Networks (CNNs) for feature extraction and active contour models for the delineation. By training and evaluating on several large-scale datasets of Greenland's outlet glaciers, we show that this approach indeed outperforms the aforementioned methods based on segmentation and edge-detection. Finally, we demonstrate that explicit contour detection has benefits over pixel-wise methods when quantifying the models' prediction uncertainties. The project page containing the code and animated model predictions can be found at \url{https://khdlr.github.io/COBRA/}.
    摘要 选择如何编码现实世界问题为机器学习任务是机器学习设计决策中非常重要的一步。 glacier calving front 问题经常被视为semantic segmentation任务。 latest studies have shown that combining segmentation with edge detection can improve the accuracy of calving front detectors. Building on this observation, we completely rephrase the task as a contour tracing problem and propose a model for explicit contour detection that does not incorporate any dense predictions as intermediate steps. The proposed approach, called "Charting Outlines by Recurrent Adaptation" (COBRA), combines Convolutional Neural Networks (CNNs) for feature extraction and active contour models for the delineation. By training and evaluating on several large-scale datasets of Greenland's outlet glaciers, we show that this approach indeed outperforms the aforementioned methods based on segmentation and edge-detection. Finally, we demonstrate that explicit contour detection has benefits over pixel-wise methods when quantifying the models' prediction uncertainties. project page containing the code and animated model predictions can be found at \url{https://khdlr.github.io/COBRA/}.Note: Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Hong Kong, Macau, and Taiwan.

Universal Semi-supervised Model Adaptation via Collaborative Consistency Training

  • paper_url: http://arxiv.org/abs/2307.03449
  • repo_url: None
  • paper_authors: Zizheng Yan, Yushuang Wu, Yipeng Qin, Xiaoguang Han, Shuguang Cui, Guanbin Li
  • for: 本研究提出了一个实际和挑战性的领域适应问题,即通用半监督模型适应(USMA),该问题只需要一个预训练的源模型,并且源和目标领域的标签集可以不同。
  • methods: 我们提出了一种协作一致培训框架,该框架规范了两个模型(源模型和目标数据只预训练的变体模型)的预测一致性,并将其们的优势融合以学习更强大的模型。
  • results: 我们的方法在多个 benchmark 数据集上实现了显著的效果。
    Abstract In this paper, we introduce a realistic and challenging domain adaptation problem called Universal Semi-supervised Model Adaptation (USMA), which i) requires only a pre-trained source model, ii) allows the source and target domain to have different label sets, i.e., they share a common label set and hold their own private label set, and iii) requires only a few labeled samples in each class of the target domain. To address USMA, we propose a collaborative consistency training framework that regularizes the prediction consistency between two models, i.e., a pre-trained source model and its variant pre-trained with target data only, and combines their complementary strengths to learn a more powerful model. The rationale of our framework stems from the observation that the source model performs better on common categories than the target-only model, while on target-private categories, the target-only model performs better. We also propose a two-perspective, i.e., sample-wise and class-wise, consistency regularization to improve the training. Experimental results demonstrate the effectiveness of our method on several benchmark datasets.
    摘要 在这篇论文中,我们介绍了一个实际和挑战性的领域适应问题,即通用半监督模型适应(USMA)。该问题的要求如下:1. 仅使用源模型的预训练结果;2. 源频率和目标频率的标签集不同,即它们共享一个标签集,但各自拥有私有的标签集;3. 每个目标频率类只需几个标注样本。为解决USMA问题,我们提出了一个协同一致训练框架。该框架通过规范源模型和目标数据只预训练的变体模型之间的预测一致性,并将其们的优势融合起来培养更强大的模型。我们的框架的基本思想是,源模型在共同类别上表现更好,而目标模型在目标私有类别上表现更好。我们还提出了两种视角(样本级和类别级)的一致训练 regularization来提高训练。实验结果表明我们的方法在多个标准数据集上具有抗预测能力。

NOFA: NeRF-based One-shot Facial Avatar Reconstruction

  • paper_url: http://arxiv.org/abs/2307.03441
  • repo_url: None
  • paper_authors: Wangbo Yu, Yanbo Fan, Yong Zhang, Xuan Wang, Fei Yin, Yunpeng Bai, Yan-Pei Cao, Ying Shan, Yang Wu, Zhongqian Sun, Baoyuan Wu
  • for: 一shot 3D facial avatar reconstruction, only requires a single source image for high-fidelity reconstruction.
  • methods: 利用3D GAN的生成先验和高效编码器-解码器网络重建源图像的 canoncial neural volume,并提出补做网络来补充面部细节。使用扭变场来折叠 canoncial volume 到表达驱动。
  • results: 通过广泛的实验比较,实现了较高的同构结果,比如果数据量更大的state-of-the-art方法。
    Abstract 3D facial avatar reconstruction has been a significant research topic in computer graphics and computer vision, where photo-realistic rendering and flexible controls over poses and expressions are necessary for many related applications. Recently, its performance has been greatly improved with the development of neural radiance fields (NeRF). However, most existing NeRF-based facial avatars focus on subject-specific reconstruction and reenactment, requiring multi-shot images containing different views of the specific subject for training, and the learned model cannot generalize to new identities, limiting its further applications. In this work, we propose a one-shot 3D facial avatar reconstruction framework that only requires a single source image to reconstruct a high-fidelity 3D facial avatar. For the challenges of lacking generalization ability and missing multi-view information, we leverage the generative prior of 3D GAN and develop an efficient encoder-decoder network to reconstruct the canonical neural volume of the source image, and further propose a compensation network to complement facial details. To enable fine-grained control over facial dynamics, we propose a deformation field to warp the canonical volume into driven expressions. Through extensive experimental comparisons, we achieve superior synthesis results compared to several state-of-the-art methods.
    摘要 三维人脸模型重建已经是计算机图形和计算机视觉领域的一个重要研究主题,需要高真实度的渲染和对姿态和表情的灵活控制,以满足许多相关应用。在最近,通过神经辐射场(NeRF)的发展,其性能得到了显著改进。然而,大多数现有的NeRF基于的人脸模型都是面向特定主体的重建和reenactment,需要多张不同视角的图像进行训练,并且学习的模型无法泛化到新的人脸主体,这限制了其进一步的应用。在这种情况下,我们提出了一种只需要单个源图像来重建高质量三维人脸模型的框架。为了解决缺乏泛化能力和缺失多视角信息的挑战,我们利用了3D GAN的生成预设,并开发了高效的编码器-解码器网络来重建源图像的神经体积,并提出了补做网络来补充人脸细节。为了实现细腻的表情控制,我们提出了扭曲场来扭曲神经体积到驱动表情。通过广泛的实验比较,我们实现了与一些当前领先方法相比的超过其表 sintesis结果。

Merging-Diverging Hybrid Transformer Networks for Survival Prediction in Head and Neck Cancer

  • paper_url: http://arxiv.org/abs/2307.03427
  • repo_url: https://github.com/mungomeng/survival-xsurv
  • paper_authors: Mingyuan Meng, Lei Bi, Michael Fulham, Dagan Feng, Jinman Kim
  • for: 预测乳腺癌患者存活情况,提供早期诊断和治疗规划的信息。
  • methods: 基于深度学习和医疗图像的深度存存模型,结合多Modalities图像(如PET-CT),并提取特定区域(如主要肿瘤区和迁徙门节区)的预测信息。
  • results: 在HEAD和NeCK淋巴肿瘤癌数据集上,我们的XSurv方法比前一代存存预测方法高效,能够结合PET和CT图像的补做性信息,并提取特定区域的预测信息。
    Abstract Survival prediction is crucial for cancer patients as it provides early prognostic information for treatment planning. Recently, deep survival models based on deep learning and medical images have shown promising performance for survival prediction. However, existing deep survival models are not well developed in utilizing multi-modality images (e.g., PET-CT) and in extracting region-specific information (e.g., the prognostic information in Primary Tumor (PT) and Metastatic Lymph Node (MLN) regions). In view of this, we propose a merging-diverging learning framework for survival prediction from multi-modality images. This framework has a merging encoder to fuse multi-modality information and a diverging decoder to extract region-specific information. In the merging encoder, we propose a Hybrid Parallel Cross-Attention (HPCA) block to effectively fuse multi-modality features via parallel convolutional layers and cross-attention transformers. In the diverging decoder, we propose a Region-specific Attention Gate (RAG) block to screen out the features related to lesion regions. Our framework is demonstrated on survival prediction from PET-CT images in Head and Neck (H&N) cancer, by designing an X-shape merging-diverging hybrid transformer network (named XSurv). Our XSurv combines the complementary information in PET and CT images and extracts the region-specific prognostic information in PT and MLN regions. Extensive experiments on the public dataset of HEad and neCK TumOR segmentation and outcome prediction challenge (HECKTOR 2022) demonstrate that our XSurv outperforms state-of-the-art survival prediction methods.
    摘要 生存预测对 cancer 患者非常重要,因为它提供了早期的诊断信息,用于治疗规划。在最近几年,深度存活模型基于深度学习和医疗图像已经显示出了惊人的表现。然而,现有的深度存活模型并没有充分利用多Modalities 图像(例如 PET-CT),也没有充分提取区域特定的信息(例如 Primary Tumor (PT)和 Metastatic Lymph Node (MLN)区域的诊断信息)。为了解决这个问题,我们提出了一种融合-分化学习框架,用于存活预测从多Modalities 图像。这个框架包括一个融合Encoder,用于融合多Modalities 信息,以及一个分化Decoder,用于提取区域特定的信息。在融合Encoder中,我们提出了一种Hybrid Parallel Cross-Attention(HPCA)块,用于有效地融合多Modalities 特征,并通过并行卷积层和交叉注意力变换器来实现。在分化Decoder中,我们提出了一种Region-specific Attention Gate(RAG)块,用于筛选出病变区域相关的特征。我们的框架在 Head and Neck 癌症的存活预测中使用 X-shape 融合-分化混合变换网络(名为 XSurv),把 PET 和 CT 图像的补充性信息融合在一起,并提取 PT 和 MLN 区域的区域特定诊断信息。我们的 XSurv 在 HEAD and neCK TumOR segmentation and outcome prediction challenge 2022 公共数据集上进行了广泛的实验,并证明了我们的 XSurv 在存活预测方面超过了当前的状态艺。

Registration-Free Hybrid Learning Empowers Simple Multimodal Imaging System for High-quality Fusion Detection

  • paper_url: http://arxiv.org/abs/2307.03425
  • repo_url: None
  • paper_authors: Yinghan Guan, Haoran Dai, Zekuan Yu, Shouyu Wang, Yuanjie Gu
  • for: smoke and wildfire detection
  • methods: CNN-Transformer hybrid learning framework with unified high-quality multimodal feature matching module and fusion module
  • results: superior detection performance compared to other state-of-the-art methods under conventional registered conditions, and the first unregistered multimodal smoke and wildfire detection benchmark is openly available.Here’s the full text in Simplified Chinese:
  • for: 这个论文是为了实现烟火检测而写的。
  • methods: 该论文提出了一种基于CNN-Transformer混合学习框架的高质量多Modal特征匹配模块(AKM)和拟合模块(WDAF),通过AKM和WDAF的合作来实现高质量红外意识可见混合检测。
  • results: experiments on M3FD dataset表明,提出的方法在已有的注册条件下达到了最佳检测性能,并且在未注册的情况下开设了第一个多Modal烟火检测benchmark。
    Abstract Multimodal fusion detection always places high demands on the imaging system and image pre-processing, while either a high-quality pre-registration system or image registration processing is costly. Unfortunately, the existing fusion methods are designed for registered source images, and the fusion of inhomogeneous features, which denotes a pair of features at the same spatial location that expresses different semantic information, cannot achieve satisfactory performance via these methods. As a result, we propose IA-VFDnet, a CNN-Transformer hybrid learning framework with a unified high-quality multimodal feature matching module (AKM) and a fusion module (WDAF), in which AKM and DWDAF work in synergy to perform high-quality infrared-aware visible fusion detection, which can be applied to smoke and wildfire detection. Furthermore, experiments on the M3FD dataset validate the superiority of the proposed method, with IA-VFDnet achieving the best detection performance than other state-of-the-art methods under conventional registered conditions. In addition, the first unregistered multimodal smoke and wildfire detection benchmark is openly available in this letter.
    摘要 多模态融合检测总是对图像系统和图像预处理做出高要求,而ither高质量预注册系统或图像注册处理成本较高。可惜,现有的融合方法都是为注册源图像设计的,因此无法实现满意的性能via这些方法。为此,我们提议IA-VFDnet,一种基于CNN-Transformer混合学习框架的高质量多模态特征匹配模块(AKM)和融合模块(WDAF),其中AKM和WDAF在同工 synergy中实现高质量红外意识可见融合检测,可应用于烟和野火检测。此外,在M3FD数据集上进行的实验 validate了我们提议的方法的优越性,IA-VFDnet在 convential注册条件下实现了其他状态对照方法的最佳检测性能。此外,我们还公开提供了首个无注册多模态烟和野火检测benchmark。

Hyperspectral and Multispectral Image Fusion Using the Conditional Denoising Diffusion Probabilistic Model

  • paper_url: http://arxiv.org/abs/2307.03423
  • repo_url: https://github.com/shuaikaishi/ddpmfus
  • paper_authors: Shuaikai Shi, Lijun Zhang, Jie Chen
  • for: 这个论文主要是为了提出一种基于深度学习的卷积混合方法,以提高卷积图像的空间和spectral分辨率。
  • methods: 该方法基于conditioned denoising diffusion probabilistic model(DDPM),包括一个前向扩散过程和一个反向denoising过程。前向扩散过程逐渐添加 Gaussian 噪声到高空间分辨率卷积图像(HrHSI),而反向denoising过程通过学习预测desired HrHSI的高空间分辨率版本,条件于对应的高空间分辨率多spectral图像(HrMSI)和low空间分辨率卷积图像(LrHSI)。
  • results: 对一个indoor和两个遥感数据集进行了实验,并与其他先进的深度学习基于混合方法进行了比较。结果显示,提出的方法在混合过程中具有superiority。codes of this work将被opensourced于以下地址:https://github.com/shuaikaishi/DDPMFus,以便进行可重现。
    Abstract Hyperspectral images (HSI) have a large amount of spectral information reflecting the characteristics of matter, while their spatial resolution is low due to the limitations of imaging technology. Complementary to this are multispectral images (MSI), e.g., RGB images, with high spatial resolution but insufficient spectral bands. Hyperspectral and multispectral image fusion is a technique for acquiring ideal images that have both high spatial and high spectral resolution cost-effectively. Many existing HSI and MSI fusion algorithms rely on known imaging degradation models, which are often not available in practice. In this paper, we propose a deep fusion method based on the conditional denoising diffusion probabilistic model, called DDPM-Fus. Specifically, the DDPM-Fus contains the forward diffusion process which gradually adds Gaussian noise to the high spatial resolution HSI (HrHSI) and another reverse denoising process which learns to predict the desired HrHSI from its noisy version conditioning on the corresponding high spatial resolution MSI (HrMSI) and low spatial resolution HSI (LrHSI). Once the training is completes, the proposed DDPM-Fus implements the reverse process on the test HrMSI and LrHSI to generate the fused HrHSI. Experiments conducted on one indoor and two remote sensing datasets show the superiority of the proposed model when compared with other advanced deep learningbased fusion methods. The codes of this work will be opensourced at this address: https://github.com/shuaikaishi/DDPMFus for reproducibility.
    摘要 干ogram (HSI) 具有大量的spectral信息,反映物质特性,但其 spatial resolution受到成像技术限制而低。与之相结合的是多spectral图像 (MSI),如 RGB 图像,具有高 spatial resolution,但lack spectral band。干ogram和多spectral图像合并是一种获得理想图像,具有高 spatial 和高 spectral resolution的方法。许多现有的 HSI 和 MSI 合并算法 rely on known imaging degradation models,往往不在实践中可用。在这篇文章中,我们提出了基于 conditional denoising diffusion probabilistic model (DDPM) 的深度融合方法,称为 DDPM-Fus。具体来说,DDPM-Fus 包括将高 spatial resolution HSI (HrHSI) 逐渐添加 Gaussian noise 的前进 diffusion process,以及 conditioning on 高 spatial resolution MSI (HrMSI) 和 low spatial resolution HSI (LrHSI) 的reverse denoising process,学习预测 Desired HrHSI。一旦训练完成,我们的 DDPM-Fus 实现了 reverse process 在 test HrMSI 和 LrHSI 上,生成融合后的 HrHSI。我们在一个indoor和两个遥感数据集上进行了实验,并证明了我们的方法在其他高级深度学习基于融合方法之上的比较优势。我们将在这里公开源代码:https://github.com/shuaikaishi/DDPMFus,以便重现。

Learning Adversarial Semantic Embeddings for Zero-Shot Recognition in Open Worlds

  • paper_url: http://arxiv.org/abs/2307.03416
  • repo_url: https://github.com/lhrst/ase
  • paper_authors: Tianqi Li, Guansong Pang, Xiao Bai, Jin Zheng, Lei Zhou, Xin Ning
  • for: 这个研究是为了解决Zero-Shot Open-Set Recognition(ZS-OSR)任务,即在Zero-Shot Learning(ZSL) Setting下需要精确地分类未见类别的样本,并能够拒绝未知类别的样本。
  • methods: 我们使用了现有的State-of-the-art ZSL和OSR模型,并引入了一个新的方法,即生成unknown classes的对抗性semantic embeddings,以训练一个unknowns-informed ZS-OSR分类器。
  • results: 我们的方法substantially outperforms the combined solutions in detecting unknown classes while retaining the classification accuracy on unseen classes,并在 generalized ZS-OSR settings中也 achieve similar superiority.
    Abstract Zero-Shot Learning (ZSL) focuses on classifying samples of unseen classes with only their side semantic information presented during training. It cannot handle real-life, open-world scenarios where there are test samples of unknown classes for which neither samples (e.g., images) nor their side semantic information is known during training. Open-Set Recognition (OSR) is dedicated to addressing the unknown class issue, but existing OSR methods are not designed to model the semantic information of the unseen classes. To tackle this combined ZSL and OSR problem, we consider the case of "Zero-Shot Open-Set Recognition" (ZS-OSR), where a model is trained under the ZSL setting but it is required to accurately classify samples from the unseen classes while being able to reject samples from the unknown classes during inference. We perform large experiments on combining existing state-of-the-art ZSL and OSR models for the ZS-OSR task on four widely used datasets adapted from the ZSL task, and reveal that ZS-OSR is a non-trivial task as the simply combined solutions perform badly in distinguishing the unseen-class and unknown-class samples. We further introduce a novel approach specifically designed for ZS-OSR, in which our model learns to generate adversarial semantic embeddings of the unknown classes to train an unknowns-informed ZS-OSR classifier. Extensive empirical results show that our method 1) substantially outperforms the combined solutions in detecting the unknown classes while retaining the classification accuracy on the unseen classes and 2) achieves similar superiority under generalized ZS-OSR settings.
    摘要 Zero-Shot Learning (ZSL) 专注于在训练过程中只使用类型相关信息来分类未经见过的样本。它无法处理生活中的开放世界enario,那里有测试样本的未知类型, neither samples(例如,图像) nor their type-related information is known during training。Open-Set Recognition (OSR) 专门解决未知类型问题,但现有的 OSR 方法没有考虑类型信息的 semantic information。为了解决这个 ZSL 和 OSR 的共同问题,我们提出了 "Zero-Shot Open-Set Recognition" (ZS-OSR) 任务,其中模型在 ZSL Setting 下进行训练,但需要在推理时准确地分类未经见过的样本,并能够拒绝未知样本。我们在四个广泛使用的数据集上进行了大规模的实验,发现 ZS-OSR 是一个非常复杂的任务,简单地将 ZSL 和 OSR 模型结合起来的方法表现不佳。我们还提出了一种专门为 ZS-OSR 设计的新方法,其中我们的模型学习生成未知类型的敌意Semantic embedding,以训练一个不知情 ZS-OSR 分类器。我们的方法在检测未知类型的同时保持分类精度,并在总体 ZS-OSR 设定下实现了类似的superiority。

Unsupervised Hyperspectral and Multispectral Images Fusion Based on the Cycle Consistency

  • paper_url: http://arxiv.org/abs/2307.03413
  • repo_url: https://github.com/shuaikaishi/CycFusion
  • paper_authors: Shuaikai Shi, Lijun Zhang, Yoann Altmann, Jie Chen
  • for: 本研究旨在提出一种不需要known spatial degradation parameters的Unsupervised hyperspectral and multispectral image fusion方法,以提高图像的空间分辨率和спектраль特征的精度。
  • methods: 该方法基于循环一致性,学习了低分辨率多spectral图像(LrHSI)和高分辨率多spectral图像(HrMSI)之间的频谱域转换,并将恰好的高分辨率 hyperspectral图像(HrHSI)视为中间特征图。
  • results: 实验结果表明,对多个数据集进行比较,该方法在无监督的情况下,与其他所有不监督拟合方法相比,具有更高的精度和稳定性。
    Abstract Hyperspectral images (HSI) with abundant spectral information reflected materials property usually perform low spatial resolution due to the hardware limits. Meanwhile, multispectral images (MSI), e.g., RGB images, have a high spatial resolution but deficient spectral signatures. Hyperspectral and multispectral image fusion can be cost-effective and efficient for acquiring both high spatial resolution and high spectral resolution images. Many of the conventional HSI and MSI fusion algorithms rely on known spatial degradation parameters, i.e., point spread function, spectral degradation parameters, spectral response function, or both of them. Another class of deep learning-based models relies on the ground truth of high spatial resolution HSI and needs large amounts of paired training images when working in a supervised manner. Both of these models are limited in practical fusion scenarios. In this paper, we propose an unsupervised HSI and MSI fusion model based on the cycle consistency, called CycFusion. The CycFusion learns the domain transformation between low spatial resolution HSI (LrHSI) and high spatial resolution MSI (HrMSI), and the desired high spatial resolution HSI (HrHSI) are considered to be intermediate feature maps in the transformation networks. The CycFusion can be trained with the objective functions of marginal matching in single transform and cycle consistency in double transforms. Moreover, the estimated PSF and SRF are embedded in the model as the pre-training weights, which further enhances the practicality of our proposed model. Experiments conducted on several datasets show that our proposed model outperforms all compared unsupervised fusion methods. The codes of this paper will be available at this address: https: //github.com/shuaikaishi/CycFusion for reproducibility.
    摘要 干支spectral图像(HSI)具有丰富的spectral信息,通常因hardware限制而具有低空间分辨率。而多spectral图像(MSI),例如RGB图像,具有高空间分辨率,但缺乏spectral特征。干支spectral和多spectral图像 fusión可以是成本效益和高效的方式,以获取高空间分辨率和高spectral分辨率图像。许多传统的HSI和MSI fusión算法依赖于已知的空间退化参数,例如点扩散函数、spectral退化参数、spectral响应函数或其中之一。另一类的深度学习基于模型则需要大量的协同训练图像,并且需要高度的精度和可靠性。在这篇文章中,我们提出了一种不需要supervision的HSI和MSI fusión模型,称为CycFusion。CycFusion学习了干支spectral和多spectral图像之间的域转换,并将愿望的高空间分辨率HSI视为转换网络中的中间特征图。CycFusion可以通过单个transform和双transform的对应函数来进行训练,并且可以在不同的datasets上进行模型验证。实验结果表明,我们提出的模型在与其他不需要supervision的fusión方法进行比较时表现出色。codes of this paper will be available at this address: https: //github.com/shuaikaishi/CycFusion for reproducibility.

Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation

  • paper_url: http://arxiv.org/abs/2307.03407
  • repo_url: None
  • paper_authors: Dahyun Kang, Piotr Koniusz, Minsu Cho, Naila Murray
  • for: 这个论文的目的是解决弱监督少量图像分类和 segmentation 问题,通过利用一个自我监督的视觉转移(ViT)预训练模型。
  • methods: 该方法使用自我监督 ViT 生成的token表示,通过自我注意力来生成分类和 segmentation 预测,通过两个任务头。
  • results: 实验结果表明,在不同的监督情况下,该方法可以具有显著的性能提升,特别是在没有像素级标签的情况下。
    Abstract We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT) pretrained with self-supervision. Our proposed method takes token representations from the self-supervised ViT and leverages their correlations, via self-attention, to produce classification and segmentation predictions through separate task heads. Our model is able to effectively learn to perform classification and segmentation in the absence of pixel-level labels during training, using only image-level labels. To do this it uses attention maps, created from tokens generated by the self-supervised ViT backbone, as pixel-level pseudo-labels. We also explore a practical setup with ``mixed" supervision, where a small number of training images contains ground-truth pixel-level labels and the remaining images have only image-level labels. For this mixed setup, we propose to improve the pseudo-labels using a pseudo-label enhancer that was trained using the available ground-truth pixel-level labels. Experiments on Pascal-5i and COCO-20i demonstrate significant performance gains in a variety of supervision settings, and in particular when little-to-no pixel-level labels are available.
    摘要 我们Addresses the task of weakly-supervised few-shot image classification和 segmentation,通过利用Vision Transformer(ViT)预训练自我supervision。我们提议的方法利用自我supervision ViT 中的token表示,通过自我注意力,生成分类和 segmentation预测。我们的模型可以有效地在没有像素级标签的情况下,使用只有图像级标签进行训练,学习进行分类和 segmentation。为此,它使用来自自我supervision ViT 中生成的token的注意力地图,作为像素级 pseudo-标签。我们还探讨了一种实用的混合supervision设置,其中一些训练图像包含ground-truth像素级标签,剩下的图像只有图像级标签。为这种混合设置,我们提议使用pseudo-标签增强器,该模型在可用的ground-truth像素级标签的基础上训练。我们的实验在 Pascal-5i 和 COCO-20i 上达到了多种supervision设置下的显著性能提升,特别是在没有或少像素级标签的情况下。

RGB-D Mapping and Tracking in a Plenoxel Radiance Field

  • paper_url: http://arxiv.org/abs/2307.03404
  • repo_url: None
  • paper_authors: Andreas L. Teigen, Yeonsoo Park, Annette Stahl, Rudolf Mester
    for:* 这个技术报告主要写于哪些领域?methods:* 这个技术使用了哪些方法?results:* 这个技术实现了哪些成果?Here are the answers in Simplified Chinese:for:* 这个技术报告主要写于 Computer Vision 和 Robotics 领域。methods:* 这个技术使用了 Plenoxel 频谱场模型,以及RGB-D数据无需神经网络的分析差分方法。results:* 这个技术实现了state-of-the-art的映射和跟踪任务结果,同时比 neural network-based 方法更快。
    Abstract Building on the success of Neural Radiance Fields (NeRFs), recent years have seen significant advances in the domain of novel view synthesis. These models capture the scene's volumetric radiance field, creating highly convincing dense photorealistic models through the use of simple, differentiable rendering equations. Despite their popularity, these algorithms suffer from severe ambiguities in visual data inherent to the RGB sensor, which means that although images generated with view synthesis can visually appear very believable, the underlying 3D model will often be wrong. This considerably limits the usefulness of these models in practical applications like Robotics and Extended Reality (XR), where an accurate dense 3D reconstruction otherwise would be of significant value. In this technical report, we present the vital differences between view synthesis models and 3D reconstruction models. We also comment on why a depth sensor is essential for modeling accurate geometry in general outward-facing scenes using the current paradigm of novel view synthesis methods. Focusing on the structure-from-motion task, we practically demonstrate this need by extending the Plenoxel radiance field model: Presenting an analytical differential approach for dense mapping and tracking with radiance fields based on RGB-D data without a neural network. Our method achieves state-of-the-art results in both the mapping and tracking tasks while also being faster than competing neural network-based approaches.
    摘要 在最近几年,因为神经辐射场(NeRF)的成功, novel view synthesis 领域有了 significiant advances。这些模型可以 capture 场景的三维辐射场,通过简单的可导渠 Equations 来创建高效的、 photorealistic 模型。 despite their popularity, these algorithms suffer from severe ambiguities in visual data inherent to the RGB sensor, which means that although images generated with view synthesis can visually appear very believable, the underlying 3D model will often be wrong. This considerably limits the usefulness of these models in practical applications like Robotics and Extended Reality (XR), where an accurate dense 3D reconstruction otherwise would be of significant value.在这份技术报告中,我们展示了视图synthesis 模型和 3D 重建模型之间的重要差异。我们还评论了为了在现今的 novel view synthesis 方法中模型 precisions 的 accurate geometry 的深度感知器的重要性。在structure-from-motion 任务中,我们实际地示出了这种需求。我们通过扩展 Plenoxel 辐射场模型,提出了一种基于 RGB-D 数据的分析差分方法 для dense mapping 和 tracking。我们的方法可以在 mapping 和 tracking 任务中达到状态艺术 Results,同时也比竞争的神经网络基于方法更快。

Beyond Geo-localization: Fine-grained Orientation of Street-view Images by Cross-view Matching with Satellite Imagery with Supplementary Materials

  • paper_url: http://arxiv.org/abs/2307.03398
  • repo_url: None
  • paper_authors: Wenmiao Hu, Yichen Zhang, Yuxuan Liang, Yifang Yin, Andrei Georgescu, An Tran, Hannes Kruppa, See-Kiong Ng, Roger Zimmermann
    for:This paper focuses on improving the accuracy of fine-grained orientation estimation for street-view images.methods:The proposed methods use a combination of feature extraction and deep learning techniques to estimate the orientation of street-view images.results:The proposed methods achieve high accuracy on orientation estimation, with an average improvement of 34.9% and 28.2% compared to previous works. Integrating fine-grained orientation estimation in training also improves the performance on geo-localization.
    Abstract Street-view imagery provides us with novel experiences to explore different places remotely. Carefully calibrated street-view images (e.g. Google Street View) can be used for different downstream tasks, e.g. navigation, map features extraction. As personal high-quality cameras have become much more affordable and portable, an enormous amount of crowdsourced street-view images are uploaded to the internet, but commonly with missing or noisy sensor information. To prepare this hidden treasure for "ready-to-use" status, determining missing location information and camera orientation angles are two equally important tasks. Recent methods have achieved high performance on geo-localization of street-view images by cross-view matching with a pool of geo-referenced satellite imagery. However, most of the existing works focus more on geo-localization than estimating the image orientation. In this work, we re-state the importance of finding fine-grained orientation for street-view images, formally define the problem and provide a set of evaluation metrics to assess the quality of the orientation estimation. We propose two methods to improve the granularity of the orientation estimation, achieving 82.4% and 72.3% accuracy for images with estimated angle errors below 2 degrees for CVUSA and CVACT datasets, corresponding to 34.9% and 28.2% absolute improvement compared to previous works. Integrating fine-grained orientation estimation in training also improves the performance on geo-localization, giving top 1 recall 95.5%/85.5% and 86.8%/80.4% for orientation known/unknown tests on the two datasets.
    摘要 街景图像提供了许多不同的地方的远程探索。高级别的街景图像(例如Google街景图)可以用于不同的下游任务,如导航和地图特征提取。随着个人高质量相机的成本下降和 portaбеility提高,互联网上上传了大量的拍摄街景图像,但通常缺失或含有噪音的感知信息。为了准备这些隐藏的财富,确定缺失的地理位置信息和摄像机方向角度是两个等 importante的任务。现有方法已经达到了高性能的地图化街景图像,但大多数现有的工作更注重地图化than estimating图像方向。在这个工作中,我们重申了找到细化的图像方向的重要性,正式定义问题,并提供了评价图像方向估计质量的测试 метрик。我们提出了两种方法来改进细化图像方向估计,实现了82.4%和72.3%的准确率,对于CVUSA和CACT datasets的图像的估计角度错误小于2度,相对于先前的工作提高了34.9%和28.2%的绝对改进。将细化的图像方向估计integrated into training还提高了地图化性能,在两个dataset上取得了 recall 95.5%/85.5%和86.8%/80.4%,对于orientationknown/unknown测试。

General-Purpose Multimodal Transformer meets Remote Sensing Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2307.03388
  • repo_url: https://github.com/nhikieu/spatialvolumetricmultimodal
  • paper_authors: Nhi Kieu, Kien Nguyen, Sridha Sridharan, Clinton Fookes
  • for: 这个研究探讨了 PerceiverIO 综合多模式网络在遥测Semantic Segmentation 领域的表现。
  • methods: 研究使用了一个 UNit-inspired 模组,该模组使用三维核算法来汇入本地信息,同时学习跨模式特征。
  • results: 研究发现,提案的方法可以与专门架构 like UNetFormer 和 SwinUNet 相比,达到了竞争性的结果,显示了该方法在优化网络架构设计方面的潜在。
    Abstract The advent of high-resolution multispectral/hyperspectral sensors, LiDAR DSM (Digital Surface Model) information and many others has provided us with an unprecedented wealth of data for Earth Observation. Multimodal AI seeks to exploit those complementary data sources, particularly for complex tasks like semantic segmentation. While specialized architectures have been developed, they are highly complicated via significant effort in model design, and require considerable re-engineering whenever a new modality emerges. Recent trends in general-purpose multimodal networks have shown great potential to achieve state-of-the-art performance across multiple multimodal tasks with one unified architecture. In this work, we investigate the performance of PerceiverIO, one in the general-purpose multimodal family, in the remote sensing semantic segmentation domain. Our experiments reveal that this ostensibly universal network struggles with object scale variation in remote sensing images and fails to detect the presence of cars from a top-down view. To address these issues, even with extreme class imbalance issues, we propose a spatial and volumetric learning component. Specifically, we design a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously, while reducing network computational burden via the cross-attention mechanism of PerceiverIO. The effectiveness of the proposed component is validated through extensive experiments comparing it with other methods such as 2D convolution, and dual local module (\ie the combination of Conv2D 1x1 and Conv2D 3x3 inspired by UNetFormer). The proposed method achieves competitive results with specialized architectures like UNetFormer and SwinUNet, showing its potential to minimize network architecture engineering with a minimal compromise on the performance.
    摘要 “现代高分辨率多spectral/干spectral传感器、LiDAR DSM(数字地面模型)等数据源的出现,为地球观测带来了前所未有的数据 богат度。多Modal AI 利用这些补充数据源,特别是 для复杂任务 like semantic segmentation。虽然专门的架构有出现,但它们具有较高的复杂度,需要较大的模型设计和重新引擎,每当新的模态出现时。 current trend in general-purpose multimodal networks has shown great potential to achieve state-of-the-art performance across multiple multimodal tasks with one unified architecture. In this work, we investigate the performance of PerceiverIO, one in the general-purpose multimodal family, in the remote sensing semantic segmentation domain. Our experiments reveal that this ostensibly universal network struggles with object scale variation in remote sensing images and fails to detect the presence of cars from a top-down view. To address these issues, we propose a spatial and volumetric learning component. Specifically, we design a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously, while reducing network computational burden via the cross-attention mechanism of PerceiverIO. The effectiveness of the proposed component is validated through extensive experiments comparing it with other methods such as 2D convolution and dual local module (\ie the combination of Conv2D 1x1 and Conv2D 3x3 inspired by UNetFormer). The proposed method achieves competitive results with specialized architectures like UNetFormer and SwinUNet, showing its potential to minimize network architecture engineering with a minimal compromise on the performance.”

Weakly-supervised Contrastive Learning for Unsupervised Object Discovery

  • paper_url: http://arxiv.org/abs/2307.03376
  • repo_url: https://github.com/npucvr/wscuod
  • paper_authors: Yunqiu Lv, Jing Zhang, Nick Barnes, Yuchao Dai
  • for: 本研究旨在提出一种新的无监督物体发现方法,以提高物体检测和分割的精度。
  • methods: 我们提出了一种基于自我超vised学习模型的方法,通过弱监督对比学习(WCL)增强 semantic信息探索。我们还使用了原始数据的主成分分析(PCA)来本地化物体区域。
  • results: 我们在一些无监督物体发现数据集上进行了广泛的实验,并证明了我们的提议的有效性。source code和实验结果可以通过我们的项目页面获取:https://github.com/npucvr/WSCUOD.git。
    Abstract Unsupervised object discovery (UOD) refers to the task of discriminating the whole region of objects from the background within a scene without relying on labeled datasets, which benefits the task of bounding-box-level localization and pixel-level segmentation. This task is promising due to its ability to discover objects in a generic manner. We roughly categorise existing techniques into two main directions, namely the generative solutions based on image resynthesis, and the clustering methods based on self-supervised models. We have observed that the former heavily relies on the quality of image reconstruction, while the latter shows limitations in effectively modeling semantic correlations. To directly target at object discovery, we focus on the latter approach and propose a novel solution by incorporating weakly-supervised contrastive learning (WCL) to enhance semantic information exploration. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images, which is achieved by fine-tuning the feature encoder of a self-supervised model, namely DINO, via WCL. Subsequently, we introduce Principal Component Analysis (PCA) to localize object regions. The principal projection direction, corresponding to the maximal eigenvalue, serves as an indicator of the object region(s). Extensive experiments on benchmark unsupervised object discovery datasets demonstrate the effectiveness of our proposed solution. The source code and experimental results are publicly available via our project page at https://github.com/npucvr/WSCUOD.git.
    摘要 无监督物体发现(UOD)指的是在场景中分别背景和物体的整个区域,不使用标注数据,这对绑定框位置和像素级划分具有推动作用。这个任务有前途,因为它可以在通用的方式下发现物体。我们约分exist的技术为两大方向,即基于图像重新synthesis的生成解决方案,以及基于自我超vised模型的聚类方法。我们发现了,前者强调图像重建质量,而后者在模型 semantic关系模型化有限。为直接实现物体发现,我们选择后者,并提出一种新的解决方案,即通过弱监督对比学习(WCL)增强 semantic信息探索。我们设计了一种带有高级 semantic特征的自然语言处理模型,通过练习 DINO 模型的特征编码器,并通过 WCL 进行 fine-tuning。然后,我们引入Principal Component Analysis(PCA)来地址 object 区域。对于无监督物体发现数据集进行了广泛的实验,证明了我们的提议的有效性。项目代码和实验结果可以通过我们的项目页面https://github.com/npucvr/WSCUOD.git获取。

A Survey of Deep Learning in Sports Applications: Perception, Comprehension, and Decision

  • paper_url: http://arxiv.org/abs/2307.03353
  • repo_url: None
  • paper_authors: Zhonghan Zhao, Wenhao Chai, Shengyu Hao, Wenhao Hu, Guanhong Wang, Shidong Cao, Mingli Song, Jenq-Neng Hwang, Gaoang Wang
  • for: 这篇论文旨在探讨深度学习在体育性能方面的应用,包括识别、理解和决策等三个方面。
  • methods: 论文提出了深度学习算法的层次结构,并对现有的数据集进行了综述,同时描述了现有的挑战和未来发展趋势。
  • results: 论文通过对现有数据集的分析和对深度学习在体育应用的概述,提供了对深度学习在体育性能方面的研究 Referenced.
    Abstract Deep learning has the potential to revolutionize sports performance, with applications ranging from perception and comprehension to decision. This paper presents a comprehensive survey of deep learning in sports performance, focusing on three main aspects: algorithms, datasets and virtual environments, and challenges. Firstly, we discuss the hierarchical structure of deep learning algorithms in sports performance which includes perception, comprehension and decision while comparing their strengths and weaknesses. Secondly, we list widely used existing datasets in sports and highlight their characteristics and limitations. Finally, we summarize current challenges and point out future trends of deep learning in sports. Our survey provides valuable reference material for researchers interested in deep learning in sports applications.
    摘要 深度学习有可能对体育表现进行革命性的改变,其应用范围从感知和理解到决策。本文提供了深度学习在体育表现方面的全面评论,主要涵盖三大方面:算法、数据集和虚拟环境,以及挑战。首先,我们介绍了深度学习算法在体育表现中的层次结构,并对它们的优缺点进行比较。其次,我们列出了常用的体育数据集,并将其特点和局限性作出描述。最后,我们summarized current challenges and highlighted future trends of deep learning in sports.本文提供的参考资料有价值,对深度学习在体育应用领域的研究人员非常有帮助。Here's the translation of the text into Traditional Chinese:深度学习有可能对体育表现进行革命性的改变,其应用范围从感知和理解到决策。本文提供了深度学习在体育表现方面的全面评论,主要涵盖三大方面:算法、数据集和虚拟环境,以及挑战。首先,我们介绍了深度学习算法在体育表现中的层次结构,并对它们的优缺点进行比较。其次,我们列出了常用的体育数据集,并将其特点和局限性作出描述。最后,我们summarized current challenges and highlighted future trends of deep learning in sports.本文提供的参考资料有价值,对深度学习在体育应用领域的研究人员非常有帮助。

Dividing and Conquering a BlackBox to a Mixture of Interpretable Models: Route, Interpret, Repeat

  • paper_url: http://arxiv.org/abs/2307.05350
  • repo_url: https://github.com/batmanlab/ICML-2023-Route-interpret-repeat
  • paper_authors: Shantanu Ghosh, Ke Yu, Forough Arabshahi, Kayhan Batmanghelich
  • for: This paper aims to blur the distinction between post hoc explanation of a Blackbox and constructing interpretable models, by iteratively carving out a mixture of interpretable experts (MoIE) and a residual network.
  • methods: The paper uses a route, interpret, and repeat approach, starting with a Blackbox and iteratively carving out MoIE and a residual network. Each interpretable model specializes in a subset of samples and explains them using First Order Logic (FOL), while the residual network handles the remaining samples.
  • results: The extensive experiments show that the approach (1) identifies a diverse set of instance-specific concepts with high concept completeness via MoIE without compromising performance, (2) identifies the relatively “harder” samples to explain via residuals, (3) outperforms interpretable by-design models by significant margins during test-time interventions, and (4) fixes the shortcut learned by the original Blackbox.Here’s the Chinese translation of the three points:
  • for: 这篇论文目标是将黑盒模型的Post hoc解释与构建可解释模型相分离,通过迭代挖出一个混合型可解释专家(MoIE)和剩余网络。
  • methods: 论文使用一种路径、解释、重复的方法,从黑盒模型开始,迭代挖出MoIE和剩余网络。每个可解释模型专门处理一部分样本,使用First Order Logic(FOL)进行基本的推理,以解释黑盒模型中的概念。剩余网络处理剩下的样本。
  • results: 广泛的实验结果表明,该方法(1)通过MoIE无需性能下降,identify一个多样化的实例特定概念集,具有高概念完整性,(2)通过剩余网络处理 harder 的样本,(3)在测试时间 intervención中,与可解释设计模型相比,具有显著的性能优势,(4)修复黑盒模型中学习的短cut。MoIE代码可以在:https://github.com/batmanlab/ICML-2023-Route-interpret-repeat
    Abstract ML model design either starts with an interpretable model or a Blackbox and explains it post hoc. Blackbox models are flexible but difficult to explain, while interpretable models are inherently explainable. Yet, interpretable models require extensive ML knowledge and tend to be less flexible and underperforming than their Blackbox variants. This paper aims to blur the distinction between a post hoc explanation of a Blackbox and constructing interpretable models. Beginning with a Blackbox, we iteratively carve out a mixture of interpretable experts (MoIE) and a residual network. Each interpretable model specializes in a subset of samples and explains them using First Order Logic (FOL), providing basic reasoning on concepts from the Blackbox. We route the remaining samples through a flexible residual. We repeat the method on the residual network until all the interpretable models explain the desired proportion of data. Our extensive experiments show that our route, interpret, and repeat approach (1) identifies a diverse set of instance-specific concepts with high concept completeness via MoIE without compromising in performance, (2) identifies the relatively ``harder'' samples to explain via residuals, (3) outperforms the interpretable by-design models by significant margins during test-time interventions, and (4) fixes the shortcut learned by the original Blackbox. The code for MoIE is publicly available at: \url{https://github.com/batmanlab/ICML-2023-Route-interpret-repeat}
    摘要 机器学习模型设计可以从可解释模型或黑盒开始,然后进行后处解释。黑盒模型灵活,但难以解释,而可解释模型具有内置的解释功能。然而,可解释模型需要广泛的机器学习知识,并且通常比其黑盒变体表现不佳。本文旨在融合后处解释和构建可解释模型。从黑盒开始,我们逐渐刻意挖掘一个混合可解释专家(MoIE)和剩下的剩下网络。每个可解释模型专门处理一 subset of samples,并使用首险逻辑(FOL)进行基本的推理,提供黑盒中概念的基本理解。我们通过剩下网络将剩下的样本传递给灵活的剩下网络。我们在这个过程中重复多次,直到所有的可解释模型解释愿望的数据分量。我们的广泛实验表明,我们的路由、解释和重复方法(1)可以通过MoIE无需牺牲性能来获得多样化的实例特有概念,(2)可以通过剩下网络来确定难以解释的样本,(3)在测试时间 intervención中大幅度超越可解释设计模型,以及(4)修复黑盒学习的快捷。MoIE代码可以在以下链接获取:https://github.com/batmanlab/ICML-2023-Route-interpret-repeat

Open-Vocabulary Object Detection via Scene Graph Discovery

  • paper_url: http://arxiv.org/abs/2307.03339
  • repo_url: None
  • paper_authors: Hengcan Shi, Munawar Hayat, Jianfei Cai
  • for: 这篇论文是为了解决开放词汇对象检测问题,即不同于传统检测,只检测固定类别对象,而是检测开放类别集中的对象。
  • methods: 该论文提出了一种新的场景图基于发现网络(SGDN),利用场景图指示来检测开放词汇对象。具体来说,包括稀疏场景图指导注意力(SSGA)的场景图解码器(SGDecoder),以及场景图基于预测(SGPred)机制。
  • results: 实验结果表明,该方法可以有效地解决开放词汇对象检测问题,并且可以进行开放Scene Graph检测。此外,该方法还可以提高对象本地化的准确率。
    Abstract In recent years, open-vocabulary (OV) object detection has attracted increasing research attention. Unlike traditional detection, which only recognizes fixed-category objects, OV detection aims to detect objects in an open category set. Previous works often leverage vision-language (VL) training data (e.g., referring grounding data) to recognize OV objects. However, they only use pairs of nouns and individual objects in VL data, while these data usually contain much more information, such as scene graphs, which are also crucial for OV detection. In this paper, we propose a novel Scene-Graph-Based Discovery Network (SGDN) that exploits scene graph cues for OV detection. Firstly, a scene-graph-based decoder (SGDecoder) including sparse scene-graph-guided attention (SSGA) is presented. It captures scene graphs and leverages them to discover OV objects. Secondly, we propose scene-graph-based prediction (SGPred), where we build a scene-graph-based offset regression (SGOR) mechanism to enable mutual enhancement between scene graph extraction and object localization. Thirdly, we design a cross-modal learning mechanism in SGPred. It takes scene graphs as bridges to improve the consistency between cross-modal embeddings for OV object classification. Experiments on COCO and LVIS demonstrate the effectiveness of our approach. Moreover, we show the ability of our model for OV scene graph detection, while previous OV scene graph generation methods cannot tackle this task.
    摘要 近年来,开放词汇(OV)对象检测已经吸引了越来越多的研究者的注意力。与传统检测不同,OV检测targets不同的开放类别对象。先前的工作frequently使用视觉语言(VL)训练数据(例如,referring grounding data)来认识OV对象。然而,这些数据通常包含更多的信息,例如场景图,这些信息也是OV检测的关键。在本文中,我们提出了一种新的场景图基于发现网络(SGDN),它利用场景图指示进行OV检测。首先,我们提出了场景图基本解码器(SGDecoder),包括稀疏场景图指导的注意力(SSGA)。它捕捉场景图并利用它们来发现OV对象。其次,我们提出了场景图基本预测(SGPred),我们构建了场景图基本偏移预测(SGOR)机制,以便对场景图EXTRACTION和对象LOCALIZATION进行互相增强。最后,我们设计了一种 crossed-modal学习机制。它通过场景图作为桥接,以提高不同模态嵌入的一致性,以便对开放类别对象进行分类。在COCO和LVIS上进行了实验,并证明了我们的方法的有效性。此外,我们还示出了我们的模型对开放场景图检测的能力,而之前的OV场景图生成方法无法完成这个任务。

Facial Landmark Detection Evaluation on MOBIO Database

  • paper_url: http://arxiv.org/abs/2307.03329
  • repo_url: None
  • paper_authors: Na Zhang
  • for: 该论文旨在提高移动设备上部署生物特征技术的研究,特别是面部识别和语音识别等技术在移动设备上的应用。
  • methods: 该论文使用了多种现有的面部特征检测方法,以评估其性能在移动设备上。
  • results: 研究发现,面部特征检测在移动设备上的性能较为挑战,MOBIO数据库可以作为一个新的挑战数据库。
    Abstract MOBIO is a bi-modal database that was captured almost exclusively on mobile phones. It aims to improve research into deploying biometric techniques to mobile devices. Research has been shown that face and speaker recognition can be performed in a mobile environment. Facial landmark localization aims at finding the coordinates of a set of pre-defined key points for 2D face images. A facial landmark usually has specific semantic meaning, e.g. nose tip or eye centre, which provides rich geometric information for other face analysis tasks such as face recognition, emotion estimation and 3D face reconstruction. Pretty much facial landmark detection methods adopt still face databases, such as 300W, AFW, AFLW, or COFW, for evaluation, but seldomly use mobile data. Our work is first to perform facial landmark detection evaluation on the mobile still data, i.e., face images from MOBIO database. About 20,600 face images have been extracted from this audio-visual database and manually labeled with 22 landmarks as the groundtruth. Several state-of-the-art facial landmark detection methods are adopted to evaluate their performance on these data. The result shows that the data from MOBIO database is pretty challenging. This database can be a new challenging one for facial landmark detection evaluation.
    摘要

CheXmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images

  • paper_url: http://arxiv.org/abs/2307.03293
  • repo_url: https://github.com/ngaggion/chexmask-database
  • paper_authors: Nicolás Gaggion, Candelaria Mosquera, Lucas Mansilla, Martina Aineseder, Diego H. Milone, Enzo Ferrante
  • for: 这个论文的目的是为了提供一个大型、多中心的胸部X射线分割数据集,以便用于胸部X射线分析方法的开发。
  • methods: 这个论文使用了HybridGNet模型来确保所有数据集中的分割结果具有一致性和高质量。
  • results: 这个论文提供了676,803个分割mask,并通过专业医生评估和自动化质量控制来验证这些mask。 Additionally, the paper provides individualized quality indices per mask and an overall quality estimation per dataset.
    Abstract The development of successful artificial intelligence models for chest X-ray analysis relies on large, diverse datasets with high-quality annotations. While several databases of chest X-ray images have been released, most include disease diagnosis labels but lack detailed pixel-level anatomical segmentation labels. To address this gap, we introduce an extensive chest X-ray multi-center segmentation dataset with uniform and fine-grain anatomical annotations for images coming from six well-known publicly available databases: CANDID-PTX, ChestX-ray8, Chexpert, MIMIC-CXR-JPG, Padchest, and VinDr-CXR, resulting in 676,803 segmentation masks. Our methodology utilizes the HybridGNet model to ensure consistent and high-quality segmentations across all datasets. Rigorous validation, including expert physician evaluation and automatic quality control, was conducted to validate the resulting masks. Additionally, we provide individualized quality indices per mask and an overall quality estimation per dataset. This dataset serves as a valuable resource for the broader scientific community, streamlining the development and assessment of innovative methodologies in chest X-ray analysis. The CheXmask dataset is publicly available at: \url{https://physionet.org/content/chexmask-cxr-segmentation-data/}.
    摘要 发展成功人工智能模型 для胸部X射影分析需要大量多样化的数据集,其中包括高质量的注解标注。虽然数据库胸部X射影图像已经发布,但大多数只包含疾病诊断标签,缺乏细腻像素级别的解剖学分割标注。为了解决这个问题,我们介绍了一个广泛的胸部X射影多中心分割数据集,其中包含来自六个公共可用的数据库:CANDID-PTX、ChestX-ray8、Chexpert、MIMIC-CXR-JPG、Padchest和VinDr-CXR,共计676,803个分割mask。我们的方法使用HybridGNet模型来确保分割结果具有一致性和高质量。我们进行了严格的验证,包括专业医生评估和自动化质量控制,以验证结果。此外,我们还提供了每个mask的个性化质量指标以及每个数据集的总质量估计。这个数据集作为科学社区的资源,可以促进胸部X射影分析领域的创新和评估。CheXmask数据集公共可用于:\url{https://physionet.org/content/chexmask-cxr-segmentation-data/}.

To pretrain or not to pretrain? A case study of domain-specific pretraining for semantic segmentation in histopathology

  • paper_url: http://arxiv.org/abs/2307.03275
  • repo_url: None
  • paper_authors: Tushar Kataria, Beatrice Knudsen, Shireen Elhabian
  • for: 这个研究是为了检查 histopathology 领域特有的预训练模型是否能提供更好的初始化,以提高病理学影像应用程序的性能。
  • methods: 研究使用了不同类型的预训练模型,包括 histopathology 领域特有的预训练模型和 real-world 影像预训练模型,并 Comparing 它们的表现。
  • results: 研究结果显示,使用 histopathology 领域特有的预训练模型可以提高病理学影像识别和分类的表现,但是这些表现取决于任务和训练数据集的大小。此外,研究也发现使用这些预训练模型可以提高病理学影像中的细胞和腺体分类表现,但是这些表现仅在特定的任务和训练数据集中出现。
    Abstract Annotating medical imaging datasets is costly, so fine-tuning (or transfer learning) is the most effective method for digital pathology vision applications such as disease classification and semantic segmentation. However, due to texture bias in models trained on real-world images, transfer learning for histopathology applications might result in underperforming models, which necessitates the need for using unlabeled histopathology data and self-supervised methods to discover domain-specific characteristics. Here, we tested the premise that histopathology-specific pretrained models provide better initializations for pathology vision tasks, i.e., gland and cell segmentation. In this study, we compare the performance of gland and cell segmentation tasks with histopathology domain-specific and non-domain-specific (real-world images) pretrained weights. Moreover, we investigate the dataset size at which domain-specific pretraining produces significant gains in performance. In addition, we investigated whether domain-specific initialization improves the effectiveness of out-of-distribution testing on distinct datasets but the same task. The results indicate that performance gain using domain-specific pretrained weights depends on both the task and the size of the training dataset. In instances with limited dataset sizes, a significant improvement in gland segmentation performance was also observed, whereas models trained on cell segmentation datasets exhibit no improvement.
    摘要 <>输入文本为:批注医学影像数据集是成本高的,因此 Fine-tuning(或传输学习)是数字 PATHOLOGY 视觉应用,如疾病分类和semantic segmentation 中最有效的方法。然而,由于图像世界中的 texture bias,传输学习 для histopathology 应用可能会导致模型表现不佳,这种情况下需要使用无标注 histopathology 数据和自我supervised 方法来发现领域特有的特征。本研究检验了假设,即 histopathology 特定的预训练模型为 PATHOLOGY 视觉任务提供更好的初始化,即腺体和细胞 segmentation。本研究 comparing 腺体和细胞 segmentation 任务使用 histopathology 领域特定和非领域特定(实际世界图像)预训练 веса的表现。此外,我们还研究了领域特定预训练生成的性能提升的数据集大小。在这些研究中,我们发现了领域特定预训练在某些任务上的性能提升取决于任务和领域特定预训练数据集的大小。在有限的数据集大小下,领域特定预训练可以获得显著的性能提升,而模型在 cell segmentation 任务上表现不变。Translation:<>输入文本为:批注医学影像数据集是成本高的,因此 Fine-tuning(或传输学习)是数字 PATHOLOGY 视觉应用,如疾病分类和semantic segmentation 中最有效的方法。然而,由于图像世界中的 texture bias,传输学习 для histopathology 应用可能会导致模型表现不佳,这种情况下需要使用无标注 histopathology 数据和自我supervised 方法来发现领域特有的特征。本研究检验了假设,即 histopathology 特定的预训练模型为 PATHOLOGY 视觉任务提供更好的初始化,即腺体和细胞 segmentation。本研究 comparing 腺体和细胞 segmentation 任务使用 histopathology 领域特定和非领域特定(实际世界图像)预训练 веса的表现。此外,我们还研究了领域特定预训练生成的性能提升的数据集大小。在这些研究中,我们发现了领域特定预训练在某些任务上的性能提升取决于任务和领域特定预训练数据集的大小。在有限的数据集大小下,领域特定预训练可以获得显著的性能提升,而模型在 cell segmentation 任务上表现不变。

ADASSM: Adversarial Data Augmentation in Statistical Shape Models From Images

  • paper_url: http://arxiv.org/abs/2307.03273
  • repo_url: None
  • paper_authors: Mokshagna Sai Teja Karanam, Tushar Kataria, Krithika Iyer, Shireen Elhabian
  • for: 这篇论文旨在提出一种新的数据增强策略,以适应图像到统计形态模型(SSM)框架中的数据缺乏问题。
  • methods: 该策略基于数据依存的噪声生成或文本增强技术,通过在图像到SSM网络中作为对手训练,生成多样化和挑战性的噪声样本。
  • results: 该策略可以提高图像到SSM网络的准确率,使模型更加注重下面形态,而不是固定在像素值上。
    Abstract Statistical shape models (SSM) have been well-established as an excellent tool for identifying variations in the morphology of anatomy across the underlying population. Shape models use consistent shape representation across all the samples in a given cohort, which helps to compare shapes and identify the variations that can detect pathologies and help in formulating treatment plans. In medical imaging, computing these shape representations from CT/MRI scans requires time-intensive preprocessing operations, including but not limited to anatomy segmentation annotations, registration, and texture denoising. Deep learning models have demonstrated exceptional capabilities in learning shape representations directly from volumetric images, giving rise to highly effective and efficient Image-to-SSM networks. Nevertheless, these models are data-hungry and due to the limited availability of medical data, deep learning models tend to overfit. Offline data augmentation techniques, that use kernel density estimation based (KDE) methods for generating shape-augmented samples, have successfully aided Image-to-SSM networks in achieving comparable accuracy to traditional SSM methods. However, these augmentation methods focus on shape augmentation, whereas deep learning models exhibit image-based texture bias resulting in sub-optimal models. This paper introduces a novel strategy for on-the-fly data augmentation for the Image-to-SSM framework by leveraging data-dependent noise generation or texture augmentation. The proposed framework is trained as an adversary to the Image-to-SSM network, augmenting diverse and challenging noisy samples. Our approach achieves improved accuracy by encouraging the model to focus on the underlying geometry rather than relying solely on pixel values.
    摘要 各种统计形态模型(SSM)在识别人体解剖学变化方面已经得到了广泛的应用,它们使用一致的形态表示方式来比较形态,从而检测疾病和制定治疗方案。在医疗影像中,从CT/MRI扫描获取形态表示需要耗时的预处理步骤,包括但不限于解剖部分标注、注册和图像减震。深度学习模型直接从三维图像中学习形态表示,这些模型已经取得了非常高效和可靠的成果,并且被称为高效的图像-SSM网络。然而,这些模型需要大量的数据,由于医疗数据的有限性,这些模型往往遇到过拟合问题。在线数据增强技术,使用基于KDE方法生成的形态增强样本,已经成功地帮助图像-SSM网络实现与传统SSM方法相当的准确性。然而,这些增强技术主要关注形态增强,而深度学习模型具有图像基于的文本偏好,导致模型表现不佳。本文提出了一种新的在线数据增强策略,通过利用数据依赖的噪声生成或文本增强来帮助图像-SSM网络。该方法在训练过程中作为对图像-SSM网络的反对手,生成多样化和挑战性的噪声样本,以提高模型的准确性。我们的方法通过让模型关注下面的结构,而不是仅仅依赖像素值,从而提高模型的表现。

Empirical Analysis of a Segmentation Foundation Model in Prostate Imaging

  • paper_url: http://arxiv.org/abs/2307.03266
  • repo_url: None
  • paper_authors: Heejong Kim, Victor Ion Butoi, Adrian V. Dalca, Daniel J. A. Margolis, Mert R. Sabuncu
    for:This paper is written for the purpose of evaluating the effectiveness of a foundation model for medical image segmentation, specifically in the context of prostate imaging.methods:The paper uses a recently developed foundation model called UniverSeg, which is trained on a large dataset of images and can be customized for various downstream tasks with little to no labeled data.results:The paper compares the performance of UniverSeg against conventional task-specific segmentation models and highlights several important factors that will likely be important in the development and adoption of foundation models for medical image segmentation. The results show that UniverSeg achieves competitive performance against task-specific models while requiring significantly less labeled data.
    Abstract Most state-of-the-art techniques for medical image segmentation rely on deep-learning models. These models, however, are often trained on narrowly-defined tasks in a supervised fashion, which requires expensive labeled datasets. Recent advances in several machine learning domains, such as natural language generation have demonstrated the feasibility and utility of building foundation models that can be customized for various downstream tasks with little to no labeled data. This likely represents a paradigm shift for medical imaging, where we expect that foundation models may shape the future of the field. In this paper, we consider a recently developed foundation model for medical image segmentation, UniverSeg. We conduct an empirical evaluation study in the context of prostate imaging and compare it against the conventional approach of training a task-specific segmentation model. Our results and discussion highlight several important factors that will likely be important in the development and adoption of foundation models for medical image segmentation.
    摘要 现代医疗影像分割技术多数采用深度学习模型。然而,这些模型通常需要严格定义的任务和质量验证数据,这会导致成本增加。在其他机器学习领域,如自然语言生成,最近的进展表明可以建立基础模型,可以通过少量或无标注数据来适应多个下游任务。这可能会对医疗影像领域造成一种 парадигShift。在这篇论文中,我们考虑了一种新发展的基础模型,即UniverSeg。我们对抗比较这种基础模型与专门为医疗影像分割训练的模型。我们的结果和讨论描述了一些重要的因素,这些因素将影响基础模型在医疗影像分割领域的发展和采纳。

A Fully Automated and Explainable Algorithm for the Prediction of Malignant Transformation in Oral Epithelial Dysplasia

  • paper_url: http://arxiv.org/abs/2307.03757
  • repo_url: None
  • paper_authors: Adam J Shephard, Raja Muhammad Saad Bashir, Hanya Mahmood, Mostafa Jahanifar, Fayyaz Minhas, Shan E Ahmed Raza, Kris D McCombe, Stephanie G Craig, Jacqueline James, Jill Brooks, Paul Nankivell, Hisham Mehanna, Syed Ali Khurram, Nasir M Rajpoot
  • for: 预防唾液腺癌的诊断和预测
  • methods: 使用人工智能算法,基于历史Patterns in Haematoxylin and Eosin染色整个扫描图像中的核lei,分配唾液腺癌转化风险分数(OMT分数),以衡量唾液腺癌的转化风险。
  • results: 在内部十进制验证集(Sheffield)和两个外部验证集(Birmingham和Belfast)上,提出了一个AUROC = 0.74的预测模型,可以预测唾液腺癌是否会转化为癌症。此外,存在证明了OMT分数的诊断价值,并且在预测转化过程中发现了 péripheral和epithelium-infiltrating免疫细胞的存在。
    Abstract Oral epithelial dysplasia (OED) is a premalignant histopathological diagnosis given to lesions of the oral cavity. Its grading suffers from significant inter-/intra- observer variability, and does not reliably predict malignancy progression, potentially leading to suboptimal treatment decisions. To address this, we developed a novel artificial intelligence algorithm that can assign an Oral Malignant Transformation (OMT) risk score, based on histological patterns in the in Haematoxylin and Eosin stained whole slide images, to quantify the risk of OED progression. The algorithm is based on the detection and segmentation of nuclei within (and around) the epithelium using an in-house segmentation model. We then employed a shallow neural network fed with interpretable morphological/spatial features, emulating histological markers. We conducted internal cross-validation on our development cohort (Sheffield; n = 193 cases) followed by independent validation on two external cohorts (Birmingham and Belfast; n = 92 cases). The proposed OMTscore yields an AUROC = 0.74 in predicting whether an OED progresses to malignancy or not. Survival analyses showed the prognostic value of our OMTscore for predicting malignancy transformation, when compared to the manually-assigned WHO and binary grades. Analysis of the correctly predicted cases elucidated the presence of peri-epithelial and epithelium-infiltrating lymphocytes in the most predictive patches of cases that transformed (p < 0.0001). This is the first study to propose a completely automated algorithm for predicting OED transformation based on interpretable nuclear features, whilst being validated on external datasets. The algorithm shows better-than-human-level performance for prediction of OED malignant transformation and offers a promising solution to the challenges of grading OED in routine clinical practice.
    摘要 口腔质变性病(OED)是口腔腺肿的先癌诊断,但其分级受到许多内外观察员的变化带来不确定性,并不能准确预测肿瘤转化,可能导致不佳的治疗决策。为解决这个问题,我们开发了一种新的人工智能算法,可以基于口腔染色涂抹整个扫描图像中的历史学特征,分配口腔肿瘤转化风险分数(OMT分数)。该算法基于识别和分割细胞核的自己 segmentation 模型,然后使用一个浅层神经网络,以便模拟历史学特征。我们在 Sheffield 开发团队(n = 193 例)进行了内部十字验证,然后在 Birmingham 和 Belfast 两个外部团队(n = 92 例)进行了独立验证。我们的提议的 OMT 分数可以在预测口腔肿瘤转化是否发生的问题上达到 AUROC = 0.74 的表现。 survival 分析表明我们的 OMT 分数具有预测肿瘤转化的诊断价值,比 manually-assigned WHO 和二分阶段的分数更高。分析正确预测的 случа件表明,在转化的 случа件中存在辐射性和 epithelium 滥入的 T 细胞,这些特征在最预测性的补丁中具有显著性(p < 0.0001)。这是首次提出一种完全自动化的 OED 转化预测算法,基于可读性的核型特征,并在外部数据集上进行了验证。该算法在预测 OED 肿瘤转化的问题上达到了人类水平以上的表现,并且提供了一个有前途的解决方案,以便在日常临床医学实践中改善 OED 的分级。

PSDR-Room: Single Photo to Scene using Differentiable Rendering

  • paper_url: http://arxiv.org/abs/2307.03244
  • repo_url: None
  • paper_authors: Kai Yan, Fujun Luan, MiloŠ HaŠAn, Thibault Groueix, Valentin Deschaintre, Shuang Zhao
  • for: 用于快速匹配目标图像中的室内场景,需要艺术和技术素养。
  • methods: 使用最新的路径空间可微 Rendering 方法,通过Gradient Descent 优化灯光和物体姿态,以及材质等参数,以达到视觉匹配目标图像。
  • results: 可以使用单张图像场景理解方法来初始化优化,并搜索适当的3D模型和材质。实验表明,方法可以 editing 室内场景中的各种元素。Here’s the translation in English for reference:
  • for: Designed to quickly match the appearance of a target image of an indoor scene, requiring both artistic and technical skills.
  • methods: Leveraging a recent path-space differentiable rendering approach to provide unbiased gradients of the rendering with respect to geometry, lighting, and procedural materials, allowing for optimization of all these components using gradient descent to visually match the input photo appearance.
  • results: Can use recent single-image scene understanding methods to initialize the optimization and search for appropriate 3D models and materials. Experimental results demonstrate the editability of the resulting scene components.
    Abstract A 3D digital scene contains many components: lights, materials and geometries, interacting to reach the desired appearance. Staging such a scene is time-consuming and requires both artistic and technical skills. In this work, we propose PSDR-Room, a system allowing to optimize lighting as well as the pose and materials of individual objects to match a target image of a room scene, with minimal user input. To this end, we leverage a recent path-space differentiable rendering approach that provides unbiased gradients of the rendering with respect to geometry, lighting, and procedural materials, allowing us to optimize all of these components using gradient descent to visually match the input photo appearance. We use recent single-image scene understanding methods to initialize the optimization and search for appropriate 3D models and materials. We evaluate our method on real photographs of indoor scenes and demonstrate the editability of the resulting scene components.
    摘要 一幅3D数字场景包含多个组件:灯光、材料和几何体,这些组件相互交互以达到所需的外观。设置这种场景是时间consuming的,需要艺术和技术技巧。在这种工作中,我们提议PSDR-Room,一个系统,允许用户最小化输入来优化灯光和个体物体的 pose 和材料,以匹配目标图像中的房间场景的外观,并且可以通过梯度 descent来优化这些组件。我们利用最近的路径空间微分渲染方法,以获取不偏梯度图像渲染中的geometry、灯光和材料的梯度,这些梯度可以用于优化这些组件。我们使用最近的单图像场景理解方法来初始化优化和搜索适合的3D模型和材料。我们对实际拍摄的室内场景照片进行评估,并证明可以编辑场景中的组件。

That’s BAD: Blind Anomaly Detection by Implicit Local Feature Clustering

  • paper_url: http://arxiv.org/abs/2307.03243
  • repo_url: None
  • paper_authors: Jie Zhang, Masanori Suganuma, Takayuki Okatani
  • for: 这篇论文探讨了无监督的工业物体/文瑞异常探测(AD),并提出了一个更加具有挑战性的无监督AD设定,即在一个给定的图像集中探测异常 sample,这个设定不需要人工标注,与过去的研究不同。
  • methods: 我们提出了一个名为PatchCluster的 novel方法,将这个问题转换为一个本地异常探测问题,并使用了一个新的分割方法来检测图像和像素层次的异常 sample。
  • results: 实验结果显示,PatchCluster在没有知情normal数据的情况下可以实现高度的异常探测性能,甚至与需要知情normal数据的SOTA方法相比。
    Abstract Recent studies on visual anomaly detection (AD) of industrial objects/textures have achieved quite good performance. They consider an unsupervised setting, specifically the one-class setting, in which we assume the availability of a set of normal (\textit{i.e.}, anomaly-free) images for training. In this paper, we consider a more challenging scenario of unsupervised AD, in which we detect anomalies in a given set of images that might contain both normal and anomalous samples. The setting does not assume the availability of known normal data and thus is completely free from human annotation, which differs from the standard AD considered in recent studies. For clarity, we call the setting blind anomaly detection (BAD). We show that BAD can be converted into a local outlier detection problem and propose a novel method named PatchCluster that can accurately detect image- and pixel-level anomalies. Experimental results show that PatchCluster shows a promising performance without the knowledge of normal data, even comparable to the SOTA methods applied in the one-class setting needing it.
    摘要 最近的图像异常检测研究(AD)已经达到了非常好的性能。它们假设了一个无监督的设置,具体是一个一类设置,在这里我们假设了一组正常(即异常free)图像用于训练。在这篇论文中,我们考虑了更加具有挑战性的无监督AD场景,在这里我们检测图像中的异常 sample,这些图像可能包含正常和异常样本。这个设置不需要人类注释,与标准的AD不同。为了便于描述,我们称之为盲目异常检测(BAD)。我们表明了BAD可以转化为本地异常检测问题,并提出了一种名为PatchCluster的新方法,可以准确地检测图像和像素级异常。实验结果表明,PatchCluster在没有正常数据知识的情况下可以达到高度的性能,甚至与需要正常数据的SOTA方法相当。

Adaptive Generation of Privileged Intermediate Information for Visible-Infrared Person Re-Identification

  • paper_url: http://arxiv.org/abs/2307.03240
  • repo_url: None
  • paper_authors: Mahdi Alehdaghi, Arthur Josi, Pourya Shamsolmoali, Rafael M. O. Cruz, Eric Granger
  • for: 本研究的目的是提高Visible-infrared人识别(V-I ReID)的精度,通过在RGB和IR感知器上建立一个共享表征空间,以便在不同感知器上捕捉到同一个人的图像。
  • methods: 本研究提出了一种名为Adaptive Generation of Privileged Intermediate Information(AGPI^2)的训练方法,用于生成一个虚拟频谱域,以bridging V和I模式之间的数据分布差异。AGPI^2使用非线性生成模块和嵌入模块,通过对RGB图像进行非线性变换,生成一个中间频谱域中的图像,并且使得这些中间图像具有较小的频谱域差异。
  • results: 实验结果表明,AGPI^2可以提高V-I ReID的匹配精度,而无需额外的计算资源在推理过程中。
    Abstract Visible-infrared person re-identification seeks to retrieve images of the same individual captured over a distributed network of RGB and IR sensors. Several V-I ReID approaches directly integrate both V and I modalities to discriminate persons within a shared representation space. However, given the significant gap in data distributions between V and I modalities, cross-modal V-I ReID remains challenging. Some recent approaches improve generalization by leveraging intermediate spaces that can bridge V and I modalities, yet effective methods are required to select or generate data for such informative domains. In this paper, the Adaptive Generation of Privileged Intermediate Information training approach is introduced to adapt and generate a virtual domain that bridges discriminant information between the V and I modalities. The key motivation behind AGPI^2 is to enhance the training of a deep V-I ReID backbone by generating privileged images that provide additional information. These privileged images capture shared discriminative features that are not easily accessible within the original V or I modalities alone. Towards this goal, a non-linear generative module is trained with an adversarial objective, translating V images into intermediate spaces with a smaller domain shift w.r.t. the I domain. Meanwhile, the embedding module within AGPI^2 aims to produce similar features for both V and generated images, encouraging the extraction of features that are common to all modalities. In addition to these contributions, AGPI^2 employs adversarial objectives for adapting the intermediate images, which play a crucial role in creating a non-modality-specific space to address the large domain shifts between V and I domains. Experimental results conducted on challenging V-I ReID datasets indicate that AGPI^2 increases matching accuracy without extra computational resources during inference.
    摘要 visible-infrared人识别方法目的是检索RGB和IR感知器上捕捉的同一个人的图像。一些V-I ReID方法直接将V和I模式集成到共同表示空间中,但由于V和I模式的数据分布差距较大,跨模式V-I ReID仍然是一个挑战。一些最近的方法利用中间空间来bridge V和I模式,但需要有效的数据选择或生成方法。在这篇论文中,我们提出了适应生成特权中间信息训练方法(AGPI^2),用于适应和生成一个可以bridge V和I模式之间的虚拟频谱。我们的关键想法是通过生成特权图像来增强深度V-I ReID背景模型的训练,这些特权图像包含共享特征信息,这些信息在原始V或I模式中很难访问。为了实现这一目标,我们在AGPI^2中训练了一个非线性生成模块,通过对V图像进行非线性映射,将其转换为中间空间中的一个更小的频谱差距。同时, embedding模块在AGPI^2中尝试生成V和生成图像之间的相似特征,以便提取这些特征是所有模式共享的。此外,AGPI^2还使用了对中间图像的对抗目标,这些目标在创建一个不受模式限制的空间中扮演了关键的角色,以Addressing the large domain shift between V and I domains。实验结果表明,AGPI^2可以提高匹配精度,不需要额外的计算资源在推理过程中。

Synthesizing Artistic Cinemagraphs from Text

  • paper_url: http://arxiv.org/abs/2307.03190
  • repo_url: https://github.com/text2cinemagraph/text2cinemagraph
  • paper_authors: Aniruddha Mahapatra, Aliaksandr Siarohin, Hsin-Ying Lee, Sergey Tulyakov, Jun-Yan Zhu
  • for: 这个论文是为了创建基于文本描述的电影场景(电影场景)的自动化方法。
  • methods: 该方法使用了图像双生技术,从单个文本提示中生成一对图像:一个艺术性的图像和一个自然looking的图像。该艺术性图像描绘文本提示中的风格和外观,而自然looking图像简化了布局和动作分析。然后,通过使用现有的自然图像和视频数据集,准确地分割自然looking图像并预测可能的动作,并将这些动作传递给艺术性图像来创建最终的电影场景。
  • results: 该方法比现有的方法在创建电影场景时表现出色,特别是在自然风景和艺术性场景以及其他世界的场景中。这被证明了通过自动化指标和用户研究。此外,该方法还可以用于动画现有的画作,以及通过文本控制动作方向。
    Abstract We introduce Text2Cinemagraph, a fully automated method for creating cinemagraphs from text descriptions - an especially challenging task when prompts feature imaginary elements and artistic styles, given the complexity of interpreting the semantics and motions of these images. Existing single-image animation methods fall short on artistic inputs, and recent text-based video methods frequently introduce temporal inconsistencies, struggling to keep certain regions static. To address these challenges, we propose an idea of synthesizing image twins from a single text prompt - a pair of an artistic image and its pixel-aligned corresponding natural-looking twin. While the artistic image depicts the style and appearance detailed in our text prompt, the realistic counterpart greatly simplifies layout and motion analysis. Leveraging existing natural image and video datasets, we can accurately segment the realistic image and predict plausible motion given the semantic information. The predicted motion can then be transferred to the artistic image to create the final cinemagraph. Our method outperforms existing approaches in creating cinemagraphs for natural landscapes as well as artistic and other-worldly scenes, as validated by automated metrics and user studies. Finally, we demonstrate two extensions: animating existing paintings and controlling motion directions using text.
    摘要 我们介绍Text2Cinemagraph,一种完全自动的方法,可以将文本描述转化成动画照片 - 特别是当提示中包含想象力和艺术风格时,这是一项非常具有挑战性的任务,因为解决含义和动作的含义需要进行复杂的解释。现有的单张图像动画方法在艺术输入下表现不佳,而最近的文本基于视频方法经常出现时间不一致,尝试维持某些区域静止。为解决这些挑战,我们提出了一个合成文本描述中的图像双胞胎的想法 - 一对一个艺术风格和自然风格相似的图像对。而艺术图像将文本中的风格和形象细节呈现出来,而自然图像则大大简化了布局和动作分析。利用现有的自然图像和视频数据集,我们可以准确地分割自然图像,并预测文本中的Semantic信息所决定的合理动作。然后将预测的动作转移到艺术图像中,以创建最终的动画照片。我们的方法在创建自然风景以及艺术和其他世界的场景中的动画照片方面表现出色,并经过自动度量和用户测试 Validation。最后,我们还展示了两个扩展:将现有的画作动画和通过文本控制动作方向。

IPO-LDM: Depth-aided 360-degree Indoor RGB Panorama Outpainting via Latent Diffusion Model

  • paper_url: http://arxiv.org/abs/2307.03177
  • repo_url: None
  • paper_authors: Tianhao Wu, Chuanxia Zheng, Tat-Jen Cham
  • for: 这篇论文的目的是创建高质量的360度RGB投影图,并使用Latent Diffusion Models(LDM)来实现。
  • methods: 这篇论文使用了一种新的双模态潜在扩散结构,该结构在训练时使用RGB和深度投影数据,但在推理时可以使用 нормаль的深度值。此外,论文还提出了一种进步的摄像头旋转技术,以提高投影图的绕ounding一致性。
  • results: 论文的IPO-LDM模型不仅在RGB投影图外绘制方面具有显著的优势,还可以生成多种不同类型的面孔,并且每个面孔具有良好的结构。
    Abstract Generating complete 360-degree panoramas from narrow field of view images is ongoing research as omnidirectional RGB data is not readily available. Existing GAN-based approaches face some barriers to achieving higher quality output, and have poor generalization performance over different mask types. In this paper, we present our 360-degree indoor RGB panorama outpainting model using latent diffusion models (LDM), called IPO-LDM. We introduce a new bi-modal latent diffusion structure that utilizes both RGB and depth panoramic data during training, but works surprisingly well to outpaint normal depth-free RGB images during inference. We further propose a novel technique of introducing progressive camera rotations during each diffusion denoising step, which leads to substantial improvement in achieving panorama wraparound consistency. Results show that our IPO-LDM not only significantly outperforms state-of-the-art methods on RGB panorama outpainting, but can also produce multiple and diverse well-structured results for different types of masks.
    摘要 <>将宽角度图像转换为全景360度图像是当前研究的热点问题,因为无法直接获得全景RGB数据。现有的基于GAN的方法具有较差的输出质量和不同掩码类型的泛化性能。在本文中,我们提出了一种基于缓动扩散模型(LDM)的360度室内RGB全景抹雷模型,称之为IPO-LDM。我们在训练时使用了RGB和深度全景数据的双模态缓动扩散结构,但在推理时可以使用depth-freeRGB图像进行抹雷。我们还提出了在每个扩散推净步中逐渐添加摄像头旋转的技术,这会导致全景包袋的实现。结果表明,我们的IPO-LDM不仅可以明显超越当前状态的RGB全景抹雷方法,还可以生成多种不同类型的掩码下的多个高质量结构。

VideoGLUE: Video General Understanding Evaluation of Foundation Models

  • paper_url: http://arxiv.org/abs/2307.03166
  • repo_url: None
  • paper_authors: Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, Mikhail Sirotenko, Huisheng Wang, Florian Schroff, Hartwig Adam, Ming-Hsuan Yang, Ting Liu, Boqing Gong
  • for: 本研究用于评估现有基础模型(Foundation Model,FM)在视频理解任务上的能力,并提出一种简单的 VideoGLUE 分数(VGS)来衡量 FM 在适应通用视频理解任务时的效果和效率。
  • methods: 本研究使用了三项hallmark task(行动识别、时间Localization和空间时间Localization)、八个社区广泛接受的数据集,以及四种适应基础模型的方法进行研究。
  • results: 主要发现结果包括:一、任务特化模型在六个FM studied 的情况下表现出色,与自然语言和图像理解领域中FM的表现形成鲜明的对比;二、视频本地FM在分析动态视频时表现更好,特别是在时间地址和多个动作理解方面;三、视频本地FM可以在轻量适应下(例如冻结FM干部)完成视频任务,而图像本地FM则在全面练习下表现较好。
    Abstract We evaluate existing foundation models video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring a foundation model (FM) for a downstream task. Moreover, we propose a scalar VideoGLUE score (VGS) to measure an FMs efficacy and efficiency when adapting to general video understanding tasks. Our main findings are as follows. First, task-specialized models significantly outperform the six FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. Second,video-native FMs, whose pretraining data contains the video modality, are generally better than image-native FMs in classifying motion-rich videos, localizing actions in time, and understanding a video of more than one action. Third, the video-native FMs can perform well on video tasks under light adaptations to downstream tasks(e.g., freezing the FM backbones), while image-native FMs win in full end-to-end finetuning. The first two observations reveal the need and tremendous opportunities to conduct research on video-focused FMs, and the last confirms that both tasks and adaptation methods matter when it comes to the evaluation of FMs.
    摘要 我们使用了一个仔细设计的实验协议来评估现有基础模型(FM)的视频理解能力,包括三项标志性任务(动作识别、时间地址和空间时间地址)、八个社区广泛接受的数据集,以及四种适应方法为基础模型进行下游任务的调整。此外,我们提出了一个名为视频GLUE分数(VGS)的scalar来衡量基础模型在普通视频理解任务上的效果和效率。我们的主要发现包括以下几点:首先,任务特化的模型在我们所研究的六个FM中显著超越了其他模型,这与自然语言和图像理解领域中FM的表现形成鲜明的对比。其次,视频本地FM,即在预训练数据中包含视频模式的FM,在分析动作丰富视频、时间地址动作和视频中的多个动作方面表现更好。最后,视频本地FM可以通过轻度适应下游任务(例如冻结FM的背bone)来达到良好的视频任务性能,而图像本地FM则在全面练习下达到更好的性能。这三个发现表明了视频关注FM的研究需求和机遇,以及任务和适应方法对FM的评估的重要性。

Can Domain Adaptation Improve Accuracy and Fairness of Skin Lesion Classification?

  • paper_url: http://arxiv.org/abs/2307.03157
  • repo_url: None
  • paper_authors: Janet Wang, Yunbei Zhang, Zhengming Ding, Jihun Hamm
  • for: 本研究旨在 investigate 多个皮肤病变数据集中的无监督领域适应(UDA)方法在 binary 和多类皮肤病变分类中的可行性。
  • methods: 我们使用了单源、合并源和多源的 UDA 训练方案,以解决皮肤病变分类中的数据不均衡问题。
  • results: 我们的实验结果表明,UDA 可以有效地在 binary 分类任务中,并且可以减轻数据不均衡问题。在多类分类任务中,UDA 的性能较弱,需要特别处理数据不均衡问题以达到上乘基eline的准确率。此外,我们发现 Label Shift 对测试错误强相关,而Feature-level UDA 方法在处理不均衡数据集时存在限制。最后,我们发现 UDA 可以有效地减少对少数群体的偏见,无需显式使用 fairness-focused 技术。
    Abstract Deep learning-based diagnostic system has demonstrated potential in classifying skin cancer conditions when labeled training example are abundant. However, skin lesion analysis often suffers from a scarcity of labeled data, hindering the development of an accurate and reliable diagnostic system. In this work, we leverage multiple skin lesion datasets and investigate the feasibility of various unsupervised domain adaptation (UDA) methods in binary and multi-class skin lesion classification. In particular, we assess three UDA training schemes: single-, combined-, and multi-source. Our experiment results show that UDA is effective in binary classification, with further improvement being observed when imbalance is mitigated. In multi-class task, its performance is less prominent, and imbalance problem again needs to be addressed to achieve above-baseline accuracy. Through our quantitative analysis, we find that the test error of multi-class tasks is strongly correlated with label shift, and feature-level UDA methods have limitations when handling imbalanced datasets. Finally, our study reveals that UDA can effectively reduce bias against minority groups and promote fairness, even without the explicit use of fairness-focused techniques.
    摘要

MultiVENT: Multilingual Videos of Events with Aligned Natural Text

  • paper_url: http://arxiv.org/abs/2307.03153
  • repo_url: None
  • paper_authors: Kate Sanders, David Etter, Reno Kriz, Benjamin Van Durme
  • for: 这个论文的目的是构建一个多语言、事件中心视频集合(MultiVENT),以便使用这些视频教学模型受益于现代新闻报道的多样化表达方式。
  • methods: 该论文使用了多种方法,包括构建多语言、事件中心视频集合(MultiVENT)、分析在线新闻视频的状况以及如何使用这些视频建立准确、多语言的模型。
  • results: 该论文提供了一个基线模型 для复杂、多语言视频检索,以便使用MultiVENT进行信息检索。
    Abstract Everyday news coverage has shifted from traditional broadcasts towards a wide range of presentation formats such as first-hand, unedited video footage. Datasets that reflect the diverse array of multimodal, multilingual news sources available online could be used to teach models to benefit from this shift, but existing news video datasets focus on traditional news broadcasts produced for English-speaking audiences. We address this limitation by constructing MultiVENT, a dataset of multilingual, event-centric videos grounded in text documents across five target languages. MultiVENT includes both news broadcast videos and non-professional event footage, which we use to analyze the state of online news videos and how they can be leveraged to build robust, factually accurate models. Finally, we provide a model for complex, multilingual video retrieval to serve as a baseline for information retrieval using MultiVENT.
    摘要 每天新闻报道很多样化,从传统广播转向多种形式的直播视频。现有的新闻视频数据集都是为英语观众制作的传统新闻广播,我们解决这个局限性的问题,构建了MultiVENT数据集,包含多种语言的事件中心视频和文档。MultiVENT包括新闻广播视频和非专业事件录像,我们通过分析在线新闻视频的状况,探讨如何使用MultiVENT建立强大、准确的模型。最后,我们提供了一种复杂的多语言视频检索模型,作为MultiVENT中的基线模型。

Topology-Aware Loss for Aorta and Great Vessel Segmentation in Computed Tomography Images

  • paper_url: http://arxiv.org/abs/2307.03137
  • repo_url: None
  • paper_authors: Seher Ozcelik, Sinan Unver, Ilke Ali Gurses, Rustu Turkay, Cigdem Gunduz-Demir
  • for: 这个论文是为了解决基于计算机 Tomatoes(CT)图像中血管和大动脉的分割问题。
  • methods: 这个论文使用了一种新的topology-aware损失函数,该函数通过 persist homology 来衡量预测和真实值之间的拓扑不同。
  • results: 实验表明,使用该损失函数可以获得更好的结果, indicating 该方法的有效性。
    Abstract Segmentation networks are not explicitly imposed to learn global invariants of an image, such as the shape of an object and the geometry between multiple objects, when they are trained with a standard loss function. On the other hand, incorporating such invariants into network training may help improve performance for various segmentation tasks when they are the intrinsic characteristics of the objects to be segmented. One example is segmentation of aorta and great vessels in computed tomography (CT) images where vessels are found in a particular geometry in the body due to the human anatomy and they mostly seem as round objects on a 2D CT image. This paper addresses this issue by introducing a new topology-aware loss function that penalizes topology dissimilarities between the ground truth and prediction through persistent homology. Different from the previously suggested segmentation network designs, which apply the threshold filtration on a likelihood function of the prediction map and the Betti numbers of the ground truth, this paper proposes to apply the Vietoris-Rips filtration to obtain persistence diagrams of both ground truth and prediction maps and calculate the dissimilarity with the Wasserstein distance between the corresponding persistence diagrams. The use of this filtration has advantage of modeling shape and geometry at the same time, which may not happen when the threshold filtration is applied. Our experiments on 4327 CT images of 24 subjects reveal that the proposed topology-aware loss function leads to better results than its counterparts, indicating the effectiveness of this use.
    摘要 Segmentation 网络不会显式地学习图像中全局不变量,如物体形状和多个物体之间的几何关系,当它们在标准损失函数下训练时。然而,将这些不变量 incorporated 到网络训练中可能会提高不同的 segmentation 任务的性能,因为它们是物体被分 segmentation 的内在特征。例如,在计算机断层成像(CT)图像中分割血管和大血管,血管在人体 анаatomy 中具有特定的几何结构,在2D CT 图像上通常看起来是圆形的物体。这篇论文解决这个问题,通过引入一种新的 topology-aware 损失函数, penalty topology 异常 между真实值和预测值通过不变式 homology。与之前的 segmentation 网络设计不同,这篇论文提议使用 Vietoris-Rips 滤波来获取 both ground truth 和预测图像的 persistence 图,并计算它们之间的 Wasserstein 距离。这种 filtration 的优点在于同时模型形状和几何,这可能不会在应用 threshold 滤波时发生。我们在 4327 CT 图像上进行了 24 个人的实验,发现提议的 topology-aware 损失函数比其他方法更有效,这表明该用法的有效性。

Benchmarking Test-Time Adaptation against Distribution Shifts in Image Classification

  • paper_url: http://arxiv.org/abs/2307.03133
  • repo_url: https://github.com/yuyongcan/benchmark-tta
  • paper_authors: Yongcan Yu, Lijun Sheng, Ran He, Jian Liang
  • for: 本研究旨在提供一个可靠的测试时适应(TTA)方法评估 benchmark,以便研究人员和实践者可以准确地评估和比较不同的 TTA 方法在改进模型的Robustness和泛化性能方面的效果。
  • methods: 本研究评估了 13 种知名的 TTA 方法和其变种,并在 five 个广泛使用的图像分类 datasets(CIFAR-10-C、CIFAR-100-C、ImageNet-C、DomainNet和Office-Home)上进行了系统性的评估。这些方法包括不同的适应enario(如在线适应 versus 离线适应、实例适应 versus 批量适应 versus 频率适应)。此外,我们还探索了不同的 TTA 方法与不同的网络后处理器之间的兼容性。
  • results: 我们的研究发现,不同的 TTA 方法在不同的预测场景下的效果有所不同。 Specifically, we found that some methods perform better in certain scenarios, while others may not be as effective. Additionally, we observed that some methods are more compatible with certain network backbones than others. Our findings provide valuable insights into the strengths and limitations of different TTA methods and can help guide future research in this area.
    Abstract Test-time adaptation (TTA) is a technique aimed at enhancing the generalization performance of models by leveraging unlabeled samples solely during prediction. Given the need for robustness in neural network systems when faced with distribution shifts, numerous TTA methods have recently been proposed. However, evaluating these methods is often done under different settings, such as varying distribution shifts, backbones, and designing scenarios, leading to a lack of consistent and fair benchmarks to validate their effectiveness. To address this issue, we present a benchmark that systematically evaluates 13 prominent TTA methods and their variants on five widely used image classification datasets: CIFAR-10-C, CIFAR-100-C, ImageNet-C, DomainNet, and Office-Home. These methods encompass a wide range of adaptation scenarios (e.g. online adaptation v.s. offline adaptation, instance adaptation v.s. batch adaptation v.s. domain adaptation). Furthermore, we explore the compatibility of different TTA methods with diverse network backbones. To implement this benchmark, we have developed a unified framework in PyTorch, which allows for consistent evaluation and comparison of the TTA methods across the different datasets and network architectures. By establishing this benchmark, we aim to provide researchers and practitioners with a reliable means of assessing and comparing the effectiveness of TTA methods in improving model robustness and generalization performance. Our code is available at https://github.com/yuyongcan/Benchmark-TTA.
    摘要 测试时适应(TTA)是一种技术,旨在通过使用无标示样本来提高模型的总体性表现。由于神经网络系统面临到分布转移时的稳定性问题,有很多TTA方法被提出。然而,评估这些方法的时候通常采用不同的设置,例如不同的分布转移、后端和设计方案,这导致了评估效果的不一致和公平性的问题。为解决这个问题,我们提出了一个基准,系统地评估13种知名TTA方法和其变体在五种广泛使用的图像分类dataset上:CIFAR-10-C、CIFAR-100-C、ImageNet-C、DomainNet和Office-Home。这些方法涵盖了各种适应enario(例如在线适应vs.离线适应、实例适应vs.批适应vs.领域适应)。此外,我们还探索了不同TTA方法与不同后端网络的Compatibility。为实现这个基准,我们在PyTorch中开发了一个统一的框架,允许在不同的dataset和网络架构上一致性地评估和比较TTA方法的效果。通过建立这个基准,我们希望为研究者和实践者提供一个可靠的方式来评估和比较TTA方法在提高模型的Robustness和总体性表现方面的效果。我们的代码可以在https://github.com/yuyongcan/Benchmark-TTA上获取。

Principal subbundles for dimension reduction

  • paper_url: http://arxiv.org/abs/2307.03128
  • repo_url: None
  • paper_authors: Morten Akhøj, James Benn, Erlend Grong, Stefan Sommer, Xavier Pennec
  • for: 用于构成和重建表面
  • methods: 使用本地线性近似来获取低维 bundle
  • results: 可以成功应用于许多重要问题,如构建 Approximating 子 manifold、计算观察之间的距离等。
    Abstract In this paper we demonstrate how sub-Riemannian geometry can be used for manifold learning and surface reconstruction by combining local linear approximations of a point cloud to obtain lower dimensional bundles. Local approximations obtained by local PCAs are collected into a rank $k$ tangent subbundle on $\mathbb{R}^d$, $k
    摘要 在这篇论文中,我们示示了如何使用非柯尼希 геометрия来进行拟合 manifold 和表面重建,通过将本地线性近似合集到一个降维Bundle 中。这个降维Bundle 在 $\mathbb{R}^d$ 上定义为 rank $k$ 的 tangent 子bundle,其中 $k

  • paper_url: http://arxiv.org/abs/2307.03110
  • repo_url: None
  • paper_authors: Bhavna Gopal, Arjun Sridhar, Tunhou Zhang, Yiran Chen
  • for: 这篇论文旨在提出一个自动化的搜索空间缩小算法,以提高搜索性能和搜索空间的多样性。
  • methods: 本论文使用了本地性和结构相似性的关系来优化搜索空间,实现了高效的搜索和多样性保持。
  • results: 本论文在不同的搜索空间和数据集上进行了实验,结果显示了LISSNAS算法在搜索性能和多样性方面的最佳性能,包括ImageNet上的手动搜索中的最高Top-1精度(77.6%)、Kendall-Tau指数、搜索空间大小等。
    Abstract Search spaces hallmark the advancement of Neural Architecture Search (NAS). Large and complex search spaces with versatile building operators and structures provide more opportunities to brew promising architectures, yet pose severe challenges on efficient exploration and exploitation. Subsequently, several search space shrinkage methods optimize by selecting a single sub-region that contains some well-performing networks. Small performance and efficiency gains are observed with these methods but such techniques leave room for significantly improved search performance and are ineffective at retaining architectural diversity. We propose LISSNAS, an automated algorithm that shrinks a large space into a diverse, small search space with SOTA search performance. Our approach leverages locality, the relationship between structural and performance similarity, to efficiently extract many pockets of well-performing networks. We showcase our method on an array of search spaces spanning various sizes and datasets. We accentuate the effectiveness of our shrunk spaces when used in one-shot search by achieving the best Top-1 accuracy in two different search spaces. Our method achieves a SOTA Top-1 accuracy of 77.6\% in ImageNet under mobile constraints, best-in-class Kendal-Tau, architectural diversity, and search space size.
    摘要 搜索空间的特征标志了神经建筑搜索(NAS)的进步。大型和复杂的搜索空间,具有多样化的建筑元素和结构,提供了更多的可能性来生成出色的建筑,但也对有效地探索和利用 pose 严重挑战。为此,许多搜索空间缩小方法通过选择单个子区域来找到一些表现良好的网络。这些方法可以提供小幅提高性和效率,但是这些技术留下大量可以进一步提高搜索性能的空间,并且无法保持建筑多样性。我们提出了 LISSNAS,一种自动化算法,可以将大型空间缩小到多样性强、性能优秀的小搜索空间。我们的方法利用了地方性,建筑和性能之间的相似关系,以高效地提取许多表现良好的网络。我们在多个搜索空间中进行了证明,并在 ImageNet 下实现了移动端的 SOTA Top-1 准确率为 77.6%,同时保持了 Kendall-Tau 最佳、建筑多样性和搜索空间大小。

How to Detect Unauthorized Data Usages in Text-to-image Diffusion Models

  • paper_url: http://arxiv.org/abs/2307.03108
  • repo_url: None
  • paper_authors: Zhenting Wang, Chen Chen, Yuchen Liu, Lingjuan Lyu, Dimitris Metaxas, Shiqing Ma
  • for: 防止文本到图像扩散模型中的数据非法使用
  • methods: 植入干扰记忆法,通过分析模型是否记忆植入内容来检测非法数据使用
  • results: 在Stable Diffusion和LoRA模型上进行了实验,得到了效果的检测非法数据使用结果
    Abstract Recent text-to-image diffusion models have shown surprising performance in generating high-quality images. However, concerns have arisen regarding the unauthorized usage of data during the training process. One example is when a model trainer collects a set of images created by a particular artist and attempts to train a model capable of generating similar images without obtaining permission from the artist. To address this issue, it becomes crucial to detect unauthorized data usage. In this paper, we propose a method for detecting such unauthorized data usage by planting injected memorization into the text-to-image diffusion models trained on the protected dataset. Specifically, we modify the protected image dataset by adding unique contents on the images such as stealthy image wrapping functions that are imperceptible to human vision but can be captured and memorized by diffusion models. By analyzing whether the model has memorization for the injected content (i.e., whether the generated images are processed by the chosen post-processing function), we can detect models that had illegally utilized the unauthorized data. Our experiments conducted on Stable Diffusion and LoRA model demonstrate the effectiveness of the proposed method in detecting unauthorized data usages.
    摘要 近期文本到图像扩散模型已经显示出了高质量图像生成的出色表现。然而,有关数据非法使用的担忧也在提出。一个例子是模型训练者收集了某个艺术家创作的图像集并尝试通过不取得艺术家的授权来训练一个能够生成类似图像的模型。为解决这个问题,检测非法数据使用变得非常重要。在这篇论文中,我们提议一种方法,通过在受保护图像集中添加特有的内容,例如隐形图像包装函数,使得扩散模型能够吸收这些内容并且记忆它们。然后,通过判断模型是否具有这些内容的记忆(即是否通过选择的后处理函数处理生成的图像),可以检测模型是否使用了非法数据。我们在Stable Diffusion和LoRA模型上进行了实验,并证明了我们的方法的有效性。

Contextual Affinity Distillation for Image Anomaly Detection

  • paper_url: http://arxiv.org/abs/2307.03101
  • repo_url: None
  • paper_authors: Jie Zhang, Masanori Suganuma, Takayuki Okatani
    for:本研究旨在提高无监督工业异常检测的性能,特别是对逻辑异常进行检测,而不需要训练繁重的模型。methods:本研究基于先前的知识塑化工作,使用两名学生(本地学生和全球学生)来更好地模仿教师的行为。本地学生主要用于检测结构异常,而全球学生则关注逻辑异常。为了进一步鼓励全球学生学习捕捉长距离依赖关系,我们设计了全球上下文维度压缩块(GCCB),并提出了上下文相互关联损失。results:实验结果表明,提议方法不需要训练复杂的模型,可以达到新的领先性水平在MVTec LOCO AD数据集上。
    Abstract Previous works on unsupervised industrial anomaly detection mainly focus on local structural anomalies such as cracks and color contamination. While achieving significantly high detection performance on this kind of anomaly, they are faced with logical anomalies that violate the long-range dependencies such as a normal object placed in the wrong position. In this paper, based on previous knowledge distillation works, we propose to use two students (local and global) to better mimic the teacher's behavior. The local student, which is used in previous studies mainly focuses on structural anomaly detection while the global student pays attention to logical anomalies. To further encourage the global student's learning to capture long-range dependencies, we design the global context condensing block (GCCB) and propose a contextual affinity loss for the student training and anomaly scoring. Experimental results show the proposed method doesn't need cumbersome training techniques and achieves a new state-of-the-art performance on the MVTec LOCO AD dataset.
    摘要 先前的工业异常检测研究主要关注本地结构异常,如裂隙和颜色杂散。尽管达到了本地异常检测的显著高效性,但它们面临着跨距离相互关联的逻辑异常,如正常对象被错误地放置。在这篇论文中,基于先前的知识塑模工作,我们提议使用两名学生(本地和全球)来更好地模仿教师的行为。本地学生,在先前的研究中主要用于结构异常检测,而全球学生则关注逻辑异常。为了进一步鼓励全球学生学习捕捉长距离相互关联,我们设计了全球上下文缩合块(GCCB)并提出了上下文相互关系损失。实验结果表明,我们提议的方法不需要复杂的训练技术,并达到了MVTec LOCO AD数据集的新的状态之平台。

cs.AI - 2023-07-07

Decomposing the Generalization Gap in Imitation Learning for Visual Robotic Manipulation

  • paper_url: http://arxiv.org/abs/2307.03659
  • repo_url: https://github.com/RLAgent/factor-world
  • paper_authors: Annie Xie, Lisa Lee, Ted Xiao, Chelsea Finn
  • for: 本研究的目的是探讨视觉机器人 manipulate 演示中的模仿学习困难的原因,以及这些困难的评估方法。
  • methods: 我们使用了 simulation 和真实机器人语言条件 manipulate 任务来评估模仿学习策略的泛化能力,并设计了一个新的 simulated 测试环境来更加控制地评估不同因素的泛化难度。
  • results: 我们的研究表明,不同因素的泛化难度存在很大差异,并且这些差异是相对稳定的。我们还发现,某些因素的泛化难度较高,而另外的因素则较低。
    Abstract What makes generalization hard for imitation learning in visual robotic manipulation? This question is difficult to approach at face value, but the environment from the perspective of a robot can often be decomposed into enumerable factors of variation, such as the lighting conditions or the placement of the camera. Empirically, generalization to some of these factors have presented a greater obstacle than others, but existing work sheds little light on precisely how much each factor contributes to the generalization gap. Towards an answer to this question, we study imitation learning policies in simulation and on a real robot language-conditioned manipulation task to quantify the difficulty of generalization to different (sets of) factors. We also design a new simulated benchmark of 19 tasks with 11 factors of variation to facilitate more controlled evaluations of generalization. From our study, we determine an ordering of factors based on generalization difficulty, that is consistent across simulation and our real robot setup.
    摘要 Translated into Simplified Chinese:这个问题是非常Difficult to approach directly, because the environment from the perspective of a robot can often be decomposed into多种因素的变化,例如照明条件或摄像头的位置。验证性地,对一些这些因素的泛化呈现了更大的困难,但现有的工作却没有提供具体如何量化每个因素对泛化差距的信息。为了回答这个问题,我们研究了模仿学习策略在模拟和真实机器人语言conditioned manipulation任务中的泛化困难。我们还设计了一个新的模拟benchmark,包含19个任务和11个因素的变化,以便更好地评估泛化的控制性。从我们的研究中,我们确定了因素的排序,这一结果在模拟和真实机器人设置中均是一致的。

Discovering Variable Binding Circuitry with Desiderata

  • paper_url: http://arxiv.org/abs/2307.03637
  • repo_url: None
  • paper_authors: Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, David Bau
  • for: 本研究旨在提出一种方法,以自动地归因模型组件负责执行特定子任务的 causal attribute。
  • methods: 本研究使用了 causal mediation experiments 来自动归因模型组件,并且只需要指定模型组件执行子任务的 causal attribute。
  • results: 研究成果显示,可以成功地自动发现 LLama-13B 模型中的共享变量绑定电路,并且只需要9个注意头和1个MLP来执行多个数学任务中的变量绑定。
    Abstract Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textit{desiderata}, or causal attributes of the model components executing that subtask. As a proof of concept, we apply our method to automatically discover shared \textit{variable binding circuitry} in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.
    摘要 最近的研究表明,计算机语言模型中的计算可能是人类理解的,有成功的尝试将单元特征和输入输出电路 lokalisirui和 intervene。在这里,我们介绍了一种方法,可以自动确定模型组件负责执行特定子任务,只需提供一组 \textit{desiderata},或模型组件执行该子任务的 causal 特征。作为证明,我们应用了我们的方法,自动发现 LLama-13B 中的共享 \textit{变量绑定Circuitry},该模型可以为多个数学任务获取变量值。我们的方法成功地将变量绑定Localized to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.

Over-the-Air Computation in OFDM Systems with Imperfect Channel State Information

  • paper_url: http://arxiv.org/abs/2307.05357
  • repo_url: None
  • paper_authors: Yilong Chen, Huijun Xing, Jie Xu, Lexi Xu, Shuguang Cui
    for:这个论文研究了在无线电通信系统中进行空中计算(AirComp),特别是在无线电信道状态信息(CSI)不准确时,多个单antenna无线设备(WD)同时向多antenna访问点(AP)上传uncoded信号进行分布式功能计算。methods:在这种情况下,我们考虑了两种enario:一种是最大化average计算平均方差(MSE),另一种是最小化计算失败概率(outage probability)。为了实现这两个目标,我们同时优化了WDs发射器和AP接收扫描器在子载波上的传输系数和接收扫描器。results:我们在这篇论文中提出了两种特殊情况的解:一种是单个AP接收天线的情况,另一种是多个AP接收天线的情况。在单个AP接收天线情况下,我们使用 Lagrange-duality 方法提出了半闭形 globally 优化解。在多个AP接收天线情况下,我们提出了高效的 alternate 优化和几何优化算法来找到 converges 解。
    Abstract This paper studies the over-the-air computation (AirComp) in an orthogonal frequency division multiplexing (OFDM) system with imperfect channel state information (CSI), in which multiple single-antenna wireless devices (WDs) simultaneously send uncoded signals to a multi-antenna access point (AP) for distributed functional computation over multiple subcarriers. In particular, we consider two scenarios with best-effort and error-constrained computation tasks, with the objectives of minimizing the average computation mean squared error (MSE) and the computation outage probability over the multiple subcarriers, respectively. Towards this end, we jointly optimize the transmit coefficients at the WDs and the receive beamforming vectors at the AP over subcarriers, subject to the maximum transmit power constraints at individual WDs. First, for the special case with a single receive antenna at the AP, we propose the semi-closed-form globally optimal solutions to the two problems using the Lagrange-duality method. It is shown that at each subcarrier, the WDs' optimized power control policy for average MSE minimization follows a regularized channel inversion structure, while that for computation outage probability minimization follows an on-off regularized channel inversion, with the regularization dependent on the transmit power budget and channel estimation error. Next, for the general case with multiple receive antennas at the AP, we present efficient algorithms based on alternating optimization and convex optimization to find converged solutions to both problems.
    摘要 For the special case with a single receive antenna at the AP, we propose semi-closed-form globally optimal solutions to the two problems using the Lagrange-duality method. The results show that at each subcarrier, the WDs' optimized power control policy for average MSE minimization follows a regularized channel inversion structure, while that for computation outage probability minimization follows an on-off regularized channel inversion, with the regularization dependent on the transmit power budget and channel estimation error.For the general case with multiple receive antennas at the AP, we present efficient algorithms based on alternating optimization and convex optimization to find converged solutions to both problems. These algorithms take into account the coupling between the transmit coefficients and the receive beamforming vectors, and the non-convexity of the optimization problems.In summary, this paper investigates the optimization of AirComp in an OFDM system with imperfect CSI, and proposes algorithms to minimize the average MSE and computation outage probability over multiple subcarriers. The proposed solutions take into account the maximum transmit power constraints and the coupling between the transmit coefficients and the receive beamforming vectors.

Brain in a Vat: On Missing Pieces Towards Artificial General Intelligence in Large Language Models

  • paper_url: http://arxiv.org/abs/2307.03762
  • repo_url: None
  • paper_authors: Yuxi Ma, Chi Zhang, Song-Chun Zhu
  • for: 这篇论文主要是为了探讨大语言模型(LLM)的评估方法和人工通用智能的定义。
  • methods: 论文首先对现有的LLM评估方法进行了全面回顾,并指出了评估方法中的一些问题,这些问题会导致LLM的能力被过分评估。然后,文章提出了人工通用智能应包含以下四个特征:1)可以完成无数量的任务;2)可以在 Context中生成新任务;3)基于值系统来生成任务;4)具有基于现实的世界模型,这种世界模型影响了它与世界的交互。
  • results: 文章认为,现有的人工智能研究仅仅是模拟智能,而不是真正的通用智能。它们缺乏了知识获得和行为的一体化,而且知识获得不仅仅靠 passive input,还需要重复的尝试和错误。文章结束时,提出了人工智能未来研究的可能性。
    Abstract In this perspective paper, we first comprehensively review existing evaluations of Large Language Models (LLMs) using both standardized tests and ability-oriented benchmarks. We pinpoint several problems with current evaluation methods that tend to overstate the capabilities of LLMs. We then articulate what artificial general intelligence should encompass beyond the capabilities of LLMs. We propose four characteristics of generally intelligent agents: 1) they can perform unlimited tasks; 2) they can generate new tasks within a context; 3) they operate based on a value system that underpins task generation; and 4) they have a world model reflecting reality, which shapes their interaction with the world. Building on this viewpoint, we highlight the missing pieces in artificial general intelligence, that is, the unity of knowing and acting. We argue that active engagement with objects in the real world delivers more robust signals for forming conceptual representations. Additionally, knowledge acquisition isn't solely reliant on passive input but requires repeated trials and errors. We conclude by outlining promising future research directions in the field of artificial general intelligence.
    摘要 在这篇观点论文中,我们首先进行了涵盖现有大语言模型(LLM)评估的全面审查,使用标准化测试和能力尺度标准。我们指出了现有评估方法存在一些问题,导致LLM的能力被过度评估。然后,我们详细说明了人工总智能应包括以下四个特点:1)可以完成无数项任务;2)可以在 Context 中生成新任务;3)基于值系统来决定任务生成;4)具有对实际世界的认知,影响其与世界的互动。基于这种视角,我们强调了人工总智能缺失的一部分,即知识和行为的一体性。我们 argued That active engagement with objects in the real world provides more robust signals for forming conceptual representations. In addition, knowledge acquisition is not solely reliant on passive input, but requires repeated trials and errors. Finally, we outline promising future research directions in the field of artificial general intelligence.

GEANN: Scalable Graph Augmentations for Multi-Horizon Time Series Forecasting

  • paper_url: http://arxiv.org/abs/2307.03595
  • repo_url: None
  • paper_authors: Sitan Yang, Malcolm Wolff, Shankar Ramasubramanian, Vincent Quenneville-Belair, Ronak Metha, Michael W. Mahoney
  • for: 解决“冷启”时间序列预测问题,即预测缺乏历史数据的时间序列。
  • methods: 利用图神经网络(GNN)作为编码器增强器,通过生成GNN基于的特征来捕捉时间序列之间的复杂关系。
  • results: 在实际应用中,对一家大型电商商户的需求预测 task 中,我们的方法可以提高总表现,并更重要的是,对“冷启”产品(新上市或者刚下架)的预测带来显著改善。
    Abstract Encoder-decoder deep neural networks have been increasingly studied for multi-horizon time series forecasting, especially in real-world applications. However, to forecast accurately, these sophisticated models typically rely on a large number of time series examples with substantial history. A rapidly growing topic of interest is forecasting time series which lack sufficient historical data -- often referred to as the ``cold start'' problem. In this paper, we introduce a novel yet simple method to address this problem by leveraging graph neural networks (GNNs) as a data augmentation for enhancing the encoder used by such forecasters. These GNN-based features can capture complex inter-series relationships, and their generation process can be optimized end-to-end with the forecasting task. We show that our architecture can use either data-driven or domain knowledge-defined graphs, scaling to incorporate information from multiple very large graphs with millions of nodes. In our target application of demand forecasting for a large e-commerce retailer, we demonstrate on both a small dataset of 100K products and a large dataset with over 2 million products that our method improves overall performance over competitive baseline models. More importantly, we show that it brings substantially more gains to ``cold start'' products such as those newly launched or recently out-of-stock.
    摘要 深度神经网络在多个时间水平预测方面得到了越来越多的研究,特别是在实际应用中。然而,为了准确预测,这些复杂的模型通常需要大量的时间序列示例,其中具有充分的历史记录。一个迅速增长的研究领域是缺少历史数据的时间序列预测问题,通常被称为“冷开始”问题。在这篇论文中,我们介绍了一种新的、简单的方法,通过利用图神经网络(GNN)作为编码器增强器来解决这个问题。这些GNN基于的特征可以捕捉到时间序列之间的复杂关系,并且其生成过程可以与预测任务结合optimized。我们示出了我们的架构可以使用数据驱动或域知识定义的图,可涵盖多个具有百万个节点的图。在我们的目标应用中,我们在10万个产品的小数据集和超过2万个产品的大数据集上进行了实验,并证明了我们的方法可以在比较基eline模型的情况下提供更好的总体性能。更重要的是,我们发现我们的方法对“冷开始”产品(如新上市或者刚出库)的预测具有显著的改善。

VesselVAE: Recursive Variational Autoencoders for 3D Blood Vessel Synthesis

  • paper_url: http://arxiv.org/abs/2307.03592
  • repo_url: None
  • paper_authors: Paula Feldman, Miguel Fainstein, Viviana Siless, Claudio Delrieux, Emmanuel Iarussi
  • for: 这篇论文的目的是为了 Synthesizing blood vessel 3D geometry, 即生成血管三维几何结构。
  • methods: 该论文使用的方法是 recursive variational Neural Network (VesselVAE),它可以完全利用血管的层次结构,学习低维抽象表示分支连接性以及表示目标表面的几何特征。
  • results: 该论文的实验结果显示,VesselVAE可以生成高度准确和多样化的血管三维模型,并且与实际数据的相似性达到了/.97、/.95和/.96三个指标。这些结果表明,VesselVAE可以用于医疗和手术训练、血液动力学 simulations 等多种目的。
    Abstract We present a data-driven generative framework for synthesizing blood vessel 3D geometry. This is a challenging task due to the complexity of vascular systems, which are highly variating in shape, size, and structure. Existing model-based methods provide some degree of control and variation in the structures produced, but fail to capture the diversity of actual anatomical data. We developed VesselVAE, a recursive variational Neural Network that fully exploits the hierarchical organization of the vessel and learns a low-dimensional manifold encoding branch connectivity along with geometry features describing the target surface. After training, the VesselVAE latent space can be sampled to generate new vessel geometries. To the best of our knowledge, this work is the first to utilize this technique for synthesizing blood vessels. We achieve similarities of synthetic and real data for radius (.97), length (.95), and tortuosity (.96). By leveraging the power of deep neural networks, we generate 3D models of blood vessels that are both accurate and diverse, which is crucial for medical and surgical training, hemodynamic simulations, and many other purposes.
    摘要 我们提出了一种基于数据的生成框架,用于synthesizing血管三维几何结构。这是一项具有挑战性的任务,因为血液系统的复杂性和多样性很高,它们的形态、大小和结构各不相同。现有的模型基本方法可以提供一定的控制和变化,但是无法捕捉实际生物学数据的多样性。我们开发了VesselVAE,一种嵌入式的可变量神经网络,它完全利用血管的层次结构,学习低维度抽象表示分支连接以及表面特征,描述目标表面的几何特征。经过训练,VesselVAE的幂数空间可以采样新的血管几何结构。根据我们所知,这是第一次利用这种技术来生成血管。我们实现了真实数据和生成数据之间的相似性(.97),(.95)和(.96)。通过利用深度神经网络的力量,我们生成了准确且多样的血管三维模型,这对医疗和手术培训、血液动力学计算以及许多其他目的都是关键。

Multimodal Deep Learning for Personalized Renal Cell Carcinoma Prognosis: Integrating CT Imaging and Clinical Data

  • paper_url: http://arxiv.org/abs/2307.03575
  • repo_url: https://github.com/mahootiha-maryam/Survival_CTplusClinical
  • paper_authors: Maryamalsadat Mahootiha, Hemin Ali Qadir, Jacob Bergsland, Ilangko Balasingham
    for:这项研究的目的是开发一个全面的深度学习模型,用于预测renoocellular carcinoma患者的生存可能性,通过结合CT成像和临床数据,并解决过去研究中出现的局限性。methods:该研究提posed一个框架,包括三个模块:3D图像特征提取器、临床变量选择和生存预测。图像特征提取器模块基于3D CNN架构,预测CT成像中renoocellular carcinoma肿瘤的ISUP分期,与死亡率相关。临床变量选择使用Spearman分数和Random Forest重要性分数作为标准,系统地选择临床变量。生存预测使用深度学习网络,以Discrete LogisticHazard-based损失函数进行训练。results:我们的发现表明,提出的策略超过了当前renoocellular carcinoma预测Literature中基于CT成像和临床因素的研究。最佳实验在测试集上达到了 concordance index 0.84和area under the curve 0.8 的水平,这表明了该方法在预测renoocellular carcinoma患者的生存可能性方面具有强大的预测力。
    Abstract Renal cell carcinoma represents a significant global health challenge with a low survival rate. This research aimed to devise a comprehensive deep-learning model capable of predicting survival probabilities in patients with renal cell carcinoma by integrating CT imaging and clinical data and addressing the limitations observed in prior studies. The aim is to facilitate the identification of patients requiring urgent treatment. The proposed framework comprises three modules: a 3D image feature extractor, clinical variable selection, and survival prediction. The feature extractor module, based on the 3D CNN architecture, predicts the ISUP grade of renal cell carcinoma tumors linked to mortality rates from CT images. A selection of clinical variables is systematically chosen using the Spearman score and random forest importance score as criteria. A deep learning-based network, trained with discrete LogisticHazard-based loss, performs the survival prediction. Nine distinct experiments are performed, with varying numbers of clinical variables determined by different thresholds of the Spearman and importance scores. Our findings demonstrate that the proposed strategy surpasses the current literature on renal cancer prognosis based on CT scans and clinical factors. The best-performing experiment yielded a concordance index of 0.84 and an area under the curve value of 0.8 on the test cohort, which suggests strong predictive power. The multimodal deep-learning approach developed in this study shows promising results in estimating survival probabilities for renal cell carcinoma patients using CT imaging and clinical data. This may have potential implications in identifying patients who require urgent treatment, potentially improving patient outcomes. The code created for this project is available for the public on: \href{https://github.com/Balasingham-AI-Group/Survival_CTplusClinical}{GitHub}
    摘要 “肾细胞癌 represents a significant global health challenge with a low survival rate. This research aimed to develop a comprehensive deep-learning model capable of predicting survival probabilities in patients with renal cell carcinoma by integrating CT imaging and clinical data, and addressing the limitations observed in prior studies. The aim is to facilitate the identification of patients requiring urgent treatment. The proposed framework consists of three modules: a 3D image feature extractor, clinical variable selection, and survival prediction. The feature extractor module, based on the 3D CNN architecture, predicts the ISUP grade of renal cell carcinoma tumors linked to mortality rates from CT images. A selection of clinical variables is systematically chosen using the Spearman score and random forest importance score as criteria. A deep learning-based network, trained with discrete LogisticHazard-based loss, performs the survival prediction. Nine distinct experiments were performed, with varying numbers of clinical variables determined by different thresholds of the Spearman and importance scores. Our findings demonstrate that the proposed strategy surpasses the current literature on renal cancer prognosis based on CT scans and clinical factors. The best-performing experiment yielded a concordance index of 0.84 and an area under the curve value of 0.8 on the test cohort, which suggests strong predictive power. The multimodal deep-learning approach developed in this study shows promising results in estimating survival probabilities for renal cell carcinoma patients using CT imaging and clinical data. This may have potential implications in identifying patients who require urgent treatment, potentially improving patient outcomes. The code created for this project is available for the public on GitHub.”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

Why machines do not understand: A response to Søgaard

  • paper_url: http://arxiv.org/abs/2307.04766
  • repo_url: None
  • paper_authors: Jobst Landgrebe, Barry Smith
  • for: 本文针对一些人认为机器人可以理解语言的观点进行批判,具体来说是关于索加德(Sogaard)在这本杂志上提出的一种这样的thesis,基于语言学习和机器学习的概念。
  • methods: 本文使用了对索加德的论点进行分析和批判的方法,包括对语言的使用和存储方式的分析,以及对机器学习和人工智能的批判。
  • results: 本文表明了索加德的论点存在问题,主要是因为他忽视了人类语言使用和计算机语言存储的区别,从而导致了机器人理解语言的困难。
    Abstract Some defenders of so-called `artificial intelligence' believe that machines can understand language. In particular, S{\o}gaard has argued in this journal for a thesis of this sort, on the basis of the idea (1) that where there is semantics there is also understanding and (2) that machines are not only capable of what he calls `inferential semantics', but even that they can (with the help of inputs from sensors) `learn' referential semantics \parencite{sogaard:2022}. We show that he goes wrong because he pays insufficient attention to the difference between language as used by humans and the sequences of inert of symbols which arise when language is stored on hard drives or in books in libraries.
    摘要 一些人认为论称的人工智能可以理解语言。特别是,S{\o}gaard在这份报告中提出了这种thesis,基于两点:一是语言存在 semantics 就是理解的 garantor(1),二是机器不仅可以进行他所称的“推理 semantics”,而且可以(通过感知器的输入)“学习” referential semantics(\parencite{sogaard:2022)。我们展示了他的错误是因为他忽视了人类使用语言和存储在硬盘或图书馆中的语言序列的差异。

Dynamic Graph Attention for Anomaly Detection in Heterogeneous Sensor Networks

  • paper_url: http://arxiv.org/abs/2307.03761
  • repo_url: https://github.com/MengjieZhao/dygatad
  • paper_authors: Mengjie Zhao, Olga Fink
  • for: 本文针对的是随着互联网 Things (IIoTs) 系统中的多变量时间序列 (MTS) 数据的异常检测,即使在感知器网络中存在复杂性和互相关系的情况下。
  • methods: 本文提出了 DyGATAD (动态图注意力异常检测) 方法,该方法利用注意力机制构建了多变量时间序列上的连续图表示,并通过推断动态边来检测关系变化。 DyGATAD 还包括了基于操作条件的重建和 topology 基于异常分数,从而提高了异常检测的能力。
  • results: 根据一个控制变量的 synthetic 数据集和一个实际 industrials 的多相流设备数据集,我们证明了 DyGATAD 在感知器网络中的异常检测性能非常高,特别是在早期疾病检测和轻度疾病检测方面表现出色。
    Abstract In the era of digital transformation, systems monitored by the Industrial Internet of Things (IIoTs) generate large amounts of Multivariate Time Series (MTS) data through heterogeneous sensor networks. While this data facilitates condition monitoring and anomaly detection, the increasing complexity and interdependencies within the sensor network pose significant challenges for anomaly detection. Despite progress in this field, much of the focus has been on point anomalies and contextual anomalies, with lesser attention paid to collective anomalies. A less addressed but common variant of collective anomalies is when the abnormal collective behavior is caused by shifts in interrelationships within the system. This can be due to abnormal environmental conditions like overheating, improper operational settings resulting from cyber-physical attacks, or system-level faults. To address these challenges, this paper proposes DyGATAD (Dynamic Graph Attention for Anomaly Detection), a graph-based anomaly detection framework that leverages the attention mechanism to construct a continuous graph representation of multivariate time series by inferring dynamic edges between time series. DyGATAD incorporates an operating condition-aware reconstruction combined with a topology-based anomaly score, thereby enhancing the detection ability of relationship shifts. We evaluate the performance of DyGATAD using both a synthetic dataset with controlled varying fault severity levels and an industrial-scale multiphase flow facility benchmark featuring various fault types with different detection difficulties. Our proposed approach demonstrated superior performance in collective anomaly detection for sensor networks, showing particular strength in early-stage fault detection, even in the case of faults with minimal severity.
    摘要 在数字变革时代,由IIoT系统监测的系统生成大量多变量时间序列(MTS)数据,这些数据可以帮助 condition monitoring 和异常检测。然而,随着传感器网络的复杂性和互相关系的增加,异常检测遇到了 significiant 挑战。虽然在这一领域已经做出了很多进展,但是大多数研究都是关注点异常和上下文异常,而忽略了集体异常。这是一种较少地研究的,但是非常普遍的 коллектив异常情况,即传感器网络中的异常行为是由系统间关系的变化引起的。这可能是因为环境条件异常、操作设置不当或系统级别的故障所致。为解决这些挑战,本文提出了 DyGATAD(动态图注意力检测),一种基于图的异常检测框架。DyGATAD 利用注意力机制来构建多变量时间序列中的连续图表示,并通过推理出动态边的方式来捕捉系统间的关系变化。DyGATAD 还包括了根据操作条件进行修正的重构,以及基于 topological 异常分数的检测,从而提高了异常检测的能力。我们对一个合成数据集和一个实际工业级多相流设施的数据进行了评估,结果表明,DyGATAD 在传感器网络中的集体异常检测中表现出色,特别是在初期疾病检测中,甚至是在疾病严重程度较低的情况下。

Large Language Models as Batteries-Included Zero-Shot ESCO Skills Matchers

  • paper_url: http://arxiv.org/abs/2307.03539
  • repo_url: None
  • paper_authors: Benjamin Clavié, Guillaume Soulié
    for:这篇论文的目的是提出一个零上下测试的自动技能抽出系统,用于对雇佣广告中的技能抽出。methods:这个系统使用大型自然语言模型(LLM)来生成Synthetic训练数据,并使用一个分类器来从雇佣广告中提取技能提及。然后使用另一个LLM进行相似预测,以重新排序技能候选人。results:这篇论文的结果显示,使用合成数据可以在技能抽出 задачі中取得10个RP@10分的高分,比前一些距离指导方法高出10个分。同时,添加GPT-4重新排序可以提高RP@10的表现,高于前一些方法的22个分。此外,将任务框架为“假程式”的提示,可以让LLM表现更好,特别是使用较弱的LLM。
    Abstract Understanding labour market dynamics requires accurately identifying the skills required for and possessed by the workforce. Automation techniques are increasingly being developed to support this effort. However, automatically extracting skills from job postings is challenging due to the vast number of existing skills. The ESCO (European Skills, Competences, Qualifications and Occupations) framework provides a useful reference, listing over 13,000 individual skills. However, skills extraction remains difficult and accurately matching job posts to the ESCO taxonomy is an open problem. In this work, we propose an end-to-end zero-shot system for skills extraction from job descriptions based on large language models (LLMs). We generate synthetic training data for the entirety of ESCO skills and train a classifier to extract skill mentions from job posts. We also employ a similarity retriever to generate skill candidates which are then re-ranked using a second LLM. Using synthetic data achieves an RP@10 score 10 points higher than previous distant supervision approaches. Adding GPT-4 re-ranking improves RP@10 by over 22 points over previous methods. We also show that Framing the task as mock programming when prompting the LLM can lead to better performance than natural language prompts, especially with weaker LLMs. We demonstrate the potential of integrating large language models at both ends of skills matching pipelines. Our approach requires no human annotations and achieve extremely promising results on skills extraction against ESCO.
    摘要 理解劳动市场动态需要准确地确定工作人员所需和拥有的技能。自动化技术在支持这一努力方面发展得越来越好。然而,从工作岗posts中自动提取技能是一项挑战,因为存在庞大的技能数量。欧洲技能、COMPETENCES、资格和职业(ESCO)框架提供了有用的参考,列出了13,000多个具体的技能。然而,技能提取仍然具有挑战性,并且准确匹配工作岗posts到ESCO分类是一个打开的问题。在这种工作中,我们提议一种终端零批量系统,使用大型自然语言模型(LLMs)进行技能提取从工作岗posts。我们生成了ESCO技能整体的合成训练数据,并使用一个分类器提取技能提及从工作岗posts。此外,我们使用一个相似搜索器生成技能候选人选,然后使用第二个LLM进行重新排序。使用合成数据实现RP@10分数10点高于前一种远程指导方法。另外,添加GPT-4重新排序可以提高RP@10分数22点以上。我们还证明,将任务fram为Mock编程时请求LLM的提示可以提高性能,特别是使用较弱的LLM。我们展示了将大型自然语言模型 integrate到技能匹配管道的潜在优势,并实现了无需人工标注的技能提取 противESCO。

Physical Color Calibration of Digital Pathology Scanners for Robust Artificial Intelligence Assisted Cancer Diagnosis

  • paper_url: http://arxiv.org/abs/2307.05519
  • repo_url: None
  • paper_authors: Xiaoyi Ji, Richard Salmon, Nita Mulliqi, Umair Khan, Yinxi Wang, Anders Blilie, Henrik Olsson, Bodil Ginnerup Pedersen, Karina Dalsgaard Sørensen, Benedicte Parm Ulhøi, Svein R Kjosavik, Emilius AM Janssen, Mattias Rantalainen, Lars Egevad, Pekka Ruusuvuori, Martin Eklund, Kimmo Kartasalo
  • for: 这项研究旨在解决数位patology中人工智能(AI)的潜力受到技术不一致的抑制,从而使AI在临床应用中受到挑战。
  • methods: 研究者使用了物理色彩准确的扫描仪进行了四个实验室的色彩准确性标准化,以确定这种方法对抗癌诊断模型的影响。
  • results: 研究结果表明,物理色彩准确的扫描仪可以标准化整个报告图像的出现,从而提高AI模型的准确性和Gleason分级表现。这项研究验证了物理色彩准确的扫描仪可以解决不同扫描仪 introduce的变化,使AI基于的肿瘤诊断变得更加可靠和在临床设置中可行。
    Abstract The potential of artificial intelligence (AI) in digital pathology is limited by technical inconsistencies in the production of whole slide images (WSIs), leading to degraded AI performance and posing a challenge for widespread clinical application as fine-tuning algorithms for each new site is impractical. Changes in the imaging workflow can also lead to compromised diagnoses and patient safety risks. We evaluated whether physical color calibration of scanners can standardize WSI appearance and enable robust AI performance. We employed a color calibration slide in four different laboratories and evaluated its impact on the performance of an AI system for prostate cancer diagnosis on 1,161 WSIs. Color standardization resulted in consistently improved AI model calibration and significant improvements in Gleason grading performance. The study demonstrates that physical color calibration provides a potential solution to the variation introduced by different scanners, making AI-based cancer diagnostics more reliable and applicable in clinical settings.
    摘要 人工智能(AI)在数字 PATHOLOGY 中的潜力受到扫描机器(Whole Slide Images,WSIs)技术不一致的限制,导致 AI 性能下降,并对营养广泛临床应用 pose 挑战。工作流程变化也可能导致诊断错误和 patient safety 风险。我们评估了扫描机器的物理色彩准确性是否可以标准化 WSI 的外观,并对抗肉癌诊断 AI 系统的1,161 WSI 的表现。色彩标准化导致 AI 模型准确性的改进,并且在 Gleason 分期性能中得到了显著改进。这项研究表明,物理色彩准确性提供了扫描机器间变化引入的解决方案,使 AI 基于肉癌诊断更可靠和在临床设置中应用。

Contrastive Graph Pooling for Explainable Classification of Brain Networks

  • paper_url: http://arxiv.org/abs/2307.11133
  • repo_url: None
  • paper_authors: Jiaxing Xu, Qingtian Bian, Xinhang Li, Aihu Zhang, Yiping Ke, Miao Qiao, Wei Zhang, Wei Khang Jeremy Sim, Balázs Gulyás
  • for: 这个论文的目的是提出一种适用于Functional magnetic resonance imaging (fMRI)数据的图 neural network (GNN) 模型,以提高对大脑网络的理解和描述。
  • methods: 这个论文使用的方法包括一种对比性双注意力块和一种可微graph pooling方法,以便更好地利用GNN来描述大脑网络。
  • results: 该论文在5个休息态fMRI大脑网络数据集上进行了应用,并证明了其在比基elines上表现出优异。此外,研究还发现了与 neuroscience 文献中的知识匹配的特征特征,并提供了直观和有趣的探索。
    Abstract Functional magnetic resonance imaging (fMRI) is a commonly used technique to measure neural activation. Its application has been particularly important in identifying underlying neurodegenerative conditions such as Parkinson's, Alzheimer's, and Autism. Recent analysis of fMRI data models the brain as a graph and extracts features by graph neural networks (GNNs). However, the unique characteristics of fMRI data require a special design of GNN. Tailoring GNN to generate effective and domain-explainable features remains challenging. In this paper, we propose a contrastive dual-attention block and a differentiable graph pooling method called ContrastPool to better utilize GNN for brain networks, meeting fMRI-specific requirements. We apply our method to 5 resting-state fMRI brain network datasets of 3 diseases and demonstrate its superiority over state-of-the-art baselines. Our case study confirms that the patterns extracted by our method match the domain knowledge in neuroscience literature, and disclose direct and interesting insights. Our contributions underscore the potential of ContrastPool for advancing the understanding of brain networks and neurodegenerative conditions.
    摘要 Functional magnetic resonance imaging (fMRI) 是一种广泛使用的技术来测量神经活动。其应用在识别下面的神经退化疾病,如 Parkinson's、Alzheimer's 和 Autism 等方面特别重要。最近的 fMRI 数据分析模型将大脑视为图,通过图神经网络(GNN)提取特征。然而,fMRI 数据的特殊性需要特殊的 GNN 设计。适应 GNN 生成有效和域 explainable 特征仍然是挑战。在这篇论文中,我们提出了对比 dual-attention 块和可导图聚合方法,称之为 ContrastPool,以更好地利用 GNN 对大脑网络。我们在 5 个休息态 fMRI 大脑网络数据集上应用了我们的方法,并证明我们的方法在比基eline上显著superior。我们的案例研究表明,我们的方法提取的特征与 neuroscience 文献中的领域知识匹配,并且揭示了直观和有趣的发现。我们的贡献表明 ContrastPool 在理解大脑网络和神经退化疾病方面具有潜力。

Procedurally generating rules to adapt difficulty for narrative puzzle games

  • paper_url: http://arxiv.org/abs/2307.05518
  • repo_url: None
  • paper_authors: Thomas Volden, Djordje Grbic, Paolo Burelli
  • for: 这篇论文旨在透过生成规则和通过玩家来调整难度。这是一个更大的项目,旨在收集和适应教育游戏 для小学生使用数字谜题游戏,设计给幼儿园。
  • methods: 这篇论文使用了遗传算法和难度度量来找到目标解集和大型自然语言模型来通过narativeContext来交流规则。
  • results: 在测试中,该方法能够在平均24个代表中找到规则,以达到目标难度。将来的实验计划提高评估、特化语言模型到儿童文学,并收集多modal数据来引导适应。
    Abstract This paper focuses on procedurally generating rules and communicating them to players to adjust the difficulty. This is part of a larger project to collect and adapt games in educational games for young children using a digital puzzle game designed for kindergarten. A genetic algorithm is used together with a difficulty measure to find a target number of solution sets and a large language model is used to communicate the rules in a narrative context. During testing the approach was able to find rules that approximate any given target difficulty within two dozen generations on average. The approach was combined with a large language model to create a narrative puzzle game where players have to host a dinner for animals that can't get along. Future experiments will try to improve evaluation, specialize the language model on children's literature, and collect multi-modal data from players to guide adaptation.
    摘要 Translation notes:* "procedurally generating" 生成过程中的 (shēngchǎng yǔ xiǎngchǎng)* "difficulty" 难度 (nándù)* "target number of solution sets" 目标解决方案的数量 (mùzhì jiějué fāng'àn de shùliàng)* "large language model" 大型自然语言模型 (dàxíng zìrán yǔyán módelì)* "narrative context" 叙事上下文 (jiùshì shàngxìa)* "genetic algorithm" 遗传算法 (lìchǎng suànfǎ)* "solution sets" 解决方案 (jiějué fāng'àn)

Tranfer Learning of Semantic Segmentation Methods for Identifying Buried Archaeological Structures on LiDAR Data

  • paper_url: http://arxiv.org/abs/2307.03512
  • repo_url: None
  • paper_authors: Paolo Soleni, Wouter B. Verschoof-van der Vaart, Žiga Kokalj, Arianna Traviglia, Marco Fiorucci
  • for: 用深度学习技术进行远程感知数据在考古研究中应用,一个主要障碍是训练模型所需的数据的有限可用性。
  • methods: 本研究使用了传输学习技术,并对两个semantic segmentation深度神经网络在两个LiDAR数据集上进行了比较。
  • results: 实验结果表明,在考古领域中使用传输学习配置可以提高性能,但尚未观察到系统性的提高。我们提供了特定的应用场景,以供未来研究的参考。
    Abstract When applying deep learning to remote sensing data in archaeological research, a notable obstacle is the limited availability of suitable datasets for training models. The application of transfer learning is frequently employed to mitigate this drawback. However, there is still a need to explore its effectiveness when applied across different archaeological datasets. This paper compares the performance of various transfer learning configurations using two semantic segmentation deep neural networks on two LiDAR datasets. The experimental results indicate that transfer learning-based approaches in archaeology can lead to performance improvements, although a systematic enhancement has not yet been observed. We provide specific insights about the validity of such techniques that can serve as a baseline for future works.
    摘要 当应用深度学习到远程感知数据中的考古研究中,一个显著的障碍是训练模型的数据减少的限制。通常使用传输学习来缓解这个问题。然而,还需要探索它在不同的考古数据集之间的效果。这篇论文比较了不同的传输学习配置使用两种semantic segmentation深度神经网络在两个LiDAR数据集上的性能。实验结果表明,在考古领域中使用传输学习可以提高性能,但是还没有系统地提高。我们提供了特定的洞察,以供未来研究的参考。

Derivative Free Weight-space Ensembling

  • paper_url: http://arxiv.org/abs/2307.03506
  • repo_url: None
  • paper_authors: Dean Ninalga
  • for: 本研究的目的是提出一种新的几个样本任务传递方法,以便在开放领域对话中进行有效的任务传递。
  • methods: 本研究使用了Derivative Free Weight-space Ensembling(DFWE)策略,该策略创建了一组多样化的专家语言模型,每个专家模型通过预定的源任务进行训练。然后,每个专家模型都进行了精度调整,以便更好地适应目标任务。最后,我们使用了一种无级优化算法来线性 interpolate между模型的权重,以达到有效地找到一个好的权重混合。
  • results: 我们在FETA-Friends上进行了实验,并证明了DFWE的效果。相比标准的预训练-精度调整方法,DFWE能够更好地传递知识并提高任务表现。
    Abstract Recent work suggests that interpolating between the weights of two specialized language models can transfer knowledge between tasks in a way that multi-task learning cannot. However, very few have explored interpolation between more than two models, where each has a distinct knowledge base. In this paper, we introduce Derivative Free Weight-space Ensembling (DFWE), a new few-sample task transfer approach for open-domain dialogue. Our framework creates a set of diverse expert language models trained using a predefined set of source tasks. Next, we finetune each of the expert models on the target task, approaching the target task from several distinct knowledge bases. Finally, we linearly interpolate between the model weights using a gradient-free-optimization algorithm, to efficiently find a good interpolation weighting. We demonstrate the effectiveness of the method on FETA-Friends outperforming the standard pretrain-finetune approach.
    摘要 研究表明,在两个特殊化语言模型之间 interpolate 知识可以在任务之间传递知识,而多任务学习则无法实现。然而,很少人研究了超过两个模型的 interpolate。在这篇论文中,我们介绍了 Derivative Free Weight-space Ensembling (DFWE),一种新的几个样本任务传递方法,用于开放领域对话。我们的框架创建了一组多样化的专家语言模型,每个模型通过预定的源任务进行训练。然后,我们每个专家模型都在目标任务上精度调整,从多个不同的知识基础上进行 approached。最后,我们使用一个 gradient-free-optimization 算法来线性 interpolate 模型的 weights,以效率地找到一个好的 interpolate 权重。我们在 FETA-Friends 上 demonstrate 了方法的效果,超过标准预训练-精度调整方法。

RCDN – Robust X-Corner Detection Algorithm based on Advanced CNN Model

  • paper_url: http://arxiv.org/abs/2307.03505
  • repo_url: None
  • paper_authors: Ben Chen, Caihua Xiong, Quanlin Li, Zhonghua Wan
  • for: 提高机器视觉和机器人领域中X-角落检测和地理化的精度和可靠性。
  • methods: 提出了一种新的检测算法,可以在多种干扰下保持高比素精度,包括镜头扭曲、极端pose和噪声。该算法采用了一个粗粒度检测网络和三种后处理技术来筛选正确的角度候选者,以及一种混合比素精度修正技术和改进的区域增长策略来自动地恢复部分可见或遮挡的检查板图样。
  • results: 对实际和 sintetic 图像进行评估,表明提出的算法在检测率、比素精度和Robustness方面比其他常用方法更高。此外,camera calibration和pose estimation实验也表明,该算法可以更好地实现相机参数的调整和pose的估计。
    Abstract Accurate detection and localization of X-corner on both planar and non-planar patterns is a core step in robotics and machine vision. However, previous works could not make a good balance between accuracy and robustness, which are both crucial criteria to evaluate the detectors performance. To address this problem, in this paper we present a novel detection algorithm which can maintain high sub-pixel precision on inputs under multiple interference, such as lens distortion, extreme poses and noise. The whole algorithm, adopting a coarse-to-fine strategy, contains a X-corner detection network and three post-processing techniques to distinguish the correct corner candidates, as well as a mixed sub-pixel refinement technique and an improved region growth strategy to recover the checkerboard pattern partially visible or occluded automatically. Evaluations on real and synthetic images indicate that the presented algorithm has the higher detection rate, sub-pixel accuracy and robustness than other commonly used methods. Finally, experiments of camera calibration and pose estimation verify it can also get smaller re-projection error in quantitative comparisons to the state-of-the-art.
    摘要 通过精准探测和定位X角的算法, robotics和机器视觉中的核心步骤是检测X角。然而,过去的方法无法保持高精度和可靠性的平衡,这两个 критери㪨都是评估探测器性能的关键因素。为解决这个问题,在这篇论文中,我们提出了一种新的探测算法,可以在多种干扰下保持高分辨率,包括镜头扭曲、极端pose和噪声。该算法采用了粗粒度探测网络和三种后处理技术来分辨正确的角度候选者,以及混合分辨率纠正技术和改进的区域增长策略来自动地恢复部分可见或遮挡的Checkerboard模式。实验表明,提出的算法在真实和 sintetic 图像上具有更高的检测率、分辨率和可靠性,并且在相机准备和pose估计方面也能够获得更小的重映射误差。

Large AI Model-Based Semantic Communications

  • paper_url: http://arxiv.org/abs/2307.03492
  • repo_url: None
  • paper_authors: Feibo Jiang, Yubo Peng, Li Dong, Kezhi Wang, Kun Yang, Cunhua Pan, Xiaohu You
  • for: 这篇论文旨在解决现有的智能通信系统中知识基础构建问题,提出一种基于大机器学习模型的智能通信框架(LAM-SC),用于处理图像数据。
  • methods: 该框架首先设计了基于universal semantic knowledge的图像分割模型(SAM)知识基础(SKB),然后提出一种基于注意力的Semantic Integration(ASI)方法,以及一种适应性压缩(ASC)编码方法来减少通信开销。
  • results: 通过实验,论文示出了LAM-SC框架的效果和未来智能通信模式中大机器学习模型基础知识的重要性。
    Abstract Semantic communication (SC) is an emerging intelligent paradigm, offering solutions for various future applications like metaverse, mixed-reality, and the Internet of everything. However, in current SC systems, the construction of the knowledge base (KB) faces several issues, including limited knowledge representation, frequent knowledge updates, and insecure knowledge sharing. Fortunately, the development of the large AI model provides new solutions to overcome above issues. Here, we propose a large AI model-based SC framework (LAM-SC) specifically designed for image data, where we first design the segment anything model (SAM)-based KB (SKB) that can split the original image into different semantic segments by universal semantic knowledge. Then, we present an attention-based semantic integration (ASI) to weigh the semantic segments generated by SKB without human participation and integrate them as the semantic-aware image. Additionally, we propose an adaptive semantic compression (ASC) encoding to remove redundant information in semantic features, thereby reducing communication overhead. Finally, through simulations, we demonstrate the effectiveness of the LAM-SC framework and the significance of the large AI model-based KB development in future SC paradigms.
    摘要 semantic communication (SC) 是一种emerging intelligent paradigm,提供未来应用程序,如 metaverse、混合现实和 everything 互联网。然而,在当前 SC 系统中,知识库(KB)的构建面临多种问题,包括有限的知识表示、频繁的知识更新和不安全的知识分享。幸运的是,大型 AI 模型的开发提供了新的解决方案。我们在这里提出一个基于大型 AI 模型的 SC 框架(LAM-SC),专门设计为图像数据处理。我们首先设计了基于universal semantic knowledge的segment anything model(SAM)知识库(SKB),可以将原始图像分解成不同的semantic segment。然后,我们提出了无人参与的注意力基本(ASI),可以对 SKB 生成的semantic segment进行权重,并将它们集成为具有semantic-aware的图像。此外,我们还提出了自适应semantic compression(ASC)编码,可以从semantic features中去除冗余信息,以减少通信开销。最后,通过 simulated experiments,我们证明了 LAM-SC 框架的有效性和未来 SC парадигms中大型 AI 模型基本知识库的发展的重要性。

Artificial Eye for the Blind

  • paper_url: http://arxiv.org/abs/2308.00801
  • repo_url: https://github.com/deepususeel/SmartEye
  • paper_authors: Abhinav Benagi, Dhanyatha Narayan, Charith Rage, A Sushmitha
  • for: 这个论文的目的是提供一种基于Raspberry Pi3的人工智能眼模型,帮助盲人进行交通导航和日常生活中的行动决策。
  • methods: 该模型使用了raspberry pi3,webcam,ultrasonic proximity sensor, speaker和多种软件模型,包括物体检测、文本识别、Google文本识别和Mycroft语音助手模型。
  • results: 模型可以帮助盲人在交通导航和日常生活中更加灵活和自信,同时还可以提供语音援助和文本援助。
    Abstract The main backbone of our Artificial Eye model is the Raspberry pi3 which is connected to the webcam ,ultrasonic proximity sensor, speaker and we also run all our software models i.e object detection, Optical Character recognition, google text to speech conversion and the Mycroft voice assistance model. At first the ultrasonic proximity sensor will be measuring the distance between itself and any obstacle in front of it .When the Proximity sensor detects any obstacle in front within its specified range, the blind person will hear an audio prompt about an obstacle in his way at a certain distance. At this time the Webcam will capture an image in front of it and the Object detection model and the Optical Character Recognition model will begin to run on the Raspberry pi. The imat of the blind person. The text and the object detected are conveyed to the blind pege captured is first sent through the Tesseract OCR module to detect any texts in the image and then through the Object detection model to detect the objects in fronrson by converting the texts to speech by using the gTTS module. Along with the above mentioned process going on there will be an active MYCROFT voice assistant model which can be used to interact with the blind person. The blind person can ask about the weather , daily news , any information on the internet ,etc
    摘要 主要脊梁我们的人工智能眼镜模型是Raspberry Pi3,与摄像头、超音波距离仪、喇叭和我们的软件模型(物品检测、字符识别、Google文本转语音和Mycroft语音助手模型)连接在一起。在 primeros,超音波距离仪将测量自己和前方任何障碍物的距离。当超音波距离仪检测到前方 Within its specified range 的障碍物时,盲人将听到一个语音提醒,表示有障碍物在他的路线上。在这个时候,摄像头将拍摄前方的图像,并将图像传递给Raspberry Pi进行处理。在处理过程中,我们使用Tesseract OCR模块来检测图像中的文本,然后将文本转换为语音,使用gTTS模块进行转换。同时,我们还会有一个活跃的MYCROFT语音助手模型,可以让盲人与其进行互动,盲人可以询问天气、每日新闻、网络上的信息等。

Discovering Hierarchical Achievements in Reinforcement Learning via Contrastive Learning

  • paper_url: http://arxiv.org/abs/2307.03486
  • repo_url: None
  • paper_authors: Seungyong Moon, Junyoung Yeom, Bumsoo Park, Hyun Oh Song
  • for: 本研究旨在探索在生成型环境中发现具有层次结构的成就,并且需要代理人类 possess 一系列能力,如总结和长期理解。
  • methods: 本研究使用 proximal policy optimization (PPO) 算法,一种简单而多功能的无模型学习方法,并且发现 PPO 代理人类可以预测下一个成就的可能性,虽然 confidence 较低。
  • results: 研究发现,使用 PPO 算法和我们提出的新的准则学习方法 achievement distillation,可以强化代理人类对下一个成就的预测,并且在挑战性的 Crafter 环境中显示出状态的术语表现。
    Abstract Discovering achievements with a hierarchical structure on procedurally generated environments poses a significant challenge. This requires agents to possess a broad range of abilities, including generalization and long-term reasoning. Many prior methods are built upon model-based or hierarchical approaches, with the belief that an explicit module for long-term planning would be beneficial for learning hierarchical achievements. However, these methods require an excessive amount of environment interactions or large model sizes, limiting their practicality. In this work, we identify that proximal policy optimization (PPO), a simple and versatile model-free algorithm, outperforms the prior methods with recent implementation practices. Moreover, we find that the PPO agent can predict the next achievement to be unlocked to some extent, though with low confidence. Based on this observation, we propose a novel contrastive learning method, called achievement distillation, that strengthens the agent's capability to predict the next achievement. Our method exhibits a strong capacity for discovering hierarchical achievements and shows state-of-the-art performance on the challenging Crafter environment using fewer model parameters in a sample-efficient regime.
    摘要 发现具有层次结构的成就需要智能体具备广泛的能力,包括总结和长期逻辑。许多先前方法基于模型或层次结构,以为存在明确的长期规划模块可以帮助学习层次成就。然而,这些方法需要大量的环境互动或庞大的模型大小,限制了它们的实用性。在这项工作中,我们发现,近似策略优化(PPO),一种简单和多样的模型自由算法,在现有实现方法中表现出色,并且我们发现PPOAgent可以预测下一个成就的概率,虽然有一定的不确定性。基于这个观察,我们提出了一种新的对比学习方法,即成就萃取,以强化智能体的下一个成就预测能力。我们的方法在挑战性高的Crafter环境中展现出了优秀的成就发现能力和模型参数更少的样本效率。

TBGC: Task-level Backbone-Oriented Gradient Clip for Multi-Task Foundation Model Learning

  • paper_url: http://arxiv.org/abs/2307.03465
  • repo_url: None
  • paper_authors: Zelun Zhang, Xue Pan
  • for: 提高多任务学习中回归梯度偏导问题
  • methods: 提出了任务级别梯度剪裁策略和多支分支数据增强策略
  • results: 实验结果表明,该策略可以减轻回归梯度偏导问题,并在CVPR2023 Foundation Model Challenge中获得1名和2名。
    Abstract The AllInOne training paradigm squeezes a wide range of tasks into a unified model in a multi-task learning manner. However, optimization in multi-task learning is more challenge than single-task learning, as the gradient norm from different tasks may vary greatly, making the backbone overly biased towards one specific task. To address this issue, we propose the task-level backbone-oriented gradient clip paradigm, compared with the vanilla gradient clip method, it has two points of emphasis:1) gradient clip is performed independently for each task. 2) backbone gradients generated from each task are rescaled to the same norm scale. Based on the experimental results, we argue that the task-level backbone-oriented gradient clip paradigm can relieve the gradient bias problem to some extent. We also propose a novel multi-branch data augmentation strategy where conflict augmentations are placed in different branches. Our approach has been shown to be effective and finally achieve 1st place in the Leaderboard A and 2nd place in the Leaderboard B of the CVPR2023 Foundation Model Challenge. It's worth noting that instead of evaluating all three tasks(detection, segmentation and fine-grained classification) in Leaderboard A, the segmentation task is not evaluated in Leaderboard B, in which our team has a huge advantage.
    摘要 全面一体训练模式将多种任务集成到一个多任务学习模型中,但是多任务学习中的优化具有更大的挑战,因为不同任务的梯度范围可能很大,导致支持结构偏向某一个特定任务。为解决这个问题,我们提出了任务级别支持结构折叠梯度剪辑方法,相比于普通梯度剪辑方法,它具有两点优势:1)梯度剪辑独立进行每个任务;2)每个任务生成的支持结构梯度都被缩放到同一个范围尺度。根据实验结果,我们认为任务级别支持结构折叠梯度剪辑方法可以减轻梯度偏向问题至少一部分。此外,我们还提出了一种新的多支持分支数据增强策略,其中冲突增强被放置在不同支持中。我们的方法在CVPR2023基金会模型挑战中获得了1名和2名。值得注意的是,在Leaderboard A中评估所有三个任务(检测、 segmentation 和细化分类),而Leaderboard B中不评估 segmentation 任务,我们在这个任务上具有很大优势。

MultiQG-TI: Towards Question Generation from Multi-modal Sources

  • paper_url: http://arxiv.org/abs/2307.04643
  • repo_url: https://github.com/moonlightlane/multiqg-ti
  • paper_authors: Zichao Wang, Richard Baraniuk
  • for: 本研究探讨了自动生成问题(QG) FROM 多ModalSource中的图像和文本,扩展了大多数现有工作的范围,这些工作都专注于仅仅从文本源中生成问题。
  • methods: 我们提出了一个简单的解决方案,called MultiQG-TI,它使得文本只问题生成器能够处理视觉输入。我们利用图像描述模型和光学字符识别模型来获取图像的文本描述和图像中的文本,并将它们与输入文本一起传递给问题生成器。我们只是微调问题生成器,而保持其他组件不变。
  • results: 在 ScienceQA 数据集上,我们示出了 MultiQG-TI 在几个shot prompting 下Significantly outperform ChatGPT,即使它有百分之一的训练参数。Additional 分析也证明了视觉和文本信号的必要性,以及模型选择的影响。
    Abstract We study the new problem of automatic question generation (QG) from multi-modal sources containing images and texts, significantly expanding the scope of most of the existing work that focuses exclusively on QG from only textual sources. We propose a simple solution for our new problem, called MultiQG-TI, which enables a text-only question generator to process visual input in addition to textual input. Specifically, we leverage an image-to-text model and an optical character recognition model to obtain the textual description of the image and extract any texts in the image, respectively, and then feed them together with the input texts to the question generator. We only fine-tune the question generator while keeping the other components fixed. On the challenging ScienceQA dataset, we demonstrate that MultiQG-TI significantly outperforms ChatGPT with few-shot prompting, despite having hundred-times less trainable parameters. Additional analyses empirically confirm the necessity of both visual and textual signals for QG and show the impact of various modeling choices.
    摘要 我们研究一个新的自动问题生成(QG)问题,利用多Modal来源,包括图像和文本,从而扩大现有大多数工作的范围,这些工作都专注于只使用文本来源进行QG。我们提出了一个简单的解决方案,称为MultiQG-TI,它使得文本只的问题生成器能够处理视觉输入,同时还可以处理文本输入。我们利用图像到文本模型和光学字符识别模型来获得图像的文本描述和图像中的文本,然后将这些信息与输入文本一起传递给问题生成器。我们只是微调问题生成器,而不是其他组件。在 ScienceQA 数据集上,我们证明 MultiQG-TI 在少量提示下,以 hundred-times fewer trainable parameters 的情况下, Significantly outperform ChatGPT。我们还进行了更多的分析,确认了视觉和文本信号的必要性,以及模型选择的影响。

A Survey on Graph Neural Networks for Time Series: Forecasting, Classification, Imputation, and Anomaly Detection

  • paper_url: http://arxiv.org/abs/2307.03759
  • repo_url: https://github.com/kimmeen/awesome-gnn4ts
  • paper_authors: Ming Jin, Huan Yee Koh, Qingsong Wen, Daniele Zambon, Cesare Alippi, Geoffrey I. Webb, Irwin King, Shirui Pan
  • for: 本研究评论文章旨在概述图 neural network(GNN)在时间序列分析(TS)领域的应用,包括预测、分类、异常检测和填充等方面。
  • methods: 本文使用GNN来模型时间序列数据中的关系,包括时间序列之间和变量之间的关系。GNN可以更好地模型这些关系,比如传统的深度神经网络和其他GNN-based方法。
  • results: 本文提供了一个全面的任务-导向的分类法,并详细介绍了一些代表性的研究工作和应用。同时,文章还提出了未来研究的可能性,包括针对不同类型时间序列数据的GNN模型。
    Abstract Time series are the primary data type used to record dynamic system measurements and generated in great volume by both physical sensors and online processes (virtual sensors). Time series analytics is therefore crucial to unlocking the wealth of information implicit in available data. With the recent advancements in graph neural networks (GNNs), there has been a surge in GNN-based approaches for time series analysis. These approaches can explicitly model inter-temporal and inter-variable relationships, which traditional and other deep neural network-based methods struggle to do. In this survey, we provide a comprehensive review of graph neural networks for time series analysis (GNN4TS), encompassing four fundamental dimensions: forecasting, classification, anomaly detection, and imputation. Our aim is to guide designers and practitioners to understand, build applications, and advance research of GNN4TS. At first, we provide a comprehensive task-oriented taxonomy of GNN4TS. Then, we present and discuss representative research works and introduce mainstream applications of GNN4TS. A comprehensive discussion of potential future research directions completes the survey. This survey, for the first time, brings together a vast array of knowledge on GNN-based time series research, highlighting foundations, practical applications, and opportunities of graph neural networks for time series analysis.
    摘要 时间序列是主要数据类型,用于记录动态系统测量和生成大量数据,both physical sensors和在线过程(虚拟感知器)生成。时间序列分析因此是解锁可用数据中的巨量信息的关键。随着图 neural networks(GNNs)的最近进步,有一个浪涌GNN-based时间序列分析方法的出现。这些方法可以显式地模型时间序列和变量之间的关系,传统的和其他深度神经网络基于方法难以做到。在本survey中,我们提供了Graph Neural Networks for Time Series Analysis(GNN4TS)的全面评论,涵盖四个基本维度:预测、分类、异常检测和补做。我们的目标是引导设计者和实践者理解、建立应用和推动GNN4TS的研究。首先,我们提供了GNN4TS的任务 oriented 分类。然后,我们介绍了代表性的研究工作和主流应用GNN4TS。最后,我们进行了全面的未来研究方向的讨论,以帮助读者更好地理解GNN-based时间序列研究的基础、实践和未来发展。这是首次将GNN-based时间序列研究汇总起来,把涉及的知识集中起来,推动 Graph Neural Networks for Time Series Analysis的研究。

Towards Deep Network Steganography: From Networks to Networks

  • paper_url: http://arxiv.org/abs/2307.03444
  • repo_url: None
  • paper_authors: Guobiao Li, Sheng Li, Meiling Li, Zhenxing Qian, Xinpeng Zhang
  • for: 这个论文主要针对的是如何在公共通道中隐藏深度神经网络(DNN)模型,特别是那些训练用于机密学习任务的模型。
  • methods: 我们提出了一种深度网络隐藏(Deep Network Steganography,DNS),将机密的DNN模型转换为一个普通的学习任务。这是由于我们的方法将机密模型中的一些重要位置装饰成普通的学习位置,并将这些位置隐藏在一个隐藏频道中。
  • results: 我们的实验结果显示,我们的方法可以实现隐藏DNN模型,并且可以在不同的学习任务之间进行隐藏。具体而言,我们在内部任务隐藏(Intra-task steganography)和多任务隐藏(Inter-task steganography)两种情况下实现了隐藏DNN模型的目标。
    Abstract With the widespread applications of the deep neural network (DNN), how to covertly transmit the DNN models in public channels brings us the attention, especially for those trained for secret-learning tasks. In this paper, we propose deep network steganography for the covert communication of DNN models. Unlike the existing steganography schemes which focus on the subtle modification of the cover data to accommodate the secrets, our scheme is learning task oriented, where the learning task of the secret DNN model (termed as secret-learning task) is disguised into another ordinary learning task conducted in a stego DNN model (termed as stego-learning task). To this end, we propose a gradient-based filter insertion scheme to insert interference filters into the important positions in the secret DNN model to form a stego DNN model. These positions are then embedded into the stego DNN model using a key by side information hiding. Finally, we activate the interference filters by a partial optimization strategy, such that the generated stego DNN model works on the stego-learning task. We conduct the experiments on both the intra-task steganography and inter-task steganography (i.e., the secret and stego-learning tasks belong to the same and different categories), both of which demonstrate the effectiveness of our proposed method for covert communication of DNN models.
    摘要 随着深度神经网络(DNN)的广泛应用,如何在公共频道上不显地传输已训练的DNN模型引发了关注,尤其是那些用于秘密学习任务的模型。在这篇论文中,我们提出了深度网络隐藏(DNN隐藏),用于不显地通信DNN模型。与现有的隐藏方案不同,我们的方案是任务 oriented,其中秘密学习任务(秘密学习任务)被隐藏到另一个普通的学习任务(隐藏学习任务)中。为此,我们提出了一种梯度基于的筛选插入方案,将重要的位置在秘密DNN模型中插入干扰筛选器,形成一个隐藏DNN模型。这些位置然后被嵌入到隐藏DNN模型中使用钥匙,并且使用侧信息隐藏。最后,我们使用部分优化策略启动干扰筛选器,使得生成的隐藏DNN模型在隐藏学习任务上工作。我们对两种情况进行实验:内任务隐藏(i.e., 秘密任务和隐藏学习任务属于同一类)和间任务隐藏(i.e., 秘密任务和隐藏学习任务属于不同类),两者均显示了我们提出的方法的效iveness。

Non-iterative Coarse-to-fine Transformer Networks for Joint Affine and Deformable Image Registration

  • paper_url: http://arxiv.org/abs/2307.03421
  • repo_url: https://github.com/mungomeng/registration-nice-trans
  • paper_authors: Mingyuan Meng, Lei Bi, Michael Fulham, Dagan Feng, Jinman Kim
  • for: 这paper是为了提出一种基于深度学习的非迭代粗细到细粒度图像匹配算法。
  • methods: 这paper使用了一种名为NICE-Trans的非迭代粗细到细粒度图像匹配网络,该网络结合了矩阵变换和扩展抽取器来实现粗细到细粒度的图像匹配。
  • results: 实验结果表明,NICE-Trans可以在七个公共数据集上击败现有的图像匹配方法,并且在注重精度和运行时间之间取得了一个良好的平衡。
    Abstract Image registration is a fundamental requirement for medical image analysis. Deep registration methods based on deep learning have been widely recognized for their capabilities to perform fast end-to-end registration. Many deep registration methods achieved state-of-the-art performance by performing coarse-to-fine registration, where multiple registration steps were iterated with cascaded networks. Recently, Non-Iterative Coarse-to-finE (NICE) registration methods have been proposed to perform coarse-to-fine registration in a single network and showed advantages in both registration accuracy and runtime. However, existing NICE registration methods mainly focus on deformable registration, while affine registration, a common prerequisite, is still reliant on time-consuming traditional optimization-based methods or extra affine registration networks. In addition, existing NICE registration methods are limited by the intrinsic locality of convolution operations. Transformers may address this limitation for their capabilities to capture long-range dependency, but the benefits of using transformers for NICE registration have not been explored. In this study, we propose a Non-Iterative Coarse-to-finE Transformer network (NICE-Trans) for image registration. Our NICE-Trans is the first deep registration method that (i) performs joint affine and deformable coarse-to-fine registration within a single network, and (ii) embeds transformers into a NICE registration framework to model long-range relevance between images. Extensive experiments with seven public datasets show that our NICE-Trans outperforms state-of-the-art registration methods on both registration accuracy and runtime.
    摘要 医疗影像分析中的图像 регистрация是一项基本要求。基于深度学习的深度 регистрация方法在最近几年内得到了广泛的认可,因为它们可以快速完成端到端的 регистрация。许多深度REGISTRATION方法在多个REGISTRATION步骤中采用了隐式的卷积神经网络,以实现粗细到细节的REGISTRATION。然而,现有的NICEREGISTRATION方法主要关注于弹性REGISTRATION,而平移REGISTRATION,是医疗影像分析中非常常见的前提,仍然是通过时间消耗的传统优化方法或额外的平移REGISTRATION网络来实现。此外,现有的NICEREGISTRATION方法受到卷积神经网络的本质性局部性的限制。使用变换器可能解决这个限制,因为它们可以捕捉图像之间的长距离相关性。但是,使用变换器来进行NICEREGISTRATION的好处尚未得到了足够的探讨。在本研究中,我们提出了一种Non-Iterative Coarse-to-finE Transformer网络(NICE-Trans),用于图像REGISTRATION。我们的NICE-Trans是第一个在单个网络中同时实现了平移和弹性的粗细到细节REGISTRATION,以及在NICEREGISTRATION框架中使用变换器来模型图像之间的长距离相关性。我们对七个公共数据集进行了广泛的实验,结果表明,我们的NICE-Trans在REGISTRATION精度和运行时间两个方面都超过了当前的REGISTRATION方法。

QI2 – an Interactive Tool for Data Quality Assurance

  • paper_url: http://arxiv.org/abs/2307.03419
  • repo_url: None
  • paper_authors: Simon Geerkens, Christian Sieberichs, Alexander Braun, Thomas Waschulzik
  • for: 本研究旨在提高机器学习系统和大数据的数据质量,以满足欧洲委员会的AI法案的数据质量要求。
  • methods: 本研究提出了一种新的数据质量检查方法,可以检查多个数据质量方面的数据。这种方法可以量化数据质量要求,并在小例子数据集上验证了其效果。
  • results: 本研究在well known MNIST数据集上应用了这种方法,并通过示例数据集展示了其工作原理和优势。
    Abstract The importance of high data quality is increasing with the growing impact and distribution of ML systems and big data. Also the planned AI Act from the European commission defines challenging legal requirements for data quality especially for the market introduction of safety relevant ML systems. In this paper we introduce a novel approach that supports the data quality assurance process of multiple data quality aspects. This approach enables the verification of quantitative data quality requirements. The concept and benefits are introduced and explained on small example data sets. How the method is applied is demonstrated on the well known MNIST data set based an handwritten digits.
    摘要 “高品质数据的重要性在机器学习系统和大数据的普及和影响力增长之际日益增加。欧盟委员会的AI法案也将提出严格的法律要求,尤其是在安全相关的机器学习系统上。本文介绍一种新的方法,以支持多种数据质量层面的质量确保过程。这种方法可以verify数据质量的量化要求。本文将 introduce和解释这个概念,并使用小型示例数据集来说明其工作方式。在著名的MNIST数据集上,我们将说明如何应用这个方法。”Here's the translation in Traditional Chinese:“高品质数据的重要性在机器学习系统和大数据的普及和影响力增长之际日益增加。欧盟委员会的AI法案也将提出严格的法律要求,尤其是在安全相关的机器学习系统上。本文介绍一种新的方法,以支持多种数据质量层面的质量确保过程。这种方法可以verify数据质量的量化要求。本文将 introduce和解释这个概念,并使用小型示例数据集来说明其工作方式。在著名的MNIST数据集上,我们将说明如何应用这个方法。”

Goal-Conditioned Predictive Coding as an Implicit Planner for Offline Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2307.03406
  • repo_url: None
  • paper_authors: Zilai Zeng, Ce Zhang, Shijie Wang, Chen Sun
  • for: 研究 whether sequence modeling can condense trajectories into useful representations for policy learning.
  • methods: 采用两阶段框架,首先使用序列模型技术简化轨迹数据,然后使用这些表示学习策略和愿景。
  • results: 在AntMaze、FrankaKitchen和Locomotion环境中进行了广泛的实验,发现序列模型对决策任务有显著影响,并且GCPC学习了一个目标状态相关的含义 reprehenstion,具有竞争性的性能。
    Abstract Recent work has demonstrated the effectiveness of formulating decision making as a supervised learning problem on offline-collected trajectories. However, the benefits of performing sequence modeling on trajectory data is not yet clear. In this work we investigate if sequence modeling has the capability to condense trajectories into useful representations that can contribute to policy learning. To achieve this, we adopt a two-stage framework that first summarizes trajectories with sequence modeling techniques, and then employs these representations to learn a policy along with a desired goal. This design allows many existing supervised offline RL methods to be considered as specific instances of our framework. Within this framework, we introduce Goal-Conditioned Predicitve Coding (GCPC), an approach that brings powerful trajectory representations and leads to performant policies. We conduct extensive empirical evaluations on AntMaze, FrankaKitchen and Locomotion environments, and observe that sequence modeling has a significant impact on some decision making tasks. In addition, we demonstrate that GCPC learns a goal-conditioned latent representation about the future, which serves as an "implicit planner", and enables competitive performance on all three benchmarks.
    摘要 To achieve this, we use a two-stage framework that first summarizes trajectories using sequence modeling techniques and then employs these representations to learn a policy along with a desired goal. This design allows many existing supervised offline RL methods to be considered as specific instances of our framework.Within this framework, we introduce Goal-Conditioned Predictive Coding (GCPC), an approach that provides powerful trajectory representations and leads to performant policies. We conduct extensive empirical evaluations on AntMaze, FrankaKitchen, and Locomotion environments and find that sequence modeling has a significant impact on some decision-making tasks. Additionally, we demonstrate that GCPC learns a goal-conditioned latent representation of the future, which serves as an "implicit planner" and enables competitive performance on all three benchmarks.

Exploring the Potential of Large Language Models (LLMs) in Learning on Graphs

  • paper_url: http://arxiv.org/abs/2307.03393
  • repo_url: https://github.com/CurryTang/Graph-LLM
  • paper_authors: Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, Hui Liu, Jiliang Tang
  • for: 本文探讨了使用大语言模型(LLMs)在图机器学习中的潜在作用,特别是节点分类任务中的两种可能的管道:LLMs-as-Enhancers 和 LLMs-as-Predictors。
  • methods: 本文采用了两种管道进行研究:一是使用 LLMs 增强节点的文本特征,然后通过 GNNs 进行预测;二是直接使用 LLMs 作为独立预测器。
  • results: 经过系统的实验研究,本文发现了一些原创的观察和新的发现,包括使用 LLMs 可以提高节点分类的准确率和提高 GNNs 的性能。
    Abstract Learning on Graphs has attracted immense attention due to its wide real-world applications. The most popular pipeline for learning on graphs with textual node attributes primarily relies on Graph Neural Networks (GNNs), and utilizes shallow text embedding as initial node representations, which has limitations in general knowledge and profound semantic understanding. In recent years, Large Language Models (LLMs) have been proven to possess extensive common knowledge and powerful semantic comprehension abilities that have revolutionized existing workflows to handle text data. In this paper, we aim to explore the potential of LLMs in graph machine learning, especially the node classification task, and investigate two possible pipelines: LLMs-as-Enhancers and LLMs-as-Predictors. The former leverages LLMs to enhance nodes' text attributes with their massive knowledge and then generate predictions through GNNs. The latter attempts to directly employ LLMs as standalone predictors. We conduct comprehensive and systematical studies on these two pipelines under various settings. From comprehensive empirical results, we make original observations and find new insights that open new possibilities and suggest promising directions to leverage LLMs for learning on graphs. Our codes and datasets are available at https://github.com/CurryTang/Graph-LLM.
    摘要 学习图有吸引了巨大的注意力,因为它在实际应用中有广泛的应用前景。最受欢迎的图学习管道是使用图神经网络(GNNs),并使用文本节点特征的浅层嵌入,但这有限制在总体知识和深刻Semantic理解方面。在最近几年,大型自然语言模型(LLMs)已经被证明具有广泛的通用知识和强大的Semantic理解能力,这些能力在处理文本数据方面已经引起了革命。在这篇论文中,我们想要探索LLMs在图机器学习中的潜力,特别是节点分类任务,并研究两种可能的管道:LLMs-as-Enhancers和LLMs-as-Predictors。前者利用LLMs来增强节点的文本特征,然后通过GNNs生成预测。后者尝试直接使用LLMs作为独立预测器。我们在不同的设置下进行了系统的研究,从广泛的实验结果中,我们得到了原创的观察和新的发现,这些发现开启了新的可能性和建议,并指向了可以利用LLMs来学习图的新的方向。我们的代码和数据集可以在https://github.com/CurryTang/Graph-LLM上获取。

On Formal Feature Attribution and Its Approximation

  • paper_url: http://arxiv.org/abs/2307.03380
  • repo_url: https://github.com/ffattr/ffa
  • paper_authors: Jinqiang Yu, Alexey Ignatiev, Peter J. Stuckey
  • for: 提高形式XAI的应用范围和效能,对feature attribution进行正式阐明和评估。
  • methods: 基于正式阐明数学基础的feature attribution方法,使用正式阐明分析器架构,并提出一个简洁的形式阐明方法。
  • results: 在实验中,提出的简洁形式阐明方法可以实现高精度的feature attribution,并且比以往的方法更具有实用性和可scalability。
    Abstract Recent years have witnessed the widespread use of artificial intelligence (AI) algorithms and machine learning (ML) models. Despite their tremendous success, a number of vital problems like ML model brittleness, their fairness, and the lack of interpretability warrant the need for the active developments in explainable artificial intelligence (XAI) and formal ML model verification. The two major lines of work in XAI include feature selection methods, e.g. Anchors, and feature attribution techniques, e.g. LIME and SHAP. Despite their promise, most of the existing feature selection and attribution approaches are susceptible to a range of critical issues, including explanation unsoundness and out-of-distribution sampling. A recent formal approach to XAI (FXAI) although serving as an alternative to the above and free of these issues suffers from a few other limitations. For instance and besides the scalability limitation, the formal approach is unable to tackle the feature attribution problem. Additionally, a formal explanation despite being formally sound is typically quite large, which hampers its applicability in practical settings. Motivated by the above, this paper proposes a way to apply the apparatus of formal XAI to the case of feature attribution based on formal explanation enumeration. Formal feature attribution (FFA) is argued to be advantageous over the existing methods, both formal and non-formal. Given the practical complexity of the problem, the paper then proposes an efficient technique for approximating exact FFA. Finally, it offers experimental evidence of the effectiveness of the proposed approximate FFA in comparison to the existing feature attribution algorithms not only in terms of feature importance and but also in terms of their relative order.
    摘要 Motivated by these limitations, this paper proposes a way to apply formal XAI to feature attribution based on formal explanation enumeration. Formal feature attribution (FFA) is argued to be advantageous over existing methods, both formal and non-formal. Given the practical complexity of the problem, the paper proposes an efficient technique for approximating exact FFA. Finally, it offers experimental evidence of the effectiveness of the proposed approximate FFA in comparison to existing feature attribution algorithms in terms of feature importance and relative order.

Efficient Ground Vehicle Path Following in Game AI

  • paper_url: http://arxiv.org/abs/2307.03379
  • repo_url: None
  • paper_authors: Rodrigue de Schaetzen, Alessandro Sestini
  • for: 这篇研究目的是为游戏AI中的地面车辆设计一个高效的路径追踪解决方案。
  • methods: 我们使用已有技术加以改进,设计了一个简单的解决方案,并调整参数以获得高效的benchmark路径追踪器。我们的解决方案特别注重计算路径曲率的 quadratic Bezier 曲线。
  • results: 我们透过在一个首人射击游戏中进行了多种测试enario,评估了提案的路径追踪器的效果和可靠性。与现有的路径追踪解决方案相比,我们获得了70%的缩减在统计上的困难事件。
    Abstract This short paper presents an efficient path following solution for ground vehicles tailored to game AI. Our focus is on adapting established techniques to design simple solutions with parameters that are easily tunable for an efficient benchmark path follower. Our solution pays particular attention to computing a target speed which uses quadratic Bezier curves to estimate the path curvature. The performance of the proposed path follower is evaluated through a variety of test scenarios in a first-person shooter game, demonstrating its effectiveness and robustness in handling different types of paths and vehicles. We achieved a 70% decrease in the total number of stuck events compared to an existing path following solution.
    摘要

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

  • paper_url: http://arxiv.org/abs/2307.03373
  • repo_url: https://github.com/983632847/All-in-One
  • paper_authors: Chunhui Zhang, Xin Sun, Li Liu, Yiqian Yang, Qiong Liu, Xi Zhou, Yanfeng Wang
  • for: 提高视觉语言跟踪器的性能,使其能够更好地处理复杂的场景,如同源扰动和极端照明。
  • methods: 提出了一个All-in-One框架,将视觉和语言信号直接混合,并使用一个统一的变换块来学习协同提取和交互。还引入了一种多Modal匹配模块,使用交叉modal和自modal对比目标来提供更有理性的表示。
  • results: 经过广泛的实验,在五个 benchmark上都达到了现有状态 искусственный智能的最高水平,并且比之前的方法更加高效和可靠。
    Abstract Current mainstream vision-language (VL) tracking framework consists of three parts, \ie a visual feature extractor, a language feature extractor, and a fusion model. To pursue better performance, a natural modus operandi for VL tracking is employing customized and heavier unimodal encoders, and multi-modal fusion models. Albeit effective, existing VL trackers separate feature extraction and feature integration, resulting in extracted features that lack semantic guidance and have limited target-aware capability in complex scenarios, \eg similar distractors and extreme illumination. In this work, inspired by the recent success of exploring foundation models with unified architecture for both natural language and computer vision tasks, we propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone. Specifically, we mix raw vision and language signals to generate language-injected vision tokens, which we then concatenate before feeding into the unified backbone architecture. This approach achieves feature integration in a unified backbone, removing the need for carefully-designed fusion modules and resulting in a more effective and efficient VL tracking framework. To further improve the learning efficiency, we introduce a multi-modal alignment module based on cross-modal and intra-modal contrastive objectives, providing more reasonable representations for the unified All-in-One transformer backbone. Extensive experiments on five benchmarks, \ie OTB99-L, TNL2K, LaSOT, LaSOT$_{\rm Ext}$ and WebUAV-3M, demonstrate the superiority of the proposed tracker against existing state-of-the-arts on VL tracking. Codes will be made publicly available.
    摘要 当前主流视觉语言(VL)跟踪框架包括三部分:视觉特征提取器、语言特征提取器和 fusions 模型。为了提高性能,常见的VL跟踪方法是采用自定义和更重的单模态编码器,以及多模态融合模型。虽然有效,现有VL跟踪器在特征提取和特征融合之间分离,导致提取出的特征缺乏 semantic 指导和具有有限的目标意识能力在复杂情况下,例如类似干扰和极端照明。在这种工作中,我们Draw inspiration from the recent success of exploring foundation models with unified architecture for both natural language and computer vision tasks,我们提出了一个All-in-One框架,该框架通过采用统一的 transformer 脊梁学习联合特征提取和交互。具体来说,我们将原始视觉和语言信号混合生成语言注入视觉 токен,然后将这些 токен concatenate 在统一脊梁架构中。这种方法实现了特征融合在统一脊梁中,从而废弃了需要 precisely 设计融合模块,并且导致更有效和高效的VL跟踪框架。为了进一步提高学习效率,我们引入了基于交叉模式和内部对比目标的多模态匹配模块,为统一 All-in-One transformer 脊梁提供更合理的表示。广泛的实验在五个标准测试集,即 OTB99-L、TNL2K、LaSOT、LaSOT$_{\rm Ext}$ 和 WebUAV-3M 上,证明我们的跟踪器在VL跟踪中超过现有状况。代码将公开。

Adaptation and Communication in Human-Robot Teaming to Handle Discrepancies in Agents’ Beliefs about Plans

  • paper_url: http://arxiv.org/abs/2307.03362
  • repo_url: None
  • paper_authors: Yuening Zhang, Brian C. Williams
  • for: 本研究旨在解决人机团队中agent之间不具备共同认知的问题,即agent可能遵循不同的习惯或只有一些agent知道的可能性。
  • methods: 本研究使用epistemic逻辑来帮助agent理解对方的信念不同,并动态计划行动以适应或通信以解决这些不同。
  • results: 我们的研究表明,使用我们提出的方法可以提高人机团队的成功率和扩展性,而不需要共同认知。
    Abstract When agents collaborate on a task, it is important that they have some shared mental model of the task routines -- the set of feasible plans towards achieving the goals. However, in reality, situations often arise that such a shared mental model cannot be guaranteed, such as in ad-hoc teams where agents may follow different conventions or when contingent constraints arise that only some agents are aware of. Previous work on human-robot teaming has assumed that the team has a set of shared routines, which breaks down in these situations. In this work, we leverage epistemic logic to enable agents to understand the discrepancy in each other's beliefs about feasible plans and dynamically plan their actions to adapt or communicate to resolve the discrepancy. We propose a formalism that extends conditional doxastic logic to describe knowledge bases in order to explicitly represent agents' nested beliefs on the feasible plans and state of execution. We provide an online execution algorithm based on Monte Carlo Tree Search for the agent to plan its action, including communication actions to explain the feasibility of plans, announce intent, and ask questions. Finally, we evaluate the success rate and scalability of the algorithm and show that our agent is better equipped to work in teams without the guarantee of a shared mental model.
    摘要 Translation in Simplified Chinese:当机器人合作完成任务时,重要的是他们有一个共享的心理模型,即任务routines的可行方案集。然而,在现实中,情况经常出现无法保证这种共享心理模型的情况,例如在协作团队中机器人可能遵循不同的 Convention或者在特殊的情况下存在只有一些机器人知道的隐式约束。过去的人机合作工作假设了团队有一组共享的routines,这会导致问题。在这种情况下,我们利用epistemic逻辑来让机器人理解对方可能的信念不同,并在运行时动态规划行动,以适应或通信解决这些不同。我们提出了一种基于 conditional doxastic逻辑的形式来描述知识库,以显式地表示机器人嵌套的信念结构。我们提供了基于Monte Carlo Tree Search的在线执行算法,让机器人在执行时计划行动,包括通信行动来解释计划的可行性、宣布意图和提问。最后,我们评估了算法的成功率和可扩展性,并显示我们的机器人在不假设共享心理模型的情况下更能够合作。

Evaluating Biased Attitude Associations of Language Models in an Intersectional Context

  • paper_url: http://arxiv.org/abs/2307.03360
  • repo_url: https://github.com/shivaomrani/llm-bias
  • paper_authors: Shiva Omrani Sabbaghi, Robert Wolfe, Aylin Caliskan
  • for: 这个论文旨在研究英语语言模型中各种社会群体的偏见。
  • methods: 研究使用了一种句子模板,以提供多元化的社会背景,以评估语言模型中各种社会群体的偏见。
  • results: 研究发现,语言模型对性别认同、社会阶层和性 orientation等社会群体的偏见最为明显。此外,研究还发现,最大和最高性能的语言模型也是最偏见的。
    Abstract Language models are trained on large-scale corpora that embed implicit biases documented in psychology. Valence associations (pleasantness/unpleasantness) of social groups determine the biased attitudes towards groups and concepts in social cognition. Building on this established literature, we quantify how social groups are valenced in English language models using a sentence template that provides an intersectional context. We study biases related to age, education, gender, height, intelligence, literacy, race, religion, sex, sexual orientation, social class, and weight. We present a concept projection approach to capture the valence subspace through contextualized word embeddings of language models. Adapting the projection-based approach to embedding association tests that quantify bias, we find that language models exhibit the most biased attitudes against gender identity, social class, and sexual orientation signals in language. We find that the largest and better-performing model that we study is also more biased as it effectively captures bias embedded in sociocultural data. We validate the bias evaluation method by overperforming on an intrinsic valence evaluation task. The approach enables us to measure complex intersectional biases as they are known to manifest in the outputs and applications of language models that perpetuate historical biases. Moreover, our approach contributes to design justice as it studies the associations of groups underrepresented in language such as transgender and homosexual individuals.
    摘要 Language models are trained on large-scale corpora that embed implicit biases documented in psychology. Valence associations (pleasantness/unpleasantness) of social groups determine the biased attitudes towards groups and concepts in social cognition. Building on this established literature, we quantify how social groups are valenced in English language models using a sentence template that provides an intersectional context. We study biases related to age, education, gender, height, intelligence, literacy, race, religion, sex, sexual orientation, social class, and weight. We present a concept projection approach to capture the valence subspace through contextualized word embeddings of language models. Adapting the projection-based approach to embedding association tests that quantify bias, we find that language models exhibit the most biased attitudes against gender identity, social class, and sexual orientation signals in language. We find that the largest and better-performing model that we study is also more biased as it effectively captures bias embedded in sociocultural data. We validate the bias evaluation method by overperforming on an intrinsic valence evaluation task. The approach enables us to measure complex intersectional biases as they are known to manifest in the outputs and applications of language models that perpetuate historical biases. Moreover, our approach contributes to design justice as it studies the associations of groups underrepresented in language such as transgender and homosexual individuals.Here's the translation in Traditional Chinese:语模型是根据大规模数据库进行训练,这些数据库中嵌入了心理学中documented的隐式偏见。在社交认知中,社会群体的态度偏好(愉悦度/不愉悦度)determine the biased attitudes towards groups and concepts。根据已有的文献,我们量化英语语模型中社会群体的valence association。我们研究年龄、教育、性别、身高、智商、文化程度、种族、宗教、性别、性向、社会阶层和身高等社会群体的偏见。我们使用 sentence template 提供的交叉sectional context,以 capture the valence subspace through contextualized word embeddings of language models。我们运用对嵌入偏见的方法,以量化语模型对于性别识别、社会阶层和性向信号的偏见。我们发现,Language models exhibit the most biased attitudes against gender identity, social class, and sexual orientation signals in language。我们还发现,我们研究的最大和最好的模型也是最偏见的,因为它很好地捕捉了社会文化资料中的偏见。我们验证了偏见评估方法的正确性,通过在内在愉悦评估任务中进行过 performs。这种方法可以量化复杂的交叉偏见,并且对于历史偏见的延续而言,我们的方法具有设计正义的功能,因为它研究了语言中underrepresented的群体,如 трансGENDER和同性恋者。

TRAC: Trustworthy Retrieval Augmented Chatbot

  • paper_url: http://arxiv.org/abs/2307.04642
  • repo_url: None
  • paper_authors: Shuo Li, Sangdon Park, Insup Lee, Osbert Bastani
  • for: 提高问答系统的准确性和可靠性
  • methods: 组合强制预测和全球测试来提供统计保证,并使用泊利投 optimize 选择全球测试的 гипер参数以最大化系统性能
  • results: 在 Natural Questions 数据集上实验表明,我们的方法可以提供预期的覆盖保证,同时最小化平均预测集大小
    Abstract Although conversational AIs have demonstrated fantastic performance, they often generate incorrect information, or hallucinations. Retrieval augmented generation has emerged as a promising solution to reduce these hallucinations. However, these techniques still cannot guarantee correctness. Focusing on question answering, we propose a framework that can provide statistical guarantees for the retrieval augmented question answering system by combining conformal prediction and global testing. In addition, we use Bayesian optimization to choose hyperparameters of the global test to maximize the performance of the system. Our empirical results on the Natural Questions dataset demonstrate that our method can provide the desired coverage guarantee while minimizing the average prediction set size.
    摘要 Note:* "hallucinations" in the original text is translated as " incorrect information" in Simplified Chinese, as "hallucinations" is not a commonly used term in Chinese.* "retrieval augmented generation" is translated as " Retrieval 增强生成" in Simplified Chinese, as "augmented" is not a commonly used term in Chinese.* "conformal prediction" is translated as "准确预测" in Simplified Chinese, as "conformal" is not a commonly used term in Chinese.* "global testing" is translated as "全球测试" in Simplified Chinese, as "global" is not a commonly used term in Chinese.* "average prediction set size" is translated as "平均预测集大小" in Simplified Chinese.

Federated Learning over a Wireless Network: Distributed User Selection through Random Access

  • paper_url: http://arxiv.org/abs/2307.03758
  • repo_url: None
  • paper_authors: Chen Sun, Shiyao Ma, Ce Zheng, Songtao Wu, Tao Cui, Lingjuan Lyu
  • for: 降低联合学习(Federated Learning)在无线网络上的通信成本。
  • methods: 使用网络内置的分布式用户选择方法,利用无线资源竞争机制。
  • results: 可以快速达到与中央用户选择方法相似的快速协调。
    Abstract User selection has become crucial for decreasing the communication costs of federated learning (FL) over wireless networks. However, centralized user selection causes additional system complexity. This study proposes a network intrinsic approach of distributed user selection that leverages the radio resource competition mechanism in random access. Taking the carrier sensing multiple access (CSMA) mechanism as an example of random access, we manipulate the contention window (CW) size to prioritize certain users for obtaining radio resources in each round of training. Training data bias is used as a target scenario for FL with user selection. Prioritization is based on the distance between the newly trained local model and the global model of the previous round. To avoid excessive contribution by certain users, a counting mechanism is used to ensure fairness. Simulations with various datasets demonstrate that this method can rapidly achieve convergence similar to that of the centralized user selection approach.
    摘要 用户选择已成为联合学习(FL)过无线网络的关键因素,但中央用户选择会增加系统复杂性。这项研究提出了基于网络内置的分布式用户选择方法,利用无线资源竞争机制。假设CSMA机制为随机访问,我们在每轮训练中 manipulate 竞争窗口(CW)大小,以优先给予certain用户 radio资源。使用训练数据偏见为FL用户选择目标场景。偏见基于上一轮训练的全球模型与当前轮训练的本地模型之间的距离。为避免某些用户的过度贡献,使用计数机制保持公平。通过 simulate 多个数据集,我们发现这种方法可快达到与中央用户选择方法相似的快速启合。

Assisting Clinical Decisions for Scarcely Available Treatment via Disentangled Latent Representation

  • paper_url: http://arxiv.org/abs/2307.03315
  • repo_url: None
  • paper_authors: Bing Xue, Ahmed Sameh Said, Ziqi Xu, Hanyang Liu, Neel Shah, Hanqing Yang, Philip Payne, Chenyang Lu
  • for: 这篇论文是为了支持医疗决策而提出的,旨在预测患者是否需要ECMO治疗,以及ECMO治疗后的可能性。
  • methods: 这篇论文提出了一种新的方法,即Treatment Variational AutoEncoder(TVAE),用于个性化治疗分析。TVAE模型了患者的治疗决策和可能的结果,并通过重构正则化和半监督来缓解干扰和缺乏治疗案例的问题。
  • results: 实验结果表明,TVAE在具有多样化COVID-19患者数据集上比州当前的治疗效果模型更高效,可以预测患者的可能性和实际结果。
    Abstract Extracorporeal membrane oxygenation (ECMO) is an essential life-supporting modality for COVID-19 patients who are refractory to conventional therapies. However, the proper treatment decision has been the subject of significant debate and it remains controversial about who benefits from this scarcely available and technically complex treatment option. To support clinical decisions, it is a critical need to predict the treatment need and the potential treatment and no-treatment responses. Targeting this clinical challenge, we propose Treatment Variational AutoEncoder (TVAE), a novel approach for individualized treatment analysis. TVAE is specifically designed to address the modeling challenges like ECMO with strong treatment selection bias and scarce treatment cases. TVAE conceptualizes the treatment decision as a multi-scale problem. We model a patient's potential treatment assignment and the factual and counterfactual outcomes as part of their intrinsic characteristics that can be represented by a deep latent variable model. The factual and counterfactual prediction errors are alleviated via a reconstruction regularization scheme together with semi-supervision, and the selection bias and the scarcity of treatment cases are mitigated by the disentangled and distribution-matched latent space and the label-balancing generative strategy. We evaluate TVAE on two real-world COVID-19 datasets: an international dataset collected from 1651 hospitals across 63 countries, and a institutional dataset collected from 15 hospitals. The results show that TVAE outperforms state-of-the-art treatment effect models in predicting both the propensity scores and factual outcomes on heterogeneous COVID-19 datasets. Additional experiments also show TVAE outperforms the best existing models in individual treatment effect estimation on the synthesized IHDP benchmark dataset.
    摘要 《 экстракорпоральная мембрананой оксигенация (ЭКМО) 是 COVID-19 患者们无法接受常规治疗的关键生命支持 modalities。然而,正确的治疗决策仍然是争议的,尚未确定哪些患者会受益于这种罕见和技术复杂的治疗选择。为支持临床决策,我们需要预测治疗需求和可能的治疗和无治疗响应。针对这种临床挑战,我们提出了 Treatment Variational AutoEncoder (TVAE),一种新的个性化治疗分析方法。TVAE 特别是为了解决 ECMO 强烈的选择偏见和罕见治疗案例的模型挑战。TVAE 将治疗决策视为多级问题,模型病人的可能的治疗分配和实际和 counterfactual 结果为其内在特征,可以通过深度卷积模型表示。实际和 counterfactual 预测错误被解决通过重建规则和半监督,并且选择偏见和罕见治疗案例被减轻通过分解和分布匹配的积分空间和标签均衡生成策略。我们在两个实际 COVID-19 数据集上评估了 TVAE:一个国际数据集从 1651 家医院 across 63 个国家收集,另一个 institutional 数据集从 15 家医院收集。结果显示,TVAE 在异质 COVID-19 数据集上预测实际分数和 factual 结果的性能较为前者。其他实验也表明 TVAE 在个体治疗效果预测方面超越了现有最佳模型。

On Invariance, Equivariance, Correlation and Convolution of Spherical Harmonic Representations for Scalar and Vectorial Data

  • paper_url: http://arxiv.org/abs/2307.03311
  • repo_url: None
  • paper_authors: Janis Keuper
  • for: 本论文主要针对Machine Learning领域中圆形卷积(Spherical Harmonic,SH)表示的数学表述,尤其是对于旋转不变和对称的特征和卷积。
  • methods: 本论文提出了SH表示的理论基础和实践方法,包括旋转不变和对称特征和卷积,以及将scalar SH表示扩展到vector field on sphere上的VH表示。
  • results: 本论文summarizes the works on rotation invariant and equivariant features, as well as convolutions and exact correlations of signals on spheres, and extends these methods to 3d vector fields on spheres.
    Abstract The mathematical representations of data in the Spherical Harmonic (SH) domain has recently regained increasing interest in the machine learning community. This technical report gives an in-depth introduction to the theoretical foundation and practical implementation of SH representations, summarizing works on rotation invariant and equivariant features, as well as convolutions and exact correlations of signals on spheres. In extension, these methods are then generalized from scalar SH representations to Vectorial Harmonics (VH), providing the same capabilities for 3d vector fields on spheres
    摘要 Recently, the mathematical representations of data in the Spherical Harmonic (SH) domain have gained increasing interest in the machine learning community. This technical report provides an in-depth introduction to the theoretical foundation and practical implementation of SH representations, including works on rotation invariant and equivariant features, as well as convolutions and exact correlations of signals on spheres. Additionally, these methods are then generalized from scalar SH representations to Vectorial Harmonics (VH), allowing for 3D vector fields on spheres to have the same capabilities.Here's the word-for-word translation of the text into Simplified Chinese:近期,圆形哈密顿(SH)领域中数据的数学表示受到机器学习社区的越来越多的关注。本技术报告对SH表示的理论基础和实践进行了深入的介绍,包括对旋转不变和对称特征的研究,以及圆形上的信号卷积和精确相关性。此外,这些方法还被推广到 vectorial harmonics(VH)中,以便三维向量场在圆形上具有相同的能力。

S2vNTM: Semi-supervised vMF Neural Topic Modeling

  • paper_url: http://arxiv.org/abs/2307.04804
  • repo_url: None
  • paper_authors: Weijie Xu, Jay Desai, Srinivasan Sengamedu, Xiaoyu Jiang, Francis Iannacci
  • for: 本研究旨在批处文本分类 зада务中提高效率和准确率,并允许使用少量关键词作为输入。
  • methods: 本研究提出了一种名为Semi-Supervised vMF Neural Topic Modeling(S2vNTM)的方法,它利用种子关键词来初始化主题,并通过关键词的模式来识别和优化主题的关键词集。
  • results: 在多个数据集上,S2vNTM的分类精度高于现有的半监督主题模型方法,而且速度至少 twice as fast as baselines。
    Abstract Language model based methods are powerful techniques for text classification. However, the models have several shortcomings. (1) It is difficult to integrate human knowledge such as keywords. (2) It needs a lot of resources to train the models. (3) It relied on large text data to pretrain. In this paper, we propose Semi-Supervised vMF Neural Topic Modeling (S2vNTM) to overcome these difficulties. S2vNTM takes a few seed keywords as input for topics. S2vNTM leverages the pattern of keywords to identify potential topics, as well as optimize the quality of topics' keywords sets. Across a variety of datasets, S2vNTM outperforms existing semi-supervised topic modeling methods in classification accuracy with limited keywords provided. S2vNTM is at least twice as fast as baselines.
    摘要 语言模型基本方法是文本分类的强大技术。然而,这些模型有几个缺点。(1)它很难 интегра human knowledge,如关键词。(2)它需要训练模型很多资源。(3)它依赖于大量文本数据进行预训练。在这篇论文中,我们提出了半supervised vMF神经话题模型(S2vNTM)来解决这些困难。S2vNTM通过提供一些种子关键词来输入主题,并利用关键词的模式来确定主题的可能性,以及优化主题的关键词集。在多个数据集上,S2vNTM比现有的半supervised主题模型在分类精度方面表现出色,只需提供有限的关键词。此外,S2vNTM比基准方法快速。

A Vulnerability of Attribution Methods Using Pre-Softmax Scores

  • paper_url: http://arxiv.org/abs/2307.03305
  • repo_url: https://github.com/mlerma54/adversarial-attacks-on-saliency-maps
  • paper_authors: Miguel Lerma, Mirtha Lucas
  • for: 本研究探讨了一种类别神经网络输出解释方法的攻击方法。
  • methods: 本研究使用了小型修改模型来影响解释方法的输出,而不改变模型的输出。
  • results: 研究发现,这种修改方法可以导致解释方法的输出受到较大的影响,而无需改变模型的输出。
    Abstract We discuss a vulnerability involving a category of attribution methods used to provide explanations for the outputs of convolutional neural networks working as classifiers. It is known that this type of networks are vulnerable to adversarial attacks, in which imperceptible perturbations of the input may alter the outputs of the model. In contrast, here we focus on effects that small modifications in the model may cause on the attribution method without altering the model outputs.
    摘要 我们讨论了一种类型的对应方法的漏洞,这种方法用于说明对应网络作为分类器的输出。已经知道这种网络受到了敌对攻击,这些攻击可能导致输入的无法识别的小变化,导致模型的输出变化。相反,我们在这里集中了对应方法的小修改会导致的效果,而不会改变模型的输出。

It is not Sexually Suggestive, It is Educative. Separating Sex Education from Suggestive Content on TikTok Videos

  • paper_url: http://arxiv.org/abs/2307.03274
  • repo_url: None
  • paper_authors: Enfa George, Mihai Surdeanu
  • for: 本研究目的是为了创建一个多Modal数据集,以便分辨TikTok上的性 suggestive内容和虚拟性教育视频。
  • methods: 研究使用了TikTok上的视频URL和音频笔录,并采用了两种基于转换器的模型来分类视频。
  • results: 初步结果表明,分辨这些类型的视频是可学习的,但也是具有挑战性的。这些实验表明,这个数据集是有意义的,并邀请更多研究者来深入研究这个领域。I hope this helps! Let me know if you have any further questions.
    Abstract We introduce SexTok, a multi-modal dataset composed of TikTok videos labeled as sexually suggestive (from the annotator's point of view), sex-educational content, or neither. Such a dataset is necessary to address the challenge of distinguishing between sexually suggestive content and virtual sex education videos on TikTok. Children's exposure to sexually suggestive videos has been shown to have adversarial effects on their development. Meanwhile, virtual sex education, especially on subjects that are more relevant to the LGBTQIA+ community, is very valuable. The platform's current system removes or penalizes some of both types of videos, even though they serve different purposes. Our dataset contains video URLs, and it is also audio transcribed. To validate its importance, we explore two transformer-based models for classifying the videos. Our preliminary results suggest that the task of distinguishing between these types of videos is learnable but challenging. These experiments suggest that this dataset is meaningful and invites further study on the subject.
    摘要 我们介绍SexTok数据集,这是一个包含TikTok视频被标记为性取向(由注释员看来)、性教育内容或者 neither 的多modal数据集。这样的数据集 необходимо用于解决TikTok上性取向内容和虚拟性教育视频的分类挑战。儿童接触性取向视频会对其发展产生有害影响。然而,虚拟性教育,特别是对LGBTQIA+社群更加重要的主题,对于儿童的性教育很有价值。 платформа当前的系统会将一些这些视频移除或处罚,尽管它们在不同的目的上服务。我们的数据集包含视频 URL,同时也有音频笔记。为验证其重要性,我们探索了两种基于 transformer 模型来分类视频。我们的初步结果表明,这种分类任务可以学习,但也是具有挑战性。这些实验表明,这个数据集是有意义的,并邀请进一步研究这个主题。

Vision Language Transformers: A Survey

  • paper_url: http://arxiv.org/abs/2307.03254
  • repo_url: None
  • paper_authors: Clayton Fields, Casey Kennington
  • for: 这个论文主要是为了探讨视Language模型的发展和应用。
  • methods: 这个论文使用了预训练的transformer架构,并通过将其应用到新任务上,以实现跨视与语言的模型。
  • results: 这个论文提供了视Language模型的广泛的研究和分析,以及其优点、局限性和未解决的问题。
    Abstract Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform. A relatively recent body of research has adapted the pretrained transformer architecture introduced in \citet{vaswani2017attention} to vision language modeling. Transformer models have greatly improved performance and versatility over previous vision language models. They do so by pretraining models on a large generic datasets and transferring their learning to new tasks with minor changes in architecture and parameter values. This type of transfer learning has become the standard modeling practice in both natural language processing and computer vision. Vision language transformers offer the promise of producing similar advancements in tasks which require both vision and language. In this paper, we provide a broad synthesis of the currently available research on vision language transformer models and offer some analysis of their strengths, limitations and some open questions that remain.
    摘要 Computer vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform. Recently, researchers have adapted the pre-trained transformer architecture introduced in vaswani2017attention to vision language modeling, which has greatly improved performance and versatility over previous vision language models. They do so by pre-training models on large generic datasets and transferring their learning to new tasks with minor changes in architecture and parameter values. This type of transfer learning has become the standard modeling practice in both natural language processing and computer vision. Vision language transformers offer the promise of producing similar advancements in tasks that require both vision and language. In this paper, we provide a broad synthesis of the currently available research on vision language transformer models and offer some analysis of their strengths, limitations, and some open questions that remain.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The translation is based on the standard Mandarin pronunciation and may not be exactly the same as the traditional Chinese used in Taiwan or other regions.

Learned Kernels for Interpretable and Efficient PPG Signal Quality Assessment and Artifact Segmentation

  • paper_url: http://arxiv.org/abs/2307.05385
  • repo_url: None
  • paper_authors: Sully F. Chen, Zhicheng Guo, Cheng Ding, Xiao Hu, Cynthia Rudin
  • for: 本研究旨在提出一种可靠、有效、可解释的脉冲光谱学(PPG)信号质量评估和artefact分割方法,以提高PPG信号的精度和可靠性。
  • methods: 本研究使用了一种小型、可解释的卷积核来学习PPG信号中的质量特征,并与现有的深度神经网络(DNN)方法进行比较。
  • results: 研究结果表明,该小型卷积核方法可以与DNN方法相比,具有类似或更好的性能,同时具有许多个数据点的优势,如快速、可靠、可解释。
    Abstract Photoplethysmography (PPG) provides a low-cost, non-invasive method to continuously monitor various cardiovascular parameters. PPG signals are generated by wearable devices and frequently contain large artifacts caused by external factors, such as motion of the human subject. In order to ensure robust and accurate extraction of physiological parameters, corrupted areas of the signal need to be identified and handled appropriately. Previous methodology relied either on handcrafted feature detectors or signal metrics which yield sub-optimal performance, or relied on machine learning techniques such as deep neural networks (DNN) which lack interpretability and are computationally and memory intensive. In this work, we present a novel method to learn a small set of interpretable convolutional kernels that has performance similar to -- and often better than -- the state-of-the-art DNN approach with several orders of magnitude fewer parameters. This work allows for efficient, robust, and interpretable signal quality assessment and artifact segmentation on low-power devices.
    摘要

Push Past Green: Learning to Look Behind Plant Foliage by Moving It

  • paper_url: http://arxiv.org/abs/2307.03175
  • repo_url: None
  • paper_authors: Xiaoyu Zhang, Saurabh Gupta
  • for: 这个论文旨在提出数据驱动的方法,用于自动化农业应用程序(如检查、评估、摘取水果)中 manipulating 植物叶子和枝干以查看后方空间。
  • methods: 这篇论文使用自我超级vision方法进行训练,使用SRPNet神经网络预测执行候选动作后可以查看的空间。
  • results: 实验表明,对于 synthetic 蔷薇和实际的 драцена植物,PPG方法在5个设定下表现出色,而SRPNet神经网络在5个设定下都超过了手动设计的探索方法和相关的ablations。
    Abstract Autonomous agriculture applications (e.g., inspection, phenotyping, plucking fruits) require manipulating the plant foliage to look behind the leaves and the branches. Partial visibility, extreme clutter, thin structures, and unknown geometry and dynamics for plants make such manipulation challenging. We tackle these challenges through data-driven methods. We use self-supervision to train SRPNet, a neural network that predicts what space is revealed on execution of a candidate action on a given plant. We use SRPNet with the cross-entropy method to predict actions that are effective at revealing space beneath plant foliage. Furthermore, as SRPNet does not just predict how much space is revealed but also where it is revealed, we can execute a sequence of actions that incrementally reveal more and more space beneath the plant foliage. We experiment with a synthetic (vines) and a real plant (Dracaena) on a physical test-bed across 5 settings including 2 settings that test generalization to novel plant configurations. Our experiments reveal the effectiveness of our overall method, PPG, over a competitive hand-crafted exploration method, and the effectiveness of SRPNet over a hand-crafted dynamics model and relevant ablations.
    摘要 自主农业应用(如检查、辐射类型、摘果)需要操作植物叶子和枝干,以便从后方看到叶子和枝干。但是叶子和枝干之间的部分可见性、极度拥挤、薄肉和植物的不确定geometry和动力学使得这种操作变得困难。我们通过数据驱动方法解决这些挑战。我们使用自我监督来训练SRPNet,一个神经网络,该网络预测执行给定植物的候选动作后可见的空间。我们使用SRPNet与十字积分法预测有效的动作,以便逐步揭示植物下方的空间。此外,SRPNet不仅预测执行动作后可见的空间量,还预测其在哪里被揭示,因此我们可以执行一系列的动作,以逐步揭示更多的植物下方的空间。我们在一个 sintetic(葡萄)和一个实际的植物( драцена)上进行了在物理测试床上的实验,并在5个设定中测试了我们的总方法,包括2个设定,以测试扩展到新的植物配置。我们的实验表明我们的总方法PPG在比手工探索方法更有效,而SRPNet在手工动力学模型和相关的ablations中也表现出了效果。

LEO: Learning Efficient Orderings for Multiobjective Binary Decision Diagrams

  • paper_url: http://arxiv.org/abs/2307.03171
  • repo_url: https://github.com/khalil-research/leo
  • paper_authors: Rahul Patel, Elias B. Khalil
  • for: 这个研究是为了解决多对象数据分析问题中的问题,特别是用BDDs来解决这些问题。
  • methods: 这个研究使用了BDDs来解决多对象数据分析问题,并且使用了一些新的变量排序方法来提高BDDs的效率和精度。
  • results: 研究发现,使用LEO这个超级vised学习方法可以快速地找到高效的变量排序方法,并且可以将PF枚举时间缩短。实验结果显示,LEO比常用的排序方法和算法配置更快速地完成PF枚举。
    Abstract Approaches based on Binary decision diagrams (BDDs) have recently achieved state-of-the-art results for multiobjective integer programming problems. The variable ordering used in constructing BDDs can have a significant impact on their size and on the quality of bounds derived from relaxed or restricted BDDs for single-objective optimization problems. We first showcase a similar impact of variable ordering on the Pareto frontier (PF) enumeration time for the multiobjective knapsack problem, suggesting the need for deriving variable ordering methods that improve the scalability of the multiobjective BDD approach. To that end, we derive a novel parameter configuration space based on variable scoring functions which are linear in a small set of interpretable and easy-to-compute variable features. We show how the configuration space can be efficiently explored using black-box optimization, circumventing the curse of dimensionality (in the number of variables and objectives), and finding good orderings that reduce the PF enumeration time. However, black-box optimization approaches incur a computational overhead that outweighs the reduction in time due to good variable ordering. To alleviate this issue, we propose LEO, a supervised learning approach for finding efficient variable orderings that reduce the enumeration time. Experiments on benchmark sets from the knapsack problem with 3-7 objectives and up to 80 variables show that LEO is ~30-300% and ~10-200% faster at PF enumeration than common ordering strategies and algorithm configuration. Our code and instances are available at https://github.com/khalil-research/leo.
    摘要 <>使用二进制决策图(BDD)的方法最近在多目标整数编程问题上实现了状态的杰出成绩。BDD中变量的排序可以影响其大小和含约环境中的缓和约束的质量。我们首先示出变量排序对多目标饶褔问题的Pareto前列(PF)枚举时间有着相似的影响。这表明需要开发可以提高多目标BDD方法的可扩展性的变量排序方法。为此,我们 derivate一个基于变量评价函数的新参数配置空间,该空间是线性的,且可以使用一小组简单易计算的变量特征来实现。我们表明该配置空间可以使用黑盒优化器高效地探索,并且可以快速找到好的排序,从而减少PF枚举时间。然而,黑盒优化器的计算开销会超过减少PF枚举时间的好变量排序的效果。为了解决这个问题,我们提出了LEO,一种监督学习方法,用于找到高效的变量排序,从而减少PF枚举时间。我们的实验结果表明,LEO比普通的排序策略和算法配置更快,在饶褔问题的 benchmark 集中,LEO的速度比Common ordering strategies和algorithm configuration快约30-300%和10-200%。我们的代码和实例可以在https://github.com/khalil-research/leo上获取。

Focused Transformer: Contrastive Training for Context Scaling

  • paper_url: http://arxiv.org/abs/2307.03170
  • repo_url: https://github.com/cstankonrad/long_llama
  • paper_authors: Szymon Tworkowski, Konrad Staniszewski, Mikołaj Pacek, Yuhuai Wu, Henryk Michalewski, Piotr Miłoś
  • for: 提高大型语言模型在长 context 下的表现
  • methods: 通过对注意层进行修改,让其可以访问外部存储,并通过对应的键值对进行映射,提高模型的表现
  • results: 通过提出 Focused Transformer (FoT) 技术,可以延长效 context 的长度,并且可以细化现有大规模模型,以提高其在长 context 下的表现,并且在 passkey 检索任务中,模型可以 успеreich 处理 $256 k$ 长 context。
    Abstract Large language models have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation in the effective context length. One solution to this issue is to endow an attention layer with access to an external memory, which comprises of (key, value) pairs. Yet, as the number of documents increases, the proportion of relevant keys to irrelevant ones decreases, leading the model to focus more on the irrelevant keys. We identify a significant challenge, dubbed the distraction issue, where keys linked to different semantic values might overlap, making them hard to distinguish. To tackle this problem, we introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning. This novel approach enhances the structure of the (key, value) space, enabling an extension of the context length. Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context. This is demonstrated by our fine-tuning of $3B$ and $7B$ OpenLLaMA checkpoints. The resulting models, which we name LongLLaMA, exhibit advancements in tasks requiring a long context. We further illustrate that our LongLLaMA models adeptly manage a $256 k$ context length for passkey retrieval.
    摘要 大型语言模型具有卓越的Contextualized Embedding能力,可以将新信息给适当地融入到模型中。然而,这种方法的潜力经常受到Context Length的限制。为了解决这个问题,我们将Attention层给了External Memory的存取权,这个External Memory包含了(键、值)对。然而,当文档数量增加时,相关的键数量减少,使模型更加倾向于关注无关的键。我们称这个问题为分心问题,因为不同的Semantic Value之间的键可能会 overlap,使其困难分辨。为了解决这个问题,我们引入了Focused Transformer(FoT)技术,这是一种以Contrastive Learning为灵感的训练过程。这种新的方法可以将(键、值)空间的结构改善,从而延长Context Length。我们的方法可以让已有的大规模模型进行微调,以增加其有效Context Length。我们给了$3B$和$7B$ OpenLLaMA检查点进行微调,将其称为LongLLaMA。这些LongLLaMA模型在需要长Context的任务中表现出色。我们还证明了LongLLaMA模型可以efficaciously manage $256 k$ Context Length for passkey retrieval。

BrickPal: Augmented Reality-based Assembly Instructions for Brick Models

  • paper_url: http://arxiv.org/abs/2307.03162
  • repo_url: None
  • paper_authors: Yao Shi, Xiaofeng Zhang, Ran zhang, Zhou Yang, Xiao Tang, Hongni Ye, Yi Wu
  • for: 帮助用户更加快速和精准地组装乐高积木,解决传统手动微调和纸质指南的问题。
  • methods: 利用可见语言处理(NLP)技术生成可能的组装序列,并在扩展现实头戴显示器提供实时指导。
  • results: 比传统组装方法更高效,NLP算法生成的组装序列可以达到同样的可用性。
    Abstract The assembly instruction is a mandatory component of Lego-like brick sets.The conventional production of assembly instructions requires a considerable amount of manual fine-tuning, which is intractable for casual users and customized brick sets.Moreover, the traditional paper-based instructions lack expressiveness and interactivity.To tackle the two problems above, we present BrickPal, an augmented reality-based system, which visualizes assembly instructions in an augmented reality head-mounted display. It utilizes Natural Language Processing (NLP) techniques to generate plausible assembly sequences, and provide real-time guidance in the AR headset.Our user study demonstrates BrickPal's effectiveness at assisting users in brick assembly compared to traditional assembly methods. Additionally, the NLP algorithm-generated assembly sequences achieve the same usability with manually adapted sequences.
    摘要 assembly instruction是乐高类积木sets中必备的一部分。传统生产assembly instruction需要较多的手动精度调整,这对普通用户和自定义积木sets来说是不可接受的。此外,传统的纸面指令缺乏表达力和互动性。为解决这两个问题,我们提出了BrickPal,一种基于扩展现实技术的系统,可以在扩展现实头戴display中可见化 assembly instruction。它利用自然语言处理(NLP)技术生成可能的积木组合序列,并在AR头戴display中提供实时指导。我们的用户研究表明,BrickPal可以较传统Assembly方法更好地帮助用户组装积木。此外,由NLP算法生成的积木组合序列与手动修改后的序列之间没有差异。

Distilling Large Vision-Language Model with Out-of-Distribution Generalizability

  • paper_url: http://arxiv.org/abs/2307.03135
  • repo_url: https://github.com/xuanlinli17/large_vlm_distillation_ood
  • paper_authors: Xuanlin Li, Yunhao Fang, Minghua Liu, Zhan Ling, Zhuowen Tu, Hao Su
  • for: 这个研究的目的是将大型描述语言模型转换为轻量级快速模型,以便在有限的资源和时间上实现实际的应用。
  • methods: 这个研究使用了教师模型的描述语言表示空间内的学习,并将其转换为学生模型。它还提出了两个原则来增强学生的开 vocabulary out-of-distribution(OOD)泛化性:一是更好地模仿教师的描述语言表示空间,并谨慎地增强视语联系的一致性; 二是增强教师的语言表示具有有用和细部的Semantic Attribute,以便更好地区别不同的标签。
  • results: 这个研究的结果显示,使用了提出的方法可以实现零shot和几shot学生模型在开 vocabulary OOD分类任务中的显著改善,这说明了我们的提出的方法的有效性。
    Abstract Large vision-language models have achieved outstanding performance, but their size and computational requirements make their deployment on resource-constrained devices and time-sensitive tasks impractical. Model distillation, the process of creating smaller, faster models that maintain the performance of larger models, is a promising direction towards the solution. This paper investigates the distillation of visual representations in large teacher vision-language models into lightweight student models using a small- or mid-scale dataset. Notably, this study focuses on open-vocabulary out-of-distribution (OOD) generalization, a challenging problem that has been overlooked in previous model distillation literature. We propose two principles from vision and language modality perspectives to enhance student's OOD generalization: (1) by better imitating teacher's visual representation space, and carefully promoting better coherence in vision-language alignment with the teacher; (2) by enriching the teacher's language representations with informative and finegrained semantic attributes to effectively distinguish between different labels. We propose several metrics and conduct extensive experiments to investigate their techniques. The results demonstrate significant improvements in zero-shot and few-shot student performance on open-vocabulary out-of-distribution classification, highlighting the effectiveness of our proposed approaches. Code released at https://github.com/xuanlinli17/large_vlm_distillation_ood
    摘要 大型视语模型已经实现出色的表现,但它们的大小和计算需求使其在有限的设备和时间上不太实用。模型缩小,将大型模型转换成更小、更快的模型,以保持其性能的方向是一个有前途的方向。这篇论文研究了将大教师视语模型中的视觉表示压缩到小学生模型中,使用小规模或中规模的数据集。尤其是这种研究强调了开放词汇 OUT-OF-DISTRIBUTION(OOD)泛化,这是之前的模型缩小文献中尚未得到足够的关注。我们提出了两个原则,一是在视觉表示空间上更好地模仿大教师,二是在视语对应上更加精细地协调大教师的语言表示。我们还提出了多个指标,并进行了广泛的实验来调查这些技术的效果。结果表明,我们的提议方法可以在零shot和几shot情况下提高小学生模型的OOD泛化性能,这证明了我们的方法的有效性。代码可以在https://github.com/xuanlinli17/large_vlm_distillation_ood上下载。

Frontier AI Regulation: Managing Emerging Risks to Public Safety

  • paper_url: http://arxiv.org/abs/2307.03718
  • repo_url: None
  • paper_authors: Markus Anderljung, Joslyn Barnhart, Anton Korinek, Jade Leung, Cullen O’Keefe, Jess Whittlestone, Shahar Avin, Miles Brundage, Justin Bullock, Duncan Cass-Beggs, Ben Chang, Tantum Collins, Tim Fist, Gillian Hadfield, Alan Hayes, Lewis Ho, Sara Hooker, Eric Horvitz, Noam Kolt, Jonas Schuett, Yonadav Shavit, Divya Siddarth, Robert Trager, Kevin Wolf
    for:这篇论文关注于所谓的”前沿AI”模型,即具有危险能力的基础模型,可能会对公共安全造成严重威胁。这类模型的管理带来了新的挑战,包括:不可预期的危险能力出现,难以防止已经部署的模型被违用,以及模型能力的普及。methods:作者提出了三个建议来管理前沿AI模型的开发和部署:(1)为前沿AI开发者设置标准,(2)要求开发者登记和报送相关信息,以便让监管部门有visibility into前沿AI开发过程,(3)确保模型的开发和部署符合安全标准。results:作者认为,互联网产业自律管理是重要的首先步骤,但是更广泛的社会讨论和政府干预将是必要的,以创建标准并确保其遵守。他们还提出了一些选择,包括授予监管机构执法权和前沿AI模型的执照制度。最后,作者提出了一些安全标准,包括在部署之前进行风险评估,外部审查模型行为,根据风险评估决定部署,以及在部署后监测和应对新的模型能力和用途信息。
    Abstract Advanced AI models hold the promise of tremendous benefits for humanity, but society needs to proactively manage the accompanying risks. In this paper, we focus on what we term "frontier AI" models: highly capable foundation models that could possess dangerous capabilities sufficient to pose severe risks to public safety. Frontier AI models pose a distinct regulatory challenge: dangerous capabilities can arise unexpectedly; it is difficult to robustly prevent a deployed model from being misused; and, it is difficult to stop a model's capabilities from proliferating broadly. To address these challenges, at least three building blocks for the regulation of frontier models are needed: (1) standard-setting processes to identify appropriate requirements for frontier AI developers, (2) registration and reporting requirements to provide regulators with visibility into frontier AI development processes, and (3) mechanisms to ensure compliance with safety standards for the development and deployment of frontier AI models. Industry self-regulation is an important first step. However, wider societal discussions and government intervention will be needed to create standards and to ensure compliance with them. We consider several options to this end, including granting enforcement powers to supervisory authorities and licensure regimes for frontier AI models. Finally, we propose an initial set of safety standards. These include conducting pre-deployment risk assessments; external scrutiny of model behavior; using risk assessments to inform deployment decisions; and monitoring and responding to new information about model capabilities and uses post-deployment. We hope this discussion contributes to the broader conversation on how to balance public safety risks and innovation benefits from advances at the frontier of AI development.
    摘要 高度智能化模型具有巨大的社会价值,但社会需要积极管理这些模型的风险。在这篇论文中,我们关注于我们称为“前沿AI”模型:高度可能的基础模型,它们可能具有严重危害公共安全的能力。前沿AI模型提出了一系列挑战:危险能力可能会不料出现;不可预料地使用已经部署的模型;模型的能力很难控制。为了解决这些挑战,至少需要三种建筑物来规范前沿AI模型的发展:(1)为前沿AI开发者设置标准;(2)要求开发者注册并报告Frontier AI的开发进度;(3)确保Frontier AI模型的安全标准的实施和部署。互联网自律管理是重要的首先步骤,但社会讨论和政府干预将是必要的,以创建标准并确保遵从其中。我们考虑了许多选项,包括授权监管机构执法权和Frontier AI模型的许可证制度。最后,我们提出了一组安全标准,包括在部署之前进行风险评估;对模型行为进行外部审查;使用风险评估来决定部署的决策;以及在部署后监测和回应新的模型能力和使用信息。我们希望这篇论文能够贡献到AI技术的前沿发展中公共安全风险和创新奖励之间的平衡。

Learning Multi-Agent Intention-Aware Communication for Optimal Multi-Order Execution in Finance

  • paper_url: http://arxiv.org/abs/2307.03119
  • repo_url: None
  • paper_authors: Yuchen Fang, Zhenggang Tang, Kan Ren, Weiqing Liu, Li Zhao, Jiang Bian, Dongsheng Li, Weinan Zhang, Yong Yu, Tie-Yan Liu
  • for: 本研究的目的是提出一种基于多智能体学习(MARL)的多订单执行方法,以优化股票交易的执行效率。
  • methods: 本研究使用了模型自适应学习(RL)方法,并在多智能体学习(MARL)框架下进行了优化。在实际市场数据上进行了实验,并通过学习多轮通信协议来提高协作效果。
  • results: 实验结果显示,使用本研究的方法可以在股票交易中提高执行效率,并且与传统的单个订单执行方法相比,具有更好的协作效果。
    Abstract Order execution is a fundamental task in quantitative finance, aiming at finishing acquisition or liquidation for a number of trading orders of the specific assets. Recent advance in model-free reinforcement learning (RL) provides a data-driven solution to the order execution problem. However, the existing works always optimize execution for an individual order, overlooking the practice that multiple orders are specified to execute simultaneously, resulting in suboptimality and bias. In this paper, we first present a multi-agent RL (MARL) method for multi-order execution considering practical constraints. Specifically, we treat every agent as an individual operator to trade one specific order, while keeping communicating with each other and collaborating for maximizing the overall profits. Nevertheless, the existing MARL algorithms often incorporate communication among agents by exchanging only the information of their partial observations, which is inefficient in complicated financial market. To improve collaboration, we then propose a learnable multi-round communication protocol, for the agents communicating the intended actions with each other and refining accordingly. It is optimized through a novel action value attribution method which is provably consistent with the original learning objective yet more efficient. The experiments on the data from two real-world markets have illustrated superior performance with significantly better collaboration effectiveness achieved by our method.
    摘要 执行订单是金融科学中的基本任务,旨在完成购买或售卖特定资产的交易订单。现代无模型学习(RL)技术提供了一种数据驱动的解决方案,但现有的工作都是优化单个订单的执行,忽略了实际情况下多个订单同时执行的现象,从而导致优化不足和偏见。在本文中,我们首先提出了多个代理RL(MARL)方法,用于多订单执行,考虑到实际约束。具体来说,我们对每个代理视为一个个人操作者,负责交易一个特定的订单,同时与别的代理进行交流和合作,以最大化总收益。但现有的MARL算法通常通过交换只有各自部分观察信息来进行交流,这在复杂的金融市场中是不具有效果的。为了提高协作,我们 THEN propose了一种可学习的多轮交流协议,用于代理之间交换意图动作,并根据此进行修改。它是通过一种新的动作价值评估方法来优化的,该方法是原始学习目标的可靠的延展。实验结果表明,我们的方法在两个实际市场的数据上显示出了显著性的提高,并 achieves 更好的协作效果。

Region-Wise Attentive Multi-View Representation Learning for Urban Region Embeddings

  • paper_url: http://arxiv.org/abs/2307.03212
  • repo_url: None
  • paper_authors: Weiliang Chan, Qianqian Ren
  • for: 这篇论文旨在 Addressing the challenges of urban region embedding by proposing a Region-Wise Multi-View Representation Learning (ROMER) model.
  • methods: 该模型使用多视角相关性 capture 和全球图注意力网络学习城市区域表示。
  • results: 实验结果表明,ROMER 模型在两个下游任务中比前STATE-OF-THE-ART 方法提高了17%。
    Abstract Urban region embedding is an important and yet highly challenging issue due to the complexity and constantly changing nature of urban data. To address the challenges, we propose a Region-Wise Multi-View Representation Learning (ROMER) to capture multi-view dependencies and learn expressive representations of urban regions without the constraints of rigid neighbourhood region conditions. Our model focus on learn urban region representation from multi-source urban data. First, we capture the multi-view correlations from mobility flow patterns, POI semantics and check-in dynamics. Then, we adopt global graph attention networks to learn similarity of any two vertices in graphs. To comprehensively consider and share features of multiple views, a two-stage fusion module is further proposed to learn weights with external attention to fuse multi-view embeddings. Extensive experiments for two downstream tasks on real-world datasets demonstrate that our model outperforms state-of-the-art methods by up to 17\% improvement.
    摘要 城市区域嵌入是一个重要且具有挑战性的问题,由于城市数据的复杂性和不断变化。为了解决这些挑战,我们提出了多视图表示学习(ROMER),用于捕捉多视图依赖关系并学习表达城市区域的表示。我们的模型专注于从多个城市数据源上学习城市区域表示。首先,我们捕捉了流动人员趋势、 POI semantics 和检查入动态的多视图相关性。然后,我们采用全球图注意网络来学习图中任意两个顶点的相似性。为了全面考虑和共享多视图特征,我们提出了两个阶段融合模块,以外部注意力学习多视图嵌入的权重。广泛的实验表明,我们的模型在实际 datasets 上的两个下游任务上比状态革命方法提高了17%。

A Survey on Evaluation of Large Language Models

  • paper_url: http://arxiv.org/abs/2307.03109
  • repo_url: https://github.com/mlgroupjlu/llm-eval-survey
  • paper_authors: Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, Xing Xie
  • for: The paper is written to provide a comprehensive review of evaluation methods for large language models (LLMs), with a focus on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
  • methods: The paper uses a survey-based approach to evaluate LLMs, covering various evaluation tasks, benchmarks, and methods.
  • results: The paper summarizes the success and failure cases of LLMs in different tasks, and highlights several future challenges that lie ahead in LLMs evaluation.Here is the same information in Simplified Chinese text:
  • for: 该论文是为了提供大语言模型(LLMs)评估方法的全面回顾,强调三个关键维度:评估任务、评估场景和评估方法。
  • methods: 论文使用问卷方式进行评估,涵盖了各种评估任务、标准套件和评估方法。
  • results: 论文总结了不同任务中 LLMs 的成功和失败案例,并指出了未来评估领域的一些挑战。
    Abstract Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, educations, natural and social sciences, agent applications, and other areas. Secondly, we answer the `where' and `how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey.
    摘要

Efficient Domain Adaptation of Sentence Embeddings Using Adapters

  • paper_url: http://arxiv.org/abs/2307.03104
  • repo_url: https://github.com/sebischair/efficient-domain-adaptation-of-sentence-embeddings-using-adapters
  • paper_authors: Tim Schopf, Dennis N. Schneider, Florian Matthes
  • for: 用于域 adaptation of sentence embeddings
  • methods: 使用lightweight adapters for parameter-efficient domain adaptation
  • results: 可以达到1%的竞争性表现,只需要训练约3.6%的参数。Here is the full sentence in Simplified Chinese:
  • for: 这篇论文是为了域 adaptation of sentence embeddings而写的。
  • methods: 这篇论文使用了lightweight adapters来实现 parameter-efficient domain adaptation。
  • results: 这篇论文可以达到1%的竞争性表现,只需要训练约3.6%的参数。
    Abstract Sentence embeddings enable us to capture the semantic similarity of short texts. Most sentence embedding models are trained for general semantic textual similarity tasks. Therefore, to use sentence embeddings in a particular domain, the model must be adapted to it in order to achieve good results. Usually, this is done by fine-tuning the entire sentence embedding model for the domain of interest. While this approach yields state-of-the-art results, all of the model's weights are updated during fine-tuning, making this method resource-intensive. Therefore, instead of fine-tuning entire sentence embedding models for each target domain individually, we propose to train lightweight adapters. These domain-specific adapters do not require fine-tuning all underlying sentence embedding model parameters. Instead, we only train a small number of additional parameters while keeping the weights of the underlying sentence embedding model fixed. Training domain-specific adapters allows always using the same base model and only exchanging the domain-specific adapters to adapt sentence embeddings to a specific domain. We show that using adapters for parameter-efficient domain adaptation of sentence embeddings yields competitive performance within 1% of a domain-adapted, entirely fine-tuned sentence embedding model while only training approximately 3.6% of the parameters.
    摘要

Structure Guided Multi-modal Pre-trained Transformer for Knowledge Graph Reasoning

  • paper_url: http://arxiv.org/abs/2307.03591
  • repo_url: None
  • paper_authors: Ke Liang, Sihang Zhou, Yue Liu, Lingyuan Meng, Meng Liu, Xinwang Liu
  • for: 本研究旨在提出一种基于多模态知识图(MKG)的多模态预训练 transformer 模型(SGMPT),以提高多模态知识图理解(KGR)的性能。
  • methods: 本研究使用了图结构编码器来编码知识图的结构特征,并设计了一种结构指导合并模块,通过两种不同的策略(加权汇和对齐约束)将结构信息注入到文本和图像特征中。
  • results: 实验结果表明,我们提出的 SGMPT 模型在 FB15k-237-IMG 和 WN18-IMG 上对多模态 KGR Task 表现出色,超过了现有的状态码模型,证明了我们的方法的有效性。
    Abstract Multimodal knowledge graphs (MKGs), which intuitively organize information in various modalities, can benefit multiple practical downstream tasks, such as recommendation systems, and visual question answering. However, most MKGs are still far from complete, which motivates the flourishing of MKG reasoning models. Recently, with the development of general artificial architectures, the pretrained transformer models have drawn increasing attention, especially for multimodal scenarios. However, the research of multimodal pretrained transformer (MPT) for knowledge graph reasoning (KGR) is still at an early stage. As the biggest difference between MKG and other multimodal data, the rich structural information underlying the MKG still cannot be fully leveraged in existing MPT models. Most of them only utilize the graph structure as a retrieval map for matching images and texts connected with the same entity. This manner hinders their reasoning performances. To this end, we propose the graph Structure Guided Multimodal Pretrained Transformer for knowledge graph reasoning, termed SGMPT. Specifically, the graph structure encoder is adopted for structural feature encoding. Then, a structure-guided fusion module with two different strategies, i.e., weighted summation and alignment constraint, is first designed to inject the structural information into both the textual and visual features. To the best of our knowledge, SGMPT is the first MPT model for multimodal KGR, which mines the structural information underlying the knowledge graph. Extensive experiments on FB15k-237-IMG and WN18-IMG, demonstrate that our SGMPT outperforms existing state-of-the-art models, and prove the effectiveness of the designed strategies.
    摘要 多Modal知识图(MKG)可以有效地提高多种下游任务的性能,如推荐系统和视觉问答系统。然而,大多数MKG都还不够完整,这些 incomplete MKG 仍然需要大量的研究和发展。在 current 的普通人工智能架构下, 预训练变换器模型在多Modal场景中受到了越来越多的关注。然而, 关于多Modal预训练变换器(MPT)的研究在知识图理解(KGR)方面仍然处于早期阶段。与其他多Modal数据不同的是, 知识图下的丰富结构信息仍然无法得到完全利用。大多数模型只是将知识图作为图结构来匹配图像和文本相关的实体。这种方式限制了他们的理解性能。为此, 我们提出了基于图 структуры的多Modal预训练变换器(SGMPT)。具体来说, SGMPT 使用图结构编码器来编码结构特征。然后, 我们设计了一种结构指导融合模块,通过两种不同的策略,即Weighted Sum 和Alignment Constraint,将结构信息注入到文本和视觉特征中。我们知道, SGMPT 是首个在多Modal KGR 中使用结构信息的 MPT 模型,从而提高了知识图理解的性能。我们在 FB15k-237-IMG 和 WN18-IMG 上进行了广泛的实验,并证明了我们的SGMPT 超过了现有的状态对模型,并证明了我们的设计策略的有效性。

A Novel Site-Agnostic Multimodal Deep Learning Model to Identify Pro-Eating Disorder Content on Social Media

  • paper_url: http://arxiv.org/abs/2307.06775
  • repo_url: None
  • paper_authors: Jonathan Feldman
    for: 这项研究旨在开发一种多modal深度学习模型,用于判断社交媒体上的帖子是否推广精神饮食疾病。methods: 这项研究使用了Twitter上的标注数据集,并训练了12个深度学习模型。最终,研究人员发现了一种将RoBERTa自然语言处理模型和MaxViT图像分类模型进行融合的多modal模型,其精度和F1分数分别为95.9%和0.959。results: 这项研究发现,使用这种多modal模型可以在不使用人工智能技术的前提下,对社交媒体上的帖子进行分类。此外,研究人员还通过对Twitter上的八个哈希标签的未看过的帖子进行时间序分析,发现自2014年以来,社交媒体上的精神饮食疾病推广内容的相对含量在这些社区内逐渐减少。然而,到2018年,这些内容的增长或已经停止下降,或者又开始增长。
    Abstract Over the last decade, there has been a vast increase in eating disorder diagnoses and eating disorder-attributed deaths, reaching their zenith during the Covid-19 pandemic. This immense growth derived in part from the stressors of the pandemic but also from increased exposure to social media, which is rife with content that promotes eating disorders. This study aimed to create a multimodal deep learning model that can determine if a given social media post promotes eating disorders based on a combination of visual and textual data. A labeled dataset of Tweets was collected from Twitter, upon which twelve deep learning models were trained and tested. Based on model performance, the most effective deep learning model was the multimodal fusion of the RoBERTa natural language processing model and the MaxViT image classification model, attaining accuracy and F1 scores of 95.9% and 0.959, respectively. The RoBERTa and MaxViT fusion model, deployed to classify an unlabeled dataset of posts from the social media sites Tumblr and Reddit, generated results akin to those of previous research studies that did not employ artificial intelligence-based techniques, indicating that deep learning models can develop insights congruent to those of researchers. Additionally, the model was used to conduct a timeseries analysis of yet unseen Tweets from eight Twitter hashtags, uncovering that, since 2014, the relative abundance of content that promotes eating disorders has decreased drastically within those communities. Despite this reduction, by 2018, content that promotes eating disorders had either stopped declining or increased in ampleness anew on these hashtags.
    摘要 A labeled dataset of tweets was collected from Twitter, and twelve deep learning models were trained and tested. The best-performing model was the multimodal fusion of the RoBERTa natural language processing model and the MaxViT image classification model, achieving accuracy and F1 scores of 95.9% and 0.959, respectively. This model was then applied to classify unlabeled posts from Tumblr and Reddit, producing results similar to previous research studies that did not use AI-based techniques.Moreover, the model was used to conduct a time series analysis of unseen tweets from eight Twitter hashtags, revealing that the relative abundance of content that promotes eating disorders has decreased significantly since 2014 within these communities. However, by 2018, the content that promotes eating disorders had either leveled off or increased again on these hashtags.In conclusion, this study demonstrates that deep learning models can identify content that promotes eating disorders on social media, and the results can be used to monitor and understand the trends of eating disorder-related content online.

cs.CL - 2023-07-07

Testing the Predictions of Surprisal Theory in 11 Languages

  • paper_url: http://arxiv.org/abs/2307.03667
  • repo_url: None
  • paper_authors: Ethan Gotlieb Wilcox, Tiago Pimentel, Clara Meister, Ryan Cotterell, Roger P. Levy
  • for: investigate the relationship between surprisal and reading times in eleven different languages, distributed across five language families.
  • methods: derive estimates from language models trained on monolingual and multilingual corpora, and test three predictions associated with surprisal theory.
  • results: all three predictions are borne out crosslinguistically, offering the most robust link to-date between information theory and incremental language processing across languages.Here’s the Chinese translation of the three information points:
  • for: investigate the relationship between surprisal和阅读时间在 eleven different languages中,分布在 five language families中。
  • methods: 使用语言模型在 monolingual和多语言 corpus 上 derivation estimates, 并测试 three predictions associated with surprisal theory.
  • results: 所有 three predictions 在 crosslinguistics 中得到证实,提供了最为稳固的 link 到 date между信息理论和语言处理过程中的 language。
    Abstract A fundamental result in psycholinguistics is that less predictable words take a longer time to process. One theoretical explanation for this finding is Surprisal Theory (Hale, 2001; Levy, 2008), which quantifies a word's predictability as its surprisal, i.e. its negative log-probability given a context. While evidence supporting the predictions of Surprisal Theory have been replicated widely, most have focused on a very narrow slice of data: native English speakers reading English texts. Indeed, no comprehensive multilingual analysis exists. We address this gap in the current literature by investigating the relationship between surprisal and reading times in eleven different languages, distributed across five language families. Deriving estimates from language models trained on monolingual and multilingual corpora, we test three predictions associated with surprisal theory: (i) whether surprisal is predictive of reading times; (ii) whether expected surprisal, i.e. contextual entropy, is predictive of reading times; (iii) and whether the linking function between surprisal and reading times is linear. We find that all three predictions are borne out crosslinguistically. By focusing on a more diverse set of languages, we argue that these results offer the most robust link to-date between information theory and incremental language processing across languages.
    摘要 一个基本的心理语言学结论是,更难预测的词语需要更长的时间来处理。一种理论解释是《不意外性理论》(Hale, 2001;Levy, 2008),它量化了一个词语在上下文中的难度为其不意外性,即其负梯度邻近概率。尽管这些预测得到了广泛的复制,但大多数研究都集中在了一个非常窄的数据集上:英语Native speaker reading English texts。实际上,没有一个全面的多语言分析。我们在现有文献中填补这个空白,通过 investigate the relationship between surprisal and reading times in eleven different languages, distributed across five language families. We derive estimates from language models trained on monolingual and multilingual corpora, and test three predictions associated with surprisal theory: (i) whether surprisal is predictive of reading times; (ii) whether expected surprisal, i.e. contextual entropy, is predictive of reading times; (iii) and whether the linking function between surprisal and reading times is linear. We find that all three predictions are borne out crosslinguistically. By focusing on a more diverse set of languages, we argue that these results offer the most robust link to-date between information theory and incremental language processing across languages.

The distribution of discourse relations within and across turns in spontaneous conversation

  • paper_url: http://arxiv.org/abs/2307.03645
  • repo_url: None
  • paper_authors: S. Magalí López Cortez, Cassandra L. Jacobs
  • for: 这篇论文是关于如何在快速对话中使用语言关系(DR)的。
  • methods: 这篇论文使用了一系列的语言模型和人工标注来适应快速对话中的语言关系。
  • results: 研究发现,不同的对话上下文会导致不同的语言关系分布,单个转折创造了最多的不确定性。此外,研究还发现,基于演示单元的嵌入可以预测语言关系。
    Abstract Time pressure and topic negotiation may impose constraints on how people leverage discourse relations (DRs) in spontaneous conversational contexts. In this work, we adapt a system of DRs for written language to spontaneous dialogue using crowdsourced annotations from novice annotators. We then test whether discourse relations are used differently across several types of multi-utterance contexts. We compare the patterns of DR annotation within and across speakers and within and across turns. Ultimately, we find that different discourse contexts produce distinct distributions of discourse relations, with single-turn annotations creating the most uncertainty for annotators. Additionally, we find that the discourse relation annotations are of sufficient quality to predict from embeddings of discourse units.
    摘要 时间压力和话题谈判可能会对人们在协说性谈话中使用语言关系(DR)所带来限制。在这个工作中,我们将写作语言系统的DR适用于精神对话使用拼写的观众标注。然后我们将检查DR在不同的多句子背景下是否被使用不同。我们比较说话者和说话之间的DR标注,以及说话者和说话之间的转折中的DR标注。最终,我们发现不同的谈话背景会生成不同的语言关系分布,单一说话标注最多对annotator造成不确定性。此外,我们发现DR标注足够高质量以预测对话单位的嵌入。

Text Simplification of Scientific Texts for Non-Expert Readers

  • paper_url: http://arxiv.org/abs/2307.03569
  • repo_url: None
  • paper_authors: Björn Engelmann, Fabian Haak, Christin Katharina Kreutz, Narjes Nikzad Khasmakhi, Philipp Schaer
  • for: 这个研究是为了帮助非专家读者更好地理解科学报告摘要中的核心信息。
  • methods: 这个研究使用了三种现成的摘要模型(两个基于T5,一个基于PEGASUS)和一个使用复杂短语识别的ChatGPT模型来简化科学报告摘要。
  • results: 这些模型可以帮助非专家读者更好地理解报告摘要中的核心信息,并且可以帮助您更快地理解这些信息。
    Abstract Reading levels are highly individual and can depend on a text's language, a person's cognitive abilities, or knowledge on a topic. Text simplification is the task of rephrasing a text to better cater to the abilities of a specific target reader group. Simplification of scientific abstracts helps non-experts to access the core information by bypassing formulations that require domain or expert knowledge. This is especially relevant for, e.g., cancer patients reading about novel treatment options. The SimpleText lab hosts the simplification of scientific abstracts for non-experts (Task 3) to advance this field. We contribute three runs employing out-of-the-box summarization models (two based on T5, one based on PEGASUS) and one run using ChatGPT with complex phrase identification.
    摘要 阅读水平是非常个人化的,它可能受到文本的语言、读者的认知能力以及主题知识的影响。文本简化是将文本重新推理以更好地适应target读者群的能力。在科学报告中简化Abstract可以帮助非专家访问核心信息,这特别有 relevance для例如,癌症患者阅读新的治疗方案。我们在SimpleText lab中为非专家(任务3)进行科学报告简化,以推动这一领域的发展。我们提供了三个运行,其中两个基于T5摘要模型,一个基于PEGASUS摘要模型,以及一个使用ChatGPT复杂短语识别。

DWReCO at CheckThat! 2023: Enhancing Subjectivity Detection through Style-based Data Sampling

  • paper_url: http://arxiv.org/abs/2307.03550
  • repo_url: None
  • paper_authors: Ipek Baris Schlicht, Lynn Khellaf, Defne Altiok
  • for: 这篇论文描述了我们在CheckThat! Lab中的主观检测任务提交。
  • methods: 为了解决任务中的分类偏见,我们使用GPT-3模型生成了不同风格的提示,基于新闻观点的主观检查表。我们使用了这些扩展训练集来练化语言特定的转换器模型。
  • results: 我们在英语、德语和土耳其语的实验中发现,不同的主观风格都能够在所有语言上得到效果。此外,我们发现在土耳其语和英语中,风格基本检测比重塑化更好。最后,GPT-3模型在非英语语言中生成风格基本文本时 occasional lacklustre 的结果。
    Abstract This paper describes our submission for the subjectivity detection task at the CheckThat! Lab. To tackle class imbalances in the task, we have generated additional training materials with GPT-3 models using prompts of different styles from a subjectivity checklist based on journalistic perspective. We used the extended training set to fine-tune language-specific transformer models. Our experiments in English, German and Turkish demonstrate that different subjective styles are effective across all languages. In addition, we observe that the style-based oversampling is better than paraphrasing in Turkish and English. Lastly, the GPT-3 models sometimes produce lacklustre results when generating style-based texts in non-English languages.
    摘要 这篇论文描述了我们在CheckThat! Lab中对主观偏见检测任务的提交。为了解决任务中的类别不均衡,我们使用GPT-3模型生成了更多的训练材料,使用基于新闻媒体的主观检查列表中的不同风格的提示。我们使用扩展的训练集来精度调整语言特定的转换器模型。我们的实验表明,不同的主观风格在所有语言中都有效。此外,我们发现在土耳其语和英语中,风格基于的增加 sampling 比较有效,而在非英语语言中,GPT-3模型 sometimes produce lacklustre results when generating style-based texts。

Quantifying the perceptual value of lexical and non-lexical channels in speech

  • paper_url: http://arxiv.org/abs/2307.03534
  • repo_url: None
  • paper_authors: Sarenne Wallbridge, Peter Bell, Catherine Lai
  • for: 研究对话中非语言信息的值
  • methods: 引入一种通用的研究方法,利用准确率和信息 entropy 来衡量非语言信息的影响
  • results: 研究发现,非语言信息在对话中产生一致的影响,即使其不如语言内容alone 导致更好的分类性turn 判断,但是它们仍然能够提高参与者的一致性。
    Abstract Speech is a fundamental means of communication that can be seen to provide two channels for transmitting information: the lexical channel of which words are said, and the non-lexical channel of how they are spoken. Both channels shape listener expectations of upcoming communication; however, directly quantifying their relative effect on expectations is challenging. Previous attempts require spoken variations of lexically-equivalent dialogue turns or conspicuous acoustic manipulations. This paper introduces a generalised paradigm to study the value of non-lexical information in dialogue across unconstrained lexical content. By quantifying the perceptual value of the non-lexical channel with both accuracy and entropy reduction, we show that non-lexical information produces a consistent effect on expectations of upcoming dialogue: even when it leads to poorer discriminative turn judgements than lexical content alone, it yields higher consensus among participants.
    摘要 文本中的演讲是一种基本的交流方式,可以看作提供两个信息传输通道:言语上的字句,以及语言上的演讲方式。两个通道都会影响听众对后续交流的期望;然而,直接量化这两个通道之间的相对效果是困难的。先前的尝试需要使用语言上的变体或明显的声音修饰来实现对话的变化。本文介绍了一种通用的研究方法,用于研究对话中非语言信息的价值。通过量化非语言信息的听众对话的准确性和 entropy 减少,我们发现,非语言信息会在对话中产生一致的效果:即使导致语言内容alone 的较差分类判断,也会得到参与者的高度一致。

AI-UPV at EXIST 2023 – Sexism Characterization Using Large Language Models Under The Learning with Disagreements Regime

  • paper_url: http://arxiv.org/abs/2307.03385
  • repo_url: https://github.com/angelfelipemp/sexism-llm-learning-with-disagreement
  • paper_authors: Angel Felipe Magnossão de Paula, Giulia Rizzi, Elisabetta Fersini, Damiano Spina
  • for: The paper aims to develop an automated system for detecting sexism and other hateful behaviors on social media to promote a more inclusive and respectful online environment.
  • methods: The proposed approach uses large language models (mBERT and XLM-RoBERTa) and ensemble strategies to identify and classify sexism in English and Spanish, without relying on aggregated labels.
  • results: The system achieved fourth place in Task 2 at EXIST and first place in Task 3, with the highest ICM-Soft of -2.32 and a normalized ICM-Soft of 0.79, outperforming the individual large language models.Here’s the simplified Chinese text for the three information points:
  • for: 本研究旨在开发一种自动检测社交媒体上的性别歧视和其他仇恨行为,以促进在线环境的包容性和尊重。
  • methods: 该方法使用大型自然语言模型(mBERT和XLM-RoBERTa)和集成策略来识别和分类社会性别歧视,不使用汇总标签。
  • results: 系统在EXIST Lab中取得了第四名的成绩( Task 2)和第一名的成绩( Task 3),ICM-Soft最高达{-2.32)和正常化ICM-Soft为0.79,超过了单独的大型自然语言模型。
    Abstract With the increasing influence of social media platforms, it has become crucial to develop automated systems capable of detecting instances of sexism and other disrespectful and hateful behaviors to promote a more inclusive and respectful online environment. Nevertheless, these tasks are considerably challenging considering different hate categories and the author's intentions, especially under the learning with disagreements regime. This paper describes AI-UPV team's participation in the EXIST (sEXism Identification in Social neTworks) Lab at CLEF 2023. The proposed approach aims at addressing the task of sexism identification and characterization under the learning with disagreements paradigm by training directly from the data with disagreements, without using any aggregated label. Yet, performances considering both soft and hard evaluations are reported. The proposed system uses large language models (i.e., mBERT and XLM-RoBERTa) and ensemble strategies for sexism identification and classification in English and Spanish. In particular, our system is articulated in three different pipelines. The ensemble approach outperformed the individual large language models obtaining the best performances both adopting a soft and a hard label evaluation. This work describes the participation in all the three EXIST tasks, considering a soft evaluation, it obtained fourth place in Task 2 at EXIST and first place in Task 3, with the highest ICM-Soft of -2.32 and a normalized ICM-Soft of 0.79. The source code of our approaches is publicly available at https://github.com/AngelFelipeMP/Sexism-LLM-Learning-With-Disagreement.
    摘要 随着社交媒体平台的普及,已经成为必要的发展自动化系统,能够检测社交媒体上的性别歧视和其他不尊重和仇恨行为,以促进更加包容和尊重的在线环境。然而,这些任务非常困难,因为不同的仇恨类别和作者的意图,尤其是在学习各自意见的情况下。这篇文章描述了AI-UPV团队在CLEF 2023年的EXIST(性别歧视 Identification in Social neTworks)实验室中的参与。提出的方法是通过直接从数据中学习,不使用任何汇总标签,来解决性别歧视标识和分类问题。然而,我们还是报告了使用软和硬评估方法的性能。我们的系统使用了大型自然语言模型(i.e., mBERT和XLM-RoBERTa)和集成策略进行性别歧视标识和分类。具体来说,我们的系统由三个不同的管道组成。集成方法在使用软和硬标签评估方法时表现出色,在EXIST任务中获得了第四名(Task 2)和第一名(Task 3),其ICM-Soft=-2.32和 normalized ICM-Soft为0.79。我们的源代码可以在https://github.com/AngelFelipeMP/Sexism-LLM-Learning-With-Disagreement上获得。

A Side-by-side Comparison of Transformers for English Implicit Discourse Relation Classification

  • paper_url: http://arxiv.org/abs/2307.03378
  • repo_url: None
  • paper_authors: Bruce W. Lee, BongSeok Yang, Jason Hyung-Jong Lee
  • for: 这个论文的目的是对多种自然语言处理领域中的隐式 дискурс关系分类进行比较研究,以便研究人员可以充分利用公共可用的模型进行дискурс分析。
  • methods: 这篇论文使用了七种预训练语言模型,并通过对这些模型进行精细调整来进行比较性能测试。这些模型包括NSP、SBO、SOP等句子级预训练目标,以及MLM和全注意力等方法。
  • results: 这篇论文的结果显示,与之前报道的不同(Shi和Demberg,2019b),使用 sentence-level 预训练目标(NSP、SBO、SOP)并不总是生成最佳的隐式 дискурс关系分类模型。相反,使用相同大小的 PLMs WITH MLM AND full attention 可以达到更高的性能(ACC = 0.671)。
    Abstract Though discourse parsing can help multiple NLP fields, there has been no wide language model search done on implicit discourse relation classification. This hinders researchers from fully utilizing public-available models in discourse analysis. This work is a straightforward, fine-tuned discourse performance comparison of seven pre-trained language models. We use PDTB-3, a popular discourse relation annotated dataset. Through our model search, we raise SOTA to 0.671 ACC and obtain novel observations. Some are contrary to what has been reported before (Shi and Demberg, 2019b), that sentence-level pre-training objectives (NSP, SBO, SOP) generally fail to produce the best performing model for implicit discourse relation classification. Counterintuitively, similar-sized PLMs with MLM and full attention led to better performance.
    摘要 “对话分析可以帮助多个自然语言处理(NLP)领域,但是对于不直接的话语关系分类仍没有广泛的语言模型搜索。这限制了研究人员对话分析中的全面利用已有的模型。这项工作是一个简单、精确地 fine-tune 多个预训练语言模型的表现比较。我们使用 PDTB-3,一个受欢迎的话语关系标注数据集。通过我们的模型搜索,我们提高了ACC的最高分为0.671,并获得了新的观察。一些与过去报告不同(Shi和Demberg,2019b),具体是内置式预训练目标(NSP、SBO、SOP)通常无法生成最佳的模型 для implicit discourse relation classification。反意外地,相同大小的PLMs WITH MLM和全域注意力可以获得更好的表现。”

Mitigating Negative Transfer with Task Awareness for Sexism, Hate Speech, and Toxic Language Detection

  • paper_url: http://arxiv.org/abs/2307.03377
  • repo_url: https://github.com/angelfelipemp/mitigating-negative-transfer-with-ta
  • paper_authors: Angel Felipe Magnossão de Paula, Paolo Rosso, Damiano Spina
  • for: 这篇论文的目的是如何 Mitigate the negative transfer problem in Multi-Task Learning (MTL)。
  • methods: 该论文提出了一种基于任务意识概念的新方法,使得避免了负性传递问题,同时提高了性能。这种方法基于在多个任务之间共享信息的思想。
  • results: 该论文在EXIST-2021和HatEval-2019测试准则上实现了新的状态态-of-the-art,并且在识别性别歧视、仇恨言语和恶意言语等领域中达到了最高的性能。
    Abstract This paper proposes a novelty approach to mitigate the negative transfer problem. In the field of machine learning, the common strategy is to apply the Single-Task Learning approach in order to train a supervised model to solve a specific task. Training a robust model requires a lot of data and a significant amount of computational resources, making this solution unfeasible in cases where data are unavailable or expensive to gather. Therefore another solution, based on the sharing of information between tasks, has been developed: Multi-Task Learning (MTL). Despite the recent developments regarding MTL, the problem of negative transfer has still to be solved. Negative transfer is a phenomenon that occurs when noisy information is shared between tasks, resulting in a drop in performance. This paper proposes a new approach to mitigate the negative transfer problem based on the task awareness concept. The proposed approach results in diminishing the negative transfer together with an improvement of performance over classic MTL solution. Moreover, the proposed approach has been implemented in two unified architectures to detect Sexism, Hate Speech, and Toxic Language in text comments. The proposed architectures set a new state-of-the-art both in EXIST-2021 and HatEval-2019 benchmarks.
    摘要

Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

  • paper_url: http://arxiv.org/abs/2307.03354
  • repo_url: None
  • paper_authors: Sara Papi, Peidong Wan, Junkun Chen, Jian Xue, Jinyu Li, Yashesh Gaur
  • for: 这篇论文主要用于提高实时涂抹翻译和自动听写的质量和效率。
  • methods: 这篇论文提出了一种串行传播变换器-变把(Transformer-Transducer),该模型同时生成自动听写(ASR)和翻译(ST)输出,使用单个解码器进行joint训练。
  • results: 实验结果表明,这种方法在单语言(it-en)和多语言(de,es,it)的设置下都能够实现最佳的质量-延迟平衡。模型的平均ASR延迟为1秒,ST延迟为1.3秒,而且与分开的ASR和ST模型相比,输出质量没有下降,甚至有所提高,增加了1.1个word error rate和0.4个bleu在多语言情况下。
    Abstract In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary. This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder. To produce ASR and ST content effectively with minimal latency, we propose a joint token-level serialized output training method that interleaves source and target words by leveraging an off-the-shelf textual aligner. Experiments in monolingual (it-en) and multilingual (\{de,es,it\}-en) settings demonstrate that our approach achieves the best quality-latency balance. With an average ASR latency of 1s and ST latency of 1.3s, our model shows no degradation or even improves output quality compared to separate ASR and ST models, yielding an average improvement of 1.1 WER and 0.4 BLEU in the multilingual case.
    摘要 在实际应用场景中,用户经常需要同时获得翻译和转写的语音识别,特别在流处理方面,需要实时生成。这篇论文介绍了一种流处理Transformer-Transducer,可同时生成自动语音识别(ASR)和语音翻译(ST)输出,使用单个解码器。为了在最小的延迟下生成ASR和ST内容,我们提议了一种共同序列化输出训练方法,通过利用商业化的文本对齐器来扫描源和目标词语。实验表明,我们的方法在单语言(it-en)和多语言(de,es,it-en)设置下都可以达到最佳的质量-延迟平衡。我们的模型的平均ASR延迟为1秒,ST延迟为1.3秒,而且与分离ASR和ST模型不同,我们的模型无减性或甚至提高输出质量,平均提高1.1个WRR和0.4个BLEU在多语言情况下。

BiPhone: Modeling Inter Language Phonetic Influences in Text

  • paper_url: http://arxiv.org/abs/2307.03322
  • repo_url: None
  • paper_authors: Abhirut Gupta, Ananya B. Sai, Richard Sproat, Yuri Vasilevski, James S. Ren, Ambarish Jash, Sukhdeep S. Sodhi, Aravindan Raghuveer
  • for: 这个论文是为了研究在使用第二语言(L2)时,因技术不匹配而受到强制使用Web的人群中,受到语言一低文化水平的影响而导致的文本错误的问题。
  • methods: 这个论文使用了一种方法来挖掘L1和L2之间的音节混淆(即L1 speaker可能会混淆的L2音节),并将这些混淆音节输入到一个生成模型(Bi-Phone)中,以生成受混淆的L2文本。
  • results: 通过人工评估,这个方法可以生成具有各种L1特征的受混淆L2文本,并且在Web上有广泛的应用。此外,这个论文还将这种方法应用于SuperGLUE语言理解 benchmark 上,并证明了SoTA语言理解模型在受混淆情况下的表现不佳。此外,这个论文还提出了一种新的音节预测预训练任务,可以帮助字节模型重新获得SuperGLUE水平的表现。最后,这个论文还发布了FunGLUE benchmark,以便进一步研究具有phonetically robust的语言模型。
    Abstract A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1). We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2. These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text. Through human evaluations, we show that Bi-Phone generates plausible corruptions that differ across L1s and also have widespread coverage on the Web. We also corrupt the popular language understanding benchmark SuperGLUE with our technique (FunGLUE for Phonetically Noised GLUE) and show that SoTA language understating models perform poorly. We also introduce a new phoneme prediction pre-training task which helps byte models to recover performance close to SuperGLUE. Finally, we also release the FunGLUE benchmark to promote further research in phonetically robust language models. To the best of our knowledge, FunGLUE is the first benchmark to introduce L1-L2 interactions in text.
    摘要 很多人被迫使用第二语言(L2)进行网络交互,但是由于技术不均衡,他们的written L2文本经常含有大量的错误,这些错误受到他们的Native Language(L1)的影响。我们提议一种方法, mines phoneme confusions(L2中的声音混淆),并将其与L1进行对应。这些混淆被用于生成Synthetically produced corrupted L2文本。我们通过人类评估表明,Bi-Phone生成的混淆是有可能的,并且在不同的L1上具有广泛的coverage。我们还将这种技术应用于SuperGLUE的人工语言理解 benchmark(FunGLUE for Phonetically Noised GLUE),并证明了SoTA语言理解模型在这种情况下表现不佳。我们还提出了一种新的声音预测预训练任务,帮助Byte模型在SuperGLUE中恢复性能。最后,我们还发布了FunGLUE benchmark,以便进一步研究在声音稳定的语言模型方面。我们知道,FunGLUE是首个引入L1-L2交互的文本 benchmark。

Covering Uncommon Ground: Gap-Focused Question Generation for Answer Assessment

  • paper_url: http://arxiv.org/abs/2307.03319
  • repo_url: None
  • paper_authors: Roni Rabin, Alexandre Djerbetian, Roee Engelberg, Lidan Hackmon, Gal Elidan, Reut Tsarfaty, Amir Globerson
  • for: The paper is written for generating gap-focused questions (GFQs) in educational dialogues to create a rich and interactive learning experience.
  • methods: The paper proposes a model that uses natural language processing techniques to generate GFQs automatically, with a focus on key desired aspects such as relevance, specificity, and engagement.
  • results: The paper provides an evaluation of the generated questions against human-generated questions, demonstrating competitive performance and the effectiveness of the proposed model in generating GFQs.Here’s the same information in Simplified Chinese text:
  • for: 这篇论文是为了自动生成教育对话中的差距关注问题(GFQ),以创造一种丰富和互动的学习经验。
  • methods: 论文提出了一种使用自然语言处理技术来生成GFQ,注重关键所需的方面,如相关性、特定性和参与度。
  • results: 论文通过人工标注者对生成的问题和人类生成的问题进行评估,表明了提案模型的竞争力和生成GFQ的效果。
    Abstract Human communication often involves information gaps between the interlocutors. For example, in an educational dialogue, a student often provides an answer that is incomplete, and there is a gap between this answer and the perfect one expected by the teacher. Successful dialogue then hinges on the teacher asking about this gap in an effective manner, thus creating a rich and interactive educational experience. We focus on the problem of generating such gap-focused questions (GFQs) automatically. We define the task, highlight key desired aspects of a good GFQ, and propose a model that satisfies these. Finally, we provide an evaluation by human annotators of our generated questions compared against human generated ones, demonstrating competitive performance.
    摘要 人际交流经常会出现信息差距 между交流方。例如,在教学对话中,学生可能提供不够的答案,而教师期望的完整答案与此存在差距。成功的对话受到教师以有效的方式询问这个差距,从而创造出丰富且互动的教学经验。我们关注于自动生成这些差距关注的问题(GFQ)的问题。我们定义任务、标出了好的GFQ所应具备的关键特征,并提议一种满足这些特征的模型。最后,我们通过人类标注员对我们生成的问题与人类生成的问题进行评估,展示了竞争力强的性能。

InfoSync: Information Synchronization across Multilingual Semi-structured Tables

  • paper_url: http://arxiv.org/abs/2307.03313
  • repo_url: https://github.com/Info-Sync/InfoSync
  • paper_authors: Siddharth Khincha, Chelsi Jain, Vivek Gupta, Tushar Kataria, Shuo Zhang
  • for: 本研究旨在解决语言间 semi-结构化数据的信息同步问题,例如wikipedia 表格的同步化。
  • methods: 提出了一种新的数据集 InfoSyncC 和一种两步方法 для tabular 同步化。InfoSync 包含 100K 实体中心表格(wikipedia Infobox) Across 14 种语言,其中一部分(3.5K 对)是手动注释。提出的方法包括信息对齐和信息更新两个步骤。
  • results: 在 InfoSync 上进行了信息对齐,信息对齐得分为 87.91(en <-> non-en)。为了评估信息更新,我们对 Infoboxes 进行了603 个表格对的人工帮助编辑。我们的方法得到了wikipedia 上的77.28% 的接受率,表明了提出的方法的有效性。
    Abstract Information Synchronization of semi-structured data across languages is challenging. For instance, Wikipedia tables in one language should be synchronized across languages. To address this problem, we introduce a new dataset InfoSyncC and a two-step method for tabular synchronization. InfoSync contains 100K entity-centric tables (Wikipedia Infoboxes) across 14 languages, of which a subset (3.5K pairs) are manually annotated. The proposed method includes 1) Information Alignment to map rows and 2) Information Update for updating missing/outdated information for aligned tables across multilingual tables. When evaluated on InfoSync, information alignment achieves an F1 score of 87.91 (en <-> non-en). To evaluate information updation, we perform human-assisted Wikipedia edits on Infoboxes for 603 table pairs. Our approach obtains an acceptance rate of 77.28% on Wikipedia, showing the effectiveness of the proposed method.
    摘要 信息同步问题在不结构化数据中是挑战。例如,wikipedia 表格在一种语言中应该与其他语言的表格进行同步。为解决这个问题,我们介绍了一个新的数据集 InfoSyncC 和一种两步方法 для表格同步。InfoSync 包含 100 万个实体中心表格(Wikipedia 信息框) Across 14 种语言,其中一 subset(3.5 千对)是 manually annotated。我们提议的方法包括 1) 信息对应和 2) 信息更新。当 evaluated on InfoSync 时,信息对应得到了 F1 分数为 87.91(en <-> non-en)。为评估信息更新,我们对 Infoboxes 进行了人工协助的 Wikipedia 编辑 603 对。我们的方法获得了 Wikipedia 上的接受率为 77.28%,显示了我们提议的方法的有效性。

Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment

  • paper_url: http://arxiv.org/abs/2307.03296
  • repo_url: https://github.com/areffarhadi/gammatonegram_cnn_dysarthric_speech
  • paper_authors: Aref Farhadipour, Hadi Veisi
  • for: 这个研究旨在开发一个基于 convolutional neural network (CNN) 的语音识别系统,以提高智能家居中的语音识别率。
  • methods: 该研究使用 gammatonegram 方法将语音文件转换为图像,并使用 pre-trained Alexnet 基于 transfer learning 方法进行语音识别。
  • results: 根据 UA 数据集的结果,提议的语音识别系统在 speaker-dependent 模式下达到了 91.29% 的准确率,语音识别系统在 text-dependent 模式下达到了 87.74% 的准确率,而两类智能评估系统在 two-class 模式下达到了 96.47% 的准确率。
    Abstract Dysarthria is a disability that causes a disturbance in the human speech system and reduces the quality and intelligibility of a person's speech. Because of this effect, the normal speech processing systems can not work properly on impaired speech. This disability is usually associated with physical disabilities. Therefore, designing a system that can perform some tasks by receiving voice commands in the smart home can be a significant achievement. In this work, we introduce gammatonegram as an effective method to represent audio files with discriminative details, which is used as input for the convolutional neural network. On the other word, we convert each speech file into an image and propose image recognition system to classify speech in different scenarios. Proposed CNN is based on the transfer learning method on the pre-trained Alexnet. In this research, the efficiency of the proposed system for speech recognition, speaker identification, and intelligibility assessment is evaluated. According to the results on the UA dataset, the proposed speech recognition system achieved 91.29% accuracy in speaker-dependent mode, the speaker identification system acquired 87.74% accuracy in text-dependent mode, and the intelligibility assessment system achieved 96.47% accuracy in two-class mode. Finally, we propose a multi-network speech recognition system that works fully automatically. This system is located in a cascade arrangement with the two-class intelligibility assessment system, and the output of this system activates each one of the speech recognition networks. This architecture achieves an accuracy of 92.3% WRR. The source code of this paper is available.
    摘要 《干扰性 speech 识别系统的设计》Introduction:难以说话(dysarthria)是一种影响人类语音系统的残疾,导致语音质量和可读性减退。由于这种影响,常规的语音处理系统无法正常工作。这种残疾通常与物理残疾相关。因此,设计一个可以通过声音命令在智能家居中进行一些任务的系统可以是一项重要的成就。在这项工作中,我们介绍了一种有效的方法,即《干扰性 grammatonegram》,用于将语音文件转换成可识别的图像,并提出了一种基于转移学习方法的 convolutional neural network(CNN)来分类不同场景的语音。Methodology:我们将每个语音文件转换成一幅图像,并使用pre-trained Alexnet进行转移学习。在这项研究中,我们评估了提案的语音识别、 speaker identification和可读性评估系统的效率。根据UA数据集的结果,提案的语音识别系统在 speaker-dependent 模式下达到了91.29%的准确率,speaker identification系统在 text-dependent 模式下达到了87.74%的准确率,而可读性评估系统在 two-class 模式下达到了96.47%的准确率。Results:我们还提出了一种多网络语音识别系统,其中每个语音识别网络都是通过两类可读性评估系统的输出来活化。这种架构可以达到92.3%的WRR精度。Conclusion:本文介绍了一种基于干扰性 grammatonegram 和转移学习的语音识别系统的设计。该系统可以在智能家居中进行一些任务,并且可以提高语音识别、 speaker identification和可读性评估的精度。ources code of this paper is available.

Performance Comparison of Pre-trained Models for Speech-to-Text in Turkish: Whisper-Small and Wav2Vec2-XLS-R-300M

  • paper_url: http://arxiv.org/abs/2307.04765
  • repo_url: None
  • paper_authors: Oyku Berfin Mercan, Sercan Cepni, Davut Emre Tasar, Sukru Ozan
  • for: 这个研究是为了测试两种预训练的多语言模型(Whisper-Small和Wav2Vec2-XLS-R-300M)在土耳其语言上的表现。
  • methods: 这个研究使用了Mozilla Common Voice版本11.0,这是一个在土耳其语言上制作的开源数据集。研究人员将这两个模型在这个数据集上进行了微调。
  • results: 研究人员计算了WER值,得到的结果是0.28和0.16,分别对应于Wav2Vec2-XLS-R-300M和Whisper-Small模型。此外,研究人员还测试了这两个模型在没有包含在训练和验证数据集中的回呼记录上的表现。
    Abstract In this study, the performances of the Whisper-Small and Wav2Vec2-XLS-R-300M models which are two pre-trained multilingual models for speech to text were examined for the Turkish language. Mozilla Common Voice version 11.0 which is prepared in Turkish language and is an open-source data set, was used in the study. The multilingual models, Whisper- Small and Wav2Vec2-XLS-R-300M were fine-tuned with this data set which contains a small amount of data. The speech to text performance of the two models was compared. WER values are calculated as 0.28 and 0.16 for the Wav2Vec2-XLS- R-300M and the Whisper-Small models respectively. In addition, the performances of the models were examined with the test data prepared with call center records that were not included in the training and validation dataset.
    摘要 在这项研究中,我们对两种预训练的多语言模型(Whisper-Small和Wav2Vec2-XLS-R-300M)进行了对 Turkish 语言的评估。我们使用了 Mozilla Common Voice 版本 11.0,这是一个开源的 Turkish 语言数据集。我们对这些数据集进行了精度的 fine-tuning,并计算了这两个模型在这些数据集上的 speech-to-text 性能。我们计算出的 WER 值为 0.28 和 0.16,对应的是 Whisper-Small 和 Wav2Vec2-XLS-R-300M 模型。此外,我们还对使用测试数据集,这些数据集不包括在训练和验证集中,进行了模型的评估。

Lost in the Middle: How Language Models Use Long Contexts

  • paper_url: http://arxiv.org/abs/2307.03172
  • repo_url: https://github.com/nelson-liu/lost-in-the-middle
  • paper_authors: Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang
  • for: 本研究探讨了语言模型在长文本上的表现,以及它们如何使用长文本中的信息。
  • methods: 本研究使用了多文档问答和关键值检索两个任务来分析语言模型在长文本上的表现。
  • results: 研究发现,语言模型在长文本上的表现通常最高时 relevante信息出现在输入文本的开头或结尾,并且当模型需要在长文本中检索 relevante信息时,表现会明显下降。此外,研究还发现,even explicitly long-context models 的表现会随输入文本的长度增长而下降。这些发现可以帮助我们更好地理解语言模型如何使用输入文本,并提供新的评估协议 для未来的长文本模型。
    Abstract While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze language model performance on two tasks that require identifying relevant information within their input contexts: multi-document question answering and key-value retrieval. We find that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts. Furthermore, performance substantially decreases as the input context grows longer, even for explicitly long-context models. Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context models.
    摘要 Recent language models have the ability to take long contexts as input, but little is known about how well they use longer contexts. We analyze the performance of language models on two tasks that require identifying relevant information within their input contexts: multi-document question answering and key-value retrieval. We find that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts. Furthermore, performance substantially decreases as the input context grows longer, even for explicitly long-context models. Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context models.Here's the text in Traditional Chinese:现代语言模型具有处理长文本上下文的能力,但知道它们如何使用长文本上下文的情况相对少。我们分析了语言模型在多文档问题回答和关键值搜寻两个任务中表现的情况,发现表现通常在输入上下文中的开头或结尾的位置最高,而在中间部分搜寻时表现则明显下降。此外,随着输入上下文的长度增加,表现也会随之下降,即使使用长文本模型。我们的分析可以帮助我们更好地理解语言模型如何使用输入上下文,并提供未来长文本模型的新评估协议。

T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

  • paper_url: http://arxiv.org/abs/2307.03132
  • repo_url: https://github.com/locuslab/t-mars
  • paper_authors: Pratyush Maini, Sachin Goyal, Zachary C. Lipton, J. Zico Kolter, Aditi Raghunathan
  • for: 这篇论文主要目标是提出一种新的数据筛选方法,以提高计算机视觉领域的模型学习效果。
  • methods: 这篇论文使用了一种新的数据筛选方法,即T-MARS(文本蒙版和重新分配),它首先将文本蒙版出现的图像,然后使用CLIP相似性分数来筛选图像。
  • results: 实验表明,T-MARS在DataComp数据筛选benchmark中的中等规模上,与最佳方法的差距为6.5%(在ImageNet上)和4.7%(在VTAB上)。此外,在不同的数据池大小从2M到64M时,T-MARS的准确率随着数据和计算的扩展呈线性增长。
    Abstract Large web-sourced multimodal datasets have powered a slew of new methods for learning general-purpose visual representations, advancing the state of the art in computer vision and revolutionizing zero- and few-shot recognition. One crucial decision facing practitioners is how, if at all, to curate these ever-larger datasets. For example, the creators of the LAION-5B dataset chose to retain only image-caption pairs whose CLIP similarity score exceeded a designated threshold. In this paper, we propose a new state-of-the-art data filtering approach motivated by our observation that nearly 40% of LAION's images contain text that overlaps significantly with the caption. Intuitively, such data could be wasteful as it incentivizes models to perform optical character recognition rather than learning visual features. However, naively removing all such data could also be wasteful, as it throws away images that contain visual features (in addition to overlapping text). Our simple and scalable approach, T-MARS (Text Masking and Re-Scoring), filters out only those pairs where the text dominates the remaining visual features -- by first masking out the text and then filtering out those with a low CLIP similarity score of the masked image. Experimentally, T-MARS outperforms the top-ranked method on the "medium scale" of DataComp (a data filtering benchmark) by a margin of 6.5% on ImageNet and 4.7% on VTAB. Additionally, our systematic evaluation on various data pool sizes from 2M to 64M shows that the accuracy gains enjoyed by T-MARS linearly increase as data and compute are scaled exponentially. Code is available at https://github.com/locuslab/T-MARS.
    摘要 大量网络源的多模式数据集已经推动了一些新的方法来学习通用视觉表示,提高计算机视觉的状态艺术。一个重要的决策是如何CURATE这些越来越大的数据集。例如,LAION-5B数据集的创建者选择了只保留具有 CLIP 相似性分数超过设定的阈值的图像-标签对。在这篇论文中,我们提出了一种新的数据筛选方法, motivated by our observation that nearly 40% of LAION's images contain text that overlaps significantly with the caption. Intuitively, such data could be wasteful as it incentivizes models to perform optical character recognition rather than learning visual features. However, naively removing all such data could also be wasteful, as it throws away images that contain visual features (in addition to overlapping text). Our simple and scalable approach, T-MARS (Text Masking and Re-Scoring), filters out only those pairs where the text dominates the remaining visual features -- by first masking out the text and then filtering out those with a low CLIP similarity score of the masked image. Experimentally, T-MARS outperforms the top-ranked method on the "medium scale" of DataComp (a data filtering benchmark) by a margin of 6.5% on ImageNet and 4.7% on VTAB. Additionally, our systematic evaluation on various data pool sizes from 2M to 64M shows that the accuracy gains enjoyed by T-MARS linearly increase as data and compute are scaled exponentially. 码可以在 https://github.com/locuslab/T-MARS 上获取。

BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training

  • paper_url: http://arxiv.org/abs/2307.03131
  • repo_url: https://github.com/powerpuffpomelo/fairseq_mrt
  • paper_authors: Yiming Yan, Tao Wang, Chengqi Zhao, Shujian Huang, Jiajun Chen, Mingxuan Wang
  • for: 这个研究旨在系统地分析和比较各种主流和前沿自动评价 metric,以了解它们在训练机器翻译系统时的导向性。
  • methods: 通过 Minimum Risk Training (MRT) 方法,研究发现了一些 metric 具有不稳定性问题,如 BLEURT 和 BARTScore 中的通用敌对翻译。经过深入分析,发现这些不稳定性的两个主要原因是训练数据集的分布偏见,以及评价metric的 парадиг。通过加入token级别的约束,提高了评价 metric 的稳定性,从而提高了机器翻译系统的性能。
  • results: 研究发现,通过提高评价 metric 的稳定性,可以提高机器翻译系统的性能。 codes 可以在 \url{https://github.com/powerpuffpomelo/fairseq_mrt} 上获取。
    Abstract Automatic metrics play a crucial role in machine translation. Despite the widespread use of n-gram-based metrics, there has been a recent surge in the development of pre-trained model-based metrics that focus on measuring sentence semantics. However, these neural metrics, while achieving higher correlations with human evaluations, are often considered to be black boxes with potential biases that are difficult to detect. In this study, we systematically analyze and compare various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems. Through Minimum Risk Training (MRT), we find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore. In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm. By incorporating token-level constraints, we enhance the robustness of evaluation metrics, which in turn leads to an improvement in the performance of machine translation systems. Codes are available at \url{https://github.com/powerpuffpomelo/fairseq_mrt}.
    摘要

VisKoP: Visual Knowledge oriented Programming for Interactive Knowledge Base Question Answering

  • paper_url: http://arxiv.org/abs/2307.03130
  • repo_url: None
  • paper_authors: Zijun Yao, Yuanyong Chen, Xin Lv, Shulin Cao, Amy Xin, Jifan Yu, Hailong Jin, Jianjun Xu, Peng Zhang, Lei Hou, Juanzi Li
  • for: 这个论文是关于Visual Knowledge oriented Programming platform(VisKoP),一种基于人工智能的知识基本问题回答(KBQA)系统,它可以将自然语言问题转化为知识导向程序语言(KoPL),并将 KoPL 程序映射到图形元素中,以便使用图形操作来编辑和调试知识基本(KB)查询。
  • methods: 这个论文使用了人工智能的神经网络程序生成模块,将自然语言问题转化为 KoPL 程序,并提供了一个高效的 KoPL 执行引擎,以便在大规模知识基本中进行实用KBQA。
  • results: 实验结果显示,VisKoP 可以高效地解决大规模知识基本中的问题,并且通过人工交互可以修复大量错误的 KoPL 程序,以获得正确的答案。
    Abstract We present Visual Knowledge oriented Programming platform (VisKoP), a knowledge base question answering (KBQA) system that integrates human into the loop to edit and debug the knowledge base (KB) queries. VisKoP not only provides a neural program induction module, which converts natural language questions into knowledge oriented program language (KoPL), but also maps KoPL programs into graphical elements. KoPL programs can be edited with simple graphical operators, such as dragging to add knowledge operators and slot filling to designate operator arguments. Moreover, VisKoP provides auto-completion for its knowledge base schema and users can easily debug the KoPL program by checking its intermediate results. To facilitate the practical KBQA on a million-entity-level KB, we design a highly efficient KoPL execution engine for the back-end. Experiment results show that VisKoP is highly efficient and user interaction can fix a large portion of wrong KoPL programs to acquire the correct answer. The VisKoP online demo https://demoviskop.xlore.cn (Stable release of this paper) and https://viskop.xlore.cn (Beta release with new features), highly efficient KoPL engine https://pypi.org/project/kopl-engine, and screencast video https://youtu.be/zAbJtxFPTXo are now publicly available.
    摘要 我们介绍Visual Knowledge oriented Programming平台(VisKoP),是一个基于问题回答(KBQA)系统,具有人类在循环中参与修改和验证知识库(KB)问题的功能。VisKoP不仅提供神经网络问题化模组,将自然语言问题转换为知识导向程式语言(KoPL),并将KoPL程式映射为图形元素。KoPL程式可以通过简单的图形操作进行修改,例如拖曳添加知识操作和填写操作符据。此外,VisKoP提供KBSchema自动完成和用户可以轻松地在KB中验证KoPL程式的中间结果。为了实现实用的KBQA,我们设计了高效的KoPL执行引擎。实验结果显示,VisKoP具有高效性,并且用户互动可以解决大量的错误KoPL程式,以获取正确答案。VisKoP在线demo:https://demoviskop.xlore.cn(稳定版本)和https://viskop.xlore.cn(beta版本,具有新功能),高效的KoPL执行引擎:https://pypi.org/project/kopl-engine,和萤幕录影影片:https://youtu.be/zAbJtxFPTXo现在公开available。

PREADD: Prefix-Adaptive Decoding for Controlled Text Generation

  • paper_url: http://arxiv.org/abs/2307.03214
  • repo_url: https://github.com/jonnypei/acl23-preadd
  • paper_authors: Jonathan Pei, Kevin Yang, Dan Klein
  • for: 控制文本生成
  • methods: prefix-adaptive decoding(PREADD)
  • results: 在三个任务中(抑制恶意输出、减少性别偏见、控制情感),PREADD比基eline和auxiliary-expert控制方法提高12%或更多的相对提升。
    Abstract We propose Prefix-Adaptive Decoding (PREADD), a flexible method for controlled text generation. Unlike existing methods that use auxiliary expert models to control for attributes, PREADD does not require an external model, instead relying on linearly combining output logits from multiple prompts. Specifically, PREADD contrasts the output logits generated using a raw prompt against those generated using a prefix-prepended prompt, enabling both positive and negative control with respect to any attribute encapsulated by the prefix. We evaluate PREADD on three tasks -- toxic output mitigation, gender bias reduction, and sentiment control -- and find that PREADD outperforms not only prompting baselines, but also an auxiliary-expert control method, by 12% or more in relative gain on our main metrics for each task.
    摘要 我们提出了预先适应编码(PREADD)方法,这是一种灵活的文本生成控制方法。与现有方法不同,PREADD不需要外部模型,而是通过将多个提示的输出拟合成为一个Linear Combination来实现控制。具体来说,PREADD比较使用 Raw Prompt 和Prefix-prepended Prompt两个提示生成的输出拟合,从而实现对任何Attributes所含的控制。我们对三个任务进行评估:毒瘤输出减少、性别偏见减少和 sentiment控制,并发现PREADD在每个任务上比基elinePrompting和auxiliary-expert控制方法提高12%或更多的相对提升。

Extracting Multi-valued Relations from Language Models

  • paper_url: http://arxiv.org/abs/2307.03122
  • repo_url: https://github.com/snehasinghania/multi_valued_slot_filling
  • paper_authors: Sneha Singhania, Simon Razniewski, Gerhard Weikum
  • for: 这篇论文是为了探讨隐藏语言表示的多个对象关系知识是否可以提取出来的。
  • methods: 这篇论文使用了现有的提示技术和新的域知识 incorporating 提示技术来评价候选对象。
  • results: 研究发现,通过选择对象的可能性大于学习关系特定的阈值得分,可以达到49.5%的 F1 分数。这些结果表明使用LM进行多值槽填任务是具有挑战性,并且激励进一步研究提取隐藏语言表示中的关系知识。
    Abstract The widespread usage of latent language representations via pre-trained language models (LMs) suggests that they are a promising source of structured knowledge. However, existing methods focus only on a single object per subject-relation pair, even though often multiple objects are correct. To overcome this limitation, we analyze these representations for their potential to yield materialized multi-object relational knowledge. We formulate the problem as a rank-then-select task. For ranking candidate objects, we evaluate existing prompting techniques and propose new ones incorporating domain knowledge. Among the selection methods, we find that choosing objects with a likelihood above a learned relation-specific threshold gives a 49.5% F1 score. Our results highlight the difficulty of employing LMs for the multi-valued slot-filling task and pave the way for further research on extracting relational knowledge from latent language representations.
    摘要 广泛的语言表现库使用预训语言模型(LM)表明它们是有前途的结构知识来源。然而,现有的方法仅专注在单一物件之间的主题关系对,即使有多个物件是正确的。为了解决这个限制,我们分析这些表现的潜在可以产生实体多个物件关系知识。我们将这个问题推理为排名选择任务。为选择候选物件,我们评估现有的提示技术和新提出的内容知识技术。我们发现,选择关系特定阈值上的可能性大于学习的relation-specific阈值会获得49.5%的F1分数。我们的结果显示使用LM进行多値构造填充任务是具有挑战性的,并且点出了进一步研究抽取语言表现中的关系知识的可能性。

KoRC: Knowledge oriented Reading Comprehension Benchmark for Deep Text Understanding

  • paper_url: http://arxiv.org/abs/2307.03115
  • repo_url: https://github.com/thu-keg/korc
  • paper_authors: Zijun Yao, Yantao Liu, Xin Lv, Shulin Cao, Jifan Yu, Lei Hou, Juanzi Li
  • for: 本文提出了一个新的 benchmark,以便检测深度文本理解的能力。
  • methods: 本文使用了大量知识库来引导注释或大型自然语言处理器(LLM)构建知识问题。
  • results: 实验结果显示,使用最佳基线方法只能在收odge Distribution test set中 achieve 68.3%和30.0% F1 measure。这表明深度文本理解仍然是一个未解决的挑战。
    Abstract Deep text understanding, which requires the connections between a given document and prior knowledge beyond its text, has been highlighted by many benchmarks in recent years. However, these benchmarks have encountered two major limitations. On the one hand, most of them require human annotation of knowledge, which leads to limited knowledge coverage. On the other hand, they usually use choices or spans in the texts as the answers, which results in narrow answer space. To overcome these limitations, we build a new challenging benchmark named KoRc in this paper. Compared with previous benchmarks, KoRC has two advantages, i.e., broad knowledge coverage and flexible answer format. Specifically, we utilize massive knowledge bases to guide annotators or large language models (LLMs) to construct knowledgable questions. Moreover, we use labels in knowledge bases rather than spans or choices as the final answers. We test state-of-the-art models on KoRC and the experimental results show that the strongest baseline only achieves 68.3% and 30.0% F1 measure in the in-distribution and out-of-distribution test set, respectively. These results indicate that deep text understanding is still an unsolved challenge. The benchmark dataset, leaderboard, and baseline methods are released in https://github.com/THU-KEG/KoRC.
    摘要 深层文本理解,需要文档与知识之间的连接,在过去几年中得到了许多benchmark的注意。然而,这些benchmark都面临了两个主要的限制:一方面,大多数它们需要人工标注知识,导致知识覆盖率受限;另一方面,它们通常使用文本中的选择或范围作为答案,这导致答案空间过于窄。为了突破这些限制,我们在这篇论文中构建了一个新的挑战性benchmark名为KoRC。相比之前的benchmark,KoRC具有两个优势:一是广泛的知识覆盖率,二是灵活的答案格式。具体来说,我们利用大量知识库来引导注urger或大语言模型(LLM)构建知识问题。此外,我们使用知识库中的标签而不是选择或范围作为答案。我们对state-of-the-art模型进行测试,实验结果表明,最强基eline只能达到68.3%和30.0%的F1度在分布式和分布式测试集上。这些结果表明,深层文本理解仍然是一个未解决的挑战。benchmark dataset、排名和基eline方法在https://github.com/THU-KEG/KoRC中发布。

cs.LG - 2023-07-07

Differentiable Turbulence

  • paper_url: http://arxiv.org/abs/2307.03683
  • repo_url: https://github.com/tumaer/JAXFLUIDS
  • paper_authors: Varun Shankar, Romit Maulik, Venkatasubramanian Viswanathan
  • for: 这个论文旨在提出一种基于深度学习的大气动力学液体流动模型,以提高二维液体动力学中的涨潮层次粗细规范模型的准确性。
  • methods: 这个论文使用了可微分的液体动力学,并结合物理恰当的深度学习架构来学习高效和通用的涨潮层次粗细规范模型。
  • results: 研究发现,包含小规模非本地特征是最关键的,以实现有效的涨潮层次粗细规范模型,而大规模特征可以提高解 posteriori 解场的点对应精度。模型可以在不同的流动配置下进行普适化,包括不同的 Reynolds 数和冲击条件。
    Abstract Deep learning is increasingly becoming a promising pathway to improving the accuracy of sub-grid scale (SGS) turbulence closure models for large eddy simulations (LES). We leverage the concept of differentiable turbulence, whereby an end-to-end differentiable solver is used in combination with physics-inspired choices of deep learning architectures to learn highly effective and versatile SGS models for two-dimensional turbulent flow. We perform an in-depth analysis of the inductive biases in the chosen architectures, finding that the inclusion of small-scale non-local features is most critical to effective SGS modeling, while large-scale features can improve pointwise accuracy of the a-posteriori solution field. The filtered velocity gradient tensor can be mapped directly to the SGS stress via decomposition of the inputs and outputs into isotropic, deviatoric, and anti-symmetric components. We see that the model can generalize to a variety of flow configurations, including higher and lower Reynolds numbers and different forcing conditions. We show that the differentiable physics paradigm is more successful than offline, a-priori learning, and that hybrid solver-in-the-loop approaches to deep learning offer an ideal balance between computational efficiency, accuracy, and generalization. Our experiments provide physics-based recommendations for deep-learning based SGS modeling for generalizable closure modeling of turbulence.
    摘要 深度学习在提高大涨规(SGS)涨规模型精度方面表现越来越有前途。我们利用了可导ifferentiable turbulence的概念,其中一个可导ifferentiable solver与物理启发的深度学习架构结合使用,以学习高效和多样化的SGS模型。我们进行了深入的杂散偏见分析,发现包含小规模非本地特征是最重要的SGS模型化特征,而大规模特征可以提高 posteriori 解场中点精度。 filtered velocity gradient tensor 可以直接映射到 SGS 压力,通过输入和输出的归一化、异常值分解和反对映射。我们发现模型可以通过不同的流场配置和强制条件进行泛化,包括不同 Reynolds 数和强制条件。我们还证明了可导ifferentiable physics парадиг是一个更成功的方法,而不是离线、先验学习。我们的实验结果为深度学习基于 SGS 模型的涨规模型化提供物理学习的建议。

GeoPhy: Differentiable Phylogenetic Inference via Geometric Gradients of Tree Topologies

  • paper_url: http://arxiv.org/abs/2307.03675
  • repo_url: https://github.com/m1m0r1/geophy
  • paper_authors: Takahiro Mimori, Michiaki Hamada
  • for: 理解生物数据中的演化关系,即使考虑分子遗传学模型的不确定性。
  • methods: 使用可 diferenciable 的形式ulation来进行phylogenetic inference,利用连续几何空间中的特有表示方式来表示树图分布。
  • results: 在使用实际 benchmark 数据进行实验中,GeoPhy 方法与其他 approximate Bayesian 方法相比,显著地提高了性能。
    Abstract Phylogenetic inference, grounded in molecular evolution models, is essential for understanding the evolutionary relationships in biological data. Accounting for the uncertainty of phylogenetic tree variables, which include tree topologies and evolutionary distances on branches, is crucial for accurately inferring species relationships from molecular data and tasks requiring variable marginalization. Variational Bayesian methods are key to developing scalable, practical models; however, it remains challenging to conduct phylogenetic inference without restricting the combinatorially vast number of possible tree topologies. In this work, we introduce a novel, fully differentiable formulation of phylogenetic inference that leverages a unique representation of topological distributions in continuous geometric spaces. Through practical considerations on design spaces and control variates for gradient estimations, our approach, GeoPhy, enables variational inference without limiting the topological candidates. In experiments using real benchmark datasets, GeoPhy significantly outperformed other approximate Bayesian methods that considered whole topologies.
    摘要 生物数据中的进化关系理解需要基于分子进化模型的phylogenetic inference。考虑phylogenetic树变量的不确定性,包括树 topology和演化距离在支持下,是准确推断物种关系和基于分子数据的任务需要变量聚合的关键。variational Bayesian方法是开发可扩展、实用模型的关键,但是不限定可能的树体系数量是一个挑战。在这种情况下,我们介绍了一种新的、完全 differentiable的phylogenetic inference形式,利用连续几何空间中特有的树分布表示。通过实践设计空间和控制变量的考虑,我们的方法GeoPhy可以在不限定树体系数量的情况下进行变量整合。在使用实际 benchmark数据进行实验中,GeoPhy表现出了与其他approximate Bayesian方法相比的显著优势。

Simulation-free Schrödinger bridges via score and flow matching

  • paper_url: http://arxiv.org/abs/2307.03672
  • repo_url: https://github.com/atong01/conditional-flow-matching
  • paper_authors: Alexander Tong, Nikolay Malkin, Kilian Fatras, Lazar Atanackovic, Yanlei Zhang, Guillaume Huguet, Guy Wolf, Yoshua Bengio
  • for: 学习细胞动态模型,即生物学中细胞的行为和变化。
  • methods: 使用SF2M方法,即 simulation-free score and flow matching 方法,不需要 simulations 来学习细胞动态模型。这种方法基于 Schrödinger bridge 问题,使用 static entropy-regularized optimal transport 或者 minibatch approximation 来有效地学习 SB 问题。
  • results: 通过应用 SF2M 方法,可以准确地模型高维细胞动态模型,并且可以回归知道的基因调控网络。此外,SF2M 方法比之前的 simulate-based 方法更高效和更准确。
    Abstract We present simulation-free score and flow matching ([SF]$^2$M), a simulation-free objective for inferring stochastic dynamics given unpaired source and target samples drawn from arbitrary distributions. Our method generalizes both the score-matching loss used in the training of diffusion models and the recently proposed flow matching loss used in the training of continuous normalizing flows. [SF]$^2$M interprets continuous-time stochastic generative modeling as a Schr\"odinger bridge (SB) problem. It relies on static entropy-regularized optimal transport, or a minibatch approximation, to efficiently learn the SB without simulating the learned stochastic process. We find that [SF]$^2$M is more efficient and gives more accurate solutions to the SB problem than simulation-based methods from prior work. Finally, we apply [SF]$^2$M to the problem of learning cell dynamics from snapshot data. Notably, [SF]$^2$M is the first method to accurately model cell dynamics in high dimensions and can recover known gene regulatory networks from simulated data.
    摘要 我们提出了一个无 simulate 的得分和流动匹配([SF]$^2$M),它是一个无 simulate 的目标,用于对未配对的源和目标样本进行推测 Stochastic 动力学。我们的方法扩展了对于传播模型的训练中使用的得分匹配损失,以及最近提出的流动匹配损失,用于对紧致常态流动的训练。[SF]$^2$M 视为连续时间的泊松桥(SB)问题,并且透过静止 entropy 调整的最佳运输或批处替代方法来快速学习 SB 无需运行学习的数学过程。我们发现 [SF]$^2$M 比从先前的作业中的 simulate 方法更加高效且更精准地解决 SB 问题。最后,我们应用 [SF]$^2$M 来学习细胞动力学从快照数据中。特别是,[SF]$^2$M 是高维度细胞动力学的首个精准模型,并且可以从实验数据中回传知名的遗传因子网络。

Online Network Source Optimization with Graph-Kernel MAB

  • paper_url: http://arxiv.org/abs/2307.03641
  • repo_url: None
  • paper_authors: Laura Toni, Pascal Frossard
  • for: 学习在大规模网络中最优化来自未知网络过程的奖励。
  • methods: 使用图kernels多臂抽象算法和适应性图字典模型来实现在线学习,并使用 Grab-UCB 在线顺序决策策略来学习参数。
  • results: 在 simulations 中,提议的在线学习算法比基准Offline方法更高效,并且在聚合约束和计算复杂度方面具有更好的性能。
    Abstract We propose Grab-UCB, a graph-kernel multi-arms bandit algorithm to learn online the optimal source placement in large scale networks, such that the reward obtained from a priori unknown network processes is maximized. The uncertainty calls for online learning, which suffers however from the curse of dimensionality. To achieve sample efficiency, we describe the network processes with an adaptive graph dictionary model, which typically leads to sparse spectral representations. This enables a data-efficient learning framework, whose learning rate scales with the dimension of the spectral representation model instead of the one of the network. We then propose Grab-UCB, an online sequential decision strategy that learns the parameters of the spectral representation while optimizing the action strategy. We derive the performance guarantees that depend on network parameters, which further influence the learning curve of the sequential decision strategy We introduce a computationally simplified solving method, Grab-arm-Light, an algorithm that walks along the edges of the polytope representing the objective function. Simulations results show that the proposed online learning algorithm outperforms baseline offline methods that typically separate the learning phase from the testing one. The results confirm the theoretical findings, and further highlight the gain of the proposed online learning strategy in terms of cumulative regret, sample efficiency and computational complexity.
    摘要 我们提议Grab-UCB算法,用图kernel多臂牌 Algorithm to learn在大规模网络中的优化来源分配,以 Maximize the reward from a priori unknown network processes. 因为uncertainty calls for online learning, which suffers from the curse of dimensionality. To achieve sample efficiency, we use an adaptive graph dictionary model to describe the network processes, which typically leads to sparse spectral representations. This enables a data-efficient learning framework, whose learning rate scales with the dimension of the spectral representation model instead of the one of the network. We then propose Grab-UCB, an online sequential decision strategy that learns the parameters of the spectral representation while optimizing the action strategy. We derive the performance guarantees that depend on network parameters, which further influence the learning curve of the sequential decision strategy. We introduce a computationally simplified solving method, Grab-arm-Light, an algorithm that walks along the edges of the polytope representing the objective function. Simulation results show that the proposed online learning algorithm outperforms baseline offline methods that typically separate the learning phase from the testing one. The results confirm the theoretical findings, and further highlight the gain of the proposed online learning strategy in terms of cumulative regret, sample efficiency, and computational complexity.Note: The translation is in Simplified Chinese, which is a standardized form of Chinese used in mainland China and Singapore. The translation may vary depending on the region or dialect.

  • paper_url: http://arxiv.org/abs/2307.03630
  • repo_url: None
  • paper_authors: Dániel Rácz, Mihály Petreczky, Bálint Daróczy
  • for: 本研究考虑了使用神经ordinary differential equation(neural ODE)在连续时间中学习线性参数变化(LPV)系统。
  • methods: 本文使用了LPV系统中的 bilinear系统,并证明了一类神经ODE可以被LPV系统中嵌入。作者还提供了一种 Probably Approximately Correct(PAC) bound,用于量化LPV系统相关神经ODE的稳定性。
  • results: 本文的主要贡献是提供了不依赖于集成时间的PAC bound,用于量化LPV系统相关神经ODE的稳定性。
    Abstract We consider the problem of learning Neural Ordinary Differential Equations (neural ODEs) within the context of Linear Parameter-Varying (LPV) systems in continuous-time. LPV systems contain bilinear systems which are known to be universal approximators for non-linear systems. Moreover, a large class of neural ODEs can be embedded into LPV systems. As our main contribution we provide Probably Approximately Correct (PAC) bounds under stability for LPV systems related to neural ODEs. The resulting bounds have the advantage that they do not depend on the integration interval.
    摘要 我们考虑了内联神经ordinary differential equations(内联神经ODE)在连续时间中的学习问题,具体来说是在线性参数变量(LPV)系统中。LPV系统包含bilinear系统,这些系统是非线性系统的通用近似器。此外,大量的内联神经ODE可以被LPV系统中嵌入。作为我们的主要贡献,我们提供了可靠地近似正确(PAC)的下界,这些下界不依赖于集成时间。

Toward High-Performance Energy and Power Battery Cells with Machine Learning-based Optimization of Electrode Manufacturing

  • paper_url: http://arxiv.org/abs/2307.05521
  • repo_url: None
  • paper_authors: Marc Duquesnoy, Chaoyue Liu, Vishank Kumar, Elixabete Ayerbe, Alejandro A. Franco
  • for: 本研究旨在优化锂离子电池电极生产过程,以满足增长的能源需求。特别是锂离子电池生产的优化非常重要,因为它会影响电池在应用中的实际性能。
  • methods: 本研究提出了一种数据驱动的机器学习(ML)助记录管道,用于对电解质性能进行双目标优化。该管道使得可逆设计制造过程参数,以生产适用于能量或动力应用的电极。这与我们之前的研究相似,在改进电极微结构中提高了电解质传输性能。
  • results: 我们的结果表明,以高活跃物质和中间固体含量和满化程度为优化目标,可以获得优化的电极。
    Abstract The optimization of the electrode manufacturing process is important for upscaling the application of Lithium Ion Batteries (LIBs) to cater for growing energy demand. In particular, LIB manufacturing is very important to be optimized because it determines the practical performance of the cells when the latter are being used in applications such as electric vehicles. In this study, we tackled the issue of high-performance electrodes for desired battery application conditions by proposing a powerful data-driven approach supported by a deterministic machine learning (ML)-assisted pipeline for bi-objective optimization of the electrochemical performance. This ML pipeline allows the inverse design of the process parameters to adopt in order to manufacture electrodes for energy or power applications. The latter work is an analogy to our previous work that supported the optimization of the electrode microstructures for kinetic, ionic, and electronic transport properties improvement. An electrochemical pseudo-two-dimensional model is fed with the electrode properties characterizing the electrode microstructures generated by manufacturing simulations and used to simulate the electrochemical performances. Secondly, the resulting dataset was used to train a deterministic ML model to implement fast bi-objective optimizations to identify optimal electrodes. Our results suggested a high amount of active material, combined with intermediate values of solid content in the slurry and calendering degree, to achieve the optimal electrodes.
    摘要 降低锂离子电池(LIB)生产过程优化的重要性,是因为它会影响电池在实际应用中的实际性。例如,在电动汽车中使用的电池。在这种研究中,我们通过提出一种强大的数据驱动方法,支持由确定性机器学习(ML)托管的双目标优化管道,来解决高性能电极的问题。这个ML管道允许逆向设计生产参数,以生产适用于能量或功率应用的电极。这与我们之前的工作相似,曾经支持电极微结构优化,以提高电池的离子、镁和电子传输性能。一个电化学 Pseudo-二维模型,通过电极性能特征来模拟电化学性能。其次,生成的数据集用于训练一个确定性ML模型,以实现快速双目标优化,并提取最佳电极。我们的结果表明,高活性材料和中值的固体含量和滚筒度,可以实现最佳电极。

GEANN: Scalable Graph Augmentations for Multi-Horizon Time Series Forecasting

  • paper_url: http://arxiv.org/abs/2307.03595
  • repo_url: None
  • paper_authors: Sitan Yang, Malcolm Wolff, Shankar Ramasubramanian, Vincent Quenneville-Belair, Ronak Metha, Michael W. Mahoney
  • for: 本研究旨在解决深度神经网络模型在缺乏历史数据的情况下进行多个时间序列预测,即“冷启动”问题。
  • methods: 该研究提出了一种使用图神经网络(GNN)作为预测器的数据增强方法,通过捕捉多个时间序列之间的复杂关系来增强预测器的编码器。
  • results: 研究在target应用中,对一家大型电子商务公司的需求预测 task 进行了测试,并表明其方法在小数据集(100K产品)和大数据集(超过200W产品)上均显著提高了模型的总性能,尤其是对于“冷启动”产品(新上市或者Recently out-of-stock)的预测性能具有显著的提升。
    Abstract Encoder-decoder deep neural networks have been increasingly studied for multi-horizon time series forecasting, especially in real-world applications. However, to forecast accurately, these sophisticated models typically rely on a large number of time series examples with substantial history. A rapidly growing topic of interest is forecasting time series which lack sufficient historical data -- often referred to as the ``cold start'' problem. In this paper, we introduce a novel yet simple method to address this problem by leveraging graph neural networks (GNNs) as a data augmentation for enhancing the encoder used by such forecasters. These GNN-based features can capture complex inter-series relationships, and their generation process can be optimized end-to-end with the forecasting task. We show that our architecture can use either data-driven or domain knowledge-defined graphs, scaling to incorporate information from multiple very large graphs with millions of nodes. In our target application of demand forecasting for a large e-commerce retailer, we demonstrate on both a small dataset of 100K products and a large dataset with over 2 million products that our method improves overall performance over competitive baseline models. More importantly, we show that it brings substantially more gains to ``cold start'' products such as those newly launched or recently out-of-stock.
    摘要 <>传输文本到Simplified Chinese表示。<>深度神经网络(encoder-decoder)在实际应用中得到了更多研究,特别是用于多个时间序列预测。然而,为了准确预测,这些复杂的模型通常需要大量的时间序列示例,而且这些示例通常具有较长的历史记录。在这篇论文中,我们介绍了一种新的简单方法,利用图 neural network(GNN)作为编码器的数据扩充,以提高预测性能。这些GNN基于的特征可以捕捉复杂的时间序列之间关系,并且其生成过程可以通过预测任务进行END-TO-END优化。我们表明,我们的架构可以使用数据驱动或定义在领域知识图中的图,并可扩展到涉及多个巨大图的信息。在我们的目标应用中,我们在100000个产品的小数据集和超过2000000个产品的大数据集上进行了实验,并证明了我们的方法在相比基eline模型的情况下提高了总性能。更重要的是,我们发现在“冷启动”产品上(例如新推出或者售罄),我们的方法带来了极大的改善。

Accelerated Optimization Landscape of Linear-Quadratic Regulator

  • paper_url: http://arxiv.org/abs/2307.03590
  • repo_url: None
  • paper_authors: Lechen Feng, Yuan-Hua Ni
  • for: 这篇论文关注的是优化控制领域的线性quadratic regulator(LQR)问题。
  • methods: 该论文提出了一种基于首频减速优化框架的LQR问题解决方案,并对SLQR和OLQR两种不同情况进行了分别的分析。
  • results: 研究人员通过提出一种 Lipschitz Hessian 性质的LQR性能函数,以及利用 симплекс 牛顿方法和重启规则来保持连续时间的优化率,实现了对SLQR和OLQR问题的高精度解决。
    Abstract Linear-quadratic regulator (LQR) is a landmark problem in the field of optimal control, which is the concern of this paper. Generally, LQR is classified into state-feedback LQR (SLQR) and output-feedback LQR (OLQR) based on whether the full state is obtained. It has been suggested in existing literature that both the SLQR and the OLQR could be viewed as \textit{constrained nonconvex matrix optimization} problems in which the only variable to be optimized is the feedback gain matrix. In this paper, we introduce a first-order accelerated optimization framework of handling the LQR problem, and give its convergence analysis for the cases of SLQR and OLQR, respectively. Specifically, a Lipschiz Hessian property of LQR performance criterion is presented, which turns out to be a crucial property for the application of modern optimization techniques. For the SLQR problem, a continuous-time hybrid dynamic system is introduced, whose solution trajectory is shown to converge exponentially to the optimal feedback gain with Nesterov-optimal order $1-\frac{1}{\sqrt{\kappa}$ ($\kappa$ the condition number). Then, the symplectic Euler scheme is utilized to discretize the hybrid dynamic system, and a Nesterov-type method with a restarting rule is proposed that preserves the continuous-time convergence rate, i.e., the discretized algorithm admits the Nesterov-optimal convergence order. For the OLQR problem, a Hessian-free accelerated framework is proposed, which is a two-procedure method consisting of semiconvex function optimization and negative curvature exploitation. In a time $\mathcal{O}(\epsilon^{-7/4}\log(1/\epsilon))$, the method can find an $\epsilon$-stationary point of the performance criterion; this entails that the method improves upon the $\mathcal{O}(\epsilon^{-2})$ complexity of vanilla gradient descent. Moreover, our method provides the second-order guarantee of stationary point.
    摘要 Linear-quadratic regulator (LQR) 是控制理论中的一个标志性问题,这篇文章的研究对象。通常情况下,LQR可以分为基于状态反馈(SLQR)和基于输出反馈(OLQR)两种,根据是否获得全状态。在现有文献中,有人提出了视为非对称矩阵优化问题的思路,其中仅仅是反馈矩阵进行优化。在这篇文章中,我们介绍了一种基于首频加速优化框架,并对 SLQR 和 OLQR 两种情况进行了分别的可控性分析。 Specifically, we present a Lipschitz Hessian property of LQR performance criterion, which turns out to be a crucial property for the application of modern optimization techniques. For the SLQR problem, we introduce a continuous-time hybrid dynamic system, whose solution trajectory is shown to converge exponentially to the optimal feedback gain with Nesterov-optimal order $1-\frac{1}{\sqrt{\kappa}$ ($\kappa$ the condition number). Then, the symplectic Euler scheme is utilized to discretize the hybrid dynamic system, and a Nesterov-type method with a restarting rule is proposed that preserves the continuous-time convergence rate, i.e., the discretized algorithm admits the Nesterov-optimal convergence order. For the OLQR problem, we propose a Hessian-free accelerated framework, which is a two-procedure method consisting of semiconvex function optimization and negative curvature exploitation. In a time $\mathcal{O}(\epsilon^{-7/4}\log(1/\epsilon))$, the method can find an $\epsilon$-stationary point of the performance criterion; this entails that the method improves upon the $\mathcal{O}(\epsilon^{-2})$ complexity of vanilla gradient descent. Moreover, our method provides the second-order guarantee of stationary point.

BOF-UCB: A Bayesian-Optimistic Frequentist Algorithm for Non-Stationary Contextual Bandits

  • paper_url: http://arxiv.org/abs/2307.03587
  • repo_url: None
  • paper_authors: Nicklas Werge, Abdullah Akgül, Melih Kandemir
  • for: 这个论文旨在提出一种新的 bayesian-optimistic frequentist upper confidence bound(BOF-UCB)算法,用于 Stochastic Contextual Linear Bandits(SCLB)中的非站ARY环境。
  • methods: 这个算法利用累缲 Bayesian 更新来推算未知的回归参数 posterior distribution,然后使用频quentist方法计算 Upper Confidence Bound(UCB),将最大化预期回归在 posterior distribution 上。
  • results: 我们提供了 BOF-UCB 的性能理论保证,并在实验中显示它在 Synthetic 数据和 classical control 任务中能够平衡寻找和实现,并且在非站ARY环境中表现比 existing methods 更好。
    Abstract We propose a novel Bayesian-Optimistic Frequentist Upper Confidence Bound (BOF-UCB) algorithm for stochastic contextual linear bandits in non-stationary environments. This unique combination of Bayesian and frequentist principles enhances adaptability and performance in dynamic settings. The BOF-UCB algorithm utilizes sequential Bayesian updates to infer the posterior distribution of the unknown regression parameter, and subsequently employs a frequentist approach to compute the Upper Confidence Bound (UCB) by maximizing the expected reward over the posterior distribution. We provide theoretical guarantees of BOF-UCB's performance and demonstrate its effectiveness in balancing exploration and exploitation on synthetic datasets and classical control tasks in a reinforcement learning setting. Our results show that BOF-UCB outperforms existing methods, making it a promising solution for sequential decision-making in non-stationary environments.
    摘要 我们提出了一种新的泛bayesian-Optimistic Frequentist Upper Confidence Bound(BOF-UCB)算法,用于非站点环境下的随机contextual linear bandit。这种独特的bayesian和频quentist原则的结合,提高了适应性和性能在动态环境下。BOF-UCB算法通过顺序的bayesian更新来推算未知回归参数的 posterior distribution,然后使用频quentist方法计算最大期望奖励的Upper Confidence Bound(UCB)。我们提供了BOF-UCB性能的理论保证,并在synthetic数据集和 классиcal控制任务中的reinforcement learning Setting中进行了实验证明。我们的结果表明,BOF-UCB超越了现有方法,这使得它成为非站点环境下的sequential decision-making的优秀解决方案。

ContextLabeler Dataset: physical and virtual sensors data collected from smartphone usage in-the-wild

  • paper_url: http://arxiv.org/abs/2307.03586
  • repo_url: None
  • paper_authors: Mattia Giovanni Campana, Franca Delmastro
  • for: 这 paper 描述了一个数据采集计划和从智能手机传感器获得的相关 daily life 活动的数据集,包括 3 名志愿者在 2 周的时间内进行的数据采集。这个数据集包含超过 45K 个数据样本,每个样本包含 1332 个特征,其中包括运动传感器、运行中的应用程序、附近设备和天气条件等多种物理和虚拟传感器。此外,每个数据样本还关联着一个真实的 Label,描述用户在感测实验中的活动和情况(例如,工作、在餐厅、进行体育活动等)。
  • methods: 作者使用了智能手机传感器进行数据采集,并没有对用户的行为做任何干扰或限制。这使得收集到的数据成为了一个不受干扰的、真实的数据集,可以用于定义和评估一 broad 范围内的 context-aware 解决方案(包括算法和协议),以适应移动环境中的用户情况变化。
  • results: 作者收集到了一个包含超过 45K 个数据样本的数据集,每个样本包含 1332 个特征。此外,每个数据样本还关联着一个真实的 Label,描述用户在感测实验中的活动和情况。这个数据集可以用于定义和评估 context-aware 解决方案,以适应移动环境中的用户情况变化。
    Abstract This paper describes a data collection campaign and the resulting dataset derived from smartphone sensors characterizing the daily life activities of 3 volunteers in a period of two weeks. The dataset is released as a collection of CSV files containing more than 45K data samples, where each sample is composed by 1332 features related to a heterogeneous set of physical and virtual sensors, including motion sensors, running applications, devices in proximity, and weather conditions. Moreover, each data sample is associated with a ground truth label that describes the user activity and the situation in which she was involved during the sensing experiment (e.g., working, at restaurant, and doing sport activity). To avoid introducing any bias during the data collection, we performed the sensing experiment in-the-wild, that is, by using the volunteers' devices, and without defining any constraint related to the user's behavior. For this reason, the collected dataset represents a useful source of real data to both define and evaluate a broad set of novel context-aware solutions (both algorithms and protocols) that aim to adapt their behavior according to the changes in the user's situation in a mobile environment.
    摘要

Programmable Synthetic Tabular Data Generation

  • paper_url: http://arxiv.org/abs/2307.03577
  • repo_url: None
  • paper_authors: Mark Vero, Mislav Balunović, Martin Vechev
  • for: 生成具有约束的Tabular数据,以便在具有隐私、数据质量和数据共享限制的情况下进行大量数据的利用。
  • methods: ProgSyn使用了一种可编程的生成模型,通过在原始数据集上预训练并在基于提供的特定需求自动 derivation的梯度下细化,以确保生成的数据具有高质量并遵循特定需求。
  • results: ProgSyn在多种约束下达到了新的状态功能,如在Adult数据集上保持同等公平性水平下提高了下游预测性能2.3%。总的来说,ProgSyn提供了一个 versatile 和可 accessible的框架,用于生成具有约束的Tabular数据,并允许特定需求的扩展。
    Abstract Large amounts of tabular data remain underutilized due to privacy, data quality, and data sharing limitations. While training a generative model producing synthetic data resembling the original distribution addresses some of these issues, most applications require additional constraints from the generated data. Existing synthetic data approaches are limited as they typically only handle specific constraints, e.g., differential privacy (DP) or increased fairness, and lack an accessible interface for declaring general specifications. In this work, we introduce ProgSyn, the first programmable synthetic tabular data generation algorithm that allows for comprehensive customization over the generated data. To ensure high data quality while adhering to custom specifications, ProgSyn pre-trains a generative model on the original dataset and fine-tunes it on a differentiable loss automatically derived from the provided specifications. These can be programmatically declared using statistical and logical expressions, supporting a wide range of requirements (e.g., DP or fairness, among others). We conduct an extensive experimental evaluation of ProgSyn on a number of constraints, achieving a new state-of-the-art on some, while remaining general. For instance, at the same fairness level we achieve 2.3% higher downstream accuracy than the state-of-the-art in fair synthetic data generation on the Adult dataset. Overall, ProgSyn provides a versatile and accessible framework for generating constrained synthetic tabular data, allowing for specifications that generalize beyond the capabilities of prior work.
    摘要 大量的表格数据因为隐私、数据质量和数据共享限制而尚未得到充分利用。训练一个生成模型生成具有原始分布的 sintetic 数据可以解决一些问题,但大多数应用需要更多的约束来限制生成的数据。现有的 sintetic 数据方法有限,它们通常只能处理特定的约束,如差分隐私(DP)或增强公平,而且缺乏可访问的接口来声明通用规则。在这项工作中,我们介绍ProgSyn,首个可编程的 sintetic 表格数据生成算法,允许用户根据需要进行全面的定制。为保证高质量的生成数据,ProgSyn在原始数据集上预训练生成模型,然后在基于提供的规则自动生成的差分损失上进行细化。这些规则可以使用统计和逻辑表达式进行程序matically声明,支持广泛的要求(例如DP或公平等)。我们在一些约束下进行了广泛的实验测试, achieved 新的状态 искусственный数据生成的状态之一,在一些约束下达到了新的状态之一,而且可以泛化。例如,在保持同等公平水平下,我们在Adult数据集上 achieved 2.3% 更高的下游准确率,比之前的最佳状态更高。总之,ProgSyn 提供了一个通用、可访问的 sintetic 表格数据生成框架,允许用户根据需要声明约束,这些约束可以超越现有的工作的能力。

One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention

  • paper_url: http://arxiv.org/abs/2307.03576
  • repo_url: None
  • paper_authors: Arvind Mahankali, Tatsunori B. Hashimoto, Tengyu Ma
  • for: 本研究目的是研究一层线性自注意力层(Transformer)在各种不同噪音和回归函数下的学习行为。
  • methods: 本研究使用了一层线性自注意力层,并在各种噪音和回归函数下进行了预训练。
  • results: 研究发现,当covariate从标准高斯分布中采样时,一层线性自注意力层会在最小二乘回归目标下进行单步Gradient Descent(GD)。而在非标准高斯分布下,改变weight vector和响应变量的分布会导致学习的算法发生显著变化。
    Abstract Recent works have empirically analyzed in-context learning and shown that transformers trained on synthetic linear regression tasks can learn to implement ridge regression, which is the Bayes-optimal predictor, given sufficient capacity [Aky\"urek et al., 2023], while one-layer transformers with linear self-attention and no MLP layer will learn to implement one step of gradient descent (GD) on a least-squares linear regression objective [von Oswald et al., 2022]. However, the theory behind these observations remains poorly understood. We theoretically study transformers with a single layer of linear self-attention, trained on synthetic noisy linear regression data. First, we mathematically show that when the covariates are drawn from a standard Gaussian distribution, the one-layer transformer which minimizes the pre-training loss will implement a single step of GD on the least-squares linear regression objective. Then, we find that changing the distribution of the covariates and weight vector to a non-isotropic Gaussian distribution has a strong impact on the learned algorithm: the global minimizer of the pre-training loss now implements a single step of $\textit{pre-conditioned}$ GD. However, if only the distribution of the responses is changed, then this does not have a large effect on the learned algorithm: even when the response comes from a more general family of $\textit{nonlinear}$ functions, the global minimizer of the pre-training loss still implements a single step of GD on a least-squares linear regression objective.
    摘要 近期研究探讨了在上下文中学习,并证明了使用生成的线性回归任务训练的变换器可以学习实现ridge回归,即极值优化预测器,只要容量足够大 [Aky\"urek et al., 2023]。另一方面,一层变换器 WITH linear self-attention 和无MLP层会学习实现一步Gradient Descent(GD)在最小二乘线性回归目标上 [von Oswald et al., 2022]。然而,这些观察的理论基础还未得到充分理解。我们在变换器中使用单层线性自注意力进行理论研究,并在生成噪声线性回归数据上训练。我们首先 математиче地表明,当covariates从标准 Gaussian 分布中采样时,一层变换器最小化预训练损失将实现一步GD在最小二乘线性回归目标上。然后,我们发现将covariates和重量 вектор从标准 Gaussian 分布更改为非均勋 Gaussian 分布会导致学习的算法强烈受到影响:全局最小化预训练损失的算法将实现一步pre-conditioned GD。但是,只是更改响应的分布而不是weight vector的分布,则不会导致很大的影响:即使响应来自更一般的非线性函数家族,全局最小化预训练损失的算法仍然会实现一步GD在最小二乘线性回归目标上。

Smoothing the Edges: A General Framework for Smooth Optimization in Sparse Regularization using Hadamard Overparametrization

  • paper_url: http://arxiv.org/abs/2307.03571
  • repo_url: None
  • paper_authors: Chris Kolb, Christian L. Müller, Bernd Bischl, David Rügamer
  • for: 本文提出了一种框架,用于平滑优化目标函数中的 $\ell_q$ 和 $\ell_{p,q}$ 正则化,以实现结构化稀疏性。
  • methods: 本方法使用了常用的随机梯度下降算法,而不需要特殊的优化算法,从而实现了可导的稀疏正则化无需简化。
  • results: 我们的方法可以具有与普通的 convex 正则化一样的全局最优解,并且可以保证地具有原始参数化中的本地最优解。此外,我们还提供了一个整合的视角,汇集了不同的参数化Literature中的概念,并对现有方法进行了meaningful扩展。在数值实验中,我们证明了我们的方法的可行性和效果,与常见的凸和非凸正则化相比,能够匹配或超越。
    Abstract This paper presents a framework for smooth optimization of objectives with $\ell_q$ and $\ell_{p,q}$ regularization for (structured) sparsity. Finding solutions to these non-smooth and possibly non-convex problems typically relies on specialized optimization routines. In contrast, the method studied here is compatible with off-the-shelf (stochastic) gradient descent that is ubiquitous in deep learning, thereby enabling differentiable sparse regularization without approximations. The proposed optimization transfer comprises an overparametrization of selected model parameters followed by a change of penalties. In the overparametrized problem, smooth and convex $\ell_2$ regularization induces non-smooth and non-convex regularization in the original parametrization. We show that the resulting surrogate problem not only has an identical global optimum but also exactly preserves the local minima. This is particularly useful in non-convex regularization, where finding global solutions is NP-hard and local minima often generalize well. We provide an integrative overview that consolidates various literature strands on sparsity-inducing parametrizations in a general setting and meaningfully extend existing approaches. The feasibility of our approach is evaluated through numerical experiments, demonstrating its effectiveness by matching or outperforming common implementations of convex and non-convex regularizers.
    摘要 (简化中文)这篇论文提出了一种基于 $\ell_q$ 和 $\ell_{p,q}$ 正则化的对象函数的平滑优化框架,用于Structured sparsity。现有的特殊化优化方法通常用于解决这些非短途和非凸问题。然而,提出的方法与深度学习中广泛使用的Stochastic gradient descent(SGD)兼容,可以无需 aproximations 实现 differentiable sparse regularization。优化转移包括在选择的模型参数上进行过参数化,然后改变 penalty。过参数化的问题会导致 smooth 和 convex $\ell_2$ regularization 在原始参数化中induces non-smooth 和 non-convex regularization。我们证明了这个代理问题不仅有identical global optimum,还能够 exactly preserve local minima。这 particualrly useful in non-convex regularization,因为找到全局解是NP-hard,而local minima通常 generalize well。我们提供了一个整合性的Overview,汇集了不同的文献弦线程在一个通用的设定下,并meaningfully extends existing approaches。 feasibility of our approach is evaluated through numerical experiments, demonstrating its effectiveness by matching or outperforming common implementations of convex and non-convex regularizers。

MALIBO: Meta-learning for Likelihood-free Bayesian Optimization

  • paper_url: http://arxiv.org/abs/2307.03565
  • repo_url: None
  • paper_authors: Jiarong Pan, Stefan Falkner, Felix Berkenkamp, Joaquin Vanschoren
  • for: 这个研究的目的是提出一种基于meta-学习的bayesian搜寻(BO)方法,以便更快地优化成本高的黑盒函数。
  • methods: 这个方法不使用代理模型,而是直接从不同任务之间的相似性学习query的价值。它还包括一个辅助模型,以便在新任务中进行稳定的适应。
  • results: 实验结果显示,这个方法在不同的benchmark中展示了强大的任何时间性和超越了现有的meta-学习BO方法。
    Abstract Bayesian optimization (BO) is a popular method to optimize costly black-box functions. While traditional BO optimizes each new target task from scratch, meta-learning has emerged as a way to leverage knowledge from related tasks to optimize new tasks faster. However, existing meta-learning BO methods rely on surrogate models that suffer from scalability issues and are sensitive to observations with different scales and noise types across tasks. Moreover, they often overlook the uncertainty associated with task similarity. This leads to unreliable task adaptation when only limited observations are obtained or when the new tasks differ significantly from the related tasks. To address these limitations, we propose a novel meta-learning BO approach that bypasses the surrogate model and directly learns the utility of queries across tasks. Our method explicitly models task uncertainty and includes an auxiliary model to enable robust adaptation to new tasks. Extensive experiments show that our method demonstrates strong anytime performance and outperforms state-of-the-art meta-learning BO methods in various benchmarks.
    摘要 bayesian 优化(BO)是一种常用的优化昂贵黑色函数的方法。而传统的 BO 每次新的目标任务都从scratch开始优化,而 meta-learning 则是一种能够借鉴相关任务来快速优化新任务的方法。然而,现有的 meta-learning BO 方法通常依赖于不准确的代理模型,这些模型受到任务规模和噪声的影响,导致缺乏可扩展性和可靠性。此外,它们经常忽略任务之间的uncertainty。这会导致在只有有限的观察数据时或者新任务与相关任务存在差异时,task adaptation 不可靠。为了解决这些限制,我们提出了一种新的 meta-learning BO 方法,该方法 circumvents 代理模型,直接在任务间学习查询的用于性。我们的方法显式地模型任务的uncertainty,并采用 auxilary 模型来实现鲁棒的任务适应性。我们的实验表明,我们的方法在不同的 benchmark 中具有强大的任何时间性和超越了当前的 meta-learning BO 方法。

DWReCO at CheckThat! 2023: Enhancing Subjectivity Detection through Style-based Data Sampling

  • paper_url: http://arxiv.org/abs/2307.03550
  • repo_url: None
  • paper_authors: Ipek Baris Schlicht, Lynn Khellaf, Defne Altiok
  • for: 本研究投稿于CheckThat! Lab的主观检测任务中,以解决任务中的类别不均衡问题。
  • methods: 使用GPT-3模型生成基于新闻方面的主观检查列表的提问,并使用这些提问生成更多的训练材料。使用扩展的训练集进行语言特定的变换器模型的练习。
  • results: 在英语、德语和土耳其语中,发现不同的主观风格都是有效的,而风格基本替换在土耳其语和英语中效果更好。然而,GPT-3模型在非英语语言中生成风格基本文本时有时会表现不佳。
    Abstract This paper describes our submission for the subjectivity detection task at the CheckThat! Lab. To tackle class imbalances in the task, we have generated additional training materials with GPT-3 models using prompts of different styles from a subjectivity checklist based on journalistic perspective. We used the extended training set to fine-tune language-specific transformer models. Our experiments in English, German and Turkish demonstrate that different subjective styles are effective across all languages. In addition, we observe that the style-based oversampling is better than paraphrasing in Turkish and English. Lastly, the GPT-3 models sometimes produce lacklustre results when generating style-based texts in non-English languages.
    摘要 这份论文描述了我们在CheckThat! Lab中的主观检测任务提交。为了解决任务中的类别不均衡,我们使用基于新闻业观点的主观检查表 generator GPT-3 模型生成了额外的训练材料。我们使用这些扩展训练集来练化语言特定的转换器模型。我们的实验表明,不同的主观风格在所有语言中都是有效的。此外,我们发现在土耳其语和英语中,风格基本替换比paraphrasing更有效。最后,GPT-3 模型在非英语语言中生成风格基本文本时有时会表现平庸。

Dynamic Graph Attention for Anomaly Detection in Heterogeneous Sensor Networks

  • paper_url: http://arxiv.org/abs/2307.03761
  • repo_url: https://github.com/MengjieZhao/dygatad
  • paper_authors: Mengjie Zhao, Olga Fink
  • for: 这篇论文主要是为了探讨运算系统监控中的异常探测问题,尤其是运算系统监控数据中的多变量时间序列(MTS)数据,并提出了一个基于图形的异常探测方法来解决这个问题。
  • methods: 这篇论文提出了一个名为 DyGATAD(动态图形注意力异常探测)的图形基于异常探测方法,这个方法利用注意力机制来建构一个当前运算系统监控数据中的连续图形表示,并且考虑了系统中各个时间序列之间的关系变化。
  • results: 这篇论文透过使用实验和工业规模的多相运输设备实验,证明了 DyGATAD 方法在运算系统监控数据中的异常探测能力,特别是在早期发现 fault 的情况下表现出色,甚至在 fault 的严重程度很低时也能够实现高精度的探测。
    Abstract In the era of digital transformation, systems monitored by the Industrial Internet of Things (IIoTs) generate large amounts of Multivariate Time Series (MTS) data through heterogeneous sensor networks. While this data facilitates condition monitoring and anomaly detection, the increasing complexity and interdependencies within the sensor network pose significant challenges for anomaly detection. Despite progress in this field, much of the focus has been on point anomalies and contextual anomalies, with lesser attention paid to collective anomalies. A less addressed but common variant of collective anomalies is when the abnormal collective behavior is caused by shifts in interrelationships within the system. This can be due to abnormal environmental conditions like overheating, improper operational settings resulting from cyber-physical attacks, or system-level faults. To address these challenges, this paper proposes DyGATAD (Dynamic Graph Attention for Anomaly Detection), a graph-based anomaly detection framework that leverages the attention mechanism to construct a continuous graph representation of multivariate time series by inferring dynamic edges between time series. DyGATAD incorporates an operating condition-aware reconstruction combined with a topology-based anomaly score, thereby enhancing the detection ability of relationship shifts. We evaluate the performance of DyGATAD using both a synthetic dataset with controlled varying fault severity levels and an industrial-scale multiphase flow facility benchmark featuring various fault types with different detection difficulties. Our proposed approach demonstrated superior performance in collective anomaly detection for sensor networks, showing particular strength in early-stage fault detection, even in the case of faults with minimal severity.
    摘要 在数字转型时代,由工业互联网Of Things(IIoT)监控的系统会生成大量多变量时间序列(MTS)数据,该数据可以帮助condition monitoring和异常检测。然而,随着传感器网络的复杂度和互相关系的增加,异常检测受到了重大挑战。虽然在这个领域已经有了很多进展,但是大多数注意力是集中在点异常和上下文异常上,对集体异常的研究相对较少。一种较少被关注但很常见的集体异常情况是,在系统中的异常 коллектив行为是由系统间关系的变化引起的。这可能是因为环境条件异常(如过热)、不正确的操作设置(由于Cyber-physical attacks)或系统级别的故障。为解决这些挑战,这篇论文提出了 DyGATAD(动态图注意力异常检测),一种基于图的异常检测框架,通过注意力机制来构建多变量时间序列中的连续图表示。DyGATAD结合了运行条件感知重建和图形异常分数,从而提高了关系变化的检测能力。我们使用了一个synthetic数据集和一个工业级多相流设施测试数据来评估DyGATAD的性能。我们的提出的方法在感知器网络中的集体异常检测方面表现出色,特别是在早期异常检测和轻度异常检测方面。

Roman Numeral Analysis with Graph Neural Networks: Onset-wise Predictions from Note-wise Features

  • paper_url: http://arxiv.org/abs/2307.03544
  • repo_url: https://github.com/manoskary/chordgnn
  • paper_authors: Emmanouil Karystinaios, Gerhard Widmer
  • for: 本研究旨在提出一种新的自动罗马数分析方法,用于把纯 симвоlic music 转化为罗马数表示。
  • methods: 该方法基于图神经网络(GNNs),可以直接处理每个音符的特征和间接关系。具有新的边减法,使得模型可以生成音符层次的表示。
  • results: 对于参考数据集,提出的 ChordGNN 模型表现更高精度,比对 existed 状态的艺术模型更高。此外,我们还 investigate 了模型的 variant,包括 NADE 和后处理技术。完整的代码可以在 GitHub 上找到。
    Abstract Roman Numeral analysis is the important task of identifying chords and their functional context in pieces of tonal music. This paper presents a new approach to automatic Roman Numeral analysis in symbolic music. While existing techniques rely on an intermediate lossy representation of the score, we propose a new method based on Graph Neural Networks (GNNs) that enable the direct description and processing of each individual note in the score. The proposed architecture can leverage notewise features and interdependencies between notes but yield onset-wise representation by virtue of our novel edge contraction algorithm. Our results demonstrate that ChordGNN outperforms existing state-of-the-art models, achieving higher accuracy in Roman Numeral analysis on the reference datasets. In addition, we investigate variants of our model using proposed techniques such as NADE, and post-processing of the chord predictions. The full source code for this work is available at https://github.com/manoskary/chordgnn
    摘要 Symbolic music中的罗马数字分析是一项重要的任务,旨在识别乐曲中的和声和其功能上下文。这篇论文提出了一种新的自动罗马数字分析方法,基于图神经网络(GNNs),可以直接描述乐曲中每个个音的特征和相互关系。我们的建议可以利用每个音的特征和间隔之间的相互关系,并通过我们的新的边缩合算法将每个音转换为和声表示。我们的结果表明,ChordGNN比现有的状态当前模型高效,在参照数据集上达到更高的罗马数字分析准确率。此外,我们还考虑了我们的模型的变体,使用提出的技术如NADE,以及后处理矩阵预测结果。完整的代码可以在https://github.com/manoskary/chordgnn上获取。

Do DL models and training environments have an impact on energy consumption?

  • paper_url: http://arxiv.org/abs/2307.05520
  • repo_url: https://github.com/GAISSA-UPC/seaa2023_ect
  • paper_authors: Santiago del Rey, Silverio Martínez-Fernández, Luís Cruz, Xavier Franch
  • for: 降低深度学习模型训练时的碳脚印。
  • methods: 分析模型架构和训练环境对训练更绿色计算机视觉模型的影响。
  • results: 选择合适的模型架构和训练环境可以减少能源消耗(最高达98.83%),但 Correctness 下降很小。 GPU 适应模型计算复杂性的增长,以提高能效性。
    Abstract Current research in the computer vision field mainly focuses on improving Deep Learning (DL) correctness and inference time performance. However, there is still little work on the huge carbon footprint that has training DL models. This study aims to analyze the impact of the model architecture and training environment when training greener computer vision models. We divide this goal into two research questions. First, we analyze the effects of model architecture on achieving greener models while keeping correctness at optimal levels. Second, we study the influence of the training environment on producing greener models. To investigate these relationships, we collect multiple metrics related to energy efficiency and model correctness during the models' training. Then, we outline the trade-offs between the measured energy efficiency and the models' correctness regarding model architecture, and their relationship with the training environment. We conduct this research in the context of a computer vision system for image classification. In conclusion, we show that selecting the proper model architecture and training environment can reduce energy consumption dramatically (up to 98.83%) at the cost of negligible decreases in correctness. Also, we find evidence that GPUs should scale with the models' computational complexity for better energy efficiency.
    摘要 现有研究主要集中在深度学习(DL)正确性和推理速度表现 improvemen。然而,还没有很多关于训练DL模型的巨大碳脚印的工作。这种研究目标是分析训练绿色计算机视觉模型时的模型架构和训练环境的影响。我们将这个目标分成两个研究问题。第一个问题是分析保持正确性水平时实现绿色模型的模型架构的影响。第二个问题是研究训练环境对生成绿色模型的影响。为了调查这些关系,我们收集了多个能效性和模型正确性的指标 during 模型的训练。然后,我们描述了在模型架构和训练环境的影响下,能效性和正确性之间的交易。我们在图像分类计算机视觉系统中进行了这种研究。结果表明,选择合适的模型架构和训练环境可以减少能 consumption (最多98.83%),同时对正确性的影响很小。此外,我们发现GPU在模型的计算复杂性增加时应该呈现加速的趋势,以实现更好的能效性。

Contrastive Graph Pooling for Explainable Classification of Brain Networks

  • paper_url: http://arxiv.org/abs/2307.11133
  • repo_url: None
  • paper_authors: Jiaxing Xu, Qingtian Bian, Xinhang Li, Aihu Zhang, Yiping Ke, Miao Qiao, Wei Zhang, Wei Khang Jeremy Sim, Balázs Gulyás
  • for: 这paper是用来探讨Functional magnetic resonance imaging (fMRI)数据的分析方法,特别是用于发现神经退化疾病如parkinson’s disease, Alzheimer’s disease 和autism spectrum disorder.
  • methods: 这paper使用了图神经网络(GNN)来提取特征,但是需要特殊的设计来适应fMRI数据的特点。该paper提出了一种对比性双重注意块和可微的图汇聚方法(ContrastPool),以更好地利用GNN来挖掘脑网络中的特征。
  • results: 该paper在5个休息态fMRI脑网络数据集上进行了应用,并证明了其在比较现有基线上的超越。case study表明,该方法提取的特征与 neuroscience文献中的领域知识匹配,并揭示了直观的发现。该paper的贡献表明了ContrastPool在理解脑网络和神经退化疾病方面的潜力。
    Abstract Functional magnetic resonance imaging (fMRI) is a commonly used technique to measure neural activation. Its application has been particularly important in identifying underlying neurodegenerative conditions such as Parkinson's, Alzheimer's, and Autism. Recent analysis of fMRI data models the brain as a graph and extracts features by graph neural networks (GNNs). However, the unique characteristics of fMRI data require a special design of GNN. Tailoring GNN to generate effective and domain-explainable features remains challenging. In this paper, we propose a contrastive dual-attention block and a differentiable graph pooling method called ContrastPool to better utilize GNN for brain networks, meeting fMRI-specific requirements. We apply our method to 5 resting-state fMRI brain network datasets of 3 diseases and demonstrate its superiority over state-of-the-art baselines. Our case study confirms that the patterns extracted by our method match the domain knowledge in neuroscience literature, and disclose direct and interesting insights. Our contributions underscore the potential of ContrastPool for advancing the understanding of brain networks and neurodegenerative conditions.
    摘要 функциональная магнитная резонансная томография (fMRI) 是一种常用的技术来测量神经活化。它的应用尤其重要在发现下面的 нейродегенератив Conditions such as Parkinson's, Alzheimer's, and Autism. current analysis of fMRI data models the brain as a graph and extracts features by graph neural networks (GNNs). However, the unique characteristics of fMRI data require a special design of GNN. Tailoring GNN to generate effective and domain-explainable features remains challenging. In this paper, we propose a contrastive dual-attention block and a differentiable graph pooling method called ContrastPool to better utilize GNN for brain networks, meeting fMRI-specific requirements. We apply our method to 5 resting-state fMRI brain network datasets of 3 diseases and demonstrate its superiority over state-of-the-art baselines. Our case study confirms that the patterns extracted by our method match the domain knowledge in neuroscience literature, and disclose direct and interesting insights. Our contributions underscore the potential of ContrastPool for advancing the understanding of brain networks and neurodegenerative conditions.Here's the translation in Traditional Chinese as well: функціональна магнітна резонансна томографія (fMRI) 是一种常用的技术来测量神经活化。它的应用尤其重要在发现下面的 нейродегенератив Conditions such as Parkinson's, Alzheimer's, and Autism. current analysis of fMRI data models the brain as a graph and extracts features by graph neural networks (GNNs). However, the unique characteristics of fMRI data require a special design of GNN. Tailoring GNN to generate effective and domain-explainable features remains challenging. In this paper, we propose a contrastive dual-attention block and a differentiable graph pooling method called ContrastPool to better utilize GNN for brain networks, meeting fMRI-specific requirements. We apply our method to 5 resting-state fMRI brain network datasets of 3 diseases and demonstrate its superiority over state-of-the-art baselines. Our case study confirms that the patterns extracted by our method match the domain knowledge in neuroscience literature, and disclose direct and interesting insights. Our contributions underscore the potential of ContrastPool for advancing the understanding of brain networks and neurodegenerative conditions.

Incentive Allocation in Vertical Federated Learning Based on Bankruptcy Problem

  • paper_url: http://arxiv.org/abs/2307.03515
  • repo_url: None
  • paper_authors: Afsana Khan, Marijn ten Thij, Frank Thuijsman, Anna Wilbik
  • for: 这篇论文探讨了如何为在Vertically Federated Learning(VFL)中活动党(具有标签的数据的党)对不活跃党(没有标签的数据的党)的参与进行奖励。
  • methods: 本论文使用了 банкрот游戏理论的变形,known as the Bankruptcy Problem,并使用了塔尔散分法解决问题。
  • results: 本论文透过实验和实际数据显示,证明了其可以保证参与者受益,并且比较了旧的计算Shapley值的方法,表明了其的方法更加有效,需要 fewer computations。
    Abstract Vertical federated learning (VFL) is a promising approach for collaboratively training machine learning models using private data partitioned vertically across different parties. Ideally in a VFL setting, the active party (party possessing features of samples with labels) benefits by improving its machine learning model through collaboration with some passive parties (parties possessing additional features of the same samples without labels) in a privacy preserving manner. However, motivating passive parties to participate in VFL can be challenging. In this paper, we focus on the problem of allocating incentives to the passive parties by the active party based on their contributions to the VFL process. We formulate this problem as a variant of the Nucleolus game theory concept, known as the Bankruptcy Problem, and solve it using the Talmud's division rule. We evaluate our proposed method on synthetic and real-world datasets and show that it ensures fairness and stability in incentive allocation among passive parties who contribute their data to the federated model. Additionally, we compare our method to the existing solution of calculating Shapley values and show that our approach provides a more efficient solution with fewer computations.
    摘要 纵向联合学习(VFL)是一种有前途的方法,通过私有数据分区Vertically Across不同的方针进行机器学习模型的共同训练。在VFLSetting中,活跃的方(具有标签的样本的特征)可以通过与一些被动方(不具有标签的样本的特征)的合作来改进其机器学习模型,这样做得有隐私保护的方式。然而,鼓励被动方参与VFL可以是困难的。在这篇论文中,我们关注在给被动方分配奖励的问题上。我们将这个问题定义为变种的核心游戏理论概念——银行rup难题,并使用塔尔摩德分配规则解决。我们对 synthetic 和实际数据集进行了评估,并证明了我们的提议方法能确保在被动方参与VFL过程中奖励分配是公平和稳定的。此外,我们与现有的计算Shapley值的方法进行比较,并证明了我们的方法提供了更高效的解决方案,计算量更少。

DEFT: Exploiting Gradient Norm Difference between Model Layers for Scalable Gradient Sparsification

  • paper_url: http://arxiv.org/abs/2307.03500
  • repo_url: https://github.com/kljp/deft
  • paper_authors: Daegun Yoon, Sangyoon Oh
  • for: 提高分布式深度学习中通信压力过大的问题,提出了一种新的梯度简化方法,即DEFT。
  • methods: DEFT将梯度选择任务分解成多个子任务,并将其分配给工作者进行并行计算,从而降低了计算成本,并且可以避免梯度积累问题。
  • results: DEFT在实验中显示出了与现有简化器相比明显的提高,同时保持高度的收敛性。
    Abstract Gradient sparsification is a widely adopted solution for reducing the excessive communication traffic in distributed deep learning. However, most existing gradient sparsifiers have relatively poor scalability because of considerable computational cost of gradient selection and/or increased communication traffic owing to gradient build-up. To address these challenges, we propose a novel gradient sparsification scheme, DEFT, that partitions the gradient selection task into sub tasks and distributes them to workers. DEFT differs from existing sparsifiers, wherein every worker selects gradients among all gradients. Consequently, the computational cost can be reduced as the number of workers increases. Moreover, gradient build-up can be eliminated because DEFT allows workers to select gradients in partitions that are non-intersecting (between workers). Therefore, even if the number of workers increases, the communication traffic can be maintained as per user requirement. To avoid the loss of significance of gradient selection, DEFT selects more gradients in the layers that have a larger gradient norm than the other layers. Because every layer has a different computational load, DEFT allocates layers to workers using a bin-packing algorithm to maintain a balanced load of gradient selection between workers. In our empirical evaluation, DEFT shows a significant improvement in training performance in terms of speed in gradient selection over existing sparsifiers while achieving high convergence performance.
    摘要 分布式深度学习中的梯度简化是一种广泛采用的解决方案,以减少分布式学习中的过度通信交流。然而,现有的大多数梯度简化方法具有较差的可扩展性,因为梯度选择和/或梯度积累带来了较高的计算成本。为了解决这些挑战,我们提出了一种新的梯度简化方案,即DEFT。DEFT方案将梯度选择任务分解成多个子任务,并将它们分配给工作者。与现有的简化方法不同,DEFT中每个工作者不需要选择所有的梯度。因此,计算成本可以随着工作者数量的增加而减少。此外,梯度积累可以被消除,因为DEFT允许工作者在不相交的 partition( между工作者)中选择梯度。因此,即使工作者数量增加,通信交流也可以保持在用户需求的水平。为了保持梯度选择的重要性,DEFT在各层中选择更多的梯度,以避免梯度简化导致的数据损失。由于每层都有不同的计算负担,DEFT使用一种堆叠算法将层分配给工作者,以保持各个层的梯度选择均衡。在我们的实验评估中,DEFT在速度和稳定性两个方面显示出了明显的改善,而且可以实现高度的并行化。

HoughLaneNet: Lane Detection with Deep Hough Transform and Dynamic Convolution

  • paper_url: http://arxiv.org/abs/2307.03494
  • repo_url: None
  • paper_authors: Jia-Qi Zhang, Hao-Bin Duan, Jun-Long Chen, Ariel Shamir, Miao Wang
  • for: 本研究旨在提高自动驾驶车辆 Lane detection 的精度,以便更好地满足自动驾驶技术的需求。
  • methods: 本研究提出了一种基于 hierarchical Deep Hough Transform (DHT) 的方法,利用图像中所有的 Lane 特征在 Hough 参数空间进行组合。此外,还提出了一种改进点选择方法和一种动态卷积模块,以更好地 differentiate между各个 Lane 特征。
  • results: 实验结果表明,提出的方法在检测受掩蔽或损坏的 Lane 图像时表现出色,与现有技术相比,其性能有所提高。
    Abstract The task of lane detection has garnered considerable attention in the field of autonomous driving due to its complexity. Lanes can present difficulties for detection, as they can be narrow, fragmented, and often obscured by heavy traffic. However, it has been observed that the lanes have a geometrical structure that resembles a straight line, leading to improved lane detection results when utilizing this characteristic. To address this challenge, we propose a hierarchical Deep Hough Transform (DHT) approach that combines all lane features in an image into the Hough parameter space. Additionally, we refine the point selection method and incorporate a Dynamic Convolution Module to effectively differentiate between lanes in the original image. Our network architecture comprises a backbone network, either a ResNet or Pyramid Vision Transformer, a Feature Pyramid Network as the neck to extract multi-scale features, and a hierarchical DHT-based feature aggregation head to accurately segment each lane. By utilizing the lane features in the Hough parameter space, the network learns dynamic convolution kernel parameters corresponding to each lane, allowing the Dynamic Convolution Module to effectively differentiate between lane features. Subsequently, the lane features are fed into the feature decoder, which predicts the final position of the lane. Our proposed network structure demonstrates improved performance in detecting heavily occluded or worn lane images, as evidenced by our extensive experimental results, which show that our method outperforms or is on par with state-of-the-art techniques.
    摘要 自动驾驶领域内,车道检测已经吸引了非常大的关注,因为它的复杂性。车道可能会变窄、散乱或者受到交通干扰,但是观察到的是车道具有几何结构,这使得使用这个特点可以提高车道检测的结果。为了解决这个挑战,我们提出了层次式深度投影变换(DHT)方法,将整个图像中的所有车道特征都归类到投影参数空间中。此外,我们还改进了点选择方法,并在原始图像中添加了动态卷积模块,以有效地将车道特征分化开。我们的网络架构包括后备网络(可以是ResNet或Pyramid Vision Transformer)、特征层次网络作为颈部EXTRACT多个尺度特征,以及层次DHT基于特征归一化头来准确地分类每条车道。通过利用车道特征在投影参数空间中,网络学习了动态卷积kernel参数相应于每条车道,使得动态卷积模块能够有效地分化开车道特征。最后,车道特征被传递给特征解码器,解码器预测了车道的最终位置。我们提出的网络结构在实际实验中表现出色,证明我们的方法在检测受阻或损坏车道图像时表现出优于或与当前领先技术相当。

ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

  • paper_url: http://arxiv.org/abs/2307.03493
  • repo_url: None
  • paper_authors: Gamze İslamoğlu, Moritz Scherer, Gianna Paulin, Tim Fischer, Victor J. B. Jung, Angelo Garofalo, Luca Benini
  • for: 本研究旨在提出一种高效的Transformer网络加速器,用于适应嵌入式系统中的自然语言处理任务。
  • methods: 该研究使用8位量化和创新的软MAX实现,实现在流式模式下计算,从而减少数据移动和能耗。
  • results: ITA在22nm Fully-Depleted Silicon-on-Insulator技术下,在0.8V voltage下达到16.9 TOPS/W的能效率,同时在面积效率方面超过现有的Transformer加速器。
    Abstract Transformer networks have emerged as the state-of-the-art approach for natural language processing tasks and are gaining popularity in other domains such as computer vision and audio processing. However, the efficient hardware acceleration of transformer models poses new challenges due to their high arithmetic intensities, large memory requirements, and complex dataflow dependencies. In this work, we propose ITA, a novel accelerator architecture for transformers and related models that targets efficient inference on embedded systems by exploiting 8-bit quantization and an innovative softmax implementation that operates exclusively on integer values. By computing on-the-fly in streaming mode, our softmax implementation minimizes data movement and energy consumption. ITA achieves competitive energy efficiency with respect to state-of-the-art transformer accelerators with 16.9 TOPS/W, while outperforming them in area efficiency with 5.93 TOPS/mm$^2$ in 22 nm fully-depleted silicon-on-insulator technology at 0.8 V.
    摘要 transformer 网络已经成为自然语言处理任务的状态泰斗方法,而在计算机视觉和音频处理领域也越来越受欢迎。然而,加速transformer模型的高精度计算和大量内存需求带来新的挑战。在这篇文章中,我们提出了一种名为ITA的新加速架构,这种架构 targets高效的在嵌入式系统上进行推理,通过8位量化和一种新的软 макс实现,该实现具有快速计算和减少数据移动的特点。ITA在0.8V 22nm Fully-depleted silicon-on-insulator技术中实现了0.8V 22nm Fully-depleted silicon-on-insulator技术中实现了16.9 TOPS/W的能效率,同时也超过了现有的state-of-the-art transformer加速器的5.93 TOPS/mm$^2$ 。

Adaptive Graph Convolution Networks for Traffic Flow Forecasting

  • paper_url: http://arxiv.org/abs/2307.05517
  • repo_url: https://github.com/zhengdaoli/agc-net
  • paper_authors: Zhengdao Li, Wei Li, Kai Hwang
  • for: 预测交通流速度是一项非常具有挑战性的任务,因为道路条件在空间和时间两个维度上是动态变化的。
  • methods: 本文提出了一种新的 adaptive graph convolution network (AGC-net),用于解决现有的 graph neural network (GNN) 中忽略时间变化道路条件的问题。AGC-net 基于一种新的上下文注意机制,包括一系列可学习的扩散尺度的 graph wavelets。
  • results: 实验结果表明,AGC-net 可以准确预测交通流速度,并且与其他基eline模型相比,有 significannot 的提高。两个公共的交通数据集上的实验结果都表明了 AGC-net 的效果。
    Abstract Traffic flow forecasting is a highly challenging task due to the dynamic spatial-temporal road conditions. Graph neural networks (GNN) has been widely applied in this task. However, most of these GNNs ignore the effects of time-varying road conditions due to the fixed range of the convolution receptive field. In this paper, we propose a novel Adaptive Graph Convolution Networks (AGC-net) to address this issue in GNN. The AGC-net is constructed by the Adaptive Graph Convolution (AGC) based on a novel context attention mechanism, which consists of a set of graph wavelets with various learnable scales. The AGC transforms the spatial graph representations into time-sensitive features considering the temporal context. Moreover, a shifted graph convolution kernel is designed to enhance the AGC, which attempts to correct the deviations caused by inaccurate topology. Experimental results on two public traffic datasets demonstrate the effectiveness of the AGC-net\footnote{Code is available at: https://github.com/zhengdaoli/AGC-net} which outperforms other baseline models significantly.
    摘要 traffic flow forecasting 是一个非常具有挑战性的任务,因为道路条件在空间和时间上都是动态的。 graph neural networks (GNN) 已经广泛应用于这个任务。然而,大多数这些 GNN 忽略了时间变化的道路条件,因为它们的固定范围的卷积感知场所不能考虑时间上的变化。在这篇论文中,我们提出了一种新的 Adaptive Graph Convolution Networks (AGC-net),用于解决 GNN 中的这个问题。AGC-net 由 Adaptive Graph Convolution (AGC) 基于一种新的上下文注意机制组成,该机制包括一组可学习的扩散尺度的图波лет。AGC 将空间图表示转化为时间敏感的特征,考虑到时间上的上下文。此外,我们还设计了一个偏移 graph convolution kernel,用于强化 AGC,以尝试修正因为不准确的 topology 所导致的偏差。实验结果表明,AGC-net 在两个公共的交通数据集上表现出色,与其他基准模型相比,具有显著的优势。Note: Please note that the translation is in Simplified Chinese, and the word order may be different from the original text.

Learning Theory of Distribution Regression with Neural Networks

  • paper_url: http://arxiv.org/abs/2307.03487
  • repo_url: https://github.com/djdprogramming/adfa2
  • paper_authors: Zhongjie Shi, Zhan Yu, Ding-Xuan Zhou
  • for: 本文目的是建立分布回归的近似理论和学习理论,使用完全连接神经网络(FNN)来实现。
  • methods: 本文使用了一种新的神经网络结构,即分布输入神经网络,来解决传统神经网络不能直接使用于分布输入的问题。
  • results: 本文通过一种新的两阶段错误分解技术, derivation of almost optimal learning rates for the proposed distribution regression model up to logarithmic terms。
    Abstract In this paper, we aim at establishing an approximation theory and a learning theory of distribution regression via a fully connected neural network (FNN). In contrast to the classical regression methods, the input variables of distribution regression are probability measures. Then we often need to perform a second-stage sampling process to approximate the actual information of the distribution. On the other hand, the classical neural network structure requires the input variable to be a vector. When the input samples are probability distributions, the traditional deep neural network method cannot be directly used and the difficulty arises for distribution regression. A well-defined neural network structure for distribution inputs is intensively desirable. There is no mathematical model and theoretical analysis on neural network realization of distribution regression. To overcome technical difficulties and address this issue, we establish a novel fully connected neural network framework to realize an approximation theory of functionals defined on the space of Borel probability measures. Furthermore, based on the established functional approximation results, in the hypothesis space induced by the novel FNN structure with distribution inputs, almost optimal learning rates for the proposed distribution regression model up to logarithmic terms are derived via a novel two-stage error decomposition technique.
    摘要 在这篇论文中,我们目标是建立分布回归的近似理论和学习理论,通过全连接神经网络(FNN)来实现。与传统的回归方法不同,分布回归的输入变量是概率度量。因此,我们需要进行第二阶采样过程来近似实际分布的信息。然而,传统的神经网络结构需要输入变量为向量。当输入样本是概率分布时,传统的深度神经网络方法无法直接使用,这会导致技术困难。我们需要一种具有良好定义的神经网络结构来处理分布输入。在现有的数学模型和理论分析之外,我们在FNN结构中建立了一个新的分布输入神经网络框架,以实现函数als定义在柯博尔概率度量空间上的近似理论。此外,基于建立的函数近似结果,我们通过一种新的两阶错 decomposition技术,在带有分布输入的FNN结构下, derivation almost optimal learning rate的提案 Distribution Regression模型,即使到对数阶段。

Discovering Hierarchical Achievements in Reinforcement Learning via Contrastive Learning

  • paper_url: http://arxiv.org/abs/2307.03486
  • repo_url: None
  • paper_authors: Seungyong Moon, Junyoung Yeom, Bumsoo Park, Hyun Oh Song
  • for: 这种研究旨在解决在生成的环境中发现层次结构的成就所存在的挑战。
  • methods: 该研究使用的方法包括 proximal policy optimization (PPO) 和 achievement distillation。
  • results: PPO Agent 可以预测下一个成就的解锁程度,并且通过 achievement distillation 方法强化了 agent 的成就预测能力,显示了在具有更多的模型参数和更高效的样本收集方法下达到了 state-of-the-art 性能。
    Abstract Discovering achievements with a hierarchical structure on procedurally generated environments poses a significant challenge. This requires agents to possess a broad range of abilities, including generalization and long-term reasoning. Many prior methods are built upon model-based or hierarchical approaches, with the belief that an explicit module for long-term planning would be beneficial for learning hierarchical achievements. However, these methods require an excessive amount of environment interactions or large model sizes, limiting their practicality. In this work, we identify that proximal policy optimization (PPO), a simple and versatile model-free algorithm, outperforms the prior methods with recent implementation practices. Moreover, we find that the PPO agent can predict the next achievement to be unlocked to some extent, though with low confidence. Based on this observation, we propose a novel contrastive learning method, called achievement distillation, that strengthens the agent's capability to predict the next achievement. Our method exhibits a strong capacity for discovering hierarchical achievements and shows state-of-the-art performance on the challenging Crafter environment using fewer model parameters in a sample-efficient regime.
    摘要 发现具有层次结构的成就需要智能体具备广泛的能力,包括通用化和长期逻辑。许多先前方法基于模型化或层次方法,假设有一个显式的长期规划模块可以帮助学习层次成就。然而,这些方法需要很多环境互动或大型模型,限制其实用性。在这种情况下,我们发现,靠近策略优化(PPO)算法,一种简单而多功能的模型自由算法,在当前实施方法下表现出色,超越先前的方法。此外,我们发现PPO Agent可以预测下一个成就的概率,虽然 confidence 较低。基于这个观察,我们提出了一种新的对比学习方法,即成就馆定,该方法可以增强智能体预测下一个成就的能力。我们的方法在挑战性的 Crafter 环境中表现出色,使用更少的模型参数在样本效率的 régime 中达到了领先的性能。

Unpaired Multi-View Graph Clustering with Cross-View Structure Matching

  • paper_url: http://arxiv.org/abs/2307.03476
  • repo_url: https://github.com/wy1019/upmgc-sm
  • paper_authors: Yi Wen, Siwei Wang, Qing Liao, Weixuan Liang, Ke Liang, Xinhang Wan, Xinwang Liu
  • for: addresses the data-unpaired problem (DUP) in multi-view literature by proposing a novel parameter-free graph clustering framework.
  • methods: utilizes the structural information from each view to refine cross-view correspondences, and is a unified framework for both fully and partially unpaired multi-view graph clustering.
  • results: extensive experiments demonstrate the effectiveness and generalization of the proposed framework for both paired and unpaired datasets.Here’s the full text in Simplified Chinese:
  • for: Addresses the data-unpaired problem (DUP) in multi-view literature by proposing a novel parameter-free graph clustering framework.
  • methods: Utilizes the structural information from each view to refine cross-view correspondences, and is a unified framework for both fully and partially unpaired multi-view graph clustering.
  • results: Extensive experiments demonstrate the effectiveness and generalization of the proposed framework for both paired and unpaired datasets.
    Abstract Multi-view clustering (MVC), which effectively fuses information from multiple views for better performance, has received increasing attention. Most existing MVC methods assume that multi-view data are fully paired, which means that the mappings of all corresponding samples between views are pre-defined or given in advance. However, the data correspondence is often incomplete in real-world applications due to data corruption or sensor differences, referred as the data-unpaired problem (DUP) in multi-view literature. Although several attempts have been made to address the DUP issue, they suffer from the following drawbacks: 1) Most methods focus on the feature representation while ignoring the structural information of multi-view data, which is essential for clustering tasks; 2) Existing methods for partially unpaired problems rely on pre-given cross-view alignment information, resulting in their inability to handle fully unpaired problems; 3) Their inevitable parameters degrade the efficiency and applicability of the models. To tackle these issues, we propose a novel parameter-free graph clustering framework termed Unpaired Multi-view Graph Clustering framework with Cross-View Structure Matching (UPMGC-SM). Specifically, unlike the existing methods, UPMGC-SM effectively utilizes the structural information from each view to refine cross-view correspondences. Besides, our UPMGC-SM is a unified framework for both the fully and partially unpaired multi-view graph clustering. Moreover, existing graph clustering methods can adopt our UPMGC-SM to enhance their ability for unpaired scenarios. Extensive experiments demonstrate the effectiveness and generalization of our proposed framework for both paired and unpaired datasets.
    摘要 多视图划分(MVC),能够有效地将多个视图中的信息结合起来,在最近几年内受到了越来越多的关注。大多数现有的MVC方法假设所有视图中的样本都是已知的,即所有样本之间的映射都是先前定义的。然而,在实际应用中,数据对应关系 oftentimes 是不完全的,这被称为多视图数据不对应问题(DUP)。虽然有几种尝试 Addressing the DUP issue,但它们受到以下缺点的限制:1)大多数方法专注于特征表示,而忽略多视图数据的结构信息,这是 clustering 任务中非常重要的; 2)现有的方法只适用于部分不对应的问题,它们无法处理完全不对应的问题; 3)它们的参数会降低模型的效率和可应用性。为了解决这些问题,我们提出了一种无参数的图 clustering 框架,名为无参数多视图图 clustering 框架 with Cross-View Structure Matching(UPMGC-SM)。与现有方法不同的是,UPMGC-SM 可以充分利用每个视图中的结构信息,以改进 cross-view 对应关系。此外,我们的 UPMGC-SM 是一种通用的框架,可以处理完全不对应和部分不对应的多视图图 clustering 问题。此外,现有的图 clustering 方法可以采用我们的 UPMGC-SM 来增强它们对不对应场景的能力。广泛的实验表明我们提出的框架在 paired 和 unpaired 数据集上的效果和通用性都很强。

Action-State Dependent Dynamic Model Selection

  • paper_url: http://arxiv.org/abs/2307.04754
  • repo_url: None
  • paper_authors: Francesco Cordoni, Alessio Sancetta
  • for: 这篇论文目的是为了找到在某些世界状态下最佳的模型,以及在这些状态下如何动态地选择模型。
  • methods: 这篇论文使用了强化学习算法来近似地解决这个动态程序问题,并且可以从数据中估计最佳策略。
  • results: 实际应用中,这种方法能够在使用macro经济变量和价格数据时,超过选择最佳股票模型的寻找方法。
    Abstract A model among many may only be best under certain states of the world. Switching from a model to another can also be costly. Finding a procedure to dynamically choose a model in these circumstances requires to solve a complex estimation procedure and a dynamic programming problem. A Reinforcement learning algorithm is used to approximate and estimate from the data the optimal solution to this dynamic programming problem. The algorithm is shown to consistently estimate the optimal policy that may choose different models based on a set of covariates. A typical example is the one of switching between different portfolio models under rebalancing costs, using macroeconomic information. Using a set of macroeconomic variables and price data, an empirical application to the aforementioned portfolio problem shows superior performance to choosing the best portfolio model with hindsight.
    摘要 一个模型在多种状况下只是最佳的。从一个模型到另一个的转换也可能是昂贵的。在这些情况下,找到一种动态选择模型的过程需要解决一个复杂的估计问题和动态programming问题。一种强化学习算法可以从数据中approxiamte和估计最佳解决方案。这种算法能够适应不同状况下的模型选择。一个典型的应用是在划转成本下选择不同的投资模型,使用macro经济信息。使用一组macro经济变量和价格数据,对投资问题的一个empirical应用表现出色,比选择划算后的最佳投资模型更高效。

Solvent: A Framework for Protein Folding

  • paper_url: http://arxiv.org/abs/2307.04603
  • repo_url: https://github.com/kakaobrain/solvent
  • paper_authors: Jaemyung Lee, Kyeongtak Han, Jaehoon Kim, Hasun Yu, Youhan Lee
  • For: The paper aims to provide a unified research framework for protein folding, called Solvent, which supports various state-of-the-art models and enables consistent and fair comparisons among different approaches.* Methods: Solvent is built with a modular design, allowing for different models to be easily integrated and trained on the same dataset. The framework includes implementations of several well-known algorithms and their components, and provides a variety of training and evaluation options.* Results: The paper presents experiments using Solvent to benchmark well-known algorithms and their components, providing insights into the protein structure modeling field. The results demonstrate the potential of Solvent to increase the reliability and consistency of proposed models, as well as improve efficiency in both speed and costs.
    Abstract Consistency and reliability are crucial for conducting AI research. Many famous research fields, such as object detection, have been compared and validated with solid benchmark frameworks. After AlphaFold2, the protein folding task has entered a new phase, and many methods are proposed based on the component of AlphaFold2. The importance of a unified research framework in protein folding contains implementations and benchmarks to consistently and fairly compare various approaches. To achieve this, we present Solvent, a protein folding framework that supports significant components of state-of-the-art models in the manner of an off-the-shelf interface Solvent contains different models implemented in a unified codebase and supports training and evaluation for defined models on the same dataset. We benchmark well-known algorithms and their components and provide experiments that give helpful insights into the protein structure modeling field. We hope that Solvent will increase the reliability and consistency of proposed models and give efficiency in both speed and costs, resulting in acceleration on protein folding modeling research. The code is available at https://github.com/kakaobrain/solvent, and the project will continue to be developed.
    摘要 “一致性和可靠性是AI研究中非常重要的。许多著名的研究领域,如对象检测,都已经被比较和验证了坚实的 bencmark 框架。 alphaFold2 后,蛋白质折叠任务进入了新的阶段,许多方法都是基于 alphaFold2 的组件。一个统一的研究框架在蛋白质折叠中的重要性,它可以一直支持当前领先的模型组件,并且可以在同一个代码库中实现和评估定义的模型。我们称之为 Solvent,它支持当前领先的模型组件,并且可以在同一个代码库中实现和评估定义的模型。我们对一些知名的算法和其组件进行了比较,并提供了有用的实验结果,它们可以帮助我们更好地理解蛋白质结构模型领域。我们希望 Solvent 能够增加提案模型的一致性和可靠性,并且能够提高速度和成本的效率,从而加速蛋白质结构模型研究。代码可以在 https://github.com/kakaobrain/solvent 上获取,项目将继续开发。”

A Survey on Graph Neural Networks for Time Series: Forecasting, Classification, Imputation, and Anomaly Detection

  • paper_url: http://arxiv.org/abs/2307.03759
  • repo_url: https://github.com/kimmeen/awesome-gnn4ts
  • paper_authors: Ming Jin, Huan Yee Koh, Qingsong Wen, Daniele Zambon, Cesare Alippi, Geoffrey I. Webb, Irwin King, Shirui Pan
  • for: 本文主要针对时间序列分析领域的研究,旨在帮助设计者和实践者更好地理解、建立应用和推动关于图 neuronal networks for time series analysis(GNN4TS)的研究。
  • methods: 本文使用了图 neuronal networks(GNN)来分析时间序列数据,并对时间序列分析领域的各种任务进行了分类和推导。
  • results: 本文对GNN4TS的研究进行了全面的回顾和评估,并介绍了一些代表性的研究和应用例子,同时也预测了未来研究的发展趋势。
    Abstract Time series are the primary data type used to record dynamic system measurements and generated in great volume by both physical sensors and online processes (virtual sensors). Time series analytics is therefore crucial to unlocking the wealth of information implicit in available data. With the recent advancements in graph neural networks (GNNs), there has been a surge in GNN-based approaches for time series analysis. These approaches can explicitly model inter-temporal and inter-variable relationships, which traditional and other deep neural network-based methods struggle to do. In this survey, we provide a comprehensive review of graph neural networks for time series analysis (GNN4TS), encompassing four fundamental dimensions: forecasting, classification, anomaly detection, and imputation. Our aim is to guide designers and practitioners to understand, build applications, and advance research of GNN4TS. At first, we provide a comprehensive task-oriented taxonomy of GNN4TS. Then, we present and discuss representative research works and introduce mainstream applications of GNN4TS. A comprehensive discussion of potential future research directions completes the survey. This survey, for the first time, brings together a vast array of knowledge on GNN-based time series research, highlighting foundations, practical applications, and opportunities of graph neural networks for time series analysis.
    摘要 时序序列是主要数据类型,用于记录动态系统测量结果,并且由物理感知器和在线过程生成大量数据(虚拟感知器)。时序序列分析是解锁可用数据中的宝库的关键。随着图神经网络(GNN)的最近进步,GNN-based时序序列分析方法在不断增长。这些方法可以直接模型时间和空间关系,传统和其他深度神经网络基于方法难以完成。在这项调查中,我们提供了完整的图神经网络时序序列分析(GNN4TS)评论,涵盖四个基本维度:预测、分类、异常检测和补做。我们的目标是帮助设计者和实践者理解、建立应用和推动GNN4TS研究。首先,我们提供了完整的任务导向的分类法GNN4TS。然后,我们展示和讨论了代表性的研究工作,并介绍了主流应用GNN4TS。最后,我们对未来研究方向进行了全面的讨论,这项调查,是首次将大量关于GNN基于时序序列研究的知识集中,把注重Foundations、实践应用和机遇的图神经网络时序序列分析。

Differential Privacy for Clustering Under Continual Observation

  • paper_url: http://arxiv.org/abs/2307.03430
  • repo_url: None
  • paper_authors: Max Dupré la Tour, Monika Henzinger, David Saulpic
  • for: 隐私 clustering一个在 $\mathbb{R}^d$ 上的数据集,该数据集在插入和删除点时进行更新。
  • methods: 提供了一种 $\varepsilon$-分割隐私 clustering机制,用于实现 $k$-means 目标,并且在 continual observation 下实现。这是首次对这个问题提供了一个增量隐私的解决方案,其损耗因数只是对数增长。
  • results: 提出了一种基于私有隐私 greedy 近似算法和维度减少算法的方法,可以实现高效的隐私 clustering。此外, partial 扩展了结果到 $k$-medians 问题。
    Abstract We consider the problem of clustering privately a dataset in $\mathbb{R}^d$ that undergoes both insertion and deletion of points. Specifically, we give an $\varepsilon$-differentially private clustering mechanism for the $k$-means objective under continual observation. This is the first approximation algorithm for that problem with an additive error that depends only logarithmically in the number $T$ of updates. The multiplicative error is almost the same as non privately. To do so we show how to perform dimension reduction under continual observation and combine it with a differentially private greedy approximation algorithm for $k$-means. We also partially extend our results to the $k$-median problem.
    摘要 我们考虑一个隐私 clustering 问题,对于一个在 $\mathbb{R}^d$ 上的资料集,该资料集会在批量更新的情况下进行插入和删除点。我们提供了一个 $\varepsilon$-隐私 clustering 机制,用于 $k$-means 目标下,并且这个方法具有对数幂递增的误差。我们还详细说明了如何在批量更新下进行维度缩减,并且与隐私保证的暴末搜索法相结合。此外,我们也对 $k$-medians 问题进行了一定的扩展。

Merging-Diverging Hybrid Transformer Networks for Survival Prediction in Head and Neck Cancer

  • paper_url: http://arxiv.org/abs/2307.03427
  • repo_url: https://github.com/mungomeng/survival-xsurv
  • paper_authors: Mingyuan Meng, Lei Bi, Michael Fulham, Dagan Feng, Jinman Kim
    for:This paper aims to improve survival prediction for cancer patients by developing a deep learning model that can effectively fuse multi-modality images (e.g., PET-CT) and extract region-specific information.methods:The proposed method uses a merging-diverging learning framework, which consists of a merging encoder and a diverging decoder. The merging encoder uses a Hybrid Parallel Cross-Attention (HPCA) block to fuse multi-modality features, while the diverging decoder uses a Region-specific Attention Gate (RAG) block to screen out features related to lesion regions.results:The proposed method (XSurv) outperforms state-of-the-art survival prediction methods on the public dataset of HECKTOR 2022. Specifically, XSurv combines the complementary information in PET and CT images and extracts region-specific prognostic information in PT and MLN regions, leading to improved survival prediction accuracy.
    Abstract Survival prediction is crucial for cancer patients as it provides early prognostic information for treatment planning. Recently, deep survival models based on deep learning and medical images have shown promising performance for survival prediction. However, existing deep survival models are not well developed in utilizing multi-modality images (e.g., PET-CT) and in extracting region-specific information (e.g., the prognostic information in Primary Tumor (PT) and Metastatic Lymph Node (MLN) regions). In view of this, we propose a merging-diverging learning framework for survival prediction from multi-modality images. This framework has a merging encoder to fuse multi-modality information and a diverging decoder to extract region-specific information. In the merging encoder, we propose a Hybrid Parallel Cross-Attention (HPCA) block to effectively fuse multi-modality features via parallel convolutional layers and cross-attention transformers. In the diverging decoder, we propose a Region-specific Attention Gate (RAG) block to screen out the features related to lesion regions. Our framework is demonstrated on survival prediction from PET-CT images in Head and Neck (H&N) cancer, by designing an X-shape merging-diverging hybrid transformer network (named XSurv). Our XSurv combines the complementary information in PET and CT images and extracts the region-specific prognostic information in PT and MLN regions. Extensive experiments on the public dataset of HEad and neCK TumOR segmentation and outcome prediction challenge (HECKTOR 2022) demonstrate that our XSurv outperforms state-of-the-art survival prediction methods.
    摘要 生存预测对于癌症患者非常重要,因为它提供了早期的诊断信息,以便为治疗规划。最近,深度存生模型基于深度学习和医疗图像已经展示了有前景的表现。然而,现有的深度存生模型尚未充分利用多Modalities图像(例如PET-CT),也没有充分提取区域特定信息(例如主要肿瘤(PT)和肿瘤静脉节(MLN)区域的诊断信息)。为了解决这一问题,我们提出了一种合并-分化学习框架 для存生预测。这个框架包括一个合并编码器,用于融合多Modalities信息,以及一个分化解码器,用于提取区域特定信息。在合并编码器中,我们提出了一种Hybrid Parallel Cross-Attention(HPCA)块,用于有效地融合多Modalities特征,并通过并行的卷积层和交互变换器来实现。在分化解码器中,我们提出了一种Region-specific Attention Gate(RAG)块,用于筛选出病理区域相关的特征。我们的框架在PET-CT图像上进行存生预测,并通过设计一个X-形合并-分化混合变换网络(名为XSurv)来组合PET和CT图像的补做信息,并提取PT和MLN区域的区域特定诊断信息。我们的XSurv在HECKTOR2022公共数据集上进行了广泛的实验,并证明了它的出色表现。

Hyperspectral and Multispectral Image Fusion Using the Conditional Denoising Diffusion Probabilistic Model

  • paper_url: http://arxiv.org/abs/2307.03423
  • repo_url: https://github.com/shuaikaishi/ddpmfus
  • paper_authors: Shuaikai Shi, Lijun Zhang, Jie Chen
  • for: 这 paper 是为了提出一种基于深度学习的干扰推理模型,用于折衔高spectral像和多spectral像。
  • methods: 该方法基于 Conditional Denoising Diffusion Probabilistic Model(DDPM),包括前向扩散过程和反向去干扰过程。
  • results: 实验表明,该方法在一个室内和两个遥感数据集上显示出了比其他高级深度学习基于合并方法更高的性能。
    Abstract Hyperspectral images (HSI) have a large amount of spectral information reflecting the characteristics of matter, while their spatial resolution is low due to the limitations of imaging technology. Complementary to this are multispectral images (MSI), e.g., RGB images, with high spatial resolution but insufficient spectral bands. Hyperspectral and multispectral image fusion is a technique for acquiring ideal images that have both high spatial and high spectral resolution cost-effectively. Many existing HSI and MSI fusion algorithms rely on known imaging degradation models, which are often not available in practice. In this paper, we propose a deep fusion method based on the conditional denoising diffusion probabilistic model, called DDPM-Fus. Specifically, the DDPM-Fus contains the forward diffusion process which gradually adds Gaussian noise to the high spatial resolution HSI (HrHSI) and another reverse denoising process which learns to predict the desired HrHSI from its noisy version conditioning on the corresponding high spatial resolution MSI (HrMSI) and low spatial resolution HSI (LrHSI). Once the training is completes, the proposed DDPM-Fus implements the reverse process on the test HrMSI and LrHSI to generate the fused HrHSI. Experiments conducted on one indoor and two remote sensing datasets show the superiority of the proposed model when compared with other advanced deep learningbased fusion methods. The codes of this work will be opensourced at this address: https://github.com/shuaikaishi/DDPMFus for reproducibility.
    摘要 干扰图像(HSI)具有大量的spectral信息,反映物质特点,但其空间分辨率受成像技术限制而受到限制。与此相对的是多spectral图像(MSI),如RGB图像,具有高空间分辨率,但lack spectral bands。干扰图像和多spectral图像的图像混合是一种获得理想图像,即高空间和高spectral分辨率的图像,可以在成本效益的情况下获得。现有的HSI和MSI混合算法多数基于知名的损坏模型,这些模型在实践中经常不可用。在这篇文章中,我们提出了基于条件滤波泛化模型的深度混合方法,称为DDPM-Fus。具体来说,DDPM-Fus包括前向滤波过程,逐渐添加高斯噪声到高空间分辨率干扰图像(HrHSI),以及另一个反向恢复过程,学习预测desired HrHSI的噪声版本,条件在HrMSI和LrHSI的帮助下。一旦训练完成,我们的DDPM-Fus会在测试HrMSI和LrHSI上实现反向过程,生成混合后的HrHSI。我们在一个室内和两个遥感数据集上进行了实验,并证明了我们的方法在其他先进的深度学习基于混合方法之上具有superiority。我们将在这个地址上开源我们的代码:https://github.com/shuaikaishi/DDPMFus,以便复制。

QI2 – an Interactive Tool for Data Quality Assurance

  • paper_url: http://arxiv.org/abs/2307.03419
  • repo_url: None
  • paper_authors: Simon Geerkens, Christian Sieberichs, Alexander Braun, Thomas Waschulzik
  • for: 这篇论文主要用于提高机器学习系统和大数据的数据质量,以满足欧盟的AI法规要求,特别是安全相关的机器学习系统的市场引入。
  • methods: 本论文提出了一种新的数据质量检查方法,可以同时检查多个数据质量方面的要求。这种方法基于量化的数据质量标准,可以帮助确保数据的质量符合要求。
  • results: 在小例子数据集上,本方法能够成功地检查数据质量,并且在知名的MNIST数据集上进行了实践示例。
    Abstract The importance of high data quality is increasing with the growing impact and distribution of ML systems and big data. Also the planned AI Act from the European commission defines challenging legal requirements for data quality especially for the market introduction of safety relevant ML systems. In this paper we introduce a novel approach that supports the data quality assurance process of multiple data quality aspects. This approach enables the verification of quantitative data quality requirements. The concept and benefits are introduced and explained on small example data sets. How the method is applied is demonstrated on the well known MNIST data set based an handwritten digits.
    摘要 “高品质的数据价值在机器学习系统和大数据的普及和影响力增长的同时也在提高。欧盟委员会的AI法案也将要求严格的数据质量标准,特别是在安全相关的机器学习系统上市。本文将介绍一种支持多种数据质量方面的质量保证过程的新方法。这种方法可以评估量数据质量要求的实施情况。本文将以小型数据集作为例子,介绍概念和优点,并在知名的MNIST数据集上显示如何实施。”Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

AdaptiveRec: Adaptively Construct Pairs for Contrastive Learning in Sequential Recommendation

  • paper_url: http://arxiv.org/abs/2307.05469
  • repo_url: None
  • paper_authors: Jaeheyoung Jeon, Jung Hyun Ryu, Jewoong Cho, Myungjoo Kang
  • for: 解决Sequential recommendation systems中的对比学习挑战,特别是false negative问题,提高推荐算法效果。
  • methods: 提出了一种进步的对比学习方法,改进了物品嵌入的质量,避免了类似实例被误分类为不相似的问题。
  • results: 实验结果表明,提出的方法能够提高推荐系统的性能,比 EXISTS 系统更高效。此外,该方法在不同的推荐场景下也有广泛的应用前景。
    Abstract This paper presents a solution to the challenges faced by contrastive learning in sequential recommendation systems. In particular, it addresses the issue of false negative, which limits the effectiveness of recommendation algorithms. By introducing an advanced approach to contrastive learning, the proposed method improves the quality of item embeddings and mitigates the problem of falsely categorizing similar instances as dissimilar. Experimental results demonstrate performance enhancements compared to existing systems. The flexibility and applicability of the proposed approach across various recommendation scenarios further highlight its value in enhancing sequential recommendation systems.
    摘要 Here's the text in Simplified Chinese:这篇论文提出了对比学习在序列推荐系统中的挑战,特别是False Negative问题,这限制了推荐算法的效iveness。通过引入高级的对比学习方法,提议方法可以改善item embedding的质量,避免错误地将相似的实例分类为不相似。实验结果表明,提议方法比现有系统有所提高,并且可以适用于不同的推荐enario,进一步强调它在序列推荐系统中的价值。

Learning from Heterogeneity: A Dynamic Learning Framework for Hypergraphs

  • paper_url: http://arxiv.org/abs/2307.03411
  • repo_url: None
  • paper_authors: Tiehua Zhang, Yuze Liu, Zhishu Shen, Xingjun Ma, Xin Chen, Xiaowei Huang, Jun Yin, Jiong Jin
  • for: 这篇论文的目的是提出一种基于hypergraph学习的图学习框架,以捕捉图中隐藏的高阶相关性。
  • methods: 该框架使用动态hyperedge构建和注意力更新来利用图中不同特征的多样性。首先,通过对应式融合策略生成高质量特征,然后通过动态分组生成hypergraph,并进行类型特定的hypergraph学习过程。
  • results: 经过对多个popular数据集的广泛测试, comparing with11种现有的状态对节点分类和链接预测任务,该框架表现出了显著的性能提升(平均12.5%在节点分类任务上,13.3%在链接预测任务上),证明了该框架的有效性。
    Abstract Graph neural network (GNN) has gained increasing popularity in recent years owing to its capability and flexibility in modeling complex graph structure data. Among all graph learning methods, hypergraph learning is a technique for exploring the implicit higher-order correlations when training the embedding space of the graph. In this paper, we propose a hypergraph learning framework named LFH that is capable of dynamic hyperedge construction and attentive embedding update utilizing the heterogeneity attributes of the graph. Specifically, in our framework, the high-quality features are first generated by the pairwise fusion strategy that utilizes explicit graph structure information when generating initial node embedding. Afterwards, a hypergraph is constructed through the dynamic grouping of implicit hyperedges, followed by the type-specific hypergraph learning process. To evaluate the effectiveness of our proposed framework, we conduct comprehensive experiments on several popular datasets with eleven state-of-the-art models on both node classification and link prediction tasks, which fall into categories of homogeneous pairwise graph learning, heterogeneous pairwise graph learning, and hypergraph learning. The experiment results demonstrate a significant performance gain (average 12.5% in node classification and 13.3% in link prediction) compared with recent state-of-the-art methods.
    摘要 GRAPH NEURAL NETWORK (GNN) 在最近几年内得到了越来越多的推广,这主要归功于它在处理复杂图结构数据时的能力和灵活性。在所有图学习方法中,超 graf学习是一种技术,用于在训练图的嵌入空间时探索隐藏的高阶相关性。在这篇论文中,我们提出了一个名为LFH的超 graf学习框架,可以在动态组成hyperedge并通过heterogeneity attribute来进行注意力更新。具体来说,在我们的框架中,高质量的特征首先通过对称的对抗策略生成初始节点嵌入。接着,通过动态分组的超 graf组建,然后进行类型特定的超 graf学习过程。为了评估我们提出的框架的效果,我们在多个popular dataset上进行了对 eleven state-of-the-art模型的比较,包括节点分类和链接预测任务,这些任务可以分为同质对策graph学习、不同质对策graph学习和超 graf学习。实验结果表明,我们的提出的框架在节点分类和链接预测任务中表现出了显著的性能提升(平均12.5%和13.3%),相比最近的state-of-the-art方法。

Scalable High-Dimensional Multivariate Linear Regression for Feature-Distributed Data

  • paper_url: http://arxiv.org/abs/2307.03410
  • repo_url: None
  • paper_authors: Shuo-Chieh Huang, Ruey S. Tsay
  • for: 这篇论文是为了应用多变量线性回传数据,并且能够处理Feature-分布式数据,这种数据在应用中日益增加。
  • methods: 本论文提出了一个两阶段松弛迪的漫游算法(TSRGA),用于应用多变量线性回传数据。TSRGA的通信复杂度不随特征维度而增加,因此具有高可扩展性。在多变量回应变数情况下,TSRGA可以获得低阶系数估计。
  • results: 在模拟实验中,TSRGA具有快速对准性。最后,本论文应用了提案的TSRGA在金融应用中,利用10-K报告中的无结构数据,证明了其在具有多个紧密大维度矩阵的应用中的有用性。
    Abstract Feature-distributed data, referred to data partitioned by features and stored across multiple computing nodes, are increasingly common in applications with a large number of features. This paper proposes a two-stage relaxed greedy algorithm (TSRGA) for applying multivariate linear regression to such data. The main advantage of TSRGA is that its communication complexity does not depend on the feature dimension, making it highly scalable to very large data sets. In addition, for multivariate response variables, TSRGA can be used to yield low-rank coefficient estimates. The fast convergence of TSRGA is validated by simulation experiments. Finally, we apply the proposed TSRGA in a financial application that leverages unstructured data from the 10-K reports, demonstrating its usefulness in applications with many dense large-dimensional matrices.
    摘要 <>将文本翻译成简化中文。<>应用中逐渐增长的分布式数据(即根据特征分区存储在多个计算节点上的数据),这篇论文提出了一种两阶段松弛抽象算法(TSRGA)用于应用多变量直线回归。TSRGA的优点在于,它的通信复杂度不随特征维度增长,因此对很大数据集进行扩展非常可行。此外,对多变量响应变量,TSRGA可以生成低级卷积系数估计。在实验中,TSRGA的快速收敛性得到了验证。最后,我们在金融应用中使用了提案的 TSRGA,通过利用10-K报告中的无结构数据,示出了在具有多个稠密大维度矩阵的应用中的实用性。

A Self-Supervised Algorithm for Denoising Photoplethysmography Signals for Heart Rate Estimation from Wearables

  • paper_url: http://arxiv.org/abs/2307.05339
  • repo_url: None
  • paper_authors: Pranay Jain, Cheng Ding, Cynthia Rudin, Xiao Hu
  • for: 本研究旨在提高智能手表和其他穿戴式设备中的心率监测精度,通过去噪掉噪音和运动干扰的PPG信号。
  • methods: 我们提出了一种基于自我超视的净化算法,利用大量的清晰PPG信号数据进行自动编码器的训练,以重建受损PPG信号。
  • results: 我们的算法可以提供更好的心率估计,并且对PPG信号的各种健康指标进行下游分析也显示了明显的改善。
    Abstract Smart watches and other wearable devices are equipped with photoplethysmography (PPG) sensors for monitoring heart rate and other aspects of cardiovascular health. However, PPG signals collected from such devices are susceptible to corruption from noise and motion artifacts, which cause errors in heart rate estimation. Typical denoising approaches filter or reconstruct the signal in ways that eliminate much of the morphological information, even from the clean parts of the signal that would be useful to preserve. In this work, we develop an algorithm for denoising PPG signals that reconstructs the corrupted parts of the signal, while preserving the clean parts of the PPG signal. Our novel framework relies on self-supervised training, where we leverage a large database of clean PPG signals to train a denoising autoencoder. As we show, our reconstructed signals provide better estimates of heart rate from PPG signals than the leading heart rate estimation methods. Further experiments show significant improvement in Heart Rate Variability (HRV) estimation from PPG signals using our algorithm. We conclude that our algorithm denoises PPG signals in a way that can improve downstream analysis of many different health metrics from wearable devices.
    摘要 智能手表和其他穿戴式设备通常配备了光谱 plethysmography (PPG) 传感器,用于监测心率和其他循环征象。然而,PPG 信号从这些设备中收集的信号受到噪声和运动artefacts的污染,导致心率估计出错。现有的减噪方法通常使用过滤或重建信号的方式,以消除大量的形态信息,包括净化部分的PPG信号,这些信号是有用的保留。在这种工作中,我们开发了一种用于减噪PPG信号的算法,可以重建污染的部分信号,同时保留净化部分的PPG信号。我们的新框架基于自我超vised学习,我们利用大量的净化PPG信号数据库来训练一个减噪自适应神经网络。我们的重建信号提供了更好的心率估计,与主流心率估计方法相比。进一步的实验表明,我们的算法可以大幅提高来自PPG信号的循环变化估计(HRV)。我们 conclude that我们的算法可以有效地减噪PPG信号,以提高来自穿戴式设备的多种健康指标的分析。

Goal-Conditioned Predictive Coding as an Implicit Planner for Offline Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2307.03406
  • repo_url: None
  • paper_authors: Zilai Zeng, Ce Zhang, Shijie Wang, Chen Sun
  • for: 本研究旨在调查 Whether sequence modeling can condense trajectories into useful representations that contribute to policy learning.
  • methods: 本研究采用了一个 Two-stage 框架,首先使用Sequence modeling技术Summary trajectories,然后使用这些表示来学习策略和愿景目标。
  • results: 研究发现,Sequence modeling 有效地减少了一些决策任务的训练时间,并且可以学习出高性能的策略。此外,GCPC 方法学习了一个 Conditioned 的未来 Representation,并在 AntMaze、FrankaKitchen 和 Locomotion 环境中达到了竞争性的性能。
    Abstract Recent work has demonstrated the effectiveness of formulating decision making as a supervised learning problem on offline-collected trajectories. However, the benefits of performing sequence modeling on trajectory data is not yet clear. In this work we investigate if sequence modeling has the capability to condense trajectories into useful representations that can contribute to policy learning. To achieve this, we adopt a two-stage framework that first summarizes trajectories with sequence modeling techniques, and then employs these representations to learn a policy along with a desired goal. This design allows many existing supervised offline RL methods to be considered as specific instances of our framework. Within this framework, we introduce Goal-Conditioned Predicitve Coding (GCPC), an approach that brings powerful trajectory representations and leads to performant policies. We conduct extensive empirical evaluations on AntMaze, FrankaKitchen and Locomotion environments, and observe that sequence modeling has a significant impact on some decision making tasks. In addition, we demonstrate that GCPC learns a goal-conditioned latent representation about the future, which serves as an "implicit planner", and enables competitive performance on all three benchmarks.
    摘要 最近的工作已经证明了将决策问题定义为有监督学习问题的可行性。然而,使用序列模型处理轨迹数据的利点还不够清晰。在这种情况下,我们 investigate whether sequence modeling can condense trajectories into useful representations that can contribute to policy learning. To achieve this, we adopt a two-stage framework that first summarizes trajectories with sequence modeling techniques and then employs these representations to learn a policy along with a desired goal. This design allows many existing supervised offline RL methods to be considered as specific instances of our framework. Within this framework, we introduce Goal-Conditioned Predicitve Coding (GCPC), an approach that brings powerful trajectory representations and leads to performant policies. We conduct extensive empirical evaluations on AntMaze, FrankaKitchen and Locomotion environments, and observe that sequence modeling has a significant impact on some decision making tasks. In addition, we demonstrate that GCPC learns a goal-conditioned latent representation about the future, which serves as an "implicit planner" and enables competitive performance on all three benchmarks.Here's the word-for-word translation of the given text into Simplified Chinese:最近的工作已经证明了将决策问题定义为有监督学习问题的可行性。然而,使用序列模型处理轨迹数据的利点还不够清晰。在这种情况下,我们 investigate whether sequence modeling can condense trajectories into useful representations that can contribute to policy learning. To achieve this, we adopt a two-stage framework that first summarizes trajectories with sequence modeling techniques and then employs these representations to learn a policy along with a desired goal. This design allows many existing supervised offline RL methods to be considered as specific instances of our framework. Within this framework, we introduce Goal-Conditioned Predicitve Coding (GCPC), an approach that brings powerful trajectory representations and leads to performant policies. We conduct extensive empirical evaluations on AntMaze, FrankaKitchen and Locomotion environments, and observe that sequence modeling has a significant impact on some decision making tasks. In addition, we demonstrate that GCPC learns a goal-conditioned latent representation about the future, which serves as an "implicit planner" and enables competitive performance on all three benchmarks.

Exploring the Potential of Large Language Models (LLMs) in Learning on Graphs

  • paper_url: http://arxiv.org/abs/2307.03393
  • repo_url: https://github.com/CurryTang/Graph-LLM
  • paper_authors: Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, Hui Liu, Jiliang Tang
  • for: 本文探讨了大语言模型(LLMs)在图机器学习中的潜力,特别是节点分类任务中。
  • methods: 本文提出了两种可能的管道: LLMS-as-Enhancers 和 LLMS-as-Predictors。前者利用 LLMS 增强节点的文本特征,然后通过 GNNs 生成预测结果。后者直接使用 LLMS 作为独立预测器。
  • results: 本文通过了多种设定下的全面和系统的实验研究,从实验结果中得出了原创的观察和新的发现,开启了新的可能性和建议,并提出了潜在的方向,以便更好地利用 LLMS 进行图学习。
    Abstract Learning on Graphs has attracted immense attention due to its wide real-world applications. The most popular pipeline for learning on graphs with textual node attributes primarily relies on Graph Neural Networks (GNNs), and utilizes shallow text embedding as initial node representations, which has limitations in general knowledge and profound semantic understanding. In recent years, Large Language Models (LLMs) have been proven to possess extensive common knowledge and powerful semantic comprehension abilities that have revolutionized existing workflows to handle text data. In this paper, we aim to explore the potential of LLMs in graph machine learning, especially the node classification task, and investigate two possible pipelines: LLMs-as-Enhancers and LLMs-as-Predictors. The former leverages LLMs to enhance nodes' text attributes with their massive knowledge and then generate predictions through GNNs. The latter attempts to directly employ LLMs as standalone predictors. We conduct comprehensive and systematical studies on these two pipelines under various settings. From comprehensive empirical results, we make original observations and find new insights that open new possibilities and suggest promising directions to leverage LLMs for learning on graphs. Our codes and datasets are available at https://github.com/CurryTang/Graph-LLM.
    摘要 学习图有很多应用,吸引了很多人的注意。最受欢迎的图学习管道是使用图神经网络(GNNs),并使用图节点特征的浅层文本嵌入,这有限制在普遍知识和深层semantic理解方面。在过去几年,大型自然语言模型(LLMs)已经证明了具有广泛的通用知识和强大的semantic理解能力,这些能力在处理文本数据方面引发了革命。在这篇论文中,我们想要探索LLMs在图机器学习中的潜力,特别是节点分类任务,并研究了两个可能的管道:LLMs-as-Enhancers和LLMs-as-Predictors。前者利用LLMs来增强节点的文本特征,然后通过GNNs生成预测结果。后者尝试直接使用LLMs作为独立预测器。我们在不同的设置下进行了全面和系统的研究,从实验结果中得出了原创的观察和新的发现,这些发现开启了新的可能性和建议了潜在的方向,以便利用LLMs进行图学习。我们的代码和数据集可以在https://github.com/CurryTang/Graph-LLM上下载。

AI-UPV at EXIST 2023 – Sexism Characterization Using Large Language Models Under The Learning with Disagreements Regime

  • paper_url: http://arxiv.org/abs/2307.03385
  • repo_url: https://github.com/angelfelipemp/sexism-llm-learning-with-disagreement
  • paper_authors: Angel Felipe Magnossão de Paula, Giulia Rizzi, Elisabetta Fersini, Damiano Spina
  • for: 本研究旨在开发自动检测社交媒体上的性别歧视和其他不尊重和仇恨行为,以促进在线环境中的包容和尊重。
  • methods: 本研究使用大型自然语言模型(i.e., mBERT和XLM-RoBERTa)和 ensemble策略进行性别歧视 Identification和分类,并在英语和西班牙语之间进行了比较。
  • results: 本研究在EXIST实验室2023中参与了三个任务,其中在第2任务中以软评估方式获得了第四名,并在第3任务中获得了最高ICM-Soft=-2.32和normalized ICM-Soft=0.79。
    Abstract With the increasing influence of social media platforms, it has become crucial to develop automated systems capable of detecting instances of sexism and other disrespectful and hateful behaviors to promote a more inclusive and respectful online environment. Nevertheless, these tasks are considerably challenging considering different hate categories and the author's intentions, especially under the learning with disagreements regime. This paper describes AI-UPV team's participation in the EXIST (sEXism Identification in Social neTworks) Lab at CLEF 2023. The proposed approach aims at addressing the task of sexism identification and characterization under the learning with disagreements paradigm by training directly from the data with disagreements, without using any aggregated label. Yet, performances considering both soft and hard evaluations are reported. The proposed system uses large language models (i.e., mBERT and XLM-RoBERTa) and ensemble strategies for sexism identification and classification in English and Spanish. In particular, our system is articulated in three different pipelines. The ensemble approach outperformed the individual large language models obtaining the best performances both adopting a soft and a hard label evaluation. This work describes the participation in all the three EXIST tasks, considering a soft evaluation, it obtained fourth place in Task 2 at EXIST and first place in Task 3, with the highest ICM-Soft of -2.32 and a normalized ICM-Soft of 0.79. The source code of our approaches is publicly available at https://github.com/AngelFelipeMP/Sexism-LLM-Learning-With-Disagreement.
    摘要 随着社交媒体平台的普及,已成为必要的开发自动化系统,以检测社交媒体上的性别歧视和其他不尊重和仇恨行为,以促进更加包容和尊重的在线环境。然而,这些任务具有不同的仇恨类别和作者意图,特别是在学习各种不同的观点下。这篇文章描述了AI-UPV团队在EXIST(sEXism Identification in Social neTworks)实验室中的参与。提议的方法是通过直接从数据中学习,不使用任何综合标签,来解决性别歧视的识别和分类问题。然而,我们还是根据软和硬评估进行了性能评估。我们使用了大型语言模型(i.e., mBERT和XLM-RoBERTa)和ensemble策略进行性别歧视识别和分类。特别是,我们的系统是由三个不同的管道组成。 ensemble方法在软和硬评估中都表现出了最佳性能,并在EXIST任务中获得了第四名(Task 2)和第一名(Task 3),其ICM-Soft=-2.32和normalized ICM-Soft为0.79。我们的代码可以在https://github.com/AngelFelipeMP/Sexism-LLM-Learning-With-Disagreement上获取。

Teaching Arithmetic to Small Transformers

  • paper_url: http://arxiv.org/abs/2307.03381
  • repo_url: https://github.com/lee-ny/teaching_arithmetic
  • paper_authors: Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos
  • for: 这个研究旨在探讨使用自然语言数据来快速启动大型语言模型的算术能力。
  • methods: 研究使用无监督下一个词预测目标进行 arithmetic 操作的学习,包括加法、乘法和平方根等操作。
  • results: 研究发现,通过使用简单的 transformer 模型和适当的数据格式化,可以使用 next-token prediction 目标来快速学习算术操作,并且这种方法可以同时提高准确率、样本复杂度和 converges 速度。
    Abstract Large language models like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges. Our work highlights the importance of high-quality, instructive data that considers the particular characteristics of the next-word prediction objective for rapidly eliciting arithmetic capabilities.
    摘要 大型语言模型如GPT-4会展示涉及到通用任务的emergent能力,例如基本的算术运算,当它们被训练在广泛的文本数据上,即使这些任务没有直接被Encoding在无监督的下一个字符预测目标中。这个研究探索了如何使用下一个字符预测目标来快速学习算术运算,包括加法、乘法和幂函数。我们首先示出了传统训练数据不是最有效的 для算术学习,并且简单的格式变更可以提高准确性。这会导致训练数据的数量对学习数据的阶段性变化产生锐角转折,在一些情况下,这些转折可以通过低维矩阵完成性的连结来解释。我们 THEN以链条式数据来训练,包括中途结果。甚至在完全absence of pre-training,这种方法可以对准确性、样本复杂度和训练速度进行同时提高。我们还研究了在训练过程中文本和算术数据之间的互动,以及几个shot prompting、预训练和模型scale的影响。此外,我们还讨论了长度扩展的挑战。我们的工作强调了高质量、 instruктив的数据的重要性,以应对特定的下一个字符预测目标,快速抽象出算术能力。

On Formal Feature Attribution and Its Approximation

  • paper_url: http://arxiv.org/abs/2307.03380
  • repo_url: https://github.com/ffattr/ffa
  • paper_authors: Jinqiang Yu, Alexey Ignatiev, Peter J. Stuckey
  • For: This paper focuses on the problem of feature attribution in machine learning models, specifically in the context of explainable artificial intelligence (XAI). It proposes a new approach called formal feature attribution (FFA) to address the limitations of existing methods.* Methods: The paper uses formal methods to analyze and evaluate the feature attribution of machine learning models. It specifically employs formal explanation enumeration to compute the exact FFA, and proposes an efficient approximation technique to handle the practical complexity of the problem.* Results: The paper provides experimental evidence of the effectiveness of the proposed approximate FFA method, comparing it to existing feature attribution algorithms in terms of feature importance and relative order. It demonstrates that FFA can provide more accurate and informative attributions than existing methods, while also being more efficient in practical settings.
    Abstract Recent years have witnessed the widespread use of artificial intelligence (AI) algorithms and machine learning (ML) models. Despite their tremendous success, a number of vital problems like ML model brittleness, their fairness, and the lack of interpretability warrant the need for the active developments in explainable artificial intelligence (XAI) and formal ML model verification. The two major lines of work in XAI include feature selection methods, e.g. Anchors, and feature attribution techniques, e.g. LIME and SHAP. Despite their promise, most of the existing feature selection and attribution approaches are susceptible to a range of critical issues, including explanation unsoundness and out-of-distribution sampling. A recent formal approach to XAI (FXAI) although serving as an alternative to the above and free of these issues suffers from a few other limitations. For instance and besides the scalability limitation, the formal approach is unable to tackle the feature attribution problem. Additionally, a formal explanation despite being formally sound is typically quite large, which hampers its applicability in practical settings. Motivated by the above, this paper proposes a way to apply the apparatus of formal XAI to the case of feature attribution based on formal explanation enumeration. Formal feature attribution (FFA) is argued to be advantageous over the existing methods, both formal and non-formal. Given the practical complexity of the problem, the paper then proposes an efficient technique for approximating exact FFA. Finally, it offers experimental evidence of the effectiveness of the proposed approximate FFA in comparison to the existing feature attribution algorithms not only in terms of feature importance and but also in terms of their relative order.
    摘要 近年来,人工智能(AI)算法和机器学习(ML)模型在各个领域得到了广泛的应用。虽然它们取得了很大的成功,但是一些重要的问题仍然需要解决,如机器学习模型的 brittleness、公正性和解释性的缺失。这些问题促使了活跃的开发Explainable Artificial Intelligence(XAI)和正式的机器学习模型验证。XAI的两大主要方向是特征选择方法,如Anchors,以及特征归因技术,如LIME和SHAP。尽管它们承诺了很多,但是现有的特征选择和归因方法受到了许多重要的问题的威胁,如解释不准确和非常型采样。一种最近的正式XAI方法,尽管作为一种alternative,免受了这些问题,但它又有一些其他的限制,例如可扩展性的限制,无法解决特征归因问题。此外,正式的解释,即正式承诺,通常很大,这会妨碍它在实践中的应用。为了解决这些问题,本文提出了一种基于正式XAI的特征归因方法,即正式特征归因(FFA)。 FF A argued to be advantageous over the existing methods, both formal and non-formal。 compte tenu de la complexité pratique du problème, la paper then propose une méthode efficace pour approximer l'explication exacte FFA。Finally, it offers experimental evidence of the effectiveness of the proposed approximate FFA in comparison to the existing feature attribution algorithms not only in terms of feature importance but also in terms of their relative order.

Mitigating Negative Transfer with Task Awareness for Sexism, Hate Speech, and Toxic Language Detection

  • paper_url: http://arxiv.org/abs/2307.03377
  • repo_url: https://github.com/angelfelipemp/mitigating-negative-transfer-with-ta
  • paper_authors: Angel Felipe Magnossão de Paula, Paolo Rosso, Damiano Spina
  • for: 这篇论文目的是解决机器学习中的负转移问题。
  • methods: 该论文提出了一种基于任务意识概念的方法来缓解负转移问题。
  • results: 该方法在EXIST-2021和HatEval-2019测试基准上实现了新的状态作图,并且与 класси型多任务学习方法相比,提高了性能。
    Abstract This paper proposes a novelty approach to mitigate the negative transfer problem. In the field of machine learning, the common strategy is to apply the Single-Task Learning approach in order to train a supervised model to solve a specific task. Training a robust model requires a lot of data and a significant amount of computational resources, making this solution unfeasible in cases where data are unavailable or expensive to gather. Therefore another solution, based on the sharing of information between tasks, has been developed: Multi-Task Learning (MTL). Despite the recent developments regarding MTL, the problem of negative transfer has still to be solved. Negative transfer is a phenomenon that occurs when noisy information is shared between tasks, resulting in a drop in performance. This paper proposes a new approach to mitigate the negative transfer problem based on the task awareness concept. The proposed approach results in diminishing the negative transfer together with an improvement of performance over classic MTL solution. Moreover, the proposed approach has been implemented in two unified architectures to detect Sexism, Hate Speech, and Toxic Language in text comments. The proposed architectures set a new state-of-the-art both in EXIST-2021 and HatEval-2019 benchmarks.
    摘要 Here is the text in Simplified Chinese:这篇论文提出了一种新的方法来解决多任务学习中的负面传递问题。在机器学习领域中,通常采用单任务学习方法来训练特定任务的超级vised模型,但是这需要很多数据和计算资源。为了解决这个限制,多任务学习(MTL)被开发出来,它在任务之间共享信息。然而,负面传递现象会导致任务之间的信息干扰,从而导致性能下降。这篇论文提出了一种基于任务意识概念的新方法来 Mitigate负面传递问题,并且在EXIST-2021和HatEval-2019测试benchmark上设置了新的状态公共。

STG-MTL: Scalable Task Grouping for Multi-Task Learning Using Data Map

  • paper_url: http://arxiv.org/abs/2307.03374
  • repo_url: None
  • paper_authors: Ammar Sherif, Abubakar Abid, Mustafa Elattar, Mohamed ElHelw
  • for: 提高多任务学习(MTL)的性能,解决 tradicional Single-Task Learning(STL)的缺点。
  • methods: 使用手工设计的特征数据地图,捕捉每个分类任务在MTL训练中的训练行为,从而实现可扩展和可组合的解决方案。
  • results: 实验表明,我们的方法可以有效地处理大量任务(达100个),并且可以提高MTL的性能。
    Abstract Multi-Task Learning (MTL) is a powerful technique that has gained popularity due to its performance improvement over traditional Single-Task Learning (STL). However, MTL is often challenging because there is an exponential number of possible task groupings, which can make it difficult to choose the best one, and some groupings might produce performance degradation due to negative interference between tasks. Furthermore, existing solutions are severely suffering from scalability issues, limiting any practical application. In our paper, we propose a new data-driven method that addresses these challenges and provides a scalable and modular solution for classification task grouping based on hand-crafted features, specifically Data Maps, which capture the training behavior for each classification task during the MTL training. We experiment with the method demonstrating its effectiveness, even on an unprecedented number of tasks (up to 100).
    摘要 多任务学习(MTL)是一种强大的技术,它在单任务学习(STL)的基础上提高性能,但是MTL也有很多挑战。其中一个主要挑战是可能的任务分组的数量是无限的,这使得选择最佳任务分组变得困难,而且一些任务分组可能会导致任务之间的负面干扰,从而降低性能。此外,现有的解决方案受到可扩展性的限制,这限制了它们在实际应用中的使用。在我们的论文中,我们提出了一种基于手工特征的数据驱动方法,该方法可以 Address these challenges and provide a scalable and modular solution for classification task grouping. We experiment with the method and demonstrate its effectiveness, even on an unprecedented number of tasks (up to 100).

Distilled Pruning: Using Synthetic Data to Win the Lottery

  • paper_url: http://arxiv.org/abs/2307.03364
  • repo_url: https://github.com/luke-mcdermott-mi/distilled-pruning
  • paper_authors: Luke McDermott, Daniel Cummings
  • for: 这篇论文旨在提出一种使用精炼数据的深度学习模型剪辑方法,不同于传统的建筑或算法优化方法。
  • methods: 这种方法利用精炼数据捕捉大数据集中的关键模式,并如何利用这种能力来实现 computationally efficient 的剪辑过程。
  • results: 实验结果表明,使用精炼数据可以在 CIFAR-10 上找到更加快速的、相对精炼的剪辑结果,比 Iterative Magnitude Pruning 快到 5 倍。这些结果表明使用精炼数据可以提高资源有效的神经网络剪辑、模型压缩和神经建筑搜索。
    Abstract This work introduces a novel approach to pruning deep learning models by using distilled data. Unlike conventional strategies which primarily focus on architectural or algorithmic optimization, our method reconsiders the role of data in these scenarios. Distilled datasets capture essential patterns from larger datasets, and we demonstrate how to leverage this capability to enable a computationally efficient pruning process. Our approach can find sparse, trainable subnetworks (a.k.a. Lottery Tickets) up to 5x faster than Iterative Magnitude Pruning at comparable sparsity on CIFAR-10. The experimental results highlight the potential of using distilled data for resource-efficient neural network pruning, model compression, and neural architecture search.
    摘要

Federated Unlearning via Active Forgetting

  • paper_url: http://arxiv.org/abs/2307.03363
  • repo_url: None
  • paper_authors: Yuyuan Li, Chaochao Chen, Xiaolin Zheng, Jiaming Zhang
  • for: 本研究旨在提出一种基于增量学习的联合学习无学习方法,以解决联合学习无学习问题。
  • methods: 我们提出的方法基于增量学习,不需要特定的模型和联合设置。我们利用新的记忆替换老的记忆,模仿人脑中的活动忘记。Specifically, the model intended to unlearn serves as a student model that continuously learns from randomly initiated teacher models. To prevent catastrophic forgetting of non-target data, we utilize elastic weight consolidation to elastically constrain weight change.
  • results: 我们的方法在三个标准 benchmark 数据集上进行了广泛的实验,并得到了满意的效果和效率。另外,我们还通过后门攻击示例表明了我们的方法具有满意的完整性。
    Abstract The increasing concerns regarding the privacy of machine learning models have catalyzed the exploration of machine unlearning, i.e., a process that removes the influence of training data on machine learning models. This concern also arises in the realm of federated learning, prompting researchers to address the federated unlearning problem. However, federated unlearning remains challenging. Existing unlearning methods can be broadly categorized into two approaches, i.e., exact unlearning and approximate unlearning. Firstly, implementing exact unlearning, which typically relies on the partition-aggregation framework, in a distributed manner does not improve time efficiency theoretically. Secondly, existing federated (approximate) unlearning methods suffer from imprecise data influence estimation, significant computational burden, or both. To this end, we propose a novel federated unlearning framework based on incremental learning, which is independent of specific models and federated settings. Our framework differs from existing federated unlearning methods that rely on approximate retraining or data influence estimation. Instead, we leverage new memories to overwrite old ones, imitating the process of \textit{active forgetting} in neurology. Specifically, the model, intended to unlearn, serves as a student model that continuously learns from randomly initiated teacher models. To preserve catastrophic forgetting of non-target data, we utilize elastic weight consolidation to elastically constrain weight change. Extensive experiments on three benchmark datasets demonstrate the efficiency and effectiveness of our proposed method. The result of backdoor attacks demonstrates that our proposed method achieves satisfying completeness.
    摘要 随着机器学习模型的隐私问题的增加,许多研究者开始探讨机器学习模型的卸载问题,即使模型不再受训练数据的影响。在联合学习领域,这种问题也得到了关注,但是联合卸载仍然是一个挑战。现有的卸载方法可以大致分为两类:精确卸载和approximate卸载。首先,在分布式环境中实现精确卸载不会提高时间效率理论上。其次,现有的联合卸载方法受到数据影响估计不准确、计算负担大、或者都有问题。为此,我们提出了一种基于增量学习的联合卸载框架,不受特定模型和联合设置的限制。我们的框架与现有的联合卸载方法不同,不是通过精度抽象重新训练或数据影响估计来实现卸载。相反,我们利用新的记忆来覆盖老的记忆,模仿人脑中的活动忘记。具体来说,作为卸载的模型,我们的模型在随机开始的老师模型的指导下不断学习。为避免非目标数据的悲观性忘记,我们利用弹性重要权重卷积来稳定重要权重的变化。我们在三个标准数据集上进行了广泛的实验,结果表明我们的提出方法是高效和有效的。结果还表明,我们的方法可以满足完整性要求。

Evaluating Biased Attitude Associations of Language Models in an Intersectional Context

  • paper_url: http://arxiv.org/abs/2307.03360
  • repo_url: https://github.com/shivaomrani/llm-bias
  • paper_authors: Shiva Omrani Sabbaghi, Robert Wolfe, Aylin Caliskan
  • For: The paper aims to quantify the biases in language models using a sentence template that provides an intersectional context, and to study the associations of underrepresented groups in language.* Methods: The paper uses a concept projection approach to capture the valence subspace through contextualized word embeddings of language models, and adapts the projection-based approach to embedding association tests to quantify bias.* Results: The paper finds that language models exhibit the most biased attitudes against gender identity, social class, and sexual orientation signals in language, and that the largest and better-performing model is also more biased. The approach enables the study of complex intersectional biases and contributes to design justice by studying the associations of underrepresented groups in language.
    Abstract Language models are trained on large-scale corpora that embed implicit biases documented in psychology. Valence associations (pleasantness/unpleasantness) of social groups determine the biased attitudes towards groups and concepts in social cognition. Building on this established literature, we quantify how social groups are valenced in English language models using a sentence template that provides an intersectional context. We study biases related to age, education, gender, height, intelligence, literacy, race, religion, sex, sexual orientation, social class, and weight. We present a concept projection approach to capture the valence subspace through contextualized word embeddings of language models. Adapting the projection-based approach to embedding association tests that quantify bias, we find that language models exhibit the most biased attitudes against gender identity, social class, and sexual orientation signals in language. We find that the largest and better-performing model that we study is also more biased as it effectively captures bias embedded in sociocultural data. We validate the bias evaluation method by overperforming on an intrinsic valence evaluation task. The approach enables us to measure complex intersectional biases as they are known to manifest in the outputs and applications of language models that perpetuate historical biases. Moreover, our approach contributes to design justice as it studies the associations of groups underrepresented in language such as transgender and homosexual individuals.
    摘要 受大规模文献吸收的语言模型具有隐式偏见,这些偏见在社会认知中确定语言模型对社会集团和概念的偏见态度。基于已有文献,我们使用一个 intersecting 上下文中的句子模板来衡量社会集团的VALence(愉悦程度)。我们研究年龄、教育、性别、身高、智商、文化程度、种族、宗教、性别、性取向、社会阶层和体重等因素对语言模型的偏见。我们采用一种投影方法来捕捉VALence子空间,并通过contextualized word embeddings来衡量语言模型的偏见。我们发现语言模型对性别认同、社会阶层和性取向信号表现出最大的偏见。此外,我们发现最大和最高性能的模型也是最偏见的,因为它能够吸收社会文化数据中嵌入的偏见。我们验证了偏见评价方法的正确性,并发现该方法可以衡量复杂的交叉群偏见,这些偏见在语言模型的输出和应用中仍然存在。此外,我们的方法对设计正义做出贡献,因为它研究未 Represented 在语言中的群体,如 трансジェンダ和同性恋者。

CSCLog: A Component Subsequence Correlation-Aware Log Anomaly Detection Method

  • paper_url: http://arxiv.org/abs/2307.03359
  • repo_url: https://github.com/hang-z/csclog
  • paper_authors: Ling Chen, Chaodu Song, Xu Wang, Dachao Fu, Feifei Li
  • for: 这个研究旨在提出一个基于系统日志的异常探测方法,以应对智能运营中的异常探测 зада难。
  • methods: 本研究使用了组件 subsequences corrrelation-aware 方法 (CSCLog),它不具备传统方法所具备的续接性,同时还能够模型异常 subsequences 之间的隐式相互关联。
  • results: 实验结果显示,CSCLog 方法可以对四个公开的系统日志数据进行异常探测,与最佳基eline相比,平均提高了7.41%的标准偏差。
    Abstract Anomaly detection based on system logs plays an important role in intelligent operations, which is a challenging task due to the extremely complex log patterns. Existing methods detect anomalies by capturing the sequential dependencies in log sequences, which ignore the interactions of subsequences. To this end, we propose CSCLog, a Component Subsequence Correlation-Aware Log anomaly detection method, which not only captures the sequential dependencies in subsequences, but also models the implicit correlations of subsequences. Specifically, subsequences are extracted from log sequences based on components and the sequential dependencies in subsequences are captured by Long Short-Term Memory Networks (LSTMs). An implicit correlation encoder is introduced to model the implicit correlations of subsequences adaptively. In addition, Graph Convolution Networks (GCNs) are employed to accomplish the information interactions of subsequences. Finally, attention mechanisms are exploited to fuse the embeddings of all subsequences. Extensive experiments on four publicly available log datasets demonstrate the effectiveness of CSCLog, outperforming the best baseline by an average of 7.41% in Macro F1-Measure.
    摘要 “异常检测基于系统日志记录是智能运维中重要的一个任务,但是由于系统日志记录的极其复杂,这是一项挑战性的任务。现有的方法通过捕捉系统日志记录序列中的顺序相关性来检测异常,但是它们忽略了系统日志记录序列中的间接相关性。为此,我们提出了CSCLog方法,它不仅捕捉系统日志记录序列中的顺序相关性,而且模型了系统日志记录序列中的间接相关性。具体来说,我们从系统日志记录序列中提取了子序列,并使用Long Short-Term Memory Networks(LSTM)捕捉了这些子序列中的顺序相关性。此外,我们引入了一个适应性的间接相关性编码器,以模型系统日志记录序列中的间接相关性。同时,我们使用Graph Convolution Networks(GCNs)来实现系统日志记录序列中的信息互动。最后,我们利用了注意力机制来融合所有子序列的嵌入。我们对四个公开的系统日志数据集进行了广泛的实验,并证明了CSCLog方法的有效性,与最佳基eline相比,CSCLog方法的平均准确率提高了7.41%。”

Stability and Generalization of Stochastic Compositional Gradient Descent Algorithms

  • paper_url: http://arxiv.org/abs/2307.03357
  • repo_url: None
  • paper_authors: Ming Yang, Xiyuan Wei, Tianbao Yang, Yiming Ying
  • for: 本文研究了Stochastic Compositional Optimization(SCO)问题的稳定性和泛化性,即在各种机器学习任务中,如奖励学习、AUC最大化和元学习,目标函数具有嵌套结构和随机性。
  • methods: 本文使用了统计学习理论的机制来分析SCO算法的稳定性和泛化性。首先,我们引入了一种稳定性概念called compositional uniform stability,并证明其与泛化之间的几何关系。然后,我们证明了SCGD和SCSC算法的 compositional uniform stability 结果。最后,我们 derive了基于稳定性和优化误差的维度独立过剩风险 bounds。
  • results: 本文的结果显示,通过分析SCO算法的稳定性和泛化性,可以更好地理解这些算法在未来测试示例上的行为。此外,我们还提供了一个基于稳定性和优化误差的维度独立过剩风险 bounds,这是现有的首例研究。
    Abstract Many machine learning tasks can be formulated as a stochastic compositional optimization (SCO) problem such as reinforcement learning, AUC maximization, and meta-learning, where the objective function involves a nested composition associated with an expectation. While a significant amount of studies has been devoted to studying the convergence behavior of SCO algorithms, there is little work on understanding their generalization, i.e., how these learning algorithms built from training examples would behave on future test examples. In this paper, we provide the stability and generalization analysis of stochastic compositional gradient descent algorithms through the lens of algorithmic stability in the framework of statistical learning theory. Firstly, we introduce a stability concept called compositional uniform stability and establish its quantitative relation with generalization for SCO problems. Then, we establish the compositional uniform stability results for two popular stochastic compositional gradient descent algorithms, namely SCGD and SCSC. Finally, we derive dimension-independent excess risk bounds for SCGD and SCSC by trade-offing their stability results and optimization errors. To the best of our knowledge, these are the first-ever-known results on stability and generalization analysis of stochastic compositional gradient descent algorithms.
    摘要 多种机器学习任务可以表示为随机 compositional optimization(SCO)问题,如奖励学习、AUC最大化和元学习,其目标函数含有嵌入的嵌入关系。虽然有很多研究关注了 SCO 算法的收敛性行为,但对于这些学习算法在未来测试例子上的表现,却有很少研究。在这篇论文中,我们提供了 SCO 算法的稳定性和泛化分析,通过统计学学习理论的框架。首先,我们引入了一种稳定性概念called compositional uniform stability,并证明其与泛化之间存在确定的关系。然后,我们证明了 SCGD 和 SCSC 两种流行的随机 compositional gradient descent 算法的 compositional uniform stability 结果。最后,我们 derivated 不同维度的维度独立过分的剩余风险 bound,通过考虑这些算法的稳定性结果和优化错误来做出交换。根据我们所知,这些结果是 SCO 算法的稳定性和泛化分析的首次研究成果。

Federated Learning over a Wireless Network: Distributed User Selection through Random Access

  • paper_url: http://arxiv.org/abs/2307.03758
  • repo_url: None
  • paper_authors: Chen Sun, Shiyao Ma, Ce Zheng, Songtao Wu, Tao Cui, Lingjuan Lyu
  • for: 降低联合学习(FL)在无线网络上的通信成本,用户选择已成为关键。
  • methods: 本研究提出了一种基于网络本身的分布式用户选择方法,利用无线资源竞争机制。使用多路访问(CSMA)机制为例,在每次训练中各用户获得Radio资源的机会。
  • results: 通过控制竞争窗口大小,以增加某些用户在每次训练中获得Radio资源的机会,实现了适度的用户选择。通过训练数据偏迟为FL用户选择目标场景。使用计数机制保证了公平性。在不同的数据集上进行了丰富的实践,并显示该方法可以快速达到与中央用户选择方法相似的准确率。
    Abstract User selection has become crucial for decreasing the communication costs of federated learning (FL) over wireless networks. However, centralized user selection causes additional system complexity. This study proposes a network intrinsic approach of distributed user selection that leverages the radio resource competition mechanism in random access. Taking the carrier sensing multiple access (CSMA) mechanism as an example of random access, we manipulate the contention window (CW) size to prioritize certain users for obtaining radio resources in each round of training. Training data bias is used as a target scenario for FL with user selection. Prioritization is based on the distance between the newly trained local model and the global model of the previous round. To avoid excessive contribution by certain users, a counting mechanism is used to ensure fairness. Simulations with various datasets demonstrate that this method can rapidly achieve convergence similar to that of the centralized user selection approach.
    摘要 用户选择已成为聚合学习(FL)在无线网络上减少通信成本的关键。然而,中央化用户选择会增加系统复杂性。本研究提出了基于网络内部的分布式用户选择方法,利用无线资源竞争机制。使用干扰多访问(CSMA)机制为例,我们在每次训练中 manipulate 竞争窗口(CW)大小,以优先给予某些用户无线资源。在训练数据偏袋场景下,我们根据上一轮训练的全球模型与当前轮训练的本地模型之间的距离,对用户进行优先级排序。为避免某些用户的过度贡献,我们使用计数机制保持公平。通过对不同的数据集进行临床示例,我们的方法可以快速达到与中央化用户选择方法相似的减少。

Distilling Universal and Joint Knowledge for Cross-Domain Model Compression on Time Series Data

  • paper_url: http://arxiv.org/abs/2307.03347
  • repo_url: https://github.com/ijcai2023/uni_kd
  • paper_authors: Qing Xu, Min Wu, Xiaoli Li, Kezhi Mao, Zhenghua Chen
  • for: 这个论文旨在提出一个标准化的架构,以便在有限资源的环境中实现深度学习模型的压缩和适应跨领域类别变化。
  • methods: 这个方法使用了一个新的统一知识传播(UNI-KD)框架,将两个领域之间的知识传播到学习者模型中,包括通用的特征水平知识和共享的数据领域知识。
  • results: 实验结果显示,这个方法在四个时间序列数据集上的性能比前一代(SOTA)标准更高,并且可以实现跨领域类别变化中的模型压缩和适应。
    Abstract For many real-world time series tasks, the computational complexity of prevalent deep leaning models often hinders the deployment on resource-limited environments (e.g., smartphones). Moreover, due to the inevitable domain shift between model training (source) and deploying (target) stages, compressing those deep models under cross-domain scenarios becomes more challenging. Although some of existing works have already explored cross-domain knowledge distillation for model compression, they are either biased to source data or heavily tangled between source and target data. To this end, we design a novel end-to-end framework called Universal and joint knowledge distillation (UNI-KD) for cross-domain model compression. In particular, we propose to transfer both the universal feature-level knowledge across source and target domains and the joint logit-level knowledge shared by both domains from the teacher to the student model via an adversarial learning scheme. More specifically, a feature-domain discriminator is employed to align teacher's and student's representations for universal knowledge transfer. A data-domain discriminator is utilized to prioritize the domain-shared samples for joint knowledge transfer. Extensive experimental results on four time series datasets demonstrate the superiority of our proposed method over state-of-the-art (SOTA) benchmarks.
    摘要 Many real-world 时序系列任务中,现有的深度学习模型的计算复杂性 oft hinders 部署在有限资源环境(例如智能手机)中。此外,由于源领域和目标领域之间的预期域转换,压缩这些深度模型在交叉领域场景下变得更加挑战。虽然一些现有的工作已经探索了交叉领域知识填充,但它们是 either 偏向源数据还是 heavily tangled между源和目标数据。为此,我们设计了一个 novel 整体框架,即 Universal and joint knowledge distillation(UNI-KD),用于交叉领域模型压缩。具体来说,我们提议将 teacher 模型中的通用特征层级知识传递给学生模型,并在 adversarial learning scheme 中使用 feature-domain discriminator 对 teacher 的表示进行对接。此外,我们还使用 data-domain discriminator 来优先级化目标领域中共享的样本,以便进行交叉领域知识传递。我们对四个时序系列 dataset 进行了广泛的实验,结果表明我们的提议方法比现有的标准准则(SOTA)更高效。

Dividing and Conquering a BlackBox to a Mixture of Interpretable Models: Route, Interpret, Repeat

  • paper_url: http://arxiv.org/abs/2307.05350
  • repo_url: https://github.com/batmanlab/ICML-2023-Route-interpret-repeat
  • paper_authors: Shantanu Ghosh, Ke Yu, Forough Arabshahi, Kayhan Batmanghelich
  • for: This paper aims to blur the distinction between post hoc explanation of a Blackbox and constructing interpretable models.
  • methods: The proposed method begins with a Blackbox, iteratively carves out a mixture of interpretable experts (MoIE) and a residual network, and uses First Order Logic (FOL) to provide basic reasoning on concepts from the Blackbox.
  • results: The extensive experiments show that the proposed approach (1) identifies a diverse set of instance-specific concepts with high concept completeness via MoIE without compromising performance, (2) identifies the relatively “harder” samples to explain via residuals, (3) outperforms the interpretable by-design models by significant margins during test-time interventions, and (4) fixes the shortcut learned by the original Blackbox.Here is the same information in Simplified Chinese:
  • for: 这篇论文目标是让黑盒模型的解释和可解释模型之间的分化越来越模糊。
  • methods: 提议的方法从黑盒开始,iteratively刻划出一个混合的可解释专家(MoIE)和剩下的待处理网络,并使用First Order Logic(FOL)提供黑盒中基本的推理。
  • results: 广泛的实验显示,提议的方法(1)通过MoIE实现了高完整性的实例特定概念,无需牺牲性能,(2)通过剩下的待处理网络实现了对更加“Difficult”的样本的解释,(3)在测试时间干涉中高度超越了可解释设计模型,(4)解决了黑盒学习的短circuit。 MoIE代码可以在以下链接获取:https://github.com/batmanlab/ICML-2023-Route-interpret-repeat
    Abstract ML model design either starts with an interpretable model or a Blackbox and explains it post hoc. Blackbox models are flexible but difficult to explain, while interpretable models are inherently explainable. Yet, interpretable models require extensive ML knowledge and tend to be less flexible and underperforming than their Blackbox variants. This paper aims to blur the distinction between a post hoc explanation of a Blackbox and constructing interpretable models. Beginning with a Blackbox, we iteratively carve out a mixture of interpretable experts (MoIE) and a residual network. Each interpretable model specializes in a subset of samples and explains them using First Order Logic (FOL), providing basic reasoning on concepts from the Blackbox. We route the remaining samples through a flexible residual. We repeat the method on the residual network until all the interpretable models explain the desired proportion of data. Our extensive experiments show that our route, interpret, and repeat approach (1) identifies a diverse set of instance-specific concepts with high concept completeness via MoIE without compromising in performance, (2) identifies the relatively ``harder'' samples to explain via residuals, (3) outperforms the interpretable by-design models by significant margins during test-time interventions, and (4) fixes the shortcut learned by the original Blackbox. The code for MoIE is publicly available at: \url{https://github.com/batmanlab/ICML-2023-Route-interpret-repeat}
    摘要 <>模型设计 Either starts with an interpretable model or a Blackbox and explains it post hoc. Blackbox models are flexible but difficult to explain, while interpretable models are inherently explainable. Yet, interpretable models require extensive ML knowledge and tend to be less flexible and underperforming than their Blackbox variants. This paper aims to blur the distinction between a post hoc explanation of a Blackbox and constructing interpretable models. Beginning with a Blackbox, we iteratively carve out a mixture of interpretable experts (MoIE) and a residual network. Each interpretable model specializes in a subset of samples and explains them using First Order Logic (FOL), providing basic reasoning on concepts from the Blackbox. We route the remaining samples through a flexible residual. We repeat the method on the residual network until all the interpretable models explain the desired proportion of data. Our extensive experiments show that our route, interpret, and repeat approach (1) identifies a diverse set of instance-specific concepts with high concept completeness via MoIE without compromising in performance, (2) identifies the relatively ``harder'' samples to explain via residuals, (3) outperforms the interpretable by-design models by significant margins during test-time interventions, and (4) fixes the shortcut learned by the original Blackbox. 模型设计可以开始 Either with an interpretable model or a Blackbox,并在后续进行解释。Blackbox模型具有灵活性,但它们具有困难解释的特性,而可解释模型则具有内在的解释性。然而,可解释模型需要ML知识的涵盖和具有较差的灵活性和性能下降。这篇论文目标是将黑盒模型的后续解释与构建可解释模型进行混合。我们从黑盒模型开始,并在每次迭代中逐步划分出一个混合的可解释专家(MoIE)和剩下的剩余网络。每个可解释模型专门处理一 subset of samples,并使用First Order Logic(FOL)进行基本的推理,提供黑盒模型中的基本概念。我们将剩下的样本通过一个灵活的剩余网络进行路由。我们在剩下的网络上重复这种方法,直到所有的可解释模型解释满足所需的数据比例。我们的广泛的实验表明,我们的路由、解释和重复方法(1)可以通过MoIE无需牺牲性能来实现高度完整的概念,(2)可以通过剩余来解释一些更加困难的样本,(3)在测试时间干涉中大幅度超越可解释设计模型,以及(4)修复黑盒模型中学习的短circuit。MoIE的代码可以在以下地址找到:

Personalized Prediction of Recurrent Stress Events Using Self-Supervised Learning on Multimodal Time-Series Data

  • paper_url: http://arxiv.org/abs/2307.03337
  • repo_url: None
  • paper_authors: Tanvir Islam, Peter Washington
  • for: 预测chronic stress的发展和影响
  • methods: 使用穿戴式生物信号数据,采用自我超vision学习(SSL)技术进行个性化预测
  • results: 在Wearable Stress and Affect Detection(WESAD)数据集上测试,SSL模型表现更好,只需使用 less than 5% 的注释,这表明该方法可以个性化预测chronic stressI hope that helps! Let me know if you have any other questions.
    Abstract Chronic stress can significantly affect physical and mental health. The advent of wearable technology allows for the tracking of physiological signals, potentially leading to innovative stress prediction and intervention methods. However, challenges such as label scarcity and data heterogeneity render stress prediction difficult in practice. To counter these issues, we have developed a multimodal personalized stress prediction system using wearable biosignal data. We employ self-supervised learning (SSL) to pre-train the models on each subject's data, allowing the models to learn the baseline dynamics of the participant's biosignals prior to fine-tuning the stress prediction task. We test our model on the Wearable Stress and Affect Detection (WESAD) dataset, demonstrating that our SSL models outperform non-SSL models while utilizing less than 5% of the annotations. These results suggest that our approach can personalize stress prediction to each user with minimal annotations. This paradigm has the potential to enable personalized prediction of a variety of recurring health events using complex multimodal data streams.
    摘要

Variational quantum regression algorithm with encoded data structure

  • paper_url: http://arxiv.org/abs/2307.03334
  • repo_url: None
  • paper_authors: C. -C. Joseph Wang, Ryan S. Bennink
  • For: solves practical problems such as combinatorial optimization, quantum chemistry simulation, quantum machine learning, and quantum error correction on noisy quantum computers.* Methods: constructs a quantum regression algorithm with model interpretability, employs a circuit that directly encodes the data in quantum amplitudes, and uses compressed encoding and digital-analog gate operation to reduce the run time complexity.* Results: achieves a logarithmic reduction in the number of physical qubits needed compared to traditional one-hot-encoding techniques, and demonstrates the effectiveness of the algorithm for linear and nonlinear regression with ensemble model training and important feature selection.
    Abstract Variational quantum algorithms (VQAs) prevail to solve practical problems such as combinatorial optimization, quantum chemistry simulation, quantum machine learning, and quantum error correction on noisy quantum computers. For variational quantum machine learning, a variational algorithm with model interpretability built into the algorithm is yet to be exploited. In this paper, we construct a quantum regression algorithm and identify the direct relation of variational parameters to learned regression coefficients, while employing a circuit that directly encodes the data in quantum amplitudes reflecting the structure of the classical data table. The algorithm is particularly suitable for well-connected qubits. With compressed encoding and digital-analog gate operation, the run time complexity is logarithmically more advantageous than that for digital 2-local gate native hardware with the number of data entries encoded, a decent improvement in noisy intermediate-scale quantum computers and a minor improvement for large-scale quantum computing Our suggested method of compressed binary encoding offers a remarkable reduction in the number of physical qubits needed when compared to the traditional one-hot-encoding technique with the same input data. The algorithm inherently performs linear regression but can also be used easily for nonlinear regression by building nonlinear features into the training data. In terms of measured cost function which distinguishes a good model from a poor one for model training, it will be effective only when the number of features is much less than the number of records for the encoded data structure to be observable. To echo this finding and mitigate hardware noise in practice, the ensemble model training from the quantum regression model learning with important feature selection from regularization is incorporated and illustrated numerically.
    摘要 varyational quantum algorithms (VQAs) prevail in solving practical problems such as combinatorial optimization, quantum chemistry simulation, quantum machine learning, and quantum error correction on noisy quantum computers. For variational quantum machine learning, a variational algorithm with model interpretability built into the algorithm is yet to be exploited. In this paper, we construct a quantum regression algorithm and identify the direct relation of variational parameters to learned regression coefficients, while employing a circuit that directly encodes the data in quantum amplitudes reflecting the structure of the classical data table. The algorithm is particularly suitable for well-connected qubits. With compressed encoding and digital-analog gate operation, the run time complexity is logarithmically more advantageous than that for digital 2-local gate native hardware with the number of data entries encoded, a decent improvement in noisy intermediate-scale quantum computers and a minor improvement for large-scale quantum computing. Our suggested method of compressed binary encoding offers a remarkable reduction in the number of physical qubits needed when compared to the traditional one-hot-encoding technique with the same input data. The algorithm inherently performs linear regression but can also be used easily for nonlinear regression by building nonlinear features into the training data. In terms of measured cost function which distinguishes a good model from a poor one for model training, it will be effective only when the number of features is much less than the number of records for the encoded data structure to be observable. To echo this finding and mitigate hardware noise in practice, the ensemble model training from the quantum regression model learning with important feature selection from regularization is incorporated and illustrated numerically.

ACDNet: Attention-guided Collaborative Decision Network for Effective Medication Recommendation

  • paper_url: http://arxiv.org/abs/2307.03332
  • repo_url: None
  • paper_authors: Jiacong Mi, Yi Zu, Zhuoyuan Wang, Jieyue He
  • for: 这篇研究旨在提出一个基于电子健康纪录(EHR)的药物建议模型,以帮助医生更好地诊断和治疗病人。
  • methods: 这篇研究使用了注意力机制和Transformer来实现病人健康状况和药物纪录的有效捕捉,并且运用了一个协同决策架构,通过药物纪录和药物表现之间的相似性来促进建议过程。
  • results: 实验结果显示,这篇研究在两个大规模医疗数据集MIMIC-III和MIMIC-IV上表现出色,与之前的模型相比,它在Jaccard、PR-AUC和F1分数上明显提高。此外,实验中的删除实验和实验案例显示了每个模组的贡献度,证实了它们对整体性能的贡献。
    Abstract Medication recommendation using Electronic Health Records (EHR) is challenging due to complex medical data. Current approaches extract longitudinal information from patient EHR to personalize recommendations. However, existing models often lack sufficient patient representation and overlook the importance of considering the similarity between a patient's medication records and specific medicines. Therefore, an Attention-guided Collaborative Decision Network (ACDNet) for medication recommendation is proposed in this paper. Specifically, ACDNet utilizes attention mechanism and Transformer to effectively capture patient health conditions and medication records by modeling their historical visits at both global and local levels. ACDNet also employs a collaborative decision framework, utilizing the similarity between medication records and medicine representation to facilitate the recommendation process. The experimental results on two extensive medical datasets, MIMIC-III and MIMIC-IV, clearly demonstrate that ACDNet outperforms state-of-the-art models in terms of Jaccard, PR-AUC, and F1 score, reaffirming its superiority. Moreover, the ablation experiments provide solid evidence of the effectiveness of each module in ACDNet, validating their contribution to the overall performance. Furthermore, a detailed case study reinforces the effectiveness of ACDNet in medication recommendation based on EHR data, showcasing its practical value in real-world healthcare scenarios.
    摘要 运用电子健康记录(EHR)提供处方建议是具有复杂医疗资料的挑战。现有方法通常从病人EHR中提取长期信息,以personalize处方建议。然而,现有的模型通常缺乏病人表现的完整性,并忽略了考虑病人处方记录和具体药品之间的相似性。因此,本文提出了一个注意力导向的协同决策网络(ACDNet),用于处方建议。具体来说,ACDNet使用注意力机制和Transformer来有效地捕捉病人健康状态和处方记录,并通过模型病人的历史访问记录,实现全球和局部水平的同步运算。ACDNet还使用协同决策框架,通过考虑处方记录和药品表示之间的相似性,来协助建议过程。实验结果显示,ACDNet在两个大量医疗数据集MIMIC-III和MIMIC-IV上具有较高的Jaccard、PR-AUC和F1分数,与现有模型相比,具体表明其超越性。此外,删除实验显示了每个模组在ACDNet中的贡献,证实它们的贡献为整体性能的重要原因。此外,一个详细的实验案例证明ACDNet在基于EHR数据的处方建议中的实际价值,展现其在实际医疗应用中的实用性。

Encoder-Decoder Networks for Self-Supervised Pretraining and Downstream Signal Bandwidth Regression on Digital Antenna Arrays

  • paper_url: http://arxiv.org/abs/2307.03327
  • repo_url: None
  • paper_authors: Rajib Bhattacharjea, Nathan West
  • for: 这个研究是应用自动学习技术到数字天线阵列数据上的首次应用。
  • methods: 研究使用encoder-decoder网络进行自我超vised隐藏重建任务,称为频道填充,用于推断数字天线阵列数据中的 zeros 层面。无需人工标注数据。
  • results: 我们发现,通过在新网络中转移encoder架构和参数,并在小量标注数据上训练,可以使新网络在数字天线阵列数据上进行带宽调整任务更好than一个Equivalent网络从随机初始化开始训练。
    Abstract This work presents the first applications of self-supervised learning applied to data from digital antenna arrays. Encoder-decoder networks are pretrained on digital array data to perform a self-supervised noisy-reconstruction task called channel in-painting, in which the network infers the contents of array data that has been masked with zeros. The self-supervised step requires no human-labeled data. The encoder architecture and weights from pretraining are then transferred to a new network with a task-specific decoder, and the new network is trained on a small volume of labeled data. We show that pretraining on the unlabeled data allows the new network to perform the task of bandwidth regression on the digital array data better than an equivalent network that is trained on the same labeled data from random initialization.
    摘要

Machine Learning to detect cyber-attacks and discriminating the types of power system disturbances

  • paper_url: http://arxiv.org/abs/2307.03323
  • repo_url: None
  • paper_authors: Diane Tuyizere, Remy Ihabwikuzo
  • for: 这项研究目标是为智能电网提供机器学习基于攻击探测模型,以便更好地识别和防范攻击。
  • methods: 该模型使用phasor measuring devices(PMUs)采集数据和日志,并使用机器学习算法来学习系统行为并识别潜在的安全边界。
  • results: 研究发现,使用Random Forest模型可以达到90.56%的检测精度,并且有助于操作人员做出决策。
    Abstract This research proposes a machine learning-based attack detection model for power systems, specifically targeting smart grids. By utilizing data and logs collected from Phasor Measuring Devices (PMUs), the model aims to learn system behaviors and effectively identify potential security boundaries. The proposed approach involves crucial stages including dataset pre-processing, feature selection, model creation, and evaluation. To validate our approach, we used a dataset used, consist of 15 separate datasets obtained from different PMUs, relay snort alarms and logs. Three machine learning models: Random Forest, Logistic Regression, and K-Nearest Neighbour were built and evaluated using various performance metrics. The findings indicate that the Random Forest model achieves the highest performance with an accuracy of 90.56% in detecting power system disturbances and has the potential in assisting operators in decision-making processes.
    摘要 这个研究提出了一种基于机器学习的电力系统攻击检测模型,特别是针对智能电网。通过利用phasor Measuring Devices(PMUs)收集的数据和日志,模型希望学习系统行为并有效地识别潜在的安全边界。提出的方法包括重要的阶段,包括数据集 pré-处理、特征选择、模型创建和评估。为验证我们的方法,我们使用了15个不同PMUs、闭合风暴报警和日志的数据集。我们建立了三种机器学习模型:Random Forest、Logistic Regression和K-Nearest Neighbour,并使用了不同的性能指标进行评估。研究发现,Random Forest模型在检测电力系统干扰的准确率达90.56%,并有助于操作人员决策过程中。

Assisting Clinical Decisions for Scarcely Available Treatment via Disentangled Latent Representation

  • paper_url: http://arxiv.org/abs/2307.03315
  • repo_url: None
  • paper_authors: Bing Xue, Ahmed Sameh Said, Ziqi Xu, Hanyang Liu, Neel Shah, Hanqing Yang, Philip Payne, Chenyang Lu
  • for: This paper aims to support clinical decisions for COVID-19 patients who require extracorporeal membrane oxygenation (ECMO) treatment.
  • methods: The paper proposes a novel approach called Treatment Variational AutoEncoder (TVAE) to predict individualized treatment outcomes for COVID-19 patients. TVAE uses a deep latent variable model to represent patients’ potential treatment assignments and factual/counterfactual outcomes, and alleviates prediction errors through a reconstruction regularization scheme and semi-supervision.
  • results: The paper evaluates TVAE on two real-world COVID-19 datasets and shows that it outperforms state-of-the-art treatment effect models in predicting propensity scores and factual outcomes on heterogeneous datasets. Additionally, TVAE outperforms existing models in individual treatment effect estimation on a synthesized dataset.
    Abstract Extracorporeal membrane oxygenation (ECMO) is an essential life-supporting modality for COVID-19 patients who are refractory to conventional therapies. However, the proper treatment decision has been the subject of significant debate and it remains controversial about who benefits from this scarcely available and technically complex treatment option. To support clinical decisions, it is a critical need to predict the treatment need and the potential treatment and no-treatment responses. Targeting this clinical challenge, we propose Treatment Variational AutoEncoder (TVAE), a novel approach for individualized treatment analysis. TVAE is specifically designed to address the modeling challenges like ECMO with strong treatment selection bias and scarce treatment cases. TVAE conceptualizes the treatment decision as a multi-scale problem. We model a patient's potential treatment assignment and the factual and counterfactual outcomes as part of their intrinsic characteristics that can be represented by a deep latent variable model. The factual and counterfactual prediction errors are alleviated via a reconstruction regularization scheme together with semi-supervision, and the selection bias and the scarcity of treatment cases are mitigated by the disentangled and distribution-matched latent space and the label-balancing generative strategy. We evaluate TVAE on two real-world COVID-19 datasets: an international dataset collected from 1651 hospitals across 63 countries, and a institutional dataset collected from 15 hospitals. The results show that TVAE outperforms state-of-the-art treatment effect models in predicting both the propensity scores and factual outcomes on heterogeneous COVID-19 datasets. Additional experiments also show TVAE outperforms the best existing models in individual treatment effect estimation on the synthesized IHDP benchmark dataset.
    摘要 外部肺氧化(ECMO)是covid-19患者无法响应传统治疗的生命支持 modalities。然而,正确的治疗决策仍然存在争议,并且不确定哪些患者会从这种罕见和技术复杂的治疗选择中受益。为支持临床决策,我们需要预测治疗需求和可能的治疗和无治疗响应。为解决这种临床挑战,我们提出了个性化治疗分析方法——治疗变量自适应器(TVAE)。TVAE是为了解决ECMO治疗选择偏袋和罕见治疗案例的模型挑战而设计的。我们将患者的可能的治疗决策和实际和对照结果视为患者的内在特征,并使用深度约束模型来表示。寻求和对照预测错误的约束来自重构规则和半监督学习,同时通过分配空间和标签匹配的生成策略来缓解选择偏袋和罕见治疗案例的问题。我们在两个真实世界COVID-19数据集上评估了TVAE:一个国际数据集来自1651家医院在63个国家,另一个机构数据集来自15家医院。结果表明,TVAE在不同COVID-19数据集上预测propensity score和实际结果的性能都高于状态的投入效果模型。此外,我们还通过附加的实验表明,TVAE在个体治疗效果预测方面也超过了 beste existing models。

On Invariance, Equivariance, Correlation and Convolution of Spherical Harmonic Representations for Scalar and Vectorial Data

  • paper_url: http://arxiv.org/abs/2307.03311
  • repo_url: None
  • paper_authors: Janis Keuper
  • for: 这份技术报告提供了圆形幂(SH)频谱中数据的数学表示的深入介绍,包括无法变和对称特征、卷积和圆形幂上信号的精确相关性。
  • methods: 本文使用了圆形幂表示,包括无法变和对称特征、卷积和圆形幂上信号的精确相关性。
  • results: 本文扩展了scalar SH表示到vectorial harmonics(VH),为3Dvector场在圆形幂上提供了相同的功能。
    Abstract The mathematical representations of data in the Spherical Harmonic (SH) domain has recently regained increasing interest in the machine learning community. This technical report gives an in-depth introduction to the theoretical foundation and practical implementation of SH representations, summarizing works on rotation invariant and equivariant features, as well as convolutions and exact correlations of signals on spheres. In extension, these methods are then generalized from scalar SH representations to Vectorial Harmonics (VH), providing the same capabilities for 3d vector fields on spheres
    摘要 Recently, the mathematical representations of data in the Spherical Harmonic (SH) domain have gained increasing interest in the machine learning community. This technical report provides an in-depth introduction to the theoretical foundation and practical implementation of SH representations, including works on rotation invariant and equivariant features, as well as convolutions and exact correlations of signals on spheres. Additionally, these methods are extended from scalar SH representations to Vectorial Harmonics (VH), enabling the same capabilities for 3D vector fields on spheres.Here's the translation in Traditional Chinese:最近,圆球几何(Spherical Harmonic,SH)领域中的数据数学表现方法在机器学习社区中受到增加的关注。本技术报告将提供深入的理论基础和实践SH表现方法,包括对于旋转不变和对称特征、圆球上信号的卷积和精确相关性。此外,这些方法还被扩展到对 vectorial harmonics(VH),实现3D вектор场在圆球上的相同能力。

When Fair Classification Meets Noisy Protected Attributes

  • paper_url: http://arxiv.org/abs/2307.03306
  • repo_url: https://github.com/evijit/awareness_vs_unawareness
  • paper_authors: Avijit Ghosh, Pablo Kvitca, Christo Wilson
  • for: 本研究旨在解决算法公平性的实际挑战,包括数据集中保护属性的可用性和可靠性问题。
  • methods: 本研究使用了不同的公平分类算法,包括依赖属性、忽略属性和不依赖属性的算法,并对这些算法进行了比较。
  • results: 研究发现,忽略属性和忽略噪声的公平分类算法可以在保护属性是不可靠或噪声的情况下达到类似的性能水平,但实施需要谨慎。
    Abstract The operationalization of algorithmic fairness comes with several practical challenges, not the least of which is the availability or reliability of protected attributes in datasets. In real-world contexts, practical and legal impediments may prevent the collection and use of demographic data, making it difficult to ensure algorithmic fairness. While initial fairness algorithms did not consider these limitations, recent proposals aim to achieve algorithmic fairness in classification by incorporating noisiness in protected attributes or not using protected attributes at all. To the best of our knowledge, this is the first head-to-head study of fair classification algorithms to compare attribute-reliant, noise-tolerant and attribute-blind algorithms along the dual axes of predictivity and fairness. We evaluated these algorithms via case studies on four real-world datasets and synthetic perturbations. Our study reveals that attribute-blind and noise-tolerant fair classifiers can potentially achieve similar level of performance as attribute-reliant algorithms, even when protected attributes are noisy. However, implementing them in practice requires careful nuance. Our study provides insights into the practical implications of using fair classification algorithms in scenarios where protected attributes are noisy or partially available.
    摘要 “algorithmic fairness的实施面临多种实际挑战,其中最大的问题之一是数据集中保护特征的可用性和可靠性。在真实世界中,法律和实际困难可能会阻止对民生数据的收集和使用,使得保证algorithmic fairness变得困难。初期的公平算法并不考虑这些限制,但最新的建议旨在通过不考虑保护特征或使用噪音来实现公平分类。根据我们所知,这是首次对公平分类算法进行了头对头比较,并考虑了两个轴:预测性和公平性。我们通过四个真实世界数据集和 sintetic perturbations 进行了测试。我们的研究发现,忽略保护特征和噪音忍容的公平分类算法可能能够与依赖保护特征的算法具有相似的性能水平,即使保护特征噪音。但是,在实践中实现这些算法需要谨慎。我们的研究为在保护特征噪音或部分可用的场景中使用公平分类算法提供了实践意义。”Note: Simplified Chinese is also known as "Mandarin" or "Standard Chinese".

A Vulnerability of Attribution Methods Using Pre-Softmax Scores

  • paper_url: http://arxiv.org/abs/2307.03305
  • repo_url: https://github.com/mlerma54/adversarial-attacks-on-saliency-maps
  • paper_authors: Miguel Lerma, Mirtha Lucas
  • for: 这种论文探讨了一种类型的归类器中的拟合方法,即使这种模型受到了恶意攻击,小量修改模型也可以导致拟合方法的解释结果受到影响。
  • methods: 这种论文使用了一种类型的归类器,并使用了某些修改方法来影响拟合方法的解释结果。
  • results: 研究发现,这种修改方法可以导致拟合方法的解释结果受到影响,而不需要改变模型的输出。
    Abstract We discuss a vulnerability involving a category of attribution methods used to provide explanations for the outputs of convolutional neural networks working as classifiers. It is known that this type of networks are vulnerable to adversarial attacks, in which imperceptible perturbations of the input may alter the outputs of the model. In contrast, here we focus on effects that small modifications in the model may cause on the attribution method without altering the model outputs.
    摘要 我们讨论了一个漏洞,它与对于卷积神经网作为分类器的说明方法有关。知道这种神经网容易受到敌意攻击,这种攻击可以通过对输入进行微妙的变化,导致模型的输出变化。相反,我们在这里专注于对于说明方法的小修改,不会改变模型的输出。

Equivariant Spherical CNN for Data Efficient and High-Performance Medical Image Processing

  • paper_url: http://arxiv.org/abs/2307.03298
  • repo_url: None
  • paper_authors: Amirreza Hashemi, Yuemeng Feng, Hamid Sabet
  • for: 这个研究旨在提高医疗图像处理领域中的Tomography应用,并且提出了一种新的对称网络方法来改善这些应用的效率和性能。
  • methods: 这个研究使用了一种叫做对称网络的方法,这种方法可以将医疗图像处理中的训练集不断地缩小,以提高网络的稳定性和效率。
  • results: 研究结果显示,使用对称网络可以实现医疗图像处理中的高品质和高效率,并且可以降低训练集的size,以减少训练时间和计算成本。
    Abstract This work highlights the significance of equivariant networks as efficient and high-performance approaches for tomography applications. Our study builds upon the limitations of Convolutional Neural Networks (CNNs), which have shown promise in post-processing various medical imaging systems. However, the efficiency of conventional CNNs heavily relies on an undiminished and proper training set. To tackle this issue, in this study, we introduce an equivariant network, aiming to reduce CNN's dependency on specific training sets. We evaluate the efficacy of equivariant CNNs on spherical signals for tomographic medical imaging problems. Our results demonstrate superior quality and computational efficiency of spherical CNNs (SCNNs) in denoising and reconstructing benchmark problems. Furthermore, we propose a novel approach to employ SCNNs as a complement to conventional image reconstruction tools, enhancing the outcomes while reducing reliance on the training set. Across all cases, we observe a significant decrease in computational costs while maintaining the same or higher quality of image processing using SCNNs compared to CNNs. Additionally, we explore the potential of this network for broader tomography applications, particularly those requiring omnidirectional representation.
    摘要 Translation note:* "Equivariant networks" is translated as "协变网络" (fùbiàn wǎngluò), which means the network architecture that preserves the symmetry of the input data.* "Spherical signals" is translated as "球形信号" (qiúxíng xìnhù), which refers to the signals that have spherical symmetry.* "Tomographic medical imaging" is translated as "tomography医学影像" (tòngshì yīxué yǐngxiàng), which refers to the medical imaging techniques that use X-rays or other forms of radiation to create cross-sectional images of the body.* "Convolutional Neural Networks" is translated as "卷积神经网络" (juéshì shénxiào wǎngluò), which is the abbreviation of CNNs.* "Omnidirectional representation" is translated as "全方位表示" (quánfāngwèi bǎoshì), which means the representation that captures the information from all directions.

OmniBoost: Boosting Throughput of Heterogeneous Embedded Devices under Multi-DNN Workload

  • paper_url: http://arxiv.org/abs/2307.03290
  • repo_url: None
  • paper_authors: Andreas Karatzas, Iraklis Anagnostopoulos
  • for: 提高多个深度神经网络(DNN)应用工作负载的高性能和高效率
  • methods: 使用杂种加速器、硬件异构性和随机空间探索技术
  • results: 与其他状态对比方法相比,实现了平均吞吐量提高4.6倍
    Abstract Modern Deep Neural Networks (DNNs) exhibit profound efficiency and accuracy properties. This has introduced application workloads that comprise of multiple DNN applications, raising new challenges regarding workload distribution. Equipped with a diverse set of accelerators, newer embedded system present architectural heterogeneity, which current run-time controllers are unable to fully utilize. To enable high throughput in multi-DNN workloads, such a controller is ought to explore hundreds of thousands of possible solutions to exploit the underlying heterogeneity. In this paper, we propose OmniBoost, a lightweight and extensible multi-DNN manager for heterogeneous embedded devices. We leverage stochastic space exploration and we combine it with a highly accurate performance estimator to observe a x4.6 average throughput boost compared to other state-of-the-art methods. The evaluation was performed on the HiKey970 development board.
    摘要 现代深度神经网络(DNN)具有深刻的效率和准确性特性。这引入了包含多个DNN应用的工作负荷,引起了新的工作负荷分布挑战。新的嵌入式系统采用多种加速器,导致系统架构多样性,现有的运行时控制器无法完全利用。为实现高吞吨在多个DNN工作负荷中,这种控制器应该探索数以千计的可能性。在这篇论文中,我们提出了OmniBoost,一个轻量级的多DNN管理器,适用于多种嵌入式设备。我们利用随机空间探索和高度准确的性能估计器,观察到与其他状态态方法相比,平均吞吨提升4.6倍。测试结果在HiKey970开发板上进行。

Optimal Scalarizations for Sublinear Hypervolume Regret

  • paper_url: http://arxiv.org/abs/2307.03288
  • repo_url: None
  • paper_authors: Qiuyi Zhang
  • for: 本研究旨在找到一种简单的非线性归一化方法,可以在多目标设定中探索多个目标的 pareto 前沿,并提高搜索效率。
  • methods: 我们使用了 hypervolume 归一化方法,并采用了随机权重的方法来评估不同的归一化方法。
  • results: 我们的研究表明,使用 hypervolume 归一化方法可以获得提高的搜索效率,并且可以在多目标问题中提供更好的解决方案。我们的实验结果也表明,使用简单的 hypervolume 归一化方法可以在 bayesian 优化中表现更好,并且可以超越标准的多目标算法,如 EHVI。
    Abstract Scalarization is a general technique that can be deployed in any multiobjective setting to reduce multiple objectives into one, such as recently in RLHF for training reward models that align human preferences. Yet some have dismissed this classical approach because linear scalarizations are known to miss concave regions of the Pareto frontier. To that end, we aim to find simple non-linear scalarizations that can explore a diverse set of $k$ objectives on the Pareto frontier, as measured by the dominated hypervolume. We show that hypervolume scalarizations with uniformly random weights are surprisingly optimal for provably minimizing the hypervolume regret, achieving an optimal sublinear regret bound of $O(T^{-1/k})$, with matching lower bounds that preclude any algorithm from doing better asymptotically. As a theoretical case study, we consider the multiobjective stochastic linear bandits problem and demonstrate that by exploiting the sublinear regret bounds of the hypervolume scalarizations, we can derive a novel non-Euclidean analysis that produces improved hypervolume regret bounds of $\tilde{O}( d T^{-1/2} + T^{-1/k})$. We support our theory with strong empirical performance of using simple hypervolume scalarizations that consistently outperforms both the linear and Chebyshev scalarizations, as well as standard multiobjective algorithms in bayesian optimization, such as EHVI.
    摘要 scalarization 是一种通用技术,可以在多目标设置中降低多个目标到一个,例如在RLHF中训练奖励模型,以实现人类偏好的Alignment。然而,一些人认为这种经典方法不合适,因为线性Scalarization会错过凹陷区域的Pareto前沿。为此,我们想找到简单的非线性Scalarization,以探索$k$个目标在Pareto前沿上的多样化集合,由dominated hypervolume来度量。我们表明,在随机权重下的 hypervolume scalarization 可以让我们提取优质的 hypervolume regret,实现 $O(T^{-1/k})$ 的优linear regret bound,与它们匹配的下界,阻止任何算法在极限情况下做得更好。作为一个理论案例,我们考虑了多目标随机线性带宽问题,并证明了通过权重Scalarization 的Sublinear regret bound,我们可以 derivate一个新的非Euclidean分析,生成改进的 hypervolume regret bound 的 $\tilde{O}(dT^{-1/2} + T^{-1/k})$。我们的理论实际上支持了使用简单的 hypervolume scalarization,常常超越了线性和Chebyshev scalarization,以及标准多目标算法在 bayesian optimization 中,如EHVI。

Empirical Analysis of a Segmentation Foundation Model in Prostate Imaging

  • paper_url: http://arxiv.org/abs/2307.03266
  • repo_url: None
  • paper_authors: Heejong Kim, Victor Ion Butoi, Adrian V. Dalca, Daniel J. A. Margolis, Mert R. Sabuncu
  • for: This paper is written for the purpose of evaluating the effectiveness of a foundation model for medical image segmentation, specifically in the context of prostate imaging.
  • methods: The paper uses a recently developed foundation model called UniverSeg, which is compared against the conventional approach of training a task-specific segmentation model.
  • results: The study finds that the foundation model achieves competitive performance in prostate imaging segmentation, and highlights several important factors that will be important in the development and adoption of foundation models for medical image segmentation.Here’s the same information in Simplified Chinese text:
  • for: 这篇论文是为了评估医疗图像分割领域中的基础模型效果,具体来说是在肾脏成像中进行评估。
  • methods: 这篇论文使用了一个最近开发的基础模型,即UniverSeg,与传统的任务特定分割模型进行比较。
  • results: 研究发现,基础模型在肾脏成像分割中实现了竞争性的性能,并提出了各种重要因素,这些因素将在基础模型的开发和应用中扮演重要的角色。
    Abstract Most state-of-the-art techniques for medical image segmentation rely on deep-learning models. These models, however, are often trained on narrowly-defined tasks in a supervised fashion, which requires expensive labeled datasets. Recent advances in several machine learning domains, such as natural language generation have demonstrated the feasibility and utility of building foundation models that can be customized for various downstream tasks with little to no labeled data. This likely represents a paradigm shift for medical imaging, where we expect that foundation models may shape the future of the field. In this paper, we consider a recently developed foundation model for medical image segmentation, UniverSeg. We conduct an empirical evaluation study in the context of prostate imaging and compare it against the conventional approach of training a task-specific segmentation model. Our results and discussion highlight several important factors that will likely be important in the development and adoption of foundation models for medical image segmentation.
    摘要 现代医疗影像分类技术多采用深度学习模型。然而,这些模型通常需要高价的标签数据来训练,导致成本高昂。近年,自然语言生成等机器学习领域的进步,已经证明了建立基础模型,可以根据不同的下游任务进行定制,仅需少量或无标签数据。这将可能成为医疗影像领域的新模式,我们预料基础模型将未来医疗影像领域的发展推动。本文考虑了最近发展的医疗影像分类基础模型UniverSeg,并在阴茎影像上进行了实验性评估,与传统方法(即训练专门的医疗影像分类模型)进行比较。我们的结果和讨论显示了一些重要的因素,将影响医疗影像分类基础模型的发展和采用。

Vision Language Transformers: A Survey

  • paper_url: http://arxiv.org/abs/2307.03254
  • repo_url: None
  • paper_authors: Clayton Fields, Casey Kennington
  • for: 这个论文旨在总结目前已经公布的视觉语言传感器模型研究,以及这些模型在不同任务上的应用和表现。
  • methods: 这些模型使用了基于transformer架构的 pré-training方法,并通过微调参数和架构来适应不同任务。
  • results: 这些模型在视觉语言任务上表现出色,并且在不同任务上具有较高的灵活性和适应能力。
    Abstract Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform. A relatively recent body of research has adapted the pretrained transformer architecture introduced in \citet{vaswani2017attention} to vision language modeling. Transformer models have greatly improved performance and versatility over previous vision language models. They do so by pretraining models on a large generic datasets and transferring their learning to new tasks with minor changes in architecture and parameter values. This type of transfer learning has become the standard modeling practice in both natural language processing and computer vision. Vision language transformers offer the promise of producing similar advancements in tasks which require both vision and language. In this paper, we provide a broad synthesis of the currently available research on vision language transformer models and offer some analysis of their strengths, limitations and some open questions that remain.
    摘要 computer vision tasks that require both vision and language, such as answering questions about or generating captions that describe an image, are difficult for computers to perform. Recently, researchers have adapted the pretrained transformer architecture introduced in \citet{vaswani2017attention} to vision language modeling, which has greatly improved performance and versatility over previous vision language models. These models are trained on large generic datasets and then transferred to new tasks with minor changes in architecture and parameter values, which has become the standard modeling practice in both natural language processing and computer vision. Vision language transformers offer the promise of producing similar advancements in tasks that require both vision and language. In this paper, we provide a comprehensive overview of the currently available research on vision language transformer models and offer some analysis of their strengths, limitations, and open questions that remain.

Learned Kernels for Interpretable and Efficient PPG Signal Quality Assessment and Artifact Segmentation

  • paper_url: http://arxiv.org/abs/2307.05385
  • repo_url: None
  • paper_authors: Sully F. Chen, Zhicheng Guo, Cheng Ding, Xiao Hu, Cynthia Rudin
  • for: 本研究旨在提出一种可靠、高效、可解释的信号质量评估和噪声分 Segmentation方法,以确保robust和准确地提取生物Physiological Parameters。
  • methods: 本方法使用了一种小量且可解释的卷积核来学习,与之前的手工特征检测器或信号度量计算相比,具有更高的性能,同时具有可解释性和低功耗特性。
  • results: 本研究实验结果表明,提出的方法可以与现有的深度神经网络(DNN)方法相当或更好地提取Physiological Parameters,同时具有许多次更多的参数和更高的计算和存储效率。
    Abstract Photoplethysmography (PPG) provides a low-cost, non-invasive method to continuously monitor various cardiovascular parameters. PPG signals are generated by wearable devices and frequently contain large artifacts caused by external factors, such as motion of the human subject. In order to ensure robust and accurate extraction of physiological parameters, corrupted areas of the signal need to be identified and handled appropriately. Previous methodology relied either on handcrafted feature detectors or signal metrics which yield sub-optimal performance, or relied on machine learning techniques such as deep neural networks (DNN) which lack interpretability and are computationally and memory intensive. In this work, we present a novel method to learn a small set of interpretable convolutional kernels that has performance similar to -- and often better than -- the state-of-the-art DNN approach with several orders of magnitude fewer parameters. This work allows for efficient, robust, and interpretable signal quality assessment and artifact segmentation on low-power devices.
    摘要

Neural Network Field Theories: Non-Gaussianity, Actions, and Locality

  • paper_url: http://arxiv.org/abs/2307.03223
  • repo_url: None
  • paper_authors: Mehmet Demirtas, James Halverson, Anindita Maiti, Matthew D. Schwartz, Keegan Stoner
  • for: 这篇论文探讨了场理论中的征function distribution,以及由 neural network ensemble describe这种分布的可能性。
  • methods: 论文使用了场理论中的中心限定定律,以及对 neural network 参数的小量偏置,来描述分布。
  • results: 论文表明,在 infinite-width (infinite-$N) Limit下, neural network ensemble可以被视为一种自由场理论,并且可以使用 field theory 的方法来描述。
    Abstract Both the path integral measure in field theory and ensembles of neural networks describe distributions over functions. When the central limit theorem can be applied in the infinite-width (infinite-$N$) limit, the ensemble of networks corresponds to a free field theory. Although an expansion in $1/N$ corresponds to interactions in the field theory, others, such as in a small breaking of the statistical independence of network parameters, can also lead to interacting theories. These other expansions can be advantageous over the $1/N$-expansion, for example by improved behavior with respect to the universal approximation theorem. Given the connected correlators of a field theory, one can systematically reconstruct the action order-by-order in the expansion parameter, using a new Feynman diagram prescription whose vertices are the connected correlators. This method is motivated by the Edgeworth expansion and allows one to derive actions for neural network field theories. Conversely, the correspondence allows one to engineer architectures realizing a given field theory by representing action deformations as deformations of neural network parameter densities. As an example, $\phi^4$ theory is realized as an infinite-$N$ neural network field theory.
    摘要 Both the path integral measure in field theory and ensembles of neural networks describe distributions over functions. When the central limit theorem can be applied in the infinite-width (infinite-$N$) limit, the ensemble of networks corresponds to a free field theory. Although an expansion in $1/N$ corresponds to interactions in the field theory, others, such as in a small breaking of the statistical independence of network parameters, can also lead to interacting theories. These other expansions can be advantageous over the $1/N$-expansion, for example by improved behavior with respect to the universal approximation theorem. Given the connected correlators of a field theory, one can systematically reconstruct the action order-by-order in the expansion parameter, using a new Feynman diagram prescription whose vertices are the connected correlators. This method is motivated by the Edgeworth expansion and allows one to derive actions for neural network field theories. Conversely, the correspondence allows one to engineer architectures realizing a given field theory by representing action deformations as deformations of neural network parameter densities. As an example, $\phi^4$ theory is realized as an infinite-$N$ neural network field theory.Here's the translation in Traditional Chinese: Both the path integral measure in field theory and ensembles of neural networks describe distributions over functions. When the central limit theorem can be applied in the infinite-width (infinite-$N$) limit, the ensemble of networks corresponds to a free field theory. Although an expansion in $1/N$ corresponds to interactions in the field theory, others, such as in a small breaking of the statistical independence of network parameters, can also lead to interacting theories. These other expansions can be advantageous over the $1/N$-expansion, for example by improved behavior with respect to the universal approximation theorem. Given the connected correlators of a field theory, one can systematically reconstruct the action order-by-order in the expansion parameter, using a new Feynman diagram prescription whose vertices are the connected correlators. This method is motivated by the Edgeworth expansion and allows one to derive actions for neural network field theories. Conversely, the correspondence allows one to engineer architectures realizing a given field theory by representing action deformations as deformations of neural network parameter densities. As an example, $\phi^4$ theory is realized as an infinite-$N$ neural network field theory.

Synthesizing Artistic Cinemagraphs from Text

  • paper_url: http://arxiv.org/abs/2307.03190
  • repo_url: https://github.com/text2cinemagraph/text2cinemagraph
  • paper_authors: Aniruddha Mahapatra, Aliaksandr Siarohin, Hsin-Ying Lee, Sergey Tulyakov, Jun-Yan Zhu
  • for: 本研究旨在创建基于文本描述的电影画面。
  • methods: 本方法使用了自动生成图像双胞胎的想法,通过将文本描述转化为一对包含艺术风格和自然风格的图像。然后,通过分析自然图像和视频数据,对实际图像进行 segmentation 和动作预测,并将预测动作传递到艺术图像中。
  • results: 本研究的结果表明, compared to现有方法,本方法在创建自然风景以及艺术和其他世界的电影画面方面表现出色,并且可以控制动作方向使用文本。此外,本研究还扩展到了将现有的画作动画化以及通过文本控制动作方向。
    Abstract We introduce Text2Cinemagraph, a fully automated method for creating cinemagraphs from text descriptions - an especially challenging task when prompts feature imaginary elements and artistic styles, given the complexity of interpreting the semantics and motions of these images. Existing single-image animation methods fall short on artistic inputs, and recent text-based video methods frequently introduce temporal inconsistencies, struggling to keep certain regions static. To address these challenges, we propose an idea of synthesizing image twins from a single text prompt - a pair of an artistic image and its pixel-aligned corresponding natural-looking twin. While the artistic image depicts the style and appearance detailed in our text prompt, the realistic counterpart greatly simplifies layout and motion analysis. Leveraging existing natural image and video datasets, we can accurately segment the realistic image and predict plausible motion given the semantic information. The predicted motion can then be transferred to the artistic image to create the final cinemagraph. Our method outperforms existing approaches in creating cinemagraphs for natural landscapes as well as artistic and other-worldly scenes, as validated by automated metrics and user studies. Finally, we demonstrate two extensions: animating existing paintings and controlling motion directions using text.
    摘要 我们介绍Text2Cinemagraph,一种完全自动的方法,可以从文本描述中生成电影图像 - 特别是处理含有想象元素和艺术风格的描述时,这是一个非常困难的任务。现有的单图动画方法在艺术输入方面有限,而 recient的文本基于视频方法经常出现时间不一致,尝试维持某些区域静止。为解决这些挑战,我们提出了一种将文本描述转化为两个图像的想法 - 一个是一个艺术性的图像,另一个是其像素对齐的自然看起来的图像。而艺术性的图像会具有文本描述中的风格和形态,而自然图像则会大大简化布局和动作分析。利用现有的自然图像和视频数据集,我们可以准确地分割自然图像,并预测可能的动作,基于 semantic信息。预测的动作然后可以被传递到艺术性的图像,以创建最终的电影图像。我们的方法比既有方法在创建电影图像的自然风景以及艺术和其他世界的场景上表现出色,并通过自动度量和用户研究得到了证明。最后,我们还展示了两种扩展:将现有的画作动画并控制动作方向使用文本。

TGRL: An Algorithm for Teacher Guided Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2307.03186
  • repo_url: None
  • paper_authors: Idan Shenfeld, Zhang-Wei Hong, Aviv Tamar, Pulkit Agrawal
  • for: 本文目的是解决Sequential Decision-Making问题,通过结合权威指导和奖励学习两种已知方法。
  • methods: 本文使用的方法是在权威指导和奖励学习目标之间进行平衡,以实现更好的性能。
  • results: 本文的实验结果显示,使用Teacher Guided Reinforcement Learning(TGRL)方法可以在多个领域中超越强基线,而无需进行参数调整。
    Abstract Learning from rewards (i.e., reinforcement learning or RL) and learning to imitate a teacher (i.e., teacher-student learning) are two established approaches for solving sequential decision-making problems. To combine the benefits of these different forms of learning, it is common to train a policy to maximize a combination of reinforcement and teacher-student learning objectives. However, without a principled method to balance these objectives, prior work used heuristics and problem-specific hyperparameter searches to balance the two objectives. We present a $\textit{principled}$ approach, along with an approximate implementation for $\textit{dynamically}$ and $\textit{automatically}$ balancing when to follow the teacher and when to use rewards. The main idea is to adjust the importance of teacher supervision by comparing the agent's performance to the counterfactual scenario of the agent learning without teacher supervision and only from rewards. If using teacher supervision improves performance, the importance of teacher supervision is increased and otherwise it is decreased. Our method, $\textit{Teacher Guided Reinforcement Learning}$ (TGRL), outperforms strong baselines across diverse domains without hyper-parameter tuning.
    摘要 学习从奖励(i.e., 奖励学习或RL)和学习教师(i.e., 教师学习)是两种成熔的解决Sequential decision-making问题的方法。为了结合这些不同的学习方法的优点,通常是训练一个策略以最大化权重的奖励和教师学习目标。然而,在过去,无法使用原则性的方法均衡这两个目标,而是使用规则和问题特有的超参数搜索来均衡。我们提出了一种原则性的方法,以及一种近似的实现方式,可以在运动时动态地和自动地调整在学习从教师和奖励中选择何时遵循教师的指导。我们的方法被称为“教师导向奖励学习”(TGRL),在多个领域中击败了强大的基准值,无需hyperparameter调整。

Quantification of Uncertainty with Adversarial Models

  • paper_url: http://arxiv.org/abs/2307.03217
  • repo_url: https://github.com/ml-jku/quam
  • paper_authors: Kajetan Schweighofer, Lukas Aichberger, Mykyta Ielanskyi, Günter Klambauer, Sepp Hochreiter
  • for: 这篇论文的目的是提出一种新的不确定量化方法,以便在实际应用中做出可靠的预测。
  • methods: 这篇论文使用了rival models的对抗方法(QUAM)来估计epistemic uncertainty,这种方法可以更好地估计这种不确定性,并且比前一些方法(如深度组合或MC dropout)更加精确。
  • results: 实验显示,QUAM方法可以优化深度学习模型中的不确定量化,并且在类型识别、物体检测和其他视觉任务中表现出色,比前一些方法更好。
    Abstract Quantifying uncertainty is important for actionable predictions in real-world applications. A crucial part of predictive uncertainty quantification is the estimation of epistemic uncertainty, which is defined as an integral of the product between a divergence function and the posterior. Current methods such as Deep Ensembles or MC dropout underperform at estimating the epistemic uncertainty, since they primarily consider the posterior when sampling models. We suggest Quantification of Uncertainty with Adversarial Models (QUAM) to better estimate the epistemic uncertainty. QUAM identifies regions where the whole product under the integral is large, not just the posterior. Consequently, QUAM has lower approximation error of the epistemic uncertainty compared to previous methods. Models for which the product is large correspond to adversarial models (not adversarial examples!). Adversarial models have both a high posterior as well as a high divergence between their predictions and that of a reference model. Our experiments show that QUAM excels in capturing epistemic uncertainty for deep learning models and outperforms previous methods on challenging tasks in the vision domain.
    摘要 量化未知是重要的 predictive uncertainty quantification 中的一部分。 epistemic uncertainty 的定义为积分函数和 posterior 的产品。现有的方法,如 Deep Ensembles 或 MC dropout,在估计 epistemic uncertainty 方面表现不佳,因为它们主要依靠 posterior 的样本。我们建议 Quantification of Uncertainty with Adversarial Models (QUAM),可以更好地估计 epistemic uncertainty。QUAM 可以在积分函数下找到整体积分值大的区域,不仅是 posterior。因此,QUAM 的 Approximation error 相对于之前的方法更低。模型具有高积分值的区域对应于 adversarial models(不是 adversarial examples!)。 adversarial models 具有高 posterior 和 reference model 的预测值之间的差异。我们的实验表明,QUAM 在 deep learning 模型中表现出色,与之前的方法在视觉领域中的 challenging tasks 上表现出色。

Learning Curves for Heterogeneous Feature-Subsampled Ridge Ensembles

  • paper_url: http://arxiv.org/abs/2307.03176
  • repo_url: https://github.com/benruben87/Learning-Curves-for-Heterogeneous-Feature-Subsampled-Ridge-Ensembles
  • paper_authors: Benjamin S. Ruben, Cengiz Pehlevan
  • for: 降低预测差异的方法,使用Random Subspace Method和Feature Bagging方法。
  • methods: 使用ridge regression在子集中适应特征,并使用statistical physics的replica trick来 derivate学习曲线。
  • results: 在线性回归设置下,通过调整子集大小和特征数量,实现更好的预测性能,并发现在参数空间中存在锐transition。
    Abstract Feature bagging is a well-established ensembling method which aims to reduce prediction variance by training estimators in an ensemble on random subsamples or projections of features. Typically, ensembles are chosen to be homogeneous, in the sense the the number of feature dimensions available to an estimator is uniform across the ensemble. Here, we introduce heterogeneous feature ensembling, with estimators built on varying number of feature dimensions, and consider its performance in a linear regression setting. We study an ensemble of linear predictors, each fit using ridge regression on a subset of the available features. We allow the number of features included in these subsets to vary. Using the replica trick from statistical physics, we derive learning curves for ridge ensembles with deterministic linear masks. We obtain explicit expressions for the learning curves in the case of equicorrelated data with an isotropic feature noise. Using the derived expressions, we investigate the effect of subsampling and ensembling, finding sharp transitions in the optimal ensembling strategy in the parameter space of noise level, data correlations, and data-task alignment. Finally, we suggest variable-dimension feature bagging as a strategy to mitigate double descent for robust machine learning in practice.
    摘要 feature bagging 是一种已经广泛应用的 ensemble 方法,旨在降低预测变分的方法,通过在随机子样本或投影中训练 estimator ensemble。通常,ensemble 被选择为Homogeneous,即每个 estimator 在 ensemble 中 disposal 的 feature 维度是固定的。在这文中,我们介绍了Heterogeneous feature ensembling,其中 estimator 建立在不同的 feature 维度上,并考虑其在线性回归设置下的性能。我们研究了一个 ensemble 的线性预测器,每个预测器使用ridge regression在一 subset 中的可用 feature 上进行训练。我们允许这些subset 中包含的 feature 的数量发生变化。使用统计物理中的replica trick,我们得到了ridge ensemble 的学习曲线,其中包括 equicorrelated 数据和各向异otropic 特征噪音。使用 derivations 中的表达,我们调查了 subsampling 和 ensembling 对 optimal 结果的影响,并发现了参数空间中的锐转点。最后,我们建议 variable-dimension feature bagging 作为一种 mitigate double descent 的实践策略。

Push Past Green: Learning to Look Behind Plant Foliage by Moving It

  • paper_url: http://arxiv.org/abs/2307.03175
  • repo_url: None
  • paper_authors: Xiaoyu Zhang, Saurabh Gupta
  • for: 这个论文的目的是解决自动化农业应用(如检查、fenotiping、摘取水果)中对植物叶子和枝条的操作带来的挑战。
  • methods: 这篇论文使用数据驱动方法来解决这些挑战。它使用自我超级视觉网络SRPNet来预测执行一个候选动作后植物上的空间可见性。
  • results: 实验表明SRPNet在5个设定下对一种 sintetic (蔷薇) 和一种真实植物 ( Draceana) 的物理测试床上表现出色,超过了一种竞争性手工探索方法。 SRPNet也在对手工动力模型和相关减少中表现出色。
    Abstract Autonomous agriculture applications (e.g., inspection, phenotyping, plucking fruits) require manipulating the plant foliage to look behind the leaves and the branches. Partial visibility, extreme clutter, thin structures, and unknown geometry and dynamics for plants make such manipulation challenging. We tackle these challenges through data-driven methods. We use self-supervision to train SRPNet, a neural network that predicts what space is revealed on execution of a candidate action on a given plant. We use SRPNet with the cross-entropy method to predict actions that are effective at revealing space beneath plant foliage. Furthermore, as SRPNet does not just predict how much space is revealed but also where it is revealed, we can execute a sequence of actions that incrementally reveal more and more space beneath the plant foliage. We experiment with a synthetic (vines) and a real plant (Dracaena) on a physical test-bed across 5 settings including 2 settings that test generalization to novel plant configurations. Our experiments reveal the effectiveness of our overall method, PPG, over a competitive hand-crafted exploration method, and the effectiveness of SRPNet over a hand-crafted dynamics model and relevant ablations.
    摘要 自主农业应用(如检查、辐射类型、摘取水果)需要对植物叶子和枝干进行检查和操作。由于植物的部分可见性、极度堆积、细小结构和不确定的植物geometry和动力学,这种操作具有挑战性。我们通过数据驱动方法解决这些挑战。我们使用自我监督训练SRPNet,一种神经网络,该网络预测执行给定植物的候选动作后所可见的空间。我们使用SRPNet与十字积分方法预测有效的执行动作,以便逐渐暴露植物下方的空间。我们在Synthetic(蔷薇)和实际植物(Dracean)上进行了物理测试,并在5个设定中进行了测试,其中2个设定检验了植物配置的普适性。我们的实验表明我们的总方法PPG在比手动探索方法更有效,而SRPNet在手动动力模型和相关减少中表现更有效。

Wasserstein Quantum Monte Carlo: A Novel Approach for Solving the Quantum Many-Body Schrödinger Equation

  • paper_url: http://arxiv.org/abs/2307.07050
  • repo_url: https://github.com/necludov/wqmc
  • paper_authors: Kirill Neklyudov, Jannes Nys, Luca Thiede, Juan Carrasquilla, Qiang Liu, Max Welling, Alireza Makhzani
  • for: solves the quantum many-body Schrödinger equation, a fundamental problem in quantum physics, chemistry, and materials science.
  • methods: uses deep learning methods to represent wave functions as neural networks, and reformulates energy functional minimization in the space of Born distributions.
  • results: demonstrates faster convergence to the ground state of molecular systems using the proposed “Wasserstein Quantum Monte Carlo” (WQMC) method.Here’s the full text in Simplified Chinese:
  • for: solves the quantum many-body Schrödinger equation, a fundamental problem in quantum physics, chemistry, and materials science.
  • methods: uses deep learning methods to represent wave functions as neural networks, and reformulates energy functional minimization in the space of Born distributions.
  • results: demonstrates faster convergence to the ground state of molecular systems using the proposed “Wasserstein Quantum Monte Carlo” (WQMC) method.
    Abstract Solving the quantum many-body Schr\"odinger equation is a fundamental and challenging problem in the fields of quantum physics, quantum chemistry, and material sciences. One of the common computational approaches to this problem is Quantum Variational Monte Carlo (QVMC), in which ground-state solutions are obtained by minimizing the energy of the system within a restricted family of parameterized wave functions. Deep learning methods partially address the limitations of traditional QVMC by representing a rich family of wave functions in terms of neural networks. However, the optimization objective in QVMC remains notoriously hard to minimize and requires second-order optimization methods such as natural gradient. In this paper, we first reformulate energy functional minimization in the space of Born distributions corresponding to particle-permutation (anti-)symmetric wave functions, rather than the space of wave functions. We then interpret QVMC as the Fisher-Rao gradient flow in this distributional space, followed by a projection step onto the variational manifold. This perspective provides us with a principled framework to derive new QMC algorithms, by endowing the distributional space with better metrics, and following the projected gradient flow induced by those metrics. More specifically, we propose "Wasserstein Quantum Monte Carlo" (WQMC), which uses the gradient flow induced by the Wasserstein metric, rather than Fisher-Rao metric, and corresponds to transporting the probability mass, rather than teleporting it. We demonstrate empirically that the dynamics of WQMC results in faster convergence to the ground state of molecular systems.
    摘要 解决量子多体Шрёдингер方程是物理学、化学和材料科学领域的基本和挑战性问题。一种常见的计算方法是量子变量 Monte Carlo(QVMC),在这种方法中,系统的基态解是通过在限定的参数化波函数内寻找能量最小值来获得。深度学习方法可以部分解决传统QVMC中的限制,因为它可以表示一个富有的波函数家族使用神经网络。然而,QVMC中的优化目标仍然具有困难度,需要使用次序优化方法,如自然梯度。在这篇论文中,我们首先将能量函数最小化转换为 Born 分布对应的 particle-permutation(反)对称波函数的空间中进行,然后将 QVMC 解释为 Born 分布空间中的 Fisher-Rao 梯度流。接着,我们在这个分布空间中尝试新的 QMC 算法,通过给分布空间添加更好的 метри,并跟踪这些 метри 导引的投影流。更具体来说,我们提出了 "Wasserstein Quantum Monte Carlo"(WQMC),它使用梯度流导引的 Wasserstein metric,而不是 Fisher-Rao metric,并与teleporting 不同。我们通过实验证明,WQMC 的动力学会更快地 converges 到分子系统的基态解。

Focused Transformer: Contrastive Training for Context Scaling

  • paper_url: http://arxiv.org/abs/2307.03170
  • repo_url: https://github.com/cstankonrad/long_llama
  • paper_authors: Szymon Tworkowski, Konrad Staniszewski, Mikołaj Pacek, Yuhuai Wu, Henryk Michalewski, Piotr Miłoś
  • for: 提高大型语言模型在上下文长度方面的潜在能力
  • methods: 通过访问外部内存,让注意层访问更多的文档,并采用对比学习的训练方法解决焦点问题
  • results: 实现了在长上下文下进行精准的启发式学习,并且可以细化大型模型的上下文长度,提高模型的性能
    Abstract Large language models have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation in the effective context length. One solution to this issue is to endow an attention layer with access to an external memory, which comprises of (key, value) pairs. Yet, as the number of documents increases, the proportion of relevant keys to irrelevant ones decreases, leading the model to focus more on the irrelevant keys. We identify a significant challenge, dubbed the distraction issue, where keys linked to different semantic values might overlap, making them hard to distinguish. To tackle this problem, we introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning. This novel approach enhances the structure of the (key, value) space, enabling an extension of the context length. Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context. This is demonstrated by our fine-tuning of $3B$ and $7B$ OpenLLaMA checkpoints. The resulting models, which we name LongLLaMA, exhibit advancements in tasks requiring a long context. We further illustrate that our LongLLaMA models adeptly manage a $256 k$ context length for passkey retrieval.
    摘要 大型语言模型具有Exceptional的能力 Contextual 地搜集新信息。然而,这种方法的潜力 Frequently 受限因为Context Length的限制。一种解决方案是赋予Attention层访问 External Memory,其包含(键、值)对。然而,随着文档数量的增加,相关键对应的权重比例逐渐减少,导致模型更多地关注无关键。我们描述了一个Significant Challenge,称之为distraction issue,其中键 Linked to Different Semantic Values 可能会 overlap,使其困难分辨。为解决这个问题,我们引入了Focused Transformer(FoT),一种基于对比学习的训练方法。这种新的approach 使(键、值)空间的结构更加稠密,使Context Length可以更长。我们的方法允许对Pre-existing, Large-scale模型进行细化,从而Lengthen its Effective Context。我们的 Fine-tuning $3B$ 和 $7B$ OpenLLaMA Checkpoint 的结果,我们命名为LongLLaMA,在需要Long Context的任务中展现出了进步。我们还证明了我们的 LongLLaMA 模型可以efficaciously manage $256k$ Context Length for Passkey Retrieval。

Can Domain Adaptation Improve Accuracy and Fairness of Skin Lesion Classification?

  • paper_url: http://arxiv.org/abs/2307.03157
  • repo_url: None
  • paper_authors: Janet Wang, Yunbei Zhang, Zhengming Ding, Jihun Hamm
  • for: 本研究旨在 investigate unsupervised domain adaptation (UDA) 方法在皮肤癌症分类 tasks 中的可行性,以提高精度和可靠性。
  • methods: 本研究使用了多个皮肤癌症数据集,并 investigate 不同的 UDA 训练方案,包括单源、合并源和多源。
  • results: 研究结果显示,UDA 在 binary 分类任务中效果显著,并且在减轻偏置问题时进一步提高了性能。在多类任务中,UDA 的表现较弱,需要处理偏置问题以达到上baseline的准确率。通过我们的量化分析,我们发现测试错误率与标签转移强相关,而特征级 UDA 方法在不平衡数据集上有限制。最后,我们的研究表明,UDA 可以有效地减少对少数群体的偏见,且不需要显式使用公平预处理技术。
    Abstract Deep learning-based diagnostic system has demonstrated potential in classifying skin cancer conditions when labeled training example are abundant. However, skin lesion analysis often suffers from a scarcity of labeled data, hindering the development of an accurate and reliable diagnostic system. In this work, we leverage multiple skin lesion datasets and investigate the feasibility of various unsupervised domain adaptation (UDA) methods in binary and multi-class skin lesion classification. In particular, we assess three UDA training schemes: single-, combined-, and multi-source. Our experiment results show that UDA is effective in binary classification, with further improvement being observed when imbalance is mitigated. In multi-class task, its performance is less prominent, and imbalance problem again needs to be addressed to achieve above-baseline accuracy. Through our quantitative analysis, we find that the test error of multi-class tasks is strongly correlated with label shift, and feature-level UDA methods have limitations when handling imbalanced datasets. Finally, our study reveals that UDA can effectively reduce bias against minority groups and promote fairness, even without the explicit use of fairness-focused techniques.
    摘要 深度学习基于的诊断系统在有 suficient 标注示例时已经表现出了抑分类皮肤癌的潜力。然而,皮肤肿瘤分析通常受到标注数据的不足的限制,这阻碍了建立准确可靠的诊断系统。在这个工作中,我们利用多个皮肤肿瘤数据集,并 investigate了不同的无监督领域适应(UDA)方法在binary和多类皮肤肿瘤分类中的可行性。特别是,我们评估了单源、合并源和多源的UDA训练方案。我们的实验结果表明,UDA在binary分类任务中是有效的,并且在减轻偏见时进一步提高了表现。在多类任务中,其表现较弱,需要解决偏见问题以达到上基线的准确率。我们的量化分析表明,测试错误的多类任务和标签转移之间存在强相关性,而feature层UDA方法在不均衡数据集上有限制。最后,我们的研究表明,UDA可以有效地减少对少数群体的偏见,无需显式使用关注公平性的技术。

Topology-Aware Loss for Aorta and Great Vessel Segmentation in Computed Tomography Images

  • paper_url: http://arxiv.org/abs/2307.03137
  • repo_url: None
  • paper_authors: Seher Ozcelik, Sinan Unver, Ilke Ali Gurses, Rustu Turkay, Cigdem Gunduz-Demir
  • for: 提高图像分割 tasks 中的性能,特别是在人体 анатоMY 中 vessels 的分割任务上。
  • methods: 提出了一种新的 topology-aware 损失函数,通过 persistent homology 来衡量网络预测和真实值之间的拓扑不同。
  • results: 对于 4327 个 CT 图像和 24 个主体的实验表明,提出的损失函数可以更好地提高图像分割的性能, indicating the effectiveness of this approach.
    Abstract Segmentation networks are not explicitly imposed to learn global invariants of an image, such as the shape of an object and the geometry between multiple objects, when they are trained with a standard loss function. On the other hand, incorporating such invariants into network training may help improve performance for various segmentation tasks when they are the intrinsic characteristics of the objects to be segmented. One example is segmentation of aorta and great vessels in computed tomography (CT) images where vessels are found in a particular geometry in the body due to the human anatomy and they mostly seem as round objects on a 2D CT image. This paper addresses this issue by introducing a new topology-aware loss function that penalizes topology dissimilarities between the ground truth and prediction through persistent homology. Different from the previously suggested segmentation network designs, which apply the threshold filtration on a likelihood function of the prediction map and the Betti numbers of the ground truth, this paper proposes to apply the Vietoris-Rips filtration to obtain persistence diagrams of both ground truth and prediction maps and calculate the dissimilarity with the Wasserstein distance between the corresponding persistence diagrams. The use of this filtration has advantage of modeling shape and geometry at the same time, which may not happen when the threshold filtration is applied. Our experiments on 4327 CT images of 24 subjects reveal that the proposed topology-aware loss function leads to better results than its counterparts, indicating the effectiveness of this use.
    摘要 对于批处理图像中的分割任务,传统的损失函数不会直接学习图像中的全局不变量,如物体形状和多个物体之间的几何关系。然而,在某些任务中,这些不变量是物体的内在特征,通过将它们包含在网络训练中可能会提高分割性能。例如,计算机 Tomatoes(CT)图像中的血管和大血管分割任务中,血管在人体 анаatomy 中的特定几何位置,通常在2D CT 图像上看到为圆形物体。本文通过引入一种新的 topology-aware 损失函数来解决这个问题,该损失函数通过 persist homology penalty topology 不同性 zwischen 真实值和预测值。与之前的 segmentation 网络设计不同,这里不是通过阈值滤波器应用 likelihood 函数和 Betti 数来实现,而是通过 Vietoris-Rips 滤波器来获得预测和真实值的 persistence 图,并计算它们之间的 Wasserstein 距离。这种方法的优点在于同时模型形状和几何,可能不会在使用阈值滤波器时发生。我们在 4327 个 CT 图像上进行了 24 个人的实验,发现提案的 topology-aware 损失函数可以更好地处理这些任务,表明其效果。

Multiplicative Updates for Online Convex Optimization over Symmetric Cones

  • paper_url: http://arxiv.org/abs/2307.03136
  • repo_url: https://github.com/waynelin74/OCO_SymmetricCones
  • paper_authors: Ilayda Canyakmaz, Wayne Lin, Georgios Piliouras, Antonios Varvitsiotis
  • For: 该 paper 研究在线凸优化中,可能的动作是 trace-one 元素在 симметричный cone 中的扩展,涵盖了广泛研究的专家设置和其量子对应体。* Methods: 该 paper 使用了 Euclidean Jordan Algebras 的工具,提出了无投影的 Symmetric-Cone Multiplicative Weights Update (SCMWU) 算法,用于在 trace-one slice 上进行在线优化。* Results: 该 paper 证明了 SCMWU 算法是一个无误算法,并且扩展了 Multiplicative Weights Update 方法的分析,包括probability simplex 和 density matrices 的扩展。
    Abstract We study online convex optimization where the possible actions are trace-one elements in a symmetric cone, generalizing the extensively-studied experts setup and its quantum counterpart. Symmetric cones provide a unifying framework for some of the most important optimization models, including linear, second-order cone, and semidefinite optimization. Using tools from the field of Euclidean Jordan Algebras, we introduce the Symmetric-Cone Multiplicative Weights Update (SCMWU), a projection-free algorithm for online optimization over the trace-one slice of an arbitrary symmetric cone. We show that SCMWU is equivalent to Follow-the-Regularized-Leader and Online Mirror Descent with symmetric-cone negative entropy as regularizer. Using this structural result we show that SCMWU is a no-regret algorithm, and verify our theoretical results with extensive experiments. Our results unify and generalize the analysis for the Multiplicative Weights Update method over the probability simplex and the Matrix Multiplicative Weights Update method over the set of density matrices.
    摘要 我们研究在线凸优化问题,其可能的动作是 traces-one 元素在一个对称体中,泛化了广泛研究的专家设定和其量子对应器。对称体提供一个统一的框架,包括线性、第二阶凸优化和半definite 优化问题。使用 Euclid Jordan 代数的工具,我们引入了 trace-one slice 的Symmetric-Cone 多重量更新(SCMWU)算法,不需要投影。我们证明 SCMWU 等价于 Follow-the-Regularized-Leader 和 Online Mirror Descent 的对称体负 entropy 作为规则。使用这种结构结果,我们证明 SCMWU 是一个不会追攻的算法,并通过广泛的实验来验证我们的理论结果。我们的结果将 Multiplicative Weights Update 方法在概率 Simplex 和 Matrix Multiplicative Weights Update 方法在密度矩阵上的分析统一和推广。

Distilling Large Vision-Language Model with Out-of-Distribution Generalizability

  • paper_url: http://arxiv.org/abs/2307.03135
  • repo_url: https://github.com/xuanlinli17/large_vlm_distillation_ood
  • paper_authors: Xuanlin Li, Yunhao Fang, Minghua Liu, Zhan Ling, Zhuowen Tu, Hao Su
  • For: + The paper aims to investigate the distillation of visual representations in large teacher vision-language models into lightweight student models, with a focus on open-vocabulary out-of-distribution (OOD) generalization.* Methods: + The proposed method uses two principles from vision and language modality perspectives to enhance student’s OOD generalization: (1) by better imitating teacher’s visual representation space, and carefully promoting better coherence in vision-language alignment with the teacher; (2) by enriching the teacher’s language representations with informative and finegrained semantic attributes to effectively distinguish between different labels.* Results: + The results demonstrate significant improvements in zero-shot and few-shot student performance on open-vocabulary out-of-distribution classification, highlighting the effectiveness of the proposed approaches.
    Abstract Large vision-language models have achieved outstanding performance, but their size and computational requirements make their deployment on resource-constrained devices and time-sensitive tasks impractical. Model distillation, the process of creating smaller, faster models that maintain the performance of larger models, is a promising direction towards the solution. This paper investigates the distillation of visual representations in large teacher vision-language models into lightweight student models using a small- or mid-scale dataset. Notably, this study focuses on open-vocabulary out-of-distribution (OOD) generalization, a challenging problem that has been overlooked in previous model distillation literature. We propose two principles from vision and language modality perspectives to enhance student's OOD generalization: (1) by better imitating teacher's visual representation space, and carefully promoting better coherence in vision-language alignment with the teacher; (2) by enriching the teacher's language representations with informative and finegrained semantic attributes to effectively distinguish between different labels. We propose several metrics and conduct extensive experiments to investigate their techniques. The results demonstrate significant improvements in zero-shot and few-shot student performance on open-vocabulary out-of-distribution classification, highlighting the effectiveness of our proposed approaches. Code released at https://github.com/xuanlinli17/large_vlm_distillation_ood
    摘要 大型视言语模型已经实现了出色的表现,但它们的大小和计算需求使其在有限的设备和时间上不可靠性不允许其部署。模型缩小,将大型模型转换成更小的更快的模型,以保持大型模型的表现,是一个有前途的方向。这篇论文 investigate teacher视言语模型中的视 representations的缩小,使用小规模或中规模的 dataset。特别是,这种研究强调了无法表示(OOD)泛化问题,在前一个model distillation文献中受到了忽略。我们提出了两个原则,从视觉和语言模式的角度来提高学生的OOD泛化表现:(1)更好地模仿教师的视觉表示空间,并且细致地协调视语对应关系;(2)使用有用和细致的语言特征来有效地分类不同的标签。我们提出了一些指标,并进行了广泛的实验来调查它们的技术。结果表明,我们的提出的方法在零shot和几shot学生表现中具有显著的改进,强调了我们的提出的方法的效iveness。代码可以在https://github.com/xuanlinli17/large_vlm_distillation_ood中下载。

Benchmarking Test-Time Adaptation against Distribution Shifts in Image Classification

  • paper_url: http://arxiv.org/abs/2307.03133
  • repo_url: https://github.com/yuyongcan/benchmark-tta
  • paper_authors: Yongcan Yu, Lijun Sheng, Ran He, Jian Liang
  • for: This paper aims to provide a benchmark for test-time adaptation (TTA) methods to enhance the generalization performance of models and improve their robustness against distribution shifts.
  • methods: The paper evaluates 13 prominent TTA methods and their variants on five widely used image classification datasets, including CIFAR-10-C, CIFAR-100-C, ImageNet-C, DomainNet, and Office-Home. These methods cover a range of adaptation scenarios, such as online adaptation vs. offline adaptation, instance adaptation vs. batch adaptation vs. domain adaptation.
  • results: The paper presents a unified framework in PyTorch to evaluate and compare the effectiveness of TTA methods across different datasets and network architectures. By establishing this benchmark, the authors aim to provide researchers and practitioners with a reliable means of assessing and comparing the effectiveness of TTA methods in improving model robustness and generalization performance.
    Abstract Test-time adaptation (TTA) is a technique aimed at enhancing the generalization performance of models by leveraging unlabeled samples solely during prediction. Given the need for robustness in neural network systems when faced with distribution shifts, numerous TTA methods have recently been proposed. However, evaluating these methods is often done under different settings, such as varying distribution shifts, backbones, and designing scenarios, leading to a lack of consistent and fair benchmarks to validate their effectiveness. To address this issue, we present a benchmark that systematically evaluates 13 prominent TTA methods and their variants on five widely used image classification datasets: CIFAR-10-C, CIFAR-100-C, ImageNet-C, DomainNet, and Office-Home. These methods encompass a wide range of adaptation scenarios (e.g. online adaptation v.s. offline adaptation, instance adaptation v.s. batch adaptation v.s. domain adaptation). Furthermore, we explore the compatibility of different TTA methods with diverse network backbones. To implement this benchmark, we have developed a unified framework in PyTorch, which allows for consistent evaluation and comparison of the TTA methods across the different datasets and network architectures. By establishing this benchmark, we aim to provide researchers and practitioners with a reliable means of assessing and comparing the effectiveness of TTA methods in improving model robustness and generalization performance. Our code is available at https://github.com/yuyongcan/Benchmark-TTA.
    摘要 Test-time adaptation (TTA) 是一种技术,目的是通过在预测时使用无标示样本来提高模型的总体性能。由于神经网络系统面临到分布shift时的稳定性问题,现在有许多TTA方法被提出。然而,评估这些方法的效果通常是在不同的设置下进行,例如不同的分布shift、背景和设计方案,导致了评估这些方法的标准化和公平的标准准比不够。为解决这个问题,我们提出了一个benchmark,可以系统地评估13种知名的TTA方法和其变种在五种常用的图像分类 datasets上:CIFAR-10-C、CIFAR-100-C、ImageNet-C、DomainNet和Office-Home。这些方法涵盖了各种适应enario(例如在线适应与离线适应、实例适应与批适应、领域适应)。此外,我们还探索了不同的TTA方法与不同的网络背景的兼容性。为实现这个benchmark,我们在PyTorch上开发了一套统一的框架,可以在不同的 datasets和网络架构上进行一致的评估和比较TTA方法的效果。通过设立这个benchmark,我们希望为研究者和实践者提供一个可靠的方式来评估和比较TTA方法在提高模型的Robustness和总体性能方面的效果。我们的代码可以在https://github.com/yuyongcan/Benchmark-TTA上获取。

T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

  • paper_url: http://arxiv.org/abs/2307.03132
  • repo_url: https://github.com/locuslab/t-mars
  • paper_authors: Pratyush Maini, Sachin Goyal, Zachary C. Lipton, J. Zico Kolter, Aditi Raghunathan
  • for: 提高计算机视觉领域的状态提取大量网络源数据的优化,以提高零或几shot认识的性能。
  • methods: 提出一种新的数据筛选方法,基于我们发现LAION dataset中40%的图片含有与标签重叠的文本,这些数据可能会使模型做到光学字符识别而不是学习视觉特征。我们的方法使用文本屏蔽和CLIP相似度分数来筛选图片,从而过滤出只含有重叠文本的图片。
  • results: 在DataComp数据筛选benchmark中,T-MARS方法在”中等规模”下的性能比顶尖方法提高6.5%在ImageNet和4.7%在VTAB。此外,我们在不同的数据池大小从2M到64M进行系统性的评估,发现T-MARS方法的准确率提升与数据和计算的扩展幂。代码可以在https://github.com/locuslab/T-MARS中下载。
    Abstract Large web-sourced multimodal datasets have powered a slew of new methods for learning general-purpose visual representations, advancing the state of the art in computer vision and revolutionizing zero- and few-shot recognition. One crucial decision facing practitioners is how, if at all, to curate these ever-larger datasets. For example, the creators of the LAION-5B dataset chose to retain only image-caption pairs whose CLIP similarity score exceeded a designated threshold. In this paper, we propose a new state-of-the-art data filtering approach motivated by our observation that nearly 40% of LAION's images contain text that overlaps significantly with the caption. Intuitively, such data could be wasteful as it incentivizes models to perform optical character recognition rather than learning visual features. However, naively removing all such data could also be wasteful, as it throws away images that contain visual features (in addition to overlapping text). Our simple and scalable approach, T-MARS (Text Masking and Re-Scoring), filters out only those pairs where the text dominates the remaining visual features -- by first masking out the text and then filtering out those with a low CLIP similarity score of the masked image. Experimentally, T-MARS outperforms the top-ranked method on the "medium scale" of DataComp (a data filtering benchmark) by a margin of 6.5% on ImageNet and 4.7% on VTAB. Additionally, our systematic evaluation on various data pool sizes from 2M to 64M shows that the accuracy gains enjoyed by T-MARS linearly increase as data and compute are scaled exponentially. Code is available at https://github.com/locuslab/T-MARS.
    摘要 大量网络收集的多模式数据已经推动了新的计算机视觉表示学习方法的发展,对计算机视觉领域进行了革命性的改进。实际操作者面临的一个关键决策是如何处理这些日益增大的数据集。例如,LAION-5B数据集的创建者选择了只保留图像和描述对应的CLIP相似度score超过设置的阈值。在这篇论文中,我们提出了一种新的数据筛选方法,它是基于我们观察到LAION中约40%的图像含有与描述重叠的文本的观察。这些数据可能是浪费的,因为它们可能会让模型做Optical Character Recognition而不是学习视觉特征。然而, Naively removing all such data could also be wasteful, as it would throw away images that contain visual features (in addition to overlapping text).我们的简单和可扩展方法T-MARS(文本覆盖和重新分配)会过滤掉那些文本占据图像的大部分视觉特征的对应。我们首先对图像中的文本进行覆盖,然后对CLIP相似度score的覆盖图像进行过滤。实验表明,T-MARS在DataComp中的"中等规模"上比顶尖方法表现出6.5%的提升,在ImageNet和VTAB上分别提升4.7%和6.5%。此外,我们对不同的数据池大小从2M到64M进行系统性的评估,发现T-MARS的准确率提升 linearly 随着数据和计算的扩展幂数。代码可以在https://github.com/locuslab/T-MARS上获取。

Principal subbundles for dimension reduction

  • paper_url: http://arxiv.org/abs/2307.03128
  • repo_url: None
  • paper_authors: Morten Akhøj, James Benn, Erlend Grong, Stefan Sommer, Xavier Pennec
  • for: 该 paper 用 sub-Riemannian geometry 进行拟合 manifold 和 surface reconstruction,并将本地线性近似转换为 lower dimensional bundle。
  • methods: 使用 local PCA 获取 local approximations,并将其集成到 rank $k$ tangent subbundle 中,$k<d$。这个 sub-Riemannian metric 可以应用于一些重要的问题,如:建立 approximating submanifold $M$,构建点云在 $\mathbb{R}^k$ 中的表示,计算 observations 之间的距离,并考虑 learned geometry。
  • results: 通过 simulations 表明,该框架在噪声数据上是稳定的,并且可以扩展到知道 Riemannian manifold 的情况。
    Abstract In this paper we demonstrate how sub-Riemannian geometry can be used for manifold learning and surface reconstruction by combining local linear approximations of a point cloud to obtain lower dimensional bundles. Local approximations obtained by local PCAs are collected into a rank $k$ tangent subbundle on $\mathbb{R}^d$, $k
    摘要 在这篇论文中,我们示例了如何使用非里曼几何来进行拟合 manifold 和表面重建,通过将本地线性近似组合到一个更低维度的束上。本地近似由本地 PCA 获得,并将其集成为一个 rank $k$ 的 tangent 束在 $\mathbb{R}^d$ 上,其中 $k

Context-Aware Configuration and Management of WiFi Direct Groups for Real Opportunistic Networks

  • paper_url: http://arxiv.org/abs/2307.03126
  • repo_url: None
  • paper_authors: Valerio Arnaboldi, Mattia Giovanni Campana, Franca Delmastro
  • for: 该研究旨在提高 Wi-Fi Direct 技术在商业移动设备上的支持,以便实现基于设备间通信(D2D)的网络解决方案。
  • methods: 该研究提议一种新的中间层协议(WiFi Direct Group Manager,WFD-GM),以便自动配置和管理 Wi-Fi Direct 组。该协议包括一个 контекст函数,该函数考虑不同参数来创建最佳组配置,包括节点稳定性和功率水平。
  • results: 研究结果显示,WFD-GM 在不同的 mobilty 模型、地理区域和节点数量的三种参考enario中表现出色,与基准方法相比,在中等/低 mobilty 情况下表现更好,在高 mobilty 情况下与基准方法相当,无论添加额外开销。
    Abstract Wi-Fi Direct is a promising technology for the support of device-to-device communications (D2D) on commercial mobile devices. However, the standard as-it-is is not sufficient to support the real deployment of networking solutions entirely based on D2D such as opportunistic networks. In fact, WiFi Direct presents some characteristics that could limit the autonomous creation of D2D connections among users' personal devices. Specifically, the standard explicitly requires the user's authorization to establish a connection between two or more devices, and it provides a limited support for inter-group communication. In some cases, this might lead to the creation of isolated groups of nodes which cannot communicate among each other. In this paper, we propose a novel middleware-layer protocol for the efficient configuration and management of WiFi Direct groups (WiFi Direct Group Manager, WFD-GM) to enable autonomous connections and inter-group communication. This enables opportunistic networks in real conditions (e.g., variable mobility and network size). WFD-GM defines a context function that takes into account heterogeneous parameters for the creation of the best group configuration in a specific time window, including an index of nodes' stability and power levels. We evaluate the protocol performances by simulating three reference scenarios including different mobility models, geographical areas and number of nodes. Simulations are also supported by experimental results related to the evaluation in a real testbed of the involved context parameters. We compare WFD-GM with the state-of-the-art solutions and we show that it performs significantly better than a Baseline approach in scenarios with medium/low mobility, and it is comparable with it in case of high mobility, without introducing additional overhead.
    摘要

Learning Multi-Agent Intention-Aware Communication for Optimal Multi-Order Execution in Finance

  • paper_url: http://arxiv.org/abs/2307.03119
  • repo_url: None
  • paper_authors: Yuchen Fang, Zhenggang Tang, Kan Ren, Weiqing Liu, Li Zhao, Jiang Bian, Dongsheng Li, Weinan Zhang, Yong Yu, Tie-Yan Liu
  • for: 这 paper 的目的是解决多个订单同时执行的问题,使用模型自由学习(RL)技术。
  • methods: 这 paper 使用多代理RL(MARL)方法,每个代理都是一个特定的订单执行者,与其他代理交换信息以协同 Maximize 总收益。
  • results: 实验结果表明,使用提议的多round通信协议和行动值归因方法可以提高协同效果,并且在两个真实市场上达到了显著更好的性能。
    Abstract Order execution is a fundamental task in quantitative finance, aiming at finishing acquisition or liquidation for a number of trading orders of the specific assets. Recent advance in model-free reinforcement learning (RL) provides a data-driven solution to the order execution problem. However, the existing works always optimize execution for an individual order, overlooking the practice that multiple orders are specified to execute simultaneously, resulting in suboptimality and bias. In this paper, we first present a multi-agent RL (MARL) method for multi-order execution considering practical constraints. Specifically, we treat every agent as an individual operator to trade one specific order, while keeping communicating with each other and collaborating for maximizing the overall profits. Nevertheless, the existing MARL algorithms often incorporate communication among agents by exchanging only the information of their partial observations, which is inefficient in complicated financial market. To improve collaboration, we then propose a learnable multi-round communication protocol, for the agents communicating the intended actions with each other and refining accordingly. It is optimized through a novel action value attribution method which is provably consistent with the original learning objective yet more efficient. The experiments on the data from two real-world markets have illustrated superior performance with significantly better collaboration effectiveness achieved by our method.
    摘要 文本翻译为简化中文:订单执行是金融计算机科学中的基本任务,旨在完成购买或卖出一些资产的交易订单。现代无模型学习(RL)技术提供了基于数据的订单执行解决方案。然而,现有的工作都是优化每个订单的执行,忽略了实际情况下多个订单同时执行的情况,导致不优化和偏见。在本文中,我们首先提出了多代理RL(MARL)方法,用于多订单执行,考虑实际约束。 Specifically,我们将每个代理视为一个特定订单的交易者,保持与彼此交流,以 maximize 总收益。然而,现有的 MARL 算法通常通过互相交换部分观察信息来进行交流,这在金融市场中是不fficient的。为了改善协作,我们则提出了学习型多轮通信协议,让代理通过交换意图动作来交流,并根据此进行修正。这种协议由一种新的动作值归属方法优化,该方法与原始学习目标一致, yet more efficient。实验结果表明,我们的方法在两个真实的市场数据上表现出色,与传统方法相比,具有显著更好的协作效果。

Steel Surface Roughness Parameter Calculations Using Lasers and Machine Learning Models

  • paper_url: http://arxiv.org/abs/2307.03723
  • repo_url: None
  • paper_authors: Alex Milne, Xianghua Xie
  • for: 这个论文主要是为了提高热压板钢制造过程中表面质量的控制。
  • methods: 这篇论文使用了现代机器学习模型来提高在生产过程中的在线测量结果的准确性,以提高表面质量控制。
  • results: 研究表明,使用数据驱动的方法可以提高表面质量控制,并且可以实现在生产过程中的实时调整。
    Abstract Control of surface texture in strip steel is essential to meet customer requirements during galvanizing and temper rolling processes. Traditional methods rely on post-production stylus measurements, while on-line techniques offer non-contact and real-time measurements of the entire strip. However, ensuring accurate measurement is imperative for their effective utilization in the manufacturing pipeline. Moreover, accurate on-line measurements enable real-time adjustments of manufacturing processing parameters during production, ensuring consistent quality and the possibility of closed-loop control of the temper mill. In this study, we leverage state-of-the-art machine learning models to enhance the transformation of on-line measurements into significantly a more accurate Ra surface roughness metric. By comparing a selection of data-driven approaches, including both deep learning and non-deep learning methods, to the close-form transformation, we evaluate their potential for improving surface texture control in temper strip steel manufacturing.
    摘要 控制表面质量在带钢制造中是非常重要,以满足锈钢和氧化钢的需求。传统方法依靠后期探针测量,而在线技术可以实现不接触的实时测量整个带。然而,确保准确测量是关键,以便在生产过程中实现实时调整制造过程参数,保证产品质量的一致性和闭环控制温钢厂。本研究利用当前最佳的机器学习模型,以提高在线测量转换为更准确的Ra表面粗糙度指标。通过比较数据驱动方法,包括深度学习和非深度学习方法,与关系式转换,我们评估其在表面 тексту라控制方面的潜在提高。

Quantum Solutions to the Privacy vs. Utility Tradeoff

  • paper_url: http://arxiv.org/abs/2307.03118
  • repo_url: None
  • paper_authors: Sagnik Chatterjee, Vyacheslav Kungurtsev
  • for: 保障生成模型中的数据隐私和安全性
  • methods: 使用量子密码学 primitives 和可证明的隐私和安全性保证
  • results: 提供了一种基于量子密码学 primitives 的新架构,可以在任何现有的类型或量子生成模型之上使用,并且具有具有很高的安全性和隐私性保证。
    Abstract In this work, we propose a novel architecture (and several variants thereof) based on quantum cryptographic primitives with provable privacy and security guarantees regarding membership inference attacks on generative models. Our architecture can be used on top of any existing classical or quantum generative models. We argue that the use of quantum gates associated with unitary operators provides inherent advantages compared to standard Differential Privacy based techniques for establishing guaranteed security from all polynomial-time adversaries.
    摘要 在这项工作中,我们提出了一种新的架构(以及其变体),基于量子密码学 primitives,具有可证明的隐私和安全保证,对于生成模型的会员推测攻击。我们的架构可以在现有的类别或量子生成模型之上使用。我们认为,使用量子门相关的单位操作器提供了内置的优势,比标准推Diff Privacy基本技术更能提供来自所有多项时间敌对者的保证的安全性。

Region-Wise Attentive Multi-View Representation Learning for Urban Region Embeddings

  • paper_url: http://arxiv.org/abs/2307.03212
  • repo_url: None
  • paper_authors: Weiliang Chan, Qianqian Ren
  • for: 本研究旨在提出一种 Region-Wise Multi-View Representation Learning (ROMER) 模型,用于捕捉多视图关系并学习表达式的城市区域表示。
  • methods: 该模型首先捕捉多视图相关性从移动流 Patterns、POI semantics 和 Check-in dynamics 中,然后采用全球图注意力网络学习任意两个顶点之间的相似性。在这之后,我们提出了一种两个阶段融合模块,以全面考虑多视图特征并共享特征。
  • results: 对实际世界数据集进行了两个下游任务的实验,结果显示,我们的模型在比较 estado-of-the-art 方法时,提高了17%的性能。
    Abstract Urban region embedding is an important and yet highly challenging issue due to the complexity and constantly changing nature of urban data. To address the challenges, we propose a Region-Wise Multi-View Representation Learning (ROMER) to capture multi-view dependencies and learn expressive representations of urban regions without the constraints of rigid neighbourhood region conditions. Our model focus on learn urban region representation from multi-source urban data. First, we capture the multi-view correlations from mobility flow patterns, POI semantics and check-in dynamics. Then, we adopt global graph attention networks to learn similarity of any two vertices in graphs. To comprehensively consider and share features of multiple views, a two-stage fusion module is further proposed to learn weights with external attention to fuse multi-view embeddings. Extensive experiments for two downstream tasks on real-world datasets demonstrate that our model outperforms state-of-the-art methods by up to 17\% improvement.
    摘要 城市区域嵌入是一项重要但又具有极高挑战性的问题,主要因为城市数据的复杂性和不断变化。为 Addressing these challenges, we propose a Region-Wise Multi-View Representation Learning (ROMER) method to capture multi-view dependencies and learn expressive representations of urban regions without the constraints of rigid neighborhood region conditions. Our model focuses on learning urban region representation from multi-source urban data. First, we capture the multi-view correlations from mobility flow patterns, POI semantics, and check-in dynamics. Then, we adopt global graph attention networks to learn the similarity of any two vertices in graphs. To comprehensively consider and share features of multiple views, a two-stage fusion module is further proposed to learn weights with external attention to fuse multi-view embeddings. Extensive experiments for two downstream tasks on real-world datasets demonstrate that our model outperforms state-of-the-art methods by up to 17\% improvement.Here's the text with some additional information about the Simplified Chinese translation:The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China and Singapore. The text is written in a formal and academic style, using technical terms and phrases commonly used in the field of computer science and machine learning. The translation aims to convey the same meaning and information as the original English text, while also taking into account the grammatical and syntactical conventions of Simplified Chinese.Please note that the translation is provided for reference only, and may not be perfect or entirely accurate. If you have any specific questions or requests for clarification, please feel free to ask.

How to Detect Unauthorized Data Usages in Text-to-image Diffusion Models

  • paper_url: http://arxiv.org/abs/2307.03108
  • repo_url: None
  • paper_authors: Zhenting Wang, Chen Chen, Yuchen Liu, Lingjuan Lyu, Dimitris Metaxas, Shiqing Ma
  • for: 防止文本到图像扩散模型的非法数据使用
  • methods: 植入插入记忆法,通过分析模型是否记忆插入内容来检测非法数据使用
  • results: 实验表明,提议的方法可以准确检测文本到图像扩散模型中的非法数据使用
    Abstract Recent text-to-image diffusion models have shown surprising performance in generating high-quality images. However, concerns have arisen regarding the unauthorized usage of data during the training process. One example is when a model trainer collects a set of images created by a particular artist and attempts to train a model capable of generating similar images without obtaining permission from the artist. To address this issue, it becomes crucial to detect unauthorized data usage. In this paper, we propose a method for detecting such unauthorized data usage by planting injected memorization into the text-to-image diffusion models trained on the protected dataset. Specifically, we modify the protected image dataset by adding unique contents on the images such as stealthy image wrapping functions that are imperceptible to human vision but can be captured and memorized by diffusion models. By analyzing whether the model has memorization for the injected content (i.e., whether the generated images are processed by the chosen post-processing function), we can detect models that had illegally utilized the unauthorized data. Our experiments conducted on Stable Diffusion and LoRA model demonstrate the effectiveness of the proposed method in detecting unauthorized data usages.
    摘要 近些时候,文本到图像扩散模型的表现有所惊喜,但也有一些问题被报告。例如,一个模型培训者可能会收集一组由某个艺术家创作的图像,然后尝试使用这些图像来训练一个可以生成类似图像的模型,而不是获得艺术家的授权。为解决这个问题,在训练过程中检测不当数据使用变得非常重要。在这篇论文中,我们提议一种方法来检测这种不当数据使用,即在文本到图像扩散模型中植入插入记忆。具体来说,我们修改了受保护的图像集,并添加了一些隐藏的图像包装函数,这些函数可以让模型在生成图像时进行隐藏的记忆。通过判断模型是否有记忆这些插入的内容(即是否通过选择的后处理函数处理生成的图像),我们可以检测模型是否使用了未经授权的数据。我们在Stable Diffusion和LoRA模型上进行了实验,并证明了我们的方法的有效性。

Beyond Intuition, a Framework for Applying GPs to Real-World Data

  • paper_url: http://arxiv.org/abs/2307.03093
  • repo_url: https://github.com/kenzaxtazi/icml23-gpframe
  • paper_authors: Kenza Tazi, Jihao Andreas Lin, Ross Viljoen, Alex Gardner, ST John, Hong Ge, Richard E. Turner
  • for: 这篇论文是用于描述如何使用 Gaussian Processes (GPs) 进行回溯 regression 的方法和指南。
  • methods: 这篇论文使用的方法包括 kernel 设计和 computational scalability 的选择,以及如何设置一个强健且具体化的 GP 模型。
  • results: 在实际应用中,这篇论文使用 GPs 进行推测 glacier elevation change 的结果比较准确。
    Abstract Gaussian Processes (GPs) offer an attractive method for regression over small, structured and correlated datasets. However, their deployment is hindered by computational costs and limited guidelines on how to apply GPs beyond simple low-dimensional datasets. We propose a framework to identify the suitability of GPs to a given problem and how to set up a robust and well-specified GP model. The guidelines formalise the decisions of experienced GP practitioners, with an emphasis on kernel design and options for computational scalability. The framework is then applied to a case study of glacier elevation change yielding more accurate results at test time.
    摘要 Note:* "Gaussian Processes" is translated as "Gaussian processes" in Simplified Chinese, which is the standard way to refer to this topic in Chinese.* "low-dimensional" is translated as "小型" (xiǎo yì) in Simplified Chinese, which means "small" or "low-dimensional" in English.* "kernel design" is translated as "kernel设计" (jīn yì jīng yì) in Simplified Chinese, which means "kernel design" in English.* "computational scalability" is translated as "计算可扩展性" (jì yì kě xiǎo yì) in Simplified Chinese, which means "computational scalability" in English.

A Novel Site-Agnostic Multimodal Deep Learning Model to Identify Pro-Eating Disorder Content on Social Media

  • paper_url: http://arxiv.org/abs/2307.06775
  • repo_url: None
  • paper_authors: Jonathan Feldman
  • for: This study aimed to create a multimodal deep learning model to determine if social media posts promote eating disorders based on visual and textual data.
  • methods: The study used a labeled dataset of Tweets and trained and tested twelve deep learning models, including a multimodal fusion of the RoBERTa natural language processing model and the MaxViT image classification model.
  • results: The RoBERTa and MaxViT fusion model achieved accuracy and F1 scores of 95.9% and 0.959, respectively, and was used to classify an unlabeled dataset of posts from Tumblr and Reddit. The model also uncovered a drastic decrease in the relative abundance of content that promotes eating disorders on eight Twitter hashtags since 2014, but with a resurgence by 2018.Here is the information in Simplified Chinese text:
  • for: 这个研究目的是创建一种基于多模态深度学习模型,以确定社交媒体文章是否推广吃见症,基于视觉和文本数据。
  • methods: 该研究使用了 Twitter 上的标注数据集,并训练和测试了十二个深度学习模型,其中包括 RoBERTa 自然语言处理模型和 MaxViT 图像分类模型的多模态融合。
  • results: RoBERTa 和 MaxViT 融合模型在识别 Tweets 中推广吃见症的任务上实现了准确率和 F1 分数为 95.9% 和 0.959,分别。此外,该模型还用于分类 Tumblr 和 Reddit 上的不标注数据集,并获得了类似于前一代研究所得到的结果,表明深度学习模型可以开发出与人类研究者相似的洞察。此外,模型还进行了 Twitter 上八个 Hashtag 的时间序分析,发现自2014年以来,内容推广吃见症的相对含量在这些社区内逐渐下降,但到2018年,内容推广吃见症又开始增加或停止下降。
    Abstract Over the last decade, there has been a vast increase in eating disorder diagnoses and eating disorder-attributed deaths, reaching their zenith during the Covid-19 pandemic. This immense growth derived in part from the stressors of the pandemic but also from increased exposure to social media, which is rife with content that promotes eating disorders. This study aimed to create a multimodal deep learning model that can determine if a given social media post promotes eating disorders based on a combination of visual and textual data. A labeled dataset of Tweets was collected from Twitter, upon which twelve deep learning models were trained and tested. Based on model performance, the most effective deep learning model was the multimodal fusion of the RoBERTa natural language processing model and the MaxViT image classification model, attaining accuracy and F1 scores of 95.9% and 0.959, respectively. The RoBERTa and MaxViT fusion model, deployed to classify an unlabeled dataset of posts from the social media sites Tumblr and Reddit, generated results akin to those of previous research studies that did not employ artificial intelligence-based techniques, indicating that deep learning models can develop insights congruent to those of researchers. Additionally, the model was used to conduct a timeseries analysis of yet unseen Tweets from eight Twitter hashtags, uncovering that, since 2014, the relative abundance of content that promotes eating disorders has decreased drastically within those communities. Despite this reduction, by 2018, content that promotes eating disorders had either stopped declining or increased in ampleness anew on these hashtags.
    摘要 过去一代,食用疾病诊断和因食用疾病而导致的死亡人数有很大增长,特别是在covid-19大流行期间。这种巨大增长来自于流行病的压力以及社交媒体上的内容,后者在患食用疾病的人群中更加普遍。这项研究旨在创建一个多模态深度学习模型,可以根据文本和视频数据判断社交媒体文章是否推广食用疾病。在Twitter上收集了一个标注的Twitter文章集,并训练了12个深度学习模型。根据模型性能,最有效的深度学习模型是将RoBERTa自然语言处理模型和MaxViT图像分类模型 multimodal融合,它的准确率和F1分数分别为95.9%和0.959。这个模型在分析Twitter上的未标注文章时,能够生成与人工智能技术不使用的研究成果相似的结果, indicating that deep learning models can develop insights congruent to those of researchers。此外,该模型还用于对Twitter上的八个Hashtag进行时间序分析,发现自2014年以来,这些社群中的不健康食物内容的相对含量有很大减少。然而,到2018年,这些社群中的不健康食物内容的含量已经减少或增加了。

eess.IV - 2023-07-07

Detecting the Sensing Area of A Laparoscopic Probe in Minimally Invasive Cancer Surgery

  • paper_url: http://arxiv.org/abs/2307.03662
  • repo_url: https://github.com/br0202/sensing_area_detection
  • paper_authors: Baoru Huang, Yicheng Hu, Anh Nguyen, Stamatia Giannarou, Daniel S. Elson
  • for: This paper aims to improve the accuracy of endoscopic radio-guided cancer detection and resection by developing a novel method for detecting the sensing area of a tethered laparoscopic gamma detector.
  • methods: The proposed method uses a simple regression network to leverage high-dimensional image features and probe position information to visualize gamma activity origination on the tissue surface.
  • results: The authors demonstrated the effectiveness of their method through intensive experimentation using two publicly released datasets captured with a custom-designed, portable stereo laparoscope system, establishing a new performance benchmark.
    Abstract In surgical oncology, it is challenging for surgeons to identify lymph nodes and completely resect cancer even with pre-operative imaging systems like PET and CT, because of the lack of reliable intraoperative visualization tools. Endoscopic radio-guided cancer detection and resection has recently been evaluated whereby a novel tethered laparoscopic gamma detector is used to localize a preoperatively injected radiotracer. This can both enhance the endoscopic imaging and complement preoperative nuclear imaging data. However, gamma activity visualization is challenging to present to the operator because the probe is non-imaging and it does not visibly indicate the activity origination on the tissue surface. Initial failed attempts used segmentation or geometric methods, but led to the discovery that it could be resolved by leveraging high-dimensional image features and probe position information. To demonstrate the effectiveness of this solution, we designed and implemented a simple regression network that successfully addressed the problem. To further validate the proposed solution, we acquired and publicly released two datasets captured using a custom-designed, portable stereo laparoscope system. Through intensive experimentation, we demonstrated that our method can successfully and effectively detect the sensing area, establishing a new performance benchmark. Code and data are available at https://github.com/br0202/Sensing_area_detection.git
    摘要 在外科onkoloji中,外科医生很难以识别lymph node和完全remove癌症,即使使用前 operated imaging系统like PET和CT,因为缺乏可靠的手术过程中的视觉化工具。 Recently, endoscopic radio-guided cancer detection and resection has been evaluated, which uses a novel tethered laparoscopic gamma detector to localize a preoperatively injected radiotracer. This can both enhance endoscopic imaging and complement preoperative nuclear imaging data. However, gamma activity visualization is challenging to present to the operator because the probe is non-imaging and it does not visibly indicate the activity origination on the tissue surface. Early attempts used segmentation or geometric methods, but these were not successful. Instead, we found that the problem could be resolved by leveraging high-dimensional image features and probe position information. To demonstrate the effectiveness of this solution, we designed and implemented a simple regression network that successfully addressed the problem. To further validate the proposed solution, we acquired and publicly released two datasets captured using a custom-designed, portable stereo laparoscope system. Through extensive experimentation, we demonstrated that our method can successfully and effectively detect the sensing area, establishing a new performance benchmark. Code and data are available at .

VesselVAE: Recursive Variational Autoencoders for 3D Blood Vessel Synthesis

  • paper_url: http://arxiv.org/abs/2307.03592
  • repo_url: None
  • paper_authors: Paula Feldman, Miguel Fainstein, Viviana Siless, Claudio Delrieux, Emmanuel Iarussi
  • For: 该论文旨在提出一种数据驱动的生成框架,用于synthesizing blood vessel 3D geometry。* Methods: 该方法使用Recursive Variational Neural Network(RVNN),全面利用血管的层次结构,学习低维抽象空间,包括分支连接性和表面特征。* Results: 该方法可以生成高度相似的真实和 sintetic 数据,包括半径(.97)、长度(.95)和折叠度(.96)。通过深度神经网络的力量,该方法生成的3D血管模型具有高度准确和多样性,这对医疗和手术培训、血液动力学 simulations 等方面都非常重要。
    Abstract We present a data-driven generative framework for synthesizing blood vessel 3D geometry. This is a challenging task due to the complexity of vascular systems, which are highly variating in shape, size, and structure. Existing model-based methods provide some degree of control and variation in the structures produced, but fail to capture the diversity of actual anatomical data. We developed VesselVAE, a recursive variational Neural Network that fully exploits the hierarchical organization of the vessel and learns a low-dimensional manifold encoding branch connectivity along with geometry features describing the target surface. After training, the VesselVAE latent space can be sampled to generate new vessel geometries. To the best of our knowledge, this work is the first to utilize this technique for synthesizing blood vessels. We achieve similarities of synthetic and real data for radius (.97), length (.95), and tortuosity (.96). By leveraging the power of deep neural networks, we generate 3D models of blood vessels that are both accurate and diverse, which is crucial for medical and surgical training, hemodynamic simulations, and many other purposes.
    摘要 我们提出了一种基于数据的生成框架,用于 sintesizing 血管三维几何结构。这是一项具有挑战性的任务,因为血管系统的复杂性以及形态、大小和结构的多样性。现有的模型基于方法可以提供一定的控制和变化,但是它们无法捕捉实际 анатомиче数据的多样性。我们开发了 VesselVAE,一种嵌入式的变量神经网络,它完全利用血管的层次结构,并学习低维度拟合空间,包括连接分支和表面几何特征。经过训练,VesselVAE 的缓存空间可以采样生成新的血管几何结构。根据我们所知,这是第一次利用这种技术来 sintesizing 血管。我们实现了真实数据和 sintesizing 数据之间的相似性(.97)、长度(.95)和折叠性(.96)。通过利用深度神经网络的力量,我们生成了高度准确和多样的血管三维模型,这对医疗和手术培训、血液动力学计算以及许多其他目的都是关键。

Physical Color Calibration of Digital Pathology Scanners for Robust Artificial Intelligence Assisted Cancer Diagnosis

  • paper_url: http://arxiv.org/abs/2307.05519
  • repo_url: None
  • paper_authors: Xiaoyi Ji, Richard Salmon, Nita Mulliqi, Umair Khan, Yinxi Wang, Anders Blilie, Henrik Olsson, Bodil Ginnerup Pedersen, Karina Dalsgaard Sørensen, Benedicte Parm Ulhøi, Svein R Kjosavik, Emilius AM Janssen, Mattias Rantalainen, Lars Egevad, Pekka Ruusuvuori, Martin Eklund, Kimmo Kartasalo
  • for: 提高艾特ints家用于数字病理学的可靠性和应用性
  • methods: 使用物理颜色准确标准化扫描仪的扫描图像,以提高人工智能系统的评估性和可靠性
  • results: 实验结果表明,物理颜色准确标准化可以标准化扫描仪的扫描图像,提高艾特ints家的评估性和可靠性,使其在临床应用中更加可靠和可行
    Abstract The potential of artificial intelligence (AI) in digital pathology is limited by technical inconsistencies in the production of whole slide images (WSIs), leading to degraded AI performance and posing a challenge for widespread clinical application as fine-tuning algorithms for each new site is impractical. Changes in the imaging workflow can also lead to compromised diagnoses and patient safety risks. We evaluated whether physical color calibration of scanners can standardize WSI appearance and enable robust AI performance. We employed a color calibration slide in four different laboratories and evaluated its impact on the performance of an AI system for prostate cancer diagnosis on 1,161 WSIs. Color standardization resulted in consistently improved AI model calibration and significant improvements in Gleason grading performance. The study demonstrates that physical color calibration provides a potential solution to the variation introduced by different scanners, making AI-based cancer diagnostics more reliable and applicable in clinical settings.
    摘要 人工智能(AI)在数字 PATHOLOGY 的潜力受到数字标本(WSIs)的技术不一致所限制,导致 AI 性能下降,使得广泛应用在临床中采用困难。变化在扫描 workflow 也可能导致诊断错误和患者安全风险。我们评估了扫描器的物理色彩准确性是否可以标准化 WSI 的外观,并使 AI 系统在 1,161 个标本上表现出色。结果显示,物理色彩准确性可以提高 AI 模型准确性,并且在 Gleason 分期性能上显示了明显的改善。这一研究表明,物理色彩准确性可以解决不同扫描器导致的变化,使 AI 基于 cancer 诊断更可靠和在临床设置中实用。

A Deep Active Contour Model for Delineating Glacier Calving Fronts

  • paper_url: http://arxiv.org/abs/2307.03461
  • repo_url: None
  • paper_authors: Konrad Heidler, Lichao Mou, Erik Loebel, Mirko Scheinert, Sébastien Lefèvre, Xiao Xiang Zhu
  • for: 这个论文是关于冰川分裂前模型的研究,旨在提出一种基于 outline 找出的方法,以提高冰川分裂前探测器的准确率。
  • methods: 该方法 combine 了 Convolutional Neural Networks (CNNs) 和 active contour model,以提取特征和描述 outline。
  • results: 通过在格陵兰冰川的多个大规模数据集上训练和评估,该方法被证明超过基于 segmentation 和 edge-detection 的方法。此外,该方法还可以更好地计算模型预测结果的不确定性。
    Abstract Choosing how to encode a real-world problem as a machine learning task is an important design decision in machine learning. The task of glacier calving front modeling has often been approached as a semantic segmentation task. Recent studies have shown that combining segmentation with edge detection can improve the accuracy of calving front detectors. Building on this observation, we completely rephrase the task as a contour tracing problem and propose a model for explicit contour detection that does not incorporate any dense predictions as intermediate steps. The proposed approach, called ``Charting Outlines by Recurrent Adaptation'' (COBRA), combines Convolutional Neural Networks (CNNs) for feature extraction and active contour models for the delineation. By training and evaluating on several large-scale datasets of Greenland's outlet glaciers, we show that this approach indeed outperforms the aforementioned methods based on segmentation and edge-detection. Finally, we demonstrate that explicit contour detection has benefits over pixel-wise methods when quantifying the models' prediction uncertainties. The project page containing the code and animated model predictions can be found at \url{https://khdlr.github.io/COBRA/}.
    摘要 选择如何编码实际问题为机器学习任务是机器学习设计决策的重要一环。冰川脱落前模型的任务经常被看作为语义分割任务。 latest studies have shown that combining segmentation with edge detection can improve the accuracy of calving front detectors. 在这个基础上,我们完全重新表述任务为一个 outline tracing 问题,并提出一种不包含任何稠密预测的模型。我们称之为“Charting Outlines by Recurrent Adaptation”(COBRA),它将 Convolutional Neural Networks(CNNs)用于特征提取和活动 kontur 模型来进行定义。通过对瑞典格陵兰冰川出口的数据进行训练和评估,我们示出了这种方法实际上超过了以前基于 segmentation 和 edge-detection 的方法。最后,我们示出了明确的 outline 检测在量化模型预测不确定性时的优势。关于这个项目,包含代码和动画预测的项目页面可以在 \url{https://khdlr.github.io/COBRA/} 上找到。

Non-iterative Coarse-to-fine Transformer Networks for Joint Affine and Deformable Image Registration

  • paper_url: http://arxiv.org/abs/2307.03421
  • repo_url: https://github.com/mungomeng/registration-nice-trans
  • paper_authors: Mingyuan Meng, Lei Bi, Michael Fulham, Dagan Feng, Jinman Kim
  • for: 这个论文主要是为了提出一种基于深度学习的非迭代抽象图像 региSTRATION方法,以提高图像REGISTRATION的精度和效率。
  • methods: 这个方法使用了非迭代抽象图像REGISTRATION的单个网络,并首次将 transformers 引入到 NICE 图像REGISTRATION 框架中,以模型图像之间的长距离相关性。
  • results: 经过广泛的七个公共数据集的实验,这个方法的注意力 Transformer 在 NICE 图像REGISTRATION 中表现出了优于当前状态艺术的精度和效率。
    Abstract Image registration is a fundamental requirement for medical image analysis. Deep registration methods based on deep learning have been widely recognized for their capabilities to perform fast end-to-end registration. Many deep registration methods achieved state-of-the-art performance by performing coarse-to-fine registration, where multiple registration steps were iterated with cascaded networks. Recently, Non-Iterative Coarse-to-finE (NICE) registration methods have been proposed to perform coarse-to-fine registration in a single network and showed advantages in both registration accuracy and runtime. However, existing NICE registration methods mainly focus on deformable registration, while affine registration, a common prerequisite, is still reliant on time-consuming traditional optimization-based methods or extra affine registration networks. In addition, existing NICE registration methods are limited by the intrinsic locality of convolution operations. Transformers may address this limitation for their capabilities to capture long-range dependency, but the benefits of using transformers for NICE registration have not been explored. In this study, we propose a Non-Iterative Coarse-to-finE Transformer network (NICE-Trans) for image registration. Our NICE-Trans is the first deep registration method that (i) performs joint affine and deformable coarse-to-fine registration within a single network, and (ii) embeds transformers into a NICE registration framework to model long-range relevance between images. Extensive experiments with seven public datasets show that our NICE-Trans outperforms state-of-the-art registration methods on both registration accuracy and runtime.
    摘要 医疗图像分析中的图像注册是一项基本要求。基于深度学习的深度注册方法在过去几年内得到了广泛的认可,因为它们可以快速完成端到端注册。许多深度注册方法在多个注册步骤中使用了缩放网络,以实现粗细到细节的注册。然而,现有的NICE注册方法主要关注于可变性注册,而平移注册,是医疗图像注册的常见前提,仍然是通过传统的优化方法或额外的平移注册网络来实现的。此外,现有的NICE注册方法受到了卷积操作的本地性的限制。使用transformer可以解决这一限制,因为它可以捕捉图像之间的长距离相关性。但是,使用transformer进行NICE注册的好处尚未得到了探讨。在本研究中,我们提出了一种非iterative粗细到细节的transformer网络(NICE-Trans) для图像注册。我们的NICE-Trans是首个深度注册方法,它(i)在单个网络中同时实现了平移和可变性的粗细到细节注册,(ii)将transformer引入NICE注册框架,以模型图像之间的长距离相关性。我们在七个公共数据集上进行了广泛的实验,结果表明,我们的NICE-Trans比状态之前的注册方法在注册精度和运行时间上都有提高。

Unsupervised Hyperspectral and Multispectral Images Fusion Based on the Cycle Consistency

  • paper_url: http://arxiv.org/abs/2307.03413
  • repo_url: https://github.com/shuaikaishi/CycFusion
  • paper_authors: Shuaikai Shi, Lijun Zhang, Yoann Altmann, Jie Chen
  • for: 这个论文主要针对的问题是如何实现高spectral resolution和高 spatial resolution的图像拼接,并且提出了一种基于循环一致的无监督拼接模型。
  • methods: 该模型基于循环一致的概念,将低spectral resolution的干扰图像(LrHSI)和高spectral resolution的多spectral图像(HrMSI)映射到高spectral resolution的图像中,并且通过单个变换和双重变换的对比来学习域转换。
  • results: 对于多个数据集,该模型的实验结果表明,与其他无监督拼接方法相比,该模型能够更好地实现高spectral resolution和高 spatial resolution的图像拼接,并且可以在不知道干扰参数的情况下进行拼接。
    Abstract Hyperspectral images (HSI) with abundant spectral information reflected materials property usually perform low spatial resolution due to the hardware limits. Meanwhile, multispectral images (MSI), e.g., RGB images, have a high spatial resolution but deficient spectral signatures. Hyperspectral and multispectral image fusion can be cost-effective and efficient for acquiring both high spatial resolution and high spectral resolution images. Many of the conventional HSI and MSI fusion algorithms rely on known spatial degradation parameters, i.e., point spread function, spectral degradation parameters, spectral response function, or both of them. Another class of deep learning-based models relies on the ground truth of high spatial resolution HSI and needs large amounts of paired training images when working in a supervised manner. Both of these models are limited in practical fusion scenarios. In this paper, we propose an unsupervised HSI and MSI fusion model based on the cycle consistency, called CycFusion. The CycFusion learns the domain transformation between low spatial resolution HSI (LrHSI) and high spatial resolution MSI (HrMSI), and the desired high spatial resolution HSI (HrHSI) are considered to be intermediate feature maps in the transformation networks. The CycFusion can be trained with the objective functions of marginal matching in single transform and cycle consistency in double transforms. Moreover, the estimated PSF and SRF are embedded in the model as the pre-training weights, which further enhances the practicality of our proposed model. Experiments conducted on several datasets show that our proposed model outperforms all compared unsupervised fusion methods. The codes of this paper will be available at this address: https: //github.com/shuaikaishi/CycFusion for reproducibility.
    摘要 干扰影像(HSI)具有丰富的 спектраль信息,通常由硬件限制而导致低空间分辨率。而多spectral影像(MSI),例如RGB影像,具有高空间分辨率,但缺乏 спектраль特征。干扰影像和多spectral影像的图像混合可以是成本效益和高效的方式,以获得高空间分辨率和高спектраль分辨率的图像。许多传统的HSI和MSI混合算法取决于已知的空间退化参数,例如点扩散函数、spectral退化参数、spectral响应函数或其中之一。另一类的深度学习模型需要大量的实验训练图像,且需要高空间分辨率HSI的实验训练图像。这两类模型在实际混合应用场景中有限制。在这篇论文中,我们提出了一种不需要实验训练图像的干扰影像和多spectral影像混合模型,基于循环一致性,称为CycFusion。CycFusion学习了干扰影像低空间分辨率(LrHSI)和高空间分辨率多spectral影像(HrMSI)之间的Domain转换,并考虑了欲有的高空间分辨率HSI作为中间特征图。CycFusion可以通过单个转换和双转换的对应函数来训练,并且可以在干扰影像和多spectral影像之间进行自适应的混合。此外,我们还在模型中嵌入了估计的PSF和SRF,以进一步提高模型的实用性。我们在多个数据集上进行了实验,结果显示,我们的提出的模型在无监督情况下超过所有相比的无监督混合方法。模型代码将在以下地址可用:https: //github.com/shuaikaishi/CycFusion,以便重现。

Towards Robust SDRTV-to-HDRTV via Dual Inverse Degradation Network

  • paper_url: http://arxiv.org/abs/2307.03394
  • repo_url: None
  • paper_authors: Kepeng Xu, Gang He, Li Xu, Xingchao Yang, Ming Sun, Yuzhi Wang, Zijia Ma, Haoqiang Fan, Xing Wen
  • for: 提高SDRTV至HDRTV的转换效果,并且解决转换过程中存在的编码痕迹的增强问题。
  • methods: 提出了一种双 inverse degradation SDRTV-to-HDRTV网络DIDNet,包括时间空间特征对齐模块和双调度卷积,以及波帕特注意模块,以提高颜色恢复能力和编码痕迹除减能力。
  • results: 与当前状态艺术方法相比,提出的方法在量化结果、视觉质量和推理时间等方面具有显著优势,因此可以在实际应用场景中提高SDRTV至HDRTV的转换效果。
    Abstract Recently, the transformation of standard dynamic range TV (SDRTV) to high dynamic range TV (HDRTV) is in high demand due to the scarcity of HDRTV content. However, the conversion of SDRTV to HDRTV often amplifies the existing coding artifacts in SDRTV which deteriorate the visual quality of the output. In this study, we propose a dual inverse degradation SDRTV-to-HDRTV network DIDNet to address the issue of coding artifact restoration in converted HDRTV, which has not been previously studied. Specifically, we propose a temporal-spatial feature alignment module and dual modulation convolution to remove coding artifacts and enhance color restoration ability. Furthermore, a wavelet attention module is proposed to improve SDRTV features in the frequency domain. An auxiliary loss is introduced to decouple the learning process for effectively restoring from dual degradation. The proposed method outperforms the current state-of-the-art method in terms of quantitative results, visual quality, and inference times, thus enhancing the performance of the SDRTV-to-HDRTV method in real-world scenarios.
    摘要 最近,标准动态范围电视(SDRTV)到高动态范围电视(HDRTV)的转换受到了高动态范围内容缺乏的限制。然而,将SDRTV转换为HDRTV经常会加剧SDRTV中存在的编码 artifacts,从而降低输出视质。在这项研究中,我们提出了一个双重逆减 SDRTV-to-HDRTV 网络 DIDNet,以解决转换后 HDRTV 中的编码 artifact 纠正问题,这一问题尚未被研究。特别是,我们提出了时空特征对齐模块和双扩涨 convolution来除除编码 artifacts和提高颜色纠正能力。此外,我们还提出了wavelet 注意力模块,以提高 SDRTV 特征在频域中。还引入了一个 auxillary 损失函数,以分离学习过程中的纠正过程,从而提高 SDRTV-to-HDRTV 方法的实际场景性能。根据量化结果和视觉质量,我们的方法超过了当前状态的艺术方法,从而提高 SDRTV-to-HDRTV 方法的性能。

CheXmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images

  • paper_url: http://arxiv.org/abs/2307.03293
  • repo_url: https://github.com/ngaggion/chexmask-database
  • paper_authors: Nicolás Gaggion, Candelaria Mosquera, Lucas Mansilla, Martina Aineseder, Diego H. Milone, Enzo Ferrante
  • for: 这个研究的目的是为了提供一个大型、多中心的胸部X射影分类数据集,以便帮助开发更好的人工智能模型。
  • methods: 这个研究使用了HybridGNet模型来确保所有数据集中的分类都是一致的和高品质的。
  • results: 这个研究产生了676,803个分类掩模,并进行了严格的验证,包括专业医生评价和自动化品质控制,以验证所有掩模的品质。
    Abstract The development of successful artificial intelligence models for chest X-ray analysis relies on large, diverse datasets with high-quality annotations. While several databases of chest X-ray images have been released, most include disease diagnosis labels but lack detailed pixel-level anatomical segmentation labels. To address this gap, we introduce an extensive chest X-ray multi-center segmentation dataset with uniform and fine-grain anatomical annotations for images coming from six well-known publicly available databases: CANDID-PTX, ChestX-ray8, Chexpert, MIMIC-CXR-JPG, Padchest, and VinDr-CXR, resulting in 676,803 segmentation masks. Our methodology utilizes the HybridGNet model to ensure consistent and high-quality segmentations across all datasets. Rigorous validation, including expert physician evaluation and automatic quality control, was conducted to validate the resulting masks. Additionally, we provide individualized quality indices per mask and an overall quality estimation per dataset. This dataset serves as a valuable resource for the broader scientific community, streamlining the development and assessment of innovative methodologies in chest X-ray analysis. The CheXmask dataset is publicly available at: \url{https://physionet.org/content/chexmask-cxr-segmentation-data/}.
    摘要 发展成功人工智能模型需要大量多样化的数据集,以便进行胸部X射线分析。虽然一些胸部X射线图像数据库已经发布,但大多数只包含疾病诊断标签,缺乏细致的像素级别解剖标注。为了解决这个问题,我们介绍了一个胸部X射线多中心分割数据集,包括六个公共可用的数据库:CANDID-PTX、ChestX-ray8、Chexpert、MIMIC-CXR-JPG、Padchest和VinDr-CXR,共计676,803个分割mask。我们的方法使用HybridGNet模型,确保分割结果具有一致性和高质量。我们进行了严格的验证,包括专业医生评估和自动化质量控制,以验证结果。此外,我们还提供了每个mask的个性化质量指数和每个数据集的总质量估计。这个数据集对科学社区来说是一个重要的资源,可以促进胸部X射线分析领域的创新和评估。CheXmask数据集可以在以下链接获取:https://physionet.org/content/chexmask-cxr-segmentation-data/。

ADASSM: Adversarial Data Augmentation in Statistical Shape Models From Images

  • paper_url: http://arxiv.org/abs/2307.03273
  • repo_url: None
  • paper_authors: Mokshagna Sai Teja Karanam, Tushar Kataria, Krithika Iyer, Shireen Elhabian
  • for: 这paper的目的是提出一种基于深度学习的图像到统计形态模型(SSM)框架,以提高图像到SSM的准确率和效率。
  • methods: 该paper使用了深度学习模型来学习图像中的形态表示,并使用了KDE方法生成形态增强样本以帮助图像到SSM网络实现比较高的准确率。
  • results: 该paper的实验结果表明,使用该novel strategy可以提高图像到SSM网络的准确率,并且可以避免深度学习模型的图像基于Texture偏好。
    Abstract Statistical shape models (SSM) have been well-established as an excellent tool for identifying variations in the morphology of anatomy across the underlying population. Shape models use consistent shape representation across all the samples in a given cohort, which helps to compare shapes and identify the variations that can detect pathologies and help in formulating treatment plans. In medical imaging, computing these shape representations from CT/MRI scans requires time-intensive preprocessing operations, including but not limited to anatomy segmentation annotations, registration, and texture denoising. Deep learning models have demonstrated exceptional capabilities in learning shape representations directly from volumetric images, giving rise to highly effective and efficient Image-to-SSM networks. Nevertheless, these models are data-hungry and due to the limited availability of medical data, deep learning models tend to overfit. Offline data augmentation techniques, that use kernel density estimation based (KDE) methods for generating shape-augmented samples, have successfully aided Image-to-SSM networks in achieving comparable accuracy to traditional SSM methods. However, these augmentation methods focus on shape augmentation, whereas deep learning models exhibit image-based texture bias resulting in sub-optimal models. This paper introduces a novel strategy for on-the-fly data augmentation for the Image-to-SSM framework by leveraging data-dependent noise generation or texture augmentation. The proposed framework is trained as an adversary to the Image-to-SSM network, augmenting diverse and challenging noisy samples. Our approach achieves improved accuracy by encouraging the model to focus on the underlying geometry rather than relying solely on pixel values.
    摘要

A Fully Automated and Explainable Algorithm for the Prediction of Malignant Transformation in Oral Epithelial Dysplasia

  • paper_url: http://arxiv.org/abs/2307.03757
  • repo_url: None
  • paper_authors: Adam J Shephard, Raja Muhammad Saad Bashir, Hanya Mahmood, Mostafa Jahanifar, Fayyaz Minhas, Shan E Ahmed Raza, Kris D McCombe, Stephanie G Craig, Jacqueline James, Jill Brooks, Paul Nankivell, Hisham Mehanna, Syed Ali Khurram, Nasir M Rajpoot
    for: 这个研究的目的是提出一种完全自动化的算法,以便预测口腔细膜癌变的发生,基于可读取的核仁特征。methods: 这个算法使用了自己设计的Segmentation模型,以及一种浅层神经网络,以检测和分类口腔细膜中的核仁。results: 该算法在三个外部数据集上进行了验证,AUROC值为0.74,表明其可以准确预测口腔细膜癌变的发生。此外,Survival分析表明该算法可以预测癌变的发生,并且可以更好地预测比 manually-assigned WHO和二元等级。
    Abstract Oral epithelial dysplasia (OED) is a premalignant histopathological diagnosis given to lesions of the oral cavity. Its grading suffers from significant inter-/intra- observer variability, and does not reliably predict malignancy progression, potentially leading to suboptimal treatment decisions. To address this, we developed a novel artificial intelligence algorithm that can assign an Oral Malignant Transformation (OMT) risk score, based on histological patterns in the in Haematoxylin and Eosin stained whole slide images, to quantify the risk of OED progression. The algorithm is based on the detection and segmentation of nuclei within (and around) the epithelium using an in-house segmentation model. We then employed a shallow neural network fed with interpretable morphological/spatial features, emulating histological markers. We conducted internal cross-validation on our development cohort (Sheffield; n = 193 cases) followed by independent validation on two external cohorts (Birmingham and Belfast; n = 92 cases). The proposed OMTscore yields an AUROC = 0.74 in predicting whether an OED progresses to malignancy or not. Survival analyses showed the prognostic value of our OMTscore for predicting malignancy transformation, when compared to the manually-assigned WHO and binary grades. Analysis of the correctly predicted cases elucidated the presence of peri-epithelial and epithelium-infiltrating lymphocytes in the most predictive patches of cases that transformed (p < 0.0001). This is the first study to propose a completely automated algorithm for predicting OED transformation based on interpretable nuclear features, whilst being validated on external datasets. The algorithm shows better-than-human-level performance for prediction of OED malignant transformation and offers a promising solution to the challenges of grading OED in routine clinical practice.
    摘要 口腔粘膜细胞变化(OED)是口腔部位的一种前癌性病理诊断,但其分级存在很大的干扰因素和内部/外部观察者的不一致,不能准确预测癌变进程,可能导致不佳的治疗决策。为了解决这个问题,我们开发了一种新的人工智能算法,可以根据染色体板术影像中的细胞核特征分配口腔癌变转换风险分数(OMT分数),以评估OED转换癌变的风险。这种算法基于检测和分类细胞核的具体方法,使用了我们自己开发的分割模型。然后,我们使用了一个浅层神经网络,以便使用可读性特征来模拟 histological markers。我们在我们的开发组(Sheffield)进行了内部十字验证(n = 193例),然后在两个外部组(Birmingham和Belfast)进行了独立验证(n = 92例)。我们的提出的 OMT 分数可以在预测OED转换癌变是否存在的问题上达到 AUROC = 0.74 的性能。我们的存活分析表明,我们的 OMT 分数具有预测癌变转换的 проGNosis 价值,比较于由人工分配的 WHO 和二分类分数。分析 Correctly 预测的 случаeschannel 显示,在转换的 caso 中存在辐射半径内的卵细胞和细胞核的卫星细胞(p < 0.0001)。这是首个提出一种完全自动化的 OED 转换预测算法,基于可读性的核特征,并在外部数据上进行了验证。这种算法在预测 OED 癌变转换性能上达到了人类水平,并且提供了一个可能的解决方案,以解决 OED 的分级挑战。

Semantic-Aware Image Compressed Sensing

  • paper_url: http://arxiv.org/abs/2307.03246
  • repo_url: None
  • paper_authors: Bowen Zhang, Zhijin Qin, Geoffrey Ye Li
  • for: 提高图像压缩感知效率
  • methods: 使用基于深度学习的图像压缩感知系统,并使用策略网络分析图像semantic信息,确定不同图像区域的测量矩阵
  • results: 提出了一种基于semantic信息的图像压缩感知系统,实现了提高图像压缩感知效率
    Abstract Deep learning based image compressed sensing (CS) has achieved great success. However, existing CS systems mainly adopt a fixed measurement matrix to images, ignoring the fact the optimal measurement numbers and bases are different for different images. To further improve the sensing efficiency, we propose a novel semantic-aware image CS system. In our system, the encoder first uses a fixed number of base CS measurements to sense different images. According to the base CS results, the encoder then employs a policy network to analyze the semantic information in images and determines the measurement matrix for different image areas. At the decoder side, a semantic-aware initial reconstruction network is developed to deal with the changes of measurement matrices used at the encoder. A rate-distortion training loss is further introduced to dynamically adjust the average compression ratio for the semantic-aware CS system and the policy network is trained jointly with the encoder and the decoder in an en-to-end manner by using some proxy functions. Numerical results show that the proposed semantic-aware image CS system is superior to the traditional ones with fixed measurement matrices.
    摘要 深度学习基于图像压缩感知(CS)技术已经取得了很大成功。然而,现有的CS系统主要采用固定的测量矩阵来对图像进行测量,忽略了图像的优化测量数量和基准的不同。为了进一步提高感知效率,我们提出了一种新的Semantic-aware图像CS系统。在我们的系统中,Encoder首先使用固定数量的基础CS测量来感知不同的图像。根据基础CS结果,Encoder然后使用一个政策网络分析图像中的Semantic信息,并确定不同的图像区域的测量矩阵。在Decoder端,一个Semantic-aware初始重建网络被开发以处理测量矩阵的变化。此外,我们还引入了一个率-压缩训练损失,以 Dynamically adjust the average compression ratio for the Semantic-aware CS system。Policy网络与Encoder和Decoder在一起训练,使用某些代理函数。numerical results show that the proposed Semantic-aware image CS system is superior to traditional ones with fixed measurement matrices.Note: Please note that the translation is in Simplified Chinese, and some words or phrases may have been translated differently in Traditional Chinese.

cs.SD - 2023-07-06

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

  • paper_url: http://arxiv.org/abs/2307.02909
  • repo_url: None
  • paper_authors: Guinan Li, Jiajun Deng, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Mingyu Cui, Helen Meng, Xunying Liu
  • for: 提高cocktail party speech recognition精度,增强对干扰音频信号的识别能力。
  • methods: 利用视频信号,实现多通道音频信号分离、去抖振荡和识别,并在前端和后端都具有视觉信息的完整 incorporation。
  • results: 相比同种audio-only基eline,提高了9.1%和6.2%的word error rate(WER),同时也提高了PESQ、STOI和SRMR分数。
    Abstract Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed in this paper. The efficacy of the video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and Conformer ASR back-end. Audio-visual integrated front-end architectures performing speech separation and dereverberation in a pipelined or joint fashion via mask-based WPD are investigated. The error cost mismatch between the speech enhancement front-end and ASR back-end components is minimized by end-to-end jointly fine-tuning using either the ASR cost function alone, or its interpolation with the speech enhancement loss. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel speech separation, dereverberation and recognition systems consistently outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute (41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech enhancement improvements were also obtained on PESQ, STOI and SRMR scores.
    摘要 当前,cocktail party speech中的叠加说话者、噪声和回声识别仍然是一个非常挑战性的任务。这是因为视觉modalities具有对声音信号损害的不变性,所以一种包含视觉信息的多渠道音频视觉混合分离、减少回声和识别方法被提议。这篇论文中的方法是通过使用面积基于的MVDR音频分离、DNN-WPE或spectral mapping(SpecM)基于的音频前端分离和Conformer ASR后端来实现。我们还 investigate了将视觉信息完全 интегрирован到所有系统组件中的音频视觉混合前端架构。在pipelined或联合的方式下,我们使用面积基于的WPD来实现音频视觉混合。为了消除语音提高前端和后端组件之间的错误成本差,我们使用综合jointly fine-tuning,使用ASR成本函数alone或其混合with语音提高损失来减少错误成本差。我们对于混合了simulation和replay的Oxford LRS2 dataset中的杂合混响杂音数据进行了实验。结果表明,我们的音频视觉多渠道分离、减少回声和识别系统在相比于相同的音频基eline的9.1%和6.2%绝对(41.7%和36.0%相对)word error rate(WER)下提高了识别率。此外,我们还获得了相应的语音提高成果在PESQ、STOI和SRMR scores上。

The Relationship Between Speech Features Changes When You Get Depressed: Feature Correlations for Improving Speed and Performance of Depression Detection

  • paper_url: http://arxiv.org/abs/2307.02892
  • repo_url: None
  • paper_authors: Fuxiang Tao, Wei Ma, Xuri Ge, Anna Esposito, Alessandro Vinciarelli
  • for: 这个论文研究了听话者抑郁症的影响,发现抑郁症会改变speech中特征之间的相关性。
  • methods: 该论文使用了SVM和LSTM两种模型,并使用了Androids Corpus dataset,包括112名speaker,其中58人被诊断为职业心理医生。
  • results: 实验结果显示,使用特征相关矩阵而不是特征向量可以提高模型的训练速度和性能,错误率下降23.1%-26.6%。这可能是因为抑郁 speaker中特征相关性更为变化。
    Abstract This work shows that depression changes the correlation between features extracted from speech. Furthermore, it shows that using such an insight can improve the training speed and performance of depression detectors based on SVMs and LSTMs. The experiments were performed over the Androids Corpus, a publicly available dataset involving 112 speakers, including 58 people diagnosed with depression by professional psychiatrists. The results show that the models used in the experiments improve in terms of training speed and performance when fed with feature correlation matrices rather than with feature vectors. The relative reduction of the error rate ranges between 23.1% and 26.6% depending on the model. The probable explanation is that feature correlation matrices appear to be more variable in the case of depressed speakers. Correspondingly, such a phenomenon can be thought of as a depression marker.
    摘要

Evaluating raw waveforms with deep learning frameworks for speech emotion recognition

  • paper_url: http://arxiv.org/abs/2307.02820
  • repo_url: None
  • paper_authors: Zeynep Hilal Kilimci, Ulku Bayraktar, Ayhan Kucukmanisa
    for:The paper focuses on speech emotion recognition using deep learning techniques, specifically on the contribution of feeding raw audio files directly into deep neural networks without any feature extraction stage.methods:The proposed model uses a combination of machine learning algorithms, ensemble learning methods, and deep and hybrid deep learning techniques, including support vector machine, decision tree, naive Bayes, random forests, majority voting, and stacking. The model also employs convolutional neural networks, long short-term memory networks, and hybrid CNN-LSTM models.results:The proposed model achieves state-of-the-art performance on six different data sets, including EMO-DB, RAVDESS, TESS, CREMA, SAVEE, and TESS+RAVDESS. Specifically, the CNN model achieves 95.86% accuracy on the TESS+RAVDESS data set, outperforming existing approaches. The proposed model also demonstrates high accuracy on other data sets, ranging from 90.34% to 99.48% and 69.72% to 85.76%, depending on the data set and the model used.
    Abstract Speech emotion recognition is a challenging task in speech processing field. For this reason, feature extraction process has a crucial importance to demonstrate and process the speech signals. In this work, we represent a model, which feeds raw audio files directly into the deep neural networks without any feature extraction stage for the recognition of emotions utilizing six different data sets, EMO-DB, RAVDESS, TESS, CREMA, SAVEE, and TESS+RAVDESS. To demonstrate the contribution of proposed model, the performance of traditional feature extraction techniques namely, mel-scale spectogram, mel-frequency cepstral coefficients, are blended with machine learning algorithms, ensemble learning methods, deep and hybrid deep learning techniques. Support vector machine, decision tree, naive Bayes, random forests models are evaluated as machine learning algorithms while majority voting and stacking methods are assessed as ensemble learning techniques. Moreover, convolutional neural networks, long short-term memory networks, and hybrid CNN- LSTM model are evaluated as deep learning techniques and compared with machine learning and ensemble learning methods. To demonstrate the effectiveness of proposed model, the comparison with state-of-the-art studies are carried out. Based on the experiment results, CNN model excels existent approaches with 95.86% of accuracy for TESS+RAVDESS data set using raw audio files, thence determining the new state-of-the-art. The proposed model performs 90.34% of accuracy for EMO-DB with CNN model, 90.42% of accuracy for RAVDESS with CNN model, 99.48% of accuracy for TESS with LSTM model, 69.72% of accuracy for CREMA with CNN model, 85.76% of accuracy for SAVEE with CNN model in speaker-independent audio categorization problems.
    摘要 《speech emotion recognition是speech processing领域中的一项挑战。为了实现这一目标,特征提取过程具有重要的重要性。在这项工作中,我们提出了一种模型,该模型直接将原始音频文件传递给深度神经网络,不需要特征提取阶段。为了证明模型的贡献,我们对传统特征提取技术(mel-scale spectogram、mel-frequency cepstral coefficients)与机器学习算法(支持向量机、决策树、naive Bayes、Random Forest)、ensemble learning方法(majority voting、stacking)进行比较。此外,我们还评估了深度学习技术(卷积神经网络、长短期记忆网络、混合卷积-LSTM)。基于实验结果,CNN模型在TESS+RAVDESS数据集上达到了95.86%的准确率,超过了现有方法,确定了新的状态态-of-the-art。我们的模型在EMO-DB、RAVDESS、TESS、CREMA和SAVEE数据集上达到了90.34%、90.42%、99.48%、69.72%和85.76%的准确率。》

DSARSR: Deep Stacked Auto-encoders Enhanced Robust Speaker Recognition

  • paper_url: http://arxiv.org/abs/2307.02751
  • repo_url: None
  • paper_authors: Zhifeng Wang, Chunyan Zeng, Surong Duan, Hongjie Ouyang, Hongmin Xu
  • for: 这项研究旨在提高cross-channel条件下i-vector框架的Robustness,并探索使用深度学习进行人识别。
  • methods: 该研究使用Stacked Auto-encoders来提取i-vector的抽象,而不是使用PLDA。在预处理和特征提取后,使用Speaker和Channel独立的speech进行UBM训练。
  • results: 实验结果显示,提出的方法比之前的方法表现更好。
    Abstract Speaker recognition is a biometric modality that utilizes the speaker's speech segments to recognize the identity, determining whether the test speaker belongs to one of the enrolled speakers. In order to improve the robustness of the i-vector framework on cross-channel conditions and explore the nova method for applying deep learning to speaker recognition, the Stacked Auto-encoders are used to get the abstract extraction of the i-vector instead of applying PLDA. After pre-processing and feature extraction, the speaker and channel-independent speeches are employed for UBM training. The UBM is then used to extract the i-vector of the enrollment and test speech. Unlike the traditional i-vector framework, which uses linear discriminant analysis (LDA) to reduce dimension and increase the discrimination between speaker subspaces, this research use stacked auto-encoders to reconstruct the i-vector with lower dimension and different classifiers can be chosen to achieve final classification. The experimental results show that the proposed method achieves better performance than the state-of-the-art method.
    摘要

On-Device Constrained Self-Supervised Speech Representation Learning for Keyword Spotting via Knowledge Distillation

  • paper_url: http://arxiv.org/abs/2307.02720
  • repo_url: None
  • paper_authors: Gene-Ping Yang, Yue Gu, Qingming Tang, Dongsu Du, Yuzong Liu
  • for: 这篇论文是为了提高在 Alexa 关键词搜寻任务中的语音识别精度,特别是在内存限制和受损数据收集的情况下。
  • methods: 我们使用了知识传授法,将大型自我超vised模型的知识转移到较小的、轻量级模型中,并使用双重视野交叉相关知识传授和老师的codebook作为学习目标。
  • results: 我们在使用了我们的S3RL架构进行 Alexa 关键词搜寻探测任务时,在正常和噪音情况下表现出色,证明了知识传授方法在自我超vised模型中构建关键词搜寻模型的时候具有优秀的效果,并且在内存限制和受损数据收集的情况下运行。
    Abstract Large self-supervised models are effective feature extractors, but their application is challenging under on-device budget constraints and biased dataset collection, especially in keyword spotting. To address this, we proposed a knowledge distillation-based self-supervised speech representation learning (S3RL) architecture for on-device keyword spotting. Our approach used a teacher-student framework to transfer knowledge from a larger, more complex model to a smaller, light-weight model using dual-view cross-correlation distillation and the teacher's codebook as learning objectives. We evaluated our model's performance on an Alexa keyword spotting detection task using a 16.6k-hour in-house dataset. Our technique showed exceptional performance in normal and noisy conditions, demonstrating the efficacy of knowledge distillation methods in constructing self-supervised models for keyword spotting tasks while working within on-device resource constraints.
    摘要 大型自我监督模型是有效的特征提取器,但是在设备内存限制和偏见数据采集下面临挑战,特别是在关键词检测中。为解决这问题,我们提出了基于知识填充的自我监督语音表示学习(S3RL)建模,用于在设备上进行关键词检测。我们采用了教师-学生框架,将大型、复杂的模型知识传播到小型、轻量级模型,使用双视交叉相关知识填充和教师的编码库作为学习目标。我们使用Alexa关键词检测任务上的16.6万小时内部数据进行评估。我们的技术在正常和噪音条件下都表现出色,证明了知识填充方法在构建自我监督模型的关键词检测任务中的效果。

eess.AS - 2023-07-06

Read, Look or Listen? What’s Needed for Solving a Multimodal Dataset

  • paper_url: http://arxiv.org/abs/2307.04532
  • repo_url: None
  • paper_authors: Netta Madvil, Yonatan Bitton, Roy Schwartz
  • for: 本研究旨在探讨大规模多模式数据集的质量评估问题。
  • methods: 我们提出了一种两步方法,利用小量人工标注将每个多模式实例映射到需要处理的Modalities。
  • results: 我们应用方法到TVQA视频问答数据集,发现大多数问题可以通过单一模式答案,无论是视频还是音频。此外,我们发现更多的70%的问题可以使用多种不同的单模式策略解决,如只看视频或只听音频。此外,我们发现MERLOT Reserve在图像基于问题上表现不佳,而文本和音频则表现较好。基于我们的观察,我们提出了一个新的测试集,其中模型需要使用多种模式,并观察到模型性能减降很大。
    Abstract The prevalence of large-scale multimodal datasets presents unique challenges in assessing dataset quality. We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it. Our method sheds light on the importance of different modalities in datasets, as well as the relationship between them. We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality. Moreover, we find that more than 70% of the questions are solvable using several different single-modality strategies, e.g., by either looking at the video or listening to the audio, highlighting the limited integration of multiple modalities in TVQA. We leverage our annotation and analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification. Based on our observations, we introduce a new test set that necessitates multiple modalities, observing a dramatic drop in model performance. Our methodology provides valuable insights into multimodal datasets and highlights the need for the development of more robust models.
    摘要 现代大规模多modal数据集存在独特的评估数据集质量挑战。我们提出了一种两步方法,使用小量人工标注来将多modal实例映射到需要处理它的modalities。我们的方法揭示了不同modalities在数据集中的重要性,以及它们之间的关系。我们应用了我们的方法到TVQA视频问答数据集,发现大多数问题可以使用单一modalities解决,无论哪种modalities。此外,我们发现超过70%的问题可以使用多个不同的单modalities策略解决,例如通过视频或音频来解决问题,这 highlights TVQA中多modalities的有限整合。我们利用我们的标注和分析MERLOT Reserve,发现它在图像基于问题上表现不佳,比 Text和音频更差。基于我们的观察,我们引入了一个新的测试集,需要多modalities,观察模型性能异常下降。我们的方法ология为多modal数据集提供了有价值的洞察,并高亮了需要更robust的模型的开发。

Deep Speech Synthesis from MRI-Based Articulatory Representations

  • paper_url: http://arxiv.org/abs/2307.02471
  • repo_url: https://github.com/articulatory/articulatory
  • paper_authors: Peter Wu, Tingle Li, Yijing Lu, Yubin Zhang, Jiachen Lian, Alan W Black, Louis Goldstein, Shinji Watanabe, Gopala K. Anumanchipalli
  • for: 这个研究旨在开发一种基于人类声道信息的语音合成方法,以提高语音合成效率、通用性和可解释性。
  • methods: 该研究使用MRI技术获取更广泛的声道信息,并引入Normalization和denoising等处理方法,以提高深度学习模型的普适性和语音质量。
  • results: 研究人员通过一系列ablations表明,MRI表示的声道信息更加全面和精准,并且可以提高语音合成效率和质量。
    Abstract In this paper, we study articulatory synthesis, a speech synthesis method using human vocal tract information that offers a way to develop efficient, generalizable and interpretable synthesizers. While recent advances have enabled intelligible articulatory synthesis using electromagnetic articulography (EMA), these methods lack critical articulatory information like excitation and nasality, limiting generalization capabilities. To bridge this gap, we propose an alternative MRI-based feature set that covers a much more extensive articulatory space than EMA. We also introduce normalization and denoising procedures to enhance the generalizability of deep learning methods trained on MRI data. Moreover, we propose an MRI-to-speech model that improves both computational efficiency and speech fidelity. Finally, through a series of ablations, we show that the proposed MRI representation is more comprehensive than EMA and identify the most suitable MRI feature subset for articulatory synthesis.
    摘要 在这篇论文中,我们研究了语音合成方法,即使用人类声门信息来开发高效、通用和可解释的合成器。最近的进展使得可以实现有声合成,但这些方法缺乏关键的唇形信息,如刺激和腔软度,这限制了其泛化能力。为了弥补这一点,我们提议一种基于MRI的特征集,覆盖了EMA的艺术iculatory空间的多倍。我们还提出了normalization和denoising的过程来提高深度学习方法在MRI数据上的泛化性。此外,我们提出了MRI-to-speech模型,可以提高计算效率和语音准确性。最后,通过一系列剥减实验,我们证明了我们的MRI表示是EMA的更加全面的,并确定了最适合语音合成的MRI特征子。