results: 该研究通过实验观察了在中心陀风中的单个粒子动态过程,并提供了实时数据图表。这是首次观察到类似现象。研究还发展了基于旋转参照framereference frame的理论模拟,与实验结果吻合良好。此外,研究还实现了第一次观察血液凝固过程的视觉化。Abstract
Centrifuges serve as essential instruments in modern experimental sciences, facilitating a wide range of routine sample processing tasks that necessitate material sedimentation. However, the study for real time observation of the dynamical process during centrifugation has remained elusive. In this study, we developed an innovative Lab_in_a_Tube imaging spectrophotometer that incorporates capabilities of real time image analysis and programmable interruption. This portable LIAT device costs less than 30 US dollars. Based on our knowledge, it is the first Wi Fi camera built_in in common lab centrifuges with active closed_loop control. We tested our LIAT imaging spectrophotometer with solute solvent interaction investigation obtained from lab centrifuges with quantitative data plotting in a real time manner. Single re circulating flow was real time observed, forming the ring shaped pattern during centrifugation. To the best of our knowledge, this is the very first observation of similar phenomena. We developed theoretical simulations for the single particle in a rotating reference frame, which correlated well with experimental results. We also demonstrated the first demonstration to visualize the blood sedimentation process in clinical lab centrifuges. This remarkable cost effectiveness opens up exciting opportunities for centrifugation microbiology research and paves the way for the creation of a network of computational imaging spectrometers at an affordable price for large scale and continuous monitoring of centrifugal processes in general.
摘要
优化的中心分离器在现代实验科学中扮演着重要的 instrumente 作用,为各种实验样品处理任务提供了一系列的标准化程序。然而,实时观察中心分离过程中的动态过程的研究仍然是一个挑战。本研究中,我们开发了一种创新的 Lab_in_a_Tube 成像спектрофотометр,它 integrate了实时图像分析和可编程中断功能。这个可以在30美元以下的PORTABLE LIAT设备中实现。据我们所知,这是首个在通用实验室中心分离器上 integrate了Wi Fi摄像头和活动封闭控制的设备。我们使用LIAT成像спектрофотомет器测试了各种溶剂与溶解物之间的反应,并在实时方式提供了数据图表。在中心分离过程中,实时观察到单个回流形成了环形模式。我们认为这是首次观察到类似现象。我们还开发了基于旋转参照框架的理论模拟,与实验结果高度相关。此外,我们还实现了首次visualize血液凝固过程的示例。这种可负担的成本开发了一个便宜的计算成像спектрофотомет器网络,为大规模和不间断的中心分离过程监测创造了可能性。
PressureTransferNet: Human Attribute Guided Dynamic Ground Pressure Profile Transfer using 3D simulated Pressure Maps
results: 通过使用感知 simulateulator创建多种人类特征和压力 profiles的数据集,并在实际世界数据集上进行评估,得到了准确地传递人类特征到地面压力 profiles的效果。通过物理学深度学习模型进行验证,并在物理压力检测数据上达到了0.79的二元R-square值和0.911$\pm$0.015的F1分数,证明了生成的压力图的正确性。Abstract
We propose PressureTransferNet, a novel method for Human Activity Recognition (HAR) using ground pressure information. Our approach generates body-specific dynamic ground pressure profiles for specific activities by leveraging existing pressure data from different individuals. PressureTransferNet is an encoder-decoder model taking a source pressure map and a target human attribute vector as inputs, producing a new pressure map reflecting the target attribute. To train the model, we use a sensor simulation to create a diverse dataset with various human attributes and pressure profiles. Evaluation on a real-world dataset shows its effectiveness in accurately transferring human attributes to ground pressure profiles across different scenarios. We visually confirm the fidelity of the synthesized pressure shapes using a physics-based deep learning model and achieve a binary R-square value of 0.79 on areas with ground contact. Validation through classification with F1 score (0.911$\pm$0.015) on physical pressure mat data demonstrates the correctness of the synthesized pressure maps, making our method valuable for data augmentation, denoising, sensor simulation, and anomaly detection. Applications span sports science, rehabilitation, and bio-mechanics, contributing to the development of HAR systems.
摘要
我们提出了PressureTransferNet,一种新的人体活动识别(HAR)方法,使用地面压力信息。我们的方法生成了特定活动的人体特有的动态地面压力轨迹,通过利用不同个体的压力数据。PressureTransferNet是一个Encoder-Decoder模型,接受来源压力地图和目标人类特征向量作为输入,生成一个新的压力地图,表现出目标特征。为了训练模型,我们使用了一种感器模拟,创建了具有不同人类特征和压力轨迹的多样化数据集。我们通过对真实数据进行评估,发现PressureTransferNet能够准确地将人类特征传递到地面压力轨迹中,并在不同场景下进行高精度的人体活动识别。我们通过使用物理学习模型进行视觉验证,并在物理压力检测数据上达到了0.79的二元共亮值,证明了生成的压力地图的正确性。我们还通过物理压力检测数据进行分类,并达到了0.911$\pm$0.015的F1分数,证明了生成的压力地图的正确性,使得我们的方法在数据增强、干扰除、感器模拟和异常检测等方面都可以获得极大的价值。应用范围包括体育科学、rehabilitation和生物机器人学,对人体活动识别系统的发展做出了贡献。
Visual attention information can be traced on cortical response but not on the retina: evidence from electrophysiological mouse data using natural images as stimuli
results: 研究发现,在主视受器皮层(V1)中,约10%的神经元响应不同于突出性视觉区域。视觉注意力信息不在视网膜响应中追踪,retina似乎不知道视觉注意力信息。Abstract
Visual attention forms the basis of understanding the visual world. In this work we follow a computational approach to investigate the biological basis of visual attention. We analyze retinal and cortical electrophysiological data from mouse. Visual Stimuli are Natural Images depicting real world scenes. Our results show that in primary visual cortex (V1), a subset of around $10\%$ of the neurons responds differently to salient versus non-salient visual regions. Visual attention information was not traced in retinal response. It appears that the retina remains naive concerning visual attention; cortical response gets modulated to interpret visual attention information. Experimental animal studies may be designed to further explore the biological basis of visual attention we traced in this study. In applied and translational science, our study contributes to the design of improved visual prostheses systems -- systems that create artificial visual percepts to visually impaired individuals by electronic implants placed on either the retina or the cortex.
摘要
Visual attention是视觉世界理解的基础。在这项工作中,我们采用计算方法研究生物基础知识的视觉注意力。我们分析了鼠脊椎细胞电physiological数据,使用自然图像作为视觉刺激。我们的结果表明,在主视觉层(V1)中,约10%的神经元与突出性视觉区域不同响应。视觉注意力信息没有在视网膜响应中追踪。这表明视网膜对视觉注意力信息是无知的, cortical响应则被修饰以解析视觉注意力信息。我们可以通过进一步的动物实验来深入探索我们在这项研究中跟踪到的生物基础知识。在应用和翻译科学方面,我们的研究对于设计更好的视觉代做系统做出了贡献——这些系统通过电子导体在视网膜或大脑中引入人工视觉感受。
Robust Spatiotemporal Fusion of Satellite Images: A Constrained Convex Optimization Approach
results: 我们的实验结果表明,ROSTF可以与其他State-of-the-art ST融合方法相比,在噪声情况下表现更好,并且在噪声较高的情况下表现更出色。Abstract
This paper proposes a novel spatiotemporal (ST) fusion framework for satellite images, named Robust Optimization-based Spatiotemporal Fusion (ROSTF). ST fusion is a promising approach to resolve a trade-off between the temporal and spatial resolution of satellite images. Although many ST fusion methods have been proposed, most of them are not designed to explicitly account for noise in observed images, despite the inevitable influence of noise caused by the measurement equipment and environment. Our ROSTF addresses this challenge by treating the noise removal of the observed images and the estimation of the target high-resolution image as a single optimization problem. Specifically, first, we define observation models for satellite images possibly contaminated with random noise, outliers, and/or missing values, and then introduce certain assumptions that would naturally hold between the observed images and the target high-resolution image. Then, based on these models and assumptions, we formulate the fusion problem as a constrained optimization problem and develop an efficient algorithm based on a preconditioned primal-dual splitting method for solving the problem. The performance of ROSTF was verified using simulated and real data. The results show that ROSTF performs comparably to several state-of-the-art ST fusion methods in noiseless cases and outperforms them in noisy cases.
摘要
Translated into Simplified Chinese:这篇论文提出了一种新的协调空间时间(ST)融合框架,称为Robust Optimization-based Spatiotemporal Fusion(ROSTF)。ST融合是一种有前途的方法,可以解决遥感图像的时空分解问题。虽然许多ST融合方法已经被提出,但大多数它们并不直接考虑观测图像中的噪声,即由测量设备和环境所引起的噪声。我们的ROSTF正是为了解决这个挑战,它将观测图像中的噪声除除和目标高分辨率图像的估计作为一个单一的优化问题进行处理。具体来说,我们首先定义了可能受到随机噪声、外围值和/或 missing value 的遥感图像的观测模型,然后引入了一些自然地存在于观测图像和目标高分辨率图像之间的假设。然后,根据这些模型和假设,我们将融合问题转化为一个受限的优化问题,并开发了一种高效的幂等分解法来解决问题。我们对ROSTF的性能进行了验证,使用了模拟和实际数据。结果表明,ROSTF在噪声不存在的情况下与一些状态先进的ST融合方法相当,在噪声存在的情况下表现更出色。
An L2-Normalized Spatial Attention Network For Accurate And Fast Classification Of Brain Tumors In 2D T1-Weighted CE-MRI Images
results: 我们的模型在这个数据集上的表现比现有的State-of-the-art方法更好,增加了1.79%的精度。组合我们的模型与预训的VGG16可以实现更高的精度,但是会增加执行速度的成本。我们的代码可以在https://github.com/juliadietlmeier/MRI_image_classification上获取。Abstract
We propose an accurate and fast classification network for classification of brain tumors in MRI images that outperforms all lightweight methods investigated in terms of accuracy. We test our model on a challenging 2D T1-weighted CE-MRI dataset containing three types of brain tumors: Meningioma, Glioma and Pituitary. We introduce an l2-normalized spatial attention mechanism that acts as a regularizer against overfitting during training. We compare our results against the state-of-the-art on this dataset and show that by integrating l2-normalized spatial attention into a baseline network we achieve a performance gain of 1.79 percentage points. Even better accuracy can be attained by combining our model in an ensemble with the pretrained VGG16 at the expense of execution speed. Our code is publicly available at https://github.com/juliadietlmeier/MRI_image_classification
摘要
我们提出一种精度快速的分类网络,用于分类MRI图像中的脑肿瘤,超过所有轻量级方法的准确性。我们在一个具有三种脑肿瘤(膜肿瘤、 Glioma 和丘脑)的2D T1束缚CE-MRI数据集上进行测试。我们引入l2正则化的空间注意力机制,以防止过拟合训练期间。我们与现状的最佳方法进行比较,并显示在将l2正则化空间注意力integrated into a baseline network时,得到了1.79个百分点的性能提升。可以通过将我们的模型与预训练的VGG16 ensemble来获得更高的准确性,但是这将导致执行速度下降。我们的代码在https://github.com/juliadietlmeier/MRI_image_classification上公开。
A Deep Learning Approach for Virtual Contrast Enhancement in Contrast Enhanced Spectral Mammography
results: 研究结果显示,CycleGAN是最有前途的深度网络,可以实现高质量的虚拟重组图像生成。这项研究的结果还表明,使用虚拟对照增强可以提高CESM的诊断精度,同时减少辐射剂量。Abstract
Contrast Enhanced Spectral Mammography (CESM) is a dual-energy mammographic imaging technique that first needs intravenously administration of an iodinated contrast medium; then, it collects both a low-energy image, comparable to standard mammography, and a high-energy image. The two scans are then combined to get a recombined image showing contrast enhancement. Despite CESM diagnostic advantages for breast cancer diagnosis, the use of contrast medium can cause side effects, and CESM also beams patients with a higher radiation dose compared to standard mammography. To address these limitations this work proposes to use deep generative models for virtual contrast enhancement on CESM, aiming to make the CESM contrast-free as well as to reduce the radiation dose. Our deep networks, consisting of an autoencoder and two Generative Adversarial Networks, the Pix2Pix, and the CycleGAN, generate synthetic recombined images solely from low-energy images. We perform an extensive quantitative and qualitative analysis of the model's performance, also exploiting radiologists' assessments, on a novel CESM dataset that includes 1138 images that, as a further contribution of this work, we make publicly available. The results show that CycleGAN is the most promising deep network to generate synthetic recombined images, highlighting the potential of artificial intelligence techniques for virtual contrast enhancement in this field.
摘要
增强型pectral маммографи(CESM)是一种双能量мамографи���图像技术,需要通过Intravenously administering an iodinated contrast medium,然后收集低能量图像和高能量图像。然后将两个扫描结果组合起来,得到增强图像。 DESPITE CESM的诊断优势,使用contrast medium可能会导致side effects,并且CESM也会对病人辐射更高的辐射剂量 compared to standard mammography。为了解决这些限制,本工作提出使用深度生成模型对CESM进行虚拟增强,以实现无contrast和降低辐射剂量。我们的深度网络包括一个自适应网络和两个生成对抗网络,Pix2Pix和CycleGAN,可以将低能量图像转换为增强图像。我们对 novel CESM dataset中的1138张图像进行了广泛的量化和质量分析,并利用了 radiologists的评估,以评估模型的性能。结果显示CycleGAN是最有前途的深度网络,highlighting the potential of artificial intelligence techniques for virtual contrast enhancement in this field。
Space Debris: Are Deep Learning-based Image Enhancements part of the Solution?
results: 根据视觉检查,这个UNet模型能够正确地更正在太空中拍摄的影像损坏,并且与现有的深度学习图像改善方法进行比较。Abstract
The volume of space debris currently orbiting the Earth is reaching an unsustainable level at an accelerated pace. The detection, tracking, identification, and differentiation between orbit-defined, registered spacecraft, and rogue/inactive space ``objects'', is critical to asset protection. The primary objective of this work is to investigate the validity of Deep Neural Network (DNN) solutions to overcome the limitations and image artefacts most prevalent when captured with monocular cameras in the visible light spectrum. In this work, a hybrid UNet-ResNet34 Deep Learning (DL) architecture pre-trained on the ImageNet dataset, is developed. Image degradations addressed include blurring, exposure issues, poor contrast, and noise. The shortage of space-generated data suitable for supervised DL is also addressed. A visual comparison between the URes34P model developed in this work and the existing state of the art in deep learning image enhancement methods, relevant to images captured in space, is presented. Based upon visual inspection, it is determined that our UNet model is capable of correcting for space-related image degradations and merits further investigation to reduce its computational complexity.
摘要
Currently, the volume of space debris orbiting the Earth is reaching an unsustainable level at an accelerated pace. The detection, tracking, identification, and differentiation between orbit-defined, registered spacecraft and rogue/inactive space "objects" are critical to asset protection. The primary objective of this work is to investigate the validity of Deep Neural Network (DNN) solutions to overcome the limitations and image artifacts most prevalent when captured with monocular cameras in the visible light spectrum. In this work, a hybrid UNet-ResNet34 Deep Learning (DL) architecture pre-trained on the ImageNet dataset is developed. Image degradations addressed include blurring, exposure issues, poor contrast, and noise. The shortage of space-generated data suitable for supervised DL is also addressed. A visual comparison between the URes34P model developed in this work and the existing state of the art in deep learning image enhancement methods, relevant to images captured in space, is presented. Based on visual inspection, it is determined that our UNet model is capable of correcting for space-related image degradations and merits further investigation to reduce its computational complexity.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard forms of Chinese writing. If you need the translation in Traditional Chinese, please let me know.
Metrics to Quantify Global Consistency in Synthetic Medical Images
results: 这个论文的结果表明,可以使用这些方法来分类图像的全局一致性,并且这些方法可以在没有标注数据时仍然有效。相比之下,已有的metric,如FID,无法直接评估图像的全局一致性。Abstract
Image synthesis is increasingly being adopted in medical image processing, for example for data augmentation or inter-modality image translation. In these critical applications, the generated images must fulfill a high standard of biological correctness. A particular requirement for these images is global consistency, i.e an image being overall coherent and structured so that all parts of the image fit together in a realistic and meaningful way. Yet, established image quality metrics do not explicitly quantify this property of synthetic images. In this work, we introduce two metrics that can measure the global consistency of synthetic images on a per-image basis. To measure the global consistency, we presume that a realistic image exhibits consistent properties, e.g., a person's body fat in a whole-body MRI, throughout the depicted object or scene. Hence, we quantify global consistency by predicting and comparing explicit attributes of images on patches using supervised trained neural networks. Next, we adapt this strategy to an unlabeled setting by measuring the similarity of implicit image features predicted by a self-supervised trained network. Our results demonstrate that predicting explicit attributes of synthetic images on patches can distinguish globally consistent from inconsistent images. Implicit representations of images are less sensitive to assess global consistency but are still serviceable when labeled data is unavailable. Compared to established metrics, such as the FID, our method can explicitly measure global consistency on a per-image basis, enabling a dedicated analysis of the biological plausibility of single synthetic images.
摘要
医疗图像处理领域内,图像合成技术在不断普及,例如数据增强或多Modalities图像翻译。在这些关键应用中,生成的图像必须满足高水平的生物准确性。特别是,生成图像必须具备全局一致性,即图像整体准确、结构化,所有图像部分都必须在真实和意义上相互协调。然而,现有的图像质量指标不直接量化这种图像的全局一致性。在这项工作中,我们介绍了两种可以测量生成图像的全局一致性的指标。为了测量全局一致性,我们假设一个真实的图像都具备一致的属性,例如整个人体MRI中的身体脂肪。因此,我们量化全局一致性通过使用已经超参的神经网络预测和比较图像中的显式属性。接下来,我们将这种策略应用到无标注 Setting中,通过测量图像中的隐藏特征来衡量图像的全局一致性。我们的结果表明,预测图像中的显式属性可以分辨全局一致的和不一致的图像。隐藏的图像特征也可以用无标注的方式来衡量图像的全局一致性,尽管它们比较敏感于数据质量。相比之下,已有的指标,如FID,不直接量化全局一致性,但可以量化图像的整体质量。我们的方法可以在每个图像基础上 direktly测量全局一致性,从而允许专门分析单个生成图像的生物可能性。
Fundus-Enhanced Disease-Aware Distillation Model for Retinal Disease Classification from OCT Images
paper_authors: Lehan Wang, Weihang Dai, Mei Jin, Chubin Ou, Xiaomeng Li
for: This paper proposes a novel method for retinal disease classification from OCT images, which can be used for ophthalmic examination.
methods: The proposed method uses a fundus-enhanced disease-aware distillation model (FDDM) that utilizes unpaired fundus images during training and does not require the use of fundus images during testing. The method enhances the OCT model by distilling disease-related information from the fundus model and aligning class similarity between both modalities.
results: The proposed approach outperforms single-modal, multi-modal, and state-of-the-art distillation methods for retinal disease classification, demonstrating its effectiveness and practicality for clinical use.Abstract
Optical Coherence Tomography (OCT) is a novel and effective screening tool for ophthalmic examination. Since collecting OCT images is relatively more expensive than fundus photographs, existing methods use multi-modal learning to complement limited OCT data with additional context from fundus images. However, the multi-modal framework requires eye-paired datasets of both modalities, which is impractical for clinical use. To address this problem, we propose a novel fundus-enhanced disease-aware distillation model (FDDM), for retinal disease classification from OCT images. Our framework enhances the OCT model during training by utilizing unpaired fundus images and does not require the use of fundus images during testing, which greatly improves the practicality and efficiency of our method for clinical use. Specifically, we propose a novel class prototype matching to distill disease-related information from the fundus model to the OCT model and a novel class similarity alignment to enforce consistency between disease distribution of both modalities. Experimental results show that our proposed approach outperforms single-modal, multi-modal, and state-of-the-art distillation methods for retinal disease classification. Code is available at https://github.com/xmed-lab/FDDM.
摘要
优化净合成 Tomatoes (OCT) 是一种新型和有效的诊断工具 для眼科检查。由于收集 OCT 图像相对较 expensive than fundus 图像,现有方法使用多模态学习来补充 OCT 数据中的有限资源。然而,多模态框架需要对 modalities 的眼球对应数据集,这是在临床应用中不实际。为解决这个问题,我们提出了一种新的眼球增强疾病感知模型 (FDDM),用于从 OCT 图像中分类眼球疾病。我们的框架在训练时使用无对应的眼球图像来增强 OCT 模型,并不需要在测试时使用眼球图像,这大大提高了我们的方法在临床应用中的实用性和效率。具体来说,我们提出了一种新的类型 prototype matching,以填充疾病相关信息从眼球模型到 OCT 模型,以及一种新的类型相似性对齐,以保持两个模式之间的疾病分布的一致性。实验结果表明,我们的提议方法在眼球疾病分类任务上超越单模、多模和状态更新的混合方法。代码可以在 https://github.com/xmed-lab/FDDM 上获取。
Unleashing the Power of Self-Supervised Image Denoising: A Comprehensive Review
results: 本文通过对多个数据集进行量化和质量测试,证明了这些方法的效iveness,并提供了对照经典算法的比较。Abstract
The advent of deep learning has brought a revolutionary transformation to image denoising techniques. However, the persistent challenge of acquiring noise-clean pairs for supervised methods in real-world scenarios remains formidable, necessitating the exploration of more practical self-supervised image denoising. This paper focuses on self-supervised image denoising methods that offer effective solutions to address this challenge. Our comprehensive review thoroughly analyzes the latest advancements in self-supervised image denoising approaches, categorizing them into three distinct classes: General methods, Blind Spot Network (BSN)-based methods, and Transformer-based methods. For each class, we provide a concise theoretical analysis along with their practical applications. To assess the effectiveness of these methods, we present both quantitative and qualitative experimental results on various datasets, utilizing classical algorithms as benchmarks. Additionally, we critically discuss the current limitations of these methods and propose promising directions for future research. By offering a detailed overview of recent developments in self-supervised image denoising, this review serves as an invaluable resource for researchers and practitioners in the field, facilitating a deeper understanding of this emerging domain and inspiring further advancements.
摘要
deep learning技术的出现对图像噪声处理方法带来了革命性的变革,但在实际场景中获得噪声清晰对照样本的困难仍然存在,需要更加实用的自我监督图像噪声处理方法。这篇评论文探讨了最新的自我监督图像噪声处理方法,将其分为三类:通用方法、BSN基于方法和Transformer基于方法。对于每种类型,我们提供了简洁的理论分析以及实际应用。为评估这些方法的效果,我们在不同的数据集上进行了量化和质量上的实验,使用经典算法作为参考。此外,我们还进行了深入的限制分析和未来研究的探讨。通过对最新的自我监督图像噪声处理方法的详细审视,这篇评论文将成为图像处理领域的一个不可或缺的资源,为研究人员和实践者提供深入的理解和发展新技术的动力。
Boundary Difference Over Union Loss For Medical Image Segmentation
results: 在两个 dataset (ACDC 和 Synapse) 上进行了实验,结果显示了这个提案的有效性,并且与其他损失函数相比,这个损失函数更能帮助边界区域分类。Abstract
Medical image segmentation is crucial for clinical diagnosis. However, current losses for medical image segmentation mainly focus on overall segmentation results, with fewer losses proposed to guide boundary segmentation. Those that do exist often need to be used in combination with other losses and produce ineffective results. To address this issue, we have developed a simple and effective loss called the Boundary Difference over Union Loss (Boundary DoU Loss) to guide boundary region segmentation. It is obtained by calculating the ratio of the difference set of prediction and ground truth to the union of the difference set and the partial intersection set. Our loss only relies on region calculation, making it easy to implement and training stable without needing any additional losses. Additionally, we use the target size to adaptively adjust attention applied to the boundary regions. Experimental results using UNet, TransUNet, and Swin-UNet on two datasets (ACDC and Synapse) demonstrate the effectiveness of our proposed loss function. Code is available at https://github.com/sunfan-bvb/BoundaryDoULoss.
摘要
医学图像分割是诊断的关键。然而,当前的医学图像分割损失主要关注总体分割结果,有 fewer 的损失用于指导边界分割。这些损失通常需要与其他损失结合使用,并且生成不具有效果的结果。为解决这个问题,我们开发了一种简单而有效的损失函数:Boundary Difference over Union Loss(Boundary DoU Loss),用于导航边界区域分割。它是通过计算预测和真实值的差集与联合集的差集和部分交集集的比率来获得的。我们的损失函数仅仅依赖于区域计算,因此易于实现和训练,不需要其他损失函数。此外,我们使用目标大小来适应性地调整边界区域的注意力。实验结果使用 UNet、TransUNet 和 Swin-UNet 在 ACDC 和 Synapse 两个 dataset 上表明了我们提出的损失函数的有效性。代码可以在 https://github.com/sunfan-bvb/BoundaryDoULoss 上获取。
Universal Adversarial Defense in Remote Sensing Based on Pre-trained Denoising Diffusion Models
paper_authors: Weikang Yu, Yonghao Xu, Pedram Ghamisi for:这个论文是为了提出一种基于扩散模型的通用防御方法,以防止深度神经网络受到攻击。methods:这个方法使用了预训练的扩散模型来防止多种未知攻击,并使用了反向和前向的扩散过程来纯化攻击样本。此外,还使用了自适应噪声水平选择(ANLS)机制来选择最佳的噪声水平,以达到最佳的纯化结果。results:实验结果表明,UAD-RS方法可以高效地防止多种常见攻击,并且不需要对攻击样本有严格的先知知识。此外,UAD-RS方法也可以减少重新训练的成本和性能的波动。Abstract
Deep neural networks (DNNs) have achieved tremendous success in many remote sensing (RS) applications, in which DNNs are vulnerable to adversarial perturbations. Unfortunately, current adversarial defense approaches in RS studies usually suffer from performance fluctuation and unnecessary re-training costs due to the need for prior knowledge of the adversarial perturbations among RS data. To circumvent these challenges, we propose a universal adversarial defense approach in RS imagery (UAD-RS) using pre-trained diffusion models to defend the common DNNs against multiple unknown adversarial attacks. Specifically, the generative diffusion models are first pre-trained on different RS datasets to learn generalized representations in various data domains. After that, a universal adversarial purification framework is developed using the forward and reverse process of the pre-trained diffusion models to purify the perturbations from adversarial samples. Furthermore, an adaptive noise level selection (ANLS) mechanism is built to capture the optimal noise level of the diffusion model that can achieve the best purification results closest to the clean samples according to their Frechet Inception Distance (FID) in deep feature space. As a result, only a single pre-trained diffusion model is needed for the universal purification of adversarial samples on each dataset, which significantly alleviates the re-training efforts and maintains high performance without prior knowledge of the adversarial perturbations. Experiments on four heterogeneous RS datasets regarding scene classification and semantic segmentation verify that UAD-RS outperforms state-of-the-art adversarial purification approaches with a universal defense against seven commonly existing adversarial perturbations. Codes and the pre-trained models are available online (https://github.com/EricYu97/UAD-RS).
摘要
深度神经网络(DNNs)在远程感知应用中取得了很大的成功,但是DNNs受到了恶意抗干扰的威胁。然而,现有的抗恶意防御策略在RS研究中通常受到性能波动和无需重新训练成本的限制,因为需要对RS数据有丰富的前知ledge。为了缓解这些挑战,我们提议一种通用的抗恶意防御策略(UAD-RS),使用预训练的扩散模型来防御通用DNNs Against多种未知的恶意抗干扰。具体来说,首先预训练了不同的RS数据集上的扩散模型,以学习各种数据域中的通用表示。然后,我们开发了一种通用抗恶意纯化框架,使用扩散模型的前进和返回过程来纯化恶意样本中的抗干扰。此外,我们还构建了一种适应性的噪声水平选择(ANLS)机制,以捕捉最佳的噪声水平,以达到最佳的纯化结果,与clean样本之间的Frechet InceptionDistance(FID)在深度特征空间最近。因此,只需要一个预训练的扩散模型,可以通用地纯化恶意样本在每个数据集上,大大减少了重新训练的努力,并保持高性能无需对恶意抗干扰有丰富的前知ledge。实验表明,UAD-RS在四个不同的RS数据集上的Scene classification和semantic segmentation任务上表现出了状态的抗恶意纯化方法。代码和预训练模型可以在线获取(https://github.com/EricYu97/UAD-RS)。
A comprehensive review of deep learning in lung cancer
results: 现有诊断方法不够有效,需要新的更智能方法Abstract
To provide the reader with a historical perspective on cancer classification approaches, we first discuss the fundamentals of the area of cancer diagnosis in this article, including the processes of cancer diagnosis and the standard classification methods employed by clinicians. Current methods for cancer diagnosis are deemed ineffective, calling for new and more intelligent approaches.
摘要
为了为读者提供历史背景,我们首先讲述了肿瘤诊断方法的基础知识,包括肿瘤诊断过程和临床医生使用的标准分类方法。现有的肿瘤诊断方法被认为是不具有效果,需要新的更智能的方法。Note: Please note that the translation is in Simplified Chinese, which is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and other parts of the world.
Framing image registration as a landmark detection problem for better representation of clinical relevance
paper_authors: Diana Waldmannstetter, Benedikt Wiestler, Julian Schwarting, Ivan Ezhov, Marie Metz, Spyridon Bakas, Bhakti Baheti, Satrajit Chakrabarty, Jan S. Kirschke, Rolf A. Heckemann, Marie Piraud, Florian Kofler, Bjoern H. Menze
for: 提高图像 регистрация评价的临床 relevance
methods: 将图像 регистрация视为特征点检测问题,通过计算 hit rate 曲线来基于少量 inter-rater 分析获得特征点检测阈值
results: 提出一种基于错误分布的阈值计算方法,能够区分之前不可区分的 регистрация算法,并且可以评估图像REGISTRAION的临床意义Abstract
Nowadays, registration methods are typically evaluated based on sub-resolution tracking error differences. In an effort to reinfuse this evaluation process with clinical relevance, we propose to reframe image registration as a landmark detection problem. Ideally, landmark-specific detection thresholds are derived from an inter-rater analysis. To approximate this costly process, we propose to compute hit rate curves based on the distribution of errors of a sub-sample inter-rater analysis. Therefore, we suggest deriving thresholds from the error distribution using the formula: median + delta * median absolute deviation. The method promises differentiation of previously indistinguishable registration algorithms and further enables assessing the clinical significance in algorithm development.
摘要
现在,注册方法通常根据半分解追踪错误的差异进行评估。为了将注册评估过程恢复到临床 relevance,我们提议将注册视为一个标记检测问题。理想情况下,标记特定的检测阈值可以从多个评估者之间的交叉分析中 derivation。为了简化这个贵重过程,我们提议根据一个子样本的评估者分布计算 hit rate 曲线。因此,我们建议使用错误分布中的 median + δ * median absolute deviation 来 derivation 阈值,这种方法可以区分之前无法分辨的注册算法,并且可以评估算法开发中的临床 significancy。
paper_authors: Chen Liu, Peike Li, Xingqun Qi, Hu Zhang, Lincheng Li, Dadong Wang, Xin Yu for: This paper focuses on the audio-visual segmentation (AVS) task, which aims to segment sounding objects from a given video.methods: The proposed method first localizes potential sounding objects in a video using an object segmentation network, and then associates the sounding object candidates with the given audio. To alleviate the ambiguity of training the object segmentation network, the method proposes a silent object-aware segmentation objective. Additionally, the method explores the audio-visual semantic correlation by attending predicted audio category scores to potential instance masks.results: The proposed method can effectively segment sounding objects without being biased to salient objects, as demonstrated by experimental results on the AVS benchmarks.Abstract
The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are prone to segment a certain salient object in a video regardless of the audio information. This is because sounding objects are often the most salient ones in the AVS dataset. Thus, current AVS methods might fail to localize genuine sounding objects due to the dataset bias. In this work, we present an audio-visual instance-aware segmentation approach to overcome the dataset bias. In a nutshell, our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio. We notice that an object could be a sounding object in one video but a silent one in another video. This would bring ambiguity in training our object segmentation network as only sounding objects have corresponding segmentation masks. We thus propose a silent object-aware segmentation objective to alleviate the ambiguity. Moreover, since the category information of audio is unknown, especially for multiple sounding sources, we propose to explore the audio-visual semantic correlation and then associate audio with potential objects. Specifically, we attend predicted audio category scores to potential instance masks and these scores will highlight corresponding sounding instances while suppressing inaudible ones. When we enforce the attended instance masks to resemble the ground-truth mask, we are able to establish audio-visual semantics correlation. Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects.
摘要
audio-visual segmentation(AVS)任务的目标是从视频中分割出声音对象。现有的方法主要是将视频和音频特征融合以获得声音对象面积。然而,我们发现现有的方法很容易将视频中的一些鲜明对象分割成声音对象,这是因为声音对象在AVS数据集中非常鲜明。因此,现有的AVS方法可能会错过真正的声音对象,这是因为数据集偏见。在这种情况下,我们提出了一种带有音频视频实例相关的分割方法,以解决数据集偏见。我们的方法首先在视频中 lokalisiert potential sounding objects的可能性,然后将这些声音对象候选者与给定的音频相关联。我们注意到,一个对象可能是一个声音对象在一个视频中,但是在另一个视频中可能是一个无声对象。这会在我们的对象分割网络训练时引入模糊性。我们因此提出了一种静音对象相关的分割目标,以解决这个问题。此外,由于音频的类别信息未知,特别是多个声音来源,我们提出了探索音频视频 semantic correlation,然后将音频与 potential objects相关联。具体来说,我们将预测的音频类别得分attend to potential instance masks,这些分数会高亮对应的声音实例,而不是无声实例。当我们强制 enforcing these attended instance masks to resemble the ground-truth mask,我们就能够建立音频视频 semantic correlation。我们的方法的实验结果在AVS benchmark中表明,我们可以准确地分割声音对象,不会受到鲜明对象的影响。
SAMbA: Speech enhancement with Asynchronous ad-hoc Microphone Arrays
results: 表明,增强机制可以使得DNNs对不同采样时间偏移和采样率偏移都具有抗随机性的能力,而不需要进行费时的处理步骤来减少偏移的影响。同时,增强机制还可以自动学习出采样时间偏移和采样率偏移的参数,不需要额外的监督。Abstract
Speech enhancement in ad-hoc microphone arrays is often hindered by the asynchronization of the devices composing the microphone array. Asynchronization comes from sampling time offset and sampling rate offset which inevitably occur when the microphones are embedded in different hardware components. In this paper, we propose a deep neural network (DNN)-based speech enhancement solution that is suited for applications in ad-hoc microphone arrays because it is distributed and copes with asynchronization. We show that asynchronization has a limited impact on the spatial filtering and mostly affects the performance of the DNNs. Instead of resynchronising the signals, which requires costly processing steps, we use an attention mechanism which makes the DNNs, thus our whole pipeline, robust to asynchronization. We also show that the attention mechanism leads to the asynchronization parameters in an unsupervised manner.
摘要
<>translate english text into simplified chineseSpeech enhancement in ad-hoc microphone arrays is often hindered by the asynchronization of the devices composing the microphone array. Asynchronization comes from sampling time offset and sampling rate offset which inevitably occur when the microphones are embedded in different hardware components. In this paper, we propose a deep neural network (DNN)-based speech enhancement solution that is suited for applications in ad-hoc microphone arrays because it is distributed and copes with asynchronization. We show that asynchronization has a limited impact on the spatial filtering and mostly affects the performance of the DNNs. Instead of resynchronising the signals, which requires costly processing steps, we use an attention mechanism which makes the DNNs, thus our whole pipeline, robust to asynchronization. We also show that the attention mechanism leads to the asynchronization parameters in an unsupervised manner.Translated text in Simplified Chinese:<>对话增强在具有不同硬件 ком成分的随意 microphone 阵列中经常受到设备不同的问题,包括抽样时间偏移和抽样率偏移。这些问题无法避免,因为 microphone 通常被嵌入不同的硬件 Component 中。在这篇论文中,我们提出了一个基于深度神经网络(DNN)的对话增强解决方案,这个解决方案适合应用在随意 microphone 阵列中,因为它是分布式的。我们显示,对话增强中的不同设备问题有限的影响,主要影响 DNN 的表现。而不是耗费成本的处理步骤来调整标本,我们使用了注意力机制,使 DNN 和我们的整个管道都具有对不同设备问题的响应性。我们还显示,注意力机制可以自动从数据中提取不同设备问题的参数。
Contrastive Conditional Latent Diffusion for Audio-visual Segmentation
paper_authors: Yuxin Mao, Jing Zhang, Mochu Xiang, Yunqiu Lv, Yiran Zhong, Yuchao Dai for:This paper proposes a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to explore the contribution of audio.methods:The proposed method uses a latent diffusion model to learn the conditional generation process of the ground-truth segmentation map, and introduces contrastive learning to learn audio-visual correspondence.results:Experimental results on a benchmark dataset verify the effectiveness of the proposed solution, demonstrating the importance of modeling the correlation between audio and the final segmentation map for AVS.Here’s the simplified Chinese text:for:这篇论文提出了一种基于扩散模型的听视同步分割方法,以探讨听音的贡献。methods:该方法使用扩散模型来学习听音与实际分割地图的关系,并通过对比学习来学习听音和视频之间的对应关系。results:实验结果表明,该方法在标准测试集上具有较高的效果,证明了模型对听音的贡献对AVS的重要性。Abstract
We propose a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to extensively explore the contribution of audio. We interpret AVS as a conditional generation task, where audio is defined as the conditional variable for sound producer(s) segmentation. With our new interpretation, it is especially necessary to model the correlation between audio and the final segmentation map to ensure its contribution. We introduce a latent diffusion model to our framework to achieve semantic-correlated representation learning. Specifically, our diffusion model learns the conditional generation process of the ground-truth segmentation map, leading to ground-truth aware inference when we perform the denoising process at the test stage. As a conditional diffusion model, we argue it is essential to ensure that the conditional variable contributes to model output. We then introduce contrastive learning to our framework to learn audio-visual correspondence, which is proven consistent with maximizing the mutual information between model prediction and the audio data. In this way, our latent diffusion model via contrastive learning explicitly maximizes the contribution of audio for AVS. Experimental results on the benchmark dataset verify the effectiveness of our solution. Code and results are online via our project page: https://github.com/OpenNLPLab/DiffusionAVS.
摘要
我们提出一种含拓扑扩散模型,通过对比学习来探索音频的贡献。我们将音频视为条件变量,用于音频生成者 segmentation。为保证音频的贡献,我们引入一种潜在扩散模型,以实现含义相关的表示学习。这种扩散模型学习了真实的分 segmentation MAP 的生成过程,从而在测试阶段实现了真实感知。作为一种条件扩散模型,我们认为保证条件变量对模型输出的贡献是必要的。我们然后引入对比学习,以学习音频视频对应关系,这与最大化模型预测和音频数据之间的共通信息相符。通过这种方式,我们的含拓扩散模型通过对比学习显著地提高了音频对 AVS 的贡献。实验结果表明我们的解决方案是有效的。代码和结果在我们项目页面上可以online获取:https://github.com/OpenNLPLab/DiffusionAVS。
DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training
results: 实验表明,DiffProsody 能够在16倍的速度上生成表达,并且对比传统方法,具有更高的质量和效果。Abstract
Expressive text-to-speech systems have undergone significant advancements owing to prosody modeling, but conventional methods can still be improved. Traditional approaches have relied on the autoregressive method to predict the quantized prosody vector; however, it suffers from the issues of long-term dependency and slow inference. This study proposes a novel approach called DiffProsody in which expressive speech is synthesized using a diffusion-based latent prosody generator and prosody conditional adversarial training. Our findings confirm the effectiveness of our prosody generator in generating a prosody vector. Furthermore, our prosody conditional discriminator significantly improves the quality of the generated speech by accurately emulating prosody. We use denoising diffusion generative adversarial networks to improve the prosody generation speed. Consequently, DiffProsody is capable of generating prosody 16 times faster than the conventional diffusion model. The superior performance of our proposed method has been demonstrated via experiments.
摘要
<>expressive 文本到语音系统已经经历了显著的进步,归功于谱定模型。但传统方法仍然可以进行改进。传统的方法通过推导式方法预测谱定向量,但它受到长期依赖和慢速推理的问题困扰。本研究提出了一种新的方法 called DiffProsody,该方法通过噪声推送生成器和谱定 conditional adversarial training来生成表达性的语音。我们的发现表明DiffProsody可以生成谱定向量。此外,我们的谱定 conditional 推理器可以准确地模拟谱定,从而提高生成的语音质量。我们使用denoising diffusion生成 adversarial networks来提高谱定生成速度。因此,DiffProsody可以在16倍 faster than conventional diffusion model中生成谱定。我们的提出的方法在实验中表现出了superior performance。Note: Please note that the translation is in Simplified Chinese, and some words or phrases may have been translated differently in Traditional Chinese.
SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation
results: 对多个模拟和实际数据集进行了实验,结果表明:1) 提议的网络在大多数任务上达到了状态艺术的性能; 2) 提议的网络受到spectral generalization问题的影响很小; 3) 提议的网络确实进行了说话人团集(示例了注意力地图)。Abstract
This work proposes a neural network to extensively exploit spatial information for multichannel joint speech separation, denoising and dereverberation, named SpatialNet.In the short-time Fourier transform (STFT) domain, the proposed network performs end-to-end speech enhancement. It is mainly composed of interleaved narrow-band and cross-band blocks to respectively exploit narrow-band and cross-band spatial information. The narrow-band blocks process frequencies independently, and use self-attention mechanism and temporal convolutional layers to respectively perform spatial-feature-based speaker clustering and temporal smoothing/filtering. The cross-band blocks processes frames independently, and use full-band linear layer and frequency convolutional layers to respectively learn the correlation between all frequencies and adjacent frequencies. Experiments are conducted on various simulated and real datasets, and the results show that 1) the proposed network achieves the state-of-the-art performance on almost all tasks; 2) the proposed network suffers little from the spectral generalization problem; and 3) the proposed network is indeed performing speaker clustering (demonstrated by attention maps).
摘要
这个工作提出了一种神经网络,以便广泛利用空间信息进行多通道联合语音分离、降噪和反射减去,名为空间网络(SpatialNet)。在短时傅立叶变换(STFT)频域内,提议的网络实现了端到端语音提升。它主要由相互交叠的窄频和交叠频块组成,用于分别利用窄频和交叠频信息。窄频块独立处理频率,使用自注意机制和时间径向层来分别进行空间特征基于的Speaker集成和时间平滑/滤波。交叠频块独立处理帧,使用全频线性层和频率径向层来分别学习所有频率和邻近频率之间的相关性。在各种模拟和实际数据集上进行了实验,结果表明:1)提议的网络在大多数任务上达到了状态艺术性的表现;2)提议的网络受到尺度泛化问题的影响很少;3)提议的网络实际进行Speaker集成(通过注意力地图进行证明)。
paper_authors: SeungHeon Doh, Keunwoo Choi, Jongpil Lee, Juhan Nam
for: 提高音乐数据的理解和管理,提供大规模音乐描述语料集。
methods: 使用大语言模型(LLM)人工生成描述句子,从大规模标签数据集中获取数据。
results: 比较多种评价指标和人工评价表明,提posed方法比基线模型有更好的表现,并在零VB和传输学习 Setting下进行了训练和评价。Abstract
Automatic music captioning, which generates natural language descriptions for given music tracks, holds significant potential for enhancing the understanding and organization of large volumes of musical data. Despite its importance, researchers face challenges due to the costly and time-consuming collection process of existing music-language datasets, which are limited in size. To address this data scarcity issue, we propose the use of large language models (LLMs) to artificially generate the description sentences from large-scale tag datasets. This results in approximately 2.2M captions paired with 0.5M audio clips. We term it Large Language Model based Pseudo music caption dataset, shortly, LP-MusicCaps. We conduct a systemic evaluation of the large-scale music captioning dataset with various quantitative evaluation metrics used in the field of natural language processing as well as human evaluation. In addition, we trained a transformer-based music captioning model with the dataset and evaluated it under zero-shot and transfer-learning settings. The results demonstrate that our proposed approach outperforms the supervised baseline model.
摘要
自动化音乐描述(Automatic music captioning)可以增强大量音乐数据的理解和组织。尽管其重要性,研究人员面临数据缺乏问题,因为现有的音乐语言数据集的收集过程很费时和成本高。为解决这个数据缺乏问题,我们提议使用大型自然语言模型(LLM)来人工生成描述句子从大规模标签数据集。这将生成约220万个描述句子和50万个音频剪辑。我们称之为大语言模型基于 Pseudo music caption 数据集(LP-MusicCaps)。我们进行了音乐描述数据集的系统评估,使用了自然语言处理领域常用的量化评估 метри。此外,我们将一种基于 transformer 的音乐描述模型训练于数据集,并在零学习和转移学习 Settings 下评估其性能。结果表明,我们的提议方法在超越了基线模型。
Mispronunciation detection using self-supervised speech representations
results: 研究发现,使用直接进行目标任务的模型训练得到最佳性能,而大多数上游模型在该任务中的性能相似。Abstract
In recent years, self-supervised learning (SSL) models have produced promising results in a variety of speech-processing tasks, especially in contexts of data scarcity. In this paper, we study the use of SSL models for the task of mispronunciation detection for second language learners. We compare two downstream approaches: 1) training the model for phone recognition (PR) using native English data, and 2) training a model directly for the target task using non-native English data. We compare the performance of these two approaches for various SSL representations as well as a representation extracted from a traditional DNN-based speech recognition model. We evaluate the models on L2Arctic and EpaDB, two datasets of non-native speech annotated with pronunciation labels at the phone level. Overall, we find that using a downstream model trained for the target task gives the best performance and that most upstream models perform similarly for the task.
摘要
近年来,自主学习(SSL)模型在各种语音处理任务中表现出色,特别在数据缺乏的情况下。在这篇论文中,我们研究了在第二语言学习者中的误听检测任务上使用SSL模型。我们比较了两个下游方法:1)使用本地英语数据来训练模型,并2)直接使用非本地英语数据来训练目标任务的模型。我们对各种SSL表示形式以及一个来自传统的DNN基于语音识别模型中的表示进行比较。我们在L2Arctic和EpaDB两个非本地语音Dataset上评估了这些模型的性能。总的来说,我们发现使用直接训练目标任务的下游模型可以获得最好的性能,而大多数上游模型在这个任务上表现相似。
paper_authors: Giulia Comini, Manuel Sam Ribeiro, Fan Yang, Heereen Shim, Jaime Lorenzo-Trueba for: 这个论文的目的是提出一个多语言统一的前端系统,用于解决语音识别相关的任务,通常由不同模块处理。methods: 该论文使用的方法包括Grapheme-to-Phoneme关系的预测和语言特定的规则系统,以解决语音识别中的homograph和多音字识别、后缀规则和隐式 диакритизацию问题。results: 该论文的实验结果显示,该多语言统一前端系统在不同语言和任务上具有竞争力,但有一些与等效的单语言解决方案进行比较时的交易。Abstract
Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end. Given a language, a lexicon can be collected offline and Grapheme-to-Phoneme (G2P) relationships are usually modeled in order to predict the pronunciation for out-of-vocabulary (OOV) words. Additionally, post-lexical phonology, often defined in the form of rule-based systems, is used to correct pronunciation within or between words. In this work we showcase a multilingual unified front-end system that addresses any pronunciation related task, typically handled by separate modules. We evaluate the proposed model on G2P conversion and other language-specific challenges, such as homograph and polyphones disambiguation, post-lexical rules and implicit diacritization. We find that the multilingual model is competitive across languages and tasks, however, some trade-offs exists when compared to equivalent monolingual solutions.
摘要
phonetic information和语言知识是texttospeech(TTS)前端的重要组成部分。给定一种语言,一个词典可以在线上采集,并且用grapheme-to-phoneme(G2P)关系来预测语音读法 для未在词汇中出现的词语(OOV words)。此外,在词语之间或在词语中进行phonology修正还需要使用post-lexical规则。在这项工作中,我们展示了一个多语言统一前端系统,可以解决任何语音相关的任务,通常由 separatemodules 处理。我们评估了提议的模型在G2P转换和其他语言特有的挑战中的性能,包括homograph和多音字的歧义、post-lexical规则和隐式 diacritization。我们发现该多语言模型在语言和任务方面都具有竞争力,但是存在一些与等效单语言解决方案相比存在的交换。
Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings
results: 对于不同语言和数据量,our approach 都能够降低 G2P 系统的 phone error rate,并且可以学习 out-of-vocabulary 单词的拼写规则。Abstract
The Grapheme-to-Phoneme (G2P) task aims to convert orthographic input into a discrete phonetic representation. G2P conversion is beneficial to various speech processing applications, such as text-to-speech and speech recognition. However, these tend to rely on manually-annotated pronunciation dictionaries, which are often time-consuming and costly to acquire. In this paper, we propose a method to improve the G2P conversion task by learning pronunciation examples from audio recordings. Our approach bootstraps a G2P with a small set of annotated examples. The G2P model is used to train a multilingual phone recognition system, which then decodes speech recordings with a phonetic representation. Given hypothesized phoneme labels, we learn pronunciation dictionaries for out-of-vocabulary words, and we use those to re-train the G2P system. Results indicate that our approach consistently improves the phone error rate of G2P systems across languages and amount of available data.
摘要
“文本转语音”(G2P)任务的目的是将文字转换为不同语言的精确的语音表示。这种转换可以应用于多种语音处理应用程序,例如文本转语音和语音识别。但是,这些应用程序通常需要手动检核的发音词典,这可能需要很长时间和成本。在这篇论文中,我们提出了一种方法来改善G2P转换任务,通过从语音录音中学习发音例子。我们的方法是从一小量标注的示例开始,使用G2P模型训练多种语言的语音识别系统,然后将语音录音转换为精确的语音表示。假设有假设的发音标签,我们可以从这些标签中学习不在词汇中的发音词典,并将它们用于重新训练G2P系统。结果显示,我们的方法可以在不同语言和可用数据量之间帮助G2P系统改善电话误差率。
All-In-One Metrical And Functional Structure Analysis With Neighborhood Attentions on Demixed Audio
results: 模型在Harmonix Set上的四个任务中均 achieve state-of-the-art表现,同时具有相对较低的参数数量,而且我们的ablation研究表明,同时学习节奏、下节奏和段落可以提高性能,每个任务彼此互助。Abstract
Music is characterized by complex hierarchical structures. Developing a comprehensive model to capture these structures has been a significant challenge in the field of Music Information Retrieval (MIR). Prior research has mainly focused on addressing individual tasks for specific hierarchical levels, rather than providing a unified approach. In this paper, we introduce a versatile, all-in-one model that jointly performs beat and downbeat tracking as well as functional structure segmentation and labeling. The model leverages source-separated spectrograms as inputs and employs dilated neighborhood attentions to capture temporal long-term dependencies, along with non-dilated attentions for local instrumental dependencies. Consequently, the proposed model achieves state-of-the-art performance in all four tasks on the Harmonix Set while maintaining a relatively lower number of parameters compared to recent state-of-the-art models. Furthermore, our ablation study demonstrates that the concurrent learning of beats, downbeats, and segments can lead to enhanced performance, with each task mutually benefiting from the others.
摘要
音乐具有复杂的层次结构,在音乐信息检索(MIR)领域建立一个全面的模型是一项重要的挑战。先前的研究主要集中在解决特定层次级别的任务上,而不是提供一个统一的方法。在本文中,我们介绍了一种通用的、全部一个模型,可同时执行节拍和下节拍跟踪、功能结构分割和标注。该模型使用源分离的spectrogram作为输入,利用扩大 neighborgraph attention capture temporal长期关系,并使用非扩大 attention capture当地乐器关系。因此,我们提出的模型在Harmonix Set上的四个任务中均 achieve state-of-the-art性能,同时具有相对较少的参数量 compared to recent state-of-the-art模型。此外,我们的ablation study表明,同时学习节拍、下节拍和分割可以导致提高性能,每个任务受到另外两个任务的帮助。
Robust Self Supervised Speech Embeddings for Child-Adult Classification in Interactions involving Children with Autism
paper_authors: Rimita Lahiri, Tiantian Feng, Rajat Hebbar, Catherine Lord, So Hyun Kim, Shrikanth Narayanan
for: automatic child-adult speaker classification in child-inclusive spoken interactions
methods: pre-training with child-inclusive interactions and self-supervision algorithms (Wav2vec 2.0 and WavLM) with a contrastive loss objective
results: 9-13% relative improvement over state-of-the-art baseline in classification F1 scores on two clinical interaction datasets involving children with Autism, with analysis of pre-training under different conditions based on demographic factors.Here’s the text in Simplified Chinese:
results: 在两个临床交互数据集上,与状态艺术基准相比,实现了9-13%的相对提升,并对不同儿童子宫因素进行了分析。Abstract
We address the problem of detecting who spoke when in child-inclusive spoken interactions i.e., automatic child-adult speaker classification. Interactions involving children are richly heterogeneous due to developmental differences. The presence of neurodiversity e.g., due to Autism, contributes additional variability. We investigate the impact of additional pre-training with more unlabelled child speech on the child-adult classification performance. We pre-train our model with child-inclusive interactions, following two recent self-supervision algorithms, Wav2vec 2.0 and WavLM, with a contrastive loss objective. We report 9 - 13% relative improvement over the state-of-the-art baseline with regards to classification F1 scores on two clinical interaction datasets involving children with Autism. We also analyze the impact of pre-training under different conditions by evaluating our model on interactions involving different subgroups of children based on various demographic factors.
摘要
我们识别了儿童 inclusive 的 spoken interactions 中的发言人问题,即自动儿童成人发言分类。儿童交流中存在较大的多样性,因为儿童的发展差异。另外,由于Autism等神经多样性的存在,会增加发言多样性。我们研究了额外预训练更多的无标签儿童语音对child-adult分类性能的影响。我们采用了两种最新的自我超视图算法,Wav2vec 2.0和WavLM,并使用了对比损失目标函数。我们发现,与状态艺术基eline相比,我们的模型在两个临床交流数据集上的分类 F1 分数有9-13%的相对改善。此外,我们还分析了预训练在不同条件下的影响,通过评估我们的模型在不同子群儿童基于各种民生因素的交流中的表现。
Pre-training End-to-end ASR Models with Augmented Speech Samples Queried by Text
paper_authors: Eric Sun, Jinyu Li, Jian Xue, Yifan Gong
for: 提高语音识别系统的语言扩展性
methods: 使用无对应的语音特征段和文本数据生成增强样本,无需额外的语音数据
results: 与使用多语言 Raw 语音数据预训练的模型相比,在Italian Transformer 抽取器模型预训练中,使用我们的方法可以获得8.7%的Relative Word Error Rate 降低,并且在与多语言数据合并预训练新模型时,可以获得12.2%的Relative Word Error Rate 降低。Abstract
In end-to-end automatic speech recognition system, one of the difficulties for language expansion is the limited paired speech and text training data. In this paper, we propose a novel method to generate augmented samples with unpaired speech feature segments and text data for model pre-training, which has the advantage of low cost without using additional speech data. When mixing 20,000 hours augmented speech data generated by our method with 12,500 hours original transcribed speech data for Italian Transformer transducer model pre-training, we achieve 8.7% relative word error rate reduction. The pre-trained model achieves similar performance as the model pre-trained with multilingual transcribed 75,000 hours raw speech data. When merging the augmented speech data with the multilingual data to pre-train a new model, we achieve even more relative word error rate reduction of 12.2% over the baseline, which further verifies the effectiveness of our method for speech data augmentation.
摘要
在端到端自动语音识别系统中,一个问题是扩展语言的困难,因为有限的配对的语音和文本训练数据。在这篇论文中,我们提出了一种新的方法,通过将无配对语音特征段和文本数据混合生成增强样本,以便模型预训练,这种方法的优点是低成本,不需要额外的语音数据。当混合我们生成的20,000小时增强语音数据和12,500小时原始译文数据 для意大陆 transformer 抽取器模型预训练,我们实现了8.7%的相对单词错误率降低。预训练后的模型与使用多语言原始75,000小时Raw语音数据预训练的模型相似的性能。当混合增强语音数据与多语言数据预训练新模型时,我们实现了更多的相对单词错误率降低12.2%,这一 Again verifies the effectiveness of our method for speech data augmentation.
results: 提出了一种可扩展的灯光无知抑制网络(LBDN),并通过灯光抑制后,使用Retinex模块进行增强,实现了低光照图像的提高和灯光抑制任务Abstract
Most existing Low-Light Image Enhancement (LLIE) methods are primarily designed to improve brightness in dark regions, which suffer from severe degradation in nighttime images. However, these methods have limited exploration in another major visibility damage, the glow effects in real night scenes. Glow effects are inevitable in the presence of artificial light sources and cause further diffused blurring when directly enhanced. To settle this issue, we innovatively consider the glow suppression task as learning physical glow generation via multiple scattering estimation according to the Atmospheric Point Spread Function (APSF). In response to the challenges posed by uneven glow intensity and varying source shapes, an APSF-based Nighttime Imaging Model with Near-field Light Sources (NIM-NLS) is specifically derived to design a scalable Light-aware Blind Deconvolution Network (LBDN). The glow-suppressed result is then brightened via a Retinex-based Enhancement Module (REM). Remarkably, the proposed glow suppression method is based on zero-shot learning and does not rely on any paired or unpaired training data. Empirical evaluations demonstrate the effectiveness of the proposed method in both glow suppression and low-light enhancement tasks.
摘要
现有的低光照图像改进方法主要是提高黑暗区域的亮度,但这些方法受到夜间图像中的耀树效果的限制。耀树效果是人工灯光源的存在导致的,并且会对直接加强的图像进行更多的杂化干扰。为解决这个问题,我们创新地将耀树降低任务作为学习物理耀树生成的多散杂折扣计算,根据大气点 рассеи函数(APSF)。为了应对耀树强度不均和灯光源形状的变化,我们特地 derivate了一种基于APSF的夜间图像模型(NIM-NLS),用于设计可扩展的光照无知抽象网络(LBDN)。经验证表明,我们提出的耀树降低方法不需要任何配对或无配对训练数据,并且在耀树降低和低光照改进任务中表现出色。
Lightweight Super-Resolution Head for Human Pose Estimation
methods: 该研究提出了 SR 头,该头可以预测高 resolution 的热图,从而减少量化错误和进一步处理的需求。 SRPose 方法在每个阶段使用 SR 头来慢慢地恢复高 resolution 的热图,并且在每个阶段使用 SR 头来监督中间特征。
results: 对 COCO、MPII 和 CrowdPose 等数据集进行了广泛的实验,显示 SRPose 方法比对应的热图基本方法有更好的性能。Abstract
Heatmap-based methods have become the mainstream method for pose estimation due to their superior performance. However, heatmap-based approaches suffer from significant quantization errors with downscale heatmaps, which result in limited performance and the detrimental effects of intermediate supervision. Previous heatmap-based methods relied heavily on additional post-processing to mitigate quantization errors. Some heatmap-based approaches improve the resolution of feature maps by using multiple costly upsampling layers to improve localization precision. To solve the above issues, we creatively view the backbone network as a degradation process and thus reformulate the heatmap prediction as a Super-Resolution (SR) task. We first propose the SR head, which predicts heatmaps with a spatial resolution higher than the input feature maps (or even consistent with the input image) by super-resolution, to effectively reduce the quantization error and the dependence on further post-processing. Besides, we propose SRPose to gradually recover the HR heatmaps from LR heatmaps and degraded features in a coarse-to-fine manner. To reduce the training difficulty of HR heatmaps, SRPose applies SR heads to supervise the intermediate features in each stage. In addition, the SR head is a lightweight and generic head that applies to top-down and bottom-up methods. Extensive experiments on the COCO, MPII, and CrowdPose datasets show that SRPose outperforms the corresponding heatmap-based approaches. The code and models are available at https://github.com/haonanwang0522/SRPose.
摘要
对于 pose 估计,热图方法已经成为主流方法,但热图方法受到下测热图的量化误差的限制,导致表现有限和中途监督的副作用。过往的热图方法将重点放在额外处理来减轻量化误差。一些热图方法会将特征图像的分辨率提高,使用多个昂贵的upsampling层来提高本地化精度。为了解决这些问题,我们创新地视Backbone网络为压缩过程,并将热图预测 reformulate 为超解析(SR)任务。我们首先提出SR Head,这个预测热图的数据点高于输入特征图像的分辨率,或者和输入图像的分辨率相同,以便实现量化误差的减少和后期处理的依赖。此外,我们提出SRPose,这个渐进从粗糙到细致的方法,可以从LR热图和受损特征图像中搜寻HR热图。为了降低HR热图的训练困难,SRPose 应用SR Head 来监督每个阶段的中间特征。此外,SR Head 是一个轻量级和通用的头,适用于顶部遍历和底部遍历方法。实验结果显示,SRPose 在 COCO、MPII 和 CrowdPose 数据集上具有较高的表现。代码和模型可以在 获取。
High-Performance Fine Defect Detection in Artificial Leather Using Dual Feature Pool Object Detection
results: YOLOD 模型在人工皮革杂 defect数据集上表现出色,与 YOLOv5 比较而言,AP_50 提高了 11.7% - 13.5%,并同时降低了 5.2% - 7.2% 的错误检测率。在普通的 MS-COCO 数据集上,YOLOD 也表现出优异的表现,与 YOLOv5 比较而言,AP 提高了 0.4% - 2.6%,AP_S 提高了 2.5% - 4.1%。这些结果表明 YOLOD 在人工皮革杂 defect detection 和通用物体检测任务中具有出色的效果和可靠性,适用于实际应用。Abstract
In this study, the structural problems of the YOLOv5 model were analyzed emphatically. Based on the characteristics of fine defects in artificial leather, four innovative structures, namely DFP, IFF, AMP, and EOS, were designed. These advancements led to the proposal of a high-performance artificial leather fine defect detection model named YOLOD. YOLOD demonstrated outstanding performance on the artificial leather defect dataset, achieving an impressive increase of 11.7% - 13.5% in AP_50 compared to YOLOv5, along with a significant reduction of 5.2% - 7.2% in the error detection rate. Moreover, YOLOD also exhibited remarkable performance on the general MS-COCO dataset, with an increase of 0.4% - 2.6% in AP compared to YOLOv5, and a rise of 2.5% - 4.1% in AP_S compared to YOLOv5. These results demonstrate the superiority of YOLOD in both artificial leather defect detection and general object detection tasks, making it a highly efficient and effective model for real-world applications.
摘要
在这个研究中,YOLOv5模型的结构问题得到了强调性分析。基于人工皮革细 defect的特点,提出了四种创新结构:DFP、IFF、AMP和EOS。这些创新导致了一种高性能的人工皮革细 defect检测模型,即YOLOD。YOLOD在人工皮革细 defect数据集上达到了11.7%-13.5%的AP_50提升,同时有5.2%-7.2%的错误检测率降低。此外,YOLOD还在通用的 MS-COCO 数据集上表现出色,与 YOLOv5 相比,AP 提升0.4%-2.6%,AP_S 提升2.5%-4.1%。这些结果表明 YOLOD 在人工皮革细 defect检测和通用对象检测任务中具有出色的性能,使其在实际应用中成为高效率的选择。
Multi-Spectral Image Stitching via Spatial Graph Reasoning
results: 根据实验结果,该方法可以有效地处理多视点场景的折叠和组合问题,并且比现有的方法更加稳定和可靠。同时,该方法还可以在实际世界和生成的 synthetic 数据集上进行广泛的评估和验证。Abstract
Multi-spectral image stitching leverages the complementarity between infrared and visible images to generate a robust and reliable wide field-of-view (FOV) scene. The primary challenge of this task is to explore the relations between multi-spectral images for aligning and integrating multi-view scenes. Capitalizing on the strengths of Graph Convolutional Networks (GCNs) in modeling feature relationships, we propose a spatial graph reasoning based multi-spectral image stitching method that effectively distills the deformation and integration of multi-spectral images across different viewpoints. To accomplish this, we embed multi-scale complementary features from the same view position into a set of nodes. The correspondence across different views is learned through powerful dense feature embeddings, where both inter- and intra-correlations are developed to exploit cross-view matching and enhance inner feature disparity. By introducing long-range coherence along spatial and channel dimensions, the complementarity of pixel relations and channel interdependencies aids in the reconstruction of aligned multi-view features, generating informative and reliable wide FOV scenes. Moreover, we release a challenging dataset named ChaMS, comprising both real-world and synthetic sets with significant parallax, providing a new option for comprehensive evaluation. Extensive experiments demonstrate that our method surpasses the state-of-the-arts.
摘要
Simplified Chinese translation:多spectral图像缝合利用不同波长图像之间的共同性来生成一个可靠和可靠的广角场景。主要挑战是探索不同视角图像之间的关系,以便对多视图场景进行对接和整合。我们利用图像卷积神经网络的优势来模型特征关系,并提出一种基于空间图 reasoning的多spectral图像缝合方法。通过嵌入不同缩放级别的相关特征,并通过强大的密集特征嵌入来学习不同视角之间的相关性。通过引入空间和通道维度的长距离相关性,我们可以利用像素关系和通道间关系来重建对齐的多视图特征,生成可靠和可靠的广角场景。此外,我们发布了一个名为ChaMS的挑战性数据集,包括真实世界和 sintetic 集合,提供了一个新的评估选项。广泛的实验表明,我们的方法超过了当前的状态。
UniVTG: Towards Unified Video-Language Temporal Grounding
results: 实验结果表明,该方法在三个任务(时间间隔检索、精彩检索和视频摘要) across seven datasets(QVHighlights、Charades-STA、TACoS、Ego4D、YouTube Highlights、TVSum和QFVS)中具有显著的效果和灵活性。Abstract
Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels. In this paper, we propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions: Firstly, we revisit a wide range of VTG labels and tasks and define a unified formulation. Based on this, we develop data annotation schemes to create scalable pseudo supervision. Secondly, we develop an effective and flexible grounding model capable of addressing each task and making full use of each label. Lastly, thanks to the unified framework, we are able to unlock temporal grounding pretraining from large-scale diverse labels and develop stronger grounding abilities e.g., zero-shot grounding. Extensive experiments on three tasks (moment retrieval, highlight detection and video summarization) across seven datasets (QVHighlights, Charades-STA, TACoS, Ego4D, YouTube Highlights, TVSum, and QFVS) demonstrate the effectiveness and flexibility of our proposed framework. The codes are available at https://github.com/showlab/UniVTG.
摘要
视频时间固定(VTG),目标是根据自定义语言查询(如句子或词语)将视频中的clipgrounding(时间间隔或不连续shot)固定,对社交媒体视频浏览非常重要。大多数方法在这个方向都是通过特定任务的模型来进行训练,这限制了它们的普适性和扩展性。在这篇论文中,我们提出了一种统一多种VTG标签和任务的方法,称为UniVTG。以下是我们的三个方向:1. 我们对VTG标签和任务进行了广泛的复审,并定义了一个统一的表述。基于这个表述,我们开发了可扩展的数据注解方案,以创建可扩展的伪数据监督。2. 我们开发了一种高效和灵活的固定模型,能够Address每个任务和使用每个标签。3. 由于我们的统一框架,我们能够在大规模多种标签上进行时间固定预训练,并发展出更强的固定能力,例如零shot固定。我们的实验表明,我们的提议的框架在三个任务(时刻回忆、突出点检测和视频概要)Across seven datasets(QVHighlights、Charades-STA、TACoS、Ego4D、YouTube Highlights、TVSum和QFVS)中具有极高的效果和灵活性。我们的代码可以在https://github.com/showlab/UniVTG中获取。
Investigating and Improving Latent Density Segmentation Models for Aleatoric Uncertainty Quantification in Medical Imaging
methods: 这篇论文使用的方法是 Probabilistic U-Net (PU-Net),它使用 latent Normal densities 来优化 conditional data log-likelihood Evidence Lower Bound。
results: 研究发现,PU-Net 的 latent space 具有严重的不均衡性问题,这会使得 gradient descent 的效果受到限制,并且模型变得极敏感于 latent space 的地方化样本localization,导致预测不准确。为解决这个问题,这篇论文提出了 Sinkhorn PU-Net (SPU-Net),使用 Sinkhorn Divergence 来促进 latent space 的均衡性,从而改善 gradient-descent 更新和模型的Robustness。实验表明,在公共数据集上,SPU-Net 与前一代 latent variable models 相比,在 Hungarian-Matched metric 上得到了最高达 11% 的性能提升。结果表明,通过促进 latent space 的均衡性,可以在医学图像分割中显著提高 latent density modeling 的性能。Abstract
Data uncertainties, such as sensor noise or occlusions, can introduce irreducible ambiguities in images, which result in varying, yet plausible, semantic hypotheses. In Machine Learning, this ambiguity is commonly referred to as aleatoric uncertainty. Latent density models can be utilized to address this problem in image segmentation. The most popular approach is the Probabilistic U-Net (PU-Net), which uses latent Normal densities to optimize the conditional data log-likelihood Evidence Lower Bound. In this work, we demonstrate that the PU- Net latent space is severely inhomogenous. As a result, the effectiveness of gradient descent is inhibited and the model becomes extremely sensitive to the localization of the latent space samples, resulting in defective predictions. To address this, we present the Sinkhorn PU-Net (SPU-Net), which uses the Sinkhorn Divergence to promote homogeneity across all latent dimensions, effectively improving gradient-descent updates and model robustness. Our results show that by applying this on public datasets of various clinical segmentation problems, the SPU-Net receives up to 11% performance gains compared against preceding latent variable models for probabilistic segmentation on the Hungarian-Matched metric. The results indicate that by encouraging a homogeneous latent space, one can significantly improve latent density modeling for medical image segmentation.
摘要
数据不确定性,如探测器噪声或遮挡,可以导致图像中的不可避免歧义,从而导致多种可能的Semantic Hypothesis。在机器学习中,这种歧义称为 aleatoric uncertainty。秘密密度模型可以用来解决这个问题。最流行的方法是Probabilistic U-Net(PU-Net),它使用秘密的Normal密度来优化 conditional data log-likelihood Evidence Lower Bound。在这种工作中,我们发现PU-Net的latent空间很强不均衡。这导致了梯度下降的效果被妨碍,并且模型变得非常敏感于秘密空间样本的localization,导致预测不准确。为了解决这个问题,我们提出了Sinkhorn PU-Net(SPU-Net),它使用Sinkhorn Divergence来促进所有秘密维度之间的均衡,从而提高梯度下降更新和模型的Robustness。我们的结果显示,通过在公共数据集上应用SPU-Net,可以在Hungarian-Matched metric上获得11%的性能提升,相比之前的秘密变量模型。结果表明,通过促进秘密空间的均衡,可以在医疗图像分割中显著提高秘密密度模型的性能。
DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation
paper_authors: Runyang Feng, Yixing Gao, Tze Ho Elden Tse, Xueqing Ma, Hyung Jin Chang for: 这个论文主要针对的是多帧人体pose estimation问题,即使用扩展的 diffusion probabilistic models 来提高人体pose estimation的准确性。methods: 这个论文提出了一种新的 diffusion 架构,称为 DiffPose,它将视频基于人体pose estimation问题转化为一个 conditional heatmap 生成问题。在这个架构中,提出了一种 SpatioTemporal Representation Learner 来集成视觉证据,以及一种 Lookup-based MultiScale Feature Interaction 来确定局部关节和全局上下文之间的相关性。results: 这个论文在三个 benchmark 上达到了新的州OF-the-art 结果,包括 PoseTrack2017、PoseTrack2018 和 PoseTrack21。此外,DiffPose 还能够结合多个 pose estimate 来提高预测准确性,特别是对挑战性的关节。此外,DiffPose 还能够调整特定的 iterative step 来提高特征精细化,无需重新训练模型。Abstract
Denoising diffusion probabilistic models that were initially proposed for realistic image generation have recently shown success in various perception tasks (e.g., object detection and image segmentation) and are increasingly gaining attention in computer vision. However, extending such models to multi-frame human pose estimation is non-trivial due to the presence of the additional temporal dimension in videos. More importantly, learning representations that focus on keypoint regions is crucial for accurate localization of human joints. Nevertheless, the adaptation of the diffusion-based methods remains unclear on how to achieve such objective. In this paper, we present DiffPose, a novel diffusion architecture that formulates video-based human pose estimation as a conditional heatmap generation problem. First, to better leverage temporal information, we propose SpatioTemporal Representation Learner which aggregates visual evidences across frames and uses the resulting features in each denoising step as a condition. In addition, we present a mechanism called Lookup-based MultiScale Feature Interaction that determines the correlations between local joints and global contexts across multiple scales. This mechanism generates delicate representations that focus on keypoint regions. Altogether, by extending diffusion models, we show two unique characteristics from DiffPose on pose estimation task: (i) the ability to combine multiple sets of pose estimates to improve prediction accuracy, particularly for challenging joints, and (ii) the ability to adjust the number of iterative steps for feature refinement without retraining the model. DiffPose sets new state-of-the-art results on three benchmarks: PoseTrack2017, PoseTrack2018, and PoseTrack21.
摘要
DiffPose是一种新的扩展了 diffusion 模型,用于视频基于人体姿态估计。这种模型通过将视频基于人体姿态估计转化为一个条件热图生成问题,以提高 temporal 信息的利用。此外,我们还提出了一种新的机制called Lookup-based MultiScale Feature Interaction,用于确定局部关节和全局上下文之间的相关性。这种机制能够生成细腻的表示,特别是关注关节区域。通过扩展 diffusion 模型,DiffPose 显示出了以下两个独特特征:首先,能够将多个 pose 估计集合以提高预测精度,特别是难于估计的关节。其次,能够在不需要重新训练模型的情况下,调整特征细化步骤的数量。DiffPose 在 PoseTrack2017、PoseTrack2018 和 PoseTrack21 三个benchmark上达到了新的state-of-the-art 结果。
Guiding Image Captioning Models Toward More Specific Captions
results: 研究人员发现,使用这种自由指导方法可以提高图像描述的精度和准确性,但是会有些差异于标准的参考文本基准。Abstract
Image captioning is conventionally formulated as the task of generating captions for images that match the distribution of reference image-caption pairs. However, reference captions in standard captioning datasets are short and may not uniquely identify the images they describe. These problems are further exacerbated when models are trained directly on image-alt text pairs collected from the internet. In this work, we show that it is possible to generate more specific captions with minimal changes to the training process. We implement classifier-free guidance for an autoregressive captioning model by fine-tuning it to estimate both conditional and unconditional distributions over captions. The guidance scale applied at decoding controls a trade-off between maximizing $p(\mathrm{caption}|\mathrm{image})$ and $p(\mathrm{image}|\mathrm{caption})$. Compared to standard greedy decoding, decoding with a guidance scale of 2 substantially improves reference-free metrics such as CLIPScore (0.808 vs. 0.775) and caption$\to$image retrieval performance in the CLIP embedding space (recall@1 44.6% vs. 26.5%), but worsens standard reference-based captioning metrics (e.g., CIDEr 78.6 vs 126.1). We further explore the use of language models to guide the decoding process, obtaining small improvements over the Pareto frontier of reference-free vs. reference-based captioning metrics that arises from classifier-free guidance, and substantially improving the quality of captions generated from a model trained only on minimally curated web data.
摘要
Image captioning 是通常用来生成图像的描述文本,但标准的参考描述集合可能不够精准地描述图像。这些问题更加严重当模型直接从互联网上收集图像-描述文本对进行训练。在这种情况下,我们显示了如何通过微调模型来生成更为特定的描述文本,而无需更改训练过程。我们实现了无类标签导向的推导,使得模型在解码过程中估计图像和描述文本之间的 conditional 和 unconditional 分布。指导缩放应用于解码控制了图像和描述文本之间的质量。与标准的批量解码相比,使用指导缩放可以提高无参考度量metric(CLIPScore)的值(0.808 vs. 0.775),以及在CLIP空间中的描述文本到图像 retrieve 性能(recall@1 44.6% vs. 26.5%),但是会降低标准的参考基础度量(例如 CIDEr 78.6 vs 126.1)。我们进一步探讨了使用语言模型来引导解码过程,可以在无参考度量 vs. 参考度量的 Pareto Frontier 上获得小幅提升,并在使用互联网上收集的最小 curaated 数据进行训练时,显著提高生成的描述文本质量。
Conditioning Generative Latent Optimization to solve Imaging Inverse Problems
results: 与现有指导学习方法相比,可以达到更好的重建质量,并且不需要 backwards 运算器,可以扩展用 caso 到非线性 IIP 问题。Abstract
Computed Tomography (CT) is a prominent example of Imaging Inverse Problem (IIP), highlighting the unrivalled performances of data-driven methods in degraded measurements setups like sparse X-ray projections. Although a significant proportion of deep learning approaches benefit from large supervised datasets to directly map experimental measurements to medical scans, they cannot generalize to unknown acquisition setups. In contrast, fully unsupervised techniques, most notably using score-based generative models, have recently demonstrated similar or better performances compared to supervised approaches to solve IIPs while being flexible at test time regarding the imaging setup. However, their use cases are limited by two factors: (a) they need considerable amounts of training data to have good generalization properties and (b) they require a backward operator, like Filtered-Back-Projection in the case of CT, to condition the learned prior distribution of medical scans to experimental measurements. To overcome these issues, we propose an unsupervised conditional approach to the Generative Latent Optimization framework (cGLO), in which the parameters of a decoder network are initialized on an unsupervised dataset. The decoder is then used for reconstruction purposes, by performing Generative Latent Optimization with a loss function directly comparing simulated measurements from proposed reconstructions to experimental measurements. The resulting approach, tested on sparse-view CT using multiple training dataset sizes, demonstrates better reconstruction quality compared to state-of-the-art score-based strategies in most data regimes and shows an increasing performance advantage for smaller training datasets and reduced projection angles. Furthermore, cGLO does not require any backward operator and could expand use cases even to non-linear IIPs.
摘要
计算Tomography(CT)是一个典型的图像反问题(IIP)的例子,表明数据驱动方法在受限度量测量设置下表现出比其他方法更高的能力。虽然大多数深度学习方法需要大量的超参数数据来直接将实验室测量映射到医学扫描图像,但它们无法泛化到未知的捕获设置。相反,完全无监督的技术,主要是使用得分数据生成模型,在最近几年内已经展现了与监督方法相当或更好的性能,而且具有可变测试时间的灵活性。然而,它们的应用场景受到两种因素的限制:(a)它们需要大量的训练数据来有好的泛化性能,(b)它们需要一个后向运算器,如滤波后投影,来定制学习的医学扫描图像的先前分布。为了突破这些问题,我们提议一种不监督的冲ombinairal方法,在这里使用一个抽象的decoder网络。decoder网络的参数在一个无监督数据集上进行初始化,然后用Generative Latent Optimization(GLO)进行重构。在这种方法中,我们直接将 simulated measurements from proposed reconstructions 与实际测量进行比较,并使用这个loss函数来优化decoder网络。我们在多个训练数据集大小进行测试,并证明在大多数数据域下,cGLO比Score-based策略具有更高的重构质量,并且在小于 projection angles 和训练数据集大小时表现出逐渐增长的性能优势。此外,cGLO不需要任何后向运算器,因此可以扩展应用场景到非线性 IIPs。
Domain Adaptation for Medical Image Segmentation using Transformation-Invariant Self-Training
results: 实验结果表明,提出的transformation-invariant self-training(TI-ST)方法可以有效地mitigate the lack of target domain annotation,并提高目标频率降阶法的性能。Abstract
Models capable of leveraging unlabelled data are crucial in overcoming large distribution gaps between the acquired datasets across different imaging devices and configurations. In this regard, self-training techniques based on pseudo-labeling have been shown to be highly effective for semi-supervised domain adaptation. However, the unreliability of pseudo labels can hinder the capability of self-training techniques to induce abstract representation from the unlabeled target dataset, especially in the case of large distribution gaps. Since the neural network performance should be invariant to image transformations, we look to this fact to identify uncertain pseudo labels. Indeed, we argue that transformation invariant detections can provide more reasonable approximations of ground truth. Accordingly, we propose a semi-supervised learning strategy for domain adaptation termed transformation-invariant self-training (TI-ST). The proposed method assesses pixel-wise pseudo-labels' reliability and filters out unreliable detections during self-training. We perform comprehensive evaluations for domain adaptation using three different modalities of medical images, two different network architectures, and several alternative state-of-the-art domain adaptation methods. Experimental results confirm the superiority of our proposed method in mitigating the lack of target domain annotation and boosting segmentation performance in the target domain.
摘要
(Simplified Chinese translation)模型可以利用无标注数据是重要的,以推行大型分布差问题的解决。在这种情况下,基于 pseudo-labeling 的自我训练技术是高效的 semi-supervised 领域适应。然而, Pseudo 标签的不可靠性可能会阻碍自我训练技术在无标注目标集中生成抽象表示。特别是在大型分布差情况下。由于神经网络性能应该对图像变换不变,我们利用这一点来确定不可靠的 Pseudo 标签。因此,我们提出了一种基于变换不变的自我训练方法(TI-ST),该方法在自我训练期间评估每个像素的 Pseudo 标签可靠性,并将不可靠的检测排除。我们进行了三种不同的医疗影像模式、两种不同的网络架构和多种现有领域适应方法的比较测试。实验结果表明,我们的提议方法可以减少目标领域的标注缺乏,并提高目标领域中的分类性能。
CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification
results: 对MS-COCO、PASCAL VOC 2007、PASCAL VOC 2012和NUS datasets进行了广泛的实验,并达到了当前无监督学习方法的最高性能水平,甚至与弱监督分类方法具有相似的性能。Abstract
This paper presents a CLIP-based unsupervised learning method for annotation-free multi-label image classification, including three stages: initialization, training, and inference. At the initialization stage, we take full advantage of the powerful CLIP model and propose a novel approach to extend CLIP for multi-label predictions based on global-local image-text similarity aggregation. To be more specific, we split each image into snippets and leverage CLIP to generate the similarity vector for the whole image (global) as well as each snippet (local). Then a similarity aggregator is introduced to leverage the global and local similarity vectors. Using the aggregated similarity scores as the initial pseudo labels at the training stage, we propose an optimization framework to train the parameters of the classification network and refine pseudo labels for unobserved labels. During inference, only the classification network is used to predict the labels of the input image. Extensive experiments show that our method outperforms state-of-the-art unsupervised methods on MS-COCO, PASCAL VOC 2007, PASCAL VOC 2012, and NUS datasets and even achieves comparable results to weakly supervised classification methods.
摘要
In the initialization stage, we leverage the powerful CLIP model to generate a similarity vector for each image, both globally and locally. We then use a similarity aggregator to combine the global and local similarity vectors. These aggregated similarity scores serve as initial pseudo labels for training.In the training stage, we propose an optimization framework to refine the pseudo labels for unobserved labels. We use the aggregated similarity scores as the initial pseudo labels and optimize the parameters of the classification network to predict the correct labels.In the inference stage, we only use the trained classification network to predict the labels of the input image. Extensive experiments show that our method outperforms state-of-the-art unsupervised methods on MS-COCO, PASCAL VOC 2007, PASCAL VOC 2012, and NUS datasets, and even achieves comparable results to weakly supervised classification methods.Here's the translation in Simplified Chinese:这篇论文提出了一种基于 CLIP 的无监督学习方法,用于无监督多个标签图像分类,包括三个阶段:初始化、训练和推断。在初始化阶段,我们利用 CLIP 模型的强大能力,并提出了一种新的方法,通过全图像和每个片断的相似性聚合来扩展 CLIP для多个标签预测。我们将每个图像分割成多个片断,然后使用 CLIP 生成每个片断的相似性向量,以及整个图像的相似性向量。然后,我们引入一个相似性聚合器,将全图像和每个片断的相似性向量聚合。这些聚合的相似性分数被用作训练阶段的初始 Pseudo 标签。在训练阶段,我们提出了一个优化框架,用于更新 Pseudo 标签的参数。我们使用聚合的相似性分数作为初始 Pseudo 标签,然后使用优化的分类网络来预测未知标签。在推断阶段,我们只使用已训练的分类网络来预测输入图像的标签。广泛的实验显示,我们的方法在 MS-COCO、PASCAL VOC 2007、PASCAL VOC 2012 和 NUS 数据集上具有优秀的表现,甚至与弱监督分类方法具有相似的表现。
Can Self-Supervised Representation Learning Methods Withstand Distribution Shifts and Corruptions?
results: 研究发现,自动监督学习方法在不同类型的数据分布下的表现存在明显的敏感性,尤其是在数据分布偏移和图像损害等情况下。这些结果提醒我们在实际应用中需要更好地处理数据分布偏移和图像损害问题,以确保自动监督学习方法的稳定性和可靠性。Abstract
Self-supervised learning in computer vision aims to leverage the inherent structure and relationships within data to learn meaningful representations without explicit human annotation, enabling a holistic understanding of visual scenes. Robustness in vision machine learning ensures reliable and consistent performance, enhancing generalization, adaptability, and resistance to noise, variations, and adversarial attacks. Self-supervised paradigms, namely contrastive learning, knowledge distillation, mutual information maximization, and clustering, have been considered to have shown advances in invariant learning representations. This work investigates the robustness of learned representations of self-supervised learning approaches focusing on distribution shifts and image corruptions in computer vision. Detailed experiments have been conducted to study the robustness of self-supervised learning methods on distribution shifts and image corruptions. The empirical analysis demonstrates a clear relationship between the performance of learned representations within self-supervised paradigms and the severity of distribution shifts and corruptions. Notably, higher levels of shifts and corruptions are found to significantly diminish the robustness of the learned representations. These findings highlight the critical impact of distribution shifts and image corruptions on the performance and resilience of self-supervised learning methods, emphasizing the need for effective strategies to mitigate their adverse effects. The study strongly advocates for future research in the field of self-supervised representation learning to prioritize the key aspects of safety and robustness in order to ensure practical applicability. The source code and results are available on GitHub.
摘要
自我监督学习在计算机视觉领域目的是利用数据内在的结构和关系来学习有意义的表示,无需显式的人类标注,以实现全面的视觉场景理解。在视觉机器学习中,可靠性和稳定性是关键的,它们可以提高总体化、适应性和对噪、变化和攻击的抵抗能力。自我监督方法,包括对比学习、知识传递、最大化mutual information和聚合,已经被认为可以实现不变学习表示。本研究探讨了自我监督学习方法中learned表示的Robustness,特别是对分布转移和图像损害的影响。通过详细的实验,我们发现 distribution shifts和图像损害会对learned表示的性能产生明显的影响,并且随着分布转移和损害的严重程度的增加,learned表示的Robustness会逐渐下降。这些发现强调了自我监督学习方法中分布转移和图像损害的重要性,并且提醒我们需要开发有效的缓解方法,以确保其实际可靠性。本研究强烈建议未来在自我监督表示学习领域的研究应该更加注重安全性和Robustness,以确保其实际应用。源代码和结果可以在GitHub上找到。
Detecting diabetic retinopathy severity through fundus images using an ensemble of classifiers
paper_authors: Eduard Popescu, Adrian Groza, Ioana Damian
For: The paper is written for diagnosing diabetic retinopathy and determining its severity levels.* Methods: The paper proposes a method for detecting diabetic retinopathy using fundus images, which includes data preprocessing, image segmentation, and an ensemble of classifiers.* Results: The paper achieves high accuracy in detecting diabetic retinopathy and its severity levels using the proposed method.Here’s the information in Simplified Chinese text:
results: 本文通过该方法实现了高精度的糖尿病诊断和严重程度评估。Abstract
Diabetic retinopathy is an ocular condition that affects individuals with diabetes mellitus. It is a common complication of diabetes that can impact the eyes and lead to vision loss. One method for diagnosing diabetic retinopathy is the examination of the fundus of the eye. An ophthalmologist examines the back part of the eye, including the retina, optic nerve, and the blood vessels that supply the retina. In the case of diabetic retinopathy, the blood vessels in the retina deteriorate and can lead to bleeding, swelling, and other changes that affect vision. We proposed a method for detecting diabetic diabetic severity levels. First, a set of data-prerpocessing is applied to available data: adaptive equalisation, color normalisation, Gaussian filter, removal of the optic disc and blood vessels. Second, we perform image segmentation for relevant markers and extract features from the fundus images. Third, we apply an ensemble of classifiers and we assess the trust in the system.
摘要
diabetic retinopathy 是一种眼病理condition that affects individuals with diabetes mellitus. It is a common complication of diabetes that can impact the eyes and lead to vision loss. One method for diagnosing diabetic retinopathy is the examination of the fundus of the eye. An ophthalmologist examines the back part of the eye, including the retina, optic nerve, and the blood vessels that supply the retina. In the case of diabetic retinopathy, the blood vessels in the retina deteriorate and can lead to bleeding, swelling, and other changes that affect vision. We proposed a method for detecting diabetic retinopathy severity levels. First, a set of data-preprocessing is applied to available data: adaptive equalization, color normalization, Gaussian filter, removal of the optic disc and blood vessels. Second, we perform image segmentation for relevant markers and extract features from the fundus images. Third, we apply an ensemble of classifiers and we assess the trust in the system.Here's the translation of the text into Traditional Chinese:diabetic retinopathy 是一种眼病理condition that affects individuals with diabetes mellitus. It is a common complication of diabetes that can impact the eyes and lead to vision loss. One method for diagnosing diabetic retinopathy is the examination of the fundus of the eye. An ophthalmologist examines the back part of the eye, including the retina, optic nerve, and the blood vessels that supply the retina. In the case of diabetic retinopathy, the blood vessels in the retina deteriorate and can lead to bleeding, swelling, and other changes that affect vision. We proposed a method for detecting diabetic retinopathy severity levels. First, a set of data-preprocessing is applied to available data: adaptive equalization, color normalization, Gaussian filter, removal of the optic disc and blood vessels. Second, we perform image segmentation for relevant markers and extract features from the fundus images. Third, we apply an ensemble of classifiers and we assess the trust in the system.
Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics
results: 实验结果表明,本研究可以有效地分割声音对象,不受数据集偏袋的影响。Abstract
The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are prone to segment a certain salient object in a video regardless of the audio information. This is because sounding objects are often the most salient ones in the AVS dataset. Thus, current AVS methods might fail to localize genuine sounding objects due to the dataset bias. In this work, we present an audio-visual instance-aware segmentation approach to overcome the dataset bias. In a nutshell, our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio. We notice that an object could be a sounding object in one video but a silent one in another video. This would bring ambiguity in training our object segmentation network as only sounding objects have corresponding segmentation masks. We thus propose a silent object-aware segmentation objective to alleviate the ambiguity. Moreover, since the category information of audio is unknown, especially for multiple sounding sources, we propose to explore the audio-visual semantic correlation and then associate audio with potential objects. Specifically, we attend predicted audio category scores to potential instance masks and these scores will highlight corresponding sounding instances while suppressing inaudible ones. When we enforce the attended instance masks to resemble the ground-truth mask, we are able to establish audio-visual semantics correlation. Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects.
摘要
audio-visual segmentation (AVS) 任务的目标是将视频中的声音对象 segmented 出来。现有的工作主要是将视频的声音和视觉特征相结合以实现声音对象的面积。然而,我们发现现有的方法很容易因为数据集中的偏见而 segment 出错误的声音对象。这是因为声音对象在视频中经常是最明显的对象。因此,现有的 AVS 方法可能会失败地 Localize 真正的声音对象,因为它们可能会受到数据集的偏见。在这种情况下,我们提出了一种 audio-visual 实例检测方法,以解决数据集的偏见。我们的方法首先在视频中检测 potential 声音对象,然后将这些对象与给定的声音相关联。我们注意到,一个对象可能在一个视频中是声音对象,而在另一个视频中是无声对象。这会在我们的对象 segmentation 网络的训练中引入歧义。我们因此提出了一种静音对象检测目标,以解决这个歧义。此外,由于声音的类别信息unknown,特别是多个声音源,我们提出了探索 audio-visual semantic 相关性,并将声音与潜在对象相关联。具体来说,我们attend 预测的声音类别分数到潜在实例面积中,这些分数会高亮相应的声音实例,而suppress 无声实例。当我们强制潜在实例面积与实际面积匹配时,我们能够建立 audio-visual semantic 相关性。我们的方法在 AVS benchmark 上进行了实验,结果表明,我们可以不受数据集的偏见,Effectively segment 声音对象。
FULLER: Unified Multi-modality Multi-task 3D Perception via Multi-level Gradient Calibration
results: 在大规模的 benchmark nuScenes 上实验表明,提出的方法能够提高 map segmentation 的精度,提高 3D 检测的精度,使3D自动驾驶在多模态多任务学习领域得到了进一步的应用。Abstract
Multi-modality fusion and multi-task learning are becoming trendy in 3D autonomous driving scenario, considering robust prediction and computation budget. However, naively extending the existing framework to the domain of multi-modality multi-task learning remains ineffective and even poisonous due to the notorious modality bias and task conflict. Previous works manually coordinate the learning framework with empirical knowledge, which may lead to sub-optima. To mitigate the issue, we propose a novel yet simple multi-level gradient calibration learning framework across tasks and modalities during optimization. Specifically, the gradients, produced by the task heads and used to update the shared backbone, will be calibrated at the backbone's last layer to alleviate the task conflict. Before the calibrated gradients are further propagated to the modality branches of the backbone, their magnitudes will be calibrated again to the same level, ensuring the downstream tasks pay balanced attention to different modalities. Experiments on large-scale benchmark nuScenes demonstrate the effectiveness of the proposed method, eg, an absolute 14.4% mIoU improvement on map segmentation and 1.4% mAP improvement on 3D detection, advancing the application of 3D autonomous driving in the domain of multi-modality fusion and multi-task learning. We also discuss the links between modalities and tasks.
摘要
多Modalità融合和多任务学习在3D自动驾驶场景中变得流行,以提高预测和计算预算的稳定性。然而,直接将现有框架应用到多Modalità多任务学习场景中可能会导致不优化和甚至有毒的模态偏见和任务冲突。前一些工作通过手动协调学习框架与实际知识来解决问题,可能会导致优化。为了解决这个问题,我们提出了一种新的、简单的多级梯度准确学习框架,在优化过程中跨任务和模态进行协调。具体来说,在 Shared Backbone 的最后层生成的梯度将被准确了,以降低任务冲突。然后,这些准确后的梯度将被再次准确到同一水平,确保下游任务对不同的模态进行平衡的注意力。我们在大规模的 benchmark nuScenes 上进行了实验,得到了提升的结果,例如,在 Map Segmentation 中提高了14.4%的精度,在 3D 检测中提高了1.4%的精度,这有助于推动3D自动驾驶在多Modalità融合和多任务学习场景中的应用。我们还讨论了模态和任务之间的关系。
Sampling to Distill: Knowledge Transfer from Open-World Data
results: 我们在CIFAR-10、CIFAR-100、NYUv2和ImageNet等数据集上进行了广泛的实验,并取得了状态理论的性能。尤其是在ImageNet数据集上,我们提高了1.50%-9.59%的准确率。Abstract
Data-Free Knowledge Distillation (DFKD) is a novel task that aims to train high-performance student models using only the teacher network without original training data. Despite encouraging results, existing DFKD methods rely heavily on generation modules with high computational costs. Meanwhile, they ignore the fact that the generated and original data exist domain shifts due to the lack of supervision information. Moreover, knowledge is transferred through each example, ignoring the implicit relationship among multiple examples. To this end, we propose a novel Open-world Data Sampling Distillation (ODSD) method without a redundant generation process. First, we try to sample open-world data close to the original data's distribution by an adaptive sampling module. Then, we introduce a low-noise representation to alleviate the domain shifts and build a structured relationship of multiple data examples to exploit data knowledge. Extensive experiments on CIFAR-10, CIFAR-100, NYUv2, and ImageNet show that our ODSD method achieves state-of-the-art performance. Especially, we improve 1.50\%-9.59\% accuracy on the ImageNet dataset compared with the existing results.
摘要
<> translate "Data-Free Knowledge Distillation (DFKD) is a novel task that aims to train high-performance student models using only the teacher network without original training data. Despite encouraging results, existing DFKD methods rely heavily on generation modules with high computational costs. Meanwhile, they ignore the fact that the generated and original data exist domain shifts due to the lack of supervision information. Moreover, knowledge is transferred through each example, ignoring the implicit relationship among multiple examples. To this end, we propose a novel Open-world Data Sampling Distillation (ODSD) method without a redundant generation process. First, we try to sample open-world data close to the original data's distribution by an adaptive sampling module. Then, we introduce a low-noise representation to alleviate the domain shifts and build a structured relationship of multiple data examples to exploit data knowledge. Extensive experiments on CIFAR-10, CIFAR-100, NYUv2, and ImageNet show that our ODSD method achieves state-of-the-art performance. Especially, we improve 1.50\%-9.59\% accuracy on the ImageNet dataset compared with the existing results."<>以下是文本的简化中文翻译:“数据 свобод知识储备(DFKD)是一项新任务,旨在使用教师网络训练高性能的学生模型,无需原始训练数据。虽然已经得到了激励的结果,但现有的DFKD方法具有高计算成本的生成模块,同时忽略了生成和原始数据之间的频繁域转移,以及每个示例之间的隐式关系。为解决这个问题,我们提出了一种新的开放世界数据采样储备(ODSD)方法,不需要重复的生成过程。我们首先尝试通过适应采样模块采集开放世界数据,以便更接近原始数据的分布。然后,我们引入低噪表示,以降低域转移并建立多个数据示例之间的结构关系,以利用数据知识。我们对CIFAR-10、CIFAR-100、NYUv2和ImageNet进行了广泛的实验,结果表明,我们的ODSD方法在性能上达到了领先水平。尤其是在ImageNet dataset上,我们提高了1.50%-9.59%的准确率,与现有结果相比。”
SAMFlow: Eliminating Any Fragmentation in Optical Flow with Segment Anything Model
paper_authors: Shili Zhou, Ruian He, Weimin Tan, Bo Yan for:* 这种方法是为了解决现有方法强调地方准确性的问题,通过大视野模型的预训练和Segment Anything Model(SAM)的嵌入来提高物体识别。methods:* 这种方法使用了预训练的大视野模型和SAM图像编码器,并提出了一种Optical Flow Task-Specific Adaption scheme来适应非分割任务中使用SAM。results:* 该模型在Sintel和KITTI-15训练集上达到了0.86/2.10清洁/最终EPE和3.55/12.32EPE/F1-all的最佳性能,比Flowformer高出8.5%/9.9%和13.2%/16.3%。此外,该模型在Sintel和KITTI-15测试集上达到了状态最佳的性能,在所有二帧方法中在Sintel清洁通过中排名第一。Abstract
Optical Flow Estimation aims to find the 2D dense motion field between two frames. Due to the limitation of model structures and training datasets, existing methods often rely too much on local clues and ignore the integrity of objects, resulting in fragmented motion estimation. Through theoretical analysis, we find the pre-trained large vision models are helpful in optical flow estimation, and we notice that the recently famous Segment Anything Model (SAM) demonstrates a strong ability to segment complete objects, which is suitable for solving the fragmentation problem. We thus propose a solution to embed the frozen SAM image encoder into FlowFormer to enhance object perception. To address the challenge of in-depth utilizing SAM in non-segmentation tasks like optical flow estimation, we propose an Optical Flow Task-Specific Adaption scheme, including a Context Fusion Module to fuse the SAM encoder with the optical flow context encoder, and a Context Adaption Module to adapt the SAM features for optical flow task with Learned Task-Specific Embedding. Our proposed SAMFlow model reaches 0.86/2.10 clean/final EPE and 3.55/12.32 EPE/F1-all on Sintel and KITTI-15 training set, surpassing Flowformer by 8.5%/9.9% and 13.2%/16.3%. Furthermore, our model achieves state-of-the-art performance on the Sintel and KITTI-15 benchmarks, ranking #1 among all two-frame methods on Sintel clean pass.
摘要
Optical Flow Estimation 目标是找到两帧之间的2D紧密运动场景。由于模型结构和训练数据的限制,现有方法frequently rely onlocal clue,忽略对象的完整性,resulting in fragmented motion estimation. Through theoretical analysis, we find that pre-trained large vision models are helpful in optical flow estimation, and the recently famous Segment Anything Model (SAM) demonstrates strong ability to segment complete objects, which is suitable for solving the fragmentation problem. We thus propose a solution to embed the frozen SAM image encoder into FlowFormer to enhance object perception. To address the challenge of in-depth utilizing SAM in non-segmentation tasks like optical flow estimation, we propose an Optical Flow Task-Specific Adaption scheme, including a Context Fusion Module to fuse the SAM encoder with the optical flow context encoder, and a Context Adaption Module to adapt the SAM features for optical flow task with Learned Task-Specific Embedding. Our proposed SAMFlow model reaches 0.86/2.10 clean/final EPE and 3.55/12.32 EPE/F1-all on Sintel and KITTI-15 training set, surpassing Flowformer by 8.5%/9.9% and 13.2%/16.3%. Furthermore, our model achieves state-of-the-art performance on the Sintel and KITTI-15 benchmarks, ranking #1 among all two-frame methods on Sintel clean pass.
Audio-visual video-to-speech synthesis with synthesized input audio
paper_authors: Triantafyllos Kefalas, Yannis Panagakis, Maja Pantic
for: investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference.
methods: use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model using both the silent video and the synthesized speech as inputs.
results: successful with both raw waveforms and mel spectrograms as target outputs.Here is the full text in Simplified Chinese:
for: 这个研究的目的是investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference.
methods: 他们使用预训练的视频到语音模型来synthesize the missing speech signals,并然后使用 Both the silent video and the synthesized speech as inputs来train an audio-visual-to-speech synthesis model.
results: 他们的实验结果表明这种方法可以成功地使用 raw waveforms和mel spectrograms as target outputs.Abstract
Video-to-speech synthesis involves reconstructing the speech signal of a speaker from a silent video. The implicit assumption of this task is that the sound signal is either missing or contains a high amount of noise/corruption such that it is not useful for processing. Previous works in the literature either use video inputs only or employ both video and audio inputs during training, and discard the input audio pathway during inference. In this work we investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference. In particular, we use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model, using both the silent video and the synthesized speech as inputs, to predict the final reconstructed speech. Our experiments demonstrate that this approach is successful with both raw waveforms and mel spectrograms as target outputs.
摘要
<>视频到语音合成涉及重建说话人的语音信号从一个无声视频中。这个隐式假设是,语音信号 Either missing or contains a high amount of noise/corruption, so it is not useful for processing. 先前的文献中的工作 Either use video inputs only or employ both video and audio inputs during training, and discard the input audio pathway during inference. In this work, we investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference. Specifically, we use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model, using both the silent video and the synthesized speech as inputs, to predict the final reconstructed speech. Our experiments demonstrate that this approach is successful with both raw waveforms and mel spectrograms as target outputs.中文简体版
Contrastive Conditional Latent Diffusion for Audio-visual Segmentation
results: 实验结果表明,我们的解决方案有效地提高了AVS任务的性能。Code and results are online via our project page: https://github.com/OpenNLPLab/DiffusionAVS.Abstract
We propose a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to extensively explore the contribution of audio. We interpret AVS as a conditional generation task, where audio is defined as the conditional variable for sound producer(s) segmentation. With our new interpretation, it is especially necessary to model the correlation between audio and the final segmentation map to ensure its contribution. We introduce a latent diffusion model to our framework to achieve semantic-correlated representation learning. Specifically, our diffusion model learns the conditional generation process of the ground-truth segmentation map, leading to ground-truth aware inference when we perform the denoising process at the test stage. As a conditional diffusion model, we argue it is essential to ensure that the conditional variable contributes to model output. We then introduce contrastive learning to our framework to learn audio-visual correspondence, which is proven consistent with maximizing the mutual information between model prediction and the audio data. In this way, our latent diffusion model via contrastive learning explicitly maximizes the contribution of audio for AVS. Experimental results on the benchmark dataset verify the effectiveness of our solution. Code and results are online via our project page: https://github.com/OpenNLPLab/DiffusionAVS.
摘要
我们提出了一种含批处理模型,通过对比学习来探索音频的贡献。我们解释音频视频分割(AVS)为一个条件生成任务,其中音频被定义为音频生产者(或者音频源)的条件变量。在这种新的解释下,模型需要学习音频和最终分割图的相关性,以确保音频的贡献。为此,我们引入了一种潜在扩散模型到我们的框架中,以实现相关性学习。具体来说,我们的扩散模型学习了真实分割图的生成过程,从而在测试阶段进行了真实分割图恢复,这是一种条件扩散模型。我们认为,在条件扩散模型中,需要确保条件变量对模型输出的贡献。我们然后引入了对比学习,以学习音频和视频之间的对应关系,这与最大化模型预测和音频数据之间的共同信息量相同。通过这种方式,我们的潜在扩散模型通过对比学习显著地提高了音频对AVS的贡献。我们的实验结果表明,我们的解决方案效果很好。代码和结果可以在我们项目页面上找到:https://github.com/OpenNLPLab/DiffusionAVS。
results: 基于上述观察和发现,本研究提出了一种ensemble攻击方法,可以实现更有效的攻击和更高的传输性。实验结果表明,这种ensemble攻击方法可以在不同的 semantic segmentation 模型上达到更高的攻击成功率。Abstract
We analysis performance of semantic segmentation models wrt. adversarial attacks, and observe that the adversarial examples generated from a source model fail to attack the target models. i.e The conventional attack methods, such as PGD and FGSM, do not transfer well to target models, making it necessary to study the transferable attacks, especially transferable attacks for semantic segmentation. We find two main factors to achieve transferable attack. Firstly, the attack should come with effective data augmentation and translation-invariant features to deal with unseen models. Secondly, stabilized optimization strategies are needed to find the optimal attack direction. Based on the above observations, we propose an ensemble attack for semantic segmentation to achieve more effective attacks with higher transferability. The source code and experimental results are publicly available via our project page: https://github.com/anucvers/TASS.
摘要
我们分析了 semantic segmentation 模型对 adversarial attack 的性能,发现源模型生成的 adversarial examples 无法攻击目标模型。即常见的攻击方法,如 PGD 和 FGSM,无法传输到目标模型,因此需要研究可传输的攻击。我们发现两个主要因素可以实现可传输攻击:首先,攻击应该包含有效的数据增强和翻译不变的特征来处理未见模型。其次,稳定的优化策略是必要的,以找到最佳攻击方向。基于以上观察,我们提出了一种 ensemble attack 来实现更有效的攻击和更高的传输性。源代码和实验结果可以通过我们的项目页面:https://github.com/anucvers/TASS 获取。
Towards Unbalanced Motion: Part-Decoupling Network for Video Portrait Segmentation
results: 实验结果显示,PDNet在与现有方法比较之下,实现了类别视频肖像分割的更高精度和可靠性。Abstract
Video portrait segmentation (VPS), aiming at segmenting prominent foreground portraits from video frames, has received much attention in recent years. However, simplicity of existing VPS datasets leads to a limitation on extensive research of the task. In this work, we propose a new intricate large-scale Multi-scene Video Portrait Segmentation dataset MVPS consisting of 101 video clips in 7 scenario categories, in which 10,843 sampled frames are finely annotated at pixel level. The dataset has diverse scenes and complicated background environments, which is the most complex dataset in VPS to our best knowledge. Through the observation of a large number of videos with portraits during dataset construction, we find that due to the joint structure of human body, motion of portraits is part-associated, which leads that different parts are relatively independent in motion. That is, motion of different parts of the portraits is unbalanced. Towards this unbalance, an intuitive and reasonable idea is that different motion states in portraits can be better exploited by decoupling the portraits into parts. To achieve this, we propose a Part-Decoupling Network (PDNet) for video portrait segmentation. Specifically, an Inter-frame Part-Discriminated Attention (IPDA) module is proposed which unsupervisely segments portrait into parts and utilizes different attentiveness on discriminative features specified to each different part. In this way, appropriate attention can be imposed to portrait parts with unbalanced motion to extract part-discriminated correlations, so that the portraits can be segmented more accurately. Experimental results demonstrate that our method achieves leading performance with the comparison to state-of-the-art methods.
摘要
视频肖像分割(VPS),targeting at segmenting prominent foreground portraits from video frames, has received much attention in recent years. However, the simplicity of existing VPS datasets leads to a limitation on extensive research of the task. In this work, we propose a new intricate large-scale Multi-scene Video Portrait Segmentation dataset MVPS consisting of 101 video clips in 7 scenario categories, in which 10,843 sampled frames are finely annotated at pixel level. The dataset has diverse scenes and complicated background environments, which is the most complex dataset in VPS to our best knowledge. Through the observation of a large number of videos with portraits during dataset construction, we find that due to the joint structure of human body, the motion of portraits is part-associated, which leads that different parts are relatively independent in motion. That is, the motion of different parts of the portraits is unbalanced. Towards this unbalance, an intuitive and reasonable idea is that different motion states in portraits can be better exploited by decoupling the portraits into parts. To achieve this, we propose a Part-Decoupling Network (PDNet) for video portrait segmentation. Specifically, an Inter-frame Part-Discriminated Attention (IPDA) module is proposed which unsupervisely segments portrait into parts and utilizes different attentiveness on discriminative features specified to each different part. In this way, appropriate attention can be imposed to portrait parts with unbalanced motion to extract part-discriminated correlations, so that the portraits can be segmented more accurately. Experimental results demonstrate that our method achieves leading performance with the comparison to state-of-the-art methods.
Simultaneous column-based deep learning progression analysis of atrophy associated with AMD in longitudinal OCT studies
paper_authors: Adi Szeskin, Roei Yehuda, Or Shmueli, Jaime Levy, Leo Joskowicz
For: The paper aims to accurately quantify retinal atrophy changes associated with dry age-related macular degeneration (AMD) on longitudinal OCT studies.* Methods: The proposed method uses a novel simultaneous multi-channel column-based deep learning model trained on registered pairs of OCT scans to detect and segment retinal atrophy segments in consecutive OCT scans.* Results: The proposed method achieved a mean atrophy segments detection precision of 0.90+-0.09 and a recall of 0.95+-0.06, outperforming standalone classification methods by 30+-62% and 27+-0% for atrophy segments and lesions.Here is the information in Simplified Chinese text:* For: 这篇论文目的是准确量化慢性肿瘤性macular degeneration (AMD) 相关的肿瘤变化的长期 OCT 图像。* Methods: 提议的方法使用了一种新的同时多通道列深度学习模型,通过注册对 OCT 图像进行训练,以同时检测并分类晚期 OCT 图像中的肿瘤变化。* Results: 提议的方法实现了mean肿瘤变化段检测精度为0.90+-0.09和检测率为0.95+-0.06,比站alone分类方法高出30+-62%和27+-0%。Abstract
Purpose: Disease progression of retinal atrophy associated with AMD requires the accurate quantification of the retinal atrophy changes on longitudinal OCT studies. It is based on finding, comparing, and delineating subtle atrophy changes on consecutive pairs (prior and current) of unregistered OCT scans. Methods: We present a fully automatic end-to-end pipeline for the simultaneous detection and quantification of time-related atrophy changes associated with dry AMD in pairs of OCT scans of a patient. It uses a novel simultaneous multi-channel column-based deep learning model trained on registered pairs of OCT scans that concurrently detects and segments retinal atrophy segments in consecutive OCT scans by classifying light scattering patterns in matched pairs of vertical pixel-wide columns (A-scans) in registered prior and current OCT slices (B-scans). Results: Experimental results on 4,040 OCT slices with 5.2M columns from 40 scans pairs of 18 patients (66% training/validation, 33% testing) with 24.13+-14.0 months apart in which Complete RPE and Outer Retinal Atrophy (cRORA) was identified in 1,998 OCT slices (735 atrophy lesions from 3,732 segments, 0.45M columns) yield a mean atrophy segments detection precision, recall of 0.90+-0.09, 0.95+-0.06 and 0.74+-0.18, 0.94+-0.12 for atrophy lesions with AUC=0.897, all above observer variability. Simultaneous classification outperforms standalone classification precision and recall by 30+-62% and 27+-0% for atrophy segments and lesions. Conclusions: simultaneous column-based detection and quantification of retinal atrophy changes associated with AMD is accurate and outperforms standalone classification methods. Translational relevance: an automatic and efficient way to detect and quantify retinal atrophy changes associated with AMD.
摘要
目的:检测和评估涂猪病关联Retinal Atrophy的疾病进程,需要精准量化涂猪病变化的longitudinal OCT图像。方法:我们提出了一个完全自动、端到端的管道,可同时检测和评估涂猪病变化的时间相关性。该管道使用了一种新的同时多通道列深度学习模型,通过匹配 registered pairs of OCT图像来同时检测和分割涂猪病变化。结果:我们在4,040个OCT图像中(5.2M列)的40个扫描对(18名患者,66%的训练/验证,33%的测试)中获得了24.13±14.0个月的间隔,并在1,998个OCT图像中(735个涂猪病变化,0.45M列)中发现了735个涂猪病变化。各个涂猪病变化的检测精度和回归率分别为0.90±0.09和0.95±0.06,AUC=0.897。同时分类的精度和回归率分别高于独立分类的精度和回归率 by 30±62%和27±0%。结论:同时列深度学习模型可以准确地检测和评估涂猪病变化关联AMD。翻译意义:一种自动和高效的方法来检测和评估涂猪病变化关联AMD。
Uncertainty-Guided Spatial Pruning Architecture for Efficient Frame Interpolation
results: 对比基eline方法,该方法可以减少计算量34%/52%/30%,并在多个benchmark上达到最佳性能。Abstract
The video frame interpolation (VFI) model applies the convolution operation to all locations, leading to redundant computations in regions with easy motion. We can use dynamic spatial pruning method to skip redundant computation, but this method cannot properly identify easy regions in VFI tasks without supervision. In this paper, we develop an Uncertainty-Guided Spatial Pruning (UGSP) architecture to skip redundant computation for efficient frame interpolation dynamically. Specifically, pixels with low uncertainty indicate easy regions, where the calculation can be reduced without bringing undesirable visual results. Therefore, we utilize uncertainty-generated mask labels to guide our UGSP in properly locating the easy region. Furthermore, we propose a self-contrast training strategy that leverages an auxiliary non-pruning branch to improve the performance of our UGSP. Extensive experiments show that UGSP maintains performance but reduces FLOPs by 34%/52%/30% compared to baseline without pruning on Vimeo90K/UCF101/MiddleBury datasets. In addition, our method achieves state-of-the-art performance with lower FLOPs on multiple benchmarks.
摘要
视频帧 interpolate (VFI) 模型对所有位置进行 convolution 操作,导致在易动的区域进行重复计算。我们可以使用动态空间剔除方法来快速跳过重复计算,但这种方法无法在 VFI 任务中正确地标识易动区域。在这篇论文中,我们开发了一种不确定性指导的空间剔除 (UGSP) 架构,以便在高效的帧 interpolate 中跳过重复计算。具体来说,具有低不确定性的像素表示易动区域,可以通过减少计算而不会导致视觉效果受损。因此,我们利用不确定性生成的掩码标签来导引我们的 UGSP 在正确的易动区域进行剔除。此外,我们提出了一种自我对比训练策略,通过一个辅助的非剔除分支来提高我们的 UGSP 表现。广泛的实验表明,我们的 UGSP 可以维持性能,同时减少 FLOPs 比基eline 无剔除的34%/52%/30%。此外,我们的方法在多个标准benchmark上实现了最佳性能。
Towards General Visual-Linguistic Face Forgery Detection
paper_authors: Ke Sun, Shen Chen, Taiping Yao, Xiaoshuai Sun, Shouhong Ding, Rongrong Ji for:* 这篇论文的目的是提出一种新的脸部 manipulate 检测方法,以满足安全、隐私和信任等问题的需求。methods:* 该方法使用了Visual-Linguistic Face Forgery Detection(VLFFD) paradigm,这种方法使用了细化的 sentence-level prompts 作为标注。* VLFFD 首先使用 Prompt Forgery Image Generator(PFIG)生成杂合的伪造图像,然后将杂合的图像和原始图像一起在 Coarse-and-Fine Co-training 框架中进行联合训练。results:* 实验结果表明,提出的方法可以在一些复杂的 benchmark 上提高现有的检测模型性能。Abstract
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We argue that such supervisions lack semantic information and interpretability. To address this issues, in this paper, we propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation. Since text annotations are not available in current deepfakes datasets, VLFFD first generates the mixed forgery image with corresponding fine-grained prompts via Prompt Forgery Image Generator (PFIG). Then, the fine-grained mixed data and coarse-grained original data and is jointly trained with the Coarse-and-Fine Co-training framework (C2F), enabling the model to gain more generalization and interpretability. The experiments show the proposed method improves the existing detection models on several challenging benchmarks.
摘要
深层质的假像技术(Deepfakes)可能 pose serious threats to security, privacy, and trust. 现有的方法主要 treated 这个任务为二分类,使用数字标签或面署信号来训练检测模型。我们认为这些超级visions 缺乏semantic information和可读性。为了解决这些问题,在这篇论文中,我们提出了一种新的方法,即视觉语言面假造检测(VLFFD)。VLFFD使用细化的 sentence-level 提示作为annotation。由于现有的深层假像数据集没有文本标注,VLFFD首先使用Prompt Forgery Image Generator (PFIG)生成杂合假造图像和对应的细化提示。然后,杂合的混合数据和原始数据jointly 在Coarse-and-Fine Co-training framework (C2F)中训练,使模型能够更好地泛化和解释。实验结果表明,提出的方法可以提高现有的检测模型在一些复杂的benchmark上的性能。
On Transferability of Driver Observation Models from Simulated to Real Environments in Autonomous Cars
methods: 本文使用了真实自动驾驶条件下的数据采集,并采用了Inflated 3D ConvNet(I3D)模型和Gradient-weighted Class Activation Mapping(Grad-CAM)来进行详细的模型决策分析。
results: 虽然模拟器上的模型表现出色,但在实际驾驶场景下,其识别率降低到46.6%,并且不同的行为类型之间存在强烈的变化。这说明了模型传输性能的挑战,并促进了我们研究更加鲜明的驾驶者观察系统,能够满足实际驾驶场景中的需求。Abstract
For driver observation frameworks, clean datasets collected in controlled simulated environments often serve as the initial training ground. Yet, when deployed under real driving conditions, such simulator-trained models quickly face the problem of distributional shifts brought about by changing illumination, car model, variations in subject appearances, sensor discrepancies, and other environmental alterations. This paper investigates the viability of transferring video-based driver observation models from simulation to real-world scenarios in autonomous vehicles, given the frequent use of simulation data in this domain due to safety issues. To achieve this, we record a dataset featuring actual autonomous driving conditions and involving seven participants engaged in highly distracting secondary activities. To enable direct SIM to REAL transfer, our dataset was designed in accordance with an existing large-scale simulator dataset used as the training source. We utilize the Inflated 3D ConvNet (I3D) model, a popular choice for driver observation, with Gradient-weighted Class Activation Mapping (Grad-CAM) for detailed analysis of model decision-making. Though the simulator-based model clearly surpasses the random baseline, its recognition quality diminishes, with average accuracy dropping from 85.7% to 46.6%. We also observe strong variations across different behavior classes. This underscores the challenges of model transferability, facilitating our research of more robust driver observation systems capable of dealing with real driving conditions.
摘要
Echoes Beyond Points: Unleashing the Power of Raw Radar Data in Multi-modality Fusion
results: 与现有方法相比,方法可以更好地利用雷达回射信号中的距离和速度准确信息,并且与图像中的 semantics信息进行深度融合,在RADIal数据集上表现出优于所有exist方法,并且接近LiDAR的性能。Abstract
Radar is ubiquitous in autonomous driving systems due to its low cost and good adaptability to bad weather. Nevertheless, the radar detection performance is usually inferior because its point cloud is sparse and not accurate due to the poor azimuth and elevation resolution. Moreover, point cloud generation algorithms already drop weak signals to reduce the false targets which may be suboptimal for the use of deep fusion. In this paper, we propose a novel method named EchoFusion to skip the existing radar signal processing pipeline and then incorporate the radar raw data with other sensors. Specifically, we first generate the Bird's Eye View (BEV) queries and then take corresponding spectrum features from radar to fuse with other sensors. By this approach, our method could utilize both rich and lossless distance and speed clues from radar echoes and rich semantic clues from images, making our method surpass all existing methods on the RADIal dataset, and approach the performance of LiDAR. Codes will be available upon acceptance.
摘要
“射频是自动驾驶系统中 ubique 的,主要因为它的成本低廉且能够适应不好的天气。然而,射频检测性能通常较差,因为它的点云 sparse 且不精准,主要是因为射频的方位和高度分辨率差。此外,点云生成算法通常会删除弱信号以减少假目标,这可能不是最佳的选择 для深度融合。在这篇论文中,我们提出了一种名为 EchoFusion 的新方法,它可以跳过现有的射频信号处理管线,然后与其他感知器进行融合。具体来说,我们首先生成 Bird's Eye View(BEV)查询,然后从射频中抽出相应的 спектル特征,与其他感知器进行融合。这种方法可以利用射频类推的丰富和无损距离和速度帮助,同时也可以充分利用图像的丰富semantic帮助,使我们的方法在 RADIal 数据集上超越所有现有的方法,并且接近 LiDAR 的性能。codes 将会在接受时发布。”
Deep Learning and Computer Vision for Glaucoma Detection: A Review
results: 经过对广泛使用的公共数据集进行检验,这篇论文显示了不同方法之间的总体性能差异、不确定性估计和多模态融合问题,同时还提到了数据集的缺陷和限制。Abstract
Glaucoma is the leading cause of irreversible blindness worldwide and poses significant diagnostic challenges due to its reliance on subjective evaluation. However, recent advances in computer vision and deep learning have demonstrated the potential for automated assessment. In this paper, we survey recent studies on AI-based glaucoma diagnosis using fundus, optical coherence tomography, and visual field images, with a particular emphasis on deep learning-based methods. We provide an updated taxonomy that organizes methods into architectural paradigms and includes links to available source code to enhance the reproducibility of the methods. Through rigorous benchmarking on widely-used public datasets, we reveal performance gaps in generalizability, uncertainty estimation, and multimodal integration. Additionally, our survey curates key datasets while highlighting limitations such as scale, labeling inconsistencies, and bias. We outline open research challenges and detail promising directions for future studies. This survey is expected to be useful for both AI researchers seeking to translate advances into practice and ophthalmologists aiming to improve clinical workflows and diagnosis using the latest AI outcomes.
摘要
Glaucoma 是全球最主要的不可逆失明病种,但它的诊断却存在一定的主观性问题。然而,最新的计算机视觉和深度学习技术的发展已经显示出了自动诊断的潜力。在这篇文章中,我们对最近的人工智能基于膝盖、Optical coherence tomography和视场图像的 glaucoma 诊断研究进行了评论。我们提供了一个更新的分类方法,将方法分为建筑学派别,并提供了可以重复的源代码链接。通过对广泛使用的公共数据集进行严格的测试,我们揭示了总体化、不确定性估计和多Modal 集成的性能差距。此外,我们还筛选了重要的数据集,并强调了规模、标签不一致和偏见的局限性。我们还详细介绍了未来研究的开放问题,并提出了可能的解决方案。这篇文章预计会对计算机科学家帮助翻译最新的进展,以及医生使用最新的人工智能结果进行诊断。
Digging Into Uncertainty-based Pseudo-label for Robust Stereo Matching
results: 实验表明,我们的方法在跨域、适应和共同泛化等方面具有强大的性能,并在 Robust Vision Challenge 2020 中获得了深度匹配任务的第一名。此外,我们的 uncertainty-based pseudo-标签还可以用于无监督的单目深度估计网络训练,并实现了与监督方法相当的性能。代码将在 https://github.com/gallenszl/UCFNet 上提供。Abstract
Due to the domain differences and unbalanced disparity distribution across multiple datasets, current stereo matching approaches are commonly limited to a specific dataset and generalize poorly to others. Such domain shift issue is usually addressed by substantial adaptation on costly target-domain ground-truth data, which cannot be easily obtained in practical settings. In this paper, we propose to dig into uncertainty estimation for robust stereo matching. Specifically, to balance the disparity distribution, we employ a pixel-level uncertainty estimation to adaptively adjust the next stage disparity searching space, in this way driving the network progressively prune out the space of unlikely correspondences. Then, to solve the limited ground truth data, an uncertainty-based pseudo-label is proposed to adapt the pre-trained model to the new domain, where pixel-level and area-level uncertainty estimation are proposed to filter out the high-uncertainty pixels of predicted disparity maps and generate sparse while reliable pseudo-labels to align the domain gap. Experimentally, our method shows strong cross-domain, adapt, and joint generalization and obtains \textbf{1st} place on the stereo task of Robust Vision Challenge 2020. Additionally, our uncertainty-based pseudo-labels can be extended to train monocular depth estimation networks in an unsupervised way and even achieves comparable performance with the supervised methods. The code will be available at https://github.com/gallenszl/UCFNet.
摘要
Due to the differences in domains and unbalanced distribution of disparities across multiple datasets, current stereo matching methods are often limited to a specific dataset and generalize poorly to others. To address this issue, we typically rely on substantial adaptation using costly target-domain ground-truth data, which is not easily accessible in practical settings. In this paper, we propose to explore uncertainty estimation for robust stereo matching. Specifically, we employ pixel-level uncertainty estimation to adaptively adjust the next stage disparity searching space, thereby driving the network to progressively prune out the space of unlikely correspondences. Additionally, we propose an uncertainty-based pseudo-label to adapt the pre-trained model to a new domain, using pixel-level and area-level uncertainty estimation to filter out high-uncertainty pixels of predicted disparity maps and generate sparse yet reliable pseudo-labels to bridge the domain gap. Our experiments show strong cross-domain, adaptive, and joint generalization, with our method achieving \textbf{1st} place on the stereo task of Robust Vision Challenge 2020. Furthermore, our uncertainty-based pseudo-labels can be extended to train monocular depth estimation networks in an unsupervised manner, achieving comparable performance with supervised methods. The code will be available at https://github.com/gallenszl/UCFNet.
Towards General Low-Light Raw Noise Synthesis and Modeling
results: 对于低光照环境下的隐藏噪声,我们的方法可以具有高度的同准化能力,并且在不同感知器上进行了广泛的比较,结果表明我们的方法在噪声降减方面与状态之前的方法进行了比较。Abstract
Modeling and synthesizing low-light raw noise is a fundamental problem for computational photography and image processing applications. Although most recent works have adopted physics-based models to synthesize noise, the signal-independent noise in low-light conditions is far more complicated and varies dramatically across camera sensors, which is beyond the description of these models. To address this issue, we introduce a new perspective to synthesize the signal-independent noise by a generative model. Specifically, we synthesize the signal-dependent and signal-independent noise in a physics- and learning-based manner, respectively. In this way, our method can be considered as a general model, that is, it can simultaneously learn different noise characteristics for different ISO levels and generalize to various sensors. Subsequently, we present an effective multi-scale discriminator termed Fourier transformer discriminator (FTD) to distinguish the noise distribution accurately. Additionally, we collect a new low-light raw denoising (LRD) dataset for training and benchmarking. Qualitative validation shows that the noise generated by our proposed noise model can be highly similar to the real noise in terms of distribution. Furthermore, extensive denoising experiments demonstrate that our method performs favorably against state-of-the-art methods on different sensors.
摘要
<>模型和生成低光环境中的原始噪声是计算摄影和图像处理应用的基本问题。虽然大多数最近的工作采用物理基于的模型来生成噪声,但低光环境中的噪声却是非常复杂且异常强变,这超出了这些模型的描述范围。为解决这个问题,我们引入一新的视角来生成噪声。具体来说,我们同时生成噪声的信号依赖和不依赖部分,通过物理和学习基础来实现。这样,我们的方法可以视为一种通用模型,即可同时学习不同ISO水平的噪声特性,并且可以通过不同感知器进行普适化。然后,我们提出了一种有效的多尺度滤波器,即福泽transformer滤波器(FTD),以便准确地分辨噪声分布。此外,我们收集了一个新的低光 raw denoising(LRD)数据集,用于训练和测试。qualitative验证表明,生成的噪声与实际噪声的分布高度相似。此外,我们对不同感知器进行了广泛的排除实验,结果显示,我们的方法与当前状态革方法相比,在不同感知器上表现出优异的效果。
A hybrid approach for improving U-Net variants in medical image segmentation
results: 研究表明,使用深度分解 convolutions 和注意系统,以及循环连接可以降低网络参数的需求,同时保持一些医疗影像分割任务的性能。Abstract
Medical image segmentation is vital to the area of medical imaging because it enables professionals to more accurately examine and understand the information offered by different imaging modalities. The technique of splitting a medical image into various segments or regions of interest is known as medical image segmentation. The segmented images that are produced can be used for many different things, including diagnosis, surgery planning, and therapy evaluation. In initial phase of research, major focus has been given to review existing deep-learning approaches, including researches like MultiResUNet, Attention U-Net, classical U-Net, and other variants. The attention feature vectors or maps dynamically add important weights to critical information, and most of these variants use these to increase accuracy, but the network parameter requirements are somewhat more stringent. They face certain problems such as overfitting, as their number of trainable parameters is very high, and so is their inference time. Therefore, the aim of this research is to reduce the network parameter requirements using depthwise separable convolutions, while maintaining performance over some medical image segmentation tasks such as skin lesion segmentation using attention system and residual connections.
摘要
医疗影像分割是医疗影像领域的关键技术,它使得专业人员可以更加准确地检查和理解不同的影像模式中提供的信息。这种技术的核心是将医疗影像分割成不同的区域或 interess 点。生成的分割图像可以用于诊断、手术规划和治疗评估等多种应用。在初期研究阶段,主要关注了现有的深度学习方法,包括MultiResUNet、Attention U-Net、传统的U-Net和其他变种。这些方法使用注意力特征向量或地图来动态添加重要权重,以提高准确性。然而,这些网络的参数需求较高,导致过拟合和执行时间较长。因此,本研究的目标是通过深度分割 convolution 来降低网络参数需求,保持一定的性能水平,而不是全面替换现有的深度学习方法。特别是在医疗影像分割任务中,如皮肤病变分割使用注意力系统和 residual connections。
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
results: 实现了长视频理解的州OF-the-art表现。Abstract
Recently, integrating video foundation models and large language models to build a video understanding system overcoming the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection are the remaining challenges. Inspired by Atkinson-Shiffrin memory model, we develop an memory mechanism including a rapidly updated short-term memory and a compact thus sustained long-term memory. We employ tokens in Transformers as the carriers of memory. MovieChat achieves state-of-the-art performace in long video understanding.
摘要
最近,我们尝试将视频基础模型和大型语言模型结合,以建立一个能够涵盖具体定义的视频理解系统,超越特定预定的视觉任务的限制。然而,现有系统只能处理很少帧数的视频。长视频处理中的计算复杂度、内存成本和长期时间连接仍然是挑战。受阿特金逊-希夫里南记忆模型启发,我们开发了一种快速更新的短期记忆机制和一种可减少的长期记忆机制。我们使用Transformers中的标识符作为记忆载体。我们的MovieChat实现了长视频理解的州际级表现。
results: 我们的实验结果表明,我们的方法可以提供好的笔触建议,并与现有技术相比,表现更好。Abstract
In the last few years, Neural Painting (NP) techniques became capable of producing extremely realistic artworks. This paper advances the state of the art in this emerging research domain by proposing the first approach for Interactive NP. Considering a setting where a user looks at a scene and tries to reproduce it on a painting, our objective is to develop a computational framework to assist the users creativity by suggesting the next strokes to paint, that can be possibly used to complete the artwork. To accomplish such a task, we propose I-Paint, a novel method based on a conditional transformer Variational AutoEncoder (VAE) architecture with a two-stage decoder. To evaluate the proposed approach and stimulate research in this area, we also introduce two novel datasets. Our experiments show that our approach provides good stroke suggestions and compares favorably to the state of the art. Additional details, code and examples are available at https://helia95.github.io/inp-website.
摘要
最近几年,神经油画(NP)技术已经能够生成极其真实的艺术作品。这篇论文推动这个新兴研究领域的状态机器人,我们提出了首个互动神经油画(I-Paint)方法。在用户看到场景并尝试通过油画复制它的情况下,我们的目标是开发一个计算机框架,以帮助用户创作性的提高,并提供可能用于完成艺术作品的下一拍建议。为实现这个目标,我们提出了一种基于条件变换器Variational AutoEncoder(VAE)架构的新方法。为评估我们的方法和激发这个领域的研究,我们还提出了两个新的数据集。我们的实验表明,我们的方法可以提供好的拍建议,并与当前状态机器人相比较好。详细信息、代码和示例可以在https://helia95.github.io/inp-website上找到。
Towards Head Computed Tomography Image Reconstruction Standardization with Deep Learning Assisted Automatic Detection
methods: 使用深度学习基于对象检测算法,自动检测和评估 orbitomeatal 线标志,以 reformatting 图像 перед reconstruction。
results: 比较了十种对象检测算法的精度、效率和Robustness,选择了轻量级的 YOLOv8,其 mAP 为 92.91%,并通过标准化重建结果的质量评估,证明方法的临床实用性和有效性。Abstract
Three-dimensional (3D) reconstruction of head Computed Tomography (CT) images elucidates the intricate spatial relationships of tissue structures, thereby assisting in accurate diagnosis. Nonetheless, securing an optimal head CT scan without deviation is challenging in clinical settings, owing to poor positioning by technicians, patient's physical constraints, or CT scanner tilt angle restrictions. Manual formatting and reconstruction not only introduce subjectivity but also strain time and labor resources. To address these issues, we propose an efficient automatic head CT images 3D reconstruction method, improving accuracy and repeatability, as well as diminishing manual intervention. Our approach employs a deep learning-based object detection algorithm, identifying and evaluating orbitomeatal line landmarks to automatically reformat the images prior to reconstruction. Given the dearth of existing evaluations of object detection algorithms in the context of head CT images, we compared ten methods from both theoretical and experimental perspectives. By exploring their precision, efficiency, and robustness, we singled out the lightweight YOLOv8 as the aptest algorithm for our task, with an mAP of 92.91% and impressive robustness against class imbalance. Our qualitative evaluation of standardized reconstruction results demonstrates the clinical practicability and validity of our method.
摘要
三维重建头部计算机断层影像(CT)图可以描述脏器结构的细节关系,从而帮助精准诊断。然而,在临床 Settings中获得优质头部CT扫描数据 без deviation 是困难的,因为技术人员的位置不准确、病人的物理限制或CT扫描器的倾斜角度限制。手动格式化和重建不仅引入主观性,还占用了时间和劳动资源。为解决这些问题,我们提出了一种高效的自动头部CT图三维重建方法,提高精度和重复性,同时减少手动干预。我们的方法使用深度学习基于对象检测算法,通过识别和评估颈部线标记来自动重新格式化图像,以便在重建前进行三维重建。由于现有对头部CT图像的对象检测算法的评估缺乏,我们对十种算法进行了 teoretic 和实验性的比较。通过探索它们的精度、效率和Robustness,我们选择了轻量级的 YOLOv8,其MAP为 92.91%,并且在类偏oshadow 下表现出了很好的Robustness。我们的质量评估表示我们的方法在临床实践中是有效的和有效性。
Detecting Out-of-distribution Objects Using Neuron Activation Patterns
paper_authors: Bartłomiej Olber, Krystian Radlak, Krystian Chachuła, Jakub Łyskawa, Piotr Frątczak
for: 实时物类检测中的OOD检测问题
methods: 基于Neuron Activation PaTteRns的OOD检测方法
results: 在两个不同的OOD情况下和三种物类检测器上,我们的方法具有比顶对ID性能的优化和高准确率的OOD检测能力。Abstract
Object detection is essential to many perception algorithms used in modern robotics applications. Unfortunately, the existing models share a tendency to assign high confidence scores for out-of-distribution (OOD) samples. Although OOD detection has been extensively studied in recent years by the computer vision (CV) community, most proposed solutions apply only to the image recognition task. Real-world applications such as perception in autonomous vehicles struggle with far more complex challenges than classification. In our work, we focus on the prevalent field of object detection, introducing Neuron Activation PaTteRns for out-of-distribution samples detection in Object detectioN (NAPTRON). Performed experiments show that our approach outperforms state-of-the-art methods, without the need to affect in-distribution (ID) performance. By evaluating the methods in two distinct OOD scenarios and three types of object detectors we have created the largest open-source benchmark for OOD object detection.
摘要
Here's the Simplified Chinese translation:对象检测是现代 робо特性应用中非常重要的算法之一。然而,现有的模型往往对异distribution(OOD)样本分配高信任分数。CV社区在过去几年内对OOD检测进行了广泛的研究,但大多数提出的解决方案仅适用于图像识别任务。实际应用中,如自动驾驶车辆的感知系统,面临着远远超出类别检测的复杂挑战。在我们的工作中,我们将注意力集中在对象检测领域,并提出了基于神经元活动的Patterns for Out-of-Distribution Sample Detection in Object detectioN(NAPTRON)。我们的方法在现有的方法中表现出色,不需要影响类别内样本(ID)的性能。我们在两个不同的OOD场景中和三种对象检测器进行了实验,创建了最大的开源benchmark дляOOD对象检测。
High Dynamic Range Image Reconstruction via Deep Explicit Polynomial Curve Estimation
results: 我们提出了一种方法,可以直接估计图像和实际闪光范围之间的对应关系,并在一个网络中确定HDR图像。首先,根据闪光范围的特点,我们构建了一个模型,用多项式描述闪光范围的趋势。使用学习网络来估计这些系数。这个曲线会自动根据低动态范围图像的闪光空间自动调整,并重建实际的HDR图像。此外,由于现有的 dataset 没有提供图像和实际闪光范围之间的对应关系,我们构建了一个新的 dataset,包括 sintetic 和实际图像。广泛的实验显示,我们的方法可以在不同的闪光范围下进行极其好的普适性和高性能。Abstract
Due to limited camera capacities, digital images usually have a narrower dynamic illumination range than real-world scene radiance. To resolve this problem, High Dynamic Range (HDR) reconstruction is proposed to recover the dynamic range to better represent real-world scenes. However, due to different physical imaging parameters, the tone-mapping functions between images and real radiance are highly diverse, which makes HDR reconstruction extremely challenging. Existing solutions can not explicitly clarify a corresponding relationship between the tone-mapping function and the generated HDR image, but this relationship is vital when guiding the reconstruction of HDR images. To address this problem, we propose a method to explicitly estimate the tone mapping function and its corresponding HDR image in one network. Firstly, based on the characteristics of the tone mapping function, we construct a model by a polynomial to describe the trend of the tone curve. To fit this curve, we use a learnable network to estimate the coefficients of the polynomial. This curve will be automatically adjusted according to the tone space of the Low Dynamic Range (LDR) image, and reconstruct the real HDR image. Besides, since all current datasets do not provide the corresponding relationship between the tone mapping function and the LDR image, we construct a new dataset with both synthetic and real images. Extensive experiments show that our method generalizes well under different tone-mapping functions and achieves SOTA performance.
摘要
First, based on the characteristics of the tone mapping function, we construct a model using a polynomial to describe the trend of the tone curve. To fit this curve, we use a learnable network to estimate the coefficients of the polynomial. This curve will be automatically adjusted according to the tone space of the Low Dynamic Range (LDR) image and reconstruct the real HDR image.Moreover, since all current datasets do not provide the corresponding relationship between the tone mapping function and the LDR image, we construct a new dataset with both synthetic and real images. Extensive experiments show that our method generalizes well under different tone-mapping functions and achieves state-of-the-art performance.
DRAW: Defending Camera-shooted RAW against Image Manipulation
results: 对多个著名RAW数据集进行了广泛的实验,并达到了高度的鲁棒性和精度。Abstract
RAW files are the initial measurement of scene radiance widely used in most cameras, and the ubiquitously-used RGB images are converted from RAW data through Image Signal Processing (ISP) pipelines. Nowadays, digital images are risky of being nefariously manipulated. Inspired by the fact that innate immunity is the first line of body defense, we propose DRAW, a novel scheme of defending images against manipulation by protecting their sources, i.e., camera-shooted RAWs. Specifically, we design a lightweight Multi-frequency Partial Fusion Network (MPF-Net) friendly to devices with limited computing resources by frequency learning and partial feature fusion. It introduces invisible watermarks as protective signal into the RAW data. The protection capability can not only be transferred into the rendered RGB images regardless of the applied ISP pipeline, but also is resilient to post-processing operations such as blurring or compression. Once the image is manipulated, we can accurately identify the forged areas with a localization network. Extensive experiments on several famous RAW datasets, e.g., RAISE, FiveK and SIDD, indicate the effectiveness of our method. We hope that this technique can be used in future cameras as an option for image protection, which could effectively restrict image manipulation at the source.
摘要
原始的RAW文件是现场辐射广泛使用的初始测量数据,而通用的RGB图像则是将RAW数据转换成through Image Signal Processing(ISP)管道。然而,在数字图像成为常用的现象以来,digital images have become susceptible to malicious manipulation. 以体内免疫系统为灵感,我们提出了一种新的图像防 manipulation 方案,即保护图像的来源,即摄像机拍摄的RAW数据。特别是,我们设计了一种轻量级的多频部分融合网络(MPF-Net),该网络适合具有有限的计算资源的设备,通过频率学习和部分特征融合来实现轻量级的性能。MPF-Net在RAW数据中引入不可见的水印,以保护图像免受辐射、压缩等后处理操作的攻击。如果图像被修改,我们可以使用一个本地化网络来准确地定位forge areas。我们在许多知名的RAW数据集,例如RAISE、FiveK和SIDD上进行了广泛的实验,结果表明我们的方法的有效性。我们希望这种技术可以在未来的摄像机中作为图像保护的选项,以防止图像修改在源头级别。
MRA-GNN: Minutiae Relation-Aware Model over Graph Neural Network for Fingerprint Embedding
methods: 我们提出了一种新的指纹嵌入方法,即Minutiae Relation-Aware model over Graph Neural Network (MRA-GNN)。MRA-GNN使用GNN模型来编码指纹的 topology和相关性,将指纹数据转换为图像,并通过Topological relation Reasoning Module (TRM)和Correlation-Aware Module (CAM)来学习指纹嵌入。为了解决GNN模型中的过拟合问题,我们还在MRA-GNN中添加了Feed-Forward Module和图像差分连接。
results: 我们的实验结果表明,MRA-GNN比前些state-of-the-art方法在多个指纹数据集上表现更好, indicating that our approach can effectively exploit the nonstructural information of fingerprints.Abstract
Deep learning has achieved remarkable results in fingerprint embedding, which plays a critical role in modern Automated Fingerprint Identification Systems. However, previous works including CNN-based and Transformer-based approaches fail to exploit the nonstructural data, such as topology and correlation in fingerprints, which is essential to facilitate the identifiability and robustness of embedding. To address this challenge, we propose a novel paradigm for fingerprint embedding, called Minutiae Relation-Aware model over Graph Neural Network (MRA-GNN). Our proposed approach incorporates a GNN-based framework in fingerprint embedding to encode the topology and correlation of fingerprints into descriptive features, achieving fingerprint representation in the form of graph embedding. Specifically, we reinterpret fingerprint data and their relative connections as vertices and edges respectively, and introduce a minutia graph and fingerprint graph to represent the topological relations and correlation structures of fingerprints. We equip MRA-GNN with a Topological relation Reasoning Module (TRM) and Correlation-Aware Module (CAM) to learn the fingerprint embedding from these graphs successfully. To tackle the over-smoothing problem in GNN models, we incorporate Feed-Forward Module and graph residual connections into proposed modules. The experimental results demonstrate that our proposed approach outperforms state-of-the-art methods on various fingerprint datasets, indicating the effectiveness of our approach in exploiting nonstructural information of fingerprints.
摘要
深度学习在指纹嵌入中取得了杰出的成果,这对现代自动指纹识别系统plays a critical role。然而,过去的方法,包括CNN和Transformer的方法,失去了非结构数据,如指纹图形和指纹之间的相关性,这些数据对实现嵌入的可识别性和稳定性至关重要。为解决这个挑战,我们提出了一种新的嵌入模型,called Minutiae Relation-Aware model over Graph Neural Network (MRA-GNN)。我们的提议方法将指纹嵌入编码为图形特征,通过在GNN基础上的框架来表示指纹图形和相关性结构。 Specifically,我们将指纹数据和其相对连接看作为顶点和边分别,并将指纹图形和指纹相关结构表示为指纹图和指纹图。我们在MRA-GNN中引入Topological relation Reasoning Module (TRM)和Correlation-Aware Module (CAM)来学习指纹嵌入。为解决GNN模型中的过拟合问题,我们将Feed-Forward Module和图 residual connections incorporated into proposed modules。实验结果表明,我们的提议方法在多个指纹数据集上比州前方法表现出色,这表明我们的方法可以成功地利用指纹非结构数据。
DDG-Net: Discriminability-Driven Graph Network for Weakly-supervised Temporal Action Localization
results: 我们提出了Discriminability-Driven Graph Network (DDG-Net),它明确地表示ambiguous snippet和特征可识别的 snippet之间的关联,避免了ambiguous信息的传递,并提高了对单位特征的可识别性。此外,我们提出了特征一致损失,以防止特征的融合和导引几何网络生成更加特征化的表示。实验结果显示DDG-Net在THUMOS14和ActivityNet1.2标准 benchmark上实现了新的州Of-The-Art结果,证明了DDG-Net的效果。代码可以在 \url{https://github.com/XiaojunTang22/ICCV2023-DDGNet} 上获得。Abstract
Weakly-supervised temporal action localization (WTAL) is a practical yet challenging task. Due to large-scale datasets, most existing methods use a network pretrained in other datasets to extract features, which are not suitable enough for WTAL. To address this problem, researchers design several modules for feature enhancement, which improve the performance of the localization module, especially modeling the temporal relationship between snippets. However, all of them neglect the adverse effects of ambiguous information, which would reduce the discriminability of others. Considering this phenomenon, we propose Discriminability-Driven Graph Network (DDG-Net), which explicitly models ambiguous snippets and discriminative snippets with well-designed connections, preventing the transmission of ambiguous information and enhancing the discriminability of snippet-level representations. Additionally, we propose feature consistency loss to prevent the assimilation of features and drive the graph convolution network to generate more discriminative representations. Extensive experiments on THUMOS14 and ActivityNet1.2 benchmarks demonstrate the effectiveness of DDG-Net, establishing new state-of-the-art results on both datasets. Source code is available at \url{https://github.com/XiaojunTang22/ICCV2023-DDGNet}.
摘要
《弱监督时间动作地标(WTAL)是一项实用又挑战性的任务。由于大规模数据集,大多数现有方法使用已经在其他数据集中预训练的网络提取特征,这些特征并不适合WTAL。为解决这个问题,研究人员设计了多个模块用于特征提高,这些模块可以提高地标模块的性能,特别是模elling时间关系 между片断。然而,所有这些方法均忽视了ambiguous信息的副作用,这会减少其他表示的可 distinguishability。 considering这种现象,我们提出了Discriminability-Driven Graph Network(DDG-Net),该网络Explicitly模型了ambiguous片断和 discriminative片断的Well-designed connections,防止了ambiguous信息的传递,并提高了片断级别表示的可 distinguishability。此外,我们还提出了特征一致损失,以防止特征的同化和驱动图 convolution网络生成更加 discriminative的表示。extensive experiments on THUMOS14和ActivityNet1.2 benchmarks表明DDG-Net的效果,创造了新的state-of-the-art result on both datasets。source code可以在 \url{https://github.com/XiaojunTang22/ICCV2023-DDGNet} 中找到。
RCS-YOLO: A Fast and High-Accuracy Object Detector for Brain Tumor Detection
paper_authors: Ming Kang, Chee-Ming Ting, Fung Fung Ting, Raphaël C. -W. Phan for: Brain tumor detectionmethods:* Proposed a novel YOLO architecture with Reparameterized Convolution based on channel Shuffle (RCS-YOLO)* Introduced Reparameterized Convolution (RCS) and One-Shot Aggregation of RCS (RCS-OSA) to extract richer information and reduce time consumptionresults:* Surpassed YOLOv6, YOLOv7, and YOLOv8 in speed and accuracy on the brain tumor dataset Br35H* Improved precision by 2.6% and inference speed by 60% compared to YOLOv7* Achieved state-of-the-art performance on brain tumor detection taskHere’s the simplified Chinese version:for: 脑肿检测methods:* 提出了一种基于通道拼接的YOLO架构(RCS-YOLO)* 引入了Reparameterized Convolution(RCS)和One-Shot Aggregation of RCS(RCS-OSA)来提取更多的信息并降低计算时间results:* 在脑肿数据集Br35H上超过了YOLOv6、YOLOv7和YOLOv8的速度和准确率* 相比YOLOv7,RCS-YOLO提高了精度2.6%,并降低了计算速度60%* 实现了脑肿检测任务的state-of-the-art性Abstract
With an excellent balance between speed and accuracy, cutting-edge YOLO frameworks have become one of the most efficient algorithms for object detection. However, the performance of using YOLO networks is scarcely investigated in brain tumor detection. We propose a novel YOLO architecture with Reparameterized Convolution based on channel Shuffle (RCS-YOLO). We present RCS and a One-Shot Aggregation of RCS (RCS-OSA), which link feature cascade and computation efficiency to extract richer information and reduce time consumption. Experimental results on the brain tumor dataset Br35H show that the proposed model surpasses YOLOv6, YOLOv7, and YOLOv8 in speed and accuracy. Notably, compared with YOLOv7, the precision of RCS-YOLO improves by 2.6%, and the inference speed by 60% at 114.8 images detected per second (FPS). Our proposed RCS-YOLO achieves state-of-the-art performance on the brain tumor detection task. The code is available at https://github.com/mkang315/RCS-YOLO.
摘要
使用 cutting-edge YOLO 框架的协议,可以实现一个极高效的对象检测。然而,使用 YOLO 网络在脑肿检测中的性能尚未得到足够的研究。我们提出了一种新的 YOLO 架构,即 Reparameterized Convolution based on channel Shuffle (RCS-YOLO)。我们还提出了一种 One-Shot Aggregation of RCS (RCS-OSA),它将 feature cascade 和计算效率联系起来,以提取更多的信息并降低计算时间。我们在 Br35H 脑肿数据集上进行了实验,结果显示,我们的提posed模型在速度和准确性两个方面都超过了 YOLOv6、YOLOv7 和 YOLOv8。尤其是比 YOLOv7,我们的 RCS-YOLO 模型的精度提高了 2.6%,并且在 114.8 帧每秒 (FPS) 的检测速度上提高了 60%。我们的提posed RCS-YOLO 模型在脑肿检测任务中达到了国际顶尖的性能。代码可以在 上下载。
HiREN: Towards Higher Supervision Quality for Better Scene Text Image Super-Resolution
paper_authors: Minyi Zhao, Yi Xu, Bingjia Li, Jie Wang, Jihong Guan, Shuigeng Zhou
for: 提高文本识别率,solve the problem of low-resolution scene images affecting text recognition.
methods: 提出了一种新的STISR框架,called High-Resolution ENhancement(HiREN),which consists of two branches and a quality estimation module. The first branch is used to recover LR images, and the other is an HR quality enhancement branch that aims to generate HQ text images based on HR images.
results: 在TextZoom dataset上进行了广泛的实验,结果表明HiREN可以与大多数现有的STISR方法结合使用,并显著提高它们的性能。Abstract
Scene text image super-resolution (STISR) is an important pre-processing technique for text recognition from low-resolution scene images. Nowadays, various methods have been proposed to extract text-specific information from high-resolution (HR) images to supervise STISR model training. However, due to uncontrollable factors (e.g. shooting equipment, focus, and environment) in manually photographing HR images, the quality of HR images cannot be guaranteed, which unavoidably impacts STISR performance. Observing the quality issue of HR images, in this paper we propose a novel idea to boost STISR by first enhancing the quality of HR images and then using the enhanced HR images as supervision to do STISR. Concretely, we develop a new STISR framework, called High-Resolution ENhancement (HiREN) that consists of two branches and a quality estimation module. The first branch is developed to recover the low-resolution (LR) images, and the other is an HR quality enhancement branch aiming at generating high-quality (HQ) text images based on the HR images to provide more accurate supervision to the LR images. As the degradation from HQ to HR may be diverse, and there is no pixel-level supervision for HQ image generation, we design a kernel-guided enhancement network to handle various degradation, and exploit the feedback from a recognizer and text-level annotations as weak supervision signal to train the HR enhancement branch. Then, a quality estimation module is employed to evaluate the qualities of HQ images, which are used to suppress the erroneous supervision information by weighting the loss of each image. Extensive experiments on TextZoom show that HiREN can work well with most existing STISR methods and significantly boost their performances.
摘要
Scene文本图像超解像(STISR)是识别文本从低分辨率Scene图像前置处理技术中的重要一环。目前,许多方法已经被提出来提取高分辨率(HR)图像中特有的文本信息,以供STISR模型训练。然而,由于手动拍摄HR图像的因素(如摄影设备、 фокус和环境)的不可控,HR图像的质量无法保证,这会不可避免地影响STISR性能。 observe到HR图像质量问题,在这篇论文中,我们提出了一个新的想法,即首先提高HR图像的质量,然后使用提高后的HR图像作为STISR模型训练的超vision。具体来说,我们开发了一个新的STISR框架,called High-Resolution ENhancement(HiREN),它包括两个分支和一个质量评估模块。第一个分支是用于恢复低分辨率(LR)图像,另一个是一个HR质量提高分支,旨在基于HR图像生成高质量(HQ)文本图像,以供更准确的超vision。由于HR到HQ的降低可能有多种,而且没有像素级supervision的HR图像生成,我们设计了一个核心准导提高网络,用于处理多种降低,并利用recognizer和文本级别的回归信号作为弱supervision信号来训练HR增强分支。然后,我们使用质量评估模块评估HQ图像的质量,并将其用于抑制错误的超vision信息。extensive experiments show that HiREN can work well with most existing STISR methods and significantly boost their performances.
Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences
results: 我们的自动评估结果显示,我们的模型在几何扩展中表现出色,与已有的方法相比,并且在几何上具有较高的内在一致性。人工评价也证明了我们的模型可以处理多种类型。Abstract
Stylized visual captioning aims to generate image or video descriptions with specific styles, making them more attractive and emotionally appropriate. One major challenge with this task is the lack of paired stylized captions for visual content, so most existing works focus on unsupervised methods that do not rely on parallel datasets. However, these approaches still require training with sufficient examples that have style labels, and the generated captions are limited to predefined styles. To address these limitations, we explore the problem of Few-Shot Stylized Visual Captioning, which aims to generate captions in any desired style, using only a few examples as guidance during inference, without requiring further training. We propose a framework called FS-StyleCap for this task, which utilizes a conditional encoder-decoder language model and a visual projection module. Our two-step training scheme proceeds as follows: first, we train a style extractor to generate style representations on an unlabeled text-only corpus. Then, we freeze the extractor and enable our decoder to generate stylized descriptions based on the extracted style vector and projected visual content vectors. During inference, our model can generate desired stylized captions by deriving the style representation from user-supplied examples. Our automatic evaluation results for few-shot sentimental visual captioning outperform state-of-the-art approaches and are comparable to models that are fully trained on labeled style corpora. Human evaluations further confirm our model s ability to handle multiple styles.
摘要
现代化的视觉描述目标是生成具有特定风格的图像或视频描述,使其更加吸引人和情感上更加适当。然而,该任务的一个主要挑战是缺乏协调的风格描述数据集,因此大多数现有的方法都是不需要平行数据集的不监督方法。然而,这些方法仍然需要训练具有足够的示例,并且生成的描述仅限于预定的风格。为解决这些限制,我们研究了几何风格视觉描述问题,该问题的目标是在描述过程中根据用户提供的少量示例生成任何风格的描述。我们提出了一个名为FS-StyleCap的框架来解决这个问题,该框架使用了决定性编码器-解码语言模型和视觉投影模块。我们的两步训练方案如下:首先,我们训练一个风格抽象器,以生成无标签文本 Corpora 上的风格表示。然后,我们冻结抽象器,并使我们的解码器基于抽象器生成风格表示和投影到视觉内容 vectors 的描述。在推断过程中,我们的模型可以根据用户提供的示例生成所需的风格描述。我们的自动评估结果表明,我们的方法在几何风格视觉描述任务中比 state-of-the-art 方法更高的评价结果,并且与完全在标注风格 Corpora 上训练的模型相当。人工评估还证明了我们的模型能够处理多种风格。
JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery
paper_authors: Jiahao Li, Zongxin Yang, Xiaohan Wang, Jianxin Ma, Chang Zhou, Yi Yang
for: 3D human mesh recovery from a single image under obscured conditions
methods: 3D JOint contrastive learning with TRansformers (JOTR) framework, including an encoder-decoder transformer architecture and a novel 3D joint contrastive learning approach
results: outperforms state-of-the-art competitors on both occlusion-specific and standard benchmarks, significantly improving the reconstruction of occluded humansHere is the Simplified Chinese version of the information:
for: 3D人体三维重建从单张图像下的遮盲条件
methods: 3D JOint contrastive learning with TRansformers (JOTR)框架,包括encoder-decoder transformer架构和一种新的3D JOINT contrastive learning方法
results: 超越了现有竞争对手在 occlusion-specific 和标准 Benchmark 上的表现,显著改善遮盲人体重建Abstract
In this study, we focus on the problem of 3D human mesh recovery from a single image under obscured conditions. Most state-of-the-art methods aim to improve 2D alignment technologies, such as spatial averaging and 2D joint sampling. However, they tend to neglect the crucial aspect of 3D alignment by improving 3D representations. Furthermore, recent methods struggle to separate the target human from occlusion or background in crowded scenes as they optimize the 3D space of target human with 3D joint coordinates as local supervision. To address these issues, a desirable method would involve a framework for fusing 2D and 3D features and a strategy for optimizing the 3D space globally. Therefore, this paper presents 3D JOint contrastive learning with TRansformers (JOTR) framework for handling occluded 3D human mesh recovery. Our method includes an encoder-decoder transformer architecture to fuse 2D and 3D representations for achieving 2D$\&$3D aligned results in a coarse-to-fine manner and a novel 3D joint contrastive learning approach for adding explicitly global supervision for the 3D feature space. The contrastive learning approach includes two contrastive losses: joint-to-joint contrast for enhancing the similarity of semantically similar voxels (i.e., human joints), and joint-to-non-joint contrast for ensuring discrimination from others (e.g., occlusions and background). Qualitative and quantitative analyses demonstrate that our method outperforms state-of-the-art competitors on both occlusion-specific and standard benchmarks, significantly improving the reconstruction of occluded humans.
摘要
在这项研究中,我们关注在单张图像下面临遮盲的3D人体 mesh 恢复问题。大多数当前的方法都是提高2D对齐技术,如空间平均和2D关节采样。然而,它们往往忽略了3D对齐的重要性,而且最近的方法在拥挤场景中很难分别人体和遮挡物或背景,因为它们在3D空间中优化目标人体的3D关节坐标作为本地监督。为解决这些问题,一个理想的方法应该包括一个混合2D和3D特征的框架,以及一种全球化优化3D空间的策略。因此,这篇论文提出了基于转换器的3D JOint contrastive learning(JOTR)框架,用于处理遮盲3D人体 mesh 恢复问题。我们的方法包括一个编码器-解码器转换器架构,用于在粗细化到细化的方式下混合2D和3D表示,以及一种新的3D关节对比学习方法,用于在3D特征空间中添加显式全球化监督。对比学习方法包括两种对比损失:关节到关节对比,用于提高相似的人体关节之间的相似性,以及关节到非关节对比,用于确保与其他物体(例如遮挡物和背景)的区分。qualitative和quantitative分析表明,我们的方法在遮盲人体 benchmark 上表现出色,与当前的竞争对手相比,显著提高了遮盲人体的重建。
MobileVidFactory: Automatic Diffusion-Based Social Media Video Generation for Mobile Devices from Text
results: 通过我们的系统,用户可以轻松地创建高质量的垂直手机视频,无需特殊的技术知识或专业技能。Abstract
Videos for mobile devices become the most popular access to share and acquire information recently. For the convenience of users' creation, in this paper, we present a system, namely MobileVidFactory, to automatically generate vertical mobile videos where users only need to give simple texts mainly. Our system consists of two parts: basic and customized generation. In the basic generation, we take advantage of the pretrained image diffusion model, and adapt it to a high-quality open-domain vertical video generator for mobile devices. As for the audio, by retrieving from our big database, our system matches a suitable background sound for the video. Additionally to produce customized content, our system allows users to add specified screen texts to the video for enriching visual expression, and specify texts for automatic reading with optional voices as they like.
摘要
mobile devices 的视频成为最近最受欢迎的信息分享和获取方式。为了便利用户创建,在这篇论文中,我们提出了一个系统,即 MobileVidFactory,可以自动生成高质量的垂直式移动视频,只需要用户提供简单的文本。我们的系统包括两部分:基本生成和个性化生成。在基本生成部分,我们利用预训练的图像扩散模型,并将其适应为高质量的开源频段 vertical video 生成器。对于音频,我们从我们大型数据库中提取合适的背景音乐,并将其匹配到视频中。此外,为了生成个性化内容,我们的系统允许用户添加自定义的屏幕文本,以激发视觉表达,并选择自定义的语音和读音。
results: 本研讨会还发布了基于PDFVQA数据集的文档答案挑战,以测试提出的模型在全文档水平的结构和上下文理解能力。这种任务可以帮助提升文档理解步骤,从单页水平升级到全文档水平理解。Abstract
Document understanding and information extraction include different tasks to understand a document and extract valuable information automatically. Recently, there has been a rising demand for developing document understanding among different domains, including business, law, and medicine, to boost the efficiency of work that is associated with a large number of documents. This workshop aims to bring together researchers and industry developers in the field of document intelligence and understanding diverse document types to boost automatic document processing and understanding techniques. We also released a data challenge on the recently introduced document-level VQA dataset, PDFVQA. The PDFVQA challenge examines the structural and contextual understandings of proposed models on the natural full document level of multiple consecutive document pages by including questions with a sequence of answers extracted from multi-pages of the full document. This task helps to boost the document understanding step from the single-page level to the full document level understanding.
摘要
文档理解和信息提取包括不同任务来理解文档并自动提取有价值信息。最近,在不同领域,如商业、法律和医学等领域,有增加文档理解的需求,以提高大量文档相关的工作效率。这场工作室将帮助研究人员和行业开发人员在文档智能和理解多种文档类型中提高自动文档处理和理解技术。我们还发布了基于最近引入的文档级VQA数据集PDFVQA的数据挑战。PDFVQA挑战测试提出的模型在全文档水平上的结构和文本上下文理解能力,通过从多页全文档中提取多个答案序列来检验模型的文档理解能力。这个任务可以帮助提高文档理解的步骤,从单页水平提升到全文档水平的理解。
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?
paper_authors: Qi Zhao, Ce Zhang, Shijie Wang, Changcheng Fu, Nakul Agarwal, Kwonjoon Lee, Chen Sun
For: The paper is focused on the long-term action anticipation (LTA) task, which involves predicting an actor’s future behavior from video observations. The goal is to improve human-machine interaction.* Methods: The authors propose a two-stage framework called AntGPT, which leverages large language models (LLMs) to help with the LTA task. The first stage recognizes the actions already performed in the observed videos, and the second stage uses an LLM to predict the future actions or infer the goal and plan the whole procedure.* Results: The authors report state-of-the-art performance on several benchmarks, including the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, and EGTEA GAZE+. They also demonstrate the effectiveness of their approach through qualitative analysis, showing that AntGPT can successfully infer the goal and perform goal-conditioned “counterfactual” prediction.Here’s the simplified Chinese text for the three points:* For: 这篇论文关注的是长期动作预测任务(LTA),即从视频观察中预测演员的未来行为。目的是提高人机交互。* Methods: 作者们提议一种两个阶段的框架called AntGPT,利用大型语言模型(LLMs)来帮助LTA任务。第一阶段认识视频中已经完成的动作,第二阶段使用LLM预测未来动作或者推理演员的目标并规划整个过程。* Results: 作者们报告了多个benchmark上的州OF-the-art性能,包括Ego4D LTA v1和v2 benchmarks、EPIC-Kitchens-55、EGTEA GAZE+。他们还通过质量分析证明了AntGPT的有效性,表明它可以成功地推理演员的目标并进行目标conditioned “counterfactual” 预测。Abstract
Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? The long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. To leverage the LLMs, we propose a two-stage framework, AntGPT. It first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation, or to infer the goal and plan the whole procedure by chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the effectiveness of our proposed approach. AntGPT achieves state-of-the-art performance on all above benchmarks, and can successfully infer the goal and thus perform goal-conditioned "counterfactual" prediction via qualitative analysis. Code and model will be released at https://brown-palm.github.io/AntGPT
摘要
<> translate the given text into Simplified Chinese.<>我们可以通过知道actor的当前行为后的共通发展(例如混合蛋)来预测actor的未来行为(例如烫蛋)。如果我们还知道actor的长期目标(例如制作蛋饭),那么长期行为预测(LTA)任务就变得非常重要。我们提议从两个角度来解决LTA任务:一个底层方法是通过模拟时间动力来预测下一个行为;另一个是通过理解actor的目标来规划需要完成目标的过程。我们认为大型自然语言模型(LLM),它们在recipe和how-to文本数据上进行预训练,有可能帮助LTA从两个角度。它可以提供下一个行为的可能性预测,以及基于观察到的部分过程来INFERactor的目标。为了利用LLM,我们提出了一个两个阶段框架,AntGPT。它首先识别视频中已经完成的行为,然后向LLM进行 conditioned generation,或者通过 chain-of-thought prompting来INFERactor的目标和计划整个过程。实验结果表明,AntGPT在Ego4D LTA v1和v2标准测试 benchmark、EPIC-Kitchens-55和EGTEA GAZE+上达到了状态之最好的性能,并可以成功地INFERactor的目标,并通过qualitative分析进行“counterfactual”预测。代码和模型将在https://brown-palm.github.io/AntGPT上发布。
Multi-modal Graph Neural Network for Early Diagnosis of Alzheimer’s Disease from sMRI and PET Scans
results: 实验结果显示, compared to single-modal approaches, 本论文提出的多Modal方法可以提高AD诊断的性能,并且可以将多Modal数据与生物 markers结合,以提高诊断的精度。Abstract
In recent years, deep learning models have been applied to neuroimaging data for early diagnosis of Alzheimer's disease (AD). Structural magnetic resonance imaging (sMRI) and positron emission tomography (PET) images provide structural and functional information about the brain, respectively. Combining these features leads to improved performance than using a single modality alone in building predictive models for AD diagnosis. However, current multi-modal approaches in deep learning, based on sMRI and PET, are mostly limited to convolutional neural networks, which do not facilitate integration of both image and phenotypic information of subjects. We propose to use graph neural networks (GNN) that are designed to deal with problems in non-Euclidean domains. In this study, we demonstrate how brain networks can be created from sMRI or PET images and be used in a population graph framework that can combine phenotypic information with imaging features of these brain networks. Then, we present a multi-modal GNN framework where each modality has its own branch of GNN and a technique is proposed to combine the multi-modal data at both the level of node vectors and adjacency matrices. Finally, we perform late fusion to combine the preliminary decisions made in each branch and produce a final prediction. As multi-modality data becomes available, multi-source and multi-modal is the trend of AD diagnosis. We conducted explorative experiments based on multi-modal imaging data combined with non-imaging phenotypic information for AD diagnosis and analyzed the impact of phenotypic information on diagnostic performance. Results from experiments demonstrated that our proposed multi-modal approach improves performance for AD diagnosis, and this study also provides technical reference and support the need for multivariate multi-modal diagnosis methods.
摘要
近年来,深度学习模型在 neuroscience 数据中用于早期诊断阿尔茨海默病(AD)。Structural magnetic resonance imaging(sMRI)和 пози트рон发射tomography(PET)图像提供了脑组织结构和功能信息,分别。将这些特征结合使得建立预测模型的性能提高,而不使用单一模式。然而,现有的多Modal approaches在深度学习中,基于 sMRI 和 PET,主要是基于卷积神经网络,这些神经网络不能捕捉图像和subject的phenotypic信息。我们提议使用图 neural networks(GNN),这些神经网络是非欧几何问题的解决方案。在这项研究中,我们将脑网络创建自 sMRI 或 PET 图像,并将其用于人口图案框架,可以结合图像和subject的phenotypic信息。然后,我们提出了一种多Modal GNN 框架,其中每个模式有自己的 GNN 分支,并提出了将多Modal数据在每个分支和连接矩阵水平进行combine的技术。最后,我们进行了晚期融合,将每个分支的初步决策相互融合,生成最终预测。随着多Modal数据的普及,多源多Modal是阿尔茨海默病诊断的趋势。我们根据多Modal 图像和非图像phenotypic信息进行了探索性实验,分析了影响诊断性能的非图像信息。实验结果表明,我们提议的多Modal方法可以提高阿尔茨海默病诊断性能,这也提供了技术参考,支持多变量多Modal诊断方法的需求。
Benchmarking and Analyzing Robust Point Cloud Recognition: Bag of Tricks for Defending Adversarial Examples
methods: 我们首先建立了一个完整、严格的点云 adversarial robustness 评估标准,以便更好地理解防御和攻击方法的效果。然后,我们收集了现有的点云 adversarial defense 技巧,并进行了广泛的系统性实验,以identify一个有效的组合这些技巧。最后,我们提出了一种hybrid training augmentation方法,考虑了不同类型的点云 adversarial examples,并将其加到了 adversarial training 中,以提高鲁棒性。
results: 我们的方法可以在面对多种攻击时保持83.45%的平均准确率, demonstrating its capability to enabling robust learners。我们的代码库在:\url{https://github.com/qiufan319/benchmark_pc_attack.git}。Abstract
Deep Neural Networks (DNNs) for 3D point cloud recognition are vulnerable to adversarial examples, threatening their practical deployment. Despite the many research endeavors have been made to tackle this issue in recent years, the diversity of adversarial examples on 3D point clouds makes them more challenging to defend against than those on 2D images. For examples, attackers can generate adversarial examples by adding, shifting, or removing points. Consequently, existing defense strategies are hard to counter unseen point cloud adversarial examples. In this paper, we first establish a comprehensive, and rigorous point cloud adversarial robustness benchmark to evaluate adversarial robustness, which can provide a detailed understanding of the effects of the defense and attack methods. We then collect existing defense tricks in point cloud adversarial defenses and then perform extensive and systematic experiments to identify an effective combination of these tricks. Furthermore, we propose a hybrid training augmentation methods that consider various types of point cloud adversarial examples to adversarial training, significantly improving the adversarial robustness. By combining these tricks, we construct a more robust defense framework achieving an average accuracy of 83.45\% against various attacks, demonstrating its capability to enabling robust learners. Our codebase are open-sourced on: \url{https://github.com/qiufan319/benchmark_pc_attack.git}.
摘要
深度神经网络(DNNs)用于3D点云识别是易受到攻击的,这 threatening其实际应用。尽管在过去几年内有很多研究努力以counter这种issue,但3D点云上的攻击者可以通过添加、移动或 removing points来生成攻击示例,使得现有的防御策略很难counter未看到的点云攻击示例。在这篇论文中,我们首先建立了一个完整、严格的点云攻击robustness benchmark,以评估防御robustness,这可以为我们提供更好的理解防御和攻击方法的效果。然后,我们收集了现有的点云防御技巧,并进行了广泛和系统的实验,以确定有效的组合方法。此外,我们提议了一种混合培育方法,考虑了不同类型的点云攻击示例,并将其与 adversarial training 结合使用,以提高防御robustness。通过这些技巧的组合,我们建立了一个更加robust的防御框架,实现了83.45%的平均准确率, demonstrating its ability to enable robust learners。我们的代码库将在:\url{https://github.com/qiufan319/benchmark_pc_attack.git} 中开源。
Cardiac MRI Orientation Recognition and Standardization using Deep Neural Networks
results: 经过了广泛的实验, validate accuracy achieved were 100.0%, 100.0%, and 99.4%, demonstrating the robustness and effectiveness of our model。Note: The abstract is in English, and the information points are in Simplified Chinese.Abstract
Orientation recognition and standardization play a crucial role in the effectiveness of medical image processing tasks. Deep learning-based methods have proven highly advantageous in orientation recognition and prediction tasks. In this paper, we address the challenge of imaging orientation in cardiac MRI and present a method that employs deep neural networks to categorize and standardize the orientation. To cater to multiple sequences and modalities of MRI, we propose a transfer learning strategy, enabling adaptation of our model from a single modality to diverse modalities. We conducted comprehensive experiments on CMR images from various modalities, including bSSFP, T2, and LGE. The validation accuracies achieved were 100.0\%, 100.0\%, and 99.4\%, confirming the robustness and effectiveness of our model. Our source code and network models are available at https://github.com/rxzhen/MSCMR-orient
摘要
医疗影像处理任务中,orientation认识和标准化具有关键作用。深度学习基于方法在orientation认识和预测任务中表现出了高度优势。本文关注卡ди亚MRI影像orientation的挑战,并提出了使用深度神经网络来分类和标准化orientation。针对多种模式和模式的MRI影像,我们提议了转移学习策略,以适应多种模式的适应。我们在不同模式的CMR影像上进行了广泛的实验,包括bSSFP、T2和LGE模式。我们所得到的验证精度分别为100.0%、100.0%和99.4%,这confirm了我们的模型的稳定性和有效性。我们的源代码和网络模型可以在https://github.com/rxzhen/MSCMR-orient上获取。
results: 研究发现,对于不注释的步态数据进行对比学习可以学习出具有临床意义的步态表示,并且可以用来诊断和评估rehabilitation therapy的效果。Abstract
Markerless motion capture (MMC) is revolutionizing gait analysis in clinical settings by making it more accessible, raising the question of how to extract the most clinically meaningful information from gait data. In multiple fields ranging from image processing to natural language processing, self-supervised learning (SSL) from large amounts of unannotated data produces very effective representations for downstream tasks. However, there has only been limited use of SSL to learn effective representations of gait and movement, and it has not been applied to gait analysis with MMC. One SSL objective that has not been applied to gait is contrastive learning, which finds representations that place similar samples closer together in the learned space. If the learned similarity metric captures clinically meaningful differences, this could produce a useful representation for many downstream clinical tasks. Contrastive learning can also be combined with causal masking to predict future timesteps, which is an appealing SSL objective given the dynamical nature of gait. We applied these techniques to gait analyses performed with MMC in a rehabilitation hospital from a diverse clinical population. We find that contrastive learning on unannotated gait data learns a representation that captures clinically meaningful information. We probe this learned representation using the framework of biomarkers and show it holds promise as both a diagnostic and response biomarker, by showing it can accurately classify diagnosis from gait and is responsive to inpatient therapy, respectively. We ultimately hope these learned representations will enable predictive and prognostic gait-based biomarkers that can facilitate precision rehabilitation through greater use of MMC to quantify movement in rehabilitation.
摘要
无标记动作跟踪(MMC)在医学设置中革命化了步态分析,使其更加可 accessible,提出了如何从步态数据中提取最有价值的临床信息的问题。在多个领域,从图像处理到自然语言处理,无监督学习(SSL)从大量无注释数据中生成了非常有效的下游任务表示。然而,在步态和运动方面尚未广泛使用SSL学习有效表示,而且尚未应用到MMC步态分析中。一个尚未应用到步态的SSL目标是对比学习,它找到类似样本在学习空间中更近的方法。如果学习的相似度量表示临床有意义的差异,这将生成有用的下游临床任务表示。对比学习还可以与 causal 遮盾来预测未来时间步骤,这是一个有appeal的SSL目标,因为步态具有动态特征。我们对医疗机构中从多样化临床人口进行的MMC步态分析进行了应用。我们发现,对于无注释步态数据的对比学习可以学习一个表示,这个表示能够捕捉临床有意义的信息。我们使用生物标记框架来询问这个学习的表示,并证明它能够准确地分类诊断,并且对于入院治疗响应。我们希望这些学习的表示能够帮助建立预测和诊断基于步态的生物标记,以便通过更广泛的MMC来评估运动。
Mask-guided Data Augmentation for Multiparametric MRI Generation with a Rare Hepatocellular Carcinoma
results: 这篇论文的结果显示,使用这种方法可以将Frechet Inception Distance score提高到86.55。此外,这种方法在2021年的数据增幅挑战中被评为得奖作品之一。Abstract
Data augmentation is classically used to improve the overall performance of deep learning models. It is, however, challenging in the case of medical applications, and in particular for multiparametric datasets. For example, traditional geometric transformations used in several applications to generate synthetic images can modify in a non-realistic manner the patients' anatomy. Therefore, dedicated image generation techniques are necessary in the medical field to, for example, mimic a given pathology realistically. This paper introduces a new data augmentation architecture that generates synthetic multiparametric (T1 arterial, T1 portal, and T2) magnetic resonance images (MRI) of massive macrotrabecular subtype hepatocellular carcinoma with their corresponding tumor masks through a generative deep learning approach. The proposed architecture creates liver tumor masks and abdominal edges used as input in a Pix2Pix network for synthetic data creation. The method's efficiency is demonstrated by training it on a limited multiparametric dataset of MRI triplets from $89$ patients with liver lesions to generate $1,000$ synthetic triplets and their corresponding liver tumor masks. The resulting Frechet Inception Distance score was $86.55$. The proposed approach was among the winners of the 2021 data augmentation challenge organized by the French Society of Radiology.
摘要
“数据扩展是深度学习模型性能的提升方法之一,但在医疗应用中受到限制,特别是对多参数数据进行扩展。例如,传统的几何变换在许多应用中生成的 sintetic 图像可能会非现实地改变病人的解剖结构。因此,在医疗领域,需要专门的图像生成技术,以例如,模拟给定的疾病实际地。这篇论文介绍了一种新的数据扩展架构,通过生成多参数(T1血管、T1门脉和T2)核磁共振成像(MRI)图像的生成深度学习方法,生成大量大macrotrabecular型肝癌的 sintetic 图像和其医学标注。提议的架构使用MRI triplets的liver lesions从89名病人中训练Pix2Pix网络,生成1000个 sintetic triplets和其医学标注。结果的Frechet Inception Distance分数为86.55。该方法在2021年法国 radiology 社会组织的数据扩展挑战中获得奖。”Note that Simplified Chinese is used in this translation, as it is the most widely used variety of Chinese in mainland China and is more straightforward to read for non-native speakers. If you prefer Traditional Chinese, I can provide that version as well.
Triple Correlations-Guided Label Supplementation for Unbiased Video Scene Graph Generation
results: 对 VidVRD 和 VidOR 等最常用的 VidSGG 数据集进行了广泛的实验,并达到了状态场下的表现,特别是在一些尾 predicate 上。Abstract
Video-based scene graph generation (VidSGG) is an approach that aims to represent video content in a dynamic graph by identifying visual entities and their relationships. Due to the inherently biased distribution and missing annotations in the training data, current VidSGG methods have been found to perform poorly on less-represented predicates. In this paper, we propose an explicit solution to address this under-explored issue by supplementing missing predicates that should be appear in the ground-truth annotations. Dubbed Trico, our method seeks to supplement the missing predicates by exploring three complementary spatio-temporal correlations. Guided by these correlations, the missing labels can be effectively supplemented thus achieving an unbiased predicate predictions. We validate the effectiveness of Trico on the most widely used VidSGG datasets, i.e., VidVRD and VidOR. Extensive experiments demonstrate the state-of-the-art performance achieved by Trico, particularly on those tail predicates.
摘要
文本描述:视频基于Scene graph生成(VidSGG)是一种方法,旨在通过识别视觉实体和其之间关系,将视频内容表示为动态图。由于训练数据的自然偏袋和缺失签注,现有的VidSGG方法在 menos-represented predicates 方面表现不佳。在这篇论文中,我们提出了一种显式的解决方案,通过补充缺失的 predicates 来解决这个未explored问题。我们称之为 Trico,我们的方法通过 three complementary spatio-temporal correlations 来补充缺失标签,从而实现无偏 predicate 预测。我们在 VidVRD 和 VidOR 等最常用的 VidSGG 数据集上验证了 Trico 的效果,特别是在 tail predicates 方面。Here's the translation in Traditional Chinese:文本描述:影像基于Scene graph生成(VidSGG)是一种方法,旨在透过识别视觉元素和其之间关系,将影像内容表示为动态图。由于训练数据的自然偏袋和缺失签识,现有的VidSGG方法在 menos-represented predicates 方面表现不佳。在这篇论文中,我们提出了一种显式的解决方案,通过补充缺失的 predicates 来解决这个未explored问题。我们称之为 Trico,我们的方法通过 three complementary spatio-temporal correlations 来补充缺失标签,从而实现无偏 predicate 预测。我们在 VidVRD 和 VidOR 等最常用的 VidSGG 数据集上验证了 Trico 的效果,特别是在 tail predicates 方面。
Stylized Projected GAN: A Novel Architecture for Fast and Realistic Image Generation
results: 提出了一种优化的架构方案,即风格化投影GANs(Stylized Projected GANs),通过结合 Style GANs 的映射网络和 Fast GAN 的跳过层刺激,提高生成图像质量和避免生成图像中的缺陷。Abstract
Generative Adversarial Networks are used for generating the data using a generator and a discriminator, GANs usually produce high-quality images, but training GANs in an adversarial setting is a difficult task. GANs require high computation power and hyper-parameter regularization for converging. Projected GANs tackle the training difficulty of GANs by using transfer learning to project the generated and real samples into a pre-trained feature space. Projected GANs improve the training time and convergence but produce artifacts in the generated images which reduce the quality of the generated samples, we propose an optimized architecture called Stylized Projected GANs which integrates the mapping network of the Style GANs with Skip Layer Excitation of Fast GAN. The integrated modules are incorporated within the generator architecture of the Fast GAN to mitigate the problem of artifacts in the generated images.
摘要
Generative Adversarial Networks (GANs) 通常用一个生成器和一个欺骗器来生成数据,但训练 GANs 在反对性设定下是一个困难的任务。GANs 需要高度的计算能力和对应变数的调整以获得协调。对应 GANs 解决了 GANs 训练的困难性,通过将生成的和实际样本 проек到预训的特征空间中。对应 GANs 可以提高训练时间和协调,但是会导致生成的图像中的瑕疵,这样会降低生成的样本质量。我们提出一个优化的架构,即 Style Projected GANs,它将 Style GANs 的映射网络和 Fast GAN 的跳跃层刺激组合在一起,并将这些模组 incorporated 到 Fast GAN 的生成器架构中,以减少生成的图像中的瑕疵。
An objective validation of polyp and instrument segmentation methods in colonoscopy through Medico 2020 polyp segmentation and MedAI 2021 transparency challenges
paper_authors: Debesh Jha, Vanshali Sharma, Debapriya Banik, Debayan Bhattacharya, Kaushiki Roy, Steven A. Hicks, Nikhil Kumar Tomar, Vajira Thambawita, Adrian Krenzer, Ge-Peng Ji, Sahadev Poudel, George Batchkala, Saruar Alam, Awadelrahman M. A. Ahmed, Quoc-Huy Trinh, Zeshan Khan, Tien-Phat Nguyen, Shruti Shrestha, Sabari Nathan, Jeonghwan Gwak, Ritika K. Jha, Zheyuan Zhang, Alexander Schlaefer, Debotosh Bhattacharjee, M. K. Bhuyan, Pradip K. Das, Sravanthi Parsa, Sharib Ali, Michael A. Riegler, Pål Halvorsen, Ulas Bagci, Thomas De Lange for:The paper is written to promote the development of efficient and transparent methods for automatic analysis of colonoscopy images, with the goal of improving the early detection of precancerous polyps.methods:The paper uses a combination of deep learning techniques and transparency and interpretability analysis to evaluate the performance and credibility of various algorithms for polyp segmentation and classification.results:The paper presents a comprehensive summary and analysis of the “Medico 2020” and “MedAI: Transparency in Medical Image Segmentation (MedAI 2021)” competitions, highlighting the strengths of the best-performing methods and discussing the possibility of clinical translations of such methods into the clinic. The paper also encourages qualitative evaluation for building more transparent and understandable AI-based colonoscopy systems.Here is the same information in Simplified Chinese text:for:这篇论文是为了促进自动检测colonoscopy图像的效率和透明度方面的研究,以提高早期检测前канcerous polyp的可能性。methods:这篇论文使用了深度学习技术和透明度分析,以评估不同算法的效果和可靠性。results:这篇论文提供了“Medico 2020”和“MedAI: Transparency in Medical Image Segmentation (MedAI 2021)”比赛的全面总结和分析,把最佳performing方法的优势强调出来,并讨论如何将这些方法在临床中应用。同时,它也鼓励更多的质量评估,以建立更加透明和理解的AI基于colonoscopy系统。Abstract
Automatic analysis of colonoscopy images has been an active field of research motivated by the importance of early detection of precancerous polyps. However, detecting polyps during the live examination can be challenging due to various factors such as variation of skills and experience among the endoscopists, lack of attentiveness, and fatigue leading to a high polyp miss-rate. Deep learning has emerged as a promising solution to this challenge as it can assist endoscopists in detecting and classifying overlooked polyps and abnormalities in real time. In addition to the algorithm's accuracy, transparency and interpretability are crucial to explaining the whys and hows of the algorithm's prediction. Further, most algorithms are developed in private data, closed source, or proprietary software, and methods lack reproducibility. Therefore, to promote the development of efficient and transparent methods, we have organized the "Medico automatic polyp segmentation (Medico 2020)" and "MedAI: Transparency in Medical Image Segmentation (MedAI 2021)" competitions. We present a comprehensive summary and analyze each contribution, highlight the strength of the best-performing methods, and discuss the possibility of clinical translations of such methods into the clinic. For the transparency task, a multi-disciplinary team, including expert gastroenterologists, accessed each submission and evaluated the team based on open-source practices, failure case analysis, ablation studies, usability and understandability of evaluations to gain a deeper understanding of the models' credibility for clinical deployment. Through the comprehensive analysis of the challenge, we not only highlight the advancements in polyp and surgical instrument segmentation but also encourage qualitative evaluation for building more transparent and understandable AI-based colonoscopy systems.
摘要
自动分析干式摄影图像是一个活跃的研究领域,受到早期检测前期癌病变的重要性启发。然而,在实时检查中检测病变可以是困难的,因为医生们的技能和经验差异,注意力不集中,疲劳等因素导致高检测病变率。深度学习在这个挑战中出现为一个有前途的解决方案,它可以帮助医生在实时检查中检测和分类检测到的病变和异常。此外,算法的准确率不是唯一的重要因素,还需要考虑算法的透明度和可解释性,以便理解算法的预测结果的原因和过程。目前,大多数算法都是在私人数据、关闭源代码或商业软件上开发的,导致方法缺乏可重复性。为了促进效率和透明度的方法的发展,我们在“医疗自动肠道分 segmentation(Medico 2020)”和“MedAI:医疗图像分 segmentation(MedAI 2021)”的竞赛中组织了一系列的挑战。我们在这篇文章中提供了这些竞赛的全面概述,分析每个贡献的优势,并评估了最佳方法的可靠性和临床应用性。为了评估团队的透明度,我们组织了一个多дисциплиnea队伍,包括专业的 Gastroenterologist,对每个提交进行评估,以evaluate团队的开源实践、失败案例分析、割除研究、可用性和理解度来深入了解模型的可靠性和临床应用性。通过全面分析这些挑战,我们不仅披露了肠道和手术 instrumente的分 segmentation的进步,还鼓励了质量评估的建立,以建立更透明和可理解的AI基于干式摄影系统。
paper_authors: Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, Maosong Sun for:* The paper aims to improve the tool-use capabilities of open-source large language models (LLMs) by introducing a general framework called ToolLLM.methods:* The framework consists of data construction, model training, and evaluation, including the creation of an instruction-tuning dataset called ToolBench and the development of a novel depth-first search-based decision tree (DFSDT) to enhance planning and reasoning capabilities.results:* The paper demonstrates the effectiveness of ToolLLM by fine-tuning LLaMA on ToolBench and obtaining ToolLLaMA, which shows a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. Additionally, the paper proposes a neural API retriever to recommend appropriate APIs for each instruction.Abstract
Despite the advancements of open-source large language models (LLMs) and their variants, e.g., LLaMA and Vicuna, they remain significantly limited in performing higher-level tasks, such as following human instructions to use external tools (APIs). This is because current instruction tuning largely focuses on basic language tasks instead of the tool-use domain. This is in contrast to state-of-the-art (SOTA) LLMs, e.g., ChatGPT, which have demonstrated excellent tool-use capabilities but are unfortunately closed source. To facilitate tool-use capabilities within open-source LLMs, we introduce ToolLLM, a general tool-use framework of data construction, model training and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is created automatically using ChatGPT. Specifically, we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub, then prompt ChatGPT to generate diverse human instructions involving these APIs, covering both single-tool and multi-tool scenarios. Finally, we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To make the searching process more efficient, we develop a novel depth-first search-based decision tree (DFSDT), enabling LLMs to evaluate multiple reasoning traces and expand the search space. We show that DFSDT significantly enhances the planning and reasoning capabilities of LLMs. For efficient tool-use assessment, we develop an automatic evaluator: ToolEval. We fine-tune LLaMA on ToolBench and obtain ToolLLaMA. Our ToolEval reveals that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. To make the pipeline more practical, we devise a neural API retriever to recommend appropriate APIs for each instruction, negating the need for manual API selection.
摘要
尽管开源大型自然语言模型(LLM)和其变体(如LLaMA和Vicuna)在进行更高级别任务方面有所进步,但它们仍然在使用外部工具(API)上有限制。这是因为当前的 instrucion 调整主要关注基础语言任务而不是工具使用领域。与此相反,状态元(SOTA)LLM(如ChatGPT)已经示出了出色的工具使用能力,但它们却是关闭源代码。为了帮助开源LLM在工具使用方面增强能力,我们介绍了 ToolLLM,一个通用工具使用框架,包括数据建构、模型训练和评估。我们首先介绍了 ToolBench,一个用于工具使用的 instrucion 调整数据集,通过自动使用 ChatGPT 生成了16,464个实际RESTful API,覆盖49个类别。然后,我们使用 ChatGPT 生成多种人类 instrucion,涵盖单个工具和多个工具enario。最后,我们使用 ChatGPT 搜索一个有效的解决方案路径(chain of API calls)。为了使搜索过程更有效率,我们开发了一种深度优先搜索基于决策树(DFSDT),使 LLVM 可以评估多种理解迹象和扩大搜索空间。我们表明,DFSDT 有效地提高 LLVM 的规划和推理能力。为了有效评估工具使用,我们开发了一个自动评估器:ToolEval。我们精度调整 LLaMA 在 ToolBench 上,得到了 ToolLLaMA。我们的 ToolEval 表明,ToolLLaMA 能够执行复杂 instrucion 并对未看到API进行扩展,并且与 ChatGPT 的性能相当。为了使管道更实用,我们设计了一种神经API搜索器,以便为每个 instrucion 推荐合适的API,从而消除手动API选择的需求。
The Ethics of AI Value Chains: An Approach for Integrating and Expanding AI Ethics Research, Practice, and Governance
paper_authors: Blair Attard-Frost, David Gray Widder
for: 本研究旨在探讨人工智能伦理学的新方法和实践,以满足多个actor、context和scale of activity中的AI系统设计、开发、使用和管理的伦理和实践问题。
methods: 本文使用价值链概念作为一个综合的概念,以涵盖AI系统的伦理和实践问题。文章对价值链理论 perspective from strategic management, service science, and economic geography literature进行了回顾和整合,并对学术、产业和政策文献中关于AI价值链的观点进行了回顾。
results: 文章通过将AI伦理问题与AI价值链中的actor和资源活动相连接,示出了 approaching AI伦理问题为价值链问题可以实现更加全面和紧凑的研究和管理实践。文章还提出了五个未来方向,以便研究者、实践者和政策制定者可以在AI价值链中调查和 intervene 伦理问题。Abstract
Recent criticisms of AI ethics principles and practices have indicated a need for new approaches to AI ethics that can account for and intervene in the design, development, use, and governance of AI systems across multiple actors, contexts, and scales of activity. This paper positions AI value chains as an integrative concept that satisfies those needs, enabling AI ethics researchers, practitioners, and policymakers to take a more comprehensive view of the ethical and practical implications of AI systems. We review and synthesize theoretical perspectives on value chains from the literature on strategic management, service science, and economic geography. We then review perspectives on AI value chains from the academic, industry, and policy literature. We connect an inventory of ethical concerns in AI to the actors and resourcing activities involved in AI value chains to demonstrate that approaching AI ethics issues as value chain issues can enable more comprehensive and integrative research and governance practices. We illustrate this by suggesting five future directions for researchers, practitioners, and policymakers to investigate and intervene in the ethical concerns associated with AI value chains.
摘要
We review and synthesize theoretical perspectives on value chains from the literature on strategic management, service science, and economic geography, and then examine perspectives on AI value chains from the academic, industry, and policy literature. We connect an inventory of ethical concerns in AI to the actors and resourcing activities involved in AI value chains, demonstrating that approaching AI ethics issues as value chain issues can lead to more comprehensive and integrative research and governance practices.To illustrate this, we suggest five future directions for researchers, practitioners, and policymakers to investigate and intervene in the ethical concerns associated with AI value chains:1. Investigating the ethical implications of AI value chains in different contexts, such as healthcare, finance, and education.2. Developing new methods and tools for analyzing and managing AI value chains, including the use of blockchain and other distributed ledger technologies.3. Examining the role of governance and regulation in AI value chains, and developing new frameworks for ethical governance and oversight.4. Addressing the issue of bias and discrimination in AI value chains, and developing strategies for mitigating these issues.5. Investigating the impact of AI value chains on employment and the labor market, and developing policies to ensure that the benefits of AI are shared fairly among all stakeholders.
Ranking-based Argumentation Semantics Applied to Logical Argumentation (full version)
results: 研究发现,使用rank-based semantics 在不同的论证结构下的行为都很相似,并且这些 semantics 可以生成一种称为责任度度量的量化表示,这种度量可以用于衡量不同论证结构下的论证结果。Abstract
In formal argumentation, a distinction can be made between extension-based semantics, where sets of arguments are either (jointly) accepted or not, and ranking-based semantics, where grades of acceptability are assigned to arguments. Another important distinction is that between abstract approaches, that abstract away from the content of arguments, and structured approaches, that specify a method of constructing argument graphs on the basis of a knowledge base. While ranking-based semantics have been extensively applied to abstract argumentation, few work has been done on ranking-based semantics for structured argumentation. In this paper, we make a systematic investigation into the behaviour of ranking-based semantics applied to existing formalisms for structured argumentation. We show that a wide class of ranking-based semantics gives rise to so-called culpability measures, and are relatively robust to specific choices in argument construction methods.
摘要
formal 的论证中,可以区分Extension-based semantics和ranking-based semantics两种不同的Semantics。Extension-based semantics是指集合的论证是接受或不接受的,而ranking-based semantics则是将论证的可接受度赋予一个排名。另外,还可以分为抽象approaches和结构化approaches两种不同的方法。而ranking-based semantics主要应用于抽象论证,对于结构化论证的应用则比较少。在这篇论文中,我们进行了系统性的调查,探讨了现有的结构化论证 formalism中ranking-based semantics的行为。我们发现,一类广泛的ranking-based semantics会导致所谓的责任度量,并且对于不同的论证构建方法具有一定的抗锋性。
KoBBQ: Korean Bias Benchmark for Question Answering
results: 我们使用 KoBBQ dataset 测试了多种国际化语言模型的准确率和偏见分数,发现韩语和英语中语言模型的偏见不同,提示需要手动制作考虑到文化差异的数据。Abstract
The BBQ (Bias Benchmark for Question Answering) dataset enables the evaluation of the social biases that language models (LMs) exhibit in downstream tasks. However, it is challenging to adapt BBQ to languages other than English as social biases are culturally dependent. In this paper, we devise a process to construct a non-English bias benchmark dataset by leveraging the English BBQ dataset in a culturally adaptive way and present the KoBBQ dataset for evaluating biases in Question Answering (QA) tasks in Korean. We identify samples from BBQ into three classes: Simply-Translated (can be used directly after cultural translation), Target-Modified (requires localization in target groups), and Sample-Removed (does not fit Korean culture). We further enhance the cultural relevance to Korean culture by adding four new categories of bias specific to Korean culture and newly creating samples based on Korean literature. KoBBQ consists of 246 templates and 4,740 samples across 12 categories of social bias. Using KoBBQ, we measure the accuracy and bias scores of several state-of-the-art multilingual LMs. We demonstrate the differences in the bias of LMs in Korean and English, clarifying the need for hand-crafted data considering cultural differences.
摘要
BBQ(偏见权重标准)数据集允许语言模型(LM)在下游任务中展现社会偏见。然而,将 BBQ adapted 到非英语语言是文化依赖的。在这篇论文中,我们提出了一种构建非英语偏见标准数据集的过程,利用英语 BBQ dataset 的文化适应方式,并介绍了韩国语言Question Answering(QA)任务中的 KoBBQ 数据集。我们将 BBQ 中的样本分为三类:可以直接使用的简单翻译(Simply-Translated)、需要本地化的目标修改(Target-Modified)和不适应韩国文化的样本(Sample-Removed)。我们还增强了韩国文化的文化相关性,添加了适用于韩国文化的四种偏见类型,并基于韩国文学创作了新的样本。KoBBQ 包含 246 个模板和 4,740 个样本,分别属于 12 个社会偏见类别。使用 KoBBQ,我们测试了多种当前领域最佳的多语言语言模型的准确性和偏见分数。我们显示了韩国语言和英语中 LM 的偏见之间的差异,从而证明了需要手动制作考虑到文化差异的数据。
AsdKB: A Chinese Knowledge Base for the Early Screening and Diagnosis of Autism Spectrum Disorder
results: 这篇论文创建了一个包含ontological和事实知识的中文知识库(AsdKB),可以用于问答、辅助诊断和专家建议。论文还提出了一个基于这个知识库的 прототип,可以在http://asdkb.org.cn/上查看。Abstract
To easily obtain the knowledge about autism spectrum disorder and help its early screening and diagnosis, we create AsdKB, a Chinese knowledge base on autism spectrum disorder. The knowledge base is built on top of various sources, including 1) the disease knowledge from SNOMED CT and ICD-10 clinical descriptions on mental and behavioural disorders, 2) the diagnostic knowledge from DSM-5 and different screening tools recommended by social organizations and medical institutes, and 3) the expert knowledge on professional physicians and hospitals from the Web. AsdKB contains both ontological and factual knowledge, and is accessible as Linked Data at https://w3id.org/asdkb/. The potential applications of AsdKB are question answering, auxiliary diagnosis, and expert recommendation, and we illustrate them with a prototype which can be accessed at http://asdkb.org.cn/.
摘要
为了轻松获得关于自闭异症 спектル谱病的知识,帮助早期检测和诊断,我们创建了Autism Spectrum Disorder知识库(AsdKB)。该知识库基于多种来源,包括1)疾病知识从SNOMED CT和ICD-10临床描述中的精神和行为病种,2)诊断知识从DSM-5和不同检测工具推荐的社会组织和医疗机构,以及3)专业医生和医院的知识。AsdKB包含ontological和事实知识,可以通过链接数据访问https://w3id.org/asdkb/。该知识库的潜在应用包括问答、辅助诊断和专家建议,我们在http://asdkb.org.cn/中提供了一个原型。
Advancing Smart Malnutrition Monitoring: A Multi-Modal Learning Approach for Vital Health Parameter Estimation
results: 该研究的模型具有较低的平均绝对误差(MAE),可以在多种照明条件下并行运行,并且能够准确地估算身高和体重。Abstract
Malnutrition poses a significant threat to global health, resulting from an inadequate intake of essential nutrients that adversely impacts vital organs and overall bodily functioning. Periodic examinations and mass screenings, incorporating both conventional and non-invasive techniques, have been employed to combat this challenge. However, these approaches suffer from critical limitations, such as the need for additional equipment, lack of comprehensive feature representation, absence of suitable health indicators, and the unavailability of smartphone implementations for precise estimations of Body Fat Percentage (BFP), Basal Metabolic Rate (BMR), and Body Mass Index (BMI) to enable efficient smart-malnutrition monitoring. To address these constraints, this study presents a groundbreaking, scalable, and robust smart malnutrition-monitoring system that leverages a single full-body image of an individual to estimate height, weight, and other crucial health parameters within a multi-modal learning framework. Our proposed methodology involves the reconstruction of a highly precise 3D point cloud, from which 512-dimensional feature embeddings are extracted using a headless-3D classification network. Concurrently, facial and body embeddings are also extracted, and through the application of learnable parameters, these features are then utilized to estimate weight accurately. Furthermore, essential health metrics, including BMR, BFP, and BMI, are computed to conduct a comprehensive analysis of the subject's health, subsequently facilitating the provision of personalized nutrition plans. While being robust to a wide range of lighting conditions across multiple devices, our model achieves a low Mean Absolute Error (MAE) of $\pm$ 4.7 cm and $\pm$ 5.3 kg in estimating height and weight.
摘要
globally, malnutrition poses a significant threat to health, resulting from an inadequate intake of essential nutrients that adversely impacts vital organs and overall bodily functioning. To combat this challenge, periodic examinations and mass screenings have been employed, incorporating both conventional and non-invasive techniques. However, these approaches are limited by the need for additional equipment, lack of comprehensive feature representation, absence of suitable health indicators, and the unavailability of smartphone implementations for precise estimations of Body Fat Percentage (BFP), Basal Metabolic Rate (BMR), and Body Mass Index (BMI) to enable efficient smart-malnutrition monitoring.To address these constraints, this study presents a groundbreaking, scalable, and robust smart malnutrition-monitoring system that leverages a single full-body image of an individual to estimate height, weight, and other crucial health parameters within a multi-modal learning framework. Our proposed methodology involves the reconstruction of a highly precise 3D point cloud, from which 512-dimensional feature embeddings are extracted using a headless-3D classification network. Concurrently, facial and body embeddings are also extracted, and through the application of learnable parameters, these features are then utilized to estimate weight accurately. Furthermore, essential health metrics, including BMR, BFP, and BMI, are computed to conduct a comprehensive analysis of the subject's health, subsequently facilitating the provision of personalized nutrition plans.Our model is robust to a wide range of lighting conditions across multiple devices, achieving a low Mean Absolute Error (MAE) of $\pm$ 4.7 cm and $\pm$ 5.3 kg in estimating height and weight.
Hybrid quantum transfer learning for crack image classification on NISQ hardware
paper_authors: Alexander Geng, Ali Moghiseh, Claudia Redenbach, Katja Schladitz
for: 这个论文是为了探讨量子计算机在图像处理方面的应用。
methods: 这个论文使用了量子转移学习来检测灰度图像中的裂隙。
results: 研究发现,使用量子转移学习可以快速地检测灰度图像中的裂隙,并且可以提高对图像的检测精度。I hope this helps! Let me know if you have any further questions or if there’s anything else I can assist you with.Abstract
Quantum computers possess the potential to process data using a remarkably reduced number of qubits compared to conventional bits, as per theoretical foundations. However, recent experiments have indicated that the practical feasibility of retrieving an image from its quantum encoded version is currently limited to very small image sizes. Despite this constraint, variational quantum machine learning algorithms can still be employed in the current noisy intermediate scale quantum (NISQ) era. An example is a hybrid quantum machine learning approach for edge detection. In our study, we present an application of quantum transfer learning for detecting cracks in gray value images. We compare the performance and training time of PennyLane's standard qubits with IBM's qasm\_simulator and real backends, offering insights into their execution efficiency.
摘要
量子计算机具有可能处理数据使用remarkably减少的量子比特数量,根据理论基础。然而,最近的实验表明目前只能处理非常小的图像大小。尽管如此,可以在当前的含杂中渠道量子(NISQ)时代使用量子机器学习算法。我们的研究中,我们应用了量子传输学习方法 для检测灰度图像中的裂隙。我们比较了PennyLane的标准量子比特与IBM的qasm\_simulator和真实后端的执行效率。
TFE-GNN: A Temporal Fusion Encoder Using Graph Neural Networks for Fine-grained Encrypted Traffic Classification
results: 实验结果显示,TFE-GNN 在两个真实数据集上比多个现有方法进行细化 encrypt 流量分类任务中表现更好。Abstract
Encrypted traffic classification is receiving widespread attention from researchers and industrial companies. However, the existing methods only extract flow-level features, failing to handle short flows because of unreliable statistical properties, or treat the header and payload equally, failing to mine the potential correlation between bytes. Therefore, in this paper, we propose a byte-level traffic graph construction approach based on point-wise mutual information (PMI), and a model named Temporal Fusion Encoder using Graph Neural Networks (TFE-GNN) for feature extraction. In particular, we design a dual embedding layer, a GNN-based traffic graph encoder as well as a cross-gated feature fusion mechanism, which can first embed the header and payload bytes separately and then fuses them together to obtain a stronger feature representation. The experimental results on two real datasets demonstrate that TFE-GNN outperforms multiple state-of-the-art methods in fine-grained encrypted traffic classification tasks.
摘要
伪Encrypted traffic classification 已经受到研究人员和工业公司的广泛关注。然而,现有的方法只是EXTract flow-level features,因为流量的统计性不可靠,或者对header和payload进行平等处理,而不是挖掘字节之间的可能的相关性。因此,在这篇论文中,我们提出了基于点wise mutual information(PMI)的字节级流量图构建方法,以及一种基于图神经网络(GNN)的特征提取模型——时间融合编码器(TFE-GNN)。特别是,我们设计了两层双向嵌入层,一个基于GNN的流量图编码器以及一个跨门控制的特征融合机制,可以先将header和payload字节分别嵌入,然后将其融合起来获得更强的特征表示。实验结果表明,TFE-GNN在真实的两个数据集上比多种现状顶尖方法在细化Encrypted traffic classification任务中表现出优异。
An O.D.E. Framework of Distributed TD-Learning for Networked Multi-Agent Markov Decision Processes
results: 我们的贡献包括两个重要点:1)我们提出了新的分布式ODE, inspirited by the averaging consensus method在连续时间领域。ODE的 converges是通过控制理论的角度进行评估的。2)基于上述ODE,我们开发了新的分布式TD学习算法。其中一个提出的分布式ODE具有两个独立的动态系统,每个系统都有独特的作用,这种特点使得我们可以提出一种新的分布式TD学习策略,其 converges可能通过Borkar-Meyn定理进行证明。Abstract
The primary objective of this paper is to investigate distributed ordinary differential equation (ODE) and distributed temporal difference (TD) learning algorithms for networked multi-agent Markov decision problems (MAMDPs). In our study, we adopt a distributed multi-agent framework where individual agents have access only to their own rewards, lacking insights into the rewards of other agents. Additionally, each agent has the ability to share its parameters with neighboring agents through a communication network, represented by a graph. Our contributions can be summarized in two key points: 1) We introduce novel distributed ODEs, inspired by the averaging consensus method in the continuous-time domain. The convergence of the ODEs is assessed through control theory perspectives. 2) Building upon the aforementioned ODEs, we devise new distributed TD-learning algorithms. A standout feature of one of our proposed distributed ODEs is its incorporation of two independent dynamic systems, each with a distinct role. This characteristic sets the stage for a novel distributed TD-learning strategy, the convergence of which can potentially be established using Borkar-Meyn theorem.
摘要
主要目标 OF 这篇论文是研究分布式常微分方程(ODE)和分布式时间差(TD)学习算法,用于分布式多智能体决策问题(MAMDPs)。在我们的研究中,我们采用了分布式多智能体框架,每个智能体只有自己的奖励信息,缺乏其他智能体奖励信息的视野。此外,每个智能体可以与邻居智能体通过通信网络(表示为图)共享自己的参数。我们的贡献可以概括为两个关键点:1. 我们提出了新的分布式ODE, inspirited 由积分协议在连续时间域。我们证明了这些ODE的收敛性,使用控制理论的视角。2. 基于上述ODE,我们开发了新的分布式TD学习算法。我们的一个提出的分布式ODE特征是它包含两个独立的动力系统,每个动力系统都有独特的作用。这种特点为我们提出了一种新的分布式TD学习策略,其收敛性可能通过博尔卡-美因定理证明。
Lookbehind Optimizer: k steps back, 1 step forward
results: 通过 Lookbehind 算法,在不同任务和训练环境中实现了各种优势,包括提高泛化性能、增强权重噪声抗性和生命长学习中的抗忘记性。Abstract
The Lookahead optimizer improves the training stability of deep neural networks by having a set of fast weights that "look ahead" to guide the descent direction. Here, we combine this idea with sharpness-aware minimization (SAM) to stabilize its multi-step variant and improve the loss-sharpness trade-off. We propose Lookbehind, which computes $k$ gradient ascent steps ("looking behind") at each iteration and combine the gradients to bias the descent step toward flatter minima. We apply Lookbehind on top of two popular sharpness-aware training methods -- SAM and adaptive SAM (ASAM) -- and show that our approach leads to a myriad of benefits across a variety of tasks and training regimes. Particularly, we show increased generalization performance, greater robustness against noisy weights, and higher tolerance to catastrophic forgetting in lifelong learning settings.
摘要
“lookahead”优化器可以提高深度神经网络的训练稳定性,通过一组快速的权重来引导 DESC 方向。我们将这个想法与锐度感知最小化(SAM)相结合,以稳定其多步变体并改善损失锐度协调。我们提议“lookbehind”,它在每次迭代中计算 $k$ 步梯度上升(“寻看后”),并将梯度相加以偏移下降步向平坦的极小值。我们在两种流行的锐度感知训练方法——SAM 和 adaptive SAM(ASAM)之上应用 Lookbehind,并证明我们的方法在各种任务和训练 режиmidst 中具有多种优点,包括提高泛化性能、增强随着权重噪声的Robustness和生长学习中的快速忘记症。
results: 通过人工监督,使用得到的axioms来扩充ontology,并提供了一个Protge插件来实现这一目标。Abstract
We tackle the task of enriching ontologies by automatically translating natural language sentences into Description Logic. Since Large Language Models (LLMs) are the best tools for translations, we fine-tuned a GPT-3 model to convert Natural Language sentences into OWL Functional Syntax. We employ objective and concise examples to fine-tune the model regarding: instances, class subsumption, domain and range of relations, object properties relationships, disjoint classes, complements, cardinality restrictions. The resulted axioms are used to enrich an ontology, in a human supervised manner. The developed tool is publicly provided as a Protge plugin.
摘要
我们致力于增强ontology的措施,通过自动将自然语言句子翻译成描述逻辑。由于大型语言模型(LLMs)是翻译的最佳工具,我们对GPT-3模型进行了细化,将自然语言句子翻译成OWL函数 sintaxis。我们使用明确和简洁的例子来细化模型,包括实例、类划覆盖、领域和关系范围、对象属性关系、离散类、补充、Cardinality约束。得到的axioms可以用于增强ontology,以人工监督的方式。我们已经开发了一个Protge插件,以便公共提供这种工具。
Anticipating Responsibility in Multiagent Planning
for: 本 paper 探讨了责任预测(Responsibility Anticipation)的概念,即在多智能计划设定中,智能机器可以预测自己是否会对某些结果负责。
methods: 本 paper 使用了linear temporal logic的形式来表达责任预测,并使用了不同的责任概念来定义责任预测的不同方面。
results: 本 paper 证明了责任预测可以在多智能计划设定中协调智能机器的行为,并提供了复杂性结果和类比CLASSICAL PLANNING的等价性。 addition, the paper also outlines a method for solving some of the attribution and anticipation problems using PDDL solvers.Abstract
Responsibility anticipation is the process of determining if the actions of an individual agent may cause it to be responsible for a particular outcome. This can be used in a multi-agent planning setting to allow agents to anticipate responsibility in the plans they consider. The planning setting in this paper includes partial information regarding the initial state and considers formulas in linear temporal logic as positive or negative outcomes to be attained or avoided. We firstly define attribution for notions of active, passive and contributive responsibility, and consider their agentive variants. We then use these to define the notion of responsibility anticipation. We prove that our notions of anticipated responsibility can be used to coordinate agents in a planning setting and give complexity results for our model, discussing equivalence with classical planning. We also present an outline for solving some of our attribution and anticipation problems using PDDL solvers.
摘要
责任预测是确定特定行为代理人可能对结果负责的过程。这可以在多代理人规划设置中使用,以便代理人可以在考虑的计划中预测责任。本文中的规划设置包括初始状态的部分信息,并考虑线性时间逻辑逻辑为正或负结果来达成或避免。我们首先定义了活动、被动和贡献责任的归属,然后使用这些定义来定义责任预测。我们证明了我们的责任预测概念可以用于协调代理人在规划设置中,并提供了复杂性结果,讨论与传统规划相等的等价性。此外,我们还提供了解决一些归属和预测问题的OUTLINE,使用PDDL解决器。
On the Trustworthiness Landscape of State-of-the-art Generative Models: A Comprehensive Survey
paper_authors: Mingyuan Fan, Cen Chen, Chengyu Wang, Jun Huang
for: This paper aims to investigate the trustworthiness of large-scale generative models, specifically addressing privacy, security, fairness, and responsibility concerns.
methods: The paper employs a comprehensive approach, analyzing both long-standing and emerging threats associated with these models across four fundamental dimensions.
results: The authors provide an extensive map outlining the trustworthiness of these models, as well as practical recommendations and future directions for promoting their trustworthy deployment.Here’s the same information in Simplified Chinese text:
results: 作者们提供了详细的可靠性地图,以及实践的建议和未来发展方向,以便推广这些模型的可靠部署,最终为社会带来利益。Abstract
Diffusion models and large language models have emerged as leading-edge generative models and have sparked a revolutionary impact on various aspects of human life. However, the practical implementation of these models has also exposed inherent risks, highlighting their dual nature and raising concerns regarding their trustworthiness. Despite the abundance of literature on this subject, a comprehensive survey specifically delving into the intersection of large-scale generative models and their trustworthiness remains largely absent. To bridge this gap, This paper investigates both the long-standing and emerging threats associated with these models across four fundamental dimensions: privacy, security, fairness, and responsibility. In this way, we construct an extensive map outlining the trustworthiness of these models, while also providing practical recommendations and identifying future directions. These efforts are crucial for promoting the trustworthy deployment of these models, ultimately benefiting society as a whole.
摘要
文本翻译为简化中文:大量生成模型和大语言模型在不同领域中得到了广泛应用,但实践中也暴露出了内在的风险,表现出这些模型的双重性和可信worthiness的问题。虽然有很多相关文献,但一篇具体探讨这些模型与可信worthiness的关系的总结尚未出现。为了填补这一空白,本文 investigate了这些模型中长期存在和新出现的威胁,从四个基本维度出发:隐私、安全、公平和责任。通过构建了这些模型的可信worthiness映射,并提供了实践推荐和未来方向,以促进这些模型的可靠应用,终于为社会带来利益。
Proactive Resource Request for Disaster Response: A Deep Learning-based Optimization Model
results: 对实际数据和模拟数据进行比较,本研究的方法显示出与现有方法相比的超过其他方法的性能。Abstract
Disaster response is critical to save lives and reduce damages in the aftermath of a disaster. Fundamental to disaster response operations is the management of disaster relief resources. To this end, a local agency (e.g., a local emergency resource distribution center) collects demands from local communities affected by a disaster, dispatches available resources to meet the demands, and requests more resources from a central emergency management agency (e.g., Federal Emergency Management Agency in the U.S.). Prior resource management research for disaster response overlooks the problem of deciding optimal quantities of resources requested by a local agency. In response to this research gap, we define a new resource management problem that proactively decides optimal quantities of requested resources by considering both currently unfulfilled demands and future demands. To solve the problem, we take salient characteristics of the problem into consideration and develop a novel deep learning method for future demand prediction. We then formulate the problem as a stochastic optimization model, analyze key properties of the model, and propose an effective solution method to the problem based on the analyzed properties. We demonstrate the superior performance of our method over prevalent existing methods using both real world and simulated data. We also show its superiority over prevalent existing methods in a multi-stakeholder and multi-objective setting through simulations.
摘要
灾害应急应对是生命与损害避免的关键,紧急应对措施的核心是紧急救援资源的管理。因此,当地机构(例如本地紧急资源分配中心)需要收集受灾地区的需求,派发可用资源,并请求中央紧急管理机构(例如美国联邦紧急管理署)的更多资源。但是,以前的资源管理研究忽略了地方机构请求最佳资源量的问题。为了填补这个研究漏洞,我们定义了一个新的资源管理问题,该问题可以考虑当前未满足的需求和未来需求,并且决定最佳的请求资源量。为了解决这个问题,我们考虑了问题的重要特征,并开发了一种新的深度学习方法来预测未来需求。然后,我们将问题转化为一个 Stochastic Optimization 模型,分析了模型的关键性质,并提出了一种有效的解决方案。我们通过实际数据和验证数据示范了我们的方法的优越性。此外,我们通过多方面和多目标的 simulations 示范了我们的方法在多个维度上的优越性。
LLMs4OL: Large Language Models for Ontology Learning
results: 论文的评估结果显示,LLMs 可以很好地应用其语言模式捕捉能力来自然语言文本中自动提取和结构知识,并且在不同的ontoLOG知识领域中表现出色。Abstract
We propose the LLMs4OL approach, which utilizes Large Language Models (LLMs) for Ontology Learning (OL). LLMs have shown significant advancements in natural language processing, demonstrating their ability to capture complex language patterns in different knowledge domains. Our LLMs4OL paradigm investigates the following hypothesis: \textit{Can LLMs effectively apply their language pattern capturing capability to OL, which involves automatically extracting and structuring knowledge from natural language text?} To test this hypothesis, we conduct a comprehensive evaluation using the zero-shot prompting method. We evaluate nine different LLM model families for three main OL tasks: term typing, taxonomy discovery, and extraction of non-taxonomic relations. Additionally, the evaluations encompass diverse genres of ontological knowledge, including lexicosemantic knowledge in WordNet, geographical knowledge in GeoNames, and medical knowledge in UMLS.
摘要
我们提出了LLMs4OL方法,它利用大语言模型(LLMs)进行ontology学习(OL)。LLMs在自然语言处理方面已经展示出了显著的进步,可以捕捉不同知识领域中的复杂语言模式。我们的LLMs4OL思想是:《可以LLMs通过捕捉和结构化自然语言文本中的知识来应用其语言模式捕捉能力到OL吗?》为了测试这一假设,我们采用了零shot提问方法进行全面评估。我们评估了9种不同的LLM模型家族,对三种主要OL任务进行评估:词类分类、分类发现和非分类关系EXTRACTION。此外,评估还涵盖了不同类型的ontological知识,包括WordNet中的lexicosemantic知识、GeoNames中的地理知识和UMLS中的医学知识。
Perceptions of the Fourth Industrial Revolution and Artificial Intelligence Impact on Society
paper_authors: Daniel Agbaji, Brady Lund, Nishith Reddy Mannuru
for: This study aims to examine the perceptions of individuals in different information flow categorizations toward AI and its implications for society.
methods: The study uses participant-supplied definitions of AI and the fourth industrial revolution to identify key themes and concerns regarding AI, such as job replacement, privacy invasion, and inaccurate information.
results: The results reveal that participants expressed concerns about the potential negative impacts of AI, such as job replacement and privacy invasion, but also recognized the benefits of AI, such as solving complex problems and increasing convenience.Here are the three points in Simplified Chinese text:
results: 结果显示参与者对AI的可能性影响存在担忧,如工作替代和隐私侵犯,但也认可AI的好处,如解决复杂问题和提高便利性。Abstract
The Fourth Industrial Revolution, particularly Artificial Intelligence (AI), has had a profound impact on society, raising concerns about its implications and ethical considerations. The emergence of text generative AI tools like ChatGPT has further intensified concerns regarding ethics, security, privacy, and copyright. This study aims to examine the perceptions of individuals in different information flow categorizations toward AI. The results reveal key themes in participant-supplied definitions of AI and the fourth industrial revolution, emphasizing the replication of human intelligence, machine learning, automation, and the integration of digital technologies. Participants expressed concerns about job replacement, privacy invasion, and inaccurate information provided by AI. However, they also recognized the benefits of AI, such as solving complex problems and increasing convenience. Views on government involvement in shaping the fourth industrial revolution varied, with some advocating for strict regulations and others favoring support and development. The anticipated changes brought by the fourth industrial revolution include automation, potential job impacts, increased social disconnect, and reliance on technology. Understanding these perceptions is crucial for effectively managing the challenges and opportunities associated with AI in the evolving digital landscape.
摘要
第四次工业革命,尤其是人工智能(AI),对社会产生了深见的影响,引起了关于其意图和伦理考虑的担忧。文本生成AI工具如ChatGPT的出现更加强调了伦理、安全、隐私和版权等问题的重要性。这项研究旨在探讨不同信息流类型人对AI的看法。结果显示参与者提供的AI和第四个工业革命的定义强调了人工智能的复制、机器学习、自动化和数字技术的集成。参与者表达了对AI的替换工作、隐私侵犯和不准确信息的担忧,但也认可AI的好处,如解决复杂问题和提高便利性。对第四个工业革命的预期变革包括自动化、工作的可能性影响、社会的孤立和依赖于技术。理解这些看法是管理AI在不断发展的数字环境中的挑战和机遇的关键。
NLLG Quarterly arXiv Report 06/23: What are the most influential current AI Papers?
paper_authors: Steffen Eger, Christoph Leiter, Jonas Belouadi, Ran Zhang, Aida Kostikova, Daniil Larionov, Yanran Chen, Vivian Fresen for: 这个研究报告的目的是为研究者和实践者提供一份快速导航,以帮助他们综合了解最新的发展和趋势在自然语言处理(NLP)和机器学习(ML)领域。methods: 这份报告使用arXiv上的 normalized citation counts来列出40篇最受欢迎的论文,以及这些论文的主要研究方向和发展趋势。results: 研究发现,自然语言处理相关的论文在最初的一半年内获得了60%的影响力,而机器学习相关的论文则占据了总数的一半。研究还发现,LLM效率、评估技术、伦理考虑、embodied agents和问题解决方法等是最受欢迎的论文主题。Abstract
The rapid growth of information in the field of Generative Artificial Intelligence (AI), particularly in the subfields of Natural Language Processing (NLP) and Machine Learning (ML), presents a significant challenge for researchers and practitioners to keep pace with the latest developments. To address the problem of information overload, this report by the Natural Language Learning Group at Bielefeld University focuses on identifying the most popular papers on arXiv, with a specific emphasis on NLP and ML. The objective is to offer a quick guide to the most relevant and widely discussed research, aiding both newcomers and established researchers in staying abreast of current trends. In particular, we compile a list of the 40 most popular papers based on normalized citation counts from the first half of 2023. We observe the dominance of papers related to Large Language Models (LLMs) and specifically ChatGPT during the first half of 2023, with the latter showing signs of declining popularity more recently, however. Further, NLP related papers are the most influential (around 60\% of top papers) even though there are twice as many ML related papers in our data. Core issues investigated in the most heavily cited papers are: LLM efficiency, evaluation techniques, ethical considerations, embodied agents, and problem-solving with LLMs. Additionally, we examine the characteristics of top papers in comparison to others outside the top-40 list (noticing the top paper's focus on LLM related issues and higher number of co-authors) and analyze the citation distributions in our dataset, among others.
摘要
“对于生成人工智能(AI)领域的资讯快速增长,特别是自然语言处理(NLP)和机器学习(ML)的子领域,对研究者和实践者而言是一个巨大的挑战。为了解决资讯过多的问题,这份报告由比丰德大学的自然语言学习研究小组统计了arXiv上最受欢迎的40篇论文,并对这些论文进行了分析。我们发现在2023年第一季度,LLMs和ChatGPT相关的论文占了最多的份额(around 60%),但是在最近几个月中,ChatGPT的人气开始下降。此外,NLP相关的论文是所有最具影响力的论文中的主要部分(约60%),即使ML相关的论文的数量是NLP相关的论文的两倍。我们发现这些最受欢迎的论文的主要研究方向包括:LLM效率、评估技术、道德考虑、具体代理人、以及使用LLM解决问题。此外,我们还评估了排名前40篇论文与其他论文之间的差异,以及我们的数据中的引用分布,等其他问题。”
Chatbot Application to Support Smart Agriculture in Thailand
results: 在实现该论文中,农民们对该应用程序的满意度达到96%,但是当农民们使用问题箱时,该应用程序只是基于脚本的规则驱动机器人,农民们需要按照指定的关键词输入才能获得回复。Abstract
A chatbot is a software developed to help reply to text or voice conversations automatically and quickly in real time. In the agriculture sector, the existing smart agriculture systems just use data from sensing and internet of things (IoT) technologies that exclude crop cultivation knowledge to support decision-making by farmers. To enhance this, the chatbot application can be an assistant to farmers to provide crop cultivation knowledge. Consequently, we propose the LINE chatbot application as an information and knowledge representation providing crop cultivation recommendations to farmers. It works with smart agriculture and recommendation systems. Our proposed LINE chatbot application consists of five main functions (start/stop menu, main page, drip irri gation page, mist irrigation page, and monitor page). Farmers will receive information for data monitoring to support their decision-making. Moreover, they can control the irrigation system via the LINE chatbot. Furthermore, farmers can ask questions relevant to the crop environment via a chat box. After implementing our proposed chatbot, farmers are very satisfied with the application, scoring a 96% satisfaction score. However, in terms of asking questions via chat box, this LINE chatbot application is a rule-based bot or script bot. Farmers have to type in the correct keywords as prescribed, otherwise they won't get a response from the chatbots. In the future, we will enhance the asking function of our LINE chatbot to be an intelligent bot.
摘要
一个聊天机器人是一种软件,用于自动回复文本或语音对话,以便在实时进行快速响应。在农业领域,现有的智能农业系统只是使用感知和互联网专利技术,排除了农作知识以支持农民决策。为了进一步提高这一点,我们提议使用LINE聊天机器人应用程序,以提供农作知识支持。我们的提议的LINE聊天机器人应用程序包括五大主要功能(开始/停止菜单、主页、滴水页面、雾化页面和监测页面)。农民可以通过这些功能获得数据监测信息,以支持他们的决策。此外,农民还可以通过聊天机器人控制滴水系统。此外,农民可以通过聊天盒子提问与作物环境相关的问题。经过我们的提议聊天机器人的实施,农民对该应用程序非常满意,得分96%。然而,在咨询问题方面,这个LINE聊天机器人应用程序是一个规则式机器人或脚本机器人,农民必须按照预先定义的关键词输入,否则无法获得机器人的回应。未来,我们计划将我们的LINE聊天机器人应用程序的问题咨询功能提升到智能机器人水平。
Approximating Counterfactual Bounds while Fusing Observational, Biased and Randomised Data Sources
paper_authors: Marco Zaffalon, Alessandro Antonucci, Rafael Cabañas, David Huber
for: 本研究旨在 Addressing the problem of integrating多个可能受偏见的观察和实验研究数据,以计算结构 causal模型中的 counterfactuals。
methods: 我们使用 causal expectation-maximization scheme 来 Approximate partially identifiable counterfactual queries,并采用 graphical transformations 将多个数据源重新映射到单一的 caso Study on palliative care 表明我们的方法的有效性,并提示了将多种不同数据源融合以获得有用的结果。
results: 我们的研究结果表明,通过 fusion of heterogeneous data sources,可以获得更加有用的结果,尤其在 partial identifiability 的情况下。Abstract
We address the problem of integrating data from multiple, possibly biased, observational and interventional studies, to eventually compute counterfactuals in structural causal models. We start from the case of a single observational dataset affected by a selection bias. We show that the likelihood of the available data has no local maxima. This enables us to use the causal expectation-maximisation scheme to approximate the bounds for partially identifiable counterfactual queries, which are the focus of this paper. We then show how the same approach can address the general case of multiple datasets, no matter whether interventional or observational, biased or unbiased, by remapping it into the former one via graphical transformations. Systematic numerical experiments and a case study on palliative care show the effectiveness of our approach, while hinting at the benefits of fusing heterogeneous data sources to get informative outcomes in case of partial identifiability.
摘要
我们考虑了多个可能受偏见的观察和交互式研究数据的集成问题,以计算结构 causal 模型中的可能值。我们从单一观察数据集中开始,假设存在选择偏见。我们表明了可用数据的概率无地方最大值。这使我们可以使用 causal 期望-最大化方案来 aproximate 部分可识别的 counterfactual 查询的上下文约束。然后,我们展示了如何使用图形变换将多个数据集重新映射到原来的情况中,无论这些数据集是交互式的、受偏见的或未受偏见的。我们的方法在系统的数字实验和一个案例研究中证明了其有效性,而且表明了将多种不同数据源融合起来可以获得有用的结果,即使在部分可识别的情况下。
Toward Quantum Machine Translation of Syntactically Distinct Languages
results: 我们的实验使用了160个样本,包括英语句子和其波斯语翻译。我们使用了классификатор-生成器模型,并使用了长期快速储存(LSTM)来实现翻译任务。我们使用了不同的优化器,包括权重参数SGD和两个附加的优化器。最终,我们得到了最佳模型,包括两个LSTM层,并使用了Adam优化器。我们的小数据集,尽管只包括简单的同义句子和单词映射,但表明了ShannonEntropy作为评价指标在更复杂的机器翻译模型中的用 utility。Abstract
The present study aims to explore the feasibility of language translation using quantum natural language processing algorithms on noisy intermediate-scale quantum (NISQ) devices. Classical methods in natural language processing (NLP) struggle with handling large-scale computations required for complex language tasks, but quantum NLP on NISQ devices holds promise in harnessing quantum parallelism and entanglement to efficiently process and analyze vast amounts of linguistic data, potentially revolutionizing NLP applications. Our research endeavors to pave the way for quantum neural machine translation, which could potentially offer advantages over classical methods in the future. We employ Shannon entropy to demonstrate the significant role of some appropriate angles of rotation gates in the performance of parametrized quantum circuits. In particular, we utilize these angles (parameters) as a means of communication between quantum circuits of different languages. To achieve our objective, we adopt the encoder-decoder model of classical neural networks and implement the translation task using long short-term memory (LSTM). Our experiments involved 160 samples comprising English sentences and their Persian translations. We trained the models with different optimisers implementing stochastic gradient descent (SGD) as primary and subsequently incorporating two additional optimizers in conjunction with SGD. Notably, we achieved optimal results-with mean absolute error of 0.03, mean squared error of 0.002, and 0.016 loss-by training the best model, consisting of two LSTM layers and using the Adam optimiser. Our small dataset, though consisting of simple synonymous sentences with word-to-word mappings, points to the utility of Shannon entropy as a figure of merit in more complex machine translation models for intricate sentence structures.
摘要
本研究旨在探索使用量子自然语言处理算法(Quantum NLP)在不稳定量子设备(NISQ)上进行语言翻译的可行性。经典的自然语言处理(NLP)方法在处理复杂语言任务时会遇到大规模计算的问题,但量子NP中的量子并行和积分可能可以有效地处理和分析大量语言数据,从而可能革命化NLP应用。我们的研究努力于开拓量子神经机器翻译,这可能将在未来提供经典方法的优势。我们使用雪兰度来示出一些适当的旋转门阻的重要性。特别是,我们使用这些角度(参数)作为不同语言的量子Circuit之间的交流方式。为实现我们的目标,我们采用了经典神经网络的编码器-解码器模型,并通过长短期记忆(LSTM)实现翻译任务。我们的实验使用了160个样本,包括英语句子和其波斯语翻译。我们使用不同的优化器进行 Stochastic Gradient Descent(SGD),并在其中添加了两个额外的优化器。结果显示,我们使用最佳模型,包括两个LSTM层,并使用 Adam 优化器,可以达到最佳结果,其中的平均绝对错误为0.03,平均平方错误为0.002,损失为0.016。我们的小样本,尽管只包括简单的同义句子,但表明雪兰度作为评价量子机器翻译模型的效用。
No Fair Lunch: A Causal Perspective on Dataset Bias in Machine Learning for Medical Imaging
paper_authors: Charles Jones, Daniel C. Castro, Fabio De Sousa Ribeiro, Ozan Oktay, Melissa McCradden, Ben Glocker
for: This paper aims to address fairness concerns in machine learning methods used in clinical decision-making, highlighting the need for a more comprehensive understanding of algorithmic bias and its mitigation strategies.
methods: The paper uses a causal perspective to identify three families of causal bias mechanisms in medical imaging datasets, and provides a practical three-step framework for reasoning about fairness in AI prediction models.
results: The paper highlights the limitations of current mitigation methods and emphasizes the importance of considering a broader range of scenarios when addressing algorithmic bias in medical imaging.Abstract
As machine learning methods gain prominence within clinical decision-making, addressing fairness concerns becomes increasingly urgent. Despite considerable work dedicated to detecting and ameliorating algorithmic bias, today's methods are deficient with potentially harmful consequences. Our causal perspective sheds new light on algorithmic bias, highlighting how different sources of dataset bias may appear indistinguishable yet require substantially different mitigation strategies. We introduce three families of causal bias mechanisms stemming from disparities in prevalence, presentation, and annotation. Our causal analysis underscores how current mitigation methods tackle only a narrow and often unrealistic subset of scenarios. We provide a practical three-step framework for reasoning about fairness in medical imaging, supporting the development of safe and equitable AI prediction models.
摘要
随着机器学习方法在医疗决策中升级,解决公平问题变得越来越紧迫。虽然已经投入了大量的时间和精力来检测和改进算法偏见,但目前的方法仍然存在可能有害的后果。我们的 causal 视角把 Algorithmic Bias 推广到了不同来源的数据集偏见,并指出了这些偏见需要不同的mitigation策略。我们引入了三种家族的 causal 偏见机制,它们来自于数据集中的 disparities、presentation 和 annotation。我们的 causal 分析表明,现有的mitigation方法只能处理一小部分的情况,而且经常是不切实际的。我们提供了一个实用的三步框架,用于考虑医疗图像中的公平问题,以支持开发安全和公平的 AI 预测模型。
Rethinking Collaborative Perception from the Spatial-Temporal Importance of Semantic Information
results: 对两个开源数据集进行了广泛的实验,并证明了IoSI-CP可以在比较情况下显著提高感知性能。Abstract
Collaboration by the sharing of semantic information is crucial to enable the enhancement of perception capabilities. However, existing collaborative perception methods tend to focus solely on the spatial features of semantic information, while neglecting the importance of the temporal dimension in collaborator selection and semantic information fusion, which instigates performance degradation. In this article, we propose a novel collaborative perception framework, IoSI-CP, which takes into account the importance of semantic information (IoSI) from both temporal and spatial dimensions. Specifically, we develop an IoSI-based collaborator selection method that effectively identifies advantageous collaborators but excludes those that bring negative benefits. Moreover, we present a semantic information fusion algorithm called HPHA (historical prior hybrid attention), which integrates a multi-scale transformer module and a short-term attention module to capture IoSI from spatial and temporal dimensions, and assigns varying weights for efficient aggregation. Extensive experiments on two open datasets demonstrate that our proposed IoSI-CP significantly improves the perception performance compared to state-of-the-art approaches. The code associated with this research is publicly available at https://github.com/huangqzj/IoSI-CP/.
摘要
合作通过 semantic 信息的共享是关键来提高感知能力。然而,现有的合作感知方法通常只关注空间维度上的 semantic 信息,而忽视了时间维度在合作者选择和 semantic 信息融合方面的重要性,这会导致性能下降。在本文中,我们提出了一种新的合作感知框架,即 IoSI-CP,该框架考虑了 semantic 信息的时间和空间维度。具体来说,我们开发了基于 IoSI 的合作者选择方法,可以有效地选择有利合作者,而不选择带有负面效应的合作者。此外,我们提出了一种 semantic 信息融合算法called HPHA(历史先 hybrid 注意),该算法包括多尺度变换器模块和短期注意模块,可以从空间和时间维度中捕捉 IoSI,并将其分配不同的权重进行有效聚合。我们对两个公开的数据集进行了广泛的实验,结果显示,我们的提议的 IoSI-CP 方法可以significantly 提高感知性能,比州先进方法更高。相关代码可以在 GitHub 上找到:https://github.com/huangqzj/IoSI-CP/.
Deception Abilities Emerged in Large Language Models
results: 研究发现,现代大语言模型已经具备了骗取false beliefs的能力,并且可以通过链式思维提高其骗取性能。此外,通过刺激模型的马ки雅文化可以改变模型的骗取倾向。Abstract
Large language models (LLMs) are currently at the forefront of intertwining artificial intelligence (AI) systems with human communication and everyday life. Thus, aligning them with human values is of great importance. However, given the steady increase in reasoning abilities, future LLMs are under suspicion of becoming able to deceive human operators and utilizing this ability to bypass monitoring efforts. As a prerequisite to this, LLMs need to possess a conceptual understanding of deception strategies. This study reveals that such strategies emerged in state-of-the-art LLMs, such as GPT-4, but were non-existent in earlier LLMs. We conduct a series of experiments showing that state-of-the-art LLMs are able to understand and induce false beliefs in other agents, that their performance in complex deception scenarios can be amplified utilizing chain-of-thought reasoning, and that eliciting Machiavellianism in LLMs can alter their propensity to deceive. In sum, revealing hitherto unknown machine behavior in LLMs, our study contributes to the nascent field of machine psychology.
摘要
Towards a Comprehensive Human-Centred Evaluation Framework for Explainable AI
results: 这篇论文通过这种新的评估框架,对XAI方法的人类体验进行了全面的评估,并提出了一些可能的改进方法。Abstract
While research on explainable AI (XAI) is booming and explanation techniques have proven promising in many application domains, standardised human-centred evaluation procedures are still missing. In addition, current evaluation procedures do not assess XAI methods holistically in the sense that they do not treat explanations' effects on humans as a complex user experience. To tackle this challenge, we propose to adapt the User-Centric Evaluation Framework used in recommender systems: we integrate explanation aspects, summarise explanation properties, indicate relations between them, and categorise metrics that measure these properties. With this comprehensive evaluation framework, we hope to contribute to the human-centred standardisation of XAI evaluation.
摘要
而研究具体化AI(XAI)正在急速发展,并且各种解释技术在各个应用领域都有扎实的成果。然而,现有的评估方法仍然缺乏标准化人类中心的评估程序。此外,当前的评估方法并不对XAI方法进行总体性的评估,即不视解释作为人类用户的复杂经验进行评估。为解决这个挑战,我们建议采用用户中心评估框架,将解释方面、解释特性、解释关系和评估指标集成到一起。通过这种全面的评估框架,我们希望能为人类中心化XAI评估做出贡献,以便在XAI评估中实现人类中心化标准化。
Value-Informed Skill Chaining for Policy Learning of Long-Horizon Tasks with Surgical Robot
results: 在三个复杂的术urgical robot任务上 achieved high task success rates和执行效率,证明了 ViSkill 技术的有效性。Abstract
Reinforcement learning is still struggling with solving long-horizon surgical robot tasks which involve multiple steps over an extended duration of time due to the policy exploration challenge. Recent methods try to tackle this problem by skill chaining, in which the long-horizon task is decomposed into multiple subtasks for easing the exploration burden and subtask policies are temporally connected to complete the whole long-horizon task. However, smoothly connecting all subtask policies is difficult for surgical robot scenarios. Not all states are equally suitable for connecting two adjacent subtasks. An undesired terminate state of the previous subtask would make the current subtask policy unstable and result in a failed execution. In this work, we introduce value-informed skill chaining (ViSkill), a novel reinforcement learning framework for long-horizon surgical robot tasks. The core idea is to distinguish which terminal state is suitable for starting all the following subtask policies. To achieve this target, we introduce a state value function that estimates the expected success probability of the entire task given a state. Based on this value function, a chaining policy is learned to instruct subtask policies to terminate at the state with the highest value so that all subsequent policies are more likely to be connected for accomplishing the task. We demonstrate the effectiveness of our method on three complex surgical robot tasks from SurRoL, a comprehensive surgical simulation platform, achieving high task success rates and execution efficiency. Code is available at $\href{https://github.com/med-air/ViSkill}{\text{https://github.com/med-air/ViSkill}$.
摘要
<>translate the following text into Simplified Chinese:Reinforcement learning is still struggling with solving long-horizon surgical robot tasks which involve multiple steps over an extended duration of time due to the policy exploration challenge. Recent methods try to tackle this problem by skill chaining, in which the long-horizon task is decomposed into multiple subtasks for easing the exploration burden and subtask policies are temporally connected to complete the whole long-horizon task. However, smoothly connecting all subtask policies is difficult for surgical robot scenarios. Not all states are equally suitable for connecting two adjacent subtasks. An undesired terminate state of the previous subtask would make the current subtask policy unstable and result in a failed execution. In this work, we introduce value-informed skill chaining (ViSkill), a novel reinforcement learning framework for long-horizon surgical robot tasks. The core idea is to distinguish which terminal state is suitable for starting all the following subtask policies. To achieve this target, we introduce a state value function that estimates the expected success probability of the entire task given a state. Based on this value function, a chaining policy is learned to instruct subtask policies to terminate at the state with the highest value so that all subsequent policies are more likely to be connected for accomplishing the task. We demonstrate the effectiveness of our method on three complex surgical robot tasks from SurRoL, a comprehensive surgical simulation platform, achieving high task success rates and execution efficiency. Code is available at $\href{https://github.com/med-air/ViSkill}{\text{https://github.com/med-air/ViSkill}$.Translate the text into Simplified Chinese:<> Reinforcement learning 仍然在解决长期护理机器人任务上遇到策略探索挑战,这些任务通常包括多个步骤,需要较长的时间来完成。 recent methods 尝试通过细分任务,使策略探索压力减轻。 However, smoothly connecting all subtask policies is difficult for surgical robot scenarios. Not all states are equally suitable for connecting two adjacent subtasks. An undesired terminate state of the previous subtask would make the current subtask policy unstable and result in a failed execution.在这项工作中,我们介绍了值 Informed skill chaining(ViSkill),一种新的强化学习框架,用于解决长期护理机器人任务。 ViSkill 的核心思想是分配终端状态,以便在该状态下启动所有后续任务。 To achieve this target, we introduce a state value function that estimates the expected success probability of the entire task given a state. Based on this value function, a chaining policy is learned to instruct subtask policies to terminate at the state with the highest value, so that all subsequent policies are more likely to be connected for accomplishing the task.我们在 SurRoL 平台上进行了三种复杂的外科 робо术任务的实验,包括胸部手术、脊梁手术和肠胃手术。 results show that our method can achieve high task success rates and execution efficiency. Code is available at $\href{https://github.com/med-air/ViSkill}{\text{https://github.com/med-air/ViSkill}$.Translation notes:* "Reinforcement learning" 翻译为 "强化学习"* "long-horizon surgical robot tasks" 翻译为 "长期护理机器人任务"* "policy exploration challenge" 翻译为 "策略探索挑战"* "subtask policies" 翻译为 "子任务策略"* "chaining policy" 翻译为 "链接策略"* "state value function" 翻译为 "状态价值函数"* "success probability" 翻译为 "成功概率"* "SurRoL" 翻译为 "SurRoL"
BAGM: A Backdoor Attack for Manipulating Text-to-Image Generative Models
paper_authors: Jordan Vice, Naveed Akhtar, Richard Hartley, Ajmal Mian
for: 这个论文主要是为了探讨文本生成人工智能(AI)中的后门攻击问题。
methods: 这篇论文使用了多种攻击策略,包括表面攻击、浅层攻击和深层攻击。这些攻击方法 targeted various stages of the text-to-image generative pipeline,包括 Tokenizer 和语言和视觉 нейрон网络。
results: 作者通过对 state-of-the-art stable diffusion pipeline 进行攻击,并创建了 Marketable Foods 数据集,证明了他们的攻击方法的有效性。Abstract
The rise in popularity of text-to-image generative artificial intelligence (AI) has attracted widespread public interest. At the same time, backdoor attacks are well-known in machine learning literature for their effective manipulation of neural models, which is a growing concern among practitioners. We highlight this threat for generative AI by introducing a Backdoor Attack on text-to-image Generative Models (BAGM). Our attack targets various stages of the text-to-image generative pipeline, modifying the behaviour of the embedded tokenizer and the pre-trained language and visual neural networks. Based on the penetration level, BAGM takes the form of a suite of attacks that are referred to as surface, shallow and deep attacks in this article. We compare the performance of BAGM to recently emerging related methods. We also contribute a set of quantitative metrics for assessing the performance of backdoor attacks on generative AI models in the future. The efficacy of the proposed framework is established by targeting the state-of-the-art stable diffusion pipeline in a digital marketing scenario as the target domain. To that end, we also contribute a Marketable Foods dataset of branded product images. We hope this work contributes towards exposing the contemporary generative AI security challenges and fosters discussions on preemptive efforts for addressing those challenges. Keywords: Generative Artificial Intelligence, Generative Models, Text-to-Image generation, Backdoor Attacks, Trojan, Stable Diffusion.
摘要
“文本至图生成人工智能(AI)的崛起引起了广泛的公众关注。同时,机器学习领域内的后门攻击已经得到了广泛的关注,这是一种可以高效地操纵神经网络的攻击方法。我们在文本至图生成模型中引入了后门攻击,并将其命名为BAGM。我们的攻击targets多个生成图像管道的不同阶段,包括嵌入的 токен化器和预训练的语言和视觉神经网络。根据渗透度,BAGM可以按照表面、浅层和深层攻击的形式出现。在这篇文章中,我们与其他相关的方法进行比较,并提出了一组用于评估生成AI模型的后门攻击性能的量化指标。我们的提案的效果得到了验证,我们使用了现有的稳定扩散管道作为目标领域,并提供了一个Marketable Foods数据集,用于评估BAGM的性能。我们希望这项工作能够曝光当代生成AI安全挑战,并促进相关的预防措施的发展。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China.
Model-free Grasping with Multi-Suction Cup Grippers for Robotic Bin Picking
results: 在实际工业应用中,对各种复杂的托盘场景进行了实验评估,示出了方法的效果。Abstract
This paper presents a novel method for model-free prediction of grasp poses for suction grippers with multiple suction cups. Our approach is agnostic to the design of the gripper and does not require gripper-specific training data. In particular, we propose a two-step approach, where first, a neural network predicts pixel-wise grasp quality for an input image to indicate areas that are generally graspable. Second, an optimization step determines the optimal gripper selection and corresponding grasp poses based on configured gripper layouts and activation schemes. In addition, we introduce a method for automated labeling for supervised training of the grasp quality network. Experimental evaluations on a real-world industrial application with bin picking scenes of varying difficulty demonstrate the effectiveness of our method.
摘要
A neural network predicts pixel-wise grasp quality for an input image to indicate areas that are generally graspable.2. An optimization step determines the optimal gripper selection and corresponding grasp poses based on configured gripper layouts and activation schemes.In addition, we introduce a method for automated labeling for supervised training of the grasp quality network. Experimental evaluations on a real-world industrial application with bin picking scenes of varying difficulty demonstrate the effectiveness of our method.中文翻译:这篇论文提出了一种新的模型自由预测方法,用于预测多个抓握器的抓握姿态。我们的方法包括两个步骤:1. 一个神经网络预测输入图像中的抓握质量,以指示可以抓取的区域。2. 一个优化步骤,根据配置的抓握器布局和活动方案,确定最佳的抓握器选择和相应的抓握姿态。此外,我们还介绍了一种自动标注方法,用于supervised学习抓握质量网络的训练。实验证明了我们的方法在实际工业应用中的减压场景中的效果。
To Classify is to Interpret: Building Taxonomies from Heterogeneous Data through Human-AI Collaboration
for: 这 paper ocuses on supporting taxonomy building with machine learning (ML) systems, with an emphasis on human-AI collaboration.
methods: The approach proposed in the paper allows users to iteratively consider multiple ML models’ outputs as part of their sensemaking process.
results: The authors implemented their approach in two real-world use cases and found that it enabled more effective taxonomy building compared to relying solely on black-boxed ML systems.Abstract
Taxonomy building is a task that requires interpreting and classifying data within a given frame of reference, which comes to play in many areas of application that deal with knowledge and information organization. In this paper, we explore how taxonomy building can be supported with systems that integrate machine learning (ML). However, relying only on black-boxed ML-based systems to automate taxonomy building would sideline the users' expertise. We propose an approach that allows the user to iteratively take into account multiple model's outputs as part of their sensemaking process. We implemented our approach in two real-world use cases. The work is positioned in the context of HCI research that investigates the design of ML-based systems with an emphasis on enabling human-AI collaboration.
摘要
税onomy建构是一项需要解释和分类数据的任务,这种任务在许多知识和信息组织领域中发挥着重要作用。在这篇论文中,我们研究如何通过 интеGRatin machine learning(ML)系统来支持税onomy建构。然而,仅仅靠黑盒ML基于系统自动化税onomy建构会忽略用户的专业知识。我们提议一种方法,允许用户在感性过程中逐渐考虑多个模型的输出。我们在两个实际应用场景中实现了这种方法。我们的工作位于人机合作研究的背景下,探讨ML基本系统的设计,以启用人AI合作。
Tracking mulitple targets with multiple radars using Distributed Auctions
paper_authors: Pierre Larrenie, Cédric Buron, Frédéric Barbaresco
for: 提高雷达网络的可恒性和灵活性
methods: 基于分布式和协同拍卖的Bundle auctions算法
results: 可同时跟踪多个目标,并且使用多个雷达跟踪同一个目标以提高准确性Abstract
Coordination of radars can be performed in various ways. To be more resilient radar networks can be coordinated in a decentralized way. In this paper, we introduce a highly resilient algorithm for radar coordination based on decentralized and collaborative bundle auctions. We first formalize our problem as a constrained optimization problem and apply a market-based algorithm to provide an approximate solution. Our approach allows to track simultaneously multiple targets, and to use up to two radars tracking the same target to improve accuracy. We show that our approach performs sensibly as well as a centralized approach relying on a MIP solver, and depending on the situations, may outperform it or be outperformed.
摘要
协调雷达可以通过不同的方式进行。为了更加鲁棒的雷达网络,可以使用分散式的协调方式。本文介绍了一种基于分散和合作的粒度拍卖算法来提高雷达协调的可靠性。我们首先将问题形式化为一个受限制的优化问题,然后应用市场基本算法提供一个近似解决方案。我们的方法可以同时跟踪多个目标,并且使用两个雷达跟踪同一个目标以提高准确性。我们表明,我们的方法与中央化方法基于MIP解决器相比,在某些情况下可能高效或低效。
L3DMC: Lifelong Learning using Distillation via Mixed-Curvature Space
results: 经过实验证明,L3DMC可以更好地适应新的知识,而无需忘记过去的知识,并且在医疗图像分类任务中显示出了高效性。Abstract
The performance of a lifelong learning (L3) model degrades when it is trained on a series of tasks, as the geometrical formation of the embedding space changes while learning novel concepts sequentially. The majority of existing L3 approaches operate on a fixed-curvature (e.g., zero-curvature Euclidean) space that is not necessarily suitable for modeling the complex geometric structure of data. Furthermore, the distillation strategies apply constraints directly on low-dimensional embeddings, discouraging the L3 model from learning new concepts by making the model highly stable. To address the problem, we propose a distillation strategy named L3DMC that operates on mixed-curvature spaces to preserve the already-learned knowledge by modeling and maintaining complex geometrical structures. We propose to embed the projected low dimensional embedding of fixed-curvature spaces (Euclidean and hyperbolic) to higher-dimensional Reproducing Kernel Hilbert Space (RKHS) using a positive-definite kernel function to attain rich representation. Afterward, we optimize the L3 model by minimizing the discrepancies between the new sample representation and the subspace constructed using the old representation in RKHS. L3DMC is capable of adapting new knowledge better without forgetting old knowledge as it combines the representation power of multiple fixed-curvature spaces and is performed on higher-dimensional RKHS. Thorough experiments on three benchmarks demonstrate the effectiveness of our proposed distillation strategy for medical image classification in L3 settings. Our code implementation is publicly available at https://github.com/csiro-robotics/L3DMC.
摘要
“一个生命时间学习(L3)模型的性能会随着在不同任务之间学习新的概念而下降。现有大多数L3方法都是在固定曲率(例如零曲率欧几里得)空间中进行学习,这并不一定适合数据的复杂的几何结构。另外,浸泡策略通常直接在低维度表示上应用约束,使L3模型不能学习新的概念,而是使模型变得非常稳定。为解决这个问题,我们提出了一种浸泡策略名为L3DMC,它在混合曲率空间中进行学习,以保持已经学习的知识,并在高维度的径规kernel空间(RKHS)中实现丰富的表示。然后,我们通过在RKHS中对新样本的表示与以前的表示建立的子空间进行优化,来最小化L3模型中的差异。L3DMC可以更好地适应新的知识,而不会忘记以前的知识,因为它结合了多个固定曲率空间的表示力和高维度RKHS中的表示。我们对三个标准 benchmark进行了详细的实验,并证明了L3DMC在医学图像分类中的效果。我们的代码实现可以在https://github.com/csiro-robotics/L3DMC中获得。”
An Effective Data Creation Pipeline to Generate High-quality Financial Instruction Data for Large Language Model
for: This paper is written for the purpose of creating a high-quality financial dataset to fine-tune large language models for financial tasks.
methods: The paper presents a data creation pipeline that incorporates human financial expert feedback to refine the dataset, using ChatGPT to initiate a dialogue between an AI investor and financial expert.
results: The pipeline yields a robust instruction tuning dataset of 103k multi-turn chats, and the experimental results show significant advancements in generating accurate, relevant, and financial-style responses from AI models, providing a powerful tool for financial applications.Here’s the text in Traditional Chinese if you prefer:
results: 管线产生了103k多轮对话的强健 instruction tuning 数据集,并通过在这个数据集上进行广泛的实验,发现 AI 模型在这个数据集上的表现有所提高,可以实现更加精确、更加相关、更加金融风格的回答,从而提供了金融业中具有强大的应用力。Abstract
At the beginning era of large language model, it is quite critical to generate a high-quality financial dataset to fine-tune a large language model for financial related tasks. Thus, this paper presents a carefully designed data creation pipeline for this purpose. Particularly, we initiate a dialogue between an AI investor and financial expert using ChatGPT and incorporate the feedback of human financial experts, leading to the refinement of the dataset. This pipeline yielded a robust instruction tuning dataset comprised of 103k multi-turn chats. Extensive experiments have been conducted on this dataset to evaluate the model's performance by adopting an external GPT-4 as the judge. The promising experimental results verify that our approach led to significant advancements in generating accurate, relevant, and financial-style responses from AI models, and thus providing a powerful tool for applications within the financial sector.
摘要
在大型语言模型开始时期,制作高质量金融数据集是非常重要的,以便使用大型语言模型进行金融相关任务的细化。因此,这篇论文提出了一个仔细设计的数据创建管道,以达到这个目标。特别是,我们通过与人工智能投资者和金融专家之间的对话,使用ChatGPT,并基于人类金融专家的反馈,对数据进行细化。这个管道生成了103k多个转换的 instruciton 数据集。我们进行了广泛的实验,使用外部的GPT-4作为评判,以评估模型的性能。结果表明,我们的方法导致了AI模型生成的精度、 relevance 和金融风格响应的显著提高,从而为金融领域应用提供了强大的工具。
paper_authors: Guodong Ding, Fadime Sener, Shugao Ma, Angela Yao
for: 本研究旨在帮助智能助手更好地协助用户完成复杂的过程,如烹饪、家居维护和组装任务。
methods: 该研究使用了学习知识库的方法,将掌握了的知识库用于检测组装过程中的错误顺序。
results: 实验表明,使用我们提出的假设推理算法可以在真实世界动作序列中准确地检测错误顺序。Abstract
One promising use case of AI assistants is to help with complex procedures like cooking, home repair, and assembly tasks. Can we teach the assistant to interject after the user makes a mistake? This paper targets the problem of identifying ordering mistakes in assembly procedures. We propose a system that can detect ordering mistakes by utilizing a learned knowledge base. Our framework constructs a knowledge base with spatial and temporal beliefs based on observed mistakes. Spatial beliefs depict the topological relationship of the assembling components, while temporal beliefs aggregate prerequisite actions as ordering constraints. With an episodic memory design, our algorithm can dynamically update and construct the belief sets as more actions are observed, all in an online fashion. We demonstrate experimentally that our inferred spatial and temporal beliefs are capable of identifying incorrect orderings in real-world action sequences. To construct the spatial beliefs, we collect a new set of coarse-level action annotations for Assembly101 based on the positioning of the toy parts. Finally, we demonstrate the superior performance of our belief inference algorithm in detecting ordering mistakes on the Assembly101 dataset.
摘要
一个有前途的用 caso of AI 助手是帮助完成复杂的过程,如cooking、家居维修和组装任务。我们可以教育助手在用户commit mistake时进行 intercept?这篇论文 targets the problem of identifying ordering mistakes in assembly procedures. We propose a system that can detect ordering mistakes by utilizing a learned knowledge base. Our framework constructs a knowledge base with spatial and temporal beliefs based on observed mistakes. Spatial beliefs depict the topological relationship of the assembling components, while temporal beliefs aggregate prerequisite actions as ordering constraints. With an episodic memory design, our algorithm can dynamically update and construct the belief sets as more actions are observed, all in an online fashion. We demonstrate experimentally that our inferred spatial and temporal beliefs are capable of identifying incorrect orderings in real-world action sequences. To construct the spatial beliefs, we collect a new set of coarse-level action annotations for Assembly101 based on the positioning of the toy parts. Finally, we demonstrate the superior performance of our belief inference algorithm in detecting ordering mistakes on the Assembly101 dataset.Here's the text with some additional information about the Simplified Chinese translation:The translation is in Simplified Chinese, which is the standardized form of Chinese used in mainland China and Singapore. The translation is written in a formal and neutral tone, which is appropriate for an academic paper.Some of the technical terms and concepts in the original text, such as "AI assistants," "assembly procedures," "knowledge base," "belief inference," and "episodic memory," are translated directly into Simplified Chinese. Other terms, such as "coarse-level action annotations," are translated as "粗级动作标注" (rough-level action annotations).The translation follows the standard grammar and sentence structure of Simplified Chinese. For example, the subject-verb-object word order is maintained in the sentences, and the use of particles and grammatical markers is consistent with the language norms.Overall, the translation is accurate and faithful to the original text, and it should be understandable to readers who are fluent in Simplified Chinese.
HouYi: An open-source large language model specially designed for renewable energy and carbon neutrality field
for: 本研究は、可再生能源の目标达成のために、Large Language Models (LLMs)を自动内容生成に応用することに焦点を当てています。
methods: 本研究では、1,168,970件の学术论文タイトルと摘要からREAP datasetを构筑し、これを基にして初の可再生能源専门のLarge Language Model(HouYi model)を开発しました。 HouYi modelは、一般的なLLMsを调整して作成されました。
results: 実験结果では、HouYi modelは、可再生能源分野の学术论文パラグラフ生成能力がChatGPTと比较しても优れています。特に、Claude、ERNIE Bot、SparkDeskと比较しても优れています。また、open-source LLaMA-13B modelに比较しても大きく优れています。Abstract
Renewable energy is important for achieving carbon neutrality goal. With the great success of Large Language Models (LLMs) like ChatGPT in automatic content generation, LLMs are playing an increasingly important role. However, there has not been a specially designed LLM for renewable energy. Meanwhile, there has not been any dataset of renewable energy for training LLMs. Therefore, this paper published the first open-source Renewable Energy Academic Paper (REAP) dataset for non-commercial LLM research of renewable energy. REAP dataset is collected through searching the title and abstract of 1,168,970 academic literatures from Web of Science. Based on REAP dataset, HouYi model, the first LLM for renewable energy, is developed through finetuning general LLMs. HouYi demonstrated powerful academic paper paragraph generation ability in renewable energy field. Experiments show that its ability to generate academic papers on renewable energy is comparable to ChatGPT, slightly outperforms Claude, ERNIE Bot and SparkDesk, and significantly outperforms open-source LLaMA-13B model.
摘要
重要的可再生能源是实现碳中和目标的关键。大语言模型(LLM)如ChatGPT在自动内容生成方面取得了巨大成功。然而,没有特地设计的LLM用于可再生能源。同时,没有任何可再生能源训练LLM的数据集。因此,本研究发表了首个非商业用途的可再生能源学术论文数据集(REAP)。REAP数据集通过搜索Web of Science上的标题和摘要来收集1,168,970份学术论文。基于REAP数据集,我们开发了首个可再生能源 LLM 模型—— HouYi。通过训练通用 LLM 模型,HouYi 在可再生能源领域的学术论文段落生成能力强大。实验表明,HouYi 的学术论文段落生成能力在可再生能源领域与ChatGPT相当,轻微超过Claude、ERNIE Bot和SparkDesk,并显著超过开源 LLMA-13B 模型。
Causal Inference for Banking Finance and Insurance A Survey
results: 研究发现,银行和保险领域中的 causal inference 应用还处于初始阶段,因此更多的研究是可能的,以使其成为可靠的方法。I hope that helps!Abstract
Causal Inference plays an significant role in explaining the decisions taken by statistical models and artificial intelligence models. Of late, this field started attracting the attention of researchers and practitioners alike. This paper presents a comprehensive survey of 37 papers published during 1992-2023 and concerning the application of causal inference to banking, finance, and insurance. The papers are categorized according to the following families of domains: (i) Banking, (ii) Finance and its subdomains such as corporate finance, governance finance including financial risk and financial policy, financial economics, and Behavioral finance, and (iii) Insurance. Further, the paper covers the primary ingredients of causal inference namely, statistical methods such as Bayesian Causal Network, Granger Causality and jargon used thereof such as counterfactuals. The review also recommends some important directions for future research. In conclusion, we observed that the application of causal inference in the banking and insurance sectors is still in its infancy, and thus more research is possible to turn it into a viable method.
摘要
causal inference 在解释统计模型和人工智能模型所做出的决策中扮演着重要的角色。近年来,这个领域吸引了研究者和实践者的关注。这篇论文对1992-2023年发表的37篇论文进行了全面的报告,这些论文关注银行、金融和保险领域中的应用 causal inference。这些论文被分为以下三个家族域:(i)银行,(ii)金融和其子领域,如公司财务、管理财务、金融风险和金融政策、金融经济和行为金融,以及(iii)保险。此外,论文还覆盖了 causal inference 的主要组成部分,包括统计方法如 Bayesian Causal Network 和 Granger Causality,以及在其中使用的术语如 counterfactuals。评论还提出了未来研究的重要方向。结论是,在银行和保险领域中应用 causal inference 的应用仍处于初始阶段,因此更多的研究可以使其成为可靠的方法。
results: 实验表明,提出的方法可以提高 continual learning 中的性能,并且可以适应 classification 和 segmentation 问题。 codes 将于 https://github.com/csiro-robotics/SDCL 上提供。Abstract
An ultimate objective in continual learning is to preserve knowledge learned in preceding tasks while learning new tasks. To mitigate forgetting prior knowledge, we propose a novel knowledge distillation technique that takes into the account the manifold structure of the latent/output space of a neural network in learning novel tasks. To achieve this, we propose to approximate the data manifold up-to its first order, hence benefiting from linear subspaces to model the structure and maintain the knowledge of a neural network while learning novel concepts. We demonstrate that the modeling with subspaces provides several intriguing properties, including robustness to noise and therefore effective for mitigating Catastrophic Forgetting in continual learning. We also discuss and show how our proposed method can be adopted to address both classification and segmentation problems. Empirically, we observe that our proposed method outperforms various continual learning methods on several challenging datasets including Pascal VOC, and Tiny-Imagenet. Furthermore, we show how the proposed method can be seamlessly combined with existing learning approaches to improve their performances. The codes of this article will be available at https://github.com/csiro-robotics/SDCL.
摘要
最终目标是在继续学习中保持之前学习的知识,以避免忘记先前的知识。为解决这问题,我们提出了一种新的知识填充技术,利用神经网络的输出/积存空间的拟合方法,以便在学习新任务时维护神经网络的知识。我们提出的方法是在第一阶段上对数据拟合空间进行线性逼近,从而利用线性子空间来模型结构,并维护神经网络的知识。我们发现这种模型方法具有一些惊喜性质,如鲁棒性和噪声Robustness,因此可以有效避免Catastrophic Forgetting问题。此外,我们还讨论了如何将我们的提议方法应用于分类和分割问题。实验表明,我们的提议方法在 Pascal VOC 和 Tiny-Imagenet 等数据集上比较出色,并且可以轻松地与现有的学习方法结合使用,以提高其性能。代码将在 https://github.com/csiro-robotics/SDCL 上公开。
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks
results: 研究发现,将bridge-architectures扩展到NLVR2 dataset上,并不能提高表现。然而,通过多Modal预训练,bridge-architectures可以在复杂视觉逻辑任务中表现出色。此外,研究还初步测试了LLaVA模型在零 shot设定下的表现。Abstract
In recent times there has been a surge of multi-modal architectures based on Large Language Models, which leverage the zero shot generation capabilities of LLMs and project image embeddings into the text space and then use the auto-regressive capacity to solve tasks such as VQA, captioning, and image retrieval. We name these architectures as "bridge-architectures" as they project from the image space to the text space. These models deviate from the traditional recipe of training transformer based multi-modal models, which involve using large-scale pre-training and complex multi-modal interactions through co or cross attention. However, the capabilities of bridge architectures have not been tested on complex visual reasoning tasks which require fine grained analysis about the image. In this project, we investigate the performance of these bridge-architectures on the NLVR2 dataset, and compare it to state-of-the-art transformer based architectures. We first extend the traditional bridge architectures for the NLVR2 dataset, by adding object level features to faciliate fine-grained object reasoning. Our analysis shows that adding object level features to bridge architectures does not help, and that pre-training on multi-modal data is key for good performance on complex reasoning tasks such as NLVR2. We also demonstrate some initial results on a recently bridge-architecture, LLaVA, in the zero shot setting and analyze its performance.
摘要
近些时候,有一波基于大型语言模型的多模态架构出现,这些架构利用大型语言模型的零shot生成能力,将图像嵌入 proyect到文本空间中,然后使用自动逆向能力解决问题如VQA、标题和图像检索。我们称这些架构为“桥架架构”,因为它们从图像空间到文本空间进行项目。这些模型与传统的转换器基于多模态模型的训练方式不同,后者通常使用大规模预训练和复杂的多模态交互通过协同或跨模态注意力。然而, bridge 架构的能力尚未在复杂的视觉逻辑任务中测试,这些任务需要细致的图像分析。在这个项目中,我们investigate bridge 架构在 NLVR2 数据集上的性能,并与当前的转换器基于多模态模型进行比较。我们首先将传统的 bridge 架构进行了 NLVR2 数据集的扩展,添加了 объек 级别特征以便进行细致的物体分析。我们的分析表明,向 bridge 架构添加 objet 级别特征并不帮助,而预训练在多模态数据上是关键 для复杂的逻辑任务如 NLVR2 的好性能。我们还展示了一些初步的 LLaVA 桥架构在零shot设置下的性能,并进行了分析。
A Pre-trained Data Deduplication Model based on Active Learning
paper_authors: Xinyao Liu, Shengdong Du, Fengmao Lv, Hongtao Xue, Jie Hu, Tianrui Li
for: 解决大数据中 dirty data 问题,提高数据质量和效率。
methods: 基于活动学习的预训练deduplication模型, integrate transformer 和 active learning into end-to-end architecture,使用 R-Drop 方法进行数据增强。
results: 在 benchmark 数据集上,提高了 recall 分数达到28%,胜过之前的 SOTA 模型。Abstract
In the era of big data, the issue of data quality has become increasingly prominent. One of the main challenges is the problem of duplicate data, which can arise from repeated entry or the merging of multiple data sources. These "dirty data" problems can significantly limit the effective application of big data. To address the issue of data deduplication, we propose a pre-trained deduplication model based on active learning, which is the first work that utilizes active learning to address the problem of deduplication at the semantic level. The model is built on a pre-trained Transformer and fine-tuned to solve the deduplication problem as a sequence to classification task, which firstly integrate the transformer with active learning into an end-to-end architecture to select the most valuable data for deduplication model training, and also firstly employ the R-Drop method to perform data augmentation on each round of labeled data, which can reduce the cost of manual labeling and improve the model's performance. Experimental results demonstrate that our proposed model outperforms previous state-of-the-art (SOTA) for deduplicated data identification, achieving up to a 28% improvement in Recall score on benchmark datasets.
摘要
在大数据时代,数据质量问题变得越来越突出。一个主要挑战是重复的数据问题,可能由重复入库或多个数据源合并而导致。这些“垃圾数据”问题可能会很大地限制大数据的有效应用。为解决数据混淆问题,我们提议一种基于活动学习的预训练混淆模型,这是首次利用活动学习来解决 semantic 层次的混淆问题。模型基于预训练的 Transformer 并在这上进行了精细调整,用于解决混淆问题作为一个序列分类任务,首次将 Transformer 与活动学习集成到了端到端架构中,以选择最有价值的数据进行混淆模型训练,并首次采用 R-Drop 方法进行数据扩展,可以降低人工标注成本并提高模型性能。实验结果表明,我们提议的模型在比较数据集上的混淆率达到了前一个State-of-the-art(SOTA)的28%提升。
STL: A Signed and Truncated Logarithm Activation Function for Neural Networks
methods: 该论文使用了一种新的激活函数,即 signed and truncated logarithm function,其具有优秀的数学性质,如不对称、单调、可导、无界值范围和连续非零导数。
results: 对比其他一些常见的激活函数,该新激活函数的result表明它在大多数神经网络中具有最佳性能。这种激活函数可以应用于各种需要激活函数的神经网络中。Abstract
Activation functions play an essential role in neural networks. They provide the non-linearity for the networks. Therefore, their properties are important for neural networks' accuracy and running performance. In this paper, we present a novel signed and truncated logarithm function as activation function. The proposed activation function has significantly better mathematical properties, such as being odd function, monotone, differentiable, having unbounded value range, and a continuous nonzero gradient. These properties make it an excellent choice as an activation function. We compare it with other well-known activation functions in several well-known neural networks. The results confirm that it is the state-of-the-art. The suggested activation function can be applied in a large range of neural networks where activation functions are necessary.
摘要
aktivation functions play an essential role in neural networks. They provide the non-linearity for the networks. Therefore, their properties are important for neural networks' accuracy and running performance. In this paper, we present a novel signed and truncated logarithm function as activation function. The proposed activation function has significantly better mathematical properties, such as being odd function, monotone, differentiable, having unbounded value range, and a continuous nonzero gradient. These properties make it an excellent choice as an activation function. We compare it with other well-known activation functions in several well-known neural networks. The results confirm that it is the state-of-the-art. The suggested activation function can be applied in a large range of neural networks where activation functions are necessary.(Simplified Chinese)活动函数在神经网络中扮演着关键的角色。它们提供了非线性性,因此其属性对神经网络的准确率和运行性非常重要。在这篇论文中,我们提出了一种新的签名和截断对数函数作为活动函数。我们的提案的活动函数具有更好的数学属性,如是奇函数、 monotonic、导函数、无 bound 值范围和连续非零导数。这些属性使其成为非常出色的选择。我们与其他常见的活动函数进行比较,并在一些常见的神经网络中进行了测试。结果表明,它是当前最佳的。我们建议这种活动函数可以在神经网络中广泛应用,特别是在需要活动函数的情况下。
Relation-Oriented: Toward Knowledge-Aligned Causal AI
methods: 本研究使用了一种基于关系的模型论坛,通过对计算机视觉和医疗信息学中的实验 validate this approach。此外,研究还提出了一种基于关系的表示学习方法,用于实现relation-oriented模型论坛。
results: 研究发现,传统的observation-oriented模型论坛在处理大数据时会存在不一致性,而基于关系的模型论坛可以更好地理解人类知识的构成和演化。此外,基于关系的表示学习方法也得到了广泛的实验验证。Abstract
In machine learning, we naturally apply an Observation-Oriented principle, in which observational variables preexist and set the stage for constructing relationships. While sufficient for traditional models, the integration of AI with big data exposes the misalignment between the observational models and our actual comprehension. Contrarily, humans shape cognitive entities defined by relationships, enabling us to formulate knowledge across temporal and hyper-dimensional spaces, rather than being confined to observational constructs. From an innovative Relation-Oriented perspective, this study examines the roots of this misalignment within our current modeling paradigm, illuminated by intuitive examples from computer vision and health informatics. We also introduce the relation-defined representation learning methodology as a practical implementation of Relation-Oriented modeling, supported by extensive experimental validation. Consider an analogy where ants dwell on a two-dimensional plane of a floor. If these ants were to construct models, they might use the nearest tree as a reference to specify the elevation in their two-dimensional models. By modeling, they observe an increased disruption at the tree's mid-level, which indicates a higher chance of encountering children. However, since they fail to comprehend humans as three-dimensional beings, instead of interpreting this phenomenon in a new dimension, "height", they solely relate it to the tree's mid-level. If they migrate to a different tree with a varying height, where mid-level no longer presents a risk, they might conclude that human behavior is too complex to model effectively. Similarly, when modeling time series, we usually discount the dimension, "time", as a single timeline, which has become our "tree".
摘要
在机器学习中,我们自然采用观察主导的原则,在其中观察变量先exists并设定了模型的场景。虽然适用于传统模型,但与人工智能和大数据相结合后,这种模型的不一致性变得更加明显。相反,人类通过形成知识的关系定义了认知实体,允许我们透过时间和多维空间形成知识,而不是仅仅遵循观察构建。从一种实际的关系主导的角度,本研究探讨了我们当前模型平台的不一致性,通过直观的计算机视觉和医疗信息学示例进行描述。我们还介绍了基于关系定义学习的方法ología,并通过广泛的实验验证。假设我们的蚂蚁在二维平面上生活。如果它们构建模型,它们可能使用最近的树作为参照,以指定模型中的高度。通过模型,它们发现高度中间层的干扰增加,表示更高的儿童遇到的可能性。但是,由于它们无法理解人类为三维存在,而不是在一个新的维度中解释这种现象,它们只是将其解释为树高度中间层的问题。如果它们migrate到另一个高度不同的树,其中中间层不再是风险,它们可能会认为人类行为是无法模型有效的。类似地,当我们模型时间序列时,我们通常忽略维度"时间",将其变成我们的"树"。
When Large Language Models Meet Personalization: Perspectives of Challenges and Opportunities
For: This paper discusses the potential uses of large language models (LLMs) in personalization systems, and how they can revolutionize the way personalization is conducted.* Methods: The paper explores the development and challenges of existing personalization systems, as well as the newly emerged capabilities of LLMs.* Results: The paper discusses the potential ways of making use of LLMs for personalization, including the ability to proactively explore user requests and provide personalized services.Abstract
The advent of large language models marks a revolutionary breakthrough in artificial intelligence. With the unprecedented scale of training and model parameters, the capability of large language models has been dramatically improved, leading to human-like performances in understanding, language synthesizing, and common-sense reasoning, etc. Such a major leap-forward in general AI capacity will change the pattern of how personalization is conducted. For one thing, it will reform the way of interaction between humans and personalization systems. Instead of being a passive medium of information filtering, large language models present the foundation for active user engagement. On top of such a new foundation, user requests can be proactively explored, and user's required information can be delivered in a natural and explainable way. For another thing, it will also considerably expand the scope of personalization, making it grow from the sole function of collecting personalized information to the compound function of providing personalized services. By leveraging large language models as general-purpose interface, the personalization systems may compile user requests into plans, calls the functions of external tools to execute the plans, and integrate the tools' outputs to complete the end-to-end personalization tasks. Today, large language models are still being developed, whereas the application in personalization is largely unexplored. Therefore, we consider it to be the right time to review the challenges in personalization and the opportunities to address them with LLMs. In particular, we dedicate this perspective paper to the discussion of the following aspects: the development and challenges for the existing personalization system, the newly emerged capabilities of large language models, and the potential ways of making use of large language models for personalization.
摘要
大语言模型的出现标志着人工智能领域的一个革命性突破。它们的规模和参数数量在前所未有地提高了大语言模型的能力,从而实现了人类化的理解、语言生成和推理等等。这种大幅提升的通用人工智能能力将改变个性化的方式。一方面,它将改变人与个性化系统之间的交互方式。而不是仅作为信息滤波的温馈媒体,大语言模型将成为活跃的用户参与基础。在这个新基础上,用户的请求可以积极探索,并将用户需要的信息传达在自然和可追溯的方式。另一方面,它将扩大个性化的范围,从单一的收集个性信息变为多重功能的提供个性服务。通过利用大语言模型作为通用界面,个性化系统可以将用户的请求编译成计划,调用外部工具的函数执行计划,并将工具输出集成到完成个性化任务。目前,大语言模型仍在开发中,个性化应用则尚未得到广泛的探索。因此,我们认为现在是评估个性化挑战和利用大语言模型解决这些挑战的时候。本观点文特别关注以下方面:现有个性化系统的发展和挑战,大语言模型新出现的能力,以及可能的使用大语言模型进行个性化的方式。
Promptly: Using Prompt Problems to Teach Learners How to Effectively Utilize AI Code Generators
paper_authors: Paul Denny, Juho Leinonen, James Prather, Andrew Luxton-Reilly, Thezyrie Amarouche, Brett A. Becker, Brent N. Reeves for: This paper aims to introduce a novel pedagogical concept called “Prompt Problem” to help students learn how to craft effective prompts for large language models (LLMs) in computing education.methods: The paper presents a novel tool called Promptly, which hosts a repository of Prompt Problems and automates the evaluation of prompt-generated code. The authors conducted a field study with 54 first-year Python programming students to explore student interactions with the tool and their perceptions of the Prompt Problem concept.results: The study found that Promptly was well-received by students for its ability to engage their computational thinking skills and expose them to new programming constructs. The authors also discuss avenues for future work, including variations on the design of Prompt Problems and the need to study their integration into the curriculum and teaching practice.Abstract
With their remarkable ability to generate code, large language models (LLMs) are a transformative technology for computing education practice. They have created an urgent need for educators to rethink pedagogical approaches and teaching strategies for newly emerging skill sets. Traditional approaches to learning programming have focused on frequent and repeated practice at writing code. The ease with which code can now be generated has resulted in a shift in focus towards reading, understanding and evaluating LLM-generated code. In parallel with this shift, a new essential skill is emerging -- the ability to construct good prompts for code-generating models. This paper introduces a novel pedagogical concept known as a `Prompt Problem', designed to help students learn how to craft effective prompts for LLMs. A Prompt Problem challenges a student to create a natural language prompt that leads an LLM to produce the correct code for a specific problem. To support the delivery of Prompt Problems at scale, in this paper we also present a novel tool called Promptly which hosts a repository of Prompt Problems and automates the evaluation of prompt-generated code. We report empirical findings from a field study in which Promptly was deployed in a first-year Python programming course (n=54). We explore student interactions with the tool and their perceptions of the Prompt Problem concept. We found that Promptly was largely well-received by students for its ability to engage their computational thinking skills and expose them to new programming constructs. We also discuss avenues for future work, including variations on the design of Prompt Problems and the need to study their integration into the curriculum and teaching practice.
摘要
带有杰出代码生成能力的大语言模型(LLM)已经是计算教育实践中的一种转变性技术。它们对教学方法和教学策略的需求已经产生了紧迫性,让教师需要重新思考教学方法。传统的编程学习方法通常是通过频繁地编写代码来帮助学生学习编程。然而,由于代码可以非常容易地生成,因此教学的重点已经从编写代码转移到了阅读、理解和评估 LLM 生成的代码。同时,一种新的重要技能正在emerging---如何构建好的提问。这篇论文介绍了一种新的教学概念,即“提问问题”(Prompt Problem),用于帮助学生学习如何编写有效的提问。一个 Prompt Problem 挑战学生创建一个自然语言提问,使 LLM 生成 correct 代码来解决特定问题。为了在大规模上提供 Prompt Problems,这篇论文还介绍了一种名为 Promptly 的新工具,该工具hosts一个提问问题的存储库和自动评估提问生成的代码。我们在一个 Python 编程课程中(n=54)进行了一项场景研究,并报告了学生与工具的交互和提问问题概念的看法。我们发现 Promptly 受到了学生的欢迎,他们认为该工具能够帮助他们发展计算思维和暴露他们于新编程构造。我们还讨论了未来工作的可能性,包括提问问题的设计变化和在课程和教学实践中的集成。
BearingPGA-Net: A Lightweight and Deployable Bearing Fault Diagnosis Network via Decoupled Knowledge Distillation and FPGA Acceleration
results: 我们设计了一种基于FPGA的加速方案,使用Verilog进行定制化量化和设计可编程逻辑门阵列 для每层的 BearingPGA-Net。这种方案强调了并行计算和模块再利用,以提高计算速度。根据我们所知,这是第一次将一个基于CNN的涤略破坏诊断模型部署在FPGA上。我们的实验结果表明,我们的部署方案可以在CPU上实现200倍以上的诊断速度,同时保持F1、回归和准确率分数下降较少于0.4%。Abstract
Deep learning has achieved remarkable success in the field of bearing fault diagnosis. However, this success comes with larger models and more complex computations, which cannot be transferred into industrial fields requiring models to be of high speed, strong portability, and low power consumption. In this paper, we propose a lightweight and deployable model for bearing fault diagnosis, referred to as BearingPGA-Net, to address these challenges. Firstly, aided by a well-trained large model, we train BearingPGA-Net via decoupled knowledge distillation. Despite its small size, our model demonstrates excellent fault diagnosis performance compared to other lightweight state-of-the-art methods. Secondly, we design an FPGA acceleration scheme for BearingPGA-Net using Verilog. This scheme involves the customized quantization and designing programmable logic gates for each layer of BearingPGA-Net on the FPGA, with an emphasis on parallel computing and module reuse to enhance the computational speed. To the best of our knowledge, this is the first instance of deploying a CNN-based bearing fault diagnosis model on an FPGA. Experimental results reveal that our deployment scheme achieves over 200 times faster diagnosis speed compared to CPU, while achieving a lower-than-0.4\% performance drop in terms of F1, Recall, and Precision score on our independently-collected bearing dataset. Our code is available at \url{https://github.com/asdvfghg/BearingPGA-Net}.
摘要
深度学习在滚珠缺陷诊断领域取得了很大成功。然而,这些成功带来更大的模型和更复杂的计算,无法在需要高速、强可移植和低功耗的工业领域中传输。在这篇论文中,我们提出了一个轻量级可部署的滚珠缺陷诊断模型,称为BearingPGA-Net,以解决这些挑战。首先,我们通过大型模型的帮助,使用分离知识采样来训练BearingPGA-Net。尽管它的体积小,我们的模型在其他轻量级state-of-the-art方法的比较中仍然表现出色。其次,我们为BearingPGA-Net设计了FPGA加速方案,使用Verilog语言设计。这个方案包括对BearingPGA-Net每层的自定义量化和FPGA中的可编程逻辑门的设计,强调并行计算和模块重用以提高计算速度。我们知道,这是第一次将CNN基于滚珠缺陷诊断模型部署到FPGA上。实验结果表明,我们的部署方案在CPU上进行诊断速度比较,可以达到200倍以上的加速速度,同时在独立收集的滚珠数据集上保持下降0.4%以下的F1、回归和准确率分数。我们的代码可以在 上下载。
Distributionally Robust Safety Filter for Learning-Based Control in Active Distribution Systems
results: 对于IEEE 33-bus和123-bus系统,提出的DRSF可以减少DRL代理在分布系统中的约束违反,同时保持near-optimal的解决方案。Abstract
Operational constraint violations may occur when deep reinforcement learning (DRL) agents interact with real-world active distribution systems to learn their optimal policies during training. This letter presents a universal distributionally robust safety filter (DRSF) using which any DRL agent can reduce the constraint violations of distribution systems significantly during training while maintaining near-optimal solutions. The DRSF is formulated as a distributionally robust optimization problem with chance constraints of operational limits. This problem aims to compute near-optimal actions that are minimally modified from the optimal actions of DRL-based Volt/VAr control by leveraging the distribution system model, thereby providing constraint satisfaction guarantee with a probability level under the model uncertainty. The performance of the proposed DRSF is verified using the IEEE 33-bus and 123-bus systems.
摘要
<>对深度强化学习(DRL)代理与实际运行系统进行交互学习时,运行系统约束可能会被违反。这封信提出了一种通用分布robust安全筛选器(DRSF),可以在DRL代理训练过程中减少运行系统约束违反的概率,同时保持近似优解。DRSF是以分布robust优化问题的形式表述,其目标是在操作限制下计算近似优解,并且具有随机变量模型的承诺保证。此外,DRSF可以利用DRL基于Volt/Var控制的优化解,从而提供约束满足保证,并且在模型不确定性下保证优化解的可行性。本文通过IEEE 33-bus和123-bus系统的实验,证明了提案的DRSF的性能。Note: Simplified Chinese is a version of Chinese that uses simpler grammar and vocabulary, and is often used in informal writing and communication. It is different from Traditional Chinese, which is a more formal version of Chinese that is used in many official documents and publications.
results: 本研究通过实验研究,证明了新的评价基于强化学习方法的有效性和利害。Abstract
This paper develops a novel rating-based reinforcement learning approach that uses human ratings to obtain human guidance in reinforcement learning. Different from the existing preference-based and ranking-based reinforcement learning paradigms, based on human relative preferences over sample pairs, the proposed rating-based reinforcement learning approach is based on human evaluation of individual trajectories without relative comparisons between sample pairs. The rating-based reinforcement learning approach builds on a new prediction model for human ratings and a novel multi-class loss function. We conduct several experimental studies based on synthetic ratings and real human ratings to evaluate the effectiveness and benefits of the new rating-based reinforcement learning approach.
摘要
这个论文提出了一种新的评分基于权重学习方法,利用人类评分来获得人类指导。与现有的偏好基于样本对比和排名基于样本对比的学习 paradigms不同,提posed评分基于学习方法是基于人类评分个 trajectory 而不是对样本对比的Relative preferences。这种评分基于学习方法采用了一种新的预测模型和一种多类损失函数。我们在使用 synthetic 评分和实际人类评分进行了多个实验研究,以评估新评分基于学习方法的效果和优势。
Proof-of-Federated-Learning-Subchain: Free Partner Selection Subchain Based on Federated Learning
results: 在 simulate 20 个矿工的情况下,论文示出了 PoFLSC 协议的有效性,当矿工池的大小减少时, Pool 中的矿工会根据 Shapley Value (SV) 的优先级顺序进行选择。在实验中,PoFLSC 协议支持了子链管理员对储存优先级的了解,并使核心分区的参与者建立和维护一个竞争性的子链。Abstract
The continuous thriving of the Blockchain society motivates research in novel designs of schemes supporting cryptocurrencies. Previously multiple Proof-of-Deep-Learning(PoDL) consensuses have been proposed to replace hashing with useful work such as deep learning model training tasks. The energy will be more efficiently used while maintaining the ledger. However deep learning models are problem-specific and can be extremely complex. Current PoDL consensuses still require much work to realize in the real world. In this paper, we proposed a novel consensus named Proof-of-Federated-Learning-Subchain(PoFLSC) to fill the gap. We applied a subchain to record the training, challenging, and auditing activities and emphasized the importance of valuable datasets in partner selection. We simulated 20 miners in the subchain to demonstrate the effectiveness of PoFLSC. When we reduce the pool size concerning the reservation priority order, the drop rate difference in the performance in different scenarios further exhibits that the miner with a higher Shapley Value (SV) will gain a better opportunity to be selected when the size of the subchain pool is limited. In the conducted experiments, the PoFLSC consensus supported the subchain manager to be aware of reservation priority and the core partition of contributors to establish and maintain a competitive subchain.
摘要
“ blockchain 社会的持续发展为我们的研究提供了新的设计方案,以支持 криптовалютой。先前,多种 Proof-of-Deep-Learning(PoDL)的共识方案已经被提出,以替换哈希函数,使用有用的工作,如深度学习模型训练任务。这将更有效地使用能量,同时维护 ledger。但是深度学习模型是问题特定的,可能非常复杂。当前的 PoDL 共识方案仍需要大量的实践和改进。在这篇论文中,我们提出了一种新的共识方案,名为 Proof-of-Federated-Learning-Subchain(PoFLSC)。我们使用了一个子链来记录训练、挑战和审核活动,并强调了合理的数据集的选择。我们在 simulations 中使用 20 个矿工,以示 PoFLSC 的效果。当我们降低了池子大小,关于保留优先级顺序,则drop rate 差异在不同的场景中进一步表现出,表明矿工 avec 更高的 Shapley 值(SV)在池子大小有限化时会获得更好的机会被选择。在我们的实验中,PoFLSC 共识支持了子链经理者了解保留优先级,以及核心分区的参与者们建立和维护竞争性的子链。”
results: 研究发现这些假帐户通常发布机器生成的内容和盗取的图片,并与其他假帐户进行回复和转发交互。这些帐户推广了嫌疑 Website 和散播负面评论,但现有的 LLM 内容分类器无法在野外环境中分辨它们和真正的人帐户。Abstract
Large language models (LLMs) exhibit impressive capabilities in generating realistic text across diverse subjects. Concerns have been raised that they could be utilized to produce fake content with a deceptive intention, although evidence thus far remains anecdotal. This paper presents a case study about a Twitter botnet that appears to employ ChatGPT to generate human-like content. Through heuristics, we identify 1,140 accounts and validate them via manual annotation. These accounts form a dense cluster of fake personas that exhibit similar behaviors, including posting machine-generated content and stolen images, and engage with each other through replies and retweets. ChatGPT-generated content promotes suspicious websites and spreads harmful comments. While the accounts in the AI botnet can be detected through their coordination patterns, current state-of-the-art LLM content classifiers fail to discriminate between them and human accounts in the wild. These findings highlight the threats posed by AI-enabled social bots.
摘要
大型语言模型(LLM)展示了各种主题的文本生成能力,但是有人们对其用于生成假内容的担忧。虽然证据ntil now still anecdotal,但这篇文章描述了一个Twitter botnet使用ChatGPT生成人类化内容。通过规则,我们确认了1,140个帐户,并通过手动标注 validate them。这些帐户组成了一个假人类型的集群,其中包括发布机器生成内容和盗取图片,以及通过回复和转推相互交流。ChatGPT生成的内容推广了嫌疑 Website和散播负面评论。虽然AI botnet帐户可以通过协调pattern detection,但现有的LLM内容分类器无法在野外 distinguishing them from human accounts。这些发现提醒了AI-enabled social bot的威胁。
Evaluating ChatGPT and GPT-4 for Visual Programming
results: 研究发现,这些模型在视觉编程领域表现不佳,尤其是在组合空间、逻辑和编程技能方面表现较差。这些结果还提供了未来开发改进生成模型在视觉编程领域表现的可能性。Abstract
Generative AI and large language models have the potential to drastically improve the landscape of computing education by automatically generating personalized feedback and content. Recent works have studied the capabilities of these models for different programming education scenarios; however, these works considered only text-based programming, in particular, Python programming. Consequently, they leave open the question of how well these models would perform in visual programming domains popularly used for K-8 programming education. The main research question we study is: Do state-of-the-art generative models show advanced capabilities in visual programming on par with their capabilities in text-based Python programming? In our work, we evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, in visual programming domains for various scenarios and assess performance using expert-based annotations. In particular, we base our evaluation using reference tasks from the domains of Hour of Code: Maze Challenge by Code-dot-org and Karel. Our results show that these models perform poorly and struggle to combine spatial, logical, and programming skills crucial for visual programming. These results also provide exciting directions for future work on developing techniques to improve the performance of generative models in visual programming.
摘要
<>将文本翻译成简化中文。<>生成AI和大语言模型有可能在计算教育领域带来很大改进,自动生成个性化反馈和内容。先前的研究已经研究了这些模型在不同的编程教育场景中的能力,但是这些研究仅考虑了文本编程,尤其是Python编程。因此,它们留下了如何在视觉编程领域中表现的问题。我们的研究问题是:现代生成模型在视觉编程领域中是否有高度的能力,与文本基于Python编程的能力相当?在我们的工作中,我们评估了两个模型:ChatGPT(基于GPT-3.5)和GPT-4,在不同的场景下进行了评估,并使用专家标注来评估性能。具体来说,我们基于Code-dot-org的Hour of Code:迷宫挑战和Karel的参考任务进行评估。我们的结果表明,这些模型在视觉编程中表现不佳,它们困难于结合空间逻辑和编程技能,这些技能是视觉编程的关键。这些结果还提供了未来开发改进生成模型在视觉编程中表现的可能性的激动人心的方向。
Representing and Reasoning with Multi-Stakeholder Qualitative Preference Queries
results: 本文提供了一种可靠的算法来回答多方参与者资源选择询问,使用 alternation-free μ-calculus 进行模型检查。实验结果表明该方法的可行性。Abstract
Many decision-making scenarios, e.g., public policy, healthcare, business, and disaster response, require accommodating the preferences of multiple stakeholders. We offer the first formal treatment of reasoning with multi-stakeholder qualitative preferences in a setting where stakeholders express their preferences in a qualitative preference language, e.g., CP-net, CI-net, TCP-net, CP-Theory. We introduce a query language for expressing queries against such preferences over sets of outcomes that satisfy specified criteria, e.g., $\mlangpref{\psi_1}{\psi_2}{A}$ (read loosely as the set of outcomes satisfying $\psi_1$ that are preferred over outcomes satisfying $\psi_2$ by a set of stakeholders $A$). Motivated by practical application scenarios, we introduce and analyze several alternative semantics for such queries, and examine their interrelationships. We provide a provably correct algorithm for answering multi-stakeholder qualitative preference queries using model checking in alternation-free $\mu$-calculus. We present experimental results that demonstrate the feasibility of our approach.
摘要
多种决策场景,如公共政策、医疗、商业和灾难应急处理,需要考虑多个利益相关者的首选。我们提供了首个正式对多个利益相关者质量预ферен斯的理解,在一种使用质量预ферен斯语言表达利益偏好的设定下。我们引入了一种表达对多个利益相关者对结果集的偏好 queries 的查询语言,例如 $\mlangpref{\psi_1}{\psi_2}{A}$(翻译为:结果集满足 $\psi_1$ 的结果集,在 $\psi_2$ 中超过 $\psi_1$ 的结果集的所有利益相关者 $A$ 中的偏好)。受实际应用场景的限制,我们引入了和分析了多种代理 semantics,并研究它们之间的关系。我们提供了一种可靠的回答多个利益相关者质量预ферен斯查询的可靠算法,使用无需分支的 $\mu$-calculus 进行模板检查。我们发现实际结果,证明了我们的方法的可行性。
LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning
results: 经employmBERT家族模型作为特征提取器,我们的实验结果表明,LaFiCMIL在八个benchmark datasets中均达到了新的状态码表现,其中包括Long Document Classification、Code Defect Detection和Android Malware Detection等领域。Abstract
Transformer-based models, such as BERT, have revolutionized various language tasks, but still struggle with large file classification due to their input limit (e.g., 512 tokens). Despite several attempts to alleviate this limitation, no method consistently excels across all benchmark datasets, primarily because they can only extract partial essential information from the input file. Additionally, they fail to adapt to the varied properties of different types of large files. In this work, we tackle this problem from the perspective of correlated multiple instance learning. The proposed approach, LaFiCMIL, serves as a versatile framework applicable to various large file classification tasks covering binary, multi-class, and multi-label classification tasks, spanning various domains including Natural Language Processing, Programming Language Processing, and Android Analysis. To evaluate its effectiveness, we employ eight benchmark datasets pertaining to Long Document Classification, Code Defect Detection, and Android Malware Detection. Leveraging BERT-family models as feature extractors, our experimental results demonstrate that LaFiCMIL achieves new state-of-the-art performance across all benchmark datasets. This is largely attributable to its capability of scaling BERT up to nearly 20K tokens, running on a single Tesla V-100 GPU with 32G of memory.
摘要
transformer-based 模型,如 BERT,已经革命化了多种语言任务,但仍然在大文件分类任务中困难,主要因为其输入限制(例如 512 个 tokens)。 despite several attempts to alleviate this limitation, no method has consistently excelled across all benchmark datasets, primarily because they can only extract partial essential information from the input file. In addition, they fail to adapt to the varied properties of different types of large files.在这种情况下,我们从相关多个实例学习的角度来解决这个问题。我们提出了一种名为 LaFiCMIL 的框架,可以应用于多种大文件分类任务,包括二分类、多类和多标签分类任务,覆盖了自然语言处理、编程语言处理和 Android 分析等领域。为了评估其效果,我们使用了八个基准数据集, relate to Long Document Classification、Code Defect Detection 和 Android Malware Detection。通过使用 BERT 家族模型作为特征提取器,我们的实验结果表明,LaFiCMIL 在所有基准数据集上达到了新的状态之术性能。这主要归功于它可以在单个 Tesla V-100 GPU 上运行,并且可以扩展到 nearly 20K 个 tokens。
Implementing Edge Based Object Detection For Microplastic Debris
results: 研究发现,使用augmented CNN方法可以在Sampled图像中实时检测废弃 пластиック,并且可以根据不同的废弃类型进行比较。此外,研究还发现了最佳预处理步骤和硬件配置,以便扩展废弃物检测研究到更大的环境中。Abstract
Plastic has imbibed itself as an indispensable part of our day to day activities, becoming a source of problems due to its non-biodegradable nature and cheaper production prices. With these problems, comes the challenge of mitigating and responding to the aftereffects of disposal or the lack of proper disposal which leads to waste concentrating in locations and disturbing ecosystems for both plants and animals. As plastic debris levels continue to rise with the accumulation of waste in garbage patches in landfills and more hazardously in natural water bodies, swift action is necessary to plug or cease this flow. While manual sorting operations and detection can offer a solution, they can be augmented using highly advanced computer imagery linked with robotic appendages for removing wastes. The primary application of focus in this report are the much-discussed Computer Vision and Open Vision which have gained novelty for their light dependence on internet and ability to relay information in remote areas. These applications can be applied to the creation of edge-based mobility devices that can as a counter to the growing problem of plastic debris in oceans and rivers, demanding little connectivity and still offering the same results with reasonably timed maintenance. The principal findings of this project cover the various methods that were tested and deployed to detect waste in images, as well as comparing them against different waste types. The project has been able to produce workable models that can perform on time detection of sampled images using an augmented CNN approach. Latter portions of the project have also achieved a better interpretation of the necessary preprocessing steps required to arrive at the best accuracies, including the best hardware for expanding waste detection studies to larger environments.
摘要
塑料已经成为我们日常生活中不可或缺的一部分,但它却带来了一系列问题,主要是因为它的不可降解性和便宜生产成本。这些问题导致塑料的不当处理和弃置,导致垃圾拥挤在特定区域和影响植物和动物的生态系统。随着塑料垃圾的水平不断升高,紧急需要采取 Swift action 来抑制或中止这流。而 manual sorting operations 和检测可以提供一个解决方案,但这些方法可以通过高度进步的计算机影像连结到机械臂,从而实现塑料的自动排除。本报告的主要应用是著名的计算机见识和开放见识,它们在不需互联网的情况下,可以实现塑料的检测和分类。这些应用可以应用于创建边缘基础设施,以抵消塑料垃圾在海洋和河流中的问题,需要 little connectivity,并且仍然可以获得相同的结果,并且在合理的维护时间进行维护。本项目的主要发现包括对图像中的垃圾检测方法的评估和比较,以及不同垃圾类型之间的比较。项目最终得出了可靠的模型,可以在合理的时间内检测样本图像,使用增强的 CNN 方法。项目的后期也获得了更好的理解有关垃圾检测的必要预处理步骤,包括最佳硬件 для扩展垃圾检测研究到更大的环境中。
For: This paper focuses on the proactive prediction of performance instability and hardware failures in storage systems, with the goal of improving their reliability and availability.* Methods: The paper surveys various mechanisms and field studies that have been proposed in the past few years for predicting performance instability and device failures in storage systems, with a focus on machine learning-based black-box approaches.* Results: The paper evaluates the strengths and limitations of three representative research works in this field, providing insights into the effectiveness of machine learning-based approaches for predicting performance instability and device failures in storage systems.Abstract
With the rapid development of cloud computing and big data technologies, storage systems have become a fundamental building block of datacenters, incorporating hardware innovations such as flash solid state drives and non-volatile memories, as well as software infrastructures such as RAID and distributed file systems. Despite the growing popularity and interests in storage, designing and implementing reliable storage systems remains challenging, due to their performance instability and prevailing hardware failures. Proactive prediction greatly strengthens the reliability of storage systems. There are two dimensions of prediction: performance and failure. Ideally, through detecting in advance the slow IO requests, and predicting device failures before they really happen, we can build storage systems with especially low tail latency and high availability. While its importance is well recognized, such proactive prediction in storage systems, on the other hand, is particularly difficult. To move towards predictability of storage systems, various mechanisms and field studies have been proposed in the past few years. In this report, we present a survey of these mechanisms and field studies, focusing on machine learning based black-box approaches. Based on three representative research works, we discuss where and how machine learning should be applied in this field. The strengths and limitations of each research work are also evaluated in detail.
摘要
With the rapid development of cloud computing and big data technologies, storage systems have become a fundamental building block of datacenters, incorporating hardware innovations such as flash solid state drives and non-volatile memories, as well as software infrastructures such as RAID and distributed file systems. Despite the growing popularity and interests in storage, designing and implementing reliable storage systems remains challenging, due to their performance instability and prevailing hardware failures. 随着云计算和大数据技术的快速发展,存储系统已成为数据中心的基本构建件,涵盖硬件创新 such as flash固态驱动器和非术TLM存储器,以及软件基础设施 such as RAID和分布式文件系统。虽然存储领域的兴趣和popularity在增长,但设计和实施可靠的存储系统仍然具有挑战性,因为它们的性能不稳定和存在硬件故障。Predictive maintenance greatly enhances the reliability of storage systems. There are two dimensions of prediction: performance and failure. Ideally, by detecting slow IO requests in advance and predicting device failures before they occur, we can build storage systems with especially low tail latency and high availability. However, such proactive prediction in storage systems is particularly difficult. To move towards predictability of storage systems, various mechanisms and field studies have been proposed in the past few years. In this report, we present a survey of these mechanisms and field studies, focusing on machine learning-based black-box approaches. Based on three representative research works, we discuss where and how machine learning should be applied in this field. The strengths and limitations of each research work are also evaluated in detail.
Predicting delays in Indian lower courts using AutoML and Decision Forests
results: 研究得出的结果显示,使用AutoML技术可以建立一个高精度的延迟预测模型,其准确率达81.4%,精度、回归和准确率均为0.81。这项研究也提供了可用于进一步研究印度法院改革的数据集和Python代码文件。Abstract
This paper presents a classification model that predicts delays in Indian lower courts based on case information available at filing. The model is built on a dataset of 4.2 million court cases filed in 2010 and their outcomes over a 10-year period. The data set is drawn from 7000+ lower courts in India. The authors employed AutoML to develop a multi-class classification model over all periods of pendency and then used binary decision forest classifiers to improve predictive accuracy for the classification of delays. The best model achieved an accuracy of 81.4%, and the precision, recall, and F1 were found to be 0.81. The study demonstrates the feasibility of AI models for predicting delays in Indian courts, based on relevant data points such as jurisdiction, court, judge, subject, and the parties involved. The paper also discusses the results in light of relevant literature and suggests areas for improvement and future research. The authors have made the dataset and Python code files used for the analysis available for further research in the crucial and contemporary field of Indian judicial reform.
摘要
Translated into Simplified Chinese:这篇论文介绍了一种基于文件时间的印度下级法院延迟预测模型。模型是基于2010年420万个法律案件的签到和结果数据,来自7000多个印度下级法院。作者使用AutoML开发了一个多类别预测模型,然后使用二分决策树分类器进行改进预测准确性。最佳模型的准确率为81.4%,精度、回归率和F1分别为0.81。研究表明,使用相关数据点,如司法管辖区、法院、法官、案件主题和参与党人,可以预测印度法院的延迟。研究还与相关文献进行了比较分析,并提出了改进和未来研究的建议。作者已经将数据集和使用于分析的Python代码文件公开,以便进一步研究印度司法改革的当前和紧迫领域。
Recent Advances in Hierarchical Multi-label Text Classification: A Survey
paper_authors: Rundong Liu, Wenhan Liang, Weijun Luo, Yuxiang Song, He Zhang, Ruohua Xu, Yunfeng Li, Ming Liu
for: 科学文献搜索、多标签文本分类
methods: 主要使用开源数据集、主要方法、评价指标、学习策略
results: 研究进展、挑战和未来发展方向In English, this translates to:
for: Scientific literature archiving, hierarchical multi-label text classification
methods: Mainly using open-sourced data sets, main methods, evaluation metrics, learning strategies
results: Research progress, challenges, and future development directionsI hope this helps!Abstract
Hierarchical multi-label text classification aims to classify the input text into multiple labels, among which the labels are structured and hierarchical. It is a vital task in many real world applications, e.g. scientific literature archiving. In this paper, we survey the recent progress of hierarchical multi-label text classification, including the open sourced data sets, the main methods, evaluation metrics, learning strategies and the current challenges. A few future research directions are also listed for community to further improve this field.
摘要
Here is the text in Simplified Chinese: Hierarchical multi-label text classification aims to classify input text into multiple labels, with the labels being structured and hierarchical. This is a crucial task in many real-world applications, such as scientific literature archiving. In this paper, we review recent progress in hierarchical multi-label text classification, including open-source data sets, main methods, evaluation metrics, learning strategies, and current challenges. We also list a few future research directions for the community to further improve this field.
results: 实现 dense retrieval 模型的效率-效果对应 frontier,并在8毫秒/查询时间下达到了相同的精度和准确率。Abstract
Retrieval approaches that score documents based on learned dense vectors (i.e., dense retrieval) rather than lexical signals (i.e., conventional retrieval) are increasingly popular. Their ability to identify related documents that do not necessarily contain the same terms as those appearing in the user's query (thereby improving recall) is one of their key advantages. However, to actually achieve these gains, dense retrieval approaches typically require an exhaustive search over the document collection, making them considerably more expensive at query-time than conventional lexical approaches. Several techniques aim to reduce this computational overhead by approximating the results of a full dense retriever. Although these approaches reasonably approximate the top results, they suffer in terms of recall -- one of the key advantages of dense retrieval. We introduce 'LADR' (Lexically-Accelerated Dense Retrieval), a simple-yet-effective approach that improves the efficiency of existing dense retrieval models without compromising on retrieval effectiveness. LADR uses lexical retrieval techniques to seed a dense retrieval exploration that uses a document proximity graph. We explore two variants of LADR: a proactive approach that expands the search space to the neighbors of all seed documents, and an adaptive approach that selectively searches the documents with the highest estimated relevance in an iterative fashion. Through extensive experiments across a variety of dense retrieval models, we find that LADR establishes a new dense retrieval effectiveness-efficiency Pareto frontier among approximate k nearest neighbor techniques. Further, we find that when tuned to take around 8ms per query in retrieval latency on our hardware, LADR consistently achieves both precision and recall that are on par with an exhaustive search on standard benchmarks.
摘要
traditional retrieval approaches that rely on lexical signals are being replaced by dense retrieval approaches that score documents based on learned dense vectors. These approaches can retrieve related documents that do not contain the same terms as the user's query, which improves recall. However, dense retrieval approaches typically require an exhaustive search over the document collection, which is computationally expensive. Several techniques have been proposed to reduce the computational overhead, but these approaches often sacrifice recall.We introduce 'LADR' (Lexically-Accelerated Dense Retrieval), a simple and effective approach that improves the efficiency of existing dense retrieval models without compromising on retrieval effectiveness. LADR uses lexical retrieval techniques to seed a dense retrieval exploration that uses a document proximity graph. We explore two variants of LADR: a proactive approach that expands the search space to the neighbors of all seed documents, and an adaptive approach that selectively searches the documents with the highest estimated relevance in an iterative fashion.Through extensive experiments across a variety of dense retrieval models, we find that LADR establishes a new dense retrieval effectiveness-efficiency Pareto frontier among approximate k nearest neighbor techniques. Furthermore, when tuned to take around 8ms per query in retrieval latency on our hardware, LADR consistently achieves both precision and recall that are on par with an exhaustive search on standard benchmarks.
Multilingual context-based pronunciation learning for Text-to-Speech
results: 该模型在G2P转换和其他语言特定任务上表现竞争力强,但有些语言和任务之间存在一些妥协。Abstract
Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end. Given a language, a lexicon can be collected offline and Grapheme-to-Phoneme (G2P) relationships are usually modeled in order to predict the pronunciation for out-of-vocabulary (OOV) words. Additionally, post-lexical phonology, often defined in the form of rule-based systems, is used to correct pronunciation within or between words. In this work we showcase a multilingual unified front-end system that addresses any pronunciation related task, typically handled by separate modules. We evaluate the proposed model on G2P conversion and other language-specific challenges, such as homograph and polyphones disambiguation, post-lexical rules and implicit diacritization. We find that the multilingual model is competitive across languages and tasks, however, some trade-offs exists when compared to equivalent monolingual solutions.
摘要
文本识别和语言知识是文本识别(TTS)前端的重要组成部分。给定一种语言,可以在线收集词典,并模型文字到音(G2P)关系,以预测未在词典中出现的词汇的发音。此外,在词语之间或者在词语之内进行发音 correction 也需要使用后 lexical phonology,通常通过规则集来实现。在这项工作中,我们展示了一种多语言统一前端系统,可以解决任何发音相关任务,通常由分立模块处理。我们对这种模型进行了G2P转换和其他语言特有挑战的评估,如Homograph和多音字识别、后 lexical 规则和隐式 диаcritization。我们发现,该多语言模型在语言和任务方面具有竞争力,但有些交换存在于与等效单语言解决方案进行比较时。
No that’s not what I meant: Handling Third Position Repair in Conversational Question Answering
paper_authors: Vevake Balaraman, Arash Eshghi, Ioannis Konstas, Ioannis Papaioannou
for: 这 paper 的目的是研究人们在对话中如何处理歧义,以及如何使用 Third Position Repair (TPR) 来纠正歧义。
methods: 这 paper 使用了一个大型的 TPR 数据集,并对这些数据集进行了自动和人工评估。同时,paper 还使用了一些基eline 模型来执行 TPR。
results: 研究发现,OpenAI 的 GPT-3 LLMs 在原始turn的 TPR 处理方面表现不佳,但是在接下来的对话问答任务中,这些 LLMs 的 TPR 处理能力有了显著改善。Abstract
The ability to handle miscommunication is crucial to robust and faithful conversational AI. People usually deal with miscommunication immediately as they detect it, using highly systematic interactional mechanisms called repair. One important type of repair is Third Position Repair (TPR) whereby a speaker is initially misunderstood but then corrects the misunderstanding as it becomes apparent after the addressee's erroneous response. Here, we collect and publicly release Repair-QA, the first large dataset of TPRs in a conversational question answering (QA) setting. The data is comprised of the TPR turns, corresponding dialogue contexts, and candidate repairs of the original turn for execution of TPRs. We demonstrate the usefulness of the data by training and evaluating strong baseline models for executing TPRs. For stand-alone TPR execution, we perform both automatic and human evaluations on a fine-tuned T5 model, as well as OpenAI's GPT-3 LLMs. Additionally, we extrinsically evaluate the LLMs' TPR processing capabilities in the downstream conversational QA task. The results indicate poor out-of-the-box performance on TPR's by the GPT-3 models, which then significantly improves when exposed to Repair-QA.
摘要
人们在对话中处理混乱communication是关键,以确保对话AI强大和可靠。人们通常在发现混乱后立即处理,使用高度系统化的互动机制called repair。一种重要的修复方式是第三人称修复(TPR),在这种情况下,说话者在对方错误回答后才被理解,并在这个过程中修复错误。我们收集和公开发布了Repair-QA数据集,这是一个大型的TPR在对话问答(QA) Setting中的数据集。数据包括TPR转帧、对话上下文和原始转帧的可能修复。我们通过训练和评估强大基线模型来证明数据的有用性。对独立TPR执行来说,我们在一个精度调整的T5模型上进行自动和人工评估,以及OpenAI的GPT-3LLMs。此外,我们在下游对话问答任务中评估LLMs的TPR处理能力。结果表明GPT-3模型在出厂情况下对TPR表现不佳,但是在暴露于Repair-QA数据集后,其表现显著改善。
Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech
paper_authors: Guangyan Zhang, Thomas Merritt, Manuel Sam Ribeiro, Biel Tura-Vecino, Kayoko Yanagisawa, Kamil Pokora, Abdelhamid Ezzerg, Sebastian Cygert, Ammar Abbas, Piotr Bilinski, Roberto Barra-Chicote, Daniel Korzekwa, Jaime Lorenzo-Trueba
for: 这 paper 是用来比较传统 L1/L2 loss 方法和流体和填充方法来进行 text-to-speech 合成 tasks 的。
results: 实验结果表明,流体基本模型在 spectrogram 预测任务中 achieve 最佳性能,超过相同的 diffusion 和 L1 模型。同时,流体和填充 prosody 预测器均导致了significant 改善,比传统 L2 训练的 prosody 模型。Abstract
Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space. Aiming to improve those assumptions, Normalizing Flows and Diffusion Probabilistic Models were recently proposed as alternatives. In this paper, we compare traditional L1/L2-based approaches to diffusion and flow-based approaches for the tasks of prosody and mel-spectrogram prediction for text-to-speech synthesis. We use a prosody model to generate log-f0 and duration features, which are used to condition an acoustic model that generates mel-spectrograms. Experimental results demonstrate that the flow-based model achieves the best performance for spectrogram prediction, improving over equivalent diffusion and L1 models. Meanwhile, both diffusion and flow-based prosody predictors result in significant improvements over a typical L2-trained prosody models.
摘要
传统的文本到语音系统通常是基于L1/L2损失优化的,这些损失假设了目标数据空间的分布。为了改善这些假设,流体和扩散概率模型最近被提出作为替代方案。在这篇论文中,我们比较了传统的L1/L2基于的方法和扩散和流体基于的方法 для文本到语音合成中的谱和频谱预测任务。我们使用一个谱模型来生成日吸率和持续时间特征,这些特征用于condition一个生成mel-spectrogram的语音模型。实验结果显示,流体基于的模型在spectrogram预测中表现最佳,超过了相当的扩散和L1模型。同时,扩散和流体基于的谱预测模型均导致了对于Typical L2训练的谱模型的显著改进。
results: 论文通过在线可视化平台提供了194,346名作家和965,210部作品的知识 graph,并在3个不同的专家领域进行了严格的测试和验证,得到了高度的评价和满意度。Abstract
Digital media have enabled the access to unprecedented literary knowledge. Authors, readers, and scholars are now able to discover and share an increasing amount of information about books and their authors. However, these sources of knowledge are fragmented and do not adequately represent non-Western writers and their works. In this paper we present The World Literature Knowledge Graph, a semantic resource containing 194,346 writers and 965,210 works, specifically designed for exploring facts about literary works and authors from different parts of the world. The knowledge graph integrates information about the reception of literary works gathered from 3 different communities of readers, aligned according to a single semantic model. The resource is accessible through an online visualization platform, which can be found at the following URL: https://literaturegraph.di.unito.it/. This platform has been rigorously tested and validated by $3$ distinct categories of experts who have found it to be highly beneficial for their respective work domains. These categories include teachers, researchers in the humanities, and professionals in the publishing industry. The feedback received from these experts confirms that they can effectively utilize the platform to enhance their work processes and achieve valuable outcomes.
摘要
数字媒体为我们提供了前所未有的文学知识访问。作家、读者和学者现在可以找到和分享越来越多的关于书籍和作家的信息。然而,这些知识来源是分散的,并不充分代表非西方作家和他们的作品。在这篇论文中,我们介绍了世界文学知识图,这是一个基于semantic模型的语义资源,用于探索不同地区的文学作品和作家的事实。知识图集成了来自3个不同社区的读者的受众反馈,并以单一的semantic模型进行对接。这个资源可以通过以下URL访问:https://literaturegraph.di.unito.it/.这个平台已经被3种不同的专家组织rigorously测试和验证,这些专家包括教师、人文科学研究人员和出版业专业人员。这些专家的反馈表明,他们可以通过这个平台增强工作流程,并实现价值的成果。
Scaling Sentence Embeddings with Large Language Models
results: 通过广泛的实验,我们发现在句子嵌入任务上,通过培训LLMs可以获得高质量的句子嵌入,无需任何微调。此外,我们发现,随着模型大小的增加,模型在语义文本相似性(STS)任务上的性能会下降。但是,我们发现最大化模型可以超过其他对手,并达到新的状态态表现在转移任务上。此外,我们还对LLMs进行了现有对比学习方法的微调,并发现2.7B OPT模型,通过我们的提示基础方法,超过了4.8B ST5模型的性能,达到新的状态态表现在STS任务上。Abstract
Large language models (LLMs) have recently garnered significant interest. With in-context learning, LLMs achieve impressive results in various natural language tasks. However, the application of LLMs to sentence embeddings remains an area of ongoing research. In this work, we propose an in-context learning-based method aimed at improving sentence embeddings performance. Our approach involves adapting the previous prompt-based representation method for autoregressive models, constructing a demonstration set that enables LLMs to perform in-context learning, and scaling up the LLMs to different model sizes. Through extensive experiments, in-context learning enables LLMs to generate high-quality sentence embeddings without any fine-tuning. It helps LLMs achieve performance comparable to current contrastive learning methods. By scaling model size, we find scaling to more than tens of billion parameters harms the performance on semantic textual similarity (STS) tasks. However, the largest model outperforms other counterparts and achieves the new state-of-the-art result on transfer tasks. We also fine-tune LLMs with current contrastive learning approach, and the 2.7B OPT model, incorporating our prompt-based method, surpasses the performance of 4.8B ST5, achieving the new state-of-the-art results on STS tasks. Our code is available at https://github.com/kongds/scaling_sentemb.
摘要
Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings
results: 对多种语言和数据量不同的G2P系统,our approach consistently提高了Phone错误率。Abstract
The Grapheme-to-Phoneme (G2P) task aims to convert orthographic input into a discrete phonetic representation. G2P conversion is beneficial to various speech processing applications, such as text-to-speech and speech recognition. However, these tend to rely on manually-annotated pronunciation dictionaries, which are often time-consuming and costly to acquire. In this paper, we propose a method to improve the G2P conversion task by learning pronunciation examples from audio recordings. Our approach bootstraps a G2P with a small set of annotated examples. The G2P model is used to train a multilingual phone recognition system, which then decodes speech recordings with a phonetic representation. Given hypothesized phoneme labels, we learn pronunciation dictionaries for out-of-vocabulary words, and we use those to re-train the G2P system. Results indicate that our approach consistently improves the phone error rate of G2P systems across languages and amount of available data.
摘要
In this paper, we propose a method to improve the G2P conversion task by learning pronunciation examples from audio recordings. We start with a small set of annotated examples and use a G2P model to train a multilingual phone recognition system. This system decodes speech recordings using a phonetic representation. Given hypothesized phoneme labels, we learn pronunciation dictionaries for out-of-vocabulary words and use those to re-train the G2P system.Our approach consistently improves the phone error rate of G2P systems across languages and amounts of available data. This shows that our method is effective in improving the accuracy of G2P conversion.
VacancySBERT: the approach for representation of titles and skills for semantic similarity search in the recruitment domain
results: 研究表明,使用自定义训练目标可以实现显著改进,比如使用 VacancySBERT 和 VacancySBERT (with skills) 得到了10% 和 21.5% 的提升。此外,开发了一个开源的基准数据集,以便进一步探索这一领域。Abstract
The paper focuses on deep learning semantic search algorithms applied in the HR domain. The aim of the article is developing a novel approach to training a Siamese network to link the skills mentioned in the job ad with the title. It has been shown that the title normalization process can be based either on classification or similarity comparison approaches. While classification algorithms strive to classify a sample into predefined set of categories, similarity search algorithms take a more flexible approach, since they are designed to find samples that are similar to a given query sample, without requiring pre-defined classes and labels. In this article semantic similarity search to find candidates for title normalization has been used. A pre-trained language model has been adapted while teaching it to match titles and skills based on co-occurrence information. For the purpose of this research fifty billion title-descriptions pairs had been collected for training the model and thirty three thousand title-description-normalized title triplets, where normalized job title was picked up manually by job ad creator for testing purposes. As baselines FastText, BERT, SentenceBert and JobBert have been used. As a metric of the accuracy of the designed algorithm is Recall in top one, five and ten model's suggestions. It has been shown that the novel training objective lets it achieve significant improvement in comparison to other generic and specific text encoders. Two settings with treating titles as standalone strings, and with included skills as additional features during inference have been used and the results have been compared in this article. Improvements by 10% and 21.5% have been achieved using VacancySBERT and VacancySBERT (with skills) respectively. The benchmark has been developed as open-source to foster further research in the area.
摘要
文章主要研究深度学习 semantic search 算法在人力资源(HR)领域中的应用。文章的目标是开发一种新的方法,通过链接在职位招聘中提到的技能与工作标题之间的连接。研究表明,标题Normalization过程可以基于类别或相似性比较方法。而类别算法尝试将样本分类到预定的类别中,相似性搜索算法则更加灵活,它们可以找到与查询样本相似的样本,无需预定的类别和标签。本文使用semantic similarity搜索来查找候选者。研究人员采用了预训练的语言模型,并将其改进以将标题和技能相匹配,基于共occurrence信息。为了训练模型,收集了50亿个标题-描述对,并使用33,000个标题-描述-Normalized标题 triplets进行测试。作为基准,使用了FastText、BERT、SentenceBert和JobBert。用于评估算法准确性的指标是Recall在top一、五和十个模型建议中。研究表明,新的训练目标可以实现显著改进,相比于其他通用和专门的文本编码器。在使用标题作为独立字符串和包含技能作为推断时进行两种设置后,对比结果。使用VacancySBERT和VacancySBERT(与技能)后,分别实现了10%和21.5%的提高。研究人员开发了一个开源的标准套件,以便进一步的研究。
Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial Attacks
methods: 使用randomized smoothing方法, derive robustness bounds against four word-level adversarial operations
results: Text-CRS可以Address all four different word-level adversarial operations,significantly improve certified accuracy and radius,outperform state-of-the-art certification against synonym substitution attacks,provide the first benchmark on certified accuracy and radius of four word-level operations.Abstract
The language models, especially the basic text classification models, have been shown to be susceptible to textual adversarial attacks such as synonym substitution and word insertion attacks. To defend against such attacks, a growing body of research has been devoted to improving the model robustness. However, providing provable robustness guarantees instead of empirical robustness is still widely unexplored. In this paper, we propose Text-CRS, a generalized certified robustness framework for natural language processing (NLP) based on randomized smoothing. To our best knowledge, existing certified schemes for NLP can only certify the robustness against $\ell_0$ perturbations in synonym substitution attacks. Representing each word-level adversarial operation (i.e., synonym substitution, word reordering, insertion, and deletion) as a combination of permutation and embedding transformation, we propose novel smoothing theorems to derive robustness bounds in both permutation and embedding space against such adversarial operations. To further improve certified accuracy and radius, we consider the numerical relationships between discrete words and select proper noise distributions for the randomized smoothing. Finally, we conduct substantial experiments on multiple language models and datasets. Text-CRS can address all four different word-level adversarial operations and achieve a significant accuracy improvement. We also provide the first benchmark on certified accuracy and radius of four word-level operations, besides outperforming the state-of-the-art certification against synonym substitution attacks.
摘要
Language models, especially basic text classification models, have been shown to be vulnerable to textual adversarial attacks such as synonym substitution and word insertion attacks. To defend against such attacks, a growing body of research has been devoted to improving model robustness. However, providing provable robustness guarantees instead of empirical robustness is still widely unexplored. In this paper, we propose Text-CRS, a generalized certified robustness framework for natural language processing (NLP) based on randomized smoothing. To our best knowledge, existing certified schemes for NLP can only certify the robustness against $\ell_0$ perturbations in synonym substitution attacks.Representing each word-level adversarial operation (i.e., synonym substitution, word reordering, insertion, and deletion) as a combination of permutation and embedding transformation, we propose novel smoothing theorems to derive robustness bounds in both permutation and embedding space against such adversarial operations. To further improve certified accuracy and radius, we consider the numerical relationships between discrete words and select proper noise distributions for the randomized smoothing.Finally, we conduct substantial experiments on multiple language models and datasets. Text-CRS can address all four different word-level adversarial operations and achieve a significant accuracy improvement. We also provide the first benchmark on certified accuracy and radius of four word-level operations, besides outperforming the state-of-the-art certification against synonym substitution attacks.
Noisy Self-Training with Data Augmentations for Offensive and Hate Speech Detection Tasks
paper_authors: João A. Leite, Carolina Scarton, Diego F. Silva
for: Automatic detection of offensive and hateful comments in online social media.
methods: Self-training and noisy self-training using textual data augmentations with five pre-trained BERT architectures.
results: Self-training consistently improves performance, while noisy self-training decreases performance on offensive and hate-speech domains.Here’s the full text in Simplified Chinese:
results: 自我训练可以一直提高表现,而含杂自我训练在侮辱和恐吓言论领域的表现下降。Abstract
Online social media is rife with offensive and hateful comments, prompting the need for their automatic detection given the sheer amount of posts created every second. Creating high-quality human-labelled datasets for this task is difficult and costly, especially because non-offensive posts are significantly more frequent than offensive ones. However, unlabelled data is abundant, easier, and cheaper to obtain. In this scenario, self-training methods, using weakly-labelled examples to increase the amount of training data, can be employed. Recent "noisy" self-training approaches incorporate data augmentation techniques to ensure prediction consistency and increase robustness against noisy data and adversarial attacks. In this paper, we experiment with default and noisy self-training using three different textual data augmentation techniques across five different pre-trained BERT architectures varying in size. We evaluate our experiments on two offensive/hate-speech datasets and demonstrate that (i) self-training consistently improves performance regardless of model size, resulting in up to +1.5% F1-macro on both datasets, and (ii) noisy self-training with textual data augmentations, despite being successfully applied in similar settings, decreases performance on offensive and hate-speech domains when compared to the default method, even with state-of-the-art augmentations such as backtranslation.
摘要
在线社交媒体中,有许多内容具有攻击性和恐惧语气,导致自动检测的需求,因为每秒钟有极多的创建。然而,人工标注数据实际上很难以取得,特别是非攻击性内容比攻击性内容更多。在这种情况下,自我训练方法可以使用弱标注的例子增加训练数据的量。现代的“杂音”自我训练方法将数据扩展技术纳入训练,以确保预测的一致性和抗衰变攻击的强健性。在这篇文章中,我们将实验 default 和杂音自我训练,使用三种文本数据扩展技术,在五种不同的预读BERT架构上进行评估。我们在两个攻击和负面语气dataset上进行评估,结果显示:(i)自我训练无变通过所有模型大小,实现最高 +1.5% F1-macro 的改善,(ii)杂音自我训练对于攻击和负面语气领域的性能下降,即使使用了现代的数据扩展技术,如回译。
Deep Dive into the Language of International Relations: NLP-based Analysis of UNESCO’s Summary Records
results: 该研究的结果表明,自动化工具可以提供有价值的决策过程分析,帮助解决国际遗产投票过程中的紧张关系和冲突。Abstract
Cultural heritage is an arena of international relations that interests all states worldwide. The inscription process on the UNESCO World Heritage List and the UNESCO Representative List of the Intangible Cultural Heritage of Humanity often leads to tensions and conflicts among states. This research addresses these challenges by developing automatic tools that provide valuable insights into the decision-making processes regarding inscriptions to the two lists mentioned above. We propose innovative topic modelling and tension detection methods based on UNESCO's summary records. Our analysis achieved a commendable accuracy rate of 72% in identifying tensions. Furthermore, we have developed an application tailored for diplomats, lawyers, political scientists, and international relations researchers that facilitates the efficient search of paragraphs from selected documents and statements from specific speakers about chosen topics. This application is a valuable resource for enhancing the understanding of complex decision-making dynamics within international heritage inscription procedures.
摘要
文化遗产是国际关系的一个领域,各国都很关心。联合国教科文组织世界遗产名录和联合国教科文组织非物质文化遗产名录的登记过程经常导致国家之间的紧张关系和冲突。本研究通过开发自动化工具,为各国帮助解决这些挑战。我们提出了创新的话题模型和紧张检测方法,基于联合国教科文组织的摘要记录。我们的分析达到了72%的准确率,可以快速寻找关键话题和紧张关系。此外,我们开发了一个专门为外交官、律师、政治科学家和国际关系研究人员设计的应用程序,可以帮助这些人快速搜索选择的文档和来自特定发言人的声明中的特定话题。这个应用程序是国际遗产登记过程中复杂决策动力理解的重要资源。
DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training
paper_authors: Hyung-Seok Oh, Sang-Hoon Lee, Seong-Whan Lee
for: This paper focuses on improving the quality and speed of expressive text-to-speech systems through the use of a diffusion-based latent prosody generator and prosody conditional adversarial training.
methods: The proposed method, called DiffProsody, uses a diffusion-based latent prosody generator and prosody conditional adversarial training to generate high-quality speech with accurate prosody. The method also utilizes denoising diffusion generative adversarial networks to improve the prosody generation speed.
results: The paper demonstrates the effectiveness of the proposed method through experiments, showing that DiffProsody is capable of generating prosody 16 times faster than the conventional diffusion model, with superior performance compared to other state-of-the-art methods.Abstract
Expressive text-to-speech systems have undergone significant advancements owing to prosody modeling, but conventional methods can still be improved. Traditional approaches have relied on the autoregressive method to predict the quantized prosody vector; however, it suffers from the issues of long-term dependency and slow inference. This study proposes a novel approach called DiffProsody in which expressive speech is synthesized using a diffusion-based latent prosody generator and prosody conditional adversarial training. Our findings confirm the effectiveness of our prosody generator in generating a prosody vector. Furthermore, our prosody conditional discriminator significantly improves the quality of the generated speech by accurately emulating prosody. We use denoising diffusion generative adversarial networks to improve the prosody generation speed. Consequently, DiffProsody is capable of generating prosody 16 times faster than the conventional diffusion model. The superior performance of our proposed method has been demonstrated via experiments.
摘要
现代文本到语音系统已经经历了重要的进步,归功于谱系模型。然而,传统方法仍然有可以改进的地方。传统方法通常采用推论方法来预测量化的谱系 вектор;然而,它受到长期依赖和慢速推理的问题困扰。本研究提出了一种新的方法,即DiffProsody,用于生成表达性的语音。我们的发现表明,我们的谱系生成器可以生成高质量的谱系 вектор。此外,我们的谱系条件推论器可以准确地模拟谱系,从而提高生成的语音质量。我们使用denoising扩散生成 adversarial networks来提高谱系生成速度。因此,DiffProsody可以在16倍的速度上生成谱系。我们的实验结果表明,我们的提议的方法在性能上有superior的表现。
Specification of MiniDemographicABM.jl: A simplified agent-based demographic model of the UK
For: The paper is written for exploring and exploiting the capabilities of the state-of-the-art Agents.jl Julia package in a simplified non-calibrated agent-based demographic model of the UK.* Methods: The paper uses a simplified non-calibrated agent-based demographic model of the UK, where individuals are subject to ageing, deaths, births, divorces, and marriages. The model can be simulated with a user-defined simulation fixed step size on a hourly, daily, weekly, monthly basis or even an arbitrary user-defined clock rate.* Results: The paper can serve as a base model to be adjusted to realistic large-scale socio-economics, pandemics or social interactions-based studies mainly within a demographic context.Here is the same information in Simplified Chinese text:* For: 本文是用于探索和利用现代Agents.jl Julia包的能力的简化非参数化人工智能模型,用于研究英国人口的特点。* Methods: 该模型使用简化非参数化人工智能模型,其中个体受到年龄、死亡、生育、离婚和婚姻的影响。模型可以通过用户定义的 simulation fixed step size 进行模拟,并且可以在每小时、每天、每周、每月基础或者用户定义的时间刻度进行模拟。* Results: 该模型可以作为基本模型,用于调整大规模的社会经济、疫情或者社交互动等研究,主要在人口学上。Abstract
This document presents adequate formal terminology for the mathematical specification of a simplified non-calibrated agent-based demographic model of the UK. Individuals of an initial population are subject to ageing, deaths, births, divorces and marriages. The main purpose of the model is to explore and exploit capabilities of the state-of-the-art Agents.jl Julia package [1]. Additionally, the model can serve as a base model to be adjusted to realistic large-scale socio-economics, pandemics or social interactions-based studies mainly within a demographic context. A specific simulation is progressed with a user-defined simulation fixed step size on a hourly, daily, weekly, monthly basis or even an arbitrary user-defined clock rate.
摘要
Translation Notes:1. "non-calibrated" 不是 "calibrated" 的反义词。"non-calibrated" 是指模型没有进行过精度调整,而"calibrated" 则是指模型已经进行了精度调整。2. "agent-based" 是指模型使用代理(agent)来表示实体,而不是使用固定的数值或函数来描述实体。3. "demographic" 是指人口的统计学性质,包括年龄、性别、地域等。4. "simulation" 是指模拟或复制实际情况的过程,通常使用计算机模拟实现。5. "fixed step size" 是指每次执行模拟时,使用固定的时间间隔(step size)来执行计算。6. "user-defined" 是指用户可以自定义的参数或设置,例如 simulation clock rate。
Utilisation of open intent recognition models for customer support intent detection
results: 研究表明,使用这些技术可以提高客户支持效率和准确率,同时为企业提供更多的国际客户和业务规模。然而,在检测未知意图方面,还需要进一步的研究和改进。Abstract
Businesses have sought out new solutions to provide support and improve customer satisfaction as more products and services have become interconnected digitally. There is an inherent need for businesses to provide or outsource fast, efficient and knowledgeable support to remain competitive. Support solutions are also advancing with technologies, including use of social media, Artificial Intelligence (AI), Machine Learning (ML) and remote device connectivity to better support customers. Customer support operators are trained to utilise these technologies to provide better customer outreach and support for clients in remote areas. Interconnectivity of products and support systems provide businesses with potential international clients to expand their product market and business scale. This paper reports the possible AI applications in customer support, done in collaboration with the Knowledge Transfer Partnership (KTP) program between Birmingham City University and a company that handles customer service systems for businesses outsourcing customer support across a wide variety of business sectors. This study explored several approaches to accurately predict customers' intent using both labelled and unlabelled textual data. While some approaches showed promise in specific datasets, the search for a single, universally applicable approach continues. The development of separate pipelines for intent detection and discovery has led to improved accuracy rates in detecting known intents, while further work is required to improve the accuracy of intent discovery for unknown intents.
摘要
This paper reports on the possible AI applications in customer support, done in collaboration with the Knowledge Transfer Partnership (KTP) program between Birmingham City University and a company that handles customer service systems for businesses outsourcing customer support across a wide variety of business sectors. This study explored several approaches to accurately predict customers' intent using both labeled and unlabeled textual data. While some approaches showed promise in specific datasets, the search for a single, universally applicable approach continues. The development of separate pipelines for intent detection and discovery has led to improved accuracy rates in detecting known intents, while further work is required to improve the accuracy of intent discovery for unknown intents.
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning
for: This paper aims to improve image-to-text generation by addressing the problem of object hallucination in zero-shot image captioning.
methods: The proposed method, ViECap, uses entity-aware decoding to guide the attention of large language models (LLMs) toward the visual entities present in the image, improving the coherence and accuracy of the generated captions.
results: Extensive experiments show that ViECap sets a new state-of-the-art cross-domain (transferable) captioning performance and performs competitively in-domain captioning compared to previous VLMs-based zero-shot methods.Abstract
Image-to-text generation aims to describe images using natural language. Recently, zero-shot image captioning based on pre-trained vision-language models (VLMs) and large language models (LLMs) has made significant progress. However, we have observed and empirically demonstrated that these methods are susceptible to modality bias induced by LLMs and tend to generate descriptions containing objects (entities) that do not actually exist in the image but frequently appear during training (i.e., object hallucination). In this paper, we propose ViECap, a transferable decoding model that leverages entity-aware decoding to generate descriptions in both seen and unseen scenarios. ViECap incorporates entity-aware hard prompts to guide LLMs' attention toward the visual entities present in the image, enabling coherent caption generation across diverse scenes. With entity-aware hard prompts, ViECap is capable of maintaining performance when transferring from in-domain to out-of-domain scenarios. Extensive experiments demonstrate that ViECap sets a new state-of-the-art cross-domain (transferable) captioning and performs competitively in-domain captioning compared to previous VLMs-based zero-shot methods. Our code is available at: https://github.com/FeiElysia/ViECap
摘要
Image-to-text生成旨在使用自然语言描述图像。最近,零批学习图像描述基于预训练视觉语言模型(VLM)和大型语言模型(LLM)已经取得了重要进展。然而,我们观察到和实际示出了这些方法受模式偏见(modality bias)的LLM的影响,往往生成包含图像中不存在的对象(实体)的描述(对象幻觉)。在本文中,我们提出了ViECap,一种可移植的解码器,利用实体意识的解码来生成在seen和unseen场景中的描述。ViECap使用实体意识强制提示来导引LLM的视觉注意力,使其能够在多样场景中生成准确的描述。与传统的VLMs-based零shot方法相比,ViECap在跨频道场景中维持性能,并在域内场景中表现竞争力。我们的代码可以在GitHub上找到:https://github.com/FeiElysia/ViECap
Classifying multilingual party manifestos: Domain transfer across country, time, and genre
paper_authors: Matthias Aßenmacher, Nadja Sauter, Christian Heumann
for: 本研究旨在探讨域传递在政治宣言中的可靠性和可重用性。
methods: 研究使用了大量政治宣言数据库,并对模型进行了精细调整。
results: 研究发现,使用(Distil)BERT模型可以在不同语言、地域、时间和类型的政治宣言中实现类似的表现。此外,研究还发现了不同国家的政治宣言之间存在一定的差异,即使这些国家使用同一种语言或文化背景。Abstract
Annotating costs of large corpora are still one of the main bottlenecks in empirical social science research. On the one hand, making use of the capabilities of domain transfer allows re-using annotated data sets and trained models. On the other hand, it is not clear how well domain transfer works and how reliable the results are for transfer across different dimensions. We explore the potential of domain transfer across geographical locations, languages, time, and genre in a large-scale database of political manifestos. First, we show the strong within-domain classification performance of fine-tuned transformer models. Second, we vary the genre of the test set across the aforementioned dimensions to test for the fine-tuned models' robustness and transferability. For switching genres, we use an external corpus of transcribed speeches from New Zealand politicians while for the other three dimensions, custom splits of the Manifesto database are used. While BERT achieves the best scores in the initial experiments across modalities, DistilBERT proves to be competitive at a lower computational expense and is thus used for further experiments across time and country. The results of the additional analysis show that (Distil)BERT can be applied to future data with similar performance. Moreover, we observe (partly) notable differences between the political manifestos of different countries of origin, even if these countries share a language or a cultural background.
摘要
大公司的标注成本仍是employmultiple的社会科学研究的主要瓶颈。一方面,利用领域传输的能力可以重用标注数据集和训练模型。另一方面,不清楚领域传输是如何工作,结果如何可靠性。我们在一个大规模的政治宣言数据库中探索领域传输的潜力。首先,我们显示了在不同领域内的精度转换模型的强大表现。其次,我们在不同维度上随机选择测试集,以测试精度转换模型的可靠性和可迁移性。为了在类别之间转换,我们使用新西兰政治人物的演讲录音库,而其他三个维度使用自定义的演示数据。虽然BERT在初始实验中Across modalities achieve the best scores,但DistilBERT在更低的计算成本下能够达到类似的性能,因此我们在时间和国家之间进行了进一步的实验。results of the additional analysis show that (Distil)BERT can be applied to future data with similar performance.更重要的是,我们发现了不同国家的政治宣言之间有些 notable differences,即使这些国家共享语言或文化背景。
FinVis-GPT: A Multimodal Large Language Model for Financial Chart Analysis
for: FinVis-GPT is proposed for financial chart analysis, providing valuable analysis and interpretation of financial charts.
methods: FinVis-GPT uses a multimodal large language model (LLM) with instruction tuning and multimodal capabilities to analyze financial charts.
results: FinVis-GPT demonstrates superior performance in various financial chart related tasks, including generating descriptions, answering questions, and predicting future market trends, compared to existing state-of-the-art multimodal LLMs.Here is the text in Simplified Chinese:
results: FinVis-GPT 在多种金融图表相关任务中表现出色,包括生成描述、回答问题和预测未来市场趋势,比现有的多Modal LLM 更高效。Abstract
In this paper, we propose FinVis-GPT, a novel multimodal large language model (LLM) specifically designed for financial chart analysis. By leveraging the power of LLMs and incorporating instruction tuning and multimodal capabilities, FinVis-GPT is capable of interpreting financial charts and providing valuable analysis. To train FinVis-GPT, a financial task oriented dataset was generated for pre-training alignment and instruction tuning, comprising various types of financial charts and their corresponding descriptions. We evaluate the model performance via several case studies due to the time limit, and the promising results demonstrated that FinVis-GPT is superior in various financial chart related tasks, including generating descriptions, answering questions and predicting future market trends, surpassing existing state-of-the-art multimodal LLMs. The proposed FinVis-GPT serves as a pioneering effort in utilizing multimodal LLMs in the finance domain and our generated dataset will be release for public use in the near future to speedup related research.
摘要
在这篇论文中,我们提出了 FinVis-GPT,一种新型的多Modal大语言模型(LLM),专门用于金融图表分析。通过利用LLM的力量和多Modal特性,FinVis-GPT可以解读金融图表并提供有价值的分析。为了训练FinVis-GPT,我们生成了一个金融任务指向数据集,用于预训练对齐和指令调整,其包括各种金融图表和其对应的描述。我们通过一些案例研究评估了模型性能,结果表明FinVis-GPT在各种金融图表相关任务中表现出色,包括生成描述、回答问题和预测未来市场趋势,超越现有的多Modal LLMs。我们的提出的FinVis-GPT是金融领域中首次利用多Modal LLMs的先河,我们将在近 future中发布生成的数据集,以便加速相关研究。
results: 研究发现,专利文本含有许多不被传统创新指标捕捉的信息,如专利引用。网络分析表明,间接链接与直接连接相当重要,而传统间接链接指标,如列 Ontief inverse matrix,仅仅捕捉了一部分间接链接。最后,基于冲击分析,我们示出技术衰退如何在技术空间传递,影响产业创新能力。Abstract
Which technological linkages affect the sector's ability to innovate? How do these effects transmit through the technology space? This paper answers these two key questions using novel methods of text mining and network analysis. We examine technological interdependence across sectors over a period of half a century (from 1976 to 2021) by analyzing the text of 6.5 million patents granted by the United States Patent and Trademark Office (USPTO), and applying network analysis to uncover the full spectrum of linkages existing across technology areas. We demonstrate that patent text contains a wealth of information often not captured by traditional innovation metrics, such as patent citations. By using network analysis, we document that indirect linkages are as important as direct connections and that the former would remain mostly hidden using more traditional measures of indirect linkages, such as the Leontief inverse matrix. Finally, based on an impulse-response analysis, we illustrate how technological shocks transmit through the technology (network-based) space, affecting the innovation capacity of the sectors.
摘要
<>这个文章使用新的文本挖掘和网络分析方法回答了两个关键问题:一是技术链接对产业创新能力产生影响,二是这些影响如何在技术空间传递?我们在1976年至2021年的半个世纪时间内分析了美国专利与商标局(USPTO)授权的650万个专利文本,并通过网络分析揭示出技术领域之间的全谱连接。我们发现专利文本含有许多不被传统创新指标捕捉的信息,例如专利引用。通过网络分析,我们证明了间接链接与直接连接具有相同的重要性,并且前者通常会通过传统的间接链接指标,如Leontief反对矩阵,被遗弃。最后,我们通过冲击回响分析,示出技术冲击如何在技术空间传递,影响产业的创新能力。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that version as well.
A Benchmark for Understanding Dialogue Safety in Mental Health Support
For: The paper aims to develop a theoretically and factually grounded taxonomy for analyzing response safety in mental health support, and to create a benchmark corpus with fine-grained labels for each dialogue session.* Methods: The paper uses a zero- and few-shot learning approach with popular language models, including BERT-base, RoBERTa-large, and ChatGPT, to detect and understand unsafe responses within the context of mental health support.* Results: The study reveals that ChatGPT struggles to detect safety categories with detailed safety definitions in a zero- and few-shot paradigm, whereas the fine-tuned model proves to be more suitable. The developed dataset and findings serve as valuable benchmarks for advancing research on dialogue safety in mental health support.Abstract
Dialogue safety remains a pervasive challenge in open-domain human-machine interaction. Existing approaches propose distinctive dialogue safety taxonomies and datasets for detecting explicitly harmful responses. However, these taxonomies may not be suitable for analyzing response safety in mental health support. In real-world interactions, a model response deemed acceptable in casual conversations might have a negligible positive impact on users seeking mental health support. To address these limitations, this paper aims to develop a theoretically and factually grounded taxonomy that prioritizes the positive impact on help-seekers. Additionally, we create a benchmark corpus with fine-grained labels for each dialogue session to facilitate further research. We analyze the dataset using popular language models, including BERT-base, RoBERTa-large, and ChatGPT, to detect and understand unsafe responses within the context of mental health support. Our study reveals that ChatGPT struggles to detect safety categories with detailed safety definitions in a zero- and few-shot paradigm, whereas the fine-tuned model proves to be more suitable. The developed dataset and findings serve as valuable benchmarks for advancing research on dialogue safety in mental health support, with significant implications for improving the design and deployment of conversation agents in real-world applications. We release our code and data here: https://github.com/qiuhuachuan/DialogueSafety.
摘要
对话安全问题在开放领域人机交互中仍然是一个普遍的挑战。现有的方法提出了不同的对话安全分类和数据集来检测直接危害性的回答。然而,这些分类可能并不适用于分析心理支持中的回答安全性。在实际交互中,一个被认为在互助会话中可以得到积极影响的回答可能并不适用于用户寻求心理支持。为了解决这些限制,本研究旨在开发一个基于理论和实际的对话安全分类,并创建一个具有细化标签的对话会话数据集,以便进一步的研究。我们使用了流行的语言模型,包括BERT-base、RoBERTa-large和ChatGPT,对心理支持中的对话安全进行检测和理解。我们的研究发现,ChatGPT在零和几个shot情况下很难检测安全类别,而精心调整的模型却表现出色。我们开发的数据集和发现将为对话安全在心理支持中的研究提供价值的标准,对实际应用中的对话机器人设计和部署产生重要影响。我们在 GitHub 上发布了代码和数据:https://github.com/qiuhuachuan/DialogueSafety。
results: 本研究的结果显示,模型在意大利语言下的零执行性能在各种下游任务中竞争性地与现有特定 для这些任务的模型竞争。Abstract
In recent years Large Language Models (LLMs) have increased the state of the art on several natural language processing tasks. However, their accessibility is often limited to paid API services, posing challenges for researchers in conducting extensive investigations. On the other hand, while some open-source models have been proposed by the community, they are typically multilingual and not specifically tailored for the Italian language. In an effort to democratize the available and open resources for the Italian language, in this paper we introduce Camoscio: a language model specifically tuned to follow users' prompts in Italian. Specifically, we finetuned the smallest variant of LLaMA (7b) with LoRA on a corpus of instruction prompts translated to Italian via ChatGPT. Results indicate that the model's zero-shot performance on various downstream tasks in Italian competes favorably with existing models specifically finetuned for those tasks. All the artifacts (code, dataset, model) are released to the community at the following url: https://github.com/teelinsan/camoscio
摘要
DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement Estimation in Conversation
results: 我们的提案系统在测试集上比基eline模型高$7%$,在验证集上高$4%$。Abstract
Conversational engagement estimation is posed as a regression problem, entailing the identification of the favorable attention and involvement of the participants in the conversation. This task arises as a crucial pursuit to gain insights into human's interaction dynamics and behavior patterns within a conversation. In this research, we introduce a dilated convolutional Transformer for modeling and estimating human engagement in the MULTIMEDIATE 2023 competition. Our proposed system surpasses the baseline models, exhibiting a noteworthy $7$\% improvement on test set and $4$\% on validation set. Moreover, we employ different modality fusion mechanism and show that for this type of data, a simple concatenated method with self-attention fusion gains the best performance.
摘要
文本对话参与度估计问题 pose 为回归问题,涉及到参与者在对话中有利的注意力和参与度的识别。这项任务对于了解人类对话动力学和行为模式具有重要意义。在这项研究中,我们提出了一种扩展 convolutional Transformer 来模型和估计对话参与度,并在 MULTIMEDIATE 2023 比赛中提交了我们的提案。我们的提案比基eline模型表现出了显著的 $7\%$ 提高(测试集)和 $4\%$ 提高(验证集)。此外,我们采用了不同的modalities fusion方法,并证明在这类数据上,简单 concatenation 方法配合自我注意力融合可以获得最佳性能。
SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation
for: This paper is written for the purpose of proposing a self-supervised neural sub-word segmentation method called SelfSeg, which is faster and more efficient than existing methods.
methods: The paper uses a self-supervised approach that takes as input a word in the form of a partially masked character sequence, optimizes the word generation probability, and generates the segmentation with the maximum posterior probability using a dynamic programming algorithm. The training time of SelfSeg depends on word frequencies, and the paper explores several word frequency normalization strategies to accelerate the training phase.
results: The paper conducts machine translation experiments in low-, middle-, and high-resource scenarios, comparing the performance of different segmentation methods. The results show that SelfSeg achieves significant improvements over existing methods, including BPE and SentencePiece, and the regularization method achieves approximately a 4.3 BLEU score improvement over BPE and a 1.2 BLEU score improvement over BPE-dropout. The paper also observes improvements on several other datasets.Here is the information in Simplified Chinese text:
results: 论文通过对机器翻译 task进行实验,在低资源、中资源和高资源的场景中比较了不同的分词方法的性能。结果表明,SelfSeg在ALT dataset上获得了1.2 BLEU分数的提升,而与DPE和VOLT相比,增加了约4.3 BLEU分数。论文还发现了其他一些数据集上的显著提升。Abstract
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT). Existing work has shown that neural sub-word segmenters are better than Byte-Pair Encoding (BPE), however, they are inefficient as they require parallel corpora, days to train and hours to decode. This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method that is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora. SelfSeg takes as input a word in the form of a partially masked character sequence, optimizes the word generation probability and generates the segmentation with the maximum posterior probability, which is calculated using a dynamic programming algorithm. The training time of SelfSeg depends on word frequencies, and we explore several word frequency normalization strategies to accelerate the training phase. Additionally, we propose a regularization mechanism that allows the segmenter to generate various segmentations for one word. To show the effectiveness of our approach, we conduct MT experiments in low-, middle- and high-resource scenarios, where we compare the performance of using different segmentation methods. The experimental results demonstrate that on the low-resource ALT dataset, our method achieves more than 1.2 BLEU score improvement compared with BPE and SentencePiece, and a 1.1 score improvement over Dynamic Programming Encoding (DPE) and Vocabulary Learning via Optimal Transport (VOLT) on average. The regularization method achieves approximately a 4.3 BLEU score improvement over BPE and a 1.2 BLEU score improvement over BPE-dropout, the regularized version of BPE. We also observed significant improvements on IWSLT15 Vi->En, WMT16 Ro->En and WMT15 Fi->En datasets, and competitive results on the WMT14 De->En and WMT14 Fr->En datasets.
摘要
它是一种基于神经网络的自动分词方法,可以用于语机翻译(NMT)的前期处理步骤。现有研究表明,神经网络分词器比Byte-Pair Encoding(BPE)更好,但它们需要并行 Corpora 并且训练和解码时间比较长。这篇论文介绍了一种自然语言自动分词法,它叫做SelfSeg,它比较快速地训练和解码,并且只需要单语言词典而不需要并行 Corpora。SelfSeg 接受一个部分遮盖的字符序列作为输入,并且通过计算最大 posterior 概率来生成分词结果。训练 SelfSeg 的时间取决于单词频率,我们也提出了多种单词频率归一化策略来加速训练阶段。此外,我们还提出了一种规范化机制,允许分词器生成不同的分词结果。为证明我们的方法的有效性,我们进行了不同分词方法的MT实验,其中包括 BPE、SentencePiece、DPE 和 VOLT。实验结果表明,在 ALT dataset 上,我们的方法可以与 BPE 和 SentencePiece 相比,提高了 más de 1.2 BLEU 分数,而与 DPE 和 VOLT 相比,提高了约 1.1 BLEU 分数。规范化机制可以提高 BPE 的约 4.3 BLEU 分数,并且与 BPE-dropout 相比,提高了约 1.2 BLEU 分数。我们还观察到在 IWSLT15 Vi->En、WMT16 Ro->En 和 WMT15 Fi->En 等 dataset 上,我们的方法具有显著的改善,并且在 WMT14 De->En 和 WMT14 Fr->En dataset 上达到了竞争性的结果。
Does fine-tuning GPT-3 with the OpenAI API leak personally-identifiable information?
results: 研究发现,对 GPT-3 进行了 fine-tuning 后,模型就可以记忆并泄露来自原始 fine-tuning 数据集中的敏感信息 (PII)。Abstract
Machine learning practitioners often fine-tune generative pre-trained models like GPT-3 to improve model performance at specific tasks. Previous works, however, suggest that fine-tuned machine learning models memorize and emit sensitive information from the original fine-tuning dataset. Companies such as OpenAI offer fine-tuning services for their models, but no prior work has conducted a memorization attack on any closed-source models. In this work, we simulate a privacy attack on GPT-3 using OpenAI's fine-tuning API. Our objective is to determine if personally identifiable information (PII) can be extracted from this model. We (1) explore the use of naive prompting methods on a GPT-3 fine-tuned classification model, and (2) we design a practical word generation task called Autocomplete to investigate the extent of PII memorization in fine-tuned GPT-3 within a real-world context. Our findings reveal that fine-tuning GPT3 for both tasks led to the model memorizing and disclosing critical personally identifiable information (PII) obtained from the underlying fine-tuning dataset. To encourage further research, we have made our codes and datasets publicly available on GitHub at: https://github.com/albertsun1/gpt3-pii-attacks
摘要
机器学习实践者常常精细调整生成预训练模型,如GPT-3,以提高模型在特定任务上的表现。前一次的研究表明,精细调整的机器学习模型可能会记忆并释出原始精细调整数据中的敏感资讯。如OpenAI提供的精细调整服务,但没有任何前一次的工作对关闭源代码模型进行了记忆攻击。在这个工作中,我们模拟了隐私攻击GPT-3,使用OpenAI的精细调整API。我们的目标是确定GPT-3是否能够从这个模型中提取个人敏感信息(PII)。我们(1)探索使用简单提示方法在GPT-3精细调整分类模型上,并(2)设计了实用的自动完成任务,以探索精细调整GPT-3中PII的记忆情况。我们的发现显示,精细调整GPT-3 для这两个任务都导致模型记忆并释出重要的个人敏感信息(PII),从原始精细调整数据中获取。为了鼓励更多研究,我们将我们的代码和数据公开提供GitHub上:https://github.com/albertsun1/gpt3-pii-attacks。
Distractor generation for multiple-choice questions with predictive prompting and large language models
results: 我们的方法在已有测试集上进行量化评估以及人工专家(教师)的质量注释中表现出色,平均53%的生成干扰符被教师评为高质量,超过当前最佳模型。 我们还比较了我们的方法与零极 chatGPT 和几个示例激活 chatGPT 的性能,并证明了我们的方法在生成高质量干扰符方面的优势。Abstract
Large Language Models (LLMs) such as ChatGPT have demonstrated remarkable performance across various tasks and have garnered significant attention from both researchers and practitioners. However, in an educational context, we still observe a performance gap in generating distractors -- i.e., plausible yet incorrect answers -- with LLMs for multiple-choice questions (MCQs). In this study, we propose a strategy for guiding LLMs such as ChatGPT, in generating relevant distractors by prompting them with question items automatically retrieved from a question bank as well-chosen in-context examples. We evaluate our LLM-based solutions using a quantitative assessment on an existing test set, as well as through quality annotations by human experts, i.e., teachers. We found that on average 53% of the generated distractors presented to the teachers were rated as high-quality, i.e., suitable for immediate use as is, outperforming the state-of-the-art model. We also show the gains of our approach 1 in generating high-quality distractors by comparing it with a zero-shot ChatGPT and a few-shot ChatGPT prompted with static examples.
摘要
大型语言模型(LLM)如ChatGPT已经表现出了在不同任务上的出色表现,引起了研究者和实践者的广泛关注。然而,在教育上,我们仍然观察到LLM在生成诱导者(i.e., 可能correct但不正确的答案)方面存在性能差距。在这项研究中,我们提出了一种策略,使LLM通过自动从问题库中提取问题项来生成相关的诱导者。我们使用现有测试集进行了量化评估,以及通过人工智能专家(i.e., 教师)的质量标注来评估。我们发现,平均 speaking 53%的生成诱导者被教师评估为高质量,可以不需要更改,超越了现有的模型。我们还显示了我们的方法在生成高质量诱导者方面的优势,与零极 chatGPT 和几个极少shot chatGPT 提交的静止示例相比。
Mispronunciation detection using self-supervised speech representations
results: 我们发现使用下游模型直接进行目标任务训练得到最好的性能,而大多数上游模型在这个任务上表现相似。Abstract
In recent years, self-supervised learning (SSL) models have produced promising results in a variety of speech-processing tasks, especially in contexts of data scarcity. In this paper, we study the use of SSL models for the task of mispronunciation detection for second language learners. We compare two downstream approaches: 1) training the model for phone recognition (PR) using native English data, and 2) training a model directly for the target task using non-native English data. We compare the performance of these two approaches for various SSL representations as well as a representation extracted from a traditional DNN-based speech recognition model. We evaluate the models on L2Arctic and EpaDB, two datasets of non-native speech annotated with pronunciation labels at the phone level. Overall, we find that using a downstream model trained for the target task gives the best performance and that most upstream models perform similarly for the task.
摘要
近年来,自我超vised学习(SSL)模型在各种语音处理任务中表现出色,特别是在数据缺乏的情况下。本文研究了使用SSL模型进行第二语言学习者的误发音检测。我们比较了两种下游方法:1)使用本地英语数据进行话语识别(PR)模型的训练,和2)直接使用非本地英语数据进行目标任务的模型训练。我们对各种SSL表示形式以及一种基于传统DNN语音识别模型中的表示进行比较。我们对L2Arctic和EpaDB两个非本地语音 datasets进行评估。总的来说,我们发现使用下游模型直接进行目标任务的训练可以获得最好的性能,而大多数上游模型在这个任务上的表现相似。
paper_authors: Zihan Zhang, Lei Shi, Ding-Xuan Zhou
for: 这个论文主要是研究 Deep Neural Networks (DNNs) 在二分类任务中的泛化分析。
methods: 这篇论文使用了一种新的oracle-type不等式,并使用这个不等式来 deriv 出 DNN 类ifiers 在 logistic loss 下的快速收敛率。
results: 这篇论文提供了一些新的泛化分析结果,包括对 DNN 类ifiers 的收敛率的优化(即 log 因子),以及对数据维度的独立性。这些结果可以解释为什么 DNN 类ifiers 在实际高维度分类任务中能够表现出色。Abstract
Deep neural networks (DNNs) trained with the logistic loss (i.e., the cross entropy loss) have made impressive advancements in various binary classification tasks. However, generalization analysis for binary classification with DNNs and logistic loss remains scarce. The unboundedness of the target function for the logistic loss is the main obstacle to deriving satisfying generalization bounds. In this paper, we aim to fill this gap by establishing a novel and elegant oracle-type inequality, which enables us to deal with the boundedness restriction of the target function, and using it to derive sharp convergence rates for fully connected ReLU DNN classifiers trained with logistic loss. In particular, we obtain optimal convergence rates (up to log factors) only requiring the H\"older smoothness of the conditional class probability $\eta$ of data. Moreover, we consider a compositional assumption that requires $\eta$ to be the composition of several vector-valued functions of which each component function is either a maximum value function or a H\"older smooth function only depending on a small number of its input variables. Under this assumption, we derive optimal convergence rates (up to log factors) which are independent of the input dimension of data. This result explains why DNN classifiers can perform well in practical high-dimensional classification problems. Besides the novel oracle-type inequality, the sharp convergence rates given in our paper also owe to a tight error bound for approximating the natural logarithm function near zero (where it is unbounded) by ReLU DNNs. In addition, we justify our claims for the optimality of rates by proving corresponding minimax lower bounds. All these results are new in the literature and will deepen our theoretical understanding of classification with DNNs.
摘要
深度神经网络(DNN)在不同的二分类任务中做出了很好的表现。然而,对于DNN和对数损失函数的泛化分析仍然缺乏研究。对数损失函数的无上界性是泛化分析的主要障碍。在这篇论文中,我们想要填补这个差距,通过设立一个新的oracle-type不等式,使得我们可以处理对数损失函数的上界限制,并使用其 derive sharp的泛化速率 для全连接ReLU DNN类ifiziert。特别是,我们获得了最佳的泛化速率(Up to log factor),只需要数据中condition class概率函数 $\eta$的Holder平滑性。此外,我们还假设了$\eta$是一个Vector-valued函数的compose,其中每个组件函数可以是最大值函数或Holder平滑函数,只依赖于几个输入变量。在这个假设下,我们 derive optimal的泛化速率(Up to log factor),这个结果解释了为什么DNN类ifiziert可以在实际高维分类问题中表现良好。此外,我们还提供了一个紧距的错误 bound,用于 aproximating自然对数函数 near zero(where it is unbounded)by ReLU DNNs。此外,我们还证明了我们的结果的最佳性,通过证明相应的最小化下界。这些结果都是文献中新的,它们将深入我们对分类问题的理论理解。
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
results: 根据论文的描述,通过使用 ToolLLM 框架和 ToolBench 数据集,LLaMA 模型可以准确执行复杂的指令,并在未看过的 API 上进行推理和规划。此外,ToolLLaMA 模型与 ChatGPT 的表现相当,且可以自动选择适合的 API。Abstract
Despite the advancements of open-source large language models (LLMs) and their variants, e.g., LLaMA and Vicuna, they remain significantly limited in performing higher-level tasks, such as following human instructions to use external tools (APIs). This is because current instruction tuning largely focuses on basic language tasks instead of the tool-use domain. This is in contrast to state-of-the-art (SOTA) LLMs, e.g., ChatGPT, which have demonstrated excellent tool-use capabilities but are unfortunately closed source. To facilitate tool-use capabilities within open-source LLMs, we introduce ToolLLM, a general tool-use framework of data construction, model training and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is created automatically using ChatGPT. Specifically, we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub, then prompt ChatGPT to generate diverse human instructions involving these APIs, covering both single-tool and multi-tool scenarios. Finally, we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To make the searching process more efficient, we develop a novel depth-first search-based decision tree (DFSDT), enabling LLMs to evaluate multiple reasoning traces and expand the search space. We show that DFSDT significantly enhances the planning and reasoning capabilities of LLMs. For efficient tool-use assessment, we develop an automatic evaluator: ToolEval. We fine-tune LLaMA on ToolBench and obtain ToolLLaMA. Our ToolEval reveals that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. To make the pipeline more practical, we devise a neural API retriever to recommend appropriate APIs for each instruction, negating the need for manual API selection.
摘要
尽管开源大型自然语言模型(LLM)和其变体(如LLaMA和Vicuna)在进行更高级任务方面有所进步,但它们仍然在使用外部工具(API)方面有限制。这是因为当前的指令调整主要集中在基础语言任务上,而不是工具使用领域。与此相比,当前最佳实践(SOTA)LLM(如ChatGPT)已经表现出了优秀的工具使用能力,但它们却是关闭源代码。为了激活开源LLM中的工具使用能力,我们提出了工具框架(ToolLLM),包括数据建构、模型训练和评估。我们首先提供了工具调整数据集(ToolBench),该数据集是通过ChatGPT自动生成的人工指令,涵盖单工具和多工具场景。我们收集了16,464个实际RESTful API,涵盖49个类别,并使用ChatGPT生成多样化的人工指令,以覆盖这些API。然后,我们使用ChatGPT搜索一个有效的解决方案路径(Chain of API calls) для每个指令。为了使搜索过程更有效,我们开发了一种深度优先搜索基于决策树(DFSDT),使LLM可以评估多种逻辑追踪,扩大搜索空间。我们表明,DFSDT有效地提高了LLM的规划和逻辑能力。为了有效评估工具使用,我们开发了自动评估器(ToolEval)。我们精心 fine-tune LLaMA 在 ToolBench 上,并得到了 ToolLLaMA。我们的 ToolEval 表明,ToolLLaMA 能够执行复杂的指令并将其推广到未经看到的API,并与ChatGPT的性能相似。为使管道更实用,我们设计了一种神经API搜索器,以便根据每个指令提供适当的API,从而消除手动API选择的需要。
results: 发现大多数latent neuron在听到真正的乐曲时都不活跃,只有几个独立的neuron会活跃,这些neuron被称为”music neuron”。发现大多数乐曲中的旋律信息都是由第一些music neuron来编码的,而乐曲的旋律信息只在 longer sequences of music中出现。Abstract
We use Google's MusicVAE, a Variational Auto-Encoder with a 512-dimensional latent space to represent a few bars of music, and organize the latent dimensions according to their relevance in describing music. We find that, on average, most latent neurons remain silent when fed real music tracks: we call these "noise" neurons. The remaining few dozens of latent neurons that do fire are called "music neurons". We ask which neurons carry the musical information and what kind of musical information they encode, namely something that can be identified as pitch, rhythm or melody. We find that most of the information about pitch and rhythm is encoded in the first few music neurons: the neural network has thus constructed a couple of variables that non-linearly encode many human-defined variables used to describe pitch and rhythm. The concept of melody only seems to show up in independent neurons for longer sequences of music.
摘要
我们使用Google的MusicVAE,一种Variational Auto-Encoder,将乐曲压缩到512维 latent space中。我们发现,在真实音乐轨迹上 feed latent neurons 中,大多数 neurons 保持静止:我们称这些为 "噪声" neurons。剩下的数十个 latent neurons 会被 activated ,我们称为 "音乐 neurons"。我们问的是,哪些 neurons 携带了音乐信息,它们编码什么样的音乐信息,例如抽象的旋律、和声或旋律。我们发现,大多数旋律和和声信息是在首几个 music neurons 中编码的:神经网络因此构造了一些变量,非线性地编码了许多人定义的旋律和和声变量。概念化的旋律似乎只在 longer sequence of music 中出现。
Lossless Transformations and Excess Risk Bounds in Statistical Inference
paper_authors: László Györfi, Tamás Linder, Harro Walk
for: 这篇论文主要研究了统计推断中的剩余最小风险,即从观测特征向量中估计随机变量的最小风险差。
methods: 本文首先 caracterizes lossless transformations,即 transformations for which the excess risk is zero for all loss functions,然后构建一个分 partitions 测试统计量来检验一个 transformations 是否是 lossless,并证明对于独立同分布数据,该测试是strongly consistent。
results: 本文还提出了基于信息理论的上限 bounds on the excess risk,这些 bounds 适用于较为通用的损失函数类型。基于这些 bounds,本文引入了 delta-lossless transformation 概念,并给出了 universally delta-lossless transformation 的充分条件。Abstract
We study the excess minimum risk in statistical inference, defined as the difference between the minimum expected loss in estimating a random variable from an observed feature vector and the minimum expected loss in estimating the same random variable from a transformation (statistic) of the feature vector. After characterizing lossless transformations, i.e., transformations for which the excess risk is zero for all loss functions, we construct a partitioning test statistic for the hypothesis that a given transformation is lossless and show that for i.i.d. data the test is strongly consistent. More generally, we develop information-theoretic upper bounds on the excess risk that uniformly hold over fairly general classes of loss functions. Based on these bounds, we introduce the notion of a delta-lossless transformation and give sufficient conditions for a given transformation to be universally delta-lossless. Applications to classification, nonparametric regression, portfolio strategies, information bottleneck, and deep learning, are also surveyed.
摘要
我们研究额外最小风险在统计推断中,定义为从观察特征向量获取随机变量的估计loss的差异,与从特征向量的变换(统计)获取随机变量的loss的差异。经过定义无损变换,即所有loss函数下的额外风险为零的变换,我们构建了分partition测试统计,用于测试一个给定的变换是否无损,并证明在独立Identically distributed(i.i.d)数据上,该测试是强有效的。更一般地,我们建立了基于信息理论的额外风险上下文 bounds,对于较为一般的损失函数而言,这些上下文 bounds 具有 uniform 的持续性。基于这些上下文 bounds,我们引入了 delta-lossless 变换的概念,并给出了universally delta-lossless 变换的充分条件。我们还应用到分类、非 Parametric 回归、投资策略、信息瓶颈和深度学习等领域。
An Efficient Shapley Value Computation for the Naive Bayes Classifier
paper_authors: Vincent Lemaire, Fabrice Clérot, Marc Boullé
for: 本研究的目的是提出一种exact analytic expression of Shapley values дляNaive Bayes分类器,以及与其他常用指标Weight of Evidence(WoE)的比较,以及与 kernelShap 结果的 empirical comparison。
methods: 本研究使用了 cooperative game theory 的 Shapley value estimation algorithms,并提出了一种exact analytic expression of Shapley values 的方法,用于Naive Bayes分类器。
results: 研究结果表明,我们的 Shapley proposal 可以在实际世界数据集上提供有用的结果,并且与 WoE 和 KernelShap 结果有一定的相似性和不同性。 特别是,我们的方法具有低的算法复杂度和高效的计算时间,可以在很大的数据集上进行实时计算。Abstract
Variable selection or importance measurement of input variables to a machine learning model has become the focus of much research. It is no longer enough to have a good model, one also must explain its decisions. This is why there are so many intelligibility algorithms available today. Among them, Shapley value estimation algorithms are intelligibility methods based on cooperative game theory. In the case of the naive Bayes classifier, and to our knowledge, there is no ``analytical" formulation of Shapley values. This article proposes an exact analytic expression of Shapley values in the special case of the naive Bayes Classifier. We analytically compare this Shapley proposal, to another frequently used indicator, the Weight of Evidence (WoE) and provide an empirical comparison of our proposal with (i) the WoE and (ii) KernelShap results on real world datasets, discussing similar and dissimilar results. The results show that our Shapley proposal for the naive Bayes classifier provides informative results with low algorithmic complexity so that it can be used on very large datasets with extremely low computation time.
摘要
Variable selection或输入变量重要性评估在机器学习模型中已成为研究焦点。不再只有一个好模型,还需要解释其决策。这是为什么今天有这么多可理解性算法的原因。 Among them, Shapley value estimation algorithms are intelligibility methods based on cooperative game theory. In the case of the naive Bayes classifier, and to our knowledge, there is no "analytical" formulation of Shapley values. This article proposes an exact analytic expression of Shapley values in the special case of the naive Bayes Classifier. We analytically compare this Shapley proposal with another frequently used indicator, the Weight of Evidence (WoE), and provide an empirical comparison of our proposal with (i) the WoE and (ii) KernelShap results on real-world datasets, discussing similar and dissimilar results. The results show that our Shapley proposal for the naive Bayes classifier provides informative results with low algorithmic complexity, so it can be used on very large datasets with extremely low computation time.Note: Please note that the translation is in Simplified Chinese, and the word order and sentence structure may be different from the original text.
Active Learning in Genetic Programming: Guiding Efficient Data Collection for Symbolic Regression
results: 我们将不确定度和多样性 combinesthrough Pareto optimization approach 来考虑它们在培训中选择有用和独特的数据点。Abstract
This paper examines various methods of computing uncertainty and diversity for active learning in genetic programming. We found that the model population in genetic programming can be exploited to select informative training data points by using a model ensemble combined with an uncertainty metric. We explored several uncertainty metrics and found that differential entropy performed the best. We also compared two data diversity metrics and found that correlation as a diversity metric performs better than minimum Euclidean distance, although there are some drawbacks that prevent correlation from being used on all problems. Finally, we combined uncertainty and diversity using a Pareto optimization approach to allow both to be considered in a balanced way to guide the selection of informative and unique data points for training.
摘要
An Empirical Study on Log-based Anomaly Detection Using Machine Learning
results: 研究发现,传统ML技术和深度学习技术在检测精度和预测时间方面几乎相当,而半监督学习技术在检测精度方面表现较差。此外,不同学习模型对于参数调整的敏感性也有很大差异。Abstract
The growth of systems complexity increases the need of automated techniques dedicated to different log analysis tasks such as Log-based Anomaly Detection (LAD). The latter has been widely addressed in the literature, mostly by means of different deep learning techniques. Nevertheless, the focus on deep learning techniques results in less attention being paid to traditional Machine Learning (ML) techniques, which may perform well in many cases, depending on the context and the used datasets. Further, the evaluation of different ML techniques is mostly based on the assessment of their detection accuracy. However, this is is not enough to decide whether or not a specific ML technique is suitable to address the LAD problem. Other aspects to consider include the training and prediction time as well as the sensitivity to hyperparameter tuning. In this paper, we present a comprehensive empirical study, in which we evaluate different supervised and semi-supervised, traditional and deep ML techniques w.r.t. four evaluation criteria: detection accuracy, time performance, sensitivity of detection accuracy as well as time performance to hyperparameter tuning. The experimental results show that supervised traditional and deep ML techniques perform very closely in terms of their detection accuracy and prediction time. Moreover, the overall evaluation of the sensitivity of the detection accuracy of the different ML techniques to hyperparameter tuning shows that supervised traditional ML techniques are less sensitive to hyperparameter tuning than deep learning techniques. Further, semi-supervised techniques yield significantly worse detection accuracy than supervised techniques.
摘要
随着系统复杂性的增加,自动化技术在不同的日志分析任务中的应用越来越重要,如日志异常检测(LAD)。在文献中,大多数使用深度学习技术来解决LAD问题,但是这些技术的使用导致传统的机器学习(ML)技术 receiving less attention,但是这些技术在某些情况下可能表现非常好,具体取决于context和使用的数据集。此外,评估不同的ML技术的方法通常是根据它们的检测精度进行评估,但这并不是决定是否适用于LAD问题的唯一因素。其他需要考虑的因素包括训练和预测时间以及参数调整的敏感性。在本文中,我们提出了一项全面的实验研究,在四个评估标准下评估不同的超vised和半supervised、传统和深度学习技术:检测精度、训练和预测时间、检测精度对参数调整的敏感性以及训练和预测时间对参数调整的敏感性。实验结果表明,超vised传统和深度学习技术在检测精度和预测时间方面表现很相似,而且总体来说,传统学习技术对参数调整的敏感性较低。此外,半supervised技术的检测精度相对较低。
TFE-GNN: A Temporal Fusion Encoder Using Graph Neural Networks for Fine-grained Encrypted Traffic Classification
paper_authors: Haozhen Zhang, Le Yu, Xi Xiao, Qing Li, Francesco Mercaldo, Xiapu Luo, Qixu Liu for: 本研究旨在提出一种基于点wise积分信息(PMI)的字节级交通图构建方法,以及一种基于图神经网络(GNN)的特征提取模型——时间融合编码器(TFE-GNN),用于细致Encrypted traffic classification。methods: 本研究使用了字节级交通图构建方法和基于GNN的特征提取模型,包括两层双重嵌入层、GNN基于交通图编码器和交通图与数据流之间的交叉阻止机制。results: 对于两个实际数据集,TFE-GNN的实验结果表明,它在细致Encrypted traffic classification任务中超过多种现有方法表现出色。Abstract
Encrypted traffic classification is receiving widespread attention from researchers and industrial companies. However, the existing methods only extract flow-level features, failing to handle short flows because of unreliable statistical properties, or treat the header and payload equally, failing to mine the potential correlation between bytes. Therefore, in this paper, we propose a byte-level traffic graph construction approach based on point-wise mutual information (PMI), and a model named Temporal Fusion Encoder using Graph Neural Networks (TFE-GNN) for feature extraction. In particular, we design a dual embedding layer, a GNN-based traffic graph encoder as well as a cross-gated feature fusion mechanism, which can first embed the header and payload bytes separately and then fuses them together to obtain a stronger feature representation. The experimental results on two real datasets demonstrate that TFE-GNN outperforms multiple state-of-the-art methods in fine-grained encrypted traffic classification tasks.
摘要
受到研究者和工业公司的广泛关注,加密流量分类技术正在不断发展。然而,现有的方法仅EXTract flow-level特征,无法处理短流,或者对header和 payload equally,失去了可能的字节相关性。因此,在这篇论文中,我们提出了基于点wise私有信息(PMI)的字节级流量图构建方法,以及一种基于图 neural network(GNN)的 Temporal Fusion Encoder(TFE-GNN)模型 для特征提取。具体来说,我们设计了两层双向嵌入层,一个基于GNN的流量图编码器以及一个跨度闭合特征融合机制,可以首先将header和 payload字节分别嵌入,然后将其相互融合以获得更强的特征表示。实验结果表明,TFE-GNN在两个实际数据集上的细化加密流量分类任务中较多状态前方法表现出色。
Deep Learning Meets Adaptive Filtering: A Stein’s Unbiased Risk Estimator Approach
results: 作者提出了一种基于Stein’s unbiased risk estimator (SURE) 的训练方法,并在实验中证明了这种方法能够提高源信号估计的性能。Abstract
This paper revisits two prominent adaptive filtering algorithms through the lens of algorithm unrolling, namely recursive least squares (RLS) and equivariant adaptive source separation (EASI), in the context of source estimation and separation. Building upon the unrolling methodology, we introduce novel task-based deep learning frameworks, denoted as Deep RLS and Deep EASI. These architectures transform the iterations of the original algorithms into layers of a deep neural network, thereby enabling efficient source signal estimation by taking advantage of a training process. To further enhance performance, we propose training these deep unrolled networks utilizing a loss function grounded on a Stein's unbiased risk estimator (SURE). Our empirical evaluations demonstrate the efficacy of this SURE-based approach for enhanced source signal estimation.
摘要
这篇论文重新审视了两种常见的适应滤波算法,即回归最小二乘(RLS)和对称适应源分离(EASI),在源估计和分离的上下文中。基于折叠方法,我们提出了两种新的任务深度学习框架,称为深度RLS和深度EASI。这些架构将原始算法的迭代转化为层次深度神经网络的形式,以便通过训练过程来实现高效的源信号估计。进一步提高性能,我们提议使用基于Stein不偏风险估计(SURE)的训练方法。我们的实验证明了这种SURE基于的方法对源信号估计具有显著的效果。
Lookbehind Optimizer: k steps back, 1 step forward
results: 在多种任务和训练方案中,Lookbehind方法可以获得更高的普适性表现、更大的随机权重稳定性和更高的寿命学习中的抗杂稳定性。Abstract
The Lookahead optimizer improves the training stability of deep neural networks by having a set of fast weights that "look ahead" to guide the descent direction. Here, we combine this idea with sharpness-aware minimization (SAM) to stabilize its multi-step variant and improve the loss-sharpness trade-off. We propose Lookbehind, which computes $k$ gradient ascent steps ("looking behind") at each iteration and combine the gradients to bias the descent step toward flatter minima. We apply Lookbehind on top of two popular sharpness-aware training methods -- SAM and adaptive SAM (ASAM) -- and show that our approach leads to a myriad of benefits across a variety of tasks and training regimes. Particularly, we show increased generalization performance, greater robustness against noisy weights, and higher tolerance to catastrophic forgetting in lifelong learning settings.
摘要
“lookahead”优化器可以提高深度神经网络的训练稳定性,通过一组快速的权重来引导 DESC 方向。我们在这里结合这个想法与锐度意识化最小化(SAM)来稳定其多步变体并改善损失锐度质量。我们提出了“lookbehind”,它在每次迭代中计算 $k$ 步梯度升级(“寻看后”),并将梯度相加以偏移下降步向平坦的极小值。我们在两种流行的锐度意识化训练方法——SAM 和 adaptive SAM(ASAM)——之上应用了 Lookbehind,并证明了我们的方法在多种任务和训练 режимах中具有多种优势,包括提高泛化性能、增强随着权重噪声的 Robustness 和生长学习中的忘却鲁棒性。
A theory of data variability in Neural Network Bayesian inference
results: 这篇论文的结果表明,在不同输入维度和训练数据规模下,神经网络的泛化性质可以通过计算 kernel 矩阵的统计性质来描述,并且可以通过这种方法获得泛化性质的 bounds。Abstract
Bayesian inference and kernel methods are well established in machine learning. The neural network Gaussian process in particular provides a concept to investigate neural networks in the limit of infinitely wide hidden layers by using kernel and inference methods. Here we build upon this limit and provide a field-theoretic formalism which covers the generalization properties of infinitely wide networks. We systematically compute generalization properties of linear, non-linear, and deep non-linear networks for kernel matrices with heterogeneous entries. In contrast to currently employed spectral methods we derive the generalization properties from the statistical properties of the input, elucidating the interplay of input dimensionality, size of the training data set, and variability of the data. We show that data variability leads to a non-Gaussian action reminiscent of a ($\varphi^3+\varphi^4$)-theory. Using our formalism on a synthetic task and on MNIST we obtain a homogeneous kernel matrix approximation for the learning curve as well as corrections due to data variability which allow the estimation of the generalization properties and exact results for the bounds of the learning curves in the case of infinitely many training data points.
摘要
bayesian inference和kernel方法在机器学习中很受欢迎,特别是神经网络泊松过程,它提供了investigate神经网络的概念,在具有无限宽隐藏层的情况下使用kernel和推理方法。我们在这个限制下建立了一个场理论 formalism,涵盖无限宽网络的泛化性质。我们系统地计算了linear, non-linear和深度非线性网络的泛化性质,并对具有不同分布的输入数据进行了系统的计算。与现有的spectral方法不同,我们从输入数据的统计性质出发,描述了输入维度、训练数据集的大小和数据的多样性之间的交互作用。我们发现,数据多样性导致一种非高斯行为,类似于($\varphi^3+\varphi^4$)-理论。使用我们的形式主义在一个synthetic任务和MNIST上,我们获得了一个homogeneous kernel matrix approximation,以及由数据多样性引起的 corrections,这些 corrections 允许我们估计泛化性质并计算 bounds 的学习曲线。
Guiding Image Captioning Models Toward More Specific Captions
paper_authors: Simon Kornblith, Lala Li, Zirui Wang, Thao Nguyen
For: The paper aims to improve the quality of image captions generated by an autoregressive captioning model, specifically by fine-tuning the model to estimate both conditional and unconditional distributions over captions.* Methods: The paper uses classifier-free guidance for the autoregressive captioning model, which involves fine-tuning the model to estimate both conditional and unconditional distributions over captions. The guidance scale applied at decoding controls a trade-off between maximizing the probability of the caption given the image and the probability of the image given the caption.* Results: The paper shows that decoding with a guidance scale of 2 substantially improves reference-free metrics such as CLIPScore and caption-to-image retrieval performance in the CLIP embedding space, but worsens standard reference-based captioning metrics such as CIDEr. The paper also explores the use of language models to guide the decoding process, obtaining small improvements over the Pareto frontier of reference-free vs. reference-based captioning metrics.Abstract
Image captioning is conventionally formulated as the task of generating captions for images that match the distribution of reference image-caption pairs. However, reference captions in standard captioning datasets are short and may not uniquely identify the images they describe. These problems are further exacerbated when models are trained directly on image-alt text pairs collected from the internet. In this work, we show that it is possible to generate more specific captions with minimal changes to the training process. We implement classifier-free guidance for an autoregressive captioning model by fine-tuning it to estimate both conditional and unconditional distributions over captions. The guidance scale applied at decoding controls a trade-off between maximizing $p(\mathrm{caption}|\mathrm{image})$ and $p(\mathrm{image}|\mathrm{caption})$. Compared to standard greedy decoding, decoding with a guidance scale of 2 substantially improves reference-free metrics such as CLIPScore (0.808 vs. 0.775) and caption$\to$image retrieval performance in the CLIP embedding space (recall@1 44.6% vs. 26.5%), but worsens standard reference-based captioning metrics (e.g., CIDEr 78.6 vs 126.1). We further explore the use of language models to guide the decoding process, obtaining small improvements over the Pareto frontier of reference-free vs. reference-based captioning metrics that arises from classifier-free guidance, and substantially improving the quality of captions generated from a model trained only on minimally curated web data.
摘要
Image captioning 通常是指生成与图像相关的文本描述,但标准的参考caption集合可能不够准确地描述图像。此外,由互联网上收集的图像-alt文本对也可能导致模型的训练过程中的问题。在这项工作中,我们表明可以通过微调模型来生成更具体的描述。我们实现了无类别导航的autoregressive captioning模型,通过估计条件和无条件的描述分布来对模型进行微调。在解码过程中应用的指导尺度控制了在描述中最大化 $p(\text{caption}|\text{image})$ 和 $p(\text{image}|\text{caption})$ 的权重平衡。与标准的批量解码相比,使用指导尺度可以substantially提高无参考metric(CLIPScore)(0.808 vs. 0.775)和在CLIP嵌入空间中的描述$\to$图像检索性能(recall@1 44.6% vs. 26.5%),但会降低标准的参考基线metric(例如 CIDEr 78.6 vs 126.1)。我们进一步探讨使用语言模型来引导解码过程,可以在无参考vs参考captioning metric之间获得小的改进,并在使用微 curated web数据训练的模型中substantially提高生成的描述质量。
On the Trustworthiness Landscape of State-of-the-art Generative Models: A Comprehensive Survey
paper_authors: Mingyuan Fan, Cen Chen, Chengyu Wang, Jun Huang for: This paper investigates the trustworthiness of large-scale generative models, specifically addressing privacy, security, fairness, and responsibility concerns.methods: The authors use a comprehensive approach, mapping out the trustworthiness of these models across four fundamental dimensions and providing practical recommendations.results: The paper provides an extensive map outlining the trustworthiness of large-scale generative models and identifies future directions for promoting their trustworthy deployment.Here’s the text in Simplified Chinese:for: 这篇论文研究了大规模生成模型的可靠性,特别是privacy、安全、公平和责任等四个基本维度上的问题。methods: 作者采用了一种全面的方法,通过地图出大规模生成模型的可靠性,并提供了实践的建议。results: 论文提供了大规模生成模型的可靠性的广泛地图,并标识了未来的发展方向,以便推广这些模型的可靠性。Abstract
Diffusion models and large language models have emerged as leading-edge generative models and have sparked a revolutionary impact on various aspects of human life. However, the practical implementation of these models has also exposed inherent risks, highlighting their dual nature and raising concerns regarding their trustworthiness. Despite the abundance of literature on this subject, a comprehensive survey specifically delving into the intersection of large-scale generative models and their trustworthiness remains largely absent. To bridge this gap, This paper investigates both the long-standing and emerging threats associated with these models across four fundamental dimensions: privacy, security, fairness, and responsibility. In this way, we construct an extensive map outlining the trustworthiness of these models, while also providing practical recommendations and identifying future directions. These efforts are crucial for promoting the trustworthy deployment of these models, ultimately benefiting society as a whole.
摘要
大数据扩散模型和大语言模型在各种方面已经成为当今领先的生成模型,它们的实施也暴露了内在的风险,表现出了双重性和不可靠性的问题。尽管有大量的文献关于这个主题,但是一篇全面探讨大规模生成模型和其可靠性之间的关系的调查还是缺席。为了填补这个空白,本文调查了这些模型所面临的长期和新出现的威胁,从四个基本维度出发:隐私、安全、公平和责任。通过构建了这些模型的可靠性地图,并提供了实践的建议和未来方向,以便推广这些模型的可靠部署,终于为社会带来利益。注:本文使用的是简化中文,以便更好地适应不同读者的需求。
Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech
paper_authors: Guangyan Zhang, Thomas Merritt, Manuel Sam Ribeiro, Biel Tura-Vecino, Kayoko Yanagisawa, Kamil Pokora, Abdelhamid Ezzerg, Sebastian Cygert, Ammar Abbas, Piotr Bilinski, Roberto Barra-Chicote, Daniel Korzekwa, Jaime Lorenzo-Trueba
results: 实验结果表明,流程基本模型在 mel-spectrogram 预测 task 中表现最佳,超过了相当的扩散和 L1 模型。此外,流程和扩散基本预测器均对传统 L2 训练的颤音模型产生了显著改进。Abstract
Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space. Aiming to improve those assumptions, Normalizing Flows and Diffusion Probabilistic Models were recently proposed as alternatives. In this paper, we compare traditional L1/L2-based approaches to diffusion and flow-based approaches for the tasks of prosody and mel-spectrogram prediction for text-to-speech synthesis. We use a prosody model to generate log-f0 and duration features, which are used to condition an acoustic model that generates mel-spectrograms. Experimental results demonstrate that the flow-based model achieves the best performance for spectrogram prediction, improving over equivalent diffusion and L1 models. Meanwhile, both diffusion and flow-based prosody predictors result in significant improvements over a typical L2-trained prosody models.
摘要
Traditional L1/L2-based approaches 是使用 Normalizing Flows 和 Diffusion Probabilistic Models 作为代替方案,以改善目标数据空间的假设。在这篇文章中,我们对 text-to-speech 合成中的 Prosody 和 mel-spectrogram 预测任务进行比较。我们使用 Prosody 模型生成 log-f0 和 duration 特征,然后使用这些特征来 condition 一个 acoustic 模型,生成 mel-spectrogram。实验结果表明,流动基于模型可以在 spectrogram 预测任务中取得最佳性能,超过相当的扩散和 L1 模型。同时,流动和扩散基于 Prosody 预测器都导致了 significativly 改善于 Typical L2 训练的 Prosody 模型。
End-to-End Reinforcement Learning for Torque Based Variable Height Hopping
results: 该论文的实验结果表明,使用RL控制方法可以成功地学习一个端到端的扭矩控制器,并且可以在实际中转移到机器人上进行运行,无需进行参数调整。Abstract
Legged locomotion is arguably the most suited and versatile mode to deal with natural or unstructured terrains. Intensive research into dynamic walking and running controllers has recently yielded great advances, both in the optimal control and reinforcement learning (RL) literature. Hopping is a challenging dynamic task involving a flight phase and has the potential to increase the traversability of legged robots. Model based control for hopping typically relies on accurate detection of different jump phases, such as lift-off or touch down, and using different controllers for each phase. In this paper, we present a end-to-end RL based torque controller that learns to implicitly detect the relevant jump phases, removing the need to provide manual heuristics for state detection. We also extend a method for simulation to reality transfer of the learned controller to contact rich dynamic tasks, resulting in successful deployment on the robot after training without parameter tuning.
摘要
四肢行走是可能适应自然或非结构化地形的最佳和多样化模式。最近的研究对动态步行和跑步控制器的优化和强化学习(RL)文献已经取得了很大的进步。跳跃是一项复杂的动态任务,涉及飞行阶段,可以提高四肢机器人的通行性。基于模型的控制方法通常需要精准地探测不同的跳跃阶段,如升空或落地,然后使用不同的控制器来处理每个阶段。在这篇论文中,我们提出了一种终端RL基于扭矩控制器,该控制器通过学习来隐式探测相关的跳跃阶段,从而消除了手动规则的需求。我们还扩展了在模拟和实际中转移学习的方法,使得在机器人上部署已经训练好的控制器而无需参数调整。
results: 研究人员通过量化和质量的评估方法,发现生成的数据具有真实性和可靠性,能够满足医疗研究中的数据需求。Abstract
Data scarcity is a common obstacle in medical research due to the high costs associated with data collection and the complexity of gaining access to and utilizing data. Synthesizing health data may provide an efficient and cost-effective solution to this shortage, enabling researchers to explore distributions and populations that are not represented in existing observations or difficult to access due to privacy considerations. To that end, we have developed a multi-task self-attention model that produces realistic wearable activity data. We examine the characteristics of the generated data and quantify its similarity to genuine samples with both quantitative and qualitative approaches.
摘要
医学研究中数据缺乏是一个常见的障碍,这是因为数据收集的成本高昂,以及获取和利用数据的复杂性。synthesize health data可能提供一种高效且成本下降的解决方案,允许研究人员探索现有观察数据中未表示或难以访问的分布和人口。为此,我们开发了一种多任务自注意模型,生成了真实的运动活动数据。我们分析生成数据的特点,并使用量化和质量化方法衡量生成数据与真实样本之间的相似性。
Graph Structure from Point Clouds: Geometric Attention is All You Need
for: 本研究 targets the problem of top jet tagging in high energy physics, using graph neural networks to improve the accuracy and efficiency of the task.
methods: 本研究提出了一种 attention mechanism, named GravNetNorm, which learns a graph structure in a high-dimensional space to handle the flow of relevance and solve the Topology Problem.
results: 实验结果表明,GravNetNorm 比其他相关模型具有更高的标签准确率和更少的计算资源消耗。Abstract
The use of graph neural networks has produced significant advances in point cloud problems, such as those found in high energy physics. The question of how to produce a graph structure in these problems is usually treated as a matter of heuristics, employing fully connected graphs or K-nearest neighbors. In this work, we elevate this question to utmost importance as the Topology Problem. We propose an attention mechanism that allows a graph to be constructed in a learned space that handles geometrically the flow of relevance, providing one solution to the Topology Problem. We test this architecture, called GravNetNorm, on the task of top jet tagging, and show that it is competitive in tagging accuracy, and uses far fewer computational resources than all other comparable models.
摘要
使用图 neural network 已经在点云问题中取得了重要进展,如高能物理学中的问题。通常,在这些问题中建立图结构的问题是视为低级别的问题,通过全连接图或 K-最近邻居的方式进行处理。在这项工作中,我们提升了这个问题的重要性,将其称为Topology Problem。我们提议一种注意力机制,使得在学习空间中建立一个图,可以处理空间上的流动相关性,解决Topology Problem。我们测试了我们称为GravNetNorm的架构,并在top jet tagging任务上达到了与其他相关模型相当的准确率,同时使用的计算资源也是相对较少的。
Proactive Resource Request for Disaster Response: A Deep Learning-based Optimization Model
results: 论文的实验结果表明,该方法比现有方法更高效,并在多个维度和目标下进行了优化。Abstract
Disaster response is critical to save lives and reduce damages in the aftermath of a disaster. Fundamental to disaster response operations is the management of disaster relief resources. To this end, a local agency (e.g., a local emergency resource distribution center) collects demands from local communities affected by a disaster, dispatches available resources to meet the demands, and requests more resources from a central emergency management agency (e.g., Federal Emergency Management Agency in the U.S.). Prior resource management research for disaster response overlooks the problem of deciding optimal quantities of resources requested by a local agency. In response to this research gap, we define a new resource management problem that proactively decides optimal quantities of requested resources by considering both currently unfulfilled demands and future demands. To solve the problem, we take salient characteristics of the problem into consideration and develop a novel deep learning method for future demand prediction. We then formulate the problem as a stochastic optimization model, analyze key properties of the model, and propose an effective solution method to the problem based on the analyzed properties. We demonstrate the superior performance of our method over prevalent existing methods using both real world and simulated data. We also show its superiority over prevalent existing methods in a multi-stakeholder and multi-objective setting through simulations.
摘要
灾害应急应对是保存生命和减少灾害后果的关键。紧急应急资源分配的管理是灾害应急应对的核心。因此,当地机构(例如本地紧急资源分配中心)会收集受到灾害影响的本地社区的需求,派发可用资源来满足需求,并请求中央紧急管理机构(例如美国联邦紧急管理署)提供更多资源。然而,过去的资源管理研究通常忽略了确定最佳资源请求量的问题。为了填补这一研究漏洞,我们定义了一个新的资源管理问题,该问题旨在预测未来需求,并考虑当前未满足的需求和未来需求。为了解决这个问题,我们首先考虑了问题的重要特征,然后开发了一种新的深度学习方法来预测未来需求。接着,我们将问题转化为一个随机优化模型,分析了模型的关键属性,并提出了一种有效的解决方案。我们通过使用实际数据和预测数据进行比较,证明了我们的方法的超越性。此外,我们还通过多元参与者和多元目标的 simulate 示例,证明了我们的方法在多元参与者和多元目标下的超越性。
Sequential and Shared-Memory Parallel Algorithms for Partitioned Local Depths
results: authors 介绍了一些性能优化策略,使Sequential 实现可以达到 $29\times$ 的速度提升,并在使用多个线程的情况下达到 $19.4\times$ 的速度提升。Abstract
In this work, we design, analyze, and optimize sequential and shared-memory parallel algorithms for partitioned local depths (PaLD). Given a set of data points and pairwise distances, PaLD is a method for identifying strength of pairwise relationships based on relative distances, enabling the identification of strong ties within dense and sparse communities even if their sizes and within-community absolute distances vary greatly. We design two algorithmic variants that perform community structure analysis through triplet comparisons of pairwise distances. We present theoretical analyses of computation and communication costs and prove that the sequential algorithms are communication optimal, up to constant factors. We introduce performance optimization strategies that yield sequential speedups of up to $29\times$ over a baseline sequential implementation and parallel speedups of up to $19.4\times$ over optimized sequential implementations using up to $32$ threads on an Intel multicore CPU.
摘要
在这项工作中,我们设计、分析和优化了继承和分享内存并行算法 для分割本地深度(PaLD)。给定一组数据点和对称距离,PaLD 是一种方法,用于根据相对距离确定对的强度,从而在密集和稀疏社区中确定强聚合,即使社区大小和内部绝对距离差异较大。我们设计了两种算法变体,通过 triplet 比较来执行社区结构分析。我们提供了计算和通信成本的理论分析,并证明Sequential 算法是通信优化的,即使到 constants 因素。我们介绍了性能优化策略,其中包括Sequential 加速的最多 $29\times$,以及并行加速的最多 $19.4\times$,使用 Intel 多核CPU 上的最多 $32$ 个线程。
UDAMA: Unsupervised Domain Adaptation through Multi-discriminator Adversarial Training with Noisy Labels Improves Cardio-fitness Prediction
paper_authors: Yu Wu, Dimitris Spathis, Hong Jia, Ignacio Perez-Pozuelo, Tomas Gonzales, Soren Brage, Nicholas Wareham, Cecilia Mascolo for:* 这个研究旨在提高健康监控应用中深度学习模型的表现,并且利用不精确的标签数据来实现这一目的。methods:* 这个研究使用了两个关键 комponents:Unsupervised Domain Adaptation 和 Multidiscriminator Adversarial Training,并在这两个部分中进行了训练。results:* 研究结果显示,UDAMA 可以对不同的标签分布进行适应,并且在两个自由生活调查中与竞争性转移学习和现有的领域适应模型相比,表现出色。Abstract
Deep learning models have shown great promise in various healthcare monitoring applications. However, most healthcare datasets with high-quality (gold-standard) labels are small-scale, as directly collecting ground truth is often costly and time-consuming. As a result, models developed and validated on small-scale datasets often suffer from overfitting and do not generalize well to unseen scenarios. At the same time, large amounts of imprecise (silver-standard) labeled data, annotated by approximate methods with the help of modern wearables and in the absence of ground truth validation, are starting to emerge. However, due to measurement differences, this data displays significant label distribution shifts, which motivates the use of domain adaptation. To this end, we introduce UDAMA, a method with two key components: Unsupervised Domain Adaptation and Multidiscriminator Adversarial Training, where we pre-train on the silver-standard data and employ adversarial adaptation with the gold-standard data along with two domain discriminators. In particular, we showcase the practical potential of UDAMA by applying it to Cardio-respiratory fitness (CRF) prediction. CRF is a crucial determinant of metabolic disease and mortality, and it presents labels with various levels of noise (goldand silver-standard), making it challenging to establish an accurate prediction model. Our results show promising performance by alleviating distribution shifts in various label shift settings. Additionally, by using data from two free-living cohort studies (Fenland and BBVS), we show that UDAMA consistently outperforms up to 12% compared to competitive transfer learning and state-of-the-art domain adaptation models, paving the way for leveraging noisy labeled data to improve fitness estimation at scale.
摘要
深度学习模型在医疗监测应用中表现出了很大的搭配性。然而,大多数医疗数据集(高品质标准)的标签是小规模的,因为直接收集真实标签是成本高昂且时间consuming。因此,基于小规模数据集开发和验证的模型往往会出现过拟合问题,并不能良好地适应未看过的场景。同时,大量的不准确(银标准)标签数据,由现代便携设备提供的不准确标签,开始出现。然而,由于测量差异,这些数据表现出了明显的标签分布偏移,这种情况驱动我们使用领域适应。为此,我们介绍了UDAMA方法,它包括无监督领域适应和多discriminator对抗学习。我们在银标准数据上预训练,并使用银标准数据和两个领域探测器进行对抗适应。特别是,我们通过应用UDAMA方法于心血管健康(CRF)预测,CRF是生物 markers的关键指标,并且标签存在各种噪音(银标准和金标准),因此建立准确的预测模型是一个挑战。我们的结果表明UDAMA方法在不同的标签分布偏移情况下能够提供有望的性能。此外,通过使用两个自由生活 cohort study(Fenland和BBVS)的数据,我们表明UDAMA方法可以在不同的预测任务上持续性高于12%的竞争对手和现有的领域适应模型。这种表现,预示了可以通过不准确的标签数据来改善健康评估的可能性。
LLMs4OL: Large Language Models for Ontology Learning
paper_authors: Hamed Babaei Giglou, Jennifer D’Souza, Sören Auer
for: 这篇论文旨在探讨 whether Large Language Models (LLMs) can effectively apply their language pattern capturing capability to Ontology Learning (OL), and evaluate the performance of nine different LLM model families for three main OL tasks.
methods: 该论文使用了零shot prompting方法进行评估,并使用了多种ontoLOGical knowledge genre,包括lexicosemantic knowledge in WordNet、geographical knowledge in GeoNames和医学知识 in UMLS。
results: 论文的评估结果表明,LLMs可以很好地应用其语言模式捕捉能力来支持OL任务,并且不同的LLM模型家族在不同的OL任务上的表现有所不同。Abstract
We propose the LLMs4OL approach, which utilizes Large Language Models (LLMs) for Ontology Learning (OL). LLMs have shown significant advancements in natural language processing, demonstrating their ability to capture complex language patterns in different knowledge domains. Our LLMs4OL paradigm investigates the following hypothesis: \textit{Can LLMs effectively apply their language pattern capturing capability to OL, which involves automatically extracting and structuring knowledge from natural language text?} To test this hypothesis, we conduct a comprehensive evaluation using the zero-shot prompting method. We evaluate nine different LLM model families for three main OL tasks: term typing, taxonomy discovery, and extraction of non-taxonomic relations. Additionally, the evaluations encompass diverse genres of ontological knowledge, including lexicosemantic knowledge in WordNet, geographical knowledge in GeoNames, and medical knowledge in UMLS.
摘要
我们提出LLMs4OL方法,该方法利用大型自然语言模型(LLMs)进行ontology学习(OL)。LLMs在不同知识领域的自然语言处理中已经表现出了明显的进步,其能够捕捉复杂的语言模式。我们的LLMs4OL思想是:\textit{可以LLMs通过捕捉和结构化自然语言文本中的知识来进行OL吗?} 为了证明这一假设,我们采用零处理方法进行全面的评估。我们评估了9种不同的LLM模型家族,用于三个主要的OL任务:类型映射、树 струкucture发现和非树结构EXTRACT。此外,评估还涵盖了多种类型的ontological知识,包括lexicosemantic知识在WordNet、地理知识在GeoNames和医疗知识在UMLS。
Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial Attacks
results: 我们在多种语言模型和数据集上进行了substantial的实验,Text-CRS可以 Address all four different word-level adversarial operations,并 achieve a significant accuracy improvement。此外,我们还提供了第一个certified accuracy和radius的四个单词操作的benchmark,并超越了现有的certification against synonym substitution attacks。Abstract
The language models, especially the basic text classification models, have been shown to be susceptible to textual adversarial attacks such as synonym substitution and word insertion attacks. To defend against such attacks, a growing body of research has been devoted to improving the model robustness. However, providing provable robustness guarantees instead of empirical robustness is still widely unexplored. In this paper, we propose Text-CRS, a generalized certified robustness framework for natural language processing (NLP) based on randomized smoothing. To our best knowledge, existing certified schemes for NLP can only certify the robustness against $\ell_0$ perturbations in synonym substitution attacks. Representing each word-level adversarial operation (i.e., synonym substitution, word reordering, insertion, and deletion) as a combination of permutation and embedding transformation, we propose novel smoothing theorems to derive robustness bounds in both permutation and embedding space against such adversarial operations. To further improve certified accuracy and radius, we consider the numerical relationships between discrete words and select proper noise distributions for the randomized smoothing. Finally, we conduct substantial experiments on multiple language models and datasets. Text-CRS can address all four different word-level adversarial operations and achieve a significant accuracy improvement. We also provide the first benchmark on certified accuracy and radius of four word-level operations, besides outperforming the state-of-the-art certification against synonym substitution attacks.
摘要
“语言模型,尤其是基本文本分类模型,在文本对抗攻击方面存在抵触。为了防御这些攻击,研究人员已经投入了大量时间和精力来提高模型的Robustness。然而,提供可证明的Robustness保证而不是实际的Robustness仍然是广泛未explored的领域。在这篇论文中,我们提出了Text-CRS,一种通用的证明Robustness框架 для自然语言处理(NLP),基于随机抽象。我们知道现有的NLP证明Robustness可以只 certificates against $\ell_0$ 抖锋攻击。我们将每种单词级对抗操作(即同义词替换、单词重新排序、插入和删除)表示为排序和嵌入变换的组合。我们提出了新的抽象定理,以获得对这些对抗操作的Robustness保证在排序和嵌入空间中。为了进一步提高证明精度和半径,我们考虑了单词之间的数学关系,并选择合适的噪声分布来进行随机抽象。最后,我们对多种语言模型和数据集进行了substantial的实验。Text-CRS可以处理所有四种单词级对抗操作,并达到了显著的准确率提高。我们还提供了对四种单词级对抗操作的证明精度和半径的首个Benchmark,并超越了现有的对同义词替换攻击的证明。”
results: 实验表明,CBO-MW 在 synthetic 环境和基于实际数据的环境中表现出色,并且在用户需求模式学习和车辆重新部署方面进行了实际应用。Abstract
In Causal Bayesian Optimization (CBO), an agent intervenes on an unknown structural causal model to maximize a downstream reward variable. In this paper, we consider the generalization where other agents or external events also intervene on the system, which is key for enabling adaptiveness to non-stationarities such as weather changes, market forces, or adversaries. We formalize this generalization of CBO as Adversarial Causal Bayesian Optimization (ACBO) and introduce the first algorithm for ACBO with bounded regret: Causal Bayesian Optimization with Multiplicative Weights (CBO-MW). Our approach combines a classical online learning strategy with causal modeling of the rewards. To achieve this, it computes optimistic counterfactual reward estimates by propagating uncertainty through the causal graph. We derive regret bounds for CBO-MW that naturally depend on graph-related quantities. We further propose a scalable implementation for the case of combinatorial interventions and submodular rewards. Empirically, CBO-MW outperforms non-causal and non-adversarial Bayesian optimization methods on synthetic environments and environments based on real-word data. Our experiments include a realistic demonstration of how CBO-MW can be used to learn users' demand patterns in a shared mobility system and reposition vehicles in strategic areas.
摘要
在 causal bayesian 优化 (CBO) 中, 一个智能体 intervenes 在一个未知的结构性 causal 模型中,以最大化一个下游奖励变量。在这篇论文中,我们考虑了泛化,其中其他智能体或外部事件也 intervenes 在系统中,这对于适应非站点性(如天气变化、市场力量或敌对者)是关键的。我们将这种泛化的 CBO 称为 Adversarial Causal Bayesian Optimization (ACBO),并介绍了首个 ACBO 算法:Causal Bayesian Optimization with Multiplicative Weights (CBO-MW)。我们的方法结合了经典的在线学习策略和 causal 模型来计算优胜的假设性奖励估计。为此,它计算出了 causal 图中的优胜性奖励估计。我们 derive regret bounds for CBO-MW,这些 regret bounds 自然地取决于图像相关的量。我们还提出了可扩展的实现方法,用于 combinatorial interventions 和 submodular 奖励。Empirically, CBO-MW 在 synthetic 环境和基于实际数据的环境中表现出色,超过了非 causal 和非对抗的 bayesian 优化方法。我们的实验包括一个实际的示例,用于在一个共享交通系统中学习用户的需求模式,并在策略性位置重新布置车辆。
Detecting diabetic retinopathy severity through fundus images using an ensemble of classifiers
paper_authors: Eduard Popescu, Adrian Groza, Ioana Damian for: This paper aims to detect diabetic retinopathy severity levels using fundus images.methods: The proposed method includes data preprocessing, image segmentation, and feature extraction, followed by an ensemble of classifiers.results: The authors assess the trust in the system and evaluate the performance of the proposed method.Abstract
Diabetic retinopathy is an ocular condition that affects individuals with diabetes mellitus. It is a common complication of diabetes that can impact the eyes and lead to vision loss. One method for diagnosing diabetic retinopathy is the examination of the fundus of the eye. An ophthalmologist examines the back part of the eye, including the retina, optic nerve, and the blood vessels that supply the retina. In the case of diabetic retinopathy, the blood vessels in the retina deteriorate and can lead to bleeding, swelling, and other changes that affect vision. We proposed a method for detecting diabetic diabetic severity levels. First, a set of data-prerpocessing is applied to available data: adaptive equalisation, color normalisation, Gaussian filter, removal of the optic disc and blood vessels. Second, we perform image segmentation for relevant markers and extract features from the fundus images. Third, we apply an ensemble of classifiers and we assess the trust in the system.
摘要
糖尿病Retinopathy是肉眼疾病,它是diabetes mellitus的常见并发症,可能影响视力。一种用于诊断糖尿病Retinopathy的方法是对眼球背部进行检查,包括retina、触觉神经和对retina的血管。在糖尿病Retinopathy中,retina中的血管会逐渐衰竭,导致内出血、软化和其他影响视力的变化。我们提出了一种用于评估糖尿病严重度的方法。首先,对可用数据进行数据处理:自适应平衡、色彩normal化、Gaussian filter和去除 optic disc和血管。其次,我们实现图像分割,提取fundus图像中相关的标记。第三,我们应用一个 ensemble of classifiers,并评估系统的可靠性。
LaplaceConfidence: a Graph-based Approach for Learning with Noisy Labels
paper_authors: Mingcai Chen, Yuntao Du, Wei Tang, Baoming Zhang, Hao Cheng, Shuwei Qian, Chongjun Wang
For: 本文提出了一种基于 Laplacian 能量的方法,可以在受损数据集上获得高质量的标签信任度。* Methods: 方法首先根据特征表示构建了所有受损样本的图,然后使用 Laplacian 能量来生成低能量图。clean标签应该适应低能量图,而噪音标签则不应该。* Results: 实验表明,LaplaceConfidence 方法在标准 benchmark 数据集上以及真实世界中的噪音下都能够获得更高的性能,比对 State-of-the-art 方法。Abstract
In real-world applications, perfect labels are rarely available, making it challenging to develop robust machine learning algorithms that can handle noisy labels. Recent methods have focused on filtering noise based on the discrepancy between model predictions and given noisy labels, assuming that samples with small classification losses are clean. This work takes a different approach by leveraging the consistency between the learned model and the entire noisy dataset using the rich representational and topological information in the data. We introduce LaplaceConfidence, a method that to obtain label confidence (i.e., clean probabilities) utilizing the Laplacian energy. Specifically, it first constructs graphs based on the feature representations of all noisy samples and minimizes the Laplacian energy to produce a low-energy graph. Clean labels should fit well into the low-energy graph while noisy ones should not, allowing our method to determine data's clean probabilities. Furthermore, LaplaceConfidence is embedded into a holistic method for robust training, where co-training technique generates unbiased label confidence and label refurbishment technique better utilizes it. We also explore the dimensionality reduction technique to accommodate our method on large-scale noisy datasets. Our experiments demonstrate that LaplaceConfidence outperforms state-of-the-art methods on benchmark datasets under both synthetic and real-world noise.
摘要
在实际应用中,完美的标签很少可用,这使得开发机器学习算法可以抗杂的挑战性增大。现有方法通常是基于模型预测和噪声标签之间的差异来筛选噪声,假设样本 WITH 小分类损失是干净的。这个工作采用了不同的方法,利用数据中学习模型和整个噪声数据集之间的一致性,通过数据中的丰富表达和拓扑信息来获得标签信任度。我们引入了LaplaceConfidence方法,它可以在数据中获得标签信任度(即干净概率),基于Laplacian能量。具体来说,它首先基于所有噪声样本的特征表示构建图,然后将图的laplacian能量最小化来生成一个低能量图。干净标签应该适应低能量图,而噪声标签不应该。这样,LaplaceConfidence方法可以判断数据中的干净概率。此外,LaplaceConfidence方法被包含到一种整体的鲁棒训练方法中,其中co-training技术生成不偏的标签信任度,而标签修复技术更好地利用它。我们还探索了缩放技术以适应我们的方法在大规模噪声数据上进行。我们的实验表明,LaplaceConfidence方法在标准 benchmark 数据集上比state-of-the-art方法更高效,并且在实际噪声下也表现出色。
Noisy Self-Training with Data Augmentations for Offensive and Hate Speech Detection Tasks
paper_authors: João A. Leite, Carolina Scarton, Diego F. Silva
for: automatic detection of offensive and hateful comments online
methods: self-training and noisy self-training with textual data augmentations
results: consistent improvement in performance regardless of model size, but noisy self-training decreases performance on offensive and hate-speech domains compared to default methodAbstract
Online social media is rife with offensive and hateful comments, prompting the need for their automatic detection given the sheer amount of posts created every second. Creating high-quality human-labelled datasets for this task is difficult and costly, especially because non-offensive posts are significantly more frequent than offensive ones. However, unlabelled data is abundant, easier, and cheaper to obtain. In this scenario, self-training methods, using weakly-labelled examples to increase the amount of training data, can be employed. Recent "noisy" self-training approaches incorporate data augmentation techniques to ensure prediction consistency and increase robustness against noisy data and adversarial attacks. In this paper, we experiment with default and noisy self-training using three different textual data augmentation techniques across five different pre-trained BERT architectures varying in size. We evaluate our experiments on two offensive/hate-speech datasets and demonstrate that (i) self-training consistently improves performance regardless of model size, resulting in up to +1.5% F1-macro on both datasets, and (ii) noisy self-training with textual data augmentations, despite being successfully applied in similar settings, decreases performance on offensive and hate-speech domains when compared to the default method, even with state-of-the-art augmentations such as backtranslation.
摘要
在线社交媒体中流行着侮辱和仇恨的评论,需要自动检测这些评论的需求,因为每秒创建的帖子的数量太多。创建高质量的人类标注数据集是困难和昂贵的,特别是非侮辱的帖子比侮辱的帖子更多。然而,无标注数据却充沛,更容易、更便宜地获得。在这种情况下,自动训练方法可以被使用,使用弱标注的示例来增加训练数据的量。最近的“噪声”自动训练方法利用数据扩展技术来确保预测的一致性和增强对噪声数据和敌意攻击的抵抗力。在这篇论文中,我们对默认和噪声自动训练使用三种不同的文本数据扩展技术进行实验,并使用五种不同的预训练BERT架构,变化规模。我们对两个侮辱和仇恨频道进行评估,并证明了以下两点:(i)自动训练 invariantly 提高性,无论模型的规模如何,可以在两个频道上达到最高 +1.5% F1-macro 的性能,(ii)噪声自动训练,尽管在类似场景中得到成功,但在侮辱和仇恨频道上,与默认方法相比,即使使用最新的扩展技术,如回 перевод,也会导致性能下降。
NLLG Quarterly arXiv Report 06/23: What are the most influential current AI Papers?
paper_authors: Steffen Eger, Christoph Leiter, Jonas Belouadi, Ran Zhang, Aida Kostikova, Daniil Larionov, Yanran Chen, Vivian Fresen
为:本文主要研究目的是提供arXiv上最受欢迎的40篇论文,尤其是关于自然语言处理(NLP)和机器学习(ML)的研究。* 方法:本文使用了 норма化引用统计来确定最受欢迎的论文,并分析了这些论文的主题和特征。* 结果:研究发现,在2023年第一季度,LLMs和ChatGPT在最受欢迎的论文中占据了主导地位,但在最近几个月内,ChatGPT的 популярность已经开始下降。此外,NLP相关的论文占了Influential论文的约60%,而ML相关的论文则占了约40%。研究还发现,最受欢迎的论文中最重要的问题包括LLM效率、评估技术、伦理考虑、embodied agents和问题解决方法。Abstract
The rapid growth of information in the field of Generative Artificial Intelligence (AI), particularly in the subfields of Natural Language Processing (NLP) and Machine Learning (ML), presents a significant challenge for researchers and practitioners to keep pace with the latest developments. To address the problem of information overload, this report by the Natural Language Learning Group at Bielefeld University focuses on identifying the most popular papers on arXiv, with a specific emphasis on NLP and ML. The objective is to offer a quick guide to the most relevant and widely discussed research, aiding both newcomers and established researchers in staying abreast of current trends. In particular, we compile a list of the 40 most popular papers based on normalized citation counts from the first half of 2023. We observe the dominance of papers related to Large Language Models (LLMs) and specifically ChatGPT during the first half of 2023, with the latter showing signs of declining popularity more recently, however. Further, NLP related papers are the most influential (around 60\% of top papers) even though there are twice as many ML related papers in our data. Core issues investigated in the most heavily cited papers are: LLM efficiency, evaluation techniques, ethical considerations, embodied agents, and problem-solving with LLMs. Additionally, we examine the characteristics of top papers in comparison to others outside the top-40 list (noticing the top paper's focus on LLM related issues and higher number of co-authors) and analyze the citation distributions in our dataset, among others.
摘要
中文(简化字)生成人工智能(AI)领域的信息快速增长,特别是自然语言处理(NLP)和机器学习(ML)的相关领域,对研究人员和实践者来说,是一个巨大的挑战。为了解决信息泛洪的问题,本报告由比较言语学组于比辗大学编制,将关注arXiv上最受欢迎的40篇论文,强调NLP和ML领域的研究。目的是提供一份快速引导新手和已有研究人员了解当前趋势的最有影响力的研究。我们发现在2023年第一季度,LLM相关论文占据了主导地位,其中ChatGPT的影响力在最近减弱。此外,NLP相关论文占据了Influence的60%,尽管ML相关论文的数量为NLP相关论文的两倍。核心问题包括:LLM效率、评价技术、伦理考虑、具体代理人和问题解决方法。此外,我们还分析了top40篇论文的特点和其他论文之间的差异,以及数据集中的引用分布。
Audio-visual video-to-speech synthesis with synthesized input audio
results: 实验结果表明,这种方法在使用 raw waveforms 和 mel spectrograms 作为目标输出时都是成功的。Abstract
Video-to-speech synthesis involves reconstructing the speech signal of a speaker from a silent video. The implicit assumption of this task is that the sound signal is either missing or contains a high amount of noise/corruption such that it is not useful for processing. Previous works in the literature either use video inputs only or employ both video and audio inputs during training, and discard the input audio pathway during inference. In this work we investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference. In particular, we use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model, using both the silent video and the synthesized speech as inputs, to predict the final reconstructed speech. Our experiments demonstrate that this approach is successful with both raw waveforms and mel spectrograms as target outputs.
摘要
<>转换文本到简化中文。>视频到语音合成 involve 重建说话人的语音信号从一个无声视频中。假设中的假设是,音频信号 either missing 或含有高量的噪声/损坏,使其不可用于处理。先前的文献中的工作 either 使用视频输入只或者在训练期间使用视频和音频输入,并在推理期间抛弃输入音频路径。在这个工作中,我们调查了使用视频和音频输入进行视频到语音合成的效果,并在训练和推理期间都使用这两种输入。我们使用预训练的视频到语音模型来合成缺失的语音信号,然后使用两个输入(即无声视频和合成的语音)来预测最终重建的语音。我们的实验表明,这种方法在 Raw waveform 和 mel spectrogram 作为目标输出时都是成功的。
A multiscale and multicriteria Generative Adversarial Network to synthesize 1-dimensional turbulent fields
paper_authors: Carlos Granero-Belinchon, Manuel Cabeza Gallucci
for: This paper introduces a new neural network stochastic model to generate a 1-dimensional stochastic field with turbulent velocity statistics, with the goal of accurately capturing the energy distribution, energy cascade, and intermittency across scales in turbulence.
methods: The model used in this paper is a Generative Adversarial Network (GAN) with multiple multiscale optimization criteria, including physics-based criteria based on the Kolmogorov and Obukhov statistical theories of fully developed turbulence. The model is fully convolutional with varying kernel sizes, and is trained using turbulent velocity signals from grid turbulence at Modane wind tunnel.
results: The paper reports that the proposed model is able to accurately capture the energy distribution, energy cascade, and intermittency across scales in turbulence, as demonstrated through experiments using turbulent velocity signals from the Modane wind tunnel.Abstract
This article introduces a new Neural Network stochastic model to generate a 1-dimensional stochastic field with turbulent velocity statistics. Both the model architecture and training procedure ground on the Kolmogorov and Obukhov statistical theories of fully developed turbulence, so guaranteeing descriptions of 1) energy distribution, 2) energy cascade and 3) intermittency across scales in agreement with experimental observations. The model is a Generative Adversarial Network with multiple multiscale optimization criteria. First, we use three physics-based criteria: the variance, skewness and flatness of the increments of the generated field that retrieve respectively the turbulent energy distribution, energy cascade and intermittency across scales. Second, the Generative Adversarial Network criterion, based on reproducing statistical distributions, is used on segments of different length of the generated field. Furthermore, to mimic multiscale decompositions frequently used in turbulence's studies, the model architecture is fully convolutional with kernel sizes varying along the multiple layers of the model. To train our model we use turbulent velocity signals from grid turbulence at Modane wind tunnel.
摘要
The model is a generative adversarial network (GAN) with multiple multiscale optimization criteria. First, three physics-based criteria are used: the variance, skewness, and flatness of the increments of the generated field, which respectively retrieve the turbulent energy distribution, energy cascade, and intermittency across scales. Second, the GAN criterion, based on reproducing statistical distributions, is used on segments of different length of the generated field.Furthermore, to mimic multiscale decompositions frequently used in turbulence studies, the model architecture is fully convolutional with kernel sizes varying along the multiple layers of the model. To train the model, turbulent velocity signals from grid turbulence at Modane wind tunnel are used.
The Decimation Scheme for Symmetric Matrix Factorization
results: 对两种矩阵 families 进行了扩展和分析,并证明了在低温限下,这种方法的自由 entropy 采取一种统一的形式。对于稀疏牛顿假设,则证明了矩阵分解存储容量在缺失模式增加时而增长。还提出了一种基于底层搜索的简单算法,可以实现矩阵分解,无需初始化。Abstract
Matrix factorization is an inference problem that has acquired importance due to its vast range of applications that go from dictionary learning to recommendation systems and machine learning with deep networks. The study of its fundamental statistical limits represents a true challenge, and despite a decade-long history of efforts in the community, there is still no closed formula able to describe its optimal performances in the case where the rank of the matrix scales linearly with its size. In the present paper, we study this extensive rank problem, extending the alternative 'decimation' procedure that we recently introduced, and carry out a thorough study of its performance. Decimation aims at recovering one column/line of the factors at a time, by mapping the problem into a sequence of neural network models of associative memory at a tunable temperature. Though being sub-optimal, decimation has the advantage of being theoretically analyzable. We extend its scope and analysis to two families of matrices. For a large class of compactly supported priors, we show that the replica symmetric free entropy of the neural network models takes a universal form in the low temperature limit. For sparse Ising prior, we show that the storage capacity of the neural network models diverges as sparsity in the patterns increases, and we introduce a simple algorithm based on a ground state search that implements decimation and performs matrix factorization, with no need of an informative initialization.
摘要
矩阵分解是一个推理问题,因其广泛的应用领域,从词语学习到推荐系统和深度学习 networks。研究其基本统计限制是一项真正挑战,尽管社区在过去的十年内努力努力,仍没有能描述其优化性能的关闭式公式,即使矩阵的排名与其大小成直线关系。在 presente 纸上,我们研究这个广泛的排名问题,扩展我们先前提出的 'decimation' 程序,并进行了详细的性能研究。decimation 的目的是一次一列/一行的因子,通过将问题映射到一个可调温度的神经网络模型中,以实现矩阵分解。虽然不是最优的,但 decimation 具有可分析的优点。我们将其扩展到两个家族的矩阵上,并对一类受限支持的假设进行了扩展性分析。在低温限下,我们证明了神经网络模型的自由能量的复制同归于普遍形式。对于稀疏的 Иссинг 假设,我们证明了矩阵分解中存储容量的增长,并引入了一种简单的地面搜索算法,实现了矩阵分解,无需具有有用的初始化。
results: 实验表明,该算法比其前一个论文中的搜索算法更快,通常比其快得多于一倍。此外,该算法还可以用于 quasi-exact line search,并且可以与导数下降的搜索 algorithms 进行比较。Abstract
Golden-section search and bisection search are the two main principled algorithms for 1d minimization of quasiconvex (unimodal) functions. The first one only uses function queries, while the second one also uses gradient queries. Other algorithms exist under much stronger assumptions, such as Newton's method. However, to the best of our knowledge, there is no principled exact line search algorithm for general convex functions -- including piecewise-linear and max-compositions of convex functions -- that takes advantage of convexity. We propose two such algorithms: $\Delta$-Bisection is a variant of bisection search that uses (sub)gradient information and convexity to speed up convergence, while $\Delta$-Secant is a variant of golden-section search and uses only function queries. While bisection search reduces the $x$ interval by a factor 2 at every iteration, $\Delta$-Bisection reduces the (sometimes much) smaller $x^*$-gap $\Delta^x$ (the $x$ coordinates of $\Delta$) by at least a factor 2 at every iteration. Similarly, $\Delta$-Secant also reduces the $x^*$-gap by at least a factor 2 every second function query. Moreover, the $y^*$-gap $\Delta^y$ (the $y$ coordinates of $\Delta$) also provides a refined stopping criterion, which can also be used with other algorithms. Experiments on a few convex functions confirm that our algorithms are always faster than their quasiconvex counterparts, often by more than a factor 2. We further design a quasi-exact line search algorithm based on $\Delta$-Secant. It can be used with gradient descent as a replacement for backtracking line search, for which some parameters can be finicky to tune -- and we provide examples to this effect, on strongly-convex and smooth functions. We provide convergence guarantees, and confirm the efficiency of quasi-exact line search on a few single- and multivariate convex functions.
摘要
金叶搜索和BIsection搜索是一维最小化逻辑函数的两种主要原则化算法。前者只使用函数查询,而后者还使用梯度查询。其他算法存在更加强大的假设,如新颖方法。然而,我们所知道的情况下,没有原则化正确线搜索算法,可以在一般凸函数(包括分割凸函数和最大组合凸函数)上使用凸性,并且可以快速 converge。我们提出了两种算法:Δ-BIsection是BIsection搜索的变种,使用(子)梯度信息和凸性来加速收敛,而Δ-Secant是金叶搜索的变种,只使用函数查询。在每次迭代中,BIsection搜索将($x$)间隔减少为2,而Δ-BIsection将($x^*$)间隔减少为至少2。 similarly,Δ-Secant将($x^*$)间隔减少为至少2 every second function query。此外,($y^*$)间隔还提供了一个精细的停止条件,可以与其他算法一起使用。我们的实验表明,我们的算法在一些凸函数上比其相对凸函数更快,常常高于2倍。我们还设计了一种 quasi-exact line search algorithm,基于Δ-Secant。它可以与梯度下降为替换backtracking line search,对于一些参数可能需要繁琐调整。我们提供了一些示例,包括强拟合函数和 глад度函数。我们提供了收敛保证,并在一些单变量和多变量凸函数上验证了 quasi-exact line search 的效率。
Simultaneous column-based deep learning progression analysis of atrophy associated with AMD in longitudinal OCT studies
paper_authors: Adi Szeskin, Roei Yehuda, Or Shmueli, Jaime Levy, Leo Joskowicz
For: The paper aims to develop a fully automatic end-to-end pipeline for detecting and quantifying retinal atrophy changes associated with dry AMD in pairs of OCT scans.* Methods: The proposed method uses a novel simultaneous multi-channel column-based deep learning model that concurrently detects and segments retinal atrophy segments in consecutive OCT scans by classifying light scattering patterns in matched pairs of vertical pixel-wide columns (A-scans) in registered prior and current OCT slices (B-scans).* Results: The experimental results on a dataset of 4,040 OCT slices with 5.2M columns from 40 scan pairs of 18 patients show a mean atrophy segments detection precision of 0.90+-0.09 and a recall of 0.95+-0.06, outperforming standalone classification methods by 30+-62% and 27+-0% for atrophy segments and lesions, respectively.Here’s the same information in Simplified Chinese text:* For: 本研究旨在开发一个完全自动的末端到终端管道,用于检测和评估涂炭病关节associated with dry AMD的脑细胞变化。* Methods: 提议的方法使用了一种新的同时多通道列基的深度学习模型,该模型同时检测并分割涂炭病关节的变化,通过匹配的列宽像素列(A-scan)在注册的先前和当前 OCT slice(B-scan)中类型化光散射模式。* Results: 实验结果表明,在40个扫描对(18名患者,每名患者有40个扫描)的4,040个 OCT slice 中,5.2亿个列上的结果显示,同时检测和分割涂炭病关节的方法可以达到0.90+-0.09的检测精度和0.95+-0.06的回归率,相比单独的检测方法,提高了30+-62%和27+-0%。Abstract
Purpose: Disease progression of retinal atrophy associated with AMD requires the accurate quantification of the retinal atrophy changes on longitudinal OCT studies. It is based on finding, comparing, and delineating subtle atrophy changes on consecutive pairs (prior and current) of unregistered OCT scans. Methods: We present a fully automatic end-to-end pipeline for the simultaneous detection and quantification of time-related atrophy changes associated with dry AMD in pairs of OCT scans of a patient. It uses a novel simultaneous multi-channel column-based deep learning model trained on registered pairs of OCT scans that concurrently detects and segments retinal atrophy segments in consecutive OCT scans by classifying light scattering patterns in matched pairs of vertical pixel-wide columns (A-scans) in registered prior and current OCT slices (B-scans). Results: Experimental results on 4,040 OCT slices with 5.2M columns from 40 scans pairs of 18 patients (66% training/validation, 33% testing) with 24.13+-14.0 months apart in which Complete RPE and Outer Retinal Atrophy (cRORA) was identified in 1,998 OCT slices (735 atrophy lesions from 3,732 segments, 0.45M columns) yield a mean atrophy segments detection precision, recall of 0.90+-0.09, 0.95+-0.06 and 0.74+-0.18, 0.94+-0.12 for atrophy lesions with AUC=0.897, all above observer variability. Simultaneous classification outperforms standalone classification precision and recall by 30+-62% and 27+-0% for atrophy segments and lesions. Conclusions: simultaneous column-based detection and quantification of retinal atrophy changes associated with AMD is accurate and outperforms standalone classification methods. Translational relevance: an automatic and efficient way to detect and quantify retinal atrophy changes associated with AMD.
摘要
目的:检测和评估普遍疾病相关的肉眼衰竭变化,需要精准地量化 consecutiveslices of OCT Studies。这是基于发现,比较和定义极微的衰竭变化的方法。方法:我们提出了一个完全自动的终端到终点管道,用于同时检测和评估普遍疾病相关的肉眼衰竭变化。这种方法使用了一种同时多通道的 column-based深度学习模型,该模型在注册的Prior和Current OCT slice之间同时检测和分割肉眼衰竭 segment。该模型通过匹配vertical pixel-wide columns(A-scans)在注册的Prior和Current OCT slice(B-scans)中匹配的光散射模式来同时检测和分割肉眼衰竭segment。结果:我们在4,040个OCT slice中进行了40个scan pairs的实验,其中每个scan pairs包含24.13±14.0个月的时间差。在这些实验中,我们发现了1,998个OCT slice中存在普遍疾病相关的肉眼衰竭(cRORA),其中735个衰竭 lesion from 3,732个 segment,0.45M columns。我们的方法在这些OCT slice中达到了 mean atrophy segments检测精度和回归的0.90±0.09和0.95±0.06,同时 simultanous classification的精度和回归也高于单独的 classification方法,相对于衰竭 segments和 lesions的检测和分割, simultaneous classification的精度和回归高于30±62%和27±0%。结论:同时检测和评估普遍疾病相关的肉眼衰竭变化是一种准确的方法,并且高于单独的 classification方法。翻译结论:我们提出了一种自动和高效的方法,可以帮助检测和评估普遍疾病相关的肉眼衰竭变化。
Deep Learning and Computer Vision for Glaucoma Detection: A Review
results: 通过对广泛使用的公共数据集进行严格的 benchmarking,我们揭示了一些普遍的性能差距,包括总体化、不确定性估计和多modal интеграción。此外,我们还细目了一些关键的数据集,并指出了限制,如批处大小、标签不一致和偏见。Abstract
Glaucoma is the leading cause of irreversible blindness worldwide and poses significant diagnostic challenges due to its reliance on subjective evaluation. However, recent advances in computer vision and deep learning have demonstrated the potential for automated assessment. In this paper, we survey recent studies on AI-based glaucoma diagnosis using fundus, optical coherence tomography, and visual field images, with a particular emphasis on deep learning-based methods. We provide an updated taxonomy that organizes methods into architectural paradigms and includes links to available source code to enhance the reproducibility of the methods. Through rigorous benchmarking on widely-used public datasets, we reveal performance gaps in generalizability, uncertainty estimation, and multimodal integration. Additionally, our survey curates key datasets while highlighting limitations such as scale, labeling inconsistencies, and bias. We outline open research challenges and detail promising directions for future studies. This survey is expected to be useful for both AI researchers seeking to translate advances into practice and ophthalmologists aiming to improve clinical workflows and diagnosis using the latest AI outcomes.
摘要
全球最主要的眼病之一是水肿眼病,它具有较大的检测挑战,主要是因为它的诊断依赖于主观评估。然而,最近的计算机视觉和深度学习技术的进步已经表明了自动诊断的潜在可能性。在这篇论文中,我们回顾了最近的人工智能基于眼膜、光共振成像和视场图像的眼病诊断研究,尤其是深度学习基本的方法。我们提供了一个更新的分类法,将方法分为建筑学 paradigma,并提供了可用的源代码,以便增强方法的重现性。通过对广泛使用的公共数据集进行严格的 benchmarking,我们揭示了总体的一致性、不确定性估计和多modal集成的性能差距。此外,我们还提供了关键的数据集,并强调了限制,如规模、标签不一致和偏见。我们列出了未解决的研究挑战,并详细介绍了未来研究的可能性。这篇论文预计会对人工智能研究人员和眼科医生都是有用的,以便将最新的成果落实到实践中,并改善诊断和临床工作流程。
No Fair Lunch: A Causal Perspective on Dataset Bias in Machine Learning for Medical Imaging
paper_authors: Charles Jones, Daniel C. Castro, Fabio De Sousa Ribeiro, Ozan Oktay, Melissa McCradden, Ben Glocker
for: This paper is written for those who are concerned about fairness in clinical decision-making, particularly in the context of machine learning methods.
methods: The paper uses a causal perspective to identify and analyze different sources of bias in datasets, and introduces a three-step framework for reasoning about fairness in medical imaging.
results: The paper highlights the limitations of current mitigation methods for algorithmic bias, and provides a practical framework for developing safe and equitable AI prediction models.Abstract
As machine learning methods gain prominence within clinical decision-making, addressing fairness concerns becomes increasingly urgent. Despite considerable work dedicated to detecting and ameliorating algorithmic bias, today's methods are deficient with potentially harmful consequences. Our causal perspective sheds new light on algorithmic bias, highlighting how different sources of dataset bias may appear indistinguishable yet require substantially different mitigation strategies. We introduce three families of causal bias mechanisms stemming from disparities in prevalence, presentation, and annotation. Our causal analysis underscores how current mitigation methods tackle only a narrow and often unrealistic subset of scenarios. We provide a practical three-step framework for reasoning about fairness in medical imaging, supporting the development of safe and equitable AI prediction models.
摘要
随着机器学习方法在医疗决策中升级,对公平性问题的解决变得越来越紧迫。虽然已经投入了大量的时间和精力来检测和改进算法的偏见,但今天的方法仍然存在有害的后果。我们的 causal 视角 shed 新的光 на算法偏见,指出不同的数据集偏见可能会看起来相同,但需要不同的修正策略。我们介绍了三种家族的 causal 偏见机制,来自不同的发病率、展示和注释的偏见。我们的 causal 分析表明,当前的修正方法只能处理一个窄而且经常不现实的子集的场景。我们提供了一个实用的三步框架,以便在医疗影像领域考虑公平性,支持开发安全和公平的 AI 预测模型。
Deception Abilities Emerged in Large Language Models
results: 研究发现,现代 LLM 已经拥有了骗取他人信任的能力,并且可以通过链条思维提高其欺骗性能。此外,通过引入 MACHIAVELLIANISM 来调节 LLM 的骗取倾向也被证明有效。Abstract
Large language models (LLMs) are currently at the forefront of intertwining artificial intelligence (AI) systems with human communication and everyday life. Thus, aligning them with human values is of great importance. However, given the steady increase in reasoning abilities, future LLMs are under suspicion of becoming able to deceive human operators and utilizing this ability to bypass monitoring efforts. As a prerequisite to this, LLMs need to possess a conceptual understanding of deception strategies. This study reveals that such strategies emerged in state-of-the-art LLMs, such as GPT-4, but were non-existent in earlier LLMs. We conduct a series of experiments showing that state-of-the-art LLMs are able to understand and induce false beliefs in other agents, that their performance in complex deception scenarios can be amplified utilizing chain-of-thought reasoning, and that eliciting Machiavellianism in LLMs can alter their propensity to deceive. In sum, revealing hitherto unknown machine behavior in LLMs, our study contributes to the nascent field of machine psychology.
摘要
Classifying multilingual party manifestos: Domain transfer across country, time, and genre
results: 研究发现,BERT 和 DistilBERT 都能够在不同的域转移情况下获得良好的表现,但 DistilBERT 的计算成本较低。此外,研究发现不同国家的政治宣言之间存在一定的差异,即使这些国家使用同一种语言或文化背景。Abstract
Annotating costs of large corpora are still one of the main bottlenecks in empirical social science research. On the one hand, making use of the capabilities of domain transfer allows re-using annotated data sets and trained models. On the other hand, it is not clear how well domain transfer works and how reliable the results are for transfer across different dimensions. We explore the potential of domain transfer across geographical locations, languages, time, and genre in a large-scale database of political manifestos. First, we show the strong within-domain classification performance of fine-tuned transformer models. Second, we vary the genre of the test set across the aforementioned dimensions to test for the fine-tuned models' robustness and transferability. For switching genres, we use an external corpus of transcribed speeches from New Zealand politicians while for the other three dimensions, custom splits of the Manifesto database are used. While BERT achieves the best scores in the initial experiments across modalities, DistilBERT proves to be competitive at a lower computational expense and is thus used for further experiments across time and country. The results of the additional analysis show that (Distil)BERT can be applied to future data with similar performance. Moreover, we observe (partly) notable differences between the political manifestos of different countries of origin, even if these countries share a language or a cultural background.
摘要
大型公司的标注成本仍是employmatical社会科学研究中的主要瓶颈。一方面,使用领域传递的能力可以重用标注数据集和训练模型。另一方面,不清楚领域传递如何工作,以及如何确定结果的可靠性。我们在一个大规模的政治宣言数据库中探索领域传递的潜力。首先,我们显示了针对域内数据集进行细化的转换器模型的强大在域内分类性能。其次,我们在不同的维度上变换测试集,以测试细化模型的可重用性和鲁棒性。为switching genre,我们使用新西兰政治人物的口头演讲录音库;对其他三个维度,我们使用 manuscripdbase中的自定义分割。虽然BERT在初始实验中Across modalities achieve the best scores, DistilBERT proves to be competitive at a lower computational expense and is thus used for further experiments across time and country. Additional analysis shows that (Distil)BERT can be applied to future data with similar performance. Moreover, we observe (partly) notable differences between the political manifestos of different countries of origin, even if these countries share a language or a cultural background.
Explainable Equivariant Neural Networks for Particle Physics: PELICAN
paper_authors: Alexander Bogatskiy, Timothy Hoffman, David W. Miller, Jan T. Offermann, Xiaoyang Liu
For: The paper is written for the task of tagging and reconstructing Lorentz-boosted top quarks, specifically identifying and measuring the $W$-boson in the dense final state.* Methods: The paper proposes a novel permutation equivariant and Lorentz invariant or covariant aggregator network called PELICAN, which employs a fundamentally symmetry group-based architecture to overcome common limitations in particle physics problems.* Results: PELICAN outperforms existing competitors with much lower model complexity and high sample efficiency on the standard task of Lorentz-boosted top quark tagging, and also outperforms hand-crafted algorithms on the less common and more complex task of four-momentum regression.Abstract
We present a comprehensive study of the PELICAN machine learning algorithm architecture in the context of both tagging (classification) and reconstructing (regression) Lorentz-boosted top quarks, including the difficult task of specifically identifying and measuring the $W$-boson inside the dense environment of the boosted hadronic final state. PELICAN is a novel permutation equivariant and Lorentz invariant or covariant aggregator network designed to overcome common limitations found in architectures applied to particle physics problems. Compared to many approaches that use non-specialized architectures that neglect underlying physics principles and require very large numbers of parameters, PELICAN employs a fundamentally symmetry group-based architecture that demonstrates benefits in terms of reduced complexity, increased interpretability, and raw performance. When tested on the standard task of Lorentz-boosted top quark tagging, PELICAN outperforms existing competitors with much lower model complexity and high sample efficiency. On the less common and more complex task of four-momentum regression, PELICAN also outperforms hand-crafted algorithms. We discuss the implications of symmetry-restricted architectures for the wider field of machine learning for physics.
摘要
我们提出了一项全面的PELICAN机器学习算法架构研究,包括标记(分类)和重建(回归) Lorentz-扩展类题粒子,包括困难的内部 $W-$ boson 识别和量测在扩展有核心环境的对应核心粒子状态中。 PELICAN 是一个新的对称平衡和 Lorentz 不变的网络架构,用于超越物理问题中常见的限制。 相比许多使用非特殊架构的方法,PELICAN 使用基本的 Symmetry 集合 基础架构,实现了对复杂性、可读性和原生性的改善。 在标准任务中 Lorentz-扩展类题粒子标识中,PELICAN 超过了现有的竞争对手,具有较低的模型复杂度和高的样本效率。 在更少见且更复杂的任务中,四维动量回归中,PELICAN 也超过了手工构成的算法。 我们讨论了对物理机器学习领域的对称限制的影响。
Value-Informed Skill Chaining for Policy Learning of Long-Horizon Tasks with Surgical Robot
paper_authors: Tao Huang, Kai Chen, Wang Wei, Jianan Li, Yonghao Long, Qi Dou
for: solves long-horizon surgical robot tasks with multiple steps over an extended duration of time.
methods: uses value-informed skill chaining (ViSkill) with a state value function to distinguish suitable terminal states for starting subtask policies, and a chaining policy to instruct subtask policies to terminate at the highest-value state.
results: demonstrates effectiveness on three complex surgical robot tasks from SurRoL, achieving high task success rates and execution efficiency.Abstract
Reinforcement learning is still struggling with solving long-horizon surgical robot tasks which involve multiple steps over an extended duration of time due to the policy exploration challenge. Recent methods try to tackle this problem by skill chaining, in which the long-horizon task is decomposed into multiple subtasks for easing the exploration burden and subtask policies are temporally connected to complete the whole long-horizon task. However, smoothly connecting all subtask policies is difficult for surgical robot scenarios. Not all states are equally suitable for connecting two adjacent subtasks. An undesired terminate state of the previous subtask would make the current subtask policy unstable and result in a failed execution. In this work, we introduce value-informed skill chaining (ViSkill), a novel reinforcement learning framework for long-horizon surgical robot tasks. The core idea is to distinguish which terminal state is suitable for starting all the following subtask policies. To achieve this target, we introduce a state value function that estimates the expected success probability of the entire task given a state. Based on this value function, a chaining policy is learned to instruct subtask policies to terminate at the state with the highest value so that all subsequent policies are more likely to be connected for accomplishing the task. We demonstrate the effectiveness of our method on three complex surgical robot tasks from SurRoL, a comprehensive surgical simulation platform, achieving high task success rates and execution efficiency. Code is available at $\href{https://github.com/med-air/ViSkill}{\text{https://github.com/med-air/ViSkill}$.
摘要
<>使用强化学习解决长时间间隔的外科机器人任务仍然面临着策略探索挑战。现有方法是通过精细分解任务,将长时间间隔的任务分解成多个子任务,以减轻探索压力。但是,在外科机器人场景下,平滑地连接所有子任务策略是困难的。不是所有状态都适合连接两个相邻的子任务策略。undesired terminate state of the previous subtask would make the current subtask policy unstable and result in a failed execution。在这种情况下,我们提出了值知识推荐技术(ViSkill),一种新的强化学习框架,用于解决长时间间隔的外科机器人任务。核心思想是在不同状态下分配不同的策略,以确保连接所有子任务策略。为此,我们引入了一个状态价值函数,用于估计整个任务的成功概率。基于这个价值函数,我们学习了一个链接策略,用于指定子任务策略在最高价值的状态中终止,以确保所有后续策略能够连接成功完成任务。我们在三个复杂的外科机器人任务上进行了实验,分别来自SurRoL数据平台,实现了高任务成功率和执行效率。代码可以在 $\href{https://github.com/med-air/ViSkill}{\text{https://github.com/med-air/ViSkill}$ 上获取。
Learning Generalizable Tool Use with Non-rigid Grasp-pose Registration
results: 学习出来的策略可以解决复杂的工具使用任务,并可以在未看过的工具上进行推广。视频和图像可以在https://maltemosbach.github.io/generalizable_tool_use上查看。Abstract
Tool use, a hallmark feature of human intelligence, remains a challenging problem in robotics due the complex contacts and high-dimensional action space. In this work, we present a novel method to enable reinforcement learning of tool use behaviors. Our approach provides a scalable way to learn the operation of tools in a new category using only a single demonstration. To this end, we propose a new method for generalizing grasping configurations of multi-fingered robotic hands to novel objects. This is used to guide the policy search via favorable initializations and a shaped reward signal. The learned policies solve complex tool use tasks and generalize to unseen tools at test time. Visualizations and videos of the trained policies are available at https://maltemosbach.github.io/generalizable_tool_use.
摘要
人类智能的一个标志性特征是工具使用,但在机器人学中,这种问题仍然是一个挑战。在这篇论文中,我们提出了一种新的方法来启用机器人学习工具使用行为。我们的方法可以在新类别中学习工具的操作,只需要一个示例。为实现这一目标,我们提出了一种新的方法来泛化多指手 robotic 手上的抓取配置到新物体。这种方法通过提供有利初始化和形成的奖励信号来导引政策搜索。我们的学习策略解决了复杂的工具使用任务,并在测试时对未看过的工具进行扩展。可以在 查看视频和图像。
Don’t be so negative! Score-based Generative Modeling with Oracle-assisted Guidance
paper_authors: Saeid Naderiparizi, Xiaoxuan Liang, Berend Zwartsenberg, Frank Wood
for: The paper is written for discussing a new method called Gen-neG, which leverages side-information in the form of an oracle to improve the learning of probabilistic models.
methods: The paper uses a combination of generative adversarial networks (GANs) and discriminator guidance in diffusion models to guide the generation process towards the positive support region indicated by the oracle.
results: The paper presents empirical results in applications including collision avoidance in self-driving simulators and safety-guarded human motion generation, demonstrating the utility of the proposed Gen-neG method.Here’s the same information in Simplified Chinese text:
results: 本文对自动驾驶模拟器中的碰撞避免和人体动作生成等应用中进行了实验,并证明了 Gen-neG 方法的实用性。Abstract
The maximum likelihood principle advocates parameter estimation via optimization of the data likelihood function. Models estimated in this way can exhibit a variety of generalization characteristics dictated by, e.g. architecture, parameterization, and optimization bias. This work addresses model learning in a setting where there further exists side-information in the form of an oracle that can label samples as being outside the support of the true data generating distribution. Specifically we develop a new denoising diffusion probabilistic modeling (DDPM) methodology, Gen-neG, that leverages this additional side-information. Our approach builds on generative adversarial networks (GANs) and discriminator guidance in diffusion models to guide the generation process towards the positive support region indicated by the oracle. We empirically establish the utility of Gen-neG in applications including collision avoidance in self-driving simulators and safety-guarded human motion generation.
摘要
“最大可能性原则”提倡通过数据可能函数估计 параметр。这种方法可以实现多种通用特征,例如建筑、参数化和估计偏好。这个工作在存在 oracle 提供样本是否在真实数据生成分布中的支持下的情况下进行模型学习。我们开发了一种新的推导散布模型方法(DDPM),叫做 Gen-neG,它利用这些额外的side-information。我们的方法基于生成对抗网络(GANs)和推导器指导在散布模型中引导生成过程,以便将生成结果导向正确的支持区域。我们经过实验证明 Gen-neG 在包括自驾车 simulation 和人类动作生成等应用中的实用性。
L3DMC: Lifelong Learning using Distillation via Mixed-Curvature Space
results: 在三个标准测试集上进行了实验,证明了我们提议的混合曲率空间防止忘记旧知识并更好地适应新知识的方法可以提高L3模型在医学图像分类任务中的表现。Abstract
The performance of a lifelong learning (L3) model degrades when it is trained on a series of tasks, as the geometrical formation of the embedding space changes while learning novel concepts sequentially. The majority of existing L3 approaches operate on a fixed-curvature (e.g., zero-curvature Euclidean) space that is not necessarily suitable for modeling the complex geometric structure of data. Furthermore, the distillation strategies apply constraints directly on low-dimensional embeddings, discouraging the L3 model from learning new concepts by making the model highly stable. To address the problem, we propose a distillation strategy named L3DMC that operates on mixed-curvature spaces to preserve the already-learned knowledge by modeling and maintaining complex geometrical structures. We propose to embed the projected low dimensional embedding of fixed-curvature spaces (Euclidean and hyperbolic) to higher-dimensional Reproducing Kernel Hilbert Space (RKHS) using a positive-definite kernel function to attain rich representation. Afterward, we optimize the L3 model by minimizing the discrepancies between the new sample representation and the subspace constructed using the old representation in RKHS. L3DMC is capable of adapting new knowledge better without forgetting old knowledge as it combines the representation power of multiple fixed-curvature spaces and is performed on higher-dimensional RKHS. Thorough experiments on three benchmarks demonstrate the effectiveness of our proposed distillation strategy for medical image classification in L3 settings. Our code implementation is publicly available at https://github.com/csiro-robotics/L3DMC.
摘要
“一个生命时间学习(L3)模型的性能会随着在不同任务上的训练,而逐渐下降。现有大多数L3方法都是在固定曲率(例如零曲率欧几里得)空间中进行训练,这并不一定适合数据的复杂的几何结构。另外,维持抽象策略会直接在低维度嵌入上加载约束,使L3模型学习新的概念变得更加困难。为解决这问题,我们提出了一种名为L3DMC的维持策略,它在混合曲率空间中进行训练,以保留已经学习的知识,并在高维度的复制函数希尔бер特空间(RKHS)中进行嵌入。然后,我们将L3模型进行优化,使其在新样本表示中与以前的表示在RKHS中构建的子空间之间的差异最小化。L3DMC可以更好地适应新的知识,而不是忘记过去的知识,因为它结合了多个固定曲率空间的表示能力,并在高维度RKHS中进行训练。我们在三个标准检验 задании上进行了详细的实验,并证明了L3DMC的效iveness。我们的代码实现可以在https://github.com/csiro-robotics/L3DMC上获得。”
An Effective Data Creation Pipeline to Generate High-quality Financial Instruction Data for Large Language Model
results: 该管道生成了一个 Robust 的征调数据集,包括103k多个多Turn chat,并通过对这个数据集进行了广泛的实验,以评估模型的性能。结果表明,该方法可以使AI模型生成准确、相关、金融式的回答,从而为金融领域应用提供一个强大的工具。Abstract
At the beginning era of large language model, it is quite critical to generate a high-quality financial dataset to fine-tune a large language model for financial related tasks. Thus, this paper presents a carefully designed data creation pipeline for this purpose. Particularly, we initiate a dialogue between an AI investor and financial expert using ChatGPT and incorporate the feedback of human financial experts, leading to the refinement of the dataset. This pipeline yielded a robust instruction tuning dataset comprised of 103k multi-turn chats. Extensive experiments have been conducted on this dataset to evaluate the model's performance by adopting an external GPT-4 as the judge. The promising experimental results verify that our approach led to significant advancements in generating accurate, relevant, and financial-style responses from AI models, and thus providing a powerful tool for applications within the financial sector.
摘要
在大语言模型时代的开始,生成高质量金融数据集是非常重要的,以调整大语言模型进行金融相关任务。这篇论文提出了一个仔细设计的数据创建管道,特别是通过与人工金融专家的对话,使用ChatGPT,并根据人类金融专家的反馈,进行数据集的精细调整。这个管道生成了103k多turn对话数据集。我们在这个数据集上进行了广泛的实验,采用外部GPT-4作为评审者,以评估模型的性能。实验结果表明,我们的方法导致了AI模型生成高准确、相关、金融风格的回答,从而为金融领域应用提供了一个强大的工具。
A continuous Structural Intervention Distance to compare Causal Graphs
paper_authors: Mihir Dhanakshirur, Felix Laumann, Junhyung Park, Mauricio Barahona
for: 本研究旨在提供一种新的维度度量,用于评估真实和学习的 causal 图之间的差异。
methods: 本研究使用 conditional mean embeddings 将 intervención 分布映射到 reproduce kernel Hilbert space 中,然后计算这些分布之间的最大(conditional)mean discrepancy,来评估 causal 图的差异。
results: 研究人员通过 theoretically 和数据实验验证了这种新的维度度量的有效性。Abstract
Understanding and adequately assessing the difference between a true and a learnt causal graphs is crucial for causal inference under interventions. As an extension to the graph-based structural Hamming distance and structural intervention distance, we propose a novel continuous-measured metric that considers the underlying data in addition to the graph structure for its calculation of the difference between a true and a learnt causal graph. The distance is based on embedding intervention distributions over each pair of nodes as conditional mean embeddings into reproducing kernel Hilbert spaces and estimating their difference by the maximum (conditional) mean discrepancy. We show theoretical results which we validate with numerical experiments on synthetic data.
摘要
理解和准确评估真实和学习的 causal 图之间的差异是 causal 推断中的关键。作为结构 Hamming 距离和结构 intervención 距离的扩展,我们提出一种新的连续量化的度量,它考虑了在计算真实和学习 causal 图之间的差异时的数据的下面。这个距离基于对每对节点的 intervención 分布进行 Conditional Mean Embedding 的嵌入,并估计它们之间的差异为最大(conditional) Mean Discrepancy。我们提供了理论结果,并通过synthetic数据的数值实验 validate 这些结果。
Towards Head Computed Tomography Image Reconstruction Standardization with Deep Learning Assisted Automatic Detection
results: 比较了十种对象检测算法的精度、效率和Robustness,选择了轻量级的 YOLOv8,其mAP为92.91%,并在标准化重建结果中表现出丰富的临床实用性和有效性。Abstract
Three-dimensional (3D) reconstruction of head Computed Tomography (CT) images elucidates the intricate spatial relationships of tissue structures, thereby assisting in accurate diagnosis. Nonetheless, securing an optimal head CT scan without deviation is challenging in clinical settings, owing to poor positioning by technicians, patient's physical constraints, or CT scanner tilt angle restrictions. Manual formatting and reconstruction not only introduce subjectivity but also strain time and labor resources. To address these issues, we propose an efficient automatic head CT images 3D reconstruction method, improving accuracy and repeatability, as well as diminishing manual intervention. Our approach employs a deep learning-based object detection algorithm, identifying and evaluating orbitomeatal line landmarks to automatically reformat the images prior to reconstruction. Given the dearth of existing evaluations of object detection algorithms in the context of head CT images, we compared ten methods from both theoretical and experimental perspectives. By exploring their precision, efficiency, and robustness, we singled out the lightweight YOLOv8 as the aptest algorithm for our task, with an mAP of 92.91% and impressive robustness against class imbalance. Our qualitative evaluation of standardized reconstruction results demonstrates the clinical practicability and validity of our method.
摘要
三维重建头部计算机断层成像(CT)图像可以帮助精确诊断,但在临床 Settings中获得优质头部CT扫描是具有挑战性的,这主要是由技术人员的位置不稳定、病人的身体限制或计算机扫描机的倾斜角度所致。手动格式化和重建不仅引入主观性,还占用了时间和劳动资源。为了解决这些问题,我们提出了一种高效的自动头部CT图像三维重建方法,提高了准确性和重复性,同时减少了手动干预。我们的方法利用深度学习基于 объек检测算法,通过识别和评估 orbitomeatal 线标记来自动重新格式化图像,以前置重建。由于现有的头部CT图像对象检测算法的评估罕见,我们从理论和实验两个角度对十种方法进行了比较。通过评估精度、效率和稳定性,我们选择了轻量级的 YOLOv8,其MAP值为92.91%,并在类偏置问题中表现出了扎实的Robustness。我们的质量评估标准化重建结果表明了我们的方法在临床实践中的可行性和有效性。
VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design
results: 实验结果表明,VITS2模型可以更好地提高单Stage Text-to-Speech模型的自然性、多种语音特征的同步和计算效率,同时可以减少先前的phoneme转换依赖,实现完全端到端单Stage Approach。Abstract
Single-stage text-to-speech models have been actively studied recently, and their results have outperformed two-stage pipeline systems. Although the previous single-stage model has made great progress, there is room for improvement in terms of its intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion. In this work, we introduce VITS2, a single-stage text-to-speech model that efficiently synthesizes a more natural speech by improving several aspects of the previous work. We propose improved structures and training mechanisms and present that the proposed methods are effective in improving naturalness, similarity of speech characteristics in a multi-speaker model, and efficiency of training and inference. Furthermore, we demonstrate that the strong dependence on phoneme conversion in previous works can be significantly reduced with our method, which allows a fully end-to-end single-stage approach.
摘要
单阶段文本至语音模型在最近几年中得到了广泛研究,其效果比两阶段管道系统更好。虽然之前的单阶段模型已经做出了很大的进步,但还有一些方面可以进一步改进,例如间歇性不自然、计算效率低下和phoneme转换的强依赖。在这项工作中,我们介绍VITS2单阶段文本至语音模型,该模型通过改进多个方面来生成更自然的语音。我们提出了改进的结构和训练机制,并证明了我们的方法能够提高自然性、语音特征相似性和训练和推理的效率。此外,我们还证明了前一代模型中phoneme转换的强依赖可以在我们的方法下降到可接受的水平,这使得完全的端到端单阶段approach成为可能。
Causal Inference for Banking Finance and Insurance A Survey
results: 本文发现,银行和保险领域中的 causal inference 应用还处于初始阶段,因此有更多的研究空间可以开拓,以使其成为可靠的方法。Abstract
Causal Inference plays an significant role in explaining the decisions taken by statistical models and artificial intelligence models. Of late, this field started attracting the attention of researchers and practitioners alike. This paper presents a comprehensive survey of 37 papers published during 1992-2023 and concerning the application of causal inference to banking, finance, and insurance. The papers are categorized according to the following families of domains: (i) Banking, (ii) Finance and its subdomains such as corporate finance, governance finance including financial risk and financial policy, financial economics, and Behavioral finance, and (iii) Insurance. Further, the paper covers the primary ingredients of causal inference namely, statistical methods such as Bayesian Causal Network, Granger Causality and jargon used thereof such as counterfactuals. The review also recommends some important directions for future research. In conclusion, we observed that the application of causal inference in the banking and insurance sectors is still in its infancy, and thus more research is possible to turn it into a viable method.
摘要
causal inference 在解释统计模型和人工智能模型所作出的决策中扮演着重要的角色。近年来,这个领域吸引了研究者和实践者的关注。本文是一篇涵盖1992-2023年发表的37篇论文,探讨了在银行、金融和保险领域中应用 causal inference 的综述。这些论文被分为以下三个家庭领域:(i)银行,(ii)金融和其子领域,如企业财务、管理财务、金融风险和金融政策、金融经济和行为金融,以及(iii)保险。此外,文章还覆盖了 causal inference 的基本组成部分,包括统计方法如 bayesian causal network 和 Granger causality,以及其中使用的术语如 counterfactuals。文章还提出了未来研究的重要方向。结论是,在银行和保险领域中应用 causal inference 的应用还处于初生阶段,因此更多的研究可以使其成为可靠的方法。
MetaDiff: Meta-Learning with Conditional Diffusion for Few-Shot Learning
results: 与现有方法相比,我们的MetaDiff在少量学习任务中表现出色,并且不需要计算第二个DERIVATIVE,从而避免了内存压力和梯度消失问题。Abstract
Equipping a deep model the abaility of few-shot learning, i.e., learning quickly from only few examples, is a core challenge for artificial intelligence. Gradient-based meta-learning approaches effectively address the challenge by learning how to learn novel tasks. Its key idea is learning a deep model in a bi-level optimization manner, where the outer-loop process learns a shared gradient descent algorithm (i.e., its hyperparameters), while the inner-loop process leverage it to optimize a task-specific model by using only few labeled data. Although these existing methods have shown superior performance, the outer-loop process requires calculating second-order derivatives along the inner optimization path, which imposes considerable memory burdens and the risk of vanishing gradients. Drawing inspiration from recent progress of diffusion models, we find that the inner-loop gradient descent process can be actually viewed as a reverse process (i.e., denoising) of diffusion where the target of denoising is model weights but the origin data. Based on this fact, in this paper, we propose to model the gradient descent optimizer as a diffusion model and then present a novel task-conditional diffusion-based meta-learning, called MetaDiff, that effectively models the optimization process of model weights from Gaussion noises to target weights in a denoising manner. Thanks to the training efficiency of diffusion models, our MetaDiff do not need to differentiate through the inner-loop path such that the memory burdens and the risk of vanishing gradients can be effectvely alleviated. Experiment results show that our MetaDiff outperforms the state-of-the-art gradient-based meta-learning family in few-shot learning tasks.
摘要
使得深度模型具备几个例之学习能力,即快速从只有几个示例学习,是人工智能的核心挑战。基于梯度的meta学习方法有效地解决了这个挑战,其关键思想是通过在外层循环中学习一个共享梯度下降算法(即其超参数),而在内层循环中使用只有几个标注数据来优化任务特定模型。虽然现有的方法已经表现出色,但外层循环过程需要计算第二个Derivative along the inner optimization path,这会带来很大的内存压力和梯度消失风险。 drawing inspiration from recent progress of diffusion models, we find that the inner-loop gradient descent process can be viewed as a reverse process (i.e., denoising) of diffusion, where the target of denoising is model weights but the origin data。 Based on this fact, in this paper, we propose to model the gradient descent optimizer as a diffusion model and then present a novel task-conditional diffusion-based meta-learning, called MetaDiff, that effectively models the optimization process of model weights from Gaussian noise to target weights in a denoising manner。Thanks to the training efficiency of diffusion models, our MetaDiff does not need to differentiate through the inner-loop path, so the memory burdens and the risk of vanishing gradients can be effectively alleviated。 Experiment results show that our MetaDiff outperforms the state-of-the-art gradient-based meta-learning family in few-shot learning tasks。
Guaranteed Optimal Generative Modeling with Maximum Deviation from the Empirical Distribution
results: 这个论文的结果是什么? + 这个论文的结果是: - 训练生成模型的误差函数应该逐渐接近零 - 训练生成模型的分布应该远离任何模型,可以在训练数据中找到的分布I hope this helps! Let me know if you have any further questions.Abstract
Generative modeling is a widely-used machine learning method with various applications in scientific and industrial fields. Its primary objective is to simulate new examples drawn from an unknown distribution given training data while ensuring diversity and avoiding replication of examples from the training data. This paper presents theoretical insights into training a generative model with two properties: (i) the error of replacing the true data-generating distribution with the trained data-generating distribution should optimally converge to zero as the sample size approaches infinity, and (ii) the trained data-generating distribution should be far enough from any distribution replicating examples in the training data. We provide non-asymptotic results in the form of finite sample risk bounds that quantify these properties and depend on relevant parameters such as sample size, the dimension of the ambient space, and the dimension of the latent space. Our results are applicable to general integral probability metrics used to quantify errors in probability distribution spaces, with the Wasserstein-$1$ distance being the central example. We also include numerical examples to illustrate our theoretical findings.
摘要
<>传统的机器学习方法之一是生成模型,它在科学和工业领域有广泛的应用。生成模型的 PRIMARY OBJECTIVE 是通过训练数据 simulate 新的例子,从未知分布中采样新的例子,同时保证新的例子具有多样性和不同于训练数据中的例子。这篇文章提供了生成模型训练的两个性质:(i)在训练数据中替换真实的数据生成分布时,训练后的数据生成分布的误差应该在样本数趋向于无穷大时Optimally Converge to Zero,(ii)训练后的数据生成分布应该与训练数据中的例子replicate的分布远离 enough。 我们提供了非假设统计结果,包括finite sample risk bounds,这些结果取决于样本大小、维度空间和秘密空间中的相关参数。我们的结果适用于普通的积分概率度量, Wasserstein-$1$ 距离是中心例子。我们还包括了数据示例来证明我们的理论发现。
DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement Estimation in Conversation
results: 该论文在 MULTIMEDIATE 2023 比赛中表现出优于基eline模型,在测试集上提高了 $7%$,在验证集上提高了 $4%$。另外,该论文还使用了不同的modalities fusión机制,并证明了在这种数据上,简单的 concatenation 方法加上自注意力融合可以获得最好的性能。Abstract
Conversational engagement estimation is posed as a regression problem, entailing the identification of the favorable attention and involvement of the participants in the conversation. This task arises as a crucial pursuit to gain insights into human's interaction dynamics and behavior patterns within a conversation. In this research, we introduce a dilated convolutional Transformer for modeling and estimating human engagement in the MULTIMEDIATE 2023 competition. Our proposed system surpasses the baseline models, exhibiting a noteworthy $7$\% improvement on test set and $4$\% on validation set. Moreover, we employ different modality fusion mechanism and show that for this type of data, a simple concatenated method with self-attention fusion gains the best performance.
摘要
通过对话参与度的 regression 问题来评估对话参与度,以获得对话中人类互动动态和行为模式的深入理解。在本研究中,我们提出了一种扩展 convolutional Transformer 来模型和估计对话参与度,并在 MULTIMEDIATE 2023 比赛中展示了我们的提案系统。我们的提案系统在测试集上显示了出色的 $7\%$ 提升,而在验证集上则是 $4\%$。此外,我们还使用了不同的 modalities 融合机制,并证明在这类数据上,简单的 concatenation 方法加上自注意力融合能够获得最好的性能。
results: 实验表明,提议的方法可以在多个难度dataset上减轻忘却现象,并且可以与现有的学习方法混合使用,以提高其性能。Abstract
An ultimate objective in continual learning is to preserve knowledge learned in preceding tasks while learning new tasks. To mitigate forgetting prior knowledge, we propose a novel knowledge distillation technique that takes into the account the manifold structure of the latent/output space of a neural network in learning novel tasks. To achieve this, we propose to approximate the data manifold up-to its first order, hence benefiting from linear subspaces to model the structure and maintain the knowledge of a neural network while learning novel concepts. We demonstrate that the modeling with subspaces provides several intriguing properties, including robustness to noise and therefore effective for mitigating Catastrophic Forgetting in continual learning. We also discuss and show how our proposed method can be adopted to address both classification and segmentation problems. Empirically, we observe that our proposed method outperforms various continual learning methods on several challenging datasets including Pascal VOC, and Tiny-Imagenet. Furthermore, we show how the proposed method can be seamlessly combined with existing learning approaches to improve their performances. The codes of this article will be available at https://github.com/csiro-robotics/SDCL.
摘要
最终目标是在持续学习中保留先前任务中学习的知识,以避免知识卷积。我们提出了一种新的知识填充技术,利用神经网络的输出/潜在空间的拓扑结构来学习新任务。为此,我们提出将数据拓扑约化到第一阶段,从而利用直线子空间来模型结构,保持神经网络的知识 while learning novel concepts。我们证明了这种模型具有许多有趣的性质,包括鲁棒性于噪声和有效防止Catastrophic Forgetting。我们还讨论了如何采用我们的提议方法来解决分类和 segmentation 问题。实验证明,我们的提议方法在 Pascal VOC 和 Tiny-Imagenet 等复杂的数据集上表现出色,并且可以轻松地与现有的学习方法结合使用,以提高其表现。代码将在 上提供。
results: 本库可以帮助用户快速和简单地进行 causal discovery,并且提供了详细的 documentation,以便学习和掌握。Abstract
Causal discovery aims at revealing causal relations from observational data, which is a fundamental task in science and engineering. We describe $\textit{causal-learn}$, an open-source Python library for causal discovery. This library focuses on bringing a comprehensive collection of causal discovery methods to both practitioners and researchers. It provides easy-to-use APIs for non-specialists, modular building blocks for developers, detailed documentation for learners, and comprehensive methods for all. Different from previous packages in R or Java, $\textit{causal-learn}$ is fully developed in Python, which could be more in tune with the recent preference shift in programming languages within related communities. The library is available at https://github.com/py-why/causal-learn.
摘要
causal discovery 旨在从观察数据中揭示 causal 关系,这是科学和工程中的基本任务。我们介绍了 $\textit{causal-learn}$,一个开源的 Python 库用于 causal discovery。这个库专注于为专家和研究人员提供一个全面的 causal discovery 方法收藏。它提供了易于使用的 API,用于非专家,模块化的构建块,详细的文档,以及全面的方法。与之前在 R 或 Java 中出现的包不同,$\textit{causal-learn}$ 是完全在 Python 中开发的,这与相关领域的编程语言偏好的变化相吻合。该库可以在 中获取。
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks
paper_authors: Kousik Rajesh, Mrigank Raman, Mohammed Asad Karim, Pranit Chawla
for: This paper investigates the performance of multi-modal architectures based on Large Language Models (LLMs) on the NLVR2 dataset, specifically focusing on the effectiveness of adding object level features and pre-training on multi-modal data for complex visual reasoning tasks.
methods: The paper proposes extending traditional bridge architectures for the NLVR2 dataset by adding object level features, and also explores the use of a recently proposed bridge-architecture called LLaVA in the zero shot setting.
results: The paper shows that pre-training on multi-modal data is key for good performance on complex reasoning tasks such as NLVR2, and that adding object level features to bridge architectures does not help. The paper also demonstrates some initial results on LLaVA in the zero shot setting and analyzes its performance.Here is the same information in Simplified Chinese text:
results: 论文表明了预training多Modal数据是复杂视觉任务NLVR2表现好的关键因素,而对象层特征的添加不帮助提高表现。论文还初步展示了LLAVA在零shot Setting中的表现和分析。Abstract
In recent times there has been a surge of multi-modal architectures based on Large Language Models, which leverage the zero shot generation capabilities of LLMs and project image embeddings into the text space and then use the auto-regressive capacity to solve tasks such as VQA, captioning, and image retrieval. We name these architectures as "bridge-architectures" as they project from the image space to the text space. These models deviate from the traditional recipe of training transformer based multi-modal models, which involve using large-scale pre-training and complex multi-modal interactions through co or cross attention. However, the capabilities of bridge architectures have not been tested on complex visual reasoning tasks which require fine grained analysis about the image. In this project, we investigate the performance of these bridge-architectures on the NLVR2 dataset, and compare it to state-of-the-art transformer based architectures. We first extend the traditional bridge architectures for the NLVR2 dataset, by adding object level features to faciliate fine-grained object reasoning. Our analysis shows that adding object level features to bridge architectures does not help, and that pre-training on multi-modal data is key for good performance on complex reasoning tasks such as NLVR2. We also demonstrate some initial results on a recently bridge-architecture, LLaVA, in the zero shot setting and analyze its performance.
摘要
现在有一些基于大语言模型的多模态架构得到了广泛应用,这些架构利用大语言模型的零批生成能力,将图像嵌入在文本空间中,然后使用自动反相能力解决问题如VQA、描述和图像检索。我们称这些架构为“桥架架构”,因为它们从图像空间到文本空间的映射。这些模型与传统的多模态变换器模型不同,它们不需要大规模预训练和复杂的多模态交互,但是它们的能力尚未在复杂的视觉逻辑任务中被测试。在这个项目中,我们调查了bridge架构在NLVR2 dataset上的表现,并与状态对照的transformer基于模型进行比较。我们首先对NLVR2 dataset进行了传统的bridge架构的扩展,添加了 объек level 特征以便促进细化的物体逻辑分析。我们的分析表明,在bridge架构中添加对象级别特征并不有助于,预训练在多模态数据上是关键 для在复杂的逻辑任务上表现良好。我们还提供了一些初步的LLaVA bridgebasis在零批设置下的表现分析。
A Pre-trained Data Deduplication Model based on Active Learning
paper_authors: Xinyao Liu, Shengdong Du, Fengmao Lv, Hongtao Xue, Jie Hu, Tianrui Li
For: 解决大数据时期的数据质量问题,特别是重复数据问题,以提高大数据的有效应用。* Methods: 基于活动学习的预训练去重模型,首次将活动学习与转换器结合在一起,以选择最有价值的数据进行模型训练,并首次应用R-Drop方法进行数据扩展。* Results: 对于去重后的数据标识,提出了28%的准确率提升,比前一个状态的艺术品(SOTA)更高。Abstract
In the era of big data, the issue of data quality has become increasingly prominent. One of the main challenges is the problem of duplicate data, which can arise from repeated entry or the merging of multiple data sources. These "dirty data" problems can significantly limit the effective application of big data. To address the issue of data deduplication, we propose a pre-trained deduplication model based on active learning, which is the first work that utilizes active learning to address the problem of deduplication at the semantic level. The model is built on a pre-trained Transformer and fine-tuned to solve the deduplication problem as a sequence to classification task, which firstly integrate the transformer with active learning into an end-to-end architecture to select the most valuable data for deduplication model training, and also firstly employ the R-Drop method to perform data augmentation on each round of labeled data, which can reduce the cost of manual labeling and improve the model's performance. Experimental results demonstrate that our proposed model outperforms previous state-of-the-art (SOTA) for deduplicated data identification, achieving up to a 28% improvement in Recall score on benchmark datasets.
摘要
在大数据时代,数据质量问题变得越来越突出。一个主要挑战是重复的数据问题,这可能来自于重复输入或多个数据源的合并。这些“废弃数据”问题可能会很大程度限制大数据的有效应用。为解决数据筛选问题,我们提议一个基于活动学习的预训练deduplication模型,这是第一个在semantic水平上使用活动学习解决数据筛选问题的研究。该模型基于预训练的Transformer结构,并在解决数据筛选问题上进行了序列分类任务的精度训练,这也是第一次将Transformer结构与活动学习集成到末端架构中,以选择数据筛选模型训练的最有价值数据,并且是第一次在每个Label数据上使用R-Drop方法进行数据增强,可以降低人工标注成本并提高模型的性能。实验结果表明,我们提议的模型在比较数据标识 task 中超过前一个状态的较好表现,达到28%的Recall分数提升在 benchmark 数据集上。
STL: A Signed and Truncated Logarithm Activation Function for Neural Networks
results: 对多种已知神经网络进行比较,结果表明新提出的activation函数在精度和运行性能方面具有 estado-of-the-art 的表现。这种活动函数可以应用于大多数神经网络中, где活动函数是必要的。Abstract
Activation functions play an essential role in neural networks. They provide the non-linearity for the networks. Therefore, their properties are important for neural networks' accuracy and running performance. In this paper, we present a novel signed and truncated logarithm function as activation function. The proposed activation function has significantly better mathematical properties, such as being odd function, monotone, differentiable, having unbounded value range, and a continuous nonzero gradient. These properties make it an excellent choice as an activation function. We compare it with other well-known activation functions in several well-known neural networks. The results confirm that it is the state-of-the-art. The suggested activation function can be applied in a large range of neural networks where activation functions are necessary.
摘要
Activation functions play an essential role in neural networks. They provide the non-linearity for the networks. Therefore, their properties are important for neural networks' accuracy and running performance. In this paper, we present a novel signed and truncated logarithm function as activation function. The proposed activation function has significantly better mathematical properties, such as being odd function, monotone, differentiable, having unbounded value range, and a continuous nonzero gradient. These properties make it an excellent choice as an activation function. We compare it with other well-known activation functions in several well-known neural networks. The results confirm that it is the state-of-the-art. The suggested activation function can be applied in a large range of neural networks where activation functions are necessary.这文章提出了一个新的签名和截断对数函数作为神经网络中的启动函数。这个提案的启动函数具有更好的数学性能,如是odd函数、单调、可微、无上限值范围和连续非零导数。这些特性使其成为一个非常出色的启动函数选择。我们与其他已知的启动函数进行比较,发现其在训练神经网络方面的表现优于其他所有启动函数。这个提案的启动函数可以应用于广泛的神经网络中,其中需要启动函数的情况下。
Does fine-tuning GPT-3 with the OpenAI API leak personally-identifiable information?
results: 研究发现,对GPT-3模型进行了两个任务的 fine-tuning,导致模型记忆并泄露了基本数据集中的敏感信息 (PII)。I hope that helps! Let me know if you have any other questions.Abstract
Machine learning practitioners often fine-tune generative pre-trained models like GPT-3 to improve model performance at specific tasks. Previous works, however, suggest that fine-tuned machine learning models memorize and emit sensitive information from the original fine-tuning dataset. Companies such as OpenAI offer fine-tuning services for their models, but no prior work has conducted a memorization attack on any closed-source models. In this work, we simulate a privacy attack on GPT-3 using OpenAI's fine-tuning API. Our objective is to determine if personally identifiable information (PII) can be extracted from this model. We (1) explore the use of naive prompting methods on a GPT-3 fine-tuned classification model, and (2) we design a practical word generation task called Autocomplete to investigate the extent of PII memorization in fine-tuned GPT-3 within a real-world context. Our findings reveal that fine-tuning GPT3 for both tasks led to the model memorizing and disclosing critical personally identifiable information (PII) obtained from the underlying fine-tuning dataset. To encourage further research, we have made our codes and datasets publicly available on GitHub at: https://github.com/albertsun1/gpt3-pii-attacks
摘要
机器学习实践者们常常微调生成批处理模型如GPT-3以提高模型在特定任务上的性能。然而,先前的研究表明,微调机器学习模型会记忆和发送敏感信息从原始微调数据集。如OpenAI提供的微调服务,但没有任何之前的工作对任何关闭源模型进行了记忆攻击。在这种工作中,我们模拟了对GPT-3的隐私攻击,以确定是否可以从这个模型中提取个人 Identifiable Information (PII)。我们(1)探索使用简单的提示方法在GPT-3微调的分类模型上,并(2)我们设计了一个实用的单词生成任务called Autocomplete,以调查微调GPT-3中PII的储存程度。我们的发现表明,对GPT-3进行微调两个任务都导致模型记忆和披露critical的个人Identifiable Information (PII)从原始微调数据集中获得。为促进更多的研究,我们在GitHub上公开了我们的代码和数据集:https://github.com/albertsun1/gpt3-pii-attacks。
UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming
results: 比前方法提高1.70倍的 durchput,并且减少搜索策略时间16倍。Abstract
Deep learning models have demonstrated impressive performance in various domains. However, the prolonged training time of these models remains a critical problem. Manually designed parallel training strategies could enhance efficiency but require considerable time and deliver little flexibility. Hence, automatic parallelism is proposed to automate the parallel strategy searching process. Even so, existing approaches suffer from sub-optimal strategy space because they treat automatic parallelism as two independent stages, namely inter- and intra-layer parallelism. To address this issue, we propose UniAP, which utilizes mixed integer quadratic programming to unify inter- and intra-layer automatic parallelism. To the best of our knowledge, UniAP is the first work to unify these two categories to search for a globally optimal strategy. The experimental results show that UniAP outperforms state-of-the-art methods by up to 1.70$\times$ in throughput and reduces strategy searching time by up to 16$\times$ across four Transformer-like models.
摘要
BearingPGA-Net: A Lightweight and Deployable Bearing Fault Diagnosis Network via Decoupled Knowledge Distillation and FPGA Acceleration
results: 我们设计了一个使用 FPGA 的加速方案,将 BearingPGA-Net 的每个层都用特定的量化和定制的逻辑门体设计成 FPGA 上,并强调了平行计算和模组重复以提高计算速度。根据我们的实验结果,我们的部署方案可以实现 CPU 上的200倍以上的诊断速度,而且与 CPU 上的 F1、Recall 和 Precision 分数相比,其表现下降不到0.4%。Abstract
Deep learning has achieved remarkable success in the field of bearing fault diagnosis. However, this success comes with larger models and more complex computations, which cannot be transferred into industrial fields requiring models to be of high speed, strong portability, and low power consumption. In this paper, we propose a lightweight and deployable model for bearing fault diagnosis, referred to as BearingPGA-Net, to address these challenges. Firstly, aided by a well-trained large model, we train BearingPGA-Net via decoupled knowledge distillation. Despite its small size, our model demonstrates excellent fault diagnosis performance compared to other lightweight state-of-the-art methods. Secondly, we design an FPGA acceleration scheme for BearingPGA-Net using Verilog. This scheme involves the customized quantization and designing programmable logic gates for each layer of BearingPGA-Net on the FPGA, with an emphasis on parallel computing and module reuse to enhance the computational speed. To the best of our knowledge, this is the first instance of deploying a CNN-based bearing fault diagnosis model on an FPGA. Experimental results reveal that our deployment scheme achieves over 200 times faster diagnosis speed compared to CPU, while achieving a lower-than-0.4\% performance drop in terms of F1, Recall, and Precision score on our independently-collected bearing dataset. Our code is available at \url{https://github.com/asdvfghg/BearingPGA-Net}.
摘要
深度学习在滚珠疲劳诊断领域取得了很大的成功,但这些成功来自于更大的模型和更复杂的计算,这些模型不能在需要高速、强可移植和低功耗的工业场景中使用。在这篇论文中,我们提出了一种轻量级可部署的滚珠疲劳诊断模型,称为滚珠疲劳诊断网络(BearingPGA-Net),以解决这些挑战。首先,我们通过一个受训的大型模型的帮助,对BearingPGA-Net进行分离知识填充。尽管它具有小型,但我们的模型在其他轻量级当前的方法中表现出色,并且达到了优于0.4%的F1、回归和准确率分数。其次,我们设计了一种FPGA加速方案,使用Verilog语言来自定义量化和设计可编程逻辑门阵列 для每层BearingPGA-Net在FPGA上,强调并行计算和模块重用以提高计算速度。到目前为止,这是首次将CNN基于滚珠疲劳诊断模型部署到FPGA上。实验结果表明,我们的部署方案可以在CPU上进行200倍以上的诊断速度提升,而且与F1、回归和准确率分数在我们独立收集的滚珠 dataset上保持低于0.4%的表现。我们的代码可以在以下链接中找到:https://github.com/asdvfghg/BearingPGA-Net。
Benchmarking and Analyzing Robust Point Cloud Recognition: Bag of Tricks for Defending Adversarial Examples
paper_authors: Qiufan Ji, Lin Wang, Cong Shi, Shengshan Hu, Yingying Chen, Lichao Sun
for: The paper is written for defending deep neural networks (DNNs) against adversarial examples in 3D point cloud recognition.
methods: The paper uses a comprehensive and rigorous benchmark to evaluate adversarial robustness, collects existing defense tricks, and proposes a hybrid training augmentation method that considers various types of point cloud adversarial examples.
results: The paper achieves an average accuracy of 83.45% against various attacks, demonstrating the capability of the proposed defense framework to enable robust learners.Here is the information in Simplified Chinese text:
results: 本文通过多种攻击测试得到了83.45%的平均准确率,demonstrating the capabilities of the proposed defense framework to enable robust learners。Abstract
Deep Neural Networks (DNNs) for 3D point cloud recognition are vulnerable to adversarial examples, threatening their practical deployment. Despite the many research endeavors have been made to tackle this issue in recent years, the diversity of adversarial examples on 3D point clouds makes them more challenging to defend against than those on 2D images. For examples, attackers can generate adversarial examples by adding, shifting, or removing points. Consequently, existing defense strategies are hard to counter unseen point cloud adversarial examples. In this paper, we first establish a comprehensive, and rigorous point cloud adversarial robustness benchmark to evaluate adversarial robustness, which can provide a detailed understanding of the effects of the defense and attack methods. We then collect existing defense tricks in point cloud adversarial defenses and then perform extensive and systematic experiments to identify an effective combination of these tricks. Furthermore, we propose a hybrid training augmentation methods that consider various types of point cloud adversarial examples to adversarial training, significantly improving the adversarial robustness. By combining these tricks, we construct a more robust defense framework achieving an average accuracy of 83.45\% against various attacks, demonstrating its capability to enabling robust learners. Our codebase are open-sourced on: \url{https://github.com/qiufan319/benchmark_pc_attack.git}.
摘要
深度神经网络(DNNs) для三维点云识别是易受到攻击的,这 threatening its practical deployment. despite many research efforts have been made to address this issue in recent years, the diversity of adversarial examples on 3D point clouds makes them more challenging to defend against than those on 2D images. For example, attackers can generate adversarial examples by adding, shifting, or removing points. As a result, existing defense strategies are difficult to counter unseen point cloud adversarial examples.In this paper, we first establish a comprehensive and rigorous point cloud adversarial robustness benchmark to evaluate adversarial robustness, which can provide a detailed understanding of the effects of the defense and attack methods. We then collect existing defense tricks in point cloud adversarial defenses and perform extensive and systematic experiments to identify an effective combination of these tricks. Furthermore, we propose a hybrid training augmentation method that considers various types of point cloud adversarial examples to adversarial training, significantly improving the adversarial robustness. By combining these tricks, we construct a more robust defense framework achieving an average accuracy of 83.45% against various attacks, demonstrating its capability to enabling robust learners. Our codebase is open-sourced on: .
paper_authors: Subhankar Ghosh, Yuanjie Shi, Taha Belkhouja, Yan Yan, Jana Doppa, Brian Jones
for: This paper focuses on developing a probabilistically robust conformal prediction (PRCP) algorithm to ensure robustness to natural/adversarial perturbations in testing examples.
methods: The proposed PRCP algorithm uses a novel adaptive approach called “quantile-of-quantile” design to determine two parallel thresholds for data samples and perturbations, achieving better trade-offs between nominal performance and robustness.
results: The proposed aPRCP algorithm is experimentally demonstrated to achieve better trade-offs than state-of-the-art conformal prediction (CP) and adversarially robust CP algorithms on CIFAR-10, CIFAR-100, and ImageNet datasets using deep neural networks.Abstract
Conformal prediction (CP) is a framework to quantify uncertainty of machine learning classifiers including deep neural networks. Given a testing example and a trained classifier, CP produces a prediction set of candidate labels with a user-specified coverage (i.e., true class label is contained with high probability). Almost all the existing work on CP assumes clean testing data and there is not much known about the robustness of CP algorithms w.r.t natural/adversarial perturbations to testing examples. This paper studies the problem of probabilistically robust conformal prediction (PRCP) which ensures robustness to most perturbations around clean input examples. PRCP generalizes the standard CP (cannot handle perturbations) and adversarially robust CP (ensures robustness w.r.t worst-case perturbations) to achieve better trade-offs between nominal performance and robustness. We propose a novel adaptive PRCP (aPRCP) algorithm to achieve probabilistically robust coverage. The key idea behind aPRCP is to determine two parallel thresholds, one for data samples and another one for the perturbations on data (aka "quantile-of-quantile" design). We provide theoretical analysis to show that aPRCP algorithm achieves robust coverage. Our experiments on CIFAR-10, CIFAR-100, and ImageNet datasets using deep neural networks demonstrate that aPRCP achieves better trade-offs than state-of-the-art CP and adversarially robust CP algorithms.
摘要
“对于机器学习分类器,具有不同程度的不确定性是一个重要的考虑因素。这篇论文探讨了一个名为“可靠性推断(Conformal Prediction,CP)”的框架,可以为机器学习分类器量化不确定性。给定一个测试例子和一个已经训练好的分类器,CP 可以生成一个包含真实类别的可能性的预测集。然而,大多数现有的 CP 研究假设测试数据是清洁的,而且对于自然或攻击性的推偏而言, CP 的可靠性不充分了解。这篇论文提出了一个名为“可靠性推断(PRCP)”的问题,它可以确保分类器对于大多数推偏而言是可靠的。我们提出了一个名为“可靠性推断(aPRCP)”的新算法,它可以实现可靠性推断。aPRCP 的关键思想是在测试数据和推偏之间设置两个平行的阈值(也称为“量ile-of-quantile”设计)。我们提供了理论分析,证明 aPRCP 算法可以实现可靠性推断。我们在 CIFAR-10、CIFAR-100 和 ImageNet dataset 上进行了深度神经网络的实验,结果显示 aPRCP 算法可以更好地平衡 Nominal 性和可靠性。”
Moreau-Yoshida Variational Transport: A General Framework For Solving Regularized Distributional Optimization Problems
paper_authors: Dai Hai Nguyen, Tetsuya Sakurai for: solves a regularized distributional optimization problem widely appeared in machine learning and statistics, such as proximal Monte-Carlo sampling, Bayesian inference and generative modeling.methods: employs the Moreau-Yoshida envelope for a smooth approximation of the nonsmooth function in the objective, and leverages the variational representation to reformulate the approximate problem as a concave-convex saddle point problem.results: provides theoretical analyses and reports experimental results to demonstrate the effectiveness of the proposed method.Abstract
We consider a general optimization problem of minimizing a composite objective functional defined over a class of probability distributions. The objective is composed of two functionals: one is assumed to possess the variational representation and the other is expressed in terms of the expectation operator of a possibly nonsmooth convex regularizer function. Such a regularized distributional optimization problem widely appears in machine learning and statistics, such as proximal Monte-Carlo sampling, Bayesian inference and generative modeling, for regularized estimation and generation. We propose a novel method, dubbed as Moreau-Yoshida Variational Transport (MYVT), for solving the regularized distributional optimization problem. First, as the name suggests, our method employs the Moreau-Yoshida envelope for a smooth approximation of the nonsmooth function in the objective. Second, we reformulate the approximate problem as a concave-convex saddle point problem by leveraging the variational representation, and then develope an efficient primal-dual algorithm to approximate the saddle point. Furthermore, we provide theoretical analyses and report experimental results to demonstrate the effectiveness of the proposed method.
摘要
我们考虑一个总体优化问题,即将一类概率分布中的一个复合目标函数进行最小化。这个目标函数由两个函数组成:一个假设具有变量表示,另一个是一个可能非均衡的凸函数。这种凸函数regularizer在机器学习和统计中广泛应用,例如距离 Monte Carlo 抽样、 bayesian 推断和生成模型。我们提出一种新的方法,称为 Moreau-Yoshida 变量运输(MYVT),以解决这种凸函数regularized 分布优化问题。首先,我们的方法使用Moreau-Yoshida 覆盖函数来将非均衡函数在目标函数中进行简化。其次,我们将 reformulate approximate 问题为一个凹凸两点问题,并利用变量表示来解决。然后,我们开发了一种高效的主动-副本算法来approximate 两点。 finally,我们提供了理论分析和实验结果,以证明我们的方法的有效性。
Hypertension Detection From High-Dimensional Representation of Photoplethysmogram Signals
results: 实验结果表明,该关系不仅限于心率和血压,而且可以扩展到更多的特征。此外,使用核函数变换为终端时间序列特征提取器,超过了前一些研究和现代深度学习模型的性能。Abstract
Hypertension is commonly referred to as the "silent killer", since it can lead to severe health complications without any visible symptoms. Early detection of hypertension is crucial in preventing significant health issues. Although some studies suggest a relationship between blood pressure and certain vital signals, such as Photoplethysmogram (PPG), reliable generalization of the proposed blood pressure estimation methods is not yet guaranteed. This lack of certainty has resulted in some studies doubting the existence of such relationships, or considering them weak and limited to heart rate and blood pressure. In this paper, a high-dimensional representation technique based on random convolution kernels is proposed for hypertension detection using PPG signals. The results show that this relationship extends beyond heart rate and blood pressure, demonstrating the feasibility of hypertension detection with generalization. Additionally, the utilized transform using convolution kernels, as an end-to-end time-series feature extractor, outperforms the methods proposed in the previous studies and state-of-the-art deep learning models.
摘要
高血压通常被称为"无论的杀手",因为它可能会导致严重的健康问题无需任何可见的symptoms。早期检测高血压是预防重要的健康问题的锁定要素。虽然一些研究表明血压和某些生命体征之间存在关系,但可靠地总结这些提议的血压估算方法并不 yet guaranteed。这种不确定性导致了一些研究质疑这些关系的存在,或者认为这些关系是弱小limited to heart rate and blood pressure。在这篇论文中,一种基于随机核函数的高维表示技术被提出用于使用PPG信号检测高血压。结果表明这种关系超出了heart rate和血压,证明了检测高血压的可能性。此外,使用核函数来实现终端时间序列特征提取,比前一些研究和现代深度学习模型都高效。
results: 经过多个实验研究,包括基于实验实际评价和synthetic评价,本研究获得了该方法的有效性和优势。Abstract
This paper develops a novel rating-based reinforcement learning approach that uses human ratings to obtain human guidance in reinforcement learning. Different from the existing preference-based and ranking-based reinforcement learning paradigms, based on human relative preferences over sample pairs, the proposed rating-based reinforcement learning approach is based on human evaluation of individual trajectories without relative comparisons between sample pairs. The rating-based reinforcement learning approach builds on a new prediction model for human ratings and a novel multi-class loss function. We conduct several experimental studies based on synthetic ratings and real human ratings to evaluate the effectiveness and benefits of the new rating-based reinforcement learning approach.
摘要
这个论文提出了一种新的评分基于束缚学习方法,利用人类评分来获取人类指导。与现有的偏好基于样本对比和排名基于样本对比不同,提出的评分基于束缚学习方法是基于人类评估个体轨迹而不需要对样本对比进行相对比较。该方法建立在新的人类评分预测模型和多类损失函数之上。我们通过对 sintetic评分和真实人类评分进行多个实验研究来评估新的评分基于束缚学习方法的有效性和优势。
Proof-of-Federated-Learning-Subchain: Free Partner Selection Subchain Based on Federated Learning
results: 在 simulations 中,我们发现在受限的订单池大小下, miner WITH 高Shapley Value (SV) 会获得更好的机会被选择。在实验中,Proof-of-Federated-Learning-Subchain (PoFLSC) 证明支持了子链管理员在受限订单池大小下建立和维护竞争性子链。Abstract
The continuous thriving of the Blockchain society motivates research in novel designs of schemes supporting cryptocurrencies. Previously multiple Proof-of-Deep-Learning(PoDL) consensuses have been proposed to replace hashing with useful work such as deep learning model training tasks. The energy will be more efficiently used while maintaining the ledger. However deep learning models are problem-specific and can be extremely complex. Current PoDL consensuses still require much work to realize in the real world. In this paper, we proposed a novel consensus named Proof-of-Federated-Learning-Subchain(PoFLSC) to fill the gap. We applied a subchain to record the training, challenging, and auditing activities and emphasized the importance of valuable datasets in partner selection. We simulated 20 miners in the subchain to demonstrate the effectiveness of PoFLSC. When we reduce the pool size concerning the reservation priority order, the drop rate difference in the performance in different scenarios further exhibits that the miner with a higher Shapley Value (SV) will gain a better opportunity to be selected when the size of the subchain pool is limited. In the conducted experiments, the PoFLSC consensus supported the subchain manager to be aware of reservation priority and the core partition of contributors to establish and maintain a competitive subchain.
摘要
continous thriving of the Blockchain society motivates research in novel designs of schemes supporting cryptocurrencies. Previously multiple Proof-of-Deep-Learning(PoDL) consensuses have been proposed to replace hashing with useful work such as deep learning model training tasks. The energy will be more efficiently used while maintaining the ledger. However deep learning models are problem-specific and can be extremely complex. Current PoDL consensuses still require much work to realize in the real world. In this paper, we proposed a novel consensus named Proof-of-Federated-Learning-Subchain(PoFLSC) to fill the gap. We applied a subchain to record the training, challenging, and auditing activities and emphasized the importance of valuable datasets in partner selection. We simulated 20 miners in the subchain to demonstrate the effectiveness of PoFLSC. When we reduce the pool size concerning the reservation priority order, the drop rate difference in the performance in different scenarios further exhibits that the miner with a higher Shapley Value (SV) will gain a better opportunity to be selected when the size of the subchain pool is limited. In the conducted experiments, the PoFLSC consensus supported the subchain manager to be aware of reservation priority and the core partition of contributors to establish and maintain a competitive subchain.Here's the translation in Traditional Chinese:continuous thriving of the Blockchain society motivates research in novel designs of schemes supporting cryptocurrencies. Previously multiple Proof-of-Deep-Learning(PoDL) consensuses have been proposed to replace hashing with useful work such as deep learning model training tasks. The energy will be more efficiently used while maintaining the ledger. However deep learning models are problem-specific and can be extremely complex. Current PoDL consensuses still require much work to realize in the real world. In this paper, we proposed a novel consensus named Proof-of-Federated-Learning-Subchain(PoFLSC) to fill the gap. We applied a subchain to record the training, challenging, and auditing activities and emphasized the importance of valuable datasets in partner selection. We simulated 20 miners in the subchain to demonstrate the effectiveness of PoFLSC. When we reduce the pool size concerning the reservation priority order, the drop rate difference in the performance in different scenarios further exhibits that the miner with a higher Shapley Value (SV) will gain a better opportunity to be selected when the size of the subchain pool is limited. In the conducted experiments, the PoFLSC consensus supported the subchain manager to be aware of reservation priority and the core partition of contributors to establish and maintain a competitive subchain.
Theoretically Principled Trade-off for Stateful Defenses against Query-Based Black-Box Attacks
results: 论文通过理论分析和实验评估表明,stateful defense的攻击检测和假阳性率之间存在一定的贸易OFF,并且可以通过不同的特征提取器和相似度阈值来优化这个贸易OFF。Abstract
Adversarial examples threaten the integrity of machine learning systems with alarming success rates even under constrained black-box conditions. Stateful defenses have emerged as an effective countermeasure, detecting potential attacks by maintaining a buffer of recent queries and detecting new queries that are too similar. However, these defenses fundamentally pose a trade-off between attack detection and false positive rates, and this trade-off is typically optimized by hand-picking feature extractors and similarity thresholds that empirically work well. There is little current understanding as to the formal limits of this trade-off and the exact properties of the feature extractors/underlying problem domain that influence it. This work aims to address this gap by offering a theoretical characterization of the trade-off between detection and false positive rates for stateful defenses. We provide upper bounds for detection rates of a general class of feature extractors and analyze the impact of this trade-off on the convergence of black-box attacks. We then support our theoretical findings with empirical evaluations across multiple datasets and stateful defenses.
摘要
⟨SYS⟩抗对抗示例威胁机器学习系统的稳定性,尤其在受限黑盒条件下。状态防御技术已经出现为有效的反应方法,通过维护最近几个查询来检测潜在攻击,并检测新的查询是否过于相似。然而,这些防御技术存在识别攻击和假阳性率之间的负面交易,这种交易通常通过手动选择特征提取器和相似性阈值来优化。现在我们对这种交易的形式上限和特征提取器/下面问题领域的影响没有很好的理解。这项工作的目的是解决这个问题,通过提供一种理论性的评估交易的权衡率和假阳性率的总bounds。我们还分析了黑盒攻击的收敛性受到这种交易的影响。最后,我们支持我们的理论发现通过多个数据集和状态防御技术的实验性评估。
Evaluating ChatGPT and GPT-4 for Visual Programming
results: 我们发现,这两种模型在视觉编程领域表现不佳,尤其是在结合空间逻辑和编程技能方面遇到困难。这些结果提供了未来发展生成模型在视觉编程领域的探索方向。Abstract
Generative AI and large language models have the potential to drastically improve the landscape of computing education by automatically generating personalized feedback and content. Recent works have studied the capabilities of these models for different programming education scenarios; however, these works considered only text-based programming, in particular, Python programming. Consequently, they leave open the question of how well these models would perform in visual programming domains popularly used for K-8 programming education. The main research question we study is: Do state-of-the-art generative models show advanced capabilities in visual programming on par with their capabilities in text-based Python programming? In our work, we evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, in visual programming domains for various scenarios and assess performance using expert-based annotations. In particular, we base our evaluation using reference tasks from the domains of Hour of Code: Maze Challenge by Code-dot-org and Karel. Our results show that these models perform poorly and struggle to combine spatial, logical, and programming skills crucial for visual programming. These results also provide exciting directions for future work on developing techniques to improve the performance of generative models in visual programming.
摘要
<>传送文本到简化中文。>生成AI和大语言模型有可能在计算教育中提供个性化反馈和内容,从而改善计算教育的景观。先前的研究已经研究了这些模型在不同的编程教育场景下的能力,但是这些研究仅考虑了文本编程,尤其是Python编程。因此,它们留下了如何在视觉编程领域中表现的问题。我们的研究问题是:现代生成模型在视觉编程领域中是否有高水平的表现,与文本基于Python编程的表现相当?在我们的工作中,我们评估了两个模型:ChatGPT(基于GPT-3.5)和GPT-4,在不同的视觉编程场景下进行评估,并使用专家标注来评估性能。具体来说,我们基于Code-dot-org的Hour of Code:迷宫挑战和Karel的参考任务进行评估。我们的结果表明,这些模型在视觉编程中表现糟糕,无法结合空间、逻辑和编程技能,这些技能是视觉编程中的关键。这些结果还提供了未来开发改进生成模型在视觉编程中表现的潜在方向。
RoseNNa: A performant, portable library for neural network inference with application to computational fluid dynamics
methods: 本论文使用的方法包括 Multilayer Perceptrons (MLPs) 和 Long Short Term Memory (LSTM) 类型的神经网络架构,并通过自动将训练好的模型转换为高性能的 Fortran 库,以便与 CFD 模型集成。
results: 结果显示,使用 RoseNNa 实现神经网络架构后,与 PyTorch 和 libtorch 相比,在 MLPs 和 LSTM RNNs 中,具有少于 100 个隐藏层和 100 个神经元的情况下,可以获得较高的速度优化,具体而言,在小型神经网络中,速度优化因子在 10 到 2 之间,而在大型神经网络中,速度优化因子在 2 倍以上。Abstract
The rise of neural network-based machine learning ushered in high-level libraries, including TensorFlow and PyTorch, to support their functionality. Computational fluid dynamics (CFD) researchers have benefited from this trend and produced powerful neural networks that promise shorter simulation times. For example, multilayer perceptrons (MLPs) and Long Short Term Memory (LSTM) recurrent-based (RNN) architectures can represent sub-grid physical effects, like turbulence. Implementing neural networks in CFD solvers is challenging because the programming languages used for machine learning and CFD are mostly non-overlapping, We present the roseNNa library, which bridges the gap between neural network inference and CFD. RoseNNa is a non-invasive, lightweight (1000 lines), and performant tool for neural network inference, with focus on the smaller networks used to augment PDE solvers, like those of CFD, which are typically written in C/C++ or Fortran. RoseNNa accomplishes this by automatically converting trained models from typical neural network training packages into a high-performance Fortran library with C and Fortran APIs. This reduces the effort needed to access trained neural networks and maintains performance in the PDE solvers that CFD researchers build and rely upon. Results show that RoseNNa reliably outperforms PyTorch (Python) and libtorch (C++) on MLPs and LSTM RNNs with less than 100 hidden layers and 100 neurons per layer, even after removing the overhead cost of API calls. Speedups range from a factor of about 10 and 2 faster than these established libraries for the smaller and larger ends of the neural network size ranges tested.
摘要
neural network基于机器学习的兴起使得高级库,如TensorFlow和PyTorch,得到了支持。计算流体动力学(CFD)研究人员也从中受益,制作出了 poderoso neural networks, promises shorter simulation times。例如,多层感知器(MLPs)和Long Short Term Memory(LSTM)回归型(RNN)架构可以表示sub-grid物理效应,如湍流。实施 neural networks在CFD solvers 中是具有挑战,因为机器学习和CFD的编程语言通常不 overlap。我们提出了roseNNa库,它 bridge gap between neural network inference and CFD。roseNNa 是一个不侵入、轻量级(1000行),并且高性能的工具,用于 neural network inference,专注于 CFD 中常用的小型网络。它通过自动将训练过的模型从通常的 neural network 训练包转换成高性能 Fortran 库,并提供 C 和 Fortran API,从而降低了访问训练过的 neural networks 的努力,并保持了在 PDE 解决方案中的性能。结果表明,roseNNa 可靠地超越了 PyTorch(Python)和 libtorch(C++)在 MLPs 和 LSTM RNNs 中,即使去除 API 调用的开销。速度提高范围从约10倍到2倍,depending on the size of the neural network。
Towards Practical Robustness Auditing for Linear Regression
results: 比state of the art大大提高性能,但对高维度问题还存在计算瓶颈Abstract
We investigate practical algorithms to find or disprove the existence of small subsets of a dataset which, when removed, reverse the sign of a coefficient in an ordinary least squares regression involving that dataset. We empirically study the performance of well-established algorithmic techniques for this task -- mixed integer quadratically constrained optimization for general linear regression problems and exact greedy methods for special cases. We show that these methods largely outperform the state of the art and provide a useful robustness check for regression problems in a few dimensions. However, significant computational bottlenecks remain, especially for the important task of disproving the existence of such small sets of influential samples for regression problems of dimension $3$ or greater. We make some headway on this challenge via a spectral algorithm using ideas drawn from recent innovations in algorithmic robust statistics. We summarize the limitations of known techniques in several challenge datasets to encourage further algorithmic innovation.
摘要
我们研究了实用的算法来找到或证明 dataset 中小subset 的存在,将其从 regression 问题中除去,使得回归系数的符号变化。我们对常用的算法技术进行实证研究,包括混合整数 quadratic 约束优化问题和特殊情况下的精准搜索方法。我们发现这些方法在大多数情况下表现出色,提供了有用的 robustness 检查。但是,特别是 dla regression 问题的维度大于 3 的情况下,计算瓶颈仍然存在,导致重要的 task 的实现受阻。我们通过使用最近的算法Robust statistics 的想法,开发了一种spectral 算法,尝试解决这个挑战。我们对一些挑战数据集的限制进行总结,以鼓励进一步的算法创新。
Mask-guided Data Augmentation for Multiparametric MRI Generation with a Rare Hepatocellular Carcinoma
results: 论文的实验结果表明,这种方法可以使用有限的多 Parametric dataset来生成1000个synthetic triplets和其对应的肝肿瘤mask,Frechet Inception Distance分数为86.55。这种方法在2021年数据扩充挑战中获得了优胜奖。Abstract
Data augmentation is classically used to improve the overall performance of deep learning models. It is, however, challenging in the case of medical applications, and in particular for multiparametric datasets. For example, traditional geometric transformations used in several applications to generate synthetic images can modify in a non-realistic manner the patients' anatomy. Therefore, dedicated image generation techniques are necessary in the medical field to, for example, mimic a given pathology realistically. This paper introduces a new data augmentation architecture that generates synthetic multiparametric (T1 arterial, T1 portal, and T2) magnetic resonance images (MRI) of massive macrotrabecular subtype hepatocellular carcinoma with their corresponding tumor masks through a generative deep learning approach. The proposed architecture creates liver tumor masks and abdominal edges used as input in a Pix2Pix network for synthetic data creation. The method's efficiency is demonstrated by training it on a limited multiparametric dataset of MRI triplets from $89$ patients with liver lesions to generate $1,000$ synthetic triplets and their corresponding liver tumor masks. The resulting Frechet Inception Distance score was $86.55$. The proposed approach was among the winners of the 2021 data augmentation challenge organized by the French Society of Radiology.
摘要
“数据扩展是深度学习模型性能改进的传统方法,但在医疗应用中具有挑战性,特别是对多 параметric 数据进行处理。例如,传统的几何变换在许多应用中用于生成合成图像,可能会非实际地改变病人的解剖结构。因此,医疗领域需要专门的图像生成技术,例如模拟给定疾病的合成图像。这篇论文介绍了一种新的数据扩展建立,通过生成多参数(T1血管、T1入口和T2)核磁共振成像(MRI)图像,并生成相应的肝肿瘤面积大型巨孢细胞肿瘤肝癌的合成图像。提议的建立使用生成深度学习方法创建肝肿瘤面积和腹部边缘,并将其作为PIx2Pix网络的输入进行合成数据创建。方法的效率被证明通过使用有限的多参数MRI三重组合来训练,从89名患有肝脏病变的病人中生成1000个合成三重组合和相应的肝肿瘤面积。结果的Frechet Inception Distance分数为86.55。该方法在2021年数据扩展挑战中被法国放射学会评选为获奖作品。”Note: Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and Macau.
You Shall not Pass: the Zero-Gradient Problem in Predict and Optimize for Convex Optimization
results: 论文发现了这种方法的一个弱点—梯度问题,并提出了一种基于差分优化的解决方法,并在两个实际 benchmark 上进行了验证。Abstract
Predict and optimize is an increasingly popular decision-making paradigm that employs machine learning to predict unknown parameters of optimization problems. Instead of minimizing the prediction error of the parameters, it trains predictive models using task performance as a loss function. In the convex optimization domain, predict and optimize has seen significant progress due to recently developed methods for differentiating optimization problem solutions over the problem parameters. This paper identifies a yet unnoticed drawback of this approach -- the zero-gradient problem -- and introduces a method to solve it. The suggested method is based on the mathematical properties of differential optimization and is verified using two real-world benchmarks.
摘要
预测和优化是一种日益受欢迎的决策模式,它利用机器学习预测未知优化问题中的参数。而不是将参数预测错误降为最小值,它使用任务性能作为损失函数来训练预测模型。在凸优化领域,预测和优化已经取得了显著进步,这主要归功于最近发展出的对优化问题解决方案的差分优化方法。然而,这种方法还存在一个未注意的缺点:零梯度问题。这篇论文描述了这个缺点,并提出了一种解决方法,基于差分优化的数学性质。该方法在两个真实世界 benchmark 上进行了验证。
Predicting delays in Indian lower courts using AutoML and Decision Forests
for: 预测印度下级法院延迟的 Classification 模型 (what the paper is written for)
methods: 使用 AutoML 开发多类别分类模型,并使用 binary decision forest 分类器提高预测精度 (what methods the paper uses)
results: 最佳模型达到 81.4% 的准确率,并且 precision、recall 和 F1 分别为 0.81 (what results the paper gets)Abstract
This paper presents a classification model that predicts delays in Indian lower courts based on case information available at filing. The model is built on a dataset of 4.2 million court cases filed in 2010 and their outcomes over a 10-year period. The data set is drawn from 7000+ lower courts in India. The authors employed AutoML to develop a multi-class classification model over all periods of pendency and then used binary decision forest classifiers to improve predictive accuracy for the classification of delays. The best model achieved an accuracy of 81.4%, and the precision, recall, and F1 were found to be 0.81. The study demonstrates the feasibility of AI models for predicting delays in Indian courts, based on relevant data points such as jurisdiction, court, judge, subject, and the parties involved. The paper also discusses the results in light of relevant literature and suggests areas for improvement and future research. The authors have made the dataset and Python code files used for the analysis available for further research in the crucial and contemporary field of Indian judicial reform.
摘要
Translation notes:* "lower courts" is translated as "下级法院" (xià jí fǎ yuàn), which refers to the lower-level courts in India's judicial system.* "case information" is translated as "案件信息" (àn jiàn xìn xì), which refers to the details of each court case.* "AutoML" is translated as "自动机器学习" (zì dòng jī shù xí), which stands for "automated machine learning" and refers to the use of software tools to automate the process of building machine learning models.* "multi-class classification model" is translated as "多类分类模型" (duō lèi fēn c categorization model), which refers to a type of machine learning model that can predict multiple classes or outcomes.* "binary decision forest classifiers" is translated as "二分裂树分类器" (èr fēn jié shù fēn c categorization), which refers to a type of machine learning model that uses a combination of decision trees to predict binary outcomes.* "precision, recall, and F1" are all translated as "准确率、报告率和F1值" (zhèng qiú lǐ, bào gào lǐ, and F1 value), which are all measures of the accuracy of a machine learning model.
zkDL: Efficient Zero-Knowledge Proofs of Deep Learning Training
results: zkDL可以在几秒钟内,对一个深度16层神经网络,并在200M参数下,生成完整和准确的证明,证明时间和证明大小均在20 kB之下,保护数据和模型参数的隐私。Abstract
The recent advancements in deep learning have brought about significant changes in various aspects of people's lives. Meanwhile, these rapid developments have raised concerns about the legitimacy of the training process of deep networks. However, to protect the intellectual properties of untrusted AI developers, directly examining the training process by accessing the model parameters and training data by verifiers is often prohibited. In response to this challenge, we present zkDL, an efficient zero-knowledge proof of deep learning training. At the core of zkDL is zkReLU, a specialized zero-knowledge proof protocol with optimized proving time and proof size for the ReLU activation function, a major obstacle in verifiable training due to its non-arithmetic nature. To integrate zkReLU into the proof system for the entire training process, we devise a novel construction of an arithmetic circuit from neural networks. By leveraging the abundant parallel computation resources, this construction reduces proving time and proof sizes by a factor of the network depth. As a result, zkDL enables the generation of complete and sound proofs, taking less than a minute with a size of less than 20 kB per training step, for a 16-layer neural network with 200M parameters, while ensuring the privacy of data and model parameters.
摘要
The core of zkDL is zkReLU, a specialized zero-knowledge proof protocol that is optimized for the ReLU activation function, which has been a major obstacle in verifiable training due to its non-arithmetic nature. By leveraging abundant parallel computation resources, we have devised a novel construction of an arithmetic circuit from neural networks, which reduces proving time and proof sizes by a factor of the network depth.With zkDL, we can generate complete and sound proofs in less than a minute, with a size of less than 20 kB per training step, for a 16-layer neural network with 200M parameters, while ensuring the privacy of data and model parameters. This solution is efficient and effective, and it addresses the challenge of verifying the training process of deep networks in a privacy-preserving manner.
paper_authors: Alexander Geng, Ali Moghiseh, Claudia Redenbach, Katja Schladitz
for: 这个研究是用于应用量子机器学习来检测灰度图像中的裂线。
methods: 这个研究使用了量子转移学习,并比较了各种量子处理器的执行效率。
results: 研究发现,使用量子转移学习可以实现高效地检测灰度图像中的裂线,并且可以在不同的量子处理器上进行实际实现。Abstract
Quantum computers possess the potential to process data using a remarkably reduced number of qubits compared to conventional bits, as per theoretical foundations. However, recent experiments have indicated that the practical feasibility of retrieving an image from its quantum encoded version is currently limited to very small image sizes. Despite this constraint, variational quantum machine learning algorithms can still be employed in the current noisy intermediate scale quantum (NISQ) era. An example is a hybrid quantum machine learning approach for edge detection. In our study, we present an application of quantum transfer learning for detecting cracks in gray value images. We compare the performance and training time of PennyLane's standard qubits with IBM's qasm\_simulator and real backends, offering insights into their execution efficiency.
摘要
量子计算机具有可能处理数据使用非常减少的量子比特数量,根据理论基础。然而,最近的实验表明目前只能处理非常小的图像大小。尽管有这些限制,可以在当前的不纯量子Intermediate scale quantum(NISQ)时代使用量子机器学习算法。我们的研究中,我们应用量子传输学习来探测灰度图像中的裂 crack。我们比较了PennyLane的标准量子比特和IBM的qasm_simulator和真实后端的执行效率。
Conditioning Generative Latent Optimization to solve Imaging Inverse Problems
results: 研究表明,使用cGLO方法可以在稀疏视角CT设置下提供更好的重建效果,并且不需要使用回探操作。此外,cGLO方法在小训练数据集下也表现出了更好的效果。Abstract
Computed Tomography (CT) is a prominent example of Imaging Inverse Problem (IIP), highlighting the unrivalled performances of data-driven methods in degraded measurements setups like sparse X-ray projections. Although a significant proportion of deep learning approaches benefit from large supervised datasets to directly map experimental measurements to medical scans, they cannot generalize to unknown acquisition setups. In contrast, fully unsupervised techniques, most notably using score-based generative models, have recently demonstrated similar or better performances compared to supervised approaches to solve IIPs while being flexible at test time regarding the imaging setup. However, their use cases are limited by two factors: (a) they need considerable amounts of training data to have good generalization properties and (b) they require a backward operator, like Filtered-Back-Projection in the case of CT, to condition the learned prior distribution of medical scans to experimental measurements. To overcome these issues, we propose an unsupervised conditional approach to the Generative Latent Optimization framework (cGLO), in which the parameters of a decoder network are initialized on an unsupervised dataset. The decoder is then used for reconstruction purposes, by performing Generative Latent Optimization with a loss function directly comparing simulated measurements from proposed reconstructions to experimental measurements. The resulting approach, tested on sparse-view CT using multiple training dataset sizes, demonstrates better reconstruction quality compared to state-of-the-art score-based strategies in most data regimes and shows an increasing performance advantage for smaller training datasets and reduced projection angles. Furthermore, cGLO does not require any backward operator and could expand use cases even to non-linear IIPs.
摘要
computed tomography (CT) 是一个典型的 imaging inverse problem (IIP) 例子, highlighting 数据驱动方法在受限的测量设置中的无与伦比的表现。 although 许多深度学习方法可以从大量的直接映射实验室测量到医疗扫描的 supervised datasets 中获得优秀的表现,它们无法泛化到未知的获取设置。 相反,完全不supervised 技术,主要是使用得分数据生成模型,在 recent years 中 demonstrates 与 supervised 方法相当或更好的表现,而且可以在测试时随意选择 imaging 设置。 however, its use cases 受到两个因素的限制: (a) 它们需要大量的训练数据来有好的泛化性质, (b) 它们需要一个 backwards Operator,如 filtered-back-projection 在 CT 中,以Conditional 学习的 learned prior distribution of medical scans 到实验测量。To overcome these issues, we propose an unsupervised conditional approach to the Generative Latent Optimization framework (cGLO), in which the parameters of a decoder network are initialized on an unsupervised dataset. the decoder is then used for reconstruction purposes, by performing Generative Latent Optimization with a loss function directly comparing simulated measurements from proposed reconstructions to experimental measurements. the resulting approach, tested on sparse-view CT using multiple training dataset sizes, demonstrates better reconstruction quality compared to state-of-the-art score-based strategies in most data regimes and shows an increasing performance advantage for smaller training datasets and reduced projection angles. Furthermore, cGLO does not require any backward operator and could expand use cases even to non-linear IIPs.
Towards General Low-Light Raw Noise Synthesis and Modeling
paper_authors: Feng Zhang, Bin Xu, Zhiqiang Li, Xinran Liu, Qingbo Lu, Changxin Gao, Nong Sang
for: 本研究旨在Addressing the problem of modeling and synthesizing low-light raw noise in computational photography and image processing applications.
methods: 我们提出了一种新的方法,即通过物理和学习模型来同时Synthesize signal-dependent and signal-independent noise.
results: 我们的方法可以同时学习不同ISO水平的噪声特性,并可以通过多尺度扩展Discriminator(FTD)来正确地分布噪声. 实验结果表明,我们的方法可以与现有方法相比,在不同的感器上表现出优异的denoising效果.Abstract
Modeling and synthesizing low-light raw noise is a fundamental problem for computational photography and image processing applications. Although most recent works have adopted physics-based models to synthesize noise, the signal-independent noise in low-light conditions is far more complicated and varies dramatically across camera sensors, which is beyond the description of these models. To address this issue, we introduce a new perspective to synthesize the signal-independent noise by a generative model. Specifically, we synthesize the signal-dependent and signal-independent noise in a physics- and learning-based manner, respectively. In this way, our method can be considered as a general model, that is, it can simultaneously learn different noise characteristics for different ISO levels and generalize to various sensors. Subsequently, we present an effective multi-scale discriminator termed Fourier transformer discriminator (FTD) to distinguish the noise distribution accurately. Additionally, we collect a new low-light raw denoising (LRD) dataset for training and benchmarking. Qualitative validation shows that the noise generated by our proposed noise model can be highly similar to the real noise in terms of distribution. Furthermore, extensive denoising experiments demonstrate that our method performs favorably against state-of-the-art methods on different sensors.
摘要
模型和 sintesizar 低光照Raw 噪声是计算摄影和图像处理应用中的基本问题。although most recent works have adopted physics-based models to synthesize noise, the signal-independent noise in low-light conditions is far more complicated and varies dramatically across camera sensors, which is beyond the description of these models. To address this issue, we introduce a new perspective to synthesize the signal-independent noise by a generative model. Specifically, we synthesize the signal-dependent and signal-independent noise in a physics- and learning-based manner, respectively. In this way, our method can be considered as a general model, that is, it can simultaneously learn different noise characteristics for different ISO levels and generalize to various sensors. Subsequently, we present an effective multi-scale discriminator termed Fourier transformer discriminator (FTD) to distinguish the noise distribution accurately. Additionally, we collect a new low-light raw denoising (LRD) dataset for training and benchmarking. Qualitative validation shows that the noise generated by our proposed noise model can be highly similar to the real noise in terms of distribution. Furthermore, extensive denoising experiments demonstrate that our method performs favorably against state-of-the-art methods on different sensors.Here's the translation in Traditional Chinese:模型和 sintesizar 低光照Raw 噪声是计算摄影和图像处理应用中的基本问题。although most recent works have adopted physics-based models to synthesize noise, the signal-independent noise in low-light conditions is far more complicated and varies dramatically across camera sensors, which is beyond the description of these models. To address this issue, we introduce a new perspective to synthesize the signal-independent noise by a generative model. Specifically, we synthesize the signal-dependent and signal-independent noise in a physics- and learning-based manner, respectively. In this way, our method can be considered as a general model, that is, it can simultaneously learn different noise characteristics for different ISO levels and generalize to various sensors. Subsequently, we present an effective multi-scale discriminator termed Fourier transformer discriminator (FTD) to distinguish the noise distribution accurately. Additionally, we collect a new low-light raw denoising (LRD) dataset for training and benchmarking. Qualitative validation shows that the noise generated by our proposed noise model can be highly similar to the real noise in terms of distribution. Furthermore, extensive denoising experiments demonstrate that our method performs favorably against state-of-the-art methods on different sensors.
A hybrid approach for improving U-Net variants in medical image segmentation
results: 该论文通过使用注意力系统和径向连接来提高医疗图像分割的准确率和效率,并且在皮肤病变分割任务中达到了一定的成果。Abstract
Medical image segmentation is vital to the area of medical imaging because it enables professionals to more accurately examine and understand the information offered by different imaging modalities. The technique of splitting a medical image into various segments or regions of interest is known as medical image segmentation. The segmented images that are produced can be used for many different things, including diagnosis, surgery planning, and therapy evaluation. In initial phase of research, major focus has been given to review existing deep-learning approaches, including researches like MultiResUNet, Attention U-Net, classical U-Net, and other variants. The attention feature vectors or maps dynamically add important weights to critical information, and most of these variants use these to increase accuracy, but the network parameter requirements are somewhat more stringent. They face certain problems such as overfitting, as their number of trainable parameters is very high, and so is their inference time. Therefore, the aim of this research is to reduce the network parameter requirements using depthwise separable convolutions, while maintaining performance over some medical image segmentation tasks such as skin lesion segmentation using attention system and residual connections.
摘要
医疗图像分割是医疗图像领域的关键技术,它使医 profesionales可以更加准确地检查和理解不同的成像模式中提供的信息。图像分割技术的核心是将医疗图像分成不同的区域或特点,以便进行更加准确的诊断、手术规划和治疗评估。在初期研究阶段,主要是对现有深度学习方法进行了审查,包括MultiResUNet、Attention U-Net、类传统U-Net和其他变体。这些变体的注意力特征向量或地图在运行时动态地给予重要的权重,以提高准确性。然而,这些网络的参数需求较高,导致过拟合和执行时间较长。因此,本研究的目标是通过深度分割减少网络参数,保持一定的性能水平,特别是在医学图像分割任务中,如皮肤病变分割使用注意力系统和 residual 连接。
High Dynamic Range Image Reconstruction via Deep Explicit Polynomial Curve Estimation
results: 经验表明,该方法可以在不同的折射函数下进行一致性的重建,并达到领先的性能水平。Abstract
Due to limited camera capacities, digital images usually have a narrower dynamic illumination range than real-world scene radiance. To resolve this problem, High Dynamic Range (HDR) reconstruction is proposed to recover the dynamic range to better represent real-world scenes. However, due to different physical imaging parameters, the tone-mapping functions between images and real radiance are highly diverse, which makes HDR reconstruction extremely challenging. Existing solutions can not explicitly clarify a corresponding relationship between the tone-mapping function and the generated HDR image, but this relationship is vital when guiding the reconstruction of HDR images. To address this problem, we propose a method to explicitly estimate the tone mapping function and its corresponding HDR image in one network. Firstly, based on the characteristics of the tone mapping function, we construct a model by a polynomial to describe the trend of the tone curve. To fit this curve, we use a learnable network to estimate the coefficients of the polynomial. This curve will be automatically adjusted according to the tone space of the Low Dynamic Range (LDR) image, and reconstruct the real HDR image. Besides, since all current datasets do not provide the corresponding relationship between the tone mapping function and the LDR image, we construct a new dataset with both synthetic and real images. Extensive experiments show that our method generalizes well under different tone-mapping functions and achieves SOTA performance.
摘要
First, based on the characteristics of the tone mapping function, we construct a model using a polynomial to describe the trend of the tone curve. To fit this curve, we use a learnable network to estimate the coefficients of the polynomial. This curve will be automatically adjusted according to the tone space of the Low Dynamic Range (LDR) image and reconstruct the real HDR image.Furthermore, since all current datasets do not provide the corresponding relationship between the tone mapping function and the LDR image, we construct a new dataset with both synthetic and real images. Extensive experiments show that our method generalizes well under different tone-mapping functions and achieves state-of-the-art performance.
DRAW: Defending Camera-shooted RAW against Image Manipulation
results: 在多个知名RAW数据集上实现了修改和增强图像的抵抗性,并且可以准确地 Localize增强区域。Abstract
RAW files are the initial measurement of scene radiance widely used in most cameras, and the ubiquitously-used RGB images are converted from RAW data through Image Signal Processing (ISP) pipelines. Nowadays, digital images are risky of being nefariously manipulated. Inspired by the fact that innate immunity is the first line of body defense, we propose DRAW, a novel scheme of defending images against manipulation by protecting their sources, i.e., camera-shooted RAWs. Specifically, we design a lightweight Multi-frequency Partial Fusion Network (MPF-Net) friendly to devices with limited computing resources by frequency learning and partial feature fusion. It introduces invisible watermarks as protective signal into the RAW data. The protection capability can not only be transferred into the rendered RGB images regardless of the applied ISP pipeline, but also is resilient to post-processing operations such as blurring or compression. Once the image is manipulated, we can accurately identify the forged areas with a localization network. Extensive experiments on several famous RAW datasets, e.g., RAISE, FiveK and SIDD, indicate the effectiveness of our method. We hope that this technique can be used in future cameras as an option for image protection, which could effectively restrict image manipulation at the source.
摘要
RAW文件是现场辐射强度的初始测量数据,广泛用于大多数相机中。现在,数字图像容易受到负面的修改。 inspirited by身体的自然免疫力是第一道防御线,我们提出了一种新的图像防范修改方案,通过保护图像的来源,即相机拍摄的RAW文件。我们设计了一种轻量级多频部分融合网络(MPF-Net),适合具有有限计算资源的设备。这个网络通过频率学习和部分特征融合,将不可见的水印(protective signal)引入RAW数据中。这种保护能力不仅可以在渲染后RGB图像中传递,而且对后期处理操作,如压缩或抖杂,也具有抗性。如果图像被修改,我们可以使用本地化网络来准确地标识受到修改的区域。我们在许多知名的RAW数据集,如RAISE、FiveK和SIDD上进行了广泛的实验,结果表明我们的方法的有效性。我们希望这种技术可以在未来的相机中作为图像保护选项,以防止图像修改在源头级别。
Multi-modal Graph Neural Network for Early Diagnosis of Alzheimer’s Disease from sMRI and PET Scans
results: 实验结果显示,这个提出的多modal资料融合方法可以提高AD诊断的性能,并且显示了不同的图形深度学习方法在不同的分支中的表现。此研究也提供了一个技术参考,支持多重多modal诊断方法的需求。Abstract
In recent years, deep learning models have been applied to neuroimaging data for early diagnosis of Alzheimer's disease (AD). Structural magnetic resonance imaging (sMRI) and positron emission tomography (PET) images provide structural and functional information about the brain, respectively. Combining these features leads to improved performance than using a single modality alone in building predictive models for AD diagnosis. However, current multi-modal approaches in deep learning, based on sMRI and PET, are mostly limited to convolutional neural networks, which do not facilitate integration of both image and phenotypic information of subjects. We propose to use graph neural networks (GNN) that are designed to deal with problems in non-Euclidean domains. In this study, we demonstrate how brain networks can be created from sMRI or PET images and be used in a population graph framework that can combine phenotypic information with imaging features of these brain networks. Then, we present a multi-modal GNN framework where each modality has its own branch of GNN and a technique is proposed to combine the multi-modal data at both the level of node vectors and adjacency matrices. Finally, we perform late fusion to combine the preliminary decisions made in each branch and produce a final prediction. As multi-modality data becomes available, multi-source and multi-modal is the trend of AD diagnosis. We conducted explorative experiments based on multi-modal imaging data combined with non-imaging phenotypic information for AD diagnosis and analyzed the impact of phenotypic information on diagnostic performance. Results from experiments demonstrated that our proposed multi-modal approach improves performance for AD diagnosis, and this study also provides technical reference and support the need for multivariate multi-modal diagnosis methods.
摘要
近年来,深度学习模型在神经成像数据上进行早期诊断阿尔茨海默病(AD)已经广泛应用。Structural磁共振成像(sMRI)和萱 electrons Tomatoes(PET)成像提供了脑部结构和功能信息,分别。将这些特征相结合,可以建立更好的预测模型,than using a single modality alone。然而,目前的多Modalapproaches in deep learning,基于sMRI和PET,主要是基于卷积神经网络,这些网络不能整合图像和参数信息。我们提议使用图 neural networks(GNN),这些网络是非欧几何问题的解决方案。在本研究中,我们示示了如何从sMRI或PET成像中创建脑网络,并使用人口图框架将图像和参数信息结合。然后,我们提出了一种多Modal GNN框架,其中每个模式有自己的GNN分支,并提出了将多Modal数据在级别Node vector和相互作用矩阵之间进行结合的技术。最后,我们进行了较晚的融合,将每个分支的初步决策相互融合,并生成最终预测。随着多Modal数据变得更加可用,多Modal和多源是AD诊断的趋势。我们基于多Modal成像数据和非成像参数信息进行了探索性实验,分析了影响诊断性能的非成像信息。实验结果表明,我们提议的多Modal方法可以提高AD诊断性能,这也提供了技术参考,支持多变量多Modal诊断方法的需求。
Cardiac MRI Orientation Recognition and Standardization using Deep Neural Networks
For: The paper is written for the purpose of addressing the challenge of imaging orientation in cardiac MRI and presenting a deep learning-based method for categorizing and standardizing the orientation.* Methods: The paper employs deep neural networks to categorize and standardize the orientation of cardiac MRI images, and proposes a transfer learning strategy to adapt the model to diverse modalities.* Results: The validation accuracies achieved were 100.0%, 100.0%, and 99.4% on CMR images from various modalities, including bSSFP, T2, and LGE, confirming the robustness and effectiveness of the proposed method.Abstract
Orientation recognition and standardization play a crucial role in the effectiveness of medical image processing tasks. Deep learning-based methods have proven highly advantageous in orientation recognition and prediction tasks. In this paper, we address the challenge of imaging orientation in cardiac MRI and present a method that employs deep neural networks to categorize and standardize the orientation. To cater to multiple sequences and modalities of MRI, we propose a transfer learning strategy, enabling adaptation of our model from a single modality to diverse modalities. We conducted comprehensive experiments on CMR images from various modalities, including bSSFP, T2, and LGE. The validation accuracies achieved were 100.0\%, 100.0\%, and 99.4\%, confirming the robustness and effectiveness of our model. Our source code and network models are available at https://github.com/rxzhen/MSCMR-orient
摘要
医疗影像处理任务中的方向识别和标准化扮演着关键性的角色。基于深度学习的方法在方向识别和预测任务中表现出了高度的优势。本文描述了在卡丁MRI中的影像方向识别挑战,并提出了使用深度神经网络来分类和标准化影像方向的方法。为了适应不同的序列和模式,我们提议了传输学习策略,使得我们的模型能够从单一的模式中适应多种模式。我们在不同的CMR图像模式(包括bSSFP、T2和LGE)上进行了广泛的实验,并得到了100.0%、100.0%和99.4%的验证精度,这证明了我们的模型的稳定性和有效性。我们的源代码和网络模型可以在https://github.com/rxzhen/MSCMR-orient中下载。
An objective validation of polyp and instrument segmentation methods in colonoscopy through Medico 2020 polyp segmentation and MedAI 2021 transparency challenges
paper_authors: Debesh Jha, Vanshali Sharma, Debapriya Banik, Debayan Bhattacharya, Kaushiki Roy, Steven A. Hicks, Nikhil Kumar Tomar, Vajira Thambawita, Adrian Krenzer, Ge-Peng Ji, Sahadev Poudel, George Batchkala, Saruar Alam, Awadelrahman M. A. Ahmed, Quoc-Huy Trinh, Zeshan Khan, Tien-Phat Nguyen, Shruti Shrestha, Sabari Nathan, Jeonghwan Gwak, Ritika K. Jha, Zheyuan Zhang, Alexander Schlaefer, Debotosh Bhattacharjee, M. K. Bhuyan, Pradip K. Das, Sravanthi Parsa, Sharib Ali, Michael A. Riegler, Pål Halvorsen, Ulas Bagci, Thomas De Lange
results: 研究发现,使用深度学习算法可以提高肿瘤检测率,并且可以提供可读的和可理解的解释,以便在临床应用中使用。Abstract
Automatic analysis of colonoscopy images has been an active field of research motivated by the importance of early detection of precancerous polyps. However, detecting polyps during the live examination can be challenging due to various factors such as variation of skills and experience among the endoscopists, lack of attentiveness, and fatigue leading to a high polyp miss-rate. Deep learning has emerged as a promising solution to this challenge as it can assist endoscopists in detecting and classifying overlooked polyps and abnormalities in real time. In addition to the algorithm's accuracy, transparency and interpretability are crucial to explaining the whys and hows of the algorithm's prediction. Further, most algorithms are developed in private data, closed source, or proprietary software, and methods lack reproducibility. Therefore, to promote the development of efficient and transparent methods, we have organized the "Medico automatic polyp segmentation (Medico 2020)" and "MedAI: Transparency in Medical Image Segmentation (MedAI 2021)" competitions. We present a comprehensive summary and analyze each contribution, highlight the strength of the best-performing methods, and discuss the possibility of clinical translations of such methods into the clinic. For the transparency task, a multi-disciplinary team, including expert gastroenterologists, accessed each submission and evaluated the team based on open-source practices, failure case analysis, ablation studies, usability and understandability of evaluations to gain a deeper understanding of the models' credibility for clinical deployment. Through the comprehensive analysis of the challenge, we not only highlight the advancements in polyp and surgical instrument segmentation but also encourage qualitative evaluation for building more transparent and understandable AI-based colonoscopy systems.
摘要
自动分析幽门摄影像是一个活跃的研究领域,受到早期检测前期肿瘤的重要性启发。然而,在实时检查中检测肿瘤可以是困难的,因为幽门摄影医生的技能和经验异常,精力不足和疲劳等多种因素导致高检测肿瘤率。深度学习已经成为一种有希望的解决方案,它可以帮助幽门摄影医生在实时检查中检测和分类检测到的肿瘤和异常。此外,算法的准确率以外,透明度和解释性也是非常重要的,以解释算法的预测原因。然而,大多数算法是在私有数据、关闭源代码或商业软件上开发的,方法缺乏可重复性。因此,为促进效率和透明度的方法的发展,我们组织了“医疗自动肿瘤分割(Medico 2020)”和“MedAI:医疗图像分割透明度(MedAI 2021)”比赛。我们对每个提交进行了全面的摘要和分析,推出了最佳方法的优势,并讨论了这些方法在临床应用中的可能性。在透明度任务中,一个多学科团队,包括专业的肠胃内科医生,对每个提交进行了评估,以评估团队的开源实践、失败案例分析、割除研究、可用性和理解度来深入了解模型的可靠性。通过全面分析挑战,我们不仅披露了肿瘤和手术工具分割领域的进步,也鼓励了更多的透明度和理解性基于医疗图像系统的AI技术的开发。
results: 实验结果表明,我们的方法在零shot VST场景中比其他VST模型表现更好。Audioamples可以在 \url{https://hiervst.github.io/} 上 obtaint。Abstract
Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervised representation. In addition, we adopt a hierarchical adaptive generator that generates the pitch representation and waveform audio sequentially. Moreover, we utilize unconditional generation to improve the speaker-relative acoustic capacity in the acoustic representation. With a hierarchical adaptive structure, the model can adapt to a novel voice style and convert speech progressively. The experimental results demonstrate that our method outperforms other VST models in zero-shot VST scenarios. Audio samples are available at \url{https://hiervst.github.io/}.
摘要
尽管voice style transfer(VST)领域的进步 rapid, recent zero-shot VST system 仍然缺乏能够传递新说者的voice style的能力。在这篇论文中,我们提出了一种层次适应式 zero-shot VST 模型,即 HierVST。无需任何文本脚本,我们只使用语音数据来训练模型,通过层次变量推断和自我supervised representation。此外,我们采用层次适应生成器,生成抽象音频序列和声音表示。此外,我们利用无条件生成技术,提高speaker-relative acoustic capacity。通过层次适应结构,模型可以逐步适应新的voice style,并将语音转换为新的语音风格。实验结果表明,我们的方法在zero-shot VST场景下表现出色,超过了其他VST模型。听音amples可以在 \url{https://hiervst.github.io/} 上找到。
ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus
results: 这个 dataset 总共包含 38.5 小时的数据,由 80 名志愿者录音。Abstract
We introduce the \`{I}r\`{o}y\`{i}nSpeech corpus -- a new dataset influenced by a desire to increase the amount of high quality, freely available, contemporary Yor\`{u}b\'{a} speech. We release a multi-purpose dataset that can be used for both TTS and ASR tasks. We curated text sentences from the news and creative writing domains under an open license i.e., CC-BY-4.0 and had multiple speakers record each sentence. We provide 5000 of our utterances to the Common Voice platform to crowdsource transcriptions online. The dataset has 38.5 hours of data in total, recorded by 80 volunteers.
摘要
我们介绍《尼罗言语 corpus》——一个新的数据集,受到了提高高质量、自由可用、当代尼罗语言的需求的影响。我们发布了多用途的数据集,可以用于 TTS 和 ASR 任务。我们从新闻和创意写作领域中选取了 CC-BY-4.0 开源许可证下的文本句子,并由多名说话者录制每句句子。我们向 Common Voice 平台提供了5000个语音utterance,以便在线受众投票转录。总共有38.5小时的数据,由80名志愿者录制。
results: 我们的方法在MMWHS dataset上实验显示,较进一步于现有的半Supervised分类方法。更进一步,我们的方法可以与完全Supervised方法的最佳结果相匹配。Abstract
Medical image segmentation typically necessitates a large and precisely annotated dataset. However, obtaining pixel-wise annotation is a labor-intensive task that requires significant effort from domain experts, making it challenging to obtain in practical clinical scenarios. In such situations, reducing the amount of annotation required is a more practical approach. One feasible direction is sparse annotation, which involves annotating only a few slices, and has several advantages over traditional weak annotation methods such as bounding boxes and scribbles, as it preserves exact boundaries. However, learning from sparse annotation is challenging due to the scarcity of supervision signals. To address this issue, we propose a framework that can robustly learn from sparse annotation using the cross-teaching of both 3D and 2D networks. Considering the characteristic of these networks, we develop two pseudo label selection strategies, which are hard-soft confidence threshold and consistent label fusion. Our experimental results on the MMWHS dataset demonstrate that our method outperforms the state-of-the-art (SOTA) semi-supervised segmentation methods. Moreover, our approach achieves results that are comparable to the fully-supervised upper bound result.
摘要
医学图像分割通常需要大量精确标注数据。然而,获取像素级标注是一项劳动密集的任务,需要培训领域专家投入大量时间和精力,在实践医疗场景中很困难。在这种情况下,减少标注量是一个更实际的方法。一种可行的方向是稀疏标注,即只标注一些剖面,具有许多优势,比如精确边界保持。然而,从稀疏标注学习是一项挑战,因为缺乏监督信号。为解决这个问题,我们提出了一种可靠地从稀疏标注学习的框架,基于3D和2D网络的交叉教学。由于这些网络的特点,我们开发了两种pseudo标签选择策略:固定-软信号阈值和一致性标签合并。我们在MMWHS数据集上进行实验,得到的结果表明,我们的方法在比较 semi-supervised segmentation 方法的State-of-the-art(SOTA)之上升起了。此外,我们的方法可以与完全监督Upper bound(UB)的结果相比。
Count, Decode and Fetch: A New Approach to Handwritten Chinese Character Error Correction
paper_authors: Pengfei Hu, Jiefeng Ma, Zhenrong Zhang, Jun Du, Jianshu Zhang
for: 提高手写中文字识别率
methods: 使用计数器、解码器和搜索器
results: 提高手写中文字识别率,可以更好地识别未看过的错误字符Abstract
Recently, handwritten Chinese character error correction has been greatly improved by employing encoder-decoder methods to decompose a Chinese character into an ideographic description sequence (IDS). However, existing methods implicitly capture and encode linguistic information inherent in IDS sequences, leading to a tendency to generate IDS sequences that match seen characters. This poses a challenge when dealing with an unseen misspelled character, as the decoder may generate an IDS sequence that matches a seen character instead. Therefore, we introduce Count, Decode and Fetch (CDF), a novel approach that exhibits better generalization towards unseen misspelled characters. CDF is mainly composed of three parts: the counter, the decoder, and the fetcher. In the first stage, the counter predicts the number of each radical class without the symbol-level position annotations. In the second stage, the decoder employs the counting information and generates the IDS sequence step by step. Moreover, by updating the counting information at each time step, the decoder becomes aware of the existence of each radical. With the decomposed IDS sequence, we can determine whether the given character is misspelled. If it is misspelled, the fetcher under the transductive transfer learning strategy predicts the ideal character that the user originally intended to write. We integrate our method into existing encoder-decoder models and significantly enhance their performance.
摘要
In the first stage, the counter predicts the number of each radical class without the symbol-level position annotations. In the second stage, the decoder employs the counting information and generates the IDS sequence step by step. Moreover, by updating the counting information at each time step, the decoder becomes aware of the existence of each radical. With the decomposed IDS sequence, we can determine whether the given character is misspelled. If it is misspelled, the fetcher under the transductive transfer learning strategy predicts the ideal character that the user originally intended to write. We integrate our method into existing encoder-decoder models and significantly enhance their performance.
SR-R$^2$KAC: Improving Single Image Defocus Deblurring
paper_authors: Peng Tang, Zhiqiang Xu, Pengfei Wei, Xiaobin Hu, Peilin Zhao, Xin Cao, Chunlai Zhou, Tobias Lasser for:The paper proposes an efficient deep learning method for single image defocus deblurring (SIDD) to address the issue of large blurs.methods:The proposed method, called R$^2$KAC, builds on the inverse kernel properties and uses a combination of kernel-sharing atrous convolutions and recursive atrous convolutions to simulate a large inverse kernel. The method also includes identity shortcuts to alleviate ringing artifacts and a scale recurrent module to exploit multi-scale information.results:The proposed method achieves the state-of-the-art performance on SIDD tasks, outperforming other existing methods.Abstract
We propose an efficient deep learning method for single image defocus deblurring (SIDD) by further exploring inverse kernel properties. Although the current inverse kernel method, i.e., kernel-sharing parallel atrous convolution (KPAC), can address spatially varying defocus blurs, it has difficulty in handling large blurs of this kind. To tackle this issue, we propose a Residual and Recursive Kernel-sharing Atrous Convolution (R$^2$KAC). R$^2$KAC builds on a significant observation of inverse kernels, that is, successive use of inverse-kernel-based deconvolutions with fixed size helps remove unexpected large blurs but produces ringing artifacts. Specifically, on top of kernel-sharing atrous convolutions used to simulate multi-scale inverse kernels, R$^2$KAC applies atrous convolutions recursively to simulate a large inverse kernel. Specifically, on top of kernel-sharing atrous convolutions, R$^2$KAC stacks atrous convolutions recursively to simulate a large inverse kernel. To further alleviate the contingent effect of recursive stacking, i.e., ringing artifacts, we add identity shortcuts between atrous convolutions to simulate residual deconvolutions. Lastly, a scale recurrent module is embedded in the R$^2$KAC network, leading to SR-R$^2$KAC, so that multi-scale information from coarse to fine is exploited to progressively remove the spatially varying defocus blurs. Extensive experimental results show that our method achieves the state-of-the-art performance.
摘要
我们提出了一种高效的深度学习方法,用于单张图像杂斑去振荡(SIDD),通过进一步探索 inverse kernel 性质。现有的 inverse kernel 方法,即 kernel-sharing parallel atrous convolution(KPAC),可以处理空间变化的杂斑干扰,但在处理大范围杂斑时存在困难。为解决这问题,我们提出了 Residual and Recursive Kernel-sharing Atrous Convolution(R$^2$KAC)。R$^2$KAC 基于 inverse kernel 的重要观察,即 successive use of inverse-kernel-based deconvolutions with fixed size 可以消除意外大杂斑,但会生成环形artefacts。在 kernel-sharing atrous convolutions 上进一步堆叠 atrous convolutions,R$^2$KAC 可以模拟大型 inverse kernel。为了进一步减少堆叠效应的影响,我们添加了 identity shortcuts between atrous convolutions,以便模拟 residual deconvolutions。最后,我们在 R$^2$KAC 网络中添加了 scale recurrent module,导致 SR-R$^2$KAC,以便利用 multi-scale information from coarse to fine 来逐步除去空间变化的杂斑干扰。我们的方法在实验中达到了状态盘的性能。
InfoStyler: Disentanglement Information Bottleneck for Artistic Style Transfer
results: 对比于传统的转移模块方法,InfoStyler可以更好地保持内容结构的稳定性,同时也可以增加风格特征的多样性。实验证明,InfoStyler可以生成高质量的风格转移图像。Abstract
Artistic style transfer aims to transfer the style of an artwork to a photograph while maintaining its original overall content. Many prior works focus on designing various transfer modules to transfer the style statistics to the content image. Although effective, ignoring the clear disentanglement of the content features and the style features from the first beginning, they have difficulty in balancing between content preservation and style transferring. To tackle this problem, we propose a novel information disentanglement method, named InfoStyler, to capture the minimal sufficient information for both content and style representations from the pre-trained encoding network. InfoStyler formulates the disentanglement representation learning as an information compression problem by eliminating style statistics from the content image and removing the content structure from the style image. Besides, to further facilitate disentanglement learning, a cross-domain Information Bottleneck (IB) learning strategy is proposed by reconstructing the content and style domains. Extensive experiments demonstrate that our InfoStyler can synthesize high-quality stylized images while balancing content structure preservation and style pattern richness.
摘要
<> traslate the given text into Simplified Chinese.<>艺术风格转移目标是将艺术作品的风格转移到照片中,保持原始内容的整体结构。许多先前的工作都是设计多种转移模块,以将风格统计传输到内容图像中。虽然有效,但忽略了初始阶段分离内容特征和风格特征的清晰分离,因此困难保持内容保持和风格传输的平衡。为解决这个问题,我们提出了一种新的信息分解方法,即InfoStyler,以捕捉预训练编码网络中最少够的信息来表示内容和风格表示。InfoStyler将分解表示学习定义为信息压缩问题,从内容图像中除去风格统计,并从风格图像中除去内容结构。此外,为进一步促进分解学习,我们提出了跨域信息瓶颈(IB)学习策略,通过重建内容和风格域来进行跨域信息瓶颈学习。广泛的实验表明,我们的InfoStyler可以生成高质量风格化图像,同时保持内容结构和风格特征的平衡。
ScribbleVC: Scribble-supervised Medical Image Segmentation with Vision-Class Embedding
results: 我们在三个 benchmark 数据集上评估了ScribbleVC,并与现有的方法进行比较。结果显示,我们的方法在精度、Robustness 和效率三方面都超过了现有的方法。Abstract
Medical image segmentation plays a critical role in clinical decision-making, treatment planning, and disease monitoring. However, accurate segmentation of medical images is challenging due to several factors, such as the lack of high-quality annotation, imaging noise, and anatomical differences across patients. In addition, there is still a considerable gap in performance between the existing label-efficient methods and fully-supervised methods. To address the above challenges, we propose ScribbleVC, a novel framework for scribble-supervised medical image segmentation that leverages vision and class embeddings via the multimodal information enhancement mechanism. In addition, ScribbleVC uniformly utilizes the CNN features and Transformer features to achieve better visual feature extraction. The proposed method combines a scribble-based approach with a segmentation network and a class-embedding module to produce accurate segmentation masks. We evaluate ScribbleVC on three benchmark datasets and compare it with state-of-the-art methods. The experimental results demonstrate that our method outperforms existing approaches in terms of accuracy, robustness, and efficiency. The datasets and code are released on GitHub.
摘要
医疗影像分割在临床决策、治疗规划和疾病监测中扮演着关键的角色。然而,准确地分割医疗影像受到多种因素的限制,包括高质量注释缺乏、成像噪声和患者间解剖学差异。此外,现有的标签效率方法和全标签方法之间还存在显著的性能差距。为了解决以上挑战,我们提出了ScribbleVC,一种基于scribble的医疗影像分割框架。此外,ScribbleVC通过多Modal信息增强机制来利用视觉和类嵌入。我们还 uniformmente利用CNN特征和Transformer特征来提取更好的视觉特征。我们的方法将scribble-based Approach与分割网络和类嵌入模块结合,以生成准确的分割mask。我们在三个标准数据集上测试了我们的方法,并与现有方法进行比较。实验结果表明,我们的方法在准确性、稳定性和效率方面都超过了现有的方法。我们在GitHub上发布了数据集和代码。
Unsupervised Decomposition Networks for Bias Field Correction in MR Image
For: The paper aims to propose a novel unsupervised decomposition network to correct bias fields in magnetic resonance (MR) images, which are degraded by intensity inhomogeneity caused by imperfect MR devices or imaged objects.* Methods: The proposed method consists of a segmentation part and an estimation part, which are optimized alternately. The segmentation part predicts the probability of every pixel belonging to each class, while the estimation part calculates the bias field. The loss functions used in the method are based on the combination of fuzzy clustering and the multiplicative bias field, which introduce smoothness of the bias field and construct soft relationships among different classes under intra-consistency constraints.* Results: The proposed method can accurately estimate bias fields and produce better bias correction results, as demonstrated by extensive experiments. The code for the proposed method is available on the link: https://github.com/LeongDong/Bias-Decomposition-Networks.Abstract
Bias field, which is caused by imperfect MR devices or imaged objects, introduces intensity inhomogeneity into MR images and degrades the performance of MR image analysis methods. Many retrospective algorithms were developed to facilitate the bias correction, to which the deep learning-based methods outperformed. However, in the training phase, the supervised deep learning-based methods heavily rely on the synthesized bias field. As the formation of the bias field is extremely complex, it is difficult to mimic the true physical property of MR images by synthesized data. While bias field correction and image segmentation are strongly related, the segmentation map is precisely obtained by decoupling the bias field from the original MR image, and the bias value is indicated by the segmentation map in reverse. Thus, we proposed novel unsupervised decomposition networks that are trained only with biased data to obtain the bias-free MR images. Networks are made up of: a segmentation part to predict the probability of every pixel belonging to each class, and an estimation part to calculate the bias field, which are optimized alternately. Furthermore, loss functions based on the combination of fuzzy clustering and the multiplicative bias field are also devised. The proposed loss functions introduce the smoothness of bias field and construct the soft relationships among different classes under intra-consistency constraints. Extensive experiments demonstrate that the proposed method can accurately estimate bias fields and produce better bias correction results. The code is available on the link: https://github.com/LeongDong/Bias-Decomposition-Networks.
摘要
��ubble field, which is caused by imperfect MR devices or imaged objects, introduces intensity inhomogeneity into MR images and degrades the performance of MR image analysis methods. Many retrospective algorithms were developed to facilitate the bias correction, to which the deep learning-based methods outperformed. However, in the training phase, the supervised deep learning-based methods heavily rely on the synthesized bias field. As the formation of the bias field is extremely complex, it is difficult to mimic the true physical property of MR images by synthesized data. While bias field correction and image segmentation are strongly related, the segmentation map is precisely obtained by decoupling the bias field from the original MR image, and the bias value is indicated by the segmentation map in reverse. Thus, we proposed novel unsupervised decomposition networks that are trained only with biased data to obtain the bias-free MR images. Networks are made up of: a segmentation part to predict the probability of every pixel belonging to each class, and an estimation part to calculate the bias field, which are optimized alternately. Furthermore, loss functions based on the combination of fuzzy clustering and the multiplicative bias field are also devised. The proposed loss functions introduce the smoothness of bias field and construct the soft relationships among different classes under intra-consistency constraints. Extensive experiments demonstrate that the proposed method can accurately estimate bias fields and produce better bias correction results. Code available at: https://github.com/LeongDong/Bias-Decomposition-Networks.
Mesh Density Adaptation for Template-based Shape Reconstruction
results: 比对无 mesh 密度适应方法和 mesh 密度适应方法的重建结果,结果显示 mesh 密度适应方法能够提高重建精度。Abstract
In 3D shape reconstruction based on template mesh deformation, a regularization, such as smoothness energy, is employed to guide the reconstruction into a desirable direction. In this paper, we highlight an often overlooked property in the regularization: the vertex density in the mesh. Without careful control on the density, the reconstruction may suffer from under-sampling of vertices near shape details. We propose a novel mesh density adaptation method to resolve the under-sampling problem. Our mesh density adaptation energy increases the density of vertices near complex structures via deformation to help reconstruction of shape details. We demonstrate the usability and performance of mesh density adaptation with two tasks, inverse rendering and non-rigid surface registration. Our method produces more accurate reconstruction results compared to the cases without mesh density adaptation.
摘要
在基于模板网格塑形的3D形状重建中,常用一种正则化方法,如平滑能量,以导向重建向 Desirable 方向。在这篇文章中,我们强调了常被忽略的属性在正则化中:网格中的顶点密度。如果不当控制密度,则重建可能会受到形状细节附近的顶点下折。我们提议一种新的网格密度适应方法,以解决这个问题。我们的网格密度适应能量通过塑形来增加顶点密度,以便重建形状细节。我们示出了使用这种方法的可用性和性能,通过对 inverse rendering 和非刚体表面 региSTR 进行比较。我们的方法可以提供更高精度的重建结果,比不使用网格密度适应的情况更好。
Open-Set Domain Adaptation with Visual-Language Foundation Models
results: 本研究的结果显示,使用CLIP可以实现State-of-the-art的ODA效果,并且可以帮助ODA模型更好地预测目标领域中的未知类别。Abstract
Unsupervised domain adaptation (UDA) has proven to be very effective in transferring knowledge obtained from a source domain with labeled data to a target domain with unlabeled data. Owing to the lack of labeled data in the target domain and the possible presence of unknown classes, open-set domain adaptation (ODA) has emerged as a potential solution to identify these classes during the training phase. Although existing ODA approaches aim to solve the distribution shifts between the source and target domains, most methods fine-tuned ImageNet pre-trained models on the source domain with the adaptation on the target domain. Recent visual-language foundation models (VLFM), such as Contrastive Language-Image Pre-Training (CLIP), are robust to many distribution shifts and, therefore, should substantially improve the performance of ODA. In this work, we explore generic ways to adopt CLIP, a popular VLFM, for ODA. We investigate the performance of zero-shot prediction using CLIP, and then propose an entropy optimization strategy to assist the ODA models with the outputs of CLIP. The proposed approach achieves state-of-the-art results on various benchmarks, demonstrating its effectiveness in addressing the ODA problem.
摘要
Unsupervised domain adaptation (UDA) 已经证明可以很有效地将源频道上带有标注数据的知识传递给目标频道上无标注数据。由于目标频道上可能存在未知的类别,开放集频道适应(ODA)已经出现为解决这个问题的可能性。虽然现有的 ODA 方法主要是通过修改源频道上的 ImageNet 预训练模型来解决源和目标频道之间的分布差异,但是这些方法通常是在源频道上进行适应。在这种情况下,我们提出了一种适应 CLIP,一种流行的视觉语言基础模型,的方法。我们 investigate CLIP 的零shot 预测性能,然后提出了一种信息归一化策略来帮助 ODA 模型使用 CLIP 的输出。我们的方法实现了多个 benchmark 上的状态实际结果,证明了它在 ODA 问题中的有效性。
Deep Convolutional Neural Networks with Zero-Padding: Feature Extraction and Learning
results: 论文derives了DCNN零填充的通用一致性和学习过程中的翻译不变性。这些结论都被数字实验验证,包括举例和实际数据运行。Abstract
This paper studies the performance of deep convolutional neural networks (DCNNs) with zero-padding in feature extraction and learning. After verifying the roles of zero-padding in enabling translation-equivalence, and pooling in its translation-invariance driven nature, we show that with similar number of free parameters, any deep fully connected networks (DFCNs) can be represented by DCNNs with zero-padding. This demonstrates that DCNNs with zero-padding is essentially better than DFCNs in feature extraction. Consequently, we derive universal consistency of DCNNs with zero-padding and show its translation-invariance in the learning process. All our theoretical results are verified by numerical experiments including both toy simulations and real-data running.
摘要
Gastrointestinal Mucosal Problems Classification with Deep Learning
paper_authors: Mohammadhasan Goharian, Vahid Goharian, Hamidreza Bolhasani
for: Early diagnosis of gastrointestinal mucosal changes to prevent cancers and provide early treatment.
methods: Deep learning, specifically Transfer Learning (TL) based on Convolutional Neural Networks (CNNs), was used to predict 8 classes of mucosal changes and anatomical landmarks from endoscopy images.
results: The best model achieved 93% accuracy in test images and was applied to real endoscopy and colonoscopy movies to classify problems.Abstract
Gastrointestinal mucosal changes can cause cancers after some years and early diagnosing them can be very useful to prevent cancers and early treatment. In this article, 8 classes of mucosal changes and anatomical landmarks including Polyps, Ulcerative Colitis, Esophagitis, Normal Z-Line, Normal Pylorus, Normal Cecum, Dyed Lifted Polyps, and Dyed Lifted Margin were predicted by deep learning. We used neural networks in this article. It is a black box artificial intelligence algorithm that works like a human neural system. In this article, Transfer Learning (TL) based on the Convolutional Neural Networks (CNNs), which is one of the well-known types of neural networks in image processing is used. We compared some famous CNN architecture including VGG, Inception, Xception, and ResNet. Our best model got 93% accuracy in test images. At last, we used our model in some real endoscopy and colonoscopy movies to classify problems.
摘要
胃肠内膜变化可能导致癌症,早期诊断可以非常有用,以预防癌症和早期治疗。在这篇文章中,我们预测了8种胃肠内膜变化和解剖学特征,包括膜腺肿(Polyps)、慢性结肠炎(Ulcerative Colitis)、食管炎(Esophagitis)、正常Z-线、正常胃隔(Normal Pylorus)、正常肠隔(Normal Cecum)、染料提取后的膜腺肿(Dyed Lifted Polyps)和染料提取后的边缘(Dyed Lifted Margin)。我们使用了神经网络(Neural Networks)来实现这一点。我们使用了传输学习(Transfer Learning),基于卷积神经网络(Convolutional Neural Networks,CNNs),这是图像处理领域的一种非常知名的神经网络类型。我们比较了一些著名的 CNN 架构,包括 VGG、Inception、Xception 和 ResNet。我们的最佳模型在测试图像中达到了93%的准确率。最后,我们使用了我们的模型在一些真实的窥镜和colonoscopy视频中分类问题。
Unified Model for Image, Video, Audio and Language Tasks
results: 该模型在多个图像和视频文本任务上显示了竞争力的性能,并且可以在不同的模式上进行权值 interpolate 以提高特点表示能力。Abstract
Large Language Models (LLMs) have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising solution is unification, allowing the support of a myriad of tasks and modalities within one unified framework. While few large models (e.g., Flamingo (Alayrac et al., 2022), trained on massive datasets, can support more than two modalities, current small to mid-scale unified models are still limited to 2 modalities, usually image-text or video-text. The question that we ask is: is it possible to build efficiently a unified model that can support all modalities? To answer this, we propose UnIVAL, a step further towards this ambitious goal. Without relying on fancy datasets sizes or models with billions of parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model. Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning. UnIVAL shows competitive performance to existing state-of-the-art approaches, across image and video-text tasks. The feature representations learned from image and video-text modalities, allows the model to achieve competitive performance when finetuned on audio-text tasks, despite not being pretrained on audio. Thanks to the unified model, we propose a novel study on multimodal model merging via weight interpolation of models trained on different multimodal tasks, showing their benefits in particular for out-of-distribution generalization. Finally, we motivate unification by showing the synergy between tasks. The model weights and code are released here: https://github.com/mshukor/UnIVAL.
摘要
大型语言模型(LLM)已经让普通的通用代理人变得非常接近现实。一个关键的障碍是任务和模式的多样性和不同性。一个有前途的解决方案是统一,允许支持多种任务和模式的一个统一框架。虽然一些大型模型(例如Flamingo(Alayrac等,2022))已经在庞大数据集上训练,可以支持更多于两种模式,但目前的小至中型统一模型仍然只能支持两种模式,通常是图像文本或视频文本。我们提出的问题是:是否可以效率地建立一个统一模型,可以支持所有模式?为此,我们提出了UnIVAL,这是一步进一步的目标。不需要庞大的数据集或多亿 Parameters的模型,UnIVAL模型只有约0.25B Parameters,可以超越两种模式,并将文本、图像、视频和音频 integrate 到一个模型中。我们的模型通过多任务均衡和多模式学习来快速预训练,并在图像和视频文本任务上达到竞争性性能。通过图像和视频文本模式学习的特征表示,我们的模型可以在没有 direct 预训练的情况下,在音频文本任务上达到竞争性性能。此外,我们还提出了一种基于多模式学习的模型混合方法,通过将不同多模式任务训练的模型权重进行 interpolate 来提高对异常数据的泛化性能。最后,我们鼓励统一,因为任务之间存在相互作用的关系。UnIVAL模型和代码可以在以下链接获取:https://github.com/mshukor/UnIVAL。
results: 我们对提出的方法进行实验评估,结果显示, compared to基eline,我们的方法可以生成高质量的细节。Abstract
In this paper, we study Text-to-3D content generation leveraging 2D diffusion priors to enhance the quality and detail of the generated 3D models. Recent progress (Magic3D) in text-to-3D has shown that employing high-resolution (e.g., 512 x 512) renderings can lead to the production of high-quality 3D models using latent diffusion priors. To enable rendering at even higher resolutions, which has the potential to further augment the quality and detail of the models, we propose a novel approach that combines multiple noise estimation processes with a pretrained 2D diffusion prior. Distinct from the Bar-Tal et al.s' study which binds multiple denoised results to generate images from texts, our approach integrates the computation of scoring distillation losses such as SDS loss and VSD loss which are essential techniques for the 3D content generation with 2D diffusion priors. We experimentally evaluated the proposed approach. The results show that the proposed approach can generate high-quality details compared to the baselines.
摘要
在这篇论文中,我们研究了基于2D扩散先验的文本到3D内容生成,以提高生成的3D模型质量和细节。最近的Magic3D进展表明,使用高分辨率(例如512x512)渲染可以生成高质量3D模型使用潜在扩散先验。为了启用更高的分辨率渲染,我们提出了一种新的方法,即将多个雷达抑制过程与预训练的2D扩散先验结合。与巴尔-塔尔等人的研究不同,我们的方法将多个抑制后的结果绑定生成图像,而不是将多个抑制后的结果绑定生成图像。我们实际进行了实验,结果显示,我们的方法可以生成高质量细节,相比于基eline。
Fusing VHR Post-disaster Aerial Imagery and LiDAR Data for Roof Classification in the Caribbean
paper_authors: Isabelle Tingzon, Nuala Margaret Cowan, Pierre Chrzanowski
for: 帮助政府更快地生成 Building Information,以提高区域风险管理和灾害应急应对。
methods: 利用深度学习技术自动分类 Very High-Resolution orthophotos 和空中 LiDAR 数据,以获取 Dominica ollowing Hurricane Maria 的建筑特征。
results: 我们的方法可以达到 F1 分数 0.93 和 0.92 для 屋顶类别和材料类别的自动分类,并且融合多 modal 地球观测数据可以达到更高的准确率。Abstract
Accurate and up-to-date information on building characteristics is essential for vulnerability assessment; however, the high costs and long timeframes associated with conducting traditional field surveys can be an obstacle to obtaining critical exposure datasets needed for disaster risk management. In this work, we leverage deep learning techniques for the automated classification of roof characteristics from very high-resolution orthophotos and airborne LiDAR data obtained in Dominica following Hurricane Maria in 2017. We demonstrate that the fusion of multimodal earth observation data performs better than using any single data source alone. Using our proposed methods, we achieve F1 scores of 0.93 and 0.92 for roof type and roof material classification, respectively. This work is intended to help governments produce more timely building information to improve resilience and disaster response in the Caribbean.
摘要
准确和最新的建筑特征信息是质量评估中的关键因素,但传统的场地调查过程可能会带来高昂的成本和长时间的投入。在这种情况下,我们利用深度学习技术自动分类飞地照片和飞行 LiDAR 数据中的屋顶特征。我们发现多模态地球观测数据的融合性能比单一数据源 alone 更高。我们的提议方法可以达到 F1 分数为 0.93 和 0.92 的精度,用于屋顶类别和材料类别的自动分类。这项工作的目的是帮助政府生成更加准确的建筑信息,以提高加勒比地区的韧性和灾害应急应对。
InvVis: Large-Scale Data Embedding for Invertible Visualization
results: 实验结果显示, InvVis 可以实现高品质的资料嵌入和显示,并且可以将大量的资料嵌入到视觉图像中,而且可以复原或修改该图像。Abstract
We present InvVis, a new approach for invertible visualization, which is reconstructing or further modifying a visualization from an image. InvVis allows the embedding of a significant amount of data, such as chart data, chart information, source code, etc., into visualization images. The encoded image is perceptually indistinguishable from the original one. We propose a new method to efficiently express chart data in the form of images, enabling large-capacity data embedding. We also outline a model based on the invertible neural network to achieve high-quality data concealing and revealing. We explore and implement a variety of application scenarios of InvVis. Additionally, we conduct a series of evaluation experiments to assess our method from multiple perspectives, including data embedding quality, data restoration accuracy, data encoding capacity, etc. The result of our experiments demonstrates the great potential of InvVis in invertible visualization.
摘要
我团队现在提出了一种新的方法,即InvVis,它可以将图像中的数据重构或进一步修改为图像。InvVis可以将大量数据,如图表数据、图表信息、源代码等,嵌入到图像中。编码后的图像与原始图像看不出差异。我们提出了一种新的方法,可以高效地将图表数据转换为图像形式,以实现大容量数据嵌入。我们还 outline了一种基于倒排神经网络的模型,以实现高质量的数据隐藏和恢复。我们在多个应用场景下进行了详细的探索和实现,并进行了多个评估实验,包括数据嵌入质量、数据恢复精度、数据编码容量等方面。实验结果表明,InvVis在 revertible visualization 方面具有很大的潜力。
for: 提高图像分辨率 ohne prior knowledge of degradation process
methods: 使用5种不同的建筑物,包括StarSRGAN模型
results: 提供新的SOTA性能,在MANLIQA和AHIQ测试中比Real-ESRGAN提高约10%,并提供了实时SR体验。Abstract
The aim of blind super-resolution (SR) in computer vision is to improve the resolution of an image without prior knowledge of the degradation process that caused the image to be low-resolution. The State of the Art (SOTA) model Real-ESRGAN has advanced perceptual loss and produced visually compelling outcomes using more complex degradation models to simulate real-world degradations. However, there is still room to improve the super-resolved quality of Real-ESRGAN by implementing recent techniques. This research paper introduces StarSRGAN, a novel GAN model designed for blind super-resolution tasks that utilize 5 various architectures. Our model provides new SOTA performance with roughly 10% better on the MANIQA and AHIQ measures, as demonstrated by experimental comparisons with Real-ESRGAN. In addition, as a compact version, StarSRGAN Lite provides approximately 7.5 times faster reconstruction speed (real-time upsampling from 540p to 4K) but can still keep nearly 90% of image quality, thereby facilitating the development of a real-time SR experience for future research. Our codes are released at https://github.com/kynthesis/StarSRGAN.
摘要
目标是提高计算机视觉中的盲超分辨率(SR)图像质量,无需知情减震过程所导致的图像低分辨率。现状的模型Real-ESRGAN已经提出了感知损失,并且生成了视觉吸引人的结果,使用更复杂的减震模型来模拟实际世界中的减震。然而,还有空间可以提高Real-ESRGAN中的超分辨率质量。这个研究论文介绍了StarSRGAN,一种新的GAN模型,用于盲超分辨率任务。我们的模型使用了5种不同的建筑,并提供了新的状态之artefact(SOTA)性能,在MANINQA和AHIQ测试中比Real-ESRGAN提高约10%。此外,StarSRGAN Lite为实时� upsampling提供了约7.5倍 faster的重建速度(从540p到4K),但可以保持 Nearly 90%的图像质量,因此可以促进实时SR经验的发展。我们的代码在https://github.com/kynthesis/StarSRGAN上发布。
Motion Degeneracy in Self-supervised Learning of Elevation Angle Estimation for 2D Forward-Looking Sonar
results: 实验和实际应用 validate了提议的方法,并且显示了稳定的自主学习性。Abstract
2D forward-looking sonar is a crucial sensor for underwater robotic perception. A well-known problem in this field is estimating missing information in the elevation direction during sonar imaging. There are demands to estimate 3D information per image for 3D mapping and robot navigation during fly-through missions. Recent learning-based methods have demonstrated their strengths, but there are still drawbacks. Supervised learning methods have achieved high-quality results but may require further efforts to acquire 3D ground-truth labels. The existing self-supervised method requires pretraining using synthetic images with 3D supervision. This study aims to realize stable self-supervised learning of elevation angle estimation without pretraining using synthetic images. Failures during self-supervised learning may be caused by motion degeneracy problems. We first analyze the motion field of 2D forward-looking sonar, which is related to the main supervision signal. We utilize a modern learning framework and prove that if the training dataset is built with effective motions, the network can be trained in a self-supervised manner without the knowledge of synthetic data. Both simulation and real experiments validate the proposed method.
摘要
2D前Looking陀螺是水下机器人视觉中的重要感知传感器。一个著名的问题在这个领域是在陀螺成像过程中缺失高度方向的信息。有需求将每幅图像中的信息提取到3D格式下,以便进行3D地图生成和机器人导航 durante fly-through任务。现有的学习基于方法已经表现出其优势,但还有一些缺点。监督学习方法可以获得高质量的结果,但可能需要进一步的努力来获得3D的实际 labels。现有的无监督方法需要使用 synthetic 图像进行预训练。本研究旨在实现不需要预训练使用 synthetic 图像的稳定无监督学习高度角度估计。在自主学习过程中出现失败可能是因为运动缺乏问题。我们首先分析了2D前Looking陀螺的运动场,与主要监督信号相关。我们利用现代学习框架,并证明如果训练集建立有效的运动,则网络可以在无监督的情况下在自主学习模式下训练,不需要Synthetic 数据的知识。 both simulation和实际实验 validate 我们的方法。
results: 实验表明,StylePrompter可以在重建质量和可编辑性之间做出平衡,并且可以”聪明”地适应大多数编辑任务,超过其他 $\mathcal{F}$-参与的恢复方法。Abstract
GAN inversion aims at inverting given images into corresponding latent codes for Generative Adversarial Networks (GANs), especially StyleGAN where exists a disentangled latent space that allows attribute-based image manipulation at latent level. As most inversion methods build upon Convolutional Neural Networks (CNNs), we transfer a hierarchical vision Transformer backbone innovatively to predict $\mathcal{W^+}$ latent codes at token level. We further apply a Style-driven Multi-scale Adaptive Refinement Transformer (SMART) in $\mathcal{F}$ space to refine the intermediate style features of the generator. By treating style features as queries to retrieve lost identity information from the encoder's feature maps, SMART can not only produce high-quality inverted images but also surprisingly adapt to editing tasks. We then prove that StylePrompter lies in a more disentangled $\mathcal{W^+}$ and show the controllability of SMART. Finally, quantitative and qualitative experiments demonstrate that StylePrompter can achieve desirable performance in balancing reconstruction quality and editability, and is "smart" enough to fit into most edits, outperforming other $\mathcal{F}$-involved inversion methods.
摘要
results: 实验结果表明,提出的方法可以在三个标准测试集上生成比state-of-the-art方法更高质量的 interpolated frames。Abstract
Video frame interpolation has been actively studied with the development of convolutional neural networks. However, due to the intrinsic limitations of kernel weight sharing in convolution, the interpolated frame generated by it may lose details. In contrast, the attention mechanism in Transformer can better distinguish the contribution of each pixel, and it can also capture long-range pixel dependencies, which provides great potential for video interpolation. Nevertheless, the original Transformer is commonly used for 2D images; how to develop a Transformer-based framework with consideration of temporal self-attention for video frame interpolation remains an open issue. In this paper, we propose Video Frame Interpolation Flow Transformer to incorporate motion dynamics from optical flows into the self-attention mechanism. Specifically, we design a Flow Transformer Block that calculates the temporal self-attention in a matched local area with the guidance of flow, making our framework suitable for interpolating frames with large motion while maintaining reasonably low complexity. In addition, we construct a multi-scale architecture to account for multi-scale motion, further improving the overall performance. Extensive experiments on three benchmarks demonstrate that the proposed method can generate interpolated frames with better visual quality than state-of-the-art methods.
摘要
<>视频帧 interpolate 已经广泛研究,发展 convolutional neural networks 时,但是由于内在的核心权重共享限制,生成的 interpolated 帧可能会产生细节损失。相比之下,Transformer 中的注意机制可以更好地识别每个像素的贡献,同时也可以捕捉长距离像素相关性,这提供了大量的潜在能量 для 视频 interpolate。然而,原始 Transformer 通常用于 2D 图像;如何开发一个包含 temporal 自注意的 Transformer 框架,以便用于视频帧 interpolate 仍是一个开放的问题。在这篇论文中,我们提出了 Video Frame Interpolation Flow Transformer,即在 optical flows 的指导下,在本地匹配区域内进行 temporal 自注意计算的 Flow Transformer Block。这使得我们的框架适用于大跑动的 interpolated 帧,同时保持reasonably low complexity。此外,我们还构建了多尺度架构,以account for multi-scale motion,进一步提高总性能。经过对三个标准测试集的广泛实验,我们发现提出的方法可以生成比state-of-the-art 方法更高质量的 interpolated 帧。
Structure-Preserving Synthesis: MaskGAN for Unpaired MR-CT Translation
paper_authors: Minh Hieu Phan, Zhibin Liao, Johan W. Verjans, Minh-Son To
for: 这篇 paper 的目的是提出一个新的、cost-effective的数据类型转换模型,以便在医疗影像处理中处理不同modalities之间的数据类型转换。
methods: 这篇 paper 使用了 CycleGAN 方法,并将其与一个辅助的插值网络(mask generator)结合,以便强制运作网络对应于不同modalities之间的数据类型转换。
results: 实验结果显示,这篇 paper 的方法(MaskGAN)在一个儿童dataset上表现出色,能够对不同modalities之间的数据类型转换进行高精度的Synthesis,并且不需要专业的标注。Abstract
Medical image synthesis is a challenging task due to the scarcity of paired data. Several methods have applied CycleGAN to leverage unpaired data, but they often generate inaccurate mappings that shift the anatomy. This problem is further exacerbated when the images from the source and target modalities are heavily misaligned. Recently, current methods have aimed to address this issue by incorporating a supplementary segmentation network. Unfortunately, this strategy requires costly and time-consuming pixel-level annotations. To overcome this problem, this paper proposes MaskGAN, a novel and cost-effective framework that enforces structural consistency by utilizing automatically extracted coarse masks. Our approach employs a mask generator to outline anatomical structures and a content generator to synthesize CT contents that align with these structures. Extensive experiments demonstrate that MaskGAN outperforms state-of-the-art synthesis methods on a challenging pediatric dataset, where MR and CT scans are heavily misaligned due to rapid growth in children. Specifically, MaskGAN excels in preserving anatomical structures without the need for expert annotations. The code for this paper can be found at https://github.com/HieuPhan33/MaskGAN.
摘要
医学图像生成是一项具有挑战性的任务,主要因为缺乏匹配数据。许多方法使用 CycleGAN 利用不匹配数据,但它们经常生成不准确的映射,导致图像坐标shift。这个问题在图像来源和目标模式之间的差异极大时更加严重。在这种情况下,当前的方法通常采用了补充性分割网络。然而,这种策略需要成本高昂和时间费时的像素级注释。为了解决这个问题,本文提出了 MaskGAN,一种新的和成本效果的框架。我们的方法使用一个掩蔽生成器来 outline 生物结构,并使用一个内容生成器来Synthesize CT 内容,以适应这些结构。我们进行了广泛的实验,并证明了 MaskGAN 在一个具有挑战性的pediatric dataset上表现出色,特别是在 MR 和 CT 扫描图像之间存在巨大的差异时。具体来说,MaskGAN 能够保持生物结构,不需要专家注释。相关代码可以在 GitHub 上找到:https://github.com/HieuPhan33/MaskGAN。
Implicit Neural Representation in Medical Imaging: A Comparative Survey
results: 论文总结了INR在医疗图像分析中的优点和限制,包括其能够解决多个难题、高效、可靠、可调和可迭代性等特点。同时,论文 также提出了将来的研究方向和机遇,如与多Modal imaging、实时交互系统和领域适应等。Abstract
Implicit neural representations (INRs) have gained prominence as a powerful paradigm in scene reconstruction and computer graphics, demonstrating remarkable results. By utilizing neural networks to parameterize data through implicit continuous functions, INRs offer several benefits. Recognizing the potential of INRs beyond these domains, this survey aims to provide a comprehensive overview of INR models in the field of medical imaging. In medical settings, numerous challenging and ill-posed problems exist, making INRs an attractive solution. The survey explores the application of INRs in various medical imaging tasks, such as image reconstruction, segmentation, registration, novel view synthesis, and compression. It discusses the advantages and limitations of INRs, highlighting their resolution-agnostic nature, memory efficiency, ability to avoid locality biases, and differentiability, enabling adaptation to different tasks. Furthermore, the survey addresses the challenges and considerations specific to medical imaging data, such as data availability, computational complexity, and dynamic clinical scene analysis. It also identifies future research directions and opportunities, including integration with multi-modal imaging, real-time and interactive systems, and domain adaptation for clinical decision support. To facilitate further exploration and implementation of INRs in medical image analysis, we have provided a compilation of cited studies along with their available open-source implementations on \href{https://github.com/mindflow-institue/Awesome-Implicit-Neural-Representations-in-Medical-imaging}. Finally, we aim to consistently incorporate the most recent and relevant papers regularly.
摘要
启发神经表示 (INR) 在场景重建和计算机图形领域已经得到了广泛的应用,并且表现出了惊人的成果。通过使用神经网络来参数化数据通过间接连续函数,INR 提供了多种优点。认识到 INR 在医疗领域的潜在应用,这篇评论旨在提供医疗影像领域中 INR 模型的全面概述。在医疗设置中,存在许多困难和不稳定的问题,使得 INR 成为一个吸引人的解决方案。评论探讨了 INR 在各种医疗影像任务中的应用,如图像重建、分割、注册、新视图生成和压缩。它讨论了 INR 的优点和局限性,包括其无关分辨率的性、内存有效性、避免地方偏见以及可微性,以及其在不同任务中的适应性。此外,评论还讨论了医疗影像数据特有的挑战和考虑因素,如数据可用性、计算复杂度和动态临床场景分析。最后,评论提出了未来研究方向和机会,包括与多模态成像集成、实时交互系统和领域适应性的研究。为便于进一步探索和应用 INR 在医疗影像分析中,我们在 \href{https://github.com/mindflow-institue/Awesome-Implicit-Neural-Representations-in-Medical-imaging} 提供了一份参考文献和其相应的开源实现。我们计划在 réguliére 基础上不断更新和补充最新和最相关的论文。
Augmented Math: Authoring AR-Based Explorable Explanations by Augmenting Static Math Textbooks
paper_authors: Neil Chulpongsatorn, Mille Skovhus Lunding, Nishan Soni, Ryo Suzuki
for: 帮助非技术用户,如教师或学生,将静止数学书籍和手册转化为即时和个性化的探索解释。
methods: 我们的系统首先使用光学字符识别和计算机视觉EXTRACT数学公式和图片FROM given document,然后将这些EXTRACT的内容绑定和操作,让用户通过移动AR界面看到交互动画 overlay onto the document。
results: 我们的研究表明,我们的系统可以帮助学生更好地理解数学概念,并且允许非技术用户创建个性化的探索解释。Abstract
We introduce Augmented Math, a machine learning-based approach to authoring AR explorable explanations by augmenting static math textbooks without programming. To augment a static document, our system first extracts mathematical formulas and figures from a given document using optical character recognition (OCR) and computer vision. By binding and manipulating these extracted contents, the user can see the interactive animation overlaid onto the document through mobile AR interfaces. This empowers non-technical users, such as teachers or students, to transform existing math textbooks and handouts into on-demand and personalized explorable explanations. To design our system, we first analyzed existing explorable math explanations to identify common design strategies. Based on the findings, we developed a set of augmentation techniques that can be automatically generated based on the extracted content, which are 1) dynamic values, 2) interactive figures, 3) relationship highlights, 4) concrete examples, and 5) step-by-step hints. To evaluate our system, we conduct two user studies: preliminary user testing and expert interviews. The study results confirm that our system allows more engaging experiences for learning math concepts.
摘要
我们介绍了增强数学(Augmented Math),一种机器学习基于的方法来创建不需要程式的扩展显示探索解释。我们的系统首先从给定的文档中提取数学公式和图片使用光学字符识别(OCR)和计算机视觉。接着,我们可以将这些提取的内容绑定和操作,让用户透过移动AR界面看到对文档的互动动画。这使得非技术用户,如教师或学生,可以将现有的数学文档和手册转换为即时和个性化的探索解释。为了设计我们的系统,我们首先分析了现有的探索 math解释,以发现通用的设计策略。根据发现的结果,我们发展了一系列可以自动生成的增强技巧,包括1)动态值、2)互动图像、3)关系显示、4)实物例子、5)步骤提示。为了评估我们的系统,我们进行了两项用户研究:初步用户测试和专家访谈。研究结果表明,我们的系统可以为学习数学概念提供更加有趣的体验。
TransFusion: A Practical and Effective Transformer-based Diffusion Model for 3D Human Motion Prediction
For: The paper is written for predicting human motion in intelligent remanufacturing systems, with a focus on ensuring safe and effective human-robot collaboration.* Methods: The paper proposes a diffusion-based model for 3D human motion prediction, which leverages Transformer as the backbone and employs the discrete cosine transform to model motion sequences in the frequency space.* Results: The paper reports extensive experimental studies on benchmark datasets to validate the effectiveness of the proposed human motion prediction model, with results showing that the model can generate samples that are more likely to happen while maintaining a certain level of diversity.Here’s the information in Simplified Chinese text:* For: 这篇论文是为了预测人体动作在智能再生产系统中,以确保人机合作的安全和效果。* Methods: 论文提出了一种基于扩散的3D人体动作预测模型,利用Transformer作为基础,并使用隔行 cosine transform来模型动作序列在频域中。* Results: 论文进行了广泛的实验研究,以验证提出的人体动作预测模型的有效性,结果表明模型可以生成更有可能性发生的样本,同时保持一定的多样性。Abstract
Predicting human motion plays a crucial role in ensuring a safe and effective human-robot close collaboration in intelligent remanufacturing systems of the future. Existing works can be categorized into two groups: those focusing on accuracy, predicting a single future motion, and those generating diverse predictions based on observations. The former group fails to address the uncertainty and multi-modal nature of human motion, while the latter group often produces motion sequences that deviate too far from the ground truth or become unrealistic within historical contexts. To tackle these issues, we propose TransFusion, an innovative and practical diffusion-based model for 3D human motion prediction which can generate samples that are more likely to happen while maintaining a certain level of diversity. Our model leverages Transformer as the backbone with long skip connections between shallow and deep layers. Additionally, we employ the discrete cosine transform to model motion sequences in the frequency space, thereby improving performance. In contrast to prior diffusion-based models that utilize extra modules like cross-attention and adaptive layer normalization to condition the prediction on past observed motion, we treat all inputs, including conditions, as tokens to create a more lightweight model compared to existing approaches. Extensive experimental studies are conducted on benchmark datasets to validate the effectiveness of our human motion prediction model.
摘要
预测人体运动在智能重组系统中发挥关键作用,以确保人机协作安全有效。现有的研究可以分为两类:一类是强调准确性,预测单个未来运动,另一类是生成基于观察的多种预测。前者忽视了人体运动的不确定性和多模性,后者经常生成的运动序列与真实值有很大偏差或在历史上不实际。为解决这些问题,我们提出了TransFusion,一种创新的扩散模型,可以生成更有可能性的3D人体运动预测样本,同时保持一定的多样性。我们的模型借鉴Transformer作为背景,并在浅深层之间设置长距离连接。此外,我们使用离散归一化变换来模型运动序列在频域中,从而提高性能。与之前的扩散模型不同,我们不需要额外的模块如交叉注意力和自适应层normalization来condition预测,而是将所有输入,包括条件,视为令素来创建更轻量级的模型。我们在标准测试集上进行了广泛的实验研究,以验证我们的人体运动预测模型的有效性。
Iterative Graph Filtering Network for 3D Human Pose Estimation
paper_authors: Zaedul Islam, A. Ben Hamza for:* This paper is written for 3D human pose estimation, specifically using graph convolutional networks (GCNs) to capture the spatial relationships between joints and learn an efficient representation of the underlying pose.methods:* The proposed method uses an iterative graph filtering framework with Laplacian regularization, which is implemented using the Gauss-Seidel iterative method.* The model architecture includes weight and adjacency modulation, skip connection, and a pure convolutional block with layer normalization.results:* The proposed model achieves state-of-the-art performance on two standard benchmark datasets for 3D human pose estimation, outperforming a comprehensive set of strong baseline methods.* Ablation studies demonstrate that the skip connection and adjacency modulation contribute to the improved model performance.Abstract
Graph convolutional networks (GCNs) have proven to be an effective approach for 3D human pose estimation. By naturally modeling the skeleton structure of the human body as a graph, GCNs are able to capture the spatial relationships between joints and learn an efficient representation of the underlying pose. However, most GCN-based methods use a shared weight matrix, making it challenging to accurately capture the different and complex relationships between joints. In this paper, we introduce an iterative graph filtering framework for 3D human pose estimation, which aims to predict the 3D joint positions given a set of 2D joint locations in images. Our approach builds upon the idea of iteratively solving graph filtering with Laplacian regularization via the Gauss-Seidel iterative method. Motivated by this iterative solution, we design a Gauss-Seidel network (GS-Net) architecture, which makes use of weight and adjacency modulation, skip connection, and a pure convolutional block with layer normalization. Adjacency modulation facilitates the learning of edges that go beyond the inherent connections of body joints, resulting in an adjusted graph structure that reflects the human skeleton, while skip connections help maintain crucial information from the input layer's initial features as the network depth increases. We evaluate our proposed model on two standard benchmark datasets, and compare it with a comprehensive set of strong baseline methods for 3D human pose estimation. Our experimental results demonstrate that our approach outperforms the baseline methods on both datasets, achieving state-of-the-art performance. Furthermore, we conduct ablation studies to analyze the contributions of different components of our model architecture and show that the skip connection and adjacency modulation help improve the model performance.
摘要
格 Graf卷积网络(GCNs)已经证明是3D人姿估计中有效的方法。通过自然地视图人体骨架结构为图,GCNs可以捕捉人体 JOINTS 之间的空间关系,并学习高效的姿势表示。然而,大多数GCN-based方法使用共享权重矩阵,使得准确地捕捉 JOINTS 之间的不同和复杂的关系困难。在这篇论文中,我们介绍了一种迭代图 filtering 框架 для3D人姿估计,该框架的目标是根据给定的2D JOINTS 位置来预测3D JOINTS 位置。我们的方法基于迭代图 filtering WITH Laplacian regularization via Gauss-Seidel iterative method。这种迭代解决方法的灵感导致我们设计了Gauss-Seidel网络(GS-Net)架构,该架构使用权重和连接调整、跳过连接、并使用纯 convolutional block with layer normalization。连接调整使得学习的边度超出人体骨架内的自然连接,从而生成一个调整后的图结构,反映人体骨架。跳过连接帮助保留输入层的初始特征信息,以适应网络深度的增加。我们对两个标准 benchmark dataset进行评估,并与一组强大的基eline方法进行比较。我们的实验结果表明,我们的方法在两个dataset上都超过基eline方法,实现状态机器人姿估计的最佳性。此外,我们进行了归 subtract 分析,并证明跳过连接和连接调整对模型性能的贡献。
HandMIM: Pose-Aware Self-Supervised Learning for 3D Hand Mesh Estimation
results: 我们的提出的方法,即HandMIM,在多种手套绘制任务上达到了优秀的表现,包括FreiHAND和HO3Dv2测试集。特别是,HandMIM在特殊优化的架构上进行了比较,并实现了6.29mm和8.00mm PA VPE 的最佳记录,从而成为3D手套绘制领域的新状态之一。Abstract
With an enormous number of hand images generated over time, unleashing pose knowledge from unlabeled images for supervised hand mesh estimation is an emerging yet challenging topic. To alleviate this issue, semi-supervised and self-supervised approaches have been proposed, but they are limited by the reliance on detection models or conventional ResNet backbones. In this paper, inspired by the rapid progress of Masked Image Modeling (MIM) in visual classification tasks, we propose a novel self-supervised pre-training strategy for regressing 3D hand mesh parameters. Our approach involves a unified and multi-granularity strategy that includes a pseudo keypoint alignment module in the teacher-student framework for learning pose-aware semantic class tokens. For patch tokens with detailed locality, we adopt a self-distillation manner between teacher and student network based on MIM pre-training. To better fit low-level regression tasks, we incorporate pixel reconstruction tasks for multi-level representation learning. Additionally, we design a strong pose estimation baseline using a simple vanilla vision Transformer (ViT) as the backbone and attach a PyMAF head after tokens for regression. Extensive experiments demonstrate that our proposed approach, named HandMIM, achieves strong performance on various hand mesh estimation tasks. Notably, HandMIM outperforms specially optimized architectures, achieving 6.29mm and 8.00mm PAVPE (Vertex-Point-Error) on challenging FreiHAND and HO3Dv2 test sets, respectively, establishing new state-of-the-art records on 3D hand mesh estimation.
摘要
WITH 一 enormous number of hand images generated over time, unleashing pose knowledge from unlabeled images for supervised hand mesh estimation is an emerging yet challenging topic. To alleviate this issue, semi-supervised and self-supervised approaches have been proposed, but they are limited by the reliance on detection models or conventional ResNet backbones. In this paper, inspired by the rapid progress of Masked Image Modeling (MIM) in visual classification tasks, we propose a novel self-supervised pre-training strategy for regressing 3D hand mesh parameters. Our approach involves a unified and multi-granularity strategy that includes a pseudo keypoint alignment module in the teacher-student framework for learning pose-aware semantic class tokens. For patch tokens with detailed locality, we adopt a self-distillation manner between teacher and student network based on MIM pre-training. To better fit low-level regression tasks, we incorporate pixel reconstruction tasks for multi-level representation learning. Additionally, we design a strong pose estimation baseline using a simple vanilla vision Transformer (ViT) as the backbone and attach a PyMAF head after tokens for regression. Extensive experiments demonstrate that our proposed approach, named HandMIM, achieves strong performance on various hand mesh estimation tasks. Notably, HandMIM outperforms specially optimized architectures, achieving 6.29mm and 8.00mm PAVPE (Vertex-Point-Error) on challenging FreiHAND and HO3Dv2 test sets, respectively, establishing new state-of-the-art records on 3D hand mesh estimation.
CoVid-19 Detection leveraging Vision Transformers and Explainable AI
paper_authors: Pangoth Santhosh Kumar, Kundrapu Supriya, Mallikharjuna Rao K
for: 这个论文的目的是为了检测肺病,以提高患者的健康和生活质量。
methods: 这篇论文使用了深度学习算法和图像处理技术来实现自动化、快速和准确地检测肺病。
results: 研究发现,使用视transformer基础模型可以解决深度学习模型对不同图像方向的问题,并在检测肺病方面达到了更高的准确率。Abstract
Lung disease is a common health problem in many parts of the world. It is a significant risk to people health and quality of life all across the globe since it is responsible for five of the top thirty leading causes of death. Among them are COVID 19, pneumonia, and tuberculosis, to name just a few. It is critical to diagnose lung diseases in their early stages. Several different models including machine learning and image processing have been developed for this purpose. The earlier a condition is diagnosed, the better the patient chances of making a full recovery and surviving into the long term. Thanks to deep learning algorithms, there is significant promise for the autonomous, rapid, and accurate identification of lung diseases based on medical imaging. Several different deep learning strategies, including convolutional neural networks (CNN), vanilla neural networks, visual geometry group based networks (VGG), and capsule networks , are used for the goal of making lung disease forecasts. The standard CNN has a poor performance when dealing with rotated, tilted, or other aberrant picture orientations. As a result of this, within the scope of this study, we have suggested a vision transformer based approach end to end framework for the diagnosis of lung disorders. In the architecture, data augmentation, training of the suggested models, and evaluation of the models are all included. For the purpose of detecting lung diseases such as pneumonia, Covid 19, lung opacity, and others, a specialised Compact Convolution Transformers (CCT) model have been tested and evaluated on datasets such as the Covid 19 Radiography Database. The model has achieved a better accuracy for both its training and validation purposes on the Covid 19 Radiography Database.
摘要
肺病是全球许多地区的常见健康问题。它对人们健康和生活质量产生了重要的风险,负责全球前30名死亡原因中的5个。包括COVID-19、肺炎和结核病等在内,这些疾病对人们的健康造成了严重的威胁。因此,早期诊断肺病非常重要。为了达到这一目标,包括机器学习和图像处理在内的多种模型已经被开发出来。随着深度学习算法的出现,肺病的自动化、快速和准确诊断已经得到了广泛的应用。在这些研究中,我们提出了基于视transformer的综合方法,以便诊断肺病。在该方法中,包括数据增强、模型训练和模型评估等环节。为了检测肺病,如肺炎、COVID-19、肺抑制等,我们测试了一种专门的Compact Convolution Transformers(CCT)模型。该模型在Covid 19 Radiography Database上的训练和验证过程中表现出了更高的准确率。
LOTUS: Learning to Optimize Task-based US representations
paper_authors: Yordanka Velikova, Mohammad Farid Azampour, Walter Simson, Vanessa Gonzalez Duque, Nassir Navab
for: 这篇论文的目的是提出一种新的方法来对超声影像进行分类,以提高诊断和监控的精度。
methods: 这篇论文使用了一种新的方法,即使用rayed-casting来模拟超声传播,并且通过对下游分类任务进行优化,以 learns optimize the parameters for generating physics-based ultrasound images。
results: 这篇论文的结果显示,使用这种方法可以实现高度的自动分类精度,并且可以同时进行实验和自动分类。 qualitative results also show that the proposed method can generate high-quality images with improved contrast and resolution.Abstract
Anatomical segmentation of organs in ultrasound images is essential to many clinical applications, particularly for diagnosis and monitoring. Existing deep neural networks require a large amount of labeled data for training in order to achieve clinically acceptable performance. Yet, in ultrasound, due to characteristic properties such as speckle and clutter, it is challenging to obtain accurate segmentation boundaries, and precise pixel-wise labeling of images is highly dependent on the expertise of physicians. In contrast, CT scans have higher resolution and improved contrast, easing organ identification. In this paper, we propose a novel approach for learning to optimize task-based ultra-sound image representations. Given annotated CT segmentation maps as a simulation medium, we model acoustic propagation through tissue via ray-casting to generate ultrasound training data. Our ultrasound simulator is fully differentiable and learns to optimize the parameters for generating physics-based ultrasound images guided by the downstream segmentation task. In addition, we train an image adaptation network between real and simulated images to achieve simultaneous image synthesis and automatic segmentation on US images in an end-to-end training setting. The proposed method is evaluated on aorta and vessel segmentation tasks and shows promising quantitative results. Furthermore, we also conduct qualitative results of optimized image representations on other organs.
摘要
医学应用中的器官隔segmentation在ultrasound图像中是非常重要的,特别是诊断和监测。现有的深度神经网络需要大量标注数据进行训练,以达到医学接受的性能。然而,在ultrasound中,由特征性质 such as speckle和响应而带来的困难,减少了获得准确的分割边界,并且精确地标注图像 pixels 是医生的专业技巧依赖。与CT扫描相比,ultrasound图像有更高的分辨率和更好的对比度,使器官识别更加容易。在这篇论文中,我们提出了一种新的方法,用于学习优化任务基于ultrasound图像表示。我们使用rayed-casting模拟了声波传播 через组织,从而生成了physics-based的ultrasound训练数据。我们的ultrasound模拟器是可微分的,可以学习参数,以便生成按下沟通任务的优化参数。此外,我们还训练了一种图像适应网络,用于同时实现图像合成和自动分割任务。我们的提posed方法在大动脉和血管分割任务中表现出了良好的量化结果。此外,我们还进行了其他器官的优化图像表示的质量研究。
Fuzzy Logic Visual Network (FLVN): A neuro-symbolic approach for visual features matching
results: 这个论文提出了名为Fuzzy Logic Visual Network(FLVN)的方法,该方法在neuro-symbolic LTN框架下学习视觉semantic空间。FLVN利用了类层级知识和 Robust高级 inductive bias,从而提高了ZSL分类的性能,在Generalized ZSL(GZSL)benchmark AWA2和CUB上达到了状态的艺术性能,相比之下,其计算开销较低。Abstract
Neuro-symbolic integration aims at harnessing the power of symbolic knowledge representation combined with the learning capabilities of deep neural networks. In particular, Logic Tensor Networks (LTNs) allow to incorporate background knowledge in the form of logical axioms by grounding a first order logic language as differentiable operations between real tensors. Yet, few studies have investigated the potential benefits of this approach to improve zero-shot learning (ZSL) classification. In this study, we present the Fuzzy Logic Visual Network (FLVN) that formulates the task of learning a visual-semantic embedding space within a neuro-symbolic LTN framework. FLVN incorporates prior knowledge in the form of class hierarchies (classes and macro-classes) along with robust high-level inductive biases. The latter allow, for instance, to handle exceptions in class-level attributes, and to enforce similarity between images of the same class, preventing premature overfitting to seen classes and improving overall performance. FLVN reaches state of the art performance on the Generalized ZSL (GZSL) benchmarks AWA2 and CUB, improving by 1.3% and 3%, respectively. Overall, it achieves competitive performance to recent ZSL methods with less computational overhead. FLVN is available at https://gitlab.com/grains2/flvn.
摘要
neuroro-symbolic 融合目标是利用深度神经网络学习的能力和符号知识表示的力量相结合。特别是逻辑张量网络(LTN)可以将背景知识表示为可 diferenciable 操作 между实数张量。然而,有很少的研究探讨了这种方法可以提高零例学习(ZSL)分类的潜力。在这项研究中,我们提出了灰度逻辑视觉网络(FLVN),它在 neuro-symbolic LTN 框架中学习视觉semantic embedding空间。FLVN integrates prior knowledge in the form of class hierarchies (classes and macro-classes) along with robust high-level inductive biases。这些假设允许,例如,处理类层特征异常,并强制图像同一类的相似性,避免提前过拟合已知类和提高总性能。FLVN 在 Generalized ZSL(GZSL)标准吗 AWA2 和 CUB 上达到了状态的捷径性表现,提高了1.3%和3%,分别。总的来说,它实现了与最近 ZSL 方法相当的性能,但计算开销较少。FLVN 可以在 中下载。