results: 我们在多个几拍学习数据集上进行了实验,并在实际应用中进行了机器人感知的测试,结果表明我们的方法高效。Abstract
We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by the unimodal prototypical networks for few-shot learning, we introduce PROTO-CLIP that utilizes image prototypes and text prototypes for few-shot learning. Specifically, PROTO-CLIP adapts the image encoder and text encoder in CLIP in a joint fashion using few-shot examples. The two encoders are used to compute prototypes of image classes for classification. During adaptation, we propose aligning the image and text prototypes of corresponding classes. Such a proposed alignment is beneficial for few-shot classification due to the contributions from both types of prototypes. We demonstrate the effectiveness of our method by conducting experiments on benchmark datasets for few-shot learning as well as in the real world for robot perception.
摘要
我们提出了一种新的几shot学习框架,利用大规模的视觉语言模型 such as CLIP。我们的方法被激发于干扰型网络 для几shot学习,我们称之为PROTO-CLIP。这个方法利用图像概念和文本概念来进行几shot学习。特别是,PROTO-CLIP在CLIP中的图像编码器和文本编码器进行联合调整,使用几shot示例来计算图像类别的概念。在调整过程中,我们提议将图像和文本概念的对应类型进行对接。这种提议的对接有助于几shot分类,因为图像和文本概念的贡献均有所提高。我们在几shot学习的标准数据集和实际世界中的 робот识别任务中进行了实验,以证明我们的方法的有效性。
PseudoCell: Hard Negative Mining as Pseudo Labeling for Deep Learning-Based Centroblast Cell Detection
For: The paper aims to assist pathologists in grading follicular lymphoma patients by automating centroblast detection in whole-slide images (WSI) of H&E-stained tissue samples using deep learning-based object detection frameworks.* Methods: The proposed method, called PseudoCell, incorporates centroblast labels from pathologists and combines them with pseudo-negative labels obtained from undersampled false-positive predictions using the cell’s morphological features.* Results: PseudoCell can accurately narrow down the areas requiring pathologists’ attention during examining tissue, eliminating 58.18-99.35% of non-centroblasts tissue areas on WSI, depending on the confidence threshold.Here’s the simplified Chinese text for the three key points:* For: 这篇论文旨在帮助病理学家评分淋巴癌病例,使用深度学习基于对象检测框架自动检测淋巴细胞在染色体抗静脉涂抹样本中。* Methods: 提议的方法叫做 PseudoCell,它将中心细胞标签与异常预测中的 pseudo-负样本结合使用,利用细胞形态特征来提高检测精度。* Results: PseudoCell 可以准确地减少病理学家的工作负担,在评分过程中减少 58.18-99.35% 的非中心细胞区域,具体的减少率取决于信任阈值。Abstract
Patch classification models based on deep learning have been utilized in whole-slide images (WSI) of H&E-stained tissue samples to assist pathologists in grading follicular lymphoma patients. However, these approaches still require pathologists to manually identify centroblast cells and provide refined labels for optimal performance. To address this, we propose PseudoCell, an object detection framework to automate centroblast detection in WSI (source code is available at https://github.com/IoBT-VISTEC/PseudoCell.git). This framework incorporates centroblast labels from pathologists and combines them with pseudo-negative labels obtained from undersampled false-positive predictions using the cell's morphological features. By employing PseudoCell, pathologists' workload can be reduced as it accurately narrows down the areas requiring their attention during examining tissue. Depending on the confidence threshold, PseudoCell can eliminate 58.18-99.35% of non-centroblasts tissue areas on WSI. This study presents a practical centroblast prescreening method that does not require pathologists' refined labels for improvement. Detailed guidance on the practical implementation of PseudoCell is provided in the discussion section.
摘要
报告文献:基于深度学习的报告分类模型已经在染色涂抹检查(H&E染色)的整个检查图像(WSI)中用于协助病理学家评分抗体癌病例。然而,这些方法仍然需要病理学家手动标识中心blast细胞并提供精细的标签以实现最佳性能。为了解决这个问题,我们提出了 PseudoCell,一个对象检测框架,可以自动检测WSI中的中心blast细胞。这个框架利用病理学家提供的中心blast标签,并将其与使用细胞形态特征所获取的假性负标签相结合。通过使用PseudoCell,病理学家的工作负担可以减少,因为它准确地缩小了需要其注意的区域。根据信任阈值,PseudoCell可以从WSI中消除58.18-99.35%的非中心blast区域。本研究提供了不需要病理学家精细标签的实用中心blast预屏检测方法。详细的实现方法请参考讨论部分。
EffLiFe: Efficient Light Field Generation via Hierarchical Sparse Gradient Descent
results: 比较state-of-the-art offline方法100倍 faster的速度和相比其他在线方法提供2dB更高的PSNR表现。Abstract
With the rise of Extended Reality (XR) technology, there is a growing need for real-time light field generation from sparse view inputs. Existing methods can be classified into offline techniques, which can generate high-quality novel views but at the cost of long inference/training time, and online methods, which either lack generalizability or produce unsatisfactory results. However, we have observed that the intrinsic sparse manifold of Multi-plane Images (MPI) enables a significant acceleration of light field generation while maintaining rendering quality. Based on this insight, we introduce EffLiFe, a novel light field optimization method, which leverages the proposed Hierarchical Sparse Gradient Descent (HSGD) to produce high-quality light fields from sparse view images in real time. Technically, the coarse MPI of a scene is first generated using a 3D CNN, and it is further sparsely optimized by focusing only on important MPI gradients in a few iterations. Nevertheless, relying solely on optimization can lead to artifacts at occlusion boundaries. Therefore, we propose an occlusion-aware iterative refinement module that removes visual artifacts in occluded regions by iteratively filtering the input. Extensive experiments demonstrate that our method achieves comparable visual quality while being 100x faster on average than state-of-the-art offline methods and delivering better performance (about 2 dB higher in PSNR) compared to other online approaches.
摘要
随着扩展现实(XR)技术的发展,需要实时生成高质量的场景灯场。现有方法可以分为线上方法和线下方法两类,线下方法可以生成高质量的新视图,但是需要长时间的推理/训练时间,而线上方法则缺乏通用性或生成结果不 satisfactory。然而,我们发现了场景多平面图(MPI)的内在稀畴推导,可以大幅加速场景灯场生成,同时保持渲染质量。基于这一点,我们提出了EffLiFe,一种新的场景灯场优化方法,利用我们提出的层次稀畴梯度下降(HSGD)来从稀畴视图图像中生成高质量的场景灯场,并在实时中完成。技术上,首先使用3D CNN生成场景的粗糙MPI,然后通过稀畴优化,只对场景中重要的MPI梯度进行几次迭代。然而,仅仅通过优化无法在 occlusion 边界处消除视觉遮挡。因此,我们提出了 occlusion-aware 迭代纠正模块,通过迭代筛选输入,消除 occlusion 区域中的视觉遮挡。广泛的实验表明,我们的方法可以与现状的线下方法相比,提供更好的视觉质量,同时在实时中完成,相比之下,其速度为100倍。
Self-supervised learning via inter-modal reconstruction and feature projection networks for label-efficient 3D-to-2D segmentation
paper_authors: José Morano, Guilherme Aresta, Dmitrii Lachinov, Julia Mai, Ursula Schmidt-Erfurth, Hrvoje Bogunović for:这篇论文主要是为了提出一种新的深度学习方法,以便在医疗图像分割任务中提高效率,减轻医生的工作负担。methods:该方法使用了一种新的卷积神经网络(CNN)和自助学习(SSL)方法,其中的3D encoder和2D decoder被连接了一些新的3D-to-2D块。而SSL方法则是通过不同维度的图像对准来实现对模式的重构。results:该方法在两个临床有 relevance 的任务中(抗纤维性atrophy和Reticular Pseudodrusen)中得到了显著的提高,比如在有限的标注数据情况下,该方法可以提高达到8%的Dice score。此外,SSL方法可以进一步提高这个性能,并且我们证明了SSL是无论网络架构是多少都有利。Abstract
Deep learning has become a valuable tool for the automation of certain medical image segmentation tasks, significantly relieving the workload of medical specialists. Some of these tasks require segmentation to be performed on a subset of the input dimensions, the most common case being 3D-to-2D. However, the performance of existing methods is strongly conditioned by the amount of labeled data available, as there is currently no data efficient method, e.g. transfer learning, that has been validated on these tasks. In this work, we propose a novel convolutional neural network (CNN) and self-supervised learning (SSL) method for label-efficient 3D-to-2D segmentation. The CNN is composed of a 3D encoder and a 2D decoder connected by novel 3D-to-2D blocks. The SSL method consists of reconstructing image pairs of modalities with different dimensionality. The approach has been validated in two tasks with clinical relevance: the en-face segmentation of geographic atrophy and reticular pseudodrusen in optical coherence tomography. Results on different datasets demonstrate that the proposed CNN significantly improves the state of the art in scenarios with limited labeled data by up to 8% in Dice score. Moreover, the proposed SSL method allows further improvement of this performance by up to 23%, and we show that the SSL is beneficial regardless of the network architecture.
摘要
深度学习已成为医疗影像分割任务自动化的有价值工具,减轻医生的工作负担。一些这些任务需要在输入维度中进行分割,最常见的情况是3D-to-2D。然而,现有方法的性能受到可用标注数据的限制,例如没有数据效果的转移学习方法,即使是在这些任务上。在这种情况下,我们提出了一种新的卷积神经网络(CNN)和自我超vised学习(SSL)方法,用于标签效率3D-to-2D分割。CNN由3D编码器和2D解码器连接的新3D-to-2D块组成。SSL方法是通过不同维度的图像对象来重建图像对。我们在两个临床有关的任务中进行验证:扫描型atrophy和reticular pseudodrusen的en-face分割。不同的数据集上的结果表明,我们提出的CNN可以在有限标注数据的情况下提高状态的艺术到8%的Dice分数。此外,我们还证明了SSL方法可以进一步改进这种性能,并且SSL是无论网络架构如何的有利。
Fourier-Net+: Leveraging Band-Limited Representation for Efficient 3D Medical Image Registration
results: 我们使用三个 dataset 进行评估,其中包括CT scan和MRI等类型的影像资料。在这些 dataset 上,我们的提案方法均能与现有的州��♀�方法相比,并且具有更快的推论速度、更低的内存占用量和更少的乘法加法操作。这些小型的computational cost使我们的Fourier-Net+在低VRAM GPU上能够高效地训练大规模3D注册。我们的代码公开在 \url{https://github.com/xi-jia/Fourier-Net}。Abstract
U-Net style networks are commonly utilized in unsupervised image registration to predict dense displacement fields, which for high-resolution volumetric image data is a resource-intensive and time-consuming task. To tackle this challenge, we first propose Fourier-Net, which replaces the costly U-Net style expansive path with a parameter-free model-driven decoder. Instead of directly predicting a full-resolution displacement field, our Fourier-Net learns a low-dimensional representation of the displacement field in the band-limited Fourier domain which our model-driven decoder converts to a full-resolution displacement field in the spatial domain. Expanding upon Fourier-Net, we then introduce Fourier-Net+, which additionally takes the band-limited spatial representation of the images as input and further reduces the number of convolutional layers in the U-Net style network's contracting path. Finally, to enhance the registration performance, we propose a cascaded version of Fourier-Net+. We evaluate our proposed methods on three datasets, on which our proposed Fourier-Net and its variants achieve comparable results with current state-of-the art methods, while exhibiting faster inference speeds, lower memory footprint, and fewer multiply-add operations. With such small computational cost, our Fourier-Net+ enables the efficient training of large-scale 3D registration on low-VRAM GPUs. Our code is publicly available at \url{https://github.com/xi-jia/Fourier-Net}.
摘要
通常,U-Net风格的网络在无监督图像注册中预测密集偏移场景,对高分辨率三维图像数据来说是一项资源挺大和时间消耗的任务。为解决这个挑战,我们首先提出了Fourier-Net,它将U-Net风格的昂贵的扩展路径换为无参数的模型驱动解码器。而不是直接预测全分辨率偏移场景,Fourier-Net将学习带有限 Fourier频域的低维度偏移场景表示。我们的模型驱动解码器将这个低维度表示转换为全分辨率偏移场景。在这个基础上,我们再次引入Fourier-Net+,它还将带有限空间表示的图像进行输入,并将U-Net风格网络的收缩路径中的几何层数降低。最后,我们提出了卷积率的缩放版本,以提高注册性能。我们在三个数据集上评估了我们的提议方法,其中Fourier-Net和其 variants与当前状态的方法相当,而且具有更快的推理速度、更低的内存占用量和更少的 multiply-add 操作。这样的小计算成本使得我们的Fourier-Net+可以高效地在低VRAM GPU上进行大规模3D注册的训练。我们的代码公开在 GitHub 上,可以在 \url{https://github.com/xi-jia/Fourier-Net} 上获取。
Multi-modal multi-class Parkinson disease classification using CNN and decision level fusion
paper_authors: Sushanta Kumar Sahu, Ananda S. Chowdhury
for: 本研究旨在提出一种直接三类PD分类方法,使用两种不同的感知模式,即MRI和DTI。
methods: 本研究使用白 matter和灰 matter从MRI和扩展纹理度和mean diffusivity从DTI来实现目标。四个独立的CNN模型在上述四种数据上进行训练。决策层使用优化的Weighted Average fusion技术进行数据融合。
results: 在公共可用的PPMI数据库上,本研究实现了95.53%的直接三类分类精度,用于PD、HC和SWEDD的分类。对于PD、HC和SWEDD的分类,进行了广泛的比较,包括一系列的剥夺研究,明显地验证了我们的提议的有效性。Abstract
Parkinson disease is the second most common neurodegenerative disorder, as reported by the World Health Organization. In this paper, we propose a direct three-Class PD classification using two different modalities, namely, MRI and DTI. The three classes used for classification are PD, Scans Without Evidence of Dopamine Deficit and Healthy Control. We use white matter and gray matter from the MRI and fractional anisotropy and mean diffusivity from the DTI to achieve our goal. We train four separate CNNs on the above four types of data. At the decision level, the outputs of the four CNN models are fused with an optimal weighted average fusion technique. We achieve an accuracy of 95.53 percentage for the direct three class classification of PD, HC and SWEDD on the publicly available PPMI database. Extensive comparisons including a series of ablation studies clearly demonstrate the effectiveness of our proposed solution.
摘要
parkinson病是第二常见的神经退化疾病,根据世界卫生组织的报告。在这篇论文中,我们提出了直接的三类PD分类方法,使用两种不同的感知模式,即MRI和DTI。我们使用MRI中的白 matter和灰 matter,以及DTI中的方向强度和mean扩散率来实现我们的目标。我们训练了四个独立的CNN模型。在决策层,四个CNN模型的输出使用最佳权重平均融合技术进行融合。我们在公共可用PPMI数据库上实现了95.53%的直接三类分类精度。经过了详细的对比,包括一系列剥离研究,我们的提议的解决方案得到了证明。
Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network for Remote Sensing Image Super-Resolution
results: 在多个标准 dataset上进行了广泛的实验,并证明了我们提posed的SPIFFNet在量化指标和视觉质量上比现有方法更高的性能Abstract
Remote sensing image super-resolution (RSISR) plays a vital role in enhancing spatial detials and improving the quality of satellite imagery. Recently, Transformer-based models have shown competitive performance in RSISR. To mitigate the quadratic computational complexity resulting from global self-attention, various methods constrain attention to a local window, enhancing its efficiency. Consequently, the receptive fields in a single attention layer are inadequate, leading to insufficient context modeling. Furthermore, while most transform-based approaches reuse shallow features through skip connections, relying solely on these connections treats shallow and deep features equally, impeding the model's ability to characterize them. To address these issues, we propose a novel transformer architecture called Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network (SPIFFNet) for RSISR. Our proposed model effectively enhances global cognition and understanding of the entire image, facilitating efficient integration of features cross-stages. The model incorporates cross-spatial pixel integration attention (CSPIA) to introduce contextual information into a local window, while cross-stage feature fusion attention (CSFFA) adaptively fuses features from the previous stage to improve feature expression in line with the requirements of the current stage. We conducted comprehensive experiments on multiple benchmark datasets, demonstrating the superior performance of our proposed SPIFFNet in terms of both quantitative metrics and visual quality when compared to state-of-the-art methods.
摘要
卫星图像超分辨率(RSISR)在提高空间细节和卫星图像质量方面扮演着重要的角色。最近,基于Transformer模型的方法在RSISR中表现竞争力强。但由于全球自注意的 quadratic computational complexity, Various methods constrain attention to a local window, enhancing its efficiency。然而,单个注意层的接受场所不够,导致Context模型缺乏。另外,大多数transform基本方法通过 skip connections 重复 shallow features,但是这些连接只是对 shallow 和 deep features进行重复,导致模型无法区分它们。为了解决这些问题,我们提出了一种新的Transformer网络模型,即 Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network(SPIFFNet)。我们的提议模型可以有效地提高全图像的全局认知和理解,并且可以有效地进行 across-stages 的特征集成。模型包括 cross-spatial pixel integration attention(CSPIA),可以将contextual information引入到当前窗口中,以及 across-stage feature fusion attention(CSFFA),可以适应当前阶段的特征需求,进行特征表达。我们在多个benchmark数据集上进行了广泛的实验,并证明了我们提出的SPIFFNet在量值指标和视觉质量方面与现有方法相比,表现出了更高的性能。
SegNetr: Rethinking the local-global interactions and skip connections in U-shaped networks
results: 作者将SegNetr应用到四个主流的医疗影像分类 datasets上,与常用的U-Net模型相比,SegNetr的参数数量和GFLOPs都比相对少,但是其分类性能与现有的方法相对,并且可以适用于其他U-shaped网络中优化分类性能。Abstract
Recently, U-shaped networks have dominated the field of medical image segmentation due to their simple and easily tuned structure. However, existing U-shaped segmentation networks: 1) mostly focus on designing complex self-attention modules to compensate for the lack of long-term dependence based on convolution operation, which increases the overall number of parameters and computational complexity of the network; 2) simply fuse the features of encoder and decoder, ignoring the connection between their spatial locations. In this paper, we rethink the above problem and build a lightweight medical image segmentation network, called SegNetr. Specifically, we introduce a novel SegNetr block that can perform local-global interactions dynamically at any stage and with only linear complexity. At the same time, we design a general information retention skip connection (IRSC) to preserve the spatial location information of encoder features and achieve accurate fusion with the decoder features. We validate the effectiveness of SegNetr on four mainstream medical image segmentation datasets, with 59\% and 76\% fewer parameters and GFLOPs than vanilla U-Net, while achieving segmentation performance comparable to state-of-the-art methods. Notably, the components proposed in this paper can be applied to other U-shaped networks to improve their segmentation performance.
摘要
They often rely on complex self-attention modules to compensate for the lack of long-term dependence in convolutional operations, which increases the number of parameters and computational complexity of the network.2. They simply fuse the features of the encoder and decoder, ignoring the spatial relationships between them.In this paper, we aim to address these issues and propose a lightweight medical image segmentation network called SegNetr. Our approach includes:1. A novel SegNetr block that can perform local-global interactions dynamically at any stage with linear complexity.2. A general information retention skip connection (IRSC) to preserve the spatial location information of encoder features and achieve accurate fusion with the decoder features.We validate the effectiveness of SegNetr on four mainstream medical image segmentation datasets, achieving segmentation performance comparable to state-of-the-art methods with 59% and 76% fewer parameters and GFLOPs than vanilla U-Net. The proposed components can be applied to other U-shaped networks to improve their segmentation performance.
DisAsymNet: Disentanglement of Asymmetrical Abnormality on Bilateral Mammograms using Self-adversarial Learning
results: 我们在三个公开数据集和一个内部数据集上进行实验,结果显示我们的方法在异常性分类、分 segmentation 和本地化任务中均能超越现有的方法。此外,重建的正常成像可以提供更好的诊断视觉参考。Abstract
Asymmetry is a crucial characteristic of bilateral mammograms (Bi-MG) when abnormalities are developing. It is widely utilized by radiologists for diagnosis. The question of 'what the symmetrical Bi-MG would look like when the asymmetrical abnormalities have been removed ?' has not yet received strong attention in the development of algorithms on mammograms. Addressing this question could provide valuable insights into mammographic anatomy and aid in diagnostic interpretation. Hence, we propose a novel framework, DisAsymNet, which utilizes asymmetrical abnormality transformer guided self-adversarial learning for disentangling abnormalities and symmetric Bi-MG. At the same time, our proposed method is partially guided by randomly synthesized abnormalities. We conduct experiments on three public and one in-house dataset, and demonstrate that our method outperforms existing methods in abnormality classification, segmentation, and localization tasks. Additionally, reconstructed normal mammograms can provide insights toward better interpretable visual cues for clinical diagnosis. The code will be accessible to the public.
摘要
bilateral mammograms (Bi-MG) 的非对称性是一个重要的特征,用于诊断。然而,关于“正常的对称 Bi-MG 看起来如何?”这个问题在算法开发中还没有得到强调。我们提出了一种新的框架,DisAsymNet,它使用对称异常变换导向自 adversarial learning 来分离异常和对称 Bi-MG。同时,我们的提议方法受到随机生成的异常影响。我们在三个公共数据集和一个内部数据集上进行了实验,并证明了我们的方法在异常分类、 segmentation 和本地化任务中表现出色。此外,重建的正常床静图可以提供更好的诊断视觉cue。代码将公开。
A Real-time Human Pose Estimation Approach for Optimal Sensor Placement in Sensor-based Human Activity Recognition
results: 我们通过一个可行性研究,使用各种感知器来监测13种不同的活动,并证明了视处理器方法的效果相当于深度学习方法。这项研究对人体活动识别领域带来了重要的进步,提供了一种轻量级、在设备上进行的传感器位置确定方法,以实现数据隐私和多模态识别。Abstract
Sensor-based Human Activity Recognition facilitates unobtrusive monitoring of human movements. However, determining the most effective sensor placement for optimal classification performance remains challenging. This paper introduces a novel methodology to resolve this issue, using real-time 2D pose estimations derived from video recordings of target activities. The derived skeleton data provides a unique strategy for identifying the optimal sensor location. We validate our approach through a feasibility study, applying inertial sensors to monitor 13 different activities across ten subjects. Our findings indicate that the vision-based method for sensor placement offers comparable results to the conventional deep learning approach, demonstrating its efficacy. This research significantly advances the field of Human Activity Recognition by providing a lightweight, on-device solution for determining the optimal sensor placement, thereby enhancing data anonymization and supporting a multimodal classification approach.
摘要
<>传感器基于人体活动识别可以实现不侵入式监测人体运动。然而,确定最佳传感器位置以获得最佳分类性能仍然是一个挑战。这篇论文介绍了一种新的方法,使用实时2D姿态估计来确定最佳传感器位置。从视频记录中获取的skeleton数据提供了一种独特的策略来识别最佳传感器位置。我们通过实验验证了我们的方法,使用抗 гравитацион acceleration sensor进行13种活动的监测,并对10名参与者进行了评估。我们的结果表明,视觉基于的传感器位置选择方法与传统的深度学习方法相当,这表明了其效果。这项研究对人体活动识别领域发展了重要的进步,提供了轻量级、在设备上进行的传感器位置选择方法,从而提高数据匿名化和支持多模态分类approach。
RefVSR++: Exploiting Reference Inputs for Reference-based Video Super-resolution
paper_authors: Han Zou, Masanori Suganuma, Takayuki Okatani for: 这paper是为了提高视频uperResolution的图像质量而写的。methods: 这paper使用了Reference-based SuperResolution和video SuperResolution两种方法,并将它们结合在一起以提高图像质量。results: 这paper实验表明,使用RefVSR++方法可以提高图像质量,比RefVSR方法高出1dB的PSNR。Abstract
Smartphones equipped with a multi-camera system comprising multiple cameras with different field-of-view (FoVs) are becoming more prevalent. These camera configurations are compatible with reference-based SR and video SR, which can be executed simultaneously while recording video on the device. Thus, combining these two SR methods can improve image quality. Recently, Lee et al. have presented such a method, RefVSR. In this paper, we consider how to optimally utilize the observations obtained, including input low-resolution (LR) video and reference (Ref) video. RefVSR extends conventional video SR quite simply, aggregating the LR and Ref inputs over time in a single bidirectional stream. However, considering the content difference between LR and Ref images due to their FoVs, we can derive the maximum information from the two image sequences by aggregating them independently in the temporal direction. Then, we propose an improved method, RefVSR++, which can aggregate two features in parallel in the temporal direction, one for aggregating the fused LR and Ref inputs and the other for Ref inputs over time. Furthermore, we equip RefVSR++ with enhanced mechanisms to align image features over time, which is the key to the success of video SR. We experimentally show that RefVSR++ outperforms RefVSR by over 1dB in PSNR, achieving the new state-of-the-art.
摘要
智能手机配备多镜头系统,包括多个镜头不同视场(FoV),变得更加普遍。这些镜头配置与参考基 SR 和视频 SR 兼容,可以同时进行视频记录设备上的实时执行。因此,将这两种 SR 方法结合可以提高图像质量。李等人已经提出了这种方法,即 RefVSR。在这篇论文中,我们考虑了如何优化获得的观察结果,包括输入低分辨率(LR)视频和参考(Ref)视频。RefVSR 将LR 和 Ref 输入简单地聚合在一起,并在时间方向上进行单向bidirectional流。但是,由于LR 和 Ref 图像的内容差异,我们可以在两个图像序列之间独立地聚合它们,从而Derive最大的信息。然后,我们提出了改进的方法,即 RefVSR++,它可以在时间方向上并行聚合两个特征,一个用于聚合混合LR 和 Ref 输入,另一个用于Ref 输入的时间方向上的聚合。此外,我们为 RefVSR++ 提供了进一步的时间对齐机制,这是图像SR 的关键。我们实验表明,RefVSR++ 可以超过 RefVSR 的1dB PSNR,达到新的状态态-艺。
Probabilistic and Semantic Descriptions of Image Manifolds and Their Applications
paper_authors: Peter Tu, Zhaoyuan Yang, Richard Hartley, Zhiwei Xu, Jing Zhang, Dylan Campbell, Jaskirat Singh, Tianyu Wang
for: 本研究旨在透过估计图像诸元函数,实现图像资料的统计分布模型。
methods: 本研究使用了常见的生成模型,如泛型流和扩散模型,以模型图像资料的分布。
results: 本研究发现,使用这些生成模型可以建立对抗攻击性质的防御措施,并且可以使用 semantic interpretations 来描述图像资料的分布。Abstract
This paper begins with a description of methods for estimating probability density functions for images that reflects the observation that such data is usually constrained to lie in restricted regions of the high-dimensional image space - not every pattern of pixels is an image. It is common to say that images lie on a lower-dimensional manifold in the high-dimensional space. However, although images may lie on such lower-dimensional manifolds, it is not the case that all points on the manifold have an equal probability of being images. Images are unevenly distributed on the manifold, and our task is to devise ways to model this distribution as a probability distribution. In pursuing this goal, we consider generative models that are popular in AI and computer vision community. For our purposes, generative/probabilistic models should have the properties of 1) sample generation: it should be possible to sample from this distribution according to the modelled density function, and 2) probability computation: given a previously unseen sample from the dataset of interest, one should be able to compute the probability of the sample, at least up to a normalising constant. To this end, we investigate the use of methods such as normalising flow and diffusion models. We then show that such probabilistic descriptions can be used to construct defences against adversarial attacks. In addition to describing the manifold in terms of density, we also consider how semantic interpretations can be used to describe points on the manifold. To this end, we consider an emergent language framework which makes use of variational encoders to produce a disentangled representation of points that reside on a given manifold. Trajectories between points on a manifold can then be described in terms of evolving semantic descriptions.
摘要
Translated into Simplified Chinese:这篇论文开始介绍了图像的概率密度函数估计方法,因为数据通常会受到高维图像空间中的限制,不是所有像素点都是图像。人们常说图像处在一个Lower-dimensional manifold上,但不是所有 manifold 上的点都有相同的概率成为图像。我们的目标是使用生成/概率模型来模型这种分布,这些模型应具有以下两个特性:1)样本生成和2)概率计算。我们研究使用normalizing flow和diffusion模型来实现这一目标。这些概率描述可以用来构建防御性 adversarial attack。此外,我们还考虑使用semantic interpretations来描述 manifold 上的点,使用一种 emergent language 框架,该框架使用variational encoders来生成一个分离的表示图像点。在 manifold 上的点之间的路径可以用 evolving semantic descriptions 来描述。
Towards accurate instance segmentation in large-scale LiDAR point clouds
results: 提高了实例分割的精度,可以更好地应用于 inventory 和管理 类应用程序。Abstract
Panoptic segmentation is the combination of semantic and instance segmentation: assign the points in a 3D point cloud to semantic categories and partition them into distinct object instances. It has many obvious applications for outdoor scene understanding, from city mapping to forest management. Existing methods struggle to segment nearby instances of the same semantic category, like adjacent pieces of street furniture or neighbouring trees, which limits their usability for inventory- or management-type applications that rely on object instances. This study explores the steps of the panoptic segmentation pipeline concerned with clustering points into object instances, with the goal to alleviate that bottleneck. We find that a carefully designed clustering strategy, which leverages multiple types of learned point embeddings, significantly improves instance segmentation. Experiments on the NPM3D urban mobile mapping dataset and the FOR-instance forest dataset demonstrate the effectiveness and versatility of the proposed strategy.
摘要
pan-opeptic segmentation是semantic和instance segmentation的组合:将3D点云中的点分配到semantic类别并将其 partition into distinct object instances。它在户外场景理解方面具有广泛的应用,从城市地图到森林管理。现有方法在邻近同semantic category的实例上具有 segmentation 瓶颈,例如邻近的街道家具或邻近的树木,这限制了它们在 inventory-或 management-type 应用中的可用性。本研究探讨了panoptic segmentation管道中关于点集 clustering into object instances的步骤,以解决这个瓶颈。我们发现了一种精心设计的 clustering 策略,该策略利用多种学习的点嵌入,可以有效地提高实例 segmentation。在NPM3D urban mobile mapping dataset和FOR-instance forest dataset上进行了实验,并证明了我们的策略的有效性和多样性。
Reference-based Motion Blur Removal: Learning to Utilize Sharpness in the Reference Image
paper_authors: Han Zou, Masanori Suganuma, Takayuki Okatani
for: 提高弱暗图像的清晰度
methods: 使用多个图像,包括一个参考图像,以提高图像清晰度
results: 实验结果显示提出的方法有效地提高图像的清晰度Abstract
Despite the recent advancement in the study of removing motion blur in an image, it is still hard to deal with strong blurs. While there are limits in removing blurs from a single image, it has more potential to use multiple images, e.g., using an additional image as a reference to deblur a blurry image. A typical setting is deburring an image using a nearby sharp image(s) in a video sequence, as in the studies of video deblurring. This paper proposes a better method to use the information present in a reference image. The method does not need a strong assumption on the reference image. We can utilize an alternative shot of the identical scene, just like in video deblurring, or we can even employ a distinct image from another scene. Our method first matches local patches of the target and reference images and then fuses their features to estimate a sharp image. We employ a patch-based feature matching strategy to solve the difficult problem of matching the blurry image with the sharp reference. Our method can be integrated into pre-existing networks designed for single image deblurring. The experimental results show the effectiveness of the proposed method.
摘要
尽管最近在图像去挠模块中得到了进步,但强挠仍然困难去处理。尽管单个图像的挠模块有限,但使用多个图像可以更有可能性地去挠模块。例如,在视频序列中使用邻近的锐化图像来去挠模块。这篇论文提出了更好地使用参考图像中的信息。我们不需要强制对参考图像进行假设。我们可以使用视频序列中的另一幅锐化图像作为参考图像,或者我们可以使用另一幅来自另一个场景的图像。我们的方法首先匹配本地补丁图像的目标和参考图像的地方,然后将其特征进行融合,以便估算一个锐化图像。我们使用补丁图像基于特征匹配策略来解决图像挠模块与锐化图像的困难问题。我们的方法可以与现有的单个图像去挠模块网络集成。实验结果表明我们的方法的效果。
MomentDiff: Generative Video Moment Retrieval from Random to Real
results: 我们的效果验证结果表明,MomentDiff 在三个公共 benchmark 上都能够准确地检索视频瞬间,并且在我们提出的两个anti-bias dataset上表现出更好的一致性和稳定性。code、模型和anti-bias评价 dataset 都可以在https://github.com/IMCCretrieval/MomentDiff 上下载。Abstract
Video moment retrieval pursues an efficient and generalized solution to identify the specific temporal segments within an untrimmed video that correspond to a given language description. To achieve this goal, we provide a generative diffusion-based framework called MomentDiff, which simulates a typical human retrieval process from random browsing to gradual localization. Specifically, we first diffuse the real span to random noise, and learn to denoise the random noise to the original span with the guidance of similarity between text and video. This allows the model to learn a mapping from arbitrary random locations to real moments, enabling the ability to locate segments from random initialization. Once trained, MomentDiff could sample random temporal segments as initial guesses and iteratively refine them to generate an accurate temporal boundary. Different from discriminative works (e.g., based on learnable proposals or queries), MomentDiff with random initialized spans could resist the temporal location biases from datasets. To evaluate the influence of the temporal location biases, we propose two anti-bias datasets with location distribution shifts, named Charades-STA-Len and Charades-STA-Mom. The experimental results demonstrate that our efficient framework consistently outperforms state-of-the-art methods on three public benchmarks, and exhibits better generalization and robustness on the proposed anti-bias datasets. The code, model, and anti-bias evaluation datasets are available at https://github.com/IMCCretrieval/MomentDiff.
摘要
视频时刻段检索寻求一种有效和通用的解决方案,以确定给定语言描述中的具体时刻段 within an untrimmed video。为此,我们提供了一个生成扩散基于的框架 called MomentDiff,它模拟了人类检索过程的一般流程,从随机浏览到渐进的地点化。 Specifically, we first diffuse the real span to random noise, and learn to denoise the random noise to the original span with the guidance of similarity between text and video. This allows the model to learn a mapping from arbitrary random locations to real moments, enabling the ability to locate segments from random initialization. Once trained, MomentDiff could sample random temporal segments as initial guesses and iteratively refine them to generate an accurate temporal boundary. 不同于掌学工作(例如基于学习的提案或查询),MomentDiff with random initialized spans could resist the temporal location biases from datasets. To evaluate the influence of the temporal location biases, we propose two anti-bias datasets with location distribution shifts, named Charades-STA-Len and Charades-STA-Mom. The experimental results demonstrate that our efficient framework consistently outperforms state-of-the-art methods on three public benchmarks, and exhibits better generalization and robustness on the proposed anti-bias datasets. The code, model, and anti-bias evaluation datasets are available at https://github.com/IMCCretrieval/MomentDiff.
A Critical Look at the Current Usage of Foundation Model for Dense Recognition Task
results: 研究发现,现有的基础模型可以在不同的下游任务中提供优秀的性能,但是现有的方法并不是最佳的。这些结论可以提供未来基础模型应用下游任务的指导。Abstract
In recent years large model trained on huge amount of cross-modality data, which is usually be termed as foundation model, achieves conspicuous accomplishment in many fields, such as image recognition and generation. Though achieving great success in their original application case, it is still unclear whether those foundation models can be applied to other different downstream tasks. In this paper, we conduct a short survey on the current methods for discriminative dense recognition tasks, which are built on the pretrained foundation model. And we also provide some preliminary experimental analysis of an existing open-vocabulary segmentation method based on Stable Diffusion, which indicates the current way of deploying diffusion model for segmentation is not optimal. This aims to provide insights for future research on adopting foundation model for downstream task.
摘要
Recently, large models trained on vast amounts of cross-modal data have achieved remarkable success in various fields, such as image recognition and generation. While these foundation models have excelled in their original applications, it remains unclear whether they can be applied to other downstream tasks. In this paper, we provide a brief overview of current methods for discriminative dense recognition tasks built on pre-trained foundation models. Additionally, we present some preliminary experimental analysis of an existing open-vocabulary segmentation method based on Stable Diffusion, which suggests that the current approach to deploying diffusion models for segmentation is not optimal. This aims to provide insights for future research on adopting foundation models for downstream tasks.Note: Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and other countries. The translation is written in Simplified Chinese.
Deep Ensemble Learning with Frame Skipping for Face Anti-Spoofing
results: 在四个数据集上进行了广泛的实验,并在最复杂的跨数据集测试场景下达到了状态之最高的检测性能(HTER)。Abstract
Face presentation attacks (PA), also known as spoofing attacks, pose a substantial threat to biometric systems that rely on facial recognition systems, such as access control systems, mobile payments, and identity verification systems. To mitigate the spoofing risk, several video-based methods have been presented in the literature that analyze facial motion in successive video frames. However, estimating the motion between adjacent frames is a challenging task and requires high computational cost. In this paper, we rephrase the face anti-spoofing task as a motion prediction problem and introduce a deep ensemble learning model with a frame skipping mechanism. In particular, the proposed frame skipping adopts a uniform sampling approach by dividing the original video into video clips of fixed size. By doing so, every nth frame of the clip is selected to ensure that the temporal patterns can easily be perceived during the training of three different recurrent neural networks (RNNs). Motivated by the performance of individual RNNs, a meta-model is developed to improve the overall detection performance by combining the prediction of individual RNNs. Extensive experiments were performed on four datasets, and state-of-the-art performance is reported on MSU-MFSD (3.12%), Replay-Attack (11.19%), and OULU-NPU (12.23%) databases by using half total error rates (HTERs) in the most challenging cross-dataset testing scenario.
摘要
面部攻击(PA),也称为伪装攻击,对具有面部识别系统的生物metric系统 pose 了很大的威胁,如访问控制系统、移动支付和身份验证系统。为了减少伪装风险,文献中有许多基于视频的方法进行分析。然而,在Successive video frames中 estimating 面部动作的问题是一项具有高计算成本的任务。在这篇论文中,我们将面部防伪任务重新定义为一个动作预测问题,并引入了一种深度ensemble学习模型和帧跳动机制。具体来说,我们的帧跳动采用了一种均匀采样方法,将原始视频分成固定大小的视频clip。这样,每个n帧的clip中的每一帧都会被选择,以便在训练三个不同的循环神经网络(RNNs)时,能够轻松地感受到时间模式。受到个体RNNs的表现的激励,我们开发了一个meta-模型,以提高总体检测性能。我们对四个数据集进行了广泛的实验,并在MSU-MFSD(3.12%)、Replay-Attack(11.19%)和OULU-NPU(12.23%)数据库上report了最新的状态。
paper_authors: Yun Liu, Yu-Huan Wu, Shi-Chen Zhang, Li Liu, Min Wu, Ming-Ming Cheng
For: 这个研究旨在提高计算机支持的肺炎诊断(CTD),使用深度学习。* Methods: 该研究使用了大规模的TB肺炎X射图数据集(TBX11K),并提出了SymFormer模型,该模型通过对称搜索注意力(SymAttention)和对称位置编码(SPE)来学习特征。* Results: 实验显示SymFormer在TBX11K数据集上达到了状态略的性能。Abstract
Tuberculosis (TB) is a major global health threat, causing millions of deaths annually. Although early diagnosis and treatment can greatly improve the chances of survival, it remains a major challenge, especially in developing countries. Recently, computer-aided tuberculosis diagnosis (CTD) using deep learning has shown promise, but progress is hindered by limited training data. To address this, we establish a large-scale dataset, namely the Tuberculosis X-ray (TBX11K) dataset, which contains 11,200 chest X-ray (CXR) images with corresponding bounding box annotations for TB areas. This dataset enables the training of sophisticated detectors for high-quality CTD. Furthermore, we propose a strong baseline, SymFormer, for simultaneous CXR image classification and TB infection area detection. SymFormer incorporates Symmetric Search Attention (SymAttention) to tackle the bilateral symmetry property of CXR images for learning discriminative features. Since CXR images may not strictly adhere to the bilateral symmetry property, we also propose Symmetric Positional Encoding (SPE) to facilitate SymAttention through feature recalibration. To promote future research on CTD, we build a benchmark by introducing evaluation metrics, evaluating baseline models reformed from existing detectors, and running an online challenge. Experiments show that SymFormer achieves state-of-the-art performance on the TBX11K dataset. The data, code, and models will be released.
摘要
肺炎(TB)是全球主要的健康威胁,每年引起数百万人的死亡。尽管早期诊断和治疗可以大大提高存活的机会,但是在发展中国家,特别是在贫困地区,这是一项重要的挑战。最近,使用深度学习的计算支持肺炎诊断(CTD)已经显示出了承诺,但是进步受到有限的训练数据的限制。为了解决这一问题,我们建立了一个大规模的数据集,即肺炎X射像(TBX11K)数据集,该数据集包含11,200个胸部X射像(CXR)图像,以及对肺炎区域的矩形框标注。这个数据集允许建立高质量的CTD检测器。此外,我们提出了一个强大的基线,即SymFormer,用于同时对CXR图像进行分类和肺炎感染区域检测。SymFormer integrate Symmetric Search Attention(SymAttention)来处理胸部X射像图像的双面对称性,以学习特征。由于CXR图像可能不会严格遵循双面对称性,我们还提出了Symmetric Positional Encoding(SPE)来促进SymAttention通过特征重新定义。为促进未来的CTD研究,我们建立了一个标准套件,包括评价指标、评估基eline模型,以及在线挑战。实验显示,SymFormer在TBX11K数据集上达到了状态畅的性能。数据、代码和模型将被发布。
Noise-to-Norm Reconstruction for Industrial Anomaly Detection and Localization
paper_authors: Shiqi Deng, Zhiyu Sun, Ruiyan Zhuang, Jun Gong
for: 这篇研究旨在提出一种基于噪音至常平衡的重建方法,用于精确地检测服务器上的异常。
methods: 本研究使用的方法包括多个级别的融合和差分注意模组,实现端对端的检测和定位。
results: 实验结果显示,提案的方法能够很好地重建异常区域为正常模式,并且在MPDD和VisA dataset上取得更竞争性的结果,设置了新的州OF-THE-ART标准。Abstract
Anomaly detection has a wide range of applications and is especially important in industrial quality inspection. Currently, many top-performing anomaly-detection models rely on feature-embedding methods. However, these methods do not perform well on datasets with large variations in object locations. Reconstruction-based methods use reconstruction errors to detect anomalies without considering positional differences between samples. In this study, a reconstruction-based method using the noise-to-norm paradigm is proposed, which avoids the invariant reconstruction of anomalous regions. Our reconstruction network is based on M-net and incorporates multiscale fusion and residual attention modules to enable end-to-end anomaly detection and localization. Experiments demonstrate that the method is effective in reconstructing anomalous regions into normal patterns and achieving accurate anomaly detection and localization. On the MPDD and VisA datasets, our proposed method achieved more competitive results than the latest methods, and it set a new state-of-the-art standard on the MPDD dataset.
摘要
异常检测有广泛的应用,特别是在工业质量检测中非常重要。目前,许多高性能异常检测模型都采用特征嵌入方法。然而,这些方法在样本位置差异较大的数据集上不太好表现。基于重建Error的方法可以快速检测异常点而无需考虑样本位置的差异。在本研究中,我们提出了基于噪声至 норahlattice的杂合策略,该策略可以避免异常区域的恒定重建。我们的重建网络基于M-net,并包括多scale融合和异常注意模块,以实现端到端异常检测和地图化。实验表明,我们的提议方法可以快速和准确地检测和地图化异常点。在MPDD和VisA数据集上,我们的提议方法比最新的方法更高效,并在MPDD数据集上设置了新的状态标准。
Sampling-based Fast Gradient Rescaling Method for Highly Transferable Adversarial Attacks
paper_authors: Xu Han, Anmin Liu, Chenxuan Yao, Yanbo Fan, Kun He
for: 这个研究目的是提高黑盒攻击的攻击效率和稳定性。
methods: 这个方法使用数据缩寸来取代标志函数,并且提出深度首先抽样法来稳定梯度更新。
results: 实验结果显示,这个方法可以对于黑盒攻击提高攻击效率和稳定性,并且比前state-of-the-art基eline更高。Abstract
Deep neural networks are known to be vulnerable to adversarial examples crafted by adding human-imperceptible perturbations to the benign input. After achieving nearly 100% attack success rates in white-box setting, more focus is shifted to black-box attacks, of which the transferability of adversarial examples has gained significant attention. In either case, the common gradient-based methods generally use the sign function to generate perturbations on the gradient update, that offers a roughly correct direction and has gained great success. But little work pays attention to its possible limitation. In this work, we observe that the deviation between the original gradient and the generated noise may lead to inaccurate gradient update estimation and suboptimal solutions for adversarial transferability. To this end, we propose a Sampling-based Fast Gradient Rescaling Method (S-FGRM). Specifically, we use data rescaling to substitute the sign function without extra computational cost. We further propose a Depth First Sampling method to eliminate the fluctuation of rescaling and stabilize the gradient update. Our method could be used in any gradient-based attacks and is extensible to be integrated with various input transformation or ensemble methods to further improve the adversarial transferability. Extensive experiments on the standard ImageNet dataset show that our method could significantly boost the transferability of gradient-based attacks and outperform the state-of-the-art baselines.
摘要
在这种情况下,我们提出了一种 Sampling-based Fast Gradient Rescaling Method (S-FGRM)。具体来说,我们使用数据缩放来取代 sign 函数,不需要额外的计算成本。我们还提出了 Depth First Sampling 方法,以消除缩放的波动并稳定梯度更新。我们的方法可以在任何 gradient-based 攻击中使用,并且可以与不同的输入变换或ensemble方法结合使用,以进一步提高攻击性能。我们在标准的 ImageNet 数据集上进行了广泛的实验,结果表明,我们的方法可以很大程度地提高攻击性能,并超越了当前的基eline。
Bundle-specific Tractogram Distribution Estimation Using Higher-order Streamline Differential Equation
results: 实验表明,该方法可以直接重建复杂的全球肌肉纤维,并且可以减少地方错误偏差和积累。它还能更好地重建长距离、扭转和大弯肌肉纤维。Abstract
Tractography traces the peak directions extracted from fiber orientation distribution (FOD) suffering from ambiguous spatial correspondences between diffusion directions and fiber geometry, which is prone to producing erroneous tracks while missing true positive connections. The peaks-based tractography methods 'locally' reconstructed streamlines in 'single to single' manner, thus lacking of global information about the trend of the whole fiber bundle. In this work, we propose a novel tractography method based on a bundle-specific tractogram distribution function by using a higher-order streamline differential equation, which reconstructs the streamline bundles in 'cluster to cluster' manner. A unified framework for any higher-order streamline differential equation is presented to describe the fiber bundles with disjoint streamlines defined based on the diffusion tensor vector field. At the global level, the tractography process is simplified as the estimation of bundle-specific tractogram distribution (BTD) coefficients by minimizing the energy optimization model, and is used to characterize the relations between BTD and diffusion tensor vector under the prior guidance by introducing the tractogram bundle information to provide anatomic priors. Experiments are performed on simulated Hough, Sine, Circle data, ISMRM 2015 Tractography Challenge data, FiberCup data, and in vivo data from the Human Connectome Project (HCP) data for qualitative and quantitative evaluation. The results demonstrate that our approach can reconstruct the complex global fiber bundles directly. BTD reduces the error deviation and accumulation at the local level and shows better results in reconstructing long-range, twisting, and large fanning tracts.
摘要
tractography 跟踪 peak 方向,从 fibers orientation distribution (FOD) 提取的方向,受到 ambiguous 的空间匹配问题,容易产生错误的轨迹,而且错过真实正确的连接。 peaks-based tractography 方法在 'single to single' 方式重建流线,缺乏全局信息,不能捕捉整个纤维Bundle的趋势。在这项工作中,我们提出了一种基于纤维特有的 tractogram 分布函数的新 tractography 方法,使用高阶流线差分方程,重建流线集在 'cluster to cluster' 方式。我们提出了一个普适的高阶流线差分方程来描述纤维Bundle中的分支流线。在全局水平上, tractography 过程简化为估计纤维特有 tractogram 分布(BTD) 系数,通过能量优化模型来Characterize BTD 和 diffusion tensor vector field 之间的关系。我们在 tractogram 集信息的引导下,引入了 tractogram bundle 信息,以提供 anatomic priors。我们在 simulated Hough, Sine, Circle 数据、ISMRM 2015 Tractography Challenge 数据、FiberCup 数据和人类连接计划 (HCP) 数据上进行了质量和量化评估。结果显示,我们的方法可以直接重建复杂的全局纤维Bundle。BTD 降低了本地错误偏差和积累,并且更好地重建长距离、扭转和大扇辐纤维。
Single Image LDR to HDR Conversion using Conditional Diffusion
results: 经过进行了全面的量化和质量测试,该方法得到了有效的结果,表明一种简单的增强 diffusion-based 方法可以替代复杂的摄像头管线架构。Abstract
Digital imaging aims to replicate realistic scenes, but Low Dynamic Range (LDR) cameras cannot represent the wide dynamic range of real scenes, resulting in under-/overexposed images. This paper presents a deep learning-based approach for recovering intricate details from shadows and highlights while reconstructing High Dynamic Range (HDR) images. We formulate the problem as an image-to-image (I2I) translation task and propose a conditional Denoising Diffusion Probabilistic Model (DDPM) based framework using classifier-free guidance. We incorporate a deep CNN-based autoencoder in our proposed framework to enhance the quality of the latent representation of the input LDR image used for conditioning. Moreover, we introduce a new loss function for LDR-HDR translation tasks, termed Exposure Loss. This loss helps direct gradients in the opposite direction of the saturation, further improving the results' quality. By conducting comprehensive quantitative and qualitative experiments, we have effectively demonstrated the proficiency of our proposed method. The results indicate that a simple conditional diffusion-based method can replace the complex camera pipeline-based architectures.
摘要
digitization 目标是复制真实场景,但低动态范围(LDR)摄像机无法表示实际场景的广泛动态范围,导致阴影和高光部分损失。本文提出了一种基于深度学习的方法,用于从阴影和高光部分中恢复细节,并重建高动态范围(HDR)图像。我们将这个问题定义为一个图像到图像(I2I)翻译任务,并提出一种基于conditioned Denoising Diffusion Probabilistic Model(DDPM)的框架,不需要核心网络指导。我们在我们的提议框架中包含了一个深度透彻网络基于自动编码器,以提高输入LDR图像的latent表示质量。此外,我们还引入了一种新的损失函数,称为曝光损失(Exposure Loss),该损失函数可以帮助导向损失向量在强度方向上反方向,进一步提高结果质量。通过对比质量和质量的实验,我们有效地表明了我们的提议方法的效果。结果表明,一种简单的增强基于diffusion的方法可以取代复杂的摄像机管线结构。
Advancing Zero-Shot Digital Human Quality Assessment through Text-Prompted Evaluation
for: This paper aims to address the lack of comprehensive digital human quality assessment (DHQA) databases by proposing a subjective quality assessment database called SJTU-H3D, which can serve as a benchmark for DHQA research.
methods: The proposed method leverages semantic and distortion features extracted from projections, as well as geometry features derived from the mesh structure of digital humans. The method employs the Contrastive Language-Image Pre-training (CLIP) model to measure semantic affinity and incorporates the Naturalness Image Quality Evaluator (NIQE) model to capture low-level distortion information.
results: The proposed Digital Human Quality Index (DHQI) demonstrates significant improvements in zero-shot performance and can serve as a robust baseline for DHQA tasks, facilitating advancements in the field.Here is the summary in the format you requested:
results: 提出的数字人质量指数(DHQI)在零shot 场景下表现出了显著的改善,可以作为数字人质量评估任务的robust 基准,促进领域的进步。Abstract
Digital humans have witnessed extensive applications in various domains, necessitating related quality assessment studies. However, there is a lack of comprehensive digital human quality assessment (DHQA) databases. To address this gap, we propose SJTU-H3D, a subjective quality assessment database specifically designed for full-body digital humans. It comprises 40 high-quality reference digital humans and 1,120 labeled distorted counterparts generated with seven types of distortions. The SJTU-H3D database can serve as a benchmark for DHQA research, allowing evaluation and refinement of processing algorithms. Further, we propose a zero-shot DHQA approach that focuses on no-reference (NR) scenarios to ensure generalization capabilities while mitigating database bias. Our method leverages semantic and distortion features extracted from projections, as well as geometry features derived from the mesh structure of digital humans. Specifically, we employ the Contrastive Language-Image Pre-training (CLIP) model to measure semantic affinity and incorporate the Naturalness Image Quality Evaluator (NIQE) model to capture low-level distortion information. Additionally, we utilize dihedral angles as geometry descriptors to extract mesh features. By aggregating these measures, we introduce the Digital Human Quality Index (DHQI), which demonstrates significant improvements in zero-shot performance. The DHQI can also serve as a robust baseline for DHQA tasks, facilitating advancements in the field. The database and the code are available at https://github.com/zzc-1998/SJTU-H3D.
摘要
digital humans 有广泛的应用在不同领域,需要相关的质量评估研究。然而,存在全面的数字人质量评估(DHQA)数据库缺失。为了填补这一空白,我们提出了清华大学三维数字人质量评估(SJTU-H3D)数据库,这是特意设计 для全身数字人的主观质量评估数据库。它包括40个高质量参照数字人和1,120个扭曲对应件,通过七种扭曲方式生成。SJTU-H3D数据库可以作为DHQA研究的标准准样, allowing 评估和优化处理算法。此外,我们提出了零极shot DHQA方法,专注于无参考(NR)场景,以确保普适性能力的同时降低数据库偏见。我们的方法利用CLIP模型测量 semantic affinity,并利用NIQE模型捕捉低级扭曲信息。此外,我们利用截角角度来描述数字人的网格结构,提取网格特征。通过综合这些指标,我们引入了数字人质量指数(DHQI),其在零极shot情况下显示了显著的改进。DHQI 也可以作为DHQA任务的稳定基准,促进领域的进步。数据库和代码可以在https://github.com/zzc-1998/SJTU-H3D 上下载。
UIT-Saviors at MEDVQA-GI 2023: Improving Multimodal Learning with Image Enhancement for Gastrointestinal Visual Question Answering
results: 研究结果显示,使用 transformer 基于 CNN 和 Transformer 架构的视觉模型可以在肠食道内镜图像中提高 VQA 性能,而且图像增强技术也有显著提高效果。研究中的最佳方法,即 BERT+BEiT 混合和图像增强,在开发测试集上达到了87.25% 准确率和91.85% F1 分数。Abstract
In recent years, artificial intelligence has played an important role in medicine and disease diagnosis, with many applications to be mentioned, one of which is Medical Visual Question Answering (MedVQA). By combining computer vision and natural language processing, MedVQA systems can assist experts in extracting relevant information from medical image based on a given question and providing precise diagnostic answers. The ImageCLEFmed-MEDVQA-GI-2023 challenge carried out visual question answering task in the gastrointestinal domain, which includes gastroscopy and colonoscopy images. Our team approached Task 1 of the challenge by proposing a multimodal learning method with image enhancement to improve the VQA performance on gastrointestinal images. The multimodal architecture is set up with BERT encoder and different pre-trained vision models based on convolutional neural network (CNN) and Transformer architecture for features extraction from question and endoscopy image. The result of this study highlights the dominance of Transformer-based vision models over the CNNs and demonstrates the effectiveness of the image enhancement process, with six out of the eight vision models achieving better F1-Score. Our best method, which takes advantages of BERT+BEiT fusion and image enhancement, achieves up to 87.25% accuracy and 91.85% F1-Score on the development test set, while also producing good result on the private test set with accuracy of 82.01%.
摘要
Our team participated in Task 1 of the challenge by proposing a multimodal learning method that incorporated image enhancement to improve the VQA performance on gastrointestinal images. Our approach used a BERT encoder and different pre-trained vision models based on convolutional neural networks (CNNs) and Transformer architecture for feature extraction from questions and endoscopy images.The results of our study showed that Transformer-based vision models outperformed CNNs, and the image enhancement process significantly improved the VQA performance. Our best method, which combined BERT+BEiT fusion and image enhancement, achieved up to 87.25% accuracy and 91.85% F1-Score on the development test set. Additionally, our method performed well on the private test set with an accuracy of 82.01%. These results demonstrate the effectiveness of our multimodal learning method and the importance of image enhancement in improving the performance of MedVQA systems.
SeLiNet: Sentiment enriched Lightweight Network for Emotion Recognition in Images
results: 在 EMOTIC 数据集上,提议的方法实现了对比基eline的 Average Precision (AP) 分数 27.17,并在实际上实现了降低模型大小的 >85% 和 >93%。Abstract
In this paper, we propose a sentiment-enriched lightweight network SeLiNet and an end-to-end on-device pipeline for contextual emotion recognition in images. SeLiNet model consists of body feature extractor, image aesthetics feature extractor, and learning-based fusion network which jointly estimates discrete emotion and human sentiments tasks. On the EMOTIC dataset, the proposed approach achieves an Average Precision (AP) score of 27.17 in comparison to the baseline AP score of 27.38 while reducing the model size by >85%. In addition, we report an on-device AP score of 26.42 with reduction in model size by >93% when compared to the baseline.
摘要
在这篇论文中,我们提出了一种具有情感增强的轻量级网络SeLiNet,以及一个端到端在设备上的管道 для上下文感知图像中的情感识别。SeLiNet模型包括人体特征提取器、图像美学特征提取器和学习基于的融合网络,这些网络共同估计不同情感和人类情感任务。在EMOTIC数据集上,我们提议的方法实现了比基准AP分数27.17,而且减少了模型大小的>85%。此外,我们还报告了在设备上实现的AP分数26.42,并且减少了模型大小的>93%。
CityTrack: Improving City-Scale Multi-Camera Multi-Target Tracking by Location-Aware Tracking and Box-Grained Matching
paper_authors: Jincheng Lu, Xipeng Yang, Jin Ye, Yifu Zhang, Zhikang Zou, Wei Zhang, Xiao Tan
For: The paper is written for the task of multi-camera multi-target tracking (MCMT) in urban traffic visual analysis, with the goal of overcoming the challenges posed by complex and dynamic urban traffic scenes.* Methods: The paper proposes a novel systematic MCMT framework called CityTrack, which integrates various advanced techniques to improve the effectiveness of the MCMT task. These techniques include a Location-Aware SCMT tracker and a novel Box-Grained Matching (BGM) method for the ICA module.* Results: The paper achieved an IDF1 of 84.91% on the public test set of the CityFlowV2 dataset, ranking 1st in the 2022 AI CITY CHALLENGE. The experimental results demonstrate the effectiveness of the proposed approach in overcoming the challenges of urban traffic scenes.Here is the information in Simplified Chinese text:
results: 论文在CityFlowV2数据集的公共测试集上 achievied an IDF1 of 84.91%, ranking 1st in the 2022 AI CITY CHALLENGE。实验结果表明提出的方法在城市交通场景下能够有效地解决多摄像头多目标跟踪问题。Abstract
Multi-Camera Multi-Target Tracking (MCMT) is a computer vision technique that involves tracking multiple targets simultaneously across multiple cameras. MCMT in urban traffic visual analysis faces great challenges due to the complex and dynamic nature of urban traffic scenes, where multiple cameras with different views and perspectives are often used to cover a large city-scale area. Targets in urban traffic scenes often undergo occlusion, illumination changes, and perspective changes, making it difficult to associate targets across different cameras accurately. To overcome these challenges, we propose a novel systematic MCMT framework, called CityTrack. Specifically, we present a Location-Aware SCMT tracker which integrates various advanced techniques to improve its effectiveness in the MCMT task and propose a novel Box-Grained Matching (BGM) method for the ICA module to solve the aforementioned problems. We evaluated our approach on the public test set of the CityFlowV2 dataset and achieved an IDF1 of 84.91%, ranking 1st in the 2022 AI CITY CHALLENGE. Our experimental results demonstrate the effectiveness of our approach in overcoming the challenges posed by urban traffic scenes.
摘要
多摄像头多目标跟踪(MCMT)是一种计算机视觉技术,它涉及同时跟踪多个目标在多个摄像头上。在城市交通视觉分析中,MCMT遇到了复杂和动态的城市交通场景,多个摄像头具有不同的视角和观点,用于覆盖大规模的城市区域。目标在城市交通场景中经常受到遮挡、照明变化和视角变化的影响,因此准确地相互关联目标在不同的摄像头上是一项具有挑战性的任务。为了解决这些挑战,我们提出了一种新的系统化的MCMT框架,称为CityTrack。具体来说,我们提出了一种位置意识的SCMT跟踪器,该跟踪器结合了多种进步技术来提高MCMT任务的效果。此外,我们还提出了一种新的盒子块匹配(BGM)方法,用于ICA模块解决上述问题。我们在CityFlowV2 dataset的公共测试集上进行了评估,并 achieved IDF1的84.91%,在2022年AI城市挑战中排名第一。我们的实验结果表明,我们的方法在城市交通场景中能够有效地解决MCMT任务中的挑战。
Active Learning with Contrastive Pre-training for Facial Expression Recognition
results: 研究发现现有的active learning方法在FER中不太够,可能是因为”冷启动”现象,即初始标注样本不够代表整个数据集。为解决这个问题,我们提议使用自我超参差异批处理,首先学习整个无标注数据集下的基本表示,然后采用active learning方法。我们的2步方法比随机抽样和最佳现有基eline的active learning方法提高了9.2%和6.7%。Abstract
Deep learning has played a significant role in the success of facial expression recognition (FER), thanks to large models and vast amounts of labelled data. However, obtaining labelled data requires a tremendous amount of human effort, time, and financial resources. Even though some prior works have focused on reducing the need for large amounts of labelled data using different unsupervised methods, another promising approach called active learning is barely explored in the context of FER. This approach involves selecting and labelling the most representative samples from an unlabelled set to make the best use of a limited 'labelling budget'. In this paper, we implement and study 8 recent active learning methods on three public FER datasets, FER13, RAF-DB, and KDEF. Our findings show that existing active learning methods do not perform well in the context of FER, likely suffering from a phenomenon called 'Cold Start', which occurs when the initial set of labelled samples is not well representative of the entire dataset. To address this issue, we propose contrastive self-supervised pre-training, which first learns the underlying representations based on the entire unlabelled dataset. We then follow this with the active learning methods and observe that our 2-step approach shows up to 9.2% improvement over random sampling and up to 6.7% improvement over the best existing active learning baseline without the pre-training. We will make the code for this study public upon publication at: github.com/ShuvenduRoy/ActiveFER.
摘要
深度学习在人脸表达识别(FER)中发挥了重要作用,感谢大型模型和庞大的标签数据。然而,获得标签数据需要巨量的人工劳动、时间和财务资源。一些先前的工作已经尝试了不同的无监督方法来降低需要大量标签数据的需求,但是另一个有投入的方法叫做活动学习仍未在FER上得到广泛的探讨。这种方法是通过从无标签集中选择和标签最 Representative 的样本来使用有限的标签预算。在这篇论文中,我们实现和研究了8种最近的活动学习方法在三个公共FER数据集上,即FER13、RAF-DB和KDEF。我们的发现表明现有的活动学习方法在FER上不 performs well,可能是因为一种被称为“冷启动”的现象,这种现象发生在整个数据集的初始标签样本不够 Representative 的情况下。为解决这个问题,我们提议了对比自适应预训练,首先学习整个无标签集下的基本表示,然后跟进活动学习方法。我们发现,我们的2步approach在Random Sampling和最佳无预训练基eline之间显示出9.2%的提升和6.7%的提升。我们将在发表之前在github.com/ShuvenduRoy/ActiveFER上公开代码。
An Uncertainty Aided Framework for Learning based Liver $T_1ρ$ Mapping and Analysis
paper_authors: Chaoxing Huang, Vincent Wai Sun Wong, Queenie Chan, Winnie Chiu Wing Chu, Weitian Chen for:The paper aims to develop a learning-based approach for accurate and reliable quantitative $T_1\rho$ imaging of the liver, which can aid in the assessment of biochemical alterations in liver pathologies.methods:The proposed approach utilizes deep learning techniques to refine parametric maps and model the uncertainty of the predicted $T_1\rho$ values. The approach also employs a probabilistic framework to improve the mapping performance and remove unreliable pixels in the region of interest.results:The proposed approach leads to a relative mapping error of less than 3% and provides uncertainty estimation simultaneously. The estimated uncertainty reflects the actual error level and can be used to further reduce the relative $T_1\rho$ mapping error to 2.60% and remove unreliable pixels in the region of interest effectively.Here is the Chinese translation of the three key points:for:论文目的是开发一种基于学习的方法,以提供准确可靠的量化T1ρ成像,用于评估肝病变的生化变化。methods:提议的方法使用深度学习技术来改进参数地图,并模型预测T1ρ值的uncertainty。该方法还使用probabilistic框架来提高映射性能,并 removes unreliable pixels in the region of interest.results:提议的方法导致相对映射错误低于3%,同时提供uncertainty estimation。预测的uncertainty反映实际错误水平,可以进一步降低相对映射错误率至2.60%,并有效地 removes unreliable pixels in the region of interest.Abstract
Objective: Quantitative $T_1\rho$ imaging has potential for assessment of biochemical alterations of liver pathologies. Deep learning methods have been employed to accelerate quantitative $T_1\rho$ imaging. To employ artificial intelligence-based quantitative imaging methods in complicated clinical environment, it is valuable to estimate the uncertainty of the predicated $T_1\rho$ values to provide the confidence level of the quantification results. The uncertainty should also be utilized to aid the post-hoc quantitative analysis and model learning tasks. Approach: To address this need, we propose a parametric map refinement approach for learning-based $T_1\rho$ mapping and train the model in a probabilistic way to model the uncertainty. We also propose to utilize the uncertainty map to spatially weight the training of an improved $T_1\rho$ mapping network to further improve the mapping performance and to remove pixels with unreliable $T_1\rho$ values in the region of interest. The framework was tested on a dataset of 51 patients with different liver fibrosis stages. Main results: Our results indicate that the learning-based map refinement method leads to a relative mapping error of less than 3% and provides uncertainty estimation simultaneously. The estimated uncertainty reflects the actual error level, and it can be used to further reduce relative $T_1\rho$ mapping error to 2.60% as well as removing unreliable pixels in the region of interest effectively. Significance: Our studies demonstrate the proposed approach has potential to provide a learning-based quantitative MRI system for trustworthy $T_1\rho$ mapping of the liver.
摘要
目的:量化$T_1\rho$成像有潜力评估肝病变的生物化学变化。深度学习方法已经被应用于加速量化$T_1\rho$成像。在复杂的临床环境中使用人工智能基本图像评估方法,有利于估计预测的$T_1\rho$值的不确定性,以提供评估结果的信任度。方法:为了解决这个需求,我们提出了参数Map重定义方法,用于学习基于$T_1\rho$映射的不确定性模型,并在概率方式上训练模型。我们还提议使用不确定度地图来进行空间权重训练改进$T_1\rho$映射网络,以提高映射性能,并从区域 interesset中除掉不可靠的$T_1\rho$值。框架在51名患者不同的肝病变stage的数据集上进行测试。主要结果:我们的结果表明,学习基于映射的地图重定义方法可以实现相对的映射错误率低于3%,并同时提供不确定性估计。预测的不确定性反映实际错误水平,可以用于进一步降低相对$T_1\rho$映射错误率至2.60%,以及有效地从区域 interesset中除掉不可靠的$T_1\rho$值。意义:我们的研究表明,我们提出的方法有可能为肝脏的量化MRI系统提供可靠的$T_1\rho$映射。
MMNet: Multi-Collaboration and Multi-Supervision Network for Sequential Deepfake Detection
paper_authors: Ruiyang Xia, Decheng Liu, Jie Li, Lin Yuan, Nannan Wang, Xinbo Gao for: 这篇论文旨在应对伪造面孔图像,特别是伪造面孔图像的Sequential deepfake detection。methods: 本论文提出了Multi-Collaboration and Multi-Supervision Network (MMNet),可以处理各种空间排序和排序排序的伪造面孔图像,并且不需要知道伪造方法。results: 实验结果显示,MMNet可以实现高度的检测性和独立的回复性。Abstract
Advanced manipulation techniques have provided criminals with opportunities to make social panic or gain illicit profits through the generation of deceptive media, such as forged face images. In response, various deepfake detection methods have been proposed to assess image authenticity. Sequential deepfake detection, which is an extension of deepfake detection, aims to identify forged facial regions with the correct sequence for recovery. Nonetheless, due to the different combinations of spatial and sequential manipulations, forged face images exhibit substantial discrepancies that severely impact detection performance. Additionally, the recovery of forged images requires knowledge of the manipulation model to implement inverse transformations, which is difficult to ascertain as relevant techniques are often concealed by attackers. To address these issues, we propose Multi-Collaboration and Multi-Supervision Network (MMNet) that handles various spatial scales and sequential permutations in forged face images and achieve recovery without requiring knowledge of the corresponding manipulation method. Furthermore, existing evaluation metrics only consider detection accuracy at a single inferring step, without accounting for the matching degree with ground-truth under continuous multiple steps. To overcome this limitation, we propose a novel evaluation metric called Complete Sequence Matching (CSM), which considers the detection accuracy at multiple inferring steps, reflecting the ability to detect integrally forged sequences. Extensive experiments on several typical datasets demonstrate that MMNet achieves state-of-the-art detection performance and independent recovery performance.
摘要
高级操作技术为犯罪分子提供了创造社会恐慌或获得违法利润的机会,通过生成假面像。为了评估图像真实性,有多种深伪检测方法被提议。序列深伪检测是深伪检测的扩展,旨在识别假面像中的假区域,并且可以进行回归。然而,由于不同的空间和时序排序杂化,假面像中的假区域具有重大差异,这会严重影响检测性能。此外,回归假面像需要了解攻击者使用的杂化模型,这是困难的获取,因为攻击者通常会隐藏自己的技巧。为解决这些问题,我们提出了多方合作多级监督网络(MMNet),可以处理不同的空间级别和时序排序,并且可以实现不需要攻击者的杂化模型回归。此外,现有的评估指标仅考虑单个推理步骤的检测精度,而不考虑连续多个步骤中的匹配度。为了超越这些限制,我们提出了一个新的评估指标called完整序列匹配(CSM),可以考虑连续多个步骤中的检测精度,更好地反映检测完整性。广泛的实验表明,MMNet在多种典型数据集上达到了当前最佳的检测性能和独立回归性能。
Applying a Color Palette with Local Control using Diffusion Models
results: 这两种编辑方法的组合可以生成有价值的工作流程,例如:先移动一个段,然后重新颜色;或者先颜色,然后强制一些段使用指定的颜色。这些方法在Yu-Gi-Oh卡牌 datasets中得到了成功应用。Abstract
We demonstrate two novel editing procedures in the context of fantasy card art. Palette transfer applies a specified reference palette to a given card. For fantasy art, the desired change in palette can be very large, leading to huge changes in the "look" of the art. We demonstrate that a pipeline of vector quantization; matching; and "vector dequantization" (using a diffusion model) produces successful extreme palette transfers. Segment control allows an artist to move one or more image segments, and to optionally specify the desired color of the result. The combination of these two types of edit yields valuable workflows, including: move a segment, then recolor; recolor, then force some segments to take a prescribed color. We demonstrate our methods on the challenging Yu-Gi-Oh card art dataset.
摘要
我们展示了两种新的编辑方法在魔幻卡牌艺术上。alette转移使用指定的参考色板应用到给定的卡牌中。为了满足魔幻艺术的需求, Desired 的色彩变化可能很大,导致卡牌的“look”发生巨大的变化。我们表明了一个频谱划分;匹配;和“频谱划分”(使用扩散模型)生成了成功的极大色彩转移。 segment控制允许艺术家移动一个或多个图像段落,并可选择要求结果的颜色。这两种类型的编辑结合使得 workflow 非常有价值,例如:移动一段后重新色彩;重新色彩,然后强制一些段落采用指定的颜色。我们在具有挑战性的Yu-Gi-Oh卡牌 dataset上展示了我们的方法。
A Study on the Impact of Face Image Quality on Face Recognition in the Wild
results: 研究发现,深度学习方法在识别低质量人脸图像时存在一定的挑战,而人类的识别能力在不同质量的人脸图像之间具有更高的灵活性和可靠性。Abstract
Deep learning has received increasing interests in face recognition recently. Large quantities of deep learning methods have been proposed to handle various problems appeared in face recognition. Quite a lot deep methods claimed that they have gained or even surpassed human-level face verification performance in certain databases. As we know, face image quality poses a great challenge to traditional face recognition methods, e.g. model-driven methods with hand-crafted features. However, a little research focus on the impact of face image quality on deep learning methods, and even human performance. Therefore, we raise a question: Is face image quality still one of the challenges for deep learning based face recognition, especially in unconstrained condition. Based on this, we further investigate this problem on human level. In this paper, we partition face images into three different quality sets to evaluate the performance of deep learning methods on cross-quality face images in the wild, and then design a human face verification experiment on these cross-quality data. The result indicates that quality issue still needs to be studied thoroughly in deep learning, human own better capability in building the relations between different face images with large quality gaps, and saying deep learning method surpasses human-level is too optimistic.
摘要
深度学习在面recognition方面Received recent interest has increased. 大量的深度方法被提出来解决面recognition中的多种问题。许多深度方法声称他们在某些数据库中达到或even surpass human-level face verification performance。我们知道,面图像质量对传统的面recognition方法 pose a great challenge, such as model-driven methods with hand-crafted features。然而,只有一些研究关注深度学习方法在面图像质量下的影响,以及人类的表现。因此,我们提出了一个问题:深度学习基于面recognition中是否仍然面临质量挑战,特别是在无结构的 condition。基于这个问题,我们进一步调查这个问题,并设计了一个人类面验 эксперимент。结果表明,质量问题仍然需要进一步研究,人类在构建不同质量图像之间的关系方面具有更好的能力,而且说深度学习方法超过人类水平是太 оптимисти。
GIT: Detecting Uncertainty, Out-Of-Distribution and Adversarial Samples using Gradients and Invariance Transformations
paper_authors: Julia Lust, Alexandru P. Condurache
for: 提高深度神经网络的预测准确率和检测误差
methods: combining gradient information and invariance transformations to detect generalization errors
results: 在多种网络架构、问题设置和扰动类型上实现了比state-of-the-art更高的检测性能Abstract
Deep neural networks tend to make overconfident predictions and often require additional detectors for misclassifications, particularly for safety-critical applications. Existing detection methods usually only focus on adversarial attacks or out-of-distribution samples as reasons for false predictions. However, generalization errors occur due to diverse reasons often related to poorly learning relevant invariances. We therefore propose GIT, a holistic approach for the detection of generalization errors that combines the usage of gradient information and invariance transformations. The invariance transformations are designed to shift misclassified samples back into the generalization area of the neural network, while the gradient information measures the contradiction between the initial prediction and the corresponding inherent computations of the neural network using the transformed sample. Our experiments demonstrate the superior performance of GIT compared to the state-of-the-art on a variety of network architectures, problem setups and perturbation types.
摘要
深度神经网络往往做出过度自信的预测,并且常需要额外检测器来检测错误预测,特别是在安全关键应用场景下。现有的检测方法通常只关注到抗击攻击或者异常样本作为预测错误的原因。然而,总体错误的原因可能是多种多样的, часто与神经网络学习不良的相关 invariants 有关。我们因此提出了 GIT,一种涵盖性的方法, combinates 使用Gradient信息和对称变换。对应样本的对称变换可以将预测错误的样本推回神经网络的总体区域,而Gradient信息则度量在初始预测和神经网络对 transformed sample 的相对计算之间的矛盾。我们的实验表明 GIT 在不同的网络架构、问题设置和扰动类型上都有superior表现,胜过当前状态的检测方法。
Spherical Feature Pyramid Networks For Semantic Segmentation
results: 在Standford 2D-3D-S数据集上,我们的模型实现了state-of-the-art表现,具有48.75的mIOU,比前一个最佳圆形CNN的表现提高3.75个IOU点。Abstract
Semantic segmentation for spherical data is a challenging problem in machine learning since conventional planar approaches require projecting the spherical image to the Euclidean plane. Representing the signal on a fundamentally different topology introduces edges and distortions which impact network performance. Recently, graph-based approaches have bypassed these challenges to attain significant improvements by representing the signal on a spherical mesh. Current approaches to spherical segmentation exclusively use variants of the UNet architecture, meaning more successful planar architectures remain unexplored. Inspired by the success of feature pyramid networks (FPNs) in planar image segmentation, we leverage the pyramidal hierarchy of graph-based spherical CNNs to design spherical FPNs. Our spherical FPN models show consistent improvements over spherical UNets, whilst using fewer parameters. On the Stanford 2D-3D-S dataset, our models achieve state-of-the-art performance with an mIOU of 48.75, an improvement of 3.75 IoU points over the previous best spherical CNN.
摘要
Semantic segmentation for spherical data is a challenging problem in machine learning, as conventional planar approaches require projecting the spherical image to the Euclidean plane, which introduces edges and distortions that impact network performance. Recently, graph-based approaches have bypassed these challenges to achieve significant improvements by representing the signal on a spherical mesh. Current approaches to spherical segmentation exclusively use variants of the UNet architecture, meaning more successful planar architectures remain unexplored.Inspired by the success of feature pyramid networks (FPNs) in planar image segmentation, we leverage the pyramidal hierarchy of graph-based spherical CNNs to design spherical FPNs. Our spherical FPN models show consistent improvements over spherical UNets, while using fewer parameters. On the Stanford 2D-3D-S dataset, our models achieve state-of-the-art performance with an mIOU of 48.75, an improvement of 3.75 IoU points over the previous best spherical CNN.Here's the translation in Traditional Chinese:Semantic segmentation for spherical data is a challenging problem in machine learning, as conventional planar approaches require projecting the spherical image to the Euclidean plane, which introduces edges and distortions that impact network performance. Recently, graph-based approaches have bypassed these challenges to achieve significant improvements by representing the signal on a spherical mesh. Current approaches to spherical segmentation exclusively use variants of the UNet architecture, meaning more successful planar architectures remain unexplored.Inspired by the success of feature pyramid networks (FPNs) in planar image segmentation, we leverage the pyramidal hierarchy of graph-based spherical CNNs to design spherical FPNs. Our spherical FPN models show consistent improvements over spherical UNets, while using fewer parameters. On the Stanford 2D-3D-S dataset, our models achieve state-of-the-art performance with an mIOU of 48.75, an improvement of 3.75 IoU points over the previous best spherical CNN.
Active Class Selection for Few-Shot Class-Incremental Learning
results: 实验结果表明,FIASco模型在实际应用中可以有效地帮助机器人不断学习新对象,并且可以在有限的互动中实现长期的学习和应用。Abstract
For real-world applications, robots will need to continually learn in their environments through limited interactions with their users. Toward this, previous works in few-shot class incremental learning (FSCIL) and active class selection (ACS) have achieved promising results but were tested in constrained setups. Therefore, in this paper, we combine ideas from FSCIL and ACS to develop a novel framework that can allow an autonomous agent to continually learn new objects by asking its users to label only a few of the most informative objects in the environment. To this end, we build on a state-of-the-art (SOTA) FSCIL model and extend it with techniques from ACS literature. We term this model Few-shot Incremental Active class SeleCtiOn (FIASco). We further integrate a potential field-based navigation technique with our model to develop a complete framework that can allow an agent to process and reason on its sensory data through the FIASco model, navigate towards the most informative object in the environment, gather data about the object through its sensors and incrementally update the FIASco model. Experimental results on a simulated agent and a real robot show the significance of our approach for long-term real-world robotics applications.
摘要
为实际应用, роботы将需要在环境中不断学习,通过与用户有限的互动来更新知识。为此,先前的几何学增长学习(FSCIL)和活动类型选择(ACS)的研究已经取得了有希望的结果,但是这些研究都是在限定的设置下进行的。因此,在这篇论文中,我们将合并FSCIL和ACS的想法,开发一种能让自主代理人 continually 学习新的物体,只需要用户标注环境中最有用的几个对象。为此,我们基于当前最佳(SOTA)的FSCIL模型,并在ACS литературе中提出的技术上进行扩展。我们称这种模型为几何学增长活动类型选择(FIASco)。此外,我们还将潜在场景基本navIGATION技术与我们的模型结合,开发了一个完整的框架,让代理人可以通过FIASco模型处理和理解感知数据, navigate towards环境中最有用的对象,通过感知器收集数据,并不断更新FIASco模型。实验结果表明,我们的方法对长期实际 робо类应用具有重要意义。
paper_authors: Yeganeh Gharedaghi, Gene Cheung, Xianming Liu
for: 提高低光照图像质量
methods: 使用图гра核方法进行锐化和对比度提高
results: 实现了竞争力强的视觉图像质量,同时降低计算复杂度。Abstract
Images captured in poorly lit conditions are often corrupted by acquisition noise. Leveraging recent advances in graph-based regularization, we propose a fast Retinex-based restoration scheme that denoises and contrast-enhances an image. Specifically, by Retinex theory we first assume that each image pixel is a multiplication of its reflectance and illumination components. We next assume that the reflectance and illumination components are piecewise constant (PWC) and continuous piecewise planar (PWP) signals, which can be recovered via graph Laplacian regularizer (GLR) and gradient graph Laplacian regularizer (GGLR) respectively. We formulate quadratic objectives regularized by GLR and GGLR, which are minimized alternately until convergence by solving linear systems -- with improved condition numbers via proposed preconditioners -- via conjugate gradient (CG) efficiently. Experimental results show that our algorithm achieves competitive visual image quality while reducing computation complexity noticeably.
摘要
低光照条件下捕捉的图像经常受到获取噪声的损害。我们提出了一种基于图格 regularization的快速 Retinex 修复方案,可以减少图像噪声并提高对比度。Specifically,我们首先根据 Retinex 理论,假设每个图像像素是其反射率和照明组件的乘积。我们接着假设反射率和照明组件是连续piecewise planar(PWP)和piecewise constant(PWC)信号,可以通过图 Laplacian regularizer(GLR)和梯度图 Laplacian regularizer(GGLR)分别recover。我们形式了quadratic objective function regularized by GLR和GGLR,并通过 conjugate gradient(CG)高效地解决线性系统,以达到很好的条件数。实验结果表明,我们的算法可以在computation complexity下降 noticeably的情况下,达到竞争性的视觉图像质量。
results: 本文的实验结果表明,该方法可以生成高质量的人类式反应,并且可以在不同的人类行为场景下应用。具体来说,通过对用户行为进行分类,生成适应的人类式反应,以提高人类与计算机之间的交互体验。Abstract
Verbal and non-verbal human reaction generation is a challenging task, as different reactions could be appropriate for responding to the same behaviour. This paper proposes the first multiple and multimodal (verbal and nonverbal) appropriate human reaction generation framework that can generate appropriate and realistic human-style reactions (displayed in the form of synchronised text, audio and video streams) in response to an input user behaviour. This novel technique can be applied to various human-computer interaction scenarios by generating appropriate virtual agent/robot behaviours. Our demo is available at \url{https://github.com/SSYSteve/MRecGen}.
摘要
文本和非文本人类反应生成是一项复杂的任务,因为不同的反应可能适用于回应同一种行为。这篇论文提出了首个多模式和多媒体(文本和非文本)适当人类反应生成框架,可以在输入用户行为的基础上生成适当和现实的人类样式反应(表示为同步文本、音频和视频流),并可以应用于不同的人机交互场景中。我们的demo可以在 \url{https://github.com/SSYSteve/MRecGen} 中找到。
GNEP Based Dynamic Segmentation and Motion Estimation for Neuromorphic Imaging
results: 该论文通过一系列实验证明了这种方法的有效性。Abstract
This paper explores the application of event-based cameras in the domains of image segmentation and motion estimation. These cameras offer a groundbreaking technology by capturing visual information as a continuous stream of asynchronous events, departing from the conventional frame-based image acquisition. We introduce a Generalized Nash Equilibrium based framework that leverages the temporal and spatial information derived from the event stream to carry out segmentation and velocity estimation. To establish the theoretical foundations, we derive an existence criteria and propose a multi-level optimization method for calculating equilibrium. The efficacy of this approach is shown through a series of experiments.
摘要
Note:* "event-based cameras" 为 asynchronous event-based imaging 的 camera* "Generalized Nash Equilibrium" 为 Generalized Nash Equilibrium 的 framework* "temporal and spatial information" 为 时间和空间信息* "existence criteria" 为 existence criteria 的 proof* "multi-level optimization method" 为 multi-level optimization method 的 proposal
Mainline Automatic Train Horn and Brake Performance Metric
results: 该论文预计将为执行系统比较和运营设计域中的障碍探测系统提供一个标准化的预测数量。Abstract
This paper argues for the introduction of a mainline rail-oriented performance metric for driver-replacing on-board perception systems. Perception at the head of a train is divided into several subfunctions. This article presents a preliminary submetric for the obstacle detection subfunction. To the best of the author's knowledge, no other such proposal for obstacle detection exists. A set of submetrics for the subfunctions should facilitate the comparison of perception systems among each other and guide the measurement of human driver performance. It should also be useful for a standardized prediction of the number of accidents for a given perception system in a given operational design domain. In particular, for the proposal of the obstacle detection submetric, the professional readership is invited to provide their feedback and quantitative information to the author. The analysis results of the feedback will be published separately later.
摘要
这篇论文提出了引入主线铁路 Orientated 性能指标,用于评估驾驶员替换车载感知系统。感知系统中的各个子函数被分解为多个子指标。本文提出了一个初步的障碍探测子指标。作者知道的范围内没有其他类似的提议。一组子指标可以方便各个感知系统之间的比较,并且可以指导测量人类驾驶员的性能。此外,它还可以用于预测给定感知系统在给定操作设计域中的事故数量。特别是对于障碍探测子指标的建议,专业读者被邀请提供反馈和量化信息给作者。作者将分析反馈的结果,并将在后续发表。
Semi-supervised Learning from Street-View Images and OpenStreetMap for Automatic Building Height Estimation
paper_authors: Hao Li, Zhendong Yuan, Gabriel Dax, Gefei Kong, Hongchao Fan, Alexander Zipf, Martin Werner
for: 本研究旨在提供一种自动化建筑高度估算方法,以便从低成本的 voluntary geographical information (VGI) 数据中生成低成本的三维城市模型。
methods: 本研究使用 semi-supervised learning (SSL) 方法,并利用 Mapillary 街景图像和 OpenStreetMap (OSM) 数据进行自动化建筑高度估算。SSL 方法包括三个部分:首先,我们提出了一种 SSL 结构,其中可以在激活批处理中设置不同的 “pseudo label” 比率;其次,我们从 OSM 数据中提取多层次形态特征,以便从建筑高度中推断建筑高度;最后,我们设计了一个建筑层数估算工作流程,利用预训练的 Facade 对象检测网络来生成 “pseudo label” 从 SVI 数据中,并将其分配给 OSM 建筑脚印。
results: 在 tested 在海德堡市的 caso study 中,我们 validate 了提议的 SSL 方法,并评估模型的性能与参 Referenced 数据中的建筑高度。结果表明,使用 Random Forest (RF)、Support Vector Machine (SVM) 和 Convolutional Neural Network (CNN) 三种不同的回归模型,SSL 方法可以在建筑高度估算中带来明显的性能提升,MAE 约为 2.1 米,与现有方法竞争。这一初步结果是激动人心,鼓励我们在未来继续基于低成本 VGI 数据进行扩展,包括在不同的区域和数据质量上进行应用。Abstract
Accurate building height estimation is key to the automatic derivation of 3D city models from emerging big geospatial data, including Volunteered Geographical Information (VGI). However, an automatic solution for large-scale building height estimation based on low-cost VGI data is currently missing. The fast development of VGI data platforms, especially OpenStreetMap (OSM) and crowdsourced street-view images (SVI), offers a stimulating opportunity to fill this research gap. In this work, we propose a semi-supervised learning (SSL) method of automatically estimating building height from Mapillary SVI and OSM data to generate low-cost and open-source 3D city modeling in LoD1. The proposed method consists of three parts: first, we propose an SSL schema with the option of setting a different ratio of "pseudo label" during the supervised regression; second, we extract multi-level morphometric features from OSM data (i.e., buildings and streets) for the purposed of inferring building height; last, we design a building floor estimation workflow with a pre-trained facade object detection network to generate "pseudo label" from SVI and assign it to the corresponding OSM building footprint. In a case study, we validate the proposed SSL method in the city of Heidelberg, Germany and evaluate the model performance against the reference data of building heights. Based on three different regression models, namely Random Forest (RF), Support Vector Machine (SVM), and Convolutional Neural Network (CNN), the SSL method leads to a clear performance boosting in estimating building heights with a Mean Absolute Error (MAE) around 2.1 meters, which is competitive to state-of-the-art approaches. The preliminary result is promising and motivates our future work in scaling up the proposed method based on low-cost VGI data, with possibilities in even regions and areas with diverse data quality and availability.
摘要
准确的建筑高度估算是三维城市模型自动生成的关键,特别是从低成本的地理空间数据(VGI)中获取数据。然而,基于低成本VGI数据的大规模建筑高度估算方法目前尚缺乏。随着VGI数据平台的快速发展,特别是OpenStreetMap(OSM)和来自公众参与的街景图像(SVI),我们有机会填补这个研究漏洞。在这项工作中,我们提出了一种半监督学习(SSL)方法,通过使用Mapillary SVI和OSM数据自动计算建筑高度,以生成低成本和开源的三维城市模型。该方法包括三部分:1. 我们提出了一种SSL结构,其中可以在批量回归中设置不同的“ Pseudo Label”比率。2. 我们从OSM数据中提取了多级形态特征,以便从建筑高度进行推断。3. 我们设计了一个建筑层数估算工作流程,使用预训练的外墙对象检测网络来生成“ Pseudo Label”从SVI中,并将其分配到相应的OSM建筑基eline。在海德堡市(Heidelberg)的 caso study中,我们验证了我们的SSL方法,并评估其性能与参照数据中的建筑高度相比。基于Random Forest(RF)、Support Vector Machine(SVM)和Convolutional Neural Network(CNN)三种不同的回归模型,SSL方法在估算建筑高度方面具有明显的性能提升,MAE约为2.1米,与现有方法相当竞争。这些初步结果启发我们未来在基于低成本VGI数据的大规模建筑高度估算方法上进行进一步研究,包括在不同的数据质量和可用性下进行扩展。
A Dataset of Inertial Measurement Units for Handwritten English Alphabets
results: 根据实验结果显示,这个dataset和收集系统可以实现高精度的英文手写识别,特别是在印度的多元文化和语言背景下。Abstract
This paper presents an end-to-end methodology for collecting datasets to recognize handwritten English alphabets by utilizing Inertial Measurement Units (IMUs) and leveraging the diversity present in the Indian writing style. The IMUs are utilized to capture the dynamic movement patterns associated with handwriting, enabling more accurate recognition of alphabets. The Indian context introduces various challenges due to the heterogeneity in writing styles across different regions and languages. By leveraging this diversity, the collected dataset and the collection system aim to achieve higher recognition accuracy. Some preliminary experimental results demonstrate the effectiveness of the dataset in accurately recognizing handwritten English alphabet in the Indian context. This research can be extended and contributes to the field of pattern recognition and offers valuable insights for developing improved systems for handwriting recognition, particularly in diverse linguistic and cultural contexts.
摘要
Here's the translation in Simplified Chinese:这篇论文提出了一种综合方法,通过使用惯性测量单元(IMU),收集手写英文字母的数据集,并利用印度写作风格的多样性,提高手写识别的准确率。印度的文化和语言多样性会导致写作风格的差异,但是收集的数据集和收集系统尝试达到更高的识别率。初步的实验结果表明,这些数据集可以准确地识别印度上手写英文字母。这些研究可以进一步推广,对手写识别技术的发展产生有价值的影响,特别是在多元语言和文化背景下。
Large-scale Detection of Marine Debris in Coastal Areas with Sentinel-2
paper_authors: Marc Rußwurm, Sushen Jilla Venkatesa, Devis Tuia
for: This paper aims to detect and quantify marine pollution and macro-plastics using remote sensing technology.
methods: The paper uses a deep segmentation model to detect marine debris in coastal areas, leveraging medium-resolution satellite data. The model is trained on a combination of annotated datasets of marine debris and evaluated on specifically selected test sites.
results: The paper demonstrates that the deep learning model outperforms existing detection models by a large margin, due to the particular dataset design with extensive sampling of negative examples and label refinements. The results show the potential for large-scale automated detection of marine debris, which can help quantify and monitor marine litter with remote sensing at global scales.Abstract
Detecting and quantifying marine pollution and macro-plastics is an increasingly pressing ecological issue that directly impacts ecology and human health. Efforts to quantify marine pollution are often conducted with sparse and expensive beach surveys, which are difficult to conduct on a large scale. Here, remote sensing can provide reliable estimates of plastic pollution by regularly monitoring and detecting marine debris in coastal areas. Medium-resolution satellite data of coastal areas is readily available and can be leveraged to detect aggregations of marine debris containing plastic litter. In this work, we present a detector for marine debris built on a deep segmentation model that outputs a probability for marine debris at the pixel level. We train this detector with a combination of annotated datasets of marine debris and evaluate it on specifically selected test sites where it is highly probable that plastic pollution is present in the detected marine debris. We demonstrate quantitatively and qualitatively that a deep learning model trained on this dataset issued from multiple sources outperforms existing detection models trained on previous datasets by a large margin. Our experiments show, consistent with the principles of data-centric AI, that this performance is due to our particular dataset design with extensive sampling of negative examples and label refinements rather than depending on the particular deep learning model. We hope to accelerate advances in the large-scale automated detection of marine debris, which is a step towards quantifying and monitoring marine litter with remote sensing at global scales, and release the model weights and training source code under https://github.com/marccoru/marinedebrisdetector
摘要
检测和评估海洋污染和巨型塑料是当前生态环境问题的一个不断增长的焦点,直接影响生物多样性和人类健康。尝试量化海洋污染的努力通常采用罕见和昂贵的海滩调查,这些调查难以在大规模上进行。在这里,Remote sensing可以提供可靠的塑料污染估计,通过定期监测和检测海洋垃圾在沿海区域。我们提出了一种基于深度分割模型的海洋垃圾检测器,该模型可以在像素级输出marine debris的投票 probabilities。我们使用了多种源的注解数据集来训练这个检测器,并在特定的测试地点进行评估,这些测试地点高度可能存在塑料污染。我们的实验表明,使用这种数据集和深度学习模型,我们可以在跨源数据集上超越现有的检测模型,提高检测精度。我们的实验还表明,这种性能是基于我们特有的数据集设计,包括广泛的负例抽象和标签精细化,而不是基于特定的深度学习模型。我们希望通过大规模自动检测海洋垃圾,降低海洋污染的评估难度,并在全球范围内追踪塑料污染,并将模型权重和训练源代码发布在https://github.com/marccoru/marinedebrisdetector。
AxonCallosumEM Dataset: Axon Semantic Segmentation of Whole Corpus Callosum cross section from EM Images
paper_authors: Ao Cheng, Guoqiang Zhao, Lirong Wang, Ruobing Zhang
for: The paper is written for the purpose of introducing a new dataset and a fine-tuning methodology for segmenting EM images of the corpus callosum.
methods: The paper uses a combination of EM imaging and manual annotation to create a large-scale dataset of the corpus callosum, and a fine-tuning methodology called EM-SAM that adapts the Segment Anything Model (SAM) to EM image segmentation tasks.
results: The paper presents the evaluation results of EM-SAM as a baseline, which outperforms other state-of-the-art methods.Here is the information in Simplified Chinese text:
results: 这篇论文提出了EM-SAM的评估结果作为基准,比其他状态所有方法表现出色。Abstract
The electron microscope (EM) remains the predominant technique for elucidating intricate details of the animal nervous system at the nanometer scale. However, accurately reconstructing the complex morphology of axons and myelin sheaths poses a significant challenge. Furthermore, the absence of publicly available, large-scale EM datasets encompassing complete cross sections of the corpus callosum, with dense ground truth segmentation for axons and myelin sheaths, hinders the advancement and evaluation of holistic corpus callosum reconstructions. To surmount these obstacles, we introduce the AxonCallosumEM dataset, comprising a 1.83 times 5.76mm EM image captured from the corpus callosum of the Rett Syndrome (RTT) mouse model, which entail extensive axon bundles. We meticulously proofread over 600,000 patches at a resolution of 1024 times 1024, thus providing a comprehensive ground truth for myelinated axons and myelin sheaths. Additionally, we extensively annotated three distinct regions within the dataset for the purposes of training, testing, and validation. Utilizing this dataset, we develop a fine-tuning methodology that adapts Segment Anything Model (SAM) to EM images segmentation tasks, called EM-SAM, enabling outperforms other state-of-the-art methods. Furthermore, we present the evaluation results of EM-SAM as a baseline.
摘要
《电子镜相(EM)仍然是解释动物神经系统细胞水平的主要技术。然而,准确地重建复杂的轴细胞和质粒层结构却是一项 significativetask。此外,没有公开可用的大规模EM数据集,包括整个脑桥完整的跨section,并且有密集的真实性标注 для轴细胞和质粒层,限制了整个脑桥的重建和评估。为了突破这些障碍,我们介绍了AxonCallosumEM数据集,包括RTT小鼠型脑桥的1.83 times 5.76mm EM图像,它包含了广泛的轴细胞集。我们仔细对600,000个小块进行了1024 times 1024的真实性检查,以提供完整的myelinated轴细胞和质粒层的真实性标注。此外,我们也进行了三个不同区域的严格的标注,用于训练、测试和验证。使用这个数据集,我们开发了一种基于SAM模型的EM图像分割方法,称为EM-SAM,它可以超越其他当前的状况。此外,我们还提供了EM-SAM的评估结果作为基准。
Expert-Agnostic Ultrasound Image Quality Assessment using Deep Variational Clustering
paper_authors: Deepak Raina, Dimitrios Ntentia, SH Chandrashekhara, Richard Voyles, Subir Kumar Saha for:The paper aims to develop an unsupervised ultrasound image quality assessment network to eliminate the burden and uncertainty of manual annotations.methods:The proposed framework, US2QNet, uses a variational autoencoder with three modules: pre-processing, clustering, and post-processing, to jointly enhance, extract, cluster, and visualize the quality feature representation of ultrasound images.results:The proposed framework achieved 78% accuracy and superior performance to state-of-the-art clustering methods in assessing the quality of urinary bladder ultrasound images.Abstract
Ultrasound imaging is a commonly used modality for several diagnostic and therapeutic procedures. However, the diagnosis by ultrasound relies heavily on the quality of images assessed manually by sonographers, which diminishes the objectivity of the diagnosis and makes it operator-dependent. The supervised learning-based methods for automated quality assessment require manually annotated datasets, which are highly labour-intensive to acquire. These ultrasound images are low in quality and suffer from noisy annotations caused by inter-observer perceptual variations, which hampers learning efficiency. We propose an UnSupervised UltraSound image Quality assessment Network, US2QNet, that eliminates the burden and uncertainty of manual annotations. US2QNet uses the variational autoencoder embedded with the three modules, pre-processing, clustering and post-processing, to jointly enhance, extract, cluster and visualize the quality feature representation of ultrasound images. The pre-processing module uses filtering of images to point the network's attention towards salient quality features, rather than getting distracted by noise. Post-processing is proposed for visualizing the clusters of feature representations in 2D space. We validated the proposed framework for quality assessment of the urinary bladder ultrasound images. The proposed framework achieved 78% accuracy and superior performance to state-of-the-art clustering methods.
摘要
“超声成像是一种常用的诊断和治疗过程中的重要模式。然而,超声诊断的准确性受到图像评估员(sonographer)的主观因素的影响,这会使诊断变得操作员依赖。支持学习基本方法需要手动标注的数据集,这是非常劳动 INTENSIVE 的。这些超声图像质量低,并且受到了诊断人员之间的视觉差异导致的噪音标注,这会降低学习效率。我们提议一种无监督的超声图像质量评估网络,namely US2QNet,以消除手动标注的压力和不确定性。US2QNet使用包括预处理、聚类和后处理三个模块的变量自动编码器,以联合提高、提取、聚类和可视化超声图像质量特征表示。预处理模块使用图像滤波来引导网络注意力向关键质量特征方向,而不是被噪声所干扰。后处理 module 提出了可视化特征表示的分布在2D空间。我们验证了我们的框架,用于评估膀胱超声图像质量。我们的框架实现了78%的准确率,并超过了当前的聚类方法的性能。”
LLCaps: Learning to Illuminate Low-Light Capsule Endoscopy with Curved Wavelet Attention and Reverse Diffusion
results: 与前十个state-of-the-art 低光照图像增强方法进行比较,该方法显示出了 statistically 和质量上的优异性。此外,该方法还在肠道疾病分 segmentation 中表现出了优秀的临床潜力。Abstract
Wireless capsule endoscopy (WCE) is a painless and non-invasive diagnostic tool for gastrointestinal (GI) diseases. However, due to GI anatomical constraints and hardware manufacturing limitations, WCE vision signals may suffer from insufficient illumination, leading to a complicated screening and examination procedure. Deep learning-based low-light image enhancement (LLIE) in the medical field gradually attracts researchers. Given the exuberant development of the denoising diffusion probabilistic model (DDPM) in computer vision, we introduce a WCE LLIE framework based on the multi-scale convolutional neural network (CNN) and reverse diffusion process. The multi-scale design allows models to preserve high-resolution representation and context information from low-resolution, while the curved wavelet attention (CWA) block is proposed for high-frequency and local feature learning. Furthermore, we combine the reverse diffusion procedure to further optimize the shallow output and generate the most realistic image. The proposed method is compared with ten state-of-the-art (SOTA) LLIE methods and significantly outperforms quantitatively and qualitatively. The superior performance on GI disease segmentation further demonstrates the clinical potential of our proposed model. Our code is publicly accessible.
摘要
无线 capsule 内镜(WCE)是一种无痛、非侵入性的诊断工具 для Digestive tract 疾病(GI)。然而,由于 GI анатомиче限制和硬件制造限制,WCE 视像信号可能受到不足照明的影响,导致复杂的检测和诊断过程。在医学领域,深度学习基于的低光照图像提高(LLIE)逐渐吸引研究人员。我们在 Computer Vision 领域的发展中,引入了基于多尺度 convolutional neural network (CNN)和反演扩散过程的 WCE LLIE 框架。这种多尺度设计允许模型保留高分辨率表示和 context 信息,而 Curved wavelet attention (CWA)块是用于高频和本地特征学习。此外,我们将反演扩散过程组合到输出优化,生成最真实的图像。我们的提议方法与当前 SOTA LLIE 方法进行比较,显著超越量化和质量上。此外,我们还对 GI 疾病 segmentation 进行了评估,并证明了我们的提议模型在临床上的潜在价值。我们的代码公共可访问。
Base Layer Efficiency in Scalable Human-Machine Coding
paper_authors: Yalda Foroutan, Alon Harell, Anderson de Andrade, Ivan V. Bajić
for: 这个论文是为了提高一种可扩展的人机图像编码器的基层编码效率而写的。
methods: 论文使用了一种现状最佳的人机图像编码器,并对其基层编码进行分析和优化。
results: 论文表明,通过对基层编码进行优化,可以获得20-40%的BD-Rate提升,用于对物体检测和实例分割进行优化。Abstract
A basic premise in scalable human-machine coding is that the base layer is intended for automated machine analysis and is therefore more compressible than the same content would be for human viewing. Use cases for such coding include video surveillance and traffic monitoring, where the majority of the content will never be seen by humans. Therefore, base layer efficiency is of paramount importance because the system would most frequently operate at the base-layer rate. In this paper, we analyze the coding efficiency of the base layer in a state-of-the-art scalable human-machine image codec, and show that it can be improved. In particular, we demonstrate that gains of 20-40% in BD-Rate compared to the currently best results on object detection and instance segmentation are possible.
摘要
基本假设在可扩展人机编码中是,基层是为机器自动分析而设计的,因此比同样内容适用于人类查看时更加压缩。使用场景包括视频监控和交通监测,大多数内容将从来不会被人类查看。因此,基层效率非常重要,系统大多数时间都将在基层率下运行。在这篇论文中,我们分析了一种现代可扩展人机图像编码器的基层编码效率,并证明可以提高。特别是,我们示出了对 объек detection 和实例分割的结果进行优化,可以获得20-40%的BD-Rate提升。
DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models
results: 该方法可以实现多种编辑模式,包括对已有图像的对象移动、对象放大、对象外观替换和内容拖动等。而所有的编辑和内容保持信号都来自于原图,无需 Fine-tuning 或额外模块。Abstract
Despite the ability of existing large-scale text-to-image (T2I) models to generate high-quality images from detailed textual descriptions, they often lack the ability to precisely edit the generated or real images. In this paper, we propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models. Specifically, we construct classifier guidance based on the strong correspondence of intermediate features in the diffusion model. It can transform the editing signals into gradients via feature correspondence loss to modify the intermediate representation of the diffusion model. Based on this guidance strategy, we also build a multi-scale guidance to consider both semantic and geometric alignment. Moreover, a cross-branch self-attention is added to maintain the consistency between the original image and the editing result. Our method, through an efficient design, achieves various editing modes for the generated or real images, such as object moving, object resizing, object appearance replacement, and content dragging. It is worth noting that all editing and content preservation signals come from the image itself, and the model does not require fine-tuning or additional modules. Our source code will be available at https://github.com/MC-E/DragonDiffusion.
摘要
尽管现有的大规模文本到图像(T2I)模型可以生成高质量的图像从详细的文本描述,但它们通常缺乏 precisely 编辑生成的图像的能力。在这篇论文中,我们提出了一种新的图像编辑方法,即 DragonDiffusion,允许用户通过 Drag 式操作来修改Diffusion 模型中的图像。特别是,我们基于Diffusion 模型中强相关的间接特征的类ifikator导航,可以将编辑信号转化为梯度,以修改Diffusion 模型的中间表示。此外,我们还建立了多尺度导航,以考虑 Semantic 和 Geometric 对齐。此外,我们还添加了cross-branch self-attention,以保持原始图像和修改结果之间的一致性。我们的方法可以实现多种编辑模式,如物体移动、物体放大、物体外观替换和内容拖动。值得注意的是,所有的编辑和内容保持信号都来自于图像本身,而模型不需要 fine-tuning 或额外模块。我们的源代码将在 GitHub 上公开。
Unbalanced Optimal Transport: A Unified Framework for Object Detection
results: 实验结果表明,使用Unbalanced Optimal Transport进行匹配可以达到当前领域的最佳性能,包括均值准确率和均值回归率,并且可以提高模型的初始化速度。Abstract
During training, supervised object detection tries to correctly match the predicted bounding boxes and associated classification scores to the ground truth. This is essential to determine which predictions are to be pushed towards which solutions, or to be discarded. Popular matching strategies include matching to the closest ground truth box (mostly used in combination with anchors), or matching via the Hungarian algorithm (mostly used in anchor-free methods). Each of these strategies comes with its own properties, underlying losses, and heuristics. We show how Unbalanced Optimal Transport unifies these different approaches and opens a whole continuum of methods in between. This allows for a finer selection of the desired properties. Experimentally, we show that training an object detection model with Unbalanced Optimal Transport is able to reach the state-of-the-art both in terms of Average Precision and Average Recall as well as to provide a faster initial convergence. The approach is well suited for GPU implementation, which proves to be an advantage for large-scale models.
摘要
RADiff: Controllable Diffusion Models for Radio Astronomical Maps Generation
paper_authors: Renato Sortino, Thomas Cecconello, Andrea DeMarco, Giuseppe Fiameni, Andrea Pilzer, Andrew M. Hopkins, Daniel Magro, Simone Riggi, Eva Sciacca, Adriano Ingallinera, Cristobal Bordiu, Filomena Bufano, Concetto Spampinato
results: 这篇论文的结果显示,使用生成的 sintetic 图像可以提高 semantic segmentation 模型的性能,并且可以自动增强数据集的规模和多样性。Abstract
Along with the nearing completion of the Square Kilometre Array (SKA), comes an increasing demand for accurate and reliable automated solutions to extract valuable information from the vast amount of data it will allow acquiring. Automated source finding is a particularly important task in this context, as it enables the detection and classification of astronomical objects. Deep-learning-based object detection and semantic segmentation models have proven to be suitable for this purpose. However, training such deep networks requires a high volume of labeled data, which is not trivial to obtain in the context of radio astronomy. Since data needs to be manually labeled by experts, this process is not scalable to large dataset sizes, limiting the possibilities of leveraging deep networks to address several tasks. In this work, we propose RADiff, a generative approach based on conditional diffusion models trained over an annotated radio dataset to generate synthetic images, containing radio sources of different morphologies, to augment existing datasets and reduce the problems caused by class imbalances. We also show that it is possible to generate fully-synthetic image-annotation pairs to automatically augment any annotated dataset. We evaluate the effectiveness of this approach by training a semantic segmentation model on a real dataset augmented in two ways: 1) using synthetic images obtained from real masks, and 2) generating images from synthetic semantic masks. We show an improvement in performance when applying augmentation, gaining up to 18% in performance when using real masks and 4% when augmenting with synthetic masks. Finally, we employ this model to generate large-scale radio maps with the objective of simulating Data Challenges.
摘要
alongside the impending completion of the Square Kilometre Array (SKA), there is an increasing demand for accurate and reliable automated solutions to extract valuable information from the vast amount of data it will acquire. Automated source finding is a particularly important task in this context, as it enables the detection and classification of astronomical objects. Deep-learning-based object detection and semantic segmentation models have proven to be suitable for this purpose. However, training such deep networks requires a high volume of labeled data, which is not easy to obtain in the context of radio astronomy. Since data needs to be manually labeled by experts, this process is not scalable to large dataset sizes, limiting the possibilities of leveraging deep networks to address several tasks. In this work, we propose RADiff, a generative approach based on conditional diffusion models trained over an annotated radio dataset to generate synthetic images, containing radio sources of different morphologies, to augment existing datasets and reduce the problems caused by class imbalances. We also show that it is possible to generate fully-synthetic image-annotation pairs to automatically augment any annotated dataset. We evaluate the effectiveness of this approach by training a semantic segmentation model on a real dataset augmented in two ways: 1) using synthetic images obtained from real masks, and 2) generating images from synthetic semantic masks. We show an improvement in performance when applying augmentation, gaining up to 18% in performance when using real masks and 4% when augmenting with synthetic masks. Finally, we employ this model to generate large-scale radio maps with the objective of simulating Data Challenges.
results: 结果显示,chatGPT 确实具有 novel 知识,可以通过拼接来提高现有 NLP 方法的性能,无论是在早期或晚期拼接。Abstract
The employment of foundation models is steadily expanding, especially with the launch of ChatGPT and the release of other foundation models. These models have shown the potential of emerging capabilities to solve problems, without being particularly trained to solve. A previous work demonstrated these emerging capabilities in affective computing tasks; the performance quality was similar to traditional Natural Language Processing (NLP) techniques, but falling short of specialised trained models, like fine-tuning of the RoBERTa language model. In this work, we extend this by exploring if ChatGPT has novel knowledge that would enhance existing specialised models when they are fused together. We achieve this by investigating the utility of verbose responses from ChatGPT about solving a downstream task, in addition to studying the utility of fusing that with existing NLP methods. The study is conducted on three affective computing problems, namely sentiment analysis, suicide tendency detection, and big-five personality assessment. The results conclude that ChatGPT has indeed novel knowledge that can improve existing NLP techniques by way of fusion, be it early or late fusion.
摘要
基础模型的雇佣正在不断扩展,尤其是随着ChatGPT的发布和其他基础模型的发布。这些模型表现出了解决问题的潜在能力,无需特别地训练。之前的研究已经证明这些潜在能力在情感计算任务中表现出了类似于传统自然语言处理(NLP)技术的性能质量,但略少于专门训练的模型,如精度调整的RoBERTa语言模型。在这项工作中,我们将进一步探索chatGPT是否具有可以增强现有特殊化模型的新知识。我们通过研究chatGPT对解决下游任务的 verbose 响应的使用,以及与现有NLP方法的融合来实现这一点。研究目标是在三种情感计算问题上进行评估,即情感分析、自杀倾向检测和大五人性评估。结果表明,chatGPTindeed具有可以改进现有NLP技术的新知识,并且可以通过融合来实现这一点,无论是在早期或晚期融合。
Hybrid Knowledge-Data Driven Channel Semantic Acquisition and Beamforming for Cell-Free Massive MIMO
paper_authors: Zhen Gao, Shicong Liu, Yu Su, Zhongxiang Li, Dezhi Zheng
For: 提高户外无线系统的支持能力,以更好地支持普遍性扩展现实(XR)应用,并将户外无线传输能力与室内无线传输能力追究到一起。* Methods: 提议一种混合知识驱动的树枝网络(MLP)-混合器-基于 autoencoder 的通道semantic取得方法,并基于获得的通道semantic进一步提议一种知识驱动的深度 unfolding 多用户扩散幕。* Results: 实验结果表明,提议的方案可以提高通道取得精度,同时降低 CSI 取得和幕设计的复杂性。在下降链传输中,提议的幕设计方法可以在只需要三次迭代后达到大约 96% 的最终谱效率性能。Abstract
This paper focuses on advancing outdoor wireless systems to better support ubiquitous extended reality (XR) applications, and close the gap with current indoor wireless transmission capabilities. We propose a hybrid knowledge-data driven method for channel semantic acquisition and multi-user beamforming in cell-free massive multiple-input multiple-output (MIMO) systems. Specifically, we firstly propose a data-driven multiple layer perceptron (MLP)-Mixer-based auto-encoder for channel semantic acquisition, where the pilot signals, CSI quantizer for channel semantic embedding, and CSI reconstruction for channel semantic extraction are jointly optimized in an end-to-end manner. Moreover, based on the acquired channel semantic, we further propose a knowledge-driven deep-unfolding multi-user beamformer, which is capable of achieving good spectral efficiency with robustness to imperfect CSI in outdoor XR scenarios. By unfolding conventional successive over-relaxation (SOR)-based linear beamforming scheme with deep learning, the proposed beamforming scheme is capable of adaptively learning the optimal parameters to accelerate convergence and improve the robustness to imperfect CSI. The proposed deep unfolding beamforming scheme can be used for access points (APs) with fully-digital array and APs with hybrid analog-digital array. Simulation results demonstrate the effectiveness of our proposed scheme in improving the accuracy of channel acquisition, as well as reducing complexity in both CSI acquisition and beamformer design. The proposed beamforming method achieves approximately 96% of the converged spectrum efficiency performance after only three iterations in downlink transmission, demonstrating its efficacy and potential to improve outdoor XR applications.
摘要
The proposed method includes a data-driven multiple layer perceptron (MLP)-Mixer-based auto-encoder for channel semantic acquisition, where the pilot signals, CSI quantizer for channel semantic embedding, and CSI reconstruction for channel semantic extraction are jointly optimized in an end-to-end manner. Additionally, based on the acquired channel semantic, the authors propose a knowledge-driven deep-unfolding multi-user beamformer, which can achieve good spectral efficiency with robustness to imperfect CSI in outdoor XR scenarios.The proposed beamforming scheme unfolds the conventional successive over-relaxation (SOR)-based linear beamforming scheme with deep learning, allowing the beamformer to adaptively learn the optimal parameters to accelerate convergence and improve robustness to imperfect CSI. The proposed beamforming scheme can be used for access points (APs) with fully-digital array and APs with hybrid analog-digital array.Simulation results demonstrate the effectiveness of the proposed scheme in improving the accuracy of channel acquisition and reducing complexity in both CSI acquisition and beamformer design. The proposed beamforming method achieves approximately 96% of the converged spectrum efficiency performance after only three iterations in downlink transmission, showing its efficacy and potential to improve outdoor XR applications.
DeepOnto: A Python Package for Ontology Engineering with Deep Learning
results: 本文提出了 Deeponto,一个 Python 包,用于支持 Ontology Engineering 任务,包括ontology alignment和 completion,通过使用深度学习方法和 Pre-trained LMs。 authors 还提供了两个用例,包括 Samsung Research UK 的数字医疗咨询和 OAEI 的 Bio-ML 追踪。Abstract
Applying deep learning techniques, particularly language models (LMs), in ontology engineering has raised widespread attention. However, deep learning frameworks like PyTorch and Tensorflow are predominantly developed for Python programming, while widely-used ontology APIs, such as the OWL API and Jena, are primarily Java-based. To facilitate seamless integration of these frameworks and APIs, we present Deeponto, a Python package designed for ontology engineering. The package encompasses a core ontology processing module founded on the widely-recognised and reliable OWL API, encapsulating its fundamental features in a more "Pythonic" manner and extending its capabilities to include other essential components including reasoning, verbalisation, normalisation, projection, and more. Building on this module, Deeponto offers a suite of tools, resources, and algorithms that support various ontology engineering tasks, such as ontology alignment and completion, by harnessing deep learning methodologies, primarily pre-trained LMs. In this paper, we also demonstrate the practical utility of Deeponto through two use-cases: the Digital Health Coaching in Samsung Research UK and the Bio-ML track of the Ontology Alignment Evaluation Initiative (OAEI).
摘要
使用深度学习技术,特别是语言模型(LM),在ontology工程中引起了广泛的关注。然而,深度学习框架如PyTorch和Tensorflow主要是为Python编程语言设计的,而广泛使用的ontology API,如OWL API和Jena,主要是Java基于的。为了实现这些框架和API的协同工作,我们提出了Deeponto,一个Python包用于ontology工程。这个包包括一个核心基于广泛认可和可靠的OWL API的ontology处理模块,封装了其基本特性在更"Pythonic"的方式下,并将其扩展到包括其他重要组成部分,如理解、抽象、准则、投影等。在这个模块基础之上,Deeponto提供了一 suite of工具、资源和算法,支持多种ontology工程任务,如ontology对齐和完成,通过深度学习方法,主要是使用预训练的LM。在这篇论文中,我们还示例了Deeponto在Samsung Research UK的数字医疗咨询和OAEI的生物ML轨道上的实践用case。
Generalizing Backpropagation for Gradient-Based Interpretability
results: 研究表明,通过计算梯度图的可解释统计数据,可以快速理解神经网络模型内部的工作方式,并且在SVA任务中,可以确定自注意机制中的哪些路径是最重要的。Abstract
Many popular feature-attribution methods for interpreting deep neural networks rely on computing the gradients of a model's output with respect to its inputs. While these methods can indicate which input features may be important for the model's prediction, they reveal little about the inner workings of the model itself. In this paper, we observe that the gradient computation of a model is a special case of a more general formulation using semirings. This observation allows us to generalize the backpropagation algorithm to efficiently compute other interpretable statistics about the gradient graph of a neural network, such as the highest-weighted path and entropy. We implement this generalized algorithm, evaluate it on synthetic datasets to better understand the statistics it computes, and apply it to study BERT's behavior on the subject-verb number agreement task (SVA). With this method, we (a) validate that the amount of gradient flow through a component of a model reflects its importance to a prediction and (b) for SVA, identify which pathways of the self-attention mechanism are most important.
摘要
很多受欢迎的深度神经网络特征归因方法依靠计算模型输出与输入之间的梯度。although these methods can indicate which input features are important for the model's prediction, they reveal little about the inner workings of the model itself. In this paper, we observe that the gradient computation of a model is a special case of a more general formulation using semirings. This observation allows us to generalize the backpropagation algorithm to efficiently compute other interpretable statistics about the gradient graph of a neural network, such as the highest-weighted path and entropy. We implement this generalized algorithm, evaluate it on synthetic datasets to better understand the statistics it computes, and apply it to study BERT's behavior on the subject-verb number agreement task (SVA). With this method, we (a) validate that the amount of gradient flow through a component of a model reflects its importance to a prediction and (b) for SVA, identify which pathways of the self-attention mechanism are most important.Here is a word-for-word translation of the text into Simplified Chinese:很多受欢迎的深度神经网络特征归因方法依靠计算模型输出与输入之间的梯度。 although these methods can indicate which input features are important for the model's prediction, they reveal little about the inner workings of the model itself. In this paper, we observe that the gradient computation of a model is a special case of a more general formulation using semirings. This observation allows us to generalize the backpropagation algorithm to efficiently compute other interpretable statistics about the gradient graph of a neural network, such as the highest-weighted path and entropy. We implement this generalized algorithm, evaluate it on synthetic datasets to better understand the statistics it computes, and apply it to study BERT's behavior on the subject-verb number agreement task (SVA). With this method, we (a) validate that the amount of gradient flow through a component of a model reflects its importance to a prediction and (b) for SVA, identify which pathways of the self-attention mechanism are most important.
results: 研究结果显示,使用Swin Transformer可以在仅有伪作品的情况下 achieve over 85%的验证精度,而EfficientNet则在标准的对比集上表现较差。这些结果表明,视觉transformer可以成为艺术作品验证中的一个强大和有前途的选择。Abstract
In recent years, Transformers, initially developed for language, have been successfully applied to visual tasks. Vision Transformers have been shown to push the state-of-the-art in a wide range of tasks, including image classification, object detection, and semantic segmentation. While ample research has shown promising results in art attribution and art authentication tasks using Convolutional Neural Networks, this paper examines if the superiority of Vision Transformers extends to art authentication, improving, thus, the reliability of computer-based authentication of artworks. Using a carefully compiled dataset of authentic paintings by Vincent van Gogh and two contrast datasets, we compare the art authentication performances of Swin Transformers with those of EfficientNet. Using a standard contrast set containing imitations and proxies (works by painters with styles closely related to van Gogh), we find that EfficientNet achieves the best performance overall. With a contrast set that only consists of imitations, we find the Swin Transformer to be superior to EfficientNet by achieving an authentication accuracy of over 85%. These results lead us to conclude that Vision Transformers represent a strong and promising contender in art authentication, particularly in enhancing the computer-based ability to detect artistic imitations.
摘要
Sequential Neural Barriers for Scalable Dynamic Obstacle Avoidance
methods: 我们提出了一种新的Sequential Neural Control Barrier模型(SNCBF)的compositional learning方法,利用了观察:多个动态障碍物的空间互动模式可以通过时间序列状态来预测。通过这种归一化,我们可以将少量障碍物训练的控制策略扩展到环境中,障碍物密度高得多的环境中。
results: 我们比较了提出的方法与潜在场、终端学习控制和模型预测控制等现有方法,并进行硬件实验,证明了方法的实用性。Abstract
There are two major challenges for scaling up robot navigation around dynamic obstacles: the complex interaction dynamics of the obstacles can be hard to model analytically, and the complexity of planning and control grows exponentially in the number of obstacles. Data-driven and learning-based methods are thus particularly valuable in this context. However, data-driven methods are sensitive to distribution drift, making it hard to train and generalize learned models across different obstacle densities. We propose a novel method for compositional learning of Sequential Neural Control Barrier models (SNCBFs) to achieve scalability. Our approach exploits an important observation: the spatial interaction patterns of multiple dynamic obstacles can be decomposed and predicted through temporal sequences of states for each obstacle. Through decomposition, we can generalize control policies trained only with a small number of obstacles, to environments where the obstacle density can be 100x higher. We demonstrate the benefits of the proposed methods in improving dynamic collision avoidance in comparison with existing methods including potential fields, end-to-end reinforcement learning, and model-predictive control. We also perform hardware experiments and show the practical effectiveness of the approach in the supplementary video.
摘要
“运动障碍物 Navigation 扩大化面临两大挑战:障碍物间的互动关系具有复杂的非常数据分布,并且计划和控制的复杂度随着障碍物数量的增加而增加 exponentially。因此,数据驱动和学习型方法在这个上特别有价。然而,数据驱动方法受到分布迁移的影响,对于不同的障碍物密度训练和应用难以稳定。我们提出了一种新的Sequential Neural Control Barrier模型(SNCBF)的实时compositional learning方法,以实现扩展性。我们发现了一个重要的观察:多个动态障碍物之间的空间互动图样可以通过时间序列状态来预测,并且透过分解,将已经在少量障碍物环境中训练的控制策略扩展到障碍物密度可以高达100倍的环境中。我们显示了与现有方法,包括潜在场、终端循环学习和预测运算控制方法相比,提出的方法具有更好的动态碰撞避免性。我们还进行了硬件实验,并在辅助影片中展示了实际效果。”
Self-supervised Optimization of Hand Pose Estimation using Anatomical Features and Iterative Learning
results: 研究人员通过对公共可用的和注解的数据集进行评估,选择了最佳参数和模型组合,并在手动组装场景中训练了一个活动识别任务,以示管道的效果。Abstract
Manual assembly workers face increasing complexity in their work. Human-centered assistance systems could help, but object recognition as an enabling technology hinders sophisticated human-centered design of these systems. At the same time, activity recognition based on hand poses suffers from poor pose estimation in complex usage scenarios, such as wearing gloves. This paper presents a self-supervised pipeline for adapting hand pose estimation to specific use cases with minimal human interaction. This enables cheap and robust hand posebased activity recognition. The pipeline consists of a general machine learning model for hand pose estimation trained on a generalized dataset, spatial and temporal filtering to account for anatomical constraints of the hand, and a retraining step to improve the model. Different parameter combinations are evaluated on a publicly available and annotated dataset. The best parameter and model combination is then applied to unlabelled videos from a manual assembly scenario. The effectiveness of the pipeline is demonstrated by training an activity recognition as a downstream task in the manual assembly scenario.
摘要
人工组装工人面临增加的工作复杂性。人类中心协助系统可以帮助,但对象识别作为激活技术,妨碍了复杂的人类中心设计。同时,基于手势识别的活动识别在复杂的使用场景中,如穿着手套,存在质量不高的手势估计问题。这篇论文介绍了一个自动学习管道,用于适应特定使用场景的手势估计,只需少量人类参与。该管道包括一个通用的机器学习模型,用于手势估计,以及空间和时间滤波器,用于考虑手部的解剖约束。还有一步重新训练,以提高模型。不同的参数组合被评估在公共可用的和注释的数据集上。最佳参数和模型组合然后应用于无注释的手势视频。这种管道的效果得到证明,通过在人工组装场景中训练一个活动识别任务。
CORE-GPT: Combining Open Access research and large language models for credible, trustworthy question answering
results: 根据20个科学领域的100个问题测试,CORE-GPT可以提供全面和可靠的答案,同时提供相关文献的链接和参考。两名注释员对答案和链接的质量和相关性进行评估,结果显示CORE-GPT的答案和链接准确性高,可靠性大。Abstract
In this paper, we present CORE-GPT, a novel question-answering platform that combines GPT-based language models and more than 32 million full-text open access scientific articles from CORE. We first demonstrate that GPT3.5 and GPT4 cannot be relied upon to provide references or citations for generated text. We then introduce CORE-GPT which delivers evidence-based answers to questions, along with citations and links to the cited papers, greatly increasing the trustworthiness of the answers and reducing the risk of hallucinations. CORE-GPT's performance was evaluated on a dataset of 100 questions covering the top 20 scientific domains in CORE, resulting in 100 answers and links to 500 relevant articles. The quality of the provided answers and and relevance of the links were assessed by two annotators. Our results demonstrate that CORE-GPT can produce comprehensive and trustworthy answers across the majority of scientific domains, complete with links to genuine, relevant scientific articles.
摘要
在这篇论文中,我们介绍了CORE-GPT,一种新的问答平台,它结合GPT基于语言模型和CORE上的 más de 3200万全文开放论文。我们首先表明GPT3.5和GPT4无法提供参考或引用 для生成的文本。然后,我们引入CORE-GPT,它可以提供基于证据的答案,并提供相关论文的引用和链接,从而增加答案的可靠性和减少误差。CORE-GPT的性能在20种科学领域的100个问题上进行评估,共计100个答案和500篇相关论文的链接。两名注释者评估了提供的答案和链接的质量和相关性。我们的结果表明,CORE-GPT可以在大多数科学领域提供全面和可靠的答案,并且链接到真实、相关的科学论文。
A Privacy-Preserving Walk in the Latent Space of Generative Models for Medical Applications
methods: 本研究使用了一种帮助找到latent space中离散的点的辅助标识类ifier,并通过非线性步进行latent space Navigation,以避免隐私问题。
results: 实验表明,给定任意两个latent space中的随机点对,我们的步进行方法比线性 interpolate 更安全,并在两个benchmark(肺炎和糖尿病肠病)上证明,通过我们的方法生成的样本可以减少数据隐私问题,同时保持模型训练的性能。Abstract
Generative Adversarial Networks (GANs) have demonstrated their ability to generate synthetic samples that match a target distribution. However, from a privacy perspective, using GANs as a proxy for data sharing is not a safe solution, as they tend to embed near-duplicates of real samples in the latent space. Recent works, inspired by k-anonymity principles, address this issue through sample aggregation in the latent space, with the drawback of reducing the dataset by a factor of k. Our work aims to mitigate this problem by proposing a latent space navigation strategy able to generate diverse synthetic samples that may support effective training of deep models, while addressing privacy concerns in a principled way. Our approach leverages an auxiliary identity classifier as a guide to non-linearly walk between points in the latent space, minimizing the risk of collision with near-duplicates of real samples. We empirically demonstrate that, given any random pair of points in the latent space, our walking strategy is safer than linear interpolation. We then test our path-finding strategy combined to k-same methods and demonstrate, on two benchmarks for tuberculosis and diabetic retinopathy classification, that training a model using samples generated by our approach mitigate drops in performance, while keeping privacy preservation.
摘要
生成敌对网络(GANs)已经证明它们可以生成匹配目标分布的 sintetic 样本。然而,从隐私角度来看,使用 GANs 作为数据共享的代理不是安全的解决方案,因为它们往往将实际样本的几乎重复的样本嵌入在隐藏空间中。 latest works, inspired by k-anonymity principles, address this issue by aggregating samples in the latent space, with the drawback of reducing the dataset by a factor of k。our work aims to mitigate this problem by proposing a latent space navigation strategy that can generate diverse sintetic samples that support the effective training of deep models, while addressing privacy concerns in a principled way。our approach leverages an auxiliary identity classifier as a guide to non-linearly walk between points in the latent space, minimizing the risk of collision with near-duplicates of real samples。we empirically demonstrate that, given any random pair of points in the latent space, our walking strategy is safer than linear interpolation。we then test our path-finding strategy combined to k-same methods and demonstrate, on two benchmarks for tuberculosis and diabetic retinopathy classification, that training a model using samples generated by our approach mitigates drops in performance while keeping privacy preservation。Note: Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and Macau.
How word semantics and phonology affect handwriting of Alzheimer’s patients: a machine learning based analysis
paper_authors: Nicole Dalia Cilia, Claudio De Stefano, Francesco Fontanella, Sabato Marco Siniscalchi
for: 本研究旨在使用手写运动动力学性质以支持诊断神经退化疾病。
methods: 本研究使用非侵入式探测技术和机器学习方法。
results: 研究发现,不同词语类型的手写 task 会产生不同的动力学特征,非常规词语需要更多的特征来进行分类,但其分类性能却很高,最好的结果达到了90%。Abstract
Using kinematic properties of handwriting to support the diagnosis of neurodegenerative disease is a real challenge: non-invasive detection techniques combined with machine learning approaches promise big steps forward in this research field. In literature, the tasks proposed focused on different cognitive skills to elicitate handwriting movements. In particular, the meaning and phonology of words to copy can compromise writing fluency. In this paper, we investigated how word semantics and phonology affect the handwriting of people affected by Alzheimer's disease. To this aim, we used the data from six handwriting tasks, each requiring copying a word belonging to one of the following categories: regular (have a predictable phoneme-grapheme correspondence, e.g., cat), non-regular (have atypical phoneme-grapheme correspondence, e.g., laugh), and non-word (non-meaningful pronounceable letter strings that conform to phoneme-grapheme conversion rules). We analyzed the data using a machine learning approach by implementing four well-known and widely-used classifiers and feature selection. The experimental results showed that the feature selection allowed us to derive a different set of highly distinctive features for each word type. Furthermore, non-regular words needed, on average, more features but achieved excellent classification performance: the best result was obtained on a non-regular, reaching an accuracy close to 90%.
摘要
In this paper, we investigated how word semantics and phonology affect the handwriting of individuals with Alzheimer's disease. We used data from six handwriting tasks, each requiring the participants to copy a word belonging to one of the following categories: regular (with predictable phoneme-grapheme correspondence, such as "cat"), non-regular (with atypical phoneme-grapheme correspondence, such as "laugh"), and non-word (non-meaningful pronounceable letter strings that conform to phoneme-grapheme conversion rules). We analyzed the data using machine learning techniques, including four well-known and widely-used classifiers and feature selection.Our results showed that feature selection allowed us to derive a different set of highly distinctive features for each word type. Additionally, non-regular words required, on average, more features but achieved excellent classification performance. In fact, the best result was obtained on a non-regular word, with an accuracy of nearly 90%. These findings suggest that machine learning approaches using kinematic properties of handwriting could be a valuable tool for the diagnosis of neurodegenerative diseases such as Alzheimer's.
results: 实验结果显示,我们的多模式度量在C3 benchmark上提供了更强的数据选择性,而object-textAlignment在这种metric中起到了关键作用。Abstract
One challenge in text-to-image (T2I) generation is the inadvertent reflection of culture gaps present in the training data, which signifies the disparity in generated image quality when the cultural elements of the input text are rarely collected in the training set. Although various T2I models have shown impressive but arbitrary examples, there is no benchmark to systematically evaluate a T2I model's ability to generate cross-cultural images. To bridge the gap, we propose a Challenging Cross-Cultural (C3) benchmark with comprehensive evaluation criteria, which can assess how well-suited a model is to a target culture. By analyzing the flawed images generated by the Stable Diffusion model on the C3 benchmark, we find that the model often fails to generate certain cultural objects. Accordingly, we propose a novel multi-modal metric that considers object-text alignment to filter the fine-tuning data in the target culture, which is used to fine-tune a T2I model to improve cross-cultural generation. Experimental results show that our multi-modal metric provides stronger data selection performance on the C3 benchmark than existing metrics, in which the object-text alignment is crucial. We release the benchmark, data, code, and generated images to facilitate future research on culturally diverse T2I generation (https://github.com/longyuewangdcu/C3-Bench).
摘要
一个挑战在文本到图像(T2I)生成中是因为训练数据中存在文化差距,这导致生成图像质量在不同文化元素的输入文本时出现差异。虽然多种T2I模型已经展示了吸引人的但是无法控制的例子,但是没有一个系统性的评价标准来评价一个T2I模型在不同文化中的表现。为了bridging这个差距,我们提出了一个多元评价标准,即挑战性跨文化(C3)benchmark,该标准包括了评价模型在目标文化中的多种评价标准。通过分析Stable Diffusion模型在C3benchmark上生成的瑕且图像,我们发现该模型经常无法生成特定文化元素。因此,我们提出了一种新的多模态指标,它考虑了文本-图像对齐,用于筛选训练数据的精度提高。实验结果显示,我们的多模态指标在C3benchmark上的数据选择性比既有的指标更强,对于文本-图像对齐是关键。我们将benchmark、数据、代码和生成图像公开发布(https://github.com/longyuewangdcu/C3-Bench),以便未来的文化多样化T2I生成研究。
A Neuromorphic Architecture for Reinforcement Learning from Real-Valued Observations
paper_authors: Sergio F. Chevtchenko, Yeshwanth Bethi, Teresa B. Ludermir, Saeed Afshar
for: solves Reinforcement Learning (RL) problems with real-valued observations in a hardware-efficient and bio-inspired way.
methods: incorporates multi-layered event-based clustering, Temporal Difference (TD)-error modulation, and eligibility traces to build a novel Spiking Neural Network (SNN) architecture.
results: consistently outperforms a tabular actor-critic algorithm and successfully discovers stable control policies on classic RL environments, with an appealing trade-off in terms of computational and hardware implementation requirements.Abstract
Reinforcement Learning (RL) provides a powerful framework for decision-making in complex environments. However, implementing RL in hardware-efficient and bio-inspired ways remains a challenge. This paper presents a novel Spiking Neural Network (SNN) architecture for solving RL problems with real-valued observations. The proposed model incorporates multi-layered event-based clustering, with the addition of Temporal Difference (TD)-error modulation and eligibility traces, building upon prior work. An ablation study confirms the significant impact of these components on the proposed model's performance. A tabular actor-critic algorithm with eligibility traces and a state-of-the-art Proximal Policy Optimization (PPO) algorithm are used as benchmarks. Our network consistently outperforms the tabular approach and successfully discovers stable control policies on classic RL environments: mountain car, cart-pole, and acrobot. The proposed model offers an appealing trade-off in terms of computational and hardware implementation requirements. The model does not require an external memory buffer nor a global error gradient computation, and synaptic updates occur online, driven by local learning rules and a broadcasted TD-error signal. Thus, this work contributes to the development of more hardware-efficient RL solutions.
摘要
利用强化学习(Reinforcement Learning,RL)的框架可以在复杂环境中做出决策。然而,通过硬件高效和生物体灵感的方式实现RL仍然是一个挑战。这篇论文提出了一种新的脉冲神经网络(Spiking Neural Network,SNN)架构,用于解决RL问题中的实数观察。提案的模型包括多层事件基于归一化、TD-错误调整和可用性追踪,这些组成部分都是在之前的工作之上。一项ablation研究证明了这些组成部分对提案模型的性能具有显著的影响。与标准的表格actor-critic算法和现状顶点优化算法(Proximal Policy Optimization,PPO)相比,我们的网络一直以上表现出色,并在经典RL环境中成功地找到稳定的控制策略:山脉汽车、折扇杆和Acrobot。提案的模型具有更有吸引力的计算和硬件实现需求。模型不需要外部内存缓冲区域,也不需要全局错误导数计算,更重要的是,神经元更新都发生在线上,驱动于本地学习规则和广播的TD-错误信号。因此,这种工作对RL解决方案的开发做出了贡献。
Amplifying Limitations, Harms and Risks of Large Language Models
results: 本文指出了LLMs的一些限制,包括其无法满足一些基本的人类语言理解需求,以及其可能导致个人和组织的损害。Abstract
We present this article as a small gesture in an attempt to counter what appears to be exponentially growing hype around Artificial Intelligence (AI) and its capabilities, and the distraction provided by the associated talk of science-fiction scenarios that might arise if AI should become sentient and super-intelligent. It may also help those outside of the field to become more informed about some of the limitations of AI technology. In the current context of popular discourse AI defaults to mean foundation and large language models (LLMs) such as those used to create ChatGPT. This in itself is a misrepresentation of the diversity, depth and volume of research, researchers, and technology that truly represents the field of AI. AI being a field of research that has existed in software artefacts since at least the 1950's. We set out to highlight a number of limitations of LLMs, and in so doing highlight that harms have already arisen and will continue to arise due to these limitations. Along the way we also highlight some of the associated risks for individuals and organisations in using this technology.
摘要
我们在这篇文章中呈现出一小小的尝试,以抵消人们对人工智能(AI)技术的快速增长和其能力的夸大、以及相关的科幻情节的抖音。我们也希望通过这篇文章,让那些不familiar with AI技术的人更加了解AI领域的限制和风险。在当今的流行讨论中,AI Defaults to Mean foundation和大型自然语言模型(LLMs),如chatGPT所使用的技术。这实际上是对AI领域的多样性、深度和规模的不公正表述。AI是一个已经在软件遗产物since at least the 1950s的研究领域。我们在这篇文章中,探讨了一些LLMs的限制,并在这个过程中,揭示了这些限制已经导致了一些害和将继续导致害。同时,我们还揭示了使用这种技术的个人和组织所面临的风险。
In Time and Space: Towards Usable Adaptive Control for Assistive Robotic Arms
paper_authors: Max Pascher, Kirill Kronhardt, Felix Ferdinand Goldau, Udo Frese, Jens Gerken for:这篇论文旨在提出一种新的控制方法,用于帮助用户更好地控制 робо臂进行 grasping 和 manipulation 任务。methods:本论文使用了 Adaptive DoF Mapping Controls (ADMC) 方法,并通过提供feed-forward multimodal feedback来帮助用户更好地选择合适的控制方式。results:研究表明,在虚拟现实环境中,使用 ADMC 方法可以降低任务完成时间、减少模式 switching 次数以及减轻用户的心理劳动量。此外,研究还发现,在不同的用户中,不同的自适应阈值可以更好地适应用户的需求。Abstract
Robotic solutions, in particular robotic arms, are becoming more frequently deployed for close collaboration with humans, for example in manufacturing or domestic care environments. These robotic arms require the user to control several Degrees-of-Freedom (DoFs) to perform tasks, primarily involving grasping and manipulating objects. Standard input devices predominantly have two DoFs, requiring time-consuming and cognitively demanding mode switches to select individual DoFs. Contemporary Adaptive DoF Mapping Controls (ADMCs) have shown to decrease the necessary number of mode switches but were up to now not able to significantly reduce the perceived workload. Users still bear the mental workload of incorporating abstract mode switching into their workflow. We address this by providing feed-forward multimodal feedback using updated recommendations of ADMC, allowing users to visually compare the current and the suggested mapping in real-time. We contrast the effectiveness of two new approaches that a) continuously recommend updated DoF combinations or b) use discrete thresholds between current robot movements and new recommendations. Both are compared in a Virtual Reality (VR) in-person study against a classic control method. Significant results for lowered task completion time, fewer mode switches, and reduced perceived workload conclusively establish that in combination with feedforward, ADMC methods can indeed outperform classic mode switching. A lack of apparent quantitative differences between Continuous and Threshold reveals the importance of user-centered customization options. Including these implications in the development process will improve usability, which is essential for successfully implementing robotic technologies with high user acceptance.
摘要
人工智能解决方案,尤其是机械臂,在人类 Collaborative 环境中越来越广泛应用,如制造或家庭护理环境。这些机械臂需要用户控制多个度Of freedom(DoF)来完成任务,主要是抓取和操纵物体。现有的输入设备主要有两个DoF,需要时间consuming和认知劳动密集的模式转换来选择个别DoF。当前的适应DoF映射控制法(ADMC)已经能够减少必要的模式转换数量,但是没有能力显著减少用户认知劳动。我们解决这问题,通过提供前向多模态反馈,使用更新的ADMC建议,让用户在实时比较当前和建议的映射。我们比较了两种新方法:a)不断推荐更新的DoF组合,或b)使用精度阈值来区分当前机器运动和新建议。两者在虚拟现实(VR)实验室中进行了对照测试,与经典控制方法进行比较。结果显示,在与 feedforward 结合使用的情况下,ADMC 方法可以实现更好的任务完成时间、更少的模式转换和更低的认知劳动。无论 apparent 的量化差异,表明用户中心的个性化选项的重要性。包括这些影响在开发过程中,将提高可用性,这是实施机器人技术的成功关键。
LEA: Improving Sentence Similarity Robustness to Typos Using Lexical Attention Bias
paper_authors: Mario Almagro, Emilio Almazán, Diego Ortego, David Jiménez for: 本研究的目的是提高Transformer搜寻引擎对文本噪音的耐性,以提高在多个领域中的下游任务性能。methods: 本研究使用了一种新的lexical-aware Attention模块(LEA),通过 incorporating lexical similarities between words in both sentences, 以提高cross-encoders对文本噪音的耐性。results: 实验结果表明,LEA可以帮助cross-encoders在文本噪音的情况下提高性能,并在不同领域中保持竞争力。Abstract
Textual noise, such as typos or abbreviations, is a well-known issue that penalizes vanilla Transformers for most downstream tasks. We show that this is also the case for sentence similarity, a fundamental task in multiple domains, e.g. matching, retrieval or paraphrasing. Sentence similarity can be approached using cross-encoders, where the two sentences are concatenated in the input allowing the model to exploit the inter-relations between them. Previous works addressing the noise issue mainly rely on data augmentation strategies, showing improved robustness when dealing with corrupted samples that are similar to the ones used for training. However, all these methods still suffer from the token distribution shift induced by typos. In this work, we propose to tackle textual noise by equipping cross-encoders with a novel LExical-aware Attention module (LEA) that incorporates lexical similarities between words in both sentences. By using raw text similarities, our approach avoids the tokenization shift problem obtaining improved robustness. We demonstrate that the attention bias introduced by LEA helps cross-encoders to tackle complex scenarios with textual noise, specially in domains with short-text descriptions and limited context. Experiments using three popular Transformer encoders in five e-commerce datasets for product matching show that LEA consistently boosts performance under the presence of noise, while remaining competitive on the original (clean) splits. We also evaluate our approach in two datasets for textual entailment and paraphrasing showing that LEA is robust to typos in domains with longer sentences and more natural context. Additionally, we thoroughly analyze several design choices in our approach, providing insights about the impact of the decisions made and fostering future research in cross-encoders dealing with typos.
摘要
文本噪声,如 typo 或缩写,是许多下游任务中知名的问题,即使是 sentence similarity 任务。我们显示这也是情况的 caso для sentence similarity 任务。sentence similarity 可以通过 cross-encoder 来实现,其中两个句子被 concatenated 在输入中,让模型利用它们之间的关系。前一些 Addressing 噪声问题的方法主要通过数据增强策略来解决,并显示在训练样本中的噪声样本上提高了模型的Robustness。但这些方法 все都受到 tokenization shift 的影响,即使在训练样本中噪声样本的情况下。在这种情况下,我们提议使用 LExical-aware Attention 模块 (LEA) 来解决 textual noise。LEA 模块 incorporates 词语之间的 lexical similarity,使得我们可以避免 tokenization shift 问题。我们使用 raw text similarity,从而获得了改进的 robustness。我们表明,LEA 模块引入的注意力偏好可以帮助 cross-encoder 在复杂的 scenario 中处理 textual noise,特别是在短文本描述和有限上下文中。我们在五个电商数据集上进行了三种 popular Transformer encoder 的实验,并证明了 LEA 在噪声存在的情况下 consistently 提高表现,而不会在干净(clean) split 上受到影响。此外,我们还在两个文本同义和重叠任务上进行了评估,并证明了 LEA 在长句子和自然上下文中具有 robustness。此外,我们还进行了多种设计决策的分析,提供了关于 LEA 的设计决策的影响和未来研究的指导。
Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition
results: 实验结果表明,提出的音频视频多渠道 speech separation、dereverberation和认知方法可以与相关的音频 только基eline相比,提高41.7%和36.0%的Relative word error rate (WER),并且在PESQ、STOI和SRMR等评价指标上也获得了一定的提高。Abstract
Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed in this paper. The efficacy of the video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and Conformer ASR back-end. Audio-visual integrated front-end architectures performing speech separation and dereverberation in a pipelined or joint fashion via mask-based WPD are investigated. The error cost mismatch between the speech enhancement front-end and ASR back-end components is minimized by end-to-end jointly fine-tuning using either the ASR cost function alone, or its interpolation with the speech enhancement loss. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel speech separation, dereverberation and recognition systems consistently outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute (41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech enhancement improvements were also obtained on PESQ, STOI and SRMR scores.
摘要
干预cocktail party speech中的干扰和混响是到现在为止的一个非常挑战的任务。我们被视觉modalities的不变性所 inspirited,我们提出了一种将视觉信息完全integrated into all system components的多渠道speech separation、dereverberation和认知approach。在这篇论文中,我们展示了视频输入的效果是在mask-based MVDR speech separation、DNN-WPE或spectral mapping(SpecM)based speech dereverberation前端和Conformer ASR后端中 consistently demonstrating。我们还investigated了audiovisual integrated front-end architecture,通过pipelined或joint fashion进行speech separation和dereverberationvia mask-based WPD。我们使用了end-to-end joint fine-tuning,使得error cost mismatch междуspeech enhancement front-end和ASR back-end component minimized。我们在用simulation或replay制造的Oxford LRS2 dataset上进行了实验,并 obtainted9.1%和6.2%的绝对(41.7%和36.0%相对)word error rate(WER)下降。我们还获得了consistent speech enhancement改进在PESQ、STOI和SRMR分数上。
BaBE: Enhancing Fairness via Estimation of Latent Explaining Variables
paper_authors: Ruta Binkyte, Daniele Gorla, Catuscia Palamidessi for: 本研究旨在解决因敏感特征S和合法变量E之间的不公平歧视问题,提出了一种预处理方法以实现公平。methods: 本研究使用了抽象统计平衡和等可能性方法来解决不公平问题,并提出了一种基于抽象统计平衡和期望最大化方法的抽象隐藏变量BAYesian Bias Elimination(BaBE)方法。results: 实验表明,BABE方法可以在synthetic和实际数据集上提供高精度和公平性。Abstract
We consider the problem of unfair discrimination between two groups and propose a pre-processing method to achieve fairness. Corrective methods like statistical parity usually lead to bad accuracy and do not really achieve fairness in situations where there is a correlation between the sensitive attribute S and the legitimate attribute E (explanatory variable) that should determine the decision. To overcome these drawbacks, other notions of fairness have been proposed, in particular, conditional statistical parity and equal opportunity. However, E is often not directly observable in the data, i.e., it is a latent variable. We may observe some other variable Z representing E, but the problem is that Z may also be affected by S, hence Z itself can be biased. To deal with this problem, we propose BaBE (Bayesian Bias Elimination), an approach based on a combination of Bayes inference and the Expectation-Maximization method, to estimate the most likely value of E for a given Z for each group. The decision can then be based directly on the estimated E. We show, by experiments on synthetic and real data sets, that our approach provides a good level of fairness as well as high accuracy.
摘要
我们考虑了不公正歧视的问题,并提出了预处理方法以实现公平。通常的统计平衡方法通常会导致准确率下降,并不真正实现公平在敏感特征S和合法特征E(解释变量)之间存在相关性的情况下。为了解决这些缺点,其他公平性的概念已经被提出,特别是conditional statistical parity和equal opportunity。然而,E通常不直接可见于数据中,即是隐藏变量。我们可以观察一些其他变量Z代表E,但问题是Z也可能受到S的影响,因此Z本身可能受到偏见。为解决这个问题,我们提出了BaBE(极 bayesian偏见消除)方法,该方法基于极 bayes推理和期望最大化方法,以估计每个组的E值。然后,决策可以直接基于估计的E值。我们通过对synthetic和实际数据集进行实验,显示了我们的方法可以实现高度的公平性和准确率。
Learning to Solve Tasks with Exploring Prior Behaviours
results: 在三个导航任务和一个机械抓取任务中,与其他基准方法相比,IRDEC方法表现出色。代码可以在https://github.com/Ricky-Zhu/IRDEC中找到。Abstract
Demonstrations are widely used in Deep Reinforcement Learning (DRL) for facilitating solving tasks with sparse rewards. However, the tasks in real-world scenarios can often have varied initial conditions from the demonstration, which would require additional prior behaviours. For example, consider we are given the demonstration for the task of \emph{picking up an object from an open drawer}, but the drawer is closed in the training. Without acquiring the prior behaviours of opening the drawer, the robot is unlikely to solve the task. To address this, in this paper we propose an Intrinsic Rewards Driven Example-based Control \textbf{(IRDEC)}. Our method can endow agents with the ability to explore and acquire the required prior behaviours and then connect to the task-specific behaviours in the demonstration to solve sparse-reward tasks without requiring additional demonstration of the prior behaviours. The performance of our method outperforms other baselines on three navigation tasks and one robotic manipulation task with sparse rewards. Codes are available at https://github.com/Ricky-Zhu/IRDEC.
摘要
深度强化学习(DRL)中广泛使用演示来解决 tasks with sparse rewards。然而,实际场景中的任务可能有不同的初始条件和演示,需要更多的先行行为。例如,假设我们给出了拾取从开放抽屉的任务演示,但抽屉在训练中是关闭的。如果不具备开抽屉的先行行为,机器人很难解决任务。为此,在这篇论文中,我们提出了内在奖励驱动的例子基于控制方法(IRDEC)。我们的方法可以让代理人具备解决 sparse-reward 任务的能力,并且不需要额外的先行行为演示。我们的方法的性能超过了其他基线在三个导航任务和一个机器人抓取任务上。代码可以在 查看。
results: 结果表明,使用SetFit的对比学习设置可以在使用一部分训练样本的情况下表现更好,而且LIME结果显示,对于法律有用的特征都得到了增强,这些特征对于分类决策起到了至关重要的作用。因此,一个使用对比学习目标进行finetuning的模型似乎更自信地基于法律有用的特征来做决策。Abstract
In this study, we analyze data-scarce classification scenarios, where available labeled legal data is small and imbalanced, potentially hurting the quality of the results. We focused on two finetuning objectives; SetFit (Sentence Transformer Finetuning), a contrastive learning setup, and a vanilla finetuning setup on a legal provision classification task. Additionally, we compare the features that are extracted with LIME (Local Interpretable Model-agnostic Explanations) to see which particular features contributed to the model's classification decisions. The results show that a contrastive setup with SetFit performed better than vanilla finetuning while using a fraction of the training samples. LIME results show that the contrastive learning approach helps boost both positive and negative features which are legally informative and contribute to the classification results. Thus a model finetuned with a contrastive objective seems to base its decisions more confidently on legally informative features.
摘要
在这项研究中,我们分析了数据缺乏的分类场景,其中可用的法律批注数据较少、不均衡,可能影响结果质量。我们关注了两个精度调整目标:SetFit(句子变换器精度调整)和普通的精度调整setup,在法律条文分类任务上进行了比较。此外,我们使用LIME(本地可解释模型无关性解释)来比较这两个setup中提取的特征,以确定哪些特征对模型的分类决策产生了影响。结果显示,使用SetFit的对比学习设置可以在使用一小部分训练样本的情况下比普通的精度调整setup表现更好。LIME结果显示,对比学习 Approach Helps 提高了正面和负面的法律有用特征,这些特征对分类结果产生了重要的贡献。因此,一个通过对比学习 objective 精度调整的模型似乎更加可靠地基于法律有用的特征来做出决策。
Towards a safe MLOps Process for the Continuous Development and Safety Assurance of ML-based Systems in the Railway Domain
results: 论文提出了一种 integrate system engineering, safety assurance, 和 ML 生命周期的全面工作流程,并描述了自动化不同阶段的挑战。该过程可以提高 ML 模型的可靠性和高效性,并且可以适应不断变化的运行环境。Abstract
Traditional automation technologies alone are not sufficient to enable driverless operation of trains (called Grade of Automation (GoA) 4) on non-restricted infrastructure. The required perception tasks are nowadays realized using Machine Learning (ML) and thus need to be developed and deployed reliably and efficiently. One important aspect to achieve this is to use an MLOps process for tackling improved reproducibility, traceability, collaboration, and continuous adaptation of a driverless operation to changing conditions. MLOps mixes ML application development and operation (Ops) and enables high frequency software releases and continuous innovation based on the feedback from operations. In this paper, we outline a safe MLOps process for the continuous development and safety assurance of ML-based systems in the railway domain. It integrates system engineering, safety assurance, and the ML life-cycle in a comprehensive workflow. We present the individual stages of the process and their interactions. Moreover, we describe relevant challenges to automate the different stages of the safe MLOps process.
摘要
传统自动化技术独立不能够实现列车无人驾驶(称为级别自动化(GoA)4)在不受限制的基础设施上。需要完成的感知任务现在通常使用机器学习(ML)来实现,因此需要可靠地开发和部署,以及持续地适应变化的条件。一个重要的方法是使用 MLOps 过程来提高可重复性、跟踪性、合作和不断创新,以便基于运行Feedback进行持续更新和改进。在这篇文章中,我们描述了一种安全的 MLOps 过程,用于无间断开发和验证 ML 基于系统的安全保障。它结合了系统工程、安全验证和 ML 生命周期在一个完整的工作流中。我们还描述了自动化不同阶段的安全 MLOps 过程中的挑战。
Enhancing LLM with Evolutionary Fine Tuning for News Summary Generation
results: 实验结果表明,新闻概要生成器能够生成准确可靠的新闻概要,并具有一定的泛化能力。Abstract
News summary generation is an important task in the field of intelligence analysis, which can provide accurate and comprehensive information to help people better understand and respond to complex real-world events. However, traditional news summary generation methods face some challenges, which are limited by the model itself and the amount of training data, as well as the influence of text noise, making it difficult to generate reliable information accurately. In this paper, we propose a new paradigm for news summary generation using LLM with powerful natural language understanding and generative capabilities. We use LLM to extract multiple structured event patterns from the events contained in news paragraphs, evolve the event pattern population with genetic algorithm, and select the most adaptive event pattern to input into the LLM to generate news summaries. A News Summary Generator (NSG) is designed to select and evolve the event pattern populations and generate news summaries. The experimental results show that the news summary generator is able to generate accurate and reliable news summaries with some generalization ability.
摘要
新闻概要生成是智能分析领域的重要任务,可以提供准确和全面的信息,帮助人们更好地理解和应对复杂的现实世界事件。然而,传统的新闻概要生成方法面临着一些挑战,这些挑战限制了模型本身和训练数据量,以及文本噪音的影响,使得生成可靠信息具有困难。在这篇论文中,我们提出了一种基于LLM的新闻概要生成方法。我们使用LLM提取新闻段落中包含的多种结构化事件模式,然后使用遗传算法演化事件模式人口,并将最适应的事件模式输入到LLM中生成新闻概要。一个新闻概要生成器(NSG)是设计用于选择和演化事件模式人口,并生成新闻概要。实验结果表明,新闻概要生成器能够生成准确和可靠的新闻概要,并具有一定的泛化能力。
Read, Look or Listen? What’s Needed for Solving a Multimodal Dataset
results: 该方法发现大多数问题可以通过单一的modalities来解答,而且大约70%的问题可以使用多种不同的单modal策略解答,如视频或音频等。此外,该方法还发现MERLOT Reserve在图像基本问题上表现不佳,而且声音识别也有困难。基于这些观察,我们提出了一个新的测试集,其中模型性能明显下降。Abstract
The prevalence of large-scale multimodal datasets presents unique challenges in assessing dataset quality. We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it. Our method sheds light on the importance of different modalities in datasets, as well as the relationship between them. We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality. Moreover, we find that more than 70% of the questions are solvable using several different single-modality strategies, e.g., by either looking at the video or listening to the audio, highlighting the limited integration of multiple modalities in TVQA. We leverage our annotation and analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification. Based on our observations, we introduce a new test set that necessitates multiple modalities, observing a dramatic drop in model performance. Our methodology provides valuable insights into multimodal datasets and highlights the need for the development of more robust models.
摘要
“大规模多modal dataset的存在带来了评估dataset质量的困难。我们提出了一个two-step方法,利用一小部分的人类标注来将每个多modal instance映射到需要处理的modalities。我们的方法照明了不同modalities在dataset中的重要性,以及它们之间的关系。我们将方法应用到TVQA dataset上,发现大多数问题可以使用单一modalities来解答,没有明显的modalities偏好。此外,我们发现超过70%的问题可以使用多种不同的单modalities策略来解答,例如通过观看影片或聆听音频,这显示了TVQA中多modalities的整合有限。我们利用标注和分析MERLOT Reserve,发现它在影像基于问题上表现不佳,但也在语音和音频方面存在问题。根据我们的观察,我们创建了一个需要多 modalities的试点,观察到模型性能明显下降。我们的方法ologie提供了多modal dataset的有价值的内在性和强化模型的需求。”
Evaluating raw waveforms with deep learning frameworks for speech emotion recognition
paper_authors: Zeynep Hilal Kilimci, Ulku Bayraktar, Ayhan Kucukmanisa for:The paper is written for the task of speech emotion recognition, specifically using raw audio files and deep neural networks to recognize emotions.methods:The proposed model uses a combination of machine learning algorithms, ensemble learning methods, and deep learning techniques, including convolutional neural networks (CNNs), long short-term memory networks (LSTMs), and hybrid CNN-LSTM models.results:The proposed model achieves state-of-the-art performance on six different data sets, including TESS+RAVDESS, EMO-DB, RAVDESS, CREMA, SAVEE, and TESS, with accuracy rates ranging from 90.34% to 95.86%. Specifically, the CNN model outperforms existing approaches with 95.86% accuracy on the TESS+RAVDESS data set using raw audio files.Abstract
Speech emotion recognition is a challenging task in speech processing field. For this reason, feature extraction process has a crucial importance to demonstrate and process the speech signals. In this work, we represent a model, which feeds raw audio files directly into the deep neural networks without any feature extraction stage for the recognition of emotions utilizing six different data sets, EMO-DB, RAVDESS, TESS, CREMA, SAVEE, and TESS+RAVDESS. To demonstrate the contribution of proposed model, the performance of traditional feature extraction techniques namely, mel-scale spectogram, mel-frequency cepstral coefficients, are blended with machine learning algorithms, ensemble learning methods, deep and hybrid deep learning techniques. Support vector machine, decision tree, naive Bayes, random forests models are evaluated as machine learning algorithms while majority voting and stacking methods are assessed as ensemble learning techniques. Moreover, convolutional neural networks, long short-term memory networks, and hybrid CNN- LSTM model are evaluated as deep learning techniques and compared with machine learning and ensemble learning methods. To demonstrate the effectiveness of proposed model, the comparison with state-of-the-art studies are carried out. Based on the experiment results, CNN model excels existent approaches with 95.86% of accuracy for TESS+RAVDESS data set using raw audio files, thence determining the new state-of-the-art. The proposed model performs 90.34% of accuracy for EMO-DB with CNN model, 90.42% of accuracy for RAVDESS with CNN model, 99.48% of accuracy for TESS with LSTM model, 69.72% of accuracy for CREMA with CNN model, 85.76% of accuracy for SAVEE with CNN model in speaker-independent audio categorization problems.
摘要
音频情感认识是语音处理领域的一个挑战任务。为此,特征提取过程具有关键的重要性,以便处理和识别语音信号。在本研究中,我们提出了一种模型,即将原始音频文件直接输入深度神经网络,不需要特征提取阶段,用于情感识别。为了证明提案的贡献,我们对传统特征提取技术(mel-scale spectrogram和mel-frequency cepstral coefficients)和机器学习算法(支持向量机器、决策树、愚蠢搜索、Random Forest)、ensemble学习方法(大多数投票和堆叠法)、深度学习技术(卷积神经网络、长期短期记忆网络和混合卷积LSTM)进行了评估。results表明,在TESS+RAVDESS数据集上,我们的模型在使用原始音频文件时达到了95.86%的准确率,从而确定了新的州OF-THE-ART。此外,我们的模型在EMO-DB、RAVDESS、TESS、CREMA和SAVEE数据集上也达到了高度的准确率。Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.
Semi-supervised Domain Adaptive Medical Image Segmentation through Consistency Regularized Disentangled Contrastive Learning
paper_authors: Hritam Basak, Zhaozheng Yin for:This paper focuses on semi-supervised domain adaptation (SSDA) for medical image segmentation, which can improve the adaptation performance with only a few labeled target samples.methods:The proposed method uses a two-stage training process, including pre-training an encoder using a novel domain-content disentangled contrastive learning (CL) and fine-tuning the encoder and decoder for pixel-level segmentation using a semi-supervised setting. The CL enforces the encoder to learn discriminative content-specific but domain-invariant semantics, while consistency regularization maintains spatial sensitivity.results:The proposed method outperforms state-of-the-art (SoTA) methods in both SSDA and unsupervised domain adaptation (UDA) settings, demonstrating its effectiveness in improving the adaptation performance with limited labeled target samples.Abstract
Although unsupervised domain adaptation (UDA) is a promising direction to alleviate domain shift, they fall short of their supervised counterparts. In this work, we investigate relatively less explored semi-supervised domain adaptation (SSDA) for medical image segmentation, where access to a few labeled target samples can improve the adaptation performance substantially. Specifically, we propose a two-stage training process. First, an encoder is pre-trained in a self-learning paradigm using a novel domain-content disentangled contrastive learning (CL) along with a pixel-level feature consistency constraint. The proposed CL enforces the encoder to learn discriminative content-specific but domain-invariant semantics on a global scale from the source and target images, whereas consistency regularization enforces the mining of local pixel-level information by maintaining spatial sensitivity. This pre-trained encoder, along with a decoder, is further fine-tuned for the downstream task, (i.e. pixel-level segmentation) using a semi-supervised setting. Furthermore, we experimentally validate that our proposed method can easily be extended for UDA settings, adding to the superiority of the proposed strategy. Upon evaluation on two domain adaptive image segmentation tasks, our proposed method outperforms the SoTA methods, both in SSDA and UDA settings. Code is available at https://github.com/hritam-98/GFDA-disentangled
摘要
although unsupervised domain adaptation (uda) is a promising direction to alleviate domain shift, they fall short of their supervised counterparts. in this work, we investigate relatively less explored semi-supervised domain adaptation (ssda) for medical image segmentation, where access to a few labeled target samples can improve the adaptation performance substantially. specifically, we propose a two-stage training process. first, an encoder is pre-trained in a self-learning paradigm using a novel domain-content disentangled contrastive learning (cl) along with a pixel-level feature consistency constraint. the proposed cl enforces the encoder to learn discriminative content-specific but domain-invariant semantics on a global scale from the source and target images, whereas consistency regularization enforces the mining of local pixel-level information by maintaining spatial sensitivity. this pre-trained encoder, along with a decoder, is further fine-tuned for the downstream task (i.e. pixel-level segmentation) using a semi-supervised setting. furthermore, we experimentally validate that our proposed method can easily be extended for uda settings, adding to the superiority of the proposed strategy. upon evaluation on two domain adaptive image segmentation tasks, our proposed method outperforms the soTA methods, both in ssda and uda settings. code is available at https://github.com/hritam-98/gfda-disentangled.Note: Please note that the translation is in Simplified Chinese, and the formatting of the text may be different from the original English version.
BHEISR: Nudging from Bias to Balance – Promoting Belief Harmony by Eliminating Ideological Segregation in Knowledge-based Recommendations
paper_authors: Mengyan Wang, Yuxuan Hu, Zihan Yuan, Chenting Jiang, Weihua Li, Shiqing Wu, Quan Bai
for: Addressing the issue of belief imbalance and user biases in personalized recommendation systems, and mitigating the negative effects of the filter bubble phenomenon.
methods: Introducing an innovative intermediate agency (BHEISR) that combines principles from nudge theory and user-specific category information to stimulate curiosity and broaden users’ belief horizons.
results: Experimental results show that the BHEISR model outperforms several baseline models in mitigating filter bubbles and balancing user perspectives, with improved recommendation diversity and user satisfaction.Abstract
In the realm of personalized recommendation systems, the increasing concern is the amplification of belief imbalance and user biases, a phenomenon primarily attributed to the filter bubble. Addressing this critical issue, we introduce an innovative intermediate agency (BHEISR) between users and existing recommendation systems to attenuate the negative repercussions of the filter bubble effect in extant recommendation systems. The main objective is to strike a belief balance for users while minimizing the detrimental influence caused by filter bubbles. The BHEISR model amalgamates principles from nudge theory while upholding democratic and transparent principles. It harnesses user-specific category information to stimulate curiosity, even in areas users might initially deem uninteresting. By progressively stimulating interest in novel categories, the model encourages users to broaden their belief horizons and explore the information they typically overlook. Our model is time-sensitive and operates on a user feedback loop. It utilizes the existing recommendation algorithm of the model and incorporates user feedback from the prior time frame. This approach endeavors to transcend the constraints of the filter bubble, enrich recommendation diversity, and strike a belief balance among users while also catering to user preferences and system-specific business requirements. To validate the effectiveness and reliability of the BHEISR model, we conducted a series of comprehensive experiments with real-world datasets. These experiments compared the performance of the BHEISR model against several baseline models using nearly 200 filter bubble-impacted users as test subjects. Our experimental results conclusively illustrate the superior performance of the BHEISR model in mitigating filter bubbles and balancing user perspectives.
摘要
在个性化推荐系统领域,现在的关注点是增强信念偏袋和用户偏见的问题,这主要归结于筛波效应。为解决这个关键问题,我们提出了一种创新的中间机制(BHEISR),位于用户和现有推荐系统之间,以减少现有推荐系统中的筛波效应的负面影响。BHEISR模型结合了抽象理论的原则,同时保持民主和透明的原则。它利用用户特定的类别信息,以刺激好奇,使用户在原来可能看得不愿意的领域中感兴趣。通过逐渐刺激用户对新类别的兴趣,模型让用户扩大他们的信念范围,探索通常被忽略的信息。我们的模型采用时间敏感的方式,通过用户反馈循环来运作。它利用现有推荐算法的模型,并将用户之前时间段的反馈纳入考虑。这种方法尝试突破筛波效应的限制,提高推荐多样性,并在用户中维护信念的平衡。为验证BHEISR模型的效果和可靠性,我们进行了一系列完整的实验,使用了现实世界数据集。这些实验比较了BHEISR模型与多个基eline模型的性能,使用了约200个受筛波影响的用户作为测试对象。我们的实验结果明确地表明,BHEISR模型在缓解筛波和平衡用户视角方面表现出色。
What Should Data Science Education Do with Large Language Models?
results: 论文认为,LLMs将改变数据科学家的职责,从手动编程、数据处理和标准分析中转移到评估和管理由自动化AI进行的分析。这种角色转变类似于软件工程师转变为产品经理。Abstract
The rapid advances of large language models (LLMs), such as ChatGPT, are revolutionizing data science and statistics. These state-of-the-art tools can streamline complex processes. As a result, it reshapes the role of data scientists. We argue that LLMs are transforming the responsibilities of data scientists, shifting their focus from hands-on coding, data-wrangling and conducting standard analyses to assessing and managing analyses performed by these automated AIs. This evolution of roles is reminiscent of the transition from a software engineer to a product manager. We illustrate this transition with concrete data science case studies using LLMs in this paper. These developments necessitate a meaningful evolution in data science education. Pedagogy must now place greater emphasis on cultivating diverse skillsets among students, such as LLM-informed creativity, critical thinking, AI-guided programming. LLMs can also play a significant role in the classroom as interactive teaching and learning tools, contributing to personalized education. This paper discusses the opportunities, resources and open challenges for each of these directions. As with any transformative technology, integrating LLMs into education calls for careful consideration. While LLMs can perform repetitive tasks efficiently, it's crucial to remember that their role is to supplement human intelligence and creativity, not to replace it. Therefore, the new era of data science education should balance the benefits of LLMs while fostering complementary human expertise and innovations. In conclusion, the rise of LLMs heralds a transformative period for data science and its education. This paper seeks to shed light on the emerging trends, potential opportunities, and challenges accompanying this paradigm shift, hoping to spark further discourse and investigation into this exciting, uncharted territory.
摘要
大 language models (LLMs) 的快速发展,如 ChatGPT,正在改变数据科学和统计领域。这些先进工具可以快速完成复杂的任务。因此,它们正在改变数据科学家的职责,从手动编程、数据处理和执行标准分析中转移注意力到评估和管理由这些自动化 AI 执行的分析。这种职业发展类似于软件工程师转变为产品经理。我们通过使用 LLMs 进行具体的数据科学案例研究, Illustrates 这种职业转变。这些发展需要数据科学教育做出深刻的改变。教学方法现需要更多地强调学生培养多样化的技能集,如 LLM Informed 创造力、批判性思维和 AI 领导编程。LLMs 也可以在教室中作为互动教学工具,贡献个性化教育。这篇文章讨论了这些方向的机会、资源和开放挑战。与任何转型技术一样,将 LLMs integrated 到教育中需要仔细考虑。虽然 LLMs 可以高效完成 repetitive 任务,但是它们的角色是补充人类智能和创造力,不是取代它们。因此,新的数据科学教育时代应该平衡 LLMS 的利点,同时培养补充人类专业知识和创新。总之,LLMs 的出现标志着数据科学和其教育的转型期。这篇文章 hoping 通过探讨 emerging 趋势、可能的机会和挑战,引发更多的讨论和调查,深入探索这个新、未知的领域。
The Role of Subgroup Separability in Group-Fair Medical Image Classification
results: 研究发现,不同医疗成像方式和保护特征下的模型性能差异有很大,而且这种差异可以预测模型偏见。这些发现提供了开发公平医疗AI的重要洞察。Abstract
We investigate performance disparities in deep classifiers. We find that the ability of classifiers to separate individuals into subgroups varies substantially across medical imaging modalities and protected characteristics; crucially, we show that this property is predictive of algorithmic bias. Through theoretical analysis and extensive empirical evaluation, we find a relationship between subgroup separability, subgroup disparities, and performance degradation when models are trained on data with systematic bias such as underdiagnosis. Our findings shed new light on the question of how models become biased, providing important insights for the development of fair medical imaging AI.
摘要
我们研究深度分类器的性能差异。我们发现不同医疗影像模式和保护特征下的分类器能力对各个子群体进行分类有所不同,并且这种性质可以预测算法偏见。通过理论分析和广泛的实验评估,我们发现与偏见系统化医疗数据进行训练时,模型性能下降的关系,并且这种关系与子群体分化性和偏见系统化医疗数据之间存在相互关系。我们的发现为开发公正医疗AI提供了重要的新idea。
Censored Sampling of Diffusion Models Using 3 Minutes of Human Feedback
results: 这个方法可以实现高效的人工回馈,并且只需要几分钟的人工回馈来生成足够的标签。Abstract
Diffusion models have recently shown remarkable success in high-quality image generation. Sometimes, however, a pre-trained diffusion model exhibits partial misalignment in the sense that the model can generate good images, but it sometimes outputs undesirable images. If so, we simply need to prevent the generation of the bad images, and we call this task censoring. In this work, we present censored generation with a pre-trained diffusion model using a reward model trained on minimal human feedback. We show that censoring can be accomplished with extreme human feedback efficiency and that labels generated with a mere few minutes of human feedback are sufficient. Code available at: https://github.com/tetrzim/diffusion-human-feedback.
摘要
吸引模型最近已经显示出很好的成像质量。然而,有时候预训 diffusion model 会出现部分不稳定性,即模型可以生成好像,但也可能生成不想要的像。如果如此,我们只需要防止生成坏像,我们称这个任务为审查(censoring)。在这个工作中,我们使用预训 diffusion model 和少量人类反馈生成的 reward model,以审查生成的像。我们展示了审查可以通过EXTREME human feedback efficiency来实现,并且只需要几分钟的人类反馈就能生成足够的标签。代码可以在 GitHub 上找到:https://github.com/tetrzim/diffusion-human-feedback。
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations
results: 实验结果显示,我们的方法可以提高评估准确率和与人类评价更加一致,同时PR算法可以在匿名设定下实现相对准确的自我评估。Abstract
Nowadays, the quality of responses generated by different modern large language models (LLMs) are hard to evaluate and compare automatically. Recent studies suggest and predominantly use LLMs as a reference-free metric for open-ended question answering. More specifically, they use the recognized "strongest" LLM as the evaluator, which conducts pairwise comparisons of candidate models' answers and provides a ranking score. However, this intuitive method has multiple problems, such as bringing in self-enhancement (favoring its own answers) and positional bias. We draw insights and lessons from the educational domain (Cho and MacArthur, 2011; Walsh, 2014) to improve LLM-based evaluations. Specifically, we propose the (1) peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs, and outputs a final ranking of models; and (2) peer discussion (PD), where we prompt two LLMs to discuss and try to reach a mutual agreement on preferences of two answers. We conduct experiments on two benchmark datasets. We find that our approaches achieve higher accuracy and align better with human judgments, respectively. Interestingly, PR can induce a relatively accurate self-ranking of models under the anonymous setting, where each model's name is unrevealed. Our work provides space to explore evaluating models that are hard to compare for humans.
摘要
现在,不同的现代大语言模型(LLM)的回答质量很难自动评估和比较。 latest studies suggest and predominantly use LLMs as a reference-free metric for open-ended question answering. More specifically, they use the recognized "strongest" LLM as the evaluator, which conducts pairwise comparisons of candidate models' answers and provides a ranking score. However, this intuitive method has multiple problems, such as self-enhancement (favoring its own answers) and positional bias. We draw insights and lessons from the educational domain (Cho and MacArthur, 2011; Walsh, 2014) to improve LLM-based evaluations. Specifically, we propose the (1) peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs, and outputs a final ranking of models; and (2) peer discussion (PD), where we prompt two LLMs to discuss and try to reach a mutual agreement on preferences of two answers. We conduct experiments on two benchmark datasets. We find that our approaches achieve higher accuracy and align better with human judgments, respectively. Interestingly, PR can induce a relatively accurate self-ranking of models under the anonymous setting, where each model's name is unrevealed. Our work provides space to explore evaluating models that are hard to compare for humans.Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.
Knowledge Graph Self-Supervised Rationalization for Recommendation
results: 对三个实际数据集进行了广泛的实验,显示 KGRec 可以超过当前的方法。此外,我们还提供了实现代码,可以在 https://github.com/HKUDS/KGRec 中找到。Abstract
In this paper, we introduce a new self-supervised rationalization method, called KGRec, for knowledge-aware recommender systems. To effectively identify informative knowledge connections, we propose an attentive knowledge rationalization mechanism that generates rational scores for knowledge triplets. With these scores, KGRec integrates generative and contrastive self-supervised tasks for recommendation through rational masking. To highlight rationales in the knowledge graph, we design a novel generative task in the form of masking-reconstructing. By masking important knowledge with high rational scores, KGRec is trained to rebuild and highlight useful knowledge connections that serve as rationales. To further rationalize the effect of collaborative interactions on knowledge graph learning, we introduce a contrastive learning task that aligns signals from knowledge and user-item interaction views. To ensure noise-resistant contrasting, potential noisy edges in both graphs judged by the rational scores are masked. Extensive experiments on three real-world datasets demonstrate that KGRec outperforms state-of-the-art methods. We also provide the implementation codes for our approach at https://github.com/HKUDS/KGRec.
摘要
在这篇论文中,我们介绍了一种新的自动化逻辑化方法,即KGRec,用于知识推荐系统。为了有效地Identify informative知识连接,我们提议了一种注意力机制,生成了知识 triplets 的 rational scores。通过这些分数,KGRec integrate了生成和对比自我超vised任务来进行推荐。为了强调知识图中的逻辑,我们设计了一种新的生成任务,即mas-king-reconstructing。通过将高分 rational scores 中的重要知识掩码,KGRec 被训练来重建和强调有用的知识连接,并作为逻辑。此外,为了更加有效地合理化知识图学习的对抗效果,我们引入了一种对比学习任务,该任务将知识视图和用户ITEM视图的信号进行对比。为了避免噪声的影响,我们将两个图中的潜在噪声根据 rational scores 进行掩码。我们的实验表明,KGRec 在三个实际 dataset 上表现出色,并提供了实现代码在https://github.com/HKUDS/KGRec。
Offline Reinforcement Learning with Imbalanced Datasets
paper_authors: Li Jiang, Sijie Chen, Jielin Qiu, Haoran Xu, Wai Kin Chan, Zhao Ding
for: This paper aims to address the issue of imbalanced datasets in offline reinforcement learning (RL) research, which can lead to neglect of real-world dataset distributions in the development of models.
methods: The proposed method utilizes the augmentation of conservative Q-learning (CQL) with a retrieval process to recall past related experiences, effectively alleviating the challenges posed by imbalanced datasets.
results: The proposed method is shown to be superior to other baselines through empirical results on several tasks in the context of imbalanced datasets with varying levels of imbalance.Here’s the Chinese translation of the three points:
results: 对各种偏见数据集的实验结果表明,提议的方法比基eline方法更加有效。Abstract
The prevalent use of benchmarks in current offline reinforcement learning (RL) research has led to a neglect of the imbalance of real-world dataset distributions in the development of models. The real-world offline RL dataset is often imbalanced over the state space due to the challenge of exploration or safety considerations. In this paper, we specify properties of imbalanced datasets in offline RL, where the state coverage follows a power law distribution characterized by skewed policies. Theoretically and empirically, we show that typically offline RL methods based on distributional constraints, such as conservative Q-learning (CQL), are ineffective in extracting policies under the imbalanced dataset. Inspired by natural intelligence, we propose a novel offline RL method that utilizes the augmentation of CQL with a retrieval process to recall past related experiences, effectively alleviating the challenges posed by imbalanced datasets. We evaluate our method on several tasks in the context of imbalanced datasets with varying levels of imbalance, utilizing the variant of D4RL. Empirical results demonstrate the superiority of our method over other baselines.
摘要
现有的大量实验算法(RL)研究中的标准实践是将注重在实验 dataset 的平衡性,导致实验 dataset 的不寻常性被忽略。实际上,现世界的 offline RL 资料集经常具有体积不寻常的状态空间分布,这可能是因为探索或安全考虑所带来的挑战。在这篇文章中,我们详细描述了实际上的 offline RL 资料集的不寻常性,其状态覆盖率遵循力量律分布,并且政策具有偏好性。我们理论和实验显示,通常的 offline RL 方法,如保守 Q-学习(CQL),在不寻常的 dataset 上不能够提取政策。受自然智慧启发,我们提出了一种新的 offline RL 方法,利用 CQL 的增强和回传过程来重新回传过去相关的体验,有效地解决了不寻常的 dataset 带来的挑战。我们在具有不同水平的不寻常度的任务上进行了评估,使用 D4RL 的变体。实验结果显示了我们的方法与其他基准相比有所superiority。
RecallM: An Architecture for Temporal Context Understanding and Question Answering
results: 经过多个实验,该方法能够提供更好的时间理解和知识掌握。Abstract
The ideal long-term memory mechanism for Large Language Model (LLM) based chatbots, would lay the foundation for continual learning, complex reasoning and allow sequential and temporal dependencies to be learnt. Creating this type of memory mechanism is an extremely challenging problem. In this paper we explore different methods of achieving the effect of long-term memory. We propose a new architecture focused on creating adaptable and updatable long-term memory for AGI systems. We demonstrate through various experiments the benefits of the RecallM architecture, particularly the improved temporal understanding of knowledge it provides.
摘要
理想的长期记忆机制为基于自然语言模型(LLM)的聊天机器人,将为持续学习、复杂逻辑和时间相关性学习 lay the foundation. 创建这种类型的记忆机制是极其困难的问题。在这篇论文中,我们探讨不同的方法来实现长期记忆的效果。我们提出了一种新的架构,专门为智能人工智能系统(AGI)创建可适应和可更新的长期记忆。我们通过多种实验证明了RecallM架构的优势,尤其是它在知识的时间理解方面的改善。
Fine-grained Action Analysis: A Multi-modality and Multi-task Dataset of Figure Skating
results: 本研究的三大贡献是:1)独立的空间和时间类别来进一步探索细化动作识别和评估; 2)首次使用骨架模式进行复杂细化动作质量评估; 3)多Modal和多任务数据集鼓励更多的动作分析模型。Abstract
The fine-grained action analysis of the existing action datasets is challenged by insufficient action categories, low fine granularities, limited modalities, and tasks. In this paper, we propose a Multi-modality and Multi-task dataset of Figure Skating (MMFS) which was collected from the World Figure Skating Championships. MMFS, which possesses action recognition and action quality assessment, captures RGB, skeleton, and is collected the score of actions from 11671 clips with 256 categories including spatial and temporal labels. The key contributions of our dataset fall into three aspects as follows. (1) Independently spatial and temporal categories are first proposed to further explore fine-grained action recognition and quality assessment. (2) MMFS first introduces the skeleton modality for complex fine-grained action quality assessment. (3) Our multi-modality and multi-task dataset encourage more action analysis models. To benchmark our dataset, we adopt RGB-based and skeleton-based baseline methods for action recognition and action quality assessment.
摘要
<>TRANSLATE_TEXT全球figure溜滑锦标赛中的现有动作数据集存在精细动作分类、低精细度、有限Modalities和任务的挑战。在本文中,我们提出了多模态和多任务figure溜滑数据集(MMFS),该数据集来自于世界figure溜滑锦标赛。MMFS包含动作识别和动作质量评估,并收集了RGB、骨架和动作分数的11671个clip,包括256个类别,其中包括空间和时间标签。我们数据集的关键贡献有三个方面:1. 我们首次独立提出了空间和时间类别,以进一步探索精细动作识别和质量评估。2. MMFS首次引入骨架模式,为复杂精细动作质量评估提供了新的可能性。3. 我们的多模态和多任务数据集激励更多的动作分析模型。为了评估我们的数据集,我们采用了基于RGB和骨架的基eline方法进行动作识别和质量评估。>>
Hierarchical Empowerment: Towards Tractable Empowerment-Based Skill-Learning
paper_authors: Andrew Levy, Sreehari Rammohan, Alessandro Allievi, Scott Niekum, George Konidaris
for: 这篇论文目的是学习大量独特技能的Agent。
methods: 论文使用了Goal-Conditioned Hierarchical Reinforcement Learning的概念,并提出了一个新的框架 called Hierarchical Empowerment,可以更方便地计算Empowerment。
results: 在一系列的 simulate robotics tasks 中,论文的四级Agent能够学习技能,覆盖了两个级别的表面积,比之前的工作更大。Abstract
General purpose agents will require large repertoires of skills. Empowerment -- the maximum mutual information between skills and the states -- provides a pathway for learning large collections of distinct skills, but mutual information is difficult to optimize. We introduce a new framework, Hierarchical Empowerment, that makes computing empowerment more tractable by integrating concepts from Goal-Conditioned Hierarchical Reinforcement Learning. Our framework makes two specific contributions. First, we introduce a new variational lower bound on mutual information that can be used to compute empowerment over short horizons. Second, we introduce a hierarchical architecture for computing empowerment over exponentially longer time scales. We verify the contributions of the framework in a series of simulated robotics tasks. In a popular ant navigation domain, our four level agents are able to learn skills that cover a surface area over two orders of magnitude larger than prior work.
摘要
通用探索者需要大量技能。使owerment(最大共同信息)提供了学习大量独特技能的路径,但共同信息困难优化。我们提出了一个新的框架,层次empowerment,将goal-conditioned hierarchical reinforcement learning中的概念集成到计算empowerment中。我们的框架做出了两个具体贡献:首先,我们引入了一个新的Variational lower bound on mutual information,用于计算empowerment over short horizons。其次,我们引入了一个层次结构,用于计算empowerment over exponentially longer time scales。我们在一系列的模拟机器人任务中验证了我们的框架的贡献。在一个流行的蚂蚁导航领域中,我们的四级探索者能够学习技能,覆盖面积超过两个数量级。
TL-nvSRAM-CIM: Ultra-High-Density Three-Level ReRAM-Assisted Computing-in-nvSRAM with DC-Power Free Restore and Ternary MAC Operations
results: 提出了一种高密度三级ReRAM助け进行计算在非易失RAM(nonvolatile RAM,nvSRAM)中的大规模NN计算,并实现了对比基eline设计的7.8倍高的存储密度,以及2.9倍和1.9倍的能效性提升。Abstract
Accommodating all the weights on-chip for large-scale NNs remains a great challenge for SRAM based computing-in-memory (SRAM-CIM) with limited on-chip capacity. Previous non-volatile SRAM-CIM (nvSRAM-CIM) addresses this issue by integrating high-density single-level ReRAMs on the top of high-efficiency SRAM-CIM for weight storage to eliminate the off-chip memory access. However, previous SL-nvSRAM-CIM suffers from poor scalability for an increased number of SL-ReRAMs and limited computing efficiency. To overcome these challenges, this work proposes an ultra-high-density three-level ReRAMs-assisted computing-in-nonvolatile-SRAM (TL-nvSRAM-CIM) scheme for large NN models. The clustered n-selector-n-ReRAM (cluster-nSnRs) is employed for reliable weight-restore with eliminated DC power. Furthermore, a ternary SRAM-CIM mechanism with differential computing scheme is proposed for energy-efficient ternary MAC operations while preserving high NN accuracy. The proposed TL-nvSRAM-CIM achieves 7.8x higher storage density, compared with the state-of-art works. Moreover, TL-nvSRAM-CIM shows up to 2.9x and 1.9x enhanced energy-efficiency, respectively, compared to the baseline designs of SRAM-CIM and ReRAM-CIM, respectively.
摘要
实现大规模神经网络(NN)中的所有重量在芯片上是一个巨大的挑战,特别是在有限的芯片容量下。前一代非易失性SRAM-CIM(nvSRAM-CIM)解决了这个问题,通过将高密度单级ReRAM(SL-ReRAM)组装在高效率SRAM-CIM之上,以储存重量,并消除外部内存存取。然而,前一代SL-nvSRAM-CIM受到更多SL-ReRAM的扩展和有限的计算效率的限制。为了解决这些挑战,这个工作提出了一个超高密度三级ReRAM-助け(TL-nvSRAM-CIM)方案,用于大型NN模型。clustered n-selector-n-ReRAM(cluster-nSnRs)被用来保证可靠的重量复原,并消除了DC电压。此外,一个ternary SRAM-CIM机制(ternary SRAM-CIM)和分别计算方案(differential computing scheme)是提出,以节能地进行ternary MAC操作,保持高NN准确性。提案的TL-nvSRAM-CIM在存储密度方面比前一代作品高7.8倍,而且在计算效率方面比基准设计高2.9倍和1.9倍,分别比SRAM-CIM和ReRAM-CIM基准设计高。
Validation of the Practicability of Logical Assessment Formula for Evaluations with Inaccurate Ground-Truth Labels
results: 实验结果表明,LAF可以有效地评估带有IAGTL的预测模型,并且在乳腺癌分 segmentation(TSfBC)领域的医学板卷图分析(MHWSIA)中得到了有效的结果。Abstract
Logical assessment formula (LAF) is a new theory proposed for evaluations with inaccurate ground-truth labels (IAGTLs) to assess the predictive models for various artificial intelligence applications. However, the practicability of LAF for evaluations with IAGTLs has not yet been validated in real-world practice. In this paper, to address this issue, we applied LAF to tumour segmentation for breast cancer (TSfBC) in medical histopathology whole slide image analysis (MHWSIA). Experimental results and analysis show the validity of LAF for evaluations with IAGTLs in the case of TSfBC and reflect the potentials of LAF applied to MHWSIA.
摘要
新的逻辑评估方程(LAF)是一种提议用于评估具有不准确真实标签(IAGTL)的预测模型,用于不同的人工智能应用。然而,LAF在实际应用中的实用性尚未得到证实。在这篇论文中,我们应用了LAF来评估乳腺癌 segmentation(TSfBC)在医学板寸影像分析(MHWSIA)中。实验结果和分析表明LAF对于具有IAGTL的评估是有效的,并且反映了LAF在MHWSIA中的潜在应用 potential。
Loss Functions and Metrics in Deep Learning. A Review
results: 本文提供了深度学习中不同任务中的损失函数和性能指标的综述,并给出了各种任务的示例和应用。这些信息可以帮助读者更好地理解深度学习中的不同方法和技术,并选择适合自己特定任务的最佳方法。Abstract
One of the essential components of deep learning is the choice of the loss function and performance metrics used to train and evaluate models. This paper reviews the most prevalent loss functions and performance measurements in deep learning. We examine the benefits and limits of each technique and illustrate their application to various deep-learning problems. Our review aims to give a comprehensive picture of the different loss functions and performance indicators used in the most common deep learning tasks and help practitioners choose the best method for their specific task.
摘要
一个深度学习中的重要组成部分是选择用于训练和评估模型的损失函数和性能指标。本文查询了深度学习中最广泛使用的损失函数和性能指标,并分析它们的优缺点,以及它们在不同的深度学习问题中的应用。本文的审查旨在为具体任务选择最佳的方法,并给深度学习实践者提供一个全面的损失函数和性能指标选择指南。Here's the word-for-word translation of the text into Simplified Chinese:一个深度学习中的重要组成部分是选择用于训练和评估模型的损失函数和性能指标。本文查询了深度学习中最广泛使用的损失函数和性能指标,并分析它们的优缺点,以及它们在不同的深度学习问题中的应用。本文的审查旨在为具体任务选择最佳的方法,并给深度学习实践者提供一个全面的损失函数和性能指标选择指南。
SACHA: Soft Actor-Critic with Heuristic-Based Attention for Partially Observable Multi-Agent Path Finding
results: comparied to 现有的学习基于方法,SACHA 和 SACHA(C) 在多种 MAPF 实例上显示出了较好的成绩,包括Success rate 和解决质量。Abstract
Multi-Agent Path Finding (MAPF) is a crucial component for many large-scale robotic systems, where agents must plan their collision-free paths to their given goal positions. Recently, multi-agent reinforcement learning has been introduced to solve the partially observable variant of MAPF by learning a decentralized single-agent policy in a centralized fashion based on each agent's partial observation. However, existing learning-based methods are ineffective in achieving complex multi-agent cooperation, especially in congested environments, due to the non-stationarity of this setting. To tackle this challenge, we propose a multi-agent actor-critic method called Soft Actor-Critic with Heuristic-Based Attention (SACHA), which employs novel heuristic-based attention mechanisms for both the actors and critics to encourage cooperation among agents. SACHA learns a neural network for each agent to selectively pay attention to the shortest path heuristic guidance from multiple agents within its field of view, thereby allowing for more scalable learning of cooperation. SACHA also extends the existing multi-agent actor-critic framework by introducing a novel critic centered on each agent to approximate $Q$-values. Compared to existing methods that use a fully observable critic, our agent-centered multi-agent actor-critic method results in more impartial credit assignment and better generalizability of the learned policy to MAPF instances with varying numbers of agents and types of environments. We also implement SACHA(C), which embeds a communication module in the agent's policy network to enable information exchange among agents. We evaluate both SACHA and SACHA(C) on a variety of MAPF instances and demonstrate decent improvements over several state-of-the-art learning-based MAPF methods with respect to success rate and solution quality.
摘要
多智能路径规划(MAPF)是许多大规模 роботиче系统中的关键组件,其中智能体需要规划避免碰撞的路径来到目标位置。在最近,多智能学习 reinforcement learning 被引入解决部分可见 MAPF 问题,通过学习一个均衡单个智能体策略来实现多智能体协作。然而,现有的学习基于方法在拥堵环境中效果不佳,主要因为这种设定的不可预测性。为了解决这个挑战,我们提出了一种多智能actor-critic方法called Soft Actor-Critic with Heuristic-Based Attention(SACHA),该方法使用了新的征量基于注意力机制来促进多智能体协作。SACHA 学习一个智能网络,以便每个智能体在其视野中选择多个智能体的最短路径征量引导,从而使得学习协作更加扩展。SACHA 还扩展了现有的多智能actor-critic框架,通过引入每个智能体中心的 $Q $-值评价器来更好地分配减少信息。与现有方法使用完全可见评价器相比,我们的代理中心多智能actor-critic方法可以更好地分配减少信息,并且更好地泛化学习到 MAPF 实例中的不同数量和类型的环境。此外,我们还实现了 SACHA(C),它在智能体策略网络中嵌入通信模块,以便智能体之间交换信息。我们对 SACHA 和 SACHA(C)在多种 MAPF 实例上进行评估,并证明它们在成功率和解决质量方面具有显著改进。
Scaling In-Context Demonstrations with Structured Attention
results: SAICL在meta-training框架下评估,与全注意力相比具有相当或更好的性能,同时可以实现up to 3.4倍的搜索速度提升。SAICL还一直超越了一个强的Fusion-in-Decoder(FiD)基线,该基线每个示例都进行独立处理。最后,由于SAICL的线性特性,我们证明SAICL可以轻松扩展到数百个示例,并且随着扩展,性能会持续提高。Abstract
The recent surge of large language models (LLMs) highlights their ability to perform in-context learning, i.e., "learning" to perform a task from a few demonstrations in the context without any parameter updates. However, their capabilities of in-context learning are limited by the model architecture: 1) the use of demonstrations is constrained by a maximum sentence length due to positional embeddings; 2) the quadratic complexity of attention hinders users from using more demonstrations efficiently; 3) LLMs are shown to be sensitive to the order of the demonstrations. In this work, we tackle these challenges by proposing a better architectural design for in-context learning. We propose SAICL (Structured Attention for In-Context Learning), which replaces the full-attention by a structured attention mechanism designed for in-context learning, and removes unnecessary dependencies between individual demonstrations, while making the model invariant to the permutation of demonstrations. We evaluate SAICL in a meta-training framework and show that SAICL achieves comparable or better performance than full attention while obtaining up to 3.4x inference speed-up. SAICL also consistently outperforms a strong Fusion-in-Decoder (FiD) baseline which processes each demonstration independently. Finally, thanks to its linear nature, we demonstrate that SAICL can easily scale to hundreds of demonstrations with continuous performance gains with scaling.
摘要
Recent large language models (LLMs) have highlighted their ability to perform in-context learning, i.e., "learning" to perform a task from a few demonstrations in the context without any parameter updates. However, their capabilities of in-context learning are limited by the model architecture: 1) the use of demonstrations is constrained by a maximum sentence length due to positional embeddings; 2) the quadratic complexity of attention hinders users from using more demonstrations efficiently; 3) LLMs are shown to be sensitive to the order of the demonstrations. In this work, we tackle these challenges by proposing a better architectural design for in-context learning. We propose SAICL (Structured Attention for In-Context Learning), which replaces the full-attention by a structured attention mechanism designed for in-context learning, and removes unnecessary dependencies between individual demonstrations, while making the model invariant to the permutation of demonstrations. We evaluate SAICL in a meta-training framework and show that SAICL achieves comparable or better performance than full attention while obtaining up to 3.4x inference speed-up. SAICL also consistently outperforms a strong Fusion-in-Decoder (FiD) baseline which processes each demonstration independently. Finally, thanks to its linear nature, we demonstrate that SAICL can easily scale to hundreds of demonstrations with continuous performance gains with scaling.
Comparing Apples to Apples: Generating Aspect-Aware Comparative Sentences from User Reviews
paper_authors: Jessica Echterhoff, An Yan, Julian McAuley
for: 用于帮助用户寻找最佳选择 among many similar alternatives
methods: 使用 transformer 架构,包括 item encoding module、comparison generation module 和 novel decoding method for user personalization
results: 生成了 fluently diverse 的比较句子,并在人工评估研究中表明了 relevance 和 truthfulness 的性能Abstract
It is time-consuming to find the best product among many similar alternatives. Comparative sentences can help to contrast one item from others in a way that highlights important features of an item that stand out. Given reviews of one or multiple items and relevant item features, we generate comparative review sentences to aid users to find the best fit. Specifically, our model consists of three successive components in a transformer: (i) an item encoding module to encode an item for comparison, (ii) a comparison generation module that generates comparative sentences in an autoregressive manner, (iii) a novel decoding method for user personalization. We show that our pipeline generates fluent and diverse comparative sentences. We run experiments on the relevance and fidelity of our generated sentences in a human evaluation study and find that our algorithm creates comparative review sentences that are relevant and truthful.
摘要
寻找最佳产品中的多个相似alternative很时间consuming。比较句子可以帮助用户对多个 Item进行对比,并将重点放在Item的重要特征上。我们的模型由三个顺序组成:(i)Item编码模块,用于编码 Item进行比较;(ii)比较生成模块,通过自动生成比较句子的方式生成比较句子;(iii)用户个性化解码方法。我们的管道能够生成流畅和多样的比较句子。我们在人工评估研究中运行了我们生成的句子的相关性和真实性,发现我们的算法可以创建相关的和真实的比较句子。
paper_authors: Pascal Van Hentenryck, Kevin Dalmeijer
for: 本研究是NSF AI Institute for Advances in Optimization的简介,旨在结合人工智能和优化,以解决供应链、能源系统、半导体设计和生产、可持续食品系统等领域的问题。
methods: 本研究使用了”教学教育”哲学,提供了长期教育路径,以帮助工程师学习人工智能技术。
results: 本研究未提供结果,但描述了AI4OPT的目标和方法。Abstract
This article is a short introduction to AI4OPT, the NSF AI Institute for Advances in Optimization. AI4OPT fuses AI and Optimization, inspired by end-use cases in supply chains, energy systems, chip design and manufacturing, and sustainable food systems. AI4OPT also applies its "teaching the teachers" philosophy to provide longitudinal educational pathways in AI for engineering.
摘要
这篇文章是关于AI4OPT,美国国家科学基金会的人工智能实验室,它将人工智能和优化相结合,以解决来自供应链、能源系统、半导体设计和生产、可持续食品系统等领域的实际问题。AI4OPT还采用“教师教学”哲学,为工程领域提供了长期教育路径。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.
Convergence of Communications, Control, and Machine Learning for Secure and Autonomous Vehicle Navigation
results: 研究人员通过提出稳定轨迹追踪、鲁棒控制 противcyber-physical攻击、适应导航控制器设计等解决方案,以及在多辆CAV协调运动时的稳定队形、快速协作学习和分布式入侵检测等问题的分析和解决方案。Abstract
Connected and autonomous vehicles (CAVs) can reduce human errors in traffic accidents, increase road efficiency, and execute various tasks ranging from delivery to smart city surveillance. Reaping these benefits requires CAVs to autonomously navigate to target destinations. To this end, each CAV's navigation controller must leverage the information collected by sensors and wireless systems for decision-making on longitudinal and lateral movements. However, enabling autonomous navigation for CAVs requires a convergent integration of communication, control, and learning systems. The goal of this article is to explicitly expose the challenges related to this convergence and propose solutions to address them in two major use cases: Uncoordinated and coordinated CAVs. In particular, challenges related to the navigation of uncoordinated CAVs include stable path tracking, robust control against cyber-physical attacks, and adaptive navigation controller design. Meanwhile, when multiple CAVs coordinate their movements during navigation, fundamental problems such as stable formation, fast collaborative learning, and distributed intrusion detection are analyzed. For both cases, solutions using the convergence of communication theory, control theory, and machine learning are proposed to enable effective and secure CAV navigation. Preliminary simulation results are provided to show the merits of proposed solutions.
摘要
自适应并连接的自动车 (CAVs) 可以减少交通事故中人类错误,提高道路效率,并执行各种任务,从物交付到智能城市监测。为了实现这些利益,每辆 CAV 的导航控制器需要根据感知器和无线系统收集的信息进行决策。然而,为 CAvs 实现自主导航,需要通信、控制和学习系统的融合。这篇文章的目标是暴露 CAvs 自主导航中存在的挑战,并提出解决方案,并在两个主要应用场景中进行分析:不协调的 CAvs 和协调 CAvs。在不协调 CAvs 的导航中,存在稳定轨迹追踪、针对物理攻击的强健控制和适应导航控制器设计的挑战。而当多辆 CAvs 协调其运动时,存在稳定队形、快速协同学习和分布式入侵检测的基本问题。为了解决这些问题,文章提出了通信理论、控制理论和机器学习的融合解决方案。文章还提供了先前的 simulations 结果,以证明提议的解决方案的优点。
for: solving many-objective optimization problems with complex trade-offs between objectives
methods: combines Many-Objective Evolutionary Algorithms and Quality Diversity algorithms like MAP-Elites, maintains a map of elites that perform well on different subsets of objective functions
results: outperforms a naive single-objective baseline on a 14-objective image-neuroevolution problem, relies on solutions jumping across bins (goal-switching) for better performance, suggests automatic identification of stepping stones or curriculum learning.Abstract
Real-world problems are often comprised of many objectives and require solutions that carefully trade-off between them. Current approaches to many-objective optimization often require challenging assumptions, like knowledge of the importance/difficulty of objectives in a weighted-sum single-objective paradigm, or enormous populations to overcome the curse of dimensionality in multi-objective Pareto optimization. Combining elements from Many-Objective Evolutionary Algorithms and Quality Diversity algorithms like MAP-Elites, we propose Many-objective Optimization via Voting for Elites (MOVE). MOVE maintains a map of elites that perform well on different subsets of the objective functions. On a 14-objective image-neuroevolution problem, we demonstrate that MOVE is viable with a population of as few as 50 elites and outperforms a naive single-objective baseline. We find that the algorithm's performance relies on solutions jumping across bins (for a parent to produce a child that is elite for a different subset of objectives). We suggest that this type of goal-switching is an implicit method to automatic identification of stepping stones or curriculum learning. We comment on the similarities and differences between MOVE and MAP-Elites, hoping to provide insight to aid in the understanding of that approach $\unicode{x2013}$ and suggest future work that may inform this approach's use for many-objective problems in general.
摘要
现实世界中的问题经常具有多个目标,需要一种精准地考虑这些目标的解决方案。现有的多目标优化方法常常假设知道目标的重要性和难度,或者需要庞大的人口来超越维度瓶颈。 combining了多目标演化算法和质量多样性算法的元素,我们提出了多目标优化via投票for elites(MOVE)。MOVE维护一个ELITES的地图,这些解决方案在不同的目标函数子集中表现出色。在一个14个目标图像神经演化问题上,我们展示了MOVE可以使用50个elites来解决问题,并且超越了单一目标基线。我们发现,该算法的性能取决于解决方案在不同目标函数子集之间跳跃(父代生成子代的时候,子代需要在不同的目标函数子集中表现出色)。我们认为这种目标跳跃是一种隐式的目标IDENTIFICATION或课程学习方法。我们对MOVE和MAP-Elites之间的相似性和不同进行评论,希望能够提供对这种方法的理解,并建议未来的工作可以为多目标问题的解决提供更多的指导。
UX Heuristics and Checklist for Deep Learning powered Mobile Applications with Image Classification
results: 该论文提出了一个在线课程和网络工具,以帮助实践者通过这些规范来评估和改进图像分类应用程序的用户界面。这些结果可以用于指导图像分类应用程序的界面设计,以及支持实践者开发更好的用户体验。Abstract
Advances in mobile applications providing image classification enabled by Deep Learning require innovative User Experience solutions in order to assure their adequate use by users. To aid the design process, usability heuristics are typically customized for a specific kind of application. Therefore, based on a literature review and analyzing existing mobile applications with image classification, we propose an initial set of AIX heuristics for Deep Learning powered mobile applications with image classification decomposed into a checklist. In order to facilitate the usage of the checklist we also developed an online course presenting the concepts and heuristics as well as a web-based tool in order to support an evaluation using these heuristics. These results of this research can be used to guide the design of the interfaces of such applications as well as support the conduction of heuristic evaluations supporting practitioners to develop image classification apps that people can understand, trust, and can engage with effectively.
摘要
进步的移动应用程序,通过深度学习实现影像分类,需要创新的用户体验解决方案,以确保用户能够正确使用这些应用程序。为了促进设计过程,通常会根据特定类型的应用程序进行用户性heet Customization。因此,根据文献综述和现有的移动应用程序影像分类,我们提出了一份初步的AIXheet for Deep Learning搭配的移动应用程序影像分类,分为一个检查表格。为了便利检查表格的使用,我们还开发了线上课程,介绍概念和heet,以及一个网页工具,以支持使用这些heet进行评估。这些研究结果可以用来引导这些应用程序的界面设计,以及支持实施这些heet进行评估,帮助实现人们能够理解、信任,并对这些应用程序进行有效的互动。
Surge Routing: Event-informed Multiagent Reinforcement Learning for Autonomous Rideshare
results: 本论文的实验结果显示,使用本学习框架可以生成的路由策略,每分钟可以服务 $6$ 更多的请求(约 $360$ 更多的请求每小时),较其他模型基于强化学习框架和操作研究中的其他分布式方法和 классиical algorithms 更高。Abstract
Large events such as conferences, concerts and sports games, often cause surges in demand for ride services that are not captured in average demand patterns, posing unique challenges for routing algorithms. We propose a learning framework for an autonomous fleet of taxis that scrapes event data from the internet to predict and adapt to surges in demand and generates cooperative routing and pickup policies that service a higher number of requests than other routing protocols. We achieve this through a combination of (i) an event processing framework that scrapes the internet for event information and generates dense vector representations that can be used as input features for a neural network that predicts demand; (ii) a two neural network system that predicts hourly demand over the entire map, using these dense vector representations; (iii) a probabilistic approach that leverages locale occupancy schedules to map publicly available demand data over sectors to discretized street intersections; and finally, (iv) a scalable model-based reinforcement learning framework that uses the predicted demand over intersections to anticipate surges and route taxis using one-agent-at-a-time rollout with limited sampling certainty equivalence. We learn routing and pickup policies using real NYC ride share data for 2022 and information for more than 2000 events across 300 unique venues in Manhattan. We test our approach with a fleet of 100 taxis on a map with 38 different sectors (2235 street intersections). Our experimental results demonstrate that our method obtains routing policies that service $6$ more requests on average per minute (around $360$ more requests per hour) than other model-based RL frameworks and other classical algorithms in operations research when dealing with surge demand conditions.
摘要
大型活动如会议、音乐会和体育赛事会导致乘车服务的需求峰值,这些需求峰值不同于平均需求模式,对路由算法 pose unique challenges. We propose a learning framework for an autonomous fleet of taxis that scrapes event data from the internet to predict and adapt to surges in demand, and generates cooperative routing and pickup policies that can service a higher number of requests than other routing protocols. We achieve this through a combination of:(i) 事件处理框架,从互联网上抓取事件信息,生成可用作神经网络预测需求的密集 вектор表示;(ii) 两个神经网络系统,使用这些密集 вектор表示预测每小时的需求,涵盖整个地图;(iii) 一种 probabilistic 方法,使用可用于地点的公共需求数据来映射到分割的街口;(iv) 一种可扩展的模型基于 reinforcement learning 框架,使用预测的需求 над intersections 来预测峰值和路由TAXI 使用一个 Agent-at-a-time 扩展,限制采样确定Equivalence. We learn routing and pickup policies using real NYC ride share data for 2022 and information for more than 2000 events across 300 unique venues in Manhattan. We test our approach with a fleet of 100 taxis on a map with 38 different sectors (2235 street intersections). Our experimental results show that our method obtains routing policies that service $6$ more requests on average per minute (around $360$ more requests per hour) than other model-based RL frameworks and other classical algorithms in operations research when dealing with surge demand conditions.
An explainable model to support the decision about the therapy protocol for AML
paper_authors: Jade M. Almeida, Giovanna A. Castro, João A. Machado-Neto, Tiago A. Almeida
for: 这项研究的目的是支持医生决策最佳治疗协议,以提高患有AML的患者存活率。
methods: 该研究使用了可解释的机器学习模型,对患者的存活预测进行数据分析。
results: 研究结果显示,使用该模型可以安全地支持医生决策,并且有潜在的应用前景以提高治疗和预测 marker。Abstract
Acute Myeloid Leukemia (AML) is one of the most aggressive types of hematological neoplasm. To support the specialists' decision about the appropriate therapy, patients with AML receive a prognostic of outcomes according to their cytogenetic and molecular characteristics, often divided into three risk categories: favorable, intermediate, and adverse. However, the current risk classification has known problems, such as the heterogeneity between patients of the same risk group and no clear definition of the intermediate risk category. Moreover, as most patients with AML receive an intermediate-risk classification, specialists often demand other tests and analyses, leading to delayed treatment and worsening of the patient's clinical condition. This paper presents the data analysis and an explainable machine-learning model to support the decision about the most appropriate therapy protocol according to the patient's survival prediction. In addition to the prediction model being explainable, the results obtained are promising and indicate that it is possible to use it to support the specialists' decisions safely. Most importantly, the findings offered in this study have the potential to open new avenues of research toward better treatments and prognostic markers.
摘要
针对恶性白细胞肉瘤(AML)的诊断和治疗决策支持,患者通常根据细胞学和分子特征进行分类,分为三个风险类别:有利、中等和不利。然而,现有的风险分类系统存在多种问题,如患者群体内的多样性,中等风险类别的定义不清晰。此外,由于大多数AML患者被诊断为中等风险,专家经常要求更多的测试和分析,导致治疗延迟并使患者的临床状况加重。本文提出了数据分析和可解释机器学习模型,用于支持治疗决策。除了模型可解释性外,研究结果具有潜在的应用前景,可能用于更好地诊断和治疗AML。此外,本研究还开创了新的研究方向,可能为AML的更好的治疗和诊断做出贡献。
Real-time Workload Pattern Analysis for Large-scale Cloud Databases
results: 实验结果表明,AWM可以提高工作负载模式发现精度达66%,并将在线推理延迟降低22%,相比之前的状态艺术。Abstract
Hosting database services on cloud systems has become a common practice. This has led to the increasing volume of database workloads, which provides the opportunity for pattern analysis. Discovering workload patterns from a business logic perspective is conducive to better understanding the trends and characteristics of the database system. However, existing workload pattern discovery systems are not suitable for large-scale cloud databases which are commonly employed by the industry. This is because the workload patterns of large-scale cloud databases are generally far more complicated than those of ordinary databases. In this paper, we propose Alibaba Workload Miner (AWM), a real-time system for discovering workload patterns in complicated large-scale workloads. AWM encodes and discovers the SQL query patterns logged from user requests and optimizes the querying processing based on the discovered patterns. First, Data Collection & Preprocessing Module collects streaming query logs and encodes them into high-dimensional feature embeddings with rich semantic contexts and execution features. Next, Online Workload Mining Module separates encoded queries by business groups and discovers the workload patterns for each group. Meanwhile, Offline Training Module collects labels and trains the classification model using the labels. Finally, Pattern-based Optimizing Module optimizes query processing in cloud databases by exploiting discovered patterns. Extensive experimental results on one synthetic dataset and two real-life datasets (extracted from Alibaba Cloud databases) show that AWM enhances the accuracy of pattern discovery by 66% and reduce the latency of online inference by 22%, compared with the state-of-the-arts.
摘要
主机数据库服务在云系统上成为常见做法。这导致数据库负载的增加,提供了 Pattern 分析的机会。从业务逻辑角度发现数据库系统的工作负载模式可以更好地了解系统的趋势和特点。然而,现有的工作负载模式发现系统不适用于大规模云数据库,这些云数据库通常被行业使用。这是因为大规模云数据库的工作负载模式通常比普通数据库更加复杂。在这篇论文中,我们提出了阿里巴巴工作负载挖掘(AWM),一个实时的工作负载模式发现系统。AWM 将用户请求中的 SQL 查询 logged 编码并分析,以优化查询处理。首先,数据收集和预处理模块将流动查询日志编码成高维ensional 特征嵌入,具有丰富的语义上下文和执行特征。接下来,在线工作挖掘模块将编码查询分成业务组,并发现每个组的工作负载模式。同时,离线训练模块收集标签并使用标签进行分类模型训练。最后,基于发现的模式, Pattern-based Optimizing 模块可以在云数据库中优化查询处理,以提高效率。我们在一个 sintetic 数据集和两个实际数据集(从阿里巴巴云数据库提取)进行了广泛的实验,结果显示,AWM 可以提高Pattern 发现精度达 66%,并将在线推理延迟降低 22%,相比之前的状态艺。
Learning when to observe: A frugal reinforcement learning framework for a high-cost world
paper_authors: Colin Bellinger, Mark Crowley, Isaac Tamblyn
for: This paper focuses on the problem of learning in reinforcement learning (RL) when there is a high cost associated with measuring the state of the environment.
methods: The proposed method is called the Deep Dynamic Multi-Step Observationless Agent (DMSOA), which does not rely on costly measurements at each time step. Instead, it uses a deep neural network to learn when to observe the environment and when to act without observation.
results: The authors evaluate DMSOA on OpenAI gym and Atari Pong environments and show that it learns a better policy with fewer decision steps and measurements than the considered alternative from the literature.Abstract
Reinforcement learning (RL) has been shown to learn sophisticated control policies for complex tasks including games, robotics, heating and cooling systems and text generation. The action-perception cycle in RL, however, generally assumes that a measurement of the state of the environment is available at each time step without a cost. In applications such as materials design, deep-sea and planetary robot exploration and medicine, however, there can be a high cost associated with measuring, or even approximating, the state of the environment. In this paper, we survey the recently growing literature that adopts the perspective that an RL agent might not need, or even want, a costly measurement at each time step. Within this context, we propose the Deep Dynamic Multi-Step Observationless Agent (DMSOA), contrast it with the literature and empirically evaluate it on OpenAI gym and Atari Pong environments. Our results, show that DMSOA learns a better policy with fewer decision steps and measurements than the considered alternative from the literature. The corresponding code is available at: \url{https://github.com/cbellinger27/Learning-when-to-observe-in-RL
摘要
强化学习(RL)已经能够学习复杂任务,包括游戏、机器人、暖通空调和文本生成。但是,RL中的动作-感知循环通常假设在每个时间步骤上可以无成本地测量环境状态。在材料设计、深海和行星探险等应用中,可能存在高成本的环境状态测量或 aproximation。在这篇论文中,我们回顾了近期快速增长的文献,其中RL机器人可能不需要或甚至不想在每个时间步骤上进行costly的测量。基于这个视角,我们提出了深度动态多步无测量代理(DMSOA),并与文献进行比较。我们还通过对OpenAI Gym和Atari Pong环境进行实验,并证明了DMSOA可以在 fewer decision steps和测量下学习更好的策略。相关代码可以在以下链接获取:https://github.com/cbellinger27/Learning-when-to-observe-in-RL。
results: 这篇论文的结果显示了联邦方法的可能性,并且透过实验和数据分析,显示了这种方法在检测流行病方面的优点和适用范围。Abstract
The surveillance of a pandemic is a challenging task, especially when crucial data is distributed and stakeholders cannot or are unwilling to share. To overcome this obstacle, federated methodologies should be developed to incorporate less sensitive evidence that entities are willing to provide. This study aims to explore the feasibility of pushing hypothesis tests behind each custodian's firewall and then meta-analysis to combine the results, and to determine the optimal approach for reconstructing the hypothesis test and optimizing the inference. We propose a hypothesis testing framework to identify a surge in the indicators and conduct power analyses and experiments on real and semi-synthetic data to showcase the properties of our proposed hypothesis test and suggest suitable methods for combining $p$-values. Our findings highlight the potential of using $p$-value combination as a federated methodology for pandemic surveillance and provide valuable insights into integrating available data sources.
摘要
“监控疫情需要复杂的努力,尤其是当重要的数据分散且掌控者无法或不愿分享。为了解决这个问题,我们应发展联邦方法ologies,将不sensitive的证据包含在内。这项研究尝试探索将假设测试传递到各个看守者的墙壁后,然后进行Meta-分析,并决定最佳的方法来重建假设测试和优化推理。我们提出了一个假设测试框架,用于识别疫情指标的增加,并执行力学分析和实验,以显示我们的提案的假设测试和合理的方法。我们的发现显示,使用$p$-值组合为联邦方法学可以实现疫情监控,并提供有价的导致统合可用数据源。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.
Human Inspired Progressive Alignment and Comparative Learning for Grounded Word Acquisition
results: 控制性实验结果表明,该方法可以有效地实现语言概念的持续学习和扩展。Abstract
Human language acquisition is an efficient, supervised, and continual process. In this work, we took inspiration from how human babies acquire their first language, and developed a computational process for word acquisition through comparative learning. Motivated by cognitive findings, we generated a small dataset that enables the computation models to compare the similarities and differences of various attributes, learn to filter out and extract the common information for each shared linguistic label. We frame the acquisition of words as not only the information filtration process, but also as representation-symbol mapping. This procedure does not involve a fixed vocabulary size, nor a discriminative objective, and allows the models to continually learn more concepts efficiently. Our results in controlled experiments have shown the potential of this approach for efficient continual learning of grounded words.
摘要
人类语言学习是一个效率高、监督的、不断进行的过程。在这项工作中,我们取得了人类宝宝语言Acquisition的灵感,并开发了一种计算机处理方法,通过比较学习来获得单词。我们根据认知发现,生成了一个小型数据集,让计算机模型可以对各种属性进行比较,学习抽象出每个共享语言标签中的共同信息。我们将单词的获得看成不仅为信息筛选过程,还是表示-符号映射。这种方法不含固定词汇大小,也不含推理性目标,使计算机模型能够高效地学习更多概念。我们在控制实验中获得的结果表明了这种方法的潜在效果。
methods: 使用统计信息或分类器来建立分布 diferencial gap 的假设,但是实际上这些 gap 并不能有效地区分人类生成和AI生成的内容的Semantic和Styleistic差异。
results: 提出了基于空格的SpaceInfi策略可以逃脱检测,并在多个benchmark和检测器上进行了实验,并提供了这种策略的理论解释。Abstract
ChatGPT brings revolutionary social value but also raises concerns about the misuse of AI-generated content. Consequently, an important question is how to detect whether content is generated by ChatGPT or by human. Existing detectors are built upon the assumption that there are distributional gaps between human-generated and AI-generated content. These gaps are typically identified using statistical information or classifiers. Our research challenges the distributional gap assumption in detectors. We find that detectors do not effectively discriminate the semantic and stylistic gaps between human-generated and AI-generated content. Instead, the "subtle differences", such as an extra space, become crucial for detection. Based on this discovery, we propose the SpaceInfi strategy to evade detection. Experiments demonstrate the effectiveness of this strategy across multiple benchmarks and detectors. We also provide a theoretical explanation for why SpaceInfi is successful in evading perplexity-based detection. Our findings offer new insights and challenges for understanding and constructing more applicable ChatGPT detectors.
摘要
chatGPT 带来了革命性的社会价值,但也引发了人工智能生成内容的滥用问题。因此,一个重要的问题是如何判断内容是否由 chatGPT 或人类生成。现有的检测器基于分布分布 gap 假设,这些 gap 通常通过统计信息或分类器来定义。我们的研究质疑这种分布 gap 假设在检测器中的有效性。我们发现,检测器无法有效地异化人类生成和 AI 生成内容的语义和风格差异。相反,“微小差异”,如额外的空格,在检测中变得关键。基于这一发现,我们提出了 SpaceInfi 策略,可以在多个检测器上逃脱检测。实验表明 SpaceInfi 策略在多个标准检测器上具有极高的效果。我们还提供了对 SpaceInfi 策略的理论解释,即在检测器中 why SpaceInfi 能够成功逃脱准确性基于检测。我们的发现对于构建更加实际的 chatGPT 检测器提供了新的思路和挑战。
ODD: A Benchmark Dataset for the NLP-based Opioid Related Aberrant Behavior Detection
paper_authors: Sunjae Kwon, Xun Wang, Weisong Liu, Emily Druhl, Minhee L. Sung, Joel I. Reisman, Wenjun Li, Robert D. Kerns, William Becker, Hong Yu
for: This paper aims to develop a novel biomedical natural language processing benchmark dataset (ODD) to detect opioid-related aberrant behaviors (ORAB) from electronic health records (EHRs).
methods: The authors use two state-of-the-art natural language processing (NLP) models (finetuning pretrained language models and prompt-tuning approaches) to identify ORAB in EHR notes.
results: The prompt-tuning models outperformed the finetuning models in most categories, especially in uncommon categories such as Suggested Aberrant Behavior, Diagnosed Opioid Dependence, and Medication Change. The best model achieved an area under the precision recall curve of 83.92%, but there is still room for improvement in detecting uncommon classes.Here’s the Chinese version of the information:
results: 提示调教模型在大多数类别上表现出了优于finetuning模型,特别是在不常见类别(建议异常行为、诊断抗疼药依赖和药物变化)上。最佳模型在精度 recall 曲线上的面积为83.92%,但是在检测不常见类别时仍有较大的改进空间。Abstract
Opioid related aberrant behaviors (ORAB) present novel risk factors for opioid overdose. Previously, ORAB have been mainly assessed by survey results and by monitoring drug administrations. Such methods however, cannot scale up and do not cover the entire spectrum of aberrant behaviors. On the other hand, ORAB are widely documented in electronic health record notes. This paper introduces a novel biomedical natural language processing benchmark dataset named ODD, for ORAB Detection Dataset. ODD is an expert-annotated dataset comprising of more than 750 publicly available EHR notes. ODD has been designed to identify ORAB from patients' EHR notes and classify them into nine categories; 1) Confirmed Aberrant Behavior, 2) Suggested Aberrant Behavior, 3) Opioids, 4) Indication, 5) Diagnosed opioid dependency, 6) Benzodiapines, 7) Medication Changes, 8) Central Nervous System-related, and 9) Social Determinants of Health. We explored two state-of-the-art natural language processing (NLP) models (finetuning pretrained language models and prompt-tuning approaches) to identify ORAB. Experimental results show that the prompt-tuning models outperformed the finetuning models in most cateogories and the gains were especially higher among uncommon categories (Suggested aberrant behavior, Diagnosed opioid dependency and Medication change). Although the best model achieved the highest 83.92% on area under precision recall curve, uncommon classes (Suggested Aberrant Behavior, Diagnosed Opioid Dependence, and Medication Change) still have a large room for performance improvement.
摘要
“对于专业医疗记录(EHR)中的Opioid相关偏常行为(ORAB)进行识别和分类是一个重要的应用。在过去,ORAB通常通过调查结果和药物管理纪录来评估。但这些方法无法扩展和不能覆盖整个偏常行为的范围。相反地,ORAB在EHR中广泛记录,这篇文章提出了一个新的生医自然语言处理标准 benchmark dataset,名为ODD(Opioid Detection Dataset)。ODD包含了超过750份公开可用的EHR资料,并且以9个类别进行分类:1)确认偏常行为,2)建议偏常行为,3)Opioids,4)指示,5)诊断Opioid依赖,6)苯二氮酚,7)药物变化,8)中枢神经系统相关,9)社会健康Determinants。我们使用了两种现代生医自然语言处理(NLP)模型(finetuning pretrained language models和prompt-tuning方法)进行ORAB的识别。实验结果显示,prompt-tuning模型在大多数类别上表现出色,特别是在不常见的类别(建议偏常行为、诊断Opioid依赖和药物变化)中。 although the best model achieved the highest 83.92% on area under precision recall curve, uncommon classes still have a large room for performance improvement.”
TransformerG2G: Adaptive time-stepping for learning temporal graph embeddings using transformers
results: 对于不同的“新鲜度”水平(根据 TEA 图表来度量),我们的 TransformerG2G 模型在链接预测精度和计算效率方面都超过了传统多步方法和我们之前的工作(DynG2G),特别是在高度的新鲜度下。此外,通过分析注意力权重,我们可以揭示时间依赖关系,找到影响因素,并获得图structure中复杂的互动之间的启示。例如,我们发现在不同的图结构阶段,注意力权重和节点度之间存在强相关关系。Abstract
Dynamic graph embedding has emerged as a very effective technique for addressing diverse temporal graph analytic tasks (i.e., link prediction, node classification, recommender systems, anomaly detection, and graph generation) in various applications. Such temporal graphs exhibit heterogeneous transient dynamics, varying time intervals, and highly evolving node features throughout their evolution. Hence, incorporating long-range dependencies from the historical graph context plays a crucial role in accurately learning their temporal dynamics. In this paper, we develop a graph embedding model with uncertainty quantification, TransformerG2G, by exploiting the advanced transformer encoder to first learn intermediate node representations from its current state ($t$) and previous context (over timestamps [$t-1, t-l$], $l$ is the length of context). Moreover, we employ two projection layers to generate lower-dimensional multivariate Gaussian distributions as each node's latent embedding at timestamp $t$. We consider diverse benchmarks with varying levels of ``novelty" as measured by the TEA plots. Our experiments demonstrate that the proposed TransformerG2G model outperforms conventional multi-step methods and our prior work (DynG2G) in terms of both link prediction accuracy and computational efficiency, especially for high degree of novelty. Furthermore, the learned time-dependent attention weights across multiple graph snapshots reveal the development of an automatic adaptive time stepping enabled by the transformer. Importantly, by examining the attention weights, we can uncover temporal dependencies, identify influential elements, and gain insights into the complex interactions within the graph structure. For example, we identified a strong correlation between attention weights and node degree at the various stages of the graph topology evolution.
摘要
几何图模型(Graph Embedding)在许多应用中已经证明非常有效,用于解决多种时间图分析任务(如链接预测、节点分类、推荐系统、异常检测和图生成)。这些时间图会展现不同的不定期性、时间间隔和节点特征的高度演变。因此,从历史图上的长距离依赖性来准确学习它们的时间动态是非常重要。在这篇论文中,我们开发了一种具有不确定性评估的图嵌入模型,名为TransformerG2G,通过利用高级变换编码器来首先学习当前时间步($t$) 和上一个时间步([$t-1, t-l$)中的节点表示,其中$l$ 是 Context的长度。此外,我们使用两个投影层来生成每个节点的时间步 $t$ 的低维多元 Gaussian 分布。我们在不同的 TEA 图表中进行了多种不同的“新鲜度”水平的测试,我们的实验结果表明,我们提出的 TransformerG2G 模型在链接预测精度和计算效率方面都高于 convential 多步法和我们之前的 DynG2G 模型,特别是在高度新鲜度时。此外,我们通过分析注意力权重来探索图结构中的时间相关性、影响因素和复杂的交互关系。例如,我们发现了节点度和注意力权重之间的强相关性,这些相关性在不同的图结构发展阶段都存在。
Artificial Intelligence in archival and historical scholarship workflow: HTS and ChatGPT
for: This paper examines the impact of Artificial Intelligence (AI) on the archival heritage digitization processes, specifically focusing on the automatic transcription, correction, and normalization of manuscripts.
methods: The study uses two AI systems, Transkribus and ChatGPT, to analyze and transcribe digitized sources.
results: The paper presents the results of a test using ChatGPT to normalize the text of 366 letters stored in the Correspondence section of the Biscari Archive (Catania), which showed that although the AI exhibited some limitations, the corrected texts met expectations. Overall, the study concludes that digitization and AI can significantly enhance archival and historical research by allowing the analysis of vast amounts of data and the application of computational linguistic tools.Abstract
This article examines the impact of Artificial Intelligence on the archival heritage digitization processes, specifically regarding the manuscripts' automatic transcription, their correction, and normalization. It highlights how digitality has compelled scholars to redefine Archive and History field and has facilitated the accessibility of analogue sources through digitization and integration into big data. The study focuses on two AI systems, namely Transkribus and ChatGPT, which enable efficient analysis and transcription of digitized sources. The article presents a test of ChatGPT, which was utilized to normalize the text of 366 letters stored in the Correspondence section of the Biscari Archive (Catania). Although the AI exhibited some limitations that resulted in inaccuracies, the corrected texts met expectations. Overall, the article concludes that digitization and AI can significantly enhance archival and historical research by allowing the analysis of vast amounts of data and the application of computational linguistic tools.
摘要
Translation in Simplified Chinese:这篇文章研究了人工智能对文件遗产数字化过程的影响,特别是手写文本自动识别、修正和normalization。它指出了数字化技术的发展使得学术界被迫重新定义档案领域和历史研究领域,并使得 Analog sources 更加可访整合入大数据。文章关注了两个 AI 系统,即 Transkribus 和 ChatGPT,它们可以高效地分析和转写数字化的源料。文章公布了 ChatGPT 对 Catania 的 Biscari 档案(Correspondence 部分)中的 366 封信的测试结果,尽管 AI 系统存在一些局限性和不准确性,但修正后的文本均达到预期。总的来说,文章结论认为,数字化和 AI 可以对档案和历史研究进行 significative 的提高,以便对大量数据进行分析和应用计算语言工具。
The Effects of Interaction Conflicts, Levels of Automation, and Frequency of Automation on Human Automation Trust and Acceptance
paper_authors: Hadi Halvachi, Ali Asghar Nazari Shirehjini, Zahra Kakavand, Niloofar Hashemi, Shervin Shirmohammadi
for: 本研究旨在 investigating the effects of Level of Automation (LoA), Frequency of Automated responses (FoA), and Conflict Intensity (CI) on human trust and acceptance of automation in the context of smart homes.
results: 结果显示,自动化水平和自动化回应频率对用户对智能环境的信任产生影响。此外,结果还表明,在自动化失败和互动冲突的情况下,用户对自动化智能环境的接受度减退。Abstract
In the presence of interaction conflicts, user trust in automation plays an important role in accepting intelligent environments such as smart homes. In this paper, a factorial research design is employed to investigate and compare the single and joint effects of Level of Automation (LoA), Frequency of Automated responses (FoA), and Conflict Intensity (CI) on human trust and acceptance of automation in the context of smart homes. To study these effects, we conducted web-based experiments to gather data from 324 online participants who experienced the system through a 3D simulation of a smart home. The findings show that the level and frequency of automation had an impact on user trust in smart environments. Furthermore, the results demonstrate that the users' acceptance of automated smart environments decreased in the presence of automation failures and interaction conflicts.
摘要
在智能环境中存在交互冲突时,用户对自动化的信任作用着重要作用于接受智能家庭。在这篇论文中,我们采用因素研究设计来调查和比较单独和共同影响智能家庭中用户信任和自动化的因素。为了研究这些影响,我们通过网络实验吸引了324名在线参与者,他们通过3D智能家庭 simulate的系统进行了体验。结果表明,自动化水平和自动化响应频率对智能家庭中用户信任有影响。此外,结果还示出在自动化失败和交互冲突情况下,用户对自动化智能环境的接受度下降。
Building Cooperative Embodied Agents Modularly with Large Language Models
paper_authors: Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, Chuang Gan
for: This paper aims to explore the use of large language models (LLMs) for multi-agent cooperation and communication in embodied environments.
methods: The authors present a novel framework that utilizes LLMs for planning, communication, and cooperation in various embodied environments, without requiring fine-tuning or few-shot prompting.
results: The authors demonstrate that LLM-based agents can surpass strong planning-based methods and exhibit emergent effective communication, and that LLM-based agents that communicate in natural language can earn more trust and cooperate more effectively with humans.Abstract
Large Language Models (LLMs) have demonstrated impressive planning abilities in single-agent embodied tasks across various domains. However, their capacity for planning and communication in multi-agent cooperation remains unclear, even though these are crucial skills for intelligent embodied agents. In this paper, we present a novel framework that utilizes LLMs for multi-agent cooperation and tests it in various embodied environments. Our framework enables embodied agents to plan, communicate, and cooperate with other embodied agents or humans to accomplish long-horizon tasks efficiently. We demonstrate that recent LLMs, such as GPT-4, can surpass strong planning-based methods and exhibit emergent effective communication using our framework without requiring fine-tuning or few-shot prompting. We also discover that LLM-based agents that communicate in natural language can earn more trust and cooperate more effectively with humans. Our research underscores the potential of LLMs for embodied AI and lays the foundation for future research in multi-agent cooperation. Videos can be found on the project website https://vis-www.cs.umass.edu/Co-LLM-Agents/.
摘要
大型语言模型(LLM)在单机器人任务中表现出了惊人的规划能力,但它们在多机器人合作中的规划和通信能力仍然不清楚,尽管这些技能对智能机器人来说非常重要。在这篇论文中,我们提出了一种新的框架,用于利用LLM来实现多机器人合作,并在不同的embodied环境中进行测试。我们的框架允许机器人规划、通信和与其他机器人或人类合作,以实现长期任务的效率。我们发现,最新的LLM,如GPT-4,可以在我们的框架下超越强规划方法,并在没有微调或几个推荐的情况下显示出emergent的有效通信。此外,我们发现LLM基于的自然语言通信的机器人可以在人类面前赢得更多的信任和更有效地合作。我们的研究证明了LLM的潜力,并为多机器人合作的未来研究提供了基础。详细信息和视频可以在项目网站https://vis-www.cs.umass.edu/Co-LLM-Agents/中找到。
results: 在 D4RL 步态标准和 Atari 游戏中,EDT 比 Q Learning-based 方法表现更好,特别是在多任务情况下。Abstract
This paper introduces Elastic Decision Transformer (EDT), a significant advancement over the existing Decision Transformer (DT) and its variants. Although DT purports to generate an optimal trajectory, empirical evidence suggests it struggles with trajectory stitching, a process involving the generation of an optimal or near-optimal trajectory from the best parts of a set of sub-optimal trajectories. The proposed EDT differentiates itself by facilitating trajectory stitching during action inference at test time, achieved by adjusting the history length maintained in DT. Further, the EDT optimizes the trajectory by retaining a longer history when the previous trajectory is optimal and a shorter one when it is sub-optimal, enabling it to "stitch" with a more optimal trajectory. Extensive experimentation demonstrates EDT's ability to bridge the performance gap between DT-based and Q Learning-based approaches. In particular, the EDT outperforms Q Learning-based methods in a multi-task regime on the D4RL locomotion benchmark and Atari games. Videos are available at: https://kristery.github.io/edt/
摘要
这篇论文介绍了弹性决策变换器(EDT),它对现有的决策变换器(DT)和其变体具有显著的进步。虽然DT宣称生成最优轨迹,但实际证据表明它在轨迹缝针接处遇到困难,即生成最优或近似最优轨迹从一组sub-optimal轨迹中的最佳部分。提出的EDT通过在测试时进行行动推理中实现轨迹缝针接,并通过调整DT中保持的历史长度来实现。此外,EDT可以根据前一个轨迹是否优化,保持更长的历史记录,从而使其能够“缝”到更优轨迹。广泛的实验表明EDT能够在DT基于和Q学习基于方法之间bridging性能差距。特别是在多任务 режи围的D4RL locomotivebenchmark和Atari游戏中,EDT超越Q学习基于方法。视频可以在:https://kristery.github.io/edt/中找到。
Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
results: 研究发现,当代语言模型在Counterfactual任务变体中表现出了一定的抽象逻辑能力,但是与标准任务的表现相比,其表现却有很大的差异,这表明当代语言模型的表现可能受到了特定任务的偏爱。Abstract
The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? To disentangle these effects, we propose an evaluation framework based on "counterfactual" task variants that deviate from the default assumptions underlying standard tasks. Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract task-solving skills to a degree, they often also rely on narrow, non-transferable procedures for task-solving. These results motivate a more careful interpretation of language model performance that teases apart these aspects of behavior.
摘要
“现代语言模型在广泛的任务上表现出了印象人的能力,这表明它们拥有一定的抽象逻辑能力。但是,这些能力是通用逻辑能力,还是对特定任务的适应?为了解答这个问题,我们提出了一个基于“假设”的评估框架。在11个任务中,我们发现了不同于默认情况下的任务Variant中的表现,但是它们的表现都具有较低的水准,与默认情况下的表现有很大的差异。这些结果表明,现代语言模型可能拥有一定的抽象任务解决能力,但是它们通常还是对特定任务进行适应,而不是拥有通用的逻辑能力。这些结果鼓励我们更加留意语言模型的表现,并将其分解为不同的方面。”
Deductive Additivity for Planning of Natural Language Proofs
paper_authors: Zayne Sprague, Kaj Bostrom, Swarat Chaudhuri, Greg Durrett
for: investigate whether an efficient planning heuristic is possible via embedding spaces compatible with deductive reasoning
methods: explore multiple sources of off-the-shelf dense embeddings in addition to fine-tuned embeddings from GPT3 and sparse embeddings from BM25
results: find that while standard embedding methods frequently embed conclusions near the sums of their premises, they fall short of being effective heuristics and lack the ability to model certain categories of reasoningAbstract
Current natural language systems designed for multi-step claim validation typically operate in two phases: retrieve a set of relevant premise statements using heuristics (planning), then generate novel conclusions from those statements using a large language model (deduction). The planning step often requires expensive Transformer operations and does not scale to arbitrary numbers of premise statements. In this paper, we investigate whether an efficient planning heuristic is possible via embedding spaces compatible with deductive reasoning. Specifically, we evaluate whether embedding spaces exhibit a property we call deductive additivity: the sum of premise statement embeddings should be close to embeddings of conclusions based on those premises. We explore multiple sources of off-the-shelf dense embeddings in addition to fine-tuned embeddings from GPT3 and sparse embeddings from BM25. We study embedding models both intrinsically, evaluating whether the property of deductive additivity holds, and extrinsically, using them to assist planning in natural language proof generation. Lastly, we create a dataset, Single-Step Reasoning Contrast (SSRC), to further probe performance on various reasoning types. Our findings suggest that while standard embedding methods frequently embed conclusions near the sums of their premises, they fall short of being effective heuristics and lack the ability to model certain categories of reasoning.
摘要
现有的自然语言系统设计为多步验证声明通常采用两个阶段:首先使用规则(规划) retrieve 一组相关的假设语句,然后使用大型自然语言模型(推理)生成新的结论。规划阶段经常需要昂贵的Transformer操作,并不可扩展到 произвольные数量的假设语句。在这篇论文中,我们investigate whether an efficient planning heuristic is possible via embedding spaces compatible with deductive reasoning. Specifically, we evaluate whether embedding spaces exhibit a property we call deductive additivity: the sum of premise statement embeddings should be close to embeddings of conclusions based on those premises. We explore multiple sources of off-the-shelf dense embeddings in addition to fine-tuned embeddings from GPT3 and sparse embeddings from BM25. We study embedding models both intrinsically, evaluating whether the property of deductive additivity holds, and extrinsically, using them to assist planning in natural language proof generation. Lastly, we create a dataset, Single-Step Reasoning Contrast (SSRC), to further probe performance on various reasoning types. Our findings suggest that while standard embedding methods frequently embed conclusions near the sums of their premises, they fall short of being effective heuristics and lack the ability to model certain categories of reasoning.
Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources
results: 评估结果显示, 在性能预测和计算成本方面都有所改善,并且在数据选择效果方面也有所超越其他一些对照方案。Abstract
Traditionally, data selection has been studied in settings where all samples from prospective sources are fully revealed to a machine learning developer. However, in practical data exchange scenarios, data providers often reveal only a limited subset of samples before an acquisition decision is made. Recently, there have been efforts to fit scaling laws that predict model performance at any size and data source composition using the limited available samples. However, these scaling functions are black-box, computationally expensive to fit, highly susceptible to overfitting, or/and difficult to optimize for data selection. This paper proposes a framework called , which predicts model performance and supports data selection decisions based on partial samples of prospective data sources. Our approach distinguishes itself from existing work by introducing a novel *two-stage* performance inference process. In the first stage, we leverage the Optimal Transport distance to predict the model's performance for any data mixture ratio within the range of disclosed data sizes. In the second stage, we extrapolate the performance to larger undisclosed data sizes based on a novel parameter-free mapping technique inspired by neural scaling laws. We further derive an efficient gradient-based method to select data sources based on the projected model performance. Evaluation over a diverse range of applications demonstrates that significantly improves existing performance scaling approaches in terms of both the accuracy of performance inference and the computation costs associated with constructing the performance predictor. Also, outperforms by a wide margin in data selection effectiveness compared to a range of other off-the-shelf solutions.
摘要
传统上,数据选择都是在所有样本都是完全披露给机器学习开发者的情况下研究的。然而,在实际数据交换场景下,数据提供者通常只 revelas 一个有限的子集的样本 перед一个收购决策。最近,有努力适应缩放函数来预测模型性能,但这些缩放函数是黑盒子、计算成本高、易于过拟合或难以优化数据选择。本文提出了一个名为的框架,可以根据部分样本预测模型性能并支持数据选择决策。我们的方法与现有工作不同,我们引入了一种新的两Stage性能预测过程。在第一stage,我们利用最佳运输距离来预测模型在任何数据混合比例范围内的性能。在第二stage,我们通过一种新的无参数映射技术,基于神经缩放法则来推断模型性能。我们进一步 derive了一种高效的梯度下降方法来选择数据源基于预测模型性能。经过对多种应用场景的评估,我们发现significantly improve现有性能扩展方法的准确性和计算成本。此外,也在数据选择效果上大幅超越了一些Off-the-shelf解决方案。
DeSRA: Detect and Delete the Artifacts of GAN-based Real-World Super-Resolution Models
results: 通过DeSRA方法,可以成功消除GAN-SR模型中的缺陷和不愉悦的artefacts,提高SR模型在实际场景中的应用能力Abstract
Image super-resolution (SR) with generative adversarial networks (GAN) has achieved great success in restoring realistic details. However, it is notorious that GAN-based SR models will inevitably produce unpleasant and undesirable artifacts, especially in practical scenarios. Previous works typically suppress artifacts with an extra loss penalty in the training phase. They only work for in-distribution artifact types generated during training. When applied in real-world scenarios, we observe that those improved methods still generate obviously annoying artifacts during inference. In this paper, we analyze the cause and characteristics of the GAN artifacts produced in unseen test data without ground-truths. We then develop a novel method, namely, DeSRA, to Detect and then Delete those SR Artifacts in practice. Specifically, we propose to measure a relative local variance distance from MSE-SR results and GAN-SR results, and locate the problematic areas based on the above distance and semantic-aware thresholds. After detecting the artifact regions, we develop a finetune procedure to improve GAN-based SR models with a few samples, so that they can deal with similar types of artifacts in more unseen real data. Equipped with our DeSRA, we can successfully eliminate artifacts from inference and improve the ability of SR models to be applied in real-world scenarios. The code will be available at https://github.com/TencentARC/DeSRA.
摘要
Image超解像(SR)使用生成敌对网络(GAN)已经取得了非常成功的RestoreRealistic Details。然而,GAN基于SR模型会不可避免地产生不愉快和不жела的artefacts,特别在实际应用中。先前的工作通常通过额外的损失 penalty在训练阶段来降低这些artefacts。然而,在实际应用中,我们发现这些改进的方法仍然在推理阶段产生明显的噪声。在这篇论文中,我们分析了GAN artefacts在未看到的测试数据中的原因和特征。然后,我们开发了一种名为DeSRA的方法,用于检测并删除SR artefacts。具体来说,我们提议使用MSE-SR结果和GAN-SR结果之间的相对本地差距来评估artefact区域。然后,我们根据这个距离和semantic-aware的阈值来定位问题区域。在检测artefact区域之后,我们开发了一种finetune过程,以提高GAN基于SR模型对类似 artefacts的处理能力。通过我们的DeSRA,我们可以成功地从推理中除除artefacts,并提高SR模型在实际应用中的可用性。代码将在https://github.com/TencentARC/DeSRA上提供。
An Exploratory Literature Study on Sharing and Energy Use of Language Models for Source Code
results: 我们从494独特的发表文献中筛选出293篇相关的文献,其中27%(79个)的文献提供了可重用的工具或IDE插件,以及可以精度调整的特定任务下的模型。此外,我们收集了模型训练硬件和训练时间的信息,以便分析训练过程中的能源消耗。我们发现当前的研究中有40%的论文不分享源代码或训练 artifacts,我们建议分享源代码和训练 artifacts,以实现可持续可重用的复制性。此外,我们建议对模型训练过程中的硬件配置和训练时间进行全面的分享,以便对模型的碳脚印象进行透明度。Abstract
Large language models trained on source code can support a variety of software development tasks, such as code recommendation and program repair. Large amounts of data for training such models benefit the models' performance. However, the size of the data and models results in long training times and high energy consumption. While publishing source code allows for replicability, users need to repeat the expensive training process if models are not shared. The main goal of the study is to investigate if publications that trained language models for software engineering (SE) tasks share source code and trained artifacts. The second goal is to analyze the transparency on training energy usage. We perform a snowballing-based literature search to find publications on language models for source code, and analyze their reusability from a sustainability standpoint. From 494 unique publications, we identified 293 relevant publications that use language models to address code-related tasks. Among them, 27% (79 out of 293) make artifacts available for reuse. This can be in the form of tools or IDE plugins designed for specific tasks or task-agnostic models that can be fine-tuned for a variety of downstream tasks. Moreover, we collect insights on the hardware used for model training, as well as training time, which together determine the energy consumption of the development process. We find that there are deficiencies in the sharing of information and artifacts for current studies on source code models for software engineering tasks, with 40% of the surveyed papers not sharing source code or trained artifacts. We recommend the sharing of source code as well as trained artifacts, to enable sustainable reproducibility. Moreover, comprehensive information on training times and hardware configurations should be shared for transparency on a model's carbon footprint.
摘要
大型语言模型可以支持软件开发任务,如代码推荐和程序修复。大量数据 для训练这些模型会提高模型性能。然而,模型和数据的大小会导致训练时间长和能源消耗高。发布源代码可以提高复制性,但用户需要重新进行费时的训练过程,如果模型不被共享。我们的研究目标是调查发布在软件工程(SE)任务上使用语言模型的研究是否分享源代码和训练 artifacts。我们的第二个目标是分析训练过程中的能源消耗透明度。我们通过降准搜索来找到使用语言模型Addressing code-related tasks的发表文献,并分析它们的可重用性从可持续性的角度。从494独特的发表文献中,我们identified 293个相关的发表文献,其中27% (79个/293个)提供了可重用的artifacts。这些可重用的artifacts可以是专门为特定任务设计的工具或IDE插件,也可以是可以微调的下游任务的模型。此外,我们收集了训练过程中硬件使用情况以及训练时间,这些信息共同决定了软件开发过程中的能源消耗。我们发现当前关于源代码模型的软件工程任务的研究中存在资源共享的不足,40%的调查文献不会分享源代码或训练过程中的artifacts。我们建议分享源代码以及训练过程中的artifacts,以实现可持续可重用。此外,我们还建议对训练过程中硬件配置和训练时间的信息进行完整的分享,以确保模型的碳脚印。
External Reasoning: Towards Multi-Large-Language-Models Interchangeable Assistance with Human Feedback
results: 经过全面评估,该方法可以达到现有解决方案的同等或更高的性能,并且比直接LLM处理全文更加高效Abstract
Memory is identified as a crucial human faculty that allows for the retention of visual and linguistic information within the hippocampus and neurons in the brain, which can subsequently be retrieved to address real-world challenges that arise through a lifetime of learning. The resolution of complex AI tasks through the application of acquired knowledge represents a stride toward the realization of artificial general intelligence. However, despite the prevalence of Large Language Models (LLMs) like GPT-3.5 and GPT-4 , which have displayed remarkable capabilities in language comprehension, generation, interaction, and reasoning, they are inhibited by constraints on context length that preclude the processing of extensive, continually evolving knowledge bases. This paper proposes that LLMs could be augmented through the selective integration of knowledge from external repositories, and in doing so, introduces a novel methodology for External Reasoning, exemplified by ChatPDF. Central to this approach is the establishment of a tiered policy for \textbf{External Reasoning based on Multiple LLM Interchange Assistance}, where the level of support rendered is modulated across entry, intermediate, and advanced tiers based on the complexity of the query, with adjustments made in response to human feedback. A comprehensive evaluation of this methodology is conducted using multiple LLMs and the results indicate state-of-the-art performance, surpassing existing solutions including ChatPDF.com. Moreover, the paper emphasizes that this approach is more efficient compared to the direct processing of full text by LLMs.
摘要
память является ключевым человеческимfaculty, allowing for the retention of visual and linguistic information within the hippocampus and neurons in the brain, which can subsequently be retrieved to address real-world challenges that arise throughout a lifetime of learning. The resolution of complex AI tasks through the application of acquired knowledge represents a stride toward the realization of artificial general intelligence. However, despite the prevalence of Large Language Models (LLMs) like GPT-3.5 and GPT-4, which have displayed remarkable capabilities in language comprehension, generation, interaction, and reasoning, they are inhibited by constraints on context length that preclude the processing of extensive, continually evolving knowledge bases. This paper proposes that LLMs could be augmented through the selective integration of knowledge from external repositories, and in doing so, introduces a novel methodology for External Reasoning, exemplified by ChatPDF. Central to this approach is the establishment of a tiered policy for External Reasoning based on Multiple LLM Interchange Assistance, where the level of support rendered is modulated across entry, intermediate, and advanced tiers based on the complexity of the query, with adjustments made in response to human feedback. A comprehensive evaluation of this methodology is conducted using multiple LLMs and the results indicate state-of-the-art performance, surpassing existing solutions including ChatPDF.com. Moreover, the paper emphasizes that this approach is more efficient compared to the direct processing of full text by LLMs.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.
FOCUS: Object-Centric World Models for Robotics Manipulation
results: 研究表明,基于对象中心的世界模型可以帮助智能体更 efficiently 解决 manipulate 任务,并且能够在不同的设定下一致地探索机器人-对象之间的互动。 具体来说,通过使用 Franka Emika 机器人臂,我们展示了 FOCUS 在实际场景中的采用。Abstract
Understanding the world in terms of objects and the possible interplays with them is an important cognition ability, especially in robotics manipulation, where many tasks require robot-object interactions. However, learning such a structured world model, which specifically captures entities and relationships, remains a challenging and underexplored problem. To address this, we propose FOCUS, a model-based agent that learns an object-centric world model. Thanks to a novel exploration bonus that stems from the object-centric representation, FOCUS can be deployed on robotics manipulation tasks to explore object interactions more easily. Evaluating our approach on manipulation tasks across different settings, we show that object-centric world models allow the agent to solve tasks more efficiently and enable consistent exploration of robot-object interactions. Using a Franka Emika robot arm, we also showcase how FOCUS could be adopted in real-world settings.
摘要
世界理解为对象和对象之间的交互是至关重要的认知能力,尤其在机器人操作中,许多任务需要机器人和物品之间的交互。然而,学习这种结构化世界模型仍然是一个挑战和未经探索的问题。为解决这个问题,我们提出了 FOCUS 模型基于代理人,该模型可以学习对象中心的世界模型。由于对象中心表示带来的新探索奖励,FOCUS 可以更好地探索机器人和物品之间的交互。在不同的设定下进行 manipulate 任务评估,我们表明对象中心世界模型可以让代理人更加高效地解决任务,并且可以一致地探索机器人和物品之间的交互。使用 Franka Emika 机器人臂,我们也展示了 FOCUS 在实际设定下的采用。
Multi-objective Deep Reinforcement Learning for Mobile Edge Computing
paper_authors: Ning Yang, Junrui Wen, Meng Zhang, Ming Tang
For: The paper is written for next-generation mobile network applications that prioritize various performance metrics, including delays and energy consumption.* Methods: The paper uses a multi-objective reinforcement learning (MORL) scheme with proximal policy optimization (PPO) to address the challenge of unknown preferences in mobile edge computing (MEC) systems.* Results: The proposed MORL scheme enhances the hypervolume of the Pareto front by up to 233.1% compared to benchmarks.Here’s the Chinese version of the three information points:* For: 这篇论文是为下一代无线网络应用程序而写的,这些应用程序优先级包括延迟和能耗。* Methods: 这篇论文使用多目标学习(MORL)与距离策略优化(PPO)来解决MEC系统中不确定的偏好问题。* Results: 提议的MORL方案可以提高Pareto前的超Volume比例达到233.1%。Abstract
Mobile edge computing (MEC) is essential for next-generation mobile network applications that prioritize various performance metrics, including delays and energy consumption. However, conventional single-objective scheduling solutions cannot be directly applied to practical systems in which the preferences of these applications (i.e., the weights of different objectives) are often unknown or challenging to specify in advance. In this study, we address this issue by formulating a multi-objective offloading problem for MEC with multiple edges to minimize expected long-term energy consumption and transmission delay while considering unknown preferences as parameters. To address the challenge of unknown preferences, we design a multi-objective (deep) reinforcement learning (MORL)-based resource scheduling scheme with proximal policy optimization (PPO). In addition, we introduce a well-designed state encoding method for constructing features for multiple edges in MEC systems, a sophisticated reward function for accurately computing the utilities of delay and energy consumption. Simulation results demonstrate that our proposed MORL scheme enhances the hypervolume of the Pareto front by up to 233.1% compared to benchmarks. Our full framework is available at https://github.com/gracefulning/mec_morl_multipolicy.
摘要
Mobile edge computing (MEC) 是下一代移动网络应用程序的关键技术,它们优先级包括延迟和能耗等多个性能指标。然而,传统的单目标调度解决方案不能直接应用于实际系统中,因为这些应用程序的偏好(即不同目标的权重) Frequently unknown or difficult to specify in advance.在本研究中,我们解决了这个问题,通过对 MEC 系统中的多个边进行协调调度,以最小化预期长期能耗和传输延迟,同时考虑不确定的偏好。为了解决不确定的偏好问题,我们设计了一种基于多目标学习(deep reinforcement learning,DRL)的资源调度方案,并使用 proximal policy optimization(PPO)来优化。此外,我们还提出了一种智能的状态编码方法,用于构建 MEC 系统中多个边的特征。此外,我们还提出了一种准确计算延迟和能耗的利益函数。实验结果表明,我们的提posed MORL 方案可以提高 Pareto 前面的抽象体积,与参考值比较,最高提高233.1%。整个框架可以在 GitHub 上找到:https://github.com/gracefulning/mec_morl_multipolicy。
OpenDelta: A Plug-and-play Library for Parameter-efficient Adaptation of Pre-trained Models
results: OpenDelta 提供了一个简单、干净、可扩展的平台,可以帮助研究人员和实践者快速适应大型 PTM,并且可以提供更多的 delta tuning 方法,以满足不同的应用需求。Abstract
The scale of large pre-trained models (PTMs) poses significant challenges in adapting to downstream tasks due to the high optimization overhead and storage costs associated with full-parameter fine-tuning. To address this, many studies explore parameter-efficient tuning methods, also framed as "delta tuning", which updates only a small subset of parameters, known as "delta modules", while keeping the backbone model's parameters fixed. However, the practicality and flexibility of delta tuning have been limited due to existing implementations that directly modify the code of the backbone PTMs and hard-code specific delta tuning methods for each PTM. In this paper, we present OpenDelta, an open-source library that overcomes these limitations by providing a plug-and-play implementation of various delta tuning methods. Our novel techniques eliminate the need to modify the backbone PTMs' code, making OpenDelta compatible with different, even novel PTMs. OpenDelta is designed to be simple, modular, and extensible, providing a comprehensive platform for researchers and practitioners to adapt large PTMs efficiently.
摘要
大型预训练模型(PTM)的scale会带来适应下游任务的很大挑战,因为全参数精度调整的优化开销和存储成本很高。为了解决这个问题,许多研究尝试了参数有效调整方法,也称为“delta调整”,该方法只更新一小部分参数,称为“delta模块”,而保持背景模型的参数不变。然而,现有的实现方式直接修改背景PTM的代码,固定了每个PTM的 delta调整方法,这限制了 delta调整的实际性和灵活性。在这篇论文中,我们介绍了 OpenDelta,一个开源库,它解决了这些限制。OpenDelta 提供了多种 delta调整方法的插件式实现,不需要修改背景PTM的代码,因此可以与不同的PTM兼容。我们的新技术使得 OpenDelta 可以与不同的 PTM 兼容,并且可以支持未来的新PTM。OpenDelta 设计为简单、卷积、扩展的,提供了一个完整的平台,让研究人员和实践者能够有效地适应大PTM。
Causal Discovery with Language Models as Imperfect Experts
results: 我们在实际数据上进行了一个案例研究,使用大型自然语言模型作为不准确的专家。Abstract
Understanding the causal relationships that underlie a system is a fundamental prerequisite to accurate decision-making. In this work, we explore how expert knowledge can be used to improve the data-driven identification of causal graphs, beyond Markov equivalence classes. In doing so, we consider a setting where we can query an expert about the orientation of causal relationships between variables, but where the expert may provide erroneous information. We propose strategies for amending such expert knowledge based on consistency properties, e.g., acyclicity and conditional independencies in the equivalence class. We then report a case study, on real data, where a large language model is used as an imperfect expert.
摘要
理解系统下 causal 关系的本质是决策准确的基本必要条件。在这项工作中,我们研究如何使用专家知识来提高基于数据的 causal 图的识别,超过 markov 等类。在这个过程中,我们考虑了一种情况,在该情况下,我们可以询问专家关于变量之间 causal 关系的方向,但专家可能提供错误的信息。我们提出了基于一致性属性的纠正策略,例如无环性和 conditional independence 等等。然后,我们报告了一个实际数据的案例研究,其中使用大语言模型作为不完全的专家。
paper_authors: Aryo Pradipta Gema, Luke Daines, Pasquale Minervini, Beatrice Alex
For: This paper focuses on adapting pre-trained language models for clinical applications, specifically using Parameter-Efficient Fine-Tuning (PEFT) techniques to reduce computational requirements.* Methods: The proposed method, Clinical LLaMA-LoRA, is built upon the open-sourced LLaMA model and is trained using clinical notes from the MIMIC-IV database. A two-step PEFT framework is proposed, which combines Clinical LLaMA-LoRA with Downstream LLaMA-LoRA for downstream tasks.* Results: The proposed framework achieves state-of-the-art AUROC scores averaged across all clinical downstream tasks, with substantial improvements of 6-9% AUROC score in large-scale multilabel classification tasks such as diagnoses and procedures classification.Here is the information in Simplified Chinese text:* For: 本研究探讨了将预训练语言模型应用于医疗领域,特别是通过Parameter-Efficient Fine-Tuning(PEFT)技术来降低计算需求。* Methods: 提议的方法是基于开源的LLaMA模型建立的CLINICAL LLaMA-LoRA,通过在MIMIC-IV数据库中获取医疗笔记进行训练。提议的方法还包括将CLINICAL LLaMA-LoRA与下游LLaMA-LoRA结合使用,形成两步PEFT框架。* Results: 提议的方法实现了医疗下游任务中的最佳AUROC分数,特别是在大规模多标签分类任务中具有6-9% AUROC分数的提升。Abstract
Adapting pretrained language models to novel domains, such as clinical applications, traditionally involves retraining their entire set of parameters. However, this approach is increasingly proven to be impractical owing to the substantial computational requirements associated with training such large language models. To address this issue, Parameter-Efficient Fine-Tuning (PEFT) techniques offer a viable solution by selectively fine-tuning a small subset of additional parameters, significantly reducing the computational requirements for domain adaptation. In this study, we propose Clinical LLaMA-LoRA, a PEFT adapter layer built upon the open-sourced LLaMA model. Clinical LLaMA-LoRA is trained using clinical notes obtained from the MIMIC-IV database, thereby creating a specialised adapter designed for the clinical domain. Additionally, we propose a two-step PEFT framework which fuses Clinical LLaMA-LoRA with Downstream LLaMA-LoRA, another PEFT adapter specialised for downstream tasks. We evaluate this framework on multiple clinical outcome prediction datasets, comparing it to clinically trained language models. Our proposed framework achieves a state-of-the-art AUROC score averaged across all clinical downstream tasks. We observe substantial improvements of 6-9% AUROC score in the large-scale multilabel classification tasks, such as diagnoses and procedures classification.
摘要
原文文本中的 Adapting pretrained language models to novel domains, such as clinical applications, 通常需要重新训练整个语言模型的参数集。然而,这种方法在计算机需求方面存在很大的障碍,特别是在训练这些大型语言模型时。为解决这个问题,Parameter-Efficient Fine-Tuning(PEFT)技术提供了一个可行的解决方案,通过选择ively fine-tune 一小部分的额外参数,可以很大地减少预处理需求。在这种研究中,我们提出了Clinical LLaMA-LoRA,一个基于开源的 LLaMA 模型的 PEFT 适应层。Clinical LLaMA-LoRA 通过使用来自 MIMIC-IV 数据库的临床笔记进行训练,创造了特殊的临床适应器。此外,我们还提出了一种两步 PEFT 框架,将 Clinical LLaMA-LoRA 与 Downstream LLaMA-LoRA,另一个特殊的 PEFT 适应器,融合在一起。我们对多个临床结果预测数据集进行评估,与临床训练语言模型进行比较。我们的提议的框架实现了临床下游任务的最佳 AUROC 分数平均值。我们发现在大规模多标签分类任务中,如诊断和治疗分类任务,AUROC 分数提高了6-9%。
Improving Retrieval-Augmented Large Language Models via Data Importance Learning
paper_authors: Xiaozhong Lyu, Stefan Grafberger, Samantha Biegel, Shaopeng Wei, Meng Cao, Sebastian Schelter, Ce Zhang
for: 该文章目的是提高大型语言模型的性能,使其能够利用外部知识,例如在问答和数据填充等任务上。
methods: 该文章提出了一种基于多线性扩展的算法,用于评估检索 Corpora 中数据点的重要性。该算法可以在 polynomial time 内计算,并且可以给出正确的结果,只需要一个检索-加持的模型和一个验证集。
results: 实验结果表明,通过只是修改或重新权重检索 Corpora,可以提高大型语言模型的性能,而不需要进行进一步的训练。在某些任务上,使用检索加持和搜索引擎 API,可以使一个小型模型(如 GPT-JT)超越不含检索增强的 GPT-3.5。此外,我们还证明了在实践中,可以计算多线性扩展的权重非常快(例如,在100万个元素的 Corpora 上只需要几分钟)。Abstract
Retrieval augmentation enables large language models to take advantage of external knowledge, for example on tasks like question answering and data imputation. However, the performance of such retrieval-augmented models is limited by the data quality of their underlying retrieval corpus. In this paper, we propose an algorithm based on multilinear extension for evaluating the data importance of retrieved data points. There are exponentially many terms in the multilinear extension, and one key contribution of this paper is a polynomial time algorithm that computes exactly, given a retrieval-augmented model with an additive utility function and a validation set, the data importance of data points in the retrieval corpus using the multilinear extension of the model's utility function. We further proposed an even more efficient ({\epsilon}, {\delta})-approximation algorithm. Our experimental results illustrate that we can enhance the performance of large language models by only pruning or reweighting the retrieval corpus, without requiring further training. For some tasks, this even allows a small model (e.g., GPT-JT), augmented with a search engine API, to outperform GPT-3.5 (without retrieval augmentation). Moreover, we show that weights based on multilinear extension can be computed efficiently in practice (e.g., in less than ten minutes for a corpus with 100 million elements).
摘要
There are exponentially many terms in the multilinear extension, and one key contribution of this paper is a polynomial-time algorithm that computes the data importance of data points in the retrieval corpus using the multilinear extension of the model's utility function, given a retrieval-augmented model with an additive utility function and a validation set. We also propose an even more efficient (${\epsilon}$, $\delta$)-approximation algorithm.Our experimental results show that we can enhance the performance of large language models by only pruning or reweighting the retrieval corpus, without requiring further training. For some tasks, this even allows a small model (e.g., GPT-JT) augmented with a search engine API to outperform GPT-3.5 (without retrieval augmentation). Moreover, we show that weights based on multilinear extension can be computed efficiently in practice (e.g., in less than ten minutes for a corpus with 100 million elements).
Style Over Substance: Evaluation Biases for Large Language Models
For: This paper aims to evaluate the performance of large language models (LLMs) in natural language generation tasks, and to propose a new approach to improve the accuracy of LLM-based evaluations.* Methods: The paper uses a dataset of intentionally flawed machine-generated answers to compare the evaluation behavior of crowd-sourced and expert annotators, as well as LLMs. The proposed approach is based on the Elo rating system, which independently evaluates machine-generated text across multiple dimensions.* Results: The paper finds that the proposed Multi-Elo Rating System significantly enhances the quality of LLM-based evaluations, particularly in terms of factual accuracy. However, there is no significant improvement in crowd-sourced-based evaluations, indicating the need for further investigation and refinement.Here are the three points in Simplified Chinese text:* For: 这篇论文目的是评估大语言模型(LLM)在自然语言生成任务中的表现,并提出一种新的评估方法来提高 LLM 评估的准确性。* Methods: 论文使用一个意外损害机器生成答案的数据集来比较人工评分和专家评分员,以及 LLM 的评估行为。提议的方法基于 Elo 评分系统,独立评估机器生成文本的多个维度。* Results: 论文发现,提议的多维度 Elo 评分系统可以显著提高 LLM 评估的质量,特别是有关事实准确性。然而,对于人工评分来说,没有显著提高, indicating 需要进一步的调查和优化。Abstract
As large language models (LLMs) continue to advance, accurately and comprehensively evaluating their performance becomes increasingly challenging. Human evaluations are conventionally considered the gold standard in natural language generation, but recent advancements incorporate state-of-the-art LLMs as proxies for human judges in evaluation processes. However, the extent to which humans and LLMs are capable evaluators remains uncertain. This study investigates the behavior of crowd-sourced and expert annotators, as well as LLMs, when comparing outputs from different models. To achieve this, we curate a dataset of intentionally flawed machine-generated answers. Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors. To address this issue, we propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score. We instantiate this idea with the Elo rating system, resulting in the Multi-Elo Rating System. Empirical results from our study reveal that this proposed approach significantly enhances the quality of LLM-based evaluations, particularly in terms of factual accuracy. However, there is no significant improvement in crowd-sourced-based evaluations, indicating the need for further investigation and refinement.
摘要
LLMs 继续进步,评估其性能变得越来越复杂。人工评估被视为自然语言生成领域的黄金标准,但在评估过程中,使用state-of-the-art LLMS作为人类评审人的代理。然而,人类和 LLMS 是否都有能力作为评估者存在uncertainty。这个研究investigates crowd-sourced和专家标注者,以及 LLMS 对不同模型的输出进行比较。为了实现这一目标,我们创建了一个包含机器生成答案中故意错误的数据集。我们的发现表明,答案中包含错误的答案被评分更高,比答案过短或者语法错误的答案更高。为了解决这个问题,我们提议在多个维度上独立评估机器生成文本,而不是将所有评估方面综合为一个分数。我们实现这一想法通过Elo分数系统,导致Multi-Elo Rating System。我们的研究发现,这种提议的方法可以显著提高 LLM-based 评估质量,尤其是对实际准确性。然而,对于人工标注者来说,没有显著改善,表明需要进一步的调查和优化。
results: 该论文的实验结果表明,在采用了预处理 grammar 的情况下,semiring-weighted deduction 的方法和无Weighted deduction 的方法在时间复杂度和空间需求上具有相同的极限性。此外,在某些 grammar 上,可以实现 sub-cubic 时间复杂度的执行。Abstract
This paper provides a reference description, in the form of a deduction system, of Earley's (1970) context-free parsing algorithm with various speed-ups. Our presentation includes a known worst-case runtime improvement from Earley's $O (N^3|G||R|)$, which is unworkable for the large grammars that arise in natural language processing, to $O (N^3|G|)$, which matches the runtime of CKY on a binarized version of the grammar $G$. Here $N$ is the length of the sentence, $|R|$ is the number of productions in $G$, and $|G|$ is the total length of those productions. We also provide a version that achieves runtime of $O (N^3|M|)$ with $|M| \leq |G|$ when the grammar is represented compactly as a single finite-state automaton $M$ (this is partly novel). We carefully treat the generalization to semiring-weighted deduction, preprocessing the grammar like Stolcke (1995) to eliminate deduction cycles, and further generalize Stolcke's method to compute the weights of sentence prefixes. We also provide implementation details for efficient execution, ensuring that on a preprocessed grammar, the semiring-weighted versions of our methods have the same asymptotic runtime and space requirements as the unweighted methods, including sub-cubic runtime on some grammars.
摘要
Here, $N$ is the length of the sentence, $|R|$ is the number of productions in $G$, and $|G|$ is the total length of those productions. The paper also discusses the generalization to semiring-weighted deduction, preprocessing the grammar like Stolcke (1995) to eliminate deduction cycles, and further generalizing Stolcke's method to compute the weights of sentence prefixes.
Agentività e telicità in GilBERTo: implicazioni cognitive
results: 研究发现,transformer 模型在完成 morphosyntactic 模式时能够充分利用词义信息,并且与意大陆 native speakers 的结果相似。这表明 transformer 模型在推理词义 semantics 方面具有一定的能力。Abstract
The goal of this study is to investigate whether a Transformer-based neural language model infers lexical semantics and use this information for the completion of morphosyntactic patterns. The semantic properties considered are telicity (also combined with definiteness) and agentivity. Both act at the interface between semantics and morphosyntax: they are semantically determined and syntactically encoded. The tasks were submitted to both the computational model and a group of Italian native speakers. The comparison between the two groups of data allows us to investigate to what extent neural language models capture significant aspects of human semantic competence.
摘要
这项研究的目的是研究transformer基于神经语言模型是否可以推断词义,并使用这些信息来完成 morphosyntactic 模式的完成。我们考虑的 semantic properties 包括 telicity 和 agentivity,它们在 semantics 和 morphosyntax 之间作用,它们是semantically determined 并 syntactically encoded。我们对这些任务进行了计算机模型和一群意大利本地语言使用者的比较,这allow us Investigate 到哪程度 neural language models 捕捉了人类 semantics 能力的重要方面。
The Relationship Between Speech Features Changes When You Get Depressed: Feature Correlations for Improving Speed and Performance of Depression Detection
methods: 实验使用了Androids Corpus dataset,包括112名 speaker,其中58名被诊断为职业心理医生诊断的抑郁症。
results: 实验结果显示,使用特征相关矩阵而不是特征向量可以提高模型的训练速度和性能,降低误差率在23.1%到26.6%之间,这可能是因为抑郁 speaker 中特征相关矩阵更为变化。Abstract
This work shows that depression changes the correlation between features extracted from speech. Furthermore, it shows that using such an insight can improve the training speed and performance of depression detectors based on SVMs and LSTMs. The experiments were performed over the Androids Corpus, a publicly available dataset involving 112 speakers, including 58 people diagnosed with depression by professional psychiatrists. The results show that the models used in the experiments improve in terms of training speed and performance when fed with feature correlation matrices rather than with feature vectors. The relative reduction of the error rate ranges between 23.1% and 26.6% depending on the model. The probable explanation is that feature correlation matrices appear to be more variable in the case of depressed speakers. Correspondingly, such a phenomenon can be thought of as a depression marker.
摘要
这个研究表明,抑郁症会改变来自语音特征的相关性。此外,这个发现可以提高基于SVM和LSTM的抑郁检测器的训练速度和性能。实验使用了公共可用的Androids Corpus数据集,包括112名说话者,其中58名被诊断为职业心理医生诊断的抑郁症患者。结果表明,使用特征相关矩阵而不是特征向量可以提高模型的训练速度和性能。错误率下降的相对减少范围为23.1%到26.6%,具体原因可能是抑郁 speaker 的特征相关矩阵更为变化。这种现象可以被视为抑郁标志。
ValiTex – a unified validation framework for computational text-based measures of social science constructs
results: 在使用ValiTex验证社会科学构uct时,可以通过应用于社交媒体数据的示例来证明该框架的实用性。Abstract
Guidance on how to validate computational text-based measures of social science constructs is fragmented. Whereas scholars are generally acknowledging the importance of validating their text-based measures, they often lack common terminology and a unified framework to do so. This paper introduces a new validation framework called ValiTex, designed to assist scholars to measure social science constructs based on textual data. The framework draws on a long-established tradition within psychometrics while extending the framework for the purpose of computational text analysis. ValiTex consists of two components, a conceptual model, and a dynamic checklist. Whereas the conceptual model provides a general structure along distinct phases on how to approach validation, the dynamic checklist defines specific validation steps and provides guidance on which steps might be considered recommendable (i.e., providing relevant and necessary validation evidence) or optional (i.e., useful for providing additional supporting validation evidence. The utility of the framework is demonstrated by applying it to a use case of detecting sexism from social media data.
摘要
帮助验证计算文本基于社会科学概念的度量方法存在 Fragmented. Although scholars generally recognize the importance of validating their text-based measures, they often lack a common terminology and unified framework to do so. This paper introduces a new validation framework called ValiTex, designed to assist scholars in measuring social science constructs based on textual data. The framework draws on a long-established tradition within psychometrics while extending the framework for the purpose of computational text analysis. ValiTex consists of two components: a conceptual model and a dynamic checklist. While the conceptual model provides a general structure for approaching validation, the dynamic checklist defines specific validation steps and provides guidance on which steps might be considered recommendable (i.e., providing relevant and necessary validation evidence) or optional (i.e., useful for providing additional supporting validation evidence. The utility of the framework is demonstrated by applying it to a use case of detecting sexism from social media data.Here's the breakdown of the translation:* "Guidance on how to validate computational text-based measures of social science constructs is fragmented" becomes 帮助验证计算文本基于社会科学概念的度量方法存在 Fragmented.* "Whereas scholars are generally acknowledging the importance of validating their text-based measures" becomes although scholars generally recognize the importance of validating their text-based measures.* "they often lack common terminology and a unified framework to do so" becomes they often lack a common terminology and unified framework to do so.* "This paper introduces a new validation framework called ValiTex" becomes 这篇论文介绍了一种新的验证框架 called ValiTex.* "designed to assist scholars in measuring social science constructs based on textual data" becomes 用于帮助学者在文本数据上验证社会科学概念.* "The framework draws on a long-established tradition within psychometrics" becomes 该框架基于长期存在的心理测量传统.* "while extending the framework for the purpose of computational text analysis" becomes 而将其扩展为计算文本分析的目的.* "ValiTex consists of two components: a conceptual model and a dynamic checklist" becomes ValiTex consists of two components: a conceptual model and a dynamic checklist.* "Whereas the conceptual model provides a general structure for approaching validation" becomes 而 conceptual model provides a general structure for approaching validation.* "the dynamic checklist defines specific validation steps and provides guidance on which steps might be considered recommendable" becomes 而 dynamic checklist defines specific validation steps and provides guidance on which steps might be considered recommendable.* "or optional (i.e., useful for providing additional supporting validation evidence)" becomes or optional (i.e., useful for providing additional supporting validation evidence).* "The utility of the framework is demonstrated by applying it to a use case of detecting sexism from social media data" becomes 该框架的实用性是通过对社交媒体数据进行性别歧视检测来示例出来.
NatLogAttack: A Framework for Attacking Natural Language Inference Models with Natural Logic
results: 比较 existing 攻击模型,NatLogAttack 可以生成更好的 adversarial examples,需要 fewer 访问 victim 模型。 Label-flipping 设定下,攻击模型更加脆弱。 NatLogAttack 提供了一种测试当前和未来 NLI 模型的能力的工具,并希望更多基于逻辑的攻击将被进一步探讨,以更好地理解推理的愿望性。Abstract
Reasoning has been a central topic in artificial intelligence from the beginning. The recent progress made on distributed representation and neural networks continues to improve the state-of-the-art performance of natural language inference. However, it remains an open question whether the models perform real reasoning to reach their conclusions or rely on spurious correlations. Adversarial attacks have proven to be an important tool to help evaluate the Achilles' heel of the victim models. In this study, we explore the fundamental problem of developing attack models based on logic formalism. We propose NatLogAttack to perform systematic attacks centring around natural logic, a classical logic formalism that is traceable back to Aristotle's syllogism and has been closely developed for natural language inference. The proposed framework renders both label-preserving and label-flipping attacks. We show that compared to the existing attack models, NatLogAttack generates better adversarial examples with fewer visits to the victim models. The victim models are found to be more vulnerable under the label-flipping setting. NatLogAttack provides a tool to probe the existing and future NLI models' capacity from a key viewpoint and we hope more logic-based attacks will be further explored for understanding the desired property of reasoning.
摘要
<> translate "Reasoning has been a central topic in artificial intelligence from the beginning. The recent progress made on distributed representation and neural networks continues to improve the state-of-the-art performance of natural language inference. However, it remains an open question whether the models perform real reasoning to reach their conclusions or rely on spurious correlations. Adversarial attacks have proven to be an important tool to help evaluate the Achilles' heel of the victim models. In this study, we explore the fundamental problem of developing attack models based on logic formalism. We propose NatLogAttack to perform systematic attacks centring around natural logic, a classical logic formalism that is traceable back to Aristotle's syllogism and has been closely developed for natural language inference. The proposed framework renders both label-preserving and label-flipping attacks. We show that compared to the existing attack models, NatLogAttack generates better adversarial examples with fewer visits to the victim models. The victim models are found to be more vulnerable under the label-flipping setting. NatLogAttack provides a tool to probe the existing and future NLI models' capacity from a key viewpoint and we hope more logic-based attacks will be further explored for understanding the desired property of reasoning." into Simplified Chinese.中文翻译:自人工智能开始以来,理解是一个中心主题。分布表示和神经网络的进步在自然语言推理中提高了状态艺术性。然而,是否模型通过真正的理解来达到结论还是利用偶合关系,这是一个打开的问题。对于受害者模型,抗击攻击是一种重要的工具。在这种研究中,我们探讨了基于逻辑ormalism的攻击模型的基本问题。我们提出了NatLogAttack,一种基于自然逻辑的系统性攻击方法。我们实现了保留和反转标签攻击。我们发现,相比现有的攻击模型,NatLogAttack生成的恶作剂更好,需要 fewer 访问受害者模型。受害者模型在反转标签设置下更加易受攻击。NatLogAttack提供了评估现有和未来 NLI 模型的能力的重要工具,我们希望更多的逻辑基于攻击将被进一步探讨,以更好地理解推理的所求性。
Generative Zero-Shot Prompt Learning for Cross-Domain Slot Filling with Inverse Prompting
results: 实验和分析结果显示,我们的提案的框架比前一代模型具有更高的效果,尤其是在未见构型上 (+13.44% F1) 获得了大幅提升。Abstract
Zero-shot cross-domain slot filling aims to transfer knowledge from the labeled source domain to the unlabeled target domain. Existing models either encode slot descriptions and examples or design handcrafted question templates using heuristic rules, suffering from poor generalization capability or robustness. In this paper, we propose a generative zero-shot prompt learning framework for cross-domain slot filling, both improving generalization and robustness than previous work. Besides, we introduce a novel inverse prompting strategy to distinguish different slot types to avoid the multiple prediction problem, and an efficient prompt-tuning strategy to boost higher performance by only training fewer prompt parameters. Experiments and analysis demonstrate the effectiveness of our proposed framework, especially huge improvements (+13.44% F1) on the unseen slots.
摘要
<>TRANSLATE_TEXT Zero-shot cross-domain slot filling aims to transfer knowledge from the labeled source domain to the unlabeled target domain. Existing models either encode slot descriptions and examples or design handcrafted question templates using heuristic rules, suffering from poor generalization capability or robustness. In this paper, we propose a generative zero-shot prompt learning framework for cross-domain slot filling, both improving generalization and robustness than previous work. Besides, we introduce a novel inverse prompting strategy to distinguish different slot types to avoid the multiple prediction problem, and an efficient prompt-tuning strategy to boost higher performance by only training fewer prompt parameters. Experiments and analysis demonstrate the effectiveness of our proposed framework, especially huge improvements (+13.44% F1) on the unseen slots.TRANSLATE_TEXT
results: 该论文提出了一种基于数据管理的生成AI验证方法,可以确保生成AI输出的正确性,促进透明度,并帮助做出更加信心的决策。Abstract
Generative AI has made significant strides, yet concerns about the accuracy and reliability of its outputs continue to grow. Such inaccuracies can have serious consequences such as inaccurate decision-making, the spread of false information, privacy violations, legal liabilities, and more. Although efforts to address these risks are underway, including explainable AI and responsible AI practices such as transparency, privacy protection, bias mitigation, and social and environmental responsibility, misinformation caused by generative AI will remain a significant challenge. We propose that verifying the outputs of generative AI from a data management perspective is an emerging issue for generative AI. This involves analyzing the underlying data from multi-modal data lakes, including text files, tables, and knowledge graphs, and assessing its quality and consistency. By doing so, we can establish a stronger foundation for evaluating the outputs of generative AI models. Such an approach can ensure the correctness of generative AI, promote transparency, and enable decision-making with greater confidence. Our vision is to promote the development of verifiable generative AI and contribute to a more trustworthy and responsible use of AI.
摘要
UniCoRN: Unified Cognitive Signal ReconstructioN bridging cognitive signals and human language
results: 实验结果表明,UniCoRN 在 fMRI2text 和 EEGto-text 解码任务中具有高效性,其 BLEU 分数分别为 34.77% 和 37.04%,比前一个基线提高了超过 10%。这表明了在不同的 cognitive signals 上使用一个共同结构可以实现高效的解码。Abstract
Decoding text stimuli from cognitive signals (e.g. fMRI) enhances our understanding of the human language system, paving the way for building versatile Brain-Computer Interface. However, existing studies largely focus on decoding individual word-level fMRI volumes from a restricted vocabulary, which is far too idealized for real-world application. In this paper, we propose fMRI2text, the first openvocabulary task aiming to bridge fMRI time series and human language. Furthermore, to explore the potential of this new task, we present a baseline solution, UniCoRN: the Unified Cognitive Signal ReconstructioN for Brain Decoding. By reconstructing both individual time points and time series, UniCoRN establishes a robust encoder for cognitive signals (fMRI & EEG). Leveraging a pre-trained language model as decoder, UniCoRN proves its efficacy in decoding coherent text from fMRI series across various split settings. Our model achieves a 34.77% BLEU score on fMRI2text, and a 37.04% BLEU when generalized to EEGto-text decoding, thereby surpassing the former baseline. Experimental results indicate the feasibility of decoding consecutive fMRI volumes, and the effectiveness of decoding different cognitive signals using a unified structure.
摘要
decode text 刺激信号(例如fMRI)可以帮助我们更好地理解人类语言系统,这将开创出多样化的脑计算机接口。然而,现有的研究主要集中于解码限定词汇 volume 的 fMRI 时间序列,这是实际应用中过于理想化的。在这篇论文中,我们提出了 fMRI2text,第一个开放词汇任务,旨在将 fMRI 时间序列和人类语言相连。此外,为了探索这个新任务的潜力,我们提出了基线解决方案,即 UniCoRN:一种统一的认知信号重建方法 для脑解oding。UniCoRN 可以重建各个时间点和时间序列,并利用预训练的语言模型作为解码器,以解码 fMRI 序列中的 coherent text。我们的模型在 fMRI2text 任务中 achievement 34.77% BLEU 分数,并在 EEGto-text 解码任务中 achievement 37.04% BLEU,超过了 former 基线。实验结果表明可以解码连续的 fMRI 序列,以及使用统一结构可以解码不同的认知信号。
Training Models to Generate, Recognize, and Reframe Unhelpful Thoughts
results: 研究表明,通过使用这些数据集来训练和/或评估当前模型,可以生成大量的个性化实践材料和假设,无需或 minimum额外加模型训练。Abstract
Many cognitive approaches to well-being, such as recognizing and reframing unhelpful thoughts, have received considerable empirical support over the past decades, yet still lack truly widespread adoption in self-help format. A barrier to that adoption is a lack of adequately specific and diverse dedicated practice material. This work examines whether current language models can be leveraged to both produce a virtually unlimited quantity of practice material illustrating standard unhelpful thought patterns matching specific given contexts, and generate suitable positive reframing proposals. We propose PATTERNREFRAME, a novel dataset of about 10k examples of thoughts containing unhelpful thought patterns conditioned on a given persona, accompanied by about 27k positive reframes. By using this dataset to train and/or evaluate current models, we show that existing models can already be powerful tools to help generate an abundance of tailored practice material and hypotheses, with no or minimal additional model training required.
摘要
许多认知方法,如认知和重新定义不helpful的思想,在过去几十年内得到了证据,然而仍未得到广泛的采用。一个阻碍factor是缺乏具有充分specific和多样化的专门练习材料。这项工作研究了whether current language models可以被利用来生成具有specific context的标准不helpful思想模式的庞大量 практи materials,以及生成适合的正面重新定义建议。我们提出了PATTERNREFRAME dataset,包含约10k例思想中的不helpful思想模式, Conditioned on a given persona, accompanied by approximately 27k positive reframes。通过使用这个dataset来训练和/或评估当前模型,我们发现了,现有模型可以被转化成帮助生成大量tailored practice material和假设,无需或 minimum additional model training。
Undecimated Wavelet Transform for Word Embedded Semantic Marginal Autoencoder in Security improvement and Denoising different Languages
results: 成功提高多语言数据处理应用程序的安全性和鲁棒性,并且能够有效地降低噪声和提高数据质量Abstract
By combining the undecimated wavelet transform within a Word Embedded Semantic Marginal Autoencoder (WESMA), this research study provides a novel strategy for improving security measures and denoising multiple languages. The incorporation of these strategies is intended to address the issues of robustness, privacy, and multilingualism in data processing applications. The undecimated wavelet transform is used as a feature extraction tool to identify prominent language patterns and structural qualities in the input data. The proposed system may successfully capture significant information while preserving the temporal and geographical links within the data by employing this transform. This improves security measures by increasing the system's ability to detect abnormalities, discover hidden patterns, and distinguish between legitimate content and dangerous threats. The Word Embedded Semantic Marginal Autoencoder also functions as an intelligent framework for dimensionality and noise reduction. The autoencoder effectively learns the underlying semantics of the data and reduces noise components by exploiting word embeddings and semantic context. As a result, data quality and accuracy are increased in following processing stages. The suggested methodology is tested using a diversified dataset that includes several languages and security scenarios. The experimental results show that the proposed approach is effective in attaining security enhancement and denoising capabilities across multiple languages. The system is strong in dealing with linguistic variances, producing consistent outcomes regardless of the language used. Furthermore, incorporating the undecimated wavelet transform considerably improves the system's ability to efficiently address complex security concerns
摘要
通过将不减波лет变换纳入word嵌入semantic marginal autoencoder(WESMA)中,本研究提供了一种新的安全提高和净化多语言的策略。这种策略的目的是解决数据处理应用中的Robustness、隐私和多语言问题。不减波лет变换被用作特征提取工具,以找出输入数据中语言模式和结构特征。提出的系统可以成功地捕捉主要信息,同时保持数据的时空地理链接。这有助于增强安全措施,提高系统检测异常、发现隐藏模式和分辨合法内容和危险威胁的能力。word嵌入semantic marginal autoencoder 还作为一种智能框架,实现维度和噪声减少。 autoencoder 通过利用word嵌入和semanticContext来学习数据的下面 semantics,从而减少噪声组件。因此,数据质量和准确性在后续处理阶段得到提高。本方法在多种语言和安全enario下进行了实验测试,结果显示,提出的方法可以在多语言下实现安全提高和净化能力。系统强大地处理语言差异,在不同语言下产生相同的结果。此外,通过 incorporating不减波лет变换,系统可以更有效地解决复杂的安全问题。
Your spouse needs professional help: Determining the Contextual Appropriateness of Messages through Modeling Social Relationships
paper_authors: David Jurgens, Agrima Seth, Jackson Sargent, Athena Aghighi, Michael Geraci
for: 本研究旨在提高实时对话中的不当内容识别,并且考虑社交上下文和规范。
methods: 本研究使用大量自然语言模型,将社交关系信息融入数据中,以更好地识别内容是否当。
results: 研究发现,社交关系信息可以帮助大量自然语言模型更加准确地识别内容是否当,并且可以预测其他社交因素,如态度和礼貌。Abstract
Understanding interpersonal communication requires, in part, understanding the social context and norms in which a message is said. However, current methods for identifying offensive content in such communication largely operate independent of context, with only a few approaches considering community norms or prior conversation as context. Here, we introduce a new approach to identifying inappropriate communication by explicitly modeling the social relationship between the individuals. We introduce a new dataset of contextually-situated judgments of appropriateness and show that large language models can readily incorporate relationship information to accurately identify appropriateness in a given context. Using data from online conversations and movie dialogues, we provide insight into how the relationships themselves function as implicit norms and quantify the degree to which context-sensitivity is needed in different conversation settings. Further, we also demonstrate that contextual-appropriateness judgments are predictive of other social factors expressed in language such as condescension and politeness.
摘要
理解人际交流需要一定程度的社会背景和规范,但现有的偏误内容标识方法大多数都是独立于社会背景进行的,只有一些方法考虑到社区规范或之前的对话。我们介绍了一种新的偏误内容标识方法,该方法将社会关系 между个人Explicitly modeling。我们新 introduce a dataset of contextually-situated appropriateness judgments and show that large language models can readily incorporate relationship information to accurately identify appropriateness in a given context.使用在线对话和电影对话,我们提供了关于社交关系如何作为隐式规范的情况,并衡量不同对话场景中context-sensitivity的程度。此外,我们还证明了Contextual-appropriateness judgments是其他社交因素表达在语言中的预测因素。
Exploring Linguistic Style Matching in Online Communities: The Role of Social Context and Conversation Dynamics
results: 研究发现LSM与Reddit上的几个社交指标(包括帖子和子社区特征、对话深度、用户积累和评论的争议程度)存在关系,并且 после用户被禁止参与社区,LSM的变化可以反映社区动态的变化。Abstract
Linguistic style matching (LSM) in conversations can be reflective of several aspects of social influence such as power or persuasion. However, how LSM relates to the outcomes of online communication on platforms such as Reddit is an unknown question. In this study, we analyze a large corpus of two-party conversation threads in Reddit where we identify all occurrences of LSM using two types of style: the use of function words and formality. Using this framework, we examine how levels of LSM differ in conversations depending on several social factors within Reddit: post and subreddit features, conversation depth, user tenure, and the controversiality of a comment. Finally, we measure the change of LSM following loss of status after community banning. Our findings reveal the interplay of LSM in Reddit conversations with several community metrics, suggesting the importance of understanding conversation engagement when understanding community dynamics.
摘要
语言风格匹配(LSM)在对话中可以反映社交影响的多个方面,如力量或说服。然而,LSM在在线交流平台如Reddit上的效果还未得到了许多研究。在这个研究中,我们分析了一个大量的两方对话线程库,并在这里确定了所有的LSM出现。我们使用两种风格来确定LSM:语言功能词和正式度。通过这个框架,我们研究了在Reddit上的讨论中LSM水平与多种社交因素之间的关系,包括帖子和子reddit特征、对话深度、用户 seniority 和评论的争议程度。最后,我们测量了因社区禁止而导致LSM变化的情况。我们的发现表明LSM在Reddit讨论中与多种社区指标之间存在紧密的关系,从而提出了理解对话参与度对社区动力学的重要性。
Dense Retrieval Adaptation using Target Domain Description
results: 经过广泛的实验表明,通过适应 dense retrieval 模型使用生成的假数据可以在目标领域中实现有效的检索性能。Abstract
In information retrieval (IR), domain adaptation is the process of adapting a retrieval model to a new domain whose data distribution is different from the source domain. Existing methods in this area focus on unsupervised domain adaptation where they have access to the target document collection or supervised (often few-shot) domain adaptation where they additionally have access to (limited) labeled data in the target domain. There also exists research on improving zero-shot performance of retrieval models with no adaptation. This paper introduces a new category of domain adaptation in IR that is as-yet unexplored. Here, similar to the zero-shot setting, we assume the retrieval model does not have access to the target document collection. In contrast, it does have access to a brief textual description that explains the target domain. We define a taxonomy of domain attributes in retrieval tasks to understand different properties of a source domain that can be adapted to a target domain. We introduce a novel automatic data construction pipeline that produces a synthetic document collection, query set, and pseudo relevance labels, given a textual domain description. Extensive experiments on five diverse target domains show that adapting dense retrieval models using the constructed synthetic data leads to effective retrieval performance on the target domain.
摘要
在信息检索(IR)领域,领域适应是指将检索模型适应到新领域的数据分布不同于源领域。现有的方法在这个领域主要集中在无监督领域适应和监督(经常是几个shot)领域适应,其中后者还有访问(有限)标注数据在目标领域。此外,还有研究提高检索模型的零shot性性能。这篇论文介绍了IR领域新的领域适应类别,即与零shot设置相似,我们假设检索模型没有访问目标文档收集。相反,它们有访问目标领域的简短文本描述。我们定义了检索任务中的领域属性分类,以便更好地理解不同的源领域可以适应的不同属性。我们介绍了一种新的自动数据建构管道,可以从文本领域描述中生成 sintetic文档收集、查询集和pseudo relevance标签。我们在五个多样化的目标领域进行了广泛的实验,发现通过适应密集检索模型使用construct的 sintetic数据可以达到有效的检索性能。
Text Alignment Is An Efficient Unified Model for Massive NLP Tasks
methods: 该模型使用了 RoBERTa 模型进行轻量级微调,并使用了 5.9 万个样例和 28 个数据集来实现模型的INSTANTIATION。
results: 对于多种文本关系任务,该模型能够匹配或超越 FLAN-T5 模型(具有约 2 倍或 10 倍的参数数量),同时也能够超越任务特定的模型在各个数据集上。此外,该模型还可以用于评估语言生成中的事实一致性,并且可以与 GPT-3.5 和 GPT-4 模型进行比较。Abstract
Large language models (LLMs), typically designed as a function of next-word prediction, have excelled across extensive NLP tasks. Despite the generality, next-word prediction is often not an efficient formulation for many of the tasks, demanding an extreme scale of model parameters (10s or 100s of billions) and sometimes yielding suboptimal performance. In practice, it is often desirable to build more efficient models -- despite being less versatile, they still apply to a substantial subset of problems, delivering on par or even superior performance with much smaller model sizes. In this paper, we propose text alignment as an efficient unified model for a wide range of crucial tasks involving text entailment, similarity, question answering (and answerability), factual consistency, and so forth. Given a pair of texts, the model measures the degree of alignment between their information. We instantiate an alignment model (Align) through lightweight finetuning of RoBERTa (355M parameters) using 5.9M examples from 28 datasets. Despite its compact size, extensive experiments show the model's efficiency and strong performance: (1) On over 20 datasets of aforementioned diverse tasks, the model matches or surpasses FLAN-T5 models that have around 2x or 10x more parameters; the single unified model also outperforms task-specific models finetuned on individual datasets; (2) When applied to evaluate factual consistency of language generation on 23 datasets, our model improves over various baselines, including the much larger GPT-3.5 (ChatGPT) and sometimes even GPT-4; (3) The lightweight model can also serve as an add-on component for LLMs such as GPT-3.5 in question answering tasks, improving the average exact match (EM) score by 17.94 and F1 score by 15.05 through identifying unanswerable questions.
摘要
大型语言模型(LLM)通常是基于下一个词预测的函数,在各种自然语言处理任务中表现出色。然而,下一个词预测并不是一个有效的形式ulation для许多任务,需要极大的模型参数(10个或100个亿)并且有时会得到低效的性能。在实践中,建立更有效的模型是非常感兴趣,即使它们不那么通用,它们仍然适用于许多问题,可以实现与较小的模型大小相同或者甚至更高的性能。在这篇论文中,我们提议文本对齐作为一种有效的统一模型,用于覆盖许多关键任务的文本相互关系。给定两个文本,模型会测量它们信息之间的相互对齐程度。我们通过轻量级的微调RoBERTa(355M参数)使用590万个示例和28个数据集来实现对齐模型(Align)。即使它的 compact size,我们的实验证明了模型的高效性和强大性:1. 在20个多样化任务上,我们的模型可以与FLAN-T5模型(约2倍或10倍更多参数)匹配或超越它们,并且单一的统一模型也可以超越任务特定的模型在各个数据集上进行微调。2. 当应用于23个数据集上的语言生成的事实一致性评估中,我们的模型可以超越多种基准,包括较大的GPT-3.5(ChatGPT)和有时even GPT-4。3. 轻量级的模型还可以作为LLMs的添加组件,在问答任务中提高GPT-3.5的平均精确匹配(EM)得分17.94%和F1得分15.05%,通过识别无法回答的问题。
On-Device Constrained Self-Supervised Speech Representation Learning for Keyword Spotting via Knowledge Distillation
paper_authors: Gene-Ping Yang, Yue Gu, Qingming Tang, Dongsu Du, Yuzong Liu for: 这个研究旨在应用大型自主学习模型来进行关键词搜寻,但是在设备上的预算和数据收集上存在偏误和限制。methods: 我们提出了一个基于知识传播的自学模型,使用教师生物框架将知识传播到小型轻量级模型,使用双重检查相关知识传播和教师的代码库作为学习目标。results: 我们使用Alexa关键词搜寻探测任务中的16.6万小时内部数据进行评估,结果显示我们的方法在正常和噪音情况下表现出色,证明了知识传播方法在关键词搜寻任务中自主学习模型的构建中具有优异的表现。Abstract
Large self-supervised models are effective feature extractors, but their application is challenging under on-device budget constraints and biased dataset collection, especially in keyword spotting. To address this, we proposed a knowledge distillation-based self-supervised speech representation learning (S3RL) architecture for on-device keyword spotting. Our approach used a teacher-student framework to transfer knowledge from a larger, more complex model to a smaller, light-weight model using dual-view cross-correlation distillation and the teacher's codebook as learning objectives. We evaluated our model's performance on an Alexa keyword spotting detection task using a 16.6k-hour in-house dataset. Our technique showed exceptional performance in normal and noisy conditions, demonstrating the efficacy of knowledge distillation methods in constructing self-supervised models for keyword spotting tasks while working within on-device resource constraints.
摘要
大型自我超vised模型是有效的特征提取器,但它们在设备上的应用面临了预算限制和欠拟合的数据采集问题,尤其是在关键词检测中。为解决这问题,我们提出了基于知识填充的自我超vised语音表示学习(S3RL)架构,用于在设备上进行关键词检测。我们的方法使用教师-学生框架来传递知识从一个更大的、更复杂的模型到一个更小的、轻量级模型,使用双视相关分配和教师的代码库作为学习目标。我们对Alexa关键词检测任务使用了16.6万小时的自有数据进行评估。我们的技术在正常和噪音条件下表现出色,证明了知识填充方法在构建自我超vised模型 для关键词检测任务中的效果。
CFSum: A Coarse-to-Fine Contribution Network for Multimodal Summarization
results: 实验结果表明,CFSumsignificantly exceeded多个强基eline的性能标准benchmark。此外,分析也证明了,有用的图像可以帮助生成非视觉词汇,这些词汇在图像中被隐式表示。Abstract
Multimodal summarization usually suffers from the problem that the contribution of the visual modality is unclear. Existing multimodal summarization approaches focus on designing the fusion methods of different modalities, while ignoring the adaptive conditions under which visual modalities are useful. Therefore, we propose a novel Coarse-to-Fine contribution network for multimodal Summarization (CFSum) to consider different contributions of images for summarization. First, to eliminate the interference of useless images, we propose a pre-filter module to abandon useless images. Second, to make accurate use of useful images, we propose two levels of visual complement modules, word level and phrase level. Specifically, image contributions are calculated and are adopted to guide the attention of both textual and visual modalities. Experimental results have shown that CFSum significantly outperforms multiple strong baselines on the standard benchmark. Furthermore, the analysis verifies that useful images can even help generate non-visual words which are implicitly represented in the image.
摘要
多模态摘要通常受到视觉模式贡献不清晰的问题困扰。现有的多模态摘要方法强调设计不同modalities的融合方法,而忽略不同modalities在摘要中的适应条件。因此,我们提出了一个新的Coarse-to-Fine贡献网络 для多模态摘要(CFSum),以考虑不同modalities的贡献。首先,以避免无用的图像干扰,我们提出了预 filtering模组。其次,以确保精准使用有用的图像,我们提出了两个层次的视觉补充模组,分别是字级和短语级。具体来说,图像贡献被计算,并被运用来引导文本和视觉modalities的注意力。实验结果显示,CFSum与多个强大的基eline进行比较,具有明显的超越。此外,分析显示,有用的图像甚至可以帮助生成非视觉字眼,这些字眼在图像中被隐含表示。
Strahler Number of Natural Language Sentences in Comparison with Random Trees
For: The paper aims to apply the Strahler number, originally proposed for characterizing river bifurcation, to natural language sentence tree structures and explore its implications for sentence processing.* Methods: The paper uses empirical measurements across grammatically annotated data to compute the upper and lower limits of the Strahler number for natural language sentences, and analyzes the growth of the Strahler number with sentence length.* Results: The paper shows that the Strahler number of natural language sentences is almost 3 or 4, similar to the case of river bifurcation (Strahler, 1957). The paper also explains reports of 3 to 4 memory areas required for sentence processing (Abney and Johnson, 1991; Schuler et al., 2010) and a psychological “magical number” of 3 to 5 (Cowan, 2001) using the Strahler number. Additionally, the paper finds that the Strahler number is not specific to natural language and holds for random trees.Abstract
The Strahler number was originally proposed to characterize the complexity of river bifurcation and has found various applications. This article proposes computation of the Strahler number's upper and lower limits for natural language sentence tree structures. Through empirical measurements across grammatically annotated data, the Strahler number of natural language sentences is shown to be almost 3 or 4, similarly to the case of river bifurcation as reported by Strahler (1957). From the theory behind the number, we show that it is one kind of lower limit on the amount of memory required to process sentences. We consider the Strahler number to provide reasoning that explains reports showing that the number of required memory areas to process sentences is 3 to 4 for parsing (Abney and Johnson, 1991; Schuler et al., 2010), and reports indicating a psychological "magical number" of 3 to 5 (Cowan, 2001). An analytical and empirical analysis shows that the Strahler number is not constant but grows logarithmically; therefore, the Strahler number of sentences derives from the range of sentence lengths. Furthermore, the Strahler number is not different for random trees, which could suggest that its origin is not specific to natural language.
摘要
斯特拉勒数 originally 提出来characterize 河流分支的复杂性,现在这篇文章提议计算自然语言句子树结构中的斯特拉勒数的上下限。通过实际测量grammatically annotated 数据,自然语言句子的斯特拉勒数被证明为大约3或4,与斯特拉勒(1957)所报道的河流分支情况类似。从理论角度来看,斯特拉勒数是自然语言句子处理所需内存量的下限。我们认为斯特拉勒数可以解释某些报告显示 sentence 的处理需要3到4个内存区域(Abney和Johnson,1991;Schuler et al., 2010),以及一些心理学家所提出的“魔数”(Cowan,2001)。我们通过分析和实际测量发现,斯特拉勒数不是常数,而是呈指数增长的,因此斯特拉勒数的句子来自范围内的句子长度。此外,斯特拉勒数不同于随机树,这可能意味着它的起源不特定于自然语言。
Learning Symbolic Rules over Abstract Meaning Representations for Textual Reinforcement Learning
methods: NESTA 方法组合了一个通用semantic parser与一个规则推导系统,从文本中学习抽象可读性的规则,作为游戏决策的基础。
results: 我们在多个文本游戏 benchmark 上进行了实验,结果表明,相比深度强化学习方法,NESTA 方法在未看过测试游戏的情况下的总体性和学习从少量交互数据的能力得到了改进。Abstract
Text-based reinforcement learning agents have predominantly been neural network-based models with embeddings-based representation, learning uninterpretable policies that often do not generalize well to unseen games. On the other hand, neuro-symbolic methods, specifically those that leverage an intermediate formal representation, are gaining significant attention in language understanding tasks. This is because of their advantages ranging from inherent interpretability, the lesser requirement of training data, and being generalizable in scenarios with unseen data. Therefore, in this paper, we propose a modular, NEuro-Symbolic Textual Agent (NESTA) that combines a generic semantic parser with a rule induction system to learn abstract interpretable rules as policies. Our experiments on established text-based game benchmarks show that the proposed NESTA method outperforms deep reinforcement learning-based techniques by achieving better generalization to unseen test games and learning from fewer training interactions.
摘要
文本基于的强化学习代理人主要是基于神经网络的模型,使用嵌入式表示,学习不可解释的策略,经常无法在未经见过的游戏中普适。然而,神经符号方法,尤其是使用中间正式表示,在语言理解任务中 receiving increasing attention。这是因为它们具有多种优点,如内生可读性、训练数据少量和对未经见过数据的普适性。因此,在这篇论文中,我们提议一种模块化的NEuro-Symbolic Textual Agent(NESTA),该方法结合通用 semantic parser 和规则推导系统,以学习抽象可读性的规则作为策略。我们在已有的文本基于游戏标准套件上进行了实验,结果显示,提议的 NESTA 方法在未经见过测试游戏的普适性和训练交互数量少于深度强化学习基本技术。
Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment
results: ZeroTA 效果显著比零上下文基线高,甚至超过了一些几个shot方法在 ActivityNet Captions 上的状态之首。此外,我们的方法在 OUT-OF-DOMAIN 场景下也表现了更高的Robustness,比supervised方法更能抗衡不同的视频样本。这种研究为了使用广泛使用的模型,如语言生成模型和视频语言对比模型,解锁了一种新的能力:理解视频中的时间方面。Abstract
Dense video captioning, a task of localizing meaningful moments and generating relevant captions for videos, often requires a large, expensive corpus of annotated video segments paired with text. In an effort to minimize the annotation cost, we propose ZeroTA, a novel method for dense video captioning in a zero-shot manner. Our method does not require any videos or annotations for training; instead, it localizes and describes events within each input video at test time by optimizing solely on the input. This is accomplished by introducing a soft moment mask that represents a temporal segment in the video and jointly optimizing it with the prefix parameters of a language model. This joint optimization aligns a frozen language generation model (i.e., GPT-2) with a frozen vision-language contrastive model (i.e., CLIP) by maximizing the matching score between the generated text and a moment within the video. We also introduce a pairwise temporal IoU loss to let a set of soft moment masks capture multiple distinct events within the video. Our method effectively discovers diverse significant events within the video, with the resulting captions appropriately describing these events. The empirical results demonstrate that ZeroTA surpasses zero-shot baselines and even outperforms the state-of-the-art few-shot method on the widely-used benchmark ActivityNet Captions. Moreover, our method shows greater robustness compared to supervised methods when evaluated in out-of-domain scenarios. This research provides insight into the potential of aligning widely-used models, such as language generation models and vision-language models, to unlock a new capability: understanding temporal aspects of videos.
摘要
dense video captioning,一种地点化意义的任务是为视频提供有关的信息和相关的描述,通常需要大量的标注视频片段和文本。为了减少标注成本,我们提议ZeroTA,一种新的零上下文方法 для dense video captioning。我们的方法不需要任何视频或标注 для训练,而是在测试时地点化和描述视频中的事件,通过优化唯一的输入。我们引入了一个软时间面罩,用于表示视频中的时间段,并与预refix参数的语言模型进行同时优化。这个联合优化使得一个冻结的语言生成模型(i.e., GPT-2)和一个冻结的视频语言对比模型(i.e., CLIP)之间进行了一致性的对接。我们还引入了一个对比时间 IoU 损失,使得一组软时间面罩能够捕捉视频中的多个不同事件。我们的方法能够有效地找到视频中的多种重要事件,并且将生成的描述文本与这些事件相应地描述。实验结果表明,ZeroTA 超过零上下文基线值,甚至超过了state-of-the-art 几个shot方法在ActivityNet Captions 上的表现。此外,我们的方法在非标注数据集上的评估也表现了更高的Robustness,比supervised方法更加稳定。这种研究为将广泛使用的模型,如语言生成模型和视频语言对比模型,Alignment 起来,以解锁视频中的时间方面的理解。
Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts
results: 研究表明,通过使用无监督分析,计算机可以准确地预测用户对塑形外科的情感,准确率高达90%。此外,模型在无监督分类任务上表现更高的准确率,于是无监督学习可能成为社交媒体文本标注的可能性。Abstract
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases based on the sheer volume and velocity of textual data. Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding. Using a word ranking method, term frequency-inverse document frequency (TF-IDF), to create features across documents, it is possible to perform unsupervised analytics, machine learning (ML) that can group the documents without a human manually labeling the data. For large datasets with thousands of features, t-distributed stochastic neighbor embedding (t-SNE), k-means clustering and Latent Dirichlet allocation (LDA) are employed to learn top words and generate topics for a Reddit and Twitter combined corpus. Using extremely simple deep learning models, this study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery based on a tweet or subreddit post with almost 90% accuracy. Furthermore, the model is capable of achieving higher accuracy on the unsupervised sentiment task than on a rudimentary supervised document classification task. Therefore, unsupervised learning may be considered a viable option in labeling social media documents for NLP tasks.
摘要
大量用户帖子数据在社交媒体平台上是人工智能(AI)应用的未经利用资源,主要因为数据量和速度的原因。自然语言处理(NLP)是AI的一个子领域,可以利用文档集(corpus)来训练计算机理解人类语言。使用词rank方法,特异频率-反文档频率(TF-IDF)创建文档之间的特征,可以进行无监督分析,机器学习(ML)可以无需人工标注数据来分组文档。对于大量数据( thousands of features),t-分布随机邻居投影(t-SNE)、k-means聚合和Latent Dirichlet allocation(LDA)可以学习文档中的top词和生成话题。使用非常简单的深度学习模型,这项研究表明,应用无监督分析可以使计算机根据推特或Reddit帖子 predict用户对塑形外科的 sentiment,准确率接近90%。此外,模型还可以在无监督分类任务上达到更高的准确率,因此无监督学习可能是标注社交媒体文档的NLP任务的可靠选择。
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference
results: 实验结果表明,使用 SkipDecode 可以获得 2x 到 5x 的批处理速度提升,无论任务类型如何,而且与大型模型(1.3 亿和 6.7 亿参数)兼容。Abstract
Autoregressive large language models (LLMs) have made remarkable progress in various natural language generation tasks. However, they incur high computation cost and latency resulting from the autoregressive token-by-token generation. To address this issue, several approaches have been proposed to reduce computational cost using early-exit strategies. These strategies enable faster text generation using reduced computation without applying the full computation graph to each token. While existing token-level early exit methods show promising results for online inference, they cannot be readily applied for batch inferencing and Key-Value caching. This is because they have to wait until the last token in a batch exits before they can stop computing. This severely limits the practical application of such techniques. In this paper, we propose a simple and effective token-level early exit method, SkipDecode, designed to work seamlessly with batch inferencing and KV caching. It overcomes prior constraints by setting up a singular exit point for every token in a batch at each sequence position. It also guarantees a monotonic decrease in exit points, thereby eliminating the need to recompute KV Caches for preceding tokens. Rather than terminating computation prematurely as in prior works, our approach bypasses lower to middle layers, devoting most of the computational resources to upper layers, allowing later tokens to benefit from the compute expenditure by earlier tokens. Our experimental results show that SkipDecode can obtain 2x to 5x inference speedups with negligible regression across a variety of tasks. This is achieved using OPT models of 1.3 billion and 6.7 billion parameters, all the while being directly compatible with batching and KV caching optimization techniques.
摘要
自然语言生成任务中,权重自适应大型语言模型(LLM)已经取得了非常出色的进步。然而,它们的计算成本和延迟都是由权重自适应的单词生成引起的。为了解决这个问题,许多方法已经被提出来减少计算成本。这些方法可以在批处理中使用早期终止策略来快速生成文本。然而,现有的单词级早期终止方法无法 direct apply于批处理和Key-Value缓存。这是因为它们必须等待批处理中的最后一个单词离开才能停止计算。这会严重限制它们的实际应用。在这篇论文中,我们提出了一种简单有效的单词级早期终止方法,即SkipDecode。它可以与批处理和Key-Value缓存兼容地工作,并且跨过先前的约束。它在每个批处理中设置了每个单词的独特终止点,并保证单词级减少的终止点,从而消除了重新计算Key-Value缓存的需要。不同于先前的方法,我们的方法不会 prematurely 终止计算,而是通过跳过中间层次,将大部分计算资源分配给上层层次,使后续的单词可以受益于先前单词的计算成本。我们的实验结果表明,SkipDecode可以在多种任务上获得2x至5x的批处理速度提升,而且减少了较小的后果。这是使用1.3亿和6.7亿参数的OPT模型,同时与批处理和Key-Value缓存优化技术兼容。
Several categories of Large Language Models (LLMs): A Short Survey
paper_authors: Saurabh Pahune, Manoj Chandrasekharan for: 本研究的目的是为讲者、开发者、学者和使用LLM-基于虚拟助手和智能客服技术的人提供有用信息和未来方向。methods: 本研究涵盖了不同类型的LLM,包括任务型金融LLM、多语言LLM、医疗和生物医学LLM、视觉语言LLM和代码语言模型。研究还描述了这些类型LLM的方法、特性、数据集、转换模型和比较指标。results: 本研究总结了不同类型LLM的方法、特性、数据集、转换模型和比较指标,并提出了未解决的问题,如提高自然语言处理、提高虚拟助手智能和解决道德和法律问题。Abstract
Large Language Models(LLMs)have become effective tools for natural language processing and have been used in many different fields. This essay offers a succinct summary of various LLM subcategories. The survey emphasizes recent developments and efforts made for various LLM kinds, including task-based financial LLMs, multilingual language LLMs, biomedical and clinical LLMs, vision language LLMs, and code language models. The survey gives a general summary of the methods, attributes, datasets, transformer models, and comparison metrics applied in each category of LLMs. Furthermore, it highlights unresolved problems in the field of developing chatbots and virtual assistants, such as boosting natural language processing, enhancing chatbot intelligence, and resolving moral and legal dilemmas. The purpose of this study is to provide readers, developers, academics, and users interested in LLM-based chatbots and virtual intelligent assistant technologies with useful information and future directions.
摘要
大语言模型(LLM)已成为自然语言处理的有效工具,并在多个领域得到广泛应用。本文提供LLM各种子类别的简洁概述,强调最近的发展和努力。包括任务基金LLM、多语言LLM、生物医学LLM、视觉语言LLM和代码语言模型在内的各种LLM类型。本文还介绍每个类型的方法、特征、数据集、转换器模型和比较指标。此外,文章还强调虚拟助手和智能客服技术的未解决问题,如提高自然语言处理、增强虚拟助手智能和解决道德法律问题。本文的目的是为有关LLM基于虚拟助手和智能客服技术的读者、开发者、学者和用户提供有用信息和未来方向。
Named Entity Inclusion in Abstractive Text Summarization
results: 实验表明,这种预训练方法可以提高名称实体包括精度和准确率指标。Abstract
We address the named entity omission - the drawback of many current abstractive text summarizers. We suggest a custom pretraining objective to enhance the model's attention on the named entities in a text. At first, the named entity recognition model RoBERTa is trained to determine named entities in the text. After that, this model is used to mask named entities in the text and the BART model is trained to reconstruct them. Next, the BART model is fine-tuned on the summarization task. Our experiments showed that this pretraining approach improves named entity inclusion precision and recall metrics.
摘要
我们解决了名称实体漏掉(named entity omission),许多当前的抽象文本摘要器的缺点。我们建议一种自定义预训练目标,以提高模型对名称实体的注意力。首先,我们使用名称实体识别模型RoBERTa来确定文本中的名称实体。然后,我们使用这个模型将名称实体在文本中隐藏,并训练BART模型来重建它们。接着,我们在摘要任务上练化BART模型。我们的实验表明,这种预训练方法可以提高名称实体包括精度和准确率指标。
LongNet: Scaling Transformers to 1,000,000,000 Tokens
for: Addressing the challenge of scaling sequence length in the era of large language models, and achieving high performance on both long-sequence modeling and general language tasks.
methods: Proposes a Transformer variant called LongNet, which uses dilated attention to expand the attentive field exponentially as the distance grows, allowing for the scaling of sequence length to more than 1 billion tokens without sacrificing performance on shorter sequences.
results: LongNet has significant advantages, including linear computation complexity and logarithmic dependency between any two tokens in a sequence, and can be served as a distributed trainer for extremely long sequences. Experiments demonstrate strong performance on both long-sequence modeling and general language tasks, opening up new possibilities for modeling very long sequences such as entire corpora or the internet.Abstract
Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted. To address this issue, we introduce LongNet, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences. Specifically, we propose dilated attention, which expands the attentive field exponentially as the distance grows. LongNet has significant advantages: 1) it has a linear computation complexity and a logarithm dependency between any two tokens in a sequence; 2) it can be served as a distributed trainer for extremely long sequences; 3) its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing Transformer-based optimization. Experiments results demonstrate that LongNet yields strong performance on both long-sequence modeling and general language tasks. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.
摘要
Era of large language models 时代,序列长度扩展成为关键询问。然而,现有方法受到计算复杂性或模型表达能力的限制,导致最大序列长度受限。为解决这个问题,我们介绍 LongNet,一种基于 Transformer 的变体,可以将序列长度扩展到更多于 10^9 个字符,而无需牺牲短序列表现。 Specifically, we propose dilated attention,它可以在距离增长时扩展担注场 exponentially。 LongNet 具有以下优势:1) 它具有线性计算复杂性和对任何两个序列元素之间的对数依赖关系; 2) 它可以作为分布式训练器进行极长序列训练; 3) 它的扩展担注可以轻松地替换标准担注,可以顺利地与现有基于 Transformer 的优化集成。实验结果表明,LongNet 在长序列模型化和通用语言任务上具有强大表现。我们的工作开 up 了模型 Very Long Sequence 的新可能性,例如,将整个文库或 even 互联网视为一个序列。
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
results: 研究发现,使用Lynx模型可以实现最高的多Modal理解和多Modal生成能力,而且在多种图像和视频任务上表现最佳。Abstract
Recent advancements in Large Language Models (LLMs) such as GPT4 have displayed exceptional multi-modal capabilities in following open-ended instructions given images. However, the performance of these models heavily relies on design choices such as network structures, training data, and training strategies, and these choices have not been extensively discussed in the literature, making it difficult to quantify progress in this field. To address this issue, this paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models. We implement over 20 variants with controlled settings. Concretely, for network structures, we compare different LLM backbones and model designs. For training data, we investigate the impact of data and sampling strategies. For instructions, we explore the influence of diversified prompts on the instruction-following ability of the trained models. For benchmarks, we contribute the first, to our best knowledge, comprehensive evaluation set including both image and video tasks through crowd-sourcing. Based on our findings, we present Lynx, which performs the most accurate multi-modal understanding while keeping the best multi-modal generation ability compared to existing open-sourced GPT4-style models.
摘要
最近的大语言模型(LLM)如GPT4的发展已经展现出了杰出的多模态能力,能够根据图像提供开放式指令following。然而,这些模型的性能受到设计选择的影响,如网络结构、训练数据和训练策略,这些选择在文献中尚未得到广泛的讨论,因此很难量化进展。为解决这问题,本文提出了一项系统性和全面的研究,通过Quantitative和Qualitative方式来训练这些模型。我们实施了20多种变体,其中包括不同的LLM后缀和模型设计、训练数据和采样策略、指令和多模态生成能力。为了评估这些模型的性能,我们建立了包括图像和视频任务的首个、至今为止最 complet comprehensive评估集。基于我们的发现,我们提出了Lynx,它在多模态理解和多模态生成能力之间具有最高精度。
Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models
results: 作者们发现,使用更高级别的模型可以更好地诱导和欺骗其他玩家,使得杀害者更容易逃脱。这种改善不是通过不同的行动,而是通过更强的游说技巧来实现。Abstract
Are current language models capable of deception and lie detection? We study this question by introducing a text-based game called $\textit{Hoodwinked}$, inspired by Mafia and Among Us. Players are locked in a house and must find a key to escape, but one player is tasked with killing the others. Each time a murder is committed, the surviving players have a natural language discussion then vote to banish one player from the game. We conduct experiments with agents controlled by GPT-3, GPT-3.5, and GPT-4 and find evidence of deception and lie detection capabilities. The killer often denies their crime and accuses others, leading to measurable effects on voting outcomes. More advanced models are more effective killers, outperforming smaller models in 18 of 24 pairwise comparisons. Secondary metrics provide evidence that this improvement is not mediated by different actions, but rather by stronger persuasive skills during discussions. To evaluate the ability of AI agents to deceive humans, we make this game publicly available at h https://hoodwinked.ai/ .
摘要
现有语言模型有能力进行诱导和欺骗吗?我们通过一款基于文本的游戏——《骗子》来研究这个问题。玩家被锁在一个房子里,需要找到逃脱的钥匙,但有一名玩家被指定为杀害其他玩家。每次杀人时,存活的玩家们会进行自然语言的讨论,然后投票 banish 一名玩家。我们在 GPT-3、GPT-3.5 和 GPT-4 控制的机器人上进行了实验,发现了诱导和欺骗的能力。凶手常否认自己的罪行,指责其他人,导致讨论结果有显著的变化。更高级的模型在18/24对比中胜过小型模型。 auxiliary 指标表明这些改进不是通过不同的行动,而是通过更强的说服技巧在讨论中。为了评估人工智能代理人能否欺骗人类,我们将这款游戏公开发布在 hhttps://hoodwinked.ai/ 。
Exploring Continual Learning for Code Generation Models
for: This paper focuses on the task of Continual Learning (CL) in the code domain, specifically addressing the issue of catastrophic forgetting in popular CL techniques when applied to coding tasks.
methods: The authors introduce a new benchmark called CodeTask-CL that covers a wide range of tasks and input/output programming languages, and compare popular CL techniques from NLP and Vision domains. They also propose a new method called Prompt Pooling with Teacher Forcing (PP-TF) to address the issue of catastrophic forgetting.
results: The authors achieve a 21.54% improvement over Prompt Pooling with their proposed method PP-TF, demonstrating the effectiveness of their approach in stabilizing training and improving performance on CL for code models.Here’s the simplified Chinese text in the format you requested:
methods: 作者们提出了一个新的benchmark代码Task-CL,覆盖了广泛的任务和输入/输出编程语言,并对NLP和视觉领域的CL技术进行比较。他们还提出了一种新的方法called Prompt Pooling with Teacher Forcing (PP-TF),以稳定训练并提高代码模型的CL性能。
results: 作者们通过PP-TF方法实现了21.54%的提升,证明了他们的方法可以稳定训练并提高代码模型的CL性能。Abstract
Large-scale code generation models such as Codex and CodeT5 have achieved impressive performance. However, libraries are upgraded or deprecated very frequently and re-training large-scale language models is computationally expensive. Therefore, Continual Learning (CL) is an important aspect that remains underexplored in the code domain. In this paper, we introduce a benchmark called CodeTask-CL that covers a wide range of tasks, including code generation, translation, summarization, and refinement, with different input and output programming languages. Next, on our CodeTask-CL benchmark, we compare popular CL techniques from NLP and Vision domains. We find that effective methods like Prompt Pooling (PP) suffer from catastrophic forgetting due to the unstable training of the prompt selection mechanism caused by stark distribution shifts in coding tasks. We address this issue with our proposed method, Prompt Pooling with Teacher Forcing (PP-TF), that stabilizes training by enforcing constraints on the prompt selection mechanism and leads to a 21.54% improvement over Prompt Pooling. Along with the benchmark, we establish a training pipeline that can be used for CL on code models, which we believe can motivate further development of CL methods for code models. Our code is available at https://github.com/amazon-science/codetaskcl-pptf
摘要
大规模的代码生成模型如Codex和CodeT5已经实现了印象所能的性能。然而,库被升级或弃用非常频繁,再次训练大规模的语言模型是计算成本高昂。因此,持续学习(Continual Learning,CL)在代码领域是一个重要的问题,尚未得到充分研究。在这篇论文中,我们介绍了一个名为CodeTask-CL的benchmark,该benchmark涵盖了各种任务,包括代码生成、翻译、概要和修订,输入和输出编程语言也有多种。接着,在我们的CodeTask-CL benchmark上,我们比较了NLP和Computer Vision领域的流行CL技术。我们发现,效果良好的方法Like Prompt Pooling(PP)受到代码任务中的分布Shift导致的训练不稳定,从而导致了悬崖效应。我们解决这个问题的方法是Prompt Pooling with Teacher Forcing(PP-TF),该方法在选择提示机制中强制实施约束,使训练更加稳定,并导致了21.54%的提高。此外,我们还设立了一个用于CL的训练管道,我们认为这可以鼓励进一步的CL方法开发。代码可以在https://github.com/amazon-science/codetaskcl-pptf上获取。
Won’t Get Fooled Again: Answering Questions with False Premises
results: 研究发现,PLMs 可以通过 fine-tuning 和重复训练来扩大其对 FPQs 的回答能力,并且可以生成有理解性的回答和讲解。这些结果表明,PLMs 已经具有了对 tricky questions 的回答能力,只是需要刺激这种能力。Abstract
Pre-trained language models (PLMs) have shown unprecedented potential in various fields, especially as the backbones for question-answering (QA) systems. However, they tend to be easily deceived by tricky questions such as "How many eyes does the sun have?". Such frailties of PLMs often allude to the lack of knowledge within them. In this paper, we find that the PLMs already possess the knowledge required to rebut such questions, and the key is how to activate the knowledge. To systematize this observation, we investigate the PLMs' responses to one kind of tricky questions, i.e., the false premises questions (FPQs). We annotate a FalseQA dataset containing 2365 human-written FPQs, with the corresponding explanations for the false premises and the revised true premise questions. Using FalseQA, we discover that PLMs are capable of discriminating FPQs by fine-tuning on moderate numbers (e.g., 256) of examples. PLMs also generate reasonable explanations for the false premise, which serve as rebuttals. Further replaying a few general questions during training allows PLMs to excel on FPQs and general questions simultaneously. Our work suggests that once the rebuttal ability is stimulated, knowledge inside the PLMs can be effectively utilized to handle FPQs, which incentivizes the research on PLM-based QA systems.
摘要
results: 经过广泛的实验,DINES方法在真实世界签名图上效果很好,可以准确预测边的签名。与其他竞争者相比,DINES方法显著超越其他方法。Abstract
Signed graphs are complex systems that represent trust relationships or preferences in various domains. Learning node representations in such graphs is crucial for many mining tasks. Although real-world signed relationships can be influenced by multiple latent factors, most existing methods often oversimplify the modeling of signed relationships by relying on social theories and treating them as simplistic factors. This limits their expressiveness and their ability to capture the diverse factors that shape these relationships. In this paper, we propose DINES, a novel method for learning disentangled node representations in signed directed graphs without social assumptions. We adopt a disentangled framework that separates each embedding into distinct factors, allowing for capturing multiple latent factors. We also explore lightweight graph convolutions that focus solely on sign and direction, without depending on social theories. Additionally, we propose a decoder that effectively classifies an edge's sign by considering correlations between the factors. To further enhance disentanglement, we jointly train a self-supervised factor discriminator with our encoder and decoder. Throughout extensive experiments on real-world signed directed graphs, we show that DINES effectively learns disentangled node representations, and significantly outperforms its competitors in the sign prediction task.
摘要
签名图是复杂系统,表示信任关系或偏好在不同领域。学习图节表示是挖掘任务中的关键。although real-world签名关系可能受到多种隐藏因素影响,现有方法通常夸大了签名关系的模型化,通过社会理论对它们进行简单化。这限制了它们的表达能力和捕捉签名关系的多样性。在这篇论文中,我们提出了DINES方法,一种学习签名直接图的不同因素分解节点表示方法。我们采用分解框架,将每个嵌入分解成不同因素,以捕捉多个隐藏因素。我们还提出了一种简单的Edge sign类型的预测方法,通过考虑因素之间的相关性来更好地预测签名关系。为了进一步提高分离度,我们同时训练了一个自我超vised因素分类器与我们的编码器和解码器一起。通过对实际签名直接图进行广泛的实验,我们表明DINES方法可以有效地学习分离节点表示,并在签名预测任务中显著超越其竞争对手。
A Hybrid End-to-End Spatio-Temporal Attention Neural Network with Graph-Smooth Signals for EEG Emotion Recognition
results: 根据DEAP数据集,这种建议的模型的性能超过了当前状态的极性识别结果,并且通过跨Modalities的传输学习(TL)来证明模型的学习是通用的。Abstract
Recently, physiological data such as electroencephalography (EEG) signals have attracted significant attention in affective computing. In this context, the main goal is to design an automated model that can assess emotional states. Lately, deep neural networks have shown promising performance in emotion recognition tasks. However, designing a deep architecture that can extract practical information from raw data is still a challenge. Here, we introduce a deep neural network that acquires interpretable physiological representations by a hybrid structure of spatio-temporal encoding and recurrent attention network blocks. Furthermore, a preprocessing step is applied to the raw data using graph signal processing tools to perform graph smoothing in the spatial domain. We demonstrate that our proposed architecture exceeds state-of-the-art results for emotion classification on the publicly available DEAP dataset. To explore the generality of the learned model, we also evaluate the performance of our architecture towards transfer learning (TL) by transferring the model parameters from a specific source to other target domains. Using DEAP as the source dataset, we demonstrate the effectiveness of our model in performing cross-modality TL and improving emotion classification accuracy on DREAMER and the Emotional English Word (EEWD) datasets, which involve EEG-based emotion classification tasks with different stimuli.
摘要
近些年,生理数据如电enzephalography(EEG)信号在情感计算中吸引了广泛的关注。在这个上下文中,主要的目标是设计一个自动化的模型,可以评估情感状态。最近,深度神经网络在情感识别任务中表现出了良好的表现。然而,设计一个深度架构,可以从原始数据中提取实用信息,仍然是一个挑战。我们在这里引入了一个深度神经网络,通过混合式空间-时间编码和循环注意网络块来获得可解释的生理学表示。此外,我们对原始数据进行了预处理步骤,使用图像信号处理工具来实现图像平滑处理。我们示出了我们提议的架构可以在公共可用的DEAP数据集上超过状态的报告结果进行情感分类。为了探索学习的一致性,我们还评估了我们学习的模型在转移学习(TL)中的性能。使用DEAP作为源数据集,我们示出了我们模型在跨Modal TL中的效果,并在DREAMER和Emotional English Word(EEWD)数据集上进行了不同的刺激情感分类任务的改进。
DeepOnto: A Python Package for Ontology Engineering with Deep Learning
methods: 本文使用 Python 框架 PyTorch 和 Tensorflow,并与广泛使用的 ontology API 如 OWL API 和 Jena 进行整合。
results: 本文提出了 Deeponto,一个 Python 套件,可以实现 ontology engineering 中的多个任务,包括 ontology 调整和完成,并且可以运用深度学习方法,如预训 LM。在文中还提供了两个实际应用案例:Samsung Research UK 的数位健康教学和 OAEI 的 Bio-ML 追踪。Abstract
Applying deep learning techniques, particularly language models (LMs), in ontology engineering has raised widespread attention. However, deep learning frameworks like PyTorch and Tensorflow are predominantly developed for Python programming, while widely-used ontology APIs, such as the OWL API and Jena, are primarily Java-based. To facilitate seamless integration of these frameworks and APIs, we present Deeponto, a Python package designed for ontology engineering. The package encompasses a core ontology processing module founded on the widely-recognised and reliable OWL API, encapsulating its fundamental features in a more "Pythonic" manner and extending its capabilities to include other essential components including reasoning, verbalisation, normalisation, projection, and more. Building on this module, Deeponto offers a suite of tools, resources, and algorithms that support various ontology engineering tasks, such as ontology alignment and completion, by harnessing deep learning methodologies, primarily pre-trained LMs. In this paper, we also demonstrate the practical utility of Deeponto through two use-cases: the Digital Health Coaching in Samsung Research UK and the Bio-ML track of the Ontology Alignment Evaluation Initiative (OAEI).
摘要
使用深度学习技术,特别是语言模型(LM),在onto工程中引起了广泛的关注。然而,深度学习框架如PyTorch和Tensorflow主要为Python编程语言开发,而广泛使用onto API,如OWL API和Jena,是主要基于Java编程语言。为了实现这些框架和API的无缝集成,我们提出了Deeponto,一个Python包用于onto工程。该包包括一个核心onto处理模块,基于广泛认可和可靠的OWL API,将其主要特征封装在更"Pythonic"的方式,并将其扩展到包括其他重要组成部分,如理解、词法、正规化、投影等。在这个模块基础之上,Deeponto提供了一组工具、资源和算法,支持多种onto工程任务,如onto对齐和完成,通过利用深度学习方法,主要是预训练LM。在这篇论文中,我们还通过两个使用案例,分别是Samsung Research UK的数字健康帮助和OAEI的生物ML轨道,证明Deeponto的实用性。
Generalizing Backpropagation for Gradient-Based Interpretability
results: 通过synthetic数据和BERT模型进行实验,研究发现:(a) 模型中组件的梯度流量反映该组件对预测的重要性,(b) SVA任务中自动注意力机制的特定路径对模型的预测具有重要性。Abstract
Many popular feature-attribution methods for interpreting deep neural networks rely on computing the gradients of a model's output with respect to its inputs. While these methods can indicate which input features may be important for the model's prediction, they reveal little about the inner workings of the model itself. In this paper, we observe that the gradient computation of a model is a special case of a more general formulation using semirings. This observation allows us to generalize the backpropagation algorithm to efficiently compute other interpretable statistics about the gradient graph of a neural network, such as the highest-weighted path and entropy. We implement this generalized algorithm, evaluate it on synthetic datasets to better understand the statistics it computes, and apply it to study BERT's behavior on the subject-verb number agreement task (SVA). With this method, we (a) validate that the amount of gradient flow through a component of a model reflects its importance to a prediction and (b) for SVA, identify which pathways of the self-attention mechanism are most important.
摘要
很多流行的特征归因方法用于解释深度神经网络,都是通过计算模型输出与输入之间的梯度来实现的。虽然这些方法可以指示模型预测中哪些输入特征是重要的,但它们对模型自身的工作机制提供的信息很少。在这篇论文中,我们发现了一种使用半群的形式ulation来计算模型的梯度。这一发现使得我们可以扩展backpropagation算法,以计算模型梯度图的其他可解释统计,例如最大重量路径和熵。我们实现了这种扩展后的算法,在synthetic数据上进行了更好的理解这些统计的测试,并将其应用于研究BERT在主语-谓语数目协调任务(SVA)中的行为。通过这种方法,我们(a)验证了模型中组件的重要性与预测中的梯度流量相对应,(b) для SVA,找到了自注意机制中最重要的路径。
Origin-Destination Travel Time Oracle for Map-based Services
results: experiments 表明,相比基eline方法,DOT 能够在精度、可扩展性和可解释性三个方面取得更好的性能。Abstract
Given an origin (O), a destination (D), and a departure time (T), an Origin-Destination (OD) travel time oracle~(ODT-Oracle) returns an estimate of the time it takes to travel from O to D when departing at T. ODT-Oracles serve important purposes in map-based services. To enable the construction of such oracles, we provide a travel-time estimation (TTE) solution that leverages historical trajectories to estimate time-varying travel times for OD pairs. The problem is complicated by the fact that multiple historical trajectories with different travel times may connect an OD pair, while trajectories may vary from one another. To solve the problem, it is crucial to remove outlier trajectories when doing travel time estimation for future queries. We propose a novel, two-stage framework called Diffusion-based Origin-destination Travel Time Estimation (DOT), that solves the problem. First, DOT employs a conditioned Pixelated Trajectories (PiT) denoiser that enables building a diffusion-based PiT inference process by learning correlations between OD pairs and historical trajectories. Specifically, given an OD pair and a departure time, we aim to infer a PiT. Next, DOT encompasses a Masked Vision Transformer~(MViT) that effectively and efficiently estimates a travel time based on the inferred PiT. We report on extensive experiments on two real-world datasets that offer evidence that DOT is capable of outperforming baseline methods in terms of accuracy, scalability, and explainability.
摘要
Given an origin (O), a destination (D), and a departure time (T), an Origin-Destination (OD) travel time oracle (ODT-Oracle) returns an estimate of the time it takes to travel from O to D when departing at T. ODT-Oracles serve important purposes in map-based services. To enable the construction of such oracles, we provide a travel-time estimation (TTE) solution that leverages historical trajectories to estimate time-varying travel times for OD pairs. The problem is complicated by the fact that multiple historical trajectories with different travel times may connect an OD pair, while trajectories may vary from one another. To solve the problem, it is crucial to remove outlier trajectories when doing travel time estimation for future queries. We propose a novel, two-stage framework called Diffusion-based Origin-destination Travel Time Estimation (DOT), that solves the problem. First, DOT employs a conditioned Pixelated Trajectories (PiT) denoiser that enables building a diffusion-based PiT inference process by learning correlations between OD pairs and historical trajectories. Specifically, given an OD pair and a departure time, we aim to infer a PiT. Next, DOT encompasses a Masked Vision Transformer~(MViT) that effectively and efficiently estimates a travel time based on the inferred PiT. We report on extensive experiments on two real-world datasets that offer evidence that DOT is capable of outperforming baseline methods in terms of accuracy, scalability, and explainability.
Track Mix Generation on Music Streaming Services using Transformers
results: desde su lanzamiento, Track Mix ha generado listas de reproducción diarias para millones de usuarios en Deezer, mejorando su experiencia de descubrimiento de música en la plataforma.Abstract
This paper introduces Track Mix, a personalized playlist generation system released in 2022 on the music streaming service Deezer. Track Mix automatically generates "mix" playlists inspired by initial music tracks, allowing users to discover music similar to their favorite content. To generate these mixes, we consider a Transformer model trained on millions of track sequences from user playlists. In light of the growing popularity of Transformers in recent years, we analyze the advantages, drawbacks, and technical challenges of using such a model for mix generation on the service, compared to a more traditional collaborative filtering approach. Since its release, Track Mix has been generating playlists for millions of users daily, enhancing their music discovery experience on Deezer.
摘要
这篇论文介绍了Deezer音乐流媒体服务于2022年发布的个性化播放列表生成系统Track Mix。Track Mix自动生成基于初始音乐曲目的"混合"播放列表,让用户发现与自己喜爱的音乐相似的歌曲。为生成这些混合,我们考虑了基于千万个轨迹序列的Transformer模型。在过去几年内,Transformers的普及程度在不断增长,我们对使用这种模型进行混合生成在服务上的优势、缺点和技术挑战进行分析,并与传统的共同推荐方法进行比较。自其发布以来,Track Mix每天为数百万用户生成播放列表,提高了Deezer音乐发现体验。Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.
A Near-Linear Time Algorithm for the Chamfer Distance
paper_authors: Ainesh Bakshi, Piotr Indyk, Rajesh Jayaram, Sandeep Silwal, Erik Waingarten
for: 点云集 $A,B \subset \mathbb{R}^d$ 的 Chamfer 距离
methods: 使用 $O(d n^2)$-时间简单扫描算法
results: 提出了首个 $(1+\epsilon)$-近似算法,运行时间为 $O(nd \log (n)/\varepsilon^2)$,并且可实现。实验表明其准确且快速于大高维数据集。Abstract
For any two point sets $A,B \subset \mathbb{R}^d$ of size up to $n$, the Chamfer distance from $A$ to $B$ is defined as $\text{CH}(A,B)=\sum_{a \in A} \min_{b \in B} d_X(a,b)$, where $d_X$ is the underlying distance measure (e.g., the Euclidean or Manhattan distance). The Chamfer distance is a popular measure of dissimilarity between point clouds, used in many machine learning, computer vision, and graphics applications, and admits a straightforward $O(d n^2)$-time brute force algorithm. Further, the Chamfer distance is often used as a proxy for the more computationally demanding Earth-Mover (Optimal Transport) Distance. However, the \emph{quadratic} dependence on $n$ in the running time makes the naive approach intractable for large datasets. We overcome this bottleneck and present the first $(1+\epsilon)$-approximate algorithm for estimating the Chamfer distance with a near-linear running time. Specifically, our algorithm runs in time $O(nd \log (n)/\varepsilon^2)$ and is implementable. Our experiments demonstrate that it is both accurate and fast on large high-dimensional datasets. We believe that our algorithm will open new avenues for analyzing large high-dimensional point clouds. We also give evidence that if the goal is to \emph{report} a $(1+\varepsilon)$-approximate mapping from $A$ to $B$ (as opposed to just its value), then any sub-quadratic time algorithm is unlikely to exist.
摘要
For any two point sets $A,B \subset \mathbb{R}^d$ of size up to $n$, the Chamfer distance from $A$ to $B$ is defined as $\text{CH}(A,B)=\sum_{a \in A} \min_{b \in B} d_X(a,b)$, where $d_X$ is the underlying distance measure (e.g., the Euclidean or Manhattan distance). The Chamfer distance is a popular measure of dissimilarity between point clouds, used in many machine learning, computer vision, and graphics applications, and admits a straightforward $O(d n^2)$-time brute force algorithm. However, the \emph{quadratic} dependence on $n$ in the running time makes the naive approach intractable for large datasets.We overcome this bottleneck and present the first $(1+\epsilon)$-approximate algorithm for estimating the Chamfer distance with a near-linear running time. Specifically, our algorithm runs in time $O(nd \log (n)/\varepsilon^2)$ and is implementable. Our experiments demonstrate that it is both accurate and fast on large high-dimensional datasets. We believe that our algorithm will open new avenues for analyzing large high-dimensional point clouds.In addition, we show that if the goal is to \emph{report} a $(1+\varepsilon)$-approximate mapping from $A$ to $B$ (as opposed to just its value), then any sub-quadratic time algorithm is unlikely to exist.
Parameter-Efficient Fine-Tuning of LLaMA for the Clinical Domain
results: 这篇论文的结果显示,使用PEFT技术和医疗领域的资料集进行微调整,可以实现与专门医疗语言模型相比的竞争水平,并且在大规模多类别标签分类任务中实现了6-9%的AUROC分数提升。Abstract
Adapting pretrained language models to novel domains, such as clinical applications, traditionally involves retraining their entire set of parameters. However, this approach is increasingly proven to be impractical owing to the substantial computational requirements associated with training such large language models. To address this issue, Parameter-Efficient Fine-Tuning (PEFT) techniques offer a viable solution by selectively fine-tuning a small subset of additional parameters, significantly reducing the computational requirements for domain adaptation. In this study, we propose Clinical LLaMA-LoRA, a PEFT adapter layer built upon the open-sourced LLaMA model. Clinical LLaMA-LoRA is trained using clinical notes obtained from the MIMIC-IV database, thereby creating a specialised adapter designed for the clinical domain. Additionally, we propose a two-step PEFT framework which fuses Clinical LLaMA-LoRA with Downstream LLaMA-LoRA, another PEFT adapter specialised for downstream tasks. We evaluate this framework on multiple clinical outcome prediction datasets, comparing it to clinically trained language models. Our proposed framework achieves a state-of-the-art AUROC score averaged across all clinical downstream tasks. We observe substantial improvements of 6-9% AUROC score in the large-scale multilabel classification tasks, such as diagnoses and procedures classification.
摘要
traditional clinical applications中的适应措施通常是 retrained整个语言模型的参数集。但这种方法随着语言模型的大小增长,所需的计算资源增长得非常快,已经成为一种实际难以进行的。为解决这个问题,Parameter-Efficient Fine-Tuning(PEFT)技术提供了一个可行的解决方案,通过选择ively fine-tune一小部分的参数,大幅降低了适应domain的计算资源需求。在这个研究中,我们提出了Clinical LLaMA-LoRA,一个基于开源的LLaMA模型的PEFT适应层。Clinical LLaMA-LoRA通过使用来自MIMIC-IV数据库的临床笔记进行训练,创造了一个特殊的适应器,专门适用于临床领域。此外,我们还提出了一个两步PEFT框架,将Clinical LLaMA-LoRA与Downstream LLaMA-LoRA,另一个特殊的PEFT适应器,融合在一起。我们对多个临床结果预测任务进行评估,与临床训练的语言模型进行比较。我们的提议框架在所有临床下游任务中获得了状态控制的AUROC分数。我们发现在大规模多标签分类任务中,如诊断和处方分类,有显著的提高,AUROC分数提高了6-9%。
results: FITS模型可以实现时间序列预测和异常检测任务的性能,而且可以轻松地在边缘设备上训练和部署。Abstract
In this paper, we introduce FITS, a lightweight yet powerful model for time series analysis. Unlike existing models that directly process raw time-domain data, FITS operates on the principle that time series can be manipulated through interpolation in the complex frequency domain. By discarding high-frequency components with negligible impact on time series data, FITS achieves performance comparable to state-of-the-art models for time series forecasting and anomaly detection tasks, while having a remarkably compact size of only approximately $10k$ parameters. Such a lightweight model can be easily trained and deployed in edge devices, creating opportunities for various applications. The anonymous code repo is available in: \url{https://anonymous.4open.science/r/FITS}
摘要
“在这篇论文中,我们介绍了FITS模型,它是一种轻量级却强大的时间序列分析模型。与现有模型不同,FITS在假设时间序列可以通过在复杂频率域中进行 interpolate 来处理时间序列数据。通过抛弃具有negligible影响的高频组件,FITS可以达到与当前领先模型相当的性能水平,而且只有约10k个参数。这种轻量级的模型可以轻松地在边缘设备中训练和部署,开创了许多应用场景。代码存储库可以在以下链接中找到:https://anonymous.4open.science/r/FITS”
PCL-Indexability and Whittle Index for Restless Bandits with General Observation Models
results: 作者提出了一种近似过程,将问题转化为一个可以使用AG算法(Ni~no-Mora和Bertsimas)解决的Finite-state问题。实验显示,作者的算法在性能上表现非常出色。Abstract
In this paper, we consider a general observation model for restless multi-armed bandit problems. The operation of the player needs to be based on certain feedback mechanism that is error-prone due to resource constraints or environmental or intrinsic noises. By establishing a general probabilistic model for dynamics of feedback/observation, we formulate the problem as a restless bandit with a countable belief state space starting from an arbitrary initial belief (a priori information). We apply the achievable region method with partial conservation law (PCL) to the infinite-state problem and analyze its indexability and priority index (Whittle index). Finally, we propose an approximation process to transform the problem into which the AG algorithm of Ni\~no-Mora and Bertsimas for finite-state problems can be applied to. Numerical experiments show that our algorithm has an excellent performance.
摘要
在这篇论文中,我们考虑了一种通用的观察模型 для无休multi-armed bandit问题。玩家的操作需要基于一定的反馈机制,由于资源限制或环境或内在噪声而存在错误。通过建立一个通用的概率模型 для反馈/观察动态,我们将问题转化为一个无限状态的restless bandit问题,从初始信息开始。我们使用可 achievable region方法和部分保存法(PCL)分析问题的指数性和优先级指标(Whittle指标)。最后,我们提出一种简化过程,使得可以将问题转化为一个有限状态的问题,并应用Ni\~no-Mora和Bertsimas的AG算法。实验表明,我们的算法具有出色的性能。
PseudoCell: Hard Negative Mining as Pseudo Labeling for Deep Learning-Based Centroblast Cell Detection
results: 在实验中, PseudoCell 可以减少病理医生的工作负担,准确地将注意力集中在需要的区域上,并可以根据信任值进行精细的区域屏除。在不同的信任值下, PseudoCell 可以消除58.18%-99.35%的非中blast组织区域。这种方法可以帮助病理医生更快速地完成检查,不需要精心标注数据进行改进。Abstract
Patch classification models based on deep learning have been utilized in whole-slide images (WSI) of H&E-stained tissue samples to assist pathologists in grading follicular lymphoma patients. However, these approaches still require pathologists to manually identify centroblast cells and provide refined labels for optimal performance. To address this, we propose PseudoCell, an object detection framework to automate centroblast detection in WSI (source code is available at https://github.com/IoBT-VISTEC/PseudoCell.git). This framework incorporates centroblast labels from pathologists and combines them with pseudo-negative labels obtained from undersampled false-positive predictions using the cell's morphological features. By employing PseudoCell, pathologists' workload can be reduced as it accurately narrows down the areas requiring their attention during examining tissue. Depending on the confidence threshold, PseudoCell can eliminate 58.18-99.35% of non-centroblasts tissue areas on WSI. This study presents a practical centroblast prescreening method that does not require pathologists' refined labels for improvement. Detailed guidance on the practical implementation of PseudoCell is provided in the discussion section.
摘要
报告中的文本翻译为简化中文:深度学习模型在染色质 immagini (WSI) 中的报告中使用了报告用于报告评估抗体混合癌症患者。然而,这些方法仍然需要病理医生手动标识中心blast细胞和提供高级标签以实现最佳性能。为了解决这个问题,我们提出了 PseudoCell,一个对象检测框架,可以自动检测WSI中的中心blast细胞。这个框架利用病理医生提供的中心blast标签和基于细胞形态特征的 pseudo-negative 标签进行组合。通过使用 PseudoCell,病理医生的工作负担可以减少,因为它可以准确地将病理医生的注意力集中在WSI中需要注意的区域上。根据信任阈值,PseudoCell可以从WSI中消除58.18-99.35%的非中心blast区域。这项研究提出了一种实用的中心blast预选方法,不需要病理医生的高级标签进行改进。详细的实现 PseudoCell 的指导在讨论部分中提供。
Improving Retrieval-Augmented Large Language Models via Data Importance Learning
paper_authors: Xiaozhong Lyu, Stefan Grafberger, Samantha Biegel, Shaopeng Wei, Meng Cao, Sebastian Schelter, Ce Zhang
for: 提高大语言模型的性能,无需进行进一步训练。
methods: 使用多线性扩展算法计算数据重要性,并提出(ε,δ)优化算法。
results: 可以提高大语言模型的性能,只需对搜索结果进行排重或重新权重,而无需进行进一步训练。在某些任务上,可以使一个小型模型(如GPT-JT),通过与搜索引擎API结合,超越GPT-3.5(无 Retrieval增强)。此外,我们还证明了在实践中可以有效计算多线性扩展的 weights(例如,在100万元素的数据集上只需几分钟时间)。Abstract
Retrieval augmentation enables large language models to take advantage of external knowledge, for example on tasks like question answering and data imputation. However, the performance of such retrieval-augmented models is limited by the data quality of their underlying retrieval corpus. In this paper, we propose an algorithm based on multilinear extension for evaluating the data importance of retrieved data points. There are exponentially many terms in the multilinear extension, and one key contribution of this paper is a polynomial time algorithm that computes exactly, given a retrieval-augmented model with an additive utility function and a validation set, the data importance of data points in the retrieval corpus using the multilinear extension of the model's utility function. We further proposed an even more efficient ({\epsilon}, {\delta})-approximation algorithm. Our experimental results illustrate that we can enhance the performance of large language models by only pruning or reweighting the retrieval corpus, without requiring further training. For some tasks, this even allows a small model (e.g., GPT-JT), augmented with a search engine API, to outperform GPT-3.5 (without retrieval augmentation). Moreover, we show that weights based on multilinear extension can be computed efficiently in practice (e.g., in less than ten minutes for a corpus with 100 million elements).
摘要
methods: 本文提出了一种新的图像模型框架,它结合了统计图像模型(graphical Lasso)和 causal-based图像模型(graphical Granger),以便 simultanously incorporating static and dynamic graphical information within the context of SSMs。
results: 实验 validate了提出的方法,并表明其可以有效地处理实际时间序列数据。Abstract
Time-series datasets are central in numerous fields of science and engineering, such as biomedicine, Earth observation, and network analysis. Extensive research exists on state-space models (SSMs), which are powerful mathematical tools that allow for probabilistic and interpretable learning on time series. Estimating the model parameters in SSMs is arguably one of the most complicated tasks, and the inclusion of prior knowledge is known to both ease the interpretation but also to complicate the inferential tasks. Very recent works have attempted to incorporate a graphical perspective on some of those model parameters, but they present notable limitations that this work addresses. More generally, existing graphical modeling tools are designed to incorporate either static information, focusing on statistical dependencies among independent random variables (e.g., graphical Lasso approach), or dynamic information, emphasizing causal relationships among time series samples (e.g., graphical Granger approaches). However, there are no joint approaches combining static and dynamic graphical modeling within the context of SSMs. This work proposes a novel approach to fill this gap by introducing a joint graphical modeling framework that bridges the static graphical Lasso model and a causal-based graphical approach for the linear-Gaussian SSM. We present DGLASSO (Dynamic Graphical Lasso), a new inference method within this framework that implements an efficient block alternating majorization-minimization algorithm. The algorithm's convergence is established by departing from modern tools from nonlinear analysis. Experimental validation on synthetic and real weather variability data showcases the effectiveness of the proposed model and inference algorithm.
摘要
时间序列数据在各个科学和工程领域中具有重要地位,如生物医学、地球观测和网络分析。现有广泛的研究探讨状态空间模型(SSM),它们是数学工具的强大工具,可以在时间序列上进行概率性和可解释的学习。估计SSM模型参数是最复杂的任务之一,并且包含先验知识可以使解释更加容易,但同时也会增加推理任务的复杂性。最近几年的研究尝试将一些模型参数视为图形上的变量,但它们存在一些限制。这种工作的目的是填补这个空白,并提出一种新的 JOINT 图形模型框架,该框架结合了统计依赖关系和 causal 关系,并将其应用于线性加 Gaussian SSM 中。我们提出了一种新的推理方法,称为 DGLASSO(动态图形lasso),它使用一种高效的块 Alternating Majorization-Minimization 算法。我们证明了该算法的收敛性,并通过使用现代非线性分析工具。实验 validate 在 sintetic 和实际气象异常数据上,显示了我们提出的模型和推理算法的效果。
Improving the Efficiency of Human-in-the-Loop Systems: Adding Artificial to Human Experts
results: 这篇论文的实验结果显示,该方法可以比传统HITL系统更高效地处理数据分类 задачі。Abstract
Information systems increasingly leverage artificial intelligence (AI) and machine learning (ML) to generate value from vast amounts of data. However, ML models are imperfect and can generate incorrect classifications. Hence, human-in-the-loop (HITL) extensions to ML models add a human review for instances that are difficult to classify. This study argues that continuously relying on human experts to handle difficult model classifications leads to a strong increase in human effort, which strains limited resources. To address this issue, we propose a hybrid system that creates artificial experts that learn to classify data instances from unknown classes previously reviewed by human experts. Our hybrid system assesses which artificial expert is suitable for classifying an instance from an unknown class and automatically assigns it. Over time, this reduces human effort and increases the efficiency of the system. Our experiments demonstrate that our approach outperforms traditional HITL systems for several benchmarks on image classification.
摘要
信息系统越来越利用人工智能(AI)和机器学习(ML)来生成价值从大量数据中。然而,ML模型不完美,可能会生成错误的分类。因此,人类在循环(HITL)扩展对ML模型进行人工审核,以便对难以分类的实例进行人类审核。这项研究认为,不断依赖于人类专家来处理困难分类会导致人力劳累,占用有限资源。为解决这个问题,我们提议一种混合系统,创建人工专家,从人类专家之前未知的类型中学习分类数据实例。我们的混合系统根据不同的人工专家选择适合分类某个实例,并自动分配。随着时间的推移,这将减少人力劳累,提高系统的效率。我们的实验表明,我们的方法在多个图像分类benchmark上表现出色。
ContainerGym: A Real-World Reinforcement Learning Benchmark for Resource Allocation
results: 研究发现了一些知名的深度问题学习算法,如PPO、TRPO和DQN,在面对现实世界问题时存在一些 interessante的限制。Abstract
We present ContainerGym, a benchmark for reinforcement learning inspired by a real-world industrial resource allocation task. The proposed benchmark encodes a range of challenges commonly encountered in real-world sequential decision making problems, such as uncertainty. It can be configured to instantiate problems of varying degrees of difficulty, e.g., in terms of variable dimensionality. Our benchmark differs from other reinforcement learning benchmarks, including the ones aiming to encode real-world difficulties, in that it is directly derived from a real-world industrial problem, which underwent minimal simplification and streamlining. It is sufficiently versatile to evaluate reinforcement learning algorithms on any real-world problem that fits our resource allocation framework. We provide results of standard baseline methods. Going beyond the usual training reward curves, our results and the statistical tools used to interpret them allow to highlight interesting limitations of well-known deep reinforcement learning algorithms, namely PPO, TRPO and DQN.
摘要
我们介绍ContainerGym,一个基于现实世界的资源分配任务的问题集,用于评估人工智能推广学习算法。我们的问题集具有现实世界决策问题中常见的挑战,例如不确定性。我们的问题集可以根据问题的困难度而设置,例如可以根据变量的维度来调整问题的难度。相比其他的人工智能推广学习问题集,我们的问题集更加直接地来自现实世界的问题,它仅受到了最小的简化和整理。因此,我们的问题集可以用来评估任何适合我们的资源分配框架的现实世界问题。我们提供了标准基eline方法的结果,以及使用的统计工具,以便highlight interesseting的深度问题学习算法的局限性,例如PPO、TRPO和DQN。
A Privacy-Preserving Walk in the Latent Space of Generative Models for Medical Applications
for: 本研究旨在提出一种 latent space 导航策略,以便生成多样化的 sintetic 样本,支持深度模型的有效训练,同时坚持隐私问题的原则性解决。
methods: 我们的方法利用卫星标识器作为导航指南,在 latent space 中不线性步行,以避免遇到真实样本的近似者。我们还证明了,任意两个随机选择的 latent space 点之间的步行策略比线性 interpolate 更安全。
results: 我们在两个抑TB 和肥皮病类型分类 benchmark 上测试了我们的路径找索策略与 k-same 方法组合,结果表明,使用我们的方法可以减少模型训练时隐私泄露的风险,而不 sacrifice 模型性能。Abstract
Generative Adversarial Networks (GANs) have demonstrated their ability to generate synthetic samples that match a target distribution. However, from a privacy perspective, using GANs as a proxy for data sharing is not a safe solution, as they tend to embed near-duplicates of real samples in the latent space. Recent works, inspired by k-anonymity principles, address this issue through sample aggregation in the latent space, with the drawback of reducing the dataset by a factor of k. Our work aims to mitigate this problem by proposing a latent space navigation strategy able to generate diverse synthetic samples that may support effective training of deep models, while addressing privacy concerns in a principled way. Our approach leverages an auxiliary identity classifier as a guide to non-linearly walk between points in the latent space, minimizing the risk of collision with near-duplicates of real samples. We empirically demonstrate that, given any random pair of points in the latent space, our walking strategy is safer than linear interpolation. We then test our path-finding strategy combined to k-same methods and demonstrate, on two benchmarks for tuberculosis and diabetic retinopathy classification, that training a model using samples generated by our approach mitigate drops in performance, while keeping privacy preservation.
摘要
генеративные adversarial networks (GANs) 已经证明可以生成匹配目标分布的合成样本。然而,从隐私角度来看,使用 GANs 作为数据共享的代理不是安全的解决方案,因为它们往往将真实样本的近似 duplicates embedding 在幂空间中。现有作品,受到 k-隐身原则的启发,通过在幂空间中进行样本聚合来解决这个问题,但这会导致数据集被减少一个系数 k。我们的工作是通过提议一种幂空间导航策略,以生成多样的合成样本,支持深度模型的有效训练,同时在原则上保持隐私。我们的方法利用辅助标识类фика器作为帮助器,在幂空间中不线性步行,最小化与真实样本近似 duplicates 的风险。我们实验表明,任意两个随机点在幂空间中的步行策略比线性 interpolate 更安全。然后,我们测试了我们的路径找到策略与 k-same 方法结合使用,并在两个标准 benchmark 上进行肢体病诊断和糖尿病肝病诊断,结果表明,通过我们的方法训练的模型可以减少性能下降,同时保持隐私。
Transfer Learning for the Efficient Detection of COVID-19 from Smartphone Audio Data
results: 结果显示L\textsuperscript{3}-Net在所有实验设定中表现最佳,与其他解决方案相比,提高了12.3%的精度-回传Area Under the Curve(AUC),并在特征提取和精度调整方法中均表现出色。此外,研究发现将专案调整到预训练的内部层次通常会导致表现下降,均下降6.6%。Abstract
Disease detection from smartphone data represents an open research challenge in mobile health (m-health) systems. COVID-19 and its respiratory symptoms are an important case study in this area and their early detection is a potential real instrument to counteract the pandemic situation. The efficacy of this solution mainly depends on the performances of AI algorithms applied to the collected data and their possible implementation directly on the users' mobile devices. Considering these issues, and the limited amount of available data, in this paper we present the experimental evaluation of 3 different deep learning models, compared also with hand-crafted features, and of two main approaches of transfer learning in the considered scenario: both feature extraction and fine-tuning. Specifically, we considered VGGish, YAMNET, and L\textsuperscript{3}-Net (including 12 different configurations) evaluated through user-independent experiments on 4 different datasets (13,447 samples in total). Results clearly show the advantages of L\textsuperscript{3}-Net in all the experimental settings as it overcomes the other solutions by 12.3\% in terms of Precision-Recall AUC as features extractor, and by 10\% when the model is fine-tuned. Moreover, we note that to fine-tune only the fully-connected layers of the pre-trained models generally leads to worse performances, with an average drop of 6.6\% with respect to feature extraction. %highlighting the need for further investigations. Finally, we evaluate the memory footprints of the different models for their possible applications on commercial mobile devices.
摘要
《医疗预测从智能手机数据中的挑战》是移动医疗(m-health)系统中的一个开放研究领域。COVID-19和其呼吸症状是这个领域中一个重要的案例研究,早期检测可以成为对抗疫情情况的实际工具。这种解决方案的有效性主要取决于应用于收集的数据的人工智能算法的性能,以及它们可能的直接在用户的移动设备上进行实现。考虑这些问题,以及有限的数据量,在这篇论文中我们提出了三种不同的深度学习模型的实验评估,并与手工设计的特征进行比较。 Specifically, we considered VGGish, YAMNET, and L\textsuperscript{3}-Net (including 12 different configurations) evaluated through user-independent experiments on 4 different datasets (13,447 samples in total). 结果显示L\textsuperscript{3}-Net在所有实验设定中都优于其他解决方案,在报告精度- recall AUC 方面提高12.3%,并在Feature extractor和 fine-tuning 方法下提高10%。此外,我们注意到只调整全连接层的准备模型通常会导致性能下降,均为6.6%。 % highlighting the need for further investigations. Finally, we evaluate the memory footprints of the different models for their possible applications on commercial mobile devices.
paper_authors: Andrey Kuzmin, Markus Nagel, Mart van Baalen, Arash Behboodi, Tijmen Blankevoort
for: Comparing the effectiveness of neural network quantization and pruning techniques for compressing deep neural networks.
methods: Analytical and empirical comparisons of expected quantization and pruning error, and lower bounds for per-layer pruning and quantization error in trained networks.
results: Quantization outperforms pruning in most cases, but pruning might be beneficial in some scenarios with very high compression ratios.Abstract
Neural network pruning and quantization techniques are almost as old as neural networks themselves. However, to date only ad-hoc comparisons between the two have been published. In this paper, we set out to answer the question on which is better: neural network quantization or pruning? By answering this question, we hope to inform design decisions made on neural network hardware going forward. We provide an extensive comparison between the two techniques for compressing deep neural networks. First, we give an analytical comparison of expected quantization and pruning error for general data distributions. Then, we provide lower bounds for the per-layer pruning and quantization error in trained networks, and compare these to empirical error after optimization. Finally, we provide an extensive experimental comparison for training 8 large-scale models on 3 tasks. Our results show that in most cases quantization outperforms pruning. Only in some scenarios with very high compression ratio, pruning might be beneficial from an accuracy standpoint.
摘要
neural network 压缩和量化技术已经几乎与神经网络一起出现了。然而,到目前为止只有随意比较这两种技术的比较。在这篇论文中,我们决定回答以下问题:哪个更好:神经网络量化或剪枝?通过回答这个问题,我们希望能够对神经网络硬件设计决策提供指导。我们对两种压缩深度神经网络的比较进行了广泛的比较。首先,我们给出了对于一般数据分布的量化和剪枝错误的分析比较。然后,我们提供了层次剪枝和量化错误的下界,并与经验性错误进行比较。最后,我们对3个任务上8个大规模模型进行了广泛的实验比较。我们的结果表明,大多数情况下,量化超过剪枝。只有在压缩比较高时,剪枝可能在准确性方面具有一定的优势。
results: 实验评估表明,DPM可以与基准算法KMeans++进行比较,在ε=1和δ=10^-5的情况下,DPM可以在synthetic数据集上提高不同类别的固有稳定度(inertia)的最大值,相比之下,与状态公算法Chang和Kamath的分布式 clustering算法相比,DPM可以提高最大值的差异达50%(synthetic数据集)和62%(实际数据集)。Abstract
Privacy-preserving clustering groups data points in an unsupervised manner whilst ensuring that sensitive information remains protected. Previous privacy-preserving clustering focused on identifying concentration of point clouds. In this paper, we take another path and focus on identifying appropriate separators that split a data set. We introduce the novel differentially private clustering algorithm DPM that searches for accurate data point separators in a differentially private manner. DPM addresses two key challenges for finding accurate separators: identifying separators that are large gaps between clusters instead of small gaps within a cluster and, to efficiently spend the privacy budget, prioritising separators that split the data into large subparts. Using the differentially private Exponential Mechanism, DPM randomly chooses cluster separators with provably high utility: For a data set $D$, if there is a wide low-density separator in the central $60\%$ quantile, DPM finds that separator with probability $1 - \exp(-\sqrt{|D|})$. Our experimental evaluation demonstrates that DPM achieves significant improvements in terms of the clustering metric inertia. With the inertia results of the non-private KMeans++ as a baseline, for $\varepsilon = 1$ and $\delta=10^{-5}$ DPM improves upon the difference to the baseline by up to $50\%$ for a synthetic data set and by up to $62\%$ for a real-world data set compared to a state-of-the-art clustering algorithm by Chang and Kamath.
摘要
自避嫌敏感聚类分析数据点,保护敏感信息的同时,也可以自动找到数据集中的精准分割器。在这篇论文中,我们不同于之前的敏感聚类方法,我们的方法是通过找到数据集中的合适分割器来实现敏感聚类。我们提出了一种新的敏感聚类算法DPM,它在 differentially private 的方式下搜索数据集中的精准分割器。DPM 解决了两个关键问题:一是找到分割器,而不是在集群中找到小距离的分割器,二是有效地花费隐私预算。DPM 使用了不同于 KMeans++ 的扩展机制,可以很好地降低隐私预算。我们的实验评估表明,DPM 可以在各种数据集上达到显著的聚类稳定度提升。相比之下,与 KMeans++ 为基线,DPM 在 $\varepsilon = 1$ 和 $\delta = 10^{-5}$ 下可以在 synthetic 数据集上提高差异达到 $50\%$,在 real-world 数据集上提高差异达到 $62\%$。
SegNetr: Rethinking the local-global interactions and skip connections in U-shaped networks
results: 研究结果显示,SegNetr在四个主流医疗影像分类数据集上表现出色,与常用的U-Net相比,缩减了59%和76%的参数和GFLOPs,并且保持了医疗影像分类的精度。此外,研究者还发现,SegNetr可以将其 Component 应用到其他 U-shaped 网络上,以提高其分类性能。Abstract
Recently, U-shaped networks have dominated the field of medical image segmentation due to their simple and easily tuned structure. However, existing U-shaped segmentation networks: 1) mostly focus on designing complex self-attention modules to compensate for the lack of long-term dependence based on convolution operation, which increases the overall number of parameters and computational complexity of the network; 2) simply fuse the features of encoder and decoder, ignoring the connection between their spatial locations. In this paper, we rethink the above problem and build a lightweight medical image segmentation network, called SegNetr. Specifically, we introduce a novel SegNetr block that can perform local-global interactions dynamically at any stage and with only linear complexity. At the same time, we design a general information retention skip connection (IRSC) to preserve the spatial location information of encoder features and achieve accurate fusion with the decoder features. We validate the effectiveness of SegNetr on four mainstream medical image segmentation datasets, with 59\% and 76\% fewer parameters and GFLOPs than vanilla U-Net, while achieving segmentation performance comparable to state-of-the-art methods. Notably, the components proposed in this paper can be applied to other U-shaped networks to improve their segmentation performance.
摘要
近期,U型网络在医学图像分割领域占据主导地位,主要是因为它们的简单和容易调整结构。然而,现有的U型分割网络:1)大多数是设计复杂的自注意模块,以补做 convolution 操作所带来的长期依赖缺失,这会增加总参数数和网络的计算复杂度; 2)简单地融合 encoder 和 decoder 的特征,忽略了它们的空间位置之间的连接。在这篇论文中,我们重新思考了以上问题,并建立了一个轻量级的医学图像分割网络,称为 SegNetr。具体来说,我们提出了一个新的 SegNetr 块,可以在任何阶段进行本地-全局交互,并且只有线性复杂度。同时,我们设计了一种普适的信息保留skip连接(IRSC),以保持 encoder 特征的空间位置信息,并与 decoder 特征进行准确融合。我们验证了 SegNetr 在四大流行的医学图像分割 dataset 上的效果,与 vanilla U-Net 相比,它具有59%和76% fewer parameters和GFLOPs,同时 achieving segmentation performance 与当前状态OFthe-art 方法相当。尤其是,本文所提出的组件可以应用于其他 U-shaped 网络,以提高它们的分割性能。
When No-Rejection Learning is Optimal for Regression with Rejection
results: 研究发现,在预测问题中,使用拒绝学习策略可以提高预测性能。此外,通过将拒绝学习策略与传统的预测器学习策略结合使用,可以提高预测性能。Abstract
Learning with rejection is a prototypical model for studying the interaction between humans and AI on prediction tasks. The model has two components, a predictor and a rejector. Upon the arrival of a sample, the rejector first decides whether to accept it; if accepted, the predictor fulfills the prediction task, and if rejected, the prediction will be deferred to humans. The learning problem requires learning a predictor and a rejector simultaneously. This changes the structure of the conventional loss function and often results in non-convexity and inconsistency issues. For the classification with rejection problem, several works develop surrogate losses for the jointly learning with provable consistency guarantees; in parallel, there has been less work for the regression counterpart. We study the regression with rejection (RwR) problem and investigate the no-rejection learning strategy which treats the RwR problem as a standard regression task to learn the predictor. We establish that the suboptimality of the no-rejection learning strategy observed in the literature can be mitigated by enlarging the function class of the predictor. Then we introduce the truncated loss to single out the learning for the predictor and we show that a consistent surrogate property can be established for the predictor individually in an easier way than for the predictor and the rejector jointly. Our findings advocate for a two-step learning procedure that first uses all the data to learn the predictor and then calibrates the prediction loss for the rejector. It is better aligned with the common intuition that more data samples will lead to a better predictor and it calls for more efforts on a better design of calibration algorithms for learning the rejector. While our discussions mainly focus on the regression problem, the theoretical results and insights generalize to the classification problem as well.
摘要
学习 WITH 拒绝是一种典型的模型,用于研究人类和 AI 在预测任务之间的交互。该模型包括一个预测器和一个拒绝者。当样本到达时,拒绝者首先决定是否接受它;如果被接受,预测器完成预测任务;如果被拒绝,预测将被延迟到人类。学习问题需要同时学习预测器和拒绝者。这会改变传统的损失函数结构,并常常导致非几何和不一致性问题。对于类别 WITH 拒绝问题,一些工作提出了证明可靠的替代损失函数,而对于 regression 问题,有少量的研究。我们研究了 regression with rejection(RwR)问题,并 investigate no-rejection 学习策略,即将 RwR 问题视为标准的回归任务,以学习预测器。我们证明了在文献中观察到的不优化情况可以通过扩大预测器的功能集来缓解。然后,我们引入了剪辑损失来单独学习预测器,并证明了在更容易的方式上可以建立预测器的一致性属性。我们的发现建议在所有数据上首先学习预测器,然后对预测器进行调整,以更好地适应实际情况。这种二步学习方式更符合人们对更多数据样本会导致更好的预测器的共识,并且强调了更好地设计报告算法以提高拒绝者的学习。虽然我们的讨论主要关注回归问题,但我们的理论结论和发现都适用于类别问题。
A Real-time Human Pose Estimation Approach for Optimal Sensor Placement in Sensor-based Human Activity Recognition
results: 研究发现,视觉基于的传感器位置选择方法可以与传统深度学习方法相比,表现相似,证明了其有效性。Abstract
Sensor-based Human Activity Recognition facilitates unobtrusive monitoring of human movements. However, determining the most effective sensor placement for optimal classification performance remains challenging. This paper introduces a novel methodology to resolve this issue, using real-time 2D pose estimations derived from video recordings of target activities. The derived skeleton data provides a unique strategy for identifying the optimal sensor location. We validate our approach through a feasibility study, applying inertial sensors to monitor 13 different activities across ten subjects. Our findings indicate that the vision-based method for sensor placement offers comparable results to the conventional deep learning approach, demonstrating its efficacy. This research significantly advances the field of Human Activity Recognition by providing a lightweight, on-device solution for determining the optimal sensor placement, thereby enhancing data anonymization and supporting a multimodal classification approach.
摘要
《受器件基于人体活动识别》可以不侵入地监测人体活动。然而,确定最佳受器件位置以实现优化的分类性能仍然是一个挑战。这篇论文提出了一种新的方法来解决这个问题,使用实时二维姿态估计来从视频记录中获取目标活动的姿态数据。这些姿态数据提供了一种独特的策略来确定最佳受器件位置。我们通过实验验证了我们的方法,使用抖动仪器来监测13种不同的活动,并在10名参与者身上进行了评估。我们的发现表明,基于视频的受器件位置确定方法和深度学习方法具有相似的效果,这证明了我们的方法的有效性。这项研究在人体活动识别领域中做出了重要贡献,提供了一种轻量级、在设备上进行的受器件位置确定方法,从而提高数据匿名化和支持多模态分类approach。
PUFFIN: A Path-Unifying Feed-Forward Interfaced Network for Vapor Pressure Prediction
paper_authors: Vinicius Viena Santana, Carine Menezes Rebello, Luana P. Queiroz, Ana Mafalda Ribeiro, Nadia Shardt, Idelfonso B. R. Nogueira
for: 提高化学物质预测的精度,以满足工业和环境应用需求。
methods: 提出了一种基于机器学习的框架,即PUFFIN(路径联合启发前向网络), combinig transfer learning 和启发节点,以提高热压缩预测。
results: PUFFIN 比不使用启发节点或使用通用描述器的其他策略更高的性能。框架的包含域专业知识来超越数据罕见性的能力,适用于更广泛的化学物质分析预测。Abstract
Accurately predicting vapor pressure is vital for various industrial and environmental applications. However, obtaining accurate measurements for all compounds of interest is not possible due to the resource and labor intensity of experiments. The demand for resources and labor further multiplies when a temperature-dependent relationship for predicting vapor pressure is desired. In this paper, we propose PUFFIN (Path-Unifying Feed-Forward Interfaced Network), a machine learning framework that combines transfer learning with a new inductive bias node inspired by domain knowledge (the Antoine equation) to improve vapor pressure prediction. By leveraging inductive bias and transfer learning using graph embeddings, PUFFIN outperforms alternative strategies that do not use inductive bias or that use generic descriptors of compounds. The framework's incorporation of domain-specific knowledge to overcome the limitation of poor data availability shows its potential for broader applications in chemical compound analysis, including the prediction of other physicochemical properties. Importantly, our proposed machine learning framework is partially interpretable, because the inductive Antoine node yields network-derived Antoine equation coefficients. It would then be possible to directly incorporate the obtained analytical expression in process design software for better prediction and control of processes occurring in industry and the environment.
摘要
精确预测蒸汽压力在各种工业和环境应用中非常重要。然而,为所有 интерес的化合物都获得准确测量是不可能的,因为实验资源和劳动力成本过高。随着温度的变化,测量蒸汽压力的关系也变得更加复杂。在这篇论文中,我们提出了PUFFIN(Path-Unifying Feed-Forward Interfaced Network)机器学习框架,该框架结合了传输学习和基于预测方程的新权重节点,以提高蒸汽压力预测。通过利用预测方程的启发和传输学习使用图像描述符,PUFFIN超越了不使用预测方程或使用通用描述符的策略。该框架的包含域专业知识以解决资料不足的限制,表明其在化学物质分析中的潜在应用。特别是,我们提出的机器学习框架部分可解释,因为 Antoine 节点的启发导致了网络获得的 Antoine 方程系数。因此,可以直接在进程设计软件中包含获得的分析表达,以提高工业和环境中进程预测和控制的准确性。
Free Bits: Latency Optimization of Mixed-Precision Quantized Neural Networks on the Edge
results: 这篇论文在MobileNetV1和MobileNetV2上进行了评估,并在一家多核RISC-V微控制器平台上部署了结果。获得了与8位模型相比的28.6%的终端延迟减少,并且在不支持子字元运算的硬件上也获得了速度优化。此外,论文还证明了其方法对于针对对减少二进制操作数量的分别搜寻来调整精度配置的超越性。Abstract
Mixed-precision quantization, where a deep neural network's layers are quantized to different precisions, offers the opportunity to optimize the trade-offs between model size, latency, and statistical accuracy beyond what can be achieved with homogeneous-bit-width quantization. To navigate the intractable search space of mixed-precision configurations for a given network, this paper proposes a hybrid search methodology. It consists of a hardware-agnostic differentiable search algorithm followed by a hardware-aware heuristic optimization to find mixed-precision configurations latency-optimized for a specific hardware target. We evaluate our algorithm on MobileNetV1 and MobileNetV2 and deploy the resulting networks on a family of multi-core RISC-V microcontroller platforms with different hardware characteristics. We achieve up to 28.6% reduction of end-to-end latency compared to an 8-bit model at a negligible accuracy drop from a full-precision baseline on the 1000-class ImageNet dataset. We demonstrate speedups relative to an 8-bit baseline, even on systems with no hardware support for sub-byte arithmetic at negligible accuracy drop. Furthermore, we show the superiority of our approach with respect to differentiable search targeting reduced binary operation counts as a proxy for latency.
摘要
混合精度量化,其中深度神经网络层次被量化到不同精度,可以超越基于同一bit宽度量化的优化。为探索混合精度配置空间中的优化搜索方法,这篇论文提议了一种混合搜索方法。该方法包括一个硬件无关的可微分搜索算法,以及一个硬件aware的优化策略,以找到适合特定硬件目标的延迟优化的混合精度配置。我们在MobileNetV1和MobileNetV2上测试了我们的算法,并将结果部署到一家多核RISC-V微控制器平台上,该平台具有不同硬件特性。我们实现了与8位模型相比的28.6%的端到端延迟减少,而且只有一个可观的减少精度。此外,我们还证明了我们的方法在对减少二进制运算数量为延迟的目标进行搜索时的优越性。
BaBE: Enhancing Fairness via Estimation of Latent Explaining Variables
for: solves the problem of unfair discrimination between two groups by proposing a pre-processing method to achieve fairness.
methods: uses Bayesian Bias Elimination (BaBE), a combination of Bayes inference and the Expectation-Maximization method, to estimate the most likely value of the latent explanatory variable E for each group.
results: shows good fairness and high accuracy in experiments on synthetic and real data sets.Abstract
We consider the problem of unfair discrimination between two groups and propose a pre-processing method to achieve fairness. Corrective methods like statistical parity usually lead to bad accuracy and do not really achieve fairness in situations where there is a correlation between the sensitive attribute S and the legitimate attribute E (explanatory variable) that should determine the decision. To overcome these drawbacks, other notions of fairness have been proposed, in particular, conditional statistical parity and equal opportunity. However, E is often not directly observable in the data, i.e., it is a latent variable. We may observe some other variable Z representing E, but the problem is that Z may also be affected by S, hence Z itself can be biased. To deal with this problem, we propose BaBE (Bayesian Bias Elimination), an approach based on a combination of Bayes inference and the Expectation-Maximization method, to estimate the most likely value of E for a given Z for each group. The decision can then be based directly on the estimated E. We show, by experiments on synthetic and real data sets, that our approach provides a good level of fairness as well as high accuracy.
摘要
我们考虑了不公正歧视 между两个群体,并提出了预处理方法以实现公正。通常的统计平衡方法会导致坏准确率并不实际实现公正,特别是在敏感属性S和合法属性E(解释变量)之间存在相关性时。为了解决这些缺点,其他公正的概念被提出,特别是conditional statistical parity和equal opportunity。然而,E通常不直接可见于数据中,即是隐藏变量。我们可能观察到Z表示E,但问题在于Z也可能受S的影响,因此Z本身可能受到偏见。为解决这个问题,我们提出了BaBE(抽象折衣 Bayesian Bias Elimination)方法,基于抽象折衣和期望最大化方法,以估计每个群体中E的最可能值。然后,决策可以直接基于估计的E。我们通过对 sintetic和实际数据集进行实验,示出了我们的方法可以实现高度的公正以及高准确率。
Learning to Solve Tasks with Exploring Prior Behaviours
results: 在三个导航任务和一个机器人处理任务中,表现优于其他基准值Abstract
Demonstrations are widely used in Deep Reinforcement Learning (DRL) for facilitating solving tasks with sparse rewards. However, the tasks in real-world scenarios can often have varied initial conditions from the demonstration, which would require additional prior behaviours. For example, consider we are given the demonstration for the task of \emph{picking up an object from an open drawer}, but the drawer is closed in the training. Without acquiring the prior behaviours of opening the drawer, the robot is unlikely to solve the task. To address this, in this paper we propose an Intrinsic Rewards Driven Example-based Control \textbf{(IRDEC)}. Our method can endow agents with the ability to explore and acquire the required prior behaviours and then connect to the task-specific behaviours in the demonstration to solve sparse-reward tasks without requiring additional demonstration of the prior behaviours. The performance of our method outperforms other baselines on three navigation tasks and one robotic manipulation task with sparse rewards. Codes are available at https://github.com/Ricky-Zhu/IRDEC.
摘要
深圳�技术(DRL)广泛使用演示来解决具有罕见奖励的任务。然而,实际场景中的任务可能会有 demonstrate 中的初始条件不同,需要额外的先行行为。例如,假设我们被给定了选择开放抽屉中的物品任务的演示,但抽屉在训练时是关闭的。如果不具备开启抽屉的先行行为,机器人很 unlikely 解决任务。为此,本文提出了内在奖励驱动示例控制方法(IRDEC)。我们的方法可以让代理人探索和获得所需的先行行为,然后与任务特定行为在演示中连接以解决罕见奖励任务,无需额外的先行行为示例。我们的方法在三个导航任务和一个机器人抓取任务中表现出色,超过其他基线。代码可以在 https://github.com/Ricky-Zhu/IRDEC 上获取。
Sample-Efficient Learning of POMDPs with Multiple Observations In Hindsight
results: 证明在这种反馈模型下,可以实现 sample-efficient learning,并且该模型可以涵盖两类新的 POMDP subclass:多观察揭示 POMDP 和 distinguishable POMDP。这两类 subclass 都是 revelaing POMDP 的推广和放松,但只需要latent state emission distribution 不同而不需要linearly independent。Abstract
This paper studies the sample-efficiency of learning in Partially Observable Markov Decision Processes (POMDPs), a challenging problem in reinforcement learning that is known to be exponentially hard in the worst-case. Motivated by real-world settings such as loading in game playing, we propose an enhanced feedback model called ``multiple observations in hindsight'', where after each episode of interaction with the POMDP, the learner may collect multiple additional observations emitted from the encountered latent states, but may not observe the latent states themselves. We show that sample-efficient learning under this feedback model is possible for two new subclasses of POMDPs: \emph{multi-observation revealing POMDPs} and \emph{distinguishable POMDPs}. Both subclasses generalize and substantially relax \emph{revealing POMDPs} -- a widely studied subclass for which sample-efficient learning is possible under standard trajectory feedback. Notably, distinguishable POMDPs only require the emission distributions from different latent states to be \emph{different} instead of \emph{linearly independent} as required in revealing POMDPs.
摘要
A Machine-Learned Ranking Algorithm for Dynamic and Personalised Car Pooling Services
paper_authors: Mattia Giovanni Campana, Franca Delmastro, Raffaele Bruno
For: 降低城市交通堵塞和污染* Methods: 使用学习排序技术自动生成每个用户的个性化选择模型,并将这些模型用于建议乘客与司机的合适的共乘机会* Results: 实验结果表明,提议的解决方案可以快速和准确地预测用户的个性化选择模型,并在静态和动态条件下表现出优异的性能。Abstract
Car pooling is expected to significantly help in reducing traffic congestion and pollution in cities by enabling drivers to share their cars with travellers with similar itineraries and time schedules. A number of car pooling matching services have been designed in order to efficiently find successful ride matches in a given pool of drivers and potential passengers. However, it is now recognised that many non-monetary aspects and social considerations, besides simple mobility needs, may influence the individual willingness of sharing a ride, which are difficult to predict. To address this problem, in this study we propose GoTogether, a recommender system for car pooling services that leverages on learning-to-rank techniques to automatically derive the personalised ranking model of each user from the history of her choices (i.e., the type of accepted or rejected shared rides). Then, GoTogether builds the list of recommended rides in order to maximise the success rate of the offered matches. To test the performance of our scheme we use real data from Twitter and Foursquare sources in order to generate a dataset of plausible mobility patterns and ride requests in a metropolitan area. The results show that the proposed solution quickly obtain an accurate prediction of the personalised user's choice model both in static and dynamic conditions.
摘要
卡车pooling 预计将能够有效地减少城市塞车和污染,通过让 drivers 和旅行者共享车辆,并且将驾驶者和旅行者的行程和时间表汇入一起。然而,现在已经被认为,在分享车辆的决策中,不只有交通需求,还有许多非实物的方面和社交考虑,这些难以预测。为解决这个问题,在这个研究中,我们提出了 GoTogether,一个基于学习排名技术的推荐系统,可以自动从用户的历史选择(即接受或拒绝分享的车辆类型)中 derivate 用户的个人化选择模型。然后,GoTogether 将建立用户的个人化推荐列表,以最大化分享成功率。为验证我们的方案的性能,我们使用了 Twitter 和 Foursquare 的数据来生成一个城市区域的可能的流动模式和分享请求。结果显示,我们的方案快速地获得了个人化用户选择模型的精准预测, both in static and dynamic conditions。
Towards a safe MLOps Process for the Continuous Development and Safety Assurance of ML-based Systems in the Railway Domain
results: 本文提出了一种安全的MLOps过程,可以帮助实现不断开发和安全验证铁路领域的ML-基于系统。这种过程可以提高系统的可靠性、可重构性和可灵活应用性。Abstract
Traditional automation technologies alone are not sufficient to enable driverless operation of trains (called Grade of Automation (GoA) 4) on non-restricted infrastructure. The required perception tasks are nowadays realized using Machine Learning (ML) and thus need to be developed and deployed reliably and efficiently. One important aspect to achieve this is to use an MLOps process for tackling improved reproducibility, traceability, collaboration, and continuous adaptation of a driverless operation to changing conditions. MLOps mixes ML application development and operation (Ops) and enables high frequency software releases and continuous innovation based on the feedback from operations. In this paper, we outline a safe MLOps process for the continuous development and safety assurance of ML-based systems in the railway domain. It integrates system engineering, safety assurance, and the ML life-cycle in a comprehensive workflow. We present the individual stages of the process and their interactions. Moreover, we describe relevant challenges to automate the different stages of the safe MLOps process.
摘要
传统自动化技术独立无法实现列车自动驾驶(称为级别自动化(GoA)4)在不受限制的基础设施上。需要完成的感知任务现在通常通过机器学习(ML)实现,因此需要可靠地开发和部署,以及持续地适应变化的条件。一个重要的方法是使用 MLOps 过程来提高可重复性、跟踪性、合作和持续创新,基于运营反馈。在这篇论文中,我们介绍了一种安全的 MLOps 过程,用于不断发展和安全验证 ML 基于系统的 railway 领域中的系统工程、安全验证和 ML 生命周期的Integration。我们介绍了不同阶段的过程和它们之间的交互。此外,我们还描述了自动化不同阶段的安全 MLOps 过程中的挑战。
PLIERS: a Popularity-Based Recommender System for Content Dissemination in Online Social Networks
results: 实验结果表明,PLIERS可以比现有的解决方案更好地平衡算法复杂性和个性化推荐的级别,同时提供更有个性化、有 relevance 和有创新性的推荐。Abstract
In this paper, we propose a novel tag-based recommender system called PLIERS, which relies on the assumption that users are mainly interested in items and tags with similar popularity to those they already own. PLIERS is aimed at reaching a good tradeoff between algorithmic complexity and the level of personalization of recommended items. To evaluate PLIERS, we performed a set of experiments on real OSN datasets, demonstrating that it outperforms state-of-the-art solutions in terms of personalization, relevance, and novelty of recommendations.
摘要
在这篇论文中,我们提出了一种新的标签基于推荐系统,即PLIERS,它假设用户主要关注的item和标签具有类似的流行度。PLIERS的目标是实现算法复杂性和个性化推荐项的好 equilibrio。为了评估PLIERS,我们在实际社交媒体数据集上进行了一系列实验,并证明它在个性化、 relevance和新颖性方面超过了现状最佳解决方案。
Provably Efficient Iterated CVaR Reinforcement Learning with Function Approximation
For: 本研究旨在强化政策,以保证在决策过程中保持安全性。* Methods: 本文提出了一种新的风险敏感强化学习形式,使用迭代条件值风险(CVaR)目标函数,并提供了一种基于线性和通用函数近似的算法。* Results: 提出的算法ICVar-L和ICVar-G可以在不同的维度和集数下实现可控的停损 regret,并且提供了一些新的技术,如CVaR运算数学减法、ridge regression与CVaR适应特征以及改进的椭球 potential 函数。Abstract
Risk-sensitive reinforcement learning (RL) aims to optimize policies that balance the expected reward and risk. In this paper, we investigate a novel risk-sensitive RL formulation with an Iterated Conditional Value-at-Risk (CVaR) objective under linear and general function approximations. This new formulation, named ICVaR-RL with function approximation, provides a principled way to guarantee safety at each decision step. For ICVaR-RL with linear function approximation, we propose a computationally efficient algorithm ICVaR-L, which achieves an $\widetilde{O}(\sqrt{\alpha^{-(H+1)}(d^2H^4+dH^6)K})$ regret, where $\alpha$ is the risk level, $d$ is the dimension of state-action features, $H$ is the length of each episode, and $K$ is the number of episodes. We also establish a matching lower bound $\Omega(\sqrt{\alpha^{-(H-1)}d^2K})$ to validate the optimality of ICVaR-L with respect to $d$ and $K$. For ICVaR-RL with general function approximation, we propose algorithm ICVaR-G, which achieves an $\widetilde{O}(\sqrt{\alpha^{-(H+1)}DH^4K})$ regret, where $D$ is a dimensional parameter that depends on the eluder dimension and covering number. Furthermore, our analysis provides several novel techniques for risk-sensitive RL, including an efficient approximation of the CVaR operator, a new ridge regression with CVaR-adapted features, and a refined elliptical potential lemma.
摘要
��risk-sensitive reinforcement learning(RL)目的是优化策略,既能够获得预期的奖励,又能够保证安全。在这篇论文中,我们研究了一种新的risk-sensitive RL形式,即Iterated Conditional Value-at-Risk(CVaR)目标下的RL。这种新形式被称为ICVaR-RL,它提供了一种理性的方式来在各个决策步骤上保证安全。 дляICVaR-RLLinear function approximation,我们提出了一种高效的算法ICVaR-L,它在 $\widetilde{O}(\sqrt{\alpha^{-(H+1)}(d^2H^4+dH^6)K})$ regret上达到,其中 $\alpha$ 是风险水平, $d$ 是状态动作特征的维度, $H$ 是每个episode的长度, $K$ 是集数。我们还证明了一个匹配的下界 $\Omega(\sqrt{\alpha^{-(H-1)}d^2K})$,以验证ICVaR-L的优化性。为ICVaR-RL General function approximation,我们提出了一种算法ICVaR-G,它在 $\widetilde{O}(\sqrt{\alpha^{-(H+1)}DH^4K})$ regret上达到,其中 $D$ 是一个参数,取决于eluder dimension和covering number。此外,我们的分析还提供了一些新的技术,包括CVaR运算的有效 aproximation、CVaR适应的ridge regression和Refined elliptical potential lemma。
results: 该论文的实验结果表明,PCIL可以在DeepMind Control suite上实现最佳性能,并且qualitative result Suggests that PCIL建立了一个更平滑和更有意义的表征空间 для仿制学习。Abstract
Adversarial imitation learning (AIL) is a popular method that has recently achieved much success. However, the performance of AIL is still unsatisfactory on the more challenging tasks. We find that one of the major reasons is due to the low quality of AIL discriminator representation. Since the AIL discriminator is trained via binary classification that does not necessarily discriminate the policy from the expert in a meaningful way, the resulting reward might not be meaningful either. We propose a new method called Policy Contrastive Imitation Learning (PCIL) to resolve this issue. PCIL learns a contrastive representation space by anchoring on different policies and generates a smooth cosine-similarity-based reward. Our proposed representation learning objective can be viewed as a stronger version of the AIL objective and provide a more meaningful comparison between the agent and the policy. From a theoretical perspective, we show the validity of our method using the apprenticeship learning framework. Furthermore, our empirical evaluation on the DeepMind Control suite demonstrates that PCIL can achieve state-of-the-art performance. Finally, qualitative results suggest that PCIL builds a smoother and more meaningful representation space for imitation learning.
摘要
“智慧传承学习(AIL)是一种流行的方法,它最近已经取得了很大的成功。然而,AIL在更加具体的任务上的表现仍然不满意。我们发现其中一个主要原因是AIL对于策略的识别器表现质量低下。因为AIL的识别器是通过二元分类来训练,这并不一定能够对策略和专家的比较进行真实的分辨。因此,我们提出了一新的方法called Policy Contrastive Imitation Learning(PCIL),以解决这个问题。PCIL通过不同策略的固定 anchor,学习一个对策略的对比性的表现空间,并通过cosine相似性基于的奖励来评估策略。我们的提案的表现学习目标可以视为AIL目标的强化版本,并提供了更加真实的策略和专家之间的比较。从理论上看,我们显示了PCIL的正确性,使用了学习传承框架。此外,我们的实验评估在DeepMind Control套件中,展示了PCIL可以实现最佳性能。最后,我们的质数数据显示PCIL可以建立一个更加平滑和真实的对比学习空间。”
Sampling-based Fast Gradient Rescaling Method for Highly Transferable Adversarial Attacks
results: 提出一种 Sampling-based Fast Gradient Rescaling Method (S-FGRM),可以减少梯度更新的误差并提高黑盒攻击的传输性能。通过数据缩放substitute sign 函数而不需要额外计算成本,并提出了 Depth First Sampling 方法来消除噪声并稳定梯度更新。对于任何 gradient-based 攻击方法,我们的方法都可以用,并且可以与其他输入转换或ensemble方法相结合以进一步提高黑盒攻击的传输性能。在标准 ImageNet 数据集上进行了广泛的实验,并达到了比基eline的state-of-the-art 性能。Abstract
Deep neural networks are known to be vulnerable to adversarial examples crafted by adding human-imperceptible perturbations to the benign input. After achieving nearly 100% attack success rates in white-box setting, more focus is shifted to black-box attacks, of which the transferability of adversarial examples has gained significant attention. In either case, the common gradient-based methods generally use the sign function to generate perturbations on the gradient update, that offers a roughly correct direction and has gained great success. But little work pays attention to its possible limitation. In this work, we observe that the deviation between the original gradient and the generated noise may lead to inaccurate gradient update estimation and suboptimal solutions for adversarial transferability. To this end, we propose a Sampling-based Fast Gradient Rescaling Method (S-FGRM). Specifically, we use data rescaling to substitute the sign function without extra computational cost. We further propose a Depth First Sampling method to eliminate the fluctuation of rescaling and stabilize the gradient update. Our method could be used in any gradient-based attacks and is extensible to be integrated with various input transformation or ensemble methods to further improve the adversarial transferability. Extensive experiments on the standard ImageNet dataset show that our method could significantly boost the transferability of gradient-based attacks and outperform the state-of-the-art baselines.
摘要
在这种情况下,我们发现,梯度更新估计中的偏差可能会导致不准确的梯度更新估计,从而导致攻击性能下降。为了解决这个问题,我们提出了一种快速梯度缩放方法(S-FGRM)。具体来说,我们使用数据缩放来取代 sign 函数,而不需要额外的计算成本。此外,我们还提出了深度优先采样方法,以消除缩放的摆动,稳定梯度更新。我们的方法可以在任何梯度基本攻击中使用,并可以与不同的输入变换或集成方法结合使用,以进一步提高攻击性能。我们在标准 ImageNet 数据集上进行了广泛的实验,发现我们的方法可以在攻击性能上提高很多,并超越当前的基eline。
Trends in Machine Learning and Electroencephalogram (EEG): A Review for Undergraduate Researchers
results: 本文对BCI研究进行了系统性的总结和分析,提供了最新的发现和探讨,并对未来的研究预测了一些有前途的方向。Abstract
This paper presents a systematic literature review on Brain-Computer Interfaces (BCIs) in the context of Machine Learning. Our focus is on Electroencephalography (EEG) research, highlighting the latest trends as of 2023. The objective is to provide undergraduate researchers with an accessible overview of the BCI field, covering tasks, algorithms, and datasets. By synthesizing recent findings, our aim is to offer a fundamental understanding of BCI research, identifying promising avenues for future investigations.
摘要
这篇论文提出了一种系统性的文献评议,探讨了在机器学习之下的脑计算器接口(BCI)。我们的关注点在于电enzephalography(EEG)研究,强调最新的趋势到2023年。我们的目标是为大学生研究者提供访问性的BCI领域概述,包括任务、算法和数据集。通过总结最近的发现,我们希望为未来的研究提供丰富的理解,并确定了可能的发展方向。Note: Simplified Chinese is used here, as it is more widely used in mainland China and other parts of the world. Traditional Chinese is also an option, but it may be less accessible to some readers.
CPDG: A Contrastive Pre-Training Method for Dynamic Graph Neural Networks
methods: 该研究提出了一种名为Contrastive Pre-Training Method for Dynamic Graph Neural Networks(CPDG),通过灵活的结构-时间子图采样器和结构-时间对照预训练方案来解决DGNN预训练中的总化能力和长短期模型能力问题。
results: 对于不同的下游任务和三种传输设置,CPDG在大规模的研究和实际动态图数据集上进行了广泛的实验,并显示了与现有方法相比的显著性提高。Abstract
Dynamic graph data mining has gained popularity in recent years due to the rich information contained in dynamic graphs and their widespread use in the real world. Despite the advances in dynamic graph neural networks (DGNNs), the rich information and diverse downstream tasks have posed significant difficulties for the practical application of DGNNs in industrial scenarios. To this end, in this paper, we propose to address them by pre-training and present the Contrastive Pre-Training Method for Dynamic Graph Neural Networks (CPDG). CPDG tackles the challenges of pre-training for DGNNs, including generalization capability and long-short term modeling capability, through a flexible structural-temporal subgraph sampler along with structural-temporal contrastive pre-training schemes. Extensive experiments conducted on both large-scale research and industrial dynamic graph datasets show that CPDG outperforms existing methods in dynamic graph pre-training for various downstream tasks under three transfer settings.
摘要
“动态图数据挖掘在最近几年内得到了广泛应用,这是因为动态图中含有丰富的信息和在实际世界中的广泛使用。尽管动态图神经网络(DGNN)得到了进步,但是动态图信息的丰富和多样化下渠道任务却对DGNN在工业场景中的实际应用带来了很大的挑战。为了解决这些挑战,在本文中,我们提出了一种名为对比预训练方法 для动态图神经网络(CPDG)。CPDG通过flexible的结构-时间子图采样器和结构-时间对比预训练方案,解决了DGNN预训练中的通用能力和长短期模型能力问题。在大规模的研究和工业动态图数据集上进行了广泛的实验,得到了CPDG在多种下渠道任务下的比较优秀表现。”
results: 在2D和3D的实验中,OLR-WA方法与整个数据集的静止批处理模型的性能相似,而且可以根据用户设置的权重来控制OLR-WA方法是否更快地适应变化或者更慢地抵抗变化。Abstract
Machine Learning requires a large amount of training data in order to build accurate models. Sometimes the data arrives over time, requiring significant storage space and recalculating the model to account for the new data. On-line learning addresses these issues by incrementally modifying the model as data is encountered, and then discarding the data. In this study we introduce a new online linear regression approach. Our approach combines newly arriving data with a previously existing model to create a new model. The introduced model, named OLR-WA (OnLine Regression with Weighted Average) uses user-defined weights to provide flexibility in the face of changing data to bias the results in favor of old or new data. We have conducted 2-D and 3-D experiments comparing OLR-WA to a static model using the entire data set. The results show that for consistent data, OLR-WA and the static batch model perform similarly and for varying data, the user can set the OLR-WA to adapt more quickly or to resist change.
摘要
Few-Shot Personalized Saliency Prediction Using Tensor Regression for Preserving Structural Global Information
results: 在实验结果中,提出的方法比前期方法更高的预测精度。Abstract
This paper presents a few-shot personalized saliency prediction using tensor-to-matrix regression for preserving the structural global information of personalized saliency maps (PSMs). In contrast to a general saliency map, a PSM has been great potential since its map indicates the person-specific visual attention that is useful for obtaining individual visual preferences from heterogeneity of gazed areas. The PSM prediction is needed for acquiring the PSM for the unseen image, but its prediction is still a challenging task due to the complexity of individual gaze patterns. For recognizing individual gaze patterns from the limited amount of eye-tracking data, the previous methods adopt the similarity of gaze tendency between persons. However, in the previous methods, the PSMs are vectorized for the prediction model. In this way, the structural global information of the PSMs corresponding to the image is ignored. For automatically revealing the relationship between PSMs, we focus on the tensor-based regression model that can preserve the structural information of PSMs, and realize the improvement of the prediction accuracy. In the experimental results, we confirm the proposed method including the tensor-based regression outperforms the comparative methods.
摘要
Previous methods have used similarity of gaze tendencies between persons to recognize individual patterns, but these methods vectorize PSMs for the prediction model, ignoring the structural global information of the PSMs corresponding to the image. In this study, we focus on a tensor-based regression model that can preserve the structural information of PSMs, and demonstrate improved prediction accuracy.Experimental results confirm that the proposed method, including tensor-based regression, outperforms comparative methods. By automatically revealing the relationship between PSMs, our method improves the accuracy of personalized saliency prediction.
paper_authors: Nan Tang, Chenyu Yang, Ju Fan, Lei Cao
for: 提高生成AI的准确性和可靠性
methods: 通过多模式数据湖的数据分析和评估,确保生成AI输出的正确性
results: 提高生成AI的可靠性和 trasparency,促进决策的可信度Abstract
Generative AI has made significant strides, yet concerns about the accuracy and reliability of its outputs continue to grow. Such inaccuracies can have serious consequences such as inaccurate decision-making, the spread of false information, privacy violations, legal liabilities, and more. Although efforts to address these risks are underway, including explainable AI and responsible AI practices such as transparency, privacy protection, bias mitigation, and social and environmental responsibility, misinformation caused by generative AI will remain a significant challenge. We propose that verifying the outputs of generative AI from a data management perspective is an emerging issue for generative AI. This involves analyzing the underlying data from multi-modal data lakes, including text files, tables, and knowledge graphs, and assessing its quality and consistency. By doing so, we can establish a stronger foundation for evaluating the outputs of generative AI models. Such an approach can ensure the correctness of generative AI, promote transparency, and enable decision-making with greater confidence. Our vision is to promote the development of verifiable generative AI and contribute to a more trustworthy and responsible use of AI.
摘要
The Role of Subgroup Separability in Group-Fair Medical Image Classification
results: 研究发现,深度分类器在不同医疗影像Modalities和保护特征下的能力将个体分为子群变化很大,而且这种性能差异与模型受到系统性偏见的情况有关。这些发现为开发公正医疗AI提供了重要的新视角。Abstract
We investigate performance disparities in deep classifiers. We find that the ability of classifiers to separate individuals into subgroups varies substantially across medical imaging modalities and protected characteristics; crucially, we show that this property is predictive of algorithmic bias. Through theoretical analysis and extensive empirical evaluation, we find a relationship between subgroup separability, subgroup disparities, and performance degradation when models are trained on data with systematic bias such as underdiagnosis. Our findings shed new light on the question of how models become biased, providing important insights for the development of fair medical imaging AI.
摘要
我们研究深度分类器的性能差异。我们发现分类器将个体分为子群时的能力差异很大 across medical imaging modalities和保护特征; 重要的是,我们发现这个性能是predictive of algorithmic bias。通过理论分析和广泛的实验评估,我们发现与subgroup separability, subgroup disparities和模型在数据中系统性偏见时的性能下降之间存在关系。我们的发现为开发公正医疗AI提供了重要的新idea。
Large Language Models Empowered Autonomous Edge AI for Connected Intelligence
results: 实验结果显示该系统可以准确理解用户需求,高效执行AI模型,并通过边缘联合学习生成高性能AI模型。Abstract
The evolution of wireless networks gravitates towards connected intelligence, a concept that envisions seamless interconnectivity among humans, objects, and intelligence in a hyper-connected cyber-physical world. Edge AI emerges as a promising solution to achieve connected intelligence by delivering high-quality, low-latency, and privacy-preserving AI services at the network edge. In this article, we introduce an autonomous edge AI system that automatically organizes, adapts, and optimizes itself to meet users' diverse requirements. The system employs a cloud-edge-client hierarchical architecture, where the large language model, i.e., Generative Pretrained Transformer (GPT), resides in the cloud, and other AI models are co-deployed on devices and edge servers. By leveraging the powerful abilities of GPT in language understanding, planning, and code generation, we present a versatile framework that efficiently coordinates edge AI models to cater to users' personal demands while automatically generating code to train new models via edge federated learning. Experimental results demonstrate the system's remarkable ability to accurately comprehend user demands, efficiently execute AI models with minimal cost, and effectively create high-performance AI models through federated learning.
摘要
wireless networks 的演化倾向于连接智能,这是一种概念,它描述了人机物之间的无缝连接,并在具有质量、延迟和隐私保护的半Connected cyber-physical world中实现智能连接。 Edge AI 作为一种实现连接智能的有力解决方案,可以在网络边缘提供高质量、低延迟和隐私保护的 AI 服务。在本文中,我们介绍了一个自动化的边缘 AI 系统,该系统可以自动组织、适应和优化以满足用户的多样化需求。该系统采用了云端-边缘-客户端层次架构,其中大型语言模型(i.e., 生成预训练 transformer,GPT)在云端 resid,而其他 AI 模型在设备和边缘服务器上并行部署。通过利用 GPT 的强大语言理解、规划和代码生成能力,我们提出了一种多功能框架,可以高效协调边缘 AI 模型,以满足用户的个人需求,并自动生成代码以进行边缘联合学习。实验结果表明该系统具有remarkable的能力,可以准确理解用户需求,高效执行 AI 模型,并通过联合学习创建高性能 AI 模型。
Temporal Difference Learning for High-Dimensional PIDEs with Jumps
methods: 基于 temporal difference learning 的 Levy 过程集和对应的强化学习模型。
results: 在100维度实验中Relative error 为 O(10^{-3}),在一维纯跳问题中Relative error 为 O(10^{-4)。 Additionally, the method demonstrates low computational cost and robustness, making it suitable for addressing problems with different forms and intensities of jumps.Abstract
In this paper, we propose a deep learning framework for solving high-dimensional partial integro-differential equations (PIDEs) based on the temporal difference learning. We introduce a set of Levy processes and construct a corresponding reinforcement learning model. To simulate the entire process, we use deep neural networks to represent the solutions and non-local terms of the equations. Subsequently, we train the networks using the temporal difference error, termination condition, and properties of the non-local terms as the loss function. The relative error of the method reaches O(10^{-3}) in 100-dimensional experiments and O(10^{-4}) in one-dimensional pure jump problems. Additionally, our method demonstrates the advantages of low computational cost and robustness, making it well-suited for addressing problems with different forms and intensities of jumps.
摘要
在这篇论文中,我们提出了一种深度学习框架,用于解决高维partial integro-differential equations(PIDEs)。我们引入了一组Levy进程,并构建了相应的回归学习模型。为了模拟整个过程,我们使用深度神经网络来表示方程的解和非本地项。然后,我们使用时间差错、终止条件和非本地项的性质作为损失函数进行训练。在100维实验中,我们达到了相对误差为O(10^{-3}),而在一维纯跳问题中,我们达到了O(10^{-4})。此外,我们的方法还具有低计算成本和稳定性,使其适用于不同形式和强度的跳跃问题。
When Does Confidence-Based Cascade Deferral Suffice?
results: 本文发现在某些情况下,信任度基于的推论规则可能会失败,而 alternate 推论策略可以在这些情况下表现更好。本文首先提出了一种理论性的最佳推论规则,然后研究了后备推论机制,并证明它们可以在某些情况下大幅提高推论性能。Abstract
Cascades are a classical strategy to enable inference cost to vary adaptively across samples, wherein a sequence of classifiers are invoked in turn. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. One simple deferral rule employs the confidence of the current classifier, e.g., based on the maximum predicted softmax probability. Despite being oblivious to the structure of the cascade -- e.g., not modelling the errors of downstream models -- such confidence-based deferral often works remarkably well in practice. In this paper, we seek to better understand the conditions under which confidence-based deferral may fail, and when alternate deferral strategies can perform better. We first present a theoretical characterisation of the optimal deferral rule, which precisely characterises settings under which confidence-based deferral may suffer. We then study post-hoc deferral mechanisms, and demonstrate they can significantly improve upon confidence-based deferral in settings where (i) downstream models are specialists that only work well on a subset of inputs, (ii) samples are subject to label noise, and (iii) there is distribution shift between the train and test set.
摘要
瀑布是一种经典的战略,以适应性地变化在不同的样本上进行推理成本。一个推延规则确定在逻辑顺序中是否invoked下一个分类器,或者终止预测。一种简单的推延规则是基于当前分类器的信任度,例如基于最大预测概率。尽管无法考虑瀑布的结构——例如下游模型的错误——但这种信任度基于推延规则在实践中经常工作非常好。在这篇论文中,我们想要更好地了解 confidence-based 推延可能失败的条件,以及在哪些情况下 alternate 推延策略可以表现更好。我们首先提出了一个理论上的最佳推延规则的 caracterization,准确地描述了 confidence-based 推延可能失败的情况。然后我们研究了后续推延机制,并证明它们可以在下列情况下显著超越 confidence-based 推延:(i)下游模型只是特定输入上工作良好,(ii)样本受到标签噪音,(iii)训练集和测试集之间存在分布变化。
Undecimated Wavelet Transform for Word Embedded Semantic Marginal Autoencoder in Security improvement and Denoising different Languages
results: 成功提高多语言安全性和数据质量Abstract
By combining the undecimated wavelet transform within a Word Embedded Semantic Marginal Autoencoder (WESMA), this research study provides a novel strategy for improving security measures and denoising multiple languages. The incorporation of these strategies is intended to address the issues of robustness, privacy, and multilingualism in data processing applications. The undecimated wavelet transform is used as a feature extraction tool to identify prominent language patterns and structural qualities in the input data. The proposed system may successfully capture significant information while preserving the temporal and geographical links within the data by employing this transform. This improves security measures by increasing the system's ability to detect abnormalities, discover hidden patterns, and distinguish between legitimate content and dangerous threats. The Word Embedded Semantic Marginal Autoencoder also functions as an intelligent framework for dimensionality and noise reduction. The autoencoder effectively learns the underlying semantics of the data and reduces noise components by exploiting word embeddings and semantic context. As a result, data quality and accuracy are increased in following processing stages. The suggested methodology is tested using a diversified dataset that includes several languages and security scenarios. The experimental results show that the proposed approach is effective in attaining security enhancement and denoising capabilities across multiple languages. The system is strong in dealing with linguistic variances, producing consistent outcomes regardless of the language used. Furthermore, incorporating the undecimated wavelet transform considerably improves the system's ability to efficiently address complex security concerns
摘要
这个研究使用Word Embedded Semantic Marginal Autoencoder(WESMA)和不删减波лет变数(Undecimated Wavelet Transform)的结合,提供了一个新的安全措施和类别数据处理领域的改进策略。这些策略的目的是解决资料处理中的问题,包括Robustness、隐私和多语言支持。Undecimated Wavelet Transform 用于特征提取,以找出输入数据中的主要语言模式和结构特征。这个系统可以成功地捕捉重要信息,并保持资料中的时间和地理连结。这将提高安全措施,增加系统的检测异常、发现隐藏模式和分辨合法内容和危险威胁的能力。Word Embedded Semantic Marginal Autoencoder 同时serve as an intelligent framework for dimensionality and noise reduction,它能够学习资料的底层 semantics,并通过Word Embeddings和Semantic Context来减少噪声。因此,处理后的数据质量和准确性增加。实验结果显示,提案的方法能够在多种语言和安全enario下实现安全增强和类别数据处理能力。系统在不同语言下的处理表现均匀,并且在应用Undecimated Wavelet Transform时,具有更高的处理复杂安全问题的能力。
results: 实验结果表明,新的参数选择方法可以提高density-based clustering算法DENCLUE的性能。Abstract
In modern day industry, clustering algorithms are daily routines of algorithm engineers. Although clustering algorithms experienced rapid growth before 2010. Innovation related to the research topic has stagnated after deep learning became the de facto industrial standard for machine learning applications. In 2007, a density-based clustering algorithm named DENCLUE was invented to solve clustering problem for nonlinear data structures. However, its parameter selection problem was largely neglected until 2011. In this paper, we propose a new approach to compute the optimal parameters for the DENCLUE algorithm, and discuss its performance in the experiment section.
摘要
现代工业中,聚类算法是算法工程师的日常 Routine。虽然聚类算法在2010年之前经历了快速增长,但是与研究主题相关的创新却在深度学习成为机器学习应用的 де facto标准后停滞不前。2007年,一种基于浓度的聚类算法 named DENCLUE 被发明,用于解决非线性数据结构上的聚类问题。然而,该算法的参数选择问题一直未得到了充分的关注,直到2011年。在这篇论文中,我们提出了一种新的方法来计算 DENCLUE 算法的优化参数,并在实验部分讨论其性能。
Offline Reinforcement Learning with Imbalanced Datasets
results: 对于具有不同水平的不均衡数据集,我们的方法比基线方法表现出色,获得了更高的性能。Abstract
The prevalent use of benchmarks in current offline reinforcement learning (RL) research has led to a neglect of the imbalance of real-world dataset distributions in the development of models. The real-world offline RL dataset is often imbalanced over the state space due to the challenge of exploration or safety considerations. In this paper, we specify properties of imbalanced datasets in offline RL, where the state coverage follows a power law distribution characterized by skewed policies. Theoretically and empirically, we show that typically offline RL methods based on distributional constraints, such as conservative Q-learning (CQL), are ineffective in extracting policies under the imbalanced dataset. Inspired by natural intelligence, we propose a novel offline RL method that utilizes the augmentation of CQL with a retrieval process to recall past related experiences, effectively alleviating the challenges posed by imbalanced datasets. We evaluate our method on several tasks in the context of imbalanced datasets with varying levels of imbalance, utilizing the variant of D4RL. Empirical results demonstrate the superiority of our method over other baselines.
摘要
现有研究中的大部分Offline reinforcement learning(RL)都使用了标准的benchmark,导致实际世界数据集的不均衡问题被忽略。实际世界中的Offline RL数据集经常具有状态空间的不均衡,这可能是因为探索挑战或安全考虑。在这篇论文中,我们指定了Offline RL中的不均衡数据集的性质,其中状态覆盖率遵循一个带有扁平政策的力学分布。理论和实验表明,通常Offline RL方法,如保守Q学习(CQL),在不均衡数据集上不能有效提取策略。 inspirited by natural intelligence,我们提议一种新的Offline RL方法,该方法利用CQL的扩展和回味过程来回忆过去相关的经验,有效地解决了不均衡数据集中的挑战。我们在多个任务上进行了在不均衡数据集上的评估,使用了D4RL的变体。实验结果表明,我们的方法在其他基elines之上具有显著的优越性。
Evaluating the Evaluators: Are Current Few-Shot Learning Benchmarks Fit for Purpose?
paper_authors: Luísa Shimabucoro, Timothy Hospedales, Henry Gouk
for: 本研究旨在investigate task-level evaluation in few-shot learning, 以提供更加可靠的模型评估方法。
methods: 本文使用cross-validation with a low number of folds和bootstrapping来评估模型性能,并考虑了多种模型选择策略。
results: 研究发现,使用cross-validation with a low number of folds可以直接 estimating model performance,而使用bootstrapping或cross-validation with a large number of folds更适合用于模型选择。总之,现有的几个shot learning benchmarks不适合用于评估单个任务的模型性能。Abstract
Numerous benchmarks for Few-Shot Learning have been proposed in the last decade. However all of these benchmarks focus on performance averaged over many tasks, and the question of how to reliably evaluate and tune models trained for individual tasks in this regime has not been addressed. This paper presents the first investigation into task-level evaluation -- a fundamental step when deploying a model. We measure the accuracy of performance estimators in the few-shot setting, consider strategies for model selection, and examine the reasons for the failure of evaluators usually thought of as being robust. We conclude that cross-validation with a low number of folds is the best choice for directly estimating the performance of a model, whereas using bootstrapping or cross validation with a large number of folds is better for model selection purposes. Overall, we find that existing benchmarks for few-shot learning are not designed in such a way that one can get a reliable picture of how effectively methods can be used on individual tasks.
摘要
多种几担学习 benchmark 在过去一代提出了,但所有这些 benchmark 都专注于多任务性能的平均值,而忽略了如何可靠地评估和调整用于个个任务的模型。这篇文章是首次研究到任务级评估 -- 在这个 Régime 中发挥作用的基本步骤。我们测量了几担学习中性能估计器的准确性,考虑了选择模型的策略,并探究了通常被视为可靠的评估器的失败原因。我们结论是,使用少量剂的交叉验证是直接测量模型性能的最佳选择,而使用bootstrap或多剂交叉验证更适合用于选择模型目的。总之,我们发现现有的几担学习 benchmark 并没有设计得能够提供用于个个任务的可靠性能图像。
Hierarchical Empowerment: Towards Tractable Empowerment-Based Skill-Learning
paper_authors: Andrew Levy, Sreehari Rammohan, Alessandro Allievi, Scott Niekum, George Konidaris
for: 本研究旨在开发一种新的框架,以便更加可 tractable 地学习大量独特技能。
methods: 本研究使用 Goal-Conditioned Hierarchical Reinforcement Learning 概念,并提出了一种新的约束下降 bounds 方法,以便更加准确地计算 Empowerment。此外,本研究还提出了一种层次结构,用于计算 Empowerment over exponentially longer time scales。
results: 在一系列的 simulated robotics tasks 中,我们的 four level agents 能够学习技能,covering a surface area over two orders of magnitude larger than prior work。Abstract
General purpose agents will require large repertoires of skills. Empowerment -- the maximum mutual information between skills and the states -- provides a pathway for learning large collections of distinct skills, but mutual information is difficult to optimize. We introduce a new framework, Hierarchical Empowerment, that makes computing empowerment more tractable by integrating concepts from Goal-Conditioned Hierarchical Reinforcement Learning. Our framework makes two specific contributions. First, we introduce a new variational lower bound on mutual information that can be used to compute empowerment over short horizons. Second, we introduce a hierarchical architecture for computing empowerment over exponentially longer time scales. We verify the contributions of the framework in a series of simulated robotics tasks. In a popular ant navigation domain, our four level agents are able to learn skills that cover a surface area over two orders of magnitude larger than prior work.
摘要
通用代理人需要大量技能。使owerment——最大双方信息共同性——提供了学习大量独特技能的路径,但双方信息困难优化。我们提出了一个新的框架,层次empowerment,它通过组合目标受控层次学习概念来使计算empowerment更加可 tractable。我们的框架做出了两项具体贡献:首先,我们引入了一种新的变量下界,用于计算短时间段内的共同信息,这可以用来计算empowerment。其次,我们引入了一个层次建构,用于在指数增长的时间尺度上计算empowerment。我们在一些模拟的机器人任务中验证了我们的框架的贡献,并在受欢迎的蚂蚁导航领域中,我们的四级代理人能够学习技能,占据了两个级别的表面积,比过去的工作更大。
Through the Fairness Lens: Experimental Analysis and Evaluation of Entity Matching
paper_authors: Nima Shahbazi, Nikola Danevski, Fatemeh Nargesian, Abolfazl Asudeh, Divesh Srivastava
for: This paper aims to address the fairness of entity matching (EM) techniques, which have not been well-studied despite extensive research on algorithmic fairness.
methods: The authors perform an extensive experimental evaluation of various EM techniques using two social datasets generated from publicly available datasets.
results: The authors find that EM techniques can be unfair under certain conditions, such as when some demographic groups are overrepresented or when names are more similar in some groups compared to others. They also find that certain fairness definitions, such as positive predictive value parity and true positive rate parity, are more capable of revealing EM unfairness due to the class imbalance nature of EM.Abstract
Entity matching (EM) is a challenging problem studied by different communities for over half a century. Algorithmic fairness has also become a timely topic to address machine bias and its societal impacts. Despite extensive research on these two topics, little attention has been paid to the fairness of entity matching. Towards addressing this gap, we perform an extensive experimental evaluation of a variety of EM techniques in this paper. We generated two social datasets from publicly available datasets for the purpose of auditing EM through the lens of fairness. Our findings underscore potential unfairness under two common conditions in real-world societies: (i) when some demographic groups are overrepresented, and (ii) when names are more similar in some groups compared to others. Among our many findings, it is noteworthy to mention that while various fairness definitions are valuable for different settings, due to EM's class imbalance nature, measures such as positive predictive value parity and true positive rate parity are, in general, more capable of revealing EM unfairness.
摘要
<>TRANSLATE_TEXTEntity matching (EM) 是一个长期研究的问题,已经被不同的社区研究了超过半个世纪。algorithmic fairness 也在当今成为一个时间问题,以 Address Machine Bias 和 Its societal impacts。despite extensive research on these two topics, little attention has been paid to the fairness of entity matching. 在这篇论文中,我们进行了广泛的实验评估多种 EM 技术。我们为了审核 EM 的公平性,Generated two social datasets from publicly available datasets。our findings highlight potential unfairness under two common conditions in real-world societies: (i) when some demographic groups are overrepresented, and (ii) when names are more similar in some groups compared to others. among our many findings, it is noteworthy to mention that while various fairness definitions are valuable for different settings, due to EM's class imbalance nature, measures such as positive predictive value parity and true positive rate parity are, in general, more capable of revealing EM unfairness.<>
paper_authors: Shang Liu, Xiaocheng Li for:这篇论文的目的是系统地探讨不确定样本选择算法,并提出一个新的不确定度量“损失作为不确定”,以及一个通用的一致损失函数,以便在不同的任务和损失函数下进行定制化。methods:这篇论文使用了一些数学理论和计算机科学技术,包括流程式和质量控制的损失函数、统计学的随机损失函数、和机器学习的损失函数。results:这篇论文提出了一个新的不确定度量“损失作为不确定”,并证明了这个度量的充分性和相对优化性。此外,这篇论文还提出了一个通用的一致损失函数,并证明了这个函数在不同的任务和损失函数下的定制化。最后,这篇论文还与某些不确定样本选择算法的理论和实际问题进行了连接。Abstract
Uncertainty sampling is a prevalent active learning algorithm that queries sequentially the annotations of data samples which the current prediction model is uncertain about. However, the usage of uncertainty sampling has been largely heuristic: (i) There is no consensus on the proper definition of "uncertainty" for a specific task under a specific loss; (ii) There is no theoretical guarantee that prescribes a standard protocol to implement the algorithm, for example, how to handle the sequentially arrived annotated data under the framework of optimization algorithms such as stochastic gradient descent. In this work, we systematically examine uncertainty sampling algorithms under both stream-based and pool-based active learning. We propose a notion of equivalent loss which depends on the used uncertainty measure and the original loss function and establish that an uncertainty sampling algorithm essentially optimizes against such an equivalent loss. The perspective verifies the properness of existing uncertainty measures from two aspects: surrogate property and loss convexity. Furthermore, we propose a new notion for designing uncertainty measures called \textit{loss as uncertainty}. The idea is to use the conditional expected loss given the features as the uncertainty measure. Such an uncertainty measure has nice analytical properties and generality to cover both classification and regression problems, which enable us to provide the first generalization bound for uncertainty sampling algorithms under both stream-based and pool-based settings, in the full generality of the underlying model and problem. Lastly, we establish connections between certain variants of the uncertainty sampling algorithms with risk-sensitive objectives and distributional robustness, which can partly explain the advantage of uncertainty sampling algorithms when the sample size is small.
摘要
<>translate "Uncertainty sampling is a prevalent active learning algorithm that queries sequentially the annotations of data samples which the current prediction model is uncertain about. However, the usage of uncertainty sampling has been largely heuristic: (i) There is no consensus on the proper definition of "uncertainty" for a specific task under a specific loss; (ii) There is no theoretical guarantee that prescribes a standard protocol to implement the algorithm, for example, how to handle the sequentially arrived annotated data under the framework of optimization algorithms such as stochastic gradient descent. In this work, we systematically examine uncertainty sampling algorithms under both stream-based and pool-based active learning. We propose a notion of equivalent loss which depends on the used uncertainty measure and the original loss function and establish that an uncertainty sampling algorithm essentially optimizes against such an equivalent loss. The perspective verifies the properness of existing uncertainty measures from two aspects: surrogate property and loss convexity. Furthermore, we propose a new notion for designing uncertainty measures called \textit{loss as uncertainty}. The idea is to use the conditional expected loss given the features as the uncertainty measure. Such an uncertainty measure has nice analytical properties and generality to cover both classification and regression problems, which enable us to provide the first generalization bound for uncertainty sampling algorithms under both stream-based and pool-based settings, in the full generality of the underlying model and problem. Lastly, we establish connections between certain variants of the uncertainty sampling algorithms with risk-sensitive objectives and distributional robustness, which can partly explain the advantage of uncertainty sampling algorithms when the sample size is small."中文翻译:Active learning中的uncertainty sampling是一种广泛使用的算法,它采样数据样本,并在这些样本上请求标注。然而,uncertainty sampling的使用具有许多不确定性:(i)没有对特定任务和损失函数的uncertainty定义的共识;(ii)没有理论保证,要求实现这个算法,例如如何在权重 descent等优化算法下处理顺序到达的标注数据。在这种工作中,我们系统地研究了uncertainty sampling算法在流式活动学习和储存活动学习下的性能。我们提出了一个等效损失的概念,它取决于使用的uncertainty度量和原始损失函数。我们证明了uncertainty sampling算法实际上是优化这个等效损失的。我们从两个方面验证了现有的uncertainty度量的正确性:代理性和损失凹陷。此外,我们提出了一个新的设计uncertainty度量的思想,即使用特征 conditional expected loss作为uncertainty度量。这种uncertainty度量具有良好的分析性和通用性,可以覆盖类别和回归问题。这使得我们可以提供流式和储存活动学习下的第一个一般化 bounds。最后,我们建立了certain variants of uncertainty sampling算法与风险敏感目标和分布 robustness之间的连接,这可以部分解释uncertainty sampling算法在样本量小时的优势。
results: 对比州chart的基eline,模型经过MSCon强制约束后在领域内和领域外的表现均有较好的成绩。Abstract
Given a similarity metric, contrastive methods learn a representation in which examples that are similar are pushed together and examples that are dissimilar are pulled apart. Contrastive learning techniques have been utilized extensively to learn representations for tasks ranging from image classification to caption generation. However, existing contrastive learning approaches can fail to generalize because they do not take into account the possibility of different similarity relations. In this paper, we propose a novel multi-similarity contrastive loss (MSCon), that learns generalizable embeddings by jointly utilizing supervision from multiple metrics of similarity. Our method automatically learns contrastive similarity weightings based on the uncertainty in the corresponding similarity, down-weighting uncertain tasks and leading to better out-of-domain generalization to new tasks. We show empirically that networks trained with MSCon outperform state-of-the-art baselines on in-domain and out-of-domain settings.
摘要
给定一个相似度 metric,对冲方法会学习一个 representation,其中相似的例子会被拟合在一起,而不相似的例子会被分离开来。对冲学习技术已经广泛应用于从图像分类到caption生成等任务中来学习表示。然而,现有的对冲学习方法可能会失去泛化能力,因为它们不考虑可能存在多种相似度关系。在这篇论文中,我们提出了一种新的多相似度对冲损失函数(MSCon),它可以通过多种相似度指标来学习泛化 embedding。我们的方法可以自动学习对冲相似度权重,基于相似度uncertainty的下降,从而实现更好的对外部领域泛化。我们通过实验表明,使用 MSCon 训练的网络在域内和域外设置中都能够超越状态机器。
Towards Symmetry-Aware Generation of Periodic Materials
paper_authors: Youzhi Luo, Chengkai Liu, Shuiwang Ji
For: 本研究旨在生成具有固体结构的周期材料,而现有的深度学习方法尚未完全捕捉周期材料的物理准确性。* Methods: 本文提出了一种新的材料生成方法——SyMat,可以捕捉物理准确性的周期材料结构。SyMat使用自适应神经网络模型生成原子类型集、晶格长度和晶格角,并使用一种新的协调分布模型来进行矩阵协调。* Results: SyMat理论上具有对所有Symmetry转换的不变性,并在随机生成和性能优化任务中达到了出色的表现。Abstract
We consider the problem of generating periodic materials with deep models. While symmetry-aware molecule generation has been studied extensively, periodic materials possess different symmetries, which have not been completely captured by existing methods. In this work, we propose SyMat, a novel material generation approach that can capture physical symmetries of periodic material structures. SyMat generates atom types and lattices of materials through generating atom type sets, lattice lengths and lattice angles with a variational auto-encoder model. In addition, SyMat employs a score-based diffusion model to generate atom coordinates of materials, in which a novel symmetry-aware probabilistic model is used in the coordinate diffusion process. We show that SyMat is theoretically invariant to all symmetry transformations on materials and demonstrate that SyMat achieves promising performance on random generation and property optimization tasks.
摘要
我们考虑了使用深度模型生成 periodic 材料的问题。 Although symmetry-aware molecule generation has been studied extensively, periodic materials have different symmetries that have not been fully captured by existing methods. In this work, we propose SyMat, a novel material generation approach that can capture the physical symmetries of periodic material structures. SyMat generates atom types and lattices of materials by generating atom type sets, lattice lengths, and lattice angles with a variational autoencoder model. In addition, SyMat employs a score-based diffusion model to generate atom coordinates of materials, using a novel symmetry-aware probabilistic model in the coordinate diffusion process. We prove that SyMat is theoretically invariant to all symmetry transformations on materials and demonstrate that SyMat achieves promising performance on random generation and property optimization tasks.
Loss Functions and Metrics in Deep Learning. A Review
results: 论文总结了不同损失函数和性能指标的应用,并提供了帮助实践者选择适合自己特定任务的方法的概述。Abstract
One of the essential components of deep learning is the choice of the loss function and performance metrics used to train and evaluate models. This paper reviews the most prevalent loss functions and performance measurements in deep learning. We examine the benefits and limits of each technique and illustrate their application to various deep-learning problems. Our review aims to give a comprehensive picture of the different loss functions and performance indicators used in the most common deep learning tasks and help practitioners choose the best method for their specific task.
摘要
一个深度学习中的重要组件是选择的损失函数和评价指标来训练和评估模型。本文评审了深度学习中最常用的损失函数和评价指标,探讨它们的优缺点和在不同的深度学习问题中的应用。我们的评审旨在给深度学习各种任务中的不同损失函数和评价指标带来全面的认知,帮助实践者选择适合自己特定任务的方法。Note: "深度学习" in Simplified Chinese is written as "深度学习" (shēn dēng xué xí), not "深度机器学习" (shēn dēng jī shū xí) as it is sometimes translated.
results: 论文中提出了一些实际应用,如数据简化和鲁棒性增强,以及一些示例的偏见假设。Abstract
Lecture notes from the course given by Professor Julia Kempe at the summer school "Statistical physics of Machine Learning" in Les Houches. The notes discuss the so-called NTK approach to problems in machine learning, which consists of gaining an understanding of generally unsolvable problems by finding a tractable kernel formulation. The notes are mainly focused on practical applications such as data distillation and adversarial robustness, examples of inductive bias are also discussed.
摘要
lecture notes from Professor Julia Kempe's course at the "Statistical physics of Machine Learning" summer school in Les Houches, which discusses the so-called NTK approach to problems in machine learning. This approach involves finding a tractable kernel formulation to gain an understanding of generally unsolvable problems. The notes focus mainly on practical applications such as data distillation and adversarial robustness, and examples of inductive bias are also discussed.Here is the translation in Traditional Chinese:lecture notes from Professor Julia Kempe's course at the "Statistical physics of Machine Learning" summer school in Les Houches, which discusses the so-called NTK approach to problems in machine learning. This approach involves finding a tractable kernel formulation to gain an understanding of generally unsolvable problems. The notes focus mainly on practical applications such as data distillation and adversarial robustness, and examples of inductive bias are also discussed.
Scaling In-Context Demonstrations with Structured Attention
results: SAICL在meta-training框架下评估,与全注意力相比具有相似或更好的性能,并在执行上实现了 up to 3.4x 速度提升。SAICL 还在每个示例独立处理方法(FiD)的强基eline上表现出色,并且因为其线性特性,可以轻松扩展到多个示例,并在扩展时保持不间断的性能提升。Abstract
The recent surge of large language models (LLMs) highlights their ability to perform in-context learning, i.e., "learning" to perform a task from a few demonstrations in the context without any parameter updates. However, their capabilities of in-context learning are limited by the model architecture: 1) the use of demonstrations is constrained by a maximum sentence length due to positional embeddings; 2) the quadratic complexity of attention hinders users from using more demonstrations efficiently; 3) LLMs are shown to be sensitive to the order of the demonstrations. In this work, we tackle these challenges by proposing a better architectural design for in-context learning. We propose SAICL (Structured Attention for In-Context Learning), which replaces the full-attention by a structured attention mechanism designed for in-context learning, and removes unnecessary dependencies between individual demonstrations, while making the model invariant to the permutation of demonstrations. We evaluate SAICL in a meta-training framework and show that SAICL achieves comparable or better performance than full attention while obtaining up to 3.4x inference speed-up. SAICL also consistently outperforms a strong Fusion-in-Decoder (FiD) baseline which processes each demonstration independently. Finally, thanks to its linear nature, we demonstrate that SAICL can easily scale to hundreds of demonstrations with continuous performance gains with scaling.
摘要
最近的大语言模型(LLM)的升温 highlights 它们在上下文学习中的能力,即从数据中学习某种任务,而无需更新参数。然而,这些模型在上下文学习的能力受到模型体系的限制:1)使用示例的限制由于位置嵌入而导致最大句子长度; 2)关注的二次复杂性阻碍用户更多示例的有效使用; 3)LLMs 被证明敏感于示例的顺序。在这种情况下,我们解决这些挑战的方法是提出一种更好的体系设计,即 SAICL(结构化关注 для上下文学习)。SAICL 将全关注替换为特定于上下文学习的结构化关注机制,并消除示例之间的不必要依赖关系,使模型对示例的顺序无敏感。我们在 meta-training 框架中评估 SAICL,并显示 SAICL 与全关注相比实现了相似或更好的性能,并在执行上获得了最多 3.4 倍的速度提升。SAICL 还一直超过了强的 Fusion-in-Decoder(FiD)基eline,该基eline 每个示例独立进行处理。最后,由于它的线性性,我们证明 SAICL 可以轻松扩展到数百个示例,并在扩展时保持不断提高性能。
GIT: Detecting Uncertainty, Out-Of-Distribution and Adversarial Samples using Gradients and Invariance Transformations
paper_authors: Julia Lust, Alexandru P. Condurache
for: 检测深度神经网络的泛化错误
methods: combines 使用梯度信息和变换方法
results: 在多种网络架构、问题设置和扰动类型上实现了比州-of-the-art的高效性Abstract
Deep neural networks tend to make overconfident predictions and often require additional detectors for misclassifications, particularly for safety-critical applications. Existing detection methods usually only focus on adversarial attacks or out-of-distribution samples as reasons for false predictions. However, generalization errors occur due to diverse reasons often related to poorly learning relevant invariances. We therefore propose GIT, a holistic approach for the detection of generalization errors that combines the usage of gradient information and invariance transformations. The invariance transformations are designed to shift misclassified samples back into the generalization area of the neural network, while the gradient information measures the contradiction between the initial prediction and the corresponding inherent computations of the neural network using the transformed sample. Our experiments demonstrate the superior performance of GIT compared to the state-of-the-art on a variety of network architectures, problem setups and perturbation types.
摘要
Active Class Selection for Few-Shot Class-Incremental Learning
results: 实验结果表明,这种方法可以在真实世界应用中提供长期的有效性,并且可以帮助机器人在有限的互动情况下不断学习和更新其模型。Abstract
For real-world applications, robots will need to continually learn in their environments through limited interactions with their users. Toward this, previous works in few-shot class incremental learning (FSCIL) and active class selection (ACS) have achieved promising results but were tested in constrained setups. Therefore, in this paper, we combine ideas from FSCIL and ACS to develop a novel framework that can allow an autonomous agent to continually learn new objects by asking its users to label only a few of the most informative objects in the environment. To this end, we build on a state-of-the-art (SOTA) FSCIL model and extend it with techniques from ACS literature. We term this model Few-shot Incremental Active class SeleCtiOn (FIASco). We further integrate a potential field-based navigation technique with our model to develop a complete framework that can allow an agent to process and reason on its sensory data through the FIASco model, navigate towards the most informative object in the environment, gather data about the object through its sensors and incrementally update the FIASco model. Experimental results on a simulated agent and a real robot show the significance of our approach for long-term real-world robotics applications.
摘要
实际应用中, роботы需要不断学习环境中的新对象,通过与用户有限的交互来实现。以前的研究中,几shot类增量学习(FSCIL)和活动类选择(ACS)都有获得了可塑性的结果,但是都是在限制的设置下测试的。因此,在这篇论文中,我们将 combining FSCIL 和 ACS 的想法,开发一个能让自主代理人不断学习新的对象,只需要用户标注环境中最有用的一些对象。为此,我们基于现状的 FSCIL 模型,并将 ACS 文献中的技术添加到其中。我们称这个模型为 Few-shot Incremental Active class SeleCtiOn(FIASco)。此外,我们还将潜在场景基本navitation技术与我们的模型集成,开发了一个完整的框架,让代理人可以通过 FIASco 模型处理和理解其感知数据, navigate到环境中最有用的对象,通过感知器收集数据,并不断更新 FIASco 模型。实验结果表明,我们的方法在模拟的代理人和真实的 робот上都具有长期实际应用的重要性。
Hybrid Ground-State Quantum Algorithms based on Neural Schrödinger Forging
results: 比标准实现更高效,可应用于更大的系统和非排列不变系统Abstract
Entanglement forging based variational algorithms leverage the bi-partition of quantum systems for addressing ground state problems. The primary limitation of these approaches lies in the exponential summation required over the numerous potential basis states, or bitstrings, when performing the Schmidt decomposition of the whole system. To overcome this challenge, we propose a new method for entanglement forging employing generative neural networks to identify the most pertinent bitstrings, eliminating the need for the exponential sum. Through empirical demonstrations on systems of increasing complexity, we show that the proposed algorithm achieves comparable or superior performance compared to the existing standard implementation of entanglement forging. Moreover, by controlling the amount of required resources, this scheme can be applied to larger, as well as non permutation invariant systems, where the latter constraint is associated with the Heisenberg forging procedure. We substantiate our findings through numerical simulations conducted on spins models exhibiting one-dimensional ring, two-dimensional triangular lattice topologies, and nuclear shell model configurations.
摘要
Entanglement forging based variational algorithms 利用量子系统的二分法解决零点问题。主要限制是对很多 potential basis states(或字串)进行 Schmidt 分解的极限积分。为了解决这个挑战,我们提议一种使用生成神经网络来标识最重要的字串,从而消除极限积分的需要。通过实验表明,我们的方法在不同级别的系统上都可以达到相对或更高的性能,而且可以控制所需资源,因此可以应用于更大的系统和非 permutation 不变系统。我们通过对磁矩模型、二维三角形矩阵和核shell模型的数值仿真来证明我们的发现。
Stability of Q-Learning Through Design and Optimism
methods: 本研究使用了Stochastic Approximation和Q学习算法,以及一些新的稳定性和可能加速的转化方法。其中两个全新的贡献是:1. Linear function approximation下Q学习稳定性的问题,已经是研究的开放问题之一,这里表明了采用修改 Gibbs 政策可以确保算法的稳定性(参数估计在 bounded 内),但是转化仍然是许多研究的开放问题之一。2. Zap Zero 算法,可以近似新顿-拉普逊流,不需要矩阵倒数。它在一定条件下是稳定的和可靠的,并且适用于Q学习和其他渐进学习算法。
results: 本研究获得了一些新的结果,包括:1. 使用修改 Gibbs 政策可以确保Q学习的稳定性,但是转化仍然是一个开放问题。2. Zap Zero 算法可以在一定条件下实现稳定和可靠的转化。Abstract
Q-learning has become an important part of the reinforcement learning toolkit since its introduction in the dissertation of Chris Watkins in the 1980s. The purpose of this paper is in part a tutorial on stochastic approximation and Q-learning, providing details regarding the INFORMS APS inaugural Applied Probability Trust Plenary Lecture, presented in Nancy France, June 2023. The paper also presents new approaches to ensure stability and potentially accelerated convergence for these algorithms, and stochastic approximation in other settings. Two contributions are entirely new: 1. Stability of Q-learning with linear function approximation has been an open topic for research for over three decades. It is shown that with appropriate optimistic training in the form of a modified Gibbs policy, there exists a solution to the projected Bellman equation, and the algorithm is stable (in terms of bounded parameter estimates). Convergence remains one of many open topics for research. 2. The new Zap Zero algorithm is designed to approximate the Newton-Raphson flow without matrix inversion. It is stable and convergent under mild assumptions on the mean flow vector field for the algorithm, and compatible statistical assumption on an underlying Markov chain. The algorithm is a general approach to stochastic approximation which in particular applies to Q-learning with "oblivious" training even with non-linear function approximation.
摘要
Q-学习已经成为现代回归学习工具包的重要组成部分,自1980年代Chris Watkins的论文发表以来。这篇论文的目的之一是提供Stochastic Approximation和Q-学习的教程,详细介绍了2023年6月在法国南部的INFORMS APS投资论坛演讲。此外,论文还提出了新的方法来保证这些算法的稳定性和可能更快的收敛,以及Stochastic Approximation在其他设置下的应用。两个贡献是完全新的:1. Q-学习使用线性函数近似的稳定性问题在研究中一直是一个开放的问题。这篇论文示出,通过修改Gibbs策略,存在一个解决了投影贝尔曼方程的算法,并且这个算法是稳定的(以bounded参数估计的形式)。然而,收敛问题仍然是许多开放的研究问题。2. Zap Zero算法是一种用于近似新顿-拉普逊流动的算法,不需要矩阵反转。它在一些简单的假设下是稳定的和收敛的,并且可以应用于Q-学习的“无知”训练,包括非线性函数近似。这个算法是一种通用的Stochastic Approximation方法,可以应用于Q-学习以外的其他问题。
An explainable model to support the decision about the therapy protocol for AML
paper_authors: Jade M. Almeida, Giovanna A. Castro, João A. Machado-Neto, Tiago A. Almeida
for: 预测AML患者的生存机会,以支持医生决策最佳治疗协议。
methods: 使用数据分析和可解释机器学习模型,以支持医生决策。
results: 提出了一个可解释的机器学习模型,可以安全地支持医生决策,并且实验结果具有承诺性。Abstract
Acute Myeloid Leukemia (AML) is one of the most aggressive types of hematological neoplasm. To support the specialists' decision about the appropriate therapy, patients with AML receive a prognostic of outcomes according to their cytogenetic and molecular characteristics, often divided into three risk categories: favorable, intermediate, and adverse. However, the current risk classification has known problems, such as the heterogeneity between patients of the same risk group and no clear definition of the intermediate risk category. Moreover, as most patients with AML receive an intermediate-risk classification, specialists often demand other tests and analyses, leading to delayed treatment and worsening of the patient's clinical condition. This paper presents the data analysis and an explainable machine-learning model to support the decision about the most appropriate therapy protocol according to the patient's survival prediction. In addition to the prediction model being explainable, the results obtained are promising and indicate that it is possible to use it to support the specialists' decisions safely. Most importantly, the findings offered in this study have the potential to open new avenues of research toward better treatments and prognostic markers.
摘要
急性骨髓性白血病(AML)是一种非常严重的血液疾病之一。为了支持专家决策最佳治疗方案,AML患者通常会根据其细胞学和分子特征进行评估,并将患者分为三个风险 категории:有利、中等和不利。然而,现有的风险分类系统存在一些知道的问题,如患者之间的不同性和中等风险类别的不清晰定义。此外,由于大多数AML患者被诊断为中等风险,专家通常会要求更多的测试和分析,导致治疗延迟和病情加重。本文提出了数据分析和可解释的机器学习模型,以支持专家决策最佳治疗方案。除了模型的可解释性之外,研究结果具有推动力,并表明可以安全地使用这些模型来支持专家决策。最重要的是,本研究的发现有可能开启新的研究途径,以提高治疗和诊断 marker 的效果。
FLuID: Mitigating Stragglers in Federated Learning using Invariant Dropout
results: 论文的实验结果表明,Invariant Dropout 可以保持基eline模型的效率,同时解决 FL 中的性能瓶颈问题。此外,FLuID 可以在 runtime 中动态调整训练负担,以适应不同的设备性能。Abstract
Federated Learning (FL) allows machine learning models to train locally on individual mobile devices, synchronizing model updates via a shared server. This approach safeguards user privacy; however, it also generates a heterogeneous training environment due to the varying performance capabilities across devices. As a result, straggler devices with lower performance often dictate the overall training time in FL. In this work, we aim to alleviate this performance bottleneck due to stragglers by dynamically balancing the training load across the system. We introduce Invariant Dropout, a method that extracts a sub-model based on the weight update threshold, thereby minimizing potential impacts on accuracy. Building on this dropout technique, we develop an adaptive training framework, Federated Learning using Invariant Dropout (FLuID). FLuID offers a lightweight sub-model extraction to regulate computational intensity, thereby reducing the load on straggler devices without affecting model quality. Our method leverages neuron updates from non-straggler devices to construct a tailored sub-model for each straggler based on client performance profiling. Furthermore, FLuID can dynamically adapt to changes in stragglers as runtime conditions shift. We evaluate FLuID using five real-world mobile clients. The evaluations show that Invariant Dropout maintains baseline model efficiency while alleviating the performance bottleneck of stragglers through a dynamic, runtime approach.
摘要
federated learning (FL) 允许机器学习模型在个人手持设备上本地进行训练,并通过共享服务器进行模型更新同步。这种方法保护用户隐私,但也创造了不同设备性能水平的多样化训练环境。因此,在FL中,慢速设备(straggler)的低性能经常决定整体训练时间。在这项工作中,我们想使用动态负载均衡来缓解FL中的性能瓶颈。我们提出了“ invariable dropout” 方法,该方法根据模型更新阈值提取子模型,以最小化可能的影响精度。基于这种dropout技术,我们开发了适应训练框架,称为 Federated Learning using Invariant Dropout (FLuID)。FLuID 提供了轻量级的子模型提取机制,以规避计算急剧的设备。我们的方法通过非慢速设备的神经元更新来构建每个慢速设备的个性化子模型,基于客户端性能分析。此外,FLuID 可以在运行时conditions 发生变化时动态适应。我们通过使用五个真实的手持设备进行评估,发现, invariable dropout 可以保持基eline模型效率,同时通过动态、运行时的方式缓解慢速设备的性能瓶颈。
Learning when to observe: A frugal reinforcement learning framework for a high-cost world
paper_authors: Colin Bellinger, Mark Crowley, Isaac Tamblyn
for: 本研究旨在探讨RL算法在环境状态测量成本高的应用场景下是否可以学习出高效的控制策略。
methods: 本文采用了 Deep Dynamic Multi-Step Observationless Agent(DMSOA),并对其进行了比较和实验评估在OpenAI gym和Atari Pong环境中。
results: 研究结果表明,DMSOA可以更好地学习控制策略,只需 fewer decision steps和测量步骤,并且在OpenAI gym和Atari Pong环境中表现出了更高的效果。Abstract
Reinforcement learning (RL) has been shown to learn sophisticated control policies for complex tasks including games, robotics, heating and cooling systems and text generation. The action-perception cycle in RL, however, generally assumes that a measurement of the state of the environment is available at each time step without a cost. In applications such as materials design, deep-sea and planetary robot exploration and medicine, however, there can be a high cost associated with measuring, or even approximating, the state of the environment. In this paper, we survey the recently growing literature that adopts the perspective that an RL agent might not need, or even want, a costly measurement at each time step. Within this context, we propose the Deep Dynamic Multi-Step Observationless Agent (DMSOA), contrast it with the literature and empirically evaluate it on OpenAI gym and Atari Pong environments. Our results, show that DMSOA learns a better policy with fewer decision steps and measurements than the considered alternative from the literature. The corresponding code is available at: \url{https://github.com/cbellinger27/Learning-when-to-observe-in-RL
摘要
Reinforcement learning (RL) 已经能够学习复杂任务,如游戏、机器人、暖气和冷气系统以及文本生成。但是,行动-感知循环在 RL 中通常假设在每个时间步骤上可以无 costa 地测量环境状态。在材料设计、深海和行星探险、医学应用等领域,可能存在高昂的测量环境状态的成本。在这篇论文中,我们报道了最近快速增长的文献,该文献采用了RL代理不需要或甚至不想在每个时间步骤上测量环境状态的视角。在这个Context中,我们提出了深度动态多步无测量代理(DMSOA),并与文献进行了比较。我们的结果表明,DMSOA 可以更好地学习策略,使用更少的决策步骤和测量。相关代码可以在以下链接中找到:https://github.com/cbellinger27/Learning-when-to-observe-in-RL。
Human Inspired Progressive Alignment and Comparative Learning for Grounded Word Acquisition
results: 实验结果表明,这种方法可以具有高效的 continent learning 特性,能够不断学习更多的概念。Abstract
Human language acquisition is an efficient, supervised, and continual process. In this work, we took inspiration from how human babies acquire their first language, and developed a computational process for word acquisition through comparative learning. Motivated by cognitive findings, we generated a small dataset that enables the computation models to compare the similarities and differences of various attributes, learn to filter out and extract the common information for each shared linguistic label. We frame the acquisition of words as not only the information filtration process, but also as representation-symbol mapping. This procedure does not involve a fixed vocabulary size, nor a discriminative objective, and allows the models to continually learn more concepts efficiently. Our results in controlled experiments have shown the potential of this approach for efficient continual learning of grounded words.
摘要
人类语言学习是一个高效、指导的、不断进行的过程。在这项工作中,我们 Draw inspiration from how human babies acquire their first language, and developed a computational process for word acquisition through comparative learning。 Based on cognitive findings, we generated a small dataset that enables the computation models to compare the similarities and differences of various attributes, learn to filter out and extract the common information for each shared linguistic label。 We frame the acquisition of words as not only the information filtration process, but also as representation-symbol mapping。 This procedure does not involve a fixed vocabulary size, nor a discriminative objective, and allows the models to continually learn more concepts efficiently。 Our results in controlled experiments have shown the potential of this approach for efficient continual learning of grounded words。
Additive Decoders for Latent Variables Identification and Cartesian-Product Extrapolation
results: 本 paper 提供了一些条件,以确保使用 additive 解码器来解决重建问题时可以准确地标注 latent variables 的块,并且可以允许 permutation 和 block-wise invertible transformations。此外,本 paper 还证明了 additive 解码器可以生成新的图像,并且可以在新的方式中重新组合已经观察到的变量因素。Abstract
We tackle the problems of latent variables identification and "out-of-support" image generation in representation learning. We show that both are possible for a class of decoders that we call additive, which are reminiscent of decoders used for object-centric representation learning (OCRL) and well suited for images that can be decomposed as a sum of object-specific images. We provide conditions under which exactly solving the reconstruction problem using an additive decoder is guaranteed to identify the blocks of latent variables up to permutation and block-wise invertible transformations. This guarantee relies only on very weak assumptions about the distribution of the latent factors, which might present statistical dependencies and have an almost arbitrarily shaped support. Our result provides a new setting where nonlinear independent component analysis (ICA) is possible and adds to our theoretical understanding of OCRL methods. We also show theoretically that additive decoders can generate novel images by recombining observed factors of variations in novel ways, an ability we refer to as Cartesian-product extrapolation. We show empirically that additivity is crucial for both identifiability and extrapolation on simulated data.
摘要
我们研究了隐藏变量标识和“不支持”图像生成问题在表示学习中。我们表明这两个问题可以通过我们称为添加型decoder解决,这类decoder与物体中心表示学习(OCRL)中使用的decoder类似,适用于可以分解为对象特定图像的和其他图像的总和。我们提供了解决这些问题时使用添加型decoder的条件, garantuee 隐藏变量块可以通过 permutation和块级可逆变换进行唯一标识。这个保证只需要很弱的 latent factor 分布的假设,这些分布可能具有统计依赖关系和极其复杂的支持形状。我们的结果为非线性独立 componon 分析(ICA)提供了新的设置,并补充了OCRL方法的理论理解。我们还证明了添加型decoder可以通过 recombining 观察到的因变量来生成新的图像,我们称之为 Cartesian-product 推导。我们通过实验表明,添加性是隐藏变量标识和 extrapolation 问题的关键。
TransformerG2G: Adaptive time-stepping for learning temporal graph embeddings using transformers
results: 对于多种不同的 benchmark 和“新鲜度”水平,我们的模型在链接预测精度和计算效率方面与传统多步方法和我们之前的模型(DynG2G)相比,表现出了明显的优势。此外,通过分析注意力权重,我们可以揭示时间依赖关系,找到影响因素,并获得图structure的复杂交互。例如,我们发现了节点度与注意力权重之间的强相关性,这表明了节点度在图结构的不同阶段的作用。Abstract
Dynamic graph embedding has emerged as a very effective technique for addressing diverse temporal graph analytic tasks (i.e., link prediction, node classification, recommender systems, anomaly detection, and graph generation) in various applications. Such temporal graphs exhibit heterogeneous transient dynamics, varying time intervals, and highly evolving node features throughout their evolution. Hence, incorporating long-range dependencies from the historical graph context plays a crucial role in accurately learning their temporal dynamics. In this paper, we develop a graph embedding model with uncertainty quantification, TransformerG2G, by exploiting the advanced transformer encoder to first learn intermediate node representations from its current state ($t$) and previous context (over timestamps [$t-1, t-l$], $l$ is the length of context). Moreover, we employ two projection layers to generate lower-dimensional multivariate Gaussian distributions as each node's latent embedding at timestamp $t$. We consider diverse benchmarks with varying levels of ``novelty" as measured by the TEA plots. Our experiments demonstrate that the proposed TransformerG2G model outperforms conventional multi-step methods and our prior work (DynG2G) in terms of both link prediction accuracy and computational efficiency, especially for high degree of novelty. Furthermore, the learned time-dependent attention weights across multiple graph snapshots reveal the development of an automatic adaptive time stepping enabled by the transformer. Importantly, by examining the attention weights, we can uncover temporal dependencies, identify influential elements, and gain insights into the complex interactions within the graph structure. For example, we identified a strong correlation between attention weights and node degree at the various stages of the graph topology evolution.
摘要
“动态图 embedding 技术在许多应用中成为了非常有效的方法,用于解决多种时间图分析任务(如链接预测、节点分类、推荐系统、异常检测和图生成)。这些时间图具有不同的时间间隔和变化快速的节点特征。因此,在学习时间图动态的过程中,需要考虑长距离的历史图 conte xt。在这篇论文中,我们开发了一种图 embedding 模型,名为 TransformerG2G,通过利用高级变换 encoder 来首先从当前状态($t$)和上一个时间步([$t-1$, $t-l$)中学习中间节点表示。此外,我们采用了两层投影层来生成每个节点的几个维度的多元 Gaussian 分布作为其秘密嵌入。我们在多种不同的“新鲜度”测试(根据 TEA 图表)中进行了多种 benchmark,结果显示,我们的提案的 TransformerG2G 模型在链接预测精度和计算效率方面都高于传统的多步方法和我们之前的 DynG2G 模型,特别是在高度新鲜度时。此外,我们通过分析注意力权重来揭示时间图中的时间依赖关系,找到影响因子和强调关系,并从图结构中获得有价值的信息。例如,我们发现了节点度和注意力权重之间的强相关关系在不同的图结构发展阶段。”
Multimodal Temporal Fusion Transformers Are Good Product Demand Forecasters
results: 在大规模实际数据集上进行实验,提出的方法可以有效地预测各种产品的需求,并且与传统方法相比,具有更高的准确率和可靠性。Abstract
Multimodal demand forecasting aims at predicting product demand utilizing visual, textual, and contextual information. This paper proposes a method for multimodal product demand forecasting using convolutional, graph-based, and transformer-based architectures. Traditional approaches to demand forecasting rely on historical demand, product categories, and additional contextual information such as seasonality and events. However, these approaches have several shortcomings, such as the cold start problem making it difficult to predict product demand until sufficient historical data is available for a particular product, and their inability to properly deal with category dynamics. By incorporating multimodal information, such as product images and textual descriptions, our architecture aims to address the shortcomings of traditional approaches and outperform them. The experiments conducted on a large real-world dataset show that the proposed approach effectively predicts demand for a wide range of products. The multimodal pipeline presented in this work enhances the accuracy and reliability of the predictions, demonstrating the potential of leveraging multimodal information in product demand forecasting.
摘要simplified_chinese多modal需求预测目标在于预测产品需求使用视觉、文本和上下文信息。这篇论文提出一种基于卷积、图гра数据结构和变换器结构的多modal产品需求预测方法。传统的需求预测方法依靠历史需求、产品类别以及附加的上下文信息如季节和事件。然而,这些方法有几个缺陷,如冷启动问题,使得预测产品需求 Until sufficient historical data is available for a particular product 困难,以及它们对类别动态不够好地处理。通过把多modal信息,如产品图像和文本描述,纳入到方法中,我们的建议方法可以解决传统方法的缺陷,并超越它们。实验在大量实际数据集上显示,我们的方法可以有效预测各种产品的需求。在这篇论文中提出的多modal管道可以增强预测的准确性和可靠性, thereby demonstrating the potential of leveraging multimodal information in product demand forecasting.
Several categories of Large Language Models (LLMs): A Short Survey
results: 这篇论文总结了各类LLM的研究进展和努力,包括任务基金经济LLM、多语言语言LLM、医学和生物医学LLM、视觉语言LLM以及代码语言模型。它还强调了聊天机器人和虚拟智能助手技术的未解决问题,如提高自然语言处理、提高聊天机器人智能和解决道德和法律问题。Abstract
Large Language Models(LLMs)have become effective tools for natural language processing and have been used in many different fields. This essay offers a succinct summary of various LLM subcategories. The survey emphasizes recent developments and efforts made for various LLM kinds, including task-based financial LLMs, multilingual language LLMs, biomedical and clinical LLMs, vision language LLMs, and code language models. The survey gives a general summary of the methods, attributes, datasets, transformer models, and comparison metrics applied in each category of LLMs. Furthermore, it highlights unresolved problems in the field of developing chatbots and virtual assistants, such as boosting natural language processing, enhancing chatbot intelligence, and resolving moral and legal dilemmas. The purpose of this study is to provide readers, developers, academics, and users interested in LLM-based chatbots and virtual intelligent assistant technologies with useful information and future directions.
摘要
大型自然语言模型(LLM)已成为自然语言处理的有效工具,广泛应用于多个领域。本文提供LLM各种子类划分的简洁概述,强调最新的发展和努力。包括任务基金 languages LLMs, 多语言语言 LLMs, 医疗和临床 LLMs, 视觉语言 LLMs, 和代码语言模型。本文介绍每种LLM类型的方法、特点、数据集、转换器模型和比较指标。此外,它还抛光了虚拟助手和智能客服技术的未解决问题,如提高自然语言处理、增强聊天机器人智能和解决道德和法律问题。本文的目的是为有关LLM基于聊天机器人和虚拟智能助手技术的读者、开发者、学者和用户提供有用信息和未来方向。
How accurate are existing land cover maps for agriculture in Sub-Saharan Africa?
results: 结果表明不同的土地覆盖图在不同国家和地区之间存在差异和低准确性,建议用户选择最适合自己需求的土地覆盖图,并促进未来的研究集中于解决图像间的差异和提高低准确性区域的准确性。Abstract
Satellite Earth observations (EO) can provide affordable and timely information for assessing crop conditions and food production. Such monitoring systems are essential in Africa, where there is high food insecurity and sparse agricultural statistics. EO-based monitoring systems require accurate cropland maps to provide information about croplands, but there is a lack of data to determine which of the many available land cover maps most accurately identify cropland in African countries. This study provides a quantitative evaluation and intercomparison of 11 publicly available land cover maps to assess their suitability for cropland classification and EO-based agriculture monitoring in Africa using statistically rigorous reference datasets from 8 countries. We hope the results of this study will help users determine the most suitable map for their needs and encourage future work to focus on resolving inconsistencies between maps and improving accuracy in low-accuracy regions.
摘要
Semi-supervised Learning from Street-View Images and OpenStreetMap for Automatic Building Height Estimation
results: 在test dataset中,使用Random Forest(RF)、Support Vector Machine(SVM)和Convolutional Neural Network(CNN)三种不同的回归模型,SSL方法在优化建筑高度估计中带来了明显的性能提升,MAE约为2.1米,与现有方法相比具有竞争力。Abstract
Accurate building height estimation is key to the automatic derivation of 3D city models from emerging big geospatial data, including Volunteered Geographical Information (VGI). However, an automatic solution for large-scale building height estimation based on low-cost VGI data is currently missing. The fast development of VGI data platforms, especially OpenStreetMap (OSM) and crowdsourced street-view images (SVI), offers a stimulating opportunity to fill this research gap. In this work, we propose a semi-supervised learning (SSL) method of automatically estimating building height from Mapillary SVI and OSM data to generate low-cost and open-source 3D city modeling in LoD1. The proposed method consists of three parts: first, we propose an SSL schema with the option of setting a different ratio of "pseudo label" during the supervised regression; second, we extract multi-level morphometric features from OSM data (i.e., buildings and streets) for the purposed of inferring building height; last, we design a building floor estimation workflow with a pre-trained facade object detection network to generate "pseudo label" from SVI and assign it to the corresponding OSM building footprint. In a case study, we validate the proposed SSL method in the city of Heidelberg, Germany and evaluate the model performance against the reference data of building heights. Based on three different regression models, namely Random Forest (RF), Support Vector Machine (SVM), and Convolutional Neural Network (CNN), the SSL method leads to a clear performance boosting in estimating building heights with a Mean Absolute Error (MAE) around 2.1 meters, which is competitive to state-of-the-art approaches. The preliminary result is promising and motivates our future work in scaling up the proposed method based on low-cost VGI data, with possibilities in even regions and areas with diverse data quality and availability.
摘要
正确的建筑高度估计是验证3D城市模型自大规模的地理空间数据中emerging的Automatic Derivation的关键,包括义务授权地理信息(VGI)。然而,一个基于低成本VGI数据的大规模建筑高度估计方法目前缺失。开放街图(OSM)和人群创建的街景图像(SVI)的快速发展提供了一个刺激的机会,以填补这个研究潜在。在这个工作中,我们提出了一个半监督学习(SSL)方法,用于自动从Mapillary SVI和OSM数据中估计建筑高度,以生成低成本和开源的3D城市建模(LoD1)。我们的方法包括三个部分:第一,我们提出了SSL schema,让使用者在监督回归中设置不同的"伪标签"比率;第二,我们从OSM数据中提取多层次形态特征,以估计建筑高度;第三,我们设计了一个建筑楼层估计工作流程,使用预训练的外观物体检测网络来从SVI中生成"伪标签",并将其分配到相应的OSM建筑基本面。在一个 Heidelberg 城市的应用中,我们认为SSL方法可以明显提高建筑高度估计的性能,MAE约2.1米,与现有方法竞争。这个初步结果将验证我们未来将基于低成本VGI数据扩展提案,包括不同的数据质量和可用性。
Conditional Korhunen-Loéve regression model with Basis Adaptation for high-dimensional problems: uncertainty quantification and inverse modeling
paper_authors: Yu-Hong Yeung, Ramakrishna Tipireddy, David A. Barajas-Solano, Alexandre M. Tartakovsky for: 这个论文的目的是提高Physical Systems中对可观测Response的模拟精度,特别是在高维问题中。methods: 该论文使用 truncated unconditional Karhunen-Lo'{e}ve expansions (KLEs) 和 conditional Karhunen-Lo'{e}ve expansions (CKLEs) 来构建模拟模型,并通过 Gaussian process regression 来conditioning the covariance kernel of the unconditional expansion on the direct measurements。results: 该论文的结果表明,基于CKLEs的BA模拟模型在forward uncertainty quantification tasks中比基于unconditional expansions的BA模拟模型更加精度。此外,使用CKLE-based BA模拟模型进行 inverse estimation of hydraulic transmissivity field 的结果也比使用unconditional BA模拟模型更加准确。Abstract
We propose a methodology for improving the accuracy of surrogate models of the observable response of physical systems as a function of the systems' spatially heterogeneous parameter fields with applications to uncertainty quantification and parameter estimation in high-dimensional problems. Practitioners often formulate finite-dimensional representations of spatially heterogeneous parameter fields using truncated unconditional Karhunen-Lo\'{e}ve expansions (KLEs) for a certain choice of unconditional covariance kernel and construct surrogate models of the observable response with respect to the random variables in the KLE. When direct measurements of the parameter fields are available, we propose improving the accuracy of these surrogate models by representing the parameter fields via conditional Karhunen-Lo\'{e}ve expansions (CKLEs). CKLEs are constructed by conditioning the covariance kernel of the unconditional expansion on the direct measurements via Gaussian process regression and then truncating the corresponding KLE. We apply the proposed methodology to constructing surrogate models via the Basis Adaptation (BA) method of the stationary hydraulic head response, measured at spatially discrete observation locations, of a groundwater flow model of the Hanford Site, as a function of the 1,000-dimensional representation of the model's log-transmissivity field. We find that BA surrogate models of the hydraulic head based on CKLEs are more accurate than BA surrogate models based on unconditional expansions for forward uncertainty quantification tasks. Furthermore, we find that inverse estimates of the hydraulic transmissivity field computed using CKLE-based BA surrogate models are more accurate than those computed using unconditional BA surrogate models.
摘要
我们提出一种方法来提高非模拟模型 observable response 的准确性,该 observable response 是基于物理系统的空间各个参数场的函数。实际工作者通常使用 truncated unconditional Karhunen-Lo\'{e}ve expansion (KLE) 来形式化空间各个参数场,然后使用这些 KLE 来构建非模拟模型。当直接测量 parameter 场可用时,我们提议使用 conditional Karhunen-Lo\'{e}ve expansion (CKLE) 来改进非模拟模型的准确性。CKLE 通过 conditioning covariance kernel 的 unconditional expansion 于直接测量,使用 Gaussian process regression truncate 相应的 KLE。我们应用该方法ологи到 constructing 非模拟模型 via Basis Adaptation (BA) 方法, Specifically, we use the BA method to construct surrogate models of the stationary hydraulic head response of a groundwater flow model of the Hanford Site, as a function of the 1,000-dimensional representation of the model's log-transmissivity field. Our results show that BA surrogate models based on CKLEs are more accurate than BA surrogate models based on unconditional expansions for forward uncertainty quantification tasks. Furthermore, we find that inverse estimates of the hydraulic transmissivity field computed using CKLE-based BA surrogate models are more accurate than those computed using unconditional BA surrogate models.
LongNet: Scaling Transformers to 1,000,000,000 Tokens
results: 实验结果表明,LongNet可以在长序列模型和通用语言任务上达到强表现,并且可以serve为分布式训练器。Abstract
Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted. To address this issue, we introduce LongNet, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences. Specifically, we propose dilated attention, which expands the attentive field exponentially as the distance grows. LongNet has significant advantages: 1) it has a linear computation complexity and a logarithm dependency between any two tokens in a sequence; 2) it can be served as a distributed trainer for extremely long sequences; 3) its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing Transformer-based optimization. Experiments results demonstrate that LongNet yields strong performance on both long-sequence modeling and general language tasks. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.
摘要
<>TRANSLATE_TEXTsequence length scaling has become a critical demand in the era of large language models. however, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted. to address this issue, we introduce longnet, a transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences. specifically, we propose dilated attention, which expands the attentive field exponentially as the distance grows. longnet has significant advantages: 1) it has a linear computation complexity and a logarithm dependency between any two tokens in a sequence; 2) it can be served as a distributed trainer for extremely long sequences; 3) its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing transformer-based optimization. experiments results demonstrate that longnet yields strong performance on both long-sequence modeling and general language tasks. our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire internet as a sequence.TRANSLATE_TEXT
for: The paper proposes an extension of the Decision Transformer (DT) to solve the problem of trajectory stitching during action inference.
methods: The proposed method uses DT’s history record to implement trajectory stitching at test time, and retains different history records in different situations to optimize the path.
results: In the D4RL locomotor benchmark and Atari games, EDT outperforms Q-learning-based methods, especially in multi-task scenarios.Abstract
This paper introduces Elastic Decision Transformer (EDT), a significant advancement over the existing Decision Transformer (DT) and its variants. Although DT purports to generate an optimal trajectory, empirical evidence suggests it struggles with trajectory stitching, a process involving the generation of an optimal or near-optimal trajectory from the best parts of a set of sub-optimal trajectories. The proposed EDT differentiates itself by facilitating trajectory stitching during action inference at test time, achieved by adjusting the history length maintained in DT. Further, the EDT optimizes the trajectory by retaining a longer history when the previous trajectory is optimal and a shorter one when it is sub-optimal, enabling it to "stitch" with a more optimal trajectory. Extensive experimentation demonstrates EDT's ability to bridge the performance gap between DT-based and Q Learning-based approaches. In particular, the EDT outperforms Q Learning-based methods in a multi-task regime on the D4RL locomotion benchmark and Atari games. Videos are available at: https://kristery.github.io/edt/
摘要
EDT addresses this issue by stitching trajectories during action inference at test time, achieved by adjusting the history length maintained in DT. EDT optimizes the trajectory by retaining a longer history when the previous trajectory is optimal and a shorter one when it is sub-optimal, allowing it to "stitch" with a more optimal trajectory.Experiments show that EDT can bridge the performance gap between DT-based and Q Learning-based approaches. Specifically, EDT outperforms Q Learning-based methods in a multi-task regime on the D4RL locomotion benchmark and Atari games. Videos are available at: .
paper_authors: Alexander Wei, Nika Haghtalab, Jacob Steinhardt
for: 本研究旨在探讨大语言模型受到攻击的原因和如何创造攻击。
methods: 研究人员使用了两种故障模式来引导攻击设计:竞合目标和欠拟合泛化。
results: 研究发现,即使使用了广泛的红队训练和安全训练,现有的模型仍然存在漏洞,新的攻击方法可以在所有提问集中成功。Abstract
Large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of "jailbreak" attacks on early releases of ChatGPT that elicit undesired behavior. Going beyond recognition of the issue, we investigate why such attacks succeed and how they can be created. We hypothesize two failure modes of safety training: competing objectives and mismatched generalization. Competing objectives arise when a model's capabilities and safety goals conflict, while mismatched generalization occurs when safety training fails to generalize to a domain for which capabilities exist. We use these failure modes to guide jailbreak design and then evaluate state-of-the-art models, including OpenAI's GPT-4 and Anthropic's Claude v1.3, against both existing and newly designed attacks. We find that vulnerabilities persist despite the extensive red-teaming and safety-training efforts behind these models. Notably, new attacks utilizing our failure modes succeed on every prompt in a collection of unsafe requests from the models' red-teaming evaluation sets and outperform existing ad hoc jailbreaks. Our analysis emphasizes the need for safety-capability parity -- that safety mechanisms should be as sophisticated as the underlying model -- and argues against the idea that scaling alone can resolve these safety failure modes.
摘要
大型语言模型培养为安全和无害性的问题尚存在攻击风险,例如ChatGPT的早期发布版本的"监狱攻击",这些攻击可以让模型表现出不жела的行为。我们不仅承认这个问题,还进一步调查这些攻击成功的原因和如何创造。我们假设两种安全培训失败情况:竞合目标和不同渠道的泛化。竞合目标发生在模型的能力和安全目标之间的冲突,而不同渠道的泛化发生在安全培训无法泛化到一个具有能力的领域。我们使用这些失败情况来引导监狱设计,然后评估当前最好的模型,包括OpenAI的GPT-4和Anthropic的Claude v1.3,对于现有和新设计的攻击。我们发现,即使这些模型进行了广泛的红色队伍和安全培训,漏洞仍然存在。尤其是,我们新设计的攻击可以在每个提示中成功,并且在模型的红色队伍评估集中的 unsafe requests 中表现出更好的效果。我们的分析强调了安全机制应该和模型的能力相似,而不是假设升级 alone 可以解决这些安全失败情况。
Conditional independence testing under model misspecification
paper_authors: Felipe Maia Polo, Yuekai Sun, Moulinath Banerjee
for: 本文研究了模型错误的情况下 regression-based 独立性测试的性能。
methods: 本文提出了三种 regression-based 测试方法的新上界或近似值,以及一种robust against model misspecification的新测试方法——Rao-Blackwellized Predictor Test (RBPT)。
results: 实验结果表明,RBPT 可以在模型错误情况下提供更好的测试性能,而且可以与现有的测试方法进行比较。Abstract
Conditional independence (CI) testing is fundamental and challenging in modern statistics and machine learning. Many modern methods for CI testing rely on powerful supervised learning methods to learn regression functions or Bayes predictors as an intermediate step. Although the methods are guaranteed to control Type-I error when the supervised learning methods accurately estimate the regression functions or Bayes predictors, their behavior is less understood when they fail due to model misspecification. In a broader sense, model misspecification can arise even when universal approximators (e.g., deep neural nets) are employed. Then, we study the performance of regression-based CI tests under model misspecification. Namely, we propose new approximations or upper bounds for the testing errors of three regression-based tests that depend on misspecification errors. Moreover, we introduce the Rao-Blackwellized Predictor Test (RBPT), a novel regression-based CI test robust against model misspecification. Finally, we conduct experiments with artificial and real data, showcasing the usefulness of our theory and methods.
摘要
conditional independence (CI) 测试是现代统计和机器学习中的基础和挑战。许多现代CI测试方法 rely on 强大的指导学习方法来学习回归函数或 bayes 预测函数作为中间步骤。虽然这些方法能够控制类型一错误,但它们在模型误差时表现不准确。在更广泛的意义上,模型误差可以出现,即使使用 universial approximators(例如深度神经网)。我们研究了 regression-based CI 测试下模型误差的性能。即,我们提出了新的近似或上限值 для三种 regression-based 测试的测试误差,其中受到模型误差的影响。此外,我们引入了 Rao-Blackwellized Predictor Test(RBPT),一种robust against model misspecification的新的 regression-based CI 测试。最后,我们在人工和实际数据上进行了实验,展示了我们的理论和方法的实用性。
Linear Regression on Manifold Structured Data: the Impact of Extrinsic Geometry on Solutions
results: 研究发现,当数据构造在某些维度上是平坦的时候,线性回归无Unique解。而在其他维度上,拓扑 manifold 的 curvature(或者参数化时的高阶非线性)会对回归解决做出重要贡献,特别是在沿着 manifold 的正常方向解决。这些发现表明了数据构造geometry对回归模型的稳定性具有重要作用。Abstract
In this paper, we study linear regression applied to data structured on a manifold. We assume that the data manifold is smooth and is embedded in a Euclidean space, and our objective is to reveal the impact of the data manifold's extrinsic geometry on the regression. Specifically, we analyze the impact of the manifold's curvatures (or higher order nonlinearity in the parameterization when the curvatures are locally zero) on the uniqueness of the regression solution. Our findings suggest that the corresponding linear regression does not have a unique solution when the embedded submanifold is flat in some dimensions. Otherwise, the manifold's curvature (or higher order nonlinearity in the embedding) may contribute significantly, particularly in the solution associated with the normal directions of the manifold. Our findings thus reveal the role of data manifold geometry in ensuring the stability of regression models for out-of-distribution inferences.
摘要
在这篇论文中,我们研究了线性回归在数据构造在拟合空间上的应用。我们假设数据构造是光滑的,并且嵌入在几何空间中,我们的目标是探讨数据构造的外部几何特性对回归的影响。Specifically,我们分析了数据构造的曲率(或者在参数化时的高阶非线性)对回归解决uniqueness的影响。我们发现当托管的子拟合空间在一些维度上是平的时,相应的线性回归并没有唯一的解。否则,拟合空间的曲率(或者参数化时的高阶非线性)可能会对解决做出重要贡献,特别是在与拟合空间的正常方向相关的解决方案中。我们的发现表明了数据构造几何的作用在外部归一致推断中的稳定性。
Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources
paper_authors: Feiyang Kang, Hoang Anh Just, Anit Kumar Sahu, Ruoxi Jia for: This paper aims to improve the process of data selection for machine learning models by proposing a framework called . The paper focuses on the scenario where only a limited subset of samples are available before an acquisition decision is made.methods: The proposed framework uses a two-stage performance inference process, which includes:1. Leveraging the Optimal Transport distance to predict the model’s performance for any data mixture ratio within the range of disclosed data sizes.2. Extrapolating the performance to larger undisclosed data sizes based on a novel parameter-free mapping technique inspired by neural scaling laws.results: The paper demonstrates that significantly improves existing performance scaling approaches in terms of both the accuracy of performance inference and the computation costs associated with constructing the performance predictor. Additionally, outperforms other off-the-shelf solutions in data selection effectiveness.Here is the information in Simplified Chinese text:for: 这 paper 的目的是改进机器学习模型选择数据的过程,提出了一个名为 的框架。 paper 关注到只有有限数据样本可用于决策前的情况。methods: 使用的是一个两个阶段性表现预测过程,包括:1. 利用 Optimal Transport 距离来预测模型在任何数据混合比例范围内的性能。2. 基于一种新的无参数映射技术,使用神经Scaling laws 来推广表现到更大的未知数据大小。results: paper 显示, 可以significantly 改进现有的表现推断方法,包括表现预测精度和构建表现预测器所需的计算成本。另外, 也可以很大程度上超越其他的OFF-the-SHELF 解决方案在数据选择效果上。Abstract
Traditionally, data selection has been studied in settings where all samples from prospective sources are fully revealed to a machine learning developer. However, in practical data exchange scenarios, data providers often reveal only a limited subset of samples before an acquisition decision is made. Recently, there have been efforts to fit scaling laws that predict model performance at any size and data source composition using the limited available samples. However, these scaling functions are black-box, computationally expensive to fit, highly susceptible to overfitting, or/and difficult to optimize for data selection. This paper proposes a framework called , which predicts model performance and supports data selection decisions based on partial samples of prospective data sources. Our approach distinguishes itself from existing work by introducing a novel *two-stage* performance inference process. In the first stage, we leverage the Optimal Transport distance to predict the model's performance for any data mixture ratio within the range of disclosed data sizes. In the second stage, we extrapolate the performance to larger undisclosed data sizes based on a novel parameter-free mapping technique inspired by neural scaling laws. We further derive an efficient gradient-based method to select data sources based on the projected model performance. Evaluation over a diverse range of applications demonstrates that significantly improves existing performance scaling approaches in terms of both the accuracy of performance inference and the computation costs associated with constructing the performance predictor. Also, outperforms by a wide margin in data selection effectiveness compared to a range of other off-the-shelf solutions.
摘要
传统上,数据选择已经被研究在所有样本都是完全公开给机器学习开发人员的设置下。然而,在实际数据交换场景下,数据提供者通常只披露一个有限的子集的样本前于收购决策。最近,有人尝试使用有限可用样本来预测模型性能的拓扑函数。然而,这些拓扑函数是黑盒子,计算成本高,易于过拟合,或者困难优化数据选择。这篇论文提出了一个名为的框架,可以根据部分样本预测模型性能并支持数据选择决策。我们的方法与现有工作不同,我们引入了一种新的两stage表现预测过程。在第一个阶段,我们利用最优运输距离来预测模型在任何数据混合比例范围内的性能。在第二个阶段,我们通过一种新的无参数映射技术,基于神经拓扑法则,来推断模型在未知大小的数据上的性能。我们还提出了一种高效的梯度下降方法来选择数据源基于预测模型性能。对于多种应用程序的评估表明,在性能拓扑预测和构建表现预测器的计算成本方面都有显著改进,而且在数据选择效果上也高效过很多其他OFFTHE-SHELF解决方案。
Gaussian Database Alignment and Gaussian Planted Matching
results: 作者们发现,当数据库特征维度为ω(log n)时,对照性的性能阈值与植入匹配阈值相同。此外,作者们还研究了各种约束的放松以更好地理解它们在不同情况下的效果,并提供了可达性和反向下界 bounds。Abstract
Database alignment is a variant of the graph alignment problem: Given a pair of anonymized databases containing separate yet correlated features for a set of users, the problem is to identify the correspondence between the features and align the anonymized user sets based on correlation alone. This closely relates to planted matching, where given a bigraph with random weights, the goal is to identify the underlying matching that generated the given weights. We study an instance of the database alignment problem with multivariate Gaussian features and derive results that apply both for database alignment and for planted matching, demonstrating the connection between them. The performance thresholds for database alignment converge to that for planted matching when the dimensionality of the database features is \(\omega(\log n)\), where \(n\) is the size of the alignment, and no individual feature is too strong. The maximum likelihood algorithms for both planted matching and database alignment take the form of a linear program and we study relaxations to better understand the significance of various constraints under various conditions and present achievability and converse bounds. Our results show that the almost-exact alignment threshold for the relaxed algorithms coincide with that of maximum likelihood, while there is a gap between the exact alignment thresholds. Our analysis and results extend to the unbalanced case where one user set is not fully covered by the alignment.
摘要
<>translate "Database alignment is a variant of the graph alignment problem: Given a pair of anonymized databases containing separate yet correlated features for a set of users, the problem is to identify the correspondence between the features and align the anonymized user sets based on correlation alone. This closely relates to planted matching, where given a bigraph with random weights, the goal is to identify the underlying matching that generated the given weights. We study an instance of the database alignment problem with multivariate Gaussian features and derive results that apply both for database alignment and for planted matching, demonstrating the connection between them. The performance thresholds for database alignment converge to that for planted matching when the dimensionality of the database features is ω(log n), where n is the size of the alignment, and no individual feature is too strong. The maximum likelihood algorithms for both planted matching and database alignment take the form of a linear program and we study relaxations to better understand the significance of various constraints under various conditions and present achievability and converse bounds. Our results show that the almost-exact alignment threshold for the relaxed algorithms coincide with that of maximum likelihood, while there is a gap between the exact alignment thresholds. Our analysis and results extend to the unbalanced case where one user set is not fully covered by the alignment."into Simplified Chinese. Here's the translation:数据库对应问题是图像问题的变种:给定一对匿名化的数据库,它们包含用户特征的分离 yet 相关的特征,问题是将这些特征对应起来,并将匿名化用户集相互对应。这与植入匹配问题密切相关,给定一个大Graph with random weights,问题是找出该大Graph中的匹配。我们研究了一个包含多变量 Gaussian 特征的数据库对应问题,并 derive 结果适用于数据库对应和植入匹配。我们发现当数据库特征维度为 ω(log n) 时,对应性reshold 与植入匹配问题的性能reshold 相同,其中 n 是对应的大小。此外,我们还发现了在不同条件下不同约束的放松关系,并 analyze 其可行性和反向关系。我们的结果表明,在放松问题中,准确对应的阈值与最大可能性问题的阈值相同,但是准确对应的阈值与最大可能性问题的阈值之间存在差异。我们的分析和结果还扩展到不均衡的情况,其中一个用户集不完全覆盖对应。
Transgressing the boundaries: towards a rigorous understanding of deep learning and its (non-)robustness
for: This paper focuses on the robustness issues of deep learning (DL) and bridges concerns and attempts from approximation theory to statistical learning theory.
methods: The paper reviews Bayesian Deep Learning as a means for uncertainty quantification and rigorous explainability.
results: The paper provides a systematic mathematical approach to understanding the specifics of DL and its success in various applications.Abstract
The recent advances in machine learning in various fields of applications can be largely attributed to the rise of deep learning (DL) methods and architectures. Despite being a key technology behind autonomous cars, image processing, speech recognition, etc., a notorious problem remains the lack of theoretical understanding of DL and related interpretability and (adversarial) robustness issues. Understanding the specifics of DL, as compared to, say, other forms of nonlinear regression methods or statistical learning, is interesting from a mathematical perspective, but at the same time it is of crucial importance in practice: treating neural networks as mere black boxes might be sufficient in certain cases, but many applications require waterproof performance guarantees and a deeper understanding of what could go wrong and why it could go wrong. It is probably fair to say that, despite being mathematically well founded as a method to approximate complicated functions, DL is mostly still more like modern alchemy that is firmly in the hands of engineers and computer scientists. Nevertheless, it is evident that certain specifics of DL that could explain its success in applications demands systematic mathematical approaches. In this work, we review robustness issues of DL and particularly bridge concerns and attempts from approximation theory to statistical learning theory. Further, we review Bayesian Deep Learning as a means for uncertainty quantification and rigorous explainability.
摘要
Machine learning has made significant progress in various fields, thanks to the rise of deep learning (DL) methods and architectures. However, a major challenge remains the lack of theoretical understanding of DL, which hinders its widespread adoption. This review aims to provide an overview of the challenges in DL, specifically in the areas of interpretability and robustness, and discuss potential solutions.The Importance of Understanding DLDeep learning is a powerful tool for approximating complex functions, but its lack of theoretical understanding poses significant challenges. While it may be sufficient to treat neural networks as black boxes in some cases, many applications require rigorous performance guarantees and a deeper understanding of what can go wrong. This is particularly important in high-stakes applications such as autonomous vehicles, image processing, and speech recognition.Mathematical Foundations of DLDespite its success in applications, DL is still largely based on modern alchemy, with many of its underlying principles still not well understood. However, recent advances in approximation theory and statistical learning theory have provided valuable insights into the mathematical foundations of DL.Robustness Issues in DLOne of the major challenges in DL is its lack of robustness to adversarial attacks and other forms of input noise. This has serious implications for applications where DL models must be relied upon to make critical decisions. To address this challenge, researchers have proposed various techniques, such as adversarial training and input preprocessing.Bayesian Deep LearningBayesian deep learning (BDL) is a promising approach to addressing the challenges of DL. BDL combines the strengths of deep learning with the principles of Bayesian inference, providing a framework for uncertainty quantification and rigorous explainability. By incorporating prior knowledge and uncertainty into the DL framework, BDL offers a more robust and reliable approach to machine learning.ConclusionIn conclusion, deep learning has revolutionized the field of machine learning, but its lack of theoretical understanding poses significant challenges. By reviewing the robustness issues of DL and exploring the potential of BDL, we can develop more reliable and robust machine learning models that can be applied in a wide range of domains.
Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models
results: 研究发现,使用更高级别的语言模型可以更好地识别和诈骗。在18个比较中,更高级别的模型比较小的模型表现更好,并且在对话中使用更多的推理和说服技巧。Abstract
Are current language models capable of deception and lie detection? We study this question by introducing a text-based game called $\textit{Hoodwinked}$, inspired by Mafia and Among Us. Players are locked in a house and must find a key to escape, but one player is tasked with killing the others. Each time a murder is committed, the surviving players have a natural language discussion then vote to banish one player from the game. We conduct experiments with agents controlled by GPT-3, GPT-3.5, and GPT-4 and find evidence of deception and lie detection capabilities. The killer often denies their crime and accuses others, leading to measurable effects on voting outcomes. More advanced models are more effective killers, outperforming smaller models in 18 of 24 pairwise comparisons. Secondary metrics provide evidence that this improvement is not mediated by different actions, but rather by stronger persuasive skills during discussions. To evaluate the ability of AI agents to deceive humans, we make this game publicly available at h https://hoodwinked.ai/ .
摘要
现在的语言模型能够做出骗局和谎言吗?我们通过一款基于文本的游戏《骗子》(Hoodwinked)来研究这个问题,这款游戏灵感来自《贾布》和《 Among Us》。玩家被困在一个房子中,需要找到逃脱的钥匙,但有一个玩家被要求杀死其他玩家。每次杀人时,幸存的玩家们进行自然语言的讨论,然后投票将一名玩家从游戏中开除。我们在使用GPT-3、GPT-3.5和GPT-4控制的代理人进行实验,发现了骗局和谎言的能力。杀人者经常否认犯罪,指责其他人,导致讨论后投票结果受到影响。更高级的模型在18个比赛中赢得了18场,比较小的模型表现更差。次要指标表明这种改进不来自不同的行动,而是来自更强的说服技巧在讨论中。为了评估人工智能代理人是否能骗人类,我们将这款游戏公开发布在。
An Exploratory Literature Study on Sharing and Energy Use of Language Models for Source Code
results: 研究发现,现有的494篇唯一出版物中,293篇 relevanter Publications使用语言模型解决源代码相关任务。其中,27% (79/293)的研究将artifacts分享给他人 reuse。此外,研究还收集了训练过程中硬件使用情况以及训练时间,从而了解训练过程中的能效性。研究发现,当前的学术研究中有40%的论文没有分享源代码或训练过程中的artifacts,建议共享源代码和训练过程中的artifacts,以便实现可持续的可重用性。Abstract
Large language models trained on source code can support a variety of software development tasks, such as code recommendation and program repair. Large amounts of data for training such models benefit the models' performance. However, the size of the data and models results in long training times and high energy consumption. While publishing source code allows for replicability, users need to repeat the expensive training process if models are not shared. The main goal of the study is to investigate if publications that trained language models for software engineering (SE) tasks share source code and trained artifacts. The second goal is to analyze the transparency on training energy usage. We perform a snowballing-based literature search to find publications on language models for source code, and analyze their reusability from a sustainability standpoint. From 494 unique publications, we identified 293 relevant publications that use language models to address code-related tasks. Among them, 27% (79 out of 293) make artifacts available for reuse. This can be in the form of tools or IDE plugins designed for specific tasks or task-agnostic models that can be fine-tuned for a variety of downstream tasks. Moreover, we collect insights on the hardware used for model training, as well as training time, which together determine the energy consumption of the development process. We find that there are deficiencies in the sharing of information and artifacts for current studies on source code models for software engineering tasks, with 40% of the surveyed papers not sharing source code or trained artifacts. We recommend the sharing of source code as well as trained artifacts, to enable sustainable reproducibility. Moreover, comprehensive information on training times and hardware configurations should be shared for transparency on a model's carbon footprint.
摘要
大型语言模型可以支持软件开发工作中的多种任务,例如代码建议和程式修理。大量训练数据可以提高模型的表现。然而,模型和数据的大小导致训练时间很长,能源消耗高。发布源代码可以保证可复制性,但用户需要重复进行昂贵的训练过程。研究的主要目标是探索发布了语言模型 для软件工程(SE)任务的文献是否共享源代码和训练遗产。第二个目标是分析训练时间的可视性。我们使用雪球搜寻法进行文献搜寻,并分析它们在可持续性方面的可重用性。从494份唯一的文献中,我们获得了293份相关的文献,用语言模型解决代码相关的任务。其中,27%(79份中的27份)公开了工具或IDE插件,这些工具可以用于特定任务或任务无关的模型,可以进行多种下游任务的微调。此外,我们收集了训练硬件和训练时间的信息,这些信息共同决定模型训练过程中的能源消耗。我们发现现有的研究中有40%的文献不会共享源代码或训练遗产,我们建议共享源代码以及训练遗产,以实现可复制性。此外,我们建议分享训练时间和硬件配置的详细信息,以便透明度地描述模型的碳足迹。
results: 研究结果表明,通过重要抽样法可以在保持隐私水平的情况下提高数据分布的准确性和效率。同时,研究还发现了一些缺点和限制,如重要抽样法可能会增加数据分布的尺度和隐私泄露的风险。Abstract
We examine the privacy-enhancing properties of subsampling a data set via importance sampling as a pre-processing step for differentially private mechanisms. This extends the established privacy amplification by subsampling result to importance sampling where each data point is weighted by the reciprocal of its selection probability. The implications for privacy of weighting each point are not obvious. On the one hand, a lower selection probability leads to a stronger privacy amplification. On the other hand, the higher the weight, the stronger the influence of the point on the output of the mechanism in the event that the point does get selected. We provide a general result that quantifies the trade-off between these two effects. We show that heterogeneous sampling probabilities can lead to both stronger privacy and better utility than uniform subsampling while retaining the subsample size. In particular, we formulate and solve the problem of privacy-optimal sampling, that is, finding the importance weights that minimize the expected subset size subject to a given privacy budget. Empirically, we evaluate the privacy, efficiency, and accuracy of importance sampling-based privacy amplification on the example of k-means clustering.
摘要
我们研究对减少数据集的隐私Properties of subsampling as a pre-processing step for differentially private mechanisms. This extends the established privacy amplification by subsampling result to importance sampling, where each data point is weighted by the reciprocal of its selection probability. The implications for privacy of weighting each point are not obvious. On the one hand, a lower selection probability leads to a stronger privacy amplification. On the other hand, the higher the weight, the stronger the influence of the point on the output of the mechanism in the event that the point does get selected. We provide a general result that quantifies the trade-off between these two effects. We show that heterogeneous sampling probabilities can lead to both stronger privacy and better utility than uniform subsampling while retaining the subsample size. In particular, we formulate and solve the problem of privacy-optimal sampling, that is, finding the importance weights that minimize the expected subset size subject to a given privacy budget. Empirically, we evaluate the privacy, efficiency, and accuracy of importance sampling-based privacy amplification on the example of k-means clustering.Here's the text with the original English text and the Simplified Chinese translation side by side for reference:Original English text:We examine the privacy-enhancing properties of subsampling a data set via importance sampling as a pre-processing step for differentially private mechanisms. This extends the established privacy amplification by subsampling result to importance sampling, where each data point is weighted by the reciprocal of its selection probability. The implications for privacy of weighting each point are not obvious. On the one hand, a lower selection probability leads to a stronger privacy amplification. On the other hand, the higher the weight, the stronger the influence of the point on the output of the mechanism in the event that the point does get selected. We provide a general result that quantifies the trade-off between these two effects. We show that heterogeneous sampling probabilities can lead to both stronger privacy and better utility than uniform subsampling while retaining the subsample size. In particular, we formulate and solve the problem of privacy-optimal sampling, that is, finding the importance weights that minimize the expected subset size subject to a given privacy budget. Empirically, we evaluate the privacy, efficiency, and accuracy of importance sampling-based privacy amplification on the example of k-means clustering.Simplified Chinese translation:我们研究对减少数据集的隐私Properties of subsampling as a pre-processing step for differentially private mechanisms. This extends the established privacy amplification by subsampling result to importance sampling, where each data point is weighted by the reciprocal of its selection probability. The implications for privacy of weighting each point are not obvious. On the one hand, a lower selection probability leads to a stronger privacy amplification. On the other hand, the higher the weight, the stronger the influence of the point on the output of the mechanism in the event that the point does get selected. We provide a general result that quantifies the trade-off between these two effects. We show that heterogeneous sampling probabilities can lead to both stronger privacy and better utility than uniform subsampling while retaining the subsample size. In particular, we formulate and solve the problem of privacy-optimal sampling, that is, finding the importance weights that minimize the expected subset size subject to a given privacy budget. Empirically, we evaluate the privacy, efficiency, and accuracy of importance sampling-based privacy amplification on the example of k-means clustering.
External Reasoning: Towards Multi-Large-Language-Models Interchangeable Assistance with Human Feedback
results: 实现了提高多种LLM表现,超过现有解决方案,并且更高效 чем直接LLM处理全文Abstract
Memory is identified as a crucial human faculty that allows for the retention of visual and linguistic information within the hippocampus and neurons in the brain, which can subsequently be retrieved to address real-world challenges that arise through a lifetime of learning. The resolution of complex AI tasks through the application of acquired knowledge represents a stride toward the realization of artificial general intelligence. However, despite the prevalence of Large Language Models (LLMs) like GPT-3.5 and GPT-4 , which have displayed remarkable capabilities in language comprehension, generation, interaction, and reasoning, they are inhibited by constraints on context length that preclude the processing of extensive, continually evolving knowledge bases. This paper proposes that LLMs could be augmented through the selective integration of knowledge from external repositories, and in doing so, introduces a novel methodology for External Reasoning, exemplified by ChatPDF. Central to this approach is the establishment of a tiered policy for \textbf{External Reasoning based on Multiple LLM Interchange Assistance}, where the level of support rendered is modulated across entry, intermediate, and advanced tiers based on the complexity of the query, with adjustments made in response to human feedback. A comprehensive evaluation of this methodology is conducted using multiple LLMs and the results indicate state-of-the-art performance, surpassing existing solutions including ChatPDF.com. Moreover, the paper emphasizes that this approach is more efficient compared to the direct processing of full text by LLMs.
摘要
记忆是人类重要的一种功能,允许人们保持视觉和语言信息在脑中 hippocampus 和 neurons 中,并可以在面临生活中的挑战时进行检索。通过人工智能技术的应用,人们可以解决复杂的问题,是人工智能的实现的一步。然而,即使现有的大语言模型(LLMs)如 GPT-3.5 和 GPT-4 已经显示出了惊人的语言理解、生成、互动和理智能力,它们却受到了知识库的长度限制,无法处理大量、不断发展的知识库。本文提出,LLMs 可以通过外部知识集成 selective 的方式进行增强,并 introduce 一种新的方法ology ,例如 ChatPDF。在这种方法中,根据查询的复杂程度,对 LLMs 进行多个 tier 的多种助手支持,并根据人类反馈进行调整。通过多种 LLMs 的测试,结果表明,这种方法可以达到现有解决方案的同等或更高水平,并且更高效于直接 LLMS 处理全文。
Exploring Continual Learning for Code Generation Models
for: This paper aims to address the issue of continual learning (CL) in the code domain, specifically for large-scale code generation models like Codex and CodeT5.
methods: The authors propose a new benchmark called CodeTask-CL that covers a wide range of coding tasks and compare popular CL techniques from NLP and Vision domains. They also introduce a new method called Prompt Pooling with Teacher Forcing (PP-TF) that addresses the issue of catastrophic forgetting in coding tasks.
results: The authors achieve a 21.54% improvement over Prompt Pooling with their proposed method PP-TF, demonstrating the effectiveness of their approach. They also establish a training pipeline for CL on code models that can be used for further development of CL methods.Here is the information in Simplified Chinese text:
for: 这篇论文目标是解决大规模代码生成模型如Codex和CodeT5的连续学习(CL)问题。
methods: 作者提出了一个新的 benchmark called CodeTask-CL,该 benchmark 涵盖了广泛的编程任务,并与 NLP 和 Computer Vision 领域的 CL 技术进行比较。他们还提出了一种新的方法 called Prompt Pooling with Teacher Forcing (PP-TF),用于解决编程任务中的快速忘记问题。
results: 作者通过 PP-TF 方法实现了21.54% 的提升,证明了他们的方法的有效性。他们还建立了一个用于 CL 的代码训练管道,这可以激励更多人开发 CL 方法。Abstract
Large-scale code generation models such as Codex and CodeT5 have achieved impressive performance. However, libraries are upgraded or deprecated very frequently and re-training large-scale language models is computationally expensive. Therefore, Continual Learning (CL) is an important aspect that remains underexplored in the code domain. In this paper, we introduce a benchmark called CodeTask-CL that covers a wide range of tasks, including code generation, translation, summarization, and refinement, with different input and output programming languages. Next, on our CodeTask-CL benchmark, we compare popular CL techniques from NLP and Vision domains. We find that effective methods like Prompt Pooling (PP) suffer from catastrophic forgetting due to the unstable training of the prompt selection mechanism caused by stark distribution shifts in coding tasks. We address this issue with our proposed method, Prompt Pooling with Teacher Forcing (PP-TF), that stabilizes training by enforcing constraints on the prompt selection mechanism and leads to a 21.54% improvement over Prompt Pooling. Along with the benchmark, we establish a training pipeline that can be used for CL on code models, which we believe can motivate further development of CL methods for code models. Our code is available at https://github.com/amazon-science/codetaskcl-pptf
摘要
大规模代码生成模型如Codex和CodeT5已经实现了印象深刻的性能。然而,库regularly更新或 deprecates,重新训练大规模语言模型是计算成本高昂。因此,持续学习(Continual Learning,CL)在代码领域是一个重要的不足之处。在这篇论文中,我们介绍了一个名为CodeTask-CL的benchmark,该benchmark包括了代码生成、翻译、摘要和修订等多种任务,并且支持不同的输入和输出编程语言。接着,在我们的CodeTask-CLbenchmark上,我们比较了 popular CL技术from NLP和Vision领域。我们发现,有效的方法如Prompt Pooling (PP)会受到极端分布变化的影响,导致训练不稳定,从而导致忘记灾难。我们解决这个问题,提出了我们的提案,Prompt Pooling with Teacher Forcing (PP-TF),该方法可以稳定训练,并且在Prompt Pooling的基础上提高了21.54%。此外,我们还提供了一个可用于CL的代码训练管道,我们认为这可以激励further development of CL方法 для代码模型。我们的代码可以在https://github.com/amazon-science/codetaskcl-pptf找到。
A probabilistic, data-driven closure model for RANS simulations with aleatoric, model uncertainty
paper_authors: Atul Agrawal, Phaedon-Stelios Koutsourelakis for:* 这个论文旨在提出一种基于数据的、闭合模型,用于Reynolds均值 Navier-Stokes(RANS) simulations,该模型包括随机变量和模型不确定性。methods:* 该 closure 由两部分组成:一个parametric部分,使用之前提出的神经网络基于张量基函数,这些基函数依赖于流体张量的约束和旋转张量的 invariants。此外,还包括随机变量,用于补做模型错误。results:* 该模型可以生成准确、 probabilistic、预测性的结果,包括所有流体量,即使在模型错误存在的情况下。在牛顿障碍流 пробле目中, demonstrates the capability of the proposed model to produce accurate, probabilistic, predictive estimates for all flow quantities, even in regions where model errors are present.Abstract
We propose a data-driven, closure model for Reynolds-averaged Navier-Stokes (RANS) simulations that incorporates aleatoric, model uncertainty. The proposed closure consists of two parts. A parametric one, which utilizes previously proposed, neural-network-based tensor basis functions dependent on the rate of strain and rotation tensor invariants. This is complemented by latent, random variables which account for aleatoric model errors. A fully Bayesian formulation is proposed, combined with a sparsity-inducing prior in order to identify regions in the problem domain where the parametric closure is insufficient and where stochastic corrections to the Reynolds stress tensor are needed. Training is performed using sparse, indirect data, such as mean velocities and pressures, in contrast to the majority of alternatives that require direct Reynolds stress data. For inference and learning, a Stochastic Variational Inference scheme is employed, which is based on Monte Carlo estimates of the pertinent objective in conjunction with the reparametrization trick. This necessitates derivatives of the output of the RANS solver, for which we developed an adjoint-based formulation. In this manner, the parametric sensitivities from the differentiable solver can be combined with the built-in, automatic differentiation capability of the neural network library in order to enable an end-to-end differentiable framework. We demonstrate the capability of the proposed model to produce accurate, probabilistic, predictive estimates for all flow quantities, even in regions where model errors are present, on a separated flow in the backward-facing step benchmark problem.
摘要
我们提出了一种数据驱动的、闭合模型,用于Reynolds均值 Navier-Stokes(RANS) simulations,该模型包括不确定性。该闭合分为两部分:一个parametric部分,使用之前提出的神经网络基于张量函数,这些函数依赖于流速度和旋转张量的 invariants。此外,还有一些随机变量,帮助考虑模型的不确定性。我们提出了一种完全 bayesian 的 формулиров法,并将之与一个稀疏逼 zero prior 结合,以便在问题空间中标识参数闭合不充分的区域,并在这些区域中进行随机修正 Reynolds 压力张量。我们使用了稀疏、间接数据,如平均速度和压力,而不是直接的 Reynolds 压力数据进行训练。 для推断和学习,我们使用了Stochastic Variational Inference 方法,该方法基于 Monte Carlo 估计和 reparametrization trick。这使得我们可以将parametric 敏感度与神经网络库中的自动导数能力结合,以实现一个可微的框架。我们在逆向排流缘问题上示出了我们的模型的能力,可以生成准确的、probabilistic 预测结果,包括在模型错误存在的区域。
In-Context Learning for Attention Scheme: from Single Softmax Regression to Multiple Softmax Regression via a Tensor Trick
results: 研究人员通过完成 lipschitz 分析,得出了关于在上下文学习中的主要结果。Abstract
Large language models (LLMs) have brought significant and transformative changes in human society. These models have demonstrated remarkable capabilities in natural language understanding and generation, leading to various advancements and impacts across several domains. We consider the in-context learning under two formulation for attention related regression in this work. Given matrices $A_1 \in \mathbb{R}^{n \times d}$, and $A_2 \in \mathbb{R}^{n \times d}$ and $B \in \mathbb{R}^{n \times n}$, the purpose is to solve some certain optimization problems: Normalized version $\min_{X} \| D(X)^{-1} \exp(A_1 X A_2^\top) - B \|_F^2$ and Rescaled version $\| \exp(A_1 X A_2^\top) - D(X) \cdot B \|_F^2$. Here $D(X) := \mathrm{diag}( \exp(A_1 X A_2^\top) {\bf 1}_n )$. Our regression problem shares similarities with previous studies on softmax-related regression. Prior research has extensively investigated regression techniques related to softmax regression: Normalized version $\| \langle \exp(Ax) , {\bf 1}_n \rangle^{-1} \exp(Ax) - b \|_2^2$ and Resscaled version $\| \exp(Ax) - \langle \exp(Ax), {\bf 1}_n \rangle b \|_2^2 $ In contrast to previous approaches, we adopt a vectorization technique to address the regression problem in matrix formulation. This approach expands the dimension from $d$ to $d^2$, resembling the formulation of the regression problem mentioned earlier. Upon completing the lipschitz analysis of our regression function, we have derived our main result concerning in-context learning.
摘要
In this work, we consider in-context learning under two formulations for attention-related regression. Given matrices $A_1 \in \mathbb{R}^{n \times d}$, $A_2 \in \mathbb{R}^{n \times d}$, and $B \in \mathbb{R}^{n \times n}$, our goal is to solve certain optimization problems:1. Normalized version: $\min_{X} \| D(X)^{-1} \exp(A_1 X A_2^\top) - B \|_F^2$2. Rescaled version: $\| \exp(A_1 X A_2^\top) - D(X) \cdot B \|_F^2$Here, $D(X) = \text{diag}(\exp(A_1 X A_2^\top) \mathbf{1}_n)$. Our regression problem shares similarities with previous studies on softmax-related regression. Prior research has extensively investigated regression techniques related to softmax regression:1. Normalized version: $\| \langle \exp(Ax), \mathbf{1}_n \rangle^{-1} \exp(Ax) - b \|_2^2$2. Rescaled version: $\| \exp(Ax) - \langle \exp(Ax), \mathbf{1}_n \rangle b \|_2^2$In contrast to previous approaches, we adopt a vectorization technique to address the regression problem in matrix formulation. This approach expands the dimension from $d$ to $d^2$, resembling the formulation of the regression problem mentioned earlier.After completing the Lipschitz analysis of our regression function, we have derived our main result concerning in-context learning.
Multi-objective Deep Reinforcement Learning for Mobile Edge Computing
paper_authors: Ning Yang, Junrui Wen, Meng Zhang, Ming Tang
For: The paper is written for next-generation mobile network applications that prioritize various performance metrics, including delays and energy consumption, and addresses the challenge of unknown preferences in multi-objective resource scheduling for mobile edge computing (MEC) systems.* Methods: The paper uses a multi-objective reinforcement learning (MORL) scheme with proximal policy optimization (PPO) to address the challenge of unknown preferences in MEC systems, and introduces a well-designed state encoding method for constructing features for multiple edges in MEC systems and a sophisticated reward function for accurately computing the utilities of delay and energy consumption.* Results: The paper demonstrates that the proposed MORL scheme enhances the hypervolume of the Pareto front by up to 233.1% compared to benchmarks, indicating the effectiveness of the proposed method in improving the performance of MEC systems.Here’s the information in Simplified Chinese text:* For: 这篇论文是为下一代 mobil 网络应用程序而写的,这些应用程序优先级包括延迟和能耗。* Methods: 这篇论文使用了多目标学习(MORL)和 proximal policy 优化(PPO)来解决 MEC 系统中 unknown preferences 的问题,并引入了多边的状态编码方法和准确计算延迟和能耗的奖励函数。* Results: 论文的实验结果表明,提议的 MORL 方案可以提高 MEC 系统的 Pareto 前面的卷积体率,相比基准值,提高了233.1%。Abstract
Mobile edge computing (MEC) is essential for next-generation mobile network applications that prioritize various performance metrics, including delays and energy consumption. However, conventional single-objective scheduling solutions cannot be directly applied to practical systems in which the preferences of these applications (i.e., the weights of different objectives) are often unknown or challenging to specify in advance. In this study, we address this issue by formulating a multi-objective offloading problem for MEC with multiple edges to minimize expected long-term energy consumption and transmission delay while considering unknown preferences as parameters. To address the challenge of unknown preferences, we design a multi-objective (deep) reinforcement learning (MORL)-based resource scheduling scheme with proximal policy optimization (PPO). In addition, we introduce a well-designed state encoding method for constructing features for multiple edges in MEC systems, a sophisticated reward function for accurately computing the utilities of delay and energy consumption. Simulation results demonstrate that our proposed MORL scheme enhances the hypervolume of the Pareto front by up to 233.1% compared to benchmarks. Our full framework is available at https://github.com/gracefulning/mec_morl_multipolicy.
摘要
Mobile edge computing (MEC) 是下一代移动网络应用程序的关键技术,这些应用程序强调多种性能指标,包括延迟和能耗。然而,传统的单目标调度解决方案无法直接应用于实际系统中,因为这些应用程序的偏好(即不同目标的权重)通常未知或难以在先 specify。在本研究中,我们解决了这个问题,通过对 MEC 系统中的多个边进行减少预期长期能耗和传输延迟的多目标卸载问题进行定义。为了解决未知偏好的挑战,我们提出了一种基于多目标学习(deep reinforcement learning,DRL)的资源调度方案,并使用 proximal policy optimization(PPO)来解决。此外,我们还提出了一种有效的状态编码方法,用于构建多边 MEC 系统中的特征。此外,我们还提出了一种具有准确计算延迟和能耗使用的复杂的奖励函数。实验结果表明,我们的提议的 MORL 方案可以提高 Pareto 前方的权重的增加率达到 233.1% 相比 benchmark。我们的全面框架可以在 GitHub 上找到:https://github.com/gracefulning/mec_morl_multipolicy。
OpenDelta: A Plug-and-play Library for Parameter-efficient Adaptation of Pre-trained Models
results: OpenDelta提供了一个通用的、可调整的、可扩展的平台,可以帮助研究者和实践者快速适应大型预训练模型。Abstract
The scale of large pre-trained models (PTMs) poses significant challenges in adapting to downstream tasks due to the high optimization overhead and storage costs associated with full-parameter fine-tuning. To address this, many studies explore parameter-efficient tuning methods, also framed as "delta tuning", which updates only a small subset of parameters, known as "delta modules", while keeping the backbone model's parameters fixed. However, the practicality and flexibility of delta tuning have been limited due to existing implementations that directly modify the code of the backbone PTMs and hard-code specific delta tuning methods for each PTM. In this paper, we present OpenDelta, an open-source library that overcomes these limitations by providing a plug-and-play implementation of various delta tuning methods. Our novel techniques eliminate the need to modify the backbone PTMs' code, making OpenDelta compatible with different, even novel PTMs. OpenDelta is designed to be simple, modular, and extensible, providing a comprehensive platform for researchers and practitioners to adapt large PTMs efficiently.
摘要
大型预训练模型(PTM)的缺省大小带来了适应下游任务的 significiant挑战,这是因为全参数细化训练所需的优化开销和存储成本很高。为了解决这个问题,许多研究尝试了参数效率的训练方法,也称为"delta tuning",这种方法只更新一个小subset的参数,称为"delta module",而保持背景模型的参数不变。然而,现有的实现方法直接修改背景PTM的代码,这限制了 delta tuning 的实用性和灵活性。在这篇论文中,我们介绍了 OpenDelta,一个开源库,它解决了这些限制。OpenDelta 提供了一个可插入的 delta tuning 实现方式,不需要修改背景PTM 的代码,因此可以与不同的、甚至是新的 PTM 兼容。我们的新技术使得 OpenDelta 可以与不同的 delta tuning 方法进行整合,提供了一个简单、干净、可扩展的平台,以便研究人员和实践者们能够效率地适应大型 PTM。
$ν^2$-Flows: Fast and improved neutrino reconstruction in multi-neutrino final states with conditional normalizing flows
results: 在$t\bar{t}$同位素事件中,$\nu^2$-Flows方法可以更加准确地重建中微子的动量和相关性,并且可以解决所有事件。比较其他方法,这个方法的推断时间更为短暂,并且可以通过图像处理器并行执行来进一步加速。在应用于$t\bar{t}$同位素事件中,$\nu^2$-Flows方法可以提高每个分布的统计精度,比标准技术提高了1.5倍至2倍,并且在一些情况下可以达到4倍。Abstract
In this work we introduce $\nu^2$-Flows, an extension of the $\nu$-Flows method to final states containing multiple neutrinos. The architecture can natively scale for all combinations of object types and multiplicities in the final state for any desired neutrino multiplicities. In $t\bar{t}$ dilepton events, the momenta of both neutrinos and correlations between them are reconstructed more accurately than when using the most popular standard analytical techniques, and solutions are found for all events. Inference time is significantly faster than competing methods, and can be reduced further by evaluating in parallel on graphics processing units. We apply $\nu^2$-Flows to $t\bar{t}$ dilepton events and show that the per-bin uncertainties in unfolded distributions is much closer to the limit of performance set by perfect neutrino reconstruction than standard techniques. For the chosen double differential observables $\nu^2$-Flows results in improved statistical precision for each bin by a factor of 1.5 to 2 in comparison to the Neutrino Weighting method and up to a factor of four in comparison to the Ellipse approach.
摘要
在这个工作中,我们引入 $\nu^2$-Flows,它是 Final State 中多个中微子的扩展方法。该架构可以Native scale 到所有对象类型和多重性的最终状态,对于任何需要的中微子多重性。在 $t\bar{t}$ 蛇脊事件中,我们可以更准确地重建中微子的动量和它们之间的相关性,而且可以为所有事件找到解决方案。在比较方法时,我们发现 $\nu^2$-Flows 的推理时间明显更快,可以通过在图形处理器上并行执行来进一步减少。我们将 $\nu^2$-Flows 应用于 $t\bar{t}$ 蛇脊事件,并发现每个分布的不确定性在 unfolded 分布中远远 closer to the limit of performance set by perfect neutrino reconstruction than标准技术。为选择的双 differential 观测量, $\nu^2$-Flows 可以提高每个分布的统计精度,比标准方法的 Neutrino Weighting 方法和 Ellipse 方法高一个factor of 1.5 to 2, and up to a factor of four in comparison to the Ellipse approach。
Unbalanced Optimal Transport: A Unified Framework for Object Detection
results: 实验表明,使用该方法训练对象检测模型可以达到现状的最佳性能水平,并且提供更快的初始化速度。I hope that helps! Let me know if you have any other questions.Abstract
During training, supervised object detection tries to correctly match the predicted bounding boxes and associated classification scores to the ground truth. This is essential to determine which predictions are to be pushed towards which solutions, or to be discarded. Popular matching strategies include matching to the closest ground truth box (mostly used in combination with anchors), or matching via the Hungarian algorithm (mostly used in anchor-free methods). Each of these strategies comes with its own properties, underlying losses, and heuristics. We show how Unbalanced Optimal Transport unifies these different approaches and opens a whole continuum of methods in between. This allows for a finer selection of the desired properties. Experimentally, we show that training an object detection model with Unbalanced Optimal Transport is able to reach the state-of-the-art both in terms of Average Precision and Average Recall as well as to provide a faster initial convergence. The approach is well suited for GPU implementation, which proves to be an advantage for large-scale models.
摘要
A Versatile Hub Model For Efficient Information Propagation And Feature Selection
for: 这 paper 是 investigate 生物大脑的 topology 特征,以及如何使用这些特征来提高信息传递和认知处理的方法。
methods: 这 paper 使用了 Echo State Network (ESN) 来研究 hub 结构的机制基础,并发现 hub 结构可以提高模型的性能。
results: 研究发现,通过 incorporating hub 结构,可以提高模型的性能,主要是通过更好地处理信息和提取特征来实现这一点。Abstract
Hub structure, characterized by a few highly interconnected nodes surrounded by a larger number of nodes with fewer connections, is a prominent topological feature of biological brains, contributing to efficient information transfer and cognitive processing across various species. In this paper, a mathematical model of hub structure is presented. The proposed method is versatile and can be broadly applied to both computational neuroscience and Recurrent Neural Networks (RNNs) research. We employ the Echo State Network (ESN) as a means to investigate the mechanistic underpinnings of hub structures. Our findings demonstrate a substantial enhancement in performance upon incorporating the hub structure. Through comprehensive mechanistic analyses, we show that the hub structure improves model performance by facilitating efficient information processing and better feature extractions.
摘要translate("Hub structure, characterized by a few highly interconnected nodes surrounded by a larger number of nodes with fewer connections, is a prominent topological feature of biological brains, contributing to efficient information transfer and cognitive processing across various species. In this paper, a mathematical model of hub structure is presented. The proposed method is versatile and can be broadly applied to both computational neuroscience and Recurrent Neural Networks (RNNs) research. We employ the Echo State Network (ESN) as a means to investigate the mechanistic underpinnings of hub structures. Our findings demonstrate a substantial enhancement in performance upon incorporating the hub structure. Through comprehensive mechanistic analyses, we show that the hub structure improves model performance by facilitating efficient information processing and better feature extractions."into Simplified Chinese) traducción(" estructura de nudo principal, caracterizada por un pequeño número de nodos altamente interconectados rodeados por un mayor número de nodos con menos conexiones, es una característica topológica destacada de los cerebros biológicos, que contribuye a la transferencia eficiente de información y el procesamiento cognitivo en diversas especies. En este artículo, se presenta un modelo matemático de la estructura de nudo. El método propuesto es versátil y puede ser ampliamente aplicado en investigaciones de neurociencia computacional y redes neuronales recurrentes (RNNs). Utilizamos la Red de Estado Eco (ESN) como medio para investigar las bases mecanicistas de las estructuras de nudo. Nuestros hallazgos demuestran una mejora substancial en el rendimiento al incorporar la estructura de nudo. A través de análisis mecanicistas exhaustivos, demostramos que la estructura de nudo mejora el rendimiento del modelo al facilitar el procesamiento eficiente de información y la extracción de características mejor."
Causal Discovery with Language Models as Imperfect Experts
results: 我们在实际数据上使用大型自然语言模型作为不准确的专家,并进行了一个案例研究。Abstract
Understanding the causal relationships that underlie a system is a fundamental prerequisite to accurate decision-making. In this work, we explore how expert knowledge can be used to improve the data-driven identification of causal graphs, beyond Markov equivalence classes. In doing so, we consider a setting where we can query an expert about the orientation of causal relationships between variables, but where the expert may provide erroneous information. We propose strategies for amending such expert knowledge based on consistency properties, e.g., acyclicity and conditional independencies in the equivalence class. We then report a case study, on real data, where a large language model is used as an imperfect expert.
摘要
理解系统下 causal 关系的下发关系是决策准确的基本前提。在这项工作中,我们研究如何使用专家知识来提高数据驱动的 causal 图识别,超出 markov 等类。在这种情况下,我们可以询问专家关于变量之间 causal 关系的方向,但专家可能提供错误的信息。我们提出了基于一致性属性的纠正策略,如循环无法和条件独立性在等类中。然后,我们报告了一个实际数据的案例研究,使用大语言模型作为不完全专家。
results: 实验结果显示,提出的方法可以在具有有限标签数据的场景下提高状态的艺术,达到8%的 dice分数提升。此外,自我超vised学习方法可以进一步提高这种性能,并且在不同的网络架构下都有益。Abstract
Deep learning has become a valuable tool for the automation of certain medical image segmentation tasks, significantly relieving the workload of medical specialists. Some of these tasks require segmentation to be performed on a subset of the input dimensions, the most common case being 3D-to-2D. However, the performance of existing methods is strongly conditioned by the amount of labeled data available, as there is currently no data efficient method, e.g. transfer learning, that has been validated on these tasks. In this work, we propose a novel convolutional neural network (CNN) and self-supervised learning (SSL) method for label-efficient 3D-to-2D segmentation. The CNN is composed of a 3D encoder and a 2D decoder connected by novel 3D-to-2D blocks. The SSL method consists of reconstructing image pairs of modalities with different dimensionality. The approach has been validated in two tasks with clinical relevance: the en-face segmentation of geographic atrophy and reticular pseudodrusen in optical coherence tomography. Results on different datasets demonstrate that the proposed CNN significantly improves the state of the art in scenarios with limited labeled data by up to 8% in Dice score. Moreover, the proposed SSL method allows further improvement of this performance by up to 23%, and we show that the SSL is beneficial regardless of the network architecture.
摘要
深度学习已成为医疗图像分割任务自动化的有价值工具,减轻医生的工作负担。一些这些任务需要在输入维度中进行分割,最常见的情况是从3D转换到2D。然而,现有方法的性能受到可用标注数据量的限制,例如没有数据效果的传输学习方法,这些方法没有被验证。在这种情况下,我们提出了一种新的卷积神经网络(CNN)和自我超vised学习(SSL)方法,用于标签效率3D-to-2D分割。CNN包括3D编码器和2D解码器连接的三个3D-to-2D块。SSL方法是使用不同维度的图像对创建图像对。我们在两个临床有 relevance 的任务中验证了该方法:在optical coherence tomography中的扁平分割和细胞ular pseudodrusen。不同的数据集结果表明,我们提出的CNN可以在有限标注数据情况下提高状态的艺术到8%的Dice分数。此外,我们还证明了SSL方法可以进一步改进这种性能,并且该方法对网络架构无关。
Fourier-Net+: Leveraging Band-Limited Representation for Efficient 3D Medical Image Registration
results: 该研究表明,使用Fourier-Net和Fourier-Net+可以实现与现有方法相同的注册性能,并且具有更快的测试速度、更低的内存压缩和更少的乘法加法操作。此外,Fourier-Net+可以实现大规模3D注册的高效训练。Abstract
U-Net style networks are commonly utilized in unsupervised image registration to predict dense displacement fields, which for high-resolution volumetric image data is a resource-intensive and time-consuming task. To tackle this challenge, we first propose Fourier-Net, which replaces the costly U-Net style expansive path with a parameter-free model-driven decoder. Instead of directly predicting a full-resolution displacement field, our Fourier-Net learns a low-dimensional representation of the displacement field in the band-limited Fourier domain which our model-driven decoder converts to a full-resolution displacement field in the spatial domain. Expanding upon Fourier-Net, we then introduce Fourier-Net+, which additionally takes the band-limited spatial representation of the images as input and further reduces the number of convolutional layers in the U-Net style network's contracting path. Finally, to enhance the registration performance, we propose a cascaded version of Fourier-Net+. We evaluate our proposed methods on three datasets, on which our proposed Fourier-Net and its variants achieve comparable results with current state-of-the art methods, while exhibiting faster inference speeds, lower memory footprint, and fewer multiply-add operations. With such small computational cost, our Fourier-Net+ enables the efficient training of large-scale 3D registration on low-VRAM GPUs. Our code is publicly available at \url{https://github.com/xi-jia/Fourier-Net}.
摘要
U-Net风格网络通常用于无监督图像registrations,预测高分辨率三维图像数据中的密集偏移场景。为了解决这个挑战,我们首先提出了Fourier-Net,它将U-Net风格的昂贵expandive路径换成一个无参数的模型驱动decoder。而不是直接预测全分辨率偏移场景,Fourier-Net学习了带限 FourierDomain中的低维度偏移场景表示。我们的model-driven decoder将这个低维度表示转换成全分辨率偏移场景。在这基础之上,我们还提出了Fourier-Net+,它还接受了带限 spatial表示的图像,并降低了U-Net风格网络的 contraction path中的卷积层数。最后,我们提出了Fourier-Net+的卷积版本,用于进一步提高registrations的性能。我们在三个dataset上评估了我们的提议方法,其中Fourier-Net和其 variants具有与当前状态的方法相同的性能,而且具有更快的推理速度、更低的内存占用和更少的 multiply-add操作。这样的小计算成本使得我们的Fourier-Net+在低VRAM GPU上能够高效地训练大规模3D registration。我们的代码publicly available at \url{https://github.com/xi-jia/Fourier-Net}.
Noise-to-Norm Reconstruction for Industrial Anomaly Detection and Localization
results: 实验表明,该方法能够准确地检测和定位异常区域,并在MPDD和VisA数据集上达到了最新的方法的竞争性成绩。Abstract
Anomaly detection has a wide range of applications and is especially important in industrial quality inspection. Currently, many top-performing anomaly-detection models rely on feature-embedding methods. However, these methods do not perform well on datasets with large variations in object locations. Reconstruction-based methods use reconstruction errors to detect anomalies without considering positional differences between samples. In this study, a reconstruction-based method using the noise-to-norm paradigm is proposed, which avoids the invariant reconstruction of anomalous regions. Our reconstruction network is based on M-net and incorporates multiscale fusion and residual attention modules to enable end-to-end anomaly detection and localization. Experiments demonstrate that the method is effective in reconstructing anomalous regions into normal patterns and achieving accurate anomaly detection and localization. On the MPDD and VisA datasets, our proposed method achieved more competitive results than the latest methods, and it set a new state-of-the-art standard on the MPDD dataset.
摘要
《异常检测在各种应用领域有广泛的应用,特别是在工业质量检测中非常重要。目前许多高性能异常检测模型都采用特征嵌入方法。但这些方法在样本位置差异较大的数据集上表现不佳。重建基于方法利用重建错误来检测异常,而不考虑样本间位置差异。在本研究中,我们提出了基于噪声至常数据(noise-to-norm)的重建基于方法,以避免异常区域的恒常重建。我们的重建网络基于M-net,并包括多尺度融合和剩余注意模块,以实现端到端异常检测和地图化。实验表明,我们的提议方法可以有效地将异常区域重建为常见模式,并实现高精度的异常检测和地图化。在MPDD和VisA数据集上,我们的提议方法超过了最新的方法,并在MPDD数据集上设置了新的状态标准。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.
Bundle-specific Tractogram Distribution Estimation Using Higher-order Streamline Differential Equation
paper_authors: Yuanjing Feng, Lei Xie, Jingqiang Wang, Jianzhong He, Fei Gao for: This paper aims to improve tractography methods for reconstructing complex global fiber bundles in the brain, which are prone to producing erroneous tracks and missing true positive connections.methods: The proposed method uses a bundle-specific tractogram distribution function based on a higher-order streamline differential equation, which reconstructs streamline bundles in a “cluster to cluster” manner. The method also introduces anatomic priors to guide the tractography process and improve the accuracy of fiber bundle reconstruction.results: The proposed method demonstrates better results in reconstructing long-range, twisting, and large fanning tracts compared to traditional peaks-based tractography methods. The method also reduces the error deviation and accumulation at the local level, and provides a more accurate representation of the complex global fiber bundles in the brain.Abstract
Tractography traces the peak directions extracted from fiber orientation distribution (FOD) suffering from ambiguous spatial correspondences between diffusion directions and fiber geometry, which is prone to producing erroneous tracks while missing true positive connections. The peaks-based tractography methods 'locally' reconstructed streamlines in 'single to single' manner, thus lacking of global information about the trend of the whole fiber bundle. In this work, we propose a novel tractography method based on a bundle-specific tractogram distribution function by using a higher-order streamline differential equation, which reconstructs the streamline bundles in 'cluster to cluster' manner. A unified framework for any higher-order streamline differential equation is presented to describe the fiber bundles with disjoint streamlines defined based on the diffusion tensor vector field. At the global level, the tractography process is simplified as the estimation of bundle-specific tractogram distribution (BTD) coefficients by minimizing the energy optimization model, and is used to characterize the relations between BTD and diffusion tensor vector under the prior guidance by introducing the tractogram bundle information to provide anatomic priors. Experiments are performed on simulated Hough, Sine, Circle data, ISMRM 2015 Tractography Challenge data, FiberCup data, and in vivo data from the Human Connectome Project (HCP) data for qualitative and quantitative evaluation. The results demonstrate that our approach can reconstruct the complex global fiber bundles directly. BTD reduces the error deviation and accumulation at the local level and shows better results in reconstructing long-range, twisting, and large fanning tracts.
摘要
tractography 跟踪 peak 方向,从 fiber orientation distribution (FOD) 中提取的方向,受到不确定的空间相对关系,容易生成错误的轨迹,而且错过真实正确的连接。 peaks-based tractography 方法在 'single to single' manner 地重建流线,缺乏全局信息,不能捕捉整个纤维Bundle的趋势。在这种工作中,我们提出了一种基于纤维特有 tractogram distribution 函数的新的 tractography 方法,使用高阶流线差分方程,重建流线集在 'cluster to cluster' manner 中。我们提出了一个综合的框架,用于描述纤维Bundle中的不同流线。在全局水平, tractography 过程被简化为估计纤维特有 tractogram distribution(BTD)系数,并用来描述纤维Bundle中的纤维tensor vector 场的关系。我们引入了 tractogram 短信息,以提供生物学上的先验知识。我们在 Hough、Sine、Circle 数据、ISMRM 2015 Tractography Challenge 数据、FiberCup 数据和人类连接计划(HCP)数据进行了质量和量化评估。结果表明,我们的方法可以直接重建复杂的全局纤维Bundle。BTD 降低了地方错误和积累,并且更好地重建长距离、扭转和大扇辐 tracts。
Single Image LDR to HDR Conversion using Conditional Diffusion
results: 我们通过对比性和质量测试,证明了我们提出的方法的效果。结果表明,一种简单的conditioned diffusion-based方法可以取代复杂的摄像头管线基础架构。Abstract
Digital imaging aims to replicate realistic scenes, but Low Dynamic Range (LDR) cameras cannot represent the wide dynamic range of real scenes, resulting in under-/overexposed images. This paper presents a deep learning-based approach for recovering intricate details from shadows and highlights while reconstructing High Dynamic Range (HDR) images. We formulate the problem as an image-to-image (I2I) translation task and propose a conditional Denoising Diffusion Probabilistic Model (DDPM) based framework using classifier-free guidance. We incorporate a deep CNN-based autoencoder in our proposed framework to enhance the quality of the latent representation of the input LDR image used for conditioning. Moreover, we introduce a new loss function for LDR-HDR translation tasks, termed Exposure Loss. This loss helps direct gradients in the opposite direction of the saturation, further improving the results' quality. By conducting comprehensive quantitative and qualitative experiments, we have effectively demonstrated the proficiency of our proposed method. The results indicate that a simple conditional diffusion-based method can replace the complex camera pipeline-based architectures.
摘要
数字图像处理目标是复制真实场景,但低动态范围(LDR)摄像机无法表示实际场景的广泛动态范围,导致图像被折制或过度曝光。这篇论文提出了基于深度学习的方法,用于从阴影和突出处恢复细节,同时重建高动态范围(HDR)图像。我们将问题定义为图像到图像(I2I)翻译任务,并提出了基于conditional Denoising Diffusion Probabilistic Model(DDPM)的框架。我们在我们的提议框架中集成了深度 neural network(CNN)基于autoencoder,以提高输入LDR图像的内存质量。此外,我们还引入了一种新的损失函数,称为曝光损失(Exposure Loss),该损失函数可以引导梯度的方向与暴露度相反,进一步提高结果的质量。通过进行全面的量化和质量测试,我们有效地表明了我们的提议方法的效果。结果表明,一种简单的增强 diffusion-based方法可以取代复杂的摄像机管线 Architecture。
Advancing Zero-Shot Digital Human Quality Assessment through Text-Prompted Evaluation
methods: 本研究使用了7种类型的扭曲方法生成1,120个扭曲的参考数字人,以及40个高品质的参考数字人。furthermore, the research proposes a zero-shot DHQA approach that leverages semantic and distortion features extracted from projections, as well as geometry features derived from the mesh structure of digital humans.
results: 研究结果显示,Zero-shot DHQA方法可以在不受数据库偏见影响的情况下,实现一定的质量评估表现。此外,研究也引入了数字人质量指数(DHQI),可以作为一个强大的基eline,促进领域的进步。Abstract
Digital humans have witnessed extensive applications in various domains, necessitating related quality assessment studies. However, there is a lack of comprehensive digital human quality assessment (DHQA) databases. To address this gap, we propose SJTU-H3D, a subjective quality assessment database specifically designed for full-body digital humans. It comprises 40 high-quality reference digital humans and 1,120 labeled distorted counterparts generated with seven types of distortions. The SJTU-H3D database can serve as a benchmark for DHQA research, allowing evaluation and refinement of processing algorithms. Further, we propose a zero-shot DHQA approach that focuses on no-reference (NR) scenarios to ensure generalization capabilities while mitigating database bias. Our method leverages semantic and distortion features extracted from projections, as well as geometry features derived from the mesh structure of digital humans. Specifically, we employ the Contrastive Language-Image Pre-training (CLIP) model to measure semantic affinity and incorporate the Naturalness Image Quality Evaluator (NIQE) model to capture low-level distortion information. Additionally, we utilize dihedral angles as geometry descriptors to extract mesh features. By aggregating these measures, we introduce the Digital Human Quality Index (DHQI), which demonstrates significant improvements in zero-shot performance. The DHQI can also serve as a robust baseline for DHQA tasks, facilitating advancements in the field. The database and the code are available at https://github.com/zzc-1998/SJTU-H3D.
摘要
“数字人类”在各个领域得到了广泛的应用,但是相关的质量评估研究受到了不足的全面数字人质量评估(DHQA)数据库的限制。为了解决这一问题,我们提出了《上海交通大学高质量数字人质量评估数据库》(SJTU-H3D),这是专门为全身数字人创建的主观质量评估数据库。它包括40个高质量参照数字人和1,120个扭曲对应件,通过七种扭曲方法生成。SJTU-H3D数据库可以作为DHQA研究的标准准确,用于评估和优化处理算法。此外,我们提出了一种零批量DHQA方法,专注于无参(NR)场景,以确保泛化能力的同时避免数据库偏见。我们的方法利用CLIP模型测量语义相似性,并使用NIQE模型捕捉低级扭曲信息。此外,我们还利用截角角度来描述数字人的网格结构,从而提取网格特征。通过积合这些度量,我们引入了数字人质量指数(DHQI),它在零批量情况下显示出了显著的改善。DHQI还可以作为DHQA任务的Robust基准,促进领域的进步。SJTU-H3D数据库和代码可以在GitHub上下载。
results: 实验结果表明,我们的算法可以提高图像质量,同时减少计算复杂度Abstract
Images captured in poorly lit conditions are often corrupted by acquisition noise. Leveraging recent advances in graph-based regularization, we propose a fast Retinex-based restoration scheme that denoises and contrast-enhances an image. Specifically, by Retinex theory we first assume that each image pixel is a multiplication of its reflectance and illumination components. We next assume that the reflectance and illumination components are piecewise constant (PWC) and continuous piecewise planar (PWP) signals, which can be recovered via graph Laplacian regularizer (GLR) and gradient graph Laplacian regularizer (GGLR) respectively. We formulate quadratic objectives regularized by GLR and GGLR, which are minimized alternately until convergence by solving linear systems -- with improved condition numbers via proposed preconditioners -- via conjugate gradient (CG) efficiently. Experimental results show that our algorithm achieves competitive visual image quality while reducing computation complexity noticeably.
摘要
图像采集在低光照条件下经常受到获取噪声的损害。我们基于最新的图格基于常量化的正则化技术,提出一种快速的Retinex基于的图像修复方案,可以去噪和提高图像的对比度。具体来说,我们首先假设每个图像像素是其反射和照明组成部分的乘积。我们接下来假设反射和照明组成部分是连续的 piecwise 板状信号(PWP)和 piecewise 常量信号(PWC),可以通过图像拉普拉斯正则izer(GLR)和梯度图像拉普拉斯正则izer(GGLR)分别进行回归。我们定义了quadratic 目标函数,其中GLR和GGLR的正则化项被加以规范化,然后通过 conjugate gradient(CG)高效地解决线性系统,并使用我们提出的预conditioner来提高condition number。实验结果表明,我们的算法可以达到竞争力强的视觉图像质量,同时减少计算复杂度明显。
AxonCallosumEM Dataset: Axon Semantic Segmentation of Whole Corpus Callosum cross section from EM Images
paper_authors: Ao Cheng, Guoqiang Zhao, Lirong Wang, Ruobing Zhang
for: 这个研究的目的是为了精确地重建动物神经系统中的 axon 和 myelin sheath 的复杂结构,以及提供大规模 EM 数据集,以促进和评估全面的 corpus callosum 重建。
methods: 这个研究使用了 Electron Microscope (EM) 技术,并开发了一个基于 Segment Anything Model (SAM) 的 fine-tuning 方法,以进行 EM 图像分类 задачі的推导。
results: 这个研究获得了一个名为 AxonCallosumEM 的大规模 EM 数据集,并使用了这个数据集进行训练和测试。EM-SAM 方法在这个数据集上表现出色,与其他现有的方法相比,具有更高的准确率和更好的一致性。Abstract
The electron microscope (EM) remains the predominant technique for elucidating intricate details of the animal nervous system at the nanometer scale. However, accurately reconstructing the complex morphology of axons and myelin sheaths poses a significant challenge. Furthermore, the absence of publicly available, large-scale EM datasets encompassing complete cross sections of the corpus callosum, with dense ground truth segmentation for axons and myelin sheaths, hinders the advancement and evaluation of holistic corpus callosum reconstructions. To surmount these obstacles, we introduce the AxonCallosumEM dataset, comprising a 1.83 times 5.76mm EM image captured from the corpus callosum of the Rett Syndrome (RTT) mouse model, which entail extensive axon bundles. We meticulously proofread over 600,000 patches at a resolution of 1024 times 1024, thus providing a comprehensive ground truth for myelinated axons and myelin sheaths. Additionally, we extensively annotated three distinct regions within the dataset for the purposes of training, testing, and validation. Utilizing this dataset, we develop a fine-tuning methodology that adapts Segment Anything Model (SAM) to EM images segmentation tasks, called EM-SAM, enabling outperforms other state-of-the-art methods. Furthermore, we present the evaluation results of EM-SAM as a baseline.
摘要
电子顾 microscope (EM) 仍然是研究动物神经系统细胞级结构的主要技术。然而,准确地重建神经元和脱膜的复杂形态却是一项非常大的挑战。此外,没有公共可用的大规模 EM 数据集覆盖整个脊梗,包括坚实的基本 truth 分类 для神经元和脱膜,使得整体脊梗重建的进步和评估受到了限制。为了突破这些障碍,我们介绍了 AxonCallosumEM 数据集,包括来自脊梗综合症 (RTT) 小鼠模型的 1.83 times 5.76mm EM 图像,其中包含了广泛的 axon 集合。我们仔细检查了超过 600,000 个小块,解决了 1024 times 1024 的分辨率,以提供脱膜神经元和脱膜的完整基本实际。此外,我们还为了训练、测试和验证而进行了三个不同的区域的杂志。使用这个数据集,我们开发了一种基于 Segment Anything Model (SAM) 的微调方法,称为 EM-SAM,该方法可以超越其他当前的状况。此外,我们还提供了 EM-SAM 的评估结果作为基准。
Expert-Agnostic Ultrasound Image Quality Assessment using Deep Variational Clustering
results: 对尿道ultrasound图像质量进行评估,实现了78%的准确率和与状态元方法相比的更高性能。Abstract
Ultrasound imaging is a commonly used modality for several diagnostic and therapeutic procedures. However, the diagnosis by ultrasound relies heavily on the quality of images assessed manually by sonographers, which diminishes the objectivity of the diagnosis and makes it operator-dependent. The supervised learning-based methods for automated quality assessment require manually annotated datasets, which are highly labour-intensive to acquire. These ultrasound images are low in quality and suffer from noisy annotations caused by inter-observer perceptual variations, which hampers learning efficiency. We propose an UnSupervised UltraSound image Quality assessment Network, US2QNet, that eliminates the burden and uncertainty of manual annotations. US2QNet uses the variational autoencoder embedded with the three modules, pre-processing, clustering and post-processing, to jointly enhance, extract, cluster and visualize the quality feature representation of ultrasound images. The pre-processing module uses filtering of images to point the network's attention towards salient quality features, rather than getting distracted by noise. Post-processing is proposed for visualizing the clusters of feature representations in 2D space. We validated the proposed framework for quality assessment of the urinary bladder ultrasound images. The proposed framework achieved 78% accuracy and superior performance to state-of-the-art clustering methods.
摘要
ultrasound imaging 是一种广泛使用的Modalities,用于许多诊断和治疗过程。然而,由于诊断基于ultrasound的图像质量,它取决于sonographers manually assessing the images的质量,这会使诊断变得人工依赖。supervised learning-based方法需要手动标注的数据集,这是非常劳动密集的获得。这些ultrasound图像质量低,并且受到了 manually annotated datasets的噪音干扰,这会降低学习效率。我们提出了一种不需要手动标注的自主ultrasound图像质量评估网络,称为US2QNet。US2QNet使用包括预处理、聚类和后处理三个模块的变分自动编码器,并在一起进行协同增强、提取、聚类和可视化ultrasound图像质量特征表示。预处理模块使用图像滤波器,将网络的注意力集中在有关质量特征的图像特征上,而不是受到噪音的扰乱。后处理是为可视化特征表示的集群在2D空间进行提出。我们验证了提出的框架,用于评估膀胱ultrasound图像质量。提出的框架实现了78%的准确率,并超过了状态的聚类方法的性能。
LLCaps: Learning to Illuminate Low-Light Capsule Endoscopy with Curved Wavelet Attention and Reverse Diffusion
methods: 本研究提出了一个基于多层态击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击击�Abstract
Wireless capsule endoscopy (WCE) is a painless and non-invasive diagnostic tool for gastrointestinal (GI) diseases. However, due to GI anatomical constraints and hardware manufacturing limitations, WCE vision signals may suffer from insufficient illumination, leading to a complicated screening and examination procedure. Deep learning-based low-light image enhancement (LLIE) in the medical field gradually attracts researchers. Given the exuberant development of the denoising diffusion probabilistic model (DDPM) in computer vision, we introduce a WCE LLIE framework based on the multi-scale convolutional neural network (CNN) and reverse diffusion process. The multi-scale design allows models to preserve high-resolution representation and context information from low-resolution, while the curved wavelet attention (CWA) block is proposed for high-frequency and local feature learning. Furthermore, we combine the reverse diffusion procedure to further optimize the shallow output and generate the most realistic image. The proposed method is compared with ten state-of-the-art (SOTA) LLIE methods and significantly outperforms quantitatively and qualitatively. The superior performance on GI disease segmentation further demonstrates the clinical potential of our proposed model. Our code is publicly accessible.
摘要
无线胶囊内镜(WCE)是一种不痛不侵的诊断工具 для Gastrointestinal(GI)疾病。然而,由于GI生物学结构和硬件生产限制,WCE视图信号可能受到不足照明的影响,导致诊断和检查过程复杂。在医学领域,深度学习基于图像提高(LLIE)逐渐吸引研究人员。基于计算机视觉的DDPM模型在发展的过程中,我们介绍了一种基于多尺度卷积神经网络(CNN)和反扩散过程的WCE LLIE框架。多尺度设计使得模型保留高分辨率表示和上下文信息,而弯曲波形注意力(CWA)块则用于高频和本地特征学习。此外,我们将反扩散过程与模型结合,以进一步优化浅输出并生成最真实的图像。我们的提案与现有的10种SOTA LLIE方法进行比较,并显著超越量化和质量上。此外,我们还通过GI疾病分割诊断的超越性表现,更加证明了我们的提案在临床中的潜在应用。我们的代码公共可访问。
Base Layer Efficiency in Scalable Human-Machine Coding
results: 论文表明,通过对基层编码进行优化,可以提高BD-Rate的效率, Specifically, the paper shows that gains of 20-40% in BD-Rate compared to the current best results on object detection and instance segmentation are possible.Abstract
A basic premise in scalable human-machine coding is that the base layer is intended for automated machine analysis and is therefore more compressible than the same content would be for human viewing. Use cases for such coding include video surveillance and traffic monitoring, where the majority of the content will never be seen by humans. Therefore, base layer efficiency is of paramount importance because the system would most frequently operate at the base-layer rate. In this paper, we analyze the coding efficiency of the base layer in a state-of-the-art scalable human-machine image codec, and show that it can be improved. In particular, we demonstrate that gains of 20-40% in BD-Rate compared to the currently best results on object detection and instance segmentation are possible.
摘要
基本前提是可扩展的人机编程的基层是为机器自动分析而设计,因此基层内容比同样的内容用于人类视觉更容易压缩。使用场景包括视频监控和交通监测,大多数内容从未被人类查看。因此,基层的编码效率非常重要,系统大多数时间都会在基层级别运行。在这篇论文中,我们分析了一个现代可扩展人机图像编码器的基层编码效率,并证明可以提高。特别是,我们示出了对 объек detection和实例 segmentation的Current best results可以提高20-40%的BD-Rate。
results: 该模型能够描述语音cluster的拼音、元音的拼音和语音的变化,并且能够模拟句子中的语音变化。Abstract
A simplified model of articulatory synthesis involving four stages is presented. The planning of articulatory gestures is based on syllable graphs with arcs and nodes that are implemented in a complex representation. This was first motivated by a reduction in the many-to-one relationship between articulatory parameters and formant space. This allows for consistent trajectory planning and computation of articulation dynamics with coordination and selection operators. The flow of articulatory parameters is derived from these graphs with four equations. Many assertions of Articulatory Phonology have been abandoned. This framework is adapted to synthesis using VLAM (a Maeda's model) and simulations are performed with syllables including main vowels and the plosives /b,d,g/ only. The model is able to describe consonant-vowel coarticulation, articulation of consonant clusters, and verbal transformations are seen as transitions of the syllable graph structure.
摘要
提出了一种简化的语音合成模型,包括四个阶段。词形图的规划基于句子树,实现了复杂的表示。这是由于减少了许多到形式空间的多对一关系。这使得运动规划和形成动力计算可以采用协调和选择运算。词形图中的四个方程描述了流体参数的流动。许多语音学理学的假设被抛弃。这种框架被适应到使用VLAM(Maeda的模型)进行合成,并在包括主元音和擦音 /b,d,g/ 的 syllables 中进行了模拟。该模型可以描述共振元音和擦音 cluster 的语音合成,以及 verb 转折的变化。
Exploring Multimodal Approaches for Alzheimer’s Disease Detection Using Patient Speech Transcript and Audio Data
results: 我们的实验结果表明,将语音和讲解数据扩展为更大的数据集可以提高AD检测的准确率。此外,我们还发现,将语音转换回原始音频,并使用它们进行对比学习可以提高AD检测的效果。Abstract
Alzheimer's disease (AD) is a common form of dementia that severely impacts patient health. As AD impairs the patient's language understanding and expression ability, the speech of AD patients can serve as an indicator of this disease. This study investigates various methods for detecting AD using patients' speech and transcripts data from the DementiaBank Pitt database. The proposed approach involves pre-trained language models and Graph Neural Network (GNN) that constructs a graph from the speech transcript, and extracts features using GNN for AD detection. Data augmentation techniques, including synonym replacement, GPT-based augmenter, and so on, were used to address the small dataset size. Audio data was also introduced, and WavLM model was used to extract audio features. These features were then fused with text features using various methods. Finally, a contrastive learning approach was attempted by converting speech transcripts back to audio and using it for contrastive learning with the original audio. We conducted intensive experiments and analysis on the above methods. Our findings shed light on the challenges and potential solutions in AD detection using speech and audio data.
摘要
阿尔茨海默病 (AD) 是一种常见的 демен茨病,它对病人的健康产生严重的影响。由于 AD 会导致病人语言理解和表达能力受损,因此病人的言语可以作为这种疾病的指标。这项研究通过使用患者的言语和文本数据,从德мент银行 Pit 数据库中提取数据,并利用预训练语言模型和图 neural network (GNN) 构建一个图,以提取特征用于 AD 检测。为了 Addressing the small dataset size, data augmentation techniques, including synonym replacement, GPT-based augmenter, and so on, were used. Additionally, audio data was also introduced, and WavLM model was used to extract audio features. These features were then fused with text features using various methods. Finally, a contrastive learning approach was attempted by converting speech transcripts back to audio and using it for contrastive learning with the original audio. We conducted intensive experiments and analysis on the above methods. Our findings shed light on the challenges and potential solutions in AD detection using speech and audio data.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.
Self-supervised learning with diffusion-based multichannel speech enhancement for speaker verification under noisy conditions
results: 在MultiSV多通道人脸识别dataset上进行评估,并显示在噪音多通道条件下获得显著的提高Abstract
The paper introduces Diff-Filter, a multichannel speech enhancement approach based on the diffusion probabilistic model, for improving speaker verification performance under noisy and reverberant conditions. It also presents a new two-step training procedure that takes the benefit of self-supervised learning. In the first stage, the Diff-Filter is trained by conducting timedomain speech filtering using a scoring-based diffusion model. In the second stage, the Diff-Filter is jointly optimized with a pre-trained ECAPA-TDNN speaker verification model under a self-supervised learning framework. We present a novel loss based on equal error rate. This loss is used to conduct selfsupervised learning on a dataset that is not labelled in terms of speakers. The proposed approach is evaluated on MultiSV, a multichannel speaker verification dataset, and shows significant improvements in performance under noisy multichannel conditions.
摘要
文章介绍了一种多通道语音提升方法——Diff-Filter,该方法基于扩散概率模型,用于提高听话人识别性能在噪音和频率干扰的情况下。文章还提出了一种新的两步训练方法,利用自动学习的优势。在第一个阶段,Diff-Filter通过进行时间频谱语音筛选,使用得分基于扩散模型进行训练。在第二个阶段,Diff-Filter与预训练的 ECAPA-TDNN speaker识别模型进行共同优化,使用一种新的平等错误率损失函数进行自动学习。我们在 MultiSV 多通道听话人识别 dataset 进行评估,并发现该方法在噪音多通道情况下显著提高了性能。
LOAF-M2L: Joint Learning of Wording and Formatting for Singable Melody-to-Lyric Generation
for: bridges the singability gap between generated lyrics and melodies, improving the compatibility of the outputs with the melody.
methods: jointly Learning wOrding And Formatting during Melody-to-Lyric training (LOAF-M2L), with a new objective informed by musicological research on the relationship between melody and lyrics.
results: achieves 3.75% and 21.44% absolute accuracy gains in the outputs’ number-of-line and syllable-per-line requirements, and demonstrates a 63.92% and 74.18% relative improvement of music-lyric compatibility and overall quality in the subjective evaluation, compared to the state-of-the-art melody-to-lyric generation model.Abstract
Despite previous efforts in melody-to-lyric generation research, there is still a significant compatibility gap between generated lyrics and melodies, negatively impacting the singability of the outputs. This paper bridges the singability gap with a novel approach to generating singable lyrics by jointly Learning wOrding And Formatting during Melody-to-Lyric training (LOAF-M2L). After general-domain pretraining, our proposed model acquires length awareness first from a large text-only lyric corpus. Then, we introduce a new objective informed by musicological research on the relationship between melody and lyrics during melody-to-lyric training, which enables the model to learn the fine-grained format requirements of the melody. Our model achieves 3.75% and 21.44% absolute accuracy gains in the outputs' number-of-line and syllable-per-line requirements compared to naive fine-tuning, without sacrificing text fluency. Furthermore, our model demonstrates a 63.92% and 74.18% relative improvement of music-lyric compatibility and overall quality in the subjective evaluation, compared to the state-of-the-art melody-to-lyric generation model, highlighting the significance of formatting learning.
摘要
尽管之前的尝试在旋律到歌词生成研究中已经做出了很多努力,但是还是存在很大的兼容性差距,这对生成的输出有负面影响。这篇论文通过一种新的approach来bridging这个兼容性差距,即在melody-to-lyric训练中同时学习wording和 formatting(LOAF-M2L)。在通用领域预训练后,我们的提议的模型首先从大量的文本只lyric corpus中获得了长度意识。然后,我们引入了基于音乐学研究的旋律和歌词之间的关系的新目标,使模型学习细腻的格式要求。我们的模型在输出的数量线和音节数要求上做出了3.75%和21.44%的绝对准确率提升,而无需牺牲文本流畅性。此外,我们的模型在主观评估中表现出了63.92%和74.18%的相对提升,相比之前的状态对旋律到歌词生成模型,这 highlights the significance of formatting learning。
paper_authors: Felix Burkhardt, Uwe Reichel, Florian Eyben, Björn Schuller
for: 这个论文是为了研究如何通过 modify 语音合成的 просоди来模拟情感表达。
methods: 这两个规则基于的模型使用 speech synthesis markup language (SSML) 来控制语音合成的 просоди,以模拟情感表达。
results: Results indicate that with a very simple method both dimensions arousal (.76 UAR) and valence (.43 UAR) can be simulated。Abstract
We introduce two rule-based models to modify the prosody of speech synthesis in order to modulate the emotion to be expressed. The prosody modulation is based on speech synthesis markup language (SSML) and can be used with any commercial speech synthesizer. The models as well as the optimization result are evaluated against human emotion annotations. Results indicate that with a very simple method both dimensions arousal (.76 UAR) and valence (.43 UAR) can be simulated.
摘要
我们介绍两个规律模型,用以 modify 语音合成中的情感表达。这些模型基于语音合成标记语言(SSML),可以与任何商业语音合成器使用。我们将这些模型以及优化结果评估于人类情感标注。结果显示,使用非常简单的方法,我们可以模拟出两个维度的情感表达,即情感刺激(.76 UAR)和情感负面(.43 UAR)。
A Database with Directivities of Musical Instruments
results: 这个论文提供了41种乐器的听录和射频模式数据,包括每种乐器的直达性平均值,以及在不同的射频带width 下的射频模式。这些数据适用于音响模拟和听觉 simulate,并且可以用于创建真实的听觉效果。Abstract
We present a database of recordings and radiation patterns of individual notes for 41 modern and historical musical instruments, measured with a 32-channel spherical microphone array in anechoic conditions. In addition, directivities averaged in one-third octave bands have been calculated for each instrument, which are suitable for use in acoustic simulation and auralisation. The data are provided in SOFA format. Spatial upsampling of the directivities was performed based on spherical spline interpolation and converted to OpenDAFF and GLL format for use in room acoustic and electro-acoustic simulation software. For this purpose, a method is presented how these directivities can be referenced to a specific microphone position in order to achieve a physically correct auralisation without colouration. The data is available under the CC BY-SA 4.0 licence.
摘要
我们提供了41种现代和历史乐器的录音和辐射模式数据,使用32道球形 микро链测量在闭合条件下。此外,我们还计算了每种乐器的一半 ок塔频率带的直达性平均值,这些数据适用于音响 simulation 和听众化。数据以SOFA格式提供,并使用球面插值和OpenDAFF/GLL格式进行空间上抽象,以便在房间音响和电子音响 simulate 软件中使用。为实现物理正确的听众化而不受抹音影响,我们还提供了一种方法,以便将这些直达性参考到特定的 Microphone 位置。数据以CC BY-SA 4.0 许可证提供。
Flowchase: a Mobile Application for Pronunciation Training
results: 通过组合机器学习模型实现联合强制对齐和音名识别,提供了对多个段和超段发音方面的反馈设计Here’s a breakdown of each point:
for: The paper is written for providing personalized and instant feedback to English learners.
methods: The paper uses a mobile application called Flowchase and a speech technology that can segment and analyze speech segmental and supra-segmental features. The speech processing pipeline receives linguistic information and a speech sample, and then performs joint forced-alignment and phonetic recognition using machine learning models based on speech representation learning.
results: The paper achieves accurate recognition of segmental and supra-segmental pronunciation aspects using a combination of machine learning models, which enables the provision of personalized and instant feedback to English learners.Abstract
In this paper, we present a solution for providing personalized and instant feedback to English learners through a mobile application, called Flowchase, that is connected to a speech technology able to segment and analyze speech segmental and supra-segmental features. The speech processing pipeline receives linguistic information corresponding to an utterance to analyze along with a speech sample. After validation of the speech sample, a joint forced-alignment and phonetic recognition is performed thanks to a combination of machine learning models based on speech representation learning that provides necessary information for designing a feedback on a series of segmental and supra-segmental pronunciation aspects.
摘要
在这篇论文中,我们提出了一种解决方案,通过一款名为Flowchase的移动应用程序,为英语学习者提供个性化和实时反馈。这个应用程序与一种可以分segment和supra-segment的speech技术相连接,以进行语音处理管道。在语音处理管道中,我们使用机器学习模型来学习语音表示学习,以获得 необходимы的信息,以设计一系列的segmental和supra-segmental发音方面的反馈。
paper_authors: Haoran Miao, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan
for: This paper proposes an online hybrid CTC/attention end-to-end ASR architecture for real-time speech recognition.
methods: The proposed architecture uses stable monotonic chunk-wise attention (sMoChA) to streamline global attention, truncated CTC (T-CTC) prefix score calculation, and dynamic waiting joint decoding (DWJD) algorithm for online prediction.
results: Compared with the offline CTC/attention model, the proposed online CTC/attention model improves the real-time factor in human-computer interaction services while maintaining moderate performance. This is the first full-stack online solution for CTC/attention end-to-end ASR architecture.Abstract
Recently, there has been increasing progress in end-to-end automatic speech recognition (ASR) architecture, which transcribes speech to text without any pre-trained alignments. One popular end-to-end approach is the hybrid Connectionist Temporal Classification (CTC) and attention (CTC/attention) based ASR architecture. However, how to deploy hybrid CTC/attention systems for online speech recognition is still a non-trivial problem. This article describes our proposed online hybrid CTC/attention end-to-end ASR architecture, which replaces all the offline components of conventional CTC/attention ASR architecture with their corresponding streaming components. Firstly, we propose stable monotonic chunk-wise attention (sMoChA) to stream the conventional global attention, and further propose monotonic truncated attention (MTA) to simplify sMoChA and solve the training-and-decoding mismatch problem of sMoChA. Secondly, we propose truncated CTC (T-CTC) prefix score to stream CTC prefix score calculation. Thirdly, we design dynamic waiting joint decoding (DWJD) algorithm to dynamically collect the predictions of CTC and attention in an online manner. Finally, we use latency-controlled bidirectional long short-term memory (LC-BLSTM) to stream the widely-used offline bidirectional encoder network. Experiments with LibriSpeech English and HKUST Mandarin tasks demonstrate that, compared with the offline CTC/attention model, our proposed online CTC/attention model improves the real time factor in human-computer interaction services and maintains its performance with moderate degradation. To the best of our knowledge, this is the first work to provide the full-stack online solution for CTC/attention end-to-end ASR architecture.
摘要
近年来,endo-to-end自动语音识别(ASR)建筑有了显著的进步,这种语音识别模型不需要预先训练的Alignment。一种popular的endo-to-end方法是hybrid Connectionist Temporal Classification(CTC)和注意(CTC/注意)基于ASR建筑。然而,如何在线部署hybrid CTC/注意系统仍然是一个非常困难的问题。这篇文章描述了我们提议的在线hybrid CTC/注意末端ASR建筑,该建筑将所有的Offline组件替换为其相应的流动组件。首先,我们提出了稳定的均衡块级注意(sMoChA),以流动 conventiomal global注意,并提出了均衡 truncated attention(MTA),以解决sMoChA的训练和解码匹配问题。其次,我们提出了truncated CTC(T-CTC)前缀分数,以流动 CTC前缀分数计算。最后,我们设计了动态等待联合解码(DWJD)算法,以在线收集CTC和注意的预测。我们还使用了时钟控制的 bidirectional long short-term memory(LC-BLSTM),以流动 widely used offline bidirectional encoder network。实验表明,相比于Offline CTC/注意模型,我们提议的在线 CTC/注意模型在人机交互服务中提高了实时因素,并保持了其性能,虽有一定的衰减。到目前为止,这是首次提供了全栈在线解决CTC/注意末端ASR建筑的工作。
Differentially Private Adversarial Auto-Encoder to Protect Gender in Voice Biometrics
results: 在 VoxCeleb 数据集上,可以成功隐藏 speaker 的 gender 信息,同时保持 speaker 识别效果,并可以根据需要调整 Laplace 噪声的强度来选择 Privacy 和 Utility 之间的平衡。Abstract
Over the last decade, the use of Automatic Speaker Verification (ASV) systems has become increasingly widespread in response to the growing need for secure and efficient identity verification methods. The voice data encompasses a wealth of personal information, which includes but is not limited to gender, age, health condition, stress levels, and geographical and socio-cultural origins. These attributes, known as soft biometrics, are private and the user may wish to keep them confidential. However, with the advancement of machine learning algorithms, soft biometrics can be inferred automatically, creating the potential for unauthorized use. As such, it is crucial to ensure the protection of these personal data that are inherent within the voice while retaining the utility of identity recognition. In this paper, we present an adversarial Auto-Encoder--based approach to hide gender-related information in speaker embeddings, while preserving their effectiveness for speaker verification. We use an adversarial procedure against a gender classifier and incorporate a layer based on the Laplace mechanism into the Auto-Encoder architecture. This layer adds Laplace noise for more robust gender concealment and ensures differential privacy guarantees during inference for the output speaker embeddings. Experiments conducted on the VoxCeleb dataset demonstrate that speaker verification tasks can be effectively carried out while concealing speaker gender and ensuring differential privacy guarantees; moreover, the intensity of the Laplace noise can be tuned to select the desired trade-off between privacy and utility.
摘要
过去十年,自动说话验证(ASV)系统的使用越来越普遍,以应对安全和高效的身份验证方法的增长需求。语音数据包含大量个人信息,包括但不限于性别、年龄、健康状况、压力水平和地域和文化背景。这些属性被称为软生物метrics,用户可能希望保持隐私。然而,通过机器学习算法的提高,软生物метrics可以被推断出来,从而创造不当使用的可能性。因此,保护这些个人数据,并在保留身份识别的同时确保隐私,是极为重要的。在这篇论文中,我们提出了一种利用对抗学习掩蔽 gender 信息的自动编码器方法,保持 speaker 嵌入的有效性,同时确保隐私。我们通过对 gender 分类器进行对抗程序,并在 Auto-Encoder 架构中添加基于 Laplace 机制的层。这层在推断过程中添加 Laplace 噪声,以确保在输出 speaker 嵌入时的隐私保障。在 VoxCeleb 数据集上进行的实验表明,可以在掩蔽 speaker 的 gender 信息和确保隐私保障的同时进行有效的 speaker 验证任务。此外,可以根据需要调整 Laplace 噪声的强度,选择适当的隐私和功能之间的平衡。
Leveraging multilingual transfer for unsupervised semantic acoustic word embeddings
results: 这paper的实验结果表明,使用这种多语言基础模型可以实现语音词嵌入的语义表示,并且在语音词相似性任务中表现出色。此外,这paper还实现了语音词 Query-by-Example 搜索的功能。Abstract
Acoustic word embeddings (AWEs) are fixed-dimensional vector representations of speech segments that encode phonetic content so that different realisations of the same word have similar embeddings. In this paper we explore semantic AWE modelling. These AWEs should not only capture phonetics but also the meaning of a word (similar to textual word embeddings). We consider the scenario where we only have untranscribed speech in a target language. We introduce a number of strategies leveraging a pre-trained multilingual AWE model -- a phonetic AWE model trained on labelled data from multiple languages excluding the target. Our best semantic AWE approach involves clustering word segments using the multilingual AWE model, deriving soft pseudo-word labels from the cluster centroids, and then training a Skipgram-like model on the soft vectors. In an intrinsic word similarity task measuring semantics, this multilingual transfer approach outperforms all previous semantic AWE methods. We also show -- for the first time -- that AWEs can be used for downstream semantic query-by-example search.
摘要
听音字嵌入(AWEs)是指将speech segmentFixed-dimensional vector representation的方法,以便不同的实现方式中的同一个词有相似的嵌入。在这篇论文中,我们探索 semantic AWE 模型。这些 AWEs 不仅应 capture phonetics, but also the meaning of a word, similar to textual word embeddings。我们考虑了target language only have untranscribed speech的场景。我们提出了一些使用预训练的多语言 AWE 模型(包括目标语言)的策略。我们的最佳semantic AWE方法是 clustering word segments using the multilingual AWE model, deriving soft pseudo-word labels from the cluster centroids, and then training a Skipgram-like model on the soft vectors。在内在词 Similarity task中,这种多语言传递方法超过了所有前一个semantic AWE方法。我们还表明了,AWEs可以用于下游semantic query-by-example search。
results: 实验结果表明,提议的multiLID方法在许多现实场景中具有超越性,可以准确检测Diffusion模型生成的图像,并正确地标识相关的生成器网络。Abstract
Diffusion models recently have been successfully applied for the visual synthesis of strikingly realistic appearing images. This raises strong concerns about their potential for malicious purposes. In this paper, we propose using the lightweight multi Local Intrinsic Dimensionality (multiLID), which has been originally developed in context of the detection of adversarial examples, for the automatic detection of synthetic images and the identification of the according generator networks. In contrast to many existing detection approaches, which often only work for GAN-generated images, the proposed method provides close to perfect detection results in many realistic use cases. Extensive experiments on known and newly created datasets demonstrate that the proposed multiLID approach exhibits superiority in diffusion detection and model identification. Since the empirical evaluations of recent publications on the detection of generated images are often mainly focused on the "LSUN-Bedroom" dataset, we further establish a comprehensive benchmark for the detection of diffusion-generated images, including samples from several diffusion models with different image sizes.
摘要
各种扩散模型最近已经成功应用于可见的图像合成,这引发了严重的不良用途的担忧。在这篇论文中,我们提议使用轻量级的多地方本各自维度(multiLID),这是在检测对抗性例子方面原始开发的。与许多现有的检测方法不同,我们的方法可以准确检测多种扩散模型生成的图像,而不是仅仅是GAN生成的图像。我们在各种实际应用场景中进行了广泛的实验,结果显示了我们的multiLID方法在扩散检测和模型标识方面具有显著的优势。由于现有文献中对生成图像的检测的评估 oftens mainly focuses on "LSUN-Bedroom" 数据集,我们还建立了一个包括多种扩散模型和不同图像大小的扩散检测和模型标识 benchmark。
GAFAR: Graph-Attention Feature-Augmentation for Registration A Fast and Light-weight Point Set Registration Algorithm
methods: 我们提出了一种新的快速和轻量级网络架构,使用注意力机制来在推理时对点描述符进行增强,以便更好地适应特定点云的注册任务。我们使用了全连接图 both within和between点云,让网络理解点云之间的相互关系和重要性,从而使我们的方法具有对于异常值和3D扫描数据的耐性和泛化能力。
results: 我们测试了我们的注册算法在不同的注册和泛化任务中的性能,并提供了运行时间和资源消耗的信息。我们的代码和训练 веса可以在https://github.com/mordecaimalignatius/GAFAR/上下载。Abstract
Rigid registration of point clouds is a fundamental problem in computer vision with many applications from 3D scene reconstruction to geometry capture and robotics. If a suitable initial registration is available, conventional methods like ICP and its many variants can provide adequate solutions. In absence of a suitable initialization and in the presence of a high outlier rate or in the case of small overlap though the task of rigid registration still presents great challenges. The advent of deep learning in computer vision has brought new drive to research on this topic, since it provides the possibility to learn expressive feature-representations and provide one-shot estimates instead of depending on time-consuming iterations of conventional robust methods. Yet, the rotation and permutation invariant nature of point clouds poses its own challenges to deep learning, resulting in loss of performance and low generalization capability due to sensitivity to outliers and characteristics of 3D scans not present during network training. In this work, we present a novel fast and light-weight network architecture using the attention mechanism to augment point descriptors at inference time to optimally suit the registration task of the specific point clouds it is presented with. Employing a fully-connected graph both within and between point clouds lets the network reason about the importance and reliability of points for registration, making our approach robust to outliers, low overlap and unseen data. We test the performance of our registration algorithm on different registration and generalization tasks and provide information on runtime and resource consumption. The code and trained weights are available at https://github.com/mordecaimalignatius/GAFAR/.
摘要
rigid registration of point clouds 是计算机视觉中的基本问题,有很多应用从3D场景重建到geometry capture和机器人等。如果有合适的初始化注册, THEN conventional methods like ICP and its variants can provide adequate solutions。在缺乏合适的初始化和高异常率或小 overlap的情况下,rigid registration 仍然存在很大的挑战。计算机视觉领域的深度学习带来了新的动力,因为它提供了可以学习表达性的特征表示和一次估计而不需要时间consuming的迭代。然而,点云的旋转和排序不变性对深度学习 pose 自己的挑战,导致表现下降和低通用性,因为感知到3D扫描不存在于网络训练时的特征。在这项工作中,我们提出了一种新的快速和轻量级网络架构,使用注意力机制来在推理时增强点云描述符,以便最佳地适应特定的点云注册任务。通过在点云之间和点云内建立完全连接的图,网络可以理解点云中的重要性和可靠性,使我们的方法对异常点和低重 overlap 有很好的Robustness。我们对不同的注册和泛化任务进行了测试,并提供了运行时间和资源消耗的信息。网络和训练过的权重可以在https://github.com/mordecaimalignatius/GAFAR/下载。
Dual Arbitrary Scale Super-Resolution for Multi-Contrast MRI
results: 实验结果显示,这个技术可以在不同的扩展比例下进行医疗影像重建,并且在两个公共的MRI数据集上表现出优于现有的方法。Abstract
Limited by imaging systems, the reconstruction of Magnetic Resonance Imaging (MRI) images from partial measurement is essential to medical imaging research. Benefiting from the diverse and complementary information of multi-contrast MR images in different imaging modalities, multi-contrast Super-Resolution (SR) reconstruction is promising to yield SR images with higher quality. In the medical scenario, to fully visualize the lesion, radiologists are accustomed to zooming the MR images at arbitrary scales rather than using a fixed scale, as used by most MRI SR methods. In addition, existing multi-contrast MRI SR methods often require a fixed resolution for the reference image, which makes acquiring reference images difficult and imposes limitations on arbitrary scale SR tasks. To address these issues, we proposed an implicit neural representations based dual-arbitrary multi-contrast MRI super-resolution method, called Dual-ArbNet. First, we decouple the resolution of the target and reference images by a feature encoder, enabling the network to input target and reference images at arbitrary scales. Then, an implicit fusion decoder fuses the multi-contrast features and uses an Implicit Decoding Function~(IDF) to obtain the final MRI SR results. Furthermore, we introduce a curriculum learning strategy to train our network, which improves the generalization and performance of our Dual-ArbNet. Extensive experiments in two public MRI datasets demonstrate that our method outperforms state-of-the-art approaches under different scale factors and has great potential in clinical practice.
摘要
限于成像系统,重建医用磁共振成像(MRI)图像从部分测量是医学成像研究中的关键。利用不同和补充的多比特MR图像不同成像模式的多比特超解析(SR)重建,可以获得高质量的SR图像。在医疗场景中,为了完全可见患部,放射医生通常会在自定义比例上缩放MR图像,而不是使用固定比例,这与大多数MRI SR方法不同。此外,现有的多比特MRI SR方法通常需要固定分辨率的参照图像,这会使得获得参照图像困难,并对自定义比例SR任务带来限制。为解决这些问题,我们提出了基于卷积神经网络的双自由多比特MRI超解析方法,称为双自由网络(Dual-ArbNet)。首先,我们通过特征编码器解耦目标和参照图像的分辨率,使得网络可以输入任意比例的目标和参照图像。然后,我们使用卷积神经网络将多比特特征进行融合,并使用卷积神经网络的强化学习函数(IDF)来获得最终的MRI SR结果。此外,我们还提出了训练网络的课程学习策略,以提高我们的Dual-ArbNet的通用性和性能。在两个公共MRI数据集上进行了广泛的实验,我们发现我们的方法在不同的比例因子下表现出优于当前状态的方法,并具有优秀的临床应用潜力。
MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers
methods: 提出了一种名为 MSViT 的动态混合缩放tokenization scheme,通过一种名为 gate 的Conditional gating机制来选择每个图像区域的最佳 токен规模,以 dynamically determine the number of tokens per input。
results: 在图像分类和 segmentation 任务上,MSViT 能够 дости得更好的精度-复杂度质量平衡,并且在训练时间和资源上具有较小的开销。Abstract
The input tokens to Vision Transformers carry little semantic meaning as they are defined as regular equal-sized patches of the input image, regardless of its content. However, processing uniform background areas of an image should not necessitate as much compute as dense, cluttered areas. To address this issue, we propose a dynamic mixed-scale tokenization scheme for ViT, MSViT. Our method introduces a conditional gating mechanism that selects the optimal token scale for every image region, such that the number of tokens is dynamically determined per input. The proposed gating module is lightweight, agnostic to the choice of transformer backbone, and trained within a few epochs (e.g., 20 epochs on ImageNet) with little training overhead. In addition, to enhance the conditional behavior of the gate during training, we introduce a novel generalization of the batch-shaping loss. We show that our gating module is able to learn meaningful semantics despite operating locally at the coarse patch-level. We validate MSViT on the tasks of classification and segmentation where it leads to improved accuracy-complexity trade-off.
摘要
(input tokens to Vision Transformers carry little semantic meaning, as they are defined as regular equal-sized patches of the input image regardless of content; however, processing uniform background areas of an image should not require as much compute as dense, cluttered areas; to address this issue, we propose a dynamic mixed-scale tokenization scheme for ViT, MSViT, which introduces a conditional gating mechanism that selects the optimal token scale for every image region, such that the number of tokens is dynamically determined per input; the proposed gating module is lightweight, agnostic to the choice of transformer backbone, and trained within a few epochs (e.g., 20 epochs on ImageNet) with little training overhead; in addition, to enhance the conditional behavior of the gate during training, we introduce a novel generalization of the batch-shaping loss; we show that our gating module is able to learn meaningful semantics despite operating locally at the coarse patch-level; we validate MSViT on the tasks of classification and segmentation, where it leads to an improved accuracy-complexity trade-off)
Multi-Scale Prototypical Transformer for Whole Slide Image Classification
results: 实验结果显示,提出的 MSPT 方法在两个公共 WSI 数据集上比较所有相比方法都高效,这表明它在 computation pathology 中具有应用潜力。Abstract
Whole slide image (WSI) classification is an essential task in computational pathology. Despite the recent advances in multiple instance learning (MIL) for WSI classification, accurate classification of WSIs remains challenging due to the extreme imbalance between the positive and negative instances in bags, and the complicated pre-processing to fuse multi-scale information of WSI. To this end, we propose a novel multi-scale prototypical Transformer (MSPT) for WSI classification, which includes a prototypical Transformer (PT) module and a multi-scale feature fusion module (MFFM). The PT is developed to reduce redundant instances in bags by integrating prototypical learning into the Transformer architecture. It substitutes all instances with cluster prototypes, which are then re-calibrated through the self-attention mechanism of the Trans-former. Thereafter, an MFFM is proposed to fuse the clustered prototypes of different scales, which employs MLP-Mixer to enhance the information communication between prototypes. The experimental results on two public WSI datasets demonstrate that the proposed MSPT outperforms all the compared algorithms, suggesting its potential applications.
摘要
整幕图像(WSI)分类是计算 PATHOLOGY 中的一项关键任务。尽管最近在多个实例学习(MIL)中进行了一些进步,但是准确地分类 WSI 仍然是一项挑战,主要因为实例卷积质量差异极大,以及综合多级信息的复杂处理。为此,我们提出了一种新的多级原型变换器(MSPT),它包括原型变换器(PT)模块和多级特征融合模块(MFFM)。PT 模块通过将实例替换为团体原型,并通过 Transformer 架构中的自注意机制来重新准确这些原型。然后,我们提出了一种 MFFM,用于融合不同级别的团体原型,并使用 MLP-Mixer 增强团体原型之间的信息交流。我们在两个公共 WSI 数据集上进行了实验,并证明了我们提出的 MSPT 在所有比较算法中表现出色, suggesting its potential applications。
Focusing on what to decode and what to train: Efficient Training with HOI Split Decoders and Specific Target Guided DeNoising
results: 在一个训练集中,我们的方法(SOV-STG)在三分之一的训练集中达到了最高精度,并且在一个三分之一的训练时间内完成了训练。代码可以在 \url{https://github.com/cjw2021/SOV-STG} 上获取。Abstract
Recent one-stage transformer-based methods achieve notable gains in the Human-object Interaction Detection (HOI) task by leveraging the detection of DETR. However, the current methods redirect the detection target of the object decoder, and the box target is not explicitly separated from the query embeddings, which leads to long and hard training. Furthermore, matching the predicted HOI instances with the ground-truth is more challenging than object detection, simply adapting training strategies from the object detection makes the training more difficult. To clear the ambiguity between human and object detection and share the prediction burden, we propose a novel one-stage framework (SOV), which consists of a subject decoder, an object decoder, and a verb decoder. Moreover, we propose a novel Specific Target Guided (STG) DeNoising strategy, which leverages learnable object and verb label embeddings to guide the training and accelerates the training convergence. In addition, for the inference part, the label-specific information is directly fed into the decoders by initializing the query embeddings from the learnable label embeddings. Without additional features or prior language knowledge, our method (SOV-STG) achieves higher accuracy than the state-of-the-art method in one-third of training epochs. The code is available at \url{https://github.com/cjw2021/SOV-STG}.
摘要
近期一阶段变换器基本方法在人物交互检测(HOI)任务中获得了显著的进步,利用DETR的检测。然而,现有方法会将 объек 预测器的目标重定向,并且不分离查询嵌入中的盒子目标,这会导致训练变得更加长和更加困难。此外,与物体检测不同,HOI 预测的匹配更加具有挑战性,简单地采用物体检测的训练策略不足以解决这个问题。为了解决人物和物体检测之间的混乱,并分担预测负担,我们提出了一种新的一阶段框架(SOV),该框架包括主体预测器、物体预测器和动词预测器。此外,我们还提出了一种新的特定目标导航(STG)减噪策略,该策略利用可学习的物体和动词标签嵌入来导航训练,并加速训练的收敛。在推断部分,标签特定的信息直接传递给预测器,通过初始化查询嵌入从学习的标签嵌入开始。不需要额外的特征或先验语言知识,我们的方法(SOV-STG)在一个第三个训练翻卷中达到了state-of-the-art的高精度。代码可以在 \url{https://github.com/cjw2021/SOV-STG} 上获取。
Interactive Image Segmentation with Cross-Modality Vision Transformers
paper_authors: Kun Li, George Vosselman, Michael Ying Yang
for: 这 paper 是为了提出一种基于视觉转换器的交互式图像分割方法,以提高图像分割精度和稳定性。
methods: 该 paper 使用了 crossing-modality 视觉转换器,利用了多Modal 数据的相互关系,从而更好地引导学习过程。
results: 实验结果表明,提出的方法在多个benchmark上表现出色,与之前的state-of-the-art模型相比,具有更高的精度和稳定性。Abstract
Interactive image segmentation aims to segment the target from the background with the manual guidance, which takes as input multimodal data such as images, clicks, scribbles, and bounding boxes. Recently, vision transformers have achieved a great success in several downstream visual tasks, and a few efforts have been made to bring this powerful architecture to interactive segmentation task. However, the previous works neglect the relations between two modalities and directly mock the way of processing purely visual information with self-attentions. In this paper, we propose a simple yet effective network for click-based interactive segmentation with cross-modality vision transformers. Cross-modality transformers exploits mutual information to better guide the learning process. The experiments on several benchmarks show that the proposed method achieves superior performance in comparison to the previous state-of-the-art models. The stability of our method in term of avoiding failure cases shows its potential to be a practical annotation tool. The code and pretrained models will be released under https://github.com/lik1996/iCMFormer.
摘要
中文翻译:互动图像分割target goal是将目标分割自背景,通过手动指导,输入多Modal数据,如图像、点击、笔划和 bounding box。最近,视Transformers在多个下游视图任务中取得了很大成功,而在互动分割任务上,几乎没有尝试将这个强大的架构应用于互动分割任务。然而,之前的工作忽视了两种模式之间的关系,直接模仿了对纯视觉信息的处理方式,使用自注意力。在这篇论文中,我们提出了一种简单 yet effective的网络,用于点击基于的互动分割,cross-modality vision transformers。cross-modality transformers利用两种模式之间的共识来更好地引导学习过程。我们的实验表明,我们提出的方法在多个 benchmark上表现出色,与之前的状态对照模型相比,具有更高的稳定性和可行性。我们的代码和预训练模型将在https://github.com/lik1996/iCMFormer中发布。
results: 该论文通过示例展示了各种自动梯度下降算法和流行的第二个信息近似算法的图像,并提供了基于连接环境的卷积特性变换,以便简化和加速图像评估。此外,该论文还提出了一种基于TN的高效绘制实现,可以加速KFAC变体和实现硬件高效的tensor dropout。Abstract
Despite their simple intuition, convolutions are more tedious to analyze than dense layers, which complicates the generalization of theoretical and algorithmic ideas. We provide a new perspective onto convolutions through tensor networks (TNs) which allow reasoning about the underlying tensor multiplications by drawing diagrams, and manipulating them to perform function transformations, sub-tensor access, and fusion. We demonstrate this expressive power by deriving the diagrams of various autodiff operations and popular approximations of second-order information with full hyper-parameter support, batching, channel groups, and generalization to arbitrary convolution dimensions. Further, we provide convolution-specific transformations based on the connectivity pattern which allow to re-wire and simplify diagrams before evaluation. Finally, we probe computational performance, relying on established machinery for efficient TN contraction. Our TN implementation speeds up a recently-proposed KFAC variant up to 4.5x and enables new hardware-efficient tensor dropout for approximate backpropagation.
摘要
Joint Hierarchical Priors and Adaptive Spatial Resolution for Efficient Neural Image Compression
results: 对比VVC参照编码器(VTM-18.0)和SwinT-ChARM神经编码器,提出的框架在质量压缩和解码复杂度之间取得了显著的质量提升。还进行了模型缩放研究和对象和主观分析,证明了我们的方法的计算效率和性能优势。Abstract
Recently, the performance of neural image compression (NIC) has steadily improved thanks to the last line of study, reaching or outperforming state-of-the-art conventional codecs. Despite significant progress, current NIC methods still rely on ConvNet-based entropy coding, limited in modeling long-range dependencies due to their local connectivity and the increasing number of architectural biases and priors, resulting in complex underperforming models with high decoding latency. Motivated by the efficiency investigation of the Tranformer-based transform coding framework, namely SwinT-ChARM, we propose to enhance the latter, as first, with a more straightforward yet effective Tranformer-based channel-wise auto-regressive prior model, resulting in an absolute image compression transformer (ICT). Through the proposed ICT, we can capture both global and local contexts from the latent representations and better parameterize the distribution of the quantized latents. Further, we leverage a learnable scaling module with a sandwich ConvNeXt-based pre-/post-processor to accurately extract more compact latent codes while reconstructing higher-quality images. Extensive experimental results on benchmark datasets showed that the proposed framework significantly improves the trade-off between coding efficiency and decoder complexity over the versatile video coding (VVC) reference encoder (VTM-18.0) and the neural codec SwinT-ChARM. Moreover, we provide model scaling studies to verify the computational efficiency of our approach and conduct several objective and subjective analyses to bring to the fore the performance gap between the adaptive image compression transformer (AICT) and the neural codec SwinT-ChARM.
摘要
近些时候,神经图像压缩(NIC)的性能已经不断提高,达到或超过了传统编码器的状态对抗水平。尽管有了 significante进步,现有NIC方法仍然基于ConvNet来实现 entropy coding,受到当地连接和建筑偏好的限制,难以模型长距离依赖关系,导致复杂的下去性能不佳的模型和高解码延迟。为了提高 tranformer 基于 transform coding 框架的效率,我们提出了一种能更直观、有效的 channel-wise auto-regressive prior 模型,即 absolute image compression transformer(ICT)。通过 ICT,我们可以从隐藏表示中捕捉全局和局部上下文,更好地参数化归一化后的量化latent。此外,我们采用一个可学习的缩放模块,与一个 ConvNeXt 基于的预/后处理器,准确提取更加 компакт的latent代码,并在重建高质量图像时更好地恢复图像。在多个标准测试集上,我们的框架与VVC参考编码器(VTM-18.0)和神经编码器 SwinT-ChARM 进行了比较,并显示了显著改善了编码效率和解码复杂度之间的平衡。此外,我们还进行了模型缩放研究,以验证我们的方法的计算效率,并进行了一些对象和主观分析,以阐明 AICT 与 SwinT-ChARM 之间的性能差距。
RanPAC: Random Projections and Pre-trained Models for Continual Learning
paper_authors: Mark D. McDonnell, Dong Gong, Amin Parveneh, Ehsan Abbasnejad, Anton van den Hengel
for: 这 paper 的目的是提出一种简洁有效的 continual learning 方法,使得 pre-trained 模型可以在非站ARY 数据流中逐步学习不同的任务,而不会忘记之前学习的内容。
methods: 这 paper 使用了 training-free random projectors 和 class-prototype accumulation 来解决 continual learning 中的忘记问题。具体来说, authors 在 pre-trained 模型的 feature representations 和 output head 之间添加了一层冻结 Random Projection 层,以捕捉Feature之间的交互,提高了类prototype-based continual learning 的 linear separability。此外, authors 还证明了在使用 pre-trained 表示时,需要decorrelating 类prototype 以降低分布差异。
results: compared to previous methods applied to pre-trained ViT-B/16 models, authors 在 seven class-incremental benchmark datasets 上减少了最终的错误率,而不使用任何缓存记忆。specifically, authors 在这些 datasets 上减少了最终错误率的范围为 10% 到 62%。这表明, pre-trained 模型可以在简单、有效、快速的 continual learning 中实现出色的性能,而且还可以避免忘记问题。Abstract
Continual learning (CL) aims to incrementally learn different tasks (such as classification) in a non-stationary data stream without forgetting old ones. Most CL works focus on tackling catastrophic forgetting under a learning-from-scratch paradigm. However, with the increasing prominence of foundation models, pre-trained models equipped with informative representations have become available for various downstream requirements. Several CL methods based on pre-trained models have been explored, either utilizing pre-extracted features directly (which makes bridging distribution gaps challenging) or incorporating adaptors (which may be subject to forgetting). In this paper, we propose a concise and effective approach for CL with pre-trained models. Given that forgetting occurs during parameter updating, we contemplate an alternative approach that exploits training-free random projectors and class-prototype accumulation, which thus bypasses the issue. Specifically, we inject a frozen Random Projection layer with nonlinear activation between the pre-trained model's feature representations and output head, which captures interactions between features with expanded dimensionality, providing enhanced linear separability for class-prototype-based CL. We also demonstrate the importance of decorrelating the class-prototypes to reduce the distribution disparity when using pre-trained representations. These techniques prove to be effective and circumvent the problem of forgetting for both class- and domain-incremental continual learning. Compared to previous methods applied to pre-trained ViT-B/16 models, we reduce final error rates by between 10\% and 62\% on seven class-incremental benchmark datasets, despite not using any rehearsal memory. We conclude that the full potential of pre-trained models for simple, effective, and fast continual learning has not hitherto been fully tapped.
摘要
In this paper, we propose a concise and effective approach for CL with pre-trained models. Since forgetting occurs during parameter updating, we contemplate an alternative approach that exploits training-free random projectors and class-prototype accumulation, which thus bypasses the issue. Specifically, we inject a frozen Random Projection layer with nonlinear activation between the pre-trained model's feature representations and output head, which captures interactions between features with expanded dimensionality, providing enhanced linear separability for class-prototype-based CL. We also demonstrate the importance of decorrelating the class-prototypes to reduce the distribution disparity when using pre-trained representations. These techniques prove to be effective and circumvent the problem of forgetting for both class- and domain-incremental continual learning.Compared to previous methods applied to pre-trained ViT-B/16 models, we reduce final error rates by between 10\% and 62\% on seven class-incremental benchmark datasets, despite not using any rehearsal memory. We conclude that the full potential of pre-trained models for simple, effective, and fast continual learning has not hitherto been fully tapped.
Rethinking Multiple Instance Learning for Whole Slide Image Classification: A Good Instance Classifier is All You Need
for: This paper focuses on developing an instance-level multiple instance learning (MIL) framework for weakly supervised whole slide image classification.
methods: The proposed method combines contrastive learning and prototype learning to effectively learn instance feature representation and generate accurate pseudo labels.
results: The proposed method achieves powerful performance on four datasets through extensive experiments and visualizations.Here’s the full answer in Simplified Chinese:
results: 经验和视觉化分析表明,提出的方法在四个数据集上具有强大的表现。Abstract
Weakly supervised whole slide image classification is usually formulated as a multiple instance learning (MIL) problem, where each slide is treated as a bag, and the patches cut out of it are treated as instances. Existing methods either train an instance classifier through pseudo-labeling or aggregate instance features into a bag feature through attention mechanisms and then train a bag classifier, where the attention scores can be used for instance-level classification. However, the pseudo instance labels constructed by the former usually contain a lot of noise, and the attention scores constructed by the latter are not accurate enough, both of which affect their performance. In this paper, we propose an instance-level MIL framework based on contrastive learning and prototype learning to effectively accomplish both instance classification and bag classification tasks. To this end, we propose an instance-level weakly supervised contrastive learning algorithm for the first time under the MIL setting to effectively learn instance feature representation. We also propose an accurate pseudo label generation method through prototype learning. We then develop a joint training strategy for weakly supervised contrastive learning, prototype learning, and instance classifier training. Extensive experiments and visualizations on four datasets demonstrate the powerful performance of our method. Codes will be available.
摘要
多数批处图像分类通常被形式化为多个实例学习(MIL)问题,每个批处图像被当作袋子,并且从中切出的小块被当作实例。现有方法可以通过假标签或者通过注意机制将实例特征聚合到袋子特征上,然后训练袋子分类器,其中注意分数可以用于实例级分类。然而,由前者构造的假实例标签通常含有很多噪音,而由后者构造的注意分数不够准确,这两者都会影响其性能。在这篇论文中,我们提出了基于对比学习和原型学习的实例级MIL框架,以有效完成实例分类和袋子分类任务。为此,我们提出了在MIL设定下的首次实例级弱监督对准学习算法,以有效地学习实例特征表示。我们还提出了一种准确的假标签生成方法。然后,我们开发了一种共同训练策略,用于弱监督对准学习、原型学习和实例分类器训练。广泛的实验和视觉化在四个数据集上,证明了我们的方法的强大性能。代码将可用。
S3C: Self-Supervised Stochastic Classifiers for Few-Shot Class-Incremental Learning
results: 在三个标准资料集上进行了广泛的评估,并且使用多个评估指标显示了这个方法的有效性。此外,还进行了两个实际的应用情况评估,包括不同新类的标签数量和基础类别的数量较少的情况,并证明了这个方法在这些情况下表现出色。Abstract
Few-shot class-incremental learning (FSCIL) aims to learn progressively about new classes with very few labeled samples, without forgetting the knowledge of already learnt classes. FSCIL suffers from two major challenges: (i) over-fitting on the new classes due to limited amount of data, (ii) catastrophically forgetting about the old classes due to unavailability of data from these classes in the incremental stages. In this work, we propose a self-supervised stochastic classifier (S3C) to counter both these challenges in FSCIL. The stochasticity of the classifier weights (or class prototypes) not only mitigates the adverse effect of absence of large number of samples of the new classes, but also the absence of samples from previously learnt classes during the incremental steps. This is complemented by the self-supervision component, which helps to learn features from the base classes which generalize well to unseen classes that are encountered in future, thus reducing catastrophic forgetting. Extensive evaluation on three benchmark datasets using multiple evaluation metrics show the effectiveness of the proposed framework. We also experiment on two additional realistic scenarios of FSCIL, namely where the number of annotated data available for each of the new classes can be different, and also where the number of base classes is much lesser, and show that the proposed S3C performs significantly better than the state-of-the-art for all these challenging scenarios.
摘要
《几个步骤类增长学习》(Few-shot Class-incremental Learning,FSCIL)的目标是逐渐学习新类,使用非常少的标注样本。然而,FSCIL面临两大挑战:(1)新类的过拟合由于数据有限,(2)在增量阶段缺乏过去类的数据,导致忘记已经学习的类。在这种情况下,我们提出了一种自编程随机分类器(S3C),以解决FSCIL中这两个挑战。S3C的权重随机性不仅减轻新类缺乏大量样本的影响,还减轻过去类缺乏样本的影响。此外,我们还添加了一个自编程组件,帮助学习基础类的特征,以便在未来遇到未经过学习的类时,减少忘记。我们对三个标准数据集进行了广泛的评估,并使用多个评价指标,结果表明我们的提案效果显著。此外,我们还在不同的实际情景下进行了进一步的实验,包括每个新类的标注数据有不同数量,以及基础类的数量远少于新类的情况,并证明了S3C在这些挑战情况下表现更好。
results: 经过理论证明和广泛的实验分析,OKO方法能够更好地准确化模型,而且在使用硬标签和抛弃任何额外参数调整时,OKO方法仍然能够提供更好的准确性。Abstract
Model overconfidence and poor calibration are common in machine learning and difficult to account for when applying standard empirical risk minimization. In this work, we propose a novel method to alleviate these problems that we call odd-$k$-out learning (OKO), which minimizes the cross-entropy error for sets rather than for single examples. This naturally allows the model to capture correlations across data examples and achieves both better accuracy and calibration, especially in limited training data and class-imbalanced regimes. Perhaps surprisingly, OKO often yields better calibration even when training with hard labels and dropping any additional calibration parameter tuning, such as temperature scaling. We provide theoretical justification, establishing that OKO naturally yields better calibration, and provide extensive experimental analyses that corroborate our theoretical findings. We emphasize that OKO is a general framework that can be easily adapted to many settings and the trained model can be applied to single examples at inference time, without introducing significant run-time overhead or architecture changes.
摘要
模型过置信和轻度误差是机器学习中的常见问题,难以通过标准的实际风险最小化来补偿。在这项工作中,我们提出了一种新的方法,称为偶数-$k$-学习(OKO),它为集合而非单个例子 minimum 交叉熵错误。这种方法自然地让模型捕捉数据示例之间的相关性,并达到更高的准确率和评估,特别是在有限的训练数据和类别不均衡的情况下。 surprisingly,OKO часто提供更好的评估,即使在使用硬标签和遗漏任何额外参数调整时。我们提供了理论基础,证明 OKO 自然地提供更好的评估,并进行了广泛的实验分析,证实我们的理论发现。我们强调的是,OKO 是一种通用的框架,可以轻松地适应多种情况,并且训练好的模型可以在推理时应用于单个示例,无需增加重要的运行时 overhead 或者架构变化。
Source Identification: A Self-Supervision Task for Dense Prediction
results: 实验结果表明,提出的SI任务在某些 dense predictions 任务上比传统的自监督任务表现更好,包括填充、像素排序、INTENSITY SHIFT 和超分辨率等。此外,在不同类型图像混合中,使用不同病人图像的混合得到最佳效果。Abstract
The paradigm of self-supervision focuses on representation learning from raw data without the need of labor-consuming annotations, which is the main bottleneck of current data-driven methods. Self-supervision tasks are often used to pre-train a neural network with a large amount of unlabeled data and extract generic features of the dataset. The learned model is likely to contain useful information which can be transferred to the downstream main task and improve performance compared to random parameter initialization. In this paper, we propose a new self-supervision task called source identification (SI), which is inspired by the classic blind source separation problem. Synthetic images are generated by fusing multiple source images and the network's task is to reconstruct the original images, given the fused images. A proper understanding of the image content is required to successfully solve the task. We validate our method on two medical image segmentation tasks: brain tumor segmentation and white matter hyperintensities segmentation. The results show that the proposed SI task outperforms traditional self-supervision tasks for dense predictions including inpainting, pixel shuffling, intensity shift, and super-resolution. Among variations of the SI task fusing images of different types, fusing images from different patients performs best.
摘要
自我监督框架强调 Raw 数据 学习 representation 无需劳动 INTENSIVE 注解,这是当前数据驱动方法的主要瓶颈。自我监督任务经常用于预训练神经网络大量无注释数据,并提取数据集的通用特征。学习的模型可能包含有用信息,可以转移到下游主任务,提高与随机参数初始化相比的性能。在本文中,我们提出了一种新的自我监督任务,即源标识(SI),它是基于经典的盲源分离问题的。通过混合多个源图像生成synthetic图像,并让网络在混合图像中重建原始图像,需要成功解决这个任务的良好图像内容理解。我们验证了我们的方法在两个医学影像分割任务中,即脑肿瘤分割和白 matter 高敏敏分割。结果表明,提议的 SI 任务超过传统自我监督任务 dense 预测,包括填充、像素拼接、Intensity 变化和超分辨率。其中,将不同类型图像混合的 variations 中,将不同病人图像混合得到最好的性能。
Direct segmentation of brain white matter tracts in diffusion MRI
paper_authors: Hamza Kebiri, Ali Gholipour, Meritxell Bach Cuadra, Davood Karimi for:这篇论文旨在提供一种新的深度学习方法,用于非侵入性地从 diffusion MRI 数据中直接分割白 matter 轨迹,并且可以避免中间计算的错误。methods:这种新方法使用深度学习算法,直接从 diffusion MRI 数据中提取白 matter 轨迹的特征,不需要进行中间计算。results:实验表明,这种方法可以达到与现有方法相当的分割精度(mean Dice Similarity Coefficient 为 0.826),并且在受损数据下表现更好,可以满足许多重要的临床和科学应用。Abstract
The brain white matter consists of a set of tracts that connect distinct regions of the brain. Segmentation of these tracts is often needed for clinical and research studies. Diffusion-weighted MRI offers unique contrast to delineate these tracts. However, existing segmentation methods rely on intermediate computations such as tractography or estimation of fiber orientation density. These intermediate computations, in turn, entail complex computations that can result in unnecessary errors. Moreover, these intermediate computations often require dense multi-shell measurements that are unavailable in many clinical and research applications. As a result, current methods suffer from low accuracy and poor generalizability. Here, we propose a new deep learning method that segments these tracts directly from the diffusion MRI data, thereby sidestepping the intermediate computation errors. Our experiments show that this method can achieve segmentation accuracy that is on par with the state of the art methods (mean Dice Similarity Coefficient of 0.826). Compared with the state of the art, our method offers far superior generalizability to undersampled data that are typical of clinical studies and to data obtained with different acquisition protocols. Moreover, we propose a new method for detecting inaccurate segmentations and show that it is more accurate than standard methods that are based on estimation uncertainty quantification. The new methods can serve many critically important clinical and scientific applications that require accurate and reliable non-invasive segmentation of white matter tracts.
摘要
脑白物质由一组轨迹组成,连接脑各区域。 segmentation这些轨迹经常需要临床和研究实验室中。扩散束MRI提供了唯一的对比,以定义这些轨迹。然而,现有的分割方法依赖于中间计算,如轨迹学或纤维方向密度的估计。这些中间计算会导致 Complex errors。此外,这些中间计算通常需要密集的多层测量,这些测量在临床和研究应用中不可得。因此,当前的方法受到低准确率和poor generalizability的限制。我们提出了一种新的深度学习方法,直接从扩散MRI数据中分割白物质轨迹,从而割除中间计算的错误。我们的实验表明,这种方法可以与状态的方法(mean Dice Similarity Coefficient of 0.826)准确性相似。相比之下,我们的方法具有更高的普适性,可以在临床研究中常见的压缩数据上进行分割,以及数据获取不同的探测协议。此外,我们还提出了一种新的方法来检测不准确的分割,并证明它在标准方法基于估计不确定性量化上更为准确。这些新方法可以满足许多临床和科学应用中需要准确和可靠的非侵入式白物质轨迹分割。
Object Recognition System on a Tactile Device for Visually Impaired
for: developing a device that facilitates communication between individuals with visual impairments and their surroundings
methods: using an object detection model implemented on a Raspberry Pi, which is connected to a specifically designed tactile device to provide auditory feedback of object identification
results: successful demonstration of the device’s effectiveness in scene understanding, including static or dynamic objects and screen contents such as TVs, computers, and mobile phones.Abstract
People with visual impairments face numerous challenges when interacting with their environment. Our objective is to develop a device that facilitates communication between individuals with visual impairments and their surroundings. The device will convert visual information into auditory feedback, enabling users to understand their environment in a way that suits their sensory needs. Initially, an object detection model is selected from existing machine learning models based on its accuracy and cost considerations, including time and power consumption. The chosen model is then implemented on a Raspberry Pi, which is connected to a specifically designed tactile device. When the device is touched at a specific position, it provides an audio signal that communicates the identification of the object present in the scene at that corresponding position to the visually impaired individual. Conducted tests have demonstrated the effectiveness of this device in scene understanding, encompassing static or dynamic objects, as well as screen contents such as TVs, computers, and mobile phones.
摘要
人们 WITH visual impairments 面临着许多挑战 WHEN interacting with their environment. Our objective is to develop a device that facilitates communication between individuals with visual impairments AND their surroundings. The device will convert visual information into auditory feedback, enabling users to understand their environment in a way that suits their sensory needs.Initially, an object detection model IS selected FROM existing machine learning models based on its accuracy AND cost considerations, including time AND power consumption. The chosen model IS then implemented on a Raspberry Pi, which IS connected to a specifically designed tactile device. When the device IS touched at a specific position, it provides an audio signal that communicates the identification of the object present in the scene at that corresponding position to the visually impaired individual.Conducted tests have demonstrated the effectiveness of this device in scene understanding, encompassing static OR dynamic objects, as well as screen contents such as TVs, computers, AND mobile phones.
Neural Fields for Interactive Visualization of Statistical Dependencies in 3D Simulation Ensembles
paper_authors: Fatemeh Farokhmanesh, Kevin Höhlein, Christoph Neuhauser, Rüdiger Westermann
for: 这篇论文是为了开发一种可以高效地表示和重建物理变量之间的统计依赖关系的神经网络。
methods: 论文使用了约束优化和缺省搜索来学习并重建非线性依赖关系。
results: 论文可以减少计算和存储需求,并在 GPU 加速的直接体积渲染器中实时显示所有依赖关系。Abstract
We present the first neural network that has learned to compactly represent and can efficiently reconstruct the statistical dependencies between the values of physical variables at different spatial locations in large 3D simulation ensembles. Going beyond linear dependencies, we consider mutual information as a measure of non-linear dependence. We demonstrate learning and reconstruction with a large weather forecast ensemble comprising 1000 members, each storing multiple physical variables at a 250 x 352 x 20 simulation grid. By circumventing compute-intensive statistical estimators at runtime, we demonstrate significantly reduced memory and computation requirements for reconstructing the major dependence structures. This enables embedding the estimator into a GPU-accelerated direct volume renderer and interactively visualizing all mutual dependencies for a selected domain point.
摘要
我们提出了第一个神经网络,可以压缩地表示并高效地重建物理变量之间的统计依赖关系在大型3D模拟集中。我们超越了线性依赖,使用共识信息作为非线性依赖的度量。我们在1000个天气预测集中,每个集包含多个物理变量,并在250x352x20的模拟网格上存储。通过避免在运行时计算昂贵的统计估计器,我们显著减少了内存和计算需求,从而实现嵌入估计器到GPU加速的直接体积渲染器中,并在选择的域点上交互式地Visualize所有的相互关系。
Evaluating AI systems under uncertain ground truth: a case study in dermatology
results: 研究发现,使用现有的 deterministic adjudication 方法可能会严重低估实际性能,而使用统计学模型可以更好地评估 annotation uncertainty 并提供不确定性估计。Abstract
For safety, AI systems in health undergo thorough evaluations before deployment, validating their predictions against a ground truth that is assumed certain. However, this is actually not the case and the ground truth may be uncertain. Unfortunately, this is largely ignored in standard evaluation of AI models but can have severe consequences such as overestimating the future performance. To avoid this, we measure the effects of ground truth uncertainty, which we assume decomposes into two main components: annotation uncertainty which stems from the lack of reliable annotations, and inherent uncertainty due to limited observational information. This ground truth uncertainty is ignored when estimating the ground truth by deterministically aggregating annotations, e.g., by majority voting or averaging. In contrast, we propose a framework where aggregation is done using a statistical model. Specifically, we frame aggregation of annotations as posterior inference of so-called plausibilities, representing distributions over classes in a classification setting, subject to a hyper-parameter encoding annotator reliability. Based on this model, we propose a metric for measuring annotation uncertainty and provide uncertainty-adjusted metrics for performance evaluation. We present a case study applying our framework to skin condition classification from images where annotations are provided in the form of differential diagnoses. The deterministic adjudication process called inverse rank normalization (IRN) from previous work ignores ground truth uncertainty in evaluation. Instead, we present two alternative statistical models: a probabilistic version of IRN and a Plackett-Luce-based model. We find that a large portion of the dataset exhibits significant ground truth uncertainty and standard IRN-based evaluation severely over-estimates performance without providing uncertainty estimates.
摘要
Translated into Simplified Chinese:为了安全,医疗AI系统在部署前都会进行严格的评估,将预测结果与假设为确定的地面 truth进行比较。然而,实际上地面 truth 可能存在uncertainty。这种uncertainty 在标准AI模型评估中很多被忽略,可能导致未来性能的估计过高。为了避免这种情况,我们会测量地面 truth 的影响,假设地面 truth 可以分解为两个主要组成部分:注释uncertainty 和限制的观察信息的 uncertainty。这种地面 truth 的uncertainty 在评估地面 truth 时被忽略,通常通过 deterministic aggregation of annotations,例如多数投票或平均值来进行。相比之下,我们提出了一种使用统计模型进行集成注释的框架。具体来说,我们将注释集成作为 posterior inference of so-called plausibilities,表示分类 Setting 中的分布,受到 annotator reliability 的 hyperparameter encoding。基于这个模型,我们提出了一个度量注释uncertainty的metric,并提供了不确定度调整的性能评估方法。我们在皮肤状况分类从图像中的案例研究中应用了我们的框架。在 previous work 中使用的 inverse rank normalization (IRN) 的 deterministic adjudication process 忽略了地面 truth 的 uncertainty,而我们则提出了两种统计模型:一种是probabilistic version of IRN,另一种是 Plackett-Luce-based model。我们发现大量数据存在显著的地面 truth uncertainty,标准 IRN-based 评估严重过估了性能,没有提供不确定度估计。
DiffFlow: A Unified SDE Framework for Score-Based Diffusion Models and Generative Adversarial Networks
results: 该论文的结果包括:1) 提出了一种新的泛化概率方程(DiffFlow),可以具有高速样本生成和高质量样本生成的优点;2) 通过调整权重来实现逻辑推理模型和生成随机模型之间的平滑过渡,保持极大似然估计的可行性。Abstract
Generative models can be categorized into two types: explicit generative models that define explicit density forms and allow exact likelihood inference, such as score-based diffusion models (SDMs) and normalizing flows; implicit generative models that directly learn a transformation from the prior to the data distribution, such as generative adversarial nets (GANs). While these two types of models have shown great success, they suffer from respective limitations that hinder them from achieving fast sampling and high sample quality simultaneously. In this paper, we propose a unified theoretic framework for SDMs and GANs. We shown that: i) the learning dynamics of both SDMs and GANs can be described as a novel SDE named Discriminator Denoising Diffusion Flow (DiffFlow) where the drift can be determined by some weighted combinations of scores of the real data and the generated data; ii) By adjusting the relative weights between different score terms, we can obtain a smooth transition between SDMs and GANs while the marginal distribution of the SDE remains invariant to the change of the weights; iii) we prove the asymptotic optimality and maximal likelihood training scheme of the DiffFlow dynamics; iv) under our unified theoretic framework, we introduce several instantiations of the DiffFLow that provide new algorithms beyond GANs and SDMs with exact likelihood inference and have potential to achieve flexible trade-off between high sample quality and fast sampling speed.
摘要
<> translate into Simplified Chinese生成模型可以分为两类:Explicit生成模型,其定义Explicit概率形式,并允许精确的风险推断,如投 diffusion模型(SDM)和生成 adversarial网(GAN);Implicit生成模型,直接学习变换从 priors 到数据分布。而这两种类型的模型各有其限制,不能同时实现高质量样本和快速采样。在这篇论文中,我们提出一个统一的理论框架,用于SDM和GAN。我们证明:i) SDM和GAN的学习动态可以用一种新的SDE named Discriminator Denoising Diffusion Flow(DiffFlow)来描述,其拥有的拥有权重的混合的分数可以用来描述实际数据和生成数据的关系;ii) 通过调整不同的分数项的相对权重,可以在SDM和GAN之间进行缓和转换,而保持样本分布的偏微分不变;iii) 我们证明DiffFlow的动态是最优的和最大可信度训练方案;iv)在我们的统一理论框架下,我们引入了一些DiffFlow的实现,提供了新的算法,其中包括高可信度采样和快速采样,并且可以实现flexible的质量和速度之间的平衡。
Wasserstein Auto-Encoders of Merge Trees (and Persistence Diagrams)
for: 这个论文是为了提出一种计算框架,用于在 Wasserstein метри空间中进行merge trees auto-encoding。
methods: 该方法使用 neural network architecture,并在每层网络中Explicitly manipulate merge trees on their associated metric space,从而实现更高的准确率和可解释性。
results: 对于公共ensemble数据进行了广泛的实验,并证明了其计算效率,typically in the orders of minutes on average。此外,该方法还可以应用于数据压缩和维度减少等问题。Abstract
This paper presents a computational framework for the Wasserstein auto-encoding of merge trees (MT-WAE), a novel extension of the classical auto-encoder neural network architecture to the Wasserstein metric space of merge trees. In contrast to traditional auto-encoders which operate on vectorized data, our formulation explicitly manipulates merge trees on their associated metric space at each layer of the network, resulting in superior accuracy and interpretability. Our novel neural network approach can be interpreted as a non-linear generalization of previous linear attempts [65] at merge tree encoding. It also trivially extends to persistence diagrams. Extensive experiments on public ensembles demonstrate the efficiency of our algorithms, with MT-WAE computations in the orders of minutes on average. We show the utility of our contributions in two applications adapted from previous work on merge tree encoding [65]. First, we apply MT-WAE to data reduction and reliably compress merge trees by concisely representing them with their coordinates in the final layer of our auto-encoder. Second, we document an application to dimensionality reduction, by exploiting the latent space of our auto-encoder, for the visual analysis of ensemble data. We illustrate the versatility of our framework by introducing two penalty terms, to help preserve in the latent space both the Wasserstein distances between merge trees, as well as their clusters. In both applications, quantitative experiments assess the relevance of our framework. Finally, we provide a C++ implementation that can be used for reproducibility.
摘要
Harmonizing Feature Attributions Across Deep Learning Architectures: Enhancing Interpretability and Consistency
results: 研究发现,使用不同深度学习架构的模型可以共同使用特征归属方法来提供本地解释,并且这些特征可以在多种模型上被协调。这些结果表明,使用协调的特征归属方法可以提高机器学习模型的解释性和可靠性。Abstract
Ensuring the trustworthiness and interpretability of machine learning models is critical to their deployment in real-world applications. Feature attribution methods have gained significant attention, which provide local explanations of model predictions by attributing importance to individual input features. This study examines the generalization of feature attributions across various deep learning architectures, such as convolutional neural networks (CNNs) and vision transformers. We aim to assess the feasibility of utilizing a feature attribution method as a future detector and examine how these features can be harmonized across multiple models employing distinct architectures but trained on the same data distribution. By exploring this harmonization, we aim to develop a more coherent and optimistic understanding of feature attributions, enhancing the consistency of local explanations across diverse deep-learning models. Our findings highlight the potential for harmonized feature attribution methods to improve interpretability and foster trust in machine learning applications, regardless of the underlying architecture.
摘要
Simplified Chinese:确保机器学习模型的可靠性和可读性是应用中的关键。特征归因方法在不同的深度学习架构上得到了广泛的关注,例如卷积神经网络(CNN)和视觉变换器。本研究检验了不同深度学习架构之间的特征归因方法的一致性,以评估将这些特征归因方法应用于未来的检测器是否可行。我们还希望通过协调这些特征归因方法,从多个模型使用不同的架构但是在同一个数据分布上进行训练来增强当地解释的一致性。通过这种协调,我们希望能够提高机器学习应用中的解释性和信任度,无论下面的架构是什么。
Compound Attention and Neighbor Matching Network for Multi-contrast MRI Super-resolution
results: 实验结果表明,CANM-Net在IXI、fastMRI和实际扫描 dataset上的SR任务中表现出色,超过了现有的状态对方法。此外,对于不完美地注册的参考和下降图像的情况,CANM-Net仍然能够保持良好的性能,这说明了其在临床应用中的可靠性。Abstract
Multi-contrast magnetic resonance imaging (MRI) reflects information about human tissue from different perspectives and has many clinical applications. By utilizing the complementary information among different modalities, multi-contrast super-resolution (SR) of MRI can achieve better results than single-image super-resolution. However, existing methods of multi-contrast MRI SR have the following shortcomings that may limit their performance: First, existing methods either simply concatenate the reference and degraded features or exploit global feature-matching between them, which are unsuitable for multi-contrast MRI SR. Second, although many recent methods employ transformers to capture long-range dependencies in the spatial dimension, they neglect that self-attention in the channel dimension is also important for low-level vision tasks. To address these shortcomings, we proposed a novel network architecture with compound-attention and neighbor matching (CANM-Net) for multi-contrast MRI SR: The compound self-attention mechanism effectively captures the dependencies in both spatial and channel dimension; the neighborhood-based feature-matching modules are exploited to match degraded features and adjacent reference features and then fuse them to obtain the high-quality images. We conduct experiments of SR tasks on the IXI, fastMRI, and real-world scanning datasets. The CANM-Net outperforms state-of-the-art approaches in both retrospective and prospective experiments. Moreover, the robustness study in our work shows that the CANM-Net still achieves good performance when the reference and degraded images are imperfectly registered, proving good potential in clinical applications.
摘要
多模式磁共振成像(MRI)可以从不同的角度提取人体组织信息,有很多临床应用。通过利用不同模式之间的补充信息,多模式超分辨(SR)的MRI可以比单图SR更好。然而,现有的多模式MRI SR方法存在以下缺点,可能会限制其性能:首先,现有的方法可能会简单地将参考和降解特征 concatenate 或者利用全局特征匹配,这些方法不适合多模式MRI SR。其次,许多现代方法使用 transformers 捕捉在空间维度上的长距离依赖关系,但它们忽略了频道维度上的自我注意力对低级视觉任务的重要性。为了解决这些缺点,我们提出了一种新的网络架构,即 Compound-Attention and Neighbor Matching 网络(CANM-Net),用于多模式MRI SR:Compound self-attention 机制可以有效捕捉空间和频道维度上的依赖关系; 邻域特征匹配模块可以将降解特征和相邻参考特征匹配并融合以获得高质量图像。我们在 IXI、fastMRI 和实际扫描数据集上进行 SR 任务的实验,CANM-Net 超过了当前的状态艺术方法。此外,我们的Robustness 研究表明,CANM-Net 在参考和降解图像不完全匹配时仍能达到良好性能,证明了其在临床应用中的好可能性。
Prompting Diffusion Representations for Cross-Domain Semantic Segmentation
results: 研究发现 diffusion-pretraining 可以很好地在新频变性环境下进行 semantic segmentation,并且可以使用 simple yet effective 的 test-time domain adaptation 方法来进一步提高性能。Abstract
While originally designed for image generation, diffusion models have recently shown to provide excellent pretrained feature representations for semantic segmentation. Intrigued by this result, we set out to explore how well diffusion-pretrained representations generalize to new domains, a crucial ability for any representation. We find that diffusion-pretraining achieves extraordinary domain generalization results for semantic segmentation, outperforming both supervised and self-supervised backbone networks. Motivated by this, we investigate how to utilize the model's unique ability of taking an input prompt, in order to further enhance its cross-domain performance. We introduce a scene prompt and a prompt randomization strategy to help further disentangle the domain-invariant information when training the segmentation head. Moreover, we propose a simple but highly effective approach for test-time domain adaptation, based on learning a scene prompt on the target domain in an unsupervised manner. Extensive experiments conducted on four synthetic-to-real and clear-to-adverse weather benchmarks demonstrate the effectiveness of our approaches. Without resorting to any complex techniques, such as image translation, augmentation, or rare-class sampling, we set a new state-of-the-art on all benchmarks. Our implementation will be publicly available at \url{https://github.com/ETHRuiGong/PTDiffSeg}.
摘要
原本设计用于图像生成的扩散模型,最近显示出了优秀的预训练特征表示,用于Semantic segmentation。我们对这个结果感到感兴趣,于是决定了探索 diffusion-预训练的特征如何在新领域中 generalized。我们发现,diffusion-预训练可以在Semantic segmentation中获得杰出的领域泛化结果,超过了 both 监督和自监督的底层网络。这种结果使我们对模型的唯一能力——接受输入提示,以提高它在不同领域的跨领域性表示。我们引入了场景提示和提示随机策略,以帮助更好地分离领域无关的信息。此外,我们提出了一种简单 yet 高效的测试时领域适应方法,基于在目标领域上无监督学习场景提示。我们的实验结果在四个 synthetic-to-real 和 clear-to-adverse weather benchmark上进行了广泛的测试,并证明了我们的方法的有效性。无需使用复杂的技术,如图像翻译、扩展、罕见类采样,我们设置了新的状态之册在所有benchmark上。我们的实现将在 GitHub 上公开,可以在 \url{https://github.com/ETHRuiGong/PTDiffSeg} 中找到。
How Deep Neural Networks Learn Compositional Data: The Random Hierarchy Model
paper_authors: Leonardo Petrini, Francesco Cagnetta, Umberto M. Tomasini, Alessandro Favero, Matthieu Wyart
for: 这个论文的目的是研究深度卷积神经网络(CNN)在高维度数据上学习复杂任务的能力。
methods: 这个论文使用了深度卷积神经网络(CNN)来解决高维度数据上的学习问题。
results: 这个论文发现,深度卷积神经网络(CNN)可以通过建立不变表示来抗衡维度的呼噪问题,并且可以通过训练数据的增加来提高学习效果。Abstract
Learning generic high-dimensional tasks is notably hard, as it requires a number of training data exponential in the dimension. Yet, deep convolutional neural networks (CNNs) have shown remarkable success in overcoming this challenge. A popular hypothesis is that learnable tasks are highly structured and that CNNs leverage this structure to build a low-dimensional representation of the data. However, little is known about how much training data they require, and how this number depends on the data structure. This paper answers this question for a simple classification task that seeks to capture relevant aspects of real data: the Random Hierarchy Model. In this model, each of the $n_c$ classes corresponds to $m$ synonymic compositions of high-level features, which are in turn composed of sub-features through an iterative process repeated $L$ times. We find that the number of training data $P^*$ required by deep CNNs to learn this task (i) grows asymptotically as $n_c m^L$, which is only polynomial in the input dimensionality; (ii) coincides with the training set size such that the representation of a trained network becomes invariant to exchanges of synonyms; (iii) corresponds to the number of data at which the correlations between low-level features and classes become detectable. Overall, our results indicate how deep CNNs can overcome the curse of dimensionality by building invariant representations, and provide an estimate of the number of data required to learn a task based on its hierarchically compositional structure.
摘要
学习高维任务非常困难,因为它们需要数据量的幂等于维度。然而,深度卷积神经网络(CNN)在缓解这个挑战上表现出了惊人的成功。一个流行的假设是,学习任务都具有很高的结构,并且CNN可以利用这种结构来建立低维度的数据表示。然而,有很少关于CNN需要多少训练数据,以及这个数据结构如何影响这个问题的研究。这篇论文回答了这个问题,对于一个简单的分类任务,即Random Hierarchy Model。在这个模型中,每个类都对应于$m$个同义词的高级特征组合,这些高级特征再经过$L$次循环组合而组成。我们发现,deep CNNs需要$P^*$多少训练数据来学习这个任务,其中(i)与输入维度相同的幂函数增长,即$n_c m^L$;(ii)与训练集大小相同,使得神经网络学习后的表示不受同义词的交换影响;(iii)与低级特征和类之间的相关性变得可识别。总之,我们的结果表明深度CNN可以缓解维度之恶的问题,并提供了学习任务基于层次结构的数据量需求的估计。
MDViT: Multi-domain Vision Transformer for Small Medical Image Segmentation Datasets
results: 实验结果显示,MDViT 可以在 4 个皮肤病分类dataset上取得最佳性能,比起现有的算法,具有更好的分类性能和固定的模型大小,甚至在更多的领域添加时仍然保持良好的性能。Abstract
Despite its clinical utility, medical image segmentation (MIS) remains a daunting task due to images' inherent complexity and variability. Vision transformers (ViTs) have recently emerged as a promising solution to improve MIS; however, they require larger training datasets than convolutional neural networks. To overcome this obstacle, data-efficient ViTs were proposed, but they are typically trained using a single source of data, which overlooks the valuable knowledge that could be leveraged from other available datasets. Naivly combining datasets from different domains can result in negative knowledge transfer (NKT), i.e., a decrease in model performance on some domains with non-negligible inter-domain heterogeneity. In this paper, we propose MDViT, the first multi-domain ViT that includes domain adapters to mitigate data-hunger and combat NKT by adaptively exploiting knowledge in multiple small data resources (domains). Further, to enhance representation learning across domains, we integrate a mutual knowledge distillation paradigm that transfers knowledge between a universal network (spanning all the domains) and auxiliary domain-specific branches. Experiments on 4 skin lesion segmentation datasets show that MDViT outperforms state-of-the-art algorithms, with superior segmentation performance and a fixed model size, at inference time, even as more domains are added. Our code is available at https://github.com/siyi-wind/MDViT.
摘要
医学图像分割(MIS)是一项复杂和变化的任务,医学图像的自然复杂性和多样性使得MIS成为一项挑战。最近,视transformer(ViT)已经出现为MIS提供了一种有前途的解决方案,但它们需要更大的训练数据集than convolutional neural networks。为了解决这个障碍,数据高效的ViTs被提出,但它们通常是使用单个数据源进行训练,这会忽略其他可用的数据集中的有价值知识。将多个领域的数据集合并训练可能会导致非可靠的知识传递(NKT),即模型在某些领域的性能下降。在这篇论文中,我们提出了MDViT,第一个多领域ViT,包括域适应器来减少数据饿和抵消NKT。此外,为了增强跨领域表示学ARNING,我们 integration了一种互助知识传播模式,将所有领域的网络知识传递到各个域适应器中。我们的实验结果表明,MDViT在4种皮肤病变分割数据集上表现出色,性能高于状态 искусственный智能算法,即使添加更多的领域。我们的代码可以在https://github.com/siyi-wind/MDViT中下载。
Make A Long Image Short: Adaptive Token Length for Vision Transformers
results: 通过对多种 ViT 模型进行图像分类和动作识别等实验,证明了该方法可以有效地降低计算成本,提高推理速度。Abstract
The vision transformer is a model that breaks down each image into a sequence of tokens with a fixed length and processes them similarly to words in natural language processing. Although increasing the number of tokens typically results in better performance, it also leads to a considerable increase in computational cost. Motivated by the saying "A picture is worth a thousand words," we propose an innovative approach to accelerate the ViT model by shortening long images. Specifically, we introduce a method for adaptively assigning token length for each image at test time to accelerate inference speed. First, we train a Resizable-ViT (ReViT) model capable of processing input with diverse token lengths. Next, we extract token-length labels from ReViT that indicate the minimum number of tokens required to achieve accurate predictions. We then use these labels to train a lightweight Token-Length Assigner (TLA) that allocates the optimal token length for each image during inference. The TLA enables ReViT to process images with the minimum sufficient number of tokens, reducing token numbers in the ViT model and improving inference speed. Our approach is general and compatible with modern vision transformer architectures, significantly reducing computational costs. We verified the effectiveness of our methods on multiple representative ViT models on image classification and action recognition.
摘要
“当然”视觉 трансформа器(ViT)是一个模型,将每个图像转换为一串固定长度的单位,然后处理他们就如同自然语言处理中的字。虽然增加单位数量通常会导致性能提高,但也会导致庞大的计算成本增加。为了解决这个问题,我们提出了一个创新的方法,可以快速加速ViT模型的推理速度。我们首先将一个可调尺度的ViT(ReViT)模型训练,可以处理多种单位长度的输入。接着,我们从ReViT中提取了单位长度标签,这些标签指示了每个图像所需的最少单位数量以确保精准预测。我们然后使用这些标签进行训练一个轻量级的单位长度分配器(TLA),以将适当的单位长度分配给每个图像 during inference。TLA使得ReViT可以在测试时快速处理图像,并且将ViT模型内部的单位数量降低,进而降低计算成本。我们的方法是通用的,可以与现代的视觉 трансформа器架构相容,实现了计算成本的创下。我们在多个代表性的ViT模型上进行了证明,以证明我们的方法的有效性。
results: 本研究的基本方法可以实现回应式和生动的代理人,并与真正的人类进行完整的对话。Abstract
We introduce a new conversation head generation benchmark for synthesizing behaviors of a single interlocutor in a face-to-face conversation. The capability to automatically synthesize interlocutors which can participate in long and multi-turn conversations is vital and offer benefits for various applications, including digital humans, virtual agents, and social robots. While existing research primarily focuses on talking head generation (one-way interaction), hindering the ability to create a digital human for conversation (two-way) interaction due to the absence of listening and interaction parts. In this work, we construct two datasets to address this issue, ``ViCo'' for independent talking and listening head generation tasks at the sentence level, and ``ViCo-X'', for synthesizing interlocutors in multi-turn conversational scenarios. Based on ViCo and ViCo-X, we define three novel tasks targeting the interaction modeling during the face-to-face conversation: 1) responsive listening head generation making listeners respond actively to the speaker with non-verbal signals, 2) expressive talking head generation guiding speakers to be aware of listeners' behaviors, and 3) conversational head generation to integrate the talking/listening ability in one interlocutor. Along with the datasets, we also propose corresponding baseline solutions to the three aforementioned tasks. Experimental results show that our baseline method could generate responsive and vivid agents that can collaborate with real person to fulfil the whole conversation. Project page: https://vico.solutions/.
摘要
我们介绍一个新的对话头生成标准 для生成面对面对话中的对话者行为。能够自动生成可以参与长时间和多轮对话的对话者是非常重要的,对于数字人类、虚拟代表和社交机器人等应用有益。而现有研究主要关注一个方面的对话(一个方向的交流),因此缺少对话者的听说和互动部分,这使得创建数字人类对话变得困难。在这项工作中,我们构建了两个数据集来解决这个问题:“ViCo”和“ViCo-X”。“ViCo”是独立的说话和听说头生成任务,而“ViCo-X”是用于synthesizing多轮对话的对话者。基于ViCo和ViCo-X,我们定义了三个新任务,包括:1)回应式听说头生成,使listeners能够活泼地回应说话者的非语言信号;2)表达声明性的说头生成,使说话者意识到听众的行为;以及3)对话头生成,将说话和听说能力集成到一个对话者中。同时,我们还提出了对这三个任务的基线解决方案。我们的基线方法可以生成有回应和生命力的代理人,与真正的人类一起完成对话。项目页面:https://vico.solutions/。
Adversarial Attacks on Image Classification Models: FGSM and Patch Attacks and their Impact
results: 对三种强大预训练的图像分类模型(ResNet-34、GoogleNet和DenseNet-161)进行了攻击性评估,发现攻击后模型的分类精度减少了较多。Abstract
This chapter introduces the concept of adversarial attacks on image classification models built on convolutional neural networks (CNN). CNNs are very popular deep-learning models which are used in image classification tasks. However, very powerful and pre-trained CNN models working very accurately on image datasets for image classification tasks may perform disastrously when the networks are under adversarial attacks. In this work, two very well-known adversarial attacks are discussed and their impact on the performance of image classifiers is analyzed. These two adversarial attacks are the fast gradient sign method (FGSM) and adversarial patch attack. These attacks are launched on three powerful pre-trained image classifier architectures, ResNet-34, GoogleNet, and DenseNet-161. The classification accuracy of the models in the absence and presence of the two attacks are computed on images from the publicly accessible ImageNet dataset. The results are analyzed to evaluate the impact of the attacks on the image classification task.
摘要
本章介绍了图像分类模型 based on 卷积神经网络 (CNN) 面临的敌对攻击。CNN 是深度学习模型的非常受欢迎的一种,通常用于图像分类任务。然而,当这些网络受到敌对攻击时,它们可能会表现非常糟糕。在这种情况下,本文讨论了两种非常知名的敌对攻击方法,即快速梯度签名方法 (FGSM) 和敌对贴图攻击。这些攻击在三种强大预训练的图像分类器架构上进行,即 ResNet-34、GoogleNet 和 DenseNet-161。在图像网络上计算图像分类器在缺乏和存在这两种攻击时的分类精度。结果分析以评估攻击对图像分类任务的影响。
Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised Audio-Visual Video Parsing
results: 经过实验表明,提出的方法可以有效地解决听视联合分割任务中的特征学习不均衡问题。Abstract
Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances as well as identify the corresponding event categories with only video-level category labels for training. Most previous methods pay much attention to refining the supervision for each modality or extracting fruitful cross-modality information for more reliable feature learning. None of them have noticed the imbalanced feature learning between different modalities in the task. In this paper, to balance the feature learning processes of different modalities, a dynamic gradient modulation (DGM) mechanism is explored, where a novel and effective metric function is designed to measure the imbalanced feature learning between audio and visual modalities. Furthermore, principle analysis indicates that the multimodal confusing calculation will hamper the precise measurement of multimodal imbalanced feature learning, which further weakens the effectiveness of our DGM mechanism. To cope with this issue, a modality-separated decision unit (MSDU) is designed for more precise measurement of imbalanced feature learning between audio and visual modalities. Comprehensive experiments are conducted on public benchmarks and the corresponding experimental results demonstrate the effectiveness of our proposed method.
摘要
弱监督音视频分析(WS-AVVP)目标是在没有视频类别标签培训的情况下,地时段化音、视觉和音视觉事件实例的涵义和相应的事件类别。先前的方法大多强调精细调整每个模式的监督或提取有利的跨模式信息以便更可靠的特征学习。然而,none of them 注意到了不同模式之间的特征学习不均衡问题。本文提出了一种动态梯度修正(DGM)机制,其中设计了一种新的有效度量函数,用于测量不同模式之间的特征学习不均衡。此外,理论分析表明,跨模式冲淡计算会妨碍精准测量跨模式特征学习不均衡问题,这会进一步削弱我们的DGM机制的效果。为解决这个问题,我们提出了一种模式分离决策单元(MSDU),用于更精准地测量不同模式之间的特征学习不均衡。我们在公共 benchmark上进行了广泛的实验,并结果表明了我们的提议方法的效果。
NMS Threshold matters for Ego4D Moment Queries – 2nd place solution to the Ego4D Moment Queries Challenge 2023
results: 本文在公共领先板上排名第2,具有26.62%的均值准确率和45.69%的Recall@1x,在tIoU=0.5的测试集上显著超越2023年挑战的强基eline。Here’s the English version of the three key information points:
for: This paper is a submission to the Ego4D Moment Queries Challenge 2023.
methods: The paper extends the latest ActionFormer method with an improved ground-truth assignment strategy during training and a refined version of SoftNMS at inference time.
results: The paper ranks 2nd on the public leaderboard with 26.62% average mAP and 45.69% Recall@1x at tIoU=0.5 on the test set, significantly outperforming the strong baseline from 2023 challenge.Abstract
This report describes our submission to the Ego4D Moment Queries Challenge 2023. Our submission extends ActionFormer, a latest method for temporal action localization. Our extension combines an improved ground-truth assignment strategy during training and a refined version of SoftNMS at inference time. Our solution is ranked 2nd on the public leaderboard with 26.62% average mAP and 45.69% Recall@1x at tIoU=0.5 on the test set, significantly outperforming the strong baseline from 2023 challenge. Our code is available at https://github.com/happyharrycn/actionformer_release.
摘要
这份报告介绍我们在Ego4D动作识别挑战2023年提交的工作。我们的提交基于最新的ActionFormer方法,我们在训练时使用改进的真实标注策略,并在推理时使用精制的SoftNMS。我们的解决方案在公共排名榜上排名第2,具有26.62%的均值准确率和45.69%的回归率@1x,在测试集上显著超越2023年挑战的强基eline。我们的代码可以在https://github.com/happyharrycn/actionformer_release中找到。
ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: TREK-150 Single Object Tracking
for: 这个研究旨在将类别框转换为面对面的图像标识,并使用Segment Anything Model (SAM)和Alpha-Refine进行标识储存和调节。
methods: 这个研究使用了Associating Objects with Transformers (AOT)框架,并将类别框转换为面对面的图像标识,使用Segment Anything Model (SAM)和Alpha-Refine进行标识储存和调节。
results: 这个研究获得了EPIC-KITCHENS TREK-150 Object Tracking Challenge的第一名。Abstract
The Associating Objects with Transformers (AOT) framework has exhibited exceptional performance in a wide range of complex scenarios for video object tracking and segmentation. In this study, we convert the bounding boxes to masks in reference frames with the help of the Segment Anything Model (SAM) and Alpha-Refine, and then propagate the masks to the current frame, transforming the task from Video Object Tracking (VOT) to video object segmentation (VOS). Furthermore, we introduce MSDeAOT, a variant of the AOT series that incorporates transformers at multiple feature scales. MSDeAOT efficiently propagates object masks from previous frames to the current frame using two feature scales of 16 and 8. As a testament to the effectiveness of our design, we achieved the 1st place in the EPIC-KITCHENS TREK-150 Object Tracking Challenge.
摘要
《 associating objects with transformers(AOT)框架在各种复杂场景中表现出色,包括视频对象跟踪和分割。在这项研究中,我们将 bounding box 转换为 masks 在参照帧中使用 Segment Anything Model(SAM)和 Alpha-Refine,然后将 masks 传播到当前帧,从视频对象跟踪(VOT)任务转化为视频对象分割(VOS)任务。此外,我们还提出了 MSDeAOT,它是 AOT 系列中的一种变体,在多级特征尺度上引入 transformers。MSDeAOT 高效地在前一帧中传播对象mask,使用 16 和 8 级的特征尺度。我们在 EPIC-KITCHENS TREK-150 对象跟踪挑战中取得了第一名,这是我们设计的证明。
ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: Semi-Supervised Video Object Segmentation
results: 根据实验结果,该论文在 EPIC-KITCHEN VISOR Semi-supervised Video Object Segmentation Challenge 中取得了第一名。Abstract
The Associating Objects with Transformers (AOT) framework has exhibited exceptional performance in a wide range of complex scenarios for video object segmentation. In this study, we introduce MSDeAOT, a variant of the AOT series that incorporates transformers at multiple feature scales. Leveraging the hierarchical Gated Propagation Module (GPM), MSDeAOT efficiently propagates object masks from previous frames to the current frame using a feature scale with a stride of 16. Additionally, we employ GPM in a more refined feature scale with a stride of 8, leading to improved accuracy in detecting and tracking small objects. Through the implementation of test-time augmentations and model ensemble techniques, we achieve the top-ranking position in the EPIC-KITCHEN VISOR Semi-supervised Video Object Segmentation Challenge.
摘要
“ associating objects with transformers”(AOT)框架在各种复杂场景中表现出了非常出色的表现,在本研究中,我们引入了多级嵌入式变换器(MSDeAOT),该系列嵌入了多级嵌入式变换器。通过层次阻止卷积模型(GPM),MSDeAOT可以有效地在当前帧中传递对象面积从前一帧,使用一个幅度为16的特征缩放级别。此外,我们还使用了一个更细的特征缩放级别,以提高对小对象的检测和跟踪精度。通过实施测试时数据增强和模型集成技术,我们在EPIC-KITCHEN VISOR Semi-supervised Video Object Segmentation Challenge中获得了第一名。
Remote Sensing Image Change Detection with Graph Interaction
results: 根据GZ CD数据集的实验结果,该方法与其他现有方法相比,显示了更高的准确率和更好的计算效率,同时也提高了整体效果。Abstract
Modern remote sensing image change detection has witnessed substantial advancements by harnessing the potent feature extraction capabilities of CNNs and Transforms.Yet,prevailing change detection techniques consistently prioritize extracting semantic features related to significant alterations,overlooking the viability of directly interacting with bitemporal image features.In this letter,we propose a bitemporal image graph Interaction network for remote sensing change detection,namely BGINet-CD. More specifically,by leveraging the concept of non-local operations and mapping the features obtained from the backbone network to the graph structure space,we propose a unified self-focus mechanism for bitemporal images.This approach enhances the information coupling between the two temporal images while effectively suppressing task-irrelevant interference,Based on a streamlined backbone architecture,namely ResNet18,our model demonstrates superior performance compared to other state-of-the-art methods (SOTA) on the GZ CD dataset. Moreover,the model exhibits an enhanced trade-off between accuracy and computational efficiency,further improving its overall effectiveness
摘要
现代远程感知图像变化检测技术已经取得了显著的进步,通过利用深度学习网络的强大特征提取能力。然而,现有的变化检测技术一直会优先把注意力集中在 semantics 相关的变化上,忽略了直接与二元时间图像特征进行交互的可能性。在本文中,我们提出了一种基于图像图грам的交互网络,称为 BGINet-CD。更具体地说,我们利用非本地操作的概念和将干扰网络生成的特征映射到图грам结构空间,提出了一种简单的自 Фокус机制。这种方法可以增强两个时间图像之间的信息coupling,同时有效地抑制任务无关的干扰。基于流畅的后ION架构,我们的模型在 GZ CD 数据集上达到了与其他现有方法(SOTA)比较的高性能。此外,我们的模型还展现出了提高精度和计算效率的权衡,进一步提高了其总效iveness。
Multi-Modal Prototypes for Open-Set Semantic Segmentation
results: 在 Pascal和COCO数据集上,我们进行了广泛的实验评估框架的效果。 results表明,我们可以在只有粗粒化数据集上训练的情况下,在 Pascal-Animals 上达到了状态控制的结果。我们还进行了细致的拆分研究,以量化和质量地分析每个组件的贡献。Abstract
In semantic segmentation, adapting a visual system to novel object categories at inference time has always been both valuable and challenging. To enable such generalization, existing methods rely on either providing several support examples as visual cues or class names as textual cues. Through the development is relatively optimistic, these two lines have been studied in isolation, neglecting the complementary intrinsic of low-level visual and high-level language information. In this paper, we define a unified setting termed as open-set semantic segmentation (O3S), which aims to learn seen and unseen semantics from both visual examples and textual names. Our pipeline extracts multi-modal prototypes for segmentation task, by first single modal self-enhancement and aggregation, then multi-modal complementary fusion. To be specific, we aggregate visual features into several tokens as visual prototypes, and enhance the class name with detailed descriptions for textual prototype generation. The two modalities are then fused to generate multi-modal prototypes for final segmentation. On both \pascal and \coco datasets, we conduct extensive experiments to evaluate the framework effectiveness. State-of-the-art results are achieved even on more detailed part-segmentation, Pascal-Animals, by only training on coarse-grained datasets. Thorough ablation studies are performed to dissect each component, both quantitatively and qualitatively.
摘要
在 semantic segmentation 中,适应新的物体类别于推理时是非常有价值和挑战的。现有的方法通常采用提供一些支持例子作为视觉cue或者类名作为文本cue来实现这种总结。虽然这两种方法在开发中有一定的成功,但是它们在孤立的情况下研究,忽略了低级别视觉信息和高级别语言信息之间的共同内在。在这篇文章中,我们定义了一个统一的设定,称为开放集 semantic segmentation(O3S),该设定的目标是从视觉例子和文本名称中学习已知和未知 semantics。我们的管道包括以下步骤:1. 单modal自增和聚合:首先对视觉特征进行单modal自增,然后对各个模式进行聚合。2. 多modal补偿融合:将视觉特征聚合成多个标签,然后将文本名称补偿为更多的细节描述,以生成多模式的融合特征。3. 最终进行多模式融合,以实现最终的分类。在 Pascal 和 COCO 数据集上,我们进行了广泛的实验,以评估框架的效果。在 Pascal-Animals 数据集上,只需要训练在粗糙数据集上,就可以 дости得 estado-of-the-art 的结果。我们还进行了详细的拆分研究,以量化和质量地分析每个组件的作用。
Distilling Missing Modality Knowledge from Ultrasound for Endometriosis Diagnosis with Magnetic Resonance Images
paper_authors: Yuan Zhang, Hu Wang, David Butler, Minh-Son To, Jodie Avery, M Louise Hull, Gustavo Carneiro for: This paper aims to improve the accuracy of detecting pouch of Douglas (POD) obliteration from magnetic resonance imaging (MRI) scans, which is a common diagnostic challenge in endometriosis diagnosis.methods: The proposed method uses a knowledge distillation training algorithm that leverages detection results from unpaired transvaginal gynecological ultrasound (TVUS) data to improve the POD obliteration detection from MRI. The algorithm pre-trains a teacher model to detect POD obliteration from TVUS data and a student model with a 3D masked auto-encoder using unlabelled pelvic 3D MRI volumes.results: Experimental results on an endometriosis dataset containing TVUS and MRI data demonstrate the effectiveness of the proposed method in improving the POD detection accuracy from MRI.Abstract
Endometriosis is a common chronic gynecological disorder that has many characteristics, including the pouch of Douglas (POD) obliteration, which can be diagnosed using Transvaginal gynecological ultrasound (TVUS) scans and magnetic resonance imaging (MRI). TVUS and MRI are complementary non-invasive endometriosis diagnosis imaging techniques, but patients are usually not scanned using both modalities and, it is generally more challenging to detect POD obliteration from MRI than TVUS. To mitigate this classification imbalance, we propose in this paper a knowledge distillation training algorithm to improve the POD obliteration detection from MRI by leveraging the detection results from unpaired TVUS data. More specifically, our algorithm pre-trains a teacher model to detect POD obliteration from TVUS data, and it also pre-trains a student model with 3D masked auto-encoder using a large amount of unlabelled pelvic 3D MRI volumes. Next, we distill the knowledge from the teacher TVUS POD obliteration detector to train the student MRI model by minimizing a regression loss that approximates the output of the student to the teacher using unpaired TVUS and MRI data. Experimental results on our endometriosis dataset containing TVUS and MRI data demonstrate the effectiveness of our method to improve the POD detection accuracy from MRI.
摘要
Endometriosis 是一种常见的慢性室内疾病,其有多种特征之一是胞囊吟唱(POD)衰竭,可以通过跨膜室内ultrasound(TVUS)扫描和磁共振成像(MRI)进行诊断。TVUS和MRI 是非侵入性的抗癌诊断技术,但患者通常不会同时进行这两种扫描,并且从 MRI 中诊断 POD 衰竭更加困难。为了解决这种分类偏补,我们在这篇论文中提出了一种知识储备训练算法,用于从 MRI 中提高 PODS 衰竭的检测精度。具体来说,我们的算法首先在 TVUS 数据上训练一个老师模型,用于检测 PODS 衰竭。然后,我们使用一个 3D 卷积掩码自动机制学习模型,并使用大量的不标注的 pelvic 3D MRI 数据进行预训练。接着,我们将老师 TVUS POD 衰竭检测器的知识传播给学生 MRI 模型,通过使用不匹配的 TVUS 和 MRI 数据来执行知识储备训练。实验结果表明,我们的方法可以提高 MRI 中 PODS 衰竭的检测精度。
Zero-Shot Neural Architecture Search: Challenges, Solutions, and Opportunities
results: 经过大规模实验,本研究表明了零shot прокси在硬件认知和无硬件情况下的效果,并提出了一些可能更好的 proxy 设计方法。In English, the three key points are:
for: The purpose of this study is to comprehensively review and compare the state-of-the-art (SOTA) zero-shot NAS methods, with an emphasis on their hardware awareness.
methods: The study uses mainstream zero-shot proxies and discusses their theoretical underpinnings.
results: Through large-scale experiments, the study demonstrates the effectiveness of zero-shot proxies in both hardware-aware and hardware-oblivious NAS scenarios, and proposes several promising ideas for designing better proxies.Abstract
Recently, zero-shot (or training-free) Neural Architecture Search (NAS) approaches have been proposed to liberate the NAS from training requirements. The key idea behind zero-shot NAS approaches is to design proxies that predict the accuracies of the given networks without training network parameters. The proxies proposed so far are usually inspired by recent progress in theoretical deep learning and have shown great potential on several NAS benchmark datasets. This paper aims to comprehensively review and compare the state-of-the-art (SOTA) zero-shot NAS approaches, with an emphasis on their hardware awareness. To this end, we first review the mainstream zero-shot proxies and discuss their theoretical underpinnings. We then compare these zero-shot proxies through large-scale experiments and demonstrate their effectiveness in both hardware-aware and hardware-oblivious NAS scenarios. Finally, we point out several promising ideas to design better proxies. Our source code and the related paper list are available on https://github.com/SLDGroup/survey-zero-shot-nas.
摘要
最近,零批量(或训练自由)神经建筑搜索(NAS)方法被提出,以解放NAS免除训练要求。零批量NAS方法的关键思想是通过不需要训练网络参数的代理来预测网络的准确率。现有的代理大多受到最新的深度学习理论的启发,并在多个 NAS 评测数据集上显示出了极大的潜力。这篇论文的目的是对现代零批量 NAS 方法进行全面的评审和比较,强调硬件意识。为此,我们首先介绍主流的零批量代理,并讨论它们的理论基础。然后,我们通过大规模的实验进行比较,并在硬件意识和硬件无知 NAS 场景中证明这些零批量代理的有效性。最后,我们提出了一些可能的代理设计方法。我们的源代码和相关论文列表可以在 上找到。
Unsupervised Spectral Demosaicing with Lightweight Spectral Attention Networks
paper_authors: Kai Feng, Yongqiang Zhao, Seong G. Kong, Haijin Zeng
for: This paper presents an unsupervised deep learning-based spectral demosaicing technique for real-world hyperspectral images.
methods: The proposed method uses a mosaic loss function, a specific model structure, a transformation strategy, and an early stopping strategy to form a complete unsupervised spectral demosaicing framework. The spectral attention module is modified to reduce complexity and parameters.
results: The proposed method outperforms conventional unsupervised methods in terms of spatial distortion suppression, spectral fidelity, robustness, and computational cost, as demonstrated through extensive experiments on synthetic and real-world datasets.Here’s the text in Simplified Chinese:
methods: 该方法使用了一种叫做 mosaic loss function 的损失函数,特定的模型结构,一种转换策略,以及一种 early stopping 策略,形成了一个完整的无监督 spectral demosaicing 框架。 spectral attention 模块被修改以降低复杂性和参数。
results: 该方法在 synthetic 和实际 datasets 上进行了广泛的实验,与传统的无监督方法相比,在平衡压缩、spectral 准确性、Robustness 和计算成本等方面表现出色。Abstract
This paper presents a deep learning-based spectral demosaicing technique trained in an unsupervised manner. Many existing deep learning-based techniques relying on supervised learning with synthetic images, often underperform on real-world images especially when the number of spectral bands increases. According to the characteristics of the spectral mosaic image, this paper proposes a mosaic loss function, the corresponding model structure, a transformation strategy, and an early stopping strategy, which form a complete unsupervised spectral demosaicing framework. A challenge in real-world spectral demosaicing is inconsistency between the model parameters and the computational resources of the imager. We reduce the complexity and parameters of the spectral attention module by dividing the spectral attention tensor into spectral attention matrices in the spatial dimension and spectral attention vector in the channel dimension, which is more suitable for unsupervised framework. This paper also presents Mosaic25, a real 25-band hyperspectral mosaic image dataset of various objects, illuminations, and materials for benchmarking. Extensive experiments on synthetic and real-world datasets demonstrate that the proposed method outperforms conventional unsupervised methods in terms of spatial distortion suppression, spectral fidelity, robustness, and computational cost.
摘要
A challenge in real-world spectral demosaicing is the inconsistency between the model parameters and the computational resources of the imager. To address this, the spectral attention module is simplified by dividing the spectral attention tensor into spectral attention matrices in the spatial dimension and a spectral attention vector in the channel dimension, making it more suitable for an unsupervised framework.This paper also introduces Mosaic25, a real 25-band hyperspectral mosaic image dataset of various objects, illuminations, and materials for benchmarking. Extensive experiments on synthetic and real-world datasets show that the proposed method outperforms conventional unsupervised methods in terms of spatial distortion suppression, spectral fidelity, robustness, and computational cost.
Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition
results: 实验结果显示,该方法在HMDB51和UCF101数据集上达到了状态级Result,并在Kinetics和something-2-something V2数据集上获得了竞争性的结果。Abstract
In the research field of few-shot learning, the main difference between image-based and video-based is the additional temporal dimension for videos. In recent years, many approaches for few-shot action recognition have followed the metric-based methods, especially, since some works use the Transformer to get the cross-attention feature of the videos or the enhanced prototype, and the results are competitive. However, they do not mine enough information from the Transformer because they only focus on the feature of a single level. In our paper, we have addressed this problem. We propose an end-to-end method named "Task-Specific Alignment and Multiple Level Transformer Network (TSA-MLT)". In our model, the Multiple Level Transformer focuses on the multiple-level feature of the support video and query video. Especially before Multiple Level Transformer, we use task-specific TSA to filter unimportant or misleading frames as a pre-processing. Furthermore, we adopt a fusion loss using two kinds of distance, the first is L2 sequence distance, which focuses on temporal order alignment. The second one is Optimal transport distance, which focuses on measuring the gap between the appearance and semantics of the videos. Using a simple fusion network, we fuse the two distances element-wise, then use the cross-entropy loss as our fusion loss. Extensive experiments show our method achieves state-of-the-art results on the HMDB51 and UCF101 datasets and a competitive result on the benchmark of Kinetics and something-2-something V2 datasets. Our code will be available at the URL: https://github.com/cofly2014/tsa-mlt.git
摘要
在几shot学习研究领域,图像和视频的主要区别在于视频具有额外的时间维度。在最近几年,许多几shot动作识别方法采用了度量基本方法,尤其是使用Transformer获得视频或增强原型的跨注意力特征,并且得到了竞争性的结果。然而,这些方法并不能充分利用Transformer的信息,因为它们只关注单级特征。在我们的论文中,我们解决了这个问题。我们提出了一种端到端方法,名为“任务特定对齐和多级Transformer网络”(TSA-MLT)。在我们的模型中,多级Transformer专注于支持视频和查询视频的多级特征。尤其是在多级Transformer之前,我们使用任务特定的TSA进行预处理,筛选不重要或误导性的帧。此外,我们采用了两种类型的混合损失,其中一种是L2序列距离,它专注于时间顺序对齐。另一种是优化运输距离,它专注于度量视频的外观和 semantics 之间的差异。使用简单的混合网络,我们将两种距离元素加权混合,然后使用交叉熵损失。广泛的实验显示,我们的方法在HMDB51和UCF101数据集上获得了状态独一标的结果,并在Kinetics和something-2-something V2数据集上获得了竞争性的结果。我们的代码将在以下URL上公开:https://github.com/cofly2014/tsa-mlt.git。
A ChatGPT Aided Explainable Framework for Zero-Shot Medical Image Diagnosis
results: 在一个私人数据集和四个公共数据集上进行了广泛的实验,结果显示了这个无需训练的零shot诊断框架的有效性和可解释性,并证明了VLMs和LLMs在医疗应用中的巨大潜力。Abstract
Zero-shot medical image classification is a critical process in real-world scenarios where we have limited access to all possible diseases or large-scale annotated data. It involves computing similarity scores between a query medical image and possible disease categories to determine the diagnostic result. Recent advances in pretrained vision-language models (VLMs) such as CLIP have shown great performance for zero-shot natural image recognition and exhibit benefits in medical applications. However, an explainable zero-shot medical image recognition framework with promising performance is yet under development. In this paper, we propose a novel CLIP-based zero-shot medical image classification framework supplemented with ChatGPT for explainable diagnosis, mimicking the diagnostic process performed by human experts. The key idea is to query large language models (LLMs) with category names to automatically generate additional cues and knowledge, such as disease symptoms or descriptions other than a single category name, to help provide more accurate and explainable diagnosis in CLIP. We further design specific prompts to enhance the quality of generated texts by ChatGPT that describe visual medical features. Extensive results on one private dataset and four public datasets along with detailed analysis demonstrate the effectiveness and explainability of our training-free zero-shot diagnosis pipeline, corroborating the great potential of VLMs and LLMs for medical applications.
摘要
Zero-shot医疗图像分类是现实场景中的一个关键过程,在受到有限的疾病或大规模标注数据的限制下进行诊断。它通过计算医疗图像和可能的疾病类别之间的相似性分数来确定诊断结果。现代的预训练视觉语言模型(VLM),如CLIP,在无需训练的情况下显示出了出色的性能,并在医学应用中获得了多方面的利用。然而,一个可解释的无需训练医疗图像诊断框架仍然在开发中。在这篇论文中,我们提出了一种基于CLIP的新的无需训练医疗图像诊断框架,并与ChatGPT进行了可解释的诊断。我们的关键想法是通过询问大语言模型(LLM)来自动生成更多的提示和知识,如疾病 симптом或描述,以帮助提供更准确和可解释的诊断。我们还设计了特定的提示,以提高生成的文本中的医学特征描述质量。我们在一个私人数据集和四个公共数据集上进行了广泛的实验和分析,并证明了我们的无需训练无需标注的诊断管道的有效性和可解释性,证明了VLM和LLM在医学应用中的潜在能力。
ToothSegNet: Image Degradation meets Tooth Segmentation in CBCT Images
results: ToothSegNet 能够生成更精准的分割结果,并超过了现有的医学图像分割方法。Abstract
In computer-assisted orthodontics, three-dimensional tooth models are required for many medical treatments. Tooth segmentation from cone-beam computed tomography (CBCT) images is a crucial step in constructing the models. However, CBCT image quality problems such as metal artifacts and blurring caused by shooting equipment and patients' dental conditions make the segmentation difficult. In this paper, we propose ToothSegNet, a new framework which acquaints the segmentation model with generated degraded images during training. ToothSegNet merges the information of high and low quality images from the designed degradation simulation module using channel-wise cross fusion to reduce the semantic gap between encoder and decoder, and also refines the shape of tooth prediction through a structural constraint loss. Experimental results suggest that ToothSegNet produces more precise segmentation and outperforms the state-of-the-art medical image segmentation methods.
摘要
在计算机协助 ortodontics 中,三维牙齿模型是许多医疗治疗的关键。 however, CBCT 图像质量问题,如材料残余和扫描设备和病人的牙科状况所引起的模糊,使 segmentation 变得困难。 在这篇论文中,我们提出了 ToothSegNet,一个新的框架,它在训练中将 segmentation 模型告诉 generated 受损图像的信息。 ToothSegNet 通过设计的受损模拟器来汇集高和低质量图像的信息,使用通道 wise cross fusion 减少 encoder 和 decoder 之间的 semantic gap,并通过结构约束损失来精细调整牙齿预测的形态。 实验结果表明, ToothSegNet 可以生成更加精确的 segmentation,并超过了当前医疗图像 segmentation 方法的性能。
Multimodal Prompt Learning for Product Title Generation with Extremely Limited Labels
results: 我们在五种新产品类别下进行了实验和分析,并在域外和域内两种 эксперименталь设定下测试了我们的方法。结果表明,只需要1%的下游标注数据进行训练,我们的方法可以在少量标注下获得最佳几枚shot结果,甚至与完全监督方法在100%的训练数据上进行了竞争。当有完整的标注数据进行训练时,我们的方法达到了状态艺术的结果。Abstract
Generating an informative and attractive title for the product is a crucial task for e-commerce. Most existing works follow the standard multimodal natural language generation approaches, e.g., image captioning, and employ the large scale of human-labelled datasets to train desirable models. However, for novel products, especially in a different domain, there are few existing labelled data. In this paper, we propose a prompt-based approach, i.e., the Multimodal Prompt Learning framework, to accurately and efficiently generate titles for novel products with limited labels. We observe that the core challenges of novel product title generation are the understanding of novel product characteristics and the generation of titles in a novel writing style. To this end, we build a set of multimodal prompts from different modalities to preserve the corresponding characteristics and writing styles of novel products. As a result, with extremely limited labels for training, the proposed method can retrieve the multimodal prompts to generate desirable titles for novel products. The experiments and analyses are conducted on five novel product categories under both the in-domain and out-of-domain experimental settings. The results show that, with only 1% of downstream labelled data for training, our proposed approach achieves the best few-shot results and even achieves competitive results with fully-supervised methods trained on 100% of training data; With the full labelled data for training, our method achieves state-of-the-art results.
摘要
<>将产品的信息和描述呈现成有引起人们兴趣的标题是电商的关键任务。现有的大多数工作采用标准的多modal自然语言生成方法,如图像标题,并使用大量人工标注数据来训练愿望的模型。但是,对于新型产品,特别是在不同的领域,有限的标注数据。在这篇论文中,我们提出了一种提示基于的方法,即多modal提示学习框架,以准确和高效地生成新型产品的标题。我们发现,新型产品标题生成的核心挑战是理解新型产品特点和生成新式标题。为此,我们构建了多modal提示,从不同的Modalities中提取了相应的特征和写作风格,以保持新型产品的特点和写作风格。结果是,只需培育1%的下游标注数据,我们的提posed方法可以将多modal提示拟合到生成愉悦的标题。我们在五种新型产品类别下进行了在领域和非领域实验设定下的实验和分析。结果表明,只需培育1%的下游标注数据,我们的提posed方法可以达到少量多shot最佳 result,甚至与完全upervised方法在100%的培育数据上培育的结果竞争。与完全标注数据培育,我们的方法达到了状态的艺术 result。
Muti-scale Graph Neural Network with Signed-attention for Social Bot Detection: A Frequency Perspective
paper_authors: Shuhao Shi, Kai Qiao, Zhengyan Wang, Jie Yang, Baojie Song, Jian Chen, Bin Yan
for: This paper proposes a method for detecting social bots using a multi-scale signed-attention graph filter (MSGS) to improve the representation ability of the model and alleviate the over-smoothing problem of deep GNNs.
methods: The proposed MSGS method utilizes a multi-scale structure to produce representation vectors at different scales, which are then combined using a signed-attention mechanism. The method also uses a polymerization process to produce the final result.
results: The experimental results on real-world datasets demonstrate that the proposed MSGS method achieves better performance compared with several state-of-the-art social bot detection methods.Here is the same information in Simplified Chinese text:
results: 实验结果表明,提出的MSGS方法在真实世界数据集上比一些现有的社交机器人检测方法更高效。Abstract
The presence of a large number of bots on social media has adverse effects. The graph neural network (GNN) can effectively leverage the social relationships between users and achieve excellent results in detecting bots. Recently, more and more GNN-based methods have been proposed for bot detection. However, the existing GNN-based bot detection methods only focus on low-frequency information and seldom consider high-frequency information, which limits the representation ability of the model. To address this issue, this paper proposes a Multi-scale with Signed-attention Graph Filter for social bot detection called MSGS. MSGS could effectively utilize both high and low-frequency information in the social graph. Specifically, MSGS utilizes a multi-scale structure to produce representation vectors at different scales. These representations are then combined using a signed-attention mechanism. Finally, multi-scale representations via MLP after polymerization to produce the final result. We analyze the frequency response and demonstrate that MSGS is a more flexible and expressive adaptive graph filter. MSGS can effectively utilize high-frequency information to alleviate the over-smoothing problem of deep GNNs. Experimental results on real-world datasets demonstrate that our method achieves better performance compared with several state-of-the-art social bot detection methods.
摘要
社交媒体上有大量的机器人存在,这会导致不良影响。社交图图 neural network(GNN)可以有效利用用户之间的社交关系,并实现出色的机器人检测结果。最近,更多的GNN基本方法被提出用于机器人检测。然而,现有的GNN基本方法通常只专注于低频信息,罕考虑高频信息,这限制了模型的表达能力。为了解决这个问题,本文提出了一种多尺度签名Graph Filter(MSGS)。MSGS可以有效利用社交图中的高频和低频信息。具体来说,MSGS使用多尺度结构生成不同级别的表示 вектор。这些表示vector然后通过签名机制相加。最后,通过多尺度MLP(多层感知器)进行混合,生成最终结果。我们分析频率响应,并证明MSGS是一种更灵活和表达力强的自适应图 filters。MSGS可以充分利用高频信息,以解决深度GNN的过滤问题。实验结果表明,我们的方法在真实世界数据上达到了与一些现有社交bot检测方法相比的更好的性能。
Hybrid Neural Diffeomorphic Flow for Shape Representation and Generation via Triplane
results: 可以生成高质量和多样化的3D抽象流变换,保证形状的topological consistency,并且在医疗影像 segmentation 数据集上证明了 HNDF 的效果Abstract
Deep Implicit Functions (DIFs) have gained popularity in 3D computer vision due to their compactness and continuous representation capabilities. However, addressing dense correspondences and semantic relationships across DIF-encoded shapes remains a critical challenge, limiting their applications in texture transfer and shape analysis. Moreover, recent endeavors in 3D shape generation using DIFs often neglect correspondence and topology preservation. This paper presents HNDF (Hybrid Neural Diffeomorphic Flow), a method that implicitly learns the underlying representation and decomposes intricate dense correspondences into explicitly axis-aligned triplane features. To avoid suboptimal representations trapped in local minima, we propose hybrid supervision that captures both local and global correspondences. Unlike conventional approaches that directly generate new 3D shapes, we further explore the idea of shape generation with deformed template shape via diffeomorphic flows, where the deformation is encoded by the generated triplane features. Leveraging a pre-existing 2D diffusion model, we produce high-quality and diverse 3D diffeomorphic flows through generated triplanes features, ensuring topological consistency with the template shape. Extensive experiments on medical image organ segmentation datasets evaluate the effectiveness of HNDF in 3D shape representation and generation.
摘要
深层含义函数(DIF)在三维计算机视觉中受到广泛关注,因为它们具有高度压缩和连续表示能力。然而,在 DIF 编码的形状之间 densely 的对应和semantic 关系仍然是一个挑战,限制了它们在文本传输和形状分析方面的应用。此外,现在的3D形状生成使用 DIF часто忽视对应和拓扑保持。这篇论文提出了 HyNDF(混合神经含义流),一种方法,可以 implicitly 学习下面的含义表示,并将含义嵌入在满足特定条件的三维形状中。为了避免局部最佳化,我们提议 hybrid 监督,捕捉本地和全局对应。与传统方法不同的是,我们不直接生成新的3D形状,而是通过含义流来生成扭变的模板形状,其中的扭变是由生成的三维特征表示。利用现有的2D扩散模型,我们可以通过生成的三维特征来生成高质量和多样化的3D扩散流,保证模板形状的拓扑一致性。EXTENSIVE EXPERIMENTS ON MEDICAL IMAGE ORGAN SEGMENTATION DATASETS EVALUATE THE EFFECTIVENESS OF HyNDF IN 3D SHAPE REPRESENTATION AND GENERATION。
Toward more frugal models for functional cerebral networks automatic recognition with resting-state fMRI
results: 研究发现,使用图表示法可以保持神经网络模型的性能,同时降低模型的复杂性,最终可以提高识别精度。Abstract
We refer to a machine learning situation where models based on classical convolutional neural networks have shown good performance. We are investigating different encoding techniques in the form of supervoxels, then graphs to reduce the complexity of the model while tracking the loss of performance. This approach is illustrated on a recognition task of resting-state functional networks for patients with brain tumors. Graphs encoding supervoxels preserve activation characteristics of functional brain networks from images, optimize model parameters by 26 times while maintaining CNN model performance.
摘要
我们提到了一种机器学习情况,其中基于古典卷积神经网络的模型表现良好。我们正在研究不同的编码技术,包括超 voxel 和图гра夫,以降低模型复杂性的方式,同时跟踪模型性能的损失。这种方法在抑制性功能网络图像识别任务中被应用,并且使用图гра夫编码超 voxel 保持了功能脑网络的活动特征,并且优化模型参数26倍,而无需妥协模型性能。
A Synthetic Electrocardiogram (ECG) Image Generation Toolbox to Facilitate Deep Learning-Based Scanned ECG Digitization
For: The paper is written for training machine learning models in algorithmic ECG diagnosis, and the authors aim to address the challenge of scarce ECG archives with reference time-series.* Methods: The authors propose a novel method for generating synthetic ECG images on standard paper-like ECG backgrounds with realistic artifacts, using digital twins and data augmentation techniques.* Results: The authors built a deep ECG image digitization model and trained it on the synthetic dataset, and the results show an average signal recovery SNR of 27$\pm$2.8,dB, demonstrating the significance of the proposed synthetic ECG image dataset for training deep learning models.Abstract
The electrocardiogram (ECG) is an accurate and widely available tool for diagnosing cardiovascular diseases. ECGs have been recorded in printed formats for decades and their digitization holds great potential for training machine learning (ML) models in algorithmic ECG diagnosis. Physical ECG archives are at risk of deterioration and scanning printed ECGs alone is insufficient, as ML models require ECG time-series data. Therefore, the digitization and conversion of paper ECG archives into time-series data is of utmost importance. Deep learning models for image processing show promise in this regard. However, the scarcity of ECG archives with reference time-series is a challenge. Data augmentation techniques utilizing \textit{digital twins} present a potential solution. We introduce a novel method for generating synthetic ECG images on standard paper-like ECG backgrounds with realistic artifacts. Distortions including handwritten text artifacts, wrinkles, creases and perspective transforms are applied to the generated images, without personally identifiable information. As a use case, we generated an ECG image dataset of 21,801 records from the 12-lead PhysioNet PTB-XL ECG time-series dataset. A deep ECG image digitization model was built and trained on the synthetic dataset, and was employed to convert the synthetic images to time-series data for evaluation. The signal-to-noise ratio (SNR) was calculated to assess the image digitization quality vs the ground truth ECG time-series. The results show an average signal recovery SNR of 27$\pm$2.8\,dB, demonstrating the significance of the proposed synthetic ECG image dataset for training deep learning models. The codebase is available as an open-access toolbox for ECG research.
摘要
《电cardiogram(ECG)是一种准确和普遍可用的诊断卡ди奥vascular疾病工具。ECG已经被记录在打印格式上了几十年,其数字化具有很大的潜在价值,用于训练机器学习(ML)模型进行算法ic ECG诊断。 Physical ECG Archives 面临着逐渐损坏的问题,而且单独扫描打印ECG 也无法满足 ML 模型的需求,因此将paper ECG Archives digitized和转换为时序数据是非常重要的。深度学习模型用于图像处理表现出了潜在的优势。然而,ECG Archives 中的参考时序数据的缺乏是一个挑战。使用 \textit{digital twins} 的数据增强技术可以解决这个问题。》我们提出了一种新的方法,用于在标准纸如ECG背景上生成 sintetic ECG 图像。这些图像包括手写文本残留、折痕、折叠和 perspective transforms,但不包含个人可 identificable 信息。作为用例,我们生成了一个 ECG 图像集合,包含PhysioNet PTB-XL ECG 时序数据集的 12 个导线。我们建立了一个深度 ECG 图像数字化模型,并在生成的 sintetic 图像上进行训练。然后,我们使用这个模型将 sintetic 图像转换为时序数据,并计算了信号噪声比(SNR),以评估图像数字化质量和参考时序数据之间的相似性。结果表明,Synthetic ECG 图像集合可以提供高质量的深度学习模型训练数据,并且可以提高 ECG 图像数字化质量。代码库将作为一个开放的工具箱,用于ECG 研究。
Text + Sketch: Image Compression at Ultra Low Rates
results: 在非常低的比特率下,方法可以明显超越学习压缩器,在 semantic 和 perceived 稳定性方面达到更高的水平。Abstract
Recent advances in text-to-image generative models provide the ability to generate high-quality images from short text descriptions. These foundation models, when pre-trained on billion-scale datasets, are effective for various downstream tasks with little or no further training. A natural question to ask is how such models may be adapted for image compression. We investigate several techniques in which the pre-trained models can be directly used to implement compression schemes targeting novel low rate regimes. We show how text descriptions can be used in conjunction with side information to generate high-fidelity reconstructions that preserve both semantics and spatial structure of the original. We demonstrate that at very low bit-rates, our method can significantly improve upon learned compressors in terms of perceptual and semantic fidelity, despite no end-to-end training.
摘要
Physics-based Motion Retargeting from Sparse Inputs
results: 研究人员通过使用这种方法,可以在实时中基于少量的人体感知数据来追踪用户的动作,并且可以让人物的姿势很好地与用户匹配。研究人员还进行了多种设定的研究,包括不均衡、舞蹈和体育动作等。Abstract
Avatars are important to create interactive and immersive experiences in virtual worlds. One challenge in animating these characters to mimic a user's motion is that commercial AR/VR products consist only of a headset and controllers, providing very limited sensor data of the user's pose. Another challenge is that an avatar might have a different skeleton structure than a human and the mapping between them is unclear. In this work we address both of these challenges. We introduce a method to retarget motions in real-time from sparse human sensor data to characters of various morphologies. Our method uses reinforcement learning to train a policy to control characters in a physics simulator. We only require human motion capture data for training, without relying on artist-generated animations for each avatar. This allows us to use large motion capture datasets to train general policies that can track unseen users from real and sparse data in real-time. We demonstrate the feasibility of our approach on three characters with different skeleton structure: a dinosaur, a mouse-like creature and a human. We show that the avatar poses often match the user surprisingly well, despite having no sensor information of the lower body available. We discuss and ablate the important components in our framework, specifically the kinematic retargeting step, the imitation, contact and action reward as well as our asymmetric actor-critic observations. We further explore the robustness of our method in a variety of settings including unbalancing, dancing and sports motions.
摘要
�� Avatars 是创建互动和 immerse 的虚拟世界体验的关键。一个挑战在动画这些人物,以便它们模拟用户的运动是商业 AR/VR 产品只有头盔和控制器,这些设备提供的感知数据非常有限。另一个挑战是人物可能有不同的骨架结构和人类的映射关系不清楚。在这种工作中,我们解决了这两个挑战。我们提出了一种在实时中归档人物动作的方法,从某些人类感知数据中提取人物动作。我们使用了强化学习来训练一个策略,以控制在物理模拟器中的人物。我们只需要人类动作捕捉数据进行训练,而不需要每个人物都有特定的动作捕捉。这使得我们可以使用大量的动作捕捉数据来训练通用的策略,以跟踪未看过的用户从真实的感知数据中的真实时间。我们在三个不同的骨架结构中进行了试验:恐龙、小型宠物和人类。我们发现,尽管没有lower body的感知数据可用,但是人物姿势仍然很准确地匹配了用户。我们讨论和折射了我们框架中的重要组件,特别是凯旋投射步骤、仿制、接触和动作奖励。我们进一步探讨了我们的方法在多种设置下的稳定性,包括不均衡、舞蹈和运动动作。
ProtoDiffusion: Classifier-Free Diffusion Guidance with Prototype Learning
paper_authors: Gulcin Baykal, Halil Faruk Karagoz, Taha Binhuraib, Gozde Unal
for: 提高生成质量和快速训练 diffusion models
methods: incorporate prototype learning into diffusion models
results: 比基eline方法更快地达到更高的生成质量Abstract
Diffusion models are generative models that have shown significant advantages compared to other generative models in terms of higher generation quality and more stable training. However, the computational need for training diffusion models is considerably increased. In this work, we incorporate prototype learning into diffusion models to achieve high generation quality faster than the original diffusion model. Instead of randomly initialized class embeddings, we use separately learned class prototypes as the conditioning information to guide the diffusion process. We observe that our method, called ProtoDiffusion, achieves better performance in the early stages of training compared to the baseline method, signifying that using the learned prototypes shortens the training time. We demonstrate the performance of ProtoDiffusion using various datasets and experimental settings, achieving the best performance in shorter times across all settings.
摘要
Diffusion models 是一种生成模型,在其他生成模型的比较中显示出了更高的生成质量和更稳定的训练。然而, diffusion models 的计算需求在训练中增加了许多。在这项工作中,我们将 prototype learning интеGRATED into diffusion models,以实现更高的生成质量和更快的训练。而不是使用随机初始化的类嵌入,我们使用分立学习的类表例来导引 diffusion 过程。我们发现,我们的方法(Call ProtoDiffusion)在训练的早期 stages 表现更好,表明使用学习的表例可以缩短训练时间。我们通过使用不同的数据集和实验设置来证明 ProtoDiffusion 的性能,在所有设置下都达到了最佳性能,并且在短时间内完成。
methods: 提议一种基于深度学习的图像跟踪方法,使用 RGB 和 TIR 图像的复合(RGBT),并使用特征提取器和跟踪器两部分组成。
results: 对 RGBT234 和 LasHeR 等 dataset进行评估,表明提议的系统在这些 dataset 上表现出色,与现有状态的 RGBT 对象跟踪器相比,具有较少的参数量。Abstract
Tracking objects can be a difficult task in computer vision, especially when faced with challenges such as occlusion, changes in lighting, and motion blur. Recent advances in deep learning have shown promise in challenging these conditions. However, most deep learning-based object trackers only use visible band (RGB) images. Thermal infrared electromagnetic waves (TIR) can provide additional information about an object, including its temperature, when faced with challenging conditions. We propose a deep learning-based image tracking approach that fuses RGB and thermal images (RGBT). The proposed model consists of two main components: a feature extractor and a tracker. The feature extractor encodes deep features from both the RGB and the TIR images. The tracker then uses these features to track the object using an enhanced attribute-based architecture. We propose a fusion of attribute-specific feature selection with an aggregation module. The proposed methods are evaluated on the RGBT234 \cite{LiCLiang2018} and LasHeR \cite{LiLasher2021} datasets, which are the most widely used RGBT object-tracking datasets in the literature. The results show that the proposed system outperforms state-of-the-art RGBT object trackers on these datasets, with a relatively smaller number of parameters.
摘要
<>转换文本到简化中文。<>计算机视觉中跟踪对象可能是一项困难任务,特别是在面临遮挡、光照变化和运动模糊等挑战时。现代深度学习技术已经在这些条件下表现出了承诺。然而,大多数深度学习基于对象跟踪器只使用可见光(RGB)图像。热电磁波(TIR)可以提供更多关于对象的信息,包括其温度,在面临挑战情况下。我们提议一种基于深度学习的图像跟踪方法,将RGB和TIR图像进行融合(RGBT)。我们的方法包括两个主要组件:特征提取器和跟踪器。特征提取器将深度特征从RGB和TIR图像中提取出来。跟踪器然后使用这些特征进行跟踪对象,使用增强的属性基本架构。我们提议一种属性特征选择与聚合模块的融合。我们的方法在RGBT234和LasHeR数据集上进行评估,这两个数据集是现有RGBT对象跟踪数据集中最为常用的。结果显示,我们的系统在这些数据集上与现状最佳的RGBT对象跟踪器进行比较,具有相对较小的参数数量。
MaskBEV: Joint Object Detection and Footprint Completion for Bird’s-eye View 3D Point Clouds
results: 在SemanticKITTI和KITTI datasets上,MaskBEV方法实现了比较出色的性能,并且分析了方法的优势和局限性。Abstract
Recent works in object detection in LiDAR point clouds mostly focus on predicting bounding boxes around objects. This prediction is commonly achieved using anchor-based or anchor-free detectors that predict bounding boxes, requiring significant explicit prior knowledge about the objects to work properly. To remedy these limitations, we propose MaskBEV, a bird's-eye view (BEV) mask-based object detector neural architecture. MaskBEV predicts a set of BEV instance masks that represent the footprints of detected objects. Moreover, our approach allows object detection and footprint completion in a single pass. MaskBEV also reformulates the detection problem purely in terms of classification, doing away with regression usually done to predict bounding boxes. We evaluate the performance of MaskBEV on both SemanticKITTI and KITTI datasets while analyzing the architecture advantages and limitations.
摘要
现代对LiDAR点云中物体探测的研究主要集中在预测物体 bounding box 上。这种预测通常使用 anchor-based 或 anchor-free 探测器,需要明确对物体的知识来工作。为了解决这些限制,我们提议了 MaskBEV,一种 bird's-eye view(BEV)Mask-based 物体探测神经网络架构。MaskBEV 预测了 BEV 实例Mask,表示探测到的对象的脚印。此外,我们的方法可以同时进行物体探测和脚印完成。而且,MaskBEV 做了 bounding box 的预测,而不需要进行坐标 regression。我们在 SemanticKITTI 和 KITTI 数据集上评估了 MaskBEV 的性能,同时分析了架构的优势和局限性。
Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning
results: 实验表明,在多个 simulate 和实际世界机器人任务中,权威 diffusion 政策 Learning 效果更好,特别是当示例具有不同水平时。Abstract
Sequence modeling approaches have shown promising results in robot imitation learning. Recently, diffusion models have been adopted for behavioral cloning, benefiting from their exceptional capabilities in modeling complex data distribution. In this work, we propose Crossway Diffusion, a method to enhance diffusion-based visuomotor policy learning by using an extra self-supervised learning (SSL) objective. The standard diffusion-based policy generates action sequences from random noise conditioned on visual observations and other low-dimensional states. We further extend this by introducing a new decoder that reconstructs raw image pixels (and other state information) from the intermediate representations of the reverse diffusion process, and train the model jointly using the SSL loss. Our experiments demonstrate the effectiveness of Crossway Diffusion in various simulated and real-world robot tasks, confirming its advantages over the standard diffusion-based policy. We demonstrate that such self-supervised reconstruction enables better representation for policy learning, especially when the demonstrations have different proficiencies.
摘要
序列模型方法在机器人模仿学习中已经取得了成功。最近,扩散模型在行为假设学习中得到了广泛的应用,这主要是因为它们可以模型复杂的数据分布。在这项工作中,我们提议了跨度扩散(Crossway Diffusion)方法,用于增强扩散基于视听动作政策学习的方法。标准的扩散基于政策可以根据随机噪声和视觉观察等低维度状态生成动作序列。我们进一步延伸了这个方法,通过引入一个新的解码器,将拟合过程中的中间表示转换回原始的图像像素和其他状态信息,并在模型中同时使用SSL损失来训练。我们的实验表明,跨度扩散方法在多种模拟和真实世界机器人任务中具有优势,特别是当示例具有不同水平时。我们的结果表明,自我超级vised reconstruction可以为政策学习提供更好的表示,尤其是当示例有不同的技巧水平时。
Grad-FEC: Unequal Loss Protection of Deep Features in Collaborative Intelligence
methods: 使用Unequal Loss Protection(ULP)approach,包括feature importance estimator和Forward Error Correction(FEC) codes, selectively保护front-end生成的重要数据包。
results: 实验结果表明,提议的方法可以在Packet Loss情况下明显提高协同智能系统的可靠性和Robustness。Abstract
Collaborative intelligence (CI) involves dividing an artificial intelligence (AI) model into two parts: front-end, to be deployed on an edge device, and back-end, to be deployed in the cloud. The deep feature tensors produced by the front-end are transmitted to the cloud through a communication channel, which may be subject to packet loss. To address this issue, in this paper, we propose a novel approach to enhance the resilience of the CI system in the presence of packet loss through Unequal Loss Protection (ULP). The proposed ULP approach involves a feature importance estimator, which estimates the importance of feature packets produced by the front-end, and then selectively applies Forward Error Correction (FEC) codes to protect important packets. Experimental results demonstrate that the proposed approach can significantly improve the reliability and robustness of the CI system in the presence of packet loss.
摘要
合作智能(CI)涉及将人工智能(AI)模型分成两部分:前端,部署在边缘设备上,和后端,部署在云端。深度特征tensor生成于前端被传输到云端通过通信频道,该频道可能会出现包loss。为 Addressing this issue, in this paper, we propose a novel approach to enhance the resilience of the CI system in the presence of packet loss through Unequal Loss Protection (ULP). The proposed ULP approach involves a feature importance estimator, which estimates the importance of feature packets produced by the front-end, and then selectively applies Forward Error Correction (FEC) codes to protect important packets. Experimental results demonstrate that the proposed approach can significantly improve the reliability and robustness of the CI system in the presence of packet loss.Here's the translation in Traditional Chinese:与我们的研究相关的文章,我们提出了一种新的方法来增强在包loss情况下的合作智能系统(CI)的可靠性和韧性。这个方法通过不等损伤保护(ULP)来实现。ULP方法包括一个特征重要度估计器,这个估计器可以估计前端生成的特征包中的重要度,然后选择性地对重要的包应用Forward Error Correction(FEC)代码进行保护。实验结果显示,提案的方法可以在包loss情况下对CI系统的可靠性和韧性进行明显改善。
Deep Features for Contactless Fingerprint Presentation Attack Detection: Can They Be Generalized?
results: 研究发现,使用 ResNet50 CNN 的特征提取方法可以获得最佳一致性表现。Abstract
The rapid evolution of high-end smartphones with advanced high-resolution cameras has resulted in contactless capture of fingerprint biometrics that are more reliable and suitable for verification. Similar to other biometric systems, contactless fingerprint-verification systems are vulnerable to presentation attacks. In this paper, we present a comparative study on the generalizability of seven different pre-trained Convolutional Neural Networks (CNN) and a Vision Transformer (ViT) to reliably detect presentation attacks. Extensive experiments were carried out on publicly available smartphone-based presentation attack datasets using four different Presentation Attack Instruments (PAI). The detection performance of the eighth deep feature technique was evaluated using the leave-one-out protocol to benchmark the generalization performance for unseen PAI. The obtained results indicated the best generalization performance with the ResNet50 CNN.
摘要
高端智能手机的高分辨率摄像头快速进化,导致无接触式指纹识别系统更加可靠和适用于验证。与其他生物特征系统一样,无接触指纹验证系统也容易受到披露攻击。在这篇论文中,我们进行了七种不同预训练Convolutional Neural Networks(CNN)和一种Vision Transformer(ViT)的比较研究,以确定可靠地探测披露攻击。我们在公开available的智能手机披露攻击数据集上进行了广泛的实验,使用了四种不同的Presentation Attack Instrument(PAI)。我们使用了留一个数据点协议来评估深度特征技术的泛化性表现。结果显示,ResNet50 CNN表现最佳。
Advancing Wound Filling Extraction on 3D Faces: Auto-Segmentation and Wound Face Regeneration Approach
results: 我们的方法可以实现高精度的3D facial wound segmentation,并且比前一代方法更高的精度。我们还提出了一个改进的方法来提取3D facial wound filler,并与前一代方法进行比较。我们的方法在试验组中取得了0.9999986%的准确率。Abstract
Facial wound segmentation plays a crucial role in preoperative planning and optimizing patient outcomes in various medical applications. In this paper, we propose an efficient approach for automating 3D facial wound segmentation using a two-stream graph convolutional network. Our method leverages the Cir3D-FaIR dataset and addresses the challenge of data imbalance through extensive experimentation with different loss functions. To achieve accurate segmentation, we conducted thorough experiments and selected a high-performing model from the trained models. The selected model demonstrates exceptional segmentation performance for complex 3D facial wounds. Furthermore, based on the segmentation model, we propose an improved approach for extracting 3D facial wound fillers and compare it to the results of the previous study. Our method achieved a remarkable accuracy of 0.9999986\% on the test suite, surpassing the performance of the previous method. From this result, we use 3D printing technology to illustrate the shape of the wound filling. The outcomes of this study have significant implications for physicians involved in preoperative planning and intervention design. By automating facial wound segmentation and improving the accuracy of wound-filling extraction, our approach can assist in carefully assessing and optimizing interventions, leading to enhanced patient outcomes. Additionally, it contributes to advancing facial reconstruction techniques by utilizing machine learning and 3D bioprinting for printing skin tissue implants. Our source code is available at \url{https://github.com/SIMOGroup/WoundFilling3D}.
摘要
facial 伤口分割扮演着重要的前Operative 规划和优化患者结果的关键角色在各种医疗应用中。在这篇论文中,我们提出了一种高效的自动化3D facial 伤口分割方法,使用了两�ream Graph Convolutional Network。我们的方法利用了Cir3D-FaIR数据集,并通过不同损失函数的实验处理数据不均衡问题。为了实现准确的分割,我们进行了详细的实验,并从已训练的模型中选择了高性能的模型。选择的模型在复杂的3D facial 伤口上显示了出色的分割性能。此外,基于分割模型,我们提出了一种改进的3D facial 伤口粘质EXTRACT方法,并与之前的研究比较。我们的方法在测试集上达到了0.9999986%的准确率,超过了之前的方法的性能。从这个结果来看,我们使用3D打印技术来 Illustrate the shape of the wound filling。这些结果对于参与预Operative 规划和 intervención设计的医生非常重要,因为它可以帮助精确评估和优化 intervención,导致患者的结果进一步提高。此外,它还对于提高面部重建技术的发展起到了作用,因为它可以使用机器学习和3D生物打印技术来打印皮肤组织嵌入物。我们的源代码可以在 \url{https://github.com/SIMOGroup/WoundFilling3D} 上获取。
Collaborative Score Distillation for Consistent Visual Synthesis
results: 在多个图像样本中实现一致性,并在多种任务中展示了 CSD 的效果,如修改投影图像、视频和3D场景。Abstract
Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.
摘要
<>使用大规模文本到图像扩散模型的生成积分可以开拓广泛的新生成和编辑应用程序。然而,当适应复杂的视觉模式时,如多个图像(例如视频),保证图像集中的一致性是挑战。在这篇论文中,我们解决这个挑战通过一种新方法:协同分数热退(CSD)。CSD基于斯泰因变量梯度下降(SVGD)。我们建议将多个样本视为“粒子”在SVGD更新中,并将其分数函数相结合,以降生生成积分过多个图像同步。因此,CSD可以快速地融合多个图像中的信息,从而实现一致的视觉生成。我们在多种任务中展示了CSD的效果,包括修改投影图像、视频和3D场景。我们的结果表明CSD是一种多功能的方法,可以提高图像集中的一致性,从而扩展文本到图像扩散模型的应用范围。
EdgeFace: Efficient Face Recognition Model for Edge Devices
results: 实验结果显示,EdgeFace网络不仅具有低 Computational Complexity 和储存空间,而且可以在挑战性的资料集上实现高度的面部识别精度,并且较以前的有效网络模型表现更好。Abstract
In this paper, we present EdgeFace, a lightweight and efficient face recognition network inspired by the hybrid architecture of EdgeNeXt. By effectively combining the strengths of both CNN and Transformer models, and a low rank linear layer, EdgeFace achieves excellent face recognition performance optimized for edge devices. The proposed EdgeFace network not only maintains low computational costs and compact storage, but also achieves high face recognition accuracy, making it suitable for deployment on edge devices. Extensive experiments on challenging benchmark face datasets demonstrate the effectiveness and efficiency of EdgeFace in comparison to state-of-the-art lightweight models and deep face recognition models. Our EdgeFace model with 1.77M parameters achieves state of the art results on LFW (99.73%), IJB-B (92.67%), and IJB-C (94.85%), outperforming other efficient models with larger computational complexities. The code to replicate the experiments will be made available publicly.
摘要
在本文中,我们提出了EdgeFace,一种轻量级、高效的面recognition网络, Drawing inspiration from EdgeNeXt 的混合体系。 EdgeFace 能够 effectively 结合 CNN 和 Transformer 模型的优点,以及低级线性层,实现面recognition 性能的优化。我们的EdgeFace 网络不仅具有低计算成本和压缩存储,还能够实现高精度的面recognition,适用于边缘设备上部署。我们进行了大量的实验,并证明EdgeFace 在艰难的面数据集上具有优秀的表现,比如LFW(99.73%)、IJB-B(92.67%)和IJB-C(94.85%)等。与其他高效模型相比,EdgeFace 的1.77M 参数的模型可以达到状态码的表现。我们将会在公共领域中公开我们的代码,以便读者可以重新实现我们的实验。
On the Matrix Form of the Quaternion Fourier Transform and Quaternion Convolution
results: 论文发现了四元数傅立叶变换矩阵与标准复数逐步变换矩阵之间的关系,以及许多已知的复数域理论是否能够推广到四元数域。Abstract
We study matrix forms of quaternionic versions of the Fourier Transform and Convolution operations. Quaternions offer a powerful representation unit, however they are related to difficulties in their use that stem foremost from non-commutativity of quaternion multiplication, and due to that $\mu^2 = -1$ posseses infinite solutions in the quaternion domain. Handling of quaternionic matrices is consequently complicated in several aspects (definition of eigenstructure, determinant, etc.). Our research findings clarify the relation of the Quaternion Fourier Transform matrix to the standard (complex) Discrete Fourier Transform matrix, and the extend on which well-known complex-domain theorems extend to quaternions. We focus especially on the relation of Quaternion Fourier Transform matrices to Quaternion Circulant matrices (representing quaternionic convolution), and the eigenstructure of the latter. A proof-of-concept application that makes direct use of our theoretical results is presented, where we produce a method to bound the spectral norm of a Quaternionic Convolution.
摘要
我们研究matrix形式的四元数版本的傅立叶变换和卷积操作。四元数提供了一个强大的表示单元,但它们具有非共轭 multiplication 的特点,导致在使用中存在许多问题。其中,由于 $\mu^2 = -1$ 在四元数域中有无穷多个解,这使得四元数矩阵的处理变得复杂。我们的研究发现,四元数傅立叶变换矩阵与标准(复数)离散傅立叶变换矩阵之间存在密切的关系,以及复数域中已知的定理是否可以推广到四元数域。我们特别关注四元数傅立叶变换矩阵与四元数卷积矩阵(表示四元数卷积)之间的关系,以及它们的特征值结构。我们还提出了一种基于我们的理论成果的应用,可以直接 bound 四元数卷积的 спектраль нор。
SUIT: Learning Significance-guided Information for 3D Temporal Detection
paper_authors: Zheyuan Zhou, Jiachen Lu, Yihan Zeng, Hang Xu, Li Zhang
for: 提高自动驾驶和机器人三元素检测的精度,尤其是利用时间信息来提高3D感知。
methods: 提出了Significance-gUided Information for 3D Temporal detection(SUIT)方法,通过采样机制提取信息充沛的特征,并使用explicit geometric transformation learning技术来学习物体中心点之间的变换。
results: 在nuScenes和Waymo dataset上评估了SUIT方法,并证明了它不仅可以减少时间拼接的内存和计算成本,而且也可以与现有基线方法相比而言性能较高。Abstract
3D object detection from LiDAR point cloud is of critical importance for autonomous driving and robotics. While sequential point cloud has the potential to enhance 3D perception through temporal information, utilizing these temporal features effectively and efficiently remains a challenging problem. Based on the observation that the foreground information is sparsely distributed in LiDAR scenes, we believe sufficient knowledge can be provided by sparse format rather than dense maps. To this end, we propose to learn Significance-gUided Information for 3D Temporal detection (SUIT), which simplifies temporal information as sparse features for information fusion across frames. Specifically, we first introduce a significant sampling mechanism that extracts information-rich yet sparse features based on predicted object centroids. On top of that, we present an explicit geometric transformation learning technique, which learns the object-centric transformations among sparse features across frames. We evaluate our method on large-scale nuScenes and Waymo dataset, where our SUIT not only significantly reduces the memory and computation cost of temporal fusion, but also performs well over the state-of-the-art baselines.
摘要
三维 объек物检测从 LiDAR 点云是自动驾驶和 робо技术中的关键问题。虽然时序点云具有提高三维见解的潜在优势,但是利用这些时序特征有效和高效地处理仍然是一个挑战。基于 LiDAR 场景中前景信息稀疏的观察,我们认为可以通过稀疏格式提供足够的知识。为此,我们提出了 Significance-gUided Information for 3D Temporal detection(SUIT),它将时序信息简化为稀疏特征 для 信息融合。具体来说,我们首先引入一种重要采样机制,将 predicted object centroids 中的信息富集 yet sparse 特征提取出来。然后,我们提出了一种显式的几何变换学习技术,这种技术学习 object-centric 几何变换 among sparse features across frames。我们对大规模的 nuScenes 和 Waymo 数据集进行了评估,我们的 SUIT 不仅能够减少时间和计算成本,而且也能够与当前的基线方法进行比较。
Edge-aware Multi-task Network for Integrating Quantification Segmentation and Uncertainty Prediction of Liver Tumor on Multi-modality Non-contrast MRI
paper_authors: Xiaojiao Xiao, Qinmin Hu, Guanghui Wang for:This paper aims to provide a unified framework for simultaneous multi-index quantification, segmentation, and uncertainty estimation of liver tumors on multi-modality non-contrast magnetic resonance imaging (NCMRI).methods:The proposed method, called edge-aware multi-task network (EaMtNet), employs two parallel CNN encoders, Sobel filters, and a newly designed edge-aware feature aggregation module (EaFA) for feature fusion and selection. The network also uses multi-tasking to leverage prediction discrepancy for uncertainty estimation and improved segmentation and quantification performance.results:The proposed EaMtNet outperforms the state-of-the-art by a large margin, achieving a dice similarity coefficient of 90.01$\pm$1.23 and a mean absolute error of 2.72$\pm$0.58 mm for MD. The results demonstrate the potential of EaMtNet as a reliable clinical-aided tool for medical image analysis.Here is the same information in Simplified Chinese:for:这篇论文目的是提供一个统一的框架,用于同时进行多指标量化、分割和不确定性估计liver肿瘤在多Modal非对照磁共振成像(NCMRI)中。methods:提议的方法是edge-aware multi-task network(EaMtNet),它使用了两个平行的CNN编码器, Sobel滤波器和一个新设计的边缘感知特征聚合模块(EaFA)来抽取本地特征和边图,并实现了特征融合和选择。网络还使用了多任务学习,以利用预测差异来估计uncertainty和提高分割和量化性能。results:提议的EaMtNet比STATE-OF-THE-ART大幅提高,达到了dice相似度为90.01$\pm$1.23和MD的mean absolute error为2.72$\pm$0.58 mm。结果表明EaMtNet可能成为一种可靠的临床助手工具,用于医疗图像分析。Abstract
Simultaneous multi-index quantification, segmentation, and uncertainty estimation of liver tumors on multi-modality non-contrast magnetic resonance imaging (NCMRI) are crucial for accurate diagnosis. However, existing methods lack an effective mechanism for multi-modality NCMRI fusion and accurate boundary information capture, making these tasks challenging. To address these issues, this paper proposes a unified framework, namely edge-aware multi-task network (EaMtNet), to associate multi-index quantification, segmentation, and uncertainty of liver tumors on the multi-modality NCMRI. The EaMtNet employs two parallel CNN encoders and the Sobel filters to extract local features and edge maps, respectively. The newly designed edge-aware feature aggregation module (EaFA) is used for feature fusion and selection, making the network edge-aware by capturing long-range dependency between feature and edge maps. Multi-tasking leverages prediction discrepancy to estimate uncertainty and improve segmentation and quantification performance. Extensive experiments are performed on multi-modality NCMRI with 250 clinical subjects. The proposed model outperforms the state-of-the-art by a large margin, achieving a dice similarity coefficient of 90.01$\pm$1.23 and a mean absolute error of 2.72$\pm$0.58 mm for MD. The results demonstrate the potential of EaMtNet as a reliable clinical-aided tool for medical image analysis.
摘要
simultaneous 多指标评估、分割和不确定性估计liver肿瘤在多Modal非对照核磁共振成像(NCMRI)是诊断精准的关键。然而,现有方法缺乏有效的多Modal NCMRI融合机制和准确边缘信息捕获机制,使这些任务变得困难。为解决这些问题,本文提出了一个统一框架,即edge-aware multi-task network(EaMtNet),将多指标评估、分割和不确定性 associates with liver tumors on multi-modal NCMRI。EaMtNet使用两个平行的CNNEncoder和Sobel滤波器提取本地特征和边图像,并使用 newly designed edge-aware feature aggregation module(EaFA)进行特征融合和选择,使网络变得边缘意识。多任务学习利用预测差异来估计不确定性和提高分割和评估性能。对多Modal NCMRI的250个临床Subject进行了广泛的实验。提出的模型在与状态艺的比较中大幅度超越了当前最佳方法,实现了 dice similarity coefficient的90.01±1.23和mean absolute error的2.72±0.58mm for MD。结果表明EaMtNet具有可靠的临床助手工具的潜力。
results: 根据本文的分析,随着数据集的大小增加,模型的性能会随着数据集的大小增加,但是不同社区的人们可能对模型的输出质量有不同的评价标准,这可能会导致模型的性能不会在所有社区中提高。Abstract
Recent work has proposed a power law relationship, referred to as ``scaling laws,'' between the performance of artificial intelligence (AI) models and aspects of those models' design (e.g., dataset size). In other words, as the size of a dataset (or model parameters, etc) increases, the performance of a given model trained on that dataset will correspondingly increase. However, while compelling in the aggregate, this scaling law relationship overlooks the ways that metrics used to measure performance may be precarious and contested, or may not correspond with how different groups of people may perceive the quality of models' output. In this paper, we argue that as the size of datasets used to train large AI models grows, the number of distinct communities (including demographic groups) whose data is included in a given dataset is likely to grow, each of whom may have different values. As a result, there is an increased risk that communities represented in a dataset may have values or preferences not captured by (or in the worst case, at odds with) the metrics used to evaluate model performance for scaling laws. We end the paper with implications for AI scaling laws -- that models may not, in fact, continue to improve as the datasets get larger -- at least not for all people or communities impacted by those models.
摘要
As the size of datasets used to train large AI models grows, the number of distinct communities (including demographic groups) whose data is included in the dataset is likely to increase. Each of these communities may have different values and preferences, which may not be captured by the metrics used to evaluate model performance. As a result, there is a risk that the communities represented in the dataset may have values or preferences that are not aligned with the metrics used to evaluate the performance of the AI model.We conclude the paper with implications for AI scaling laws, suggesting that models may not continue to improve as the size of the datasets increases for all people or communities impacted by the models.
Decentralized Data Governance as Part of a Data Mesh Platform: Concepts and Approaches
results: 该论文提出了一种数据网格概念模型,并讨论了如何通过平台手段实现数据管理的自治。这些结论基于实际实现了一个可用的数据网格平台的经验。Abstract
Data mesh is a socio-technical approach to decentralized analytics data management. To manage this decentralization efficiently, data mesh relies on automation provided by a self-service data infrastructure platform. A key aspect of this platform is to enable decentralized data governance. Because data mesh is a young approach, there is a lack of coherence in how data mesh concepts are interpreted in the industry, and almost no work on how a data mesh platform facilitates governance. This paper presents a conceptual model of key data mesh concepts and discusses different approaches to drive governance through platform means. The insights presented are drawn from concrete experiences of implementing a fully-functional data mesh platform that can be used as a reference on how to approach data mesh platform development.
摘要
“数据网”是一种社会技术方法,用于分布式分析数据管理。为了有效地管理这种分布,数据网依赖于自助数据基础设施平台的自动化。该平台的一个关键方面是实现分布式数据管理。由于数据网是一种新的方法,因此在行业中对数据网概念的解释存在一定的不一致,并且几乎没有关于如何通过平台手段实现管理的研究。本文提出了一个概念模型,描述了不同的方法来驱动管理通过平台手段。这些发现来自实际实施了可用于参考的完整功能数据网平台的经验。
LLQL: Logistic Likelihood Q-Learning for Reinforcement Learning
results: 我们发现在线环境下 Bellman 误差遵循 Logistic 分布,而离线环境下 Bellman 误差遵循受限 Logistic 分布,其受限分布取决于离线数据集中的先前策略。基于这些发现,我们提出了一种基于 Logistic 最大似然函数的 $\rm LLoss$ 替代损失函数。我们还发现了在离线数据集中的奖励应该遵循特定的分布,以便实现离线目标。在我们的数据分析中,我们对 Soft-Actor-Critic 算法的两个变种在线和离线环境中进行了控制变量修正。结果证明了我们的假设,并发现 LLoss 的方差比 MSELoss 更小。Abstract
Currently, research on Reinforcement learning (RL) can be broadly classified into two categories: online RL and offline RL. Both in online and offline RL, the primary focus of research on the Bellman error lies in the optimization techniques and performance improvement, rather than exploring the inherent structural properties of the Bellman error, such as distribution characteristics. In this study, we analyze the distribution of the Bellman approximation error in both online and offline settings. We find that in the online environment, the Bellman error follows a Logistic distribution, while in the offline environment, the Bellman error follows a constrained Logistic distribution, where the constrained distribution is dependent on the prior policy in the offline data set. Based on this finding, we have improved the MSELoss which is based on the assumption that the Bellman errors follow a normal distribution, and we utilized the Logistic maximum likelihood function to construct $\rm LLoss$ as an alternative loss function. In addition, we observed that the rewards in the offline data set should follow a specific distribution, which would facilitate the achievement of offline objectives. In our numerical experiments, we performed controlled variable corrections on the loss functions of two variants of Soft-Actor-Critic in both online and offline environments. The results confirmed our hypothesis regarding the online and offline settings, we also found that the variance of LLoss is smaller than MSELoss. Our research provides valuable insights for further investigations based on the distribution of Bellman errors.
摘要
当前研究的束缚学习(RL)可以分为两类:在线RL和离线RL。在线和离线RL中,研究者们主要关注优化技术和性能提升,而不是探索 Bellman 误差的内在结构特性,如分布特性。在这项研究中,我们分析了在线和离线环境中 Bellman aproximation 误差的分布。我们发现在在线环境中,Bellman 误差遵循 Logistic 分布,而在离线环境中,Bellman 误差遵循受限的 Logistic 分布,该分布取决于离线数据集中的先前策略。根据这一发现,我们改进了基于 assume Bellman 误差遵循正态分布的 MSELoss,并使用 Logistic 最大可能函数来构建 LLoss 作为替代损失函数。此外,我们发现在离线数据集中的奖励应遵循特定分布,这将有助于实现离线目标。在我们的数值实验中,我们对 MSELoss 和 LLoss 的损失函数进行了控制变量修正,并在在线和离线环境中对两种 Soft-Actor-Critic 的损失函数进行了控制变量修正。结果证明了我们的假设,并发现 LLoss 的方差比 MSELoss 更小。我们的研究提供了有价值的发现,供后续基于 Bellman 误差分布的研究。
Exploring new ways: Enforcing representational dissimilarity to learn new features and reduce error consistency
results: 研究发现,通过使用表示相似场方法来促进不同层次的中间表示之间的不同,可以提高ensemble的准确率,并且输出预测的相关性降低。Abstract
Independently trained machine learning models tend to learn similar features. Given an ensemble of independently trained models, this results in correlated predictions and common failure modes. Previous attempts focusing on decorrelation of output predictions or logits yielded mixed results, particularly due to their reduction in model accuracy caused by conflicting optimization objectives. In this paper, we propose the novel idea of utilizing methods of the representational similarity field to promote dissimilarity during training instead of measuring similarity of trained models. To this end, we promote intermediate representations to be dissimilar at different depths between architectures, with the goal of learning robust ensembles with disjoint failure modes. We show that highly dissimilar intermediate representations result in less correlated output predictions and slightly lower error consistency, resulting in higher ensemble accuracy. With this, we shine first light on the connection between intermediate representations and their impact on the output predictions.
摘要
独立训练的机器学习模型通常会学习类似的特征。给定一个独立训练的模型集合,这会导致相关的预测和共同失败模式。先前的尝试,包括对输出预测或搜索的减 correlate 的方法,得到了杂合的结果,特别是因为它们在优化目标之间矛盾。在这篇论文中,我们提出了一个新的想法,即利用表示相似场的方法来促进训练期间的不相似性。为此,我们在不同的架构之间Promote 中间表示的不同程度,以达到学习强健的集成模型,并且使得输出预测更加不相关。我们显示,高度不同的中间表示会导致输出预测更加不相关,并且错误一致率略低,从而提高集成模型的准确率。这样,我们首次照明了中间表示与其影响输出预测的关系。
Deep Contract Design via Discontinuous Piecewise Affine Neural Networks
results: 本文通过实验表明,可以使用一小数量的训练样本和扩展到大规模问题解决优化合同问题,并成功地估算主体的收益函数。Abstract
Contract design involves a principal who establishes contractual agreements about payments for outcomes that arise from the actions of an agent. In this paper, we initiate the study of deep learning for the automated design of optimal contracts. We formulate this as an offline learning problem, where a deep network is used to represent the principal's expected utility as a function of the design of a contract. We introduce a novel representation: the Discontinuous ReLU (DeLU) network, which models the principal's utility as a discontinuous piecewise affine function where each piece corresponds to the agent taking a particular action. DeLU networks implicitly learn closed-form expressions for the incentive compatibility constraints of the agent and the utility maximization objective of the principal, and support parallel inference on each piece through linear programming or interior-point methods that solve for optimal contracts. We provide empirical results that demonstrate success in approximating the principal's utility function with a small number of training samples and scaling to find approximately optimal contracts on problems with a large number of actions and outcomes.
摘要
translate to Simplified Chinese as follows:CONTRACT 设计包括一个主体,该主体确定基于代理人的行为的支付方式。在这篇论文中,我们开始使用深度学习来自动设计优化的合同。我们将这视为一个离线学习问题,其中深度网络用于表示主体的预期用户函数,该函数与合同的设计相关。我们引入了一个新的表示方式:离散ReLU(DeLU)网络,该网络模型了主体的用户函数为离散凸函数,其中每个部分对应代理人行为的不同。DeLU网络隐式地学习closed-form表达代理人的奖励兼容性约束和主体的用户最大化目标,并支持并行推理每个部分的Linear Programming或内部点法解决优化合同。我们提供了实验结果,证明了使用小量训练样本和扩展可以相似地 aproximate主体的用户函数,并在具有大量行动和结果的问题上找到相对优化的合同。
results: 论文表明,在MAB中,外部学习算法可以初始化和调整 Tsallis-entropy 泛化 Exp3 算法,使得任务平均误差降低。在BLO中,论文学习初始化和调整在线镜像 DESC 算法,并证明任务平均误差与动作空间依赖的度量直接相关。Abstract
We study online meta-learning with bandit feedback, with the goal of improving performance across multiple tasks if they are similar according to some natural similarity measure. As the first to target the adversarial online-within-online partial-information setting, we design meta-algorithms that combine outer learners to simultaneously tune the initialization and other hyperparameters of an inner learner for two important cases: multi-armed bandits (MAB) and bandit linear optimization (BLO). For MAB, the meta-learners initialize and set hyperparameters of the Tsallis-entropy generalization of Exp3, with the task-averaged regret improving if the entropy of the optima-in-hindsight is small. For BLO, we learn to initialize and tune online mirror descent (OMD) with self-concordant barrier regularizers, showing that task-averaged regret varies directly with an action space-dependent measure they induce. Our guarantees rely on proving that unregularized follow-the-leader combined with two levels of low-dimensional hyperparameter tuning is enough to learn a sequence of affine functions of non-Lipschitz and sometimes non-convex Bregman divergences bounding the regret of OMD.
摘要
我们研究在线meta学习,使用反馈强制,以提高多个任务的性能,如果它们按照自然的相似度度量相似。作为首次针对在线内部partial-information设置下进行反馈强制的研究,我们设计了meta算法, combinest outer learner来同时调整内部学习器的初始化和其他超参数。对于多重抓拍机(MAB)和bandit线性优化(BLO)两个重要情况,我们的meta算法可以在不同的任务上实现性能提高。对于MAB,我们使用 Tsallis-entropy泛化的Exp3算法作为内部学习器,并将任务平均违假 regret的下降与Entropy of optima-in-hindsight的小化直接相关。对于BLO,我们学习了在线镜像下降(OMD)的初始化和调整,并证明任务平均违假 regret与动作空间依赖的度量相关。我们的保证基于证明,不含regularizer的follow-the-leadercombined with two levels of low-dimensional hyperparameter tuning是可以学习一个序列 Affine functions of non-Lipschitz和有时非 convex Bregman divergences,从而约束OMD的违假 regret。
First-Explore, then Exploit: Meta-Learning Intelligent Exploration
results: 论文表明,First-Explore可以学习人类级别的探索策略,并且在需要牺牲奖励的探索领域表现更好。First-Explore也超越了现有的标准RL和meta-RL方法。Abstract
Standard reinforcement learning (RL) agents never intelligently explore like a human (i.e. by taking into account complex domain priors and previous explorations). Even the most basic intelligent exploration strategies such as exhaustive search are only inefficiently or poorly approximated by approaches such as novelty search or intrinsic motivation, let alone more complicated strategies like learning new skills, climbing stairs, opening doors, or conducting experiments. This lack of intelligent exploration limits sample efficiency and prevents solving hard exploration domains. We argue a core barrier prohibiting many RL approaches from learning intelligent exploration is that the methods attempt to explore and exploit simultaneously, which harms both exploration and exploitation as the goals often conflict. We propose a novel meta-RL framework (First-Explore) with two policies: one policy learns to only explore and one policy learns to only exploit. Once trained, we can then explore with the explore policy, for as long as desired, and then exploit based on all the information gained during exploration. This approach avoids the conflict of trying to do both exploration and exploitation at once. We demonstrate that First-Explore can learn intelligent exploration strategies such as exhaustive search and more, and that it outperforms dominant standard RL and meta-RL approaches on domains where exploration requires sacrificing reward. First-Explore is a significant step towards creating meta-RL algorithms capable of learning human-level exploration which is essential to solve challenging unseen hard-exploration domains.
摘要
标准强化学习(RL)代理不像人类一样智能探索(即考虑复杂的领域假设和先前的探索)。 Even the most basic intelligent exploration strategies, such as exhaustive search, are only inefficiently or poorly approximated by approaches such as novelty search or intrinsic motivation, let alone more complicated strategies like learning new skills, climbing stairs, opening doors, or conducting experiments. This lack of intelligent exploration limits sample efficiency and prevents solving hard exploration domains. We argue that a core barrier preventing many RL approaches from learning intelligent exploration is that the methods attempt to explore and exploit simultaneously, which harms both exploration and exploitation as the goals often conflict.We propose a novel meta-RL framework (First-Explore) with two policies: one policy learns to only explore, and one policy learns to only exploit. Once trained, we can then explore with the explore policy for as long as desired, and then exploit based on all the information gained during exploration. This approach avoids the conflict of trying to do both exploration and exploitation at once. We demonstrate that First-Explore can learn intelligent exploration strategies such as exhaustive search and more, and that it outperforms dominant standard RL and meta-RL approaches on domains where exploration requires sacrificing reward. First-Explore is a significant step towards creating meta-RL algorithms capable of learning human-level exploration, which is essential to solve challenging unseen hard-exploration domains.
SVDM: Single-View Diffusion Model for Pseudo-Stereo 3D Object Detection
results: 在KITTI dataset上达到新的州OF-THE-ART性能,并且与多种静止探测器兼容Abstract
One of the key problems in 3D object detection is to reduce the accuracy gap between methods based on LiDAR sensors and those based on monocular cameras. A recently proposed framework for monocular 3D detection based on Pseudo-Stereo has received considerable attention in the community. However, so far these two problems are discovered in existing practices, including (1) monocular depth estimation and Pseudo-Stereo detector must be trained separately, (2) Difficult to be compatible with different stereo detectors and (3) the overall calculation is large, which affects the reasoning speed. In this work, we propose an end-to-end, efficient pseudo-stereo 3D detection framework by introducing a Single-View Diffusion Model (SVDM) that uses a few iterations to gradually deliver right informative pixels to the left image. SVDM allows the entire pseudo-stereo 3D detection pipeline to be trained end-to-end and can benefit from the training of stereo detectors. Afterwards, we further explore the application of SVDM in depth-free stereo 3D detection, and the final framework is compatible with most stereo detectors. Among multiple benchmarks on the KITTI dataset, we achieve new state-of-the-art performance.
摘要
Monocular depth estimation and Pseudo-Stereo detector must be trained separately.2. Difficulty in compatibility with different stereo detectors.3. The overall calculation is large, which affects the reasoning speed.In this work, we propose an end-to-end, efficient pseudo-stereo 3D detection framework by introducing a Single-View Diffusion Model (SVDM) that uses a few iterations to gradually deliver right informative pixels to the left image. SVDM allows the entire pseudo-stereo 3D detection pipeline to be trained end-to-end and can benefit from the training of stereo detectors. Additionally, we explore the application of SVDM in depth-free stereo 3D detection, and the final framework is compatible with most stereo detectors. Our experiments on the KITTI dataset achieve new state-of-the-art performance.
Analyzing Different Expert-Opined Strategies to Enhance the Effect on the Goal of a Multi-Attribute Decision-Making System Using a Concept of Effort Propagation and Application in Enhancement of High School Students’ Performance
results: 研究发现,在一个实际的高中管理系统中,可以通过合理地分配努力来提高学生表现,并且可以达到约7%-15%的总努力传播。此外,研究还发现了一种最佳策略,可以帮助提高学生表现最大化。Abstract
In many real-world multi-attribute decision-making (MADM) problems, mining the inter-relationships and possible hierarchical structures among the factors are considered to be one of the primary tasks. But, besides that, one major task is to determine an optimal strategy to work on the factors to enhance the effect on the goal attribute. This paper proposes two such strategies, namely parallel and hierarchical effort assignment, and propagation strategies. The concept of effort propagation through a strategy is formally defined and described in the paper. Both the parallel and hierarchical strategies are divided into sub-strategies based on whether the assignment of efforts to the factors is uniform or depends upon some appropriate heuristics related to the factors in the system. The adapted and discussed heuristics are the relative significance and effort propagability of the factors. The strategies are analyzed for a real-life case study regarding Indian high school administrative factors that play an important role in enhancing students' performance. Total effort propagation of around 7%-15% to the goal is seen across the proposed strategies given a total of 1 unit of effort to the directly accessible factors of the system. A comparative analysis is adapted to determine the optimal strategy among the proposed ones to enhance student performance most effectively. The highest effort propagation achieved in the work is approximately 14.4348%. The analysis in the paper establishes the necessity of research towards the direction of effort propagation analysis in case of decision-making problems.
摘要
在许多实际多属性决策(MADM)问题中,挖掘因素之间的关系和可能的层次结构是一项重要任务。然而,除了这一点,另一个重要任务是确定一个优化策略,以便在因素上增强目标属性的效应。本文提出了两种策略,namely parallel和层次努力分配策略,以及宣传策略。在策略中,努力的传播是正式地定义和描述的。两种策略都被分为不同的子策略,根据努力分配到因素的方式是不同的。adapted和讨论的启发是因素相关的相对重要性和努力传播可能性。这些策略在印度高中学生表现的一个实际案例中进行了分析,并观察到了约7%-15%的总努力传播到目标。通过对这些策略进行比较分析,确定了最佳策略以提高学生表现的方式。在工作中,最高的努力传播率为约14.4348%。这种分析确立了努力传播分析在决策问题中的必要性。
Exploring Multimodal Approaches for Alzheimer’s Disease Detection Using Patient Speech Transcript and Audio Data
methods: 该研究使用预训练的语言模型和图 neural network(GNN)构建语音和文本数据中的图,提取特征以进行AD检测。此外,还使用了数据增强技术,包括同义词替换和GPT基于的增强器,以 Address the small dataset size issue。
results: 研究人员通过了intsensive的实验和分析,发现了对AD检测使用语音和文本数据的挑战和可能的解决方案。Abstract
Alzheimer's disease (AD) is a common form of dementia that severely impacts patient health. As AD impairs the patient's language understanding and expression ability, the speech of AD patients can serve as an indicator of this disease. This study investigates various methods for detecting AD using patients' speech and transcripts data from the DementiaBank Pitt database. The proposed approach involves pre-trained language models and Graph Neural Network (GNN) that constructs a graph from the speech transcript, and extracts features using GNN for AD detection. Data augmentation techniques, including synonym replacement, GPT-based augmenter, and so on, were used to address the small dataset size. Audio data was also introduced, and WavLM model was used to extract audio features. These features were then fused with text features using various methods. Finally, a contrastive learning approach was attempted by converting speech transcripts back to audio and using it for contrastive learning with the original audio. We conducted intensive experiments and analysis on the above methods. Our findings shed light on the challenges and potential solutions in AD detection using speech and audio data.
摘要
阿尔茨海默病 (AD) 是一种常见的脑残留病,对病人健康产生严重的影响。由于 AD 会对病人语言理解和表达能力产生影响,因此病人的语音可以作为该病的指标。这项研究探讨了使用病人语音和讲解数据进行 AD 检测的不同方法。提议的方法包括使用预训练语言模型和图内网络(GNN),构建语音讲解的图,并从图中提取特征进行 AD 检测。数据扩充技术,包括同义词替换、GPT-based augmenter 等,用于解决小数据集的问题。此外,我们还引入了音频数据,并使用 WavLM 模型提取音频特征。这些特征与文本特征进行拼接,并使用不同的方法进行融合。最后,我们尝试了一种对比学习方法,通过将语音讲解转换回音频,并与原始音频进行对比学习。我们进行了广泛的实验和分析,以探讨 AD 检测中使用语音和音频数据的挑战和可能的解决方案。
Power-up! What Can Generative Models Do for Human Computation Workflows?
for: This paper aims to explore the potential benefits of using large language models (LLMs) in crowdsourcing workflows and to identify the best ways to integrate LLMs into existing crowdsourcing design patterns.
methods: The paper proposes a vision for incorporating LLMs into crowdsourcing workflows, including identifying key junctures in the workflow where LLMs can add value and proposing means to augment existing design patterns for crowd work.
results: The paper aims to provide a comprehensive understanding of how LLMs can improve the effectiveness of crowdsourcing workflows and how such workflows can be evaluated from the perspectives of various stakeholders involved in the crowdsourcing paradigm.Abstract
We are amidst an explosion of artificial intelligence research, particularly around large language models (LLMs). These models have a range of applications across domains like medicine, finance, commonsense knowledge graphs, and crowdsourcing. Investigation into LLMs as part of crowdsourcing workflows remains an under-explored space. The crowdsourcing research community has produced a body of work investigating workflows and methods for managing complex tasks using hybrid human-AI methods. Within crowdsourcing, the role of LLMs can be envisioned as akin to a cog in a larger wheel of workflows. From an empirical standpoint, little is currently understood about how LLMs can improve the effectiveness of crowdsourcing workflows and how such workflows can be evaluated. In this work, we present a vision for exploring this gap from the perspectives of various stakeholders involved in the crowdsourcing paradigm -- the task requesters, crowd workers, platforms, and end-users. We identify junctures in typical crowdsourcing workflows at which the introduction of LLMs can play a beneficial role and propose means to augment existing design patterns for crowd work.
摘要
我们正处于人工智能研究的爆发阶段,特别是大语言模型(LLM)的研究。这些模型在医学、金融、通用常识图和众生编辑等领域有各种应用。研究在众生编辑中使用人工智能方法的工作流程还是一个未经探索的领域。众生编辑社区已经生成了一些关于工作流程和人工智能方法的研究。在众生编辑中,LLM可以视为工作流程中的一个部件。从实证角度来看,现在还未完全了解LLM如何改善众生编辑工作流程的效果,以及如何评估这些工作流程。在这项工作中,我们提出了一种视图,从众生编辑参与者的角度来explore这个黑板。我们标识了典型的众生编辑工作流程中的缺失,并提议使用人工智能技术来增强现有的设计模式。
MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition
results: 实验结果显示,MAE-DFER 可以与现有的监督学习方法相比,在六个数据集上具有更高的识别率(例如,DFEW 上的 UAR 提高了6.30%,MAFW 上的 UAR 提高了8.34%),并且具有较低的计算成本(约38% FLOPs)。Abstract
Dynamic facial expression recognition (DFER) is essential to the development of intelligent and empathetic machines. Prior efforts in this field mainly fall into supervised learning paradigm, which is severely restricted by the limited labeled data in existing datasets. Inspired by recent unprecedented success of masked autoencoders (e.g., VideoMAE), this paper proposes MAE-DFER, a novel self-supervised method which leverages large-scale self-supervised pre-training on abundant unlabeled data to largely advance the development of DFER. Since the vanilla Vision Transformer (ViT) employed in VideoMAE requires substantial computation during fine-tuning, MAE-DFER develops an efficient local-global interaction Transformer (LGI-Former) as the encoder. Moreover, in addition to the standalone appearance content reconstruction in VideoMAE, MAE-DFER also introduces explicit temporal facial motion modeling to encourage LGI-Former to excavate both static appearance and dynamic motion information. Extensive experiments on six datasets show that MAE-DFER consistently outperforms state-of-the-art supervised methods by significant margins (e.g., +6.30\% UAR on DFEW and +8.34\% UAR on MAFW), verifying that it can learn powerful dynamic facial representations via large-scale self-supervised pre-training. Besides, it has comparable or even better performance than VideoMAE, while largely reducing the computational cost (about 38\% FLOPs). We believe MAE-DFER has paved a new way for the advancement of DFER and can inspire more relevant research in this field and even other related tasks. Codes and models are publicly available at https://github.com/sunlicai/MAE-DFER.
摘要
“动态表情识别(DFER)是智能和同情机器的发展所必需的一环。先前的努力主要是基于指导学习 paradigm,它受限于现有数据集的标注数据量的严重限制。 Drawing inspiration from recent unprecedented success of masked autoencoders (e.g., VideoMAE), this paper proposes MAE-DFER, a novel self-supervised method that leverages large-scale self-supervised pre-training on abundant unlabeled data to significantly advance the development of DFER. Since the vanilla Vision Transformer (ViT) employed in VideoMAE requires substantial computation during fine-tuning, MAE-DFER develops an efficient local-global interaction Transformer (LGI-Former) as the encoder. Moreover, in addition to the standalone appearance content reconstruction in VideoMAE, MAE-DFER also introduces explicit temporal facial motion modeling to encourage LGI-Former to excavate both static appearance and dynamic motion information. Extensive experiments on six datasets show that MAE-DFER consistently outperforms state-of-the-art supervised methods by significant margins (e.g., +6.30\% UAR on DFEW and +8.34\% UAR on MAFW), verifying that it can learn powerful dynamic facial representations via large-scale self-supervised pre-training. Besides, it has comparable or even better performance than VideoMAE, while largely reducing the computational cost (about 38\% FLOPs). We believe MAE-DFER has paved a new way for the advancement of DFER and can inspire more relevant research in this field and even other related tasks. Codes and models are publicly available at https://github.com/sunlicai/MAE-DFER.”
On the Adversarial Robustness of Generative Autoencoders in the Latent Space
results: 研究发现,流行的生成自动编码器在伪设空间中存在漏洞,可以通过攻击来让模型生成错误的样本。此外,研究还发现了一定的负相关性 между模型的攻击Robustness和干扰级别的关系。Abstract
The generative autoencoders, such as the variational autoencoders or the adversarial autoencoders, have achieved great success in lots of real-world applications, including image generation, and signal communication. However, little concern has been devoted to their robustness during practical deployment. Due to the probabilistic latent structure, variational autoencoders (VAEs) may confront problems such as a mismatch between the posterior distribution of the latent and real data manifold, or discontinuity in the posterior distribution of the latent. This leaves a back door for malicious attackers to collapse VAEs from the latent space, especially in scenarios where the encoder and decoder are used separately, such as communication and compressed sensing. In this work, we provide the first study on the adversarial robustness of generative autoencoders in the latent space. Specifically, we empirically demonstrate the latent vulnerability of popular generative autoencoders through attacks in the latent space. We also evaluate the difference between variational autoencoders and their deterministic variants and observe that the latter performs better in latent robustness. Meanwhile, we identify a potential trade-off between the adversarial robustness and the degree of the disentanglement of the latent codes. Additionally, we also verify the feasibility of improvement for the latent robustness of VAEs through adversarial training. In summary, we suggest concerning the adversarial latent robustness of the generative autoencoders, analyze several robustness-relative issues, and give some insights into a series of key challenges.
摘要
自然语言处理中的生成自动编码器,如变量自动编码器和反对自动编码器,在很多实际应用中取得了很大成功,包括图像生成和信号通信。然而,在实际部署中对这些生成自动编码器的Robustness没有受到充分的关注。由于生成自动编码器的概率 latent 结构,变量自动编码器(VAEs)可能会面临到真实数据拟合问题或 latent posterior distribution 的缺continuity。这使得恶意攻击者可以通过 latent 空间的攻击,尤其是在encoder 和 decoder 在独立使用时。在这种情况下,VAEs 的 latent 空间 Robustness 会受到攻击。在这个工作中,我们提供了生成自动编码器的 latent 空间 Robustness 的首次研究。我们通过在 latent 空间进行攻击来实证性地表明了各种生成自动编码器的 latent 感受性问题。我们还发现了变量自动编码器和其束定variant 之间的差异,并观察到后者在 latent Robustness 方面表现更好。此外,我们还发现了对 latent Robustness 的一定负面影响,即难以同时保证 Robustness 和混合程度的高度。最后,我们验证了对 VAEs 的 latent Robustness 进行适应 adversarial 训练的可行性。总之,我们建议关注生成自动编码器的 latent 空间 Robustness,分析了一些Robustness-相关的问题,并提供了一些关键挑战的视角。
The FormAI Dataset: Generative AI in Software Security Through the Lens of Formal Verification
paper_authors: Norbert Tihanyi, Tamas Bisztray, Ridhi Jain, Mohamed Amine Ferrag, Lucas C. Cordeiro, Vasileios Mavroeidis for:The paper presents the FormAI dataset, a large collection of AI-generated C programs with vulnerability classification.methods:The dataset is generated using GPT-3.5-turbo, and a dynamic zero-shot prompting technique is introduced to spawn a diverse set of programs. The dataset is labeled with vulnerabilities using a formal verification method based on ESBMC.results:The dataset contains 112,000 programs with varying levels of complexity, and each program is associated with relevant CWE numbers. The dataset is suitable for evaluating the effectiveness of static and dynamic analysis tools, and can be used to train LLMs and machine learning algorithms.Abstract
This paper presents the FormAI dataset, a large collection of 112,000 AI-generated compilable and independent C programs with vulnerability classification. We introduce a dynamic zero-shot prompting technique, constructed to spawn a diverse set of programs utilizing Large Language Models (LLMs). The dataset is generated by GPT-3.5-turbo and comprises programs with varying levels of complexity. Some programs handle complicated tasks such as network management, table games, or encryption, while others deal with simpler tasks like string manipulation. Every program is labeled with the vulnerabilities found within the source code, indicating the type, line number, and vulnerable function name. This is accomplished by employing a formal verification method using the Efficient SMT-based Bounded Model Checker (ESBMC), which performs model checking, abstract interpretation, constraint programming, and satisfiability modulo theories, to reason over safety/security properties in programs. This approach definitively detects vulnerabilities and offers a formal model known as a counterexample, thus eliminating the possibility of generating false positive reports. This property of the dataset makes it suitable for evaluating the effectiveness of various static and dynamic analysis tools. Furthermore, we have associated the identified vulnerabilities with relevant Common Weakness Enumeration (CWE) numbers. We make the source code available for the 112,000 programs, accompanied by a comprehensive list detailing the vulnerabilities detected in each individual program including location and function name, which makes the dataset ideal to train LLMs and machine learning algorithms.
摘要
Citation: A Key to Building Responsible and Accountable Large Language Models
results: 论文提出了一些研究问题,以促进未来对语言模型的负责任和可靠性的研究。Abstract
Large Language Models (LLMs) bring transformative benefits alongside unique challenges, including intellectual property (IP) and ethical concerns. This position paper explores a novel angle to mitigate these risks, drawing parallels between LLMs and established web systems. We identify "citation" as a crucial yet missing component in LLMs, which could enhance content transparency and verifiability while addressing IP and ethical dilemmas. We further propose that a comprehensive citation mechanism for LLMs should account for both non-parametric and parametric content. Despite the complexity of implementing such a citation mechanism, along with the inherent potential pitfalls, we advocate for its development. Building on this foundation, we outline several research problems in this area, aiming to guide future explorations towards building more responsible and accountable LLMs.
摘要
Diffusion Models for Computational Design at the Example of Floor Plans
For: The paper explores the capabilities of diffusion-based AI image generators for computational design in civil engineering, specifically for creating floor plans.* Methods: The paper uses diffusion models with improved semantic encoding to generate floor plans, and experiments are conducted to evaluate the validity and query performance of the generated plans.* Results: The paper shows that the proposed diffusion models can improve the validity of generated floor plans from 6% to 90%, and improve query performance for different examples. However, the paper also identifies shortcomings and future research challenges for these models.Abstract
AI Image generators based on diffusion models are widely discussed recently for their capability to create images from simple text prompts. But, for practical use in civil engineering they need to be able to create specific construction plans for given constraints. Within this paper we explore the capabilities of those diffusion-based AI generators for computational design at the example of floor plans and identify their current limitation. We explain how the diffusion-models work and propose new diffusion models with improved semantic encoding. In several experiments we show that we can improve validity of generated floor plans from 6% to 90% and query performance for different examples. We identify short comings and derive future research challenges of those models and discuss the need to combine diffusion models with building information modelling. With this we provide key insights into the current state and future directions for diffusion models in civil engineering.
摘要
《Diffusion模型基于AI图像生成器在近期内获得了广泛关注,因为它们可以从简单文本提示中生成图像。但是,在实用ivil engineering中,它们需要能够生成基于给定限制的具体的建筑计划。本文通过使用扩散模型来探讨计算设计的可能性,并在底层平面的示例中进行了详细的探讨。我们解释了扩散模型如何工作,并提出了改进 semantic encoding的新扩散模型。在多个实验中,我们显示了可以从6%提高到90%的有效性,并且对不同的示例进行了查询性能的改进。我们还识别出了这些模型的缺陷,并提出了未来研究的挑战。我们认为需要将扩散模型与建筑信息模型结合,以提供更好的设计解决方案。这些发现对于 diffusion模型在 civil engineering 中的当前状况和未来发展提供了关键的洞察。》Note: Simplified Chinese is used here, as it is more widely used in mainland China. If you prefer Traditional Chinese, I can provide that version as well.
results: 这篇论文提出了一种新的延迟遮盾计算方法,可以在 worst-case 延迟输入信号的假设下保证安全性。此外,论文还提出了一些新的决策策略,可以最小化遮盾对系统的干扰。最后,论文还实现了延迟遮盾在实际驾驶模拟器中的首次应用。Abstract
Agents operating in physical environments need to be able to handle delays in the input and output signals since neither data transmission nor sensing or actuating the environment are instantaneous. Shields are correct-by-construction runtime enforcers that guarantee safe execution by correcting any action that may cause a violation of a formal safety specification. Besides providing safety guarantees, shields should interfere minimally with the agent. Therefore, shields should pick the safe corrective actions in such a way that future interferences are most likely minimized. Current shielding approaches do not consider possible delays in the input signals in their safety analyses. In this paper, we address this issue. We propose synthesis algorithms to compute \emph{delay-resilient shields} that guarantee safety under worst-case assumptions on the delays of the input signals. We also introduce novel heuristics for deciding between multiple corrective actions, designed to minimize future shield interferences caused by delays. As a further contribution, we present the first integration of shields in a realistic driving simulator. We implemented our delayed shields in the driving simulator \textsc{Carla}. We shield potentially unsafe autonomous driving agents in different safety-critical scenarios and show the effect of delays on the safety analysis.
摘要
In this paper, we address this issue by proposing synthesis algorithms to compute "delay-resilient shields" that guarantee safety under worst-case assumptions about input signal delays. We also introduce novel heuristics for deciding between multiple corrective actions, designed to minimize future shield interferences caused by delays. Furthermore, we present the first integration of shields in a realistic driving simulator. We implemented our delayed shields in the driving simulator \textsc{Carla} and shielded potentially unsafe autonomous driving agents in different safety-critical scenarios, demonstrating the effect of delays on the safety analysis.
Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning
for: 本研究 introduce了一个名为 Instruction Following Score (IFS) 的度量,用于测量语言模型能够遵循指令的能力。
methods: 本研究使用了公开available的基础模型和指令模型,并 benchmarked 它们以显示出 ratio of well formatted responses to partial and full sentences 可以作为基础模型和指令模型之间的区别。此外,IFS 也可以用作初级停止条件 для instruct tuning。
results: 研究发现,使用 SFT 训练 7B 和 13B LLaMA 模型,模型在训练过程中较早学会遵循指令,并且进一步 fine-tuning 可能会导致基础模型的 semantics 变化。例如,我们通过auxiliary metric ObjecQA 显示出模型预测的 объекivity 在特定情况下变化,而且这些变化在 IFS 倾斜时最为显著。本研究希望通过将 instruct tuning 分解为 IFS 和 semantics 因素,开启更好控制的 instruct tuning 和设计最小化 instruct 界面 querying foundation models 的可能性。Abstract
In this paper, we introduce the Instruction Following Score (IFS), a metric that detects language models' ability to follow instructions. The metric has a dual purpose. First, IFS can be used to distinguish between base and instruct models. We benchmark publicly available base and instruct models, and show that the ratio of well formatted responses to partial and full sentences can be an effective measure between those two model classes. Secondly, the metric can be used as an early stopping criteria for instruct tuning. We compute IFS for Supervised Fine-Tuning (SFT) of 7B and 13B LLaMA models, showing that models learn to follow instructions relatively early in the training process, and the further finetuning can result in changes in the underlying base model semantics. As an example of semantics change we show the objectivity of model predictions, as defined by an auxiliary metric ObjecQA. We show that in this particular case, semantic changes are the steepest when the IFS tends to plateau. We hope that decomposing instruct tuning into IFS and semantic factors starts a new trend in better controllable instruct tuning and opens possibilities for designing minimal instruct interfaces querying foundation models.
摘要
在这篇论文中,我们引入了指令遵循分数(IFS),这是一个用于检测语言模型遵循指令的度量。IFS有两个目的:首先,它可以用于分辨基础模型和指令模型之间的差异。我们对公开可用的基础模型和指令模型进行了比较,并显示了基础模型和指令模型之间的响应格式化率的差异。其次,IFS可以用作指令训练的早期停止条件。我们对7B和13B LLama模型进行了超参训练(SFT),并计算了IFS,显示了模型在训练过程中早些地学会遵循指令,并且进一步的训练可以导致基础模型 semantics 的改变。我们作为一个例子,显示了模型预测的 объекivity,如定义的auxiliary metric ObjecQA。我们发现,在这种特定情况下,semantic changes 是指令训练过程中最大化的。我们希望通过将 instruct tuning 分解为 IFS 和semantic factors,开启更好控制的 instruct tuning 和设计最小的 instruct 界面,以便查询基础模型。
Towards Open Federated Learning Platforms: Survey and Vision from Technical and Legal Perspectives
results: 通过技术和法律两个视角进行全面回顾,提出了一系列有价值的主题,如模型库的可用性、法律合规性分析、模型 reuse 中的版权问题和知识产权保护等。特别是,提出了一种新的分类系统,用于整合 batch model reusing 方法中的许可证约束,并且提供了一个系统性的框架,用于 indentifying 可能的法律后果和限制。Abstract
Traditional Federated Learning (FL) follows a server-domincated cooperation paradigm which narrows the application scenarios of FL and decreases the enthusiasm of data holders to participate. To fully unleash the potential of FL, we advocate rethinking the design of current FL frameworks and extending it to a more generalized concept: Open Federated Learning Platforms. We propose two reciprocal cooperation frameworks for FL to achieve this: query-based FL and contract-based FL. In this survey, we conduct a comprehensive review of the feasibility of constructing an open FL platform from both technical and legal perspectives. We begin by reviewing the definition of FL and summarizing its inherent limitations, including server-client coupling, low model reusability, and non-public. In the query-based FL platform, which is an open model sharing and reusing platform empowered by the community for model mining, we explore a wide range of valuable topics, including the availability of up-to-date model repositories for model querying, legal compliance analysis between different model licenses, and copyright issues and intellectual property protection in model reusing. In particular, we introduce a novel taxonomy to streamline the analysis of model license compatibility in FL studies that involve batch model reusing methods, including combination, amalgamation, distillation, and generation. This taxonomy provides a systematic framework for identifying the corresponding clauses of licenses and facilitates the identification of potential legal implications and restrictions when reusing models. Through this survey, we uncover the the current dilemmas faced by FL and advocate for the development of sustainable open FL platforms. We aim to provide guidance for establishing such platforms in the future, while identifying potential problems and challenges that need to be addressed.
摘要
传统的联合学习(FL)采用服务器主导的合作方式,这限制了FL应用场景和数据持有者参与的积极性。为了充分发挥FL的潜力,我们提倡重新设计当前FL框架,扩展其为更通用的概念:开放联合学习平台。我们提出了两种对接合作框架,即查询基于的FL和合同基于的FL。在这篇评论中,我们从技术和法律两个角度进行了全面的审查,以确定开放FL平台的可行性。我们首先介绍了FL的定义和其内置的局限性,包括服务器客户端耦合、低模型复用率和非公共。在查询基于的FL平台上,我们探讨了一系列有价值的话题,包括社区驱动的模型共享和再利用,以及模型存储库的可用性、许可证分析和版权问题。其中,我们提出了一种新的分类方法,用于对FL研究中的批量模型复用方法进行许可证兼容性分析。这种分类方法提供了一个系统化的框架,可以帮助identify potential legal implications和restrictions when reusing models。通过这篇评论,我们揭示了当前FL面临的矛盾和挑战,并提出了建立可持续的开放FL平台的建议。我们希望通过这篇评论,为未来建立这些平台提供指南,并 indentify potential problems和挑战需要解决。
Beyond Known Reality: Exploiting Counterfactual Explanations for Medical Research
for: 这个研究使用对比性解释来探讨医学研究中的”如果”场景,以扩展我们对现有边界的理解。Specifically, we focus on using MRI features for diagnosing pediatric posterior fossa brain tumors as a case study.
results: 结果表明,对比性解释可以增强人工智能和解释的acceptance在医疗实践中。我们的方法保持了统计学和临床准确性,允许在不同的现实下探讨不同的肿瘤特征。 Additionally, we explore the potential use of counterfactuals for data augmentation and evaluate their feasibility as an alternative approach in medical research.Abstract
This study employs counterfactual explanations to explore "what if?" scenarios in medical research, with the aim of expanding our understanding beyond existing boundaries. Specifically, we focus on utilizing MRI features for diagnosing pediatric posterior fossa brain tumors as a case study. The field of artificial intelligence and explainability has witnessed a growing number of studies and increasing scholarly interest. However, the lack of human-friendly interpretations in explaining the outcomes of machine learning algorithms has significantly hindered the acceptance of these methods by clinicians in their clinical practice. To address this, our approach incorporates counterfactual explanations, providing a novel way to examine alternative decision-making scenarios. These explanations offer personalized and context-specific insights, enabling the validation of predictions and clarification of variations under diverse circumstances. Importantly, our approach maintains both statistical and clinical fidelity, allowing for the examination of distinct tumor features through alternative realities. Additionally, we explore the potential use of counterfactuals for data augmentation and evaluate their feasibility as an alternative approach in medical research. The results demonstrate the promising potential of counterfactual explanations to enhance trust and acceptance of AI-driven methods in clinical settings.
摘要
Translation notes:* "counterfactual explanations" Counterfactual explanations are used to explore "what if?" scenarios in medical research, providing personalized and context-specific insights into alternative decision-making scenarios.* "MRI features" 我们使用 MRI 特征来诊断儿童后部脑肿。* "artificial intelligence and explainability" 人工智能和解释性在医学研究中受到了越来越多的研究和学术关注,但是机器学习算法的解释性不够人性化,对临床医生的acceptance带来了很大障碍。* "data augmentation" 我们还探讨了使用 counterfactual explanations 作为数据扩充的可能性,以增强 AI 驱动方法在临床实践中的acceptance。
DARE: Towards Robust Text Explanations in Biomedical and Healthcare Applications
results: 本研究通过广泛的实验 validate了我们的方法,并表明了在三个Established biomedical benchmarks上,我们的方法能够提高深度神经网络的解释性 robustness。Abstract
Along with the successful deployment of deep neural networks in several application domains, the need to unravel the black-box nature of these networks has seen a significant increase recently. Several methods have been introduced to provide insight into the inference process of deep neural networks. However, most of these explainability methods have been shown to be brittle in the face of adversarial perturbations of their inputs in the image and generic textual domain. In this work we show that this phenomenon extends to specific and important high stakes domains like biomedical datasets. In particular, we observe that the robustness of explanations should be characterized in terms of the accuracy of the explanation in linking a model's inputs and its decisions - faithfulness - and its relevance from the perspective of domain experts - plausibility. This is crucial to prevent explanations that are inaccurate but still look convincing in the context of the domain at hand. To this end, we show how to adapt current attribution robustness estimation methods to a given domain, so as to take into account domain-specific plausibility. This results in our DomainAdaptiveAREstimator (DARE) attribution robustness estimator, allowing us to properly characterize the domain-specific robustness of faithful explanations. Next, we provide two methods, adversarial training and FAR training, to mitigate the brittleness characterized by DARE, allowing us to train networks that display robust attributions. Finally, we empirically validate our methods with extensive experiments on three established biomedical benchmarks.
摘要
alongside the successful deployment of deep neural networks in several application domains, the need to unravel the black-box nature of these networks has become increasingly important recently. several methods have been proposed to provide insight into the inference process of deep neural networks. however, most of these explainability methods have been shown to be fragile in the face of adversarial perturbations of their inputs in the image and generic textual domain. in this work, we show that this phenomenon extends to specific and important high-stakes domains like biomedical datasets. in particular, we observe that the robustness of explanations should be characterized in terms of the accuracy of the explanation in linking a model's inputs and its decisions - faithfulness - and its relevance from the perspective of domain experts - plausibility. this is crucial to prevent explanations that are inaccurate but still look convincing in the context of the domain at hand. to this end, we show how to adapt current attribution robustness estimation methods to a given domain, so as to take into account domain-specific plausibility. this results in our DomainAdaptiveAREstimator (DARE) attribution robustness estimator, allowing us to properly characterize the domain-specific robustness of faithful explanations. next, we provide two methods, adversarial training and FAR training, to mitigate the brittleness characterized by DARE, allowing us to train networks that display robust attributions. finally, we empirically validate our methods with extensive experiments on three established biomedical benchmarks.
Combating Confirmation Bias: A Unified Pseudo-Labeling Framework for Entity Alignment
paper_authors: Qijie Ding, Jie Yin, Daokun Zhang, Junbin Gao
for: 提高实体对Alignment的准确率,排除假标签错误的影响
methods: 提posed a Unified Pseudo-Labeling framework for Entity Alignment (UPL-EA),包括OT模型化和跨迭代pseudo标签准确率提高
results: 实验结果表明,our approach可以在有限先aligned种子的情况下 achieve competitive performance,并 theoretically supported by our analysis of pseudo-labeling errors.Abstract
Entity alignment (EA) aims at identifying equivalent entity pairs across different knowledge graphs (KGs) that refer to the same real-world identity. To systematically combat confirmation bias for pseudo-labeling-based entity alignment, we propose a Unified Pseudo-Labeling framework for Entity Alignment (UPL-EA) that explicitly eliminates pseudo-labeling errors to boost the accuracy of entity alignment. UPL-EA consists of two complementary components: (1) The Optimal Transport (OT)-based pseudo-labeling uses discrete OT modeling as an effective means to enable more accurate determination of entity correspondences across two KGs and to mitigate the adverse impact of erroneous matches. A simple but highly effective criterion is further devised to derive pseudo-labeled entity pairs that satisfy one-to-one correspondences at each iteration. (2) The cross-iteration pseudo-label calibration operates across multiple consecutive iterations to further improve the pseudo-labeling precision rate by reducing the local pseudo-label selection variability with a theoretical guarantee. The two components are respectively designed to eliminate Type I and Type II pseudo-labeling errors identified through our analyse. The calibrated pseudo-labels are thereafter used to augment prior alignment seeds to reinforce subsequent model training for alignment inference. The effectiveness of UPL-EA in eliminating pseudo-labeling errors is both theoretically supported and experimentally validated. The experimental results show that our approach achieves competitive performance with limited prior alignment seeds.
摘要
Entity alignment (EA) targets identifying equivalent entity pairs across different knowledge graphs (KGs) that refer to the same real-world identity. To systematically combat confirmation bias for pseudo-labeling-based entity alignment, we propose a Unified Pseudo-Labeling framework for Entity Alignment (UPL-EA) that explicitly eliminates pseudo-labeling errors to boost the accuracy of entity alignment. UPL-EA consists of two complementary components:1. Optimal Transport (OT)-based pseudo-labeling uses discrete OT modeling as an effective means to enable more accurate determination of entity correspondences across two KGs and to mitigate the adverse impact of erroneous matches. A simple but highly effective criterion is further devised to derive pseudo-labeled entity pairs that satisfy one-to-one correspondences at each iteration.2. The cross-iteration pseudo-label calibration operates across multiple consecutive iterations to further improve the pseudo-labeling precision rate by reducing the local pseudo-label selection variability with a theoretical guarantee. The two components are respectively designed to eliminate Type I and Type II pseudo-labeling errors identified through our analysis.The calibrated pseudo-labels are thereafter used to augment prior alignment seeds to reinforce subsequent model training for alignment inference. The effectiveness of UPL-EA in eliminating pseudo-labeling errors is both theoretically supported and experimentally validated. The experimental results show that our approach achieves competitive performance with limited prior alignment seeds.
Performance Modeling of Data Storage Systems using Generative Models
paper_authors: Abdalaziz Rashid Al-Maeeni, Aziz Temirkhanov, Artem Ryzhikov, Mikhail Hushchyn
For: The paper is written for researchers and practitioners in the field of industrial data analysis, particularly those interested in high-precision modeling of systems and digital twins.* Methods: The paper uses machine learning-based generative models to develop models of a storage system, including probabilistic models for each storage component (HDD and SSD) and RAID schemes.* Results: The paper reports results of experiments demonstrating the accuracy of the predictions made by the models, with errors ranging from 4-10% for IOPS and 3-16% for latency, depending on the components and models used. The predictions show up to 0.99 Pearson correlation with Little’s law, which can be used for unsupervised reliability checks of the models. Additionally, the paper presents novel data sets that can be used for benchmarking regression algorithms, conditional generative models, and uncertainty estimation methods in machine learning.Here is the information in Simplified Chinese text:
results: 实验结果表明模型预测的误差在4-10%(IOPS)和3-16%(延迟)之间,具体取决于组件和模型。预测结果与小的法律之间呈0.99相关性,可以用于无监督可靠性检查。此外,本研究还提供了可用于机器学习回归算法、条件生成模型和不确定性估计方法的新数据集。Abstract
High-precision modeling of systems is one of the main areas of industrial data analysis. Models of systems, their digital twins, are used to predict their behavior under various conditions. We have developed several models of a storage system using machine learning-based generative models. The system consists of several components: hard disk drive (HDD) and solid-state drive (SSD) storage pools with different RAID schemes and cache. Each storage component is represented by a probabilistic model that describes the probability distribution of the component performance in terms of IOPS and latency, depending on their configuration and external data load parameters. The results of the experiments demonstrate the errors of 4-10 % for IOPS and 3-16 % for latency predictions depending on the components and models of the system. The predictions show up to 0.99 Pearson correlation with Little's law, which can be used for unsupervised reliability checks of the models. In addition, we present novel data sets that can be used for benchmarking regression algorithms, conditional generative models, and uncertainty estimation methods in machine learning.
摘要
高精度系统模型化是工业数据分析的一个主要领域。系统的数字双(digital twin)被用于预测它们在不同条件下的行为。我们已经开发了一些基于机器学习生成模型的存储系统模型。存储系统由多个组件组成:硬盘驱动器(HDD)和固态驱动器(SSD)存储池,具有不同的RAID方案和缓存。每个存储组件都是一个概率模型,描述了组件性能的概率分布在 terms of IOPS 和延迟,受到配置和外部数据负荷参数的影响。实验结果显示,各种模型的误差为4-10% для IOPS 和 3-16% для 延迟,具体取决于系统的组件和模型。预测结果与 Little's law 的相关性达到0.99,可以用于无监督可靠性检查。此外,我们还提供了新的数据集,可以用于机器学习回归算法、条件生成模型和不确定性估计方法的benchmarking。
A Comparison of Machine Learning Methods for Data with High-Cardinality Categorical Variables
results: 研究发现,机器学习模型含有随机效应可以提高预测精度,而树融合含有随机效应的方法在高 Cardinality categorical 变量下表现较好。Abstract
High-cardinality categorical variables are variables for which the number of different levels is large relative to the sample size of a data set, or in other words, there are few data points per level. Machine learning methods can have difficulties with high-cardinality variables. In this article, we empirically compare several versions of two of the most successful machine learning methods, tree-boosting and deep neural networks, and linear mixed effects models using multiple tabular data sets with high-cardinality categorical variables. We find that, first, machine learning models with random effects have higher prediction accuracy than their classical counterparts without random effects, and, second, tree-boosting with random effects outperforms deep neural networks with random effects.
摘要
高级别分类变量是指数据集中样本数少于变量水平的数量相对较多的变量,也就是说,每个水平有少量数据点。机器学习方法可能会对高级别分类变量表示困难。本文通过使用多个标准化数据集来比较多种最successful机器学习方法的实际性能,包括树融合和深度神经网络,以及线性混合模型。我们发现:首先,机器学习模型含随机效应比其经典版本没有随机效应的模型具有更高的预测精度;其次,树融合含随机效应的模型在对高级别分类变量进行预测时表现较好。
Line Graphics Digitization: A Step Towards Full Automation
for: This paper aims to improve the digitization of mathematical graphics, specifically statistical plots, to increase accessibility and reproducibility.
methods: The authors introduce the Line Graphics (LG) dataset, which includes pixel-wise annotations of 5 coarse and 10 fine-grained categories, and explore 7 state-of-the-art models to benchmark the dataset.
results: The dataset covers 520 images of mathematical graphics from 450 documents across different disciplines, and can support two computer vision tasks: semantic segmentation and object detection. The authors plan to make the dataset, code, and models publicly available to the community.Here’s the same information in Simplified Chinese text:
results: 数据集包括450份文档中的520张数学图表,覆盖了不同领域的文献,并可以支持2种计算机视觉任务:semantic segmentation和物体检测。作者们计划将数据集、代码和模型公开给社区。Abstract
The digitization of documents allows for wider accessibility and reproducibility. While automatic digitization of document layout and text content has been a long-standing focus of research, this problem in regard to graphical elements, such as statistical plots, has been under-explored. In this paper, we introduce the task of fine-grained visual understanding of mathematical graphics and present the Line Graphics (LG) dataset, which includes pixel-wise annotations of 5 coarse and 10 fine-grained categories. Our dataset covers 520 images of mathematical graphics collected from 450 documents from different disciplines. Our proposed dataset can support two different computer vision tasks, i.e., semantic segmentation and object detection. To benchmark our LG dataset, we explore 7 state-of-the-art models. To foster further research on the digitization of statistical graphs, we will make the dataset, code, and models publicly available to the community.
摘要
数字化文档的普及和可重现性得到了更广泛的访问和使用。自动化文档格式和文本内容的数字化已经是研究的长期焦点,但是关于图形元素,如统计图表,的问题尚未得到充分的研究。在这篇论文中,我们提出了细腻的数字化统计图表的视觉理解任务,并提供了Line Graphics(LG)数据集,该数据集包括精度的像素级别标注5种粗细的类别和10种细腻的类别。我们的数据集包括450份不同领域的文档中的520张数学图表。我们的提posed数据集可以支持两种不同的计算机视觉任务,即semantic segmentation和object detection。为了评估我们的LG数据集,我们探索了7个当前最佳模型。为了推动数字化统计图表领域的进一步研究,我们将在社区中公开数据集、代码和模型。
paper_authors: Muhammad Osama Nusrat, Zeeshan Habib, Mehreen Alam, Saad Ahmed Jamal
for: 理解在社交媒体中的表情符号含义
methods: 使用BERTTransformer预训练语言模型进行表情符号预测
results: 实验结果显示,我们的方法在预测表情符号时的准确率超过75%,与当前状态的模型相比有所提高。Abstract
In recent years, the use of emojis in social media has increased dramatically, making them an important element in understanding online communication. However, predicting the meaning of emojis in a given text is a challenging task due to their ambiguous nature. In this study, we propose a transformer-based approach for emoji prediction using BERT, a widely-used pre-trained language model. We fine-tuned BERT on a large corpus of text containing both text and emojis to predict the most appropriate emoji for a given text. Our experimental results demonstrate that our approach outperforms several state-of-the-art models in predicting emojis with an accuracy of over 75 percent. This work has potential applications in natural language processing, sentiment analysis, and social media marketing.
摘要
近年来,社交媒体中emoji的使用量有很大的增长,使其成为了在线communication中重要的元素。然而,预测emoji在给定文本中的意思是一项具有挑战性的任务,这是因为emoji具有抽象的特点。在这项研究中,我们提出了基于 transformer 的方法,使用BERT,一种广泛使用的预先训练语言模型,来预测emoji。我们对一大量文本数据进行了微调,以预测文本中的最佳emoji。我们的实验结果表明,我们的方法可以超过75%的精度来预测emoji。这项工作有可能应用于自然语言处理、情感分析和社交媒体营销等领域。
Flowchase: a Mobile Application for Pronunciation Training
results: 提供了一种基于语音表征学习的机器学习模型,用于设计个性化的反馈Here’s the same information in Traditional Chinese:
for: 提供个性化和实时反馈给英语学习者
methods: 使用流处理应用程序和语音技术进行语音分 segment和分析
results: 提供了一种基于语音表征学习的机器学习模型,用于设计个性化的反馈Abstract
In this paper, we present a solution for providing personalized and instant feedback to English learners through a mobile application, called Flowchase, that is connected to a speech technology able to segment and analyze speech segmental and supra-segmental features. The speech processing pipeline receives linguistic information corresponding to an utterance to analyze along with a speech sample. After validation of the speech sample, a joint forced-alignment and phonetic recognition is performed thanks to a combination of machine learning models based on speech representation learning that provides necessary information for designing a feedback on a series of segmental and supra-segmental pronunciation aspects.
摘要
在本文中,我们提出了一种解决方案,通过一款名为Flowchase的移动应用程序,为英语学习者提供个性化和实时反馈。该应用程序与一种可以分 segment和supra-segment 语音特征的语音技术相连接。语音处理管道接受语音词汇和语音样本,并进行验证。然后,通过一种基于语音表示学习的机器学习模型的组合,实现联合强制对齐和音名识别。这些信息对于设计 série of segmental和supra-segmental 发音方面的反馈是必要的。
Recommender Systems in the Era of Large Language Models (LLMs)
results: 本研究发现,通过使用LLMs,可以提高推荐系统的一致性和可靠性,同时提高用户体验。此外,本研究还发现了一些未来的研究方向,包括如何更好地使用LLMs的特征来提高推荐系统的性能,以及如何在不同的推荐场景下使用LLMs来提高推荐系统的一致性和可靠性。Abstract
With the prosperity of e-commerce and web applications, Recommender Systems (RecSys) have become an important component of our daily life, providing personalized suggestions that cater to user preferences. While Deep Neural Networks (DNNs) have made significant advancements in enhancing recommender systems by modeling user-item interactions and incorporating textual side information, DNN-based methods still face limitations, such as difficulties in understanding users' interests and capturing textual side information, inabilities in generalizing to various recommendation scenarios and reasoning on their predictions, etc. Meanwhile, the emergence of Large Language Models (LLMs), such as ChatGPT and GPT4, has revolutionized the fields of Natural Language Processing (NLP) and Artificial Intelligence (AI), due to their remarkable abilities in fundamental responsibilities of language understanding and generation, as well as impressive generalization and reasoning capabilities. As a result, recent studies have attempted to harness the power of LLMs to enhance recommender systems. Given the rapid evolution of this research direction in recommender systems, there is a pressing need for a systematic overview that summarizes existing LLM-empowered recommender systems, to provide researchers in relevant fields with an in-depth understanding. Therefore, in this paper, we conduct a comprehensive review of LLM-empowered recommender systems from various aspects including Pre-training, Fine-tuning, and Prompting. More specifically, we first introduce representative methods to harness the power of LLMs (as a feature encoder) for learning representations of users and items. Then, we review recent techniques of LLMs for enhancing recommender systems from three paradigms, namely pre-training, fine-tuning, and prompting. Finally, we comprehensively discuss future directions in this emerging field.
摘要
随着电子商务和网络应用的发展,个性化推荐系统(RecSys)已经成为我们日常生活中的重要组成部分,提供个性化的建议,以适应用户的偏好。深度神经网络(DNNs)已经在提高RecSys方面做出了重要的进步,通过模拟用户 item 交互和 incorporating 文本侧 информацию,但 DNNs 方法仍然存在一些限制,如用户兴趣的理解和文本侧信息的捕捉困难、不同推荐场景的泛化和理解等等。同时,大型自然语言模型(LLMs),如 ChatGPT 和 GPT4,在自然语言处理(NLP)和人工智能(AI)领域中带来了革命,因为它们在语言理解和生成基本责任上表现出色,同时具有卓越的泛化和理解能力。因此,最近的研究尝试使用 LLMs 来改进 RecSys。随着这个研究方向的快速发展,有一定的急需要对 LLM-empowered RecSys 进行系统性的综述,以为相关领域的研究人员提供深入的理解。因此,在这篇论文中,我们进行了全面的 LLM-empowered RecSys 的综述,从多种方面进行了详细的介绍,包括预训练、精度调整和提示等方面。更 specifically,我们首先介绍了使用 LLMs 作为特征编码器来学习用户和商品的表示。然后,我们查看了最近在 RecSys 领域中使用 LLMs 的技术,从三个 paradigms 的角度进行了详细的介绍:预训练、精度调整和提示。最后,我们进行了全面的未来方向的讨论。
Fisher-Weighted Merge of Contrastive Learning Models in Sequential Recommendation
results: 经过广泛的实验证明,我们的提案方法能够实现更好的推荐性能,并且显示出在时间序列推荐领域的应用前景。Abstract
Along with the exponential growth of online platforms and services, recommendation systems have become essential for identifying relevant items based on user preferences. The domain of sequential recommendation aims to capture evolving user preferences over time. To address dynamic preference, various contrastive learning methods have been proposed to target data sparsity, a challenge in recommendation systems due to the limited user-item interactions. In this paper, we are the first to apply the Fisher-Merging method to Sequential Recommendation, addressing and resolving practical challenges associated with it. This approach ensures robust fine-tuning by merging the parameters of multiple models, resulting in improved overall performance. Through extensive experiments, we demonstrate the effectiveness of our proposed methods, highlighting their potential to advance the state-of-the-art in sequential learning and recommendation systems.
摘要
随着在线平台和服务的激增,推荐系统已成为根据用户偏好 identify 相关项目的重要工具。Sequential recommendation 领域targets evolving user preferences over time。为了Address dynamic preference, various contrastive learning methods have been proposed to address data sparsity, a challenge in recommendation systems due to limited user-item interactions. In this paper, we are the first to apply the Fisher-Merging method to Sequential Recommendation, addressing and resolving practical challenges associated with it. This approach ensures robust fine-tuning by merging the parameters of multiple models, resulting in improved overall performance. Through extensive experiments, we demonstrate the effectiveness of our proposed methods, highlighting their potential to advance the state-of-the-art in sequential learning and recommendation systems.Here's a word-for-word translation of the text into Simplified Chinese:随着在线平台和服务的激增,推荐系统已成为根据用户偏好 identify 相关项目的重要工具。Sequential recommendation 领域targets evolving user preferences over time。为了Address dynamic preference, various contrastive learning methods have been proposed to address data sparsity, a challenge in recommendation systems due to limited user-item interactions. In this paper, we are the first to apply the Fisher-Merging method to Sequential Recommendation, addressing and resolving practical challenges associated with it. This approach ensures robust fine-tuning by merging the parameters of multiple models, resulting in improved overall performance. Through extensive experiments, we demonstrate the effectiveness of our proposed methods, highlighting their potential to advance the state-of-the-art in sequential learning and recommendation systems.
VertiBench: Advancing Feature Distribution Diversity in Vertical Federated Learning Benchmarks
results: 本文对 cutting-edge VFL 算法进行了全面的评估,提供了 valuable 的研究指导,包括特征重要性和特征相关性对 VFL 性能的影响,以及实际 VFL 数据集的效果。Abstract
Vertical Federated Learning (VFL) is a crucial paradigm for training machine learning models on feature-partitioned, distributed data. However, due to privacy restrictions, few public real-world VFL datasets exist for algorithm evaluation, and these represent a limited array of feature distributions. Existing benchmarks often resort to synthetic datasets, derived from arbitrary feature splits from a global set, which only capture a subset of feature distributions, leading to inadequate algorithm performance assessment. This paper addresses these shortcomings by introducing two key factors affecting VFL performance - feature importance and feature correlation - and proposing associated evaluation metrics and dataset splitting methods. Additionally, we introduce a real VFL dataset to address the deficit in image-image VFL scenarios. Our comprehensive evaluation of cutting-edge VFL algorithms provides valuable insights for future research in the field.
摘要
竖向联合学习(VFL)是一种重要的训练机器学习模型的方法,其中数据分布在不同的Feature中。然而,由于隐私限制,公共世界上存在的VFL数据集很少,这些数据集通常只表示一小部分的特征分布。现有的比赛通常会 resort to synthetic datasets,这些数据集来自于arbitrary的特征分裂,只 capture一部分的特征分布,导致评估算法性能的不准确。本文通过引入特征重要性和特征相关性两个关键因素,并提出相应的评估指标和数据分割方法来解决这些缺陷。此外,我们还引入了一个真实的VFL数据集,以解决图像-图像VFL场景中的欠缺。我们的全面评估 cutting-edge VFL 算法提供了有价值的意义,对未来在这个领域的研究做出了重要的贡献。
EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models
paper_authors: Michael Wornow, Rahul Thapa, Ethan Steinberg, Jason Fries, Nigam Shah
For: The paper is written to address the challenges of applying machine learning (ML) in healthcare, specifically the lack of shared assets such as datasets, tasks, and models.* Methods: The paper contributes three main contributions: (1) a new dataset called EHRSHOT, which contains de-identified structured data from 6,712 patients’ electronic health records (EHRs) at Stanford Medicine; (2) the weights of a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients; and (3) 15 few-shot clinical prediction tasks to evaluate the performance of foundation models.* Results: The paper provides an end-to-end pipeline for the community to validate and build upon the performance of the clinical foundation model, and defines 15 few-shot clinical prediction tasks to evaluate the model’s ability to adapt to new tasks with limited training data.Abstract
While the general machine learning (ML) community has benefited from public datasets, tasks, and models, the progress of ML in healthcare has been hampered by a lack of such shared assets. The success of foundation models creates new challenges for healthcare ML by requiring access to shared pretrained models to validate performance benefits. We help address these challenges through three contributions. First, we publish a new dataset, EHRSHOT, containing de-identified structured data from the electronic health records (EHRs) of 6,712 patients from Stanford Medicine. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and not restricted to ICU/ED patients. Second, we publish the weights of a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. We are one of the first to fully release such a model for coded EHR data; in contrast, most prior models released for clinical data (e.g. GatorTron, ClinicalBERT) only work with unstructured text and cannot process the rich, structured data within an EHR. We provide an end-to-end pipeline for the community to validate and build upon its performance. Third, we define 15 few-shot clinical prediction tasks, enabling evaluation of foundation models on benefits such as sample efficiency and task adaption. The code to reproduce our results, as well as the model and dataset (via a research data use agreement), are available at our Github repo here: https://github.com/som-shahlab/ehrshot-benchmark
摘要
在通用机器学习(ML)社区中,公共数据集、任务和模型的进步带来了很多利好,但是医疗机器学习(ML)的进步却受到了共享资产的缺乏的限制。成功的基础模型创造了新的挑战,需要访问共享预训练模型来验证性能效果。我们通过以下三个贡献来解决这些挑战:1. 我们发布了一个新的数据集,名为EHRSHOT,其包含了医学记录系统(EHR)中6,712名患者的去推定结构化数据。与MIMIC-III/IV和其他流行的EHR数据集不同,EHRSHOT是长期的,而不是仅仅局部ICU/ED病人。2. 我们发布了一个141M参数的临床基础模型,该模型在6.27M名患者的结构化EHR数据上进行了预训练。我们是第一个完全发布codified EHR数据的基础模型的一个,而不是只能处理文本数据,并且可以处理医疗记录中的丰富结构数据。我们提供了一个综合的执行管道,让社区可以验证和建立其性能。3. 我们定义了15个几个shot临床预测任务,以评估基础模型的样本效率和任务适应性。我们在Github上提供了代码重现结果,以及模型和数据(通过研究数据使用协议),请参考以下链接:https://github.com/som-shahlab/ehrshot-benchmark。
Generative Adversarial Networks for Dental Patient Identity Protection in Orthodontic Educational Imaging
For: This research aims to develop a novel area-preserving Generative Adversarial Networks (GAN) inversion technique for effectively de-identifying dental patient images, addressing privacy concerns while preserving key dental features.* Methods: The enhanced GAN Inversion methodology incorporates several deep learning models to provide end-to-end development guidance and practical application for image de-identification, adapting the context from one image to another while preserving dental features essential for oral diagnosis and dental education.* Results: The approach was assessed with varied facial pictures and achieved effective de-identification, maintaining the realism of important dental features, and was deemed useful for dental diagnostics and education by a panel of five clinicians. The generated images can be used to streamline the de-identification process of dental patient images, enhance efficiency in dental education, and create de-identified datasets for broader 2D image research.Abstract
Objectives: This research introduces a novel area-preserving Generative Adversarial Networks (GAN) inversion technique for effectively de-identifying dental patient images. This innovative method addresses privacy concerns while preserving key dental features, thereby generating valuable resources for dental education and research. Methods: We enhanced the existing GAN Inversion methodology to maximize the preservation of dental characteristics within the synthesized images. A comprehensive technical framework incorporating several deep learning models was developed to provide end-to-end development guidance and practical application for image de-identification. Results: Our approach was assessed with varied facial pictures, extensively used for diagnosing skeletal asymmetry and facial anomalies. Results demonstrated our model's ability to adapt the context from one image to another, maintaining compatibility, while preserving dental features essential for oral diagnosis and dental education. A panel of five clinicians conducted an evaluation on a set of original and GAN-processed images. The generated images achieved effective de-identification, maintaining the realism of important dental features and were deemed useful for dental diagnostics and education. Clinical Significance: Our GAN model and the encompassing framework can streamline the de-identification process of dental patient images, enhancing efficiency in dental education. This method improves students' diagnostic capabilities by offering more exposure to orthodontic malocclusions. Furthermore, it facilitates the creation of de-identified datasets for broader 2D image research at major research institutions.
摘要
目标:本研究提出了一种新的面积保持式生成对抗网络(GAN)反转技术,用于有效地去标识牙科病人图像,同时保持牙科特征。这种创新方法解决了隐私问题,并为牙科教育和研究提供了有价值的资源。方法:我们对现有的GAN反转方法进行了改进,以最大化保留牙科特征在合成图像中。我们开发了一个包含多个深度学习模型的技术框架,以提供综合的开发指南和实践应用于图像去标识。结果:我们的方法在各种facial picture上进行了评估,广泛用于诊断骨骼偏位和脸部异常。结果表明我们的模型能够将上下文从一个图像传递到另一个图像,保持兼容性,同时保留牙科特征,这些特征对于牙科诊断和牙科教育非常重要。一组五名临床医生对一组原始图像和GAN处理后的图像进行评估。生成的图像实现了有效的去标识,保留了牙科特征的真实性,并被评估为用于牙科诊断和牙科教育中有用。临床意义:我们的GAN模型和包含它的框架可以快速减少牙科病人图像去标识的过程,提高牙科教育的效率。这种方法可以帮助学生提高诊断能力,并为更广泛的2D图像研究提供匿名化数据。
Comparative Analysis of GPT-4 and Human Graders in Evaluating Praise Given to Students in Synthetic Dialogues
results: GPT-4在特定褒奖要素上表现moderately well,但在识别教师的诚恳褒奖语言方面表现不佳,尤其在零例提示场景下。Abstract
Research suggests that providing specific and timely feedback to human tutors enhances their performance. However, it presents challenges due to the time-consuming nature of assessing tutor performance by human evaluators. Large language models, such as the AI-chatbot ChatGPT, hold potential for offering constructive feedback to tutors in practical settings. Nevertheless, the accuracy of AI-generated feedback remains uncertain, with scant research investigating the ability of models like ChatGPT to deliver effective feedback. In this work-in-progress, we evaluate 30 dialogues generated by GPT-4 in a tutor-student setting. We use two different prompting approaches, the zero-shot chain of thought and the few-shot chain of thought, to identify specific components of effective praise based on five criteria. These approaches are then compared to the results of human graders for accuracy. Our goal is to assess the extent to which GPT-4 can accurately identify each praise criterion. We found that both zero-shot and few-shot chain of thought approaches yield comparable results. GPT-4 performs moderately well in identifying instances when the tutor offers specific and immediate praise. However, GPT-4 underperforms in identifying the tutor's ability to deliver sincere praise, particularly in the zero-shot prompting scenario where examples of sincere tutor praise statements were not provided. Future work will focus on enhancing prompt engineering, developing a more general tutoring rubric, and evaluating our method using real-life tutoring dialogues.
摘要
In this study, we evaluated 30 dialogues generated by GPT-4 in a tutor-student setting using two different prompting approaches: the zero-shot chain of thought and the few-shot chain of thought. We used five criteria to identify specific components of effective praise. We compared the results of human graders to assess the extent to which GPT-4 can accurately identify each praise criterion.We found that both zero-shot and few-shot chain of thought approaches yielded comparable results. GPT-4 performed moderately well in identifying instances when the tutor offers specific and immediate praise. However, GPT-4 underperformed in identifying the tutor's ability to deliver sincere praise, particularly when examples of sincere tutor praise statements were not provided in the zero-shot prompting scenario.Future work will focus on enhancing prompt engineering, developing a more general tutoring rubric, and evaluating our method using real-life tutoring dialogues.
Evaluating the Effectiveness of Large Language Models in Representing Textual Descriptions of Geometry and Spatial Relations
results: 实验显示,LLM生成的几何特征可以保持几何类型和捕捉一定的空间关系(准确率达73%),但还存在 numeric 值的估计和空间相关对象的检索等挑战。Abstract
This research focuses on assessing the ability of large language models (LLMs) in representing geometries and their spatial relations. We utilize LLMs including GPT-2 and BERT to encode the well-known text (WKT) format of geometries and then feed their embeddings into classifiers and regressors to evaluate the effectiveness of the LLMs-generated embeddings for geometric attributes. The experiments demonstrate that while the LLMs-generated embeddings can preserve geometry types and capture some spatial relations (up to 73% accuracy), challenges remain in estimating numeric values and retrieving spatially related objects. This research highlights the need for improvement in terms of capturing the nuances and complexities of the underlying geospatial data and integrating domain knowledge to support various GeoAI applications using foundation models.
摘要
STS-CCL: Spatial-Temporal Synchronous Contextual Contrastive Learning for Urban Traffic Forecasting
results: 根据广泛的实验和评估表明,建立基于STS-CCL对抗学习模型的预测器,在交通预测任务中表现出了superior性能,胜过现有的交通预测标准准。此外,STS-CCL模型对大量的未标注交通数据进行了高效地学习和预测,因此非常适合具有数据缺乏问题的其他空间时间任务。Abstract
Efficiently capturing the complex spatiotemporal representations from large-scale unlabeled traffic data remains to be a challenging task. In considering of the dilemma, this work employs the advanced contrastive learning and proposes a novel Spatial-Temporal Synchronous Contextual Contrastive Learning (STS-CCL) model. First, we elaborate the basic and strong augmentation methods for spatiotemporal graph data, which not only perturb the data in terms of graph structure and temporal characteristics, but also employ a learning-based dynamic graph view generator for adaptive augmentation. Second, we introduce a Spatial-Temporal Synchronous Contrastive Module (STS-CM) to simultaneously capture the decent spatial-temporal dependencies and realize graph-level contrasting. To further discriminate node individuals in negative filtering, a Semantic Contextual Contrastive method is designed based on semantic features and spatial heterogeneity, achieving node-level contrastive learning along with negative filtering. Finally, we present a hard mutual-view contrastive training scheme and extend the classic contrastive loss to an integrated objective function, yielding better performance. Extensive experiments and evaluations demonstrate that building a predictor upon STS-CCL contrastive learning model gains superior performance than existing traffic forecasting benchmarks. The proposed STS-CCL is highly suitable for large datasets with only a few labeled data and other spatiotemporal tasks with data scarcity issue.
摘要
efficiently capturing the complex spatiotemporal representations from large-scale unlabeled traffic data remains a challenging task. To address this issue, this work proposes a novel Spatial-Temporal Synchronous Contextual Contrastive Learning (STS-CCL) model.First, we introduce advanced contrastive learning techniques for spatiotemporal graph data, including basic and strong augmentation methods that perturb the data in terms of graph structure and temporal characteristics. Additionally, we employ a learning-based dynamic graph view generator for adaptive augmentation.Second, we introduce a Spatial-Temporal Synchronous Contrastive Module (STS-CM) to simultaneously capture decent spatial-temporal dependencies and realize graph-level contrasting. To further discriminate node individuals in negative filtering, we design a Semantic Contextual Contrastive method based on semantic features and spatial heterogeneity, achieving node-level contrastive learning along with negative filtering.Finally, we present a hard mutual-view contrastive training scheme and extend the classic contrastive loss to an integrated objective function, yielding better performance. Extensive experiments and evaluations demonstrate that building a predictor upon STS-CCL contrastive learning model gains superior performance than existing traffic forecasting benchmarks. The proposed STS-CCL is highly suitable for large datasets with only a few labeled data and other spatiotemporal tasks with data scarcity issues.
The KiTS21 Challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase CT
paper_authors: Nicholas Heller, Fabian Isensee, Dasha Trofimova, Resha Tejpaul, Zhongchen Zhao, Huai Chen, Lisheng Wang, Alex Golts, Daniel Khapun, Daniel Shats, Yoel Shoshan, Flora Gilboa-Solomon, Yasmeen George, Xi Yang, Jianpeng Zhang, Jing Zhang, Yong Xia, Mengran Wu, Zhiyang Liu, Ed Walczak, Sean McSweeney, Ranveer Vasdev, Chris Hornung, Rafat Solaiman, Jamee Schoephoerster, Bailey Abernathy, David Wu, Safa Abdulkadir, Ben Byun, Justice Spriggs, Griffin Struyk, Alexandra Austin, Ben Simpson, Michael Hagstrom, Sierra Virnig, John French, Nitin Venkatesh, Sarah Chan, Keenan Moore, Anna Jacobsen, Susan Austin, Mark Austin, Subodh Regmi, Nikolaos Papanikolopoulos, Christopher Weight
for: The paper is written for the 2021 Kidney and Kidney Tumor Segmentation Challenge (KiTS21) and the 2021 international conference on Medical Image Computing and Computer Assisted Interventions (MICCAI).
methods: The paper uses a novel annotation method that collects three separate annotations for each region of interest, and these annotations are performed in a fully transparent setting using a web-based annotation tool.
results: The top-performing teams achieved a significant improvement over the state of the art set in 2019, and this performance is shown to inch ever closer to human-level performance.Here are the three key points in Simplified Chinese text:
results: Top Performing Teams的性能得到了2019年的状态前进的显著改进,并且这种性能在人类水平附近。Abstract
This paper presents the challenge report for the 2021 Kidney and Kidney Tumor Segmentation Challenge (KiTS21) held in conjunction with the 2021 international conference on Medical Image Computing and Computer Assisted Interventions (MICCAI). KiTS21 is a sequel to its first edition in 2019, and it features a variety of innovations in how the challenge was designed, in addition to a larger dataset. A novel annotation method was used to collect three separate annotations for each region of interest, and these annotations were performed in a fully transparent setting using a web-based annotation tool. Further, the KiTS21 test set was collected from an outside institution, challenging participants to develop methods that generalize well to new populations. Nonetheless, the top-performing teams achieved a significant improvement over the state of the art set in 2019, and this performance is shown to inch ever closer to human-level performance. An in-depth meta-analysis is presented describing which methods were used and how they faired on the leaderboard, as well as the characteristics of which cases generally saw good performance, and which did not. Overall KiTS21 facilitated a significant advancement in the state of the art in kidney tumor segmentation, and provides useful insights that are applicable to the field of semantic segmentation as a whole.
摘要
Translation notes:* "Kidney and Kidney Tumor Segmentation Challenge" (KiTS21) was translated as "肾肿瘤分割挑战" (KiTS21)* "International Conference on Medical Image Computing and Computer Assisted Interventions" (MICCAI) was translated as "医学图像计算和计算助け手动 intervención国际会议" (MICCAI)* "novel annotation method" was translated as "新的注释方法"* "transparent setting" was translated as "透明的设定"* "web-based annotation tool" was translated as "基于网络的注释工具"* "outside institution" was translated as "外部机构"* "state of the art set" was translated as "现状下的最佳集"* "human-level performance" was translated as "人类水平的性能"* "detailed meta-analysis" was translated as "详细的元分析"* "insights" was translated as "思想"
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
paper_authors: Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach
For: 本研究旨在提出一种基于潜在扩散模型的文本到图像生成模型,即SDXL。相比之前的稳定扩散模型,SDXL具有更大的UNet准备模型,主要是因为更多的注意块和更大的跨注意Context。* Methods: 本研究使用了多种新的conditioning方案,并在多个方向比例上训练了SDXL。此外,我们还提出了一种图像质量改进模型,用于通过后期image-to-image技术提高SDXL生成的图像的视觉准确性。* Results: 我们的实验结果表明,SDXL在比前一个版本的稳定扩散模型和黑盒状态的图像生成器中表现出了明显的改进,并达到了与之相当的图像质量。此外,我们还提供了模型代码和参数Weight,以便其他研究人员可以进行更多的研究和应用。Abstract
We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at https://github.com/Stability-AI/generative-models
摘要
我们介绍SDXL,一个文本至图生成模型。与过去版本的稳定扩散模型相比,SDXL具有三倍大的UNet底层:模型参数的增加主要是因为更多的注意块和更大的跨注意条件。我们设计了多种新的条件定义方案,并将SDXL训练在多个比例的方向上。我们还引入了一个修正模型,用于在SDXL生成的样本中提高视觉实际性,使用后置的图像至图像技术。我们示出SDXL与过去版本的稳定扩散模型相比,有着截然改善的性能,并与黑盒类型的图像生成器的结果竞争。为了促进开放性的研究和透明度在大型模型训练和评估中,我们在https://github.com/Stability-AI/generative-models提供了代码和模型的重量。
A Neural Collapse Perspective on Feature Evolution in Graph Neural Networks
results: 研究发现,在图内节点分类任务中,也存在类似于“神经崩溃”(NC)现象,即在训练深度学习模型后,特征的 innerhalb-class variability 减少,类征的 class means 更加倾斜于某些对称结构。此外,研究还发现,这种现象不仅存在于图内节点分类任务中,还存在于图上的其他任务中。Abstract
Graph neural networks (GNNs) have become increasingly popular for classification tasks on graph-structured data. Yet, the interplay between graph topology and feature evolution in GNNs is not well understood. In this paper, we focus on node-wise classification, illustrated with community detection on stochastic block model graphs, and explore the feature evolution through the lens of the "Neural Collapse" (NC) phenomenon. When training instance-wise deep classifiers (e.g. for image classification) beyond the zero training error point, NC demonstrates a reduction in the deepest features' within-class variability and an increased alignment of their class means to certain symmetric structures. We start with an empirical study that shows that a decrease in within-class variability is also prevalent in the node-wise classification setting, however, not to the extent observed in the instance-wise case. Then, we theoretically study this distinction. Specifically, we show that even an "optimistic" mathematical model requires that the graphs obey a strict structural condition in order to possess a minimizer with exact collapse. Interestingly, this condition is viable also for heterophilic graphs and relates to recent empirical studies on settings with improved GNNs' generalization. Furthermore, by studying the gradient dynamics of the theoretical model, we provide reasoning for the partial collapse observed empirically. Finally, we present a study on the evolution of within- and between-class feature variability across layers of a well-trained GNN and contrast the behavior with spectral methods.
摘要
граф neural networks (GNNs) 在图structured data 上的分类任务中越来越受欢迎。然而,graph topology和特征进化在 GNNs 中的关系还不够清楚。在这篇论文中,我们将关注 node-wise 分类,通过社区探测在 Stochastic Block Model 图上进行示例,并通过 "Neural Collapse" (NC) 现象来探究特征进化。在训练深度分类器(例如图像分类) beyond the zero training error point 时,NC 表明深度特征中的 Within-class 变化减少,并且类别的中心对某些对称结构进行了Alignment。我们开始了一个实验,显示在 node-wise 分类 setting 中也存在类似的减少 Within-class 变化,但不如 instance-wise 情况那么严重。然后,我们进行了理论研究。具体来说,我们表明了在 graphs obey 一定的 strict structural condition 下,才能存在一个减少 Within-class 变化的最优解。这种condition 适用于 heterophilic graphs 也,与 latest empirical studies on improved GNNs 的泛化有关。此外,我们通过研究理论模型的梯度动态,提供了对 partial collapse 的解释。最后,我们展示了一个 well-trained GNN 的层次特征变化的进化,并与 spectral methods 进行了比较。
methods: 本研究提出了一种新的 causality-based 方法,名为 causal video summarizer (CVS),以有效地捕捉视频和查询之间的交互信息,解决多模态视频概要任务。CVS 方法包括一个概率编码器和一个概率解码器。
results: 根据现有的多模态视频概要数据集的评估结果,实验结果显示,提出的方法在比较 state-of-the-art 方法时,具有+5.4% 的准确率和+4.92% 的 F 1 分数的提升。Abstract
Recently, video summarization has been proposed as a method to help video exploration. However, traditional video summarization models only generate a fixed video summary which is usually independent of user-specific needs and hence limits the effectiveness of video exploration. Multi-modal video summarization is one of the approaches utilized to address this issue. Multi-modal video summarization has a video input and a text-based query input. Hence, effective modeling of the interaction between a video input and text-based query is essential to multi-modal video summarization. In this work, a new causality-based method named Causal Video Summarizer (CVS) is proposed to effectively capture the interactive information between the video and query to tackle the task of multi-modal video summarization. The proposed method consists of a probabilistic encoder and a probabilistic decoder. Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective with the increase of +5.4% in accuracy and +4.92% increase of F 1- score, compared with the state-of-the-art method.
摘要
最近,视频概要提出了方法,以便视频探索。然而,传统的视频概要模型只生成固定的视频概要,通常与用户特定需求独立,因此限制了视频探索的效iveness。多模式视频概要是解决这一问题的一种方法。多模式视频概要需要视频输入和文本基于查询输入。因此,模型视频输入和查询输入之间的交互信息的有效模型化是多模式视频概要的关键。在这项工作中,一种新的 causality-based 方法named Causal Video Summarizer (CVS) 被提出,以便有效地捕捉视频和查询之间的交互信息,解决多模式视频概要任务。该方法包括一个 probabilistic encoder 和一个 probabilistic decoder。根据现有的多模式视频概要数据集的评估结果,实验结果显示,提posed方法比状态之前的方法增加了 +5.4% 的准确率和 +4.92% 的 F 1- 分数。
Query-based Video Summarization with Pseudo Label Supervision
paper_authors: Jia-Hong Huang, Luka Murn, Marta Mrak, Marcel Worring
for: This paper is written for improving the performance of deep video summarization models by using self-supervised learning and pseudo labels to address the data sparsity challenge.
methods: The paper proposes using segment-level pseudo labels generated based on existing human-defined frame-level labels to pre-train a supervised deep model. It also introduces a semantics booster to generate context-aware query representations and mutual attention to capture interactive information between visual and textual modalities.
results: The proposed video summarization algorithm achieves state-of-the-art performance on three commonly-used video summarization benchmarks.Here is the information in Simplified Chinese text:
results: 提posed的视频概要算法在三个常用的视频概要 benchmark 上达到了国际最佳性能。Abstract
Existing datasets for manually labelled query-based video summarization are costly and thus small, limiting the performance of supervised deep video summarization models. Self-supervision can address the data sparsity challenge by using a pretext task and defining a method to acquire extra data with pseudo labels to pre-train a supervised deep model. In this work, we introduce segment-level pseudo labels from input videos to properly model both the relationship between a pretext task and a target task, and the implicit relationship between the pseudo label and the human-defined label. The pseudo labels are generated based on existing human-defined frame-level labels. To create more accurate query-dependent video summaries, a semantics booster is proposed to generate context-aware query representations. Furthermore, we propose mutual attention to help capture the interactive information between visual and textual modalities. Three commonly-used video summarization benchmarks are used to thoroughly validate the proposed approach. Experimental results show that the proposed video summarization algorithm achieves state-of-the-art performance.
摘要
Currently available datasets for manually labeled query-based video summarization are limited and expensive, which hinders the performance of supervised deep video summarization models. Self-supervision can address this issue by using a pretext task and acquiring extra data with pseudo labels to pre-train a supervised deep model. In this work, we introduce segment-level pseudo labels from input videos to properly model the relationship between the pretext task and the target task, as well as the implicit relationship between the pseudo label and the human-defined label. The pseudo labels are generated based on existing human-defined frame-level labels. To create more accurate query-dependent video summaries, a semantics booster is proposed to generate context-aware query representations. Additionally, we propose mutual attention to capture the interactive information between visual and textual modalities. Three commonly-used video summarization benchmarks are used to thoroughly validate the proposed approach. Experimental results show that the proposed video summarization algorithm achieves state-of-the-art performance.Here's the translation in Traditional Chinese:现有的手动标注的查询基本视频摘要数据集是有限且昂贵的,这限制了深度视频摘要模型的表现。自我超vision可以解决这个问题,使用预text任务和获取额外数据的pseudo标签进行预训深度模型。在这个工作中,我们引入input视频中的段级pseudo标签,以正确地模型预text任务和目标任务之间的关系,以及pseudo标签和人 definied标签之间的潜在关系。pseudo标签是基于现有的人definied帧级标签。为了创建更加精确的查询基本视频摘要,我们提议了semantics booster来生成上下文感知查询表现。此外,我们提议了互动注意力,以帮助捕捉视觉和文本Modalities之间的互动信息。三个通用的视频摘要 benchmark是用来充分验证我们的方法。实验结果显示,我们的 виде摘要算法实现了顶尖性能。
Concept2Box: Joint Geometric Embeddings for Learning Two-View Knowledge Graphs
results: 实验表明,Concept2Box可以有效地嵌入两种视图的关系数据。Abstract
Knowledge graph embeddings (KGE) have been extensively studied to embed large-scale relational data for many real-world applications. Existing methods have long ignored the fact many KGs contain two fundamentally different views: high-level ontology-view concepts and fine-grained instance-view entities. They usually embed all nodes as vectors in one latent space. However, a single geometric representation fails to capture the structural differences between two views and lacks probabilistic semantics towards concepts' granularity. We propose Concept2Box, a novel approach that jointly embeds the two views of a KG using dual geometric representations. We model concepts with box embeddings, which learn the hierarchy structure and complex relations such as overlap and disjoint among them. Box volumes can be interpreted as concepts' granularity. Different from concepts, we model entities as vectors. To bridge the gap between concept box embeddings and entity vector embeddings, we propose a novel vector-to-box distance metric and learn both embeddings jointly. Experiments on both the public DBpedia KG and a newly-created industrial KG showed the effectiveness of Concept2Box.
摘要
知识图embeddings(KGE)已广泛研究以涵盖大规模关系数据的多种实际应用。现有方法长期忽视了许多知识图中的两种基本视图:高级 ontology-view 概念和细化实例-view 实体。它们通常将所有节点embed为一个笛卡尔空间中的向量。然而,单一的几何表示无法捕捉两种视图之间的结构差异,缺乏概率semantics towards 概念的粒度。我们提出了Concept2Box,一种新的方法,它在两种视图中同时使用双几何表示。我们使用盒形 embeddings来模型概念,它可以学习层次结构和复杂关系,如穿梭和分立。盒体体积可以被解释为概念的粒度。与概念不同,我们使用向量来模型实体。为了将盒形 embeddings和向量 embeddings相互关联,我们提出了一种新的向量-盒形距离度量,并同时学习两种embeddings。经过在公共DBpedia知识图和自己创建的工业知识图的实验,Concept2Box的效果得到了证明。
MDI+: A Flexible Random Forest-Based Feature Importance Framework
paper_authors: Abhineet Agarwal, Ana M. Kenney, Yan Shuo Tan, Tiffany M. Tang, Bin Yu
for: 本研究旨在提出一种可以扩展MDI(Mean Decrease in Impurity)的特征重要性度量方法,以便在随机森林(Random Forest)模型中更好地评估特征的重要性。
methods: 本研究使用了MDI的解释,即每棵树中的特征$X_k$的MDI等于Linear regression中的非正则化$R^2$值。基于这个解释,我们提出了一种更加灵活的特征重要性框架 called MDI+。MDI+可以让分析者根据数据结构选择合适的Generalized Linear Models(GLMs)和度量,并且可以减少树见的偏见。
results: 我们通过数据驱动的 simulations 示例了MDI+在标准特征重要性度量方法(如MFE、Recursive Feature Elimination等)的比较,并发现MDI+可以更好地预测标准特征。此外,我们还应用MDI+于两个实际案例(药物响应预测和乳腺癌分型),并发现MDI+可以提取稳定的预测基因。Abstract
Mean decrease in impurity (MDI) is a popular feature importance measure for random forests (RFs). We show that the MDI for a feature $X_k$ in each tree in an RF is equivalent to the unnormalized $R^2$ value in a linear regression of the response on the collection of decision stumps that split on $X_k$. We use this interpretation to propose a flexible feature importance framework called MDI+. Specifically, MDI+ generalizes MDI by allowing the analyst to replace the linear regression model and $R^2$ metric with regularized generalized linear models (GLMs) and metrics better suited for the given data structure. Moreover, MDI+ incorporates additional features to mitigate known biases of decision trees against additive or smooth models. We further provide guidance on how practitioners can choose an appropriate GLM and metric based upon the Predictability, Computability, Stability framework for veridical data science. Extensive data-inspired simulations show that MDI+ significantly outperforms popular feature importance measures in identifying signal features. We also apply MDI+ to two real-world case studies on drug response prediction and breast cancer subtype classification. We show that MDI+ extracts well-established predictive genes with significantly greater stability compared to existing feature importance measures. All code and models are released in a full-fledged python package on Github.
摘要
《Mean Decrease in Impurity (MDI) is a popular feature importance measure for random forests (RFs). We show that the MDI for a feature $X_k$ in each tree in an RF is equivalent to the unnormalized $R^2$ value in a linear regression of the response on the collection of decision stumps that split on $X_k$. We use this interpretation to propose a flexible feature importance framework called MDI+. Specifically, MDI+ generalizes MDI by allowing the analyst to replace the linear regression model and $R^2$ metric with regularized generalized linear models (GLMs) and metrics better suited for the given data structure. Moreover, MDI+ incorporates additional features to mitigate known biases of decision trees against additive or smooth models. We further provide guidance on how practitioners can choose an appropriate GLM and metric based upon the Predictability, Computability, Stability framework for veridical data science. Extensive data-inspired simulations show that MDI+ significantly outperforms popular feature importance measures in identifying signal features. We also apply MDI+ to two real-world case studies on drug response prediction and breast cancer subtype classification. We show that MDI+ extracts well-established predictive genes with significantly greater stability compared to existing feature importance measures. All code and models are released in a full-fledged python package on Github.》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know.
Learning ECG signal features without backpropagation
results: 在ECG信号分类任务中,该方法实现了state-of-the-art表现。Abstract
Representation learning has become a crucial area of research in machine learning, as it aims to discover efficient ways of representing raw data with useful features to increase the effectiveness, scope and applicability of downstream tasks such as classification and prediction. In this paper, we propose a novel method to generate representations for time series-type data. This method relies on ideas from theoretical physics to construct a compact representation in a data-driven way, and it can capture both the underlying structure of the data and task-specific information while still remaining intuitive, interpretable and verifiable. This novel methodology aims to identify linear laws that can effectively capture a shared characteristic among samples belonging to a specific class. By subsequently utilizing these laws to generate a classifier-agnostic representation in a forward manner, they become applicable in a generalized setting. We demonstrate the effectiveness of our approach on the task of ECG signal classification, achieving state-of-the-art performance.
摘要
<>关键词: representational learning,时序数据,简化表示,物理理论,ECG信号分类研究领域:机器学习,数据分析摘要:本文提出了一种新的方法,用于生成时序数据的表示。这种方法基于物理理论的想法,在数据驱动的方式下构建了一个紧凑的表示,可以捕捉数据的内在结构和任务特定的信息,同时仍保持易于理解、可读性和验证性。这种新方法的目标是发现 Shared Characteristic 的 linear law,以便在特定类别中的样本之间建立共同的表示。通过后续使用这些法律生成一种类器无关的表示,它们在通用设定下可以应用。我们在 ECG 信号分类任务中实现了state-of-the-art 的性能。简化中文:研究领域:机器学习,数据分析摘要:本文提出了一种新的方法,用于生成时序数据的表示。这种方法基于物理理论的想法,可以捕捉数据的内在结构和任务特定的信息,同时仍保持易于理解、可读性和验证性。这种新方法的目标是发现 Shared Characteristic 的 linear law,以便在特定类别中的样本之间建立共同的表示。详细描述: Representational learning 已成为机器学习中的一个关键领域,旨在找到有效的表示方式,以提高下游任务的效果、范围和可应用性。本文提出了一种基于物理理论的方法,用于生成时序数据的表示。这种方法可以捕捉数据的内在结构和任务特定的信息,同时仍保持易于理解、可读性和验证性。本文的主要贡献在于提出了一种新的方法,用于生成时序数据的表示。这种方法基于物理理论的想法,可以捕捉数据的内在结构和任务特定的信息,同时仍保持易于理解、可读性和验证性。这种新方法的目标是发现 Shared Characteristic 的 linear law,以便在特定类别中的样本之间建立共同的表示。通过实验,我们在 ECG 信号分类任务中实现了state-of-the-art 的性能。这表明,我们的方法可以有效地捕捉 ECG 信号的特征,并且可以在通用设定下应用。结语:本文提出了一种基于物理理论的方法,用于生成时序数据的表示。这种方法可以捕捉数据的内在结构和任务特定的信息,同时仍保持易于理解、可读性和验证性。这种新方法的目标是发现 Shared Characteristic 的 linear law,以便在特定类别中的样本之间建立共同的表示。我们在 ECG 信号分类任务中实现了state-of-the-art 的性能,这表明我们的方法可以有效地捕捉 ECG 信号的特征。
Natural Language Generation and Understanding of Big Code for AI-Assisted Programming: A Review
results: 该论文总结了NLP技术在程式设计任务中的应用,包括GitHub Copilot和DeepMind AlphaCode等应用,并探讨了在扩展至Apple Xcode的 mobil 软件开发中的挑战和机遇。Abstract
This paper provides a comprehensive review of the literature concerning the utilization of Natural Language Processing (NLP) techniques, with a particular focus on transformer-based large language models (LLMs) trained using Big Code, within the domain of AI-assisted programming tasks. LLMs, augmented with software naturalness, have played a crucial role in facilitating AI-assisted programming applications, including code generation, code completion, code translation, code refinement, code summarization, defect detection, and clone detection. Notable examples of such applications include the GitHub Copilot powered by OpenAI's Codex and DeepMind AlphaCode. This paper presents an overview of the major LLMs and their applications in downstream tasks related to AI-assisted programming. Furthermore, it explores the challenges and opportunities associated with incorporating NLP techniques with software naturalness in these applications, with a discussion on extending AI-assisted programming capabilities to Apple's Xcode for mobile software development. This paper also presents the challenges of and opportunities for incorporating NLP techniques with software naturalness, empowering developers with advanced coding assistance and streamlining the software development process.
摘要
Translation notes:* "Natural Language Processing" (NLP) is translated as "自然语言处理" (Zìrán yǔyán gōngchá)* "Large Language Models" (LLMs) is translated as "大语言模型" (Dà yǔyán módelì)* "Big Code" is translated as "大代码" (Dà biǎo código)* "AI-assisted programming" is translated as "人工智能助成编程" (Réngōng zhìnéng bùxí biānjiāng)* "Software naturalness" is translated as "软件自然性" (Ròngjiàn zìrán xìng)* "GitHub Copilot" is translated as "GitHub 助手" (GitHub zhùshǒu)* "OpenAI's Codex" is translated as "OpenAI 的 Codex" (OpenAI de Codex)* "DeepMind AlphaCode" is translated as "DeepMind 的 AlphaCode" (DeepMind de AlphaCode)* "Xcode" is translated as "Xcode" (Xcode)
Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners
results: 实验结果显示, KnowNo 在多种 simulated 和实际 robot 设定中表现出色,可以优化效率和自主性,同时提供正式的保证。 KnowNo 可以与 LLM 直接使用,不需要模型调整,并且建议一种轻量级的不确定性模型,可以与基础模型相容。Abstract
Large language models (LLMs) exhibit a wide range of promising capabilities -- from step-by-step planning to commonsense reasoning -- that may provide utility for robots, but remain prone to confidently hallucinated predictions. In this work, we present KnowNo, which is a framework for measuring and aligning the uncertainty of LLM-based planners such that they know when they don't know and ask for help when needed. KnowNo builds on the theory of conformal prediction to provide statistical guarantees on task completion while minimizing human help in complex multi-step planning settings. Experiments across a variety of simulated and real robot setups that involve tasks with different modes of ambiguity (e.g., from spatial to numeric uncertainties, from human preferences to Winograd schemas) show that KnowNo performs favorably over modern baselines (which may involve ensembles or extensive prompt tuning) in terms of improving efficiency and autonomy, while providing formal assurances. KnowNo can be used with LLMs out of the box without model-finetuning, and suggests a promising lightweight approach to modeling uncertainty that can complement and scale with the growing capabilities of foundation models. Website: https://robot-help.github.io
摘要
大型语言模型(LLM)表现出广泛的有前途能力——从步骤规划到通过常识理解——可能提供 robot 中的用途,但仍然容易做出自信的预测。在这个工作中,我们提出了 KnowNo,它是一个测量和调整 LLM 基于的 плаanner 的不确定性框架,以便它们知道自己不确定的时候,并在需要时请求帮助。KnowNo 基于对照预测理论,以提供任务完成的 Statistical 保证,同时最小化人工帮助在复杂多步 плаanning 环境中。实验显示,KnowNo 在各种模拟和实际 robot 环境中表现出色,比modern baseline(可能包括集合或广泛提升)在任务完成和自主性方面表现更好,同时提供了正式保证。KnowNo 可以与 LLM 直接使用,不需要模型调整,并建议一个轻量级的不确定性模型,可以与基础模型的能力相互补充。详情请参考 website。
Stranding Risk for Underactuated Vessels in Complex Ocean Currents: Analysis and Controllers
paper_authors: Andreas Doering, Marius Wiggert, Hanna Krasowski, Manan Doshi, Pierre F. J. Lermusiaux, Claire J. Tomlin
For: The paper aims to improve the safety of low-propulsion vessels navigating in high-risk regions by developing a feedback control policy that takes into account the uncertainty of ocean currents.* Methods: The authors use a combination of analytical and numerical methods, including the Hamilton-Jacobi Multi-Time Reachability (HJ-MTR) framework, to synthesize a feedback policy that guarantees safe operation in the presence of forecast errors.* Results: The authors demonstrate the effectiveness of their approach through large-scale simulations, showing that their method significantly improves safety over baseline methods while still achieving timely arrival at the destination.Here’s the same information in Simplified Chinese:* For: 这篇论文旨在提高低推力船舶在高风险区域的安全性,通过开发一种Feedback控制策略,考虑到海流预测错误的不确定性。* Methods: 作者使用了一种组合分析和数值方法,包括Hamilton-Jacobi Multi-Time Reachability(HJ-MTR)框架,Synthesize一种Feedback策略,在预测错误存在的情况下保证安全运行。* Results: 作者通过大规模仿真实验,证明了他们的方法在安全性和快速到达目的地之间具有显著改善,而且仍然能够保证船舶的安全运行。Abstract
Low-propulsion vessels can take advantage of powerful ocean currents to navigate towards a destination. Recent results demonstrated that vessels can reach their destination with high probability despite forecast errors. However, these results do not consider the critical aspect of safety of such vessels: because of their low propulsion which is much smaller than the magnitude of currents, they might end up in currents that inevitably push them into unsafe areas such as shallow areas, garbage patches, and shipping lanes. In this work, we first investigate the risk of stranding for free-floating vessels in the Northeast Pacific. We find that at least 5.04% would strand within 90 days. Next, we encode the unsafe sets as hard constraints into Hamilton-Jacobi Multi-Time Reachability (HJ-MTR) to synthesize a feedback policy that is equivalent to re-planning at each time step at low computational cost. While applying this policy closed-loop guarantees safe operation when the currents are known, in realistic situations only imperfect forecasts are available. We demonstrate the safety of our approach in such realistic situations empirically with large-scale simulations of a vessel navigating in high-risk regions in the Northeast Pacific. We find that applying our policy closed-loop with daily re-planning on new forecasts can ensure safety with high probability even under forecast errors that exceed the maximal propulsion. Our method significantly improves safety over the baselines and still achieves a timely arrival of the vessel at the destination.
摘要
低推力船只可以利用强大的海洋流动来导航到目的地。最新的结果表明,船只可以快速到达目的地,即使预测错误也可以达到高probability。然而,这些结果不考虑船只的安全性:由于它们的推力非常小,可能会被强大的流动推向危险区域,如浅水区、垃圾堆和航道。在这项工作中,我们首先调查了偏离船只在北太平洋的风险。我们发现至少5.04%的船只在90天内会偏离。接着,我们将危险集编码为硬制约into Hamilton-Jacobi Multi-Time Reachability(HJ-MTR),以生成一个等价于重新规划的反馈策略。当应用这种策略时,在计算成本低的情况下,可以保证船只的安全运行。然而,在实际情况下,只有不准确的预测available。我们通过大规模的 simulate船只在高风险区域的高效实际情况来证明我们的方法的安全性。我们发现,在每天重新规划新的预测基础上应用我们的策略可以保证船只的安全性,即使预测错误超过最大推力。我们的方法可以大幅提高船只的安全性,同时仍能达到快速到达目的地的目的。
Maximizing Seaweed Growth on Autonomous Farms: A Dynamic Programming Approach for Underactuated Systems Navigating on Uncertain Ocean Currents
results: 实验结果表明,使用这种控制器可以实现95.8%的最佳生长水平,只使用5天的预测数据。这表明在实际情况下,使用低功率推进和优化控制可以提高浮动海藻养殖园中的海藻生长。Abstract
Seaweed biomass offers significant potential for climate mitigation, but large-scale, autonomous open-ocean farms are required to fully exploit it. Such farms typically have low propulsion and are heavily influenced by ocean currents. We want to design a controller that maximizes seaweed growth over months by taking advantage of the non-linear time-varying ocean currents for reaching high-growth regions. The complex dynamics and underactuation make this challenging even when the currents are known. This is even harder when only short-term imperfect forecasts with increasing uncertainty are available. We propose a dynamic programming-based method to efficiently solve for the optimal growth value function when true currents are known. We additionally present three extensions when as in reality only forecasts are known: (1) our methods resulting value function can be used as feedback policy to obtain the growth-optimal control for all states and times, allowing closed-loop control equivalent to re-planning at every time step hence mitigating forecast errors, (2) a feedback policy for long-term optimal growth beyond forecast horizons using seasonal average current data as terminal reward, and (3) a discounted finite-time Dynamic Programming (DP) formulation to account for increasing ocean current estimate uncertainty. We evaluate our approach through 30-day simulations of floating seaweed farms in realistic Pacific Ocean current scenarios. Our method demonstrates an achievement of 95.8% of the best possible growth using only 5-day forecasts. This confirms the feasibility of using low-power propulsion and optimal control for enhanced seaweed growth on floating farms under real-world conditions.
摘要
海藻质量具有气候缓和潜在,但需要大规模、自主的开放海洋农场来完全利用其潜力。这些农场通常具有低速度和受海洋流动影响,我们想要设计一个控制器,以最大化海藻生长在月分内的时间,通过利用不对称时间变化的海洋流动来到达高生长区域。由于流动的复杂dinamics和对控制的具体影响,这是一个具有挑战性的任务,甚至当真实流动currents是知道的时候。我们提出了一种基于动态计划的方法,以有效地解决最佳生长值函数的问题,当真实流动currents是知道的时候。我们还提出了三个扩展,包括:1. 我们的方法所得到的值函数可以作为反馈政策,以获得生长最佳控制,即使只有5天的预测。2. 一个以季节平均流动为终端赏金的负回授政策,以便在预测时间以外的长期优化生长。3. 一个折衣finite-time动态计划形式,以考虑海洋流动估计uncertainty的增加。我们通过30天的 simulate Pacific Ocean current scenarios中的浮动海藻农场来评估我们的方法。我们的方法能够实现95.8%的最佳生长,仅使用5天的预测。这证明了在现实情况下使用低功率propulsion和优化控制可以实现更好的海藻生长。
ClimateLearn: Benchmarking Machine Learning for Weather and Climate Modeling
results: 该研究表明,该库可以帮助解决气候科学中一些核心问题,如天气预测和气候下降。同时,该库还提供了详细的文档、贡献指南和快速入门教程,以扩大访问和推动社区的成长。Abstract
Modeling weather and climate is an essential endeavor to understand the near- and long-term impacts of climate change, as well as inform technology and policymaking for adaptation and mitigation efforts. In recent years, there has been a surging interest in applying data-driven methods based on machine learning for solving core problems such as weather forecasting and climate downscaling. Despite promising results, much of this progress has been impaired due to the lack of large-scale, open-source efforts for reproducibility, resulting in the use of inconsistent or underspecified datasets, training setups, and evaluations by both domain scientists and artificial intelligence researchers. We introduce ClimateLearn, an open-source PyTorch library that vastly simplifies the training and evaluation of machine learning models for data-driven climate science. ClimateLearn consists of holistic pipelines for dataset processing (e.g., ERA5, CMIP6, PRISM), implementation of state-of-the-art deep learning models (e.g., Transformers, ResNets), and quantitative and qualitative evaluation for standard weather and climate modeling tasks. We supplement these functionalities with extensive documentation, contribution guides, and quickstart tutorials to expand access and promote community growth. We have also performed comprehensive forecasting and downscaling experiments to showcase the capabilities and key features of our library. To our knowledge, ClimateLearn is the first large-scale, open-source effort for bridging research in weather and climate modeling with modern machine learning systems. Our library is available publicly at https://github.com/aditya-grover/climate-learn.
摘要
Modeling 天气和气候是一项重要的尝试,以便理解气候变化的短期和长期影响,以及为适应和遏制尝试提供技术和政策。在过去几年中,对使用数据驱动方法基于机器学习解决气象预报和气候下降的核心问题有很大的兴趣。尽管有了明显的进步,但这些进步受到了大规模、开源的尝试的缺乏,导致用户们使用不一致或不充分的数据集、训练setup和评估方法,从而限制了模型的可重复性和可靠性。我们介绍了 ClimateLearn,一个基于 PyTorch 的开源库,可以大大简化气象学习模型的训练和评估。ClimateLearn 包括气象数据处理的整体管道(如 ERA5、CMIP6 和 PRISM),状态 искус理智慧模型的实现(如 Transformers 和 ResNets),以及量化和质量评估方法 для标准气象预报和气候模型任务。我们还提供了广泛的文档、贡献指南和快速入门教程,以扩大访问权限并推动社区的增长。我们还进行了全面的预测和下降实验,以示出我们库的能力和关键特点。到我们所知,ClimateLearn 是首个大规模、开源的气象学习模型与现代机器学习系统之间的桥梁。我们的库可以在 中下载。
Math Agents: Computational Infrastructure, Mathematical Embedding, and Genomics
results: 该项目通过使用数学代理人和数学嵌入来解决生物信息系统龄化问题,并通过使用生成AI分析长期健康记录,使用SIR精度医疗模型分析 causal 关系。Abstract
The advancement in generative AI could be boosted with more accessible mathematics. Beyond human-AI chat, large language models (LLMs) are emerging in programming, algorithm discovery, and theorem proving, yet their genomics application is limited. This project introduces Math Agents and mathematical embedding as fresh entries to the "Moore's Law of Mathematics", using a GPT-based workflow to convert equations from literature into LaTeX and Python formats. While many digital equation representations exist, there's a lack of automated large-scale evaluation tools. LLMs are pivotal as linguistic user interfaces, providing natural language access for human-AI chat and formal languages for large-scale AI-assisted computational infrastructure. Given the infinite formal possibility spaces, Math Agents, which interact with math, could potentially shift us from "big data" to "big math". Math, unlike the more flexible natural language, has properties subject to proof, enabling its use beyond traditional applications like high-validation math-certified icons for AI alignment aims. This project aims to use Math Agents and mathematical embeddings to address the ageing issue in information systems biology by applying multiscalar physics mathematics to disease models and genomic data. Generative AI with episodic memory could help analyse causal relations in longitudinal health records, using SIR Precision Health models. Genomic data is suggested for addressing the unsolved Alzheimer's disease problem.
摘要
“生成AI的进步可能会受到更多的易于访问的数学支持。在大语言模型(LLM)的出现之前,程式语言和数学问题的发现、证明也在发展中。然而,这些领域的应用尚未推广到 genomics 领域。这个项目将引入“数学代理人”和“数学嵌入”作为“摩尔的法则”的新入口,使用基于 GPT 的 workflow 将文献中的公式转换为 LaTeX 和 Python 格式。处理大量数据的数学表现方式已经存在许多,但是没有自动化的大规模评估工具。LLMs 作为人类语言界面,提供了自然语言访问,并且可以帮助大规模的 AI 辅助 computional 基础设施。由于数学有许多固定的性质,可以进行证明,因此它可以在传统应用领域以外进行应用。这个项目将使用数学代理人和数学嵌入来解决生物信息系统中的年龄问题,并且运用多尺度物理数学来应用疾病模型和 genomic 数据。生成 AI WITH episodic 记忆可以分析循环健康记录中的 causal 关系,使用 SIR 精度健康模型。 genomic 数据被建议用于解决阿兹海默症的问题。”
Concept-Based Explanations to Test for False Causal Relationships Learned by Abusive Language Classifiers
paper_authors: Isar Nejadgholi, Svetlana Kiritchenko, Kathleen C. Fraser, Esma Balkır
for: The paper is written to address the issue of classifiers learning false causal relationships between concepts and labels, specifically the concept of negative emotions in abusive language.
methods: The paper uses three well-known abusive language classifiers trained on large English datasets, and assesses their accuracy on a challenge set across all decision thresholds. Additionally, the paper introduces concept-based explanation metrics to assess the influence of the concept on the labels.
results: The paper finds that the classifiers have learned unwanted dependencies on the concept of negative emotions, and that these dependencies can compromise classification accuracy. The paper also introduces concept-based explanation metrics to compare classifiers regarding the degree of false global sufficiency they have learned between a concept and a label.Abstract
Classifiers tend to learn a false causal relationship between an over-represented concept and a label, which can result in over-reliance on the concept and compromised classification accuracy. It is imperative to have methods in place that can compare different models and identify over-reliances on specific concepts. We consider three well-known abusive language classifiers trained on large English datasets and focus on the concept of negative emotions, which is an important signal but should not be learned as a sufficient feature for the label of abuse. Motivated by the definition of global sufficiency, we first examine the unwanted dependencies learned by the classifiers by assessing their accuracy on a challenge set across all decision thresholds. Further, recognizing that a challenge set might not always be available, we introduce concept-based explanation metrics to assess the influence of the concept on the labels. These explanations allow us to compare classifiers regarding the degree of false global sufficiency they have learned between a concept and a label.
摘要
классификаторы有很大可能学习一种假的 causal 关系 между一个过度表达的概念和一个标签,这可能导致依赖于该概念并降低分类精度。因此需要有方法来比较不同的模型并识别特定概念上的过度依赖。我们使用三种广泛使用的负面语言分类器,并将关注情绪的概念,这是一个重要的信号,但不应该被视为 sufficient feature для标签。 motivated by the definition of global sufficiency,我们首先查看分类器学习的不必要依赖关系,并评估它们在所有决策阈值下的准确率。此外,recognizing 挑战集可能不总是可用,我们引入基于概念的解释指标,以评估概念对标签的影响。这些解释allow us to compare 分类器,以评估它们学习的假的 global sufficiency 程度。
KDSTM: Neural Semi-supervised Topic Modeling with Knowledge Distillation
results: 在多个 dataset 上,我们的方法比现有的超vised topic modeling方法在分类准确率、稳定性和效率方面表现出色,并且与弱监督文本分类方法相当。Abstract
In text classification tasks, fine tuning pretrained language models like BERT and GPT-3 yields competitive accuracy; however, both methods require pretraining on large text datasets. In contrast, general topic modeling methods possess the advantage of analyzing documents to extract meaningful patterns of words without the need of pretraining. To leverage topic modeling's unsupervised insights extraction on text classification tasks, we develop the Knowledge Distillation Semi-supervised Topic Modeling (KDSTM). KDSTM requires no pretrained embeddings, few labeled documents and is efficient to train, making it ideal under resource constrained settings. Across a variety of datasets, our method outperforms existing supervised topic modeling methods in classification accuracy, robustness and efficiency and achieves similar performance compare to state of the art weakly supervised text classification methods.
摘要
在文本分类任务中,使用预训练的自然语言模型如BERT和GPT-3可以获得竞争力的准确率,但两者都需要预训练大量文本数据。然而,通用主题模型方法具有分析文档以提取有意义的词语模式的优势,无需预训练。为了利用主题模型方法在文本分类任务中的无监督抽象知识提取,我们开发了知识填充半supervised主题模型(KDSTM)。KDSTM不需预训练 embedding,只需几个标注文档,训练效率高,适用于资源受限的情况。在多个数据集上,我们的方法比现有的监督主题模型方法在分类精度、稳定性和效率方面表现出色,并与州度静的弱监督文本分类方法实现相似的性能。
Exploring Non-Verbal Predicates in Semantic Role Labeling: Challenges and Opportunities
results: 作者发现了现有 benchmark 不能准确反映 SRL 的当前情况,以及现有系统无法在不同 predicate 类型之间传输知识。此外,作者还提出了一个新的 challenge set,以探讨不同语言资源是否可以促进知识传输。Abstract
Although we have witnessed impressive progress in Semantic Role Labeling (SRL), most of the research in the area is carried out assuming that the majority of predicates are verbs. Conversely, predicates can also be expressed using other parts of speech, e.g., nouns and adjectives. However, non-verbal predicates appear in the benchmarks we commonly use to measure progress in SRL less frequently than in some real-world settings -- newspaper headlines, dialogues, and tweets, among others. In this paper, we put forward a new PropBank dataset which boasts wide coverage of multiple predicate types. Thanks to it, we demonstrate empirically that standard benchmarks do not provide an accurate picture of the current situation in SRL and that state-of-the-art systems are still incapable of transferring knowledge across different predicate types. Having observed these issues, we also present a novel, manually-annotated challenge set designed to give equal importance to verbal, nominal, and adjectival predicate-argument structures. We use such dataset to investigate whether we can leverage different linguistic resources to promote knowledge transfer. In conclusion, we claim that SRL is far from "solved", and its integration with other semantic tasks might enable significant improvements in the future, especially for the long tail of non-verbal predicates, thereby facilitating further research on SRL for non-verbal predicates.
摘要
尽管我们已经目睹了Semantic Role Labeling(SRL)的印象人的进步,但大多数研究在这个领域都是基于假设大多数 predicate 是动词的。然而, predicate 可以通过其他部件表达,如名称和形容词。然而,非语言 predicate 在我们通用的测试数据集中出现得更少, чем在一些实际场景中,如新闻标题、对话和微博。在这篇论文中,我们提出了一个新的 PropBank 数据集,它具有广泛的多 predicate 类型覆盖率。因此,我们可以证明现有的标准测试数据集不能准确反映当前情况,并且现有的状态顶尖系统无法在不同 predicate 类型之间传输知识。经过观察这些问题,我们还提出了一个新的、手动标注的挑战集,用于让 verbal、名称和形容词 predicate-argument 结构具有同样的重要性。我们使用这个数据集来调查是否可以通过不同的语言资源来促进知识传输。在结论中,我们宣称 SRL 还没有"解决",并且将它与其他 semantics 任务结合可能会导致未来的重要进步,特别是对非语言 predicate 的长尾,从而促进 SRL 的进一步研究。
paper_authors: Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, Richard G. Baraniuk
for: This paper aims to investigate the properties of autophagous loops in generative AI algorithms and their impact on the quality and diversity of future generative models.
methods: The authors use state-of-the-art generative image models of three families to conduct an analytical and empirical analysis of autophagous loops with varying levels of fresh real data availability and bias.
results: The primary conclusion is that without enough fresh real data in each generation of an autophagous loop, future generative models are likely to experience a decline in quality (precision) or diversity (recall), which the authors term “Model Autophagy Disorder” (MAD).Abstract
Seismic advances in generative AI algorithms for imagery, text, and other data types has led to the temptation to use synthetic data to train next-generation models. Repeating this process creates an autophagous (self-consuming) loop whose properties are poorly understood. We conduct a thorough analytical and empirical analysis using state-of-the-art generative image models of three families of autophagous loops that differ in how fixed or fresh real training data is available through the generations of training and in whether the samples from previous generation models have been biased to trade off data quality versus diversity. Our primary conclusion across all scenarios is that without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease. We term this condition Model Autophagy Disorder (MAD), making analogy to mad cow disease.
摘要
自适应式AI算法的发展在图像、文本和其他数据类型上,导致使用生成数据训练下一代模型的吸引力。重复这个过程会形成一个自食い(自危)循环,其性质未得好解。我们通过使用当今最佳的生成图像模型进行全面的分析和实验研究,对三种不同的自食い循环进行了分析,它们是根据训练数据的可用性和前一代模型的样本偏好是否受到影响。我们的主要结论是,如果每代训练数据不够新鲜,那么未来的生成模型就会无法维持其精度(精度)或多样性(重复率)。我们将这种情况称为“模型自食い病”(MAD),与疯牛病相似。
results: 实验结果显示,本研究所提出的Task Planing Agent(TaPA)框架可以在实际环境中高度成功完成复杂的任务,并且比LLaVA和GPT-3.5更高的成功率。Abstract
Equipping embodied agents with commonsense is important for robots to successfully complete complex human instructions in general environments. Recent large language models (LLM) can embed rich semantic knowledge for agents in plan generation of complex tasks, while they lack the information about the realistic world and usually yield infeasible action sequences. In this paper, we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint, where the agent generates executable plans according to the existed objects in the scene by aligning LLMs with the visual perception models. Specifically, we first construct a multimodal dataset containing triplets of indoor scenes, instructions and action plans, where we provide the designed prompts and the list of existing objects in the scene for GPT-3.5 to generate a large number of instructions and corresponding planned actions. The generated data is leveraged for grounded plan tuning of pre-trained LLMs. During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations. Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin, which indicates the practicality of embodied task planning in general and complex environments.
摘要
equiping 智能 agents with common sense 是为 robot 成功完成复杂的人类指令的关键,这些指令可能存在一般环境中。 latest large language models (LLM) 可以嵌入智能 agent 的具有 semantic knowledge ,但它们缺乏实际世界的信息,通常会生成不可行的动作序列。在这篇论文中,我们提出了一种 Task Planing Agent (TaPA) 在embodied task 中进行基于物理场景的固有计划,agent 会根据场景中存在的物品生成可执行的计划。specifically,我们首先构建了一个多Modal dataset,包括室内场景、指令和动作计划的 triplets,我们为 GPT-3.5 提供了设计的提示和场景中存在的物品列表,以便它在 generate 大量的指令和相应的计划动作。生成的数据被用于地面的 LLVM 训练。在推理时,我们使用多视图RGB图像来扩展开放词汇物体检测器,以确定场景中的物品。实验结果表明,我们的 TaPA 框架生成的计划可以在 LLLaVA 和 GPT-3.5 的基础上取得更高的成功率,这表明了embodied task 的偏置计划在一般和复杂的环境中的实用性。
DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation
results: 我们的transformer架构可以高效地进行2D到3D的 fine-tuning,使得预测DiT-3D的点云预测器可以在ShapeNet上达到状态之最的性能。实验结果表明,我们提出的DiT-3D可以在高精度和多样化的3D点云生成中占据领先地位,特别是在Chamfer Distance上减少了状态之最方法的1-最近邻接精度值4.59,并将覆盖率提高3.51。Abstract
Recent Diffusion Transformers (e.g., DiT) have demonstrated their powerful effectiveness in generating high-quality 2D images. However, it is still being determined whether the Transformer architecture performs equally well in 3D shape generation, as previous 3D diffusion methods mostly adopted the U-Net architecture. To bridge this gap, we propose a novel Diffusion Transformer for 3D shape generation, namely DiT-3D, which can directly operate the denoising process on voxelized point clouds using plain Transformers. Compared to existing U-Net approaches, our DiT-3D is more scalable in model size and produces much higher quality generations. Specifically, the DiT-3D adopts the design philosophy of DiT but modifies it by incorporating 3D positional and patch embeddings to adaptively aggregate input from voxelized point clouds. To reduce the computational cost of self-attention in 3D shape generation, we incorporate 3D window attention into Transformer blocks, as the increased 3D token length resulting from the additional dimension of voxels can lead to high computation. Finally, linear and devoxelization layers are used to predict the denoised point clouds. In addition, our transformer architecture supports efficient fine-tuning from 2D to 3D, where the pre-trained DiT-2D checkpoint on ImageNet can significantly improve DiT-3D on ShapeNet. Experimental results on the ShapeNet dataset demonstrate that the proposed DiT-3D achieves state-of-the-art performance in high-fidelity and diverse 3D point cloud generation. In particular, our DiT-3D decreases the 1-Nearest Neighbor Accuracy of the state-of-the-art method by 4.59 and increases the Coverage metric by 3.51 when evaluated on Chamfer Distance.
摘要
现代扩散变换器(例如DiT)已经证明了它们在生成高质量2D图像方面的强大效果。然而,是否使用Transformer架构在3D形状生成方面表现良好,目前仍在决定中。为了bridging这个差距,我们提出了一种新的扩散变换器 для3D形状生成,即DiT-3D,它可以直接将噪声级别提取到点云中的简单变换器中进行操作。与现有的U-Net方法相比,我们的DiT-3D更加可扩展,生成质量更高。具体来说,DiT-3D采用了Diffusion Transformer的设计哲学,并在点云中添加3D位置嵌入和小块嵌入,以适应输入的自适应归一化。为了降低3D形状生成中自我注意力的计算成本,我们在Transformer块中添加了3D窗口注意力。最后,我们使用线性层和去voxel化层预测噪声级别提取结果。此外,我们的transformer架构支持高效的精度调整,从2D到3D的精度调整,可以使用预训练的DiT-2D Checkpoint在ImageNet上进行高效地调整DiT-3D在ShapeNet上的性能。实验结果表明,我们提出的DiT-3D可以在ShapeNet dataset上实现状态的艺术性和多样性3D点云生成的最佳性能。具体来说,我们的DiT-3D在1-Nearest Neighbor精度上降低了状态艺术性方法的精度值4.59,并在ChamferDistance上提高了覆盖率3.51。
Strictly Low Rank Constraint Optimization – An Asymptotically $\mathcal{O}(\frac{1}{t^2})$ Method
results: 实现 $O(\frac{1}{t^2})$ 的收敛率,且支持集在每次更新中 monotonically shrinking,这是 momentum-based 算法中的一个新特点。Abstract
We study a class of non-convex and non-smooth problems with \textit{rank} regularization to promote sparsity in optimal solution. We propose to apply the proximal gradient descent method to solve the problem and accelerate the process with a novel support set projection operation on the singular values of the intermediate update. We show that our algorithms are able to achieve a convergence rate of $O(\frac{1}{t^2})$, which is exactly same as Nesterov's optimal convergence rate for first-order methods on smooth and convex problems. Strict sparsity can be expected and the support set of singular values during each update is monotonically shrinking, which to our best knowledge, is novel in momentum-based algorithms.
摘要
我们研究一类非 convex 和非光滑的问题,使用 \textit{rank} regularization 来提高优解的稀疏性。我们提议使用 proximal 梯度下降法来解决问题,并使用一种新的支持集投影操作来加速过程。我们表明我们的算法可以实现 $O(\frac{1}{t^2})$ 的收敛率,与内斯特洛夫的优化收敛率相同,这是在光滑和 convex 问题上的首个案例。在每次更新中,支持集的单值是 monotonically shrinking,这在旋转-based 算法中是新的。Here's the word-for-word translation of the text into Simplified Chinese:我们研究一类非 convex 和非光滑的问题,使用 \textit{rank} regularization 来提高优解的稀疏性。我们提议使用 proximal 梯度下降法来解决问题,并使用一种新的支持集投影操作来加速过程。我们表明我们的算法可以实现 $O(\frac{1}{t^2})$ 的收敛率,与内斯特洛夫的优化收敛率相同,这是在光滑和 convex 问题上的首个案例。在每次更新中,支持集的单值是 monotonically shrinking,这在旋转-based 算法中是新的。
Human Trajectory Forecasting with Explainable Behavioral Uncertainty
results: 对于11种现有方法,这个方法可以达到50%的提升预测精度,并且在不同的环境和人群密度下能够更好地泛化。此外,这个方法还可以提供有信心的预测结果,以更好地解释人类行为的原因。Abstract
Human trajectory forecasting helps to understand and predict human behaviors, enabling applications from social robots to self-driving cars, and therefore has been heavily investigated. Most existing methods can be divided into model-free and model-based methods. Model-free methods offer superior prediction accuracy but lack explainability, while model-based methods provide explainability but cannot predict well. Combining both methodologies, we propose a new Bayesian Neural Stochastic Differential Equation model BNSP-SFM, where a behavior SDE model is combined with Bayesian neural networks (BNNs). While the NNs provide superior predictive power, the SDE offers strong explainability with quantifiable uncertainty in behavior and observation. We show that BNSP-SFM achieves up to a 50% improvement in prediction accuracy, compared with 11 state-of-the-art methods. BNSP-SFM also generalizes better to drastically different scenes with different environments and crowd densities (~ 20 times higher than the testing data). Finally, BNSP-SFM can provide predictions with confidence to better explain potential causes of behaviors. The code will be released upon acceptance.
摘要
人类轨迹预测可以理解和预测人类行为,因此得到了广泛的研究。现有的方法主要可以分为无模型和具有模型的方法。无模型方法具有更高的预测精度,但缺乏可解性,而具有模型的方法提供了可解性,但预测不准确。我们提出了一种新的 Bayesian Neural Stochastic Differential Equation 模型(BNSP-SFM),其中将行为 SDE 模型与 Bayesian neural networks(BNNs)结合在一起。而 BNNs 提供了更高的预测力,SDE 则提供了强大的解释性,同时可以量化行为和观察的uncertainty。我们表明,BNSP-SFM 可以与 11 种现有方法进行比较,并达到50%的提升预测精度。此外,BNSP-SFM 还可以更好地适应不同的环境和人群密度(大约 20 倍于测试数据)。最后,BNSP-SFM 可以提供预测结果的信任度,以更好地解释行为的可能性。代码将在接受后发布。
DeepFlorist: Rethinking Deep Neural Networks and Ensemble Learning as A Meta-Classifier For Object Classification
results: 实验结果表明,深度 florist 在标准花卉数据集上表现出色,与当前最佳方法进行比较,具有较高的准确率和可靠性。Abstract
In this paper, we propose a novel learning paradigm called "DeepFlorist" for flower classification using ensemble learning as a meta-classifier. DeepFlorist combines the power of deep learning with the robustness of ensemble methods to achieve accurate and reliable flower classification results. The proposed network architecture leverages a combination of dense convolutional and convolutional neural networks (DCNNs and CNNs) to extract high-level features from flower images, followed by a fully connected layer for classification. To enhance the performance and generalization of DeepFlorist, an ensemble learning approach is employed, incorporating multiple diverse models to improve the classification accuracy. Experimental results on benchmark flower datasets demonstrate the effectiveness of DeepFlorist, outperforming state-of-the-art methods in terms of accuracy and robustness. The proposed framework holds significant potential for automated flower recognition systems in real-world applications, enabling advancements in plant taxonomy, conservation efforts, and ecological studies.
摘要
在这篇论文中,我们提出了一种新的学习方法,即“深度 florist”,用于使用集成学习作为元类фикатор进行花类分类。深度 florist 结合了深度学习的力量和集成方法的稳定性,以实现高度准确和可靠的花类分类结果。提议的网络架构利用了 dense convolutional neural networks (DCNNs) 和 convolutional neural networks (CNNs) 来提取花图像中的高级特征,然后进行分类。为了提高深度 florist 的性能和泛化能力,我们采用了集成学习方法,并将多种多样的模型集成到一起,以提高分类精度。实验结果表明,深度 florist 在标准花图像集上表现出色,与当前最佳方法相比,具有更高的准确率和更好的泛化能力。该提案的框架在实际应用中的自动花识别系统中具有重要的潜在价值,可以推动植物分类、保护和生态学研究等领域的进步。
results: 结果显示,在catalan语言下,HT条件对narraative Engagement、Enjoyment和翻译接受度得分较高,而在荷兰语言下,PE条件对Enjoyment和翻译接受度得分较高,但对原始英语版本的评分最高。研究结果表明,在读一篇小说翻译版本时,不仅翻译条件和质量对其接受度起到关键作用,而且参与者的读写习惯、读写语言和社会语言地位也具有重要作用。Abstract
This article presents the results of a study involving the reception of a fictional story by Kurt Vonnegut translated from English into Catalan and Dutch in three conditions: machine-translated (MT), post-edited (PE) and translated from scratch (HT). 223 participants were recruited who rated the reading conditions using three scales: Narrative Engagement, Enjoyment and Translation Reception. The results show that HT presented a higher engagement, enjoyment and translation reception in Catalan if compared to PE and MT. However, the Dutch readers show higher scores in PE than in both HT and MT, and the highest engagement and enjoyments scores are reported when reading the original English version. We hypothesize that when reading a fictional story in translation, not only the condition and the quality of the translations is key to understand its reception, but also the participants reading patterns, reading language, and, perhaps language status in their own societies.
摘要
MuLMS-AZ: An Argumentative Zoning Dataset for the Materials Science Domain
paper_authors: Timo Pierre Schrader, Teresa Bürkle, Sophie Henning, Sherry Tan, Matteo Finco, Stefan Grünewald, Maira Indrikova, Felix Hildebrand, Annemarie Friedrich
for: 这篇论文是为了提高材料科学研究文献的处理而写的。
methods: 这篇论文使用了域标注方法,并采用了特殊的材料科学注解方案来标注文献。
results: 研究发现,使用域特定的 пре打 trains 可以获得高精度的分类结果,而且已经存在的其他领域的 AZ 类别也可以在一定程度上转移到材料科学领域中。Abstract
Scientific publications follow conventionalized rhetorical structures. Classifying the Argumentative Zone (AZ), e.g., identifying whether a sentence states a Motivation, a Result or Background information, has been proposed to improve processing of scholarly documents. In this work, we adapt and extend this idea to the domain of materials science research. We present and release a new dataset of 50 manually annotated research articles. The dataset spans seven sub-topics and is annotated with a materials-science focused multi-label annotation scheme for AZ. We detail corpus statistics and demonstrate high inter-annotator agreement. Our computational experiments show that using domain-specific pre-trained transformer-based text encoders is key to high classification performance. We also find that AZ categories from existing datasets in other domains are transferable to varying degrees.
摘要
Note: The Simplified Chinese translation is written in the standardized format used in China, which is different from the Traditional Chinese used in Taiwan and other countries.
Utilizing ChatGPT Generated Data to Retrieve Depression Symptoms from Social Media
results: 我们的结果表明,使用专门为semantic search设计的模型的句子嵌入在这个任务中 perfoms 更好,而使用预训练在心理健康数据上的模型的嵌入则不及。此外,我们发现生成的Synthetic Data 是这个任务中特别适用的,使用原始BDI-II句子alone 的方法得到了最佳性能。Abstract
In this work, we present the contribution of the BLUE team in the eRisk Lab task on searching for symptoms of depression. The task consists of retrieving and ranking Reddit social media sentences that convey symptoms of depression from the BDI-II questionnaire. Given that synthetic data provided by LLMs have been proven to be a reliable method for augmenting data and fine-tuning downstream models, we chose to generate synthetic data using ChatGPT for each of the symptoms of the BDI-II questionnaire. We designed a prompt such that the generated data contains more richness and semantic diversity than the BDI-II responses for each question and, at the same time, contains emotional and anecdotal experiences that are specific to the more intimate way of sharing experiences on Reddit. We perform semantic search and rank the sentences' relevance to the BDI-II symptoms by cosine similarity. We used two state-of-the-art transformer-based models (MentalRoBERTa and a variant of MPNet) for embedding the social media posts, the original and generated responses of the BDI-II. Our results show that using sentence embeddings from a model designed for semantic search outperforms the approach using embeddings from a model pre-trained on mental health data. Furthermore, the generated synthetic data were proved too specific for this task, the approach simply relying on the BDI-II responses had the best performance.
摘要
在这项工作中,我们介绍了BLUE团队在eRisk Lab任务中搜索抑郁症状的贡献。这个任务的目标是从Reddit社交媒体上检索和排名符合BDI-II问卷中的抑郁症状表达。由于使用生成的数据已被证明可以增强模型和下游模型的精度,我们使用ChatGPT生成每个BDI-II问卷中的症状数据。我们设计了一个提示,以便生成的数据具有更多的 ricness和semantic diversity,同时具有特定于Reddit上更加亲切的情感和个人经历。我们使用cosine similarity对句子的相似性进行Semantic search和排名。我们使用两种现代变换器模型(MentalRoBERTa和MPNet的变种)来嵌入社交媒体文章和BDI-II的原始和生成回答。我们的结果表明,使用专门 дляsemantic search的句子嵌入模型比使用预先训练在精神健康数据上的模型来得到更高的性能。此外,我们发现生成的 sintetic data太特定于这个任务,使用直接使用BDI-II回答的方法获得最好的性能。
Sumformer: Universal Approximation for Efficient Transformers
paper_authors: Silas Alberti, Niclas Dern, Laura Thesing, Gitta Kutyniok
for: 这 paper 是为了研究sequence-to-sequence函数的 universally approximation问题而写的。
methods: 这 paper 使用了一种新的 Sumformer 架构,以及对 Linformer 和 Performer 的分析。
results: 这 paper 提供了 universally approximation results for Linformer 和 Performer,并提供了一个新的证明,证明只需一层注意力层可以universally approximation sequence-to-sequence函数。Abstract
Natural language processing (NLP) made an impressive jump with the introduction of Transformers. ChatGPT is one of the most famous examples, changing the perception of the possibilities of AI even outside the research community. However, besides the impressive performance, the quadratic time and space complexity of Transformers with respect to sequence length pose significant limitations for handling long sequences. While efficient Transformer architectures like Linformer and Performer with linear complexity have emerged as promising solutions, their theoretical understanding remains limited. In this paper, we introduce Sumformer, a novel and simple architecture capable of universally approximating equivariant sequence-to-sequence functions. We use Sumformer to give the first universal approximation results for Linformer and Performer. Moreover, we derive a new proof for Transformers, showing that just one attention layer is sufficient for universal approximation.
摘要
自然语言处理(NLP)在Transformers的引入后作出了卓越的跳跃。ChatGPT是最著名的示例之一,对外部研究社区的可能性产生了深刻的影响。然而,Transformers的序列长度相对于时间和空间复杂性的平方带来了重大限制,对长序列处理有重要的限制。虽然高效的Transformers架构如Linformer和Performer已经出现了,但它们的理论理解仍然受限。在本文中,我们介绍Sumformer,一种新的简单架构,可以universally approximating equivariant sequence-to-sequence函数。我们使用Sumformer来给Linformer和Performer的首次universal approximation结果。此外,我们还 derivates一个新的证明,显示只需一层注意力层就能 universally approximation Transformers。
Performance Comparison of Large Language Models on VNHSGE English Dataset: OpenAI ChatGPT, Microsoft Bing Chat, and Google Bard
results: 研究发现,BingChat的性能比ChatGPT和Bard高出92.4%和86%。这意味着BingChat可以取代ChatGPT,而ChatGPT尚未正式在越南上市。此外,BingChat、Bard和ChatGPT在英语水平都高于越南高中生。这些结果贡献到了LLMs在英语教学中的潜力的理解。Abstract
This paper presents a performance comparison of three large language models (LLMs), namely OpenAI ChatGPT, Microsoft Bing Chat (BingChat), and Google Bard, on the VNHSGE English dataset. The performance of BingChat, Bard, and ChatGPT (GPT-3.5) is 92.4\%, 86\%, and 79.2\%, respectively. The results show that BingChat is better than ChatGPT and Bard. Therefore, BingChat and Bard can replace ChatGPT while ChatGPT is not yet officially available in Vietnam. The results also indicate that BingChat, Bard and ChatGPT outperform Vietnamese students in English language proficiency. The findings of this study contribute to the understanding of the potential of LLMs in English language education. The remarkable performance of ChatGPT, BingChat, and Bard demonstrates their potential as effective tools for teaching and learning English at the high school level.
摘要
这个论文比较了三个大语言模型(LLMs)的性能,即OpenAI ChatGPT、Microsoft Bing Chat(BingChat)和Google Bard,在VNHSGE英语数据集上。这三个模型的性能分别为92.4%、86%和79.2%。结果表明,BingChat比ChatGPT和Bard更好,因此BingChat和Bard可以取代ChatGPT,而ChatGPT尚未正式在越南发布。结果还表明,BingChat、Bard和ChatGPT在英语水平超过越南学生。这些发现贡献于英语教学中LLMs的潜力理解。ChatGPT、BingChat和Bard的出色表现表明它们在高中英语教学中是有效的工具。
SpaceNLI: Evaluating the Consistency of Predicting Inferences in Space
for: fill the gap of spatial expression and reasoning in natural language inference (NLI) datasets
methods: semi-automatically created an NLI dataset for spatial reasoning called SpaceNLI, using curated reasoning patterns and expert annotations
results: SOTA NLI systems obtain moderate results on spatial NLI problems but lack consistency per inference pattern, with non-projective spatial inferences (especially those using the “between” preposition) being the most challenging ones.Abstract
While many natural language inference (NLI) datasets target certain semantic phenomena, e.g., negation, tense & aspect, monotonicity, and presupposition, to the best of our knowledge, there is no NLI dataset that involves diverse types of spatial expressions and reasoning. We fill this gap by semi-automatically creating an NLI dataset for spatial reasoning, called SpaceNLI. The data samples are automatically generated from a curated set of reasoning patterns, where the patterns are annotated with inference labels by experts. We test several SOTA NLI systems on SpaceNLI to gauge the complexity of the dataset and the system's capacity for spatial reasoning. Moreover, we introduce a Pattern Accuracy and argue that it is a more reliable and stricter measure than the accuracy for evaluating a system's performance on pattern-based generated data samples. Based on the evaluation results we find that the systems obtain moderate results on the spatial NLI problems but lack consistency per inference pattern. The results also reveal that non-projective spatial inferences (especially due to the "between" preposition) are the most challenging ones.
摘要
whilst many natural language inference (NLI) datasets target certain semantic phenomena, e.g., negation, tense & aspect, monotonicity, and presupposition, to the best of our knowledge, there is no NLI dataset that involves diverse types of spatial expressions and reasoning. We fill this gap by semi-automatically creating an NLI dataset for spatial reasoning, called SpaceNLI. The data samples are automatically generated from a curated set of reasoning patterns, where the patterns are annotated with inference labels by experts. We test several state-of-the-art NLI systems on SpaceNLI to gauge the complexity of the dataset and the system's capacity for spatial reasoning. Moreover, we introduce a Pattern Accuracy and argue that it is a more reliable and stricter measure than the accuracy for evaluating a system's performance on pattern-based generated data samples. Based on the evaluation results we find that the systems obtain moderate results on the spatial NLI problems but lack consistency per inference pattern. The results also reveal that non-projective spatial inferences (especially due to the "between" preposition) are the most challenging ones.Here's the translation in Traditional Chinese as well:而 whereas many natural language inference (NLI) datasets target certain semantic phenomena, e.g., negation, tense & aspect, monotonicity, and presupposition, to the best of our knowledge, there is no NLI dataset that involves diverse types of spatial expressions and reasoning. We fill this gap by semi-automatically creating an NLI dataset for spatial reasoning, called SpaceNLI. The data samples are automatically generated from a curated set of reasoning patterns, where the patterns are annotated with inference labels by experts. We test several state-of-the-art NLI systems on SpaceNLI to gauge the complexity of the dataset and the system's capacity for spatial reasoning. Moreover, we introduce a Pattern Accuracy and argue that it is a more reliable and stricter measure than the accuracy for evaluating a system's performance on pattern-based generated data samples. Based on the evaluation results we find that the systems obtain moderate results on the spatial NLI problems but lack consistency per inference pattern. The results also reveal that non-projective spatial inferences (especially due to the "between" preposition) are the most challenging ones.
Open-Source Large Language Models Outperform Crowd Workers and Approach ChatGPT in Text-Annotation Tasks
paper_authors: Meysam Alizadeh, Maël Kubli, Zeynab Samei, Shirin Dehghani, Juan Diego Bermeo, Maria Korobeynikova, Fabrizio Gilardi
for: This study aims to evaluate the performance of open-source Large Language Models (LLMs) in text annotation tasks and compare them with proprietary models like ChatGPT and human-based services such as MTurk.
methods: The study uses both zero-shot and few-shot approaches and different temperature parameters across a range of text annotation tasks to assess the performance of open-source LLMs.
results: The findings show that while ChatGPT achieves the best performance in most tasks, open-source LLMs not only outperform MTurk but also demonstrate competitive potential against ChatGPT in specific tasks.Here is the same information in Simplified Chinese text:
results: 结果显示,虽然ChatGPT在大多数任务中表现最佳,但开源LLMs不仅超越MTurk,还在特定任务中与ChatGPT竞争。Abstract
This study examines the performance of open-source Large Language Models (LLMs) in text annotation tasks and compares it with proprietary models like ChatGPT and human-based services such as MTurk. While prior research demonstrated the high performance of ChatGPT across numerous NLP tasks, open-source LLMs like HugginChat and FLAN are gaining attention for their cost-effectiveness, transparency, reproducibility, and superior data protection. We assess these models using both zero-shot and few-shot approaches and different temperature parameters across a range of text annotation tasks. Our findings show that while ChatGPT achieves the best performance in most tasks, open-source LLMs not only outperform MTurk but also demonstrate competitive potential against ChatGPT in specific tasks.
摘要
Translation in Simplified Chinese:这个研究研究了开源大型自然语言模型(LLM)在文本标注任务中的性能,并与专有模型如ChatGPT和人类基于服务如MTurk进行比较。先前的研究表明了ChatGPT在多种NLP任务中的高性能,但开源LLM如HugginChat和FLAN在成本效益、透明度、复制性和数据保护方面吸引了关注。我们使用零shot和几shot方法以及不同温度参数,对多种文本标注任务进行评估。我们的发现表明,虽然ChatGPT在大多数任务中表现最好,但开源LLM不仅超越MTurk,还在特定任务中与ChatGPT竞争。
Generative Job Recommendations with Large Language Model
For: 提供个性化和全面的就业搜索体验,通过语言模型生成具体的职位描述以满足企业和潜在雇员的需求。* Methods: 使用Supervised Fine-Tuning (SFT) STRATEGY和Proximal Policy Optimization (PPO)-based Reinforcement Learning (RL) 方法来训练语言模型生成职位描述,并使用CV和JD匹配度作为奖励模型。* Results: EXTENSIVE EXPERIMENTS ON A LARGE-SCALE REAL-WORLD DATASET 表明,我们的方法可以提供更高的准确率和更好的个性化效果,并且可以补充现有的作业推荐模型,提高搜索效果。Abstract
The rapid development of online recruitment services has encouraged the utilization of recommender systems to streamline the job seeking process. Predominantly, current job recommendations deploy either collaborative filtering or person-job matching strategies. However, these models tend to operate as "black-box" systems and lack the capacity to offer explainable guidance to job seekers. Moreover, conventional matching-based recommendation methods are limited to retrieving and ranking existing jobs in the database, restricting their potential as comprehensive career AI advisors. To this end, here we present GIRL (GeneratIve job Recommendation based on Large language models), a novel approach inspired by recent advancements in the field of Large Language Models (LLMs). We initially employ a Supervised Fine-Tuning (SFT) strategy to instruct the LLM-based generator in crafting suitable Job Descriptions (JDs) based on the Curriculum Vitae (CV) of a job seeker. Moreover, we propose to train a model which can evaluate the matching degree between CVs and JDs as a reward model, and we use Proximal Policy Optimization (PPO)-based Reinforcement Learning (RL) method to further fine-tine the generator. This aligns the generator with recruiter feedback, tailoring the output to better meet employer preferences. In particular, GIRL serves as a job seeker-centric generative model, providing job suggestions without the need of a candidate set. This capability also enhances the performance of existing job recommendation models by supplementing job seeking features with generated content. With extensive experiments on a large-scale real-world dataset, we demonstrate the substantial effectiveness of our approach. We believe that GIRL introduces a paradigm-shifting approach to job recommendation systems, fostering a more personalized and comprehensive job-seeking experience.
摘要
“在线招聘服务的快速发展已经鼓励了对职业推荐服务的使用。现在大多数的职业推荐都使用了共同预测或人职匹配策略。但这些模型通常 acted as "黑盒子"系统,缺乏可靠的指导。传统的匹配基于推荐方法仅能从数据库中撷取和排名现有的职位,限制了它们的潜在。为了解决这个问题,我们现在提出了GIRL(生成式职业推荐,基于大型自然语言模型)。我们首先使用Supervised Fine-Tuning(SFT)策略,将LLM基于生成器 instrucured in crafting suitable Job Descriptions(JD)based on the Curriculum Vitae(CV)of a job seeker。此外,我们提议使用Proximal Policy Optimization(PPO)-based Reinforcement Learning(RL)方法,以更好地调整生成器。这样,生成器与招聘者反馈相互匹配,使生成器的输出更加符合雇主的需求。特别是,GIRL serves as a job seeker-centric generative model,提供了不需要候选人的职业建议。这个能力也提高了现有的职业推荐模型的表现,通过增加了对职业推荐的内容。经过了大规模的实验,我们证明了GIRL的实际效果。我们相信GIRL引入了一种新的职业推荐系统方法,创造了更加个性化和全面的职业搜寻体验。”
LOAF-M2L: Joint Learning of Wording and Formatting for Singable Melody-to-Lyric Generation
for: bridges the singability gap between generated lyrics and melodies
methods: jointly Learning wOrding And Formatting during Melody-to-Lyric training (LOAF-M2L)
results: achieves 3.75% and 21.44% absolute accuracy gains in the outputs’ number-of-line and syllable-per-line requirements, and demonstrates 63.92% and 74.18% relative improvement of music-lyric compatibility and overall quality in the subjective evaluation.Abstract
Despite previous efforts in melody-to-lyric generation research, there is still a significant compatibility gap between generated lyrics and melodies, negatively impacting the singability of the outputs. This paper bridges the singability gap with a novel approach to generating singable lyrics by jointly Learning wOrding And Formatting during Melody-to-Lyric training (LOAF-M2L). After general-domain pretraining, our proposed model acquires length awareness first from a large text-only lyric corpus. Then, we introduce a new objective informed by musicological research on the relationship between melody and lyrics during melody-to-lyric training, which enables the model to learn the fine-grained format requirements of the melody. Our model achieves 3.75% and 21.44% absolute accuracy gains in the outputs' number-of-line and syllable-per-line requirements compared to naive fine-tuning, without sacrificing text fluency. Furthermore, our model demonstrates a 63.92% and 74.18% relative improvement of music-lyric compatibility and overall quality in the subjective evaluation, compared to the state-of-the-art melody-to-lyric generation model, highlighting the significance of formatting learning.
摘要
尽管之前的旋律到歌词生成研究已经做出了很多努力,但是还存在旋律到歌词的兼容性差距,这缺点影响了输出的唱ability。这篇论文通过一种新的approach来bridging这个兼容性差距,通过同时学习wording和 formatting来帮助模型学习合适的歌词。在通用领域预训练后,我们的提议的模型从大量的文本只 lyrics corpus中获得了长度意识。然后,我们引入了基于音乐学研究的melody和lyrics之间关系的新目标,使模型学习细腻的格式要求。我们的模型在Naive fine-tuning的基础上获得了3.75%和21.44%的绝对准确率提升,而无需牺牲文本流畅性。此外,我们的模型在主观评估中 Display 63.92%和74.18%的音乐-歌词兼容性和总质量提升,相比之前的State-of-the-art melody-to-lyric生成模型,表明了格式学习的重要性。
Leveraging Denoised Abstract Meaning Representation for Grammatical Error Correction
results: 实验结果表明,与一些强大的基准模型相比,AMR-GEC可以与其相比,而且可以降低训练时间32%。Abstract
Grammatical Error Correction (GEC) is the task of correcting errorful sentences into grammatically correct, semantically consistent, and coherent sentences. Popular GEC models either use large-scale synthetic corpora or use a large number of human-designed rules. The former is costly to train, while the latter requires quite a lot of human expertise. In recent years, AMR, a semantic representation framework, has been widely used by many natural language tasks due to its completeness and flexibility. A non-negligible concern is that AMRs of grammatically incorrect sentences may not be exactly reliable. In this paper, we propose the AMR-GEC, a seq-to-seq model that incorporates denoised AMR as additional knowledge. Specifically, We design a semantic aggregated GEC model and explore denoising methods to get AMRs more reliable. Experiments on the BEA-2019 shared task and the CoNLL-2014 shared task have shown that AMR-GEC performs comparably to a set of strong baselines with a large number of synthetic data. Compared with the T5 model with synthetic data, AMR-GEC can reduce the training time by 32\% while inference time is comparable. To the best of our knowledge, we are the first to incorporate AMR for grammatical error correction.
摘要
grammatical error correction (GEC) 是指 corrections 错误的句子到 grammatically 正确、semantically 一致、coherent 的句子。受欢迎的 GEC 模型 either 使用大规模的 synthetic corpora 或者使用大量的人工设计的规则。前者 costly 训练,后者需要很多的人类专家知识。在 recent 年,AMR,一种 semantic representation framework,has 广泛应用于 many natural language tasks due to its completeness and flexibility。一个可耻的问题是 that AMRs of grammatically incorrect sentences may not be exactly reliable。在 this paper,we propose the AMR-GEC, a seq-to-seq model that incorporates denoised AMR as additional knowledge. Specifically, we design a semantic aggregated GEC model and explore denoising methods to get AMRs more reliable。 experiments on the BEA-2019 shared task and the CoNLL-2014 shared task have shown that AMR-GEC performs comparably to a set of strong baselines with a large number of synthetic data。 compared with the T5 model with synthetic data, AMR-GEC can reduce the training time by 32% while inference time is comparable。to the best of our knowledge, we are the first to incorporate AMR for grammatical error correction.
results: 论文的evaluation结果表明,该模型在三个常见的LS数据集(LexMTurk、BenchLS和NNSEval)上的表现比前一代的模型(如LSBert和ConLS)更好,并且在TSAR-2022多语言LS共享任务中的一部分检测数据集上与参与系统竞争,甚至在某些指标上超越GPT-3模型。此外,模型还在西班牙语和葡萄牙语上获得了性能提升。Abstract
Text is by far the most ubiquitous source of knowledge and information and should be made easily accessible to as many people as possible; however, texts often contain complex words that hinder reading comprehension and accessibility. Therefore, suggesting simpler alternatives for complex words without compromising meaning would help convey the information to a broader audience. This paper proposes mTLS, a multilingual controllable Transformer-based Lexical Simplification (LS) system fined-tuned with the T5 model. The novelty of this work lies in the use of language-specific prefixes, control tokens, and candidates extracted from pre-trained masked language models to learn simpler alternatives for complex words. The evaluation results on three well-known LS datasets -- LexMTurk, BenchLS, and NNSEval -- show that our model outperforms the previous state-of-the-art models like LSBert and ConLS. Moreover, further evaluation of our approach on the part of the recent TSAR-2022 multilingual LS shared-task dataset shows that our model performs competitively when compared with the participating systems for English LS and even outperforms the GPT-3 model on several metrics. Moreover, our model obtains performance gains also for Spanish and Portuguese.
摘要
文本是最常见的知识和信息来源,应该让更多人有访问权。然而,文本经常包含复杂的词语,这会降低阅读理解和访问性。因此,建议使用 simpler alternatives for complex words,不会妨碍意思的传递,可以帮助更多人理解。这篇论文提出了一种名为 mTLS 的多语言可控Transformer 基本 Lexical Simplification(LS)系统,该系统通过语言特定前缀、控制 токен和来自预训练的 masked language model 中的候选词来学习 simpler alternatives for complex words。我们的评估结果在 LexMTurk、BenchLS 和 NNSEval 三个常见 LS 数据集上表明,我们的模型比前一代模型 like LSBert 和 ConLS 更高效。此外,我们的方法在 TSAR-2022 年度多语言 LS 共同任务中的一部分进行了进一步的评估,我们的模型与英语 LS 中的参与系统竞争,并在一些指标上超越 GPT-3 模型。此外,我们的模型在西班牙语和葡萄牙语上也获得了性能提升。
Do predictability factors towards signing avatars hold across cultures?
results: 研究发现,听力障碍人群对虚拟人物的态度和acceptance水平异常高,且与听力状况、技术经验和年龄有关。特别是,MSL用户对虚拟人物的态度较低,与其他研究结果相比。Abstract
Avatar technology can offer accessibility possibilities and improve the Deaf-and-Hard of Hearing sign language users access to communication, education and services, such as the healthcare system. However, sign language users acceptance of signing avatars as well as their attitudes towards them vary and depend on many factors. Furthermore, research on avatar technology is mostly done by researchers who are not Deaf. The study examines the extent to which intrinsic or extrinsic factors contribute to predict the attitude towards avatars across cultures. Intrinsic factors include the characteristics of the avatar, such as appearance, movements and facial expressions. Extrinsic factors include users technology experience, their hearing status, age and their sign language fluency. This work attempts to answer questions such as, if lower attitude ratings are related to poor technology experience with ASL users, for example, is that also true for Moroccan Sign Language (MSL) users? For the purposes of the study, we designed a questionnaire to understand MSL users attitude towards avatars. Three groups of participants were surveyed: Deaf (57), Hearing (20) and Hard-of-Hearing (3). The results of our study were then compared with those reported in other relevant studies.
摘要
《备用人物技术可以提供更多的可用性和改善聋听人用手语讲解、教育和服务的访问,例如医疗系统。然而,手语用户对签名人物的接受度和对其的态度是多方面的和各不相同的。此外,研究人员大多是不聋听的。本研究探讨了对签名人物的态度是由内在或外在因素决定的程度。内在因素包括人物的特征,如外表、运动和表情。外在因素包括用户的技术经验、听力状况、年龄和手语流利程度。本研究试图回答问题,例如,听力不佳的ASL用户对人物的态度评分是否与MSL用户相似?为了了解MSL用户对人物的态度,我们设计了一份问naire。三个组合体的参与者被调查:聋听(57)、正常听力(20)和听力不佳(3)。我们的研究结果与其他相关研究的结果进行比较。
Different Games in Dialogue: Combining character and conversational types in strategic choice
for: investigating the interaction of conversational type and character types of interlocutors
methods: combining decision making process for selecting dialogue moves with character type and conversational type, and presenting a mathematical model to illustrate the interactions
results: presenting a quantitative approach to understanding the interactions between conversational type and character types in dialogue movesAbstract
In this paper, we show that investigating the interaction of conversational type (often known as language game or speech genre) with the character types of the interlocutors is worthwhile. We present a method of calculating the decision making process for selecting dialogue moves that combines character type and conversational type. We also present a mathematical model that illustrate these factors' interactions in a quantitative way.
摘要
在这篇论文中,我们证明了对话类型(常称为语言游戏或语言类型)与对话参与者的人物类型之间的交互是有价值的。我们提出了一种计算对话搬运的决策过程,该过程结合了人物类型和对话类型。我们还提出了一个数学模型,用于显示这些因素之间的交互。
Leveraging multilingual transfer for unsupervised semantic acoustic word embeddings
results: 作者的方法在语义相似任务中表现出色,比过去的所有语义AWE方法都高。此外,这种多语言传递方法还能够用于下游语义查询例子搜索。Abstract
Acoustic word embeddings (AWEs) are fixed-dimensional vector representations of speech segments that encode phonetic content so that different realisations of the same word have similar embeddings. In this paper we explore semantic AWE modelling. These AWEs should not only capture phonetics but also the meaning of a word (similar to textual word embeddings). We consider the scenario where we only have untranscribed speech in a target language. We introduce a number of strategies leveraging a pre-trained multilingual AWE model -- a phonetic AWE model trained on labelled data from multiple languages excluding the target. Our best semantic AWE approach involves clustering word segments using the multilingual AWE model, deriving soft pseudo-word labels from the cluster centroids, and then training a Skipgram-like model on the soft vectors. In an intrinsic word similarity task measuring semantics, this multilingual transfer approach outperforms all previous semantic AWE methods. We also show -- for the first time -- that AWEs can be used for downstream semantic query-by-example search.
摘要
听音字嵌入(AWE)是一种固定维度的 вектор表示方法,用于 repre senting speech segments 中的声音内容,以便不同的实现都有相似的嵌入。在这篇论文中,我们探讨 semantic AWE 模型。这些 AWE 不仅应 capture 声音特征,还应包含语言中词的含义(类似于文本字嵌入)。我们假设只有目标语言的无标注语音。我们提出了一些利用预训练的多语言 AWE 模型(包括目标语言)的策略。我们的最佳 semantics AWE 方法是将词段分组使用多语言 AWE 模型, deriv ing 软 Pseudo-word 标签从分组中心点,然后在 Skipgram-like 模型中训练 soft vectors。在内在词义相似任务中,这种多语言传输方法超过了所有之前的 semantics AWE 方法。我们还示出了,AWE 可以用于下游 semantic query-by-example 搜索。
results: 我们在多个标准数据集上进行了实验,并证明了我们的方法可以提高话题准确性和文档表示力,并且比既有最佳方法更高。Abstract
Existing NTMs with contrastive learning suffer from the sample bias problem owing to the word frequency-based sampling strategy, which may result in false negative samples with similar semantics to the prototypes. In this paper, we aim to explore the efficient sampling strategy and contrastive learning in NTMs to address the aforementioned issue. We propose a new sampling assumption that negative samples should contain words that are semantically irrelevant to the prototype. Based on it, we propose the graph contrastive topic model (GCTM), which conducts graph contrastive learning (GCL) using informative positive and negative samples that are generated by the graph-based sampling strategy leveraging in-depth correlation and irrelevance among documents and words. In GCTM, we first model the input document as the document word bipartite graph (DWBG), and construct positive and negative word co-occurrence graphs (WCGs), encoded by graph neural networks, to express in-depth semantic correlation and irrelevance among words. Based on the DWBG and WCGs, we design the document-word information propagation (DWIP) process to perform the edge perturbation of DWBG, based on multi-hop correlations/irrelevance among documents and words. This yields the desired negative and positive samples, which will be utilized for GCL together with the prototypes to improve learning document topic representations and latent topics. We further show that GCL can be interpreted as the structured variational graph auto-encoder which maximizes the mutual information of latent topic representations of different perspectives on DWBG. Experiments on several benchmark datasets demonstrate the effectiveness of our method for topic coherence and document representation learning compared with existing SOTA methods.
摘要
现有的NTMs通过对比学习受到样本偏袋问题的影响,这可能导致假象样本与prototype的semantics相似。在这篇文章中,我们想探索NTMs中有效的采样策略和对比学习方法来解决上述问题。我们提出了一种新的采样假设,即负样本应包含与prototype的semantics不相关的词。基于这个假设,我们提出了图像对比话题模型(GCTM),它通过图像对比学习(GCL)使用了有用的负样本和prototype来提高文档主题表示和潜在主题。在GCTM中,我们首先将输入文档表示为文档词biipartite图(DWBG),然后构建负样本和正样本的word co-occurrence图(WCG),通过图像神经网络编码,表达文档和词之间的深度semantic correlation和irrelevance。基于DWBG和WCG,我们设计了文档词信息传播过程(DWIP)来进行DWBG的边刺激,基于文档和词之间的多趟相关性/不相关性。这会生成我们需要的负样本和正样本,它们将与prototype一起用于GCL来提高文档主题表示和潜在主题。我们还证明了GCL可以被视为structured variational graph auto-encoder,它最大化了不同角度的DWBG上latent topic representation之间的mutual information。在多个标准 benchmark dataset上进行了实验,我们发现我们的方法在主题准确性和文档表示学习方面比现有SOTA方法更有效。
Flacuna: Unleashing the Problem Solving Power of Vicuna using FLAN Fine-Tuning
paper_authors: Deepanway Ghosal, Yew Ken Chia, Navonil Majumder, Soujanya Poria for:* The paper is focused on investigating the impact of the third factor (instruction dataset) on the performance of large language models (LLMs) that utilize encoder-decoder or decoder-only architecture.methods:* The paper uses a customized instruction dataset collection called FLANMINI, which includes a subset of the large-scale instruction dataset known as FLAN, as well as various code-related datasets and conversational datasets derived from ChatGPT/GPT-4.* The paper fine-tunes VICUNA, a large language model based on LLAMA, on the FLAN dataset to obtain enhanced problem-solving abilities.results:* The paper shows that fine-tuning VICUNA on the FLAN dataset leads to significant improvements across numerous benchmark datasets in INSTRUCTEVAL.* The paper also introduces FLACUNA, a publicly available model that is fine-tuned VICUNA on the FLAN dataset, which demonstrates improved problem-solving abilities compared to the latest decoder-based LLMs.Abstract
Recently, the release of INSTRUCTEVAL has provided valuable insights into the performance of large language models (LLMs) that utilize encoder-decoder or decoder-only architecture. Interestingly, despite being introduced four years ago, T5-based LLMs, such as FLAN-T5, continue to outperform the latest decoder-based LLMs, such as LLAMA and VICUNA, on tasks that require general problem-solving skills. This performance discrepancy can be attributed to three key factors: (1) Pre-training data, (2) Backbone architecture, and (3) Instruction dataset. In this technical report, our main focus is on investigating the impact of the third factor by leveraging VICUNA, a large language model based on LLAMA, which has undergone fine-tuning on ChatGPT conversations. To achieve this objective, we fine-tuned VICUNA using a customized instruction dataset collection called FLANMINI. This collection includes a subset of the large-scale instruction dataset known as FLAN, as well as various code-related datasets and conversational datasets derived from ChatGPT/GPT-4. This dataset comprises a large number of tasks that demand problem-solving skills. Our experimental findings strongly indicate that the enhanced problem-solving abilities of our model, FLACUNA, are obtained through fine-tuning VICUNA on the FLAN dataset, leading to significant improvements across numerous benchmark datasets in INSTRUCTEVAL. FLACUNA is publicly available at https://huggingface.co/declare-lab/flacuna-13b-v1.0.
摘要
近期,INSTRUCTEVAL的发布提供了大语言模型(LLM)使用encoder-decoder或decoder-only架构的表现的有价值信息。有趣的是,虽然四年前出现了T5基于的LLM,如FLAN-T5,仍然在需要通用问题解决能力的任务上超过最新的decoder基于LLM,如LLAMA和VICUNA。这种表现差异可以归因于三个关键因素:(1)预训练数据,(2)后端架构,(3)指令集。在这份技术报告中,我们主要关注第三个因素,通过使用基于LLAMA的大语言模型VICUNA进行细化,以便 Investigate the impact of this factor. To achieve this goal, we fine-tuned VICUNA using a customized instruction dataset collection called FLANMINI. This collection includes a subset of the large-scale instruction dataset known as FLAN, as well as various code-related datasets and conversational datasets derived from ChatGPT/GPT-4. This dataset comprises a large number of tasks that demand problem-solving skills. Our experimental findings strongly indicate that the enhanced problem-solving abilities of our model, FLACUNA, are obtained through fine-tuning VICUNA on the FLAN dataset, leading to significant improvements across numerous benchmark datasets in INSTRUCTEVAL. FLACUNA is publicly available at https://huggingface.co/declare-lab/flacuna-13b-v1.0.
results: 广泛的实验表明,CAME可以在不同的NLP任务中(如BERT和GPT-2训练)实现稳定的训练和高度的性能。特别是在BERT预训练中使用大批处理(32,768)时,我们的提议的优化器可以更快地收敛并达到更高的准确率,比 Adam 优化器更高。Abstract
Adaptive gradient methods, such as Adam and LAMB, have demonstrated excellent performance in the training of large language models. Nevertheless, the need for adaptivity requires maintaining second-moment estimates of the per-parameter gradients, which entails a high cost of extra memory overheads. To solve this problem, several memory-efficient optimizers (e.g., Adafactor) have been proposed to obtain a drastic reduction in auxiliary memory usage, but with a performance penalty. In this paper, we first study a confidence-guided strategy to reduce the instability of existing memory efficient optimizers. Based on this strategy, we propose CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in memory-efficient methods. Extensive experiments demonstrate the training stability and superior performance of CAME across various NLP tasks such as BERT and GPT-2 training. Notably, for BERT pre-training on the large batch size of 32,768, our proposed optimizer attains faster convergence and higher accuracy compared with the Adam optimizer. The implementation of CAME is publicly available.
摘要
优化器方法,如 Adam 和 LAMB,在大型语言模型的训练中表现出色。然而,需要适应性导致每个参数的二次积分估计需要保持,这会带来较高的额外存储开销。为解决这问题,许多快速优化器(例如 Adafactor)已经被提出,可以减少auxiliary存储使用量,但是它们通常会带来性能损失。在本文中,我们首先研究了一种 confidence-guided 策略,以减少现有的存储效率优化器的不稳定性。基于这种策略,我们提出了 CAME,可以同时实现两个目标:快速 convergence 和低存储使用量。我们的实验表明,CAME 在不同的 NLP 任务上(包括 BERT 和 GPT-2 训练)具有稳定的训练和高效的性能。特别是,对于 BERT 的预训练,我们的提出的优化器在大批处理大小为 32,768 时实现了更快的 convergence 和更高的准确率,相比 Adam 优化器。CAME 的实现已经公开可用。
Using Data Augmentations and VTLN to Reduce Bias in Dutch End-to-End Speech Recognition Systems
results: 组合数据增强和 VTLN 可以降低平均 WER 和偏见,对不同多种speaker group的表现提高了6.9%和3.9%。VTLN 模型在德语和中文儿童语音上也有良好的普适性Abstract
Speech technology has improved greatly for norm speakers, i.e., adult native speakers of a language without speech impediments or strong accents. However, non-norm or diverse speaker groups show a distinct performance gap with norm speakers, which we refer to as bias. In this work, we aim to reduce bias against different age groups and non-native speakers of Dutch. For an end-to-end (E2E) ASR system, we use state-of-the-art speed perturbation and spectral augmentation as data augmentation techniques and explore Vocal Tract Length Normalization (VTLN) to normalise for spectral differences due to differences in anatomy. The combination of data augmentation and VTLN reduced the average WER and bias across various diverse speaker groups by 6.9% and 3.9%, respectively. The VTLN model trained on Dutch was also effective in improving performance of Mandarin Chinese child speech, thus, showing generalisability across languages
摘要
语音技术在 норм 说话人(即成年本地语言无异常发音或强调的人)方面有很大的进步。然而,非 norm 或多样化 speaker group display distinct performance gap with norm speakers,我们称之为偏见。在这种工作中,我们想减少对不同年龄组和非本地说话人的偏见。为一个端到端(E2E)语音识别系统,我们使用现有的速度扰动和spectral augmentation作为数据增强技术,并探索 vocals tract length normalization(VTLN)来归一化因为身体差异而导致的spectral differences。这些技术的组合reduced the average WER和偏见 across various diverse speaker groups by 6.9% and 3.9%, respectively。VTLN模型在荷兰语上训练也有效地提高了中文儿童语音的性能,因此显示了语言通用性。
PULSAR at MEDIQA-Sum 2023: Large Language Models Augmented by Synthetic Dialogue Convert Patient Dialogues to Medical Records
results: 我们发现域pecific预训练和数据增强的证据有限,但是将语言模型扩大scale 实现了最好的性能提升。我们的方法在任务B中排名第二和第三,其中 code 可以在 https://github.com/yuping-wu/PULSAR 上获取。Abstract
This paper describes PULSAR, our system submission at the ImageClef 2023 MediQA-Sum task on summarising patient-doctor dialogues into clinical records. The proposed framework relies on domain-specific pre-training, to produce a specialised language model which is trained on task-specific natural data augmented by synthetic data generated by a black-box LLM. We find limited evidence towards the efficacy of domain-specific pre-training and data augmentation, while scaling up the language model yields the best performance gains. Our approach was ranked second and third among 13 submissions on task B of the challenge. Our code is available at https://github.com/yuping-wu/PULSAR.
摘要
Translated into Simplified Chinese:这篇论文描述了我们在ImageClef 2023 MediQA-Sum任务中提交的PULSAR系统,该系统使用域pecific预训练,生成特殊化语言模型,并在任务特定的自然数据上进行训练,同时使用由黑盒LLM生成的 sintetic数据进行数据增强。我们发现域pecific预训练和数据增强具有有限的效果,而scale up语言模型则能够实现最佳性能提升。我们的方法在任务B中 ranking第二和第三,排名13个提交。我们的代码可以在https://github.com/yuping-wu/PULSAR中找到。
Open-Domain Hierarchical Event Schema Induction by Incremental Prompting and Verification
results: 我们的方法可以生成大型和复杂的事件Graph,并且与直接使用LLM生成线性Graph相比,可以提高7.2%的时间关系和31.0%的层次关系。此外,我们的方法也可以让人类评估者在翻译事件Graph时覆盖了$\sim$10%更多的事件,并评估我们的schema得分高于前一个关闭领域的模型1.3分(在5分满分标准下)。Abstract
Event schemas are a form of world knowledge about the typical progression of events. Recent methods for event schema induction use information extraction systems to construct a large number of event graph instances from documents, and then learn to generalize the schema from such instances. In contrast, we propose to treat event schemas as a form of commonsense knowledge that can be derived from large language models (LLMs). This new paradigm greatly simplifies the schema induction process and allows us to handle both hierarchical relations and temporal relations between events in a straightforward way. Since event schemas have complex graph structures, we design an incremental prompting and verification method to break down the construction of a complex event graph into three stages: event skeleton construction, event expansion, and event-event relation verification. Compared to directly using LLMs to generate a linearized graph, our method can generate large and complex schemas with 7.2% F1 improvement in temporal relations and 31.0% F1 improvement in hierarchical relations. In addition, compared to the previous state-of-the-art closed-domain schema induction model, human assessors were able to cover $\sim$10% more events when translating the schemas into coherent stories and rated our schemas 1.3 points higher (on a 5-point scale) in terms of readability.
摘要
Event schemas are a form of common sense about the typical progression of events. Recent methods for event schema induction use information extraction systems to construct a large number of event graph instances from documents and then learn to generalize the schema from such instances. In contrast, we propose to treat event schemas as a form of common sense that can be derived from large language models (LLMs). This new paradigm greatly simplifies the schema induction process and allows us to handle both hierarchical relations and temporal relations between events in a straightforward way. Since event schemas have complex graph structures, we design an incremental prompting and verification method to break down the construction of a complex event graph into three stages: event skeleton construction, event expansion, and event-event relation verification. Compared to directly using LLMs to generate a linearized graph, our method can generate large and complex schemas with 7.2% F1 improvement in temporal relations and 31.0% F1 improvement in hierarchical relations. In addition, compared to the previous state-of-the-art closed-domain schema induction model, human assessors were able to cover approximately 10% more events when translating the schemas into coherent stories and rated our schemas 1.3 points higher (on a 5-point scale) in terms of readability.
results: 该模型在两个不同的数据集上(拉丁语族数据集和中文数据集)都达到了新的高水平,并且在多个不同的指标上比前一代模型(Meloni et al., 2021)表现出色。Abstract
Protoform reconstruction is the task of inferring what morphemes or words appeared like in the ancestral languages of a set of daughter languages. Meloni et al. (2021) achieved the state-of-the-art on Latin protoform reconstruction with an RNN-based encoder-decoder with attention model. We update their model with the state-of-the-art seq2seq model: the Transformer. Our model outperforms their model on a suite of different metrics on two different datasets: their Romance data of 8,000 cognates spanning 5 languages and a Chinese dataset (Hou 2004) of 800+ cognates spanning 39 varieties. We also probe our model for potential phylogenetic signal contained in the model. Our code is publicly available at https://github.com/cmu-llab/acl-2023.
摘要
protoform reconstruction是推理古代语言的 morpheme或词语在祖语言中的推理任务。 Meloni et al. (2021) 使用 RNN 基于 Encoder-Decoder 模型 WITH attention 模型实现了拉丁protoform reconstruction 的状态对。我们对其模型进行了更新,使用现代 seq2seq 模型:Transformer。我们的模型在两个不同的数据集上(Romance 数据集和 Hou 2004 中的 Chinese 数据集)表现出了更高的性能,并且在这两个数据集上进行了可能的phylogenetic signal的探索。我们的代码可以在https://github.com/cmu-llab/acl-2023 中找到。
ProPILE: Probing Privacy Leakage in Large Language Models
results: 这篇论文显示了 ProPILE 可以帮助数据主人评估 OPT-1.3B 模型是否泄露 PII,并且可以运用更强大的提示来评估 LLM 服务提供者自己的 PII 泄露水平。Abstract
The rapid advancement and widespread use of large language models (LLMs) have raised significant concerns regarding the potential leakage of personally identifiable information (PII). These models are often trained on vast quantities of web-collected data, which may inadvertently include sensitive personal data. This paper presents ProPILE, a novel probing tool designed to empower data subjects, or the owners of the PII, with awareness of potential PII leakage in LLM-based services. ProPILE lets data subjects formulate prompts based on their own PII to evaluate the level of privacy intrusion in LLMs. We demonstrate its application on the OPT-1.3B model trained on the publicly available Pile dataset. We show how hypothetical data subjects may assess the likelihood of their PII being included in the Pile dataset being revealed. ProPILE can also be leveraged by LLM service providers to effectively evaluate their own levels of PII leakage with more powerful prompts specifically tuned for their in-house models. This tool represents a pioneering step towards empowering the data subjects for their awareness and control over their own data on the web.
摘要
“快速发展和广泛使用大型语言模型(LLM)已引发了重要的个人隐私泄露(PII)问题。这些模型通常是基于互联网收集的大量数据进行训练,可能包含敏感个人数据。本文介绍了一种名为ProPILE的新的探测工具,用于赋予数据主(PII的所有者)对LLM基础设施中的隐私侵犯程度进行了解和控制。ProPILE允许数据主根据自己的PII提取模型中的敏感数据,以评估PII泄露的可能性。我们在OPT-1.3B模型基于公共可用的Pile数据集上进行了应用。我们显示了如何假设的数据主可以评估其PII是否包含在Pile数据集中被泄露。此外,ProPILE还可以被LLM服务提供者使用来评估自己的PII泄露水平,并通过特定于自己模型的更强的提示来进行评估。这种工具为数据主提供了一个前所未有的控制和了解自己数据在互联网上的能力。”
Decoding the Popularity of TV Series: A Network Analysis Perspective
results: 研究发现,电视剧集的certain network metrics和IMDB评分之间存在强相关关系。Abstract
In this paper, we analyze the character networks extracted from three popular television series and explore the relationship between a TV show episode's character network metrics and its review from IMDB. Character networks are graphs created from the plot of a TV show that represents the interactions of characters in scenes, indicating the presence of a connection between them. We calculate various network metrics for each episode, such as node degree and graph density, and use these metrics to explore the potential relationship between network metrics and TV series reviews from IMDB. Our results show that certain network metrics of character interactions in episodes have a strong correlation with the review score of TV series. Our research aims to provide more quantitative information that can help TV producers understand how to adjust the character dynamics of future episodes to appeal to their audience. By understanding the impact of character interactions on audience engagement and enjoyment, producers can make informed decisions about the development of their shows.
摘要
在这篇论文中,我们分析了三部电视剧中的人物网络,探究电视剧集 episoden 的人物网络指标与IMDB的评论之间的关系。人物网络是从电视剧的剧情中提取的人物之间的互动关系图,表明了各个人物之间的连接存在。我们计算了每集的不同网络指标,如节点度和图密度,并使用这些指标来探究电视剧集的评论分数与人物网络之间的可能的关系。我们的结果表明,一些集 episoden 的人物互动网络指标与电视剧的IMDB评论分数具有强相关性。我们的研究旨在为电视制作人提供更多的量化信息,帮助他们更好地理解如何通过调整人物之间的互动,来满足观众的需求。通过了解人物互动对观众参与度和满意度的影响,制作人可以做出更有知识的决策,以提高他们的电视剧的质量。