methods: 该模型使用了一种基于感知器的演说者分类模型,并将EEND-EDA模块 replaced with Perceiver-based模块,以提高模型的性能和精度。
results: 对于 widely studied Callhome dataset,该模型可以更好地分类演说者,并在长录音中运行推理的时间大幅提高。此外,与其他方法进行了对比,该模型( DiaPer)在性能和设计上达到了很高的水平,并且可以在较短的时间内完成推理。Abstract
Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based one and show its advantages over EEND-EDA; namely obtaining better performance on the largely studied Callhome dataset, finding the quantity of speakers in a conversation more accurately, and running inference on almost half of the time on long recordings. Furthermore, when exhaustively compared with other methods, our model, DiaPer, reaches remarkable performance with a very lightweight design. Besides, we perform comparisons with other works and a cascaded baseline across more than ten public wide-band datasets. Together with this publication, we release the code of DiaPer as well as models trained on public and free data.
摘要
Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization
results: 在MISP dataset上进行实验,提出的方法实现了超越性能,并在MISP Challenge 2022中获得第三名。Abstract
The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised~(ResNet and ECAPA-TDNN) and self-supervised pre-trained models~(WavLM and HuBERT) as the speaker and audio embedding extractors in an end-to-end audio-visual speaker diarization~(AVSD) system. Then we explore the effectiveness of different frameworks, including Transformer, Conformer, and cross-attention mechanism, in the audio-visual decoder. To mitigate the degradation of performance caused by separate training, we jointly train the audio encoder, speaker encoder, and audio-visual decoder in the AVSD system. Experiments on the MISP dataset demonstrate that the proposed method achieves superior performance and obtained third place in MISP Challenge 2022.
摘要
缺乏标注的音频视频数据是训练出色的音频视频说话人分类系统的限制。为了提高音频视频说话人分类的性能,我们利用预训练的监督和自监督语音模型。具体来说,我们采用监督(ResNet和ECAPA-TDNN)和自监督预训练模型(WavLM和HuBERT)作为音频和视频嵌入提取器在结束到端音频视频说话人分类(AVSD)系统中。然后我们探索不同框架,包括Transformer、Conformer和交叉关注机制,在音频视频解码器中的效果。为了避免分离训练导致性能下降,我们在AVSD系统中同时训练音频编码器、说话人编码器和音频视频解码器。实验表明,我们提posed方法在MISP数据集上表现出色,并在MISP Challenge 2022中获得第三名。
results: 研究发现,使用synthetic图像进行训练时, scaling behavior会受到文本描述、类别自由导航缩放因素和文本生成模型类型等多个因素的影响。 并且,对于图像分类器的训练,使用synthetic图像可以达到与实际图像相似的缩放规律,但是在使用CLIP模型时,synthetic图像的性能明显下降。这些发现可能有助于解决在大规模训练中使用synthetic图像的问题。Abstract
Recent significant advances in text-to-image models unlock the possibility of training vision systems using synthetic images, potentially overcoming the difficulty of collecting curated data at scale. It is unclear, however, how these models behave at scale, as more synthetic data is added to the training set. In this paper we study the scaling laws of synthetic images generated by state of the art text-to-image models, for the training of supervised models: image classifiers with label supervision, and CLIP with language supervision. We identify several factors, including text prompts, classifier-free guidance scale, and types of text-to-image models, that significantly affect scaling behavior. After tuning these factors, we observe that synthetic images demonstrate a scaling trend similar to, but slightly less effective than, real images in CLIP training, while they significantly underperform in scaling when training supervised image classifiers. Our analysis indicates that the main reason for this underperformance is the inability of off-the-shelf text-to-image models to generate certain concepts, a limitation that significantly impairs the training of image classifiers. Our findings also suggest that scaling synthetic data can be particularly effective in scenarios such as: (1) when there is a limited supply of real images for a supervised problem (e.g., fewer than 0.5 million images in ImageNet), (2) when the evaluation dataset diverges significantly from the training data, indicating the out-of-distribution scenario, or (3) when synthetic data is used in conjunction with real images, as demonstrated in the training of CLIP models.
摘要
results: 在不同的设定下,Gen2Det可以获得显著的性能提升,包括在LVIS上的尖顶检测设定下提高罕见类别的性能,在COCO上的低数据量设定下提高检测和分割任务的性能,以及在最通用的检测设定下仍然表现良好。Abstract
Recently diffusion models have shown improvement in synthetic image quality as well as better control in generation. We motivate and present Gen2Det, a simple modular pipeline to create synthetic training data for object detection for free by leveraging state-of-the-art grounded image generation methods. Unlike existing works which generate individual object instances, require identifying foreground followed by pasting on other images, we simplify to directly generating scene-centric images. In addition to the synthetic data, Gen2Det also proposes a suite of techniques to best utilize the generated data, including image-level filtering, instance-level filtering, and better training recipe to account for imperfections in the generation. Using Gen2Det, we show healthy improvements on object detection and segmentation tasks under various settings and agnostic to detection methods. In the long-tailed detection setting on LVIS, Gen2Det improves the performance on rare categories by a large margin while also significantly improving the performance on other categories, e.g. we see an improvement of 2.13 Box AP and 1.84 Mask AP over just training on real data on LVIS with Mask R-CNN. In the low-data regime setting on COCO, Gen2Det consistently improves both Box and Mask AP by 2.27 and 1.85 points. In the most general detection setting, Gen2Det still demonstrates robust performance gains, e.g. it improves the Box and Mask AP on COCO by 0.45 and 0.32 points.
摘要
近期扩散模型在合成图像质量和控制方面有所改进,我们提出和实现了Gen2Det,一个简单的模块化管道,用于免费生成对象检测的Synthetic培训数据。与现有方法不同,Gen2Det不是单独生成个体物体实例,而是直接生成场景中心的图像。此外,Gen2Det还提出了一系列使用生成数据的技术,包括图像级滤波、实例级滤波和更好的训练策略,以抵消生成过程中的瑕疵。使用Gen2Det,我们在不同的设定下显著提高了对象检测和分割任务的性能,包括在LVIS中的长尾检测设定下,对罕见类别的性能提高了大幅度,同时也提高了其他类别的性能,例如在LVIS上与Mask R-CNN训练的Box AP和Mask AP分别提高2.13和1.84点。在COCO中的低数据 régime下,Gen2Det通常提高了Box和Mask AP的性能,分别提高2.27和1.85点。在最通用的检测设定下,Gen2Det仍然显示了强大的性能改进,例如在COCO上,Box和Mask AP分别提高0.45和0.32点。
paper_authors: Haofei Xu, Anpei Chen, Yuedong Chen, Christos Sakaridis, Yulun Zhang, Marc Pollefeys, Andreas Geiger, Fisher Yu
for: 解决多个基线设定下的稀疏视场渲染问题
methods: 使用批处理网络来激活多个基线设定下的视场渲染
results: 在多个基线设定下和多种场景下(包括DTU、RealEstate10K和LLFF)达到了最佳性能,并在零基线情况下也有出色的推广能力。Abstract
We present Multi-Baseline Radiance Fields (MuRF), a general feed-forward approach to solving sparse view synthesis under multiple different baseline settings (small and large baselines, and different number of input views). To render a target novel view, we discretize the 3D space into planes parallel to the target image plane, and accordingly construct a target view frustum volume. Such a target volume representation is spatially aligned with the target view, which effectively aggregates relevant information from the input views for high-quality rendering. It also facilitates subsequent radiance field regression with a convolutional network thanks to its axis-aligned nature. The 3D context modeled by the convolutional network enables our method to synthesis sharper scene structures than prior works. Our MuRF achieves state-of-the-art performance across multiple different baseline settings and diverse scenarios ranging from simple objects (DTU) to complex indoor and outdoor scenes (RealEstate10K and LLFF). We also show promising zero-shot generalization abilities on the Mip-NeRF 360 dataset, demonstrating the general applicability of MuRF.
摘要
我们提出了多基线辐射场(MuRF),一种通用的斜杆方法,用于解决多个不同基线设置(小基线和大基线,以及不同输入视图数)下的稀疏视场synthesis。为渲染目标新视图,我们将三维空间分割成平行于目标图像平面的面,并从input视图中搜集相关信息,以实现高质量渲染。此外,我们还使用卷积网络进行辐射场回归,因为目标视图空间具有轴对齐性,从而使得我们可以更好地捕捉Scene中的细节。我们的MuRF在不同基线设置下和多种场景(DTU、RealEstate10K和LLFF)中实现了状态机器的性能。此外,我们还证明了MuRF在Mip-NeRF 360 dataset上具有良好的零基eline泛化能力,这表明MuRF可以通用化应用于不同的场景。
EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS
paper_authors: Sharath Girish, Kamal Gupta, Abhinav Shrivastava
For: 这paper focused on addressing the memory storage and training speed challenges of 3D Gaussian splatting (3D-GS) in novel-view scene synthesis.* Methods: The authors proposed a technique using quantized embeddings to significantly reduce memory storage requirements and a coarse-to-fine training strategy for faster and more stable optimization of the Gaussian point clouds.* Results: The approach resulted in scene representations with fewer Gaussians and quantized representations, leading to faster training times and rendering speeds for real-time rendering of high resolution scenes, with a reduction in memory usage of more than an order of magnitude while maintaining the reconstruction quality.Abstract
Recently, 3D Gaussian splatting (3D-GS) has gained popularity in novel-view scene synthesis. It addresses the challenges of lengthy training times and slow rendering speeds associated with Neural Radiance Fields (NeRFs). Through rapid, differentiable rasterization of 3D Gaussians, 3D-GS achieves real-time rendering and accelerated training. They, however, demand substantial memory resources for both training and storage, as they require millions of Gaussians in their point cloud representation for each scene. We present a technique utilizing quantized embeddings to significantly reduce memory storage requirements and a coarse-to-fine training strategy for a faster and more stable optimization of the Gaussian point clouds. Our approach results in scene representations with fewer Gaussians and quantized representations, leading to faster training times and rendering speeds for real-time rendering of high resolution scenes. We reduce memory by more than an order of magnitude all while maintaining the reconstruction quality. We validate the effectiveness of our approach on a variety of datasets and scenes preserving the visual quality while consuming 10-20x less memory and faster training/inference speed. Project page and code is available https://efficientgaussian.github.io
摘要
近些时间,3D Gaussian splatting(3D-GS)在新视图场景合成中受到欢迎。它解决了Neural Radiance Fields(NeRFs)的训练时间和渲染速度问题,通过快速、可导的三角形渲染来实现实时渲染和加速训练。然而,它们需要巨大的内存资源来进行训练和存储,因为它们需要每个场景的点云表示中有数百万个 Gaussian。我们提出了使用量化编码来减少内存存储需求,并采用粗略到细节的训练策略来加速和稳定 Gaussian 点云的优化。我们的方法可以在高分辨率场景中实现更快的训练时间和渲染速度,同时保持视觉质量。我们通过对多个数据集和场景进行验证,发现我们的方法可以减少内存使用量高于一个数量级,并且加速训练和推理速度。项目页面和代码可以在 上查看。
Visual Geometry Grounded Deep Structure From Motion
paper_authors: Jianyuan Wang, Nikita Karaev, Christian Rupprecht, David Novotny
for: The paper is written to address the problem of structure-from-motion (SfM) in computer vision, specifically to develop a new deep pipeline called VGGSfM that can reconstruct the camera poses and 3D structure of a scene from a set of unconstrained 2D images.
methods: The paper uses deep learning techniques to enhance specific elements of the SfM pipeline, such as keypoint matching, and introduces new mechanisms and simplifications to make the pipeline fully differentiable and trainable in an end-to-end manner.
results: The paper achieves state-of-the-art performance on three popular datasets, CO3D, IMC Phototourism, and ETH3D, demonstrating the effectiveness of the proposed VGGSfM pipeline.Here’s the same information in Simplified Chinese:
results: 论文在CO3D、IMC Phototourism和ETH3D三个Popular Dataset上达到了最佳性能,证明提案的VGGSfM管道的有效性。Abstract
Structure-from-motion (SfM) is a long-standing problem in the computer vision community, which aims to reconstruct the camera poses and 3D structure of a scene from a set of unconstrained 2D images. Classical frameworks solve this problem in an incremental manner by detecting and matching keypoints, registering images, triangulating 3D points, and conducting bundle adjustment. Recent research efforts have predominantly revolved around harnessing the power of deep learning techniques to enhance specific elements (e.g., keypoint matching), but are still based on the original, non-differentiable pipeline. Instead, we propose a new deep pipeline VGGSfM, where each component is fully differentiable and thus can be trained in an end-to-end manner. To this end, we introduce new mechanisms and simplifications. First, we build on recent advances in deep 2D point tracking to extract reliable pixel-accurate tracks, which eliminates the need for chaining pairwise matches. Furthermore, we recover all cameras simultaneously based on the image and track features instead of gradually registering cameras. Finally, we optimise the cameras and triangulate 3D points via a differentiable bundle adjustment layer. We attain state-of-the-art performance on three popular datasets, CO3D, IMC Phototourism, and ETH3D.
摘要
Structure-from-motion (SfM) 是计算机视觉领域的一个长期问题,目标是从一组不受限制的二维图像中重construct camera pose和场景的3D结构。经典框架通过检测和匹配关键点、注册图像、三角测量3D点和进行缓冲调整来解决这个问题。最近的研究努力主要是通过利用深度学习技术来提高特定元素(例如关键点匹配),但是仍然基于原始、不可导的管道。相反,我们提出了一个新的深度管道 VGGSfM,其中每个组件都是可导的,因此可以在终端向导下进行训练。为此,我们引入了新的机制和简化。首先,我们基于最近的深度2D点跟踪技术来提取可靠的像素精度轨迹,从而消除了将链式匹配链接的需求。其次,我们同时回归所有相机,基于图像和轨迹特征而不是逐渐注册相机。最后,我们通过可导的缓冲调整层来优化相机和三角测量3D点。我们在三个流行的数据集上达到了状态的最佳性能:CO3D、IMC Phototourism 和 ETH3D。
GenDeF: Learning Generative Deformation Field for Video Generation
results: 对三个常见的视频生成标准套件进行质量评估,结果表明GenDeF方法在视觉质量和生成灵活性方面具有显著的优势。Abstract
We offer a new perspective on approaching the task of video generation. Instead of directly synthesizing a sequence of frames, we propose to render a video by warping one static image with a generative deformation field (GenDeF). Such a pipeline enjoys three appealing advantages. First, we can sufficiently reuse a well-trained image generator to synthesize the static image (also called canonical image), alleviating the difficulty in producing a video and thereby resulting in better visual quality. Second, we can easily convert a deformation field to optical flows, making it possible to apply explicit structural regularizations for motion modeling, leading to temporally consistent results. Third, the disentanglement between content and motion allows users to process a synthesized video through processing its corresponding static image without any tuning, facilitating many applications like video editing, keypoint tracking, and video segmentation. Both qualitative and quantitative results on three common video generation benchmarks demonstrate the superiority of our GenDeF method.
摘要
我们提出了一种新的视频生成方法的思路。而不是直接生成一系列帧,我们建议使用生成几何变换场(GenDeF)来折射一个静止图像,以生成视频。这种管道具有三个吸引人的优点。首先,我们可以重用一个已经受过训练的图像生成器来生成静止图像(也称为标准图像),从而降低生成视频的难度,并因此提高视觉质量。其次,我们可以轻松地将几何变换场转换为光流,使得可以直接应用显式结构化 regularization 来模式化动作,导致时间相关的结果。最后,内容和运动之间的分离使得用户可以对生成的视频进行处理,不需要任何调整,从而便利了许多应用,如视频编辑、关键点跟踪和视频分割。我们在三个常见的视频生成 benchmark 上进行了both质量和量化的结果,显示了我们的 GenDeF 方法的优越性。
PrimDiffusion: Volumetric Primitives Diffusion for 3D Human Generation
results: outperforms state-of-the-art methods, real-time rendering of high-quality 3D humans, flexible and efficient framework for training-free conditional generationHere’s the full answer in Simplified Chinese:
results: 比较之前的方法,这种方法在3D人体生成方面表现更出色,可以实现高质量的3D人体图像在$512\times512$的分辨率上的实时渲染。此外,这种方法还能够支持无需训练的 conditional generation,如Texture Transfer和3D填充等。Abstract
We present PrimDiffusion, the first diffusion-based framework for 3D human generation. Devising diffusion models for 3D human generation is difficult due to the intensive computational cost of 3D representations and the articulated topology of 3D humans. To tackle these challenges, our key insight is operating the denoising diffusion process directly on a set of volumetric primitives, which models the human body as a number of small volumes with radiance and kinematic information. This volumetric primitives representation marries the capacity of volumetric representations with the efficiency of primitive-based rendering. Our PrimDiffusion framework has three appealing properties: 1) compact and expressive parameter space for the diffusion model, 2) flexible 3D representation that incorporates human prior, and 3) decoder-free rendering for efficient novel-view and novel-pose synthesis. Extensive experiments validate that PrimDiffusion outperforms state-of-the-art methods in 3D human generation. Notably, compared to GAN-based methods, our PrimDiffusion supports real-time rendering of high-quality 3D humans at a resolution of $512\times512$ once the denoising process is done. We also demonstrate the flexibility of our framework on training-free conditional generation such as texture transfer and 3D inpainting.
摘要
我们现在提出PrimDiffusion,第一个基于扩散的3D人生成框架。在3D人体生成中设计扩散模型具有较高的计算成本和人体结构的复杂性,这些挑战困难了我们的设计。我们的关键发现是直接在一组积分体 primitives 上进行扩散过程,将人体视为一些小体积量素和动态信息的集合。这种积分体 primitives 表示方式结合了积分体表示的容量和基于 primitives 的渲染效率。PrimDiffusion 框架具有三个魅力性属性:1)紧凑且表达力强的扩散模型参数空间,2)包含人类先验的flexible 3D表示,3)无需解码器的高效新视角和新pose生成。我们的 PrimDiffusion 在3D人生成方面与状态 искусственный neural networks 比较,实验证明 PrimDiffusion 能够在 $512\times512$ 分辨率下实现高质量的实时3D人生成。此外,我们还证明了我们的框架在训练不需要情况下进行条件生成,如xture传输和3D遮盲填充。
MonoGaussianAvatar: Monocular Gaussian Point-based Head Avatar
results: 我们的方法可以达到前所未有的Result,超越过去的方法。Abstract
The ability to animate photo-realistic head avatars reconstructed from monocular portrait video sequences represents a crucial step in bridging the gap between the virtual and real worlds. Recent advancements in head avatar techniques, including explicit 3D morphable meshes (3DMM), point clouds, and neural implicit representation have been exploited for this ongoing research. However, 3DMM-based methods are constrained by their fixed topologies, point-based approaches suffer from a heavy training burden due to the extensive quantity of points involved, and the last ones suffer from limitations in deformation flexibility and rendering efficiency. In response to these challenges, we propose MonoGaussianAvatar (Monocular Gaussian Point-based Head Avatar), a novel approach that harnesses 3D Gaussian point representation coupled with a Gaussian deformation field to learn explicit head avatars from monocular portrait videos. We define our head avatars with Gaussian points characterized by adaptable shapes, enabling flexible topology. These points exhibit movement with a Gaussian deformation field in alignment with the target pose and expression of a person, facilitating efficient deformation. Additionally, the Gaussian points have controllable shape, size, color, and opacity combined with Gaussian splatting, allowing for efficient training and rendering. Experiments demonstrate the superior performance of our method, which achieves state-of-the-art results among previous methods.
摘要
<>通过动态渲染 photo-realistic 头部人物模型,bridging 虚拟和真实世界的差距是一项关键的推进。近期的头部人物技术,包括explicit 3D 变形多面体 (3DMM)、点云和神经凝结表示,在这些研究中得到报道。然而,3DMM 基于的方法受限于固定的体系,点云方法因需要处理大量的点而带来巨大的训练负担,而神经凝结方法受到限制,包括形态灵活性和渲染效率等方面。为了解决这些挑战,我们提出了 MonoGaussianAvatar(单个 Gaussian 点 clouds 基于 Head Avatar),一种新的方法,利用 Gaussian 点 clouds 和 Gaussian 变形场来学习从单个照片视频序列中获取 head avatar。我们定义了我们的 head avatar 通过可变形状的 Gaussian 点来表示,这些点在目标姿势和表情的人物中运动,并且可以实现高效的变形。此外,Gaussian 点还具有可控的形状、大小、颜色和透明度,可以通过 Gaussian 混合来实现高效的训练和渲染。实验表明我们的方法可以达到前期方法的最佳性能。Note: Simplified Chinese is used in this translation, which is a standardized form of Chinese used in mainland China and widely used in informal writing and communication.
GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation
results: 作者在对比 SDXL 时,通过人工评分发现 GenTron 在视觉质量方面获得了 51.1% 的胜利率(并且有 19.8% 的引き分),在文本对应性方面获得了 42.3% 的胜利率(并且有 42.9% 的引き分)。 GenTron 还在 T2I-CompBench 中表现出色,这表明它在compositional generation 方面具有强大的能力。Abstract
In this study, we explore Transformer-based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability, the visual generative domain primarily utilizes CNN-based U-Net architectures, particularly in diffusion-based models. We introduce GenTron, a family of Generative models employing Transformer-based diffusion, to address this gap. Our initial step was to adapt Diffusion Transformers (DiTs) from class to text conditioning, a process involving thorough empirical exploration of the conditioning mechanism. We then scale GenTron from approximately 900M to over 3B parameters, observing significant improvements in visual quality. Furthermore, we extend GenTron to text-to-video generation, incorporating novel motion-free guidance to enhance video quality. In human evaluations against SDXL, GenTron achieves a 51.1% win rate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text alignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench, underscoring its strengths in compositional generation. We believe this work will provide meaningful insights and serve as a valuable reference for future research.
摘要
在这项研究中,我们探索了基于Transformer的扩散模型,用于图像和视频生成。尽管Transformer架构在不同领域因其灵活性和可扩展性而占据主导地位,但视觉生成领域主要使用CNN基于U-Net架构,特别是在扩散基于模型中。我们介绍了GenTron家族,一种基于Transformer扩散的生成模型,以Address这一漏洞。我们首先将Diffusion Transformers(DiTs)从类到文本条件下适应,经过了详细的实验研究条件机制。然后,我们缩放了GenTron的参数量从大约900M到3B以上,观察到了显著改善的视觉质量。此外,我们将GenTron扩展到文本到视频生成,并添加了新的无动作指导,以提高视频质量。在人类评估中,GenTron在SDXL的比赛中获得了51.1%的胜利率(与19.8%的平局率),以及42.3%的文本排序率(与42.9%的平局率)。GenTron还在T2I-CompBench上表现出色,这反映了它在复杂生成中的优势。我们认为这项研究将为未来研究提供有价值的参考。
SPIDeRS: Structured Polarization for Invisible Depth and Reflectance Sensing
paper_authors: Tomoki Ichikawa, Shohei Nobuhara, Ko Nishino
For: 这个论文旨在开拓用偏振光感知技术捕捉物体形状和反射率的新途径。* Methods: 该论文提出了使用偏振光模式(SPIDeRS),通过对投射光的角度 Linear Polarization(AoLP)进行调制,以捕捉物体的形状和反射率。* Results: 论文通过实验表明,SPIDeRS方法可以成功地重建各种材质的物体形状,并且能够快速和稳定地处理各种材质和环境光线。此外,该方法还可以实现对物体的重新照明。Abstract
Can we capture shape and reflectance in stealth? Such capability would be valuable for many application domains in vision, xR, robotics, and HCI. We introduce Structured Polarization, the first depth and reflectance sensing method using patterns of polarized light (SPIDeRS). The key idea is to modulate the angle of linear polarization (AoLP) of projected light at each pixel. The use of polarization makes it invisible and lets us recover not only depth but also directly surface normals and even reflectance. We implement SPIDeRS with a liquid crystal spatial light modulator (SLM) and a polarimetric camera. We derive a novel method for robustly extracting the projected structured polarization pattern from the polarimetric object appearance. We evaluate the effectiveness of SPIDeRS by applying it to a number of real-world objects. The results show that our method successfully reconstructs object shapes of various materials and is robust to diffuse reflection and ambient light. We also demonstrate relighting using recovered surface normals and reflectance. We believe SPIDeRS opens a new avenue of polarization use in visual sensing.
摘要
可以捕捉形状和反射吗?这种能力会对视觉、虚拟现实、机器人和人机交互等应用领域有很大的价值。我们介绍了结构化偏振(SPIDeRS),它是利用排序偏振光的方法来捕捉深度和反射特征。我们使用偏振光使其隐形,从而恢复不仅深度,还直接恢复表面法向和反射率。我们使用液晶Display的空间光模ulator(SLM)和偏振光相机来实现SPIDeRS。我们提出了一种新的方法,可以坚定地提取到偏振光模式的投影结构。我们对一些实际物体进行应用,结果显示我们的方法可以成功地重建物体的形状,并且对各种材料的反射和 ambient light 有良好的鲁棒性。我们还示出了使用恢复的表面法向和反射率来进行光照。我们认为SPIDeRS开启了一新的偏振光在视觉感知中的应用 Avenues。
Free3D: Consistent Novel View Synthesis without 3D Representation
methods: 这篇论文使用了一种简单的方法, Starting from a pre-trained 2D image generator for generalization, 并将其精度调整为NVS。它不需要Explicit 3D表示,这会增加计算成本和内存占用,或者训练一个额外的3D网络。而是通过新的每像素投影方向 normalization(RCN)层来更好地编码目标摄像头pose。此外,它还使用了一个轻量级的多视图注意力层和多视图噪声共享来提高多视图一致性。
results: 这篇论文在Objaverse dataset上进行了训练,并在多个新类别上进行了优秀的泛化。包括OminiObject3D和GSO等多个新数据集。Abstract
We introduce Free3D, a simple approach designed for open-set novel view synthesis (NVS) from a single image. Similar to Zero-1-to-3, we start from a pre-trained 2D image generator for generalization, and fine-tune it for NVS. Compared to recent and concurrent works, we obtain significant improvements without resorting to an explicit 3D representation, which is slow and memory-consuming or training an additional 3D network. We do so by encoding better the target camera pose via a new per-pixel ray conditioning normalization (RCN) layer. The latter injects pose information in the underlying 2D image generator by telling each pixel its specific viewing direction. We also improve multi-view consistency via a light-weight multi-view attention layer and multi-view noise sharing. We train Free3D on the Objaverse dataset and demonstrate excellent generalization to various new categories in several new datasets, including OminiObject3D and GSO. We hope our simple and effective approach will serve as a solid baseline and help future research in NVS with more accuracy pose. The project page is available at https://chuanxiaz.com/free3d/.
摘要
我们介绍Free3D,一种简单的方法,用于开放集成视 synthesis(NVS)从单个图像。与Zero-1-to-3类似,我们从预训练的2D图像生成器开始,并对其进行特化以实现NVS。相比之前的并发作品,我们获得了显著改善,而无需使用显式的3D表示或再训练一个3D网络。我们通过新的每像素射线条件正则化(RCN)层来更好地编码目标摄像头姿态信息。这个层使每个像素了解其特定的视觉方向。我们还通过轻量级多视图注意力层和多视图噪声共享来提高多视图一致性。我们在Objaverse数据集上训练了Free3D,并在多个新类别上示出了杰出的总体化能力。我们希望我们的简单有效的方法能够作为一个坚实的基准,帮助未来的NVS研究更加精准地识别pose。项目页面可以在https://chuanxiaz.com/free3d/上找到。
Digital Life Project: Autonomous 3D Characters with Social Intelligence
paper_authors: Zhongang Cai, Jianping Jiang, Zhongfei Qing, Xinying Guo, Mingyuan Zhang, Zhengyu Lin, Haiyi Mei, Chen Wei, Ruisi Wang, Wanqi Yin, Xiangyu Fan, Han Du, Liang Pan, Peng Gao, Zhitao Yang, Yang Gao, Jiaqi Li, Tianxiang Ren, Yukun Wei, Xiaogang Wang, Chen Change Loy, Lei Yang, Ziwei Liu
For: The paper aims to create autonomous 3D characters that can engage in social interactions and express themselves with articulated body motions, simulating life in a digital environment.* Methods: The framework consists of two primary components: SocioMind, a digital brain that models personalities and initiates dialogue topics, and MoMat-MoGen, a text-driven motion synthesis paradigm that ensures motion quality and generates diverse movements.* Results: Extensive experiments demonstrate that each module achieves state-of-the-art performance in its respective domain, enabling virtual characters to initiate and sustain dialogues autonomously, while evolving their socio-psychological states and performing contextually relevant bodily movements. Additionally, the motion captioning module allows the virtual character to recognize and respond to human players’ actions.Abstract
In this work, we present Digital Life Project, a framework utilizing language as the universal medium to build autonomous 3D characters, who are capable of engaging in social interactions and expressing with articulated body motions, thereby simulating life in a digital environment. Our framework comprises two primary components: 1) SocioMind: a meticulously crafted digital brain that models personalities with systematic few-shot exemplars, incorporates a reflection process based on psychology principles, and emulates autonomy by initiating dialogue topics; 2) MoMat-MoGen: a text-driven motion synthesis paradigm for controlling the character's digital body. It integrates motion matching, a proven industry technique to ensure motion quality, with cutting-edge advancements in motion generation for diversity. Extensive experiments demonstrate that each module achieves state-of-the-art performance in its respective domain. Collectively, they enable virtual characters to initiate and sustain dialogues autonomously, while evolving their socio-psychological states. Concurrently, these characters can perform contextually relevant bodily movements. Additionally, a motion captioning module further allows the virtual character to recognize and appropriately respond to human players' actions. Homepage: https://digital-life-project.com/
摘要
在这项工作中,我们介绍了数字生命计划,这是一个基于语言作为通用媒介的框架,用于建立自主的3D人物,能够与人类进行社交交流,并通过具有艺术化的身体动作表达自己的情感和想法。我们的框架包括两个主要组成部分:1)SocioMind:一个仔细设计的数字大脑,模拟人格特征,包括几个例子,基于心理学原理进行反射过程,并且模拟自主性,可以主动发起对话话题;2)MoMat-MoGen:一种基于文本的动作合成方法,用于控制人物的数字身体。它结合了行业中证明有效的动作匹配技术,以及当前的动作生成技术,以实现多样性。我们的实验表明,每个模块都达到了其各自领域的状态之 искус。总的来说,这两个模块使虚拟人物能够自主地发起和维持对话,同时演化其社会心理状态。此外,虚拟人物还可以根据人类玩家的行为,Recognize和应ropriately respond to human players' actions。更多信息请访问我们的官方网站:。
HyperDreamer: Hyper-Realistic 3D Content Generation and Editing from a Single Image
results: HyperDreamer提供了多个关键设计和吸引人的属性,包括:1)可观看的360度网格模型、高分辨率纹理、2)可渲染的细节 semantic 数据驱动的纹理和反射性质的学习、3)可编辑的功能,让用户透过几下点击来选择和修改纹理。实验结果显示HyperDreamer可以实现高分辨率纹理和区域特征数据驱动的3D内容创建,并且具够用户友好的编辑功能。Abstract
3D content creation from a single image is a long-standing yet highly desirable task. Recent advances introduce 2D diffusion priors, yielding reasonable results. However, existing methods are not hyper-realistic enough for post-generation usage, as users cannot view, render and edit the resulting 3D content from a full range. To address these challenges, we introduce HyperDreamer with several key designs and appealing properties: 1) Viewable: 360 degree mesh modeling with high-resolution textures enables the creation of visually compelling 3D models from a full range of observation points. 2) Renderable: Fine-grained semantic segmentation and data-driven priors are incorporated as guidance to learn reasonable albedo, roughness, and specular properties of the materials, enabling semantic-aware arbitrary material estimation. 3) Editable: For a generated model or their own data, users can interactively select any region via a few clicks and efficiently edit the texture with text-based guidance. Extensive experiments demonstrate the effectiveness of HyperDreamer in modeling region-aware materials with high-resolution textures and enabling user-friendly editing. We believe that HyperDreamer holds promise for advancing 3D content creation and finding applications in various domains.
摘要
三维内容创建从单个图像是一项长期受欢迎但具有高度挑战性的任务。最近的进步引入2D扩散约束,可以得到可理解的结果。然而,现有的方法并不够真实主义,用户无法在全面范围内查看、渲染和修改生成的三维内容。为解决这些挑战,我们介绍了HyperDreamer,具有以下键要素和吸引人的性能:1. 可见:使用360度的网格模型和高分辨率文本ures可以从全面范围内创建视觉吸引人的三维模型。2. 可渲染:基于Semantic-aware材质预测和数据驱动约束,学习物质的可见性、粗糙度和镜面性,以获得合理的材质性。3. 可编辑:用户可以通过几个键盘快捷键选择任意区域,并通过文本导航进行高效地修改图像。广泛的实验表明HyperDreamer在高分辨率图像和可见性材质预测方面具有优异的效果,并且允许用户Friendly editing。我们认为HyperDreamer将为三维内容创建带来进步,并在多个领域发现应用。
results: 在Pascal VOC、ADE20K和CityScapes等 datasets上,这个方法可以实现开放词汇分割的state-of-the-art Results,而且与给定类别名称的方法的性能相当。Abstract
Vision-Language Models (VLMs) have emerged as promising tools for open-ended image understanding tasks, including open vocabulary segmentation. Yet, direct application of such VLMs to segmentation is non-trivial, since VLMs are trained with image-text pairs and naturally lack pixel-level granularity. Recent works have made advancements in bridging this gap, often by leveraging the shared image-text space in which the image and a provided text prompt are represented. In this paper, we challenge the capabilities of VLMs further and tackle open-vocabulary segmentation without the need for any textual input. To this end, we propose a novel Self-Guided Semantic Segmentation (Self-Seg) framework. Self-Seg is capable of automatically detecting relevant class names from clustered BLIP embeddings and using these for accurate semantic segmentation. In addition, we propose an LLM-based Open-Vocabulary Evaluator (LOVE) to effectively assess predicted open-vocabulary class names. We achieve state-of-the-art results on Pascal VOC, ADE20K and CityScapes for open-vocabulary segmentation without given class names, as well as competitive performance with methods where class names are given. All code and data will be released.
摘要
视力语言模型(VLM)已经出现为开放式图像理解任务中的有力工具,包括开放词汇分割。然而,直接将VLM应用于分割是非得,因为VLM通常是通过图像和文本对的共同表示来进行训练的。在这篇论文中,我们将VLM的能力进一步挑战,并在不需要任何文本输入的情况下进行开放词汇分割。为此,我们提出了一个新的自适应 semantic segmentation(Self-Seg)框架。Self-Seg可以自动检测相关的类名称,并使用这些类名称进行准确的semantic segmentation。此外,我们还提出了一个基于LLM的开放词汇评估器(LOVE),以有效评估预测的开放词汇类名称。我们在Pascal VOC、ADE20K和CityScapes等 dataset上实现了开放词汇分割无给类名称的状态顶峰性成绩,以及在给定类名称的情况下的竞争性表现。所有代码和数据将被释出。
PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns
For: 本研究提出了一种新的虚拟试穿(ucVTON)任务,以生成个性化复合衣物的 фото真实 synthesis。与先前的艺术约束的方法不同,我们的方法允许灵活地指定样式(文本或图像)和文化(全衣物、剪辑部分或纹理覆盖)条件。* Methods: 我们提出了一个两个阶段管道,以解决使用全衣物图像作为条件时的杂谱挑战。在第一阶段,我们生成了基于输入的愿望的样式条件,并在第二阶段将文化纹理应用到解决图像上。为表示复杂和不定的纹理,我们首先提出了EXTRACTING hierarchical和平衡CLIP特征,并在VTON中应用位编码。* Results: 实验表明,我们的方法可以生成高质量的个性化synthesis,并允许用户有灵活的控制样式和文化杂谱。这些功能将虚拟试穿带到了在线购物和时尚设计中的新的水平。Abstract
In this paper, we propose a novel virtual try-on from unconstrained designs (ucVTON) task to enable photorealistic synthesis of personalized composite clothing on input human images. Unlike prior arts constrained by specific input types, our method allows flexible specification of style (text or image) and texture (full garment, cropped sections, or texture patches) conditions. To address the entanglement challenge when using full garment images as conditions, we develop a two-stage pipeline with explicit disentanglement of style and texture. In the first stage, we generate a human parsing map reflecting the desired style conditioned on the input. In the second stage, we composite textures onto the parsing map areas based on the texture input. To represent complex and non-stationary textures that have never been achieved in previous fashion editing works, we first propose extracting hierarchical and balanced CLIP features and applying position encoding in VTON. Experiments demonstrate superior synthesis quality and personalization enabled by our method. The flexible control over style and texture mixing brings virtual try-on to a new level of user experience for online shopping and fashion design.
摘要
在这篇论文中,我们提出了一种新的无束定设计(ucVTON)任务,以实现人像化的个性化复合服装的光realistic合成。与先前的方法不同,我们的方法允许用户自由地指定样式(文本或图像)和 текстур(全服装、剪辑部分或纹理块)条件。为了解决使用全服装图像作为条件时的杂糅挑战,我们开发了一个两阶段管道,其中首先生成基于输入的人像分割图,然后在第二阶段将Texture Composite到分割图区域。为了表示复杂的非站ARY texture,我们首先提出了EXTRACTING hierarchical和平衡CLIP特征,然后在VTON中应用位编码。实验表明,我们的方法可以实现高质量的合成和个性化,并且允许用户自由地控制样式和 текстур混合,从而将虚拟试穿升级到在线购物和时尚设计中的新的用户体验水平。
Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models
results: 实验结果表明,该框架在受到干扰的情况下具有较高的稳定性和可控性,能够理解复杂多对象关系,并适用于多种表格和6度自由重新排序任务。Abstract
We introduce Dream2Real, a robotics framework which integrates vision-language models (VLMs) trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place. This enables language-conditioned rearrangement to be performed zero-shot, without needing to collect a training dataset of example arrangements. Results on a series of real-world tasks show that this framework is robust to distractors, controllable by language, capable of understanding complex multi-object relations, and readily applicable to both tabletop and 6-DoF rearrangement tasks.
摘要
我们介绍 Dream2Real,一个 robotics 框架,它将 Computer Vision Language Models (VLMs) 训练在 2D 数据上集成到一个 3D 物品重新排序管线中。这是通过 robot 自动建立一个 3D 场景的构成,并将物品重新排序到虚拟中,然后由 VLM 评估这些重新排序的图像,选择最好满足使用者指令的重新排序,并将其转移到实际世界中进行pick-and-place操作。这使得语言条件的重新排序可以 Zero-shot 进行,不需要收集一个训练集的示例排序。实验结果显示,这个框架具有较好的抗斥扰性、可以通过语言控制、理解复杂多个物品关系、并可以应用于表格顶部和 6-DoF 重新排序任务。
Camera Height Doesn’t Change: Unsupervised Monocular Scale-Aware Road-Scene Depth Estimation
methods: 这 paper 使用了一种新的Scale-aware monocular depth estimation方法,即StableCamH,该方法不需要任何外部感知器或监督。它利用了场景中对象的高度知识,并将高度指示器聚合到一个共同的抽象度量中,以便实现Robust和准确的无监督末端训练。
results: 实验表明,StableCamH 可以实现高精度的单目深度估计,并且比其他相关方法有更高的状态艺术性。它还能够在不同的场景下保持一定的普适性。Abstract
Monocular depth estimators either require explicit scale supervision through auxiliary sensors or suffer from scale ambiguity, which renders them difficult to deploy in downstream applications. A possible source of scale is the sizes of objects found in the scene, but inaccurate localization makes them difficult to exploit. In this paper, we introduce a novel scale-aware monocular depth estimation method called StableCamH that does not require any auxiliary sensor or supervision. The key idea is to exploit prior knowledge of object heights in the scene but aggregate the height cues into a single invariant measure common to all frames in a road video sequence, namely the camera height. By formulating monocular depth estimation as camera height optimization, we achieve robust and accurate unsupervised end-to-end training. To realize StableCamH, we devise a novel learning-based size prior that can directly convert car appearance into its dimensions. Extensive experiments on KITTI and Cityscapes show the effectiveness of StableCamH, its state-of-the-art accuracy compared with related methods, and its generalizability. The training framework of StableCamH can be used for any monocular depth estimation method and will hopefully become a fundamental building block for further work.
摘要
眼镜几何估计器通常需要显式的尺度监督通过辅助传感器或受到尺度模糊的影响,这使其在下游应用中具有困难性。可能的尺度来源是场景中的物体大小,但是不准确的地址化使其困难于利用。在这篇论文中,我们介绍了一种新的尺度意识的眼镜几何估计方法 called StableCamH,不需要任何辅助传感器或监督。关键思想是利用场景中物体高度的先验知识,并将高度指示器聚合成所有帧在路由视频序列中的相同不变量,即摄像机高度。通过将眼镜几何估计转化为摄像机高度优化问题,我们实现了 Robust 和准确的无监督末端训练。为实现 StableCamH,我们设计了一种新的学习基于大小估计的大小先验,可以直接将车辆外观转化为其维度。广泛的实验表明 StableCamH 的有效性,与相关方法相比的状态态准确性,以及其普适性。训练框架可以用于任何眼镜几何估计方法,希望成为未来工作的基础组件。
Diffusion Reflectance Map: Single-Image Stochastic Inverse Rendering of Illumination and Reflectance
methods: 这 paper 使用了一种新的扩散模型,即Diffusion Reflectance Map Network (DRMNet),来解决这个盲目逆问题。
results: 这 paper 通过使用 DRMNet 获取了一个完整的反射率谱,并且能够高效地逆向推算出来的照明谱。Abstract
Reflectance bounds the frequency spectrum of illumination in the object appearance. In this paper, we introduce the first stochastic inverse rendering method, which recovers the full frequency spectrum of an illumination jointly with the object reflectance from a single image. Our key idea is to solve this blind inverse problem in the reflectance map, an appearance representation invariant to the underlying geometry, by learning to reverse the image formation with a novel diffusion model which we refer to as the Diffusion Reflectance Map Network (DRMNet). Given an observed reflectance map converted and completed from the single input image, DRMNet generates a reflectance map corresponding to a perfect mirror sphere while jointly estimating the reflectance. The forward process can be understood as gradually filtering a natural illumination with lower and lower frequency reflectance and additive Gaussian noise. DRMNet learns to invert this process with two subnetworks, IllNet and RefNet, which work in concert towards this joint estimation. The network is trained on an extensive synthetic dataset and is demonstrated to generalize to real images, showing state-of-the-art accuracy on established datasets.
摘要
illumination 绘制对象的外观呈现各个频谱的约束。在这篇论文中,我们提出了第一个随机 inverse rendering 方法,可以从单个图像中同时回归完整的频谱照明和物体反射率。我们的关键想法是通过学习反射映射的异常扩散模型,称为噪声映射网络(DRMNet),来解决这种盲目逆向问题。给定一个已经 converts 和完善的反射映射图,DRMNet 可以生成一个对应于完美镜球的反射映射,同时进行 JOINT 估计。前向过程可以理解为逐渐过滤自然照明的低频反射和加性 Gaussian 噪声,DRMNet 学习了这一过程的逆向。网络由两个子网络,IllNet 和 RefNet,共同努力实现这种 JOINT 估计。网络在大量的 sintetic 数据上训练,并在实际图像上显示了状态级别的准确性。
Correspondences of the Third Kind: Camera Pose Estimation from Object Reflection
results: 论文表明,通过使用反射匹配,可以解决基于背景的匹配假设的偏误,并且可以提高Camera pose 和物体形状估计的准确性。Abstract
Computer vision has long relied on two kinds of correspondences: pixel correspondences in images and 3D correspondences on object surfaces. Is there another kind, and if there is, what can they do for us? In this paper, we introduce correspondences of the third kind we call reflection correspondences and show that they can help estimate camera pose by just looking at objects without relying on the background. Reflection correspondences are point correspondences in the reflected world, i.e., the scene reflected by the object surface. The object geometry and reflectance alters the scene geometrically and radiometrically, respectively, causing incorrect pixel correspondences. Geometry recovered from each image is also hampered by distortions, namely generalized bas-relief ambiguity, leading to erroneous 3D correspondences. We show that reflection correspondences can resolve the ambiguities arising from these distortions. We introduce a neural correspondence estimator and a RANSAC algorithm that fully leverages all three kinds of correspondences for robust and accurate joint camera pose and object shape estimation just from the object appearance. The method expands the horizon of numerous downstream tasks, including camera pose estimation for appearance modeling (e.g., NeRF) and motion estimation of reflective objects (e.g., cars on the road), to name a few, as it relieves the requirement of overlapping background.
摘要
RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models
results: 论文通过对多种视频编辑场景进行质量和量化的实验,显示了与现有方法相比,RAVE 可以更快、更高质量地生成视频。它可以处理 longer videos 并且可以实现多种不同类型的编辑,从本地特征修改到形态修饰。Abstract
Recent advancements in diffusion-based models have demonstrated significant success in generating images from text. However, video editing models have not yet reached the same level of visual quality and user control. To address this, we introduce RAVE, a zero-shot video editing method that leverages pre-trained text-to-image diffusion models without additional training. RAVE takes an input video and a text prompt to produce high-quality videos while preserving the original motion and semantic structure. It employs a novel noise shuffling strategy, leveraging spatio-temporal interactions between frames, to produce temporally consistent videos faster than existing methods. It is also efficient in terms of memory requirements, allowing it to handle longer videos. RAVE is capable of a wide range of edits, from local attribute modifications to shape transformations. In order to demonstrate the versatility of RAVE, we create a comprehensive video evaluation dataset ranging from object-focused scenes to complex human activities like dancing and typing, and dynamic scenes featuring swimming fish and boats. Our qualitative and quantitative experiments highlight the effectiveness of RAVE in diverse video editing scenarios compared to existing methods. Our code, dataset and videos can be found in https://rave-video.github.io.
摘要
最近的扩散基本模型的进步已经在生成图像从文本中达到了显著的成功。然而,视频编辑模型还没有达到同样的视觉质量和用户控制水平。为解决这个问题,我们介绍了RAVE,一种零批量视频编辑方法,无需进行额外训练。RAVE使用输入视频和文本提示来生成高质量的视频,保留原始的运动和semantic结构。它采用了一种新的噪声混淆策略,利用干扰帧之间的空间时间交互,以更快速生成一致的视频。此外,RAVE还具有较低的内存需求,可以处理更长的视频。RAVE可以完成各种编辑任务,从本地特征修改到形态变换。为了展示RAVE的多样性,我们创建了一个包括物体关注场景、人类活动如舞蹈和打印等,以及动态场景如游泳鱼和船等的广泛的视频评估dataset。我们的质量和量测试表明,RAVE在多种视频编辑场景中比现有方法更有效。我们的代码、dataset和视频可以在https://rave-video.github.io找到。
Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping
results: 我们的方法在MVTec 3D-AD数据集上达到了标准和少量学习的检测和分 segmentation性能顶峰,并且在测试时间和内存占用方面比前一代多模态AD方法更快和更有效。此外,我们还提出了层剔除技术来提高内存和时间效率,但是这会导致一定的性能下降。Abstract
The paper explores the industrial multimodal Anomaly Detection (AD) task, which exploits point clouds and RGB images to localize anomalies. We introduce a novel light and fast framework that learns to map features from one modality to the other on nominal samples. At test time, anomalies are detected by pinpointing inconsistencies between observed and mapped features. Extensive experiments show that our approach achieves state-of-the-art detection and segmentation performance in both the standard and few-shot settings on the MVTec 3D-AD dataset while achieving faster inference and occupying less memory than previous multimodal AD methods. Moreover, we propose a layer-pruning technique to improve memory and time efficiency with a marginal sacrifice in performance.
摘要
文章研究了工业多模态异常检测(AD)任务,利用点云和RGB图像localize异常。我们提出了一种新的轻量级快速框架,可以在正常样本上学习从一个模式到另一个模式的特征映射。在测试时,异常都是通过比较观察到的特征与映射特征进行异常检测。实验显示,我们的方法在MVTec 3D-AD数据集的标准和少量设置中实现了state-of-the-art检测和分类性能,并且在实际应用中具有更快的推理速度和更少的内存占用量,与之前的多模态AD方法相比。此外,我们还提出了层束技术来提高内存和时间效率,但是这会导致一定的性能下降。
Bootstrapping Autonomous Radars with Self-Supervised Learning
results: 对于下游对象检测任务,该提议的自助学习框架可以提高现有超过90%的精度。Abstract
The perception of autonomous vehicles using radars has attracted increased research interest due its ability to operate in fog and bad weather. However, training radar models is hindered by the cost and difficulty of annotating large-scale radar data. To overcome this bottleneck, we propose a self-supervised learning framework to leverage the large amount of unlabeled radar data to pre-train radar-only embeddings for self-driving perception tasks. The proposed method combines radar-to-radar and radar-to-vision contrastive losses to learn a general representation from unlabeled radar heatmaps paired with their corresponding camera images. When used for downstream object detection, we demonstrate that the proposed self-supervision framework can improve the accuracy of state-of-the-art supervised baselines by 5.8% in mAP.
摘要
自适应汽车使用雷达的感知已经吸引了更多的研究兴趣,因为雷达可以在fog和坏天气下运行。然而,培育雷达模型受到了大量雷达数据的成本和困难的标注的限制。为了突破这个瓶颈,我们提议一种无监督学习框架,利用大量无标注雷达数据来预训练雷达专用的嵌入函数。我们的方法将雷达到雷达和雷达到视觉的对比损失相结合,以学习来自无标注雷达热图与其对应的摄像头图像的通用表示。在下游对象检测任务中使用我们的自监督学习框架,我们实验表明,可以提高现有最佳监督学习基线的准确率by 5.8% in mAP。
FRNet: Frustum-Range Networks for Scalable LiDAR Segmentation
results: 实验结果显示,FRNet可以实现高效率和高精度的LiDAR分类,并在四个流行的LiDAR分类 benchmark上表现出色。Abstract
LiDAR segmentation is crucial for autonomous driving systems. The recent range-view approaches are promising for real-time processing. However, they suffer inevitably from corrupted contextual information and rely heavily on post-processing techniques for prediction refinement. In this work, we propose a simple yet powerful FRNet that restores the contextual information of the range image pixels with corresponding frustum LiDAR points. Firstly, a frustum feature encoder module is used to extract per-point features within the frustum region, which preserves scene consistency and is crucial for point-level predictions. Next, a frustum-point fusion module is introduced to update per-point features hierarchically, which enables each point to extract more surrounding information via the frustum features. Finally, a head fusion module is used to fuse features at different levels for final semantic prediction. Extensive experiments on four popular LiDAR segmentation benchmarks under various task setups demonstrate our superiority. FRNet achieves competitive performance while maintaining high efficiency. The code is publicly available.
摘要
利用LiDAR分割是自动驾驶系统的关键技术。最近的距离视图方法显示了实时处理的承诺,但它们无法免受干扰信息的损害和依赖后处理技术来进行预测精度的改进。在这项工作中,我们提出了一种简单又强大的FRNet,它可以恢复距离图像像素中的Contextual信息和相应的镜头LiDAR点。首先,我们使用镜头特征编码模块提取距离区域内每个点的特征,这种方法保持了场景一致性,对点级预测具有重要的作用。然后,我们引入了镜头点融合模块,通过 hierarchical更新每个点的特征,让每个点可以通过镜头特征提取更多的周围信息。最后,我们使用头融合模块将不同层次的特征进行融合,以实现最终的semantic预测。我们在四个流行的LiDAR分割benchmark上进行了广泛的实验,结果表明FRNet可以与其他方法相比,具有竞争力的性能,同时保持高效。代码公开可用。
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
methods: 提出了一种基于扩散模型的方法,通过在两个角度(结构层和内容层)隔离空间和时间因素,提高T2V的性能。在结构层,通过将T2V任务分解成两个步骤,包括空间理解和时间理解,使用一个统一的杂净器来处理。在内容层,提取了输入视频内容中的两种细致的信息,用于表达运动和外观变化。这两种信息 THEN 用于引导模型的训练,以生成具有 semantics 精度和运动稳定性的视频。
results: 实验表明, HiGen 方法在比较现有的T2V方法时表现出了明显的优势,能够生成真实、多样化的视频。Abstract
Despite diffusion models having shown powerful abilities to generate photorealistic images, generating videos that are realistic and diverse still remains in its infancy. One of the key reasons is that current methods intertwine spatial content and temporal dynamics together, leading to a notably increased complexity of text-to-video generation (T2V). In this work, we propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives, i.e., structure level and content level. At the structure level, we decompose the T2V task into two steps, including spatial reasoning and temporal reasoning, using a unified denoiser. Specifically, we generate spatially coherent priors using text during spatial reasoning and then generate temporally coherent motions from these priors during temporal reasoning. At the content level, we extract two subtle cues from the content of the input video that can express motion and appearance changes, respectively. These two cues then guide the model's training for generating videos, enabling flexible content variations and enhancing temporal stability. Through the decoupled paradigm, HiGen can effectively reduce the complexity of this task and generate realistic videos with semantics accuracy and motion stability. Extensive experiments demonstrate the superior performance of HiGen over the state-of-the-art T2V methods.
摘要
尽管扩散模型已经显示了生成高品质图像的强大能力,但生成真实多样化的视频仍处于初生阶段。一个关键的原因是现有的方法将空间内容和时间动态相互杂糅,从而增加了文本到视频生成(T2V)的复杂性。在这种情况下,我们提出了HiGen方法,基于扩散模型,通过在两个角度(结构层和内容层)分解视频生成任务,提高性能。在结构层上,我们将T2V任务分解为两步,包括空间理解和时间理解,使用一个统一的减噪器。特别是,我们在空间理解阶段使用文本生成空间相对谱,然后在时间理解阶段使用这些谱生成时间相对的动作。在内容层上,我们提取了输入视频中的两种细微特征,可以表达动作和外观变化。这两种特征则导向模型在生成视频时的训练,使得模型能够生成多样化的内容和稳定的动作。通过分解模型,HiGen可以有效减少T2V任务的复杂性,并生成具有semantics精度和动作稳定的真实视频。我们的广泛实验表明,HiGen在现状顶峰方法的T2V任务上表现出色。
Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion
results: 这 paper 的实验和评估表明,AMUSE 可以生成高质量的人体姿势动画,同时控制表达的情感和个人风格。与现有技术相比,AMUSE 的生成的姿势动画更好地同步与语音内容,同时更好地表达输入语音中的情感。Abstract
Existing methods for synthesizing 3D human gestures from speech have shown promising results, but they do not explicitly model the impact of emotions on the generated gestures. Instead, these methods directly output animations from speech without control over the expressed emotion. To address this limitation, we present AMUSE, an emotional speech-driven body animation model based on latent diffusion. Our observation is that content (i.e., gestures related to speech rhythm and word utterances), emotion, and personal style are separable. To account for this, AMUSE maps the driving audio to three disentangled latent vectors: one for content, one for emotion, and one for personal style. A latent diffusion model, trained to generate gesture motion sequences, is then conditioned on these latent vectors. Once trained, AMUSE synthesizes 3D human gestures directly from speech with control over the expressed emotions and style by combining the content from the driving speech with the emotion and style of another speech sequence. Randomly sampling the noise of the diffusion model further generates variations of the gesture with the same emotional expressivity. Qualitative, quantitative, and perceptual evaluations demonstrate that AMUSE outputs realistic gesture sequences. Compared to the state of the art, the generated gestures are better synchronized with the speech content and better represent the emotion expressed by the input speech. Our project website is amuse.is.tue.mpg.de.
摘要
现有的方法可以从语音中生成3D人姿动作,但这些方法没有直接模型表达的情感的影响。相反,这些方法直接从语音输出动画而不具有表达情感的控制。为了解决这一限制,我们提出了AMUSE,一种基于积分扩散的情感语音驱动人姿动画模型。我们观察到了内容(即语音节奏和单词词汇)、情感和个人风格是可分离的。为了考虑这一点,AMUSE将驱动音频映射到三个分离的幽默量Vector:一个为内容,一个为情感,一个为个人风格。然后,通过将这些幽默量Vector作为条件,使用扩散模型生成人姿动作序列。一旦训练完成,AMUSE可以直接从语音中生成3D人姿动作,并控制表达的情感和风格。随机采样扩散模型的噪声可以生成同情感表达的多种姿动作。我们的项目网站是amuse.is.tue.mpg.de。 Qualitative, quantitative, and perceptual evaluations demonstrate that AMUSE outputs realistic gesture sequences. Compared to the state of the art, the generated gestures are better synchronized with the speech content and better represent the emotion expressed by the input speech.
FitDiff: Robust monocular 3D facial shape and reflectance estimation using Diffusion Models
results: 实现了高性能的3D人脸模型,可以 direct 使用于常见渲染引擎,并且可以在不约束的情况下生成高质量的人脸模型。Abstract
The remarkable progress in 3D face reconstruction has resulted in high-detail and photorealistic facial representations. Recently, Diffusion Models have revolutionized the capabilities of generative methods by achieving far better performance than GANs. In this work, we present FitDiff, a diffusion-based 3D facial avatar generative model. This model accurately generates relightable facial avatars, utilizing an identity embedding extracted from an "in-the-wild" 2D facial image. Our multi-modal diffusion model concurrently outputs facial reflectance maps (diffuse and specular albedo and normals) and shapes, showcasing great generalization capabilities. It is solely trained on an annotated subset of a public facial dataset, paired with 3D reconstructions. We revisit the typical 3D facial fitting approach by guiding a reverse diffusion process using perceptual and face recognition losses. Being the first LDM conditioned on face recognition embeddings, FitDiff reconstructs relightable human avatars, that can be used as-is in common rendering engines, starting only from an unconstrained facial image, and achieving state-of-the-art performance.
摘要
“三维面部重建技术的发展有所进步,实现高细节和实际的面部表现。最近,扩散模型对生成方法带来了革命性的改善,超过了GANs的性能。在这个工作中,我们介绍了FitDiff,一个基于扩散的三维人脸假象生成模型。这个模型可以实现对照点推射的人脸假象,使用从“在野”的二维面部图像中提取的身份嵌入。我们的多modal扩散模型同时输出面部反射maps(湿润和视亮反射率)和形状,展示了很好的通用能力。它仅从公共的面部数据集上得到了标注的一部分,并与3D重建相结合。我们重新评估了传统的三维面部适摄approach,通过导引反扩散过程,使用感知和识别损失来引导。作为首个基于识别嵌入的LDM,FitDiff可以从无条件的面部图像中重建可适摄的人脸假象,并实现现场最佳性能。”
DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
results: 比对状态方法有较高表现,可自由定制任何主题和动作Abstract
Customized generation using diffusion models has made impressive progress in image generation, but remains unsatisfactory in the challenging video generation task, as it requires the controllability of both subjects and motions. To that end, we present DreamVideo, a novel approach to generating personalized videos from a few static images of the desired subject and a few videos of target motion. DreamVideo decouples this task into two stages, subject learning and motion learning, by leveraging a pre-trained video diffusion model. The subject learning aims to accurately capture the fine appearance of the subject from provided images, which is achieved by combining textual inversion and fine-tuning of our carefully designed identity adapter. In motion learning, we architect a motion adapter and fine-tune it on the given videos to effectively model the target motion pattern. Combining these two lightweight and efficient adapters allows for flexible customization of any subject with any motion. Extensive experimental results demonstrate the superior performance of our DreamVideo over the state-of-the-art methods for customized video generation. Our project page is at https://dreamvideo-t2v.github.io.
摘要
<>将文本翻译成简化中文。<>使用扩散模型进行个性化生成已经在图像生成中做出了很大进步,但在视频生成任务中仍然不够满意,因为它需要主题和运动的控制。为此,我们提出了 DreamVideo,一种生成个性化视频的新方法,只需提供一些Static图像和target运动的视频。DreamVideo将这个任务分解为两个阶段:主题学习和运动学习,通过利用预训练的视频扩散模型。主题学习的目标是准确捕捉提供的主题细节,我们通过文本反转和我们特制的identidad适应器进行精细调整。在运动学习阶段,我们设计了运动适应器,并在给定的视频上进行了微调。将这两个轻量级和高效的适应器结合使得任何主题可以与任何运动自由定制。我们的项目页面是https://dreamvideo-t2v.github.io。Extensive experimental results demonstrate the superior performance of our DreamVideo over the state-of-the-art methods for customized video generation.
Approximate Caching for Efficiently Serving Diffusion Models
results: 实现 % GPU 计算减少,19.8% 综合延迟减少,19% 成本减少,在两个实际生产工作上Here’s a breakdown of each point:
for: The paper is aimed at improving the efficiency and scalability of text-to-image generation systems using diffusion models.
methods: The authors propose a technique called approximate-caching, which reuses intermediate noise states created during prior image generations for similar prompts to reduce the number of iterative denoising steps.
results: The proposed approach achieves an average of % GPU compute savings, 19.8% end-to-end latency reduction, and 19% dollar savings on two real production workloads, compared to a baseline system.Abstract
Text-to-image generation using diffusion models has seen explosive popularity owing to their ability in producing high quality images adhering to text prompts. However, production-grade diffusion model serving is a resource intensive task that not only require high-end GPUs which are expensive but also incurs considerable latency. In this paper, we introduce a technique called approximate-caching that can reduce such iterative denoising steps for an image generation based on a prompt by reusing intermediate noise states created during a prior image generation for similar prompts. Based on this idea, we present an end to end text-to-image system, Nirvana, that uses the approximate-caching with a novel cache management-policy Least Computationally Beneficial and Frequently Used (LCBFU) to provide % GPU compute savings, 19.8% end-to-end latency reduction and 19% dollar savings, on average, on two real production workloads. We further present an extensive characterization of real production text-to-image prompts from the perspective of caching, popularity and reuse of intermediate states in a large production environment.
摘要
We have developed an end-to-end text-to-image system, Nirvana, that incorporates approximate-caching with a novel cache management policy called Least Computationally Beneficial and Frequently Used (LCBFU). This approach achieves an average of 30% GPU compute savings, 19.8% end-to-end latency reduction, and 19% dollar savings on two real production workloads.In addition, we have conducted an extensive characterization of real production text-to-image prompts from the perspective of caching, popularity, and reuse of intermediate states in a large production environment. Our results show that the proposed technique can significantly reduce the computational resources and latency required for text-to-image generation, making it more practical for real-world applications.
Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views
results: 提出了一种名为 Cascade-Zero123 的树建 Framework,可以在不同视图之间建立更高度的视觉一致性,并且在许多复杂和挑战的场景下显示出优于 Zero-1-to-3 的表现。Abstract
Synthesizing multi-view 3D from one single image is a significant and challenging task. For this goal, Zero-1-to-3 methods aim to extend a 2D latent diffusion model to the 3D scope. These approaches generate the target-view image with a single-view source image and the camera pose as condition information. However, the one-to-one manner adopted in Zero-1-to-3 incurs challenges for building geometric and visual consistency across views, especially for complex objects. We propose a cascade generation framework constructed with two Zero-1-to-3 models, named Cascade-Zero123, to tackle this issue, which progressively extracts 3D information from the source image. Specifically, a self-prompting mechanism is designed to generate several nearby views at first. These views are then fed into the second-stage model along with the source image as generation conditions. With self-prompted multiple views as the supplementary information, our Cascade-Zero123 generates more highly consistent novel-view images than Zero-1-to-3. The promotion is significant for various complex and challenging scenes, involving insects, humans, transparent objects, and stacked multiple objects etc. The project page is at https://cascadezero123.github.io/.
摘要
Synthesizing multi-view 3D from one single image is a significant and challenging task. For this goal, Zero-1-to-3 methods aim to extend a 2D latent diffusion model to the 3D scope. These approaches generate the target-view image with a single-view source image and the camera pose as condition information. However, the one-to-one manner adopted in Zero-1-to-3 incurs challenges for building geometric and visual consistency across views, especially for complex objects. We propose a cascade generation framework constructed with two Zero-1-to-3 models, named Cascade-Zero123, to tackle this issue, which progressively extracts 3D information from the source image. Specifically, a self-prompting mechanism is designed to generate several nearby views at first. These views are then fed into the second-stage model along with the source image as generation conditions. With self-prompted multiple views as the supplementary information, our Cascade-Zero123 generates more highly consistent novel-view images than Zero-1-to-3. The promotion is significant for various complex and challenging scenes, involving insects, humans, transparent objects, and stacked multiple objects etc. The project page is at .
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models
for: 这 paper 的目的是提出一种新的类型的扩散模型,即 Smooth Diffusion,可以同时具有高性能和平滑的特性。
methods: 该 paper 使用 Step-wise Variation Regularization 和 interpolation standard deviation (ISTD) metric 来保证扩散模型的特征空间平滑。
results: 对比其他扩散模型,Smooth Diffusion 在文本到图像(T2I)生成、图像 interpolating、图像反转和图像编辑等下游任务中表现出色,并且可以减少 visual fluctuations。Abstract
Recently, diffusion models have made remarkable progress in text-to-image (T2I) generation, synthesizing images with high fidelity and diverse contents. Despite this advancement, latent space smoothness within diffusion models remains largely unexplored. Smooth latent spaces ensure that a perturbation on an input latent corresponds to a steady change in the output image. This property proves beneficial in downstream tasks, including image interpolation, inversion, and editing. In this work, we expose the non-smoothness of diffusion latent spaces by observing noticeable visual fluctuations resulting from minor latent variations. To tackle this issue, we propose Smooth Diffusion, a new category of diffusion models that can be simultaneously high-performing and smooth. Specifically, we introduce Step-wise Variation Regularization to enforce the proportion between the variations of an arbitrary input latent and that of the output image is a constant at any diffusion training step. In addition, we devise an interpolation standard deviation (ISTD) metric to effectively assess the latent space smoothness of a diffusion model. Extensive quantitative and qualitative experiments demonstrate that Smooth Diffusion stands out as a more desirable solution not only in T2I generation but also across various downstream tasks. Smooth Diffusion is implemented as a plug-and-play Smooth-LoRA to work with various community models. Code is available at https://github.com/SHI-Labs/Smooth-Diffusion.
摘要
最近,扩散模型在文本到图像(T2I)生成中已经做出了显著进步,生成图像的准确性和多样性都有所提高。然而,扩散模型中的积分空间内的平滑性仍未得到了足够的探究。平滑的积分空间确保输入积分变动对输出图像的变化是稳定的,这种属性对下游任务,如图像 interpolate、反转和编辑等具有优点。在这种工作中,我们发现了扩散积分空间的非平滑性,通过观察输入积分变动对输出图像的视觉变化,我们发现这种变化很不稳定。为了解决这个问题,我们提出了新的扩散模型,即平滑扩散(Smooth Diffusion)。我们在这个模型中引入步骤变量规则,使得输入积分变动和输出图像变动之间的比例保持不变。此外,我们还设计了一个 interpolate standard deviation(ISTD)度量,用于评估扩散模型的积分空间平滑性。我们的实验表明,平滑扩散模型在T2I生成和多种下游任务中表现出色,并且可以与社区模型结合使用。代码可以在 GitHub 上找到:https://github.com/SHI-Labs/Smooth-Diffusion。
OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization
results: 我们在不同的网络架构和数据集上进行了广泛的实验,发现 OT-Attack 可以在图像-文本匹配任务中提高 adversarial transferability,并且比现有的状态对方法更高效。Abstract
Vision-language pre-training (VLP) models demonstrate impressive abilities in processing both images and text. However, they are vulnerable to multi-modal adversarial examples (AEs). Investigating the generation of high-transferability adversarial examples is crucial for uncovering VLP models' vulnerabilities in practical scenarios. Recent works have indicated that leveraging data augmentation and image-text modal interactions can enhance the transferability of adversarial examples for VLP models significantly. However, they do not consider the optimal alignment problem between dataaugmented image-text pairs. This oversight leads to adversarial examples that are overly tailored to the source model, thus limiting improvements in transferability. In our research, we first explore the interplay between image sets produced through data augmentation and their corresponding text sets. We find that augmented image samples can align optimally with certain texts while exhibiting less relevance to others. Motivated by this, we propose an Optimal Transport-based Adversarial Attack, dubbed OT-Attack. The proposed method formulates the features of image and text sets as two distinct distributions and employs optimal transport theory to determine the most efficient mapping between them. This optimal mapping informs our generation of adversarial examples to effectively counteract the overfitting issues. Extensive experiments across various network architectures and datasets in image-text matching tasks reveal that our OT-Attack outperforms existing state-of-the-art methods in terms of adversarial transferability.
摘要
vision-language pre-training(VLP)模型显示出处理图像和文本的印象性能力。然而,它们受到多 modal adversarial examples(AE)的攻击。研究生成高传播性 adversarial examples的生成是关键的,以探索 VLP 模型在实际场景中的漏洞。current works 表明,通过数据增强和图像文本模式交互,可以提高 VLP 模型中 adversarial examples的传播性。然而,它们没有考虑数据增强图像文本对的最佳对齐问题。这个缺点导致 adversarial examples 过于适应源模型,因此限制了改进的传播性。在我们的研究中,我们首先探索了通过数据增强生成的图像集和其对应的文本集之间的交互。我们发现,通过数据增强生成的图像样本可以与某些文本集进行最佳对齐,而与其他文本集之间的对齐更为低。这一发现使我们提出了一种基于 Optimal Transport 的 adversarial attack,称为 OT-Attack。我们的方法将图像和文本集的特征视为两个不同的分布,并使用 Optimal Transport 理论来确定这两个分布之间的最有效的映射。这种最有效的映射为我们生成 adversarial examples,以有效地对抗过拟合问题。我们在不同的网络架构和数据集上进行了广泛的实验,发现 OT-Attack 在图像文本匹配任务中超过了现有状态的方法,在 adversarial transferability 方面表现出色。
PhysHOI: Physics-Based Imitation of Dynamic Human-Object Interaction
results: 研究者通过使用PhysHOI模型,成功地模拟了多种HOI任务,包括全身抓取和篮球技巧。此外,研究者还介绍了 BallPlay 数据集,该数据集包含了八种全身篮球技巧。Abstract
Humans interact with objects all the time. Enabling a humanoid to learn human-object interaction (HOI) is a key step for future smart animation and intelligent robotics systems. However, recent progress in physics-based HOI requires carefully designed task-specific rewards, making the system unscalable and labor-intensive. This work focuses on dynamic HOI imitation: teaching humanoid dynamic interaction skills through imitating kinematic HOI demonstrations. It is quite challenging because of the complexity of the interaction between body parts and objects and the lack of dynamic HOI data. To handle the above issues, we present PhysHOI, the first physics-based whole-body HOI imitation approach without task-specific reward designs. Except for the kinematic HOI representations of humans and objects, we introduce the contact graph to model the contact relations between body parts and objects explicitly. A contact graph reward is also designed, which proved to be critical for precise HOI imitation. Based on the key designs, PhysHOI can imitate diverse HOI tasks simply yet effectively without prior knowledge. To make up for the lack of dynamic HOI scenarios in this area, we introduce the BallPlay dataset that contains eight whole-body basketball skills. We validate PhysHOI on diverse HOI tasks, including whole-body grasping and basketball skills.
摘要
人类与物体之间的互动是一天的常见情况。允许人类型机器人学习人机互动(HOI)是未来智能动画和智能机器人系统的关键一步。然而,最近的物理基于HOI进步需要仔细设计任务特定的奖励,这使得系统不可扩展和劳动密集。本工作关注于动态HOI伪装:通过imitating动机HOI示例教育人类型机器人动态交互技能。这是因为交互体系和物体之间的互动复杂性和lack of dynamic HOI数据而具有挑战性。为解决以上问题,我们提出PhysHOI,首个基于物理的整体HOI伪装方法,不需要任务特定奖励设计。 except for the kinematic HOI representations of humans and objects, we introduce the contact graph to model the contact relations between body parts and objects explicitly. A contact graph reward is also designed, which proved to be critical for precise HOI imitation. Based on the key designs, PhysHOI can imitate diverse HOI tasks simply yet effectively without prior knowledge. To make up for the lack of dynamic HOI scenarios in this area, we introduce the BallPlay dataset that contains eight whole-body basketball skills. We validate PhysHOI on diverse HOI tasks, including whole-body grasping and basketball skills.
AniRes2D: Anisotropic Residual-enhanced Diffusion for 2D MR Super-Resolution
paper_authors: Zejun Wu, Samuel W. Remedios, Blake E. Dewey, Aaron Carass, Jerry L. Prince for: This paper aims to super-resolve anisotropic low-resolution (LR) magnetic resonance (MR) images using denoising diffusion probabilistic models (DDPMs).methods: The proposed approach, called AniRes2D, combines DDPM with a residual prediction for 2D super-resolution (SR).results: AniRes2D outperforms several other DDPM-based models in quantitative metrics, visual quality, and out-of-domain evaluation. Additionally, the use of noise conditioning augmentation (NCA) as an alternative augmentation technique for DDPM-based SR models was found to reduce performance.Abstract
Anisotropic low-resolution (LR) magnetic resonance (MR) images are fast to obtain but hinder automated processing. We propose to use denoising diffusion probabilistic models (DDPMs) to super-resolve these 2D-acquired LR MR slices. This paper introduces AniRes2D, a novel approach combining DDPM with a residual prediction for 2D super-resolution (SR). Results demonstrate that AniRes2D outperforms several other DDPM-based models in quantitative metrics, visual quality, and out-of-domain evaluation. We use a trained AniRes2D to super-resolve 3D volumes slice by slice, where comparative quantitative results and reduced skull aliasing are achieved compared to a recent state-of-the-art self-supervised 3D super-resolution method. Furthermore, we explored the use of noise conditioning augmentation (NCA) as an alternative augmentation technique for DDPM-based SR models, but it was found to reduce performance. Our findings contribute valuable insights to the application of DDPMs for SR of anisotropic MR images.
摘要
低分辨率磁共振成像(MR)图像具有快速获取的优点,但是它们降低了自动处理的可能性。我们提议使用杂相噪声泛化模型(DDPM)来超解这些2D获取的低分辨率MR剖面。本文介绍了AniRes2D,一种新的方法,它结合了DDPM和差异预测来实现2D超解像(SR)。结果表明,AniRes2D在量化指标、视觉质量和 OUT-OF-DOMAIN 评价中都超过了其他多个DDPM基于模型。此外,我们使用训练过的AniRes2D来超解3D卷积slice by slice,并实现了相对于最新的自动学习3D超解像方法的比较量化结果和减少骨架噪声。此外,我们还探讨了噪声条件增强(NCA)作为DDPM基于SR模型的增强技术,但发现它会降低性能。我们的发现对DDPM在SR方面的应用提供了有价值的信息。
SingingHead: A Large-scale 4D Dataset for Singing Head Animation
results: 对比与当前最佳3D面部动画和2D肖像视频合成方法,实验结果表明,歌唱音频驱动的3D头部动画和2D肖像视频合成问题可以通过使用高质量的歌唱头数据集和统一的面部动画框架得到有效解决。Abstract
Singing, as a common facial movement second only to talking, can be regarded as a universal language across ethnicities and cultures, plays an important role in emotional communication, art, and entertainment. However, it is often overlooked in the field of audio-driven facial animation due to the lack of singing head datasets and the domain gap between singing and talking in rhythm and amplitude. To this end, we collect a high-quality large-scale singing head dataset, SingingHead, which consists of more than 27 hours of synchronized singing video, 3D facial motion, singing audio, and background music from 76 individuals and 8 types of music. Along with the SingingHead dataset, we argue that 3D and 2D facial animation tasks can be solved together, and propose a unified singing facial animation framework named UniSinger to achieve both singing audio-driven 3D singing head animation and 2D singing portrait video synthesis. Extensive comparative experiments with both SOTA 3D facial animation and 2D portrait animation methods demonstrate the necessity of singing-specific datasets in singing head animation tasks and the promising performance of our unified facial animation framework.
摘要
唱歌,作为人类面部运动的常见第二种(只次于说话),在不同的民族和文化中具有通用语言的作用,对情感通信、艺术和娱乐具有重要作用。然而,在受音频驱动的面部动画领域中,唱歌却经常被忽略,这主要归结于唱歌数据集的缺乏和说话和唱歌的频谱和幅度域的域外差。为此,我们收集了高质量大规模的唱歌头部数据集,即SingingHead,该数据集包括76名个体和8种音乐的27小时以上同步唱歌视频、3D面部动作、唱歌音频和背景音乐。此外,我们认为3D和2D面部动画任务可以并行解决,并提出了一个统一的唱歌面部动画框架名为UniSinger,以实现受音频驱动的3D唱歌头部动画和2D唱歌肖像视频生成。我们对两者进行了广泛的比较实验,结果表明唱歌特定数据集的重要性和我们的统一动画框架的承诺性。
DemoCaricature: Democratising Caricature Generation with a Rough Sketch
methods: 该论文提出了Explicit Rank-1 Model Editing和单图个性化,通过 selectively 应用细微修改来融合个体和风格的特征。
results: 该论文的目标不是取代艺术家,而是减少访问难度,让爱好者参与到艺术创作中来。Abstract
In this paper, we democratise caricature generation, empowering individuals to effortlessly craft personalised caricatures with just a photo and a conceptual sketch. Our objective is to strike a delicate balance between abstraction and identity, while preserving the creativity and subjectivity inherent in a sketch. To achieve this, we present Explicit Rank-1 Model Editing alongside single-image personalisation, selectively applying nuanced edits to cross-attention layers for a seamless merge of identity and style. Additionally, we propose Random Mask Reconstruction to enhance robustness, directing the model to focus on distinctive identity and style features. Crucially, our aim is not to replace artists but to eliminate accessibility barriers, allowing enthusiasts to engage in the artistry.
摘要
在这篇论文中,我们减少了卡通生成的门槛,让个人轻松地创建个性化卡通,只需一张照片和一个概念图片。我们的目标是在把抽象和个体之间做出 equilibrio,保留素描和创作的主观性。为了实现这一点,我们提出了显式 Rank-1 模型编辑,并同时实现单图个性化, selectively 应用细腻的修改来融合个体和风格。此外,我们还提出了随机面罩重建,以增强 robustness,使模型更加注重特征和风格。我们的目标不是取代艺术家,而是消除访问障碍,让爱好者参与到艺术创作中。
Multi-View Unsupervised Image Generation with Cross Attention Guidance
results: 比较于先前的方法,MIRAGE模型在真实图像上的novel view synthesis task中表现出色,并且具有较高的鲁棒性和可扩展性,可以在多种Texture和几何形态下进行 Synthetic image生成。Abstract
The growing interest in novel view synthesis, driven by Neural Radiance Field (NeRF) models, is hindered by scalability issues due to their reliance on precisely annotated multi-view images. Recent models address this by fine-tuning large text2image diffusion models on synthetic multi-view data. Despite robust zero-shot generalization, they may need post-processing and can face quality issues due to the synthetic-real domain gap. This paper introduces a novel pipeline for unsupervised training of a pose-conditioned diffusion model on single-category datasets. With the help of pretrained self-supervised Vision Transformers (DINOv2), we identify object poses by clustering the dataset through comparing visibility and locations of specific object parts. The pose-conditioned diffusion model, trained on pose labels, and equipped with cross-frame attention at inference time ensures cross-view consistency, that is further aided by our novel hard-attention guidance. Our model, MIRAGE, surpasses prior work in novel view synthesis on real images. Furthermore, MIRAGE is robust to diverse textures and geometries, as demonstrated with our experiments on synthetic images generated with pretrained Stable Diffusion.
摘要
“ novel view 合成的兴趣增长,受到神经辐射场(NeRF)模型的驱动,受到多视图图像的精度标注的限制。 recent models 通过 fine-tuning large text2image 扩散模型在 sintetic multi-view 数据上进行了调整。 despite robust zero-shot 通用化,它们可能需要后处理和面临quality issues due to the synthetic-real domain gap。 this paper introduces a novel pipeline for unsupervised training of a pose-conditioned diffusion model on single-category datasets. with the help of pretrained self-supervised Vision Transformers (DINOv2), we identify object poses by clustering the dataset through comparing visibility and locations of specific object parts. the pose-conditioned diffusion model, trained on pose labels, and equipped with cross-frame attention at inference time ensures cross-view consistency, that is further aided by our novel hard-attention guidance. our model, MIRAGE, surpasses prior work in novel view synthesis on real images. Furthermore, MIRAGE is robust to diverse textures and geometries, as demonstrated with our experiments on synthetic images generated with pretrained Stable Diffusion.”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.
Towards a Perceptual Evaluation Framework for Lighting Estimation
results: 研究发现,现有的IQA指标从 literature中的大多数都不能正确地表达人类的偏好,但是通过学习现有的IQA指标的组合,可以更加准确地表达人类的偏好。这提供了一个新的感知框架,可以帮助评估未来的光照估算算法。Abstract
Progress in lighting estimation is tracked by computing existing image quality assessment (IQA) metrics on images from standard datasets. While this may appear to be a reasonable approach, we demonstrate that doing so does not correlate to human preference when the estimated lighting is used to relight a virtual scene into a real photograph. To study this, we design a controlled psychophysical experiment where human observers must choose their preference amongst rendered scenes lit using a set of lighting estimation algorithms selected from the recent literature, and use it to analyse how these algorithms perform according to human perception. Then, we demonstrate that none of the most popular IQA metrics from the literature, taken individually, correctly represent human perception. Finally, we show that by learning a combination of existing IQA metrics, we can more accurately represent human preference. This provides a new perceptual framework to help evaluate future lighting estimation algorithms.
摘要
进步在照明估算方面是通过计算现有的图像质量评价(IQA)指标来跟踪的。这可能看起来很合理,但我们示示了在使用估算出来的照明来重新照明虚拟场景时,这种方法并不与人类的偏好相匹配。为了研究这一点,我们设计了一个控制的心理physical实验,让人类观察者选择他们的偏好 amongst 由文献中选择的照明估算算法rendered scene,并用其来分析这些算法如何根据人类感知表现。然后,我们表明了现有IQA指标中的任何一个都不能准确地表达人类感知。最后,我们示示了通过学习现有IQA指标的组合,可以更准确地表达人类偏好。这提供了一个新的感知框架,可以帮助评估未来的照明估算算法。
A Multi-scale Information Integration Framework for Infrared and Visible Image Fusion
results: 对于两个 dataset 的融合结果,研究发现我们的方法可以保持红外射频信息和可见信息的同时保持高质量,并与其他当前领域的状态对照方法相当。我们还进行了精度和量化的实验,以及缺省实验,以证明我们的信息整合建筑和损失函数中的优先级权重适应性测量的效果。Abstract
Infrared and visible image fusion aims at generating a fused image containing the intensity and detail information of source images, and the key issue is effectively measuring and integrating the complementary information of multi-modality images from the same scene. Existing methods mostly adopt a simple weight in the loss function to decide the information retention of each modality rather than adaptively measuring complementary information for different image pairs. In this study, we propose a multi-scale dual attention (MDA) framework for infrared and visible image fusion, which is designed to measure and integrate complementary information in both structure and loss function at the image and patch level. In our method, the residual downsample block decomposes source images into three scales first. Then, dual attention fusion block integrates complementary information and generates a spatial and channel attention map at each scale for feature fusion. Finally, the output image is reconstructed by the residual reconstruction block. Loss function consists of image-level, feature-level and patch-level three parts, of which the calculation of the image-level and patch-level two parts are based on the weights generated by the complementary information measurement. Indeed, to constrain the pixel intensity distribution between the output and infrared image, a style loss is added. Our fusion results perform robust and informative across different scenarios. Qualitative and quantitative results on two datasets illustrate that our method is able to preserve both thermal radiation and detailed information from two modalities and achieve comparable results compared with the other state-of-the-art methods. Ablation experiments show the effectiveness of our information integration architecture and adaptively measure complementary information retention in the loss function.
摘要
infrared和可见图像融合的目标是生成包含源图像的Intensity和细节信息的融合图像,关键问题是有效地度量和集成多模态图像从同一场景中的补偿信息。现有方法大多采用简单的权重在损失函数中决定每个模态的信息保留,而不是动态度量不同图像对的补偿信息。在本研究中,我们提出了多尺度双注意(MDA)框架 для infrared和可见图像融合,该框架是用来度量和集成多模态图像的补偿信息的。我们的方法首先使用residual下采样块将源图像 decomposes into three scales。然后,双注意 fusions 块将补偿信息集成并生成每层的空间和通道注意力图,用于特征融合。最后,输出图像由 residual reconstruction block重建。损失函数包括图像水平、特征水平和 patch 水平三部分,其中图像水平和 patch 水平两部分的计算基于生成的补偿信息量。此外,为了限制输出图像和infrared图像的像素 INTENSITY 分布,我们添加了一个风格损失。我们的融合结果在不同场景中具有良好的Robustness和信息量。Qualitative和量化结果表明我们的方法能够保留两种模态的热辐射和细节信息,并与其他状态对比的方法相当。ablation эксперименты表明我们的信息集成体系和 adaptively 度量补偿信息保留在损失函数中的效果。
iDesigner: A High-Resolution and Complex-Prompt Following Text-to-Image Diffusion Model for Interior Design
results: 实验结果表明,该方法可以准确地生成高质量图像,并超越强基eline。Abstract
With the open-sourcing of text-to-image models (T2I) such as stable diffusion (SD) and stable diffusion XL (SD-XL), there is an influx of models fine-tuned in specific domains based on the open-source SD model, such as in anime, character portraits, etc. However, there are few specialized models in certain domains, such as interior design, which is attributed to the complex textual descriptions and detailed visual elements inherent in design, alongside the necessity for adaptable resolution. Therefore, text-to-image models for interior design are required to have outstanding prompt-following capabilities, as well as iterative collaboration with design professionals to achieve the desired outcome. In this paper, we collect and optimize text-image data in the design field and continue training in both English and Chinese on the basis of the open-source CLIP model. We also proposed a fine-tuning strategy with curriculum learning and reinforcement learning from CLIP feedback to enhance the prompt-following capabilities of our approach so as to improve the quality of image generation. The experimental results on the collected dataset demonstrate the effectiveness of the proposed approach, which achieves impressive results and outperforms strong baselines.
摘要
开源文本到图像模型(T2I)如稳定扩散(SD)和稳定扩散XL(SD-XL)的开源,导致各个领域的模型进行精细定制,如动漫、人物肖像等。然而,有些领域,如内部设计,尚未有专门的模型,这可能与文本描述的复杂性和设计中的细节视觉元素有关,以及需要可变分辨率。因此,内部设计中的文本到图像模型需要出色的提示遵循能力,以及与设计专业人员之间的迭代协作,以实现愿景。在这篇论文中,我们收集和优化设计领域的文本图像数据,并在英文和中文基础上继续训练开源CLIP模型。我们还提出了一种 fine-tuning 策略,利用课程学习和反馈学习从CLIP反馈来提高我们方法的提示遵循能力,以提高图像生成质量。实验结果表明,我们提posed的方法在收集的数据集上具有惊人的效果,并超越了强基线。
GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives
results: GPT4SGG可以显著提高基于图像caption数据的SGG模型的性能。Abstract
Learning scene graphs from natural language descriptions has proven to be a cheap and promising scheme for Scene Graph Generation (SGG). However, such unstructured caption data and its processing are troubling the learning an acurrate and complete scene graph. This dilema can be summarized as three points. First, traditional language parsers often fail to extract meaningful relationship triplets from caption data. Second, grounding unlocalized objects in parsed triplets will meet ambiguity in visual-language alignment. Last, caption data typically are sparse and exhibit bias to partial observations of image content. These three issues make it hard for the model to generate comprehensive and accurate scene graphs. To fill this gap, we propose a simple yet effective framework, GPT4SGG, to synthesize scene graphs from holistic and region-specific narratives. The framework discards traditional language parser, and localize objects before obtaining relationship triplets. To obtain relationship triplets, holistic and dense region-specific narratives are generated from the image. With such textual representation of image data and a task-specific prompt, an LLM, particularly GPT-4, directly synthesizes a scene graph as "pseudo labels". Experimental results showcase GPT4SGG significantly improves the performance of SGG models trained on image-caption data. We believe this pioneering work can motivate further research into mining the visual reasoning capabilities of LLMs.
摘要
Traditional language parsers often fail to extract meaningful relationship triplets from caption data.2. Grounding unlocalized objects in parsed triplets can lead to ambiguity in visual-language alignment.3. Caption data is typically sparse and biased towards partial observations of image content.To address these issues, we propose a simple yet effective framework, GPT4SGG, to synthesize scene graphs from holistic and region-specific narratives. Our framework discards traditional language parsers and localizes objects before obtaining relationship triplets. We generate holistic and dense region-specific narratives from the image to obtain relationship triplets, and use a task-specific prompt and an LLM, particularly GPT-4, to directly synthesize a scene graph as “pseudo labels”.Experimental results show that GPT4SGG significantly improves the performance of SGG models trained on image-caption data. We believe this pioneering work can motivate further research into mining the visual reasoning capabilities of LLMs.
Cross-codex Learning for Reliable Scribe Identification in Medieval Manuscripts
paper_authors: Julius Weißmann, Markus Seidl, Anya Dietrich, Martin Haltrich
for: 这 paper 的目的是提出一种基于 CNN 的文本独立scribe identification方法,以解决不同编码的问题。
methods: 该方法使用了预处理、不同的神经网络架构和拒绝选项来提高分类结果的准确率和稳定性。
results: 研究发现,使用遮盖灰度图像进行预处理可以明显提高分类结果的 F1 分数,而不同的神经网络架构在不同的时间和精度上有所不同,AlexNet 是最佳的权衡点。此外,通过实施拒绝选项,可以进一步提高 CNN 输出的稳定性。研究使用了大规模的开源数据集 —— Codex Claustroneoburgensis 数据库(CCl-DB)——并在不同的编码中进行了详细的分析和比较。Abstract
Historic scribe identification is a substantial task for obtaining information about the past. Uniform script styles, such as the Carolingian minuscule, make it a difficult task for classification to focus on meaningful features. Therefore, we demonstrate in this paper the importance of cross-codex training data for CNN based text-independent off-line scribe identification, to overcome codex dependent overfitting. We report three main findings: First, we found that preprocessing with masked grayscale images instead of RGB images clearly increased the F1-score of the classification results. Second, we trained different neural networks on our complex data, validating time and accuracy differences in order to define the most reliable network architecture. With AlexNet, the network with the best trade-off between F1-score and time, we achieved for individual classes F1-scores of up to 0,96 on line level and up to 1.0 on page level in classification. Third, we could replicate the finding that the CNN output can be further improved by implementing a reject option, giving more stable results. We present the results on our large scale open source dataset -- the Codex Claustroneoburgensis database (CCl-DB) -- containing a significant number of writings from different scribes in several codices. We demonstrate for the first time on a dataset with such a variety of codices that paleographic decisions can be reproduced automatically and precisely with CNNs. This gives manifold new and fast possibilities for paleographers to gain insights into unlabeled material, but also to develop further hypotheses.
摘要
历史scribe标识是一项重要的任务,以获取过去的信息。使用统一的字体风格,如卡洛林小写,可以使类型化学习受到限制。因此,在这篇论文中,我们展示了跨代码кс训练数据的重要性,以超越代码кс依赖的预测。我们报告了三个主要发现:首先,我们发现使用遮盖 grayscale 图像进行预处理,而不是 RGB 图像,明显提高了分类结果的 F1 分数。第二,我们在我们的复杂数据上训练了不同的神经网络,以确定最可靠的网络架构。使用 AlexNet 网络,我们在个体类别上达到了最高的 F1 分数,达到 0.96 的线程级别和 1.0 的页级别。第三,我们发现可以通过实施拒绝选项,提高 CNN 输出的稳定性。我们在 Codex Claustroneoburgensis 数据库(CCl-DB)中进行了大规模的开源数据集测试,并证明了在多种代码кс中,使用 CNN 可以自动和精度地复制 paleographic 决策。这为 paleographers 提供了新的快速可能性,以获取无标签材料中的信息,并发展更多的假设。
GPT-4V with Emotion: A Zero-shot Benchmark for Multimodal Emotion Understanding
paper_authors: Zheng Lian, Licai Sun, Haiyang Sun, Kang Chen, Zhuofan Wen, Hao Gu, Shun Chen, Bin Liu, Jianhua Tao
for: 这 paper 的主要目的是评估 GPT-4V 在多模态情感理解方面的能力,包括面部情绪识别、视觉情感分析、微表情识别、动态面部情绪识别和多模态情感识别等任务。
methods: 这 paper 使用 GPT-4V 进行多模态情感理解任务的评估,包括使用预训练模型和零例学习策略等方法。
results: 实验结果表明 GPT-4V 在多模态情感理解任务中表现出色,甚至超过了一些指导学习系统。然而,GPT-4V 在微表情识别任务中表现较差,这可能是因为这个任务需要专门的专业知识。Abstract
Recently, GPT-4 with Vision (GPT-4V) has shown remarkable performance across various multimodal tasks. However, its efficacy in emotion recognition remains a question. This paper quantitatively evaluates GPT-4V's capabilities in multimodal emotion understanding, encompassing tasks such as facial emotion recognition, visual sentiment analysis, micro-expression recognition, dynamic facial emotion recognition, and multimodal emotion recognition. Our experiments show that GPT-4V exhibits impressive multimodal and temporal understanding capabilities, even surpassing supervised systems in some tasks. Despite these achievements, GPT-4V is currently tailored for general domains. It performs poorly in micro-expression recognition that requires specialized expertise. The main purpose of this paper is to present quantitative results of GPT-4V on emotion understanding and establish a zero-shot benchmark for future research. Code and evaluation results are available at: https://github.com/zeroQiaoba/gpt4v-emotion.
摘要
最近,GPT-4 with Vision(GPT-4V)在多modal任务上表现出色,但其情感认知能力仍然存在问题。这篇论文量化评估GPT-4V在多modal情感理解方面的能力,包括facial emotion recognition、视觉情感分析、微表情识别、动态facial emotion recognition以及多modal情感理解。我们的实验表明GPT-4V在多modal和时间上具有出色的理解能力,甚至超过了指导系统在某些任务中的表现。虽然GPT-4V在某些任务中表现出色,但目前它仍然需要特殊知识来完成微表情识别任务。本文的主要目的是公布GPT-4V在情感理解方面的量化结果,并建立未来研究的零开发基准。代码和评估结果可以在:https://github.com/zeroQiaoba/gpt4v-emotion上获取。
Activity Grammars for Temporal Action Segmentation
results: 实验结果表明,我们的方法可以显著改善时间动作分割的性能和可读性,在两个标准的评测 dataset(Breakfast和50 Salads)上。Abstract
Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level semantics beyond individual and contextual properties. The task of temporal action segmentation, which aims at translating an untrimmed activity video into a sequence of action segments, remains challenging for this reason. This paper addresses the problem by introducing an effective activity grammar to guide neural predictions for temporal action segmentation. We propose a novel grammar induction algorithm that extracts a powerful context-free grammar from action sequence data. We also develop an efficient generalized parser that transforms frame-level probability distributions into a reliable sequence of actions according to the induced grammar with recursive rules. Our approach can be combined with any neural network for temporal action segmentation to enhance the sequence prediction and discover its compositional structure. Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability on two standard benchmarks, Breakfast and 50 Salads.
摘要
Sequence prediction on temporal data需要理解多层semantics的compositional结构 beyond individual和contextual properties。Temporal action segmentation Task,which aims to translate an untrimmed activity video into a sequence of action segments, remains challenging for this reason. This paper addresses the problem by introducing an effective activity grammar to guide neural predictions for temporal action segmentation. We propose a novel grammar induction algorithm that extracts a powerful context-free grammar from action sequence data. We also develop an efficient generalized parser that transforms frame-level probability distributions into a reliable sequence of actions according to the induced grammar with recursive rules. Our approach can be combined with any neural network for temporal action segmentation to enhance the sequence prediction and discover its compositional structure. Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability on two standard benchmarks, Breakfast and 50 Salads.
Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation
results: 对于多种设置,Rein显示出了与状态艺术方法相当或更高的性能。具体来说,在Cityscapes dataset上,只有额外增加1%的trainable参数,Rein就可以达到68.1%的mIoU水平,不需要访问任何真正的城市景象数据集。Abstract
In this paper, we first assess and harness various Vision Foundation Models (VFMs) in the context of Domain Generalized Semantic Segmentation (DGSS). Driven by the motivation that Leveraging Stronger pre-trained models and Fewer trainable parameters for Superior generalizability, we introduce a robust fine-tuning approach, namely Rein, to parameter-efficiently harness VFMs for DGSS. Built upon a set of trainable tokens, each linked to distinct instances, Rein precisely refines and forwards the feature maps from each layer to the next layer within the backbone. This process produces diverse refinements for different categories within a single image. With fewer trainable parameters, Rein efficiently fine-tunes VFMs for DGSS tasks, surprisingly surpassing full parameter fine-tuning. Extensive experiments across various settings demonstrate that Rein significantly outperforms state-of-the-art methods. Remarkably, with just an extra 1% of trainable parameters within the frozen backbone, Rein achieves a mIoU of 68.1% on the Cityscapes, without accessing any real urban-scene datasets.
摘要
在这篇论文中,我们首先评估和利用不同的视觉基础模型(VFM)在领域泛化semantic segmentation(DGSS)上。驱动于强大的预训练模型和 fewer trainable parameters的目的,我们提出了一种robust fine-tuning方法,即Rein,以parameter-efficiently harness VFMs for DGSS。基于一组可训练的token,每个token都链接到不同的实例,Rein精确地细化和传递各层的特征图在后向网络中。这个过程生成了不同类别内单个图像中的多种细化。与完全 parameter fine-tuning相比,Rein更加parameter-efficiently fine-tunes VFMs for DGSS任务, surprisingly surpassing state-of-the-art methods。广泛的实验表明,Rein在不同设定下显著超越了现有的方法。几乎只有额外1%的trainable parameters在冻结后bone中,Rein在Cityscapes上达到了68.1%的mIoU,不用访问任何真实的都市场景数据。
TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes
For: The paper is written for the task of text-driven 3D stylization of multi-object scenes.* Methods: The paper proposes a novel framework called TeMO, which uses a Decoupled Graph Attention (DGA) module and a Cross-Grained Contrast (CGC) supervision system to parse multi-object 3D scenes and edit their styles under contrast supervision at multiple levels.* Results: The paper reports that the proposed method can synthesize high-quality stylized content and outperform existing methods over a wide range of multi-object 3D meshes.Here is the information in Simplified Chinese text:
results: 文章表明,提出的方法可以生成高质量的风格化内容,并在多种多bject 3D 场景中超越现有方法。Abstract
Recent progress in the text-driven 3D stylization of a single object has been considerably promoted by CLIP-based methods. However, the stylization of multi-object 3D scenes is still impeded in that the image-text pairs used for pre-training CLIP mostly consist of an object. Meanwhile, the local details of multiple objects may be susceptible to omission due to the existing supervision manner primarily relying on coarse-grained contrast of image-text pairs. To overcome these challenges, we present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles under the contrast supervision at multiple levels. We first propose a Decoupled Graph Attention (DGA) module to distinguishably reinforce the features of 3D surface points. Particularly, a cross-modal graph is constructed to align the object points accurately and noun phrases decoupled from the 3D mesh and textual description. Then, we develop a Cross-Grained Contrast (CGC) supervision system, where a fine-grained loss between the words in the textual description and the randomly rendered images are constructed to complement the coarse-grained loss. Extensive experiments show that our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes. Our code and results will be made publicly available
摘要
近期,文本驱动的3D样式化单个物体的进步很大,主要归功于基于CLIP的方法。然而,多个物体3D场景的样式化仍然受阻,因为用于预训练CLIP的图文对的大多数是单个物体。此外,现有的监督方式很可能会因多个物体的地方细节而产生欠拟合。为了解决这些挑战,我们提出了一种新的框架,名为TeMO,用于分析多个物体3D场景并在多级对比拟合下编辑样式。我们首先提出了一种分离Graph Attention(DGA)模块,用于强化3D表面点的特征。具体来说,我们构建了一个交叉模式图,以便准确对齐对象点并分离出3D网格和文本描述。然后,我们开发了一种跨模式对比(CGC)监督系统,其中文本描述中的单词和随机渲染的图像之间的细致对比loss被建立,以补做了粗略对比loss。我们的方法可以生成高质量的样式化内容,并在多种多样的多个物体3D网格上超越现有方法。我们的代码和结果将公开发布。
Fine-tune vision foundation model for crack segmentation in civil infrastructures
results: 比较实验显示,CrackSAM 模型在不同的测试数据集上表现出色,特别是在陌生测试数据集上。这些十字测试结果表明了基础模型的出色零执行能力,并提供了新的思想来发展视觉模型。Abstract
Large-scale foundation models have become the mainstream method in the field of deep learning, while in civil engineering, the scale of AI models is strictly limited. In this work, vision foundation model is introduced for crack segmentation. Two Parameter-efficient fine-tuning methods, adapter and low-rank adaptation, are adopted to fine-tune the foundation model in the field of semantic segmentation: Segment Anything Model (SAM). The fine-tuned model CrackSAM is much larger than all the existing crack segmentation models, but shows excellent performance. To test the zero-shot performance of the proposed method, two unique datasets related to road and exterior wall cracks are collected, annotated and open-sourced, in total 810 images. Comparative experiments are conducted with twelve mature semantic segmentation models. On datasets with artificial noise and previously unseen datasets, the performance of CrackSAM far exceeds that of all state-of-the-art models. CrackSAM exhibits remarkable superiority, particularly in challenging conditions such as dim lighting, shadows, road markings, construction joints, and other interference factors. Such cross-scenario results demonstrate the outstanding zero-shot capability of foundation models, and provide new ideas for the development of vision models in civil engineering.
摘要
大规模基础模型在深度学习领域已成为主流方法,而在公共工程领域,AI模型的规模受限。在这项工作中,我们介绍了视觉基础模型,用于裂隙分 segmentation。我们采用了两种 Parameter-efficient 精细调整方法,Adapter 和 Low-rank adaptation,来精细调整基础模型,得到了 Segment Anything Model (SAM)。经过精细调整后的模型 CrackSAM 的大小比所有现有的裂隙分 segmentation 模型都要大得多,但它在性能上表现出色。为证明提案方法的零aser性能,我们收集了两个Unique的公路和外墙裂隙数据集,共计810个图像,并对其进行了注释和开源。与12种成熟的semantic segmentation模型进行了比较实验。在人工噪音和未经见过的数据集上,CrackSAM 的表现远胜于所有state-of-the-art模型。CrackSAM 在各种挑战性情况下,如低光照、阴影、路标、建筑 JOINTS 和其他干扰因素下表现出了remarkable 的优势。这些跨场景结果表明基础模型的零aser性能,并提供了新的思路 для视觉模型的发展。
TLCE: Transfer-Learning Based Classifier Ensembles for Few-Shot Class-Incremental Learning
results: 经验表明,TLCE方法可以在多个数据集上超越当前领域的状态OF-the-art 方法,并在数据不均衡的情况下也能够达到良好的表现。Abstract
Few-shot class-incremental learning (FSCIL) struggles to incrementally recognize novel classes from few examples without catastrophic forgetting of old classes or overfitting to new classes. We propose TLCE, which ensembles multiple pre-trained models to improve separation of novel and old classes. TLCE minimizes interference between old and new classes by mapping old class images to quasi-orthogonal prototypes using episodic training. It then ensembles diverse pre-trained models to better adapt to novel classes despite data imbalance. Extensive experiments on various datasets demonstrate that our transfer learning ensemble approach outperforms state-of-the-art FSCIL methods.
摘要
几个示例学习(Few-shot class-incremental learning,FSCIL)遇到了新类的扩展认知问题,无法逐步忘记老类或过度适应新类。我们提出了TLCE,它通过多个预训练模型的拟合来提高新类和老类之间的分离。TLCE使用 episodic 训练将老类图像映射到 quasi-orthogonal 的 прототипы,从而减少了老类和新类之间的干扰。然后,它将多个不同预训练模型 ensemble 以更好地适应新类,尽管数据不均衡。广泛的实验表明,我们的传输学ensemble方法在多个数据集上超过了当前FSCIL方法的表现。
Guided Reconstruction with Conditioned Diffusion Models for Unsupervised Anomaly Detection in Brain MRIs
results: 研究表明,通过conditioning机制可以提高异常检测性能,降低假阳性预测,并在不同MRI获得和模拟对比下显示了适用性。Abstract
Unsupervised anomaly detection in Brain MRIs aims to identify abnormalities as outliers from a healthy training distribution. Reconstruction-based approaches that use generative models to learn to reconstruct healthy brain anatomy are commonly used for this task. Diffusion models are an emerging class of deep generative models that show great potential regarding reconstruction fidelity. However, they face challenges in preserving intensity characteristics in the reconstructed images, limiting their performance in anomaly detection. To address this challenge, we propose to condition the denoising mechanism of diffusion models with additional information about the image to reconstruct coming from a latent representation of the noise-free input image. This conditioning enables high-fidelity reconstruction of healthy brain structures while aligning local intensity characteristics of input-reconstruction pairs. We evaluate our method's reconstruction quality, domain adaptation features and finally segmentation performance on publicly available data sets with various pathologies. Using our proposed conditioning mechanism we can reduce the false-positive predictions and enable a more precise delineation of anomalies which significantly enhances the anomaly detection performance compared to established state-of-the-art approaches to unsupervised anomaly detection in brain MRI. Furthermore, our approach shows promise in domain adaptation across different MRI acquisitions and simulated contrasts, a crucial property of general anomaly detection methods.
摘要
<>转换给定文本到简化中文。>无监督异常检测在大脑MRI中 aimsto标识异常为外liers从健康训练分布中。重建基于approaches用生成模型学习重建健康大脑 анато STRING。噪声模型是深度生成模型的一个新兴类型,它们在重建准确性方面表现出了极高的潜力。然而,它们面临着保持强度特征的挑战,这限制了它们在异常检测中的表现。为解决这个挑战,我们提议通过将杂噪机制的conditioning加入 diffusion models,使得它们可以高精度重建健康大脑结构,同时保持输入图像和重建图像的本地强度特征相似。我们评估我们的方法的重建质量、领域适应特征和最终的分割性能在公共可用数据集上,包括不同的疾病。使用我们的提议的conditioning机制,我们可以减少假阳性预测,并提供更精确的异常检测性能,相比于现有的无监督异常检测方法。此外,我们的方法显示出在不同的MRI获得和模拟增强中的领域适应特性,这是一种重要的异常检测方法的特征。
SAMBA: A Trainable Segmentation Web-App with Smart Labelling
results: 该论文通过使用SAMBA工具,实现了高质量、可重复的批处理结果,并且可以在浏览器中访问(https://www.sambasegment.com/),不需要下载任何外部依赖项。Abstract
Segmentation is the assigning of a semantic class to every pixel in an image and is a prerequisite for various statistical analysis tasks in materials science, like phase quantification, physics simulations or morphological characterization. The wide range of length scales, imaging techniques and materials studied in materials science means any segmentation algorithm must generalise to unseen data and support abstract, user-defined semantic classes. Trainable segmentation is a popular interactive segmentation paradigm where a classifier is trained to map from image features to user drawn labels. SAMBA is a trainable segmentation tool that uses Meta's Segment Anything Model (SAM) for fast, high-quality label suggestions and a random forest classifier for robust, generalizable segmentations. It is accessible in the browser (https://www.sambasegment.com/) without the need to download any external dependencies. The segmentation backend is run in the cloud, so does not require the user to have powerful hardware.
摘要
分割是将每个图像像素分配到 semantic 类别的过程,是物理科学中各种统计分析任务的必要前提,如相态量化、物理模拟或形态特征化。由于物理科学中的图像、捕捉技术和材料种类的广泛性,任何分割算法都必须能够总结到未经见过的数据和支持抽象、用户定义的 semantic 类别。可交互式分割是一种受欢迎的分割方法,其中一个分类器将从图像特征图中映射到用户手动绘制的标签。SAMBA 是一种可交互式分割工具,它使用 Meta 的分割任何模型(SAM)来获得快速、高质量的标签建议,并使用随机森林分类器来实现可靠、普适的分割。它可以在浏览器中访问(https://www.sambasegment.com),不需要下载任何外部依赖项。分割服务器位于云端,因此不需要用户拥有高性能的硬件。
Text as Image: Learning Transferable Adapter for Multi-Label Classification
results: 实验结果显示, compared to existing methods, 本方法在多 Label 图像分类 task 中得到了更高的性能,并且可以自动生成多 Label 文本描述。Abstract
Pre-trained vision-language models have notably accelerated progress of open-world concept recognition. Their impressive zero-shot ability has recently been transferred to multi-label image classification via prompt tuning, enabling to discover novel labels in an open-vocabulary manner. However, this paradigm suffers from non-trivial training costs, and becomes computationally prohibitive for a large number of candidate labels. To address this issue, we note that vision-language pre-training aligns images and texts in a unified embedding space, making it potential for an adapter network to identify labels in visual modality while be trained in text modality. To enhance such cross-modal transfer ability, a simple yet effective method termed random perturbation is proposed, which enables the adapter to search for potential visual embeddings by perturbing text embeddings with noise during training, resulting in better performance in visual modality. Furthermore, we introduce an effective approach to employ large language models for multi-label instruction-following text generation. In this way, a fully automated pipeline for visual label recognition is developed without relying on any manual data. Extensive experiments on public benchmarks show the superiority of our method in various multi-label classification tasks.
摘要
<>将文本翻译成简化中文。<>开放世界概念认知的进步受到了预训练视觉语言模型的加速。这些模型在零批量模式下表现出了很强的能力,并在最近通过提示调整来应用于多标签图像分类。然而,这种方法受到非常严重的训练成本和计算拥堵问题的限制,对于大量候选标签来说是计算不可持续的。为解决这个问题,我们注意到了视觉语言预训练的图像和文本在一个统一的嵌入空间中对齐,这使得一个适配器网络能够在文本模式下训练时通过搜索图像嵌入的方式来识别图像中的标签。为进一步增强这种跨Modal的传输能力,我们提出了一种简单 yet effective的方法,即随机扰动法,该法在训练时对文本嵌入进行随机扰动,以便适配器通过搜索图像嵌入来提高图像模式的性能。此外,我们提出了一种有效的方法,以大型自然语言模型进行多标签指令遵循文本生成。这种方法可以自动生成图像标签识别的完整管道,不需要任何手动数据。我们在公共benchmark上进行了广泛的实验,并证明了我们的方法在多种多标签分类任务中的优越性。
EulerMormer: Robust Eulerian Motion Magnification via Dynamic Filtering within Transformer
paper_authors: Fei Wang, Dan Guo, Kun Li, Meng Wang
for: 该 paper 的目的是提高视觉感知的分辨率,揭示视频中的微小运动信息。
methods: 该 paper 使用了一种新的动态筛选策略,通过启用动态筛选来除掉噪声,保留重要的特征。
results: 该 paper 的实验结果显示,EulerMormer 可以更好地提高视频动作彩色化,至少比现有方法要好。Abstract
Video Motion Magnification (VMM) aims to break the resolution limit of human visual perception capability and reveal the imperceptible minor motion that contains valuable information in the macroscopic domain. However, challenges arise in this task due to photon noise inevitably introduced by photographic devices and spatial inconsistency in amplification, leading to flickering artifacts in static fields and motion blur and distortion in dynamic fields in the video. Existing methods focus on explicit motion modeling without emphasizing prioritized denoising during the motion magnification process. This paper proposes a novel dynamic filtering strategy to achieve static-dynamic field adaptive denoising. Specifically, based on Eulerian theory, we separate texture and shape to extract motion representation through inter-frame shape differences, expecting to leverage these subdivided features to solve this task finely. Then, we introduce a novel dynamic filter that eliminates noise cues and preserves critical features in the motion magnification and amplification generation phases. Overall, our unified framework, EulerMormer, is a pioneering effort to first equip with Transformer in learning-based VMM. The core of the dynamic filter lies in a global dynamic sparse cross-covariance attention mechanism that explicitly removes noise while preserving vital information, coupled with a multi-scale dual-path gating mechanism that selectively regulates the dependence on different frequency features to reduce spatial attenuation and complement motion boundaries. We demonstrate extensive experiments that EulerMormer achieves more robust video motion magnification from the Eulerian perspective, significantly outperforming state-of-the-art methods. The source code is available at https://github.com/VUT-HFUT/EulerMormer.
摘要
视频动态增强(VMM)目标是突破人类视觉能力的解像力限制,揭示macroscopic中的微小运动信息。然而,实际应用中遇到了照相机引入的光子噪声和宽度不匹配的增强问题,导致静止场景中的闪烁 artifacts和动态场景中的运动模糊和扭曲。现有方法主要关注Explicit Motion Modeling,而忽略了对优先采样的优先级调整。本文提出了一种新的动态筛选策略,以实现静态-动态场景适应性的减噪。具体来说,基于Eulerian理论,我们分解了图像和形状,通过帧间形状差来提取运动表示。我们引入了一种新的动态筛选器,可以消除噪声信号,保留关键特征在运动增强和增强生成阶段。总的来说,我们的整体框架,EulerMormer,是一项前所未有的将Transformer学习模型应用于VMM领域的准新提案。动态筛选器的核心在于全球动态稀疏相关性注意力机制,可以显著地除去噪声信号,同时保留重要信息。此外,我们还引入了一种多尺度双路闭合机制,可以选择不同频谱特征的依赖程度,以减少空间减弱和补做运动边界。我们的实验结果表明,EulerMormer在Eulerian视角下实现了更加稳定和 Robust的视频动态增强,significantly Outperforming当前的方法。代码可以在https://github.com/VUT-HFUT/EulerMormer上获取。
Diffusing Colors: Image Colorization with Text Guided Diffusion
results: 该方法可以提供高度自动和控制的图像色调化输出,并且在视觉质量和 semantics 方面表现出色,比如现有方法更具控制性和semantic coherence。Abstract
The colorization of grayscale images is a complex and subjective task with significant challenges. Despite recent progress in employing large-scale datasets with deep neural networks, difficulties with controllability and visual quality persist. To tackle these issues, we present a novel image colorization framework that utilizes image diffusion techniques with granular text prompts. This integration not only produces colorization outputs that are semantically appropriate but also greatly improves the level of control users have over the colorization process. Our method provides a balance between automation and control, outperforming existing techniques in terms of visual quality and semantic coherence. We leverage a pretrained generative Diffusion Model, and show that we can finetune it for the colorization task without losing its generative power or attention to text prompts. Moreover, we present a novel CLIP-based ranking model that evaluates color vividness, enabling automatic selection of the most suitable level of vividness based on the specific scene semantics. Our approach holds potential particularly for color enhancement and historical image colorization.
摘要
“灰度图像颜色化是一个复杂且主观的任务,具有 significante 挑战。尽管最近的进步使用大规模数据集和深度神经网络,仍然存在控制性和视觉质量的问题。为了解决这些问题,我们提出了一种新的图像颜色化框架,利用图像扩散技术和粒体文本提示。这种整合不仅能生成符合 semantics 的颜色化输出,还可以大幅提高用户对颜色化过程的控制能力。我们的方法具有自动化和控制之间的平衡,在视觉质量和semantic coherence 方面超过现有技术。我们利用预训练的生成Diffusion Model,并证明可以在颜色化任务中训练这个模型而不会失去生成力或文本提示的注意力。此外,我们还提出了基于 CLIP 的颜色豁达评分模型,可以自动选择基于特定场景 semantics 的最适合的颜色豁达水平。我们的方法在彩色提高和历史图像颜色化方面具有潜力。”
results: 对比 existing approaches,该方法能够更好地平衡样式化的текстура和时间准确性,并且可以适应新的 pose 和视图。Abstract
We present a first step towards 4D (3D and time) human video stylization, which addresses style transfer, novel view synthesis and human animation within a unified framework. While numerous video stylization methods have been developed, they are often restricted to rendering images in specific viewpoints of the input video, lacking the capability to generalize to novel views and novel poses in dynamic scenes. To overcome these limitations, we leverage Neural Radiance Fields (NeRFs) to represent videos, conducting stylization in the rendered feature space. Our innovative approach involves the simultaneous representation of both the human subject and the surrounding scene using two NeRFs. This dual representation facilitates the animation of human subjects across various poses and novel viewpoints. Specifically, we introduce a novel geometry-guided tri-plane representation, significantly enhancing feature representation robustness compared to direct tri-plane optimization. Following the video reconstruction, stylization is performed within the NeRFs' rendered feature space. Extensive experiments demonstrate that the proposed method strikes a superior balance between stylized textures and temporal coherence, surpassing existing approaches. Furthermore, our framework uniquely extends its capabilities to accommodate novel poses and viewpoints, making it a versatile tool for creative human video stylization.
摘要
我们提出了第一步 towards 4D(3D和时间)人类视频化,解决了式 Transfer、新视角合成和人类动画在一个统一框架中。虽然许多视频化方法已经发展出来,但它们通常仅能在特定的输入视频中显示固定的见识点,缺乏能够扩展到新的视角和新的姿势的能力。为了突破这些限制,我们利用神经辉照场(NeRFs)来表示视频,进行运算在输出的特征空间中。我们的创新方法包括同时对人类主体和周围的场景使用两个NeRFs。这个双重表现方式使得人类主体可以在不同的姿势和新的视角中进行动画。我们将 introduce a novel geometry-guided tri-plane representation,对特征表现的稳定性有很大的提高,与直接 tri-plane 优化相比。接下来,我们将在NeRFs的输出特征空间中进行颜色化。实验结果表明,我们的方法能够将统一的平衡保持在颜色化和时间对称之间,超过现有的方法。此外,我们的框架可以独特地扩展到新的姿势和视角,使其成为艺术人类视频化的多用途工具。
Polarimetric Light Transport Analysis for Specular Inter-reflection
results: 经过实验和数据分析,该方法能够有效地分解金属表面上的干扰反射分成直接反射和交叉反射两部分。此外,该方法还可以与其他分解方法结合使用,以进一步分析光学运输。在实际应用中,该方法能够提高3D测量的准确性,抵消强干扰反射的影响。Abstract
Polarization is well known for its ability to decompose diffuse and specular reflections. However, the existing decomposition methods only focus on direct reflection and overlook multiple reflections, especially specular inter-reflection. In this paper, we propose a novel decomposition method for handling specular inter-reflection of metal objects by using a unique polarimetric feature: the rotation direction of linear polarization. This rotation direction serves as a discriminative factor between direct and inter-reflection on specular surfaces. To decompose the reflectance components, we actively rotate the linear polarization of incident light and analyze the rotation direction of the reflected light. We evaluate our method using both synthetic and real data, demonstrating its effectiveness in decomposing specular inter-reflections of metal objects. Furthermore, we demonstrate that our method can be combined with other decomposition methods for a detailed analysis of light transport. As a practical application, we show its effectiveness in improving the accuracy of 3D measurement against strong specular inter-reflection.
摘要
polarization 已经被广泛应用于分解杂化和射频反射。然而,现有的分解方法只关注直接反射,忽略了多重反射,特别是金属表面上的射频反射。在这篇论文中,我们提出了一种新的分解方法,用于处理金属表面上的射频反射。我们利用了唯一的极化特征:反射光的旋转方向。这个旋转方向作为直接反射和交叠反射的判别因素。为分解反射组分,我们活动地调整了入射光的极化方向,然后分析反射光的旋转方向。我们使用了 sintetic 和实际数据进行评估,并证明了我们的方法可以有效地分解金属表面上的射频反射。此外,我们还示出了将我们的方法与其他分解方法结合使用可以对光传输进行详细分析。在实际应用中,我们展示了我们的方法可以提高3D测量的准确性,特别是在强射频交叠反射的情况下。
For: The paper is written for the purpose of advancing the field of post-mortem iris recognition, which is an emerging application of iris-based human identification in a forensic setup.* Methods: The paper proposes a novel approach using a conditional StyleGAN-based iris synthesis model to generate multiple within-class and between-class post-mortem iris images, compliant with ISO/IEC 29794-6, and with decomposition deformations controlled by the requested PMI (post mortem interval).* Results: The paper offers a contribution to facilitate progress in post-mortem iris recognition by generating multiple post-mortem iris images with controlled decomposition deformations, which can be used to enhance the existing, very sparse, post-mortem iris datasets and improve the training effectiveness of professional forensic human examiners.Abstract
Post-mortem iris recognition is an emerging application of iris-based human identification in a forensic setup, able to correctly identify deceased subjects even three weeks post-mortem. This technique thus is considered as an important component of future forensic toolkits. The current advancements in this field are seriously slowed down by exceptionally difficult data collection, which can happen in mortuary conditions, at crime scenes, or in ``body farm'' facilities. This paper makes a novel contribution to facilitate progress in post-mortem iris recognition by offering a conditional StyleGAN-based iris synthesis model, trained on the largest-available dataset of post-mortem iris samples acquired from more than 350 subjects, generating -- through appropriate exploration of StyleGAN latent space -- multiple within-class (same identity) and between-class (different new identities) post-mortem iris images, compliant with ISO/IEC 29794-6, and with decomposition deformations controlled by the requested PMI (post mortem interval). Besides an obvious application to enhance the existing, very sparse, post-mortem iris datasets to advance -- among others -- iris presentation attack endeavors, we anticipate it may be useful to generate samples that would expose professional forensic human examiners to never-seen-before deformations for various PMIs, increasing their training effectiveness. The source codes and model weights are made available with the paper.
摘要
后备识别是一种emerging应用的芳心眼基于人体识别技术,可以在法医设置中正确识别已经去世的人员,甚至在三周后。这种技术因此被视为未来法医工具箱中的重要组成部分。但现有技术的发展受到了极其困难的数据收集所带来的困难,这些数据可能在室内、犯罪现场或“尸体农场”设施中收集。这篇论文提供了一种新的 StyleGAN 基于的芳心眼合成模型,训练在已知的芳心眼样本中,生成了符合 ISO/IEC 29794-6 标准的多个同类(同一个身份)和异类(不同的新身份)的尸体芳心照片,并控制了请求的 PMIs(后备时间)。此外,我们预计这些样本可能有助于增加现有的、非常罕见的尸体芳心样本,以提高识别攻击的效果。我们还将提供模型的源代码和权重。
A Multilevel Guidance-Exploration Network and Behavior-Scene Matching Method for Human Behavior Anomaly Detection
results: 我们的方法在上海理工和UBnormal数据集上实现了状态当前的最佳性能,具体来说是AUC为86.9%和73.5%。Abstract
Human behavior anomaly detection aims to identify unusual human actions, playing a crucial role in intelligent surveillance and other areas. The current mainstream methods still adopt reconstruction or future frame prediction techniques. However, reconstructing or predicting low-level pixel features easily enables the network to achieve overly strong generalization ability, allowing anomalies to be reconstructed or predicted as effectively as normal data. Different from their methods, inspired by the Student-Teacher Network, we propose a novel framework called the Multilevel Guidance-Exploration Network(MGENet), which detects anomalies through the difference in high-level representation between the Guidance and Exploration network. Specifically, we first utilize the pre-trained Normalizing Flow that takes skeletal keypoints as input to guide an RGB encoder, which takes unmasked RGB frames as input, to explore motion latent features. Then, the RGB encoder guides the mask encoder, which takes masked RGB frames as input, to explore the latent appearance feature. Additionally, we design a Behavior-Scene Matching Module(BSMM) to detect scene-related behavioral anomalies. Extensive experiments demonstrate that our proposed method achieves state-of-the-art performance on ShanghaiTech and UBnormal datasets, with AUC of 86.9 % and 73.5 %, respectively. The code will be available on https://github.com/molu-ggg/GENet.
摘要
人类行为异常检测目标是 identificar不寻常的人类行为,在智能监测等领域发挥关键作用。当前主流方法仍然采用重建或未来帧预测技术。然而,重建或预测低级像素特征容易让网络实现过于强大的泛化能力,使异常现象重建或预测与正常数据一样有效。与传统方法不同,我们提出了一种新的框架,即多级引导探索网络(MGENet),通过高级表示差异来检测异常。具体来说,我们首先利用预训练的Normalizing Flow,将骨架关键点作为输入,导引RGB编码器,该编码器接受不masked RGB帧作为输入,探索动作缓存特征。然后,RGB编码器导引面编码器,该编码器接受面罩RGB帧作为输入,探索面相应特征。此外,我们还设计了行为场景匹配模块(BSMM),检测场景相关的行为异常。广泛的实验证明了我们提出的方法在上海理工和UB正常数据集上达到了当前最佳性能,AUC为86.9%和73.5%。代码将在https://github.com/molu-ggg/GENet上发布。
Instance Tracking in 3D Scenes from Egocentric Videos
results: 作者们首先重用了相关领域的方法,如单个对象跟踪(SOT)——在2D帧中运行SOT方法来跟踪实例,并使用相机pose和深度将其提升到3D坐标系。他们还提出了一种简单的方法,利用预训练的分割和检测模型生成RGB帧中的提案,并将提案与已经存储的实例图像进行匹配。实验结果显示,作者们的方法(无需调整)在SOT基础上显著超越了。作者们 conclude 认为,通过利用相机pose和使用3D全球坐标表示, Egocentric 实例跟踪问题变得更加容易解决。Abstract
Egocentric sensors such as AR/VR devices capture human-object interactions and offer the potential to provide task-assistance by recalling 3D locations of objects of interest in the surrounding environment. This capability requires instance tracking in real-world 3D scenes from egocentric videos (IT3DEgo). We explore this problem by first introducing a new benchmark dataset, consisting of RGB and depth videos, per-frame camera pose, and instance-level annotations in both 2D camera and 3D world coordinates. We present an evaluation protocol which evaluates tracking performance in 3D coordinates with two settings for enrolling instances to track: (1) single-view online enrollment where an instance is specified on-the-fly based on the human wearer's interactions. and (2) multi-view pre-enrollment where images of an instance to be tracked are stored in memory ahead of time. To address IT3DEgo, we first re-purpose methods from relevant areas, e.g., single object tracking (SOT) -- running SOT methods to track instances in 2D frames and lifting them to 3D using camera pose and depth. We also present a simple method that leverages pretrained segmentation and detection models to generate proposals from RGB frames and match proposals with enrolled instance images. Perhaps surprisingly, our extensive experiments show that our method (with no finetuning) significantly outperforms SOT-based approaches. We conclude by arguing that the problem of egocentric instance tracking is made easier by leveraging camera pose and using a 3D allocentric (world) coordinate representation.
摘要
egocentric感知器如AR/VR设备记录人类对物体的互动,并可提供任务协助 by recalling 3D对象位置在周围环境中。这一能力需要实例跟踪在实际世界3D场景中(IT3DEgo)。我们对这个问题进行探索,首先介绍了一个新的 benchmark数据集,包括RGB和深度视频,每帧摄像头pose,以及实例级别的注释在两个坐标系中(2D摄像头坐标和3D世界坐标)。我们提出了一种评估协议,用于评估在3D坐标系中的跟踪性能,并分为两种设置来启动实例跟踪:(1)实时在线报名,基于人员佩戴者的互动来指定实例。(2)先在内存中存储实例图像,然后在多视图情况下进行报名。为解决IT3DEgo,我们首先重用了相关领域的方法,例如单个对象跟踪(SOT)——在2D帧中运行SOT方法来跟踪实例,并使用摄像头pose和深度将其映射到3D中。我们还提出了一种简单的方法,利用预训练的分割和检测模型生成RGB帧中的提案,然后与存储在内存中的实例图像进行匹配。我们的广泛实验表明,我们的方法(无需调整)在SOT基础上显著超越了。我们结论,通过利用摄像头pose和使用3D全称(世界)坐标表示, egocentric实例跟踪问题变得更加容易解决。
Multi-strategy Collaborative Optimized YOLOv5s and its Application in Distance Estimation
results: 通过 simulate experiment,提高mAP值5.5%,并实现基于估计距离信息提供安全建议。Abstract
The increasing accident rate brought about by the explosive growth of automobiles has made the research on active safety systems of automobiles increasingly important. The importance of improving the accuracy of vehicle target detection is self-evident. To achieve the goals of vehicle detection and distance estimation and provide safety warnings, a Distance Estimation Safety Warning System (DESWS) based on a new neural network model (YOLOv5s-SE) by replacing the IoU with DIoU, embedding SE attention module, and a distance estimation method through using the principle of similar triangles was proposed. In addition, a method that can give safety suggestions based on the estimated distance using nonparametric testing was presented in this work. Through the simulation experiment, it was verified that the mAP was improved by 5.5% and the purpose of giving safety suggestions based on the estimated distance information can be achieved.
摘要
随着汽车数量的快速增长,汽车活动安全系统的研究变得越来越重要。提高车辆检测精度的重要性自然而然。为了实现车辆检测和距离估计,并提供安全警示,一种基于新的神经网络模型(YOLOv5s-SE)、取代IOU的DIoU模块、并使用 Similar Triangles 原理进行距离估计的距离估计安全警示系统(DESWS)被提议。此外,基于估计距离信息给出安全建议的方法也被介绍在这项工作中。通过实验验证,我们发现了5.5%的mAP提高和基于估计距离信息给出安全建议的目的得到实现。
Identity-Obscured Neural Radiance Fields: Privacy-Preserving 3D Facial Reconstruction
paper_authors: Jiayi Kong, Baixin Xu, Xurui Song, Chen Qian, Jun Luo, Ying He
for: 保护个人隐私,避免泄露敏感面部数据
methods: 使用隐私保护的图像,在NeRF框架中重建3D头部几何结构
results: 能够生成可信度高的头部几何结构,不需要敏感面部数据Abstract
Neural radiance fields (NeRF) typically require a complete set of images taken from multiple camera perspectives to accurately reconstruct geometric details. However, this approach raise significant privacy concerns in the context of facial reconstruction. The critical need for privacy protection often leads invidividuals to be reluctant in sharing their facial images, due to fears of potential misuse or security risks. Addressing these concerns, we propose a method that leverages privacy-preserving images for reconstructing 3D head geometry within the NeRF framework. Our method stands apart from traditional facial reconstruction techniques as it does not depend on RGB information from images containing sensitive facial data. Instead, it effectively generates plausible facial geometry using a series of identity-obscured inputs, thereby protecting facial privacy.
摘要
neuronal radiance fields (NeRF) 通常需要多个相机视角的完整图像来准确重建几何细节。然而,这种方法会引起 Privacy Concerns 的问题,特别是在人脸重建方面。由于担心数据可能被滥用或者安全风险,个人可能会拒绝分享自己的面部图像。为解决这些问题,我们提议一种基于隐藏个人信息的 NeRF 框架,以保护面部隐私。我们的方法与传统的面部重建技术不同,因为它不需要包含敏感面部数据的 RGB 信息。相反,它可以生成可信的面部几何结构,使用一系列隐藏个人身份的输入,以保护面部隐私。
Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning Interference with Gradient Projection
results: 论文提供了实验证据,表明这种卷积抹除方法可以生成与从scratch重新训练的模型性能相似的模型,即使训练数据集不再可用。Abstract
Recent data-privacy laws have sparked interest in machine unlearning, which involves removing the effect of specific training samples from a learnt model as if they were never present in the original training dataset. The challenge of machine unlearning is to discard information about the ``forget'' data in the learnt model without altering the knowledge about the remaining dataset and to do so more efficiently than the naive retraining approach. To achieve this, we adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU), in which the model takes steps in the orthogonal direction to the gradient subspaces deemed unimportant for the retaining dataset, so as to its knowledge is preserved. By utilizing Stochastic Gradient Descent (SGD) to update the model weights, our method can efficiently scale to any model and dataset size. We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible. Our code is available at https://github.com/hnanhtuan/projected_gradient_unlearning.
摘要
现代数据隐私法规已引发机器学习忘记技术的兴趣,该技术涉及从已经训练过的模型中除去特定训练样本的效果,如果现在不再可用。忘记技术的挑战是不要改变原始数据集中剩下的知识,同时快速地忘记掉忘记数据。我们采用了投影 gradient 学习方法,名为 Projected-Gradient Unlearning (PGU),该方法中模型通过在不重要的梯度空间上做出步进行,以保留模型的知识。通过使用 Stochastic Gradient Descent (SGD) 更新模型参数,我们的方法可以高效地扩展到任何模型和数据集大小。我们提供了实验证明,证明我们的忘记方法可以生成与从scratch retrained模型相似的模型,即使训练数据集不再可用。我们的代码可以在 GitHub 上找到:https://github.com/hnanhtuan/projected_gradient_unlearning。
Open-Vocabulary Segmentation with Semantic-Assisted Calibration
for: This paper focuses on the problem of open-vocabulary segmentation (OVS), specifically addressing the challenge of aligning visual content with the semantics of unbounded text.
methods: The proposed method, Semantic-assisted CAlibration Network (SCAN), incorporates a generalized semantic prior of CLIP into proposal embedding and applies a contextual shift strategy to mitigate the lack of global context and unnatural background noise.
results: SCAN achieves state-of-the-art performance on all popular open-vocabulary segmentation benchmarks, and the authors also propose a new evaluation metric called Semantic-Guided IoU (SG-IoU) to address the problem of semantic duplication across categories.Here’s the simplified Chinese text:
results: SCAN在所有启用词汇分割标准准则上达到了最新的性能记录,同时作者还提出了一种新的评价指标叫做semantic-guided IoU(SG-IoU),用于解决 semantic duplication 的问题。Abstract
This paper studies open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with generalized contextual prior of CLIP. As the core of open-vocabulary understanding, alignment of visual content with the semantics of unbounded text has become the bottleneck of this field. To address this challenge, recent works propose to utilize CLIP as an additional classifier and aggregate model predictions with CLIP classification results. Despite their remarkable progress, performance of OVS methods in relevant scenarios is still unsatisfactory compared with supervised counterparts. We attribute this to the in-vocabulary embedding and domain-biased CLIP prediction. To this end, we present a Semantic-assisted CAlibration Network (SCAN). In SCAN, we incorporate generalized semantic prior of CLIP into proposal embedding to avoid collapsing on known categories. Besides, a contextual shift strategy is applied to mitigate the lack of global context and unnatural background noise. With above designs, SCAN achieves state-of-the-art performance on all popular open-vocabulary segmentation benchmarks. Furthermore, we also focus on the problem of existing evaluation system that ignores semantic duplication across categories, and propose a new metric called Semantic-Guided IoU (SG-IoU).
摘要
In SCAN, we incorporate a generalized semantic prior of CLIP into the proposal embedding to avoid collapsing on known categories. Additionally, we apply a contextual shift strategy to mitigate the lack of global context and unnatural background noise. With these designs, SCAN achieves state-of-the-art performance on all popular open-vocabulary segmentation benchmarks. Furthermore, we also focus on the problem of the existing evaluation system, which ignores semantic duplication across categories, and propose a new metric called Semantic-Guided IoU (SG-IoU).
MTVG : Multi-text Video Generation with Text-to-Video Models
results: 我们的广泛实验表明,我们的提posed方法可以生成高度相关和temporally smooth的视频,包括多种转换的描述。视频示例可以在我们项目页面中找到:https://kuai-lab.github.io/mtvg-page。Abstract
Recently, video generation has attracted massive attention and yielded noticeable outcomes. Concerning the characteristics of video, multi-text conditioning incorporating sequential events is necessary for next-step video generation. In this work, we propose a novel multi-text video generation~(MTVG) by directly utilizing a pre-trained diffusion-based text-to-video~(T2V) generation model without additional fine-tuning. To generate consecutive video segments, visual consistency generated by distinct prompts is necessary with diverse variations, such as motion and content-related transitions. Our proposed MTVG includes Dynamic Noise and Last Frame Aware Inversion which reinitialize the noise latent to preserve visual coherence between videos of different prompts and prevent repetitive motion or contents. Furthermore, we present Structure Guiding Sampling to maintain the global appearance across the frames in a single video clip, where we leverage iterative latent updates across the preceding frame. Additionally, our Prompt Generator allows for arbitrary format of text conditions consisting of diverse events. As a result, our extensive experiments, including diverse transitions of descriptions, demonstrate that our proposed methods show superior generated outputs in terms of semantically coherent and temporally seamless video.Video examples are available in our project page: https://kuai-lab.github.io/mtvg-page.
摘要
To generate consecutive video segments, visual consistency generated by distinct prompts is necessary, including diverse variations such as motion and content-related transitions. Our proposed MTVG includes Dynamic Noise and Last Frame Aware Inversion, which reinitialize the noise latent to preserve visual coherence between videos of different prompts and prevent repetitive motion or contents.Moreover, we present Structure Guiding Sampling to maintain the global appearance across the frames in a single video clip, where we leverage iterative latent updates across the preceding frame. Our Prompt Generator allows for arbitrary formats of text conditions consisting of diverse events.Our extensive experiments, including diverse transitions of descriptions, demonstrate that our proposed methods produce superior generated outputs in terms of semantically coherent and temporally seamless videos. Video examples are available on our project page: .
Large Language Models are Good Prompt Learners for Low-Shot Image Classification
results: 与其他状态之最比较,LLaMP在零shot泛化和几shot图像分类任务中表现出色,在11个数据集上实现了更好的性能。Abstract
Low-shot image classification, where training images are limited or inaccessible, has benefited from recent progress on pre-trained vision-language (VL) models with strong generalizability, e.g. CLIP. Prompt learning methods built with VL models generate text features from the class names that only have confined class-specific information. Large Language Models (LLMs), with their vast encyclopedic knowledge, emerge as the complement. Thus, in this paper, we discuss the integration of LLMs to enhance pre-trained VL models, specifically on low-shot classification. However, the domain gap between language and vision blocks the direct application of LLMs. Thus, we propose LLaMP, Large Language Models as Prompt learners, that produces adaptive prompts for the CLIP text encoder, establishing it as the connecting bridge. Experiments show that, compared with other state-of-the-art prompt learning methods, LLaMP yields better performance on both zero-shot generalization and few-shot image classification, over a spectrum of 11 datasets.
摘要
低样本图像分类,即训练图像有限或不可达,受到最近进步的前期训练视觉语言(VL)模型的强大泛化效应所帮助。例如CLIP。在这篇论文中,我们讨论了将大型语言模型(LLMs)与预训练VL模型结合使用,具体来说是用于低样本分类。然而,视觉和语言之间的领域差异使得直接应用LLMs不具有优势。因此,我们提出了LLaMP,即大型语言模型作为提示学习器,以生成适应性的提示,使CLIP文本编码器成为连接的桥梁。实验结果表明,相比其他状态的最佳提示学习方法,LLaMP在零shot泛化和几个shot图像分类方面表现更好,在11个数据集上。
Combining inherent knowledge of vision-language models with unsupervised domain adaptation through self-knowledge distillation
paper_authors: Thomas Westfechtel, Dexuan Zhang, Tatsuya Harada for: 这篇论文目的是为了解决不需要标注数据的问题,使用来源数据集和目标数据集之间的知识传递,以提高目标数据集的预测性能。methods: 这篇论文使用了无监督预测和视力语言模型的自然知识,并结合了域 adaptation 技术和自我学习精炼。results: 实验和降解研究表明,这种方法可以在三个基准数据集(OfficeHome、VisDA和DomainNet)上达到最佳性能,并且可以通过逐渐扩展源领域的策略来进一步提高性能。Abstract
Unsupervised domain adaptation (UDA) tries to overcome the tedious work of labeling data by leveraging a labeled source dataset and transferring its knowledge to a similar but different target dataset. On the other hand, current vision-language models exhibit astonishing zero-shot prediction capabilities. In this work, we combine knowledge gained through UDA with the inherent knowledge of vision-language models. In a first step, we generate the zero-shot predictions of the source and target dataset using the vision-language model. Since zero-shot predictions usually exhibit a large entropy, meaning that the class probabilities are rather evenly distributed, we first adjust the distribution to accentuate the winning probabilities. This is done using both source and target data to keep the relative confidence between source and target data. We then employ a conventional DA method, to gain the knowledge from the source dataset, in combination with self-knowledge distillation, to maintain the inherent knowledge of the vision-language model. We further combine our method with a gradual source domain expansion strategy (GSDE) and show that this strategy can also benefit by including zero-shot predictions. We conduct experiments and ablation studies on three benchmarks (OfficeHome, VisDA, and DomainNet) and outperform state-of-the-art methods. We further show in ablation studies the contributions of different parts of our algorithm.
摘要
Unsupervised domain adaptation (UDA) 尝试使用来自源数据集的标注数据,将其知识传递到目标数据集中,而不需要重新标注大量数据。然而,当前的视觉语言模型具有惊人的零批预测能力。在这种情况下,我们将UDA知识与视觉语言模型的内在知识结合起来。我们首先使用视觉语言模型进行零批预测,然后对源和目标数据集进行调整,以强调赢利的概率分布。接着,我们使用传统的DA方法,以获得源数据集中的知识,并与自我知识照抄来保持视觉语言模型的内在知识。我们还将我们的方法与渐进源领域扩展策略(GSDE)结合使用,并证明这种策略可以通过包括零批预测来增强表现。我们在OfficeHome、VisDA和DomainNet三个benchmark上进行了实验和ablation研究,并超越了状态之前的方法。我们还通过ablation研究表明了不同部分的贡献。
An unsupervised approach towards promptable defect segmentation in laser-based additive manufacturing by Segment Anything
results: 本研究在一个案例研究中,使用了这个框架进行实时的气孔分类,并获得了高的 dice similarity coefficient(DSC),不需要模型进行supervised fine-tuning。Abstract
Foundation models are currently driving a paradigm shift in computer vision tasks for various fields including biology, astronomy, and robotics among others, leveraging user-generated prompts to enhance their performance. In the manufacturing domain, accurate image-based defect segmentation is imperative to ensure product quality and facilitate real-time process control. However, such tasks are often characterized by multiple challenges including the absence of labels and the requirement for low latency inference among others. To address these issues, we construct a framework for image segmentation using a state-of-the-art Vision Transformer (ViT) based Foundation model (Segment Anything Model) with a novel multi-point prompt generation scheme using unsupervised clustering. We apply our framework to perform real-time porosity segmentation in a case study of laser base powder bed fusion (L-PBF) and obtain high Dice Similarity Coefficients (DSC) without the necessity for any supervised fine-tuning in the model. Using such lightweight foundation model inference in conjunction with unsupervised prompt generation, we envision the construction of a real-time anomaly detection pipeline that has the potential to revolutionize the current laser-based additive manufacturing processes, thereby facilitating the shift towards Industry 4.0 and promoting defect-free production along with operational efficiency.
摘要
现代基金模型在计算机视觉任务中推动一种 парадигShift,包括生物学、天文学和机器人等领域,通过用户生成的提示来提高其性能。在制造领域,准确的图像基于缺陷分 segmentation是保证产品质量和实时过程控制的关键。然而,这些任务经常面临多个挑战,包括标签缺失和低延迟推理等。为解决这些问题,我们构建了一个基于现代视觉转移(ViT)的基础模型(Segment Anything Model)和一种新的多点提示生成方案,使用不supervised clustering。我们在一个案例研究中使用我们的框架进行实时порosity segmentation,并 obtaint high Dice Similarity Coefficients(DSC)without the necessity for any supervised fine-tuning in the model。使用这种轻量级基础模型推理和无监督提示生成,我们可以构建一个实时异常检测管线,这将可能革新现有的激光基于加法制造过程,并促进缺陷free production和操作效率。
Differentiable Registration of Images and LiDAR Point Clouds with VoxelPoint-to-Pixel Matching
paper_authors: Junsheng Zhou, Baorui Ma, Wenyuan Zhang, Yi Fang, Yu-Shen Liu, Zhizhong Han for:* 这种方法用于跨模态注册,即将2D图像与3D点云注册到共同的空间中。methods:* 使用三元网络学习Pixel-to-Voxel匹配,以学习跨模态空间的特征表示。* 使用可微的PnP解决器进行推理,以便直接在推理过程中提供监督。results:* 在KITTI和nuScenes数据集上实现了比之前方法更好的注册结果。Abstract
Cross-modality registration between 2D images from cameras and 3D point clouds from LiDARs is a crucial task in computer vision and robotic. Previous methods estimate 2D-3D correspondences by matching point and pixel patterns learned by neural networks, and use Perspective-n-Points (PnP) to estimate rigid transformation during post-processing. However, these methods struggle to map points and pixels to a shared latent space robustly since points and pixels have very different characteristics with patterns learned in different manners (MLP and CNN), and they also fail to construct supervision directly on the transformation since the PnP is non-differentiable, which leads to unstable registration results. To address these problems, we propose to learn a structured cross-modality latent space to represent pixel features and 3D features via a differentiable probabilistic PnP solver. Specifically, we design a triplet network to learn VoxelPoint-to-Pixel matching, where we represent 3D elements using both voxels and points to learn the cross-modality latent space with pixels. We design both the voxel and pixel branch based on CNNs to operate convolutions on voxels/pixels represented in grids, and integrate an additional point branch to regain the information lost during voxelization. We train our framework end-to-end by imposing supervisions directly on the predicted pose distribution with a probabilistic PnP solver. To explore distinctive patterns of cross-modality features, we design a novel loss with adaptive-weighted optimization for cross-modality feature description. The experimental results on KITTI and nuScenes datasets show significant improvements over the state-of-the-art methods. The code and models are available at https://github.com/junshengzhou/VP2P-Match.
摘要
cross-modality registration between 2D图像从摄像头和3D点云从LiDAR中的是计算机视觉和机器人领域中的关键任务。先前的方法通过匹配点和像素模式,由神经网络学习而来,并使用Perspective-n-Points(PnP)来估算rigid变换,并在后期处理中进行估算。然而,这些方法在robust地映射点和像素到共享的潜在空间是不稳定的,因为点和像素具有不同的特征,由不同的神经网络学习而来,并且它们也无法直接在变换上提供监督。为解决这些问题,我们提议学习一个结构化的cross-modality潜在空间,以表示像素特征和3D特征。我们设计了一个 triplet网络,用于学习VoxelPoint-to-Pixel匹配,其中我们使用voxels和点来学习cross-modality潜在空间。我们设计了voxel和像素分支基于Convolutional Neural Networks(CNNs)来进行 convolutions on voxels/pixels表示的格子,并将其集成到一个额外的点分支来恢复在voxelization中丢失的信息。我们在整个框架中进行了端到端训练,并通过直接在预测的姿态分布上进行supervision来训练。为了探索不同的cross-modality特征特征,我们设计了一个新的损失函数,并在其中进行了adaptive-weighted优化。实验结果表明,我们的方法在KITTI和nuScenes datasets上达到了状态之前的最佳性能。代码和模型可以在https://github.com/junshengzhou/VP2P-Match中获取。
Residual Graph Convolutional Network for Bird’s-Eye-View Semantic Segmentation
results: 对 nuScenes 数据集进行实验,与四种现状顶尖网络和其四种变体进行比较,得到更高的 IoU 和 mIoU 值,其中 mIoU 值高达 3.1%,超过了最佳现状网络 BEVFusion。Abstract
Retrieving spatial information and understanding the semantic information of the surroundings are important for Bird's-Eye-View (BEV) semantic segmentation. In the application of autonomous driving, autonomous vehicles need to be aware of their surroundings to drive safely. However, current BEV semantic segmentation techniques, deep Convolutional Neural Networks (CNNs) and transformers, have difficulties in obtaining the global semantic relationships of the surroundings at the early layers of the network. In this paper, we propose to incorporate a novel Residual Graph Convolutional (RGC) module in deep CNNs to acquire both the global information and the region-level semantic relationship in the multi-view image domain. Specifically, the RGC module employs a non-overlapping graph space projection to efficiently project the complete BEV information into graph space. It then builds interconnected spatial and channel graphs to extract spatial information between each node and channel information within each node (i.e., extract contextual relationships of the global features). Furthermore, it uses a downsample residual process to enhance the coordinate feature reuse to maintain the global information. The segmentation data augmentation and alignment module helps to simultaneously augment and align BEV features and ground truth to geometrically preserve their alignment to achieve better segmentation results. Our experimental results on the nuScenes benchmark dataset demonstrate that the RGC network outperforms four state-of-the-art networks and its four variants in terms of IoU and mIoU. The proposed RGC network achieves a higher mIoU of 3.1% than the best state-of-the-art network, BEVFusion. Code and models will be released.
摘要
Retrieving spatial information和理解周围环境的 semantics是bird's-eye-view(BEV)semantic segmentation中重要的。在自动驾驶应用中,自动车辆需要了解其周围环境,以确保安全驾驶。然而,当前BEV semantic segmentation技术中,深度卷积神经网络(CNN)和transformers都具有在网络的早期层获得全局semantic关系的困难。在这篇论文中,我们提议在深度CNN中引入一种新的Residual Graph Convolutional(RGC)模块,以获得全局信息和多视图图像域中的区域级别semantic关系。具体来说,RGC模块使用非重叠图像空间投影,高效地将完整的BEV信息投影到图像空间中。然后,它建立了不重叠的空间和通道图(graph),以EXTRACT SPATIAL INFORMATION BETWEEN EACH NODE AND CHANNEL INFORMATION WITHIN EACH NODE(提取每个节点和每个通道之间的空间信息)。此外,它使用下采样径优化过程,以提高坐标特征重复使用,以保持全局信息。我们的实验结果表明,RGC网络在nuScenes benchmark dataset上比四个state-of-the-art网络和其四个变体更高的IoU和mIoU。我们计划发布代码和模型。
DiffusionPhase: Motion Diffusion in Frequency Domain
results: 实验表明,我们的方法可以比现有方法更好地生成各种高质量动作序列,并能够合理地synthesize 长度为数十帧的动作序列。Abstract
In this study, we introduce a learning-based method for generating high-quality human motion sequences from text descriptions (e.g., ``A person walks forward"). Existing techniques struggle with motion diversity and smooth transitions in generating arbitrary-length motion sequences, due to limited text-to-motion datasets and the pose representations used that often lack expressiveness or compactness. To address these issues, we propose the first method for text-conditioned human motion generation in the frequency domain of motions. We develop a network encoder that converts the motion space into a compact yet expressive parameterized phase space with high-frequency details encoded, capturing the local periodicity of motions in time and space with high accuracy. We also introduce a conditional diffusion model for predicting periodic motion parameters based on text descriptions and a start pose, efficiently achieving smooth transitions between motion sequences associated with different text descriptions. Experiments demonstrate that our approach outperforms current methods in generating a broader variety of high-quality motions, and synthesizing long sequences with natural transitions.
摘要
在本研究中,我们提出了一种基于学习的方法,用于从文本描述生成高质量人体动作序列(例如,“一个人走向前”)。现有技术在生成无限长动作序列时受到动作多样性和平滑过渡的限制,主要是因为文本到动作数据集的有限性和用于表示姿势的姿势表示方法缺乏表达力和紧凑性。为解决这些问题,我们提出了文本条件人体动作生成的首次方法在动作频谱中。我们开发了一个网络编码器,将动作空间转换为一个紧凑 yet 表达力强的参数化频谱空间,高精度地记录了动作的本地周期性在时间和空间中。我们还提出了一种基于文本描述和起始姿势的 conditional diffusion 模型,可以高效地预测基于文本描述的 periodic 动作参数,并实现了不同文本描述相关的动作序列之间的自然过渡。实验表明,我们的方法在生成更多种高质量动作和生成长序列的自然过渡方面表现出色。
ImFace++: A Sophisticated Nonlinear 3D Morphable Face Model with Implicit Neural Representations
results: 对比评估表明,ImFace++在面部重建精度和匹配精度两个方面具有显著的进步,并且可以更好地捕捉人脸的各种表情和形态特征。Abstract
Accurate representations of 3D faces are of paramount importance in various computer vision and graphics applications. However, the challenges persist due to the limitations imposed by data discretization and model linearity, which hinder the precise capture of identity and expression clues in current studies. This paper presents a novel 3D morphable face model, named ImFace++, to learn a sophisticated and continuous space with implicit neural representations. ImFace++ first constructs two explicitly disentangled deformation fields to model complex shapes associated with identities and expressions, respectively, which simultaneously facilitate the automatic learning of correspondences across diverse facial shapes. To capture more sophisticated facial details, a refinement displacement field within the template space is further incorporated, enabling a fine-grained learning of individual-specific facial details. Furthermore, a Neural Blend-Field is designed to reinforce the representation capabilities through adaptive blending of an array of local fields. In addition to ImFace++, we have devised an improved learning strategy to extend expression embeddings, allowing for a broader range of expression variations. Comprehensive qualitative and quantitative evaluations demonstrate that ImFace++ significantly advances the state-of-the-art in terms of both face reconstruction fidelity and correspondence accuracy.
摘要
三维面部表示的精度非常重要在计算机视觉和图形应用中。然而,挑战仍然存在,主要由数据精度和模型线性所带来的限制,这些限制阻碍了现有研究中的人脸和表情特征的准确捕捉。本文介绍一种新的三维人脸模型,名为ImFace++,用于学习一个复杂的连续空间,使用隐藏神经表示。ImFace++首先构建了两个分离的扭曲场,用于模型人脸的各种形态特征和表情特征,同时使得自动学习对应关系。为了捕捉更复杂的面部细节,模板空间中的修正偏移场也被引入,以进一步提高个体特定的面部细节学习。此外,我们还设计了一个神经混合场,以强化表示能力通过adaptive混合本地场。除了ImFace++,我们还提出了改进的学习策略,以扩展表情嵌入,使得表情变化范围更广泛。全面的质量和量测试表明,ImFace++在人脸重建准确度和对应精度方面都有显著提高。
PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation
results: 在广泛使用的ShapeNetPart和PartE数据集上,与现有方法相比,PartDistill得到了大于15%和12%的高于mIoU分数提升Abstract
This paper proposes a cross-modal distillation framework, PartDistill, which transfers 2D knowledge from vision-language models (VLMs) to facilitate 3D shape part segmentation. PartDistill addresses three major challenges in this task: the lack of 3D segmentation in invisible or undetected regions in the 2D projections, inaccurate and inconsistent 2D predictions by VLMs, and the lack of knowledge accumulation across different 3D shapes. PartDistill consists of a teacher network that uses a VLM to make 2D predictions and a student network that learns from the 2D predictions while extracting geometrical features from multiple 3D shapes to carry out 3D part segmentation. A bi-directional distillation, including forward and backward distillations, is carried out within the framework, where the former forward distills the 2D predictions to the student network, and the latter improves the quality of the 2D predictions, which subsequently enhances the final 3D part segmentation. Moreover, PartDistill can exploit generative models that facilitate effortless 3D shape creation for generating knowledge sources to be distilled. Through extensive experiments, PartDistill boosts the existing methods with substantial margins on widely used ShapeNetPart and PartE datasets, by more than 15% and 12% higher mIoU scores, respectively.
摘要
Lack of 3D segmentation in invisible or undetected regions in the 2D projections.2. Inaccurate and inconsistent 2D predictions by VLMs.3. Lack of knowledge accumulation across different 3D shapes.PartDistill consists of a teacher network that uses a VLM to make 2D predictions and a student network that learns from the 2D predictions while extracting geometrical features from multiple 3D shapes for 3D part segmentation. A bi-directional distillation, including forward and backward distillations, is carried out within the framework. The forward distillation transfers the 2D predictions to the student network, while the backward distillation improves the quality of the 2D predictions, which ultimately enhances the final 3D part segmentation.Moreover, PartDistill can exploit generative models that facilitate effortless 3D shape creation for generating knowledge sources to be distilled. Through extensive experiments, PartDistill outperforms existing methods by more than 15% and 12% higher mIoU scores on the ShapeNetPart and PartE datasets, respectively.
Natural-language-driven Simulation Benchmark and Copilot for Efficient Production of Object Interactions in Virtual Road Scenes
paper_authors: Kairui Yang, Zihao Guo, Gengjie Lin, Haotian Dong, Die Zuo, Jibin Peng, Zhao Huang, Zhecheng Xu, Fupeng Li, Ziyun Bai, Di Lin
For: The paper is written for researchers and developers of autonomous driving systems, as well as those interested in natural-language-driven simulation for teaching and testing such systems.* Methods: The paper proposes the use of natural-language descriptions to control object interactions in virtual road scenes, and presents a dataset of 120,000 such descriptions, along with a methodology for translating them into renderable code.* Results: The paper evaluates the effectiveness of the proposed methodology, SimCopilot, in controlling object motions, generating complex interactions, and generalizing interactions across different road topologies using the L2I dataset. The results demonstrate the potential of the NLD simulation for efficient and effective testing and teaching of autonomous driving systems.Here is the information in Simplified Chinese text:
results: 论文使用L2I数据集评估了 simulate 的能力控制对象的运动、生成复杂互动和泛化互动 across 不同的道路拓扑。结果表明NLD simulate 具有高效高效的测试和教学自动驾驶系统的潜力。Abstract
We advocate the idea of the natural-language-driven(NLD) simulation to efficiently produce the object interactions between multiple objects in the virtual road scenes, for teaching and testing the autonomous driving systems that should take quick action to avoid collision with obstacles with unpredictable motions. The NLD simulation allows the brief natural-language description to control the object interactions, significantly reducing the human efforts for creating a large amount of interaction data. To facilitate the research of NLD simulation, we collect the Language-to-Interaction(L2I) benchmark dataset with 120,000 natural-language descriptions of object interactions in 6 common types of road topologies. Each description is associated with the programming code, which the graphic render can use to visually reconstruct the object interactions in the virtual scenes. As a methodology contribution, we design SimCopilot to translate the interaction descriptions to the renderable code. We use the L2I dataset to evaluate SimCopilot's abilities to control the object motions, generate complex interactions, and generalize interactions across road topologies. The L2I dataset and the evaluation results motivate the relevant research of the NLD simulation.
摘要
我们支持自然语言驱动(NLD)模拟,以快速生成虚拟道路场景中多个物体之间的交互行为,为自动驾驶系统的教学和测试。NLD模拟允许使用简短的自然语言描述控制物体交互,大幅减少人类创建大量交互数据的努力。为推进NLD模拟的研究,我们收集了语言到交互(L2I)标准数据集,包含6种常见道路格局的120,000个自然语言交互描述。每个描述都与程序代码相关,可以用图形渲染来视觉重建虚拟场景中的物体交互。作为方法贡献,我们设计了SimCopilot,可以将交互描述翻译成可渲染代码。我们使用L2I数据集来评估SimCopilot的交互控制能力、生成复杂交互和泛化交互 across道路格局。L2I数据集和评估结果激发了相关的NLD模拟研究。
LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures
results: 论文的实验表明,LiDAR度量方法可以更好地预测JE建模器的优化参数,比 tradicional rank基于的方法更加准确。这种度量方法可以帮助更好地评估JE建模器的性能,并促进这些技术在不同领域的广泛应用。Abstract
Joint embedding (JE) architectures have emerged as a promising avenue for acquiring transferable data representations. A key obstacle to using JE methods, however, is the inherent challenge of evaluating learned representations without access to a downstream task, and an annotated dataset. Without efficient and reliable evaluation, it is difficult to iterate on architectural and training choices for JE methods. In this paper, we introduce LiDAR (Linear Discriminant Analysis Rank), a metric designed to measure the quality of representations within JE architectures. Our metric addresses several shortcomings of recent approaches based on feature covariance rank by discriminating between informative and uninformative features. In essence, LiDAR quantifies the rank of the Linear Discriminant Analysis (LDA) matrix associated with the surrogate SSL task -- a measure that intuitively captures the information content as it pertains to solving the SSL task. We empirically demonstrate that LiDAR significantly surpasses naive rank based approaches in its predictive power of optimal hyperparameters. Our proposed criterion presents a more robust and intuitive means of assessing the quality of representations within JE architectures, which we hope facilitates broader adoption of these powerful techniques in various domains.
摘要
Stable diffusion for Data Augmentation in COCO and Weed Datasets
paper_authors: Boyang Deng, Yuzhen Lu for: This study aimed to explore the potential of stable diffusion models in improving the performance of object detection models, specifically for small datasets with image-sparse categories.methods: The study utilized a recent version of stable diffusion to generate synthetic images belonging to seven categories in the COCO dataset and three weed species in Michigan. The YOLOv8 model was trained based on these synthetic images and compared to models trained on original images. Additionally, several techniques of stable diffusion were leveraged for image generation with different focuses.results: Despite the overall results being disappointing, promising results were achieved in some classes, illustrating the potential of stable diffusion models to improve object detection performance. The study suggests that stable diffusion models may be adapted to classification and detection tasks in different fields.Abstract
Generative models have increasingly impacted relative tasks ranging from image revision and object detection in computer vision to interior design and idea illustration in more general fields. Stable diffusion is an outstanding model series that paves the way for producing high-resolution images with thorough details from text prompts or reference images. It will be an interesting topic about how to leverage the capability of stable diffusion to elevate the image variations of certain categories (e.g., vehicles, humans, and daily objects); particularly, it has the potential to gain improvements for small datasets with image-sparse categories. This study utilized seven categories in the popular COCO dataset and three widespread weed species in Michigan to evaluate the efficiency of a recent version of stable diffusion. In detail, Stable diffusion was used to generate synthetic images belonging to these classes; then, YOLOv8 models were trained based on these synthetic images, whose performance was compared to the models trained on original images. In addition, several techniques (e.g., Image-to-image translation, Dreambooth, ControlNet) of Stable diffusion were leveraged for image generation with different focuses. In spite of the overall results being disappointing, promising results have been achieved in some classes, illustrating the potential of stable diffusion models to improve the performance of detection models, which represent more helpful information being conveyed into the models by the generated images. This seminal study may expedite the adaption of stable diffusion models to classification and detection tasks in different fields.
摘要
生成模型在各种领域中的应用,从计算机视觉中的图像修订和对象检测到更通用的室内设计和想象插图。稳定扩散是一种出色的模型系列,能够生成高分辨率图像,从文本提示或参考图像中获得详细的信息。研究如何利用稳定扩散的能力,提高certain类型图像的变化(如车辆、人体、日常物品)的表现。这项研究使用了COCO dataset中的7类和Michigan的3种常见药类进行评估稳定扩散的效率。具体来说,使用稳定扩散生成相关类别的 sintetic 图像,然后基于这些 sintetic 图像训练 YOLOv8 模型,并与原始图像训练的模型进行比较。此外,研究还利用了稳定扩散的不同技术(如图像转换、梦幕、控制网)进行图像生成。尽管总体结果有些 disappointing,但在某些类别中获得了promising的结果,这demonstrates the potential of stable diffusion models to improve detection models' performance,即通过生成图像传递更多有助的信息到模型中。这项研究可能会促进稳定扩散模型在不同领域的适应。Simplified Chinese:生成模型在不同领域中的应用,从计算机视觉中的图像修订和对象检测到更通用的室内设计和想象插图。稳定扩散是一种出色的模型系列,能够生成高分辨率图像,从文本提示或参考图像中获得详细的信息。研究如何利用稳定扩散的能力,提高certain类型图像的变化(如车辆、人体、日常物品)的表现。这项研究使用了COCO dataset中的7类和Michigan的3种常见药类进行评估稳定扩散的效率。具体来说,使用稳定扩散生成相关类别的 sintetic 图像,然后基于这些 sintetic 图像训练 YOLOv8 模型,并与原始图像训练的模型进行比较。
results: 与相关基线相比,NeRFiller创造了最3D一致和可能的场景完成。Abstract
We propose NeRFiller, an approach that completes missing portions of a 3D capture via generative 3D inpainting using off-the-shelf 2D visual generative models. Often parts of a captured 3D scene or object are missing due to mesh reconstruction failures or a lack of observations (e.g., contact regions, such as the bottom of objects, or hard-to-reach areas). We approach this challenging 3D inpainting problem by leveraging a 2D inpainting diffusion model. We identify a surprising behavior of these models, where they generate more 3D consistent inpaints when images form a 2$\times$2 grid, and show how to generalize this behavior to more than four images. We then present an iterative framework to distill these inpainted regions into a single consistent 3D scene. In contrast to related works, we focus on completing scenes rather than deleting foreground objects, and our approach does not require tight 2D object masks or text. We compare our approach to relevant baselines adapted to our setting on a variety of scenes, where NeRFiller creates the most 3D consistent and plausible scene completions. Our project page is at https://ethanweber.me/nerfiller.
摘要
我们提出了NeRFiller,一种方法,通过使用可用的2D视觉生成模型进行生成3D填充,以完成3D捕捉中缺失的部分。由于 mesh重建失败或缺少观察(例如,与物体接触的区域,如物体底部或难以达到的区域),这种缺失的问题非常困难。我们利用2D填充扩散模型来解决这个问题,并发现了一种意外的行为:当图像组成2×2网格时,这些模型会生成更加3D一致的填充。我们如何扩展这种行为到更多 than four images。然后,我们提出了一种迭代的框架,将这些填充的区域融合成一个具有一致的3D场景。与相关的工作不同,我们的方法不需要紧密的2D对象框架或文本。我们在多个场景上与相关的基准对比,发现NeRFiller创造出了最3D一致和可能的场景完成。我们的项目页面是https://ethanweber.me/nerfiller。
paper_authors: Simon Frieder, Julius Berner, Philipp Petersen, Thomas Lukasiewicz
for: This paper is written for mathematicians and discusses the potential of large language models (LLMs) to aid professional mathematicians in their work.
methods: The paper provides a mathematical description of the transformer model used in all modern language models, and outlines best practices and potential issues with using LLMs for mathematical tasks.
results: The paper reports on the mathematical abilities of language models and discusses their potential to change how mathematicians work.Here’s the simplified Chinese text for the three information points:
results: 论文报告了语言模型的数学能力,并讨论了它们可能对数学家工作的影响。Abstract
Large language models (LLMs) such as ChatGPT have received immense interest for their general-purpose language understanding and, in particular, their ability to generate high-quality text or computer code. For many professions, LLMs represent an invaluable tool that can speed up and improve the quality of work. In this note, we discuss to what extent they can aid professional mathematicians. We first provide a mathematical description of the transformer model used in all modern language models. Based on recent studies, we then outline best practices and potential issues and report on the mathematical abilities of language models. Finally, we shed light on the potential of LMMs to change how mathematicians work.
摘要
大型语言模型(LLM)如ChatGPT已经受到了极大的关注,因为它们可以在通用语言理解方面提供高质量的文本或编程代码。许多行业中,LLM是一种非常有价值的工具,可以帮助提高工作效率和质量。在这个笔记中,我们讨论了LLM在职业数学家方面的帮助,首先介绍了现代语言模型使用的变换器模型的数学描述。根据最新的研究,我们then outline了最佳实践和潜在的问题,并对语言模型的数学能力进行了报告。最后,我们探讨了LMM在数学家的工作方式中的潜在变革。Here's the text with some additional explanations and notes in square brackets:大型语言模型(LLM)如ChatGPT已经受到了极大的关注,因为它们可以在通用语言理解方面提供高质量的文本或编程代码。[1] 许多行业中,LLM是一种非常有价值的工具,可以帮助提高工作效率和质量。在这个笔记中,我们讨论了LLM在职业数学家方面的帮助,首先介绍了现代语言模型使用的变换器模型的数学描述。变换器模型是现代语言模型的核心部分,它可以帮助模型理解语言的含义和结构。[2] 在这个部分,我们将介绍变换器模型的数学描述,包括它的核心思想和实现方式。根据最新的研究,我们then outline了最佳实践和潜在的问题,以帮助读者更好地理解LLM的应用和可能的问题。这些问题包括模型的训练和优化、数据集的选择和处理、模型的评估和评价等。最后,我们探讨了LMM在数学家的工作方式中的潜在变革。随着LLM的发展,数学家可能会采用新的方法和工具来替代或补充传统的数学方法。这些变革可能会带来新的机遇和挑战,需要数学家们适应和适应。In summary, this note provides an overview of the potential benefits and challenges of using large language models (LLMs) in professional mathematics. We discuss the mathematical description of the transformer model used in all modern language models, and outline best practices and potential issues in applying LLMs to mathematical tasks. Finally, we explore the potential of LMMs to change how mathematicians work, and the new opportunities and challenges that may arise as a result.
methods: 论文使用了大语言模型(LLMs)和强大的文本到图像生成扩散模型,提出了一种简单的方法 called StackedDiffusion,可以将文本输入转化为图文教程。
results: 论文的模型在比较baseline方法和现有的多modal LLMs时表现出色,在30%的情况下,用户甚至偏好于人类生成的文章。该模型可以实现许多 static web上的文章无法提供的应用场景,如个性化的中间步骤和图片。Abstract
We introduce the new task of generating Illustrated Instructions, i.e., visual instructions customized to a user's needs. We identify desiderata unique to this task, and formalize it through a suite of automatic and human evaluation metrics, designed to measure the validity, consistency, and efficacy of the generations. We combine the power of large language models (LLMs) together with strong text-to-image generation diffusion models to propose a simple approach called StackedDiffusion, which generates such illustrated instructions given text as input. The resulting model strongly outperforms baseline approaches and state-of-the-art multimodal LLMs; and in 30% of cases, users even prefer it to human-generated articles. Most notably, it enables various new and exciting applications far beyond what static articles on the web can provide, such as personalized instructions complete with intermediate steps and pictures in response to a user's individual situation.
摘要
我们介绍一个新任务:生成图文指南,即根据用户需求的可视指南。我们描述这个任务的专有需求,并使用一套自动和人类评估指标,以衡量生成的有效性、一致性和有效性。我们结合大型语言模型(LLMs)和强大的文本至图生成扩散模型,提出一个简单的方法called StackedDiffusion,可以将文本转换为图文指南。这个模型与基准方法和现有的多媒体LLMs相比,表现优异,甚至在30%的情况下,用户也偏好它比人类生成的文章。此外,这个模型可以开启许多新的应用,例如根据用户个人情况适应的图文指南,以及具有中途步骤和图片的专业指南。
results: 该论文通过对 MAVREC dataset进行广泛的测试,发现将对象检测器与相同地区的地面图像进行预训练是一种超越性的方法,可以提高空中检测的性能。Abstract
Despite the commercial abundance of UAVs, aerial data acquisition remains challenging, and the existing Asia and North America-centric open-source UAV datasets are small-scale or low-resolution and lack diversity in scene contextuality. Additionally, the color content of the scenes, solar-zenith angle, and population density of different geographies influence the data diversity. These two factors conjointly render suboptimal aerial-visual perception of the deep neural network (DNN) models trained primarily on the ground-view data, including the open-world foundational models. To pave the way for a transformative era of aerial detection, we present Multiview Aerial Visual RECognition or MAVREC, a video dataset where we record synchronized scenes from different perspectives -- ground camera and drone-mounted camera. MAVREC consists of around 2.5 hours of industry-standard 2.7K resolution video sequences, more than 0.5 million frames, and 1.1 million annotated bounding boxes. This makes MAVREC the largest ground and aerial-view dataset, and the fourth largest among all drone-based datasets across all modalities and tasks. Through our extensive benchmarking on MAVREC, we recognize that augmenting object detectors with ground-view images from the corresponding geographical location is a superior pre-training strategy for aerial detection. Building on this strategy, we benchmark MAVREC with a curriculum-based semi-supervised object detection approach that leverages labeled (ground and aerial) and unlabeled (only aerial) images to enhance the aerial detection. We publicly release the MAVREC dataset: https://mavrec.github.io.
摘要
尽管商业化的无人机技术已经广泛应用,但是获取空中数据仍然是一个挑战。现有的亚洲和北美中心的开源无人机数据集是小规模或低分辨率,缺乏不同地理景观的多样性。此外,空中场景的颜色内容、太阳高度和不同地区的人口密度也会影响数据多样性。这两个因素共同导致deep neural network(DNN)模型在空中视觉上的表现不佳,这些模型主要在地面数据上训练。为了开启空中检测的transformative时代,我们提出了Multiview Aerial Visual RECognition(MAVREC)数据集。MAVREC包含了不同视角的同步录制Scene,包括地面摄像头和无人机摄像头。MAVREC包含了约2.5小时的industry标准2.7K分辨率视频序列,超过0.5万帧,以及1.1万个注释 bounding box。这使得MAVREC成为了最大的地面和空中视图数据集,也是所有模式和任务中的第四大数据集。我们通过广泛的MAVREC benchmarking发现,将 objet detector 预训练于相同地理位置的地面图像中是一种superior的预训练策略。基于这种策略,我们在MAVREC上 benchmarking 一种curriculum-based semi-supervised object detection方法,该方法利用了标注(地面和空中)和无标注(只有空中)图像来提高空中检测。我们将MAVREC数据集公开发布:https://mavrec.github.io。
PlayFusion: Skill Acquisition via Diffusion from Language-Annotated Play
results: 在多种环境中( simulate 和实际世界)进行了广泛的实验,并在 https://play-fusion.github.io 上提供了结果视觉和视频 демонстрацииAbstract
Learning from unstructured and uncurated data has become the dominant paradigm for generative approaches in language and vision. Such unstructured and unguided behavior data, commonly known as play, is also easier to collect in robotics but much more difficult to learn from due to its inherently multimodal, noisy, and suboptimal nature. In this paper, we study this problem of learning goal-directed skill policies from unstructured play data which is labeled with language in hindsight. Specifically, we leverage advances in diffusion models to learn a multi-task diffusion model to extract robotic skills from play data. Using a conditional denoising diffusion process in the space of states and actions, we can gracefully handle the complexity and multimodality of play data and generate diverse and interesting robot behaviors. To make diffusion models more useful for skill learning, we encourage robotic agents to acquire a vocabulary of skills by introducing discrete bottlenecks into the conditional behavior generation process. In our experiments, we demonstrate the effectiveness of our approach across a wide variety of environments in both simulation and the real world. Results visualizations and videos at https://play-fusion.github.io
摘要
学习从无结构和无约束数据中获得了生成方法的主流 paradigma,这种无结构和无约束的数据通常被称为玩儿。在机器人学中,这种玩儿数据更容易收集,但它具有内在多modal、噪音和不优化的特点,因此更难学习。在这篇论文中,我们研究了从玩儿数据中学习目标导向的技能策略的问题。我们利用了Diffusion模型来学习一个多任务Diffusion模型,从玩儿数据中提取机器人技能。通过在状态和动作空间中使用条件杂化 diffusion 过程,我们可以干涉玩儿数据的复杂性和多模态性,并生成多样化和有趣的机器人行为。为了使Diffusion模型更有用于技能学习,我们鼓励机器人代理人积累一个技能词汇,通过在条件行为生成过程中引入杂化瓶颈来实现。在我们的实验中,我们证明了我们的方法在多种环境中都有效,包括simulation和实际世界。结果可参见https://play-fusion.github.io。
Adversarial Learning for Feature Shift Detection and Correction
results: 研究表明,通过组bining主流监督学习模型和简单的迭代策略,可以有效地检测和修复特征Shift,并且超越了现有的统计和神经网络方法。Abstract
Data shift is a phenomenon present in many real-world applications, and while there are multiple methods attempting to detect shifts, the task of localizing and correcting the features originating such shifts has not been studied in depth. Feature shifts can occur in many datasets, including in multi-sensor data, where some sensors are malfunctioning, or in tabular and structured data, including biomedical, financial, and survey data, where faulty standardization and data processing pipelines can lead to erroneous features. In this work, we explore using the principles of adversarial learning, where the information from several discriminators trained to distinguish between two distributions is used to both detect the corrupted features and fix them in order to remove the distribution shift between datasets. We show that mainstream supervised classifiers, such as random forest or gradient boosting trees, combined with simple iterative heuristics, can localize and correct feature shifts, outperforming current statistical and neural network-based techniques. The code is available at https://github.com/AI-sandbox/DataFix.
摘要
<>将文本翻译为简化中文。<>数据Shift是现实世界中的一种现象,即多种方法可以检测Shift,但是对于特征Shift的本地化和修复尚未得到深入研究。特征Shift可以发生在多种数据集中,包括多感器数据、感器失效、表格和结构化数据、生物医学、金融和调查数据等,其中不正确的标准化和数据处理管道可以导致错误的特征。在这项工作中,我们explore使用对抗学习的原则,其中通过多个分类器分别分类两个分布来检测损害特征并修复它们,以消除数据集之间的分布偏移。我们显示,主流的超vised分类器,如随机森林或梯度折衔树,可以与简单的迭代策略相结合,在检测和修复特征Shift方面超过当前统计和神经网络基于的技术。代码可以在https://github.com/AI-sandbox/DataFix上获取。
Sim-to-Real Causal Transfer: A Metric Learning Approach to Causally-Aware Interaction Representations
for: This paper focuses on modeling spatial-temporal interactions among neighboring agents in multi-agent problems, and investigates the causal relationships between agents.
methods: The paper introduces a metric learning approach that regularizes latent representations with causal annotations, and proposes a sim-to-real causal transfer method via cross-domain multi-task learning.
results: The paper shows that the proposed approach leads to higher degrees of causal awareness and stronger out-of-distribution robustness, and can substantially boost generalization even in the absence of real-world causal annotations.Here’s the simplified Chinese text format for the three information points:
results: 论文表明,提出的方法可以提高 causal 意识和 out-of-distribution Robustness,并可以在没有实际 causal 注解的情况下具有显著提升。Abstract
Modeling spatial-temporal interactions among neighboring agents is at the heart of multi-agent problems such as motion forecasting and crowd navigation. Despite notable progress, it remains unclear to which extent modern representations can capture the causal relationships behind agent interactions. In this work, we take an in-depth look at the causal awareness of these representations, from computational formalism to real-world practice. First, we cast doubt on the notion of non-causal robustness studied in the recent CausalAgents benchmark. We show that recent representations are already partially resilient to perturbations of non-causal agents, and yet modeling indirect causal effects involving mediator agents remains challenging. To address this challenge, we introduce a metric learning approach that regularizes latent representations with causal annotations. Our controlled experiments show that this approach not only leads to higher degrees of causal awareness but also yields stronger out-of-distribution robustness. To further operationalize it in practice, we propose a sim-to-real causal transfer method via cross-domain multi-task learning. Experiments on pedestrian datasets show that our method can substantially boost generalization, even in the absence of real-world causal annotations. We hope our work provides a new perspective on the challenges and potential pathways towards causally-aware representations of multi-agent interactions. Our code is available at https://github.com/socialcausality.
摘要
First, we question the notion of non-causal robustness studied in the recent CausalAgents benchmark. We show that recent representations are already partially resilient to perturbations of non-causal agents, but modeling indirect causal effects involving mediator agents remains challenging. To address this challenge, we propose a metric learning approach that regularizes latent representations with causal annotations. Our controlled experiments show that this approach not only leads to higher degrees of causal awareness but also yields stronger out-of-distribution robustness.To further operationalize this approach in practice, we propose a sim-to-real causal transfer method via cross-domain multi-task learning. Experiments on pedestrian datasets show that our method can substantially boost generalization, even in the absence of real-world causal annotations. We hope our work provides a new perspective on the challenges and potential pathways towards causally-aware representations of multi-agent interactions. Our code is available at .
Using Large Language Models for Hyperparameter Optimization
paper_authors: Michael R. Zhang, Nishkrit Desai, Juhan Bae, Jonathan Lorraine, Jimmy Ba
for: 这个论文研究了使用基础大语言模型(LLM)进行参数优化(HPO)中的决策。
methods: 这个论文使用了empirical evaluations来证明,在受限的搜索预算下,LLM可以与传统的HPO方法如随机搜索和 bayesian优化在标准benchmark上表现相似或更好。此外,我们还提议将模型specifying code treated as a hyperparameter,LLM输出,超越了现有HPO方法的能力。
results: 我们的发现表明,LLM是一种有前途的工具,可以提高传统决策问题中的效率。Abstract
This paper studies using foundational large language models (LLMs) to make decisions during hyperparameter optimization (HPO). Empirical evaluations demonstrate that in settings with constrained search budgets, LLMs can perform comparably or better than traditional HPO methods like random search and Bayesian optimization on standard benchmarks. Furthermore, we propose to treat the code specifying our model as a hyperparameter, which the LLM outputs, going beyond the capabilities of existing HPO approaches. Our findings suggest that LLMs are a promising tool for improving efficiency in the traditional decision-making problem of hyperparameter optimization.
摘要
Simplified Chinese:这篇论文研究使用基础大语言模型(LLM)来进行参数优化(HPO)的决策。实验评估表明,在受限的搜索预算下,LLM可以与传统的HPO方法如随机搜索和bayesian优化相比,在标准的benchmark上表现相似或更好。此外,论文还提议将模型所编写的代码作为参数,由LLM输出,超越现有的HPO方法。结果表明,LLM是提高传统决策问题中HPO的效率的有望工具。
Coordination-free Decentralised Federated Learning on Complex Networks: Overcoming Heterogeneity
results: 该论文的结果显示,使用DFL算法可以训练更加准确的本地模型,并且在交互更加有效的情况下达到这一目的。Abstract
Federated Learning (FL) is a well-known framework for successfully performing a learning task in an edge computing scenario where the devices involved have limited resources and incomplete data representation. The basic assumption of FL is that the devices communicate directly or indirectly with a parameter server that centrally coordinates the whole process, overcoming several challenges associated with it. However, in highly pervasive edge scenarios, the presence of a central controller that oversees the process cannot always be guaranteed, and the interactions (i.e., the connectivity graph) between devices might not be predetermined, resulting in a complex network structure. Moreover, the heterogeneity of data and devices further complicates the learning process. This poses new challenges from a learning standpoint that we address by proposing a communication-efficient Decentralised Federated Learning (DFL) algorithm able to cope with them. Our solution allows devices communicating only with their direct neighbours to train an accurate model, overcoming the heterogeneity induced by data and different training histories. Our results show that the resulting local models generalise better than those trained with competing approaches, and do so in a more communication-efficient way.
摘要
Graph Metanetworks for Processing Diverse Neural Architectures
results: 本文在多个元网络任务上验证了GMNs的效果,并证明了它的通用性和可扩展性。Abstract
Neural networks efficiently encode learned information within their parameters. Consequently, many tasks can be unified by treating neural networks themselves as input data. When doing so, recent studies demonstrated the importance of accounting for the symmetries and geometry of parameter spaces. However, those works developed architectures tailored to specific networks such as MLPs and CNNs without normalization layers, and generalizing such architectures to other types of networks can be challenging. In this work, we overcome these challenges by building new metanetworks - neural networks that take weights from other neural networks as input. Put simply, we carefully build graphs representing the input neural networks and process the graphs using graph neural networks. Our approach, Graph Metanetworks (GMNs), generalizes to neural architectures where competing methods struggle, such as multi-head attention layers, normalization layers, convolutional layers, ResNet blocks, and group-equivariant linear layers. We prove that GMNs are expressive and equivariant to parameter permutation symmetries that leave the input neural network functions unchanged. We validate the effectiveness of our method on several metanetwork tasks over diverse neural network architectures.
摘要
神经网络高效地储存学习到其参数中。因此,许多任务可以通过将神经网络本身作为输入数据来统一。然而,先前的研究通常是为特定的神经网络类型,如多层感知网络(MLP)和卷积神经网络(CNN)而设计,而不具有普适性。在这个工作中,我们解决了这些挑战,通过建立新的元网络(Graph Metanetworks,GMNs)。我们精心构建了输入神经网络的图表示,然后使用图神经网络进行处理。我们的方法可以普适地应用于各种神经网络架构,包括多头注意层、归一化层、卷积层、ResNet块和群equivariant的线性层。我们证明了GMNs是表达力强和对参数排序 симметries 的等效的。我们验证了我们的方法在多个元网络任务上的有效性,并且验证了其在不同的神经网络架构上的通用性。
AVA: Towards Autonomous Visualization Agents through Visual Perception-Driven Decision-Making
methods: 这篇论文使用多Modal Foundation Models(MMFM),即以前只是文本类型的大语言模型(LLM),现在可以处理视觉资料,创造了无前例的应用机会。
results: 这篇论文提出了一个框架,用于设计AVA,并提供了多个实际应用情况,以示其通用性。这篇论文还进行了初步的探索和证明,显示这种方法具有广泛应用性,并且可以帮助专家实现高级视觉目标。Abstract
With recent advances in multi-modal foundation models, the previously text-only large language models (LLM) have evolved to incorporate visual input, opening up unprecedented opportunities for various applications in visualization. Our work explores the utilization of the visual perception ability of multi-modal LLMs to develop Autonomous Visualization Agents (AVAs) that can interpret and accomplish user-defined visualization objectives through natural language. We propose the first framework for the design of AVAs and present several usage scenarios intended to demonstrate the general applicability of the proposed paradigm. The addition of visual perception allows AVAs to act as the virtual visualization assistant for domain experts who may lack the knowledge or expertise in fine-tuning visualization outputs. Our preliminary exploration and proof-of-concept agents suggest that this approach can be widely applicable whenever the choices of appropriate visualization parameters require the interpretation of previous visual output. Feedback from unstructured interviews with experts in AI research, medical visualization, and radiology has been incorporated, highlighting the practicality and potential of AVAs. Our study indicates that AVAs represent a general paradigm for designing intelligent visualization systems that can achieve high-level visualization goals, which pave the way for developing expert-level visualization agents in the future.
摘要
Translation Notes:* "multi-modal" is translated as "多Modal" (duō módu)* "large language models" is translated as "大型语言模型" (dàxìng yǔyán módelì)* "visual perception" is translated as "视觉认知" (wèi jiào rènshi)* "Autonomous Visualization Agents" is translated as "自主视觉代理" (zìzhǔ wèi jiào dài lǐ)* "user-defined" is translated as "用户定义" (yònghòu dìngyì)* "natural language" is translated as "自然语言" (zìrán yǔyán)* "domain experts" is translated as "领域专家" (lǐngyì zhùkē)* "fine-tuning" is translated as "细调" (xìtiáng)* "visual output" is translated as "视觉输出" (wèi jiào shūchū)* "unstructured interviews" is translated as "无结构采访" (wù xiéjiè cèchè)* "practicality" is translated as "实用性" (shíyòngxìng)* "potential" is translated as "潜力" (qiánlì)* "expert-level" is translated as "专家级" (zhuang jià giai)
GSGFormer: Generative Social Graph Transformer for Multimodal Pedestrian Trajectory Prediction
results: 通过多个公共数据集的评估,GSGFormer不仅在充足数据情况下超越了前方法,还在数据有限情况下保持竞争力。Abstract
Pedestrian trajectory prediction, vital for selfdriving cars and socially-aware robots, is complicated due to intricate interactions between pedestrians, their environment, and other Vulnerable Road Users. This paper presents GSGFormer, an innovative generative model adept at predicting pedestrian trajectories by considering these complex interactions and offering a plethora of potential modal behaviors. We incorporate a heterogeneous graph neural network to capture interactions between pedestrians, semantic maps, and potential destinations. The Transformer module extracts temporal features, while our novel CVAE-Residual-GMM module promotes diverse behavioral modality generation. Through evaluations on multiple public datasets, GSGFormer not only outperforms leading methods with ample data but also remains competitive when data is limited.
摘要
自适应步人行道径预测是自动驾驶车和社交意识机器人的关键技术,但受环境和其他护航人员影响的复杂交互使得预测路径变得更加困难。本文提出了GSGFormer,一种创新的生成模型,能够考虑这些复杂交互,并提供多种可能的行为模式。我们在图 neural network中嵌入不同类型的人员、semantic map和可能的目的地,使用Transformer模块提取时间特征,并使用我们提出的CVAE-Residual-GMM模块来促进多样性行为模式生成。经多个公共数据集的评估,GSGFormer不仅在充足数据情况下比leading方法表现出优,而且在数据有限情况下也能够维持竞争力。
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator
results: CoT 在多种 benchmark 上表现出优于链式思维和其他基eline,包括 BIG-Bench Hard 上的 84%,比链式思维提高 12%。 CoT 适用于大型和小型模型,并可以扩展 LM 的逻辑和算术能力。Abstract
Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter -- we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for linguistic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for "detect_sarcasm(string)" that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they are used not only to write the code, but also to selectively "emulate" the interpreter by generating the expected output of "detect_sarcasm(string)" and other lines of code (e.g., that the interpreter could not compile). In this work, we propose Chain of Code (CoT), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format linguistic sub-tasks in a program as flexible pseudocode that the compiler can explicitly catch undefined behaviors and hand off to simulate with an LM (as an "LMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. CoT scales well with large and small models alike, and broadens the scope of reasoning questions that LMs can correctly answer by "thinking in code". Project webpage: https://chain-of-code.github.io/.
摘要
code 提供了一般的语法结构,可以建立复杂的程序和精确的计算,当与代码解释器结合使用时 -- 我们假设语言模型(LM)可以通过编写代码来提高链式思维,不仅限于逻辑和算术任务,还可以应用于语言任务(特别是这些任务的混合)。例如,请求语言模型编写代码来计算文章中含有讽刺的次数:LM可能会遇到编写"detect_sarcasm(string)"的实现问题,因为处理边缘情况会是不可能的。然而,LM可能仍然生成有效的解决方案,如果它不仅用于编写代码,还用于选择性地"模拟"解释器,生成"detect_sarcasm(string)"和其他代码行的预期输出(例如,解释器无法编译的代码)。在这项工作中,我们提出了链式代码(CoT),一种简单 yet 有效的扩展,可以提高LM的代码驱动思维能力。关键思想是鼓励LM在语言任务中格式化程序为灵活的pseudocode,让编译器显式捕捉undefined behaviors,并将其交给LM模拟(如LMulator)。实验表明,链式代码在多种 bench 上表现出优于链式思维和其他基elines ;在 BIG-Bench Hard 上,链式代码达到84%,比链式思维增加12%。CoT 适用于大型和小型模型,并使LMEmulator能够正确回答更多的思维问题。项目首页:。
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
results: 对于ID保持能力,测试时微调基于方法的表现落后于我们的PhotoMaker。然而,PhotoMaker具有高质量生成结果、快速生成速度、强大泛化能力和广泛应用前景。我们的项目页面可以在https://photo-maker.github.io/ 中找到。Abstract
Recent advances in text-to-image generation have made remarkable progress in synthesizing realistic human photos conditioned on given text prompts. However, existing personalized generation methods cannot simultaneously satisfy the requirements of high efficiency, promising identity (ID) fidelity, and flexible text controllability. In this work, we introduce PhotoMaker, an efficient personalized text-to-image generation method, which mainly encodes an arbitrary number of input ID images into a stack ID embedding for preserving ID information. Such an embedding, serving as a unified ID representation, can not only encapsulate the characteristics of the same input ID comprehensively, but also accommodate the characteristics of different IDs for subsequent integration. This paves the way for more intriguing and practically valuable applications. Besides, to drive the training of our PhotoMaker, we propose an ID-oriented data construction pipeline to assemble the training data. Under the nourishment of the dataset constructed through the proposed pipeline, our PhotoMaker demonstrates better ID preservation ability than test-time fine-tuning based methods, yet provides significant speed improvements, high-quality generation results, strong generalization capabilities, and a wide range of applications. Our project page is available at https://photo-maker.github.io/
摘要
最近的文本到图生成技术已经做出了很大的进步,可以生成基于给定文本提示的真实的人像图。然而,现有的个性化生成方法无法同时满足高效、出色的人脸唯一性(ID)和文本可控性的 требования。在这项工作中,我们介绍了 PhotoMaker,一种高效的个性化文本到图生成方法,它主要将输入的任意数量的输入ID图片编码成堆ID嵌入,以保持ID信息。这种嵌入可以不仅捕捉输入ID的特征,还可以同时捕捉不同ID的特征,以便后续的集成。这对应用更加有趣和实用的应用开放了大门。此外,为了训练我们的 PhotoMaker,我们提出了一种ID导向的数据生成管道,用于组装训练数据。在这个管道下构建的数据下,我们的 PhotoMaker 能够更好地保持ID信息,而且提供了显著的速度提高、高质量生成结果、强大的泛化能力和广泛的应用领域。我们的项目页面可以在 上找到。
Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use
results: 对 widely recognized tool use benchmark进行了广泛的实验,结果表明,通过注意力篮方法可以提高LLMs的工具使用性能,并达到SOTA水平。Abstract
Recent advancements in large language models (LLMs) have significantly expanded their functionality and skills as tool agents. In this paper, we argue that a waveform pattern in the model's attention allocation has an impact on the tool use performance, which degrades when the position of essential information hits the trough zone. To address this issue, we propose a novel inference method named Attention Buckets. This approach enables LLMs to handle context by conducting parallel processes, each featuring a unique RoPE angle base that shapes the attention waveform. Attention Buckets ensures that an attention trough of a particular process can be compensated with an attention peak of another run, reducing the risk of the LLM missing essential information residing within the attention trough. Our extensive experiments on the widely recognized tool use benchmark demonstrate the efficacy of our approach, where a 7B-parameter open-source model enhanced by Attention Buckets achieves SOTA performance on par with GPT-4.
摘要
Translated into Simplified Chinese:大型语言模型(LLM)的最新进展已经扩展了它们作为工具代理的功能和技能。在这篇论文中,我们认为模型的注意力分配的波形带有影响工具使用性能的效果,当关键信息的位置处于注意力峰值的区域时,性能会下降。为解决这个问题,我们提出了一种新的推理方法,名为注意力袋。这种方法使得 LLM 可以处理上下文,通过进行平行进程,每个进程都有一个唯一的 RoPE 角度基础,这些基础形成了注意力波形。注意力袋确保了每个进程的注意力峰值可以补偿另一个进程的注意力沟渠,从而降低 LLM 错过关键信息的风险。我们对 widely recognized 工具使用 benchmark 进行了广泛的实验, demonstrably 表明我们的方法的效果,一个 7B 参数的开源模型,通过注意力袋的提升,与 GPT-4 的性能相当。
Scalable Knowledge Graph Construction and Inference on Human Genome Variants
For: This paper is written for researchers and scientists working with large-scale genomic data, particularly those interested in using knowledge graphs for analysis and inference in vaccine-na"ive COVID-19 patients.* Methods: The paper uses variant-level information extracted from RNA-sequencing data and represents it as a unified, large knowledge graph. The data is converted to Resource Description Framework (RDF) triples and an ontology is defined for the VCF and CADD scores files.* Results: The paper presents a case study using the knowledge graph and performs a classification task using graph machine learning. The authors also compare different Graph Neural Networks (GNNs) for the case study.Abstract
Real-world knowledge can be represented as a graph consisting of entities and relationships between the entities. The need for efficient and scalable solutions arises when dealing with vast genomic data, like RNA-sequencing. Knowledge graphs offer a powerful approach for various tasks in such large-scale genomic data, such as analysis and inference. In this work, variant-level information extracted from the RNA-sequences of vaccine-na\"ive COVID-19 patients have been represented as a unified, large knowledge graph. Variant call format (VCF) files containing the variant-level information were annotated to include further information for each variant. The data records in the annotated files were then converted to Resource Description Framework (RDF) triples. Each VCF file obtained had an associated CADD scores file that contained the raw and Phred-scaled scores for each variant. An ontology was defined for the VCF and CADD scores files. Using this ontology and the extracted information, a large, scalable knowledge graph was created. Available graph storage was then leveraged to query and create datasets for further downstream tasks. We also present a case study using the knowledge graph and perform a classification task using graph machine learning. We also draw comparisons between different Graph Neural Networks (GNNs) for the case study.
摘要
实际世界知识可以表示为一个图像,其中包含实体和实体之间的关系。在面临庞大基因数据时,如RNA测序数据,高效可扩展的解决方案变得非常重要。知识图表示一种强大的方法,可以用于各种大规模基因数据的任务,如分析和推理。在这种工作中,COVID-19患者无病毒释出的RNA测序数据中的变异信息被统一表示为一个大型知识图。变异调用格式(VCF)文件中的数据记录被转换为Resource Description Framework(RDF)三元组。每个VCF文件都有关联的CADD分数文件,其中包含每个变异的Raw和Phred分数。一个 ontology 被定义为VCF和CADD分数文件。使用这个ontology和提取的信息,创建了一个大型可扩展的知识图。然后,利用可用的图存储来查询和创建下游任务的数据集。我们还提供了一个实践案例,使用知识图和图机器学习进行分类任务,并对不同的图神经网络(GNNs)进行比较。
results: 这篇论文显示了在多名候选人选举中存在很多不同的时间性问题,并提出了一种统一的框架来研究这些问题,同时还提出了未来研究的多个机会和未来多名候选人选举在时间设置下的未来发展vision。Abstract
Multiwinner voting captures a wide variety of settings, from parliamentary elections in democratic systems to product placement in online shopping platforms. There is a large body of work dealing with axiomatic characterizations, computational complexity, and algorithmic analysis of multiwinner voting rules. Although many challenges remain, significant progress has been made in showing existence of fair and representative outcomes as well as efficient algorithmic solutions for many commonly studied settings. However, much of this work focuses on single-shot elections, even though in numerous real-world settings elections are held periodically and repeatedly. Hence, it is imperative to extend the study of multiwinner voting to temporal settings. Recently, there have been several efforts to address this challenge. However, these works are difficult to compare, as they model multi-period voting in very different ways. We propose a unified framework for studying temporal fairness in this domain, drawing connections with various existing bodies of work, and consolidating them within a general framework. We also identify gaps in existing literature, outline multiple opportunities for future work, and put forward a vision for the future of multiwinner voting in temporal settings.
摘要
多赢者投票涵盖了广泛的场景,从民主体系的议会选举到在线购物平台上的产品放置。有大量的研究探讨了多赢者投票规则的axiomaCharacterization、计算复杂性和算法分析。虽然还有很多挑战,但是已经取得了许多进步,表明在许多常见的场景下存在公平和代表的结果以及高效的算法解决方案。然而,大多数研究都专注于单次选举,尽管在现实中的选举往往是周期性的和重复的。因此,需要扩展多赢者投票的研究到时间设定下。最近,有几项努力来解决这个挑战。然而,这些工作很难比较,因为它们在多赢者投票的多期选举方法上有很大的差异。我们提议一个统一的框架来研究多赢者投票的时间准确性,与不同的现有体系相连,并将其总结在一个通用框架内。我们还标识出了现有文献中的空白,阐述了多个未来工作的机会,并提出了未来多赢者投票在时间设定下的未来发展vision。
Intelligent Anomaly Detection for Lane Rendering Using Transformer with Self-Supervised Pre-Training and Customized Fine-Tuning
paper_authors: Yongqi Dong, Xingmin Lu, Ruohan Li, Wei Song, Bart van Arem, Haneen Farah for:The paper aims to accurately and effectively detect anomalies in lane rendering map images in digital navigation systems.methods:The proposed pipeline consists of four phases: data pre-processing, self-supervised pre-training with the masked image modeling (MiM) method, customized fine-tuning using cross-entropy based loss with label smoothing, and post-processing. The pipeline leverages state-of-the-art deep learning techniques, especially those involving Transformer models.results:The proposed pipeline exhibits superior performance in lane rendering image anomaly detection, with the self-supervised pre-training with MiM significantly enhancing the detection accuracy while reducing the total training time. Specifically, the Swin Transformer with Uniform Masking as self-supervised pretraining (Swin-Trans-UM) achieved an accuracy of 94.77% and an AUC score of 0.9743, outperforming the pure Swin Transformer without pre-training (Swin-Trans) with an accuracy of 94.01% and an AUC of 0.9498. The fine-tuning epochs were reduced from 280 to 41.Abstract
The burgeoning navigation services using digital maps provide great convenience to drivers. Nevertheless, the presence of anomalies in lane rendering map images occasionally introduces potential hazards, as such anomalies can be misleading to human drivers and consequently contribute to unsafe driving conditions. In response to this concern and to accurately and effectively detect the anomalies, this paper transforms lane rendering image anomaly detection into a classification problem and proposes a four-phase pipeline consisting of data pre-processing, self-supervised pre-training with the masked image modeling (MiM) method, customized fine-tuning using cross-entropy based loss with label smoothing, and post-processing to tackle it leveraging state-of-the-art deep learning techniques, especially those involving Transformer models. Various experiments verify the effectiveness of the proposed pipeline. Results indicate that the proposed pipeline exhibits superior performance in lane rendering image anomaly detection, and notably, the self-supervised pre-training with MiM can greatly enhance the detection accuracy while significantly reducing the total training time. For instance, employing the Swin Transformer with Uniform Masking as self-supervised pretraining (Swin-Trans-UM) yielded a heightened accuracy at 94.77% and an improved Area Under The Curve (AUC) score of 0.9743 compared with the pure Swin Transformer without pre-training (Swin-Trans) with an accuracy of 94.01% and an AUC of 0.9498. The fine-tuning epochs were dramatically reduced to 41 from the original 280. In conclusion, the proposed pipeline, with its incorporation of self-supervised pre-training using MiM and other advanced deep learning techniques, emerges as a robust solution for enhancing the accuracy and efficiency of lane rendering image anomaly detection in digital navigation systems.
摘要
随着数字地图导航服务的普及,许多司机得到了很大的便利。然而,图像渲染 Lane 中的异常 occasionally 会导致安全驾驶问题,因为这些异常可能会误导人类司机,从而导致不安全的驾驶情况。为解决这个问题并准确地检测异常,这篇论文将 Lane 渲染图像异常检测转化为一个分类问题,并提出了一个四个阶段管道,包括数据预处理、自我supervised 预训练使用 Masquerade 图像模型(MiM)方法、定制化 fine-tuning 使用交叉熵基于损失函数和标签平滑、以及Post-processing 等。通过使用现代深度学习技术,特别是 Transformer 模型,这种管道得到了较好的效果。各种实验证明了该管道的效果。结果表明,提议的管道在 Lane 渲染图像异常检测中表现出色,而且使用自我supervised 预训练的 MiM 可以大幅提高检测精度,同时显著减少总训练时间。例如,使用 Swin Transformer 与 Uniform Masking 作为自我supervised 预训练(Swin-Trans-UM),训练精度提高到 94.77%,AUC 分数提高到 0.9743,与不使用预训练的 Swin Transformer (Swin-Trans)相比,训练精度下降到 94.01%,AUC 分数下降到 0.9498。训练精度 epoch 减少到 41,与原始 280 epoch 相比。结论:提议的管道,通过 incorporating 自我supervised 预训练和其他高级深度学习技术,成为了提高数字导航系统 Lane 渲染图像异常检测精度和效率的可靠解决方案。
Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization
results: 我们在线上和离线上进行了实验,发现QU-SAC算法在对 uncertainty estimation方法进行比较时具有改善的性能。Abstract
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. In particular, we focus on characterizing the variance over values induced by a distribution over MDPs. Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation (UBE), but the over-approximation may result in inefficient exploration. We propose a new UBE whose solution converges to the true posterior variance over values and leads to lower regret in tabular exploration problems. We identify challenges to apply the UBE theory beyond tabular problems and propose a suitable approximation. Based on this approximation, we introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC), that can be applied for either risk-seeking or risk-averse policy optimization with minimal changes. Experiments in both online and offline RL demonstrate improved performance compared to other uncertainty estimation methods.
摘要
我团队考虑了基于模型学习的奖励回报不确定量化问题。我们专注于 caracterizing 模型生成器(MDP)分布下值的差异。先前的工作使用了called uncertainty Bellman equation(UBE)来上界 posterior variance over values,但这可能会导致不准确的探索。我们提出了一个新的 UBE,其解 converge 到真实的 posterior variance over values,并且可以降低探索中的 regret。我们认为在非表格问题中应用 UBE 理论存在挑战,我们提出了一种适当的approximation。基于这种approximation,我们介绍了一种通用的policy优化算法,即 Q-uncertainty soft actor-critic(QU-SAC),可以用于 either risk-seeking 或 risk-averse policy optimization,只需要 minimal changes。实验表明,与其他不确定量化方法相比,QU-SAC 在线上和离线上 Reinforcement Learning 中表现出了改善的性能。
Adversarial Denoising Diffusion Model for Unsupervised Anomaly Detection
results: 实验结果表明,提出的 ADDM 在无监督 MRI 图像异常检测方面表现出色,比其他基于 DDPM 的异常检测方法更好。具体来说,与其他 DDPM 基于的异常检测方法相比,ADDM 在同样的抽样步骤下表现更好,并且在50% fewer sampling steps 下表现类似。Abstract
In this paper, we propose the Adversarial Denoising Diffusion Model (ADDM). The ADDM is based on the Denoising Diffusion Probabilistic Model (DDPM) but complementarily trained by adversarial learning. The proposed adversarial learning is achieved by classifying model-based denoised samples and samples to which random Gaussian noise is added to a specific sampling step. With the addition of explicit adversarial learning on data samples, ADDM can learn the semantic characteristics of the data more robustly during training, which achieves a similar data sampling performance with much fewer sampling steps than DDPM. We apply ADDM to anomaly detection in unsupervised MRI images. Experimental results show that the proposed ADDM outperformed existing generative model-based unsupervised anomaly detection methods. In particular, compared to other DDPM-based anomaly detection methods, the proposed ADDM shows better performance with the same number of sampling steps and similar performance with 50% fewer sampling steps.
摘要
在这篇论文中,我们提出了对抗杂化扩散模型(ADDM)。ADDM基于杂化扩散概率模型(DDPM),但通过对抗学习训练。我们通过将模型生成的干涉样本和随机 Gaussian 噪声添加到特定抽样步骤来实现对抗学习。通过Explicit地在数据样本上进行对抗学习训练,ADDM可以更加坚定地学习数据的 semantics 特征 durante el entrenamiento, 实现与 DDPM 相同的数据抽样性能,但需要更少的抽样步骤。我们应用 ADDM 于无监督 MRI 图像中的异常检测。实验结果表明,提出的 ADDM 超过了现有的生成模型基于无监督异常检测方法。特别是与其他 DDPM 基于异常检测方法相比,ADDM 在同样的抽样步骤下显示了更好的性能,并且在50% fewer sampling steps 下显示了类似的性能。
How much informative is your XAI? A decision-making assessment task to objectively measure the goodness of explanations
results: 这篇论文的结果表明,用户中心的XAI系统在用户和系统之间的互动中提供了更多的信息,从而提高了用户的参与度和满意度。Abstract
There is an increasing consensus about the effectiveness of user-centred approaches in the explainable artificial intelligence (XAI) field. Indeed, the number and complexity of personalised and user-centred approaches to XAI have rapidly grown in recent years. Often, these works have a two-fold objective: (1) proposing novel XAI techniques able to consider the users and (2) assessing the \textit{goodness} of such techniques with respect to others. From these new works, it emerged that user-centred approaches to XAI positively affect the interaction between users and systems. However, so far, the goodness of XAI systems has been measured through indirect measures, such as performance. In this paper, we propose an assessment task to objectively and quantitatively measure the goodness of XAI systems in terms of their \textit{information power}, which we intended as the amount of information the system provides to the users during the interaction. Moreover, we plan to use our task to objectively compare two XAI techniques in a human-robot decision-making task to understand deeper whether user-centred approaches are more informative than classical ones.
摘要
随着用户中心approach在可解释人工智能(XAI)领域的效iveness的共识增加,而该领域中的个性化和用户中心approach的数量和复杂度也在不断增加。这些工作通常有两重目标:(1)提出新的XAI技术,考虑用户,并(2)评估这些技术与其他技术的好坏。从这些新工作中,我们发现用户中心approach对用户和系统之间的交互产生了积极影响。然而,迄今为止,XAI系统的好坏被用 indirect measures,如性能来衡量。在这篇论文中,我们提出了一个评估任务,用于 объектив地和量化地衡量XAI系统在交互中提供的信息量,我们称之为“信息能力”。此外,我们计划使用该任务来对两种XAI技术在人机决策任务中进行对比,以了解用户中心approach是否比 классическиеapproach更有信息性。
Deep Dynamics: Vehicle Dynamics Modeling with a Physics-Informed Neural Network for Autonomous Racing
results: 论文的开loop和关loop性能评估显示,深度动力学方法可以实现高速车辆动力学模型的精确预测,并且可以在实际应用中提供更好的性能。Abstract
Autonomous racing is a critical research area for autonomous driving, presenting significant challenges in vehicle dynamics modeling, such as balancing model precision and computational efficiency at high speeds (>280kmph), where minor errors in modeling have severe consequences. Existing physics-based models for vehicle dynamics require elaborate testing setups and tuning, which are hard to implement, time-intensive, and cost-prohibitive. Conversely, purely data-driven approaches do not generalize well and cannot adequately ensure physical constraints on predictions. This paper introduces Deep Dynamics, a physics-informed neural network (PINN) for vehicle dynamics modeling of an autonomous racecar. It combines physics coefficient estimation and dynamical equations to accurately predict vehicle states at high speeds and includes a unique Physics Guard layer to ensure internal coefficient estimates remain within their nominal physical ranges. Open-loop and closed-loop performance assessments, using a physics-based simulator and full-scale autonomous Indy racecar data, highlight Deep Dynamics as a promising approach for modeling racecar vehicle dynamics.
摘要
自主赛车是汽车自动驾驶研究领域中的一个关键领域,它提出了许多车辆动力学模型的挑战,例如在高速(>280kmph)下精度和计算效率之间协调模型,因为小误差在模型化会导致严重的后果。现有的物理学基模型需要耗时consuming和成本高昂的测试设置和调整,这是困难实施的。相反,几何数据驱动方法无法广泛适用和不能充分保证物理约束。这篇论文介绍了深度动力学(Deep Dynamics),一种基于物理学的神经网络(PINN)用于赛车动力学模型化。它将物理系数估计和动力学方程组合在一起,以高速精度预测车辆状态,并添加了一个唯一的物理守护层,确保内部系数估计在其物理范围内偏差不大。使用物理学基模型和实际自动Indy赛车数据进行开 Loop和关 Loop性能评估,显示了深度动力学是一种有前途的方法 для赛车动力学模型化。
LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs
paper_authors: Yunsheng Ma, Can Cui, Xu Cao, Wenqian Ye, Peiran Liu, Juanwu Lu, Amr Abdelraouf, Rohit Gupta, Kyungtae Han, Aniket Bera, James M. Rehg, Ziran Wang
results: 实验结果表明,与人类反馈的 GPT-4 取得了92.7%的任务完成率和0.9%的Collision rate。Here’s the full translation of the paper’s abstract in Simplified Chinese:本研究旨在提出一种新的自动驾驶规划框架,重新定义了规划任务为一种代码生成过程,利用已有的行为 primitives。我们引入了LaMPilot bencmark,以量化评估大语言模型(LLMs)在翻译人类指令为行为策略方面的效果。我们评估了多种现有的语言模型,包括 GPT-4 等,对 LaMPilot benchmark 进行了实验。实验结果表明,与人类反馈的 GPT-4 取得了92.7%的任务完成率和0.9%的Collision rate。为了鼓励更多人进一步研究这一领域,我们将代码和数据集公开发布。Abstract
We present LaMPilot, a novel framework for planning in the field of autonomous driving, rethinking the task as a code-generation process that leverages established behavioral primitives. This approach aims to address the challenge of interpreting and executing spontaneous user instructions such as "overtake the car ahead," which have typically posed difficulties for existing frameworks. We introduce the LaMPilot benchmark specifically designed to quantitatively evaluate the efficacy of Large Language Models (LLMs) in translating human directives into actionable driving policies. We then evaluate a wide range of state-of-the-art code generation language models on tasks from the LaMPilot Benchmark. The results of the experiments showed that GPT-4, with human feedback, achieved an impressive task completion rate of 92.7% and a minimal collision rate of 0.9%. To encourage further investigation in this area, our code and dataset will be made available.
摘要
我们提出了LaMPilot,一个新的推广框架 для自动驾驶的观念规划,将观念规划视为一个代码生成过程,利用已经确立的行为元素。这种方法旨在解决现有框架对于“超越前方车辆”等评估和执行自由用措施的问题。我们创建了LaMPilot Benchmark,专门用于量化评估大型自然语言模型(LLM)在翻译人类指令为可行驾驶策略的能力。然后,我们评估了一些现有的代码生成语言模型在LaMPilot Benchmark上的表现,结果显示,受人给出反馈的GPT-4 achieved an impressive task completion rate of 92.7%和一个 minimal collision rate of 0.9%。为了鼓励更多人对这个领域进行研究,我们将代码和数据公开。
PCoQA: Persian Conversational Question Answering Dataset
results: 论文通过对PCoQA数据集进行分析和测试,发现该数据集具有较多的开放式非事实答案、更长的答案和较少的词语重叠,这些特点对问题 answering task提供了新的挑战。Abstract
Humans seek information regarding a specific topic through performing a conversation containing a series of questions and answers. In the pursuit of conversational question answering research, we introduce the PCoQA, the first \textbf{P}ersian \textbf{Co}nversational \textbf{Q}uestion \textbf{A}nswering dataset, a resource comprising information-seeking dialogs encompassing a total of 9,026 contextually-driven questions. Each dialog involves a questioner, a responder, and a document from the Wikipedia; The questioner asks several inter-connected questions from the text and the responder provides a span of the document as the answer for each question. PCoQA is designed to present novel challenges compared to previous question answering datasets including having more open-ended non-factual answers, longer answers, and fewer lexical overlaps. This paper not only presents the comprehensive PCoQA dataset but also reports the performance of various benchmark models. Our models include baseline models and pre-trained models, which are leveraged to boost the performance of the model. The dataset and benchmarks are available at our Github page.
摘要
人类通过进行对话来寻求关于特定主题的信息。在对话问答研究中,我们介绍了PCoQA,这是首个波斯尼亚对话问答数据集,包含9,026个基于上下文的问题。每个对话都包括一个问者、一个响应者和一份来自WIKIPEDIA的文档,问者会对文档中的不同部分提问,而响应者则会提供相应的文档段作为答案。PCoQA的设计目的是为了提供新的挑战,比如更多的开放式非事实答案、更长的答案和 fewer 词语重叠。这篇论文不仅介绍了完整的PCoQA数据集,还报告了不同的参考模型的性能。我们的模型包括基线模型和预训练模型,这些模型可以帮助提高模型的性能。数据集和参考模型可以在我们的Github页面上下载。
CLadder: A Benchmark to Assess Causal Reasoning Capabilities of Language Models
paper_authors: Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, Bernhard Schölkopf
for: This paper aims to evaluate the ability of large language models (LLMs) to perform causal reasoning, specifically in accordance with well-defined formal rules.
methods: The authors propose a new NLP task called causal inference in natural language, which is inspired by the “causal inference engine” postulated by Judea Pearl et al. They create a large dataset called CLadder, which includes causal graphs and queries, and evaluate multiple LLMs on this dataset using a bespoke chain-of-thought prompting strategy called CausalCoT.
results: The authors show that their task is highly challenging for LLMs and conduct an in-depth analysis to gain deeper insight into the causal reasoning abilities of LLMs. They also open-source their data and code for future research.Abstract
The ability to perform causal reasoning is widely considered a core feature of intelligence. In this work, we investigate whether large language models (LLMs) can coherently reason about causality. Much of the existing work in natural language processing (NLP) focuses on evaluating commonsense causal reasoning in LLMs, thus failing to assess whether a model can perform causal inference in accordance with a set of well-defined formal rules. To address this, we propose a new NLP task, causal inference in natural language, inspired by the "causal inference engine" postulated by Judea Pearl et al. We compose a large dataset, CLadder, with 10K samples: based on a collection of causal graphs and queries (associational, interventional, and counterfactual), we obtain symbolic questions and ground-truth answers, through an oracle causal inference engine. These are then translated into natural language. We evaluate multiple LLMs on our dataset, and we introduce and evaluate a bespoke chain-of-thought prompting strategy, CausalCoT. We show that our task is highly challenging for LLMs, and we conduct an in-depth analysis to gain deeper insight into the causal reasoning abilities of LLMs. Our data is open-sourced at https://huggingface.co/datasets/causalNLP/cladder, and our code can be found at https://github.com/causalNLP/cladder.
摘要
“能够进行 causal reasoning 的能力是人工智能的核心特性。在这项工作中,我们调查了大型自然语言模型(LLM)是否能够一致地理解 causality。现有的大部分自然语言处理(NLP)工作专注于评估 LLM 的通过感知 causal reasoning,因此失律不评估模型是否能够按照定义的正式规则进行 causal inference。为了解决这个问题,我们提出了一个新的 NLP 任务:自然语言中的 causal inference, draw inspiration from Judea Pearl 等人提出的“causal inference engine”。我们组建了一个大型数据集,名为 CLadder,包含 10,000 个样本:根据一个收集的 causal graph 和查询(associational、interventional 和 counterfactual),我们得到了符号问题和真实答案,通过一个 oracle causal inference engine。这些被译成自然语言。我们评估多个 LLM 在我们的数据集上,并引入和评估一个特色Chain-of-Thought 提示策略,名为 CausalCoT。我们发现这个任务对 LLM 非常具挑战性,并进行了深入的分析,以获得更深入的理解 LLM 的 causal reasoning 能力。我们的数据可以在 https://huggingface.co/datasets/causalNLP/cladder 获取,代码可以在 https://github.com/causalNLP/cladder 找到。”
Enhancing Medical Task Performance in GPT-4V: A Comprehensive Study on Prompt Engineering Strategies
results: 研究发现,通过对GPT-4V的问题工程,可以提高模型在医疗影像领域的解释精度和相关性,并提供了10种问题工程技巧,以帮助将GPT-4V应用在医疗环境中。Abstract
OpenAI's latest large vision-language model (LVLM), GPT-4V(ision), has piqued considerable interest for its potential in medical applications. Despite its promise, recent studies and internal reviews highlight its underperformance in specialized medical tasks. This paper explores the boundary of GPT-4V's capabilities in medicine, particularly in processing complex imaging data from endoscopies, CT scans, and MRIs etc. Leveraging open-source datasets, we assessed its foundational competencies, identifying substantial areas for enhancement. Our research emphasizes prompt engineering, an often-underutilized strategy for improving AI responsiveness. Through iterative testing, we refined the model's prompts, significantly improving its interpretative accuracy and relevance in medical imaging. From our comprehensive evaluations, we distilled 10 effective prompt engineering techniques, each fortifying GPT-4V's medical acumen. These methodical enhancements facilitate more reliable, precise, and clinically valuable insights from GPT-4V, advancing its operability in critical healthcare environments. Our findings are pivotal for those employing AI in medicine, providing clear, actionable guidance on harnessing GPT-4V's full diagnostic potential.
摘要
Causality and Explainability for Trustworthy Integrated Pest Management
methods: 使用先进数据分析框架,提供可靠的害虫人口预测、可读性的害虫存在预测、实用的农业干预建议、场景特定的治理方案评估等功能,以增强Integrated Pest Management(IPM)的采用
results: 通过这种数据分析框架,能够提高IPM的采用率,并为农民提供实用的农业干预建议,以帮助他们更好地应对害虫和气候变化的挑战Abstract
Pesticides serve as a common tool in agricultural pest control but significantly contribute to the climate crisis. To combat this, Integrated Pest Management (IPM) stands as a climate-smart alternative. Despite its potential, IPM faces low adoption rates due to farmers' skepticism about its effectiveness. To address this challenge, we introduce an advanced data analysis framework tailored to enhance IPM adoption. Our framework provides i) robust pest population predictions across diverse environments with invariant and causal learning, ii) interpretable pest presence predictions using transparent models, iii) actionable advice through counterfactual explanations for in-season IPM interventions, iv) field-specific treatment effect estimations, and v) assessments of the effectiveness of our advice using causal inference. By incorporating these features, our framework aims to alleviate skepticism and encourage wider adoption of IPM practices among farmers.
摘要
农药常用于农业害虫控制,但它对气候危机做出了重要贡献。为了解决这个问题,集成性害虫管理(IPM)成为了气候聪明的代替方案。然而,IPM的采纳率仍然较低,这是因为农民对其效果的怀疑。为解决这个挑战,我们将引入一个进阶的数据分析框架,以增强IPM的采纳率。我们的框架包括以下五个功能:1. 透明度强的害虫人口预测,可以在多种环境下提供不同的预测结果,并且具有因果学的学习能力。2. 可读的害虫存在预测,使用透明的模型,帮助农民更好地理解害虫的状态。3. 对于在质感季节中进行IPM干预的建议,包括反思性的解释,帮助农民更好地了解他们的选择。4. 根据农场特定的情况进行实际的治理效果估计。5. 使用 causal inference 来评估我们的建议的有效性。通过整合这些功能,我们的框架 hope 可以帮助农民更好地了解IPM的效果,增强他们对IPM的信心,并促进更广泛的IPM采纳率。
Surrogate Modelling for Sea Ice Concentration using Lightweight Neural Ensemble
results: 实验研究表明,LANE-SI模型能够在特定水域预测海冰分布的长期预测质量与资源储量的物理模型相当,并且在某些时期 even superior。在喀拉海的测试中,LANE-SI模型与现有的物理模型SEAS5相比,提供了20%的改善。Abstract
The modeling and forecasting of sea ice conditions in the Arctic region are important tasks for ship routing, offshore oil production, and environmental monitoring. We propose the adaptive surrogate modeling approach named LANE-SI (Lightweight Automated Neural Ensembling for Sea Ice) that uses ensemble of relatively simple deep learning models with different loss functions for forecasting of spatial distribution for sea ice concentration in the specified water area. Experimental studies confirm the quality of a long-term forecast based on a deep learning model fitted to the specific water area is comparable to resource-intensive physical modeling, and for some periods of the year, it is superior. We achieved a 20% improvement against the state-of-the-art physics-based forecast system SEAS5 for the Kara Sea.
摘要
在北极地区,模拟和预测海冰情况是重要的任务,对于船 Routing、海上油气生产和环境监测都很重要。我们提出了适应型模型方法,名为LANE-SI(轻量级自动神经网络模型),它使用不同损失函数的 ensemble 深度学习模型来预测指定水域中海冰浓度的空间分布。实验研究表明,基于特定水域适应深度学习模型的长期预测质量与资源占用physical modeling相当,而且在某些时期,even superior。我们在卡拉海上实现了20%的提升,比领先的物理模型预测系统SEAS5更好。
MIMo: A Multi-Modal Infant Model for Studying Cognitive Development
results: 本研究提供了MIMo的设计和界面,以及其应用示例。所有代码可以在https://github.com/trieschlab/MIMo上下载。Abstract
Human intelligence and human consciousness emerge gradually during the process of cognitive development. Understanding this development is an essential aspect of understanding the human mind and may facilitate the construction of artificial minds with similar properties. Importantly, human cognitive development relies on embodied interactions with the physical and social environment, which is perceived via complementary sensory modalities. These interactions allow the developing mind to probe the causal structure of the world. This is in stark contrast to common machine learning approaches, e.g., for large language models, which are merely passively ``digesting'' large amounts of training data, but are not in control of their sensory inputs. However, computational modeling of the kind of self-determined embodied interactions that lead to human intelligence and consciousness is a formidable challenge. Here we present MIMo, an open-source multi-modal infant model for studying early cognitive development through computer simulations. MIMo's body is modeled after an 18-month-old child with detailed five-fingered hands. MIMo perceives its surroundings via binocular vision, a vestibular system, proprioception, and touch perception through a full-body virtual skin, while two different actuation models allow control of his body. We describe the design and interfaces of MIMo and provide examples illustrating its use. All code is available at https://github.com/trieschlab/MIMo .
摘要
人类智能和意识逐渐发展 during the process of cognitive development. 理解这个发展是理解人类心理的关键方面,可能帮助建立类似的人工智能。然而,计算机模型这种自主embodied interactions的类型,如人类智能和意识的发展所需,是一项具有挑战性的计算机模型。在这里,我们介绍MIMo,一个开源多模式 infant model,用于通过计算机模拟研究初期认知发展。MIMo的身体模拟了18个月大的婴儿,包括细节的五根手指。MIMo通过双眼视力、 equilibrioception、 proprioception和触觉通过全身虚拟皮肤进行感知周围环境,而两种不同的 actuation model 允许控制他的身体。我们介绍MIMo的设计和界面,并提供使用示例。所有代码可以在https://github.com/trieschlab/MIMo 上获取。
results: 本研究系统化地对前期研究知识驱动自动驾驶的努力进行了评价和指导。同时,我们将在\url{https://github.com/PJLab-ADG/awesome-knowledge-driven-AD}上公开分享最新的开源资源和研究进展。Abstract
This paper explores the emerging knowledge-driven autonomous driving technologies. Our investigation highlights the limitations of current autonomous driving systems, in particular their sensitivity to data bias, difficulty in handling long-tail scenarios, and lack of interpretability. Conversely, knowledge-driven methods with the abilities of cognition, generalization and life-long learning emerge as a promising way to overcome these challenges. This paper delves into the essence of knowledge-driven autonomous driving and examines its core components: dataset \& benchmark, environment, and driver agent. By leveraging large language models, world models, neural rendering, and other advanced artificial intelligence techniques, these components collectively contribute to a more holistic, adaptive, and intelligent autonomous driving system. The paper systematically organizes and reviews previous research efforts in this area, and provides insights and guidance for future research and practical applications of autonomous driving. We will continually share the latest updates on cutting-edge developments in knowledge-driven autonomous driving along with the relevant valuable open-source resources at: \url{https://github.com/PJLab-ADG/awesome-knowledge-driven-AD}.
摘要
这篇论文探讨了emerging知识驱动自动驾驶技术的发展。我们的调查发现现有自动驾驶系统存在数据偏见、长尾enario处理困难和解释性缺乏等问题。相比之下,知识驱动方法具有认知、泛化和持续学习等能力,显示出有抑止这些挑战的潜力。本论文探究了知识驱动自动驾驶的核心组成部分:数据集&比赛场景、环境和驾驶者代理。通过大语言模型、世界模型、神经渲染和其他高级人工智能技术,这些组件共同帮助构建更加整体、适应性强和智能的自动驾驶系统。本论文系统地整理和评审了先前的研究努力,并提供了未来研究和实践自动驾驶的指导和有价值的开源资源。我们将不断分享最新的开发进展和相关的有价值开源资源,请查看:https://github.com/PJLab-ADG/awesome-knowledge-driven-AD。
nerblackbox: A High-level Library for Named Entity Recognition in Python
methods: Transformer-based models, fully automated model training and evaluation, versatile model inference, fine-grained control, customizable features
results: Targeted at application-oriented developers as well as machine learning experts and researchersHere’s the breakdown of each point:1. For: The paper is written for Named Entity Recognition (NER), which is a subtask of natural language processing (NLP) that involves identifying and categorizing named entities in unstructured text into predefined categories such as person, organization, location, etc.2. Methods: The paper proposes the use of transformer-based models for NER, which are state-of-the-art models that have shown high accuracy in various NLP tasks. The library provides simple-to-use yet powerful methods to access data and models from a wide range of sources, and offers fine-grained control and customizable features for model training and evaluation.3. Results: The paper targets both application-oriented developers and machine learning experts and researchers, indicating that the library can be used for a wide range of applications and can be customized to meet specific needs.Abstract
We present nerblackbox, a python library to facilitate the use of state-of-the-art transformer-based models for named entity recognition. It provides simple-to-use yet powerful methods to access data and models from a wide range of sources, for fully automated model training and evaluation as well as versatile model inference. While many technical challenges are solved and hidden from the user by default, nerblackbox also offers fine-grained control and a rich set of customizable features. It is thus targeted both at application-oriented developers as well as machine learning experts and researchers.
摘要
我们现在提供nerblackbox,一个Python库,以便使用现代转换器模型进行命名实体识别。它提供了简单易用的方法来访问数据和模型从多种来源,以便实现完全自动的模型训练和评估,以及灵活的模型推理。虽然许多技术挑战已经解决,但nerblackbox还提供了细化控制和丰富的自定义特性。因此,它适用于应用开发者、机器学习专家和研究人员。
Extending Answer Set Programming with Rational Numbers
results: 提供了一个well-defined semantics 和一个实现方案,以便在ASP-Core-2标准中添加 rational numbers,扩展ASP的表达能力和应用范围。Abstract
Answer Set Programming (ASP) is a widely used declarative programming paradigm that has shown great potential in solving complex computational problems. However, the inability to natively support non-integer arithmetic has been highlighted as a major drawback in real-world applications. This feature is crucial to accurately model and manage real-world data and information as emerged in various contexts, such as the smooth movement of video game characters, the 3D movement of mechanical arms, and data streamed by sensors. Nevertheless, extending ASP in this direction, without affecting its declarative nature and its well-defined semantics, poses non-trivial challenges; thus, no ASP system is able to reason natively with non-integer domains. Indeed, the widespread floating-point arithmetic is not applicable to the ASP case, as the reproducibility of results cannot be guaranteed and the semantics of an ASP program would not be uniquely and declaratively determined, regardless of the employed machine or solver. To overcome such limitations and in the realm of pure ASP, this paper proposes an extension of ASP in which non-integers are approximated to rational numbers, fully granting reproducibility and declarativity. We provide a well-defined semantics for the ASP-Core-2 standard extended with rational numbers and an implementation thereof. We hope this work could serve as a stepping stone towards a more expressive and versatile ASP language that can handle a broader range of real-world problems.
摘要
为了超越这些限制,本文提出了一种基于 ASP 的非整数扩展,使用 rational numbers 进行近似。这种方法保证了重现性和声明性,并且可以在 pure ASP 中实现。我们为 ASP-Core-2 标准提供了一个具有准确定义的 semantics,并实现了相应的实现。我们希望这种工作能够为一种更表达力和多样化的 ASP 语言提供序 erm。
Mastering Complex Coordination through Attention-based Dynamic Graph
results: 实验表明,相比之前的SOTA方法,DAGMIX在大规模场景下显著地提高了性能,同时在其他任务上也取得了优秀的结果。Abstract
The coordination between agents in multi-agent systems has become a popular topic in many fields. To catch the inner relationship between agents, the graph structure is combined with existing methods and improves the results. But in large-scale tasks with numerous agents, an overly complex graph would lead to a boost in computational cost and a decline in performance. Here we present DAGMIX, a novel graph-based value factorization method. Instead of a complete graph, DAGMIX generates a dynamic graph at each time step during training, on which it realizes a more interpretable and effective combining process through the attention mechanism. Experiments show that DAGMIX significantly outperforms previous SOTA methods in large-scale scenarios, as well as achieving promising results on other tasks.
摘要
multi-agent系统中的代理之间协调已成为许多领域的热点话题。为了捕捉代理之间的内部关系,graph结构与现有方法相结合,提高了结果。但在具有大量代理的大规模任务中,过于复杂的graph会导致计算成本的增加和性能的下降。这里,我们提出了DAGMIX,一种新的图基于值分解方法。而不是完整的graph,DAGMIX在每个训练步骤中生成动态graph,通过注意机制实现更加可读性和效果的组合过程。实验显示,DAGMIX在大规模场景中显著超越了先前的SOTA方法,以及在其他任务上实现了优秀的结果。
Detecting and Restoring Non-Standard Hands in Stable Diffusion Generated Images
results: 实验结果表明,这个ipeline可以有效地提高手像的准确性和真实性。在线demo可以在https://fixhand.yiqun.io中浏览。Abstract
We introduce a pipeline to address anatomical inaccuracies in Stable Diffusion generated hand images. The initial step involves constructing a specialized dataset, focusing on hand anomalies, to train our models effectively. A finetuned detection model is pivotal for precise identification of these anomalies, ensuring targeted correction. Body pose estimation aids in understanding hand orientation and positioning, crucial for accurate anomaly correction. The integration of ControlNet and InstructPix2Pix facilitates sophisticated inpainting and pixel-level transformation, respectively. This dual approach allows for high-fidelity image adjustments. This comprehensive approach ensures the generation of images with anatomically accurate hands, closely resembling real-world appearances. Our experimental results demonstrate the pipeline's efficacy in enhancing hand image realism in Stable Diffusion outputs. We provide an online demo at https://fixhand.yiqun.io
摘要
我们介绍一个管道,以解决稳定扩散生成的手像中的解剖错误。我们的初始步骤是构建一个特殊化的数据集,专注于手异常,以训练我们的模型。一个精度地检测模型是关键,以确保精准地标识这些异常。体位估计帮助理解手姿和位置,这是精确地修复异常的关键。我们通过控制网络和指导混合Pixel2Pix来实现高级别的填充和像素级转换。这种双重方法允许我们进行高质量的图像调整。我们的实验结果表明这个管道可以增强稳定扩散输出中的手像实实验。您可以在https://fixhand.yiqun.io上查看我们的在线 demo。
Graph Convolutions Enrich the Self-Attention in Transformers!
paper_authors: Jeongwhan Choi, Hyowon Wi, Jayoung Kim, Yehjin Shin, Kookjin Lee, Nathaniel Trask, Noseong Park
for: 提高Transformer模型的性能,解决深度Transformer模型中的填充问题
methods: 重新设计了自我注意 Mechanism,从graph signal processing(GSP)的角度来解释原始自我注意,并提出了图filter-based self-attention(GFSA)来学习一个通用 yet effective的自我注意
results: 在计算机视觉、自然语言处理、图Pattern分类、语音识别和代码分类等领域中,GFSA可以提高Transformer模型的性能,而且与原始自我注意机制相比,GFSA的复杂性略大一些Abstract
Transformers, renowned for their self-attention mechanism, have achieved state-of-the-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep Transformer models is the oversmoothing problem, where representations across layers converge to indistinguishable values, leading to significant performance degradation. We interpret the original self-attention as a simple graph filter and redesign it from a graph signal processing (GSP) perspective. We propose graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism. We demonstrate that GFSA improves the performance of Transformers in various fields, including computer vision, natural language processing, graph pattern classification, speech recognition, and code classification.
摘要
启发器(Transformers)因其自注意机制而广泛应用于自然语言处理、计算机视觉、时间序列预测等领域,并实现了多项任务的状态性能记录。然而,深度启发器模型中的杂散强度问题可能会导致层次表示变得不分明,从而导致性能下降。我们将原始自注意视为简单的图 filters,从图信号处理(GSP)的视角重新设计了自注意机制。我们提出了图筛子自注意(GFSA),以学习一个通用 yet 有效的一个,其复杂度略大于原始自注意机制。我们示出了 GFSA 可以在计算机视觉、自然语言处理、图 Pattern 分类、语音识别和代码分类等领域提高启发器的表现。
Adventures of Trustworthy Vision-Language Models: A Survey
paper_authors: Mayank Vatsa, Anubhooti Jain, Richa Singh for:这篇论文的主要目标是探讨视言转换器的可靠性和负责任性,以提高我们对其在实际应用中的使用的理解和管理。methods:本论文采用了三个基本原则来评估视言转换器的可靠性和负责任性:偏见、Robustness和可解释性。results:这篇论文通过对视言转换器的实际应用进行深入分析,提高了我们对这些工具在不同任务和领域中的使用的理解和管理。Abstract
Recently, transformers have become incredibly popular in computer vision and vision-language tasks. This notable rise in their usage can be primarily attributed to the capabilities offered by attention mechanisms and the outstanding ability of transformers to adapt and apply themselves to a variety of tasks and domains. Their versatility and state-of-the-art performance have established them as indispensable tools for a wide array of applications. However, in the constantly changing landscape of machine learning, the assurance of the trustworthiness of transformers holds utmost importance. This paper conducts a thorough examination of vision-language transformers, employing three fundamental principles of responsible AI: Bias, Robustness, and Interpretability. The primary objective of this paper is to delve into the intricacies and complexities associated with the practical use of transformers, with the overarching goal of advancing our comprehension of how to enhance their reliability and accountability.
摘要
最近,变换器在计算机视觉和视觉语言任务中变得非常受欢迎。这种卓越的使用可以主要归结于关注机制和变换器对各种任务和领域的杰出适应能力。它们的多样性和当今最佳性表现使得它们成为许多应用程序不可或缺的工具。然而,在机器学习领域不断变化的背景下,确保变换器的可靠性的重要性越来越大。这篇文章通过三种基本原则来检查视觉语言变换器:偏见、可靠性和可解释性。文章的主要目标是探讨变换器在实际应用中的复杂性和特点,以达到提高其可靠性和责任性的目的。
Dynamic Data-Driven Digital Twins for Blockchain Systems
results: 研究表明,通过DDDAS反馈循环和强化学习代理人可以有效地优化区块链系统的性能,同时减少决策过程的计算开销。Abstract
In recent years, we have seen an increase in the adoption of blockchain-based systems in non-financial applications, looking to benefit from what the technology has to offer. Although many fields have managed to include blockchain in their core functionalities, the adoption of blockchain, in general, is constrained by the so-called trilemma trade-off between decentralization, scalability, and security. In our previous work, we have shown that using a digital twin for dynamically managing blockchain systems during runtime can be effective in managing the trilemma trade-off. Our Digital Twin leverages DDDAS feedback loop, which is responsible for getting the data from the system to the digital twin, conducting optimisation, and updating the physical system. This paper examines how leveraging DDDAS feedback loop can support the optimisation component of the trilemma benefiting from Reinforcement Learning agents and a simulation component to augment the quality of the learned model while reducing the computational overhead required for decision-making.
摘要
Constraint Model for the Satellite Image Mosaic Selection Problem
paper_authors: Manuel Combarro Simón, Pierre Talbot, Grégoire Danoy, Jedrzej Musial, Mohammed Alswaitti, Pascal Bouvry
for: study and monitor different regions of the Earth using satellite imagery
methods: constraint and mixed integer lineal programming formulation of the satellite image mosaic selection problem, a multi-objective extension of the polygon cover problem
results: proposed a dataset of realistic and challenging instances, evaluated and compared two proposed models, showed their efficiency for large instances up to 200 images.Abstract
Satellite imagery solutions are widely used to study and monitor different regions of the Earth. However, a single satellite image can cover only a limited area. In cases where a larger area of interest is studied, several images must be stitched together to create a single larger image, called a mosaic, that can cover the area. Today, with the increasing number of satellite images available for commercial use, selecting the images to build the mosaic is challenging, especially when the user wants to optimize one or more parameters, such as the total cost and the cloud coverage percentage in the mosaic. More precisely, for this problem the input is an area of interest, several satellite images intersecting the area, a list of requirements relative to the image and the mosaic, such as cloud coverage percentage, image resolution, and a list of objectives to optimize. We contribute to the constraint and mixed integer lineal programming formulation of this new problem, which we call the \textit{satellite image mosaic selection problem}, which is a multi-objective extension of the polygon cover problem. We propose a dataset of realistic and challenging instances, where the images were captured by the satellite constellations SPOT, Pl\'eiades and Pl\'eiades Neo. We evaluate and compare the two proposed models and show their efficiency for large instances, up to 200 images.
摘要
卫星影像解决方案广泛应用于 изу查和监测不同地区。然而,单个卫星图像只能覆盖有限区域。在研究更大的区域时,需要将多张图像拼接成一个更大的图像,称为拼接图像(mosaic),以覆盖整个区域。随着商用卫星图像的增加,选择构成拼接图像的图像变得挑战性更高,特别是当用户希望最化一或多个参数,如总成本和拼接图像中云覆盖率。更加准确地说,我们提出了一个新的问题,称为卫星图像拼接选择问题(SISP),这是多目标扩展的多边形覆盖问题。我们提供了一个实际和具有挑战性的实例集,这些图像由卫星 конstellations SPOT、Pléiades和Pléiades Neo拍摄。我们评估和比较了两种提出的模型,并证明它们在大实例上的效率。
Joint-Individual Fusion Structure with Fusion Attention Module for Multi-Modal Skin Cancer Classification
paper_authors: Peng Tang, Xintong Yan, Yang Nan, Xiaobin Hu, Xiaobin Hu, Bjoern H Menzee. Sebastian Krammer, Tobias Lasser
for: 这 paper 是为了提高皮肤癌类别化的精度,使用DERMATOLOGICAL IMAGES 和病人metadata。
methods: 这 paper 使用了一种新的拓展方法,即将多modal数据 fusion 和病人metadata fusion 结合使用,以提高皮肤癌类别化的精度。
results: 这 paper 的实验结果表明,使用该新的拓展方法可以提高皮肤癌类别化的精度,并且在三个公共数据集上都比其他已有的拓展方法更好。Abstract
Most convolutional neural network (CNN) based methods for skin cancer classification obtain their results using only dermatological images. Although good classification results have been shown, more accurate results can be achieved by considering the patient's metadata, which is valuable clinical information for dermatologists. Current methods only use the simple joint fusion structure (FS) and fusion modules (FMs) for the multi-modal classification methods, there still is room to increase the accuracy by exploring more advanced FS and FM. Therefore, in this paper, we design a new fusion method that combines dermatological images (dermoscopy images or clinical images) and patient metadata for skin cancer classification from the perspectives of FS and FM. First, we propose a joint-individual fusion (JIF) structure that learns the shared features of multi-modality data and preserves specific features simultaneously. Second, we introduce a fusion attention (FA) module that enhances the most relevant image and metadata features based on both the self and mutual attention mechanism to support the decision-making pipeline. We compare the proposed JIF-MMFA method with other state-of-the-art fusion methods on three different public datasets. The results show that our JIF-MMFA method improves the classification results for all tested CNN backbones and performs better than the other fusion methods on the three public datasets, demonstrating our method's effectiveness and robustness
摘要
大多数卷积神经网络(CNN)基于方法为皮肤癌类别使用仅图像。尽管有得到了好的分类结果,但可以通过考虑病人的metadata来提高准确性。现有方法只使用简单的联合结构(FS)和联合模块(FM)来实现多modal分类方法,仍然有很大的提高空间。因此,在这篇论文中,我们设计了一种新的融合方法,其将皮肤图像(dermoscopy图像或临床图像)和病人metadata融合在一起,从多模态数据的角度来进行皮肤癌类别。首先,我们提出了共同特征学习(JIF)结构,可以同时学习多模态数据中的共同特征和特定特征。其次,我们引入了融合注意力(FA)模块,通过自我注意力和相互注意力机制来增强最相关的图像和metadata特征,以支持决策管道。我们与其他当前领域的先进融合方法进行比较,在三个公共数据集上测试了我们的JIF-MMFA方法。结果显示,我们的JIF-MMFA方法可以在所有测试的CNN背景上提高分类结果,并且在三个公共数据集上表现出了更好的效果,证明了我们的方法的有效性和稳定性。
AI and Jobs: Has the Inflection Point Arrived? Evidence from an Online Labor Platform
For: The paper examines the performance of statistical AI in human tasks and proposes a three-phase visual framework to understand the evolving relation between AI and jobs.* Methods: The paper uses a simple economic model of competition to show the existence of an inflection point for each occupation, and studies the impact of AI performance on workers in two occupations (translation and web development) on a large online labor platform.* Results: The paper finds that the launch of ChatGPT, which led to significant improvement of AI performance on many tasks, has negatively affected translators in terms of the number of accepted jobs and earnings, while positively affecting web developers in terms of the number of accepted jobs, but not earnings.Here are the three key points in Simplified Chinese text:
results: 研究发现,ChatGPT的发布,导致了许多任务的人工智能表现得到了显著改进,但是对翻译工作者来说,这个改进导致了工作量和收入的减少,而对网开发工作者来说,这个改进没有影响工作量,但是有所提高了收入。Abstract
Artificial intelligence (AI) refers to the ability of machines or software to mimic or even surpass human intelligence in a given cognitive task. While humans learn by both induction and deduction, the success of current AI is rooted in induction, relying on its ability to detect statistical regularities in task input -- an ability learnt from a vast amount of training data using enormous computation resources. We examine the performance of such a statistical AI in a human task through the lens of four factors, including task learnability, statistical resource, computation resource, and learning techniques, and then propose a three-phase visual framework to understand the evolving relation between AI and jobs. Based on this conceptual framework, we develop a simple economic model of competition to show the existence of an inflection point for each occupation. Before AI performance crosses the inflection point, human workers always benefit from an improvement in AI performance, but after the inflection point, human workers become worse off whenever such an improvement occurs. To offer empirical evidence, we first argue that AI performance has passed the inflection point for the occupation of translation but not for the occupation of web development. We then study how the launch of ChatGPT, which led to significant improvement of AI performance on many tasks, has affected workers in these two occupations on a large online labor platform. Consistent with the inflection point conjecture, we find that translators are negatively affected by the shock both in terms of the number of accepted jobs and the earnings from those jobs, while web developers are positively affected by the very same shock. Given the potentially large disruption of AI on employment, more studies on more occupations using data from different platforms are urgently needed.
摘要
Here is the translation in Simplified Chinese:人工智能(AI)指的是机器或软件可以模仿或超越人类智能在某种认知任务中。人类通过直觉和推理来学习,而现代AI的成功则基于它可以探测任务输入中的统计规律,这些规律来自大量的训练数据和庞大的计算资源。我们通过四个因素来检查AI在人类任务中的表现:任务学习性、统计资源、计算资源和学习技术。然后,我们提出了一个三个阶段的视觉框架来理解AI和工作之间的关系。基于这个框架,我们开发了一个简单的经济模型来表明每个职业都有一个极限点,超过这个点后,人类工作者会因为AI性能的改进而变得更加不利。我们提供了实证证明,证明AI性能已经超过了翻译工作的极限点,但没有超过网开发工作的极限点。此外,我们研究了 chatGPT 的发布对两个职业(翻译和网开发)的工作者在一个大型在线劳动平台上的影响,发现翻译员因为这场冲击而受到了负面影响,而网开发员则因为这场冲击而受到了正面影响。鉴于 AI 对employmnet的潜在大规模变革,需要更多的研究使用不同的平台和数据来 investigate 更多的职业。
Augmentation-Free Dense Contrastive Knowledge Distillation for Efficient Semantic Segmentation
results: 在五个主流的benchmark上进行了广泛的实验,并证明了该方法的效iveness。例如,使用Af-DCD方法训练DeepLabV3-Res18|DeepLabV3-MBV2模型,在Cityscapes dataset上达到了77.03%|76.38%的mIOU水平,创造了新的性能记录。此外,Af-DCD方法在不同的teacher-student网络对比中均实现了绝对mIOU提升。Abstract
In recent years, knowledge distillation methods based on contrastive learning have achieved promising results on image classification and object detection tasks. However, in this line of research, we note that less attention is paid to semantic segmentation. Existing methods heavily rely on data augmentation and memory buffer, which entail high computational resource demands when applying them to handle semantic segmentation that requires to preserve high-resolution feature maps for making dense pixel-wise predictions. In order to address this problem, we present Augmentation-free Dense Contrastive Knowledge Distillation (Af-DCD), a new contrastive distillation learning paradigm to train compact and accurate deep neural networks for semantic segmentation applications. Af-DCD leverages a masked feature mimicking strategy, and formulates a novel contrastive learning loss via taking advantage of tactful feature partitions across both channel and spatial dimensions, allowing to effectively transfer dense and structured local knowledge learnt by the teacher model to a target student model while maintaining training efficiency. Extensive experiments on five mainstream benchmarks with various teacher-student network pairs demonstrate the effectiveness of our approach. For instance, the DeepLabV3-Res18|DeepLabV3-MBV2 model trained by Af-DCD reaches 77.03%|76.38% mIOU on Cityscapes dataset when choosing DeepLabV3-Res101 as the teacher, setting new performance records. Besides that, Af-DCD achieves an absolute mIOU improvement of 3.26%|3.04%|2.75%|2.30%|1.42% compared with individually trained counterpart on Cityscapes|Pascal VOC|Camvid|ADE20K|COCO-Stuff-164K. Code is available at https://github.com/OSVAI/Af-DCD
摘要
在最近几年,基于对比学习的知识塑化方法在图像分类和对象检测任务上取得了可观的成绩。然而,在这一线earch中,我们注意到 semantic segmentation 方面 receiving less attention。现有方法通常通过数据扩展和内存缓存来实现,这会带来高计算资源的需求,特别是在对 dense pixel-wise 预测进行高分辨率特征图保存时。为解决这个问题,我们提出了一种新的对比学习学习方法,即 Augmentation-free Dense Contrastive Knowledge Distillation (Af-DCD)。Af-DCD 利用了一种遮盲特征模仿策略,并通过在通道和空间维度进行精心的特征分 partitions 来形成一种新的对比学习损失,以有效地将师模型中 dense 和结构化的本地知识传递给目标学生模型,同时保持培训效率。我们在五个主流 benchmark 上进行了广泛的实验,并证明了我们的方法的效iveness。例如,使用 Af-DCD 培训 DeepLabV3-Res18|DeepLabV3-MBV2 模型时,在 Cityscapes 数据集上达到了 77.03%|76.38% mIOU,创造了新的性能纪录。此外,Af-DCD 在 Cityscapes|Pascal VOC|Camvid|ADE20K|COCO-Stuff-164K 数据集上实现了绝对 mIOU 改善值为 3.26%|3.04%|2.75%|2.30%|1.42%,相比 individually 培训的对应模型。代码可以在 上下载。
TimeDRL: Disentangled Representation Learning for Multivariate Time-Series
for: 本研究旨在 Addressing the challenges of multivariate time-series data in real-world applications, such as healthcare and industry, by learning rich representations without relying on labels.
results: 对 6 个时间序列预测 dataset 和 5 个时间序列分类 dataset 进行了广泛的实验,结果显示 TimeDRL consistently 超过了现有的表示学习方法,实现了平均预测 error 的下降57.98% 和分类 accuracy 的提高1.25%。此外,对 TimeDRL 的各部件的独立贡献进行了详细的剖析,并在 semi-supervised 学习 scenario 中证明了其效果。Abstract
Multivariate time-series data in numerous real-world applications (e.g., healthcare and industry) are informative but challenging due to the lack of labels and high dimensionality. Recent studies in self-supervised learning have shown their potential in learning rich representations without relying on labels, yet they fall short in learning disentangled embeddings and addressing issues of inductive bias (e.g., transformation-invariance). To tackle these challenges, we propose TimeDRL, a generic multivariate time-series representation learning framework with disentangled dual-level embeddings. TimeDRL is characterized by three novel features: (i) disentangled derivation of timestamp-level and instance-level embeddings from patched time-series data using a [CLS] token strategy; (ii) utilization of timestamp-predictive and instance-contrastive tasks for disentangled representation learning, with the former optimizing timestamp-level embeddings with predictive loss, and the latter optimizing instance-level embeddings with contrastive loss; and (iii) avoidance of augmentation methods to eliminate inductive biases, such as transformation-invariance from cropping and masking. Comprehensive experiments on 6 time-series forecasting datasets and 5 time-series classification datasets have shown that TimeDRL consistently surpasses existing representation learning approaches, achieving an average improvement of forecasting by 57.98% in MSE and classification by 1.25% in accuracy. Furthermore, extensive ablation studies confirmed the relative contribution of each component in TimeDRL's architecture, and semi-supervised learning evaluations demonstrated its effectiveness in real-world scenarios, even with limited labeled data.
摘要
多变量时间序列数据在实际应用中(如医疗和工业)具有信息丰富性,但同时又存在标签缺乏和维度高的挑战。现有的自动学习研究已经证明了它们在无标签情况下学习丰富表示的潜力,但它们在学习分离的表示和偏好问题上异常缺乏进攻性。为了解决这些挑战,我们提出了TimeDRL,一种通用多变量时间序列表示学习框架,具有分离的两级嵌入。TimeDRL的三个新特点是:1. 通过CLStoken策略从排序时间序列数据中分割出时间戳级别和实例级别的分离嵌入。2. 使用时间戳预测和实例对比任务来学习分离表示,其中时间戳预测任务使得时间戳级别嵌入具有预测损失,而实例对比任务使得实例级别嵌入具有对比损失。3. 不使用扩展方法,以避免偏好问题,如变换不变性。我们在6个时间序列预测 dataset和5个时间序列分类 dataset上进行了广泛的实验,并证明了TimeDRL在对比现有表示学习方法时,平均提高了预测值的MSE Error by 57.98%和分类精度的Accuracy by 1.25%。此外,我们还进行了广泛的减少研究,以证明TimeDRL的各个组件的相对贡献,以及在实际应用中,即使具有有限的标签数据,TimeDRL仍能达到显著的表示学习效果。
Using a Large Language Model to generate a Design Structure Matrix
For: The paper aims to improve the productivity of Design Structure Matrix (DSM) generation for complex engineering systems by using a Large Language Model (LLM).* Methods: The paper proposes a workflow that leverages an LLM to support the generation of DSM, and a prototype of the workflow was developed and applied to a diesel engine DSM.* Results: The prototype was found to reproduce 77.3% of the DSM entries published previously, suggesting the potential of the proposed method to aid DSM generation.Here’s the same information in Simplified Chinese text:
results: 原型能够重现之前发布的 DSM 项目462项中的357项(即77.3%),表明该方法可以帮助 DSM 生成。Abstract
The Design Structure Matrix (DSM) is an established method used in dependency modelling, especially in the design of complex engineering systems. The generation of DSM is traditionally carried out through manual means and can involve interviewing experts to elicit critical system elements and the relationships between them. Such manual approaches can be time-consuming and costly. This paper presents a workflow that uses a Large Language Model (LLM) to support the generation of DSM and improve productivity. A prototype of the workflow was developed in this work and applied on a diesel engine DSM published previously. It was found that the prototype could reproduce 357 out of 462 DSM entries published (i.e. 77.3%), suggesting that the work can aid DSM generation. A no-code version of the prototype is made available online to support future research.
摘要
designer 结构矩阵 (DSM) 是一种已经确立的方法,用于复杂工程系统的依赖关系模型化。传统上,生成 DSM 需要手动实现,可能需要专家采访以提取关键系统元素和它们之间的关系。这些手动方法可能会耗时和成本高。这篇文章介绍了一个工作流程,使用大型自然语言模型 (LLM) 支持 DSM 生成并提高生产力。该工作流程的原型在这篇文章中实现,并在之前已发布的柴油机 DSM 上应用。发现该原型可以重produce 462个 DSM 项中的 357个 (即77.3%),这表明该工作可以帮助 DSM 生成。在线上提供了一个无代码版本的原型,以支持未来的研究。
Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic Play
paper_authors: Timothy Schaumlöffel, Arthur Aubret, Gemma Roig, Jochen Triesch
for: 研究婴儿在玩耍时语言输入对视觉表示的影响
methods: 使用计算机模型研究婴儿在玩耍时语言输入对视觉表示的学习
results: 研究发现,父母的命名语音可以提高婴儿对物品的识别能力Abstract
Infants' ability to recognize and categorize objects develops gradually. The second year of life is marked by both the emergence of more semantic visual representations and a better understanding of word meaning. This suggests that language input may play an important role in shaping visual representations. However, even in suitable contexts for word learning like dyadic play sessions, caregivers utterances are sparse and ambiguous, often referring to objects that are different from the one to which the child attends. Here, we systematically investigate to what extent caregivers' utterances can nevertheless enhance visual representations. For this we propose a computational model of visual representation learning during dyadic play. We introduce a synthetic dataset of ego-centric images perceived by a toddler-agent that moves and rotates toy objects in different parts of its home environment while hearing caregivers' utterances, modeled as captions. We propose to model toddlers' learning as simultaneously aligning representations for 1) close-in-time images and 2) co-occurring images and utterances. We show that utterances with statistics matching those of real caregivers give rise to representations supporting improved category recognition. Our analysis reveals that a small decrease/increase in object-relevant naming frequencies can drastically impact the learned representations. This affects the attention on object names within an utterance, which is required for efficient visuo-linguistic alignment. Overall, our results support the hypothesis that caregivers' naming utterances can improve toddlers' visual representations.
摘要
infant 的物体识别和分类能力逐渐发展。第二年的生活中, semantic visual representations emerge 并且对话语言理解得到进一步提高。这表明语言输入可能对视觉表示产生重要影响。然而,在适合语言学习的情况下,照顾者的词汇很少,杂乱无章,经常指向不同于儿童注意的物体。在这种情况下,我们系统地 investigate 照顾者的词汇是否可以增强视觉表示。为此,我们提出了一个计算机模型,用于在互动游戏中学习视觉表示。我们创建了一个合成的 egocentric 图像数据集,表现出婴儿在家庭环境中不同部分移动和旋转玩具对象时所感受到的视角。我们提议将婴儿学习视觉表示分为两个方面:1) close-in-time 图像和2)同时出现的图像和词汇。我们显示,与实际照顾者的词汇统计相符的词汇会导致支持更好的分类认知。我们的分析表明,对象名称的频率小幅变化可以决定学习的表示。这种变化会影响对话中对象名称的注意力,这是必要的 для高效的视觉语言同步。总之,我们的结果支持照顾者的词汇可以提高婴儿的视觉表示。
Breaking the Entanglement of Homophily and Heterophily in Semi-supervised Node Classification
paper_authors: Henan Sun, Xunkai Li, Zhengyu Wu, Daohan Su, Rong-Hua Li, Guoren Wang
For: The paper aims to develop a powerful graph neural network (GNN) model that can ensure performance under both homophily and heterophily, and address the issue of sub-optimal graph representations in existing GNNs.* Methods: The proposed method, called Adaptive Modular Undirected/Directed Graph (AMUD), quantifies the relationship between node profiles and topology from a statistical perspective, and offers valuable insights for adaptively modeling natural directed graphs as the undirected or directed graph to maximize the benefits from subsequent graph learning. The proposed method also introduces Adaptive Directed Pattern Aggregation (ADPA) as a new directed graph learning paradigm for AMUD.* Results: Empirical studies have demonstrated that AMUD guides efficient graph learning, and extensive experiments on 14 benchmark datasets have substantiated the impressive performance of ADPA, outperforming baselines by significant margins of 3.96%.Abstract
Recently, graph neural networks (GNNs) have shown prominent performance in semi-supervised node classification by leveraging knowledge from the graph database. However, most existing GNNs follow the homophily assumption, where connected nodes are more likely to exhibit similar feature distributions and the same labels, and such an assumption has proven to be vulnerable in a growing number of practical applications. As a supplement, heterophily reflects dissimilarity in connected nodes, which has gained significant attention in graph learning. To this end, data engineers aim to develop a powerful GNN model that can ensure performance under both homophily and heterophily. Despite numerous attempts, most existing GNNs struggle to achieve optimal node representations due to the constraints of undirected graphs. The neglect of directed edges results in sub-optimal graph representations, thereby hindering the capacity of GNNs. To address this issue, we introduce AMUD, which quantifies the relationship between node profiles and topology from a statistical perspective, offering valuable insights for \underline{A}daptively \underline{M}odeling the natural directed graphs as the \underline{U}ndirected or \underline{D}irected graph to maximize the benefits from subsequent graph learning. Furthermore, we propose \underline{A}daptive \underline{D}irected \underline{P}attern \underline{A}ggregation (ADPA) as a new directed graph learning paradigm for AMUD. Empirical studies have demonstrated that AMUD guides efficient graph learning. Meanwhile, extensive experiments on 14 benchmark datasets substantiate the impressive performance of ADPA, outperforming baselines by significant margins of 3.96\%.
摘要
近些年来,图 neck 网络 (GNNs) 在半监督节点分类中表现出色,通过利用图数据库中的知识来提高性能。然而,大多数现有的 GNNs 遵循同类性假设,即连接的节点更有可能具有相似的特征分布和标签,并且这种假设在越来越多的实际应用中证明不坚定。为了补做,环境工程师寻求开发一种功能强大的 GNN 模型,能够在同类性和不同性两种情况下保证节点表示的性能。尽管有很多尝试,但大多数现有的 GNNs 因为不包括导向边的约束而难以获得优化的节点表示。为解决这个问题,我们提出了 AMUD,它从统计角度量化了节点profile和图STRUCTURE的关系,为之后的图学习提供了有价值的发现。此外,我们还提出了 Adaptive Directed Pattern Aggregation (ADPA) 作为 AMUD 的新型导向图学习方法。实验表明,AMUD 可以导引有效的图学习。同时,对 14 个标准数据集的广泛实验表明,ADPA 可以在同类性和不同性两种情况下明显超越基eline,提高性能的 margins 约 3.96%。
Enhancing the Rationale-Input Alignment for Self-explaining Rationalization
results: 对两个实际应用场景的实验表明,提出的方法可以显著提高 explanation 质量(根据模型选择的解释和人工标注的 rationalization 的重叠率),并在两个Synthetic设定中进行了验证。Abstract
Rationalization empowers deep learning models with self-explaining capabilities through a cooperative game, where a generator selects a semantically consistent subset of the input as a rationale, and a subsequent predictor makes predictions based on the selected rationale. In this paper, we discover that rationalization is prone to a problem named \emph{rationale shift}, which arises from the algorithmic bias of the cooperative game. Rationale shift refers to a situation where the semantics of the selected rationale may deviate from the original input, but the predictor still produces accurate predictions based on the deviation, resulting in a compromised generator with misleading feedback. To address this issue, we first demonstrate the importance of the alignment between the rationale and the full input through both empirical observations and theoretical analysis. Subsequently, we introduce a novel approach called DAR (\textbf{D}iscriminatively \textbf{A}ligned \textbf{R}ationalization), which utilizes an auxiliary module pretrained on the full input to discriminatively align the selected rationale and the original input. We theoretically illustrate how DAR accomplishes the desired alignment, thereby overcoming the rationale shift problem. The experiments on two widely used real-world benchmarks show that the proposed method significantly improves the explanation quality (measured by the overlap between the model-selected explanation and the human-annotated rationale) as compared to state-of-the-art techniques. Additionally, results on two synthetic settings further validate the effectiveness of DAR in addressing the rationale shift problem.
摘要
理解能使深度学习模型具备自我解释能力通过合作游戏,其中一个生成器选择一个Semantically consistent的输入subset作为理由,然后一个后续预测器根据选择的理由进行预测。在这篇论文中,我们发现了一个名为“理由Shift”的问题,它由合作游戏的算法偏见引起。理由Shift指的是一种情况,在选择理由时,输入的 semantics可能与原始输入不同,但预测器仍然可以基于这种偏差 Produce accurate predictions,导致生成器受到误导。为解决这个问题,我们首先通过实际观察和理论分析证明了理由和全输入之间的对齐的重要性。然后,我们提出了一种名为DAR(Discriminative Aligned Rationalization)的新方法,它利用一个额外模块,该模块在全输入上预训练,以对选择的理由和原始输入进行对齐。我们理论上说明了DAR如何实现所需的对齐,从而解决理由Shift问题。实验结果表明,提出的方法可以在两个广泛使用的实际 benchmark上显著提高解释质量(通过模型选择的解释和人类标注的理由之间的 overlap 来衡量),相比之下state-of-the-art技术。此外,对两个synthetic设定进行了进一步验证,证明DAR能够有效地解决理由Shift问题。
VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal Models
results: 研究发现当前的专有模型通常比开源模型表现更好,增加了22.70%的准确性;但是还有提高的空间。视觉引用提示策略对LMMs的准确性有显著影响,其变化范围为-17.5%到+7.3%。Abstract
With recent advancements in Large Multimodal Models (LMMs) across various domains, a novel prompting method called visual referring prompting has emerged, showing significant potential in enhancing human-computer interaction within multimodal systems. This method offers a more natural and flexible approach to human interaction with these systems compared to traditional text descriptions or coordinates. However, the categorization of visual referring prompting remains undefined, and its impact on the performance of LMMs has yet to be formally examined. In this study, we conduct the first comprehensive analysis of LMMs using a variety of visual referring prompting strategies. We introduce a benchmark dataset called VRPTEST, comprising 3 different visual tasks and 2,275 images, spanning diverse combinations of prompt strategies. Using VRPTEST, we conduct a comprehensive evaluation of eight versions of prominent open-source and proprietary foundation models, including two early versions of GPT-4V. We develop an automated assessment framework based on software metamorphic testing techniques to evaluate the accuracy of LMMs without the need for human intervention or manual labeling. We find that the current proprietary models generally outperform the open-source ones, showing an average accuracy improvement of 22.70%; however, there is still potential for improvement. Moreover, our quantitative analysis shows that the choice of prompt strategy significantly affects the accuracy of LMMs, with variations ranging from -17.5% to +7.3%. Further case studies indicate that an appropriate visual referring prompting strategy can improve LMMs' understanding of context and location information, while an unsuitable one might lead to answer rejection. We also provide insights on minimizing the negative impact of visual referring prompting on LMMs.
摘要
With the recent advancements in Large Multimodal Models (LMMs) across various domains, a novel prompting method called visual referring prompting has emerged, showing significant potential in enhancing human-computer interaction within multimodal systems. This method offers a more natural and flexible approach to human interaction with these systems compared to traditional text descriptions or coordinates. However, the categorization of visual referring prompting remains undefined, and its impact on the performance of LMMs has yet to be formally examined. In this study, we conduct the first comprehensive analysis of LMMs using a variety of visual referring prompting strategies. We introduce a benchmark dataset called VRPTEST, comprising 3 different visual tasks and 2,275 images, spanning diverse combinations of prompt strategies. Using VRPTEST, we conduct a comprehensive evaluation of eight versions of prominent open-source and proprietary foundation models, including two early versions of GPT-4V. We develop an automated assessment framework based on software metamorphic testing techniques to evaluate the accuracy of LMMs without the need for human intervention or manual labeling. We find that the current proprietary models generally outperform the open-source ones, showing an average accuracy improvement of 22.70%; however, there is still potential for improvement. Moreover, our quantitative analysis shows that the choice of prompt strategy significantly affects the accuracy of LMMs, with variations ranging from -17.5% to +7.3%. Further case studies indicate that an appropriate visual referring prompting strategy can improve LMMs' understanding of context and location information, while an unsuitable one might lead to answer rejection. We also provide insights on minimizing the negative impact of visual referring prompting on LMMs.
Voice Recognition Robot with Real-Time Surveillance and Automation
results: 该技术不仅可以用于助助人们 WITH disabilities,还可以应用于工业自动化领域,让机器人执行特定任务 WITH 精度。Abstract
Voice recognition technology enables the execution of real-world operations through a single voice command. This paper introduces a voice recognition system that involves converting input voice signals into corresponding text using an Android application. The text messages are then transmitted through Bluetooth connectivity, serving as a communication platform. Simultaneously, a controller circuit, equipped with a Bluetooth module, receives the text signal and, following a coding mechanism, executes real-world operations. The paper extends the application of voice recognition to real-time surveillance and automation, incorporating obstacle detection and avoidance mechanisms, as well as control over lighting and horn functions through predefined voice commands. The proposed technique not only serves as an assistive tool for individuals with disabilities but also finds utility in industrial automation, enabling robots to perform specific tasks with precision.
摘要
“声Recognition技术可以通过单个声音命令执行真实世界操作。这篇论文介绍一种声Recognition系统,该系统通过Android应用程序将输入声音信号转换成对应的文本。这些文本消息然后通过蓝牙连接传输,作为通信平台。同时,一个controller圈,配备了蓝牙模块,接收到文本信号,并根据编码机制执行真实世界操作。该技术不仅为残疾人提供助手,还在工业自动化中找到了应用,使得机器人可以通过先defined的声音命令执行特定任务。”
Synergistic Signals: Exploiting Co-Engagement and Semantic Links via Graph Neural Networks
methods: 本研究使用了 SemanticGNN模型,该模型可以借鉴 semantic information 和 co-engagement signals 来学习实体之间的相似性。
results: 实验表明,SemanticGNN 模型可以提高推荐系统的性能,并且可以提供更好的解释性。在 Netflix 中部署了该模型,并 obtianed 35% 的提高。Abstract
Given a set of candidate entities (e.g. movie titles), the ability to identify similar entities is a core capability of many recommender systems. Most often this is achieved by collaborative filtering approaches, i.e. if users co-engage with a pair of entities frequently enough, the embeddings should be similar. However, relying on co-engagement data alone can result in lower-quality embeddings for new and unpopular entities. We study this problem in the context recommender systems at Netflix. We observe that there is abundant semantic information such as genre, content maturity level, themes, etc. that complements co-engagement signals and provides interpretability in similarity models. To learn entity similarities from both data sources holistically, we propose a novel graph-based approach called SemanticGNN. SemanticGNN models entities, semantic concepts, collaborative edges, and semantic edges within a large-scale knowledge graph and conducts representation learning over it. Our key technical contributions are twofold: (1) we develop a novel relation-aware attention graph neural network (GNN) to handle the imbalanced distribution of relation types in our graph; (2) to handle web-scale graph data that has millions of nodes and billions of edges, we develop a novel distributed graph training paradigm. The proposed model is successfully deployed within Netflix and empirical experiments indicate it yields up to 35% improvement in performance on similarity judgment tasks.
摘要
SemanticGNN models entities, semantic concepts, collaborative edges, and semantic edges within a large-scale knowledge graph and conducts representation learning over it. Our key technical contributions are twofold:1. We develop a novel relation-aware attention graph neural network (GNN) to handle the imbalanced distribution of relation types in our graph.2. To handle web-scale graph data that has millions of nodes and billions of edges, we develop a novel distributed graph training paradigm.The proposed model is successfully deployed within Netflix and empirical experiments indicate it yields up to 35% improvement in performance on similarity judgment tasks.
Making Translators Privacy-aware on the User’s Side
results: 实验表明,PRISM可以保持翻译准确性,同时具有良好的隐私保护功能。Abstract
We propose PRISM to enable users of machine translation systems to preserve the privacy of data on their own initiative. There is a growing demand to apply machine translation systems to data that require privacy protection. While several machine translation engines claim to prioritize privacy, the extent and specifics of such protection are largely ambiguous. First, there is often a lack of clarity on how and to what degree the data is protected. Even if service providers believe they have sufficient safeguards in place, sophisticated adversaries might still extract sensitive information. Second, vulnerabilities may exist outside of these protective measures, such as within communication channels, potentially leading to data leakage. As a result, users are hesitant to utilize machine translation engines for data demanding high levels of privacy protection, thereby missing out on their benefits. PRISM resolves this problem. Instead of relying on the translation service to keep data safe, PRISM provides the means to protect data on the user's side. This approach ensures that even machine translation engines with inadequate privacy measures can be used securely. For platforms already equipped with privacy safeguards, PRISM acts as an additional protection layer, reinforcing their security furthermore. PRISM adds these privacy features without significantly compromising translation accuracy. Our experiments demonstrate the effectiveness of PRISM using real-world translators, T5 and ChatGPT (GPT-3.5-turbo), and the datasets with two languages. PRISM effectively balances privacy protection with translation accuracy.
摘要
我们提出PRISM,以帮助机器翻译系统用户保护数据的隐私。隐私保护的需求在应用机器翻译系统中增长,但是许多机器翻译引擎声称优先级化隐私保护,但是其保护范围和具体措施尚未得到充分的解释。首先,数据保护的方式和程度通常不够明确,即使服务提供商认为自己有足够的安全措施,但是高水平的攻击者仍然可能从数据中提取敏感信息。其次,攻击者可能通过通信频道泄露数据,从而导致数据泄露。因此,用户不愿意使用需要高度隐私保护的数据进行机器翻译,导致其潜在的好处被排除在外。PRISM解决这个问题。而不是依赖翻译服务保护数据,PRISM为用户提供了保护数据的方式。这种方法使得即使机器翻译引擎的隐私保护措施不够,也可以安全地使用机器翻译系统。对于已经搭载了隐私保护措施的平台,PRISM可以作为额外的安全层,进一步强化其安全性。PRISM在实际翻译器T5和ChatGPT(GPT-3.5-turbo)上进行了实验,并使用了两种语言的dataset。我们的实验结果表明,PRISM能够均衡隐私保护和翻译准确率。
A Low-Overhead Incorporation-Extrapolation based Few-Shot CSI Feedback Framework for Massive MIMO Systems
results: 本研究的结果显示,提案的IEFSF方法可以将CSI追踪预算降低至16倍以下,使用只需几百次收集的数据来维持更高的缩寸精度。Abstract
Accurate channel state information (CSI) is essential for downlink precoding at the base station (BS), especially for frequency FDD wideband massive MIMO systems with OFDM. In FDD systems, CSI is attained through CSI feedback from the user equipment (UE). However, large-scale antennas and large number of subcarriers significantly increase CSI feedback overhead. Deep learning-based CSI feedback methods have received tremendous attention in recent years due to their great capability of compressing CSI. Nonetheless, large amounts of collected samples are required to train deep learning models, which is severely challenging in practice. Besides, with the rapidly increasing number of antennas and subcarriers, most of these deep learning methods' CSI feedback overhead also grow dramatically, owing to their focus on full-dimensional CSI feedback. To address this issue, in this paper, we propose a low-overhead Incorporation-Extrapolation based Few-Shot CSI feedback Framework (IEFSF) for massive MIMO systems. To further reduce the feedback overhead, a low-dimensional eigenvector-based CSI matrix is first formed with the incorporation process at the UE, and then recovered to the full-dimensional eigenvector-based CSI matrix at the BS via the extrapolation process. After that, to alleviate the necessity of the extensive collected samples and enable few-shot CSI feedback, we further propose a knowledge-driven data augmentation method and an artificial intelligence-generated content (AIGC) -based data augmentation method by exploiting the domain knowledge of wireless channels and by exploiting a novel generative model, respectively. Numerical results demonstrate that the proposed IEFSF can significantly reduce CSI feedback overhead by 16 times compared with existing CSI feedback methods while maintaining higher feedback accuracy using only several hundreds of collected samples.
摘要
准确的通道状态信息(CSI)是基站(BS)下链前测试中的关键参数,特别是在宽带FDDF大规模MIMO系统中使用OFDM时。在FDDF系统中,CSI通过用户设备(UE)返回CSI反馈来获取。然而,大量天线和Subcarrier的存在会增加CSI反馈 overhead。在过去几年中,基于深度学习的CSI反馈方法得到了很大的关注,因为它们可以高效地压缩CSI。然而,需要大量的数据集来训练深度学习模型,这是在实践中非常困难的。此外,随着天线和Subcarrier的数量不断增加,大多数深度学习方法的CSI反馈 overhead也在不断增长,这是因为它们主要关注全维度CSI反馈。为解决这个问题,本文提出了一种低过头Incorporation-Extrapolation基于少量样本的CSI反馈框架(IEFSF),用于大规模MIMO系统。然后,我们进一步提出了一种使用低维度特征向量构建CSI矩阵的整合过程,并使用拟合过程将其恢复到全维度特征向量CSI矩阵。最后,我们进一步提出了一种知识驱动的数据增强方法和一种基于人工智能生成内容(AIGC)的数据增强方法,通过利用无线通信频道的域知识和利用一种新的生成模型,分别减少了数据增强的必要性和使得只需几百个样本就能实现少量CSI反馈。数值结果表明,提出的IEFSF可以在与现有CSI反馈方法相比下减少CSI反馈过头by 16倍,同时保持高精度CSI反馈,只需使用几百个样本。
Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes
paper_authors: Hmrishav Bandyopadhyay, Subhadeep Koley, Ayan Das, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Ayan Kumar Bhunia, Yi-Zhe Song
for: 这 paper 的目的是推动 3D 内容创建的民主化,允许精准地从抽象图纸中生成 3D 形状,超越绘画技巧的限制。
methods: 该 paper 提出了一种新的部件级模型和对接框架,使得抽象模型和cross-modal对应成为可能。这种方法还可以轻松地扩展到绘制模型,通过建立 CLIPasso 边框图和 projeted 3D 部件区域之间的对应关系,从而消除了人工绘制和 3D 形状之间的数据集对应的需求。
results: 该 paper 的方法可以快速和高效地生成精准的 3D 形状,并且提供了一种简单的编辑过程,这种编辑过程是cross-modal部件对接的直接产物。在低维度的隐藏空间中运行,该方法可以减少计算负担和处理时间。Abstract
In this paper, we democratise 3D content creation, enabling precise generation of 3D shapes from abstract sketches while overcoming limitations tied to drawing skills. We introduce a novel part-level modelling and alignment framework that facilitates abstraction modelling and cross-modal correspondence. Leveraging the same part-level decoder, our approach seamlessly extends to sketch modelling by establishing correspondence between CLIPasso edgemaps and projected 3D part regions, eliminating the need for a dataset pairing human sketches and 3D shapes. Additionally, our method introduces a seamless in-position editing process as a byproduct of cross-modal part-aligned modelling. Operating in a low-dimensional implicit space, our approach significantly reduces computational demands and processing time.
摘要
在这篇论文中,我们民主化三维内容创作,使得精确地生成三维形状从抽象草图中,而超越绘制技能的限制。我们引入了一种新的部件水平模型和对齐框架,使得抽象模型和跨模态对应更加容易。利用同一部件水平解码器,我们的方法可顺利扩展到草图模型,通过将CLIPasso边框图和 проекed 3D部件区域相匹配,从而消除需要人工草图和三维形状的对应集。此外,我们的方法还提供了一种无缝在位编辑过程,这是跨模态部件对齐模型的一个直接产物。在低维度的隐式空间中运行,我们的方法显著减少了计算需求和处理时间。
Modeling Boundedly Rational Agents with Latent Inference Budgets
results: 在三个模型任务中(推理航行目标从路径、推理人类交流目标从人类词汇、预测人类象棋下一步),L-IBM匹配或超越博尔茨曼决策下的随机性模型。计算出来的推理预算是有意义的、效率高的和与Player技巧、伙伴技巧和任务Difficulty相关的。Abstract
We study the problem of modeling a population of agents pursuing unknown goals subject to unknown computational constraints. In standard models of bounded rationality, sub-optimal decision-making is simulated by adding homoscedastic noise to optimal decisions rather than explicitly simulating constrained inference. In this work, we introduce a latent inference budget model (L-IBM) that models agents' computational constraints explicitly, via a latent variable (inferred jointly with a model of agents' goals) that controls the runtime of an iterative inference algorithm. L-IBMs make it possible to learn agent models using data from diverse populations of suboptimal actors. In three modeling tasks -- inferring navigation goals from routes, inferring communicative intents from human utterances, and predicting next moves in human chess games -- we show that L-IBMs match or outperform Boltzmann models of decision-making under uncertainty. Inferred inference budgets are themselves meaningful, efficient to compute, and correlated with measures of player skill, partner skill and task difficulty.
摘要
我们研究一个人群代理人追求未知目标的问题,受到未知计算限制的模型。标准的受限合理性模型中,决策过程中的不合理性被模拟为加速优化决策的随机噪声,而不是直接模拟决策过程中的计算限制。在这个工作中,我们引入了一种隐藏推理预算模型(L-IBM),该模型直接模拟代理人的计算限制,通过隐藏变量(与代理人的目标一起被 JOINTLY 推理)控制步骤推理算法的运行时间。L-IBMs 使得可以使用不同人群的不优化行为数据来学习代理人模型。在 Route navigation 目标推理、人类语言表达意图推理和人类棋盘游戏下一步预测三个任务中,我们发现 L-IBMs 与 Boltzmann 决策模型相当或超越。推理推算预算是自己意义的、Compute 效率的和与玩家技巧、对手技巧和任务难度相关的。
Improved Face Representation via Joint Label Classification and Supervised Contrastive Clustering
results: 经过EXTensive的质量和量化实验,提出的方法在Popular facial benchmarks 上显示出效果和超越现有的方法。Abstract
Face clustering tasks can learn hierarchical semantic information from large-scale data, which has the potential to help facilitate face recognition. However, there are few works on this problem. This paper explores it by proposing a joint optimization task of label classification and supervised contrastive clustering to introduce the cluster knowledge to the traditional face recognition task in two ways. We first extend ArcFace with a cluster-guided angular margin to adjust the within-class feature distribution according to the hard level of face clustering. Secondly, we propose a supervised contrastive clustering approach to pull the features to the cluster center and propose the cluster-aligning procedure to align the cluster center and the learnable class center in the classifier for joint training. Finally, extensive qualitative and quantitative experiments on popular facial benchmarks demonstrate the effectiveness of our paradigm and its superiority over the existing approaches to face recognition.
摘要
面部聚类任务可以从大规模数据中学习层次Semantic信息,这有助于促进面部识别。然而,有少量相关研究。这篇论文探讨这个问题,提出了一个结合标签分类和监督相关聚类的联合优化任务,以引入聚类知识到传统的面部识别任务中。我们首先将ArcFace扩展为具有困难程度的聚类指导角度的angular margin,以调整内部特征分布。其次,我们提出了一种监督相关聚类的方法,将特征拖动到聚类中心,并提出了聚类中心和学习可变类中心的对齐过程,用于联合训练。最后,我们在流行的面部识别benchmark上进行了广泛的质量和kvantalitative实验,证明了我们的思想的有效性和传统方法的超越。
The sample complexity of multi-distribution learning
results: 这篇论文提供了一种样本复杂度为 $\widetilde{O}((d+k)\epsilon^{-2}) \cdot (k/\epsilon)^{o(1)}$ 的算法,用于最小化多个分布中的最大人口损失。这个结果与下界准确相符,并解决了 COLT 2023 年开放问题。Abstract
Multi-distribution learning generalizes the classic PAC learning to handle data coming from multiple distributions. Given a set of $k$ data distributions and a hypothesis class of VC dimension $d$, the goal is to learn a hypothesis that minimizes the maximum population loss over $k$ distributions, up to $\epsilon$ additive error. In this paper, we settle the sample complexity of multi-distribution learning by giving an algorithm of sample complexity $\widetilde{O}((d+k)\epsilon^{-2}) \cdot (k/\epsilon)^{o(1)}$. This matches the lower bound up to sub-polynomial factor and resolves the COLT 2023 open problem of Awasthi, Haghtalab and Zhao [AHZ23].
摘要
多分布学习总结了经典的PAC学习,以处理来自多个分布的数据。给定一个 $k$ 个数据分布和一个假设集合的 VC 维度 $d$,目标是学习一个假设,以最小化来自 $k$ 个分布的最大人口损失,在 $\epsilon$ 加法误差内。在这篇论文中,我们解决了多分布学习的样本复杂度问题,并提供了样本复杂度为 $\widetilde{O}((d+k)\epsilon^{-2}) \cdot (k/\epsilon)^{o(1)}$ 的算法。这与下界几乎吻合,解决了 COLT 2023 开放问题,即阿华史提, 哈格塔兰和赵娟的问题 [AHZ23].
Moirai: Towards Optimal Placement for Distributed Inference on Heterogeneous Devices
results: 实验表明,与现状态艺术比较,Moirai可以减少综合推理延迟时间,最高减少4.28倍。Abstract
The escalating size of Deep Neural Networks (DNNs) has spurred a growing research interest in hosting and serving DNN models across multiple devices. A number of studies have been reported to partition a DNN model across devices, providing device placement solutions. The methods appeared in the literature, however, either suffer from poor placement performance due to the exponential search space or miss an optimal placement as a consequence of the reduced search space with limited heuristics. Moreover, these methods have ignored the runtime inter-operator optimization of a computation graph when coarsening the graph, which degrades the end-to-end inference performance. This paper presents Moirai that better exploits runtime inter-operator fusion in a model to render a coarsened computation graph, reducing the search space while maintaining the inter-operator optimization provided by inference backends. Moirai also generalizes the device placement algorithm from multiple perspectives by considering inference constraints and device heterogeneity.Extensive experimental evaluation with 11 large DNNs demonstrates that Moirai outperforms the state-of-the-art counterparts, i.e., Placeto, m-SCT, and GETF, up to 4.28$\times$ in reduction of the end-to-end inference latency. Moirai code is anonymously released at \url{https://github.com/moirai-placement/moirai}.
摘要
“深度神经网络(DNN)的规模不断增大,导致了在多个设备上部署和执行 DNN 模型的研究兴趣的增加。一些研究已经提出了将 DNN 模型分割到不同设备上,提供设备分配解决方案。然而,现有的方法 Either suffer from poor placement performance due to the exponential search space or miss an optimal placement as a consequence of the reduced search space with limited heuristics. In addition, these methods have ignored the runtime inter-operator optimization of a computation graph when coarsening the graph, which degrades the end-to-end inference performance.本文提出了 Moirai,它可以更好地利用运行时间间操作符融合在模型中,将 computation graph 缩小,同时维护运行时间间操作符的优化。Moirai 还通过考虑推理约束和设备多样性,对设备分配算法进行了多个视角的扩展。对于 11 个大型 DNN,我们进行了广泛的实验评估,结果显示,Moirai 可以与现有的状态计算机代码相比,提高总结束到结果的执行时间减少至多 4.28 倍。Moirai 代码已经匿名公开发布在 GitHub 上,请参考 \url{https://github.com/moirai-placement/moirai}。”
k* Distribution: Evaluating the Latent Space of Deep Neural Networks using Local Neighborhood Analysis
for: 这 paper 的目的是提出一种方法来捕捉 neural network 学习的latent space中每个类别的样本分布结构,以便更好地理解这些样本的分布。
methods: 这 paper 使用了 local neighborhood analysis 方法来捕捉每个类别的样本分布结构,并通过对不同类别的样本进行比较来描述这些分布的特点。
results: 研究发现,在 neural network 学习的latent space中,每个类别的样本分布结构都是不同的,有些类别的样本分布破碎,有些类别的样本分布重叠,有些类别的样本分布呈集中分布。这些结果表明,使用 traditional 的dimensionality reduction techniques 可能会歪斜这些分布结构,从而使得分类更加困难。Abstract
Most examinations of neural networks' learned latent spaces typically employ dimensionality reduction techniques such as t-SNE or UMAP. While these methods effectively capture the overall sample distribution in the entire learned latent space, they tend to distort the structure of sample distributions within specific classes in the subset of the latent space. This distortion complicates the task of easily distinguishing classes identifiable by neural networks. In response to this challenge, we introduce the k* Distribution methodology. This approach focuses on capturing the characteristics and structure of sample distributions for individual classes within the subset of the learned latent space using local neighborhood analysis. The key concept is to facilitate easy comparison of different k* distributions, enabling analysis of how various classes are processed by the same neural network. This provides a more profound understanding of existing contemporary visualizations. Our study reveals three distinct distributions of samples within the learned latent space subset: a) Fractured, b) Overlapped, and c) Clustered. We note and demonstrate that the distribution of samples within the network's learned latent space significantly varies depending on the class. Furthermore, we illustrate that our analysis can be applied to explore the latent space of diverse neural network architectures, various layers within neural networks, transformations applied to input samples, and the distribution of training and testing data for neural networks. We anticipate that our approach will facilitate more targeted investigations into neural networks by collectively examining the distribution of different samples within the learned latent space.
摘要
大多数神经网络学习的秘密空间研究通常使用维度减少技术如t-SNE或UMAP。而这些方法可以有效捕捉整个学习的秘密空间中所有样本的总分布,但它们可能会扭曲特定类别在学习的秘密空间中的样本分布结构。这种扭曲使得识别神经网络中的类别变得更加困难。为了解决这个挑战,我们介绍了k*分布方法。这种方法关注于捕捉具体类别在学习的秘密空间中的样本分布特征和结构,使用地方室分析。关键思想是使得不同k*分布的比较更加容易,以便分析神经网络中不同类别如何处理同一个学习的秘密空间。这提供了更深刻的理解当代视觉化方法。我们的研究发现学习的秘密空间中分配的样本分布有三种类型:分割、重叠和块分布。我们注意到并证明,不同类别在神经网络学习的秘密空间中样本分布显著不同。此外,我们还示例了我们的分析可以应用于各种神经网络架构、神经网络层、输入样本变换和神经网络训练和测试数据分布。我们预期,我们的方法将能够促进神经网络的更加准确和有向的研究,通过同时检查不同类别在学习的秘密空间中的样本分布。
results: 研究发现,随着模型size的增加,ICL示例的 incorporation,以及RLHF的使用,模型的性能和准确性之间存在负相关性。此外,研究还发现,通常使用的温度扩大技术可以提供有限的准确性改进, suggesting that new methods may be required for settings where models are expected to be reliable.Abstract
Modern auto-regressive language models are trained to minimize log loss on broad data by predicting the next token so they are expected to get calibrated answers when framing a problem as a next-token prediction task. We study this for in-context learning (ICL), a widely used way to adapt frozen large language models (LLMs) via crafting prompts, and investigate the trade-offs between performance and calibration on a wide range of natural language understanding and reasoning tasks. We conduct extensive experiments to show that such trade-offs may get worse as we increase model size, incorporate more ICL examples, and fine-tune models using instruction, dialog, or reinforcement learning from human feedback (RLHF) on carefully curated datasets. Furthermore, we find that common recalibration techniques that are widely effective such as temperature scaling provide limited gains in calibration errors, suggesting that new methods may be required for settings where models are expected to be reliable.
摘要
现代自动逐字语言模型通常在宽泛的数据集上训练,以预测下一个字符以达到最小化对数损失,因此它们在表述问题为下一个字符预测任务时应该得到准确的答案。我们研究这种在上下文学习(ICL)中进行尝试,并评估在各种自然语言理解和思维任务上性能和准确性之间的交易。我们进行了广泛的实验,发现随着模型大小增加、 incorporate更多 ICL 示例和使用人工反馈学习(RLHF)来练化模型时,这种交易可能会变得更加糟糕。此外,我们发现通常有效的恢复技术,如温度缩放,对准确性错误提供有限的改进, suggesting that new methods may be required for settings where models are expected to be reliable.
Efficiently Predicting Protein Stability Changes Upon Single-point Mutation with Large Language Models
paper_authors: Yijie Zhang, Zhangyang Gao, Cheng Tan, Stan Z. Li
for: 预测蛋白稳定性变化,即在单点替换后蛋白的稳定性如何变化。
methods: 我们采用了ESM模型,即Large Language Models,以捕捉蛋白序列和结构特征,以便预测蛋白在单点替换后稳定性的变化。
results: 我们提出了一种高效的ESM-助け过程,可以准确预测蛋白稳定性变化。此外,我们还精心设计了一个不受数据泄露影响的数据集,以便更公正地比较模型的性能。Abstract
Predicting protein stability changes induced by single-point mutations has been a persistent challenge over the years, attracting immense interest from numerous researchers. The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry, including drug development, protein evolution analysis, and enzyme synthesis. Despite the proposition of multiple methodologies aimed at addressing this issue, few approaches have successfully achieved optimal performance coupled with high computational efficiency. Two principal hurdles contribute to the existing challenges in this domain. The first is the complexity of extracting and aggregating sufficiently representative features from proteins. The second refers to the limited availability of experimental data for protein mutation analysis, further complicating the comprehensive evaluation of model performance on unseen data samples. With the advent of Large Language Models(LLM), such as the ESM models in protein research, profound interpretation of protein features is now accessibly aided by enormous training data. Therefore, LLMs are indeed to facilitate a wide range of protein research. In our study, we introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations. Furthermore, we have curated a dataset meticulously designed to preclude data leakage, corresponding to two extensively employed test datasets, to facilitate a more equitable model comparison.
摘要
预测蛋白稳定性变化带来单点替换的挑战,在多年来吸引了众多研究者的关注。预测蛋白稳定性的精度对生物化学中的多个领域和应用具有重要意义,如药物开发、蛋白进化分析和酶合成。尽管有许多方法试图解决这一问题,但只有少数方法实现了优秀的性能和高计算效率。两个主要障碍是提取和综合蛋白质特征的复杂性,以及蛋白替换分析的实验数据的有限性,这使得全面评估模型性能对未见数据样本变得更加困难。与此同时,大型自然语言模型(LLM)在蛋白研究中得到广泛应用,对蛋白特征进行深刻的解释,并且通过巨量的训练数据,为蛋白研究提供了可靠的支持。在本研究中,我们提出了一种带有ESM支持的高效方法,将蛋白序列和结构特征集成起来预测蛋白 upon single-point 变化的稳定性。此外,我们还为这种方法精心准备了一个避免数据泄露的数据集,包括两个广泛使用的测试数据集,以便更公平地对比模型。
KOALA: Self-Attention Matters in Knowledge Distillation of Latent Diffusion Models for Memory-Efficient and Fast Image Synthesis
results: 作者通过自己设计的高效U-Net和自注意知识储存策略,建立了高效的T2I模型——KOALA-1B & -700M,并将模型的大小减少到54%和69%。具体来说,KOALA-700M比SDXL快了更多 than twice,但仍然保持了 decent的生成质量。作者希望由于其平衡速度和性能的交易,KOALA模型可以作为SDXL的cost-effective替代品在资源受限的环境中。Abstract
Stable diffusion is the mainstay of the text-to-image (T2I) synthesis in the community due to its generation performance and open-source nature. Recently, Stable Diffusion XL (SDXL), the successor of stable diffusion, has received a lot of attention due to its significant performance improvements with a higher resolution of 1024x1024 and a larger model. However, its increased computation cost and model size require higher-end hardware(e.g., bigger VRAM GPU) for end-users, incurring higher costs of operation. To address this problem, in this work, we propose an efficient latent diffusion model for text-to-image synthesis obtained by distilling the knowledge of SDXL. To this end, we first perform an in-depth analysis of the denoising U-Net in SDXL, which is the main bottleneck of the model, and then design a more efficient U-Net based on the analysis. Secondly, we explore how to effectively distill the generation capability of SDXL into an efficient U-Net and eventually identify four essential factors, the core of which is that self-attention is the most important part. With our efficient U-Net and self-attention-based knowledge distillation strategy, we build our efficient T2I models, called KOALA-1B & -700M, while reducing the model size up to 54% and 69% of the original SDXL model. In particular, the KOALA-700M is more than twice as fast as SDXL while still retaining a decent generation quality. We hope that due to its balanced speed-performance tradeoff, our KOALA models can serve as a cost-effective alternative to SDXL in resource-constrained environments.
摘要
稳定扩散是社区中文本图像(T2I)生成的主流方法,因其生成性能和开源的特点。最近,稳定扩散XL(SDXL)在社区中受到了很多关注,因为它在1024x1024的高分辨率和更大的模型上显示了显著的性能提升。然而,它的计算成本和模型大小增加,需要更高级别的硬件(例如更大的VRAM GPU),从而导致了更高的运行成本。为了解决这个问题,在这种工作中,我们提出了一种高效的潜在扩散模型,通过缩写SDXL的知识来实现。首先,我们进行了SDXL中denoising U-Net的深入分析,这是模型的主要瓶颈。然后,我们设计了基于分析的更高效的U-Net。其次,我们研究了如何有效地将SDXL的生成能力透传到更高效的U-Net中,并最终确定了四个关键因素,其中核心在于自注意是最重要的一部分。使用我们的高效U-Net和自注意基于的知识填充策略,我们构建了高效的T2I模型,即KOALA-1B & -700M。特别是KOALA-700M,它比SDXL快得多,但仍保留了良好的生成质量。我们希望,由于它的平衡速度-性能质量比例,我们的KOALA模型可以在资源受限的环境中作为SDXL的成本效果替代品。
Style Transfer to Calvin and Hobbes comics using Stable Diffusion
paper_authors: Sloke Shrestha, Sundar Sripada V. S., Asvin Venkataramanan
for: 这份报告documents our journey to perform stable diffusion fine-tuning on a dataset containing Calvin and Hobbes comics, with the goal of converting any given input image into the comic style of Calvin and Hobbes.
methods: we used Low Rank Adaptation (LoRA) to efficiently speed up the fine-tuning process, and the diffusion itself was handled by a Variational Autoencoder (VAE), which is a U-net.
results: our results were visually appealing, considering the amount of training time and the quality of input data that went into training.Here’s the full text in Simplified Chinese:
for: 这份报告documents our journey to perform stable diffusion fine-tuning on a dataset containing Calvin and Hobbes comics, with the goal of converting any given input image into the comic style of Calvin and Hobbes.
results: 我们的结果具有视觉吸引力,尽管训练时间相对较短,输入数据质量也不高。Abstract
This project report summarizes our journey to perform stable diffusion fine-tuning on a dataset containing Calvin and Hobbes comics. The purpose is to convert any given input image into the comic style of Calvin and Hobbes, essentially performing style transfer. We train stable-diffusion-v1.5 using Low Rank Adaptation (LoRA) to efficiently speed up the fine-tuning process. The diffusion itself is handled by a Variational Autoencoder (VAE), which is a U-net. Our results were visually appealing for the amount of training time and the quality of input data that went into training.
摘要
这份项目报告概述了我们在含有卡维和龟壳漫画的数据集上进行稳定扩散精度调整的旅程。目的是将任何输入图像转化为卡维和龟壳漫画的风格,实现风格传输。我们使用了稳定扩散v1.5,通过低级适应(LoRA)进行高效地加速精度调整过程。扩散本身是由一个变量自动编码器(VAE)来处理,这是一个U-Net。我们的结果非常有趣,尽管训练时间相对较短,输入数据质量也不高。
MICRO: Model-Based Offline Reinforcement Learning with a Conservative Bellman Operator
paper_authors: Xiao-Yin Liu, Xiao-Hu Zhou, Guo-Tao Li, Hao Li, Mei-Jiang Gui, Tian-Yu Xiang, De-Xing Huang, Zeng-Guang Hou for:Offline reinforcement learning (RL) faces a significant challenge of distribution shift, and model-based algorithms are proposed to tackle this problem.methods:The proposed algorithm, MICRO, uses a conservative Bellman operator and introduces robustness into policy optimization.results:Compared with previous model-based algorithms, MICRO outperforms in offline RL benchmark and is considerably robust to adversarial perturbations, with reduced computation cost.Here is the text in Simplified Chinese:for:线上强化学习(RL)面临到分布shift的重要挑战,而模型基于的算法已成为有效的解决方案。methods:提议的算法MICRO使用保守的Bellman运算符,将robustness引入政策优化中。results:与前一代模型基于算法相比,MICRO在offline RL benchMark中表现出色,对于抗争扰攻击具有显著的robustness,计算成本也有所减少。Abstract
Offline reinforcement learning (RL) faces a significant challenge of distribution shift. Model-free offline RL penalizes the Q value for out-of-distribution (OOD) data or constrains the policy closed to the behavior policy to tackle this problem, but this inhibits the exploration of the OOD region. Model-based offline RL, which uses the trained environment model to generate more OOD data and performs conservative policy optimization within that model, has become an effective method for this problem. However, the current model-based algorithms rarely consider agent robustness when incorporating conservatism into policy. Therefore, the new model-based offline algorithm with a conservative Bellman operator (MICRO) is proposed. This method trades off performance and robustness via introducing the robust Bellman operator into the algorithm. Compared with previous model-based algorithms with robust adversarial models, MICRO can significantly reduce the computation cost by only choosing the minimal Q value in the state uncertainty set. Extensive experiments demonstrate that MICRO outperforms prior RL algorithms in offline RL benchmark and is considerably robust to adversarial perturbations.
摘要
<> tranlate the given text into Simplified Chinese.<>无线连接学习(RL)在无线连接环境中遇到了分布shift的主要挑战。无模型RL penalty Q值为非常的数据或者约束策略与行为策略相似,以解决这个问题,但是这会限制外围的探索。基于环境模型的无线RL,通过训练环境模型生成更多的外围数据,并在该模型中进行保守的策略优化,成为了有效的方法。然而,当前的基于模型的算法很少考虑代理人机器人的可靠性。因此,一种新的基于模型的无线算法(MICRO)被提出,该方法在性能和可靠性之间进行了交换。与之前的基于模型的算法相比,MICRO可以大幅减少计算成本,只需选择状态不确定集中的最小Q值。广泛的实验表明,MICRO在无线RL benchmark中高效并且对抗攻击性较强。
Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration
paper_authors: Meihao Fan, Xiaoyue Han, Ju Fan, Chengliang Chai, Nan Tang, Guoliang Li, Xiaoyong Du for: 这篇论文的目的是提出一种可靠且成本低的批处理方法,以便实现Entity Resolution(ER)任务中的数据融合。methods: 本论文使用了内置语言模型(PLM)和大型语言模型(LLM),并将其应用于Entity Resolution(ER)任务中。另外,论文还提出了一个批处理方法,包括选择示例和批据拼写,以便实现效率的数据融合。results: 经过广泛的实验,论文发现 batch prompting 可以实现高精度的Entity Resolution(ER),并且比 PLM 和 LLM 的手动设计示例更加成本低。论文还提供了选择合适的设计选择的指南。Abstract
Entity resolution (ER) is an important data integration task with a wide spectrum of applications. The state-of-the-art solutions on ER rely on pre-trained language models (PLMs), which require fine-tuning on a lot of labeled matching/non-matching entity pairs. Recently, large languages models (LLMs), such as GPT-4, have shown the ability to perform many tasks without tuning model parameters, which is known as in-context learning (ICL) that facilitates effective learning from a few labeled input context demonstrations. However, existing ICL approaches to ER typically necessitate providing a task description and a set of demonstrations for each entity pair and thus have limitations on the monetary cost of interfacing LLMs. To address the problem, in this paper, we provide a comprehensive study to investigate how to develop a cost-effective batch prompting approach to ER. We introduce a framework BATCHER consisting of demonstration selection and question batching and explore different design choices that support batch prompting for ER. We also devise a covering-based demonstration selection strategy that achieves an effective balance between matching accuracy and monetary cost. We conduct a thorough evaluation to explore the design space and evaluate our proposed strategies. Through extensive experiments, we find that batch prompting is very cost-effective for ER, compared with not only PLM-based methods fine-tuned with extensive labeled data but also LLM-based methods with manually designed prompting. We also provide guidance for selecting appropriate design choices for batch prompting.
摘要
<>传统的实体解决(ER)问题通常采用先置模型(PLM)进行学习,需要大量标注匹配/非匹配实体对的训练。然而,最新的大语言模型(LLM),如GPT-4,可以在不需要模型参数调整的情况下完成许多任务,这被称为“在Context学习”(ICL),可以从少量标注输入上进行有效的学习。然而,现有的ICL方法通常需要提供任务描述和每个实体对的示例,因此有限制在实现LLM的互动成本。为解决这个问题,在这篇论文中,我们提供了一项全面的研究,探讨如何开发一种可持续性高的批处理方法来实现ER。我们提出了一个名为批处理器(BATCHER)的框架,包括示例选择和批处理问题,并explore了不同的设计选择,以支持批处理 дляER。我们还提出了一种覆盖策略来选择示例,实现了匹配精度和经济成本之间的有效平衡。我们进行了广泛的测试,探讨设计空间和我们的提议的性能。我们发现,批处理是ER问题中非常经济的方法,相比PLM基于大量标注数据进行 fine-tuning,以及LLM基于手动设计提示的方法。我们还提供了选择合适的设计选择的指南。
Improving Medical Report Generation with Adapter Tuning and Knowledge Enhancement in Vision-Language Foundation Models
paper_authors: Shibin Wu, Bang Yang, Zhiyu Ye, Haoqian Wang, Hairong Zheng, Tong Zhang
for: automatic creation of coherent and precise descriptions for medical images
methods: vision-language pre-training and fine-tuning approach, BLIP-2, with adapter tuning and medical knowledge enhancement loss
results: significant improvements in accuracy and coherence, achieving the best-averaged results against several state-of-the-art methods, with improvements in ROUGE and CIDEr underscoring the method’s efficacy.Abstract
Medical report generation demands automatic creation of coherent and precise descriptions for medical images. However, the scarcity of labelled medical image-report pairs poses formidable challenges in developing large-scale neural networks capable of harnessing the potential of artificial intelligence, exemplified by large language models. This study builds upon the state-of-the-art vision-language pre-training and fine-tuning approach, BLIP-2, to customize general large-scale foundation models. Integrating adapter tuning and a medical knowledge enhancement loss, our model significantly improves accuracy and coherence. Validation on the dataset of ImageCLEFmedical 2023 demonstrates our model's prowess, achieving the best-averaged results against several state-of-the-art methods. Significant improvements in ROUGE and CIDEr underscore our method's efficacy, highlighting promising outcomes for the rapid medical-domain adaptation of the vision-language foundation models in addressing challenges posed by data scarcity.
摘要
医疗报告生成需要自动生成准确和 coherent 的描述,但是医疗图像报告对的数据稀缺性带来了大规模神经网络发展的巨大挑战。这种研究基于现有的视力语言预训练和精度调整技术,BLIP-2,以便自定义通用大规模基础模型。通过拟合调整和医学知识增强损失,我们的模型在医疗图像报告生成中提高了准确率和coherence。对于 ImageCLEFmedical 2023 数据集进行验证,我们的模型表现出色,与多种现有方法相比,实现了最佳平均结果。ROUGE 和 CIDEr 的提高表明了我们的方法的有效性,这表明了神经网络基础模型在医疗领域快速适应中的批处性。
results: 作者实验结果显示,自编码学习策略可以提高模型的性能,比如在Flickr30k、ReferIt和RefCOCO+等测试集上的性能提高了4.69%、7.68%和3.74%。Abstract
Vision-and-language models trained to match images with text can be combined with visual explanation methods to point to the locations of specific objects in an image. Our work shows that the localization --"grounding"-- abilities of these models can be further improved by finetuning for self-consistent visual explanations. We propose a strategy for augmenting existing text-image datasets with paraphrases using a large language model, and SelfEQ, a weakly-supervised strategy on visual explanation maps for paraphrases that encourages self-consistency. Specifically, for an input textual phrase, we attempt to generate a paraphrase and finetune the model so that the phrase and paraphrase map to the same region in the image. We posit that this both expands the vocabulary that the model is able to handle, and improves the quality of the object locations highlighted by gradient-based visual explanation methods (e.g. GradCAM). We demonstrate that SelfEQ improves performance on Flickr30k, ReferIt, and RefCOCO+ over a strong baseline method and several prior works. Particularly, comparing to other methods that do not use any type of box annotations, we obtain 84.07% on Flickr30k (an absolute improvement of 4.69%), 67.40% on ReferIt (an absolute improvement of 7.68%), and 75.10%, 55.49% on RefCOCO+ test sets A and B respectively (an absolute improvement of 3.74% on average).
摘要
我们的工作显示,通过自适应视觉解释方法来改进视语模型的地方化能力("地方化")可以进一步提高视语模型的泛化能力。我们提出了一种策略,通过大语言模型生成重叠的文本来增强现有的文本图像集。 Specifically, 我们尝试通过生成文本短语的多个重叠来让模型对应的地方在图像中匹配。我们认为这将扩展模型能处理的词汇量,并提高GradCAM等梯度基本视觉解释方法所提供的对象位置质量。我们示出,使用SelfEQ可以在Flickr30k、ReferIt和RefCOCO+上比基eline方法和先前的方法表现更好,特别是不使用任何类型的盒子注释时。我们在Flickr30k上得到了84.07%(相对提高4.69%),在ReferIt上得到了67.40%(相对提高7.68%),并在RefCOCO+的测试集A和B上得到了75.10%和55.49%(相对提高3.74%的平均提升)。
results: 实验结果表明,提案的模型在同时演示和笔记翻译任务中达到了状态级表现。Abstract
We introduce the Efficient Monotonic Multihead Attention (EMMA), a state-of-the-art simultaneous translation model with numerically-stable and unbiased monotonic alignment estimation. In addition, we present improved training and inference strategies, including simultaneous fine-tuning from an offline translation model and reduction of monotonic alignment variance. The experimental results demonstrate that the proposed model attains state-of-the-art performance in simultaneous speech-to-text translation on the Spanish and English translation task.
摘要
我们介绍了高效简单多头注意力(EMMA)模型,这是当前最佳同时翻译模型,具有稳定和不偏的单调对齐估计。此外,我们还提出了改进的训练和执行策略,包括同时精度调整从线上翻译模型和减少单调对齐差异。实验结果表明,我们的模型在西班牙语和英语翻译任务上达到了当前最佳性能。
paper_authors: Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
for: LLMCompiler is designed to efficiently orchestrate multi-function calling in LLMs, addressing the limitations of current methods such as high latency and cost.
methods: LLMCompiler uses three components - LLM Planner, Task Fetching Unit, and Executor - to execute functions in parallel, streamlining parallel function calling based on classical compiler principles.
results: LLMCompiler achieves consistent latency speedup of up to 3.7x, cost savings of up to 6.7x, and accuracy improvement of up to ~9% compared to ReAct, with up to 1.35x latency gain over OpenAI’s recent parallel function calling method.Abstract
Large Language Models (LLMs) have shown remarkable results on various complex reasoning benchmarks. The reasoning capabilities of LLMs enable them to execute function calls, using user-provided functions to overcome their inherent limitations, such as knowledge cutoffs, poor arithmetic skills, or lack of access to private data. This development has expanded LLMs' scope to include multi-function calling, where LLMs are equipped with a variety of functions and select the proper functions based on the context. Multi-function calling abilities of LLMs have catalyzed LLM-based software development, allowing them to tackle more complex problems. However, current methods for multi-function calling often require sequential reasoning and acting for each function which can result in high latency, cost, and sometimes inaccurate behavior. To address this, we introduce LLMCompiler, which executes functions in parallel to efficiently orchestrate multi-function calling. Drawing from the principles of classical compilers, LLMCompiler streamlines parallel function calling with three components: (i) an LLM Planner, formulating execution strategies and dependencies; (ii) a Task Fetching Unit, dispatching function calling tasks; and (iii) an Executor, executing these tasks in parallel. LLMCompiler automatically computes an optimized orchestration for the function calls and can be used with open-source models such as LLaMA-2. We have benchmarked LLMCompiler on a range of tasks including cases with non-trivial inter-dependency between function calls, as well as cases that require dynamic replanning based on intermediate results. We observe consistent latency speedup of up to 3.7x, cost savings of up to 6.7x, and accuracy improvement of up to ~9% as compared to ReAct. Additionally, LLMCompiler achieves up to 1.35x latency gain over OpenAI's recent parallel function calling, while achieving similar accuracy.
摘要
大型语言模型(LLM)在不同的复杂逻辑测试上表现出色。 LLM 的逻辑能力使其能够执行用户提供的函数调用,以超越其内置的限制,如知识割裂、数学能力不足或缺乏私有数据访问。这种发展已经扩展了 LLM 的范围,包括多函数调用,其中 LLM 可以根据上下文选择适合的函数。多函数调用能力的发展刺激了基于 LLM 的软件开发,使其能够解决更复杂的问题。然而,目前的多函数调用方法通常需要顺序的逻辑和行为,可能会导致高延迟、成本和不准确的行为。为解决这个问题,我们介绍了 LLMCompiler,它通过并行执行函数来有效地协调多函数调用。 drew 从类型的编译器的原理,LLMCompiler 使用三个组件:(一)LLM 规划器,制定执行策略和依赖关系;(二)任务抓取单元,抓取函数调用任务;(三)执行器,在并行方式下执行这些任务。 LLMCompiler 自动计算优化的函数调用协调,并可以与开源模型如 LLaMA-2 结合使用。我们对 LLMCompiler 进行了多种任务的测试,包括一些具有非常轻量级的函数调用依赖关系的情况,以及需要动态重新规划基于中间结果的情况。我们观察到 LLMC 在执行速度、成本和准确性方面具有一定的提升,相比于 ReAct,LLMC 可以在执行速度方面获得最高达 3.7 倍的吞吐量提升,成本上可以得到最高达 6.7 倍的成本减少,并在准确性方面可以获得最高达 ~9% 的提升。此外,LLMC 在与 OpenAI 最新的并行函数调用相比,可以在执行速度方面具有最高达 1.35 倍的吞吐量提升,同时保持相似的准确性。
A Block Metropolis-Hastings Sampler for Controllable Energy-based Text Generation
results: 该采样器可以更有效地采样目标分布,并可以通过采样过程来 determin 生成长度。在两个控制性生成任务中,该采样器表现出了更高的下游性能和更准确的目标分布采样。Abstract
Recent work has shown that energy-based language modeling is an effective framework for controllable text generation because it enables flexible integration of arbitrary discriminators. However, because energy-based LMs are globally normalized, approximate techniques like Metropolis-Hastings (MH) are required for inference. Past work has largely explored simple proposal distributions that modify a single token at a time, like in Gibbs sampling. In this paper, we develop a novel MH sampler that, in contrast, proposes re-writes of the entire sequence in each step via iterative prompting of a large language model. Our new sampler (a) allows for more efficient and accurate sampling from a target distribution and (b) allows generation length to be determined through the sampling procedure rather than fixed in advance, as past work has required. We perform experiments on two controlled generation tasks, showing both downstream performance gains and more accurate target distribution sampling in comparison with single-token proposal techniques.
摘要
最近的研究表明,能量基本语言模型是一种有效的框架 для可控文本生成,因为它允许自由地集成任意识别器。然而,由于能量基本语言模型是全球Normalized,因此需要使用伪估量-哈斯坦斯(MH)等近似技术进行推断。过去的工作主要探索了简单的提案分布,如吉布斯抽样,来修改单个字符。在这篇论文中,我们开发了一种新的MH抽样方法,它在每步提交整个序列的 rewrite 请求,通过迭代提示大语言模型来实现。我们的新抽样方法(a)允许更高效地和更准确地抽样目标分布,(b)允许生成长度在抽样过程中被确定,而不是在先前的工作中需要预先确定。我们在两个控制性生成任务上进行了实验,并表明了下游性能提升和更准确地抽样目标分布,相比单个字符提案技术。
On the Learnability of Watermarks for Language Models
paper_authors: Chenchen Gu, Xiang Lisa Li, Percy Liang, Tatsunori Hashimoto
For: This paper focuses on the learnability of watermarks for responsible deployment of language models.* Methods: The authors propose a method called watermark distillation, which trains a student model to mimic the behavior of a teacher model that uses decoding-based watermarking.* Results: The authors test their approach on three different decoding-based watermarking strategies and various hyperparameter settings, and find that models can learn to generate watermarked text with high detectability. However, they also identify limitations to learnability, such as the loss of watermarking capabilities under fine-tuning on normal text and high sample complexity when learning low-distortion watermarks.Here is the text in Simplified Chinese:* For: 这篇论文关注语言模型负责任部署的水印学习可行性。* Methods: 作者提出了一种方法 called watermark distillation,它通过训练一个学生模型模仿一个教师模型使用编码基于的水印来实现。* Results: 作者对三种不同的编码基于的水印策略和各种 гипер参数设置进行测试,发现模型可以学习生成高检测率的水印文本。但是,他们还发现了学习水印的限制,包括在正常文本练习下失去水印功能和低抖度水印学习高样本复杂性。Abstract
Watermarking of language model outputs enables statistical detection of model-generated text, which has many applications in the responsible deployment of language models. Existing watermarking strategies operate by altering the decoder of an existing language model, and the ability for a language model to directly learn to generate the watermark would have significant implications for the real-world deployment of watermarks. First, learned watermarks could be used to build open models that naturally generate watermarked text, allowing for open models to benefit from watermarking. Second, if watermarking is used to determine the provenance of generated text, an adversary can hurt the reputation of a victim model by spoofing its watermark and generating damaging watermarked text. To investigate the learnability of watermarks, we propose watermark distillation, which trains a student model to behave like a teacher model that uses decoding-based watermarking. We test our approach on three distinct decoding-based watermarking strategies and various hyperparameter settings, finding that models can learn to generate watermarked text with high detectability. We also find limitations to learnability, including the loss of watermarking capabilities under fine-tuning on normal text and high sample complexity when learning low-distortion watermarks.
摘要
水印语言模型输出的启用可以实现统计检测语言模型生成的文本,这有很多应用程序在负责投入语言模型。现有的水印策略通过改变现有语言模型的解码器来实现,而语言模型直接学习生成水印的能力会有对实际应用中水印的重要影响。首先,学习的水印可以用于建立开放的模型,让开放的模型从水印中获得 benefiting 。其次,如果使用水印来确定生成的文本的起源,then an adversary can hurt the reputation of a victim model by spoofing its watermark and generating damaging watermarked text。为了研究水印的学习性,我们提出了水印蒸馏,它使用教师模型来帮助学生模型模仿使用解码器来实现水印的特征。我们测试了我们的方法在三种不同的解码器基础上和不同的Hyperparameter设置下,发现模型可以学习生成高检测性的水印文本。我们还发现了学习水印的限制,包括在练习 normal text 时失去水印功能以及高样本复杂性在学习低损失水印时。
OpenAsp: A Benchmark for Multi-document Open Aspect-based Summarization
results: 研究人员发现,现有的自动概要模型和大语言模型在OpenAsp中表现不佳,表明OpenAsp提供了一种真实的开放方面场景,为未来的研究提供了新的挑战。Abstract
The performance of automatic summarization models has improved dramatically in recent years. Yet, there is still a gap in meeting specific information needs of users in real-world scenarios, particularly when a targeted summary is sought, such as in the useful aspect-based summarization setting targeted in this paper. Previous datasets and studies for this setting have predominantly concentrated on a limited set of pre-defined aspects, focused solely on single document inputs, or relied on synthetic data. To advance research on more realistic scenarios, we introduce OpenAsp, a benchmark for multi-document \textit{open} aspect-based summarization. This benchmark is created using a novel and cost-effective annotation protocol, by which an open aspect dataset is derived from existing generic multi-document summarization datasets. We analyze the properties of OpenAsp showcasing its high-quality content. Further, we show that the realistic open-aspect setting realized in OpenAsp poses a challenge for current state-of-the-art summarization models, as well as for large language models.
摘要
近年来,自动概要模型的性能有了大幅提升。然而,在实际场景中满足用户特定需求仍然存在差距,特别是在targeted概要 Setting中,如本文中所述的有用方面基本概要 Summarization Setting。先前的数据集和研究主要集中在有限的预定的方面上,单纯地针对单个文档输入,或者使用人工生成的数据。为了推进研究在更真实的场景中,我们介绍OpenAsp,一个多文档的 open 方面基本概要 Summarization 数据集。我们使用一种新的和经济的注释协议,将open aspect数据集 derivated from existing generic multi-document summarization datasets。我们分析OpenAsp的性能特点,并证明了其高质量内容。此外,我们还证明了open-aspect Setting在OpenAsp中对当前状态的概要模型和大语言模型带来挑战。
When Input Integers are Given in the Unary Numeral Representation
results: 研究发现,当输入整数使用unary表示方式时,许多NP完全(或者NP硬)问题变得非常容易解决。此外,该论文还发现了一些NP完全问题的特点,即它们在不同的数字表示方式下的计算复杂性差异很大。Abstract
Many NP-complete problems take integers as part of their input instances. These input integers are generally binarized, that is, provided in the form of the "binary" numeral representation, and the lengths of such binary forms are used as a basis unit to measure the computational complexity of the problems. In sharp contrast, the "unarization" (or the "unary" numeral representation) of numbers has been known to bring a remarkably different effect onto the computational complexity of the problems. When no computational-complexity difference is observed between binarization and unarization of instances, on the contrary, the problems are said to be strong NP-complete. This work attempts to spotlight an issue of how the unarization of instances affects the computational complexity of various combinatorial problems. We present numerous NP-complete (or even NP-hard) problems, which turn out to be easily solvable when input integers are represented in unary. We then discuss the computational complexities of such problems when taking unary-form integer inputs. We hope that a list of such problems signifies the structural differences between strong NP-completeness and non-strong NP-completeness.
摘要
多种NP完全问题中的输入整数通常被二进制化,即提供二进制 numeral 表示形式,并使用二进制形式的长度作为计算复杂性的基准单位。然而,对数的“unalization”(或“unary” numeral 表示形式)对问题的计算复杂性带来了截然不同的影响。当不存在计算复杂性差异 между二进制化和unalization的实例时,则称该问题为强NP完全。本工作的目标是探讨一些NP完全(或者NP硬)问题的输入整数在unalization 形式下的计算复杂性。我们提供了许多NP完全(或者NP硬)问题,其中大多数可以轻松地解决,只要输入整数在unalization 形式下。我们 затем讨论了这些问题在unalization 形式下的计算复杂性。我们希望通过这些问题的列表,揭示了强NP完全和非强NP完全之间的结构差异。
results: 我们的合并框架 ‘’Matching Models in their Task Subspace’’ (MaTS) 在多任务和中间任务模型合并中实现了状态的最佳result。我们发布了所有的代码和检查点在 https://github.com/r-three/mats。Abstract
Model merging aims to cheaply combine individual task-specific models into a single multitask model. In this work, we view past merging methods as leveraging different notions of a ''task subspace'' in which models are matched before being merged. We connect the task subspace of a given model to its loss landscape and formalize how this approach to model merging can be seen as solving a linear system of equations. While past work has generally been limited to linear systems that have a closed-form solution, we consider using the conjugate gradient method to find a solution. We show that using the conjugate gradient method can outperform closed-form solutions, enables merging via linear systems that are otherwise intractable to solve, and flexibly allows choosing from a wide variety of initializations and estimates for the ''task subspace''. We ultimately demonstrate that our merging framework called ''Matching Models in their Task Subspace'' (MaTS) achieves state-of-the-art results in multitask and intermediate-task model merging. We release all of the code and checkpoints used in our work at https://github.com/r-three/mats.
摘要
模型融合目标是将多个任务特定模型融合到单个多任务模型中,以便更好地利用资源和降低成本。在这种情况下,我们将过去的融合方法视为在不同的任务子空间(task subspace)中匹配模型,然后将它们融合。我们将任务子空间中的模型连接到它们的损失函数,并将这种方法形式化为解决线性系统的问题。 although past work has generally been limited to linear systems with closed-form solutions, we consider using the conjugate gradient method to find a solution. We show that using the conjugate gradient method can outperform closed-form solutions, enables merging via linear systems that are otherwise intractable to solve, and flexibly allows choosing from a wide variety of initializations and estimates for the task subspace. We ultimately demonstrate that our merging framework called "Matching Models in their Task Subspace" (MaTS) achieves state-of-the-art results in multitask and intermediate-task model merging. 我们在https://github.com/r-three/mats中发布了我们所使用的所有代码和检查点。
Beyond Surface: Probing LLaMA Across Scales and Layers
results: 研究发现了一些不寻常的发现,包括:(1)扩大模型大小不一定会增加知识和计算能力,但可以提高逻辑能力,尤其是在数学问题解决方面;(2)LLaMA的下层缺乏重要的数学和事实知识,但具有逻辑、多语言和认知能力,而顶层含有最多的计算能力和实际知识。Abstract
This paper presents an in-depth analysis of Large Language Models (LLMs), focusing on LLaMA, a prominent open-source foundational model in natural language processing. Instead of assessing LLaMA through its generative output, we design multiple-choice tasks to probe its intrinsic understanding in high-order tasks such as reasoning and computation. We examine the model horizontally, comparing different sizes, and vertically, assessing different layers. We unveil several key and uncommon findings based on the designed probing tasks: (1) Horizontally, enlarging model sizes almost could not automatically impart additional knowledge or computational prowess. Instead, it can enhance reasoning abilities, especially in math problem solving, and helps reduce hallucinations, but only beyond certain size thresholds; (2) In vertical analysis, the lower layers of LLaMA lack substantial arithmetic and factual knowledge, showcasing logical thinking, multilingual and recognitive abilities, with top layers housing most computational power and real-world knowledge.
摘要
Horizontally, increasing model size does not automatically improve knowledge or computational abilities. Instead, larger models excel in math problem-solving and reduce hallucinations, but only above certain size thresholds.2. Vertically, the lower layers of LLaMA have limited arithmetic and factual knowledge, demonstrating logical thinking, multilingual, and recognitive abilities. The upper layers possess the most computational power and real-world knowledge.
Prompt Highlighter: Interactive Control for Multi-Modal LLMs
results: 我们的方法可以与当前的LLMs和VLMs兼容,并在多种任务上达到了出色的自适应生成结果。实验表明,在推理过程中,通过高亮提示 span来引导模型的注意力,可以更好地控制生成的内容,并且可以生成更加可靠的结果。无需调试LLaVA-v1.5,我们的方法在MMBench测试中获得了69.5的分数,在MME-perception测试中获得了1552.5的分数。Abstract
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. Multi-modal LLMs empower multi-modality understanding with the capability of semantic generation yet bring less explainability and heavier reliance on prompt contents due to their autoregressive generative nature. While manipulating prompt formats could improve outputs, designing specific and precise prompts per task can be challenging and ineffective. To tackle this issue, we introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation. Motivated by the classifier-free diffusion guidance, we form regular and unconditional context pairs based on highlighted tokens, demonstrating that the autoregressive generation in models can be guided in a classifier-free way. Notably, we find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs. Our approach is compatible with current LLMs and VLMs, achieving impressive customized generation results without training. Experiments confirm its effectiveness in focusing on input contexts and generating reliable content. Without tuning on LLaVA-v1.5, our method secured 69.5 in the MMBench test and 1552.5 in MME-perception. The code is available at: https://github.com/dvlab-research/Prompt-Highlighter/
摘要
Inspired by classifier-free diffusion guidance, we form regular and unconditional context pairs based on highlighted tokens, demonstrating that the autoregressive generation in models can be guided in a classifier-free way. We find that guiding the models with highlighted tokens through attention weights leads to more desired outputs during inference. Our approach is compatible with current LLMs and VLMs, achieving impressive customized generation results without training. Experimental results show that our method can effectively focus on input contexts and generate reliable content. Without tuning on LLaVA-v1.5, our method achieved 69.5 on the MMBench test and 1552.5 on MME-perception.The code for Prompt Highlighter is available at:
PsyChat: A Client-Centric Dialogue System for Mental Health Support
results: 经自动和人工评估,提出的对话系统显示在实际心理支持中的有效性和实用性,并在模拟Client-虚拟师生互动enario中预测客户端行为,选择合适的师生策略,并生成准确和适当的回快。Abstract
Dialogue systems are increasingly integrated into mental health support to help clients facilitate exploration, gain insight, take action, and ultimately heal themselves. For a dialogue system to be practical and user-friendly, it should be client-centric, focusing on the client's behaviors. However, existing dialogue systems publicly available for mental health support often concentrate solely on the counselor's strategies rather than the behaviors expressed by clients. This can lead to the implementation of unreasonable or inappropriate counseling strategies and corresponding responses from the dialogue system. To address this issue, we propose PsyChat, a client-centric dialogue system that provides psychological support through online chat. The client-centric dialogue system comprises five modules: client behavior recognition, counselor strategy selection, input packer, response generator intentionally fine-tuned to produce responses, and response selection. Both automatic and human evaluations demonstrate the effectiveness and practicality of our proposed dialogue system for real-life mental health support. Furthermore, we employ our proposed dialogue system to simulate a real-world client-virtual-counselor interaction scenario. The system is capable of predicting the client's behaviors, selecting appropriate counselor strategies, and generating accurate and suitable responses, as demonstrated in the scenario.
摘要
对话系统在心理健康支持中越来越普遍应用,以帮助客户探索、获得新想法、采取行动,最终自我治疗。为了使对话系统实用和易用,它应该是客户中心的,专注于客户的行为。然而,现有的心理健康支持中的对话系统通常围绕咨商的策略而非客户表达的行为进行集中。这可能导致对话系统实施不合理或不适当的咨商策略和相应的回复。为解决这问题,我们提议了心理聊天(PsyChat),一个客户中心的对话系统,通过在线聊天提供心理支持。客户中心对话系统包括五个模块:客户行为识别、咨商策略选择、输入压缩、回复生成和回复选择。我们通过自动和人类评估表明,我们的提议的对话系统在实际心理健康支持中效果和实用。此外,我们利用我们的对话系统 simulate一个真实的客户-虚拟咨商互动场景,系统能预测客户的行为,选择合适的咨商策略,并生成准确和适合的回复,如场景所示。
Swap distance minimization in SOV languages. Cognitive and mathematical foundations
paper_authors: Ramon Ferrer-i-Cancho, Savithry Namboodiripad
for: investigate the principle of swap distance minimization in the context of word order
methods: use word order rotation as a cognitive underpinning to test the prediction in three flexible order SOV languages
results: evidence of swap distance minimization found in all three languages, but weaker in Sinhalese, stronger in Korean and Malayalam.Abstract
Distance minimization is a general principle of language. A special case of this principle in the domain of word order is swap distance minimization. This principle predicts that variations from a canonical order that are reached by fewer swaps of adjacent constituents are lest costly and thus more likely. Here we investigate the principle in the context of the triple formed by subject (S), object (O) and verb (V). We introduce the concept of word order rotation as a cognitive underpinning of that prediction. When the canonical order of a language is SOV, the principle predicts SOV < SVO, OSV < VSO, OVS < VOS, in order of increasing cognitive cost. We test the prediction in three flexible order SOV languages: Korean (Koreanic), Malayalam (Dravidian), and Sinhalese (Indo-European). Evidence of swap distance minimization is found in all three languages, but it is weaker in Sinhalese. Swap distance minimization is stronger than a preference for the canonical order in Korean and especially Malayalam.
摘要
“距离最小化是语言普遍的原则。对话语序的特例是交换距离最小化原则。这个原则预测在主语(S)、质题(O)和动词(V)之间的语序组合时,将更多的词汇交换到更近的位置,则更加可能。我们在三种可动的语序SOV语言中调查这个预测:韩国语(韩语)、马拉雅拉姆语(南部语)和斯里兰卡语(欧洲语)。发现交换距离最小化原则在所有三种语言中都有证据,但在斯里兰卡语中较弱。交换距离最小化原则在韩语和马拉雅拉姆语中更强,比对于标准语序更具有优势。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is also widely used, and the translation would be slightly different in Traditional Chinese.
Language Model Knowledge Distillation for Efficient Question Answering in Spanish
results: 实验结果表明,使用这种压缩方法可以保持大型模型的性能,同时提高推理速度,这将为西班牙语自然语言处理任务提供更多的可能性。Abstract
Recent advances in the development of pre-trained Spanish language models has led to significant progress in many Natural Language Processing (NLP) tasks, such as question answering. However, the lack of efficient models imposes a barrier for the adoption of such models in resource-constrained environments. Therefore, smaller distilled models for the Spanish language could be proven to be highly scalable and facilitate their further adoption on a variety of tasks and scenarios. In this work, we take one step in this direction by developing SpanishTinyRoBERTa, a compressed language model based on RoBERTa for efficient question answering in Spanish. To achieve this, we employ knowledge distillation from a large model onto a lighter model that allows for a wider implementation, even in areas with limited computational resources, whilst attaining negligible performance sacrifice. Our experiments show that the dense distilled model can still preserve the performance of its larger counterpart, while significantly increasing inference speedup. This work serves as a starting point for further research and investigation of model compression efforts for Spanish language models across various NLP tasks.
摘要
最近的西班牙语模型开发进步,导致许多自然语言处理(NLP)任务中的进步,如问题回答。然而,缺乏高效的模型限制了这些模型在资源有限环境中的采用。因此,可能有助于推广的西班牙语小型模型会对多种任务和场景产生深见。在这项工作中,我们通过基于RoBERTa的西班牙语压缩语言模型SpanishTinyRoBERTa来实现这一目标。为此,我们使用知识填充法将大型模型的知识传递到轻量级模型中,以便在计算资源有限的地区进行更广泛的实施,而无需损失性能。我们的实验表明,压缩后的紧凑型模型仍可保持与其更大的对手模型相同的性能,同时显著提高推理速度。这项工作为西班牙语模型压缩研究的起点,预示未来可能会在多种NLP任务中进行进一步的研究和探索。
Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
results: 在三个开源人类对齐的LLMs上,我们的方法在中文和英文Malicious Instructions上达到了出色的Jailbreak攻击性能。此外,我们进行了详细的减少实验,并证明了我们的核心想法“自然语言回应倾向分析”的有效性。我们的探索还暴露了LLMs在后续对话中被引导生成更详细的危害回应的潜在漏洞。Abstract
Extensive work has been devoted to improving the safety mechanism of Large Language Models (LLMs). However, in specific scenarios, LLMs still generate harmful responses when faced with malicious instructions, a phenomenon referred to as "Jailbreak Attack". In our research, we introduce a novel jailbreak attack method (\textbf{RADIAL}), which consists of two steps: 1) Inherent Response Tendency Analysis: we analyze the inherent affirmation and rejection tendency of LLMs to react to real-world instructions. 2) Real-World Instructions-Driven Jailbreak: based on our analysis, we strategically choose several real-world instructions and embed malicious instructions into them to amplify the LLM's potential to generate harmful responses. On three open-source human-aligned LLMs, our method achieves excellent jailbreak attack performance for both Chinese and English malicious instructions. Besides, we guided detailed ablation experiments and verified the effectiveness of our core idea "Inherent Response Tendency Analysis". Our exploration also exposes the vulnerability of LLMs to being induced into generating more detailed harmful responses in subsequent rounds of dialogue.
摘要
大量的工作已经投入到大语言模型(LLMs)的安全机制进行改进中,然而在特定情况下,LLMs仍然会生成有害回应,这种现象被称为“监狱攻击”。在我们的研究中,我们介绍了一种新的监狱攻击方法(RADIAL),它包括两个步骤:1. 自然语言回应倾向分析:我们分析了LLMs对实际指令的自然语言回应倾向,以便更好地理解它们的内在倾向。2. 基于实际指令的监狱攻击:根据我们的分析,我们选择了一些实际指令,并将恶意指令隐藏在其中,以增强LLMs的潜在性能。在三个开源的人类对齐的LLMs上,我们的方法在中文和英文恶意指令下达到了优秀的监狱攻击性能。此外,我们还进行了详细的剥离实验,并证明了我们的核心思想“自然语言回应倾向分析”的有效性。我们的探索也暴露了LLMs在后续对话中被引导到生成更详细的有害回应的潜在漏洞。
Comparing Large Language Model AI and Human-Generated Coaching Messages for Behavioral Weight Loss
results: 研究发现,在第一阶段,人工智能讯息被评分为较低的有用性,但在第二阶段,人工智能讯息与人类写的讯息一样有用性。受试者也认为人工智能讯息具有同等的感柔和个性化的价值。Abstract
Automated coaching messages for weight control can save time and costs, but their repetitive, generic nature may limit their effectiveness compared to human coaching. Large language model (LLM) based artificial intelligence (AI) chatbots, like ChatGPT, could offer more personalized and novel messages to address repetition with their data-processing abilities. While LLM AI demonstrates promise to encourage healthier lifestyles, studies have yet to examine the feasibility and acceptability of LLM-based BWL coaching. 87 adults in a weight-loss trial rated ten coaching messages' helpfulness (five human-written, five ChatGPT-generated) using a 5-point Likert scale, providing additional open-ended feedback to justify their ratings. Participants also identified which messages they believed were AI-generated. The evaluation occurred in two phases: messages in Phase 1 were perceived as impersonal and negative, prompting revisions for Phase 2 messages. In Phase 1, AI-generated messages were rated less helpful than human-written ones, with 66 percent receiving a helpfulness rating of 3 or higher. However, in Phase 2, the AI messages matched the human-written ones regarding helpfulness, with 82% scoring three or above. Additionally, 50% were misidentified as human-written, suggesting AI's sophistication in mimicking human-generated content. A thematic analysis of open-ended feedback revealed that participants appreciated AI's empathy and personalized suggestions but found them more formulaic, less authentic, and too data-focused. This study reveals the preliminary feasibility and acceptability of LLM AIs, like ChatGPT, in crafting potentially effective weight control coaching messages. Our findings also underscore areas for future enhancement.
摘要
自动化教练消息可以节省时间和成本,但它们的重复和通用性可能会限制它们的有效性,相比于人类教练。大语言模型(LLM)基于的人工智能(AI)聊天机器人,如ChatGPT,可以提供更个性化和新颖的消息,以解决重复性。而LML AI在促进健康生活方面的潜在优势尚未得到研究。在这项研究中,87名体重损失试验参与者评分了10个教练消息的有用性(5个人类写成,5个ChatGPT生成),并提供了更多的开放式反馈来证明他们的评分。参与者还 identificated which messages they believed were AI-generated。评分occured in two phases:phase 1的消息被视为不个性化和消极,因此需要修订。在第一阶段,AI生成的消息被评分为 less helpful than human-written ones,有66%的参与者给出了3或更高的有用性评分。然而,在第二阶段,AI消息与人类写成的消息相比,在有用性方面具有了类似的水平,有82%的参与者给出了3或更高的有用性评分。此外,50%的参与者 incorrectly identified AI-generated messages as human-written,这表明AI的成熟程度在模仿人类生成内容方面。一个主题分析的开放式反馈表明,参与者对AI的同情和个性化建议表示感谢,但他们认为AI的消息更加 formulaic, less authentic, and too data-focused。这项研究表明LLM AI在制定有效的体重控制教练消息方面的初步可行性和可接受性。我们的发现还揭示了未来的改进方向。
Multimodal Misinformation Detection in a South African Social Media Environment
methods: 本研究使用了多种信息来检测谣言,包括文本和视觉元素。它使用了 bidirectional encoder representations from transformers(BERT)作为文本编码器,并使用 residual network(ResNet)作为视觉编码器。
results: 结果表明,在南非社交媒体环境中使用南非样本进行模型训练可以提高模型的性能,并且多模态模型在不同文化环境中的知识传递性较高。Abstract
With the constant spread of misinformation on social media networks, a need has arisen to continuously assess the veracity of digital content. This need has inspired numerous research efforts on the development of misinformation detection (MD) models. However, many models do not use all information available to them and existing research contains a lack of relevant datasets to train the models, specifically within the South African social media environment. The aim of this paper is to investigate the transferability of knowledge of a MD model between different contextual environments. This research contributes a multimodal MD model capable of functioning in the South African social media environment, as well as introduces a South African misinformation dataset. The model makes use of multiple sources of information for misinformation detection, namely: textual and visual elements. It uses bidirectional encoder representations from transformers (BERT) as the textual encoder and a residual network (ResNet) as the visual encoder. The model is trained and evaluated on the Fakeddit dataset and a South African misinformation dataset. Results show that using South African samples in the training of the model increases model performance, in a South African contextual environment, and that a multimodal model retains significantly more knowledge than both the textual and visual unimodal models. Our study suggests that the performance of a misinformation detection model is influenced by the cultural nuances of its operating environment and multimodal models assist in the transferability of knowledge between different contextual environments. Therefore, local data should be incorporated into the training process of a misinformation detection model in order to optimize model performance.
摘要
随着社交媒体上的谣言不断扩散,需要不断评估数字内容的真实性。这种需求激发了许多研究对谣言检测(MD)模型的开发。然而,许多模型未使用所有可用信息,现有研究中缺乏 relevante 数据集来训练模型,特别在南非社交媒体环境中。本研究的目标是研究谣言检测模型在不同Contextual 环境中的知识传递性。本研究提供了一个多模式谣言检测模型,可以在南非社交媒体环境中运行,同时也 introduce了一个南非谣言数据集。该模型利用文本和视觉元素进行谣言检测,使用了 bidirectional encoder representations from transformers(BERT)作为文本编码器,并使用 residual network(ResNet)作为视觉编码器。模型在 Fakeddit 数据集和南非谣言数据集上进行训练和评估。结果显示,将南非样本包含在模型训练中会提高模型在南非Contextual 环境中的性能,并且多模式模型在谣言检测中保留了较多的知识,比文本和视觉单模式模型更高。我们的研究表明,谣言检测模型在运行环境中的文化特点会影响其性能,并且多模式模型可以在不同Contextual 环境中传递知识。因此,在谣言检测模型训练过程中应该包含本地数据。
RoAST: Robustifying Language Models via Adversarial Perturbation with Selective Training
results: 在六种不同的LM上比领先方法更高的效果表明RoAST的有用性Abstract
Fine-tuning pre-trained language models (LMs) has become the de facto standard in many NLP tasks. Nevertheless, fine-tuned LMs are still prone to robustness issues, such as adversarial robustness and model calibration. Several perspectives of robustness for LMs have been studied independently, but lacking a unified consideration in multiple perspectives. In this paper, we propose Robustifying LMs via Adversarial perturbation with Selective Training (RoAST), a simple yet effective fine-tuning technique to enhance the multi-perspective robustness of LMs in a unified way. RoAST effectively incorporates two important sources for the model robustness, robustness on the perturbed inputs and generalizable knowledge in pre-trained LMs. To be specific, RoAST introduces adversarial perturbation during fine-tuning while the model parameters are selectively updated upon their relative importance to minimize unnecessary deviation. Under a unified evaluation of fine-tuned LMs by incorporating four representative perspectives of model robustness, we demonstrate the effectiveness of RoAST compared to state-of-the-art fine-tuning methods on six different types of LMs, which indicates its usefulness in practice.
摘要
现在许多自然语言处理(NLP)任务中都是通过微调预训练语言模型(LM)来实现标准。然而,微调LM仍然存在一些可靠性问题,如抗对抗性和模型调整。多种robustnessperspective对LM的研究已经进行了独立的研究,但是它们之间没有一个统一的考虑。在这篇论文中,我们提出了通过对抗偏移量学习(RoAST)来提升LM的多元可靠性。RoAST是一种简单 yet有效的微调技术,可以在一个统一的方式下提高LM的多元可靠性。具体来说,RoAST在微调过程中引入对抗偏移量,并在模型参数中进行选择性更新,以避免不必要的偏移。通过将四种代表性的LM模型进行统一评估,我们展示了RoAST比之前的微调方法更有效。这表明RoAST在实际应用中具有实际意义。
results: 该论文通过对现有文献和实验数据进行分析和验证,证明了其框架的可靠性和有效性,并发现了两种新的不良交互现象。Abstract
Machine learning (ML) models cannot neglect risks to security, privacy, and fairness. Several defenses have been proposed to mitigate such risks. When a defense is effective in mitigating one risk, it may correspond to increased or decreased susceptibility to other risks. Existing research lacks an effective framework to recognize and explain these unintended interactions. We present such a framework, based on the conjecture that overfitting and memorization underlie unintended interactions. We survey existing literature on unintended interactions, accommodating them within our framework. We use our framework to conjecture on two previously unexplored interactions, and empirically validate our conjectures.
摘要
机器学习(ML)模型不能忽略安全、隐私和公平的风险。一些防御策已经被提议以减轻这些风险。当一种防御效果于一种风险减轻时,可能与其他风险相对增加或减少抵抗性。现有的研究缺乏一个有效的框架来识别和解释这些不意图的互动。我们提出了这种框架,基于假设过拟合和记忆在不意图互动中发挥作用。我们对现有文献进行了survey,并将其嵌入我们的框架中。我们使用我们的框架来推测两个以前未explored的互动,并employs empirical validation来验证我们的推测。
Trajeglish: Learning the Language of Driving Scenarios
results: 模型在 Waymo Sim Agents Benchmark 上得分最高,超过了先前的工作,在实зм和交互度量上增加了3.3%和9.9%。模型在全自动和半自动设置下进行了ablation,并证明了模型学习的表示可以快速地改进 nuScenes 的性能。Abstract
A longstanding challenge for self-driving development is simulating dynamic driving scenarios seeded from recorded driving logs. In pursuit of this functionality, we apply tools from discrete sequence modeling to model how vehicles, pedestrians and cyclists interact in driving scenarios. Using a simple data-driven tokenization scheme, we discretize trajectories to centimeter-level resolution using a small vocabulary. We then model the multi-agent sequence of motion tokens with a GPT-like encoder-decoder that is autoregressive in time and takes into account intra-timestep interaction between agents. Scenarios sampled from our model exhibit state-of-the-art realism; our model tops the Waymo Sim Agents Benchmark, surpassing prior work along the realism meta metric by 3.3% and along the interaction metric by 9.9%. We ablate our modeling choices in full autonomy and partial autonomy settings, and show that the representations learned by our model can quickly be adapted to improve performance on nuScenes. We additionally evaluate the scalability of our model with respect to parameter count and dataset size, and use density estimates from our model to quantify the saliency of context length and intra-timestep interaction for the traffic modeling task.
摘要
自驾发展中长期的挑战之一是模拟基于记录的驾驶情况,以便实现自驾车在不同情况下的适应能力。为了实现这一目标,我们使用排序序列模型工具来模拟汽车、人行车和自行车之间的互动。我们使用一种简单的数据驱动的封装方案,将轨迹拆分为厘米级别,并使用一小 vocabulary 来模型多个代理人的运动序列。我们then使用一种 GPT 类 Encoder-Decoder 模型,通过时间进行自然语言处理和同时间间互动来模型多个代理人的运动序列。从我们的模型中采样的情况显示出了当今最高的实际性;我们的模型在 Waymo Sim Agents Benchmark 中超过了先前的工作,在实际性指标上高于前一个 metric 的 3.3%,并在互动指标上高于前一个 metric 的 9.9%。我们还对我们的模型化选择进行了杜比特分析,并证明了在全自动和半自动设置中,我们的模型可以快速适应改进 nuScenes 的表现。此外,我们还评估了我们的模型的参数计数和数据集大小的扩展性,并使用我们的模型生成的浓度估计来评估上下文长度和同时间互动的重要性。
Horizon-Free and Instance-Dependent Regret Bounds for Reinforcement Learning with General Function Approximation
results: 实现了时间 horizon 自由、实例 dependent 和锐度下界,并且 computationally efficientHere’s a more detailed explanation of each point:
for: The paper aims to solve the long-planning horizon problem in reinforcement learning, which is a challenging problem that many existing algorithms cannot handle.
methods: The proposed algorithm UCRL-WVTR uses a novel approach that combines weighted value-targeted regression and high-order moment estimation to achieve horizon-free and instance-dependent performance.
results: The paper shows that UCRL-WVTR achieves a sharp regret bound, which is the best possible bound for this type of problem, and is computationally efficient. The results are corroborated by comprehensive experiments.Abstract
To tackle long planning horizon problems in reinforcement learning with general function approximation, we propose the first algorithm, termed as UCRL-WVTR, that achieves both \emph{horizon-free} and \emph{instance-dependent}, since it eliminates the polynomial dependency on the planning horizon. The derived regret bound is deemed \emph{sharp}, as it matches the minimax lower bound when specialized to linear mixture MDPs up to logarithmic factors. Furthermore, UCRL-WVTR is \emph{computationally efficient} with access to a regression oracle. The achievement of such a horizon-free, instance-dependent, and sharp regret bound hinges upon (i) novel algorithm designs: weighted value-targeted regression and a high-order moment estimator in the context of general function approximation; and (ii) fine-grained analyses: a novel concentration bound of weighted non-linear least squares and a refined analysis which leads to the tight instance-dependent bound. We also conduct comprehensive experiments to corroborate our theoretical findings.
摘要
为了解决具有扩展规划难度的征配学习问题,我们提出了首个算法UCRL-WVTR,该算法实现了“无顶点”和“特定实例”两个特点,即消除了规划难度的多项式依赖关系。我们 derive的 regret bound被称为“锐利”,因为它与最小最优下界几乎匹配,当特化到线性混合MDP时。此外,UCRL-WVTR具有计算效率的优势,只需访问一个回归权重 oracle。实现这种“无顶点”、“特定实例”和“锐利” regret bound的成就取决于以下两点:(i)新的算法设计:Weighted value-targeted regression和高阶 moments estimator在总函数近似下的应用。(ii)细腻的分析:Weighted non-linear least squares的新的准则和高阶实例特定的分析。我们还进行了广泛的实验来证明我们的理论发现。
Privacy-preserving quantum federated learning via gradient hiding
results: 本论文的两种协议比前一些使用表达式量子圈或差分隐私技术的协议更加高效,可以在低通信资源下提供强大的隐私保护。这些协议可以被视为分布式量子计算领域的一个重要突破,并且为安全分布式量子机器学习的发展奠定了基础。Abstract
Distributed quantum computing, particularly distributed quantum machine learning, has gained substantial prominence for its capacity to harness the collective power of distributed quantum resources, transcending the limitations of individual quantum nodes. Meanwhile, the critical concern of privacy within distributed computing protocols remains a significant challenge, particularly in standard classical federated learning (FL) scenarios where data of participating clients is susceptible to leakage via gradient inversion attacks by the server. This paper presents innovative quantum protocols with quantum communication designed to address the FL problem, strengthen privacy measures, and optimize communication efficiency. In contrast to previous works that leverage expressive variational quantum circuits or differential privacy techniques, we consider gradient information concealment using quantum states and propose two distinct FL protocols, one based on private inner-product estimation and the other on incremental learning. These protocols offer substantial advancements in privacy preservation with low communication resources, forging a path toward efficient quantum communication-assisted FL protocols and contributing to the development of secure distributed quantum machine learning, thus addressing critical privacy concerns in the quantum computing era.
摘要
分布式量子计算,特别是分布式量子机器学习,已经受到了大量的投入,因为它可以利用分布在多个量子节点之间的量子资源,超越个体量子节点的限制。然而,在分布式计算协议中保护隐私是一个急需解决的主要挑战,特别是在标准的类型ical federated learning(FL)场景中,参与计算的客户端数据容易遭受到服务器的泄露。这篇论文提出了新的量子协议,利用量子通信来解决FL问题,强化隐私措施,并优化通信效率。与前一些使用表达式变量量子电路或差分隐私技术的作品不同,我们考虑了抽象量子状态来隐藏梯度信息,并提出了两种不同的FL协议,一种基于私有内积估计,另一种基于增量学习。这些协议可以在隐私保护方面做出重要进步,并且具有低通信资源的优势,开拓出有效的量子通信助手FL协议的发展,从而解决了量子计算时代的关键隐私问题。
FreqFed: A Frequency Analysis-Based Approach for Mitigating Poisoning Attacks in Federated Learning
results: 可以有效地防范毒素攻击,且无较大影响于模型的实用性Abstract
Federated learning (FL) is a collaborative learning paradigm allowing multiple clients to jointly train a model without sharing their training data. However, FL is susceptible to poisoning attacks, in which the adversary injects manipulated model updates into the federated model aggregation process to corrupt or destroy predictions (untargeted poisoning) or implant hidden functionalities (targeted poisoning or backdoors). Existing defenses against poisoning attacks in FL have several limitations, such as relying on specific assumptions about attack types and strategies or data distributions or not sufficiently robust against advanced injection techniques and strategies and simultaneously maintaining the utility of the aggregated model. To address the deficiencies of existing defenses, we take a generic and completely different approach to detect poisoning (targeted and untargeted) attacks. We present FreqFed, a novel aggregation mechanism that transforms the model updates (i.e., weights) into the frequency domain, where we can identify the core frequency components that inherit sufficient information about weights. This allows us to effectively filter out malicious updates during local training on the clients, regardless of attack types, strategies, and clients' data distributions. We extensively evaluate the efficiency and effectiveness of FreqFed in different application domains, including image classification, word prediction, IoT intrusion detection, and speech recognition. We demonstrate that FreqFed can mitigate poisoning attacks effectively with a negligible impact on the utility of the aggregated model.
摘要
Federation learning (FL) 是一种合作学习模式,允许多个客户端共同训练模型,无需共享训练数据。然而,FL 受到毒 attacks 的威胁,敌人可以在联合模型聚合过程中注入假设Injected model updates 来腐蚀或破坏预测结果(无目标毒 attacks)或植入隐藏功能(Targeted poisoning 或 backdoors)。现有的 FL 中毒防御措施有一些限制,如基于特定的攻击类型和策略或数据分布的假设,或不够鲁棒地对抗高级别的插入技术和策略。为了解决现有防御措施的不足,我们采用一种通用和全然不同的方法来检测毒 attacks。我们称之为 FreqFed,它是一种新的聚合机制,将模型更新(即权重)转换到频率域,从而可以识别模型更新中具有足够信息的核心频率成分。这使得我们可以在客户端上本地训练时,不论攻击类型、策略或客户端数据分布,都可以有效地过滤恶意更新。我们在不同的应用领域进行了广泛的测试,包括图像分类、词语预测、互联网入侵检测和语音识别。我们的结果表明,FreqFed 可以有效地抵御毒 attacks,而且对联合模型的实用性产生较小的影响。
Monitoring Sustainable Global Development Along Shared Socioeconomic Pathways
paper_authors: Michelle W. L. Wan, Jeffrey N. Clark, Edward A. Small, Elena Fillola Mayoral, Raúl Santos-Rodríguez
for: 监测可持续global发展
methods: 数学 derivation scoring algorithm和机器学习方法
results: 初步研究获得了promising results,实现了可持续发展监测的可能性Abstract
Sustainable global development is one of the most prevalent challenges facing the world today, hinging on the equilibrium between socioeconomic growth and environmental sustainability. We propose approaches to monitor and quantify sustainable development along the Shared Socioeconomic Pathways (SSPs), including mathematically derived scoring algorithms, and machine learning methods. These integrate socioeconomic and environmental datasets, to produce an interpretable metric for SSP alignment. An initial study demonstrates promising results, laying the groundwork for the application of different methods to the monitoring of sustainable global development.
摘要
可持续全球发展是当今世界面临的一个最主要挑战,它取决于经济社会发展和环境可持续性的平衡。我们提出了监测和衡量可持续发展的方法,包括数学 derivation 的分数算法和机器学习方法。这些方法可以将经济社会和环境数据集成起来,生成可解释的指标,用于衡量可持续发展的Alignment。一项初步研究表明了这些方法的替代性,为可持续全球发展的监测奠定基础。
On the Impact of Multi-dimensional Local Differential Privacy on Fairness
results: 研究发现,多Attribute LDP可以有效减少差异,但是独立vs共合LDP只在保障低水平时才有意义,而输出Y分布也对于哪个群体更加敏感于隐私保护。Abstract
Automated decision systems are increasingly used to make consequential decisions in people's lives. Due to the sensitivity of the manipulated data as well as the resulting decisions, several ethical concerns need to be addressed for the appropriate use of such technologies, in particular, fairness and privacy. Unlike previous work, which focused on centralized differential privacy (DP) or local DP (LDP) for a single sensitive attribute, in this paper, we examine the impact of LDP in the presence of several sensitive attributes (i.e., multi-dimensional data) on fairness. Detailed empirical analysis on synthetic and benchmark datasets revealed very relevant observations. In particular, (1) multi-dimensional LDP is an efficient approach to reduce disparity, (2) the multi-dimensional approach of LDP (independent vs. combined) matters only at low privacy guarantees, and (3) the outcome Y distribution has an important effect on which group is more sensitive to the obfuscation. Last, we summarize our findings in the form of recommendations to guide practitioners in adopting effective privacy-preserving practices while maintaining fairness and utility in ML applications.
摘要
Semi-Supervised Active Learning for Semantic Segmentation in Unknown Environments Using Informative Path Planning
results: 与普通的全supervised方法相当,但减少了人工标注努力,而且超过了无监督方法的性能。Abstract
Semantic segmentation enables robots to perceive and reason about their environments beyond geometry. Most of such systems build upon deep learning approaches. As autonomous robots are commonly deployed in initially unknown environments, pre-training on static datasets cannot always capture the variety of domains and limits the robot's perception performance during missions. Recently, self-supervised and fully supervised active learning methods emerged to improve a robot's vision. These approaches rely on large in-domain pre-training datasets or require substantial human labelling effort. We propose a planning method for semi-supervised active learning of semantic segmentation that substantially reduces human labelling requirements compared to fully supervised approaches. We leverage an adaptive map-based planner guided towards the frontiers of unexplored space with high model uncertainty collecting training data for human labelling. A key aspect of our approach is to combine the sparse high-quality human labels with pseudo labels automatically extracted from highly certain environment map areas. Experimental results show that our method reaches segmentation performance close to fully supervised approaches with drastically reduced human labelling effort while outperforming self-supervised approaches.
摘要
We propose a planning method for semi-supervised active learning of semantic segmentation that significantly reduces the amount of human labeling required compared to fully supervised approaches. Our method uses an adaptive map-based planner that is guided towards the frontiers of unexplored space with high model uncertainty, collecting training data for human labeling. We combine the sparse, high-quality human labels with pseudo labels automatically extracted from highly certain environment map areas.Experimental results show that our method achieves segmentation performance that is close to fully supervised approaches with drastically reduced human labeling effort, while outperforming self-supervised approaches.
A Scalable Network-Aware Multi-Agent Reinforcement Learning Framework for Decentralized Inverter-based Voltage Control
for: 这 paper Addresses the challenges of decentralized voltage control in power grids with an increase in distributed generations (DGs).
methods: 该 paper 提出了一种 scalable network-aware (SNA) 框架,利用网络结构来缩减评估器的输入,从而提高可扩展性和通信成本在训练过程中。
results: 该 paper 在一个包含 114 个分布式发电机的系统中成功应用了该 SNA 框架,并提供了一个可能的解决方案 для Decentralized voltage control in increasingly complex power grid systems.Abstract
This paper addresses the challenges associated with decentralized voltage control in power grids due to an increase in distributed generations (DGs). Traditional model-based voltage control methods struggle with the rapid energy fluctuations and uncertainties of these DGs. While multi-agent reinforcement learning (MARL) has shown potential for decentralized secondary control, scalability issues arise when dealing with a large number of DGs. This problem lies in the dominant centralized training and decentralized execution (CTDE) framework, where the critics take global observations and actions. To overcome these challenges, we propose a scalable network-aware (SNA) framework that leverages network structure to truncate the input to the critic's Q-function, thereby improving scalability and reducing communication costs during training. Further, the SNA framework is theoretically grounded with provable approximation guarantee, and it can seamlessly integrate with multiple multi-agent actor-critic algorithms. The proposed SNA framework is successfully demonstrated in a system with 114 DGs, providing a promising solution for decentralized voltage control in increasingly complex power grid systems.
摘要
Investigating the Design Space of Diffusion Models for Speech Enhancement
results: 论文表明,在使用 proper preconditioning、training loss weighting、SDE 和抽象器时,可以超过一个流行的 diffusion-based speech enhancement 系统,并且使用更少的抽象步骤,从而降低计算成本。Abstract
Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals. However, one popular diffusion model framework previously laid in image generation literature did not account for such a transformation towards the system input, which prevents from relating the existing diffusion-based speech enhancement systems with the aforementioned diffusion model framework. To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals. This allows us to apply recent developments from image generation literature, and to systematically investigate design aspects of diffusion models that remain largely unexplored for speech enhancement, such as the neural network preconditioning, the training loss weighting, the stochastic differential equation (SDE), or the amount of stochasticity injected in the reverse process. We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals. Moreover, we show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system in terms of perceptual metrics while using fewer sampling steps, thus reducing the computational cost by a factor of four.
摘要
Diffusion模型是一新的一类生成模型,在图像生成领域中表现出色。因此,研究人员尝试将Diffusion模型应用到其他任务中,如声音提高。一种流行的方法是在Diffusion模型中模拟清晰声音和噪声声音之间的渐进变换。然而,一个流行的Diffusion模型框架在图像生成领域中提出,并没有考虑这种变换向系统输入的问题,这阻止了将现有的Diffusion模型基于声音提高系统与上述Diffusion模型框架相关联。为了解决这个问题,我们延展这个框架,以考虑清晰声音和噪声声音之间的渐进变换。这allowed我们应用最新的发展在图像生成领域,并系统地研究Diffusion模型的设计方面,例如神经网络预处理、训练损失权重、斯德笃抽象(SDE)或抽象进程中的随机性。我们发现,先前的Diffusion模型基于声音提高系统的性能不能归因于清晰声音和噪声声音之间的渐进变换。此外,我们发现,对预处理、训练损失权重、SDE和抽象进程中的随机性进行合适的选择可以超过一种流行的Diffusion模型基于声音提高系统,以至于使用更少的抽象步骤,从而降低计算成本,减少一半。
NeuJeans: Private Neural Network Inference with Joint Optimization of Convolution and Bootstrapping
results: 作者们的NeuJeans解决方案可以在 ImageNet (ResNet18)中实现深度 CNN 的透明度服务,只需几秒钟的时间,并且在 state-of-the-art FHE-based PI 工作中提高了 conv2d 的性能,最高达 5.68 倍。Abstract
Fully homomorphic encryption (FHE) is a promising cryptographic primitive for realizing private neural network inference (PI) services by allowing a client to fully offload the inference task to a cloud server while keeping the client data oblivious to the server. This work proposes NeuJeans, an FHE-based solution for the PI of deep convolutional neural networks (CNNs). NeuJeans tackles the critical problem of the enormous computational cost for the FHE evaluation of convolutional layers (conv2d), mainly due to the high cost of data reordering and bootstrapping. We first propose an encoding method introducing nested structures inside encoded vectors for FHE, which enables us to develop efficient conv2d algorithms with reduced data reordering costs. However, the new encoding method also introduces additional computations for conversion between encoding methods, which could negate its advantages. We discover that fusing conv2d with bootstrapping eliminates such computations while reducing the cost of bootstrapping. Then, we devise optimized execution flows for various types of conv2d and apply them to end-to-end implementation of CNNs. NeuJeans accelerates the performance of conv2d by up to 5.68 times compared to state-of-the-art FHE-based PI work and performs the PI of a CNN at the scale of ImageNet (ResNet18) within a mere few seconds
摘要
《归一深度学习(FHE)》是一种有前途的密码学 primitives,可以让客户端将推理任务完全卸载到云服务器上,同时保持客户端数据对服务器的不透明度。这项工作提出了《NeuJeans》,一种基于FHE的深度学习(CNN)的私有推理(PI)解决方案。 NeuJeans解决了FHE评估中深度学习层(conv2d)的巨大计算成本问题,主要是因为数据重新排序和启动成本高昂。我们首先提出了嵌入结构内编码向量的编码方法,该方法可以帮助我们开发高效的 conv2d 算法,并降低数据重新排序成本。然而,新的编码方法也引入了 conversions between encoding methods 的计算成本,这可能会消除其优势。我们发现,将 conv2d 与启动混合可以消除这些计算成本,同时降低启动成本。然后,我们设计了不同类型的 conv2d 的优化执行流,并将其应用到 CNN 的端到端实现。 NeuJeans 可以在几秒钟内完成 ImageNet (ResNet18) 的 CNN 私有推理,并在 state-of-the-art FHE-based PI 工作中提高 conv2d 性能的速度达 5.68 倍。
Improved Efficient Two-Stage Denoising Diffusion Power System Measurement Recovery Against False Data Injection Attacks and Data Losses
results: 经过广泛的数值实验表明,提出的 TSDM 可以准确恢复电力系统测量数据,即使面临强度随机变化和复杂的Cyber-Physical 情况下Here’s a brief explanation of each point:
for: The paper is written to address the issue of measurement uncertainties in power systems, specifically the problem of recovering accurate measurements despite random changes and data losses.
methods: The proposed solution is an improved two-stage denoising diffusion model (TSDM) that consists of a classifier-guided conditional anomaly detection component and a diffusion-based measurement imputation component. The model also employs precise means and optimal variances to accelerate the diffusion generation process with subsequence sampling.
results: Extensive numerical case studies demonstrate that the proposed TSDM can accurately recover power system measurements under various measurement uncertainties, including renewable energy integration and complex cyber-physical contingencies. Additionally, the proposed TSDM shows stronger robustness compared to existing reconstruction networks and lower computational complexity than general denoising diffusion models.Abstract
Measurement uncertainties, represented by cyber-attacks and data losses, seriously degrade the quality of power system measurements. Fortunately, the powerful generation ability of the denoising diffusion models can enable more precise measurement generation for power system data recovery. However, the controllable data generation and efficient computing methods of denoising diffusion models for deterministic trajectory still need further investigation. To this end, this paper proposes an improved two-stage denoising diffusion model (TSDM) to identify and reconstruct the measurements with various measurement uncertainties. The first stage of the model comprises a classifier-guided conditional anomaly detection component, while the second stage involves diffusion-based measurement imputation component. Moreover, the proposed TSDM adopts precise means and optimal variances to accelerate the diffusion generation process with subsequence sampling. Extensive numerical case studies demonstrate that the proposed TSDM can accurately recover power system measurements despite strong randomness under renewable energy integration and highly nonlinear dynamics under complex cyber-physical contingencies. Additionally, the proposed TSDM has stronger robustness compared to existing reconstruction networks and exhibits lower computational complexity than general denoising diffusion models.
摘要
量度不确定性,表现为网络攻击和数据丢失,对电力系统量度质量产生严重影响。幸运的是,具有强生成能力的杂散扩散模型可以为电力系统数据恢复提供更精确的量度生成。然而,杂散扩散模型的可控数据生成和高效计算方法仍需进一步研究。为此,本文提出了改进的两stage杂散扩散模型(TSDM),用于标识和重建具有多种量度不确定性的量度。首stage模型包括类 Conditional anomaly detection 组件,而第二stage模型则是基于扩散的测量填充组件。此外,提议的 TSDM 采用精确的方法和优化的方差,以加速扩散生成过程,并使用 subsequential sampling。广泛的数字案例研究表明,提议的 TSDM 可以准确地恢复电力系统量度,即使面临环境变化和复杂的网络攻击。此外,提议的 TSDM 具有更高的Robustness 和更低的计算复杂度,相比于现有的重建网络。
results: 这个论文的结果显示,lazzy LBCS和Stochastic LBCS可以对G"ozc"u et al.的greedy learning-based CS(LBCS)方法进行优化,并且可以在大规模的临床enario中运行。此外,这个论文还表明,生成 adversarial networks(GANs)可以作为探测方式的自然标准,并且可以透过量子随机性来引导取样。Abstract
Despite its exceptional soft tissue contrast, Magnetic Resonance Imaging (MRI) faces the challenge of long scanning times compared to other modalities like X-ray radiography. Shortening scanning times is crucial in clinical settings, as it increases patient comfort, decreases examination costs and improves throughput. Recent advances in compressed sensing (CS) and deep learning allow accelerated MRI acquisition by reconstructing high-quality images from undersampled data. While reconstruction algorithms have received most of the focus, designing acquisition trajectories to optimize reconstruction quality remains an open question. This thesis explores two approaches to address this gap in the context of Cartesian MRI. First, we propose two algorithms, lazy LBCS and stochastic LBCS, that significantly improve upon G\"ozc\"u et al.'s greedy learning-based CS (LBCS) approach. These algorithms scale to large, clinically relevant scenarios like multi-coil 3D MR and dynamic MRI, previously inaccessible to LBCS. Additionally, we demonstrate that generative adversarial networks (GANs) can serve as a natural criterion for adaptive sampling by leveraging variance in the measurement domain to guide acquisition. Second, we delve into the underlying structures or assumptions that enable mask design algorithms to perform well in practice. Our experiments reveal that state-of-the-art deep reinforcement learning (RL) approaches, while capable of adaptation and long-horizon planning, offer only marginal improvements over stochastic LBCS, which is neither adaptive nor does long-term planning. Altogether, our findings suggest that stochastic LBCS and similar methods represent promising alternatives to deep RL. They shine in particular by their scalability and computational efficiency and could be key in the deployment of optimized acquisition trajectories in Cartesian MRI.
摘要
尽管它具有出色的软组织干扰特征,但是磁共振成像(MRI)仍然面临着扫描时间较长的挑战,相比其他方法如X射线成像。缩短扫描时间是在临床设置中非常重要,因为它提高了患者的 COMFORT,降低了检查成本,并提高了通过率。 latest advances in compressed sensing(CS)和深度学习允许加速MRI捕获,但是设计捕获路径以提高重建质量仍然是一个未解决的问题。这个论文探讨了在Cartesian MRI中两种方法来解决这个问题。首先,我们提出了两种算法,懒散LBCS和随机LBCS,这两种算法可以在大规模临床场景中,如多极ocoil 3D MR和动态MRI中,提高重建质量。此外,我们发现了使用生成对抗网络(GANs)可以作为重建质量的自然标准,通过测量频谱的差异来引导捕获。其次,我们研究了在实践中底层的假设和结构,以便设计面积算法可以在实际场景中表现良好。我们的实验表明,当前的深度学习RL方法,尽管具有适应和长期规划能力,仅提供了有限的改进,而stochastic LBCS则不是适应的 nor does it involve long-term planning。总之,我们的发现表明,stochastic LBCS和类似方法在Cartesian MRI中具有优秀的可扩展性和计算效率,可能成为优化捕获路径的关键。
Equivariant Scalar Fields for Molecular Docking with Fast Fourier Transforms
paper_authors: Bowen Jing, Tommi Jaakkola, Bonnie Berger
For: The paper is written for researchers and practitioners in the field of molecular docking and virtual screening, who are interested in accelerating the optimization process of scoring functions using machine learning techniques.* Methods: The paper proposes a machine learning-based approach to learn a scoring function with a functional form that allows for more rapid optimization. Specifically, the scoring function is defined as the cross-correlation of multi-channel ligand and protein scalar fields parameterized by equivariant graph neural networks. The approach uses fast Fourier transforms to enable rapid optimization over rigid-body degrees of freedom.* Results: The paper benchmarks the proposed scoring function on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking. The results show that the proposed method attains similar but faster performance on crystal structures compared to the widely-used Vina and Gnina scoring functions, and is more robust on computationally predicted structures.Abstract
Molecular docking is critical to structure-based virtual screening, yet the throughput of such workflows is limited by the expensive optimization of scoring functions involved in most docking algorithms. We explore how machine learning can accelerate this process by learning a scoring function with a functional form that allows for more rapid optimization. Specifically, we define the scoring function to be the cross-correlation of multi-channel ligand and protein scalar fields parameterized by equivariant graph neural networks, enabling rapid optimization over rigid-body degrees of freedom with fast Fourier transforms. The runtime of our approach can be amortized at several levels of abstraction, and is particularly favorable for virtual screening settings with a common binding pocket. We benchmark our scoring functions on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking. Our method attains similar but faster performance on crystal structures compared to the widely-used Vina and Gnina scoring functions, and is more robust on computationally predicted structures. Code is available at https://github.com/bjing2016/scalar-fields.
摘要
分子停靠是结构基于虚拟屏选的关键步骤,但这些工作流程的throughput受到大多数 docking 算法中的优化 scoring function 的成本限制。我们 investigate 如何使用机器学习加速这个过程,通过学习一个具有更快的优化的函数形式。Specifically,我们定义 scoring function 为多通道 ligand 和蛋白质 scalar fields 参数的 equivariant graph neural networks,可以快速优化 rigid-body 度OF freedom WITH fast Fourier transforms。我们的方法可以在多级别归一化 runtime,特别适用于具有共同绑定袋的虚拟屏选设置。我们对两个简化 docking 相关任务进行了 benchmarking:decoy pose scoring 和rigid conformer docking。我们的方法在 crystal structures 上与广泛使用的 Vina 和 Gnina scoring functions 相当,但更快速。此外,我们的方法在 computationally predicted structures 上更加稳定。代码可以在 https://github.com/bjing2016/scalar-fields 上获取。
Stochastic-Constrained Stochastic Optimization with Markovian Data
methods: 我们扩展了偏置加penalty框架,一种 primal-dual 随机梯度法,到Markov链抽样Setting中。我们提出了两种变种,其中一种是当 Mixing Time 知道的情况,而另一种是当 Mixing Time 不知道的情况。我们的算法适用于更一般的受限online凸优化问题中,其中约束函数序列遵循Markov链。
results: 我们通过对分类问题with fairness约束进行数值实验来证明我们的提出方法的效果。Abstract
This paper considers stochastic-constrained stochastic optimization where the stochastic constraint is to satisfy that the expectation of a random function is below a certain threshold. In particular, we study the setting where data samples are drawn from a Markov chain and thus are not independent and identically distributed. We generalize the drift-plus-penalty framework, a primal-dual stochastic gradient method developed for the i.i.d. case, to the Markov chain sampling setting. We propose two variants of drift-plus-penalty; one is for the case when the mixing time of the underlying Markov chain is known while the other is for the case of unknown mixing time. In fact, our algorithms apply to a more general setting of constrained online convex optimization where the sequence of constraint functions follows a Markov chain. Both algorithms are adaptive in that the first works without knowledge of the time horizon while the second uses AdaGrad-style algorithm parameters, which is of independent interest. We demonstrate the effectiveness of our proposed methods through numerical experiments on classification with fairness constraints.
摘要
这个论文考虑了随机约束随机优化问题,其中随机约束是确保随机函数的期望值下于某个阈值。特别是,我们研究了从马尔可夫链中采样数据的情况,因此数据不是独立和 Identically distributed。我们扩展了推移加权策略,一种在i.i.d.情况下开发的 primal-dual 随机梯度法。我们提出了两种变体的推移加权策略,其中一种是当下知 mixing 时间的情况,另一种是当 mixing 时间未知的情况。实际上,我们的算法适用于更一般的受限 online 凸优化问题,其中约束函数序列采样自马尔可夫链。两种算法都是可适应的,第一个不需要时间框架,而第二个使用 AdaGrad 类型的算法参数,这是独立的兴趣。我们通过对分类问题的数学实验表明了我们的提议的方法的效果。
Finding Interpretable Class-Specific Patterns through Efficient Neural Search
results: 在 sintetic 数据和实际世界数据中,包括三个生物应用,DiffNaps 能够准确、简洁和可解释地描述类别。与现有方法不同,DiffNaps 能够在大规模应用场景中提供有用的结果。Abstract
Discovering patterns in data that best describe the differences between classes allows to hypothesize and reason about class-specific mechanisms. In molecular biology, for example, this bears promise of advancing the understanding of cellular processes differing between tissues or diseases, which could lead to novel treatments. To be useful in practice, methods that tackle the problem of finding such differential patterns have to be readily interpretable by domain experts, and scalable to the extremely high-dimensional data. In this work, we propose a novel, inherently interpretable binary neural network architecture DIFFNAPS that extracts differential patterns from data. DiffNaps is scalable to hundreds of thousands of features and robust to noise, thus overcoming the limitations of current state-of-the-art methods in large-scale applications such as in biology. We show on synthetic and real world data, including three biological applications, that, unlike its competitors, DiffNaps consistently yields accurate, succinct, and interpretable class descriptions
摘要
发现数据中类别之间的差异模式,可以推测和理解类型特有的机制。在分子生物学中,这有可能提高了Cells的过程之间的理解,从而导致新的治疗方法。为了在实践中有用,找到这些差异模式的方法必须能够快速地被领域专家理解,并且可以处理大量的特征。在这种情况下,我们提出了一种新的、自然语言可读的二进制神经网络架构DIFFNAPS,它可以从数据中提取差异模式。DiffNaps可扩展到千上万个特征,并且具有鲁棒性和抗噪特性,因此可以在大规模应用中超越现有的状况架构。我们在 sintetic和实际数据上,包括三个生物应用,示出了DiffNaps在比其他竞争对手的情况下,能够提供准确、简洁和可读的类描述。
A Structural-Clustering Based Active Learning for Graph Neural Networks
results: 这篇论文透过实验展示了 SPA 方法在不同的标注预算下的更高精度和macro-F1 分数,并且可以实现更大的样本数量和更快的查询时间。Abstract
In active learning for graph-structured data, Graph Neural Networks (GNNs) have shown effectiveness. However, a common challenge in these applications is the underutilization of crucial structural information. To address this problem, we propose the Structural-Clustering PageRank method for improved Active learning (SPA) specifically designed for graph-structured data. SPA integrates community detection using the SCAN algorithm with the PageRank scoring method for efficient and informative sample selection. SPA prioritizes nodes that are not only informative but also central in structure. Through extensive experiments, SPA demonstrates higher accuracy and macro-F1 score over existing methods across different annotation budgets and achieves significant reductions in query time. In addition, the proposed method only adds two hyperparameters, $\epsilon$ and $\mu$ in the algorithm to finely tune the balance between structural learning and node selection. This simplicity is a key advantage in active learning scenarios, where extensive hyperparameter tuning is often impractical.
摘要
在活动学习中,图 neural network (GNN) 已经表现出效iveness。然而,在这些应用中,一个常见的挑战是不足利用重要的结构信息。为了解决这个问题,我们提出了结构归一化PageRank方法(SPA),专门为图结构数据设计。SPA将SCAN算法用于社区探测与PageRank分数方法结合,以提高活动学习中的选择效率和信息含量。SPA将中心结构中的节点优先选择,以提高准确率和macro-F1分数。经过广泛的实验,SPA在不同的注解预算下显示出较高的准确率和macro-F1分数,并且在查询时间方面具有显著的减少。此外,提出的方法只有两个超参数,$\epsilon$ 和 $\mu$,可以细调化结构学习和节点选择的平衡。这种简单性是活动学习场景中的一个关键优势,因为投入大量的超参数调试是不实际的。
Simulating the Air Quality Impact of Prescribed Fires Using a Graph Neural Network-Based PM$_{2.5}$ Emissions Forecasting System
results: 这篇论文的实验是在决定加利福尼亚州进行控制火的最佳时间和外火季进行更多的控制火时,Quantify the potential air quality trade-offs involved in conducting more prescribed fires outside the fire season.Abstract
The increasing size and severity of wildfires across western North America have generated dangerous levels of PM$_{2.5}$ pollution in recent years. In a warming climate, expanding the use of prescribed fires is widely considered to be the most robust fire mitigation strategy. However, reliably forecasting the potential air quality impact from these prescribed fires, a critical ingredient in determining the fires' location and time, at hourly to daily time scales remains a challenging problem. This paper proposes a novel integration of prescribed fire simulation with a spatio-temporal graph neural network-based PM$_{2.5}$ forecasting model. The experiments in this work focus on determining the optimal time for implementing prescribed fires in California as well as quantifying the potential air quality trade-offs involved in conducting more prescribed fires outside the fire season.
摘要
在西北美洲的西部,过去几年的野火规模和严重程度都在增长。这些野火会生成危险水平的PM$_{2.5}$污染物。在气候变化的情况下,扩大使用预制火是最有力的火mitigation策略。然而,在小时到日律时间尺度上,可靠地预测预制火对空气质量的影响仍然是一个挑战。本文提出了一种新的预制火仿真模型与空间时间图 neural network基于PM$_{2.5}$预测模型的集成。这些实验在决定加利福尼亚州预制火的最佳时间以及在外火季外进行更多预制火的空气质量交易征有关。
Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data
results: theoretically和empirically比较了 FedSplit 和标准 Federated Learning 方法的速度,证明了 FedSplit 的更快的速度。同时,对 FedSplit 方法的泛化上限 bounds 进行了研究。通过引入因素分析,实现了在实际数据上实现 FedSplit 模型,并对其进行了实际验证,证明了 FedFac 的更高的预测性能。Abstract
Federated learning is an emerging distributed machine learning framework aiming at protecting data privacy. Data heterogeneity is one of the core challenges in federated learning, which could severely degrade the convergence rate and prediction performance of deep neural networks. To address this issue, we develop a novel personalized federated learning framework for heterogeneous data, which we refer to as FedSplit. This modeling framework is motivated by the finding that, data in different clients contain both common knowledge and personalized knowledge. Then the hidden elements in each neural layer can be split into the shared and personalized groups. With this decomposition, a novel objective function is established and optimized. We demonstrate FedSplit enjoyers a faster convergence speed than the standard federated learning method both theoretically and empirically. The generalization bound of the FedSplit method is also studied. To practically implement the proposed method on real datasets, factor analysis is introduced to facilitate the decoupling of hidden elements. This leads to a practically implemented model for FedSplit and we further refer to as FedFac. We demonstrated by simulation studies that, using factor analysis can well recover the underlying shared/personalized decomposition. The superior prediction performance of FedFac is further verified empirically by comparison with various state-of-the-art federated learning methods on several real datasets.
摘要
“联合学习”是一种现代分布式机器学习框架,旨在保护数据隐私。数据多样性是联合学习中的核心挑战,这可能会严重影响深度神经网络的收敛率和预测性能。为解决这个问题,我们开发了一种新的个性化联合学习框架 для多样数据,我们称之为FedSplit。这个模型框架是基于发现数据在不同客户端中包含共同知识和个性化知识的发现。然后,隐藏元素在每层神经网络中可以被分解成共同和个性化两组。通过这种分解,我们定义了一个新的目标函数并优化它。我们证明FedSplit比标准联合学习方法更快地收敛,并且在理论和实验上进行了证明。我们还研究了FedSplit方法的泛化约束。为实现提议方法在实际数据上的应用,我们引入因素分析来帮助隐藏元素的解 coupling。这导致了一个实际应用的FedFac模型,我们进一步称之为FedFac。我们通过模拟研究表明,使用因素分析可以良好地回归到共同/个性化分解。FedFac的更高预测性能也被实际对比评估中的各种现代联合学习方法证明。
Estimating Countries with Similar Maternal Mortality Rate using Cluster Analysis and Pairing Countries with Identical MMR
results: 研究发现了一些国家的 MMR 水平呈现出较高的趋势,而其他国家的 MMR 水平呈现出较低的趋势。此外,研究还发现了一些国家的孕产母健康受到哪些因素的影响,如医疗设施和医疗服务质量等。Abstract
In the evolving world, we require more additionally the young era to flourish and evolve into developed land. Most of the population all around the world are unaware of the complications involved in the routine they follow while they are pregnant and how hospital facilities affect maternal health. Maternal Mortality is the death of a pregnant woman due to intricacies correlated to pregnancy, underlying circumstances exacerbated by the pregnancy or management of these situations. It is crucial to consider the Maternal Mortality Rate (MMR) in diverse locations and determine which human routines and hospital facilities diminish the Maternal Mortality Rate (MMR). This research aims to examine and discover the countries which are keeping more lavish threats of MMR and countries alike in MMR encountered. Data is examined and collected for various countries, data consists of the earlier years' observation. From the perspective of Machine Learning, Unsupervised Machine Learning is implemented to perform Cluster Analysis. Therefore the pairs of countries with similar MMR as well as the extreme opposite pair concerning the MMR are found.
摘要
在 развивающемся世界,我们需要更多的年轻一代发展成为发达地区。大多数人民全球都不知道他们日常 Routine 中有哪些问题,以及孕期所带来的风险如何影响母亲健康。孕子死亡率(MMR)是指怀孕期间因孕期相关问题或加剧了孕期所带来的问题而导致的母亲死亡率。我们需要考虑各地区的 MMR,并找出哪些人类习惯和医院设施可以降低 MMR。这项研究的目的是检查各国 MM Rate 的情况,并找出这些国家之间的相似性和对 MM Rate 的影响。我们通过机器学习的无监督方法进行嵌入分析,从而找到具有相似 MMR 的国家对应的对应对。
Invariant Random Forest: Tree-Based Model Solution for OOD Generalization
results: 数据测试表明,相比非OOD树模型,IDT和IRF具有更高的性能,这 imply OOD泛化对决策树模型是必要的和应该受到更多的关注。Abstract
Out-Of-Distribution (OOD) generalization is an essential topic in machine learning. However, recent research is only focusing on the corresponding methods for neural networks. This paper introduces a novel and effective solution for OOD generalization of decision tree models, named Invariant Decision Tree (IDT). IDT enforces a penalty term with regard to the unstable/varying behavior of a split across different environments during the growth of the tree. Its ensemble version, the Invariant Random Forest (IRF), is constructed. Our proposed method is motivated by a theoretical result under mild conditions, and validated by numerical tests with both synthetic and real datasets. The superior performance compared to non-OOD tree models implies that considering OOD generalization for tree models is absolutely necessary and should be given more attention.
摘要
OUT-OF-DISTRIBUTION(OOD)泛化是机器学习中的一个重要话题。然而,最近的研究仅专注于神经网络中的相关方法。这篇论文介绍了一种新的和有效的决策树模型 OUT-OF-DISTRIBUTION 泛化方法,称为 invariable Decision Tree(IDT)。IDT 在不同环境中执行分裂的稳定性/变化行为增加一个罚金项。其ensemble版本,即 invariable Random Forest(IRF),通过组合IDT来构建。我们的提议方法是基于轻微条件的理论结论,并通过数学测试表明了与非OOD树模型的超越性。这表明考虑OOD泛化对树模型是必要的,并且应该得到更多的关注。Note: "决策树模型" (decision tree model) is translated as "树模型" (tree model) in Simplified Chinese, and "Out-Of-Distribution" (OOD) is translated as " OUT-OF-DISTRIBUTION" (with capital letters) in Simplified Chinese.
CODEX: A Cluster-Based Method for Explainable Reinforcement Learning
paper_authors: Timothy K. Mathes, Jessica Inman, Andrés Colón, Simon Khan
for: 提高强化学习(RL)在实际应用中的采用率,通过解释RL代理行为和建立用户信任。
methods: 使用语义归类,准确概括RL代理行为在状态动作空间中的表现。
results: 实验结果表明,使用语义归类可以保留时间和实体信息,并将游戏环境中的综合事件与 semantic space 关联起来。Abstract
Despite the impressive feats demonstrated by Reinforcement Learning (RL), these algorithms have seen little adoption in high-risk, real-world applications due to current difficulties in explaining RL agent actions and building user trust. We present Counterfactual Demonstrations for Explanation (CODEX), a method that incorporates semantic clustering, which can effectively summarize RL agent behavior in the state-action space. Experimentation on the MiniGrid and StarCraft II gaming environments reveals the semantic clusters retain temporal as well as entity information, which is reflected in the constructed summary of agent behavior. Furthermore, clustering the discrete+continuous game-state latent representations identifies the most crucial episodic events, demonstrating a relationship between the latent and semantic spaces. This work contributes to the growing body of work that strives to unlock the power of RL for widespread use by leveraging and extending techniques from Natural Language Processing.
摘要
尽管增强学习(RL)展示了各种印象的表现,但由于目前的解释RL机器人行为和建立用户信任的困难,RL在高风险实际应用中尚未得到广泛采用。我们提出了Counterfactual Demonstrations for Explanation(CODEX)方法,该方法通过语义划分,可以有效地概括RL机器人在状态动作空间的行为。实验在MiniGrid和StarCraft II游戏环境中表明,语义划分保留了时间和实体信息,并在构建机器人行为概要中反映出来。此外,将游戏状态的维度和连续数据映射到语义空间中,可以标识最重要的 episodic 事件,表明语义和 latent 空间之间存在关系。这项工作为RL的普及做出了贡献,并通过推广和扩展自然语言处理技术,推动RL在实际应用中的广泛使用。
Constrained Hierarchical Clustering via Graph Coarsening and Optimal Cuts
results: 我们通过将问题分解为两步解决:首先,作为一个软约束REGULARIZED LEAST SQUARES,使得结果倾向于horizontal feasible set。然后,从结果中提取平铺的 clusters,基于可用的约束计算最优的剖高。我们发现该方法与现有算法相比,表现很好,计算效率也很高。Abstract
Motivated by extracting and summarizing relevant information in short sentence settings, such as satisfaction questionnaires, hotel reviews, and X/Twitter, we study the problem of clustering words in a hierarchical fashion. In particular, we focus on the problem of clustering with horizontal and vertical structural constraints. Horizontal constraints are typically cannot-link and must-link among words, while vertical constraints are precedence constraints among cluster levels. We overcome state-of-the-art bottlenecks by formulating the problem in two steps: first, as a soft-constrained regularized least-squares which guides the result of a sequential graph coarsening algorithm towards the horizontal feasible set. Then, flat clusters are extracted from the resulting hierarchical tree by computing optimal cut heights based on the available constraints. We show that the resulting approach compares very well with respect to existing algorithms and is computationally light.
摘要
We overcome state-of-the-art bottlenecks by formulating the problem in two steps:1. First, as a soft-constrained regularized least-squares, which guides the result of a sequential graph coarsening algorithm towards the horizontal feasible set.2. Then, flat clusters are extracted from the resulting hierarchical tree by computing optimal cut heights based on the available constraints.We show that the resulting approach compares very well with respect to existing algorithms and is computationally light.Simplified Chinese:我们受短句设置中抽取和概括有用信息的需求,例如满意问naire、酒店评价和X/Twitter等,我们研究了层次划分单词的问题。特别是在 horizontal 和 vertical 结构约束下划分单词。horizontal 约束通常是不能链接和必须链接 между 单词,而 vertical 约束是层次约束 between 层次。我们突破现有瓶颈,通过将问题分解成两步:1. 首先,作为软约束正则最小二乘,将结果导向到可能的水平可行集。2. 然后,从结果的层次树中提取最佳剖高来确定层次结构。我们显示,我们的方法与现有算法相比,具有较好的比较和较轻的计算负担。
Wavelength-multiplexed Delayed Inputs for Memory Enhancement of Microring-based Reservoir Computing
paper_authors: Bernard J. Giron Castro, Christophe Peucheret, Francesco Da Ros
for: solves memory-demanding tasks like time-series prediction
methods: combines parallel delayed inputs and wavelength division multiplexing
results: good performance without requiring external optical feedbackAbstract
We numerically demonstrate a silicon add-drop microring-based reservoir computing scheme that combines parallel delayed inputs and wavelength division multiplexing. The scheme solves memory-demanding tasks like time-series prediction with good performance without requiring external optical feedback.
摘要
我们 numerically 示出了一种基于微缩彩环的硅添加降阶扩展计算方案,该方案结合了并行延迟输入和波长分 multiplexing。该方案可以解决具有巨大内存需求的任务,如时间序列预测,并且不需要外部光学反馈。
Coherent energy and force uncertainty in deep learning force fields
results: 研究人员通过该模型可以获得能量和力的 aleatoric uncertainty,并且还可以通过一种 bayesian 解释来获得epistemic uncertainty。Abstract
In machine learning energy potentials for atomic systems, forces are commonly obtained as the negative derivative of the energy function with respect to atomic positions. To quantify aleatoric uncertainty in the predicted energies, a widely used modeling approach involves predicting both a mean and variance for each energy value. However, this model is not differentiable under the usual white noise assumption, so energy uncertainty does not naturally translate to force uncertainty. In this work we propose a machine learning potential energy model in which energy and force aleatoric uncertainty are linked through a spatially correlated noise process. We demonstrate our approach on an equivariant messages passing neural network potential trained on energies and forces on two out-of-equilibrium molecular datasets. Furthermore, we also show how to obtain epistemic uncertainties in this setting based on a Bayesian interpretation of deep ensemble models.
摘要
在机器学习中的原子系统能量潜能中,力通常通过原子位置的负导数法取得。为了量化随机uncertainty的预测值,一种广泛使用的模型方法是预测每个能量值的平均值和方差。然而,这种模型在常见的白噪声假设下不是可导的,因此能量uncertainty不自然地翻译为力uncertainty。在这项工作中,我们提议一种机器学习的潜能能量模型,其中能量和力随机uncertainty通过空间相关的噪声过程相连。我们在两个不对称分子数据集上使用了对称的消息传递神经网络潜能模型,并demonstrate了如何在这种设定下获取epistemic uncertainty。
A novel feature selection framework for incomplete data
results: 实验结果表明,提出的方法在人工生成和实际受限数据集上显著超越其他方法。Abstract
Feature selection on incomplete datasets is an exceptionally challenging task. Existing methods address this challenge by first employing imputation methods to complete the incomplete data and then conducting feature selection based on the imputed data. Since imputation and feature selection are entirely independent steps, the importance of features cannot be considered during imputation. However, in real-world scenarios or datasets, different features have varying degrees of importance. To address this, we propose a novel incomplete data feature selection framework that considers feature importance. The framework mainly consists of two alternating iterative stages: the M-stage and the W-stage. In the M-stage, missing values are imputed based on a given feature importance vector and multiple initial imputation results. In the W-stage, an improved reliefF algorithm is employed to learn the feature importance vector based on the imputed data. Specifically, the feature importance vector obtained in the current iteration of the W-stage serves as input for the next iteration of the M-stage. Experimental results on both artificially generated and real incomplete datasets demonstrate that the proposed method outperforms other approaches significantly.
摘要
feature选择在受限数据集上是非常具有挑战性的任务。现有的方法对此问题进行了填充方法来完善缺失的数据,然后基于填充的数据进行feature选择。由于填充和feature选择是完全独立的两个步骤,因此在实际的数据集中,不同的特征之间的重要性不能被考虑。为解决这个问题,我们提出了一种新的受限数据集feature选择框架,该框架主要包括两个交互式迭代阶段:M阶段和W阶段。在M阶段,基于给定的特征重要性向量和多个初始填充结果来填充缺失值。在W阶段,我们改进了reliefF算法来学习特征重要性向量基于填充后的数据。具体来说,在当前迭代的W阶段中获得的特征重要性向量将被用作下一轮M阶段的输入。实验结果表明,提出的方法在人工生成的和实际的受限数据集上明显超越了其他方法。
Mixture of Dynamical Variational Autoencoders for Multi-Source Trajectory Modeling and Separation
results: 该模型在计算机视觉任务(多对象跟踪)和音频处理任务(单道音频源分离)上达到了良好的效果,并超过了多个基eline方法。Abstract
In this paper, we propose a latent-variable generative model called mixture of dynamical variational autoencoders (MixDVAE) to model the dynamics of a system composed of multiple moving sources. A DVAE model is pre-trained on a single-source dataset to capture the source dynamics. Then, multiple instances of the pre-trained DVAE model are integrated into a multi-source mixture model with a discrete observation-to-source assignment latent variable. The posterior distributions of both the discrete observation-to-source assignment variable and the continuous DVAE variables representing the sources content/position are estimated using a variational expectation-maximization algorithm, leading to multi-source trajectories estimation. We illustrate the versatility of the proposed MixDVAE model on two tasks: a computer vision task, namely multi-object tracking, and an audio processing task, namely single-channel audio source separation. Experimental results show that the proposed method works well on these two tasks, and outperforms several baseline methods.
摘要
在这篇论文中,我们提出了一种含有聚合变量的生成模型,即混合动力变量自动编码器(MixDVAE),用于模型多个运动源系统的动态。一个DVAE模型在单个源数据集上进行预训练,以捕捉源动力。然后,多个预训练DVAE模型被集成到一个多源混合模型中,其中有一个离散观测到源的归一化变量。通过变量估计算子扩散算法,我们可以估计混合模型中的观测到源的 posterior 分布,以及表示源内容/位置的连续DVAE变量的 posterior 分布。通过这种方法,我们可以进行多源轨迹估计。我们在计算机视觉任务和音频处理任务中运用了这种方法,并得到了良好的结果,比基eline方法更好。
Improving Communication Efficiency of Federated Distillation via Accumulating Local Updates
methods: 本论文提出了一种名为 ALU(Accumulated Local Update)的新技术,它在 Federated Distillation 中聚集多轮本地更新,然后将知识传递到中央服务器。
results: 实验结果显示,ALU 可以对 Federated Distillation 的通信效率进行帮助,实现更好的训练效果。Abstract
As an emerging federated learning paradigm, federated distillation enables communication-efficient model training by transmitting only small-scale knowledge during the learning process. To further improve the communication efficiency of federated distillation, we propose a novel technique, ALU, which accumulates multiple rounds of local updates before transferring the knowledge to the central server. ALU drastically decreases the frequency of communication in federated distillation, thereby significantly reducing the communication overhead during the training process. Empirical experiments demonstrate the substantial effect of ALU in improving the communication efficiency of federated distillation.
摘要
为了提高 federated learning 的通信效率,我们提出了一种新的技术:ALU。ALU 在 local update 多次聚合后,才将知识传输到中央服务器。这种技术可以减少 federated distillation 中的通信频率,从而减少训练过程中的通信开销。实验结果表明,ALU 可以很有效地提高 federated distillation 中的通信效率。
Multi-scale Residual Transformer for VLF Lightning Transients Classification
results: 模型可以准确地分类突发电磁信号,达到90%的准确率,并且可以捕捉突发电磁信号的细腻特征和不同尺度的特征Here’s the translation in English:
for: Improving the reliability and performance of navigation systems, reducing electromagnetic interference and noise
methods: Using deep learning-based Convolutional Neural Network (CNN) models to accurately classify lightning signals
results: The model can accurately classify lightning signals, achieving 90% accuracy, and can capture the fine-grained features and different aspects of the input lightning signal sequence.Abstract
The utilization of Very Low Frequency (VLF) electromagnetic signals in navigation systems is widespread. However, the non-stationary behavior of lightning signals can affect VLF electromagnetic signal transmission. Accurately classifying lightning signals is important for reducing interference and noise in VLF, thereby improving the reliability and overall performance of navigation systems. In recent years, the evolution of deep learning, specifically Convolutional Neural Network (CNNs), has sparked a transformation in lightning classification, surpassing traditional statistical methodologies. Existing CNN models have limitations as they overlook the diverse attributes of lightning signals across different scales and neglect the significance of temporal sequencing in sequential signals. This study introduces an innovative multi-scale residual transform (MRTransformer) that not only has the ability to discern intricate fine-grained patterns while also weighing the significance of different aspects within the input lightning signal sequence. This model performs the attributes of the lightning signal across different scales and the level of accuracy reached 90% in the classification. In future work, this model has the potential applied to a comprehensive understanding of the localization and waveform characteristics of lightning signals.
摘要
utilization of Very Low Frequency (VLF) electromagnetic signals in navigation systems 广泛存在,但是灯 signals 的非站点行为可以影响 VLF 电磁信号传输。正确分类灯 signals 对于降低干扰和静止噪声是非常重要的,从而提高导航系统的可靠性和总体性能。在最近几年,深度学习的演化,尤其是 Convolutional Neural Network (CNNs),对灯 signals 的分类带来了革命性的变革,超越传统的统计方法。现有的 CNN 模型弊端在忽略不同尺度的灯 signals 的多样性和忽略时间序列信号的重要性。本研究提出了一种创新的多尺度径补 transformer (MRTransformer),该模型不仅能够识别细腻的pattern,同时也能够评估不同尺度的灯 signals 的重要性。该模型可以在不同尺度上分类灯 signals,并达到了 90% 的准确率。在未来的工作中,这种模型有potential应用于灯 signals 的本地化和波形特征的全面理解。
Zero-Touch Networks: Towards Next-Generation Network Automation
paper_authors: Mirna El Rajab, Li Yang, Abdallah Shami
for: 这篇论文探讨了零touch网络和服务管理(ZSM)框架在5G和5G+网络管理中的应用,包括自动化自我管理和自我维护功能,以 Addressing the escalating complexity and growing data volume of modern networks.
methods: 该论文探讨了零touch网络(ZTN)在ZSM框架中的应用,包括网络优化、流量监控、能效维护和安全方面,并 explore the challenges associated with ZSM, particularly those related to Machine Learning(ML).
results: 研究发现,将AutoML纳入ZTNs可以减少网络管理成本并提高性能,同时可以透过自动化ML模型选择和调整过程来提高预测精度。实验结果表明,提案的AutoML管道比传统ML更高度的准确性。Abstract
The Zero-touch network and Service Management (ZSM) framework represents an emerging paradigm in the management of the fifth-generation (5G) and Beyond (5G+) networks, offering automated self-management and self-healing capabilities to address the escalating complexity and the growing data volume of modern networks. ZSM frameworks leverage advanced technologies such as Machine Learning (ML) to enable intelligent decision-making and reduce human intervention. This paper presents a comprehensive survey of Zero-Touch Networks (ZTNs) within the ZSM framework, covering network optimization, traffic monitoring, energy efficiency, and security aspects of next-generational networks. The paper explores the challenges associated with ZSM, particularly those related to ML, which necessitate the need to explore diverse network automation solutions. In this context, the study investigates the application of Automated ML (AutoML) in ZTNs, to reduce network management costs and enhance performance. AutoML automates the selection and tuning process of a ML model for a given task. Specifically, the focus is on AutoML's ability to predict application throughput and autonomously adapt to data drift. Experimental results demonstrate the superiority of the proposed AutoML pipeline over traditional ML in terms of prediction accuracy. Integrating AutoML and ZSM concepts significantly reduces network configuration and management efforts, allowing operators to allocate more time and resources to other important tasks. The paper also provides a high-level 5G system architecture incorporating AutoML and ZSM concepts. This research highlights the potential of ZTNs and AutoML to revolutionize the management of 5G+ networks, enabling automated decision-making and empowering network operators to achieve higher efficiency, improved performance, and enhanced user experience.
摘要
第五代(5G)和更多(5G+)网络管理的零touch网络和服务管理(ZSM)框架代表了现代网络管理的新趋势,提供自动化自我管理和自我修复功能,以应对现代网络的增长复杂度和数据量的增长。ZSM框架利用先进技术,如机器学习(ML),以实现智能决策和减少人类干预。本文对零touch网络(ZTN)进行了全面的报告,覆盖了网络优化、流量监测、能效性和安全方面的下一代网络问题。本文探讨了ZSM的挑战,特别是与ML相关的挑战,需要探索多种网络自动化解决方案。在这个上下文中,研究探讨了在ZTN中应用自动化机器学习(AutoML)的可能性,以减少网络管理成本和提高性能。AutoML可以自动选择和调整ML模型,以预测应用程序吞吐量并自动适应数据漂移。实验结果表明,提议的AutoML管道比传统ML更高精度。将AutoML和ZSM概念结合可以减少网络配置和管理努力,让操作员可以更多的时间和资源投入到其他重要任务上。文章还提供了5G系统架构,包括AutoML和ZSM概念。这些研究强调了零touch网络和AutoML在5G+网络管理中的潜在力量,帮助自动化决策,让网络操作员更高效、更好的性能和更好的用户体验。
Resource Allocation for Semantic Communication under Physical-layer Security
for: 这篇论文旨在实现6G无线网络中的 semantic communication 系统,它的目的是传递抽象信息,而不是原始数据,接收者将尝试复原。
methods: 本文提出了一个共同优化算法,用于优化总延迟和功能。另外,为了保证系统的安全性,我们将物理层安全方法(secrecy rate) incorporated into the optimization problem。
results: 实验结果显示,提出的算法在比较基准下表现出最佳的共同优化性能。Abstract
Semantic communication is deemed as a revolution of Shannon's paradigm in the six-generation (6G) wireless networks. It aims at transmitting the extracted information rather than the original data, which receivers will try to recover. Intuitively, the larger extracted information, the longer latency of semantic communication will be. Besides, larger extracted information will result in more accurate reconstructed information, thereby causing a higher utility of the semantic communication system. Shorter latency and higher utility are desirable objectives for the system, so there will be a trade-off between utility and latency. This paper proposes a joint optimization algorithm for total latency and utility. Moreover, security is essential for the semantic communication system. We incorporate the secrecy rate, a physical-layer security method, into the optimization problem. The secrecy rate is the communication rate at which no information is disclosed to an eavesdropper. Experimental results demonstrate that the proposed algorithm obtains the best joint optimization performance compared to the baselines.
摘要
Semantic communication 被认为是6G无线网络中的一个革命,它目的是将提取的信息传输而不是原始数据,接收者将尝试复原。具体来说,提取的信息越大, semantic communication 的延迟时间就会越长。此外,提取的信息越多,复原的信息会更加精准,因此 semantic communication 系统的使用价值会更高。系统中的延迟时间和使用价值是 Desirable 的目标,因此这两个目标会存在贡献和延迟的复杂性。这篇论文提出了一个共同优化算法,以便同时优化总延迟时间和使用价值。此外,安全性是 semantic communication 系统的必要条件。我们将物理层安全方法——秘密率, incorporated 到优化问题中。秘密率是指在无法获得听者的情况下,通信的速率。实验结果显示,提出的算法在基准之上表现最佳。
A Novel Federated Learning-based Intrusion Detection System for Flying Ad Hoc Networks
paper_authors: Ozlem Ceviz, Pinar Sadioglu, Sevil Sen, Vassilios G. Vassilakis
for: This paper aims to improve the security of Flying Ad-hoc Networks (FANETs) by developing a Federated Learning-based Intrusion Detection System (FL-IDS) that addresses privacy concerns while maintaining effective intrusion detection performance.
methods: The FL-IDS approach uses federated learning to enable UAVs to collaboratively train a global intrusion detection model without sharing raw data. Local models are assigned to each UAV, and only updated model weights are shared with a central server. The Bias Towards Specific Clients (BTSC) method is also used to enhance FL-IDS performance.
results: The experimental results show that FL-IDS has competitive performance with Central IDS (C-IDS) while mitigating privacy concerns. FL-IDS also surpasses C-IDS and traditional intrusion detection methods, including Local IDS (L-IDS), in terms of detection performance.Here are the three key points in Simplified Chinese text:
results: 实验结果显示,FL-IDS与中央攻击检测系统(C-IDS)的性能相似,同时解决了隐私问题。FL-IDS还超过C-IDS和传统的攻击检测方法,包括本地攻击检测系统(L-IDS),在检测性能方面。Abstract
Unmanned aerial vehicles (UAVs) in flying ad-hoc networks (FANETs) face security challenges due to the dynamic and distributed nature of these networks. This paper presents the Federated Learning-based Intrusion Detection System (FL-IDS), an innovative approach designed to improve FANET security. FL-IDS leverages federated learning to address privacy concerns of centralized intrusion detection systems. FL-IDS operates in a decentralized manner, enabling UAVs to collaboratively train a global intrusion detection model without sharing raw data. Local models are assigned to each UAV, using client-specific data, and only updated model weights are shared with a central server. This preserves privacy while utilizing collective intelligence for effective intrusion detection. Experimental results show FL-IDS's competitive performance with Central IDS (C-IDS) while mitigating privacy concerns. The Bias Towards Specific Clients (BTSC) method further enhances FL-IDS performance, surpassing C-IDS even at lower attacker ratios. A comparative analysis with traditional intrusion detection methods, including Local IDS (L-IDS), provides insights into FL-IDS's strengths. This study significantly contributes to FANET security by introducing a privacy-aware, decentralized intrusion detection approach tailored to the unique challenges of UAV networks.
摘要
无人飞行器(UAV)在飞行式自组织网络(FANET)面临安全挑战,主要是因为这些网络的动态和分布式特性。本文介绍了联邦学习基于的入侵检测系统(FL-IDS),这是一种创新的方法,旨在提高FANET的安全性。FL-IDS使用联邦学习来解决中央入侵检测系统中的隐私问题。FL-IDS在分布式的方式下运行,让无人飞行器之间协同培训全球入侵检测模型,而不需要分享原始数据。每个无人飞行器都有自己的本地模型,使用客户端特定的数据进行训练,并仅将更新后的模型 веса分享到中央服务器。这种方式保持了隐私,同时利用了集体智慧来实现有效的入侵检测。实验结果表明,FL-IDS在与中央入侵检测系统(C-IDS)的竞争中表现积极,同时解决了隐私问题。此外,使用偏好特定客户端方法(BTSC)可以进一步提高FL-IDS的表现,甚至在较低的入侵者比率下也超过C-IDS。与传统的入侵检测方法,包括本地入侵检测系统(L-IDS)进行比较分析,可以了解FL-IDS的优势。本研究对FANET安全做出了重要贡献,介绍了适应无人飞行器网络特点的隐私意识、分布式入侵检测方法。
Small Area Estimation of Case Growths for Timely COVID-19 Outbreak Detection
paper_authors: Zhaowei She, Zilong Wang, Jagpreet Chhatwal, Turgay Ayer for:这个论文旨在提出一种基于机器学习的泛化随机森林算法(TLGRF),用于实时检测COVID-19疫情在美国县份中的快速增长率,并且可以在各县的样本量较少的情况下准确地估计增长率。methods:这个论文使用了TLGRF算法,该算法根据病例数据中的日期和县级特征选择适当的拟合窗口大小来估计快速增长率。此外,该论文还使用了传输学习技术,以便在各县的样本量较少的情况下准确地估计增长率。results:研究表明,TLGRF算法可以准确地估计COVID-19疫情在各县中的快速增长率,并且在对比于已有的增长率估计方法时表现出了优异性。此外,在基于科罗拉多州的疫情数据进行的实验中,TLGRF算法可以提高检测疫情的准确率,并且可以在各县的样本量较少的情况下提供有用的增长率估计。Abstract
The COVID-19 pandemic has exerted a profound impact on the global economy and continues to exact a significant toll on human lives. The COVID-19 case growth rate stands as a key epidemiological parameter to estimate and monitor for effective detection and containment of the resurgence of outbreaks. A fundamental challenge in growth rate estimation and hence outbreak detection is balancing the accuracy-speed tradeoff, where accuracy typically degrades with shorter fitting windows. In this paper, we develop a machine learning (ML) algorithm, which we call Transfer Learning Generalized Random Forest (TLGRF), that balances this accuracy-speed tradeoff. Specifically, we estimate the instantaneous COVID-19 exponential growth rate for each U.S. county by using TLGRF that chooses an adaptive fitting window size based on relevant day-level and county-level features affecting the disease spread. Through transfer learning, TLGRF can accurately estimate case growth rates for counties with small sample sizes. Out-of-sample prediction analysis shows that TLGRF outperforms established growth rate estimation methods. Furthermore, we conducted a case study based on outbreak case data from the state of Colorado and showed that the timely detection of outbreaks could have been improved by up to 224% using TLGRF when compared to the decisions made by Colorado's Department of Health and Environment (CDPHE). To facilitate implementation, we have developed a publicly available outbreak detection tool for timely detection of COVID-19 outbreaks in each U.S. county, which received substantial attention from policymakers.
摘要
COVID-19 流行病对全球经济产生了深远的影响,并仍然对人类生命产生了重要的影响。 COVID-19 病例增长率作为一个关键的疫情 Parameters,用于估算和控制疫情复发。然而,在增长率估算中,精度和速度之间存在一定的权衡问题,通常精度随着适应窗口的短化而下降。在这篇论文中,我们开发了一种机器学习(ML)算法,我们称之为Transfer Learning Generalized Random Forest(TLGRF),它可以在精度和速度之间做出权衡。具体来说,我们使用 TLGRF 来估算每个美国县的 COVID-19 快速增长率,并通过适应窗口的自适应调整来基于相关的日期和县级特征来控制疫情的传播。通过传输学习,TLGRF 可以准确地估算 county 的增长率,即使它们有小样本大小。外样预测分析表明,TLGRF 超过了已有的增长率估算方法。此外,我们基于科罗拉多州疫情案例进行了一个案例研究,并显示了在使用 TLGRF 时,可以提高检测疫情的准确性,最高可以达到 224%。为了实现实施,我们开发了一个可公开获取的疫情检测工具,该工具可以在每个美国县中实时检测 COVID-19 疫情,并得到了政策制定者的重要关注。
On the adaptation of in-context learners for system identification
paper_authors: Dario Piga, Filippo Pura, Marco Forgione
for: 这篇论文旨在探讨元模型修改的角色,以提高系统识别的预测性能。
methods: 这篇论文使用了 numrical examples,以示出元模型修改可以在三个实际情况下提高预测性能: tailoring the meta-model to describe a specific system rather than a class; extending the meta-model to capture the behaviour of systems beyond the initial training class; 和 recalibrating the model for new prediction tasks。
results: 结果显示了元模型修改可以实现一个更加灵活和可靠的元学习架构,以提高系统识别的预测性能。Abstract
In-context system identification aims at constructing meta-models to describe classes of systems, differently from traditional approaches that model single systems. This paradigm facilitates the leveraging of knowledge acquired from observing the behaviour of different, yet related dynamics. This paper discusses the role of meta-model adaptation. Through numerical examples, we demonstrate how meta-model adaptation can enhance predictive performance in three realistic scenarios: tailoring the meta-model to describe a specific system rather than a class; extending the meta-model to capture the behaviour of systems beyond the initial training class; and recalibrating the model for new prediction tasks. Results highlight the effectiveness of meta-model adaptation to achieve a more robust and versatile meta-learning framework for system identification.
摘要
Context-aware系统identification旨在构建meta模型,用于描述不同系统的类型,不同于传统方法,该方法只是模型单个系统。这个思想使得可以利用不同系统的行为观察获得的知识,来提高预测性能。本文讨论meta模型适应的角色。通过数值示例,我们示出了meta模型适应可以提高预测性能的三种实际场景:将meta模型适应到特定系统而不是类型;将meta模型扩展到涵盖初学习类型以外的系统行为;以及重新调整模型以适应新的预测任务。结果表明,meta模型适应可以实现一个更加稳健和多样化的meta学习框架 для系统识别。
A Transformer Model for Symbolic Regression towards Scientific Discovery
results: 论文在使用 normalized tree-based edit distance 评价指标时,在Symbolic Regression for Scientific Discovery datasets中实现了state-of-the-art的结果。Abstract
Symbolic Regression (SR) searches for mathematical expressions which best describe numerical datasets. This allows to circumvent interpretation issues inherent to artificial neural networks, but SR algorithms are often computationally expensive. This work proposes a new Transformer model aiming at Symbolic Regression particularly focused on its application for Scientific Discovery. We propose three encoder architectures with increasing flexibility but at the cost of column-permutation equivariance violation. Training results indicate that the most flexible architecture is required to prevent from overfitting. Once trained, we apply our best model to the SRSD datasets (Symbolic Regression for Scientific Discovery datasets) which yields state-of-the-art results using the normalized tree-based edit distance, at no extra computational cost.
摘要
Symbolic Regression (SR) 搜索数字集合中的数学表达,以解决人工神经网络的解释问题。然而,SR 算法通常 computationally expensive。这项工作提出了一种新的 Transformer 模型,专门用于Symbolic Regression,特别是在科学发现中。我们提出三种 encoder 架构,具有不同的灵活性,但是会导致列 permutation 不变性破坏。训练结果表明,最灵活的架构可以避免过拟合。已训练后,我们使用我们最佳模型来处理 SRSD 数据集(Symbolic Regression for Scientific Discovery 数据集),并取得了使用normalized tree-based edit distance的最佳结果,无论额外计算成本。
MeanCut: A Greedy-Optimized Graph Clustering via Path-based Similarity and Degree Descent Criterion
for: 提出了一种基于路径相似性的非破坏性图分割算法,以提高内部协会关系,并提出了一个新的目标函数 MeanCut,通过排序度Descending Order greedily 优化,以适应非圆杯分布数据。
methods: 使用了路径相似性来增强内部协会关系,并提出了一种基于最大束 Span Tree (MST) 的快速同步算法 FastMST,以提高计算效率。此外,还定义了一个浮动度阈值因子 (DGF),用于分离弱连接的群体。
results: 通过对真实 benchmark 和面Recognition 应用进行测试, validate 了我们的算法的有效性和稳定性。Abstract
As the most typical graph clustering method, spectral clustering is popular and attractive due to the remarkable performance, easy implementation, and strong adaptability. Classical spectral clustering measures the edge weights of graph using pairwise Euclidean-based metric, and solves the optimal graph partition by relaxing the constraints of indicator matrix and performing Laplacian decomposition. However, Euclidean-based similarity might cause skew graph cuts when handling non-spherical data distributions, and the relaxation strategy introduces information loss. Meanwhile, spectral clustering requires specifying the number of clusters, which is hard to determine without enough prior knowledge. In this work, we leverage the path-based similarity to enhance intra-cluster associations, and propose MeanCut as the objective function and greedily optimize it in degree descending order for a nondestructive graph partition. This algorithm enables the identification of arbitrary shaped clusters and is robust to noise. To reduce the computational complexity of similarity calculation, we transform optimal path search into generating the maximum spanning tree (MST), and develop a fast MST (FastMST) algorithm to further improve its time-efficiency. Moreover, we define a density gradient factor (DGF) for separating the weakly connected clusters. The validity of our algorithm is demonstrated by testifying on real-world benchmarks and application of face recognition. The source code of MeanCut is available at https://github.com/ZPGuiGroupWhu/MeanCut-Clustering.
摘要
“ spectral clustering 是图像分割最常见的方法,受到极高的表现、易于实现和强大的适应性的欢迎。 classical spectral clustering 使用对照式 Euclidean 距离来衡量图像的边重量,并通过减少约束和 Laplacian 分解来解决最优图像分割。然而, Euclidean 基于的相似性可能会导致非圆形数据分布的扭曲割,而放松策略会导致信息损失。此外, spectral clustering 需要指定分区数量,这可能需要充分的先验知识。在这种工作中,我们使用路径基于的相似性来增强内部协会,并提出 MeanCut 作为目标函数,通过度降序搜索来快速优化。这个算法可以识别任意形状的团群,并对噪声抗性。为了降低相似性计算的计算复杂度,我们将最优路径搜索转换为生成最大束Tree(MST),并开发 FastMST 算法以进一步提高时间效率。此外,我们定义了强度梯度因子(DGF),用于分离弱连接的团群。我们的算法在实际 benchmark 和面部识别应用中证明了其有效性。MeanCut 算法的源代码可以在 https://github.com/ZPGuiGroupWhu/MeanCut-Clustering 上获取。”
A Robust and Efficient Boundary Point Detection Method by Measuring Local Direction Dispersion
For: The paper aims to address the challenge of boundary point detection in machine learning tasks, particularly in non-convex structures and high-dimensional manifolds.* Methods: The proposed method, Local Direction Dispersion (LoDD), uses a density-independent K-Nearest Neighbors (KNN) method to determine neighboring points and a statistic-based metric using the eigenvalues of the covariance matrix of KNN coordinates to measure the centrality of a query point.* Results: The paper demonstrates the validity of LoDD on five synthetic datasets and ten real-world benchmarks, and shows that LoDD achieves promising and robust detection accuracy in a time-efficient manner.Here’s the Chinese translation of the three key points:* For: 本文targets addressing the challenge of boundary point detection in machine learning tasks, particularly in non-convex structures and high-dimensional manifolds.* Methods: 提议的方法是Local Direction Dispersion (LoDD), which uses a density-independent K-Nearest Neighbors (KNN) method to determine neighboring points and a statistic-based metric using the eigenvalues of the covariance matrix of KNN coordinates to measure the centrality of a query point.* Results: 本文 demonstrates the validity of LoDD on five synthetic datasets and ten real-world benchmarks, and shows that LoDD achieves promising and robust detection accuracy in a time-efficient manner.Abstract
Boundary points pose a significant challenge for machine learning tasks, including classification, clustering, and dimensionality reduction. Due to the similarity of features, boundary areas can result in mixed-up classes or clusters, leading to a crowding problem in dimensionality reduction. To address this challenge, numerous boundary point detection methods have been developed, but they are insufficiently to accurately and efficiently identify the boundary points in non-convex structures and high-dimensional manifolds. In this work, we propose a robust and efficient method for detecting boundary points using Local Direction Dispersion (LoDD). LoDD considers that internal points are surrounded by neighboring points in all directions, while neighboring points of a boundary point tend to be distributed only in a certain directional range. LoDD adopts a density-independent K-Nearest Neighbors (KNN) method to determine neighboring points, and defines a statistic-based metric using the eigenvalues of the covariance matrix of KNN coordinates to measure the centrality of a query point. We demonstrated the validity of LoDD on five synthetic datasets (2-D and 3-D) and ten real-world benchmarks, and tested its clustering performance by equipping with two typical clustering methods, K-means and Ncut. Our results show that LoDD achieves promising and robust detection accuracy in a time-efficient manner.
摘要
<> transtable text into Simplified Chinese.Boundary points pose a significant challenge for machine learning tasks, including classification, clustering, and dimensionality reduction. Due to the similarity of features, boundary areas can result in mixed-up classes or clusters, leading to a crowding problem in dimensionality reduction. To address this challenge, numerous boundary point detection methods have been developed, but they are insufficiently to accurately and efficiently identify the boundary points in non-convex structures and high-dimensional manifolds. In this work, we propose a robust and efficient method for detecting boundary points using Local Direction Dispersion (LoDD). LoDD considers that internal points are surrounded by neighboring points in all directions, while neighboring points of a boundary point tend to be distributed only in a certain directional range. LoDD adopts a density-independent K-Nearest Neighbors (KNN) method to determine neighboring points, and defines a statistic-based metric using the eigenvalues of the covariance matrix of KNN coordinates to measure the centrality of a query point. We demonstrated the validity of LoDD on five synthetic datasets (2-D and 3-D) and ten real-world benchmarks, and tested its clustering performance by equipping with two typical clustering methods, K-means and Ncut. Our results show that LoDD achieves promising and robust detection accuracy in a time-efficient manner.中文简体版:边界点pose机器学习任务中的一大挑战,包括分类、聚类和维度减少。由于特征相似性,边界区域可能会导致类别杂化或群集拥堵,从而影响维度减少。为解决这个挑战,许多边界点检测方法已经被开发出来,但它们无法准确和高效地在非对称结构和高维投影中标识边界点。在这种情况下,我们提出了一种可靠和高效的边界点检测方法——本地方向分散(LoDD)。LoDD认为内部点被周围的点所环绕,而边界点的周围点往往只在某个方向范围内分布。LoDD采用了无关于密度的K最近邻(KNN)方法来确定邻近点,并使用KNN坐标的covariance矩阵的 eigenvalues来定义查询点的中心性。我们在五个synthetic数据集(2D和3D)和十个实际 benchmark上验证了LoDD的有效性,并通过与K-means和Ncut两种典型的聚类方法结合测试其聚类性能。我们的结果表明,LoDD在时间高效的情况下实现了可靠和robust的检测精度。
DiscoBAX: Discovery of Optimal Intervention Sets in Genomic Experiment Design
results: 对比 existing state-of-the-art methods for experimental design, DiscoBAX 能够更好地选择有效且多样的干预,并且可以提高预测性的报告率。Abstract
The discovery of therapeutics to treat genetically-driven pathologies relies on identifying genes involved in the underlying disease mechanisms. Existing approaches search over the billions of potential interventions to maximize the expected influence on the target phenotype. However, to reduce the risk of failure in future stages of trials, practical experiment design aims to find a set of interventions that maximally change a target phenotype via diverse mechanisms. We propose DiscoBAX, a sample-efficient method for maximizing the rate of significant discoveries per experiment while simultaneously probing for a wide range of diverse mechanisms during a genomic experiment campaign. We provide theoretical guarantees of approximate optimality under standard assumptions, and conduct a comprehensive experimental evaluation covering both synthetic as well as real-world experimental design tasks. DiscoBAX outperforms existing state-of-the-art methods for experimental design, selecting effective and diverse perturbations in biological systems.
摘要
发现用于治疗遗传疾病的药物需要确定疾病机理中参与的基因。现有的方法在亿万个可能性中搜索以 maximize 目标fenotype的影响。然而,以降低未来试验阶段的失败风险,实用的实验设计目标是找到一组实现多种机制的干扰。我们提出了DiscoBAX,一种高效的样本设计方法,以 maximize 每次试验中对target fenotype的重要发现速率,同时在 genomic experiment campaign 中广泛探索多种不同的机制。我们提供了标准假设下的理论保证,并进行了对 synthetic 和实际实验设计任务的全面性evaluation。DiscoBAX 在比较现有状态的方法上表现出了更高的效果,在生物系统中选择有效和多样的干扰。
Jointly spatial-temporal representation learning for individual trajectories
for: This paper aims to provide a method for learning general-purpose trajectory representations that can be used for various geospatial applications, such as predicting movement patterns and preserving trajectory similarity.
methods: The proposed method, called ST-GraphRL, uses a weighted directed spatial-temporal graph, a two-stage joint encoder, and a decoder to learn entangled spatial-temporal dependencies and explicit mobility regularities from trajectory data.
results: The proposed ST-GraphRL method outperformed all baseline models in predicting movement spatial-temporal distributions and preserving trajectory similarity, with high spatial-temporal correlations. Additionally, the method was found to understand spatial-temporal patterns and be transferable for general-purpose geospatial data representations.Abstract
Individual trajectories, containing substantial information on human-environment interactions across space and time, is a crucial input for geospatial foundation models (GeoFMs). However, existing attempts, leveraging trajectory data for various applications have overlooked the implicit spatial-temporal dependency within trajectories and failed to encode and represent it in a format friendly to deep learning, posing a challenge in obtaining general-purpose trajectory representations. Therefore, this paper proposes a spatial-temporal joint representation learning method (ST-GraphRL) to formalize learnable spatial-temporal dependencies into trajectory representations. The proposed ST-GraphRL consists of three compositions: (i) a weighted directed spatial-temporal graph to explicitly construct mobility interactions over both space and time dimensions; (ii) a two-stage jointly encoder (i.e., decoupling and fusion) to learn entangled spatial-temporal dependencies by independently decomposing and jointly aggregating space and time information; (iii) a decoder guides ST-GraphRL to learn explicit mobility regularities by simulating the spatial-temporal distributions of trajectories. Tested on three real-world human mobility datasets, the proposed ST-GraphRL outperformed all the baseline models in predicting movement spatial-temporal distributions and preserving trajectory similarity with high spatial-temporal correlations. We also explore how spatial-temporal features presented in latent space, validating that ST-GraphRL understands spatial-temporal patterns. This method is also transferable for general-purpose geospatial data representations for broad downstream tasks, as well advancing GeoFMs developing.
摘要
个人路径,含有大量人环境互动的资讯,是地球资库模型(GeoFM)的重要输入。然而,现有的尝试,利用行程数据进行不同应用,忽略了行程中隐藏的空间时间相依性,并未能将其转换为深度学习友好的格式,实际上增加了获得通用行程表现的挑战。因此,本文提出了一个空间时间共同表现学习方法(ST-GraphRL),以明确构筑行程中的 mobilitity互动,并将其转换为深度学习友好的格式。ST-GraphRL包括三个部分:1. 一个权重向量的直向空间时间图(ST-Graph),以明确构筑行程中的 mobilitity互动;2. 一个分阶段的两种联合对应(i.e., 分离和联合),以独立地分解和联合空间和时间信息;3. 一个导引ST-GraphRL学习明确的 mobilitity常式的推导器。在三个实验中,提出的ST-GraphRL比所有基eline模型在预测行程的空间时间分布和保持行程相似性高的情况下表现出色。我们还评估了ST-GraphRL中的空间时间特征在潜在空间中的显示,验证ST-GraphRL理解了空间时间模式。此方法也可以转移到通用地球资料表现的广泛下渠道任务,以及进一步推动GeoFM的发展。
Reconstruction of dynamical systems from data without time labels
paper_authors: Zhijun Zeng, Pipi Hu, Chenglong Bao, Yi Zhu, Zuoqiang Shi
for: reconstruction of dynamical systems from data without time labels
methods: using sliced Wasserstein distance to minimize distribution loss
results: effective reconstruction of underlying dynamical systems through extensive experimentsAbstract
In this paper, we study the method to reconstruct dynamical systems from data without time labels. Data without time labels appear in many applications, such as molecular dynamics, single-cell RNA sequencing etc. Reconstruction of dynamical system from time sequence data has been studied extensively. However, these methods do not apply if time labels are unknown. Without time labels, sequence data becomes distribution data. Based on this observation, we propose to treat the data as samples from a probability distribution and try to reconstruct the underlying dynamical system by minimizing the distribution loss, sliced Wasserstein distance more specifically. Extensive experiment results demonstrate the effectiveness of the proposed method.
摘要
在这篇论文中,我们研究了从数据无时间标签中重construct动力系统的方法。无时间标签的数据在许多应用中出现,如分子动力学、单元RNA分析等。重construct动力系统从时间序列数据中的研究已经很广泛,但这些方法不适用于无时间标签的数据。在这种情况下,时间序列数据变成了分布数据。基于这一观察,我们提议对数据进行样本化,并尝试通过最小化分布损失来重construct下面的动力系统,具体来说是使用截卷 Wasserstein 距离。我们的实验结果表明该方法的有效性。
Series2Vec: Similarity-based Self-supervised Representation Learning for Time Series Classification
results: 对于9个大规模的实际 datasets和UCRLUEA数据库,Series2Vec表现出了与当前状态的自适应技术相比的提高。此外,Series2Vec在具有有限标注数据的 dataset 中表现出了高效性,并且可以与其他表示学习模型进行集成以提高时间序列分类性能。Abstract
We argue that time series analysis is fundamentally different in nature to either vision or natural language processing with respect to the forms of meaningful self-supervised learning tasks that can be defined. Motivated by this insight, we introduce a novel approach called \textit{Series2Vec} for self-supervised representation learning. Unlike other self-supervised methods in time series, which carry the risk of positive sample variants being less similar to the anchor sample than series in the negative set, Series2Vec is trained to predict the similarity between two series in both temporal and spectral domains through a self-supervised task. Series2Vec relies primarily on the consistency of the unsupervised similarity step, rather than the intrinsic quality of the similarity measurement, without the need for hand-crafted data augmentation. To further enforce the network to learn similar representations for similar time series, we propose a novel approach that applies order-invariant attention to each representation within the batch during training. Our evaluation of Series2Vec on nine large real-world datasets, along with the UCR/UEA archive, shows enhanced performance compared to current state-of-the-art self-supervised techniques for time series. Additionally, our extensive experiments show that Series2Vec performs comparably with fully supervised training and offers high efficiency in datasets with limited-labeled data. Finally, we show that the fusion of Series2Vec with other representation learning models leads to enhanced performance for time series classification. Code and models are open-source at \url{https://github.com/Navidfoumani/Series2Vec.}
摘要
我们认为时间序列分析与视觉或自然语言处理有所不同,尤其是在定义有用的自愿式学习任务方面。这个见解驱使我们推出了一个新的方法called Series2Vec,用于自愿式表现学习。与其他时间序列自愿式方法不同,Series2Vec 在时间和频域两个领域中预测两个时间序列之间的相似性。Series2Vec 依赖自愿式相似性步骤的一致性,而不是自愿式相似性测量的自然质量。而且,我们提出了一个新的方法,将每个batch中的每个表现经过频率对称处理,以在训练时强制网络学习相似的表现。我们的实验表明,Series2Vec 在九个大规模的真实世界数据集上表现出色,并且与现有的自愿式技术相比,具有更高的效率和相似的表现。此外,我们还证明了 Series2Vec 可以与其他表现学习模型融合,实现更高的时间序列分类表现。我们的代码和模型都公开在 GitHub 上,请参考 \url{https://github.com/Navidfoumani/Series2Vec.}
Rapid detection of rare events from in situ X-ray diffraction data using machine learning
paper_authors: Weijian Zheng, Jun-Sang Park, Peter Kenesei, Ahsan Ali, Zhengchun Liu, Ian T. Foster, Nicholas Schwarz, Rajkumar Kettimuthu, Antonino Miceli, Hemant Sharma
results: 这篇论文的结果表明,这种新的自动化方法可以比传统方法快速多少,并且可以处理更多的数据,从而提高多模式X射线Diffraction方法的效率和时间分辨率。Abstract
High-energy X-ray diffraction methods can non-destructively map the 3D microstructure and associated attributes of metallic polycrystalline engineering materials in their bulk form. These methods are often combined with external stimuli such as thermo-mechanical loading to take snapshots over time of the evolving microstructure and attributes. However, the extreme data volumes and the high costs of traditional data acquisition and reduction approaches pose a barrier to quickly extracting actionable insights and improving the temporal resolution of these snapshots. Here we present a fully automated technique capable of rapidly detecting the onset of plasticity in high-energy X-ray microscopy data. Our technique is computationally faster by at least 50 times than the traditional approaches and works for data sets that are up to 9 times sparser than a full data set. This new technique leverages self-supervised image representation learning and clustering to transform massive data into compact, semantic-rich representations of visually salient characteristics (e.g., peak shapes). These characteristics can be a rapid indicator of anomalous events such as changes in diffraction peak shapes. We anticipate that this technique will provide just-in-time actionable information to drive smarter experiments that effectively deploy multi-modal X-ray diffraction methods that span many decades of length scales.
摘要
高能衍射方法可以不破坏性地映射出金属多晶体材料的3D微结构和相关特征。这些方法经常与外部刺激(如热机械压力)结合,以获取时间序列中的微结构和特征变化。然而,传统的数据获取和减少方法的高成本和庞大数据量带来阻碍快速提取有用信息的障碍。我们现在介绍一种全自动的方法,可以快速检测高能衍射数据中的塑性变化。这种方法比传统方法快速50倍,并且可以处理9倍稀疏的数据集。我们的方法利用了无监督图像表示学习和聚类,将庞大数据转化为紧凑、具有Semantic-rich特征的表示。这些特征可以快速检测异常事件,例如衍射峰形变化。我们预计这种方法将提供实时有用的信息,以便更好地利用多Modal衍射方法,覆盖多个长径范围。
Node-aware Bi-smoothing: Certified Robustness against Graph Injection Attacks
results: 通过理论分析和实验证明,提出的补做方法能够保证鲁棒性。同时,在实际应用中也能够采用这种方法来防止真实的攻击。Abstract
Deep Graph Learning (DGL) has emerged as a crucial technique across various domains. However, recent studies have exposed vulnerabilities in DGL models, such as susceptibility to evasion and poisoning attacks. While empirical and provable robustness techniques have been developed to defend against graph modification attacks (GMAs), the problem of certified robustness against graph injection attacks (GIAs) remains largely unexplored. To bridge this gap, we introduce the node-aware bi-smoothing framework, which is the first certifiably robust approach for general node classification tasks against GIAs. Notably, the proposed node-aware bi-smoothing scheme is model-agnostic and is applicable for both evasion and poisoning attacks. Through rigorous theoretical analysis, we establish the certifiable conditions of our smoothing scheme. We also explore the practical implications of our node-aware bi-smoothing schemes in two contexts: as an empirical defense approach against real-world GIAs and in the context of recommendation systems. Furthermore, we extend two state-of-the-art certified robustness frameworks to address node injection attacks and compare our approach against them. Extensive evaluations demonstrate the effectiveness of our proposed certificates.
摘要
Translated into Simplified Chinese:深度图学(DGL)技术在不同领域得到了广泛应用,但是最近的研究发现了DGL模型的漏洞,如欺骗和毒素攻击。虽然有些Empirical和可证明的Robustness技术已经开发出来防御Graph Modification Attacks(GMAs),但GIAs问题还尚未得到充分研究。为了填补这一差,我们提出了节点意识Bi-smoothing框架,这是第一个可证明Robustness的普通节点分类任务 противGIAs。它是Model-agnostic的,可以适用于欺骗和毒素攻击。通过严谨的理论分析,我们确定了我们的缓和方案的可证明条件。我们还将我们的节点意识Bi-smoothing方案应用于两个Context:作为实际GIAs防御措施和推荐系统中。此外,我们将两个State-of-the-art certified robustness frameworks extend to address node injection attacks and compare our approach against them.广泛的评估表明了我们的提案的有效性。
PerSival: Neural-network-based visualisation for pervasive continuum-mechanical simulations in musculoskeletal biomechanics
paper_authors: David Rosin, Johannes Kässinger, Xingyao Yu, Okan Avci, Christian Bleiler, Oliver Röhrle
for: 这 paper 的目的是提出一种基于神经网络的三维人体 upper limb 肌骨系统模型的新型建模方法,以实现在资源受限的系统(如移动设备)上进行仿真计算。
methods: 本 paper 使用了稀疏格网模型来捕捉肌肉表面的变形,并使用深度学习模型进行实时视觉化。
results: 本 paper 的实验结果显示,使用这种方法可以实现实时视觉化,并且可以在 CPU 上 achieve 101 fps 和 GPU 上 achieve 287 fps。Abstract
This paper presents a novel neural network architecture for the purpose of pervasive visualisation of a 3D human upper limb musculoskeletal system model. Bringing simulation capabilities to resource-poor systems like mobile devices is of growing interest across many research fields, to widen applicability of methods and results. Until recently, this goal was thought to be out of reach for realistic continuum-mechanical simulations of musculoskeletal systems, due to prohibitive computational cost. Within this work we use a sparse grid surrogate to capture the surface deformation of the m.~biceps brachii in order to train a deep learning model, used for real-time visualisation of the same muscle. Both these surrogate models take 5 muscle activation levels as input and output Cartesian coordinate vectors for each mesh node on the muscle's surface. Thus, the neural network architecture features a significantly lower input than output dimension. 5 muscle activation levels were sufficient to achieve an average error of 0.97 +/- 0.16 mm, or 0.57 +/- 0.10 % for the 2809 mesh node positions of the biceps. The model achieved evaluation times of 9.88 ms per predicted deformation state on CPU only and 3.48 ms with GPU-support, leading to theoretical frame rates of 101 fps and 287 fps respectively. Deep learning surrogates thus provide a way to make continuum-mechanical simulations accessible for visual real-time applications.
摘要
Translated into Simplified Chinese:这篇论文提出了一种新的神经网络架构,用于实时Visual化3D人体上肱部Musculoskeletal系统模型。该方法使用一种稀疏Grid抽象法来捕捉肱部肌肉的表面变形,并使用深度学习模型来实现实时Visualization。该神经网络架构的输入只有5个肌肉活动水平,因此输入维度远低于传统方法。该方法可以实现肱部肌肉的表面变形的平均误差为0.97+/-0.16mm和0.57+/-0.10%,并且在CPU和GPU支持下,评估时间分别为9.88ms和3.48ms, theoretically leading to frame rates of 101fps和287fps。这种方法提供了一种使continuum-mechanical simulations可以在视觉实时应用中进行的访问。
paper_authors: Bhaskara Rao Chintada, Sebastián Ruiz-Lopera, René Restrepo, Brett E. Bouma, Martin Villiger, Néstor Uribe-Patarroyo
for: This paper aims to develop a deep learning framework for volumetric speckle reduction in optical coherence tomography (OCT) using a conditional generative adversarial network (cGAN).
methods: The proposed method takes partial OCT volumes as input and leverages the volumetric nature of OCT data to generate artifact-free despeckled volumes with excellent speckle reduction and resolution preservation in all three dimensions.
results: The proposed method demonstrates fast, effective, and high-quality despeckling of different tissue types acquired with three different OCT systems, outperforming existing deep learning methods. Additionally, the method addresses the challenge of generating high-quality, speckle-free training data by using volumetric non-local means despeckling-TNode.Abstract
We present a deep learning framework for volumetric speckle reduction in optical coherence tomography (OCT) based on a conditional generative adversarial network (cGAN) that leverages the volumetric nature of OCT data. In order to utilize the volumetric nature of OCT data, our network takes partial OCT volumes as input, resulting in artifact-free despeckled volumes that exhibit excellent speckle reduction and resolution preservation in all three dimensions. Furthermore, we address the ongoing challenge of generating ground truth data for supervised speckle suppression deep learning frameworks by using volumetric non-local means despeckling-TNode to generate training data. We show that, while TNode processing is computationally demanding, it serves as a convenient, accessible gold-standard source for training data; our cGAN replicates efficient suppression of speckle while preserving tissue structures with dimensions approaching the system resolution of non-local means despeckling while being two orders of magnitude faster than TNode. We demonstrate fast, effective, and high-quality despeckling of the proposed network in different tissue types acquired with three different OCT systems compared to existing deep learning methods. The open-source nature of our work facilitates re-training and deployment in any OCT system with an all-software implementation, working around the challenge of generating high-quality, speckle-free training data.
摘要
我团队提出了一种基于conditioned generative adversarial network(cGAN)的深度学习框架,用于对optical coherence tomography(OCT)数据进行三维斑点减少。为了利用OCT数据的三维性,我们的网络接受部分OCT卷积数据作为输入,从而生成了无残余的斑点减少后的高质量数据。此外,我们解决了生成深度学习框架的检测数据的挑战,通过使用volumetric non-local means despeckling-TNode生成训练数据。我们发现,虚拟节点处理是计算昂贵的,但它成为了训练数据的便利、可 accessible的黄金标准;我们的cGAN可以高效地减少斑点,同时保留组织结构,与非本地 means despeckling的系统分辨率相近,且两个数量级更快。我们展示了我们提议的网络在不同的组织类型和三个不同的OCT系统上的快速、高质量的斑点减少。我们的开源实现方便了重训练和部署,并且可以绕过生成高质量、斑点减少后的训练数据的挑战。
paper_authors: Chau-Wai Wong, Chang-Hong Fu, Mengting Xu, Guan-Ming Su
for: 这 paper 是关于视频编码方面的研究,旨在提高编码率。
methods: 这 paper 使用了点操作,Directly modify the input video signal to improve the compression ratio。
results: 这 paper 通过 teorically 分析和实验验证,显示了在非优化 entropy coder 的情况下,Point reshaping 可以提高编码率。并且可以获得closed form 中的 PSNR 增加。Abstract
Reshaping, a point operation that alters the characteristics of signals, has been shown capable of improving the compression ratio in video coding practices. Out-of-loop reshaping that directly modifies the input video signal was first adopted as the supplemental enhancement information~(SEI) for the HEVC/H.265 without the need of altering the core design of the video codec. VVC/H.266 further improves the coding efficiency by adopting in-loop reshaping that modifies the residual signal being processed in the hybrid coding loop. In this paper, we theoretically analyze the rate-distortion performance of the in-loop reshaping and use experiments to verify the theoretical result. We prove that the in-loop reshaping can improve coding efficiency when the entropy coder adopted in the coding pipeline is suboptimal, which is in line with the practical scenarios that video codecs operate in. We derive the PSNR gain in a closed form and show that the theoretically predicted gain is consistent with that measured from experiments using standard testing video sequences.
摘要
rezhi,一种点操作,可以改变信号特性,已经证明可以提高视频编码中的压缩比率。在HEVC/H.265中,直接修改输入视频信号的out-of-loop rezhi首先被采用为supplemental enhancement information(SEI),无需修改视频编码的核心设计。VVC/H.266进一步提高了编码效率,通过在混合编码循环中修改剩余信号。在这篇论文中,我们 theoretically 分析了rezhi在循环中的Rate-Distortion性能,并通过实验验证我们的理论结果。我们证明,rezhi可以在编码ipeline中使用的 entropy 编码器不是最佳的情况下提高编码效率,这与实际的视频编码场景相符。我们 derive PSNR 增加的closed form 和实验结果相符。
Ricci-Notation Tensor Framework for Model-Based Approaches to Imaging
results: 对比 numeric tensor 前置者,RT框架具有更高的计算和编程效率,可以更好地模型图像问题并开发解决方案。Abstract
Model-based approaches to imaging, like specialized image enhancements in astronomy, favour physics-based models which facilitate explanations of relationships between observed inputs and computed outputs. While this paper features a tutorial example, inspired by exoplanet imaging, that reveals embedded 2D fast Fourier transforms in an image enhancement model, the work is actually about the tensor algebra and software, or tensor frameworks, available for model-based imaging. The paper proposes a Ricci-notation tensor (RT) framework, comprising an extended Ricci notation, which aligns well with the symbolic dual-index algebra of non-Euclidean geometry, and codesigned object-oriented software, called the RTToolbox for MATLAB. Extensions offer novel representations for entrywise, pagewise, and broadcasting operations popular in extended matrix-vector (EMV) frameworks for imaging. Complementing the EMV algebra computable with MATLAB, the RTToolbox demonstrates programmatic and computational efficiency thanks to careful design of tensor and dual-index classes. Compared to a numeric tensor predecessor, the RT framework enables superior ways to model imaging problems and, thereby, to develop solutions.
摘要
模型基于方法,如天文学专门的图像提高技术,偏好基于物理模型,以便解释观测输入和计算输出之间的关系。本文介绍了一个教程例子, Draw inspiration from exoplanet imaging, revela了图像提高模型中嵌入的2D快速傅立做 transforms。然而,这篇文章实际上是关于矩阵代数和软件,或矩阵框架,用于模型基于图像的。文章提出了一种基于征 Ricci notation tensor(RT)框架,包括一个扩展的征 Ricci notation,与非欧几何的Symbolic dual-index algebraaligns well,以及codesigned object-oriented软件,称为RTToolbox for MATLAB。扩展提供了novel representations for entrywise、pagewise和广播操作,广泛用于扩展矩阵-向量(EMV)框架。与EMV算法可计算的MATLAB相比,RTToolbox表现出了优秀的程序和计算效率,感谢tensor和 dual-index类的优化设计。与 numeric tensor前代的RT框架相比,RT框架允许更好地模型图像问题,并因此开发解决方案。
paper_authors: Sueda Taner, Maxime Guillaud, Olav Tirkkonen, Christoph Studer
for: This paper is written for the purpose of extracting pseudo-position information for each user using channel state information (CSI) data at the infrastructure basestation side, without requiring any ground-truth position information.
methods: The paper proposes a novel streaming channel charting (CC) architecture that maintains a small core CSI dataset and uses a min-max-similarity criterion for curation.
results: Numerical validation with measured CSI data demonstrates that the proposed method approaches the accuracy obtained from the complete CSI dataset while using only a fraction of CSI storage and avoiding catastrophic forgetting of old CSI data.Abstract
Channel charting (CC) applies dimensionality reduction to channel state information (CSI) data at the infrastructure basestation side with the goal of extracting pseudo-position information for each user. The self-supervised nature of CC enables predictive tasks that depend on user position without requiring any ground-truth position information. In this work, we focus on the practically relevant streaming CSI data scenario, in which CSI is constantly estimated. To deal with storage limitations, we develop a novel streaming CC architecture that maintains a small core CSI dataset from which the channel charts are learned. Curation of the core CSI dataset is achieved using a min-max-similarity criterion. Numerical validation with measured CSI data demonstrates that our method approaches the accuracy obtained from the complete CSI dataset while using only a fraction of CSI storage and avoiding catastrophic forgetting of old CSI data.
摘要
HARQ-IR Aided Short Packet Communications: BLER Analysis and Throughput Maximization
results: 论文的结果表明,使用 GP 解决方案可以在高 SNR 下获得更高的 LTAT,但在低 SNR 下,使用 DRL 解决方案可以更好地提高 LTAT,尽管计算 overhead 会增加。Abstract
This paper introduces hybrid automatic repeat request with incremental redundancy (HARQ-IR) to boost the reliability of short packet communications. The finite blocklength information theory and correlated decoding events tremendously preclude the analysis of average block error rate (BLER). Fortunately, the recursive form of average BLER motivates us to calculate its value through the trapezoidal approximation and Gauss-Laguerre quadrature. Moreover, the asymptotic analysis is performed to derive a simple expression for the average BLER at high signal-to-noise ratio (SNR). Then, we study the maximization of long term average throughput (LTAT) via power allocation meanwhile ensuring the power and the BLER constraints. For tractability, the asymptotic BLER is employed to solve the problem through geometric programming (GP). However, the GP-based solution underestimates the LTAT at low SNR due to a large approximation error in this case. Alternatively, we also develop a deep reinforcement learning (DRL)-based framework to learn power allocation policy. In particular, the optimization problem is transformed into a constrained Markov decision process, which is solved by integrating deep deterministic policy gradient (DDPG) with subgradient method. The numerical results finally demonstrate that the DRL-based method outperforms the GP-based one at low SNR, albeit at the cost of increasing computational burden.
摘要
To maximize the long-term average throughput (LTAT) while ensuring power and BLER constraints, we use power allocation and study the problem through geometric programming (GP). However, the GP-based solution underestimates the LTAT at low SNR due to a large approximation error. As an alternative, we develop a deep reinforcement learning (DRL)-based framework to learn the power allocation policy. The optimization problem is transformed into a constrained Markov decision process, which is solved by integrating deep deterministic policy gradient (DDPG) with subgradient method.The numerical results show that the DRL-based method outperforms the GP-based one at low SNR, but with increased computational burden.
Secure Cell-Free Integrated Sensing and Communication in the Presence of Information and Sensing Eavesdroppers
results: 我们的计算结果表明,我们的提出的设计方案在对抗探测和窃听者的情况下,能够提供较高的探测概率和安全性。此外,我们还提出了两种备选的共同发射方案,一种是根据探测区域内的感知信号强度最大化,另一种是通过协调发射来实现。Abstract
This paper studies a secure cell-free integrated sensing and communication (ISAC) system, in which multiple ISAC transmitters collaboratively send confidential information to multiple communication users (CUs) and concurrently conduct target detection. Different from prior works investigating communication security against potential information eavesdropping, we consider the security of both communication and sensing in the presence of both information and sensing eavesdroppers that aim to intercept confidential communication information and extract target information, respectively. Towards this end, we optimize the joint information and sensing transmit beamforming at these ISAC transmitters for secure cell-free ISAC. Our objective is to maximize the detection probability over a designated sensing area while ensuring the minimum signal-to-interference-plus-noise-ratio (SINR) requirements at CUs. Our formulation also takes into account the maximum tolerable signal-to-noise ratio (SNR) at information eavesdroppers for ensuring the confidentiality of information transmission, and the maximum detection probability constraints at sensing eavesdroppers for preserving sensing privacy. The formulated secure joint transmit beamforming problem is highly non-convex due to the intricate interplay between the detection probabilities, beamforming vectors, and SINR constraints. Fortunately, through strategic manipulation and via applying the semidefinite relaxation (SDR) technique, we successfully obtain the globally optimal solution to the design problem by rigorously verifying the tightness of SDR. Furthermore, we present two alternative joint beamforming designs based on the sensing SNR maximization over the specific sensing area and the coordinated beamforming, respectively. Numerical results reveal the benefits of our proposed design over these alternative benchmarks.
摘要
The formulated secure joint transmit beamforming problem is highly non-convex due to the complex interplay between the detection probabilities, beamforming vectors, and SINR constraints. However, by strategically manipulating the problem and applying the semidefinite relaxation (SDR) technique, we successfully obtain the globally optimal solution to the design problem. Furthermore, we propose two alternative joint beamforming designs based on sensing SNR maximization over the specific sensing area and coordinated beamforming, respectively. Numerical results show the benefits of our proposed design compared to these alternative benchmarks.Translation notes:* "cell-free" is translated as "无线网络" (wireless network)* "integrated sensing and communication" is translated as "整合探测和通信" (integrated sensing and communication)* "ISAC transmitters" is translated as "ISAC发送器" (ISAC transmitters)* "communication users" is translated as "通信用户" (communication users)* "sensing eavesdroppers" is translated as "探测侦测者" (sensing eavesdroppers)* "information eavesdroppers" is translated as "信息侦测者" (information eavesdroppers)* "SINR" is translated as "信噪比" (SINR)* "SNR" is translated as "信噪比" (SNR)* "SDR" is translated as "半definiterelaxation" (SDR)Note: The translation is based on the standardized Chinese terminology for the relevant technical terms, and may not be the only possible translation.
Accelerated Real-Life (ARL) Testing and Characterization of Automotive LiDAR Sensors to facilitate the Development and Validation of Enhanced Sensor Models
paper_authors: Marcel Kettelgerdes, Tjorven Hillmann, Thomas Hirmer, Hüseyin Erdogan, Bernhard Wunderle, Gordon Elger
for: This paper aims to address the aging effects of LiDAR sensors in automated driving simulation and sensor modeling, with the goal of improving the reliability and safety of ADAS systems.
methods: The authors propose a cutting-edge Hardware-in-the-Loop (HiL) test bench for accelerated aging and characterization of Automotive LiDAR sensors, which enables the simulation of aging effects such as laser beam profile deterioration, output power reduction, and intrinsic parameter drift.
results: The proposed method is expected to provide a more accurate and comprehensive understanding of LiDAR sensor aging, and will help to identify and model degradation effects, as well as suggest quantitative model validation metrics.Abstract
In the realm of automated driving simulation and sensor modeling, the need for highly accurate sensor models is paramount for ensuring the reliability and safety of advanced driving assistance systems (ADAS). Hence, numerous works focus on the development of high-fidelity models of ADAS sensors, such as camera, Radar as well as modern LiDAR systems to simulate the sensor behavior in different driving scenarios, even under varying environmental conditions, considering for example adverse weather effects. However, aging effects of sensors, leading to suboptimal system performance, are mostly overlooked by current simulation techniques. This paper introduces a cutting-edge Hardware-in-the-Loop (HiL) test bench designed for the automated, accelerated aging and characterization of Automotive LiDAR sensors. The primary objective of this research is to address the aging effects of LiDAR sensors over the product life cycle, specifically focusing on aspects such as laser beam profile deterioration, output power reduction and intrinsic parameter drift, which are mostly neglected in current sensor models. By that, this proceeding research is intended to path the way, not only towards identifying and modeling respective degradation effects, but also to suggest quantitative model validation metrics.
摘要
在自动驾驶模拟和感知模型方面,准确的感知模型对于确保高级驾驶助手系统(ADAS)的可靠性和安全性至关重要。因此,许多研究都是关于开发高精度ADAS感知器模型,如摄像头、雷达以及现代LiDAR系统,以模拟不同驾驶场景下的感知器行为,包括不同环境条件下的影响。然而,感知器的衰老效应通常被当前的模拟技术忽略。这篇论文介绍了一种领先的硬件在Loop(HiL)测试台,用于自动化、加速感知器衰老和特性测试。该研究的主要目标是处理感知器的衰老效应,特别是激光束profile衰老、输出功率减少和内在参数漂移等方面,这些方面通常由当前的感知器模型忽略。通过这种研究,可以不仅模型和识别相应的衰老效应,还可以建议量化模型验证度量器。
Enhanced data Detection for Massive MIMO with 1-Bit ADCs
results: 结果表明,使用 zeros forcing 和 minimum mean squared error 接收器可以提供 considerable gains,而且提议的 joint data detection 策略可以带来更大的提升。Abstract
We present new insightful results on the uplink data detection for massive multiple-input multiple-output systems with 1-bit analog-to-digital converters. The expected values of the soft-estimated symbols (i.e., after the linear combining and prior to the data detection) have been recently characterized for multiple user equipments (UEs) and maximum ratio combining (MRC) receiver at the base station. In this paper, we first provide a numerical evaluation of the expected value of the soft-estimated symbols with zero-forcing (ZF) and minimum mean squared error (MMSE) receivers for a multi-UE setting with correlated Rayleigh fading. Then, we propose a joint data detection (JD) strategy, which exploits the interdependence among the soft-estimated symbols of the interfering UEs, along with its low-complexity variant. These strategies are compared with a naive approach that adapts the maximum-likelihood data detection to the 1-bit quantization. Numerical results show that ZF and MMSE provide considerable gains over MRC in terms of symbol error rate. Moreover, the proposed JD and its low-complexity variant provide a significant boost in comparison with the single-UE data detection.
摘要
我们提出了新的深入的结果,探讨大规模多输入多Output系统中1比特数字转换器的上行数据检测。我们最近对多个用户设备(UE)和最大比率组合(MRC)接收器的基站中的软估Symbol(即在线性组合后并 перед数据检测)的预期值进行了数值评估。在这篇论文中,我们首先为多 UE 设置中的零力量(ZF)和最小平均方差 Error(MMSE)接收器提供了数值评估,然后提议了一种联合数据检测策略(JD),该策略利用了干扰 UE 之间软估符的互相关系,同时还提出了一种低复杂度的变体。这些策略与适应最大likelihood数据检测的方法进行比较。numerical resultshow ZF和MMSE在Symbol error rate方面提供了明显的提升,而提议的JD和其低复杂度变体也提供了显著的提升,相比单UE数据检测。
An Unsupervised Machine Learning Scheme for Index-Based CSI Feedback in Wi-Fi
results: 通过对五种不同的数据表示方法进行比较,显示了新的索引基本 feedback 方法可以有效地减少反馈 overhead,提高通过put的率,同时保持适当的链接性能。Abstract
With the ever-increasing demand for high-speed wireless data transmission, beamforming techniques have been proven to be crucial in improving the data rate and the signal-to-noise ratio (SNR) at the receiver. However, they require feedback mechanisms that need an overhead of information and increase the system complexity, potentially challenging the efficiency and capacity of modern wireless networks. This paper investigates novel index-based feedback mechanisms that aim at reducing the beamforming feedback overhead in Wi-Fi links. The proposed methods mitigate the overhead by generating a set of candidate beamforming vectors using an unsupervised learning-based framework. The amount of feedback information required is thus reduced by using the index of the candidate as feedback instead of transmitting the entire beamforming matrix. We explore several methods that consider different representations of the data in the candidate set. In particular, we propose five different ways to generate and represent the candidate sets that consider the covariance matrices of the channel, serialize the feedback matrix, and account for the effective distance, among others. Additionally, we also discuss the implications of using partial information in the compressed beamforming feedback on the link performance and compare it with the newly proposed index-based methods. Extensive IEEE 802.11 standard-compliant simulation results show that the proposed methods effectively minimize the feedback overhead, enhancing the throughput while maintaining an adequate link performance.
摘要
随着无线数据传输的高速需求不断增长, beamforming技术已经被证明是在提高接收器的数据速率和信噪比 (SNR) 方面发挥重要作用。然而,它们需要回馈机制,这会增加系统的复杂度和信息过头,可能挑战现代无线网络的效率和容量。这篇论文探讨了一种新的索引基本反馈机制,用于减少 Wi-Fi 链路中的回馈过头。该方法通过使用一种无supervised learning-based框架生成候选 beamforming вектор集。因此,需要返回的反馈信息量减少为使用候选集索引作为反馈,而不是将整个回馈矩阵发送。我们提出了五种不同的方法来生成和表示候选集,包括考虑通道 covariance 矩阵、序列化反馈矩阵、考虑有效距离等。此外,我们还讨论了使用压缩回馈中的半信息对链路性能的影响,并与新提出的索引基本方法进行比较。我们的 IEEE 802.11 标准 compatibles 的 simulate 结果表明,提议的方法能够有效减少回馈过头,提高传输速率,同时保持链路性能。
Enhanced Index-Based Feedback Overhead Reduction for WLANs
results: 比IEEE 802.11be 基准方式高速率约54%,与前一种索引基于方法相比下降约4 dB,并讨论了选择距离度量对链接性能的影响。Abstract
Compressed beamforming algorithm is used in the current Wi-Fi standard to reduce the beamforming feedback overhead (BFO). However, with each new amendment of the standard the number of supported antennas in Wi-Fi devices increases, leading to increased BFO and hampering the throughput despite using compressed beamforming. In this paper, a novel index-based method is presented to reduce the BFO in Wi-Fi links. In particular, a k-means clustering-based approach is presented to generate candidate beamforming feedback matrices, thereby reducing the BFO to only the index of the said candidate matrices. With extensive simulation results, we compare the newly proposed method with the IEEE 802.11be baseline and our previously published index-based method. We show approximately 54% gain in throughput at high signal-to-noise (SNR) against the IEEE 802.11be baseline. Our comparison also shows approximately 4 dB gain compared to our previously published method at the packet-error-rate (PER) of 0.01 using MCS index 11. Additionally, we also discuss the impact of the distance metric chosen for clustering as well as candidate selection on the link performance.
摘要
现在的 Wi-Fi 标准中使用压缩式 beamforming 算法来减少射频反馈过载 (BFO)。然而,每一个标准修订中支持antenna的数量在 Wi-Fi 设备中增加,导致 BFO 的增加和通过速率的阻塞,即使使用压缩式 beamforming。在这篇论文中,我们提出了一种新的指标基于方法来降低 Wi-Fi 链接中的 BFO。具体来说,我们使用 k-means 分类来生成候选射频反馈矩阵,从而降低 BFO 到只有候选矩阵的指标。我们通过广泛的 simulate 结果,与 IEEE 802.11be 基准和我们之前发表的指标基于方法进行比较。我们发现在高信号噪比 (SNR) 下,与 IEEE 802.11be 基准相比,我们的方法可以获得约 54% 的吞吐量提升。我们的比较还表明,与 MCS 指标 11 下的 PER 0.01 相比,我们的方法可以获得约 4 dB 的增加。此外,我们还讨论了选择 clustering 的距离度量以及候选选择对链接性能的影响。
RIS-Aided Interference Cancellation for Joint Device-to-Device and Cellular Communications
results: 研究人员发现,AO方法可以快速 converge 并且可以在初始化IC解决方案的情况下表现更好,而且对于单个D2D对,IC方法可以通过有限反馈来实现。Abstract
Joint device-to-device (D2D) and cellular communication is a promising technology for enhancing the spectral efficiency of future wireless networks. However, the interference management problem is challenging since the operating devices and the cellular users share the same spectrum. The emerging reconfigurable intelligent surfaces (RIS) technology is a potentially ideal solution for this interference problem since RISs can shape the wireless channel in desired ways. This paper considers an RIS-aided joint D2D and cellular communication system where the RIS is exploited to cancel interference to the D2D links and maximize the minimum signal-to-interference plus noise (SINR) of the device pairs and cellular users. First, we adopt a popular alternating optimization (AO) approach to solve the minimum SINR maximization problem. Then, we propose an interference cancellation (IC)-based approach whose complexity is much lower than that of the AO algorithm. We derive a representation for the RIS phase shift vector which cancels the interference to the D2D links. Based on this representation, the RIS phase shift optimization problem is transformed into an effective D2D channel optimization. We show that the AO approach can converge faster and can even give better performance when it is initialized by the proposed IC solution. We also show that for the case of a single D2D pair, the proposed IC approach can be implemented with limited feedback from the single receive device.
摘要
合作设备到设备(D2D)和基站通信技术可能会提高未来无线网络的频谱效率。然而,管理干扰的问题很大,因为操作设备和基站用户共享同一频谱。emerging reconfigurable intelligent surfaces(RIS)技术可能是干扰问题的解决方案,因为RIS可以形成 désirable的无线通信频道。本文考虑了RIS帮助的合作D2D和基站通信系统,其中RIS被利用来消除D2D链路的干扰,并最大化设备对和基站用户的最小干扰 plus noise ratio(SINR)。首先,我们采用了一种受欢迎的交互优化(AO)方法来解决最大化SINR问题。然后,我们提出了一种干扰抑制(IC)基于的方法,其复杂性比AO算法要低。我们 derive了一个表示RIS相位偏移 вектор可以消除D2D链路的干扰。基于这个表示,RIS相位优化问题被转化为了一个有效的D2D通信道优化问题。我们表明,AO方法可以更快 converges和可以在ICInitialize时给出更好的性能。我们还表明,对于具有单个D2D对的情况,我们的IC方法可以通过有限反馈来实现。
paper_authors: Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, Haizhou Li
for: 提高 ResNet 模型在人脸识别中的表现
methods: 研究优化 ResNet 模型中的步长配置,以提高人脸识别性能
results: 提出 Golden Gemini 原则,可以提高 ResNet 模型在 VoxCeleb、SITW 和 CNCeleb 等 dataset 上的表现,同时减少参数量和计算量Here’s a more detailed explanation of each point:
for: The paper aims to improve the performance of ResNet models in speaker verification tasks.
methods: The authors research and optimize the stride configuration of ResNet models to improve the performance in speaker verification tasks.
results: The authors propose a new principle called Golden Gemini, which can significantly improve the performance of ResNet models on various datasets, while reducing the number of parameters and computational cost.Abstract
Previous studies demonstrate the impressive performance of residual neural networks (ResNet) in speaker verification. The ResNet models treat the time and frequency dimensions equally. They follow the default stride configuration designed for image recognition, where the horizontal and vertical axes exhibit similarities. This approach ignores the fact that time and frequency are asymmetric in speech representation. In this paper, we address this issue and look for optimal stride configurations specifically tailored for speaker verification. We represent the stride space on a trellis diagram, and conduct a systematic study on the impact of temporal and frequency resolutions on the performance and further identify two optimal points, namely Golden Gemini, which serves as a guiding principle for designing 2D ResNet-based speaker verification models. By following the principle, a state-of-the-art ResNet baseline model gains a significant performance improvement on VoxCeleb, SITW, and CNCeleb datasets with 7.70%/11.76% average EER/minDCF reductions, respectively, across different network depths (ResNet18, 34, 50, and 101), while reducing the number of parameters by 16.5% and FLOPs by 4.1%. We refer to it as Gemini ResNet. Further investigation reveals the efficacy of the proposed Golden Gemini operating points across various training conditions and architectures. Furthermore, we present a new benchmark, namely the Gemini DF-ResNet, using a cutting-edge model.
摘要
We represent the stride space on a trellis diagram and conduct a systematic study on the impact of temporal and frequency resolutions on performance. Our results identify two optimal points, which we refer to as the "Golden Gemini" principle. By following this principle, we design a state-of-the-art ResNet baseline model that achieves significant performance improvements on VoxCeleb, SITW, and CNCeleb datasets with an average EER/minDCF reduction of 7.70%/11.76%, respectively, across different network depths (ResNet18, 34, 50, and 101). This is achieved while reducing the number of parameters by 16.5% and FLOPs by 4.1%. We refer to this model as the "Gemini ResNet".Our further investigation shows that the proposed Golden Gemini operating points are effective across various training conditions and architectures. Additionally, we introduce a new benchmark, the "Gemini DF-ResNet", which uses a cutting-edge model.
Detecting Voice Cloning Attacks via Timbre Watermarking
results: 我们的实验表明,提出的时域特征水印方法可以鲁棒地抵抗不同的语音伪认攻击,并且在实际服务中(如PaddleSpeech、Voice-Cloning-App、so-vits-svc)实现了做用性。此外,我们还进行了一系列的ablation研究,以验证我们的设计的有效性。部分音频样本可以在https://timbrewatermarking.github.io/samples上找到。Abstract
Nowadays, it is common to release audio content to the public. However, with the rise of voice cloning technology, attackers have the potential to easily impersonate a specific person by utilizing his publicly released audio without any permission. Therefore, it becomes significant to detect any potential misuse of the released audio content and protect its timbre from being impersonated. To this end, we introduce a novel concept, "Timbre Watermarking", which embeds watermark information into the target individual's speech, eventually defeating the voice cloning attacks. To ensure the watermark is robust to the voice cloning model's learning process, we design an end-to-end voice cloning-resistant detection framework. The core idea of our solution is to embed and extract the watermark in the frequency domain in a temporally invariant manner. To acquire generalization across different voice cloning attacks, we modulate their shared process and integrate it into our framework as a distortion layer. Experiments demonstrate that the proposed timbre watermarking can defend against different voice cloning attacks, exhibit strong resistance against various adaptive attacks (e.g., reconstruction-based removal attacks, watermark overwriting attacks), and achieve practicality in real-world services such as PaddleSpeech, Voice-Cloning-App, and so-vits-svc. In addition, ablation studies are also conducted to verify the effectiveness of our design. Some audio samples are available at https://timbrewatermarking.github.io/samples.
摘要
现在,发布音频内容对公众而言很常见。然而,随着声音恶搅技术的出现,攻击者有可能轻松地冒充特定人的声音,无需得到其授权。因此,检测可能的声音内容不当使用并保护声音特征免受冒充的重要性日益增加。为此,我们提出了一种新的概念——“声音水印”,将水印信息嵌入目标人的说话中,最终抵御声音恶搅攻击。为确保水印具有对声音恶搅模型学习过程的鲜度,我们设计了一个端到端声音恶搅抗性检测框架。我们的核心思想是在频域中嵌入和提取水印,以保持时间不变的方式。为了在不同的声音恶搅攻击下实现通用性,我们对声音恶搅模型的共同过程进行修饰,并将其纳入我们的框架中作为干扰因素。实验表明,我们的声音水印可以对不同的声音恶搅攻击进行防御,并且在实际应用中具有实用性,如PaddleSpeech、Voice-Cloning-App和so-vits-svc等。此外,我们还进行了ablation experiment来证明我们的设计的有效性。有关audio samples的详细信息可以在上找到。
results: 这篇论文的结果显示,在 audio classification 任务上,Audio Spectrogram Transformer 模型表现出色,但是如何将它 efficiently 适应到多个下游任务仍然是一个问题。这篇论文提供了一个详细的 investigated ,发现 adapters 在四个 benchmark 上 consistently outperform 其他方法,并且在 few-shot learning 设定和当 total 参数数量增加时仍然保持佳效。Abstract
The common modus operandi of fine-tuning large pre-trained Transformer models entails the adaptation of all their parameters (i.e., full fine-tuning). While achieving striking results on multiple tasks, this approach becomes unfeasible as the model size and the number of downstream tasks increase. In natural language processing and computer vision, parameter-efficient approaches like prompt-tuning and adapters have emerged as solid alternatives by fine-tuning only a small number of extra parameters, without sacrificing performance accuracy. Specifically, adapters, due to their flexibility, have recently garnered significant attention, leading to several variants. For audio classification tasks, the Audio Spectrogram Transformer model shows impressive results. However, surprisingly, how to efficiently adapt it to several downstream tasks has not been tackled before. In this paper, we bridge this gap and present a detailed investigation of common parameter-efficient methods, revealing that adapters consistently outperform the other methods across four benchmarks. This trend is also confirmed in few-shot learning settings and when the total number of trainable parameters increases, demonstrating adapters superior scalability. We finally study the best adapter configuration, as well as the role of residual connections in the learning process.
摘要
通常的模式操作(modus operandi)在精细调整大型预训练Transformer模型中包括所有参数的调整(i.e., full fine-tuning)。 although this approach can achieve striking results on multiple tasks, it becomes unfeasible as the model size and the number of downstream tasks increase. In natural language processing and computer vision, parameter-efficient approaches like prompt-tuning and adapters have emerged as solid alternatives by fine-tuning only a small number of extra parameters, without sacrificing performance accuracy. Specifically, adapters, due to their flexibility, have recently garnered significant attention, leading to several variants. For audio classification tasks, the Audio Spectrogram Transformer model shows impressive results. However, surprisingly, how to efficiently adapt it to several downstream tasks has not been tackled before. In this paper, we bridge this gap and present a detailed investigation of common parameter-efficient methods, revealing that adapters consistently outperform the other methods across four benchmarks. This trend is also confirmed in few-shot learning settings and when the total number of trainable parameters increases, demonstrating adapters superior scalability. We finally study the best adapter configuration, as well as the role of residual connections in the learning process.
Lightweight Speaker Verification Using Transformation Module with Feature Partition and Fusion
results: 我们在两个公共的Speech Corporae(namely VoxCeleb1和VoxCeleb2)上进行实验,结果表明,将转换模块插入到AMCRN、ResNet34和ECAPA-TDNN三种模型中,可以微不足道提高模型误差,同时显著降低模型复杂度。我们的提议方法在内存需求和计算复杂度方面比基eline方法更好,同时也能够通过不同的截断段 lengths generalize well。Abstract
Although many efforts have been made on decreasing the model complexity for speaker verification, it is still challenging to deploy speaker verification systems with satisfactory result on low-resource terminals. We design a transformation module that performs feature partition and fusion to implement lightweight speaker verification. The transformation module consists of multiple simple but effective operations, such as convolution, pooling, mean, concatenation, normalization, and element-wise summation. It works in a plug-and-play way, and can be easily implanted into a wide variety of models to reduce the model complexity while maintaining the model error. First, the input feature is split into several low-dimensional feature subsets for decreasing the model complexity. Then, each feature subset is updated by fusing it with the inter-feature-subsets correlational information to enhance its representational capability. Finally, the updated feature subsets are independently fed into the block (one or several layers) of the model for further processing. The features that are output from current block of the model are processed according to the steps above before they are fed into the next block of the model. Experimental data are selected from two public speech corpora (namely VoxCeleb1 and VoxCeleb2). Results show that implanting the transformation module into three models (namely AMCRN, ResNet34, and ECAPA-TDNN) for speaker verification slightly increases the model error and significantly decreases the model complexity. Our proposed method outperforms baseline methods on the whole in memory requirement and computational complexity with lower equal error rate. It also generalizes well across truncated segments with various lengths.
摘要
尽管已经做了许多减少模型复杂度的努力,但是在低资源终端上部署高度准确的语音识别系统仍然是一项挑战。我们设计了一个转换模块,该模块通过特征分割和融合来实现轻量级语音识别。该模块由多种简单 yet effective的操作组成,如卷积、抽象、均值、 concatenation、标准化和元素积加。它可以在插件式的方式下工作,并可以轻松地被插入到各种模型中,以降低模型复杂度而保持模型误差。首先,输入特征被拆分成多个低维度特征子集,以降低模型复杂度。然后,每个特征子集被更新,通过与其他特征子集相关信息融合来增强其表达能力。最后,更新后的特征子集独立地被 fed into 模型块(一个或多个层)中进行进一步处理。输出于当前层的特征被按照上述步骤进行处理,然后被 fed into 下一层的模型中。实验数据选自两个公共的语音 corpora(namely VoxCeleb1和VoxCeleb2)。结果显示,在三个模型(namely AMCRN、ResNet34和ECAPA-TDNN)中implanting transformation module slight increase model error and significantly reduce model complexity.我们的提出方法在整体上优于基准方法,在内存需求和计算复杂度方面具有更好的性能,并且能够良好地适应不同长度的截断段。
results: 实验结果表明,提出的模型在DIBCO和H-DIBCO测试集上具有较高的效果,并超越了现有的CNN和ViT基于的状态对比方法。Abstract
Document image enhancement is a fundamental and important stage for attaining the best performance in any document analysis assignment because there are many degradation situations that could harm document images, making it more difficult to recognize and analyze them. In this paper, we propose \textbf{T2T-BinFormer} which is a novel document binarization encoder-decoder architecture based on a Tokens-to-token vision transformer. Each image is divided into a set of tokens with a defined length using the ViT model, which is then applied several times to model the global relationship between the tokens. However, the conventional tokenization of input data does not adequately reflect the crucial local structure between adjacent pixels of the input image, which results in low efficiency. Instead of using a simple ViT and hard splitting of images for the document image enhancement task, we employed a progressive tokenization technique to capture this local information from an image to achieve more effective results. Experiments on various DIBCO and H-DIBCO benchmarks demonstrate that the proposed model outperforms the existing CNN and ViT-based state-of-the-art methods. In this research, the primary area of examination is the application of the proposed architecture to the task of document binarization. The source code will be made available at https://github.com/RisabBiswas/T2T-BinFormer.
摘要
文档图像提升是文档分析任务的基础和重要阶段,因为文档图像可能会受到多种干扰,使其更难识别和分析。在这篇论文中,我们提出了一种新的文档二进制编解码架构,名为T2T-BinFormer,它基于Token-to-token视transformer。每个图像被分成一系列有定长度的标签,然后应用多次模型全局关系 между标签。然而,传统的标注输入数据不能充分反映输入图像的重要本地结构,这会导致低效率。相比使用简单的ViT和硬分割图像,我们采用了进行грессив的标注技术,以 capture本地信息从图像中,以实现更有效的结果。实验表明,我们提出的模型在DIBCO和H-DIBCO测试 benchmark上超过了现有的CNN和ViT基于的状态态别方法。在这篇研究中,我们的研究领域仅是应用提案的架构到文档二进制化任务。源代码将在https://github.com/RisabBiswas/T2T-BinFormer 上提供。
Adapting HouseDiffusion for conditional Floor Plan generation on Modified Swiss Dwellings dataset
results: 该方法在使用简化所有房间多边形为矩形后,对Diffusion模型表现更好。这表明未来的工作应该探索更好的多边形表示方式在扩散模型中。Abstract
Automated floor plan generation has recently gained momentum with several methods that have been proposed. The CVAAD Floor Plan Auto-Completion workshop challenge introduced MSD, a new dataset that includes existing structural walls of the building as an additional input constraint. This technical report presents an approach for extending a recent work, HouseDiffusion (arXiv:2211.13287 [cs.CV]), to the MSD dataset. The adaption involves modifying the model's transformer layers to condition on a set of wall lines. The report introduces a pre-processing pipeline to extract wall lines from the binary mask of the building structure provided as input. Additionally, it was found that a data processing procedure that simplifies all room polygons to rectangles leads to better performance. This indicates that future work should explore better representations of variable-length polygons in diffusion models. The code will be made available at a later date.
摘要
自动化楼层设计已经在最近得到了一些方法的提出。CVAAD楼层自动完成工作坊挑战引入了MSD dataset,该数据集包括建筑物的结构墙为额外输入约束。本技术报告介绍了一种将最近的工作 HouseDiffusion(arXiv:2211.13287 [cs.CV])加以适应MSD dataset的方法。改进包括修改模型的变换层以条件在一组墙线上进行。报告介绍了一个预处理管道,用于从建筑物的二进制掩码中提取墙线。此外,发现将所有房间 polygon 简化为矩形 leads to better performance。这表明未来的工作应该探索更好的变量长度多边形表示方法。代码将在后续日期上传。
The Potential of Vision-Language Models for Content Moderation of Children’s Videos
results: 我们的实验结果表明,我们的Projection层CLIP模型在MOB数据集上的内容审核性能高于之前的工作。我们还发现,在儿童动画视频中,需要包含更多的上下文信息,以提高内容审核性能。Abstract
Natural language supervision has been shown to be effective for zero-shot learning in many computer vision tasks, such as object detection and activity recognition. However, generating informative prompts can be challenging for more subtle tasks, such as video content moderation. This can be difficult, as there are many reasons why a video might be inappropriate, beyond violence and obscenity. For example, scammers may attempt to create junk content that is similar to popular educational videos but with no meaningful information. This paper evaluates the performance of several CLIP variations for content moderation of children's cartoons in both the supervised and zero-shot setting. We show that our proposed model (Vanilla CLIP with Projection Layer) outperforms previous work conducted on the Malicious or Benign (MOB) benchmark for video content moderation. This paper presents an in depth analysis of how context-specific language prompts affect content moderation performance. Our results indicate that it is important to include more context in content moderation prompts, particularly for cartoon videos as they are not well represented in the CLIP training data.
摘要
自然语言监督已经证明可以有效地应用于无需示例学习的许多计算机视觉任务中,如物体检测和活动识别。然而,生成有用提示可以是更加困难的任务,尤其是在更加微妙的任务中,如视频内容筛选。这可能是因为有很多原因可能会使视频不适宜,而不仅仅是暴力和不当。例如,骗子可能会尝试创建垃圾内容,这些内容与有用的教育视频类似,但是没有实际信息。本文评估了几种 CLIP 变种的内容筛选性能,包括我们提议的 Vanilla CLIP 加投影层模型。我们的结果表明,我们的模型在 Malicious or Benign(MOB)benchmark上的内容筛选性能高于之前的工作。本文还提供了内容筛选提示语言特定的深入分析,我们的结果表明,在cartoon视频中包含更多上下文信息是重要的。
paper_authors: Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, C. Karen Liu
for: 这 paper 的目的是 simulate 人类行为,具体来说是在 3D 场景中生成同步的人类动作和物体动作,以满足语言描述的 Style 和意图。
methods: 这 paper 使用了 conditional diffusion model,该模型可以根据语言描述、初始人类和物体状态,以及稀肥的物体方向点来生成人类动作和物体动作。在高级规划方法的支持下,可以有效提取出场景中的高级方向点。
results: 这 paper 的实验结果表明,通过引入物体几何损失以提高生成的物体动作与输入物体方向点的匹配,可以提高生成的人类动作和物体动作的准确性和真实性。此外,通过设计导航 тер미来保证在生成过程中的接触约束,可以更好地控制人类动作和物体动作的相互作用。Abstract
Synthesizing semantic-aware, long-horizon, human-object interaction is critical to simulate realistic human behaviors. In this work, we address the challenging problem of generating synchronized object motion and human motion guided by language descriptions in 3D scenes. We propose Controllable Human-Object Interaction Synthesis (CHOIS), an approach that generates object motion and human motion simultaneously using a conditional diffusion model given a language description, initial object and human states, and sparse object waypoints. While language descriptions inform style and intent, waypoints ground the motion in the scene and can be effectively extracted using high-level planning methods. Naively applying a diffusion model fails to predict object motion aligned with the input waypoints and cannot ensure the realism of interactions that require precise hand-object contact and appropriate contact grounded by the floor. To overcome these problems, we introduce an object geometry loss as additional supervision to improve the matching between generated object motion and input object waypoints. In addition, we design guidance terms to enforce contact constraints during the sampling process of the trained diffusion model.
摘要
<>使用语义感知、长期观察的人机交互Synthesizing semantic-aware, long-horizon, human-object interaction is critical to simulate realistic human behaviors. In this work, we address the challenging problem of generating synchronized object motion and human motion guided by language descriptions in 3D scenes. We propose Controllable Human-Object Interaction Synthesis (CHOIS), an approach that generates object motion and human motion simultaneously using a conditional diffusion model given a language description, initial object and human states, and sparse object waypoints. While language descriptions inform style and intent, waypoints ground the motion in the scene and can be effectively extracted using high-level planning methods. Naively applying a diffusion model fails to predict object motion aligned with the input waypoints and cannot ensure the realism of interactions that require precise hand-object contact and appropriate contact grounded by the floor. To overcome these problems, we introduce an object geometry loss as additional supervision to improve the matching between generated object motion and input object waypoints. In addition, we design guidance terms to enforce contact constraints during the sampling process of the trained diffusion model.
results: 论文所示的视觉结果采用多种多样的场景类型和风格,包括幻想的“妖怪旅行”,并通过多种评价指标证明了生成的场景的可读性和一致性。Abstract
We introduce WonderJourney, a modularized framework for perpetual 3D scene generation. Unlike prior work on view generation that focuses on a single type of scenes, we start at any user-provided location (by a text description or an image) and generate a journey through a long sequence of diverse yet coherently connected 3D scenes. We leverage an LLM to generate textual descriptions of the scenes in this journey, a text-driven point cloud generation pipeline to make a compelling and coherent sequence of 3D scenes, and a large VLM to verify the generated scenes. We show compelling, diverse visual results across various scene types and styles, forming imaginary "wonderjourneys". Project website: https://kovenyu.com/WonderJourney/
摘要
我们介绍 WonderJourney,一个模块化框架 для不断的3D场景生成。与以往的视图生成方法不同,我们从用户提供的文本描述或图像开始,并生成一个长 sequences of多样化但协调连接的3D场景。我们利用一个LLM来生成场景的文本描述,一个文本驱动的点云生成管道来制造一串有趣且吸引人的3D场景,以及一个大型VLM来验证生成的场景。我们展示了多样化的视觉结果,包括不同的场景类型和风格,形成一系列的“奇幻旅行”。项目网站:https://kovenyu.com/WonderJourney/
Inpaint3D: 3D Scene Content Generation using 2D Inpainting Diffusion
results: 可以生成高质量的3D场景图像,并且可以进行3D物体消除、3D物体替换和3D场景完成等操作。Abstract
This paper presents a novel approach to inpainting 3D regions of a scene, given masked multi-view images, by distilling a 2D diffusion model into a learned 3D scene representation (e.g. a NeRF). Unlike 3D generative methods that explicitly condition the diffusion model on camera pose or multi-view information, our diffusion model is conditioned only on a single masked 2D image. Nevertheless, we show that this 2D diffusion model can still serve as a generative prior in a 3D multi-view reconstruction problem where we optimize a NeRF using a combination of score distillation sampling and NeRF reconstruction losses. Predicted depth is used as additional supervision to encourage accurate geometry. We compare our approach to 3D inpainting methods that focus on object removal. Because our method can generate content to fill any 3D masked region, we additionally demonstrate 3D object completion, 3D object replacement, and 3D scene completion.
摘要
LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
results: 我们在两个 egocentric 数据集—— Ego4D 和 Epic-Kitchens 上验证了我们的提posed模型,并得到了明显的改善。我们的实验结果表明,我们的模型在量化和质量上的评价都显著超越了先前的图像修改模型。此外,我们还进行了详细的折衔分析和细节分析,以提供我们的方法的深入理解。Abstract
Generating instructional images of human daily actions from an egocentric viewpoint serves a key step towards efficient skill transfer. In this paper, we introduce a novel problem -- egocentric action frame generation. The goal is to synthesize the action frame conditioning on the user prompt question and an input egocentric image that captures user's environment. Notably, existing egocentric datasets lack the detailed annotations that describe the execution of actions. Additionally, the diffusion-based image manipulation models fail to control the state change of an action within the corresponding egocentric image pixel space. To this end, we finetune a visual large language model (VLLM) via visual instruction tuning for curating the enriched action descriptions to address our proposed problem. Moreover, we propose to Learn EGOcentric (LEGO) action frame generation using image and text embeddings from VLLM as additional conditioning. We validate our proposed model on two egocentric datasets -- Ego4D and Epic-Kitchens. Our experiments show prominent improvement over prior image manipulation models in both quantitative and qualitative evaluation. We also conduct detailed ablation studies and analysis to provide insights on our method.
摘要
<>传送给定文本到简化中文。>生成用户日常动作的教学图像从 egocentric 视角 serves 一个关键的步骤 towards 高效技能传递。在这篇论文中,我们介绍了一个新的问题: egocentric action frame generation。目标是通过用户提问问题和输入 egocentric 图像来Conditional synthesize action frame。需要注意的是,现有的 egocentric 数据集缺乏详细的动作执行注释。此外,diffusion-based 图像修饰模型无法在对应的 egocentric 图像像素空间控制动作状态变化。为此,我们通过对 visual large language model (VLLM) 进行训练来增强 action 描述,以解决我们的提posed 问题。此外,我们还提出了 Learn EGOcentric (LEGO) 动作帧生成模型,使用 VLLM 的图像和文本嵌入作为附加的conditioning。我们在 Ego4D 和 Epic-Kitchens 两个 egocentric 数据集上验证了我们的提posed 模型,并得到了前期图像修饰模型的明显改进。我们还进行了详细的拟合分析以提供我们的方法的内在性分析。
results: 实现了高精度、实时渲染头像,包括眉毛和皮肤的详细表现,并且支持多种材质和灯光环境Abstract
The fidelity of relighting is bounded by both geometry and appearance representations. For geometry, both mesh and volumetric approaches have difficulty modeling intricate structures like 3D hair geometry. For appearance, existing relighting models are limited in fidelity and often too slow to render in real-time with high-resolution continuous environments. In this work, we present Relightable Gaussian Codec Avatars, a method to build high-fidelity relightable head avatars that can be animated to generate novel expressions. Our geometry model based on 3D Gaussians can capture 3D-consistent sub-millimeter details such as hair strands and pores on dynamic face sequences. To support diverse materials of human heads such as the eyes, skin, and hair in a unified manner, we present a novel relightable appearance model based on learnable radiance transfer. Together with global illumination-aware spherical harmonics for the diffuse components, we achieve real-time relighting with spatially all-frequency reflections using spherical Gaussians. This appearance model can be efficiently relit under both point light and continuous illumination. We further improve the fidelity of eye reflections and enable explicit gaze control by introducing relightable explicit eye models. Our method outperforms existing approaches without compromising real-time performance. We also demonstrate real-time relighting of avatars on a tethered consumer VR headset, showcasing the efficiency and fidelity of our avatars.
摘要
“预 Rendering 的实现是受到几何和外观表示的限制。对于几何, Both mesh 和 volume 方法均难以模拟细节丰富的3D毛发几何。对于外观,现有的重新照明模型仅具有限定的精确度和通常太慢呈现高分辨率连续环境。在这个工作中,我们提出了可重新照明 Gaussian 几何人像,一种方法来建立高精确度可重新照明的头部人像,并可以通过生成新的表情来动画。我们的几何模型基于3D Gaussian 可以捕捉3D 一致的毫米级细节,如头发丝和肌肤孔。为了统一人类头部的不同材质,我们提出了一个新的可重新照明外观模型,基于学习透射转移。该模型可以与球面几何相互适应,并且可以实现实时重新照明。我们还引入可重新照明的视网膜模型,以提高眼睛镜射的精确度和可控。我们的方法在不妥协实时性的前提下,超越了现有的方法。我们还展示了在绑定顾客VR头盔上实现了实时重新照明的人像,实现了我们的方法的效率和精确度。”
Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning
results: 经过广泛的实验评估,本研究的SiC模型在多个任务上达到了状态的末级多任务性和甚至超越单任务方法的certain tasks。Abstract
In-context learning provides a new perspective for multi-task modeling for vision and NLP. Under this setting, the model can perceive tasks from prompts and accomplish them without any extra task-specific head predictions or model fine-tuning. However, Skeleton sequence modeling via in-context learning remains unexplored. Directly applying existing in-context models from other areas onto skeleton sequences fails due to the inter-frame and cross-task pose similarity that makes it outstandingly hard to perceive the task correctly from a subtle context. To address this challenge, we propose Skeleton-in-Context (SiC), an effective framework for in-context skeleton sequence modeling. Our SiC is able to handle multiple skeleton-based tasks simultaneously after a single training process and accomplish each task from context according to the given prompt. It can further generalize to new, unseen tasks according to customized prompts. To facilitate context perception, we additionally propose a task-unified prompt, which adaptively learns tasks of different natures, such as partial joint-level generation, sequence-level prediction, or 2D-to-3D motion prediction. We conduct extensive experiments to evaluate the effectiveness of our SiC on multiple tasks, including motion prediction, pose estimation, joint completion, and future pose estimation. We also evaluate its generalization capability on unseen tasks such as motion-in-between. These experiments show that our model achieves state-of-the-art multi-task performance and even outperforms single-task methods on certain tasks.
摘要
具有新的视角的内容学习提供了多任务模型化的新途径,在这种设置下,模型可以通过提示来完成任务,无需任务特定的头预测或模型精度调整。然而,骨架序列模型化via内容学习仍未被探讨。直接将现有的内容模型应用到骨架序列上失败,因为交叉任务和时间帧之间的 pose 相似性使得任务很难从柔和的上下文中正确感知。为解决这个挑战,我们提出了 Skeleton-in-Context(SiC),一种有效的内容学习框架 для骨架序列模型化。我们的 SiC 可以同时处理多个骨架序列任务,并在单一训练过程中完成每个任务。它还可以根据自定义提示来适应新、未经见过的任务。为了促进上下文感知,我们还提出了一种任务统一提示,可以适应不同的任务性质,如部分关节级生成、序列预测、2D-to-3D 动作预测。我们进行了广泛的实验评估 SiC 的多任务性能,包括动作预测、pose 估计、关节完成和未来 pose 估计。我们还评估了它的扩展性在未经见过的任务上,如动作-in-between。这些实验表明,我们的模型可以达到当前最佳的多任务性能,甚至在某些任务上超过单任务方法。
Self-conditioned Image Generation via Generating Representations
results: 该 paper 在 ImageNet 256$\times$256 上测试了 RC G,并取得了 Frechet Inception Distance (FID) 的 3.31 和 Inception Score (IS) 的 253.4。这些结果不仅在类无条件图像生成中提高了状态的艺术,而且在类征随机图像生成中也有出色的表现, thereby bridging the long-standing performance gap between these two tasks。Abstract
This paper presents $\textbf{R}$epresentation-$\textbf{C}$onditioned image $\textbf{G}$eneration (RCG), a simple yet effective image generation framework which sets a new benchmark in class-unconditional image generation. RCG does not condition on any human annotations. Instead, it conditions on a self-supervised representation distribution which is mapped from the image distribution using a pre-trained encoder. During generation, RCG samples from such representation distribution using a representation diffusion model (RDM), and employs a pixel generator to craft image pixels conditioned on the sampled representation. Such a design provides substantial guidance during the generative process, resulting in high-quality image generation. Tested on ImageNet 256$\times$256, RCG achieves a Frechet Inception Distance (FID) of 3.31 and an Inception Score (IS) of 253.4. These results not only significantly improve the state-of-the-art of class-unconditional image generation but also rival the current leading methods in class-conditional image generation, bridging the long-standing performance gap between these two tasks. Code is available at https://github.com/LTH14/rcg.
摘要
这篇论文提出了$\textbf{R}$epresentation-$\textbf{C}$onditioned image $\textbf{G}$eneration(RCG)框架,这是一种简单却有效的图像生成框架,并将类无条件图像生成的新标准设置。RCG不需要任何人工标注。相反,它使用一个预训练的编码器将图像分布映射到自我监督的表示分布中,然后使用一个表示扩散模型(RDM)从该表示分布中随机抽取样本,并使用一个像素生成器来根据抽取的表示来生成图像像素。这种设计提供了巨大的生成过程中的指导,导致高质量的图像生成。在ImageNet 256$\times$256上测试,RCG的Frechet Inception Distance(FID)为3.31,Inception Score(IS)为253.4。这些结果不仅在类无条件图像生成中提高了状态的末端性,还与当前领先的方法在类条件图像生成中匹配,bridging长期存在的性能差距。代码可以在https://github.com/LTH14/rcg上获取。
results: 论文通过实验和 физиical fabrication verification,证明了该方法的有效性和可行性,并且展示了它们可以在实际场景中应用Abstract
We explore the problem of computationally generating special `prime' images that produce optical illusions when physically arranged and viewed in a certain way. First, we propose a formal definition for this problem. Next, we introduce Diffusion Illusions, the first comprehensive pipeline designed to automatically generate a wide range of these illusions. Specifically, we both adapt the existing `score distillation loss' and propose a new `dream target loss' to optimize a group of differentially parametrized prime images, using a frozen text-to-image diffusion model. We study three types of illusions, each where the prime images are arranged in different ways and optimized using the aforementioned losses such that images derived from them align with user-chosen text prompts or images. We conduct comprehensive experiments on these illusions and verify the effectiveness of our proposed method qualitatively and quantitatively. Additionally, we showcase the successful physical fabrication of our illusions -- as they are all designed to work in the real world. Our code and examples are publicly available at our interactive project website: https://diffusionillusions.com
摘要
我们探索 computationally 生成特殊的“prime”图像,这些图像当 Physically 排列和观看时会创造错觉。我们提出了一个形式定义这个问题。接着,我们引入了Diffusion Illusions,第一个涵盖广泛这些错觉的整体管道。具体而言,我们将存在固定的 text-to-image 传播模型中的数据泵对称化来优化一组不同参数的 prime 图像。我们研究了三种错觉,每种错觉的 prime 图像在不同的排列和优化方式下,使用我们提出的“score distillation loss”和“dream target loss”来优化。我们进行了详细的实验,证明了我们的提案的效果是质量和量上的。此外,我们还展示了我们的错觉可以在实际世界中成功 физи学上制做。我们的代码和例子可以在我们的互动项目网站上公开:https://diffusionillusions.comNote: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.
AVID: Any-Length Video Inpainting with Diffusion Model
results: 经过了 comprehensive 的实验,研究发现该方法可以在不同的视频 duration 范围内,Robustly 处理不同类型的填充需求,并且图像质量很高。更多的视频化结果可以在 https://zhang-zx.github.io/AVID/ 上公开查看。Abstract
Recent advances in diffusion models have successfully enabled text-guided image inpainting. While it seems straightforward to extend such editing capability into video domain, there has been fewer works regarding text-guided video inpainting. Given a video, a masked region at its initial frame, and an editing prompt, it requires a model to do infilling at each frame following the editing guidance while keeping the out-of-mask region intact. There are three main challenges in text-guided video inpainting: ($i$) temporal consistency of the edited video, ($ii$) supporting different inpainting types at different structural fidelity level, and ($iii$) dealing with variable video length. To address these challenges, we introduce Any-Length Video Inpainting with Diffusion Model, dubbed as AVID. At its core, our model is equipped with effective motion modules and adjustable structure guidance, for fixed-length video inpainting. Building on top of that, we propose a novel Temporal MultiDiffusion sampling pipeline with an middle-frame attention guidance mechanism, facilitating the generation of videos with any desired duration. Our comprehensive experiments show our model can robustly deal with various inpainting types at different video duration range, with high quality. More visualization results is made publicly available at https://zhang-zx.github.io/AVID/ .
摘要
Memory Triggers: Unveiling Memorization in Text-To-Image Generative Models through Word-Level Duplication
results: 本文发现了两种未经过评估的重复现象在 diffusion-based 模型中的存在,这些现象可能导致模型内存储和攻击。本文的研究可以帮助改善 diffusion-based 模型的安全性和责任性,并推广更责任的应用。Abstract
Diffusion-based models, such as the Stable Diffusion model, have revolutionized text-to-image synthesis with their ability to produce high-quality, high-resolution images. These advancements have prompted significant progress in image generation and editing tasks. However, these models also raise concerns due to their tendency to memorize and potentially replicate exact training samples, posing privacy risks and enabling adversarial attacks. Duplication in training datasets is recognized as a major factor contributing to memorization, and various forms of memorization have been studied so far. This paper focuses on two distinct and underexplored types of duplication that lead to replication during inference in diffusion-based models, particularly in the Stable Diffusion model. We delve into these lesser-studied duplication phenomena and their implications through two case studies, aiming to contribute to the safer and more responsible use of generative models in various applications.
摘要
Diffusion-based models, such as the Stable Diffusion model, have revolutionized text-to-image synthesis with their ability to produce high-quality, high-resolution images. These advancements have prompted significant progress in image generation and editing tasks. However, these models also raise concerns due to their tendency to memorize and potentially replicate exact training samples, posing privacy risks and enabling adversarial attacks. Duplication in training datasets is recognized as a major factor contributing to memorization, and various forms of memorization have been studied so far. This paper focuses on two distinct and underexplored types of duplication that lead to replication during inference in diffusion-based models, particularly in the Stable Diffusion model. We delve into these lesser-studied duplication phenomena and their implications through two case studies, aiming to contribute to the safer and more responsible use of generative models in various applications.Here's the translation in Traditional Chinese:Diffusion-based models, such as the Stable Diffusion model, have revolutionized text-to-image synthesis with their ability to produce high-quality, high-resolution images. These advancements have prompted significant progress in image generation and editing tasks. However, these models also raise concerns due to their tendency to memorize and potentially replicate exact training samples, posing privacy risks and enabling adversarial attacks. Duplication in training datasets is recognized as a major factor contributing to memorization, and various forms of memorization have been studied so far. This paper focuses on two distinct and underexplored types of duplication that lead to replication during inference in diffusion-based models, particularly in the Stable Diffusion model. We delve into these lesser-studied duplication phenomena and their implications through two case studies, aiming to contribute to the safer and more responsible use of generative models in various applications.
Hybrid Functional Maps for Crease-Aware Non-Isometric Shape Matching
paper_authors: Lennart Bastian, Yizheng Xie, Nassir Navab, Zorah Lähner
for: solves the problem of non-isometric shape correspondence in computer vision, which is a fundamental challenge.
methods: combines the non-orthogonal extrinsic basis of eigenfunctions of the elastic thin-shell hessian with the intrinsic ones of the LBO, creating a hybrid spectral space.
results: achieves significant improvements in non-isometric correspondence settings, with up to 15% better mean geodesic error, and up to 45% improvement in scenarios with topological noise.Abstract
Non-isometric shape correspondence remains a fundamental challenge in computer vision. Traditional methods using Laplace-Beltrami operator (LBO) eigenmodes face limitations in characterizing high-frequency extrinsic shape changes like bending and creases. We propose a novel approach of combining the non-orthogonal extrinsic basis of eigenfunctions of the elastic thin-shell hessian with the intrinsic ones of the LBO, creating a hybrid spectral space in which we construct functional maps. To this end, we present a theoretical framework to effectively integrate non-orthogonal basis functions into descriptor- and learning-based functional map methods. Our approach can be incorporated easily into existing functional map pipelines across varying applications and is able to handle complex deformations beyond isometries. We show extensive evaluations across various supervised and unsupervised settings and demonstrate significant improvements. Notably, our approach achieves up to 15% better mean geodesic error for non-isometric correspondence settings and up to 45% improvement in scenarios with topological noise.
摘要
WarpDiffusion: Efficient Diffusion Model for High-Fidelity Virtual Try-on
results: 对高分辨率虚拟试穿标准benchmark和野化测试集进行了广泛的实验,证明了WarpDiffusion方法的超越性,both qualitatively and quantitativelyAbstract
Image-based Virtual Try-On (VITON) aims to transfer an in-shop garment image onto a target person. While existing methods focus on warping the garment to fit the body pose, they often overlook the synthesis quality around the garment-skin boundary and realistic effects like wrinkles and shadows on the warped garments. These limitations greatly reduce the realism of the generated results and hinder the practical application of VITON techniques. Leveraging the notable success of diffusion-based models in cross-modal image synthesis, some recent diffusion-based methods have ventured to tackle this issue. However, they tend to either consume a significant amount of training resources or struggle to achieve realistic try-on effects and retain garment details. For efficient and high-fidelity VITON, we propose WarpDiffusion, which bridges the warping-based and diffusion-based paradigms via a novel informative and local garment feature attention mechanism. Specifically, WarpDiffusion incorporates local texture attention to reduce resource consumption and uses a novel auto-mask module that effectively retains only the critical areas of the warped garment while disregarding unrealistic or erroneous portions. Notably, WarpDiffusion can be integrated as a plug-and-play component into existing VITON methodologies, elevating their synthesis quality. Extensive experiments on high-resolution VITON benchmarks and an in-the-wild test set demonstrate the superiority of WarpDiffusion, surpassing state-of-the-art methods both qualitatively and quantitatively.
摘要
图像基于虚拟试穿(VITON)目标是将店内裤装图像传递到目标人体上。现有方法通常是将裤装扭曲以适应人体姿势,但是这些方法通常忽略裤装和皮肤边界处的合成质量以及真实的效果,如皮肤弯曲和阴影。这些限制使得生成的结果具有较低的真实感和应用实用性。基于Diffusion模型在cross-modal图像合成中的成功,一些最近的Diffusion基本方法尝试解决这个问题。然而,它们通常需要大量的训练资源或者难以实现真实的试穿效果和保留裤装细节。为了高效和高精度的VITON,我们提议WarpDiffusion,它通过一种新的信息rich和本地裤装特征注意机制来结合扭曲和Diffusion两个 paradigma。具体来说,WarpDiffusion使用本地文本注意力来减少资源消耗,并使用一种新的自动层Mask模块,以便只保留扭曲裤装中的关键部分,而不是不实际或错误的部分。值得一提的是,WarpDiffusion可以与现有VITON方法集成,提高其合成质量。广泛的实验表明,WarpDiffusion在高分辨率VITON标准准样和一个实际测试集上具有较高的超越性,胜过当前状态的方法 both qualitatively and quantitatively。
Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving
paper_authors: Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, Li Zhang for:The paper aims to provide a benchmark dataset (Reason2Drive) for studying interpretable reasoning in complex driving environments, and to evaluate the reasoning capabilities of large vision-language models (VLMs) in autonomous driving.methods:The proposed benchmark dataset consists of over 600K video-text pairs, collected from a diverse range of open-source outdoor driving datasets, including nuScenes, Waymo, and ONCE. A novel aggregated evaluation metric is introduced to assess chain-based reasoning performance in autonomous systems, addressing the semantic ambiguities of existing metrics such as BLEU and CIDEr.results:The authors conduct experiments to assess various existing VLMs on the proposed benchmark, revealing insights into their reasoning capabilities. Additionally, they develop an efficient approach to empower VLMs to leverage object-level perceptual elements in both feature extraction and prediction, further enhancing their reasoning accuracy.Abstract
Large vision-language models (VLMs) have garnered increasing interest in autonomous driving areas, due to their advanced capabilities in complex reasoning tasks essential for highly autonomous vehicle behavior. Despite their potential, research in autonomous systems is hindered by the lack of datasets with annotated reasoning chains that explain the decision-making processes in driving. To bridge this gap, we present Reason2Drive, a benchmark dataset with over 600K video-text pairs, aimed at facilitating the study of interpretable reasoning in complex driving environments. We distinctly characterize the autonomous driving process as a sequential combination of perception, prediction, and reasoning steps, and the question-answer pairs are automatically collected from a diverse range of open-source outdoor driving datasets, including nuScenes, Waymo and ONCE. Moreover, we introduce a novel aggregated evaluation metric to assess chain-based reasoning performance in autonomous systems, addressing the semantic ambiguities of existing metrics such as BLEU and CIDEr. Based on the proposed benchmark, we conduct experiments to assess various existing VLMs, revealing insights into their reasoning capabilities. Additionally, we develop an efficient approach to empower VLMs to leverage object-level perceptual elements in both feature extraction and prediction, further enhancing their reasoning accuracy. The code and dataset will be released.
摘要
We categorize the autonomous driving process into three sequential steps: perception, prediction, and reasoning. The question-answer pairs in the dataset are automatically collected from a variety of open-source outdoor driving datasets, including nuScenes, Waymo, and ONCE. To evaluate the performance of VLMs in chain-based reasoning, we introduce a novel aggregated evaluation metric that addresses the semantic ambiguities of existing metrics such as BLEU and CIDEr.In our experiments, we assess the reasoning capabilities of various existing VLMs using the proposed benchmark. Additionally, we develop an efficient approach that enables VLMs to leverage object-level perceptual elements in both feature extraction and prediction, further enhancing their reasoning accuracy. The code and dataset will be publicly released.
Seeing the random forest through the decision trees. Supporting learning health systems from histopathology with machine learning models: Challenges and opportunities
results: 然后,作者提出了一种将隐藏在数字化病理学板幅中的信息,由机器学习模型提取出来,与其他医疗大数据集成起来以支持“学习医疗系统”的新机会。Abstract
This paper discusses some overlooked challenges faced when working with machine learning models for histopathology and presents a novel opportunity to support "Learning Health Systems" with them. Initially, the authors elaborate on these challenges after separating them according to their mitigation strategies: those that need innovative approaches, time, or future technological capabilities and those that require a conceptual reappraisal from a critical perspective. Then, a novel opportunity to support "Learning Health Systems" by integrating hidden information extracted by ML models from digitalized histopathology slides with other healthcare big data is presented.
摘要
这篇论文探讨了机器学习模型在病理学领域所面临的一些常被忽略的挑战,并提出了一个新的机会,即通过将数字化病理学报告中隐藏的信息与医疗大数据集成以支持“学习医疗系统”。首先,作者们从分类措施的角度分析了这些挑战,并将它们分为需要创新、时间或未来技术能力的解决方案,以及需要重新思考的概念方式。然后,作者们提出了一种将隐藏信息由机器学习模型从数字化病理学报告中提取出来,并与其他医疗大数据集成以支持“学习医疗系统”的新机会。
Editable Stain Transformation Of Histological Images Using Unpaired GANs
paper_authors: Tibor Sloboda, Lukáš Hudec, Wanda Benešová
for: This paper is written for researchers and clinicians working in the field of histopathology, particularly those interested in metaplastic breast cancer.
methods: The paper introduces a new method called xAI-CycleGAN, which combines Mask CycleGAN with explainability features and structure-preserving capabilities to transform H&E stained breast tissue images into P63-like images.
results: The paper shows that xAI-CycleGAN is effective in maintaining structural integrity and generating high-quality images, and a survey of histopathologists indicates that the generated images are often comparable in realism to actual images.Here’s the information in Simplified Chinese text:
results: 论文显示,xAI-CycleGAN 能够保持结构完整性和生成高质量图像,并且 histopathologist 调查表明,生成的图像与实际图像的真实性相当。Abstract
Double staining in histopathology, particularly for metaplastic breast cancer, typically employs H&E and P63 dyes. However, P63's tissue damage and high cost necessitate alternative methods. This study introduces xAI-CycleGAN, an advanced architecture combining Mask CycleGAN with explainability features and structure-preserving capabilities for transforming H&E stained breast tissue images into P63-like images. The architecture allows for output editing, enhancing resemblance to actual images and enabling further model refinement. We showcase xAI-CycleGAN's efficacy in maintaining structural integrity and generating high-quality images. Additionally, a histopathologist survey indicates the generated images' realism is often comparable to actual images, validating our model's high-quality output.
摘要
Training Neural Networks on RAW and HDR Images for Restoration Tasks
methods: 作者测试了多种方法,包括使用常见传输函数(PQ、PU21、mu-law)将HDR/RAW图像转换为显示编码图像,以及使用损失函数来 corrected for perceptual non-uniformity。
results: 结果显示,使用显示编码图像来训练神经网络可以提高图像修复性能,最多提高10-15 dB。Abstract
The vast majority of standard image and video content available online is represented in display-encoded color spaces, in which pixel values are conveniently scaled to a limited range (0-1) and the color distribution is approximately perceptually uniform. In contrast, both camera RAW and high dynamic range (HDR) images are often represented in linear color spaces, in which color values are linearly related to colorimetric quantities of light. While training on commonly available display-encoded images is a well-established practice, there is no consensus on how neural networks should be trained for tasks on RAW and HDR images in linear color spaces. In this work, we test several approaches on three popular image restoration applications: denoising, deblurring, and single-image super-resolution. We examine whether HDR/RAW images need to be display-encoded using popular transfer functions (PQ, PU21, mu-law), or whether it is better to train in linear color spaces, but use loss functions that correct for perceptual non-uniformity. Our results indicate that neural networks train significantly better on HDR and RAW images represented in display-encoded color spaces, which offer better perceptual uniformity than linear spaces. This small change to the training strategy can bring a very substantial gain in performance, up to 10-15 dB.
摘要
大多数标准图像和视频内容在线上都是使用显示编码的颜色空间表示,其中像素值被简单地缩放到有限范围(0-1),颜色分布约束是人类视觉的。然而,相机RAW和高动态范围(HDR)图像通常是使用线性颜色空间表示,其中颜色值与光谱量的线性关系。虽然使用公共显示编码图像进行训练已经是一种广泛的做法,但是有关如何训练神经网络进行RAW和HDR图像的任务还没有一致的共识。在这种情况下,我们测试了几种方法,包括三个流行的图像修复应用程序:噪声除除、锐化除和单像超解像。我们发现,使用流行的传输函数(PQ、PU21、mu-law)将HDR/RAW图像转换为显示编码颜色空间可以提高神经网络的训练效果,并且可以提高性能,最高提高10-15 dB。
Boosting Segment Anything Model Towards Open-Vocabulary Learning
results: 对比前SoTA方法,Sambor在COCO和LVIS等几个标准测试 benchmark上表现出优于前SoTA的零shot性能。Abstract
The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics. In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we enhance it with the capacity to detect arbitrary objects based on human inputs like category names or reference expressions. To accomplish this, we introduce a novel SideFormer module that extracts SAM features to facilitate zero-shot object localization and inject comprehensive semantic information for open-vocabulary recognition. In addition, we devise an open-set region proposal network (Open-set RPN), enabling the detector to acquire the open-set proposals generated by SAM. Sambor demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous SoTA methods. We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models.
摘要
最近的Segment Anything Model(SAM)已经出现为一种新的视觉基础模型,展示了强大的零例掌握和灵活的提示。尽管SAM在多个领域发现了应用和改进,但它的主要限制在于无法捕捉物体 semantics。在这篇论文中,我们提出了Sambor,一种将SAM与开 vocabulary对象检测器集成的端到端框架。保留SAM所具有的所有杰出特性,我们增强它以可以根据人类输入的类别名或引用表达Inject comprehensive semantic information for open-vocabulary recognition。为实现这一点,我们提出了一种SideFormer模块,该模块EXTRACTS SAM特征以便零例物体定位和注入开放 vocabulary检测器。此外,我们开发了一种开放集区提档网络(Open-set RPN),使得检测器可以从SAM获得开放集提档。Sambor在COCO和LVIS等标准套件上表现出色,与之前的SoTA方法竞争高度。我们希望这项工作能够为SAM扩展到多种物体类别的认知,并推动开放 vocabulary学习的发展,借助视觉基础模型。
TokenCompose: Grounding Diffusion with Token-level Supervision
results: 改进多种类实例组合,提高生成图像的真实性Abstract
We present TokenCompose, a Latent Diffusion Model for text-to-image generation that achieves enhanced consistency between user-specified text prompts and model-generated images. Despite its tremendous success, the standard denoising process in the Latent Diffusion Model takes text prompts as conditions only, absent explicit constraint for the consistency between the text prompts and the image contents, leading to unsatisfactory results for composing multiple object categories. TokenCompose aims to improve multi-category instance composition by introducing the token-wise consistency terms between the image content and object segmentation maps in the finetuning stage. TokenCompose can be applied directly to the existing training pipeline of text-conditioned diffusion models without extra human labeling information. By finetuning Stable Diffusion, the model exhibits significant improvements in multi-category instance composition and enhanced photorealism for its generated images.
摘要
我们介绍TokenCompose,一个基于潜在扩散模型的文本到图像生成模型,它可以提高文本提示中的物品Category的一致性。尽管Latent Diffusion Model在标准去噪过程中已经取得了很大的成功,但是没有明确的约束来保证文本提示和图像内容之间的一致性,从而导致多物品Category的组合结果不满意。TokenCompose通过在调整阶段引入单位层次的一致性项目来改善多类别实例组合。TokenCompose可以与现有的文本条件扩散模型训练管道直接应用,无需额外的人工标注信息。经过调整Stable Diffusion模型,模型可以在多类别实例组合中展示明显的改善和提高图像生成的真实感。
Automated Multimodal Data Annotation via Calibration With Indoor Positioning System
paper_authors: Ryan Rubel, Andrew Dudash, Mohammad Goli, James O’Hara, Karl Wunderlich
for: 提高物流批处和自动化基础设施中的对象检测精度
methods: 利用探测器和摄像头数据的融合,自动生成多模态对象检测数据集,不需要人工标注
results: 比人类基线快261.8倍,提高整个数据集创建的速度61.5%Abstract
Learned object detection methods based on fusion of LiDAR and camera data require labeled training samples, but niche applications, such as warehouse robotics or automated infrastructure, require semantic classes not available in large existing datasets. Therefore, to facilitate the rapid creation of multimodal object detection datasets and alleviate the burden of human labeling, we propose a novel automated annotation pipeline. Our method uses an indoor positioning system (IPS) to produce accurate detection labels for both point clouds and images and eliminates manual annotation entirely. In an experiment, the system annotates objects of interest 261.8 times faster than a human baseline and speeds up end-to-end dataset creation by 61.5%.
摘要
< translations > 输入文本:Learned object detection methods based on fusion of LiDAR and camera data require labeled training samples, but niche applications, such as warehouse robotics or automated infrastructure, require semantic classes not available in large existing datasets. Therefore, to facilitate the rapid creation of multimodal object detection datasets and alleviate the burden of human labeling, we propose a novel automated annotation pipeline. Our method uses an indoor positioning system (IPS) to produce accurate detection labels for both point clouds and images and eliminates manual annotation entirely. In an experiment, the system annotates objects of interest 261.8 times faster than a human baseline and speeds up end-to-end dataset creation by 61.5%. 学习基于LiDAR和摄像头数据的对象检测方法需要标注训练样本,但是特殊应用,如仓库 роботех术或自动化基础设施,需要不在大量现有数据集中存在的语义类。因此,我们提议一种新的自动注释管道,以快速创建多modal对象检测数据集,消除人工标注的压力。我们的方法使用indoor定位系统(IPS)生成高精度的检测标签,包括点云和图像,并完全消除人工标注。在一个实验中,系统可以对需要注释的对象进行261.8倍的速度,比人类基准更快,并提高整个数据集创建的速度 by 61.5%。 Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China.
SurfaceAug: Closing the Gap in Multimodal Ground Truth Sampling
results: 实验表明,使用SurfaceAug算法可以提高多模态检测器的检测性能,并在汽车检测任务中创造出新的状态OF THE ART。Abstract
Despite recent advances in both model architectures and data augmentation, multimodal object detectors still barely outperform their LiDAR-only counterparts. This shortcoming has been attributed to a lack of sufficiently powerful multimodal data augmentation. To address this, we present SurfaceAug, a novel ground truth sampling algorithm. SurfaceAug pastes objects by resampling both images and point clouds, enabling object-level transformations in both modalities. We evaluate our algorithm by training a multimodal detector on KITTI and compare its performance to previous works. We show experimentally that SurfaceAug outperforms existing methods on car detection tasks and establishes a new state of the art for multimodal ground truth sampling.
摘要
尽管最新的模型架构和数据增强技术有所进步,多模态对象探测器仍然很难超越它们的 LiDAR 只 counterparts。这种缺陷被归结为缺乏具有足够力的多模态数据增强。为解决这个问题,我们提出了 SurfaceAug,一种新的地面 truth 采样算法。SurfaceAug 将对象采样并将图像和点云重新采样,使得对象在多modal 中进行对象级别的变换。我们通过在 KITTI 上训练多模态探测器,并与前期作品进行比较,并证明 SurfaceAug 在汽车检测任务中表现出色,并在多模态地面 truth 采样中创造了新的状态。
A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting
results: 我们通过多种填充任务评估PowerPaint模型,并证明它在多元图像填充方面具有州前性。此外,我们还展示了PowerPaint模型可以作为负例示例进行物体 removing,并使用提取技术实现可控的形态指导物体填充。Abstract
Achieving high-quality versatile image inpainting, where user-specified regions are filled with plausible content according to user intent, presents a significant challenge. Existing methods face difficulties in simultaneously addressing context-aware image inpainting and text-guided object inpainting due to the distinct optimal training strategies required. To overcome this challenge, we introduce PowerPaint, the first high-quality and versatile inpainting model that excels in both tasks. First, we introduce learnable task prompts along with tailored fine-tuning strategies to guide the model's focus on different inpainting targets explicitly. This enables PowerPaint to accomplish various inpainting tasks by utilizing different task prompts, resulting in state-of-the-art performance. Second, we demonstrate the versatility of the task prompt in PowerPaint by showcasing its effectiveness as a negative prompt for object removal. Additionally, we leverage prompt interpolation techniques to enable controllable shape-guided object inpainting. Finally, we extensively evaluate PowerPaint on various inpainting benchmarks to demonstrate its superior performance for versatile image inpainting. We release our codes and models on our project page: https://powerpaint.github.io/.
摘要
实现高质量多元图像填充是一项具有挑战性的任务,因为用户需要指定区域被填充 plausible 内容,根据用户的意图。现有方法在同时解决 context-aware 图像填充和 text-guided 对象填充时存在困难,因为它们需要不同的优化策略。为了解决这个挑战,我们引入 PowerPaint,第一个高质量和多元的填充模型,可以同时解决多种填充任务。首先,我们引入可学习的任务提示 along with 特化的 fine-tuning 策略,以指导模型在不同的填充目标上进行明确的注意力集中。这使得 PowerPaint 可以通过不同的任务提示来完成多种填充任务,并达到状态 искусственный表现。其次,我们示出 PowerPaint 中任务提示的可控性,通过在填充任务中使用 negative 提示来进行对象移除。此外,我们利用 prompt interpolation 技术来实现可控的形态指导对象填充。最后,我们对 PowerPaint 进行了广泛的评估,以展示其在多种填充任务中的优越性。我们将代码和模型发布在我们的项目页面:https://powerpaint.github.io/.
results: 可以通过将不同的概念轴拼接起来,生成具有新的组合的图像。此外,在批处理阶段进行轻量级的finetuning可以使模型对新的概念进行扩展。Abstract
Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g. a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.
摘要
Translated into Simplified Chinese:我们对视觉世界的理解是以各种概念轴为中心的,这些概念轴描述了不同的视觉特征。虽然语言可以轻松地指定不同的概念轴,但是具体的视觉细节通常超出语言表达的限制,例如一种特定的画法。在这项工作中,我们的目标是学习语言导向的视觉概念表示,通过简单地蒸馏大量预训练的视觉语言模型。我们专门 trains a set of concept encoders来编码与各种语言导向的概念轴相关的信息,以达到重建输入图像通过预训练的 Text-to-Image(T2I)模型。为了促进不同概念编码器之间的更好分离,我们将概念嵌入 anchor 到预训练的视觉问答(VQA)模型中的文本嵌入。在推理时,模型可以从新的测试图像中提取各种概念嵌入,并将其拼接起来生成新的视觉组合。通过轻量级的推理时 fine-tuning 过程,它也可以通用于未seen 的概念。
XCube ($\mathcal{X}^3$): Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies
paper_authors: Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, Francis Williams
for: 高级3D矢量网格生成模型
methods: 基于VDB数据结构的层次粒子潜在扩散模型
results: 可以生成高分辨率对象,并在大型户外场景中显示出清晰的质量和量度提升。Here’s the simplified Chinese text:
for: 高级3D矢量网格生成模型
methods: 基于VDB数据结构的层次粒子潜在扩散模型
results: 可以生成高分辨率对象,并在大型户外场景中显示出清晰的质量和量度提升。Abstract
We present $\mathcal{X}^3$ (pronounced XCube), a novel generative model for high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can generate millions of voxels with a finest effective resolution of up to $1024^3$ in a feed-forward fashion without time-consuming test-time optimization. To achieve this, we employ a hierarchical voxel latent diffusion model which generates progressively higher resolution grids in a coarse-to-fine manner using a custom framework built on the highly efficient VDB data structure. Apart from generating high-resolution objects, we demonstrate the effectiveness of XCube on large outdoor scenes at scales of 100m$\times$100m with a voxel size as small as 10cm. We observe clear qualitative and quantitative improvements over past approaches. In addition to unconditional generation, we show that our model can be used to solve a variety of tasks such as user-guided editing, scene completion from a single scan, and text-to-3D. More results and details can be found at https://research.nvidia.com/labs/toronto-ai/xcube/.
摘要
我们提出了$\mathcal{X}^3$(简称XCube),一种新型的生成模型,用于高分辨率稀疏三维体格中的任意属性。我们的模型可以在Feed-forward方式下生成数百万个小体格,最高效的分辨率达到1024^3,而不需要时间consuming的测试时间优化。为了实现这一点,我们采用了层次的体格潜在扩散模型,该模型在粗化到细化的方式下生成高分辨率的体格,使用自定义基于高效的VDB数据结构的框架。除了生成高分辨率的物体,我们证明XCube在100米×100米的大 OUTDOOR场景中, voxel size为10cm,可以达到显著的Qualitative和Quantitative改进。此外,我们还证明了XCube可以用于多种任务,如用户指导编辑、从单个扫描完成场景、和文本到3D。更多结果和细节可以在https://research.nvidia.com/labs/toronto-ai/xcube/find。
results: 实验和用户研究表明,Context Diffusion 在域内和域外任务中都表现出色,比对应模型更高质量和准确性。Abstract
We propose Context Diffusion, a diffusion-based framework that enables image generation models to learn from visual examples presented in context. Recent work tackles such in-context learning for image generation, where a query image is provided alongside context examples and text prompts. However, the quality and fidelity of the generated images deteriorate when the prompt is not present, demonstrating that these models are unable to truly learn from the visual context. To address this, we propose a novel framework that separates the encoding of the visual context and preserving the structure of the query images. This results in the ability to learn from the visual context and text prompts, but also from either one of them. Furthermore, we enable our model to handle few-shot settings, to effectively address diverse in-context learning scenarios. Our experiments and user study demonstrate that Context Diffusion excels in both in-domain and out-of-domain tasks, resulting in an overall enhancement in image quality and fidelity compared to counterpart models.
摘要
我们提出Context Diffusion框架,一种基于扩散的框架,让图像生成模型从视觉示例中学习。 latest work tackles such in-context learning for image generation, where a query image is provided alongside context examples and text prompts. However, the quality and fidelity of the generated images deteriorate when the prompt is not present, indicating that these models are unable to truly learn from the visual context. To address this, we propose a novel framework that separates the encoding of the visual context and preserves the structure of the query images. This allows the model to learn from both the visual context and text prompts, as well as from either one of them. Furthermore, we enable our model to handle few-shot settings, effectively addressing diverse in-context learning scenarios. Our experiments and user study show that Context Diffusion excels in both in-domain and out-of-domain tasks, resulting in an overall enhancement in image quality and fidelity compared to counterpart models.Here's the translation in Traditional Chinese:我们提出Context Diffusion框架,一种基于扩散的框架,让图像生成模型从视觉示例中学习。 latest work tackles such in-context learning for image generation, where a query image is provided alongside context examples and text prompts. However, the quality and fidelity of the generated images deteriorate when the prompt is not present, indicating that these models are unable to truly learn from the visual context. To address this, we propose a novel framework that separates the encoding of the visual context and preserves the structure of the query images. This allows the model to learn from both the visual context and text prompts, as well as from either one of them. Furthermore, we enable our model to handle few-shot settings, effectively addressing diverse in-context learning scenarios. Our experiments and user study show that Context Diffusion excels in both in-domain and out-of-domain tasks, resulting in an overall enhancement in image quality and fidelity compared to counterpart models.
DocBinFormer: A Two-Level Transformer Network for Effective Document Image Binarization
results: 对多种DIBCO和H-DIBCO标准底库进行了广泛的实验,比较了与现有技术的比较,得到了四个维度的提高。Abstract
In real life, various degradation scenarios exist that might damage document images, making it harder to recognize and analyze them, thus binarization is a fundamental and crucial step for achieving the most optimal performance in any document analysis task. We propose DocBinFormer (Document Binarization Transformer), a novel two-level vision transformer (TL-ViT) architecture based on vision transformers for effective document image binarization. The presented architecture employs a two-level transformer encoder to effectively capture both global and local feature representation from the input images. These complimentary bi-level features are exploited for efficient document image binarization, resulting in improved results for system-generated as well as handwritten document images in a comprehensive approach. With the absence of convolutional layers, the transformer encoder uses the pixel patches and sub-patches along with their positional information to operate directly on them, while the decoder generates a clean (binarized) output image from the latent representation of the patches. Instead of using a simple vision transformer block to extract information from the image patches, the proposed architecture uses two transformer blocks for greater coverage of the extracted feature space on a global and local scale. The encoded feature representation is used by the decoder block to generate the corresponding binarized output. Extensive experiments on a variety of DIBCO and H-DIBCO benchmarks show that the proposed model outperforms state-of-the-art techniques on four metrics. The source code will be made available at https://github.com/RisabBiswas/DocBinFormer.
摘要
实际生活中,文档图像可能会受到各种损害情况的影响,如磨砺、折叠、损害等,这会使文档图像更难以识别和分析,因此文档图像 binarization 成为了文档分析任务中的基本和重要步骤。我们提出了 DocBinFormer(文档 binarization transformer)模型,这是基于视transformer的一种新型两级视transformer(TL-ViT)架构,用于有效地进行文档图像 binarization。提出的架构使用两级视transformerEncoder,以 capture文档图像的全局和地方特征表示。这些补充性的 би层特征被利用于有效地进行文档图像 binarization,从而实现了对系统生成的文档图像和手写文档图像的全面性。不同于传统的 convolutional layers,transformerEncoder 使用像素块和子块以及它们的位势信息直接操作它们,而decoder 使用 latent representation 生成干净(binarized)输出图像。相比使用单个视transformer块来EXTRACT信息从图像块,提出的架构使用两个视transformer块,以更好地覆盖EXTRACT的特征空间。对于 DIBCo 和 H-DIBCO 测试集,我们进行了广泛的实验,结果显示,提出的模型在四个纪录中超越了当前状态的技术。源代码将于 GitHub 上公开,可以通过 https://github.com/RisabBiswas/DocBinFormer 访问。
SYNC-CLIP: Synthetic Data Make CLIP Generalize Better in Data-Limited Scenarios
results: 在三个模型总结任务上表现很竞争力,特别是在11个开放词汇场景中 novel 类上的平均提升3.0%。Abstract
Prompt learning is a powerful technique for transferring Vision-Language Models (VLMs) such as CLIP to downstream tasks. However, the prompt-based methods that are fine-tuned solely with base classes may struggle to generalize to novel classes in open-vocabulary scenarios, especially when data are limited. To address this issue, we propose an innovative approach called SYNC-CLIP that leverages SYNthetiC data for enhancing the generalization capability of CLIP. Based on the observation of the distribution shift between the real and synthetic samples, we treat real and synthetic samples as distinct domains and propose to optimize separate domain prompts to capture domain-specific information, along with the shared visual prompts to preserve the semantic consistency between two domains. By aligning the cross-domain features, the synthetic data from novel classes can provide implicit guidance to rebalance the decision boundaries. Experimental results on three model generalization tasks demonstrate that our method performs very competitively across various benchmarks. Notably, SYNC-CLIP outperforms the state-of-the-art competitor PromptSRC by an average improvement of 3.0% on novel classes across 11 datasets in open-vocabulary scenarios.
摘要
Prompt learning 是一种强大的技术,用于将视觉语言模型(VLM)如 CLIP 传递到下游任务上。然而,通过基础类 alone 精度的 fine-tuning 方法可能会在开放词汇场景中对新类做出泛化困难,特别是当数据有限。为解决这个问题,我们提出了一种创新的方法called SYNC-CLIP,该方法利用 SYNthetiC 数据来增强 CLIP 的泛化能力。基于实际和 sintetic 样本之间的分布偏移的观察,我们将实际和 sintetic 样本视为两个不同的领域,并提议使用两个独立的领域 prompt 来捕捉两个领域中的信息,同时保持两个领域之间的semantic 一致性。通过对两个领域的特征进行对应,可以使 synthetic 数据中的新类提供隐藏的导航,以重新平衡决策边界。我们在三种模型泛化任务上进行了实验,结果显示,我们的方法在多种benchmark上表现很竞争力,并且在开放词汇场景下, SYNC-CLIP 平均超过了state-of-the-art 竞争对手 PromptSRC 的3.0% 。
Enhancing Kinship Verification through Multiscale Retinex and Combined Deep-Shallow features
paper_authors: El Ouanas Belabbaci, Mohammed Khammari, Ammar Chouchane, Mohcene Bessaoudi, Abdelmalik Ouamane, Yassine Himeur, Shadi Atalla, Wathiq Mansoor
for: Kinship verification from facial images, with applications in image annotation, forensic analysis, and social media research.
methods: Multiscale Retinex (MSR) preprocessing, deep and shallow texture descriptors (VGG16 and Local Phase Quantization (LPQ)), and Logistic Regression (LR) method.
results: Robust and effective method tested on three kinship datasets (Cornell Kin Face, UB Kin Face, and TS Kin Face) with improved image quality and accuracy.Abstract
The challenge of kinship verification from facial images represents a cutting-edge and formidable frontier in the realms of pattern recognition and computer vision. This area of study holds a myriad of potential applications, spanning from image annotation and forensic analysis to social media research. Our research stands out by integrating a preprocessing method named Multiscale Retinex (MSR), which elevates image quality and amplifies contrast, ultimately bolstering the end results. Strategically, our methodology capitalizes on the harmonious blend of deep and shallow texture descriptors, merging them proficiently at the score level through the Logistic Regression (LR) method. To elucidate, we employ the Local Phase Quantization (LPQ) descriptor to extract shallow texture characteristics. For deep feature extraction, we turn to the prowess of the VGG16 model, which is pre-trained on a convolutional neural network (CNN). The robustness and efficacy of our method have been put to the test through meticulous experiments on three rigorous kinship datasets, namely: Cornell Kin Face, UB Kin Face, and TS Kin Face.
摘要
“人脸图像关系识别是一个 cutting-edge 和 formidable 的前沿领域,涉及到图像识别和计算机视觉等领域。这个领域拥有很多应用场景,从图像注释和审查分析到社交媒体研究。我们的研究具有独特的优势,通过 integrate 一种名为 Multiscale Retinex (MSR) 的预处理方法,提高图像质量并增强对比度,从而提高结果的可靠性。战略上,我们充分利用了深度和浅度的文本描述器的协同作用,通过 Logistic Regression (LR) 方法进行混合。特别是,我们使用 Local Phase Quantization (LPQ) 描述器来EXTRACT 浅度的文本特征,而深度特征EXTRACTION 则通过 VGG16 模型进行预训练的 convolutional neural network (CNN)。我们的方法在三个严格的人脸数据集(Cornell Kin Face、UB Kin Face 和 TS Kin Face)上进行了严格的实验,以证明其可靠性和效果。”
When an Image is Worth 1,024 x 1,024 Words: A Case Study in Computational Pathology
results: 实验结果表明,LongViT有效地处理大пикsel图像,并在癌病类型和生存预测方面超过了之前的状态对策法。Abstract
This technical report presents LongViT, a vision Transformer that can process gigapixel images in an end-to-end manner. Specifically, we split the gigapixel image into a sequence of millions of patches and project them linearly into embeddings. LongNet is then employed to model the extremely long sequence, generating representations that capture both short-range and long-range dependencies. The linear computation complexity of LongNet, along with its distributed algorithm, enables us to overcome the constraints of both computation and memory. We apply LongViT in the field of computational pathology, aiming for cancer diagnosis and prognosis within gigapixel whole-slide images. Experimental results demonstrate that LongViT effectively encodes gigapixel images and outperforms previous state-of-the-art methods on cancer subtyping and survival prediction. Code and models will be available at https://aka.ms/LongViT.
摘要
这份技术报告介绍了LongViT,一种可以一直处理 гигаPixel图像的整体端到端方法。具体来说,我们将гигаPixel图像分成了数百万个小块,并将它们线性 проек到嵌入中。然后,我们使用LongNet模型来模型EXTREMELY LONG SEQUENCE,生成表示短距离和长距离依赖关系。由于LongNet的直线计算复杂度和分布式算法,我们可以超越计算和内存的限制。我们在计算生物学中应用LongViT,targeting cancer diagnosis and prognosis within gigapixel whole-slide images。实验结果表明,LongViT有效地编码гигаPixel图像,并比前一个状态的方法在 cancer 分型和生存预测方面表现出色。代码和模型将可以在 https://aka.ms/LongViT 上获得。
Personalized Face Inpainting with Diffusion Models by Parallel Visual Attention
paper_authors: Jianjin Xu, Saman Motamed, Praneetha Vaddamanu, Chen Henry Wu, Christian Haene, Jean-Charles Bazin, Fernando de la Torre
for: 提高面填充结果,降低推理过程中的计算复杂度
methods: 使用并行视觉注意力(PVA)和扩散模型
results: 实现了无与伦比的人脸特征保持,并提供了有效的语言控制性Here’s a more detailed explanation of each point:
for: The paper aims to improve the results of face inpainting and reduce the computational complexity during inference.
methods: The proposed method uses Parallel Visual Attention (PVA) in conjunction with diffusion models. PVA is inserted into each cross-attention module in the denoising network, which attends to features extracted from reference images by an identity encoder.
results: The proposed method achieves unparalleled identity resemblance in both face inpainting and face inpainting with language guidance tasks, outperforming various benchmarks, including MyStyle, Paint by Example, and Custom Diffusion. The method also provides effective language-controllability and requires only 40 fine-tuning steps for each new identity, which is a significant speed increase of over 20 times compared to Custom Diffusion.Abstract
Face inpainting is important in various applications, such as photo restoration, image editing, and virtual reality. Despite the significant advances in face generative models, ensuring that a person's unique facial identity is maintained during the inpainting process is still an elusive goal. Current state-of-the-art techniques, exemplified by MyStyle, necessitate resource-intensive fine-tuning and a substantial number of images for each new identity. Furthermore, existing methods often fall short in accommodating user-specified semantic attributes, such as beard or expression. To improve inpainting results, and reduce the computational complexity during inference, this paper proposes the use of Parallel Visual Attention (PVA) in conjunction with diffusion models. Specifically, we insert parallel attention matrices to each cross-attention module in the denoising network, which attends to features extracted from reference images by an identity encoder. We train the added attention modules and identity encoder on CelebAHQ-IDI, a dataset proposed for identity-preserving face inpainting. Experiments demonstrate that PVA attains unparalleled identity resemblance in both face inpainting and face inpainting with language guidance tasks, in comparison to various benchmarks, including MyStyle, Paint by Example, and Custom Diffusion. Our findings reveal that PVA ensures good identity preservation while offering effective language-controllability. Additionally, in contrast to Custom Diffusion, PVA requires just 40 fine-tuning steps for each new identity, which translates to a significant speed increase of over 20 times.
摘要
面部填充在各种应用中具有重要意义,如图像修复、图像编辑和虚拟现实。尽管面部生成模型已经取得了显著的进步,但保持人脸唯一特征的维护仍然是一个困难的目标。现有的技术,如MyStyle,需要资源占用很大的精度调整和大量的图像,并且常常无法满足用户指定的语义特征,如胡须或表情。为了改进填充结果并降低推理过程中的计算复杂性,本文提议使用并行视觉注意力(PVA)和扩散模型。具体来说,我们在推理网络中插入并行注意力矩阵,以便由标识编码器提取的特征进行跨层协同注意。我们在CelebAHQ-IDI dataset上训练了添加了注意力矩阵和标识编码器。实验表明,PVA在面部填充和面部填充With Language Guidance任务中保持了无与伦比的人脸相似度,与多种标准准则,如MyStyle、Paint by Example和Custom Diffusion相比,我们的发现表明PVA可以保持良好的人脸特征,同时具有有效的语言控制能力。此外,与Custom Diffusion相比,PVA只需要40个微调步骤,可以在每个新的人脸上微调,这意味着大幅提高了推理速度,大约20倍。
How Low Can You Go? Surfacing Prototypical In-Distribution Samples for Unsupervised Anomaly Detection
paper_authors: Felix Meissen, Johannes Getzner, Alexander Ziller, Georgios Kaissis, Daniel Rueckert
for: 这 paper 的目的是提出一种基于无监督学习的异常检测方法,以避免大量的标注努力。
methods: 这 paper 使用了三种方法来标定彩色分布中的原型样本,并证明了这些样本可以用于异常检测。
results: 这 paper 的实验结果表明,只使用了彩色分布中的一小部分样本(typically 10)可以达到高效的异常检测性能,并且在一些情况下可以超越全量训练的性能。Abstract
Unsupervised anomaly detection (UAD) alleviates large labeling efforts by training exclusively on unlabeled in-distribution data and detecting outliers as anomalies. Generally, the assumption prevails that large training datasets allow the training of higher-performing UAD models. However, in this work, we show that using only very few training samples can already match - and in some cases even improve - anomaly detection compared to training with the whole training dataset. We propose three methods to identify prototypical samples from a large dataset of in-distribution samples. We demonstrate that by training with a subset of just ten such samples, we achieve an area under the receiver operating characteristics curve (AUROC) of $96.37 \%$ on CIFAR10, $92.59 \%$ on CIFAR100, $95.37 \%$ on MNIST, $95.38 \%$ on Fashion-MNIST, $96.37 \%$ on MVTec-AD, $98.81 \%$ on BraTS, and $81.95 \%$ on RSNA pneumonia detection, even exceeding the performance of full training in $25/67$ classes we tested. Additionally, we show that the prototypical in-distribution samples identified by our proposed methods translate well to different models and other datasets and that using their characteristics as guidance allows for successful manual selection of small subsets of high-performing samples. Our code is available at https://anonymous.4open.science/r/uad_prototypical_samples/
摘要
“无监督异常检测(USD)可以减少大量标注努力,通过只使用无标注数据集训练并检测异常点。通常认为,大规模的训练数据集可以训练更高性能的USD模型。然而,在这项工作中,我们表明使用只有很少的训练样本(只有十个)仍然可以与全部训练数据集训练的模型匹配或者超越异常检测。我们提出三种方法来标识数据集中的代表性样本。我们示示了使用这些方法训练的模型可以在CIFAR10、CIFAR100、MNIST、Fashion-MNIST、MVTec-AD和BraTS等七个数据集上达到AUROC值为96.37%、92.59%、95.37%、95.38%、96.37%和98.81%,而且在25/67个类中超过了全部训练的性能。此外,我们还证明了这些代表性样本的特征可以在不同的模型和数据集上翻译良好,并且使用这些特征作为指导可以成功地手动选择小样本集。我们的代码可以在https://anonymous.4open.science/r/uad_prototypical_samples/”
Texture-Semantic Collaboration Network for ORSI Salient Object Detection
paper_authors: Gongyang Li, Zhen Bai, Zhi Liu for: 这种研究旨在提高OPTICAL REMOTE SENSING IMAGES中的精锐对象检测精度。methods: 该方法基于一个普通的encoder-decoder结构,并包括一个重要的Texture-Semantic Collaboration Module(TSCM),用于在基本特征提取过程中进行精度提升和交互。results: 对于三个数据集的广泛实验表明,该方法可与14种现有方法进行竞争,并且能够处理多种场景。Abstract
Salient object detection (SOD) in optical remote sensing images (ORSIs) has become increasingly popular recently. Due to the characteristics of ORSIs, ORSI-SOD is full of challenges, such as multiple objects, small objects, low illuminations, and irregular shapes. To address these challenges, we propose a concise yet effective Texture-Semantic Collaboration Network (TSCNet) to explore the collaboration of texture cues and semantic cues for ORSI-SOD. Specifically, TSCNet is based on the generic encoder-decoder structure. In addition to the encoder and decoder, TSCNet includes a vital Texture-Semantic Collaboration Module (TSCM), which performs valuable feature modulation and interaction on basic features extracted from the encoder. The main idea of our TSCM is to make full use of the texture features at the lowest level and the semantic features at the highest level to achieve the expression enhancement of salient regions on features. In the TSCM, we first enhance the position of potential salient regions using semantic features. Then, we render and restore the object details using the texture features. Meanwhile, we also perceive regions of various scales, and construct interactions between different regions. Thanks to the perfect combination of TSCM and generic structure, our TSCNet can take care of both the position and details of salient objects, effectively handling various scenes. Extensive experiments on three datasets demonstrate that our TSCNet achieves competitive performance compared to 14 state-of-the-art methods. The code and results of our method are available at https://github.com/MathLee/TSCNet.
摘要
优点对象检测(SOD)在光学远程感知图像(ORSIs)已经成为最近受欢迎的研究领域。由于ORSIs的特点,ORSIs-SOD充满挑战,如多对象、小对象、低照明和不规则形状。为解决这些挑战,我们提出了一种简洁 yet 有效的 Texture-Semantic Collaboration Network(TSCNet),以探索Texture和Semantic的协作。具体来说,TSCNet基于 generic encoder-decoder 结构。除了编码器和解码器之外,TSCNet还包括一个重要的 Texture-Semantic Collaboration Module(TSCM),该模块在基础特征提取后进行有价值的特征修饰和交互。我们的TSCM的主要想法是利用Texture特征的最低层和Semantic特征的最高层来实现特征强化突出焦点区域。在TSCM中,我们首先使用Semantic特征提高突出焦点区域的位置。然后,我们使用Texture特征来还原和修复对象细节。同时,我们还捕捉不同比例的区域,并在不同区域之间建立交互。感谢TSCM和generic结构的完美结合,我们的TSCNet可以同时处理不同的场景,并且能够有效地处理多对象、小对象、低照明和不规则形状。我们的实验表明,与14种现有方法相比,TSCNet在三个数据集上达到了竞争性的性能。我们的代码和实验结果可以在 中找到。
FoodFusion: A Latent Diffusion Model for Realistic Food Image Generation
results: 对比公共可用的图像生成模型, FoodFusion模型能够生成出更真实和多样的食物图像。Abstract
Current state-of-the-art image generation models such as Latent Diffusion Models (LDMs) have demonstrated the capacity to produce visually striking food-related images. However, these generated images often exhibit an artistic or surreal quality that diverges from the authenticity of real-world food representations. This inadequacy renders them impractical for applications requiring realistic food imagery, such as training models for image-based dietary assessment. To address these limitations, we introduce FoodFusion, a Latent Diffusion model engineered specifically for the faithful synthesis of realistic food images from textual descriptions. The development of the FoodFusion model involves harnessing an extensive array of open-source food datasets, resulting in over 300,000 curated image-caption pairs. Additionally, we propose and employ two distinct data cleaning methodologies to ensure that the resulting image-text pairs maintain both realism and accuracy. The FoodFusion model, thus trained, demonstrates a remarkable ability to generate food images that exhibit a significant improvement in terms of both realism and diversity over the publicly available image generation models. We openly share the dataset and fine-tuned models to support advancements in this critical field of food image synthesis at https://bit.ly/genai4good.
摘要
当前最先进的图像生成模型,如潜在扩散模型(LDM),已经表现出可以生成高度吸引人的食物相关图像。然而,这些生成的图像 часто具有艺术或幻想的特点,与现实食物图像的准确性相差很大。这种不足使得它们在需要实际食物图像的应用中不够实用,如图像基于饮食评估的训练模型。为解决这些限制,我们介绍了 FoodFusion,一种特意为实际食物图像的准确合成而设计的潜在扩散模型。 FoodFusion 模型的开发包括利用大量的开源食物数据集,共计超过 300,000 个图像描述对。此外,我们还提出了两种不同的数据清洁方法,以确保生成的图像描述对保持实际性和准确性。 FoodFusion 模型经过训练后表现出了对实际食物图像的很好的合成,与公共可用的图像生成模型相比,具有显著的改善。我们将数据集和细化模型公开分享,以支持这一关键领域的进步,请参考 。
Low-shot Object Learning with Mutual Exclusivity Bias
results: 本研究发现,通过使用LSME模型,可以在低数据量情况下实现高精度的对象分类。此外,还提供了一种基eline方法,其性能高于当前state-of-the-art模型。Abstract
This paper introduces Low-shot Object Learning with Mutual Exclusivity Bias (LSME), the first computational framing of mutual exclusivity bias, a phenomenon commonly observed in infants during word learning. We provide a novel dataset, comprehensive baselines, and a state-of-the-art method to enable the ML community to tackle this challenging learning task. The goal of LSME is to analyze an RGB image of a scene containing multiple objects and correctly associate a previously-unknown object instance with a provided category label. This association is then used to perform low-shot learning to test category generalization. We provide a data generation pipeline for the LSME problem and conduct a thorough analysis of the factors that contribute to its difficulty. Additionally, we evaluate the performance of multiple baselines, including state-of-the-art foundation models. Finally, we present a baseline approach that outperforms state-of-the-art models in terms of low-shot accuracy.
摘要
The goal of LSME is to analyze an RGB image of a scene containing multiple objects and correctly associate a previously unknown object instance with a provided category label. This association is then used to perform low-shot learning to test category generalization.We provide a data generation pipeline for the LSME problem and conduct a thorough analysis of the factors that contribute to its difficulty. Additionally, we evaluate the performance of multiple baselines, including state-of-the-art foundation models. Finally, we present a baseline approach that outperforms state-of-the-art models in terms of low-shot accuracy.Translation notes:* "Low-shot Object Learning with Mutual Exclusivity Bias" 是一种新的计算框架,用于解决幼儿语言学习中的互相排斥现象。* "mutual exclusivity bias" 被称为 "互相排斥现象",是指在语言学习过程中,幼儿会忽略或快速忘记已经了解的词语。* "low-shot learning" 是一种基于少量数据的学习方法,旨在测试分类模型的一般化能力。* "RGB image" 是一种图像格式,用于描述图像的颜色和亮度信息。* "scene" 是一个场景或环境,通常包括多个物体或对象。
Single Image Reflection Removal with Reflection Intensity Prior Knowledge
results: 我们在实际世界 benchmark 上进行了实验,结果表明我们的方法可以达到state-of-the-art的准确率在 SIRR 中。Abstract
Single Image Reflection Removal (SIRR) in real-world images is a challenging task due to diverse image degradations occurring on the glass surface during light transmission and reflection. Many existing methods rely on specific prior assumptions to resolve the problem. In this paper, we propose a general reflection intensity prior that captures the intensity of the reflection phenomenon and demonstrate its effectiveness. To learn the reflection intensity prior, we introduce the Reflection Prior Extraction Network (RPEN). By segmenting images into regional patches, RPEN learns non-uniform reflection prior in an image. We propose Prior-based Reflection Removal Network (PRRN) using a simple transformer U-Net architecture that adapts reflection prior fed from RPEN. Experimental results on real-world benchmarks demonstrate the effectiveness of our approach achieving state-of-the-art accuracy in SIRR.
摘要
Single Image Reflection Removal (SIRR) in real-world images is a challenging task due to diverse image degradations occurring on the glass surface during light transmission and reflection. Many existing methods rely on specific prior assumptions to resolve the problem. In this paper, we propose a general reflection intensity prior that captures the intensity of the reflection phenomenon and demonstrate its effectiveness. To learn the reflection intensity prior, we introduce the Reflection Prior Extraction Network (RPEN). By segmenting images into regional patches, RPEN learns non-uniform reflection prior in an image. We propose Prior-based Reflection Removal Network (PRRN) using a simple transformer U-Net architecture that adapts reflection prior fed from RPEN. Experimental results on real-world benchmarks demonstrate the effectiveness of our approach achieving state-of-the-art accuracy in SIRR.Here's the text in Traditional Chinese:Single Image Reflection Removal (SIRR) in real-world images is a challenging task due to diverse image degradations occurring on the glass surface during light transmission and reflection. Many existing methods rely on specific prior assumptions to resolve the problem. In this paper, we propose a general reflection intensity prior that captures the intensity of the reflection phenomenon and demonstrate its effectiveness. To learn the reflection intensity prior, we introduce the Reflection Prior Extraction Network (RPEN). By segmenting images into regional patches, RPEN learns non-uniform reflection prior in an image. We propose Prior-based Reflection Removal Network (PRRN) using a simple transformer U-Net architecture that adapts reflection prior fed from RPEN. Experimental results on real-world benchmarks demonstrate the effectiveness of our approach achieving state-of-the-art accuracy in SIRR.
paper_authors: Maria Priisalu, Ted Kronvall, Cristian Sminchisescu
for: 预测人体动作,即基于过去人体动作的未来动作预测。
methods: 使用个性化时间序列分析模型,将神经网络pose预测个性化。
results: 可以高效地在线进行个性化动作预测,使用低参数时间序列分析模型进行个性化神经网络pose预测。Abstract
Human pose forecasting is the task of predicting articulated human motion given past human motion. There exists a number of popular benchmarks that evaluate an array of different models performing human pose forecasting. These benchmarks do not reflect that a human interacting system, such as a delivery robot, observes and plans for the motion of the same individual over an extended period of time. Every individual has unique and distinct movement patterns. This is however not reflected in existing benchmarks that evaluate a model's ability to predict an average human's motion rather than a particular individual's. We reformulate the human motion forecasting problem and present a model-agnostic personalization method. Motion forecasting personalization can be performed efficiently online by utilizing a low-parametric time-series analysis model that personalizes neural network pose predictions.
摘要
人体动作预测是指根据过去人体动作预测未来人体动作的任务。现有许多popular benchmarks evaluate多种不同的模型在人体动作预测方面。这些benchmarks不反映一个人工智能系统,如elivery robot,观察和规划同一个人的动作在长期内。每个个体都有独特和特殊的运动模式。这并不反映现有的benchmarks,它们评估一个平均的人体动作预测而不是特定个体的动作。我们重新定义人体动作预测问题并提出了模型无关的个性化方法。人体动作预测个性化可以在线进行efficiently,使用低参数时间序列分析模型来个性化神经网络的pose预测。
AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation
results: 实验表明,AnimatableDreamer可以生成高灵活性的文本导向3D模型,并且在非固定重建任务上表现出优于传统非固定重建方法。同时,通过从多视图一致扩散模型中获得了示例知识,CSD可以循环提高生成过程。Abstract
Text-to-3D model adaptations have advanced static 3D model quality, but sequential 3D model generation, particularly for animatable objects with large motions, is still scarce. Our work proposes AnimatableDreamer, a text-to-4D generation framework capable of generating diverse categories of non-rigid objects while adhering to the object motions extracted from a monocular video. At its core, AnimatableDreamer is equipped with our novel optimization design dubbed Canonical Score Distillation (CSD), which simplifies the generation dimension from 4D to 3D by denoising over different frames in the time-varying camera spaces while conducting the distillation process in a unique canonical space shared per video. Concretely, CSD ensures that score gradients back-propagate to the canonical space through differentiable warping, hence guaranteeing the time-consistent generation and maintaining morphological plausibility across different poses. By lifting the 3D generator to 4D with warping functions, AnimatableDreamer offers a novel perspective on non-rigid 3D model generation and reconstruction. Besides, with inductive knowledge from a multi-view consistent diffusion model, CSD regularizes reconstruction from novel views, thus cyclically enhancing the generation process. Extensive experiments demonstrate the capability of our method in generating high-flexibility text-guided 3D models from the monocular video, while also showing improved reconstruction performance over typical non-rigid reconstruction methods. Project page https://AnimatableDreamer.github.io.
摘要
文本到3D模型的转化技术已经提高了静止3D模型的质量,但是顺序的3D模型生成,特别是对于可动的物体来说,仍然很罕见。我们的工作提出了AnimatableDreamer,一个文本到4D生成框架,可以生成多种类型的非固定物体,同时遵循视频中的物体运动。AnimatableDreamer的核心是我们的新的优化设计,即幻数分配(CSD),它将生成维度从4D降低到3D,通过在不同帧中的时间变化空间中的杂谱抑制来简化生成过程,同时通过可导的折叠来保证时间一致的生成和在不同姿势下的物理可能性。通过将生成器提升到4D,AnimatableDreamer提供了一个新的非固定3D模型生成和重建的视角。此外,通过从多视图一致的扩散模型中获得了抽象知识,CSD对新视图的重建进行了规范化,因此在生成过程中进行了循环增强。广泛的实验表明我们的方法可以从单视图的视频中生成高灵活度的文本引导的3D模型,同时也比典型的非固定重建方法提高了重建性能。详细信息请参考https://AnimatableDreamer.github.io。
paper_authors: Vladimir Arkhipkin, Andrei Filatov, Viacheslav Vasilev, Anastasia Maltseva, Said Azizov, Igor Pavlov, Julia Agafonova, Andrey Kuznetsov, Denis Dimitrov
results: 作者通过对大量实验和训练技巧的调整,实现了提高模型质量的突破。特别是在文本理解和特定领域方面,Kandinsky 3.0 表现出了明显的提高。production system of user interaction。Abstract
We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. Compared to previous versions of Kandinsky 2.x, Kandinsky 3.0 leverages a two times larger U-Net backbone, a ten times larger text encoder and removes diffusion mapping. We describe the architecture of the model, the data collection procedure, the training technique, and the production system of user interaction. We focus on the key components that, as we have identified as a result of a large number of experiments, had the most significant impact on improving the quality of our model compared to the others. By our side-by-side comparisons, Kandinsky becomes better in text understanding and works better on specific domains. Project page: https://ai-forever.github.io/Kandinsky-3
摘要
我们现在提供Kandinsky 3.0,一种基于潜在扩散的大规模文本到图像生成模型,继承了文本到图像Kandinsky模型的系列和我们的进步,以实现更高的图像生成质量和真实性。与前一代Kandinsky 2.x相比,Kandinsky 3.0使用了两倍大的 U-Net 背景、十倍大的文本编码器,并移除了扩散映射。我们描述了模型的体系结构、数据收集过程、训练技巧和用户交互生产系统。我们关注了影响模型质量改进的关键组件,并通过大量实验确定了这些组件的影响。在我们的左右比较中,Kandinsky 3.0在文本理解方面变得更好,在特定领域上表现更好。项目页面:https://ai-forever.github.io/Kandinsky-3
Gravitational cell detection and tracking in fluorescence microscopy data
results: 该方法在Cell Tracking Challenge dataset上得到了良好的结果,可能比现代机器学习模型更好。Abstract
Automatic detection and tracking of cells in microscopy images are major applications of computer vision technologies in both biomedical research and clinical practice. Though machine learning methods are increasingly common in these fields, classical algorithms still offer significant advantages for both tasks, including better explainability, faster computation, lower hardware requirements and more consistent performance. In this paper, we present a novel approach based on gravitational force fields that can compete with, and potentially outperform modern machine learning models when applied to fluorescence microscopy images. This method includes detection, segmentation, and tracking elements, with the results demonstrated on a Cell Tracking Challenge dataset.
摘要
自动探测和跟踪细胞在微scopic影像中是计算机视觉技术的主要应用领域,在生物医学研究和临床实践中都具有重要的应用价值。虽然机器学习方法在这些领域日益普及,但古典算法仍然在这两个任务中具有显著优势,包括更好的解释性、更快的计算速度、更低的硬件需求和更一致的性能。在这篇论文中,我们提出了一种基于重力力场的新方法,可以与现代机器学习模型相比,并可能在fluorescence microscopy影像中表现出 competed performance。这种方法包括探测、分割和跟踪的元素,其结果在Cell Tracking Challenge数据集上进行了示例。
Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation
results: 我们在5种下游分割任务上 validate了我们的方法的有效性,并且在大多数任务上超越了预训练SAM和领域适应方法。Abstract
The success of large language models has inspired the computer vision community to explore image segmentation foundation model that is able to zero/few-shot generalize through prompt engineering. Segment-Anything(SAM), among others, is the state-of-the-art image segmentation foundation model demonstrating strong zero/few-shot generalization. Despite the success, recent studies reveal the weakness of SAM under strong distribution shift. In particular, SAM performs awkwardly on corrupted natural images, camouflaged images, medical images, etc. Motivated by the observations, we aim to develop a self-training based strategy to adapt SAM to target distribution. Given the unique challenges of large source dataset, high computation cost and incorrect pseudo label, we propose a weakly supervised self-training architecture with anchor regularization and low-rank finetuning to improve the robustness and computation efficiency of adaptation. We validate the effectiveness on 5 types of downstream segmentation tasks including natural clean/corrupted images, medical images, camouflaged images and robotic images. Our proposed method is task-agnostic in nature and outperforms pre-trained SAM and state-of-the-art domain adaptation methods on almost all downstream tasks with the same testing prompt inputs.
摘要
大型语言模型的成功激发了计算机视觉社区开展图像分割基础模型,能够零/几shot通用化through prompt工程。Segment-Anything(SAM)等是目前最佳图像分割基础模型,表现出强大的零/几shot通用化能力。然而,最新的研究表明SAM在强大的分布转移下表现不佳。特别是SAM在天然雏形图像、医疗图像、潜补图像等领域表现awkward。驱动于这些观察,我们目标是开发一种基于自我训练的策略,以适应目标分布。由于大源数据集的特殊挑战(高计算成本和错误 pseudo label),我们提议一种弱级指导自学习架构,并在 anchor regularization 和低级训练下进行改进。我们验证了方法的有效性,并在5种下游分割任务中达到了最佳效果,包括天然清洁/损坏图像、医疗图像、潜补图像和机器人图像。我们的提议方法是任务无关的性质,并在大多数下游任务上超越了预训练SAM和当前领域适应方法。
AnimateZero: Video Diffusion Models are Zero-Shot Image Animators
for: The paper is written for generating high-quality video from text descriptions, and providing precise control over the appearance and motion of the generated video.
methods: The paper proposes two methods to improve the text-to-video diffusion model, AnimateDiff, by decoupling the video into appearance and motion and using positional-corrected window attention for temporal control.
results: The proposed method, AnimateZero, can successfully control the generating progress without further training, and enables multiple new applications such as interactive video generation and real image animation, as demonstrated by the detailed experiments.Here’s the same information in Simplified Chinese:
results: 提posed方法animateZero可以成功控制生成过程,并启用多种新应用,如交互式视频生成和真实图像动画,如实验所示。Abstract
Large-scale text-to-video (T2V) diffusion models have great progress in recent years in terms of visual quality, motion and temporal consistency. However, the generation process is still a black box, where all attributes (e.g., appearance, motion) are learned and generated jointly without precise control ability other than rough text descriptions. Inspired by image animation which decouples the video as one specific appearance with the corresponding motion, we propose AnimateZero to unveil the pre-trained text-to-video diffusion model, i.e., AnimateDiff, and provide more precise appearance and motion control abilities for it. For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation for ensuring the generated first frame is equal to the given generated image. For temporal control, we replace the global temporal attention of the original T2V model with our proposed positional-corrected window attention to ensure other frames align with the first frame well. Empowered by the proposed methods, AnimateZero can successfully control the generating progress without further training. As a zero-shot image animator for given images, AnimateZero also enables multiple new applications, including interactive video generation and real image animation. The detailed experiments demonstrate the effectiveness of the proposed method in both T2V and related applications.
摘要
大规模文本到视频(T2V)扩散模型在最近几年内取得了很大进步,具体来说是视觉质量、运动和时间一致性。然而,生成过程仍然是一个黑盒子,其中所有属性(如外观、运动)都是同时学习和生成的,只能通过粗略的文本描述进行控制。 inspirited by图像动画,我们提出了AnimateZero,它可以解除预训练的文本到视频扩散模型(AnimateDiff)的封闭性,并提供更精确的外观和运动控制能力。For appearance control, we borrow intermediate latents and their features from the text-to-image(T2I)生成, ensuring the generated first frame is equal to the given generated image. For temporal control, we replace the global temporal attention of the original T2V model with our proposed positional-corrected window attention, ensuring other frames align with the first frame well. Empowered by the proposed methods, AnimateZero can successfully control the generating progress without further training. As a zero-shot image animator for given images, AnimateZero also enables multiple new applications, including interactive video generation and real image animation. The detailed experiments demonstrate the effectiveness of the proposed method in both T2V and related applications.
PneumoLLM: Harnessing the Power of Large Language Model for Pneumoconiosis Diagnosis
results: 实验结果显示,本研究的方法比传统的方法更有效,并且可以更好地利用LLMs。 codes可以在https://github.com/CodeMonsterPHD/PneumoLLM/tree/main 找到。Abstract
The conventional pretraining-and-finetuning paradigm, while effective for common diseases with ample data, faces challenges in diagnosing data-scarce occupational diseases like pneumoconiosis. Recently, large language models (LLMs) have exhibits unprecedented ability when conducting multiple tasks in dialogue, bringing opportunities to diagnosis. A common strategy might involve using adapter layers for vision-language alignment and diagnosis in a dialogic manner. Yet, this approach often requires optimization of extensive learnable parameters in the text branch and the dialogue head, potentially diminishing the LLMs' efficacy, especially with limited training data. In our work, we innovate by eliminating the text branch and substituting the dialogue head with a classification head. This approach presents a more effective method for harnessing LLMs in diagnosis with fewer learnable parameters. Furthermore, to balance the retention of detailed image information with progression towards accurate diagnosis, we introduce the contextual multi-token engine. This engine is specialized in adaptively generating diagnostic tokens. Additionally, we propose the information emitter module, which unidirectionally emits information from image tokens to diagnosis tokens. Comprehensive experiments validate the superiority of our methods and the effectiveness of proposed modules. Our codes can be found at https://github.com/CodeMonsterPHD/PneumoLLM/tree/main.
摘要
传统的预训练和finetuning方法,虽然效果良好于常见的疾病,但在诊断数据稀缺的职业疾病如粉渣病中遇到了挑战。最近,大型自然语言模型(LLMs)在对多个任务进行对话中表现出了前所未有的能力,这带来了诊断的机会。一种常见的策略可能是使用适配层进行视觉语言对接和诊断,但这种方法经常需要在文本分支和对话头中优化大量可学习参数,可能导致LLMs的效果受到限制,尤其是在受限的训练数据量。在我们的工作中,我们创新性地将文本分支 eliminate 并将对话头替换为分类头。这种方法提供了更有效的方法来利用LLMs进行诊断,并且具有更少的可学习参数。此外,为了保持图像信息的细节和准确诊断的进程,我们引入了Contextual Multi-token Engine。这个引擎专门用于适应生成诊断的特征字符。此外,我们还提出了信息发送模块,它将图像字符中的信息单向发送到诊断字符中。我们的实验结果证明了我们的方法的优越性和提出的模块的效果。我们的代码可以在https://github.com/CodeMonsterPHD/PneumoLLM/tree/main中找到。
From Detection to Action Recognition: An Edge-Based Pipeline for Robot Human Perception
results: 通过对state-of-the-art解决方案进行比较,以及使用自己的数据集进行测试,研究人员发现了他们的方法在实际应用中表现出色,能够准确地识别人类动作并响应相应的行为。Abstract
Mobile service robots are proving to be increasingly effective in a range of applications, such as healthcare, monitoring Activities of Daily Living (ADL), and facilitating Ambient Assisted Living (AAL). These robots heavily rely on Human Action Recognition (HAR) to interpret human actions and intentions. However, for HAR to function effectively on service robots, it requires prior knowledge of human presence (human detection) and identification of individuals to monitor (human tracking). In this work, we propose an end-to-end pipeline that encompasses the entire process, starting from human detection and tracking, leading to action recognition. The pipeline is designed to operate in near real-time while ensuring all stages of processing are performed on the edge, reducing the need for centralised computation. To identify the most suitable models for our mobile robot, we conducted a series of experiments comparing state-of-the-art solutions based on both their detection performance and efficiency. To evaluate the effectiveness of our proposed pipeline, we proposed a dataset comprising daily household activities. By presenting our findings and analysing the results, we demonstrate the efficacy of our approach in enabling mobile robots to understand and respond to human behaviour in real-world scenarios relying mainly on the data from their RGB cameras.
摘要
<>mobile服务机器人在各种应用中表现越来越有效,如医疗、日常生活活动识别(ADL)和 ambient assisted living(AAL)。这些机器人依赖人体动作识别(HAR)来理解人类行为和意图。然而,为了使 HAR 在服务机器人上工作有效,需要先知道人类存在(人体探测),并识别要监测的人员(人体跟踪)。在这种情况下,我们提出了一个整体管道,从人体探测和跟踪开始,然后进行动作识别。管道采用边缘计算,以快速响应的方式进行处理,以避免中央计算。为了选择最适合我们的移动机器人的模型,我们进行了一系列实验,比较了当前状态的解决方案,以及它们的检测性和效率。为了评估我们的提议的有效性,我们提出了一个日常家庭活动数据集。通过对我们的发现和分析结果,我们证明了我们的方法在实际 scenarios 中使用主要基于 RGB 摄像头的数据来理解和回应人类行为的有效性。
Memory-Efficient Optical Flow via Radius-Distribution Orthogonal Cost Volume
results: 在 Sintel 和 KITTI 测试 benchmark 上具有竞争性的表现,并在高分辨率输入上保持最高的缓存有效性。Abstract
The full 4D cost volume in Recurrent All-Pairs Field Transforms (RAFT) or global matching by Transformer achieves impressive performance for optical flow estimation. However, their memory consumption increases quadratically with input resolution, rendering them impractical for high-resolution images. In this paper, we present MeFlow, a novel memory-efficient method for high-resolution optical flow estimation. The key of MeFlow is a recurrent local orthogonal cost volume representation, which decomposes the 2D search space dynamically into two 1D orthogonal spaces, enabling our method to scale effectively to very high-resolution inputs. To preserve essential information in the orthogonal space, we utilize self attention to propagate feature information from the 2D space to the orthogonal space. We further propose a radius-distribution multi-scale lookup strategy to model the correspondences of large displacements at a negligible cost. We verify the efficiency and effectiveness of our method on the challenging Sintel and KITTI benchmarks, and real-world 4K ($2160\!\times\!3840$) images. Our method achieves competitive performance on both Sintel and KITTI benchmarks, while maintaining the highest memory efficiency on high-resolution inputs.
摘要
全4D成本体积在回忆所有场合变换(RAFT)或全球匹配使用 transformer 实现了吸引人的表现 для оптиче流估计。然而,它们的内存消耗随输入分辨率平方增长,使其对高分辨率图像不实用。在这篇论文中,我们介绍了 MeFlow,一种新的内存高效的方法 для高分辨率 оптиче流估计。MeFlow 的关键是一种 recursively 的本地正交成本体积表示,它在动态地将2D搜索空间分解成两个1D正交空间,使我们的方法可以有效地扩展到非常高分辨率输入。为了保留2D空间中的重要信息,我们使用自注意力来传递特征信息到正交空间。此外,我们还提出了一种半径分布多Scale Lookup策略,以模拟大差距的匹配。我们在 Sintel 和 KITTI 标准benchmark 上验证了我们的方法的效率和效果,并在4K(2160×3840)分辨率的真实世界图像上实现了高效的内存使用。我们的方法在高分辨率输入上实现了与 Sintel 和 KITTI 标准benchmark 的竞争性表现,同时保持最高的内存高效性。
HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting
results: 我们的方法可以具有较高的优化速度、渲染质量和存储开销。对比 existed方法,我们的方法在优化速度、渲染质量和存储开销上具有显著的优势。Abstract
We have recently seen tremendous progress in photo-real human modeling and rendering. Yet, efficiently rendering realistic human performance and integrating it into the rasterization pipeline remains challenging. In this paper, we present HiFi4G, an explicit and compact Gaussian-based approach for high-fidelity human performance rendering from dense footage. Our core intuition is to marry the 3D Gaussian representation with non-rigid tracking, achieving a compact and compression-friendly representation. We first propose a dual-graph mechanism to obtain motion priors, with a coarse deformation graph for effective initialization and a fine-grained Gaussian graph to enforce subsequent constraints. Then, we utilize a 4D Gaussian optimization scheme with adaptive spatial-temporal regularizers to effectively balance the non-rigid prior and Gaussian updating. We also present a companion compression scheme with residual compensation for immersive experiences on various platforms. It achieves a substantial compression rate of approximately 25 times, with less than 2MB of storage per frame. Extensive experiments demonstrate the effectiveness of our approach, which significantly outperforms existing approaches in terms of optimization speed, rendering quality, and storage overhead.
摘要
很近期,我们在人模型和渲染方面就见到了很大的进步。然而,将真实的人体表现 efficiently 渲染到精炼图像中仍然是一个挑战。在这篇论文中,我们提出了 HiFi4G,一种明确和紧凑的 Gaussian 基于方法,用于高效的人体表现渲染。我们的核心想法是将 3D Gaussian 表示与非RIGID 跟踪结合起来,实现一种紧凑和压缩友好的表示。我们首先提出了 dual-graph 机制,用于获取动作先验,其中一个粗略的变形图 для initialize 和一个细化的 Gaussian 图来施加后续约束。然后,我们采用了一种4D Gaussian 优化方案,通过适应的空间-时间 regularizers 来均衡非RIGID 先验和 Gaussian 更新。我们还提出了一种 companion 压缩方案,使用差异做为补偿,以实现在不同平台上的 immerse 体验。它可以达到约 25 倍的压缩率,仅占每帧 storage 约 2MB。我们的方法在优化速度、渲染质量和存储开销等方面都表现出了明显的优势。
F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis
paper_authors: Sitong Su, Jianzhi Liu, Lianli Gao, Jingkuan Song
for: 提高 Text-to-Video Synthesis 的推理速度,不需要重新训练模型。
methods: 通过对两种主流 Text-to-Video 模型(基于 transformer 和 diffusion 模型)的推理过程进行探索,发现两者具有共同的循环关系建立模块的重复性。根据这个发现,我们提出了一种无需重新训练的、通用的剪辑策略 called F3-Pruning。
results: 在三个 dataset 上使用 классические transformer-based 模型 CogVideo 和典型 diffusion-based 模型 Tune-A-Video,进行了广泛的实验,证明 F3-Pruning 可以快速加速推理,保持质量,并具有广泛的应用性。Abstract
Recently Text-to-Video (T2V) synthesis has undergone a breakthrough by training transformers or diffusion models on large-scale datasets. Nevertheless, inferring such large models incurs huge costs.Previous inference acceleration works either require costly retraining or are model-specific.To address this issue, instead of retraining we explore the inference process of two mainstream T2V models using transformers and diffusion models.The exploration reveals the redundancy in temporal attention modules of both models, which are commonly utilized to establish temporal relations among frames.Consequently, we propose a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights.Specifically, when aggregate temporal attention values are ranked below a certain ratio, corresponding weights will be pruned.Extensive experiments on three datasets using a classic transformer-based model CogVideo and a typical diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning in inference acceleration, quality assurance and broad applicability.
摘要
最近,文本到视频(T2V)合成技术已经经历了重大突破,通过训练变换器或扩散模型在大规模数据集上进行训练。然而,在这些大型模型的推理过程中仍存在巨大的成本问题。过去的推理加速工作都需要重新训练模型,或者是模型特定的。为了解决这个问题,我们不再重新训练,而是 explore T2V模型中的推理过程。我们发现了这两种主流T2V模型中的时间关注模块具有重复性,这些模块通常用于建立帧之间的时间关系。因此,我们提出了一种无需训练的、通用剪裁策略called F3-Pruning。具体来说,当汇合时间关注值 rank下一定比率时,相应的权重将被剪裁。我们在三个数据集上使用了一种经典的变换器基于模型CogVideo和一种典型的扩散基于模型Tune-A-Video进行了广泛的实验,并证明了F3-Pruning在推理加速、质量保证和广泛应用上的效果。
Data-driven Crop Growth Simulation on Time-varying Generated Images using Multi-conditional Generative Adversarial Networks
results: 论文的结果表明,使用这种两个阶段框架可以提供高品质的、适度损失的时间变化图像预测,并且可以提供有用的情况下的蔬菜 trait 预测。此外,论文还发现,通过添加过程基于的生物物理模拟biomass作为条件,可以提高预测图像中的蔬菜 trait 的准确性。这表明,这种框架有 potential to serve as an interface between image-based and process-based crop growth models。Abstract
Image-based crop growth modeling can substantially contribute to precision agriculture by revealing spatial crop development over time, which allows an early and location-specific estimation of relevant future plant traits, such as leaf area or biomass. A prerequisite for realistic and sharp crop image generation is the integration of multiple growth-influencing conditions in a model, such as an image of an initial growth stage, the associated growth time, and further information about the field treatment. We present a two-stage framework consisting first of an image prediction model and second of a growth estimation model, which both are independently trained. The image prediction model is a conditional Wasserstein generative adversarial network (CWGAN). In the generator of this model, conditional batch normalization (CBN) is used to integrate different conditions along with the input image. This allows the model to generate time-varying artificial images dependent on multiple influencing factors of different kinds. These images are used by the second part of the framework for plant phenotyping by deriving plant-specific traits and comparing them with those of non-artificial (real) reference images. For various crop datasets, the framework allows realistic, sharp image predictions with a slight loss of quality from short-term to long-term predictions. Simulations of varying growth-influencing conditions performed with the trained framework provide valuable insights into how such factors relate to crop appearances, which is particularly useful in complex, less explored crop mixture systems. Further results show that adding process-based simulated biomass as a condition increases the accuracy of the derived phenotypic traits from the predicted images. This demonstrates the potential of our framework to serve as an interface between an image- and process-based crop growth model.
摘要
Image-based 农业精细化模型可以很大程度地提高精细农业的效率,因为它可以在时间和空间上显示植物的发展,从而提供早期和地点特定的植物特征预测,如叶面积或生物质。为了实现真实和明确的植物图像生成,我们提出了一个两阶段框架,包括图像预测模型和生长估计模型。这两个模型都是独立地训练的。图像预测模型使用conditioned Wasserstein generative adversarial network (CWGAN),在生成器中使用conditional batch normalization (CBN)来integrate不同的条件和输入图像。这allowsthe模型可以生成时间变化的人工图像,具体取决于不同类型的多种影响因素。这些图像被用于实际图像估计中,通过 derive植物特征并与实际参照图像进行比较。对于多种作物数据集,我们的框架可以生成真实、锐化的图像预测,尽管短期预测的质量略有下降。我们的模型可以在不同生长影响因素的 simulations中提供有价值的植物表型信息,特别是在复杂的作物混合系统中。此外,我们发现在添加过程基于的生物质预测中,可以提高 derivated的植物特征 trait的准确性。这表明我们的框架可以作为图像-基于的农业生长模型和过程-基于的农业生长模型之间的桥接。
High-Quality Facial Geometry and Appearance Capture at Home
results: 实验结果表明,该方法可以获得高品质的3D捕捉结果。Abstract
Facial geometry and appearance capture have demonstrated tremendous success in 3D scanning real humans in studios. Recent works propose to democratize this technique while keeping the results high quality. However, they are still inconvenient for daily usage. In addition, they focus on an easier problem of only capturing facial skin. This paper proposes a novel method for high-quality face capture, featuring an easy-to-use system and the capability to model the complete face with skin, mouth interior, hair, and eyes. We reconstruct facial geometry and appearance from a single co-located smartphone flashlight sequence captured in a dim room where the flashlight is the dominant light source (e.g. rooms with curtains or at night). To model the complete face, we propose a novel hybrid representation to effectively model both eyes and other facial regions, along with novel techniques to learn it from images. We apply a combined lighting model to compactly represent real illuminations and exploit a morphable face albedo model as a reflectance prior to disentangle diffuse and specular. Experiments show that our method can capture high-quality 3D relightable scans.
摘要
面部几何和外观捕捉在工作室中已经取得了很大的成功,最近的工作尝试将这种技术民主化,保持结果高质量。然而,它们仍然不方便每天使用。此外,它们只关注了面部皮肤的捕捉,忽略了其他面部部分。这篇论文提议一种新的高质量面部捕捉方法,其特点是易于使用的系统和能够模型完整的面部,包括皮肤、嘴部内部、头发和眼睛。我们从具有强烈照明的单个手机闪光灯序列中重建面部几何和外观,并提出了一种新的混合表示方法,能够有效地模型面部的其他部分,以及一些新的技术来学习它们。我们采用了一种组合照明模型,以Compactly represent实际照明,并利用一种可变面睛模型作为反射特征来分离投射和镜面。实验显示,我们的方法可以捕捉高质量的3D渐光扫描。
UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity
results: 在标准域内评估中,CFAM 在不同的 dataset 上达到了竞争性的性能,特别是在 ultra fine-grained 的 UFine6926 上。此外,通过评估在 UFine3C 上,我们示出了训练在 UFine6926 上的模型在实际场景中的普遍性比较高。Abstract
Existing text-based person retrieval datasets often have relatively coarse-grained text annotations. This hinders the model to comprehend the fine-grained semantics of query texts in real scenarios. To address this problem, we contribute a new benchmark named \textbf{UFineBench} for text-based person retrieval with ultra-fine granularity. Firstly, we construct a new \textbf{dataset} named UFine6926. We collect a large number of person images and manually annotate each image with two detailed textual descriptions, averaging 80.8 words each. The average word count is three to four times that of the previous datasets. In addition of standard in-domain evaluation, we also propose a special \textbf{evaluation paradigm} more representative of real scenarios. It contains a new evaluation set with cross domains, cross textual granularity and cross textual styles, named UFine3C, and a new evaluation metric for accurately measuring retrieval ability, named mean Similarity Distribution (mSD). Moreover, we propose CFAM, a more efficient \textbf{algorithm} especially designed for text-based person retrieval with ultra fine-grained texts. It achieves fine granularity mining by adopting a shared cross-modal granularity decoder and hard negative match mechanism. With standard in-domain evaluation, CFAM establishes competitive performance across various datasets, especially on our ultra fine-grained UFine6926. Furthermore, by evaluating on UFine3C, we demonstrate that training on our UFine6926 significantly improves generalization to real scenarios compared with other coarse-grained datasets. The dataset and code will be made publicly available at \url{https://github.com/Zplusdragon/UFineBench}.
摘要
Traditional text-based person retrieval datasets often have relatively coarse-grained text annotations, which hinders the model's ability to comprehend the fine-grained semantics of query texts in real-world scenarios. To address this problem, we contribute a new benchmark named UFineBench for text-based person retrieval with ultra-fine granularity.First, we construct a new dataset named UFine6926, which contains a large number of person images and manual textual descriptions with an average of 80.8 words each. The average word count is three to four times that of previous datasets. In addition to standard in-domain evaluation, we propose a special evaluation paradigm that is more representative of real-world scenarios, including a new evaluation set with cross-domain, cross-textual granularity, and cross-textual styles, named UFine3C, and a new evaluation metric named mean Similarity Distribution (mSD) to accurately measure retrieval ability.Moreover, we propose CFAM, a more efficient algorithm designed for text-based person retrieval with ultra-fine-grained texts. It achieves fine granularity mining by adopting a shared cross-modal granularity decoder and hard negative match mechanism. With standard in-domain evaluation, CFAM achieves competitive performance across various datasets, especially on our ultra-fine-grained UFine6926. Furthermore, by evaluating on UFine3C, we demonstrate that training on our UFine6926 significantly improves generalization to real-world scenarios compared with other coarse-grained datasets. The dataset and code will be publicly available at \url{https://github.com/Zplusdragon/UFineBench}.
paper_authors: Ribana Roscher, Lukas Roth, Cyrill Stachniss, Achim Walter
for: This paper aims to address the challenges of digital agriculture by adopting a data-centric approach to machine learning, with the goal of improving the accuracy and sustainability of agricultural tasks such as yield prediction, weed detection, and early disease identification.
methods: The paper proposes the use of data-centric machine learning strategies that utilize the intrinsic value of data to develop accurate, generalizable, and adaptable methods for digital agriculture. These strategies include acquiring and curating valuable data, as well as implementing effective learning and evaluation methods.
results: The paper has the potential to create accurate, generalizable, and adaptable machine learning methods that effectively and sustainably address agricultural tasks, leading to improved yields, reduced waste, and more efficient use of resources. By adopting a data-centric approach, the paper aims to overcome the limitations of traditional model-centric methods and fully realize the potential of digital agriculture.Abstract
In response to the increasing global demand for food, feed, fiber, and fuel, digital agriculture is rapidly evolving to meet these demands while reducing environmental impact. This evolution involves incorporating data science, machine learning, sensor technologies, robotics, and new management strategies to establish a more sustainable agricultural framework. So far, machine learning research in digital agriculture has predominantly focused on model-centric approaches, focusing on model design and evaluation. These efforts aim to optimize model accuracy and efficiency, often treating data as a static benchmark. Despite the availability of agricultural data and methodological advancements, a saturation point has been reached, with many established machine learning methods achieving comparable levels of accuracy and facing similar limitations. To fully realize the potential of digital agriculture, it is crucial to have a comprehensive understanding of the role of data in the field and to adopt data-centric machine learning. This involves developing strategies to acquire and curate valuable data and implementing effective learning and evaluation strategies that utilize the intrinsic value of data. This approach has the potential to create accurate, generalizable, and adaptable machine learning methods that effectively and sustainably address agricultural tasks such as yield prediction, weed detection, and early disease identification
摘要
随着全球食品、饲料、纤维和能源的需求增长,数字农业在实现更加可持续的农业框架方面逐渐演化。这种演化包括在数据科学、机器学习、传感器技术、机器人和新管理策略等方面进行整合,以提高农业的可持续性。迄今为止,数字农业中的机器学习研究主要集中在模型中心的方法上,强调模型设计和评估。这些努力的目标是优化模型的准确率和效率,通常将数据视为静态标准。然而,农业数据的可用性和方法的进步,已经达到了满载点,许多已经确立的机器学习方法都达到了相似的准确率和面临相似的限制。为了实现数字农业的潜在价值,必须有一个全面的理解数据在农业领域的角色,并采用数据驱动的机器学习策略。这种方法可以创造高准确率、普适和适应的机器学习方法,以有效和可持续地解决农业任务,如收益预测、苔萝检测和早期病诊断。
Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle
paper_authors: Youtian Lin, Zuozhuo Dai, Siyu Zhu, Yao Yao
for: fast dynamic scene reconstruction and real-time rendering from multi-view and monocular videos
methods: 使用Point-based 3D Gaussian Splatting (3DGS) 和 Dual-Domain Deformation Model (DDDM) 模型 attribute 变形
results: 比前者更高的新视图渲染质量和5倍快的训练速度Abstract
We introduce Gaussian-Flow, a novel point-based approach for fast dynamic scene reconstruction and real-time rendering from both multi-view and monocular videos. In contrast to the prevalent NeRF-based approaches hampered by slow training and rendering speeds, our approach harnesses recent advancements in point-based 3D Gaussian Splatting (3DGS). Specifically, a novel Dual-Domain Deformation Model (DDDM) is proposed to explicitly model attribute deformations of each Gaussian point, where the time-dependent residual of each attribute is captured by a polynomial fitting in the time domain, and a Fourier series fitting in the frequency domain. The proposed DDDM is capable of modeling complex scene deformations across long video footage, eliminating the need for training separate 3DGS for each frame or introducing an additional implicit neural field to model 3D dynamics. Moreover, the explicit deformation modeling for discretized Gaussian points ensures ultra-fast training and rendering of a 4D scene, which is comparable to the original 3DGS designed for static 3D reconstruction. Our proposed approach showcases a substantial efficiency improvement, achieving a $5\times$ faster training speed compared to the per-frame 3DGS modeling. In addition, quantitative results demonstrate that the proposed Gaussian-Flow significantly outperforms previous leading methods in novel view rendering quality. Project page: https://nju-3dv.github.io/projects/Gaussian-Flow
摘要
我们介绍 Gaussian-Flow,一种新的点 clouds 方法,用于快速动态场景重建和实时渲染,从多条看法和单条看法影片中。相比于通行的 NeRF-based 方法,我们的方法具有更快的训练和渲染速度。我们提出了一个名为 Dual-Domain Deformation Model (DDDM) 的新模型,用于处理每个点 clouds 的特征变形。这个模型可以实时追踪场景中的变形,不需要针对每帧训练分别的 3DGS,也不需要添加额外的隐藏神经场来模型3D动力学。此外,我们的方法可以实现瞬间训练和渲染4D场景,与原始的3DGS设计 для静止3D重建相比,具有更高的效率。我们的方法在训练速度方面比较 NeRF-based 方法快 $5\times$,并且在新观点显示质量方面对比前进方法有所改善。更多详细信息可以参考我们的项目页面:https://nju-3dv.github.io/projects/Gaussian-Flow。
results: 作者在三个RGB-斜波benchmark上进行了广泛的实验,其中包括UPLight、ZJU和MCubeS datasets。results表明,ShareCMP在mIoU性能上达到了当前最佳性能,并且在参数量方面减少了约26-33% compared to previous dual-branch models。Abstract
Multimodal semantic segmentation is developing rapidly, but the modality of RGB-Polarization remains underexplored. To delve into this problem, we construct a UPLight RGB-P segmentation benchmark with 12 typical underwater semantic classes which provides data support for Autonomous Underwater Vehicles (AUVs) to perform special perception tasks. In this work, we design the ShareCMP, an RGB-P semantic segmentation framework with a shared dual-branch architecture, which reduces the number of parameters by about 26-33% compared to previous dual-branch models. It encompasses a Polarization Generate Attention (PGA) module designed to generate polarization modal images with richer polarization properties for the encoder. In addition, we introduce the Class Polarization-Aware Loss (CPALoss) to improve the learning and understanding of the encoder for polarization modal information and to optimize the PGA module. With extensive experiments on a total of three RGB-P benchmarks, our ShareCMP achieves state-of-the-art performance in mIoU with fewer parameters on the UPLight (92.45%), ZJU (92.7%), and MCubeS (50.99%) datasets. The code is available at https://github.com/LEFTeyex/ShareCMP.
摘要
多Modal semantic segmentation 正在快速发展,但RGB-Polarization Modality仍未得到充分研究。为了探讨这个问题,我们构建了一个UPLight RGB-P semantic segmentation benchmark,提供了12种常见的海洋 semantic classes数据支持Autonomous Underwater Vehicles(AUVs)执行特殊见解任务。在这项工作中,我们设计了 ShareCMP,一个RGB-P semantic segmentation框架,它减少了前 dual-branch 模型参数数量约26-33%。它包括一个Polarization Generate Attention(PGA)模块,用于生成具有更丰富Polarization Properties的折射模式图像,以便编码器更好地理解。此外,我们引入了Class Polarization-Aware Loss(CPALoss),以提高编码器对折射模式信息的学习和理解,并且优化PGA模块。经过对三个RGB-P benchmark进行了广泛的实验,我们的 ShareCMP在mIoU总平均值上达到了当前最佳性能(92.45%),在UPLight(92.7%)、ZJU(92.7%)和MCubeS(50.99%)数据集上。代码可以在https://github.com/LEFTeyex/ShareCMP中获取。
Artist-Friendly Relightable and Animatable Neural Heads
results: 该方法可以在任何环境下,包括近场灯光和视点,实现动态 neural avatars 的适应和表情变化。Abstract
An increasingly common approach for creating photo-realistic digital avatars is through the use of volumetric neural fields. The original neural radiance field (NeRF) allowed for impressive novel view synthesis of static heads when trained on a set of multi-view images, and follow up methods showed that these neural representations can be extended to dynamic avatars. Recently, new variants also surpassed the usual drawback of baked-in illumination in neural representations, showing that static neural avatars can be relit in any environment. In this work we simultaneously tackle both the motion and illumination problem, proposing a new method for relightable and animatable neural heads. Our method builds on a proven dynamic avatar approach based on a mixture of volumetric primitives, combined with a recently-proposed lightweight hardware setup for relightable neural fields, and includes a novel architecture that allows relighting dynamic neural avatars performing unseen expressions in any environment, even with nearfield illumination and viewpoints.
摘要
“一种日益普及的方法是通过使用Volume Neural Fields(VNF)来创建真实的数字人物。原始神经辐射场(NeRF)可以实现了基于多视图图像的新观察 Synthesis,而后续方法表明这些神经表示可以扩展到动态人物。最近,新变体也超越了传统的热成因在神经表示中的缺陷,示示静态神经人物可以在任何环境中重新照明。在这个工作中,我们同时解决了运动和照明问题,提出一种新的方法,可以在任何环境中重新照明和动画静态神经人物,包括静态神经人物在靠近照明和视角下表现出来的新表情。”Here's a breakdown of the translation:* "an increasingly common approach" ⇒ "日益普及的方法"* "for creating photo-realistic digital avatars" ⇒ "创建真实的数字人物"* "through the use of volumetric neural fields" ⇒ "通过使用Volume Neural Fields(VNF)"* "The original neural radiance field (NeRF)" ⇒ "原始神经辐射场(NeRF)"* "allowed for impressive novel view synthesis of static heads" ⇒ "可以实现了基于多视图图像的新观察 Synthesis"* "and follow up methods showed that these neural representations can be extended to dynamic avatars" ⇒ "而后续方法表明这些神经表示可以扩展到动态人物"* "Recently, new variants also surpassed the usual drawback of baked-in illumination in neural representations" ⇒ "最近,新变体也超越了传统的热成因在神经表示中的缺陷"* "showing that static neural avatars can be relit in any environment" ⇒ "示示静态神经人物可以在任何环境中重新照明"* "In this work we simultaneously tackle both the motion and illumination problem" ⇒ "在这个工作中,我们同时解决了运动和照明问题"* "proposing a new method for relightable and animatable neural heads" ⇒ "提出一种新的方法,可以在任何环境中重新照明和动画静态神经人物"* "Our method builds on a proven dynamic avatar approach based on a mixture of volumetric primitives" ⇒ "我们的方法基于一种已经证明有效的动态人物方法,基于一种混合的Volume primitives"* "combined with a recently-proposed lightweight hardware setup for relightable neural fields" ⇒ "并与最近提出的轻量级硬件设置 для重新照明神经场合并使用"* "and includes a novel architecture that allows relighting dynamic neural avatars performing unseen expressions in any environment" ⇒ "并包括一种新的架构,允许在任何环境中表现出来的新表情"I hope this helps! Let me know if you have any further questions.
DeepPyramid+: Medical Image Segmentation using Pyramid View Fusion and Deformable Pyramid Reception
results: 对于多种数据集,包括 ENDOMETRIOSIS 视频、MRI 图像、OCT 扫描和 Laparoscopy 视频,DeepPyramid+ 能够处理多种挑战,如形态和比例变化、反射和模糊干扰。 DeepPyramid+ 在分割性能方面显示出了显著提高,对于内域分割而言,可以提高 Dice 乘数达到 3.65%,对于跨域分割而言,可以提高 Dice 乘数达到 17%。 DeepPyramid+ 在不同的模式下,包括不同的后处网络,都能够表现出优于状态艺网络,示出其 universality。Abstract
Semantic Segmentation plays a pivotal role in many applications related to medical image and video analysis. However, designing a neural network architecture for medical image and surgical video segmentation is challenging due to the diverse features of relevant classes, including heterogeneity, deformability, transparency, blunt boundaries, and various distortions. We propose a network architecture, DeepPyramid+, which addresses diverse challenges encountered in medical image and surgical video segmentation. The proposed DeepPyramid+ incorporates two major modules, namely "Pyramid View Fusion" (PVF) and "Deformable Pyramid Reception," (DPR), to address the outlined challenges. PVF replicates a deduction process within the neural network, aligning with the human visual system, thereby enhancing the representation of relative information at each pixel position. Complementarily, DPR introduces shape- and scale-adaptive feature extraction techniques using dilated deformable convolutions, enhancing accuracy and robustness in handling heterogeneous classes and deformable shapes. Extensive experiments conducted on diverse datasets, including endometriosis videos, MRI images, OCT scans, and cataract and laparoscopy videos, demonstrate the effectiveness of DeepPyramid+ in handling various challenges such as shape and scale variation, reflection, and blur degradation. DeepPyramid+ demonstrates significant improvements in segmentation performance, achieving up to a 3.65% increase in Dice coefficient for intra-domain segmentation and up to a 17% increase in Dice coefficient for cross-domain segmentation. DeepPyramid+ consistently outperforms state-of-the-art networks across diverse modalities considering different backbone networks, showcasing its versatility.
摘要
医疗图像和视频分割扮演着重要的角色在医疗领域中,但设计用于医疗图像和手术视频分割的神经网络架构却是一项挑战。这是因为医疗图像和手术视频中的相关类型具有多样化特征,包括不均匀、可变形、透明度和多种扭曲。我们提出了一种网络架构,即深度PYRAMID+,以解决这些挑战。深度PYRAMID+包括两个主要模块:“Pyramid View Fusion”(PVF)和“Deformable Pyramid Reception”(DPR)。PVF通过在神经网络中复制推理过程,与人视觉系统相似,以提高每个像素位置的相关信息表示。此外,DPR引入了灵活扩展的特征提取技术,通过扩展扩展的填充式卷积,提高精度和Robustness,以处理多样化的类型和可变形。我们对多种数据集进行了广泛的实验,包括结直肠炎视频、MRI图像、OCT扫描和胆囊和 Laparoscopy 视频。结果表明,深度PYRAMID+在处理多种挑战时表现出色,包括形态和比例变化、镜像和模糊干扰。深度PYRAMID+在分割性能方面达到了3.65%的提升,在同频段分割中达到了17%的提升。此外,深度PYRAMID+在不同的模式下,包括不同的后处网络,均表现出了其多样性。
Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future
For: This paper aims to provide a comprehensive review of open-source autonomous driving datasets, including their principles, data scales, and future challenges.* Methods: The paper uses a systematic approach to assess over 70 open-source autonomous driving datasets from domestic and international sources, and offers insights into the creation of high-quality datasets, data engine systems, and generative foundation models.* Results: The paper provides an exhaustive analysis and discourse of the characteristics and data scales of future third-generation autonomous driving datasets, and highlights the scientific and technical challenges that need to be addressed to advance autonomous innovation and foster technological enhancement in critical domains.Abstract
With the continuous maturation and application of autonomous driving technology, a systematic examination of open-source autonomous driving datasets becomes instrumental in fostering the robust evolution of the industry ecosystem. Current autonomous driving datasets can broadly be categorized into two generations. The first-generation autonomous driving datasets are characterized by relatively simpler sensor modalities, smaller data scale, and is limited to perception-level tasks. KITTI, introduced in 2012, serves as a prominent representative of this initial wave. In contrast, the second-generation datasets exhibit heightened complexity in sensor modalities, greater data scale and diversity, and an expansion of tasks from perception to encompass prediction and control. Leading examples of the second generation include nuScenes and Waymo, introduced around 2019. This comprehensive review, conducted in collaboration with esteemed colleagues from both academia and industry, systematically assesses over seventy open-source autonomous driving datasets from domestic and international sources. It offers insights into various aspects, such as the principles underlying the creation of high-quality datasets, the pivotal role of data engine systems, and the utilization of generative foundation models to facilitate scalable data generation. Furthermore, this review undertakes an exhaustive analysis and discourse regarding the characteristics and data scales that future third-generation autonomous driving datasets should possess. It also delves into the scientific and technical challenges that warrant resolution. These endeavors are pivotal in advancing autonomous innovation and fostering technological enhancement in critical domains. For further details, please refer to https://github.com/OpenDriveLab/DriveAGI.
摘要
随着自动驾驶技术的不断成熔和应用,对开源自动驾驶数据集进行系统性的评估变得非常重要,以推动行业生态系统的健康发展。目前的自动驾驶数据集可以大致分为两代。第一代自动驾驶数据集具有较简单的感知模式、较小的数据规模和仅仅是感知级别的任务。2012年出现的KITTI是这一代的代表之一。相比之下,第二代数据集具有更高的感知模式复杂度、更大的数据规模和多样性,以及从感知扩展到预测和控制等任务。 nuScenes和Waymo等在2019年出现,是这一代的代表之一。本文是和学术界和业界知名专家合作完成的,对国内外七十余个开源自动驾驶数据集进行了系统性的评估。它提供了高质量数据集的创建原则、数据引擎系统的重要性以及基于生成基础模型的可扩展数据生成等方面的启示。此外,本文还进行了对未来第三代自动驾驶数据集的特点和数据规模的仔细分析和讨论,以及解决科技和科学挑战的需要。这些努力将有助于推动自动驾驶创新和技术进步,激发行业的健康发展。详细信息请参考。
SVQ: Sparse Vector Quantization for Spatiotemporal Forecasting
results: 经过对多个领域的多种数据进行实验, authors 证明了 SVQ 方法可以减少计算量而不失去细节,并且在所有标准准则中表现出色,达到了状态的推理效果。Abstract
Spatiotemporal forecasting tasks, such as weather forecasting and traffic prediction, offer significant societal benefits. These tasks can be effectively approached as image forecasting problems using computer vision models. Vector quantization (VQ) is a well-known method for discrete representation that improves the latent space, leading to enhanced generalization and transfer learning capabilities. One of the main challenges in using VQ for spatiotemporal forecasting is how to balance between keeping enough details and removing noises from the original patterns for better generalization. We address this challenge by developing sparse vector quantization, or {\bf SVQ} for short, that leverages sparse regression to make better trade-off between the two objectives. The main innovation of this work is to approximate sparse regression by a two-layer MLP and a randomly fixed or learnable matrix, dramatically improving its computational efficiency. Through experiments conducted on diverse datasets in multiple fields including weather forecasting, traffic flow prediction, and video forecasting, we unequivocally demonstrate that our proposed method consistently enhances the performance of base models and achieves state-of-the-art results across all benchmarks.
摘要
各种空间时间预测任务,如天气预测和交通预测,具有重要的社会效益。这些任务可以被视为计算机视觉模型的图像预测问题。量化 вектор(VQ)是一种广泛使用的离散表示方法,可以提高秘密空间,从而提高泛化和转移学习能力。然而,使用VQ进行空间时间预测存在一个主要挑战,即如何平衡保留足够的细节和从原始模式中除去噪声,以便更好地泛化。我们解决这个挑战,通过开发一种叫做{\bf SVQ}的稀疏 вектор量化方法,该方法利用稀疏回归来更好地让拟合两个目标。我们的主要创新是,我们使用两层多层感知神经网络(MLP)和一个随机或学习的矩阵来近似稀疏回归,从而很大地提高计算效率。我们在多个领域的多个数据集上进行了实验,包括天气预测、交通流预测和视频预测,并证明了我们的提议可以坚持性地提高基本模型的性能,并在所有标准准则上达到领先的结果。
Predicting Postoperative Intraocular Lens Dislocation in Cataract Surgery via Deep Learning
methods: 该研究使用了三种类型的 convolutional neural networks (CNNs),包括 recurrent CNNs, region-based CNNs, 和 pixel-based CNNs,以计算镜片 unfolding delay, rotation, 和不稳定性 durante la cirugía.
results: 研究结果表明,提posed framework可以准确地评估镜片的统计特征 durante la cirugía,并与专家外科医的假设和观察相符。 results also show significant correlations between lens unfolding delay and lens rotation, and significant differences between the intra-operative rotations stability of four groups of lenses.Abstract
A critical yet unpredictable complication following cataract surgery is intraocular lens dislocation. Postoperative stability is imperative, as even a tiny decentration of multifocal lenses or inadequate alignment of the torus in toric lenses due to postoperative rotation can lead to a significant drop in visual acuity. Investigating possible intraoperative indicators that can predict post-surgical instabilities of intraocular lenses can help prevent this complication. In this paper, we develop and evaluate the first fully-automatic framework for the computation of lens unfolding delay, rotation, and instability during surgery. Adopting a combination of three types of CNNs, namely recurrent, region-based, and pixel-based, the proposed framework is employed to assess the possibility of predicting post-operative lens dislocation during cataract surgery. This is achieved via performing a large-scale study on the statistical differences between the behavior of different brands of intraocular lenses and aligning the results with expert surgeons' hypotheses and observations about the lenses. We exploit a large-scale dataset of cataract surgery videos featuring four intraocular lens brands. Experimental results confirm the reliability of the proposed framework in evaluating the lens' statistics during the surgery. The Pearson correlation and t-test results reveal significant correlations between lens unfolding delay and lens rotation and significant differences between the intra-operative rotations stability of four groups of lenses. These results suggest that the proposed framework can help surgeons select the lenses based on the patient's eye conditions and predict post-surgical lens dislocation.
摘要
医学研究中的一个扰乱 yet 预测不可预知的问题是 Cataract 手术后的内 ocular lens 异位。手术后稳定是非常重要的,因为even a tiny decentration of multifocal lenses or inadequate alignment of the torus in toric lenses due to postoperative rotation can lead to a significant drop in visual acuity。investigating possible intraoperative indicators that can predict post-surgical instabilities of intraocular lenses can help prevent this complication。在这篇论文中,我们开发并评估了第一个完全自动化的框架,用于计算内 ocular lens unfolding delay,旋转和不稳定性 during surgery。采用三种类型的 CNNs, namely recurrent, region-based, and pixel-based,我们的提议的框架可以评估可能预测 Cataract 手术后的内 ocular lens 异位。我们通过对不同品牌的内 ocular lens 的行为进行大规模研究,并与专家预测和观察结果进行对比,以确定可能预测内 ocular lens 异位的指标。我们利用了 Cataract 手术视频数据集,该数据集包含四种内 ocular lens 品牌。实验结果证明了我们的提议的框架可以正确地评估内 ocular lens 的统计特征 during surgery。Pearson correlation 和 t-test 结果表明了内 ocular lens unfolding delay 和旋转的相关性,以及不同品牌的内 ocular lens 在手术过程中的旋转稳定性之间存在显著差异。这些结果表明,我们的提议的框架可以帮助Surgeons 根据病人的眼睛状况选择适合的内 ocular lens,并预测 Cataract 手术后的内 ocular lens 异位。
Action Scene Graphs for Long-Form Understanding of Egocentric Videos
paper_authors: Ivan Rodin, Antonino Furnari, Kyle Min, Subarna Tripathi, Giovanni Maria Farinella for: Egocentric Action Scene Graphs (EASGs) 是一种新的表示方式,用于长形 Egocentric 视频理解。methods: EASGs 使用一种新的注解过程,将 Egocentric 视频中摄像头携员的动作、相互作用对象、时间进行描述,从而提供一个时间演进的图形基本描述。results: 通过两个下游任务( Egocentric 动作预测和 Egocentric 活动概要)的实验,我们证明 EASGs 有效地支持长形 Egocentric 视频理解。Abstract
We present Egocentric Action Scene Graphs (EASGs), a new representation for long-form understanding of egocentric videos. EASGs extend standard manually-annotated representations of egocentric videos, such as verb-noun action labels, by providing a temporally evolving graph-based description of the actions performed by the camera wearer, including interacted objects, their relationships, and how actions unfold in time. Through a novel annotation procedure, we extend the Ego4D dataset by adding manually labeled Egocentric Action Scene Graphs offering a rich set of annotations designed for long-from egocentric video understanding. We hence define the EASG generation task and provide a baseline approach, establishing preliminary benchmarks. Experiments on two downstream tasks, egocentric action anticipation and egocentric activity summarization, highlight the effectiveness of EASGs for long-form egocentric video understanding. We will release the dataset and the code to replicate experiments and annotations.
摘要
我们介绍 Egocentric Action Scene Graphs (EASGs),一种新的长形 Egocentric 视频理解表示方法。EASGs 将标准手动标注的 Egocentric 视频表示方法(如动词-名词动作标签)扩展到提供时间演化的图形基本描述,包括相互作用的对象、关系和时间演化的行为。通过一种新的注解过程,我们延展了 Ego4D 数据集,添加了手动标注的 Egocentric Action Scene Graphs,提供了丰富的注解,旨在支持长形 Egocentric 视频理解。我们定义 EASG 生成任务,并提供了基线方法,设置了初步的参考点。在两个下游任务中, Egocentric 动作预测和 Egocentric 活动概要 SUMMARIZATION 中,实验表明 EASGs 对长形 Egocentric 视频理解具有有效性。我们将发布数据集和实验代码,以便重现实验和注解。
Novel class discovery meets foundation models for 3D semantic segmentation
results: 通过对SemanticKITTI、SemanticPOSS和S3DIS dataset的广泛评估,本paper demonstarted了提档的NCD方法在point cloud semantic segmentation中的显著superiority。Abstract
The task of Novel Class Discovery (NCD) in semantic segmentation entails training a model able to accurately segment unlabelled (novel) classes, relying on the available supervision from annotated (base) classes. Although extensively investigated in 2D image data, the extension of the NCD task to the domain of 3D point clouds represents a pioneering effort, characterized by assumptions and challenges that are not present in the 2D case. This paper represents an advancement in the analysis of point cloud data in four directions. Firstly, it introduces the novel task of NCD for point cloud semantic segmentation. Secondly, it demonstrates that directly transposing the only existing NCD method for 2D image semantic segmentation to 3D data yields suboptimal results. Thirdly, a new NCD approach based on online clustering, uncertainty estimation, and semantic distillation is presented. Lastly, a novel evaluation protocol is proposed to rigorously assess the performance of NCD in point cloud semantic segmentation. Through comprehensive evaluations on the SemanticKITTI, SemanticPOSS, and S3DIS datasets, the paper demonstrates substantial superiority of the proposed method over the considered baselines.
摘要
《点云 semantic segmentation中的新类发现任务(NCD)》的挑战在于训练一个能够准确地分类未标注(新类)的模型,基于可用的基类(标注类)的监督。虽然在2D图像数据中广泛研究了NCD任务,但将NCD任务推广到3D点云领域是一项创新的尝试,具有2D情况中缺失的假设和挑战。本文在以下四个方向下进行了进一步的分析:1. 引入了点云 semantic segmentation中的新类发现任务。2. 示示了将2D图像 semantic segmentation中的唯一一个NCD方法直接应用于3D数据会导致不佳的结果。3. 提出了基于在线聚类、不确定性估计和semantic distillation的新的NCD方法。4. 提出了用于严格评估点云 semantic segmentation中NCD性能的新评价协议。通过对SemanticKITTI、SemanticPOSS和S3DIS datasets进行了广泛的评估,本文示示了提案的方法在考虑的基线上比较出色的性能。
Riemannian Complex Matrix Convolution Network for PolSAR Image Classification
paper_authors: Junfei Shi, Wei Wang, Haiyan Jin, Mengmeng Nie, Shanshan Ji for: 本研究旨在提高PolSAR图像分类的深度学习方法,以便更好地利用PolSAR数据的特点。methods: 该方法使用Riemannian复数矩阵卷积网络,直接使用复数矩阵作为网络输入,并定义Riemannian操作来学习复数矩阵的特征。此外,还开发了一种快速kernel学习方法来学习类pecific特征并降低计算时间。results: 实验结果显示,提出的方法可以在三个不同的PolSAR数据集上获得超过当前最佳方法的性能。Abstract
Recently, deep learning methods have achieved superior performance for Polarimetric Synthetic Aperture Radar(PolSAR) image classification. Existing deep learning methods learn PolSAR data by converting the covariance matrix into a feature vector or complex-valued vector as the input. However, all these methods cannot learn the structure of complex matrix directly and destroy the channel correlation. To learn geometric structure of complex matrix, we propose a Riemannian complex matrix convolution network for PolSAR image classification in Riemannian space for the first time, which directly utilizes the complex matrix as the network input and defines the Riemannian operations to learn complex matrix's features. The proposed Riemannian complex matrix convolution network considers PolSAR complex matrix endowed in Riemannian manifold, and defines a series of new Riemannian convolution, ReLu and LogEig operations in Riemannian space, which breaks through the Euclidean constraint of conventional networks. Then, a CNN module is appended to enhance contextual Riemannian features. Besides, a fast kernel learning method is developed for the proposed method to learn class-specific features and reduce the computation time effectively. Experiments are conducted on three sets of real PolSAR data with different bands and sensors. Experiments results demonstrates the proposed method can obtain superior performance than the state-of-the-art methods.
摘要
最近,深度学习方法在极化 син统�apor(PolSAR)图像分类中达到了突出表现。现有的深度学习方法将PolSAR数据转换为covariance矩阵中的特征向量或复数向量作为网络输入,但这些方法无法直接学习复杂矩阵的结构,同时也会隐藏通道相关性。为了学习复杂矩阵的结构,我们提出了一种在Riemannian空间中使用复数矩阵 convolution网络,这是第一次将PolSAR复数矩阵作为网络输入,并定义了Riemannian操作来学习复数矩阵的特征。我们的Riemannian复数矩阵 convolution网络将PolSAR复数矩阵视为Riemannian拓扑上的一个维度,并定义了一系列新的Riemannian convolution、ReLu和LogEig操作,这些操作绕过了传统网络中的欧几丁度约束。然后,我们附加了一个CNN模块以增强Riemannian特征的上下文特征。此外,我们还开发了一种高效的kernel学习方法,可以快速学习类别特征,并降低计算时间。我们在三个不同激光和探测器的实验数据上进行了实验,实验结果表明,我们的方法可以在PolSAR图像分类中达到更高的表现。
Evaluating the point cloud of individual trees generated from images based on Neural Radiance fields (NeRF) method
results: NeRF方法在个体树三维重建中表现良好,成功重建率高,在树叶区域重建更好,但生成的点云模型具有噪音和低分辨率问题。相比摄grammetric重建方法,NeRF方法在重建效率方面具有显著优势,适用于复杂场景,但生成的点云模型具有较差的精度。Abstract
Three-dimensional (3D) reconstruction of trees has always been a key task in precision forestry management and research. Due to the complex branch morphological structure of trees themselves and the occlusions from tree stems, branches and foliage, it is difficult to recreate a complete three-dimensional tree model from a two-dimensional image by conventional photogrammetric methods. In this study, based on tree images collected by various cameras in different ways, the Neural Radiance Fields (NeRF) method was used for individual tree reconstruction and the exported point cloud models are compared with point cloud derived from photogrammetric reconstruction and laser scanning methods. The results show that the NeRF method performs well in individual tree 3D reconstruction, as it has higher successful reconstruction rate, better reconstruction in the canopy area, it requires less amount of images as input. Compared with photogrammetric reconstruction method, NeRF has significant advantages in reconstruction efficiency and is adaptable to complex scenes, but the generated point cloud tends to be noisy and low resolution. The accuracy of tree structural parameters (tree height and diameter at breast height) extracted from the photogrammetric point cloud is still higher than those of derived from the NeRF point cloud. The results of this study illustrate the great potential of NeRF method for individual tree reconstruction, and it provides new ideas and research directions for 3D reconstruction and visualization of complex forest scenes.
摘要
三维重建(3D)森林植物是精度森林管理和研究中的关键任务。由于树木自身的分支结构复杂和树干、枝头和叶子的遮挡,使得从二维图像中创建完整的三维树模型很难。在这项研究中,通过不同摄像头和方法收集的树木图像,使用Neural Radiance Fields(NeRF)方法进行个体树重建,并与光ogrammetric重建和激光扫描方法 derivated的点云模型进行比较。结果表明,NeRF方法在个体树三维重建方面表现良好,其成功重建率高、在树叶区域重建更好,需要 fewer amount of images as input。相比光ogrammetric重建方法,NeRF方法在重建效率方面具有显著优势,可以适应复杂的场景,但生成的点云具有噪声和低分辨率。抽取从光ogrammetric点云中的树Structural parameter(树高和Breast height diameter)的准确性仍然高于NeRF点云中的。这项研究的结果表明NeRF方法在个体树重建方面具有潜在的优势,并提供了新的想法和研究方向 для三维重建和可见化复杂的森林场景。
Bottom-Up Instance Segmentation of Catheters for Chest X-Rays
results: 提出了一种能够有效地处理中线和管线的深度学习方法,可以减少报告延迟和提高鉴定效率。Abstract
Chest X-ray (CXR) is frequently employed in emergency departments and intensive care units to verify the proper placement of central lines and tubes and to rule out related complications. The automation of the X-ray reading process can be a valuable support tool for non-specialist technicians and minimize reporting delays due to non-availability of experts. While existing solutions for automated catheter segmentation and malposition detection show promising results, the disentanglement of individual catheters remains an open challenge, especially in complex cases where multiple devices appear superimposed in the X-ray projection. Moreover, conventional top-down instance segmentation methods are ineffective on such thin and long devices, that often extend through the entire image. In this paper, we propose a deep learning approach based on associative embeddings for catheter instance segmentation, able to overcome those limitations and effectively handle device intersections.
摘要
��X���镜(CXR)�� frequently employed in emergency departments and intensive care units to verify the proper placement of central lines and tubes and to rule out related complications. The automation of the X-ray reading process can be a valuable support tool for non-specialist technicians and minimize reporting delays due to non-availability of experts. While existing solutions for automated catheter segmentation and malposition detection show promising results, the disentanglement of individual catheters remains an open challenge, especially in complex cases where multiple devices appear superimposed in the X-ray projection. Moreover, conventional top-down instance segmentation methods are ineffective on such thin and long devices, that often extend through the entire image. In this paper, we propose a deep learning approach based on associative embeddings for catheter instance segmentation, able to overcome those limitations and effectively handle device intersections.Here's the breakdown of the translation:* ��X���镜 (CXR): Chest X-ray* �� frequently employed: frequently used* �� to verify the proper placement: to confirm the correct positioning* �� of central lines and tubes: of central lines and tubes* �� and to rule out related complications: and to rule out related complications* The automation of the X-ray reading process: The automation of the X-ray reading process* can be a valuable support tool: can be a valuable support tool* for non-specialist technicians: for non-specialist technicians* and minimize reporting delays: and minimize reporting delays* due to non-availability of experts: due to the lack of experts* While existing solutions: While existing solutions* for automated catheter segmentation: for automated catheter segmentation* and malposition detection: and malposition detection* show promising results: show promising results* the disentanglement of individual catheters: the disentanglement of individual catheters* remains an open challenge: remains an open challenge* especially in complex cases: especially in complex cases* where multiple devices appear superimposed: where multiple devices appear superimposed* in the X-ray projection: in the X-ray projection* Moreover, conventional top-down instance segmentation methods: Moreover, conventional top-down instance segmentation methods* are ineffective on such thin and long devices: are ineffective on such thin and long devices* that often extend through the entire image: that often extend through the entire image* In this paper, we propose: In this paper, we propose* a deep learning approach: a deep learning approach* based on associative embeddings: based on associative embeddings* for catheter instance segmentation: for catheter instance segmentation* able to overcome those limitations: able to overcome those limitations* and effectively handle device intersections: and effectively handle device intersections
RING-NeRF: A Versatile Architecture based on Residual Implicit Neural Grids
results: 该论文实现了高质量的反射渲染、从少量指导视点重建高质量图像,以及在Scene-specific initialization 缺失情况下的Robustness。此外,该 Architecture 还可以动态添加格子以提高重建精度。Abstract
Since their introduction, Neural Fields have become very popular for 3D reconstruction and new view synthesis. Recent researches focused on accelerating the process, as well as improving the robustness to variation of the observation distance and limited number of supervised viewpoints. However, those approaches often led to dedicated solutions that cannot be easily combined. To tackle this issue, we introduce a new simple but efficient architecture named RING-NeRF, based on Residual Implicit Neural Grids, that provides a control on the level of detail of the mapping function between the scene and the latent spaces. Associated with a distance-aware forward mapping mechanism and a continuous coarse-to-fine reconstruction process, our versatile architecture demonstrates both fast training and state-of-the-art performances in terms of: (1) anti-aliased rendering, (2) reconstruction quality from few supervised viewpoints, and (3) robustness in the absence of appropriate scene-specific initialization for SDF-based NeRFs. We also demonstrate that our architecture can dynamically add grids to increase the details of the reconstruction, opening the way to adaptive reconstruction.
摘要
Since their introduction, Neural Fields have become very popular for 3D reconstruction and new view synthesis. Recent researches focused on accelerating the process, as well as improving the robustness to variation of the observation distance and limited number of supervised viewpoints. However, those approaches often led to dedicated solutions that cannot be easily combined. To tackle this issue, we introduce a new simple but efficient architecture named RING-NeRF, based on Residual Implicit Neural Grids, that provides a control on the level of detail of the mapping function between the scene and the latent spaces. Associated with a distance-aware forward mapping mechanism and a continuous coarse-to-fine reconstruction process, our versatile architecture demonstrates both fast training and state-of-the-art performances in terms of: (1) anti-aliased rendering, (2) reconstruction quality from few supervised viewpoints, and (3) robustness in the absence of appropriate scene-specific initialization for SDF-based NeRFs. We also demonstrate that our architecture can dynamically add grids to increase the details of the reconstruction, opening the way to adaptive reconstruction.Here is the translation in Traditional Chinese:自从其出现以来,神经场的应用已经非常受欢迎,包括3D重建和新视角Synthesis。最近的研究专注于加速过程,以及增强观察距离的变化和有限数量的指导视点。然而,这些方法经常导致专门的解决方案,不能轻松结合。为了解决这个问题,我们介绍了一个新的简单 yet efficient的架构,名为RING-NeRF,基于差分隐藏神经格。这个架构具有调控场景和对应空间之间的对映函数级别的控制。在距离意识的前进映射机制和连续从粗到细重建过程中,我们的多元架构展示了快速训练和现有的表现,包括:(1)抑制噪声渲染、(2)从几个指导视点重建质量、和(3)在没有适当Scene-specific Initialization的情况下的Robustness。我们还证明了我们的架构可以灵活地增加Grids,开启了可靠的重建。
PointMoment:Mixed-Moment-based Self-Supervised Representation Learning for 3D Point Clouds
results: 实验结果表明, compared with前期无监督学习方法,本方法在点云分类和 segmentation 下游任务中获得了更好的性能。Abstract
Large and rich data is a prerequisite for effective training of deep neural networks. However, the irregularity of point cloud data makes manual annotation time-consuming and laborious. Self-supervised representation learning, which leverages the intrinsic structure of large-scale unlabelled data to learn meaningful feature representations, has attracted increasing attention in the field of point cloud research. However, self-supervised representation learning often suffers from model collapse, resulting in reduced information and diversity of the learned representation, and consequently degrading the performance of downstream tasks. To address this problem, we propose PointMoment, a novel framework for point cloud self-supervised representation learning that utilizes a high-order mixed moment loss function rather than the conventional contrastive loss function. Moreover, our framework does not require any special techniques such as asymmetric network architectures, gradient stopping, etc. Specifically, we calculate the high-order mixed moment of the feature variables and force them to decompose into products of their individual moment, thereby making multiple variables more independent and minimizing the feature redundancy. We also incorporate a contrastive learning approach to maximize the feature invariance under different data augmentations of the same point cloud. Experimental results show that our approach outperforms previous unsupervised learning methods on the downstream task of 3D point cloud classification and segmentation.
摘要
Translated into Simplified Chinese:大量和丰富的数据是深度神经网络训练的必要条件,但点云数据的不规则性使得手动标注成本昂贵和劳动密集。基于自主学习的点云研究已经吸引了领域的越来越多的关注,但自主学习常常导致模型塌陷,从而减少了学习后的信息和多样性,影响下游任务的性能。为解决这个问题,我们提出了PointMoment框架,它使用高阶混合瞬时loss函数而不是传统的对比损失函数进行自主学习。此外,我们的框架不需要特殊的技术,如偏振网络架构、梯度停止等。我们计算了特征变量的高阶混合瞬时,并让它们归一化为各个瞬时的产品,从而使多个变量更加独立,减少特征冗余。我们还包括对不同数据扩展的对比学习方法,以最大化特征不变性。实验结果表明,我们的方法在3D点云分类和 segmentation 下游任务上表现出色,超过了之前的无监督学习方法。
GraNet: A Multi-Level Graph Network for 6-DoF Grasp Pose Generation in Cluttered Scenes
results: 在GraspNet-1Billion大量测试集上达到了最佳性能,特别是在抓取未看过的物体 (+11.62 AP) 中;实际机器人实验中得到了高成功率的抓取物体resultAbstract
6-DoF object-agnostic grasping in unstructured environments is a critical yet challenging task in robotics. Most current works use non-optimized approaches to sample grasp locations and learn spatial features without concerning the grasping task. This paper proposes GraNet, a graph-based grasp pose generation framework that translates a point cloud scene into multi-level graphs and propagates features through graph neural networks. By building graphs at the scene level, object level, and grasp point level, GraNet enhances feature embedding at multiple scales while progressively converging to the ideal grasping locations by learning. Our pipeline can thus characterize the spatial distribution of grasps in cluttered scenes, leading to a higher rate of effective grasping. Furthermore, we enhance the representation ability of scalable graph networks by a structure-aware attention mechanism to exploit local relations in graphs. Our method achieves state-of-the-art performance on the large-scale GraspNet-1Billion benchmark, especially in grasping unseen objects (+11.62 AP). The real robot experiment shows a high success rate in grasping scattered objects, verifying the effectiveness of the proposed approach in unstructured environments.
摘要
“六度自由 grasping 在无结构环境中是机器人学中的一个重要但困难任务。现有的大多数方法使用非优化的方法来抽取抓取位置和学习空间特征,而这篇论文提出了 GraNet,一个基于图形的抓取位置生成框架。这个框架将Scene、物体和抓取点纳入图形,并通过图形神经网络传递特征。这样可以增强特征嵌入多个尺度,同时逐渐趋向理想的抓取位置,进而学习。我们的管道可以对丛集环境中的抓取位置进行描述,从而提高抓取率。此外,我们还将构成层、物体层和抓取点层的图形网络结构获得更好的表现能力,通过本地关系的注意力机制。我们的方法在GraspNet-1Billion大规模测试中实现了州前的性能,特别是对未见过的物品 (+11.62 AP)。实验显示我们的方法在无结构环境中成功地抓取散乱的物品,证明了我们的方法在实际应用中的有效性。”
PointJEM: Self-supervised Point Cloud Understanding for Reducing Feature Redundancy via Joint Entropy Maximization
results: 经过实验 validate,PointJEM可以减少特征之间的相关性,并且在下游任务中比如分类和分割任务中表现竞争性。Abstract
Most deep learning-based point cloud processing methods are supervised and require large scale of labeled data. However, manual labeling of point cloud data is laborious and time-consuming. Self-supervised representation learning can address the aforementioned issue by learning robust and generalized representations from unlabeled datasets. Nevertheless, the embedded features obtained by representation learning usually contain redundant information, and most current methods reduce feature redundancy by linear correlation constraints. In this paper, we propose PointJEM, a self-supervised representation learning method applied to the point cloud field. PointJEM comprises an embedding scheme and a loss function based on joint entropy. The embedding scheme divides the embedding vector into different parts, each part can learn a distinctive feature. To reduce redundant information in the features, PointJEM maximizes the joint entropy between the different parts, thereby rendering the learned feature variables pairwise independent. To validate the effectiveness of our method, we conducted experiments on multiple datasets. The results demonstrate that our method can significantly reduce feature redundancy beyond linear correlation. Furthermore, PointJEM achieves competitive performance in downstream tasks such as classification and segmentation.
摘要
大多数深度学习基于点云处理方法都是指导的,需要大规模的标注数据。然而,手动标注点云数据是时间消耗和劳动密集的。无监督学习表征学可以解决这个问题,通过不监督学习得到稳定和通用的表征。然而,嵌入的特征通常包含重复信息,现有的方法通常通过线性相关约束来减少特征重复。在本文中,我们提出了PointJEM,一种应用于点云领域的自监督表征学习方法。PointJEM包括嵌入方案和基于共同 entropy的损失函数。嵌入方案将嵌入向量分成不同部分,每个部分可以学习特征。为了减少特征重复信息,PointJEM通过最大化共同 entropy来将学习的特征变量pairwise独立。为了证明我们的方法的有效性,我们在多个数据集上进行了实验。结果表明,我们的方法可以在特征重复信息上减少特征重复,并且在下游任务 such as 分类和 segmentation 中达到竞争性的性能。
Building Category Graphs Representation with Spatial and Temporal Attention for Visual Navigation
for: This paper is written for visual navigation in partially observable environments, with the goal of reaching a target object based on a sequence of partial observations.
methods: The proposed method uses a Category Relation Graph (CRG) to learn the knowledge of object category layout relations, and a Temporal-Spatial-Region (TSR) attention architecture to perceive the long-term spatial-temporal dependencies of objects.
results: The proposed CRG-TSR method significantly outperforms existing methods regarding both effectiveness and efficiency, as demonstrated by experiments on AI2-THOR.Here’s the Simplified Chinese translation of the three points:
results: 该提议的CRG-TSR方法在AI2-THOR上比现有方法更高效和更有效果,经过实验证明。Abstract
Given an object of interest, visual navigation aims to reach the object's location based on a sequence of partial observations. To this end, an agent needs to 1) learn a piece of certain knowledge about the relations of object categories in the world during training and 2) look for the target object based on the pre-learned object category relations and its moving trajectory in the current unseen environment. In this paper, we propose a Category Relation Graph (CRG) to learn the knowledge of object category layout relations and a Temporal-Spatial-Region (TSR) attention architecture to perceive the long-term spatial-temporal dependencies of objects helping the navigation. We learn prior knowledge of object layout, establishing a category relationship graph to deduce the positions of specific objects. Subsequently, we introduced TSR to capture the relationships of objects in temporal, spatial, and regions within the observation trajectories. Specifically, we propose a Temporal attention module (T) to model the temporal structure of the observation sequence, which implicitly encodes the historical moving or trajectory information. Then, a Spatial attention module (S) is used to uncover the spatial context of the current observation objects based on the category relation graph and past observations. Last, a Region attention module (R) shifts the attention to the target-relevant region. Based on the visual representation extracted by our method, the agent can better perceive the environment and easily learn superior navigation policy. Experiments on AI2-THOR demonstrate our CRG-TSR method significantly outperforms existing methods regarding both effectiveness and efficiency. The code has been included in the supplementary material and will be publicly available.
摘要
Given an object of interest, visual navigation aims to reach the object's location based on a sequence of partial observations. To this end, an agent needs to 1) learn a piece of certain knowledge about the relations of object categories in the world during training and 2) look for the target object based on the pre-learned object category relations and its moving trajectory in the current unseen environment. In this paper, we propose a Category Relation Graph (CRG) to learn the knowledge of object category layout relations and a Temporal-Spatial-Region (TSR) attention architecture to perceive the long-term spatial-temporal dependencies of objects helping the navigation. We learn prior knowledge of object layout, establishing a category relationship graph to deduce the positions of specific objects. Subsequently, we introduced TSR to capture the relationships of objects in temporal, spatial, and regions within the observation trajectories. Specifically, we propose a Temporal attention module (T) to model the temporal structure of the observation sequence, which implicitly encodes the historical moving or trajectory information. Then, a Spatial attention module (S) is used to uncover the spatial context of the current observation objects based on the category relation graph and past observations. Last, a Region attention module (R) shifts the attention to the target-relevant region. Based on the visual representation extracted by our method, the agent can better perceive the environment and easily learn superior navigation policy. Experiments on AI2-THOR demonstrate our CRG-TSR method significantly outperforms existing methods regarding both effectiveness and efficiency. The code has been included in the supplementary material and will be publicly available.Here's the translation in Traditional Chinese: Given an object of interest, visual navigation aims to reach the object's location based on a sequence of partial observations. To this end, an agent needs to 1) learn a piece of certain knowledge about the relations of object categories in the world during training and 2) look for the target object based on the pre-learned object category relations and its moving trajectory in the current unseen environment. In this paper, we propose a Category Relation Graph (CRG) to learn the knowledge of object category layout relations and a Temporal-Spatial-Region (TSR) attention architecture to perceive the long-term spatial-temporal dependencies of objects helping the navigation. We learn prior knowledge of object layout, establishing a category relationship graph to deduce the positions of specific objects. Subsequently, we introduced TSR to capture the relationships of objects in temporal, spatial, and regions within the observation trajectories. Specifically, we propose a Temporal attention module (T) to model the temporal structure of the observation sequence, which implicitly encodes the historical moving or trajectory information. Then, a Spatial attention module (S) is used to uncover the spatial context of the current observation objects based on the category relation graph and past observations. Last, a Region attention module (R) shifts the attention to the target-relevant region. Based on the visual representation extracted by our method, the agent can better perceive the environment and easily learn superior navigation policy. Experiments on AI2-THOR demonstrate our CRG-TSR method significantly outperforms existing methods regarding both effectiveness and efficiency. The code has been included in the supplementary material and will be publicly available.
GCFA:Geodesic Curve Feature Augmentation via Shape Space Theory
paper_authors: Yuexing Han, Guanxin Wan, Bing Wang
for: 提高小样本环境下的数据预处理模型性能
methods: 基于形态空间理论的曲线特征增强方法(GCFA)
results: 可以减少信息损失和提高小样本环境下的数据预处理模型性能Abstract
Deep learning has yielded remarkable outcomes in various domains. However, the challenge of requiring large-scale labeled samples still persists in deep learning. Thus, data augmentation has been introduced as a critical strategy to train deep learning models. However, data augmentation suffers from information loss and poor performance in small sample environments. To overcome these drawbacks, we propose a feature augmentation method based on shape space theory, i.e., Geodesic curve feature augmentation, called GCFA in brevity. First, we extract features from the image with the neural network model. Then, the multiple image features are projected into a pre-shape space as features. In the pre-shape space, a Geodesic curve is built to fit the features. Finally, the many generated features on the Geodesic curve are used to train the various machine learning models. The GCFA module can be seamlessly integrated with most machine learning methods. And the proposed method is simple, effective and insensitive for the small sample datasets. Several examples demonstrate that the GCFA method can greatly improve the performance of the data preprocessing model in a small sample environment.
摘要
深度学习在不同领域带来了惊人的成果,但是需要大规模标注样本的挑战仍然存在。为了解决这个挑战,我们提出了基于形状空间理论的特征增强方法,即曲线特征增强方法(GCFA)。首先,我们从图像中提取特征使用神经网络模型。然后,多个图像特征被投影到预先定义的形状空间中作为特征。在形状空间中,建立一条适合特征的地odesic曲线。最后,在Geodesic曲线上生成的多个特征被用来训练不同的机器学习模型。GCFA模块可以轻松地与大多数机器学习方法结合使用。此外,我们的方法简单、有效,对小样本 dataset 表现不敏感。一些例子表明,GCFA 方法可以在小样本环境中大幅提高数据预处理模型的性能。
Background Clustering Pre-training for Few-shot Segmentation
results: 在 PASCAL-5i 和 COCO-20i 上进行了实验,并确认了 BCSP 的高效性。Abstract
Recent few-shot segmentation (FSS) methods introduce an extra pre-training stage before meta-training to obtain a stronger backbone, which has become a standard step in few-shot learning. Despite the effectiveness, current pre-training scheme suffers from the merged background problem: only base classes are labelled as foregrounds, making it hard to distinguish between novel classes and actual background. In this paper, we propose a new pre-training scheme for FSS via decoupling the novel classes from background, called Background Clustering Pre-Training (BCPT). Specifically, we adopt online clustering to the pixel embeddings of merged background to explore the underlying semantic structures, bridging the gap between pre-training and adaptation to novel classes. Given the clustering results, we further propose the background mining loss and leverage base classes to guide the clustering process, improving the quality and stability of clustering results. Experiments on PASCAL-5i and COCO-20i show that BCPT yields advanced performance. Code will be available.
摘要
近期几招分割(FSS)方法增加了预训练阶段,以获得更强的基础模型,这成为了几招学习的标准步骤。然而,当前的预训练方案受到合并背景问题的困扰:仅仅基础类被标注为前景,使得 отличать新类和实际背景变得困难。在本文中,我们提出了一种新的预训练方案 для FSS,通过分离新类和背景来解决这个问题。具体来说,我们采用在像素嵌入空间中进行在线 clustering,以探索基于 semantic structure的下面结构, bridge 预训练和适应新类的 gap。给出 clustering 结果,我们进一步提议 background mining loss,并利用基础类来引导 clustering 过程,提高 clustering 结果的质量和稳定性。实验表明,BCPT 可以在 PASCAL-5i 和 COCO-20i 上提供先进的性能。代码将可用。
Complementary Benefits of Contrastive Learning and Self-Training Under Distribution Shift
results: 研究结果表明,在频率Shift设定下,自适应学习和对比学习的组合可以获得3-8%的高精度,而在 semi-supervised learning 设定下,两种方法的组合效果不如两种方法独立使用时。Abstract
Self-training and contrastive learning have emerged as leading techniques for incorporating unlabeled data, both under distribution shift (unsupervised domain adaptation) and when it is absent (semi-supervised learning). However, despite the popularity and compatibility of these techniques, their efficacy in combination remains unexplored. In this paper, we undertake a systematic empirical investigation of this combination, finding that (i) in domain adaptation settings, self-training and contrastive learning offer significant complementary gains; and (ii) in semi-supervised learning settings, surprisingly, the benefits are not synergistic. Across eight distribution shift datasets (e.g., BREEDs, WILDS), we demonstrate that the combined method obtains 3--8% higher accuracy than either approach independently. We then theoretically analyze these techniques in a simplified model of distribution shift, demonstrating scenarios under which the features produced by contrastive learning can yield a good initialization for self-training to further amplify gains and achieve optimal performance, even when either method alone would fail.
摘要
<>自适应和对比学习已成为 incorporating 无标注数据的主流技术,包括分布转换(频率域适应)和缺乏标注数据(半supervised learning)。然而,这两种技术的组合效果尚未得到探索。在这篇文章中,我们进行了系统性的实验研究,发现:(i)在频率域适应设置下,自适应和对比学习具有显著的补偿效果;(ii)在半supervised learning设置下,奇怪地,这两种技术的共同效果不是синергетиче。通过八个分布转换数据集(e.g., BREEDs, WILDS)的实验表明,这种组合方法在频率域适应设置下可以获得3-8%的更高精度,比单独使用任一方法高。然后,我们在一个简化的分布转换模型中 theoretically 分析了这些技术,示出在某些场景下,对比学习生成的特征可以为自适应提供一个良好的初始化,使得自适应可以进一步增强得分,达到最佳性能,即使单独使用任一方法时未能达到。Note: I have translated the text into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can also provide that translation.
DiffPMAE: Diffusion Masked Autoencoders for Point Cloud Reconstruction
For: The paper is written for point cloud reconstruction and its applications in interactive service delivery and the future Metaverse.* Methods: The paper proposes an effective point cloud reconstruction architecture that combines Masked Auto-Encoding and Diffusion Model mechanism to remotely reconstruct point cloud data.* Results: The paper validates the performance of the proposed method, DiffPMAE, exceeding many state-of-the-art methods in terms of auto-encoding and downstream tasks considered, with over 60,000 objects in the ShapeNet-55 and ModelNet datasets.Here’s the simplified Chinese text for the three key information points:* For: 这篇论文是为点云重建和其应用于互动服务传递和未来Metaverse而写的。* Methods: 这篇论文提出了一种有效的点云重建架构,它结合Masked Auto-Encoding和Diffusion Model机制来远程重建点云数据。* Results: 论文 validate了提出的方法DiffPMAE的性能,其在Auto-Encoding和下游任务中超过了许多状态艺术法,使用ShapeNet-55和ModelNet datasets中的超过60000个对象。Abstract
Point cloud streaming is increasingly getting popular, evolving into the norm for interactive service delivery and the future Metaverse. However, the substantial volume of data associated with point clouds presents numerous challenges, particularly in terms of high bandwidth consumption and large storage capacity. Despite various solutions proposed thus far, with a focus on point cloud compression, upsampling, and completion, these reconstruction-related methods continue to fall short in delivering high fidelity point cloud output. As a solution, in DiffPMAE, we propose an effective point cloud reconstruction architecture. Inspired by self-supervised learning concepts, we combine Masked Auto-Encoding and Diffusion Model mechanism to remotely reconstruct point cloud data. By the nature of this reconstruction process, DiffPMAE can be extended to many related downstream tasks including point cloud compression, upsampling and completion. Leveraging ShapeNet-55 and ModelNet datasets with over 60000 objects, we validate the performance of DiffPMAE exceeding many state-of-the-art methods in-terms of auto-encoding and downstream tasks considered.
摘要
<> translate_language: zh-CNPoint cloud streaming is becoming increasingly popular, evolving into the norm for interactive service delivery and the future Metaverse. However, the large amount of data associated with point clouds presents numerous challenges, particularly in terms of high bandwidth consumption and large storage capacity. Despite various solutions proposed thus far, with a focus on point cloud compression, upsampling, and completion, these reconstruction-related methods continue to fall short in delivering high fidelity point cloud output. As a solution, in DiffPMAE, we propose an effective point cloud reconstruction architecture. Inspired by self-supervised learning concepts, we combine Masked Auto-Encoding and Diffusion Model mechanism to remotely reconstruct point cloud data. By the nature of this reconstruction process, DiffPMAE can be extended to many related downstream tasks including point cloud compression, upsampling, and completion. Leveraging ShapeNet-55 and ModelNet datasets with over 60000 objects, we validate the performance of DiffPMAE exceeding many state-of-the-art methods in terms of auto-encoding and downstream tasks considered.Translation:点云流处理正在越来越受欢迎,成为未来元宇宙的标准。然而,点云数据的大量存储和带宽占用却提出了许多挑战。尽管过去有各种解决方案,主要集中在点云压缩、扩大和完成等相关任务上,但这些重建相关方法仍然无法提供高精度点云输出。作为解决方案,在DiffPMAE中,我们提出了一种高效的点云重建架构。受到无监督学习概念的激发,我们将Masked Auto-Encoding和扩散模型机制结合起来远程重建点云数据。由于这种重建过程的自然特性,DiffPMAE可以扩展到许多相关的下游任务,包括点云压缩、扩大和完成。基于ShapeNet-55和ModelNet datasets上的超过60000个对象,我们验证了DiffPMAE的性能,胜过许多现状的方法。
Cooperative Probabilistic Trajectory Forecasting under Occlusion
for: This paper aims to improve the navigation of safety-critical tasks in situations with occlusion.
methods: The paper proposes an end-to-end network that estimates the current state of occluded pedestrians in the reference frame of the ego agent and predicts the trajectory with safety guarantees.
results: The proposed method shows experimental results that are almost similar to the ground truth trajectory assuming no occlusion, demonstrating the effectiveness of the method for uncertainty-aware navigation among multiple connected agents under occlusion.Here is the text in Simplified Chinese:
results: 实验结果显示,提议的方法可以准确地预测 occluded 人员的轨迹,与不受 occlusion 影响的ground truth trajectory 相似,表明方法对多个连接 Agent 下 occlusion 的导航具有可靠性。Abstract
Perception and planning under occlusion is essential for safety-critical tasks. Occlusion-aware planning often requires communicating the information of the occluded object to the ego agent for safe navigation. However, communicating rich sensor information under adverse conditions during communication loss and limited bandwidth may not be always feasible. Further, in GPS denied environments and indoor navigation, localizing and sharing of occluded objects can be challenging. To overcome this, relative pose estimation between connected agents sharing a common field of view can be a computationally effective way of communicating information about surrounding objects. In this paper, we design an end-to-end network that cooperatively estimates the current states of occluded pedestrian in the reference frame of ego agent and then predicts the trajectory with safety guarantees. Experimentally, we show that the uncertainty-aware trajectory prediction of occluded pedestrian by the ego agent is almost similar to the ground truth trajectory assuming no occlusion. The current research holds promise for uncertainty-aware navigation among multiple connected agents under occlusion.
摘要
感知和规划在干扰 Situation 中是安全关键任务。 occlusion-aware 规划通常需要 egocentric Agent 通过信息的传输来保证安全导航。然而,在通信损失和带宽限制的情况下,传输丰富的感知信息可能并不是可行的。此外,在 GPS 被拒绝环境和室内导航中,本地化和共享遮盲对象可能是挑战。为了解决这个问题, egocentric Agent 之间共享视场中的相对pose 估计可以是一种计算效率的信息传输方式。在这篇论文中,我们设计了一个端到端网络,可以协同估计 occluded pedestrian 的当前状态,然后预测安全保证的 trajectory。实验表明, egocentric Agent 对 occluded pedestrian 的不确定性感知轨迹预测与无遮盲情况下的真实轨迹非常相似。本研究保持了多个连接 Agent 下uncertainty-aware 导航的前景。
On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
results: 我们发现LMMs在某些任务上不具备Robustness,但是在科学问答任务上,LMMs却能够保持8.10%的性能下降,而其视觉对手下降了99.73%。此外,我们也提出了一种新的实际世界图像分类方法,即查询分解法,通过添加exit queries到输入提示中,可以减少攻击效果并提高图像分类精度。Abstract
Recent advances in instruction tuning have led to the development of State-of-the-Art Large Multimodal Models (LMMs). Given the novelty of these models, the impact of visual adversarial attacks on LMMs has not been thoroughly examined. We conduct a comprehensive study of the robustness of various LMMs against different adversarial attacks, evaluated across tasks including image classification, image captioning, and Visual Question Answer (VQA). We find that in general LMMs are not robust to visual adversarial inputs. However, our findings suggest that context provided to the model via prompts, such as questions in a QA pair helps to mitigate the effects of visual adversarial inputs. Notably, the LMMs evaluated demonstrated remarkable resilience to such attacks on the ScienceQA task with only an 8.10% drop in performance compared to their visual counterparts which dropped 99.73%. We also propose a new approach to real-world image classification which we term query decomposition. By incorporating existence queries into our input prompt we observe diminished attack effectiveness and improvements in image classification accuracy. This research highlights a previously under-explored facet of LMM robustness and sets the stage for future work aimed at strengthening the resilience of multimodal systems in adversarial environments.
摘要
最近的教程调整技术发展,带来了当今最佳大型多modal模型(LMM)的出现。由于这些模型的新颖性,它们在面对视觉攻击的情况下的Robustness未经过完善的研究。我们进行了多种LMM对不同攻击的全面研究,评估 tasks包括图像分类、图像描述和Visual Question Answer(VQA)。我们发现,在总体来说,LMMs不 robust于视觉攻击。然而,我们的发现表明,通过提供给模型的上下文,如问题在QA对中的提示,可以减轻视觉攻击的影响。特别是,我们评估的LMMs在科学QA任务上表现出了remarkable的抵抗力,只有8.10%的性能下降,与其视觉对应的下降99.73%。我们还提出了一种新的实际图像分类方法,我们称之为查询分解。通过在输入提示中添加存在查询,我们发现了攻击效果的减少和图像分类精度的提高。这些研究探讨了LMM的一个未曾被探讨的方面,并为未来强化多modal系统在攻击环境中的Robustness做出了重要贡献。
Class Incremental Learning for Adversarial Robustness
paper_authors: Seungju Cho, Hongsin Lee, Changick Kim for:* 这个研究旨在探索对于固定数据集的数据增量学习方法,即Adversarially Robust Class Incremental Learning (ARCIL),以强化模型的Robustness。methods:* 研究使用了naive adversarial training和增量学习的组合,并观察到这种组合会导致模型的Robustness下降。* 为了解决这个问题,研究人员提出了Flatness Preserving Distillation (FPD)损失函数和Logit Adjustment Distillation (LAD)损失函数,以维持模型的flatness特征。results:* 实验结果显示,这个方法比应用于现有的增量学习方法上的对抗训练方法来得到更高的AutoAttack准确率,具体而言,相比基eline,这个方法在split CIFAR-10、CIFAR-100和Tiny ImageNet上的AutoAttack准确率分别提高了5.99%p、5.27%p和3.90%p。Abstract
Adversarial training integrates adversarial examples during model training to enhance robustness. However, its application in fixed dataset settings differs from real-world dynamics, where data accumulates incrementally. In this study, we investigate Adversarially Robust Class Incremental Learning (ARCIL), a method that combines adversarial robustness with incremental learning. We observe that combining incremental learning with naive adversarial training easily leads to a loss of robustness. We discover that this is attributed to the disappearance of the flatness of the loss function, a characteristic of adversarial training. To address this issue, we propose the Flatness Preserving Distillation (FPD) loss that leverages the output difference between adversarial and clean examples. Additionally, we introduce the Logit Adjustment Distillation (LAD) loss, which adapts the model's knowledge to perform well on new tasks. Experimental results demonstrate the superiority of our method over approaches that apply adversarial training to existing incremental learning methods, which provides a strong baseline for incremental learning on adversarial robustness in the future. Our method achieves AutoAttack accuracy that is 5.99\%p, 5.27\%p, and 3.90\%p higher on average than the baseline on split CIFAR-10, CIFAR-100, and Tiny ImageNet, respectively. The code will be made available.
摘要
“敌对训练”是一种将敌对示例 integrate到模型训练中以增强模型的Robustness。但是,它在固定 dataset 设定下不同于实际世界的动态资料,在这里我们调查了“敌对Robust Class Incremental Learning”(ARCIL)方法,它结合了敌对Robustness 和增量学习。我们发现,将增量学习与普通的敌对训练结合起来很容易导致模型的Robustness 下降。我们发现这是因为敌对训练的损失函数不再平坦的问题。为解决这个问题,我们提出了“平坦保持”(FPD)损失函数,它利用了敌对和清洁示例之间的出力差异。此外,我们引入了“alogit Adjustment Distillation”(LAD)损失函数,它让模型对新任务进行好的适应。实验结果显示我们的方法比于将敌对训练应用到现有的增量学习方法,提供了一个强的基eline,从而实现了增量学习中的敌对Robustness。我们的方法在split CIFAR-10、CIFAR-100 和 Tiny ImageNet 上的 AutoAttack 精度高于基eline的平均5.99\%p、5.27\%p 和 3.90\%p。代码将会公开。”
Indirect Gradient Matching for Adversarial Robust Distillation
paper_authors: Hongsin Lee, Seungju Cho, Changick Kim
For: 提高模型的 adversarial Robustness* Methods: 使用 Indirect Gradient Distillation Module (IGDM),具体来说是通过 Taylor 约化来匹配老师的输入梯度和学生模型的输入梯度。* Results: experimental results show that IGDM can effectively enhance the performance of all AD methods, especially when integrated into the SOTA method without additional data augmentation. Specifically, the AutoAttack accuracy of the ResNet-18 model and MobileNetV2 model were improved from 28.06% to 30.32% and from 26.18% to 29.52%, respectively.Abstract
Adversarial training significantly improves adversarial robustness, but superior performance is primarily attained with large models. This substantial performance gap for smaller models has spurred active research into adversarial distillation (AD) to mitigate the difference. Existing AD methods leverage the teacher's logits as a guide. In contrast to these approaches, we aim to transfer another piece of knowledge from the teacher, the input gradient. In this paper, we propose a distillation module termed Indirect Gradient Distillation Module (IGDM) that indirectly matches the student's input gradient with that of the teacher. We hypothesize that students can better acquire the teacher's knowledge by matching the input gradient. Leveraging the observation that adversarial training renders the model locally linear on the input space, we employ Taylor approximation to effectively align gradients without directly calculating them. Experimental results show that IGDM seamlessly integrates with existing AD methods, significantly enhancing the performance of all AD methods. Particularly, utilizing IGDM on the CIFAR-100 dataset improves the AutoAttack accuracy from 28.06% to 30.32% with the ResNet-18 model and from 26.18% to 29.52% with the MobileNetV2 model when integrated into the SOTA method without additional data augmentation. The code will be made available.
摘要
学习对抗训练显著提高了对抗 robustness,但是大型模型才能获得出色的性能。这个性能差距对小型模型的研究启动了活跃的激发。现有的AD方法利用教师的логи特点作为导向。与这些方法不同,我们尝试将另一种教师知识传递给学生,即输入梯度。在这篇论文中,我们提出了一种名为间接梯度填充模块(IGDM),它间接匹配学生的输入梯度与教师的梯度。我们认为,通过匹配输入梯度,学生可以更好地继承教师的知识。利用对抗训练使模型在输入空间变为本地线性,我们采用泰勒近似来有效地对梯度进行匹配,而不需要直接计算。实验结果表明,IGDM可以与现有AD方法集成,显著提高所有AD方法的性能。特别是在CIFAR-100 dataset上,对于使用ResNet-18和MobileNetV2模型,通过IGDM进行加工后,AutoAttack准确率从28.06%提高到30.32%和26.18%提高到29.52%,分别与当前最佳方法集成后的准确率相比提高了2.26%和3.34%。代码将会公开。
SO-NeRF: Active View Planning for NeRF using Surrogate Objectives
methods: 提出了一个名为Surrogate Objectives for Active Radiance Fields(SOAR)的可读性函数集,该集使用几何和光学视觉特征来评估视图的好坏,包括表面覆盖率、几何复杂度、文本复杂度和光线多样性。通过深度网络学习来推断SOAR分数,可以快速选择优秀视图,不需要预先访问所有候选视图或在重建过程中训练任何频谱场。
results: SOARNet在比较baseline的速度下实现了$\sim$80倍的加速,同时保持了重建质量的比较或相同水平。此外,SOAR是模型兼容的,因此可以应用于完全含有神经透彻的方法以及完全透彻的方法。Abstract
Despite the great success of Neural Radiance Fields (NeRF), its data-gathering process remains vague with only a general rule of thumb of sampling as densely as possible. The lack of understanding of what actually constitutes good views for NeRF makes it difficult to actively plan a sequence of views that yield the maximal reconstruction quality. We propose Surrogate Objectives for Active Radiance Fields (SOAR), which is a set of interpretable functions that evaluates the goodness of views using geometric and photometric visual cues - surface coverage, geometric complexity, textural complexity, and ray diversity. Moreover, by learning to infer the SOAR scores from a deep network, SOARNet, we are able to effectively select views in mere seconds instead of hours, without the need for prior visits to all the candidate views or training any radiance field during such planning. Our experiments show SOARNet outperforms the baselines with $\sim$80x speed-up while achieving better or comparable reconstruction qualities. We finally show that SOAR is model-agnostic, thus it generalizes across fully neural-implicit to fully explicit approaches.
摘要
尽管神经辉度场(NeRF)取得了很大的成功,但是它的数据收集过程仍然存在一定的欠准,只有一个通用的规则 thumb 来 sampler dense 地。由于没有对 NeRF 的好视图理解,因此很难活动地规划一个可以提供最高重建质量的序列视图。我们提出了代理目标函数活动辉度场(SOAR),它是一组可解释的函数,用于评估视图的好坏,包括表面覆盖率、几何复杂度、文本复杂度和射线多样性。此外,我们通过学习一个深度网络来推算 SOAR 分数,我们可以在几秒钟内选择视图,而不需要先访问所有候选视图或在这个阶段训练任何辉度场。我们的实验表明,SOARNet 比基eline 快得多,而且重建质量也是比较或相似的。最后,我们证明 SOAR 是模型无关的,因此它可以普适化到完全神经透凝到完全Explicit Approaches 之间。
FAAC: Facial Animation Generation with Anchor Frame and Conditional Control for Superior Fidelity and Editability
results: 在多个 DreamBooth 和 LoRA 模型上验证了方法的效果,实现了人脸动画的高质量、高可编辑性和视频动作的提升。同时,通过使用 3D 参数化人脸模型,实现了高精度的人脸表达和动作捕捉。Abstract
Over recent years, diffusion models have facilitated significant advancements in video generation. Yet, the creation of face-related videos still confronts issues such as low facial fidelity, lack of frame consistency, limited editability and uncontrollable human poses. To address these challenges, we introduce a facial animation generation method that enhances both face identity fidelity and editing capabilities while ensuring frame consistency. This approach incorporates the concept of an anchor frame to counteract the degradation of generative ability in original text-to-image models when incorporating a motion module. We propose two strategies towards this objective: training-free and training-based anchor frame methods. Our method's efficacy has been validated on multiple representative DreamBooth and LoRA models, delivering substantial improvements over the original outcomes in terms of facial fidelity, text-to-image editability, and video motion. Moreover, we introduce conditional control using a 3D parametric face model to capture accurate facial movements and expressions. This solution augments the creative possibilities for facial animation generation through the integration of multiple control signals. For additional samples, please visit https://anonymous.4open.science/r/FAAC.
摘要
Recent years have seen significant advancements in video generation thanks to diffusion models. However, creating face-related videos still poses challenges such as low facial fidelity, inconsistent frames, limited editability, and uncontrollable human poses. To address these issues, we propose a facial animation generation method that enhances both face identity fidelity and editing capabilities while ensuring frame consistency. This approach utilizes the concept of an anchor frame to counteract the degradation of generative ability in original text-to-image models when incorporating a motion module. We propose two strategies towards this objective: training-free and training-based anchor frame methods. Our method has been validated on multiple representative DreamBooth and LoRA models, delivering substantial improvements over the original outcomes in terms of facial fidelity, text-to-image editability, and video motion. Additionally, we introduce conditional control using a 3D parametric face model to capture accurate facial movements and expressions, expanding the creative possibilities for facial animation generation through the integration of multiple control signals. For more examples, please visit https://anonymous.4open.science/r/FAAC.
OctreeOcc: Efficient and Multi-Granularity Occupancy Prediction Using Octree Queries
results: 在对多种场景和对象进行了广泛的评估中,OctreeOcc不仅超过了当前State-of-the-art方法,还实现了15%-24%的计算开销减少相对于dense-grid-based方法。Abstract
Occupancy prediction has increasingly garnered attention in recent years for its fine-grained understanding of 3D scenes. Traditional approaches typically rely on dense, regular grid representations, which often leads to excessive computational demands and a loss of spatial details for small objects. This paper introduces OctreeOcc, an innovative 3D occupancy prediction framework that leverages the octree representation to adaptively capture valuable information in 3D, offering variable granularity to accommodate object shapes and semantic regions of varying sizes and complexities. In particular, we incorporate image semantic information to improve the accuracy of initial octree structures and design an effective rectification mechanism to refine the octree structure iteratively. Our extensive evaluations show that OctreeOcc not only surpasses state-of-the-art methods in occupancy prediction, but also achieves a 15%-24% reduction in computational overhead compared to dense-grid-based methods.
摘要
Recently, occupancy prediction has received increasing attention due to its ability to understand 3D scenes in fine detail. Traditional methods often rely on dense, regular grid representations, which can lead to excessive computational demands and a loss of spatial details for small objects. This paper introduces OctreeOcc, a novel 3D occupancy prediction framework that leverages the octree representation to adaptively capture valuable information in 3D, offering variable granularity to accommodate object shapes and semantic regions of varying sizes and complexities. Specifically, we incorporate image semantic information to improve the accuracy of initial octree structures and design an effective rectification mechanism to refine the octree structure iteratively. Our extensive evaluations show that OctreeOcc not only outperforms state-of-the-art methods in occupancy prediction, but also achieves a 15%-24% reduction in computational overhead compared to dense-grid-based methods.
Human Body Model based ID using Shape and Pose Parameters
results: 当将HMID网络训练使用附加的形态和 pose损失函数时,它在生物特征识别性能方面表现出了显著改进,比不使用这些损失函数的同类模型更好。Abstract
We present a Human Body model based IDentification system (HMID) system that is jointly trained for shape, pose and biometric identification. HMID is based on the Human Mesh Recovery (HMR) network and we propose additional losses to improve and stabilize shape estimation and biometric identification while maintaining the pose and shape output. We show that when our HMID network is trained using additional shape and pose losses, it shows a significant improvement in biometric identification performance when compared to an identical model that does not use such losses. The HMID model uses raw images instead of silhouettes and is able to perform robust recognition on images collected at range and altitude as many anthropometric properties are reasonably invariant to clothing, view and range. We show results on the USF dataset as well as the BRIAR dataset which includes probes with both clothing and view changes. Our approach (using body model losses) shows a significant improvement in Rank20 accuracy and True Accuracy Rate on the BRIAR evaluation dataset.
摘要
我们提出了一种基于人体模型的人体标识系统(HMID),该系统同时受过训练以确定形状、姿势和生物特征。HMID基于人体网络(HMR),我们提出了额外的损失函数,以提高和稳定形状估计和生物标识的性能,而不会影响形状和姿势输出。我们证明,当我们的HMID网络通过额外的形状和姿势损失函数进行训练时,其生物标识性能会显著提高,比与没有使用这些损失函数的同类模型。HMID模型使用原始图像而不是检测图像,可以实现robust的人体识别,无论图像是在多种angles和距离上拍摄。我们在USF数据集以及包含衣服和视角变化的BRIAR数据集上进行了实验,并证明了我们的方法(使用人体模型损失)在BRIAR评估数据集上显著提高了rank20准确率和真实准确率。
Rethinking Object Saliency Ranking: A Novel Whole-flow Processing Paradigm
results: 在广泛的SALICON集上表现出色,超过现有状态码方法。Abstract
Existing salient object detection methods are capable of predicting binary maps that highlight visually salient regions. However, these methods are limited in their ability to differentiate the relative importance of multiple objects and the relationships among them, which can lead to errors and reduced accuracy in downstream tasks that depend on the relative importance of multiple objects. To conquer, this paper proposes a new paradigm for saliency ranking, which aims to completely focus on ranking salient objects by their "importance order". While previous works have shown promising performance, they still face ill-posed problems. First, the saliency ranking ground truth (GT) orders generation methods are unreasonable since determining the correct ranking order is not well-defined, resulting in false alarms. Second, training a ranking model remains challenging because most saliency ranking methods follow the multi-task paradigm, leading to conflicts and trade-offs among different tasks. Third, existing regression-based saliency ranking methods are complex for saliency ranking models due to their reliance on instance mask-based saliency ranking orders. These methods require a significant amount of data to perform accurately and can be challenging to implement effectively. To solve these problems, this paper conducts an in-depth analysis of the causes and proposes a whole-flow processing paradigm of saliency ranking task from the perspective of "GT data generation", "network structure design" and "training protocol". The proposed approach outperforms existing state-of-the-art methods on the widely-used SALICON set, as demonstrated by extensive experiments with fair and reasonable comparisons. The saliency ranking task is still in its infancy, and our proposed unified framework can serve as a fundamental strategy to guide future work.
摘要
现有的焦点 объекt detection方法可以预测视觉焦点区域的二进制地图,但这些方法有限制,无法 diferenciate 多个对象之间的关系和重要性,这可能导致错误和下游任务的减少精度。为了解决这个问题,这篇论文提出了一新的焦点排名 paradigm,旨在完全专注于对焦点对象进行“重要性顺序”排名。 although previous works have shown promising performance, they still face ill-posed problems. First, the saliency ranking ground truth (GT) orders generation methods are unreasonable, since determining the correct ranking order is not well-defined, resulting in false alarms. Second, training a ranking model remains challenging because most saliency ranking methods follow the multi-task paradigm, leading to conflicts and trade-offs among different tasks. Third, existing regression-based saliency ranking methods are complex for saliency ranking models due to their reliance on instance mask-based saliency ranking orders. These methods require a significant amount of data to perform accurately and can be challenging to implement effectively.为了解决这些问题,这篇论文进行了深入的分析并提出了一个整体流程处理 paradigm,从GT数据生成、网络结构设计以及训练协议的角度进行了解决。 proposed approach outperforms existing state-of-the-art methods on the widely-used SALICON set, as demonstrated by extensive experiments with fair and reasonable comparisons. The saliency ranking task is still in its infancy, and our proposed unified framework can serve as a fundamental strategy to guide future work.
Predicting Scores of Various Aesthetic Attribute Sets by Learning from Overall Score Labels
paper_authors: Heng Huang, Xin Jin, Yaqi Liu, Hao Lou, Chaoen Xiao, Shuai Cui, Xinning Li, Dongqing Zou for:This paper aims to develop a novel aesthetic attribute evaluation framework to predict attribute scores and overall scores for images.methods:The proposed framework, called F2S (attribute features to attribute scores), uses networks from different tasks to provide attribute features and leverages an aesthetic attribute contribution to describe the role of aesthetic attributes in an image.results:The proposed F2S model achieves comparable performance with those trained on fully-annotated aesthetic attribute score labels, making it feasible to learn meaningful attribute scores for various aesthetic attribute sets in different types of images with only overall aesthetic scores.Abstract
Now many mobile phones embed deep-learning models for evaluation or guidance on photography. These models cannot provide detailed results like human pose scores or scene color scores because of the rare of corresponding aesthetic attribute data. However, the annotation of image aesthetic attribute scores requires experienced artists and professional photographers, which hinders the collection of large-scale fully-annotated datasets. In this paper, we propose to replace image attribute labels with feature extractors. First, a novel aesthetic attribute evaluation framework based on attribute features is proposed to predict attribute scores and overall scores. We call it the F2S (attribute features to attribute scores) model. We use networks from different tasks to provide attribute features to our F2S models. Then, we define an aesthetic attribute contribution to describe the role of aesthetic attributes throughout an image and use it with the attribute scores and the overall scores to train our F2S model. Sufficient experiments on publicly available datasets demonstrate that our F2S model achieves comparable performance with those trained on the datasets with fully-annotated aesthetic attribute score labels. Our method makes it feasible to learn meaningful attribute scores for various aesthetic attribute sets in different types of images with only overall aesthetic scores.
摘要
现在许多移动电话内置深度学习模型来评估或指导摄影。这些模型无法提供如人体姿势分数或场景颜色分数的详细结果,因为缺乏对应的艺术特性数据。然而,对图像艺术特性分数的标注需要经验丰富的艺术家和专业摄影师,这阻碍了大规模完全标注数据集的收集。在这篇论文中,我们提议将图像特性标签 replacced 为特征抽取器。我们提出了一种基于特征 attribute 评估框架的新艺术特性评估方法,我们称之为 F2S(特征 attribute 到 attribute 分数)模型。我们使用不同任务的网络提供特征 attribute 到我们的 F2S 模型。然后,我们定义了一种艺术特性贡献来描述图像中艺术特性的作用,并使用它与特征分数和总分数一起训练我们的 F2S 模型。我们的实验表明,我们的 F2S 模型在公开可用的数据集上具有相当的性能,与完全标注艺术特性分数标签的模型相比。我们的方法使得可以在不同类型的图像中学习有意义的特征分数,只需要全局艺术分数。
Cache Me if You Can: Accelerating Diffusion Models through Block Caching
paper_authors: Felix Wimbauer, Bichen Wu, Edgar Schoenfeld, Xiaoliang Dai, Ji Hou, Zijian He, Artsiom Sanakoyeu, Peizhao Zhang, Sam Tsai, Jonas Kohler, Christian Rupprecht, Daniel Cremers, Peter Vajda, Jialiang Wang
results: 我们通过FID、人类评估和Qualitative分析显示,封页储存技术可以在同等的计算成本下生成更高品质的图像。我们在不同的顶尖模型(LDM和EMU)和推导器(DDIM和DPM)上进行了实验。Abstract
Diffusion models have recently revolutionized the field of image synthesis due to their ability to generate photorealistic images. However, one of the major drawbacks of diffusion models is that the image generation process is costly. A large image-to-image network has to be applied many times to iteratively refine an image from random noise. While many recent works propose techniques to reduce the number of required steps, they generally treat the underlying denoising network as a black box. In this work, we investigate the behavior of the layers within the network and find that 1) the layers' output changes smoothly over time, 2) the layers show distinct patterns of change, and 3) the change from step to step is often very small. We hypothesize that many layer computations in the denoising network are redundant. Leveraging this, we introduce block caching, in which we reuse outputs from layer blocks of previous steps to speed up inference. Furthermore, we propose a technique to automatically determine caching schedules based on each block's changes over timesteps. In our experiments, we show through FID, human evaluation and qualitative analysis that Block Caching allows to generate images with higher visual quality at the same computational cost. We demonstrate this for different state-of-the-art models (LDM and EMU) and solvers (DDIM and DPM).
摘要
对于图像生成领域,传播模型最近已经引发了革命,因为它们可以生成高品质图像。然而,传播模型的一个主要缺点是生成图像的过程占用资源。一个大型图像到图像网络需要多次应用,以进行反复地更正图像从随机噪音中。虽然许多最近的工作提出了减少需要的步骤的技术,但是它们通常对底层减噪网络视为黑盒子。在这个工作中,我们研究层的输出是如何变化,并发现了以下三点:1)层的输出随时间的变化非常平滑,2)层的变化具有明确的模式,3)步骤之间的变化通常非常小。我们推测了很多层计算在减噪网络中是重复的。基于这一点,我们引入了块缓存,其中我们可以在不同的步骤中重用前一步的层块输出,以加速推断。此外,我们也提出了根据每个块的时间步骤变化自动确定缓存计划的技术。在我们的实验中,通过FID、人类评估和质量分析,我们显示了使用块缓存可以在同样的计算成本下生成图像的视觉质量更高。我们在不同的状态艺模型(LDM和EMU)和解决方案(DDIM和DPM)上进行了实验,并得到了类似的结果。
Satellite Imagery and AI: A New Era in Ocean Conservation, from Research to Deployment and Impact
paper_authors: Patrick Beukema, Favyen Bastani, Piper Wolters, Henry Herzog, Joe Ferdinando
For: The paper is written for those interested in using satellite data to monitor and prevent illegal, unreported, and unregulated (IUU) fishing.* Methods: The paper introduces three specialized computer vision models for synthetic aperture radar (Sentinel-1), optical imagery (Sentinel-2), and nighttime lights (Suomi-NPP/NOAA-20) to monitor maritime activities.* Results: The paper presents real-time computer vision services for conservation, which have been deployed in Skylight, a free online platform for maritime monitoring.Here’s the information in Simplified Chinese text:* For: 这篇论文是为了使用卫星数据监测和防止非法、未报告和不监管的渔业活动而写的。* Methods: 论文介绍了三种特殊的计算机视觉模型,用于synthetic aperture radar(Sentinel-1)、光学成像(Sentinel-2)和夜色灯(Suomi-NPP/NOAA-20)来监测海洋活动。* Results: 论文介绍了实时计算机视觉服务,已经部署在免费的Skylight平台上,用于海洋监测。Abstract
Illegal, unreported, and unregulated (IUU) fishing poses a global threat to ocean habitats. Publicly available satellite data offered by NASA and the European Space Agency (ESA) provide an opportunity to actively monitor this activity. Effectively leveraging satellite data for maritime conservation requires highly reliable machine learning models operating globally with minimal latency. This paper introduces three specialized computer vision models designed for synthetic aperture radar (Sentinel-1), optical imagery (Sentinel-2), and nighttime lights (Suomi-NPP/NOAA-20). It also presents best practices for developing and delivering real-time computer vision services for conservation. These models have been deployed in Skylight, a real time maritime monitoring platform, which is provided at no cost to users worldwide.
摘要
非法、未报告、未规制(IUU)渔业活动对海洋生态系统构成全球性威胁。 NASA和欧洲空间局(ESA)公开提供的卫星数据可以为保护海洋生态系统提供实时监测。 为了有效地利用卫星数据,我们需要开发高可靠性机器学习模型,能够全球运行,延迟最小。这篇论文介绍了三种特殊的计算机视觉模型,适用于Synthetic Aperture Radar(Sentinel-1)、光学图像(Sentinel-2)和夜色照明(Suomi-NPP/NOAA-20)。它还提供了保护实时计算机视觉服务的最佳实践。这些模型已经在Skylight实时海洋监测平台上部署,该平台免费提供给全球用户。
Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields
results: 我们的方法可以提供比较好的结果,同时更快于训练和渲染。此外,我们还是首次使用点和 bounding-box prompting来 manipulate radiance field,通过利用SAM模型。Abstract
3D scene representations have gained immense popularity in recent years. Methods that use Neural Radiance fields are versatile for traditional tasks such as novel view synthesis. In recent times, some work has emerged that aims to extend the functionality of NeRF beyond view synthesis, for semantically aware tasks such as editing and segmentation using 3D feature field distillation from 2D foundation models. However, these methods have two major limitations: (a) they are limited by the rendering speed of NeRF pipelines, and (b) implicitly represented feature fields suffer from continuity artifacts reducing feature quality. Recently, 3D Gaussian Splatting has shown state-of-the-art performance on real-time radiance field rendering. In this work, we go one step further: in addition to radiance field rendering, we enable 3D Gaussian splatting on arbitrary-dimension semantic features via 2D foundation model distillation. This translation is not straightforward: naively incorporating feature fields in the 3DGS framework leads to warp-level divergence. We propose architectural and training changes to efficiently avert this problem. Our proposed method is general, and our experiments showcase novel view semantic segmentation, language-guided editing and segment anything through learning feature fields from state-of-the-art 2D foundation models such as SAM and CLIP-LSeg. Across experiments, our distillation method is able to provide comparable or better results, while being significantly faster to both train and render. Additionally, to the best of our knowledge, we are the first method to enable point and bounding-box prompting for radiance field manipulation, by leveraging the SAM model. Project website at: https://feature-3dgs.github.io/
摘要
三元场景表示在最近几年内得到了广泛的推广。使用神经辐射场的方法非常灵活,能够完成传统任务,如新视图合成。在最近的一些工作中,人们尝试了将NeRF的功能扩展到更加semantically aware的任务,如编辑和分割,通过从2D基础模型中提取3D特征场的分配。然而,这些方法有两个主要的限制:(1)它们受NeRF管道的渲染速度的限制,(2)卷积表示的特征场会受到继承artefact的影响,从而降低特征质量。Recently, 3D Gaussian Splatting has shown state-of-the-art performance on real-time radiance field rendering. In this work, we go one step further: in addition to radiance field rendering, we enable 3D Gaussian splatting on arbitrary-dimension semantic features via 2D foundation model distillation. This translation is not straightforward: naively incorporating feature fields in the 3DGS framework leads to warp-level divergence. We propose architectural and training changes to efficiently avert this problem. Our proposed method is general, and our experiments showcase novel view semantic segmentation, language-guided editing and segment anything through learning feature fields from state-of-the-art 2D foundation models such as SAM and CLIP-LSeg. Across experiments, our distillation method is able to provide comparable or better results, while being significantly faster to both train and render. Additionally, to the best of our knowledge, we are the first method to enable point and bounding-box prompting for radiance field manipulation, by leveraging the SAM model. Project website at: