2023-08-08

cs.CV

cs.CV - 2023-08-08

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

paper_url: http://arxiv.org/abs/2308.04352
repo_url: https://github.com/3d-vista/3D-VisTA
paper_authors: Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, Qing Li
for: 3D vision-language grounding (3D-VL) tasks, such as visual grounding, dense captioning, question answering, and situated reasoning.
methods: Uses a pre-trained Transformer for 3D vision and text alignment, with self-attention layers for single-modal modeling and multi-modal fusion.
results: Achieves state-of-the-art results on various 3D-VL tasks, with superior data efficiency and strong performance even with limited annotations during fine-tuning.Here’s the simplified Chinese text:
for: 3D视力语言固定（3D-VL）任务，如视图固定、密集描述、问答和位置理解。
methods: 使用预训练的 transformer для 3D视力和文本对齐，通过自我注意层实现单模态模型和多模态融合。
results: 在多种 3D-VL 任务上取得了状态之一的结果，并且在限制缺少标注时的练习 fine-tuning 中表现出色。

Abstract
3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence. Current 3D-VL models rely heavily on sophisticated modules, auxiliary losses, and optimization tricks, which calls for a simple and unified model. In this paper, we propose 3D-VisTA, a pre-trained Transformer for 3D Vision and Text Alignment that can be easily adapted to various downstream tasks. 3D-VisTA simply utilizes self-attention layers for both single-modal modeling and multi-modal fusion without any sophisticated task-specific design. To further enhance its performance on 3D-VL tasks, we construct ScanScribe, the first large-scale 3D scene-text pairs dataset for 3D-VL pre-training. ScanScribe contains 2,995 RGB-D scans for 1,185 unique indoor scenes originating from ScanNet and 3R-Scan datasets, along with paired 278K scene descriptions generated from existing 3D-VL tasks, templates, and GPT-3. 3D-VisTA is pre-trained on ScanScribe via masked language/object modeling and scene-text matching. It achieves state-of-the-art results on various 3D-VL tasks, ranging from visual grounding and dense captioning to question answering and situated reasoning. Moreover, 3D-VisTA demonstrates superior data efficiency, obtaining strong performance even with limited annotations during downstream task fine-tuning.

摘要
三维视力语言固定（3D-VL）是一个emerging领域，旨在将三维物理世界与自然语言相连接，这对实体智能是非常重要。现有3D-VL模型都依赖于复杂的模块、辅助损失和优化技巧，这зыва�种简单的和一致的模型。在这篇论文中，我们提出了3D-VisTA，一个预训练的Transformer用于三维视力和文本对齐。3D-VisTA使用自注意层来模型单Modal和多Modal的混合，不需任何任务特定的复杂设计。为了进一步提高3D-VL任务的表现，我们构建了ScanScribe，这是第一个大规模的3D场景文本对 dataset，包括2995个RGB-D扫描和1185个唯一的室内场景，来自ScanNet和3R-Scan dataset，以及278K个场景描述，这些描述来自现有的3D-VL任务、模板和GPT-3。3D-VisTA在ScanScribe上预训练后，可以通过偏挥语言/物体模型和场景文本匹配来进行Masked Language/Object Modeling和Scene-Text Matching。它在多种3D-VL任务上达到了状态前的Result，从visual grounding和精密描述到问题回答和位置理解。此外，3D-VisTA还表现出了优秀的数据效率，能够在下游任务练习时就具有强的表现，即使有限的注释。

paper_url: http://arxiv.org/abs/2308.04343
repo_url: https://github.com/luminosityx/hat
paper_authors: Yi Bin, Haoxuan Li, Yahui Xu, Xing Xu, Yang Yang, Heng Tao Shen
for: 提高跨模态检索的性能，具体来说是提高图像和文本之间的匹配和相关性。
methods: 使用两�reamTransformers作为图像和文本的Encoder，并实现层次对应模块以探索不同层次的多重对应关系。
results: 对两个基准数据集MSCOCO和Flickr30K进行了广泛的实验，并与SOTA基线相比，HAT得到了大量的提升。具体来说，在图像到文本和文本到图像检索两个关键任务上，HAT的Recall@1提高了7.6%和16.7%在MSCOCO上，以及4.4%和11.6%在Flickr30K上。

Abstract
Most existing cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts, \textit{e.g.}, CNN for images and RNN/Transformer for texts. Such discrepancy in architectures may induce different semantic distribution spaces and limit the interactions between images and texts, and further result in inferior alignment between images and texts. To fill this research gap, inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. Specifically, we design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed \textbf{Hierarchical Alignment Transformers (HAT)}, which consists of an image Transformer, a text Transformer, and a hierarchical alignment module. With such identical architectures, the encoders could produce representations with more similar characteristics for images and texts, and make the interactions and alignments between them much easier. Besides, to leverage the rich semantics, we devise a hierarchical alignment scheme to explore multi-level correspondences of different layers between images and texts. To evaluate the effectiveness of the proposed HAT, we conduct extensive experiments on two benchmark datasets, MSCOCO and Flickr30K. Experimental results demonstrate that HAT outperforms SOTA baselines by a large margin. Specifically, on two key tasks, \textit{i.e.}, image-to-text and text-to-image retrieval, HAT achieves 7.6\% and 16.7\% relative score improvement of Recall@1 on MSCOCO, and 4.4\% and 11.6\% on Flickr30k respectively. The code is available at \url{https://github.com/LuminosityX/HAT}.

摘要
现有跨Modal Retrieval方法通常采用不同架构的两�ream Encoder，如图像使用CNN，文本使用RNN/Transformer。这种不同的架构可能会导致图像和文本的Semantic分布空间不同，限制图像和文本之间的交互，从而导致图像和文本的Alignment不佳。为了填补这个研究空白，我们提出了一种基于Transformers的跨Modal Retrieval框架，名为层次对齐Transformers（HAT）。这个框架包括图像Transformer、文本Transformer和层次对齐模块。通过使用同一种架构，encoder可以生成更像性的表示，从而使图像和文本之间的交互和对齐变得更加容易。此外，为了利用rich的Semantic，我们设计了一种层次对齐方案，以探索不同层次的对应关系 между图像和文本。为证明HAT的效iveness，我们对MSCOCO和Flickr30K两个benchmark datasets进行了广泛的实验。实验结果表明，HAT在图像-文本和文本-图像检索任务上的表现都超过了State-of-the-Art baseline，具体来说，在MSCOCO上，HAT在图像-文本和文本-图像检索任务上的Recall@1相对于基eline的提高为7.6%和16.7%。在Flickr30K上，HAT的提高为4.4%和11.6%。代码可以在github上找到：https://github.com/LuminosityX/HAT。

TranSTYLer: Multimodal Behavioral Style Transfer for Facial and Body Gestures Generation

paper_url: http://arxiv.org/abs/2308.10843
repo_url: None
paper_authors: Mireille Fares, Catherine Pelachaud, Nicolas Obin
for: 这篇论文目的是将虚拟代表人物的行为表达风格传递到另一个代表人物中，保留行为的形式，以便在交流中传递意义。
methods: 我们提出了一种基于多模态变换器的模型，称为TranSTYLer，可以将多 modal 的行为合成到源说话者的样式下。我们假设行为表达风格在不同的沟通方式中都有编码，包括文本、语音、身体姿势和脸部表情。我们使用内容和风格分离的方法，以确保传递的风格不会干扰源行为的意义。
results: 我们使用PATS corpus进行训练，并对其进行扩展以包括对话活动和2D脸部特征点。对象和主观评价表明，我们的模型在训练阶段seen和unseen风格时都能够超越状态之前的模型。为了解决可能出现的风格和内容泄露问题，我们提出了一种方法来评估传递的行为和姿势是否成功地采用了target风格，而不会破坏源内容的意义。

Abstract
This paper addresses the challenge of transferring the behavior expressivity style of a virtual agent to another one while preserving behaviors shape as they carry communicative meaning. Behavior expressivity style is viewed here as the qualitative properties of behaviors. We propose TranSTYLer, a multimodal transformer based model that synthesizes the multimodal behaviors of a source speaker with the style of a target speaker. We assume that behavior expressivity style is encoded across various modalities of communication, including text, speech, body gestures, and facial expressions. The model employs a style and content disentanglement schema to ensure that the transferred style does not interfere with the meaning conveyed by the source behaviors. Our approach eliminates the need for style labels and allows the generalization to styles that have not been seen during the training phase. We train our model on the PATS corpus, which we extended to include dialog acts and 2D facial landmarks. Objective and subjective evaluations show that our model outperforms state of the art models in style transfer for both seen and unseen styles during training. To tackle the issues of style and content leakage that may arise, we propose a methodology to assess the degree to which behavior and gestures associated with the target style are successfully transferred, while ensuring the preservation of the ones related to the source content.

摘要

Domain Adaptive Person Search via GAN-based Scene Synthesis for Cross-scene Videos

paper_url: http://arxiv.org/abs/2308.04322
repo_url: https://github.com/crsm424/da-gss
paper_authors: Huibing Wang, Tianxiang Cui, Mingze Yao, Huijuan Pang, Yushan Du
for: 提高人体搜索任务中的精度和效率，使用生成 adversarial networks (GAN) 生成高质量的人体图像数据。
methods: 基于 Fast R-CNN 模型，采用 Assisted-Identity Query Module (AIDQ) 提供正面图像，并采用 GAN 生成高质量的人体图像数据进行场景合成。采用在线学习策略，同步学习生成的图像和原始图像，以便增强特征学习。
results: 在 CUHK-SYSU 和 PRW 两个人体搜索标准benchmark上进行了广泛的实验，并取得了优秀的性能。并进行了详细的减少性能研究，证明 GAN 生成的数据可以增加数据的多样性和真实性。

Abstract
Person search has recently been a challenging task in the computer vision domain, which aims to search specific pedestrians from real cameras.Nevertheless, most surveillance videos comprise only a handful of images of each pedestrian, which often feature identical backgrounds and clothing. Hence, it is difficult to learn more discriminative features for person search in real scenes. To tackle this challenge, we draw on Generative Adversarial Networks (GAN) to synthesize data from surveillance videos. GAN has thrived in computer vision problems because it produces high-quality images efficiently. We merely alter the popular Fast R-CNN model, which is capable of processing videos and yielding accurate detection outcomes. In order to appropriately relieve the pressure brought by the two-stage model, we design an Assisted-Identity Query Module (AIDQ) to provide positive images for the behind part. Besides, the proposed novel GAN-based Scene Synthesis model that can synthesize high-quality cross-id person images for person search tasks. In order to facilitate the feature learning of the GAN-based Scene Synthesis model, we adopt an online learning strategy that collaboratively learns the synthesized images and original images. Extensive experiments on two widely used person search benchmarks, CUHK-SYSU and PRW, have shown that our method has achieved great performance, and the extensive ablation study further justifies our GAN-synthetic data can effectively increase the variability of the datasets and be more realistic.

摘要
人体搜索是计算机视觉领域中的一个长期挑战，目标是从真实的摄像头中搜索特定的步行人。然而，大多数surveillance视频中只包含每个步行人的几张图像，这些图像通常具有相同的背景和服装。因此，学习更加特异的人体特征变得困难。为解决这个问题，我们引入生成 adversarial networks（GAN）来生成数据集。GAN在计算机视觉问题中取得了成功，因为它可以生成高质量的图像。我们只是修改了popular Fast R-CNN模型，这种模型可以处理视频并提供准确的检测结果。为了正确地减轻两个阶段模型中的压力，我们设计了一个帮助查询模块（AIDQ），以提供后部图像的正面图像。此外，我们还提出了一种新的基于GAN的Scene Synthesis模型，可以生成高质量的跨ID人体图像 для人体搜索任务。为了促进GAN-based Scene Synthesis模型的特征学习，我们采用了在线学习策略，将合作学习生成的图像和原始图像。广泛的实验表明，我们的方法在两个常用的人体搜索标准 benchmarck上表现出色，并且extensive ablation study further justify我们的GAN-synthetic数据可以增加数据集的变化性和更加真实。

All-pairs Consistency Learning for Weakly Supervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.04321
repo_url: None
paper_authors: Weixuan Sun, Yanhao Zhang, Zhen Qin, Zheyuan Liu, Lin Cheng, Fanyi Wang, Yiran Zhong, Nick Barnes
for: 提高弱级 semantic segmentation（WSSS）中对象的本地化。
methods: 使用 transformer 基于的常见化regularization，包括 consistency regularization 和 all-pairs consistency regularization（ACR）。
results: 在 PASCAL VOC 和 MS COCO 数据集上实现了更好的类本地化图（67.3% mIoU on PASCAL VOC train），从而提高 WSSS 性能。

Abstract
In this work, we propose a new transformer-based regularization to better localize objects for Weakly supervised semantic segmentation (WSSS). In image-level WSSS, Class Activation Map (CAM) is adopted to generate object localization as pseudo segmentation labels. To address the partial activation issue of the CAMs, consistency regularization is employed to maintain activation intensity invariance across various image augmentations. However, such methods ignore pair-wise relations among regions within each CAM, which capture context and should also be invariant across image views. To this end, we propose a new all-pairs consistency regularization (ACR). Given a pair of augmented views, our approach regularizes the activation intensities between a pair of augmented views, while also ensuring that the affinity across regions within each view remains consistent. We adopt vision transformers as the self-attention mechanism naturally embeds pair-wise affinity. This enables us to simply regularize the distance between the attention matrices of augmented image pairs. Additionally, we introduce a novel class-wise localization method that leverages the gradients of the class token. Our method can be seamlessly integrated into existing WSSS methods using transformers without modifying the architectures. We evaluate our method on PASCAL VOC and MS COCO datasets. Our method produces noticeably better class localization maps (67.3% mIoU on PASCAL VOC train), resulting in superior WSSS performances.

摘要
在这项工作中，我们提出了一种基于转换器的新的常规化方法，以改进弱元素概率semantic segmentation（WSSS）中对 объек的本地化。在图像级WSSS中，使用Class Activation Map（CAM）生成对象本地化，但CAM的部分活动问题导致consistency regularization不具备对图像增强的抗锯齿性。我们的方法忽略了每个CAM中的对region之间的关系，这些关系捕捉了上下文信息，并且应该是图像视图不变的。为此，我们提出了一种新的所有对之间一致常规化（ACR）。给定两个扩展视图，我们的方法对扩展视图中的活动强度进行规范，同时确保每个视图中的区域之间的相互关系保持一致。我们采用了转换器作为自我注意力机制，这使得我们可以简单地规范扩展视图之间的距离。此外，我们还提出了一种新的类型本地化方法，该方法利用类token的梯度来优化类本地化。我们的方法可以轻松地与现有的WSSS方法集成，无需修改架构。我们在PASCAL VOC和MS COCO数据集上进行了评估，我们的方法在PASCAL VOC训练集上得到了67.3%的mean Intersection over Union（mIoU），这表明我们的方法可以提供更好的类本地化图像。

Cloth2Tex: A Customized Cloth Texture Generation Pipeline for 3D Virtual Try-On

paper_url: http://arxiv.org/abs/2308.04288
repo_url: None
paper_authors: Daiheng Gao, Xu Chen, Xindi Zhang, Qi Wang, Ke Sun, Bang Zhang, Liefeng Bo, Qixing Huang
for: 这篇论文旨在提供一种自然语言处理方法，以便自动生成高质量的3D衣物文字图像，以满足3D虚拟试穿、数字化2D服装到3D服装和布料动画等应用需求。
methods: 该方法基于自我超级vised学习，通过对2D参考图像进行扩散学习，生成高质量的 texture maps，并且可以支持高精度的 texture inpainting。
results: 作者通过质量和量化评估，证明了 Cloth2Tex 可以生成高质量的 texture maps，并且在视觉效果上超过其他方法。

Abstract
Fabricating and designing 3D garments has become extremely demanding with the increasing need for synthesizing realistic dressed persons for a variety of applications, e.g. 3D virtual try-on, digitalization of 2D clothes into 3D apparel, and cloth animation. It thus necessitates a simple and straightforward pipeline to obtain high-quality texture from simple input, such as 2D reference images. Since traditional warping-based texture generation methods require a significant number of control points to be manually selected for each type of garment, which can be a time-consuming and tedious process. We propose a novel method, called Cloth2Tex, which eliminates the human burden in this process. Cloth2Tex is a self-supervised method that generates texture maps with reasonable layout and structural consistency. Another key feature of Cloth2Tex is that it can be used to support high-fidelity texture inpainting. This is done by combining Cloth2Tex with a prevailing latent diffusion model. We evaluate our approach both qualitatively and quantitatively and demonstrate that Cloth2Tex can generate high-quality texture maps and achieve the best visual effects in comparison to other methods. Project page: tomguluson92.github.io/projects/cloth2tex/

摘要
制备和设计3D衣服已经变得极其需求量，因为需要生成真实的穿着人形进行多种应用，如3D虚拟试穿、2D衣服数字化到3D服装和布料动画。因此需要一个简单和直观的管道来获得高质量的纹理，从简单的输入中，如2D参考图像。传统的折叠基于的纹理生成方法需要手动选择大量的控制点，这可以是一个时间consuming和繁琐的过程。我们提议一种新的方法，called Cloth2Tex，它消除了人类的劳动在这个过程中。Cloth2Tex是一种自动学习的方法，可以生成纹理图片，并且具有合理的布局和结构一致性。另外，Cloth2Tex还可以支持高精度的纹理填充。我们通过质量和量化的评估，证明Cloth2Tex可以生成高质量的纹理图片，并且在比较其他方法时，可以 achieve the best visual effects。项目页面：tomguluson92.github.io/projects/cloth2tex/

paper_url: http://arxiv.org/abs/2308.04283
repo_url: https://github.com/muhayyuddin/visual-servoing
paper_authors: Muhayyuddin Ahmed, Ahsan Baidar Bakht, Taimur Hassan, Waseem Akram, Ahmed Humais, Lakmal Seneviratne, Shaoming He, Defu Lin, Irfan Hussain
for: 本研究旨在提高自主水面船（USV）的视觉 Navigation 性能，特别是在自动检查和跟踪任务中。
methods: 该研究提出了一种基于生成对抗网络（GAN）的自主视觉导航框架，用于在极端海洋环境中跟踪目标对象。该框架包括一个整合的感知管道，通过GAN将噪音除去并高亮目标特征，然后将这些感知特征传递给YOLOv5对象检测器。
results: 对比state-of-the-art净气化方法，该提案在MBZIRC simulate dataset上表现出了明显的优异性能，包括各种指标上的比较优异性能。

Abstract
Visual perception is an important component for autonomous navigation of unmanned surface vessels (USV), particularly for the tasks related to autonomous inspection and tracking. These tasks involve vision-based navigation techniques to identify the target for navigation. Reduced visibility under extreme weather conditions in marine environments makes it difficult for vision-based approaches to work properly. To overcome these issues, this paper presents an autonomous vision-based navigation framework for tracking target objects in extreme marine conditions. The proposed framework consists of an integrated perception pipeline that uses a generative adversarial network (GAN) to remove noise and highlight the object features before passing them to the object detector (i.e., YOLOv5). The detected visual features are then used by the USV to track the target. The proposed framework has been thoroughly tested in simulation under extremely reduced visibility due to sandstorms and fog. The results are compared with state-of-the-art de-hazing methods across the benchmarked MBZIRC simulation dataset, on which the proposed scheme has outperformed the existing methods across various metrics.

摘要
<>translate text into Simplified ChineseVisual perception is an important component for autonomous navigation of unmanned surface vessels (USV), particularly for the tasks related to autonomous inspection and tracking. These tasks involve vision-based navigation techniques to identify the target for navigation. Reduced visibility under extreme weather conditions in marine environments makes it difficult for vision-based approaches to work properly. To overcome these issues, this paper presents an autonomous vision-based navigation framework for tracking target objects in extreme marine conditions. The proposed framework consists of an integrated perception pipeline that uses a generative adversarial network (GAN) to remove noise and highlight the object features before passing them to the object detector (i.e., YOLOv5). The detected visual features are then used by the USV to track the target. The proposed framework has been thoroughly tested in simulation under extremely reduced visibility due to sandstorms and fog. The results are compared with state-of-the-art de-hazing methods across the benchmarked MBZIRC simulation dataset, on which the proposed scheme has outperformed the existing methods across various metrics.<>Here's the translation in Simplified Chinese:视觉认知是自动航行无人水面船（USV）中重要的一部分，尤其是在自动检查和跟踪任务中。这些任务需要基于视觉导航技术来确定目标。 marine 环境中的极端天气条件会使视觉基于的方法难以正常工作。为解决这些问题，本文提出了一个基于视觉的自动导航框架，用于在极端海洋条件下跟踪目标对象。该框架包括一个集成的识别管道，使用生成对抗网络（GAN）来消除噪声并强调对象特征，然后将这些特征传递给对象检测器（YOLOv5）进行检测。检测到的视觉特征然后被用于跟踪目标。本框架在基于 MBZIRC 的 simulate 环境下进行了严格的测试，并与现有的抑霾方法进行了比较。结果表明，提出的方案在各种维度上都有出众的表现。

SDLFormer: A Sparse and Dense Locality-enhanced Transformer for Accelerated MR Image Reconstruction

paper_url: http://arxiv.org/abs/2308.04262
repo_url: https://github.com/rahul-gs-16/sdlformer
paper_authors: Rahul G. S., Sriprabha Ramnarayanan, Mohammad Al Fahim, Keerthi Ram, Preejith S. P, Mohanasankar Sivaprakasam
for: 这个论文目的是提出一种基于窗口变换器的加速MRI图像重建方法，以提高MRI图像重建的效率和质量。
methods: 该方法使用窗口变换器网络，并 integrate了扩大注意力机制和卷积Operation来提高图像之间的非本地关系，以及学习低级翻译不变的特征。
results: 对多核磁共振图像加速的实验结果显示，该方法可以与其他重建建筑物相比，提高PSNR和SSIM指标的值。 Code可以在https://github.com/rahul-gs-16/sdlformer.git中找到。

Abstract
Transformers have emerged as viable alternatives to convolutional neural networks owing to their ability to learn non-local region relationships in the spatial domain. The self-attention mechanism of the transformer enables transformers to capture long-range dependencies in the images, which might be desirable for accelerated MRI image reconstruction as the effect of undersampling is non-local in the image domain. Despite its computational efficiency, the window-based transformers suffer from restricted receptive fields as the dependencies are limited to within the scope of the image windows. We propose a window-based transformer network that integrates dilated attention mechanism and convolution for accelerated MRI image reconstruction. The proposed network consists of dilated and dense neighborhood attention transformers to enhance the distant neighborhood pixel relationship and introduce depth-wise convolutions within the transformer module to learn low-level translation invariant features for accelerated MRI image reconstruction. The proposed model is trained in a self-supervised manner. We perform extensive experiments for multi-coil MRI acceleration for coronal PD, coronal PDFS and axial T2 contrasts with 4x and 5x under-sampling in self-supervised learning based on k-space splitting. We compare our method against other reconstruction architectures and the parallel domain self-supervised learning baseline. Results show that the proposed model exhibits improvement margins of (i) around 1.40 dB in PSNR and around 0.028 in SSIM on average over other architectures (ii) around 1.44 dB in PSNR and around 0.029 in SSIM over parallel domain self-supervised learning. The code is available at https://github.com/rahul-gs-16/sdlformer.git

摘要
transformers 已经成为了 convolutional neural networks 的可行的替代方案，因为它们可以学习图像空间中的非本地区域关系。transformers 中的自注意机制使得 transformers 可以捕捉图像中的长距离依赖关系，这可能是加速 MRI 图像重建的潜在的优点，因为 MRI 图像下折衔的效果是非本地的。 despite its computational efficiency, window-based transformers suffer from restricted receptive fields as the dependencies are limited to within the scope of the image windows. we propose a window-based transformer network that integrates dilated attention mechanism and convolution for accelerated MRI image reconstruction. the proposed network consists of dilated and dense neighborhood attention transformers to enhance the distant neighborhood pixel relationship and introduce depth-wise convolutions within the transformer module to learn low-level translation invariant features for accelerated MRI image reconstruction. the proposed model is trained in a self-supervised manner. we perform extensive experiments for multi-coil MRI acceleration for coronal PD, coronal PDFS and axial T2 contrasts with 4x and 5x under-sampling in self-supervised learning based on k-space splitting. we compare our method against other reconstruction architectures and the parallel domain self-supervised learning baseline. results show that the proposed model exhibits improvement margins of (i) around 1.40 dB in PSNR and around 0.028 in SSIM on average over other architectures (ii) around 1.44 dB in PSNR and around 0.029 in SSIM over parallel domain self-supervised learning. the code is available at https://github.com/rahul-gs-16/sdlformer.git.

Blur aware metric depth estimation with multi-focus plenoptic cameras

paper_url: http://arxiv.org/abs/2308.04252
repo_url: https://github.com/comsee-research/blade
paper_authors: Mathieu Labussière, Céline Teulière, Omar Ait-Aider
for: 这个论文的主要目标是提出一种基于raw图像的多重焦距束镜相机的 metric depth estimation算法，以提高不同焦距的匹配和束缚信息的拟合。
methods: 该方法利用了不同焦距的图像捕捉，并通过缓冲信息来提高匹配和束缚信息的拟合。具体来说，该方法首先计算了图像的缓冲信息，然后利用了这些缓冲信息来提高匹配和束缚信息的拟合。
results: 实验结果表明，引入了焦距缓冲信息可以提高depth estimation的准确性和精度。该方法在实验中对实际的3D复杂场景进行了验证，并与3D激光扫描仪获取的实际测量数据进行了比较。

Abstract
While a traditional camera only captures one point of view of a scene, a plenoptic or light-field camera, is able to capture spatial and angular information in a single snapshot, enabling depth estimation from a single acquisition. In this paper, we present a new metric depth estimation algorithm using only raw images from a multi-focus plenoptic camera. The proposed approach is especially suited for the multi-focus configuration where several micro-lenses with different focal lengths are used. The main goal of our blur aware depth estimation (BLADE) approach is to improve disparity estimation for defocus stereo images by integrating both correspondence and defocus cues. We thus leverage blur information where it was previously considered a drawback. We explicitly derive an inverse projection model including the defocus blur providing depth estimates up to a scale factor. A method to calibrate the inverse model is then proposed. We thus take into account depth scaling to achieve precise and accurate metric depth estimates. Our results show that introducing defocus cues improves the depth estimation. We demonstrate the effectiveness of our framework and depth scaling calibration on relative depth estimation setups and on real-world 3D complex scenes with ground truth acquired with a 3D lidar scanner.

摘要
tradicional 摄像机只能捕捉一个场景的一点视角，而 plenoptic 或 light-field 摄像机则能够在单个拍摄中捕捉场景的空间和角度信息，从而实现深度估计从单个获得。在这篇论文中，我们提出了一种基于原始图像的新的深度估计算法，使用多重ocus plenoptic 摄像机获得的Raw图像。我们的方法尤其适用于多重ocus配置，其中多个微镜头具有不同的 фокус距离。我们的方法的主要目标是通过结合匹配和杂谱诱导来提高不同损失的 disparity 估计。我们利用了模糊信息，而前面它被视为一个缺点。我们明确地 derivation 一个逆 проекции模型，包括杂谱模糊，以获得深度估计。然后，我们提出了一种准确做出深度缩放准确的方法。我们的结果表明，将杂谱诱导包含在深度估计中可以提高深度估计的精度。我们在相对深度估计设置和实际世界3D复杂场景中使用了真实的3D激光扫描仪获得的ground truth进行证明。

AICSD: Adaptive Inter-Class Similarity Distillation for Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.04243
repo_url: https://github.com/amirmansurian/aicsd
paper_authors: Amir M. Mansourian, Rozhan Ahmadi, Shohreh Kasaei
For: The paper aims to improve the accuracy of lightweight student networks for semantic segmentation tasks using knowledge distillation.* Methods: The proposed method, called Inter-Class Similarity Distillation (ICSD), transfers high-order relations from the teacher network to the student network by computing intra-class distributions and inter-class similarity matrices using KL divergence. An Adaptive Loss Weighting (ALW) training strategy is also proposed to gradually reduce the influence of the teacher network towards the end of training.* Results: The proposed method outperforms most existing knowledge distillation methods in terms of mIoU and pixel accuracy on two well-known datasets for semantic segmentation, Cityscapes and Pascal VOC 2012.Here are the three key points in Simplified Chinese text:* 为：本文目的是使用知识传授提高轻量级学生网络在 semantic segmentation 任务中的准确性。* 方法：提议的方法是 Inter-Class Similarity Distillation (ICSD)，它通过计算网络输出中每个类的内部分布来传递教师网络中高阶关系。此外，还使用 Adaptive Loss Weighting (ALW) 训练策略，以逐渐减少教师网络的影响。* 结果：提议的方法在 Cityscapes 和 Pascal VOC 2012 两个常见的 semantic segmentation 数据集上，与大多数现有的知识传授方法相比，在 mIoU 和像素准确性上表现出色。

Abstract
In recent years, deep neural networks have achieved remarkable accuracy in computer vision tasks. With inference time being a crucial factor, particularly in dense prediction tasks such as semantic segmentation, knowledge distillation has emerged as a successful technique for improving the accuracy of lightweight student networks. The existing methods often neglect the information in channels and among different classes. To overcome these limitations, this paper proposes a novel method called Inter-Class Similarity Distillation (ICSD) for the purpose of knowledge distillation. The proposed method transfers high-order relations from the teacher network to the student network by independently computing intra-class distributions for each class from network outputs. This is followed by calculating inter-class similarity matrices for distillation using KL divergence between distributions of each pair of classes. To further improve the effectiveness of the proposed method, an Adaptive Loss Weighting (ALW) training strategy is proposed. Unlike existing methods, the ALW strategy gradually reduces the influence of the teacher network towards the end of training process to account for errors in teacher's predictions. Extensive experiments conducted on two well-known datasets for semantic segmentation, Cityscapes and Pascal VOC 2012, validate the effectiveness of the proposed method in terms of mIoU and pixel accuracy. The proposed method outperforms most of existing knowledge distillation methods as demonstrated by both quantitative and qualitative evaluations. Code is available at: https://github.com/AmirMansurian/AICSD

摘要
Recently, deep neural networks have achieved remarkable accuracy in computer vision tasks. However, with inference time being a crucial factor, particularly in dense prediction tasks such as semantic segmentation, knowledge distillation has emerged as a successful technique for improving the accuracy of lightweight student networks. Existing methods often neglect the information in channels and among different classes. To overcome these limitations, this paper proposes a novel method called Inter-Class Similarity Distillation (ICSD) for the purpose of knowledge distillation.The proposed method transfers high-order relations from the teacher network to the student network by independently computing intra-class distributions for each class from network outputs. This is followed by calculating inter-class similarity matrices for distillation using KL divergence between distributions of each pair of classes. To further improve the effectiveness of the proposed method, an Adaptive Loss Weighting (ALW) training strategy is proposed. Unlike existing methods, the ALW strategy gradually reduces the influence of the teacher network towards the end of training process to account for errors in teacher's predictions.Extensive experiments conducted on two well-known datasets for semantic segmentation, Cityscapes and Pascal VOC 2012, validate the effectiveness of the proposed method in terms of mIoU and pixel accuracy. The proposed method outperforms most of existing knowledge distillation methods as demonstrated by both quantitative and qualitative evaluations. Code is available at: https://github.com/AmirMansurian/AICSD.Here's the translation in Traditional Chinese:过去的几年，深度神经网络在计算机视觉任务中已经取得了很高的准确性。然而，在填充预测任务中，特别是 semantic segmentation 中，推论时间成为一个关键的因素。为了解决这个问题，这篇文章提出了一种名为 Inter-Class Similarity Distillation (ICSD) 的新方法。提案的方法通过获取师网络的高阶关系，将这些关系转移到学生网络中。这是通过获取每个类别的网络输出中的数据，并计算每对类别之间的相似性矩阵来进行知识传递。另外，为了进一步提高方法的效果，这篇文章还提出了一种 Adaptive Loss Weighting (ALW) 训练策略。与现有的方法不同的是，ALW 策略在训练过程中逐渐将师网络的影响降低，以抵销教师预测中的错误。实验结果显示，提案的方法在 Cityscapes 和 Pascal VOC 2012 这两个常用的 semantic segmentation 数据集上具有较高的 mIoU 和像素精度。此外，与现有的知识传递方法比较，提案的方法在量值和质量上都表现较好。代码可以在 https://github.com/AmirMansurian/AICSD 上取得。

A Comparative Study of Image-to-Image Translation Using GANs for Synthetic Child Race Data

paper_url: http://arxiv.org/abs/2308.04232
repo_url: None
paper_authors: Wang Yao, Muhammad Ali Farooq, Joseph Lemley, Peter Corcoran
for: 提高face recognition技术的种族多样性
methods: 使用图像-图像转换来调整儿童脸部数据的种族
results: 实验结果表明，使用图像-图像转换方法可以生成各种种族的人工儿童脸部数据样本，提高face recognition技术的种族多样性。

Abstract
The lack of ethnic diversity in data has been a limiting factor of face recognition techniques in the literature. This is particularly the case for children where data samples are scarce and presents a challenge when seeking to adapt machine vision algorithms that are trained on adult data to work on children. This work proposes the utilization of image-to-image transformation to synthesize data of different races and thus adjust the ethnicity of children's face data. We consider ethnicity as a style and compare three different Image-to-Image neural network based methods, specifically pix2pix, CycleGAN, and CUT networks to implement Caucasian child data and Asian child data conversion. Experimental validation results on synthetic data demonstrate the feasibility of using image-to-image transformation methods to generate various synthetic child data samples with broader ethnic diversity.

摘要
“无伦不同的人种数据的缺乏对面 recognition技术的发展带来了限制。尤其是儿童的数据样本罕见，对于适应机器视觉算法trained on adult data来应用于儿童的情况存在挑战。本工作提议利用图像到图像转换来增加不同的人种样本，以适应儿童的脸部数据的不同种族。我们认为人种是一种风格，并评估了三种基于图像到图像神经网络的方法，即 pix2pix、CycleGAN 和 CUT 网络，以实现白人儿童数据和亚洲儿童数据的转换。对于 sintetic data 的实验验证结果表明，使用图像到图像转换方法可以生成各种不同的 sintetic 儿童数据样本，以拓宽人种多样性。”

Will your Doorbell Camera still recognize you as you grow old

paper_url: http://arxiv.org/abs/2308.04224
repo_url: None
paper_authors: Wang Yao, Muhammad Ali Farooq, Joseph Lemley, Peter Corcoran
for: 这个研究探讨了低功耗消费类设备（如门禁摄像头）的Robust验证问题，尤其是针对年龄的影响。
methods: 这个研究使用了两个公共的年龄数据集（AgeDB和Morph-II）作为基线，并使用了一种图形真实的年龄变换方法来增加一组高质量的面部图像，以模拟不同年龄的影响。
results: 实验结果表明，长期年龄影响仍然是现代面部验证方法的主要挑战。

Abstract
Robust authentication for low-power consumer devices such as doorbell cameras poses a valuable and unique challenge. This work explores the effect of age and aging on the performance of facial authentication methods. Two public age datasets, AgeDB and Morph-II have been used as baselines in this work. A photo-realistic age transformation method has been employed to augment a set of high-quality facial images with various age effects. Then the effect of these synthetic aging data on the high-performance deep-learning-based face recognition model is quantified by using various metrics including Receiver Operating Characteristic (ROC) curves and match score distributions. Experimental results demonstrate that long-term age effects are still a significant challenge for the state-of-the-art facial authentication method.

摘要
低功耗消费者设备的坚实验证提供了一个独特和有价值的挑战。这项工作研究了人脸认证方法在不同年龄的影响。使用了公共的年龄数据集AgeDB和Morph-II作为基准，这里使用了一种实际准确的年龄变换方法来增加一组高质量的人脸图像，并对这些图像进行了不同年龄的变换。然后，通过使用深度学习基于的高性能人脸识别模型，量化这些人脸图像在不同年龄的影响。实验结果显示，长期年龄效应仍然是现代人脸认证方法的一大挑战。

AquaSAM: Underwater Image Foreground Segmentation

paper_url: http://arxiv.org/abs/2308.04218
repo_url: None
paper_authors: Muduo Xu, Jianhao Su, Yutao Liu
for: 这个论文是为了推广自然图像分割模型（SAM）的成功，将其应用于水下图像分割。
methods: 这篇论文使用自动分类和提取SUIM数据集中的各种标签，然后通过简单的微调方法将SAM模型适应通用水下图像分割。
results: 经过对8种分 segmentation任务（如人体潜水员）的广泛实验，这篇论文表明AquaSAM模型在水下图像分割任务中比默认SAM模型更高效，尤其是在困难任务（如珊瑚礁）中。AquaSAM模型在水下图像分割任务中的平均Dice相似度指数（DSC）提高了7.13%，并在多尺度指标（mIoU）上提高了8.27%。

Abstract
The Segment Anything Model (SAM) has revolutionized natural image segmentation, nevertheless, its performance on underwater images is still restricted. This work presents AquaSAM, the first attempt to extend the success of SAM on underwater images with the purpose of creating a versatile method for the segmentation of various underwater targets. To achieve this, we begin by classifying and extracting various labels automatically in SUIM dataset. Subsequently, we develop a straightforward fine-tuning method to adapt SAM to general foreground underwater image segmentation. Through extensive experiments involving eight segmentation tasks like human divers, we demonstrate that AquaSAM outperforms the default SAM model especially at hard tasks like coral reefs. AquaSAM achieves an average Dice Similarity Coefficient (DSC) of 7.13 (%) improvement and an average of 8.27 (%) on mIoU improvement in underwater segmentation tasks.

摘要
《Segment Anything Model》（SAM）已经革命化自然图像分割，但其在水下图像上的性能仍然受限。这项工作提出了将SAM扩展到水下图像上，以创建一种多样化的水下目标分割方法。为此，我们首先自动找到和分类SUIM数据集中的多种标签。然后，我们开发了一种简单的微调方法，以适应SAM进行普通水下图像分割的适应。经过对八种分割任务，如人体潜水员，的广泛实验，我们表明了 AquaSAM 在水下分割任务中的优异性，尤其是在复杂的珊瑚礁等难题上。AquaSAM 在水下分割任务中的平均 dice相似度系数（DSC）提高了7.13%，和水下分割任务的平均准确率（mIoU）提高了8.27%。

Robust retrieval of material chemical states in X-ray microspectroscopy

paper_url: http://arxiv.org/abs/2308.04207
repo_url: None
paper_authors: Ting Wang, Xiaotong Wu, Jizhou Li, Chao Wang
for: 研究材料的结构和化学变化，提供高分辨率的结构和光谱信息。
methods: 提出了一种新的数据建模方法和专门的分解框架，可以快速和可靠地检测材料的化学状态，并且可以扩展到多态材料化学。
results: 通过实验结果，证明了该方法的有效性和可靠性，可以在实际应用中快速和准确地检测材料的化学状态，即使在低信号噪声和光谱特征 overlap 的情况下。

Abstract
X-ray microspectroscopic techniques are essential for studying morphological and chemical changes in materials, providing high-resolution structural and spectroscopic information. However, its practical data analysis for reliably retrieving the chemical states remains a major obstacle to accelerating the fundamental understanding of materials in many research fields. In this work, we propose a novel data formulation model for X-ray microspectroscopy and develop a dedicated unmixing framework to solve this problem, which is robust to noise and spectral variability. Moreover, this framework is not limited to the analysis of two-state material chemistry, making it an effective alternative to conventional and widely-used methods. In addition, an alternative directional multiplier method with provable convergence is applied to obtain the solution efficiently. Our framework can accurately identify and characterize chemical states in complex and heterogeneous samples, even under challenging conditions such as low signal-to-noise ratios and overlapping spectral features. Extensive experimental results on simulated and real datasets demonstrate its effectiveness and reliability.

摘要

Exploring Transformers for Open-world Instance Segmentation

paper_url: http://arxiv.org/abs/2308.04206
repo_url: None
paper_authors: Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, Ping Luo
for: 这 paper 的目的是提出一种基于 transformer 的开放世界实例分割方法，以满足现有的实例分割模型在开放世界中的应用。
methods: 这 paper 使用了 transformer 网络，并提出了两个新的技术：首先，attach stop-gradient 操作来防止新类目被抑制为背景，并在 classification 头添加 IoU 头来发现新的物体。其次，提出了一种新的对比学习框架，通过在 object 队列中维护对象的中心点，并动态选择对象和背景的正负样本进行对比学习。
results: 这 paper 的模型在多种开放世界cross-category 和 cross-dataset 推广中取得了state-of-the-art 性能，特别是在 VOC 到 non-VOC 设置下，模型在 ARb100 和 ARm100 上达到了40.0% 和34.9% 的最高记录。在 COCO 到 UVO 推广中，SWORD 模型比前一个最佳的开放世界模型高出5.9% 和8.1% 的 APm 和 ARm100。

Abstract
Open-world instance segmentation is a rising task, which aims to segment all objects in the image by learning from a limited number of base-category objects. This task is challenging, as the number of unseen categories could be hundreds of times larger than that of seen categories. Recently, the DETR-like models have been extensively studied in the closed world while stay unexplored in the open world. In this paper, we utilize the Transformer for open-world instance segmentation and present SWORD. Firstly, we introduce to attach the stop-gradient operation before classification head and further add IoU heads for discovering novel objects. We demonstrate that a simple stop-gradient operation not only prevents the novel objects from being suppressed as background, but also allows the network to enjoy the merit of heuristic label assignment. Secondly, we propose a novel contrastive learning framework to enlarge the representations between objects and background. Specifically, we maintain a universal object queue to obtain the object center, and dynamically select positive and negative samples from the object queries for contrastive learning. While the previous works only focus on pursuing average recall and neglect average precision, we show the prominence of SWORD by giving consideration to both criteria. Our models achieve state-of-the-art performance in various open-world cross-category and cross-dataset generalizations. Particularly, in VOC to non-VOC setup, our method sets new state-of-the-art results of 40.0% on ARb100 and 34.9% on ARm100. For COCO to UVO generalization, SWORD significantly outperforms the previous best open-world model by 5.9% on APm and 8.1% on ARm100.

摘要
open-world实例分割是一项崛起的任务，旨在通过学习有限数量的基本类目对象来分割图像中的所有对象。这个任务非常吃力，因为未知类别的数量可能是已知类别的百倍以上。在过去，DETR-like模型在关闭世界中被广泛研究，而在开放世界中却未得到过 изучение。在这篇论文中，我们使用Transformer进行开放世界实例分割，并提出SWORD。首先，我们在分类头部添加停止梯度操作，并添加IoU头来发现新对象。我们发现简单的停止梯度操作不仅防止新对象被识别为背景，还让网络享受到了识别标签的便利。其次，我们提出了一种新的对比学习框架，以增强对象和背景之间的表示。我们保持一个通用对象队列，以获取对象的中心，并动态选择对象查询中的正确和错误样本进行对比学习。而过去的工作只关注着追求平均回归率，忽略了平均准确率，我们显示SWORD的优势，并在不同的开放世界交叉类和交叉数据集上达到了state-of-the-art表现。尤其是在VOC到非VOC设置下，我们的方法设置了新的state-of-the-art记录，ARb100上的40.0%和ARm100上的34.9%。在COCO到UVO总结上，SWORD明显超过了之前最佳的开放世界模型，APm上提高了5.9%和ARm100上提高了8.1%。

D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation

paper_url: http://arxiv.org/abs/2308.04197
repo_url: None
paper_authors: Hanjun Li, Xiujun Shu, Sunan He, Ruizhi Qiao, Wei Wen, Taian Guo, Bei Gan, Xing Sun
for: 本研究旨在降低TSG任务中注意力标注成本，保持与完全监督方法相匹配的性能。
methods: 我们提出了一种基于Dynamic Gaussian prior的Grounding框架，包括Semantic Alignment Group Contrastive Learning模块(SA-GCL)和Dynamic Gaussian prior Adjustment模块(DGA)。
results: 我们的D3G方法在三个挑战性 benchmark上进行了广泛的实验，并证明了它的效果性。它与现状的弱监督方法相比，提高了性能的大幅度，并降低了与完全监督方法的性能差距。

Abstract
Temporal sentence grounding (TSG) aims to locate a specific moment from an untrimmed video with a given natural language query. Recently, weakly supervised methods still have a large performance gap compared to fully supervised ones, while the latter requires laborious timestamp annotations. In this study, we aim to reduce the annotation cost yet keep competitive performance for TSG task compared to fully supervised ones. To achieve this goal, we investigate a recently proposed glance-supervised temporal sentence grounding task, which requires only single frame annotation (referred to as glance annotation) for each query. Under this setup, we propose a Dynamic Gaussian prior based Grounding framework with Glance annotation (D3G), which consists of a Semantic Alignment Group Contrastive Learning module (SA-GCL) and a Dynamic Gaussian prior Adjustment module (DGA). Specifically, SA-GCL samples reliable positive moments from a 2D temporal map via jointly leveraging Gaussian prior and semantic consistency, which contributes to aligning the positive sentence-moment pairs in the joint embedding space. Moreover, to alleviate the annotation bias resulting from glance annotation and model complex queries consisting of multiple events, we propose the DGA module, which adjusts the distribution dynamically to approximate the ground truth of target moments. Extensive experiments on three challenging benchmarks verify the effectiveness of the proposed D3G. It outperforms the state-of-the-art weakly supervised methods by a large margin and narrows the performance gap compared to fully supervised methods. Code is available at https://github.com/solicucu/D3G.

摘要
Temporal sentence grounding (TSG) 目标是在没有剪辑的视频中定位一个具体的时刻，与一个自然语言查询符对应。Recently, weakly supervised methods 仍然与完全监督的方法之间存在大量性能差距，而后者需要劳动密集的时间戳注解。在这种研究中，我们想要降低注解成本， yet keep competitive performance for TSG task compared to fully supervised ones。To achieve this goal, we investigate a recently proposed glance-supervised temporal sentence grounding task, which requires only single frame annotation (referred to as glance annotation) for each query。Under this setup, we propose a Dynamic Gaussian prior based Grounding framework with Glance annotation (D3G), which consists of a Semantic Alignment Group Contrastive Learning module (SA-GCL) and a Dynamic Gaussian prior Adjustment module (DGA). Specifically, SA-GCL samples reliable positive moments from a 2D temporal map via jointly leveraging Gaussian prior and semantic consistency, which contributes to aligning the positive sentence-moment pairs in the joint embedding space。Moreover, to alleviate the annotation bias resulting from glance annotation and model complex queries consisting of multiple events, we propose the DGA module, which adjusts the distribution dynamically to approximate the ground truth of target moments。Extensive experiments on three challenging benchmarks verify the effectiveness of the proposed D3G。It outperforms the state-of-the-art weakly supervised methods by a large margin and narrows the performance gap compared to fully supervised methods。代码可以在 https://github.com/solicucu/D3G 中找到。

Image Copy-Move Forgery Detection via Deep Cross-Scale PatchMatch

paper_url: http://arxiv.org/abs/2308.04188
repo_url: None
paper_authors: Yingjie He, Yuanman Li, Changsheng Chen, Xia Li
for: 本研究旨在提高图像 копиrighted forgery detection（CMFD）领域的检测精度和普适性。
methods: 本研究提出了一种新的全级图像CMFD框架， combining conventional and deep learning methods。 Specifically, we design a deep cross-scale patchmatch method tailored for CMFD to localize copy-move regions, and develop a manipulation region location branch for source/target separation。
results: 我们的方法在不同的复制和移动内容中显示出了显著更高的普适性和性能， compared to existing approaches。

Abstract
The recently developed deep algorithms achieve promising progress in the field of image copy-move forgery detection (CMFD). However, they have limited generalizability in some practical scenarios, where the copy-move objects may not appear in the training images or cloned regions are from the background. To address the above issues, in this work, we propose a novel end-to-end CMFD framework by integrating merits from both conventional and deep methods. Specifically, we design a deep cross-scale patchmatch method tailored for CMFD to localize copy-move regions. In contrast to existing deep models, our scheme aims to seek explicit and reliable point-to-point matching between source and target regions using features extracted from high-resolution scales. Further, we develop a manipulation region location branch for source/target separation. The proposed CMFD framework is completely differentiable and can be trained in an end-to-end manner. Extensive experimental results demonstrate the high generalizability of our method to different copy-move contents, and the proposed scheme achieves significantly better performance than existing approaches.

摘要
最近发展的深度算法在图像复制移动伪造检测（CMFD）领域具有承诺的进步。然而，这些深度算法在一些实际场景中具有有限的通用性，例如在训练图像中没有复制移动对象或者径复制区域来自背景。为了解决上述问题，在这项工作中，我们提出了一种新的端到端CMFD框架，通过结合传统和深度方法的优点。具体来说，我们设计了一种适合CMFD的深度跨scale patchmatch方法，以便在本地化复制移动区域。与现有的深度模型不同，我们的方案寻求明确和可靠的点对点匹配 между源和目标区域，使用高分辨率层次中提取的特征。此外，我们开发了一个修改区域定位分支，用于源/目标分离。我们提出的CMFD框架是完全可导的，可以在端到端的训练方式下进行培训。广泛的实验结果表明，我们的方法具有不同复制移动内容的高通用性，并且我们提出的方案在现有方法中显著提高了性能。

How Generalizable are Deepfake Detectors? An Empirical Study

paper_url: http://arxiv.org/abs/2308.04177
repo_url: https://github.com/boutiquelee/deepfakeempiricalstudy
paper_authors: Boquan Li, Jun Sun, Christopher M. Poskitt
for: 这篇论文旨在探讨深伪材料检测方法的普适性，以帮助检测器在不同的 dataset 上保持一步 ahead of 害客。
methods: 本论文使用了六个深伪数据集、五种深伪检测方法和两种模型增强方法进行研究。
results: 研究发现，检测器在零 shot 设定下不能普适化，并且发现检测器学习了特定的合成方法的不良特征，以及检测器EXTRACTING 缺乏特征，导致普适性受限。然而，研究还发现了一些通用的神经元，可能为零 shot 普适性提供了可能的路径。

Abstract
Deepfake videos and images are becoming increasingly credible, posing a significant threat given their potential to facilitate fraud or bypass access control systems. This has motivated the development of deepfake detection methods, in which deep learning models are trained to distinguish between real and synthesized footage. Unfortunately, existing detection models struggle to generalize to deepfakes from datasets they were not trained on, but little work has been done to examine why or how this limitation can be addressed. In this paper, we present the first empirical study on the generalizability of deepfake detectors, an essential goal for detectors to stay one step ahead of attackers. Our study utilizes six deepfake datasets, five deepfake detection methods, and two model augmentation approaches, confirming that detectors do not generalize in zero-shot settings. Additionally, we find that detectors are learning unwanted properties specific to synthesis methods and struggling to extract discriminative features, limiting their ability to generalize. Finally, we find that there are neurons universally contributing to detection across seen and unseen datasets, illuminating a possible path forward to zero-shot generalizability.

摘要
深刻的假动作和图像在增加可信度方面做出了重要贡献，它们的潜在威胁包括诈骗和绕过存取控制系统。这些问题驱使了深入学习检测方法的发展，这些模型通过训练来识别真实和合成的录影。可是，现有的检测模型在不同的数据集上缺乏通用性，但有很少的研究探讨这个限制和如何解决。在这篇论文中，我们提供了深入探讨检测器通用性的首个实践研究，这是检测器要一步拦截到诈骗者的重要目标。我们的研究使用了六个深刻假数据集，五个深刻检测方法和两种模型增强方法，确定了检测器在零点设定下不具通用性。此外，我们发现检测器在合成方法特有的特性上学习不良的特征，导致它们对于新的数据集难以准确检测。最后，我们发现有些神经网络在所有数据集上都具有检测功能，这提供了可能的通用性路径。

EFaR 2023: Efficient Face Recognition Competition

paper_url: http://arxiv.org/abs/2308.04168
repo_url: https://github.com/ahasanpour/EFaR-2023
paper_authors: Jan Niklas Kolf, Fadi Boutros, Jurek Elliesen, Markus Theuerkauf, Naser Damer, Mohamad Alansari, Oussama Abdul Hay, Sara Alansari, Sajid Javed, Naoufel Werghi, Klemen Grm, Vitomir Štruc, Fernando Alonso-Fernandez, Kevin Hernandez Diaz, Josef Bigun, Anjith George, Christophe Ecabert, Hatef Otroshi Shahreza, Ketan Kotwal, Sébastien Marcel, Iurii Medvedev, Bo Jin, Diogo Nunes, Ahmad Hassanpour, Pankaj Khatiwada, Aafan Ahmad Toor, Bian Yang
for: 这篇论文主要是为了介绍2023年国际 JOINT会议 on Biometrics (IJCB 2023) 上进行的人脸认可竞赛（EFaR），以及参与竞赛的6个队伍的17个提交。
methods: submitted solutions 使用了小型、高效的网络架构，以减少计算成本，一些解决方案还应用了模型归一化。
results: 论文评估了提交的解决方案的表现，以及一些基eline的测试数据集上的比较性能。 Here’s the English version of the three key information points:
for: The paper mainly introduces the Efficient Face Recognition Competition (EFaR) held at the 2023 International Joint Conference on Biometrics (IJCB 2023), as well as the 6 teams that participated in the competition with 17 submissions.
methods: The submitted solutions use small, efficient network architectures to reduce computational cost, and some solutions apply model quantization.
results: The paper evaluates the performance of the submitted solutions and compares them to a set of baselines on a diverse set of benchmarks, including bias, cross-quality, and large-scale recognition.

Abstract
This paper presents the summary of the Efficient Face Recognition Competition (EFaR) held at the 2023 International Joint Conference on Biometrics (IJCB 2023). The competition received 17 submissions from 6 different teams. To drive further development of efficient face recognition models, the submitted solutions are ranked based on a weighted score of the achieved verification accuracies on a diverse set of benchmarks, as well as the deployability given by the number of floating-point operations and model size. The evaluation of submissions is extended to bias, cross-quality, and large-scale recognition benchmarks. Overall, the paper gives an overview of the achieved performance values of the submitted solutions as well as a diverse set of baselines. The submitted solutions use small, efficient network architectures to reduce the computational cost, some solutions apply model quantization. An outlook on possible techniques that are underrepresented in current solutions is given as well.

摘要
这篇论文介绍了2023年国际 JOINT Conference on Biometrics（IJCB 2023）上进行的Efficient Face Recognition Competition（EFaR）的结果。比赛接收了6个队伍的17个提交。为了驱动高效人脸识别模型的进一步发展，提交的解决方案按照使用多个benchmark上达到的验证精度的权重分数、以及模型的大小和浮点数据操作数量来进行排名。评测中还包括偏见、交叉评估和大规模识别的benchmark。总的来说，本文给出了提交的解决方案的实际性和多个基准值的概述，以及一些未在当前解决方案中充分表现的可能的技术。

Under-Display Camera Image Restoration with Scattering Effect

paper_url: http://arxiv.org/abs/2308.04163
repo_url: https://github.com/namecantbenull/srudc
paper_authors: Binbin Song, Xiangyu Chen, Shuning Xu, Jiantao Zhou
for: addresses the under-display camera (UDC) image restoration problem with a specific focus on the scattering effect caused by the display.
methods: uses a two-branch restoration network, including a scattering branch that uses channel-wise self-attention to estimate the scattering effect parameters, and an image branch that leverages local representation advantages of CNN to recover clear scenes.
results: demonstrates superior performance over state-of-the-art UDC restoration techniques through extensive experiments on both real-world and synthesized data.Here’s the summary in Traditional Chinese:
for: addresses the 下层显示器（UDC）的图像修复问题，专注在显示器对图像的散射效应。
methods: 使用了两条分支修复网络，包括散射分支，使用通道别自我注意来估算散射效应的参数，以及图像分支，利用图像网络的地方表现优势来修复清晰的场景。
results: 通过对真实世界和合成数据进行广泛的实验，证明了提案方法与现有的UDC修复技术相比，具有较好的性能。

Abstract
The under-display camera (UDC) provides consumers with a full-screen visual experience without any obstruction due to notches or punched holes. However, the semi-transparent nature of the display inevitably introduces the severe degradation into UDC images. In this work, we address the UDC image restoration problem with the specific consideration of the scattering effect caused by the display. We explicitly model the scattering effect by treating the display as a piece of homogeneous scattering medium. With the physical model of the scattering effect, we improve the image formation pipeline for the image synthesis to construct a realistic UDC dataset with ground truths. To suppress the scattering effect for the eventual UDC image recovery, a two-branch restoration network is designed. More specifically, the scattering branch leverages global modeling capabilities of the channel-wise self-attention to estimate parameters of the scattering effect from degraded images. While the image branch exploits the local representation advantage of CNN to recover clear scenes, implicitly guided by the scattering branch. Extensive experiments are conducted on both real-world and synthesized data, demonstrating the superiority of the proposed method over the state-of-the-art UDC restoration techniques. The source code and dataset are available at \url{https://github.com/NamecantbeNULL/SRUDC}.

摘要
“Under-display camera（UDC）为用户提供了一个无阻碍的全屏视觉体验，但是半透明的显示器无法避免对UDC图像的严重抑制。在这种情况下，我们在UDC图像恢复问题上进行了专门的考虑，并模型了由显示器引起的散射效应。我们通过物理模型来描述散射效应，并对图像形成管线进行了改进，以建立一个真实的UDC数据集。为了减少散射效应的影响，我们设计了两棵树结构，其中一棵是散射分支，利用通道级自注意力来估计散射效应的参数，另一棵是图像分支，利用深度学习来恢复清晰的场景。我们在实际数据上进行了广泛的实验，并证明了我们的方法在UDC图像恢复问题上的优越性。数据集和源代码可以在 GitHub 上获取（https://github.com/NamecantbeNULL/SRUDC）。”

EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation

paper_url: http://arxiv.org/abs/2308.04162
repo_url: https://github.com/lab206/epcformer
paper_authors: Jiajun Chen, Jiacheng Lin, Zhiqiang Xiao, Haolong Fu, Ke Nai, Kailun Yang, Zhiyong Li
for: 这个论文主要针对的是Audio-guided Video Object Segmentation (A-VOS)和Referring Video Object Segmentation (R-VOS)两个关联任务，它们都是根据用户提供的表达提示从视频序列中提取特定对象的任务。
methods: 这篇论文提出了两种解决方案，一是对话表达匹配（EA）机制，用于Audio和Text表达之间的匹配，以实现语音和文本表达之间的含义相似性。另一个是表达视觉注意力（EVA）机制，用于深入探究Audio、Text和视频特征之间的互动。
results: 实验结果表明，我们提出的通用EPCFormer模型在两个任务上都达到了状态的艺术Result，并且可以很好地传递知识 между两个任务。

Abstract
Audio-guided Video Object Segmentation (A-VOS) and Referring Video Object Segmentation (R-VOS) are two highly-related tasks, which both aim to segment specific objects from video sequences according to user-provided expression prompts. However, due to the challenges in modeling representations for different modalities, contemporary methods struggle to strike a balance between interaction flexibility and high-precision localization and segmentation. In this paper, we address this problem from two perspectives: the alignment representation of audio and text and the deep interaction among audio, text, and visual features. First, we propose a universal architecture, the Expression Prompt Collaboration Transformer, herein EPCFormer. Next, we propose an Expression Alignment (EA) mechanism for audio and text expressions. By introducing contrastive learning for audio and text expressions, the proposed EPCFormer realizes comprehension of the semantic equivalence between audio and text expressions denoting the same objects. Then, to facilitate deep interactions among audio, text, and video features, we introduce an Expression-Visual Attention (EVA) mechanism. The knowledge of video object segmentation in terms of the expression prompts can seamlessly transfer between the two tasks by deeply exploring complementary cues between text and audio. Experiments on well-recognized benchmarks demonstrate that our universal EPCFormer attains state-of-the-art results on both tasks. The source code of EPCFormer will be made publicly available at https://github.com/lab206/EPCFormer.

摘要
Audio-guided Video Object Segmentation (A-VOS) 和 Referring Video Object Segmentation (R-VOS) 是两个高度相关的任务，它们都是根据用户提供的表达提示来从视频序列中 segment 特定对象。然而，由于不同模式之间的表达模型化困难，当前方法很难以寻求高精度地位和表达提示之间的平衡。在这篇论文中，我们解决这个问题从两个方面：表达提示的对齐表示和深度交互 among audio、文本和视觉特征。首先，我们提出一种通用架构，即表达Prompt Collaboration Transformer（EPCFormer）。然后，我们提出一种表达对齐（EA）机制，用于对 audio 和文本表达进行对齐。通过引入对 audio 和文本表达的对比学习，我们实现了对 audio 和文本表达的Semantic equivalence的认知。然后，为了促进 audio、文本和视觉特征之间的深度交互，我们引入表达-视觉注意力（EVA）机制。通过深入探索 audio、文本和视觉特征之间的 complementary cues，我们实现了从表达提示角度看到的视频对象分割知识的交叉传递。实验结果表明，我们的通用 EPCFormer 在两个任务上达到了状态艺术的Result。源代码将在 GitHub 上公开，详细信息请参考。

Towards Top-Down Stereoscopic Image Quality Assessment via Stereo Attention

paper_url: http://arxiv.org/abs/2308.04156
repo_url: https://github.com/fanning-zhang/satnet
paper_authors: Huilin Zhang, Sumei Li, Yongli Chang
for: 这篇论文的目的是提出一种基于顶部下向的图像三维质量评估（SIQA）网络，以更好地评估和改进3D内容的视觉体验。
methods: 该论文提出了一种新的网络方法，即通过层次注意力（Stereo Attention）来实现顶部下向的评估过程。该方法可以从高级二视图信号下到低级单视图信号的进程中进行导引，并在处理管道中进行可调calibration。
results: 实验结果表明，该方法可以更好地模拟人类视觉系统（HVS）的性质，并超越现有的底层方法。

Abstract
Stereoscopic image quality assessment (SIQA) plays a crucial role in evaluating and improving the visual experience of 3D content. Existing binocular properties and attention-based methods for SIQA have achieved promising performance. However, these bottom-up approaches are inadequate in exploiting the inherent characteristics of the human visual system (HVS). This paper presents a novel network for SIQA via stereo attention, employing a top-down perspective to guide the quality assessment process. Our proposed method realizes the guidance from high-level binocular signals down to low-level monocular signals, while the binocular and monocular information can be calibrated progressively throughout the processing pipeline. We design a generalized Stereo AttenTion (SAT) block to implement the top-down philosophy in stereo perception. This block utilizes the fusion-generated attention map as a high-level binocular modulator, influencing the representation of two low-level monocular features. Additionally, we introduce an Energy Coefficient (EC) to account for recent findings indicating that binocular responses in the primate primary visual cortex are less than the sum of monocular responses. The adaptive EC can tune the magnitude of binocular response flexibly, thus enhancing the formation of robust binocular features within our framework. To extract the most discriminative quality information from the summation and subtraction of the two branches of monocular features, we utilize a dual-pooling strategy that applies min-pooling and max-pooling operations to the respective branches. Experimental results highlight the superiority of our top-down method in simulating the property of visual perception and advancing the state-of-the-art in the SIQA field. The code of this work is available at https://github.com/Fanning-Zhang/SATNet.

摘要
三维内容的视觉体验评估（SIQA）具有重要的作用，用于评估和改进三维内容的视觉体验。现有的底层方法和双目性质具有承诺的表现。然而，这些底层方法无法充分利用人视系统（HVS）的内在特性。这篇论文提出了一种新的网络方法 для SIQA，通过双目注意力来导引评估过程。我们的提议方法可以从高级双目信号下降到低级单目信号，同时双目和单目信息可以在处理管道中进行进度性calibration。我们设计了一种通用的双目注意力块（SAT）来实现上述哲学。这个块利用生成的注意力地图作为高级双目模ulator，影响低级单目特征表示。此外，我们引入了能量系数（EC），以应对证明 primate primary visual cortex中的双目响应小于单目响应的现象。可变的EC可以适应性地调整双目响应的 магнитуда，以便在我们的框架中成形Robust的双目特征。为了从两个支路的单目特征之和和差中提取最有价值的质量信息，我们采用了双pooling策略，对两个支路的单目特征进行最小池化和最大池化操作。实验结果表明，我们的底层方法可以准确模拟视觉响应和提高SIQA领域的状态。代码可以在中找到。

Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

paper_url: http://arxiv.org/abs/2308.04152
repo_url: https://github.com/dcdmllm/cheetah
paper_authors: Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Hanwang Zhang, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Yueting Zhuang
for: 这个论文旨在�evaluating the instruction following ability of multimodal large language models (MLLMs) on complicated interleaved vision-language instructions, and introducing a generic and lightweight controllable knowledge re-injection module to address the common defect of existing methods.methods: The proposed method utilizes a controllable knowledge re-injection module that leverages the sophisticated reasoning ability of LLMs to conditionally extract instruction-specific visual information and re-inject it into the LLM. The module is learned using an annotation-free cross-attention guided counterfactual image training strategy that collaborates a cascade of foundation models.results: The proposed method achieves state-of-the-art zero-shot performance across all tasks of I4, without high-quality multimodal instruction tuning data. Cheetor also exhibits competitive performance compared with state-of-the-art instruction tuned models on MME benchmark.

Abstract
Multimodal Large Language Models (MLLMs) have recently sparked significant interest, which demonstrates emergent capabilities to serve as a general-purpose model for various vision-language tasks. However, existing methods mainly focus on limited types of instructions with a single image as visual context, which hinders the widespread availability of MLLMs. In this paper, we introduce the I4 benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions, which involve intricate image-text sequential context, covering a diverse range of scenarios (e.g., visually-rich webpages/textbooks, lecture slides, embodied dialogue). Systematic evaluation on our I4 benchmark reveals a common defect of existing methods: the Visual Prompt Generator (VPG) trained on image-captioning alignment objective tends to attend to common foreground information for captioning but struggles to extract specific information required by particular tasks. To address this issue, we propose a generic and lightweight controllable knowledge re-injection module, which utilizes the sophisticated reasoning ability of LLMs to control the VPG to conditionally extract instruction-specific visual information and re-inject it into the LLM. Further, we introduce an annotation-free cross-attention guided counterfactual image training strategy to methodically learn the proposed module by collaborating a cascade of foundation models. Enhanced by the proposed module and training strategy, we present Cheetor, a Transformer-based MLLM that can effectively handle a wide variety of interleaved vision-language instructions and achieves state-of-the-art zero-shot performance across all tasks of I4, without high-quality multimodal instruction tuning data. Cheetor also exhibits competitive performance compared with state-of-the-art instruction tuned models on MME benchmark.

摘要
大量多模态语言模型 (MLLMs) 在最近吸引了广泛的关注，这表明它们在多种视觉语言任务上表现出了总体的多功能性。然而，现有的方法主要集中在有限的类型的指令上，使得 MLMMs 的普及性受限。在这篇论文中，我们介绍了 I4 benchmark，用于全面评估 MLMMs 对于复杂的交叠视觉语言指令的遵循能力。系统性的评估表明，现有的方法存在一种普遍的缺陷：使用图像captioning对应目标训练的视觉提示生成器 (VPG) 往往会强调通用的前景信息，但是忽略特定任务所需的具体信息。为解决这一问题，我们提出了一种通用且轻量级的可控知识重新注入模块，该模块利用 LLMS 的复杂逻辑能力来控制 VPG，将特定任务所需的视觉信息从 LLMS 中提取出来，并重新注入到 LLMS 中。此外，我们提出了一种无需注意标注的横向关注帮助的反向图像培训策略，用于系统地学习该模块。通过该模块和培训策略，我们提出了一种基于转换器的 MLLM ，称为 Cheetor，可以有效地处理各种交叠视觉语言指令，并在 I4 测试benchmark上达到零基eline性能，不需要高质量的多媒体指令调整数据。此外，Cheetor 还与状态当前的指令调整模型在 MME 测试benchmark上表现竞争力。

Application for White Spot Syndrome Virus (WSSV) Monitoring using Edge Machine Learning

paper_url: http://arxiv.org/abs/2308.04151
repo_url: None
paper_authors: Lorenzo S. Querol, Macario O. Cordel II, Dan Jeric A. Rustia, Mary Nia M. Santos
for: This paper aims to improve disease surveillance in the aquaculture industry, specifically for the White Spot Syndrome Virus (WSSV), by developing a mobile application and training a WSSV recognition model using computer vision techniques.
methods: The authors developed a mobile application to collect and monitor data, and trained two models (MobileNetV3-Small and EfficientNetV2-B0) using an imbalanced dataset to improve WSSV recognition. They also analyzed the saliency heatmaps of both models to understand the features that are most important in making a prediction.
results: The models achieved an F1-Score of 0.72 and 0.99, respectively, and the saliency heatmaps revealed the features that are most important in the images for making a correct prediction. The results demonstrate the effectiveness of using computer vision techniques for WSSV recognition, but also highlight the limitations of using resource-constrained devices and the need for further improvement.

Abstract
The aquaculture industry, strongly reliant on shrimp exports, faces challenges due to viral infections like the White Spot Syndrome Virus (WSSV) that severely impact output yields. In this context, computer vision can play a significant role in identifying features not immediately evident to skilled or untrained eyes, potentially reducing the time required to report WSSV infections. In this study, the challenge of limited data for WSSV recognition was addressed. A mobile application dedicated to data collection and monitoring was developed to facilitate the creation of an image dataset to train a WSSV recognition model and improve country-wide disease surveillance. The study also includes a thorough analysis of WSSV recognition to address the challenge of imbalanced learning and on-device inference. The models explored, MobileNetV3-Small and EfficientNetV2-B0, gained an F1-Score of 0.72 and 0.99 respectively. The saliency heatmaps of both models were also observed to uncover the "black-box" nature of these models and to gain insight as to what features in the images are most important in making a prediction. These results highlight the effectiveness and limitations of using models designed for resource-constrained devices and balancing their performance in accurately recognizing WSSV, providing valuable information and direction in the use of computer vision in this domain.

摘要
鱼养业，强调虾 экспор特别是，面临病毒感染的挑战，如白点综合病毒（WSSV），这会严重影响产量。在这种情况下，计算机视觉可以发挥重要的作用，可以帮助找到不直观或未经训练的目的不可见的特征，从而减少WSSV感染的报告时间。本研究的挑战是有限的数据，用于WSSV识别的模型训练。为解决这个问题，我们开发了一款专门用于数据采集和监测的移动应用程序，以便创建一个用于训练WSSV识别模型的图像数据集。本研究还包括了WSSV识别的全面分析，以解决模型学习的偏袋问题和设备上的推理。我们检查了两种模型，MobileNetV3-Small和EfficientNetV2-B0，它们的F1分数分别为0.72和0.99。我们还研究了这两个模型的精度热图，以了解这些模型在图像中的哪些特征是最重要的，以及它们如何影响模型的预测结果。这些结果显示了使用特定的资源限制的设备上的模型的效果和局限性，以及在精度地识别WSSV的方面的价值信息和指导。

Class-level Structural Relation Modelling and Smoothing for Visual Representation Learning

paper_url: http://arxiv.org/abs/2308.04142
repo_url: https://github.com/czt117/csrms
paper_authors: Zitan Chen, Zhuang Qi, Xiao Cao, Xiangxian Li, Xiangxu Meng, Lei Meng
for: 这篇论文主要targets the problem of visual representation learning, particularly when dealing with classes that have diverse visual patterns.
methods: 这篇论文提出了一个框架，named CSRMS，which includes three modules: Class-level Relation Modelling, Class-aware Graph Sampling, and Relational Graph-Guided Representation Learning. These modules aim to model a relational graph of the entire dataset and perform class-aware smoothing and regularization operations to alleviate the issue of intra-class visual diversity and inter-class similarity.
results: 实验结果显示，CSRMS可以将结构知识模型化到图像表现学中，提高表现学模型的性能。此外，CSRMS可以与现有的最佳表现学模型结合使用，实现表现学模型的性能提升。

Abstract
Representation learning for images has been advanced by recent progress in more complex neural models such as the Vision Transformers and new learning theories such as the structural causal models. However, these models mainly rely on the classification loss to implicitly regularize the class-level data distributions, and they may face difficulties when handling classes with diverse visual patterns. We argue that the incorporation of the structural information between data samples may improve this situation. To achieve this goal, this paper presents a framework termed \textbf{C}lass-level Structural Relation Modeling and Smoothing for Visual Representation Learning (CSRMS), which includes the Class-level Relation Modelling, Class-aware Graph Sampling, and Relational Graph-Guided Representation Learning modules to model a relational graph of the entire dataset and perform class-aware smoothing and regularization operations to alleviate the issue of intra-class visual diversity and inter-class similarity. Specifically, the Class-level Relation Modelling module uses a clustering algorithm to learn the data distributions in the feature space and identify three types of class-level sample relations for the training set; Class-aware Graph Sampling module extends typical training batch construction process with three strategies to sample dataset-level sub-graphs; and Relational Graph-Guided Representation Learning module employs a graph convolution network with knowledge-guided smoothing operations to ease the projection from different visual patterns to the same class. Experiments demonstrate the effectiveness of structured knowledge modelling for enhanced representation learning and show that CSRMS can be incorporated with any state-of-the-art visual representation learning models for performance gains. The source codes and demos have been released at https://github.com/czt117/CSRMS.

摘要
“图像表现学已经由最近的更进步的神经网络模型，如视图变换器和新的学习理论，如结构 causal 模型，所进步。但这些模型主要靠 классификаtion 损失来隐式训练数据分布，可能在处理多标的视觉模式时遇到问题。我们认为将数据样本之间的结构信息纳入模型中可以改善这个情况。为此，这篇论文提出了一个名为 Class-level Structural Relation Modeling and Smoothing for Visual Representation Learning（CSRMS）的框架，包括 Class-level Relation Modelling、Class-aware Graph Sampling 和 Relational Graph-Guided Representation Learning 三个模块。这些模块的目的是建立数据集的关系图，并通过阶段调整和缓和操作来缓和内部分类视觉多样性和相似性。具体来说，Class-level Relation Modelling 模块使用聚类算法学习数据集的分布在特征空间，并识别出三种类别水平的样本关系 для训练集; Class-aware Graph Sampling 模块延伸了传统的训练批次建构过程，使用三种策略来抽样数据集; Relational Graph-Guided Representation Learning 模块运用了图像 convolution 网络和知识导向缓和操作来将不同的视觉模式转换为同一个类别。实验结果显示结构知识模型可以帮助提高图像表现学，并证明 CSRMS 可以与任何现有的图像表现学模型结合使用，以获得性能提升。CSRMS 的源代码和示例已经发布在 GitHub 上（https://github.com/czt117/CSRMS）。”

Comprehensive Assessment of the Performance of Deep Learning Classifiers Reveals a Surprising Lack of Robustness

paper_url: http://arxiv.org/abs/2308.04137
repo_url: None
paper_authors: Michael W. Spratling
for: The paper aims to evaluate the robustness of machine learning models, specifically deep neural networks, and to develop a benchmark for comprehensive evaluation of performance.
methods: The paper proposes using a wide range of different types of data to benchmark performance and using a single metric to produce a consistent evaluation of performance.
results: The paper finds that current deep neural networks are extremely vulnerable to making mistakes on certain types of data, and that they are insecure and unreliable in real-world scenarios where they may encounter data from many different domains.Here’s the Chinese translation of the three points:
for: 这篇论文的目的是评估机器学习模型（尤其是深度神经网络）的可靠性和可靠性评估方法。
methods: 论文提议使用多种不同类型的数据来评估性能，并使用单一指标来生成一致的评估结果。
results: 论文发现现有的深度神经网络在某些类型的数据上很容易出错，并且在实际场景中，它们可能会遇到多种不同的预测任务，因此它们是不可靠的。

Abstract
Reliable and robust evaluation methods are a necessary first step towards developing machine learning models that are themselves robust and reliable. Unfortunately, current evaluation protocols typically used to assess classifiers fail to comprehensively evaluate performance as they tend to rely on limited types of test data, and ignore others. For example, using the standard test data fails to evaluate the predictions made by the classifier to samples from classes it was not trained on. On the other hand, testing with data containing samples from unknown classes fails to evaluate how well the classifier can predict the labels for known classes. This article advocates bench-marking performance using a wide range of different types of data and using a single metric that can be applied to all such data types to produce a consistent evaluation of performance. Using such a benchmark it is found that current deep neural networks, including those trained with methods that are believed to produce state-of-the-art robustness, are extremely vulnerable to making mistakes on certain types of data. This means that such models will be unreliable in real-world scenarios where they may encounter data from many different domains, and that they are insecure as they can easily be fooled into making the wrong decisions. It is hoped that these results will motivate the wider adoption of more comprehensive testing methods that will, in turn, lead to the development of more robust machine learning methods in the future. Code is available at: \url{https://codeberg.org/mwspratling/RobustnessEvaluation}

摘要
可靠且稳定的评估方法是开发机器学习模型的必要第一步。然而，当前的评估协议通常只使用有限的测试数据来评估类ifiers的性能，而忽略其他类型的测试数据。例如，使用标准测试数据不能评估类ifiers对未知类型数据的预测性能。相反，使用包含未知类型数据的测试数据则不能评估类ifiers对已知类型数据的预测性能。这篇文章提出了使用多种不同类型的数据进行比较性能的方法，并使用一个统一的指标来评估所有数据类型的性能。使用这种标准，发现现有的深度神经网络，包括由其它方法训练的神经网络，在某些数据类型上存在极大的敏感性和容易被骗的问题。这意味着这些模型在实际场景中可能会出现问题，并且它们是不安全的，因为它们可以轻松地被骗到错误决策。希望这些结果能够激励更广泛的测试方法的采用，以便在未来开发更加稳定的机器学习方法。Code可以在以下链接获取：

OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation

paper_url: http://arxiv.org/abs/2308.04126
repo_url: None
paper_authors: Dongyang Yu, Shihao Wang, Yuan Fang, Wangpeng An
for: 这 paper 是为了解决多模态数据融合和无限数据生成问题，以提高人工智能对复杂实际数据的理解和生成能力。
methods: 这 paper 使用了多种操作，包括视频/图像描述EXTRACTION、稠密描述EXTRACTION、自动语音识别（ASR）、光学字符识别（OCR）、认知任何模型（RAM）和物体跟踪。
results: 这 paper 的 finale输出将每个视频输入转化成一个详细的时间序列文档，从而使视频变成了详细的故事，使其更易于大语言模型处理。

Abstract
This paper presents OmniDataComposer, an innovative approach for multimodal data fusion and unlimited data generation with an intent to refine and uncomplicate interplay among diverse data modalities. Coming to the core breakthrough, it introduces a cohesive data structure proficient in processing and merging multimodal data inputs, which include video, audio, and text. Our crafted algorithm leverages advancements across multiple operations such as video/image caption extraction, dense caption extraction, Automatic Speech Recognition (ASR), Optical Character Recognition (OCR), Recognize Anything Model(RAM), and object tracking. OmniDataComposer is capable of identifying over 6400 categories of objects, substantially broadening the spectrum of visual information. It amalgamates these diverse modalities, promoting reciprocal enhancement among modalities and facilitating cross-modal data correction. \textbf{The final output metamorphoses each video input into an elaborate sequential document}, virtually transmuting videos into thorough narratives, making them easier to be processed by large language models. Future prospects include optimizing datasets for each modality to encourage unlimited data generation. This robust base will offer priceless insights to models like ChatGPT, enabling them to create higher quality datasets for video captioning and easing question-answering tasks based on video content. OmniDataComposer inaugurates a new stage in multimodal learning, imparting enormous potential for augmenting AI's understanding and generation of complex, real-world data.

摘要
The final output transforms each video input into an elaborate sequential document, virtually transmuting videos into thorough narratives that are easier to process by large language models. Future prospects include optimizing datasets for each modality to encourage unlimited data generation, providing priceless insights to models like ChatGPT and enabling them to create higher quality datasets for video captioning. This will ease question-answering tasks based on video content and inaugurate a new stage in multimodal learning, offering enormous potential for augmenting AI's understanding and generation of complex, real-world data.In simplified Chinese, the text would be:这篇论文介绍了 OmniDataComposer，一种创新的多Modal数据融合和无限数据生成方法。核心突破是一种可靠的数据结构，能够高效地处理和融合多Modal数据输入，包括视频、音频和文本。算法利用了多种进步，如视频/图像描述EXTRACTION、稠密描述EXTRACTION、自动语音识别（ASR）、光学字符识别（OCR）、Recognize Anything Model（RAM）和物体跟踪。输出 transformations each video input into an elaborate sequential document， virtually transmuting videos into thorough narratives that are easier to process by large language models。未来的前景包括优化每个模式的数据集，以便无限数据生成。这将为模型如ChatGPT提供无估量的智能，使其创建更高质量的视频描述集和简化基于视频内容的问答任务。OmniDataComposer开启了一个新的多Modal学习阶段，提供了巨大的潜力来增强AI对复杂、实际世界数据的理解和生成。

Multimodal Color Recommendation in Vector Graphic Documents

paper_url: http://arxiv.org/abs/2308.04118
repo_url: None
paper_authors: Qianru Qiu, Xueting Wang, Mayu Otani
for: 这个研究旨在提供基于文本 контекст的颜色建议，以帮助设计者选择适合的颜色。
methods: 该模型使用自我注意力网络和 crossed attention网络，以捕捉多个色彩中的关系，并将颜色和文本表示 integrate into one model。
results: 实验结果表明，该方法在准确率、颜色分布和用户体验方面都超过了先前的颜色alette completion方法，同时在全色组生成任务中，其对比 truth palettes 的颜色多样性和相似性也有所提高。

Abstract
Color selection plays a critical role in graphic document design and requires sufficient consideration of various contexts. However, recommending appropriate colors which harmonize with the other colors and textual contexts in documents is a challenging task, even for experienced designers. In this study, we propose a multimodal masked color model that integrates both color and textual contexts to provide text-aware color recommendation for graphic documents. Our proposed model comprises self-attention networks to capture the relationships between colors in multiple palettes, and cross-attention networks that incorporate both color and CLIP-based text representations. Our proposed method primarily focuses on color palette completion, which recommends colors based on the given colors and text. Additionally, it is applicable for another color recommendation task, full palette generation, which generates a complete color palette corresponding to the given text. Experimental results demonstrate that our proposed approach surpasses previous color palette completion methods on accuracy, color distribution, and user experience, as well as full palette generation methods concerning color diversity and similarity to the ground truth palettes.

摘要
颜色选择在图文设计中扮演着关键的角色，需要考虑各种不同的 контекス特。然而，建议合适的颜色，使其融合在其他颜色和文本上下文中，是经验 designer 的挑战。在这个研究中，我们提出了一种多模态假面颜色模型，将多个颜色精灵 integrate 到一起，以提供文本意识 Color 推荐。我们的提议模型包括自我注意力网络，捕捉多个颜色精灵之间的关系，以及 crossed 注意力网络，将颜色和 CLIP 基于的文本表示 incorporate 到一起。我们的提议方法主要关注颜色精灵 completion，根据给定的颜色和文本来推荐颜色。此外，它还适用于另一个颜色推荐任务，全alette generation，生成与给定文本相对应的完整颜色精灵。实验结果表明，我们的提议方法在准确性、颜色分布和用户体验方面，都有所提高，并且在全alette generation 任务中，色彩多样性和真实性与基准 palettes 相比，也有所提高。

From Unimodal to Multimodal: improving the sEMG-Based Pattern Recognition via deep generative models

paper_url: http://arxiv.org/abs/2308.04091
repo_url: None
paper_authors: Wentao Wei, Linyan Ren
for: 提高手势识别精度
methods: 使用深度生成模型生成虚拟IMU信号，并与真实的EMG信号一起输入多模态Convolutional Neural Network（CNN）模型进行手势识别
results: 对6个数据库进行测试，包括5个公开的数据库和自己收集的数据库，其中28名参与者执行了38种手势，包括EMG和IMU数据，结果表明提议方法比单模态HGR方法（增加2.15%-13.10%）表现更好，这表明通过深度生成模型生成的虚拟IMU信号可以明显提高EMG基于的手势识别精度。

Abstract
Multimodal hand gesture recognition (HGR) systems can achieve higher recognition accuracy. However, acquiring multimodal gesture recognition data typically requires users to wear additional sensors, thereby increasing hardware costs. This paper proposes a novel generative approach to improve Surface Electromyography (sEMG)-based HGR accuracy via virtual Inertial Measurement Unit (IMU) signals. Specifically, we trained a deep generative model based on the intrinsic correlation between forearm sEMG signals and forearm IMU signals to generate virtual forearm IMU signals from the input forearm sEMG signals at first. Subsequently, the sEMG signals and virtual IMU signals were fed into a multimodal Convolutional Neural Network (CNN) model for gesture recognition. To evaluate the performance of the proposed approach, we conducted experiments on 6 databases, including 5 publicly available databases and our collected database comprising 28 subjects performing 38 gestures, containing both sEMG and IMU data. The results show that our proposed approach outperforms the sEMG-based unimodal HGR method (with increases of 2.15%-13.10%). It demonstrates that incorporating virtual IMU signals, generated by deep generative models, can significantly enhance the accuracy of sEMG-based HGR. The proposed approach represents a successful attempt to transition from unimodal HGR to multimodal HGR without additional sensor hardware.

摘要
多模态手势识别（HGR）系统可以提高识别精度。然而，获取多模态手势识别数据通常需要用户穿着额外传感器，从而增加硬件成本。这篇论文提出了一种新的生成方法，用于通过生成虚拟抬肘卫星测量单元（IMU）信号来提高表肘电omyography（sEMG）基于的HGR精度。特别是，我们使用了深度生成模型，根据肘部sEMG信号和肘部IMU信号的内在相关性来生成虚拟肘部IMU信号。然后，sEMG信号和虚拟IMU信号被输入到一个多模态卷积神经网络（CNN）模型中进行手势识别。为评估提案的性能，我们进行了6个数据库的实验，包括5个公共可用的数据库和我们收集的数据库，包含28名参与者进行38种手势，其中包括sEMG和IMU数据。结果表明，我们的提案方法比sEMG基于的单模态HGR方法（增幅1.15%-13.10%）高。这表明，通过深度生成模型生成的虚拟IMU信号可以显著提高sEMG基于的HGR精度。这种方法表明了在不增加额外传感器硬件成本的情况下，从单模态HGR转移到多模态HGR的成功尝试。

3D Gaussian Splatting for Real-Time Radiance Field Rendering

paper_url: http://arxiv.org/abs/2308.04079
repo_url: https://github.com/graphdeco-inria/gaussian-splatting
paper_authors: Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis
for: 实现高质量 novel-view 合成，提高 scenes 的完整性和分辨率。
methods: 使用 3D Gaussians 表示 scene，并进行interleaved 优化/密度控制，以获得高精度 scene 表示。
results: 实现了 state-of-the-art 的 visual quality 和实时渲染，并且在多个评估 datasets 上达到了领先的Result。

Abstract
Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution. First, starting from sparse points produced during camera calibration, we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space; Second, we perform interleaved optimization/density control of the 3D Gaussians, notably optimizing anisotropic covariance to achieve an accurate representation of the scene; Third, we develop a fast visibility-aware rendering algorithm that supports anisotropic splatting and both accelerates training and allows realtime rendering. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets.

摘要
“射频场方法”（Radiance Field method）在近期为 novel-view synthesis of captured scenes with multiple photos or videos 进行了革命性的改进。然而，实现高品质仍然需要费时训练和渲染 neural network，而最近的更快的方法则必须牺牲品质来获得速度。对于无限和完整的场景（而不是孤立的物体），以及1080p分辨率的渲染，目前的任何方法都无法在真实时间内进行高品质的novel-view synthesis。我们提出了三个关键的元素，允许我们实现现代化的Visual quality，同时维持竞争性的训练时间和重要的高品质实时（>= 30 fps）novel-view synthesis at 1080p resolution。首先，从摄像机对焦点所生成的稀疏点开始，我们使用3D Gaussians来表示场景，并保留恰当的维度场内散度场的性质，以避免在空间中无需过度计算。第二，我们在3D Gaussians中进行推广/频率控制，特别是对照方差进行最佳化，以确保场景的准确表示。第三，我们开发了一个快速可见性测试的渲染算法，支持标准渲染和实时渲染，并且加速训练和实时渲染。我们在一些已知的测试集上进行了实验，并证明了我们的方法可以实现现代化的Visual quality和高品质的实时渲染。

Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video

paper_url: http://arxiv.org/abs/2308.04074
repo_url: None
paper_authors: Weichao Zhao, Hezhen Hu, Wengang Zhou, Li li, Houqiang Li
for: 本文旨在提高单摄影像中重构互动手的精度，通过利用空间时间信息来提高互动手的重构效果。
methods: 本文提出了一个新的时间框架，利用时间上下文来补充单摄像头提供的不充分信息，并提出了一个互пенetration检测模块来生成物理合理的互动手。
results: 经验表明，本文提出的方法可以达到新的状态精度水平，在公共测试 benchmark 上实现了最高的表现。

Abstract
Reconstructing interacting hands from monocular RGB data is a challenging task, as it involves many interfering factors, e.g. self- and mutual occlusion and similar textures. Previous works only leverage information from a single RGB image without modeling their physically plausible relation, which leads to inferior reconstruction results. In this work, we are dedicated to explicitly exploiting spatial-temporal information to achieve better interacting hand reconstruction. On one hand, we leverage temporal context to complement insufficient information provided by the single frame, and design a novel temporal framework with a temporal constraint for interacting hand motion smoothness. On the other hand, we further propose an interpenetration detection module to produce kinetically plausible interacting hands without physical collisions. Extensive experiments are performed to validate the effectiveness of our proposed framework, which achieves new state-of-the-art performance on public benchmarks.

摘要
重构互动手 FROM monochrome RGB 数据是一项复杂的任务，因为它们包含许多干扰因素，如自我和相互遮挡，以及类似的 texture。先前的工作只是利用单个 RGB 图像提供的信息，没有考虑这些物理上的可能的关系，这导致了更差的重建结果。在这项工作中，我们决心显式利用空间-时间信息来实现更好的互动手 reconstruction。一方面，我们利用时间上下文来补充单幅图像中的不足信息，并设计了一个时间框架，以确保互动手动 motion 的平滑性。另一方面，我们进一步提议了一个 penetration 检测模块，以生成物理可能的互动手无碰撞。我们进行了广泛的实验来验证我们的提议的有效性，并实现了公共标准的新高水平性能。

3D Scene Diffusion Guidance using Scene Graphs

paper_url: http://arxiv.org/abs/2308.04468
repo_url: https://github.com/hamnaanaa/3D-Scene-Diffusion-Guidance-using-Scene-Graphs
paper_authors: Mohammad Naanaa, Katharina Schmid, Yinyu Nie
for: 生成高质量3D场景的导航是一项复杂的任务。扩散模型已经显示了可能性，但现有方法直接使用文本嵌入来控制生成，限制了对物体之间复杂的空间关系的吸收。
methods: 我们提议一种使用场景图导航3D场景扩散指导的新方法。我们在杜比纳网络中使用关系图 convolutional块，以利用场景图提供的相对空间信息。
results: 我们的方法可以显著改善场景描述和生成场景之间的对齐。

Abstract
Guided synthesis of high-quality 3D scenes is a challenging task. Diffusion models have shown promise in generating diverse data, including 3D scenes. However, current methods rely directly on text embeddings for controlling the generation, limiting the incorporation of complex spatial relationships between objects. We propose a novel approach for 3D scene diffusion guidance using scene graphs. To leverage the relative spatial information the scene graphs provide, we make use of relational graph convolutional blocks within our denoising network. We show that our approach significantly improves the alignment between scene description and generated scene.

摘要
<>将高质量3D场景合成引导为一个挑战性的任务。分散模型已经展示了生成多样数据的潜力，包括3D场景。然而，当前的方法直接基于文本嵌入来控制生成，限制了对物体之间复杂的空间关系的 incorporation。我们提出了一种新的方法，使用场景图导向3D场景扩散指导。为了利用场景图提供的相对空间信息，我们在杂化网络中使用关系图 convolutional块。我们显示，我们的方法可以显著改善场景描述和生成场景之间的对齐。Note: "场景图" (scene graph) refers to a graph that represents the relationships between objects in a scene, and "杂化网络" (denoising network) is a type of neural network that is trained to remove noise from a signal.

ConDistFL: Conditional Distillation for Federated Learning from Partially Annotated Data

paper_url: http://arxiv.org/abs/2308.04070
repo_url: https://github.com/nvidia/nvflare
paper_authors: Pochuan Wang, Chen Shen, Weichung Wang, Masahiro Oda, Chiou-Shann Fuh, Kensaku Mori, Holger R. Roth
for: simultaneously delineating multiple organs and diseases
methods: federated learning (FL) with knowledge distillation
results: outperforms FedAvg and FedOpt baselines, superior generalizability on external test dataset, can perform well without frequent aggregationHere’s the Simplified Chinese translation of the three points:
for: 同时分割多个器官和疾病
methods: 联邦学习（FL）与知识储存
results: 比 FedAvg 和 FedOpt 基elines有更好的性能，在外部测试集上表现出较高的普适性，可以不经常聚合来达到良好的性能。

Abstract
Developing a generalized segmentation model capable of simultaneously delineating multiple organs and diseases is highly desirable. Federated learning (FL) is a key technology enabling the collaborative development of a model without exchanging training data. However, the limited access to fully annotated training data poses a major challenge to training generalizable models. We propose "ConDistFL", a framework to solve this problem by combining FL with knowledge distillation. Local models can extract the knowledge of unlabeled organs and tumors from partially annotated data from the global model with an adequately designed conditional probability representation. We validate our framework on four distinct partially annotated abdominal CT datasets from the MSD and KiTS19 challenges. The experimental results show that the proposed framework significantly outperforms FedAvg and FedOpt baselines. Moreover, the performance on an external test dataset demonstrates superior generalizability compared to models trained on each dataset separately. Our ablation study suggests that ConDistFL can perform well without frequent aggregation, reducing the communication cost of FL. Our implementation will be available at https://github.com/NVIDIA/NVFlare/tree/dev/research/condist-fl.

摘要
发展一个可以同时分割多个器官和疾病的通用分割模型是非常有优势的。联邦学习（FL）是一种关键技术，它可以帮助建立一个模型，不需要交换训练数据。然而，受到完全标注数据的限制，很难训练通用的模型。我们提出了“ConDistFL”框架，它将FL与知识储存结合以解决这个问题。本地模型可以从全球模型中提取未标注器官和肿瘤的知识，使用适当的条件概率表示。我们在四个不同的部分标注的腹部CT数据集上验证了我们的框架。实验结果表明，我们的框架在FedAvg和FedOpt基准下显著 OUTPERFORMS。此外，对于外部测试集，我们的模型表现更高的普适性，比单独在每个数据集上训练的模型。我们的剖分研究表明，ConDistFL可以在不经常聚合的情况下表现良好，减少联邦学习中的通信成本。我们的实现将在https://github.com/NVIDIA/NVFlare/tree/dev/research/condist-fl上提供。

Backdoor Federated Learning by Poisoning Backdoor-Critical Layers

paper_url: http://arxiv.org/abs/2308.04466
repo_url: None
paper_authors: Haomin Zhuang, Mingxian Yu, Hao Wang, Yang Hua, Jian Li, Xu Yuan
for: 本研究旨在探讨 Federated Learning (FL) 中的后门攻击，以及如何通过识别和适应各种防御策略来实现这种攻击。
methods: 本研究提出了一种基于实际攻击者视角的协调方法，可以帮助攻击者识别和攻击 FL 模型中的极其敏感层（Backdoor-Critical，BC）。此外，本研究还提出了一种基于 BC 层的新型后门攻击方法，可以在不同的防御策略下寻找最佳攻击方式。
results: 经过广泛的实验，研究发现，使用本研究的 BC 层感知后门攻击方法，可以在七种最新的防御策略下成功后门 FL 模型，且比最新的后门攻击方法更高效。

Abstract
Federated learning (FL) has been widely deployed to enable machine learning training on sensitive data across distributed devices. However, the decentralized learning paradigm and heterogeneity of FL further extend the attack surface for backdoor attacks. Existing FL attack and defense methodologies typically focus on the whole model. None of them recognizes the existence of backdoor-critical (BC) layers-a small subset of layers that dominate the model vulnerabilities. Attacking the BC layers achieves equivalent effects as attacking the whole model but at a far smaller chance of being detected by state-of-the-art (SOTA) defenses. This paper proposes a general in-situ approach that identifies and verifies BC layers from the perspective of attackers. Based on the identified BC layers, we carefully craft a new backdoor attack methodology that adaptively seeks a fundamental balance between attacking effects and stealthiness under various defense strategies. Extensive experiments show that our BC layer-aware backdoor attacks can successfully backdoor FL under seven SOTA defenses with only 10% malicious clients and outperform the latest backdoor attack methods.

摘要
Translation notes:* "backdoor-critical" (BC) layers are a small subset of layers in a machine learning model that dominate the model's vulnerabilities.* The proposed approach identifies and verifies BC layers from the perspective of attackers.* The new backdoor attack methodology adaptively seeks a balance between attacking effects and stealthiness under various defense strategies.* The approach can successfully backdoor FL under seven state-of-the-art defenses with only 10% malicious clients and outperform the latest backdoor attack methods.

An Empirical Analysis of Range for 3D Object Detection

paper_url: http://arxiv.org/abs/2308.04054
repo_url: None
paper_authors: Neehar Peri, Mengtian Li, Benjamin Wilson, Yu-Xiong Wang, James Hays, Deva Ramanan
for: 本文主要研究长距离3D探测，以实现自主驾驶车辆的安全 Navigation。
methods: 本文使用Argoverse 2.0 dataset进行实验分析，探讨长距离3D探测的问题，并发现近距离LiDAR测量是紧密且适合使用小尺寸矩阵，而远距离测量则是疏 dispersed且适合使用大尺寸矩阵。本文还提出了一组为近 vs 远场探测而调整的范围专家，以及一些简单的技术来优化长距离探测的效率和精度。
results: 本文的实验结果显示，使用该范围专家和技术可以提高长距离探测的效率33%，并提高精度3.2% CDS。

Abstract
LiDAR-based 3D detection plays a vital role in autonomous navigation. Surprisingly, although autonomous vehicles (AVs) must detect both near-field objects (for collision avoidance) and far-field objects (for longer-term planning), contemporary benchmarks focus only on near-field 3D detection. However, AVs must detect far-field objects for safe navigation. In this paper, we present an empirical analysis of far-field 3D detection using the long-range detection dataset Argoverse 2.0 to better understand the problem, and share the following insight: near-field LiDAR measurements are dense and optimally encoded by small voxels, while far-field measurements are sparse and are better encoded with large voxels. We exploit this observation to build a collection of range experts tuned for near-vs-far field detection, and propose simple techniques to efficiently ensemble models for long-range detection that improve efficiency by 33% and boost accuracy by 3.2% CDS.

摘要
lidar-based 3D 探测在自动驾驶中扮演着关键性的角色。很奇怪的是，即使自动车辆（AV）需要探测附近 объек（以避免碰撞）和远场 объек（为长期规划），当前的标准准则仅专注于附近 3D 探测。然而，AV 需要探测远场 объек 以确保安全 Navigation。在这篇论文中，我们提供了实验分析远场 3D 探测使用 Argoverse 2.0 长距离探测数据集，以更好地理解问题，并分享以下发现：附近 LiDAR 测量 dense 且最佳地编码为小 voxels，而远场测量则是稀疏的，更适合使用大 voxels 编码。我们利用这一观察，建立了适应于近vs远场探测的范围专家，并提出了简单的技术来有效地ensemble模型以提高长距离探测的效率和准确率。

Implicit neural representations for joint decomposition and registration of gene expression images in the marmoset brain

paper_url: http://arxiv.org/abs/2308.04039
repo_url: None
paper_authors: Michal Byra, Charissa Poon, Tomomi Shimogori, Henrik Skibbe
for: 本研究旨在提出一种基于隐藏神经表示的图像匹配方法，用于处理两个脑图像之间的匹配问题，其中一个图像包含附加的特征或artefacts，而另一个图像则不包含这些特征。
methods: 本方法使用隐藏网络和图像排除损失来同时进行匹配和图像分解，其中支持图像能够匹配well于模板，而剩余图像则捕捉到各自图像特征的差异。
results: 实验结果表明，本方法可以提供出色的结果，并在其他匹配技术上表现出色。

Abstract
We propose a novel image registration method based on implicit neural representations that addresses the challenging problem of registering a pair of brain images with similar anatomical structures, but where one image contains additional features or artifacts that are not present in the other image. To demonstrate its effectiveness, we use 2D microscopy $\textit{in situ}$ hybridization gene expression images of the marmoset brain. Accurately quantifying gene expression requires image registration to a brain template, which is difficult due to the diversity of patterns causing variations in visible anatomical brain structures. Our approach uses implicit networks in combination with an image exclusion loss to jointly perform the registration and decompose the image into a support and residual image. The support image aligns well with the template, while the residual image captures individual image characteristics that diverge from the template. In experiments, our method provided excellent results and outperformed other registration techniques.

摘要
我们提出了一种基于隐式神经表示的新型图像匹配方法，用于处理一对具有相似解剖结构的脑图像，其中一个图像包含一些不在另一个图像中存在的特征或噪声。为证明其效果，我们使用了2D微显镜天然增强引入蛋白表达图像。正确评估蛋白表达需要图像匹配到脑模板，这是因为脑结构的多样性导致视觉特征的变化。我们的方法使用隐式网络和图像排除损失相结合，同时进行匹配和图像分解。支持图像能够匹配良好到模板，而剩余图像损失中包含各自图像特征。在实验中，我们的方法表现出色，超越了其他匹配技术。

Synthetic Augmentation with Large-scale Unconditional Pre-training

paper_url: http://arxiv.org/abs/2308.04020
repo_url: https://github.com/karenyyy/histodiffaug
paper_authors: Jiarong Ye, Haomiao Ni, Peng Jin, Sharon X. Huang, Yuan Xue
for: 提高医疗影像识别系统的训练数据效果，减少专家标注的成本和时间。
methods: 提出了一种名为HistoDiffusion的生成图像增强技术，可以在大量未标注数据集上预训练，然后应用于小规模标注数据集进行增强训练。
results: 通过在三个 histopathology 数据集上预训练，然后在一个 colorectal cancer (CRC) 数据集上测试，得到了训练使用小量标注数据集的增强图像识别率的提高，具体提高6.4%。

Abstract
Deep learning based medical image recognition systems often require a substantial amount of training data with expert annotations, which can be expensive and time-consuming to obtain. Recently, synthetic augmentation techniques have been proposed to mitigate the issue by generating realistic images conditioned on class labels. However, the effectiveness of these methods heavily depends on the representation capability of the trained generative model, which cannot be guaranteed without sufficient labeled training data. To further reduce the dependency on annotated data, we propose a synthetic augmentation method called HistoDiffusion, which can be pre-trained on large-scale unlabeled datasets and later applied to a small-scale labeled dataset for augmented training. In particular, we train a latent diffusion model (LDM) on diverse unlabeled datasets to learn common features and generate realistic images without conditional inputs. Then, we fine-tune the model with classifier guidance in latent space on an unseen labeled dataset so that the model can synthesize images of specific categories. Additionally, we adopt a selective mechanism to only add synthetic samples with high confidence of matching to target labels. We evaluate our proposed method by pre-training on three histopathology datasets and testing on a histopathology dataset of colorectal cancer (CRC) excluded from the pre-training datasets. With HistoDiffusion augmentation, the classification accuracy of a backbone classifier is remarkably improved by 6.4% using a small set of the original labels. Our code is available at https://github.com/karenyyy/HistoDiffAug.

摘要
医学图像识别系统经常需要大量的训练数据，包括专家标注，这可能是时间consuming和成本高的。在最近，人工增强技术被提出，以生成符合类别标签的图像。然而，这些方法的效果受训练的生成模型的表达能力的限制，而这无法保证。为了进一步减少依赖于标注数据，我们提议一种名为HistoDiffusion的人工增强方法。在这种方法中，我们首先在大量无标注数据上训练一个潜在扩散模型（LDM），以学习通用特征并生成真实图像。然后，我们在一个未看过的标注数据集上精度地调整模型，以使其能够生成特定类别的图像。此外，我们采用了一种选择机制，只添加符合目标标签的synthetic样本。我们对三个 Histopathology 数据集进行预训练，并在一个排除在预训练数据集中的大肠癌（CRC）数据集上进行测试。与HistoDiffusion增强后，一个基础类фика器的分类精度显著提高了6.4%，只使用一小部分原始标注。我们的代码可以在 GitHub 上找到：https://github.com/karenyyy/HistoDiffAug。

Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning

paper_url: http://arxiv.org/abs/2308.04016
repo_url: https://github.com/hanjaekim98/cot
paper_authors: Hanjae Kim, Jiyoung Lee, Seongheon Park, Kwanghoon Sohn
for: 本研究的目的是提出一种简单可扩展的架构，以解决现有的 zero-shot 组合学习（CZSL）问题，包括考虑对象和特征之间的上下文关系，以及实际世界中的组合数据长尾分布问题。
methods: 本研究提出了一种名为 Composition Transformer（CoT）的简单可扩展架构，该架构包括对象和特征专家，通过可视网络层次结构来生成表示性 embedding。对象专家从底层 final layer 中提取表示性对象 embedding，而特征专家通过一种提出的对象引导注意力模块来生成特征 embedding，以显式地模型上下文关系。
results: 根据多个 benchmark 数据集，包括 MIT-States、C-GQA 和 VAW-CZSL，our method achieve State-of-the-Art 性能。此外，我们还证明了 CoT 在改善可视特征分辨率和减少模型偏见问题上的效果。代码可以在 https://github.com/HanjaeKim98/CoT 上获取。

Abstract
Compositional zero-shot learning (CZSL) aims to recognize unseen compositions with prior knowledge of known primitives (attribute and object). Previous works for CZSL often suffer from grasping the contextuality between attribute and object, as well as the discriminability of visual features, and the long-tailed distribution of real-world compositional data. We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues. CoT employs object and attribute experts in distinctive manners to generate representative embeddings, using the visual network hierarchically. The object expert extracts representative object embeddings from the final layer in a bottom-up manner, while the attribute expert makes attribute embeddings in a top-down manner with a proposed object-guided attention module that models contextuality explicitly. To remedy biased prediction caused by imbalanced data distribution, we develop a simple minority attribute augmentation (MAA) that synthesizes virtual samples by mixing two images and oversampling minority attribute classes. Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL. We also demonstrate the effectiveness of CoT in improving visual discrimination and addressing the model bias from the imbalanced data distribution. The code is available at https://github.com/HanjaeKim98/CoT.

摘要
compositional zero-shot learning (CZSL) targets recognizing unseen compositions based on prior knowledge of known primitives (attribute and object). previous works for CZSL often suffer from grasping the contextuality between attribute and object, as well as the discriminability of visual features, and the long-tailed distribution of real-world compositional data. we propose a simple and scalable framework called Composition Transformer (CoT) to address these issues. CoT employs object and attribute experts in distinctive manners to generate representative embeddings, using the visual network hierarchically. the object expert extracts representative object embeddings from the final layer in a bottom-up manner, while the attribute expert makes attribute embeddings in a top-down manner with a proposed object-guided attention module that models contextuality explicitly. to remedy biased prediction caused by imbalanced data distribution, we develop a simple minority attribute augmentation (MAA) that synthesizes virtual samples by mixing two images and oversampling minority attribute classes. our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL. we also demonstrate the effectiveness of CoT in improving visual discrimination and addressing the model bias from the imbalanced data distribution. the code is available at https://github.com/HanjaeKim98/CoT.

Coarse-to-Fine: Learning Compact Discriminative Representation for Single-Stage Image Retrieval

paper_url: http://arxiv.org/abs/2308.04008
repo_url: https://github.com/bassyess/cfcd
paper_authors: Yunquan Zhu, Xinkai Gao, Bo Ke, Ruizhi Qiao, Xing Sun
for: 实现单stage图像检索的高效精度搜索
methods: 提出了一种Coarse-to-Fine框架，学习Compact Discriminative representation（CFCD），只需要图像级别的标签进行训练。具体来说，我们首先设计了一种适应性softmax基于损失函数，在每个mini-batch中动态调整其尺度和边缘，以强化supervision during training和intra-class compactness。其次，我们提出了一种机制，通过硬negative sampling策略选择突出地ocal descriptors，并将其混合到全局表示中，以便在全球范围内优化相互之间的Semantic关系。
results: 经验证明了我们的方法的效果，在Revisited Oxford和Revisited Paris等benchmark上实现了单stage图像检索的state-of-the-art性能。

Abstract
Image retrieval targets to find images from a database that are visually similar to the query image. Two-stage methods following retrieve-and-rerank paradigm have achieved excellent performance, but their separate local and global modules are inefficient to real-world applications. To better trade-off retrieval efficiency and accuracy, some approaches fuse global and local feature into a joint representation to perform single-stage image retrieval. However, they are still challenging due to various situations to tackle, $e.g.$, background, occlusion and viewpoint. In this work, we design a Coarse-to-Fine framework to learn Compact Discriminative representation (CFCD) for end-to-end single-stage image retrieval-requiring only image-level labels. Specifically, we first design a novel adaptive softmax-based loss which dynamically tunes its scale and margin within each mini-batch and increases them progressively to strengthen supervision during training and intra-class compactness. Furthermore, we propose a mechanism which attentively selects prominent local descriptors and infuse fine-grained semantic relations into the global representation by a hard negative sampling strategy to optimize inter-class distinctiveness at a global scale. Extensive experimental results have demonstrated the effectiveness of our method, which achieves state-of-the-art single-stage image retrieval performance on benchmarks such as Revisited Oxford and Revisited Paris. Code is available at https://github.com/bassyess/CFCD.

摘要
将给定文本翻译成简化中文。图像检索目标是从数据库中检索与查询图像视觉相似的图像。两Stage方法在 retrieve-and-rerank 模式下实现了出色的表现，但它们的分立的本地和全局模块在实际应用中不是非常有效。为了更好地平衡检索效率和准确率，一些方法将全局和本地特征集成为一个共同表示，以实现单stage图像检索。然而，它们仍然面临许多挑战，例如背景、遮挡和视角等。在这项工作中，我们设计了一个粗略到细节的框架，用于学习练习Compact Discriminative representation（CFCD），以实现端到端单stage图像检索，只需要图像级别标签。具体来说，我们首先设计了一种新的适应式软MAX基于损失函数，可以在每个小批中动态调整缩放和边界，以强化supervision during training和内部精度。此外，我们提出了一种机制，可以在硬negative samplingstrategy中选择表现出色的本地特征，并将其注入到全局表示中，以便在全球级别提高对类的分辨率。经验证实结果表明，我们的方法可以实现单stage图像检索的最佳表现，在Revisited Oxford和Revisited Paris等benchmark上达到了状态畅的单stage图像检索性能。代码可以在https://github.com/bassyess/CFCD中找到。

Few-shot medical image classification with simple shape and texture text descriptors using vision-language models

paper_url: http://arxiv.org/abs/2308.04005
repo_url: None
paper_authors: Michal Byra, Muhammad Febrian Rachmadi, Henrik Skibbe
for: 本研究探讨了视力语言模型（VLMs）和大语言模型在医学图像二进制少量分类中的有用性。
methods: 我们使用GPT-4模型生成医学图像中对象形状和xture特征的文本描述符。然后，这些GPT-4生成的描述符， alongside VLMs pré-训练natural images, 用于分类胸部X射线和乳腺ultrasound图像。
results: 我们的结果表明，使用VLMs和GPT-4生成的描述符进行医学图像二进制少量分类是一种可行的方法。然而，为了准确地分类，需要排除certain descriptor的计算分类分数。此外，我们评估了VLMs对乳腺癌ultrasound图像中形状特征的评价能力。我们进一步调查GPT-4生成的描述符集中的变化程度。我们的工作提供了关于VLMs在医学图像分析中的应用的重要发现。

Abstract
In this work, we investigate the usefulness of vision-language models (VLMs) and large language models for binary few-shot classification of medical images. We utilize the GPT-4 model to generate text descriptors that encapsulate the shape and texture characteristics of objects in medical images. Subsequently, these GPT-4 generated descriptors, alongside VLMs pre-trained on natural images, are employed to classify chest X-rays and breast ultrasound images. Our results indicate that few-shot classification of medical images using VLMs and GPT-4 generated descriptors is a viable approach. However, accurate classification requires to exclude certain descriptors from the calculations of the classification scores. Moreover, we assess the ability of VLMs to evaluate shape features in breast mass ultrasound images. We further investigate the degree of variability among the sets of text descriptors produced by GPT-4. Our work provides several important insights about the application of VLMs for medical image analysis.

摘要
在这项研究中，我们调查了视力语言模型（VLM）和大语言模型是否能够实现医学图像二进制几个shot分类。我们使用GPT-4模型生成医学图像中对象的形状和文化特征的文本描述。然后，这些GPT-4生成的描述、 alongside VLMs预训练于自然图像，用于分类胸部X射线和乳腺ultrasound图像。我们的结果表明，使用VLMs和GPT-4生成的描述进行医学图像二进制分类是一种可行的方法。然而，精确地分类需要排除某些描述器从分类得分计算中。此外，我们评估了VLMs对乳腺瘤ultrasound图像中形状特征的评价能力。我们进一步调查GPT-4生成的描述集中的变化程度。我们的研究提供了关于VLMs在医学图像分析方面的重要发现。

Real-time Strawberry Detection Based on Improved YOLOv5s Architecture for Robotic Harvesting in open-field environment

paper_url: http://arxiv.org/abs/2308.03998
repo_url: None
paper_authors: Zixuan He, Salik Ram Khana, Xin Zhang, Manoj Karkee, Qin Zhang
For: The paper proposes a custom object detection model based on YOLOv5 for strawberry detection in open-field environments.* Methods: The proposed model modifies the original YOLOv5 architecture by replacing the C3 module with C2f and combining Spatial Pyramid Pooling Fast with Cross Stage Partial Net. The model is trained on a dataset of RGB images of strawberry canopies with three maturity classes.* Results: The proposed model achieves the highest mean average precision of 80.3% among five compared models, with an inference speed of 18ms per image. The model outperforms the latest YOLOv8s in terms of average precision in the immature and mature classes, while being faster and having fewer parameters.

Abstract
This study proposed a YOLOv5-based custom object detection model to detect strawberries in an outdoor environment. The original architecture of the YOLOv5s was modified by replacing the C3 module with the C2f module in the backbone network, which provided a better feature gradient flow. Secondly, the Spatial Pyramid Pooling Fast in the final layer of the backbone network of YOLOv5s was combined with Cross Stage Partial Net to improve the generalization ability over the strawberry dataset in this study. The proposed architecture was named YOLOv5s-Straw. The RGB images dataset of the strawberry canopy with three maturity classes (immature, nearly mature, and mature) was collected in open-field environment and augmented through a series of operations including brightness reduction, brightness increase, and noise adding. To verify the superiority of the proposed method for strawberry detection in open-field environment, four competitive detection models (YOLOv3-tiny, YOLOv5s, YOLOv5s-C2f, and YOLOv8s) were trained, and tested under the same computational environment and compared with YOLOv5s-Straw. The results showed that the highest mean average precision of 80.3% was achieved using the proposed architecture whereas the same was achieved with YOLOv3-tiny, YOLOv5s, YOLOv5s-C2f, and YOLOv8s were 73.4%, 77.8%, 79.8%, 79.3%, respectively. Specifically, the average precision of YOLOv5s-Straw was 82.1% in the immature class, 73.5% in the nearly mature class, and 86.6% in the mature class, which were 2.3% and 3.7%, respectively, higher than that of the latest YOLOv8s. The model included 8.6*10^6 network parameters with an inference speed of 18ms per image while the inference speed of YOLOv8s had a slower inference speed of 21.0ms and heavy parameters of 11.1*10^6, which indicates that the proposed model is fast enough for real time strawberry detection and localization for the robotic picking.

摘要
The dataset used in this study consisted of RGB images of strawberry canopies with three maturity classes (immature, nearly mature, and mature) collected in an open-field environment. The images were augmented using brightness reduction, brightness increase, and noise adding.To evaluate the performance of the proposed model, four competitive detection models (YOLOv3-tiny, YOLOv5s, YOLOv5s-C2f, and YOLOv8s) were trained and tested under the same computational environment. The results showed that the proposed model achieved the highest mean average precision of 80.3%, outperforming the other models by 3.7% to 7.3%. Specifically, the average precision of YOLOv5s-Straw was 82.1% in the immature class, 73.5% in the nearly mature class, and 86.6% in the mature class.The proposed model includes 8.6 million network parameters and has an inference speed of 18ms per image, which is fast enough for real-time strawberry detection and localization for robotic picking. In comparison, YOLOv8s has heavier parameters (11.1 million) and a slower inference speed (21.0ms).Overall, the proposed YOLOv5s-Straw model outperformed other state-of-the-art models for strawberry detection in open-field environments, and is a promising solution for robotic strawberry picking applications.

PARTNER: Level up the Polar Representation for LiDAR 3D Object Detection

paper_url: http://arxiv.org/abs/2308.03982
repo_url: None
paper_authors: Ming Nie, Yujing Xue, Chunwei Wang, Chaoqiang Ye, Hang Xu, Xinge Zhu, Qingqiu Huang, Michael Bi Mi, Xinchao Wang, Li Zhang
for: 提高3D物体探测器的精度和稳定性，特别是在流式探测和不同分辨率下。
methods: 使用极坐标系表示，并引入实例级别的几何信息来改进检测头，以解决非uniform分辨率所导致的特征扭曲问题。
results: 与前一代极坐标系方法相比，实现了3.68%和9.15%的显著提高在 Waymo 和 ONCE 验证集上，并在流式探测和不同分辨率下达到了竞争力的 результаados。

Abstract
Recently, polar-based representation has shown promising properties in perceptual tasks. In addition to Cartesian-based approaches, which separate point clouds unevenly, representing point clouds as polar grids has been recognized as an alternative due to (1) its advantage in robust performance under different resolutions and (2) its superiority in streaming-based approaches. However, state-of-the-art polar-based detection methods inevitably suffer from the feature distortion problem because of the non-uniform division of polar representation, resulting in a non-negligible performance gap compared to Cartesian-based approaches. To tackle this issue, we present PARTNER, a novel 3D object detector in the polar coordinate. PARTNER alleviates the dilemma of feature distortion with global representation re-alignment and facilitates the regression by introducing instance-level geometric information into the detection head. Extensive experiments show overwhelming advantages in streaming-based detection and different resolutions. Furthermore, our method outperforms the previous polar-based works with remarkable margins of 3.68% and 9.15% on Waymo and ONCE validation set, thus achieving competitive results over the state-of-the-art methods.

摘要
近些时间，基于极坐标的表示方法在认知任务中展现出了有前途的性能。除了使用坐标系分解的方法，即在不同的分辨率下分别处理点云，基于极坐标网格的表示方法被认为是一个有优势的选择，因为它们在不同的分辨率下具有robust性和流式处理的优势。然而，现状的极坐标基的检测方法无法避免特征扭曲问题，这是因为极坐标网格的非均匀分配引起的。为解决这个问题，我们提出了PARTNER，一种新的3D物体检测器。PARTNER通过重新调整全局表示和添加实例级别的几何信息来缓解特征扭曲问题，并且在检测头中进行了改进，以便更好地进行准确性。我们的方法在流式检测和不同的分辨率上具有了极大的优势，并且在 Waymo和ONCE验证集上比前一代极坐标基的方法有3.68%和9.15%的remarkable margins，从而实现了与当前最佳方法的竞争性。

PAIF: Perception-Aware Infrared-Visible Image Fusion for Attack-Tolerant Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.03979
repo_url: https://github.com/liuzhu-cv/paif
paper_authors: Zhu Liu, Jinyuan Liu, Benzhuang Zhang, Long Ma, Xin Fan, Risheng Liu
for: 本研究旨在提高针对抗击的图像融合方法的分割稳定性。
methods: 本研究提出了一种感知响应的融合框架，通过系统性分析不同模式之间的相关性，并提出了一种协调结构来平衡标准准确率和鲁棒性。此外，还提出了一种适应学习策略来提高图像融合的参数鲁棒性，以便在多种抗击干扰下学习有效的特征提取。
results: 实验结果表明，我们的方案可以大幅提高分割稳定性，相比高级竞争者，增加了15.3%的mIOU分割精度。

Abstract
Infrared and visible image fusion is a powerful technique that combines complementary information from different modalities for downstream semantic perception tasks. Existing learning-based methods show remarkable performance, but are suffering from the inherent vulnerability of adversarial attacks, causing a significant decrease in accuracy. In this work, a perception-aware fusion framework is proposed to promote segmentation robustness in adversarial scenes. We first conduct systematic analyses about the components of image fusion, investigating the correlation with segmentation robustness under adversarial perturbations. Based on these analyses, we propose a harmonized architecture search with a decomposition-based structure to balance standard accuracy and robustness. We also propose an adaptive learning strategy to improve the parameter robustness of image fusion, which can learn effective feature extraction under diverse adversarial perturbations. Thus, the goals of image fusion (\textit{i.e.,} extracting complementary features from source modalities and defending attack) can be realized from the perspectives of architectural and learning strategies. Extensive experimental results demonstrate that our scheme substantially enhances the robustness, with gains of 15.3% mIOU of segmentation in the adversarial scene, compared with advanced competitors. The source codes are available at https://github.com/LiuZhu-CV/PAIF.

摘要
infrared和可见图像融合是一种强大的技术，可以将不同modalities的补充信息结合以提高下游semantic perception任务的性能。现有的学习基于方法显示出惊人的表现，但是它们受到内置的敌意攻击的隐藏危险，导致准确性减少。在这种工作中，我们提出了一种感知 aware的融合框架，以提高 segmentation 的 Robustness 在敌意场景中。我们首先进行了系统的分析，探讨了不同modalities的图像融合组件与 segmentation 的相关性。基于这些分析，我们提出了一种协调结构，以平衡标准准确性和 Robustness。我们还提出了一种适应学习策略，以提高图像融合的参数Robustness，使其在多种敌意攻击下学习有效的特征提取。因此，我们的方案可以从architecture和学习策略的角度实现图像融合的两个目标：提取源modalities中的补充特征，并防止攻击。我们的实验结果表明，我们的方案可以大幅提高Robustness，与先进竞争对手相比，增加了15.3%的mIOU segmentation准确率。源代码可以在https://github.com/LiuZhu-CV/PAIF上获取。

PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning

paper_url: http://arxiv.org/abs/2308.03977
repo_url: https://github.com/facebookresearch/pug
paper_authors: Florian Bordes, Shashank Shekhar, Mark Ibrahim, Diane Bouchacourt, Pascal Vincent, Ari S. Morcos
for: 这个论文旨在推广使用真实图像数据的问题，提出了一种使用游戏引擎生成高品质的 synthetic 图像数据，以便更好地训练和评估深度神经网络。
methods: 这篇论文使用了 Unreal Engine 游戏引擎，生成了高品质的 synthetic 图像数据，并为 representation learning 研究提供了一种可控、真实的图像环境。
results: 论文通过对多种视觉模型的评估，显示了 PUG 环境和数据集的可用性和有效性，并为研究人员提供了一种更加准确和可靠的评估方式。

Abstract
Synthetic image datasets offer unmatched advantages for designing and evaluating deep neural networks: they make it possible to (i) render as many data samples as needed, (ii) precisely control each scene and yield granular ground truth labels (and captions), (iii) precisely control distribution shifts between training and testing to isolate variables of interest for sound experimentation. Despite such promise, the use of synthetic image data is still limited -- and often played down -- mainly due to their lack of realism. Most works therefore rely on datasets of real images, which have often been scraped from public images on the internet, and may have issues with regards to privacy, bias, and copyright, while offering little control over how objects precisely appear. In this work, we present a path to democratize the use of photorealistic synthetic data: we develop a new generation of interactive environments for representation learning research, that offer both controllability and realism. We use the Unreal Engine, a powerful game engine well known in the entertainment industry, to produce PUG (Photorealistic Unreal Graphics) environments and datasets for representation learning. In this paper, we demonstrate the potential of PUG to enable more rigorous evaluations of vision models.

摘要
<> translate the following text into Simplified Chinese<>人工图像数据集具有无可比的优势，可以为深度神经网络的设计和评估带来很多便利：可以（i）生成无数量的数据样本，（ii）精确控制每个场景和获得细腻的标签和描述，（iii）在训练和测试中控制分布变化，以孤立变量。Despite such promise, the use of synthetic image data is still limited -- and often played down -- mainly due to their lack of realism. Most works therefore rely on datasets of real images, which have often been scraped from public images on the internet, and may have issues with regards to privacy, bias, and copyright, while offering little control over how objects precisely appear. In this work, we present a path to democratize the use of photorealistic synthetic data: we develop a new generation of interactive environments for representation learning research, that offer both controllability and realism. We use the Unreal Engine, a powerful game engine well known in the entertainment industry, to produce PUG (Photorealistic Unreal Graphics) environments and datasets for representation learning. In this paper, we demonstrate the potential of PUG to enable more rigorous evaluations of vision models.Translation:人工图像数据集具有无可比的优势，可以为深度神经网络的设计和评估带来很多便利。可以（i）生成无数量的数据样本，（ii）精确控制每个场景和获得细腻的标签和描述，（iii）在训练和测试中控制分布变化，以孤立变量。尽管如此，使用人工图像数据的使用仍然受到限制——主要是因为它们缺乏实际性。大多数工作因此选择使用实际图像数据，这些数据经常从互联网上抓取，可能存在隐私、偏见和版权问题，而且无法控制对象的具体外观。在这个工作中，我们提出了一种路径，以使用PUG（真实无极图形）环境和数据集来进行表示学习研究。我们使用Unreal Engine游戏引擎，这是娱乐业界非常知名的游戏引擎，生成PUG环境和数据集。在这篇论文中，我们示出了PUG的潜在力量，以允许更加严格的评估视觉模型。

Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D Action Representation Learning

paper_url: http://arxiv.org/abs/2308.03975
repo_url: None
paper_authors: Jiahang Zhang, Lilang Lin, Jiaying Liu
for: 本研究的目的是提出一种新的自动学习方法，用于解决人体动作理解中的骨架关系学习问题，这是一个重要 yet 挑战性的问题。
methods: 本研究使用了Prompted Contrast with Masked Motion Modeling（PCM$^{\rm 3}$）方法，它将对比学习和做牌预测任务相互补充，从而提高了对多个下游任务的泛化能力。
results: 实验结果表明，PCM$^{\rm 3}$ 方法在五个下游任务中的表现都superior于现有的状态之工作，特别是在三个大规模的数据集上。codes

Abstract
Self-supervised learning has proved effective for skeleton-based human action understanding, which is an important yet challenging topic. Previous works mainly rely on contrastive learning or masked motion modeling paradigm to model the skeleton relations. However, the sequence-level and joint-level representation learning cannot be effectively and simultaneously handled by these methods. As a result, the learned representations fail to generalize to different downstream tasks. Moreover, combining these two paradigms in a naive manner leaves the synergy between them untapped and can lead to interference in training. To address these problems, we propose Prompted Contrast with Masked Motion Modeling, PCM$^{\rm 3}$, for versatile 3D action representation learning. Our method integrates the contrastive learning and masked prediction tasks in a mutually beneficial manner, which substantially boosts the generalization capacity for various downstream tasks. Specifically, masked prediction provides novel training views for contrastive learning, which in turn guides the masked prediction training with high-level semantic information. Moreover, we propose a dual-prompted multi-task pretraining strategy, which further improves model representations by reducing the interference caused by learning the two different pretext tasks. Extensive experiments on five downstream tasks under three large-scale datasets are conducted, demonstrating the superior generalization capacity of PCM$^{\rm 3}$ compared to the state-of-the-art works. Our project is publicly available at: https://jhang2020.github.io/Projects/PCM3/PCM3.html .

摘要
自我指导学习已经证明对人体动作理解是有效的，这是一个重要但也是具有挑战性的领域。先前的工作主要采用了对比学习或遮盖动作模型的概念学习方法来模型人体关系。然而，序列水平和联合水平的表示学习无法同时得到有效的处理。这导致学习表示失去泛化到不同的下游任务中。此外，将这两种方法在一种简单的方式结合可能会导致在训练中的干扰。为解决这些问题，我们提出了受提醒的对比学习与遮盖动作模型（PCM$^{\rm 3}$），用于多样化的3D动作表示学习。我们的方法将对比学习和遮盖预测任务融合在一起，从而增强模型的泛化能力 для多种下游任务。具体来说，遮盖预测提供了对比学习训练中的新的训练视图，而对比学习则帮助遮盖预测训练得到高级别semantic信息。此外，我们还提出了双重受提醒多任务预训练策略，可以降低学习两个不同预tex任务时的干扰。我们在三个大规模数据集上进行了五个下游任务的广泛实验，证明PCM$^{\rm 3}$的泛化能力较为先前的工作更高。我们的项目在https://jhang2020.github.io/Projects/PCM3/PCM3.html上公开可用。

Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization

paper_url: http://arxiv.org/abs/2308.03950
repo_url: https://github.com/yujieouo/smie
paper_authors: Yujie Zhou, Wenwen Qiang, Anyi Rao, Ning Lin, Bing Su, Jiaqi Wang
for: 这个研究目的是为了在训练seen类别的数据上进行zero-shot的动作识别，即识别未见类别的动作。
methods: 我们提出了一新的zero-shot skeleton-based action recognition方法，通过估计和最大化mutual information（MI）。 Specifically, 我们在visual和semantic空间之间实现了分布对齐，并且利用时间信息来增强动作的归一化。
results: 我们在三个大规模的skeleton action dataset上进行了广泛的实验，结果显示了我们的方法的有效性。

Abstract
Zero-shot skeleton-based action recognition aims to recognize actions of unseen categories after training on data of seen categories. The key is to build the connection between visual and semantic space from seen to unseen classes. Previous studies have primarily focused on encoding sequences into a singular feature vector, with subsequent mapping the features to an identical anchor point within the embedded space. Their performance is hindered by 1) the ignorance of the global visual/semantic distribution alignment, which results in a limitation to capture the true interdependence between the two spaces. 2) the negligence of temporal information since the frame-wise features with rich action clues are directly pooled into a single feature vector. We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and maximization. Specifically, 1) we maximize the MI between visual and semantic space for distribution alignment; 2) we leverage the temporal information for estimating the MI by encouraging MI to increase as more frames are observed. Extensive experiments on three large-scale skeleton action datasets confirm the effectiveness of our method. Code: https://github.com/YujieOuO/SMIE.

摘要
zero-shot骨干基于动作识别targets recognizing unseen categories after training on seen categories. The key is to build a connection between visual and semantic space from seen to unseen classes. Previous studies have primarily focused on encoding sequences into a singular feature vector, with subsequent mapping the features to an identical anchor point within the embedded space. Their performance is hindered by 1) ignorance of the global visual/semantic distribution alignment, which results in a limitation to capture the true interdependence between the two spaces. 2) negligence of temporal information since the frame-wise features with rich action clues are directly pooled into a single feature vector. We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and maximization. Specifically, 1) we maximize the MI between visual and semantic space for distribution alignment; 2) we leverage the temporal information for estimating the MI by encouraging MI to increase as more frames are observed. Extensive experiments on three large-scale skeleton action datasets confirm the effectiveness of our method. Code: .Here's the word-for-word translation of the text into Simplified Chinese: zero-shot骨干基于动作识别targetsRecognize unseen categories after training on seen categories. The key is to build a connection between visual and semantic space from seen to unseen classes. Previous studies have primarily focused on encoding sequences into a singular feature vector, with subsequent mapping the features to an identical anchor point within the embedded space. Their performance is hindered by 1) ignorance of the global visual/semantic distribution alignment, which results in a limitation to capture the true interdependence between the two spaces. 2) negligence of temporal information since the frame-wise features with rich action clues are directly pooled into a single feature vector. We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and maximization. Specifically, 1) we maximize the MI between visual and semantic space for distribution alignment; 2) we leverage the temporal information for estimating the MI by encouraging MI to increase as more frames are observed. Extensive experiments on three large-scale skeleton action datasets confirm the effectiveness of our method. Code: .

Deterministic Neural Illumination Mapping for Efficient Auto-White Balance Correction

paper_url: http://arxiv.org/abs/2308.03939
repo_url: https://github.com/birdortyedi/denim
paper_authors: Furkan Kınlı, Doğa Yılmaz, Barış Özcan, Furkan Kıraç
for: 提供高速、高质量图像彩度 correction 解决方案
methods: 基于 deterministic color style transfer 的权重映射策略，具有 resolution-agnostic 特点，可整合任何预训练 AWB 网络
results: 实验结果表明，该方法可以实现至少 35 倍快的处理速度，并且与现有方法相当或更高的性能，在高分辨率图像上Here’s the breakdown of each point:1. 为什么：提供高速、高质量图像彩度 correction 解决方案2. 如何：基于 deterministic color style transfer 的权重映射策略，具有 resolution-agnostic 特点，可整合任何预训练 AWB 网络3. 结果：实验结果表明，该方法可以实现至少 35 倍快的处理速度，并且与现有方法相当或更高的性能，在高分辨率图像上

Abstract
Auto-white balance (AWB) correction is a critical operation in image signal processors for accurate and consistent color correction across various illumination scenarios. This paper presents a novel and efficient AWB correction method that achieves at least 35 times faster processing with equivalent or superior performance on high-resolution images for the current state-of-the-art methods. Inspired by deterministic color style transfer, our approach introduces deterministic illumination color mapping, leveraging learnable projection matrices for both canonical illumination form and AWB-corrected output. It involves feeding high-resolution images and corresponding latent representations into a mapping module to derive a canonical form, followed by another mapping module that maps the pixel values to those for the corrected version. This strategy is designed as resolution-agnostic and also enables seamless integration of any pre-trained AWB network as the backbone. Experimental results confirm the effectiveness of our approach, revealing significant performance improvements and reduced time complexity compared to state-of-the-art methods. Our method provides an efficient deep learning-based AWB correction solution, promising real-time, high-quality color correction for digital imaging applications. Source code is available at https://github.com/birdortyedi/DeNIM/

摘要
自动白平衡（AWB）修正是图像信号处理中的关键操作，以确保图像彩色 corrections 在不同照明场景下具有准确性和一致性。本文描述了一种新的和高效的 AWB 修正方法，可以在高分辨率图像上实现至少35倍的处理速度，与现有方法相当或更好的性能。我们的方法基于权值映射矩阵，通过学习映射矩阵来实现权值映射，并将其应用于AWB修正输出。我们的方法包括将高分辨率图像和相应的秘密表示 feed 到映射模块，以 derivation 一个征准形式，然后另一个映射模块将像素值映射到AWB修正后的像素值。这种策略是解决分辨率不依赖的，同时也可以轻松地将任何预训练的AWB网络作为后ION。实验结果表明我们的方法的有效性，表明与现有方法相比，具有显著的性能提升和处理时间减少。我们的方法提供了一种高效的深度学习基于AWB修正解决方案，承诺实时、高质量彩色修正 для数字摄影应用。代码可以在上获取。

TIJO: Trigger Inversion with Joint Optimization for Defending Multimodal Backdoored Models

paper_url: http://arxiv.org/abs/2308.03906
repo_url: None
paper_authors: Indranil Sur, Karan Sikka, Matthew Walmer, Kaushik Koneripalli, Anirban Roy, Xiao Lin, Ajay Divakaran, Susmit Jha
for: 防止多 modal 模型中的 dual-key 后门攻击
methods: 使用 joint optimization 技术来反向工程 trigger，并在图像和文本模式之间进行协调优化
results: 在 TrojVQA 测试集上，TIJO 方法可以减少 dual-key 后门攻击的攻击效果，并且在单模态后门攻击中也表现出良好的效果

Abstract
We present a Multimodal Backdoor Defense technique TIJO (Trigger Inversion using Joint Optimization). Recent work arXiv:2112.07668 has demonstrated successful backdoor attacks on multimodal models for the Visual Question Answering task. Their dual-key backdoor trigger is split across two modalities (image and text), such that the backdoor is activated if and only if the trigger is present in both modalities. We propose TIJO that defends against dual-key attacks through a joint optimization that reverse-engineers the trigger in both the image and text modalities. This joint optimization is challenging in multimodal models due to the disconnected nature of the visual pipeline which consists of an offline feature extractor, whose output is then fused with the text using a fusion module. The key insight enabling the joint optimization in TIJO is that the trigger inversion needs to be carried out in the object detection box feature space as opposed to the pixel space. We demonstrate the effectiveness of our method on the TrojVQA benchmark, where TIJO improves upon the state-of-the-art unimodal methods from an AUC of 0.6 to 0.92 on multimodal dual-key backdoors. Furthermore, our method also improves upon the unimodal baselines on unimodal backdoors. We present ablation studies and qualitative results to provide insights into our algorithm such as the critical importance of overlaying the inverted feature triggers on all visual features during trigger inversion. The prototype implementation of TIJO is available at https://github.com/SRI-CSL/TIJO.

摘要
我们提出了一种多模态后门防御技术TIJO（Trigger Inversion using Joint Optimization）。在最近的arXiv:2112.07668中，我们已经成功地实现了对多模态模型的后门攻击。这个后门触发器被分解成两个模式（图像和文本），只有在两个模式中都存在触发器时才会启动后门。我们的TIJO技术利用联合优化来防御双钥匙攻击，通过在图像和文本模式中对触发器进行反向工程。这个联合优化在多模态模型中是具有挑战性的，因为视觉管道中的数据都是独立的，包括一个离线特征提取器，其输出然后与文本模式进行融合。我们的关键发现是，在对触发器进行反向工程时，应该在图像特征空间进行，而不是像素空间。我们在TrojVQA benchmark上证明了TIJO的有效性，其在多模态双钥匙后门上从AUC 0.6提高到0.92，并且在单模态后门上也超过了单模态基线。我们还提供了简要的ablation study和Qualitative results，以便更好地理解我们的算法，如果在触发器反向工程中 overlaying 翻译的特征Trigger。TIJO的原型实现可以在https://github.com/SRI-CSL/TIJO中找到。

Developability Approximation for Neural Implicits through Rank Minimization

paper_url: http://arxiv.org/abs/2308.03900
repo_url: None
paper_authors: Pratheba Selvaraju
for: 该论文主要用于开发一种可以从二维面上无撕、无扭的三维表面的方法。
methods: 该方法基于神经隐函数，并通过添加一个正则化项来促进zero Gaussian curvature。
results: 实验结果表明，该方法可以准确地重建开发可能的表面，并且可以在受到噪声影响的情况下保持一定的精度。

Abstract
Developability refers to the process of creating a surface without any tearing or shearing from a two-dimensional plane. It finds practical applications in the fabrication industry. An essential characteristic of a developable 3D surface is its zero Gaussian curvature, which means that either one or both of the principal curvatures are zero. This paper introduces a method for reconstructing an approximate developable surface from a neural implicit surface. The central idea of our method involves incorporating a regularization term that operates on the second-order derivatives of the neural implicits, effectively promoting zero Gaussian curvature. Implicit surfaces offer the advantage of smoother deformation with infinite resolution, overcoming the high polygonal constraints of state-of-the-art methods using discrete representations. We draw inspiration from the properties of surface curvature and employ rank minimization techniques derived from compressed sensing. Experimental results on both developable and non-developable surfaces, including those affected by noise, validate the generalizability of our method.

摘要
<>将文本翻译为简化字符的中文。<>发展可能性指的是将二维面变换为无撕裂、无剪裂的三维表面的过程。它在制造业中有实际应用。一个必要的特征是发展可能性表面的零 Gaussian 几何，这意味着一或两个主要几何都是零。这篇论文介绍了一种使用神经隐式函数来重建精确的发展可能性表面的方法。我们的中心思想是在神经隐式函数的第二阶导数上添加一个正则化项，以实现零 Gaussian 几何。隐式表面具有较平滑的变形和无限分辨率的优势，超越了现有方法使用分割表示的高 polygon 约束。我们启发自表面几何的属性，并使用压缩感知技术来解决矩阵问题。实验结果表明，我们的方法在发展可能性表面和非发展可能性表面，包括受噪声影响的情况下，具有普适性。

From Sky to the Ground: A Large-scale Benchmark and Simple Baseline Towards Real Rain Removal

paper_url: http://arxiv.org/abs/2308.03867
repo_url: https://github.com/yunguo224/lhp-rain
paper_authors: Yun Guo, Xueyao Xiao, Yi Chang, Shumin Deng, Luxin Yan
For: 提高实际雨天图像涂抹（RID）的进步，增加大规模高质量配对训练样本。* Methods: 构建了一个大规模高质量配对雨天图像数据集（LHP-Rain），包括3000个视频序列，100万高分辨率（19201080）帧对。提出了一种新的稳定低级 tensor 恢复模型，生成更好地分离静背景和动雨。设计了一种简单的 transformer 基于单图雨涂抹基线，同时利用自身关注和跨层关注，具有捕捉特征表示。 Results: 对比 existing 方法，提出的 dataset 和 deraining 方法具有显著的优势，在雨天图像涂抹任务中具有更高的性能。

Abstract
Learning-based image deraining methods have made great progress. However, the lack of large-scale high-quality paired training samples is the main bottleneck to hamper the real image deraining (RID). To address this dilemma and advance RID, we construct a Large-scale High-quality Paired real rain benchmark (LHP-Rain), including 3000 video sequences with 1 million high-resolution (1920*1080) frame pairs. The advantages of the proposed dataset over the existing ones are three-fold: rain with higher-diversity and larger-scale, image with higher-resolution and higher-quality ground-truth. Specifically, the real rains in LHP-Rain not only contain the classical rain streak/veiling/occlusion in the sky, but also the \textbf{splashing on the ground} overlooked by deraining community. Moreover, we propose a novel robust low-rank tensor recovery model to generate the GT with better separating the static background from the dynamic rain. In addition, we design a simple transformer-based single image deraining baseline, which simultaneously utilize the self-attention and cross-layer attention within the image and rain layer with discriminative feature representation. Extensive experiments verify the superiority of the proposed dataset and deraining method over state-of-the-art.

摘要
学习基于的图像雨排除方法已经做出了大量的进步。然而，缺乏大规模高质量对应训练样本是阻碍真实图像雨排除（RID）的主要瓶颈。为解决这个困难和提高RID，我们构建了大规模高质量对应雨天 benchmark（LHP-Rain），包括3000个视频序列和100万高分辨率（1920*1080）帧对。LHP-Rain中的雨水比现有的 dataset 更多样化和大规模，图像质量更高，附加的雨水ground truth 更加准确。具体来说，LHP-Rain 中的雨水不仅包括天空中的класси型雨条/遮盲/占据，还包括在地面上的溅射，这一点在雨排除社区中很少被考虑。此外，我们提出了一种新的robust低级张量回归模型，用于生成更加分离静态背景和动态雨水的GT。此外，我们设计了一种简单的 transformer 基于的单图像雨排除基线，同时利用自身关注和跨层关注，在图像和雨层中同时使用特征表示。广泛的实验证明了我们提出的数据集和雨排除方法的优越性。

DefCor-Net: Physics-Aware Ultrasound Deformation Correction

paper_url: http://arxiv.org/abs/2308.03865
repo_url: https://github.com/karolinezhy/defcornet
paper_authors: Zhongliang Jiang, Yue Zhou, Dongliang Cao, Nassir Navab
for: 这篇论文的目的是对于ultrasound（US）图像取得中的形状扭曲进行修正，以提高诊断的精度和一致性。
methods: 这篇论文提出了一个基于多对多深度学习网络（DefCor-Net）的新型生物学知识感知扭曲修正方法。这个方法通过在粗细对称的特征提取器中进行精确的缩寸推导，以便在当前测量力的基础上线性回归扭曲场。
results: 根据实验结果显示，DefCor-Net可以对于US图像进行高精度的形状修正，从而回复原始的几何结构（Dice Coefficient：从 $14.3\pm20.9$ 提高至 $82.6\pm12.1$，当力量为 $6N$）。

Abstract
The recovery of morphologically accurate anatomical images from deformed ones is challenging in ultrasound (US) image acquisition, but crucial to accurate and consistent diagnosis, particularly in the emerging field of computer-assisted diagnosis. This article presents a novel anatomy-aware deformation correction approach based on a coarse-to-fine, multi-scale deep neural network (DefCor-Net). To achieve pixel-wise performance, DefCor-Net incorporates biomedical knowledge by estimating pixel-wise stiffness online using a U-shaped feature extractor. The deformation field is then computed using polynomial regression by integrating the measured force applied by the US probe. Based on real-time estimation of pixel-by-pixel tissue properties, the learning-based approach enables the potential for anatomy-aware deformation correction. To demonstrate the effectiveness of the proposed DefCor-Net, images recorded at multiple locations on forearms and upper arms of six volunteers are used to train and validate DefCor-Net. The results demonstrate that DefCor-Net can significantly improve the accuracy of deformation correction to recover the original geometry (Dice Coefficient: from $14.3\pm20.9$ to $82.6\pm12.1$ when the force is $6N$).

摘要
“ ultrasound（US）图像获取中，形态准确性的图像恢复是一项挑战，但是对医学诊断的准确性和一致性具有极高的重要性，特别是在计算机助动诊断领域。本文提出了一种基于多尺度深度神经网络（DefCor-Net）的新型形态意识恢复方法。为了实现像素级的表现，DefCor-Net在核心网络中包含生物医学知识，并且在线计算每个像素的刚性。通过把测量US探针所应用的力场 интеグрирова到多元函数回归，DefCor-Net计算出了形态场。基于实时测量每个像素的组织特性，这种学习基于的方法具有潜在的形态意识恢复能力。为证明DefCor-Net的有效性，使用了多个臂和肘的六名志愿者所记录的图像进行训练和验证。结果显示，DefCor-Net可以显著改善对形态恢复的准确性（Dice Coefficient：从14.3±20.9到82.6±12.1，当力场为6N）。”Note: The translation is in Simplified Chinese, which is the standardized form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

High-Throughput and Accurate 3D Scanning of Cattle Using Time-of-Flight Sensors and Deep Learning

paper_url: http://arxiv.org/abs/2308.03861
repo_url: None
paper_authors: Gbenga Omotara, Seyed Mohamad Ali Tousi, Jared Decker, Derek Brake, Guilherme N. DeSouza
for: 这个论文是为了开发一种高速三维扫描解决方案，用于准确测量牛的形态特征。
methods: 这个系统使用了一个数组深度感知器，包括时间探测（Tof）感知器，每个感知器都由专门的嵌入式设备控制。系统能够生成高质量的3D点云，从而生成高精度的牛形态模型。
results: 根据实验结果，提出的系统能够生成高质量的牛形态模型，并且可以准确测量牛的体积和表面积。

Abstract
We introduce a high throughput 3D scanning solution specifically designed to precisely measure cattle phenotypes. This scanner leverages an array of depth sensors, i.e. time-of-flight (Tof) sensors, each governed by dedicated embedded devices. The system excels at generating high-fidelity 3D point clouds, thus facilitating an accurate mesh that faithfully reconstructs the cattle geometry on the fly. In order to evaluate the performance of our system, we have implemented a two-fold validation process. Initially, we test the scanner's competency in determining volume and surface area measurements within a controlled environment featuring known objects. Secondly, we explore the impact and necessity of multi-device synchronization when operating a series of time-of-flight sensors. Based on the experimental results, the proposed system is capable of producing high-quality meshes of untamed cattle for livestock studies.

摘要
我们介绍了一种高通量3D扫描解决方案，专门为精确测量牛phenotype提供。这个扫描仪使用了一组深度感知器，即时光探测（ToF）感知器，每个感知器由专门的嵌入式设备控制。系统能够生成高品质3D点云，从而实现精确重建牛体均匀的三维模型。为评估我们的系统性能，我们实施了两重验证过程。首先，我们测试了扫描仪在控制台上测量物体体积和表面积的能力。其次，我们探索了在多个时光探测感知器同时运行时的多设备同步的影响和必要性。根据实验结果，我们的系统能够生成高质量牛体三维模型，为畜牧学研究提供有价值的数据。

3D Motion Magnification: Visualizing Subtle Motions with Time Varying Radiance Fields

paper_url: http://arxiv.org/abs/2308.03757
repo_url: None
paper_authors: Brandon Y. Feng, Hadi Alzayer, Michael Rubinstein, William T. Freeman, Jia-Bin Huang
for: 这个论文旨在帮助我们更好地视觉化某些不可见的运动，尤其是在运动camera中捕捉的场景中。
methods: 该论文提出了一种基于时变辐射场的3D动态扩大方法，可以在运动camera中捕捉的场景中增强微不可见的运动。该方法基于律动原理，通过EXTRACT和增强 embedding点的变化来实现动态扩大。
results: 该论文通过使用不同的场景和摄像头设置进行了研究和验证，并证明了其效果。

Abstract
Motion magnification helps us visualize subtle, imperceptible motion. However, prior methods only work for 2D videos captured with a fixed camera. We present a 3D motion magnification method that can magnify subtle motions from scenes captured by a moving camera, while supporting novel view rendering. We represent the scene with time-varying radiance fields and leverage the Eulerian principle for motion magnification to extract and amplify the variation of the embedding of a fixed point over time. We study and validate our proposed principle for 3D motion magnification using both implicit and tri-plane-based radiance fields as our underlying 3D scene representation. We evaluate the effectiveness of our method on both synthetic and real-world scenes captured under various camera setups.

摘要
运动增大帮助我们可见到微不足的运动。然而，先前的方法只适用于 fix 摄像机拍摄的 2D 视频。我们提出了一种支持新视图渲染的3D 运动增大方法，可以增大 captured by a moving camera 中的微不足运动。我们使用时间变化的辐射场来表示场景，并利用儒利安理则来提取和增强时间上点的变化。我们对使用 implicit 和 tri-plane-based 辐射场作为场景表示方法进行了研究和验证。我们对具有不同摄像机设置的 both synthetic 和实际场景进行了评估。

FSD V2: Improving Fully Sparse 3D Object Detection with Virtual Voxels

paper_url: http://arxiv.org/abs/2308.03755
repo_url: https://github.com/tusen-ai/sst
paper_authors: Lue Fan, Feng Wang, Naiyan Wang, Zhaoxiang Zhang
for: 这篇论文目的是提出一种简化FSDv1的方法，以提高其普适性和性能。
methods: 这篇论文使用了虚拟voxel的概念，取代了FSDv1中的归一化实例分割。虚拟voxel不仅能解决完全稀缺探测器中的中心特征缺失问题，还使得框架更加简洁和流畅。
results: 这篇论文在三个大规模数据集上进行了实验，包括Waymo开放数据集、Argoverse 2数据集和nuScenes数据集。结果显示FSDv2在长距离场景中表现出色，并在多种场景中具有竞争性的性能。此外，论文还提供了详细的实验分析，以便促进可重复性和进一步研究。

Abstract
LiDAR-based fully sparse architecture has garnered increasing attention. FSDv1 stands out as a representative work, achieving impressive efficacy and efficiency, albeit with intricate structures and handcrafted designs. In this paper, we present FSDv2, an evolution that aims to simplify the previous FSDv1 while eliminating the inductive bias introduced by its handcrafted instance-level representation, thus promoting better general applicability. To this end, we introduce the concept of \textbf{virtual voxels}, which takes over the clustering-based instance segmentation in FSDv1. Virtual voxels not only address the notorious issue of the Center Feature Missing problem in fully sparse detectors but also endow the framework with a more elegant and streamlined approach. Consequently, we develop a suite of components to complement the virtual voxel concept, including a virtual voxel encoder, a virtual voxel mixer, and a virtual voxel assignment strategy. Through empirical validation, we demonstrate that the virtual voxel mechanism is functionally similar to the handcrafted clustering in FSDv1 while being more general. We conduct experiments on three large-scale datasets: Waymo Open Dataset, Argoverse 2 dataset, and nuScenes dataset. Our results showcase state-of-the-art performance on all three datasets, highlighting the superiority of FSDv2 in long-range scenarios and its general applicability to achieve competitive performance across diverse scenarios. Moreover, we provide comprehensive experimental analysis to elucidate the workings of FSDv2. To foster reproducibility and further research, we have open-sourced FSDv2 at https://github.com/tusen-ai/SST.

摘要
“LiDAR-based弹性探测 Architecture 在最近得到了增加的注意。FSDv1 作为代表性的工作，成功地实现了出色的效率和可靠性，但具有复杂的结构和手工设计。在这篇论文中，我们提出 FSDv2，它是 FSDv1 的进化，旨在简化前一代的结构，消除实例级别表示所引入的预设偏见，以提高更好的通用性。为此，我们引入了“虚拟小体”概念，取代 FSDv1 中的弹性分割。虚拟小体不仅解决了完全缺失中心特征问题，还赋予框架更加简洁和流畅的方式。为此，我们开发了一套辅助虚拟小体的组件，包括虚拟小体编码器、虚拟小体混合器和虚拟小体分配策略。通过实验验证，我们证明虚拟小体机制与 FSDv1 中手工 clustering 功能相似，但更加通用。我们在 Waymo Open Dataset、Argoverse 2 dataset 和 nuScenes dataset 上进行了实验，我们的结果显示 FSDv2 在长距离场景中具有状态机器人的性能，并在多种场景中实现了竞争性的表现。此外，我们进行了全面的实验分析，以便更好地解释 FSDv2 的工作原理。为了促进可重复性和进一步研究，我们将 FSDv2 开源在 GitHub 上，请参考。”

Mask Frozen-DETR: High Quality Instance Segmentation with One GPU

paper_url: http://arxiv.org/abs/2308.03747
repo_url: None
paper_authors: Zhanhao Liang, Yuhui Yuan
for: 研究如何建立具有最小训练时间和GPU资源的强大实例分割模型，而不是现有的大多数方法尝试通过建立更复杂的框架来提高实例分割模型的准确率，以及这种方法的简单性和通用性。
methods: 我们提出了一种简单的普适框架，称为Mask Frozen-DETR，可以将任何现有的DETR基于对象检测模型转换成强大的实例分割模型。我们的方法仅需训练一个轻量级的mask网络，该网络在固定DETR基于对象检测器的 bounding box 中预测实例mask。
results: 我们的方法在COCO测试数据集上的测试预测中，与状态当前的实例分割方法Mask DINO相比，提高了性能（55.3% vs. 54.7%），并且在训练时间和GPU资源上减少了训练时间的多少（10X）。此外，我们的所有实验都可以使用一个Tesla V100 GPU With 16 GB的内存进行训练，表明了我们提出的框架的显著高效性。

Abstract
In this paper, we aim to study how to build a strong instance segmenter with minimal training time and GPUs, as opposed to the majority of current approaches that pursue more accurate instance segmenter by building more advanced frameworks at the cost of longer training time and higher GPU requirements. To achieve this, we introduce a simple and general framework, termed Mask Frozen-DETR, which can convert any existing DETR-based object detection model into a powerful instance segmentation model. Our method only requires training an additional lightweight mask network that predicts instance masks within the bounding boxes given by a frozen DETR-based object detector. Remarkably, our method outperforms the state-of-the-art instance segmentation method Mask DINO in terms of performance on the COCO test-dev split (55.3% vs. 54.7%) while being over 10X times faster to train. Furthermore, all of our experiments can be trained using only one Tesla V100 GPU with 16 GB of memory, demonstrating the significant efficiency of our proposed framework.

摘要
在这篇论文中，我们目的是研究如何使用最少的训练时间和GPU来构建一个强大的实例分割器，而不是现有的大多数方法，它们通过建立更高级的框架来提高实例分割器的准确率，但是这会导致训练时间更长和GPU需求更高。为此，我们提出了一个简单和通用的框架，称为Mask Frozen-DETR，它可以将任何现有的DETR基于对象检测模型转化成一个强大的实例分割模型。我们的方法只需训练一个轻量级的面网络，该网络可以在冻结的DETR基于对象检测模型提供的 bounding box 内预测实例面。值得注意的是，我们的方法在 COCO 测试发展集上比 state-of-the-art 实例分割方法 Mask DINO 高出0.6%的性能（55.3% vs. 54.7%），而且训练时间比 Mask DINO 快上10倍。此外，我们所有的实验都可以使用单个 Tesla V100 GPU WITH 16 GB 内存进行训练，这表明我们提出的方法具有显著的效率。

AdaptiveSAM: Towards Efficient Tuning of SAM for Surgical Scene Segmentation

paper_url: http://arxiv.org/abs/2308.03726
repo_url: https://github.com/jayparanjape/biastuning
paper_authors: Jay N. Paranjape, Nithin Gopalakrishnan Nair, Shameema Sikder, S. Swaroop Vedula, Vishal M. Patel
for:这篇论文是为了解决人工智能在外科Scene分析中的基本问题，即数据稀缺性问题。methods:这篇论文提出了一种基于Segment-Anything（SAM）模型的适应方法，即AdaptiveSAM，可以快速地适应新的数据集，同时允许文本提示分割。results:实验表明，AdaptiveSAM可以在各种医学影像数据集上出perform better than当前状态的方法，包括手术、超声和X射线等。

Abstract
Segmentation is a fundamental problem in surgical scene analysis using artificial intelligence. However, the inherent data scarcity in this domain makes it challenging to adapt traditional segmentation techniques for this task. To tackle this issue, current research employs pretrained models and finetunes them on the given data. Even so, these require training deep networks with millions of parameters every time new data becomes available. A recently published foundation model, Segment-Anything (SAM), generalizes well to a large variety of natural images, hence tackling this challenge to a reasonable extent. However, SAM does not generalize well to the medical domain as is without utilizing a large amount of compute resources for fine-tuning and using task-specific prompts. Moreover, these prompts are in the form of bounding-boxes or foreground/background points that need to be annotated explicitly for every image, making this solution increasingly tedious with higher data size. In this work, we propose AdaptiveSAM - an adaptive modification of SAM that can adjust to new datasets quickly and efficiently, while enabling text-prompted segmentation. For finetuning AdaptiveSAM, we propose an approach called bias-tuning that requires a significantly smaller number of trainable parameters than SAM (less than 2\%). At the same time, AdaptiveSAM requires negligible expert intervention since it uses free-form text as prompt and can segment the object of interest with just the label name as prompt. Our experiments show that AdaptiveSAM outperforms current state-of-the-art methods on various medical imaging datasets including surgery, ultrasound and X-ray. Code is available at https://github.com/JayParanjape/biastuning

摘要
划分是跨域诊断中的基本问题，但由于医学领域数据的稀缺性，使得传统划分技术难以适应这个任务。为解决这个问题，当前的研究通常使用预训练模型，并对其进行微调。然而，这需要训练深度网络数百万个参数，每次新数据available时需要重新训练。一个最近发表的基础模型Segment-Anything（SAM）能够通用于各种自然图像，因此有所减轻这个问题。然而，SAM在医学领域中不具备泛化能力，需要大量计算资源进行微调，并使用任务特有的提示。这些提示通常是 bounding-boxes 或 foreground/background 点，需要明确标注每个图像，这使得该解决方案难以扩展。在这项工作中，我们提出了 AdaptiveSAM，一种适应型的 SAM 修改。AdaptiveSAM 可以快速地适应新的数据集，而且可以通过自由文本提示进行文本识别。我们还提出了一种偏好调整方法，可以在微调 AdaptiveSAM 时减少参数的数量，至少比 SAM 少于 2%。同时，AdaptiveSAM 需要非常少的专家干预，因为它使用自由文本提示，并且可以通过对象关键词来 segment 目标对象。我们的实验表明，AdaptiveSAM 在各种医学成像数据集上表现出色，包括手术、ultrasound 和 X-ray。代码可以在 https://github.com/JayParanjape/biastuning 上获取。

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation

paper_url: http://arxiv.org/abs/2308.03725
repo_url: None
paper_authors: Renjie Liang, Yiming Yang, Hui Lu, Li Li
for: 本研究旨在探讨如何从未处理过的视频中检测自然语言查询中描述的事件时间戳，并提出一种高效的方法。
methods: 该研究提出了一种基于知识储存的高效多教师模型（EMTM），通过将多种不同的网络结构融合到一起，以提高计算效率而不失效果。
results: 实验结果表明，该方法可以在三个常用的TSGV测试集上达到高效性和精度的平衡，而无需使用复杂的架构和损失函数。

Abstract
Temporal Sentence Grounding in Videos (TSGV) aims to detect the event timestamps described by the natural language query from untrimmed videos. This paper discusses the challenge of achieving efficient computation in TSGV models while maintaining high performance. Most existing approaches exquisitely design complex architectures to improve accuracy with extra layers and loss, suffering from inefficiency and heaviness. Although some works have noticed that, they only make an issue of feature fusion layers, which can hardly enjoy the highspeed merit in the whole clunky network. To tackle this problem, we propose a novel efficient multi-teacher model (EMTM) based on knowledge distillation to transfer diverse knowledge from both heterogeneous and isomorphic networks. Specifically, We first unify different outputs of the heterogeneous models into one single form. Next, a Knowledge Aggregation Unit (KAU) is built to acquire high-quality integrated soft labels from multiple teachers. After that, the KAU module leverages the multi-scale video and global query information to adaptively determine the weights of different teachers. A Shared Encoder strategy is then proposed to solve the problem that the student shallow layers hardly benefit from teachers, in which an isomorphic teacher is collaboratively trained with the student to align their hidden states. Extensive experimental results on three popular TSGV benchmarks demonstrate that our method is both effective and efficient without bells and whistles.

摘要

Automated Real Time Delineation of Supraclavicular Brachial Plexus in Neck Ultrasonography Videos: A Deep Learning Approach

paper_url: http://arxiv.org/abs/2308.03717
repo_url: None
paper_authors: Abhay Tyagi, Abhishek Tyagi, Manpreet Kaur, Jayanthi Sivaswami, Richa Aggarwal, Kapil Dev Soni, Anjan Trikha
for: 这个研究是为了探索使用深度学习模型来进行 neck ultrasound 影像中的副claviicular brachial plexus 的现场分类，以提高医疗执行者对这些影像的识别和分类能力。
methods: 这个研究使用了三种不同的 ultrasound 机器，将227个系统扫描到supraclavicular和interscalene brachial plexus 的不同设定中，共产生了227个唯一的影像视频。这些影像被227名经验丰富的医生评估和标注，并使用了部分自动化的物件追踪和活动曲线算法来标注影像。
results: 研究结果显示，使用深度学习模型可以实现高准确性和可靠性的现场分类，并且可以区别supraclavicular和邻近的interscalene brachial plexus。此外，研究也显示了不同的 ultrasound 机器的影像数据集可以通过精致化和无需精致化的方法来进行数据集的整合和标注。

Abstract
Peripheral nerve blocks are crucial to treatment of post-surgical pain and are associated with reduction in perioperative opioid use and hospital stay. Accurate interpretation of sono-anatomy is critical for the success of ultrasound (US) guided peripheral nerve blocks and can be challenging to the new operators. This prospective study enrolled 227 subjects who were systematically scanned for supraclavicular and interscalene brachial plexus in various settings using three different US machines to create a dataset of 227 unique videos. In total, 41,000 video frames were annotated by experienced anaesthesiologists using partial automation with object tracking and active contour algorithms. Four baseline neural network models were trained on the dataset and their performance was evaluated for object detection and segmentation tasks. Generalizability of the best suited model was then tested on the datasets constructed from separate US scanners with and without fine-tuning. The results demonstrate that deep learning models can be leveraged for real time segmentation of supraclavicular brachial plexus in neck ultrasonography videos with high accuracy and reliability. Model was also tested for its ability to differentiate between supraclavicular and adjoining interscalene brachial plexus. The entire dataset has been released publicly for further study by the research community.

摘要
périphériques nerve blocks sont essentielles pour le traitement de la douleur postopératoire et sont associées à une réduction de l'utilisation de morphiniques periopératoires et de la durée de hospitalisation. L'interprétation accurate de la sono-anatomie est critique pour le succès des blocks nerveuses guidées par ultrason (US) et peut être challengeante pour les nouveaux opérateurs. Cette étude prospective a enrôlé 227 sujets qui ont été systématiquement scannés pour le plexus brachial supraclaviculaire et interscapulin au moyen de trois machines US différentes pour créer un ensemble de 227 vidéos uniques. Au total, 41 000 cadres de vidéo ont été annotés par des anesthésiologistes expérimentés utilisant une partial automation avec des algorithmes de suivi d'objets et de contours actifs. Quatre modèles de réseaux de neurones basiques ont été entraînés sur le dataset et leur performance a été évaluée pour les tâches de détection et de segmentation d'objets. La généralisation du modèle le plus adapté a été testée sur les données constructives de scanners US différents, avec et sans fine-tuning. Les résultats montrent que les modèles d'apprentissage profond peuvent être utilisés pour la segmentation en temps réel du plexus brachial supraclaviculaire dans les vidéos d'ultrasonographie du cou avec une précision et une fiabilité élevées. Le modèle a également été testé pour sa capacité à distinguer entre le plexus brachial supraclaviculaire et l'adjoignant plexus interscapulin. Le tout dataset a été libéré au public pour une étude supplémentaire par la communauté de la recherche.

Scaling may be all you need for achieving human-level object recognition capacity with human-like visual experience

paper_url: http://arxiv.org/abs/2308.03712
repo_url: https://github.com/eminorhan/humanlike-vits
paper_authors: A. Emin Orhan
for: 这种研究是为了检验当今自动学习方法是否可以通过增加数据量和模型大小来达到人类级视觉对象识别能力。
methods: 这种研究使用了掩Masked autoencoders (MAEs)作为自动学习算法，并在增加数据量、模型大小和图像分辨率的情况下进行了涨scale experiment。
results: 研究发现，通过同时增加数据量、模型大小和图像分辨率，可以达到人类级视觉对象识别能力，但需要在模型大小、数据量和图像分辨率的增加中同步进行调整。例如，一个2.5B参数的ViT模型，通过20K小时（2.3年）的人类类视频数据和952x952像素的空间分辨率进行训练，应该可以达到人类级准确率在ImageNet。

Abstract
This paper asks whether current self-supervised learning methods, if sufficiently scaled up, would be able to reach human-level visual object recognition capabilities with the same type and amount of visual experience humans learn from. Previous work on this question only considered the scaling of data size. Here, we consider the simultaneous scaling of data size, model size, and image resolution. We perform a scaling experiment with vision transformers up to 633M parameters in size (ViT-H/14) trained with up to 5K hours of human-like video data (long, continuous, mostly egocentric videos) with image resolutions of up to 476x476 pixels. The efficiency of masked autoencoders (MAEs) as a self-supervised learning algorithm makes it possible to run this scaling experiment on an unassuming academic budget. We find that it is feasible to reach human-level object recognition capacity at sub-human scales of model size, data size, and image size, if these factors are scaled up simultaneously. To give a concrete example, we estimate that a 2.5B parameter ViT model trained with 20K hours (2.3 years) of human-like video data with a spatial resolution of 952x952 pixels should be able to reach roughly human-level accuracy on ImageNet. Human-level competence is thus achievable for a fundamental perceptual capability from human-like perceptual experience (human-like in both amount and type) with extremely generic learning algorithms and architectures and without any substantive inductive biases.

摘要

Prototype Learning for Out-of-Distribution Polyp Segmentation

paper_url: http://arxiv.org/abs/2308.03709
repo_url: None
paper_authors: Nikhil Kumar Tomar, Debesh Jha, Ulas Bagci
for: 本研究的目的是创建一个可靠和通用的肿瘤 segmentation 模型，以便在不同中心的数据集上提供可靠的 segmentation 结果。
methods: 我们的模型使用了不同的照明模式，如白光 imaging (WLI)、蓝光 imaging (BLI)、 Linked color imaging (LCI) 和 flexible spectral imaging color enhancement (FICE)，并使用 prototype 来表示每种对象类的特征特征，例如形状、Texture 和颜色。
results: 我们的模型可以在不同中心的数据集上提供高达 $\geq$ 90%的 dice 系数和 $\geq$ 85%的 mIoU 分割精度，并且具有实时处理速度。在对 16 种现状顶尖图像分割架构进行比较时，我们的方法表现出了超越性，这可能将改善临床结果。

Abstract
Existing polyp segmentation models from colonoscopy images often fail to provide reliable segmentation results on datasets from different centers, limiting their applicability. Our objective in this study is to create a robust and well-generalized segmentation model named PrototypeLab that can assist in polyp segmentation. To achieve this, we incorporate various lighting modes such as White light imaging (WLI), Blue light imaging (BLI), Linked color imaging (LCI), and Flexible spectral imaging color enhancement (FICE) into our new segmentation model, that learns to create prototypes for each class of object present in the images. These prototypes represent the characteristic features of the objects, such as their shape, texture, color. Our model is designed to perform effectively on out-of-distribution (OOD) datasets from multiple centers. We first generate a coarse mask that is used to learn prototypes for the main object class, which are then employed to generate the final segmentation mask. By using prototypes to represent the main class, our approach handles the variability present in the medical images and generalize well to new data since prototype capture the underlying distribution of the data. PrototypeLab offers a promising solution with a dice coefficient of $\geq$ 90\% and mIoU $\geq$ 85\% with a near real-time processing speed for polyp segmentation. It achieved superior performance on OOD datasets compared to 16 state-of-the-art image segmentation architectures, potentially improving clinical outcomes. Codes are available at https://github.com/xxxxx/PrototypeLab.

摘要
traditional Chinese version:现有的肿体段化模型从医学护理影像中的分段结果不可靠，限制了它们的实用性。我们的目标是创建一个可靠和普遍适用的分段模型，名为PrototypeLab，可以帮助进行肿体段化。为了实现这一目标，我们在新的分段模型中 integrate了不同的照明方式，如白光成像（WLI）、蓝光成像（BLI）、相关颜色成像（LCI）和可变色spectral成像（FICE）。这些照明方式的整合使我们的新分段模型学习出每个类别对应的原型，这些原型表示对象的形状、 текстура和颜色的特征特征。我们的模型设计能够在多个中心的数据集上表现出色，并且可以快速处理数据。我们首先生成一个粗略的mask，并使用这个mask来学习每个主要类别的原型，然后使用这些原型生成最终的分段mask。通过使用原型来表示主要类别，我们的方法可以处理医学影像中的变化，并且可以很好地适应新数据，因为原型捕捉了数据的下面分布。PrototypeLab提供了一个有 promise的解决方案，其中 dice coefficient ≥ 90%和mIoU ≥ 85%，并且具有近实时处理速度。它在多个中心的数据集上表现出色，并且超过了16种state-of-the-art图像分 segmentation模型，可能改善临床结果。代码可以在https://github.com/xxxxx/PrototypeLab 获取。Here's the translation in Simplified Chinese:现有的肿体段化模型经常无法在不同中心的数据集上提供可靠的分段结果，这限制了它们的实用性。我们的目标是创建一个可靠和普遍适用的分段模型，名为PrototypeLab，可以帮助进行肿体段化。为了实现这一目标，我们在新的分段模型中 integrate了不同的照明方式，如白光成像（WLI）、蓝光成像（BLI）、相关颜色成像（LCI）和可变色spectral成像（FICE）。这些照明方式的整合使我们的新分段模型学习出每个类别对应的原型，这些原型表示对象的形状、 текстуra和颜色的特征特征。我们的模型设计能够在多个中心的数据集上表现出色，并且可以快速处理数据。我们首先生成一个粗略的mask，并使用这个mask来学习每个主要类别的原型，然后使用这些原型生成最终的分段mask。通过使用原型来表示主要类别，我们的方法可以处理医学影像中的变化，并且可以很好地适应新数据，因为原型捕捉了数据的下面分布。PrototypeLab提供了一个有 promise的解决方案，其中 dice coefficient ≥ 90%和mIoU ≥ 85%，并且具有近实时处理速度。它在多个中心的数据集上表现出色，并且超过了16种state-of-the-art图像分 segmentation模型，可能改善临床结果。代码可以在https://github.com/xxxxx/PrototypeLab 获取。

Video-based Person Re-identification with Long Short-Term Representation Learning

paper_url: http://arxiv.org/abs/2308.03703
repo_url: None
paper_authors: Xuehu Liu, Pingping Zhang, Huchuan Lu
for: 视频基于人识别（V-ReID）任务是从非重叠摄像机捕捉的 raw 视频中 Retrieval 特定人员，是多媒体和计算机视ión应用的基本任务。然而，由于人员和场景的变化，高性能的实现仍然面临着许多挑战。
methods: 我们注意到人员的长期和短期信息都是重要的robust视频表示。因此，我们提出了一种新的深度学习框架，即 Long Short-Term Representation Learning（LSTRL），以提高 V-ReID 的效果。更具体来说，我们提出了一种 Multi-granularity Appearance Extractor（MAE），可以有效地在多帧中捕捉四种粒度的外观表示。同时，我们提出了一种 Bi-direction Motion Estimator（BME），可以高效地从邻帧中提取回传信息。MAE 和 BME 都可以与现有网络结合使用，以提高特征学习能力。
results: 我们进行了广泛的实验，测试我们的提议在三个常用的标准 benchmar 上。结果显示，我们的方法可以在 V-ReID 中提供更高的性能，超过大多数当前状态的最佳方法。

Abstract
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras. As a fundamental task, it spreads many multimedia and computer vision applications. However, due to the variations of persons and scenes, there are still many obstacles that must be overcome for high performance. In this work, we notice that both the long-term and short-term information of persons are important for robust video representations. Thus, we propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID. More specifically, to extract long-term representations, we propose a Multi-granularity Appearance Extractor (MAE), in which four granularity appearances are effectively captured across multiple frames. Meanwhile, to extract short-term representations, we propose a Bi-direction Motion Estimator (BME), in which reciprocal motion information is efficiently extracted from consecutive frames. The MAE and BME are plug-and-play and can be easily inserted into existing networks for efficient feature learning. As a result, they significantly improve the feature representation ability for V-ReID. Extensive experiments on three widely used benchmarks show that our proposed approach can deliver better performances than most state-of-the-arts.

摘要
视频基于人体重新识别（V-ReID）目标是从非重叠的视频中提取特定人脸。作为基础任务，它广泛应用于多媒体和计算机视觉领域。然而，由于人脸和场景的变化，V-ReID仍然存在许多障碍。在这项工作中，我们注意到人脸的长期和短期信息都是重要的robust视频表示。因此，我们提出了一种新的深度学习框架，即长期短期表示学习（LSTRL），以提高V-ReID的性能。更进一步，我们提出了一种多粒度外观捕获器（MAE），可以有效地在多帧中捕获四个粒度的人脸表达。同时，我们提出了一种双向运动估计器（BME），可以快速提取从一帧到下一帧的对称运动信息。MAE和BME都可以与现有网络结合使用，以提高特征学习的能力。经验表明，我们的提出的方法可以在三个广泛使用的标准测试集上达到比较高的性能。

Screen-based 3D Subjective Experiment Software

paper_url: http://arxiv.org/abs/2308.03698
repo_url: None
paper_authors: Songlin Fan, Wei Gao
for: 本研究旨在开发一种可靠的3D主观评价平台，以便用户可以自由设计3D主观方法和建立高质量的主观评价数据集，推动3D图形主观评价领域的发展。
methods: 本研究使用了一种能够同时渲染源刺激和受损刺激，并允许刺激响应参与者交互的软件平台，以便准确地描述3D刺激的主观质量差异。
results: 经验分析表明，使用本研究提出的软件平台进行主观测试可以生成合理的3D模型主观质量分数。

Abstract
Recently, widespread 3D graphics (e.g., point clouds and meshes) have drawn considerable efforts from academia and industry to assess their perceptual quality by conducting subjective experiments. However, lacking a handy software for 3D subjective experiments complicates the construction of 3D graphics quality assessment datasets, thus hindering the prosperity of relevant fields. In this paper, we develop a powerful platform with which users can flexibly design their 3D subjective methodologies and build high-quality datasets, easing a broad spectrum of 3D graphics subjective quality study. To accurately illustrate the perceptual quality differences of 3D stimuli, our software can simultaneously render the source stimulus and impaired stimulus and allows both stimuli to respond synchronously to viewer interactions. Compared with amateur 3D visualization tool-based or image/video rendering-based schemes, our approach embodies typical 3D applications while minimizing cognitive overload during subjective experiments. We organized a subjective experiment involving 40 participants to verify the validity of the proposed software. Experimental analyses demonstrate that subjective tests on our software can produce reasonable subjective quality scores of 3D models. All resources in this paper can be found at https://openi.pcl.ac.cn/OpenDatasets/3DQA.

摘要
近些年来，广泛的3D图形（如点云和网格）在学术和industry中吸引了广泛的努力，以评估它们的主观质量通过主观实验。然而，由于缺乏一个方便的3D主观实验软件，建构3D图形质量评估数据集的建构变得更加困难，从而阻碍相关领域的发展。在这篇论文中，我们开发了一个强大的平台，允许用户自由地设计他们的3D主观方法ологи和建立高质量数据集，从而促进3D图形主观质量研究的广泛发展。为准确地 Illustrate3D刺激物的主观质量差异，我们的软件可以同时渲染源刺激和受损刺激，并且允许两个刺激响应同步到观众的交互。与 amateur 3D视觉工具基于的方案或基于图像/视频渲染的方案相比，我们的方法体现出典型的3D应用程序，同时减少主观实验中的认知负担。我们组织了一个主观实验，具有40名参与者，以验证我们提出的软件的有效性。实验分析表明，我们的软件可以生成3D模型的主观质量分数。所有资源可以在https://openi.pcl.ac.cn/OpenDatasets/3DQA找到。

Learning Concise and Descriptive Attributes for Visual Recognition

paper_url: http://arxiv.org/abs/2308.03685
repo_url: None
paper_authors: An Yan, Yu Wang, Yiwu Zhong, Chengyu Dong, Zexue He, Yujie Lu, William Wang, Jingbo Shang, Julian McAuley
for: 这研究旨在探讨基础模型的新进展，以及它们如何提高可读性的视觉识别器。
methods: 该研究使用大语言模型（LLM）来生成特征集，然后应用视觉语言模型来分类图像。
results: 研究发现，使用大量的特征集可以达到与图像特征集相当的性能，但是我们在8个 dataset上进一步的调查发现，LLM生成的特征集中有很多噪音。我们提出一种新的学习搜索方法，可以找到更小的 yet 高效的特征集。在 CUB dataset 上，我们的方法可以使用只有 32 个特征集来分类 200 种鸟类，并且达到了使用大量 LLG 生成的特征集（如 10k 个特征集）的性能水平。此外，我们的新方法还具有更高的可读性和交互性，以及能够概括知识的能力。

Abstract
Recent advances in foundation models present new opportunities for interpretable visual recognition -- one can first query Large Language Models (LLMs) to obtain a set of attributes that describe each class, then apply vision-language models to classify images via these attributes. Pioneering work shows that querying thousands of attributes can achieve performance competitive with image features. However, our further investigation on 8 datasets reveals that LLM-generated attributes in a large quantity perform almost the same as random words. This surprising finding suggests that significant noise may be present in these attributes. We hypothesize that there exist subsets of attributes that can maintain the classification performance with much smaller sizes, and propose a novel learning-to-search method to discover those concise sets of attributes. As a result, on the CUB dataset, our method achieves performance close to that of massive LLM-generated attributes (e.g., 10k attributes for CUB), yet using only 32 attributes in total to distinguish 200 bird species. Furthermore, our new paradigm demonstrates several additional benefits: higher interpretability and interactivity for humans, and the ability to summarize knowledge for a recognition task.

摘要

2023-08-08

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval

TranSTYLer: Multimodal Behavioral Style Transfer for Facial and Body Gestures Generation

Domain Adaptive Person Search via GAN-based Scene Synthesis for Cross-scene Videos

All-pairs Consistency Learning for Weakly Supervised Semantic Segmentation

Cloth2Tex: A Customized Cloth Texture Generation Pipeline for 3D Virtual Try-On

Vision-Based Autonomous Navigation for Unmanned Surface Vessel in Extreme Marine Conditions

SDLFormer: A Sparse and Dense Locality-enhanced Transformer for Accelerated MR Image Reconstruction

Blur aware metric depth estimation with multi-focus plenoptic cameras

AICSD: Adaptive Inter-Class Similarity Distillation for Semantic Segmentation

A Comparative Study of Image-to-Image Translation Using GANs for Synthetic Child Race Data

Will your Doorbell Camera still recognize you as you grow old

AquaSAM: Underwater Image Foreground Segmentation

Robust retrieval of material chemical states in X-ray microspectroscopy

Exploring Transformers for Open-world Instance Segmentation

D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation

Image Copy-Move Forgery Detection via Deep Cross-Scale PatchMatch

How Generalizable are Deepfake Detectors? An Empirical Study

EFaR 2023: Efficient Face Recognition Competition

Under-Display Camera Image Restoration with Scattering Effect

EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation

Towards Top-Down Stereoscopic Image Quality Assessment via Stereo Attention

Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

Application for White Spot Syndrome Virus (WSSV) Monitoring using Edge Machine Learning

Class-level Structural Relation Modelling and Smoothing for Visual Representation Learning

Comprehensive Assessment of the Performance of Deep Learning Classifiers Reveals a Surprising Lack of Robustness

OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation

Multimodal Color Recommendation in Vector Graphic Documents

From Unimodal to Multimodal: improving the sEMG-Based Pattern Recognition via deep generative models

3D Gaussian Splatting for Real-Time Radiance Field Rendering

Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video

3D Scene Diffusion Guidance using Scene Graphs

ConDistFL: Conditional Distillation for Federated Learning from Partially Annotated Data

Backdoor Federated Learning by Poisoning Backdoor-Critical Layers

An Empirical Analysis of Range for 3D Object Detection

Implicit neural representations for joint decomposition and registration of gene expression images in the marmoset brain

Synthetic Augmentation with Large-scale Unconditional Pre-training

Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning

Coarse-to-Fine: Learning Compact Discriminative Representation for Single-Stage Image Retrieval

Few-shot medical image classification with simple shape and texture text descriptors using vision-language models

Real-time Strawberry Detection Based on Improved YOLOv5s Architecture for Robotic Harvesting in open-field environment

PARTNER: Level up the Polar Representation for LiDAR 3D Object Detection

PAIF: Perception-Aware Infrared-Visible Image Fusion for Attack-Tolerant Semantic Segmentation

PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning

Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D Action Representation Learning

Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization

Deterministic Neural Illumination Mapping for Efficient Auto-White Balance Correction

TIJO: Trigger Inversion with Joint Optimization for Defending Multimodal Backdoored Models

Developability Approximation for Neural Implicits through Rank Minimization

From Sky to the Ground: A Large-scale Benchmark and Simple Baseline Towards Real Rain Removal

DefCor-Net: Physics-Aware Ultrasound Deformation Correction

High-Throughput and Accurate 3D Scanning of Cattle Using Time-of-Flight Sensors and Deep Learning

3D Motion Magnification: Visualizing Subtle Motions with Time Varying Radiance Fields

FSD V2: Improving Fully Sparse 3D Object Detection with Virtual Voxels

Mask Frozen-DETR: High Quality Instance Segmentation with One GPU

AdaptiveSAM: Towards Efficient Tuning of SAM for Surgical Scene Segmentation

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation

Automated Real Time Delineation of Supraclavicular Brachial Plexus in Neck Ultrasonography Videos: A Deep Learning Approach

Scaling may be all you need for achieving human-level object recognition capacity with human-like visual experience

Prototype Learning for Out-of-Distribution Polyp Segmentation

Video-based Person Re-identification with Long Short-Term Representation Learning

Screen-based 3D Subjective Experiment Software

Learning Concise and Descriptive Attributes for Visual Recognition