2023-09-09

cs.CV

cs.CV - 2023-09-09

Semi-supervised Instance Segmentation with a Learned Shape Prior

paper_url: http://arxiv.org/abs/2309.04888
repo_url: None
paper_authors: Long Chen, Weiwen Zhang, Yuli Wu, Martin Strauch, Dorit Merhof
for: 这个 paper 是为了解决实例分割问题，而不需要大量的标注对象边框数据。
methods: 这个方法使用了形状先验模型，该模型是通过变分自动编码器学习的，只需要很少量的训练数据。
results: 在我们的实验中，使用了几十个目标数据集中的对象形状补丁，以及完全 sintetic 的形状，可以达到与超级vised 方法相同的效果，并在三个 cell 分割数据集上都 superior 于预训练的超级vised 模型。

Abstract
To date, most instance segmentation approaches are based on supervised learning that requires a considerable amount of annotated object contours as training ground truth. Here, we propose a framework that searches for the target object based on a shape prior. The shape prior model is learned with a variational autoencoder that requires only a very limited amount of training data: In our experiments, a few dozens of object shape patches from the target dataset, as well as purely synthetic shapes, were sufficient to achieve results en par with supervised methods with full access to training data on two out of three cell segmentation datasets. Our method with a synthetic shape prior was superior to pre-trained supervised models with access to limited domain-specific training data on all three datasets. Since the learning of prior models requires shape patches, whether real or synthetic data, we call this framework semi-supervised learning.

摘要
到目前为止，大多数实例分割方法基于supervised learning，需要较大量的标注对象边框作为训练真实数据。在这里，我们提出了一种框架，它基于形态先验来寻找目标对象。形态先验模型通过variational autoencoder来学习，只需要一个非常有限的训练数据：在我们的实验中，几十个目标数据集中的对象形状补充、完全 sintética的形状都能够达到与supervised方法相同的效果。我们的方法使用 sintética形状先验在所有三个数据集上都高于预训练的supervised模型。由于学习先验模型需要形状补充，是semi-supervised learning。

SortedAP: Rethinking evaluation metrics for instance segmentation

paper_url: http://arxiv.org/abs/2309.04887
repo_url: None
paper_authors: Long Chen, Yuli Wu, Johannes Stegmaier, Dorit Merhof
for: 评估实例分割中的评价指标，需要全面考虑对象检测和分割精度。
methods: 本文提出了一种新的评价指标called sortedAP，它具有 conditional sensitivity和精度递减的特点。
results: sortedAP可以准确地评估实例分割的质量，并且具有不间断的惩罚尺度，可以提供更加准确的质量评估结果。

Abstract
Designing metrics for evaluating instance segmentation revolves around comprehensively considering object detection and segmentation accuracy. However, other important properties, such as sensitivity, continuity, and equality, are overlooked in the current study. In this paper, we reveal that most existing metrics have a limited resolution of segmentation quality. They are only conditionally sensitive to the change of masks or false predictions. For certain metrics, the score can change drastically in a narrow range which could provide a misleading indication of the quality gap between results. Therefore, we propose a new metric called sortedAP, which strictly decreases with both object- and pixel-level imperfections and has an uninterrupted penalization scale over the entire domain. We provide the evaluation toolkit and experiment code at https://www.github.com/looooongChen/sortedAP.

摘要
设计实例 segmentation 评价指标涉及全面考虑对象检测和分割精度。然而，现有的研究几乎忽略了其他重要特性，如敏感性、连续性和平等性。在这篇论文中，我们发现现有的指标有限制的分辨率。它们只是在某些指标下有限制的敏感，而且有一定的风险提供假的质量指标。因此，我们提出了一个新的指标called sortedAP，它在对象和像素级别的不足下坚持减少，并在整个领域上具有不间断的补偿幅度。我们在 GitHub 上提供了评价工具箱和实验代码，请参考。

AnyPose: Anytime 3D Human Pose Forecasting via Neural Ordinary Differential Equations

paper_url: http://arxiv.org/abs/2309.04840
repo_url: None
paper_authors: Zixing Wang, Ahmed H. Qureshi
for: 这篇研究目的是为了提出一个可靠的三维人体姿态预测方法，以便在实时人机交互中进行预测。
methods: 这篇研究使用了神经ordinary differential equation（Neural ODE）来建模人类行为动力学。
results: 研究结果显示，AnyPose方法在Human3.6M、AMASS和3DPW数据集上显示出高精度的未来姿态预测，并且比传统方法快得多个computational time。

Abstract
Anytime 3D human pose forecasting is crucial to synchronous real-world human-machine interaction, where the term ``anytime" corresponds to predicting human pose at any real-valued time step. However, to the best of our knowledge, all the existing methods in human pose forecasting perform predictions at preset, discrete time intervals. Therefore, we introduce AnyPose, a lightweight continuous-time neural architecture that models human behavior dynamics with neural ordinary differential equations. We validate our framework on the Human3.6M, AMASS, and 3DPW dataset and conduct a series of comprehensive analyses towards comparison with existing methods and the intersection of human pose and neural ordinary differential equations. Our results demonstrate that AnyPose exhibits high-performance accuracy in predicting future poses and takes significantly lower computational time than traditional methods in solving anytime prediction tasks.

摘要
任何时刻3D人姿预测是实时人机交互中关键，其中“任何时刻”指的是预测人姿的任何实数时间步。然而，我们所知道的所有现有方法都是在固定、精确时间间隔进行预测。因此，我们介绍了AnyPose，一种轻量级连续时间神经网络架构，用于模elling人类行为动力学。我们验证了我们的框架在Human3.6M、AMASS和3DPW数据集上，并进行了一系列完整的分析，包括与现有方法进行比较和人姿和神经ordinary differential equations的交叠。我们的结果表明，AnyPose在预测未来姿势方面具有高精度性和较低的计算时间，与传统方法在实时预测任务中具有优势。

Neural Semantic Surface Maps

paper_url: http://arxiv.org/abs/2309.04836
repo_url: None
paper_authors: Luca Morreale, Noam Aigerman, Vladimir G. Kim, Niloy J. Mitra
for: 生成两个 genus-zero 形的 semantic surface-to-surface 映射，即将 semantically 相应的区域匹配到另一个形上。
methods: 使用 pre-trained 视觉模型进行 Semantic Matching，并使用 off-the-shelf 图像匹配方法生成 feature points。
results: 可以生成 semantic surface-to-surface 映射，不需要任何 3D 训练数据或手动标注。方法可以在高semantic complexity 和 nearly isometric 情况下效果很好。

Abstract
We present an automated technique for computing a map between two genus-zero shapes, which matches semantically corresponding regions to one another. Lack of annotated data prohibits direct inference of 3D semantic priors; instead, current State-of-the-art methods predominantly optimize geometric properties or require varying amounts of manual annotation. To overcome the lack of annotated training data, we distill semantic matches from pre-trained vision models: our method renders the pair of 3D shapes from multiple viewpoints; the resulting renders are then fed into an off-the-shelf image-matching method which leverages a pretrained visual model to produce feature points. This yields semantic correspondences, which can be projected back to the 3D shapes, producing a raw matching that is inaccurate and inconsistent between different viewpoints. These correspondences are refined and distilled into an inter-surface map by a dedicated optimization scheme, which promotes bijectivity and continuity of the output map. We illustrate that our approach can generate semantic surface-to-surface maps, eliminating manual annotations or any 3D training data requirement. Furthermore, it proves effective in scenarios with high semantic complexity, where objects are non-isometrically related, as well as in situations where they are nearly isometric.

摘要
Note:* "genus-zero shapes" refers to shapes without any holes or singularities.* "semantic priors" refer to the prior knowledge of the semantic meaning of the objects or regions in the scene.* "manual annotation" refers to the process of labeling the objects or regions in the scene with semantic information.* "pre-trained vision models" refer to deep learning models that have been trained on large datasets of images to learn features and patterns.* "image-matching method" refers to a technique that compares two images and finds the corresponding points between them.* "feature points" refer to the points in the image that have been identified as being semantically meaningful.* "bijection" refers to a one-to-one correspondence between two sets, which is important for ensuring that the output map is accurate and consistent.* "continuity" refers to the property of a function that has no gaps or jumps in its output.

Few-Shot Medical Image Segmentation via a Region-enhanced Prototypical Transformer

paper_url: http://arxiv.org/abs/2309.04825
repo_url: https://github.com/yazhouzhu19/rpt
paper_authors: Yazhou Zhu, Shidong Wang, Tong Xin, Haofeng Zhang
for: 这篇论文是为了解决医疗图像分类 tasks 中的问题，特别是对于大量医疗图像的自动分类。
methods: 本篇论文使用了一种名为 Region-enhanced Prototypical Transformer (RPT) 的方法，这是一种基于几个支持像的学习方法，它可以对于不同的测试案例进行几个 shot 的学习。
results: 在三个公开的医疗图像数据集上进行了广泛的实验，结果显示 RPT 方法可以对于 Few-Shot Medical Image Segmentation (FSMS) tasks 提供更好的性能，与现有的方法相比，具有更好的准确性和稳定性。

Abstract
Automated segmentation of large volumes of medical images is often plagued by the limited availability of fully annotated data and the diversity of organ surface properties resulting from the use of different acquisition protocols for different patients. In this paper, we introduce a more promising few-shot learning-based method named Region-enhanced Prototypical Transformer (RPT) to mitigate the effects of large intra-class diversity/bias. First, a subdivision strategy is introduced to produce a collection of regional prototypes from the foreground of the support prototype. Second, a self-selection mechanism is proposed to incorporate into the Bias-alleviated Transformer (BaT) block to suppress or remove interferences present in the query prototype and regional support prototypes. By stacking BaT blocks, the proposed RPT can iteratively optimize the generated regional prototypes and finally produce rectified and more accurate global prototypes for Few-Shot Medical Image Segmentation (FSMS). Extensive experiments are conducted on three publicly available medical image datasets, and the obtained results show consistent improvements compared to state-of-the-art FSMS methods. The source code is available at: https://github.com/YazhouZhu19/RPT.

摘要
自动化分割大量医疗图像的问题 frequently 受到完全标注数据的有限性和不同患者的获取协议所导致的组织表面性的多样性的影响。在这篇论文中，我们介绍了一种更有前途的几拟学学习基于方法，名为区域增强的原型变换器（RPT），以降低大型内类多样性/偏见的影响。首先，我们提出了一种分区策略，以生成支持原型的分区原型集。其次，我们提出了一种自选机制，以吸收或移除在支持原型和区域支持原型中的干扰。通过堆叠 BaT 块，我们的 RPT 可以Iteratively 优化生成的区域原型，并最终生成修正和更准确的全局原型，为几拟学医疗图像分割（FSMS）提供了更好的结果。我们在三个公开的医疗图像数据集上进行了广泛的实验，并取得了与当前最佳 FSMS 方法相对的稳定性和可靠性。源代码可以在 GitHub 上找到：https://github.com/YazhouZhu19/RPT。

paper_url: http://arxiv.org/abs/2309.04820
repo_url: None
paper_authors: Michael A. Hobley, Victor A. Prisacariu
for: 这篇论文的目的是提出一种多类、无类别 counting 方法，以解决现有方法在 COUNTING 任务中存在的限制。
methods: 该方法使用了一种新的概念，即在 COUNTING 阶段不需要使用类例进行导航，而是在计数后发现类例以帮助用户理解生成的输出。
results: 对于 MCAC 数据集，该方法可以与 Contemporary methods 相比，而无需人工循环注解。此外，该方法还在 FSC-147 数据集上实现了类似的性能。

Abstract
Class-agnostic counting methods enumerate objects of an arbitrary class, providing tremendous utility in many fields. Prior works have limited usefulness as they require either a set of examples of the type to be counted or that the image contains only a single type of object. A significant factor in these shortcomings is the lack of a dataset to properly address counting in settings with more than one kind of object present. To address these issues, we propose the first Multi-class, Class-Agnostic Counting dataset (MCAC) and A Blind Counter (ABC123), a method that can count multiple types of objects simultaneously without using examples of type during training or inference. ABC123 introduces a new paradigm where instead of requiring exemplars to guide the enumeration, examples are found after the counting stage to help a user understand the generated outputs. We show that ABC123 outperforms contemporary methods on MCAC without the requirement of human in-the-loop annotations. We also show that this performance transfers to FSC-147, the standard class-agnostic counting dataset.

摘要
<> translate "Class-agnostic counting methods enumerate objects of an arbitrary class, providing tremendous utility in many fields. Prior works have limited usefulness as they require either a set of examples of the type to be counted or that the image contains only a single type of object. A significant factor in these shortcomings is the lack of a dataset to properly address counting in settings with more than one kind of object present. To address these issues, we propose the first Multi-class, Class-Agnostic Counting dataset (MCAC) and A Blind Counter (ABC123), a method that can count multiple types of objects simultaneously without using examples of type during training or inference. ABC123 introduces a new paradigm where instead of requiring exemplars to guide the enumeration, examples are found after the counting stage to help a user understand the generated outputs. We show that ABC123 outperforms contemporary methods on MCAC without the requirement of human in-the-loop annotations. We also show that this performance transfers to FSC-147, the standard class-agnostic counting dataset."中文简体版：类型不扩知的统计方法可以对任意类型的对象进行枚举，提供了很多领域的巨大实用性。先前的方法具有有限的用途，因为它们需要 Either a set of examples of the type to be counted or that the image contains only a single type of object。这些缺点的一个重要因素是缺乏适用于多种对象存在的数据集，以正确地解决类型不扩知的统计问题。为解决这些问题，我们提出了首个多类、类型不扩知统计数据集（MCAC）和一种无需在训练或推理阶段使用类型示例的方法（ABC123）。ABC123引入了一新的思路，而不是需要 exemplars 来引导枚举，而是在统计阶段找到例子，以帮助用户理解生成的输出。我们表明，ABC123 在 MCAC 上超越了当前方法，而不需要人工循环注释。我们还表明，这种性能可以跨种类，并在标准的类型不扩知统计数据集 FSC-147 上进行验证。

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

paper_url: http://arxiv.org/abs/2309.04814
repo_url: None
paper_authors: Xiuzhe Wu, Pengfei Hu, Yang Wu, Xiaoyang Lyu, Yan-Pei Cao, Ying Shan, Wenming Yang, Zhongqian Sun, Xiaojuan Qi
for: 根据谈话生成自然看起来的动画，解决过去的问题包括不准确的唇形生成和底层的图像质量。
methods: 我们提出了一个构成-分解-重新组合框架（Speech2Lip），将谈话驱动的动作和外观分解为两个部分：谈话敏感的动作和谈话不敏感的动作。这使得我们可以从有限的训练数据中学习出自然的动画。
results: 我们的模型可以从几分钟的训练影片中学习出高品质的动画，并且在谈话与图像的同步性方面达到了顶尖的表现。

Abstract
Synthesizing realistic videos according to a given speech is still an open challenge. Previous works have been plagued by issues such as inaccurate lip shape generation and poor image quality. The key reason is that only motions and appearances on limited facial areas (e.g., lip area) are mainly driven by the input speech. Therefore, directly learning a mapping function from speech to the entire head image is prone to ambiguity, particularly when using a short video for training. We thus propose a decomposition-synthesis-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance to facilitate effective learning from limited training data, resulting in the generation of natural-looking videos. First, given a fixed head pose (i.e., canonical space), we present a speech-driven implicit model for lip image generation which concentrates on learning speech-sensitive motion and appearance. Next, to model the major speech-insensitive motion (i.e., head movement), we introduce a geometry-aware mutual explicit mapping (GAMEM) module that establishes geometric mappings between different head poses. This allows us to paste generated lip images at the canonical space onto head images with arbitrary poses and synthesize talking videos with natural head movements. In addition, a Blend-Net and a contrastive sync loss are introduced to enhance the overall synthesis performance. Quantitative and qualitative results on three benchmarks demonstrate that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization. Code: https://github.com/CVMI-Lab/Speech2Lip.

摘要
Synthesizing realistic videos according to given speech is still an open challenge. Previous works have been plagued by issues such as inaccurate lip shape generation and poor image quality. The key reason is that only motions and appearances on limited facial areas (e.g., lip area) are mainly driven by the input speech. Therefore, directly learning a mapping function from speech to the entire head image is prone to ambiguity, particularly when using a short video for training. We thus propose a decomposition-synthesis-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance to facilitate effective learning from limited training data, resulting in the generation of natural-looking videos. First, given a fixed head pose (i.e., canonical space), we present a speech-driven implicit model for lip image generation which concentrates on learning speech-sensitive motion and appearance. Next, to model the major speech-insensitive motion (i.e., head movement), we introduce a geometry-aware mutual explicit mapping (GAMEM) module that establishes geometric mappings between different head poses. This allows us to paste generated lip images at the canonical space onto head images with arbitrary poses and synthesize talking videos with natural head movements. In addition, a Blend-Net and a contrastive sync loss are introduced to enhance the overall synthesis performance. Quantitative and qualitative results on three benchmarks demonstrate that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization. Code: .

VeRi3D: Generative Vertex-based Radiance Fields for 3D Controllable Human Image Synthesis

paper_url: http://arxiv.org/abs/2309.04800
repo_url: None
paper_authors: Xinya Chen, Jiaxin Huang, Yanrui Bin, Lu Yu, Yiyi Liao
for: 生成高质量人体图像，包括自然的姿势和形态变化。
methods: 使用神经网络学习 vertex-based radiance field， Parametric human template SMPL 进行 parameterization。
results: 可以生成高品质的人体图像，并且可以自由控制摄像机姿势、人姿势、形态变化以及部分编辑。

Abstract
Unsupervised learning of 3D-aware generative adversarial networks has lately made much progress. Some recent work demonstrates promising results of learning human generative models using neural articulated radiance fields, yet their generalization ability and controllability lag behind parametric human models, i.e., they do not perform well when generalizing to novel pose/shape and are not part controllable. To solve these problems, we propose VeRi3D, a generative human vertex-based radiance field parameterized by vertices of the parametric human template, SMPL. We map each 3D point to the local coordinate system defined on its neighboring vertices, and use the corresponding vertex feature and local coordinates for mapping it to color and density values. We demonstrate that our simple approach allows for generating photorealistic human images with free control over camera pose, human pose, shape, as well as enabling part-level editing.

摘要
Recently, there has been significant progress in unsupervised learning of 3D-aware generative adversarial networks. Some recent work has shown promising results in learning human generative models using neural articulated radiance fields, but their generalization ability and controllability are still limited, such as difficulty in generalizing to novel pose/shape and lack of part controllability. To address these issues, we propose VeRi3D, a generative human vertex-based radiance field parameterized by the vertices of the parametric human template, SMPL. We map each 3D point to the local coordinate system defined on its neighboring vertices, and use the corresponding vertex feature and local coordinates to map it to color and density values. Our simple approach enables the generation of photorealistic human images with free control over camera pose, human pose, shape, as well as part-level editing.

Self-Supervised Transformer with Domain Adaptive Reconstruction for General Face Forgery Video Detection

paper_url: http://arxiv.org/abs/2309.04795
repo_url: None
paper_authors: Daichi Zhang, Zihao Xiao, Jianmin Li, Shiming Ge
for: 本研究旨在提高违伪面影片检测效果，尤其是在不同的违伪方法或真实源影片下进行检测时。
methods: 本研究提出了一种基于自动编码器和对比学习的Self-supervised Transformer，并在 fine-tuning 过程中添加了两种辅助任务，即对比学习和重建学习。此外，还提出了一种适应域重建模块，用于在不同违伪频谱上进行适应。
results: 经验表明，提出的方法在公共数据集上进行测试时，能够与现有的超级vised竞争对手相比，并且具有很好的泛化性。

Abstract
Face forgery videos have caused severe social public concern, and various detectors have been proposed recently. However, most of them are trained in a supervised manner with limited generalization when detecting videos from different forgery methods or real source videos. To tackle this issue, we explore to take full advantage of the difference between real and forgery videos by only exploring the common representation of real face videos. In this paper, a Self-supervised Transformer cooperating with Contrastive and Reconstruction learning (CoReST) is proposed, which is first pre-trained only on real face videos in a self-supervised manner, and then fine-tuned a linear head on specific face forgery video datasets. Two specific auxiliary tasks incorporated contrastive and reconstruction learning are designed to enhance the representation learning. Furthermore, a Domain Adaptive Reconstruction (DAR) module is introduced to bridge the gap between different forgery domains by reconstructing on unlabeled target videos when fine-tuning. Extensive experiments on public datasets demonstrate that our proposed method performs even better than the state-of-the-art supervised competitors with impressive generalization.

摘要
《Face forgery videos have caused severe social public concern, and various detectors have been proposed recently. However, most of them are trained in a supervised manner with limited generalization when detecting videos from different forgery methods or real source videos. To tackle this issue, we explore taking full advantage of the difference between real and forgery videos by only exploring the common representation of real face videos. In this paper, a Self-supervised Transformer cooperating with Contrastive and Reconstruction learning (CoReST) is proposed, which is first pre-trained only on real face videos in a self-supervised manner, and then fine-tuned a linear head on specific face forgery video datasets. Two specific auxiliary tasks incorporated contrastive and reconstruction learning are designed to enhance the representation learning. Furthermore, a Domain Adaptive Reconstruction (DAR) module is introduced to bridge the gap between different forgery domains by reconstructing on unlabeled target videos when fine-tuning. Extensive experiments on public datasets demonstrate that our proposed method performs even better than the state-of-the-art supervised competitors with impressive generalization.》Here's the word-for-word translation:《人脸伪造视频引起了严重的社会公众关注，而最近有许多检测器被提出。然而，大多数检测器都是在有监督的方式进行训练，其检测视频的能力受到不同的伪造方法或原始视频的限制。为了解决这个问题，我们尝试了利用真实视频中的差异，并且只探索真实视频的共同表示。在这篇论文中，我们提出了一种基于自助学习的 transformer 和对比学习（CoReST），它首先在真实视频上进行自助学习，然后在特定的伪造视频数据集上进行细致的调整。为了增强表示学习，我们采用了两种特定的辅助任务：对比学习和重构学习。此外，我们还提出了一种适应域重构（DAR）模块，用于在不同的伪造领域之间桥接。在公共数据集上进行了广泛的实验，结果表明，我们的提出的方法能够在充分扩展的情况下，与当前最佳监督者进行比较，并且表现出色。》

Latent Degradation Representation Constraint for Single Image Deraining

paper_url: http://arxiv.org/abs/2309.04780
repo_url: None
paper_authors: Yuhong He, Long Peng, Lu Wang, Jun Cheng
for: 本研究旨在提出一种新的单图排除雨水模型，以解决现有方法难以学习雨水干扰的问题。
methods: 该模型包括指向感知编码器（DAEncoder）、UNet排除网络和多尺度交互块（MSIBlock）。DAEncoder使用可变扩散捕捉雨水束的方向一致性，适应地抽取雨水干扰表示。然后，在训练中引入约束损失来显式地约束干扰表示学习。最后，我们提出了MSIBlock，用于与学习的干扰表示和排除网络的解码特征进行 adaptive 信息互动，以便使排除网络能够消除各种复杂的雨水束和重建图像细节。
results: 实验结果表明，我们的方法在 sintetic 和实际数据集上达到了新的州OF-the-art 性能。

Abstract
Since rain streaks show a variety of shapes and directions, learning the degradation representation is extremely challenging for single image deraining. Existing methods are mainly targeted at designing complicated modules to implicitly learn latent degradation representation from coupled rainy images. This way, it is hard to decouple the content-independent degradation representation due to the lack of explicit constraint, resulting in over- or under-enhancement problems. To tackle this issue, we propose a novel Latent Degradation Representation Constraint Network (LDRCNet) that consists of Direction-Aware Encoder (DAEncoder), UNet Deraining Network, and Multi-Scale Interaction Block (MSIBlock). Specifically, the DAEncoder is proposed to adaptively extract latent degradation representation by using the deformable convolutions to exploit the direction consistency of rain streaks. Next, a constraint loss is introduced to explicitly constraint the degradation representation learning during training. Last, we propose an MSIBlock to fuse with the learned degradation representation and decoder features of the deraining network for adaptive information interaction, which enables the deraining network to remove various complicated rainy patterns and reconstruct image details. Experimental results on synthetic and real datasets demonstrate that our method achieves new state-of-the-art performance.

摘要
因为雨条状态呈多种形状和方向，单一图像净化很难学习降低表现的表现。现有方法主要是通过设计复杂的模组来隐式地学习隐藏的降低表现征象，这样很难分离内容独立的降低表现，从而导致过弹或者下弹问题。为解决这个问题，我们提出了一个新的内容独立降低表现条件网络（LDRCNet），它包括了方向感应编码器（DAEncoder）、UNet净化网络和多尺度互动对（MSIBlock）。具体来说，DAEncoder可以透过使用可整合的梯度感应来适应地抽出降低表现的内容独立表现。接着，我们引入了一个约束损失来规范降低表现学习的过程中。最后，我们提出了一个MSIBlock，用于与学习的降低表现和净化网络的解码特征进行互动运算，这使得净化网络能够根据不同的雨条状态和内容独立的降低表现来移除各种复杂的雨条状态和重建图像细节。实验结果显示，我们的方法在synthetic和real dataset上取得了新的顶峰性能。

Visual Material Characteristics Learning for Circular Healthcare

paper_url: http://arxiv.org/abs/2309.04763
repo_url: https://github.com/fedezocco/matvisiongluinh-pytorch_tensorflow
paper_authors: Federico Zocco, Shahin Rahimifard
for: 增强医疗垃圾回收链，提高医疗垃圾的再利用率。
methods: 开发了多种视力系统，用于三个主要循环经济任务：资源映射和量化、垃圾分类、和分解。
results: 研究表明，使用表征学视觉技术可以提高回收链的性能，自动化系统是关键因素，因为受污染风险。两个完全注释化数据集也公开发布，用于图像分割和逻辑点跟踪在医疗器械分解过程中。

Abstract
The linear take-make-dispose paradigm at the foundations of our traditional economy is proving to be unsustainable due to waste pollution and material supply uncertainties. Hence, increasing the circularity of material flows is necessary. In this paper, we make a step towards circular healthcare by developing several vision systems targeting three main circular economy tasks: resources mapping and quantification, waste sorting, and disassembly. The performance of our systems demonstrates that representation-learning vision can improve the recovery chain, where autonomous systems are key enablers due to the contamination risks. We also published two fully-annotated datasets for image segmentation and for key-point tracking in disassembly operations of inhalers and glucose meters. The datasets and source code are publicly available.

摘要
传统经济的线性“取-制造-废弃”模式已经显示无法维持可持续发展，由废弃污染和材料供应不确定性而导致。因此，提高物流循环性是必要的。在这篇论文中，我们向循环医疗领域发展了多种视系统，目标是三大循环经济任务：资源映射和评估、废弃分类和分解。我们的系统表现了使用表征学视觉技术可以提高回收链，自主系统作为污染风险的关键启用者。我们还发布了两个完全注释的数据集，一个是图像分割数据集，另一个是关键点跟踪在分解医疗器械和糖尿病测量仪器的数据集。这两个数据集和源代码都公开可用。

Probabilistic Triangulation for Uncalibrated Multi-View 3D Human Pose Estimation

paper_url: http://arxiv.org/abs/2309.04756
repo_url: https://github.com/bymaths/probabilistic_triangulation
paper_authors: Boyuan Jiang, Lei Hu, Shihong Xia
for: 本研究旨在提出一种可靠的三维人体 pose 估计方法，以替代现有的固定摄像机pose 方法，提高 pose 估计的泛化能力。
methods: 本方法基于 probablistic triangulation 模块，通过 iteratively 更新摄像机pose 分布，从 2D 特征点计算 posterior 概率，以直接卷积 backwards 传播 gradients，实现 end-to-end 训练。
results: 对 Human3.6M 和 CMU Panoptic 数据集进行了广泛的实验，比较了与其他不准确方法和准确方法进行比较，显示了我们的方法可以达到更高的泛化性和更高的估计精度之间的让步。

Abstract
3D human pose estimation has been a long-standing challenge in computer vision and graphics, where multi-view methods have significantly progressed but are limited by the tedious calibration processes. Existing multi-view methods are restricted to fixed camera pose and therefore lack generalization ability. This paper presents a novel Probabilistic Triangulation module that can be embedded in a calibrated 3D human pose estimation method, generalizing it to uncalibration scenes. The key idea is to use a probability distribution to model the camera pose and iteratively update the distribution from 2D features instead of using camera pose. Specifically, We maintain a camera pose distribution and then iteratively update this distribution by computing the posterior probability of the camera pose through Monte Carlo sampling. This way, the gradients can be directly back-propagated from the 3D pose estimation to the 2D heatmap, enabling end-to-end training. Extensive experiments on Human3.6M and CMU Panoptic demonstrate that our method outperforms other uncalibration methods and achieves comparable results with state-of-the-art calibration methods. Thus, our method achieves a trade-off between estimation accuracy and generalizability. Our code is in https://github.com/bymaths/probabilistic_triangulation

摘要
三维人体姿态估算已经是计算机视觉和图形领域的长期挑战，多视图方法在这一点上已经取得了 significativement progress，但它们受到了繁琐的卡利ibration过程的限制。现有的多视图方法受到固定相机pose的限制，因此缺乏总体化能力。这篇论文提出了一种新的概率三角形模块，可以在卡利ibration场景下插入到已经卡利ibration的三维人体姿态估算方法中，并且可以提高其总体化能力。我们的关键想法是使用概率分布来模型相机pose，并且在每次迭代中更新这个分布，从2D特征上计算后验概率。具体来说，我们保持一个相机pose分布，然后在每次迭代中使用蒙地卡ろ sampling算法来更新这个分布。这样，可以直接从3D姿态估算中传递梯度到2D热图中，实现端到端训练。我们在Human3.6M和CMU Panoptic等数据集上进行了广泛的实验，结果表明，我们的方法在不卡利ibration场景下表现出比其他无卡利ibration方法更好的性能，并且与卡利ibration方法相当的性能。因此，我们的方法实现了姿态估算精度和总体化之间的交换。我们的代码在https://github.com/bymaths/probabilistic_triangulation中。

Deep Video Restoration for Under-Display Camera

paper_url: http://arxiv.org/abs/2309.04752
repo_url: None
paper_authors: Xuanxi Chen, Tao Wang, Ziqian Shao, Kaihao Zhang, Wenhan Luo, Tong Lu, Zikun Liu, Tae-Kyun Kim, Hongdong Li
For: 这个论文主要针对的是Under-Display Camera（UDC）视频修复（UDC-VR）问题，而现有的UDC修复方法仅专注于图像。* Methods: 这篇论文首先提出了基于GAN生成器的生成管线，用于模拟真实的UDC降低过程。然后，他们建立了大规模的UDC视频修复数据集named PexelsUDC，包括两个子集named PexelsUDC-T和PexelsUDC-P，这两个子集分别对应不同的显示器。* Results: 使用提出的数据集和基线方法，论文进行了广泛的比较研究，发现现有的视频修复方法在UDC-VR任务上存在局限性。然后，他们提出了一种基于 transformer 的新基eline方法，该方法可以充分利用视频的空间和时间信息来修复降低的视频。广泛的实验表明，该方法在 PexelsUDC 上达到了状态级表现。

Abstract
Images or videos captured by the Under-Display Camera (UDC) suffer from severe degradation, such as saturation degeneration and color shift. While restoration for UDC has been a critical task, existing works of UDC restoration focus only on images. UDC video restoration (UDC-VR) has not been explored in the community. In this work, we first propose a GAN-based generation pipeline to simulate the realistic UDC degradation process. With the pipeline, we build the first large-scale UDC video restoration dataset called PexelsUDC, which includes two subsets named PexelsUDC-T and PexelsUDC-P corresponding to different displays for UDC. Using the proposed dataset, we conduct extensive benchmark studies on existing video restoration methods and observe their limitations on the UDC-VR task. To this end, we propose a novel transformer-based baseline method that adaptively enhances degraded videos. The key components of the method are a spatial branch with local-aware transformers, a temporal branch embedded temporal transformers, and a spatial-temporal fusion module. These components drive the model to fully exploit spatial and temporal information for UDC-VR. Extensive experiments show that our method achieves state-of-the-art performance on PexelsUDC. The benchmark and the baseline method are expected to promote the progress of UDC-VR in the community, which will be made public.

摘要
“图像或视频捕捉于下层显示摄像头（UDC）会受到严重的降解效应，如饱和衰减和颜色偏移。而现有的UDC还原方法仅专注于图像还原，UDC视频还原（UDC-VR）尚未在社区中得到探索。在这项工作中，我们首先提出了基于GAN的生成管道，用于模拟真实的UDC降解过程。通过管道，我们建立了首个大规模的UDC视频还原数据集named PexelsUDC，该数据集包括两个子集名为 PexelsUDC-T 和 PexelsUDC-P，分别对应不同的显示器 для UDC。使用我们提posed的数据集，我们进行了广泛的比较研究，发现现有的视频还原方法在UDC-VR任务上存在局限性。为此，我们提出了一种基于 transformer 的基eline方法，该方法可以在不同的显示器上进行自适应增强降解视频。该方法的关键组件包括空间分支、本地化 transformers、嵌入时间 transformers 和空间-时间融合模块。这些组件使得模型能够充分利用空间和时间信息进行UDC-VR。广泛的实验表明，我们的方法在 PexelsUDC 上达到了状态的最佳性能。数据集和基线方法将被公开，以促进社区中 UDC-VR 的进步。”

Mirror-Aware Neural Humans

paper_url: http://arxiv.org/abs/2309.04750
repo_url: None
paper_authors: Daniel Ajisafe, James Tang, Shih-Yang Su, Bastian Wandt, Helge Rhodin
for: 实现基于单个摄像头的高质量人体动作捕捉系统，解决多视图系统和单视图系统的缺点。
methods: 使用镜子来记录两个视图，并利用镜子来学习人体完整的形状和精密的外观特征。
results: 实现了一个可靠地从Off-the-shelf 2D姿势获取3Dskeleton姿势，并且在镜子场景中处理 occlusion 问题，提高了系统的可靠性和精度。

Abstract
Human motion capture either requires multi-camera systems or is unreliable using single-view input due to depth ambiguities. Meanwhile, mirrors are readily available in urban environments and form an affordable alternative by recording two views with only a single camera. However, the mirror setting poses the additional challenge of handling occlusions of real and mirror image. Going beyond existing mirror approaches for 3D human pose estimation, we utilize mirrors for learning a complete body model, including shape and dense appearance. Our main contributions are extending articulated neural radiance fields to include a notion of a mirror, making it sample-efficient over potential occlusion regions. Together, our contributions realize a consumer-level 3D motion capture system that starts from off-the-shelf 2D poses by automatically calibrating the camera, estimating mirror orientation, and subsequently lifting 2D keypoint detections to 3D skeleton pose that is used to condition the mirror-aware NeRF. We empirically demonstrate the benefit of learning a body model and accounting for occlusion in challenging mirror scenes.

摘要
人体运动捕捉 either需要多个摄像头系统或者因为深度 ambiguity 导致单视输入不可靠。然而，镜子在城市环境中ready available 并且成为一种可靠的替代方案，只需要一个单个摄像头来记录两个视图。然而，镜子设置增加了处理真实和镜像干扰的挑战。我们超越现有的镜子方法 для 3D人体 pose estimation，我们利用镜子来学习完整的身体模型，包括形状和精密的外观。我们的主要贡献是将 articulated neural radiance fields 扩展到包括镜子的概念，使其在潜在干扰区域上更加效率。在一起，我们的贡献实现了一个消费级3D运动捕捉系统，它可以从OFF-THE-SHELF 2Dpose开始，自动调整摄像头，估算镜子方向，并将2D键点检测提升到3D骨骼姿势，该姿势用于condition mirror-aware NeRF。我们实际示出了学习身体模型和考虑干扰的好处在具有挑战的镜子场景中。

When to Learn What: Model-Adaptive Data Augmentation Curriculum

paper_url: http://arxiv.org/abs/2309.04747
repo_url: None
paper_authors: Chengkai Hou, Jieyu Zhang, Tianyi Zhou
for: 提高神经网络的通用性，通过强制实施输入数据中的一系列固定变换来实现数据增强。
methods: 提出了一种名为 Model Adaptive Data Augmentation (MADAug) 的方法，该方法通过在不同训练阶段选择不同的数据增强操作符来适应每个输入图像，从而生成一个数据增强课程优化了模型的泛化性。
results: 对多个图像分类任务和网络架构进行了广泛的评估，与现有的数据增强方法进行了互相比较，并表明 MADAug 可以在所有类型上提供更好的性能，并且在难度更高的类型上提供更大的改进。此外，MADAug 学习的策略在细化数据上表现更好，并自然地生成了一个易于难度增加的学习课程。

Abstract
Data augmentation (DA) is widely used to improve the generalization of neural networks by enforcing the invariances and symmetries to pre-defined transformations applied to input data. However, a fixed augmentation policy may have different effects on each sample in different training stages but existing approaches cannot adjust the policy to be adaptive to each sample and the training model. In this paper, we propose Model Adaptive Data Augmentation (MADAug) that jointly trains an augmentation policy network to teach the model when to learn what. Unlike previous work, MADAug selects augmentation operators for each input image by a model-adaptive policy varying between training stages, producing a data augmentation curriculum optimized for better generalization. In MADAug, we train the policy through a bi-level optimization scheme, which aims to minimize a validation-set loss of a model trained using the policy-produced data augmentations. We conduct an extensive evaluation of MADAug on multiple image classification tasks and network architectures with thorough comparisons to existing DA approaches. MADAug outperforms or is on par with other baselines and exhibits better fairness: it brings improvement to all classes and more to the difficult ones. Moreover, MADAug learned policy shows better performance when transferred to fine-grained datasets. In addition, the auto-optimized policy in MADAug gradually introduces increasing perturbations and naturally forms an easy-to-hard curriculum.

摘要
<> translate the following text into Simplified Chinese<>数据扩充（DA）广泛应用于神经网络中以提高模型通用性，通过强制数据中的不变性和对称性。然而，现有的方法无法适应每个样本和训练阶段的不同效果，它们的固定扩充策略可能会导致模型的不平衡。在这篇论文中，我们提出了模型适应性数据扩充（MADAug），它将在训练过程中同时训练扩充策略网络，以教导模型何时学习什么。与之前的方法不同，MADAug在每个输入图像上选择的扩充运算符会随训练阶段而变化，生成一个适应性优化的数据扩充课程，以提高模型的通用性。在MADAug中，我们通过两级优化算法，即目标函数优化和权重优化，以iminimize一个验证集损失函数，以训练扩充策略网络。我们进行了多种图像分类任务和网络架构的广泛评估，并进行了对现有DA方法的比较。MADAug在多个任务上具有优于或与其他基elines一样的性能，并且展现出更好的公平性：它对所有类别都带来改进，并对难类更多。此外，MADAug学习的策略表现更好，当 transferred to 细化数据集时。此外，MADAug自动优化的策略逐渐增加干扰量，自然地形成一个易于困难的课程。Note: "Simplified Chinese" is used to refer to the written form of Chinese that uses simpler characters and grammar compared to Traditional Chinese.

Frequency-Aware Self-Supervised Long-Tailed Learning

paper_url: http://arxiv.org/abs/2309.04723
repo_url: None
paper_authors: Ci-Siang Lin, Min-Hung Chen, Yu-Chiang Frank Wang
for: 本研究旨在 Addressing the challenges of long-tailed data distributions in real-world scenarios, where label annotation may not be available.
methods: 方法方面, the paper proposes Frequency-Aware Self-Supervised Learning (FASSL), which learns discriminative feature representations from unlabeled data with inherent long-tailed distributions. The approach involves learning frequency-aware prototypes and exploiting the relationships between image data and the derived prototypes using a self-supervised learning scheme.
results: 实验结果表明, FASSL 可以有效地学习从无标签数据中，并且可以提供高质量的特征表示。 experiments on long-tailed image datasets demonstrate the effectiveness of the proposed approach.

Abstract
Data collected from the real world typically exhibit long-tailed distributions, where frequent classes contain abundant data while rare ones have only a limited number of samples. While existing supervised learning approaches have been proposed to tackle such data imbalance, the requirement of label supervision would limit their applicability to real-world scenarios in which label annotation might not be available. Without the access to class labels nor the associated class frequencies, we propose Frequency-Aware Self-Supervised Learning (FASSL) in this paper. Targeting at learning from unlabeled data with inherent long-tailed distributions, the goal of FASSL is to produce discriminative feature representations for downstream classification tasks. In FASSL, we first learn frequency-aware prototypes, reflecting the associated long-tailed distribution. Particularly focusing on rare-class samples, the relationships between image data and the derived prototypes are further exploited with the introduced self-supervised learning scheme. Experiments on long-tailed image datasets quantitatively and qualitatively verify the effectiveness of our learning scheme.

摘要
通常来说，实际世界中的数据都会展现长尾分布，其中常见的类别具有丰富的数据，而罕见的类别则只有有限的样本。现有的超级vised学习方法可以解决数据不均衡问题，但是它们需要 labels 的存在，这限制了它们在真实世界中的应用。在这篇文章中，我们提出了不需要 labels 的自动学习方法，即频率意识自我超级学习（FASSL）。我们的目标是从无标签数据中学习具有抑制力的特征表示，以便在下游分类任务中使用。在 FASSL 中，我们首先学习频率意识的原型，这些原型反映了相应的长尾分布。特别是关注罕见类别的样本，我们通过引入的自我超级学习方案来利用这些样本和 derivated 的原型之间的关系。实验表明，我们的学习方法在长尾图像 dataset 上具有较高的效果。

UnitModule: A Lightweight Joint Image Enhancement Module for Underwater Object Detection

paper_url: http://arxiv.org/abs/2309.04708
repo_url: None
paper_authors: Zhuoyan Liu, Bo Wang, Ye Li, Jiaxian He, Yunfeng Li
for: 提高对水下物体检测模型的输入图像质量，以提高检测效果。methods: 提出了一种可插入式的水下共同图像增强模块（UnitModule），通过对 UnitModule 和检测器进行无监督学习，以提高UnitModule 和检测器之间的交互。此外，还提出了一种预测颜色偏见的方法，以及一种叫做水下随机颜色传播（UCRT）的数据增强技术。results: 对 DUO dataset 进行了广泛的实验，并取得了最高改进率的 2.6 AP 以及新测试集（URPCtest）上的改进率为 3.3 AP。 UnitModule 可以提高所有测试模型的性能，特别是具有较少参数的模型。此外，UnitModule 的参数量只有 31K，对原始检测模型的执行速度没有明显的影响。我们的量化和视觉分析也证明了 UnitModule 可以有效地提高输入图像质量和检测器对对象特征的识别能力。

Abstract
Underwater object detection faces the problem of underwater image degradation, which affects the performance of the detector. Underwater object detection methods based on noise reduction and image enhancement usually do not provide images preferred by the detector or require additional datasets. In this paper, we propose a plug-and-play Underwater joint image enhancement Module (UnitModule) that provides the input image preferred by the detector. We design an unsupervised learning loss for the joint training of UnitModule with the detector without additional datasets to improve the interaction between UnitModule and the detector. Furthermore, a color cast predictor with the assisting color cast loss and a data augmentation called Underwater Color Random Transfer (UCRT) are designed to improve the performance of UnitModule on underwater images with different color casts. Extensive experiments are conducted on DUO for different object detection models, where UnitModule achieves the highest performance improvement of 2.6 AP for YOLOv5-S and gains the improvement of 3.3 AP on the brand-new test set (URPCtest). And UnitModule significantly improves the performance of all object detection models we test, especially for models with a small number of parameters. In addition, UnitModule with a small number of parameters of 31K has little effect on the inference speed of the original object detection model. Our quantitative and visual analysis also demonstrates the effectiveness of UnitModule in enhancing the input image and improving the perception ability of the detector for object features.

摘要
水下物体检测面临着水下图像弱化问题，这会影响检测器的性能。通常的水下物体检测方法通过减少噪声和图像提高不提供检测器所需的图像，或者需要额外数据集。在这篇论文中，我们提出了一个卷积核Module（UnitModule），它提供了检测器所需的输入图像。我们设计了一个不supervised学习损失，以joint地训练UnitModule和检测器，从而改善UnitModule和检测器之间的交互。此外，我们还设计了一个帮助预测颜色折射的颜色预测器，以及一种叫做水下随机传播（UCRT）的数据增强技术，以提高UnitModule在不同颜色折射下的性能。我们在DUO上进行了广泛的实验，其中UnitModule在不同的物体检测模型上达到了最高的性能提升2.6AP，并在新的测试集（URPCtest）上提升3.3AP。此外，UnitModule对所有物体检测模型都有显著的性能提升，特别是对具有较少参数的模型。此外，UnitModule具有31K参数，对原始物体检测模型的执行速度有很小的影响。我们的量化和视觉分析也表明，UnitModule可以有效地提高输入图像的质量和检测器对物体特征的感知能力。

A Spatial-Temporal Deformable Attention based Framework for Breast Lesion Detection in Videos

paper_url: http://arxiv.org/abs/2309.04702
repo_url: https://github.com/alfredqin/stnet
paper_authors: Chao Qin, Jiale Cao, Huazhu Fu, Rao Muhammad Anwer, Fahad Shahbaz Khan
for: 检测乳腺癌视频是计算机辅助诊断中的关键任务。现有的视频基于乳腺癌检测方法通常是基于自我注意力操作进行时间特征聚合。我们认为这种策略难以有效地执行深度特征聚合，并且忽略了有用的地方信息。
methods: 我们提出了一种空间-时间可变注意力基础框架，名为STNet。我们的STNet引入了一个空间-时间可变注意力模块，以进行本地空间-时间特征融合。这个模块在每个阶段的encoder和decoder中都可以进行深度特征聚合。为了进一步加速检测速度，我们引入了一种encoder特征排序策略，在排序过程中，我们共享了背景和encoder特征，并将encoder特征排序给decoder生成多帧预测结果。
results: 我们在公共乳腺癌ultrasound视频数据集上进行了实验，结果显示，我们的STNet在检测性能方面取得了州属的纪录，同时在检测速度方面也比前者快两倍。代码和模型可以在https://github.com/AlfredQin/STNet上获取。

Abstract
Detecting breast lesion in videos is crucial for computer-aided diagnosis. Existing video-based breast lesion detection approaches typically perform temporal feature aggregation of deep backbone features based on the self-attention operation. We argue that such a strategy struggles to effectively perform deep feature aggregation and ignores the useful local information. To tackle these issues, we propose a spatial-temporal deformable attention based framework, named STNet. Our STNet introduces a spatial-temporal deformable attention module to perform local spatial-temporal feature fusion. The spatial-temporal deformable attention module enables deep feature aggregation in each stage of both encoder and decoder. To further accelerate the detection speed, we introduce an encoder feature shuffle strategy for multi-frame prediction during inference. In our encoder feature shuffle strategy, we share the backbone and encoder features, and shuffle encoder features for decoder to generate the predictions of multiple frames. The experiments on the public breast lesion ultrasound video dataset show that our STNet obtains a state-of-the-art detection performance, while operating twice as fast inference speed. The code and model are available at https://github.com/AlfredQin/STNet.

摘要
检测乳腺病变视频是计算机辅助诊断中的关键任务。现有的视频基于 breast lesion 检测方法通常采用深度归一化特征的时间特征聚合方法。我们认为这种策略困难具有效地执行深度特征聚合和忽略了有用的本地信息。为解决这些问题，我们提出了一种空间时间变形注意力基本框架，名为 STNet。我们的 STNet 引入了一个空间时间变形注意力模块，以进行本地空间时间特征融合。这个模块在每个阶段的 both encoder 和 decoder 中进行深度特征聚合。为了进一步加速检测速度，我们提出了一种 encoder 特征混合策略，在推理过程中将 encoder 特征混合多帧预测。在我们的 encoder 特征混合策略中，我们共享 backbone 和 encoder 特征，并在 decoder 中混合 encoder 特征来生成多帧预测。实验结果表明，我们的 STNet 在公共乳腺病变ultrasound video 数据集上取得了状态的检测性能，同时在推理速度上两倍快。代码和模型可以在中下载。

DeNoising-MOT: Towards Multiple Object Tracking with Severe Occlusions

paper_url: http://arxiv.org/abs/2309.04682
repo_url: None
paper_authors: Teng Fu, Xiaocong Wang, Haiyang Yu, Ke Niu, Bin Li, Xiangyang Xue
for: 提高多对目标跟踪（MOT）在受阻碍的情况下的性能。
methods: 使用增强的隐藏状态和推理框架，以及一种新的排除噪声的策略。
results: 在MOT17、MOT20和DanceTrack datasets上进行了广泛的实验，并表明了与之前的状态革命性的提高。

Abstract
Multiple object tracking (MOT) tends to become more challenging when severe occlusions occur. In this paper, we analyze the limitations of traditional Convolutional Neural Network-based methods and Transformer-based methods in handling occlusions and propose DNMOT, an end-to-end trainable DeNoising Transformer for MOT. To address the challenge of occlusions, we explicitly simulate the scenarios when occlusions occur. Specifically, we augment the trajectory with noises during training and make our model learn the denoising process in an encoder-decoder architecture, so that our model can exhibit strong robustness and perform well under crowded scenes. Additionally, we propose a Cascaded Mask strategy to better coordinate the interaction between different types of queries in the decoder to prevent the mutual suppression between neighboring trajectories under crowded scenes. Notably, the proposed method requires no additional modules like matching strategy and motion state estimation in inference. We conduct extensive experiments on the MOT17, MOT20, and DanceTrack datasets, and the experimental results show that our method outperforms previous state-of-the-art methods by a clear margin.

摘要
多bject 跟踪 (MOT) 在严重遮挡情况下变得更加挑战。在这篇论文中，我们分析传统的卷积神经网络基本方法和转移器基本方法在处理遮挡的局限性，并提出了DNMOT，一种可以受教育的端到端的减噪变换器 для MOT。为了解决遮挡的挑战，我们在训练时间添加了噪声到轨迹上，使我们的模型在encoder-decoder架构中学习减噪过程，从而使我们的模型在拥挤的场景下表现出强大的鲁棒性。此外，我们提出了协调器策略，以更好地协调不同类型的查询在解码器中的交互，从而避免在拥挤的场景下邻近轨迹之间的互相抑制。值得注意的是，我们提出的方法不需要在推断过程中添加额外的模块，如匹配策略和运动状态估计。我们在MOT17、MOT20和DanceTrack datasets上进行了广泛的实验，实验结果表明，我们的方法在前一代方法之上具有明显的优势。

BiLMa: Bidirectional Local-Matching for Text-based Person Re-identification

paper_url: http://arxiv.org/abs/2309.04675
repo_url: None
paper_authors: Takuro Fujii, Shuhei Tarashima
for: 这个论文是针对文本描述人脸图像的重新识别问题（Text-based Person Re-identification，TBPReID）的研究。methods: 这个论文使用的方法是将图像和文本部分对齐，并通过Masked Language Modeling（MLM）和Masked Image Modeling（MIM）进行协调训练。它还提出了对向（from text to image）和反向（from image to text）的本地匹配方法，以提高TBPReID的性能。results: 根据实验结果，这个方法在三个测试 benchmark 上达到了当今最佳的 Rank@1 和 mAP 分数。

Abstract
Text-based person re-identification (TBPReID) aims to retrieve person images represented by a given textual query. In this task, how to effectively align images and texts globally and locally is a crucial challenge. Recent works have obtained high performances by solving Masked Language Modeling (MLM) to align image/text parts. However, they only performed uni-directional (i.e., from image to text) local-matching, leaving room for improvement by introducing opposite-directional (i.e., from text to image) local-matching. In this work, we introduce Bidirectional Local-Matching (BiLMa) framework that jointly optimize MLM and Masked Image Modeling (MIM) in TBPReID model training. With this framework, our model is trained so as the labels of randomly masked both image and text tokens are predicted by unmasked tokens. In addition, to narrow the semantic gap between image and text in MIM, we propose Semantic MIM (SemMIM), in which the labels of masked image tokens are automatically given by a state-of-the-art human parser. Experimental results demonstrate that our BiLMa framework with SemMIM achieves state-of-the-art Rank@1 and mAP scores on three benchmarks.

摘要

SSHNN: Semi-Supervised Hybrid NAS Network for Echocardiographic Image Segmentation

paper_url: http://arxiv.org/abs/2309.04672
repo_url: None
paper_authors: Renqi Chen, Jingjing Luo, Fan Nian, Yuhui Cen, Yiheng Peng, Zekuan Yu
for: 准确的医疗影像分割，特别是echocardiographic图像处理中的噪声难以忽略，需要 elaboration 的网络设计。
methods: 我们提出了一种新的半supervised Hybrid NAS网络（SSHNN），利用卷积操作来实现层次特征融合，并通过引入Transformers来补做全局上下文，以及U-shaped解码器来有效地连接全局上下文和本地特征。
results: 我们在CAMUS医学电子心肺图像集上进行了广泛的实验，发现SSHNN比 estado-of-the-art方法更高效，实现了高精度的分割。

Abstract
Accurate medical image segmentation especially for echocardiographic images with unmissable noise requires elaborate network design. Compared with manual design, Neural Architecture Search (NAS) realizes better segmentation results due to larger search space and automatic optimization, but most of the existing methods are weak in layer-wise feature aggregation and adopt a ``strong encoder, weak decoder" structure, insufficient to handle global relationships and local details. To resolve these issues, we propose a novel semi-supervised hybrid NAS network for accurate medical image segmentation termed SSHNN. In SSHNN, we creatively use convolution operation in layer-wise feature fusion instead of normalized scalars to avoid losing details, making NAS a stronger encoder. Moreover, Transformers are introduced for the compensation of global context and U-shaped decoder is designed to efficiently connect global context with local features. Specifically, we implement a semi-supervised algorithm Mean-Teacher to overcome the limited volume problem of labeled medical image dataset. Extensive experiments on CAMUS echocardiography dataset demonstrate that SSHNN outperforms state-of-the-art approaches and realizes accurate segmentation. Code will be made publicly available.

摘要
准确的医疗图像分割，特别是用于echocardiographic图像，需要考虑到干扰的存在。传统的手动设计方法在层次特征聚合方面有限，而Neural Architecture Search（NAS）可以通过更大的搜索空间和自动优化来实现更好的分割结果。然而，现有的方法通常具有“强Encoder,弱Decoder”结构，无法处理全局关系和地方细节。为解决这些问题，我们提出了一种新的半supervised Hybrid NAS网络，称为 SSHNN。在 SSHNN 中，我们创新地使用卷积操作来实现层次特征融合，而不是使用normalized scalars，以避免丢失细节。此外，我们还引入了Transformers来补做全局上下文，并设计了U型决策器来有效地连接全局上下文和地方特征。具体来说，我们实现了一种半supervised算法Mean-Teacher来超越医疗图像数据集的限制。我们进行了广泛的实验，并证明了 SSHNN 可以超过现有的方法，并实现准确的分割。代码将公开发布。

Unified Language-Vision Pretraining with Dynamic Discrete Visual Tokenization

paper_url: http://arxiv.org/abs/2309.04669
repo_url: https://github.com/jy0205/LaVIT
paper_authors: Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Bin Chen, Chenyi Lei, An Liu, Chengru Song, Xiaoqiang Lei, Yadong Mu, Di Zhang, Wenwu Ou, Kun Gai
for: 本研究旨在突破现有语言模型只允许视觉输入的限制，将语言和视觉都 Represented为一个共同表示，以提高多模态理解能力。
methods: 作者提出了一种名为LaVIT（语言-视觉 transformer）的基础模型，该模型通过一种图像tokenizer将非语言图像转化为一个序列化的语言形式，从而使得模型可以同时处理图像和文本。
results: 对于下游任务，LaVIT比现有模型提高了大幅度的性能，并且在多模态理解任务中表现出色。

Abstract
Recently, the remarkable advance of the Large Language Model (LLM) has inspired researchers to transfer its extraordinary reasoning capability to data across several modalities. The prevailing approaches primarily regard visual input as the prompt and focus exclusively on optimizing the text generation process conditioned upon vision content by a frozen LLM. Such an inequitable treatment of vision and language heavily constrains the model's potential. In this paper, we break through this limitation by representing both vision and language in a unified representation. To this end, we craft a visual tokenizer that translates the non-linguistic image into a sequence of discrete tokens like a foreign language that LLM can read. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image content. Coped with this visual tokenizer, the presented foundation model called LaVIT (Language-VIsion Transformer) can handle both image and text indiscriminately under a unified generative learning paradigm. Pre-trained on the web-scale image-text corpus, LaVIT is empowered with impressive multi-modal comprehension capability. The extensive experiments showcase that it outperforms existing models by a large margin on downstream tasks. Our code and models will be available at https://github.com/jy0205/LaVIT.

摘要
近期，大型语言模型（LLM）的出色发展已经激发了研究人员将其杰出的思维能力应用到多 modalities 的数据上。现有的方法主要将视觉输入视为提示，归类专门为conditioned upon vision content by a frozen LLM。这种对视觉和语言的不公平待遇，具有严重限制模型的潜力。在这篇论文中，我们突破这一限制，将视觉和语言都 Represented 为共同表示。为此，我们设计了一种视觉化 токен化器，将非语言的图像转化为一系列精确的 discrete tokens，这些 tokens 类似于外语，LLM 可以读取。得到的视觉 tokens 包含高级别 semantics 和支持动态序列长度，从图像内容而来。与这种视觉化 токен化器相配合，我们提出的基础模型 called LaVIT (Language-VIsion Transformer) 可以平等地处理图像和文本，并在一个共同生成学习 paradigm 下进行学习。预训练在网络规模的图像-文本 Corporpus 上，LaVIT 具有卓越的多模态理解能力。广泛的实验表明，它在下游任务上高度超越现有模型。我们的代码和模型将在 GitHub 上提供，请参考 https://github.com/jy0205/LaVIT。

ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation

paper_url: http://arxiv.org/abs/2309.05674
repo_url: https://github.com/xianlin7/convformer
paper_authors: Xian Lin, Zengqiang Yan, Xianbo Deng, Chuansheng Zheng, Li Yu
for: 提高transformer-based框架中的 segmentation性能，增强对医疗图像的分类能力。
methods: 提出CNN-style Transformers（ConvFormer），通过增强注意力归一化和特征提取来提高分类性能。ConvFormer包括pooling、CNN-style自注意（CSA）和卷积FeedForward Network（CFFN），可以作为vanilla Vision Transformers中的tokenization、self-attention和FeedForward Network。
results: 在多个 dataset上展示了ConvFormer作为plug-and-play模块，可以遥增transformer-based框架中的segmentation性能。

Abstract
Transformers have been extensively studied in medical image segmentation to build pairwise long-range dependence. Yet, relatively limited well-annotated medical image data makes transformers struggle to extract diverse global features, resulting in attention collapse where attention maps become similar or even identical. Comparatively, convolutional neural networks (CNNs) have better convergence properties on small-scale training data but suffer from limited receptive fields. Existing works are dedicated to exploring the combinations of CNN and transformers while ignoring attention collapse, leaving the potential of transformers under-explored. In this paper, we propose to build CNN-style Transformers (ConvFormer) to promote better attention convergence and thus better segmentation performance. Specifically, ConvFormer consists of pooling, CNN-style self-attention (CSA), and convolutional feed-forward network (CFFN) corresponding to tokenization, self-attention, and feed-forward network in vanilla vision transformers. In contrast to positional embedding and tokenization, ConvFormer adopts 2D convolution and max-pooling for both position information preservation and feature size reduction. In this way, CSA takes 2D feature maps as inputs and establishes long-range dependency by constructing self-attention matrices as convolution kernels with adaptive sizes. Following CSA, 2D convolution is utilized for feature refinement through CFFN. Experimental results on multiple datasets demonstrate the effectiveness of ConvFormer working as a plug-and-play module for consistent performance improvement of transformer-based frameworks. Code is available at https://github.com/xianlin7/ConvFormer.

摘要
transformers 已经广泛研究在医学影像分割中建立对比较远范围的长距离相依性。然而，有限的高质量医学影像数据使 transformers 具有医学影像分割中的注意力塌缩现象，其中注意力映射变得相似或 même identical。相比之下，卷积神经网络（CNN）在小规模训练数据上具有更好的收敛性能，但它们受限于有限的接收场。现有的工作主要关注于将 CNN 和 transformers 结合使用，而忽略了注意力塌缩现象，这使得 transformers 的潜在能力尚未得到充分探索。在这篇论文中，我们提出了一种基于 CNN 的 transformers（ConvFormer），以便提高注意力的叠合和医学影像分割性能。具体来说，ConvFormer 包括池化、CNN 样式自注意（CSA）和卷积神经网络（CFFN），与标准视觉 transformers 中的征文化、自注意和Feed Forward 网络相对应。与position embedding和分割不同，ConvFormer 采用了2D卷积和最大池化来保持位坐标信息和特征大小减少。这样，CSA 可以将 2D 特征图作为输入，建立长距离相依性 by 构建自注意矩阵作为卷积核函数的 adaptive 大小。接着，2D 卷积被用于特征细化通过 CFFN。实验结果表明，ConvFormer 作为 transformer 基础架构中的插件模块，可以提高 transformer 基础架构的一致性和医学影像分割性能。代码可以在 https://github.com/xianlin7/ConvFormer 找到。

Progressive Feature Adjustment for Semi-supervised Learning from Pretrained Models

paper_url: http://arxiv.org/abs/2309.04659
repo_url: None
paper_authors: Hai-Ming Xu, Lingqiao Liu, Hao Chen, Ehsan Abbasnejad, Rafael Felix
for: 提高 semi-supervised learning 的性能，解决数据标注束缚问题
methods: 使用 pseudo-labels 更新 feature extractor，保证 feature distribution 维护良好的类别分离性，并且只允许类ifier 通过 labels 进行训练
results: 对比现有解决方案，提出的方法实现更高的性能

Abstract
As an effective way to alleviate the burden of data annotation, semi-supervised learning (SSL) provides an attractive solution due to its ability to leverage both labeled and unlabeled data to build a predictive model. While significant progress has been made recently, SSL algorithms are often evaluated and developed under the assumption that the network is randomly initialized. This is in sharp contrast to most vision recognition systems that are built from fine-tuning a pretrained network for better performance. While the marriage of SSL and a pretrained model seems to be straightforward, recent literature suggests that naively applying state-of-the-art SSL with a pretrained model fails to unleash the full potential of training data. In this paper, we postulate the underlying reason is that the pretrained feature representation could bring a bias inherited from the source data, and the bias tends to be magnified through the self-training process in a typical SSL algorithm. To overcome this issue, we propose to use pseudo-labels from the unlabelled data to update the feature extractor that is less sensitive to incorrect labels and only allow the classifier to be trained from the labeled data. More specifically, we progressively adjust the feature extractor to ensure its induced feature distribution maintains a good class separability even under strong input perturbation. Through extensive experimental studies, we show that the proposed approach achieves superior performance over existing solutions.

摘要
为了减轻数据标注的负担，半upervised learning（SSL）提供了一个有力的解决方案，因为它可以利用标注和无标注数据建立预测模型。 although significant progress has been made recently, SSL algorithms are often evaluated and developed under the assumption that the network is randomly initialized. This is in sharp contrast to most vision recognition systems that are built from fine-tuning a pretrained network for better performance. While the marriage of SSL and a pretrained model seems to be straightforward, recent literature suggests that naively applying state-of-the-art SSL with a pretrained model fails to unleash the full potential of training data. In this paper, we postulate that the underlying reason is that the pretrained feature representation could bring a bias inherited from the source data, and the bias tends to be magnified through the self-training process in a typical SSL algorithm. To overcome this issue, we propose to use pseudo-labels from the unlabelled data to update the feature extractor that is less sensitive to incorrect labels and only allow the classifier to be trained from the labeled data. More specifically, we progressively adjust the feature extractor to ensure its induced feature distribution maintains a good class separability even under strong input perturbation. Through extensive experimental studies, we show that the proposed approach achieves superior performance over existing solutions.Note: The translation is in Simplified Chinese, which is one of the two standard versions of Chinese used in mainland China and Singapore.

Generation and Recombination for Multifocus Image Fusion with Free Number of Inputs

paper_url: http://arxiv.org/abs/2309.04657
repo_url: None
paper_authors: Huafeng Li, Dan Wang, Yuxin Huang, Yafei Zhang, Zhengtao Yu
for: overcome the limitation of optical lenses and achieve simultaneous fusion of multiple images
methods: combining generation and recombination model (GRFusion), hard-pixel-guided recombination mechanism, and multi-directional gradient embedding method
results: effective and superior fusion performance, free from the number of inputs and with improved visual quality

Abstract
Multifocus image fusion is an effective way to overcome the limitation of optical lenses. Many existing methods obtain fused results by generating decision maps. However, such methods often assume that the focused areas of the two source images are complementary, making it impossible to achieve simultaneous fusion of multiple images. Additionally, the existing methods ignore the impact of hard pixels on fusion performance, limiting the visual quality improvement of fusion image. To address these issues, a combining generation and recombination model, termed as GRFusion, is proposed. In GRFusion, focus property detection of each source image can be implemented independently, enabling simultaneous fusion of multiple source images and avoiding information loss caused by alternating fusion. This makes GRFusion free from the number of inputs. To distinguish the hard pixels from the source images, we achieve the determination of hard pixels by considering the inconsistency among the detection results of focus areas in source images. Furthermore, a multi-directional gradient embedding method for generating full focus images is proposed. Subsequently, a hard-pixel-guided recombination mechanism for constructing fused result is devised, effectively integrating the complementary advantages of feature reconstruction-based method and focused pixel recombination-based method. Extensive experimental results demonstrate the effectiveness and the superiority of the proposed method.The source code will be released on https://github.com/xxx/xxx.

摘要
多聚焦图像融合是一种有效的方法来超越光学镜头的限制。许多现有方法通过生成决策地图来获得融合结果，但这些方法经常假设源图像的焦点区域是补偿的，这使得同时融合多个图像变得不可能。此外，现有方法忽略了融合过程中硬Pixel的影响，从而限制融合图像的视觉质量改善。为解决这些问题，我们提出了GRFusion模型。GRFusion模型中可以独立实现每个源图像的焦点属性检测，因此可以同时融合多个源图像，避免因为交替融合而产生的信息损失。这使得GRFusion模型不受输入数量的限制。为了分辨硬Pixel与源图像之间的差异，我们提出了基于focus区域的决策结果的不一致来确定硬Pixel。此外，我们还提出了一种多向导向量嵌入方法来生成全焦图像。然后，我们设计了一种基于硬Pixel指导的融合机制，以有效地结合了特征重建方法和焦点像素重建方法的优点。经验证明了我们的方法的有效性和优越性。源代码将在GitHub上发布。

Exploring Robust Features for Improving Adversarial Robustness

paper_url: http://arxiv.org/abs/2309.04650
repo_url: None
paper_authors: Hong Wang, Yuefan Deng, Shinjae Yoo, Yuewei Lin
for: 提高深度神经网络（DNNs）在安全敏感应用中的使用，因为它们容易受到特制攻击。
methods: 提出了一种特征分离模型，用于分离Robust特征和非Robust特征以及域специфи的特征。
results: 对四种广泛使用的数据集进行了extensive实验，并证明了我们的模型可以提高对特制攻击的抵抗力，并且可以准确地识别域特定的特征。

Abstract
While deep neural networks (DNNs) have revolutionized many fields, their fragility to carefully designed adversarial attacks impedes the usage of DNNs in safety-critical applications. In this paper, we strive to explore the robust features which are not affected by the adversarial perturbations, i.e., invariant to the clean image and its adversarial examples, to improve the model's adversarial robustness. Specifically, we propose a feature disentanglement model to segregate the robust features from non-robust features and domain specific features. The extensive experiments on four widely used datasets with different attacks demonstrate that robust features obtained from our model improve the model's adversarial robustness compared to the state-of-the-art approaches. Moreover, the trained domain discriminator is able to identify the domain specific features from the clean images and adversarial examples almost perfectly. This enables adversarial example detection without incurring additional computational costs. With that, we can also specify different classifiers for clean images and adversarial examples, thereby avoiding any drop in clean image accuracy.

摘要
深度神经网络（DNN）在许多领域中已经引领了革命，但它们对特制的敌意攻击却有很大的敏感性，这限制了DNN在安全关键应用程序中的使用。在这篇论文中，我们尝试探索抗敌攻击的可靠特征，即对于清洁图像和敌意攻击的稳定特征，以提高模型的抗敌能力。具体来说，我们提出了特征分离模型，以分离可靠特征和非可靠特征、域特定特征。我们在四种广泛使用的数据集上进行了大量的实验，并证明了我们的模型可以在不同的攻击下提高抗敌能力，并且域特定特征分离器可以准确地从清洁图像和敌意攻击中分离域特定特征。这使得我们可以采取不同的分类器来处理清洁图像和敌意攻击，从而避免清洁图像精度下降。

Fun Paper

2023-09-09

cs.CV - 2023-09-09

Semi-supervised Instance Segmentation with a Learned Shape Prior

SortedAP: Rethinking evaluation metrics for instance segmentation

AnyPose: Anytime 3D Human Pose Forecasting via Neural Ordinary Differential Equations

Neural Semantic Surface Maps

Few-Shot Medical Image Segmentation via a Region-enhanced Prototypical Transformer

ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-agnostic Counting

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

VeRi3D: Generative Vertex-based Radiance Fields for 3D Controllable Human Image Synthesis

Self-Supervised Transformer with Domain Adaptive Reconstruction for General Face Forgery Video Detection

Latent Degradation Representation Constraint for Single Image Deraining

Visual Material Characteristics Learning for Circular Healthcare

Probabilistic Triangulation for Uncalibrated Multi-View 3D Human Pose Estimation

Deep Video Restoration for Under-Display Camera

Mirror-Aware Neural Humans

When to Learn What: Model-Adaptive Data Augmentation Curriculum

Frequency-Aware Self-Supervised Long-Tailed Learning

UnitModule: A Lightweight Joint Image Enhancement Module for Underwater Object Detection

A Spatial-Temporal Deformable Attention based Framework for Breast Lesion Detection in Videos

DeNoising-MOT: Towards Multiple Object Tracking with Severe Occlusions

BiLMa: Bidirectional Local-Matching for Text-based Person Re-identification

SSHNN: Semi-Supervised Hybrid NAS Network for Echocardiographic Image Segmentation

Unified Language-Vision Pretraining with Dynamic Discrete Visual Tokenization

ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation

Progressive Feature Adjustment for Semi-supervised Learning from Pretrained Models

Generation and Recombination for Multifocus Image Fusion with Free Number of Inputs

Exploring Robust Features for Improving Adversarial Robustness