2023-10-24

cs.CV

cs.CV - 2023-10-24

On the Foundations of Shortcut Learning

paper_url: http://arxiv.org/abs/2310.16228
repo_url: None
paper_authors: Katherine L. Hermann, Hossein Mobahi, Thomas Fel, Michael C. Mozer
for: 这个论文的目的是研究深度学习模型如何选择特征，以及这些特征是如何影响模型的预测性和可用性。
methods: 这个论文使用了一种微型的生成框架，用于 sintesizing 分类数据集，并测试了不同输入属性的可用性和预测性如何影响模型的特征选择。
results: 研究发现，当 introducing a single hidden layer with ReLU or Tanh units 时，模型会带有一定的短cut bias，而 linear models 则相对较不受这种偏见。此外，研究还发现在实际应用中，模型在不同数据集中的可用性 manipulate 会增加模型的短cut bias。

Abstract
Deep-learning models can extract a rich assortment of features from data. Which features a model uses depends not only on predictivity-how reliably a feature indicates train-set labels-but also on availability-how easily the feature can be extracted, or leveraged, from inputs. The literature on shortcut learning has noted examples in which models privilege one feature over another, for example texture over shape and image backgrounds over foreground objects. Here, we test hypotheses about which input properties are more available to a model, and systematically study how predictivity and availability interact to shape models' feature use. We construct a minimal, explicit generative framework for synthesizing classification datasets with two latent features that vary in predictivity and in factors we hypothesize to relate to availability, and quantify a model's shortcut bias-its over-reliance on the shortcut (more available, less predictive) feature at the expense of the core (less available, more predictive) feature. We find that linear models are relatively unbiased, but introducing a single hidden layer with ReLU or Tanh units yields a bias. Our empirical findings are consistent with a theoretical account based on Neural Tangent Kernels. Finally, we study how models used in practice trade off predictivity and availability in naturalistic datasets, discovering availability manipulations which increase models' degree of shortcut bias. Taken together, these findings suggest that the propensity to learn shortcut features is a fundamental characteristic of deep nonlinear architectures warranting systematic study given its role in shaping how models solve tasks.

摘要
深度学习模型可以从数据中提取丰富的特征。这些特征的选择不仅取决于预测性能——数据集标签的可预测性——还取决于可用性——从输入中提取特征的可能性。文献中的快捷学习例子表明模型会偏爱某些特征，例如文本而不是形状，图像背景而不是前景对象。在这里，我们测试假设关于输入属性的可用性，并系统地研究预测性和可用性之间如何相互影响模型的特征使用。我们构建了一个最小、显式生成框架，用于生成分类数据集，其中两个隐藏特征变化于预测性和我们假设与可用性相关的因素。我们量化模型的快捷偏好——它过度依赖于快捷特征，而忽略核心特征。我们发现线性模型相对偏倚，但是在添加一个单一的隐藏层并使用ReLU或Tanh单元时，模型就会偏爱快捷特征。我们的实验结果与基于Neural Tangent Kernels的理论质量相符。最后，我们研究了实际使用的模型在自然化数据集中是如何交易预测性和可用性的。我们发现可以通过可用性操作来增加模型的快捷偏好。总之，这些发现表明深度非线性架构中的快捷特征学习是一种基本特征，需要系统地研究，因为它会影响模型如何解决任务。

ShadowSense: Unsupervised Domain Adaptation and Feature Fusion for Shadow-Agnostic Tree Crown Detection from RGB-Thermal Drone Imagery

paper_url: http://arxiv.org/abs/2310.16212
repo_url: https://github.com/rudrakshkapil/shadowsense
paper_authors: Rudraksh Kapil, Seyed Mojtaba Marvasti-Zadeh, Nadir Erbilgin, Nilanjan Ray
for: This paper is written for detecting individual tree crowns from remote sensing data, specifically in the context of dense forests with diverse environmental variations.
methods: The proposed method, called ShadowSense, is entirely self-supervised and leverages domain adversarial training and feature pyramid networks to adapt domain-invariant representations and improve the accuracy of tree crown detection.
results: The proposed method outperforms both the baseline RGB-trained detector and state-of-the-art techniques that rely on unsupervised domain adaptation or early image fusion, as demonstrated through extensive experiments.Here is the text in Simplified Chinese:
for: 本研究旨在通过远程感知数据检测个体树冠，特别是在杂化的森林环境下。
methods: 提议的方法是自适应域难变换和特征PYRAMID网络，以适应域不同的表示，并提高树冠检测的准确性。
results: 提议的方法比基础RGB训练的检测器和使用无监督领域适应或早期图像融合的状态前方法更高效，经过广泛的实验证明。

Abstract
Accurate detection of individual tree crowns from remote sensing data poses a significant challenge due to the dense nature of forest canopy and the presence of diverse environmental variations, e.g., overlapping canopies, occlusions, and varying lighting conditions. Additionally, the lack of data for training robust models adds another limitation in effectively studying complex forest conditions. This paper presents a novel method for detecting shadowed tree crowns and provides a challenging dataset comprising roughly 50k paired RGB-thermal images to facilitate future research for illumination-invariant detection. The proposed method (ShadowSense) is entirely self-supervised, leveraging domain adversarial training without source domain annotations for feature extraction and foreground feature alignment for feature pyramid networks to adapt domain-invariant representations by focusing on visible foreground regions, respectively. It then fuses complementary information of both modalities to effectively improve upon the predictions of an RGB-trained detector and boost the overall accuracy. Extensive experiments demonstrate the superiority of the proposed method over both the baseline RGB-trained detector and state-of-the-art techniques that rely on unsupervised domain adaptation or early image fusion. Our code and data are available: https://github.com/rudrakshkapil/ShadowSense

摘要
原文：检测个体树冠FROM remote sensing数据中具有挑战性，因为森林穹顶叶物 dense 和多样化的环境变化（如重叠叶物、遮挡和不同的照明条件），而且缺乏数据 для训练可靠的模型，增加了研究复杂的森林条件的限制。本文提出了一种新的方法（ShadowSense），可以检测遮挡的树冠，并提供了一个包含约50k个RGB-热图像的挑战性数据集，以便未来的研究人员可以利用这些数据进行光照不敏感的检测。本方法是完全自动化的，不需要源频道注释，通过领域对抗训练来提取特征和对前景特征进行对齐，以适应频道不敏感的表示。然后，它将RGB和热图像的补充信息 fusion，以提高RGB检测器的预测结果。广泛的实验表明，提议的方法在RGB检测器和状态足的技术上都具有显著的优势。我们的代码和数据可以在 GitHub 上获得：https://github.com/rudrakshkapil/ShadowSense。Translation:原文：检测个体树冠FROM remote sensing数据中具有挑战性，因为森林穹顶叶物 dense 和多样化的环境变化（如重叠叶物、遮挡和不同的照明条件），而且缺乏数据 для训练可靠的模型，增加了研究复杂的森林条件的限制。本文提出了一种新的方法（ShadowSense），可以检测遮挡的树冠，并提供了一个包含约50k个RGB-热图像的挑战性数据集，以便未来的研究人员可以利用这些数据进行光照不敏感的检测。本方法是完全自动化的，不需要源频道注释，通过领域对抗训练来提取特征和对前景特征进行对齐，以适应频道不敏感的表示。然后，它将RGB和热图像的补充信息 fusion，以提高RGB检测器的预测结果。广泛的实验表明，提议的方法在RGB检测器和状态足的技术上都具有显著的优势。我们的代码和数据可以在 GitHub 上获得：https://github.com/rudrakshkapil/ShadowSense。

Sea-Land-Cloud Segmentation in Satellite Hyperspectral Imagery by Deep Learning

paper_url: http://arxiv.org/abs/2310.16210
repo_url: https://github.com/jonalvjusto/s_l_c_segm_hyp_img
paper_authors: Jon Alvarez Justo, Joseph Landon Garrett, Mariana-Iuliana Georgescu, Jesus Gonzalez-Llorente, Radu Tudor Ionescu, Tor Arne Johansen
For: The paper is written for on-board Artificial Intelligence (AI) techniques for enhancing the autonomy of satellite platforms through edge inference, specifically focusing on multi-class segmentation of High-Resolution Satellite (HS) imagery for sea, land, and cloud formations.* Methods: The paper employs 16 different deep learning (DL) models for segmenting HS imagery, including both shallow and deep models, and proposes four new DL models. The models are trained and evaluated for in-orbit deployment, considering performance, parameter count, and inference time.* Results: The paper shows that the proposed 1D-Justo-LiuNet model consistently outperforms state-of-the-art models for sea-land-cloud segmentation in terms of performance (0.93 accuracy) and parameter count (4,563), but presents longer inference time (15s) in the tested processing architecture. Additionally, the paper demonstrates that reducing spectral channels down to 3 can lower models’ parameters and inference time, but at the cost of weaker segmentation performance.

Abstract
Satellites are increasingly adopting on-board Artificial Intelligence (AI) techniques to enhance platforms' autonomy through edge inference. In this context, the utilization of deep learning (DL) techniques for segmentation in HS satellite imagery offers advantages for remote sensing applications, and therefore, we train 16 different models, whose codes are made available through our study, which we consider to be relevant for on-board multi-class segmentation of HS imagery, focusing on classifying oceanic (sea), terrestrial (land), and cloud formations. We employ the HYPSO-1 mission as an illustrative case for sea-land-cloud segmentation, and to demonstrate the utility of the segments, we introduce a novel sea-land-cloud ranking application scenario. Our system prioritizes HS image downlink based on sea, land, and cloud coverage levels from the segmented images. We comparatively evaluate the models for in-orbit deployment, considering performance, parameter count, and inference time. The models include both shallow and deep models, and after we propose four new DL models, we demonstrate that segmenting single spectral signatures (1D) outperforms 3D data processing comprising both spectral (1D) and spatial (2D) contexts. We conclude that our lightweight DL model, called 1D-Justo-LiuNet, consistently surpasses state-of-the-art models for sea-land-cloud segmentation, such as U-Net and its variations, in terms of performance (0.93 accuracy) and parameter count (4,563). However, the 1D models present longer inference time (15s) in the tested processing architecture, which is clearly suboptimal. Finally, after demonstrating that in-orbit image segmentation should occur post L1b radiance calibration rather than on raw data, we additionally show that reducing spectral channels down to 3 lowers models' parameters and inference time, at the cost of weaker segmentation performance.

摘要
卫星在board上采用人工智能(AI)技术以提高平台的自主性，在这种情况下，使用深度学习(DL)技术进行高分辨率卫星影像（HS）中的分割具有优势，因此我们在本研究中训练了16个模型，代码在我们的研究中公布。我们认为这些模型适用于HS影像的board上多类分割，主要是为了将海洋（海）、陆地（地）和云形态分别分类。我们使用HYPSO-1任务作为示例，以示 segmentation 的实用性，并在HS影像中提供了一个新的海地陆云排名应用场景。我们的系统根据HS影像中的海、地和云覆盖水平来决定下载链接。我们对在空间中部署的模型进行比较评估，考虑性能、参数计数和计算时间。模型包括浅层和深度模型，而我们还提出了四种新的深度学习模型。我们发现，通过单 spectral signature（1D）进行分割，可以超过三个维度（2D）的数据处理。我们认为我们的轻量级深度学习模型，称为1D-Justo-LiuNet，在海地陆云分割方面 consistently exceeds state-of-the-art 模型（如UNet和其变种），以性能（0.93准确率）和参数计数（4563）为标准。然而，1D模型在测试的处理架构中具有较长的计算时间（15s），这显然不是最佳。最后，我们还证明了在空间中进行卫星影像分割应该在L1b辐射均衡化后进行，而不是在原始数据上进行。此外，我们发现，将 spectral channel 降低到3个可以降低模型的参数和计算时间，但是这将导致分割性能弱化。

Learning Low-Rank Latent Spaces with Simple Deterministic Autoencoder: Theoretical and Empirical Insights

paper_url: http://arxiv.org/abs/2310.16194
repo_url: None
paper_authors: Alokendu Mazumder, Tirthajit Baruah, Bhartendu Kumar, Rishab Sharma, Vishwajeet Pattanaik, Punit Rathore
for: 提高自动编码器的数据表示效果，使其更加紧凑地表示数据。
methods: incorporated 一个低级别正则化项，使自动编码器适应ively reconstruction低维 latent space，保持基本的目标。
results: empirical research shows that our model outperforms traditional autoencoders in various tasks such as image generation and downstream classification, and the theoretical analysis also proves the effectiveness of our model.

Abstract
The autoencoder is an unsupervised learning paradigm that aims to create a compact latent representation of data by minimizing the reconstruction loss. However, it tends to overlook the fact that most data (images) are embedded in a lower-dimensional space, which is crucial for effective data representation. To address this limitation, we propose a novel approach called Low-Rank Autoencoder (LoRAE). In LoRAE, we incorporated a low-rank regularizer to adaptively reconstruct a low-dimensional latent space while preserving the basic objective of an autoencoder. This helps embed the data in a lower-dimensional space while preserving important information. It is a simple autoencoder extension that learns low-rank latent space. Theoretically, we establish a tighter error bound for our model. Empirically, our model's superiority shines through various tasks such as image generation and downstream classification. Both theoretical and practical outcomes highlight the importance of acquiring low-dimensional embeddings.

摘要
“自动Encoder是一种无监督学习架构，旨在创建一个简单的内在表示方法，以最小化重建损失。但是，它往往忽略了许多资料（图像）是嵌入在较低维度的空间中，这是有效的资料表示的关键。为了解决这个限制，我们提出了一种新的方法 called Low-Rank Autoencoder (LoRAE)。在LoRAE中，我们添加了一个低维度正规化项，以适应地重建一个较低维度的内在空间，保留了基本的自动Encoder目标。这将资料嵌入在较低维度的空间中，保留了重要的信息。它是一个简单地将自动Encoder扩展为学习低维度内在空间的方法。理论上，我们建立了一个更紧的错误范围，实际上，我们的模型在不同的任务上（如图像生成和下游分类）表现出了优越的成果。实际和理论的结果都显示了低维度内在空间的重要性。”

G-CASCADE: Efficient Cascaded Graph Convolutional Decoding for 2D Medical Image Segmentation

paper_url: http://arxiv.org/abs/2310.16175
repo_url: https://github.com/SLDGroup/G-CASCADE
paper_authors: Md Mostafijur Rahman, Radu Marculescu
for: 这篇论文是用于医疗影像分类的重要应用，并提出了一个新的图像分类模型，即弹性图像分类实现器（G-CASCADE）。
methods: 这个模型使用了弹性图像分类实现器，并与多个层次变换器结合，以进行多阶段特征地图。
results: 试验结果显示，这个模型在五个医疗影像分类任务（包括腹部器官、心脏器官、肿瘤涂炭、皮肤涂炭和视力器官）中都超过了其他现有的State-of-the-art（SOTA）方法。此外，该模型还可以轻松地与其他层次енкоーダ结合，用于通用的 semantic和医疗影像分类任务。

Abstract
In recent years, medical image segmentation has become an important application in the field of computer-aided diagnosis. In this paper, we are the first to propose a new graph convolution-based decoder namely, Cascaded Graph Convolutional Attention Decoder (G-CASCADE), for 2D medical image segmentation. G-CASCADE progressively refines multi-stage feature maps generated by hierarchical transformer encoders with an efficient graph convolution block. The encoder utilizes the self-attention mechanism to capture long-range dependencies, while the decoder refines the feature maps preserving long-range information due to the global receptive fields of the graph convolution block. Rigorous evaluations of our decoder with multiple transformer encoders on five medical image segmentation tasks (i.e., Abdomen organs, Cardiac organs, Polyp lesions, Skin lesions, and Retinal vessels) show that our model outperforms other state-of-the-art (SOTA) methods. We also demonstrate that our decoder achieves better DICE scores than the SOTA CASCADE decoder with 80.8% fewer parameters and 82.3% fewer FLOPs. Our decoder can easily be used with other hierarchical encoders for general-purpose semantic and medical image segmentation tasks.

摘要
Recently, medical image segmentation has become an important application in the field of computer-aided diagnosis. In this paper, we propose a new graph convolution-based decoder, namely Cascaded Graph Convolutional Attention Decoder (G-CASCADE), for 2D medical image segmentation. G-CASCADE progressively refines multi-stage feature maps generated by hierarchical transformer encoders with an efficient graph convolution block. The encoder uses the self-attention mechanism to capture long-range dependencies, while the decoder refines the feature maps while preserving long-range information due to the global receptive fields of the graph convolution block. We evaluate our decoder with multiple transformer encoders on five medical image segmentation tasks (i.e., Abdomen organs, Cardiac organs, Polyp lesions, Skin lesions, and Retinal vessels) and show that our model outperforms other state-of-the-art (SOTA) methods. We also demonstrate that our decoder achieves better DICE scores than the SOTA CASCADE decoder with 80.8% fewer parameters and 82.3% fewer FLOPs. Our decoder can easily be used with other hierarchical encoders for general-purpose semantic and medical image segmentation tasks.

iNVS: Repurposing Diffusion Inpainters for Novel View Synthesis

paper_url: http://arxiv.org/abs/2310.16167
repo_url: None
paper_authors: Yash Kant, Aliaksandr Siarohin, Michael Vasilkovsky, Riza Alp Guler, Jian Ren, Sergey Tulyakov, Igor Gilitschenski
for: The paper is written for generating consistent novel views from a single source image, with a focus on maximizing the reuse of visible pixels from the source image.
methods: The paper uses a monocular depth estimator to transfer visible pixels from the source view to the target view, and trains the method on the large-scale Objaverse dataset to learn 3D object priors. The paper also introduces a novel masking mechanism based on epipolar lines to further improve the quality of the approach.
results: The paper demonstrates the zero-shot abilities of the framework on three challenging datasets: Google Scanned Objects, Ray Traced Multiview, and Common Objects in 3D, and shows that the approach can generate high-quality novel views without requiring any additional training data.Here’s the simplified Chinese text for the three key points:
for: 这篇论文是为了从单一的源图像中生成一致的新视图，并将可见像素从源视图传输到目标视图。
methods: 这篇论文使用单目深度估计器来传输可见像素从源视图到目标视图，并在Objaverse大规模数据集上训练方法来学习3D对象假设。文章还引入了一种新的蒙版机制基于轴线，以进一步提高方法的质量。
results: 这篇论文在Google扫描物体、光Trace多视图和通用物体3D等三个复杂的数据集上展示了零shot能力，并证明了方法可以生成高质量的新视图无需任何额外的训练数据。

Abstract
We present a method for generating consistent novel views from a single source image. Our approach focuses on maximizing the reuse of visible pixels from the source image. To achieve this, we use a monocular depth estimator that transfers visible pixels from the source view to the target view. Starting from a pre-trained 2D inpainting diffusion model, we train our method on the large-scale Objaverse dataset to learn 3D object priors. While training we use a novel masking mechanism based on epipolar lines to further improve the quality of our approach. This allows our framework to perform zero-shot novel view synthesis on a variety of objects. We evaluate the zero-shot abilities of our framework on three challenging datasets: Google Scanned Objects, Ray Traced Multiview, and Common Objects in 3D. See our webpage for more details: https://yashkant.github.io/invs/

摘要
我们提出了一种方法，可以从单个源图像生成一致的新视图。我们的方法是利用可见像素的最大重复使用来实现这一点。我们使用一个独立深度估计器，将源视图中的可见像素传递到目标视图中。我们从一个预训练的2D填充扩散模型开始，然后在宽泛的Objaverse数据集上训练我们的方法，以学习3D物体假设。在训练过程中，我们使用一种新的masking机制，基于epipolar线，以进一步提高我们的方法的质量。这 позволяет我们的框架在多种物体上进行零aser的新视图合成。我们在Google Scanned Objects、Ray Traced Multiview和Common Objects in 3D等三个挑战性 dataset上评估了我们的框架的零aser能力。更多细节请参考我们的网站：https://yashkant.github.io/invs/

MyriadAL: Active Few Shot Learning for Histopathology

paper_url: http://arxiv.org/abs/2310.16161
repo_url: None
paper_authors: Nico Schiavone, Jingyi Wang, Shuangzhi Li, Roger Zemp, Xingyu Li
for:This paper addresses the issue of limited annotation budget in active learning (AL) and few-shot learning (FSL) scenarios, particularly in the context of histopathology where labelling is expensive.methods:The proposed Myriad Active Learning (MAL) framework includes a contrastive-learning encoder, pseudo-label generation, and novel query sample selection in the loop. Unlabelled data is massaged in a self-supervised manner to obtain data representations and clustering knowledge, which are used to activate the AL loop. The pseudo-labels of unlabelled data are refined with feedback from an oracle in each AL cycle, and the updated pseudo-labels are used to improve active learning query selection.results:Extensive experiments on two public histopathology datasets show that MAL has superior test accuracy, macro F1-score, and label efficiency compared to prior works, and can achieve a comparable test accuracy to a fully supervised algorithm while labelling only 5% of the dataset.

Abstract
Active Learning (AL) and Few Shot Learning (FSL) are two label-efficient methods which have achieved excellent results recently. However, most prior arts in both learning paradigms fail to explore the wealth of the vast unlabelled data. In this study, we address this issue in the scenario where the annotation budget is very limited, yet a large amount of unlabelled data for the target task is available. We frame this work in the context of histopathology where labelling is prohibitively expensive. To this end, we introduce an active few shot learning framework, Myriad Active Learning (MAL), including a contrastive-learning encoder, pseudo-label generation, and novel query sample selection in the loop. Specifically, we propose to massage unlabelled data in a self-supervised manner, where the obtained data representations and clustering knowledge form the basis to activate the AL loop. With feedback from the oracle in each AL cycle, the pseudo-labels of the unlabelled data are refined by optimizing a shallow task-specific net on top of the encoder. These updated pseudo-labels serve to inform and improve the active learning query selection process. Furthermore, we introduce a novel recipe to combine existing uncertainty measures and utilize the entire uncertainty list to reduce sample redundancy in AL. Extensive experiments on two public histopathology datasets show that MAL has superior test accuracy, macro F1-score, and label efficiency compared to prior works, and can achieve a comparable test accuracy to a fully supervised algorithm while labelling only 5% of the dataset.

摘要
aktive læring (AL) og few shot læring (FSL) er to metoder, der har opnået excellente resultater på seneste tid. Men de fleste forrige værker i begge læring paradigmer har ignoreret de rigdomme af utydelige data. I denne studie adresserer vi dette problem i scenarioen, hvor annotation-budgetten er meget begrænset, men der er en stor mængde utydelige data for måletasken. Vi placerer dette arbejde i konteksten af histopatologi, hvor etikettering er forbudtivis. For at gøre dette, introducerer vi et aktivt few shot læring-ramme, kaldet Myriad Active Learning (MAL), der inkluderer en kontrastiv-læring encoder, pseudo-etiketteringsgenerering og en ny query-sample-udvælgelse i loop. Specielt proposerer vi at masseere utydelige data i en selv-superviset måde, hvor de erhvervede datarepræsentationer og clustering-viden danner grunden til at aktivere AL-loopet. Med tilbakemelding fra oraklet i hver AL-cirkel, opdaterer vi pseudo-etiketterne af utydelige data ved at optimere en skalmål-specifik net på toppen af encoderen. Disse opdaterede pseudo-etiketter tjener til at informere og forbedre den aktive læring query-udvælgelse proces. Desudtan introducerer vi en ny opskrift til at kombinere eksisterende usikkerhedsmål og bruge hele usikkerhedslisten til at reducere sample-redundansen i AL. Vores omfattende eksperimenter på to offentlige histopatologi-datasæt viser, at MAL har overlegne prøveaccuracy, makro F1-score og etiket-effektivitet i forhold til tidligere værker, og kan opnå en kompatibel prøveaccuracy til en fuldt-superviset algoritme, mens kun 5% af datasets er etiket.

Pix2HDR – A pixel-wise acquisition and deep learning-based synthesis approach for high-speed HDR videos

paper_url: http://arxiv.org/abs/2310.16139
repo_url: None
paper_authors: Caixin Wang, Jie Zhang, Matthew A. Wilson, Ralph Etienne-Cummings
for: 能够高速 capture 高动态范围视频，尤其是在低照度和暗背景下，是许多视觉应用中的关键。
methods: 我们使用像素级别可编程的图像感知器，通过在不同曝光和相位偏移中采样视频帧，同时捕捉快速运动和高动态范围。然后，我们使用深度神经网络对像素级别输出进行端到端学习，以实现高空间分辨率和减少运动模糊。
results: 我们实现了1000帧/秒的高动态范围视频捕捉，并能够减少运动模糊。我们的方法可以在各种动态条件下提高视觉系统的适应性和性能。

Abstract
Accurately capturing dynamic scenes with wide-ranging motion and light intensity is crucial for many vision applications. However, acquiring high-speed high dynamic range (HDR) video is challenging because the camera's frame rate restricts its dynamic range. Existing methods sacrifice speed to acquire multi-exposure frames. Yet, misaligned motion in these frames can still pose complications for HDR fusion algorithms, resulting in artifacts. Instead of frame-based exposures, we sample the videos using individual pixels at varying exposures and phase offsets. Implemented on a pixel-wise programmable image sensor, our sampling pattern simultaneously captures fast motion at a high dynamic range. We then transform pixel-wise outputs into an HDR video using end-to-end learned weights from deep neural networks, achieving high spatiotemporal resolution with minimized motion blurring. We demonstrate aliasing-free HDR video acquisition at 1000 FPS, resolving fast motion under low-light conditions and against bright backgrounds - both challenging conditions for conventional cameras. By combining the versatility of pixel-wise sampling patterns with the strength of deep neural networks at decoding complex scenes, our method greatly enhances the vision system's adaptability and performance in dynamic conditions.

摘要
必须精准捕捉动态场景中的广泛运动和光强变化是许多视觉应用中的关键。然而，获取高速高动态范围（HDR）视频具有摄像机框率限制动态范围。现有方法为了获得多曝光帧而 sacrifices 速度。然而，这些多曝光帧中的运动不同相对可能会对HDR融合算法产生难以控制的缺陷，从而导致 artifacts。相比 frame-based 曝光，我们使用个别像素的不同曝光和阶段偏移来采样视频。在一个像素可程序化图像感知器上实现的这种采样模式中，我们可以同时捕捉快速运动和高动态范围。然后，我们使用深度神经网络学习到的权重将像素级别输出转换为HDR视频，实现高空间时间分辨率，并最小化运动模糊。我们在1000 FPS下获得无扭曲HDR视频，解决快速运动在低光环境和灯光背景下的捕捉问题，这些问题对传统摄像机来说都是挑战。通过将像素级别采样模式与深度神经网络的解码能力结合，我们的方法可以大幅提高视觉系统的适应性和性能在动态条件下。

Subtle Signals: Video-based Detection of Infant Non-nutritive Sucking as a Neurodevelopmental Cue

paper_url: http://arxiv.org/abs/2310.16138
repo_url: https://github.com/ostadabbas/nns-detection-and-segmentation
paper_authors: Shaotong Zhu, Michael Wan, Sai Kumar Reddy Manne, Emily Zimmerman, Sarah Ostadabbas
for:This paper aims to develop a vision-based algorithm for non-contact detection of non-nutritive sucking (NNS) activity in infants using baby monitor footage.methods:The proposed algorithm utilizes optical flow and temporal convolutional networks to detect and amplify subtle infant-sucking signals from baby monitor videos.results:The authors successfully classify short video clips of uniform length into NNS and non-NNS periods, and investigate manual and learning-based techniques to piece together local classification results for segmenting longer mixed-activity videos into NNS and non-NNS segments of varying duration. Two novel datasets of annotated infant videos are introduced, including one sourced from a clinical study featuring 19 infant subjects and 183 hours of overnight baby monitor footage.

Abstract
Non-nutritive sucking (NNS), which refers to the act of sucking on a pacifier, finger, or similar object without nutrient intake, plays a crucial role in assessing healthy early development. In the case of preterm infants, NNS behavior is a key component in determining their readiness for feeding. In older infants, the characteristics of NNS behavior offer valuable insights into neural and motor development. Additionally, NNS activity has been proposed as a potential safeguard against sudden infant death syndrome (SIDS). However, the clinical application of NNS assessment is currently hindered by labor-intensive and subjective finger-in-mouth evaluations. Consequently, researchers often resort to expensive pressure transducers for objective NNS signal measurement. To enhance the accessibility and reliability of NNS signal monitoring for both clinicians and researchers, we introduce a vision-based algorithm designed for non-contact detection of NNS activity using baby monitor footage in natural settings. Our approach involves a comprehensive exploration of optical flow and temporal convolutional networks, enabling the detection and amplification of subtle infant-sucking signals. We successfully classify short video clips of uniform length into NNS and non-NNS periods. Furthermore, we investigate manual and learning-based techniques to piece together local classification results, facilitating the segmentation of longer mixed-activity videos into NNS and non-NNS segments of varying duration. Our research introduces two novel datasets of annotated infant videos, including one sourced from our clinical study featuring 19 infant subjects and 183 hours of overnight baby monitor footage.

摘要
非营养性吸引行为（NNS），即吸食瓶、手指或类似物的行为而无营养摄入，在健康早期发展中扮演着关键角色。在幼儿期，NNS行为的特点对神经和运动发展提供了价值的信息。此外，NNS活动还被提议为新生儿死亡综合征（SIDS）的防范手段。然而，NNS评估的临床应用受到劳动力 INTENSIVE 和主观的吸食嘴部评估的限制。因此，研究人员经常使用昂贵的压力传感器来实现对NNS信号的 объектив测量。为了提高NNS信号监测的可 accessibility 和可靠性，我们提出了一种视觉基于的算法，用于非接触地检测幼儿吸食活动，使用宝宝监控器的视频记录。我们的方法包括对光流和时间卷积神经网络进行全面探索，以检测和强制 infant-吸食信号。我们成功地将短视频段分类为NNS和非NNS期间。此外，我们还 investigate 手动和学习基于的技术，以便将本地分类结果组装成长期混合活动视频的NNS和非NNS分 segments。我们的研究 introduce 了一个新的幼儿视频标注 Dataset，包括来自我们临床研究的19名婴儿主体和183小时的夜间宝宝监控器视频记录。

Stereoscopic Depth Perception Through Foliage

paper_url: http://arxiv.org/abs/2310.16120
repo_url: None
paper_authors: Robert Kerschner, Rakesh John Amala Arokia Nathan, Rafal Mantiuk, Oliver Bimber
for: 这篇论文是为了探讨人工和计算机方法在推断受遮盖物体深度方面的可能性和限制。
methods: 这篇论文使用了计算机 оптиче sintetic aperture sensing技术，以及人类视觉能力的融合，以实现对受遮盖物体深度的推断。
results: 研究发现，只有当使用了计算机方法和人类视觉能力的融合时，人类可以成功推断受遮盖物体深度。

Abstract
Both humans and computational methods struggle to discriminate the depths of objects hidden beneath foliage. However, such discrimination becomes feasible when we combine computational optical synthetic aperture sensing with the human ability to fuse stereoscopic images. For object identification tasks, as required in search and rescue, wildlife observation, surveillance, and early wildfire detection, depth assists in differentiating true from false findings, such as people, animals, or vehicles vs. sun-heated patches at the ground level or in the tree crowns, or ground fires vs. tree trunks. We used video captured by a drone above dense woodland to test users' ability to discriminate depth. We found that this is impossible when viewing monoscopic video and relying on motion parallax. The same was true with stereoscopic video because of the occlusions caused by foliage. However, when synthetic aperture sensing was used to reduce occlusions and disparity-scaled stereoscopic video was presented, whereas computational (stereoscopic matching) methods were unsuccessful, human observers successfully discriminated depth. This shows the potential of systems which exploit the synergy between computational methods and human vision to perform tasks that neither can perform alone.

摘要
人类和计算方法都有困难在检测被树叶所隐藏的物体的深度。然而，当我们结合计算光学合成开口探测与人类的双目视觉融合时，这种检测变得可能。对于搜索救援、野生动物观察、监测和早期森林火灾检测等任务，深度可以帮助分辨真实的发现和假阳性发现，如人、动物或车辆与地面热腋或树叶之间的区分。我们使用了飞行器飞行在密集的森林区 capture的视频进行测试，发现在单目视频和运动相差探测时，人类无法分辨深度。同时，使用折射 Synthetic Aperture Sensing 减少遮挡和尺度缩放双目视频显示，而计算（双目视觉匹配）方法失败，人类观察员成功地分辨深度。这表明了将计算方法和人类视觉相结合的系统可以执行计算方法和人类无法执行的任务。

Wakening Past Concepts without Past Data: Class-Incremental Learning from Online Placebos

paper_url: http://arxiv.org/abs/2310.16115
repo_url: None
paper_authors: Yaoyao Liu, Yingying Li, Bernt Schiele, Qianru Sun
for: 这篇研究是针对分类增培学习（Class-Incremental Learning，CIL）中对旧类知识的保留问题进行了深入研究。
methods: 这篇研究使用了知识浓缩（Knowledge Distillation，KD）技术来解决旧类知识的保留问题。具体来说，这篇研究使用了新类数据来进行KD，并发现这会降低模型的适应度和效率。因此，这篇研究提出了使用流行的自由图像数据库，如Google Images，来选择合适的地方（placebos）进行KD。
results: 这篇研究的结果显示了以下几个点：1）使用流行的自由图像数据库选择的placebos可以实现高效的旧类知识保留，2）不需要额外的监督或记忆预算，3）在使用较低的记忆预算时，与一些顶尖的CIL方法进行比较，表现更好。

Abstract
Not forgetting old class knowledge is a key challenge for class-incremental learning (CIL) when the model continuously adapts to new classes. A common technique to address this is knowledge distillation (KD), which penalizes prediction inconsistencies between old and new models. Such prediction is made with almost new class data, as old class data is extremely scarce due to the strict memory limitation in CIL. In this paper, we take a deep dive into KD losses and find that "using new class data for KD" not only hinders the model adaption (for learning new classes) but also results in low efficiency for preserving old class knowledge. We address this by "using the placebos of old classes for KD", where the placebos are chosen from a free image stream, such as Google Images, in an automatical and economical fashion. To this end, we train an online placebo selection policy to quickly evaluate the quality of streaming images (good or bad placebos) and use only good ones for one-time feed-forward computation of KD. We formulate the policy training process as an online Markov Decision Process (MDP), and introduce an online learning algorithm to solve this MDP problem without causing much computation costs. In experiments, we show that our method 1) is surprisingly effective even when there is no class overlap between placebos and original old class data, 2) does not require any additional supervision or memory budget, and 3) significantly outperforms a number of top-performing CIL methods, in particular when using lower memory budgets for old class exemplars, e.g., five exemplars per class.

摘要
不忘旧知识是隐藏学习（CIL）中的关键挑战，当模型不断适应新类时。一种常见的解决方法是知识阶段（KD），它评估新和旧模型的预测不一致。在这篇论文中，我们深入研究KD损失，发现“使用新类数据进行KD”不仅阻碍模型适应新类，而且效率不高于保留旧类知识。我们解决这问题，通过“使用旧类的地精选择”，其中地精选择自由图像流，如Google Images，以自动化和经济的方式。为此，我们训练了在线地精选策略，以快速评估流动图像质量（好或坏地精），并只使用好的地精进行一次 feed-forward 计算。我们将策略训练过程形式化为在线Markov决策过程（MDP），并引入在线学习算法来解决这个MDP问题，不需要多少计算成本。在实验中，我们发现我们的方法具有以下优点：1）效果很好，甚至在没有类 overlap 情况下也能够达到良好的效果，2）不需要额外的监督或存储预算，3）与一些顶尖的CIL方法相比，特别是使用较低的存储预算，例如每个类五个示例。

Towards long-tailed, multi-label disease classification from chest X-ray: Overview of the CXR-LT challenge

paper_url: http://arxiv.org/abs/2310.16112
repo_url: None
paper_authors: Gregory Holste, Yiliang Zhou, Song Wang, Ajay Jaiswal, Mingquan Lin, Sherry Zhuge, Yuzhe Yang, Dongkyun Kim, Trong-Hieu Nguyen-Mau, Minh-Triet Tran, Jaehyup Jeong, Wongi Park, Jongbin Ryu, Feng Hong, Arsh Verma, Yosuke Yamagishi, Changhyun Kim, Hyeryeong Seo, Myungjoo Kang, Leo Anthony Celi, Zhiyong Lu, Ronald M. Summers, George Shih, Zhangyang Wang, Yifan Peng
for: 这个论文是为了研究长尾学习在医疗影像识别中的应用。
methods: 论文使用了多种方法，包括长尾学习、多标签分类和视觉语言基础模型。
results: 研究发现了许多高性能解决方案，并提供了实践的建议 для长尾多标签医疗影像分类。此外，研究还提出了基于视觉语言基础模型的几种新的方法。

Abstract
Many real-world image recognition problems, such as diagnostic medical imaging exams, are "long-tailed" $\unicode{x2013}$ there are a few common findings followed by many more relatively rare conditions. In chest radiography, diagnosis is both a long-tailed and multi-label problem, as patients often present with multiple findings simultaneously. While researchers have begun to study the problem of long-tailed learning in medical image recognition, few have studied the interaction of label imbalance and label co-occurrence posed by long-tailed, multi-label disease classification. To engage with the research community on this emerging topic, we conducted an open challenge, CXR-LT, on long-tailed, multi-label thorax disease classification from chest X-rays (CXRs). We publicly release a large-scale benchmark dataset of over 350,000 CXRs, each labeled with at least one of 26 clinical findings following a long-tailed distribution. We synthesize common themes of top-performing solutions, providing practical recommendations for long-tailed, multi-label medical image classification. Finally, we use these insights to propose a path forward involving vision-language foundation models for few- and zero-shot disease classification.

摘要
(Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. The traditional Chinese form is used in Hong Kong, Macau, and Taiwan.)

Complex Image Generation SwinTransformer Network for Audio Denoising

paper_url: http://arxiv.org/abs/2310.16109
repo_url: https://github.com/YoushanZhang/CoxImgSwinTransformer
paper_authors: Youshan Zhang, Jialu Li
for: 这篇论文旨在提高实际应用中的音频干扰除除颗粒性。
methods: 该论文将音频干扰除问题转化为一个图像生成任务，并使用复杂的SwinTransformer网络捕捉更多的复杂傅立叶域信息。然后，通过结构相似和细节损失函数生成高质量图像，并使用SDR损失函数最小化清洁和干扰音频之间的差异。
results: 对两个标准数据集进行了广泛的实验，结果表明我们提出的模型比现有的方法更高效。

Abstract
Achieving high-performance audio denoising is still a challenging task in real-world applications. Existing time-frequency methods often ignore the quality of generated frequency domain images. This paper converts the audio denoising problem into an image generation task. We first develop a complex image generation SwinTransformer network to capture more information from the complex Fourier domain. We then impose structure similarity and detailed loss functions to generate high-quality images and develop an SDR loss to minimize the difference between denoised and clean audios. Extensive experiments on two benchmark datasets demonstrate that our proposed model is better than state-of-the-art methods.

摘要
高性能音频减噪仍然是实际应用中的挑战。现有的时间频域方法通常忽略生成的频域图像质量。本文将音频减噪问题转换成图像生成任务。我们首先开发了复杂的图像生成SwinTransformer网络，以捕捉更多的复杂 Fourier 频域信息。然后，我们对生成的图像强制实施结构相似性和细节损失函数，以生成高质量的图像。同时，我们还开发了SDR损失函数，以最小化减噪后和清晰音频之间的差异。我们在两个标准测试集上进行了广泛的实验，结果表明，我们的提出的模型在比state-of-the-art方法更高的性能。

LaksNet: an end-to-end deep learning model for self-driving cars in Udacity simulator

paper_url: http://arxiv.org/abs/2310.16103
repo_url: None
paper_authors: Lakshmikar R. Polamreddy, Youshan Zhang
for: 降低机动车事故的风险，提高自动驾驶技术的效果
methods: 提出了一种新的快速深度学习模型，称为`LaksNet’，包括四个卷积层和两个全连接层
results: 对于UDacity simulator中的训练数据，LaksNet模型与许多现有的预训练ImageNet和NVIDIA模型相比，在车辆无法离开轨道的时间方面表现出色。

Abstract
The majority of road accidents occur because of human errors, including distraction, recklessness, and drunken driving. One of the effective ways to overcome this dangerous situation is by implementing self-driving technologies in vehicles. In this paper, we focus on building an efficient deep-learning model for self-driving cars. We propose a new and effective convolutional neural network model called `LaksNet' consisting of four convolutional layers and two fully connected layers. We conduct extensive experiments using our LaksNet model with the training data generated from the Udacity simulator. Our model outperforms many existing pre-trained ImageNet and NVIDIA models in terms of the duration of the car for which it drives without going off the track on the simulator.

摘要
大多数道路事故是因为人类错误所致，包括分心、无负责感和酒后驾车。为了解决这个危险情况，我们可以在车辆中实施自动驾驶技术。在这篇论文中，我们专注于建立高效的深度学习模型 для自动驾驶车辆。我们提出了一个新的和有效的核心神经网络模型，称为“LaksNet”，它包括四个核心神经层和两个全连接层。我们对我们的LaksNet模型进行了广泛的实验，使用来自UDacity simulator的训练数据。我们的模型在训练数据上表现出色，与许多现有的预训ImageNet和NVIDIA模型相比，在 simulator 上驾车时间更长。

Learned, Uncertainty-driven Adaptive Acquisition for Photon-Efficient Multiphoton Microscopy

paper_url: http://arxiv.org/abs/2310.16102
repo_url: None
paper_authors: Cassandra Tong Ye, Jiashu Han, Kunzan Liu, Anastasios Angelopoulos, Linda Griffith, Kristina Monakhova, Sixian You
for: 这个论文是为了提高多光子镜像技术的信息准确性和速度而写的。
methods: 该论文使用了深度学习方法来降噪多光子镜像测量结果，同时也可以提供图像 pixel 层次的不确定性预测。
results: 该论文使用实验数据证明了其方法可以保持细节特征，并且在降噪和预测不确定性方面超过其他方法。此外，该方法还可以基于图像重新扫描优化采样技术，从而实现120倍的采样时间和总光谱强度减少。

Abstract
Multiphoton microscopy (MPM) is a powerful imaging tool that has been a critical enabler for live tissue imaging. However, since most multiphoton microscopy platforms rely on point scanning, there is an inherent trade-off between acquisition time, field of view (FOV), phototoxicity, and image quality, often resulting in noisy measurements when fast, large FOV, and/or gentle imaging is needed. Deep learning could be used to denoise multiphoton microscopy measurements, but these algorithms can be prone to hallucination, which can be disastrous for medical and scientific applications. We propose a method to simultaneously denoise and predict pixel-wise uncertainty for multiphoton imaging measurements, improving algorithm trustworthiness and providing statistical guarantees for the deep learning predictions. Furthermore, we propose to leverage this learned, pixel-wise uncertainty to drive an adaptive acquisition technique that rescans only the most uncertain regions of a sample. We demonstrate our method on experimental noisy MPM measurements of human endometrium tissues, showing that we can maintain fine features and outperform other denoising methods while predicting uncertainty at each pixel. Finally, with our adaptive acquisition technique, we demonstrate a 120X reduction in acquisition time and total light dose while successfully recovering fine features in the sample. We are the first to demonstrate distribution-free uncertainty quantification for a denoising task with real experimental data and the first to propose adaptive acquisition based on reconstruction uncertainty

摘要
多 photon 微scopie (MPM) 是一种强大的成像工具，对生物组织成像具有关键作用。然而，由于大多数多 photon 微scopie 平台采用点扫描方式，因此在获取时间、预览 angle （FOV）、光学毒性和图像质量之间存在内在的交易，常导致快速、大 FOV 和/或温和成像时需要噪声损害。深度学习可以用于Multiphoton 微scopie 的噪声除除，但这些算法可能会出现幻觉，这可能是医学和科学应用中致命的问题。我们提出了一种同时减噪和预测像素级不确定性的方法，以提高深度学习预测的可信度，并为生成统计保证。此外，我们建议使用这些学习到的像素级不确定性来驱动适应式获取技术，重新扫描样本中的不确定区域。我们在实验室中使用噪声多 photon 微scopie 测量数据进行示例，并证明我们可以保持细节和其他减噪方法的表现相似，同时预测每个像素的不确定性。最后，我们使用我们的适应式获取技术，实现了120倍减少获取时间和总光子剂量，并成功地恢复样本中的细节特征。我们是首次在实验室数据中实现分布无关的不确定性量化，并首次提出适应式获取基于成像重建不确定性。

Deep Feature Registration for Unsupervised Domain Adaptation

paper_url: http://arxiv.org/abs/2310.16100
repo_url: None
paper_authors: Youshan Zhang, Brian D. Davison
for: 本研究旨在探讨如何更好地对准源频率和目标频率进行匹配，以提高预测性能。
methods: 我们提出了一种深度特征注册（DFR）模型，通过维护域不变特征并同时降低域不同特征和目标频率之间的差异，来实现域适应。此外，我们还提出了一种假标签纠正过程，包括概率软选择和中心基于硬选择，以提高目标域中假标签的质量。
results: 我们在多个UDA benchmark上进行了广泛的实验，结果表明我们的DFR模型能够提高预测性能，达到新的领先水平。

Abstract
While unsupervised domain adaptation has been explored to leverage the knowledge from a labeled source domain to an unlabeled target domain, existing methods focus on the distribution alignment between two domains. However, how to better align source and target features is not well addressed. In this paper, we propose a deep feature registration (DFR) model to generate registered features that maintain domain invariant features and simultaneously minimize the domain-dissimilarity of registered features and target features via histogram matching. We further employ a pseudo label refinement process, which considers both probabilistic soft selection and center-based hard selection to improve the quality of pseudo labels in the target domain. Extensive experiments on multiple UDA benchmarks demonstrate the effectiveness of our DFR model, resulting in new state-of-the-art performance.

摘要
而尚未得到监督的领域适应（Unsupervised Domain Adaptation，UDA）已经被探索，以利用源领域的标注数据来适应目标领域的无标注数据。然而，如何更好地对应源和目标特征进行对应还没有得到充分的解决。在这篇论文中，我们提出了深度特征注册（Deep Feature Registration，DFR）模型，以生成具有领域不变特征的注册特征，同时使得注册特征和目标特征之间的差异最小化via histogram matching。此外，我们还employs一种假标签重新定义过程，包括概率软选择和中心基于硬选择，以提高目标领域的假标签质量。我们在多个UDA benchmark上进行了广泛的实验，并demonstrate了我们DFR模型的效果，从而实现了新的领先性表现。

From Posterior Sampling to Meaningful Diversity in Image Restoration

paper_url: http://arxiv.org/abs/2310.16047
repo_url: None
paper_authors: Noa Cohen, Hila Manor, Yuval Bahat, Tomer Michaeli
for: 这个论文主要目标是解决图像恢复中的多样性问题，即每个降低图像都可以有无数多个有效的恢复方案。
methods: 作者提出了一些后处理技术，可以与多样性图像恢复方法结合使用，以生成符号意义的多样性输出。此外，作者还提出了一种实用的方法，使得Diffusion based图像恢复方法可以生成多样性输出，而且只增加了negligible的计算开销。
results: 作者通过广泛的用户研究发现，减少输出之间的相似性是与后采样的策略相比significantly有利的。codes和示例可以在https://noa-cohen.github.io/MeaningfulDiversityInIR找到。

Abstract
Image restoration problems are typically ill-posed in the sense that each degraded image can be restored in infinitely many valid ways. To accommodate this, many works generate a diverse set of outputs by attempting to randomly sample from the posterior distribution of natural images given the degraded input. Here we argue that this strategy is commonly of limited practical value because of the heavy tail of the posterior distribution. Consider for example inpainting a missing region of the sky in an image. Since there is a high probability that the missing region contains no object but clouds, any set of samples from the posterior would be entirely dominated by (practically identical) completions of sky. However, arguably, presenting users with only one clear sky completion, along with several alternative solutions such as airships, birds, and balloons, would better outline the set of possibilities. In this paper, we initiate the study of meaningfully diverse image restoration. We explore several post-processing approaches that can be combined with any diverse image restoration method to yield semantically meaningful diversity. Moreover, we propose a practical approach for allowing diffusion based image restoration methods to generate meaningfully diverse outputs, while incurring only negligent computational overhead. We conduct extensive user studies to analyze the proposed techniques, and find the strategy of reducing similarity between outputs to be significantly favorable over posterior sampling. Code and examples are available in https://noa-cohen.github.io/MeaningfulDiversityInIR

摘要
Image restoration problems typically have infinite solutions, making it difficult to choose the correct one. Many methods generate diverse outputs by sampling from the posterior distribution of natural images. However, this approach is limited because the posterior distribution has a heavy tail, and the generated outputs are often similar. For example, inpainting a missing region of the sky in an image, the posterior distribution is likely to contain only sky, with little variation. Instead of presenting users with multiple identical sky completions, it would be more useful to show alternative solutions such as airships, birds, and balloons. In this paper, we study meaningfully diverse image restoration and propose several post-processing approaches to achieve semantically meaningful diversity. We also propose a practical approach to generate diverse outputs with negligible computational overhead. We conduct user studies to evaluate the proposed techniques and find that reducing similarity between outputs is significantly favorable over posterior sampling. Code and examples are available at https://noa-cohen.github.io/MeaningfulDiversityInIR.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Stanford-ORB: A Real-World 3D Object Inverse Rendering Benchmark

paper_url: http://arxiv.org/abs/2310.16044
repo_url: None
paper_authors: Zhengfei Kuang, Yunzhi Zhang, Hong-Xing Yu, Samir Agarwala, Shangzhe Wu, Jiajun Wu
for: 这个论文旨在提供一个真实世界的3D物体反向渲染测试准则，以评估和比较不同反向渲染方法的性能。
methods: 这篇论文使用了一个新的真实世界对象库，包含了各种不同的自然场景下的物体捕捉数据，以及相应的3D扫描数据、多视图图像和环境照明数据。
results: 该论文首次在真实世界场景下提供了一个完整的对象反向渲染评估准则，并对多种现有方法进行了比较。

Abstract
We introduce Stanford-ORB, a new real-world 3D Object inverse Rendering Benchmark. Recent advances in inverse rendering have enabled a wide range of real-world applications in 3D content generation, moving rapidly from research and commercial use cases to consumer devices. While the results continue to improve, there is no real-world benchmark that can quantitatively assess and compare the performance of various inverse rendering methods. Existing real-world datasets typically only consist of the shape and multi-view images of objects, which are not sufficient for evaluating the quality of material recovery and object relighting. Methods capable of recovering material and lighting often resort to synthetic data for quantitative evaluation, which on the other hand does not guarantee generalization to complex real-world environments. We introduce a new dataset of real-world objects captured under a variety of natural scenes with ground-truth 3D scans, multi-view images, and environment lighting. Using this dataset, we establish the first comprehensive real-world evaluation benchmark for object inverse rendering tasks from in-the-wild scenes, and compare the performance of various existing methods.

摘要
我们引入 Stanford-ORB，一个新的实际世界3D物体反向渲染评估标准。 latest advances in inverse rendering 已经实现了广泛的实际应用在3D内容生成中，从研究和商业用例到consumer devices。 although the results continue to improve, there is no real-world benchmark that can quantitatively assess and compare the performance of various inverse rendering methods. existing real-world datasets typically only consist of the shape and multi-view images of objects, which are not sufficient for evaluating the quality of material recovery and object relighting. methods capable of recovering material and lighting often resort to synthetic data for quantitative evaluation, which on the other hand does not guarantee generalization to complex real-world environments. we introduce a new dataset of real-world objects captured under a variety of natural scenes with ground-truth 3D scans, multi-view images, and environment lighting. using this dataset, we establish the first comprehensive real-world evaluation benchmark for object inverse rendering tasks from in-the-wild scenes, and compare the performance of various existing methods.

ConvBKI: Real-Time Probabilistic Semantic Mapping Network with Quantifiable Uncertainty

paper_url: http://arxiv.org/abs/2310.16020
repo_url: None
paper_authors: Joey Wilson, Yuewei Fu, Joshua Friesen, Parker Ewen, Andrew Capodieci, Paramsothy Jayakumar, Kira Barton, Maani Ghaffari
for: 这个论文旨在开发一种卷积神经网络，用于实时 semantic mapping 在不确定环境中。
methods: 这个方法使用 Explicitly updating per-voxel probabilistic distributions within a neural network layer，combines the reliability of classical probabilistic algorithms with the performance and efficiency of modern neural networks。
results: 研究人员通过比较 ConvBKI 与现有深度学习方法和概率算法，发现 ConvBKI 具有更高的可靠性和性能。在实际的实验中，ConvBKI 在具有困难的感知 task 中表现出色。

Abstract
In this paper, we develop a modular neural network for real-time semantic mapping in uncertain environments, which explicitly updates per-voxel probabilistic distributions within a neural network layer. Our approach combines the reliability of classical probabilistic algorithms with the performance and efficiency of modern neural networks. Although robotic perception is often divided between modern differentiable methods and classical explicit methods, a union of both is necessary for real-time and trustworthy performance. We introduce a novel Convolutional Bayesian Kernel Inference (ConvBKI) layer which incorporates semantic segmentation predictions online into a 3D map through a depthwise convolution layer by leveraging conjugate priors. We compare ConvBKI against state-of-the-art deep learning approaches and probabilistic algorithms for mapping to evaluate reliability and performance. We also create a Robot Operating System (ROS) package of ConvBKI and test it on real-world perceptually challenging off-road driving data.

摘要
在这篇论文中，我们开发了一种模块化神经网络，用于实时semantic mapping在不确定环境中，这种神经网络层中的每个小块拥有精确的概率分布。我们的方法结合了经典的概率算法和现代神经网络的性能和效率。虽然机器人观察通常被分为现代差分方法和经典显式方法，但是两者的结合是实时和可靠性的关键。我们引入了一种新的卷积 bayesian kernel inference（ConvBKI）层，通过 conjugate priors 将semantic segmentation预测结果在3D地图中进行深度wise convolution，并在实时更新per-voxel的概率分布。我们将ConvBKI与当前的深度学习方法和概率算法进行比较，以评估可靠性和性能。我们还创建了一个 Robot Operating System（ROS）包，并在实际的困难的Off-road驾驶数据上测试了ConvBKI。

CVPR 2023 Text Guided Video Editing Competition

paper_url: http://arxiv.org/abs/2310.16003
repo_url: https://github.com/showlab/loveu-tgve-2023
paper_authors: Jay Zhangjie Wu, Xiuyu Li, Difei Gao, Zhen Dong, Jinbin Bai, Aishani Singh, Xiaoyu Xiang, Youzeng Li, Zuwei Huang, Yuanxi Sun, Rui He, Feng Hu, Junhua Hu, Hai Huang, Hanyu Zhu, Xu Cheng, Jie Tang, Mike Zheng Shou, Kurt Keutzer, Forrest Iandola
for: 这 paper 是为了提出一个新的视频编辑数据集（TGVE），并且在 CVPR 会议上进行了一场竞赛，以评估模型在这个数据集上的性能。
methods: 这 paper 使用了文本到图像模型（如 Stable Diffusion 和 Imagen）进行生成 AI，并且提出了一个新的竞赛数据集来评估这些模型。
results: 这 paper 描述了竞赛中的赢家方法，并且提供了一个 retrapective 来评估竞赛中的模型性能。

Abstract
Humans watch more than a billion hours of video per day. Most of this video was edited manually, which is a tedious process. However, AI-enabled video-generation and video-editing is on the rise. Building on text-to-image models like Stable Diffusion and Imagen, generative AI has improved dramatically on video tasks. But it's hard to evaluate progress in these video tasks because there is no standard benchmark. So, we propose a new dataset for text-guided video editing (TGVE), and we run a competition at CVPR to evaluate models on our TGVE dataset. In this paper we present a retrospective on the competition and describe the winning method. The competition dataset is available at https://sites.google.com/view/loveucvpr23/track4.

摘要
每天人们观看超过100亿小时视频，大多数这些视频都是人工编辑的，这是一项繁琐的过程。然而，基于文本到图像模型如稳定扩散和图像，生成AI在视频任务上有了很大的进步。然而，评估这些视频任务的进步很难，因为没有标准的评价标准。因此，我们提出了一个新的文本引导视频编辑（TGVE）数据集，并在CVPR上举办了一场比赛来评估模型在我们的TGVE数据集上的性能。在这篇论文中，我们提供了一个回顾这场比赛的报告，并描述了赢家的方法。TGVE数据集可以在以下链接下载：https://sites.google.com/view/loveucvpr23/track4。

Integrating View Conditions for Image Synthesis

paper_url: http://arxiv.org/abs/2310.16002
repo_url: https://github.com/viiika/viewcontrol
paper_authors: Jinbin Bai, Zhen Dong, Aosong Feng, Xiao Zhang, Tian Ye, Kaicheng Zhou, Mike Zheng Shou
for: 图像处理领域中进行细致 semantic 修改现有图像的挑战。
methods: 该 paper 提出了一种创新的框架，通过视点信息来增强图像编辑任务的控制。
results: 通过对现有方法进行评估和与当前领先方法进行比较，我们提供了吸引人的证据，证明我们的框架在多个维度上表现出色。

Abstract
In the field of image processing, applying intricate semantic modifications within existing images remains an enduring challenge. This paper introduces a pioneering framework that integrates viewpoint information to enhance the control of image editing tasks. By surveying existing object editing methodologies, we distill three essential criteria, consistency, controllability, and harmony, that should be met for an image editing method. In contrast to previous approaches, our method takes the lead in satisfying all three requirements for addressing the challenge of image synthesis. Through comprehensive experiments, encompassing both quantitative assessments and qualitative comparisons with contemporary state-of-the-art methods, we present compelling evidence of our framework's superior performance across multiple dimensions. This work establishes a promising avenue for advancing image synthesis techniques and empowering precise object modifications while preserving the visual coherence of the entire composition.

摘要
在图像处理领域，在现有图像中进行细腻 semantic 修改仍然是一项挑战。这篇论文介绍了一种先进的框架，将视点信息integrated到图像编辑任务中，以提高图像编辑的控制能力。通过对现有 объек editing 方法进行检查，我们提炼出三个 essence criterion，即一致性、可控性和和谐性，这些 criterion 应被满足以解决图像合成的挑战。与先前的方法不同，我们的方法在这些 criterion 中脱颖而出，并在多个维度上提供了吸引人的证据。通过全面的实验，包括量化评估和当今领先的方法进行比较，我们展示了我们的框架在多个维度的优秀表现。这项工作为图像合成技术的发展开辟了新的 Avenues，并为精准对象修改而保持整体图像的视觉一致性提供了新的能力。

Transitivity Recovering Decompositions: Interpretable and Robust Fine-Grained Relationships

paper_url: http://arxiv.org/abs/2310.15999
repo_url: https://github.com/abhrac/trd
paper_authors: Abhra Chaudhuri, Massimiliano Mancini, Zeynep Akata, Anjan Dutta
for: 本研究旨在批处细腻表示学习中的抽象关系学习，以实现状态环境下的最佳结果。
methods: 本研究使用的方法是基于图像视图的可读性图表示抽象关系，并通过图像视图之间的关系恢复抽象关系的含义。
results: 研究表明，使用可读性图表示抽象关系可以提高表示学习的可读性和稳定性，并且可以与现有的状态环境下的最佳结果匹配。

Abstract
Recent advances in fine-grained representation learning leverage local-to-global (emergent) relationships for achieving state-of-the-art results. The relational representations relied upon by such methods, however, are abstract. We aim to deconstruct this abstraction by expressing them as interpretable graphs over image views. We begin by theoretically showing that abstract relational representations are nothing but a way of recovering transitive relationships among local views. Based on this, we design Transitivity Recovering Decompositions (TRD), a graph-space search algorithm that identifies interpretable equivalents of abstract emergent relationships at both instance and class levels, and with no post-hoc computations. We additionally show that TRD is provably robust to noisy views, with empirical evidence also supporting this finding. The latter allows TRD to perform at par or even better than the state-of-the-art, while being fully interpretable. Implementation is available at https://github.com/abhrac/trd.

摘要
最近的细腻表示学技术发展借鉴本地-全局（emergent）关系以实现当前最佳效果。这些方法使用的关系表示是抽象的。我们想要拆分这种抽象，表示它们为可解释的图像视图。我们开始通过理论来证明，抽象的关系表示实际上是地址本地视图之间的转移关系。基于这一点，我们设计了归纳恢复分解（TRD），一种图像空间搜索算法，可以在实例和类层次上寻找可解释的相似关系，无需后续计算。我们还证明了TRD在噪音视图下的Robustness，并且有实际证明支持这一点。这意味着TRD可以与当前最佳性能相当或更高，同时具有可解释性。实现可以在https://github.com/abhrac/trd上找到。

Vision-Language Pseudo-Labels for Single-Positive Multi-Label Learning

paper_url: http://arxiv.org/abs/2310.15985
repo_url: https://github.com/mvrl/vlpl
paper_authors: Xin Xing, Zhexiao Xiong, Abby Stylianou, Srikumar Sastry, Liyu Gong, Nathan Jacobs
For: 这篇论文是为了解决单个图像可以同时属于多个类别的问题。* Methods: 该论文提出了一种新的视觉语言假标签方法（VLPL），使用视觉语言模型提供强有力的正确和负有力假标签，并与当前最佳方法相比，在 Pascal VOC、MS-COCO、NUS-WIDE 和 CUB-Birds 等四个dataset上减少了5.5%、18.4%、15.2% 和8.4%的错误率。* Results: 该论文的实验结果表明，使用VLPL方法可以在单个图像上预测多个类别时，与当前最佳方法相比，提高了5.5%、18.4%、15.2% 和8.4%的准确率。

Abstract
This paper presents a novel approach to Single-Positive Multi-label Learning. In general multi-label learning, a model learns to predict multiple labels or categories for a single input image. This is in contrast with standard multi-class image classification, where the task is predicting a single label from many possible labels for an image. Single-Positive Multi-label Learning (SPML) specifically considers learning to predict multiple labels when there is only a single annotation per image in the training data. Multi-label learning is in many ways a more realistic task than single-label learning as real-world data often involves instances belonging to multiple categories simultaneously; however, most common computer vision datasets predominantly contain single labels due to the inherent complexity and cost of collecting multiple high quality annotations for each instance. We propose a novel approach called Vision-Language Pseudo-Labeling (VLPL), which uses a vision-language model to suggest strong positive and negative pseudo-labels, and outperforms the current SOTA methods by 5.5% on Pascal VOC, 18.4% on MS-COCO, 15.2% on NUS-WIDE, and 8.4% on CUB-Birds. Our code and data are available at https://github.com/mvrl/VLPL.

摘要

Geometry-Aware Video Quality Assessment for Dynamic Digital Human

paper_url: http://arxiv.org/abs/2310.15984
repo_url: None
paper_authors: Zicheng Zhang, Yingjie Zhou, Wei Sun, Xiongkuo Min, Guangtao Zhai
for: 本研究旨在提出一种基于DDH的无参考视频质量评估方法，以解决DDH在生成和传输过程中受到噪声和压缩扭曲的问题。
methods: 本方法使用DDH的统计参数来描述几何特征，从渲染视频中提取空间和时间特征，并将所有特征集成并回归到质量值上。
results: 实验结果表明，提出的方法在DDH-QA数据库上达到了状态对应的性能。

Abstract
Dynamic Digital Humans (DDHs) are 3D digital models that are animated using predefined motions and are inevitably bothered by noise/shift during the generation process and compression distortion during the transmission process, which needs to be perceptually evaluated. Usually, DDHs are displayed as 2D rendered animation videos and it is natural to adapt video quality assessment (VQA) methods to DDH quality assessment (DDH-QA) tasks. However, the VQA methods are highly dependent on viewpoints and less sensitive to geometry-based distortions. Therefore, in this paper, we propose a novel no-reference (NR) geometry-aware video quality assessment method for DDH-QA challenge. Geometry characteristics are described by the statistical parameters estimated from the DDHs' geometry attribute distributions. Spatial and temporal features are acquired from the rendered videos. Finally, all kinds of features are integrated and regressed into quality values. Experimental results show that the proposed method achieves state-of-the-art performance on the DDH-QA database.

摘要
“几何动态人体”（DDH）是一种三维数字模型，通过预先定义的动作来动态显示，但是在生成和传输过程中受到噪音/移动的干扰，需要进行感知评估。通常，DDH会被显示为2Drendered动画影片，因此可以将影片质量评估（VQA）方法应用到DDH质量评估（DDH-QA）任务中。但是，VQA方法对于视点有高度依赖，较少关注几何基于的扭曲。因此，在本文中，我们提出一种新的无参考（NR）几何意识的影片质量评估方法，用于DDH-QA挑战。几何特征被描述为DDH几何特征分布的统计参数。从rendered影片中获取的空间和时间特征。最后，所有类型的特征被统合并对质量值进行回推。实验结果显示，提议的方法在DDH-QA数据库中实现了国际级的表现。

Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection

paper_url: http://arxiv.org/abs/2310.15955
repo_url: None
paper_authors: Manyuan Zhang, Guanglu Song, Yu Liu, Hongsheng Li
for: 这个研究旨在提高DETR的物体检测性能。methods: 这篇研究使用了DETR的核心架构，并将分类和位置检测转化为不同的任务，以解决DETR的训练问题。results: 这篇研究在MSCOCO数据集上进行了广泛的实验，并证明了该方法可以提高DETR的表现，比如提高Conditional DETR的AP值4.5。

Abstract
The introduction of DETR represents a new paradigm for object detection. However, its decoder conducts classification and box localization using shared queries and cross-attention layers, leading to suboptimal results. We observe that different regions of interest in the visual feature map are suitable for performing query classification and box localization tasks, even for the same object. Salient regions provide vital information for classification, while the boundaries around them are more favorable for box regression. Unfortunately, such spatial misalignment between these two tasks greatly hinders DETR's training. Therefore, in this work, we focus on decoupling localization and classification tasks in DETR. To achieve this, we introduce a new design scheme called spatially decoupled DETR (SD-DETR), which includes a task-aware query generation module and a disentangled feature learning process. We elaborately design the task-aware query initialization process and divide the cross-attention block in the decoder to allow the task-aware queries to match different visual regions. Meanwhile, we also observe that the prediction misalignment problem for high classification confidence and precise localization exists, so we propose an alignment loss to further guide the spatially decoupled DETR training. Through extensive experiments, we demonstrate that our approach achieves a significant improvement in MSCOCO datasets compared to previous work. For instance, we improve the performance of Conditional DETR by 4.5 AP. By spatially disentangling the two tasks, our method overcomes the misalignment problem and greatly improves the performance of DETR for object detection.

摘要
DETR的引入标志着对象检测领域的新 paradigm。然而，DETR的解码器使用共享查询和交叉注意层进行类别和框定位任务，导致不佳的结果。我们发现不同的视觉特征图像区域适合进行查询类别和框定位任务，即使是同一个对象。焦点区域提供了关键的信息 для类别，而周围的边缘区域更适合框定位。然而，这种视觉空间不一致性使得DETR的训练受到很大的阻碍。因此，在这项工作中，我们关注DETR中的地方分解。为达到这一目标，我们提出了一种名为空间地解决DETR（SD-DETR）的新设计方案。我们仔细设计了任务意识查询初始化过程，并将权重分配块分解成多个视觉区域。同时，我们也发现了高分类信息和精准定位预测不一致的问题，因此我们提出了一种对齐损失来进一步引导空间地解DETR的训练。通过广泛的实验，我们证明了我们的方法在COCO dataset上比前一次的工作提高了4.5个AP。通过空间地解DETR，我们解决了视觉空间不一致性问题，并大幅提高了DETR对象检测的性能。

Improving Robustness and Reliability in Medical Image Classification with Latent-Guided Diffusion and Nested-Ensembles

paper_url: http://arxiv.org/abs/2310.15952
repo_url: https://github.com/xingbpshen/nested-diffusion
paper_authors: Xing Shen, Hengguan Huang, Brennan Nichyporuk, Tal Arbel
for: 提高深度学习模型在医疗图像分析任务中的Robustness，无需预先定义数据增强策略。
methods: 提出了一种基于转换器和condition diffusion模型的三stage方法，通过建立层次特征表示，采用反卷积过程和生成型预测方法，提高模型对医疗图像变化的Robustness。
results: 通过对医疗图像benchmark数据集进行广泛的实验，示出了方法在比较状态方法的Robustness和信任折衔calibration方面的改进，同时也提出了实例水平预测不确定性的评估策略。

Abstract
While deep learning models have achieved remarkable success across a range of medical image analysis tasks, deployment of these models in real clinical contexts requires that they be robust to variability in the acquired images. While many methods apply predefined transformations to augment the training data to enhance test-time robustness, these transformations may not ensure the model's robustness to the diverse variability seen in patient images. In this paper, we introduce a novel three-stage approach based on transformers coupled with conditional diffusion models, with the goal of improving model robustness to the kinds of imaging variability commonly encountered in practice without the need for pre-determined data augmentation strategies. To this end, multiple image encoders first learn hierarchical feature representations to build discriminative latent spaces. Next, a reverse diffusion process, guided by the latent code, acts on an informative prior and proposes prediction candidates in a generative manner. Finally, several prediction candidates are aggregated in a bi-level aggregation protocol to produce the final output. Through extensive experiments on medical imaging benchmark datasets, we show that our method improves upon state-of-the-art methods in terms of robustness and confidence calibration. Additionally, we introduce a strategy to quantify the prediction uncertainty at the instance level, increasing their trustworthiness to clinicians using them in clinical practice.

摘要
深度学习模型在医疗影像分析任务中已经取得了惊人的成功，但是在真实的临床上应用时，这些模型需要能够抗抗各种影像获取过程中的变化。许多方法通过预先定义的变换来增强训练数据，以提高测试时的Robustness，但这些变换可能并不能 garantía模型对实际医疗影像中的多样性的Robustness。在这篇论文中，我们提出了一种新的三个阶段方法，基于 transformers 和 condition diffusion 模型，以提高模型在实际医疗影像中的Robustness，无需预先定义数据增强策略。在这个方法中，多个图像Encoder 先学习层次特征表示，以建立特征空间的掌握。接下来，根据嵌入码，一种逆 diffusion 过程，带有有用的先验，提出了生成性的预测候选者。最后，多个预测候选者通过二级聚合协议来组合，生成最终输出。通过对医疗影像标准 benchmark 数据集进行广泛的实验，我们证明了我们的方法可以比对当前的方法提高Robustness和信任度Calibration。此外，我们还介绍了一种实例水平的预测uncertainty量化策略，使其在临床实践中增加了信任性。

Language-driven Scene Synthesis using Multi-conditional Diffusion Model

paper_url: http://arxiv.org/abs/2310.15948
repo_url: https://github.com/andvg3/LSDM
paper_authors: An Vuong, Minh Nhat Vu, Toan Tien Nguyen, Baoru Huang, Dzung Nguyen, Thieu Vo, Anh Nguyen
for: 本研究旨在提出一种语言驱动的场景生成任务，该任务将文本提示、人体动作和现有物品综合使用，以生成自然的场景。
methods: 我们提出了一种多 condional 干扰模型，该模型通过显式预测导向点来处理和编码多个条件，并且与其他干扰文献的隐式统一方法不同。
results: 我们的方法在实验中表现出色，超过了当前标准准则，并允许自然的场景编辑应用。

Abstract
Scene synthesis is a challenging problem with several industrial applications. Recently, substantial efforts have been directed to synthesize the scene using human motions, room layouts, or spatial graphs as the input. However, few studies have addressed this problem from multiple modalities, especially combining text prompts. In this paper, we propose a language-driven scene synthesis task, which is a new task that integrates text prompts, human motion, and existing objects for scene synthesis. Unlike other single-condition synthesis tasks, our problem involves multiple conditions and requires a strategy for processing and encoding them into a unified space. To address the challenge, we present a multi-conditional diffusion model, which differs from the implicit unification approach of other diffusion literature by explicitly predicting the guiding points for the original data distribution. We demonstrate that our approach is theoretically supportive. The intensive experiment results illustrate that our method outperforms state-of-the-art benchmarks and enables natural scene editing applications. The source code and dataset can be accessed at https://lang-scene-synth.github.io/.

摘要
Scene synthesis is a challenging problem with many industrial applications. Recently, a lot of effort has been put into synthesizing scenes using human motions, room layouts, or spatial graphs as input. However, few studies have addressed this problem from multiple modalities, especially combining text prompts. In this paper, we propose a new task called language-driven scene synthesis, which integrates text prompts, human motion, and existing objects for scene synthesis. Unlike other single-condition synthesis tasks, our problem involves multiple conditions and requires a strategy for processing and encoding them into a unified space. To address the challenge, we present a multi-conditional diffusion model, which differs from the implicit unification approach of other diffusion literature by explicitly predicting the guiding points for the original data distribution. We demonstrate that our approach is theoretically supportive. The intensive experiment results show that our method outperforms state-of-the-art benchmarks and enables natural scene editing applications. The source code and dataset can be accessed at https://lang-scene-synth.github.io/.Here's the translation in Traditional Chinese:Scene synthesis is a challenging problem with many industrial applications. Recently, a lot of effort has been put into synthesizing scenes using human motions, room layouts, or spatial graphs as input. However, few studies have addressed this problem from multiple modalities, especially combining text prompts. In this paper, we propose a new task called language-driven scene synthesis, which integrates text prompts, human motion, and existing objects for scene synthesis. Unlike other single-condition synthesis tasks, our problem involves multiple conditions and requires a strategy for processing and encoding them into a unified space. To address the challenge, we present a multi-conditional diffusion model, which differs from the implicit unification approach of other diffusion literature by explicitly predicting the guiding points for the original data distribution. We demonstrate that our approach is theoretically supportive. The intensive experiment results show that our method outperforms state-of-the-art benchmarks and enables natural scene editing applications. The source code and dataset can be accessed at https://lang-scene-synth.github.io/.

ShARc: Shape and Appearance Recognition for Person Identification In-the-wild

paper_url: http://arxiv.org/abs/2310.15946
repo_url: None
paper_authors: Haidong Zhu, Wanrong Zheng, Zhaoheng Zheng, Ram Nevatia
for: 这篇论文旨在提出一种多模态方法，用于在无控制环境中进行人识别，以提高人识别精度。
methods: 该方法使用了两个编码器：一个姿势和形态编码器（PSE）和一个综合应用编码器（AAE）。PSE编码人体形状通过二维素投影、骨骼运动和三维体形等方式，而AAE提供了两种时间domain的特征聚合方法：注意力基于的特征聚合和平均聚合。
results: comparative experiments on public datasets（CCVID、MEVID和BRIAR）表明，该方法可以在人识别任务中显著提高精度，比 EXISTS 的方法更高。

Abstract
Identifying individuals in unconstrained video settings is a valuable yet challenging task in biometric analysis due to variations in appearances, environments, degradations, and occlusions. In this paper, we present ShARc, a multimodal approach for video-based person identification in uncontrolled environments that emphasizes 3-D body shape, pose, and appearance. We introduce two encoders: a Pose and Shape Encoder (PSE) and an Aggregated Appearance Encoder (AAE). PSE encodes the body shape via binarized silhouettes, skeleton motions, and 3-D body shape, while AAE provides two levels of temporal appearance feature aggregation: attention-based feature aggregation and averaging aggregation. For attention-based feature aggregation, we employ spatial and temporal attention to focus on key areas for person distinction. For averaging aggregation, we introduce a novel flattening layer after averaging to extract more distinguishable information and reduce overfitting of attention. We utilize centroid feature averaging for gallery registration. We demonstrate significant improvements over existing state-of-the-art methods on public datasets, including CCVID, MEVID, and BRIAR.

摘要
在未控制的视频设置下，识别人员是生物ometric分析中的一项有价值 yet challenging task，因为人体外观、环境、损害和遮挡会导致人脸识别的困难。在这篇论文中，我们提出了 ShARc，一种多模态方法，用于在未控制环境下进行视频基于人脸识别。我们提出了两个编码器：一个 pose和形体编码器（PSE），以及一个综合应用 aparearance编码器（AAE）。PSE编码器通过矩阵化的轮廓、骨骼运动和三维体形来编码人体形状，而 AAE 提供了两种时间上的特征聚合方法：注意力基于的特征聚合和平均聚合。为了注意力基于的特征聚合，我们使用空间和时间的注意力来强调关键区域的人脸分辨率。为了平均聚合，我们引入了一种新的扁平层来抽取更多的分辨率信息，并避免注意力的过拟合。我们使用中心点特征平均进行 галерее注册。我们在公共数据集上达到了与现有状态的显著提升。

RePoseDM: Recurrent Pose Alignment and Gradient Guidance for Pose Guided Image Synthesis

paper_url: http://arxiv.org/abs/2310.16074
repo_url: None
paper_authors: Anant Khandelwal
For: pose-guided person image synthesis task* Methods: recurrent pose alignment, gradient guidance from pose interaction fields* Results: photorealistic appearance, flawless pose transfer, plausible pose transfer trajectories, efficient gradient guidanceHere’s the full text in Simplified Chinese:
for: 人像图像合成任务，需要重新渲染参考图像，以保持高真实度和精准的 pose 传输。由于人像图像具有高结构化特征，现有方法通常需要密集的连接来处理复杂的扭变和遮挡。但是， convolutional neural networks 生成的特征图不具有对称性，因此即使使用多级扭变和masking在latent space中，pose 对Alignment也不是完美的。我们引用了 diffusion 模型可以生成高真实度的图像，并提出了 recurrent pose alignment 来提供 pose-aligned texture 特征作为 conditional guidance。此外，我们还提出了基于 pose 互动场的梯度导航，可以输出target pose 的可能性场，帮助学习 plausible pose transfer 曲线，以保持图像的真实度和细节。
methods: recurrent pose alignment, gradient guidance from pose interaction fields
results: 高真实度的出现，精准的 pose 传输，可靠的 pose transfer 曲线，高效的梯度导航I hope this helps! Let me know if you have any further questions.

Abstract
Pose-guided person image synthesis task requires re-rendering a reference image, which should have a photorealistic appearance and flawless pose transfer. Since person images are highly structured, existing approaches require dense connections for complex deformations and occlusions because these are generally handled through multi-level warping and masking in latent space. But the feature maps generated by convolutional neural networks do not have equivariance, and hence even the multi-level warping does not have a perfect pose alignment. Inspired by the ability of the diffusion model to generate photorealistic images from the given conditional guidance, we propose recurrent pose alignment to provide pose-aligned texture features as conditional guidance. Moreover, we propose gradient guidance from pose interaction fields, which output the distance from the valid pose manifold given a target pose as input. This helps in learning plausible pose transfer trajectories that result in photorealism and undistorted texture details. Extensive results on two large-scale benchmarks and a user study demonstrate the ability of our proposed approach to generate photorealistic pose transfer under challenging scenarios. Additionally, we prove the efficiency of gradient guidance in pose-guided image generation on the HumanArt dataset with fine-tuned stable diffusion.

摘要
pose-guided人像图像合成任务需要重新渲染引用图像，该图像应该具有摄影真实的外观和完美的pose倾斜。由于人像图像具有高度结构，现有方法通常需要密集的连接来处理复杂的变形和遮挡。但是，由 convolutional neural networks 生成的特征图不具有对称性，因此，即使使用多级扭曲和masking在 latent space中也不会得到完美的pose对齐。为了解决这个问题，我们提出了 recurrent pose alignment，它可以提供pose-aligned的текстура特征作为条件引用。此外，我们还提出了来自pose交互场的梯度导航，它输出了目标pose manifold 中的距离。这有助于学习plausible的pose转移轨迹，以实现摄影真实和不受扭曲的текстура细节。我们在两个大规模的benchmark上和用户研究中展示了我们提出的方法能够在复杂的场景下生成摄影真实的pose转移。此外，我们还证明了梯度导航在pose-guided图像生成中的稳定性。

Mitigate Domain Shift by Primary-Auxiliary Objectives Association for Generalizing Person ReID

paper_url: http://arxiv.org/abs/2310.15913
repo_url: None
paper_authors: Qilei Li, Shaogang Gong
for: 提高人脸识别模型在不同频谱环境下的泛化能力
methods: 通过同时学习主要人脸识别任务和副任务（pedestrian saliency检测）来减少模型域特性，并通过PAOA机制来协调两个任务的损失函数
results: 实验表明提出的PAOA模型在不同频谱环境下表现出色，比较主流的单任务模型和多任务模型更高效。

Abstract
While deep learning has significantly improved ReID model accuracy under the independent and identical distribution (IID) assumption, it has also become clear that such models degrade notably when applied to an unseen novel domain due to unpredictable/unknown domain shift. Contemporary domain generalization (DG) ReID models struggle in learning domain-invariant representation solely through training on an instance classification objective. We consider that a deep learning model is heavily influenced and therefore biased towards domain-specific characteristics, e.g., background clutter, scale and viewpoint variations, limiting the generalizability of the learned model, and hypothesize that the pedestrians are domain invariant owning they share the same structural characteristics. To enable the ReID model to be less domain-specific from these pure pedestrians, we introduce a method that guides model learning of the primary ReID instance classification objective by a concurrent auxiliary learning objective on weakly labeled pedestrian saliency detection. To solve the problem of conflicting optimization criteria in the model parameter space between the two learning objectives, we introduce a Primary-Auxiliary Objectives Association (PAOA) mechanism to calibrate the loss gradients of the auxiliary task towards the primary learning task gradients. Benefiting from the harmonious multitask learning design, our model can be extended with the recent test-time diagram to form the PAOA+, which performs on-the-fly optimization against the auxiliary objective in order to maximize the model's generative capacity in the test target domain. Experiments demonstrate the superiority of the proposed PAOA model.

摘要
深度学习已经在独立和同样分布（IID）假设下明显提高了人识别模型的准确率，但同时也变得明显在未见的新领域中运行时失效，这是因为不可预测的领域转移。当前的领域总结（DG）人识别模型在学习领域不可变的表示方法时遇到了困难，我们认为深度学习模型受到领域特有的特征，如背景噪音、缩放和视角变化，这限制了学习的模型通用性，我们假设人识别对象在不同领域中具有相同的结构特征。为了让人识别模型免受这些纯净的人识别对象的影响，我们提出了一种方法，即通过同时学习主要人识别实例分类目标和 auxiliary 任务来导向模型学习。为了解决模型参数空间中主要和auxiliary任务之间的冲突问题，我们提出了主要-auxiliary任务关联（PAOA）机制，该机制可以在模型参数空间中均衡两个学习任务的损失导数。通过和谐多任务学习设计，我们的模型可以通过添加最近的测试时 diagram 来形成 PAOA+，该模型在测试目标领域中进行在线优化，以最大化模型的生成能力。实验结果表明我们的 PAOA 模型具有优势。

YOLO-Angio: An Algorithm for Coronary Anatomy Segmentation

paper_url: http://arxiv.org/abs/2310.15898
repo_url: None
paper_authors: Tom Liu, Hui Lin, Aggelos K. Katsaggelos, Adrienne Kline
for: 本研究旨在提供一种快速和准确的自动 coronary artery disease 诊断方法，用于替代人工评估。
methods: 本研究使用了一种三 stage 方法，包括预处理和特征选择，然后使用 YOLOv8 ensemble 模型生成可能的血管候选者，并最后使用一种逻辑基本方法重建 coronary tree。
results: 本研究在 MICCAI 2023 自动 coronary artery disease 诊断挑战中获得第三名，并在官方评估 metric 上得到了 F1 分数为 0.422 和 0.4289 的好成绩。

Abstract
Coronary angiography remains the gold standard for diagnosis of coronary artery disease, the most common cause of death worldwide. While this procedure is performed more than 2 million times annually, there remain few methods for fast and accurate automated measurement of disease and localization of coronary anatomy. Here, we present our solution to the Automatic Region-based Coronary Artery Disease diagnostics using X-ray angiography images (ARCADE) challenge held at MICCAI 2023. For the artery segmentation task, our three-stage approach combines preprocessing and feature selection by classical computer vision to enhance vessel contrast, followed by an ensemble model based on YOLOv8 to propose possible vessel candidates by generating a vessel map. A final segmentation is based on a logic-based approach to reconstruct the coronary tree in a graph-based sorting method. Our entry to the ARCADE challenge placed 3rd overall. Using the official metric for evaluation, we achieved an F1 score of 0.422 and 0.4289 on the validation and hold-out sets respectively.

摘要
coronary angiography 仍然是 coronary artery disease 诊断的标准方法，全球最常见的死亡原因之一。尽管这个过程每年超过 2 百万次执行，但是尚未有快速和准确的自动测量疾病和 coronary anatomy 的方法。在这里，我们介绍了我们在 MICCAI 2023 上参加的 Automatic Region-based Coronary Artery Disease diagnostics using X-ray angiography images (ARCADE) 挑战的解决方案。 для artery segmentation 任务，我们采用了三个阶段的方法，首先使用类传统计算机视觉技术进行预处理和特征选择，以增强血管对比度，然后使用 YOLOv8 ensemble model 提出可能的血管候选者，生成血管地图。最后，我们使用逻辑基于方法来重建 coronary tree，并使用图表分类法来排序。我们在 ARCADE 挑战中的参赛作品位列全国第三。使用官方的评价指标，我们在验证集和保留集上的 F1 分数分别为 0.422 和 0.4289。

Correlation Debiasing for Unbiased Scene Graph Generation in Videos

paper_url: http://arxiv.org/abs/2310.16073
repo_url: None
paper_authors: Anant Khandelwal
for: 生成视频中的动态Scene Graph（SGG）需要不仅具备对场景中 объек 的全面理解，还需要模型这些对象之间的时间变化和互动。
methods: 我们提出了Flow-aware temporal consistency和Correlation Debiasing with uncertainty attenuation（FloCoDe），它使用流动特征扭曲来检测帧中的时间一致性，并使用相关偏置来学习不偏的关系表示。
results: 我们的方法可以减少生成的Scene Graph中的长untailed分布问题，并提高生成的Scene Graph的性能，相比普通方法可以提高到4.1%。

Abstract
Dynamic scene graph generation (SGG) from videos requires not only comprehensive understanding of objects across the scenes that are prone to temporal fluctuations but also a model the temporal motions and interactions with different objects. Moreover, the long-tailed distribution of visual relationships is the crucial bottleneck of most dynamic SGG methods, since most of them focus on capturing spatio-temporal context using complex architectures, which leads to the generation of biased scene graphs. To address these challenges, we propose FloCoDe: Flow-aware temporal consistency and Correlation Debiasing with uncertainty attenuation for unbiased dynamic scene graphs. FloCoDe employs feature warping using flow to detect temporally consistent objects across the frames. In addition, it uses correlation debiasing to learn the unbiased relation representation for long-tailed classes. Moreover, to attenuate the predictive uncertainties, it uses a mixture of sigmoidal cross-entropy loss and contrastive loss to incorporate label correlations to identify the commonly co-occurring relations and help debias the long-tailed ones. Extensive experimental evaluation shows a performance gain as high as 4.1% showing the superiority of generating more unbiased scene graphs.

摘要
干净场景图生成（SGG）从视频中需要不仅具备场景中对象的全面理解，而且还需要模型这些对象的时间运动和不同对象之间的交互。此外，视觉关系的长尾分布是大多数动态SGG方法的关键瓶颈，因为大多数方法强调通过复杂的建筑来捕捉空间时间上下文，从而导致生成偏向的场景图。为解决这些挑战，我们提出了FloCoDe：流程承诺和对称偏移以降低不当的动态场景图生成。FloCoDe使用流程映射来检测帧中的时间一致对象。此外，它使用对称偏移来学习不偏向的关系表示，并使用权重混合的sigmoid混合 entropy损失和对比损失来吸收标签相关性，以帮助减少预测不确定性。广泛的实验评估表明，FloCoDe可以提高生成不偏向场景图的性能，最高提高4.1%。

On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms

paper_url: http://arxiv.org/abs/2310.15848
repo_url: None
paper_authors: Surbhi Mittal, Kartik Thakral, Richa Singh, Mayank Vatsa, Tamar Glaser, Cristian Canton Ferrer, Tal Hassner
for: 本研究旨在提出一个责任机器学习数据集框架，以评估数据集的可信worthiness。
methods: 本研究使用了责任机器学习数据集框架，考虑了公平、隐私和法规遵循性等方面，并提出了改进数据集文件的建议。
results: 经过100多个数据集的调查，发现现有数据集中有严重的公平、隐私和法规遵循性问题。

Abstract
Artificial Intelligence (AI) has made its way into various scientific fields, providing astonishing improvements over existing algorithms for a wide variety of tasks. In recent years, there have been severe concerns over the trustworthiness of AI technologies. The scientific community has focused on the development of trustworthy AI algorithms. However, machine and deep learning algorithms, popular in the AI community today, depend heavily on the data used during their development. These learning algorithms identify patterns in the data, learning the behavioral objective. Any flaws in the data have the potential to translate directly into algorithms. In this study, we discuss the importance of Responsible Machine Learning Datasets and propose a framework to evaluate the datasets through a responsible rubric. While existing work focuses on the post-hoc evaluation of algorithms for their trustworthiness, we provide a framework that considers the data component separately to understand its role in the algorithm. We discuss responsible datasets through the lens of fairness, privacy, and regulatory compliance and provide recommendations for constructing future datasets. After surveying over 100 datasets, we use 60 datasets for analysis and demonstrate that none of these datasets is immune to issues of fairness, privacy preservation, and regulatory compliance. We provide modifications to the ``datasheets for datasets" with important additions for improved dataset documentation. With governments around the world regularizing data protection laws, the method for the creation of datasets in the scientific community requires revision. We believe this study is timely and relevant in today's era of AI.

摘要
人工智能（AI）已经渗透到了不同的科学领域，提供了惊人的改进，对各种任务的表现有了很大的提高。然而，在最近几年，关于AI技术的可靠性产生了严重的担忧。科学社区在开发可靠AI算法方面做出了巨大的努力。然而，机器学习和深度学习算法，今天科学社区中流行的算法，具有很强的数据依赖性。这些学习算法通过数据中的模式识别，学习行为目标。数据中的任何漏洞都可能直接影响算法。在这种情况下，我们认为负责任机器学习数据集的重要性是不可或缺的。我们提出了一种评估数据集的负责任框架，通过评估数据集的可责任性来理解它在算法中的角色。我们通过公平、隐私保护和法规遵守来评估数据集的负责任性。我们对100余个数据集进行了调查，并选择60个数据集进行分析，发现这些数据集中有很多问题，包括公平、隐私保护和法规遵守。我们对数据集的“数据Sheet”进行了重要的修改，以便更好地记录数据集的信息。随着政府在全球范围内规范数据保护法律，科学社区在创建数据集的方法需要进行修订。我们认为这种研究在当今AI时代非常时间和有 relevance。

CPSeg: Finer-grained Image Semantic Segmentation via Chain-of-Thought Language Prompting

paper_url: http://arxiv.org/abs/2310.16069
repo_url: None
paper_authors: Lei Li
for: 该研究旨在提高图像分割性能，通过 integrate 文本信息和图像进行语言导向的数据利用。
methods: 该framework 使用了一种新的 “Chain-of-Thought” 过程，通过将多个句子中的文本信息编码，形成一个 coherent 的链接。
results: 我们的质量和量测试表明，CPSeg 能够有效地提高图像分割性能。

Abstract
Natural scene analysis and remote sensing imagery offer immense potential for advancements in large-scale language-guided context-aware data utilization. This potential is particularly significant for enhancing performance in downstream tasks such as object detection and segmentation with designed language prompting. In light of this, we introduce the CPSeg, Chain-of-Thought Language Prompting for Finer-grained Semantic Segmentation), an innovative framework designed to augment image segmentation performance by integrating a novel "Chain-of-Thought" process that harnesses textual information associated with images. This groundbreaking approach has been applied to a flood disaster scenario. CPSeg encodes prompt texts derived from various sentences to formulate a coherent chain-of-thought. We propose a new vision-language dataset, FloodPrompt, which includes images, semantic masks, and corresponding text information. This not only strengthens the semantic understanding of the scenario but also aids in the key task of semantic segmentation through an interplay of pixel and text matching maps. Our qualitative and quantitative analyses validate the effectiveness of CPSeg.

摘要
自然场景分析和远程感知影像具有巨大的潜力，可以提高大规模语言引导的上下文意识数据利用的性能。特别是在对象检测和分割任务中，采用设计语言提示可以提高性能。为此，我们介绍了CPSeg（语义粒度分割）框架，它通过 integrate 一种新的 "链条思维" 过程，将图像与文本信息相结合，以提高图像分割性能。我们在洪水灾害场景中应用了这种创新性的方法。CPSeg 将文本信息转化为谱文本，并将它们组织成一个 coherent 的链条思维。我们提出了一个新的视力语言数据集，FloodPrompt，该数据集包括图像、semantic 面积和相应的文本信息。这不仅强化了洪水场景的 semantic 理解，还帮助实现semantic segmentation 任务中的像素和文本匹配图。我们的质量和量统计分析证明了 CPSeg 的有效性。

Unpaired MRI Super Resolution with Self-Supervised Contrastive Learning

paper_url: http://arxiv.org/abs/2310.15767
repo_url: None
paper_authors: Hao Li, Quanwei Liu, Jianan Liu, Xiling Liu, Yanni Dong, Tao Huang, Zhihan Lv
for: 提高临床设备的诊断精度
methods: 使用自我指导的对比学习来提高SR性能，使用真实的高解度图像和生成的SR图像构建正例对
results: 使用限制性的训练数据可以获得显著提高的峰值信号噪声比和结构相似度指标

Abstract
High-resolution (HR) magnetic resonance imaging (MRI) is crucial for enhancing diagnostic accuracy in clinical settings. Nonetheless, the inherent limitation of MRI resolution restricts its widespread applicability. Deep learning-based image super-resolution (SR) methods exhibit promise in improving MRI resolution without additional cost. However, these methods frequently require a substantial number of HR MRI images for training, which can be challenging to acquire. In this paper, we propose an unpaired MRI SR approach that employs self-supervised contrastive learning to enhance SR performance with limited training data. Our approach leverages both authentic HR images and synthetically generated SR images to construct positive and negative sample pairs, thus facilitating the learning of discriminative features. Empirical results presented in this study underscore significant enhancements in the peak signal-to-noise ratio and structural similarity index, even when a paucity of HR images is available. These findings accentuate the potential of our approach in addressing the challenge of limited training data, thereby contributing to the advancement of high-resolution MRI in clinical applications.

摘要
高分辨率（HR）磁共振成像（MRI）是诊断精度的关键因素，但MRI的自然限制使其广泛应用受到限制。深度学习基于图像超分辨（SR）方法表现出提高MRI分辨率的承诺，但这些方法frequently需要大量的HR MRI图像进行训练，这可能困难以获得。本文提出了一种没有对应HR图像的MRI SR方法，使用自我超级vised学习来提高SR性能，并利用高分辨率图像和生成的SR图像来构建正例对和负例对，从而促进学习抽象特征。实验结果表明，即使只有少量HR图像可用，SR性能也能得到显著提高，包括峰值信号噪声比和结构相似度指数。这些结果强调了我们的方法在有限训练数据情况下的潜在价值，并为高分辨率MRI在临床应用中的进一步发展提供了贡献。

Deep Learning Models for Classification of COVID-19 Cases by Medical Images

paper_url: http://arxiv.org/abs/2310.16851
repo_url: https://github.com/aditya-saxena-7/Detection-of-COVID-19-from-Chest-X-Ray-images-using-CNNs
paper_authors: Amir Ali
for: 这个研究旨在提高新型冠状病毒诊断的准确性和速度，通过利用深度学习模型对患者的Computed Tomography（CT）图像进行精确分类。
methods: 我们的研究使用了深度转移学习模型，包括DenseNet201、GoogleNet和AlexNet，进行 Covid-19 分类。我们还对这些模型进行了精心的超参数调整，以提高其性能。
results: 我们的研究结果表明，使用深度学习模型可以提高 Covid-19 诊断的准确性和速度。我们的模型可以处理多种医疗图像类型，并能够准确地识别特征性的 Covid-19 特征。这些成果表明了这些模型在全球 Covid-19 斗争中的潜在作用。

Abstract
In recent times, the use of chest Computed Tomography (CT) images for detecting coronavirus infections has gained significant attention, owing to their ability to reveal bilateral changes in affected individuals. However, classifying patients from medical images presents a formidable challenge, particularly in identifying such bilateral changes. To tackle this challenge, our study harnesses the power of deep learning models for the precise classification of infected patients. Our research involves a comparative analysis of deep transfer learning-based classification models, including DenseNet201, GoogleNet, and AlexNet, against carefully chosen supervised learning models. Additionally, our work encompasses Covid-19 classification, which involves the identification and differentiation of medical images, such as X-rays and electrocardiograms, that exhibit telltale signs of Covid-19 infection. This comprehensive approach ensures that our models can handle a wide range of medical image types and effectively identify characteristic patterns indicative of Covid-19. By conducting meticulous research and employing advanced deep learning techniques, we have made significant strides in enhancing the accuracy and speed of Covid-19 diagnosis. Our results demonstrate the effectiveness of these models and their potential to make substantial contributions to the global effort to combat COVID-19.

摘要
Recently, the use of chest Computed Tomography (CT) images for detecting coronavirus infections has gained significant attention due to their ability to reveal bilateral changes in affected individuals. However, classifying patients from medical images presents a formidable challenge, particularly in identifying such bilateral changes. To tackle this challenge, our study employs deep learning models for precise classification of infected patients. Our research involves a comparative analysis of deep transfer learning-based classification models, including DenseNet201, GoogleNet, and AlexNet, against carefully chosen supervised learning models. Additionally, our work encompasses Covid-19 classification, which involves the identification and differentiation of medical images, such as X-rays and electrocardiograms, that exhibit telltale signs of Covid-19 infection. This comprehensive approach ensures that our models can handle a wide range of medical image types and effectively identify characteristic patterns indicative of Covid-19. By conducting meticulous research and employing advanced deep learning techniques, we have made significant strides in enhancing the accuracy and speed of Covid-19 diagnosis. Our results demonstrate the effectiveness of these models and their potential to make substantial contributions to the global effort to combat COVID-19.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

Large Language Models are Temporal and Causal Reasoners for Video Question Answering

paper_url: http://arxiv.org/abs/2310.15747
repo_url: https://github.com/mlvlab/Flipped-VQA
paper_authors: Dohwan Ko, Ji Soo Lee, Wooyoung Kang, Byungseok Roh, Hyunwoo J. Kim
for: 这paper是为了解决视频问答任务中的语言偏见问题，使用大语言模型(LLMs)提供了有效的假设，但是这些假设常导致模型过于依赖问题，而忽略视频内容。methods: 该paper提出了一种新的框架 called Flipped-VQA，它鼓励模型预测所有的 $\langle$V, Q, A$\rangle$ triplet，并通过翻转源对和目标标签来理解它们之间的复杂关系。results: 在五个挑战性的视频问答 benchmark 上，LLaMA-VQA模型，基于 Flipped-VQA 框架，超过了基于 LLMs 和非 LLMS 的模型的性能。此外，该框架可以应用于不同的 LLMs (OPT 和 GPT-J)，并在这些模型中提高了表现。empirical 表明，Flipped-VQA 不仅增强了语言偏见的利用，还减轻了语言偏见，导致错误答案依赖于问题。

Abstract
Large Language Models (LLMs) have shown remarkable performances on a wide range of natural language understanding and generation tasks. We observe that the LLMs provide effective priors in exploiting $\textit{linguistic shortcuts}$ for temporal and causal reasoning in Video Question Answering (VideoQA). However, such priors often cause suboptimal results on VideoQA by leading the model to over-rely on questions, $\textit{i.e.}$, $\textit{linguistic bias}$, while ignoring visual content. This is also known as `ungrounded guesses' or `hallucinations'. To address this problem while leveraging LLMs' prior on VideoQA, we propose a novel framework, Flipped-VQA, encouraging the model to predict all the combinations of $\langle$V, Q, A$\rangle$ triplet by flipping the source pair and the target label to understand their complex relationships, $\textit{i.e.}$, predict A, Q, and V given a VQ, VA, and QA pairs, respectively. In this paper, we develop LLaMA-VQA by applying Flipped-VQA to LLaMA, and it outperforms both LLMs-based and non-LLMs-based models on five challenging VideoQA benchmarks. Furthermore, our Flipped-VQA is a general framework that is applicable to various LLMs (OPT and GPT-J) and consistently improves their performances. We empirically demonstrate that Flipped-VQA not only enhances the exploitation of linguistic shortcuts but also mitigates the linguistic bias, which causes incorrect answers over-relying on the question. Code is available at https://github.com/mlvlab/Flipped-VQA.

摘要
大型语言模型（LLMs）在各种自然语言理解和生成任务上表现出色。我们发现LLMs在视频问答（VideoQA）任务上提供了有效的先验知识，但这些先验知识经常导致视频内容被忽略，从而导致 incorrect answers。这也被称为“不扎实的猜测”或“幻觉”。为解决这个问题并利用LLMs的先验知识，我们提出了一种新的框架——翻转VQA（Flipped-VQA），强制模型预测所有的 $\langle$V, Q, A$\rangle$ 三元组。在这篇论文中，我们将应用Flipped-VQA于 LLMA 模型，并在五个复杂的 VideoQA benchmark 上取得了比 LLMS 和非 LLMS 模型更高的表现。此外，我们的 Flipped-VQA 是一个通用的框架，可以应用于多种 LLMS（OPT 和 GPT-J），并在不同的 LLMS 上提高其表现。我们经验表明，Flipped-VQA 不仅增强了语言短cut的利用，还减轻语言偏见，这种偏见导致 incorrect answers 依赖于问题。代码可以在 https://github.com/mlvlab/Flipped-VQA 上获取。

Interpretable Medical Image Classification using Prototype Learning and Privileged Information

paper_url: http://arxiv.org/abs/2310.15741
repo_url: https://github.com/xrad-ulm/proto-caps
paper_authors: Luisa Gallee, Meinrad Beer, Michael Goetz
for: 这个论文主要是为了提高医疗图像分类的可解释性和性能。methods: 这个论文使用了 capsule networks、prototype learning 和特权信息来提高模型的可解释性和性能。results: 对于 LIDC-IDRI 数据集，该方法可以同时提高预测精度和可解释性，比基eline模型提高了6.3%的精度。同时，模型可以提供 случа例理解和原型表示，让 radiologist 定义的特征得到视觉验证。

Abstract
Interpretability is often an essential requirement in medical imaging. Advanced deep learning methods are required to address this need for explainability and high performance. In this work, we investigate whether additional information available during the training process can be used to create an understandable and powerful model. We propose an innovative solution called Proto-Caps that leverages the benefits of capsule networks, prototype learning and the use of privileged information. Evaluating the proposed solution on the LIDC-IDRI dataset shows that it combines increased interpretability with above state-of-the-art prediction performance. Compared to the explainable baseline model, our method achieves more than 6 % higher accuracy in predicting both malignancy (93.0 %) and mean characteristic features of lung nodules. Simultaneously, the model provides case-based reasoning with prototype representations that allow visual validation of radiologist-defined attributes.

摘要
<>TRANSLATE_TEXT interpretability 是医学影像领域中的一项重要需求。高级深度学习方法可以满足这一需求，但同时也需要解释性和高性能的模型。在这项工作中，我们investigate了是否可以在训练过程中使用额外信息来创建一个可理解的和强大的模型。我们提出了一种创新解决方案，即Proto-Caps，该解决方案利用了圆拱网络、原型学习和特权信息的使用。对于LIDC-IDRI数据集进行评估，我们发现该方法可以同时实现高度解释性和上状态艺术性的预测性能。相比于解释性基准模型，我们的方法在预测肿瘤性（93.0%）和肿瘤特征特征的平均值方面达到了6.0%以上的提升。此外，模型还提供了基于案例的理解，使得可以通过板准表示来视觉验证 ради医定义的特征。Note: The text is translated using the Google Translate API, which may not be perfect and may not capture all the nuances of the original text.

Query-adaptive DETR for Crowded Pedestrian Detection

paper_url: http://arxiv.org/abs/2310.15725
repo_url: None
paper_authors: Feng Gao, Jiaxu Leng, Ji Gan, Xinbo Gao
for: 本研究旨在提高DETR和其变体在拥挤人群检测中的性能，特别是在不同程度的拥挤场景下自动调整DETRs的查询数量。
methods: 本文分析了两种当前的查询生成方法，并提出了四个指导原则 для设计适应查询生成方法。然后，我们提出了rank-based adaptive query generation（RAQG）方法，包括设计一个预测rank的rank prediction head，以及基于预测rank的adaptive选择方法来生成查询。此外，我们还提出了Soft Gradient L1 Loss来更好地训练rank prediction head。
results: 我们的方法可以具体地应用于任何DETRs，并在Crowdhuman dataset和Citypersons dataset上实现了竞争性的 Result。尤其是在Crowdhuman dataset上，我们的方法可以 дости得状态的最佳39.4% MR。

Abstract
DEtection TRansformer (DETR) and its variants (DETRs) have been successfully applied to crowded pedestrian detection, which achieved promising performance. However, we find that, in different degrees of crowded scenes, the number of DETRs' queries must be adjusted manually, otherwise, the performance would degrade to varying degrees. In this paper, we first analyze the two current query generation methods and summarize four guidelines for designing the adaptive query generation method. Then, we propose Rank-based Adaptive Query Generation (RAQG) to alleviate the problem. Specifically, we design a rank prediction head that can predict the rank of the lowest confidence positive training sample produced by the encoder. Based on the predicted rank, we design an adaptive selection method that can adaptively select coarse detection results produced by the encoder to generate queries. Moreover, to train the rank prediction head better, we propose Soft Gradient L1 Loss. The gradient of Soft Gradient L1 Loss is continuous, which can describe the relationship between the loss value and the updated value of model parameters granularly. Our method is simple and effective, which can be plugged into any DETRs to make it query-adaptive in theory. The experimental results on Crowdhuman dataset and Citypersons dataset show that our method can adaptively generate queries for DETRs and achieve competitive results. Especially, our method achieves state-of-the-art 39.4% MR on Crowdhuman dataset.

摘要
<>转化文本到简化中文。<>DEtection TRansformer（DETR）和其变体（DETRs）已经成功应用于人群检测，实现了出色的表现。然而，我们发现在不同的人群场景中，DETRs的查询数量需要手动调整，否则表现会随之下降。在这篇论文中，我们首先分析了两种当前的查询生成方法，并总结出四个指导方针 для设计适应查询生成方法。然后，我们提议 Rank-based Adaptive Query Generation（RAQG）来解决这个问题。具体来说，我们设计了一个排名预测头，可以预测Encoder输出最低信心正例的排名。根据预测的排名，我们设计了一种适应选择方法，可以适应地选择Encoder输出的粗略检测结果来生成查询。此外，为了训练排名预测头更好，我们提议Soft Gradient L1 Loss。Soft Gradient L1 Loss的梯度是连续的，可以描述模型参数更新值与梯度值之间的关系细致。我们的方法简单而有效，可以将其插入到任何DETRs中，让它成为可query-adaptive的。实验结果表明，我们的方法可以适应地生成查询 для DETRs，并实现了竞争性的结果。特别是，我们的方法在Crowdhuman dataset上达到了39.4%的MR。

GNeSF: Generalizable Neural Semantic Fields

paper_url: http://arxiv.org/abs/2310.15712
repo_url: None
paper_authors: Hanlin Chen, Chen Li, Mengqi Guo, Zhiwen Yan, Gim Hee Lee
For: 本研究提出了一种基于神经 implicit representation的3D场景分割方法，以便只需在2D层次上进行训练，并且可以在推理时通过多视图图像特征和 semantic map 来适应不同场景。* Methods: 我们的方法首先从多视图图像特征和 semantic map 中提取特征，然后使用一种新的软投票机制来融合不同视图的2D semantic信息。此外，我们还编码了视图差信息，以便在推理时根据视图的距离对3D点的投票分数进行调整。最后，我们还设计了一个可见模块来检测和过滤出 occluded 视图中的不良信息。* Results: 我们的方法可以在不同场景下进行3D semantic segmentation，并且可以与场景特定的方法匹配性。此外，我们的方法还可以在只有2D semantic annotation 的情况下进行 Synthetic semantic map 的生成和3D semantic segmentation。实验结果表明，我们的方法可以与场景特定的方法匹配性，并且可以在只有2D annotation 的情况下进行3D semantic segmentation。更重要的是，我们的方法可以在有限的2D annotation 的情况下，超越现有的强supervision-based方法。我们的源代码可以在 GitHub 上找到：https://github.com/HLinChen/GNeSF。

Abstract
3D scene segmentation based on neural implicit representation has emerged recently with the advantage of training only on 2D supervision. However, existing approaches still requires expensive per-scene optimization that prohibits generalization to novel scenes during inference. To circumvent this problem, we introduce a generalizable 3D segmentation framework based on implicit representation. Specifically, our framework takes in multi-view image features and semantic maps as the inputs instead of only spatial information to avoid overfitting to scene-specific geometric and semantic information. We propose a novel soft voting mechanism to aggregate the 2D semantic information from different views for each 3D point. In addition to the image features, view difference information is also encoded in our framework to predict the voting scores. Intuitively, this allows the semantic information from nearby views to contribute more compared to distant ones. Furthermore, a visibility module is also designed to detect and filter out detrimental information from occluded views. Due to the generalizability of our proposed method, we can synthesize semantic maps or conduct 3D semantic segmentation for novel scenes with solely 2D semantic supervision. Experimental results show that our approach achieves comparable performance with scene-specific approaches. More importantly, our approach can even outperform existing strong supervision-based approaches with only 2D annotations. Our source code is available at: https://github.com/HLinChen/GNeSF.

摘要
三维场景分割基于神经隐式表示最近几年发展起来，具有训练只需2D监督的优势。然而，现有方法仍然需要贵重的每个场景优化，这限制了扩展到新场景的推理。为了解决这个问题，我们介绍了一个普适的三维分割框架基于隐式表示。Specifically，我们的框架接受多视图图像特征和 semantic maps作为输入，而不是只是空间信息，以避免过拟合特定场景的 Géometric 和 Semantic 信息。我们提出了一种新的软投票机制，用于对每个3D点的2D semantic信息进行集成。此外，我们还编码了视图差信息，以预测投票分数。直观地说，这使得邻近视图的semantic信息能够更大地贡献。此外，我们还设计了一个隐藏信息检测和过滤模块，以避免由 occluded 视图引入的损害信息。由于我们的提出的方法具有普适性，我们可以在具有 solely 2D semantic supervision 的情况下生成 semantic maps or 进行3D semantic segmentation for novel scenes。实验结果表明，我们的方法可以与场景特定方法匹配性能，而且甚至可以超越基于强监督的现有方法。我们的源代码可以在 GitHub 上找到：https://github.com/HLinChen/GNeSF。

Physics-Informed with Power-Enhanced Residual Network for Interpolation and Inverse Problems

paper_url: http://arxiv.org/abs/2310.15690
repo_url: https://github.com/cmmai/resnet_for_pinn
paper_authors: Amir Noorizadegan, D. L. Young, Y. C. Hon, C. S. Chen
for: 提高神经网络的 interpolating 能力，包括平滑和非平滑函数的 interpolating。
methods: 提出了一种新的神经网络结构 Power-Enhancing residual network，通过添加力量项来提高网络的表达力。研究考虑了网络深度、宽度以及优化方法，并证明了该architecture的适应性和性能优势。
results: 研究表明，提出的Power-Enhancing residual network具有非常高的准确率，特别是对非平滑函数的 interpolating。实际应用中也证明了该网络的superiority，包括准确率、速度和效率。此外，研究还探讨了更深的网络的影响。最后，该architecture被应用于解决反射 Burgers 方程的问题，并达到了优秀的性能。

Abstract
This paper introduces a novel neural network structure called the Power-Enhancing residual network, designed to improve interpolation capabilities for both smooth and non-smooth functions in 2D and 3D settings. By adding power terms to residual elements, the architecture boosts the network's expressive power. The study explores network depth, width, and optimization methods, showing the architecture's adaptability and performance advantages. Consistently, the results emphasize the exceptional accuracy of the proposed Power-Enhancing residual network, particularly for non-smooth functions. Real-world examples also confirm its superiority over plain neural network in terms of accuracy, convergence, and efficiency. The study also looks at the impact of deeper network. Moreover, the proposed architecture is also applied to solving the inverse Burgers' equation, demonstrating superior performance. In conclusion, the Power-Enhancing residual network offers a versatile solution that significantly enhances neural network capabilities. The codes implemented are available at: \url{https://github.com/CMMAi/ResNet_for_PINN}.

摘要
这篇论文介绍了一种新的神经网络结构，称为能量增强径 residual network，用于提高 interpolation 能力，包括平滑和非平滑函数的情况。通过添加能量项到径元中，这种建筑增强了网络的表达能力。研究探讨了网络的深度、宽度和优化方法，并证明了该建筑的适应性和性能优势。结果表明提案的 Power-Enhancing residual network 具有 Exceptional accuracy，特别是对非平滑函数。实际例子也证明了它在精度、速度和稳定性方面的优越性。此外，该建筑还应用于解决 inverse Burgers' equation，并达到了优秀的性能。 conclude，Power-Enhancing residual network 提供了一种多样化的解决方案，可以显著提高神经网络的能力。实现的代码可以在：\url{https://github.com/CMMAi/ResNet_for_PINN} 中找到。

Nighttime Thermal Infrared Image Colorization with Feedback-based Object Appearance Learning

paper_url: http://arxiv.org/abs/2310.15688
repo_url: https://github.com/fuyaluo/foalgan
paper_authors: Fu-Ya Luo, Shu-Lin Liu, Yi-Jun Cao, Kai-Fu Yang, Chang-Yong Xie, Yong Liu, Yong-Jie Li
for: 提高夜间热动像（TIR）图像的可读性和可用性，以便在夜间场景中进行视觉识别和任务执行。
methods: 提出了一种基于生成对抗网络的feedback型物体外观学习（FoalGAN）方法，通过增加 occlusion-aware mixup 模块和对应的外观一致损失来提高小对象翻译性能。
results: 经验表明，提出的 FoalGAN 方法不仅可以提高小对象的外观学习，还能够在nighttime street scene中提高 semantic preservation和edge consistency。

Abstract
Stable imaging in adverse environments (e.g., total darkness) makes thermal infrared (TIR) cameras a prevalent option for night scene perception. However, the low contrast and lack of chromaticity of TIR images are detrimental to human interpretation and subsequent deployment of RGB-based vision algorithms. Therefore, it makes sense to colorize the nighttime TIR images by translating them into the corresponding daytime color images (NTIR2DC). Despite the impressive progress made in the NTIR2DC task, how to improve the translation performance of small object classes is under-explored. To address this problem, we propose a generative adversarial network incorporating feedback-based object appearance learning (FoalGAN). Specifically, an occlusion-aware mixup module and corresponding appearance consistency loss are proposed to reduce the context dependence of object translation. As a representative example of small objects in nighttime street scenes, we illustrate how to enhance the realism of traffic light by designing a traffic light appearance loss. To further improve the appearance learning of small objects, we devise a dual feedback learning strategy to selectively adjust the learning frequency of different samples. In addition, we provide pixel-level annotation for a subset of the Brno dataset, which can facilitate the research of NTIR image understanding under multiple weather conditions. Extensive experiments illustrate that the proposed FoalGAN is not only effective for appearance learning of small objects, but also outperforms other image translation methods in terms of semantic preservation and edge consistency for the NTIR2DC task.

摘要
这个文章讨论了在不良环境（例如总黑暗）下稳定图像摄取的问题。因为热色干扰（TIR）相机在夜间场景认知中是一种普遍的选择，但是TIR图像的低 контра斯特和无彩色性使得人类阅读和后续的RGB基于视觉算法的应用受到阻碍。因此，将夜间TIR图像转换为相应的日间彩色图像（NTIR2DC）是一个有必要的步骤。尽管在NTIR2DC任务上已经做出了卓越的进步，但是如何提高小物类的译像性仍然是尚未探讨的问题。为了解决这个问题，我们提出了一个基于对应学习的生成 adversarial network（FoalGAN）。specifically，我们提出了一个遮瑕节module和相应的出现整合损失，以减少物类译像的上下文依赖。作为夜间街道场景中小物件的示例，我们详细说明了如何增强交通信号灯的现实感。此外，我们还提出了一个双重反馈学习策略，以选择性地调整不同的样本学习频率。此外，我们还提供了Brno dataset中一 subset的像素级注释，以便帮助夜间多种天气下NTIR图像理解的研究。实验结果显示，我们提出的FoalGAN不仅有效地进行小物类的出现学习，而且也在NTIR2DC任务中比其他图像转换方法具有更高的semantic preservation和edge consistency。

paper_url: http://arxiv.org/abs/2310.15670
repo_url: https://github.com/opendrivelab/birds-eye-view-perception
paper_authors: Linyan Huang, Zhiqi Li, Chonghao Sima, Wenhai Wang, Jingdong Wang, Yu Qiao, Hongyang Li
for: 提高camera-only 3D对象检测器（学生）的准确率，通过从LiDAR-或多Modal-基于的对手（师）获得知识传递。
methods: 使用Uni-modal Distillation，即主要使用摄像头特征，并采用师模型与学生模型的同构结构，以减少特征差异。同时，引入了一种基于运动误差的细化 trajectory-based distillation模块，以进一步改进学生模型的性能。
results: 在nuscenes上，VCD-A实现了63.1% NDS的新州OF-the-art成绩，超过了其他多Modal或LiDAR-based模型。

Abstract
Current research is primarily dedicated to advancing the accuracy of camera-only 3D object detectors (apprentice) through the knowledge transferred from LiDAR- or multi-modal-based counterparts (expert). However, the presence of the domain gap between LiDAR and camera features, coupled with the inherent incompatibility in temporal fusion, significantly hinders the effectiveness of distillation-based enhancements for apprentices. Motivated by the success of uni-modal distillation, an apprentice-friendly expert model would predominantly rely on camera features, while still achieving comparable performance to multi-modal models. To this end, we introduce VCD, a framework to improve the camera-only apprentice model, including an apprentice-friendly multi-modal expert and temporal-fusion-friendly distillation supervision. The multi-modal expert VCD-E adopts an identical structure as that of the camera-only apprentice in order to alleviate the feature disparity, and leverages LiDAR input as a depth prior to reconstruct the 3D scene, achieving the performance on par with other heterogeneous multi-modal experts. Additionally, a fine-grained trajectory-based distillation module is introduced with the purpose of individually rectifying the motion misalignment for each object in the scene. With those improvements, our camera-only apprentice VCD-A sets new state-of-the-art on nuScenes with a score of 63.1% NDS.

摘要
当前研究主要目标是提高Camera-only 3D对象检测器（学生）的准确率，通过从LiDAR-或多模态基础的对手（专家）传递知识。然而，域之间差距和时间融合不兼容性使得采用热静融合方法的改进效果受到限制。为了解决这个问题，我们提出了VCD框架，包括学生友好的多模态专家和时间融合友好的热静融合监督。多模态专家VCD-E采用与Camera-only学生相同的结构，以减轻特征差异，并利用LiDAR输入为深度估计来重建3D场景，实现与其他多模态专家相同的性能。此外，我们还引入了细腻的轨迹基于热静融合模块，以减轻每个对象在场景中的运动误差。通过这些改进，我们的Camera-only学生VCD-A在nuScenes上设置新的状态标准，分数为63.1% NDS。

Region-controlled Style Transfer

paper_url: http://arxiv.org/abs/2310.15658
repo_url: https://github.com/kakinglow/Selective-Style-Transfer
paper_authors: Junjie Kang, Jinsong Wu, Shiqi Jiang
for: 提高图像风格传递的控制精度，使得图像风格更加自然和有趣。
methods: 提出了一种基于损失函数的启用方法，可以在不同区域中控制风格强度，并且引入了一种新的特征融合方法，可以将内容特征变换到风格特征的空间中，保持 semantic关系。
results: 经过广泛的实验 validate 了我们提出的方法的有效性。

Abstract
Image style transfer is a challenging task in computational vision. Existing algorithms transfer the color and texture of style images by controlling the neural network's feature layers. However, they fail to control the strength of textures in different regions of the content image. To address this issue, we propose a training method that uses a loss function to constrain the style intensity in different regions. This method guides the transfer strength of style features in different regions based on the gradient relationship between style and content images. Additionally, we introduce a novel feature fusion method that linearly transforms content features to resemble style features while preserving their semantic relationships. Extensive experiments have demonstrated the effectiveness of our proposed approach.

摘要
Computational vision 中的图像风格传递是一项复杂的任务。现有的算法可以通过控制神经网络的特征层来传递风格图像的颜色和文字感，但它们无法控制不同区域的文字强度。为解决这个问题，我们提出了一种使用损失函数来约束不同区域的风格强度的训练方法。这种方法基于风格和内容图像的梯度关系来导引风格特征的传递强度在不同区域。此外，我们还介绍了一种新的特征融合方法，该方法将内容特征线性变换为风格特征，以保持它们的 semantic 关系。我们进行了广泛的实验，并证明了我们的提出的方法的有效性。

Breaking of brightness consistency in optical flow with a lightweight CNN network

paper_url: http://arxiv.org/abs/2310.15655
repo_url: https://github.com/linyicheng1/LET-NET
paper_authors: Yicheng Lin, Shuo Wang, Yunlong Jiang, Bin Han
for: 提高高Dynamic Range（HDR）环境中光流算法的性能，适用于各种计算机视觉任务。
methods: 使用轻量级网络提取强协VARIABILITY的抽象特征和强角度的方向特征，并修改传统的光流方法的亮度一致性要求。
results: 提高了93%的翻译错误，并在公共HDR数据集上验证了方法的可靠性和稳定性。

Abstract
Sparse optical flow is widely used in various computer vision tasks, however assuming brightness consistency limits its performance in High Dynamic Range (HDR) environments. In this work, a lightweight network is used to extract illumination robust convolutional features and corners with strong invariance. Modifying the typical brightness consistency of the optical flow method to the convolutional feature consistency yields the light-robust hybrid optical flow method. The proposed network runs at 190 FPS on a commercial CPU because it uses only four convolutional layers to extract feature maps and score maps simultaneously. Since the shallow network is difficult to train directly, a deep network is designed to compute the reliability map that helps it. An end-to-end unsupervised training mode is used for both networks. To validate the proposed method, we compare corner repeatability and matching performance with origin optical flow under dynamic illumination. In addition, a more accurate visual inertial system is constructed by replacing the optical flow method in VINS-Mono. In a public HDR dataset, it reduces translation errors by 93\%. The code is publicly available at https://github.com/linyicheng1/LET-NET.

摘要
广泛应用在计算机视觉任务中的稀肥光流尚未在高动态范围（HDR）环境中表现出色，因为它假设照明一致性。在这种工作中，我们使用轻量级网络提取抗照明强健的 convolutional 特征和角度特征。对于传统的光流方法的照明一致性来改变光流方法，得到了光敏感混合光流方法。提案的网络在商业CPU上运行于190帧/秒，因为它只使用了四个 convolutional 层来提取特征图和分数图同时。由于这个浅网络困难直接训练，我们设计了深度网络来计算可靠度地图。我们使用无监督的整体训练模式来训练两个网络。为验证我们的方法，我们比较了角 repeatability 和匹配性与原始光流方法在动态照明下。此外，我们将光流方法在 VINS-Mono 中换为我们的方法，从而构建了更加准确的视ер普系统。在一个公共 HDR 数据集上，它降低了翻译错误率93%。代码可以在上获取。

Mean Teacher DETR with Masked Feature Alignment: A Robust Domain Adaptive Detection Transformer Framework

paper_url: http://arxiv.org/abs/2310.15646
repo_url: None
paper_authors: Weixi Weng, Chun Yuan
for:这 paper 的目的是提出一种基于 Mean Teacher 和 DETR 的 two-stage 无监督领域适应对象检测方法 (MTM)，以解决现有方法存在的性能波动和训练停滞问题。methods:这 paper 使用了两种 masked feature alignment 方法，包括 masked domain query-based feature alignment (MDQFA) 和 masked token-wise feature alignment (MTWFA)，以减轻领域偏移问题，并提高模型的目标性能。results: experiments 表明，MTM 可以在三个复杂的情况下提供更高的性能，而且理论分析表明，MTM 可以更好地减轻领域偏移问题。

Abstract
Unsupervised domain adaptation object detection(UDAOD) research on Detection Transformer(DETR) mainly focuses on feature alignment and existing methods can be divided into two kinds, each of which has its unresolved issues. One-stage feature alignment methods can easily lead to performance fluctuation and training stagnation. Two-stage feature alignment method based on mean teacher comprises a pretraining stage followed by a self-training stage, each facing problems in obtaining reliable pretrained model and achieving consistent performance gains. Methods mentioned above have not yet explore how to utilize the third related domain such as target-like domain to assist adaptation. To address these issues, we propose a two-stage framework named MTM, i.e. Mean Teacher-DETR with Masked Feature Alignment. In the pretraining stage, we utilize labeled target-like images produced by image style transfer to avoid performance fluctuation. In the self-training stage, we leverage unlabeled target images by pseudo labels based on mean teacher and propose a module called Object Queries Knowledge Transfer(OQKT) to ensure consistent performance gains of the student model. Most importantly, we propose masked feature alignment methods including Masked Domain Query-based Feature Alignment(MDQFA) and Masked Token-wise Feature Alignment(MTWFA) to alleviate domain shift in a more robust way, which not only prevent training stagnation and lead to a robust pretrained model in the pretraining stage, but also enhance the model's target performance in the self-training stage. Experiments on three challenging scenarios and a theoretical analysis verify the effectiveness of MTM.

摘要
<>本文主要研究无监督领域适应对检测变换器（DETR）的问题，具体来说是在特征对齐方面。现有的方法可以分为两类，各自带有不解决的问题。一类是一stage特征对齐方法，容易导致性能波动和训练停滞。另一类是基于mean teacher的两stage特征对齐方法，但每个阶段都面临着获得可靠预训练模型和实现一致性提升的问题。以上方法尚未考虑利用第三个相关领域，如目标类似领域，来支持适应。为了解决这些问题，我们提出了一个名为MTM（Mean Teacher-DETR with Masked Feature Alignment）的两stage框架。在预训练阶段，我们利用了标注目标类似图像生成的image style transfer来避免性能波动。在自我训练阶段，我们利用了无标签目标图像和pseudo标签基于mean teacher，并提出了一个名为Object Queries Knowledge Transfer（OQKT）的模块，以确保学生模型的一致性提升。最重要的是，我们提出了一些masked feature alignment方法，包括Masked Domain Query-based Feature Alignment（MDQFA）和Masked Token-wise Feature Alignment（MTWFA），以减轻领域偏移，不仅在预训练阶段避免训练停滞，还在自我训练阶段提高模型的目标性能。实验结果和理论分析证明了MTM的有效性。

GUPNet++: Geometry Uncertainty Propagation Network for Monocular 3D Object Detection

paper_url: http://arxiv.org/abs/2310.15624
repo_url: https://github.com/supermhp/gupnet
paper_authors: Yan Lu, Xinzhu Ma, Lei Yang, Tianzhu Zhang, Yating Liu, Qi Chu, Tong He, Yonghui Li, Wanli Ouyang
for:* image-based monocular 3D object detectionmethods:* using perspective projection to estimate object depth* modeling geometry projection in a probabilistic mannerresults:* state-of-the-art (SOTA) performance in image-based monocular 3D detection* superiority in efficacy with a simplified framework

Abstract
Geometry plays a significant role in monocular 3D object detection. It can be used to estimate object depth by using the perspective projection between object's physical size and 2D projection in the image plane, which can introduce mathematical priors into deep models. However, this projection process also introduces error amplification, where the error of the estimated height is amplified and reflected into the projected depth. It leads to unreliable depth inferences and also impairs training stability. To tackle this problem, we propose a novel Geometry Uncertainty Propagation Network (GUPNet++) by modeling geometry projection in a probabilistic manner. This ensures depth predictions are well-bounded and associated with a reasonable uncertainty. The significance of introducing such geometric uncertainty is two-fold: (1). It models the uncertainty propagation relationship of the geometry projection during training, improving the stability and efficiency of the end-to-end model learning. (2). It can be derived to a highly reliable confidence to indicate the quality of the 3D detection result, enabling more reliable detection inference. Experiments show that the proposed approach not only obtains (state-of-the-art) SOTA performance in image-based monocular 3D detection but also demonstrates superiority in efficacy with a simplified framework.

摘要
To address this problem, we propose a novel Geometry Uncertainty Propagation Network (GUPNet++) that models the geometry projection in a probabilistic manner. This ensures that depth predictions are well-bounded and associated with a reasonable uncertainty. The significance of introducing such geometric uncertainty is twofold:1. It models the uncertainty propagation relationship of the geometry projection during training, improving the stability and efficiency of the end-to-end model learning.2. It can be derived to a highly reliable confidence to indicate the quality of the 3D detection result, enabling more reliable detection inference.Experiments show that the proposed approach not only achieves state-of-the-art performance in image-based monocular 3D detection but also demonstrates superiority in efficiency with a simplified framework.

Grasp Multiple Objects with One Hand

paper_url: http://arxiv.org/abs/2310.15599
repo_url: None
paper_authors: Yuyang Li, Bo Liu, Yiran Geng, Puhao Li, Yaodong Yang, Yixin Zhu, Tengyu Liu, Siyuan Huang
for: 这篇论文的目的是解决机器人多物 grasping 问题，即同时抓持和操作多个物体，如对象转移和手部操作。
methods: 这篇论文提出了一种两阶段方法，名为 MultiGrasp，用于在桌面上使用多指灵活手柔腕抓取多个物体。该方法包括（i）生成 pré-grasp 提议和（ii）执行抓取和升起操作。
results: 实验主要关注 dual-object grasping，成功率为 44.13%，表明方法能够适应未看到的物体配置和不精准的抓取。方法还能够抓取更多的物体，尽管推理速度减少。

Abstract
The human hand's complex kinematics allow for simultaneous grasping and manipulation of multiple objects, essential for tasks like object transfer and in-hand manipulation. Despite its importance, robotic multi-object grasping remains underexplored and presents challenges in kinematics, dynamics, and object configurations. This paper introduces MultiGrasp, a two-stage method for multi-object grasping on a tabletop with a multi-finger dexterous hand. It involves (i) generating pre-grasp proposals and (ii) executing the grasp and lifting the objects. Experimental results primarily focus on dual-object grasping and report a 44.13% success rate, showcasing adaptability to unseen object configurations and imprecise grasps. The framework also demonstrates the capability to grasp more than two objects, albeit at a reduced inference speed.

摘要
人类手部的复杂动态学允许同时抓取和操作多个物体，这是对象传递和手部执行中的重要功能。尽管其重要性，机器人多物体抓取仍然受欠发展和研究，包括动力学、物体配置和 grasping 等方面的挑战。本文介绍 MultiGrasp，一种基于两个阶段的多物体抓取方法，用于表面上的多指dexterous 手。其包括（i）生成 pré-grasp 建议和（ii）执行抓取和升起 Objects。实验结果主要关注 dual-object grasping，成功率为 44.13%，展示了适应不同物体配置和不精确的抓取。框架还证明可以抓取更多的物体，尽管速度较慢。

Facial Data Minimization: Shallow Model as Your Privacy Filter

paper_url: http://arxiv.org/abs/2310.15590
repo_url: None
paper_authors: Yuwen Pu, Jiahao Chen, Jiayu Pan, Hao li, Diqun Yan, Xuhong Zhang, Shouling Ji
for: 保护用户面部数据隐私
methods: 提出了一种数据隐私最小化变换方法（PMT），可以基于授权服务器的浅型模型进行处理，以获得干扰数据。此外，还提出了一种增强干扰方法来提高PMT的Robustness。
results: 经过广泛的实验测试，PMT方法能够有效地防止面部数据泄露和滥用，同时保持面recognition精度。

Abstract
Face recognition service has been used in many fields and brings much convenience to people. However, once the user's facial data is transmitted to a service provider, the user will lose control of his/her private data. In recent years, there exist various security and privacy issues due to the leakage of facial data. Although many privacy-preserving methods have been proposed, they usually fail when they are not accessible to adversaries' strategies or auxiliary data. Hence, in this paper, by fully considering two cases of uploading facial images and facial features, which are very typical in face recognition service systems, we proposed a data privacy minimization transformation (PMT) method. This method can process the original facial data based on the shallow model of authorized services to obtain the obfuscated data. The obfuscated data can not only maintain satisfactory performance on authorized models and restrict the performance on other unauthorized models but also prevent original privacy data from leaking by AI methods and human visual theft. Additionally, since a service provider may execute preprocessing operations on the received data, we also propose an enhanced perturbation method to improve the robustness of PMT. Besides, to authorize one facial image to multiple service models simultaneously, a multiple restriction mechanism is proposed to improve the scalability of PMT. Finally, we conduct extensive experiments and evaluate the effectiveness of the proposed PMT in defending against face reconstruction, data abuse, and face attribute estimation attacks. These experimental results demonstrate that PMT performs well in preventing facial data abuse and privacy leakage while maintaining face recognition accuracy.

摘要
《面部识别服务中的数据隐私保护》面部识别服务在各个领域中得到广泛应用，为人们带来了很大的便利。然而，一旦用户的面部数据被提供给服务提供者，用户就会失去面部数据的控制权。随着面部数据泄露的问题的出现，有许多隐私和安全问题被提出。虽然许多隐私保护方法已经被提出，但它们通常因为不能抵御敌对策略或辅助数据的攻击而失效。因此，在这篇论文中，我们根据面部识别服务系统中的两种上传方式（上传面部图像和上传面部特征），提出了一种数据隐私减少转换（PMT）方法。这种方法可以基于授权服务的浅型模型处理原始面部数据，以获得干扰后的数据。这些干扰后的数据可以保持授权服务器模型的表现和限制未授权服务器模型的表现，同时防止原始隐私数据泄露。此外，服务提供者可能会对接收到的数据进行预处理操作，因此我们还提出了一种加强干扰方法以提高PMT的Robustness。此外，为了让一张面部图像同时授权多个服务模型，我们还提出了一种多重限制机制以提高PMT的扩展性。最后，我们进行了广泛的实验和评估，并证明了PMT在防止面部数据滥用和隐私泄露的同时保持面部识别精度的效iveness。

VMAF Re-implementation on PyTorch: Some Experimental Results

paper_url: http://arxiv.org/abs/2310.15578
repo_url: None
paper_authors: Kirill Aistov, Maxim Koroteev
for: 本研究提出了一种基于PyTorch框架的VMAF实现，并与标准（libvmaf）实现进行比较，显示两者之间的差异在VMAF单位下小于10^-2。
methods: 本研究使用了VMAF作为目标函数，并 investigate了在计算梯度时的问题。结果表明，通过使用VMAF作为目标函数进行训练，不会导致梯度计算出现问题。
results: 本研究的实验结果表明，使用PyTorch框架实现的VMAF和标准（libvmaf）实现之间的差异在VMAF单位下小于10^-2。

Abstract
Based on the standard VMAF implementation we propose an implementation of VMAF using PyTorch framework. For this implementation comparisons with the standard (libvmaf) show the discrepancy $\lesssim 10^{-2}$ in VMAF units. We investigate gradients computation when using VMAF as an objective function and demonstrate that training using this function does not result in ill-behaving gradients.

摘要
(Simplified Chinese) Based on 标准 VMAF 实现，我们提议使用 PyTorch 框架来实现 VMAF。与标准 (libvmaf) 进行比较，我们发现 VMAF 单位下的差异在 $10^{-2}$ 以下。我们调查了使用 VMAF 作为目标函数时的梯度计算，并证明了在训练这个函数时不会出现不正确的梯度问题。

paper_url: http://arxiv.org/abs/2310.15568
repo_url: None
paper_authors: Yunyao Mao, Jiajun Deng, Wengang Zhou, Zhenbo Lu, Wanli Ouyang, Houqiang Li
for: 这篇论文主要用于提出一种基于对比学习的自动生成3D人体动作表示学习方法，以解决现有的对比学习框架中不充分利用不同骨骼模式之间的资源。
methods: 该方法基于一种泛化对比学习框架，称为Inter-和Intra-模式共同静止（I$^2$MD）框架。在I$^2$MD中，我们首先将交叉模式交互改为一种交叉模式共同静止（CMD） proces。此外，我们还采用了一种内部模式共同静止策略（IMD），以解决相似样本之间的干扰和利用其下面的上下文。
results: 对三个数据集进行了广泛的实验，并取得了一系列新纪录。

Abstract
Recent progresses on self-supervised 3D human action representation learning are largely attributed to contrastive learning. However, in conventional contrastive frameworks, the rich complementarity between different skeleton modalities remains under-explored. Moreover, optimized with distinguishing self-augmented samples, models struggle with numerous similar positive instances in the case of limited action categories. In this work, we tackle the aforementioned problems by introducing a general Inter- and Intra-modal Mutual Distillation (I$^2$MD) framework. In I$^2$MD, we first re-formulate the cross-modal interaction as a Cross-modal Mutual Distillation (CMD) process. Different from existing distillation solutions that transfer the knowledge of a pre-trained and fixed teacher to the student, in CMD, the knowledge is continuously updated and bidirectionally distilled between modalities during pre-training. To alleviate the interference of similar samples and exploit their underlying contexts, we further design the Intra-modal Mutual Distillation (IMD) strategy, In IMD, the Dynamic Neighbors Aggregation (DNA) mechanism is first introduced, where an additional cluster-level discrimination branch is instantiated in each modality. It adaptively aggregates highly-correlated neighboring features, forming local cluster-level contrasting. Mutual distillation is then performed between the two branches for cross-level knowledge exchange. Extensive experiments on three datasets show that our approach sets a series of new records.

摘要
最近的进展在自监视3D人体动作表示学习中很大程度上归功于对比学习。然而，在传统的对比框架中，不同骨骼模式之间的丰富补充关系尚未得到充分利用。此外，使用自我增强样本进行优化的模型在有限的动作类型下难以处理大量相似正例的问题。在这项工作中，我们解决了上述问题通过介入一种通用的Inter-和Intra-modal Mutual Distillation（I$^2$MD）框架。在I$^2$MD中，我们首先将交叉模式交互重新表述为交叉模式共同馈敷（CMD）过程。与现有的馈敷解决方案不同，在CMD中，知识不断更新和bidirectionally馈敷 между模式 during pre-training。为了解决相似样本之间的干扰和利用他们的基础上下文，我们进一步设计了Intra-modal Mutual Distillation（IMD）策略。在IMD中，我们首次引入了Dynamic Neighbors Aggregation（DNA）机制，其中每个模式中增加了一个额外的群集级别分支。它适应地聚合高度相关的邻近特征，形成局部群集级别对比。然后，我们进行了相互馈敷 между两个分支，以实现交叉知识交换。我们在三个 dataset 上进行了广泛的实验，结果显示，我们的方法创造了一系列新的纪录。

PET Synthesis via Self-supervised Adaptive Residual Estimation Generative Adversarial Network

paper_url: http://arxiv.org/abs/2310.15550
repo_url: None
paper_authors: Yuxin Xue, Lei Bi, Yige Peng, Michael Fulham, David Dagan Feng, Jinman Kim
for: 降低对 positron emission tomography (PET) 的辐射暴露，同时维持高品质的 molecular imaging 图像。
methods: 使用 convolutional neural networks (CNNs) 生成来源低剂量 PET 图像，并使用自适应 residual estimation 生成 adversarial network (SS-AEGAN) 来解决 texture 和结构之间的差异问题，以及处理资料分布的迁移问题。
results: SS-AEGAN 在一个公共 benchmark 资料集上的实验中，与 state-of-the-art 合成方法相比，具有较高的效果，并且可以降低辐射暴露。

Abstract
Positron emission tomography (PET) is a widely used, highly sensitive molecular imaging in clinical diagnosis. There is interest in reducing the radiation exposure from PET but also maintaining adequate image quality. Recent methods using convolutional neural networks (CNNs) to generate synthesized high-quality PET images from low-dose counterparts have been reported to be state-of-the-art for low-to-high image recovery methods. However, these methods are prone to exhibiting discrepancies in texture and structure between synthesized and real images. Furthermore, the distribution shift between low-dose PET and standard PET has not been fully investigated. To address these issues, we developed a self-supervised adaptive residual estimation generative adversarial network (SS-AEGAN). We introduce (1) An adaptive residual estimation mapping mechanism, AE-Net, designed to dynamically rectify the preliminary synthesized PET images by taking the residual map between the low-dose PET and synthesized output as the input, and (2) A self-supervised pre-training strategy to enhance the feature representation of the coarse generator. Our experiments with a public benchmark dataset of total-body PET images show that SS-AEGAN consistently outperformed the state-of-the-art synthesis methods with various dose reduction factors.

摘要
Positron emission tomography (PET) 是一种广泛使用、高度敏感的分子成像技术，在临床诊断中有广泛应用。然而，PET扫描产生的辐射暴露可能会导致影像质量下降。为了解决这问题，我们开发了一种自适应差分估计生成 adversarial neural network（SS-AEGAN）。我们提出了以下两点：1. 适应差分估计映射机制（AE-Net），用于在低剂量PET图像的基础上，动态修正预先生成的PET图像，通过将低剂量PET图像和生成输出之间的差分Map作为输入，以 rectify 预先生成的PET图像。2. 一种自动预训练策略，用于增强生成器的特征表示。我们的实验表明，SS-AEGAN在一个公共的整体PET图像benchmark dataset上，一直表现出优于当前的生成方法，并且可以处理不同的辐射减少因子。

Segue: Side-information Guided Generative Unlearnable Examples for Facial Privacy Protection in Real World

paper_url: http://arxiv.org/abs/2310.16061
repo_url: None
paper_authors: Zhiling Zhang, Jie Zhang, Kui Zhang, Wenbo Zhou, Weiming Zhang, Nenghai Yu
for: 实现隐私保护，防止人脸识别系统从数据中学习歧视特征。
methods: 使用“不可学习例子”的概念，在模型训练阶段添加不可见的干扰，以防止模型学习歧视特征。
results: 提出了一个名为Segue的新方法，可以快速生成可转移的不可学习例子，并且具有对抗JPEG压缩、敌方训练和一些标准的数据增强等特性。

Abstract
The widespread use of face recognition technology has given rise to privacy concerns, as many individuals are worried about the collection and utilization of their facial data. To address these concerns, researchers are actively exploring the concept of ``unlearnable examples", by adding imperceptible perturbation to data in the model training stage, which aims to prevent the model from learning discriminate features of the target face. However, current methods are inefficient and cannot guarantee transferability and robustness at the same time, causing impracticality in the real world. To remedy it, we propose a novel method called Segue: Side-information guided generative unlearnable examples. Specifically, we leverage a once-trained multiple-used model to generate the desired perturbation rather than the time-consuming gradient-based method. To improve transferability, we introduce side information such as true labels and pseudo labels, which are inherently consistent across different scenarios. For robustness enhancement, a distortion layer is integrated into the training pipeline. Extensive experiments demonstrate that the proposed Segue is much faster than previous methods (1000$\times$) and achieves transferable effectiveness across different datasets and model architectures. Furthermore, it can resist JPEG compression, adversarial training, and some standard data augmentations.

摘要
广泛使用人脸识别技术已引起隐私问题，许多人担心模型会收集和利用他们的面部数据。为解决这些问题，研究人员正在积极探索“不可学习示例”的概念，通过在模型训练阶段添加不可见的干扰，以防止模型学习权倾向的面部特征。然而，当前的方法存在效率和可靠性问题，导致实际应用中存在不可避免的问题。为此，我们提出了一种新的方法：Segue：侧情况引导生成不可学习示例。具体来说，我们利用一个已经训练过的多用模型来生成所需的干扰，而不是时间消耗的梯度基本方法。为了提高传输性，我们引入了侧情况，如真实标签和假标签，这些情况是不同场景中的自然一致性。为了增强Robustness，我们在训练管道中添加了扭曲层。广泛的实验表明，我们提出的Segue比前一代方法（1000倍）更快速，并在不同的数据集和模型结构上实现了传输性和可靠性。此外，它还能抵抗JPEG压缩、反击训练和一些标准的数据扩展。

Learning with Noisy Labels Using Collaborative Sample Selection and Contrastive Semi-Supervised Learning

paper_url: http://arxiv.org/abs/2310.15533
repo_url: None
paper_authors: Qing Miao, Xiaohe Wu, Chao Xu, Yanli Ji, Wangmeng Zuo, Yiwen Guo, Zhaopeng Meng
for: 提高 Learning with Noisy Labels (LNL) 中的泛化性能。
methods: 提出了一种名为协同样本选择 (CSS) 的方法，利用大规模预训练模型 CLIP，从混合噪声样本中减少混合噪声样本。通过将 CLIP 的概率与 DNN 分类器的预测结果组合，使用 2D-GMM 训练。同时，通过对 CLIP 的招待语言进行练习，提高了 DNN 分类器的特征表示，提高了分类性能。
results: 在多个 benchmark 数据集上实验结果表明，提出的方法在比 estado-of-the-art 方法更高的泛化性能。

Abstract
Learning with noisy labels (LNL) has been extensively studied, with existing approaches typically following a framework that alternates between clean sample selection and semi-supervised learning (SSL). However, this approach has a limitation: the clean set selected by the Deep Neural Network (DNN) classifier, trained through self-training, inevitably contains noisy samples. This mixture of clean and noisy samples leads to misguidance in DNN training during SSL, resulting in impaired generalization performance due to confirmation bias caused by error accumulation in sample selection. To address this issue, we propose a method called Collaborative Sample Selection (CSS), which leverages the large-scale pre-trained model CLIP. CSS aims to remove the mixed noisy samples from the identified clean set. We achieve this by training a 2-Dimensional Gaussian Mixture Model (2D-GMM) that combines the probabilities from CLIP with the predictions from the DNN classifier. To further enhance the adaptation of CLIP to LNL, we introduce a co-training mechanism with a contrastive loss in semi-supervised learning. This allows us to jointly train the prompt of CLIP and the DNN classifier, resulting in improved feature representation, boosted classification performance of DNNs, and reciprocal benefits to our Collaborative Sample Selection. By incorporating auxiliary information from CLIP and utilizing prompt fine-tuning, we effectively eliminate noisy samples from the clean set and mitigate confirmation bias during training. Experimental results on multiple benchmark datasets demonstrate the effectiveness of our proposed method in comparison with the state-of-the-art approaches.

摘要
学习噪声标签（LNL）已经得到了广泛的研究，现有的方法通常采用一种框架，即 alternate between clean sample selection和半supervised learning（SSL）。然而，这种方法有一个限制：深度神经网络（DNN）分类器，通过自我训练，选择的干净集（clean set）一定会包含噪声样本。这种混合干净和噪声样本的组合会导致DNN训练在SSL阶段的偏导，从而导致随机误差的积累，从而降低分类性能。为解决这个问题，我们提出了一种方法called Collaborative Sample Selection（CSS），它利用了大规模预训练的模型CLIP。CSS的目标是从已知干净集中除掉混合的噪声样本。我们通过训练一个2维Gaussian Mixture Model（2D-GMM），将CLIP的概率与DNN分类器的预测结合起来，实现这一目标。为了进一步提高CLIP在LNL中的适应性，我们引入了一种contrastive loss的co-training机制，使得我们可以同时训练CLIP的提示和DNN分类器，从而提高分类性能。通过利用CLIP的auxiliary信息和提示细化，我们可以有效地从干净集中除掉噪声样本，避免随机误差的积累，并提高分类性能。我们在多个benchmark数据集上进行了实验，与状态 искусственныйints的方法进行比较，结果表明我们的提posed方法的效果。

NetDistiller: Empowering Tiny Deep Learning via In-Situ Distillation

paper_url: http://arxiv.org/abs/2310.19820
repo_url: None
paper_authors: Shunyao Zhang, Yonggan Fu, Shang Wu, Jyotikrishna Dass, Haoran You, Yingyan, Lin
for: 提高 tiny neural network (TNN) 的任务准确率，以便在具有限制的 Edge 设备上部署 TNN。
methods: 提出了一种名为 NetDistiller 的框架，该框架通过将 TNN 视为一个权重共享的老师模型的子网络，并通过（1）梯度手术和（2）uncertainty-aware distillation来处理梯度冲突和老师模型过度适应。
results: 对多种任务进行了广泛的实验，证明 NetDistiller 可以有效地提高 TNN 的可达准确率，并且超过了现有方法的性能。代码可以在 https://github.com/GATECH-EIC/NetDistiller 上下载。

Abstract
Boosting the task accuracy of tiny neural networks (TNNs) has become a fundamental challenge for enabling the deployments of TNNs on edge devices which are constrained by strict limitations in terms of memory, computation, bandwidth, and power supply. To this end, we propose a framework called NetDistiller to boost the achievable accuracy of TNNs by treating them as sub-networks of a weight-sharing teacher constructed by expanding the number of channels of the TNN. Specifically, the target TNN model is jointly trained with the weight-sharing teacher model via (1) gradient surgery to tackle the gradient conflicts between them and (2) uncertainty-aware distillation to mitigate the overfitting of the teacher model. Extensive experiments across diverse tasks validate NetDistiller's effectiveness in boosting TNNs' achievable accuracy over state-of-the-art methods. Our code is available at https://github.com/GATECH-EIC/NetDistiller.

摘要
增强小神经网络（TNNs）的任务准确率已成为实现TNNs在边缘设备上的部署的基本挑战，这些设备受限于内存、计算、带宽和电池供应等方面的严格限制。为此，我们提出了名为NetDistiller的框架，用于提高TNNs的可达准确率。具体来说，我们将Target TNN模型和扩展annels的weight-sharing教师模型一起培训，使得它们可以共享参数。我们通过（1）梯度手术和（2）uncertainty-aware distillation来解决这两个模型之间的梯度冲突和教师模型过拟合。我们的实验证明，NetDistiller可以覆盖多个任务中的TNNs，并且与现有方法相比具有更高的效果。我们的代码可以在https://github.com/GATECH-EIC/NetDistiller中找到。

Cross-view Self-localization from Synthesized Scene-graphs

paper_url: http://arxiv.org/abs/2310.15504
repo_url: None
paper_authors: Ryogo Yamamoto, Kanji Tanaka
for: 本研究旨在解决跨视角自地标记问题，该问题中提供的数据库图像来自稀疏的视角。
methods: 本研究使用NeRF技术生成数据库图像，并使用图像纹理场学习抽象出视点不变的外观特征和视点相关的空间 semantic特征。然后，这两种特征 fusion到Scene Graph中，通过图 neural network压缩学习和识别。
results: 研究人员通过一个新的混合场景模型，将视点不变的外观特征和视点相关的空间 semantic特征 fusion到Scene Graph中，并使用图 neural network压缩学习和识别。实验结果表明，提案的方法可以在跨视角自地标记问题中提供高性能。

Abstract
Cross-view self-localization is a challenging scenario of visual place recognition in which database images are provided from sparse viewpoints. Recently, an approach for synthesizing database images from unseen viewpoints using NeRF (Neural Radiance Fields) technology has emerged with impressive performance. However, synthesized images provided by these techniques are often of lower quality than the original images, and furthermore they significantly increase the storage cost of the database. In this study, we explore a new hybrid scene model that combines the advantages of view-invariant appearance features computed from raw images and view-dependent spatial-semantic features computed from synthesized images. These two types of features are then fused into scene graphs, and compressively learned and recognized by a graph neural network. The effectiveness of the proposed method was verified using a novel cross-view self-localization dataset with many unseen views generated using a photorealistic Habitat simulator.

摘要
cross-view自本地化是视觉地标识中的一种具有挑战性的场景，其中数据库图像来自稀见的视角。近些年，使用NeRF（神经辐射场）技术生成数据库图像从未看过的视角的方法出现了，表现出色。然而，这些技术生成的图像与原始图像质量相比较低，同时也增加了数据库存储成本。在这项研究中，我们探索了一种新的混合场景模型，它结合了raw图像中的视变不变特征和synthesized图像中的视依存空间semantic特征。这两种特征然后被融合到场景图中，并通过图 neural网络压缩学习和识别。研究的效果得到了一个新的cross-view自本地化数据集，使用了真实的Habitat模拟器生成的多个未看过的视角。

Salient Object Detection in RGB-D Videos

paper_url: http://arxiv.org/abs/2310.15482
repo_url: https://github.com/kerenfu/rdvs
paper_authors: Ao Mou, Yukang Lu, Jiahao He, Dingyao Min, Keren Fu, Qijun Zhao
for: 这篇论文主要用于研究RGB-D视频中的鲜明 объек检测（SOD）技术，以提高视频中对象的检测精度。
methods: 该论文提出了一个新的三流网络模型（DCTNet+），其中RGB模式作为主要输入模式，并使用了depth和运动流作为auxiliary modalities。具有多modal注意模块（MAM）和修复融合模块（RFM）等两个模块，以便强化特征提升、修复和融合，以实现高精度的最终预测。
results: 对于17个视频SOD模型和14个RGB-D SOD模型，DCTNet+在pseudo RGB-D视频 dataset上的实验表明其在鲜明 объек检测任务中表现出优于其他模型。此外，对于真实的RGB-D视频 dataset，DCTNet+也表现出了优于其他模型。

Abstract
Given the widespread adoption of depth-sensing acquisition devices, RGB-D videos and related data/media have gained considerable traction in various aspects of daily life. Consequently, conducting salient object detection (SOD) in RGB-D videos presents a highly promising and evolving avenue. Despite the potential of this area, SOD in RGB-D videos remains somewhat under-explored, with RGB-D SOD and video SOD (VSOD) traditionally studied in isolation. To explore this emerging field, this paper makes two primary contributions: the dataset and the model. On one front, we construct the RDVS dataset, a new RGB-D VSOD dataset with realistic depth and characterized by its diversity of scenes and rigorous frame-by-frame annotations. We validate the dataset through comprehensive attribute and object-oriented analyses, and provide training and testing splits. Moreover, we introduce DCTNet+, a three-stream network tailored for RGB-D VSOD, with an emphasis on RGB modality and treats depth and optical flow as auxiliary modalities. In pursuit of effective feature enhancement, refinement, and fusion for precise final prediction, we propose two modules: the multi-modal attention module (MAM) and the refinement fusion module (RFM). To enhance interaction and fusion within RFM, we design a universal interaction module (UIM) and then integrate holistic multi-modal attentive paths (HMAPs) for refining multi-modal low-level features before reaching RFMs. Comprehensive experiments, conducted on pseudo RGB-D video datasets alongside our RDVS, highlight the superiority of DCTNet+ over 17 VSOD models and 14 RGB-D SOD models. Ablation experiments were performed on both pseudo and realistic RGB-D video datasets to demonstrate the advantages of individual modules as well as the necessity of introducing realistic depth. Our code together with RDVS dataset will be available at https://github.com/kerenfu/RDVS/.

摘要
由于深度感知设备的广泛应用，RGB-D视频和相关数据/媒体在不同方面得到了广泛的应用。因此，在RGB-D视频中进行突出物体检测（SOD）是一个非常有前途的和发展中的领域。尽管这个领域的潜力很大，但RGB-D SOD和视频 SOD（VSOD）通常是隔离的，这个领域的探索仍然很有前途。为了探索这个新兴领域，本文做出了两个主要贡献： datasets 和模型。一方面，我们构建了 RDVS dataset，一个新的 RGB-D VSOD dataset，其特点是多样化的场景和精心注解每帧数据。我们通过了全面的特征和物体分析，并提供了训练和测试分割。此外，我们引入 DCTNet+，一种三流网络，其中RGB模式具有主导性，并将深度和光流视为助手模式。为了提高特征优化、融合和混合，我们提出了两个模块：多模式注意力模块（MAM）和修复融合模块（RFM）。为了增强 RFM 之间的互动和融合，我们设计了一个通用互动模块（UIM），并将整体多模式注意力路径（HMAPs）integrated into RFMs。对 pseudo RGB-D 视频 dataset 和我们的 RDVS 进行了广泛的实验，显示 DCTNet+ 在 17 个 VSOD 模型和 14 个 RGB-D SOD 模型中表现出色。我们还进行了对 pseudo 和真实 RGB-D 视频 dataset 的ablation 实验，以示各个模块的优势和引入真实深度的必要性。我们的代码以及 RDVS dataset 将在 GitHub 上提供，链接为：https://github.com/kerenfu/RDVS/.

DeepIron: Predicting Unwarped Garment Texture from a Single Image

paper_url: http://arxiv.org/abs/2310.15447
repo_url: None
paper_authors: Hyun-Song Kwon, Sung-Hee Lee
for: 创建虚拟人物和虚拟试穿
methods: 使用纹理抽象 Framework，包括Texture Unwarper，将输入图像中的纹理抽象到3D裤子模型中
results: 生成高质量的纹理图像，可以在新的姿势下显示真实扭曲的3D裤子模型

Abstract
Realistic reconstruction of 3D clothing from an image has wide applications, such as avatar creation and virtual try-on. This paper presents a novel framework that reconstructs the texture map for 3D garments from a single image with pose. Assuming that 3D garments are modeled by stitching 2D garment sewing patterns, our specific goal is to generate a texture image for the sewing patterns. A key component of our framework, the Texture Unwarper, infers the original texture image from the input clothing image, which exhibits warping and occlusion of texture due to the user's body shape and pose. The Texture Unwarper effectively transforms between the input and output images by mapping the latent spaces of the two images. By inferring the unwarped original texture of the input garment, our method helps reconstruct 3D garment models that can show high-quality texture images realistically deformed for new poses. We validate the effectiveness of our approach through a comparison with other methods and ablation studies.

摘要
现实重建三维服装图像从图像中的应用广泛，如创建化身和虚拟试穿。本文提出了一种新的框架，可以从单个图像中恢复三维服装的纹理地图。我们的具体目标是生成适合三维服装模型的纹理图像。我们的框架中的纹理恢复器（Texture Unwarper）可以从输入服装图像中推理出原始纹理图像，这个图像受用户的身体形态和姿势的扭曲和遮盖。纹理恢复器可以将输入和输出图像的 latent space 进行可靠地映射，从而将输入服装图像恢复成高质量的三维服装图像。我们验证了我们的方法的有效性通过与其他方法进行比较和简化研究。

Fast Propagation is Better: Accelerating Single-Step Adversarial Training via Sampling Subnetworks

paper_url: http://arxiv.org/abs/2310.15444
repo_url: None
paper_authors: Xiaojun Jia, Jianshu Li, Jindong Gu, Yang Bai, Xiaochun Cao
For: 提高模型的对抗性和训练效率* Methods: 使用单步攻击基于模型内部块的动态抽象网络，并提供了理论分析以证明模型对抗性的提高* Results: 比对前方法更好地降低训练成本，同时达到更高的模型对抗性水平

Abstract
Adversarial training has shown promise in building robust models against adversarial examples. A major drawback of adversarial training is the computational overhead introduced by the generation of adversarial examples. To overcome this limitation, adversarial training based on single-step attacks has been explored. Previous work improves the single-step adversarial training from different perspectives, e.g., sample initialization, loss regularization, and training strategy. Almost all of them treat the underlying model as a black box. In this work, we propose to exploit the interior building blocks of the model to improve efficiency. Specifically, we propose to dynamically sample lightweight subnetworks as a surrogate model during training. By doing this, both the forward and backward passes can be accelerated for efficient adversarial training. Besides, we provide theoretical analysis to show the model robustness can be improved by the single-step adversarial training with sampled subnetworks. Furthermore, we propose a novel sampling strategy where the sampling varies from layer to layer and from iteration to iteration. Compared with previous methods, our method not only reduces the training cost but also achieves better model robustness. Evaluations on a series of popular datasets demonstrate the effectiveness of the proposed FB-Better. Our code has been released at https://github.com/jiaxiaojunQAQ/FP-Better.

摘要
<>通过对抗训练来建立对抗示例的模型 robustness 已经显示出了承诺。对抗训练的一个主要缺点是生成对抗示例所带来的计算开销。为了解决这个限制，基于单步攻击的对抗训练已经得到了探索。先前的工作从不同的角度改进了单步对抗训练，例如样本初始化、损失规则化和训练策略。大多数情况下，这些方法都将模型视为黑盒子。在这种情况下，我们提议使用模型的内部建筑部件来提高效率。具体来说，我们提议在训练过程中动态选择轻量级子网络作为代理模型。这样，在前向和反向传播中，都可以加速对抗训练。此外，我们提供了理论分析，证明单步对抗训练可以通过采样轻量级子网络来提高模型的鲁棒性。此外，我们还提出了一种新的采样策略，其中采样的层次和迭代次数都会变化。相比之前的方法，我们的方法不仅减少了训练成本，还可以达到更好的模型鲁棒性。我们的代码已经在https://github.com/jiaxiaojunQAQ/FP-Better上发布。Note: The text has been translated using Google Translate, and some parts may not be perfectly accurate or idiomatic.

G2-MonoDepth: A General Framework of Generalized Depth Inference from Monocular RGB+X Data

paper_url: http://arxiv.org/abs/2310.15422
repo_url: https://github.com/wang-xjtu/g2-monodepth
paper_authors: Haotian Wang, Meng Yang, Nanning Zheng
for: This paper aims to solve the problem of monocular depth inference for robots, which is a fundamental problem for scene perception.
methods: The paper proposes a unified task of monocular depth inference, which uses a unified data representation, a novel unified loss, an improved network, and a data augmentation pipeline to well propagate diverse scene scales from input to output.
results: The paper demonstrates the effectiveness of its approach on three sub-tasks, including depth estimation, depth completion with different sparsity, and depth enhancement in unseen scenes, and outperforms state-of-the-art baselines on both real-world data and synthetic data.Here’s the Chinese version:
for: 本研究旨在解决机器人场景识别中的单目深度推断问题，是机器人Scene perception的基础问题。
methods: 本paper提出了一种统一任务的单目深度推断方法，使用统一的数据表示，一种新的统一损失函数，改进的网络和数据增强管道，以很好地传递不同场景比例的输入数据到输出场景中。
results: 本paper在三个子任务中，包括深度估计、深度缺失不同粒度和场景中的深度增强，都以状态体系内的基线性能为代表，并在实际数据和 sintetic数据上都超过了状态体系内的基线性能。

Abstract
Monocular depth inference is a fundamental problem for scene perception of robots. Specific robots may be equipped with a camera plus an optional depth sensor of any type and located in various scenes of different scales, whereas recent advances derived multiple individual sub-tasks. It leads to additional burdens to fine-tune models for specific robots and thereby high-cost customization in large-scale industrialization. This paper investigates a unified task of monocular depth inference, which infers high-quality depth maps from all kinds of input raw data from various robots in unseen scenes. A basic benchmark G2-MonoDepth is developed for this task, which comprises four components: (a) a unified data representation RGB+X to accommodate RGB plus raw depth with diverse scene scale/semantics, depth sparsity ([0%, 100%]) and errors (holes/noises/blurs), (b) a novel unified loss to adapt to diverse depth sparsity/errors of input raw data and diverse scales of output scenes, (c) an improved network to well propagate diverse scene scales from input to output, and (d) a data augmentation pipeline to simulate all types of real artifacts in raw depth maps for training. G2-MonoDepth is applied in three sub-tasks including depth estimation, depth completion with different sparsity, and depth enhancement in unseen scenes, and it always outperforms SOTA baselines on both real-world data and synthetic data.

摘要
<> simulationMonocular depth inference 是场景理解机器人的基本问题。特定机器人可能配备了一个摄像头 plus 任何类型的深度传感器，并处在不同的尺度和不同的场景中。而最近的进步导致了多个个人子任务的分化，从而导致了大规模工业化的高成本定制。这篇论文研究了一个统一的幌 depth inference 任务，可以从所有类型的输入原始数据中推算出高质量的深度图。为了实现这一目标，我们开发了一个基本的标准准则 G2-MonoDepth，它包括以下四个Component：（a）一种统一的数据表示RGB+X，可以容纳RGB plus Raw depth 的多种场景尺度/semantics、深度缺失（[0%, 100%））和错误（孔隙/噪声/模糊）。（b）一种新的统一损失，可以适应输入原始数据中的多种深度缺失/错误和输出场景中的多种尺度。（c）一种改进的网络，可以好地传播输入场景中的多种尺度到输出场景中。（d）一个数据增强管道，可以模拟所有类型的真实遗传 artifacts 在原始深度图中，用于训练。G2-MonoDepth 在三个子任务中应用，包括深度估算、深度完成不同缺失和深度增强在未看到场景中，并一直高于 SOTA 基eline 在实际数据和 sintetic 数据上。>>>

2023-10-24

On the Foundations of Shortcut Learning

ShadowSense: Unsupervised Domain Adaptation and Feature Fusion for Shadow-Agnostic Tree Crown Detection from RGB-Thermal Drone Imagery

Sea-Land-Cloud Segmentation in Satellite Hyperspectral Imagery by Deep Learning

Learning Low-Rank Latent Spaces with Simple Deterministic Autoencoder: Theoretical and Empirical Insights

G-CASCADE: Efficient Cascaded Graph Convolutional Decoding for 2D Medical Image Segmentation

iNVS: Repurposing Diffusion Inpainters for Novel View Synthesis

MyriadAL: Active Few Shot Learning for Histopathology

Pix2HDR – A pixel-wise acquisition and deep learning-based synthesis approach for high-speed HDR videos

Subtle Signals: Video-based Detection of Infant Non-nutritive Sucking as a Neurodevelopmental Cue

Stereoscopic Depth Perception Through Foliage

Wakening Past Concepts without Past Data: Class-Incremental Learning from Online Placebos

Towards long-tailed, multi-label disease classification from chest X-ray: Overview of the CXR-LT challenge

Complex Image Generation SwinTransformer Network for Audio Denoising

LaksNet: an end-to-end deep learning model for self-driving cars in Udacity simulator

Learned, Uncertainty-driven Adaptive Acquisition for Photon-Efficient Multiphoton Microscopy

Deep Feature Registration for Unsupervised Domain Adaptation

From Posterior Sampling to Meaningful Diversity in Image Restoration

Stanford-ORB: A Real-World 3D Object Inverse Rendering Benchmark

ConvBKI: Real-Time Probabilistic Semantic Mapping Network with Quantifiable Uncertainty

CVPR 2023 Text Guided Video Editing Competition

Integrating View Conditions for Image Synthesis

Transitivity Recovering Decompositions: Interpretable and Robust Fine-Grained Relationships

Vision-Language Pseudo-Labels for Single-Positive Multi-Label Learning

Geometry-Aware Video Quality Assessment for Dynamic Digital Human

Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection

Improving Robustness and Reliability in Medical Image Classification with Latent-Guided Diffusion and Nested-Ensembles

Language-driven Scene Synthesis using Multi-conditional Diffusion Model

ShARc: Shape and Appearance Recognition for Person Identification In-the-wild

RePoseDM: Recurrent Pose Alignment and Gradient Guidance for Pose Guided Image Synthesis

Mitigate Domain Shift by Primary-Auxiliary Objectives Association for Generalizing Person ReID

YOLO-Angio: An Algorithm for Coronary Anatomy Segmentation

Correlation Debiasing for Unbiased Scene Graph Generation in Videos

On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms

CPSeg: Finer-grained Image Semantic Segmentation via Chain-of-Thought Language Prompting

Unpaired MRI Super Resolution with Self-Supervised Contrastive Learning

Deep Learning Models for Classification of COVID-19 Cases by Medical Images

Large Language Models are Temporal and Causal Reasoners for Video Question Answering

Interpretable Medical Image Classification using Prototype Learning and Privileged Information

Query-adaptive DETR for Crowded Pedestrian Detection

GNeSF: Generalizable Neural Semantic Fields

Physics-Informed with Power-Enhanced Residual Network for Interpolation and Inverse Problems

Nighttime Thermal Infrared Image Colorization with Feedback-based Object Appearance Learning

Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection

Region-controlled Style Transfer

Breaking of brightness consistency in optical flow with a lightweight CNN network

Mean Teacher DETR with Masked Feature Alignment: A Robust Domain Adaptive Detection Transformer Framework

GUPNet++: Geometry Uncertainty Propagation Network for Monocular 3D Object Detection

Grasp Multiple Objects with One Hand

Facial Data Minimization: Shallow Model as Your Privacy Filter

VMAF Re-implementation on PyTorch: Some Experimental Results

I$^2$MD: 3D Action Representation Learning with Inter- and Intra-modal Mutual Distillation

PET Synthesis via Self-supervised Adaptive Residual Estimation Generative Adversarial Network

Segue: Side-information Guided Generative Unlearnable Examples for Facial Privacy Protection in Real World

Learning with Noisy Labels Using Collaborative Sample Selection and Contrastive Semi-Supervised Learning

NetDistiller: Empowering Tiny Deep Learning via In-Situ Distillation

Cross-view Self-localization from Synthesized Scene-graphs

Salient Object Detection in RGB-D Videos

DeepIron: Predicting Unwarped Garment Texture from a Single Image

Fast Propagation is Better: Accelerating Single-Step Adversarial Training via Sampling Subnetworks

G2-MonoDepth: A General Framework of Generalized Depth Inference from Monocular RGB+X Data