2023-10-30

cs.CV

cs.CV - 2023-10-30

Facial asymmetry: A Computer Vision based behaviometric index for assessment during a face-to-face interview

paper_url: http://arxiv.org/abs/2310.20083
repo_url: None
paper_authors: Shuvam Keshari, Tanusree Dutta, Raju Mullick, Ashish Rathor, Priyadarshi Patnaik
for: 这个研究旨在帮助选择过程中选择合适的人员，使用 behaviometry 来评估面试者的行为特征，以提高面试过程的 объектив性和准确性。
methods: 本研究使用了 computer vision 技术和 open-source 库 (python-opencv 和 dlib)，将面孔不对称和微表情分析为主要的评估方法。
results: 研究发现，通过分析面孔不对称和微表情，可以获得不受评估员的偏见和社会可能性影响的不足信息，帮助选择过程更加精确和公平。

Abstract
Choosing the right person for the right job makes the personnel interview process a cognitively demanding task. Psychometric tests, followed by an interview, have often been used to aid the process although such mechanisms have their limitations. While psychometric tests suffer from faking or social desirability of responses, the interview process depends on the way the responses are analyzed by the interviewers. We propose the use of behaviometry as an assistive tool to facilitate an objective assessment of the interviewee without increasing the cognitive load of the interviewer. Behaviometry is a relatively little explored field of study in the selection process, that utilizes inimitable behavioral characteristics like facial expressions, vocalization patterns, pupillary reactions, proximal behavior, body language, etc. The method analyzes thin slices of behavior and provides unbiased information about the interviewee. The current study proposes the methodology behind this tool to capture facial expressions, in terms of facial asymmetry and micro-expressions. Hemi-facial composites using a structural similarity index was used to develop a progressive time graph of facial asymmetry, as a test case. A frame-by-frame analysis was performed on three YouTube video samples, where Structural similarity index (SSID) scores of 75% and more showed behavioral congruence. The research utilizes open-source computer vision algorithms and libraries (python-opencv and dlib) to formulate the procedure for analysis of the facial asymmetry.

摘要
选择合适的人 для合适的工作是人员面试过程中的认知具有挑战性的任务。尽管心理测试和面试都有其限制，但我们提议使用行为测量为面试过程中的辅助工具，以帮助对面试者进行 объектив的评估，不会增加面试者的认知负担。行为测量是一个相对未探索的领域，它利用独特的行为特征，如脸部表现、嗓音特征、眼动反应、躯体语言等，来对面试者进行 объектив的评估。这种方法可以分析面试者的短时间行为，提供不受偏见的信息。本研究使用开源计算机视觉算法和库（python-opencv和dlib）来制定对脸部不均的分析的过程。Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

LinFlo-Net: A two-stage deep learning method to generate simulation ready meshes of the heart

paper_url: http://arxiv.org/abs/2310.20065
repo_url: None
paper_authors: Arjun Narayanan, Fanwei Kong, Shawn Shadden
for: automatic generation of computer models of the human heart from patient imaging data
methods: two-stage diffeomorphic deformation process with a novel loss function to minimize mesh self-penetration
results: meshes free of self-intersections, comparable accuracy with state-of-the-art methods, and ready for use in physics-based simulation without post-processing.Here’s the full translation in Simplified Chinese:
for: automatic generation of computer models of the human heart from patient imaging data
methods: two-stage diffeomorphic deformation process with a novel loss function to minimize mesh self-penetration
results: meshes free of self-intersections, comparable accuracy with state-of-the-art methods, and ready for use in physics-based simulation without post-processing.

Abstract
We present a deep learning model to automatically generate computer models of the human heart from patient imaging data with an emphasis on its capability to generate thin-walled cardiac structures. Our method works by deforming a template mesh to fit the cardiac structures to the given image. Compared with prior deep learning methods that adopted this approach, our framework is designed to minimize mesh self-penetration, which typically arises when deforming surface meshes separated by small distances. We achieve this by using a two-stage diffeomorphic deformation process along with a novel loss function derived from the kinematics of motion that penalizes surface contact and interpenetration. Our model demonstrates comparable accuracy with state-of-the-art methods while additionally producing meshes free of self-intersections. The resultant meshes are readily usable in physics based simulation, minimizing the need for post-processing and cleanup.

摘要
我们提出了一种深度学习模型，可自动生成人体心脏的计算模型，从患者成像数据中提取心脏结构。我们的方法是通过将模板网格调整到图像中心脏结构的位置，以实现这一目的。相比之前的深度学习方法，我们的框架设计了减少网格自交相互穿梭的功能，通过两个阶段的 diffeomorphic 变换过程，同时使用一种新的损失函数，该函数基于运动dinamics的kinematics来penalize表面 contacts和interpenetration。我们的模型可以与现状技术相比，同时生成自交相互穿梭的网格。得到的网格可以 direct用于基于物理学的模拟，减少后期处理和清洁工作。

paper_url: http://arxiv.org/abs/2310.20064
repo_url: None
paper_authors: Kevin Zhang, Sakshum Kulshrestha, Christopher Metzler
for: 提出了一种基于adaptive-sampling/active-learning的 universal denoising network训练策略，以提高现有的denoising network训练效率和性能。
methods: 使用一种基于波动函数的参数空间拟合方法，以减少训练时间并提高网络性能。
results: 通过对合理的训练数据进行适应性 sampling和活动学习，实现了一个灵活的 universal denoiser network，可以在各种操作条件下达到特циализированdenoiser network的最佳性能。

Abstract
Despite recent advances, developing general-purpose universal denoising and artifact-removal networks remains largely an open problem: Given fixed network weights, one inherently trades-off specialization at one task (e.g.,~removing Poisson noise) for performance at another (e.g.,~removing speckle noise). In addition, training such a network is challenging due to the curse of dimensionality: As one increases the dimensions of the specification-space (i.e.,~the number of parameters needed to describe the noise distribution) the number of unique specifications one needs to train for grows exponentially. Uniformly sampling this space will result in a network that does well at very challenging problem specifications but poorly at easy problem specifications, where even large errors will have a small effect on the overall mean squared error. In this work we propose training denoising networks using an adaptive-sampling/active-learning strategy. Our work improves upon a recently proposed universal denoiser training strategy by extending these results to higher dimensions and by incorporating a polynomial approximation of the true specification-loss landscape. This approximation allows us to reduce training times by almost two orders of magnitude. We test our method on simulated joint Poisson-Gaussian-Speckle noise and demonstrate that with our proposed training strategy, a single blind, generalist denoiser network can achieve peak signal-to-noise ratios within a uniform bound of specialized denoiser networks across a large range of operating conditions. We also capture a small dataset of images with varying amounts of joint Poisson-Gaussian-Speckle noise and demonstrate that a universal denoiser trained using our adaptive-sampling strategy outperforms uniformly trained baselines.

摘要
尽管最近有了进步，开发通用的锈除和遗传物理损订网络仍然是一个大多数未解决的问题：给定固定网络重量，一会 naturally trades-off特殊化在一个任务（例如，~除掉Poisson锈）的性能与另一个任务（例如，~除掉斑点锈）的性能之间。此外，在这种网络上进行训练也是一个挑战，因为维度的诅咒：随着特征空间的维度（即需要描述噪声分布的参数数量）的增加，需要训练的特殊化数量会加 exponential。对于这些特殊化进行均匀采样会导致一个网络能够处理非常困难的特定任务，但是对于容易处理的任务，即使大错也将具有小影响于总平均方差。在这项工作中，我们提议使用可适应采样/活动学习策略来训练锈除网络。我们的工作超越了最近提出的通用锈除训练策略，并在更高的维度上进行扩展。此外，我们还利用幂等函数来近似真实的规范损失景观，以降低训练时间。我们在模拟的 JOINT Poisson-Gaussian-Speckle 噪声下测试了我们的方法，并证明了一个盲目、通用的锈除网络可以在各种操作条件下达到特殊化锈除网络的峰信号强度上限。此外，我们还 capture了一个小型图像库，包含不同量的 JOINT Poisson-Gaussian-Speckle 噪声，并证明了一个通用锈除网络，使用我们的适应采样策略训练后，可以超越固定采样的基eline。

SolarFormer: Multi-scale Transformer for Solar PV Profiling

paper_url: http://arxiv.org/abs/2310.20057
repo_url: None
paper_authors: Adrian de Luis, Minh Tran, Taisei Hanyu, Anh Tran, Liao Haitao, Roy McCann, Alan Mantooth, Ying Huang, Ngan Le
for: 这篇论文是为了探讨实时太阳能板的映射和分类，以便更好地理解太阳能板的采用和实施。
methods: 这篇论文使用了SolarFormer模型，该模型包括多层TransformerEncoder和掩盖式TransformerDecoder，并且具有实例问题机制以提高太阳能板的本地化。
results: 根据多个测试数据集，包括法国GGE、法国IGN和美国加州USGS，这篇论文的SolarFormer模型能够与现有的模型比较或超越，表明了优化太阳能板的映射和分类。

Abstract
As climate change intensifies, the global imperative to shift towards sustainable energy sources becomes more pronounced. Photovoltaic (PV) energy is a favored choice due to its reliability and ease of installation. Accurate mapping of PV installations is crucial for understanding their adoption and informing energy policy. To meet this need, we introduce the SolarFormer, designed to segment solar panels from aerial imagery, offering insights into their location and size. However, solar panel identification in Computer Vision is intricate due to various factors like weather conditions, roof conditions, and Ground Sampling Distance (GSD) variations. To tackle these complexities, we present the SolarFormer, featuring a multi-scale Transformer encoder and a masked-attention Transformer decoder. Our model leverages low-level features and incorporates an instance query mechanism to enhance the localization of solar PV installations. We rigorously evaluated our SolarFormer using diverse datasets, including GGE (France), IGN (France), and USGS (California, USA), across different GSDs. Our extensive experiments consistently demonstrate that our model either matches or surpasses state-of-the-art models, promising enhanced solar panel segmentation for global sustainable energy initiatives.

摘要

Radiomics as a measure superior to the Dice similarity coefficient for tumor segmentation performance evaluation

paper_url: http://arxiv.org/abs/2310.20039
repo_url: None
paper_authors: Yoichi Watanabe, Rukhsora Akramova
for: This study aims to evaluate the segmentation ability of physicians and auto-segmentation tools in high-quality radiotherapy delivery by using Radiomics features as a superior measure compared to the widely used Dice Similarity Coefficient (DSC).methods: The research involves selecting reproducible radiomics features for evaluating segmentation accuracy by analyzing radiomics data from 2 CT scans of 10 lung tumors, and using CT images from 10 patients, each segmented by different physicians or auto-segmentation tools, to assess segmentation performance.results: The study reveals 206 radiomics features with a Concordance Correlation Coefficient (CCC) greater than 0.93 between the two CT images, indicating robust reproducibility. Seven features exhibit low Intraclass Correlation Coefficients (ICC), signifying increased sensitivity to segmentation differences. The findings suggest that Radiomics features, particularly those related to shape and energy, can capture subtle variations in tumor segmentation characteristics, unlike DSC.

Abstract
In high-quality radiotherapy delivery, precise segmentation of targets and healthy structures is essential. This study proposes Radiomics features as a superior measure for assessing the segmentation ability of physicians and auto-segmentation tools, in comparison to the widely used Dice Similarity Coefficient (DSC). The research involves selecting reproducible radiomics features for evaluating segmentation accuracy by analyzing radiomics data from 2 CT scans of 10 lung tumors, available in the RIDER Data Library. Radiomics features were extracted using PyRadiomics, with selection based on the Concordance Correlation Coefficient (CCC). Subsequently, CT images from 10 patients, each segmented by different physicians or auto-segmentation tools, were used to assess segmentation performance. The study reveals 206 radiomics features with a CCC greater than 0.93 between the two CT images, indicating robust reproducibility. Among these features, seven exhibit low Intraclass Correlation Coefficients (ICC), signifying increased sensitivity to segmentation differences. Notably, ICCs of original shape features, including sphericity, elongation, and flatness, ranged from 0.1177 to 0.995. In contrast, all DSC values exceeded 0.778. This research demonstrates that radiomics features, particularly those related to shape and energy, can capture subtle variations in tumor segmentation characteristics, unlike DSC. As a result, Radiomics features with ICC prove superior for evaluating a physician's tumor segmentation ability and the performance of auto-segmentation tools. The findings suggest that these new metrics can be employed to assess novel auto-segmentation methods and enhance the training of individuals in medical segmentation, thus contributing to improved radiotherapy practices.

摘要
高品质放疗需要精准地分割目标和健康结构。本研究提出使用 радиом特征来评估医生和自动分割工具的分割精度，而不是常用的 dice相似度指标（DSC）。研究选择了可重复的 радиом特征来评估分割精度，通过分析20个lung tumor的CT扫描图，从RIDER数据库中获得的 radiomics数据。 радиом特征使用PyRadiomics提取，基于协调相似度指标（CCC）进行选择。然后，从10名患者的CT图像中，每名患者由不同的医生或自动分割工具进行分割，以评估分割性能。研究发现有206个 радиом特征，其CCC值大于0.93，表示robust可重复性。其中，7个特征 display low intraclass correlation coefficients（ICC）， indicating increased sensitivity to segmentation differences。尤其是原始形状特征，包括圆形度、强度和平坦度，ICC值分别为0.1177-0.995。与此相比，所有DSC值都大于0.778。这项研究表明， радиом特征，特别是与形状和能量相关的特征，可以捕捉到细微的分割特征变化，与DSC不同。因此，Radiomics特征，特别是ICC高的特征，可以更好地评估医生的肿瘤分割能力和自动分割工具的性能。这些新指标可以用来评估新的自动分割方法和提高医学分割训练，从而对放疗实践产生贡献。

Adaptive Anchor Label Propagation for Transductive Few-Shot Learning

paper_url: http://arxiv.org/abs/2310.19996
repo_url: https://github.com/michalislazarou/a2lp
paper_authors: Michalis Lazarou, Yannis Avrithis, Guangyu Ren, Tania Stathaki
for: 提高几个shot学习的性能，使用有限量的标注数据
methods: 使用泛化推理方法，如标签传播，将无标注数据作为支持向量进行推理
results: 提出了一种新的 Adaptive Anchor Label Propagation 算法，在 1-shot 和 5-shot 设置下，与标准标签传播算法比较，提高了7% 和 2% 的性能。实验结果表明，我们的算法在四个常用的几个shot benchmark数据集（miniImageNet、tieredImageNet、CUB 和 CIFAR-FS）上表现出色，并且可以与两种常用的干部网络（ResNet12 和 WideResNet-28-10）结合使用。代码可以在 GitHub 上找到。

Abstract
Few-shot learning addresses the issue of classifying images using limited labeled data. Exploiting unlabeled data through the use of transductive inference methods such as label propagation has been shown to improve the performance of few-shot learning significantly. Label propagation infers pseudo-labels for unlabeled data by utilizing a constructed graph that exploits the underlying manifold structure of the data. However, a limitation of the existing label propagation approaches is that the positions of all data points are fixed and might be sub-optimal so that the algorithm is not as effective as possible. In this work, we propose a novel algorithm that adapts the feature embeddings of the labeled data by minimizing a differentiable loss function optimizing their positions in the manifold in the process. Our novel algorithm, Adaptive Anchor Label Propagation}, outperforms the standard label propagation algorithm by as much as 7% and 2% in the 1-shot and 5-shot settings respectively. We provide experimental results highlighting the merits of our algorithm on four widely used few-shot benchmark datasets, namely miniImageNet, tieredImageNet, CUB and CIFAR-FS and two commonly used backbones, ResNet12 and WideResNet-28-10. The source code can be found at https://github.com/MichalisLazarou/A2LP.

摘要
几个shot学习解决了使用有限的标注数据来分类图像的问题。通过使用推导式推理方法，如标签卷推理，可以在几个shot学习中显著提高性能。标签卷推理在未标注数据上推断 pseudo-标签，利用建立的图像数据下的潜在拓扑结构。然而，现有的标签卷推理方法的一个局限性是所有数据点的位置固定，可能不是最佳的，因此算法效果不是最好。在这种情况下，我们提出了一种新的算法，可以适应标注数据的特征表示进行最佳化。我们称之为 Adaptive Anchor Label Propagation（A2LP）。A2LP算法在1-shot和5-shot设置中分别超过标准标签卷推理算法的7%和2%。我们在四个广泛使用的几个shot benchmark数据集（miniImageNet、tieredImageNet、CUB和CIFAR-FS）和两种常用的后向推理器（ResNet12和WideResNet-28-10）上进行了实验，并提供了相关的实验结果。代码可以在https://github.com/MichalisLazarou/A2LP中找到。

Emotional Theory of Mind: Bridging Fast Visual Processing with Slow Linguistic Reasoning

paper_url: http://arxiv.org/abs/2310.19995
repo_url: None
paper_authors: Yasaman Etesam, Ozge Nilay Yalcin, Chuxuan Zhang, Angelica Lim
for: 这个研究的目的是评估最近的大视语言模型（CLIP、LLaVA）和大语言模型（GPT-3.5）中嵌入的情感常识，以及在情感上下文中的表达。
methods: 作者使用了“narative captions”来描述情感表达，并使用了872种物理社交信号描述和224个情感相关的环境 labels。
results: 对于图像-语言-情感任务， combining “fast”和”slow”的理解方法可以提高情感认知系统的性能。然而，遗留了在零学习情感理论心理任务中的差距，相比于之前在EMOTIC dataset上进行的训练。

Abstract
The emotional theory of mind problem in images is an emotion recognition task, specifically asking "How does the person in the bounding box feel?" Facial expressions, body pose, contextual information and implicit commonsense knowledge all contribute to the difficulty of the task, making this task currently one of the hardest problems in affective computing. The goal of this work is to evaluate the emotional commonsense knowledge embedded in recent large vision language models (CLIP, LLaVA) and large language models (GPT-3.5) on the Emotions in Context (EMOTIC) dataset. In order to evaluate a purely text-based language model on images, we construct "narrative captions" relevant to emotion perception, using a set of 872 physical social signal descriptions related to 26 emotional categories, along with 224 labels for emotionally salient environmental contexts, sourced from writer's guides for character expressions and settings. We evaluate the use of the resulting captions in an image-to-language-to-emotion task. Experiments using zero-shot vision-language models on EMOTIC show that combining "fast" and "slow" reasoning is a promising way forward to improve emotion recognition systems. Nevertheless, a gap remains in the zero-shot emotional theory of mind task compared to prior work trained on the EMOTIC dataset.

摘要
“情感理论心理问题在图像中是一种情感识别任务，具体来说是问“图像中人员在盒子中如何感到？” facial expressions、body pose、contextual information和隐性常识都会对这个任务带来挑战，使得这个任务成为当前情感计算领域中最Difficult Problem。本工作的目标是评估最近的大视语言模型（CLIP、LLaVA）和大语言模型（GPT-3.5）中的情感常识在EMOTIC数据集上的表现。为了评估一个纯文本基于的语言模型在图像上，我们构建了“narative captions”相关于情感感知，使用了872种物理社交信号描述和26种情感类别，以及224个用于描述情感相关的环境Label。我们在图像-语言-情感任务中使用了这些caption进行评估。针对EMOTIC数据集，我们发现，结合“快”和“慢”理解是一种有前途的方法，可以改进情感识别系统。然而，在零shot情感理论心理任务中，与先前工作相比，还存在一定的差距。”

Addressing Weak Decision Boundaries in Image Classification by Leveraging Web Search and Generative Models

paper_url: http://arxiv.org/abs/2310.19986
repo_url: None
paper_authors: Preetam Prabhu Srikar Dammu, Yunhe Feng, Chirag Shah
for: This paper aims to address the issue of bias and discrimination in machine learning (ML) models, specifically in the context of image classification.
methods: The proposed approach leverages the power of web search and generative models to enhance robustness and mitigate bias in ML models. The method involves identifying weak decision boundaries for vulnerable populations, constructing search queries for Google, and generating new training samples through DALL-E 2 and Stable Diffusion.
results: The proposed method achieved a significant reduction (77.30%) in the model’s gender accuracy disparity, and improved the classifier’s decision boundary with fewer weakspots and increased separation between classes.Here are the three points in Simplified Chinese text:
for: 本研究旨在解决机器学习（ML）模型中的偏见和歧视问题，具体来说是在图像分类任务中。
methods: 提议的方法利用网络搜索和生成模型来增强ML模型的稳定性和减少偏见，包括识别投降群体的弱点决策边界，构建Google搜索查询，通过DALL-E 2和稳定扩散生成新的训练样本。
results: 提议的方法在人类肖像分类问题中实现了显著减少（77.30%）的性别准确率差距，并改善分类器的决策边界，具有更少的弱点和类别之间更大的分离。

Abstract
Machine learning (ML) technologies are known to be riddled with ethical and operational problems, however, we are witnessing an increasing thrust by businesses to deploy them in sensitive applications. One major issue among many is that ML models do not perform equally well for underrepresented groups. This puts vulnerable populations in an even disadvantaged and unfavorable position. We propose an approach that leverages the power of web search and generative models to alleviate some of the shortcomings of discriminative models. We demonstrate our method on an image classification problem using ImageNet's People Subtree subset, and show that it is effective in enhancing robustness and mitigating bias in certain classes that represent vulnerable populations (e.g., female doctor of color). Our new method is able to (1) identify weak decision boundaries for such classes; (2) construct search queries for Google as well as text for generating images through DALL-E 2 and Stable Diffusion; and (3) show how these newly captured training samples could alleviate population bias issue. While still improving the model's overall performance considerably, we achieve a significant reduction (77.30\%) in the model's gender accuracy disparity. In addition to these improvements, we observed a notable enhancement in the classifier's decision boundary, as it is characterized by fewer weakspots and an increased separation between classes. Although we showcase our method on vulnerable populations in this study, the proposed technique is extendable to a wide range of problems and domains.

摘要
机器学习（ML）技术具有许多道理和运营问题，但是企业在敏感应用中广泛使用。一个主要问题是ML模型在弱 represented 群体中不具有相同的性能。这会让投降的人口处于更加不利和不利的位置。我们提议一种使用网络搜索和生成模型来缓解歧视模型的缺陷。我们在图像分类问题中使用图像网络的人类树subset，并证明我们的方法可以提高鲁棒性和减少偏见在某些类型中，例如女性黑人医生。我们的新方法可以（1）标识这些类型的弱点决策界；（2）生成Google搜索和生成图像的文本；以及（3）这些新采集的训练样本可以减少人群偏见问题。虽然我们仍然提高模型的总性能，但我们 Achieve a significant reduction (77.30%) in the model's gender accuracy disparity。此外，我们发现了这些改进后的决策界更加稳定，具有更少的弱点和类型之间更加增加的分化。虽然在本研究中我们使用易受护理的人口，但我们提议的技术可以应用于各种问题和领域。

‘Person’ == Light-skinned, Western Man, and Sexualization of Women of Color: Stereotypes in Stable Diffusion

paper_url: http://arxiv.org/abs/2310.19981
repo_url: None
paper_authors: Sourojit Ghosh, Aylin Caliskan
for: The paper examines the stereotypes embedded in Stable Diffusion, a popular text-to-image generator, and how it assigns gender and nationality/continental identity to individuals based on the absence of information.
methods: The paper uses vision-language model CLIP’s cosine similarity to compare images generated by CLIP-based Stable Diffusion v2.1 and manual examination to chronicle the results of front-facing images of persons from different continents, nationalities, and genders.
results: The paper finds that Stable Diffusion outputs of “a person” without any additional gender/nationality information correspond closest to images of men and least with persons of nonbinary gender, and that the output images are more likely to be of European/North American men rather than women or nonbinary individuals. The paper also observes continental stereotypes and the harmful erasure of Indigenous Oceanic peoples, as well as the oversexualization of women, specifically Latin American, Mexican, Indian, and Egyptian women.

Abstract
We study stereotypes embedded within one of the most popular text-to-image generators: Stable Diffusion. We examine what stereotypes of gender and nationality/continental identity does Stable Diffusion display in the absence of such information i.e. what gender and nationality/continental identity is assigned to `a person', or to `a person from Asia'. Using vision-language model CLIP's cosine similarity to compare images generated by CLIP-based Stable Diffusion v2.1 verified by manual examination, we chronicle results from 136 prompts (50 results/prompt) of front-facing images of persons from 6 different continents, 27 nationalities and 3 genders. We observe how Stable Diffusion outputs of `a person' without any additional gender/nationality information correspond closest to images of men and least with persons of nonbinary gender, and to persons from Europe/North America over Africa/Asia, pointing towards Stable Diffusion having a concerning representation of personhood to be a European/North American man. We also show continental stereotypes and resultant harms e.g. a person from Oceania is deemed to be Australian/New Zealander over Papua New Guinean, pointing to the erasure of Indigenous Oceanic peoples, who form a majority over descendants of colonizers both in Papua New Guinea and in Oceania overall. Finally, we unexpectedly observe a pattern of oversexualization of women, specifically Latin American, Mexican, Indian and Egyptian women relative to other nationalities, measured through an NSFW detector. This demonstrates how Stable Diffusion perpetuates Western fetishization of women of color through objectification in media, which if left unchecked will amplify this stereotypical representation. Image datasets are made publicly available.

摘要
我们研究在普遍用于文本到图像生成器中嵌入的标准刻板：稳定扩散。我们研究这些刻板在没有任何信息时如何表现出的性别和国籍/大陆认同，例如，在没有任何信息时，怎样对“一个人”进行识别，或者对“一个来自亚洲的人”进行识别。我们使用 CLIP 的 cosine 相似性来比较由 CLIP 基于的 Stable Diffusion v2.1 生成的图像，并通过手动检查，对 136 个提示（每个提示 50 个结果）进行了 chronicle 记录，这些提示包括来自 6 个大陆、27 个国家和 3 个性别的人像。我们发现在没有任何信息时，Stable Diffusion 输出的“一个人”最接近男性图像，并且与非binary 性别最少相关，而且更多地对应于欧洲/北美地区。此外，我们发现 Stable Diffusion 对人类形象的表现存在严重问题，例如，它认为澳大利亚/新西兰人是大洋洲人，而不是 Papua New Guinea 人，这种做法导致了原住民澳大利亚/新西兰人的抹杀，这些人占澳大利亚/新西兰人口的多数。此外，我们还发现，在某些国家女性图像中存在过度性化现象，例如拉丁美洲、墨西哥、印度和埃及女性图像，这种现象是由西方对女性色彩化的objectification在媒体中所带来的。如果不加以检查，这种刻板的表现将被加剧。我们将图像数据公共发布。

Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks

paper_url: http://arxiv.org/abs/2310.19909
repo_url: https://github.com/hsouri/battle-of-the-backbones
paper_authors: Micah Goldblum, Hossein Souri, Renkun Ni, Manli Shu, Viraj Prabhu, Gowthami Somepalli, Prithvijit Chattopadhyay, Mark Ibrahim, Adrien Bardes, Judy Hoffman, Rama Chellappa, Andrew Gordon Wilson, Tom Goldstein
for: 这个论文的目的是为computer vision系统提供一个多种预训练模型的比较，以帮助实践者更好地选择适合的模型。
methods: 这个论文使用了多种预训练模型，包括视力变换器（ViTs）和自我超vised学习（SSL），并对这些模型进行了多种任务的测试和分析。
results: 研究发现，使用大量的预训练数据和超vised学习方法可以实现高性能的计算视觉系统，而且这些模型在多种任务上表现强劲，特别是在类别预测和物体检测任务上。此外，研究还发现，使用同样的 arquitectures和预训练数据，SSL模型可以与超vised学习模型相比，表现很高水平。

Abstract
Neural network based computer vision systems are typically built on a backbone, a pretrained or randomly initialized feature extractor. Several years ago, the default option was an ImageNet-trained convolutional neural network. However, the recent past has seen the emergence of countless backbones pretrained using various algorithms and datasets. While this abundance of choice has led to performance increases for a range of systems, it is difficult for practitioners to make informed decisions about which backbone to choose. Battle of the Backbones (BoB) makes this choice easier by benchmarking a diverse suite of pretrained models, including vision-language models, those trained via self-supervised learning, and the Stable Diffusion backbone, across a diverse set of computer vision tasks ranging from classification to object detection to OOD generalization and more. Furthermore, BoB sheds light on promising directions for the research community to advance computer vision by illuminating strengths and weakness of existing approaches through a comprehensive analysis conducted on more than 1500 training runs. While vision transformers (ViTs) and self-supervised learning (SSL) are increasingly popular, we find that convolutional neural networks pretrained in a supervised fashion on large training sets still perform best on most tasks among the models we consider. Moreover, in apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets, we find that SSL backbones are highly competitive, indicating that future works should perform SSL pretraining with advanced architectures and larger pretraining datasets. We release the raw results of our experiments along with code that allows researchers to put their own backbones through the gauntlet here: https://github.com/hsouri/Battle-of-the-Backbones

摘要
神经网络基于计算机视觉系统通常建立在脊梁上，脊梁可以是预训练或随机初始化的特征提取器。过去几年，默认选择是使用ImageNet进行预训练的卷积神经网络。然而，最近几年，各种算法和数据集预训练的脊梁出现了 countless 。这种多样性使得各种系统表现得更好，但是对实践者来说做出 Informed 决策变得更加困难。“Backbone Battle”（BoB）使得这种选择变得更加容易，它对多种预训练模型进行了多种计算机视觉任务的比较，包括分类、物体检测和OOD泛化等。此外，BoB还为研究者们提供了推进计算机视觉的方向，通过对现有方法的分析和评估，揭示了现有方法的优劣。我们发现，使用大量训练集和超参数进行预训练的卷积神经网络仍然在大多数任务上表现最佳。此外，我们发现，使用同一类型的 arquitectures 和相同大小的预训练集，SSL 预训练的脊梁在同类任务上表现非常竞争力。因此，未来的工作应该使用更先进的架构和更大的预训练集进行SSL预训练。我们将 raw 的实验结果和相应的代码发布在 GitHub 上：

MIST: Medical Image Segmentation Transformer with Convolutional Attention Mixing (CAM) Decoder

paper_url: http://arxiv.org/abs/2310.19898
repo_url: https://github.com/rahman-motiur/mist
paper_authors: Md Motiur Rahman, Shiva Shokouhmand, Smriti Bhatt, Miad Faezipour
for: 这篇论文的目的是提出一种适合医疗影像分类的transformer模型，以解决traditional transformer模型在多modal维度中捕捉本地Context的问题。
methods: 这篇论文提出了一种叫做Medical Image Segmentation Transformer（MIST）的模型，它包括一个预训多轴条件对应（MaxViT）的енкодер，以及一个新的Convolutional Attention Mixing（CAM）解码器。CAM解码器使用多头自我注意、空间注意和压缩和刺激注意模组来捕捉所有空间维度中的长程相互作用。
results: 实验结果显示，我们的MIST模型与CAM解码器在ACDC和Synapse datasets上的分类性能比State-of-the-art模型高。此外，我们还证明了在不同的维度上对应的CAM解码器可以将低级和高级特征融合，以提高分类性能。

Abstract
One of the common and promising deep learning approaches used for medical image segmentation is transformers, as they can capture long-range dependencies among the pixels by utilizing self-attention. Despite being successful in medical image segmentation, transformers face limitations in capturing local contexts of pixels in multimodal dimensions. We propose a Medical Image Segmentation Transformer (MIST) incorporating a novel Convolutional Attention Mixing (CAM) decoder to address this issue. MIST has two parts: a pre-trained multi-axis vision transformer (MaxViT) is used as an encoder, and the encoded feature representation is passed through the CAM decoder for segmenting the images. In the CAM decoder, an attention-mixer combining multi-head self-attention, spatial attention, and squeeze and excitation attention modules is introduced to capture long-range dependencies in all spatial dimensions. Moreover, to enhance spatial information gain, deep and shallow convolutions are used for feature extraction and receptive field expansion, respectively. The integration of low-level and high-level features from different network stages is enabled by skip connections, allowing MIST to suppress unnecessary information. The experiments show that our MIST transformer with CAM decoder outperforms the state-of-the-art models specifically designed for medical image segmentation on the ACDC and Synapse datasets. Our results also demonstrate that adding the CAM decoder with a hierarchical transformer improves segmentation performance significantly. Our model with data and code is publicly available on GitHub.

摘要
一种常见且有前途的深度学习方法是转换器，它可以通过自我注意来捕捉图像像素之间的长距离依赖关系。 despite its success in医疗图像分割，转换器受到多modal维度中像素的本地上下文的限制。 we propose a Medical Image Segmentation Transformer (MIST) that incorporates a novel Convolutional Attention Mixing (CAM) decoder to address this issue. MIST consists of two parts: a pre-trained multi-axis vision transformer (MaxViT) is used as an encoder, and the encoded feature representation is passed through the CAM decoder for segmenting the images. In the CAM decoder, an attention-mixer combining multi-head self-attention, spatial attention, and squeeze and excitation attention modules is introduced to capture long-range dependencies in all spatial dimensions. Moreover, to enhance spatial information gain, deep and shallow convolutions are used for feature extraction and receptive field expansion, respectively. The integration of low-level and high-level features from different network stages is enabled by skip connections, allowing MIST to suppress unnecessary information. The experiments show that our MIST transformer with CAM decoder outperforms the state-of-the-art models specifically designed for medical image segmentation on the ACDC and Synapse datasets. Our results also demonstrate that adding the CAM decoder with a hierarchical transformer improves segmentation performance significantly. Our model with data and code is publicly available on GitHub.Here's the translation in Traditional Chinese:一种常见且有前途的深度学习方法是转换器，它可以通过自我注意来捕捉图像像素之间的长距离依赖关系。 despite its success in医疗图像分割，转换器受到多modal维度中像素的本地上下文的限制。 we propose a Medical Image Segmentation Transformer (MIST) that incorporates a novel Convolutional Attention Mixing (CAM) decoder to address this issue. MIST consists of two parts: a pre-trained multi-axis vision transformer (MaxViT) is used as an encoder, and the encoded feature representation is passed through the CAM decoder for segmenting the images. In the CAM decoder, an attention-mixer combining multi-head self-attention, spatial attention, and squeeze and excitation attention modules is introduced to capture long-range dependencies in all spatial dimensions. Moreover, to enhance spatial information gain, deep and shallow convolutions are used for feature extraction and receptive field expansion, respectively. The integration of low-level and high-level features from different network stages is enabled by skip connections, allowing MIST to suppress unnecessary information. The experiments show that our MIST transformer with CAM decoder outperforms the state-of-the-art models specifically designed for medical image segmentation on the ACDC and Synapse datasets. Our results also demonstrate that adding the CAM decoder with a hierarchical transformer improves segmentation performance significantly. Our model with data and code is publicly available on GitHub.

DiffEnc: Variational Diffusion with a Learned Encoder

paper_url: http://arxiv.org/abs/2310.19789
repo_url: None
paper_authors: Beatrix M. G. Nielsen, Anders Christensen, Andrea Dittadi, Ole Winther
for: 这个论文的目的是提出一种基于扩散模型的变量自动机（VA），具有两个改进：在生成过程中共享参数，以及高效地计算损失为独立的层次结构。
methods: 论文提出了两种改进 diffusion model，首先是在扩散过程中添加了数据和深度依赖的均值函数，这导致了一种修改的扩散损失。
results: 论文的提出的DiffEnc框架在CIFAR-10上实现了状态计算机科学的最佳可能性。此外，论文还提出了一种基于权重的扩散损失方法，可以用于进行推理。

Abstract
Diffusion models may be viewed as hierarchical variational autoencoders (VAEs) with two improvements: parameter sharing for the conditional distributions in the generative process and efficient computation of the loss as independent terms over the hierarchy. We consider two changes to the diffusion model that retain these advantages while adding flexibility to the model. Firstly, we introduce a data- and depth-dependent mean function in the diffusion process, which leads to a modified diffusion loss. Our proposed framework, DiffEnc, achieves state-of-the-art likelihood on CIFAR-10. Secondly, we let the ratio of the noise variance of the reverse encoder process and the generative process be a free weight parameter rather than being fixed to 1. This leads to theoretical insights: For a finite depth hierarchy, the evidence lower bound (ELBO) can be used as an objective for a weighted diffusion loss approach and for optimizing the noise schedule specifically for inference. For the infinite-depth hierarchy, on the other hand, the weight parameter has to be 1 to have a well-defined ELBO.

摘要
Diffusion models可以被看作为层次Variational Autoencoder（VAEs），具有两个改进：在生成过程中共享参数的条件分布，以及独立计算损失的 hierarchy。我们考虑了对 diffusion model 的两种改进，以提高模型的灵活性。首先，我们引入了基于数据和深度的均值函数，导致了修改的扩散损失。我们的提出的框架，DiffEnc，在 CIFAR-10 上达到了状态机器的可靠性。其次，我们允许反推采样过程中噪声方差的比率作为自由参数，而不是固定为1。这导致了理论上的发现：对于有限深度层次，可以使用抽象下界（ELBO）作为一个Weighted扩散损失的目标，并且可以优化噪声程度特别 для推理。而对于无穷深度层次，则weight参数必须为1，以便有一定的ELBO。

MM-VID: Advancing Video Understanding with GPT-4V(ision)

paper_url: http://arxiv.org/abs/2310.19773
repo_url: None
paper_authors: Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, Ce Liu, Lijuan Wang
For: MM-VID is designed to facilitate advanced video understanding, particularly for long-form videos and intricate tasks such as reasoning within hour-long content and grasping storylines spanning multiple episodes.* Methods: MM-VID uses a video-to-script generation with GPT-4V to transcribe multimodal elements into a long textual script, enabling advanced capabilities such as audio description, character identification, and multimodal high-level comprehension.* Results: Experimental results demonstrate the effectiveness of MM-VID in handling distinct video genres with various video lengths, and its potential when applied to interactive environments such as video games and graphic user interfaces.Here’s the information in Simplified Chinese text:
for: MM-VID 是为了实现高级视频理解，特别是面对长视频和多集的复杂任务，如在一小时内进行理性和掌握多集剧情。
methods: MM-VID 使用 GPT-4V 将视频转换为长文本脚本，以便利用大语言模型（LLM）实现高级功能，如音频描述、人物识别和多模态高级理解。
results: 实验结果表明 MM-VID 可以处理不同类型的视频，并且在交互环境中如游戏和图形用户界面中表现出色。

Abstract
We present MM-VID, an integrated system that harnesses the capabilities of GPT-4V, combined with specialized tools in vision, audio, and speech, to facilitate advanced video understanding. MM-VID is designed to address the challenges posed by long-form videos and intricate tasks such as reasoning within hour-long content and grasping storylines spanning multiple episodes. MM-VID uses a video-to-script generation with GPT-4V to transcribe multimodal elements into a long textual script. The generated script details character movements, actions, expressions, and dialogues, paving the way for large language models (LLMs) to achieve video understanding. This enables advanced capabilities, including audio description, character identification, and multimodal high-level comprehension. Experimental results demonstrate the effectiveness of MM-VID in handling distinct video genres with various video lengths. Additionally, we showcase its potential when applied to interactive environments, such as video games and graphic user interfaces.

摘要
我们介绍MM-VID，一个整合了GPT-4V的特有功能和视觉、音频和语音特化工具，以便实现高级视频理解。MM-VID是为解决长形视频和复杂任务（如在多集 episodic 内理解）而设计。它使用视频到脚本生成（GPT-4V）将多Modal元素转化为长文本脚本。生成的脚本包括人物运动、动作、表情和对话，使大语言模型（LLMs）可以实现视频理解。这使得可以实现高级功能，如音频描述、人物识别和多Modal高级理解。我们的实验结果表明MM-VID可以处理不同的视频类型和长度。此外，我们还展示了其在交互环境中的潜在应用，如视频游戏和图形用户界面。

paper_url: http://arxiv.org/abs/2310.19752
repo_url: https://github.com/idstcv/inmap
paper_authors: Qi Qian, Yuanhong Xu, Juhua Hu
for: 这篇论文旨在提出一种方法，使用文本proxy来学习视觉代理，以便在没有标注目标视觉数据的情况下进行零shot传递。
methods: 该方法使用CLIP的视觉预训练方法，然后使用文本proxy来学习视觉代理。同时， authors提出了一种理论分析，表明不可以通过减少对比偏好来减小视觉和文本空间之间的差距。
results: 实验结果表明，该方法可以在一分钟内在单个GPU上学习视觉代理，并且可以提高零shot传递精度从77.02%提高到80.21% на ImageNet上，使用ViT-L/14@336预训练后CLIP。

Abstract
Vision-language pre-training methods, e.g., CLIP, demonstrate an impressive zero-shot performance on visual categorizations with the class proxy from the text embedding of the class name. However, the modality gap between the text and vision space can result in a sub-optimal performance. We theoretically show that the gap cannot be reduced sufficiently by minimizing the contrastive loss in CLIP and the optimal proxy for vision tasks may reside only in the vision space. Therefore, given unlabeled target vision data, we propose to learn the vision proxy directly with the help from the text proxy for zero-shot transfer. Moreover, according to our theoretical analysis, strategies are developed to further refine the pseudo label obtained by the text proxy to facilitate the intra-modal proxy learning (InMaP) for vision. Experiments on extensive downstream tasks confirm the effectiveness and efficiency of our proposal. Concretely, InMaP can obtain the vision proxy within one minute on a single GPU while improving the zero-shot accuracy from $77.02\%$ to $80.21\%$ on ImageNet with ViT-L/14@336 pre-trained by CLIP. Code is available at \url{https://github.com/idstcv/InMaP}.

摘要
视觉语言预训练方法，如CLIP，示出了无需seen数据的很好表现能力。然而，视觉和语言空间之间的差异可能会导致表现下降。我们理论上表明，这种差异无法通过CLIP中的对比损失来减小。因此，在没有标注目标视觉数据的情况下，我们提议通过文本代理来学习视觉代理。此外，根据我们的理论分析，我们还开发了一些策略来进一步修正由文本代理生成的假标签，以便进行内模态代理学习（InMaP）。实验表明，InMaP可以在单个GPU上在1分钟内获得视觉代理，并在ImageNet上提高零shot准确率从77.02%到80.21%。代码可以在上找到。

Tell Me What Is Good About This Property: Leveraging Reviews For Segment-Personalized Image Collection Summarization

paper_url: http://arxiv.org/abs/2310.19743
repo_url: None
paper_authors: Monika Wysoczanska, Moran Beladev, Karen Lastmann Assaraf, Fengjun Wang, Ofri Kleinfeld, Gil Amsalem, Hadas Harush Boker
for: 本研究旨在提高Booking.com上的图像摘要，使其更能够满足用户的需求和喜好。
methods: 本研究使用用户点评中提到的最重要方面进行图像摘要的分析，以提高摘要的有用性和相关性。
results: 研究表明，使用用户点评进行图像摘要可以提高摘要的质量和有用性，而且不需要费时的注释。我们的人体学习研究也表明，用户对于我们的cross-modal方法（CrossSummarizer）表示更高的满意度。

Abstract
Image collection summarization techniques aim to present a compact representation of an image gallery through a carefully selected subset of images that captures its semantic content. When it comes to web content, however, the ideal selection can vary based on the user's specific intentions and preferences. This is particularly relevant at Booking.com, where presenting properties and their visual summaries that align with users' expectations is crucial. To address this challenge, we consider user intentions in the summarization of property visuals by analyzing property reviews and extracting the most significant aspects mentioned by users. By incorporating the insights from reviews in our visual summaries, we enhance the summaries by presenting the relevant content to a user. Moreover, we achieve it without the need for costly annotations. Our experiments, including human perceptual studies, demonstrate the superiority of our cross-modal approach, which we coin as CrossSummarizer over the no-personalization and image-based clustering baselines.

摘要
simplified Chinese:图像集合概要技术的目标是通过选择一个精心选择的图像子集，以捕捉图像集合的 semantic 内容。然而，在网络内容上，理想的选择可能会因用户的具体目的和偏好而发生变化。这 particualry relevant 在 Booking.com 上，因为在这里，为用户提供适合他们期望的 properties 和其视觉概要是非常重要。为了解决这个挑战，我们在 visual 概要中考虑用户的意图，通过分析 property 评论，提取用户提到的最重要的方面。通过在概要中包含评论中的 Insight，我们可以提高概要的 relevance，为用户提供相关的内容。此外，我们可以在不需要昂贵的注释的情况下实现这一点。我们的实验，包括人类感知研究，表明我们的 cross-modal 方法（我们称之为 CrossSummarizer）在无个性化和图像基于划分的基线之上具有超越性。

Promise:Prompt-driven 3D Medical Image Segmentation Using Pretrained Image Foundation Models

paper_url: http://arxiv.org/abs/2310.19721
repo_url: None
paper_authors: Hao Li, Han Liu, Dewei Hu, Jiacheng Wang, Ipek Oguz
for: 提高医疗成像领域中数据采集和标签可用性问题的解决方案
methods: 基于自然图像领域的转移学习，使用单个点提示来激活已经预训练的2D图像基础模型，并使用轻量级适配器来提取3D空间上的深度相关信息
results: 在两个公共数据集上进行colon和肠癌肿块分割任务，与状态之前 segmentation 方法相比，提出的方法具有更高的性能

Abstract
To address prevalent issues in medical imaging, such as data acquisition challenges and label availability, transfer learning from natural to medical image domains serves as a viable strategy to produce reliable segmentation results. However, several existing barriers between domains need to be broken down, including addressing contrast discrepancies, managing anatomical variability, and adapting 2D pretrained models for 3D segmentation tasks. In this paper, we propose ProMISe,a prompt-driven 3D medical image segmentation model using only a single point prompt to leverage knowledge from a pretrained 2D image foundation model. In particular, we use the pretrained vision transformer from the Segment Anything Model (SAM) and integrate lightweight adapters to extract depth-related (3D) spatial context without updating the pretrained weights. For robust results, a hybrid network with complementary encoders is designed, and a boundary-aware loss is proposed to achieve precise boundaries. We evaluate our model on two public datasets for colon and pancreas tumor segmentations, respectively. Compared to the state-of-the-art segmentation methods with and without prompt engineering, our proposed method achieves superior performance. The code is publicly available at https://github.com/MedICL-VU/ProMISe.

摘要
医学影像问题的常见问题，如数据获取困难和标签不足，可以通过域转换学习来生成可靠的 segmentation 结果。然而，需要破坏几个存在的域之间的障碍，包括对比不同、管理生物学变化和使用2D预训练模型进行3D segmentation任务的适应。在本文中，我们提出了ProMISe，一种基于单点提示的3D医学影像 segmentation模型，使用Segment Anything Model（SAM）的预训练视Transformer，并通过轻量级适配器来提取深度相关（3D）空间上下文而无需更新预训练参数。为了增强结果的稳定性，我们设计了混合网络，并提出了边界意识损失来实现精确的边界。我们在两个公共数据集上进行了对colon和肠癌肿 segmentation的评估，并与无提示工程和其他状态之前的方法进行了比较。相比之下，我们的提posed方法得到了superior的性能。代码可以在https://github.com/MedICL-VU/ProMISe上获取。

Deep-learning-based decomposition of overlapping-sparse images: application at the vertex of neutrino interactions

paper_url: http://arxiv.org/abs/2310.19695
repo_url: None
paper_authors: Saúl Alonso-Monsalve, Davide Sgalaberna, Xingyu Zhao, Adrien Molines, Clark McGrew, André Rubbia
for: 这篇论文旨在解决高能物理中的核反应点中的复杂问题，即通过深度学习分解多维重叠 sparse 图像，提取低动量粒子的精细参数。
methods: 该方法利用深度学习对多维重叠 sparse 图像进行分解，从而提取低动量粒子的精细参数。
results: 该方法可以准确地提取低动量粒子的精细参数，提高了 neutrino 事件的能量解析精度。此外，通过与完全可导生成模型结合，进一步提高图像分解，以达到具有前所未有的结果。

Abstract
Image decomposition plays a crucial role in various computer vision tasks, enabling the analysis and manipulation of visual content at a fundamental level. Overlapping images, which occur when multiple objects or scenes partially occlude each other, pose unique challenges for decomposition algorithms. The task intensifies when working with sparse images, where the scarcity of meaningful information complicates the precise extraction of components. This paper presents a solution that leverages the power of deep learning to accurately extract individual objects within multi-dimensional overlapping-sparse images, with a direct application in high-energy physics with decomposition of overlaid elementary particles obtained from imaging detectors. In particular, the proposed approach tackles a highly complex yet unsolved problem: identifying and measuring independent particles at the vertex of neutrino interactions, where one expects to observe detector images with multiple indiscernible overlapping charged particles. By decomposing the image of the detector activity at the vertex through deep learning, it is possible to infer the kinematic parameters of the identified low-momentum particles - which otherwise would remain neglected - and enhance the reconstructed energy resolution of the neutrino event. We also present an additional step - that can be tuned directly on detector data - combining the above method with a fully-differentiable generative model to improve the image decomposition further and, consequently, the resolution of the measured parameters, achieving unprecedented results. This improvement is crucial for precisely measuring the parameters that govern neutrino flavour oscillations and searching for asymmetries between matter and antimatter.

摘要
图像分解在各种计算机视觉任务中扮演着关键角色，允许对视觉内容进行基础 уров划分。不同物体或场景之间的重叠，对分解算法 pose 独特挑战。在缺乏有效信息的情况下，精度地提取组件变得更加复杂。这篇文章提出了一种利用深度学习来准确地从多维重叠 sparse 图像中提取个体对象，具体应用于高能物理中的图像分解。特别是，提议方案解决了一个非常复杂但未解决的问题：在neutrino交互点上识别和测量低动量粒子的独立参数，其中预期可以 observer 探测器图像中多个难以区分的 charged 粒子。通过深度学习对探测器活动图像的分解，可以从粒子的kinematic 参数中提取低动量粒子的识别结果，从而提高 neutrino 事件的能量分解。此外，我们还提出了一个附加步骤，可以直接基于探测器数据进行调整，通过将上述方法与完全可导生成模型相结合，进一步提高图像分解，并因此提高测量参数的分解精度，达到历史上最佳结果。这种改进是关键的，用于精确测量neutrino 味 flavor 振荡和物质与反物质的差异。

A Principled Hierarchical Deep Learning Approach to Joint Image Compression and Classification

paper_url: http://arxiv.org/abs/2310.19675
repo_url: None
paper_authors: Siyu Qi, Achintha Wijesinghe, Lahiru D. Chamain, Zhi Ding
for: 本研究旨在优化深度学习模型，以便在具有限制频率/容量的物理通道上实现远程图像分类。
methods: 本研究提出了一种三步联合学习策略，以引导编码器提取高度精炼、描述性强和可以通过共同增强/变换进行学习的特征。在初步屏选阶段之前，我们将缓存dimension优化为最小值。在训练过程中，我们使用 entropy-based quantization和/或手动 truncation来调整缓存表示的位数。
results: 我们的提议方法可以在CIFAR-10和CIFAR-100上实现Accuracy提高约1.5%和3% respectively，比传统的端到端十字 entropy 训练更高。

Abstract
Among applications of deep learning (DL) involving low cost sensors, remote image classification involves a physical channel that separates edge sensors and cloud classifiers. Traditional DL models must be divided between an encoder for the sensor and the decoder + classifier at the edge server. An important challenge is to effectively train such distributed models when the connecting channels have limited rate/capacity. Our goal is to optimize DL models such that the encoder latent requires low channel bandwidth while still delivers feature information for high classification accuracy. This work proposes a three-step joint learning strategy to guide encoders to extract features that are compact, discriminative, and amenable to common augmentations/transformations. We optimize latent dimension through an initial screening phase before end-to-end (E2E) training. To obtain an adjustable bit rate via a single pre-deployed encoder, we apply entropy-based quantization and/or manual truncation on the latent representations. Tests show that our proposed method achieves accuracy improvement of up to 1.5% on CIFAR-10 and 3% on CIFAR-100 over conventional E2E cross-entropy training.

摘要
深度学习（DL）应用中的低成本感知器Remote图像分类具有物理通道，将边缘感知器和云端分类器分开。传统的DL模型需要在感知器和云端分类器之间分配资源，这会增加训练复杂性和通信成本。我们的目标是将DL模型优化，使承载器中的秘密特征具有低通信带宽，仍然能够提供高精度分类。我们提出了一种三步共同学习策略，以导引承载器中的encoder提取高度紧凑、抽象、可以适应增强的特征。我们通过初始屏选 phasescreening来验证维度的选择。在end-to-end（E2E）训练过程中，我们使用Entropy基于的压缩和/或手动跳转来调整秘密特征的维度。测试结果表明，我们的提议方法可以在CIFAR-10和CIFAR-100上提高精度达1.5%和3%。

DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization

paper_url: http://arxiv.org/abs/2310.19668
repo_url: None
paper_authors: Guowei Xu, Ruijie Zheng, Yongyuan Liang, Xiyao Wang, Zhecheng Yuan, Tianying Ji, Yu Luo, Xiaoyu Liu, Jiaxin Yuan, Pu Hua, Shuzhen Li, Yanjie Ze, Hal Daumé III, Furong Huang, Huazhe Xu
for: 提高模型自适应控制任务的效率和稳定性。
methods: 利用三种核心机制来引导代理人的决策决策-优化策略，并使用着寂率来衡量代理人的活动水平。
results: 实验表明，DrM在三个连续控制 benchmark环境中（DeepMind Control Suite、MetaWorld 和 Adroit） achieved significant improvements in sample efficiency and asymptotic performance with no broken seeds（76 seeds in total），并成功解决了 DeepMind Control Suite 中的狗和手动控制任务，以及 Adroit 中的三个灵活手动控制任务。

Abstract
Visual reinforcement learning (RL) has shown promise in continuous control tasks. Despite its progress, current algorithms are still unsatisfactory in virtually every aspect of the performance such as sample efficiency, asymptotic performance, and their robustness to the choice of random seeds. In this paper, we identify a major shortcoming in existing visual RL methods that is the agents often exhibit sustained inactivity during early training, thereby limiting their ability to explore effectively. Expanding upon this crucial observation, we additionally unveil a significant correlation between the agents' inclination towards motorically inactive exploration and the absence of neuronal activity within their policy networks. To quantify this inactivity, we adopt dormant ratio as a metric to measure inactivity in the RL agent's network. Empirically, we also recognize that the dormant ratio can act as a standalone indicator of an agent's activity level, regardless of the received reward signals. Leveraging the aforementioned insights, we introduce DrM, a method that uses three core mechanisms to guide agents' exploration-exploitation trade-offs by actively minimizing the dormant ratio. Experiments demonstrate that DrM achieves significant improvements in sample efficiency and asymptotic performance with no broken seeds (76 seeds in total) across three continuous control benchmark environments, including DeepMind Control Suite, MetaWorld, and Adroit. Most importantly, DrM is the first model-free algorithm that consistently solves tasks in both the Dog and Manipulator domains from the DeepMind Control Suite as well as three dexterous hand manipulation tasks without demonstrations in Adroit, all based on pixel observations.

摘要
视觉强化学习（RL）在连续控制任务中表现出了承诺。尽管它已经取得了进步，但现有算法仍然在许多方面表现不满意，如样本效率、极限性能和随机种子的稳定性。在这篇论文中，我们发现了现有的视觉RL方法中的一个重要缺陷：代理人在训练的早期经常进入持续的无动作状态，从而限制它们的探索效果。在这基础之上，我们还发现了代理人倾向于无动作探索的倾向和策略网络中无活动神经元之间存在显著的相关性。为了量化这种无动作，我们采用了沉默率作为RL代理人网络中的活动度量表。实验表明，DrM可以在三个核心机制的指导下， aktive 地降低沉默率，从而改善样本效率和极限性能。DrM在DeepMind Control Suite、MetaWorld和Adroit三个连续控制 benchmark环境中实现了显著的改进，而且在 Dog 和 Manipulator 领域中解决了任务，并在 Adroit 中实现了三个灵活手 manipulate 任务无需示例。

Domain Generalization in Computational Pathology: Survey and Guidelines

paper_url: http://arxiv.org/abs/2310.19656
repo_url: None
paper_authors: Mostafa Jahanifar, Manahil Raza, Kesi Xu, Trinh Vuong, Rob Jewsbury, Adam Shephard, Neda Zamanitajeddin, Jin Tae Kwak, Shan E Ahmed Raza, Fayyaz Minhas, Nasir Rajpoot
for: 本研究旨在探讨域特化（Domain Generalization，DS）问题在计算生物学（Computational Pathology，CPath）领域中的应用和解决方案。
methods: 本文系统地综述了CPath领域中存在多种域特化问题的类型和现有的解决方案，以及这些方法的优缺点和适用范围。此外，我们还进行了28种先进的域特化算法的 benchmarking 实验，并发现了一些有效的方法。
results: 本研究的结果表明，在CPath领域中，通过精心的实验设计和特征增强技术，可以有效地解决域特化问题。然而，不同的场景下的域特化问题需要不同的解决方案。因此，本文提出了明确的指南和方法来检测和管理域特化问题。这些概念、指南和建议都适用于大多数医学影像分析任务。

Abstract
Deep learning models have exhibited exceptional effectiveness in Computational Pathology (CPath) by tackling intricate tasks across an array of histology image analysis applications. Nevertheless, the presence of out-of-distribution data (stemming from a multitude of sources such as disparate imaging devices and diverse tissue preparation methods) can cause \emph{domain shift} (DS). DS decreases the generalization of trained models to unseen datasets with slightly different data distributions, prompting the need for innovative \emph{domain generalization} (DG) solutions. Recognizing the potential of DG methods to significantly influence diagnostic and prognostic models in cancer studies and clinical practice, we present this survey along with guidelines on achieving DG in CPath. We rigorously define various DS types, systematically review and categorize existing DG approaches and resources in CPath, and provide insights into their advantages, limitations, and applicability. We also conduct thorough benchmarking experiments with 28 cutting-edge DG algorithms to address a complex DG problem. Our findings suggest that careful experiment design and CPath-specific Stain Augmentation technique can be very effective. However, there is no one-size-fits-all solution for DG in CPath. Therefore, we establish clear guidelines for detecting and managing DS depending on different scenarios. While most of the concepts, guidelines, and recommendations are given for applications in CPath, we believe that they are applicable to most medical image analysis tasks as well.

摘要
深度学习模型在计算 PATH 中（CPath）表现出色，能够解决各种复杂的图像分析任务。然而，由于多种来源的不同图像设备和多种组织方法而导致的“领域转移”（DS）问题，使得训练的模型在未看过的数据集上的泛化性受到影响。为了解决这问题，我们提出了一些创新的“领域泛化”（DG）方法。我们在这篇文章中对 DG 方法在 CPath 中的应用进行了系统性的介绍和评估，并提供了适用于不同场景的指导方针。我们还进行了28种高级 DG 算法的 benchmarking 实验，发现在 CPath 中使用特定的染料增强技术和精心的实验设计可以获得非常有效的结果。然而，没有一个适合所有情况的 DG 解决方案。因此，我们提出了适应不同场景的 DS 探测和管理的明确指导方针。这些概念、指导方针和建议大部分适用于医学图像分析任务中。

Upgrading VAE Training With Unlimited Data Plans Provided by Diffusion Models

paper_url: http://arxiv.org/abs/2310.19653
repo_url: None
paper_authors: Tim Z. Xiao, Johannes Zenn, Robert Bamler
for: This paper aims to mitigate overfitting in variational autoencoders (VAEs) by training on samples from a pre-trained diffusion model.
methods: The paper proposes training VAEs on samples from a pre-trained diffusion model to improve their representation learning and mitigate overfitting.
results: The paper finds that training VAEs on samples from a pre-trained diffusion model leads to improvements in generative performance, amortization gap, and robustness compared to normal training and conventional data augmentation methods.

Abstract
Variational autoencoders (VAEs) are popular models for representation learning but their encoders are susceptible to overfitting (Cremer et al., 2018) because they are trained on a finite training set instead of the true (continuous) data distribution $p_{\mathrm{data}(\mathbf{x})$. Diffusion models, on the other hand, avoid this issue by keeping the encoder fixed. This makes their representations less interpretable, but it simplifies training, enabling accurate and continuous approximations of $p_{\mathrm{data}(\mathbf{x})$. In this paper, we show that overfitting encoders in VAEs can be effectively mitigated by training on samples from a pre-trained diffusion model. These results are somewhat unexpected as recent findings (Alemohammad et al., 2023; Shumailov et al., 2023) observe a decay in generative performance when models are trained on data generated by another generative model. We analyze generalization performance, amortization gap, and robustness of VAEs trained with our proposed method on three different data sets. We find improvements in all metrics compared to both normal training and conventional data augmentation methods, and we show that a modest amount of samples from the diffusion model suffices to obtain these gains.

摘要
variational autoencoders (VAEs) 是一种广泛使用的表示学习模型，但它们的编码器容易过拟合 (Cremer et al., 2018)，因为它们被训练在一个有限的训练集上而不是真正的数据分布 $p_{\text{data}(\mathbf{x})$。而扩散模型则可以避免这个问题，因为它们的编码器是固定的。这会使其表示更难于解释，但是它可以简化训练，使得可以获得精确和连续的 $p_{\text{data}(\mathbf{x})$ 的近似。在这篇论文中，我们显示了使用预训练的扩散模型训练 VAE 可以有效地遏制编码器过拟合。这些结果有些意外，因为 latest findings (Alemohammad et al., 2023; Shumailov et al., 2023) 观察到在另一个生成模型上训练模型会导致生成性能下降。我们分析了 VAE 在不同数据集上的总体性能、挥剑差和稳定性，并发现在使用我们提议的方法时，所有指标都有所改善，并且发现一些模est amount of samples from the diffusion model suffices to obtain these gains。

DistNet2D: Leveraging long-range temporal information for efficient segmentation and tracking

paper_url: http://arxiv.org/abs/2310.19641
repo_url: None
paper_authors: Jean Ollion, Martin Maliet, Caroline Giuglaris, Elise Vacher, Maxime Deforet
for: 这篇论文的目的是提出一种新的深度神经网络架构，以便在视频微镜像中提高细胞分类和跟踪的准确率。
methods: 这篇论文使用的方法包括一种新的深度神经网络架构 DistNet2D，该架构利用了中期和长期的时间上下文来提高分类和跟踪的准确率。另外，论文还使用了一种后处理程序，该程序利用整个电影中的信息来更正分类错误。
results: 论文的实验结果显示，DistNet2D 在两个实验数据集上都有较高的性能，比之前两种方法更高。此外，论文还示出了在 correlate 细胞大小和形状与其运输特性的 statistically 的可能性。

Abstract
Extracting long tracks and lineages from videomicroscopy requires an extremely low error rate, which is challenging on complex datasets of dense or deforming cells. Leveraging temporal context is key to overcome this challenge. We propose DistNet2D, a new deep neural network (DNN) architecture for 2D cell segmentation and tracking that leverages both mid- and long-term temporal context. DistNet2D considers seven frames at the input and uses a post-processing procedure that exploits information from the entire movie to correct segmentation errors. DistNet2D outperforms two recent methods on two experimental datasets, one containing densely packed bacterial cells and the other containing eukaryotic cells. It has been integrated into an ImageJ-based graphical user interface for 2D data visualization, curation, and training. Finally, we demonstrate the performance of DistNet2D on correlating the size and shape of cells with their transport properties over large statistics, for both bacterial and eukaryotic cells.

摘要
原文：Extracting long tracks and lineages from videomicroscopy requires an extremely low error rate, which is challenging on complex datasets of dense or deforming cells. Leveraging temporal context is key to overcome this challenge. We propose DistNet2D, a new deep neural network (DNN) architecture for 2D cell segmentation and tracking that leverages both mid- and long-term temporal context. DistNet2D considers seven frames at the input and uses a post-processing procedure that exploits information from the entire movie to correct segmentation errors. DistNet2D outperforms two recent methods on two experimental datasets, one containing densely packed bacterial cells and the other containing eukaryotic cells. It has been integrated into an ImageJ-based graphical user interface for 2D data visualization, curation, and training. Finally, we demonstrate the performance of DistNet2D on correlating the size and shape of cells with their transport properties over large statistics, for both bacterial and eukaryotic cells.翻译：EXTRACTING LONG TRACKS AND LINEAGES FROM VIDEOMICROSCOPY REQUIRES AN EXTREMELY LOW ERROR RATE, WHICH IS CHALLENGING ON COMPLEX DATASETS OF DENSE OR DEFORMING CELLS. LEVERAGING TEMPORAL CONTEXT IS KEY TO OVERCOME THIS CHALLENGE. WE PROPOSE DISTNET2D, A NEW DEEP NEURAL NETWORK (DNN) ARCHITECTURE FOR 2D CELL SEGMENTATION AND TRACKING THAT LEVERAGES BOTH MID- AND LONG-TERM TEMPORAL CONTEXT. DISTNET2D CONSIDERS SEVEN FRAMES AT THE INPUT AND USES A POST-PROCESSING PROCEDURE THAT EXPLOITS INFORMATION FROM THE ENTIRE MOVIE TO CORRECT SEGMENTATION ERRORS. DISTNET2D OUTPERFORMS TWO RECENT METHODS ON TWO EXPERIMENTAL DATASETS, ONE CONTAINING DENSELY PACKED BACTERIAL CELLS AND THE OTHER CONTAINING EUKARYOTIC CELLS. IT HAS BEEN INTEGRATED INTO AN IMAGEJ-BASED GRAPHICAL USER INTERFACE FOR 2D DATA VISUALIZATION, CURATION, AND TRAINING. FINALLY, WE DEMONSTRATE THE PERFORMANCE OF DISTNET2D ON CORRELATING THE SIZE AND SHAPE OF CELLS WITH THEIR TRANSPORT PROPERTIES OVER LARGE STATISTICS, FOR BOTH BACTERIAL AND EUKARYOTIC CELLS.

Leave No Stone Unturned: Mine Extra Knowledge for Imbalanced Facial Expression Recognition

paper_url: http://arxiv.org/abs/2310.19636
repo_url: https://github.com/zyh-uaiaaaa/Mine-Extra-Knowledge
paper_authors: Yuhang Zhang, Yaqi Li, Lixiong Qin, Xuannan Liu, Weihong Deng
for: addresses the imbalanced facial expression recognition (FER) problem by proposing a novel approach to extract extra knowledge related to minor classes from both major and minor class samples.
methods: leverages re-balanced attention maps to regularize the model and extract transformation invariant information about the minor classes from all training samples, and introduces re-balanced smooth labels to regulate the cross-entropy loss and guide the model to pay more attention to the minor classes.
results: achieves state-of-the-art performance under the imbalanced FER task through extensive experiments on different datasets and backbones.

Abstract
Facial expression data is characterized by a significant imbalance, with most collected data showing happy or neutral expressions and fewer instances of fear or disgust. This imbalance poses challenges to facial expression recognition (FER) models, hindering their ability to fully understand various human emotional states. Existing FER methods typically report overall accuracy on highly imbalanced test sets but exhibit low performance in terms of the mean accuracy across all expression classes. In this paper, our aim is to address the imbalanced FER problem. Existing methods primarily focus on learning knowledge of minor classes solely from minor-class samples. However, we propose a novel approach to extract extra knowledge related to the minor classes from both major and minor class samples. Our motivation stems from the belief that FER resembles a distribution learning task, wherein a sample may contain information about multiple classes. For instance, a sample from the major class surprise might also contain useful features of the minor class fear. Inspired by that, we propose a novel method that leverages re-balanced attention maps to regularize the model, enabling it to extract transformation invariant information about the minor classes from all training samples. Additionally, we introduce re-balanced smooth labels to regulate the cross-entropy loss, guiding the model to pay more attention to the minor classes by utilizing the extra information regarding the label distribution of the imbalanced training data. Extensive experiments on different datasets and backbones show that the two proposed modules work together to regularize the model and achieve state-of-the-art performance under the imbalanced FER task. Code is available at https://github.com/zyh-uaiaaaa.

摘要
《表情数据具有显著的不均衡问题，大多数收集到的数据显示了快乐或中性的表情，而少量的恐惧或厌恶表情。这种不均衡问题对于表情识别（FER）模型提出了挑战，使其无法彻底理解人类各种情感状态。现有的FER方法通常在不均衡测试集上报告总准确率，但在不同表情类别的均准确率方面表现较差。在这篇论文中，我们的目标是解决不均衡FER问题。现有的方法主要是通过学习少数类样本来学习少数类知识。然而，我们提出了一种新的方法，利用重新平衡的注意力地图来补做模型，使其能够从所有训练样本中提取 transformation 不变的信息。此外，我们还引入了重新平衡的平滑标签，以规则模型的权重，使模型更加注重少数类。我们的实验表明，两个提出的模块可以结合使用，补做模型，并在不均衡FER任务上实现状态的杰出性。代码可以在中找到。》Note that Simplified Chinese is the official writing system used in mainland China, and it may be different from Traditional Chinese used in other regions.

Bidirectional Captioning for Clinically Accurate and Interpretable Models

paper_url: http://arxiv.org/abs/2310.19635
repo_url: None
paper_authors: Keegan Quigley, Miriam Cha, Josh Barua, Geeticka Chauhan, Seth Berkowitz, Steven Horng, Polina Golland
for: 这个论文旨在探讨视语预训练在医学图像分析中的应用，并比较了强制对比学习方法和描述语言模型在生成图像描述的能力。
methods: 这篇论文使用了对文本描述的双向预训练，并且使用了一种名为RadTex的医学领域的CNNEncoder和TransformerDecoder架构。
results: 研究发现，描述语言预训练不仅可以生成与对比学习方法相当的视觉编码器（CheXpert竞赛多标签AUC为89.4%），还可以生成丰富的医学报告（描述macro-F1分数为0.349，使用CheXpert标签器），并且可以根据提示生成有targeted、交互的输出。

Abstract
Vision-language pretraining has been shown to produce high-quality visual encoders which transfer efficiently to downstream computer vision tasks. While generative language models have gained widespread attention, image captioning has thus far been mostly overlooked as a form of cross-modal pretraining in favor of contrastive learning, especially in medical image analysis. In this paper, we experiment with bidirectional captioning of radiology reports as a form of pretraining and compare the quality and utility of learned embeddings with those from contrastive pretraining methods. We optimize a CNN encoder, transformer decoder architecture named RadTex for the radiology domain. Results show that not only does captioning pretraining yield visual encoders that are competitive with contrastive pretraining (CheXpert competition multi-label AUC of 89.4%), but also that our transformer decoder is capable of generating clinically relevant reports (captioning macro-F1 score of 0.349 using CheXpert labeler) and responding to prompts with targeted, interactive outputs.

摘要
视力语言预训理有效地生成高质量的视觉编码器，这些编码器可以有效传递到下游计算机视觉任务。然而，生成语言模型在医学影像分析领域中却受到了比较少的关注，而image captioning则被忽略了作为杂模预训理方法。在这篇论文中，我们使用了对 radiology 报告的 bidirectional captioning 作为预训理方法，并与对比式预训理方法进行比较。我们优化了一个 CNN 编码器和 transformer 解码器，并将其命名为 RadTex。结果显示，不仅captioning预训理可以生成与对比式预训理方法相当的视觉编码器（CheXpert 竞赛多标签 AUC 为 89.4%），而且我们的 transformer 解码器还能生成丰富的临床相关报告（captioning macro-F1 分数为 0.349，使用 CheXpert 标签器），并能够响应到提示的有arget、互动输出。

Convolutional Neural Networks for Automatic Detection of Intact Adenovirus from TEM Imaging with Debris, Broken and Artefacts Particles

paper_url: http://arxiv.org/abs/2310.19630
repo_url: None
paper_authors: Olivier Rukundo, Andrea Behanova, Riccardo De Feo, Seppo Ronkko, Joni Oja, Jussi Tohka
for: 这种研究是为了帮助药品生产商在开发和生产过程中进行常规监测Primary particles和纯度分布，以避免产品的变化和杂质污染。
methods: 这种研究使用了电子显微镜成像技术，帮助生产商预测改变对病毒基因疗法Vector产品和中间体的特征和纯度。
results: 通过自动检测和分割adenovirus的软件工具，可以帮助生产商更好地检测和分析adenovirus在电子显微镜成像系统中。这些工具可以减少人工干预，提高检测效率和精度。

Abstract
Regular monitoring of the primary particles and purity profiles of a drug product during development and manufacturing processes is essential for manufacturers to avoid product variability and contamination. Transmission electron microscopy (TEM) imaging helps manufacturers predict how changes affect particle characteristics and purity for virus-based gene therapy vector products and intermediates. Since intact particles can characterize efficacious products, it is beneficial to automate the detection of intact adenovirus against a non-intact-viral background mixed with debris, broken, and artefact particles. In the presence of such particles, detecting intact adenoviruses becomes more challenging. To overcome the challenge, due to such a presence, we developed a software tool for semi-automatic annotation and segmentation of adenoviruses and a software tool for automatic segmentation and detection of intact adenoviruses in TEM imaging systems. The developed semi-automatic tool exploited conventional image analysis techniques while the automatic tool was built based on convolutional neural networks and image analysis techniques. Our quantitative and qualitative evaluations showed outstanding true positive detection rates compared to false positive and negative rates where adenoviruses were nicely detected without mistaking them for real debris, broken adenoviruses, and/or staining artefacts.

摘要
常规监测药品主要粒子和纯度profile during 开发和生产过程是非常重要，以避免产品变异和杂质污染。电子传输微scopy（TEM）成像帮助制药厂家预测变化对粒子特性和纯度的影响，用于抗生素基因疗法vector产品和中间体。由于完整的粒子可以characterize 有效的产品，因此自动检测完整的adenovirus Against 杂质背景杂mix的粒子是有利的。在存在这些粒子时，检测完整的adenoviruses更加挑战。为了解决这个问题，我们开发了一种 semi-automatic 注解和分割adenoviruses的软件工具，以及一种自动分割和检测完整adenoviruses的软件工具。我们的量化和质量评估表明，我们的方法具有出色的真正阳性检测率，而false正和false负检测率几乎为零。adenoviruses 在TEM成像系统中被成功地检测出来，无需误尝杂质、损坏adenoviruses和染料artefacts。

GC-MVSNet: Multi-View, Multi-Scale, Geometrically-Consistent Multi-View Stereo

paper_url: http://arxiv.org/abs/2310.19583
repo_url: https://github.com/vkvats/GC-MVSNet
paper_authors: Vibhas K. Vats, Sripad Joshi, David J. Crandall, Md. Alimoor Reza, Soon-heung Jung
for: 本研究的目的是提出一种新的多视图零点网络（MVSNet），用于解决多视图零点 reconstruction 问题。
methods: 本研究使用的方法是基于学习的多视图零点方法，其中包括一种新的几何一致损失函数，用于在不同视图和不同缩放级别之间进行多视图几何一致检查。
results: 实验结果表明，GC-MVSNet 可以快速地学习高质量的多视图零点 reconstruction，并在 DTU 和 BlendedMVS 数据集上达到新的状态当前。此外，GC-MVSNet 还在 Tanks 和 Temples 数据集上达到了竞争性的结果。

Abstract
Traditional multi-view stereo (MVS) methods rely heavily on photometric and geometric consistency constraints, but newer machine learning-based MVS methods check geometric consistency across multiple source views only as a post-processing step. In this paper, we present a novel approach that explicitly encourages geometric consistency of reference view depth maps across multiple source views at different scales during learning (see Fig. 1). We find that adding this geometric consistency loss significantly accelerates learning by explicitly penalizing geometrically inconsistent pixels, reducing the training iteration requirements to nearly half that of other MVS methods. Our extensive experiments show that our approach achieves a new state-of-the-art on the DTU and BlendedMVS datasets, and competitive results on the Tanks and Temples benchmark. To the best of our knowledge, GC-MVSNet is the first attempt to enforce multi-view, multi-scale geometric consistency during learning.

摘要
传统的多视图零点矩阵（MVS）方法强调图像照度和几何含义一致性约束，但 newer的机器学习基于MVS方法只在多个源视图中的几何一致性检查为后处理步骤。在这篇论文中，我们提出了一种新的方法，其中在多个源视图中的参考视图深度图中的几何一致性被明确地强制实施（参见图1）。我们发现，在学习过程中添加几何一致性损失可以显著加速学习，因为它直接惩罚不几何一致的像素，从而减少了学习迭代次数至近乎半数。我们的广泛实验表明，我们的方法在DTU和BlendedMVS数据集上达到了新的状态计算机视觉领域，并在Tanks and Temples标准准确率中获得了竞争力的结果。我们知道，GC-MVSNet是第一个在多视图、多尺度中强制多视图几何一致性的学习方法。

Human-interpretable and deep features for image privacy classification

paper_url: http://arxiv.org/abs/2310.19582
repo_url: None
paper_authors: Darya Baranouskaya, Andrea Cavallaro
for: 本研究旨在探讨隐私分类 dataset 和不同评估人员对敏感图像的纠正标签的问题。
methods: 本文提出了八种适用于图像隐私分类的特有和人类可解释的特征，并证明这些特征可以提高深度学习模型的性能和图像隐私分类的表示能力。
results: 本文分析了不同评估人员对敏感图像的纠正标签，并提出了适用于图像隐私分类的特有和人类可解释的特征。这些特征可以提高深度学习模型的性能和图像隐私分类的表示能力。

Abstract
Privacy is a complex, subjective and contextual concept that is difficult to define. Therefore, the annotation of images to train privacy classifiers is a challenging task. In this paper, we analyse privacy classification datasets and the properties of controversial images that are annotated with contrasting privacy labels by different assessors. We discuss suitable features for image privacy classification and propose eight privacy-specific and human-interpretable features. These features increase the performance of deep learning models and, on their own, improve the image representation for privacy classification compared with much higher dimensional deep features.

摘要

Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model

paper_url: http://arxiv.org/abs/2310.19581
repo_url: None
paper_authors: Suyeon Lee, Chaeyoung Jung, Youngjoon Jang, Jaehun Kim, Joon Son Chung
for: 本研究旨在提取混合声音中的目标说话者的声音，使用视觉cue。现有的音频视频演示 separation 方法已经展示了扩展性和可识别性的好表现，但是保持自然性仍然是一个挑战。
methods: 我们提出了一种基于扩散机制的音频视频演示 separation 模型，称为 AVDiffuSS。此外，我们还提出了一种特制的跨注意力特征融合机制，用于融合两种感知模式。这种机制特意针对语音频谱，以利用语音生成中的phonetic信息，以实现高度的特征融合，而不需较大的计算资源。
results: 我们在两个 benchmark 上进行了实验，包括 VoxCeleb2 和 LRS3，并得到了对比较好的结果。相比之前的方法，我们的方法可以生成更自然的语音，而且不需要较大的计算资源。

Abstract
The objective of this work is to extract target speaker's voice from a mixture of voices using visual cues. Existing works on audio-visual speech separation have demonstrated their performance with promising intelligibility, but maintaining naturalness remains a challenge. To address this issue, we propose AVDiffuSS, an audio-visual speech separation model based on a diffusion mechanism known for its capability in generating natural samples. For an effective fusion of the two modalities for diffusion, we also propose a cross-attention-based feature fusion mechanism. This mechanism is specifically tailored for the speech domain to integrate the phonetic information from audio-visual correspondence in speech generation. In this way, the fusion process maintains the high temporal resolution of the features, without excessive computational requirements. We demonstrate that the proposed framework achieves state-of-the-art results on two benchmarks, including VoxCeleb2 and LRS3, producing speech with notably better naturalness.

摘要
<>TRANSLATE_TEXT目标是从混合声音中提取目标说话人的声音，使用视觉cue。现有的听视频演示表达能力很出色，但保持自然性仍然是挑战。为解决这个问题，我们提议AVDiffuSS，一种基于扩散机制的听视频演示模型。为有效地融合两种模式，我们还提议一种强调语音域的feature合并机制。这种机制通过跨注意力的feature合并来实现高时间分辨率的特征合并，而无需过度的计算成本。我们示示了该框架在VoxCeleb2和LRS3两个benchmark上的状态空间表现，生成的speech具有显著更好的自然性。TRANSLATE_TEXT

A Perceptual Shape Loss for Monocular 3D Face Reconstruction

paper_url: http://arxiv.org/abs/2310.19580
repo_url: None
paper_authors: Christopher Otto, Prashanth Chandran, Gaspard Zoss, Markus Gross, Paulo Gotardo, Derek Bradley
for: This paper aims to improve the quality of monocular 3D face reconstruction by proposing a new perceptual shape loss function that uses shading cues to evaluate the quality of the 3D face estimate.
methods: The proposed method uses a discriminator-style neural network to evaluate the quality of the shaded render of the geometry estimate, without requiring an estimate of the albedo or illumination in the scene. The loss operates entirely in image space and is agnostic to mesh topology.
results: The authors show that their new perceptual shape loss can be combined with traditional energy terms for monocular 3D face optimization and deep neural network regression, improving upon current state-of-the-art results.Here is the simplified Chinese text for the three key points:
for: 这篇论文目标是提高单投影3D人脸重建的质量，通过提出一种基于照明cue的新的形态损失函数。
methods: 该方法使用一种权重矩阵来评估预测的形态推测的质量，不需要场景中的反射率或照明估计。损失函数完全在图像空间运行，不受形态网格的影响。
results: 作者们表明，他们的新的形态损失函数可以与传统的能量函数相结合，提高单投影3D人脸优化和深度神经网络回归的结果。

Abstract
Monocular 3D face reconstruction is a wide-spread topic, and existing approaches tackle the problem either through fast neural network inference or offline iterative reconstruction of face geometry. In either case carefully-designed energy functions are minimized, commonly including loss terms like a photometric loss, a landmark reprojection loss, and others. In this work we propose a new loss function for monocular face capture, inspired by how humans would perceive the quality of a 3D face reconstruction given a particular image. It is widely known that shading provides a strong indicator for 3D shape in the human visual system. As such, our new 'perceptual' shape loss aims to judge the quality of a 3D face estimate using only shading cues. Our loss is implemented as a discriminator-style neural network that takes an input face image and a shaded render of the geometry estimate, and then predicts a score that perceptually evaluates how well the shaded render matches the given image. This 'critic' network operates on the RGB image and geometry render alone, without requiring an estimate of the albedo or illumination in the scene. Furthermore, our loss operates entirely in image space and is thus agnostic to mesh topology. We show how our new perceptual shape loss can be combined with traditional energy terms for monocular 3D face optimization and deep neural network regression, improving upon current state-of-the-art results.

摘要
单眼3D脸重建是一个广泛的研究领域，现有的方法可以通过快速的神经网络推断或者组合不同的构成来解决这个问题。不 matter the approach, 都需要仔细设计能量函数，通常包括像素损失、标点重映射损失等损失函数。在这个工作中，我们提出了一个新的损失函数 для单眼3D脸重建，受人类视觉系统中的视觉评估影响。我们发现，阴影提供了3D形状中强大的视觉指标。因此，我们的新的“感知”形状损失将评估3D脸估计中的阴影匹配度，以便更好地评估3D脸重建的质量。我们的损失函数通过一个批评器网络来实现，这个批评器网络将从输入的脸像和geometry估计中获得的阴影匹配度进行评估，并且预测一个视觉评估分数。这个批评器网络仅从RGB图像和geometry render alone，没有需要场景照明估计。此外，我们的损失函数完全在图像空间中运作，因此不受体统的限制。在这个研究中，我们显示了我们的新的感知形状损失可以与传统的能量函数和神经网络回推 combinated，以提高目前的州Of-The-Art结果。

Skip-WaveNet: A Wavelet based Multi-scale Architecture to Trace Firn Layers in Radar Echograms

paper_url: http://arxiv.org/abs/2310.19574
repo_url: None
paper_authors: Debvrat Varshney, Masoud Yari, Oluwanisola Ibikunle, Jilu Li, John Paden, Maryam Rahnemoonfar
for: 这个论文的目的是为了提高空气雷达探测器上的冰层层次检测，以便计算冰川覆盖率和海平面升高的贡献。
methods: 该论文使用了波峰基于的多尺度深度学习架构来自动处理空气雷达探测器上的冰层信号，以提高冰层层次检测的精度。
results: 该论文的提出的Skip-WaveNet架构可以在不同的数据集上实现更高的优化数据集批处理（ODS）和优化图像批处理（OIS）F- scores，并且可以在不同的层次上检测冰层，并且可以估计层次的平均绝对误差为3.31像素和94.3%的准确率。

Abstract
Echograms created from airborne radar sensors capture the profile of firn layers present on top of an ice sheet. Accurate tracking of these layers is essential to calculate the snow accumulation rates, which are required to investigate the contribution of polar ice cap melt to sea level rise. However, automatically processing the radar echograms to detect the underlying firn layers is a challenging problem. In our work, we develop wavelet-based multi-scale deep learning architectures for these radar echograms to improve firn layer detection. We show that wavelet based architectures improve the optimal dataset scale (ODS) and optimal image scale (OIS) F-scores by 3.99% and 3.7%, respectively, over the non-wavelet architecture. Further, our proposed Skip-WaveNet architecture generates new wavelets in each iteration, achieves higher generalizability as compared to state-of-the-art firn layer detection networks, and estimates layer depths with a mean absolute error of 3.31 pixels and 94.3% average precision. Such a network can be used by scientists to trace firn layers, calculate the annual snow accumulation rates, estimate the resulting surface mass balance of the ice sheet, and help project global sea level rise.

摘要
雷达探测机上的echogram可以捕捉firn层的 Profiling ，这些层次是 ice sheet 的重要特征，用于计算降雪积累率，以确定 polar ice cap 的融化对海平面升高的贡献。但是，自动处理雷达echogram以检测 underlying firn层是一个具有挑战性的问题。在我们的工作中，我们开发了wavelet基的多尺度深度学习架构，用于改进firn层检测。我们表明，wavelet基架构可以提高optimal dataset scale (ODS) 和optimal image scale (OIS) F-score 的值，分别提高3.99%和3.7%。此外，我们提出的Skip-WaveNet架构在每个迭代中生成新的wavelet，实现了高度的泛化性，并且可以对firn层进行准确的深度估计，均方差误差为3.31像素和94.3%的均值精度。这种网络可以被科学家用于跟踪firn层，计算降雪积累率，估计ice sheet 的表面质量变化，并帮助预测全球海平面升高。

Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning

paper_url: http://arxiv.org/abs/2310.19559
repo_url: https://github.com/Andy20178/DCL
paper_authors: Changsheng Lv, Shuai Zhang, Yapeng Tian, Mengshi Qi, Huadong Ma
for: 本研究提出了一种基于Counterfactual Learning的Disentangled Counterfactual Learning（DCL）方法，用于解决物理 audiovisual 常识逻辑问题。
methods: 我们提出的DCL方法使用了分解器，将视频分解成静态（时间不变）和动态（时间变化）因素，并采用了变量自动编码器（VAE）来最大化多modal数据之间的相互信息。此外，我们还引入了对抗学习模块，以增强模型的逻辑能力。
results: 我们的提出方法在实验中超越了基eline方法，并实现了状态之最好的性能。我们的源代码可以在https://github.com/Andy20178/DCL上获取。

Abstract
In this paper, we propose a Disentangled Counterfactual Learning~(DCL) approach for physical audiovisual commonsense reasoning. The task aims to infer objects' physics commonsense based on both video and audio input, with the main challenge is how to imitate the reasoning ability of humans. Most of the current methods fail to take full advantage of different characteristics in multi-modal data, and lacking causal reasoning ability in models impedes the progress of implicit physical knowledge inferring. To address these issues, our proposed DCL method decouples videos into static (time-invariant) and dynamic (time-varying) factors in the latent space by the disentangled sequential encoder, which adopts a variational autoencoder (VAE) to maximize the mutual information with a contrastive loss function. Furthermore, we introduce a counterfactual learning module to augment the model's reasoning ability by modeling physical knowledge relationships among different objects under counterfactual intervention. Our proposed method is a plug-and-play module that can be incorporated into any baseline. In experiments, we show that our proposed method improves baseline methods and achieves state-of-the-art performance. Our source code is available at https://github.com/Andy20178/DCL.

摘要
在这篇论文中，我们提出了一种分离式Counterfactual Learning（DCL）方法，用于物理 audiovisual常识理解。任务的目标是根据视频和音频输入推理出物体的物理常识，主要挑战在于如何模仿人类的思维能力。现有的方法多数不能充分利用多个Modal数据的不同特征，同时模型中缺乏 causal 理解能力，这些问题限制了隐藏的物理知识推理的进步。为解决这些问题，我们提出的 DCL 方法在幂变编码器中分离视频 into 静态（时间不变）和动态（时间变化）因素，使用 VAE maximize mutual information和对比损失函数。此外，我们引入了 counterfactual 学习模块，以模型物体之间的物理知识关系，并通过对假设性的 intervención来增强模型的推理能力。我们提出的方法是可以与任何基础模型集成的插件模块，在实验中，我们表明了我们的方法可以超越基础方法，达到领域的最佳性能。我们的源代码可以在 GitHub 上找到：https://github.com/Andy20178/DCL。

Harvest Video Foundation Models via Efficient Post-Pretraining

paper_url: http://arxiv.org/abs/2310.19554
repo_url: https://github.com/opengvlab/internvideo
paper_authors: Yizhuo Li, Kunchang Li, Yinan He, Yi Wang, Yali Wang, Limin Wang, Yu Qiao, Ping Luo
for: 提高视频语言模型的生成效率和质量，以及大量视频语言数据的抽象和 reuse。
methods: randomly dropping input video patches和masking out input text during the post-pretraining procedure，以促进视频语言融合学习。
results: 对多种零shot任务、视频问答和视频文本检索等多种视频语言下沉Task进行了广泛的实验 validate the effectiveness of our method，并 achieve state-of-the-art performances，与一些劳烈预训练的视频基础模型相当。

Abstract
Building video-language foundation models is costly and difficult due to the redundant nature of video data and the lack of high-quality video-language datasets. In this paper, we propose an efficient framework to harvest video foundation models from image ones. Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure. The patch dropping boosts the training efficiency significantly and text masking enforces the learning of cross-modal fusion. We conduct extensive experiments to validate the effectiveness of our method on a wide range of video-language downstream tasks including various zero-shot tasks, video question answering, and video-text retrieval. Despite its simplicity, our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models. Our method is extremely efficient and can be trained in less than one day on 8 GPUs, requiring only WebVid-10M as pretraining data. We hope our method can serve as a simple yet strong counterpart for prevalent video foundation models, provide useful insights when building them, and make large pretrained models more accessible and sustainable. This is part of the InternVideo project \url{https://github.com/OpenGVLab/InternVideo}.

摘要
Translated into Simplified Chinese:建立视频语言基础模型的成本和困难在于视频数据的重复性和语言基础模型的缺乏高质量数据集。在这篇论文中，我们提出了一种高效的框架，可以从图像基础模型中提取视频基础模型。我们的方法是通过随机删除输入视频补丁和隐藏输入文本来增强训练效率。补丁删除提高了训练效率，而文本隐藏强制学习cross模式融合。我们进行了广泛的实验，以验证我们的方法在多种视频语言下游任务中的效果，包括多种零学习任务、视频问答和视频文本检索等。尽管我们的方法简单，但它可以与一些努力预训练的视频基础模型匹敌。我们的方法非常高效，可以在8个GPU上训练完成，只需要WebVid-10M作为预训练数据。我们希望我们的方法可以作为一种简单却强大的对手，为建立视频基础模型提供有用的指导，并使大型预训练模型更加可 accessible和可持续。这是InternVideo项目的一部分（https://github.com/OpenGVLab/InternVideo）。

MENTOR: Human Perception-Guided Pretraining for Iris Presentation Detection

paper_url: http://arxiv.org/abs/2310.19545
repo_url: None
paper_authors: Colton R. Crum, Adam Czajka
for: 提高困难任务中的生物认证表示力，如人脸识别和 iris PAD。
methods: 利用人类注意力指导 CNN 模型的训练，包括两个唯一的训练轮。首先，我们训练一个自动编码器，以学习人类注意力地图。然后，我们使用这个表示来训练一个 iris PAD 模型，并使用这个模型作为一个人类注意力指导的注释工具。
results: 三重效果：(a) 使用人类注意力训练的 encoder 的 weights 对 iris PAD 性能有显著提高，比如 ImageNet 源或随机 initialization 的 weights。(b) 可以生成无限多个人类注意力地图，用于未看过的 iris PAD 样本。(c) 可以提高 iris PAD 模型训练的效率。

Abstract
Incorporating human salience into the training of CNNs has boosted performance in difficult tasks such as biometric presentation attack detection. However, collecting human annotations is a laborious task, not to mention the questions of how and where (in the model architecture) to efficiently incorporate this information into model's training once annotations are obtained. In this paper, we introduce MENTOR (huMan pErceptioN-guided preTraining fOr iris pResentation attack detection), which addresses both of these issues through two unique rounds of training. First, we train an autoencoder to learn human saliency maps given an input iris image (both real and fake examples). Once this representation is learned, we utilize the trained autoencoder in two different ways: (a) as a pre-trained backbone for an iris presentation attack detector, and (b) as a human-inspired annotator of salient features on unknown data. We show that MENTOR's benefits are threefold: (a) significant boost in iris PAD performance when using the human perception-trained encoder's weights compared to general-purpose weights (e.g. ImageNet-sourced, or random), (b) capability of generating infinite number of human-like saliency maps for unseen iris PAD samples to be used in any human saliency-guided training paradigm, and (c) increase in efficiency of iris PAD model training. Sources codes and weights are offered along with the paper.

摘要
<>传统的 convolutional neural network (CNN) 在难度较高的任务中表现不佳，如生物 metric 表现攻击检测。然而，收集人类注释是一项劳顿的任务，更是问题是如何有效地将这些信息integrated into 模型的训练过程中。在这篇论文中，我们介绍了 MENTOR（huMan pErceptioN-guided preTraining fOr iris pResentation attack detection），它解决了这两个问题。我们使用两种不同的训练方法：首先，我们训练了一个 autoencoder，以学习人类注意力图 given 输入的 iris 图像（ both real 和 fake 例子）。然后，我们使用这个 representations 来训练一个 iris 表达攻击检测器，并使用这个模型来生成无数量的人类类似的注意力图。我们发现，MENTOR 具有以下三个优点：（a）在使用人类注意力训练后的encoder 的 weights 时，iris PAD 性能得到了明显的提高。（b）可以生成无数量的人类类似的注意力图，用于任何人类注意力导向的训练方法。（c）可以提高 iris PAD 模型训练的效率。我们提供了模型和 weights，以便在论文中使用。

paper_url: http://arxiv.org/abs/2310.19542
repo_url: https://github.com/Tchuanm/AViTMP
paper_authors: Chuanming Tang, Kai Wang, Joost van de Weijer, Jianlin Zhang, Yongmei Huang
for:AViTMP is proposed to tackle the inferior effectiveness of the vanilla ViT in visual tracking tasks, by bridging the gap between single-branch networks and discriminative models.methods:The proposed AViTMP model uses an adaptor module and joint target state embedding in the encoder to enrich the dense embedding paradigm based on ViT, and combines it with a dense-fusion decoder and a discriminative target model to predict accurate location. Additionally, a novel inference pipeline called CycleTrack is presented to mitigate the limitations of conventional inference practice, and a dual-frame update inference strategy is proposed to handle significant challenges in long-term scenarios.results:The proposed AViTMP model achieves state-of-the-art performance in visual tracking tasks, especially on long-time tracking and robustness, as demonstrated by the experimental results on ten tracking benchmarks, including LaSOT, LaSOTExtSub, AVisT, etc.

Abstract
Despite achieving state-of-the-art performance in visual tracking, recent single-branch trackers tend to overlook the weak prior assumptions associated with the Vision Transformer (ViT) encoder and inference pipeline. Moreover, the effectiveness of discriminative trackers remains constrained due to the adoption of the dual-branch pipeline. To tackle the inferior effectiveness of the vanilla ViT, we propose an Adaptive ViT Model Prediction tracker (AViTMP) to bridge the gap between single-branch network and discriminative models. Specifically, in the proposed encoder AViT-Enc, we introduce an adaptor module and joint target state embedding to enrich the dense embedding paradigm based on ViT. Then, we combine AViT-Enc with a dense-fusion decoder and a discriminative target model to predict accurate location. Further, to mitigate the limitations of conventional inference practice, we present a novel inference pipeline called CycleTrack, which bolsters the tracking robustness in the presence of distractors via bidirectional cycle tracking verification. Lastly, we propose a dual-frame update inference strategy that adeptively handles significant challenges in long-term scenarios. In the experiments, we evaluate AViTMP on ten tracking benchmarks for a comprehensive assessment, including LaSOT, LaSOTExtSub, AVisT, etc. The experimental results unequivocally establish that AViTMP attains state-of-the-art performance, especially on long-time tracking and robustness.

摘要
尽管最新的单支持器跟踪器达到了视觉跟踪的状态顶峰性能，但是最近的单支持器跟踪器往往忽略了视觉转换器（ViT）Encoder和推理管道中的弱优先级假设。此外，采用双支持器管道的效果仍然受限，不能够满足跟踪任务的需求。为了解决vanilla ViT的不足，我们提出了一种适应型ViT模型预测跟踪器（AViTMP），以填补单支持器网络和推理模型之间的差距。具体来说，在我们提出的AViT-Encoder中，我们添加了适应模块和联合目标状态嵌入，以激活ViT dense embedding的PARADIGM。然后，我们将AViT-Encoder与密集混合解码器和一种推理模型结合，以准确预测位置。此外，为了 Mitigate the limitations of conventional inference practice, we present a novel inference pipeline called CycleTrack, which bolsters the tracking robustness in the presence of distractors via bidirectional cycle tracking verification. Finally, we propose a dual-frame update inference strategy that adeptively handles significant challenges in long-term scenarios. In the experiments, we evaluate AViTMP on ten tracking benchmarks for a comprehensive assessment, including LaSOT, LaSOTExtSub, AVisT, etc. The experimental results unequivocally establish that AViTMP attains state-of-the-art performance, especially on long-time tracking and robustness.

IterInv: Iterative Inversion for Pixel-Level T2I Models

paper_url: http://arxiv.org/abs/2310.19540
repo_url: https://github.com/Tchuanm/IterInv
paper_authors: Chuanming Tang, Kai Wang, Joost van de Weijer
for: 提高图像生成领域中的图像编辑能力，允许用户通过修改输入文本来控制生成的图像。
methods: 基于Latent Diffusion Models (LDM)的图像生成模型，以及Imagen和DeepFloyd-IF等图像增强模型。
results: 提出了一种基于迭代 concatenation 的图像逆变换技术（IterInv），并证明了IterInv 可以与流行的图像编辑方法结合使用，提高图像生成和编辑的能力。

Abstract
Large-scale text-to-image diffusion models have been a ground-breaking development in generating convincing images following an input text prompt. The goal of image editing research is to give users control over the generated images by modifying the text prompt. Current image editing techniques are relying on DDIM inversion as a common practice based on the Latent Diffusion Models (LDM). However, the large pretrained T2I models working on the latent space as LDM suffer from losing details due to the first compression stage with an autoencoder mechanism. Instead, another mainstream T2I pipeline working on the pixel level, such as Imagen and DeepFloyd-IF, avoids this problem. They are commonly composed of several stages, normally with a text-to-image stage followed by several super-resolution stages. In this case, the DDIM inversion is unable to find the initial noise to generate the original image given that the super-resolution diffusion models are not compatible with the DDIM technique. According to our experimental findings, iteratively concatenating the noisy image as the condition is the root of this problem. Based on this observation, we develop an iterative inversion (IterInv) technique for this stream of T2I models and verify IterInv with the open-source DeepFloyd-IF model. By combining our method IterInv with a popular image editing method, we prove the application prospects of IterInv. The code will be released at \url{https://github.com/Tchuanm/IterInv.git}.

摘要
大规模文本到图像扩散模型已经是图像生成领域的一项重要发展，可以生成基于输入文本提示的真实的图像。图像编辑研究的目标是给用户控制生成图像的文本提示。现有的图像编辑技术多 rely于 DDIM 逆转，基于缺失扩散模型（LDM）。然而，大规模预训练 T2I 模型在幽parallel space中作为 LDM 受到压缩的影响，导致失去细节。相比之下，另一种主流 T2I 管道在像素级别进行操作，如 Imagen 和 DeepFloyd-IF，可以避免这个问题。它们通常包括一个文本到图像阶段，然后跟踪多个超分辨阶段。在这种情况下，DDIM 逆转无法找到初始噪声，生成原始图像。根据我们的实验发现，逐次 concatenate 噪声图像作为条件是这个问题的根本原因。基于这一观察，我们开发了一种逐次逆转（IterInv）技术，并验证 IterInv 与开源的 DeepFloyd-IF 模型。通过将我们的方法 IterInv 与一种流行的图像编辑方法相结合，我们证明了 IterInv 的应用前景。代码将在 \url{https://github.com/Tchuanm/IterInv.git} 上发布。

Revitalizing Legacy Video Content: Deinterlacing with Bidirectional Information Propagation

paper_url: http://arxiv.org/abs/2310.19535
repo_url: None
paper_authors: Zhaowei Gao, Mingyang Song, Christopher Schroers, Yang Zhang
for: 这个论文是为了提高旧视频内容中缺失信息的还原，以提高现代显示器的显示质量。
methods: 该论文使用深度学习技术，提出了一种基于流场指导的修正块（FRB），用于进行特征修正，包括对适应、融合和修正。
results: 实验结果表明，该方法比现有方法有更高的性能。

Abstract
Due to old CRT display technology and limited transmission bandwidth, early film and TV broadcasts commonly used interlaced scanning. This meant each field contained only half of the information. Since modern displays require full frames, this has spurred research into deinterlacing, i.e. restoring the missing information in legacy video content. In this paper, we present a deep-learning-based method for deinterlacing animated and live-action content. Our proposed method supports bidirectional spatio-temporal information propagation across multiple scales to leverage information in both space and time. More specifically, we design a Flow-guided Refinement Block (FRB) which performs feature refinement including alignment, fusion, and rectification. Additionally, our method can process multiple fields simultaneously, reducing per-frame processing time, and potentially enabling real-time processing. Our experimental results demonstrate that our proposed method achieves superior performance compared to existing methods.

摘要

Are Natural Domain Foundation Models Useful for Medical Image Classification?

paper_url: http://arxiv.org/abs/2310.19522
repo_url: None
paper_authors: Joana Palés Huix, Adithya Raju Ganeshan, Johan Fredin Haslum, Magnus Söderberg, Christos Matsoukas, Kevin Smith
for: 这篇论文旨在 investigate 各种现有的基础模型在医疗影像分类任务上的转移性。
methods: 本研究使用了五种基础模型，namely SAM、SEEM、DINOv2、BLIP、OpenCLIP，并在四个已知的医疗影像数据集上进行评估。不同的训练设定也被探索以扩大这些模型的潜力。
results: 研究结果显示DINOv2在特定的训练设定下表现出色，常常超越标准做法的ImageNet预训练。但是其他基础模型对医疗影像分类任务的转移性较差，无法一致地超越ImageNet预训练的基准。

Abstract
The deep learning field is converging towards the use of general foundation models that can be easily adapted for diverse tasks. While this paradigm shift has become common practice within the field of natural language processing, progress has been slower in computer vision. In this paper we attempt to address this issue by investigating the transferability of various state-of-the-art foundation models to medical image classification tasks. Specifically, we evaluate the performance of five foundation models, namely SAM, SEEM, DINOv2, BLIP, and OpenCLIP across four well-established medical imaging datasets. We explore different training settings to fully harness the potential of these models. Our study shows mixed results. DINOv2 in particular, consistently outperforms the standard practice of ImageNet pretraining. However, other foundation models failed to consistently beat this established baseline indicating limitations in their transferability to medical image classification tasks.

摘要
深度学习领域正在转向通用基础模型的使用，以便轻松地适应多种任务。在自然语言处理领域中，这种思想转变已经成为惯例，但在计算机视觉领域，进步 slower。本文我们尝试解决这个问题，通过对不同基础模型的传输性进行研究。我们评估了五种基础模型，namely SAM、SEEM、DINOv2、BLIP 和 OpenCLIP，在四个已知的医疗影像集合上进行了评估。我们探索了不同的训练设置，以充分发挥这些模型的潜力。我们的研究显示了混合的结果。DINOv2 特别是，在医疗影像分类任务中一直表现出色，超过了标准实践的 ImageNet 预训练。然而，其他基础模型未能一直表现出色，表明它们在医疗影像分类任务中的传输性有限。

Generating Context-Aware Natural Answers for Questions in 3D Scenes

paper_url: http://arxiv.org/abs/2310.19516
repo_url: None
paper_authors: Mohammed Munzer Dwedari, Matthias Niessner, Dave Zhenyu Chen
for: Answering questions in 3D scenes naturally and freely, without being limited to pre-defined answers.
methods: Converted the question answering task into a sequence generation task, and optimized the model directly on language rewards to ensure global sentence semantics. Additionally, a pragmatic language understanding reward was adapted to improve sentence quality.
results: Set a new SOTA (State of the Art) on the ScanQA benchmark with a CIDEr score of 72.22/66.57 on the test sets.

Abstract
3D question answering is a young field in 3D vision-language that is yet to be explored. Previous methods are limited to a pre-defined answer space and cannot generate answers naturally. In this work, we pivot the question answering task to a sequence generation task to generate free-form natural answers for questions in 3D scenes (Gen3DQA). To this end, we optimize our model directly on the language rewards to secure the global sentence semantics. Here, we also adapt a pragmatic language understanding reward to further improve the sentence quality. Our method sets a new SOTA on the ScanQA benchmark (CIDEr score 72.22/66.57 on the test sets).

摘要
三维问答是一个年轻的领域，尚未得到充分开发。先前的方法受限于预先定义的答案空间，无法自然生成答案。在这种工作中，我们将问答任务转换为一种序列生成任务，以便在三维场景中自然生成免束答案（Gen3DQA）。为此，我们直接优化我们的模型以获取语言奖励，以保证全句 semantics。此外，我们还适应了一种 Pragmatic 语言理解奖励，以进一步提高句子质量。我们的方法在 ScanQA benchmark 上设置了新的 SOTA（CIDEr 分数 72.22/66.57）。

Transformer-based nowcasting of radar composites from satellite images for severe weather

paper_url: http://arxiv.org/abs/2310.19515
repo_url: https://github.com/caglarkucuk/earthformer-satellite-to-radar
paper_authors: Çağlar Küçük, Apostolos Giannakos, Stefan Schneider, Alexander Jann
for: 这个研究是为了提高预报技术，将天气卫星数据与地面雷达数据融合，以提供高精度的预报。
methods: 这个研究使用了Transformer数据分析模型，利用天气卫星数据进行预报。
results: 研究发现，使用这种模型可以对不同的天气情况进行高精度的预报，并且具有耐变性和复杂数据结构的能力。

Abstract
Weather radar data are critical for nowcasting and an integral component of numerical weather prediction models. While weather radar data provide valuable information at high resolution, their ground-based nature limits their availability, which impedes large-scale applications. In contrast, meteorological satellites cover larger domains but with coarser resolution. However, with the rapid advancements in data-driven methodologies and modern sensors aboard geostationary satellites, new opportunities are emerging to bridge the gap between ground- and space-based observations, ultimately leading to more skillful weather prediction with high accuracy. Here, we present a Transformer-based model for nowcasting ground-based radar image sequences using satellite data up to two hours lead time. Trained on a dataset reflecting severe weather conditions, the model predicts radar fields occurring under different weather phenomena and shows robustness against rapidly growing/decaying fields and complex field structures. Model interpretation reveals that the infrared channel centered at 10.3 $\mu m$ (C13) contains skillful information for all weather conditions, while lightning data have the highest relative feature importance in severe weather conditions, particularly in shorter lead times. The model can support precipitation nowcasting across large domains without an explicit need for radar towers, enhance numerical weather prediction and hydrological models, and provide radar proxy for data-scarce regions. Moreover, the open-source framework facilitates progress towards operational data-driven nowcasting.

摘要
天气雷达数据是现场预报中的关键参数，也是数值天气预测模型中的重要组成部分。 Although 天气雷达数据提供高分辨率的信息，它们的地面性限制了它们的可用性，这限制了大规模应用。相比之下，天气卫星覆盖的范围更大，但是其分辨率相对较低。然而，随着数据驱动方法的快速发展和现代气象卫星上的先进感知器，新的机遇正在出现，可以 bridging 天气雷达和空间天气观测之间的差距，从而实现更准确的天气预测。在这篇文章中，我们提出了一种基于变换器的模型，用于预测基于雷达图像序列的现场预报，使用卫星数据作为至多两个小时的领先时间。模型在不同的天气现象下预测雷达场，并且在快速增长/衰退的场和复杂场结构下表现稳定。模型解释显示，10.3微米的紫外通道（C13）含有有用的信息，而闪电数据在严重天气情况下具有最高相对特征重要性。该模型可以在大规模应用中支持降水预报，不需要明确的雷达天线，提高数值天气预测和水文模型，并提供数据缺乏地区的雷达代理。此外，开源框架使得操作数据驱动现场预报的进步更加容易。

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

paper_url: http://arxiv.org/abs/2310.19512
repo_url: https://github.com/ailab-cvc/videocrafter
paper_authors: Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, Ying Shan
for: The paper is written for researchers and engineers who are interested in video generation, specifically those looking for open-source models.
methods: The paper introduces two diffusion models for high-quality video generation: text-to-video (T2V) and image-to-video (I2V) models. The T2V model synthesizes a video based on a given text input, while the I2V model incorporates an additional image input to produce videos that strictly adhere to the content of the provided reference image.
results: The proposed T2V model can generate realistic and cinematic-quality videos with a resolution of $1024 \times 576$, outperforming other open-source T2V models in terms of quality. The I2V model is the first open-source I2V foundation model capable of transforming a given image into a video clip while maintaining content preservation constraints.Here’s the simplified Chinese text for the three key points:
for: 这篇论文是为研究者和工程师们，尤其是那些关注视频生成技术的人写的。
methods: 这篇论文介绍了两种扩散模型，用于高质量视频生成：文本到视频（T2V）模型和图像到视频（I2V）模型。T2V模型将文本输入转化为视频，而I2V模型将图像输入转化为视频，并且保持图像内容、结构和风格不变。
results: 提出的T2V模型可以生成高质量和电影质量的视频，分辨率为$1024 \times 576$，超过了其他开源T2V模型的质量。I2V模型是首个开源I2V基础模型，可以将给定的图像转化为视频clip，并且保持内容不变。

Abstract
Video generation has increasingly gained interest in both academia and industry. Although commercial tools can generate plausible videos, there is a limited number of open-source models available for researchers and engineers. In this work, we introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V models synthesize a video based on a given text input, while I2V models incorporate an additional image input. Our proposed T2V model can generate realistic and cinematic-quality videos with a resolution of $1024 \times 576$, outperforming other open-source T2V models in terms of quality. The I2V model is designed to produce videos that strictly adhere to the content of the provided reference image, preserving its content, structure, and style. This model is the first open-source I2V foundation model capable of transforming a given image into a video clip while maintaining content preservation constraints. We believe that these open-source video generation models will contribute significantly to the technological advancements within the community.

摘要
视频生成已经在学术和行业中引起了越来越多的兴趣。虽然商业工具可以生成可信度很高的视频，但是学术研究和工程师可以使用的开源模型却有限。在这项工作中，我们介绍了两种扩散模型用于高质量视频生成，即文本到视频（T2V）模型和图像到视频（I2V）模型。T2V模型将文本输入转化成视频，而I2V模型具有额外的图像输入。我们的提议的T2V模型可以生成高质量和电影级的视频，分辨率为1024×576，超越其他开源T2V模型。I2V模型是首个能够将给定图像转化成视频clip，保持内容、结构和风格的开源基础模型。我们认为这些开源视频生成模型会对社区技术进步产生重要贡献。

paper_url: http://arxiv.org/abs/2310.19495
repo_url: None
paper_authors: M. Sunbeam
for: 这篇论文旨在简要报告深度学习方法在水下机器人视觉导航中的应用。
methods: 该论文涵盖了深度学习方法在水下机器人视觉导航中的应用，包括可用的水下视图数据集，模仿学习和奖励学习方法 для导航。此外，文献将根据学习方法的类别，对水下机器人的训练方法进行分类。
results: 文献将对水下机器人的视觉导航进行概述，并提供相关的研究成果和评估。不包括使用深度学习算法处理非视觉数据的文献，除非作为对照例子。

Abstract
This paper aims to briefly survey deep learning methods for visual navigation of underwater robotics. The scope of this paper includes the visual perception of underwater robotics with deep learning methods, the available visual underwater datasets, imitation learning, and reinforcement learning methods for navigation. Additionally, relevant works will be categorized under the imitation learning or deep learning paradigm for underwater robots for clarity of the training methodologies in the current landscape. Literature that uses deep learning algorithms to process non-visual data for underwater navigation will not be considered, except as contrasting examples.

摘要
这篇论文目的是 briefly 概括深度学习方法对水下机器人视觉导航。论文的范围包括水下机器人视觉深度学习方法、可用的水下视数据集、仿真学习和奖励学习方法 для 导航。此外，相关的工作会被分类为仿真学习或深度学习 paradigm 下的训练方法，以便在当前领域的训练方法之间进行清晰的分类。文献使用深度学习算法处理非视觉数据进行水下导航将不被考虑，除非作为对照例外。

paper_url: http://arxiv.org/abs/2310.19477
repo_url: None
paper_authors: Tingting Wu, Zhiyan Du, Zhi Li, Feng-Lei Fan, Tieyong Zeng
for: 提高零掩蔽图像的清晰度和细节表现
methods: 使用变分深度图像优先（VDIP）和总泛化变分（TGV）regularization，并使用 alternate direction method of multipliers（ADMM）来解决问题
results: 比较多种现有模型，提出VDIP-TGV模型，实验表明VDIP-TGV模型可以更好地重建图像的细节和Edge，并且可以更好地处理大blur kernel的情况

Abstract
Recovering clear images from blurry ones with an unknown blur kernel is a challenging problem. Deep image prior (DIP) proposes to use the deep network as a regularizer for a single image rather than as a supervised model, which achieves encouraging results in the nonblind deblurring problem. However, since the relationship between images and the network architectures is unclear, it is hard to find a suitable architecture to provide sufficient constraints on the estimated blur kernels and clean images. Also, DIP uses the sparse maximum a posteriori (MAP), which is insufficient to enforce the selection of the recovery image. Recently, variational deep image prior (VDIP) was proposed to impose constraints on both blur kernels and recovery images and take the standard deviation of the image into account during the optimization process by the variational principle. However, we empirically find that VDIP struggles with processing image details and tends to generate suboptimal results when the blur kernel is large. Therefore, we combine total generalized variational (TGV) regularization with VDIP in this paper to overcome these shortcomings of VDIP. TGV is a flexible regularization that utilizes the characteristics of partial derivatives of varying orders to regularize images at different scales, reducing oil painting artifacts while maintaining sharp edges. The proposed VDIP-TGV effectively recovers image edges and details by supplementing extra gradient information through TGV. Additionally, this model is solved by the alternating direction method of multipliers (ADMM), which effectively combines traditional algorithms and deep learning methods. Experiments show that our proposed VDIP-TGV surpasses various state-of-the-art models quantitatively and qualitatively.

摘要
recuperar imagens claras a partir de imagens borrosas com um kernel de blur desconhecido é um problema desafiador. A prioridade de imagem profunda (DIP) propõe usar a rede profunda como um regularizador para uma imagem individual em vez de um modelo de treinamento supervisionado, o que alcança resultados animadores no problema de desblurimento não cego. No entanto, desde que a relação entre as imagens e as arquiteturas de rede é incerta, é difícil encontrar uma arquitetura adequada para fornecer restrições suficientes sobre os kernel de blur e as imagens limpas. Além disso, DIP usa a máxima a posteriori esparsa (MAP), o que é insuficiente para impor a seleção da imagem de recuperação.Recentemente, o prior de imagem profunda variável (VDIP) foi proposto para impor restrições sobre os kernel de blur e as imagens de recuperação e considerar a variância da imagem durante o processo de otimização pelo princípio variacional. No entanto, encontramos empremicamente que VDIP tem dificuldade em processar detalhes de imagem e tende a gerar resultados subótimos quando o kernel de blur é grande. Portanto, combinamos a regularização de tipo geral variacional (TGV) com VDIP neste artigo para superar as limitações de VDIP. TGV é uma regularização flexível que utiliza as características de derivadas parciais de ordens variables para regularizar imagens em diferentes escalas, reduzindo artefatos de pintura de óleo enquanto mantém bordos afiados. A nossa propriedade VDIP-TGV eficazmente recupera bordos e detalhes de imagem adicionando informações de gradiente extra através de TGV. Além disso, este modelo é resolvido pelo método de direções alternadas de multiplicadores (ADMM), que eficazmente combina métodos tradicionais e de aprendizado profundo. Os resultados experimentais mostram que nossa propriedade VDIP-TGV ultrapassa modelos estado-da-arte quantitativamente e qualitativamente.

Generative Neural Fields by Mixtures of Neural Implicit Functions

paper_url: http://arxiv.org/abs/2310.19464
repo_url: https://github.com/tackgeun/mNIF
paper_authors: Tackgeun You, Mijeong Kim, Jungtaek Kim, Bohyung Han
for: 本文提出了一种新的生成神经场方法，用于学习基于线性组合的隐藏基准网络。
methods: 该方法使用meta-学习或自动解码方法学习基准网络和其系数在隐藏空间，并通过Weighted Model Averaging来减少推理时间和内存占用。
results: 实验表明，该方法在多种图像、体量数据和NeRF场景上达到了竞争性的生成性能，而且不需要特殊的Modalities和Domain设计。

Abstract
We propose a novel approach to learning the generative neural fields represented by linear combinations of implicit basis networks. Our algorithm learns basis networks in the form of implicit neural representations and their coefficients in a latent space by either conducting meta-learning or adopting auto-decoding paradigms. The proposed method easily enlarges the capacity of generative neural fields by increasing the number of basis networks while maintaining the size of a network for inference to be small through their weighted model averaging. Consequently, sampling instances using the model is efficient in terms of latency and memory footprint. Moreover, we customize denoising diffusion probabilistic model for a target task to sample latent mixture coefficients, which allows our final model to generate unseen data effectively. Experiments show that our approach achieves competitive generation performance on diverse benchmarks for images, voxel data, and NeRF scenes without sophisticated designs for specific modalities and domains.

摘要
我们提出了一种新的方法，用于学习生成神经场的线性组合卷积神经网络表示。我们的算法在幂 Learning 或自动解码模式下学习基准网络的卷积神经表示和其权重在隐藏空间的系数，从而轻松地扩大生成神经场的容量。通过对模型进行权重平均，我们可以在推理时保持网络的大小小于原始模型，同时在采样实例时实现高效的延迟和内存占用。此外，我们可以根据目标任务自定义锈推散概率模型，从而在最终模型中生成未seen数据，并且实现了在多种模式和领域上的竞争性生成性能。实验结果表明，我们的方法在多种图像、体积数据和NeRF场景上都可以 дости得竞争性的生成性能，无需特殊的设计 для特定的Modalities和领域。

Towards Grouping in Large Scenes with Occlusion-aware Spatio-temporal Transformers

paper_url: http://arxiv.org/abs/2310.19447
repo_url: None
paper_authors: Jinsong Zhang, Lingfeng Gu, Yu-Kun Lai, Xueyang Wang, Kun Li
for: 这个研究旨在提出一个终端框架GroupTransformer，用于大规模场景中的群体检测，以便实现公共安全和智能城市等应用。
methods: 这个方法使用了 occlusion encoder 来探测和抑制严重遮蔽的人像照片，以及 spatio-temporal transformers 来同时提取路径信息和融合人际特征在层次结构中。
results: 实验结果显示，GroupTransformer 在大规模场景和小规模场景都能够提高表现，比如精度和 F1 分数在大规模场景上提高了 més de 10%，在小规模场景上提高了 més de 5%。

Abstract
Group detection, especially for large-scale scenes, has many potential applications for public safety and smart cities. Existing methods fail to cope with frequent occlusions in large-scale scenes with multiple people, and are difficult to effectively utilize spatio-temporal information. In this paper, we propose an end-to-end framework,GroupTransformer, for group detection in large-scale scenes. To deal with the frequent occlusions caused by multiple people, we design an occlusion encoder to detect and suppress severely occluded person crops. To explore the potential spatio-temporal relationship, we propose spatio-temporal transformers to simultaneously extract trajectory information and fuse inter-person features in a hierarchical manner. Experimental results on both large-scale and small-scale scenes demonstrate that our method achieves better performance compared with state-of-the-art methods. On large-scale scenes, our method significantly boosts the performance in terms of precision and F1 score by more than 10%. On small-scale scenes, our method still improves the performance of F1 score by more than 5%. The project page with code can be found at http://cic.tju.edu.cn/faculty/likun/projects/GroupTrans.

摘要
<> translate "Group detection, especially for large-scale scenes, has many potential applications for public safety and smart cities. Existing methods fail to cope with frequent occlusions in large-scale scenes with multiple people, and are difficult to effectively utilize spatio-temporal information. In this paper, we propose an end-to-end framework,GroupTransformer, for group detection in large-scale scenes. To deal with the frequent occlusions caused by multiple people, we design an occlusion encoder to detect and suppress severely occluded person crops. To explore the potential spatio-temporal relationship, we propose spatio-temporal transformers to simultaneously extract trajectory information and fuse inter-person features in a hierarchical manner. Experimental results on both large-scale and small-scale scenes demonstrate that our method achieves better performance compared with state-of-the-art methods. On large-scale scenes, our method significantly boosts the performance in terms of precision and F1 score by more than 10%. On small-scale scenes, our method still improves the performance of F1 score by more than 5%. The project page with code can be found at http://cic.tju.edu.cn/faculty/likun/projects/GroupTrans." into Simplified Chinese.<>大规模场景中的集群检测具有许多公共安全和智能城市应用的潜在应用场景。现有方法在大规模场景中频繁受到多人干扰，并且Difficult to effectively utilize spatio-temporal information。在这篇论文中，我们提出了一个端到端框架，GroupTransformer，用于大规模场景中的集群检测。为了处理由多人引起的频繁干扰，我们设计了一个干扰编码器，用于检测并Suppress严重干扰人裁剪。以探索可能的空间时间关系，我们提出了空间时间变换器，用于同时提取轨迹信息并融合人员之间的特征。实验结果表明，我们的方法在大规模场景和小规模场景上都有更好的性能，相比于当前最佳方法。在大规模场景上，我们的方法可以提高精度和F1分数的性能，高于10%。在小规模场景上，我们的方法仍然可以提高F1分数的性能，高于5%。项目页面和代码可以在http://cic.tju.edu.cn/faculty/likun/projects/GroupTrans中找到。

One-for-All: Bridge the Gap Between Heterogeneous Architectures in Knowledge Distillation

paper_url: http://arxiv.org/abs/2310.19444
repo_url: https://github.com/hao840/ofakd
paper_authors: Zhiwei Hao, Jianyuan Guo, Kai Han, Yehui Tang, Han Hu, Yunhe Wang, Chang Xu
for: 提高模型性能通过教师学生训练方式
methods: 使用中心kernel对比法 comparing the learned features between heterogeneous teacher and student models, and proposing a simple yet effective one-for-all KD framework called OFA-KD
results: 对不同架构的模型进行知识储备，并且可以获得显著的性能提升（最大提升率为8.0%在CIFAR-100 dataset和0.7%在ImageNet-1K dataset）

Abstract
Knowledge distillation~(KD) has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme. However, most existing distillation methods are designed under the assumption that the teacher and student models belong to the same model family, particularly the hint-based approaches. By using centered kernel alignment (CKA) to compare the learned features between heterogeneous teacher and student models, we observe significant feature divergence. This divergence illustrates the ineffectiveness of previous hint-based methods in cross-architecture distillation. To tackle the challenge in distilling heterogeneous models, we propose a simple yet effective one-for-all KD framework called OFA-KD, which significantly improves the distillation performance between heterogeneous architectures. Specifically, we project intermediate features into an aligned latent space such as the logits space, where architecture-specific information is discarded. Additionally, we introduce an adaptive target enhancement scheme to prevent the student from being disturbed by irrelevant information. Extensive experiments with various architectures, including CNN, Transformer, and MLP, demonstrate the superiority of our OFA-KD framework in enabling distillation between heterogeneous architectures. Specifically, when equipped with our OFA-KD, the student models achieve notable performance improvements, with a maximum gain of 8.0% on the CIFAR-100 dataset and 0.7% on the ImageNet-1K dataset. PyTorch code and checkpoints can be found at https://github.com/Hao840/OFAKD.

摘要
知识塑化（KD）已经证明是一种非常有效的方法，可以通过教师学生训练方案提高模型性能。然而，大多数现有的塑化方法假设教师和学生模型属于同一种模型家族，特别是使用提示方法。我们使用中心kernels对比（CKA）来比较教师和学生模型学习的特征，我们发现了显著的特征分化。这种分化表明了之前的提示方法在cross-architecture塑化中的不足。为了解决塑化不同模型的挑战，我们提出了一个简单 yet 有效的一forall KD框架，叫做OFA-KD。我们将中间特征 проек到一个对齐的特征空间，例如logits空间，其中抹除了建筑特定的信息。此外，我们引入一种适应目标增强方案，以避免学生被不相关信息所干扰。我们在不同的建筑，包括CNN、Transformer和MLP，进行了广泛的实验，并证明了我们的OFA-KD框架在不同建筑之间的塑化中具有显著的优势。具体来说，当我们的OFA-KD框架与学生模型结合使用时，学生模型在CIFAR-100数据集上得到了显著的性能提高，最大提高为8.0%，而在ImageNet-1K数据集上的提高为0.7%。PyTorch代码和检查点可以在https://github.com/Hao840/OFAKD中找到。

Dynamic Gaussian Splatting from Markerless Motion Capture can Reconstruct Infants Movements

paper_url: http://arxiv.org/abs/2310.19441
repo_url: None
paper_authors: R. James Cotton, Colleen Peyton
for: 这个研究的目的是为了发展更进一步的运动分析工具，以便应用于多元化的临床人口，包括婴儿和新生儿。
methods: 这个研究使用了动态戈氏抛范法来处理缺乏标签的动作捕捉数据，并且使用 semantic segmentation 库来专注在婴儿身上，从而提高了场景的初始化。
results: 研究结果显示了这种方法在生成新的景象和追踪婴儿运动方面的潜力，这些结果显示了这种方法的应用可能性。

Abstract
Easy access to precise 3D tracking of movement could benefit many aspects of rehabilitation. A challenge to achieving this goal is that while there are many datasets and pretrained algorithms for able-bodied adults, algorithms trained on these datasets often fail to generalize to clinical populations including people with disabilities, infants, and neonates. Reliable movement analysis of infants and neonates is important as spontaneous movement behavior is an important indicator of neurological function and neurodevelopmental disability, which can help guide early interventions. We explored the application of dynamic Gaussian splatting to sparse markerless motion capture (MMC) data. Our approach leverages semantic segmentation masks to focus on the infant, significantly improving the initialization of the scene. Our results demonstrate the potential of this method in rendering novel views of scenes and tracking infant movements. This work paves the way for advanced movement analysis tools that can be applied to diverse clinical populations, with a particular emphasis on early detection in infants.

摘要
提高患者的移动跟踪精度可以帮助许多rehabilitation领域。一个挑战是，虽然有很多 Datasets 和预训练算法 дляabled-bodied adults，但这些算法经常无法泛化到临床 популяции，包括残疾人、婴儿和新生儿。婴儿和新生儿的自发运动行为对神经学功能和发展障碍的评估具有重要意义，可以帮助早期 intervene。我们 explore了使用动态 Gaussian splatting 将 sparse markerless motion capture (MMC) 数据应用于实际场景中。我们的方法利用 semantic segmentation masks 将注意力集中在婴儿身上，大幅提高场景的初始化。我们的结果表明这种方法在生成新视图和跟踪婴儿运动方面具有潜在的潜力。这项工作为诊断和治疗各种临床 популяции提供了新的动力，特别是在早期诊断婴儿中。

GaitFormer: Learning Gait Representations with Noisy Multi-Task Learning

paper_url: http://arxiv.org/abs/2310.19418
repo_url: https://github.com/cosmaadrian/gaitformer
paper_authors: Adrian Cosma, Emilian Radoi
for: 本研究旨在提出一种基于人体运动pattern的人体认同方法，不需要参与者的协助。
methods: 我们提出了一种名为DenseGait的大规模 dataset，并使用 transformer 型模型进行预训练。
results: 我们的方法可以在 CASIA-B 和 FVG 上达到 92.5% 和 85.33% 的准确率，比之前的方法提高 +14.2% 和 +9.67%。此外，我们的方法还可以准确地识别出人体的性别信息和多种外观特征。

Abstract
Gait analysis is proven to be a reliable way to perform person identification without relying on subject cooperation. Walking is a biometric that does not significantly change in short periods of time and can be regarded as unique to each person. So far, the study of gait analysis focused mostly on identification and demographics estimation, without considering many of the pedestrian attributes that appearance-based methods rely on. In this work, alongside gait-based person identification, we explore pedestrian attribute identification solely from movement patterns. We propose DenseGait, the largest dataset for pretraining gait analysis systems containing 217K anonymized tracklets, annotated automatically with 42 appearance attributes. DenseGait is constructed by automatically processing video streams and offers the full array of gait covariates present in the real world. We make the dataset available to the research community. Additionally, we propose GaitFormer, a transformer-based model that after pretraining in a multi-task fashion on DenseGait, achieves 92.5% accuracy on CASIA-B and 85.33% on FVG, without utilizing any manually annotated data. This corresponds to a +14.2% and +9.67% accuracy increase compared to similar methods. Moreover, GaitFormer is able to accurately identify gender information and a multitude of appearance attributes utilizing only movement patterns. The code to reproduce the experiments is made publicly.

摘要
《坐姿分析可靠地实现人识别，无需参与者合作。行走是一种不会很快变化的生物指纹，每个人的坐姿都唯一。到目前为止，学术研究中的坐姿分析主要关注人识别和人群统计分析，忽略了许多步态特征，这些特征是外表基于方法所依据的。在这项工作中，我们不仅进行了人识别，还从步态特征中提取了多种人体特征。我们提出了DenseGait数据集，包含217万个匿名跟踪点，自动获得了42种外表特征。DenseGait数据集通过自动处理视频流程而成，具有实际世界中的全部步态covariate。我们将数据集公开发布。此外，我们提出了GaitFormer模型，通过多任务预训练在DenseGait数据集上，实现了CASIA-B和FVG上的92.5%和85.33%的准确率，不使用任何手动标注数据。这相对于类似方法提高了14.2%和9.67%的准确率。此外，GaitFormer能够通过只使用运动特征来准确地识别性别信息和许多外表特征。我们将实验代码公开。》

CARPE-ID: Continuously Adaptable Re-identification for Personalized Robot Assistance

paper_url: http://arxiv.org/abs/2310.19413
repo_url: None
paper_authors: Federico Rollo, Andrea Zunino, Nikolaos Tsagarakis, Enrico Mingo Hoffman, Arash Ajoudani
for: 本研究旨在提供一种基于持续视觉适应技术的人重识别模块，以便机器人在拥挤环境中进行个性化目标识别。
methods: 本研究使用了持续视觉适应技术来实现机器人与正确的人员进行无障碍的合作。
results: 实验结果显示，对比之前的状态艺技法，本研究的CARPE-ID模块在所有情况下（除了两个特殊情况）都能够准确地跟踪每个选择的目标。

Abstract
In today's Human-Robot Interaction (HRI) scenarios, a prevailing tendency exists to assume that the robot shall cooperate with the closest individual or that the scene involves merely a singular human actor. However, in realistic scenarios, such as shop floor operations, such an assumption may not hold and personalized target recognition by the robot in crowded environments is required. To fulfil this requirement, in this work, we propose a person re-identification module based on continual visual adaptation techniques that ensure the robot's seamless cooperation with the appropriate individual even subject to varying visual appearances or partial or complete occlusions. We test the framework singularly using recorded videos in a laboratory environment and an HRI scenario, i.e., a person-following task by a mobile robot. The targets are asked to change their appearance during tracking and to disappear from the camera field of view to test the challenging cases of occlusion and outfit variations. We compare our framework with one of the state-of-the-art Multi-Object Tracking (MOT) methods and the results show that the CARPE-ID can accurately track each selected target throughout the experiments in all the cases (except two limit cases). At the same time, the s-o-t-a MOT has a mean of 4 tracking errors for each video.

摘要
今天的人机交互（HRI）场景中，一种普遍的假设是Robot会与最近的个体或场景中的唯一的人actor进行合作。然而，在现实场景中，如生产线上的操作，这种假设可能不成立，需要机器人个性化目标认识。为此，在这项工作中，我们提出了基于不断视觉适应技术的人重识别模块，以确保机器人与正确的个体进行无缝合作，即使视觉表现或部分或完全遮挡。我们使用实验室环境中记录的视频进行单个测试，以及人工智能场景，即移动机器人跟踪人员任务。 targets在跟踪中改变外表和消失视频场景中，以测试受阻和衣服变化的情况。我们与状态的多目标跟踪（MOT）方法进行比较，结果显示，CARPE-ID可以在所有情况下（除了两个边界情况）accurately跟踪每个选择的目标。同时，s-o-t-a MOT的每个视频的平均跟踪错误为4。

Intelligent Breast Cancer Diagnosis with Heuristic-assisted Trans-Res-U-Net and Multiscale DenseNet using Mammogram Images

paper_url: http://arxiv.org/abs/2310.19411
repo_url: None
paper_authors: Muhammad Yaqub, Feng Jinchao
for: 这个研究旨在提高乳癌早期检测的精确性，以帮助提高女性患者的癌症预后。
methods: 本研究使用了一个新的深度学习方法，包括三个阶段：资料收集、影像分类和乳癌识别。影像分类使用了一个基于Atrous Convolution的Attentive and Adaptive Trans-Res-UNet（ACA-ATRUNet）架构，而乳癌识别则使用了一个基于Atrous Convolution的Attentive and Adaptive Multi-scale DenseNet（ACA-AMDN）模型。内部的参数是使用Modified Mussel Length-based Eurasian Oystercatcher Optimization（MML-EOO）算法来优化。
results: 实验结果显示，提案的乳癌检测框架可以实现高精度的早期癌症检测，比传统方法更高。

Abstract
Breast cancer (BC) significantly contributes to cancer-related mortality in women, underscoring the criticality of early detection for optimal patient outcomes. A mammography is a key tool for identifying and diagnosing breast abnormalities; however, accurately distinguishing malignant mass lesions remains challenging. To address this issue, we propose a novel deep learning approach for BC screening utilizing mammography images. Our proposed model comprises three distinct stages: data collection from established benchmark sources, image segmentation employing an Atrous Convolution-based Attentive and Adaptive Trans-Res-UNet (ACA-ATRUNet) architecture, and BC identification via an Atrous Convolution-based Attentive and Adaptive Multi-scale DenseNet (ACA-AMDN) model. The hyperparameters within the ACA-ATRUNet and ACA-AMDN models are optimised using the Modified Mussel Length-based Eurasian Oystercatcher Optimization (MML-EOO) algorithm. Performance evaluation, leveraging multiple metrics, is conducted, and a comparative analysis against conventional methods is presented. Our experimental findings reveal that the proposed BC detection framework attains superior precision rates in early disease detection, demonstrating its potential to enhance mammography-based screening methodologies.

摘要
乳癌（BC）对女性患有癌症的死亡率产生了重要的贡献，这更加提出了早期检测的重要性以便实现最佳病人结果。ammaography 是识别和诊断乳腺畸形的关键工具，但是准确地识别癌症还是一个挑战。为了解决这个问题，我们提出了一种基于深度学习的乳癌检测方法，使用ammaography 图像。我们的提议的模型包括三个不同的阶段：数据收集从已知基准源，使用 Atrous Convolution-based Attentive and Adaptive Trans-Res-UNet（ACA-ATRUNet）架构进行图像分割，以及使用 Atrous Convolution-based Attentive and Adaptive Multi-scale DenseNet（ACA-AMDN）模型进行乳癌识别。在 ACA-ATRUNet 和 ACA-AMDN 模型中的超参数使用 Modified Mussel Length-based Eurasian Oystercatcher Optimization（MML-EOO）算法优化。我们的实验结果表明，提议的乳癌检测框架可以在早期癌症检测中实现更高的精度率，这表明它有potential 提高基于ammaography 的检测方法。

Generated Distributions Are All You Need for Membership Inference Attacks Against Generative Models

paper_url: http://arxiv.org/abs/2310.19410
repo_url: https://github.com/minxingzhang/miagm
paper_authors: Minxing Zhang, Ning Yu, Rui Wen, Michael Backes, Yang Zhang
for: 本研究旨在揭示生成模型中私人信息泄露的隐患，以及这些隐患对多种生成模型的普遍性。
methods: 我们提出了一种普适的会员推测攻击，只需要利用目标生成器生成的分布，以及auxiliary非会员数据集。这种攻击不需要阴影模型或白盒访问，可以覆盖多种生成模型，如生成对抗网络、变量自动编码器、隐函数和扩散模型。
results: 实验表明，所有的生成模型都受到我们的攻击。例如，我们对 DDPM、DDIM 和 FastDPM 在 CIFAR-10 和 CelebA 上进行了攻击，攻击 AUC 高于 0.99。而对 VQGAN、LDM（用于文本Conditional生成）和 LIIF 进行的攻击，AUC 高于 0.90。因此，我们呼吁我们的社区在设计和发布生成模型时注意这种隐患，以保护用户的隐私。

Abstract
Generative models have demonstrated revolutionary success in various visual creation tasks, but in the meantime, they have been exposed to the threat of leaking private information of their training data. Several membership inference attacks (MIAs) have been proposed to exhibit the privacy vulnerability of generative models by classifying a query image as a training dataset member or nonmember. However, these attacks suffer from major limitations, such as requiring shadow models and white-box access, and either ignoring or only focusing on the unique property of diffusion models, which block their generalization to multiple generative models. In contrast, we propose the first generalized membership inference attack against a variety of generative models such as generative adversarial networks, [variational] autoencoders, implicit functions, and the emerging diffusion models. We leverage only generated distributions from target generators and auxiliary non-member datasets, therefore regarding target generators as black boxes and agnostic to their architectures or application scenarios. Experiments validate that all the generative models are vulnerable to our attack. For instance, our work achieves attack AUC $>0.99$ against DDPM, DDIM, and FastDPM trained on CIFAR-10 and CelebA. And the attack against VQGAN, LDM (for the text-conditional generation), and LIIF achieves AUC $>0.90.$ As a result, we appeal to our community to be aware of such privacy leakage risks when designing and publishing generative models.

摘要
<>通过多种视觉创作任务，生成模型已经取得了革命性的成功，但是在同时，它们也面临着泄露私人训练数据的风险。多种会员推断攻击（MIAs）已经提出，以示生成模型的隐私漏洞，但是这些攻击受到了重大限制，例如需要阴影模型和白盒访问，并且完全忽略了扩散模型的特殊性，导致它们无法泛化到多种生成模型。相比之下，我们提出了第一个通用的会员推断攻击，可以对多种生成模型进行攻击，包括生成对抗网络、自适应网络、隐藏函数和升级扩散模型。我们只需要使用目标生成器生成的分布，并且不需要访问目标生成器的 Architecture或应用场景，因此可以视为目标生成器为黑盒模型。实验表明，所有这些生成模型都受到了我们的攻击。例如，我们的工作在DDPM、DDIM和FastDPM在CIFAR-10和CelebA上获得了攻击AUC>0.99。而对VQGAN、LDM（ для文本条件生成）和LIIF的攻击也获得了AUC>0.90。因此，我们呼吁我们的社区在设计和发布生成模型时需要注意隐私泄露风险。

Radar-Lidar Fusion for Object Detection by Designing Effective Convolution Networks

paper_url: http://arxiv.org/abs/2310.19405
repo_url: None
paper_authors: Farzeen Munir, Shoaib Azam, Tomasz Kucner, Ville Kyrki, Moongu Jeon
for: 提高遥感系统的精度和可靠性，特别是在不良天气条件下。
methods: 利用雷达和激光数据的混合技术，通过添加注意力来结合两种数据，然后使用一种新的并平行分支结构来处理尺度变化。
results: 在Radiate数据集上使用COCO指标进行评估，与状态的艺术方法相比，提高了1.89%和2.61%在有利和不利天气条件下。这显示了雷达-激光混合在困难的天气条件下准确地检测和 lokalisieren objects。

Abstract
Object detection is a core component of perception systems, providing the ego vehicle with information about its surroundings to ensure safe route planning. While cameras and Lidar have significantly advanced perception systems, their performance can be limited in adverse weather conditions. In contrast, millimeter-wave technology enables radars to function effectively in such conditions. However, relying solely on radar for building a perception system doesn't fully capture the environment due to the data's sparse nature. To address this, sensor fusion strategies have been introduced. We propose a dual-branch framework to integrate radar and Lidar data for enhanced object detection. The primary branch focuses on extracting radar features, while the auxiliary branch extracts Lidar features. These are then combined using additive attention. Subsequently, the integrated features are processed through a novel Parallel Forked Structure (PFS) to manage scale variations. A region proposal head is then utilized for object detection. We evaluated the effectiveness of our proposed method on the Radiate dataset using COCO metrics. The results show that it surpasses state-of-the-art methods by $1.89\%$ and $2.61\%$ in favorable and adverse weather conditions, respectively. This underscores the value of radar-Lidar fusion in achieving precise object detection and localization, especially in challenging weather conditions.

摘要
Translated into Simplified Chinese:对象检测是感知系统的核心组件，为ego车提供环境信息，以确保安全的路径规划。尽管摄像头和激光技术在恶劣天气条件下有显著进步，但它们的性能可能受到限制。然而，毫米波技术在这些条件下能够有效地运行。然而，只靠 радиar建立感知系统不能完全捕捉环境，因为数据的稀疏性。为此，感知融合策略被引入。我们提出了一种双支osoframework，用于集成雷达和激光数据，以提高对象检测。主支系集中提取雷达特征，而辅助支系提取激光特征。这些特征然后在additive注意力下组合。然后，这些集成特征经过一种新的并行分支结构（PFS）进行扩展。一个区域提议头然后用于对象检测。我们使用Radiate数据集以COCO指标进行评估。结果显示，我们的提posed方法在不利天气和恶劣天气条件下比 estado-of-the-art方法提高$1.89\%$和$2.61\%$。这表明雷达-激光融合在挑战性天气条件下实现精准的对象检测和定位，尤其是在恶劣天气条件下。

A Clinical Guideline Driven Automated Linear Feature Extraction for Vestibular Schwannoma

paper_url: http://arxiv.org/abs/2310.19392
repo_url: None
paper_authors: Navodini Wijethilake, Steve Connor, Anna Oviedova, Rebecca Burger, Tom Vercauteren, Jonathan Shapey
for: This paper aims to automate and improve the clinical decision-making process for patients with Vestibular Schwannoma by using deep learning-based segmentation to extract relevant clinical features from T1 and T2 weighted MRI scans.
methods: The authors use a deep learning-based segmentation approach to extract the maximum linear measurement from the segmented regions, and propose a novel algorithm to choose and extract the most appropriate measurement based on the size of the extrameatal portion of the tumour.
results: The authors achieved Dice-scores of 0.8124 +- 0.2343 and 0.8969 +- 0.0521 for extrameatal and whole tumour regions respectively for T2 weighted MRI, and 0.8222 +- 0.2108 and 0.9049 +- 0.0646 for T1 weighted MRI. The automated measurements were found to be significantly correlated with the manual measurements obtained by an expert neuroradiologist (p < 0.0001).

Abstract
Vestibular Schwannoma is a benign brain tumour that grows from one of the balance nerves. Patients may be treated by surgery, radiosurgery or with a conservative "wait-and-scan" strategy. Clinicians typically use manually extracted linear measurements to aid clinical decision making. This work aims to automate and improve this process by using deep learning based segmentation to extract relevant clinical features through computational algorithms. To the best of our knowledge, our study is the first to propose an automated approach to replicate local clinical guidelines. Our deep learning based segmentation provided Dice-scores of 0.8124 +- 0.2343 and 0.8969 +- 0.0521 for extrameatal and whole tumour regions respectively for T2 weighted MRI, whereas 0.8222 +- 0.2108 and 0.9049 +- 0.0646 were obtained for T1 weighted MRI. We propose a novel algorithm to choose and extract the most appropriate maximum linear measurement from the segmented regions based on the size of the extrameatal portion of the tumour. Using this tool, clinicians will be provided with a visual guide and related metrics relating to tumour progression that will function as a clinical decision aid. In this study, we utilize 187 scans obtained from 50 patients referred to a tertiary specialist neurosurgical service in the United Kingdom. The measurements extracted manually by an expert neuroradiologist indicated a significant correlation with the automated measurements (p < 0.0001).

摘要
vestibular schwannoma 是一种良性肿瘤，来自于很多平衡神经的增生。患者可能会通过手术、放射治疗或保守的“等待和扫描”策略进行治疗。临床医生通常使用手动提取的直线测量来帮助临床决策。本研究旨在自动化和改进这个过程，使用深度学习基于的分割来提取相关的临床特征，并通过计算机算法来提供临床指导。根据我们所知，本研究是首个提出自动化地复制当地临床指南的研究。我们的深度学习基于的分割提供了T2束积成像中的 dice-分数为0.8124±0.2343和0.8969±0.0521，而T1束积成像中的分数为0.8222±0.2108和0.9049±0.0646。我们提出了一种新的算法，可以在分割后选择和提取最适合的最大直线测量，基于脑外部的肿瘤部分的大小。使用这种工具，临床医生将被提供视觉指南和相关的肿瘤进展指标，作为临床决策参考。本研究使用50名患者和187个扫描图像，由专业神经外科医生手动提取的测量表明和自动化测量之间存在显著相关性（p<0.0001）。

TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition

paper_url: http://arxiv.org/abs/2310.19380
repo_url: https://github.com/lmmmeng/transxnet
paper_authors: Meng Lou, Hong-Yu Zhou, Sibei Yang, Yizhou Yu
for: 提高图像分类模型的泛化能力和计算效率
methods: 提出了一种轻量级双动态то金符（D-Mixer），通过应用高效的全局注意力模块和输入参数化深度卷积，具有强导向性和扩大的有效识别场，并将其作为基本建筑块设计了一种新的混合CNN-Transformer视觉后缀网络（TransXNet）
results: 在ImageNet-1K图像分类任务中，TransXNet-T比Swin-T提高0.3%的顶部一准度，同时需要比Swin-T的计算成本少于一半，而TransXNet-S和TransXNet-B也达到了83.8%和84.6%的顶部一准度，具有合理的计算成本。此外，我们提出的网络架构在多种精密预测任务中表现出色，比其他现有的网络更高效，同时具有更好的泛化能力。

Abstract
Recent studies have integrated convolution into transformers to introduce inductive bias and improve generalization performance. However, the static nature of conventional convolution prevents it from dynamically adapting to input variations, resulting in a representation discrepancy between convolution and self-attention as self-attention calculates attention matrices dynamically. Furthermore, when stacking token mixers that consist of convolution and self-attention to form a deep network, the static nature of convolution hinders the fusion of features previously generated by self-attention into convolution kernels. These two limitations result in a sub-optimal representation capacity of the constructed networks. To find a solution, we propose a lightweight Dual Dynamic Token Mixer (D-Mixer) that aggregates global information and local details in an input-dependent way. D-Mixer works by applying an efficient global attention module and an input-dependent depthwise convolution separately on evenly split feature segments, endowing the network with strong inductive bias and an enlarged effective receptive field. We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network that delivers compelling performance. In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3\% in top-1 accuracy while requiring less than half of the computational cost. Furthermore, TransXNet-S and TransXNet-B exhibit excellent model scalability, achieving top-1 accuracy of 83.8\% and 84.6\% respectively, with reasonable computational costs. Additionally, our proposed network architecture demonstrates strong generalization capabilities in various dense prediction tasks, outperforming other state-of-the-art networks while having lower computational costs.

摘要
近期研究已经将卷积 integrate 到 transformer 中来引入逻辑偏好和提高总体性能。然而，传统的卷积 static nature 阻碍它在输入变化中动态适应，导致卷积和自注意力之间的表示差异，自注意力计算动态注意力矩阵时。此外，当堆叠 token mixer 组成深度网络时，传统的卷积 static nature 限制了自注意力生成的特征的融合到卷积核心中。这两个限制导致构建的网络没有达到最佳表示能力。为了解决这问题，我们提出了一种轻量级的双动态 токен mixer（D-Mixer），它可以在输入依赖的方式中集成全局信息和本地细节。D-Mixer 通过简单的全局注意力模块和输入依赖的深度卷积分别应用于分别拆分的特征段，赋予网络强大的逻辑偏好和扩大有效识别场。我们使用 D-Mixer 作为基本建构件，设计了 TransXNet，一种新的混合 CNN-Transformer 视觉后缀网络，它在 ImageNet-1K 图像分类任务中，胜过 Swin-T 的 top-1 准确率，而且计算成本相对较低。此外，TransXNet-S 和 TransXNet-B 在不同的计算成本下仍然表现出色，分别达到了 83.8% 和 84.6% 的 top-1 准确率。此外，我们的提案的网络架构还在不同的粗糙预测任务中表现出色，比其他状态发展的网络更高的总体性能，同时计算成本相对较低。

Color Equivariant Convolutional Networks

paper_url: http://arxiv.org/abs/2310.19368
repo_url: https://github.com/attila94/ceconv
paper_authors: Attila Lengyel, Ombretta Strafforello, Robert-Jan Bruintjes, Alexander Gielisse, Jan van Gemert
for: 提高 CNN 对各种颜色变化的识别能力
methods: 提出 Color Equivariant Convolutions (CEConvs) deep learning Building Block，允许形状特征共享 across 颜色谱，保留重要的颜色信息
results: 在不同任务上提高下游性能，提高对颜色变化的Robustness，包括训练测试分布shift

Abstract
Color is a crucial visual cue readily exploited by Convolutional Neural Networks (CNNs) for object recognition. However, CNNs struggle if there is data imbalance between color variations introduced by accidental recording conditions. Color invariance addresses this issue but does so at the cost of removing all color information, which sacrifices discriminative power. In this paper, we propose Color Equivariant Convolutions (CEConvs), a novel deep learning building block that enables shape feature sharing across the color spectrum while retaining important color information. We extend the notion of equivariance from geometric to photometric transformations by incorporating parameter sharing over hue-shifts in a neural network. We demonstrate the benefits of CEConvs in terms of downstream performance to various tasks and improved robustness to color changes, including train-test distribution shifts. Our approach can be seamlessly integrated into existing architectures, such as ResNets, and offers a promising solution for addressing color-based domain shifts in CNNs.

摘要
颜色是计算机视觉中的一个关键视觉提示符，它被卷积神经网络（CNN）广泛利用于物体识别中。然而，如果数据集中存在颜色变化的偏置，那么CNN会遇到困难。颜色不变性可以解决这个问题，但是它会在抛弃所有颜色信息的同时做出牺牲。在这篇论文中，我们提出了颜色等距化卷积（CEConvs），一种新的深度学习建模元件，它可以在颜色谱中共享形状特征，同时保留重要的颜色信息。我们扩展了卷积神经网络中的等距变换概念，包括在抽象中插入色差参数共享。我们示出了CEConvs在不同任务下的下游性能和颜色变化的Robustness，包括训练测试分布跳变。我们的方法可以轻松地与现有的架构相结合，如ResNet，并提供一个有 Promise的解决方案，用于在CNN中Addressing颜色基于频谱的域shift。

Semi- and Weakly-Supervised Domain Generalization for Object Detection

paper_url: http://arxiv.org/abs/2310.19351
repo_url: None
paper_authors: Ryosuke Furuta, Yoichi Sato
for: 这篇论文目的是解决域别差异导致侦测器性能下降的问题，提出了两个新的问题设定：半指导下域普遍化物件探测 (SS-DGOD) 和弱指导下域普遍化物件探测 (WS-DGOD)。
methods: 这篇论文提出了一个学生-教师学习框架，其中一个学生网络是在一个执教师网络的 Pseudo 标签下进行训练，并且使用执教师网络生成的 Pseudo 标签进行探测器的训练。
results: 实验结果显示，使用提案的问题设定和学生-教师学习框架可以将物件探测器训练得到优化，比起基准探测器训练一个执Domain Data的情况下，表现更好，而且与或更好于不使用Target Domain Data进行训练的UDA设定相比。

Abstract
Object detectors do not work well when domains largely differ between training and testing data. To solve this problem, domain generalization approaches, which require training data with ground-truth labels from multiple domains, have been proposed. However, it is time-consuming and labor-intensive to collect those data for object detection because not only class labels but also bounding boxes must be annotated. To overcome the problem of domain gap in object detection without requiring expensive annotations, we propose to consider two new problem settings: semi-supervised domain generalizable object detection (SS-DGOD) and weakly-supervised DGOD (WS-DGOD). In contrast to the conventional domain generalization for object detection that requires labeled data from multiple domains, SS-DGOD and WS-DGOD require labeled data only from one domain and unlabeled or weakly-labeled data from multiple domains for training. We show that object detectors can be effectively trained on the proposed settings with the same student-teacher learning framework, where a student network is trained with pseudo labels output from a teacher on the unlabeled or weakly-labeled data. The experimental results demonstrate that the object detectors trained on the proposed settings significantly outperform baseline detectors trained on one labeled domain data and perform comparably to or better than those trained on unsupervised domain adaptation (UDA) settings, while ours do not use target domain data for training in contrast to UDA.

摘要
Unlike conventional domain generalization for object detection, which requires labeled data from multiple domains, SS-DGOD and WS-DGOD only require labeled data from one domain and unlabeled or weakly-labeled data from multiple domains for training. We use a student-teacher learning framework, where a student network is trained with pseudo labels output from a teacher on the unlabeled or weakly-labeled data. The experimental results show that object detectors can be effectively trained on the proposed settings and significantly outperform baseline detectors trained on one labeled domain data. Our approach also performs comparably to or better than those trained on unsupervised domain adaptation (UDA) settings, without using target domain data for training.

Label-Only Model Inversion Attacks via Knowledge Transfer

paper_url: http://arxiv.org/abs/2310.19342
repo_url: None
paper_authors: Ngoc-Bao Nguyen, Keshigeyan Chandrasegaran, Milad Abdollahzadeh, Ngai-Man Cheung
For: The paper is focused on addressing the privacy threat of model inversion (MI) attacks, specifically in the label-only setup where the adversary only has access to the model’s predicted labels.* Methods: The proposed approach, called LOKT, uses transfer learning to leverage the knowledge of an opaque target model to surrogate models, enabling the use of advanced white-box attacks. The key technique is a novel model called Target model-assisted ACGAN (T-ACGAN), which facilitates effective knowledge transfer.* Results: The proposed method significantly outperforms existing state-of-the-art (SOTA) label-only MI attacks by more than 15% across all MI benchmarks, and compares favorably in terms of query budget.Here’s the information in Simplified Chinese:* For: 本研究旨在解决机器学习模型 inverse 攻击（MI）的隐私威胁，特别是在标签只 setup 中， где adversary 只有模型预测的标签。* Methods: 提议的方法是使用知识传输来利用透明目标模型的知识，并使用 surrogate 模型来实现高级白盒攻击。关键技术是一种新的模型called Target model-assisted ACGAN (T-ACGAN)，它实现了有效的知识传输。* Results: 提议的方法在所有 MI 测试集上比现有 SOTA 标签只 MI 攻击高出15%以上，并与查询预算相比较有利。

Abstract
In a model inversion (MI) attack, an adversary abuses access to a machine learning (ML) model to infer and reconstruct private training data. Remarkable progress has been made in the white-box and black-box setups, where the adversary has access to the complete model or the model's soft output respectively. However, there is very limited study in the most challenging but practically important setup: Label-only MI attacks, where the adversary only has access to the model's predicted label (hard label) without confidence scores nor any other model information. In this work, we propose LOKT, a novel approach for label-only MI attacks. Our idea is based on transfer of knowledge from the opaque target model to surrogate models. Subsequently, using these surrogate models, our approach can harness advanced white-box attacks. We propose knowledge transfer based on generative modelling, and introduce a new model, Target model-assisted ACGAN (T-ACGAN), for effective knowledge transfer. Our method casts the challenging label-only MI into the more tractable white-box setup. We provide analysis to support that surrogate models based on our approach serve as effective proxies for the target model for MI. Our experiments show that our method significantly outperforms existing SOTA Label-only MI attack by more than 15% across all MI benchmarks. Furthermore, our method compares favorably in terms of query budget. Our study highlights rising privacy threats for ML models even when minimal information (i.e., hard labels) is exposed. Our study highlights rising privacy threats for ML models even when minimal information (i.e., hard labels) is exposed. Our code, demo, models and reconstructed data are available at our project page: https://ngoc-nguyen-0.github.io/lokt/

摘要
“在机器学习（ML）模型倒推（MI）攻击中，敌对者利用对ML模型的访问来推断和重建私有训练数据。在白盒和黑盒设置中，敌对者已经获得了很多进展，但是在最重要的Label-only MI攻击中，即只有模型预测标签（硬标签）而无 confidence scores 和其他模型信息的情况下，还有很少的研究。在这项工作中，我们提出了一种新的方法 called LOKT，用于Label-only MI攻击。我们的想法是基于目标模型的知识传递到临时模型上。然后，我们可以通过这些临时模型来利用高级白盒攻击。我们提出了基于生成模型的知识传递，并引入了一种新的模型 Target model-assisted ACGAN（T-ACGAN），用于有效地传递知识。我们将Label-only MI转化为更加可控的白盒设置。我们提供分析支持，表明我们的方法可以使用临时模型来代表目标模型进行MI。我们的实验表明，我们的方法可以与现有最佳实践（SOTA）Label-only MI攻击比较，并且在所有MI benchmark上表现出色，超过15%的提升。此外，我们的方法与查询预算相比，也具有良好的比较。我们的研究表明，即使只有硬标签信息被暴露出来，ML模型也面临着巨大的隐私威胁。我们的研究也表明，隐私威胁不仅限于ML模型，还可能扩展到其他领域。我们的代码、demo、模型和重建数据都可以在我们项目页面上下载：https://ngoc-nguyen-0.github.io/lokt/”

On Measuring Fairness in Generative Models

paper_url: http://arxiv.org/abs/2310.19297
repo_url: None
paper_authors: Christopher T. H. Teo, Milad Abdollahzadeh, Ngai-Man Cheung
for: 这研究的目的是对公平生成模型进行深入的研究，特别是评估公平性的标准方法。
methods: 这篇论文提出了三个贡献：首先，通过实验表明现有的公平性评估框架存在评估错误，即使使用高精度敏感特征（SA）分类器也是如此。这些发现质量地动摇了之前报道的公平性改进。其次，提出了一种新的评估框架，即类别错误意识（CLEAM），使用统计模型来补做SA分类器的不准确。CLEAM可以减少评估错误，例如从4.98%降至0.62%对StyleGAN2的性别。此外，CLEAM具有较少的附加开销。三、通过CLEAM评估了文本到图像生成器和GANs的公平性，发现这些模型具有重要的偏见问题，这些问题对其应用产生了担忧。
results: 这篇论文通过实验和分析发现，现有的公平性评估框架存在评估错误，而提出的CLEAM可以减少这些错误。此外，通过CLEAM评估了文本到图像生成器和GANs的公平性，发现这些模型具有重要的偏见问题。

Abstract
Recently, there has been increased interest in fair generative models. In this work, we conduct, for the first time, an in-depth study on fairness measurement, a critical component in gauging progress on fair generative models. We make three contributions. First, we conduct a study that reveals that the existing fairness measurement framework has considerable measurement errors, even when highly accurate sensitive attribute (SA) classifiers are used. These findings cast doubts on previously reported fairness improvements. Second, to address this issue, we propose CLassifier Error-Aware Measurement (CLEAM), a new framework which uses a statistical model to account for inaccuracies in SA classifiers. Our proposed CLEAM reduces measurement errors significantly, e.g., 4.98% $\rightarrow$ 0.62% for StyleGAN2 w.r.t. Gender. Additionally, CLEAM achieves this with minimal additional overhead. Third, we utilize CLEAM to measure fairness in important text-to-image generator and GANs, revealing considerable biases in these models that raise concerns about their applications. Code and more resources: https://sutd-visual-computing-group.github.io/CLEAM/.

摘要
近期，关于公平生成模型的兴趣增长。在这项工作中，我们第一次进行了深入的公平度测量研究，这是评估公平生成模型的关键组件。我们的贡献包括以下三个方面：第一，我们发现现有的公平度测量框架存在较大的测量错误，即使使用高精度敏感特征（SA）分类器也是如此。这些发现质量地影响了先前报道的公平改进。第二，为解决这一问题，我们提出了一种新的测量框架，即类别错误意识（CLEAM）。CLEAM使用统计模型来考虑敏感特征分类器的不准确性，从而大幅减少测量错误。例如，对于StyleGAN2，我们的CLEAM可以将4.98%降低到0.62%。此外，CLEAM可以实现这一目标而无需较大的额外负担。第三，我们使用CLEAM测量了文本生成器和GAN的公平度，发现这些模型存在许多问题，它们的应用可能会引起关切。代码和更多资源可以在以下链接中找到：https://sutd-visual-computing-group.github.io/CLEAM/.

FetusMapV2: Enhanced Fetal Pose Estimation in 3D Ultrasound

paper_url: http://arxiv.org/abs/2310.19293
repo_url: None
paper_authors: Chaoyu Chen, Xin Yang, Yuhao Huang, Wenlong Shi, Yan Cao, Mingyuan Luo, Xindi Hu, Lei Zhue, Lequan Yu, Kejuan Yue, Yuanji Zhang, Yi Xiong, Dong Ni, Weijun Huang
for: 提供精准的三维胎儿姿态信息，用于生物 métrique 测量、平面定位和胎儿运动监测等应用。
methods: 提出一种新的三维胎儿姿态估计框架（叫做FetusMapV2），解决了限制性的硬件资源、图像质量、同构结构和胎儿姿态变化等问题。
results: 对比其他强有力竞争者，该方法在一个大规模的胎儿US数据集上表现出优异的准确性和稳定性。

Abstract
Fetal pose estimation in 3D ultrasound (US) involves identifying a set of associated fetal anatomical landmarks. Its primary objective is to provide comprehensive information about the fetus through landmark connections, thus benefiting various critical applications, such as biometric measurements, plane localization, and fetal movement monitoring. However, accurately estimating the 3D fetal pose in US volume has several challenges, including poor image quality, limited GPU memory for tackling high dimensional data, symmetrical or ambiguous anatomical structures, and considerable variations in fetal poses. In this study, we propose a novel 3D fetal pose estimation framework (called FetusMapV2) to overcome the above challenges. Our contribution is three-fold. First, we propose a heuristic scheme that explores the complementary network structure-unconstrained and activation-unreserved GPU memory management approaches, which can enlarge the input image resolution for better results under limited GPU memory. Second, we design a novel Pair Loss to mitigate confusion caused by symmetrical and similar anatomical structures. It separates the hidden classification task from the landmark localization task and thus progressively eases model learning. Last, we propose a shape priors-based self-supervised learning by selecting the relatively stable landmarks to refine the pose online. Extensive experiments and diverse applications on a large-scale fetal US dataset including 1000 volumes with 22 landmarks per volume demonstrate that our method outperforms other strong competitors.

摘要
三维超声成像（US）中的胎儿姿态估计涉及到确定一组相关的胎儿生理学特征点。其主要目标是通过特征点之间的连接提供胎儿全面信息，以便在各种关键应用程序中使用，如生物 метри克测量、平面定位和胎儿运动监测。然而，在三维胎儿姿态估计中存在多种挑战，包括图像质量不佳、有限的GPU内存处理高维数据、同形或混淆的生理结构和胎儿姿态变化较大。在这种研究中，我们提出了一种新的三维胎儿姿态估计框架（称为FetusMapV2），以解决上述挑战。我们的贡献有三个方面：1. 我们提出了一种启发式方法，通过不同的网络结构和活动不受限制的GPU内存管理方法，可以在有限的GPU内存下提高输入图像分辨率以获得更好的结果。2. 我们设计了一种新的对称损失函数，用于降低同形和相似的生理结构所引起的混淆。它将隐藏的分类任务与特征点local化任务分离开来，从而逐渐缓解模型学习。3. 我们提出了一种基于形状假设的自动学习方法，通过选择稳定的特征点来在线进行姿态约束。我们在一个大规模的胎儿US数据集上进行了广泛的实验和多种应用，并证明了我们的方法在其他强竞争者之上表现出优异。

EDiffSR: An Efficient Diffusion Probabilistic Model for Remote Sensing Image Super-Resolution

paper_url: http://arxiv.org/abs/2310.19288
repo_url: https://github.com/xy-boy/ediffsr
paper_authors: Yi Xiao, Qiangqiang Yuan, Kui Jiang, Jiang He, Xianyu Jin, Liangpei Zhang
for: 本研究旨在提出一种高效的激进网络模型，用于提高遥感图像超分辨率（SR）的效果。
methods: 本研究使用了扩散概率模型（DPM），并开发了一种高效的活化网络（EANet）和一种实用的条件增强模块（CPEM）来提高SR的性能。
results: 对四个遥感图像 dataset 进行了广泛的实验，并证明了 EDiffSR 可以提供高质量的 SR 效果，同时具有高效的计算成本和简单易用的训练方法。

Abstract
Recently, convolutional networks have achieved remarkable development in remote sensing image Super-Resoltuion (SR) by minimizing the regression objectives, e.g., MSE loss. However, despite achieving impressive performance, these methods often suffer from poor visual quality with over-smooth issues. Generative adversarial networks have the potential to infer intricate details, but they are easy to collapse, resulting in undesirable artifacts. To mitigate these issues, in this paper, we first introduce Diffusion Probabilistic Model (DPM) for efficient remote sensing image SR, dubbed EDiffSR. EDiffSR is easy to train and maintains the merits of DPM in generating perceptual-pleasant images. Specifically, different from previous works using heavy UNet for noise prediction, we develop an Efficient Activation Network (EANet) to achieve favorable noise prediction performance by simplified channel attention and simple gate operation, which dramatically reduces the computational budget. Moreover, to introduce more valuable prior knowledge into the proposed EDiffSR, a practical Conditional Prior Enhancement Module (CPEM) is developed to help extract an enriched condition. Unlike most DPM-based SR models that directly generate conditions by amplifying LR images, the proposed CPEM helps to retain more informative cues for accurate SR. Extensive experiments on four remote sensing datasets demonstrate that EDiffSR can restore visual-pleasant images on simulated and real-world remote sensing images, both quantitatively and qualitatively. The code of EDiffSR will be available at https://github.com/XY-boy/EDiffSR

摘要
最近，卷积网络在远程感知图像超分辨（SR）中取得了显著的发展，通过最小化回归目标函数，如MSE损失来实现。然而，这些方法经常受到过度平滑的问题，导致图像质量不佳。生成对抗网络具有推理细节的潜力，但容易塌陷，导致不良artefacts。为了解决这些问题，在本文中，我们首先介绍了Diffusion Probabilistic Model（DPM）为远程感知图像SR，称为EDiffSR。EDiffSR易于训练并保持DPM的优点，生成感知良好的图像。与之前使用庞大的UNet进行噪声预测不同，我们开发了高效的Activation Network（EANet），通过简化通道注意力和简单的门控操作，可以减少计算预算。此外，为了在提出的EDiffSR中引入更多有价值的前提知识，我们开发了实用的Conditional Prior Enhancement Module（CPEM），帮助提取更富有的条件。与大多数DPM基于SR模型直接生成条件的方法不同，CPEM帮助保留更多有用的cue，以提高准确的SR。我们在四个远程感知数据集上进行了广泛的实验，显示EDiffSR可以在模拟和实际远程感知图像上重建高质量的视觉美好图像， both quantitatively and qualitatively。代码将在https://github.com/XY-boy/EDiffSR上公开。

Improving Online Source-free Domain Adaptation for Object Detection by Unsupervised Data Acquisition

paper_url: http://arxiv.org/abs/2310.19258
repo_url: None
paper_authors: Xiangyu Shi, Yanyuan Qiao, Qi Wu, Lingqiao Liu, Feras Dayoub
for: 提高移动机器人中的适应对象检测，因为部署在多样化和未知环境中是挑战。
methods: 使用线上源自由领域适应（O-SFDA）实时模型适应，通过目标领域流动不标注数据来进行适应。
results: 比较现有状态 искусственный智能技术，我们的方法在实际 dataset 上表现出色，说明了不supervised数据获取可以提高移动机器人中的适应对象检测。

Abstract
Effective object detection in mobile robots is challenged by deployment in diverse and unfamiliar environments. Online Source-Free Domain Adaptation (O-SFDA) offers real-time model adaptation using a stream of unlabeled data from a target domain. However, not all captured frames in mobile robotics contain information that is beneficial for adaptation, particularly when there is a strong domain shift. This paper introduces a novel approach to enhance O-SFDA for adaptive object detection in mobile robots via unsupervised data acquisition. Our methodology prioritizes the most informative unlabeled samples for inclusion in the online training process. Empirical evaluation on a real-world dataset reveals that our method outperforms existing state-of-the-art O-SFDA techniques, demonstrating the viability of unsupervised data acquisition for improving adaptive object detection in mobile robots.

摘要
efficient object detection in mobile robots is challenging due to deployment in diverse and unfamiliar environments. Online Source-Free Domain Adaptation (O-SFDA) provides real-time model adaptation using a stream of unlabeled data from the target domain. However, not all captured frames in mobile robotics contain useful information for adaptation, especially when there is a strong domain shift. This paper proposes a novel approach to enhance O-SFDA for adaptive object detection in mobile robots through unsupervised data acquisition. Our method prioritizes the most informative unlabeled samples for inclusion in the online training process. Empirical evaluation on a real-world dataset shows that our method outperforms existing state-of-the-art O-SFDA techniques, demonstrating the feasibility of unsupervised data acquisition for improving adaptive object detection in mobile robots.Here's the text with Traditional Chinese characters:efficient object detection in mobile robots is challenging due to deployment in diverse and unfamiliar environments. Online Source-Free Domain Adaptation (O-SFDA) provides real-time model adaptation using a stream of unlabeled data from the target domain. However, not all captured frames in mobile robotics contain useful information for adaptation, especially when there is a strong domain shift. This paper proposes a novel approach to enhance O-SFDA for adaptive object detection in mobile robots through unsupervised data acquisition. Our method prioritizes the most informative unlabeled samples for inclusion in the online training process. Empirical evaluation on a real-world dataset shows that our method outperforms existing state-of-the-art O-SFDA techniques, demonstrating the feasibility of unsupervised data acquisition for improving adaptive object detection in mobile robots.

A High-Resolution Dataset for Instance Detection with Multi-View Instance Capture

paper_url: http://arxiv.org/abs/2310.19257
repo_url: None
paper_authors: Qianqian Shen, Yunhan Zhao, Nahyun Kwon, Jeeeun Kim, Yanan Li, Shu Kong
for: 这个论文的目标是提出一个新的实例检测 dataset 和协议，以提高实例检测领域的研究。
methods: 该论文使用了多视图实例捕捉和多种场景图像，并使用了自动批注的方式来生成训练图像。
results: 研究发现，使用存在检测模型（Segment Anything Model，SAM）和自然语言推荐的自我超视图表示（DINOv2），可以达到>10 AP的提升，比抽象的实例检测模型（如FasterRCNN和RetinaNet）更好。

Abstract
Instance detection (InsDet) is a long-lasting problem in robotics and computer vision, aiming to detect object instances (predefined by some visual examples) in a cluttered scene. Despite its practical significance, its advancement is overshadowed by Object Detection, which aims to detect objects belonging to some predefined classes. One major reason is that current InsDet datasets are too small in scale by today's standards. For example, the popular InsDet dataset GMU (published in 2016) has only 23 instances, far less than COCO (80 classes), a well-known object detection dataset published in 2014. We are motivated to introduce a new InsDet dataset and protocol. First, we define a realistic setup for InsDet: training data consists of multi-view instance captures, along with diverse scene images allowing synthesizing training images by pasting instance images on them with free box annotations. Second, we release a real-world database, which contains multi-view capture of 100 object instances, and high-resolution (6k x 8k) testing images. Third, we extensively study baseline methods for InsDet on our dataset, analyze their performance and suggest future work. Somewhat surprisingly, using the off-the-shelf class-agnostic segmentation model (Segment Anything Model, SAM) and the self-supervised feature representation DINOv2 performs the best, achieving >10 AP better than end-to-end trained InsDet models that repurpose object detectors (e.g., FasterRCNN and RetinaNet).

摘要
首先，我们定义了实际的InsDet设置：训练数据包括多视图实例捕捉，以及包含多种场景图像，以便通过粘贴实例图像到它们上面并提供自由框注释来生成训练图像。第二，我们发布了一个真实的数据库，它包含100个对象实例的多视图捕捉和高分辨率（6000 x 8000）测试图像。第三，我们广泛研究了InsDet基线方法，分析其性能并建议未来的工作。有些surprisingly，使用的�分类无关的分割模型（Segment Anything Model，SAM）和自然的自我指导特征表示（DINOv2）perform the best， exceeding >10 AP better than end-to-end trained InsDet models that repurpose object detectors（例如，FasterRCNN和RetinaNet）。

There Are No Data Like More Data- Datasets for Deep Learning in Earth Observation

paper_url: http://arxiv.org/abs/2310.19231
repo_url: None
paper_authors: Michael Schmitt, Seyed Ali Ahmadi, Yonghao Xu, Gulsen Taskin, Ujjwal Verma, Francescopaolo Sica, Ronny Hansch
for: 本文旨在强调地球观测数据集的重要性，并将地球观测领域的机器学习数据集和应用集中心stage。
methods: 本文主要采用历史发展评估、现有资源描述和未来发展 perspective来探讨地球观测领域的机器学习数据集。
results: 本文希望通过强调地球观测数据集的重要性，为地球观测领域的机器学习研究提供一个新的视角，并且预期能够促进地球观测领域的机器学习研究发展。

Abstract
Carefully curated and annotated datasets are the foundation of machine learning, with particularly data-hungry deep neural networks forming the core of what is often called Artificial Intelligence (AI). Due to the massive success of deep learning applied to Earth Observation (EO) problems, the focus of the community has been largely on the development of ever-more sophisticated deep neural network architectures and training strategies largely ignoring the overall importance of datasets. For that purpose, numerous task-specific datasets have been created that were largely ignored by previously published review articles on AI for Earth observation. With this article, we want to change the perspective and put machine learning datasets dedicated to Earth observation data and applications into the spotlight. Based on a review of the historical developments, currently available resources are described and a perspective for future developments is formed. We hope to contribute to an understanding that the nature of our data is what distinguishes the Earth observation community from many other communities that apply deep learning techniques to image data, and that a detailed understanding of EO data peculiarities is among the core competencies of our discipline.

摘要
仔细挑选和注释的数据集是机器学习的基础，特别是深度神经网络作为人工智能（AI）的核心。由于深度学习在地球观测（EO）问题上取得了巨大成功，因此社区的焦点主要集中在发展更加复杂的深度神经网络架构和训练策略上，忽视了数据集的重要性。为了改变这种情况，我们创建了许多任务特定的数据集，这些数据集在前一些发表的评论文章中几乎未得到批注。在这篇文章中，我们想要把机器学习专门为地球观测数据和应用的数据集和技术放在主要的位置。通过审查历史发展，我们现在描述了可用的资源，并形成了未来发展的视角。我们希望通过这篇文章，让读者理解，地球观测社区与其他使用深度学习技术处理图像数据的社区不同，我们的数据特点是我们的专业领域的核心。

CHAMMI: A benchmark for channel-adaptive models in microscopy imaging

paper_url: http://arxiv.org/abs/2310.19224
repo_url: https://github.com/chaudatascience/channel_adaptive_models
paper_authors: Zitong Chen, Chau Pham, Siqi Wang, Michael Doron, Nikita Moshkov, Bryan A. Plummer, Juan C. Caicedo
for: 这个论文的目的是为了研究适应通道数的神经网络模型，以便在生物实验中处理不同通道数的单元图像。
methods: 这篇论文使用了多种现有技术，包括变化通道数的神经网络模型和生物学上相关的评价框架，以评估这些模型在不同通道数下的表现。
results: 论文发现，适应通道数的神经网络模型可以更好地泛化到异常任务，并且可以在计算效率方面具有优势。同时，论文提供了一个抽象的数据集和评价API，以便在未来的研究和应用中进行对比。

Abstract
Most neural networks assume that input images have a fixed number of channels (three for RGB images). However, there are many settings where the number of channels may vary, such as microscopy images where the number of channels changes depending on instruments and experimental goals. Yet, there has not been a systemic attempt to create and evaluate neural networks that are invariant to the number and type of channels. As a result, trained models remain specific to individual studies and are hardly reusable for other microscopy settings. In this paper, we present a benchmark for investigating channel-adaptive models in microscopy imaging, which consists of 1) a dataset of varied-channel single-cell images, and 2) a biologically relevant evaluation framework. In addition, we adapted several existing techniques to create channel-adaptive models and compared their performance on this benchmark to fixed-channel, baseline models. We find that channel-adaptive models can generalize better to out-of-domain tasks and can be computationally efficient. We contribute a curated dataset (https://doi.org/10.5281/zenodo.7988357) and an evaluation API (https://github.com/broadinstitute/MorphEm.git) to facilitate objective comparisons in future research and applications.

摘要
大多数神经网络假设输入图像有固定数量的通道（三个 для RGB 图像）。然而，有许多情况下，通道的数量可能会变化，如微scopic 图像中的通道数量，它们取决于实验室设备和实验目标。然而，没有系统地创建和评估可以适应不同通道数量的神经网络。因此，训练的模型往往只能在特定的研究中使用，而不能在其他微scopic 设置中 reuse。在这篇论文中，我们提出了一个检验通道适应模型在微scopic 成像中的标准套件，包括：1）一个包含不同通道单元细胞图像的数据集，和2）一个生物学上有意义的评估框架。此外，我们将existings several techniques to create channel-adaptive models and compared their performance on this benchmark to fixed-channel, baseline models. We find that channel-adaptive models can generalize better to out-of-domain tasks and can be computationally efficient. We contribute a curated dataset (https://doi.org/10.5281/zenodo.7988357) and an evaluation API (https://github.com/broadinstitute/MorphEm.git) to facilitate objective comparisons in future research and applications.

Modular Anti-noise Deep Learning Network for Robotic Grasp Detection Based on RGB Images

paper_url: http://arxiv.org/abs/2310.19223
repo_url: None
paper_authors: Zhaocong Li
for: 本文旨在提出一种基于单个RGB图像的抓取姿态检测方法，以便在实际应用中提高抓取精度。
methods: 该方法使用模块学习网络，结合抓取检测和语义分割，以适应 robot 装备平行板抓取器。网络不仅可以识别抓取对象，还可以结合先前的抓取分析和语义分割，从而提高抓取精度。
results: 实验和评估结果表明，提出的方法可以准确地检测抓取姿态，并且可以适应受扰和杂乱的视觉。该方法的设计具有可学习性和可靠性，并且可以在实际应用中实现可靠的抓取姿态检测。

Abstract
While traditional methods relies on depth sensors, the current trend leans towards utilizing cost-effective RGB images, despite their absence of depth cues. This paper introduces an interesting approach to detect grasping pose from a single RGB image. To this end, we propose a modular learning network augmented with grasp detection and semantic segmentation, tailored for robots equipped with parallel-plate grippers. Our network not only identifies graspable objects but also fuses prior grasp analyses with semantic segmentation, thereby boosting grasp detection precision. Significantly, our design exhibits resilience, adeptly handling blurred and noisy visuals. Key contributions encompass a trainable network for grasp detection from RGB images, a modular design facilitating feasible grasp implementation, and an architecture robust against common image distortions. We demonstrate the feasibility and accuracy of our proposed approach through practical experiments and evaluations.

摘要
tradicional方法依靠深度感知器，当前趋势则是使用便宜的RGB图像，尽管它们缺乏深度cue。这篇论文介绍了一种有趣的方法来探测抓取姿势从单个RGB图像中。为此，我们提议一种模块化学习网络，其中包括抓取检测和semantic分类，适用于配备平行板抓取器的机器人。我们的网络不仅可以识别可抓取的物体，还可以将先前的抓取分析与semantic分类相结合，从而提高抓取检测精度。另外，我们的设计具有抗锈和抗噪特性，可以 effectively 处理模糊和噪声的视觉。我们的主要贡献包括一种可调网络 дляRGB图像中的抓取检测、一种可实施的模块化设计以及一种对常见图像扭曲有效的架构。我们通过实验和评估证明了我们的提议的可行性和准确性。

Generalized Category Discovery with Clustering Assignment Consistency

paper_url: http://arxiv.org/abs/2310.19210
repo_url: None
paper_authors: Xiangli Yang, Xinglin Pan, Irwin King, Zenglin Xu
for: 本研究旨在提出一种基于协同学习的开放世界任务，即生成扩展类发现（Generalized Category Discovery，GCD）。GCD 的目标是自动将未标注的样本分组，使用标注过的数据集中的信息传递。
methods: 我们提出了一种基于协同学习的框架，包括弱和强的扩展变换，以及基于协同假设的准确性学习策略。我们使用这些策略来学习半supervised的表示学习过程，并使用这些表示来构建一个原始稀热网络，并使用社群探测方法来获取分组结果和类别数量。
results: 我们的方法在三个通用 benchmark 上 achieve state-of-the-art 性能，并在三个细化的视觉识别数据集上也达到了优秀的结果。特别是在 ImageNet-100 数据集上，我们的方法与最佳基eline相比，在 \texttt{Novel} 和 \texttt{All} 类上提高了15.5% 和 7.0%。

Abstract
Generalized category discovery (GCD) is a recently proposed open-world task. Given a set of images consisting of labeled and unlabeled instances, the goal of GCD is to automatically cluster the unlabeled samples using information transferred from the labeled dataset. The unlabeled dataset comprises both known and novel classes. The main challenge is that unlabeled novel class samples and unlabeled known class samples are mixed together in the unlabeled dataset. To address the GCD without knowing the class number of unlabeled dataset, we propose a co-training-based framework that encourages clustering consistency. Specifically, we first introduce weak and strong augmentation transformations to generate two sufficiently different views for the same sample. Then, based on the co-training assumption, we propose a consistency representation learning strategy, which encourages consistency between feature-prototype similarity and clustering assignment. Finally, we use the discriminative embeddings learned from the semi-supervised representation learning process to construct an original sparse network and use a community detection method to obtain the clustering results and the number of categories simultaneously. Extensive experiments show that our method achieves state-of-the-art performance on three generic benchmarks and three fine-grained visual recognition datasets. Especially in the ImageNet-100 data set, our method significantly exceeds the best baseline by 15.5\% and 7.0\% on the \texttt{Novel} and \texttt{All} classes, respectively.

摘要
通用类发现（GCD）是一个最近提出的开放世界任务。给定一个包含标注和无标注实例的图像集合，GCD的目标是自动将无标注样本分组使用标注集合中的信息传递。无标注集合包含已知和新类。主要挑战是无标注新类样本和已知类样本在无标注集合中杂mix。为解决GCD而不知道无标注集合的类数，我们提出了基于co-training assumption的框架。Specifically，我们首先引入弱和强变换 transformation来生成两个足够不同的视图 для同一个样本。然后，我们基于co-training assumption，提出一种协同表示学习策略，该策略激励协同 между特征原型相似度和分类划分。最后，我们使用从半超vised表示学习过程中学习的抗性嵌入来构建原始稀有网络，并使用社区探测方法获取分类结果和类数同时。广泛的实验表明，我们的方法在三个通用图像识别 benchmark 上达到了状态机器人的性能，特别是在 ImageNet-100 数据集上，我们的方法在 \texttt{Novel} 和 \texttt{All} 类上超过了最佳基eline的15.5%和7.0%。

2023-10-30

Facial asymmetry: A Computer Vision based behaviometric index for assessment during a face-to-face interview

LinFlo-Net: A two-stage deep learning method to generate simulation ready meshes of the heart

A Scalable Training Strategy for Blind Multi-Distribution Noise Removal

SolarFormer: Multi-scale Transformer for Solar PV Profiling

Radiomics as a measure superior to the Dice similarity coefficient for tumor segmentation performance evaluation

Adaptive Anchor Label Propagation for Transductive Few-Shot Learning

Emotional Theory of Mind: Bridging Fast Visual Processing with Slow Linguistic Reasoning

Addressing Weak Decision Boundaries in Image Classification by Leveraging Web Search and Generative Models

‘Person’ == Light-skinned, Western Man, and Sexualization of Women of Color: Stereotypes in Stable Diffusion

Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks

MIST: Medical Image Segmentation Transformer with Convolutional Attention Mixing (CAM) Decoder

DiffEnc: Variational Diffusion with a Learned Encoder

MM-VID: Advancing Video Understanding with GPT-4V(ision)

Intra-Modal Proxy Learning for Zero-Shot Visual Categorization with CLIP

Tell Me What Is Good About This Property: Leveraging Reviews For Segment-Personalized Image Collection Summarization

Promise:Prompt-driven 3D Medical Image Segmentation Using Pretrained Image Foundation Models

Deep-learning-based decomposition of overlapping-sparse images: application at the vertex of neutrino interactions

A Principled Hierarchical Deep Learning Approach to Joint Image Compression and Classification

DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization

Domain Generalization in Computational Pathology: Survey and Guidelines

Upgrading VAE Training With Unlimited Data Plans Provided by Diffusion Models

DistNet2D: Leveraging long-range temporal information for efficient segmentation and tracking

Leave No Stone Unturned: Mine Extra Knowledge for Imbalanced Facial Expression Recognition

Bidirectional Captioning for Clinically Accurate and Interpretable Models

Convolutional Neural Networks for Automatic Detection of Intact Adenovirus from TEM Imaging with Debris, Broken and Artefacts Particles

GC-MVSNet: Multi-View, Multi-Scale, Geometrically-Consistent Multi-View Stereo

Human-interpretable and deep features for image privacy classification

Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model

A Perceptual Shape Loss for Monocular 3D Face Reconstruction

Skip-WaveNet: A Wavelet based Multi-scale Architecture to Trace Firn Layers in Radar Echograms

Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning

Harvest Video Foundation Models via Efficient Post-Pretraining

MENTOR: Human Perception-Guided Pretraining for Iris Presentation Detection

Exploiting Image-Related Inductive Biases in Single-Branch Visual Tracking

IterInv: Iterative Inversion for Pixel-Level T2I Models

Revitalizing Legacy Video Content: Deinterlacing with Bidirectional Information Propagation

Are Natural Domain Foundation Models Useful for Medical Image Classification?

Generating Context-Aware Natural Answers for Questions in 3D Scenes

Transformer-based nowcasting of radar composites from satellite images for severe weather

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Deep Learning for Visual Navigation of Underwater Robots

VDIP-TGV: Blind Image Deconvolution via Variational Deep Image Prior Empowered by Total Generalized Variation

Generative Neural Fields by Mixtures of Neural Implicit Functions

Towards Grouping in Large Scenes with Occlusion-aware Spatio-temporal Transformers

One-for-All: Bridge the Gap Between Heterogeneous Architectures in Knowledge Distillation

Dynamic Gaussian Splatting from Markerless Motion Capture can Reconstruct Infants Movements

GaitFormer: Learning Gait Representations with Noisy Multi-Task Learning

CARPE-ID: Continuously Adaptable Re-identification for Personalized Robot Assistance

Intelligent Breast Cancer Diagnosis with Heuristic-assisted Trans-Res-U-Net and Multiscale DenseNet using Mammogram Images

Generated Distributions Are All You Need for Membership Inference Attacks Against Generative Models

Radar-Lidar Fusion for Object Detection by Designing Effective Convolution Networks

A Clinical Guideline Driven Automated Linear Feature Extraction for Vestibular Schwannoma

TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition

Color Equivariant Convolutional Networks

Semi- and Weakly-Supervised Domain Generalization for Object Detection

Label-Only Model Inversion Attacks via Knowledge Transfer

On Measuring Fairness in Generative Models

FetusMapV2: Enhanced Fetal Pose Estimation in 3D Ultrasound

EDiffSR: An Efficient Diffusion Probabilistic Model for Remote Sensing Image Super-Resolution

Improving Online Source-free Domain Adaptation for Object Detection by Unsupervised Data Acquisition

A High-Resolution Dataset for Instance Detection with Multi-View Instance Capture

There Are No Data Like More Data- Datasets for Deep Learning in Earth Observation

CHAMMI: A benchmark for channel-adaptive models in microscopy imaging

Modular Anti-noise Deep Learning Network for Robotic Grasp Detection Based on RGB Images

Generalized Category Discovery with Clustering Assignment Consistency